I haven’t used ColdFusion to search websites since something like 2001 back when verity could output WDDX and could index current PDFs. Searching with CF lost favor, and i mostly deployed sites that were supported by a search engine outside of my scope, or which did not require one. So, i’m rebuilding the skillset as well.

According to the CF9 release notes, Solr is a step up from verity (how can it not be, the version of verity in CF is ancient) in the following ways:

  • XML/HTTP Interfaces
  • Loose schema to define types and fields
  • Web Administration Interface
  • Extensive Caching
  • Index Replication
  • Extensible Open Architecture
  • Written in Java5, deployable as a WAR
  • Support for stemming
  • Support for MS Office 2007 file formats

Solr is a full-text search engine, based on Lucene, that contains the following features:The ColdFusion installer automatically creates the ColdFusion 9 Solr service, which contains the Solr web application. For UNIX and Linux, you need to start and stop the Solr shell script.

Now, I personally need better pdf support. Since it wasn’t listed above i decided to try it out. Just to see.

Creating and Searching Solr collections in ColdFusion

This section is review.  It focuses on how to create search indexes (called collections in CF) using both the CF Adminsitrator and using line code (cfcollection, cfindex, cfsearch)

Creating and indexing Solr Collections in ColdFusion Administrator

  1. Log in to CF administrator on your CF9 server.
  2. Under Data and Services, select “ColdFusion Collections”
  3. At the top of the page, begin creating your collection.
    1. type a name for your collection, and
    2. a path where the data files should be stored (i use c:\coldfusion9\collections–this file does not need to be inside of your web space, though it can be).
    3. At the bottom of the form, before the list of collections, choose “Solr Collection” (for now we will skip the language and category, we’ll come back to this in a later post)
    4. Click “Create Collection”.
  4. Moving down this CF Admin page, note the four icons beside your new collection in the collections list.  They are in order: index, optimize, purge and delete.

Your collection is created. Of course it’s empty and useless right now, but we have a framework into which we put our indexed collection.

Indexing a Collection using CF Administrator

Assuming you are still logged in to the CF Administrator and have just created the collection.  If you aren’t, you can catch up using step 1 and 2 above.

  1. In the collections list, find your collection.  Click on the first icon on the left, “index”. This is where the magic gets done.
  2. The page you reach is called the “manage collection” page and it offers three forms.  one for indexing collections, one for aliasing them (a more advanced topic if i ever saw one) and one for renaming collections.  Our focus is the topmost form.
    1. Enter the filetypes to be indexed. CF indicates that the possibilities are endless and provides some catch-all filetypes like *. (files with no extension) and .* (all files). Since my site has a lot of xml files and such that i don’t want to index, i chose the following “.htm, .html, .cfm, .cfml, .pdf, .doc, .odt, .docx, .xls, .xlsx” but you should review your file types and try what you think will work.
    2. Enter or choose a path. This is the path that contains the files you want indexed, not the storage path.
    3. If all your files are NOT in the same directory and you chose a parent directory, click “Recurseively index subdirectories” otherwise, ignore the rest of the form.
    4. Click “Submit”
  3. Wait. Note to those who develop on a local machine. This can be a massive request and a time out could occur. It’s not likely to happen on production, but i’ve seen it on large collections on test and staging servers also. If it does, well you’re playing around. Make your search smaller either by limiting the directories or limiting the filetypes.

Boom. Indexed collection you can search to your heart’s desire. Just remember, it only indexed the filetypes and directories you submitted.

Searching your indexed collection

Once you have created and indexed your collection you are ready to search.  This is handled with  a coldfusion tag called cfsearch in a standard cfm file. For simplicity sake, and because we all know how to make a form and pass variables, i’m just going to stick my search criteria in my sample code:

<cfsearch name="search_results" collection="miki_test" criteria="grim" maxrows="300">

The cfsearch is the easy part. i use name just like cfquery, i tell it what collection to search, and i put my criteria in. As i’m searching digital fiction, grim seemed a good search term. It is recommended that you set a maxrows on these searches lest you return zillions of results

This creates a recordset much like any query with the following columns:
Key, Rank, RecordsSearched, Score, Size, Summary, Title, URL

in addition, if we did all that was described above, we will have the following columns in the recordset, but they will be empty:

Author, Category, CategoryTree, Context, Custom1, Custom2, Custom3, Custom4 (these columns become relevant with more advanced searches, but for our basic search, they are empty).

a sample record follows (i searched a collection containing digital books in html, txt and pdf formats)
Key: C:\ColdFusion9\wwwroot\folklore_stuffies\folklore\Northern_mythologyv2.txt
Rank:1
RecordsSearched: 64
Rank: 3.9678988
Size: 590469
Summary: Google This is a digital copy of a book that was preserved for generations on library shelves before it was carefully scanned by Google as part of a project to make the world’s books discoverable online. It has survived long enough for the copyright to expire and the book to enter th
Title: Northern_mythologyv2.txt
Type: text/plain
URL: /Northern_mythologyv2.txt

As far as search results go, i’m not particularly keen on it. For one thing, the summary does not focus on the places where my search terms were found, it’s just the first x lines of the text document. The type is fine for text documents, but everything from a .otc to a .pdf is given a “type” of application/octet-stream. Since i want to let folks know what kind of document they’re looking at this is useless information; and lastly, the url is a absolute url from the base of my search…in order to actually direct folks to my page, i have to strip the url.

Here is how i handled it.

<cfsearch name="search_results" collection="miki_test" criteria="grim" maxrows="300"> <h2>Results</h2> <table> <tr> <th scope="col">Rank</th><th>FileSize</th><th scope="col">Link</th><th scope="col">Text</th> </tr> <cfoutput query="search_results"> <tr> <td>#rank#</td> <td> <img src="images/#replacenocase(right(title,4),'.','','ALL')#_tiny.gif" alt="#replacenocase(right(title,4),'.','','ALL')# file"> <cfif size/1024 lt 1000>#evaluate(int(size/1024))#K <cfelse>#evaluate(int(size/1024/1024))#M</cfif> </td> <td><cfset my_url = replacenocase(url,'/','','ONE')> <cfset my_title = rereplacenocase(title,".[pdf|doc|txt|docx]","","ALL")> <a href="#my_url#" >#my_title#</a> </td> <td>#summary#</td> </tr> </cfoutput> </table>

What i’ve done here is taken what seems the most relevant of the columns and tried to build a search results page. I’ve used the file extension to guess the file type and called that straight into an image tag. This will work if you are sure that you have an image for every filetype that comes out, alternately you can use a cfswitch with an default case that points to unknown_tiny.gif. (an icon of a page with a question mark on it). Next to the filetype is the filesize. The results come back in bytes, so in order to get KB and MB you have to calculate them. The if condition on line 13 displays kilobytes if the calcuated KB is less than 1000, and if it is not, then it displays in MB. the int function just rounds the value to the nearest integer. I did this because i had rediculously long sizes being calculated.
I calculate the url to put in the anchor by replacing the leading / and this works for my needs. Your mileage may vary. I take the title field and ditch the file extension using a regular expression and link using the calculated url, finally i display the summary.

This gives me something useful, until i can actually gather and search using more metadata than i have available to me straight from the file.

Creating and indexing your collection using ColdFusion Tags

Since i promised a full write up, here is how to create and index your collection using cf tags.

Creating

<cfcollection action="create" collection="miki_test" engine="solr" path="c:\coldfusion9\collections">

These are equivalent to the fields i had you enter above.
action is pretty self explanatory except that create may only be used once. you must delete before creating if you intend to use the cfcollection tag multiple times or create different collections.
collection is the same as the name you entered in the CF Admin
engine is how you specify whether to use solr or verity
path is, again, the path where you want the collection stored. Does not have to be inside web root.

Indexing
In theory, the following should work (not that i have been able to make it work, (it currently loads a blank page, but does not appear to index anything: it should do so:

<cfindex action="update" collection="miki_test" type="path" key="c:\coldfusion9\wwwroot\folklore_stuffies\folklore">

Oh and by the way, this little exercise did answer my question: Solr handles modern pdfs particularly well, even if the search results do leave something to be desired.