By paddloPayday loans

Solr. An exercise in relearning Search in ColdFusion

I haven’t used ColdFusion to search websites since something like 2001 back when verity could output WDDX and could index current PDFs. Searching with CF lost favor, and i mostly deployed sites that were supported by a search engine outside of my scope, or which did not require one. So, i’m rebuilding the skillset as well.

According to the CF9 release notes, Solr is a step up from verity (how can it not be, the version of verity in CF is ancient) in the following ways:

  • XML/HTTP Interfaces
  • Loose schema to define types and fields
  • Web Administration Interface
  • Extensive Caching
  • Index Replication
  • Extensible Open Architecture
  • Written in Java5, deployable as a WAR
  • Support for stemming
  • Support for MS Office 2007 file formats

Solr is a full-text search engine, based on Lucene, that contains the following features:The ColdFusion installer automatically creates the ColdFusion 9 Solr service, which contains the Solr web application. For UNIX and Linux, you need to start and stop the Solr shell script.

Now, I personally need better pdf support. Since it wasn’t listed above i decided to try it out. Just to see.

Creating and Searching Solr collections in ColdFusion

This section is review.  It focuses on how to create search indexes (called collections in CF) using both the CF Adminsitrator and using line code (cfcollection, cfindex, cfsearch)

Creating and indexing Solr Collections in ColdFusion Administrator

  1. Log in to CF administrator on your CF9 server.
  2. Under Data and Services, select “ColdFusion Collections”
  3. At the top of the page, begin creating your collection.
    1. type a name for your collection, and
    2. a path where the data files should be stored (i use c:\coldfusion9\collections–this file does not need to be inside of your web space, though it can be).
    3. At the bottom of the form, before the list of collections, choose “Solr Collection” (for now we will skip the language and category, we’ll come back to this in a later post)
    4. Click “Create Collection”.
  4. Moving down this CF Admin page, note the four icons beside your new collection in the collections list.  They are in order: index, optimize, purge and delete.

Your collection is created. Of course it’s empty and useless right now, but we have a framework into which we put our indexed collection.

Indexing a Collection using CF Administrator

Assuming you are still logged in to the CF Administrator and have just created the collection.  If you aren’t, you can catch up using step 1 and 2 above.

  1. In the collections list, find your collection.  Click on the first icon on the left, “index”. This is where the magic gets done.
  2. The page you reach is called the “manage collection” page and it offers three forms.  one for indexing collections, one for aliasing them (a more advanced topic if i ever saw one) and one for renaming collections.  Our focus is the topmost form.
    1. Enter the filetypes to be indexed. CF indicates that the possibilities are endless and provides some catch-all filetypes like *. (files with no extension) and .* (all files). Since my site has a lot of xml files and such that i don’t want to index, i chose the following “.htm, .html, .cfm, .cfml, .pdf, .doc, .odt, .docx, .xls, .xlsx” but you should review your file types and try what you think will work.
    2. Enter or choose a path. This is the path that contains the files you want indexed, not the storage path.
    3. If all your files are NOT in the same directory and you chose a parent directory, click “Recurseively index subdirectories” otherwise, ignore the rest of the form.
    4. Click “Submit”
  3. Wait. Note to those who develop on a local machine. This can be a massive request and a time out could occur. It’s not likely to happen on production, but i’ve seen it on large collections on test and staging servers also. If it does, well you’re playing around. Make your search smaller either by limiting the directories or limiting the filetypes.

Boom. Indexed collection you can search to your heart’s desire. Just remember, it only indexed the filetypes and directories you submitted.

Searching your indexed collection

Once you have created and indexed your collection you are ready to search.  This is handled with  a coldfusion tag called cfsearch in a standard cfm file. For simplicity sake, and because we all know how to make a form and pass variables, i’m just going to stick my search criteria in my sample code:

<cfsearch name="search_results" collection="miki_test" criteria="grim" maxrows="300">

The cfsearch is the easy part. i use name just like cfquery, i tell it what collection to search, and i put my criteria in. As i’m searching digital fiction, grim seemed a good search term. It is recommended that you set a maxrows on these searches lest you return zillions of results

This creates a recordset much like any query with the following columns:
Key, Rank, RecordsSearched, Score, Size, Summary, Title, URL

in addition, if we did all that was described above, we will have the following columns in the recordset, but they will be empty:

Author, Category, CategoryTree, Context, Custom1, Custom2, Custom3, Custom4 (these columns become relevant with more advanced searches, but for our basic search, they are empty).

a sample record follows (i searched a collection containing digital books in html, txt and pdf formats)
Key: C:\ColdFusion9\wwwroot\folklore_stuffies\folklore\Northern_mythologyv2.txt
Rank:1
RecordsSearched: 64
Rank: 3.9678988
Size: 590469
Summary: Google This is a digital copy of a book that was preserved for generations on library shelves before it was carefully scanned by Google as part of a project to make the world’s books discoverable online. It has survived long enough for the copyright to expire and the book to enter th
Title: Northern_mythologyv2.txt
Type: text/plain
URL: /Northern_mythologyv2.txt

As far as search results go, i’m not particularly keen on it. For one thing, the summary does not focus on the places where my search terms were found, it’s just the first x lines of the text document. The type is fine for text documents, but everything from a .otc to a .pdf is given a “type” of application/octet-stream. Since i want to let folks know what kind of document they’re looking at this is useless information; and lastly, the url is a absolute url from the base of my search…in order to actually direct folks to my page, i have to strip the url.

Here is how i handled it.

<cfsearch name="search_results" collection="miki_test" criteria="grim" maxrows="300"> <h2>Results</h2> <table> <tr> <th scope="col">Rank</th><th>FileSize</th><th scope="col">Link</th><th scope="col">Text</th> </tr> <cfoutput query="search_results"> <tr> <td>#rank#</td> <td> <img src="images/#replacenocase(right(title,4),'.','','ALL')#_tiny.gif" alt="#replacenocase(right(title,4),'.','','ALL')# file"> <cfif size/1024 lt 1000>#evaluate(int(size/1024))#K <cfelse>#evaluate(int(size/1024/1024))#M</cfif> </td> <td><cfset my_url = replacenocase(url,'/','','ONE')> <cfset my_title = rereplacenocase(title,".[pdf|doc|txt|docx]","","ALL")> <a href="#my_url#" >#my_title#</a> </td> <td>#summary#</td> </tr> </cfoutput> </table>

What i’ve done here is taken what seems the most relevant of the columns and tried to build a search results page. I’ve used the file extension to guess the file type and called that straight into an image tag. This will work if you are sure that you have an image for every filetype that comes out, alternately you can use a cfswitch with an default case that points to unknown_tiny.gif. (an icon of a page with a question mark on it). Next to the filetype is the filesize. The results come back in bytes, so in order to get KB and MB you have to calculate them. The if condition on line 13 displays kilobytes if the calcuated KB is less than 1000, and if it is not, then it displays in MB. the int function just rounds the value to the nearest integer. I did this because i had rediculously long sizes being calculated.
I calculate the url to put in the anchor by replacing the leading / and this works for my needs. Your mileage may vary. I take the title field and ditch the file extension using a regular expression and link using the calculated url, finally i display the summary.

This gives me something useful, until i can actually gather and search using more metadata than i have available to me straight from the file.

Creating and indexing your collection using ColdFusion Tags

Since i promised a full write up, here is how to create and index your collection using cf tags.

Creating

<cfcollection action="create" collection="miki_test" engine="solr" path="c:\coldfusion9\collections">

These are equivalent to the fields i had you enter above.
action is pretty self explanatory except that create may only be used once. you must delete before creating if you intend to use the cfcollection tag multiple times or create different collections.
collection is the same as the name you entered in the CF Admin
engine is how you specify whether to use solr or verity
path is, again, the path where you want the collection stored. Does not have to be inside web root.

Indexing
In theory, the following should work (not that i have been able to make it work, (it currently loads a blank page, but does not appear to index anything: it should do so:

<cfindex action="update" collection="miki_test" type="path" key="c:\coldfusion9\wwwroot\folklore_stuffies\folklore">

Oh and by the way, this little exercise did answer my question: Solr handles modern pdfs particularly well, even if the search results do leave something to be desired.

What’s New in CF9

ORM Support

  • Object Relational Mapping is key to RAD (you find it in Ruby, CFWheels, etc)
    Hibernate: the number one ORM tool is included with CF9 (requires additional tool on development side –CFBuilder or others)
  • Requires use of application.cfc (will not work in application.cfm)
  • At the very least, use of hibernate with a tool like cfbuilder will write 2/3 of your database queries for you. Frequently more than that.

FLEX and AIR Support

  • More Flash Remote commands available without purchasing Flex. (Great for Flash users)
  • Now integrated with Adobe Air allowing you to use your ColdFusion to power your Air apps.

Language Enhancements

  • More commands available within CFScript (You can now write most of your app within cfscript):
    • Basic language constructs: throw, writedump, writelog, location and trace
    • Script functions: query, mail, http, storedprod, pdf and ftp
    • Keywords: abort, exit, include, param, property, rethrow, throw
      Operations: import and new
  • New onServerStart() method to run only when server starts. Useful for configuring logging, instantiating applications, and setting up scheduler.
  • Nested CFTRANSACTIONS
  • Udf/cfc name conflict resolution
  • Local scope/Var scope (helps with name conflicts between application and function) – also can use var anywhere within a cfc and not just at the top.
  • Local is now a reserved word!
  • Implicit getters and setters for cfproperty!

Product Integration

  • Ajax Controls – extJS 3 now embedded (was extJS 2.x), google maps now embedded,.
  • SharePoint integration: can use sharepoint signon in cf, can load sharepoint actions in CF application, can create web parts for sharepoint.
  • MS Office interoperability – now can use cfspreadsheet and cfpresentation rather than cfcontent. Can convert word and powerpoint to pdf with cfdocument.
  • OpenOffice support: if installed and the proper attribute is used, cfdocument will use openoffice engine to create pdf rather than cf engine.
  • Search using Apache Solr rather than Verity!!

Performance Enhancements

  • Improved clustering: supports serialization of query, array and datetime types in a cfc.
  • Granular control over cacheing (can cache fragments of page, can cache in memory, can cache specific objects, etc)
  • In-memory files: simplify execution of dynamic code.
  • Other: Faster cfcs, faster java method invocation

Database Enhancements

  • Now uses DataDirect 4.0 sp 1
    • Support MySQL(enterprise and commercial), Oracle 11g, Informix 11, SQL Server 2008
    • Support IPv6 (Internet Protocol).
    • Set default query timeout value.
  • Datasource attribute is now optional for cfquery, cfinsert, cfupdate, and cfdbinfo.

Code Analyzer

  • Helps migrate from CF7 and 8 to CF9

Service Features

  • More CF can be exposed as web service: cfpdf, cfimage, cfdocument, cfmail, cfpop, cfchart

Other Enhancements

  • Server Manager – desktop version of CF Admin (runs on Adobe Air) with limited availability
  • Built-in support for portlet standards: supports JSR-168, JSR-286, WSRP to help build cf powered portal content.
  • PDF functionality: can support FDF, PDF Package, size optimization, headers and footers (in cfpdf), RGB/ARGB and cfimage support in cfpdf, improved thumbnails, image extraction from pdfs.
  • IMAP support – can query imap servers for email, can run imap items like delete mail/folder, mark multiple as read, and manage folders.
  • JRE specifications – JRE 6 update 14 for all platforms except solaris which uses JRE 6 update 12.

New and Updated Functions

  • Visit livedocs for the full list
  • Most notable: spreadsheet functions and isIPV6().

New Tags

  • Visit livedocs for the full list
  • Most notable: CFFINALLY: brings try/catch up to date with Java allowing a portion of code to run regardless of an error being caught.

Notes

  • With the exception of extJS 3 from extJS 2, existing apps are unlikely to require many changes to continue to function.

Useful Regular Expressions in CF

Have you ever said “Sure, i can take out all that stuff that MSWord puts in when you paste from word” Or “I’m sure I can just find the bits that i don’t need and get rid of them before saving.” If you did, did you regret it immediately?

Or did you wait till you hit the search engines and found the perfect regular expression only when you pasted it to your “ReReplace()” it turned out to be formatted incorrectly for CF?

First things first, there are regular expressions, and there are CF regular expressions. Most systems, from unix shell to Ruby on Rails have a means of using regular expressions, but not all of them look the same.

for instance, in powerGREP you say:

\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b

in order to find an email address, but in cold fusion, a number of elements have different uses, so you then have to write the following.

ReFind("[^['_a-z0-9-]+(\.['_a-z0-9-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*\.(([a-z]{2,4})");

In addition, there is a great deal covered on taking HTML out, including these two, which i’ve used frequently:

The First:

function stripHTML(str) { new_str = REReplaceNoCase(str,"","","ALL"); return REReplaceNoCase(new_str,"<[^>]*>","","ALL"); }

Now this one focuses directly on Microsoft, which likes to put some conditional code in that runs depending on your browser, and inside of which are styles as well as html. In addition it fills up standard html tags with a number of attributes that are really just for Microsoft.

The first REReplace looks for those <!–.if blocks with any text between them and the end if. Replaces all of them with nothing.

the second REReplace then finds EVERY set of text and whatnot between a < and a > and takes it right out. Note that this will remove EVERY tag, including ones that it might be relatively wise to keep.

In testing, this took the HTML version of a Word document and removed as many as 84% of the characters, thus shrinking it down to something more usable in a ‘text’ field.

Of course, most of us actually want rich text, we provide rich text editors for that purpose. So we probably don’t want to completely remove everything.

The Second:

function stripHTML2(str) { new_str = REReplaceNoCase(str,"","","ALL"); return REReplaceNoCase(new_str,"<(?!\/?(br|b|table|tr|td|i|em|strong|p))[^>]*>", "","ALL"); }

 
So ; all these | serve as “or” and so you’re effectively only removing those tags that do NOT contain br, b, table, tr, td, i, em, strong, or p.

As written, this function will do exactly that, but it does leave in attributes added by Microsoft to support much of what you removed already.

So you could take the above regular expression, and modify it to search out these styles and whatnot that would be added between the < and >.

Of course, most of this, you could probably get elsewhere, and in fact, there’s a function i was able to get at CFLib.org that will do even more than that, and make your data safe.

CSS vs CFDOCUMENT

In this corner!  weighing 133 lbs, the challenger, Cascading Style Sheets.  In this corner! weighing 896 lbs, the Vanilla Gorilla, CFDOCUMENT.

That’s right ladies and gentlemen, the fight to end all fights, cfdocument, with different implementations of html handling between cf7 and cf8, will take on css, which displays certain formatting properties differently between html 4 and xhtml 1.

It was quite a showdown. Especially when we extracted the html from the code in order to create html that actually floats AND overlaps a column of text.  Then cfdocument it.

First we got it to work in both Mozilla and MSIE.  Easy-peasy. Then we gave all that VALIDATED html to cfdocument and it failed to render as either browser presented it.

So we scratch our heads for a little while, search the web for info and page through some old CF books.  In a CFCert book for mx7 we find a note that cfdocument doesn’t support xhtml1 and check our doctype.  WHAM its declared as xhtml 1 strict.

Odd, we think, since we are running cf8, and all the stuff we are finding online says “valid xhtml1″. But then it occurs to us that one of the servers, test, is running cf7…so, we might as well create code that renders on test too.

So I change the doctype and feed it to cf8.  Wham, it works in CF8. Well ok, we think, and are very proud of ourselves, now it will work on test too.  We zip it up and push it out and does it work on test… NOOOO.

We give up and tell the testers to test pdf’s on staging.

We is mostly me, here, but also includes  an HTML expert  and my boss.

So, that’s the story, but what is behind it?  Well  Some of it i can’t explain and some of it i’m just guessing but i think it probably has to do with one or more of the following:

  • Support for HTML in CFDOCUMENT
  • Support for CSS in CFDOCUMENT
  • The dreaded “Quirks Mode” wherein browsers attempt to handle invalid or untyped xhtml.
  • CFDOCUMENT in CF7 and CF8
  • Using HTML tables for formatting.
  • Attempting to generate valid HTML from a series of regular expressions.
  • The cocky attempt to display 2 columns of text in a dynamically generated pdf
  • The cocky attempt to then float an image over one column.

Regardless, we won the war with a couple of improper uses of HTML  and copius use of HTML Doctypes.

Here is the final code (sans database and other code stuff)

Commentary: This mixes inline styles and embedded styles.  The original stuff is inline.  The floating image and whatnot is embedded.  Long story about applying styles in cfdocumentsections skipped for now.  However suffice it to say that cfdocumentsection resets all styles, so i put as many as i could inline.

Commentary: tables for formatting were used because we couldn’t make css do the job, possibly because we were designating xhtml1.

Commentary: if you grab this code, change the doctype to xhtml1 strict and take a look at the difference in how the floating image is handled between the two.  It will make you smile.

<cfsilent>
<!---
This file displays the standard PDF sheet
REVISION TRACKING:
MT - 1/30/08 - Create file 
--->
</cfsilent>
<cfdocument format="PDF" marginleft=".6" marginright=".6" marginbottom=".8" margintop=".6">
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
        "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
    <title>title goes here</title>
    <meta http-equiv="Content-Style-Type" content="text/css" />
<style type="text/css">
img.floatbak
{
top:-15;
float:right;
margin-top: 0;
margin-left:5;
margin-bottom:0
}


img.float
{
position:absolute;
right:0px;
top:40px;
z-index:1
}
</style>

</head>
<body>
    <div id="container" style="width:100%;">
        <div id="nonfloat" style="width:100%;">
	    <table>
		<tr>
	 	    <td><img src="images/logo_image.gif" alt="Company Logo" /></td>
		    <td><p style="font-family:arial black,arial,helvetica,sans-serif;font-size:16pt;">
 	                  Company Information Sheet <br/>
	                  <em style="font-family: Arial, Helvetica, sans-serif;font-size:14pt;font-style: italic;font-weight:bold;">
		             Company Name
       	                  </em>
	            </p></td>
	</tr>
	</table>
</div>
<div id="nonfloat2" style="width:100%;">
<table cellpadding="5" cellspacing="1" width="100%">
	<tr> 
		<td valign="top" width="50%"><p style="font-family:Arial Black, Arial, Helvetica, sans-serif; font-size:12pt;font-weight:bold;"> big data left side </p>
			<p style="font-family: Times New Roman, serif;font-size:11pt">
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt 
ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco 
laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in 
voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat 
non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
			</p>
			<p style="font-family: Times New Roman, serif;font-size:11pt">
				At vero eos et accusamus et iusto odio dignissimos ducimus qui blanditiis 
praesentium voluptatum deleniti atque corrupti quos dolores et quas molestias excepturi 
sint occaecati cupiditate non provident, similique sunt in culpa qui officia deserunt mollitia 
animi, id est laborum et dolorum fuga. Et harum quidem rerum facilis est et expedita distinctio. 
Nam libero tempore, cum soluta nobis est eligendi optio cumque nihil impedit quo minus id 
quod maxime placeat facere possimus, omnis voluptas assumenda est, omnis dolor repellendus. 
Temporibus autem quibusdam et aut officiis debitis aut rerum necessitatibus saepe eveniet ut 
et voluptates repudiandae sint et molestiae non recusandae. Itaque earum rerum hic tenetur 
a sapiente delectus, ut aut reiciendis voluptatibus maiores alias consequatur aut perferendis 
doloribus asperiores repellat.
			</p>
			</td>
			 <td width="50%">
				<table border="1" cellpadding="0" cellspacing="0">
					<tr>
					<td> 
					<img class="float" src="images/folating_image.jpg" alt="secondary logo" >
					<div style="position:relative;z-index:0;margin-left:10px;margin-top:10px;margin-bottom:10px">
					<p style="font-family:Arial Black, Arial, Helvetica, sans-serif; font-size:12pt;font-weight:bold;">Big data right side</p>
					<p style="font-family: Arial, Helvetica, sans-serif; font-size:11pt;font-weight:bold; font-style:italic;">italicized numerical values<br>second value</p>
			<p style="font-family: Times New Roman, serif;font-size:11pt"> Sed ut perspiciatis 
unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, 
totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae 
dicta sunt explicabo. Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, 
sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro 
quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit, sed quia non 
numquam eius modi tempora incidunt ut labore et dolore magnam aliquam quaerat voluptatem. 
Ut enim ad minima veniam, quis nostrum exercitationem ullam corporis suscipit laboriosam, nisi ut 
aliquid ex ea commodi consequatur? Quis autem vel eum iure reprehenderit qui in ea voluptate 
velit esse quam nihil molestiae consequatur, vel illum qui dolorem eum fugiat quo voluptas nulla 
pariatur?
		</p>
	</div>
</td>
</tr>
</table>  
</td>
</tr>
</table> 
</div>
</div>
</body>
</html>
</cfdocument>