Full Text Search refuses to be a black box

Once upon a time, before Google pwn3d internet search, there were several competing definitions for full text search. Altavista more or less gave you results matching the exact strings you gave it, but in a crazy order that made it painful to use. Excite (my favorite back then) used a dictionary to achieve stemming and synonym matches: searching for ‘dogs’ would also match documents that contained ‘canines’ or ‘dog’. Then Google blew them all away, and established a dominant set of expectations for how text search behaves.

I forgot about this, which is why I’m frustrated by the almost ridiculous complexity of the major server-side text search engines available right now. But it makes sense, once you learn what the options are.

The first thing you have to decide is if you want to stick with an in-database search feature, meaning the one that comes with your SQL RDBMS of choice. If so, you get ACID and hopefully MVCC features, which is probably why you’re using a SQL RDBMS in the first place: you’d like to get back the data that you put into it, even with multiuser access.

This option is just becoming decent in the last few years; typically you pay a penalty for the ACID and MVCC assurance, as well as for the smaller audience that uses that particular database. The factory installed radio is never as good as the one you order from Crutchfield, and likewise, the bundled thingamajig is never as good as the best available special-purpose thingamajig. But you get a few things in return: examples specific to your chosen database (duh), a simplified storage architecture, and probably some lower resource requirements.

By “simplified storage architecture” I mean that you don’t have to add your data to the database, and then also to the search engine, nor do you have to query the search engine first, and then possibly use the results to query the database if the search engine doesn’t contain everything you need. By “probably lower resource requirements” I mean that you’re not running two redundant but slightly differently optimized database servers: one SQL RDBMS and one specialized full text search server. Two redundant servers means double the storage, double the ram, plus or minus some variation in overhead (row size in bytes vs. page size, etc.). The full text index built into the database product will need a substantial amount of extra space for the indexes in memory and on disk, but almost certainly this will be less than a separate search server would need for the index and data, plus its own code, in memory and on disk.

An argument could be made for dumping the SQL RDBMS in favor of just using the dedicated full text search engine, but given that SQL RDMBSs tend to be the only game in town that offers unlimited ad-hoc querying, ACID, MVCC, and easy integration with all sorts of application languages and frameworks, I suspect that the folks making that argument either have really unusual requirements in mind, or are bozos who fall in love with architectures that work well for a few of their requirements and suck utterly at the remaining requirements.

There are quite a lot of those people out there. The same sort of suggestion has been made for object databases and XML databases in the past, and it turns out that the ones that blow away SQL RDBMSs in performance also lack one or more of the qualities that make SQL RDBMSs so useful. Putting those features back in tends to bring performance back down to a level close to that of the SQL RDBMS, and the argument for total replacement becomes weak. But the arguments, and arguers, remain.

But, perhaps your requirements for search performance are extreme: you need super duper fast search, which typically means you need a reasonable turnaround time on searches of a huge number of documents, with lots of users searching. The leading dedicated search engines lately seem to be Lucene (or one of its derivatives, such as Ferret) or Xapian. In either case, you get super high performance, at the cost of some additional resources due to overlap with the SQL RDBMS, application complexity (two reads, two writes), and sysadmin complexity (a whole ‘nuther product to manage in order to keep the overall application up and running).

So, I’ll just accept that there are reasons to do either one, and maybe some corner cases in which just using a dedicated search engine makes sense instead of a SQL RDBMS. But let’s move on to the not-a-black-box aspect of full text search.

Here are the key issues:

  • Full text search indices get huge quickly.
  • Some words are almost meaningless to searchers but are extremely commonly used.
  • Some words mean the same thing, or are variations of the same word.
  • Various kinds of coded character data (scientific notation, URLs, mailing addresses, etc.) are commonly embedded in searchable text.

As a result of all of these, simply accelerating "select * from cute_pet_stories WHERE UPPER(story) LIKE '%DOGS%'" isn’t a viable approach. Instead, the search indexer requires additional information, such as the character encoding and language of the text being indexed, and uses that to simplify the text being indexed into root words (dog instead of dogs) that don’t include low-value words such as “a”, “the”, “of”, etc. (these are called stop words). Also, it may differentiate encoded data from regular language text, and handle it specially.

PostgreSQL includes a search engine called Tsearch2, which is apparently quite fast if you’re willing to sacrifice size (big indexes) and write performance.

The implementation of a Tsearch2-indexed table is interesting: first you add a column to your table that’s just there to be indexed, and you fill it using text-mangling functions that do stemming (dogs->dog), stop-word removal, and word counting. That leaves you with a column of type tsvector. Then you create an index on that column, and do your text queries against that index. You have to clean up your search text first, though, and similarly mangle it into an appropriately stemmed, stop-word-free tsquery object which itself can contain boolean expressions that will be used in the search process.

(There’s an acts_as_tsearch Rails plugin that attempts to simplify this into an idiom that makes more sense from a declarative standpoint. It looks pretty immature but I’m gonna give it a whirl anyway.)

Lucene does something similar, using Java classes it calls Analyzers to encapsulate the same kind of text-mangling behavior that Tsearch2 performs using its to_tsvector SQL function. Xapian also has this same feature, apparently from the same original source. So the model of first preparing your text for indexability, then indexing it, then searching with similarly prepared query text, appears to be common if not universal.

Hopefully now you understand, as I now do, that full text search is inherently complicated, but not necessarily slow. All you need to do is understand the generic way that full text indexing and searching works, and then make a decision about integrated vs. standalone based on your setup.

I’m going to try Tsearch2 + acts_as_tsearch. I’ll let you know how it goes.

1 thought on “Full Text Search refuses to be a black box”

Leave a Reply

Your email address will not be published. Required fields are marked *