XML – Pervasive Code

Full Text Search refuses to be a black box

Jamie Flournoy — Mon, 16 Jul 2007 02:48:56 +0000

Once upon a time, before Google pwn3d internet search, there were several competing definitions for full text search. Altavista more or less gave you results matching the exact strings you gave it, but in a crazy order that made it painful to use. Excite (my favorite back then) used a dictionary to achieve stemming and synonym matches: searching for ‘dogs’ would also match documents that contained ‘canines’ or ‘dog’. Then Google blew them all away, and established a dominant set of expectations for how text search behaves.

I forgot about this, which is why I’m frustrated by the almost ridiculous complexity of the major server-side text search engines available right now. But it makes sense, once you learn what the options are.

The first thing you have to decide is if you want to stick with an in-database search feature, meaning the one that comes with your SQL RDBMS of choice. If so, you get ACID and hopefully MVCC features, which is probably why you’re using a SQL RDBMS in the first place: you’d like to get back the data that you put into it, even with multiuser access.

This option is just becoming decent in the last few years; typically you pay a penalty for the ACID and MVCC assurance, as well as for the smaller audience that uses that particular database. The factory installed radio is never as good as the one you order from Crutchfield, and likewise, the bundled thingamajig is never as good as the best available special-purpose thingamajig. But you get a few things in return: examples specific to your chosen database (duh), a simplified storage architecture, and probably some lower resource requirements.

By “simplified storage architecture” I mean that you don’t have to add your data to the database, and then also to the search engine, nor do you have to query the search engine first, and then possibly use the results to query the database if the search engine doesn’t contain everything you need. By “probably lower resource requirements” I mean that you’re not running two redundant but slightly differently optimized database servers: one SQL RDBMS and one specialized full text search server. Two redundant servers means double the storage, double the ram, plus or minus some variation in overhead (row size in bytes vs. page size, etc.). The full text index built into the database product will need a substantial amount of extra space for the indexes in memory and on disk, but almost certainly this will be less than a separate search server would need for the index and data, plus its own code, in memory and on disk.

An argument could be made for dumping the SQL RDBMS in favor of just using the dedicated full text search engine, but given that SQL RDMBSs tend to be the only game in town that offers unlimited ad-hoc querying, ACID, MVCC, and easy integration with all sorts of application languages and frameworks, I suspect that the folks making that argument either have really unusual requirements in mind, or are bozos who fall in love with architectures that work well for a few of their requirements and suck utterly at the remaining requirements.

There are quite a lot of those people out there. The same sort of suggestion has been made for object databases and XML databases in the past, and it turns out that the ones that blow away SQL RDBMSs in performance also lack one or more of the qualities that make SQL RDBMSs so useful. Putting those features back in tends to bring performance back down to a level close to that of the SQL RDBMS, and the argument for total replacement becomes weak. But the arguments, and arguers, remain.

But, perhaps your requirements for search performance are extreme: you need super duper fast search, which typically means you need a reasonable turnaround time on searches of a huge number of documents, with lots of users searching. The leading dedicated search engines lately seem to be Lucene (or one of its derivatives, such as Ferret) or Xapian. In either case, you get super high performance, at the cost of some additional resources due to overlap with the SQL RDBMS, application complexity (two reads, two writes), and sysadmin complexity (a whole ‘nuther product to manage in order to keep the overall application up and running).

So, I’ll just accept that there are reasons to do either one, and maybe some corner cases in which just using a dedicated search engine makes sense instead of a SQL RDBMS. But let’s move on to the not-a-black-box aspect of full text search.

Here are the key issues:

Full text search indices get huge quickly.
Some words are almost meaningless to searchers but are extremely commonly used.
Some words mean the same thing, or are variations of the same word.
Various kinds of coded character data (scientific notation, URLs, mailing addresses, etc.) are commonly embedded in searchable text.

As a result of all of these, simply accelerating "select * from cute_pet_stories WHERE UPPER(story) LIKE '%DOGS%'" isn’t a viable approach. Instead, the search indexer requires additional information, such as the character encoding and language of the text being indexed, and uses that to simplify the text being indexed into root words (dog instead of dogs) that don’t include low-value words such as “a”, “the”, “of”, etc. (these are called stop words). Also, it may differentiate encoded data from regular language text, and handle it specially.

PostgreSQL includes a search engine called Tsearch2, which is apparently quite fast if you’re willing to sacrifice size (big indexes) and write performance.

The implementation of a Tsearch2-indexed table is interesting: first you add a column to your table that’s just there to be indexed, and you fill it using text-mangling functions that do stemming (dogs->dog), stop-word removal, and word counting. That leaves you with a column of type tsvector. Then you create an index on that column, and do your text queries against that index. You have to clean up your search text first, though, and similarly mangle it into an appropriately stemmed, stop-word-free tsquery object which itself can contain boolean expressions that will be used in the search process.

(There’s an acts_as_tsearch Rails plugin that attempts to simplify this into an idiom that makes more sense from a declarative standpoint. It looks pretty immature but I’m gonna give it a whirl anyway.)

Lucene does something similar, using Java classes it calls Analyzers to encapsulate the same kind of text-mangling behavior that Tsearch2 performs using its to_tsvector SQL function. Xapian also has this same feature, apparently from the same original source. So the model of first preparing your text for indexability, then indexing it, then searching with similarly prepared query text, appears to be common if not universal.

Hopefully now you understand, as I now do, that full text search is inherently complicated, but not necessarily slow. All you need to do is understand the generic way that full text indexing and searching works, and then make a decision about integrated vs. standalone based on your setup.

I’m going to try Tsearch2 + acts_as_tsearch. I’ll let you know how it goes.

The Mozilla Platform’s Catch-22 Problem

Jamie Flournoy — Mon, 30 Apr 2007 23:00:58 +0000

Starting with Netscape 4.5, I’ve used Netscape, then Mozilla, then Thunderbird for email. I have a similar relationship with Firefox. I’ve watched with great hope and been disappointed over the years as Thunderbird bugs that really annoy me just… stay. I think I know why. It’s because Firefox and Thunderbird are built in such a way as to create a catch-22 situation — one that actually discourages new contributors.

Here’s a good example of a bug that would seem to be trivial to fix: “Entire message” quick search criteria is only the body. You can’t search the entire message in Thunderbird. Can’t do it. Never been possible, as far as I know. Lots of people noticed it; it’s been marked as an “enhancement” in “New” status for years, even though it’s quite obviously a bug in my opinion. So, go to the source code, find the part where it decides what parts of the body to search, and make it search the whole thing. Should be easy, right?

Hell no. Firefox and Thunderbird are built using an incredibly complicated set of tools and technologies created by the Netscape and Mozilla folks, specifically for the purpose of building Firefox and Thunderbird.

Low-level stuff is written in C++, but if you know C++ already you still have to learn their coding standards and class libraries. Yeah, it’s great that they have coding standards and libraries; I’m just sayin’, that’s more stuff you have to learn in order to write a single line of low-level code.

High level stuff is in a combo of JavaScript and XML. Of course you have to learn the XUL platform stuff, so just knowing JavaScript and XML isn’t sufficient. You’ll also probably need to learn to use the low-level classes from the XUL side, so that you can get anything done.

Now if you want to actually build it, you’ll also need to understand their build system, and maybe their installer. You might need to learn how their help system, documentation system, internationalization sytem, etc. etc. work.

Now, I’m in awe of the monumental amount of new and working code that the Mozilla community has created. But the problem is just that: they built this huge application stack, and in order to get working on Firefox or Thunderbird, you have to learn a big chunk of that stack.

In theory, this wouldn’t be any more of a barrier than if they had coded to a single platform’s GUI widget set, class library, etc. But if I were to learn .NET and Visual Studio, or Cocoa, or GTK+, I would then have skills that I could use to work on thousands of other applications, either open source or commercial. That’s because those technologies are designed for general purpose application development.

The Mozilla application platform, however, has no life of its own outside of Mozilla. There’s an effort to extract it and make XULRunner a viable standalone platform that you could build your own XUL applications on, but that’s currently just a “stable developer preview”. What that means is that it’s still not suitable for use if you wanted to, say, build your own iTunes killer with it. It’s almost there, and maybe if you throw a lot of effort at it you could hack it so it’s usable, but as-is, it isn’t readily useful.

So, when a developer (such as myself) looks at this technology set, they ask themselves the question, “can I justify spending this much time learning all this stuff just to fix a little annoying bug?” If you spend a week or two downloading and learning and hacking and that gets you a certain amount of skill with the technology set, was it worth it? Maybe you could go get a job working with this technology set, except almost no one else is using it, not even open source projects.

When a corporation looks at this technology set, they have to ask a similar question: “is this the platform that will give me the best bang for my developer buck?” If they invest the time and money in hiring developers to work on this mostly-complete platform, so that they can then build an application with it, will it pay off? What about alternatives? Will the pool of available developers be so small that the project will fail before it starts?

These are not insurmountable barriers; some folks know some of these technologies already or only want to work on a corner of the application, so the overhead is smaller. Or, they may really want to work on Firefox or Thunderbird very badly. For an employer or open source project, their goals may match the goals of the original platform designers very well, in which case the payoff of using the Mozilla platform would be much higher than if you were (for example) making a Windows-only IM client.

But the barriers do exist, and they discourage the platform and the applications built with it from being improved. Developers don’t want to spend all that time learning things they can’t use elsewhere, and employers don’t want to pay developers to fix the platform so that other projects could find it useful. A few projects do exist and obviously somebody somewhere is working on Thunderbird bugs, but for a platform this sophisticated with this much mindshare (via Firefox), it’s remarkably unsuccessful in terms of adoption by developers.

All it would take to fix this would be some cash. Somebody could donate (or spend, in pursuit of a business goal) money toward the completion of XULRunner for general use, and toward clear and useful beginning developer documentation for the rest of the Mozilla platform.

Any takers? I could thrown in a hundred bucks or so, but I have a feeling this is more like a $50,000 undertaking. Otherwise we’re stuck waiting for Joost and a handful of others to slowly move it along.