sql – Pervasive Code

Sphinx Search init script for Centos 5.1

Jamie Flournoy — Mon, 14 Apr 2008 06:18:11 +0000

Sphinx search is pretty new, and as a result I was unable to find a nice convenient package for it for CentOS 5.1. This is problematic since there is no init script included with the source tarball, and the issue of updating the index is the sysadmin and developer’s problem, and cannot be configured to simply update the index when the data changes.

The second problem (updates) is one I punted on; for now I have a cron job rebuilding the entire index every 5 minutes, which will probably be replaced with something smarter and lower-latency at a later time.

The first problem (no init script) is easy to solve, but apparently nobody has done so for CentOS 5.1 and published it. So, here is my CentOS 5.1 init script for the Sphinx Search server. It is known to work with version 0.9.8-rc2.

BTW, the alternative solution to the problem of a daemon not having a System V init script is to just put some extra junk in /etc/rc.local. That is the quick and dirty solution, and is undesirable for several reasons:

You can’t easily stop or restart the service, because it’s not a service as far as the OS knows; it’s just some junk in a script that got run a while ago.
You can’t use chkconfig or its GUI cousin with the creative name, The Services Configuration Tool, to control it and tie it to specific runlevels.

(System V runlevels and init scripts are useful, even if you don’t need all of the runlevel functionality. The stop/start/restart PID stuff is useful by itself.)

Document Databases – New Kids on an Old Block

Jamie Flournoy — Sat, 16 Feb 2008 06:16:16 +0000

There’s a new crop of databases that has appeared lately, under the rubric of “document databases”, and there’s quite a lot of enthusiasm for them given that they tend to be slow and very feature-poor compared to the SQL RDBMSs that are the typical persistence mechanism for web applications. What’s mainly appealing about them is that they are easy to use, and theoretically quite scalable, compared to the traditional “one big SQL database server” approach.

But the simplicity of these new document databases is tied to some significant trade-offs in the current implementations. And so I’m going to try and put them into context with some of the other data persistence options that have been around for a while, but which aren’t currently getting as much hype as document databases. Hopefully that will help all of us to understand how these new and evolving document databases can be useful to us, and what the alternatives are in areas where they may not fit well.

I’d first like to try and deconstruct a false dichotomy that I’ve noticed being used in arguments in favor of some of these new databases. That dichotomy casts SQL RDBMSs (such as MySQL, Oracle, PostgreSQL, MS SQL Server, etc.) as big, complicated, and hard to scale, compared to document databases which are small, simple, and easy to scale. The main problem with this dichotomy is that there are far more choices than just two. Each database product embodies a set of design choices, and although there is some clustering of decisions into general types of product, the boundaries are a lot fuzzier than a product evangelist might have you believe.

Furthermore, trade-offs made early in a product’s lifetime may have been altered over time. A good example is the no-longer-true characterization of MySQL being fast but not reliable, vs. PostgreSQL being reliable but not fast. In reality, recent releases of both products are moving toward being very fast and very reliable.

Because there are so many database products out there, I’m going to have to fall back on a small subset of example products, as illustrations of issues that may or may not exist in a particular product you’re looking at. The key for an application architect evaluating a persistence mechanism is to understand the abstract concepts, and to figure out which ones matter to your current application. That will let you select a product (or a combination of products, including some custom code perhaps) that suits you. As with all aspects of architecture, there is no cookbook you can use, and in six months all the options will change. You have to analyze your needs first, and then get your hands dirty with evaluation second.

So, let’s start deconstructing some of the examples into design decisions.

SQL RDMBS

This is the traditional choice for web applications. You get a remote query language, so you can specify in great detail what you want to retrieve. You get very precise control over data representation, including some that may be burdensome if you’re not concerned with internationalization, multiple currencies, and time zones. You get a lot of control over performance and a lot of information about how things work at a low level inside the database, from indexes to data page size to transaction logs and checkpoint frequency. Most of them include ACID transaction support, which is nice for reliability, but which usually obligates you to implement a backup scheme or else they will eventually stop accepting new transactions and/or run out of disk space.

Some of the design drawbacks include:
– the use of local storage, for performance and for a guarantee that data has been committed to disk
– the use of a high level query language (SQL) and a query optimizer, so the specific process the database uses to satisfy your query is not in your face (and thus may be surprisingly inefficient, if you aren’t familiar with how it works)
– the use of a proprietary network protocol, which means that you need a special client library for just that one product, which may or may not implement all of the features that the server offers (such as encryption)

However, there are some variations that make the edges of this category fuzzier. There are ACID-compliant SQL RDBMSs that have no network layer, and are very lightweight; in some cases they may not even support concurrent access. Examples include HSQLDB and SQLite.

Networked Filesystem

This is typically used for accessing shared file servers, or allowing “thin client” behavior so that users can get to their own environment and data from any given endpoint. Examples include NFS, SMB, AFP, GFS, and quite a few others. The main advantage of these systems is that the remote filesystem is represented as being directly connected to the local system, while also being available to other users or other client systems who are connected to the same remote system.

Trade-offs of this design include:
– performance on a LAN may be good, but over a slow, high latency link may be very poor
– there is usually no ACID transaction support, just file locking
– file ownership and permissions can be very hard to manage
– if the file server goes offline, the entire local system may hang or crash

In particular, content indexing, complex querying, and data integrity features are generally not offered. You can layer that on top, though, but that layer will not necessarily work if it was originally designed to work on local filesystems. In particular I’m thinking about DBM files; they’re fast and easy to use but not all of them will work properly with files located on a network filesystem.

Also, directory scanning performance can be very poor if thousands of files are located in a single directory; listing all files starting with the letter T may actually require the entire directory to be retrieved and filtered on the client side.

Variations include FTP and WebDAV, which are not intended to simulate a local filesystem, but instead have filesystem-like semantics. Some operating systems will mount them as remote filesystems anyway, for ease of use for viewing and copying files, but it’s not possible to lock a remote file, so safe multiuser access is not possible.

Object Database

Object databases offer a direct representation of an application’s data in almost exactly the same form that exists in memory. Whereas a relational database stores data in tabular form regardless of the particulars of a client application, an object database stores data in the same form that the application uses. The exception to this is in the representation of references to other objects; at some level these encapsulate pointers to the memory address of the data in the application’s address space, and this must be substituted with a pointer to the location in the database’s storage system before storing it.

An object database will not handle time zones or internationalization, but nor will it complicate those matters if the application handles those already. The data is simply stored as-is. Also, object databases typically do offer ACID transaction support.

Aside from the conceptual simplicity of the similar data model, one major bonus of an object database is that the use of pointers makes data retrieval extremely fast; rather than parsing a query and searching indexes for the on-disk location of a desired object, the application can simply ask the database for it by its reference.

The big trade-offs here are twofold:

First, the lack of indirection through a query language and an indexing system mean that the application developer must anticipate all of the queries that will be needed, and incorporate collections into the the object graph that will be used to get to the stored objects. Otherwise the application’s object model will need to be updated frequently to include these later.

Second, altering the application’s object model and then retrieving data stored using older code can be very complex. Because the stored objects and application’s code are out of sync in this situation, additional application code must be written to convert existing stored objects into the new representation and persist them back to the database.

Combine those two trade-offs, and it’s clear that the performance benefit comes with the price of considerable additional application development effort.

Also, an object database is by nature bound to a single application, rather than being a point of integration between multiple applications. Any attempt to create a shared code library that manages access to the object database introduces potential “impedance mismatches” between each application and the shared object model, which reduces the simplicity that an object database offers in comparison to a relational database.

Document Database

Arguably a rejection of relational technology, document databases offer several advantages compared to the three classes of database previously mentioned. Documents need not be internally represented as a flat set of key-value pairs as seen in a SQL RDBMS; for example, the document may be an XML document. Queries are possible and may even use a standardized query language to express the conditions for matching desirable documents. Documents may have internal structure that is understood by the database server (so that it can query against the document’s contents), as in the case of an XML database, or they may have an external metadata structure consisting of key-value pairs, or both.

The drawbacks of this type of system derive from the fact that it is similar in many ways to each of the other database types.

Querying ability means that the server must incorporate some kind of indexing system for performance reasons, which means that the document must either internally or externally conform to some sort of standard data model. Some document database systems simply omit querying except by the document’s main ID (similar to a SQL primary key, a network filesystem’s filename, or an object database’s storage reference). This has the same drawback as with an object database: the application must take on the responsibility of managing querying, searching, and sorting itself, across the network.

The similarity to a networked filesystem may also have scalability benefits; if referential data integrity is not provided, then documents can be located on any remote system, and partitioning is simple. However, distributed queries will still need to be managed at the application level, and potentially any transaction becomes a distributed transaction, since a changed document on one server may be referenced from any number of documents in any number of other servers. (This is also true of the other systems, though.)

Still, because a document database is closest to a networked filesystem, it may be suitable for simple requirements where a relational database or object database seems to complex and slow, but where the bare-bones functionality of a networked filesystem is too simple. The compromise of a binary file with simple additional metadata or properties attached “out of band” with the data itself, or of a structured document format that is flexible but not ideal, may be acceptable if sophisticated querying is possible as a result.

It’s hard to provide good examples of document databases, because the category is very broad, and includes a lot of simple projects that provide just a little functionality above and beyond what WebDAV already provides. But a few that I’ve heard of recently include CouchDB, SimpleDB, and RDDB.

CouchDB imposes a simple key-value data structure on document content, but no internal document schema or grouping of documents by type. It does offer indexing and querying. Notably, it also offers transparent replication, at a field level (changes to two different fields in two copies of the same document are synchronized to both copies).

Amazon SimpleDB similarly imposes a key-value structure, though one key can have multiple values. It too offers a query language, and indexing. Because it’s built on Amazon’s S3 service, transparent replication is also included.

Future Prospects

The main complaints that I’ve seen directed at relational databases involve two things: one, the difficulty of scaling them up, and two, the restrictive data model. Sharding (a.k.a. data partitioning) is the usual remedy for scaling problems, but that requires the elimination of referential integrity in the SQL RDBMSs I’m aware of, and requires distributed transactions in order to preserve ACID transaction properties across denormalized copies of the modified data.

Interestingly, these issues are the same across the board, regardless of database type. Either you abandon transactions, or you move them up to a level that’s aware of the data partitioning. I see no reason why these and other high-end RDBMS features couldn’t be offered in a proxy layer that possibly even contains the query processing as well.

One way to approach this is to build a closed system with a given set of features and a limited API that permits a single query language. This seems to be the way that CouchDB and SimpleDB are approaching the problem.

Another way to approach this problem is to simply say that the storage back-ends of relational databases could be enhanced to incorporate built-in transparent partitioning. I don’t think that SQL RDBMSs will abandon the concept of a table schema any time soon, but there’s no reason why products that already include XML query and indexing capabilities and free-form natural language indexing (a.k.a. Full Text Search) couldn’t also include indexing capabilities for simple key-value structured data inside a single column of semi-structured data, giving most of the same functionality as a document database.

Given that, the remaining limitation of a SQL RDBMSs is the requirement that the back-end storage system be located on a disk drive physically connected to the same server, and that the storage be touched only by processes running on the same server together so that they can coordinate access to the data.

For now, though, document databases look like they can be very useful for certain types of persistence requirements; I don’t see them as a viable substitute for everything that a SQL RDBMS does, but that perception is limited mainly by the choices at hand. CouchDB looks like the most generally useful option so far, though I’d like to see the addition of optional schemas (opt-in on a per-object level, as seen in LDAP), and/or a pluggable language option. (It seems that everyone using a document database is also enamored of their application language and dislikes the idea of putting logic in the data tier unless it’s written in the same language.)

I welcome your comments – this is mostly a brain dump of things I’ve seen before to help myself and others contextualize the new document databases, and document databases are evolving too rapidly for me to keep up with all of them on my own.

Acts_as_tsearch adjustments needed for PostgreSQL 8.3rc2

Jamie Flournoy — Thu, 24 Jan 2008 20:00:23 +0000

Just a quick note: acts_as_tsearch needs some guidance to work with PostgreSQL 8.3 due to changes in tsearch2 integration.

I’m pretty close to tossing out acts_as_tsearch and rolling my own (trigger-based) tsearch2 plugin, but for now I’m just sticking with it and checking out the PostgreSQL 8.3 release candidate.

I was able to build 8.3rc2 on Mac OS X 10.5.1 from the tarball sources with the instructions in the INSTALL document, no hitches whatsoever. Because I have 8.2 installed via MacPorts, there were no file conflicts (different install directories, data directories, etc.), so all I had to due was shut down the 8.2 server and start the 8.3rc2 server and it was ready to go.

Unfortunately, acts_as_tsearch didn’t work properly the way I had used with with 8.2. The issue appears to be that the tsearch2 locale called ‘default’ is gone, which is what acts_as_tsearch uses if you don’t specify something else. The default locale value is now located in postgresql.conf. Using that value as an explicit locale in the acts_as_tsearch declaration in my model class solved the problem. The code change looks like this:

OLD:
acts_as_tsearch :fields => ["subject","body"]

NEW:
acts_as_tsearch :vector => {:fields => ["subject","body"], :locale => 'pg_catalog.english'}

Like I said, due to the fact that acts_as_tsearch is designed to hide the complexity of tsearch2, it is not well suited to my somewhat complex requirements. So, I’m ditching it in favor of custom code, which I hope to plugin-ize and release some time later. So, this change is necessary but might not be sufficient for your own project. But I hope it helps you get started on upgrading successfully.

ActiveRecord: the Visual Basic of Object Relational Mappers

Jamie Flournoy — Fri, 05 Oct 2007 02:07:53 +0000

I’ve been working with Ruby on Rails intensively for several months, and I’ve finally found a place where Rails can’t readily be extended to do what I want. It’s ActiveRecord, which is probably the most controversial part of Rails.

I’m reminded of a James Gosling quote disparaging Microsoft tools, particularly Visual Basic: “The easy stuff is easy, but the hard stuff is impossible.” There’s a parallel between VB and Rails in this instance, in that if you only let yourself use the high level tools, the hard stuff is impossible, but the designers specifically tell you to do the hard stuff using a lower level toolset. The controversy that surrounds “X can’t do everything, therefore it sucks” should really be focusing on the feasibility of going through that trapdoor to do things “the hard way”. This is what Delphi did, which is why so many folks chose it over VB; it made the hard stuff easier.

Here’s the task I need to accomplish, for which ActiveRecord is not well suited: complex queries involving SQL functions and multiple-table joins. I want to join a few tables together, order by a SQL function, include with each result row the result of a SQL function that operates on each row, and have all that come back as a graph of high-level objects.

Despite my attempts to use plugins, extend and/or fix bugs in those plugins, and to dig through the ActiveRecord source to figure out what the documentation won’t tell me, I was unable to get it to work. Most of the parts of what I wanted was possible: acts_as_tsearch cleverly weaves SQL functions into a high-level ActiveRecord::Base.find calls; paginating_find provides a very convenient pagination API on top of ActiveRecord::Base.find, and ActiveRecord includes some clever association tricks such as automatic many-to-many relationships (has_and_belongs_to_many), eager loading of associated records using a join (via the :include option to ActiveRecord::Base.find), and a fairly low-level :joins option that lets you add tables to a ‘find’ query which can be used in your :conditions. Problem is, they don’t all work together in a fancy way.

Really, the issue in this case is related to the design choices that went into ActiveRecord.

Some ORMs (object-relational mappers) are designed in a modular fashion: there is a part that helps you describe the relationships between your model objects, a part that helps you construct queries, and a part that does the storage and retrieval. Sometimes there’s another part that uses your description of object relationships to create an empty database with the appropriate data model, or that looks at an existing database and creates an object model that matches it. Sometimes there’s an import/export tool for bulk data loading or dumping as well.

ActiveRecord has the first three functions integrated (which has benefits and drawbacks compared to a more modular approach), has a very isolated schema manipulation module, and has a somewhat isolated data loader tool.

The relationships are explicitly declared in source code using associations: has_one, has_many, belongs_to, and has_and_belongs_to_many. These are pretty fancy and provide some convenience features that make the associations appear as object collections, such that changing the collection and saving it turns into insert/delete/update activity in the database.

Query construction is basically tied to the objects themselves, in a way that greatly simplifies star-join queries, but which handles only the simplest joins across multiple tables, and is barely able to handle self-referential joins at all. So, you can easily load an object (or group of similar objects) and associated objects, but OLAP-style queries (“what are the top 5 states where customers are located who have bought classical CDs within 2 weeks of their release using American Express and had them shipped as gifts via UPS 3-day Select?”) are impossible. Oddly, views, functions, and stored procedures could bridge the gap between real-world data models and ActiveRecord’s limited set of association types, but they are not supported either.

The storage and retrieval code is inseparable from the query code, and so it is not possible to examine and modify the final SQL before it is executed, nor is it possible to provide an arbitrary query and have the results be parsed into an object graph based on the associations you have defined. The code that would allow these features appears to exist and be sufficiently well designed to allow this with a fairly small amount of changes to ActiveRecord. However, it is currently (as of Rails 1.2.3, which is the current release) not part of the documented API and is declared private.

There is a limited facility for constructing simple objects from arbitrary SQL, in find_by_sql. This loses essentially all of the high level functionality of the find method; most notably, it isn’t possible to use find_by_sql results to instantiate an object graph, rather than a flat array of objects (similar to the eager loading feature in the regular find method).

ActiveRecord has fairly good high-level schema creation functionality (“migrations”). Though it lacks concepts for all but the basic database objects, support can be added for foreign key constraints (I kid you not, they aren’t supported by Rails itself!) and views. There’s also a simple way to execute arbitrary SQL. Migrations aren’t technically that amazing, but rather they’re a helpful organizational approach to what can be a really hairy problem: defining a schema and then applying changes to live databases while keeping track of what changes you’ve already applied.

Finally, there is a test data loading facility called Fixtures. The common opinion of Fixtures seems to be that they are broken by design and should be avoided. The main issue I’ve found with them is that the implementation ignores the kind of database design elements that any book on SQL would recommend, such as foreign keys and check constraints. I managed to circumvent this with a combination of a plugin and some customization, described in detail in my previous post, Rails, Fixtures, the Test DB, and Test::Unit. With those changes, all test fixture data is preloaded in the right order (so constraints aren’t violated) before any tests run, and any data alterations within tests are rolled back automatically by Rails.

A secondary issue with Fixtures is that they go directly from YAML text files to SQL INSERT statements, bypassing the ActiveRecord Model classes. ActiveRecord does pretty much rule out any fancy mapping between database tables and objects, so that’s not a problem, but this model-skipping fixture loading implementation means that any code in your model object (validations, before_save filters, etc.) will not be executed when loading fixtures. So fixtures do not work well with the otherwise pervasive Rails design rule of “put all the intelligence in the application”.

Still, despite the commonly-held disdain for using fixtures at all, I find that they can be tamed. In fact I’ve even created a base data facility for loading the fundamental data set that needs to be in the live database (e.g. initial admin user info). My approach is basically to alter fixture behavior to treat it as essentially a bulk data loading tool, and to do the extra housekeeping after loading to make up for the fact that the ActiveRecord model code was bypassed.

As far as I know, there is no bulk data dumping functionality in Rails.

So, to summarize, of the five main ORM features, here’s how ActiveRecord stacks up:

Describing Relationships: Easy to understand and use, with lots of slick functionality
Querying: Easy to understand and use, but limited to simple join structures, and not possible to customize query building or rewrite SQL before execution
Storage and Retrieval: Very easy to use, but only within the limits of the query builder’s features
Schema manipulation: Easy to understand and use; limited in functionality but readily extensible; solid third party plugins are available for missing schema objects
Bulk Loading and Dumping: Loading is badly designed and implemented, but fixable with some effort; dumping is not offered

Okay, so it definitely makes the easy stuff easy. But what about the rest?

As I observed before, ActiveRecord is not designed as a set of modules that you use to assemble a solution that fits your needs. That’s more of the Java approach to design, and it trades flexibility for convenience. It can be a major pain to assemble a working system out of all of those abstract Java APIs, which are sometimes so comically over-patternized as to draw mockery such as the hilarious “Are Javalanders Happy?” code snippet from Execution in the Kingdom of Nouns. Rails makes the opposite trade-off: sacrifice flexibility and gain a very approachable API.

Unfortunately, the Java approach (too abstract to readily use, but extremely flexible) is easily wrapped with a simpler, more convenient, less customizable API. The Rails approach isn’t internally componentized (have a look at ActiveRecord’s activerecord/base.rb source file in its 2,165-line glory, almost all of which is one class), so if you want to fiddle with its internal behavior, you can’t. So with Rails, it’s all or nothing: high level slickness for simple requirements, or hand-written SQL and hand-coded results mapping for your complex requirements.

As I said at the beginning, though, the key question is not how comprehensive the high level feature set is. More important is the question of how painful things are when you drop down to a lower level for a greater degree of control.

It would be nice if there were a middle level of complexity, between the high-level ‘find’ method and ‘has_xxx’ associations, and raw SQL. There isn’t. I think that the reason there isn’t one is that there is still a persistent belief among many Rails core team members and community members that databases should be stupid: just a persistent hash. Once upon a time I worked that way myself: I didn’t have access to or skill with a SQL RDBMS, and so I solved all of my persistence problems with DBM files, which (using Perl’s Tie::Hash class) are conceptually just persistent hashtables. miniSQL was little more than a SQL query parser on top of that sort of storage engine, and MySQL originally was pretty similar. But big databases have all sorts of useful features that address complicated persistence requirements in a fairly elegant way.

Given that Ruby fans like the idea of domain specific languages, which let you work in a super high level language customized to the problem domain, it’s surprising that Rails groupthink is that SQL is bad. It’s actually a very high level language, and allows a well written database to do some pretty amazing optimization on the fly because it provides a strong layer of abstraction between what you requested and how the storage engine provides it.

No, it’s not dynamic, nor is it pure relational perfection, but it’s pretty darn good. Pre- and post-event validations and arbitrary callbacks to user-specified code, functions providing behavior on top of data… these are all things that Ruby and Rails fans hold in high regard when provided by Ruby and Rails, but which are considered a bad idea at the database layer. As I discussed at length in Rails and the notion of Stupid Databases Being a Good Idea, this is a philosophy rooted in DRY, but it has some major flaws.

Mainly, there is the issue that some things must be done in the data tier, and trying to put them in the application tier doesn’t work. The best example that comes to mind is full text search. Satisfying queries is the database’s job, period. It’s just hideously slow to try and do an inner join in the application across a network link to a database. If you find yourself doing this, that’s a pretty good sign that your architecture is broken. But some queries are too complicated for ActiveRecord, so sometimes you must choose between a series of high level queries whose results are intersected in application code (easy to understand, but extremely inefficient), or hand coded SQL.

Well, SQL is fast and is a high level domain-specific language, so it isn’t actually a bad tool for the job. The problem is that this approach (the trapdoor to the lower level API) is regarded differently by different people. Some see it as a common and reasonable approach to complex requirements; others see it as a bad evil scary thing that should be avoided at all costs, a kludge and a design mistake.

As a result, the low level option in Rails is anemic. It’s there, but you’re not supposed to use it. Ruby’s ActiveRecord Makes Dropping to Raw SQL a Royal Pain (Probably on Purpose) notes that there are no bind variables allowed in ActiveRecord. You may be saying, “No, wait a minute, I’ve used them, that can’t be right.” That’s what I thought. Look at the source; the bind variable functionality is actually a high level feature built on top of drivers that don’t have that feature. Whatever you did at the high level, it’s going to the driver as a single string. Okay, it’s nice that they added that feature, especially since it provides a single point of testing and verification for safe escaping. But that functionality (in sanitize_sql) is not part of the public API. Fortunately that same article provides a workaround that makes sanitize_sql accessible, so you can use bind variables in your hand coded SQL code, and pretend that the driver supports them. But that’s not likely to work forever.

The key problem with ActiveRecord is its least common denominator feature set, based around the least featureful of all popular SQL databases: MySQL. Years ago, MySQL AB (the vendor of the MySQL database) took a strong philosophical stand against pretty much any advanced database features (which their product lacked, and which competing products had), but lately they’ve softened and added those features that they claimed nobody really needed. In the meantime, Rails has been designed with minimal expectations for database sophistication; therefore, the limited functionality of ActiveRecord is fairly complete, assuming you’re using a database with similarly limited functionality.

Triggers, stored procedures, functions, data integrity constraints, nested transactions, and views are all examples of unsupported database functionality. Try and use them via ActiveRecord’s high level API, and you will quickly see how fragile and inflexible ActiveRecord really is. If you shouldn’t need those features in your database, then you shouldn’t need anything that ActiveRecord doesn’t already provide, so it shouldn’t matter that you can’t extend ActiveRecord.

Truly, these are features that you need only in a few small cases in your application, so looking at individual queries they’re needed rarely (which is not the same thing as “never”). But looking at whether you need one or more of them in a given application, they’re needed more often than not. The pain of using hand coded SQL makes this worse: some tricky things could be done either using a view or stored procedure, or using a really slick dynamic SQL statement. Making all of those options painful means that even a clever developer can’t use anything in their bag of tricks to craft an elegant solution.

Unfortunately, non-trivial web applications need things like full text search, complex associations between persistent objects, non-trival summary information about associated objects, and complex reports, and ActiveRecord fails at all of these. These are not just things that big dumb ancient companies that like using Object COBOL think they need; Amazon and eBay need them too.

The acts_as_tsearch plugin is a good case study of ActiveRecord’s design flaws. TSearch2 is the standard PostgreSQL full text search engine, and it’s pretty good in my opinion. It’s also pretty straightforward to use. Unfortunately for developers using Rails, TSearch2 uses SQL functions (mainly to_tsquery and rank_cd). The acts_as_tsearch plugin tries to inject SQL into ActiveRecord’s queries via the high-level find interface, but ultimately fails as soon as you use the :joins or :include options. The problem is that ActiveRecord has a very simplistic idea of how queries and joins work, and so if you need to inject SQL functions to get the job done (as is necessary in TSearch2 queries), too bad. (See also issues 7 and 8 in acts_as_tsearch, in which I describe and attempt to clean up the mess that results when you use find_by_tsearch in non-trivial ways.)

A fellow Rails developer asked me in all seriousness why I wasn’t abandoning the full text search functionality of TSearch2 and just using a completely separate, redundant database product designed exclusively for full text search. Seriously, that is considered the “easy” approach: one database for full text search, and another for ACID/OLTP/CRUD. Honestly if I were going to go down that road I would try hard to just abandon the SQL RDMBS and put everything in the other database, since Lucene and its imitators are capable of far more than just find-text-in-document queries. The pain of duplicating everything, using two query languages, two document representations (in addition to the object representation in Ruby) and writing application-tier query correlation makes the double-DB approach seem very unwise.

It makes far more sense to me to use the SQL RDMBS’s full text search facility, even if there’s a 2x or 3x read performance penalty, because the conceptual simplicity of having one powerful storage tier (instead of two halves cobbled together) eliminates a ton of ugliness in the application, and the SQL RDBMS is going to get clustered for reads anyway. Nevertheless, even if I’m wrong about this case (putting search in the SQL RDBMS instead of in a separate server), there are other cases for needing a smart database that gives you exactly the results you need and lets you push data logic into the data tier.

So, what do I suggest? Abandon Rails? Nope. I still like Ruby a lot, and find Rails very useful. I just think that ActiveRecord needs to support the low-level and middle-level abstractions better.

Specifically, supporting bind variables (either by exposing that sanitize_sql function, or better yet by making drivers and connection adapters support bind variables for real) would make the find_by_sql, select_all, and exec approaches to low-level SQL query execution less painful.

More difficult, and substantially more valuable, would be refactoring ActiveRecord::Base to split it up in the way I described above: association descriptions and unmarshalling code separate from query building code separate from SQL execution and result retrieval code. All of this could remain hidden for most users under the same old slick high-level API, but for advanced requirements, the ability to fiddle with the SQL and still use the built in high-level unmarshalling code to create object graphs from flat result sets would be very powerful, and useful.

I looked at one alternative to ActiveRecord, called Sequel, which overlaps with ActiveRecord only partially. It is a query builder and lazy result proxy, which is actually what I thought ActiveRecord would do when I first started working with Rails. The proxy design means that you can either keep adding constraints or start fetching results, from the same Dataset class. This seems like a pretty good approach, though I haven’t really looked closely to make sure it would fit what ActiveRecord needs.

What Sequel lacks, though, is the unmarshalling side: turning a 2-dimensional (rows of columns) result set into a complex object graph (customers with orders with order lines with products from suppliers stored in warehouses), with user-controlled eager or lazy loading behavior. Ruby is well-suited to a design that would allow user-specified code (i.e., a block) to decompose each row into the object graph associated with that row, leaving the remaining associations on those objects to be lazily provided via future queries.

So, I think there is hope for ActiveRecord, definitely. I considered the idea of rolling a minimal Hibernate clone, or some other sort of challenger to ActiveRecord, but I don’t that ActiveRecord is broken beyond repair. I think the shortest path to a badass Ruby ORM is through improvements (refactoring and abstraction) to ActiveRecord.

So, if you’ve read this far, you probably care about these issues. Here’s my call to action: Please help me make ActiveRecord less like VB and more like Delphi. Who else is interested in helping me with this effort? Are there alternatives that I’ve missed, or components that could be integrated into ActiveRecord to make it better?

Immature developer attitudes revealed in flames regarding CDBaby

Jamie Flournoy — Sun, 23 Sep 2007 20:36:31 +0000

Derek Sivers of CDBaby kicks ass. He got a sophisticated and very very user-friendly, efficient, straightforward e-commerce system (including the back-end systems) written in PHP. Based on what I’ve read, he’s up there with Phil Greenspun in my opinion; that is, he’s among those who understand strategy and customer service and low-level technology and are able to build systems that don’t suck, resisting the temptation to be distracted by technological panaceas and fads. I may disagree with their individual technology decisions, but their higher-level thinking is excellent, so they’re definitely in the class of people who I’ll give the benefit of the doubt.

So when I read 7 reasons I switched back to PHP after 2 years on Rails I was a bit surprised, but not much. He’s experienced with PHP (he says he’s written 90,000 lines of code for CDBaby!), and has a huge installed base of code he wrote and understands intimately. He tried Rails, it didn’t work the way he wanted, and he went back to PHP. It was immediately obvious to him that this was what he should continue using.

The most shrill and arrogant among the Rails community have been rather unkind, partly due to this rather poorly written Slashdot headline that misrepresents what Derek says in his article.

I’m using Rails and I’m generally happy with it, though I have had to do some customization of the framework (mostly with pre-existing Rails plugins from like-minded developers) to suit my style. At this time I plan to continue using Rails for my project, and to keep using it in future projects.

But, I thought it would be instructive to summarize the arguments made by the folks slamming Derek in the comments to his article.

Rails is the correct solution for CDBaby, regardless of what the founder//designer/lead programmer of CDBaby thinks.
If you don’t agree with Rails’ design, you’re wrong, and you need to change your design if not your whole business to fit Rails.
PHP is bad and can never result in good code. Conversely, Ruby is good and is always better than PHP.
Objects are better than SQL and tables. Domain specific languages are great, as long as they’re written in Ruby; using SQL as a domain specific language for data manipulation and querying is not Ruby and therefore is a bad idea. Also, just because one person can code something in PHP in 2 months that runs on one server, instead of two people taking over two years to do it in Rails (which is notoriously hardware-hungry), doesn’t make PHP + SQL better. Because objects make you more productive.
Just because you wrote a successful online store yourself from scratch, have an O’Reilly column, and then hired a Rails core team member who now works for the same company as most of the rest of the Rails core team including DHH (37Signals), doesn’t mean that you actually had people smart enough to rewrite your app in Rails. You need to prove to the world that what you wanted to do was impossible in Rails before we’ll give you permission to make technical decisions for yourself.
If you don’t like Rails it must be because Rails is too good for you. Maybe instead of learning from years of real world experience running a business and writing the entire software suite for that business yourself, and hiring one of the best Rails developers in the world for your transition to a new architecture, you should have gone back and gotten a CS degree.

Since this is such an active flamewar with such sloppy readers (responding to the Slashdot headline’s misapprehension of what Derek wrote, instead of what he actually wrote), let me say this clearly:

The above listed points are not my opinions. They are summaries of opinions I find immature and silly.

(I’m assuming someone somewhere will read this post in anger and make themselves look foolish by arguing against these points as if I believed them anyway. I tried to warn ya…)

We can learn something from this. This same flamewar keeps appearing over and over and over, all over the internet, and before that, on BBSs and in print.

The simple fact of human mortality means that most of us are going to be learning and making mistakes that we imagine a wise, seasoned developer wouldn’t make. But that guy retired and is fishing, so it’s up to us to screw up and learn and hopefully do better next time.

For the last few thousand years, we’ve had the advantage of being able to read books, and get wisdom that way. But there are fads, and hyped books that seem to know it all. So you have to read a lot of books, and try a lot of contradictory ways of doing things, to get a mature enough perspective to make wise choices on future projects.

The point that is generally missed by everyone trying to do anything new to them, is that there’s a lot more out there that you don’t understand, and a lot of what you think is your brilliant new invention has been done before. The more you learn, the more humble you become as you learn that there are people WAY smarter that you are, and that there are people who have come before you who you will never catch up to.

In the case of web development, that means that there are people out there who disagree with you, and you might not actually know everything about everything like you think you do. Local experience and immersion in the problem they’re trying to solve makes them much better suited to solving their problem than some armchair quarterback. It’s tempting to fold your arms and feel smug about your quick dismissal of someone else who clearly isn’t as big a genius as you are, but the more certain you are of your own brilliance, the more likely it is that you’re just ignorant. (See also: Unskilled And Unaware Of It.)

In Derek’s case, he clearly jumped feet first into Rails and hired a rockstar developer, and then made the decision given a uniquely advantageous perspective that it just wasn’t working, after giving it a hell of a lot more time to pay off than I would have. (I’m not exactly sitting here scratching my head wondering “how could Derek have been so wrong?”)

What matters a lot more than choice of programming language is the ability to get the project done, meaning tested and correct and launched. Apparently for Derek, PHP is the way to get that done, and Rails ain’t.

Finally, consider the parallels between Derek in this case, and Phil Greenspun in his book (late 90’s). They both use SQL more than the “Objectistas” would like, which is to say, they use it directly and like it. They both use languages that are not the most fancy high level languages available, but they do completely understand how their application works, from UI all the way down to hardware performance, and they make integrated decisions that take business strategy and technological factors into consideration.

This is really critical to understand about why both of these guys are so smart: both of these people put aside dogma and made decisions that were all over the map, sometimes pragmatic and hackish, sometimes very rigorous and disciplined, depending on their assessment of the particular micro-issue they were addressing at the moment. They weren’t subscribing to any particular software development philosophy (Big Ball of Mud, XP, RUP, UML, Agile whatever, waterfall, etc.), nor did they just choose the most hyped architecture of the day and blindly stick with it, but freely mixed and matched whatever they thought was appropriate given their perspective on the situation.

In other words, they didn’t fall into the trap of thinking that there is One True Way to do everything. They improvised. They integrated their own experience and perspective with the wisdom of others, and made the decision that worked for them.

I hate to sound like a moral relativist, because I’m not. But as a practitioner, I’m definitely becoming more and more of a process relativist every day. One of the best ways to make a project fail is to try and force-fit an architecture and/or a methodology onto that project without customizing them to the particular details of the project. The best methodology is called “Roll Your Own.”

And so I leave you with this question, in the hope that it will help you with your current and future projects: Have you made any dogmatic decisions about your process or technology that are hurting your project?

Maybe it’s smarter at this point to keep doing what you’re already doing, but maybe there are some things that you could change. Try to keep your Mind Like Water, strong enough to cut through mountains but flexible enough to flow around obstacles, and making the decisions of which one to do based on the information that you and only you have.

Making the Rails acts_as_tsearch plugin work with fixtures

Jamie Flournoy — Sun, 23 Sep 2007 05:47:50 +0000

acts-as-tsearch is pretty cool, except for the fact that it uses Ruby (app layer) instead of PL/pgSQL (DB layer) to update the tsvectors that are indexed for full text search. That means that fixture data gets inserted without being full text indexed. D’oh!

Here’s some code that changes that.

(I put this in my environment.rb because I’m not quite at the point of shoving all this into a plugin like I probably should.)

# make it so that fixture loading (test data and base data) includes the tsearch2 update_vector
class Fixtures
    class << self # we want to mess with self.instantiate_fixtures
        def create_fixtures_with_update_vector(fixtures_directory, table_names, class_names = {})
            create_fixtures_without_update_vector(fixtures_directory, table_names, class_names)
            # create a Class instance for each table name fixtures were loaded for, then call update_vector on it
            table_names.each do |tn|
                klass_name = (ActiveRecord::Base.pluralize_table_names ? tn.singularize.camelize : tn.camelize)
                begin
                    klass = Object.const_get(klass_name) # will fail if tn is a habtm table (no corresponding model class)
                    klass.update_vector if klass.respond_to?(:update_vector)
                rescue
                    nil # if it is a habtm table, there's no need to update a tsearch2 vector for sure
                end
            end
        end
        alias_method_chain :create_fixtures, :update_vector
    end
end

(Sorry for the formatting but I like wide lines in my source code and my WP theme doesn’t. Just copy and paste and it should be fine.)

Bad, Bad Code

Jamie Flournoy — Sat, 04 Aug 2007 22:21:02 +0000

I’ve written before about tips for offshoring. One specific thing I said to watch for is the bait-and-switch of talent: during the sales process you’re shown rockstars, but the real code you get is written by clueless newbies. When you set up a project such that you’ve minimized the cost per hour of development, but you don’t have anyone checking the work product (i.e. code reviews) coming from the subcontractor, very bad things happen.

Here’s a doozy: In 2007, people are still writing JSP like this…

Check out the 4th message in the thread, with the big code sample.

Table based HTML layout, and no CSS at all? Check. Heck, the table width % values don’t even add up to 100%.
SQL in the JSP? Check.
Making a new JDBC connection for each page view, instead of using a connection pool? Check.
Unescaped strings in the SQL? Check. (Not strings coming from the browser in this particular JSP page, but you don’t know where those strings originate. Why wouldn’t you escape it just in case?)
Failing to use a prepared statement? Check. (That would also solve the escaping problem.)
Using a string literal in a SQL or command line context, so you can’t log it beforehand? Check. Even better, the code makes a query string first, prints it (commented out), and then uses a different string in the actual query. Nice!
Using the JdbcOdbc driver? Check. (From Sun’s JDBC Basics: The JDBC-ODBC Bridge driver provided with JDBC is recommended only for development and testing, or when no other alternative is available.) I’m guessing that the use of ODBC here is the only reason why the database username and password aren’t embedded in the code sample and posted for all to see.
Empty exception catch block? Check.

I’d have a hard time coming up with a fake example of bad code that was worse.

But wait, what else has this person asked about?

No way. Yes! Error in Socket and File Writing!
The post includes the router’s username and password, and its configuration including:

Its IP address, and all of the routes it contains, and all of its interfaces and where they go
A couple of other passwords stored in the router
A crypto key that appears to be to a VPN (looks like a pre-shared key, meaning not a public key but one that must be kept secret)!

I’m not gonna say “you get what you pay for” since open source software has served me very well, but I will say that you get what you bargain for. If your bargain includes not looking at the work product of the people you hire, which is to say, hiring the cheapest people available and not supervising them, you’re not going to be happy with what you get.

Of course, this could have been written by a U.S. citizen who works in a cube on-site and makes $200/hour. Point is, hire carefully, and supervise your workers. It seems simple when put that way, but it’s amazing how often companies are willing to hire software subcontractors carelessly (solely on price?) and then pay little or no attention to the resulting work, when the arrangement involves offshore outsourcing.

By the way, the IP addresses in the original post (with the awful code sample) are listed next to the name “Areva”, implying that this code is part of a project for Areva. Who is Areva? They make nuclear power plants. Sweet dreams!

Rails, Fixtures, the Test DB, and Test::Unit

Jamie Flournoy — Fri, 03 Aug 2007 02:37:22 +0000

From what I’ve seen, Rails’ weakest features lie in the way it prepares the test database and test data, and Ruby’s Test::Unit isn’t much better than the awful but ubuiquitous JUnit that Java developers are accustomed to. I set out this week to impose my preferences on Rails in this area, and that took some effort. Here’s what I did.

When I’ve implemented (in Java) what Rails does for database preparation, I did it like this:

Create the test database exactly the same way that the developers’ databases are created: by running the exact same code, pointed at a different database.
Load the appropriate sets of data for the test database. “Sets” is plural on purpose; most non-trivial databases include code tables, which constitute base data which are essentially part of the database design itself. Then, test code will want a fixed set of known test data to act upon, so that tests can measure whether the code did the right thing given the test data (the right inputs yield the right outputs).
Run the individual tests, providing some way of assuring that changes to the test data are undone before the next test.

At first (11 years ago) I used a hand-maintained SQL DDL file to create the databases. Later I split that up into one file per table, and made a list of the proper ordering of tables during creation (reversible for deletion). Later still, with Hibernate, I ditched the DDL and let a higher-level ORM description of the table do the schema generation (which was painful in Hibernate since it wasn’t made to do that except from the command line, but it was possible to hack it into a state of relative beauty). The test data was always loaded from a bunch of text files that were easy to hand-edit (as opposed to a bunch of SQL INSERT statements).

Running the test with assurance of pristine test data was more or less horrific in a J2EE+Hibernate 2.x environment. The design of Hibernate and JUnit made it difficult to wrap tests in transactions, and the version of MySQL that we were using had no transactional storage engines available at all (MyISAM? Thanks, Red Hat!), so I ended up falling back on an intrusive but relatively high-performance design that required tests to declare if they were going to alter the test data, so that the test teardown method knew it had to reload the test data. Since we were waiting for Hibernate 3.0, MySQL 5.x, and a few other things to become part of our architecture, I left that solution in place and ended up moving on to a new job before fixing it.

Rails initially seemed to nail this problem: the test database is automatically made based on the development database; the data is loaded from YAML files called Fixtures, which feature a very simple and straightforward API, and tests run inside individual transactions. Nice!

Except not. Fixtures are loaded by specifying the tables for which you need test data loaded, and this is done in each Test::Unit::TestCase class, of which I have several hundred. They are stupidly reloaded each time you say a given TestCase is going to use them. Worse, the tables you’re using for this TestCase are emptied out using SQL DELETE statements, but if there is test data in other tables that has foreign key dependencies on the data being deleted, fixture loading will fail. (Rails was not designed for FKs to be enabled in the database, so encountering this this bug is a side effect of enabling them via the plugin.) This deletion behavior is pointless in light of transactions wrapping each test, but if you’re using MySQL MyISAM you can’t use transactions, so it needs to be there for people using MyISAM, which is to say, crazy people who care not for their data.

Since Test::Unit, like Java’s JUnit, lacks a hook for the beginning or end of a given TestCase class’s set of tests, there’s no way to accumulate a list of fixtures created and then delete them and/or reload them at the end. That would at least allow you to undo the creation of the fixtures so that the tables were all empty before the next set of fixtures were loaded. Sadly, Test::Unit is not that clever.

I initially fixed this problem a couple of months ago, using a hack that simply refuses to delete and re-create (test data) fixtures if they’re already loaded. That works since the fixture data progressively accumulates and is always clean since changes within tests are rolled back at the end of those tests.

Upon adding a trigger to a Rails migration and then writing a test case that checked to see if it was working, I found the true ugliness. Rails has Migrations, which in my opinion are an excellent feature that works well, and is a more useful generalization of my ordered-list-o-tables and set of table-definition text files. But… when creating the test database, Rails uses the SchemaDumper‘s schema.rb output to create it, instead of using migrations. Talk about principle of least astonishment… I was pretty astonished. We have migrations, which is how we create databases! Great! So let’s use this other thing instead.

Also, SchemaDumper does not in fact dump the schema; it dumps tables and indices only. The RedHillOnRails foreign keys core plugin adds foreign key dumping to this output, but forget about check constraints, triggers, and stored procedures. Those schema objects are ignored, so your test database is not the same as your development (or production) database. Whoops.

I thought of about a dozen ways to deal with this:

Abandon triggers and do it all in Rails, make a TODO to fix this later, and get on with feature implementation
Add code to the tests to check for the missing schema objects and add them if missing (eww)
Replace the db:test:prepare Rake task with one that tells PostgreSQL to copy the database as-is
Replace the db:test:prepare Rake task with one that tells PostgreSQL to use pg_dump instead of ActiveRecord::SchemaDumper
Hack the PostgreSQL-specific code that SchemaDumper uses to look at the pg_proc and pg_trigger system catalogs and use code similar to the RedHillOnRails Core plugin to dump stored procs and triggers into schema.rb also
Just dump using pg_dump into a temp file and parse the output and add that to schema.rb (ewwwwwww)

etc. etc.

I finally found the Migrate Test DB Rake Plugin which simply uses your Rails Migrations to create the test database. Lovely. Except I now had some new problems.

rake db:schema:purge for PostgreSQL does dropdb/createdb on the test database to empty it out. That creates a database with no built in procedural langauges, so stored procs won’t work. Adding the language to that database is a DB superuser task, so it couldn’t be done inside of Rake. Fortunately I found that I could solve this via “createlang plpgsql template1” which puts plpgsql in the template database used for creating new databases. Easy.
My never-delete-fixtures code got into a fight with my base-data-loader code. They both used Fixtures to load data, and so the base data fixtures made the never-delete-fixtures code think that the test data was already in. So the tests failed due to lacking test data.

I fixed this initially by modifying my BaseDataLoader class to not load base data if RAILS_ENV is ‘test’, and added code to the Migrate Test DB Plugin to set RAILS_ENV to ‘test’ right before running the migrations on the test database. This is a workaround, really, because it still leaves the base data either missing entirely, or duplicated.

Then I switched to the Preload Fixtures plugin which is nice but still leads to FK related errors. It grab the fixture names from your test/fixtures directory and loads all the files it finds, in the order it found them. That fails since alphabetical order and the required table creation order are different in my case.

Fortunately since I’m using the Migrate Test DB Plugin I can just observe the order in which tables were created and tell the Preload Fixtures plugin to do its work in the same order. This is in my environment.rb because that’s where all my project-wide monkeypatching currently lives. (Cleaning that up and maybe plugin-izing it is a TODO for the future.)

# Due to FKs, gotta specify ordering of fixture preloading here. Why not let migration create_table statements do it?
# (depends on Migrate Test DB Plugin being present; is here for the benefit of the preload_fixtures plugin)
module ActiveRecord::ConnectionAdapters::SchemaStatements
    alias create_table_orig create_table
    def create_table(table_name, options = {}, &block)
        fixture_filename = "#{table_name}.yml"
        if File.file?(File.join([RAILS_ROOT, 'test', 'fixtures' ,fixture_filename]))
            ENV['FIXTURES'] = [ENV['FIXTURES'], fixture_filename].compact.join(',')
            # puts ENV['FIXTURES']
        end
        create_table_orig(table_name, options, &block)
    end
end

Sadly if you run “rake test” it runs ruby as a subprocess in order to do “rake test:units”, “rake test:functionals”, and “rake test:integration”. That means that the migrations are run once (before the tests), but that the preloading is done three times. The second and third times through, though, the preloading fails since it’s trying to delete-then-create each table’s fixtures in table-creation order. So, a patch to preload_fixtures.rb is needed, to ensure that deletes are done first, in the reverse order of table creation. Here’s what the new preload! method looks like:

def self.preload!
    puts "PRELOADING FIXTURES..."

    require 'active_record/fixtures'
    ActiveRecord::Base.establish_connection(:test)
    fixture_filenames = (ENV['FIXTURES'] ? ENV['FIXTURES'].split(/,/) : Dir.glob(File.join(RAILS_ROOT, 'test', 'fixtures', '*.{yml,csv}')))
    
    # delete first, in reverse order
    fixture_filenames.reverse.each do |fixture_file|
        table_name = File.basename(fixture_file, '.*') # hack; might not be correct if class name != camelized table name
        ActiveRecord::Base.connection.delete "DELETE FROM #{table_name}", 'Fixture Delete'
    end
    
    fixture_filenames.each do |fixture_file|
      Fixtures.create_fixtures(File.join(RAILS_ROOT, 'test', 'fixtures'), File.basename(fixture_file, '.*'))
    end      
    puts "DONE. Loaded #{Fixtures.all_loaded_fixtures.keys.length} fixtures."
  end

I’m not sure, but I think there’s an assumption in there that the table name is the same as the fixture name. My patch also makes that assumption, which is true in the case of my project. But in your project you might not have done that, so further hackery might be needed.

So, it all seems to work correctly now, and I’m back to working on my trigger code. If this seems like it took a lot of effort, it did, but I think it’ll be worth it once I start using stored procs and triggers more. That phase begins now.

Rails and the notion of Stupid Databases Being a Good Idea

Jamie Flournoy — Fri, 03 Aug 2007 02:16:40 +0000

For the last few days I’ve been struggling to bend Rails to my will regarding the proper way to assure data consistency. Today I made some progress. This builds upon some research I did a few months ago, and hopefully this is a more or less complete solution to the problem of making Rails work the way I want it to regarding test databases.

DHH has clearly stated that he does not like a smart database. This is common among application developers, particularly in the agile methods camp, in that they generally appear not to understand relational set theory, or if they do, they believe that it is inherently inferior to object oriented methods (which lack a theoretical basis, as Fabian Pascal will happily shout at anyone who will listen). I gather from DHH’s statements that he merely is trying to practice Don’t Repeat Yourself (a.k.a. DRY, one of the most important values of Rails). I gather from Rails itself that he either respects the need of some folks to disagree with him enough to provide hooks to bypass ActiveRecord, or that he at least agreed with someone else’s patch. By this I mean that there are ways around the ORM features of ActiveRecord, to do raw SQL and to execute raw DDL at database creation time, which implies that he isn’t trying to force his opinions on others, but rather to make it easier to do things his way than to do them a different way.

Fair enough. Rails is opinionated software, as DHH often says, and I have found several cases where letting go of my particular way of doing things has been fine, given that Rails has a different but equally valid way of doing things that is made super easy by the framework.

However, I disagree with his decision to keep the DB stupid, for two reasons.

First, I prefer to put logic where it belongs, rather than gathering it all in one place. AJAX, and in particular Google Maps, is a good example of presentation logic going where it belongs, making the whole application work better. SQL RDBMSs have features that can be abused, and in some cases these features are there because a wrong-thinking but wealthy client demanded them, but most of the advanced features that a “Real Database” has are there so that you can protect yourself against data loss or data corruption. The database is in a unique position to let you declare rules for things that must always be true, and then to trust that the database will never violate those rules. Older versions of MySQL were notably lacking in these features and their absence was justified by MySQL staff who basically said “you don’t need that, and if you want it, you’re confused.” Rails has inherited some of these damaged assumptions from MySQL, leaving basic relational features like foreign keys out of the framework(!). Fortunately Rails allows plugins, and there is a set of foreign key plugins that overturn this decision. But in general, if the database belongs to your application, that’s not an excuse to move database functionality into application code. By calling it your application’s database (as opposed to an Integration Database) you imply that it is part of your application, and therefore any rules or procedural code in it is necessarily also part of your application. You can’t monopolize the database and say that no one else has any business using it, while at the same time holding it at arm’s length and saying it’s not a valid part of the application. It is. No, business rules probably don’t belong in the database, but basic data consistency maintenance (in rule or procedural form) does.

I’m being charitable here, but my experience with individual practitioners of the Stupid Database Method invariably ends with me finding out that they don’t really understand databases at all (hence the desire to abstract the database away entirely with a driver plugin architecture topped by an ORM layer, lest they have to understand how a specific database product works), and would rather remain ignorant and reinvent the same functionality in the application layer or in the ORM layer. (It’s a case of “when all you have is a hammer, everything looks like a nail”, where the hammer is a general-purpose programming language, and you’re looking at a problem of high performance concurrent transactional programming.)

Not surprisingly, the database of choice for these folks is the least featureful, lowest cost, easiest to install one available. Because naturally it’s much more agile to write and debug new multithreaded transactional code in a high level dynamic language. than it is to get the same functionality for free in a thoroughly tested product that’s written in C. Right? Perhaps DHH is not one of these people. I assume he is not, again based on what he has said and coded. But nevertheless, the folks I’ve talked to personally who agree with his point of view are all coming from a point of view of willful ignorance.

Secondly, I prefer to employ defense in depth against data errors. Transient errors can have workarounds, but data errors are permanent, and that means that if your data is valuable, the damage done can be irreversible. Just because it’s possible for correct application code to avoid race conditions, improper escaping, etc. doesn’t mean that you should put all your eggs in that basket. When the price of data corruption is high (i.e. if you value the data in your database) then it’s worth the duplication of effort: test the application code, but also put a constraint in the database that will catch things the application code missed.

This is the same sort of thinking that leads to using automated unit tests, then functional tests, then integration tests, and then some manual QA, all overlapping. Duplication of effort? Yes. Worth it? Yes. Database bugs are arguably the worst kind of bugs to find in production, so they merit extra code that maybe isn’t absolutely necessary for the application to work, but is nice to have since you’d like to sleep at night.

So, I feel justified in wanting to put CHECK constraints and triggers in my database.

The implementation details are discussed in part 2.

Full Text Search refuses to be a black box

Jamie Flournoy — Mon, 16 Jul 2007 02:48:56 +0000

Once upon a time, before Google pwn3d internet search, there were several competing definitions for full text search. Altavista more or less gave you results matching the exact strings you gave it, but in a crazy order that made it painful to use. Excite (my favorite back then) used a dictionary to achieve stemming and synonym matches: searching for ‘dogs’ would also match documents that contained ‘canines’ or ‘dog’. Then Google blew them all away, and established a dominant set of expectations for how text search behaves.

I forgot about this, which is why I’m frustrated by the almost ridiculous complexity of the major server-side text search engines available right now. But it makes sense, once you learn what the options are.

The first thing you have to decide is if you want to stick with an in-database search feature, meaning the one that comes with your SQL RDBMS of choice. If so, you get ACID and hopefully MVCC features, which is probably why you’re using a SQL RDBMS in the first place: you’d like to get back the data that you put into it, even with multiuser access.

This option is just becoming decent in the last few years; typically you pay a penalty for the ACID and MVCC assurance, as well as for the smaller audience that uses that particular database. The factory installed radio is never as good as the one you order from Crutchfield, and likewise, the bundled thingamajig is never as good as the best available special-purpose thingamajig. But you get a few things in return: examples specific to your chosen database (duh), a simplified storage architecture, and probably some lower resource requirements.

By “simplified storage architecture” I mean that you don’t have to add your data to the database, and then also to the search engine, nor do you have to query the search engine first, and then possibly use the results to query the database if the search engine doesn’t contain everything you need. By “probably lower resource requirements” I mean that you’re not running two redundant but slightly differently optimized database servers: one SQL RDBMS and one specialized full text search server. Two redundant servers means double the storage, double the ram, plus or minus some variation in overhead (row size in bytes vs. page size, etc.). The full text index built into the database product will need a substantial amount of extra space for the indexes in memory and on disk, but almost certainly this will be less than a separate search server would need for the index and data, plus its own code, in memory and on disk.

An argument could be made for dumping the SQL RDBMS in favor of just using the dedicated full text search engine, but given that SQL RDMBSs tend to be the only game in town that offers unlimited ad-hoc querying, ACID, MVCC, and easy integration with all sorts of application languages and frameworks, I suspect that the folks making that argument either have really unusual requirements in mind, or are bozos who fall in love with architectures that work well for a few of their requirements and suck utterly at the remaining requirements.

There are quite a lot of those people out there. The same sort of suggestion has been made for object databases and XML databases in the past, and it turns out that the ones that blow away SQL RDBMSs in performance also lack one or more of the qualities that make SQL RDBMSs so useful. Putting those features back in tends to bring performance back down to a level close to that of the SQL RDBMS, and the argument for total replacement becomes weak. But the arguments, and arguers, remain.

But, perhaps your requirements for search performance are extreme: you need super duper fast search, which typically means you need a reasonable turnaround time on searches of a huge number of documents, with lots of users searching. The leading dedicated search engines lately seem to be Lucene (or one of its derivatives, such as Ferret) or Xapian. In either case, you get super high performance, at the cost of some additional resources due to overlap with the SQL RDBMS, application complexity (two reads, two writes), and sysadmin complexity (a whole ‘nuther product to manage in order to keep the overall application up and running).

So, I’ll just accept that there are reasons to do either one, and maybe some corner cases in which just using a dedicated search engine makes sense instead of a SQL RDBMS. But let’s move on to the not-a-black-box aspect of full text search.

Here are the key issues:

Full text search indices get huge quickly.
Some words are almost meaningless to searchers but are extremely commonly used.
Some words mean the same thing, or are variations of the same word.
Various kinds of coded character data (scientific notation, URLs, mailing addresses, etc.) are commonly embedded in searchable text.

As a result of all of these, simply accelerating "select * from cute_pet_stories WHERE UPPER(story) LIKE '%DOGS%'" isn’t a viable approach. Instead, the search indexer requires additional information, such as the character encoding and language of the text being indexed, and uses that to simplify the text being indexed into root words (dog instead of dogs) that don’t include low-value words such as “a”, “the”, “of”, etc. (these are called stop words). Also, it may differentiate encoded data from regular language text, and handle it specially.

PostgreSQL includes a search engine called Tsearch2, which is apparently quite fast if you’re willing to sacrifice size (big indexes) and write performance.

The implementation of a Tsearch2-indexed table is interesting: first you add a column to your table that’s just there to be indexed, and you fill it using text-mangling functions that do stemming (dogs->dog), stop-word removal, and word counting. That leaves you with a column of type tsvector. Then you create an index on that column, and do your text queries against that index. You have to clean up your search text first, though, and similarly mangle it into an appropriately stemmed, stop-word-free tsquery object which itself can contain boolean expressions that will be used in the search process.

(There’s an acts_as_tsearch Rails plugin that attempts to simplify this into an idiom that makes more sense from a declarative standpoint. It looks pretty immature but I’m gonna give it a whirl anyway.)

Lucene does something similar, using Java classes it calls Analyzers to encapsulate the same kind of text-mangling behavior that Tsearch2 performs using its to_tsvector SQL function. Xapian also has this same feature, apparently from the same original source. So the model of first preparing your text for indexability, then indexing it, then searching with similarly prepared query text, appears to be common if not universal.

Hopefully now you understand, as I now do, that full text search is inherently complicated, but not necessarily slow. All you need to do is understand the generic way that full text indexing and searching works, and then make a decision about integrated vs. standalone based on your setup.

I’m going to try Tsearch2 + acts_as_tsearch. I’ll let you know how it goes.