databases – Pervasive Code

Rails Migration Antipatterns and How To Fix Them

Jamie Flournoy — Thu, 18 Mar 2010 11:29:40 +0000

Migrations are one of the best features of Rails. Although some folks prefer pure SQL rather than Rails migration DSL, I don’t know of anyone who dislikes the idea of a versioned schema that can evolve in a controlled and repeatable fashion.

But because the concept of database migrations is such a powerful one, it’s tempting to jam any old change that affects the database into a new migration and run rake db:migrate to make it happen. I’ve been guilty of a bit of this in the past, and I’ve joined some projects that did other ugly things in migrations. In the process I’ve learned the hard way that there are some things you must never do in a migration or they will come back to haunt you later. Here they are.

Antipattern: Require the Database to Exist Already

In other words, the antipattern is for the first migration to depend on some tables and maybe even some data already being in the database.

I know that the original Rails blog video shows DHH using a MySQL admin tool to create the blog database interactively, but really you should be using migrations to create the schema programmatically from scratch.

If you’re already working on a project that didn’t do that, you can run rake db:schema:dump and look at db/schema.rb; it contains code that you can insert into a new migration to create the same schema in your development environment. If you’re using DB features that the design philosophy of ActiveRecord doesn’t agree with, such as triggers, and the schema.rb dump doesn’t include them (or if you just think the migration DSL is ugly and you like SQL DDL better), you can do a mysqldump / pg_dump / whateverdump and wrap a migration around the loading of that SQL file.

If you have a hybrid (you have to start with an old db dump and then migrate it so it becomes current), that’s gross, and you have a couple of options which are both pretty ugly. But they will work, and when you’re done the ugliness will be gone.

You could fight your way back to the oldest schema version by debugging the self.down methods and running rake db:rollback repeatedly until you can create a 00001_starting_db_schema.rb migration, or you could just blow away all the migrations and use the highest schema version for a new migration that contains the output of a current rake db:schema:dump. It depends on how many copies of the database are out there with old schemas that would need to be brought up to date. Clearing out db/migrate and replacing it all with a single migration is cleaner, but if your production database is 5 migrations out of date you obviously can’t do that. But you could collapse it down to the one big-bang migration (as the oldest), plus the 5 pending schema changes. If you do it right, you can just deploy the new code and run rake db:migrate and everything will be fine. If not, well, you were testing it on a backup of the production database, right? :)

Antipattern: Only Work Correctly With the Production Data

What’s wrong with developers just making dumps of the production database and loading them locally?

First of all, it means that all schema changes have to start at the production database and work backwards to developers’ sandboxed development environments. Hopefully this strikes you as a very stupid workflow.

Secondly, maybe your users don’t all want to get a message that says “test message foo bar sdfasdfasd bloopity bloop” when you’re testing your new alert system. Should you really be putting their data (passwords, contact info, etc.) at the mercy of your crummy new code?

You should be able to immediately generate an empty, clean database for development. rake db:drop; rake db:create; rake db:migrate should do this; rake db:reset should have the same result but should be faster since it doesn’t bother with each migration in sequence.

You should also be able to immediately generate any essential base data such as the initial admin user. The SeedFu plugin does a good job here.

If you need some additional fake data to fiddle around with in your development environment, the Populator gem is handy for mass-inserting a bunch of faux data, especially in conjunction with Faker.

Note that the migrations should neither depend on nor contain actual data. They should just change the data model.

Antipattern: Clean Up That Only Works on Production Data

This is really a subset of the previous item but it’s worth considering as a special case.

If you want to fix some data that got slightly corrupted by some bad code that has been replaced, migrations aren’t a terrible way to accomplish that.

It’s not really what migrations are for, and a one-off rake task can do it just as well, but if you really want to, you can get away with it under one condition: you have to make your cleanup migration code succeed even if the database is empty (such as when a developer has just run rake db:reset; rake db:migrate).

Antipattern: Load Data

The populator gem is good for initial, mandatory data. The machinist gem is good for synthetic test data. Delete db/fixtures and everything in it. Fixtures are evil.

Wrap a rake task around the “get my development database ready” concept. This task should start with the “get my empty production database ready” task (or some subset of that which is appropriate for developer use).

If you need to load arbitrary data now and then, write an importer. Do this as a rake task, or a web UI to a bulk data importer feature. Better yet, make a web UI in your admin area which is just a wrapper around the rake task that bulk imports data. Then delegate the bulk importing to your customers so your admins can do real admin work. But don’t load data in a migration.

Antipattern: Use Rails Models in the Migration

Models evolve, but old migrations don’t change (nor should they). So when you wrote a migration that used a model, it used the old version of the model code. Then a year later the model has evolved, and the new validations on first_name and last_name fail because it used to be full_name, and that old migration that hasn’t changed has stopped working. It depended on something that did change, incompatibly.

For rockstar points, in your continuous integration environment you should run rake db:drop; rake db:create; rake db:migrate to make sure that this can never happen.

But if it has already happened, rip out the model code and replace it with Rails DSL code, with execute statements containing raw SQL code, or (if you feel like a Ruby rockstar) declare new, stripped down model classes inside your migration class that will act as stand-ins for the limited needs of the migration. See Migrating with Models for more on how to do this last trick.

Conclusion

You should always be able to do this in every Rails environment that your application has: rake db:drop; rake db:create; rake db:migrate; rake db:reset

At this point you should then be able to run rake db:test:prepare and then rake spec or rake test or whatever and have it work.

If any part of that process fails, you are missing out on the benefits of using Rails migrations.

Recommended mount options for ext3

Jamie Flournoy — Fri, 16 May 2008 03:54:30 +0000

The details of the various mount options for the ext3 filesystem are fairly well documented, but as with many things in the Unix world, knowledge is far easier to come by than wisdom. That’s a pithy way of saying that I had to do some digging to find recommendations, as opposed to explanations. So here are my recommendations for ext3 users (which encompasses the majority of the Linux-using world, as far as I can tell).

noatime

First of all, do yourself a favor and disable atime updates, using the noatime mount option. This yields a huge performance boost.

This is done by adding noatime to the appropriate lines in /etc/fstab (do it once for each ext3 filesystem that’s listed), in the fourth column, which probably says defaults now.

To make this change to a live, running filesystem, remount the drive (adjust this so that the right disk device is specified at the end of the line:

sudo mount -o noatime,nodiratime,remount,rw /dev/xvda1

(My understanding is that the noatime implies the nodiratime option, but I decided to add it just in case this was not true.)

atime is a relative of the well known file modification and creation timestamps, but it tracks access to file data. That means that if you read one byte from a file, even if it’s cached in RAM, you’re actually also triggering a write to the directory entry for that file, so that its atime can be updated. (If you want to slap your forehead now in disbelief, be my guest.) And if you read a ton of little files (which happens rather often in the unix world), that means a ton of writes to update all of their directory entries. You don’t want that, right?

But do you need it? Almost certainly not. It’s required by the POSIX standard, and the need for it to be present and turned on is well debated by people more knowledgeable about this in this thread from the Linux kernel mailing list. The summary of their argument is that it’s the kernel’s job to remain standards compliant, and only the distributor or user has enough information to know that they don’t care about that part of the standard and can safely disable it. I can understand that point of view.

Well, I did the reading, and you can safely disable it, unless you’re using mutt. If you’re using mutt, or if you’re just nervous about disabling something that somebody somewhere says you might maybe need someday, then disable atime for every filesystem that doesn’t have your mail spool on it, and use the relatime mode on that drive. (relatime is a clever hack that simulates atime behavior while skipping the disk write in certain cases.)

Journaling mode

Ext3 is a journaling filesystem, which is generally a good thing. There are three modes of operation for ext3’s journaling functionality, but which to use?

“It depends” is not very satisfying, so an easy rule of thumb would be to use data=journal if you really, really want to ensure the durability of your data, and data=ordered if you can tolerate a teeny tiny chance of data corruption.

I measured all three journaling modes by running time sudo rsnapshot hourly on a VPS that backed up VPSs on the same physical server to a dedicated backup disk. In other words, the source was on the same physical server as the destination but they were on different disks.

rsnapshot uses hard links to share file data across backup sets, so backing up an unchanged directory twice takes a hardly any additional space compared to backing it up once. But it does need to do a bunch of disk reads and writes to make all the linked directory entries when it does this, so there is a fair amount of I/O involved: more than what rsync would need to just update a local directory to match the remote directory, but far less than what would be needed to make a separate copy of every file for each backup.

In abstract terms, the I/O for this backup process involves a lot of small reads and writes, and a very small number of medium or large writes for changed files. All of these occur as fast as the disk can service them, and the disk is quiet aside from this activity.

Here’s what I measured (in three test runs per journal type):

Journal Type	Real Time
data=journal	2m05s, 2m57s, 2m51s
data=writeback	2m03s, 1m18s, 1m22s
data=ordered	2m12s, 1m30s, 1m20s

For this application, data=journal takes twice as long as the others, while data=ordered runs just as fast as data=writeback while providing some additional protection.

So data=writeback is useless in my case, and the fact that data=ordered is the default makes sense. You get almost the same level of data protection as with data=journal, but with the performance of data=writeback. Different I/O patterns will give different results, but I suspect that the pattern I tested with is the most common in real server usage. (Note that in ext3’s v1 journal format, data=journal was the only journal behavior.)

My inclination is to stick with the default setting, even using data=ordered on database servers, since the database is doing its own higher-level journaling in the form of a transaction log. I’m basing this recommendation on this detail from the Gentoo article:

When appending data to files, data=ordered mode provides all of the integrity guarantees offered by ext3’s full data journaling mode. However, if part of a file is being overwritten and the system crashes, it’s possible that the region being written will contain a combination of original blocks interspersed with updated blocks.

Since a database transaction log is generally appended to rather than overwritten, my understanding is that it will protect against the above scenario in which data=ordered can cause a mix of old and new data. The database’s data files may have a mix of old and new data, but the transaction log would not show that the transaction have been completed yet, so it would be re-run during recovery and the remaining old data would be removed. I think.

The usage pattern where data that you really care about is overwritten regularly (as opposed to logs, which simply append) is rare in my experience, except in the case of database servers which are covered by their own logs as I just mentioned. So I don’t know of a particular application type that demands the full data journaling mode.

Anyway, I recommend against data=writeback altogether, unless you don’t mind some data corruption if there’s a power failure. The speed gain I measured isn’t worth the risk, in my opinion.

Sphinx Search init script for Centos 5.1

Jamie Flournoy — Mon, 14 Apr 2008 06:18:11 +0000

Sphinx search is pretty new, and as a result I was unable to find a nice convenient package for it for CentOS 5.1. This is problematic since there is no init script included with the source tarball, and the issue of updating the index is the sysadmin and developer’s problem, and cannot be configured to simply update the index when the data changes.

The second problem (updates) is one I punted on; for now I have a cron job rebuilding the entire index every 5 minutes, which will probably be replaced with something smarter and lower-latency at a later time.

The first problem (no init script) is easy to solve, but apparently nobody has done so for CentOS 5.1 and published it. So, here is my CentOS 5.1 init script for the Sphinx Search server. It is known to work with version 0.9.8-rc2.

BTW, the alternative solution to the problem of a daemon not having a System V init script is to just put some extra junk in /etc/rc.local. That is the quick and dirty solution, and is undesirable for several reasons:

You can’t easily stop or restart the service, because it’s not a service as far as the OS knows; it’s just some junk in a script that got run a while ago.
You can’t use chkconfig or its GUI cousin with the creative name, The Services Configuration Tool, to control it and tie it to specific runlevels.

(System V runlevels and init scripts are useful, even if you don’t need all of the runlevel functionality. The stop/start/restart PID stuff is useful by itself.)

Making SELinux allow a nonstandard MySQL port number on Centos 5.1

Jamie Flournoy — Sat, 29 Mar 2008 18:36:30 +0000

SELinux is a recently added security system that’s installed by default with CentOS 5.1 (and Red Hat Enterprise Linux 5, and others). Since it’s newer than the classic “Discretionary Access Control” Unix security model, it’s not nearly as well documented, and unfamiliar to many. I had never even heard of it until this week.

After a lot of reading about it, and debating disabling it entirely, I figured out how to do some minor SELinux customization to fit my needs for a MySQL database server. Hopefully this will help folks who are in a similar situation.

Fortunately, although SELinux is sophisticated, it’s not too obtrusive as implemented in CentOS 5.1. In configuring it, Red Hat has picked an admirable position somewhere between ironclad security with a huge administrative burden, and toothless security that is easy to use because it isn’t doing anything to protect you. This is important, because if the configuration process were too odious from the point of view of a typical junior sysadmin, it’s very likely that people would get in the habit of just turning it off entirely. As it is, SELinux on RHEL 5 / CentOS 5.1 is now becoming part of the landscape of what a modern Linux looks like; based on what I’ve read on relevant forums lately, admins are taking the time to try and customize its default configuration to their needs (with some success) rather than just turning it off.

The nicely balanced default configuration that Red Hat has chosen is called the Targeted Policy, which means that if the SELinux configuration files know about a specific daemon, then it will be subject to specific rules; otherwise, the classic Unix security model applies. So if you stay with the standard configuration of those targeted daemons, SELinux is providing an additional level of security containment around them, and as long as it does what it’s supposed to, you’ll never notice it.

In my case, I’m running MySQL and OpenSSH, and have configured them to listen on nonstandard ports. SSH is not targeted, so this was trivial to do. MySQL is targeted, so it didn’t work right away.

Specifically, MySQL wouldn’t start, and in /var/log/messages I saw something like this:

kernel: audit(1206710000.178:12): avc:  denied  { name_bind } \
for  pid=8591 comm="mysqld" src=1234 scontext=user_u:system_r:mysqld_t:s0 \
tcontext=system_u:object_r:port_t:s0 tclass=tcp_socket

In plain English, “I denied process 8591’s request to bind to port 1234”. So SELinux needs to be told that MySQL should be allowed to bind to port 1234.

Here’s what I had to do: (assuming a mysqld port number of 1234, and that the iptables firewall is already adjusted for this)

sudo /usr/sbin/semanage port -a -t mysqld_port_t -p tcp 1234

This means “Change the SELinux policy for ports by adding one, of mysqld_port_t type, protocol TCP, port number 1234.”

Now you should be able to see the standard port (3306) and the new one (1234) with this:
sudo /usr/sbin/semanage port -l | grep mysql

That should output something like “mysqld_port_t tcp 1234,3306”.

(These changes are persisted in /etc/selinux/targeted/modules/active/ports.local, so they will still be active after a reboot.)

Now, MySQL starts happily and I can connect and use it as I had expected. But I didn’t have to disable SELinux, which means that this and other daemons are still running inside a security container that will help to limit the damage if their security is compromised.

Document Databases – New Kids on an Old Block

Jamie Flournoy — Sat, 16 Feb 2008 06:16:16 +0000

There’s a new crop of databases that has appeared lately, under the rubric of “document databases”, and there’s quite a lot of enthusiasm for them given that they tend to be slow and very feature-poor compared to the SQL RDBMSs that are the typical persistence mechanism for web applications. What’s mainly appealing about them is that they are easy to use, and theoretically quite scalable, compared to the traditional “one big SQL database server” approach.

But the simplicity of these new document databases is tied to some significant trade-offs in the current implementations. And so I’m going to try and put them into context with some of the other data persistence options that have been around for a while, but which aren’t currently getting as much hype as document databases. Hopefully that will help all of us to understand how these new and evolving document databases can be useful to us, and what the alternatives are in areas where they may not fit well.

I’d first like to try and deconstruct a false dichotomy that I’ve noticed being used in arguments in favor of some of these new databases. That dichotomy casts SQL RDBMSs (such as MySQL, Oracle, PostgreSQL, MS SQL Server, etc.) as big, complicated, and hard to scale, compared to document databases which are small, simple, and easy to scale. The main problem with this dichotomy is that there are far more choices than just two. Each database product embodies a set of design choices, and although there is some clustering of decisions into general types of product, the boundaries are a lot fuzzier than a product evangelist might have you believe.

Furthermore, trade-offs made early in a product’s lifetime may have been altered over time. A good example is the no-longer-true characterization of MySQL being fast but not reliable, vs. PostgreSQL being reliable but not fast. In reality, recent releases of both products are moving toward being very fast and very reliable.

Because there are so many database products out there, I’m going to have to fall back on a small subset of example products, as illustrations of issues that may or may not exist in a particular product you’re looking at. The key for an application architect evaluating a persistence mechanism is to understand the abstract concepts, and to figure out which ones matter to your current application. That will let you select a product (or a combination of products, including some custom code perhaps) that suits you. As with all aspects of architecture, there is no cookbook you can use, and in six months all the options will change. You have to analyze your needs first, and then get your hands dirty with evaluation second.

So, let’s start deconstructing some of the examples into design decisions.

SQL RDMBS

This is the traditional choice for web applications. You get a remote query language, so you can specify in great detail what you want to retrieve. You get very precise control over data representation, including some that may be burdensome if you’re not concerned with internationalization, multiple currencies, and time zones. You get a lot of control over performance and a lot of information about how things work at a low level inside the database, from indexes to data page size to transaction logs and checkpoint frequency. Most of them include ACID transaction support, which is nice for reliability, but which usually obligates you to implement a backup scheme or else they will eventually stop accepting new transactions and/or run out of disk space.

Some of the design drawbacks include:
– the use of local storage, for performance and for a guarantee that data has been committed to disk
– the use of a high level query language (SQL) and a query optimizer, so the specific process the database uses to satisfy your query is not in your face (and thus may be surprisingly inefficient, if you aren’t familiar with how it works)
– the use of a proprietary network protocol, which means that you need a special client library for just that one product, which may or may not implement all of the features that the server offers (such as encryption)

However, there are some variations that make the edges of this category fuzzier. There are ACID-compliant SQL RDBMSs that have no network layer, and are very lightweight; in some cases they may not even support concurrent access. Examples include HSQLDB and SQLite.

Networked Filesystem

This is typically used for accessing shared file servers, or allowing “thin client” behavior so that users can get to their own environment and data from any given endpoint. Examples include NFS, SMB, AFP, GFS, and quite a few others. The main advantage of these systems is that the remote filesystem is represented as being directly connected to the local system, while also being available to other users or other client systems who are connected to the same remote system.

Trade-offs of this design include:
– performance on a LAN may be good, but over a slow, high latency link may be very poor
– there is usually no ACID transaction support, just file locking
– file ownership and permissions can be very hard to manage
– if the file server goes offline, the entire local system may hang or crash

In particular, content indexing, complex querying, and data integrity features are generally not offered. You can layer that on top, though, but that layer will not necessarily work if it was originally designed to work on local filesystems. In particular I’m thinking about DBM files; they’re fast and easy to use but not all of them will work properly with files located on a network filesystem.

Also, directory scanning performance can be very poor if thousands of files are located in a single directory; listing all files starting with the letter T may actually require the entire directory to be retrieved and filtered on the client side.

Variations include FTP and WebDAV, which are not intended to simulate a local filesystem, but instead have filesystem-like semantics. Some operating systems will mount them as remote filesystems anyway, for ease of use for viewing and copying files, but it’s not possible to lock a remote file, so safe multiuser access is not possible.

Object Database

Object databases offer a direct representation of an application’s data in almost exactly the same form that exists in memory. Whereas a relational database stores data in tabular form regardless of the particulars of a client application, an object database stores data in the same form that the application uses. The exception to this is in the representation of references to other objects; at some level these encapsulate pointers to the memory address of the data in the application’s address space, and this must be substituted with a pointer to the location in the database’s storage system before storing it.

An object database will not handle time zones or internationalization, but nor will it complicate those matters if the application handles those already. The data is simply stored as-is. Also, object databases typically do offer ACID transaction support.

Aside from the conceptual simplicity of the similar data model, one major bonus of an object database is that the use of pointers makes data retrieval extremely fast; rather than parsing a query and searching indexes for the on-disk location of a desired object, the application can simply ask the database for it by its reference.

The big trade-offs here are twofold:

First, the lack of indirection through a query language and an indexing system mean that the application developer must anticipate all of the queries that will be needed, and incorporate collections into the the object graph that will be used to get to the stored objects. Otherwise the application’s object model will need to be updated frequently to include these later.

Second, altering the application’s object model and then retrieving data stored using older code can be very complex. Because the stored objects and application’s code are out of sync in this situation, additional application code must be written to convert existing stored objects into the new representation and persist them back to the database.

Combine those two trade-offs, and it’s clear that the performance benefit comes with the price of considerable additional application development effort.

Also, an object database is by nature bound to a single application, rather than being a point of integration between multiple applications. Any attempt to create a shared code library that manages access to the object database introduces potential “impedance mismatches” between each application and the shared object model, which reduces the simplicity that an object database offers in comparison to a relational database.

Document Database

Arguably a rejection of relational technology, document databases offer several advantages compared to the three classes of database previously mentioned. Documents need not be internally represented as a flat set of key-value pairs as seen in a SQL RDBMS; for example, the document may be an XML document. Queries are possible and may even use a standardized query language to express the conditions for matching desirable documents. Documents may have internal structure that is understood by the database server (so that it can query against the document’s contents), as in the case of an XML database, or they may have an external metadata structure consisting of key-value pairs, or both.

The drawbacks of this type of system derive from the fact that it is similar in many ways to each of the other database types.

Querying ability means that the server must incorporate some kind of indexing system for performance reasons, which means that the document must either internally or externally conform to some sort of standard data model. Some document database systems simply omit querying except by the document’s main ID (similar to a SQL primary key, a network filesystem’s filename, or an object database’s storage reference). This has the same drawback as with an object database: the application must take on the responsibility of managing querying, searching, and sorting itself, across the network.

The similarity to a networked filesystem may also have scalability benefits; if referential data integrity is not provided, then documents can be located on any remote system, and partitioning is simple. However, distributed queries will still need to be managed at the application level, and potentially any transaction becomes a distributed transaction, since a changed document on one server may be referenced from any number of documents in any number of other servers. (This is also true of the other systems, though.)

Still, because a document database is closest to a networked filesystem, it may be suitable for simple requirements where a relational database or object database seems to complex and slow, but where the bare-bones functionality of a networked filesystem is too simple. The compromise of a binary file with simple additional metadata or properties attached “out of band” with the data itself, or of a structured document format that is flexible but not ideal, may be acceptable if sophisticated querying is possible as a result.

It’s hard to provide good examples of document databases, because the category is very broad, and includes a lot of simple projects that provide just a little functionality above and beyond what WebDAV already provides. But a few that I’ve heard of recently include CouchDB, SimpleDB, and RDDB.

CouchDB imposes a simple key-value data structure on document content, but no internal document schema or grouping of documents by type. It does offer indexing and querying. Notably, it also offers transparent replication, at a field level (changes to two different fields in two copies of the same document are synchronized to both copies).

Amazon SimpleDB similarly imposes a key-value structure, though one key can have multiple values. It too offers a query language, and indexing. Because it’s built on Amazon’s S3 service, transparent replication is also included.

Future Prospects

The main complaints that I’ve seen directed at relational databases involve two things: one, the difficulty of scaling them up, and two, the restrictive data model. Sharding (a.k.a. data partitioning) is the usual remedy for scaling problems, but that requires the elimination of referential integrity in the SQL RDBMSs I’m aware of, and requires distributed transactions in order to preserve ACID transaction properties across denormalized copies of the modified data.

Interestingly, these issues are the same across the board, regardless of database type. Either you abandon transactions, or you move them up to a level that’s aware of the data partitioning. I see no reason why these and other high-end RDBMS features couldn’t be offered in a proxy layer that possibly even contains the query processing as well.

One way to approach this is to build a closed system with a given set of features and a limited API that permits a single query language. This seems to be the way that CouchDB and SimpleDB are approaching the problem.

Another way to approach this problem is to simply say that the storage back-ends of relational databases could be enhanced to incorporate built-in transparent partitioning. I don’t think that SQL RDBMSs will abandon the concept of a table schema any time soon, but there’s no reason why products that already include XML query and indexing capabilities and free-form natural language indexing (a.k.a. Full Text Search) couldn’t also include indexing capabilities for simple key-value structured data inside a single column of semi-structured data, giving most of the same functionality as a document database.

Given that, the remaining limitation of a SQL RDBMSs is the requirement that the back-end storage system be located on a disk drive physically connected to the same server, and that the storage be touched only by processes running on the same server together so that they can coordinate access to the data.

For now, though, document databases look like they can be very useful for certain types of persistence requirements; I don’t see them as a viable substitute for everything that a SQL RDBMS does, but that perception is limited mainly by the choices at hand. CouchDB looks like the most generally useful option so far, though I’d like to see the addition of optional schemas (opt-in on a per-object level, as seen in LDAP), and/or a pluggable language option. (It seems that everyone using a document database is also enamored of their application language and dislikes the idea of putting logic in the data tier unless it’s written in the same language.)

I welcome your comments – this is mostly a brain dump of things I’ve seen before to help myself and others contextualize the new document databases, and document databases are evolving too rapidly for me to keep up with all of them on my own.

Acts_as_tsearch adjustments needed for PostgreSQL 8.3rc2

Jamie Flournoy — Thu, 24 Jan 2008 20:00:23 +0000

Just a quick note: acts_as_tsearch needs some guidance to work with PostgreSQL 8.3 due to changes in tsearch2 integration.

I’m pretty close to tossing out acts_as_tsearch and rolling my own (trigger-based) tsearch2 plugin, but for now I’m just sticking with it and checking out the PostgreSQL 8.3 release candidate.

I was able to build 8.3rc2 on Mac OS X 10.5.1 from the tarball sources with the instructions in the INSTALL document, no hitches whatsoever. Because I have 8.2 installed via MacPorts, there were no file conflicts (different install directories, data directories, etc.), so all I had to due was shut down the 8.2 server and start the 8.3rc2 server and it was ready to go.

Unfortunately, acts_as_tsearch didn’t work properly the way I had used with with 8.2. The issue appears to be that the tsearch2 locale called ‘default’ is gone, which is what acts_as_tsearch uses if you don’t specify something else. The default locale value is now located in postgresql.conf. Using that value as an explicit locale in the acts_as_tsearch declaration in my model class solved the problem. The code change looks like this:

OLD:
acts_as_tsearch :fields => ["subject","body"]

NEW:
acts_as_tsearch :vector => {:fields => ["subject","body"], :locale => 'pg_catalog.english'}

Like I said, due to the fact that acts_as_tsearch is designed to hide the complexity of tsearch2, it is not well suited to my somewhat complex requirements. So, I’m ditching it in favor of custom code, which I hope to plugin-ize and release some time later. So, this change is necessary but might not be sufficient for your own project. But I hope it helps you get started on upgrading successfully.

Capacity vs. Scalability

Jamie Flournoy — Wed, 14 Nov 2007 00:23:58 +0000

In I still donâ€t get the fascination with Ruby on Rails, Andy Davidson writes:
Scaling does not mean â€œAllows you to throw money at the problemâ€, it means â€œCan deal with workloadâ€. He goes on to recommend mod_perl instead of Rails.

I’m not interested whether he likes Rails or not. Lots of people hate Rails, and I don’t care. I’m not going to make a big deal about the fact that he’s comparing a runtime architecture (Apache + mod_perl) with a framework (Ruby on Rails).

Those are insignificant compared to his claim that scalability means “Can deal with workload”. Actually, that’s a description of capacity.

Scalability is a very distinct concept from capacity. Scalability is not a true/false property of a system; there are degrees of scalability, which can be represented in a 2D graph of # of simultaneous requests that you can service with an acceptable response time (X axis), plotted against the resources required to service those requests (on the Y axis). The function f in the y=f(x) equation that is behind that graph is how scalable your application is.

If it’s a straight line, that’s quite good: “linear scalability”. More requests cost the same amount per request as the ones you’re getting now. Double your customers, double your net profits.

If it curves down away from a straight line, that’s even better than linear scalability: you’ve attained an economy of scale, so twice as many requests costs less than twice as much as the amount you’re paying now.

If it curves up away from a straight line, that’s bad, because more load means a greater cost per request. Each new customer makes you less money than the last one. Eventually you will grow to a point where you lose money and your business fails. This is what people are referring to when they say something won’t scale. Linear (or better) scalability curves are what people mean when they say something will scale.

In the worst case, the upward curve is asymptotic to a vertical line. In other words, at some number N of simultaneous requests coming in, you “hit a wall”, and no amount of extra resources will help you. “Allows you to throw money at the problem”, as Mr. Davidson puts it, actually describes all three curves, except for this worst case of curving upward asymptotically. But as long as you don’t hit a wall, “Can deal with workload” is satisfied. The more interesting questions, though, are how much it costs you to add capacity, and whether there’s a certain number of requests above which you start to make or lose money.

Of course, the ideal curves are not what you see in practice. In reality you buy resources in chunks, such as a server or a specific plan of bandwidth, power, and rack space from your colocation provider. The graph looks more like a staircase, and going from N customers to N+1 customers means you have to spend $thousands on new hardware. Each of those chunks represents a certain amount of capacity. Capacity is just a measure of how large each chunk is, or of the largest value of X that your server cluster can support without more resources.

But you can’t just extrapolate from the fact that a single server S will support, say, 100 requests per second, that 100 of them will support 10,000 requests per second. If only it worked that way, capacity planning would be really easy. Sadly, architecting a web application for linear scalability is hard. (It’s doable and the approach is fairly well documented, but it’s not easy.)

I think it’s worth pointing out something now which should be obvious: the slope of a straight line doesn’t change its curvature. If you’re paying a silly amount for each request because you’re using an inefficient architecture that scales linearly, but you’re making an even larger silly amount from your customers, you’re still going to be in business if you have 10x as many customers. You may be leaving money on the table due to inefficient use of resources, but you’re not ruined.

If you’re lucky enough to be in that situation, you can probably hire one or two sysadmin/developer ninjas to optimize your app and change the slope of your line downward. Alternatively, you might decide to increase your profits by just buying more servers and advertising. You could even do both.

Likewise, if you’re losing an average of $5 per customer visit regardless of how many customers you have (with that hard-to-attain linear scalability again), then adding a bunch of servers isn’t going to help you. Sun, Compaq, and Dell sold a ton of server hardware in the late 1990s to companies that didn’t understand this.

In a more realistic scenario, you might be paying a lot for servers but not have much revenue. Improving your application’s efficiency would reduce the cost of your resources somewhat, and you might change from losing money to making money. That’s great, but if your curve bent upward before, it still bends upward, and if you were gonna hit a wall before, you still will. The right decision might be to worry about that later, once you’re profitable. At that time you could afford to change your architecture to scale better. Or, you might choose to invest in the future and focus about scalability improvements and growing your customer base now, losing money now but making piles of cash later when you eventually improve your efficiency.

In conclusion, to tie the two terms together: scalability is a measure of how cost-effectively you can grow (or shrink) your capacity.

And to tie this topic to my ongoing claim that trading runtime language performance for developer productivity is generally a good idea for web apps:

Language performance does not affect whether an application scales or not. It is a coefficient to the cost of capacity.

The cost of capacity affects the slope of your curve, but not the curvature. That’s important. Your architecture and application design are what affect the curvature of your scalability. You need to pay attention to both: the curvature of your scalability function and the cost of capacity will tell you where to invest your developer and sysadmin resources for the best return on investment.

ActiveRecord: the Visual Basic of Object Relational Mappers

Jamie Flournoy — Fri, 05 Oct 2007 02:07:53 +0000

I’ve been working with Ruby on Rails intensively for several months, and I’ve finally found a place where Rails can’t readily be extended to do what I want. It’s ActiveRecord, which is probably the most controversial part of Rails.

I’m reminded of a James Gosling quote disparaging Microsoft tools, particularly Visual Basic: “The easy stuff is easy, but the hard stuff is impossible.” There’s a parallel between VB and Rails in this instance, in that if you only let yourself use the high level tools, the hard stuff is impossible, but the designers specifically tell you to do the hard stuff using a lower level toolset. The controversy that surrounds “X can’t do everything, therefore it sucks” should really be focusing on the feasibility of going through that trapdoor to do things “the hard way”. This is what Delphi did, which is why so many folks chose it over VB; it made the hard stuff easier.

Here’s the task I need to accomplish, for which ActiveRecord is not well suited: complex queries involving SQL functions and multiple-table joins. I want to join a few tables together, order by a SQL function, include with each result row the result of a SQL function that operates on each row, and have all that come back as a graph of high-level objects.

Despite my attempts to use plugins, extend and/or fix bugs in those plugins, and to dig through the ActiveRecord source to figure out what the documentation won’t tell me, I was unable to get it to work. Most of the parts of what I wanted was possible: acts_as_tsearch cleverly weaves SQL functions into a high-level ActiveRecord::Base.find calls; paginating_find provides a very convenient pagination API on top of ActiveRecord::Base.find, and ActiveRecord includes some clever association tricks such as automatic many-to-many relationships (has_and_belongs_to_many), eager loading of associated records using a join (via the :include option to ActiveRecord::Base.find), and a fairly low-level :joins option that lets you add tables to a ‘find’ query which can be used in your :conditions. Problem is, they don’t all work together in a fancy way.

Really, the issue in this case is related to the design choices that went into ActiveRecord.

Some ORMs (object-relational mappers) are designed in a modular fashion: there is a part that helps you describe the relationships between your model objects, a part that helps you construct queries, and a part that does the storage and retrieval. Sometimes there’s another part that uses your description of object relationships to create an empty database with the appropriate data model, or that looks at an existing database and creates an object model that matches it. Sometimes there’s an import/export tool for bulk data loading or dumping as well.

ActiveRecord has the first three functions integrated (which has benefits and drawbacks compared to a more modular approach), has a very isolated schema manipulation module, and has a somewhat isolated data loader tool.

The relationships are explicitly declared in source code using associations: has_one, has_many, belongs_to, and has_and_belongs_to_many. These are pretty fancy and provide some convenience features that make the associations appear as object collections, such that changing the collection and saving it turns into insert/delete/update activity in the database.

Query construction is basically tied to the objects themselves, in a way that greatly simplifies star-join queries, but which handles only the simplest joins across multiple tables, and is barely able to handle self-referential joins at all. So, you can easily load an object (or group of similar objects) and associated objects, but OLAP-style queries (“what are the top 5 states where customers are located who have bought classical CDs within 2 weeks of their release using American Express and had them shipped as gifts via UPS 3-day Select?”) are impossible. Oddly, views, functions, and stored procedures could bridge the gap between real-world data models and ActiveRecord’s limited set of association types, but they are not supported either.

The storage and retrieval code is inseparable from the query code, and so it is not possible to examine and modify the final SQL before it is executed, nor is it possible to provide an arbitrary query and have the results be parsed into an object graph based on the associations you have defined. The code that would allow these features appears to exist and be sufficiently well designed to allow this with a fairly small amount of changes to ActiveRecord. However, it is currently (as of Rails 1.2.3, which is the current release) not part of the documented API and is declared private.

There is a limited facility for constructing simple objects from arbitrary SQL, in find_by_sql. This loses essentially all of the high level functionality of the find method; most notably, it isn’t possible to use find_by_sql results to instantiate an object graph, rather than a flat array of objects (similar to the eager loading feature in the regular find method).

ActiveRecord has fairly good high-level schema creation functionality (“migrations”). Though it lacks concepts for all but the basic database objects, support can be added for foreign key constraints (I kid you not, they aren’t supported by Rails itself!) and views. There’s also a simple way to execute arbitrary SQL. Migrations aren’t technically that amazing, but rather they’re a helpful organizational approach to what can be a really hairy problem: defining a schema and then applying changes to live databases while keeping track of what changes you’ve already applied.

Finally, there is a test data loading facility called Fixtures. The common opinion of Fixtures seems to be that they are broken by design and should be avoided. The main issue I’ve found with them is that the implementation ignores the kind of database design elements that any book on SQL would recommend, such as foreign keys and check constraints. I managed to circumvent this with a combination of a plugin and some customization, described in detail in my previous post, Rails, Fixtures, the Test DB, and Test::Unit. With those changes, all test fixture data is preloaded in the right order (so constraints aren’t violated) before any tests run, and any data alterations within tests are rolled back automatically by Rails.

A secondary issue with Fixtures is that they go directly from YAML text files to SQL INSERT statements, bypassing the ActiveRecord Model classes. ActiveRecord does pretty much rule out any fancy mapping between database tables and objects, so that’s not a problem, but this model-skipping fixture loading implementation means that any code in your model object (validations, before_save filters, etc.) will not be executed when loading fixtures. So fixtures do not work well with the otherwise pervasive Rails design rule of “put all the intelligence in the application”.

Still, despite the commonly-held disdain for using fixtures at all, I find that they can be tamed. In fact I’ve even created a base data facility for loading the fundamental data set that needs to be in the live database (e.g. initial admin user info). My approach is basically to alter fixture behavior to treat it as essentially a bulk data loading tool, and to do the extra housekeeping after loading to make up for the fact that the ActiveRecord model code was bypassed.

As far as I know, there is no bulk data dumping functionality in Rails.

So, to summarize, of the five main ORM features, here’s how ActiveRecord stacks up:

Describing Relationships: Easy to understand and use, with lots of slick functionality
Querying: Easy to understand and use, but limited to simple join structures, and not possible to customize query building or rewrite SQL before execution
Storage and Retrieval: Very easy to use, but only within the limits of the query builder’s features
Schema manipulation: Easy to understand and use; limited in functionality but readily extensible; solid third party plugins are available for missing schema objects
Bulk Loading and Dumping: Loading is badly designed and implemented, but fixable with some effort; dumping is not offered

Okay, so it definitely makes the easy stuff easy. But what about the rest?

As I observed before, ActiveRecord is not designed as a set of modules that you use to assemble a solution that fits your needs. That’s more of the Java approach to design, and it trades flexibility for convenience. It can be a major pain to assemble a working system out of all of those abstract Java APIs, which are sometimes so comically over-patternized as to draw mockery such as the hilarious “Are Javalanders Happy?” code snippet from Execution in the Kingdom of Nouns. Rails makes the opposite trade-off: sacrifice flexibility and gain a very approachable API.

Unfortunately, the Java approach (too abstract to readily use, but extremely flexible) is easily wrapped with a simpler, more convenient, less customizable API. The Rails approach isn’t internally componentized (have a look at ActiveRecord’s activerecord/base.rb source file in its 2,165-line glory, almost all of which is one class), so if you want to fiddle with its internal behavior, you can’t. So with Rails, it’s all or nothing: high level slickness for simple requirements, or hand-written SQL and hand-coded results mapping for your complex requirements.

As I said at the beginning, though, the key question is not how comprehensive the high level feature set is. More important is the question of how painful things are when you drop down to a lower level for a greater degree of control.

It would be nice if there were a middle level of complexity, between the high-level ‘find’ method and ‘has_xxx’ associations, and raw SQL. There isn’t. I think that the reason there isn’t one is that there is still a persistent belief among many Rails core team members and community members that databases should be stupid: just a persistent hash. Once upon a time I worked that way myself: I didn’t have access to or skill with a SQL RDBMS, and so I solved all of my persistence problems with DBM files, which (using Perl’s Tie::Hash class) are conceptually just persistent hashtables. miniSQL was little more than a SQL query parser on top of that sort of storage engine, and MySQL originally was pretty similar. But big databases have all sorts of useful features that address complicated persistence requirements in a fairly elegant way.

Given that Ruby fans like the idea of domain specific languages, which let you work in a super high level language customized to the problem domain, it’s surprising that Rails groupthink is that SQL is bad. It’s actually a very high level language, and allows a well written database to do some pretty amazing optimization on the fly because it provides a strong layer of abstraction between what you requested and how the storage engine provides it.

No, it’s not dynamic, nor is it pure relational perfection, but it’s pretty darn good. Pre- and post-event validations and arbitrary callbacks to user-specified code, functions providing behavior on top of data… these are all things that Ruby and Rails fans hold in high regard when provided by Ruby and Rails, but which are considered a bad idea at the database layer. As I discussed at length in Rails and the notion of Stupid Databases Being a Good Idea, this is a philosophy rooted in DRY, but it has some major flaws.

Mainly, there is the issue that some things must be done in the data tier, and trying to put them in the application tier doesn’t work. The best example that comes to mind is full text search. Satisfying queries is the database’s job, period. It’s just hideously slow to try and do an inner join in the application across a network link to a database. If you find yourself doing this, that’s a pretty good sign that your architecture is broken. But some queries are too complicated for ActiveRecord, so sometimes you must choose between a series of high level queries whose results are intersected in application code (easy to understand, but extremely inefficient), or hand coded SQL.

Well, SQL is fast and is a high level domain-specific language, so it isn’t actually a bad tool for the job. The problem is that this approach (the trapdoor to the lower level API) is regarded differently by different people. Some see it as a common and reasonable approach to complex requirements; others see it as a bad evil scary thing that should be avoided at all costs, a kludge and a design mistake.

As a result, the low level option in Rails is anemic. It’s there, but you’re not supposed to use it. Ruby’s ActiveRecord Makes Dropping to Raw SQL a Royal Pain (Probably on Purpose) notes that there are no bind variables allowed in ActiveRecord. You may be saying, “No, wait a minute, I’ve used them, that can’t be right.” That’s what I thought. Look at the source; the bind variable functionality is actually a high level feature built on top of drivers that don’t have that feature. Whatever you did at the high level, it’s going to the driver as a single string. Okay, it’s nice that they added that feature, especially since it provides a single point of testing and verification for safe escaping. But that functionality (in sanitize_sql) is not part of the public API. Fortunately that same article provides a workaround that makes sanitize_sql accessible, so you can use bind variables in your hand coded SQL code, and pretend that the driver supports them. But that’s not likely to work forever.

The key problem with ActiveRecord is its least common denominator feature set, based around the least featureful of all popular SQL databases: MySQL. Years ago, MySQL AB (the vendor of the MySQL database) took a strong philosophical stand against pretty much any advanced database features (which their product lacked, and which competing products had), but lately they’ve softened and added those features that they claimed nobody really needed. In the meantime, Rails has been designed with minimal expectations for database sophistication; therefore, the limited functionality of ActiveRecord is fairly complete, assuming you’re using a database with similarly limited functionality.

Triggers, stored procedures, functions, data integrity constraints, nested transactions, and views are all examples of unsupported database functionality. Try and use them via ActiveRecord’s high level API, and you will quickly see how fragile and inflexible ActiveRecord really is. If you shouldn’t need those features in your database, then you shouldn’t need anything that ActiveRecord doesn’t already provide, so it shouldn’t matter that you can’t extend ActiveRecord.

Truly, these are features that you need only in a few small cases in your application, so looking at individual queries they’re needed rarely (which is not the same thing as “never”). But looking at whether you need one or more of them in a given application, they’re needed more often than not. The pain of using hand coded SQL makes this worse: some tricky things could be done either using a view or stored procedure, or using a really slick dynamic SQL statement. Making all of those options painful means that even a clever developer can’t use anything in their bag of tricks to craft an elegant solution.

Unfortunately, non-trivial web applications need things like full text search, complex associations between persistent objects, non-trival summary information about associated objects, and complex reports, and ActiveRecord fails at all of these. These are not just things that big dumb ancient companies that like using Object COBOL think they need; Amazon and eBay need them too.

The acts_as_tsearch plugin is a good case study of ActiveRecord’s design flaws. TSearch2 is the standard PostgreSQL full text search engine, and it’s pretty good in my opinion. It’s also pretty straightforward to use. Unfortunately for developers using Rails, TSearch2 uses SQL functions (mainly to_tsquery and rank_cd). The acts_as_tsearch plugin tries to inject SQL into ActiveRecord’s queries via the high-level find interface, but ultimately fails as soon as you use the :joins or :include options. The problem is that ActiveRecord has a very simplistic idea of how queries and joins work, and so if you need to inject SQL functions to get the job done (as is necessary in TSearch2 queries), too bad. (See also issues 7 and 8 in acts_as_tsearch, in which I describe and attempt to clean up the mess that results when you use find_by_tsearch in non-trivial ways.)

A fellow Rails developer asked me in all seriousness why I wasn’t abandoning the full text search functionality of TSearch2 and just using a completely separate, redundant database product designed exclusively for full text search. Seriously, that is considered the “easy” approach: one database for full text search, and another for ACID/OLTP/CRUD. Honestly if I were going to go down that road I would try hard to just abandon the SQL RDMBS and put everything in the other database, since Lucene and its imitators are capable of far more than just find-text-in-document queries. The pain of duplicating everything, using two query languages, two document representations (in addition to the object representation in Ruby) and writing application-tier query correlation makes the double-DB approach seem very unwise.

It makes far more sense to me to use the SQL RDMBS’s full text search facility, even if there’s a 2x or 3x read performance penalty, because the conceptual simplicity of having one powerful storage tier (instead of two halves cobbled together) eliminates a ton of ugliness in the application, and the SQL RDBMS is going to get clustered for reads anyway. Nevertheless, even if I’m wrong about this case (putting search in the SQL RDBMS instead of in a separate server), there are other cases for needing a smart database that gives you exactly the results you need and lets you push data logic into the data tier.

So, what do I suggest? Abandon Rails? Nope. I still like Ruby a lot, and find Rails very useful. I just think that ActiveRecord needs to support the low-level and middle-level abstractions better.

Specifically, supporting bind variables (either by exposing that sanitize_sql function, or better yet by making drivers and connection adapters support bind variables for real) would make the find_by_sql, select_all, and exec approaches to low-level SQL query execution less painful.

More difficult, and substantially more valuable, would be refactoring ActiveRecord::Base to split it up in the way I described above: association descriptions and unmarshalling code separate from query building code separate from SQL execution and result retrieval code. All of this could remain hidden for most users under the same old slick high-level API, but for advanced requirements, the ability to fiddle with the SQL and still use the built in high-level unmarshalling code to create object graphs from flat result sets would be very powerful, and useful.

I looked at one alternative to ActiveRecord, called Sequel, which overlaps with ActiveRecord only partially. It is a query builder and lazy result proxy, which is actually what I thought ActiveRecord would do when I first started working with Rails. The proxy design means that you can either keep adding constraints or start fetching results, from the same Dataset class. This seems like a pretty good approach, though I haven’t really looked closely to make sure it would fit what ActiveRecord needs.

What Sequel lacks, though, is the unmarshalling side: turning a 2-dimensional (rows of columns) result set into a complex object graph (customers with orders with order lines with products from suppliers stored in warehouses), with user-controlled eager or lazy loading behavior. Ruby is well-suited to a design that would allow user-specified code (i.e., a block) to decompose each row into the object graph associated with that row, leaving the remaining associations on those objects to be lazily provided via future queries.

So, I think there is hope for ActiveRecord, definitely. I considered the idea of rolling a minimal Hibernate clone, or some other sort of challenger to ActiveRecord, but I don’t that ActiveRecord is broken beyond repair. I think the shortest path to a badass Ruby ORM is through improvements (refactoring and abstraction) to ActiveRecord.

So, if you’ve read this far, you probably care about these issues. Here’s my call to action: Please help me make ActiveRecord less like VB and more like Delphi. Who else is interested in helping me with this effort? Are there alternatives that I’ve missed, or components that could be integrated into ActiveRecord to make it better?

Making Rails’ rake:test not drop your PGSQL database

Jamie Flournoy — Sat, 22 Sep 2007 23:18:19 +0000

Let’s say you’re using Rails with PostgreSQL and the TSearch2 built-in full text search engine.

Did you notice that every time you run rake test, that depends on db:test:prepare, which depends on db:test:clone, which depends on db:test:purge, which drops the database and creates it again?

Along with your dropped database goes the TSearch2 functions that wrap the C libraries that do the actual work. So, in effect, you no longer have TSearch2 installed. (“Uh… I kinda needed those…”) Presumably if you have tests that exercise search functionality, they will always fail because the TSearch2 functions are gone by the time the tests run.

Since these functions are just wrappers for C libraries, which are not subject to the PostgreSQL plugin security model, PostgreSQL wisely prevents any old user from getting at them. Only a superuser can create them, which means you can’t just add the tsearch2.sql script to a Rails DB migration and get them back each time that way.

Options include:

Making a setuid script (or a script with the postgres user’s password embedded) that the migration can run, which will log in as the postgres user, run the tsearch2.sql script, and grant permissions to your Rails DB user to use them
Changing the rules of the PostgreSQL instance you’re using to allow any old user to mess with C libraries (a pretty big security hole, but maybe you don’t care about that on your development DB on your laptop), and putting tsearch2.sql in a migration. (I dunno if this is even technically possible, but it seems like such a bad idea that I’m not even bothering to look.)
Using Rake to tell Rails not to drop and re-create your database for each test run, but instead to migrate back to 0 and then re-migrate to the latest version.

I chose #3. Here’s the code, which is in my Rakefile:

# don't drop the test database; migrate it back to 0
Rake::TaskManager.class_eval do
  def delete_task(task_name)
    @tasks.delete(task_name.to_s)
  end
  Rake.application.delete_task("db:test:purge")
end
namespace :db do
    namespace :test do
        task :purge do
            ActiveRecord::Migrator.migrate("db/migrate/", 0)
        end
    end
end

In the rare case where you really wanted to drop and re-create your test database, just use the command line PostgreSQL commands dropdb and createdb, and then (still as the postgres user) run the tsearch2.sql script.

Then resume normal Rails rake:test use, until such time as you irrevocably hose your database (really?) whereupon you’ll need to use the dropdb/createdb method again.

Rails, Fixtures, the Test DB, and Test::Unit

Jamie Flournoy — Fri, 03 Aug 2007 02:37:22 +0000

From what I’ve seen, Rails’ weakest features lie in the way it prepares the test database and test data, and Ruby’s Test::Unit isn’t much better than the awful but ubuiquitous JUnit that Java developers are accustomed to. I set out this week to impose my preferences on Rails in this area, and that took some effort. Here’s what I did.

When I’ve implemented (in Java) what Rails does for database preparation, I did it like this:

Create the test database exactly the same way that the developers’ databases are created: by running the exact same code, pointed at a different database.
Load the appropriate sets of data for the test database. “Sets” is plural on purpose; most non-trivial databases include code tables, which constitute base data which are essentially part of the database design itself. Then, test code will want a fixed set of known test data to act upon, so that tests can measure whether the code did the right thing given the test data (the right inputs yield the right outputs).
Run the individual tests, providing some way of assuring that changes to the test data are undone before the next test.

At first (11 years ago) I used a hand-maintained SQL DDL file to create the databases. Later I split that up into one file per table, and made a list of the proper ordering of tables during creation (reversible for deletion). Later still, with Hibernate, I ditched the DDL and let a higher-level ORM description of the table do the schema generation (which was painful in Hibernate since it wasn’t made to do that except from the command line, but it was possible to hack it into a state of relative beauty). The test data was always loaded from a bunch of text files that were easy to hand-edit (as opposed to a bunch of SQL INSERT statements).

Running the test with assurance of pristine test data was more or less horrific in a J2EE+Hibernate 2.x environment. The design of Hibernate and JUnit made it difficult to wrap tests in transactions, and the version of MySQL that we were using had no transactional storage engines available at all (MyISAM? Thanks, Red Hat!), so I ended up falling back on an intrusive but relatively high-performance design that required tests to declare if they were going to alter the test data, so that the test teardown method knew it had to reload the test data. Since we were waiting for Hibernate 3.0, MySQL 5.x, and a few other things to become part of our architecture, I left that solution in place and ended up moving on to a new job before fixing it.

Rails initially seemed to nail this problem: the test database is automatically made based on the development database; the data is loaded from YAML files called Fixtures, which feature a very simple and straightforward API, and tests run inside individual transactions. Nice!

Except not. Fixtures are loaded by specifying the tables for which you need test data loaded, and this is done in each Test::Unit::TestCase class, of which I have several hundred. They are stupidly reloaded each time you say a given TestCase is going to use them. Worse, the tables you’re using for this TestCase are emptied out using SQL DELETE statements, but if there is test data in other tables that has foreign key dependencies on the data being deleted, fixture loading will fail. (Rails was not designed for FKs to be enabled in the database, so encountering this this bug is a side effect of enabling them via the plugin.) This deletion behavior is pointless in light of transactions wrapping each test, but if you’re using MySQL MyISAM you can’t use transactions, so it needs to be there for people using MyISAM, which is to say, crazy people who care not for their data.

Since Test::Unit, like Java’s JUnit, lacks a hook for the beginning or end of a given TestCase class’s set of tests, there’s no way to accumulate a list of fixtures created and then delete them and/or reload them at the end. That would at least allow you to undo the creation of the fixtures so that the tables were all empty before the next set of fixtures were loaded. Sadly, Test::Unit is not that clever.

I initially fixed this problem a couple of months ago, using a hack that simply refuses to delete and re-create (test data) fixtures if they’re already loaded. That works since the fixture data progressively accumulates and is always clean since changes within tests are rolled back at the end of those tests.

Upon adding a trigger to a Rails migration and then writing a test case that checked to see if it was working, I found the true ugliness. Rails has Migrations, which in my opinion are an excellent feature that works well, and is a more useful generalization of my ordered-list-o-tables and set of table-definition text files. But… when creating the test database, Rails uses the SchemaDumper‘s schema.rb output to create it, instead of using migrations. Talk about principle of least astonishment… I was pretty astonished. We have migrations, which is how we create databases! Great! So let’s use this other thing instead.

Also, SchemaDumper does not in fact dump the schema; it dumps tables and indices only. The RedHillOnRails foreign keys core plugin adds foreign key dumping to this output, but forget about check constraints, triggers, and stored procedures. Those schema objects are ignored, so your test database is not the same as your development (or production) database. Whoops.

I thought of about a dozen ways to deal with this:

Abandon triggers and do it all in Rails, make a TODO to fix this later, and get on with feature implementation
Add code to the tests to check for the missing schema objects and add them if missing (eww)
Replace the db:test:prepare Rake task with one that tells PostgreSQL to copy the database as-is
Replace the db:test:prepare Rake task with one that tells PostgreSQL to use pg_dump instead of ActiveRecord::SchemaDumper
Hack the PostgreSQL-specific code that SchemaDumper uses to look at the pg_proc and pg_trigger system catalogs and use code similar to the RedHillOnRails Core plugin to dump stored procs and triggers into schema.rb also
Just dump using pg_dump into a temp file and parse the output and add that to schema.rb (ewwwwwww)

etc. etc.

I finally found the Migrate Test DB Rake Plugin which simply uses your Rails Migrations to create the test database. Lovely. Except I now had some new problems.

rake db:schema:purge for PostgreSQL does dropdb/createdb on the test database to empty it out. That creates a database with no built in procedural langauges, so stored procs won’t work. Adding the language to that database is a DB superuser task, so it couldn’t be done inside of Rake. Fortunately I found that I could solve this via “createlang plpgsql template1” which puts plpgsql in the template database used for creating new databases. Easy.
My never-delete-fixtures code got into a fight with my base-data-loader code. They both used Fixtures to load data, and so the base data fixtures made the never-delete-fixtures code think that the test data was already in. So the tests failed due to lacking test data.

I fixed this initially by modifying my BaseDataLoader class to not load base data if RAILS_ENV is ‘test’, and added code to the Migrate Test DB Plugin to set RAILS_ENV to ‘test’ right before running the migrations on the test database. This is a workaround, really, because it still leaves the base data either missing entirely, or duplicated.

Then I switched to the Preload Fixtures plugin which is nice but still leads to FK related errors. It grab the fixture names from your test/fixtures directory and loads all the files it finds, in the order it found them. That fails since alphabetical order and the required table creation order are different in my case.

Fortunately since I’m using the Migrate Test DB Plugin I can just observe the order in which tables were created and tell the Preload Fixtures plugin to do its work in the same order. This is in my environment.rb because that’s where all my project-wide monkeypatching currently lives. (Cleaning that up and maybe plugin-izing it is a TODO for the future.)

# Due to FKs, gotta specify ordering of fixture preloading here. Why not let migration create_table statements do it?
# (depends on Migrate Test DB Plugin being present; is here for the benefit of the preload_fixtures plugin)
module ActiveRecord::ConnectionAdapters::SchemaStatements
    alias create_table_orig create_table
    def create_table(table_name, options = {}, &block)
        fixture_filename = "#{table_name}.yml"
        if File.file?(File.join([RAILS_ROOT, 'test', 'fixtures' ,fixture_filename]))
            ENV['FIXTURES'] = [ENV['FIXTURES'], fixture_filename].compact.join(',')
            # puts ENV['FIXTURES']
        end
        create_table_orig(table_name, options, &block)
    end
end

Sadly if you run “rake test” it runs ruby as a subprocess in order to do “rake test:units”, “rake test:functionals”, and “rake test:integration”. That means that the migrations are run once (before the tests), but that the preloading is done three times. The second and third times through, though, the preloading fails since it’s trying to delete-then-create each table’s fixtures in table-creation order. So, a patch to preload_fixtures.rb is needed, to ensure that deletes are done first, in the reverse order of table creation. Here’s what the new preload! method looks like:

def self.preload!
    puts "PRELOADING FIXTURES..."

    require 'active_record/fixtures'
    ActiveRecord::Base.establish_connection(:test)
    fixture_filenames = (ENV['FIXTURES'] ? ENV['FIXTURES'].split(/,/) : Dir.glob(File.join(RAILS_ROOT, 'test', 'fixtures', '*.{yml,csv}')))
    
    # delete first, in reverse order
    fixture_filenames.reverse.each do |fixture_file|
        table_name = File.basename(fixture_file, '.*') # hack; might not be correct if class name != camelized table name
        ActiveRecord::Base.connection.delete "DELETE FROM #{table_name}", 'Fixture Delete'
    end
    
    fixture_filenames.each do |fixture_file|
      Fixtures.create_fixtures(File.join(RAILS_ROOT, 'test', 'fixtures'), File.basename(fixture_file, '.*'))
    end      
    puts "DONE. Loaded #{Fixtures.all_loaded_fixtures.keys.length} fixtures."
  end

I’m not sure, but I think there’s an assumption in there that the table name is the same as the fixture name. My patch also makes that assumption, which is true in the case of my project. But in your project you might not have done that, so further hackery might be needed.

So, it all seems to work correctly now, and I’m back to working on my trigger code. If this seems like it took a lot of effort, it did, but I think it’ll be worth it once I start using stored procs and triggers more. That phase begins now.