postgresql – Pervasive Code

Hiding a CLI-only user account in Mac OS X 10.5 Leopard

Jamie Flournoy — Thu, 06 Nov 2008 00:06:52 +0000

A year and a half ago I installed the excellent PostgreSQL via MacPorts, and had to create a user account manually. Annoyingly, this postgres user shows up in the GUI login screen and Fast User Switching menu under Leopard. I found a fix today.

I dug around a few months ago, and found some options to solve this annoyance that I didn’t like very much. Changing the shell to /usr/bin/false works but then you can’t su to postgres. Changing the UID to <500 and enabling the plist option to hide <500 UIDs seems like a kludge. I was looking for a minimally invasive tweak, that would just make it not show up in that menu. Breaking the account so I can't use it the same way, or altering systemwide behavior, seemed drastic given that there are several other system accounts that have the desired behavior. Today I decided to fix this and looked harder at what dscl would tell me about other hidden accounts. The solution that worked and didn’t seem icky to me was this:
sudo dscl . append Users/postgres Password '*'

That sets the password string to *. This allows me to continue to sudo su -l postgres whenever I feel like it, but it isn’t shown as an account in the GUI. Hooray!

Sphinx Search init script for Centos 5.1

Jamie Flournoy — Mon, 14 Apr 2008 06:18:11 +0000

Sphinx search is pretty new, and as a result I was unable to find a nice convenient package for it for CentOS 5.1. This is problematic since there is no init script included with the source tarball, and the issue of updating the index is the sysadmin and developer’s problem, and cannot be configured to simply update the index when the data changes.

The second problem (updates) is one I punted on; for now I have a cron job rebuilding the entire index every 5 minutes, which will probably be replaced with something smarter and lower-latency at a later time.

The first problem (no init script) is easy to solve, but apparently nobody has done so for CentOS 5.1 and published it. So, here is my CentOS 5.1 init script for the Sphinx Search server. It is known to work with version 0.9.8-rc2.

BTW, the alternative solution to the problem of a daemon not having a System V init script is to just put some extra junk in /etc/rc.local. That is the quick and dirty solution, and is undesirable for several reasons:

You can’t easily stop or restart the service, because it’s not a service as far as the OS knows; it’s just some junk in a script that got run a while ago.
You can’t use chkconfig or its GUI cousin with the creative name, The Services Configuration Tool, to control it and tie it to specific runlevels.

(System V runlevels and init scripts are useful, even if you don’t need all of the runlevel functionality. The stop/start/restart PID stuff is useful by itself.)

Document Databases – New Kids on an Old Block

Jamie Flournoy — Sat, 16 Feb 2008 06:16:16 +0000

There’s a new crop of databases that has appeared lately, under the rubric of “document databases”, and there’s quite a lot of enthusiasm for them given that they tend to be slow and very feature-poor compared to the SQL RDBMSs that are the typical persistence mechanism for web applications. What’s mainly appealing about them is that they are easy to use, and theoretically quite scalable, compared to the traditional “one big SQL database server” approach.

But the simplicity of these new document databases is tied to some significant trade-offs in the current implementations. And so I’m going to try and put them into context with some of the other data persistence options that have been around for a while, but which aren’t currently getting as much hype as document databases. Hopefully that will help all of us to understand how these new and evolving document databases can be useful to us, and what the alternatives are in areas where they may not fit well.

I’d first like to try and deconstruct a false dichotomy that I’ve noticed being used in arguments in favor of some of these new databases. That dichotomy casts SQL RDBMSs (such as MySQL, Oracle, PostgreSQL, MS SQL Server, etc.) as big, complicated, and hard to scale, compared to document databases which are small, simple, and easy to scale. The main problem with this dichotomy is that there are far more choices than just two. Each database product embodies a set of design choices, and although there is some clustering of decisions into general types of product, the boundaries are a lot fuzzier than a product evangelist might have you believe.

Furthermore, trade-offs made early in a product’s lifetime may have been altered over time. A good example is the no-longer-true characterization of MySQL being fast but not reliable, vs. PostgreSQL being reliable but not fast. In reality, recent releases of both products are moving toward being very fast and very reliable.

Because there are so many database products out there, I’m going to have to fall back on a small subset of example products, as illustrations of issues that may or may not exist in a particular product you’re looking at. The key for an application architect evaluating a persistence mechanism is to understand the abstract concepts, and to figure out which ones matter to your current application. That will let you select a product (or a combination of products, including some custom code perhaps) that suits you. As with all aspects of architecture, there is no cookbook you can use, and in six months all the options will change. You have to analyze your needs first, and then get your hands dirty with evaluation second.

So, let’s start deconstructing some of the examples into design decisions.

SQL RDMBS

This is the traditional choice for web applications. You get a remote query language, so you can specify in great detail what you want to retrieve. You get very precise control over data representation, including some that may be burdensome if you’re not concerned with internationalization, multiple currencies, and time zones. You get a lot of control over performance and a lot of information about how things work at a low level inside the database, from indexes to data page size to transaction logs and checkpoint frequency. Most of them include ACID transaction support, which is nice for reliability, but which usually obligates you to implement a backup scheme or else they will eventually stop accepting new transactions and/or run out of disk space.

Some of the design drawbacks include:
– the use of local storage, for performance and for a guarantee that data has been committed to disk
– the use of a high level query language (SQL) and a query optimizer, so the specific process the database uses to satisfy your query is not in your face (and thus may be surprisingly inefficient, if you aren’t familiar with how it works)
– the use of a proprietary network protocol, which means that you need a special client library for just that one product, which may or may not implement all of the features that the server offers (such as encryption)

However, there are some variations that make the edges of this category fuzzier. There are ACID-compliant SQL RDBMSs that have no network layer, and are very lightweight; in some cases they may not even support concurrent access. Examples include HSQLDB and SQLite.

Networked Filesystem

This is typically used for accessing shared file servers, or allowing “thin client” behavior so that users can get to their own environment and data from any given endpoint. Examples include NFS, SMB, AFP, GFS, and quite a few others. The main advantage of these systems is that the remote filesystem is represented as being directly connected to the local system, while also being available to other users or other client systems who are connected to the same remote system.

Trade-offs of this design include:
– performance on a LAN may be good, but over a slow, high latency link may be very poor
– there is usually no ACID transaction support, just file locking
– file ownership and permissions can be very hard to manage
– if the file server goes offline, the entire local system may hang or crash

In particular, content indexing, complex querying, and data integrity features are generally not offered. You can layer that on top, though, but that layer will not necessarily work if it was originally designed to work on local filesystems. In particular I’m thinking about DBM files; they’re fast and easy to use but not all of them will work properly with files located on a network filesystem.

Also, directory scanning performance can be very poor if thousands of files are located in a single directory; listing all files starting with the letter T may actually require the entire directory to be retrieved and filtered on the client side.

Variations include FTP and WebDAV, which are not intended to simulate a local filesystem, but instead have filesystem-like semantics. Some operating systems will mount them as remote filesystems anyway, for ease of use for viewing and copying files, but it’s not possible to lock a remote file, so safe multiuser access is not possible.

Object Database

Object databases offer a direct representation of an application’s data in almost exactly the same form that exists in memory. Whereas a relational database stores data in tabular form regardless of the particulars of a client application, an object database stores data in the same form that the application uses. The exception to this is in the representation of references to other objects; at some level these encapsulate pointers to the memory address of the data in the application’s address space, and this must be substituted with a pointer to the location in the database’s storage system before storing it.

An object database will not handle time zones or internationalization, but nor will it complicate those matters if the application handles those already. The data is simply stored as-is. Also, object databases typically do offer ACID transaction support.

Aside from the conceptual simplicity of the similar data model, one major bonus of an object database is that the use of pointers makes data retrieval extremely fast; rather than parsing a query and searching indexes for the on-disk location of a desired object, the application can simply ask the database for it by its reference.

The big trade-offs here are twofold:

First, the lack of indirection through a query language and an indexing system mean that the application developer must anticipate all of the queries that will be needed, and incorporate collections into the the object graph that will be used to get to the stored objects. Otherwise the application’s object model will need to be updated frequently to include these later.

Second, altering the application’s object model and then retrieving data stored using older code can be very complex. Because the stored objects and application’s code are out of sync in this situation, additional application code must be written to convert existing stored objects into the new representation and persist them back to the database.

Combine those two trade-offs, and it’s clear that the performance benefit comes with the price of considerable additional application development effort.

Also, an object database is by nature bound to a single application, rather than being a point of integration between multiple applications. Any attempt to create a shared code library that manages access to the object database introduces potential “impedance mismatches” between each application and the shared object model, which reduces the simplicity that an object database offers in comparison to a relational database.

Document Database

Arguably a rejection of relational technology, document databases offer several advantages compared to the three classes of database previously mentioned. Documents need not be internally represented as a flat set of key-value pairs as seen in a SQL RDBMS; for example, the document may be an XML document. Queries are possible and may even use a standardized query language to express the conditions for matching desirable documents. Documents may have internal structure that is understood by the database server (so that it can query against the document’s contents), as in the case of an XML database, or they may have an external metadata structure consisting of key-value pairs, or both.

The drawbacks of this type of system derive from the fact that it is similar in many ways to each of the other database types.

Querying ability means that the server must incorporate some kind of indexing system for performance reasons, which means that the document must either internally or externally conform to some sort of standard data model. Some document database systems simply omit querying except by the document’s main ID (similar to a SQL primary key, a network filesystem’s filename, or an object database’s storage reference). This has the same drawback as with an object database: the application must take on the responsibility of managing querying, searching, and sorting itself, across the network.

The similarity to a networked filesystem may also have scalability benefits; if referential data integrity is not provided, then documents can be located on any remote system, and partitioning is simple. However, distributed queries will still need to be managed at the application level, and potentially any transaction becomes a distributed transaction, since a changed document on one server may be referenced from any number of documents in any number of other servers. (This is also true of the other systems, though.)

Still, because a document database is closest to a networked filesystem, it may be suitable for simple requirements where a relational database or object database seems to complex and slow, but where the bare-bones functionality of a networked filesystem is too simple. The compromise of a binary file with simple additional metadata or properties attached “out of band” with the data itself, or of a structured document format that is flexible but not ideal, may be acceptable if sophisticated querying is possible as a result.

It’s hard to provide good examples of document databases, because the category is very broad, and includes a lot of simple projects that provide just a little functionality above and beyond what WebDAV already provides. But a few that I’ve heard of recently include CouchDB, SimpleDB, and RDDB.

CouchDB imposes a simple key-value data structure on document content, but no internal document schema or grouping of documents by type. It does offer indexing and querying. Notably, it also offers transparent replication, at a field level (changes to two different fields in two copies of the same document are synchronized to both copies).

Amazon SimpleDB similarly imposes a key-value structure, though one key can have multiple values. It too offers a query language, and indexing. Because it’s built on Amazon’s S3 service, transparent replication is also included.

Future Prospects

The main complaints that I’ve seen directed at relational databases involve two things: one, the difficulty of scaling them up, and two, the restrictive data model. Sharding (a.k.a. data partitioning) is the usual remedy for scaling problems, but that requires the elimination of referential integrity in the SQL RDBMSs I’m aware of, and requires distributed transactions in order to preserve ACID transaction properties across denormalized copies of the modified data.

Interestingly, these issues are the same across the board, regardless of database type. Either you abandon transactions, or you move them up to a level that’s aware of the data partitioning. I see no reason why these and other high-end RDBMS features couldn’t be offered in a proxy layer that possibly even contains the query processing as well.

One way to approach this is to build a closed system with a given set of features and a limited API that permits a single query language. This seems to be the way that CouchDB and SimpleDB are approaching the problem.

Another way to approach this problem is to simply say that the storage back-ends of relational databases could be enhanced to incorporate built-in transparent partitioning. I don’t think that SQL RDBMSs will abandon the concept of a table schema any time soon, but there’s no reason why products that already include XML query and indexing capabilities and free-form natural language indexing (a.k.a. Full Text Search) couldn’t also include indexing capabilities for simple key-value structured data inside a single column of semi-structured data, giving most of the same functionality as a document database.

Given that, the remaining limitation of a SQL RDBMSs is the requirement that the back-end storage system be located on a disk drive physically connected to the same server, and that the storage be touched only by processes running on the same server together so that they can coordinate access to the data.

For now, though, document databases look like they can be very useful for certain types of persistence requirements; I don’t see them as a viable substitute for everything that a SQL RDBMS does, but that perception is limited mainly by the choices at hand. CouchDB looks like the most generally useful option so far, though I’d like to see the addition of optional schemas (opt-in on a per-object level, as seen in LDAP), and/or a pluggable language option. (It seems that everyone using a document database is also enamored of their application language and dislikes the idea of putting logic in the data tier unless it’s written in the same language.)

I welcome your comments – this is mostly a brain dump of things I’ve seen before to help myself and others contextualize the new document databases, and document databases are evolving too rapidly for me to keep up with all of them on my own.

Acts_as_tsearch adjustments needed for PostgreSQL 8.3rc2

Jamie Flournoy — Thu, 24 Jan 2008 20:00:23 +0000

Just a quick note: acts_as_tsearch needs some guidance to work with PostgreSQL 8.3 due to changes in tsearch2 integration.

I’m pretty close to tossing out acts_as_tsearch and rolling my own (trigger-based) tsearch2 plugin, but for now I’m just sticking with it and checking out the PostgreSQL 8.3 release candidate.

I was able to build 8.3rc2 on Mac OS X 10.5.1 from the tarball sources with the instructions in the INSTALL document, no hitches whatsoever. Because I have 8.2 installed via MacPorts, there were no file conflicts (different install directories, data directories, etc.), so all I had to due was shut down the 8.2 server and start the 8.3rc2 server and it was ready to go.

Unfortunately, acts_as_tsearch didn’t work properly the way I had used with with 8.2. The issue appears to be that the tsearch2 locale called ‘default’ is gone, which is what acts_as_tsearch uses if you don’t specify something else. The default locale value is now located in postgresql.conf. Using that value as an explicit locale in the acts_as_tsearch declaration in my model class solved the problem. The code change looks like this:

OLD:
acts_as_tsearch :fields => ["subject","body"]

NEW:
acts_as_tsearch :vector => {:fields => ["subject","body"], :locale => 'pg_catalog.english'}

Like I said, due to the fact that acts_as_tsearch is designed to hide the complexity of tsearch2, it is not well suited to my somewhat complex requirements. So, I’m ditching it in favor of custom code, which I hope to plugin-ize and release some time later. So, this change is necessary but might not be sufficient for your own project. But I hope it helps you get started on upgrading successfully.

Leopard Upgrade Report: Mo’ Features, Mo’ Problems

Jamie Flournoy — Thu, 27 Dec 2007 20:01:40 +0000

(Apologies to The Notorious B.I.G. for the title.)

I upgraded to Mac OS X 10.5 “Leopard” recently. In short, it’s not ready for mainstream use. There are a few nice improvements, but these are balanced by numerous problems that make me wish I had waited until, say, June 2008 or so. If you haven’t upgraded and aren’t sure that you need to, I suggest that you wait a few months, until some of the bugs have been worked out.

The Good Stuff:

Terminal.app now allows tabbed terminals in one window. Terminal also allows window background transparency, which wouldn’t normally be a feature I would care about, except that it makes it quite easy to set transparency at about 85-90% and put a DVD window behind it. This setup is more readable than the arrangement that Trans Lucy provides (it overlays a transparent movie window instead). If you’re like me, sometimes you work better with something busy going on in the background, and this does a good job of leaving text legible while making the movie kinda visible in the background.

Terminal and Safari now allow tabs to be dragged downward off of the tab bar to create a new window for that tab. It looks pretty cool and can be useful at times, mostly in Safari when you’re browsing a bunch of sites and want to pick one and background-load a bunch of links off of that page.

The new Finder fixes some of the old Finder bugs, though it brings new ones. (More about that below.) Basic problems like window resizing not shrinking to fit only the icons that were there, scroll bars displaying even when they are not needed, etc. are now gone. Desktop icons for volumes remember where you left them instead of being automatically placed every time they are mounted. They also position themselves so as to avoid being under the Dock if hiding is off (they don’t reposition themselves automatically as soon as the Dock position changes, but they are aware of where the Dock overhangs when they are choosing where to align).

Similarly, the window manager is smarter about being aware of where the Dock is, and will try to avoid resizing or zooming windows in ways that put part of them under the Dock.

Preview.app is a bit nicer: you can open a bunch of windows at once, even in one-window-per-file mode, and it stacks the windows in sorted order based on filename. The previous version (bundled with Mac OS X 10.4) opened them in essentially random order and would only open a maximum of 20 windows at once, forcing photos 20 and up into a single window with an image list in the sidebar. I don’t know about you but I take a lot more than 20 photos at any given event and it’s annoying to have to keep track of how many Preview windows I have open in order to keep it from merging the excess ones together into a single window. The new Preview fixes that.

My complex development environment worked fine, with one exception (details below). Ruby on Rails, Mongrel, PostreSQL via MacPorts, and autotest all worked fine. That was a big relief.

The Bad Stuff:

There are some incompatibilities that caused me problems early on. Most notably, Application Enhancer (used by the handy Instant Hijack feature of the excellent Audio Hijack Pro) causes a blue screen on boot! Here are the removal instructions. Fortunately I did a backup before installing (really it was just my nightly full backup) or I would have been a lot more freaked out about this.

I experienced seriously poor wireless networking performance. By that I mean very erratic ping times and latency even when sitting right next to the base station. Here are the details and fix; basically it’s a preferences file problem, which can be fixed by deleting the old 10.4-era preferences file. I’m not sure how Apple QA missed this one; I’d think that 100% of Tiger-to-Leopard upgrade cases would experience this problem.

GNU Screen, which I use like crazy, changed somehow, and I can’t figure out what they changed from the documentation. My super slick Screen running shell script, which sets up my whole development environment like a character mode IDE, broke. PATH somehow gets stomped and reset to the standard login-shell PATH, but no other environment variables get touched. I was not able to figure out why Screen was doing this in Leopard but not in Tiger, and eventually punted and put a hack in my .screenrc to re-set the PATH the way I wanted it in each screen. (Previously I had set it before creating any screens and it was inherited by each of them.)

There are UI issues that make existing applications not work right. OmniGraffle, which is a truly fabulous drawing program, had some strange input problems with mouse clicks that made it display error messages. A small point release fixed the problem, and the new OmniGraffle 5.0 beta releases have no issues. Unfortunately, OmniGraffle 5 is Leopard-only, and has a new file format that causes version 4 to warn you about possible unknown incompatibility upon opening the file. Given my current attitude of “don’t upgrade until later” about Leopard for other people, I don’t want to trap my drawings in a new Leopard-only format. So I’m still using 4.2.2 and it’s fine.

Other applications are not so fine. Microsoft Excel has some input issues that cause the keyboard input to suddenly go into limbo for no apparent reason. Clicking onto another application’s window and then back onto Excel fixes it; there is no current fix from Microsoft for this. You just have to accept that sometimes your keyboard input goes away and you have to fiddle for a minute to get it back. I think I’ve figured out that it has something to do with AutoFill and the popup menu of possible completions, so it may be possible to train oneself to not ever do whatever it is that causes the keyboard input to go nuts. But I’d rather just upgrade to Office 2008 whenever it comes out.

Adobe Photoshop CS3 has input problems as well. Most notably, the little numeric text input boxes for various tools work once and then go dead. For example, try transforming a selection and specifying, say, 66% width. Then do the same thing again – can’t do it; the typing goes into nowhere and the resizing never happens. The text is even drawn incorrectly in the little text box. This is a well known problem and according to this guy quoting an Adobe spokesperson, it will be fixed soon. As far as I can tell there is no workaround for this.

Carbon Copy Cloner 3.0.1, which I use for backups, doesn’t work properly on Leopard. Its only function in my world is to sync my whole hard disk every night to an external hard disk. It doesn’t do this correctly, unless you run the program and uncheck/recheck the checkbox next to the scheduled task every time you reboot. Supposedly the 3.0.2 update will fix this, so that the scheduled task properly installs itself without user intervention. For now I have a text file that reminds me to do this in my home directory, and I dragged that icon into my account’s Login Items list in System Preferences so I don’t forget to do it.

There are some race conditions apparent with opening items from Samba valumes. I have a home server with lots of files on it, and some aliases to folders on that server are on my laptop. So, when I log in, I can just open the folder and it mounts the server volume and goes right there. In Leopard, if I open more than one such folder alias, all of them fail, the volume fails to mount, and a stack of error dialog boxes piles up in the middle of my screen. Upon dismissing the last such error dialog box, the Finder crashes. The workaround is to mount the server volume first, or to open only one server-based folder alias first, and then open the rest.

Worse, if you have aliases to .dmg (disk images) on the server, and you try to mount more than one .dmg from a server at the same time (select more than one and open them simultaneously), all of them will fail and the diskimages-helper processes will all get stuck. At that point you must reboot in order to un-wedge the situation; kill -9 kills the processes but any future disk images will get stuck trying to mount in the same fashion. The workaround is to never try and open more than one alias to a disk image from a server at once. You can keep a bunch mounted, but you have to open and mount each one separately.

There are also a lot of cosmetic problems, mostly with the new Finder. The translucent menu bar makes it hard to read things on the menu bar, and just plain looks sloppy with any background other than a white one. I didn’t like the stripes in list view in the finder and using this hint turned that display option off with defaults write com.apple.finder FXListViewStripes -bool FALSE at a command line.

The Finder seems to have completely lost its ability to remember what view type (list, icons, or columns) you wanted for each window; I have frequently found that I’ll set a window to list view and navigate around and when I re-open that folder it’s in icon view, or vice versa. I’m pretty good at detecting patterns of UI behavior so I think I would have grasped any new system Apple has devised for deciding what view a window should have. Whatever the new one is, it sucks.

The Finder still has problems displaying image previews in icon mode. That’s one of my favorite OS X features: the ability to just open a folder full of photos and see previews of all of them at 128×128 pixel size, without using iPhoto or some other photo gallery app. Sadly, the Finder sometimes just stops building previews of the currently displayed set of icons, and you have to scroll around or “select all” and then “select none” to kick it so it notices that it still has some preview-building work to do. I do wish it would cache previews on disk, too; it’s silly that almost every time I visit a folder full of photos, it has to rebuild the previews from scratch. It does cache the previews in memory, but I have digital camera photos from ten years ago, and I don’t understand why the Finder needs to keep rebuilding the previews over and over.

The Finder still resizes the window so that it is partially offscreen when displaying the Toolbar. If you want to search within the current folder, you need to show the Toolbar and enter text in the search box. Okay, but that means that the Sidebar (with all the volumes, on the left side of the window) appears, and for some reason the whole window resizes to preserve the size of the content area, which means that the window grows, substantially. It also preserves the position of the content area, so the window grows on all three sides except the top. If your Finder window was close to being as big as the screen, the new size will be off screen in every direction. You can’t zoom the window without moving it first so that the zoom button is on screen, because the zoom button is now off the left side of the screen. It’s really not very slick, and becomes annoying if you use the Finder very much.

Also, when searching in the Finder using the in-window search function, the scope is always reset to “This Mac” instead of the current folder. Given that there’s a separate Find function, that doesn’t make a lot of sense. Since the search scope controls don’t appear until after you’ve started typing a search string, your only option is to start searching the whole computer and then click the scope button you wanted, which responds slowly since the Finder is busily carrying out the global search you didn’t want. Eventually it throws away the first partially-completed result set and starts over from the current window. Again, it’s frustrating if you search the current folder and its contents very often.

To be honest, though, a lot of these Finder gripes are new UI annoyances exchanged for old Finder UI annoyances. The new Finder is marginally better, except for all that mounting-multiple-disk-image or opening-multiple-folder stuff.

Conclusion:

Wait before upgrading, to give Adobe and Microsoft and Apple time to release some updates. I would say maybe March or June would be a good time to check on these things and see if it’s safe to upgrade. Until then, keep using Tiger; it’s really quite good and doesn’t have these issues.

ActiveRecord: the Visual Basic of Object Relational Mappers

Jamie Flournoy — Fri, 05 Oct 2007 02:07:53 +0000

I’ve been working with Ruby on Rails intensively for several months, and I’ve finally found a place where Rails can’t readily be extended to do what I want. It’s ActiveRecord, which is probably the most controversial part of Rails.

I’m reminded of a James Gosling quote disparaging Microsoft tools, particularly Visual Basic: “The easy stuff is easy, but the hard stuff is impossible.” There’s a parallel between VB and Rails in this instance, in that if you only let yourself use the high level tools, the hard stuff is impossible, but the designers specifically tell you to do the hard stuff using a lower level toolset. The controversy that surrounds “X can’t do everything, therefore it sucks” should really be focusing on the feasibility of going through that trapdoor to do things “the hard way”. This is what Delphi did, which is why so many folks chose it over VB; it made the hard stuff easier.

Here’s the task I need to accomplish, for which ActiveRecord is not well suited: complex queries involving SQL functions and multiple-table joins. I want to join a few tables together, order by a SQL function, include with each result row the result of a SQL function that operates on each row, and have all that come back as a graph of high-level objects.

Despite my attempts to use plugins, extend and/or fix bugs in those plugins, and to dig through the ActiveRecord source to figure out what the documentation won’t tell me, I was unable to get it to work. Most of the parts of what I wanted was possible: acts_as_tsearch cleverly weaves SQL functions into a high-level ActiveRecord::Base.find calls; paginating_find provides a very convenient pagination API on top of ActiveRecord::Base.find, and ActiveRecord includes some clever association tricks such as automatic many-to-many relationships (has_and_belongs_to_many), eager loading of associated records using a join (via the :include option to ActiveRecord::Base.find), and a fairly low-level :joins option that lets you add tables to a ‘find’ query which can be used in your :conditions. Problem is, they don’t all work together in a fancy way.

Really, the issue in this case is related to the design choices that went into ActiveRecord.

Some ORMs (object-relational mappers) are designed in a modular fashion: there is a part that helps you describe the relationships between your model objects, a part that helps you construct queries, and a part that does the storage and retrieval. Sometimes there’s another part that uses your description of object relationships to create an empty database with the appropriate data model, or that looks at an existing database and creates an object model that matches it. Sometimes there’s an import/export tool for bulk data loading or dumping as well.

ActiveRecord has the first three functions integrated (which has benefits and drawbacks compared to a more modular approach), has a very isolated schema manipulation module, and has a somewhat isolated data loader tool.

The relationships are explicitly declared in source code using associations: has_one, has_many, belongs_to, and has_and_belongs_to_many. These are pretty fancy and provide some convenience features that make the associations appear as object collections, such that changing the collection and saving it turns into insert/delete/update activity in the database.

Query construction is basically tied to the objects themselves, in a way that greatly simplifies star-join queries, but which handles only the simplest joins across multiple tables, and is barely able to handle self-referential joins at all. So, you can easily load an object (or group of similar objects) and associated objects, but OLAP-style queries (“what are the top 5 states where customers are located who have bought classical CDs within 2 weeks of their release using American Express and had them shipped as gifts via UPS 3-day Select?”) are impossible. Oddly, views, functions, and stored procedures could bridge the gap between real-world data models and ActiveRecord’s limited set of association types, but they are not supported either.

The storage and retrieval code is inseparable from the query code, and so it is not possible to examine and modify the final SQL before it is executed, nor is it possible to provide an arbitrary query and have the results be parsed into an object graph based on the associations you have defined. The code that would allow these features appears to exist and be sufficiently well designed to allow this with a fairly small amount of changes to ActiveRecord. However, it is currently (as of Rails 1.2.3, which is the current release) not part of the documented API and is declared private.

There is a limited facility for constructing simple objects from arbitrary SQL, in find_by_sql. This loses essentially all of the high level functionality of the find method; most notably, it isn’t possible to use find_by_sql results to instantiate an object graph, rather than a flat array of objects (similar to the eager loading feature in the regular find method).

ActiveRecord has fairly good high-level schema creation functionality (“migrations”). Though it lacks concepts for all but the basic database objects, support can be added for foreign key constraints (I kid you not, they aren’t supported by Rails itself!) and views. There’s also a simple way to execute arbitrary SQL. Migrations aren’t technically that amazing, but rather they’re a helpful organizational approach to what can be a really hairy problem: defining a schema and then applying changes to live databases while keeping track of what changes you’ve already applied.

Finally, there is a test data loading facility called Fixtures. The common opinion of Fixtures seems to be that they are broken by design and should be avoided. The main issue I’ve found with them is that the implementation ignores the kind of database design elements that any book on SQL would recommend, such as foreign keys and check constraints. I managed to circumvent this with a combination of a plugin and some customization, described in detail in my previous post, Rails, Fixtures, the Test DB, and Test::Unit. With those changes, all test fixture data is preloaded in the right order (so constraints aren’t violated) before any tests run, and any data alterations within tests are rolled back automatically by Rails.

A secondary issue with Fixtures is that they go directly from YAML text files to SQL INSERT statements, bypassing the ActiveRecord Model classes. ActiveRecord does pretty much rule out any fancy mapping between database tables and objects, so that’s not a problem, but this model-skipping fixture loading implementation means that any code in your model object (validations, before_save filters, etc.) will not be executed when loading fixtures. So fixtures do not work well with the otherwise pervasive Rails design rule of “put all the intelligence in the application”.

Still, despite the commonly-held disdain for using fixtures at all, I find that they can be tamed. In fact I’ve even created a base data facility for loading the fundamental data set that needs to be in the live database (e.g. initial admin user info). My approach is basically to alter fixture behavior to treat it as essentially a bulk data loading tool, and to do the extra housekeeping after loading to make up for the fact that the ActiveRecord model code was bypassed.

As far as I know, there is no bulk data dumping functionality in Rails.

So, to summarize, of the five main ORM features, here’s how ActiveRecord stacks up:

Describing Relationships: Easy to understand and use, with lots of slick functionality
Querying: Easy to understand and use, but limited to simple join structures, and not possible to customize query building or rewrite SQL before execution
Storage and Retrieval: Very easy to use, but only within the limits of the query builder’s features
Schema manipulation: Easy to understand and use; limited in functionality but readily extensible; solid third party plugins are available for missing schema objects
Bulk Loading and Dumping: Loading is badly designed and implemented, but fixable with some effort; dumping is not offered

Okay, so it definitely makes the easy stuff easy. But what about the rest?

As I observed before, ActiveRecord is not designed as a set of modules that you use to assemble a solution that fits your needs. That’s more of the Java approach to design, and it trades flexibility for convenience. It can be a major pain to assemble a working system out of all of those abstract Java APIs, which are sometimes so comically over-patternized as to draw mockery such as the hilarious “Are Javalanders Happy?” code snippet from Execution in the Kingdom of Nouns. Rails makes the opposite trade-off: sacrifice flexibility and gain a very approachable API.

Unfortunately, the Java approach (too abstract to readily use, but extremely flexible) is easily wrapped with a simpler, more convenient, less customizable API. The Rails approach isn’t internally componentized (have a look at ActiveRecord’s activerecord/base.rb source file in its 2,165-line glory, almost all of which is one class), so if you want to fiddle with its internal behavior, you can’t. So with Rails, it’s all or nothing: high level slickness for simple requirements, or hand-written SQL and hand-coded results mapping for your complex requirements.

As I said at the beginning, though, the key question is not how comprehensive the high level feature set is. More important is the question of how painful things are when you drop down to a lower level for a greater degree of control.

It would be nice if there were a middle level of complexity, between the high-level ‘find’ method and ‘has_xxx’ associations, and raw SQL. There isn’t. I think that the reason there isn’t one is that there is still a persistent belief among many Rails core team members and community members that databases should be stupid: just a persistent hash. Once upon a time I worked that way myself: I didn’t have access to or skill with a SQL RDBMS, and so I solved all of my persistence problems with DBM files, which (using Perl’s Tie::Hash class) are conceptually just persistent hashtables. miniSQL was little more than a SQL query parser on top of that sort of storage engine, and MySQL originally was pretty similar. But big databases have all sorts of useful features that address complicated persistence requirements in a fairly elegant way.

Given that Ruby fans like the idea of domain specific languages, which let you work in a super high level language customized to the problem domain, it’s surprising that Rails groupthink is that SQL is bad. It’s actually a very high level language, and allows a well written database to do some pretty amazing optimization on the fly because it provides a strong layer of abstraction between what you requested and how the storage engine provides it.

No, it’s not dynamic, nor is it pure relational perfection, but it’s pretty darn good. Pre- and post-event validations and arbitrary callbacks to user-specified code, functions providing behavior on top of data… these are all things that Ruby and Rails fans hold in high regard when provided by Ruby and Rails, but which are considered a bad idea at the database layer. As I discussed at length in Rails and the notion of Stupid Databases Being a Good Idea, this is a philosophy rooted in DRY, but it has some major flaws.

Mainly, there is the issue that some things must be done in the data tier, and trying to put them in the application tier doesn’t work. The best example that comes to mind is full text search. Satisfying queries is the database’s job, period. It’s just hideously slow to try and do an inner join in the application across a network link to a database. If you find yourself doing this, that’s a pretty good sign that your architecture is broken. But some queries are too complicated for ActiveRecord, so sometimes you must choose between a series of high level queries whose results are intersected in application code (easy to understand, but extremely inefficient), or hand coded SQL.

Well, SQL is fast and is a high level domain-specific language, so it isn’t actually a bad tool for the job. The problem is that this approach (the trapdoor to the lower level API) is regarded differently by different people. Some see it as a common and reasonable approach to complex requirements; others see it as a bad evil scary thing that should be avoided at all costs, a kludge and a design mistake.

As a result, the low level option in Rails is anemic. It’s there, but you’re not supposed to use it. Ruby’s ActiveRecord Makes Dropping to Raw SQL a Royal Pain (Probably on Purpose) notes that there are no bind variables allowed in ActiveRecord. You may be saying, “No, wait a minute, I’ve used them, that can’t be right.” That’s what I thought. Look at the source; the bind variable functionality is actually a high level feature built on top of drivers that don’t have that feature. Whatever you did at the high level, it’s going to the driver as a single string. Okay, it’s nice that they added that feature, especially since it provides a single point of testing and verification for safe escaping. But that functionality (in sanitize_sql) is not part of the public API. Fortunately that same article provides a workaround that makes sanitize_sql accessible, so you can use bind variables in your hand coded SQL code, and pretend that the driver supports them. But that’s not likely to work forever.

The key problem with ActiveRecord is its least common denominator feature set, based around the least featureful of all popular SQL databases: MySQL. Years ago, MySQL AB (the vendor of the MySQL database) took a strong philosophical stand against pretty much any advanced database features (which their product lacked, and which competing products had), but lately they’ve softened and added those features that they claimed nobody really needed. In the meantime, Rails has been designed with minimal expectations for database sophistication; therefore, the limited functionality of ActiveRecord is fairly complete, assuming you’re using a database with similarly limited functionality.

Triggers, stored procedures, functions, data integrity constraints, nested transactions, and views are all examples of unsupported database functionality. Try and use them via ActiveRecord’s high level API, and you will quickly see how fragile and inflexible ActiveRecord really is. If you shouldn’t need those features in your database, then you shouldn’t need anything that ActiveRecord doesn’t already provide, so it shouldn’t matter that you can’t extend ActiveRecord.

Truly, these are features that you need only in a few small cases in your application, so looking at individual queries they’re needed rarely (which is not the same thing as “never”). But looking at whether you need one or more of them in a given application, they’re needed more often than not. The pain of using hand coded SQL makes this worse: some tricky things could be done either using a view or stored procedure, or using a really slick dynamic SQL statement. Making all of those options painful means that even a clever developer can’t use anything in their bag of tricks to craft an elegant solution.

Unfortunately, non-trivial web applications need things like full text search, complex associations between persistent objects, non-trival summary information about associated objects, and complex reports, and ActiveRecord fails at all of these. These are not just things that big dumb ancient companies that like using Object COBOL think they need; Amazon and eBay need them too.

The acts_as_tsearch plugin is a good case study of ActiveRecord’s design flaws. TSearch2 is the standard PostgreSQL full text search engine, and it’s pretty good in my opinion. It’s also pretty straightforward to use. Unfortunately for developers using Rails, TSearch2 uses SQL functions (mainly to_tsquery and rank_cd). The acts_as_tsearch plugin tries to inject SQL into ActiveRecord’s queries via the high-level find interface, but ultimately fails as soon as you use the :joins or :include options. The problem is that ActiveRecord has a very simplistic idea of how queries and joins work, and so if you need to inject SQL functions to get the job done (as is necessary in TSearch2 queries), too bad. (See also issues 7 and 8 in acts_as_tsearch, in which I describe and attempt to clean up the mess that results when you use find_by_tsearch in non-trivial ways.)

A fellow Rails developer asked me in all seriousness why I wasn’t abandoning the full text search functionality of TSearch2 and just using a completely separate, redundant database product designed exclusively for full text search. Seriously, that is considered the “easy” approach: one database for full text search, and another for ACID/OLTP/CRUD. Honestly if I were going to go down that road I would try hard to just abandon the SQL RDMBS and put everything in the other database, since Lucene and its imitators are capable of far more than just find-text-in-document queries. The pain of duplicating everything, using two query languages, two document representations (in addition to the object representation in Ruby) and writing application-tier query correlation makes the double-DB approach seem very unwise.

It makes far more sense to me to use the SQL RDMBS’s full text search facility, even if there’s a 2x or 3x read performance penalty, because the conceptual simplicity of having one powerful storage tier (instead of two halves cobbled together) eliminates a ton of ugliness in the application, and the SQL RDBMS is going to get clustered for reads anyway. Nevertheless, even if I’m wrong about this case (putting search in the SQL RDBMS instead of in a separate server), there are other cases for needing a smart database that gives you exactly the results you need and lets you push data logic into the data tier.

So, what do I suggest? Abandon Rails? Nope. I still like Ruby a lot, and find Rails very useful. I just think that ActiveRecord needs to support the low-level and middle-level abstractions better.

Specifically, supporting bind variables (either by exposing that sanitize_sql function, or better yet by making drivers and connection adapters support bind variables for real) would make the find_by_sql, select_all, and exec approaches to low-level SQL query execution less painful.

More difficult, and substantially more valuable, would be refactoring ActiveRecord::Base to split it up in the way I described above: association descriptions and unmarshalling code separate from query building code separate from SQL execution and result retrieval code. All of this could remain hidden for most users under the same old slick high-level API, but for advanced requirements, the ability to fiddle with the SQL and still use the built in high-level unmarshalling code to create object graphs from flat result sets would be very powerful, and useful.

I looked at one alternative to ActiveRecord, called Sequel, which overlaps with ActiveRecord only partially. It is a query builder and lazy result proxy, which is actually what I thought ActiveRecord would do when I first started working with Rails. The proxy design means that you can either keep adding constraints or start fetching results, from the same Dataset class. This seems like a pretty good approach, though I haven’t really looked closely to make sure it would fit what ActiveRecord needs.

What Sequel lacks, though, is the unmarshalling side: turning a 2-dimensional (rows of columns) result set into a complex object graph (customers with orders with order lines with products from suppliers stored in warehouses), with user-controlled eager or lazy loading behavior. Ruby is well-suited to a design that would allow user-specified code (i.e., a block) to decompose each row into the object graph associated with that row, leaving the remaining associations on those objects to be lazily provided via future queries.

So, I think there is hope for ActiveRecord, definitely. I considered the idea of rolling a minimal Hibernate clone, or some other sort of challenger to ActiveRecord, but I don’t that ActiveRecord is broken beyond repair. I think the shortest path to a badass Ruby ORM is through improvements (refactoring and abstraction) to ActiveRecord.

So, if you’ve read this far, you probably care about these issues. Here’s my call to action: Please help me make ActiveRecord less like VB and more like Delphi. Who else is interested in helping me with this effort? Are there alternatives that I’ve missed, or components that could be integrated into ActiveRecord to make it better?

Making the Rails acts_as_tsearch plugin work with fixtures

Jamie Flournoy — Sun, 23 Sep 2007 05:47:50 +0000

acts-as-tsearch is pretty cool, except for the fact that it uses Ruby (app layer) instead of PL/pgSQL (DB layer) to update the tsvectors that are indexed for full text search. That means that fixture data gets inserted without being full text indexed. D’oh!

Here’s some code that changes that.

(I put this in my environment.rb because I’m not quite at the point of shoving all this into a plugin like I probably should.)

# make it so that fixture loading (test data and base data) includes the tsearch2 update_vector
class Fixtures
    class << self # we want to mess with self.instantiate_fixtures
        def create_fixtures_with_update_vector(fixtures_directory, table_names, class_names = {})
            create_fixtures_without_update_vector(fixtures_directory, table_names, class_names)
            # create a Class instance for each table name fixtures were loaded for, then call update_vector on it
            table_names.each do |tn|
                klass_name = (ActiveRecord::Base.pluralize_table_names ? tn.singularize.camelize : tn.camelize)
                begin
                    klass = Object.const_get(klass_name) # will fail if tn is a habtm table (no corresponding model class)
                    klass.update_vector if klass.respond_to?(:update_vector)
                rescue
                    nil # if it is a habtm table, there's no need to update a tsearch2 vector for sure
                end
            end
        end
        alias_method_chain :create_fixtures, :update_vector
    end
end

(Sorry for the formatting but I like wide lines in my source code and my WP theme doesn’t. Just copy and paste and it should be fine.)

Making Rails’ rake:test not drop your PGSQL database

Jamie Flournoy — Sat, 22 Sep 2007 23:18:19 +0000

Let’s say you’re using Rails with PostgreSQL and the TSearch2 built-in full text search engine.

Did you notice that every time you run rake test, that depends on db:test:prepare, which depends on db:test:clone, which depends on db:test:purge, which drops the database and creates it again?

Along with your dropped database goes the TSearch2 functions that wrap the C libraries that do the actual work. So, in effect, you no longer have TSearch2 installed. (“Uh… I kinda needed those…”) Presumably if you have tests that exercise search functionality, they will always fail because the TSearch2 functions are gone by the time the tests run.

Since these functions are just wrappers for C libraries, which are not subject to the PostgreSQL plugin security model, PostgreSQL wisely prevents any old user from getting at them. Only a superuser can create them, which means you can’t just add the tsearch2.sql script to a Rails DB migration and get them back each time that way.

Options include:

Making a setuid script (or a script with the postgres user’s password embedded) that the migration can run, which will log in as the postgres user, run the tsearch2.sql script, and grant permissions to your Rails DB user to use them
Changing the rules of the PostgreSQL instance you’re using to allow any old user to mess with C libraries (a pretty big security hole, but maybe you don’t care about that on your development DB on your laptop), and putting tsearch2.sql in a migration. (I dunno if this is even technically possible, but it seems like such a bad idea that I’m not even bothering to look.)
Using Rake to tell Rails not to drop and re-create your database for each test run, but instead to migrate back to 0 and then re-migrate to the latest version.

I chose #3. Here’s the code, which is in my Rakefile:

# don't drop the test database; migrate it back to 0
Rake::TaskManager.class_eval do
  def delete_task(task_name)
    @tasks.delete(task_name.to_s)
  end
  Rake.application.delete_task("db:test:purge")
end
namespace :db do
    namespace :test do
        task :purge do
            ActiveRecord::Migrator.migrate("db/migrate/", 0)
        end
    end
end

In the rare case where you really wanted to drop and re-create your test database, just use the command line PostgreSQL commands dropdb and createdb, and then (still as the postgres user) run the tsearch2.sql script.

Then resume normal Rails rake:test use, until such time as you irrevocably hose your database (really?) whereupon you’ll need to use the dropdb/createdb method again.

Rails, Fixtures, the Test DB, and Test::Unit

Jamie Flournoy — Fri, 03 Aug 2007 02:37:22 +0000

From what I’ve seen, Rails’ weakest features lie in the way it prepares the test database and test data, and Ruby’s Test::Unit isn’t much better than the awful but ubuiquitous JUnit that Java developers are accustomed to. I set out this week to impose my preferences on Rails in this area, and that took some effort. Here’s what I did.

When I’ve implemented (in Java) what Rails does for database preparation, I did it like this:

Create the test database exactly the same way that the developers’ databases are created: by running the exact same code, pointed at a different database.
Load the appropriate sets of data for the test database. “Sets” is plural on purpose; most non-trivial databases include code tables, which constitute base data which are essentially part of the database design itself. Then, test code will want a fixed set of known test data to act upon, so that tests can measure whether the code did the right thing given the test data (the right inputs yield the right outputs).
Run the individual tests, providing some way of assuring that changes to the test data are undone before the next test.

At first (11 years ago) I used a hand-maintained SQL DDL file to create the databases. Later I split that up into one file per table, and made a list of the proper ordering of tables during creation (reversible for deletion). Later still, with Hibernate, I ditched the DDL and let a higher-level ORM description of the table do the schema generation (which was painful in Hibernate since it wasn’t made to do that except from the command line, but it was possible to hack it into a state of relative beauty). The test data was always loaded from a bunch of text files that were easy to hand-edit (as opposed to a bunch of SQL INSERT statements).

Running the test with assurance of pristine test data was more or less horrific in a J2EE+Hibernate 2.x environment. The design of Hibernate and JUnit made it difficult to wrap tests in transactions, and the version of MySQL that we were using had no transactional storage engines available at all (MyISAM? Thanks, Red Hat!), so I ended up falling back on an intrusive but relatively high-performance design that required tests to declare if they were going to alter the test data, so that the test teardown method knew it had to reload the test data. Since we were waiting for Hibernate 3.0, MySQL 5.x, and a few other things to become part of our architecture, I left that solution in place and ended up moving on to a new job before fixing it.

Rails initially seemed to nail this problem: the test database is automatically made based on the development database; the data is loaded from YAML files called Fixtures, which feature a very simple and straightforward API, and tests run inside individual transactions. Nice!

Except not. Fixtures are loaded by specifying the tables for which you need test data loaded, and this is done in each Test::Unit::TestCase class, of which I have several hundred. They are stupidly reloaded each time you say a given TestCase is going to use them. Worse, the tables you’re using for this TestCase are emptied out using SQL DELETE statements, but if there is test data in other tables that has foreign key dependencies on the data being deleted, fixture loading will fail. (Rails was not designed for FKs to be enabled in the database, so encountering this this bug is a side effect of enabling them via the plugin.) This deletion behavior is pointless in light of transactions wrapping each test, but if you’re using MySQL MyISAM you can’t use transactions, so it needs to be there for people using MyISAM, which is to say, crazy people who care not for their data.

Since Test::Unit, like Java’s JUnit, lacks a hook for the beginning or end of a given TestCase class’s set of tests, there’s no way to accumulate a list of fixtures created and then delete them and/or reload them at the end. That would at least allow you to undo the creation of the fixtures so that the tables were all empty before the next set of fixtures were loaded. Sadly, Test::Unit is not that clever.

I initially fixed this problem a couple of months ago, using a hack that simply refuses to delete and re-create (test data) fixtures if they’re already loaded. That works since the fixture data progressively accumulates and is always clean since changes within tests are rolled back at the end of those tests.

Upon adding a trigger to a Rails migration and then writing a test case that checked to see if it was working, I found the true ugliness. Rails has Migrations, which in my opinion are an excellent feature that works well, and is a more useful generalization of my ordered-list-o-tables and set of table-definition text files. But… when creating the test database, Rails uses the SchemaDumper‘s schema.rb output to create it, instead of using migrations. Talk about principle of least astonishment… I was pretty astonished. We have migrations, which is how we create databases! Great! So let’s use this other thing instead.

Also, SchemaDumper does not in fact dump the schema; it dumps tables and indices only. The RedHillOnRails foreign keys core plugin adds foreign key dumping to this output, but forget about check constraints, triggers, and stored procedures. Those schema objects are ignored, so your test database is not the same as your development (or production) database. Whoops.

I thought of about a dozen ways to deal with this:

Abandon triggers and do it all in Rails, make a TODO to fix this later, and get on with feature implementation
Add code to the tests to check for the missing schema objects and add them if missing (eww)
Replace the db:test:prepare Rake task with one that tells PostgreSQL to copy the database as-is
Replace the db:test:prepare Rake task with one that tells PostgreSQL to use pg_dump instead of ActiveRecord::SchemaDumper
Hack the PostgreSQL-specific code that SchemaDumper uses to look at the pg_proc and pg_trigger system catalogs and use code similar to the RedHillOnRails Core plugin to dump stored procs and triggers into schema.rb also
Just dump using pg_dump into a temp file and parse the output and add that to schema.rb (ewwwwwww)

etc. etc.

I finally found the Migrate Test DB Rake Plugin which simply uses your Rails Migrations to create the test database. Lovely. Except I now had some new problems.

rake db:schema:purge for PostgreSQL does dropdb/createdb on the test database to empty it out. That creates a database with no built in procedural langauges, so stored procs won’t work. Adding the language to that database is a DB superuser task, so it couldn’t be done inside of Rake. Fortunately I found that I could solve this via “createlang plpgsql template1” which puts plpgsql in the template database used for creating new databases. Easy.
My never-delete-fixtures code got into a fight with my base-data-loader code. They both used Fixtures to load data, and so the base data fixtures made the never-delete-fixtures code think that the test data was already in. So the tests failed due to lacking test data.

I fixed this initially by modifying my BaseDataLoader class to not load base data if RAILS_ENV is ‘test’, and added code to the Migrate Test DB Plugin to set RAILS_ENV to ‘test’ right before running the migrations on the test database. This is a workaround, really, because it still leaves the base data either missing entirely, or duplicated.

Then I switched to the Preload Fixtures plugin which is nice but still leads to FK related errors. It grab the fixture names from your test/fixtures directory and loads all the files it finds, in the order it found them. That fails since alphabetical order and the required table creation order are different in my case.

Fortunately since I’m using the Migrate Test DB Plugin I can just observe the order in which tables were created and tell the Preload Fixtures plugin to do its work in the same order. This is in my environment.rb because that’s where all my project-wide monkeypatching currently lives. (Cleaning that up and maybe plugin-izing it is a TODO for the future.)

# Due to FKs, gotta specify ordering of fixture preloading here. Why not let migration create_table statements do it?
# (depends on Migrate Test DB Plugin being present; is here for the benefit of the preload_fixtures plugin)
module ActiveRecord::ConnectionAdapters::SchemaStatements
    alias create_table_orig create_table
    def create_table(table_name, options = {}, &block)
        fixture_filename = "#{table_name}.yml"
        if File.file?(File.join([RAILS_ROOT, 'test', 'fixtures' ,fixture_filename]))
            ENV['FIXTURES'] = [ENV['FIXTURES'], fixture_filename].compact.join(',')
            # puts ENV['FIXTURES']
        end
        create_table_orig(table_name, options, &block)
    end
end

Sadly if you run “rake test” it runs ruby as a subprocess in order to do “rake test:units”, “rake test:functionals”, and “rake test:integration”. That means that the migrations are run once (before the tests), but that the preloading is done three times. The second and third times through, though, the preloading fails since it’s trying to delete-then-create each table’s fixtures in table-creation order. So, a patch to preload_fixtures.rb is needed, to ensure that deletes are done first, in the reverse order of table creation. Here’s what the new preload! method looks like:

def self.preload!
    puts "PRELOADING FIXTURES..."

    require 'active_record/fixtures'
    ActiveRecord::Base.establish_connection(:test)
    fixture_filenames = (ENV['FIXTURES'] ? ENV['FIXTURES'].split(/,/) : Dir.glob(File.join(RAILS_ROOT, 'test', 'fixtures', '*.{yml,csv}')))
    
    # delete first, in reverse order
    fixture_filenames.reverse.each do |fixture_file|
        table_name = File.basename(fixture_file, '.*') # hack; might not be correct if class name != camelized table name
        ActiveRecord::Base.connection.delete "DELETE FROM #{table_name}", 'Fixture Delete'
    end
    
    fixture_filenames.each do |fixture_file|
      Fixtures.create_fixtures(File.join(RAILS_ROOT, 'test', 'fixtures'), File.basename(fixture_file, '.*'))
    end      
    puts "DONE. Loaded #{Fixtures.all_loaded_fixtures.keys.length} fixtures."
  end

I’m not sure, but I think there’s an assumption in there that the table name is the same as the fixture name. My patch also makes that assumption, which is true in the case of my project. But in your project you might not have done that, so further hackery might be needed.

So, it all seems to work correctly now, and I’m back to working on my trigger code. If this seems like it took a lot of effort, it did, but I think it’ll be worth it once I start using stored procs and triggers more. That phase begins now.

Rails and the notion of Stupid Databases Being a Good Idea

Jamie Flournoy — Fri, 03 Aug 2007 02:16:40 +0000

For the last few days I’ve been struggling to bend Rails to my will regarding the proper way to assure data consistency. Today I made some progress. This builds upon some research I did a few months ago, and hopefully this is a more or less complete solution to the problem of making Rails work the way I want it to regarding test databases.

DHH has clearly stated that he does not like a smart database. This is common among application developers, particularly in the agile methods camp, in that they generally appear not to understand relational set theory, or if they do, they believe that it is inherently inferior to object oriented methods (which lack a theoretical basis, as Fabian Pascal will happily shout at anyone who will listen). I gather from DHH’s statements that he merely is trying to practice Don’t Repeat Yourself (a.k.a. DRY, one of the most important values of Rails). I gather from Rails itself that he either respects the need of some folks to disagree with him enough to provide hooks to bypass ActiveRecord, or that he at least agreed with someone else’s patch. By this I mean that there are ways around the ORM features of ActiveRecord, to do raw SQL and to execute raw DDL at database creation time, which implies that he isn’t trying to force his opinions on others, but rather to make it easier to do things his way than to do them a different way.

Fair enough. Rails is opinionated software, as DHH often says, and I have found several cases where letting go of my particular way of doing things has been fine, given that Rails has a different but equally valid way of doing things that is made super easy by the framework.

However, I disagree with his decision to keep the DB stupid, for two reasons.

First, I prefer to put logic where it belongs, rather than gathering it all in one place. AJAX, and in particular Google Maps, is a good example of presentation logic going where it belongs, making the whole application work better. SQL RDBMSs have features that can be abused, and in some cases these features are there because a wrong-thinking but wealthy client demanded them, but most of the advanced features that a “Real Database” has are there so that you can protect yourself against data loss or data corruption. The database is in a unique position to let you declare rules for things that must always be true, and then to trust that the database will never violate those rules. Older versions of MySQL were notably lacking in these features and their absence was justified by MySQL staff who basically said “you don’t need that, and if you want it, you’re confused.” Rails has inherited some of these damaged assumptions from MySQL, leaving basic relational features like foreign keys out of the framework(!). Fortunately Rails allows plugins, and there is a set of foreign key plugins that overturn this decision. But in general, if the database belongs to your application, that’s not an excuse to move database functionality into application code. By calling it your application’s database (as opposed to an Integration Database) you imply that it is part of your application, and therefore any rules or procedural code in it is necessarily also part of your application. You can’t monopolize the database and say that no one else has any business using it, while at the same time holding it at arm’s length and saying it’s not a valid part of the application. It is. No, business rules probably don’t belong in the database, but basic data consistency maintenance (in rule or procedural form) does.

I’m being charitable here, but my experience with individual practitioners of the Stupid Database Method invariably ends with me finding out that they don’t really understand databases at all (hence the desire to abstract the database away entirely with a driver plugin architecture topped by an ORM layer, lest they have to understand how a specific database product works), and would rather remain ignorant and reinvent the same functionality in the application layer or in the ORM layer. (It’s a case of “when all you have is a hammer, everything looks like a nail”, where the hammer is a general-purpose programming language, and you’re looking at a problem of high performance concurrent transactional programming.)

Not surprisingly, the database of choice for these folks is the least featureful, lowest cost, easiest to install one available. Because naturally it’s much more agile to write and debug new multithreaded transactional code in a high level dynamic language. than it is to get the same functionality for free in a thoroughly tested product that’s written in C. Right? Perhaps DHH is not one of these people. I assume he is not, again based on what he has said and coded. But nevertheless, the folks I’ve talked to personally who agree with his point of view are all coming from a point of view of willful ignorance.

Secondly, I prefer to employ defense in depth against data errors. Transient errors can have workarounds, but data errors are permanent, and that means that if your data is valuable, the damage done can be irreversible. Just because it’s possible for correct application code to avoid race conditions, improper escaping, etc. doesn’t mean that you should put all your eggs in that basket. When the price of data corruption is high (i.e. if you value the data in your database) then it’s worth the duplication of effort: test the application code, but also put a constraint in the database that will catch things the application code missed.

This is the same sort of thinking that leads to using automated unit tests, then functional tests, then integration tests, and then some manual QA, all overlapping. Duplication of effort? Yes. Worth it? Yes. Database bugs are arguably the worst kind of bugs to find in production, so they merit extra code that maybe isn’t absolutely necessary for the application to work, but is nice to have since you’d like to sleep at night.

So, I feel justified in wanting to put CHECK constraints and triggers in my database.

The implementation details are discussed in part 2.