Document Databases – New Kids on an Old Block

There’s a new crop of databases that has appeared lately, under the rubric of “document databases”, and there’s quite a lot of enthusiasm for them given that they tend to be slow and very feature-poor compared to the SQL RDBMSs that are the typical persistence mechanism for web applications. What’s mainly appealing about them is that they are easy to use, and theoretically quite scalable, compared to the traditional “one big SQL database server” approach.

But the simplicity of these new document databases is tied to some significant trade-offs in the current implementations. And so I’m going to try and put them into context with some of the other data persistence options that have been around for a while, but which aren’t currently getting as much hype as document databases. Hopefully that will help all of us to understand how these new and evolving document databases can be useful to us, and what the alternatives are in areas where they may not fit well.

I’d first like to try and deconstruct a false dichotomy that I’ve noticed being used in arguments in favor of some of these new databases. That dichotomy casts SQL RDBMSs (such as MySQL, Oracle, PostgreSQL, MS SQL Server, etc.) as big, complicated, and hard to scale, compared to document databases which are small, simple, and easy to scale. The main problem with this dichotomy is that there are far more choices than just two. Each database product embodies a set of design choices, and although there is some clustering of decisions into general types of product, the boundaries are a lot fuzzier than a product evangelist might have you believe.

Furthermore, trade-offs made early in a product’s lifetime may have been altered over time. A good example is the no-longer-true characterization of MySQL being fast but not reliable, vs. PostgreSQL being reliable but not fast. In reality, recent releases of both products are moving toward being very fast and very reliable.

Because there are so many database products out there, I’m going to have to fall back on a small subset of example products, as illustrations of issues that may or may not exist in a particular product you’re looking at. The key for an application architect evaluating a persistence mechanism is to understand the abstract concepts, and to figure out which ones matter to your current application. That will let you select a product (or a combination of products, including some custom code perhaps) that suits you. As with all aspects of architecture, there is no cookbook you can use, and in six months all the options will change. You have to analyze your needs first, and then get your hands dirty with evaluation second.

So, let’s start deconstructing some of the examples into design decisions.

SQL RDMBS

This is the traditional choice for web applications. You get a remote query language, so you can specify in great detail what you want to retrieve. You get very precise control over data representation, including some that may be burdensome if you’re not concerned with internationalization, multiple currencies, and time zones. You get a lot of control over performance and a lot of information about how things work at a low level inside the database, from indexes to data page size to transaction logs and checkpoint frequency. Most of them include ACID transaction support, which is nice for reliability, but which usually obligates you to implement a backup scheme or else they will eventually stop accepting new transactions and/or run out of disk space.

Some of the design drawbacks include:
– the use of local storage, for performance and for a guarantee that data has been committed to disk
– the use of a high level query language (SQL) and a query optimizer, so the specific process the database uses to satisfy your query is not in your face (and thus may be surprisingly inefficient, if you aren’t familiar with how it works)
– the use of a proprietary network protocol, which means that you need a special client library for just that one product, which may or may not implement all of the features that the server offers (such as encryption)

However, there are some variations that make the edges of this category fuzzier. There are ACID-compliant SQL RDBMSs that have no network layer, and are very lightweight; in some cases they may not even support concurrent access. Examples include HSQLDB and SQLite.

Networked Filesystem

This is typically used for accessing shared file servers, or allowing “thin client” behavior so that users can get to their own environment and data from any given endpoint. Examples include NFS, SMB, AFP, GFS, and quite a few others. The main advantage of these systems is that the remote filesystem is represented as being directly connected to the local system, while also being available to other users or other client systems who are connected to the same remote system.

Trade-offs of this design include:
– performance on a LAN may be good, but over a slow, high latency link may be very poor
– there is usually no ACID transaction support, just file locking
– file ownership and permissions can be very hard to manage
– if the file server goes offline, the entire local system may hang or crash

In particular, content indexing, complex querying, and data integrity features are generally not offered. You can layer that on top, though, but that layer will not necessarily work if it was originally designed to work on local filesystems. In particular I’m thinking about DBM files; they’re fast and easy to use but not all of them will work properly with files located on a network filesystem.

Also, directory scanning performance can be very poor if thousands of files are located in a single directory; listing all files starting with the letter T may actually require the entire directory to be retrieved and filtered on the client side.

Variations include FTP and WebDAV, which are not intended to simulate a local filesystem, but instead have filesystem-like semantics. Some operating systems will mount them as remote filesystems anyway, for ease of use for viewing and copying files, but it’s not possible to lock a remote file, so safe multiuser access is not possible.

Object Database

Object databases offer a direct representation of an application’s data in almost exactly the same form that exists in memory. Whereas a relational database stores data in tabular form regardless of the particulars of a client application, an object database stores data in the same form that the application uses. The exception to this is in the representation of references to other objects; at some level these encapsulate pointers to the memory address of the data in the application’s address space, and this must be substituted with a pointer to the location in the database’s storage system before storing it.

An object database will not handle time zones or internationalization, but nor will it complicate those matters if the application handles those already. The data is simply stored as-is. Also, object databases typically do offer ACID transaction support.

Aside from the conceptual simplicity of the similar data model, one major bonus of an object database is that the use of pointers makes data retrieval extremely fast; rather than parsing a query and searching indexes for the on-disk location of a desired object, the application can simply ask the database for it by its reference.

The big trade-offs here are twofold:

First, the lack of indirection through a query language and an indexing system mean that the application developer must anticipate all of the queries that will be needed, and incorporate collections into the the object graph that will be used to get to the stored objects. Otherwise the application’s object model will need to be updated frequently to include these later.

Second, altering the application’s object model and then retrieving data stored using older code can be very complex. Because the stored objects and application’s code are out of sync in this situation, additional application code must be written to convert existing stored objects into the new representation and persist them back to the database.

Combine those two trade-offs, and it’s clear that the performance benefit comes with the price of considerable additional application development effort.

Also, an object database is by nature bound to a single application, rather than being a point of integration between multiple applications. Any attempt to create a shared code library that manages access to the object database introduces potential “impedance mismatches” between each application and the shared object model, which reduces the simplicity that an object database offers in comparison to a relational database.

Document Database

Arguably a rejection of relational technology, document databases offer several advantages compared to the three classes of database previously mentioned. Documents need not be internally represented as a flat set of key-value pairs as seen in a SQL RDBMS; for example, the document may be an XML document. Queries are possible and may even use a standardized query language to express the conditions for matching desirable documents. Documents may have internal structure that is understood by the database server (so that it can query against the document’s contents), as in the case of an XML database, or they may have an external metadata structure consisting of key-value pairs, or both.

The drawbacks of this type of system derive from the fact that it is similar in many ways to each of the other database types.

Querying ability means that the server must incorporate some kind of indexing system for performance reasons, which means that the document must either internally or externally conform to some sort of standard data model. Some document database systems simply omit querying except by the document’s main ID (similar to a SQL primary key, a network filesystem’s filename, or an object database’s storage reference). This has the same drawback as with an object database: the application must take on the responsibility of managing querying, searching, and sorting itself, across the network.

The similarity to a networked filesystem may also have scalability benefits; if referential data integrity is not provided, then documents can be located on any remote system, and partitioning is simple. However, distributed queries will still need to be managed at the application level, and potentially any transaction becomes a distributed transaction, since a changed document on one server may be referenced from any number of documents in any number of other servers. (This is also true of the other systems, though.)

Still, because a document database is closest to a networked filesystem, it may be suitable for simple requirements where a relational database or object database seems to complex and slow, but where the bare-bones functionality of a networked filesystem is too simple. The compromise of a binary file with simple additional metadata or properties attached “out of band” with the data itself, or of a structured document format that is flexible but not ideal, may be acceptable if sophisticated querying is possible as a result.

It’s hard to provide good examples of document databases, because the category is very broad, and includes a lot of simple projects that provide just a little functionality above and beyond what WebDAV already provides. But a few that I’ve heard of recently include CouchDB, SimpleDB, and RDDB.

CouchDB imposes a simple key-value data structure on document content, but no internal document schema or grouping of documents by type. It does offer indexing and querying. Notably, it also offers transparent replication, at a field level (changes to two different fields in two copies of the same document are synchronized to both copies).

Amazon SimpleDB similarly imposes a key-value structure, though one key can have multiple values. It too offers a query language, and indexing. Because it’s built on Amazon’s S3 service, transparent replication is also included.

Future Prospects

The main complaints that I’ve seen directed at relational databases involve two things: one, the difficulty of scaling them up, and two, the restrictive data model. Sharding (a.k.a. data partitioning) is the usual remedy for scaling problems, but that requires the elimination of referential integrity in the SQL RDBMSs I’m aware of, and requires distributed transactions in order to preserve ACID transaction properties across denormalized copies of the modified data.

Interestingly, these issues are the same across the board, regardless of database type. Either you abandon transactions, or you move them up to a level that’s aware of the data partitioning. I see no reason why these and other high-end RDBMS features couldn’t be offered in a proxy layer that possibly even contains the query processing as well.

One way to approach this is to build a closed system with a given set of features and a limited API that permits a single query language. This seems to be the way that CouchDB and SimpleDB are approaching the problem.

Another way to approach this problem is to simply say that the storage back-ends of relational databases could be enhanced to incorporate built-in transparent partitioning. I don’t think that SQL RDBMSs will abandon the concept of a table schema any time soon, but there’s no reason why products that already include XML query and indexing capabilities and free-form natural language indexing (a.k.a. Full Text Search) couldn’t also include indexing capabilities for simple key-value structured data inside a single column of semi-structured data, giving most of the same functionality as a document database.

Given that, the remaining limitation of a SQL RDBMSs is the requirement that the back-end storage system be located on a disk drive physically connected to the same server, and that the storage be touched only by processes running on the same server together so that they can coordinate access to the data.

For now, though, document databases look like they can be very useful for certain types of persistence requirements; I don’t see them as a viable substitute for everything that a SQL RDBMS does, but that perception is limited mainly by the choices at hand. CouchDB looks like the most generally useful option so far, though I’d like to see the addition of optional schemas (opt-in on a per-object level, as seen in LDAP), and/or a pluggable language option. (It seems that everyone using a document database is also enamored of their application language and dislikes the idea of putting logic in the data tier unless it’s written in the same language.)

I welcome your comments – this is mostly a brain dump of things I’ve seen before to help myself and others contextualize the new document databases, and document databases are evolving too rapidly for me to keep up with all of them on my own.

Leave a Reply