Databases and version control systems

I’ve been talking with a friend about some of the “classical” variants of hacking: Unix, Lisp, Emacs, etc…

One of the topics that we’ve stumbled upon more than once lately is Git. I’ve been holding fast in the centralized-VCS system world, not willing to let go of my trusty installations of SVN. The most important reason for this was actually that I just did not understand how a non-centralized VCS works. I chanced upon an article on Hacker News a couple of days back that talks about Understanding Git Conceptually.

Motivated by this article I brought up Git again at our next encounter and we soon came to agreement on the point that both of us believe that Git is in fact more than just a version control system. In fact, with its distributed, decentralized approach and branching/merging, it does look a lot like a component that lives beneath the hood of a database system. Discussing the topic further we’ve mentioned databases which are document-oriented, and not in the typical RDBMS offering. In a database which deals with documents as the atomic unit of its operations, the relationship between a document-centric database and a filesystem with a VCS might become obvious:

Schemas/databases are equivalent to directories,
Tables (if they exist at all) are also equivalent to directories containing files,
Records/rows/documents (whatever you chose to call them) are equivalent to files,
Views (if they are supported) are equivalent to directories containing symlinks,
Transactions are VCS branch/merge operations.

As compared to the typical database back-end, a VCS will give you not only all of the above but also access to all previous versions of each document, pretty transparently

Git excels in its features, non-centralized approach and above all else, incredible speed. It seemed like an ideal candidate to use for the VCS back-end if you wanted to build a database system. After concluding this, my friend challenged me: in theory, it’s almost obvious you could build a database system on top of Git. But can it actually be pulled off? I said “hell, yes!”

The database: CouchDB

The most pressing concern for me now became: what database system could I attempt to duplicate, that has the least amount of cruft I would need to write on my own?

After some discussion with my friend, I’ve decided that CouchDB would be an ideal candidate. It is document-centric, has a very thin interface layer exposed as HTTP, and the only difficult part could be its cozy love affair with JSON and JavaScript in general: the core element of the database is a JSON fragment and views are an implementation of Map-Reduce with the logic written as JavaScript.

It turned out that JSON was not a problem at all. Mozilla’s SpiderMonkey is a nifty implementation of JavaScript. I could just feed select bits of code into this system and have it process the data for me.

I had all my ingredients ready, so I could start!

The anatomy of a hack: Bash, inetd, SpiderMonkey, Sed and Git

I decided that I want to make this as Unix-like in philosophy as possible. The key concept should be “loose coupling.”

The system is composed of a couple of Bash scripts which don’t feature any special tricks (well, other than Bash support for regular expressions). There’s a light usage of sed for some simple data transformations. I’m also cheating by using CouchDB’s compiled version of SpiderMonkey (in the CouchDB it is a binary called couchjs). In order to avoid any tomfoolery with sockets, I just wrapped my Bash script in an inetd service. This lets me abstract all actual socket handling to the level where data comes in and is sent out via stdin/stdout.

What I really think is valuable here is that I developed several utilities on the side: a simple Bash logging system and a semi-broken implementation of HTTP 1.0 server (with header parsing and all).

To test the system, I took CouchDB’s client called Futon, hacked up a simple redirect with Apache’s mod_rewrite and tested the whole thing. The non-modified Futon client is in the download archive, in the GitCouch passes the basic (first) test case of the Futon client. For seven hundred lines of Bash scripts, I find that to be an amazing success

Try it out

You can download GitCouch, of course. There is no documentation in there, and to get it running, you’ll have to do some digging around.

Duality of the VCS - document database models

When I was discussing this project with a friendly hacker, he suggested that he actually wanted to do the hack the other way around: he wanted to create a version control system by using CouchDB. To me this suggests that these two topics (document databases and VCS) are actually two faces of the same coin.

I hope to return to study this area again some time in the future.