How did you hear about CouchDB, and why did you choose to use it?
In 2012 I was working for Health On the Net, which is a Geneva-based NGO focusing on healthcare-related tech. Mostly what we did was certify health websites as abiding by a specific ethical code, but we also built a lot of websites and apps for clients like the European Commission and Swiss organizations like Santé Romande.
One of these was a greenfield project called Khresmoi, where we had an opportunity to build a health-based search engine using our database of certified health web sites. The main architect of the project had already chosen the core technologies, but he had also accepted a job in the US, so I was his replacement. The project was built on Solr/Lucene, Perl, jQuery, and a weird database I had never heard of before called CouchDB.
I’m not really sure why he had chosen CouchDB, but it was extremely ill-suited for the project at hand. Essentially we were crawling websites and storing the entire content, along with some metadata, in CouchDB. We did this several times a day, and every time a page was updated, we simply overwrote the existing documents. We weren’t using CouchDB sync at all, and we weren’t checking to see if the content had changed before writing a new revision.
Since of course CouchDB is all about revisions, this meant that the size of the database kept blowing up. Our machines would get overloaded with tens of gigabytes of data. The original architect hadn’t foreseen any of these problems, so I had to learn from scratch what CouchDB was, and how to do things like “compaction” on a regular basis to keep the database from ballooning.
We also had a lot of partners in the Khresmoi project who were very interested in aggregated views on our metadata, so I also had to learn how to performantly execute map/reduce queries, and keep those from growing out of control as well. It was pretty sink-or-swim, and to be honest I really disliked CouchDB at first, and I was always looking for opportunities to replace it with something else.
By learning all the rough edges of CouchDB, though, I eventually gained an appreciation for what CouchDB was actually good at: sync. It also impressed upon me the importance of understanding the tradeoffs of a database before using it in a project.
Did you have a specific problem that CouchDB solved?
In my mind, CouchDB has two killer features: sync and HTTP. We weren’t using either one in this project. The Perl crawler stored webpage data in CouchDB, and CouchDB was never exposed to the frontend via HTTP; it was just ferried into a Solr search database. This was also in the days before attachments, so we were storing all content as base64 strings.
What CouchDB did do fairly well was that we could do map/reduce queries on the data and then send a simple, queryable URL to our partners so that they could work with the data. It was also easy to set up authentication so that, for instance, only those with a username and a password could read it, but they couldn’t write it. The downside was that the views took a long time to build up; usually a partner would request a view on the data, and I’d say, “Okay, it’ll be done after the weekend.”
For the folks who are unsure of how they could use CouchDB–because there are a lot of databases out there—could you explain the use case?
CouchDB’s superpower is sync. Sometimes I even try to explain it to people by saying, “CouchDB isn’t a database; it’s a sync engine.” It’s a way of efficiently transferring data from one place to another, while intelligently managing conflicts and revisions. It’s very similar to Git. When I make that analogy, the light bulb often goes off.
Where this often fails is that folks may have an existing datastore, and they just want some sync mechanism on top of that. For instance, they have a MySQL or a MongoDB database, and they want just want PouchDB to sync to that instead of syncing to CouchDB. The reason this doesn’t work, and which is often hard to grasp, is that those other databases don’t have a concept of revisions built-in. For instance, when you delete a row or an object, it’s just gone. In CouchDB, it keeps a tombstone around so that it can remember what was deleted.
The analogy I would give, for people who struggle to understand why they can’t just slap CouchDB replication on top of Mongo or MySQL, is that it’s like saying, “Hey, I love Git, and the Git client is really cool, but can I use it with my FTP server?” Obviously that doesn’t work – an FTP server is just a flat filesystem, with no concept of branches or revisions. It’s exactly the same with CouchDB.
What would you say are the top three benefits of using CouchDB?
Sync, reliability, and simplicity. As J. Chris Anderson has said, CouchDB doesn’t aim to be the Ferrari of databases; it wants to be the Honda accord of databases. (See my old blog post on the subject)
The append-only file format means that you can just
kill -9 a running CouchDB process and your data is still recoverable. It never gets corrupted. Also the HTTP/REST interface is very easy to use; you can use something like
curl or Postman to learn how it works. When I was learning CouchDB, I would often just put some sample data into a database using Futon, and then I’d play around with URL parameters until I understood how it was working.
What tools are you using in addition for your infrastructure? Have you discovered anything that pairs well with CouchDB?
Well, as a co-maintainer of PouchDB, I obviously have to plug PouchDB here. PouchDB makes it trivially easy to sync between CouchDB on the server and IndexedDB, WebSQL, or LevelDB on the client. A lot of this can be credited to how well-thought-out CouchDB is as a whole.
There are other tools I find useful, though, like Postman which is a neat tool for debugging HTTP APIs. I’ve also written a tool called pouchdb-dump-cli which can be used to “dump” an entire CouchDB or PouchDB database to a text file, which can then be loaded back using pouchdb-load. Of course the classic backup tool for CouchDB is called
cp (i.e., just copy the
.couch file), but pouchdb-dump/pouchdb-load can be nice for portability and to make it easy to inspect the full contents of a database.
What are your future plans with your project? Any cool plans or developments you want to promote?
Absolutely, we’ve got a lot of work going in to PouchDB at the moment. Future improvements we plan to make are:
- A more performant secondary index system
- The purge API, which is the major piece of CouchDB functionality that is still unsupported by PouchDB
- Faster replication – there are still some low-hanging fruit in the replication algorithm where we can optimize the back-and-forth and speed up replication
Have a suggestion on what you’d like to hear about next on the CouchDB blog? Email us!