A CouchDB User Story: chatting with Assaf

In our interview with Assaf, he talked to us about how his usage of CouchDB for an internal project for his organization’s intranet. Assaf’s challenge was unique in that his project could not use clustering effectively as it had to be entirely in one machine.

Assaf’s machine supported nearly 6TB with around 2 billion documents across ~20 DBs, serving right around 100k reads/day and 20-50GB writes/day. This led him to “debug the hell out of it” resulting in this document: Linux tuning for better CouchDB performance.

Assaf went on the tell us more of why he chose to use CouchDB and how it has best helped support his project’s needs.

How did you hear about CouchDB, and why did you choose to use it?

I initially encountered CouchDB on Google.

I had inherited a project that was using Apache SOLR as its main database, but back then (April 2016) it had about 100GB of data, so all was well. The only person with write access to the database was me, so all we needed from SOLR is to be very quick while reading, and it was.

But then, I got 1.2TB of zipped, highly nested, schemaless JSONs to index. SOLR has this neat feature: “Schemaless Mode” which basically just creates an index (=schema entry) for each new field it discovers.

I had to use this mode because all fields with a value of sha1 string had to be fast to query, and the field names were randomly generated (weird, I know).

Because the field names were random, SOLR would create new schema entries all the time, which led it to be extremely slow and unstable.

SOLR would also flatten the input JSONs (e.g. {"a":{"b":1}} => {"a.b":1}) which was very annoying for us. After a couple of weeks and not a lot of GBs indexed, we experienced a big power outage. SOLR took 5 days to recover from this incident (checksum on init? data recovery?), so our systems wasn’t operational for that time span. This was UNACCEPTABLE!

I started googling for a schemaless DB that could support deeply nested JSONs. I ruled out MongoDB because of bad past experience, very slow queries on a 10GB collection with indexes. I also ruled out Elasticsearch because of Lucene. I figured Lucene’s many files and file edits is what caused the long recovery time after the power outage.

I specifically googled “schemaless db” and “mongodb vs”, it was here that I came across CouchDB.

I started reading the documentation and it got me hooked on the “just relax”, “there is no turn off switch, just kill the process” and the ability to build indexes programmatically, so I could recurse into the objects and emit values that match the sha1 regex.

What would you say is the top benefit of using CouchDB?

Durability. Since the SOLR saga, we’ve experienced a few more power outages, hard disk failures and filesystem corruptions (at least 2 of each; yeah, our infrastructure can be better).

Amongst all the panic and horror, I was smiling.

After power outages CouchDB has a zero recovery time. If a hard disk had died or the filesystem got corrupted, CouchDB would just reacquire the lost data by synchronizing from a replica or replicating a backup.

What tools are you using in addition for your infrastructure? Have you discovered anything that pairs well with CouchDB?

  • couchimport
  • jq.
  • curl.

What are your future plans with your project? Any cool plans or developments you want to promote?

Yes, I have found a neat trick to import an archive full of JSON files.

I also plan to add a section about client http keep-alives, to my document detailing my results for seeking better CouchDB performance on Linux systems. I’ve found out that using HTTP keep-alives to access CouchDB can drastically improve CouchDB’s performance, as it doesn’t need to build and destroy TCP connections between interactions with clients. For example, while using Node.js’ request or request-promise package we’d turn on "require ('http').globalAgent.keepAlive = true" and pass "forever: true" with each request.

 

Use cases are a great avenue for sharing useful technical information, let us know how you use CouchDB! Additionally, if there’s something you’d like to see covered on the CouchDB blog, we would love to accommodate. Email us!

For more about CouchDB visit couchdb.org or follow us on Twitter at @couchdb

2 thoughts on “A CouchDB User Story: chatting with Assaf

  1. CouchDB Weekly News, August 3, 2017 – CouchDB Blog

  2. Open Source Your Summer! – CouchDB Blog

Leave a comment