A CouchDB User Story: chatting with Assaf Morami

In our interview with Assaf Morami, he talked to us about how his usage of CouchDB for an internal project for his organization’s intranet. Assaf’s challenge was unique in that his project could not use clustering effectively as it had to be entirely in one machine.

Assaf’s machine supported nearly 6TB with around 2 billion documents across ~20 DBs, serving right around 100k reads/day and 20-50GB writes/day. This led him to “debug the hell out of it” resulting in this document: Linux tuning for better CouchDB performance.

Assaf went on the tell us more of why he chose to use CouchDB and how it has best helped support his project’s needs.

How did you hear about CouchDB, and why did you choose to use it?

I initially encountered CouchDB on Google.

I had inherited a project that was using Apache SOLR as its main database, but back then (April 2016) it had about 100GB of data, so all was well. The only person with write access to the database was me, so all we needed from SOLR is to be very quick while reading, and it was.

But then, I got 1.2TB of zipped, highly nested, schemaless JSONs to index. SOLR has this neat feature: “Schemaless Mode” which basically just creates an index (=schema entry) for each new field it discovers.

I had to use this mode because all fields with a value of sha1 string had to be fast to query, and the field names were randomly generated (weird, I know).

Because the field names were random, SOLR would create new schema entries all the time, which led it to be extremely slow and unstable.

SOLR would also flatten the input JSONs (e.g. {"a":{"b":1}} => {"a.b":1}) which was very annoying for us. After a couple of weeks and not a lot of GBs indexed, we experienced a big power outage. SOLR took 5 days to recover from this incident (checksum on init? data recovery?), so our systems wasn’t operational for that time span. This was UNACCEPTABLE!

I started googling for a schemaless DB that could support deeply nested JSONs. I ruled out MongoDB because of bad past experience, very slow queries on a 10GB collection with indexes. I also ruled out Elasticsearch because of Lucene. I figured Lucene’s many files and file edits is what caused the long recovery time after the power outage.

I specifically googled “schemaless db” and “mongodb vs”, it was here that I came across CouchDB.

I started reading the documentation and it got me hooked on the “just relax”, “there is no turn off switch, just kill the process” and the ability to build indexes programmatically, so I could recurse into the objects and emit values that match the sha1 regex.

What would you say is the top benefit of using CouchDB?

Durability. Since the SOLR saga, we’ve experienced a few more power outages, hard disk failures and filesystem corruptions (at least 2 of each; yeah, our infrastructure can be better).

Amongst all the panic and horror, I was smiling.

After power outages CouchDB has a zero recovery time. If a hard disk had died or the filesystem got corrupted, CouchDB would just reacquire the lost data by synchronizing from a replica or replicating a backup.

What tools are you using in addition for your infrastructure? Have you discovered anything that pairs well with CouchDB?

  • couchimport
  • obviously.
  • jq.
  • curl.

What are your future plans with your project? Any cool plans or developments you want to promote?

Yes, I have found a neat trick to import an archive full of JSON files.

I also plan to add a section about client http keep-alives, to my document detailing my results for seeking better CouchDB performance on Linux systems. I’ve found out that using HTTP keep-alives to access CouchDB can drastically improve CouchDB’s performance, as it doesn’t need to build and destroy TCP connections between interactions with clients. For example, while using Node.js’ request or request-promise package we’d turn on "require ('http').globalAgent.keepAlive = true" and pass "forever: true" with each request.

 

Use cases are a great avenue for sharing useful technical information, let us know how you use CouchDB! Additionally, if there’s something you’d like to see covered on the CouchDB blog, we would love to accommodate. Email us!

For more about CouchDB visit couchdb.org or follow us on Twitter at @couchdb

CouchDB Developer Profile: Glynn Bird

The CouchDB community is made up of a unique network of individuals with different backgrounds and skill sets. Glynn Bird hails from Middlesbrough, UK and found his way to CouchDB via research and development for the steel industry, writing CRM systems, and eventually NoSQL. Now Glynn works for IBM Cloud Data Services as a Developer Advocate and Author. He recently shared his experience working with CouchDB with us.

Do you want to talk about your background, or how you got involved in CouchDB?

I started my career in the research and development arm of the steel industry, making sensors and control & instrumentation systems. I then moved into web development for a business directory service. During that time I was looking for a database that could store fairly complex JSON objects and ended up choosing CouchDB after evaluating it against other document stores. CouchDB was the only one that had an HTTP API, a free-text search feature, and the ability to scale up (in terms of data size and traffic) by adding more hardware to the cluster. I ended up choosing Cloudant as a hosted solution.

What areas of the project do you work on?

I’m not a CouchDB “core” developer – I don’t know a word of Erlang! I have worked on the Nano project which is the official Node.js client library for CouchDB. It started life as an open-source project written by Nuno Job who kindly donated it to the Apache Foundation.

Nano is a general purpose library but I’ve also written or worked on other libraries such as Silverlining for new users, nodejs-cloudant which is Nano plus some Cloudant-specific functions and cloudantlite for folks who want to learn the API.

I also enjoy building command-line utilities that interact with CouchDB via the API to provide backup, diff, csv import/export, design document migration, shell access and more.

What’s a recent development/event/aspect of the project that you’re excited about?

CouchDB’s replication engine is one of its major strengths and is going to see a significant iteration shortly. Combining a server-side CouchDB cluster with PouchDB running in a web browser means your web app can operate completely off the grid. This Offline First approach is central to the Progressive Web App movement which aims to allow web applications to compete with native, installable phone/tablet apps.

Building an app the replicates between a local and remote copy of the data often follows a “database per user” pattern. Check out the Hood.ie framework or the Envoy library which can help you get started.

What do you think are the top three benefits of using CouchDB as a database solution?

Schema flexibility  – if your data model is evolving, just modify the form of JSON you save.
Scale – just add servers! It’s not quite that simple, but it’s getting there.
High availability – CouchDB is lots of servers behind a load-balancer. When a node goes offline, there are others with the same data that continue to provide service.
Replication – oops that’s four!

What advice do you have for someone who just discovered CouchDB?

CouchDB doesn’t know about or ask for your database schema, but that doesn’t mean you shouldn’t think about your schema in advance. Consider the questions your application is going to ask of your data and how that can be achieved in a performant way using the querying, indexing and aggregation functions available.

Joan Touzet’s 2013 “Misconceptions about CouchDB” talk is essential viewing for a developer new to CouchDB, especially if they are coming from a relational background. Joan’s presentation pre-dates the “Mango” query language and the other “2.0+” features, but has otherwise aged well.

Other than that, ask the community who are very friendly and willing to answer your questions. Chat on Slack or IRC or raise question tagged couchdb on Stack Overflow.

 

For more about CouchDB visit couchdb.org or follow us on Twitter at @couchdb

Have a suggestion on what you’d like to hear about next on the CouchDB blog? Email us!