The Road to CouchDB 3.0: Automatic View Index Warming

This is a post in a series about the Apache CouchDB 3.0 release. Check out the other posts in this series.

Querying in CouchDB has always been a little different than in other databases. One such aspect is index creation and updating. In most other databases, an index is usually created upon definition, and updated when new data arrives. In CouchDB, when making a query against an index, that index is created and updated on demand at query time.

The underlying reason is a performance trade-off: in other databases, you are encouraged to have as many indexes as you need, but no more, because each additional index makes inserting and updating any data more resource intensive. In CouchDB on the other hand, you can have as many indexes as you like, only the ones that are actually used are built at query time.

The trade-off in particular is the following: updating many database updates at once is a lot more efficient than doing it one-by-one. However, if an index has not been used in a while, it can take quite some time to process all updates that have happened in the meantime. So the trade-off is data insertion speed for query latency.

Over the lifetime of CouchDB 1.x and 2.x users have built their own little cron jobs that periodically query all indexes to make sure each real query has at most a little database-update gap to bridge, making query responses more predictable.

CouchDB 3.0 introduces Ken, an automatic background indexing service that does this for you. No need to keep maintaining those view update cron jobs. Ken has been on duty at Cloudant for a long time, and is finally available in Open Source CouchDB as of 3.0
See the documentation for more details.

The Road to CouchDB 3.0: Shard Splitting

This is a post in a series about the Apache CouchDB 3.0 release. Check out the other posts in this series.

One of the features introduced in CouchDB 2.x was sharding: sharding splits up a single logical database into multiple physical files. In CouchDB 1.x there was a strict 1:1 ratio from databases to database files.

Why sharding then? For large databases, storing all data into a single file makes that file unwieldy. Just thinking about backing up a single multi-gigabyte, or even multi-terabyte file alone should make you uncomfortable.

Sharding also allows for placing each shard of a database on a separate server. This allows you to store more data in a single database than there is capacity on any one node.

There is another reason however: CouchDB’s file format maintains a single serial queue for write requests to that file, mainly for data consistency reasons. With multiple files, write-heavy databases can now write truly in parallel, greatly improving write throughput on sharded databases.

And what’s good for the writer is good for the reader: with sharding, more concurrent read requests can be handled with ease.

The downside for sharding is that requests that need to gather data from each shard must do more internal cluster work to collect all that data. So, it is advisable to use a sharding level that you need for the above reason, but no more. See also the post on partitioned queries.

Sharding in 2.x has one further drawback: you must set the sharding level (q) at database creation time. That is: you must anticipate how big a database will get eventually, and what shard level makes sense for you at that future time.

The shard level could not be changed after the database was created. Even though you could change the default q level for new database creations, any existing databases used the q level they were created with.

Some external tools exist to help with resharding databases, but they require at least a little bit of downtime while the newly-sharded database is built up again, or require a perfectly timed, or cleverly-scripted, load balancer switchover.

CouchDB 3.0 introduces live shard splitting. You can, while the database server is running, and while the database is fully available to your application, split its shards. This allows you to choose a low q level when creating the database, and increase it as your application grows, allowing you to make the best use of your computing resources at any one time.

Note that shard-merging is not supported at this point, and you still have to rely on those external tools if you need this functionality.
Check out the documentation for more details.