Feature: Mango Query

This is the fourth in a series of blog posts introducing the Apache CouchDB 2.0 release. Read parts one, two, and three in the series. 

Two years ago, Cloudant developed a declarative style syntax for creating and querying Cloudant indexes. The idea was to attract users who were not familiar with Map-Reduce and Javascript but still wanted to experience the power of NOSQL databases. Moreover, the syntax was MongoDB-inspired, meaning that users already familiar with MongoDB’s find() operator could easily transition over to Cloudant’s new declarative API. Cloudant introduced this feature as Cloudant Query. Within a few months, Cloudant donated Cloudant Query to CouchDB. We decided to adopt the development codename for introduction to the CouchDB community. Lo and behold: Mango.

For a quick introduction on how to get started with creating and querying indexes using Mango, check out this informative post: Introducing Cloudant Query.

When Mango was first donated to CouchDB, the codebases were identical. However, the repositories diverged as Cloudant added a new text-search feature to Cloudant Query that leveraged Cloudant’s existing full-text-search API. The new text-search feature also made the existing query API more flexible and truly ad-hoc. At the time, Cloudant’s full-text-search was not open sourced, and thus CouchDB’s version could not reap the benefits.

In late July of 2015, Cloudant open sourced full-text-search. This allowed Cloudant Query and Mango Query to become synchronized.

Check out Enable Full Text Search in Apache CouchDB to start using text search with Mango Query.

To fully understand the differences between original Mango JSON indexes and text indexes checkout Mango JSON vs Text Indexes. One of the restrictions of Mango in the past two years was that users had to create an index first before running a query. This was a nuisance to developers who just wanted to execute a query against the database, especially when they encountered the infamous no_index_found error. We’re happy to announce that in CouchDB 2.0, this restriction has been lifted. Users can now execute queries without the need to create an index first.

 

Tony Sun is a software developer at IBM Cloudant focusing on indexing and core API functionality. He is also a CouchDB committer.


You can download the latest release candidate from http://couchdb.apache.org/release-candidate/2.0/. Files with -RC in their name a special release candidate tags, and the files with the git hash in their name are builds off of every commit to CouchDB master.

We are inviting the community to thoroughly test their applications with CouchDB 2.0 release candidates. See the testing and setup instructions for more details.

CouchDB 2.0 Architecture

This is the third in a series of blog posts introducing the Apache CouchDB 2.0 release. Read part one: The Road to CouchDB 2.0 and part two: Fauxton, the new CouchDB Dashboard.

CouchDB has always anticipated clustering as a core feature and, with 2.0, it has finally landed.

We’ve followed the Dynamo model made famous by Amazon where a database is divided into a number of equal, but separate, pieces, which we refer to as shards. Any given document belongs to one shard, and this is determined directly from its ID (and only its ID). This arrangement means that any node in the CouchDB cluster knows exactly where any document is hosted, allowing for scalable reading and writing. In addition, CouchDB 2.0 keeps multiple copies of each shard, so that the loss of any individual node is not fatal.

When creating a database, you can specify the number of shards (with ?q=) and the number of copies of those shards (with ?n=) or use the defaults. The default N is 3 which is almost always the right value, fewer is too risky, greater is too expensive. The default Q is 8 and this is suitable for most uses. You are well advised to raise this number if your database will be large, or if you plan to increase the size of your cluster significantly over time. As a rule of thumb, aim for no more than 10 million documents per shard.

When a document is created, updated, deleted, or read, the node that processes the HTTP request (which can be any node in the cluster) spawns N processes that run in parallel, to attempt the desired operation at every copy of the document. The coordinating node will wait for N/2+1 responses before merging those responses as the HTTP response. This overlap helps to present a consistent view of the database, though that consistency is not guaranteed (CouchDB 2.0 is an Available/Partition-Tolerant system by design, we sacrifice Consistency for Availability).

All the usual CouchDB features work as normal with only minor changes
in some cases. The most noteworthy is the changes feed. The update
sequences that CouchDB 2.0 returns are now strings, not numbers, as it
encodes the numeric sequences of each shard of the database. You treat
them as you should always have treated the numeric sequence;
opaquely. Pass the update sequence value back to the since= parameter of _changes and all is well.

A further note, though: It is possible for CouchDB 2.0’s changes feed to return updates from before your since parameter so it is important to be idempotent when you process the rows you receive (the replication protocol is a good example of this). The reason for these so-called ‘rewinds’ is the case when CouchDB cannot contact the specific copy of a shard included in the update sequence and must replace it with another copy. Since the updates to the multiple copies of a shard are not coordinated, and hence are not in the same order, CouchDB finds the most recent checkpoint between the two copies and rewinds you there. This is typically a small amount but it depends strongly on update rate.

This has been a necessarily brief overview of the 2.0 architecture but we hope it covers enough ground to pique your interest. Oh, and relax.

 

Robert Newson is an engineer at Cloudant and an Apache CouchDB PMC member.


You can download the latest release candidate from http://couchdb.apache.org/release-candidate/2.0/. Files with -RC in their name a special release candidate tags, and the files with the git hash in their name are builds off of every commit to CouchDB master.

We are inviting the community to thoroughly test their applications with CouchDB 2.0 release candidates. See the testing and setup instructions for more details.