Testing clustered applications – like CouchDB

There’s days when you want to hug your tests—that’s especially true when your project takes longer to complete or is being refactored. Perhaps it’s that you stay able to move forward without fear. This blog post picks one particular topic in the test space: integration tests for clustered applications.

There’s particular challenges to them because of CAP [1], so this post will cover how to handle those challenges. My recent interest in this stems from adjusting the integration tests of CouchDB [2] to cover the new cluster feature that will be part of version 2.0 offering an API equal (or close) to what the non-clustered version 1.6 could provide. The post is certainly far away from featuring something outstanding (and it should be mentioned the tests existed before, written by some very bright people [5]). Nonetheless, I’ve learned some lessons along the way that maybe you’ll find useful as well.

While unit tests do a stellar job (and should be the first to begin with anyway), there are some aspects they won’t cover; hence the reason you need integration / acceptance tests that look at larger parts or the application as a whole and make sure all parts fit. These tests are far from easy, however, and they’re even harder to create and maintain than unit tests.

They’re hard to get stable and really meaningful in the sense that when they’re red, they’re red for a reason and point out an important spot to check. When they’re never red (despite your application being hopelessly broken), they’re meaningless. When they’re red without real cause (maybe re-running makes them green again), you can bet that no one looks at them after a week.

As CouchDB tries to maintain API compatibility (there is an existing test suite for version 1.6), the logical step for 2.0 is running the same suite against 2.0 (i.e., a cluster) to see what happens. Clustering means there are several nodes working together to serve requests with data being split into shards. For every answer, several nodes / shards have to be queried to produce a combined result.

Long story short—running the tests did not produce much of a result except loads of red—as clusters add some powerful (and quite nasty) specifics to the game.

A good way to explain these specifics is bespoke CAP theorem [1]: basically, it assumes that an application cannot at the same time be

  • Cconsistent (i.e., whichever endpoint you ask, you get the latest and only-valid response)
  • Aavailable (i.e., you can get an answer immediately, even with ongoing work on other endpoints)
  • Ppartition tolerant (i.e., there are several parts / endpoints operating in parallel)

Instead, there is always a tradeoff to be balanced out individually. CouchDB (like many others) decided to make a tradeoff at the C-part, specifically, there is no guarantee that all endpoints will (immediately) produce the same and only-true result (e.g., because the cluster has not fully synced just yet). Furthermore, combining the result has the potential for further surprises (because naturally-ordered results from several machines might be combined). Nothing here is specific to CouchDB, it’s a general tradeoff in the CAP space.

What this consistency issue now means is that many of the assertions which held true in the non-clustered version are no longer always valid. The most nasty thing here is: test might be green one run and red the next and green the next and so on. You could really say that the C stands for Crazyheisenbug [3] / [4] , i.e., absolute truth is replaced with probability.

Shame you can’t test probability. And you shouldn’t try either—rather, a way is to dig a little deeper into how the cluster works to come up with rewritten tests whose assertions will always be true and meaningful, i.e., running the test gives a valuable statement about whether or not the current codebase is sound (as opposed to expect(true).toBeTruthy() 😉 ).

It’s up to you to judge whether that attempt succeeded, but here are the lessons learned:

Lesson learned #1: Think long and hard about ordering

Apart from naive balancing (like modulo-ing some ID to decide the one and only system to always handle some data without ever combining several systems’ results), the result in a cluster is computed dynamically from several shards that might reside in several processes or even several nodes.

Combining several (partial) lists coming from several shards into one list breaks some assumption: First, for time-based lists, there is no absolute guarantee of time; second, for sequence-based lists, there are several systems with their own sequence (CouchDB solved this by appending a hash to the numbered sequence); and lastly, for several identical keys there is not necessarily the expected natural order any more (because results are copied together from shards and the order can get scrambled).

When looking at the _changes feed (a nice CouchDB feature for getting immediate feedback on changes in a database): say you insert four documents (1, 2, 3, 4) in order to the cluster and then query the _changes feed. _changes will contain all four—but in ANY order (not necessarily 1, 2, 3, 4— it can also be 3, 2, 1, 4 or 4, 2, 1, 3 or even 4, 3, 2, 1, you simply don’t know).

Implication for re-writing: go away from array[0].id==1 and so on to array.length==4 && array.contains(1) and so on. Likewise, when having several identical keys, get away from accidentally testing some implicit order and checking for array lengths (and contains-checks, when the set is complete).

Lesson learned #2: Make quora work for you

When redundancy is involved, clustering solutions may return "ok" before all copies are written. CouchDB does this as well.

When query-ing immediately after getting the "ok" (e.g., for an assertion), you will run into an unexpected result. Much of this phenomenon is mitigated by a read-quorum, i.e., the system tries to read several sources. When both read and write quorums are a qualified majority, you’re normally safe.

Still, situations can occur when not all quora are adhered to (example, special queries that don’t care about quora). In that case, setting the write quorum equal to the number of nodes will help—as everything is written. It’s a little counter-intuitive (and a little against the idea of a cluster), but for testing it does help.

It’s probably worthwhile to remember that a lack of consistency is worse for testing than for real life. Imagine liking something on Facebook and then checking the activity log. Maybe (hypothetically) the like is not there right away, but a couple of seconds later it is—who cares? For tests, you can hardly wait for a minute after each action before asserting, so helping one’s luck a little bit seems quite legitimate.

Lesson learned #3: Cut down on the number of tests

Okay, that’s not a very cluster-specific one. But with getting a stable result becoming even harder in a cluster, it’s very valid.

It’s per se much better to have few good tests than many bad ones. Not only because of false alarms that render the whole suite useless, but also, at the end of the day, you’d want to have a reliable and meaningful statement on the application.

Integration tests will never cover all (and trying ends you up in a really nasty place), but there’s a few main paths (often those you try manually first) that tell a LOT about what is up (or down). So, you can comment out / ignore tests you don’t have time to fix right away (and that were not one of the few critical paths, at least the way you see things), and looking back, this seems like a good and valid idea.

Lesson learned #4: The API is not the same

It’s not just that clusters are different in terms of how they aggregate (and behave in general). It’s also the case that some corner cases just have not been cluster-ized (or in fact cannot be cluster-ized ever).

So the API will be different. Sometimes the test becomes obsolete, sometimes it’s important for meaningful tests to think up an alternative.

Lesson learned #5: Debug deeply

Some error messages are not what they seem to be. Like a "invalid character in HTTP".

When re-checking the test’s call with curl --raw, however, it turned out to be a grossly wrong answer with an HTTP 500 factored in after the initial header because of too little timeouts.

Long story short: not only will tests be more stable and meaningful, you’ll also discover important shortcomings along the way. So don’t settle until you really have a good answer. It’s tedious and you might not follow through every time, but it’s a helpful reminder that stability is vital and these root cause analyses pay off.

Sometimes, race conditions just mean that a tiny time later the response is correct and trying three times with brute force would do the trick. It shouldn’t.

Lesson learned #6: Start with one node and work up (to three)

One node still exclude lots of issues faced in clustering—so starting with one node feels like cheating. And it is.

But: things that don’t work on one node will never run on three.

They’re even harder to fix with three. After fixing one, there’s lots of work left to have it work on three, but at least you can work on a halfway stable basis.

Lesson learned #7: Continue the test flow

Maybe you’re lucky and your test is split up in tiny little routines, and the next one (successfully) runs even when its predecessor fails.

If you’re not that lucky (or the scenario requires steps to build upon one another) and you discover a bug in the application that prevents huge parts from the remaining test from running: put in // TODO: after fix of Tick-123 and make sure the rest of the test runs.

You’ll certainly find many other issues along the way—and get the good feeling of knowing how far you actually are (and how far the software is, too).

Lesson learned #8: Start w/ new and empty DBs

It’s hard to undo all actions performed in a test DB—at least fully and reliably. It’s far better to always start of a new and empty one.

One thing to try is deleting the test DB immediately before executing a test re-creating it. It gets tricky though when cleaning up DB resources takes some time.

CouchDB is fortunate to be able to create new databases with a single PUT so Jan Lehnardt (@janl on Twitter) came up with the idea of dicing the database names every time. That way, every test starts off a reliable and blank new DB with a unique name (and another source of Heisenbugs is gone).

As good citizens, tests still clean DBs after running. But a crashing tests or a forgotten delete doesn’t hurt.

Lesson learned #9: Use throttling (and use re-try as very last resort only)

Your laptop has computing power in abundance—the CI infrastructure might not (or at least not predictably). So an idea I found useful is throttling CPU (e.g., reduce # of cores—a VM helps here) and re-running the tests.

In fact, running them several times over. Lots of additional issues will pop up for sure. Some are bound to a not-yet-sufficiently-synced state.

Try to apply one of the points above first to remedy, but sometimes, there’s just no way but re-trying the call and assertion as the 2nd / 3rd attempt gave the system enough time to complete background work.

Over-using produces arbitrarily red tests, but not using at all provides them even more often. Still—when opting for three re-tries, the fourth might have done the trick. It’s no good and possibly will never be, but it’s a last resort when there’s no other way of reliably asserting.

Lesson learned #10: Always (always!) keep a backed-up VM

I should have known better and was sloppy. It’s SO hard to re-setup everything (all Erlang RPMs, etc.) after doing one stupid mistake. The second time round, you, too, will really make sure to have the VM disk file backed up and you can go back without the hassle.

Summary

Clusters are very different from single instances. And that cannot be hidden from anyone—clustering is not (fully) transparent. A better (and less painful) way is to understand a little bit about how they work and how you can work with them. Even with less deterministic outcomes, you can assert!

Sometimes the process is tedious and nerve-wracking, but eventually, you’ll come to a better, more meaningful, and ultimately more valuable test suite. None of the above is very elaborate or CouchDB-specific. If you like, take it as recipe for testing your own clustered applications. Hopefully you find it useful and feedback is appreciated if you do. [Contact info here]

About the author

Sebastian Rothbucher (@sebredhh on Twitter) is a Web developer and CouchDB contributor. Coming from VB/Java, he now enjoys the JavaScript side of life (and spends some rainy Hamburg Saturdays hacking productivity tools and Web apps using CouchDB).

References

CouchDB Weekly News, May 26, 2016

Major Discussions

The state of filtered replication (see thread)

Stefan du Fresne is looking for a way to increase replication performance, or a way to safeguard against many replicators unacceptably degrading couchdb performance.

Post on Trigger (see thread)

This email discussion comes from new contributor Reddy, proposing the use of triggers to post arbitrary JSON to arbitrary endpoints.

Releases in the CouchDB Universe

  • changemachine 1.0.0 (Node) – Handle CouchDB change notifications using Neuron
  • changemate 1.0.0 (Node) – Change Notification for CouchDB
  • couchdb-wedge 0.0.1 (Node) – CLI and node modules for doing bulk operations like deleting all databases on a server or replicating all databases from one server to another.
  • iobroker.sql 1.0.0 (Node) – Log state sql in a two-stages process (first to Redis, then to CouchDB)
  • nano-records 1.1.0 (Node) – Utility returns objects for interacting with CouchDB.
  • poms 2.0.0 (Ruby Gem) – Interface to POMS CouchDB API
  • replimate 1.0.0 (Node) – CouchDB (using _replicator db) replication helpers
  • simple-couchdb-view-processor 1.0.1 (Node) – Process documents in a CouchDB view with the goal to remove them from the view

PouchDB

Opinions and other News in the CouchDB Universe

CouchDB Use Cases, Questions and Answers

Use Case:

Stack Overflow:

no public answer yet:

PouchDB Use Cases, Questions and Answers

Use Case:

no public answer on Stack Overflow yet:

For more new questions and answers about CouchDB, see these search results and about PouchDB, see these.

Get involved!

If you want to get into working on CouchDB:

  • We have an infinite number of open contributor positions on CouchDB. Submit a pull request and join the project!
  • Do you want to help us with the work on the new CouchDB website? Get in touch on our new website mailing list and join the website team! – www@couchdb.apache.org
  • The CouchDB advocate marketing programme is just getting started. Join us in CouchDB’s Advocate Hub!
  • CouchDB has a new wiki. Help us move content from the old to the new one!
  • Can you help with Web Design, Development or UX for our Admin Console? No Erlang skills required! – Get in touch with us.
  • Do you want to help moving the CouchDB docs translation forward? We’d love to have you in our L10n team! See our current status and languages we’d like to provide CouchDB docs in on this page. If you’d like to help, don’t hesitate to contact the L10n mailing list on l10n@couchdb.apache.org or ping Andy Wenk (awenkhh on IRC).

We’d be happy to welcome you on board!

Events

Job opportunities for people with CouchDB skills

Time to relax!

  • “While a competition may seem like a strange way to de-stress, participants seemed enthused. ‘This event is highly recommended for those who have migraines or complicated thoughts,’ said the winner, Shin Hyo-Seob, aka Crush, a popular rapper who entered the contest to unwind after recording an album.” – This Relaxation Competition Finds Out Who Can Chill the Best
  • “The visible impacts of depression and stress that can be seen in a person’s face — and contribute to shorter lives — can also be found in alterations in genetic activity, according to newly published research.” –Genes linked to effects of mood and stress on longevity identified
  • “In an American accent, she describes her routine, switching out a few details from the original. Instead of an ice pack over her eyes, she uses two cold spoons, a big jar of Vegemite is featured in the fridge (seemingly a shout-out to all Australians), and the Les Misérables poster is replaced, naturally, by Hamilton. ‘There is an idea of Margot Robbie, some kind of abstraction. But there is no real me. Only an entity, something illusory,’ she says as she peels the iconic clear, rubbery face mask off. ‘Maybe you can even sense our lifestyles are probably comparable. I simply am not there.’” – Watch Margot Robbie in a Strange, Oddly Soothing American Psycho Spoof
  • “‘Books are places we can lose ourselves and at the same time find the answers to whatever we might be searching for, even if we don’t quite know what that is,’ said the festival’s artistic director Jemma Birrell.” – Ask a bibliotherapist: how books can help soothe troubled minds
  • “Whether you enjoy making your own miniature fake foods or not, we think it’s safe to say watching someone craft them is spellbinding. The tiny details, the silent patience, and the absurd proportions all combine to create a relaxing stew of motion — almost like meditation but with more craft resin!” – This video of someone making ramen miniatures is the most relaxing thing on YouTube
  • “The researchers showed that programmers’ stress levels were a better predictor of the quality (and thus the maintainability) of their code than the more costly process of code-review, where another programmer checks a colleague’s work before it is put into production, in order to ensure that it will be clear to future maintainers of the system.” – Programmers’ stress levels can accurately predict the quality of their code

… and also in the news