There’s days when you want to hug your tests—that’s especially true when your project takes longer to complete or is being refactored. Perhaps it’s that you stay able to move forward without fear. This blog post picks one particular topic in the test space: integration tests for clustered applications.
There’s particular challenges to them because of CAP , so this post will cover how to handle those challenges. My recent interest in this stems from adjusting the integration tests of CouchDB  to cover the new cluster feature that will be part of version 2.0 offering an API equal (or close) to what the non-clustered version 1.6 could provide. The post is certainly far away from featuring something outstanding (and it should be mentioned the tests existed before, written by some very bright people ). Nonetheless, I’ve learned some lessons along the way that maybe you’ll find useful as well.
While unit tests do a stellar job (and should be the first to begin with anyway), there are some aspects they won’t cover; hence the reason you need integration / acceptance tests that look at larger parts or the application as a whole and make sure all parts fit. These tests are far from easy, however, and they’re even harder to create and maintain than unit tests.
They’re hard to get stable and really meaningful in the sense that when they’re red, they’re red for a reason and point out an important spot to check. When they’re never red (despite your application being hopelessly broken), they’re meaningless. When they’re red without real cause (maybe re-running makes them green again), you can bet that no one looks at them after a week.
As CouchDB tries to maintain API compatibility (there is an existing test suite for version 1.6), the logical step for 2.0 is running the same suite against 2.0 (i.e., a cluster) to see what happens. Clustering means there are several nodes working together to serve requests with data being split into shards. For every answer, several nodes / shards have to be queried to produce a combined result.
Long story short—running the tests did not produce much of a result except loads of red—as clusters add some powerful (and quite nasty) specifics to the game.
A good way to explain these specifics is bespoke CAP theorem : basically, it assumes that an application cannot at the same time be
- C = consistent (i.e., whichever endpoint you ask, you get the latest and only-valid response)
- A = available (i.e., you can get an answer immediately, even with ongoing work on other endpoints)
- P = partition tolerant (i.e., there are several parts / endpoints operating in parallel)
Instead, there is always a tradeoff to be balanced out individually. CouchDB (like many others) decided to make a tradeoff at the C-part, specifically, there is no guarantee that all endpoints will (immediately) produce the same and only-true result (e.g., because the cluster has not fully synced just yet). Furthermore, combining the result has the potential for further surprises (because naturally-ordered results from several machines might be combined). Nothing here is specific to CouchDB, it’s a general tradeoff in the CAP space.
What this consistency issue now means is that many of the assertions which held true in the non-clustered version are no longer always valid. The most nasty thing here is: test might be green one run and red the next and green the next and so on. You could really say that the C stands for Crazyheisenbug  /  , i.e., absolute truth is replaced with probability.
Shame you can’t test probability. And you shouldn’t try either—rather, a way is to dig a little deeper into how the cluster works to come up with rewritten tests whose assertions will always be true and meaningful, i.e., running the test gives a valuable statement about whether or not the current codebase is sound (as opposed to
expect(true).toBeTruthy() 😉 ).
It’s up to you to judge whether that attempt succeeded, but here are the lessons learned:
Lesson learned #1: Think long and hard about ordering
Apart from naive balancing (like modulo-ing some ID to decide the one and only system to always handle some data without ever combining several systems’ results), the result in a cluster is computed dynamically from several shards that might reside in several processes or even several nodes.
Combining several (partial) lists coming from several shards into one list breaks some assumption: First, for time-based lists, there is no absolute guarantee of time; second, for sequence-based lists, there are several systems with their own sequence (CouchDB solved this by appending a hash to the numbered sequence); and lastly, for several identical keys there is not necessarily the expected natural order any more (because results are copied together from shards and the order can get scrambled).
When looking at the
_changes feed (a nice CouchDB feature for getting immediate feedback on changes in a database): say you insert four documents (1, 2, 3, 4) in order to the cluster and then query the
_changes will contain all four—but in ANY order (not necessarily 1, 2, 3, 4— it can also be 3, 2, 1, 4 or 4, 2, 1, 3 or even 4, 3, 2, 1, you simply don’t know).
Implication for re-writing: go away from
array.id==1 and so on to
array.length==4 && array.contains(1) and so on. Likewise, when having several identical keys, get away from accidentally testing some implicit order and checking for array lengths (and contains-checks, when the set is complete).
Lesson learned #2: Make quora work for you
When redundancy is involved, clustering solutions may return
"ok" before all copies are written. CouchDB does this as well.
When query-ing immediately after getting the
"ok" (e.g., for an assertion), you will run into an unexpected result. Much of this phenomenon is mitigated by a read-quorum, i.e., the system tries to read several sources. When both read and write quorums are a qualified majority, you’re normally safe.
Still, situations can occur when not all quora are adhered to (example, special queries that don’t care about quora). In that case, setting the write quorum equal to the number of nodes will help—as everything is written. It’s a little counter-intuitive (and a little against the idea of a cluster), but for testing it does help.
It’s probably worthwhile to remember that a lack of consistency is worse for testing than for real life. Imagine liking something on Facebook and then checking the activity log. Maybe (hypothetically) the like is not there right away, but a couple of seconds later it is—who cares? For tests, you can hardly wait for a minute after each action before asserting, so helping one’s luck a little bit seems quite legitimate.
Lesson learned #3: Cut down on the number of tests
Okay, that’s not a very cluster-specific one. But with getting a stable result becoming even harder in a cluster, it’s very valid.
It’s per se much better to have few good tests than many bad ones. Not only because of false alarms that render the whole suite useless, but also, at the end of the day, you’d want to have a reliable and meaningful statement on the application.
Integration tests will never cover all (and trying ends you up in a really nasty place), but there’s a few main paths (often those you try manually first) that tell a LOT about what is up (or down). So, you can comment out / ignore tests you don’t have time to fix right away (and that were not one of the few critical paths, at least the way you see things), and looking back, this seems like a good and valid idea.
Lesson learned #4: The API is not the same
It’s not just that clusters are different in terms of how they aggregate (and behave in general). It’s also the case that some corner cases just have not been cluster-ized (or in fact cannot be cluster-ized ever).
So the API will be different. Sometimes the test becomes obsolete, sometimes it’s important for meaningful tests to think up an alternative.
Lesson learned #5: Debug deeply
Some error messages are not what they seem to be. Like a
"invalid character in HTTP".
When re-checking the test’s call with
curl --raw, however, it turned out to be a grossly wrong answer with an HTTP 500 factored in after the initial header because of too little timeouts.
Long story short: not only will tests be more stable and meaningful, you’ll also discover important shortcomings along the way. So don’t settle until you really have a good answer. It’s tedious and you might not follow through every time, but it’s a helpful reminder that stability is vital and these root cause analyses pay off.
Sometimes, race conditions just mean that a tiny time later the response is correct and trying three times with brute force would do the trick. It shouldn’t.
Lesson learned #6: Start with one node and work up (to three)
One node still exclude lots of issues faced in clustering—so starting with one node feels like cheating. And it is.
But: things that don’t work on one node will never run on three.
They’re even harder to fix with three. After fixing one, there’s lots of work left to have it work on three, but at least you can work on a halfway stable basis.
Lesson learned #7: Continue the test flow
Maybe you’re lucky and your test is split up in tiny little routines, and the next one (successfully) runs even when its predecessor fails.
If you’re not that lucky (or the scenario requires steps to build upon one another) and you discover a bug in the application that prevents huge parts from the remaining test from running: put in
// TODO: after fix of Tick-123 and make sure the rest of the test runs.
You’ll certainly find many other issues along the way—and get the good feeling of knowing how far you actually are (and how far the software is, too).
Lesson learned #8: Start w/ new and empty DBs
It’s hard to undo all actions performed in a test DB—at least fully and reliably. It’s far better to always start of a new and empty one.
One thing to try is deleting the test DB immediately before executing a test re-creating it. It gets tricky though when cleaning up DB resources takes some time.
CouchDB is fortunate to be able to create new databases with a single
PUT so Jan Lehnardt (@janl on Twitter) came up with the idea of dicing the database names every time. That way, every test starts off a reliable and blank new DB with a unique name (and another source of Heisenbugs is gone).
As good citizens, tests still clean DBs after running. But a crashing tests or a forgotten delete doesn’t hurt.
Lesson learned #9: Use throttling (and use re-try as very last resort only)
Your laptop has computing power in abundance—the CI infrastructure might not (or at least not predictably). So an idea I found useful is throttling CPU (e.g., reduce # of cores—a VM helps here) and re-running the tests.
In fact, running them several times over. Lots of additional issues will pop up for sure. Some are bound to a not-yet-sufficiently-synced state.
Try to apply one of the points above first to remedy, but sometimes, there’s just no way but re-trying the call and assertion as the 2nd / 3rd attempt gave the system enough time to complete background work.
Over-using produces arbitrarily red tests, but not using at all provides them even more often. Still—when opting for three re-tries, the fourth might have done the trick. It’s no good and possibly will never be, but it’s a last resort when there’s no other way of reliably asserting.
Lesson learned #10: Always (always!) keep a backed-up VM
I should have known better and was sloppy. It’s SO hard to re-setup everything (all Erlang RPMs, etc.) after doing one stupid mistake. The second time round, you, too, will really make sure to have the VM disk file backed up and you can go back without the hassle.
Clusters are very different from single instances. And that cannot be hidden from anyone—clustering is not (fully) transparent. A better (and less painful) way is to understand a little bit about how they work and how you can work with them. Even with less deterministic outcomes, you can assert!
Sometimes the process is tedious and nerve-wracking, but eventually, you’ll come to a better, more meaningful, and ultimately more valuable test suite. None of the above is very elaborate or CouchDB-specific. If you like, take it as recipe for testing your own clustered applications. Hopefully you find it useful and feedback is appreciated if you do. [Contact info here]
About the author
-  extensive background see e.g. http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf
-  http://couchdb.apache.org
-  https://twitter.com/sebredhh/status/666717657278914562
-  https://en.wikipedia.org/wiki/Heisenbug (referrig to the race-condition / Multi-thread explanation)