Testing clustered applications – like CouchDB

There’s days when you want to hug your tests—that’s especially true when your project takes longer to complete or is being refactored. Perhaps it’s that you stay able to move forward without fear. This blog post picks one particular topic in the test space: integration tests for clustered applications.

There’s particular challenges to them because of CAP [1], so this post will cover how to handle those challenges. My recent interest in this stems from adjusting the integration tests of CouchDB [2] to cover the new cluster feature that will be part of version 2.0 offering an API equal (or close) to what the non-clustered version 1.6 could provide. The post is certainly far away from featuring something outstanding (and it should be mentioned the tests existed before, written by some very bright people [5]). Nonetheless, I’ve learned some lessons along the way that maybe you’ll find useful as well.

While unit tests do a stellar job (and should be the first to begin with anyway), there are some aspects they won’t cover; hence the reason you need integration / acceptance tests that look at larger parts or the application as a whole and make sure all parts fit. These tests are far from easy, however, and they’re even harder to create and maintain than unit tests.

They’re hard to get stable and really meaningful in the sense that when they’re red, they’re red for a reason and point out an important spot to check. When they’re never red (despite your application being hopelessly broken), they’re meaningless. When they’re red without real cause (maybe re-running makes them green again), you can bet that no one looks at them after a week.

As CouchDB tries to maintain API compatibility (there is an existing test suite for version 1.6), the logical step for 2.0 is running the same suite against 2.0 (i.e., a cluster) to see what happens. Clustering means there are several nodes working together to serve requests with data being split into shards. For every answer, several nodes / shards have to be queried to produce a combined result.

Long story short—running the tests did not produce much of a result except loads of red—as clusters add some powerful (and quite nasty) specifics to the game.

A good way to explain these specifics is bespoke CAP theorem [1]: basically, it assumes that an application cannot at the same time be

  • Cconsistent (i.e., whichever endpoint you ask, you get the latest and only-valid response)
  • Aavailable (i.e., you can get an answer immediately, even with ongoing work on other endpoints)
  • Ppartition tolerant (i.e., there are several parts / endpoints operating in parallel)

Instead, there is always a tradeoff to be balanced out individually. CouchDB (like many others) decided to make a tradeoff at the C-part, specifically, there is no guarantee that all endpoints will (immediately) produce the same and only-true result (e.g., because the cluster has not fully synced just yet). Furthermore, combining the result has the potential for further surprises (because naturally-ordered results from several machines might be combined). Nothing here is specific to CouchDB, it’s a general tradeoff in the CAP space.

What this consistency issue now means is that many of the assertions which held true in the non-clustered version are no longer always valid. The most nasty thing here is: test might be green one run and red the next and green the next and so on. You could really say that the C stands for Crazyheisenbug [3] / [4] , i.e., absolute truth is replaced with probability.

Shame you can’t test probability. And you shouldn’t try either—rather, a way is to dig a little deeper into how the cluster works to come up with rewritten tests whose assertions will always be true and meaningful, i.e., running the test gives a valuable statement about whether or not the current codebase is sound (as opposed to expect(true).toBeTruthy() 😉 ).

It’s up to you to judge whether that attempt succeeded, but here are the lessons learned:

Lesson learned #1: Think long and hard about ordering

Apart from naive balancing (like modulo-ing some ID to decide the one and only system to always handle some data without ever combining several systems’ results), the result in a cluster is computed dynamically from several shards that might reside in several processes or even several nodes.

Combining several (partial) lists coming from several shards into one list breaks some assumption: First, for time-based lists, there is no absolute guarantee of time; second, for sequence-based lists, there are several systems with their own sequence (CouchDB solved this by appending a hash to the numbered sequence); and lastly, for several identical keys there is not necessarily the expected natural order any more (because results are copied together from shards and the order can get scrambled).

When looking at the _changes feed (a nice CouchDB feature for getting immediate feedback on changes in a database): say you insert four documents (1, 2, 3, 4) in order to the cluster and then query the _changes feed. _changes will contain all four—but in ANY order (not necessarily 1, 2, 3, 4— it can also be 3, 2, 1, 4 or 4, 2, 1, 3 or even 4, 3, 2, 1, you simply don’t know).

Implication for re-writing: go away from array[0].id==1 and so on to array.length==4 && array.contains(1) and so on. Likewise, when having several identical keys, get away from accidentally testing some implicit order and checking for array lengths (and contains-checks, when the set is complete).

Lesson learned #2: Make quora work for you

When redundancy is involved, clustering solutions may return "ok" before all copies are written. CouchDB does this as well.

When query-ing immediately after getting the "ok" (e.g., for an assertion), you will run into an unexpected result. Much of this phenomenon is mitigated by a read-quorum, i.e., the system tries to read several sources. When both read and write quorums are a qualified majority, you’re normally safe.

Still, situations can occur when not all quora are adhered to (example, special queries that don’t care about quora). In that case, setting the write quorum equal to the number of nodes will help—as everything is written. It’s a little counter-intuitive (and a little against the idea of a cluster), but for testing it does help.

It’s probably worthwhile to remember that a lack of consistency is worse for testing than for real life. Imagine liking something on Facebook and then checking the activity log. Maybe (hypothetically) the like is not there right away, but a couple of seconds later it is—who cares? For tests, you can hardly wait for a minute after each action before asserting, so helping one’s luck a little bit seems quite legitimate.

Lesson learned #3: Cut down on the number of tests

Okay, that’s not a very cluster-specific one. But with getting a stable result becoming even harder in a cluster, it’s very valid.

It’s per se much better to have few good tests than many bad ones. Not only because of false alarms that render the whole suite useless, but also, at the end of the day, you’d want to have a reliable and meaningful statement on the application.

Integration tests will never cover all (and trying ends you up in a really nasty place), but there’s a few main paths (often those you try manually first) that tell a LOT about what is up (or down). So, you can comment out / ignore tests you don’t have time to fix right away (and that were not one of the few critical paths, at least the way you see things), and looking back, this seems like a good and valid idea.

Lesson learned #4: The API is not the same

It’s not just that clusters are different in terms of how they aggregate (and behave in general). It’s also the case that some corner cases just have not been cluster-ized (or in fact cannot be cluster-ized ever).

So the API will be different. Sometimes the test becomes obsolete, sometimes it’s important for meaningful tests to think up an alternative.

Lesson learned #5: Debug deeply

Some error messages are not what they seem to be. Like a "invalid character in HTTP".

When re-checking the test’s call with curl --raw, however, it turned out to be a grossly wrong answer with an HTTP 500 factored in after the initial header because of too little timeouts.

Long story short: not only will tests be more stable and meaningful, you’ll also discover important shortcomings along the way. So don’t settle until you really have a good answer. It’s tedious and you might not follow through every time, but it’s a helpful reminder that stability is vital and these root cause analyses pay off.

Sometimes, race conditions just mean that a tiny time later the response is correct and trying three times with brute force would do the trick. It shouldn’t.

Lesson learned #6: Start with one node and work up (to three)

One node still exclude lots of issues faced in clustering—so starting with one node feels like cheating. And it is.

But: things that don’t work on one node will never run on three.

They’re even harder to fix with three. After fixing one, there’s lots of work left to have it work on three, but at least you can work on a halfway stable basis.

Lesson learned #7: Continue the test flow

Maybe you’re lucky and your test is split up in tiny little routines, and the next one (successfully) runs even when its predecessor fails.

If you’re not that lucky (or the scenario requires steps to build upon one another) and you discover a bug in the application that prevents huge parts from the remaining test from running: put in // TODO: after fix of Tick-123 and make sure the rest of the test runs.

You’ll certainly find many other issues along the way—and get the good feeling of knowing how far you actually are (and how far the software is, too).

Lesson learned #8: Start w/ new and empty DBs

It’s hard to undo all actions performed in a test DB—at least fully and reliably. It’s far better to always start of a new and empty one.

One thing to try is deleting the test DB immediately before executing a test re-creating it. It gets tricky though when cleaning up DB resources takes some time.

CouchDB is fortunate to be able to create new databases with a single PUT so Jan Lehnardt (@janl on Twitter) came up with the idea of dicing the database names every time. That way, every test starts off a reliable and blank new DB with a unique name (and another source of Heisenbugs is gone).

As good citizens, tests still clean DBs after running. But a crashing tests or a forgotten delete doesn’t hurt.

Lesson learned #9: Use throttling (and use re-try as very last resort only)

Your laptop has computing power in abundance—the CI infrastructure might not (or at least not predictably). So an idea I found useful is throttling CPU (e.g., reduce # of cores—a VM helps here) and re-running the tests.

In fact, running them several times over. Lots of additional issues will pop up for sure. Some are bound to a not-yet-sufficiently-synced state.

Try to apply one of the points above first to remedy, but sometimes, there’s just no way but re-trying the call and assertion as the 2nd / 3rd attempt gave the system enough time to complete background work.

Over-using produces arbitrarily red tests, but not using at all provides them even more often. Still—when opting for three re-tries, the fourth might have done the trick. It’s no good and possibly will never be, but it’s a last resort when there’s no other way of reliably asserting.

Lesson learned #10: Always (always!) keep a backed-up VM

I should have known better and was sloppy. It’s SO hard to re-setup everything (all Erlang RPMs, etc.) after doing one stupid mistake. The second time round, you, too, will really make sure to have the VM disk file backed up and you can go back without the hassle.

Summary

Clusters are very different from single instances. And that cannot be hidden from anyone—clustering is not (fully) transparent. A better (and less painful) way is to understand a little bit about how they work and how you can work with them. Even with less deterministic outcomes, you can assert!

Sometimes the process is tedious and nerve-wracking, but eventually, you’ll come to a better, more meaningful, and ultimately more valuable test suite. None of the above is very elaborate or CouchDB-specific. If you like, take it as recipe for testing your own clustered applications. Hopefully you find it useful and feedback is appreciated if you do. [Contact info here]

About the author

Sebastian Rothbucher (@sebredhh on Twitter) is a Web developer and CouchDB contributor. Coming from VB/Java, he now enjoys the JavaScript side of life (and spends some rainy Hamburg Saturdays hacking productivity tools and Web apps using CouchDB).

References

Databases aren’t boring

It sounds super boring, sometimes scary. Let’s talk about database development, data management, and database administration. Sounds boring, right? I promise: it isn’t. I mainly develop for Fauxton, the UI for CouchDB we’ll release with 2.0. We are developing, designing, and concepting a UI for administration and data management.

When I tell fellow developers and designers that I develop a database, many of them look scared. Sometimes they also look bored, because people rarely imagine data management is exciting. Some of them might think of boring database courses in college. In this article we’ll discover what makes database development so interesting and exciting.

One of our main objectives is to make data management as frictionless as possible for the user. How can we lower the entry barrier for new users, but still support our power users? How can we display data in an accurate, detailed way, but still have a high density of information? How do we measure our success without traditional systems to measure engagement, like tracking? It is important to remember–we can be successful only if our users are successful.

A recent example where we want to make our users more successful is eventual consistency and MVCC. Large numbers of conflicts can be problematic; they will slow down the database and take up a lot of space. Some of the Fauxton developers recently had a hackweek. As part of one project we asked ourselves: “How can we make conflict detection and resolution a first class citizen in CouchDB and make it as frictionless as possible?”

Our goals:

  • Conflict detection should be as easy as possible
  • Make conflict resolution as easy as possible and provide necessary tooling
  • Help to avoid situations where a large number of conflicts become problematic
  • Provide better education and tutorials for conflict resolution, directly in the dashboard

We focused mainly on conflict resolution as our time was limited to one week. A document with conflicts has different revisions, and Couch elects one as the “winning revision.” How to choose the right revision and get rid of the other ones? Our project, codenamed “The Revision Browser,” was born. We wanted to provide a way to easily diff revisions, and inspect the revision tree. We also wanted an easy way to delete conflicting revisions and select other revisions as a winner. The first, ugly prototype had two dropdowns:

first-prototype

The first prototype

We are a distributed team, so use video calls for evaluating the iterations. We demo the current, incomplete work. Whenever possible, we test ad-hoc changes directly in the browser during the session. One addition that came up during our demos was to provide another view mode next to the “diff mode”. It shows both conflicting documents next to each other:

Both conflicting documents next to each other

Both conflicting documents next to each other

After the hackweek we had some work left to bring the project over the finish line. I am happy to announce that we have a minimum viable product now:

The diffing for both conflicting documents

The diffing for both conflicting documents

The feature was created in close collaboration with other developers and UX researchers. Here is a video showing the new features in action:

Data management is also interesting from the technical point of view. How can we display a lot of documents, but keep the application snappy? The revision browser is written in React. The code itself is pretty concise as we recently added ES 2015 / ES6 support to Fauxton. Thanks to our test coverage, we can refactor large parts of Fauxton. Recently, we changed the whole infrastructure underneath without much trouble. Interested in the code? It is available at: https://github.com/apache/couchdb-fauxton/pull/670

Conclusion

Despite its image, data management and database administration IS interesting. We face hard problems from a product point of view. They are challenging and it is fun to solve them in a team including developers, UX researchers, and designers. We also face interesting technical problems and solve them with the best technology available.

About the author

Robert Kowalski is a passionate software engineer and CouchDB contributor. He enjoys traveling and recently released a book about command line tools in Node.js, The CLI Book.