The paper describes some of the complexity of this task:
Complex dependencies among system components can cause failures to propagate to other components, triggering multiple alarms and complicating root-cause determination.This reminds me of a post by Werner Vogels a few months ago. Talking about Amazon's rearchitecture efforts, Werner had said, "A service-oriented architecture would give us the level of isolation that would allow us to build many software components rapidly and independently."
Maya is a new visualization tool ... [that] displays the dependencies among components as a directed graph. Each component -- web page, service, or database -- is represented as a black (healthy) or red (unhealthy) dot.
The operators thus clearly see which components that are reporting alarms may simply be suffering from a cascaded failure.
In the comments, I asked, "Why do web services necessarily eliminate dependencies? Can you not have the same complicated nest of dependencies in a cloud of web services, all of them talking to each other in an incomprehensible chatter?"
I do worry that Amazon may have taken a complex web of dependencies in libraries and moved that to a complex web of dependencies in web services. Since web service calls look like heavyweight library calls to developers, it is not clear to me that web services necessarily reduce dependencies, not without a special effort to make sure they do.
Moreover, as this paper describes, there is additional complexity in a distributed system of web services, in particular the difficulty of debugging and profiling across remote calls.
I noticed the authors of this paper include the widely respected Professor Michael Jordan at UC Berkeley in addition to several people at Amazon.com. Very interesting.
It is unusual to see Amazon publish academic papers on their work. Historically, they have not been as open as Google or Microsoft Research. I am excited to see more papers coming out of Amazon describing the challenging problems they are solving internally.
[Robin Harris post found via Dan Creswell]