James Hamilton wrote a LISA 2007 paper, "On Designing and Deploying Internet-Scale Services" (PDF) with a remarkable brain dump of good advice on building large scale services.
The paper does read like a laundry list, so let me point at what I consider the most important recommendations.
Expect failures and test for them, to the point of never shutting down the system normally -- "crash only software" -- and designing and testing your system so that the "operations team [is] willing and able to bring down any server in the service at any time without draining the workload first."
Automate everything and keep it simple. Failover should be automatic, nothing should require human intervention, and this should be constantly tested. Complexity and dependencies are your enemy when trying to manage all possible failure modes automatically, and James rightly insists on keeping it simple, to the point that he argues that anything that would add complexity without producing order of magnitude improvements should not be implemented.
Test live with rollback, versioning, and incremental rollouts. As long as problems can be quickly reverted, rolling out code live early and often is not expensive or dangerous. More dangerous is "big-bang changes" where many things change at once, making it hard to determine where problems lie and hard to revert to the older version of the code base.
While I agree with almost everything James wrote, I found a few areas where I disagree based on my own experience at Amazon.
First, James wrote that "we like shipping on a 3-month cycle." That hardly seems like early and often to me. I think code should be shipped constantly, multiple times a week or even multiple times a day across different services. On the Web, there is no reason to not ship frequently, and launching continuously forces developers to eliminate any dependencies with other code in their launch while correctly adding pressure on high quality tools for monitoring and debugging the health of the system, versioning, rapid rollbacks, incremental rollouts, and live experiments.
Second, James says we should not "affinitize requests or clients to specific servers". Depending on what he means here, we may disagree, because I think this potentially could conflict with simplifying database design. In particular, it is much easier to do database replication (and James acknowledges that "database scaling remains one of the hardest problems in designing internet scale services") if we stick a user to a specific server for their writes. Then, we can just guarantee the much simpler eventual consistency with rather loose requirements on what eventual means. The counter-argument here is that sticking users to specific servers can create hot spots, but I think that largely can be avoided if we pick the server to stick to for writes in part based on load.
Finally, James advocates for a manual "big red switch" that allows us to throttle our system manually. I think this is in conflict with the idea of automating everything. What I would prefer to see is that the system automatically responds to low load by using more resources and less load by using less, with monitoring, warnings, and ability to override. As James says, "People make mistakes. People need sleep. People forget things." I think it is optimistic to believe that, in the middle of a crisis, people could quickly decide how to optimally throttle back a system so it can maintain the highest level of quality of service at unusual load. But an automated system that has even limited understanding of the cost of its parts could do that.
In all, a great paper, thought-provoking and well worth reading.