Now that commodity Linux servers are so commonplace, it is easy to forget that, during the mid and late 1990's, there was a lively debate between whether sites should scale up using massive, mainframe-like servers (Sun's Star Fire being a good example) or scale out using cheap hardware (e.g. Intel desktop PCs running Linux).
When I joined Amazon in early 1997, there was one massive database and one massive webserver. Big iron.
As Amazon grew, this situation became less and less desirable. Not only was it costly to scale up big iron, but we didn't like having a beefy box as a single point of failure. Amazon needed to move to a web cluster.
Just a few months after I started at Amazon, I was part of a team of two responsible for the website split. I was working with an extremely talented developer named Bob.
Bob was a madman. There was nothing that seemed to stop him. Problem on the website no one else could debug? He'd attach to a live obidos process and figure it out in seconds. Database inexplicably hanging? He'd start running process traces on the live Oracle database and uncover bugs in Oracle's code. Crazy. He seemed to know no limits, nothing that he wouldn't attack and figure out. Working with him was inspirational.
Bob and I were going to take Amazon from one webserver to multiple webservers. This was more difficult than it sounds. There were dependencies in the tools and in obidos that assumed one webserver. Some systems even directly accessed the data stored on the webserver. In fact, there were so many dependencies in the code base that, just to get this done in any reasonable amount of time, it was necessary to maintain backward compatibility as much as possible.
We designed a rough architecture for the system. There would be two staging servers, development and master, and then a fleet of online webservers. The staging servers were largely designed for backward compatibility. Developers would share data with development when creating new website features. Customer service, QA, and tools would share data with master. This had the added advantage of making master a last wall of defense where new code and data would be tested before it hit online.
Read-only data would be pushed out through this pipeline. Logs would be pulled off the online servers. For backward compatibility with log processing tools, logs would be merged so they looked like they came from one webserver and then put on a fileserver.
Stepping out for a second, this is a point where we really would have liked to have a robust, clustered, replicated, distributed file system. That would have been perfect for read-only data used by the webservers.
NFS isn't even close to this. It isn't clustered or replicated. It freezes all clients when the NFS server goes down. Ugly. Options that are closer to what we wanted, like CODA, were (and still are) in the research stage.
Without a reliable distributed file system, we were down to manually giving each webserver a local copy of the read-only data. Again, existing tools failed us. We wanted a system that was extremely fast and would do versioning and rollback. Tools like rdist were not sufficient.
So we wrote it ourselves. Under enormous time pressure. The current big iron was melting under "get big fast" load, a situation that was about to get much worse as Christmas approached. We needed this done and done yesterday.
We got it done in time. We got Amazon running on a fleet of four webservers. It all worked. Tools continued to function. Developers even got personal websites, courtesy of Bob, that made testing and debugging much easier than before.
Well, it sort of just worked. I am embarrassed to admit that some parts of the system started to break down more quickly than I had hoped. My poor sysadmin knowledge bit me badly as the push and pull tools I wrote failed in ways a more seasoned geek would have caught. Worse, the system was not nearly robust enough to unexpected outages in the network and machines, partially because we were not able to integrate the system with the load balancer (an early version of Cisco Local Director) that we were using to take machines automatically in and out of service.
But it did work. Amazon never would have been able to handle 1997 holiday traffic without what we did. I am proud to have been a part of it.
Of course, all of this is obsolete. Amazon switched to Linux (as did others), substantially increasing the number of webservers, and eventually Amazon switched to a deep services-based architecture. Needs changed, and the current Amazon web cluster bares little resemblance to what I just described.
But, when building Findory's webserver cluster, I again found myself wanting a reliable, clustered, distributed file system, and again found the options lacking. I again wanted tools for fast replication of files with versioning and rollback, and again found those missing. As I looked to solve those problems, the feeling of deja vu was unshakeable.