Their scaling strategy relies heavily on a large caching layer that focuses on caching entire HTML pages. The idea here is that the vast majority of accesses to Wikipedia are anonymous reads, so the same pages can be served up to those people.
This does work pretty well -- apparently, 78% of accesses are served from the Squid caches and another 7% on top of that get served from Memcached pages -- but it appears they basically have to toss the cache if anything on the page is different, including logged in users.
During the talk, I was a little surprised they were not more focused on caching at the data layer, focusing on making the underlying databases serve data rapidly instead of trying to avoid the databases.
If I understand it right, the Wikipedia architecture has all the Wikipedia data thrown into one giant database. Everything is in there, all the tables, even the large indexes for full text search. Then, they tossed on a few slave databases that appear to take replicated copies of the entire database.
All this data on one machine appears to mean they are hitting disk a lot on the database servers.
That doesn't seem necessary. The entire Wikipedia database appears to be a few hundred gigabytes. It should be possible to get most of that in-memory in a horizontally partitioned database cluster of a couple dozen machines.
In an ideal world, we'd be talking about something like the Google Cluster, where shards of the data are distributed across many machines and accessed in parallel, or a big virtual database like I craved in an earlier post.
But, let's stick with low hanging fruit here. So, to start, I would pull text search out of MySQL. Yes, I know, it's so lazily convenient to use MySQL full text search, but the performance doesn't seem to be where it needs to be. Moreover, it almost certainly is desirable for something like search to have its own dedicated hardware, not to be competiting for resources with an RDMS on the same box.
Then, I'd partition the data. Get it so each box only has a shard of the data big enough that the disk is quiet. I really am tempted to start rambling about something other than a simple partition of the existing MySQL tables and using MySQL replication here, but we're talking about low hanging fruit, so I'll stick with MySQL and what is simple to get done.
If almost all the Wikipedia database accesses never touched a disk, I suspect performance of the database layer might be good enough that parts of that HTML caching layer become unnecessary or even counterproductive. If so, the freed up boxes could be shifted from the HTML caching layer to the database layer, improving performance and scalability even further.
Near the end of the talk, Brion seemed to suggest that Wikipedia was going to focus their scaling efforts on boosting cache hit rates, more work on that last layer right before the client. It appeared to me that it might involve some fairly complicated work to figure out exactly when parts of the coarse-grained, HTML cache are valid.
I wonder if the focus might be better spent on the data layer, getting the performance there to the point that caching further out becomes unnecessary.
I have to say, it is great fun looking at this kind of large scale problem. I wish I could dig in deeper and pour over profile data for the site.
The efforts of the tiny Wikipedia development team are really impressive. Not only have they managed to scale to crushing loads, but also they have done it in an unbelievably meager budget. They have much deserved loyalty from a community that wants to see them continue to thrive and grow.
Update: Interesting March 2006 messages ( ) on the Wikitech mailing list from Brion Vibber show that Wikipedia is entertaining the idea of switching to MySQL Cluster. An excerpt:
We don't support MySQL Cluster at this time, as it's currently limited in many ways. (Everything must fit in memory, can't have certain field types, etc.)MySQL Cluster looks tempting but, as Brian said, it wouldn't work right now since everything wouldn't fit in memory within the maximum cluster size and other restrictions. Probably better to look at manually partitioning the data to get it in memory.
Currently waiting for the upcoming versions which allow disk-backed data, etc.
No point in rushing in when our own contributors who are MySQL employees are telling us to wait for it to mature a bit.
See also my post, "MySQL Cluster and my big virtual database".