/
Foursquare meltdown in October 2010

Foursquare meltdown in October 2010

The full story, with blog is available here. Here is the main story:

As many of you are aware, Foursquare had a significant outage this week. The outage was caused by capacity problems on one of the machines hosting the MongoDB database used for check-ins. This is an account of what happened, why it happened, how it can be prevented, and how 10gen is working to improve MongoDB in light of this outage.

It's important to note that throughout this week, 10gen and Foursquare engineers have been working together very closely to resolve the issue.

Some history

Foursquare has been hosting check-ins on a MongoDB database for some time now. The database was originally running on a single EC2 instance with 66GB of RAM. About 2 months ago, in response to increased capacity requirements, Foursquare migrated that single instance to a two-shard cluster. Now, each shard was running on its own 66GB instance, and both shards were also replicating to a slave for redundancy. This was an important migration because it allowed Foursquare to keep all of their check-in data in RAM, which is essential for maintaining acceptable performance.

The data had been split into 200 evenly distributed chunks based on user id. The first half went to one server, and the other half to the other. Each shard had about 33GB of data in RAM at this point, and the whole system ran smoothly for two months.

What we missed in the interim

Over these two months, check-ins were being written continually to each shard. Unfortunately, these check-ins did not grow evenly across chunks. It's easy to imagine how this might happen: assuming certain subsets of users are more active than others, it's conceivable that their updates might all go to the same shard. That's what occurred in this case, resulting in one shard growing to 66GB and the other only to 50GB. (1)

What went wrong

On Monday morning, the data on one shard (we'll call it shard0) finally grew to about 67GB, surpassing the 66GB of RAM on the hosting machine. Whenever data size grows beyond physical RAM, it becomes necessary to read and write to disk, which is orders of magnitude slower than reading and writing RAM. Thus, certain queries started to become very slow, and this caused a backlog that brought the site down.

We first attempted to fix the problem by adding a third shard. We brought the third shard up and started migrating chunks. Queries were now being distributed to all three shards, but shard0 continued to hit disk very heavily. When this failed to correct itself, we ultimately discovered that the problem was due to data fragmentation on shard0. In essence, although we had moved 5% of the data from shard0 to the new third shard, the data files, in their fragmented state, still needed the same amount of RAM. This can be explained by the fact that Foursquare check-in documents are small (around 300 bytes each), so many of them can fit on a 4KB page. Removing 5% of these just made each page a little more sparse, rather than removing pages altogether. (2)

After the first day's outage it had become clear that chunk migration, sans compaction, was not going to solve the immediate problem. By the time the second day's outage occurred, we had already move 5% of the data off of shard0, so we decided to start an offline process to compact the database using MongoDB's repairDatabase() feature. This process took about 4 hours (partly due to the data size, and partly because of the slowness of EBS volumes at the time). At the end of the 4 hours, the RAM requirements for shard0 had in fact been reduced by 5%, allowing us to bring the system back online.

Afterwards

Since repairing shard0 and adding a third shard, we've set up even more shards, and now the check-in data is evenly distributed and there is a good deal of extra capacity. Still, we had to address the fragmentation problem. We ran a repairDatabase() on the slaves, and promoted the slaves to masters, further reducing the RAM needed on each shard to about 20GB.

How is this issue triggered?

Several conditions need to be met to trigger the issue that brought down Foursquare:

  1. Systems are at or over capacity. How capacity is defined varies; in the case of Foursquare, all data needed to fit into RAM for acceptable performance. Other deployments may not have such strict RAM
    requirements.
  2. Document size is less than 4k. Such documents, when moved, may be too small to free up pages and, thus, memory.
  3. Shard key order and insertion order are different. This prevents data from being moved in contiguous chunks.

Most sharded deployments will not meet these criteria. Anyone whose documents are larger than 4KB will not suffer significant fragmentation because the pages that aren't being used won't be cached.

Prevention

The main thing to remember here is that once you're at max capacity, it's difficult to add more capacity without some downtime when objects are small. However, if caught in advance, adding more shards on a live system can be done with no downtime.

For example, if we had notifications in place to alert us 12 hours earlier that we needed more capacity, we could have added a third shard, migrated data, and then compacted the slaves.

Another salient point: when you're operating at or near capacity, realize that if things get slow at your hosting provider, you may find yourself all of a sudden effectively over capacity.

Final Thoughts

The 10gen tech team is working hard to correct the issues exposed by this outage. We will continue to work as hard as possible to ensure that everyone using MongoDB has the best possible experience. We are thankful for the support that we have received from Foursquare and our community during this unfortunate episode. As always, please let us know if you have any questions or concerns.

(1) Chunks get split when they are 200MB into 2 100MB halves. This means that even if the number of chunks on each shard was the same, data size is not always so. This is something we are going to be
addressing in MongoDB. We'll be making splitting balancing look for this imbalance so it can act upon it.

(2) The 10gen team is working on doing online incremental compaction of both data files and indexes. We know this has been a concern in non-sharded systems as well. More details about this will be coming
in the next few weeks.