Generating a Facebook Page

Bob English sent the following email on April 2, 2010:


I thought it would be useful to add some current page generation data to the yesterday's discussion.

One of our important pages currently takes a little over two seconds to generation on our web tier. 3/4 of that time is web tier CPU. External data waits account for about 1/7 of this page's generation time.

Of those waits, memcached accesses account for a little more than half the time, and 90% of the accesses. The rest of the data access time is spent waiting for mysql. Because memcached accesses retrieve several objects and mysql accesses only singletons, this somewhat understates the relative efficiency of memcached.

A memcached call averages less than a millisecond. A mysql call averages about nine.

If we held everything else constant, replaced all mysql calls with memcached calls, and reduced the average latency to 100us, we'd reduce the page generation time by about 12%. Reducing it further to 10us would reduce it by another 1%.

On the cpu side, there are a couple interesting data points. We cache the results of many memcached/mysql queries in a memory cache, so that data needed by more than one code path is only requested once. We do so in a couple different ways. Part of the marshalling code in one of those mechanisms takes about 1/3 of our data access wait time to execute. In the other, the average call to the caching routine (strictly CPU, usually a cache hit) takes over 40us to execute.

There are a couple different ways to look at this.

One is that a simple replacement of our existing data infrastructure with ramcloud would have some value, but it would be bounded and we'd see much more benefit from the first order of magnitude than the second. Heroic efforts for 1% improvements aren't usually justified, and the costs of the application portion of the data access stack would prevent us from accessing, for example, ten times as much data as we do today.

But that may not be the correct view. The marshalling and caching routines are there, in large part, to protect us from the high cost of network data access. If network accesses in the 5us range cost less than any of the mechanisms we use to optimize data access, we might be able to eliminate most of those routines.

Another data point here both illustrates the possibilities and highlights some issues that weren't discussed. All of our web tier boxes run APC, which we use to distribute site variables. The average access time to APC (a purely local in memory cache) is 30us. If we could get that data in 5us from RamCloud, we'd have no reason to run APC, but the fact that it takes 30us to get that data from memory suggests that it without some very clever marshalling strategies, it could take 30us to load it into php, even after the data is in memory.