Server Memory Architecture

Modern TLB Characteristics

Opteron (K10) and Nehalem have split L1 and L2 TLBs. In both cases, the L1 TLB consists of separate instruction (ITLB) and data TLBs. The Nehalem has a unified L2 TLB, unlike the Opteron.

Opteron:
- 48-entry fully-associative L1 data TLB
  - supports 4K, 2M, and 1G pages (or 24 4M pages with PAE)
- 512-entry 4-way set-associative L2 data TLB for 4K pages
- 128-entry 2-way set-associative L2 data TLB for 2M pages
  - can be used as 64-entry 2-way for 4M PAE pages
- 16-entry 8-way set-associative L2 data TLB for 1G pages

The Opterons do not support 1G pages in their ITLBs, so code in 1G mappings is to be avoided. This shouldn't be an issue.

Xeon (Nehalem):
- 64-entry fully-associative L1 data TLB for 4K pages
- 32-entry fully-associative L1 data TLB for 2M and 4M pages
  - supports 4K, 2M and 4M pages. NO 1G PAGES
- 512-entry unified L2 TLB, which only supports 4K pages

Nehalem is far more limited in number of large pages supported, as well as the maximum large page size. This probably precludes the architecture from RAMCloud, as we would have at most 32 large page mappings in the L1 DTLB and none in the unified L2.

64GB in pages

16,777,216 4K pages
32,768 2M pages
64 1G pages

=> We need seriously large pages to have memory mapped in without TLB misses affecting our performance.

Miscellany

Current Opterons can map 48GB using the L1 DTLB and 64GB with the L2. However, note that:
- The cost-effectiveness ceiling is hit around 4GB/dimm density today.
- Each socket corresponds to a subset of the total dimms (4 to 8 dimms/socket is typical, => 16-32GB per socket)
- Each socket has 2-6 cores today, 8-12 very soon.
=> 4-socket, quad-core server has 16 cores, which can map 768GB of physical memory in L1 DTLBs
Cores may soon have TLB locality
- May want to route requests to core(s) with appropriate mappings