Server Memory Architecture

Intro

AMD64's long mode (64-bit mode) only works with paging enabled. In fact, the AMD64 Architecture Programmer's Manual says:

Deactivate long mode by clearing CR0.PG to 0.

So, in order to address a reasonable amount of memory per machine, we will need to use the paging hardware. The problem is that TLB misses can be costly (to be quantified).

Modern TLB Characteristics

Opteron (K10) and Nehalem have split L1 and L2 TLBs. In both cases, the L1 TLB consists of separate instruction (ITLB) and data TLBs. The Nehalem has a unified L2 TLB, unlike the Opteron.

Opteron:
- 48-entry fully-associative L1 data TLB
  - supports 4K, 2M, and 1G pages (or 24 4M pages with PAE)
- 512-entry 4-way set-associative L2 data TLB for 4K pages
- 128-entry 2-way set-associative L2 data TLB for 2M pages
  - can be used as 64-entry 2-way for 4M PAE pages
- 16-entry 8-way set-associative L2 data TLB for 1G pages

The Opterons do not support 1G pages in their ITLBs, so code in 1G mappings is to be avoided. This shouldn't be an issue.

Xeon (Nehalem):
- 64-entry fully-associative L1 data TLB for 4K pages
- 32-entry fully-associative L1 data TLB for 2M and 4M pages
  - supports 4K, 2M and 4M pages. NO 1G PAGES
- 512-entry unified L2 TLB, which only supports 4K pages

Nehalem is far more limited in number of large pages supported, as well as the maximum large page size. This probably precludes the architecture from RAMCloud, as we would have at most 32 large page mappings in the L1 DTLB and none in the unified L2.

64GB in pages

16,777,216 4K pages
32,768 2M pages
16,384 4M pages
64 1G pages

=> We need seriously large pages to have memory mapped in without TLB misses affecting our performance.

Miscellany

Current Opterons can map 48GB using the L1 DTLB and 64GB with the L2. However, note that:
- The cost-effectiveness ceiling is hit around 4GB/dimm density today.
- Each socket corresponds to a subset of the total dimms (4 to 8 dimms/socket is typical, => 16-32GB per socket)
- Each socket has 2-6 cores today, 8-12 very soon.
=> 4-socket, quad-core server has 16 cores, which can map 768GB of physical memory in L1 DTLBs
Cores may soon have TLB locality
- May want to route requests to core(s) with appropriate mappings