Server Memory Architecture

Introduction

AMD64's long mode (64-bit mode) only works with paging enabled. In fact, the AMD64 Architecture Programmer's Manual says:

Deactivate long mode by clearing CR0.PG to 0.

So, in order to address a reasonable amount of memory per machine, we will need to use the paging hardware. The problem is that TLB misses can be costly (to be quantified).

AMD64 Paging Hardware

AMD64 has essentially four different paging set ups, depending on whether long mode (64-bit) or legacy mode (32-bit) is used, and in the latter case, whether PAE (physical address extension) or PSE (page size extension) is used. To make things more complicated, PSE has three variants: the original 32-bit PSE, which just enables 4MB pages, PSE-36, which enables 4MB pages and allows 4MB pages to reference 36-bit physical addresses and AMD64's PSE which is basically PSE-40.

Mode

PSE

PAE

Virt Addr Bits

Phys Addr Bits

Effect

Long

X

1

64

52

4K, 2M and 1G pages in 4, 3, and 2-level page tables, respectively
(1G pages are currently Opteron-only.)

Legacy

0

0

32

32

4K pages in 2-level page table

Legacy

1

0

32

32 (4K pages)
32, 36, or 40 (4M pages)

4K and 4M pages in 2 and 1-level page tables, respectively

Legacy

X

1

32

52

4K and 2M pages in 3 and 2-level page tables, respectively
(PAE in legacy mode includes an NX bit, if hardware supports it.)

Note that long mode implies PAE mode and that in PAE mode, PSE is ignored.

(History: PSE appeared in the Pentium, but wasn't documented until the Pentium Pro. PSE-36 appeared in the Pentium-III. PAE appeared in the Pentium Pro.)

Modern TLB Characteristics

NB: Purportedly the new Westmere Nehalem chips have GB pages

Opteron (K10) and Nehalem have split L1 and L2 TLBs. In both cases, the L1 TLB consists of separate instruction (ITLB) and data TLBs. The Nehalem has a unified L2 TLB, unlike the Opteron.

  • Opteron:
    • 48-entry fully-associative L1 data TLB
      • supports 4K, 2M, and 1G pages (or 24 4M pages with PAE(question))
    • 512-entry 4-way set-associative L2 data TLB for 4K pages
    • 128-entry 2-way set-associative L2 data TLB for 2M pages
      • can be used as 64-entry 2-way for 4M PAE pages
    • 16-entry 8-way set-associative L2 data TLB for 1G pages

The Opterons do not support 1G pages in their ITLBs, so code in 1G mappings is to be avoided. This shouldn't be an issue.

  • Xeon (Nehalem):
    • 64-entry 4-way set-associative L1 data TLB for 4K pages
    • 32-entry 4-way set-associative L1 data TLB for 2M and 4M pages
      • supports 4K, 2M and 4M pages. NO 1G PAGES
    • 512-entry unified 4-way set-associative L2 TLB, which only supports 4K pages

Nehalem is far more limited in number of large pages supported, as well as the maximum large page size. This probably precludes the architecture from RAMCloud, as we would have at most 32 large page mappings in the L1 DTLB and none in the unified L2.

64GB in pages

  • 16,777,216 4K pages
  • 32,768 2M pages
  • 16,384 4M pages
  • 64 1G pages

=> We need seriously large pages to have memory mapped in without TLB misses affecting our performance.

Miscellany

  • Current Opterons can map 48GB using the L1 DTLB and 64GB with the L2. However, note that:
    • The cost-effectiveness ceiling is hit around 4GB/dimm density today.
    • Each socket corresponds to a subset of the total dimms (4 to 8 dimms/socket is typical, => 16-32GB per socket)
    • Each socket has 2-6 cores today, 8-12 very soon.
  • => 4-socket, quad-core server has 16 cores, which can map 768GB of physical memory in L1 DTLBs
  • Cores may soon have TLB locality
    • May want to route requests to core(s) with appropriate mappings