Introduction

AMD64's long mode (64-bit mode) only works with paging enabled. In fact, the AMD64 Architecture Programmer's Manual says:

Deactivate long mode by clearing CR0.PG to 0.

So, in order to address a reasonable amount of memory per machine, we will need to use the paging hardware. The problem is that TLB misses can be costly (to be quantified).

AMD64 Paging Hardware

AMD64 has essentially four different paging set ups, depending on whether long mode (64-bit) or legacy mode (32-bit) is used, and in the latter case, whether PAE (physical address extension) or PSE (page size extension) is used. To make things more complicated, PSE has three variants: the original 32-bit PSE, which just enables 4MB pages, PSE-36, which enables 4MB pages and allows 4MB pages to reference 36-bit physical addresses and AMD64's PSE which is basically PSE-40.

Mode

PSE

PAE

Virt Addr Bits

Phys Addr Bits

Effect

Long

X

1

64

52

4K, 2M and 1G pages in 4, 3, and 2-level page tables, respectively
(1G pages are currently Opteron-only.)

Legacy

0

0

32

32

4K pages in 2-level page table

Legacy

1

0

32

32 (4K pages)
32, 36, or 40 (4M pages)

4K and 4M pages in 2 and 1-level page tables, respectively

Legacy

X

1

32

52

4K and 2M pages in 3 and 2-level page tables, respectively
(PAE in legacy mode includes an NX bit, if hardware supports it.)

Note that long mode implies PAE mode and that in PAE mode, PSE is ignored.

(History: PSE appeared in the Pentium, but wasn't documented until the Pentium Pro. PSE-36 appeared in the Pentium-III. PAE appeared in the Pentium Pro.)

Modern TLB Characteristics

NB: Purportedly the new Westmere Nehalem chips have GB pages

Opteron (K10) and Nehalem have split L1 and L2 TLBs. In both cases, the L1 TLB consists of separate instruction (ITLB) and data TLBs. The Nehalem has a unified L2 TLB, unlike the Opteron.

The Opterons do not support 1G pages in their ITLBs, so code in 1G mappings is to be avoided. This shouldn't be an issue.

Nehalem is far more limited in number of large pages supported, as well as the maximum large page size. This probably precludes the architecture from RAMCloud, as we would have at most 32 large page mappings in the L1 DTLB and none in the unified L2.

64GB in pages

=> We need seriously large pages to have memory mapped in without TLB misses affecting our performance.

Miscellany