AMD64's long mode (64-bit mode) only works with paging enabled. In fact, the AMD64 Architecture Programmer's Manual says:
Deactivate long mode by clearing CR0.PG to 0.
So, in order to address a reasonable amount of memory per machine, we will need to use the paging hardware. The problem is that TLB misses can be costly (to be quantified).
AMD64 has essentially four different paging set ups, depending on whether long mode (64-bit) or legacy mode (32-bit) is used, and in the latter case, whether PAE (physical address extension) or PSE (page size extension) is used. To make things more complicated, PSE has three variants: the original 32-bit PSE, which just enables 4MB pages, PSE-36, which enables 4MB pages and allows 4MB pages to reference 36-bit physical addresses and AMD64's PSE which is basically PSE-40.
Mode |
PSE |
PAE |
Virt Addr Bits |
Phys Addr Bits |
Effect |
---|---|---|---|---|---|
Long |
X |
1 |
64 |
52 |
4K, 2M and 1G pages in 4, 3, and 2-level page tables, respectively |
Legacy |
0 |
0 |
32 |
32 |
4K pages in 2-level page table |
Legacy |
1 |
0 |
32 |
32 (4K pages) |
4K and 4M pages in 2 and 1-level page tables, respectively |
Legacy |
X |
1 |
32 |
52 |
4K and 2M pages in 3 and 2-level page tables, respectively |
Note that long mode implies PAE mode and that in PAE mode, PSE is ignored.
(History: PSE appeared in the Pentium, but wasn't documented until the Pentium Pro. PSE-36 appeared in the Pentium-III. PAE appeared in the Pentium Pro.)
NB: Purportedly the new Westmere Nehalem chips have GB pages
Opteron (K10) and Nehalem have split L1 and L2 TLBs. In both cases, the L1 TLB consists of separate instruction (ITLB) and data TLBs. The Nehalem has a unified L2 TLB, unlike the Opteron.
The Opterons do not support 1G pages in their ITLBs, so code in 1G mappings is to be avoided. This shouldn't be an issue.
Nehalem is far more limited in number of large pages supported, as well as the maximum large page size. This probably precludes the architecture from RAMCloud, as we would have at most 32 large page mappings in the L1 DTLB and none in the unified L2.
=> We need seriously large pages to have memory mapped in without TLB misses affecting our performance.