Main Memory Latency
Just for perspective, here's the main memory access latencies we've determined:
Kernel-level, fixed-TLB benchmark:
Processor |
MHz |
Architecture |
Ticks/Access |
NSec/access |
---|---|---|---|---|
Xeon Dual-Core 3060 |
2400 |
Core 2 |
185 |
77 |
Core i7-920 |
2667 |
Nehalem |
174 |
65 |
Phenom 9850 Quad-Core |
2500 |
K10 |
329 |
132 |
- Determined using a modified HiStar kernel, which remaps a 2MB data page as uncached and does 4-byte aligned accesses 100e6 times. The average is then taken. There is one initial TLB miss, but no other activity in the system.
User-level benchmark (includes TLB misses, except for 1GB Phenom case):
Processor |
MHz |
Architecture |
Page Size |
Ticks/Access |
NSec/access |
---|---|---|---|---|---|
Phenom 9850 Quad-Core |
2500 |
K10 |
4k |
529 |
212 |
Phenom 9850 Quad-Core |
2500 |
K10 |
1g |
244 |
98 |
Xeon Dual-Core 3060 |
2400 |
Core 2 |
4k |
262 |
109 |
Xeon Dual-Core 3060 |
2400 |
Core 2 |
4m |
193 |
80 |
Core i7-920 |
2667 |
Nehalem |
4k |
99 |
37 |
Core i7-920 |
2667 |
Nehalem |
2m |
63 |
24 |
- Determined in userspace by the following:
int main() { uint32_t *buf = getbuf(); const int loops = 100 * 1000 * 1000; uint64_t b; uint64_t blah = 0; // don't compile away int i; b = rdtsc(); for (i = 0; i < loops; i++) blah += random() % (maxmem / sizeof(buf[0])); uint64_t random_ticks = rdtsc() - b; printf("%" PRIu64 " ticks for random-mod (%" PRIu64 " each)\n", random_ticks, random_ticks / loops); b = rdtsc(); for (i = 0; i < loops; i++) blah += buf[random() % (maxmem / sizeof(buf[0]))]; uint64_t access_ticks = rdtsc() - b; printf("%" PRIu64 " total ticks (%" PRIu64 " each)\n", access_ticks, access_ticks / loops); printf("%" PRIu64 " ticks not including random-mod (%" PRIu64 " each)\n", access_ticks - random_ticks, (access_ticks - random_ticks) / loops); return blah; }
- Where getbuf() returns a 1GB region of va (maxmem = 1 * 1024 * 1024 * 1024).
- Note that Phenom and Nehalem have about 23MB of L1 and L2 data TLB coverage. The Xeon is likely similar, if less.
- All chips have < 10MB cache, so > 99% of the data set is uncached.