Hash Table & Multi-Level Lookup Performance

Main Memory Latency

Just for perspective, here's the main memory access latencies we've determined:

Kernel-level, fixed-TLB benchmark:

Processor	MHz	Architecture	Ticks/Access	NSec/access
Xeon Dual-Core 3060	2400	Core 2	185	77

Determined using a modified HiStar kernel, which remaps a 2MB data page as uncached and does 4-byte aligned accesses 100e6 times. The average is then taken. There is one initial TLB miss, but no other activity in the system.

User-level benchmark (includes TLB misses, except for 1GB Phenom case):

Processor	MHz	Architecture	Page Size	Ticks/Access	NSec/access
Phenom 9850 Quad-Core	2500	K10	4k	529	212
Phenom 9850 Quad-Core	2500	K10	1g	244	98
Xeon Dual-Core 3060	2400	Core 2	4k	262	109
Xeon Dual-Core 3060	2400	Core 2	4m	193	80
Core i7-920	2667	Nehalem	4k	99	37
Core i7-920	2667	Nehalem	2m	63	24

Determined in userspace by the following:

int main() {
	uint32_t *buf = getbuf();
	const int loops = 100 * 1000 * 1000;
	uint64_t b;
	uint64_t blah = 0;	// don't compile away
	int i;

	b = rdtsc();
	for (i = 0; i < loops; i++)
		blah += random() % (maxmem / sizeof(buf[0]));
	uint64_t random_ticks = rdtsc() - b; 

	printf("%" PRIu64 " ticks for random-mod (%" PRIu64 " each)\n",
	    random_ticks, random_ticks / loops);

	b = rdtsc();
	for (i = 0; i < loops; i++)
		blah += buf[random() % (maxmem / sizeof(buf[0]))];
	uint64_t access_ticks = rdtsc() - b;

	printf("%" PRIu64 " total ticks (%" PRIu64 " each)\n", access_ticks,
	    access_ticks / loops);
	printf("%" PRIu64 " ticks not including random-mod (%" PRIu64 " each)\n",
	    access_ticks - random_ticks, (access_ticks - random_ticks) / loops);

	return blah;
}

Where getbuf() returns a 1GB region of va (maxmem = 1 * 1024 * 1024 * 1024).
Note that Phenom and Nehalem have about 23MB of L1 and L2 data TLB coverage. The Xeon is likely similar, if less.
All chips have < 10MB cache, so > 99% of the data set is uncached.