RPC Performance Numbers

User level library Performance (libe1000):

1 byte payload: minimum ethernet frame size

Best result: 11.25 us

Life of an RTT:

(Client side) Copy packet into ring buffer: 29 ns
(Client side) Time taken to poke the NIC to inform it that a packet is ready to transmit: 147 ns
(Client side) Time for NIC to transfer packet from memory to on chip buffer: 205 ns

(Server side) Time to transfer packet from NIC buffer to memory: 264 ns
(Server side) Time to process packet and construct new packet: 176 ns
(Server side) Time taken to poke the NIC to inform it that a packet is ready to transmit: 150 ns
(Server side) Time for NIC to transfer packet from memory to on chip buffer: 200 ns

(Client side) Time to transfer packet from NIC buffer to memory: 260 ns

Total time spent in hardware: 1.431 us

Interesting issue: When the NIC receives a packet, it DMA's it into the main memory of the machine, and sets a bit in the ring descriptor. The driver spins on this bit waiting to be told that a packet has been received. Thus there is contention between the processor trying to read this bit (thus bringing it into the cache), and the NIC trying to set the bit (trying to invalidate it from the cache). Thus, after inserting a spin loop in the code before the bit is checked, the latency was lowered, because this contention was reduced.

Round Trip Times for various configs:

(100mbps network, router)

Simple memcached udp C client
10 byte payload: 83.6 us

memcached TCP client:
10 byte payload: 126.3 us

1gig network, no router:
(10 byte payload)

with interrupt coalescing:

simple memcached client: 10 byte: 69.6 us
simple tcp client: 103.6 us

without interrupt coalescing:

simple memcached client: 52.6 us
simple tcp client: 84.7 us

memcached software (single threaded):

Read: 13 us
Write 22 us

Increases to around 20/30 us when loaded.