...
- RTT of 26 us for a simple ping client/server with 10 byte payload. 38 us for 100 bytes.
- How is this achieved?
- OS Bypass - exposed NIC functions directly to user level program
- Proprietary protocol
- Polling, instead of interrupts: Continually poll the NIC instead of it generating interrupts
- Eliminate all copies on the server side
- Process the packet while its still in the ring buffer.
- This might need a large ring buffer, which might result in increased latency.
- Solution: Multiple server threads processing in parallel.
- Need locking mechanism -> Might increase overhead?
- Using the GAMMA code as the base
- RTT may be improved with some more NIC tuning
- Claimed latency of 12-13 us with this mechanism.
- Maybe use a doorbell register of some sort to reduce transmit latency further?
- HiStar results (all using 100 byte frames [inc. ethernet header, minus CRC], kernel mode, interrupts disabled, no network stack, timed with TSC):
- All numbers had very low variance - shared, lightly loaded 10/100/1000 copper switch
- Intel e1000 gigabit nic (i82573E)
- unclear if running 100mbit of 1000mbit mode - our switch lies, but phy claimed gigabit.
- 36usec RTT kernel-to-kernel ping, polled, no interrupts
- => user-mode drivers may have little overhead
- transmit delay for 1 packet (time from putting on xmit ring to nic claiming xmit done):
- 'xmit done' ill defined; docs seem to imply time to move buffer into xmit FIFO (as we configured the NIC)
- IRQ assertion: 25-26usec
- ring descriptor update: 23.5usec
- => ring buffer update to IRQ assertion delay is ~2-3usec
- transmit delay for n sequential packets:
- 1: 23.5 usec 5k ticks (9.8 usec/pkt)
- 2: 34.5 usec 5k ticks (7.2 usec/pkt)
- 10: 136.5 usec 5k ticks (5.6usec/pkt)
- => DMA engine latency in startup? could account for 30% of RTT overhead
- NICs don't seem optimised for awesome latency when inactive
- Lots of room for improvement if hardware designers had low latency concerns?
...