- Ideal RTT of 1 us.
- Measured from when the client sends out the request till the time it receives the reply from the server.
- High Throughput - 1 million requests/sec/node
- Low Cost? (Are we willing to pay more for lower latency/higher bandwidth)
Baseline performance (using commodity Gigabit NICs):
- Simple ping: 300 us RTT
- Had a lot of variance: min: 250 us; max: 500 us.
- Because of how the card was generating interrupts
- Memcached disables Nagle's Algorithm.
- RTT with TCP: Around 400 us for 10 byte payload
- RTT with UDP: 200 us for 10 byte payload
- RTT of 200 us in a rack and 200000 req/sec/server. They use UDP (Not sure about payload size)
- RTT of 400-500 us across the data center
- These numbers had high variance.
- Takes around 15 us for the packet to bubble up through the kernel stack and into the user level.
What causes such high latency?
- Batched processing by the NIC:
- Solution: Tune the NIC to respond immediately.
- Disable Interrupt Coalescing.
- Reduce tx buffer delays (make the buffer processes a packet as soon as its put into the buffer)
- This results in higher CPU usage, but that may be alright if we can dedicate one CPU core to the NIC
- Protocol Overhead
- TCP/IP have lots of processing overhead.
- Solution: UDP. Even better: use proprietary protocol with low overhead
- Kernel network stack
- Bypass the kernel completely.
- Expose the driver interface to user level
- Results in vastly reduced overhead and reduces the number of copies.
- Instead of bypassing kernel, we can also implement our code in the kernel itself
- Intermediate Copies:
- Solution: Implement a zero copy mechanism for processing packets.
- CPU Scheduling/Preempting
- Speed of light
- RTT of 26 us for a simple ping client/server with 10 byte payload. 38 us for 100 bytes.
- How is this achieved?
- OS Bypass - exposed NIC functions directly to user level program
- Proprietary protocol
- Polling, instead of interrupts: Continually poll the NIC instead of it generating interrupts
- Eliminate all copies on the server side
- Process the packet while its still in the ring buffer.
- This might need a large ring buffer, which might result in increased latency.
- Solution: Multiple server threads processing in parallel.
- Need locking mechanism -> Might increase overhead?
- Using the GAMMA code as the base
- RTT may be improved with some more NIC tuning
- Claimed latency of 12-13 us with this mechanism.
- Maybe use a doorbell register of some sort to reduce transmit latency further?
- HiStar results (all using 100 byte frames [inc. ethernet header, minus CRC], kernel mode, interrupts disabled, no network stack, timed with TSC):
- All numbers had very low variance - shared, lightly loaded 10/100/1000 copper switch
- Intel e1000 gigabit nic (i82573E)
- unclear if running 100mbit of 1000mbit mode - our switch lies, but phy claimed gigabit.
- 36usec RTT kernel-to-kernel ping, polled, no interrupts
- => user-mode drivers may have little overhead
- transmit delay for 1 packet (time from putting on xmit ring to nic claiming xmit done):
- 'xmit done' ill defined; docs seem to imply time to move buffer into xmit FIFO (as we configured the NIC)
- IRQ assertion: 25-26usec
- ring descriptor update: 23.5usec
- => ring buffer update to IRQ assertion delay is ~2-3usec
- transmit delay for n sequential packets:
- 1: 23.5k ticks (9.8 usec/pkt)
- 2: 34.5k ticks (7.2 usec/pkt)
- 10: 136.5k ticks (5.6usec/pkt)
- => DMA engine latency in startup? could account for 30% of RTT overhead
- NICs don't seem optimised for awesome latency when inactive
- Lots of room for improvement if hardware designers had low latency concerns?
- Fulcrum Low latency switches:
- 200 ns for L2 switch
- 300 ns for switches with ACLs (L2+)
- 400 ns for IP Switches
- Fujitsu XG Series: < 400 ns
- Arista: 600 ns
How to improve latency further:
Use HPC Based Communication Methods:
- Based on MPI paradigm
- Very low latency (1 us quoted, 3 us in reality)
- Uses highly efficient RDMA
- Low loss cables
- Costly - $550 per single port NIC
- But being replaced by 10GigE - comparable cost and performance, more versatility.
Use 10 GigE NICs:
- Claimed latency of 8.9 us for 200 byte packets (RTT = 2 * 8.9 = 18 us)
- Have TCP Offload Engines
- Another advantage: Uses reliable cabling. Hence, faster encoding techniques are used resulting in lower latency.
- However, this is still to high for us
- Solution: Use a more efficient protocol, MPI like interface
Myrinet Express over Ethernet (for 10 GigE NICs):
- Myrinet's protocol implemented to work over Ethernet
- Uses kernel bypass & RDMA
- Latency of 2.63 us (RTT of 5 us)
- Leverages the fact that CX4 cables are low loss and low overhead
- No really fast implementation exists. But since the protocol is open, it should be possible.
- Uses normal ethernet switches
- Lower CPU utilization than TCP/IP
Ethernet RDMA/iWarp on 10GigE NICs:
- Bypass kernel completely, and place data into the memory of the other host directly.
- It's design is not yet as efficient as Infiniband, so its a little slower
- Gaining traction, and is being refined by Intel and others
- Currently: RTT << 20 us
- We may be able to use this primitive to implement the client library:
- Client can do a get/put into server memory
- Security/Access Control?
Other related work:
- Open-MX: an implementation of MXoE - similar to GAMMA. 20 us latency (40 us RTT) (current)
- UNet - an OS Bypass mechanism which achieves < 60 us RTT (1996)
- Virtual Interface Architecture - Latency of < 40 us, when implemented in silicon (2002)
- Active Messaging
- Client sends code to be executed on the server.
- No modern implementation, RTT of 50 us in 1995
- Sort of what we are doing now...
- UVM - Modifications to the VM system to support sharing data between kernel and user
What's the best we can do?
- MXoE over 10GigE - 5 us RTT
- Best combination of commodity and performance
- On a many core machine, this gives us our required throughput
- Highest performance, but at what cost?
- Dying anyway
- Implement our software as part of the hyper-visor
- Low overhead
- Can be run on all available machines easily, takes advantage of all available DRAM
- TCP/IP over 10GigE - 18 us RTT
- In case we use flash anyway, would this be OK?
- Is a goal of 1 us practical?
- We are fundamentally limited by the speed of light - it takes 1 us to travel 300m. The length of a wire between 2 servers in a mega data center may be more than 300m.
- Note: For writes, we have to commit the data to other servers as well before returning to the client. Hence, we may have to do several RPCs to service one write.
- CX4: $600 for an 10G Intel dual port adapter - but can implement MXoE - for very low latency
- Compared to just $30 for a gigabit adapter
- CX4 enabled Fujitsu switch: 20 ports - $11000 at $550 / port
- Arista: $500 / port switches, but no CX4
- Total Cost for NICs/switches in a 1000 machine cloud (using 10GigE technology): ~ $1.5 m, depending on topology
Protocol Design Questions:
- Must use a very simple protocol to enable server to process quickly.
- Just get/set?
- Or should we support more complex operations?
- Depends on node architecture - what sort of processing power we have on the servers
- Linked to client/server work split
Overall Design Questions:
- Do we need 10 GigE? Can we make do with Gigabit ethernet?
- Even if we don't need the latency, we might need its bandwidth, given our design for durability and backup
- Can we get away with using TCP/IP given that these cards have TCP Offload Engines?
- What latency is acceptable, given that a hard drive access has latency in the order of milliseconds?
- How much are we willing to pay for such low latency?
- Must avoid operating system overhead:
- Run RAMCloud as part of the kernel?
- "Use the cores, Luke": dedicate one core to managing the network, don't take interrupts?
- What is the right network protocol?
- TCP flow control and retry don't seem appropriate for operation within a datacenter.
- Some data on switch latency from Brandon Heller:
The datasheet quotes 200ns for the L2-only FM2000, 300ms with ACLs enabled for the FM3000. Arista quotes 600ns delay regardless of packet size for their 24-port switches and 1200ns for their 48p version, which uses an internal fat tree of 6 24p FocalPoint chips (so 3 300ns hops are req'd).
The PHY can also add quite a bit of delay; supposedly 10GBase-T transceivers, due to the block encode/decode delay, add 1us per link (Wikipedia). Fiber, CX4, and twinax should be much lower-latency, since they escrew the fancy coding techniques for lower-error cabling. This is something I'd like to measure with the Triumph box coming soon.