Low latency RPCs
Goals:
Ideal RTT of 1 us.
Measured from when the client sends out the request till the time it receives the reply from the server.
High Throughput - 1 million requests/sec/node
Low Cost? (Are we willing to pay more for lower latency/higher bandwidth)
Baseline performance (using commodity Gigabit NICs):
Simple ping: 300 us RTT
Had a lot of variance: min: 250 us; max: 500 us.
Because of how the card was generating interrupts
memcached:
Memcached disables Nagle's Algorithm.
RTT with TCP: Around 400 us for 10 byte payload
RTT with UDP: 200 us for 10 byte payload
Facebook:
RTT of 200 us in a rack and 200000 req/sec/server. They use UDP (Not sure about payload size)
RTT of 400-500 us across the data center
These numbers had high variance.
Takes around 15 us for the packet to bubble up through the kernel stack and into the user level.
What causes such high latency?
Batched processing by the NIC:
Solution: Tune the NIC to respond immediately.
Disable Interrupt Coalescing.
Reduce tx buffer delays (make the buffer processes a packet as soon as its put into the buffer)
This results in higher CPU usage, but that may be alright if we can dedicate one CPU core to the NIC
Protocol Overhead
TCP/IP have lots of processing overhead.
Solution: UDP. Even better: use proprietary protocol with low overhead
Kernel network stack
Bypass the kernel completely.
Expose the driver interface to user level
Results in vastly reduced overhead and reduces the number of copies.
Instead of bypassing kernel, we can also implement our code in the kernel itself
Intermediate Copies:
Solution: Implement a zero copy mechanism for processing packets.
CPU Scheduling/Preempting
Speed of light
Current performance:
RTT of 26 us for a simple ping client/server with 10 byte payload. 38 us for 100 bytes.
How is this achieved?
OS Bypass - exposed NIC functions directly to user level program
Proprietary protocol
Polling, instead of interrupts: Continually poll the NIC instead of it generating interrupts
Eliminate all copies on the server side
Process the packet while its still in the ring buffer.
This might need a large ring buffer, which might result in increased latency.
Solution: Multiple server threads processing in parallel.
Need locking mechanism -> Might increase overhead?
Using the GAMMA code as the base
RTT may be improved with some more NIC tuning
Claimed latency of 12-13 us with this mechanism.
Maybe use a doorbell register of some sort to reduce transmit latency further?
HiStar results (all using 100 byte frames [inc. ethernet header, minus CRC], kernel mode, interrupts disabled, no network stack, timed with TSC):
All numbers had very low variance - shared, lightly loaded 10/100/1000 copper switch
Intel e1000 gigabit nic (i82573E)
unclear if running 100mbit of 1000mbit mode - our switch lies, but phy claimed gigabit.
36usec RTT kernel-to-kernel ping, polled, no interrupts
=> user-mode drivers may have little overhead
transmit delay for 1 packet (time from putting on xmit ring to nic claiming xmit done):
'xmit done' ill defined; docs seem to imply time to move buffer into xmit FIFO (as we configured the NIC)
IRQ assertion: 25-26usec
ring descriptor update: 23.5usec
=> ring buffer update to IRQ assertion delay is ~2-3usec
transmit delay for n sequential packets:
1: 23.5k ticks (9.8 usec/pkt)
2: 34.5k ticks (7.2 usec/pkt)
10: 136.5k ticks (5.6usec/pkt)
=> DMA engine latency in startup? could account for 30% of RTT overhead
NICs don't seem optimised for awesome latency when inactive
Lots of room for improvement if hardware designers had low latency concerns?
Switch Latency
Fulcrum Low latency switches:
200 ns for L2 switch
300 ns for switches with ACLs (L2+)
400 ns for IP Switches
Fujitsu XG Series: < 400 ns
Arista: 600 ns
How to improve latency further:
Use HPC Based Communication Methods:
Based on MPI paradigm
Infinband/Myrinet
Infiniband:
Very low latency (1 us quoted, 3 us in reality)
Uses highly efficient RDMA
Low loss cables
Costly - $550 per single port NIC
But being replaced by 10GigE - comparable cost and performance, more versatility.
Use 10 GigE NICs:
Claimed latency of 8.9 us for 200 byte packets (RTT = 2 * 8.9 = 18 us)
Have TCP Offload Engines
Another advantage: Uses reliable cabling. Hence, faster encoding techniques are used resulting in lower latency.
However, this is still to high for us
Solution: Use a more efficient protocol, MPI like interface
Myrinet Express over Ethernet (for 10 GigE NICs):
Myrinet's protocol implemented to work over Ethernet
Uses kernel bypass & RDMA
Latency of 2.63 us (RTT of 5 us)
Leverages the fact that CX4 cables are low loss and low overhead
Cons:
No really fast implementation exists. But since the protocol is open, it should be possible.
Pros:
Uses normal ethernet switches
Lower CPU utilization than TCP/IP
Ethernet RDMA/iWarp on 10GigE NICs:
Bypass kernel completely, and place data into the memory of the other host directly.
It's design is not yet as efficient as Infiniband, so its a little slower
Gaining traction, and is being refined by Intel and others
Currently: RTT << 20 us
We may be able to use this primitive to implement the client library:
Client can do a get/put into server memory
Security/Access Control?
Other related work:
Open-MX: an implementation of MXoE - similar to GAMMA. 20 us latency (40 us RTT) (current)
UNet - an OS Bypass mechanism which achieves < 60 us RTT (1996)
Virtual Interface Architecture - Latency of < 40 us, when implemented in silicon (2002)
Active Messaging
Client sends code to be executed on the server.
No modern implementation, RTT of 50 us in 1995
Sort of what we are doing now...
UVM - Modifications to the VM system to support sharing data between kernel and user
What's the best we can do?
MXoE over 10GigE - 5 us RTT
Best combination of commodity and performance
On a many core machine, this gives us our required throughput
Infiniband:
Highest performance, but at what cost?
Dying anyway
Implement our software as part of the hyper-visor
Low overhead
Can be run on all available machines easily, takes advantage of all available DRAM
TCP/IP over 10GigE - 18 us RTT
In case we use flash anyway, would this be OK?
Is a goal of 1 us practical?
We are fundamentally limited by the speed of light - it takes 1 us to travel 300m. The length of a wire between 2 servers in a mega data center may be more than 300m.
Note: For writes, we have to commit the data to other servers as well before returning to the client. Hence, we may have to do several RPCs to service one write.
System Cost:
NICs:
CX4: $600 for an 10G Intel dual port adapter - but can implement MXoE - for very low latency
Compared to just $30 for a gigabit adapter
Switches:
CX4 enabled Fujitsu switch: 20 ports - $11000 at $550 / port
Arista: $500 / port switches, but no CX4
Total Cost for NICs/switches in a 1000 machine cloud (using 10GigE technology): ~ $1.5 m, depending on topology
Protocol Design Questions:
Must use a very simple protocol to enable server to process quickly.
Just get/set?
Or should we support more complex operations?
Depends on node architecture - what sort of processing power we have on the servers
Linked to client/server work split
Overall Design Questions:
Do we need 10 GigE? Can we make do with Gigabit ethernet?
Even if we don't need the latency, we might need its bandwidth, given our design for durability and backup
Can we get away with using TCP/IP given that these cards have TCP Offload Engines?
What latency is acceptable, given that a hard drive access has latency in the order of milliseconds?
How much are we willing to pay for such low latency?
----------
Must avoid operating system overhead:
Run RAMCloud as part of the kernel?
"Use the cores, Luke": dedicate one core to managing the network, don't take interrupts?
What is the right network protocol?
TCP flow control and retry don't seem appropriate for operation within a datacenter.
Some data on switch latency from Brandon Heller: