Low latency RPCs

Goals:

Ideal RTT of 1 us.
- Measured from when the client sends out the request till the time it receives the reply from the server.
High Throughput - 1 million requests/sec/node
Low Cost? (Are we willing to pay more for lower latency/higher bandwidth)

Baseline performance (using commodity Gigabit NICs):

Simple ping: 300 us RTT
- Had a lot of variance: min: 250 us; max: 500 us.
- Because of how the card was generating interrupts

memcached:
- Memcached disables Nagle's Algorithm.
- RTT with TCP: Around 400 us for 10 byte payload
- RTT with UDP: 200 us for 10 byte payload
- Facebook:
  - RTT of 200 us in a rack and 200000 req/sec/server. They use UDP (Not sure about payload size)
  - RTT of 400-500 us across the data center
- These numbers had high variance.

Takes around 15 us for the packet to bubble up through the kernel stack and into the user level.

What causes such high latency?

Batched processing by the NIC:
- Solution: Tune the NIC to respond immediately.
- Disable Interrupt Coalescing.
- Reduce tx buffer delays (make the buffer processes a packet as soon as its put into the buffer)
- This results in higher CPU usage, but that may be alright if we can dedicate one CPU core to the NIC
Protocol Overhead
- TCP/IP have lots of processing overhead.
- Solution: UDP. Even better: use proprietary protocol with low overhead
Kernel network stack
- Bypass the kernel completely.
- Expose the driver interface to user level
- Results in vastly reduced overhead and reduces the number of copies.
- Instead of bypassing kernel, we can also implement our code in the kernel itself
Intermediate Copies:
- Solution: Implement a zero copy mechanism for processing packets.
CPU Scheduling/Preempting
Speed of light

Current performance:

RTT of 26 us for a simple ping client/server with 10 byte payload. 38 us for 100 bytes.
How is this achieved?
- OS Bypass - exposed NIC functions directly to user level program
- Proprietary protocol
- Polling, instead of interrupts: Continually poll the NIC instead of it generating interrupts
- Eliminate all copies on the server side
  - Process the packet while its still in the ring buffer.
  - This might need a large ring buffer, which might result in increased latency.
  - Solution: Multiple server threads processing in parallel.
  - Need locking mechanism -> Might increase overhead?
- Using the GAMMA code as the base
RTT may be improved with some more NIC tuning
- Claimed latency of 12-13 us with this mechanism.
- Maybe use a doorbell register of some sort to reduce transmit latency further?
HiStar results (all using 100 byte frames [inc. ethernet header, minus CRC], kernel mode, interrupts disabled, no network stack, timed with TSC):
- All numbers had very low variance - shared, lightly loaded 10/100/1000 copper switch
- Intel e1000 gigabit nic (i82573E)
  - unclear if running 100mbit of 1000mbit mode - our switch lies, but phy claimed gigabit.
- 36usec RTT kernel-to-kernel ping, polled, no interrupts
  - => user-mode drivers may have little overhead
- transmit delay for 1 packet (time from putting on xmit ring to nic claiming xmit done):
  - 'xmit done' ill defined; docs seem to imply time to move buffer into xmit FIFO (as we configured the NIC)
  - IRQ assertion: 25-26usec
  - ring descriptor update: 23.5usec
  - => ring buffer update to IRQ assertion delay is ~2-3usec
- transmit delay for n sequential packets:
  - 1: 23.5k ticks (9.8 usec/pkt)
  - 2: 34.5k ticks (7.2 usec/pkt)
  - 10: 136.5k ticks (5.6usec/pkt)
- => DMA engine latency in startup? could account for 30% of RTT overhead
  - NICs don't seem optimised for awesome latency when inactive
  - Lots of room for improvement if hardware designers had low latency concerns?

Switch Latency

Fulcrum Low latency switches:
- 200 ns for L2 switch
- 300 ns for switches with ACLs (L2+)
- 400 ns for IP Switches
Fujitsu XG Series: < 400 ns
Arista: 600 ns

How to improve latency further:

Use HPC Based Communication Methods:

Based on MPI paradigm
Infinband/Myrinet
- Infiniband:
  - Very low latency (1 us quoted, 3 us in reality)
  - Uses highly efficient RDMA
  - Low loss cables
  - Costly - $550 per single port NIC
- But being replaced by 10GigE - comparable cost and performance, more versatility.

Use 10 GigE NICs:

Claimed latency of 8.9 us for 200 byte packets (RTT = 2 * 8.9 = 18 us)
Have TCP Offload Engines
Another advantage: Uses reliable cabling. Hence, faster encoding techniques are used resulting in lower latency.
However, this is still to high for us
- Solution: Use a more efficient protocol, MPI like interface

Myrinet Express over Ethernet (for 10 GigE NICs):

Myrinet's protocol implemented to work over Ethernet
Uses kernel bypass & RDMA
Latency of 2.63 us (RTT of 5 us)
Leverages the fact that CX4 cables are low loss and low overhead
Cons:
- No really fast implementation exists. But since the protocol is open, it should be possible.
Pros:
- Uses normal ethernet switches
- Lower CPU utilization than TCP/IP

Ethernet RDMA/iWarp on 10GigE NICs:

Bypass kernel completely, and place data into the memory of the other host directly.
It's design is not yet as efficient as Infiniband, so its a little slower
Gaining traction, and is being refined by Intel and others
Currently: RTT << 20 us
We may be able to use this primitive to implement the client library:
- Client can do a get/put into server memory
- Security/Access Control?

Other related work:

Open-MX: an implementation of MXoE - similar to GAMMA. 20 us latency (40 us RTT) (current)
UNet - an OS Bypass mechanism which achieves < 60 us RTT (1996)
Virtual Interface Architecture - Latency of < 40 us, when implemented in silicon (2002)
Active Messaging
- Client sends code to be executed on the server.
- No modern implementation, RTT of 50 us in 1995
- Sort of what we are doing now...
UVM - Modifications to the VM system to support sharing data between kernel and user

What's the best we can do?

MXoE over 10GigE - 5 us RTT
- Best combination of commodity and performance
- On a many core machine, this gives us our required throughput
Infiniband:
- Highest performance, but at what cost?
- Dying anyway
Implement our software as part of the hyper-visor
- Low overhead
- Can be run on all available machines easily, takes advantage of all available DRAM
TCP/IP over 10GigE - 18 us RTT
- In case we use flash anyway, would this be OK?
Is a goal of 1 us practical?
- We are fundamentally limited by the speed of light - it takes 1 us to travel 300m. The length of a wire between 2 servers in a mega data center may be more than 300m.
- Note: For writes, we have to commit the data to other servers as well before returning to the client. Hence, we may have to do several RPCs to service one write.

System Cost:

NICs:
- CX4: $600 for an 10G Intel dual port adapter - but can implement MXoE - for very low latency
- Compared to just $30 for a gigabit adapter

Switches:
- CX4 enabled Fujitsu switch: 20 ports - $11000 at $550 / port
- Arista: $500 / port switches, but no CX4

Total Cost for NICs/switches in a 1000 machine cloud (using 10GigE technology): ~ $1.5 m, depending on topology

Protocol Design Questions:

Must use a very simple protocol to enable server to process quickly.
- Just get/set?
- Or should we support more complex operations?
Depends on node architecture - what sort of processing power we have on the servers
Linked to client/server work split

Overall Design Questions:

Do we need 10 GigE? Can we make do with Gigabit ethernet?
- Even if we don't need the latency, we might need its bandwidth, given our design for durability and backup
Can we get away with using TCP/IP given that these cards have TCP Offload Engines?
What latency is acceptable, given that a hard drive access has latency in the order of milliseconds?
How much are we willing to pay for such low latency?

----------

Must avoid operating system overhead:
- Run RAMCloud as part of the kernel?
- "Use the cores, Luke": dedicate one core to managing the network, don't take interrupts?
What is the right network protocol?
- TCP flow control and retry don't seem appropriate for operation within a datacenter.
Some data on switch latency from Brandon Heller:
The datasheet quotes 200ns for the L2-only FM2000, 300ms with ACLs enabled for the FM3000. Arista quotes 600ns delay regardless of packet size for their 24-port switches and 1200ns for their 48p version, which uses an internal fat tree of 6 24p FocalPoint chips (so 3 300ns hops are req'd).

The PHY can also add quite a bit of delay; supposedly 10GBase-T transceivers, due to the block encode/decode delay, add 1us per link (Wikipedia). Fiber, CX4, and twinax should be much lower-latency, since they escrew the fancy coding techniques for lower-error cabling. This is something I'd like to measure with the Triumph box coming soon.