Page Comparison

Ideal RTT of 1 us.
- Measured from when the client sends out the request till the time it receives the reply from the server.
High Throughput - 1 million requests/sec/node
Low Cost? (Are we willing to pay more for lower latency/higher bandwidth)

...

Batched processing by the NIC:
- Solution: Tune the NIC to respond immediately.
- Disable Interrupt Coalescing.
- Reduce tx buffer delays (make the buffer processes a packet as soon as its put into the buffer)
- This results in higher CPU usage, but that may be alright if we can dedicate one CPU core to the NIC
Protocol Overhead
- TCP/IP have lots of processing overhead.
- Solution: UDP. Even better: use proprietary protocol with low overhead
Kernel network stack
- Bypass the kernel completely.
- Expose the driver interface to user level
- Results in vastly reduced overhead and no reduces the number of copies.
- Instead of bypassing kernel, we can also implement our code in the kernel itself
Intermediate Copies:
- Solution: Implement a zero copy mechanism for processing packets.
CPU Scheduling/Preempting
Speed of light

...

RTT of 26 us for a simple ping client/server with 10 byte payload. 38 us for 100 bytes.
How is this achieved?
- OS Bypass - exposed NIC functions directly to user level program
- Proprietary protocol
- Polling, instead of interrupts: Continually poll the NIC instead of it generating interrupts
- Eliminate all copies on the server side
  - Process the packet while its still in the ring buffer.
  - This might need a large ring buffer, which might result in increased latency.
  - Solution: Multiple server threads processing in parallel.
  - Need locking mechanism -> Might increase overhead?
- Using the GAMMA code as the base
RTT may be improved with some more NIC tuning
- Claimed latency of 12-13 us with this mechanism.
- Maybe use a doorbell register of some sort to reduce transmit latency further?

...

...

Claimed latency of 8.9 us for 200 byte packets (RTT = 2 * 8.9 = 18 us)
Have TCP Offload Engines
Another advantage: Uses reliable cabling. Hence, faster encoding techniques are used resulting in lower latency.
However, this is still to high for us
- Solution: Use a more efficient protocol, MPI like interface

Myrinet's protocol implemented to work over Ethernet
Uses kernel bypass & RDMA
Latency of 2.63 us (RTT of 5 us)
Leverages the fact that CX4 cables are low loss and low overhead
Cons:
- No really fast implementation exists. But since the protocol is open, it should be possible.
Pros:
- Uses normal switches ethernet switches
- Lower CPU utilization than TCP/IP

...

MXoE over 10GigE - 5 us RTT
- Best combination of commodity and performance
- On a many core machine, this gives us our required throughput
Infiniband:
- Highest performance, but at what cost?
- Dying anyway
Run this software in Implement our software as part of the hyper-visor ?
- Low overhead
- Can be run on all available machines easily, takes advantage of all available DRAM
TCP/IP over 10GigE - 18 us RTT
- In case we use flash anyway, would this be OK?
Is a goal of 1 us practical?
- We are fundamentally limited by the speed of light - it takes 1 us to travel 300m. The length of a wire between 2 servers in a mega data center may be more than 300m.
- Note: For writes, we have to commit the data to other servers as well before returning to the client. Hence, we may have to do several RPCs to service one write.

...

Versions Compared