Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Goals:

  • Ideal RTT of 1 us.
    • Measured from when the client sends out the request till the time it receives the reply from the server.
  • High Throughput - 1 million requests/sec/node
  • Low Cost? (Are we willing to pay more for lower latency/higher bandwidth)

...

  • Batched processing by the NIC:
    • Solution: Tune the NIC to respond immediately.
    • Disable Interrupt Coalescing.
    • Reduce tx buffer delays (make the buffer processes a packet as soon as its put into the buffer)
    • This results in higher CPU usage, but that may be alright if we can dedicate one CPU core to the NIC
  • Protocol Overhead
    • TCP/IP have lots of processing overhead.
    • Solution: UDP. Even better: use proprietary protocol with low overhead
  • Kernel network stack
    • Bypass the kernel completely.
    • Expose the driver interface to user level
    • Results in vastly reduced overhead and no reduces the number of copies.
    • Instead of bypassing kernel, we can also implement our code in the kernel itself
  • Intermediate Copies:
    • Solution: Implement a zero copy mechanism for processing packets.
  • CPU Scheduling/Preempting
  • Speed of light (smile)

...

  • RTT of 26 us for a simple ping client/server with 10 byte payload. 38 us for 100 bytes.
  • How is this achieved?
    • OS Bypass - exposed NIC functions directly to user level program
    • Proprietary protocol
    • Polling, instead of interrupts: Continually poll the NIC instead of it generating interrupts
    • Eliminate all copies on the server side
      • Process the packet while its still in the ring buffer.
      • This might need a large ring buffer, which might result in increased latency.
      • Solution: Multiple server threads processing in parallel.
      • Need locking mechanism -> Might increase overhead?
    • Using the GAMMA code as the base
  • RTT may be improved with some more NIC tuning
    • Claimed latency of 12-13 us with this mechanism.
    • Maybe use a doorbell register of some sort to reduce transmit latency further?
Switch Latency

...

Use HPC Based Communication Methods:
  • Based on MPI paradigm
  • Infinband/Myrinet
    • Infiniband:
      • Very low latency (1 us quoted, 3 us in reality)
      • Uses highly efficient RDMA
      • Low loss cables
      • Costly - $550 per single port NIC
    • But being replaced by 10GigE - comparable cost and performance, more versatility.

...

  • Claimed latency of 8.9 us for 200 byte packets (RTT = 2 * 8.9 = 18 us)
  • Have TCP Offload Engines
  • Another advantage: Uses reliable cabling. Hence, faster encoding techniques are used resulting in lower latency.
  • However, this is still to high for us
    • Solution: Use a more efficient protocol, MPI like interface
Myrinet Express over Ethernet (for 10 GigE NICs):
  • Myrinet's protocol implemented to work over Ethernet
  • Uses kernel bypass & RDMA
  • Latency of 2.63 us (RTT of 5 us)
  • Leverages the fact that CX4 cables are low loss and low overhead
  • Cons:
    • No really fast implementation exists. But since the protocol is open, it should be possible.
  • Pros:
    • Uses normal switches ethernet switches
    • Lower CPU utilization than TCP/IP

...

  • MXoE over 10GigE - 5 us RTT
    • Best combination of commodity and performance
    • On a many core machine, this gives us our required throughput
  • Infiniband:
    • Highest performance, but at what cost?
    • Dying anyway
  • Run this software in Implement our software as part of the hyper-visor ?
    • Low overhead
    • Can be run on all available machines easily, takes advantage of all available DRAM
  • TCP/IP over 10GigE - 18 us RTT
    • In case we use flash anyway, would this be OK?
  • Is a goal of 1 us practical?
    • We are fundamentally limited by the speed of light - it takes 1 us to travel 300m. The length of a wire between 2 servers in a mega data center may be more than 300m.
    • Note: For writes, we have to commit the data to other servers as well before returning to the client. Hence, we may have to do several RPCs to service one write.

...