Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 4.0

...

  • memcached:
    • Memcached disables Nagle's Algorithm.
    • RTT with TCP: Around 400 us for 10 byte payload
    • RTT with UDP: 200 us for 10 byte payload
    • Facebook:
      • RTT of 200 us in a rack and 200000 req/sec/server. They use UDP (Not sure about payload size)
      • RTT of 400-500 us across the data center
    • These numbers had high variance.
  • Takes around 15 us for the packet to bubble up through the kernel stack and into the user level.
What causes such high latency?

...

Myrinet Express over Ethernet (for 10 GigE NICs):
  • Myrinet's protocol implemented to work over Ethernet
  • Uses kernel bypass & RDMA
  • Latency of 2.63 us (RTT of 5 us)
  • Leverages the fact that CX4 cables are low loss and low overhead
  • Cons:
    • No really fast implementation exists. But since the protocol is open, it should be possible.
  • Pros:
    • Uses normal ethernet switches
    • Lower CPU utilization than TCP/IP
Ethernet RDMA/iWarp on 10GigE NICs:
  • Bypass kernel completely, and place data into the memory of the other host directly.
  • It's design is not yet as efficient as Infiniband, so its a little slower
  • Gaining traction, and is being refined by Intel and others
  • Currently: RTT << 20 us
  • We may be able to use this primitive to implement the client library:
    • Client can do a get/put into server memory
    • Security/Access Control?
Other related work:
  • Open-MX: an implementation of MXoE - similar to GAMMA. 20 us latency (40 us RTT) (current)
  • UNet - an OS Bypass mechanism which achieves < 60 us RTT (1996)
  • Virtual Interface Architecture - Latency of < 40 us, when implemented in silicon (2002)
  • Active Messaging
    • Client sends code to be executed on the server.
    • No modern implementation, RTT of 50 us in 1995
    • Sort of what we are doing now...
  • UVM - Modifications to the VM system to support sharing data between kernel and user
What's the best we can do?
  • MXoE over 10GigE - 5 us RTT
    • Best combination of commodity and performance
    • On a many core machine, this gives us our required throughput
  • Infiniband:
    • Highest performance, but at what cost?
    • Dying anyway
  • Implement our software as part of the hyper-visor
    • Low overhead
    • Can be run on all available machines easily, takes advantage of all available DRAM
  • TCP/IP over 10GigE - 18 us RTT
    • In case we use flash anyway, would this be OK?
  • Is a goal of 1 us practical?
    • We are fundamentally limited by the speed of light - it takes 1 us to travel 300m. The length of a wire between 2 servers in a mega data center may be more than 300m.
    • Note: For writes, we have to commit the data to other servers as well before returning to the client. Hence, we may have to do several RPCs to service one write.
System Cost:
  • NICs:
    • CX4: $600 for an 10G Intel dual port adapter - but can implement MXoE - for very low latency
    • Compared to just $30 for a gigabit adapter
  • Switches:
    • CX4 enabled Fujitsu switch: 20 ports - $11000 at $550 / port
    • Arista: $500 / port switches, but no CX4
  • Total Cost for NICs/switches in a 1000 machine cloud (using 10GigE technology): ~ $1.5 m, depending on topology
Protocol Design Questions:
  • Must use a very simple protocol to enable server to process quickly.
    • Just get/set?
    • Or should we support more complex operations?
  • Depends on node architecture - what sort of processing power we have on the servers
  • Linked to client/server work split
Overall Design Questions:
  • Do we need 10 GigE? Can we make do with Gigabit ethernet?
    • Even if we don't need the latency, we might need its bandwidth, given our design for durability and backup
  • Can we get away with using TCP/IP given that these cards have TCP Offload Engines?
  • What latency is acceptable, given that a hard drive access has latency in the order of milliseconds?
  • How much are we willing to pay for such low latency?

----------

  • Must avoid operating system overhead:
    • Run RAMCloud as part of the kernel?
    • "Use the cores, Luke": dedicate one core to managing the network, don't take interrupts?
  • What is the right network protocol?
    • TCP flow control and retry don't seem appropriate for operation within a datacenter.
  • Some data on switch latency from Brandon Heller:

    The datasheet quotes 200ns for the L2-only FM2000, 300ms with ACLs enabled for the FM3000. Arista quotes 600ns delay regardless of packet size for their 24-port switches and 1200ns for their 48p version, which uses an internal fat tree of 6 24p FocalPoint chips (so 3 300ns hops are req'd).

    The PHY can also add quite a bit of delay; supposedly 10GBase-T transceivers, due to the block encode/decode delay, add 1us per link (Wikipedia). Fiber, CX4, and twinax should be much lower-latency, since they escrew the fancy coding techniques for lower-error cabling. This is something I'd like to measure with the Triumph box coming soon.