Content Comparison

...

Takes around 15 us for the packet to bubble up through the kernel stack and into the user level.

...

Myrinet's protocol implemented to work over Ethernet
Uses kernel bypass & RDMA
Latency of 2.63 us (RTT of 5 us)
Leverages the fact that CX4 cables are low loss and low overhead
Cons:
- No really fast implementation exists. But since the protocol is open, it should be possible.
Pros:
- Uses normal ethernet switches
- Lower CPU utilization than TCP/IP

Bypass kernel completely, and place data into the memory of the other host directly.
It's design is not yet as efficient as Infiniband, so its a little slower
Gaining traction, and is being refined by Intel and others
Currently: RTT << 20 us
We may be able to use this primitive to implement the client library:
- Client can do a get/put into server memory
- Security/Access Control?

Open-MX: an implementation of MXoE - similar to GAMMA. 20 us latency (40 us RTT) (current)
UNet - an OS Bypass mechanism which achieves < 60 us RTT (1996)
Virtual Interface Architecture - Latency of < 40 us, when implemented in silicon (2002)
Active Messaging
- Client sends code to be executed on the server.
- No modern implementation, RTT of 50 us in 1995
- Sort of what we are doing now...
UVM - Modifications to the VM system to support sharing data between kernel and user

MXoE over 10GigE - 5 us RTT
- Best combination of commodity and performance
- On a many core machine, this gives us our required throughput
Infiniband:
- Highest performance, but at what cost?
- Dying anyway
Implement our software as part of the hyper-visor
- Low overhead
- Can be run on all available machines easily, takes advantage of all available DRAM
TCP/IP over 10GigE - 18 us RTT
- In case we use flash anyway, would this be OK?
Is a goal of 1 us practical?
- We are fundamentally limited by the speed of light - it takes 1 us to travel 300m. The length of a wire between 2 servers in a mega data center may be more than 300m.
- Note: For writes, we have to commit the data to other servers as well before returning to the client. Hence, we may have to do several RPCs to service one write.

NICs:
- CX4: $600 for an 10G Intel dual port adapter - but can implement MXoE - for very low latency
- Compared to just $30 for a gigabit adapter

Switches:
- CX4 enabled Fujitsu switch: 20 ports - $11000 at $550 / port
- Arista: $500 / port switches, but no CX4

Total Cost for NICs/switches in a 1000 machine cloud (using 10GigE technology): ~ $1.5 m, depending on topology

Must use a very simple protocol to enable server to process quickly.
- Just get/set?
- Or should we support more complex operations?
Depends on node architecture - what sort of processing power we have on the servers
Linked to client/server work split

Do we need 10 GigE? Can we make do with Gigabit ethernet?
- Even if we don't need the latency, we might need its bandwidth, given our design for durability and backup
Can we get away with using TCP/IP given that these cards have TCP Offload Engines?
What latency is acceptable, given that a hard drive access has latency in the order of milliseconds?
How much are we willing to pay for such low latency?

----------

Must avoid operating system overhead:
- Run RAMCloud as part of the kernel?
- "Use the cores, Luke": dedicate one core to managing the network, don't take interrupts?
What is the right network protocol?
- TCP flow control and retry don't seem appropriate for operation within a datacenter.
Some data on switch latency from Brandon Heller:
The datasheet quotes 200ns for the L2-only FM2000, 300ms with ACLs enabled for the FM3000. Arista quotes 600ns delay regardless of packet size for their 24-port switches and 1200ns for their 48p version, which uses an internal fat tree of 6 24p FocalPoint chips (so 3 300ns hops are req'd).
The PHY can also add quite a bit of delay; supposedly 10GBase-T transceivers, due to the block encode/decode delay, add 1us per link (Wikipedia). Fiber, CX4, and twinax should be much lower-latency, since they escrew the fancy coding techniques for lower-error cabling. This is something I'd like to measure with the Triumph box coming soon.

Versions Compared