Content Comparison

...

L2 switches are cheaper and simpler, but L2 doesn't scale by default
- Broadcast domains cannot get too large, else performance suffers
- switches have MAC forwarding tables, but are often <16k entries in size and we want to scale beyond that
L3 switches are more expensive, but L3 scales more clearly
- Set up localized subnets to restrict broadcast domains, or use VLANs
- Route between subnets intelligently
To be workable in fat-trees, L3 needs work:
- Need custom/non-standard software/algorithms to update routes to achieve maximum bandwidth and balance of links
So long as we're hacking, why not see if L2 can be made to work?
- Ditch ARP, restrict broadcasts (bcasts can be specially handled out of the fast-path)
- Use L2 source-routing and L2-in-L2 encapsulation - hosts transmit frames wrapped with destination MAC header of the TOR switch for the destination host
  - => switches need only know MACs of other switches, not all hosts - overcomes 16k MAC entry table limit
  - requires a directory service and host stack modifications
  - use VLB (Valiant Load Balancing) to to maximise bandwidth
- advantage of a flat address space and simpler, lower-level protocol
- disadvantage of pushing changes into host protocol stacks (though IP is oblivious)

...

latency
- Arista 48-port 10GigE switches advertise a minimum of 600nsec latency (no idea what the distribution looks like)
  - across 6 hops, that's > 3.6 usec
- Woven System's 144-port 10GigE switches advertise 1.6usec port-to-port latency (twice almost 3x Arista's minimum)
  - => > 3.2usec in first two levels of hierarchy
- take away: sub-5usec is probably not currently possible
- how important is RDMA (iWARP) for our goals?
  - if it necessitates 10GigE, it'd be costly
  - unclear if custom (i.e. non-commodity) hardware required
    - Force-10 forsees iWARP done on NIC, like TOE (TCP Offload Engine)
bandwidth
- 128 bytes / object * 1.0e6 objects/second = 122MBytes/sec (not including any packet overhead)
  - this is gigabit range... 10GigE vs. GigE may be a significant question:
    - Arista 48-port 10GigE's not commodity (~$20k/switch, vs. $2-3k/switch of commodity 1GigE)
    - But what if we have much bigger, hot objects on a machine?
      - Do we want to assume a single machine can always handle requests?
        e.g. 10KByte object => max. ~12,500 requests/sec on gigabit
- Going beyond gigabit is still very costly
  - ~25k cluster with full 10GigE bi-section bandwidth would be ~$57M for switches (~7x the cost of gigabit)
  - if 10GigE not needed, but gigabit not enough, may be cheaper to dual-home machines and increase total # of ports
load balancing
- RAMCloud is expected to deal with many small, well-distributed objects
  - this may aid obtaining maximum bandwidth utilisation, since we can bounce packets randomly across potential paths
  - no idea if random is even close to optimal

...

Version	Old Version 10	New Version 11
Changes made by	Stephen M. Rumble	Stephen M. Rumble
Saved on	May 25, 2009	May 25, 2009

Versions Compared

Key