...
- L2 switches are cheaper and simpler, but L2 doesn't scale by default
- Broadcast domains cannot get too large, else performance suffers
- switches have MAC forwarding tables, but are often <16k entries in size and we want to scale beyond that
- L3 switches are more expensive, but L3 scales more clearly
- Set up localized subnets to restrict broadcast domains, or use VLANs
- Route between subnets intelligently
- To be workable in fat-trees, L3 needs work:
- Need custom/non-standard software/algorithms to update routes to achieve maximum bandwidth and balance of links
- So long as we're hacking, why not see if L2 can be made to work?
- Ditch ARP, restrict broadcasts (bcasts can be specially handled out of the fast-path)
- Use L2 source-routing and L2-in-L2 encapsulation - hosts transmit frames wrapped with destination MAC header of the TOR switch for the destination host
- => switches need only know MACs of other switches, not all hosts - overcomes 16k MAC entry table limit
- requires a directory service and host stack modifications
- use VLB (Valiant Load Balancing) to to maximise bandwidth
- advantage of a flat address space and simpler, lower-level protocol
- disadvantage of pushing changes into host protocol stacks (though IP is oblivious)
...
- latency
- Arista 48-port 10GigE switches advertise a minimum of 600nsec latency (no idea what the distribution looks like)
- across 6 hops, that's > 3.6 usec
- Woven System's 144-port 10GigE switches advertise 1.6usec port-to-port latency (twice almost 3x Arista's minimum)
- => > 3.2usec in first two levels of hierarchy
- take away: sub-5usec is probably not currently possible
- how important is RDMA (iWARP) for our goals?
- if it necessitates 10GigE, it'd be costly
- unclear if custom (i.e. non-commodity) hardware required
- Force-10 forsees iWARP done on NIC, like TOE (TCP Offload Engine)
- Arista 48-port 10GigE switches advertise a minimum of 600nsec latency (no idea what the distribution looks like)
- bandwidth
- 128 bytes / object * 1.0e6 objects/second = 122MBytes/sec (not including any packet overhead)
- this is gigabit range... 10GigE vs. GigE may be a significant question:
- Arista 48-port 10GigE's not commodity (~$20k/switch, vs. $2-3k/switch of commodity 1GigE)
- But what if we have much bigger, hot objects on a machine?
- Do we want to assume a single machine can always handle requests?
- e.g. 10KByte object => max. ~12,500 requests/sec on gigabit
- Do we want to assume a single machine can always handle requests?
- this is gigabit range... 10GigE vs. GigE may be a significant question:
- Going beyond gigabit is still very costly
- ~25k cluster with full 10GigE bi-section bandwidth would be ~$57M for switches (~7x the cost of gigabit)
- if 10GigE not needed, but gigabit not enough, may be cheaper to dual-home machines and increase total # of ports
- 128 bytes / object * 1.0e6 objects/second = 122MBytes/sec (not including any packet overhead)
- load balancing
- RAMCloud is expected to deal with many small, well-distributed objects
- this may aid obtaining maximum bandwidth utilisation, since we can bounce packets randomly across potential paths
- no idea if random is even close to optimal
- RAMCloud is expected to deal with many small, well-distributed objects
...