Page Comparison

...

Hierarchical limited by fastest core switch speed, Fat-tree is not so limited
- => Cannot get 10GigE bi-section bandwidth with hierarchical today, need a Fat-tree with 10-GigE switches
Example: MSR's Monsoon vs. UCSD's Fat-Tree commodity system
- Want network connecting many 1GigE nodes with no oversubscription
- MSR uses a hierarchical configuration (10GigE aggregation and core switches, 1GigE TOR switches)
- UCSD's uses identical, 24-port, commodity 1GigE switches (i.e. k = 48)
- Both theoretically capable of 1:1 oversubscription (i.e. no oversubscription)

	Hierarchical	Fat-tree
# hosts	25,920	27,648
# switches	108 x 144-port 10GigE 1,296 x 20-port 1GigE w/ 2x10GigE uplinks	2,880 x 48-port 1GigE
# wires	57,024 (~91% GigE, ~9% 10GigE)	82,944
# unique paths	144 (36 via core with 2x dual uplinks in each subtree)	572

...

Ideas from supercomputing:
- Hypercubes
  - nCUBE hypercube machines, probably many others
- Torus's (?Tori?)
  - IBM Blue Gene connects tens of thousands of CPUs with high bandwidth (e.g. 380MB/sec with 4.5usec avg. ping-pongs - link- IBM says 6.4usec latency)
Hosts connect to n neighbours and route amongst themselves
- Requires hosts to route frames
  - => higher latencies, unless we can do it on the NIC (NetFPGA?)
- High wiring complexity
  - no idea how this compares to already high complexity of hierarchical and, especially, fat-tree topologies
- May impose greater constraints on cluster geometry to appropriately establish links??
- No dedicated switching elements, simpler (electrically) point-to-point links

L2 switches are cheaper and simpler, but L2 doesn't scale by default
- Broadcast domains cannot get too large, else performance suffers
- switches have MAC forwarding tables, but are often <16k entries in size and we want to scale beyond that
L3 switches are more expensive, but L3 scales more clearly
- Set up localized subnets to restrict broadcast domains, or use VLANs
- Route between subnets intelligently
To be workable in fat-trees, L3 needs work:
- Need custom/non-standard software/algorithms to update routes to achieve maximum bandwidth and balance of links
So long as we're hacking, why not see if L2 can be made to work?
- Ditch ARP, restrict broadcasts (bcasts can be specially handled out of the fast-path)
- Use L2 source-routing and L2-in-L2 encapsulation - hosts transmit frames wrapped with destination MAC header of the TOR switch for the destination host
  - => switches need only know MACs of other switches, not all hosts - overcomes 16k MAC entry table limit
  - requires a directory service and host stack modifications
  - use VLB (Valiant Load Balancing) to
- advantage of a flat address space and simpler, lower-level protocol
- disadvantage of pushing changes into host protocol stacks (though IP is oblivious)

Clear that to use commodity parts, at either L2 or L3, some modifications must be made
- e.g. L3 routing to achieve maximum bandwidth, L2 stack changes to avoid broadcast issues, L2 VLB to maximize link utilisation
- OpenFlow may be an important part of this
  - unknown how programmability of switches affects their 'commodity' status

latency
- Arista 48-port 10GigE switches advertise a minimum of 600nsec latency (no idea what the distribution looks like)
  - across 6 hops, that's > 3.6 usec
- Woven System's 144-port 10GigE switches advertise 1.6usec port-to-port latency (twice Arista's minimum)
  - => > 3.2usec in first two levels of hierarchy
- take away: sub-5usec is probably not currently possible
bandwidth
- 128 bytes / object * 1.0e6 objects/second = 122MBytes/sec (not including any packet overhead)
  - this is gigabit range... 10GigE vs. GigE may be a significant question:
    - Arista 48-port 10GigE's not commodity (~$20k/switch, vs. $2-3k/switch of commodity 1GigE)
    - But what if we have much bigger, hot objects on a machine?
      - Do we want to assume a single machine can always handle requests?
        e.g. 10KByte object => max. ~12,500 requests/sec on gigabit
- Going beyond gigabit is still very costly
  - ~25k cluster with full 10GigE bi-section bandwidth would be ~$57M for switches (~7x the cost of gigabit)
  - if 10GigE not needed, but gigabit not enough, may be cheaper to dual-home machines and increase total # of ports
load balancing
- RAMCloud is expected to deal with many small, well-distributed objects
  - this may aid obtaining maximum bandwidth utilisation, since we can bounce packets randomly across potential paths
  - no idea if random is even close to optimal

If networking costs only small part of total DC cost, why is there oversubscription currently?
- it's possible to pay more and reduce oversubscription - cost doesn't seem the major factor
- but people argue that oversubscription leads to significant bottlenecks in real DCs
  - but, then, why aren't they reducing oversubscription from the get go?

Versions Compared