...
- MSR's Monsoon vs. UCSD's Fat-Tree commodity system
- Want network connecting many 1GigE nodes with no oversubscription
- MSR uses a hierarchical configuration (10GigE aggregation and core switches, 1GigE TOR switches)
- UCSD's uses identical, 24-port, commodity 1GigE switches (i.e. k = 48)
- Both theoretically capable of 1:1 oversubscription (i.e. no oversubscription)
Hierarchical
Fat-tree
# hosts
25,920
27,648
# switches
108 x 144-port 10GigE
1,296 x 20-port 1GigE w/ 2x10GigE uplinks2,880 x 48-port 1GigE
# wires
57,024 (~91% GigE, ~9% 10GigE)
82,944
# unique paths
144 (36 via core with 2x dual uplinks in each subtree)
572
- Notes:
- 48-port 1GigE switches cost ~$2.5-3k
- 2,880 * $2500 = $7M
- 20-port 1GigE switches w/ 10GigE uplinks probably cost about the same (~2.5-3k) [uplinks not commodity]
- 1,296 * $2500 = $3.24M
- 144-port 10GigE switches advertised as $1500/port ($216k/switch) in mid-2007
- to be competitive with fat-tree on per-port cost, price per port must drop 6.25x to $241.76 ($34.8k/switch)
- 6.25x drop seems pretty close to the common 2-year drop in price
- 48-port 1GigE switches cost ~$2.5-3k
- Notes:
Alternative Network Topologies
- Ideas from supercomputing:
- Hypercubes
- Torus's (?Tori?)
- IBM Blue Gene connects tens of thousands of CPUs with high bandwidth (e.g. 380MB/sec with 4.5usec avg. ping-pongs - link)
- Hosts connect to n neighbours and route amongst themselves
- Requires hosts to route frames
- => higher latencies, unless we can do it on the NIC (NetFPGA?)
- High wiring complexity
- no idea how this compares to already high complexity of hierarchical and, especially, fat-tree topologies
- May impose greater constraints on cluster geometry to appropriately establish links??
- No dedicated switching elements, simpler (electrically) point-to-point links
- Requires hosts to route frames
RAMCloud Requirements
- latency
- Arista 48-port 10GigE switches advertise a minimum of 600nsec latency (no idea what the distribution looks like)
- across 6 hops, that's > 3.6 usec
- Woven System's 144-port 10GigE switches advertise 1.6usec port-to-port latency (twice Arista's minimum)
- => > 3.2usec in first two levels of hierarchy
- take away: sub-5usec is probably not currently possible
- Arista 48-port 10GigE switches advertise a minimum of 600nsec latency (no idea what the distribution looks like)
- bandwidth
- 128 bytes / object * 1.0e6 objects/second = 122MBytes/sec (not including any packet overhead)
- this is gigabit range... 10GigE vs. GigE may be a significant question:
- Arista 48-port 10GigE's not commodity (~$20k/switch, vs. $2-3k/switch of commodity 1GigE)
- But what if we have much bigger, hot objects on a machine?
- Do we want to assume a single machine can always handle requests?
- e.g. 10KByte object => max. ~12,500 requests/sec on gigabit
- Do we want to assume a single machine can always handle requests?
- this is gigabit range... 10GigE vs. GigE may be a significant question:
- 128 bytes / object * 1.0e6 objects/second = 122MBytes/sec (not including any packet overhead)
Misc. Thoughts
- If networking costs only small part of total DC cost, why is there oversubscription currently?
- it's possible to pay more and reduce oversubscription - cost doesn't seem the major factor
- but people argue that oversubscription leads to significant bottlenecks in real DCs
- but, then, why aren't they reducing oversubscription from the get go?