...
- Size is defined by a factor k, the number of ports per identical switch in the network
- 3-level heirarchy:
- core level ((k/2)^2 = k^2/4 switches)
- each core switch uses all k ports to connect to k switches in first layer of the pod level
- pod level (k pods)
- each pod has 2 internal layers with (k/2 switches/layer => k switches/pod)
- upper level switches (k/2 of them) connect k/2 of their ports to core level switches
- other k/2 ports connect to each of the k/2 lower pod level switches
- lower level switches (k/2 of them) connect to k/2 hosts each
- end host level (k^3/4 total hosts)
- core level ((k/2)^2 = k^2/4 switches)
k
# hosts
# switches
# ports
host:switch ratio
host:port ratio
4
16
20
80
0.8
0.2
8
128
80
640
1.6
0.2
16
1,024
320
5,120
3.2
0.2
32
8,192
1,280
40,960
6.4
0.2
48
27,648
2,880
138,240
9.6
0.2
64
65,536
5,120
327,680
12.8
0.2
96
221,184
11,520
1,105,920
19.2
0.2
128
524,288
20,480
2,621,440
25.6
0.2
- Fat-tree will have no oversubscription if network resources can be properly exploited ("rearrangeably non-blocking")
- i.e. for a network of 1GigE switches, there will always be 1Gbit available between two arbitrary hosts if the interconnects between them can be properly scheduled
- ways of handling this include recomputation of routes based on load, randomizing core switch hops, etc
- take away: can max out all ports, but only if we're smart
- Al-Fare's SIGCOMM '08 paper shows > 80% utilisation under worst-case conditions
- i.e. for a network of 1GigE switches, there will always be 1Gbit available between two arbitrary hosts if the interconnects between them can be properly scheduled
Fat-Tree vs. Hierarchical
...
- Hierarchical limited by fastest core switch speed, Fat-tree is not so limited
- => Cannot get 10GigE bi-section bandwidth with hierarchical today, need a Fat-tree with 10-GigE switches
- => Cannot get 10GigE bi-section bandwidth with hierarchical today, need a Fat-tree with 10-GigE switches
- Example: MSR's Monsoon vs. UCSD's Fat-Tree commodity system
- Want network connecting many 1GigE nodes with no oversubscription
- MSR uses a hierarchical configuration (10GigE aggregation and core switches, 1GigE TOR switches)
- UCSD's uses identical, 24-port, commodity 1GigE switches (i.e. k = 48)
- Both theoretically capable of 1:1 oversubscription (i.e. no oversubscription)
Hierarchical
Fat-tree
# hosts
25,920
27,648
# switches
108 x 144-port 10GigE
1,296 x 20-port 1GigE w/ 2x10GigE uplinks2,880 x 48-port 1GigE
# wires
57,024 (~91% GigE, ~9% 10GigE)
82,944
# unique paths
144 (36 via core with 2x dual uplinks in each subtree)
572
- Notes:
- 48-port 1GigE switches cost ~$2.5-3k
- 2,880 * $2500 = $7M
- 20-port 1GigE switches w/ 10GigE uplinks probably cost about the same (~2.5-3k) [uplinks not commodity]
- 1,296 * $2500 = $3.24M
- 144-port 10GigE switches advertised as $1500/port ($216k/switch) in mid-2007
- to be competitive with fat-tree on per-port cost, price per port must drop 6.25x to $241.76 ($34.8k/switch)
- 6.25x drop seems pretty close to the common 2-year drop in price
- If we were to make a 10GigE Fat-tree similar to the above, today it would cost (MSRP) about $20k/switch x 2,880 switches = $57.6M
- 48-port 1GigE switches cost ~$2.5-3k
- Notes:
Alternative Network Topologies
...
- latency
- Arista 48-port 10GigE switches advertise a minimum of 600nsec latency (no idea what the distribution looks like)
- across 6 hops, that's > 3.6 usec
- Woven System's 144-port 10GigE switches advertise 1.6usec port-to-port latency (twice Arista's minimum)
- => > 3.2usec in first two levels of hierarchy
- take away: sub-5usec is probably not currently possible
- Arista 48-port 10GigE switches advertise a minimum of 600nsec latency (no idea what the distribution looks like)
- bandwidth
- 128 bytes / object * 1.0e6 objects/second = 122MBytes/sec (not including any packet overhead)
- this is gigabit range... 10GigE vs. GigE may be a significant question:
- Arista 48-port 10GigE's not commodity (~$20k/switch, vs. $2-3k/switch of commodity 1GigE)
- But what if we have much bigger, hot objects on a machine?
- Do we want to assume a single machine can always handle requests?
- e.g. 10KByte object => max. ~12,500 requests/sec on gigabit
- Do we want to assume a single machine can always handle requests?
- this is gigabit range... 10GigE vs. GigE may be a significant question:
- Going beyond gigabit is still very costly
- ~25k cluster with full 10GigE bi-section bandwidth would be ~$57M for switches (~7x the cost of gigabit)
- if 10GigE not needed, but gigabit not enough, may be cheaper to dual-home machines and increase total # of ports
- 128 bytes / object * 1.0e6 objects/second = 122MBytes/sec (not including any packet overhead)
Misc. Thoughts
- If networking costs only small part of total DC cost, why is there oversubscription currently?
- it's possible to pay more and reduce oversubscription - cost doesn't seem the major factor
- but people argue that oversubscription leads to significant bottlenecks in real DCs
- but, then, why aren't they reducing oversubscription from the get go?