Network Substate
Random Concepts and Terminology
- DC = "Data Center"
- TOR = "Top of Rack"
- TOR switches connect all machines within a rack and provide uplinks to higher network levels in the topology
- TOR switches may use 1 or more ports to uplink. The ports may be of the same bandwidth as the hosts (e.g. gigabit) or higher (e.g. 10GigE)
- DCs organize machines (~20-40) into racks, which all connect to a TOR switch, which in turn connects to some aggregation switch
- Oversubscription - networks can be oversubscribed at various points.
- Switches generally have enough backplane bandwidth to saturate all internal ports
- But uplinks to other switches may be a fraction of this, reducing total bandwidth between hosts on different switches
- using multiple ports and/or higher speed uplinks can mitigate this
Data Center Networks
- Current data centers purported to be highly specialized
- hierarchical network topologies with higher bandwidth aggregation and core switches/routers
- that is, data rates increase up the tree to handle accumulation of bandwidth used by many, slower leaves further down
- requires big, specialised switches to maintain reasonable bandwidth
- e.g. 100+ 10GigE switches with >100 ports each, at the core
- pricey... Woven Systems 144 ports 10GigE switch debuted at $1500/port in mid-2007
- oversubscription is purportedly common
- mainly affects bi-section bandwidth (the data center isn't uniform - locality important, else lower bandwidth expectations)
- implies congestion is possible, adding overhead for reliable protocols and packet latency
- 2.5:1 to 8:1 ratios quoted by Al-Fares, et al ('08 SIGCOMM)
- 2.5:1 means for every 2.5Gb at the end hosts, only 1Gb is allocated at the core
- a saturated network, therefore, cannot run all hosts at full rates
- Current hot trend is commoditisation
- Google does this internally, Microsoft/Yahoo/Amazon probably similarly smart about it
- they've solved it, but either find it too important to share, or don't yet need SIGCOMM papers
- Nothing is standard. Requires modifications to routing and/or address resolution protocols
- hacks to L2 or L3 routing
- L4 protocols generally oblivious
- need to be careful about not excessively reordering packets
- non-standard is reasonable for DC's, since internal network open to innovation
- Main idea is to follow in footsteps of commodity servers
- From fewer, big, less hackable Sun/IBM/etc boxen to many smaller, hackable i386/amd64 machines running Linux/FreeBSD/something Microsofty
- Clear win for servers (~45% of DC budget), less so for networks (~15%) [%s from: greenberg, jan '09 ccr]
- Is 15% large enough to care that much about optimisation (Amdahl strikes again)?
- Alternatively, is 15% small enough that we can increase it to get features we want (iWARP, full, non-blocking 10GigE bi-section bandwidth, lower latencies, etc)?
- Similarly, Network Commoditisation => lots of similar, cheaper, simpler building blocks
- i.e. many cheaper, (near-)identical switches with a single, common data rate
- Favours Clos (Charles Clos) topologies such as the fashionable "fat-tree", i.e.:
- Multi-rooted, wide trees with lots of redundancies to spread bandwidth of # of links
- large number of equal throughput paths between distant nodes
- switches with equivalent #'s of ports used throughout
- 6 maximum hops from anywhere to anywhere in the system
- scales massively
- does not necessitate faster data rates further up the tree to avoid oversubscription
Fat-Trees
- Size is defined by a factor k, the number of ports per identical switch in the network
- 3-level heirarchy:
- core level ((k/2)^2 = k^2/4 switches)
- each core switch uses all k ports to connect to k switches in first layer of the pod level
- pod level (k pods)
- each pod has 2 internal layers with (k/2 switches/layer => k switches/pod)
- upper level switches (k/2 of them) connect k/2 of their ports to core level switches
- other k/2 ports connect to each of the k/2 lower pod level switches
- lower level switches (k/2 of them) connect to k/2 hosts each
- end host level (k^3/4 total hosts)
k |
# hosts |
# switches |
# ports |
host:switch ratio |
host:port ratio |
4 |
16 |
20 |
80 |
0.8 |
0.2 |
8 |
128 |
80 |
640 |
1.6 |
0.2 |
16 |
1,024 |
320 |
5,120 |
3.2 |
0.2 |
32 |
8,192 |
1,280 |
40,960 |
6.4 |
0.2 |
48 |
27,648 |
2,880 |
138,240 |
9.6 |
0.2 |
64 |
65,536 |
5,120 |
327,680 |
12.8 |
0.2 |
96 |
221,184 |
11,520 |
1,105,920 |
19.2 |
0.2 |
128 |
524,288 |
20,480 |
2,621,440 |
25.6 |
0.2 |
- Fat-tree will have no oversubscription if network resources can be properly exploited ("rearrangeably non-blocking")
- i.e. for a network of 1GigE switches, there will always be 1Gbit available between two arbitrary hosts if the interconnects between them can be properly scheduled
- ways of handling this include recomputation of routes based on load, randomizing core switch hops, etc
- take away: can max out all ports, but only if we're smart
- Al-Fare's SIGCOMM '08 paper shows > 80% utilisation under worst-case conditions
Fat-Tree vs. Hierarchical Example
- MSR's Monsoon vs. UCSD's Fat-Tree commodity system
- Want network connecting many 1GigE nodes with no oversubscription
- MSR uses a hierarchical configuration (10GigE aggregation and core switches, 1GigE TOR switches)
Misc. Thoughts
- If networking costs only small part of total DC cost, why is there oversubscription currently?
- it's possible to pay more and reduce oversubscription - cost doesn't seem the major factor
- but people argue that oversubscription leads to significant bottlenecks in real DCs
- but, then, why aren't they reducing oversubscription from the get go?