Network Substate
Data Center Networks
- Current data centers purported to be highly specialized
- hierarchical network topologies with higher bandwidth aggregation and core switches/routers
- that is, data rates increase up the tree to handle accumulation of bandwidth used by many, slower leaves further down
- requires big, specialised switches to maintain reasonable bandwidth
- e.g. 100+ 10GigE switches with >100 ports each, at the core
- pricey... Woven Systems 144 ports 10GigE switch debuted at $1500/port in mid-2007
- oversubscription is purportedly common
- mainly affects bi-section bandwidth (the data center isn't uniform - locality important, else lower bandwidth expectations)
- implies congestion is possible, adding overhead for reliable protocols and packet latency
- 2.5:1 to 8:1 ratios quoted by Al-Fares, et al ('08 SIGCOMM)
- 2.5:1 means for every 2.5Gb at the end hosts, only 1Gb is allocated at the core
- a saturated network, therefore, cannot run all hosts at full rates
- Current hot trend is commoditisation
- Google does this internally, Microsoft/Yahoo/Amazon probably similarly smart about it
- they've solved it, but either find it too important to share, or don't yet need SIGCOMM papers
- Nothing is standard. Requires modifications to routing and/or address resolution protocols
- hacks to L2 or L3 routing
- L4 protocols generally oblivious
- need to be careful about not excessively reordering packets
- non-standard is reasonable for DC's, since internal network open to innovation
- Main idea is to follow in footsteps of commodity servers
- From fewer, big, less hackable Sun/IBM/etc boxen to many smaller, hackable i386/amd64 machines running Linux/FreeBSD/something Microsofty
- Clear win for servers (~45% of DC budget), less so for networks (~15%) [%s from: greenberg, jan '09 ccr]
- Is 15% large enough to care that much about optimisation (Amdahl strikes again)?
- Alternatively, is 15% small enough that we can increase it to get features we want (iWARP, full, non-blocking 10GigE bi-section bandwidth, lower latencies, etc)?
- Similarly, Network Commoditisation => lots of similar, cheaper, simpler building blocks
- i.e. many cheaper, (near-)identical switches with a single, common data rate
- Favours Clos (Charles Clos) topologies such as the fashionable "fat-tree", i.e.:
- Multi-rooted, wide trees with lots of redundancies to spread bandwidth of # of links
- large number of equal throughput paths between distant nodes
- switches with equivalent #'s of ports used throughout
- 6 maximum hops from anywhere to anywhere in the system
- scales massively
- does not necessitate faster data rates further up the tree to avoid oversubscription
Fat-Trees
- Size is defined by a factor k, the number of ports per identical switch in the network
- 3-level heirarchy:
- core level ((k/2)^2 switches)
- pod level (k pods)
- each pod has 2 internal layers with (k/2 switches/layer => k switches/pod)
- end host level (k^3/3 total hosts)
k |
# hosts |
# switches |
|
host:switch ratio |
host:port ratio |
4 |
16 |
20 |
80 |
0.8 |
0.2 |
8 |
128 |
80 |
640 |
1.6 |
0.2 |
16 |
1,024 |
320 |
5,120 |
3.2 |
0.2 |
32 |
8,192 |
1,280 |
40,960 |
6.4 |
0.2 |
48 |
27,648 |
2,880 |
138,240 |
9.6 |
0.2 |
64 |
65,536 |
5,120 |
327,680 |
12.8 |
0.2 |
96 |
221,184 |
11,520 |
1,105,920 |
19.2 |
0.2 |
128 |
524,288 |
20,480 |
2,621,440 |
25.6 |
0.2 |