...
- DC = "Data Center"
- TOR = "Top of Rack"
- TOR switches connect all machines within a rack and provide uplinks to higher network levels in the topology
- TOR switches may use 1 or more ports to uplink. The ports may be of the same bandwidth as the hosts (e.g. gigabit) or higher (e.g. 10GigE)
- DCs organize machines (~20-40) into racks, which all connect to a TOR switch, which in turn connects to some aggregation switch
- TOR switches connect all machines within a rack and provide uplinks to higher network levels in the topology
- EOR = "End of Row"
- switch connecting a row of racks
- appears to describe both one big switch handling all hosts for several racks, or an aggregation switch connecting to TORs of several racks.
- Oversubscription - networks can be oversubscribed at various points.
- Switches generally have enough backplane bandwidth to saturate all internal ports
- But uplinks to other switches may be a fraction of this, reducing total bandwidth between hosts on different switches
- using multiple ports and/or higher speed uplinks can mitigate this
- Oversubscription often described as a ration (n:m), e.g. for every n megabits of bandwidth at a host port, only m megabits of bandwidth exist between it and the most distant machine in the network
- cut-through vs. store-and-forward
- switches can either buffer entire frames and retransmit out a port (store-and-forward) or stream them without fully buffering in-between (cut-through)
- cut-through can provide better latency
...
- Size is defined by a factor k, the number of ports per identical switch in the network
- 3-level heirarchy:
- core level ((k/2)^2 = k^2/4 switches)
- each core switch uses all k ports to connect to k switches in first layer of the pod level
- pod level (k pods)
- each pod has 2 internal layers with (k/2 switches/layer => k switches/pod)
- upper level switches (k/2 of them) connect k/2 of their ports to core level switches
- other k/2 ports connect to each of the k/2 lower pod level switches
- lower level switches (k/2 of them) connect to k/2 hosts each
- end host level (k^3/4 total hosts)
- core level ((k/2)^2 = k^2/4 switches)
k
# hosts
# switches
# ports
host:switch ratio
host:port ratio
4
16
20
80
0.8
0.2
8
128
80
640
1.6
0.2
16
1,024
320
5,120
3.2
0.2
32
8,192
1,280
40,960
6.4
0.2
48
27,648
2,880
138,240
9.6
0.2
64
65,536
5,120
327,680
12.8
0.2
96
221,184
11,520
1,105,920
19.2
0.2
128
524,288
20,480
2,621,440
25.6
0.2
- Fat-tree will have no oversubscription if network resources can be properly exploited ("rearrangeably non-blocking")
- i.e. for a network of 1GigE switches, there will always be 1Gbit available between two arbitrary hosts if the interconnects between them can be properly scheduled
- ways of handling this include recomputation of routes based on load, randomizing core switch hops, etc
- take away: can max out all ports, but only if we're smart
- Al-Fare's SIGCOMM '08 paper shows > 80% utilisation under worst-case conditions
- i.e. for a network of 1GigE switches, there will always be 1Gbit available between two arbitrary hosts if the interconnects between them can be properly scheduled
...
- latency
- Arista 48-port 10GigE switches advertise a minimum of 600nsec latency (no idea what the distribution looks like)
- across 6 hops, that's > 3.6 usec
- Woven System's 144-port 10GigE switches advertise 1.6usec port-to-port latency (almost 3x Arista's minimum)
- => > 3.2usec in first two levels of hierarchy
- Force-10 networks cut-through 10GigE latencies:
hops
64byte pkt
1500byte pkt
1
351 nsec
1500 nsec
3
951 nsec
2100 nsec
5
1550 nsec
2700 nsec
- Arista 48-port 10GigE switches advertise a minimum of 600nsec latency (no idea what the distribution looks like)
- take away: sub-5usec is probably not currently possible across a data center
- how important is RDMA (iWARP) for our goals?
- if it necessitates 10GigE, it'd be costly (today)
- unclear if custom (i.e. non-commodity) hardware required
- Force-10 forsees iWARP done on NIC, like TOE (TCP Offload Engine) -link
- bandwidth
- 128 bytes / object * 1.0e6 objects/second = 122MBytes/sec (not including any packet overhead)
- this is gigabit range... 10GigE vs. GigE may be a significant question:
- Arista 48-port 10GigE's not commodity (~$20k/switch, vs. $2-3k/switch of commodity 1GigE)
- But what if we have much bigger, hot objects on a machine?
- Do we want to assume a single machine can always handle requests?
- e.g. 10KByte object => max. ~12,500 requests/sec on gigabit
- Do we want to assume a single machine can always handle requests?
- this is gigabit range... 10GigE vs. GigE may be a significant question:
- Going beyond gigabit is still very costly
- ~25k cluster with full 10GigE bi-section bandwidth would be ~$57M for switches (~7x the cost of gigabit)
- if 10GigE not needed, but gigabit not enough, may be cheaper to dual-home machines and increase total # of ports
- 128 bytes / object * 1.0e6 objects/second = 122MBytes/sec (not including any packet overhead)
- load balancing
- RAMCloud is expected to deal with many small, well-distributed objects
- this may aid obtaining maximum bandwidth utilisation, since we can bounce packets randomly across potential paths
- no idea if random is even close to optimal
- RAMCloud is expected to deal with many small, well-distributed objects
...
- congestion results in:
- packet loss if buffers overflow
- else, increased latency from waiting in line
- which is worse?
- if RAMCloud fast enough, occasional packet loss may not be horrible
- buffering may cause undesired latencies/variability in latency
- even if no oversubscription, congestion is still an issue
- e.g. any time multiple flows funnel from several ports to one port (within the network) or host (at the leaves)
- conceivable in RAMCloud, as we expect to communicate with many different systems
- e.g.: could be a problem if client issues enough sufficiently large requests to a large set of servers
- conceivable in RAMCloud, as we expect to communicate with many different systems
- e.g. any time multiple flows funnel from several ports to one port (within the network) or host (at the leaves)
- UDP has no congestion control mechanisms
- connectionless, unreliable protocol probably essential for latency and throughput goals
- need to avoid congestion how?
- rely on user to stagger queries/reduce parallelism? [c.f. Facebook]
- if we're sufficiently fast, will we run into these problems anyhow?
"Data Center Ethernet"
- Cisco: "collection of standards-based extensions to classical Ethernet that allows data center architects to create a data center transport layer that is:"
- stable
- lossless
- efficient
- Purpose is apparently to buck the trend of building multiple application-specific networks (IP, SAN, Infiniband, etc)
- how? better multi-tenacy (traffic class isolation/prioritisation), guaranteed delivery (lossless transmission), layer-2 multipath (higher bisectional bandwidth)
- A series of additional standards:
- "
...
- XXX TODO XXX
- Class-based flow control" (CBFC)
- for multi-tenancy
- Enhanced transmission selection (ETS)
- for multi-tenancy
- Data center bridging exchange protocol (DCBCXP)
- Lossless Ethernet
- for guaranteed delivery
- Congestion notification
- end-to-end congestion management to avoid dropped frames (i.e. work around TCP congestion collapse, retrofit non-congestion-aware protocols to not cause trouble)
- Class-based flow control" (CBFC)
Misc. Thoughts
- If networking costs only small part of total DC cost, why is there oversubscription currently?
- it's possible to pay more and reduce oversubscription - cost doesn't seem the major factor
- but people argue that oversubscription leads to significant bottlenecks in real DCs
- but, then, why aren't they reducing oversubscription from the get go?