Network substrate
Network Substate
Random Concepts and Terminology
DC = "Data Center"
TOR = "Top of Rack"
TOR switches connect all machines within a rack and provide uplinks to higher network levels in the topology
TOR switches may use 1 or more ports to uplink. The ports may be of the same bandwidth as the hosts (e.g. gigabit) or higher (e.g. 10GigE)
DCs organize machines (~20-40) into racks, which all connect to a TOR switch, which in turn connects to some aggregation switch
EOR = "End of Row"
switch connecting a row of racks
appears to describe both one big switch handling all hosts for several racks, or an aggregation switch connecting to TORs of several racks.
hierarchical vs. fat-tree topologies:
hierarchical => different switches and bandwidths at the core than toward leaves
fat-tree => same switches everywhere, all at same bandwidth
Oversubscription - networks can be oversubscribed at various points.
Switches generally have enough backplane bandwidth to saturate all internal ports
But uplinks to other switches may be a fraction of this, reducing total bandwidth between hosts on different switches
using multiple ports and/or higher speed uplinks can mitigate this
Oversubscription often described as a ration (n:m), e.g. for every n megabits of bandwidth at a host port, only m megabits of bandwidth exist between it and the most distant machine in the network
cut-through vs. store-and-forward
switches can either buffer entire frames and retransmit out a port (store-and-forward) or stream them without fully buffering in-between (cut-through)
cut-through can provide better latency
Data Center Networks
Current data centers purported to be highly specialized
hierarchical network topologies with higher bandwidth aggregation and core switches/routers
that is, data rates increase up the tree to handle accumulation of bandwidth used by many, slower leaves further down
requires big, specialised switches to maintain reasonable bandwidth
e.g. 100+ 10GigE switches with >100 ports each, at the core
pricey... Woven Systems 144 ports 10GigE switch debuted at $1500/port in mid-2007
current cost unknown
oversubscription is purportedly common
mainly affects bi-section bandwidth (the data center isn't uniform - locality important, else lower bandwidth expectations)
implies congestion is possible, adding overhead for reliable protocols and packet latency
2.5:1 to 8:1 ratios quoted by Al-Fares, et al ('08 SIGCOMM)
2.5:1 means for every 2.5Gb at the end hosts, only 1Gb is allocated at the core
a saturated network, therefore, cannot run all hosts at full rates
Current hot trend is commoditisation
Google does this internally, Microsoft/Yahoo/Amazon probably similarly smart about it
they've solved it, but either find it too important to share, or don't yet need SIGCOMM papers
Nothing is standard. Requires modifications to routing and/or address resolution protocols
hacks to L2 or L3 routing
L4 protocols generally oblivious
need to be careful about not excessively reordering packets
non-standard is reasonable for DC's, since internal network open to innovation
Main idea is to follow in footsteps of commodity servers
From fewer, big, less hackable Sun/IBM/etc boxen to many smaller, hackable i386/amd64 machines running Linux/FreeBSD/something Microsofty
Clear win for servers (~45% of DC budget), less so for networks (~15%) [%s from: greenberg, jan '09 ccr]
Is 15% large enough to care that much about optimisation (Amdahl strikes again)?
Alternatively, is 15% small enough that we can increase it to get features we want (iWARP, full, non-blocking 10GigE bi-section bandwidth, lower latencies, etc)?
Similarly, Network Commoditisation => lots of similar, cheaper, simpler building blocks
i.e. many cheaper, (near-)identical switches with a single, common data rate
Favours Clos (Charles Clos) topologies such as the fashionable "fat-tree", i.e.:
Multi-rooted, wide trees with lots of redundancies to spread bandwidth of # of links
large number of equal throughput paths between distant nodes
switches with equivalent #'s of ports used throughout
6 maximum hops from anywhere to anywhere in the system
scales massively
does not necessitate faster data rates further up the tree to avoid oversubscription
Fat-Trees
Size is defined by a factor k, the number of ports per identical switch in the network
3-level heirarchy:
core level ((k/2)^2 = k^2/4 switches)
each core switch uses all k ports to connect to k switches in first layer of the pod level
pod level (k pods)
each pod has 2 internal layers with (k/2 switches/layer => k switches/pod)
upper level switches (k/2 of them) connect k/2 of their ports to core level switches
other k/2 ports connect to each of the k/2 lower pod level switches
lower level switches (k/2 of them) connect to k/2 hosts each
end host level (k^3/4 total hosts)
Fat-tree will have no oversubscription if network resources can be properly exploited ("rearrangeably non-blocking")
i.e. for a network of 1GigE switches, there will always be 1Gbit available between two arbitrary hosts if the interconnects between them can be properly scheduled
ways of handling this include recomputation of routes based on load, randomizing core switch hops, etc
take away: can max out all ports, but only if we're smart
Al-Fare's SIGCOMM '08 paper shows > 80% utilisation under worst-case conditions
Fat-Tree vs. Hierarchical
Hierarchical limited by fastest core switch speed, Fat-tree is not so limited
=> Cannot get 10GigE bi-section bandwidth with hierarchical today, need a Fat-tree with 10-GigE switches
Example: MSR's Monsoon vs. UCSD's Fat-Tree commodity system
Want network connecting many 1GigE nodes with no oversubscription
MSR uses a hierarchical configuration (10GigE aggregation and core switches, 1GigE TOR switches)
UCSD's uses identical, 24-port, commodity 1GigE switches (i.e. k = 48)
Both theoretically capable of 1:1 oversubscription (i.e. no oversubscription)
Notes:
48-port 1GigE switches cost ~$2.5-3k
2,880 * $2500 = $7M
20-port 1GigE switches w/ 10GigE uplinks probably cost about the same (~2.5-3k) [uplinks not commodity]
1,296 * $2500 = $3.24M
144-port 10GigE switches advertised as $1500/port ($216k/switch) in mid-2007
to be competitive with fat-tree on per-port cost, price per port must drop 6.25x to $241.76 ($34.8k/switch)
6.25x drop seems pretty close to the common 2-year drop in price
If we were to make a 10GigE Fat-tree similar to the above, today it would cost (MSRP) about $20k/switch x 2,880 switches = $57.6M
Alternative Network Topologies
Ideas from supercomputing:
Hypercubes
nCUBE hypercube machines, probably many others
Torus's (?Tori?)
IBM Blue Gene connects tens of thousands of CPUs with high bandwidth (e.g. 380MB/sec with 4.5usec avg. ping-pongs - link- IBM says 6.4usec latency)
Hosts connect to n neighbours and route amongst themselves
Requires hosts to route frames
=> higher latencies, unless we can do it on the NIC (NetFPGA?)
High wiring complexity
no idea how this compares to already high complexity of hierarchical and, especially, fat-tree topologies
May impose greater constraints on cluster geometry to appropriately establish links??
No dedicated switching elements, simpler (electrically) point-to-point links
L2 vs. L3 switching
L2 switches are cheaper and simpler, but L2 doesn't scale by default
Broadcast domains cannot get too large, else performance suffers
switches have MAC forwarding tables, but are often <16k entries in size and we want to scale beyond that
L3 switches are more expensive, but L3 scales more clearly
Set up localized subnets to restrict broadcast domains, or use VLANs
Route between subnets intelligently
To be workable in fat-trees, L3 needs work:
Need custom/non-standard software/algorithms to update routes to achieve maximum bandwidth and balance of links
So long as we're hacking, why not see if L2 can be made to work?
Ditch ARP, restrict broadcasts (bcasts can be specially handled out of the fast-path)
Use L2 source-routing and L2-in-L2 encapsulation - hosts transmit frames wrapped with destination MAC header of the TOR switch for the destination host
=> switches need only know MACs of other switches, not all hosts - overcomes 16k MAC entry table limit
requires a directory service and host stack modifications
use VLB (Valiant Load Balancing) to maximise bandwidth
advantage of a flat address space and simpler, lower-level protocol
disadvantage of pushing changes into host protocol stacks (though IP is oblivious)
Importance of Programmability
Clear that to use commodity parts, at either L2 or L3, some modifications must be made
e.g. L3 routing to achieve maximum bandwidth, L2 stack changes to avoid broadcast issues, L2 VLB to maximize link utilisation
OpenFlow may be an important part of this
unknown how programmability of switches affects their 'commodity' status
RAMCloud Requirements and Network Effects
latency
Arista 48-port 10GigE switches advertise a minimum of 600nsec latency (no idea what the distribution looks like)
across 6 hops, that's > 3.6 usec
Woven System's 144-port 10GigE switches advertise 1.6usec port-to-port latency (almost 3x Arista's minimum)
=> > 3.2usec in first two levels of hierarchy
Force-10 networks cut-through 10GigE latencies:
take away: sub-5usec is probably not currently possible across a data center
how important is RDMA (iWARP) for our goals?
if it necessitates 10GigE, it'd be costly (today)
unclear if custom (i.e. non-commodity) hardware required
Force-10 forsees iWARP done on NIC, like TOE (TCP Offload Engine) -link
bandwidth
128 bytes / object * 1.0e6 objects/second = 122MBytes/sec (not including any packet overhead)
this is gigabit range... 10GigE vs. GigE may be a significant question:
Arista 48-port 10GigE's not commodity (~$20k/switch, vs. $2-3k/switch of commodity 1GigE)
But what if we have much bigger, hot objects on a machine?
Do we want to assume a single machine can always handle requests?
e.g. 10KByte object => max. ~12,500 requests/sec on gigabit
Going beyond gigabit is still very costly
~25k cluster with full 10GigE bi-section bandwidth would be ~$57M for switches (~7x the cost of gigabit)
if 10GigE not needed, but gigabit not enough, may be cheaper to dual-home machines and increase total # of ports
load balancing
RAMCloud is expected to deal with many small, well-distributed objects
this may aid obtaining maximum bandwidth utilisation, since we can bounce packets randomly across potential paths
no idea if random is even close to optimal
Congestion
congestion results in:
packet loss if buffers overflow
else, increased latency from waiting in line
which is worse?
if RAMCloud fast enough, occasional packet loss may not be horrible
buffering may cause undesired latencies/variability in latency
even if no oversubscription, congestion is still an issue
e.g. any time multiple flows funnel from several ports to one port (within the network) or host (at the leaves)
conceivable in RAMCloud, as we expect to communicate with many different systems
e.g.: could be a problem if client issues enough sufficiently large requests to a large set of servers
UDP has no congestion control mechanisms
connectionless, unreliable protocol probably essential for latency and throughput goals
need to avoid congestion how?
rely on user to stagger queries/reduce parallelism? [c.f. Facebook]
if we're sufficiently fast, will we run into these problems anyhow?
balaji's points: buffers don't scale with bandwidth increases
simply can't get 2x buffers with similar increase in bandwidth at high end
further, adding more bandwidth and keeping a reservation for temporary congestion is better than adding buffers
especially for RAMCloud - reduces latency
is this an argument against commodity (at least, against a pure commodity fat-tree)?
ECN - Explicit Congestion Notification
already done by switches - set bit in IP TOS header if nearing congestion, with greater probability as we approach saturation
mostly for sustained flow traffic
RAMCloud expects lots of small datagrams, rather than flows
"Data Center Ethernet"
Cisco: "collection of standards-based extensions to classical Ethernet that allows data center architects to create a data center transport layer that is:"
stable
lossless
efficient
Purpose is apparently to buck the trend of building multiple application-specific networks (IP, SAN, Infiniband, etc)
how? better multi-tenacy (traffic class isolation/prioritisation), guaranteed delivery (lossless transmission), layer-2 multipath (higher bisectional bandwidth)
A series of additional standards:
"Class-based flow control" (CBFC)
for multi-tenancy
Enhanced transmission selection (ETS)
for multi-tenancy
Data center bridging exchange protocol (DCBCXP)
Lossless Ethernet
for guaranteed delivery
Congestion notification
end-to-end congestion management to avoid dropped frames (i.e. work around TCP congestion collapse, retrofit non-congestion-aware protocols to not cause trouble )
In the field
Google
Use long-lived TCP connections
pre-established and left open to avoid handshake overhead
unclear how TCP has been tweaked for low-latency environment (retransmit timeouts, etc)
Misc. Thoughts
If networking costs only small part of total DC cost, why is there oversubscription currently?
it's possible to pay more and reduce oversubscription - cost doesn't seem the major factor
but people argue that oversubscription leads to significant bottlenecks in real DCs
but, then, why aren't they reducing oversubscription from the get go?