...
- DC = "Data Center"
- TOR = "Top of Rack"
- TOR switches connect all machines within a rack and provide uplinks to higher network levels in the topology
- TOR switches may use 1 or more ports to uplink. The ports may be of the same bandwidth as the hosts (e.g. gigabit) or higher (e.g. 10GigE)
- DCs organize machines (~20-40) into racks, which all connect to a TOR switch, which in turn connects to some aggregation switch
- TOR switches connect all machines within a rack and provide uplinks to higher network levels in the topology
- Oversubscription - networks can be oversubscribed at various points.
- Switches generally have enough backplane bandwidth to saturate all internal ports
- But uplinks to other switches may be a fraction of this, reducing total bandwidth between hosts on different switches
- using multiple ports and/or higher speed uplinks can mitigate this
- Oversubscription often described as a ration (n:m), e.g. for every n megabits of bandwidth at a host port, only m megabits of bandwidth exist between it and the most distant machine in the network
- cut-through vs. store-and-forward
- switches can either buffer entire frames and retransmit out a port (store-and-forward) or stream them without fully buffering in-between (cut-through)
- cut-through can provide better latency
Data Center Networks
- Current data centers purported to be highly specialized
- hierarchical network topologies with higher bandwidth aggregation and core switches/routers
- that is, data rates increase up the tree to handle accumulation of bandwidth used by many, slower leaves further down
- requires big, specialised switches to maintain reasonable bandwidth
- e.g. 100+ 10GigE switches with >100 ports each, at the core
- pricey... Woven Systems 144 ports 10GigE switch debuted at $1500/port in mid-2007
- current cost unknown
- pricey... Woven Systems 144 ports 10GigE switch debuted at $1500/port in mid-2007
- e.g. 100+ 10GigE switches with >100 ports each, at the core
- oversubscription is purportedly common
- mainly affects bi-section bandwidth (the data center isn't uniform - locality important, else lower bandwidth expectations)
- implies congestion is possible, adding overhead for reliable protocols and packet latency
- 2.5:1 to 8:1 ratios quoted by Al-Fares, et al ('08 SIGCOMM)
- 2.5:1 means for every 2.5Gb at the end hosts, only 1Gb is allocated at the core
- a saturated network, therefore, cannot run all hosts at full rates
- hierarchical network topologies with higher bandwidth aggregation and core switches/routers
- Current hot trend is commoditisation
- Google does this internally, Microsoft/Yahoo/Amazon probably similarly smart about it
- they've solved it, but either find it too important to share, or don't yet need SIGCOMM papers
- Nothing is standard. Requires modifications to routing and/or address resolution protocols
- hacks to L2 or L3 routing
- L4 protocols generally oblivious
- need to be careful about not excessively reordering packets
- non-standard is reasonable for DC's, since internal network open to innovation
- hacks to L2 or L3 routing
- Main idea is to follow in footsteps of commodity servers
- From fewer, big, less hackable Sun/IBM/etc boxen to many smaller, hackable i386/amd64 machines running Linux/FreeBSD/something Microsofty
- Clear win for servers (~45% of DC budget), less so for networks (~15%) [%s from: greenberg, jan '09 ccr]
- Is 15% large enough to care that much about optimisation (Amdahl strikes again)?
- Alternatively, is 15% small enough that we can increase it to get features we want (iWARP, full, non-blocking 10GigE bi-section bandwidth, lower latencies, etc)?
- Similarly, Network Commoditisation => lots of similar, cheaper, simpler building blocks
- i.e. many cheaper, (near-)identical switches with a single, common data rate
- Favours Clos (Charles Clos) topologies such as the fashionable "fat-tree", i.e.:
- Multi-rooted, wide trees with lots of redundancies to spread bandwidth of # of links
- large number of equal throughput paths between distant nodes
- switches with equivalent #'s of ports used throughout
- 6 maximum hops from anywhere to anywhere in the system
- scales massively
- does not necessitate faster data rates further up the tree to avoid oversubscription
- Favours Clos (Charles Clos) topologies such as the fashionable "fat-tree", i.e.:
- i.e. many cheaper, (near-)identical switches with a single, common data rate
- Google does this internally, Microsoft/Yahoo/Amazon probably similarly smart about it
...
- latency
- Arista 48-port 10GigE switches advertise a minimum of 600nsec latency (no idea what the distribution looks like)
- across 6 hops, that's > 3.6 usec
- Woven System's 144-port 10GigE switches advertise 1.6usec port-to-port latency (almost 3x Arista's minimum)
- => > 3.2usec in first two levels of hierarchy
- take away: sub-5usec is probably not currently possible
- how important is RDMA (iWARP) for our goals?
- if it necessitates 10GigE, it'd be costly
- unclear if custom (i.e. non-commodity) hardware required
- Force-10 forsees iWARP done on NIC, like TOE (TCP Offload Engine)
- Arista 48-port 10GigE switches advertise a minimum of 600nsec latency (no idea what the distribution looks like)
- bandwidth
- 128 bytes / object * 1.0e6 objects/second = 122MBytes/sec (not including any packet overhead)
- this is gigabit range... 10GigE vs. GigE may be a significant question:
- Arista 48-port 10GigE's not commodity (~$20k/switch, vs. $2-3k/switch of commodity 1GigE)
- But what if we have much bigger, hot objects on a machine?
- Do we want to assume a single machine can always handle requests?
- e.g. 10KByte object => max. ~12,500 requests/sec on gigabit
- Do we want to assume a single machine can always handle requests?
- this is gigabit range... 10GigE vs. GigE may be a significant question:
- Going beyond gigabit is still very costly
- ~25k cluster with full 10GigE bi-section bandwidth would be ~$57M for switches (~7x the cost of gigabit)
- if 10GigE not needed, but gigabit not enough, may be cheaper to dual-home machines and increase total # of ports
- 128 bytes / object * 1.0e6 objects/second = 122MBytes/sec (not including any packet overhead)
- load balancing
- RAMCloud is expected to deal with many small, well-distributed objects
- this may aid obtaining maximum bandwidth utilisation, since we can bounce packets randomly across potential paths
- no idea if random is even close to optimal
- RAMCloud is expected to deal with many small, well-distributed objects
Congestion
- congestion results in:
- packet loss if buffers overflow
- else, increased latency from waiting in line
- which is worse?
- if RAMCloud fast enough, occasional packet loss may not be horrible
- buffering may cause undesired latencies/variability in latency
- even if no oversubscription, congestion is still an issue
- e.g. any time multiple flows funnel from several ports to one port (within the network) or host (at the leaves)
- conceivable in RAMCloud, as we expect to communicate with many different systems
- e.g.: could be a problem if client issues enough sufficiently large requests to a large set of servers
- conceivable in RAMCloud, as we expect to communicate with many different systems
- e.g. any time multiple flows funnel from several ports to one port (within the network) or host (at the leaves)
- UDP has no congestion control mechanisms
- connectionless, unreliable protocol probably essential for latency and throughput goals
- need to avoid congestion how?
- rely on user to stagger queries/reduce parallelism? [c.f. Facebook]
- if we're sufficiently fast, will we run into these problems anyhow?
"Data Center Ethernet"
- XXX TODO XXX
Misc. Thoughts
- If networking costs only small part of total DC cost, why is there oversubscription currently?
- it's possible to pay more and reduce oversubscription - cost doesn't seem the major factor
- but people argue that oversubscription leads to significant bottlenecks in real DCs
- but, then, why aren't they reducing oversubscription from the get go?