Network Substate

Random Concepts and Terminology

DC = "Data Center"
TOR = "Top of Rack"
- TOR switches connect all machines within a rack and provide uplinks to higher network levels in the topology
  - TOR switches may use 1 or more ports to uplink. The ports may be of the same bandwidth as the hosts (e.g. gigabit) or higher (e.g. 10GigE)
- DCs organize machines (~20-40) into racks, which all connect to a TOR switch, which in turn connects to some aggregation switch
Oversubscription - networks can be oversubscribed at various points.
- Switches generally have enough backplane bandwidth to saturate all internal ports
- But uplinks to other switches may be a fraction of this, reducing total bandwidth between hosts on different switches
  - using multiple ports and/or higher speed uplinks can mitigate this
- Oversubscription often described as a ration (n:m), e.g. for every n megabits of bandwidth at a host port, only m megabits of bandwidth exist between it and the most distant machine in the network

Data Center Networks

Current data centers purported to be highly specialized
- hierarchical network topologies with higher bandwidth aggregation and core switches/routers
  - that is, data rates increase up the tree to handle accumulation of bandwidth used by many, slower leaves further down
- requires big, specialised switches to maintain reasonable bandwidth
  - e.g. 100+ 10GigE switches with >100 ports each, at the core
    - pricey... Woven Systems 144 ports 10GigE switch debuted at $1500/port in mid-2007
      - current cost unknown
- oversubscription is purportedly common
  - mainly affects bi-section bandwidth (the data center isn't uniform - locality important, else lower bandwidth expectations)
  - implies congestion is possible, adding overhead for reliable protocols and packet latency
  - 2.5:1 to 8:1 ratios quoted by Al-Fares, et al ('08 SIGCOMM)
    - 2.5:1 means for every 2.5Gb at the end hosts, only 1Gb is allocated at the core
    - a saturated network, therefore, cannot run all hosts at full rates
Current hot trend is commoditisation
- Google does this internally, Microsoft/Yahoo/Amazon probably similarly smart about it
  - they've solved it, but either find it too important to share, or don't yet need SIGCOMM papers
- Nothing is standard. Requires modifications to routing and/or address resolution protocols
  - hacks to L2 or L3 routing
    - L4 protocols generally oblivious
    - need to be careful about not excessively reordering packets
  - non-standard is reasonable for DC's, since internal network open to innovation
- Main idea is to follow in footsteps of commodity servers
  - From fewer, big, less hackable Sun/IBM/etc boxen to many smaller, hackable i386/amd64 machines running Linux/FreeBSD/something Microsofty
  - Clear win for servers (~45% of DC budget), less so for networks (~15%) [%s from: greenberg, jan '09 ccr]
    - Is 15% large enough to care that much about optimisation (Amdahl strikes again)?
    - Alternatively, is 15% small enough that we can increase it to get features we want (iWARP, full, non-blocking 10GigE bi-section bandwidth, lower latencies, etc)?
- Similarly, Network Commoditisation => lots of similar, cheaper, simpler building blocks
  - i.e. many cheaper, (near-)identical switches with a single, common data rate
    - Favours Clos (Charles Clos) topologies such as the fashionable "fat-tree", i.e.:
      - Multi-rooted, wide trees with lots of redundancies to spread bandwidth of # of links
      - large number of equal throughput paths between distant nodes
      - switches with equivalent #'s of ports used throughout
      - 6 maximum hops from anywhere to anywhere in the system
      - scales massively
      - does not necessitate faster data rates further up the tree to avoid oversubscription

Fat-Trees

Size is defined by a factor k, the number of ports per identical switch in the network
3-level heirarchy:
- core level ((k/2)^2 = k^2/4 switches)
  - each core switch uses all k ports to connect to k switches in first layer of the pod level
- pod level (k pods)
  - each pod has 2 internal layers with (k/2 switches/layer => k switches/pod)
  - upper level switches (k/2 of them) connect k/2 of their ports to core level switches
    - other k/2 ports connect to each of the k/2 lower pod level switches
  - lower level switches (k/2 of them) connect to k/2 hosts each
- end host level (k^3/4 total hosts)

k	# hosts	# switches	# ports	host:switch ratio	host:port ratio
4	16	20	80	0.8	0.2
8	128	80	640	1.6	0.2
16	1,024	320	5,120	3.2	0.2
32	8,192	1,280	40,960	6.4	0.2
48	27,648	2,880	138,240	9.6	0.2
64	65,536	5,120	327,680	12.8	0.2
96	221,184	11,520	1,105,920	19.2	0.2
128	524,288	20,480	2,621,440	25.6	0.2

Fat-tree will have no oversubscription if network resources can be properly exploited ("rearrangeably non-blocking")
- i.e. for a network of 1GigE switches, there will always be 1Gbit available between two arbitrary hosts if the interconnects between them can be properly scheduled
  - ways of handling this include recomputation of routes based on load, randomizing core switch hops, etc
- take away: can max out all ports, but only if we're smart
- Al-Fare's SIGCOMM '08 paper shows > 80% utilisation under worst-case conditions

Fat-Tree vs. Hierarchical

Hierarchical limited by fastest core switch speed, Fat-tree is not so limited
- => Cannot get 10GigE bi-section bandwidth with hierarchical today, need a Fat-tree with 10-GigE switches
Example: MSR's Monsoon vs. UCSD's Fat-Tree commodity system
- Want network connecting many 1GigE nodes with no oversubscription
- MSR uses a hierarchical configuration (10GigE aggregation and core switches, 1GigE TOR switches)
- UCSD's uses identical, 24-port, commodity 1GigE switches (i.e. k = 48)
- Both theoretically capable of 1:1 oversubscription (i.e. no oversubscription)

	Hierarchical	Fat-tree
# hosts	25,920	27,648
# switches	108 x 144-port 10GigE 1,296 x 20-port 1GigE w/ 2x10GigE uplinks	2,880 x 48-port 1GigE
# wires	57,024 (~91% GigE, ~9% 10GigE)	82,944
# unique paths	144 (36 via core with 2x dual uplinks in each subtree)	572

Notes:
- 48-port 1GigE switches cost ~$2.5-3k
  - 2,880 * $2500 = $7M
- 20-port 1GigE switches w/ 10GigE uplinks probably cost about the same (~2.5-3k) [uplinks not commodity]
  - 1,296 * $2500 = $3.24M
- 144-port 10GigE switches advertised as $1500/port ($216k/switch) in mid-2007
  - to be competitive with fat-tree on per-port cost, price per port must drop 6.25x to $241.76 ($34.8k/switch)
  - 6.25x drop seems pretty close to the common 2-year drop in price
- If we were to make a 10GigE Fat-tree similar to the above, today it would cost (MSRP) about $20k/switch x 2,880 switches = $57.6M

Alternative Network Topologies

Ideas from supercomputing:
- Hypercubes
  - nCUBE hypercube machines, probably many others
- Torus's (?Tori?)
  - IBM Blue Gene connects tens of thousands of CPUs with high bandwidth (e.g. 380MB/sec with 4.5usec avg. ping-pongs - link)
Hosts connect to n neighbours and route amongst themselves
- Requires hosts to route frames
  - => higher latencies, unless we can do it on the NIC (NetFPGA?)
- High wiring complexity
  - no idea how this compares to already high complexity of hierarchical and, especially, fat-tree topologies
- May impose greater constraints on cluster geometry to appropriately establish links??
- No dedicated switching elements, simpler (electrically) point-to-point links

RAMCloud Requirements

latency
- Arista 48-port 10GigE switches advertise a minimum of 600nsec latency (no idea what the distribution looks like)
  - across 6 hops, that's > 3.6 usec
- Woven System's 144-port 10GigE switches advertise 1.6usec port-to-port latency (twice Arista's minimum)
  - => > 3.2usec in first two levels of hierarchy
- take away: sub-5usec is probably not currently possible
bandwidth
- 128 bytes / object * 1.0e6 objects/second = 122MBytes/sec (not including any packet overhead)
  - this is gigabit range... 10GigE vs. GigE may be a significant question:
    - Arista 48-port 10GigE's not commodity (~$20k/switch, vs. $2-3k/switch of commodity 1GigE)
    - But what if we have much bigger, hot objects on a machine?
      - Do we want to assume a single machine can always handle requests?
        
        e.g. 10KByte object => max. ~12,500 requests/sec on gigabit
- Going beyond gigabit is still very costly
  - ~25k cluster with full 10GigE bi-section bandwidth would be ~$57M for switches (~7x the cost of gigabit)
  - if 10GigE not needed, but gigabit not enough, may be cheaper to dual-home machines and increase total # of ports

Misc. Thoughts

If networking costs only small part of total DC cost, why is there oversubscription currently?
- it's possible to pay more and reduce oversubscription - cost doesn't seem the major factor
- but people argue that oversubscription leads to significant bottlenecks in real DCs
  - but, then, why aren't they reducing oversubscription from the get go?

Network substrate

Network Substate

Random Concepts and Terminology

Data Center Networks

Fat-Trees

Fat-Tree vs. Hierarchical

Alternative Network Topologies

RAMCloud Requirements

Misc. Thoughts