Network Substate
Random Concepts and Terminology
- DC = "Data Center"
- TOR = "Top of Rack"
- TOR switches connect all machines within a rack and provide uplinks to higher network levels in the topology
- TOR switches may use 1 or more ports to uplink. The ports may be of the same bandwidth as the hosts (e.g. gigabit) or higher (e.g. 10GigE)
- DCs organize machines (~20-40) into racks, which all connect to a TOR switch, which in turn connects to some aggregation switch
- TOR switches connect all machines within a rack and provide uplinks to higher network levels in the topology
- EOR = "End of Row"
- switch connecting a row of racks
- appears to describe both one big switch handling all hosts for several racks, or an aggregation switch connecting to TORs of several racks.
- hierarchical vs. fat-tree topologies:
- hierarchical => different switches and bandwidths at the core than toward leaves
- fat-tree => same switches everywhere, all at same bandwidth
- Oversubscription - networks can be oversubscribed at various points.
- Switches generally have enough backplane bandwidth to saturate all internal ports
- But uplinks to other switches may be a fraction of this, reducing total bandwidth between hosts on different switches
- using multiple ports and/or higher speed uplinks can mitigate this
- Oversubscription often described as a ration (n:m), e.g. for every n megabits of bandwidth at a host port, only m megabits of bandwidth exist between it and the most distant machine in the network
- cut-through vs. store-and-forward
- switches can either buffer entire frames and retransmit out a port (store-and-forward) or stream them without fully buffering in-between (cut-through)
- cut-through can provide better latency
...
- congestion results in:
- packet loss if buffers overflow
- else, increased latency from waiting in line
- which is worse?
- if RAMCloud fast enough, occasional packet loss may not be horrible
- buffering may cause undesired latencies/variability in latency
- even if no oversubscription, congestion is still an issue
- e.g. any time multiple flows funnel from several ports to one port (within the network) or host (at the leaves)
- conceivable in RAMCloud, as we expect to communicate with many different systems
- e.g.: could be a problem if client issues enough sufficiently large requests to a large set of servers
- conceivable in RAMCloud, as we expect to communicate with many different systems
- e.g. any time multiple flows funnel from several ports to one port (within the network) or host (at the leaves)
- UDP has no congestion control mechanisms
- connectionless, unreliable protocol probably essential for latency and throughput goals
- need to avoid congestion how?
- rely on user to stagger queries/reduce parallelism? [c.f. Facebook]
- if we're sufficiently fast, will we run into these problems anyhow?sufficiently fast, will we run into these problems anyhow?
- balaji's points: buffers don't scale with bandwidth increases
- simply can't get 2x buffers with similar increase in bandwidth at high end
- further, adding more bandwidth and keeping a reservation for temporary congestion is better than adding buffers
- especially for RAMCloud - reduces latency
- is this an argument against commodity (at least, against a pure commodity fat-tree)?
- ECN - Explicit Congestion Notification
- already done by switches - set bit in IP TOS header if nearing congestion, with greater probability as we approach saturation
- mostly for sustained flow traffic
- RAMCloud expects lots of small datagrams, rather than flows
"Data Center Ethernet"
- Cisco: "collection of standards-based extensions to classical Ethernet that allows data center architects to create a data center transport layer that is:"
- stable
- lossless
- efficient
- Purpose is apparently to buck the trend of building multiple application-specific networks (IP, SAN, Infiniband, etc)
- how? better multi-tenacy (traffic class isolation/prioritisation), guaranteed delivery (lossless transmission), layer-2 multipath (higher bisectional bandwidth)
- A series of additional standards:
- "Class-based flow control" (CBFC)
- for multi-tenancy
- Enhanced transmission selection (ETS)
- for multi-tenancy
- Data center bridging exchange protocol (DCBCXP)
- Lossless Ethernet
- for guaranteed delivery
- Congestion notification
- end-to-end congestion management to avoid dropped frames (i.e. work around TCP congestion collapse, retrofit non-congestion-aware protocols to not cause trouble )
- "Class-based flow control" (CBFC)
In the field
- Google
- Use long-lived TCP connections
- pre-established and left open to avoid handshake overhead
- unclear how TCP has been tweaked for low-latency environment (retransmit timeouts, etc)
- Use long-lived TCP connections
Misc. Thoughts
- If networking costs only small part of total DC cost, why is there oversubscription currently?
- it's possible to pay more and reduce oversubscription - cost doesn't seem the major factor
- but people argue that oversubscription leads to significant bottlenecks in real DCs
- but, then, why aren't they reducing oversubscription from the get go?