Network Substate

Random Concepts and Terminology

DC = "Data Center"
TOR = "Top of Rack"
- TOR switches connect all machines within a rack and provide uplinks to higher network levels in the topology
  - TOR switches may use 1 or more ports to uplink. The ports may be of the same bandwidth as the hosts (e.g. gigabit) or higher (e.g. 10GigE)
- DCs organize machines (~20-40) into racks, which all connect to a TOR switch, which in turn connects to some aggregation switch
EOR = "End of Row"
- switch connecting a row of racks
- appears to describe both one big switch handling all hosts for several racks, or an aggregation switch connecting to TORs of several racks.
Oversubscription - networks can be oversubscribed at various points. Switches
hierarchical vs. fat-tree topologies:
- hierarchical => different switches and bandwidths at the core than toward leaves
- fat-tree => same switches everywhere, all at same bandwidth

Oversubscription - networks can be oversubscribed at various points.
- Switches generally have enough backplane bandwidth to saturate all internal ports
- But uplinks to other switches may be a fraction of this, reducing total bandwidth between hosts on different switches
  - using multiple ports and/or higher speed uplinks can mitigate this
- Oversubscription often described as a ration (n:m), e.g. for every n megabits of bandwidth at a host port, only m megabits of bandwidth exist between it and the most distant machine in the network
cut-through vs. store-and-forward
- switches can either buffer entire frames and retransmit out a port (store-and-forward) or stream them without fully buffering in-between (cut-through)
- cut-through can provide better latency

...

congestion results in:
- packet loss if buffers overflow
- else, increased latency from waiting in line
- which is worse?
  - if RAMCloud fast enough, occasional packet loss may not be horrible
  - buffering may cause undesired latencies/variability in latency
even if no oversubscription, congestion is still an issue
- e.g. any time multiple flows funnel from several ports to one port (within the network) or host (at the leaves)
  - conceivable in RAMCloud, as we expect to communicate with many different systems
    - e.g.: could be a problem if client issues enough sufficiently large requests to a large set of servers
UDP has no congestion control mechanisms
- connectionless, unreliable protocol probably essential for latency and throughput goals
- need to avoid congestion how?
  - rely on user to stagger queries/reduce parallelism? [c.f. Facebook]
  - if we're sufficiently fast, will we run into these problems anyhow?sufficiently fast, will we run into these problems anyhow?
balaji's points: buffers don't scale with bandwidth increases
- simply can't get 2x buffers with similar increase in bandwidth at high end
- further, adding more bandwidth and keeping a reservation for temporary congestion is better than adding buffers
  - especially for RAMCloud - reduces latency
  - is this an argument against commodity (at least, against a pure commodity fat-tree)?
ECN - Explicit Congestion Notification
- already done by switches - set bit in IP TOS header if nearing congestion, with greater probability as we approach saturation
- mostly for sustained flow traffic
  - RAMCloud expects lots of small datagrams, rather than flows

"Data Center Ethernet"

Cisco: "collection of standards-based extensions to classical Ethernet that allows data center architects to create a data center transport layer that is:"
- stable
- lossless
- efficient
Purpose is apparently to buck the trend of building multiple application-specific networks (IP, SAN, Infiniband, etc)
- how? better multi-tenacy (traffic class isolation/prioritisation), guaranteed delivery (lossless transmission), layer-2 multipath (higher bisectional bandwidth)
A series of additional standards:
- "Class-based flow control" (CBFC)
  - for multi-tenancy
- Enhanced transmission selection (ETS)
  - for multi-tenancy
- Data center bridging exchange protocol (DCBCXP)
- Lossless Ethernet
  - for guaranteed delivery
- Congestion notification
  - end-to-end congestion management to avoid dropped frames (i.e. work around TCP congestion collapse, retrofit non-congestion-aware protocols to not cause trouble )

In the field

Google
- Use long-lived TCP connections
  - pre-established and left open to avoid handshake overhead
  - unclear how TCP has been tweaked for low-latency environment (retransmit timeouts, etc)

Misc. Thoughts

If networking costs only small part of total DC cost, why is there oversubscription currently?
- it's possible to pay more and reduce oversubscription - cost doesn't seem the major factor
- but people argue that oversubscription leads to significant bottlenecks in real DCs
  - but, then, why aren't they reducing oversubscription from the get go?

Versions Compared

Old Version 13

New Version Current

Key

Network Substate

Random Concepts and Terminology

"Data Center Ethernet"

In the field

Misc. Thoughts

Page Comparison

Versions Compared

Old Version 13

New Version Current

Key

<span class="diff-html-added" data-a11y-before="Start of added content" data-a11y-after="End of added content" id="added-diff-0">[data-colorid=gzfh1sfu1c]{color:#666666} html[data-color-mode=dark] [data-colorid=gzfh1sfu1c]{color:#999999}</span>Network Substate

Random Concepts and Terminology

"Data Center Ethernet"

In the field

Misc. Thoughts

Network Substate