Child pages
  • Network substrate
Skip to end of metadata
Go to start of metadata

Network Substate

Random Concepts and Terminology

  • DC = "Data Center"
  • TOR = "Top of Rack"
    • TOR switches connect all machines within a rack and provide uplinks to higher network levels in the topology
      • TOR switches may use 1 or more ports to uplink. The ports may be of the same bandwidth as the hosts (e.g. gigabit) or higher (e.g. 10GigE)
    • DCs organize machines (~20-40) into racks, which all connect to a TOR switch, which in turn connects to some aggregation switch
  • EOR = "End of Row"
    • switch connecting a row of racks
    • appears to describe both one big switch handling all hosts for several racks, or an aggregation switch connecting to TORs of several racks.
  • hierarchical vs. fat-tree topologies:
    • hierarchical => different switches and bandwidths at the core than toward leaves
    • fat-tree => same switches everywhere, all at same bandwidth
  • Oversubscription - networks can be oversubscribed at various points.
    • Switches generally have enough backplane bandwidth to saturate all internal ports
    • But uplinks to other switches may be a fraction of this, reducing total bandwidth between hosts on different switches
      • using multiple ports and/or higher speed uplinks can mitigate this
    • Oversubscription often described as a ration (n:m), e.g. for every n megabits of bandwidth at a host port, only m megabits of bandwidth exist between it and the most distant machine in the network
  • cut-through vs. store-and-forward
    • switches can either buffer entire frames and retransmit out a port (store-and-forward) or stream them without fully buffering in-between (cut-through)
    • cut-through can provide better latency

Data Center Networks

  • Current data centers purported to be highly specialized
    • hierarchical network topologies with higher bandwidth aggregation and core switches/routers
      • that is, data rates increase up the tree to handle accumulation of bandwidth used by many, slower leaves further down
    • requires big, specialised switches to maintain reasonable bandwidth
      • e.g. 100+ 10GigE switches with >100 ports each, at the core
        • pricey... Woven Systems 144 ports 10GigE switch debuted at $1500/port in mid-2007
          • current cost unknown
    • oversubscription is purportedly common
      • mainly affects bi-section bandwidth (the data center isn't uniform - locality important, else lower bandwidth expectations)
      • implies congestion is possible, adding overhead for reliable protocols and packet latency
      • 2.5:1 to 8:1 ratios quoted by Al-Fares, et al ('08 SIGCOMM)
        • 2.5:1 means for every 2.5Gb at the end hosts, only 1Gb is allocated at the core (question)
        • a saturated network, therefore, cannot run all hosts at full rates
  • Current hot trend is commoditisation
    • Google does this internally, Microsoft/Yahoo/Amazon probably similarly smart about it
      • they've solved it, but either find it too important to share, or don't yet need SIGCOMM papers
    • Nothing is standard. Requires modifications to routing and/or address resolution protocols
      • hacks to L2 or L3 routing
        • L4 protocols generally oblivious
        • need to be careful about not excessively reordering packets
      • non-standard is reasonable for DC's, since internal network open to innovation
    • Main idea is to follow in footsteps of commodity servers
      • From fewer, big, less hackable Sun/IBM/etc boxen to many smaller, hackable i386/amd64 machines running Linux/FreeBSD/something Microsofty
      • Clear win for servers (~45% of DC budget), less so for networks (~15%) [%s from: greenberg, jan '09 ccr]
        • Is 15% large enough to care that much about optimisation (Amdahl strikes again)?
        • Alternatively, is 15% small enough that we can increase it to get features we want (iWARP, full, non-blocking 10GigE bi-section bandwidth, lower latencies, etc)?
    • Similarly, Network Commoditisation => lots of similar, cheaper, simpler building blocks
      • i.e. many cheaper, (near-)identical switches with a single, common data rate
        • Favours Clos (Charles Clos) topologies such as the fashionable "fat-tree", i.e.:
          • Multi-rooted, wide trees with lots of redundancies to spread bandwidth of # of links
          • large number of equal throughput paths between distant nodes
          • switches with equivalent #'s of ports used throughout
          • 6 maximum hops from anywhere to anywhere in the system
          • scales massively
          • does not necessitate faster data rates further up the tree to avoid oversubscription


  • Size is defined by a factor k, the number of ports per identical switch in the network
  • 3-level heirarchy:
    • core level ((k/2)^2 = k^2/4 switches)
      • each core switch uses all k ports to connect to k switches in first layer of the pod level
    • pod level (k pods)
      • each pod has 2 internal layers with (k/2 switches/layer => k switches/pod)
      • upper level switches (k/2 of them) connect k/2 of their ports to core level switches
        • other k/2 ports connect to each of the k/2 lower pod level switches
      • lower level switches (k/2 of them) connect to k/2 hosts each
    • end host level (k^3/4 total hosts)
  • k

    # hosts

    # switches

    # ports

    host:switch ratio

    host:port ratio

















































  • Fat-tree will have no oversubscription if network resources can be properly exploited ("rearrangeably non-blocking")
    • i.e. for a network of 1GigE switches, there will always be 1Gbit available between two arbitrary hosts if the interconnects between them can be properly scheduled
      • ways of handling this include recomputation of routes based on load, randomizing core switch hops, etc
    • take away: can max out all ports, but only if we're smart
    • Al-Fare's SIGCOMM '08 paper shows > 80% utilisation under worst-case conditions

Fat-Tree vs. Hierarchical

  • Hierarchical limited by fastest core switch speed, Fat-tree is not so limited
    • => Cannot get 10GigE bi-section bandwidth with hierarchical today, need a Fat-tree with 10-GigE switches
  • Example: MSR's Monsoon vs. UCSD's Fat-Tree commodity system
    • Want network connecting many 1GigE nodes with no oversubscription
    • MSR uses a hierarchical configuration (10GigE aggregation and core switches, 1GigE TOR switches)
    • UCSD's uses identical, 24-port, commodity 1GigE switches (i.e. k = 48)
    • Both theoretically capable of 1:1 oversubscription (i.e. no oversubscription)



    # hosts



    # switches

    108 x 144-port 10GigE
    1,296 x 20-port 1GigE w/ 2x10GigE uplinks

    2,880 x 48-port 1GigE

    # wires

    57,024 (~91% GigE, ~9% 10GigE)


    # unique paths

    144 (36 via core with 2x dual uplinks in each subtree)


    • Notes:
      • 48-port 1GigE switches cost ~$2.5-3k
        • 2,880 * $2500 = $7M
      • 20-port 1GigE switches w/ 10GigE uplinks probably cost about the same (~2.5-3k) [uplinks not commodity]
        • 1,296 * $2500 = $3.24M
      • 144-port 10GigE switches advertised as $1500/port ($216k/switch) in mid-2007
        • to be competitive with fat-tree on per-port cost, price per port must drop 6.25x to $241.76 ($34.8k/switch)
        • 6.25x drop seems pretty close to the common 2-year drop in price
      • If we were to make a 10GigE Fat-tree similar to the above, today it would cost (MSRP) about $20k/switch x 2,880 switches = $57.6M

Alternative Network Topologies

  • Ideas from supercomputing:
    • Hypercubes
      • nCUBE hypercube machines, probably many others
    • Torus's (?Tori?)
      • IBM Blue Gene connects tens of thousands of CPUs with high bandwidth (e.g. 380MB/sec with 4.5usec avg. ping-pongs - link- IBM says 6.4usec latency)
  • Hosts connect to n neighbours and route amongst themselves
    • Requires hosts to route frames
      • => higher latencies, unless we can do it on the NIC (NetFPGA?)
    • High wiring complexity
      • no idea how this compares to already high complexity of hierarchical and, especially, fat-tree topologies
    • May impose greater constraints on cluster geometry to appropriately establish links??
    • No dedicated switching elements, simpler (electrically) point-to-point links

L2 vs. L3 switching

  • L2 switches are cheaper and simpler, but L2 doesn't scale by default
    • Broadcast domains cannot get too large, else performance suffers
    • switches have MAC forwarding tables, but are often <16k entries in size and we want to scale beyond that
  • L3 switches are more expensive, but L3 scales more clearly
    • Set up localized subnets to restrict broadcast domains, or use VLANs
    • Route between subnets intelligently
  • To be workable in fat-trees, L3 needs work:
    • Need custom/non-standard software/algorithms to update routes to achieve maximum bandwidth and balance of links
  • So long as we're hacking, why not see if L2 can be made to work?
    • Ditch ARP, restrict broadcasts (bcasts can be specially handled out of the fast-path)
    • Use L2 source-routing and L2-in-L2 encapsulation - hosts transmit frames wrapped with destination MAC header of the TOR switch for the destination host
      • => switches need only know MACs of other switches, not all hosts - overcomes 16k MAC entry table limit
      • requires a directory service and host stack modifications
      • use VLB (Valiant Load Balancing) to maximise bandwidth
    • advantage of a flat address space and simpler, lower-level protocol
    • disadvantage of pushing changes into host protocol stacks (though IP is oblivious)

Importance of Programmability

  • Clear that to use commodity parts, at either L2 or L3, some modifications must be made
    • e.g. L3 routing to achieve maximum bandwidth, L2 stack changes to avoid broadcast issues, L2 VLB to maximize link utilisation
    • OpenFlow may be an important part of this
      • unknown how programmability of switches affects their 'commodity' status

RAMCloud Requirements and Network Effects

  • latency
    • Arista 48-port 10GigE switches advertise a minimum of 600nsec latency (no idea what the distribution looks like)
      • across 6 hops, that's > 3.6 usec
    • Woven System's 144-port 10GigE switches advertise 1.6usec port-to-port latency (almost 3x Arista's minimum)
      • => > 3.2usec in first two levels of hierarchy
    • Force-10 networks cut-through 10GigE latencies:
      • hops

        64byte pkt

        1500byte pkt


        351 nsec

        1500 nsec


        951 nsec

        2100 nsec


        1550 nsec

        2700 nsec

    • take away: sub-5usec is probably not currently possible across a data center
    • how important is RDMA (iWARP) for our goals?
      • if it necessitates 10GigE, it'd be costly (today)
      • unclear if custom (i.e. non-commodity) hardware required
        • Force-10 forsees iWARP done on NIC, like TOE (TCP Offload Engine) -link
  • bandwidth
    • 128 bytes / object * 1.0e6 objects/second = 122MBytes/sec (not including any packet overhead)
      • this is gigabit range... 10GigE vs. GigE may be a significant question:
        • Arista 48-port 10GigE's not commodity (~$20k/switch, vs. $2-3k/switch of commodity 1GigE)
        • But what if we have much bigger, hot objects on a machine?
          • Do we want to assume a single machine can always handle requests?
            • e.g. 10KByte object => max. ~12,500 requests/sec on gigabit
    • Going beyond gigabit is still very costly
      • ~25k cluster with full 10GigE bi-section bandwidth would be ~$57M for switches (~7x the cost of gigabit)
      • if 10GigE not needed, but gigabit not enough, may be cheaper to dual-home machines and increase total # of ports
  • load balancing
    • RAMCloud is expected to deal with many small, well-distributed objects
      • this may aid obtaining maximum bandwidth utilisation, since we can bounce packets randomly across potential paths
      • no idea if random is even close to optimal


  • congestion results in:
    • packet loss if buffers overflow
    • else, increased latency from waiting in line
    • which is worse?
      • if RAMCloud fast enough, occasional packet loss may not be horrible
      • buffering may cause undesired latencies/variability in latency
  • even if no oversubscription, congestion is still an issue
    • e.g. any time multiple flows funnel from several ports to one port (within the network) or host (at the leaves)
      • conceivable in RAMCloud, as we expect to communicate with many different systems
        • e.g.: could be a problem if client issues enough sufficiently large requests to a large set of servers
  • UDP has no congestion control mechanisms
    • connectionless, unreliable protocol probably essential for latency and throughput goals
    • need to avoid congestion how?
      • rely on user to stagger queries/reduce parallelism? [c.f. Facebook]
      • if we're sufficiently fast, will we run into these problems anyhow?
  • balaji's points: buffers don't scale with bandwidth increases
    • simply can't get 2x buffers with similar increase in bandwidth at high end
    • further, adding more bandwidth and keeping a reservation for temporary congestion is better than adding buffers
      • especially for RAMCloud - reduces latency
      • is this an argument against commodity (at least, against a pure commodity fat-tree)?
  • ECN - Explicit Congestion Notification
    • already done by switches - set bit in IP TOS header if nearing congestion, with greater probability as we approach saturation
    • mostly for sustained flow traffic
      • RAMCloud expects lots of small datagrams, rather than flows

"Data Center Ethernet"

  • Cisco: "collection of standards-based extensions to classical Ethernet that allows data center architects to create a data center transport layer that is:"
    • stable
    • lossless
    • efficient
  • Purpose is apparently to buck the trend of building multiple application-specific networks (IP, SAN, Infiniband, etc)
    • how? better multi-tenacy (traffic class isolation/prioritisation), guaranteed delivery (lossless transmission), layer-2 multipath (higher bisectional bandwidth) 
  • A series of additional standards:
    • "Class-based flow control" (CBFC)
      • for multi-tenancy
    • Enhanced transmission selection (ETS)
      • for multi-tenancy
    • Data center bridging exchange protocol (DCBCXP)
    • Lossless Ethernet
      • for guaranteed delivery
    • Congestion notification
      • end-to-end congestion management to avoid dropped frames (i.e. work around TCP congestion collapse, retrofit non-congestion-aware protocols to not cause trouble(question) )

In the field

  • Google
    • Use long-lived TCP connections
      • pre-established and left open to avoid handshake overhead
      • unclear how TCP has been tweaked for low-latency environment (retransmit timeouts, etc)

Misc. Thoughts

  • If networking costs only small part of total DC cost, why is there oversubscription currently?
    • it's possible to pay more and reduce oversubscription - cost doesn't seem the major factor
    • but people argue that oversubscription leads to significant bottlenecks in real DCs
      • but, then, why aren't they reducing oversubscription from the get go?
  • No labels