Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • DC = "Data Center"
  • TOR = "Top of Rack"
    • TOR switches connect all machines within a rack and provide uplinks to higher network levels in the topology
      • TOR switches may use 1 or more ports to uplink. The ports may be of the same bandwidth as the hosts (e.g. gigabit) or higher (e.g. 10GigE)
    • DCs organize machines (~20-40) into racks, which all connect to a TOR switch, which in turn connects to some aggregation switch
  • Oversubscription - networks can be oversubscribed at various points.
    • Switches generally have enough backplane bandwidth to saturate all internal ports
    • But uplinks to other switches may be a fraction of this, reducing total bandwidth between hosts on different switches
      • using multiple ports and/or higher speed uplinks can mitigate this
    • Oversubscription often described as a ration (n:m), e.g. for every n megabits of bandwidth at a host port, only m megabits of bandwidth exist between it and the most distant machine in the network
  • cut-through vs. store-and-forward
    • switches can either buffer entire frames and retransmit out a port (store-and-forward) or stream them without fully buffering in-between (cut-through)
    • cut-through can provide better latency

Data Center Networks

  • Current data centers purported to be highly specialized
    • hierarchical network topologies with higher bandwidth aggregation and core switches/routers
      • that is, data rates increase up the tree to handle accumulation of bandwidth used by many, slower leaves further down
    • requires big, specialised switches to maintain reasonable bandwidth
      • e.g. 100+ 10GigE switches with >100 ports each, at the core
        • pricey... Woven Systems 144 ports 10GigE switch debuted at $1500/port in mid-2007
          • current cost unknown
    • oversubscription is purportedly common
      • mainly affects bi-section bandwidth (the data center isn't uniform - locality important, else lower bandwidth expectations)
      • implies congestion is possible, adding overhead for reliable protocols and packet latency
      • 2.5:1 to 8:1 ratios quoted by Al-Fares, et al ('08 SIGCOMM)
        • 2.5:1 means for every 2.5Gb at the end hosts, only 1Gb is allocated at the core (question)
        • a saturated network, therefore, cannot run all hosts at full rates
  • Current hot trend is commoditisation
    • Google does this internally, Microsoft/Yahoo/Amazon probably similarly smart about it
      • they've solved it, but either find it too important to share, or don't yet need SIGCOMM papers
    • Nothing is standard. Requires modifications to routing and/or address resolution protocols
      • hacks to L2 or L3 routing
        • L4 protocols generally oblivious
        • need to be careful about not excessively reordering packets
      • non-standard is reasonable for DC's, since internal network open to innovation
    • Main idea is to follow in footsteps of commodity servers
      • From fewer, big, less hackable Sun/IBM/etc boxen to many smaller, hackable i386/amd64 machines running Linux/FreeBSD/something Microsofty
      • Clear win for servers (~45% of DC budget), less so for networks (~15%) [%s from: greenberg, jan '09 ccr]
        • Is 15% large enough to care that much about optimisation (Amdahl strikes again)?
        • Alternatively, is 15% small enough that we can increase it to get features we want (iWARP, full, non-blocking 10GigE bi-section bandwidth, lower latencies, etc)?
    • Similarly, Network Commoditisation => lots of similar, cheaper, simpler building blocks
      • i.e. many cheaper, (near-)identical switches with a single, common data rate
        • Favours Clos (Charles Clos) topologies such as the fashionable "fat-tree", i.e.:
          • Multi-rooted, wide trees with lots of redundancies to spread bandwidth of # of links
          • large number of equal throughput paths between distant nodes
          • switches with equivalent #'s of ports used throughout
          • 6 maximum hops from anywhere to anywhere in the system
          • scales massively
          • does not necessitate faster data rates further up the tree to avoid oversubscription

...

  • latency
    • Arista 48-port 10GigE switches advertise a minimum of 600nsec latency (no idea what the distribution looks like)
      • across 6 hops, that's > 3.6 usec
    • Woven System's 144-port 10GigE switches advertise 1.6usec port-to-port latency (almost 3x Arista's minimum)
      • => > 3.2usec in first two levels of hierarchy
    • take away: sub-5usec is probably not currently possible
    • how important is RDMA (iWARP) for our goals?
      • if it necessitates 10GigE, it'd be costly
      • unclear if custom (i.e. non-commodity) hardware required
        • Force-10 forsees iWARP done on NIC, like TOE (TCP Offload Engine)
  • bandwidth
    • 128 bytes / object * 1.0e6 objects/second = 122MBytes/sec (not including any packet overhead)
      • this is gigabit range... 10GigE vs. GigE may be a significant question:
        • Arista 48-port 10GigE's not commodity (~$20k/switch, vs. $2-3k/switch of commodity 1GigE)
        • But what if we have much bigger, hot objects on a machine?
          • Do we want to assume a single machine can always handle requests?
            • e.g. 10KByte object => max. ~12,500 requests/sec on gigabit
    • Going beyond gigabit is still very costly
      • ~25k cluster with full 10GigE bi-section bandwidth would be ~$57M for switches (~7x the cost of gigabit)
      • if 10GigE not needed, but gigabit not enough, may be cheaper to dual-home machines and increase total # of ports
  • load balancing
    • RAMCloud is expected to deal with many small, well-distributed objects
      • this may aid obtaining maximum bandwidth utilisation, since we can bounce packets randomly across potential paths
      • no idea if random is even close to optimal

Congestion

  • congestion results in:
    • packet loss if buffers overflow
    • else, increased latency from waiting in line
    • which is worse?
      • if RAMCloud fast enough, occasional packet loss may not be horrible
      • buffering may cause undesired latencies/variability in latency
  • even if no oversubscription, congestion is still an issue
    • e.g. any time multiple flows funnel from several ports to one port (within the network) or host (at the leaves)
      • conceivable in RAMCloud, as we expect to communicate with many different systems
        • e.g.: could be a problem if client issues enough sufficiently large requests to a large set of servers
  • UDP has no congestion control mechanisms
    • connectionless, unreliable protocol probably essential for latency and throughput goals
    • need to avoid congestion how?
      • rely on user to stagger queries/reduce parallelism? [c.f. Facebook]
      • if we're sufficiently fast, will we run into these problems anyhow?

"Data Center Ethernet"

  • XXX TODO XXX

Misc. Thoughts

  • If networking costs only small part of total DC cost, why is there oversubscription currently?
    • it's possible to pay more and reduce oversubscription - cost doesn't seem the major factor
    • but people argue that oversubscription leads to significant bottlenecks in real DCs
      • but, then, why aren't they reducing oversubscription from the get go?