Homa Transport Design Notes

Table of Contents

Motivations Of This Work
Objectives Of This Work
Network Assumptions
Homa Congestion Control

Motivations Of This Work:

We want to design this transport because:

Infiniband reliable connections as the main RAMCloud transport has limitations like
- Scalability
- Not commodity
We’d like to design a transport mechanism that is
- Fits well for datacenter network fabric
- Tailored for RPC systems like RAMCloud RPC mechanism

Objectives Of This Work:

Low latency:
- no latency overhead for short messages in presence of large messages and high network utilization
- As close as possible to hardware limits
- Minimal network buffer usage
Scalablity:
- Transport should ideally allow and facilitate presence of millions of client connections per server
- Minimal per client state
Congestion Control
- possible to achieve high network utilization without sacrificing latency of short messages or
- causing congestion collapse.

Network Assumptions:

The scope of this transport is limitted and transport is expected to work properly only if assumptions below hold:

Network is assumed to be full bisection bandwidth
Packets choose random paths to reach destination (ie. packet level load balancing)
- 1 and 2 practically allow us not to worry about persistent congestion in the core of the network.
Network has low latency fabric: High throughput (ie. 10/40/100 Gb/s) and extremely low switching times
Network switch provides few levels of priorities (eg. 8 prios for ethernet and 16 prios for Infiniband)

Homa Congestion Control

Idea behind Homa scheme:

With the previous assumptions held, we can claim the primary point of congestion is going to be at the top of the rack switch queue near the receiver. When multiple senders are transmitting packets to a single receiver, a queue can start to build up the TOR buffer close to the receiver. Homa needs to manage that queue and avoid congestion at the point. To that end, receiver appears to be right place to implement the congestion control logic because:

It knows all the message size being transmitted to it
It observes all the traffic coming from the TOR buffer through its inbound link

Therefore, for Homa we choose to use Receiver Side Congestion Control as follows:

Sender sends a request packet specifying number of bytes in the mesg it want to transmit.
Receiver gets the request packet, and sends one grant every packet time to the sender allowing the sender to send one single packet for each grant pkt. Grant packets specify number of bytes sender can send in one packet and receiver continues sending grant packets until all bytes are granted for that sender.

Achieving low latency in Homa:

Being able to preempt longer requests for shorter ones is the key for achieving low latency. In order to achieve this preemption we utilize two features of the Homa Transport:

Using grants to preempt:

When multiple senders trying to send multiple messages to a receiver, the receivers favors shortest request (ie message that needs to send less bytes) over longer ones by granting it first. At any grant time, among all outstanding requests for transmission, receiver chooses the message with shortest bytes remained to grant and send one grant for that message.

Unscheduled bytes for compensating one RTT overhead:

In this current form, completion of each message takes at least one RTT more time than minimum required time because for each message, the sender first has to send one request packet and wait for the first grant until it can start transmitting the actual bytes for the message. This causes one RTT extra latency for each message. To avoid this one RTT extra latency we allow each sender to send one RTT worth of unscheduled bytes. As an example, for a 10Gb/s network link, in a network that RTT is 15us, the unscheduled bytes limit will be 15*1e-6*10*1e9/8 = 18750bytes = 18.3KB. This means each sender is allowed allowed to send the first 18.3KB of each message right after the request packet without waiting for a grant.

Sending unscheduled bytes are specially important for short messages because 1RTT overhead is approximately 100% latency increase.

Unscheduled packets cause queue build up:

When multiple senders each send one RTT unscheduled packets to a receiver, this causes causes a queue to build up at the receiver’s TOR buffer. This queue directly builds up as a result of unscheduled bytes and it persistently stays as long as there are outstanding messages at the receiver and the receiver continues to grant. This queue build up is undesirable because:

Buffering adversely affect latency
Buffering limits preemption

Resolving the buffer build up issue:

The idea behind the solution to buffer build up because of the unscheduled packets is simple: differ sending grants and let TOR queue to deplete. The receiver realizes that a queue has been built up at the TOR and differs the next grant for T seconds such that when the next scheduled packet related to this new grant arrives at the TOR buffer, the queue size at the buffer has just gone to zero. T must be chosen by the receiver such that:

The queue depletes to zero.
The receiver link not pass unnecessary bubble

This idea implemented using a Traffic Pacer at the receiver:

Traffic Pacer:

Traffic Pacer is the module that keeps track of outstanding bytes traveling to the receiver and recognizes the queue build up at the TOR and then it reacts accordingly by timing the grants properly to let the queue deplete. How traffic pacer works:

Each sender specifies in the request packet of a message the number of unsched. bytes that follows the request.
When the request arrives at receiver, traffic pacer knows how many bytes are outstanding: ie sum of outstanding unsched bytes and scheduled bytes
The traffic pacer forces one RTT bytes cap on the total number of outstanding bytes. When the outstanding bytes overflows from the cap, the traffic pacer delays the next grant until the total outstanding bytes plus (including the next grant size) stays below the cap.

Rational of traffic pacer: If we ensure at any point of time, there is 1RTT worth of bytes outstanding in the network, that guarantees (if there is no variation in the RTT) that we have no queue built up in network and no bubble will be passed on the links.

Using priorities for preemption:

Homa relies on few network priority levels to allow preemption of large messages and favor short messages. However priorities are limited and scarce and we should be very conservative in using them.

Possible uses for priorities:

Higher priorities for short messages
Higher priorities for unscheduled traffic
Utilizing multiple priorities within unscheduled traffic
Utilizing multiple priorities within scheduled traffic

Priorities within the unscheduled packets:

The unscheduled packets need to be at higher priority than scheduled packets for multiple reasons:

Messages shorter than one RTT are sent completely in unsched bytes
The request packet must be delivered to the receiver at the highest priority
Transmission of the unsched bytes at low priority might delay them and since the receiver has no control over them this can cause bandwidth waste and complication of retransmissions at receiver.

We have ran experiments using multiple priority assignment schemes based on the messages size distribution and message byte distributions. Assuming that we have N prios to use within the unsched packets, the best practice seems to be depending on the size distributions of messages (ie workload type), but equal number of bytes on priorities seems to work pretty good in most cases. In this method, we have size dist. of messages so we can find cumulative byte dist. and we assign priorities to different message sizes such that equal number of bytes will be transmitted over each priority.

Priorities within scheduled packets:

Priorities of scheduled packets are defined in the grant packets that are sent by the receiver. Receiver uses an adaptive priority scheme for scheduled packets. First grant will be sent at the lowest priority level and we continue sending grant at this prio until a time comes that we need to preempt for the next grant since the new grant belongs to a message that it has fewest remaining bytes to grant than any other scheduled message and some of the outstanding bytes at lower priority overshoots the 1RTT outstanding bytes cap.

There is another subtlety in the scheduled priority assignment that is we continue sending grants based on the adaptive scheme until the last 1RTT remaining bytes to grant. For the last 1RTT remaining bytes to grant for each message, we switch the priority scheme to the same scheme used for unscheduled packets.

Homa Design Notes