Homa Transport Design Notes

Table of Contents

Motivations Of This Work:
Objectives Of This Work:
Network Assumptions:
Homa Congestion Control:
Many Receiver Problem

Motivations Of This Work:

We want to design this transport because:

Infiniband reliable connections as the main RAMCloud transport has limitations like
- Scalability
- Not commodity
We’d like to design a transport mechanism that is
- Fits well for datacenter network fabric
- Tailored for RPC systems like RAMCloud RPC mechanism

Objectives Of This Work:

Low latency:
- no latency overhead for short messages in presence of large messages and high network utilization
- As close as possible to hardware limits
- Minimal network buffer usage
Scalablity:
- Transport should ideally allow and facilitate presence of millions of client connections per server
- Minimal per client state
Congestion Control
- possible to achieve high network utilization without sacrificing latency of short messages or
- causing congestion collapse.

Network Assumptions:

The scope of this transport is limitted and transport is expected to work properly only if assumptions below hold:

Network is assumed to be full bisection bandwidth
Packets choose random paths to reach destination (ie. packet level load balancing)
- 1 and 2 practically allow us not to worry about persistent congestion in the core of the network.
Network has low latency fabric: High throughput (ie. 10/40/100 Gb/s) and extremely low switching times . Network switch provides few priority levels (eg. 8 prios for ethernet and 16 prios for Infiniband)

Homa Congestion Control:

Idea behind Homa scheme:

With the previous assumptions held, we can claim the primary point of congestion is going to be at the top of the rack switch queue near the receiver. When multiple senders are transmitting packets to a single receiver, a queue can start to build up the Top Of Rack (TOR) buffer close to the receiver. Homa needs to manage that queue and avoid congestion at the point. To that end, receiver appears to be right place to implement the congestion control logic because:

It knows all the message size being transmitted to it
It observes all the traffic coming from the TOR buffer through its inbound link

Therefore, for Homa we choose to use Receiver Side Congestion Control as follows:

Sender sends a request packet specifying number of bytes in the mesg it want to transmit.
Receiver gets the request packet, and sends one grant every packet time to the sender allowing the sender to send one single packet for each grant pkt. Grant packets specify number of bytes sender can send in one packet and receiver continues sending grant packets until all bytes are granted for that sender.

Achieving low latency in Homa:

Being able to preempt longer requests for shorter ones is the key for achieving low latency. In order to achieve this preemption we utilize two features of the Homa Transport:

Using grants to preempt:

When multiple senders trying to send multiple messages to a receiver, the receivers favors shortest request (ie message that needs to send less bytes) over longer ones by granting it first. At any grant time, among all outstanding requests for transmission, receiver chooses the message with shortest bytes remained to grant and send one grant for that message.

Unscheduled bytes compensating one RTT overhead:

In this current form, completion of each message takes at least one RTT more time than minimum required time because for each message, the sender first has to send one request packet and wait for the first grant until it can start transmitting the actual bytes for the message. This causes one RTT extra latency for each message. To avoid this one RTT extra latency we allow each sender to send one RTT worth of unscheduled bytes. As an example, for a 10Gb/s network link, in a network that RTT is 15us, the unscheduled bytes limit will be 15*1e-6*10*1e9/8 = 18750bytes = 18.3KB. This means each sender is allowed to send the first 18.3KB of each message right after the request packet without waiting for a grant.

Sending unscheduled bytes are specially important for short messages because 1RTT overhead is approximately 100% latency increase.

Unscheduled packets cause queue build up:

When multiple senders each send one RTT unscheduled packets to a receiver, this causes causes a queue to build up at the receiver’s TOR buffer. This queue directly builds up as a result of unscheduled bytes and it persistently stays as long as there are outstanding messages at the receiver and the receiver continues to grant. This queue build up is undesirable because:

Buffering adversely affect latency
Buffering limits preemption

Resolving the buffer build up issue:

The idea behind the solution to buffer build up because of the unscheduled packets is simple: defer sending grants and let TOR queue to deplete. The receiver realizes that a queue has been built up at the TOR and defers the next grant for T seconds such that when the next scheduled packet related to this new grant arrives at the TOR buffer, the queue size at the buffer has just gone to zero. T must be chosen by the receiver such that:

The queue depletes to zero.
The receiver link not pass unnecessary bubble

This idea implemented using a Traffic Pacer at the receiver:

Traffic Pacer:

Traffic Pacer is the module that keeps track of outstanding bytes traveling to the receiver and recognizes the queue build up at the TOR and then it reacts accordingly by timing the grants properly to let the queue deplete. How traffic pacer works:

Each sender specifies in the request packet of a message the number of unsched. bytes that follows the request.
When the request arrives at receiver, traffic pacer knows how many bytes are outstanding: ie sum of outstanding unsched bytes and scheduled bytes
The traffic pacer forces one RTT bytes cap on the total number of outstanding bytes. When the outstanding bytes overflows from the cap, the traffic pacer delays the next grant until the total outstanding bytes plus (including the next grant size) stays below the cap.

Rational of traffic pacer: If we ensure at any point of time, there is 1RTT worth of bytes outstanding in the network coming to the receiver, that guarantees (if there is no variation in the RTT) that we have no queue built up in network and no bubble will be passed on the links.

Using priorities for preemption:

Homa relies on few network priority levels to allow preemption of large messages and favor short messages. However priorities are limited and scarce and we should be very conservative in using them.

Possible uses for priorities:

Higher priorities for short messages
Higher priorities for unscheduled traffic
Utilizing multiple priorities within unscheduled traffic
Utilizing multiple priorities within scheduled traffic

Priorities within the unscheduled packets:

The unscheduled packets need to be at higher priority than scheduled packets for multiple reasons:

Messages shorter than one RTT are sent completely in unsched bytes
The request packet must be delivered to the receiver at the highest priority
Transmission of the unsched bytes at low priority might delay them and since the receiver has no control over them this can cause bandwidth waste and complication of retransmissions at receiver.

We have ran experiments using multiple priority assignment schemes based on the messages size distribution and message byte distributions. Assuming that we have N prios to use within the unsched packets, the best practice seems to be depending on the size distributions of messages (ie workload type), but equal number of bytes on priorities seems to work pretty good in most cases. In this method, we have size dist. of messages so we can find cumulative byte dist. and we assign priorities to different message sizes such that equal number of bytes will be transmitted over each priority.

Priorities within scheduled packets:

Priorities of scheduled packets are defined in the grant packets that are sent by the receiver. Receiver uses an adaptive priority scheme for scheduled packets. First grant will be sent containing the lowest priority level and we continue sending grant containing this prio until a time comes that we need to preempt for the next grant since the new grant belongs to a message that it has fewest remaining bytes to grant than any other scheduled message and some of the outstanding bytes at lower priority overshoots the 1RTT outstanding bytes cap.

There is another subtlety in the scheduled priority assignment that if we continue sending grants based on the adaptive scheme until the last 1RTT remaining bytes to grant. For the last 1RTT remaining bytes to grant for each message, we switch the priority scheme to the same scheme used for unscheduled packets.

Many Receiver Problem

Problem Description:

In a scenario that we have many receivers, a sender might want to send to multiple receivers at the same time and if multiple receivers commit their bandwidth to the sender by sending grants to that sender, this can cause receiver bandwidth to be wasted as the sender can only send one packet at time. On the other hand, if the sender chooses to only send a request to one of the receivers and that receiver is busy sending grant to other sender’s then the sender bandwidth will be wasted. So this is obviously a distributed matching problem that must be properly solved such that minimizes the bandwidth wastage at both sender and receiver links.

So far in the process of designing Homa, we have focused only on the receiver side and putting the bulk of the transport logic in the receiver side without much attention to the sender side. Solving this problem though requires possibly elaborate schemes to be used at the sender side which we are going to start to talk about here.

Straw Man solutions:

The two straw man approaches for the senders are:

Sender observes the priorities assigned by the receivers and send packets based on whatever priority they are assigned. In this scheme, sender simply has an pkt output queue which is sorted by the priorities given to each packet based on the receiver logic (ie. based on if packet is unsched, sched, how big the was is, etc). The sender simply send packet from this packet at each packet time. This scheme translates to Round Robin in packet level for packets with same priority belong to different msgs. Biggest flaws in this case:
1. Round Robin increases the completion time of all messages
2. Round Robin causes the receiver bw to be wasted as sender sends requests to multiple receivers. In common lightly loaded case,corresponding receivers commit to that sender by sending grants back but the receiver can only send to one receiver at any point of time.
SMF (Shortest Message First): This is a greedy approach where sender chooses shortest message among all messages and send unsched bytes for that message. In common case, it will receive grants for this msg by the time the last unsched byte is sent and it can complete that message. This is in fact work to completion approach where we choose one msg and complete transmission of that message before sending any packet for any other msg. This scheme resolves both problems with previous scheme but it has other flaws:
- Imagine Sx sends unsched bytes to an Rx and expects the Rx to send grants back but the grant will be late because either Rx is busy sending grant to even a shorter flow or there’s been a queue at TOR near Rx that caused every packet to be delayed. In such case sender is going to waste bw and blow bubble in its link so it needs to give up at some point and send request to other receivers.
Ideally we need to compute a matching between the senders and receivers but with a very little information. Is there any way to probe the receivers to know about their available bandwidth.

STF scheme: Blow bubble at sender to avoid bubbles at receiver

An interesting question is what do you do when there is bubble at sender. That is sender start sending one msg and sender likes to run that msg to completion. Sender waits for grant to be back but grant is late and sender might stall waiting for grant for that msg but it also have other msgs that it could be sending. The question is, do sender hang tight and allow bubble in its output link hoping that grant is coming soon or do I go ahead and start on second msg and start sending it in which case the problem is the sender might end up getting grant from two receivers and of course it can’t transmit to both of them simultaneously and it ends up blowing up bubble in the receiver side.

Presumably you are willing to stall for small amount of time for missing grant to come in but how long are we supposed to wait? Or if we start sending next msg and don’t get grant from that one as well, are we going to start the third one? Which ever way sender goes, it’s risking network bandwidth either at sender or receiver side.

Possibility of negative grants:

So far, we only talked about sending positive grant when receiver tells the sender when it’s ready to send packets out. We might want to introduce notion of negative grants or NACK from the receiver so when receiver gets the request and its busy, it can send NACK to sender which basically tells the sender that it’s busy for the following amount of time so sender knows how big of gap until it can potentially get a grant from that receiver and so it can do some other useful work in the meantime.
Similarly, say the sender is in the middle of transmission of a very large msg and a new small msg appears. According to STF policy at sender, sender likes to preempt the large msg and complete small msg. At this point maybe the sender should send a NACK to receiver of large msg, informing it is going to send another msg for some amount time and blows a bubble in receiver link. Some discussions about the scenario:
1. If we don’t want to blow bubble in the receiver, sender has to continue sending packets for large message for one RTT after it sends the NACK and before preempting for shorter msg. On the other hand, if you don’t preempt immediately for the short msg, the effect on the stretch of the short msg can be quite large and delaying preemption can cause up to 100% overhead on the relative stretch of short msg. The choice here is between latency of shorter request vs bandwidth utilization at the receiver
2. Or perhaps, the receiver should slightly over schedule itself so if there’s always at least a little bit of buffering at receiver side (eg. receiver TOR queue), then sender can immediately preempt for shorter msg without actually introduce a few bubbles in the link of preempted msg receiver. We don’t necessarily want buffering since buffering adds latency but one possibility for receiver is to adjust the buffering based on the load. When receiver is underloaded, it doesn’t really matter if sender introduces buffer because it’s not taking away bandwidth from any body else but if the receiver is over loaded you might actually want to nudge your buffering up a little bit to make sure its link absolutely never goes idle if sender tries to preempt.

Receiver sharing bw with multiple senders:

Another idea (which might not be a good idea) is receiver sharing bw with multiple senders and send grants to multiple senders. The disadvantage of this is that receiver is round robinning and every body’s completion time increases but if your bw limited your response time is limited by how efficient your using bw any how (how said this is efficient use of bw?). The advantage is receiver has less to lose when a sender suddenly decides not to send any more. Receiver still discovers that after one RTT but it only gave half of gw to that sender.

A major problem with this idea is that you slow down your common case. We are trying to solve a problem that happens with a less frequency comparing to the problem that we cause. This occasionally helps but often hurts. A blanket rule like this that says only give 50% of your bw to highest priority flow, hurts that flow because the flow that is small from perspective of receiver is likely to be small from perspective the sender as well in the common case. This idea wouldn’t make sense at all if your are very bw constrained.

Two types of solutions:

Reputation scheduling

If the receiver had some notion of comparing its destinations and measure of confidence in senders. From the Rx perspective, every time it send grant, it likes to send grant to Rx that will definitely send the data and if the Sx occasionally doesn’t send data, even if it’s the highest priority Sx, then Rx should mix the grants to that Sx with grants to next preferred Sx. Imagine, every time the Sx miss sending pkt for a grant, Rx degrade its reputation (Rx is keeping some reputation score for each Sx). Then Rx uses that score when sending grants exclusively to highest priority Sx or share grants with other Sx. This will try to use past information to find each Sx score and use it for next grants so question is, "Is there any reason to believe senders will remain consistent?" A long term reputation score might not be helpful in this case.

Message passing and probing:

Senders and receivers are going to pass msg to give each other as accurate of information as possible and make decisions based on that. For example, NACKs and Grants as described above could be a way to achieve this information passing. We make this information available as precise as we can and we come up with policy to have schedules at senders and receivers based on that.

On the receiver side, Rx is getting all scheduled pkts but suddenly a NACK comes in and says next k slot is going to be empty since sender decided to preempt. What does the receiver do with this information? If k is shorter than RTT, receiver can’t do any thing about that and those k slot are burned opportunities for receiver. But Rx can make sure it will not receive bubbles during those k slots if it maintains a small amount of buffering (at least k pkts) at Rx TOR. In this case, receiver looses the k slot buffering and it should send k grants to replenish the buffer. Note that the NACK pkt should be highest prio to make sure it arrives early and wont stuck in the back of the queue.
Question is how much buffer we want at Rx TOR queue? Depends on load. If the Rx is only 20% of input BW, then probably we don’t want any buffering at all but if using 90% of BW, then it probably want buffer at TOR.
On the sender side, Sx has been receiving grants but no all of a sudden it gets a NACK from the Rx saying here will be no grants for some time. First of all having Rx to send back NACK, specifying how long it can’t send grants, could be necessary so Sx can send request to next preferred Rx and tell it how much it can send because this Rx is back up (secondary Rx).You might want to let the secondary Rx know don’t send any grants at all as you are my secondary Rx, I just want to trickle pkt to you any chance I get. Concern here is that going down this path can result in a very complicated scheme.
What if we have notion of primary sender and receiver. The model here is that there is a matching between senders/receivers. Any sender and receiver have primary counterparts. For the primary sender/receiver, protocol is exactly the same as before meaning that sender gets a grant from primary receiver and sends packet to it.Any pkt not send on this primary pairs is treated like unscheduled pkts so it wont cause new grants to come. Now if sender get an empty slot (NACK from the receiver), it will pick its second best receiver (ie. secondary receiver) and sends it a pkt. From the perspective the secondary receiver, it should just treat that pkt like an unsched. pkt. But how many pkt does sender send? If at some point sender gets so many empty slots from primary receiver, then it should try to change its primary receiver. The part the remains to be solved is if we can find these matchings?
If we’re talking about the cases when all of your messages are smaller than RTT then this problem might not have any solution and mistakes are short lived any ways. If every sender and receiver have stable preference, these matching can be found in few round trips (ie. for long transmissions). In fact the problem this primary matching is trying to solve is if you have sets of messages that are multiple RTT long then your matching first of all should converge to the right matching and all the pkts we send to secondary receiver are kind of our best effort to keep network utilized while we’re trying to find primary receiver. And again, because senders are allowed to send to secondary receivers and also preempt for shorter messages, the receivers should also keep some buffering at TOR to keep bw utilized if bubbles are being passed.
Another concern here is such algorithm might not be even stable? Suppose at the receiver we are getting a long message and we intentionally introduce buffer build up equal to one RTT so we let some other long sender#2 send few packets and they just get buffered in the switch but we don’t want those pkts from sender#2 to interrupt our primary sender#1 so we tell our secondary sender#2 to send at lower priority. Now problem is, primary sender#1 finishes and secondary sender#2 is supposed to become primary and we want to tell sender#3 to become secondary and packets from sender#3 are supposed be even lower priority. So every time we go to a new sender, we keep going to a new lower priority. What we really like to do is to prioritize those packets and TOR switch is the right place to implement this algorithm.

Homa Design Notes