Homa Transport Design Notes

Motivations Of This Work:

Well known transport mechanisms have many limitation for low latency and datacenters communications:
- TCP and its variants are primarily designed to achieve high throughput by building up buffers and sacrificing latency.
- Infiniband achieves low latency but is not commodity and IB Reliable connections have scalability issues in number of allowed connection per server.
We like to design a transport mechanism that addresses these issues:
- Fits well with commodity datacenter network fabric
- Achieves high throughput and low latency of short requests
- Tailored for RPC systems like RAMCloud RPC mechanism

Objectives Of This Work:

Low latency:
- No latency overhead for short messages in presence of large messages and high network utilization
- Performance as close as possible to hardware limits (ie. minimal algorithmic and software overhead)
- Minimal network buffer usage as buffer build up complicates preemption and increases latency
Scalablity:
- Transport should ideally allow and facilitate presence of millions of client connections per server
- Minimal per client state
Congestion Control
- possible to achieve high network utilization without sacrificing latency of short messages or causing congestion collapse.

Network Assumptions:

The scope of this transport is limitted and transport is expected to work properly only if assumptions below hold:

Network is assumed to be full bisection bandwidth
Packets choose random paths to reach destination (ie. packet level load balancing)
- 1 and 2 practically allow us not to worry about persistent congestion in the core of the network.
Network has low latency fabric: High throughput (ie. 10/40/100 Gb/s) and extremely low switching times . Network switch provides few priority levels (eg. 8 prios for ethernet and 16 prios for Infiniband)

Key Ideas and Contributions

Receivers schedule inbound traffic on a packet-by-packet basis
- Can preempt large messages to favor short ones
Unsched traffic to hide the road trip latency
Traffic pacer to prevent buffer build up
Priorities for scheduled traffic
Priorities for unscheduled traffic

Homa Congestion Control:

Idea behind Homa scheme:

With the previous assumptions held, we can claim the primary point of congestion is going to be at the top of the rack switch queue near the receiver. When multiple senders are transmitting packets to a single receiver, a queue can start to build up the Top Of Rack (TOR) buffer close to the receiver. Homa needs to manage that queue and avoid congestion at the point. To that end, receiver appears to be right place to implement the congestion control logic because:

It knows all the message size being transmitted to it
It observes all the traffic coming from the TOR buffer through its inbound link

Therefore, for Homa we choose to use Receiver Side Congestion Control as follows:

Sender sends a request packet specifying number of bytes in the mesg it want to transmit.
Receiver gets the request packet, and sends one grant every packet time to the sender allowing the sender to send one single packet for each grant pkt. Grant packets specify number of bytes sender can send in one packet and receiver continues sending grant packets until all bytes are granted for that sender.

Achieving low latency in Homa:

Being able to preempt longer requests for shorter ones is the key for achieving low latency. In order to achieve this preemption we utilize two features of the Homa Transport:

Using grants to preempt:

When multiple senders trying to send multiple messages to a receiver, the receivers favors shortest request (ie message that needs to send less bytes) over longer ones by granting it first. At any grant time, among all outstanding requests for transmission, receiver chooses the message with shortest bytes remained to grant and send one grant for that message.

Unscheduled bytes compensating one RTT overhead:

In this current form, completion of each message takes at least one RTT more time than minimum required time because for each message, the sender first has to send one request packet and wait for the first grant until it can start transmitting the actual bytes for the message. This causes one RTT extra latency for each message. To avoid this one RTT extra latency we allow each sender to send one RTT worth of unscheduled bytes. As an example, for a 10Gb/s network link, in a network that RTT is 15us, the unscheduled bytes limit will be 15*1e-6*10*1e9/8 = 18750bytes = 18.3KB. This means each sender is allowed to send the first 18.3KB of each message right after the request packet without waiting for a grant.

Sending unscheduled bytes are particularly important for short messages because 1RTT overhead is approximately 100% latency increase.

Unscheduled packets cause queue build up:

When multiple senders each send one RTT unscheduled packets to a receiver, this causes causes a queue to build up at the receiver’s TOR buffer. This queue directly builds up as a result of unscheduled bytes and it persistently stays as long as there are outstanding messages at the receiver and the receiver continues to grant. This queue build up is undesirable because:

Buffering causes head of line blocking which increases latency
Buffering limits preemption such that even if the algorithm preempt large messages for shorter ones at end host, packets of short message still will be blocked behind buffered packets.

Resolving the buffer build up issue:

The idea behind the solution to buffer build up because of the unscheduled packets is simple: defer sending grants and let the TOR queue deplete. Receive keeps track of number of all outstanding scheduled and unscheduled bytes. When outstanding bytes rises above 1RTT bytes, the receiver realizes a pkt queue is being build up at the TOR. At the times of buffer build up, the it then defers the next grant for T seconds to allow the TOR buffer to deplete. Receiver chooses T such that when the next scheduled packet associated to the new grant arrives at the TOR buffer, the queue size at the buffer has just gone to zero. Therefore:

The TOR queue depletes to zero.
The receiver link doesn’t pass bubble (ie. receiver inbound link doesn’t go idle)

This idea implemented using a Traffic Pacer at the receiver:

Traffic Pacer:

The Traffic Pacer is the module that keeps track of outstanding bytes traveling to the receiver and recognizes the queue build up at the TOR and then it reacts accordingly by timing the grants properly to let the queue deplete. How the Traffic Pacer works:

Each sender specifies in the request packet of a message the number of unsched. bytes that follows the request.
When the request arrives at receiver, traffic pacer knows how many bytes are outstanding: ie sum of outstanding unsched bytes and scheduled bytes
The traffic pacer forces one RTT bytes cap on the total number of outstanding bytes. When the outstanding bytes overflows from the cap, the traffic pacer delays the next grant until the total outstanding bytes plus (including the next grant size) stays below the cap.

Rational of traffic pacer: If we ensure at any point of time, there is 1RTT worth of bytes outstanding in the network coming to the receiver, that guarantees (if there is no variation in the RTT) that we have no queue built up in network and no bubble will be passed on the links.

Using priorities for preemption:

Homa relies on few network priority levels to allow preemption of large messages and favor short messages. However priorities are limited and scarce and we should be very conservative in using them.

Possible uses for priorities:

Higher priorities for short messages
Higher priorities for unscheduled traffic
Utilizing multiple priorities within unscheduled traffic
Utilizing multiple priorities within scheduled traffic

Priorities within the unscheduled packets:

The unscheduled packets of a message need to be at higher priority than its scheduled packets for multiple reasons:

Request packet of a message must be the highest priority packet of that message to inform the receiver of this message at the minimum possible time and allow the receiver to take that message into account for traffic pacing and future scheduling cycles.
For workloads that unsched bytes are small fraction of traffic (eg. heavy tail workloads) we should send unsched at a higher priority to inform receiver of new messages as soon as possible.
For workloads that unscheduled traffic is a considerable fraction of total traffic, vast majority of messages are short (ie less than couple of RTT in size). Large fraction of bytes in these messages are unsched bytes and need to be sent at high priority to preempt the network queues for them and achieve low latency.
Sending unsched bytes at low priority may defeat the whole purpose of them (ie. hiding 1RTT scheduling latency overhead) at high load factors since they might get queued up. This could potentially increase completion times of short messages. Generally in common case for a message transmission, we’d like to have all unsched packets received by the time first sched packet of the message has arrived at receiver.

We have run experiments using multiple priority assignment schemes based on various messages size distributions and message byte distributions. Assuming that we have N prios to use within the unsched packets, the best practice seems to depend on the size distributions of messages (ie workload type), but equal number of bytes on priorities seems to work pretty well in most cases. In this method, we have size dist. of messages so we can find cumulative byte dist. and we assign priorities to different message sizes such that equal number of bytes will be transmitted over each priority. Based on the simulation results for various size distributions, four priority levels seems to be sufficient for unsched bytes to get quite close results to an ideal priority transport.

Priorities within scheduled packets:

Preemption of large messages for short messages is a key idea in our scheme and in order to achieve that properly, we may need to use multiple priorities within scheduled packets as well. Imagine scenario when receiver has been granting a large message for some time and has 1RTT outstanding grants for this large message. While granting, a request for a much shorted msg arrives which rises the outstanding bytes to 2RTT including 1RTT low prio sched pkts for large message, 1RTT-1pkt high prio unsched for new arrived short msg. At this point the pacer delays sending grant for 1RTT time then sends grant for short message. This 1RTT delay translates to 1RTT increase completion time increase in short message because of low prio scheduled packets we have previously sent for large message. To resolve this issue, Receiver adaptively uses multiple prios for scheduled packets.

Priorities of scheduled packets are defined in the grant packets that are sent by the receiver. First grant will be sent containing the lowest priority level. Receiver proceeds sending grant containing this prio until a time comes that we need to preempt for the next grant since the new grant belongs to a message with fewest remaining bytes. At this point, if the outstanding granted bytes of larger messages subtracted from total outstanding bytes lays below 1RTT outstanding bytes cap, then the receiver should immediately elevate sched prio to higher priority and continue sending grants with no delay to avoid having the sched packets of high priority shorter message being blocked by lower prio sched packets of the larger message.

There is another subtlety in the scheduled priority assignment that we continue sending grants based on the adaptive scheme until the last 1RTT remaining bytes to grant. For the last 1RTT remaining bytes to grant for each message, we switch the priority scheme to the same scheme used for unscheduled packets. This is because from the perspective of the receiver, a message that is has B less than less one RTT bytes, should be treated like a new message that with total size of B. This reduce the scheduling penalty for messages that are just above unsched limit but not too larger than that limit.

Many Receiver Problem

Problem Description

In a scenario that we have many receivers, a sender might want to send to multiple receivers at the same time and if multiple receivers commit their bandwidth to the sender by sending grants to that sender, this can cause receiver bandwidth to be wasted as the sender can only send one packet at time. On the other hand, if the sender chooses to only send a request to one of the receivers and that receiver is busy sending grant to other sender’s then the sender bandwidth will be wasted. So this is obviously a distributed matching problem that must be properly solved such that minimizes the bandwidth wastage at both sender and receiver links.

So far in the process of designing Homa, we have focused only on the receiver side and putting the bulk of the transport logic in the receiver side without much attention to the sender side. Solving this problem though requires possibly elaborate schemes to be used at the sender side which we are going to start to talk about here.

Straw Man schemes

The two straw man approaches for the senders are:

Sender observes the priorities assigned by the receivers and send packets based on whatever priority they are assigned. In this scheme, sender simply has an pkt output queue which is sorted by the priorities given to each packet based on the receiver logic (ie. based on if packet is unsched, sched, how big the was is, etc). The sender simply send packet from this packet at each packet time. This scheme translates to Round Robin in packet level for packets with same priority belong to different msgs. Biggest flaws in this case:
1. Round Robin increases the completion time of all messages
2. Round Robin causes the receiver bw to be wasted as sender sends requests to multiple receivers. In common lightly loaded case,corresponding receivers commit to that sender by sending grants back but the receiver can only send to one receiver at any point of time.
SMF (Shortest Message First): This is a greedy approach where sender chooses shortest message among all messages and send unsched bytes for that message. In common case, it will receive grants for this msg by the time the last unsched byte is sent and it can complete that message. This is in fact work to completion approach where we choose one msg and complete transmission of that message before sending any packet for any other msg. This scheme resolves both problems with previous scheme but it has other flaws:
- Imagine Sx sends unsched bytes to an Rx and expects the Rx to send grants back but the grant will be late because either Rx is busy sending grant to even a shorter flow or there’s been a queue at TOR near Rx that caused every packet to be delayed. In such case sender is going to waste bw and blow bubble in its link so it needs to give up at some point and send request to other receivers.
Ideally we need to compute a matching between the senders and receivers but with a very little information. Is there any way to probe the receivers to know about their available bandwidth.

STF scheme: Blow bubble at sender to avoid bubbles at receiver

An interesting question is what do you do when there is bubble at sender. That is sender start sending one msg and sender likes to run that msg to completion. Sender waits for grant to be back but grant is late and sender might stall waiting for grant for that msg but it also have other msgs that it could be sending. The question is, do sender hang tight and allow bubble in its output link hoping that grant is coming soon or do I go ahead and start on second msg and start sending it in which case the problem is the sender might end up getting grant from two receivers and of course it can’t transmit to both of them simultaneously and it ends up blowing up bubble in the receiver side.

Presumably you are willing to stall for small amount of time for missing grant to come in but how long are we supposed to wait? Or if we start sending next msg and don’t get grant from that one as well, are we going to start the third one? Which ever way sender goes, it’s risking network bandwidth either at sender or receiver side.

Possibility of negative grants

So far, we only talked about sending positive grant when receiver tells the sender when it’s ready to send packets out. We might want to introduce notion of negative grants or NACK from the receiver so when receiver gets the request and its busy, it can send NACK to sender which basically tells the sender that it’s busy for the following amount of time so sender knows how big of gap until it can potentially get a grant from that receiver and so it can do some other useful work in the meantime.
Similarly, say the sender is in the middle of transmission of a very large msg and a new small msg appears. According to STF policy at sender, sender likes to preempt the large msg and complete small msg. At this point maybe the sender should send a NACK to receiver of large msg, informing it is going to send another msg for some amount time and blows a bubble in receiver link. Some discussions about the scenario:
1. If we don’t want to blow bubble in the receiver, sender has to continue sending packets for large message for one RTT after it sends the NACK and before preempting for shorter msg. On the other hand, if you don’t preempt immediately for the short msg, the effect on the stretch of the short msg can be quite large and delaying preemption can cause up to 100% overhead on the relative stretch of short msg. The choice here is between latency of shorter request vs bandwidth utilization at the receiver
2. Or perhaps, the receiver should slightly over schedule itself so if there’s always at least a little bit of buffering at receiver side (eg. receiver TOR queue), then sender can immediately preempt for shorter msg without actually introduce a few bubbles in the link of preempted msg receiver. We don’t necessarily want buffering since buffering adds latency but one possibility for receiver is to adjust the buffering based on the load. When receiver is underloaded, it doesn’t really matter if sender introduces buffer because it’s not taking away bandwidth from any body else but if the receiver is over loaded you might actually want to nudge your buffering up a little bit to make sure its link absolutely never goes idle if sender tries to preempt.

Receiver sharing bw with multiple senders

Another idea (which might not be a good idea) is receiver sharing bw with multiple senders and send grants to multiple senders. The disadvantage of this is that receiver is round robinning and every body’s completion time increases but if your bw limited your response time is limited by how efficient your using bw any how (how said this is efficient use of bw?). The advantage is receiver has less to lose when a sender suddenly decides not to send any more. Receiver still discovers that after one RTT but it only gave half of gw to that sender.

A major problem with this idea is that you slow down your common case. We are trying to solve a problem that happens with a less frequency comparing to the problem that we cause. This occasionally helps but often hurts. A blanket rule like this that says only give 50% of your bw to highest priority flow, hurts that flow because the flow that is small from perspective of receiver is likely to be small from perspective the sender as well in the common case. This idea wouldn’t make sense at all if your are very bw constrained.

Two types of solutions

Reputation scheduling

If the receiver had some notion of comparing its destinations and measure of confidence in senders. From the Rx perspective, every time it send grant, it likes to send grant to Rx that will definitely send the data and if the Sx occasionally doesn’t send data, even if it’s the highest priority Sx, then Rx should mix the grants to that Sx with grants to next preferred Sx. Imagine, every time the Sx miss sending pkt for a grant, Rx degrade its reputation (Rx is keeping some reputation score for each Sx). Then Rx uses that score when sending grants exclusively to highest priority Sx or share grants with other Sx. This will try to use past information to find each Sx score and use it for next grants so question is, "Is there any reason to believe senders will remain consistent?" A long term reputation score might not be helpful in this case.

Message passing and probing

Senders and receivers are going to pass msg to give each other as accurate of information as possible and make decisions based on that. For example, NACKs and Grants as described above could be a way to achieve this information passing. We make this information available as precise as we can and we come up with policy to have schedules at senders and receivers based on that.

On the receiver side, Rx is getting all scheduled pkts but suddenly a NACK comes in and says next k slot is going to be empty since sender decided to preempt. What does the receiver do with this information? If k is shorter than RTT, receiver can’t do any thing about that and those k slot are burned opportunities for receiver. But Rx can make sure it will not receive bubbles during those k slots if it maintains a small amount of buffering (at least k pkts) at Rx TOR. In this case, receiver looses the k slot buffering and it should send k grants to replenish the buffer. Note that the NACK pkt should be highest prio to make sure it arrives early and wont stuck in the back of the queue.
Question is how much buffer we want at Rx TOR queue? Depends on load. If the Rx is only 20% of input BW, then probably we don’t want any buffering at all but if using 90% of BW, then it probably want buffer at TOR.
On the sender side, Sx has been receiving grants but no all of a sudden it gets a NACK from the Rx saying here will be no grants for some time. First of all having Rx to send back NACK, specifying how long it can’t send grants, could be necessary so Sx can send request to next preferred Rx and tell it how much it can send because this Rx is back up (secondary Rx).You might want to let the secondary Rx know don’t send any grants at all as you are my secondary Rx, I just want to trickle pkt to you any chance I get. Concern here is that going down this path can result in a very complicated scheme.
What if we have notion of primary sender and receiver. The model here is that there is a matching between senders/receivers. Any sender and receiver have primary counterparts. For the primary sender/receiver, protocol is exactly the same as before meaning that sender gets a grant from primary receiver and sends packet to it.Any pkt not send on this primary pairs is treated like unscheduled pkts so it wont cause new grants to come. Now if sender get an empty slot (NACK from the receiver), it will pick its second best receiver (ie. secondary receiver) and sends it a pkt. From the perspective the secondary receiver, it should just treat that pkt like an unsched. pkt. But how many pkt does sender send? If at some point sender gets so many empty slots from primary receiver, then it should try to change its primary receiver. The part the remains to be solved is if we can find these matchings?
If we’re talking about the cases when all of your messages are smaller than RTT then this problem might not have any solution and mistakes are short lived any ways. If every sender and receiver have stable preference, these matching can be found in few round trips (ie. for long transmissions). In fact the problem this primary matching is trying to solve is if you have sets of messages that are multiple RTT long then your matching first of all should converge to the right matching and all the pkts we send to secondary receiver are kind of our best effort to keep network utilized while we’re trying to find primary receiver. And again, because senders are allowed to send to secondary receivers and also preempt for shorter messages, the receivers should also keep some buffering at TOR to keep bw utilized if bubbles are being passed.
Another concern here is such algorithm might not be even stable? Suppose at the receiver we are getting a long message and we intentionally introduce buffer build up equal to one RTT so we let some other long sender#2 send few packets and they just get buffered in the switch but we don’t want those pkts from sender#2 to interrupt our primary sender#1 so we tell our secondary sender#2 to send at lower priority. Now problem is, primary sender#1 finishes and secondary sender#2 is supposed to become primary and we want to tell sender#3 to become secondary and packets from sender#3 are supposed be even lower priority. So every time we go to a new sender, we keep going to a new lower priority. What we really like to do is to prioritize those packets and TOR switch is the right place to implement this algorithm.

A Straw Man Scheme Based On pFabric

Distributed matching problem: The problem is that, when a new high priority msg enters the network that changes matched pairs of sender/receiver, the senders in the old match pairs must some how be able to figure out that they should stop sending packets of the old msg and start on the next high priority messages to find new matched pairs.

The way pFabric resolves the matching problem:

The senders behave greedily: When a new high-prio/short msg to a different receiver arrives it causes the sender to preempt all other messages at Sender’s NIC. Moreover, the packets belong to this new msg preempt other low priority msgs at receiver TOR buffer. Since the packets of the old low priority msg are preempted at the receiver TOR, the acknowledgments of the packets of the low prio msg wont get back to the corresponding sender and the sender of that message will stop sending new packets (pFabric uses a window based congestion control scheme which limits how many outstanding packets a msg can have). When the high prio msg transmission is finished, the buffered packets of the low prio msg will be transmitted to the receiver and acknowledgments of those packets will be sent back to the sender which causes the transmission of old low prio msg to be resumed. The key points here are the buffered packets of low prio msg at the receiver TOR that cover one RTT and hides the delay associated with rescheduling low prio message in the network. Therefore the receiver link will not run idle and their acknowledgment of buffered packets reawakens the transmission of the msg.

Simulating this behaviour in Homa:

In Homa, the senders can be greedy as well such that whenever they can transmit a packet, they choose a pkt from the shortest mesg that has ready sched/unsched pkts for transmission. This behavior can cause a problem for receiver: a receiver may incur 1-RTT or more bubble when sender preempts the message for which receiver has already sent 1RTT grant. This problem can be solved if receiver is allowed to send grants to multiple senders so that if one stops sending, the packets of the other sender keep the receiver link busy. In this case, the receiver grants can almost achieve the same behavior as pFabric and the pFabric’s outstanding window for every message at receiver can be emulated by allowing one RTT outstanding pkt for each msg. In this scheme, the receiver grants highest prio message until it reaches an RTT cap for inflight pkts. Once it reaches the cap for that message, it switches to the next highest prio msg from a different sender and grants that one. So receiver effectively keeps outstanding bytes per sender. If a sender decides to switch from receiver 1 to receiver 2, that would be ok for receiver 1 since receiver 1 immediately starts granting it’s second high priority msg as soon as it has one RTT cap outstanding bytes for that sender but that sender has stopped sending. This essentially is simulating what TOR scheduler was doing in pFabric. The concern here though is that this can lead to large buffer build up at receiver TOR which limits the preemption for shortest message and increases latency. pFabric is OK with that large buffer because it had unlimited number of priorities but Homa can’t deal with too much buffering since priorities are scarce. Assuming N is the number of senders that receiver needs to roll over until it finds one that is willing to send to it, then the max buffer is N*RTT. This buffering happens when non of the N senders are willing to send and then all of a sudden they all decided to send 1RTT all at the same time. The buffering in general can hurt the latency of the shorter messages but perhaps we can smartly use our few priorities to preempt the buffer for the shorter messages.

This Scheme compares to the NACK approach and NACK approach might as well outperform this but this scheme could be used as straw man and very simple solution for comparing any other scheme against it. This scheme is also very robust in occasions that sender process gets descheduled by the OS or sender process just dies and can’t get a chance to send NACK.

Using priorities to preempt and find the optimal match

Bottom set of pictures show this scheme.

We have prio 5 level and P5 is the worst. A series of mesg coming in time in this order m1, m2, … each having msg length l1, l2, … For simplicity we use length to assign prio in contrast to SRPT where remaining is used. The relative size of the messages are shown on the horizontal lines in the pictures where l1 < l3 < l2. There exist a per sender outstandingBytes cap at 1RTT so each message can at most have 1RTT grants outstanding.

Simple Version

Receiver Knows about both m1 and m2. We start receiving m1 at lowest prio P5 and outstandingBytes for m1 is at 1RTT cap at P5. In this simple version, if m1 stops, we are willing to blow bubbles at receiver because we didn’t buffer ahead, but the moment we receive bubble, we switch to m2 and start granting at same prio P5. Now what if some new mesg m3 arrives. If l3 was the shortest then it would have been clear as we would have placed it on new prio P4 an dif l3 was longest the we just ignore it and keep granting m2.

But if l1 < l3 < l2, then of course m3 needs to be granted but question is what prio should we assign to m3? Let’s assume we have to associate each mesg length with a prio level and any incoming mesg that is shorter than that need to bumped up in prio and any mesg larger than that is gonna go in lower prio. So now the question is in this scenario, which mesg length must be associated with P5? should it be l2 that is sending or l1 that is not sending? Well it appears that P5 should be associated with l2 here because if we use l1 for P5 then we are using stale info since the mesg we are granting to at P5 now is m2 and also m1 had its chance of sending pkts at P5 but stopped so shouldn’t we use m1 len in our decisions. So if P5 is associated with l2, and l3<l2, then m3 will be granted at P4 and P4 is associated with l3 now. The disadvantage of this is that now if m1 resumes sending pkts at P5, m3 is higher prio now while l3 > l1 that means since l1 stopped transmitting, you may get demoted to share your priority with the messages that are larger than you: In this simple scheme if m1 comes back, m1 stays at P5 (priorities are assigned at mesg arrival time and doesn’t change over the length of its life) so we get buffering at P5 from m1 and m2. We first get m3 pkts and when m3 finishes m1 and m2 end up alternating at P5 which is not the best behavior but system will have progress. In this really simple scheme both m1 and m2 share P5 and we will round robin between the m1 and m2 (each time a pkt arrives from one, you send a grant for that mesg at P5). So the sequence pkts of m1 and m2 would become something like 2,2,1,2,1,2,1,2,1,2,… and we’ll have 1RTT buffering from both mesgs but when m1 and m2 finish, buffering goes away.

So this scheme has at least three disadvantages and shortfall: 1. We initially had 1RTT bubble when m1 stopped because we didn’t buffer ahead. 2. We had weird priority inversion between m3 and m1 where m1 got better prio though it was larger 3. Also between m1 and m2 when m1 is back, we end up roundrobinning between m1 and m2.

Improving The Scheme

We’d like to resolve the shortfalls of simple version. First thing is that when m1 resumes, I’d like to give it its priority over m2 back: Let’s get rid of m3. We originally had m1 receiving grants at prio P5 and then m1 stopped so we started to grant m2 at prio p5. Later now m1 resumes and when first pkt of m1 is received, outstandingBytes of m1 goes below 1RTT and we want to give it its priority over m2. So we immediately bump up m1 prio level to P4 and send grant for it and at this point P5 is associated with m2 and P5 associated with P4. So assuming m1 and m2 will transmit fully at this point, the train of pkts would look like this: 2_p5, 2_p5, 1_p5, 2_p5, 1_p5, 2_p5, 1_p5, 2_p5, 1_p4, 1_p4, 1_p4, …, 1_p4, 2_p5, 1_p5, 2_p5, 1_p5, 2_p5, 1_p5 we are getting, m2 pkts at p5, and then 1RTT of m1 and m2 pkts will be interleaving. Then we get m1 pkts at p4 and when m1 is fully granted and all grants of m1 at p4 are received, we get 1RTT of interleaving m1 and m2 pkts.

The key idea and net effect is that we were able to slit m2 on lower prio level right under m1 and we effectively were able to insert lower prio mesg in the system and buffer up some pkts from those mesgs to prevent bw wastage when higher prio mesgs stop sending.

Note

When m1 resumes, we will have pkts of m1 buffered up behind 1RTT of m2 pkts. So the question is, Is the any way to let that queue subside before we send grants to m1 again? The answer seems to be No and that’s because we need to estimate buffering and skip grants to let buffer subside but we can’t reconstruct buffering at TOR using information we got at receiver and if we overestimate buffering, we may cause bubble and bw wastage by skipping too many grants.

Getting rid of gap between m1 and m2

Given that we are using multiple priority levels, there’s no reason to wait for m1 to pause and then send grants for m2 as this pause may lead to 1RTT of bubble. Imagine we are sending grant to m1 at P5 and m2 request arrives, we’re going to immediately bump m1 to P4 and grant m1 at that priority whenever m1 falls below its cap. On the other hand, we immediately start granting m2 at P5 for every pkt of m2 that we get. Assuming m1 is behaving reasonably and doesn’t pause, we will get all m1 Pkts before m2 granted pkts are arriving.

We can decide how many mesgs we want to buffer

Imagine while m1 and m2 are receiving grants at P4 and P5, a new mesg m3 arrives that is larger than both m1 and m2. We can bump m1 and m2 grants to P3 and P4 respectively and slide m3 at the bottom of stack and grant m3 at P5. If m3 is such m1 < m3 < m2, we can pull up m1 to P3 and push m2 to P5 and grant m3 at P4 in the middle for every pkt we receive for m3. An interesting parameter of simulation could be how many of message we want to keep granting: If n is total number of mesgs we’d like to grant, we can always only choose top n mesg and only grant those mesgs at long as they’re sending pkts to us.

We are using priorities for three things here: 1. Preemption: We promote shorter messages to higher priority when larger mesgs arrive 2. Relatively static priorities for unscheduled bytes. 3. We use lower priorities to buffer ahead from lower prio mesgs to avoid bubbles when high prio sender don’t send to us. This buffer helps us with delay variations as well.

Questions and Thoughts

The second scheme proposed here is using mesg length to prioritize (SMF) instead of SRPT which uses remaining bytes. Using length leads to scheme that favors shorter mesgs even more than SRPT and that can lead to worse starvation problems for long messages. Would be good to do simulation based on both SRPT and SMF and see how things look different. Argument for SRPT: SRPT minimizes total completion time and good approximation of minimizing total stretch and also for traffic distributions where mesg sizes are close to each other and they vary withing a factor 2, then SRPT is much better than SMF. Argument for SMF: Our goal is good utilization of the system while getting low latency for short message and SMF achieves that better and SRPT.
Issues with second scheme: We are able to preempt for smaller mesgs using our priority scheme to promote them but have not discussed how to demote and clear the system from outstanding traffic grants. If total number of messages exceed the total number of priorities we got, and we fill up all the prios with top preferred mesgs and then one of them stops sending to us, how should we handle that? For example, it matters if the one that stops is the highest prio and in that case at some point you just want to bump every body up by one prio and insert anther mesg at the bottom of stack.
Imagine we only have 3 prio and we have our top three senders granted at top prio, but all of a sudden the top 3 mesgs will change to a new set of mesgs and we have to change use of top prios for that 3 but the packets of these new set arrive after the already queued up mesgs and we have to pay this buffered pkts on the latency of the new set of mesg.
How many priority levels do we need use to cover almost all the gaps for the senders not responding to us. We hop that only 2 or 3 priorities should be enough to send grants to few senders at a time and cover all the gaps that senders might cause at receiver’s link by preempting for their own advantage. It’s really about how far down into the senders preference list, a receiver need to go in order to find a match (ie. a sender that’s willing to send). This number highly depends on the workload.
TrafficPacer is not relevant any more as we don’t send grants at pkt times but we only send grants when we receiver pkts. TrafficPacer was to make sure the input rate at sender and receiver are matched but now that we have multiple senders sending pkts to us at different priorities, the rate is not matched any more. So we send grant for every pkt we receive except when we have less than RTT unsched bytes where the receiver needs to continue sending grants upto 1RTT grants every pkts time while no unsched pkt is being received.

Greedy SRPT Oracle

Why Oracle scheduler is appealing to us:
1. We know there is no way to do better than the Oracle scheduler
2. We show Homa performance is close to the Oracle’s

Therefor, our Homa protocol performance is great.

SRPT gready Oracle:

An Oracle that achieves maximal matching would be like this: At arrival any new message or completion of sending a message, the oracle looks across all of the messages in the system and calculates a matching set (or schedule set) as follows: find the message with shortest remaining bytes and add that to the set of scheduled messages. For remaining unmatched receivers and senders, repeat calculating a message schedule based on this policy until we no longer are able to find a pair of sender/receiver are free and have message for transmissions. We transmit packets based on the matched set.

Every time a message finishes transmission at a sender or a new message enters the network, we re-run the algorithm for all unfinished messages and all sender/receivers in the network to find a new match set.

Every time a new desired match is found and advertised to the senders, each sender will send a packet belong to new calculated/desired scheduled message only after the previous packet it’s been busy sending has been completely serialized to outbound link (Each sender has a timer for transmitting packets such that every time a pkt1 is sent to the NIC for transmission, the timer is scheduled for the next packet transmission right after last bit of pkt1 is serialized to the network. This way we can keep NIC TX queue always empty and we can achieve our preemption of messages up to one packet size boundaries).

The theoretical limit for this algorithm says this algorithm is 2-approximation of optimal solution for minimizing total completion times of messages.

The algorithm here has problems and is definitely not the best possible Oracle. One obvious problem with this algorithm is inability to do sub-packet preemption that is when a message is scheduled for transmission of a packet from sender 1 to receiver 1, we won’t be able to preempt that packet if a new msg is scheduled to the same receiver from a different sender until that packet transmission is finished. This can lead to queuing at receiver and incorrect schedules when we have hotspot receivers at high load factors. Even if we could solve this problem at the end hosts, this issue of being blocked for currently transmitting packet can still happen at switches and links in the network and we can’t solve it there unless we carefully schedule packets at every link of the network. Moreover, solving this issue requires breaking messages into shorter packets which has overhead of transmitting more pkts and header bytes for the same message and can become a large throughput overhead for heavy head workloads.

This Oracle we know is better than pFabric and is definitely an aggressive base line for comparison which has been used in pFabric and many other works that came afterwards. So if we compare against this Oracle, we have indirectly compared Homa against other algorithms.

Greedy SRPT Oracle On a Fluid Network Model

We can completely remove packetization from Oracle and assume the network has a fluid model. We can run Greedy SRPT Oracle as above but every time we find a new desired schedule, we immediately stop previous incomplete transmissions and apply new schedule like if previous messages can be preemted at bit boundaries. This model doesn’t need a network simulator but merely can be implemented by a python scripts with message sizes and inter arrivals times as the input.

The Challenge though is to account for the real network latency and switching times to correct the completion times calculated from this new model. This could get messy and we don’t want to do it.

Open Desing Questions

Open Questions: How hot spots will affect the result. So far we always simulated evenly distributed traffic patterns and at some point we should try to see how things would be different when we have receiver hot spots. For example in Homa, it is quite possible that receiver has to throttle the unsched bytes limit to less than RTT if it is overloaded.
Delay Variability: So far we have focused 1RTT cap however we know delay variations happen in the network and we need to take that into account. Furthermore, RTT is assumed to be the minimum RTT possible that is when a full 1538B packet is sent in one direction and a small 84Bytes (min Ethernet Frame Size) is send in the other direction. However, in high load, average RTT can be much higher than this value and forcing the min RTT as the cap can lead to wasted bw. So the question is at high load and high delay variability, how the RTT value should be calculated?
Higher link speed: In what way going to next generation networks fabric with 40gb/s NIC links and 100gb/s fabric links will affect our results.
Adapting to link/switch failures: We have assumed a full bisection bandwidth network but how should Homa detect and react to links and switches that might temporarily fail and reduce the bisection bw in some parts of the network.

Voodoo Constants

The round-trip time
The total number of available priorities
How should we divide the prios among the folowing packet types
- Control packets (ie. grants)
- Unscheduled packets (priority cut off among different message sizes)
- Scheduled packets
- Low priority redundant preemptive scheduled packets to avoid bubbles

Homa Paper ToDos

Simulation comparison points, ordered by importance
1. self comparison when Homa features are removed
2. pHost
3. pFabric
4. PIAS
5. DCTCP
6. investigate what causes simulation instability in phost and pfabric cods at high loads. possible explanations are uneven link utilization, more than hundred percent input rate to network, and wasted bandwidth.
Complete the analysis of simulations with multiple senders and receivers:
1. wasted bandwidth at receiver
2. cumulative time average of priority usages
3. time series of outstanding messages
Algorithmic shortcomings
1. if sender stops sending, should receiver "deactivate" message so it doesn't consume a priority level for redundancy?
2. possibly receive multiple messages on the same priority level
3. choosing parameters:
  1. dividing priorities between scheduled and unscheduled
  2. picking cutoffs for unscheduled priorities
  3. choosing redundancy factor for scheduled messages
4. measuring actual workload, using it to pick parameters and change dynamically

RAMCloud Project

Homa Design Notes