Page Comparison

...

Many Receiver Problem

Problem Description

In a scenario that we have many receivers, a sender might want to send to multiple receivers at the same time and if multiple receivers commit their bandwidth to the sender by sending grants to that sender, this can cause receiver bandwidth to be wasted as the sender can only send one packet at time. On the other hand, if the sender chooses to only send a request to one of the receivers and that receiver is busy sending grant to other sender’s then the sender bandwidth will be wasted. So this is obviously a distributed matching problem that must be properly solved such that minimizes the bandwidth wastage at both sender and receiver links.

So far in the process of designing Homa, we have focused only on the receiver side and putting the bulk of the transport logic in the receiver side without much attention to the sender side. Solving this problem though requires possibly elaborate schemes to be used at the sender side which we are going to start to talk about here.

Straw Man schemes

The two straw man approaches for the senders are:

Sender observes the priorities assigned by the receivers and send packets based on whatever priority they are assigned. In this scheme, sender simply has an pkt output queue which is sorted by the priorities given to each packet based on the receiver logic (ie. based on if packet is unsched, sched, how big the was is, etc). The sender simply send packet from this packet at each packet time. This scheme translates to Round Robin in packet level for packets with same priority belong to different msgs. Biggest flaws in this case:
1. Round Robin increases the completion time of all messages
2. Round Robin causes the receiver bw to be wasted as sender sends requests to multiple receivers. In common lightly loaded case,corresponding receivers commit to that sender by sending grants back but the receiver can only send to one receiver at any point of time.
SMF (Shortest Message First): This is a greedy approach where sender chooses shortest message among all messages and send unsched bytes for that message. In common case, it will receive grants for this msg by the time the last unsched byte is sent and it can complete that message. This is in fact work to completion approach where we choose one msg and complete transmission of that message before sending any packet for any other msg. This scheme resolves both problems with previous scheme but it has other flaws:
- Imagine Sx sends unsched bytes to an Rx and expects the Rx to send grants back but the grant will be late because either Rx is busy sending grant to even a shorter flow or there’s been a queue at TOR near Rx that caused every packet to be delayed. In such case sender is going to waste bw and blow bubble in its link so it needs to give up at some point and send request to other receivers.
Ideally we need to compute a matching between the senders and receivers but with a very little information. Is there any way to probe the receivers to know about their available bandwidth.

STF scheme: Blow bubble at sender to avoid bubbles at receiver

An interesting question is what do you do when there is bubble at sender. That is sender start sending one msg and sender likes to run that msg to completion. Sender waits for grant to be back but grant is late and sender might stall waiting for grant for that msg but it also have other msgs that it could be sending. The question is, do sender hang tight and allow bubble in its output link hoping that grant is coming soon or do I go ahead and start on second msg and start sending it in which case the problem is the sender might end up getting grant from two receivers and of course it can’t transmit to both of them simultaneously and it ends up blowing up bubble in the receiver side.

Presumably you are willing to stall for small amount of time for missing grant to come in but how long are we supposed to wait? Or if we start sending next msg and don’t get grant from that one as well, are we going to start the third one? Which ever way sender goes, it’s risking network bandwidth either at sender or receiver side.

Possibility of negative grants

So far, we only talked about sending positive grant when receiver tells the sender when it’s ready to send packets out. We might want to introduce notion of negative grants or NACK from the receiver so when receiver gets the request and its busy, it can send NACK to sender which basically tells the sender that it’s busy for the following amount of time so sender knows how big of gap until it can potentially get a grant from that receiver and so it can do some other useful work in the meantime.
Similarly, say the sender is in the middle of transmission of a very large msg and a new small msg appears. According to STF policy at sender, sender likes to preempt the large msg and complete small msg. At this point maybe the sender should send a NACK to receiver of large msg, informing it is going to send another msg for some amount time and blows a bubble in receiver link. Some discussions about the scenario:
1. If we don’t want to blow bubble in the receiver, sender has to continue sending packets for large message for one RTT after it sends the NACK and before preempting for shorter msg. On the other hand, if you don’t preempt immediately for the short msg, the effect on the stretch of the short msg can be quite large and delaying preemption can cause up to 100% overhead on the relative stretch of short msg. The choice here is between latency of shorter request vs bandwidth utilization at the receiver
2. Or perhaps, the receiver should slightly over schedule itself so if there’s always at least a little bit of buffering at receiver side (eg. receiver TOR queue), then sender can immediately preempt for shorter msg without actually introduce a few bubbles in the link of preempted msg receiver. We don’t necessarily want buffering since buffering adds latency but one possibility for receiver is to adjust the buffering based on the load. When receiver is underloaded, it doesn’t really matter if sender introduces buffer because it’s not taking away bandwidth from any body else but if the receiver is over loaded you might actually want to nudge your buffering up a little bit to make sure its link absolutely never goes idle if sender tries to preempt.

Receiver sharing bw with multiple senders

Another idea (which might not be a good idea) is receiver sharing bw with multiple senders and send grants to multiple senders. The disadvantage of this is that receiver is round robinning and every body’s completion time increases but if your bw limited your response time is limited by how efficient your using bw any how (how said this is efficient use of bw?). The advantage is receiver has less to lose when a sender suddenly decides not to send any more. Receiver still discovers that after one RTT but it only gave half of gw to that sender.

A major problem with this idea is that you slow down your common case. We are trying to solve a problem that happens with a less frequency comparing to the problem that we cause. This occasionally helps but often hurts. A blanket rule like this that says only give 50% of your bw to highest priority flow, hurts that flow because the flow that is small from perspective of receiver is likely to be small from perspective the sender as well in the common case. This idea wouldn’t make sense at all if your are very bw constrained.

Two types of solutions

Reputation scheduling

If the receiver had some notion of comparing its destinations and measure of confidence in senders. From the Rx perspective, every time it send grant, it likes to send grant to Rx that will definitely send the data and if the Sx occasionally doesn’t send data, even if it’s the highest priority Sx, then Rx should mix the grants to that Sx with grants to next preferred Sx. Imagine, every time the Sx miss sending pkt for a grant, Rx degrade its reputation (Rx is keeping some reputation score for each Sx). Then Rx uses that score when sending grants exclusively to highest priority Sx or share grants with other Sx. This will try to use past information to find each Sx score and use it for next grants so question is, "Is there any reason to believe senders will remain consistent?" A long term reputation score might not be helpful in this case.

Message passing and probing

Senders and receivers are going to pass msg to give each other as accurate of information as possible and make decisions based on that. For example, NACKs and Grants as described above could be a way to achieve this information passing. We make this information available as precise as we can and we come up with policy to have schedules at senders and receivers based on that.

On the receiver side, Rx is getting all scheduled pkts but suddenly a NACK comes in and says next k slot is going to be empty since sender decided to preempt. What does the receiver do with this information? If k is shorter than RTT, receiver can’t do any thing about that and those k slot are burned opportunities for receiver. But Rx can make sure it will not receive bubbles during those k slots if it maintains a small amount of buffering (at least k pkts) at Rx TOR. In this case, receiver looses the k slot buffering and it should send k grants to replenish the buffer. Note that the NACK pkt should be highest prio to make sure it arrives early and wont stuck in the back of the queue.
Question is how much buffer we want at Rx TOR queue? Depends on load. If the Rx is only 20% of input BW, then probably we don’t want any buffering at all but if using 90% of BW, then it probably want buffer at TOR.
On the sender side, Sx has been receiving grants but no all of a sudden it gets a NACK from the Rx saying here will be no grants for some time. First of all having Rx to send back NACK, specifying how long it can’t send grants, could be necessary so Sx can send request to next preferred Rx and tell it how much it can send because this Rx is back up (secondary Rx).You might want to let the secondary Rx know don’t send any grants at all as you are my secondary Rx, I just want to trickle pkt to you any chance I get. Concern here is that going down this path can result in a very complicated scheme.
What if we have notion of primary sender and receiver. The model here is that there is a matching between senders/receivers. Any sender and receiver have primary counterparts. For the primary sender/receiver, protocol is exactly the same as before meaning that sender gets a grant from primary receiver and sends packet to it.Any pkt not send on this primary pairs is treated like unscheduled pkts so it wont cause new grants to come. Now if sender get an empty slot (NACK from the receiver), it will pick its second best receiver (ie. secondary receiver) and sends it a pkt. From the perspective the secondary receiver, it should just treat that pkt like an unsched. pkt. But how many pkt does sender send? If at some point sender gets so many empty slots from primary receiver, then it should try to change its primary receiver. The part the remains to be solved is if we can find these matchings?
If we’re talking about the cases when all of your messages are smaller than RTT then this problem might not have any solution and mistakes are short lived any ways. If every sender and receiver have stable preference, these matching can be found in few round trips (ie. for long transmissions). In fact the problem this primary matching is trying to solve is if you have sets of messages that are multiple RTT long then your matching first of all should converge to the right matching and all the pkts we send to secondary receiver are kind of our best effort to keep network utilized while we’re trying to find primary receiver. And again, because senders are allowed to send to secondary receivers and also preempt for shorter messages, the receivers should also keep some buffering at TOR to keep bw utilized if bubbles are being passed.
Another concern here is such algorithm might not be even stable? Suppose at the receiver we are getting a long message and we intentionally introduce buffer build up equal to one RTT so we let some other long sender#2 send few packets and they just get buffered in the switch but we don’t want those pkts from sender#2 to interrupt our primary sender#1 so we tell our secondary sender#2 to send at lower priority. Now problem is, primary sender#1 finishes and secondary sender#2 is supposed to become primary and we want to tell sender#3 to become secondary and packets from sender#3 are supposed be even lower priority. So every time we go to a new sender, we keep going to a new lower priority. What we really like to do is to prioritize those packets and TOR switch is the right place to implement this algorithm.

A Straw Man Scheme Based On pFabric

Distributed matching problem: The problem is that, when a new high priority msg enters the network that changes matched pairs of sender/receiver, the senders in the old match pairs must some how be able to figure out that they should stop sending packets of the old msg and start on the next high priority messages to find new pairs.

The way pFabric resolves the matching problem:

When new msg to a different receiver arrives that causes the sender to preempt, the packets belong to this new msg preempt other low priority msgs at receiver TOR buffer. Since the packets of the old low priority msg are preempted at the receiver TOR, the acknowledgments of the packets of the low prio msg wont get back to the corresponding sender and the sender of that message will stop sending new packets (pFabric uses a window based congestion control scheme which limits how many outstanding packets a msg can have). When the high prio msg transmission is finished, the buffered packets of the low prio msg will be transmitted to the receiver and acknowledgments of those packets will be send to the sender which causes the transmission of old low prio msg to be resumed. The key point here are the buffered packet of low prio msg at receiver TOR that cover the RTT therefore the receiver link never runs idle and their acknowledgment reawakens the transmission of the msg.

Simulating this behaviour in Homa:

The receiver grants can almost achieve the same behavior and the outstanding window for every message at receiver can be simulated by allowing one RTT outstanding pkt for each msg. The receiver grants highest prio message until it reaches an RTT cap for inflight pkts. Once it reaches the cap for that message, it switches to the next highest prio msg and grants that one. So receiver effectively keeps outstanding bytes per sender. If a sender decides to switch from receiver 1 to receiver 2, that would be ok for receiver 1 since receiver 1 immediately starts granting it’s second high priority msg as soon as it has one RTT cap outstanding bytes for that sender but that sender has stopped sending. This essentially is simulating what TOR scheduler was doing in pFabric. The concern here though is that this can lead to large buffer build up at receiver TOR and pFabric was OK with that buffer because it had unlimited number of priorities but Homa can’t deal with too much buffering since priorities are scarce. Assuming N is the number of senders that receiver needs to roll over until it finds one that is willing to send to it, then the max buffer is N*RTT. And now question is how should we use our limited priorities within those messages.

This Scheme compares to the NACK approach and NACK approach might as well outperform this but this scheme could be used as straw man and very simple solution for comparing any other scheme against it. This scheme is also very robust in occasions that sender process gets descheduled by the OS or sender process just dies and can’t get a chance to send NACK.

Greedy SRPT Oracle

Why Oracle scheduler is appealing to us:
1. We know there is no way to do better than the Oracle scheduler
2. We show Homa performance is close to the Oracle’s

Therefor, our Homa protocol performance is great.

SRPT gready Oracle:

An Oracle that achieves maximal matching would be like this: At arrival any new message or completion of sending a message, the oracle looks across all of the messages in the system and calculates a matching set (or schedule set) as follows: find the message with shortest remaining bytes and add that to the set of scheduled messages. For remaining unmatched receivers and senders, repeat calculating a message schedule based on this policy until we no longer are able to find a pair of sender/receiver are free and have message for transmissions. We transmit packets based on the matched set.

Every time a message finishes transmission at a sender or a new message enters the network, we re-run the algorithm for all unfinished messages and all sender/receivers in the network to find a new match set.

Every time a new desired match is found and advertised to the senders, each sender will send a packet belong to new calculated/desired scheduled message only after the previous packet it’s been busy sending has been completely serialized to outbound link (Each sender has a timer for transmitting packets such that every time a pkt1 is sent to the NIC for transmission, the timer is scheduled for the next packet transmission right after last bit of pkt1 is serialized to the network. This way we can keep NIC TX queue always empty and we can achieve our preemption of messages up to one packet size boundaries).

The theoretical limit for this algorithm says this algorithm is 2-approximation of optimal solution for minimizing total completion times of messages.

The algorithm here has problems and is definitely not the best possible Oracle. One obvious problem with this algorithm is inability to do sub-packet preemption that is when a message is scheduled for transmission of a packet from sender 1 to receiver 1, we won’t be able to preempt that packet if a new msg is scheduled to the same receiver from a different sender until that packet transmission is finished. This can lead to queuing at receiver and incorrect schedules when we have hotspot receivers at high load factors. Even if we could solve this problem at the end hosts, this issue of being blocked for currently transmitting packet can still happen at switches and links in the network and we can’t solve it there unless we carefully schedule packets at every link of the network. Moreover, solving this issue requires breaking messages into shorter packets which has overhead of transmitting more pkts and header bytes for the same message and can become a large throughput overhead for heavy head workloads.

This Oracle we know is better than pFabric and is definitely an aggressive base line for comparison which has been used in pFabric and many other works that came afterwards. So if we compare against this Oracle, we have indirectly compared Homa against other algorithms.

Greedy SRPT Oracle On a Fluid Network Model

We can completely remove packetization from Oracle and assume the network has a fluid model. We can run Greedy SRPT Oracle as above but every time we find a new desired schedule, we immediately stop previous incomplete transmissions and apply new schedule like if previous messages can be preemted at bit boundaries. This model doesn’t need a network simulator but merely can be implemented by a python scripts with message sizes and inter arrivals times as the input.

The Challenge though is to account for the real network latency and switching times to correct the completion times calculated from this new model. This could get messy and we don’t want to do it.

...

Versions Compared

Old Version 9

New Version 10

Key