We're updating the issue view to help you get more done. 

Lazy transport init causes unstability

Description

When using mixed transports where servers use basic+dpdk and the coordinator use basic+udp lazy transport init causes long dispatch lockouts that exceed basic timeouts, introduce spurious (enlist) RPC retries, and destabilize the whole cluster.

To reproduce, add usleep(1000) just below rte_eal_init() in DpdkDriver.cc. Re-run with multiple servers.

What goes wrong:

  • The coordinator isn't using DPDK in its own locator, so it skips basic+dpdk init.

  • When the first master enlists everything works properly and coordinator server list updates are enqueued.

  • In the meantime, a second master starts an RPC to the coordinator to enlist.

  • Before the second master can enlist, the coordinator server list update (caused by the first enlistment) is processed. It is destined to the first master with the locator it enlisted with, which is only accessible via basic+dpdk.

  • TransportManager grabs the dispatch lock then starts basic+dpdk init.

  • Our DPDK init takes between 1 and 2 seconds consistently (the number of huge pages we use seems to make some difference, but even with 64 MB of pages we see several hundreds of milliseconds init'ing).

  • In the meantime, the second servers enlist has been retried once every 80 ms.

  • These retries have all built up in buffers to be delivered up to the coordinator.

  • Enlistment is not idempotent. All of the enlistments complete "successfully" adding 10s of infant mortalities to the coordinator server list.

We can reasonably reliably run things if we delay starting more servers until the first has had a chance to register and receive its first server list update.

We're looking for a longer term solution. basic+dpdk doesn't seem to require the dispatch lock for init, but other transports seem like they may? (e.g. basic+udp is lazy init on the server side and seems to explode without the dispatch lock). One fix may be eliminating or perforating the dispatch lock for transport init since enlistments can be serviced just fine. Adding some kind long retry delay for enlistment could help but has some downsides too.

Environment

None

Status

Assignee

Unassigned

Reporter

Ryan Stutsman

Labels

None

Priority

Medium