Transactions still deadlocking over WorkerTimer


We are still getting deadlocks during crash recovery in Jonathan's test case, related to transactions:

  • One thread is executing txDecision and is trying to delete a PreparedOp, which requires stopping, then deleting, its WorkerTimer. The thread is stuck because the WorkerTimer is active; furthermore, it is working on behalf of a crash recovery, so the recovery cannot complete.

  • Another thread is executing the WorkerTimer, trying to invoke txHintFailed. However, this operation is looping in ObjectFinder trying to find the server for the object of the notification. However, this server has crashed, so the corner can't return information until the server comes back up again.

This is another example of being bitten by synchronous behavior within an asynchronous RPC. The overall txHintFailed RPC is being executed asynchronously, but it invokes ObjectFinder::lookup, which runs synchronously, thereby defeating the WorkerTimer's attempt not to wait.

I am wondering if it's time to bite the bullet and find a solution to this problem. If we just made an asynchronous interface to ObjectFinder (which shouldn't be that hard?) wonder if this would solve the problem for now. Are there a lot of other cases besides ObjectFinder?




Yilong Li


John Ousterhout