While tracking down problems with crash recovery in Jonathan's test case, I found a deadlock involving the TxRecoveryManager lock. One thread is trying to perform transaction recovery and owns the lock. It has called down through TxRecoveryManager::RecoveryTask:erformTask to sendDecisionRpc and on into ObjectFinder::lookup to find the object. The object in question happens to be in a table that was on the crashed server, so ObjectFinder keeps calling the coordinator with a getTableConfig RPC to find out where the object ends up after recovery. However, this call can't complete until recovery completes, and meanwhile this thread is holding the TxRecoveryManager lock.
At the same time, the same server is also one of the recovery Masters. It is replaying the log, found a transaction-related entry, and invoked TxRecoveryManager::recoverRecovery. This method is now looping, trying to acquire the lock. Thus, recovery can't complete.
Here arethe stack traces for the two threads: