We're updating the issue view to help you get more done. 

Large transactions using basic transport cause master servers to commit suicide

Description

Large (i.e. order thousand object) transactions, when performed over basic transport, result in the coordinator getting slow pings from master servers and assuming they have crashed when they have not. Master servers then commit suicide. The client reports that the transaction has timed out. This behavior does not happen with infiniband transport.

Client prints the following to the console (and server logs are attached):

1452643483.711986256 BasicTransport.cc:1098 in handleTimerEvent WARNING[1]: aborting TX_DECISION RPC to server 192.168.0.102:12247, sequence 3099: timeout
1452643483.712008058 BasicTransport.cc:1098 in handleTimerEvent WARNING[1]: aborting TX_DECISION RPC to server 192.168.0.102:12247, sequence 3100: timeout
1452643483.712013480 BasicTransport.cc:1098 in handleTimerEvent WARNING[1]: aborting TX_DECISION RPC to server 192.168.0.102:12247, sequence 3101: timeout
1452643483.712018828 BasicTransport.cc:1098 in handleTimerEvent WARNING[1]: aborting TX_DECISION RPC to server 192.168.0.102:12247, sequence 3102: timeout
1452643483.712020844 BasicTransport.cc:1098 in handleTimerEvent WARNING[1]: aborting TX_DECISION RPC to server 192.168.0.102:12247, sequence 3103: timeout

Environment

None

Status

Assignee

Unassigned

Reporter

Jonathan Ellithorpe

Labels

None

Priority

Medium