Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Corrected links that should have been relative instead of absolute.

...

1347302391.840341991 src/ServerList.cc:256 in ServerList::applyUpdate default NOTICE[23993:3]: Got server list update (version number 4)
1347302391.840372293 src/ServerList.cc:283 in ServerList::applyUpdate default NOTICE[23993:3]: Marking server id 2.0 as crashed

rc05 rc10 (coord):

1347302383.367961834 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[11923:3]: Sending SET_SERVER_LIST request to infrc:host=192.168.1.101,port=11113 with 82 bytes
1347302391.589683109 src/CoordinatorServerManager.cc:329 in CoordinatorServerManager::hintServerDown default NOTICE[11923:2]: Checking server id 2.0 (infrc:host=192.168.1.102,port=11113)
1347302391.589726734 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[11923:2]: Sending PING request to infrc:host=192.168.1.102,port=11113 with 12 bytes
1347302391.839722646 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[11923:1]: Sending PING request to infrc:host=192.168.1.102,port=11113 with 12 bytes
1347302391.839754458 src/CoordinatorServerManager.cc:593 in CoordinatorServerManager::verifyServerFailure default NOTICE[11923:2]: Verified host failure: id 2.0 ("infrc:host=192.168.1.102,port=11113")
1347302391.839778594 src/CoordinatorServerManager.cc:334 in CoordinatorServerManager::hintServerDown default NOTICE[11923:2]: Server id 2.0 has crashed, notifying the cluster and starting recovery
1347302391.840062004 src/CoordinatorServerManager.cc:388 in CoordinatorServerManager::ServerDown::execute default DEBUG[11923:2]: LogCabin: StateServerDown entryId: 6
1347302391.840191597 src/MasterRecoveryManager.cc:339 in MasterRecoveryManager::startMasterRecovery default NOTICE[11923:2]: Server 2.0 crashed, but it had no tablets
1347302391.840287675 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[11923:3]: Sending SET_SERVER_LIST request to infrc:host=192.168.1.103,port=11113 with 82 bytes
1347302391.840421882 src/CoordinatorServerList.cc:1071 in CoordinatorServerList::updateEntryVersion default DEBUG[11923:3]: server 1.0 updated (3->4)
1347302391.840460219 src/CoordinatorServerList.cc:1071 in CoordinatorServerList::updateEntryVersion default DEBUG[11923:3]: server 3.0 updated (3->4)
1347302392.344700020 src/SessionAlarm.cc:191 in SessionAlarmTimer::handleTimerEvent default WARNING[11923:1]: Server at infrc:host=192.168.1.102,port=11113 is not responding, aborting session
1347302392.344724344 src/InfRcTransport.cc:374 in InfRcTransport<Infiniband>::InfRcSession::abort default NOTICE[11923:1]: Infiniband aborting PING request to infrc:host=192.168.1.102,port=11113
1347302395.891263126 src/InfRcTransport.cc:771 in InfRcTransport<Infiniband>::reapTxBuffers default ERROR[11923:1]: Transmit failed for buffer 16514264: RETRY_EXC_ERR
1347302396.589683109 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[11923:3]: Sending SET_SERVER_LIST request to infrc:host=192.168.1.101,port=11113 with 82 bytes
1347302397.349493880 (1000 duplicates of the following message were suppressed)
1347302397.349493880 src/SessionAlarm.cc:191 in SessionAlarmTimer::handleTimerEvent default WARNING[11923:1]: Server at infrc:host=192.168.1.102,port=11113 is not responding, aborting session
1347302402.354289096 (1000 duplicates of the following message were suppressed)
1347302402.354289096 src/SessionAlarm.cc:191 in SessionAlarmTimer::handleTimerEvent default WARNING[11923:1]: Server at infrc:host=192.168.1.102,port=11113 is not responding, aborting session

<<< last two lines continue forever >>>

 

Pulled in the most recent code, added some more log messages to debug, and ran it again. Here's what the coord had to say:

rc10 (coord):

1348793663.876517585 (64 duplicates of the following message were suppressed)
1348793663.876517585 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[18101:3]: Sending GET_SERVER_ID request to infrc:host=192.168.1.103,port=11113 with 4 bytes
1348793663.876636141 (63 duplicates of the following message were suppressed)
1348793663.876636141 src/RpcWrapper.cc:190 in RpcWrapper::isReady default DEBUG[18101:3]: Server infrc:host=192.168.1.103,port=11113 returned STATUS_RETRY from GET_SERVER_ID request
1348793677.421530999 src/CoordinatorServerManager.cc:311 in CoordinatorServerManager::hintServerDown default NOTICE[18101:2]: Checking server id 2.0 (infrc:host=192.168.1.102,port=11113)
1348793677.421590621 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[18101:2]: Sending PING request to infrc:host=192.168.1.102,port=11113 with 12 bytes
1348793677.671584498 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[18101:1]: Sending PING request to infrc:host=192.168.1.102,port=11113 with 12 bytes
1348793677.671618699 src/CoordinatorServerManager.cc:581 in CoordinatorServerManager::verifyServerFailure default NOTICE[18101:2]: Verified host failure: id 2.0 ("infrc:host=192.168.1.102,port=11113")
1348793677.671645215 src/CoordinatorServerManager.cc:316 in CoordinatorServerManager::hintServerDown default NOTICE[18101:2]: Server id 2.0 has crashed, notifying the cluster and starting recovery
1348793677.680007062 src/CoordinatorServerManager.cc:370 in CoordinatorServerManager::ServerDown::execute default DEBUG[18101:2]: LogCabin: StateServerDown entryId: 6
1348793677.680051875 src/MasterRecoveryManager.cc:344 in MasterRecoveryManager::startMasterRecovery default NOTICE[18101:2]: Server 2.0 crashed, but it had no tablets
1348793677.680146177 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[18101:3]: Sending SET_SERVER_LIST request to infrc:host=192.168.1.101,port=11113 with 91 bytes
1348793677.680232213 src/CoordinatorServerList.cc:1000 in CoordinatorServerList::dispatchRpc default DEBUG[18101:3]: **** Started ServerListRpc to 1.0
1348793677.680261538 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[18101:3]: Sending SET_SERVER_LIST request to infrc:host=192.168.1.103,port=11113 with 91 bytes
1348793677.680283384 src/CoordinatorServerList.cc:1000 in CoordinatorServerList::dispatchRpc default DEBUG[18101:3]: **** Started ServerListRpc to 3.0
1348793677.680392255 src/CoordinatorServerList.cc:976 in CoordinatorServerList::dispatchRpc default DEBUG[18101:3]: **** Completed ServerListRpc to 1.0
1348793677.680421534 src/CoordinatorServerList.cc:1135 in CoordinatorServerList::updateEntryVersion default DEBUG[18101:3]: server 1.0 updated (3->4)
1348793677.681636301 src/CoordinatorServerList.cc:976 in CoordinatorServerList::dispatchRpc default DEBUG[18101:3]: **** Completed ServerListRpc to 3.0
1348793677.681659441 src/CoordinatorServerList.cc:1135 in CoordinatorServerList::updateEntryVersion default DEBUG[18101:3]: server 3.0 updated (3->4)
1348793677.686200509 src/CoordinatorServerManager.cc:306 in CoordinatorServerManager::hintServerDown default NOTICE[18101:2]: Spurious crash report on unknown server id 2.0
1348793678.176558243 src/SessionAlarm.cc:191 in SessionAlarmTimer::handleTimerEvent default WARNING[18101:1]: Server at infrc:host=192.168.1.102,port=11113 is not responding, aborting session
1348793678.176582957 src/InfRcTransport.cc:374 in InfRcTransport<Infiniband>::InfRcSession::abort default NOTICE[18101:1]: Infiniband aborting PING request to infrc:host=192.168.1.102,port=11113
1348793681.918351539 src/InfRcTransport.cc:771 in InfRcTransport<Infiniband>::reapTxBuffers default ERROR[18101:1]: Transmit failed for buffer 26580032: RETRY_EXC_ERR
1348793683.181321280 (1000 duplicates of the following message were suppressed)
1348793683.181321280 src/SessionAlarm.cc:191 in SessionAlarmTimer::handleTimerEvent default WARNING[18101:1]: Server at infrc:host=192.168.1.102,port=11113 is not responding, aborting session
1348793688.186085328 (1000 duplicates of the following message were suppressed)
1348793688.186085328 src/SessionAlarm.cc:191 in SessionAlarmTimer::handleTimerEvent default WARNING[18101:1]: Server at infrc:host=192.168.1.102,port=11113 is not responding, aborting session
1348793693.190960484 (1000 duplicates of the following message were suppressed)
1348793693.190960484 src/SessionAlarm.cc:191 in SessionAlarmTimer::handleTimerEvent default WARNING[18101:1]: Server at infrc:host=192.168.1.102,port=11113 is not responding, aborting session
1348793698.195949984 (1000 duplicates of the following message were suppressed)
1348793698.195949984 src/SessionAlarm.cc:191 in SessionAlarmTimer::handleTimerEvent default WARNING[18101:1]: Server at infrc:host=192.168.1.102,port=11113 is not responding, aborting session

<<last two lines continue forever>>

...

3.

start coord: rc10
start master: rc01
start master: rc02
start master: rc03
crash master: rc02
start master: rc04

...

Working hypothesis: crashing rc02 appends a serverdown entry to logcabin. Till ongaro implements a cleaner in logcabin, i have a workaround method readValidEntries() in LogCabinHelper that cleans up all the entries read to remove the invalidated entries. This method assumes that the order number of any particular entry corresponds to the entry id. This is probably not true. Update: Fixed.

...