Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

1347302391.840341991 src/ServerList.cc:256 in ServerList::applyUpdate default NOTICE[23993:3]: Got server list update (version number 4)
1347302391.840372293 src/ServerList.cc:283 in ServerList::applyUpdate default NOTICE[23993:3]: Marking server id 2.0 as crashed

rc05 rc10 (coord):

1347302383.367961834 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[11923:3]: Sending SET_SERVER_LIST request to infrc:host=192.168.1.101,port=11113 with 82 bytes
1347302391.589683109 src/CoordinatorServerManager.cc:329 in CoordinatorServerManager::hintServerDown default NOTICE[11923:2]: Checking server id 2.0 (infrc:host=192.168.1.102,port=11113)
1347302391.589726734 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[11923:2]: Sending PING request to infrc:host=192.168.1.102,port=11113 with 12 bytes
1347302391.839722646 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[11923:1]: Sending PING request to infrc:host=192.168.1.102,port=11113 with 12 bytes
1347302391.839754458 src/CoordinatorServerManager.cc:593 in CoordinatorServerManager::verifyServerFailure default NOTICE[11923:2]: Verified host failure: id 2.0 ("infrc:host=192.168.1.102,port=11113")
1347302391.839778594 src/CoordinatorServerManager.cc:334 in CoordinatorServerManager::hintServerDown default NOTICE[11923:2]: Server id 2.0 has crashed, notifying the cluster and starting recovery
1347302391.840062004 src/CoordinatorServerManager.cc:388 in CoordinatorServerManager::ServerDown::execute default DEBUG[11923:2]: LogCabin: StateServerDown entryId: 6
1347302391.840191597 src/MasterRecoveryManager.cc:339 in MasterRecoveryManager::startMasterRecovery default NOTICE[11923:2]: Server 2.0 crashed, but it had no tablets
1347302391.840287675 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[11923:3]: Sending SET_SERVER_LIST request to infrc:host=192.168.1.103,port=11113 with 82 bytes
1347302391.840421882 src/CoordinatorServerList.cc:1071 in CoordinatorServerList::updateEntryVersion default DEBUG[11923:3]: server 1.0 updated (3->4)
1347302391.840460219 src/CoordinatorServerList.cc:1071 in CoordinatorServerList::updateEntryVersion default DEBUG[11923:3]: server 3.0 updated (3->4)
1347302392.344700020 src/SessionAlarm.cc:191 in SessionAlarmTimer::handleTimerEvent default WARNING[11923:1]: Server at infrc:host=192.168.1.102,port=11113 is not responding, aborting session
1347302392.344724344 src/InfRcTransport.cc:374 in InfRcTransport<Infiniband>::InfRcSession::abort default NOTICE[11923:1]: Infiniband aborting PING request to infrc:host=192.168.1.102,port=11113
1347302395.891263126 src/InfRcTransport.cc:771 in InfRcTransport<Infiniband>::reapTxBuffers default ERROR[11923:1]: Transmit failed for buffer 16514264: RETRY_EXC_ERR
1347302396.589683109 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[11923:3]: Sending SET_SERVER_LIST request to infrc:host=192.168.1.101,port=11113 with 82 bytes
1347302397.349493880 (1000 duplicates of the following message were suppressed)
1347302397.349493880 src/SessionAlarm.cc:191 in SessionAlarmTimer::handleTimerEvent default WARNING[11923:1]: Server at infrc:host=192.168.1.102,port=11113 is not responding, aborting session
1347302402.354289096 (1000 duplicates of the following message were suppressed)
1347302402.354289096 src/SessionAlarm.cc:191 in SessionAlarmTimer::handleTimerEvent default WARNING[11923:1]: Server at infrc:host=192.168.1.102,port=11113 is not responding, aborting session

<<< last two lines continue forever >>>

 

Pulled in the most recent code, added some more log messages to debug, and ran it again. Here's what the coord had to say:

rc10 (coord):

1348793663.876517585 (64 duplicates of the following message were suppressed)
1348793663.876517585 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[18101:3]: Sending GET_SERVER_ID request to infrc:host=192.168.1.103,port=11113 with 4 bytes
1348793663.876636141 (63 duplicates of the following message were suppressed)
1348793663.876636141 src/RpcWrapper.cc:190 in RpcWrapper::isReady default DEBUG[18101:3]: Server infrc:host=192.168.1.103,port=11113 returned STATUS_RETRY from GET_SERVER_ID request
1348793677.421530999 src/CoordinatorServerManager.cc:311 in CoordinatorServerManager::hintServerDown default NOTICE[18101:2]: Checking server id 2.0 (infrc:host=192.168.1.102,port=11113)
1348793677.421590621 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[18101:2]: Sending PING request to infrc:host=192.168.1.102,port=11113 with 12 bytes
1348793677.671584498 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[18101:1]: Sending PING request to infrc:host=192.168.1.102,port=11113 with 12 bytes
1348793677.671618699 src/CoordinatorServerManager.cc:581 in CoordinatorServerManager::verifyServerFailure default NOTICE[18101:2]: Verified host failure: id 2.0 ("infrc:host=192.168.1.102,port=11113")
1348793677.671645215 src/CoordinatorServerManager.cc:316 in CoordinatorServerManager::hintServerDown default NOTICE[18101:2]: Server id 2.0 has crashed, notifying the cluster and starting recovery
1348793677.680007062 src/CoordinatorServerManager.cc:370 in CoordinatorServerManager::ServerDown::execute default DEBUG[18101:2]: LogCabin: StateServerDown entryId: 6
1348793677.680051875 src/MasterRecoveryManager.cc:344 in MasterRecoveryManager::startMasterRecovery default NOTICE[18101:2]: Server 2.0 crashed, but it had no tablets
1348793677.680146177 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[18101:3]: Sending SET_SERVER_LIST request to infrc:host=192.168.1.101,port=11113 with 91 bytes
1348793677.680232213 src/CoordinatorServerList.cc:1000 in CoordinatorServerList::dispatchRpc default DEBUG[18101:3]: **** Started ServerListRpc to 1.0
1348793677.680261538 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[18101:3]: Sending SET_SERVER_LIST request to infrc:host=192.168.1.103,port=11113 with 91 bytes
1348793677.680283384 src/CoordinatorServerList.cc:1000 in CoordinatorServerList::dispatchRpc default DEBUG[18101:3]: **** Started ServerListRpc to 3.0
1348793677.680392255 src/CoordinatorServerList.cc:976 in CoordinatorServerList::dispatchRpc default DEBUG[18101:3]: **** Completed ServerListRpc to 1.0
1348793677.680421534 src/CoordinatorServerList.cc:1135 in CoordinatorServerList::updateEntryVersion default DEBUG[18101:3]: server 1.0 updated (3->4)
1348793677.681636301 src/CoordinatorServerList.cc:976 in CoordinatorServerList::dispatchRpc default DEBUG[18101:3]: **** Completed ServerListRpc to 3.0
1348793677.681659441 src/CoordinatorServerList.cc:1135 in CoordinatorServerList::updateEntryVersion default DEBUG[18101:3]: server 3.0 updated (3->4)
1348793677.686200509 src/CoordinatorServerManager.cc:306 in CoordinatorServerManager::hintServerDown default NOTICE[18101:2]: Spurious crash report on unknown server id 2.0
1348793678.176558243 src/SessionAlarm.cc:191 in SessionAlarmTimer::handleTimerEvent default WARNING[18101:1]: Server at infrc:host=192.168.1.102,port=11113 is not responding, aborting session
1348793678.176582957 src/InfRcTransport.cc:374 in InfRcTransport<Infiniband>::InfRcSession::abort default NOTICE[18101:1]: Infiniband aborting PING request to infrc:host=192.168.1.102,port=11113
1348793681.918351539 src/InfRcTransport.cc:771 in InfRcTransport<Infiniband>::reapTxBuffers default ERROR[18101:1]: Transmit failed for buffer 26580032: RETRY_EXC_ERR
1348793683.181321280 (1000 duplicates of the following message were suppressed)
1348793683.181321280 src/SessionAlarm.cc:191 in SessionAlarmTimer::handleTimerEvent default WARNING[18101:1]: Server at infrc:host=192.168.1.102,port=11113 is not responding, aborting session
1348793688.186085328 (1000 duplicates of the following message were suppressed)
1348793688.186085328 src/SessionAlarm.cc:191 in SessionAlarmTimer::handleTimerEvent default WARNING[18101:1]: Server at infrc:host=192.168.1.102,port=11113 is not responding, aborting session
1348793693.190960484 (1000 duplicates of the following message were suppressed)
1348793693.190960484 src/SessionAlarm.cc:191 in SessionAlarmTimer::handleTimerEvent default WARNING[18101:1]: Server at infrc:host=192.168.1.102,port=11113 is not responding, aborting session
1348793698.195949984 (1000 duplicates of the following message were suppressed)
1348793698.195949984 src/SessionAlarm.cc:191 in SessionAlarmTimer::handleTimerEvent default WARNING[18101:1]: Server at infrc:host=192.168.1.102,port=11113 is not responding, aborting session

<<last two lines continue forever>>

...

3.

start coord: rc10
start master: rc01
start master: rc02
start master: rc03
crash master: rc02
start master: rc04

...