Weird things discovered while running Coordinator Recovery

Note: This works:

start coord
start master m1
start master m2
crash coord
start master m3

Everything goes as expected. When started, m3 gets the entire server list, including entries for m1 and m2.

Also repeated this more times. As in,

start coord
start master m1
start master m2
crash coord
start master m3
crash coord
start master m4
crash coord
start master m5

Works fine.

1.

start coord
start master m1
start master m2
crash master m2

Reaction:
Tries to recover the backup on m2 even though I started the master with explicitly no backups (options: -M r=0)

Ryan's suggestion for a quick hack fix:
In BackupFailureMonitor.cc - in constructor for BackupFailureMonitor, change the initializer tracker(context, this) to tracker(context).

That worked for now.

2.

start coord: rc10
start master: rc01
start master: rc02
start master: rc03
crash master: rc02

Reaction:

rc01, rc03:

1347302391.840341991 src/ServerList.cc:256 in ServerList::applyUpdate default NOTICE[23993:3]: Got server list update (version number 4)
1347302391.840372293 src/ServerList.cc:283 in ServerList::applyUpdate default NOTICE[23993:3]: Marking server id 2.0 as crashed

rc10 (coord):

1347302383.367961834 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[11923:3]: Sending SET_SERVER_LIST request to infrc:host=192.168.1.101,port=11113 with 82 bytes
1347302391.589683109 src/CoordinatorServerManager.cc:329 in CoordinatorServerManager::hintServerDown default NOTICE[11923:2]: Checking server id 2.0 (infrc:host=192.168.1.102,port=11113)
1347302391.589726734 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[11923:2]: Sending PING request to infrc:host=192.168.1.102,port=11113 with 12 bytes
1347302391.839722646 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[11923:1]: Sending PING request to infrc:host=192.168.1.102,port=11113 with 12 bytes
1347302391.839754458 src/CoordinatorServerManager.cc:593 in CoordinatorServerManager::verifyServerFailure default NOTICE[11923:2]: Verified host failure: id 2.0 ("infrc:host=192.168.1.102,port=11113")
1347302391.839778594 src/CoordinatorServerManager.cc:334 in CoordinatorServerManager::hintServerDown default NOTICE[11923:2]: Server id 2.0 has crashed, notifying the cluster and starting recovery
1347302391.840062004 src/CoordinatorServerManager.cc:388 in CoordinatorServerManager::ServerDown::execute default DEBUG[11923:2]: LogCabin: StateServerDown entryId: 6
1347302391.840191597 src/MasterRecoveryManager.cc:339 in MasterRecoveryManager::startMasterRecovery default NOTICE[11923:2]: Server 2.0 crashed, but it had no tablets
1347302391.840287675 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[11923:3]: Sending SET_SERVER_LIST request to infrc:host=192.168.1.103,port=11113 with 82 bytes
1347302391.840421882 src/CoordinatorServerList.cc:1071 in CoordinatorServerList::updateEntryVersion default DEBUG[11923:3]: server 1.0 updated (3->4)
1347302391.840460219 src/CoordinatorServerList.cc:1071 in CoordinatorServerList::updateEntryVersion default DEBUG[11923:3]: server 3.0 updated (3->4)
1347302392.344700020 src/SessionAlarm.cc:191 in SessionAlarmTimer::handleTimerEvent default WARNING[11923:1]: Server at infrc:host=192.168.1.102,port=11113 is not responding, aborting session
1347302392.344724344 src/InfRcTransport.cc:374 in InfRcTransport<Infiniband>::InfRcSession::abort default NOTICE[11923:1]: Infiniband aborting PING request to infrc:host=192.168.1.102,port=11113
1347302395.891263126 src/InfRcTransport.cc:771 in InfRcTransport<Infiniband>::reapTxBuffers default ERROR[11923:1]: Transmit failed for buffer 16514264: RETRY_EXC_ERR
1347302396.589683109 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[11923:3]: Sending SET_SERVER_LIST request to infrc:host=192.168.1.101,port=11113 with 82 bytes
1347302397.349493880 (1000 duplicates of the following message were suppressed)
1347302397.349493880 src/SessionAlarm.cc:191 in SessionAlarmTimer::handleTimerEvent default WARNING[11923:1]: Server at infrc:host=192.168.1.102,port=11113 is not responding, aborting session
1347302402.354289096 (1000 duplicates of the following message were suppressed)
1347302402.354289096 src/SessionAlarm.cc:191 in SessionAlarmTimer::handleTimerEvent default WARNING[11923:1]: Server at infrc:host=192.168.1.102,port=11113 is not responding, aborting session

<<< last two lines continue forever >>>

Pulled in the most recent code, added some more log messages to debug, and ran it again. Here's what the coord had to say:

rc10 (coord):

1348793663.876517585 (64 duplicates of the following message were suppressed)
1348793663.876517585 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[18101:3]: Sending GET_SERVER_ID request to infrc:host=192.168.1.103,port=11113 with 4 bytes
1348793663.876636141 (63 duplicates of the following message were suppressed)
1348793663.876636141 src/RpcWrapper.cc:190 in RpcWrapper::isReady default DEBUG[18101:3]: Server infrc:host=192.168.1.103,port=11113 returned STATUS_RETRY from GET_SERVER_ID request
1348793677.421530999 src/CoordinatorServerManager.cc:311 in CoordinatorServerManager::hintServerDown default NOTICE[18101:2]: Checking server id 2.0 (infrc:host=192.168.1.102,port=11113)
1348793677.421590621 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[18101:2]: Sending PING request to infrc:host=192.168.1.102,port=11113 with 12 bytes
1348793677.671584498 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[18101:1]: Sending PING request to infrc:host=192.168.1.102,port=11113 with 12 bytes
1348793677.671618699 src/CoordinatorServerManager.cc:581 in CoordinatorServerManager::verifyServerFailure default NOTICE[18101:2]: Verified host failure: id 2.0 ("infrc:host=192.168.1.102,port=11113")
1348793677.671645215 src/CoordinatorServerManager.cc:316 in CoordinatorServerManager::hintServerDown default NOTICE[18101:2]: Server id 2.0 has crashed, notifying the cluster and starting recovery
1348793677.680007062 src/CoordinatorServerManager.cc:370 in CoordinatorServerManager::ServerDown::execute default DEBUG[18101:2]: LogCabin: StateServerDown entryId: 6
1348793677.680051875 src/MasterRecoveryManager.cc:344 in MasterRecoveryManager::startMasterRecovery default NOTICE[18101:2]: Server 2.0 crashed, but it had no tablets
1348793677.680146177 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[18101:3]: Sending SET_SERVER_LIST request to infrc:host=192.168.1.101,port=11113 with 91 bytes
1348793677.680232213 src/CoordinatorServerList.cc:1000 in CoordinatorServerList::dispatchRpc default DEBUG[18101:3]: **** Started ServerListRpc to 1.0
1348793677.680261538 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[18101:3]: Sending SET_SERVER_LIST request to infrc:host=192.168.1.103,port=11113 with 91 bytes
1348793677.680283384 src/CoordinatorServerList.cc:1000 in CoordinatorServerList::dispatchRpc default DEBUG[18101:3]: **** Started ServerListRpc to 3.0
1348793677.680392255 src/CoordinatorServerList.cc:976 in CoordinatorServerList::dispatchRpc default DEBUG[18101:3]: **** Completed ServerListRpc to 1.0
1348793677.680421534 src/CoordinatorServerList.cc:1135 in CoordinatorServerList::updateEntryVersion default DEBUG[18101:3]: server 1.0 updated (3->4)
1348793677.681636301 src/CoordinatorServerList.cc:976 in CoordinatorServerList::dispatchRpc default DEBUG[18101:3]: **** Completed ServerListRpc to 3.0
1348793677.681659441 src/CoordinatorServerList.cc:1135 in CoordinatorServerList::updateEntryVersion default DEBUG[18101:3]: server 3.0 updated (3->4)
1348793677.686200509 src/CoordinatorServerManager.cc:306 in CoordinatorServerManager::hintServerDown default NOTICE[18101:2]: Spurious crash report on unknown server id 2.0
1348793678.176558243 src/SessionAlarm.cc:191 in SessionAlarmTimer::handleTimerEvent default WARNING[18101:1]: Server at infrc:host=192.168.1.102,port=11113 is not responding, aborting session
1348793678.176582957 src/InfRcTransport.cc:374 in InfRcTransport<Infiniband>::InfRcSession::abort default NOTICE[18101:1]: Infiniband aborting PING request to infrc:host=192.168.1.102,port=11113
1348793681.918351539 src/InfRcTransport.cc:771 in InfRcTransport<Infiniband>::reapTxBuffers default ERROR[18101:1]: Transmit failed for buffer 26580032: RETRY_EXC_ERR
1348793683.181321280 (1000 duplicates of the following message were suppressed)
1348793683.181321280 src/SessionAlarm.cc:191 in SessionAlarmTimer::handleTimerEvent default WARNING[18101:1]: Server at infrc:host=192.168.1.102,port=11113 is not responding, aborting session
1348793688.186085328 (1000 duplicates of the following message were suppressed)
1348793688.186085328 src/SessionAlarm.cc:191 in SessionAlarmTimer::handleTimerEvent default WARNING[18101:1]: Server at infrc:host=192.168.1.102,port=11113 is not responding, aborting session
1348793693.190960484 (1000 duplicates of the following message were suppressed)
1348793693.190960484 src/SessionAlarm.cc:191 in SessionAlarmTimer::handleTimerEvent default WARNING[18101:1]: Server at infrc:host=192.168.1.102,port=11113 is not responding, aborting session
1348793698.195949984 (1000 duplicates of the following message were suppressed)
1348793698.195949984 src/SessionAlarm.cc:191 in SessionAlarmTimer::handleTimerEvent default WARNING[18101:1]: Server at infrc:host=192.168.1.102,port=11113 is not responding, aborting session

<<last two lines continue forever>>

3.

start coord: rc10
start master: rc01
start master: rc02
start master: rc03
crash master: rc02
start master: rc04

weird thing:

rc04 gets entire server list (rc01, rc02, rc03, rc04), in which rc02 has status crashed (status number: 1). Why isn't this server entirely gone?

4.

start coord: rc10
start master: rc01
start master: rc02
start master: rc03
crash master: rc02
start master: rc04 (as in point 3: gets entire server list, including rc02, which has status crashed)
crash coord: rc10
restart coord: rc10

Reaction:

rc03, rc10 (coord): nothing

rc01:

1347303047.493191417 src/ServerList.cc:114 in ServerList::applyServerList default NOTICE[24017:3]: A repeated/old update version 1 was sent to a ServerList with version 5.
1347303047.494626917 src/ServerList.cc:114 in ServerList::applyServerList default NOTICE[24017:3]: A repeated/old update version 2 was sent to a ServerList with version 5.

rc04:

1347303047.495092655 src/ServerList.cc:114 in ServerList::applyServerList default NOTICE[26203:3]: A repeated/old update version 2 was sent to a ServerList with version 5.

(see implications after point 5)

5.

start coord: rc10
start master: rc01
start master: rc02
start master: rc03
crash master: rc02
start master: rc04 (gets entire server list, including rc07, which has status crashed)
crash coord: rc10
restart coord: rc10
start master: rc05

Reaction:

rc05: gets serverlist with following info:

server {
services: 25
server_id: 1
service_locator: "infrc:host=192.168.1.101,port=11113"
expected_read_mbytes_per_sec: 0
status: 0
}
server {
services: 25
server_id: 2
service_locator: "infrc:host=192.168.1.105,port=11113"
expected_read_mbytes_per_sec: 0
status: 0
}
server {
services: 25
server_id: 4
service_locator: "infrc:host=192.168.1.104,port=11113"
expected_read_mbytes_per_sec: 0
status: 0
}
version_number: 3

rc01:

1347303277.037728247 src/ServerList.cc:114 in ServerList::applyServerList default NOTICE[24017:3]: A repeated/old update version 3 was sent to a ServerList with version 5.

rc04:

1347303277.038167949 src/ServerList.cc:114 in ServerList::applyServerList default NOTICE[26203:3]: A repeated/old update version 3 was sent to a ServerList with version 5.

(see implications below)

4 & 5 =>

Problem 1:

Server list received by rc05 has version number is 3 (which is an older number if you consider the server list version numbers that the masters have already seen). Also, in point 4 and 5, the other servers are receiving updates with older numbers, that indicates the same problem. This is to be expected since we're rebuilding server list from scratch on coordinator recovery, and not keeping around the old version number. After more work on serverlist (RAM-453, RAM-455, and possibly some more) this will hopefully not be an issue anymore.

Problem 2:

Server rc03 disappeared during the coordinator recovery in point 4. This is clear in the server list received by rc05, and was also indicated in point 4 when servers rc01 and rc04 received (old / redundant) serverlist updates, but rc03 didn't. To confirm, i started another master rc06. This server was assigned server id 3 (reused from rc03 going out of the cluster).

Working hypothesis: crashing rc02 appends a serverdown entry to logcabin. Till ongaro implements a cleaner in logcabin, i have a workaround method readValidEntries() in LogCabinHelper that cleans up all the entries read to remove the invalidated entries. This method assumes that the order number of any particular entry corresponds to the entry id. This is not true. Update: Fixed.