Weird things discovered while running Coordinator Recovery

Note: This works:

start coord
start master m1
start master m2
crash coord
start master m3

Everything goes as expected. When started, m3 gets the entire server list, including entries for m1 and m2.

Also repeated this more times. As in,

start coord
start master m1
start master m2
crash coord
start master m3
crash coord
start master m4
crash coord
start master m5

Works fine.

1.

start coord
start master m1
start master m2
crash master m2

Reaction:
Tries to recover the backup on m2 even though I started the master with explicitly no backups (options: -M r=0)

Ryan's suggestion for a quick hack fix:
In BackupFailureMonitor.cc - in constructor for BackupFailureMonitor, change the initializer tracker(context, this) to tracker(context).

That worked for now.

2.

start coord: rc10
start master: rc01
start master: rc02
start master: rc03
crash master: rc02

Reaction:

rc01, rc03:

1347302391.840341991 src/ServerList.cc:256 in ServerList::applyUpdate default NOTICE[23993:3]: Got server list update (version number 4)
1347302391.840372293 src/ServerList.cc:283 in ServerList::applyUpdate default NOTICE[23993:3]: Marking server id 2.0 as crashed

rc05 (coord):

1347302383.367961834 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[11923:3]: Sending SET_SERVER_LIST request to infrc:host=192.168.1.101,port=11113 with 82 bytes
1347302391.589683109 src/CoordinatorServerManager.cc:329 in CoordinatorServerManager::hintServerDown default NOTICE[11923:2]: Checking server id 2.0 (infrc:host=192.168.1.102,port=11113)
1347302391.589726734 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[11923:2]: Sending PING request to infrc:host=192.168.1.102,port=11113 with 12 bytes
1347302391.839722646 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[11923:1]: Sending PING request to infrc:host=192.168.1.102,port=11113 with 12 bytes
1347302391.839754458 src/CoordinatorServerManager.cc:593 in CoordinatorServerManager::verifyServerFailure default NOTICE[11923:2]: Verified host failure: id 2.0 ("infrc:host=192.168.1.102,port=11113")
1347302391.839778594 src/CoordinatorServerManager.cc:334 in CoordinatorServerManager::hintServerDown default NOTICE[11923:2]: Server id 2.0 has crashed, notifying the cluster and starting recovery
1347302391.840062004 src/CoordinatorServerManager.cc:388 in CoordinatorServerManager::ServerDown::execute default DEBUG[11923:2]: LogCabin: StateServerDown entryId: 6
1347302391.840191597 src/MasterRecoveryManager.cc:339 in MasterRecoveryManager::startMasterRecovery default NOTICE[11923:2]: Server 2.0 crashed, but it had no tablets
1347302391.840287675 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[11923:3]: Sending SET_SERVER_LIST request to infrc:host=192.168.1.103,port=11113 with 82 bytes
1347302391.840421882 src/CoordinatorServerList.cc:1071 in CoordinatorServerList::updateEntryVersion default DEBUG[11923:3]: server 1.0 updated (3->4)
1347302391.840460219 src/CoordinatorServerList.cc:1071 in CoordinatorServerList::updateEntryVersion default DEBUG[11923:3]: server 3.0 updated (3->4)
1347302392.344700020 src/SessionAlarm.cc:191 in SessionAlarmTimer::handleTimerEvent default WARNING[11923:1]: Server at infrc:host=192.168.1.102,port=11113 is not responding, aborting session
1347302392.344724344 src/InfRcTransport.cc:374 in InfRcTransport<Infiniband>::InfRcSession::abort default NOTICE[11923:1]: Infiniband aborting PING request to infrc:host=192.168.1.102,port=11113
1347302395.891263126 src/InfRcTransport.cc:771 in InfRcTransport<Infiniband>::reapTxBuffers default ERROR[11923:1]: Transmit failed for buffer 16514264: RETRY_EXC_ERR
1347302396.589683109 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[11923:3]: Sending SET_SERVER_LIST request to infrc:host=192.168.1.101,port=11113 with 82 bytes
1347302397.349493880 (1000 duplicates of the following message were suppressed)
1347302397.349493880 src/SessionAlarm.cc:191 in SessionAlarmTimer::handleTimerEvent default WARNING[11923:1]: Server at infrc:host=192.168.1.102,port=11113 is not responding, aborting session
1347302402.354289096 (1000 duplicates of the following message were suppressed)
1347302402.354289096 src/SessionAlarm.cc:191 in SessionAlarmTimer::handleTimerEvent default WARNING[11923:1]: Server at infrc:host=192.168.1.102,port=11113 is not responding, aborting session

<<< last two lines continue forever >>>

3.

start coord: rc10
start master: rc01
start master: rc02
start master: rc03
crash master: rc02
start master: rc04

weird thing:

rc04 gets entire server list (rc01, rc02, rc03, rc04), in which rc02 has status crashed (status number: 1). Why isn't this server entirely gone?

4.

start coord: rc10
start master: rc01
start master: rc02
start master: rc03
crash master: rc02
start master: rc04 (as in point 3: gets entire server list, including rc02, which has status crashed)
crash coord: rc10
restart coord: rc10

Reaction:

rc03, rc10 (coord): nothing

rc01:

1347303047.493191417 src/ServerList.cc:114 in ServerList::applyServerList default NOTICE[24017:3]: A repeated/old update version 1 was sent to a ServerList with version 5.
1347303047.494626917 src/ServerList.cc:114 in ServerList::applyServerList default NOTICE[24017:3]: A repeated/old update version 2 was sent to a ServerList with version 5.

rc04:

1347303047.495092655 src/ServerList.cc:114 in ServerList::applyServerList default NOTICE[26203:3]: A repeated/old update version 2 was sent to a ServerList with version 5.

(see implications after point 5)

5.

start coord: rc10
start master: rc01
start master: rc02
start master: rc03
crash master: rc02
start master: rc04 (gets entire server list, including rc07, which has status crashed)
crash coord: rc10
restart coord: rc10
start master: rc05

Reaction:

rc05: gets serverlist with following info:

server {
services: 25
server_id: 1
service_locator: "infrc:host=192.168.1.101,port=11113"
expected_read_mbytes_per_sec: 0
status: 0
}
server {
services: 25
server_id: 2
service_locator: "infrc:host=192.168.1.105,port=11113"
expected_read_mbytes_per_sec: 0
status: 0
}
server {
services: 25
server_id: 4
service_locator: "infrc:host=192.168.1.104,port=11113"
expected_read_mbytes_per_sec: 0
status: 0
}
version_number: 3

rc01:

1347303277.037728247 src/ServerList.cc:114 in ServerList::applyServerList default NOTICE[24017:3]: A repeated/old update version 3 was sent to a ServerList with version 5.

rc04:

1347303277.038167949 src/ServerList.cc:114 in ServerList::applyServerList default NOTICE[26203:3]: A repeated/old update version 3 was sent to a ServerList with version 5.

(see implications below)

4 & 5 =>

Problem 1:

Server list received by rc05 has version number is 3 (which is an older number if you consider the server list version numbers that the masters have already seen). Also, in point 4 and 5, the other servers are receiving updates with older numbers, that indicates the same problem. This is to be expected since we're rebuilding server list from scratch on coordinator recovery, and not keeping around the old version number. After more work on serverlist (RAM-453, RAM-455, and possibly some more) this will hopefully not be an issue anymore.

Problem 2:

Server rc03 disappeared during the coordinator recovery in point 4. This is clear in the server list received by rc05, and was also indicated in point 4 when servers rc01 and rc04 received (old / redundant) serverlist updates, but rc03 didn't. To confirm, i started another master rc06. This server was assigned server id 3 (reused from rc03 going out of the cluster).

Working hypothesis: crashing rc02 appends a serverdown entry to logcabin. Till ongaro implements a cleaner in logcabin, i have a workaround method readValidEntries() in LogCabinHelper that cleans up all the entries read to remove the invalidated entries. This method assumes that the order number of any particular entry corresponds to the entry id. This is probably not true.