Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 12 Next »

Note: This works:

start coord
start master m1
start master m2
crash coord
start master m3

Everything goes as expected. When started, m3 gets the entire server list, including entries for m1 and m2.

Also repeated this more times. As in,

start coord
start master m1
start master m2
crash coord
start master m3
crash coord
start master m4
crash coord
start master m5

Works fine.

 


1.

start coord
start master m1
start master m2
crash master m2

Reaction:
Tries to recover the backup on m2 even though I started the master with explicitly no backups (options: -M r=0)

Ryan's suggestion for a quick hack fix:
In BackupFailureMonitor.cc - in constructor for BackupFailureMonitor, change the initializer tracker(context, this) to tracker(context).

That worked for now.

 


2.

start coord: rc10
start master: rc01
start master: rc02
start master: rc03
crash master: rc02

Reaction:

rc01, rc03:

1347302391.840341991 src/ServerList.cc:256 in ServerList::applyUpdate default NOTICE[23993:3]: Got server list update (version number 4)
1347302391.840372293 src/ServerList.cc:283 in ServerList::applyUpdate default NOTICE[23993:3]: Marking server id 2.0 as crashed

rc05 (coord):

1346965811.147348807 src/CoordinatorServerManager.cc:335 in CoordinatorServerManager::hintServerDown default NOTICE[8989:2]: Checking server id 2.0 (infrc:host=192.168.1.107,port=11113)
1346965811.147401900 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[8989:2]: Sending PING request to infrc:host=192.168.1.107,port=11113 with 12 bytes
1346965811.397395327 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[8989:1]: Sending PING request to infrc:host=192.168.1.107,port=11113 with 12 bytes
1346965811.397429587 src/CoordinatorServerManager.cc:611 in CoordinatorServerManager::verifyServerFailure default NOTICE[8989:2]: Verified host failure: id 2.0 ("infrc:host=192.168.1.107,port=11113")
1346965811.397454893 src/CoordinatorServerManager.cc:340 in CoordinatorServerManager::hintServerDown default NOTICE[8989:2]: Server id 2.0 has crashed, notifying the cluster and starting recovery
1346965811.397619774 src/CoordinatorServerManager.cc:408 in CoordinatorServerManager::ServerDown::execute default DEBUG[8989:2]: LogCabin: StateServerDown entryId: 6
1346965811.397665957 src/MasterRecoveryManager.cc:339 in MasterRecoveryManager::startMasterRecovery default NOTICE[8989:2]: Server 2.0 crashed, but it had no tablets
1346965811.397899237 src/CoordinatorServerManager.cc:330 in CoordinatorServerManager::hintServerDown default NOTICE[8989:2]: Spurious crash report on unknown server id 2.0
1346965811.397955087 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[8989:3]: Sending UPDATE_SERVER_LIST request to infrc:host=192.168.1.108,port=11113 with 80 bytes
1346965811.398121531 src/ServerListUpdater.cc:323 in ServerListUpdater::sendMembershipUpdate default DEBUG[8989:3]: Server list update sent to server 3.0
1346965811.902357437 src/SessionAlarm.cc:191 in SessionAlarmTimer::handleTimerEvent default WARNING[8989:1]: Server at infrc:host=192.168.1.107,port=11113 is not responding, aborting session
1346965811.902389656 src/InfRcTransport.cc:374 in InfRcTransport<Infiniband>::InfRcSession::abort default NOTICE[8989:1]: Infiniband aborting PING request to infrc:host=192.168.1.107,port=11113
1346965815.524228181 src/InfRcTransport.cc:771 in InfRcTransport<Infiniband>::reapTxBuffers default ERROR[8989:1]: Transmit failed for buffer 39173256: RETRY_EXC_ERR
1346965816.147348807 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[8989:3]: Sending UPDATE_SERVER_LIST request to infrc:host=192.168.1.106,port=11113 with 80 bytes
1346965816.147348807 src/ServerListUpdater.cc:323 in ServerListUpdater::sendMembershipUpdate default DEBUG[8989:3]: Server list update sent to server 1.0
1346965816.907036043 (1000 duplicates of the following message were suppressed)
1346965816.907036043 src/SessionAlarm.cc:191 in SessionAlarmTimer::handleTimerEvent default WARNING[8989:1]: Server at infrc:host=192.168.1.107,port=11113 is not responding, aborting session
1346965821.911712711 (1000 duplicates of the following message were suppressed)
1346965821.911712711 src/SessionAlarm.cc:191 in SessionAlarmTimer::handleTimerEvent default WARNING[8989:1]: Server at infrc:host=192.168.1.107,port=11113 is not responding, aborting session

<<< last two lines continue forever >>>

 


3.

start coord: rc05
start master: rc06
start master: rc07
start master: rc08
crash master: rc07

Reaction:

rc08: nothing bad

1346984637.572391059 src/FailureDetector.cc:160 in FailureDetector::pingRandomServer default WARNING[31253:2]: Ping timeout to server id 2.0 (locator "infrc:host=192.168.1.107,port=11113")
1346984637.606243205 src/MembershipService.cc:125 in MembershipService::updateServerList default NOTICE[31253:4]: Got server list update (version number 4)
1346984637.606272072 src/ServerList.cc:202 in ServerList::applyUpdate default NOTICE[31253:4]: Got server list update (version number 4)
1346984637.606292226 src/ServerList.cc:229 in ServerList::applyUpdate default NOTICE[31253:4]: Marking server id 2.0 as crashed

rc06: nothing bad

1346984637.572391059 src/FailureDetector.cc:160 in FailureDetector::pingRandomServer default WARNING[31253:2]: Ping timeout to server id 2.0 (locator "infrc:host=192.168.1.107,port=11113")
1346984637.606243205 src/MembershipService.cc:125 in MembershipService::updateServerList default NOTICE[31253:4]: Got server list update (version number 4)
1346984637.606272072 src/ServerList.cc:202 in ServerList::applyUpdate default NOTICE[31253:4]: Got server list update (version number 4)
1346984637.606292226 src/ServerList.cc:229 in ServerList::applyUpdate default NOTICE[31253:4]: Marking server id 2.0 as crashed

coord: freaks out

1346984637.355197986 src/CoordinatorServerManager.cc:335 in CoordinatorServerManager::hintServerDown default NOTICE[16448:2]: Checking server id 2.0 (infrc:host=192.168.1.107,port=11113)
1346984637.355266589 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[16448:2]: Sending PING request to infrc:host=192.168.1.107,port=11113 with 12 bytes
1346984637.605262131 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[16448:1]: Sending PING request to infrc:host=192.168.1.107,port=11113 with 12 bytes
1346984637.605295631 src/CoordinatorServerManager.cc:611 in CoordinatorServerManager::verifyServerFailure default NOTICE[16448:2]: Verified host failure: id 2.0 ("infrc:host=192.168.1.107,port=11113")
1346984637.605320810 src/CoordinatorServerManager.cc:340 in CoordinatorServerManager::hintServerDown default NOTICE[16448:2]: Server id 2.0 has crashed, notifying the cluster and starting recovery
1346984637.605706128 src/CoordinatorServerManager.cc:408 in CoordinatorServerManager::ServerDown::execute default DEBUG[16448:2]: LogCabin: StateServerDown entryId: 6
1346984637.605752629 src/MasterRecoveryManager.cc:339 in MasterRecoveryManager::startMasterRecovery default NOTICE[16448:2]: Server 2.0 crashed, but it had no tablets
1346984637.605827124 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[16448:3]: Sending UPDATE_SERVER_LIST request to infrc:host=192.168.1.106,port=11113 with 80 bytes
1346984637.606006619 src/ServerListUpdater.cc:323 in ServerListUpdater::sendMembershipUpdate default DEBUG[16448:3]: Server list update sent to server 1.0
1346984637.606036795 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[16448:3]: Sending UPDATE_SERVER_LIST request to infrc:host=192.168.1.108,port=11113 with 80 bytes
1346984637.606072872 src/CoordinatorServerManager.cc:330 in CoordinatorServerManager::hintServerDown default NOTICE[16448:2]: Spurious crash report on unknown server id 2.0
1346984637.606230215 src/ServerListUpdater.cc:323 in ServerListUpdater::sendMembershipUpdate default DEBUG[16448:3]: Server list update sent to server 3.0
1346984638.110239099 src/SessionAlarm.cc:191 in SessionAlarmTimer::handleTimerEvent default WARNING[16448:1]: Server at infrc:host=192.168.1.107,port=11113 is not responding, aborting session
1346984638.110263925 src/InfRcTransport.cc:374 in InfRcTransport<Infiniband>::InfRcSession::abort default NOTICE[16448:1]: Infiniband aborting PING request to infrc:host=192.168.1.107,port=11113
1346984642.606072872 src/CoordinatorServerManager.cc:330 in CoordinatorServerManager::hintServerDown default NOTICE[16448:2]: Spurious crash report on unknown server id 2.0
1346984643.115031869 (1000 duplicates of the following message were suppressed)
1346984643.115031869 src/SessionAlarm.cc:191 in SessionAlarmTimer::handleTimerEvent default WARNING[16448:1]: Server at infrc:host=192.168.1.107,port=11113 is not responding, aborting session
1346984648.119826776 (1000 duplicates of the following message were suppressed)
1346984648.119826776 src/SessionAlarm.cc:191 in SessionAlarmTimer::handleTimerEvent default WARNING[16448:1]: Server at infrc:host=192.168.1.107,port=11113 is not responding, aborting session

<<< last two lines repeat forever >>>

 


4.

start coord: rc05
start master: rc06
start master: rc07
start master: rc08
crash master: rc07
start master: rc09

weird thing:

rc09 gets entire server list, including rc07, which has status crashed. why isn't this server entirely gone?

 


5.

start coord: rc05
start master: rc06
start master: rc07
start master: rc08
crash master: rc07
start master: rc09 (gets entire server list, including rc07, which has status crashed)
crash coord: rc05
restart coord: rc05

Reaction:

rc08, rc09: nothing

rc06: freaks out

1346966448.440068376 src/ServerList.cc:316 in ServerList::add default WARNING[2916:4]: Duplicate add of ServerId 1.0!
1346966448.440084603 src/ServerList.cc:316 in ServerList::add default WARNING[2916:4]: Duplicate add of ServerId 4.0!
1346966449.033342651 src/FailureDetector.cc:238 in FailureDetector::checkForStaleServerList default WARNING[2916:2]: Stale server list detected (have 2, saw 2). Requesting new list push! Timeout after 500809 us.
1346966449.538371007 src/SessionAlarm.cc:191 in SessionAlarmTimer::handleTimerEvent default WARNING[2916:1]: Server at infrc:host=192.168.1.105,port=11114 is not responding, aborting session
1346966449.538404924 src/InfRcTransport.cc:374 in InfRcTransport<Infiniband>::InfRcSession::abort default NOTICE[2916:1]: Infiniband aborting SEND_SERVER_LIST request to infrc:host=192.168.1.105,port=11114
1346966449.538426687 src/InfRcTransport.cc:374 in InfRcTransport<Infiniband>::InfRcSession::abort default NOTICE[2916:1]: Infiniband aborting PING request to infrc:host=192.168.1.105,port=11114
1346966450.443647526 src/FailureDetector.cc:238 in FailureDetector::checkForStaleServerList default WARNING[2916:2]: Stale server list detected (have 2, saw 2). Requesting new list push! Timeout after 500623 us.
1346966450.944640867 src/FailureDetector.cc:238 in FailureDetector::checkForStaleServerList default WARNING[2916:2]: Stale server list detected (have 2, saw 2). Requesting new list push! Timeout after 500822 us.
1346966451.545478201 src/FailureDetector.cc:238 in FailureDetector::checkForStaleServerList default WARNING[2916:2]: Stale server list detected (have 2, saw 2). Requesting new list push! Timeout after 500537 us.
1346966452.046287039 src/FailureDetector.cc:238 in FailureDetector::checkForStaleServerList default WARNING[2916:2]: Stale server list detected (have 2, saw 2). Requesting new list push! Timeout after 500646 us.
1346966452.647144517 src/FailureDetector.cc:238 in FailureDetector::checkForStaleServerList default WARNING[2916:2]: Stale server list detected (have 2, saw 2). Requesting new list push! Timeout after 500579 us.
1346966453.148023608 src/FailureDetector.cc:238 in FailureDetector::checkForStaleServerList default WARNING[2916:2]: Stale server list detected (have 2, saw 2). Requesting new list push! Timeout after 500783 us.
1346966453.296479628 src/InfRcTransport.cc:771 in InfRcTransport<Infiniband>::reapTxBuffers default ERROR[2916:1]: Transmit failed for buffer 16017008: RETRY_EXC_ERR
1346966453.296504019 src/InfRcTransport.cc:771 in InfRcTransport<Infiniband>::reapTxBuffers default ERROR[2916:1]: Transmit failed for buffer 16017032: WR_FLUSH_ERR
1346966453.439968920 (6 duplicates of the following message were suppressed)
1346966453.439968920 src/ServerList.cc:112 in ServerList::applyFullList default NOTICE[2916:4]: Got complete list of servers containing 2 entries (version number 2)
1346966453.440012247 (6 duplicates of the following message were suppressed)
1346966453.440012247 src/ServerList.cc:115 in ServerList::applyFullList default NOTICE[2916:4]: Server List from coordinator is server {
services: 25
server_id: 1
service_locator: "infrc:host=192.168.1.106,port=11113"
expected_read_mbytes_per_sec: 0
status: 0
}
server {
services: 25
server_id: 4
service_locator: "infrc:host=192.168.1.109,port=11113"
expected_read_mbytes_per_sec: 0
status: 0
}
version_number: 2

<<< entire thing repeats forever>>>


coord: freaks out

1346966448.320090231 src/CoordinatorMain.cc:54 in main default NOTICE[9026:1]: Command line: ./obj.master/coordinator -C infrc:host=192.168.1.105,port=11114 -z 127.0.0.1:61023 -l DEBUG
1346966448.321387909 src/InfRcTransport.cc:216 in InfRcTransport default NOTICE[9026:1]: InfRc listening on UDP: 192.168.1.105:11114
1346966448.433162412 src/CoordinatorMain.cc:64 in main default NOTICE[9026:1]: coordinator: Listening on infrc:host=192.168.1.105,port=11114
1346966448.433280776 src/CoordinatorService.cc:49 in CoordinatorService default NOTICE[9026:1]: Connecting to LogCabin cluster at 127.0.0.1:61023
1346966448.434636096 src/CoordinatorService.cc:54 in CoordinatorService default NOTICE[9026:1]: Connected to LogCabin cluster.
1346966448.435483597 src/CoordinatorServiceRecovery.cc:53 in CoordinatorServiceRecovery::replay default DEBUG[9026:1]: 1 - ServerEnlisted

libprotobuf ERROR google/protobuf/message_lite.cc:123] Can't parse message of type "RAMCloud.ProtoBuf.EntryType" because it is missing required fields: entry_type
1346966448.435661460 src/CoordinatorServiceRecovery.cc:53 in CoordinatorServiceRecovery::replay default DEBUG[9026:1]: 7 -

1346966448.435703584 src/CoordinatorServiceRecovery.cc:53 in CoordinatorServiceRecovery::replay default DEBUG[9026:1]: 9 - ServerEnlisted

1346966448.436312693 src/Dispatch.cc:132 in Dispatch::poll default NOTICE[9026:1]: Long gap in dispatcher: 116.7 ms
1346966448.439794775 src/Infiniband.cc:711 in RealInfiniband::QueuePair::plumb default DEBUG[9026:2]: infiniband qp plumbed: lid 0xb, qpn 0x48004a, psn 0x5e999b, ibPhysicalPort 1 to remote lid 0x28, remote qpn 0x48004c, remote psn 0x857ce
1346966448.439897229 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[9026:2]: Sending GET_SERVER_ID request to infrc:host=192.168.1.106,port=11113 with 4 bytes
1346966448.440025683 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[9026:2]: Sending UPDATE_SERVER_LIST request to infrc:host=192.168.1.106,port=11113 with 80 bytes
1346966448.440171963 src/ServerListUpdater.cc:330 in ServerListUpdater::sendMembershipUpdate default NOTICE[9026:2]: Server 1.0 had lost an update. Sending whole list.
1346966448.440217983 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[9026:2]: Sending SET_SERVER_LIST request to infrc:host=192.168.1.106,port=11113 with 143 bytes
1346966449.539538376 src/Infiniband.cc:711 in RealInfiniband::QueuePair::plumb default DEBUG[9026:1]: infiniband qp plumbed: lid 0xb, qpn 0x48004d, psn 0x203853, ibPhysicalPort 1 to remote lid 0x28, remote qpn 0x4c005c, remote psn 0xd1699f
1346966449.540067091 src/ServerListUpdater.cc:262 in ServerListUpdater::handleRequest default DEBUG[9026:2]: Sending server list to server id 1.0 as requested
1346966453.658974955 (7 duplicates of the following message were suppressed)
1346966453.658974955 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[9026:2]: Sending SET_SERVER_LIST request to infrc:host=192.168.1.106,port=11113 with 143 bytes
1346966454.860312295 (8 duplicates of the following message were suppressed)
1346966454.860312295 src/ServerListUpdater.cc:262 in ServerListUpdater::handleRequest default DEBUG[9026:2]: Sending server list to server id 1.0 as requested
1346966458.664567186 (7 duplicates of the following message were suppressed)
1346966458.664567186 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[9026:2]: Sending SET_SERVER_LIST request to infrc:host=192.168.1.106,port=11113 with 143 bytes
1346966460.566663965 (8 duplicates of the following message were suppressed)
1346966460.566663965 src/ServerListUpdater.cc:262 in ServerListUpdater::handleRequest default DEBUG[9026:2]: Sending server list to server id 1.0 as requested


<<< last two lines repeat forever >>>


6.

start coord: rc05
start master: rc06
start master: rc07
start master: rc08
crash master: rc07
start master: rc09 (gets entire server list, including rc07, which has status crashed)
crash coord: rc05
restart coord: rc05
start master: rc10

Reaction:

rc10 gets serverlist with following info:
server {
services: 25
server_id: 1
service_locator: "infrc:host=192.168.1.106,port=11113"
expected_read_mbytes_per_sec: 0
status: 0
}
server {
services: 25
server_id: 2
service_locator: "infrc:host=192.168.1.110,port=11113"
expected_read_mbytes_per_sec: 0
status: 0
}
server {
services: 25
server_id: 4
service_locator: "infrc:host=192.168.1.109,port=11113"
expected_read_mbytes_per_sec: 0
status: 0
}
version_number: 3

rc06 and rc09 also freak out.

Note:
Server id 2 has been reused for the new server. This is fine. I should check the generation number (which should be incremented).
Server list version number is 3 (which is an older number if you consider the server list version numbers that the masters have already seen). This is to be expected since we're rebuilding server list from scratch on coordinator recovery, and not keeping around the old version number. After more work on serverlist (RAM-453, RAM-455, and possibly some more) this will hopefully not be an issue anymore.

So, real problem: server rc08 disappeared for no reason.


  • No labels