Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 6 Next »

1.

start coord
start master m1
start master m2
crash master m2

Reaction:
Tries to recover the backup on m2 even though I started the master with explicitly no backups (options: -M r=0)

Ryan's suggestion for a quick hack fix:
In BackupFailureMonitor.cc - in constructor for BackupFailureMonitor, change the initializer tracker(context, this) to tracker(context).

That worked for now.

 


 

2.

start coord: rc05
start master: rc06
start master: rc07
start master: rc08
crash master: rc07

Reaction:

rc08:
1346965811.146929195 src/FailureDetector.cc:160 in FailureDetector::pingRandomServer default WARNING[26984:2]: Ping timeout to server id 2.0 (locator "infrc:host=192.168.1.107,port=11113")
1346965811.397751433 src/MembershipService.cc:125 in MembershipService::updateServerList default NOTICE[26984:3]: Got server list update (version number 4)
1346965811.397778293 src/ServerList.cc:202 in ServerList::applyUpdate default NOTICE[26984:3]: Got server list update (version number 4)
1346965811.397797167 src/ServerList.cc:229 in ServerList::applyUpdate default NOTICE[26984:3]: Marking server id 2.0 as crashed


rc06:
1346965811.150152462 src/FailureDetector.cc:160 in FailureDetector::pingRandomServer default WARNING[2916:2]: Ping timeout to server id 2.0 (locator "infrc:host=192.168.1.107,port=11113")
1346965811.397608803 src/MembershipService.cc:125 in MembershipService::updateServerList default NOTICE[2916:3]: Got server list update (version number 4)
1346965811.397636765 src/ServerList.cc:202 in ServerList::applyUpdate default NOTICE[2916:3]: Got server list update (version number 4)
1346965811.397656711 src/ServerList.cc:229 in ServerList::applyUpdate default NOTICE[2916:3]: Marking server id 2.0 as crashed
1346965811.900134258 src/SessionAlarm.cc:191 in SessionAlarmTimer::handleTimerEvent default WARNING[2916:1]: Server at infrc:host=192.168.1.107,port=11113 is not responding, aborting session
1346965811.900160928 src/InfRcTransport.cc:374 in InfRcTransport<Infiniband>::InfRcSession::abort default NOTICE[2916:1]: Infiniband aborting PING request to infrc:host=192.168.1.107,port=11113
1346965816.905120233 (1000 duplicates of the following message were suppressed)
1346965816.905120233 src/SessionAlarm.cc:191 in SessionAlarmTimer::handleTimerEvent default WARNING[2916:1]: Server at infrc:host=192.168.1.107,port=11113 is not responding, aborting session
1346965821.910120108 (1000 duplicates of the following message were suppressed)
1346965821.910120108 src/SessionAlarm.cc:191 in SessionAlarmTimer::handleTimerEvent default WARNING[2916:1]: Server at infrc:host=192.168.1.107,port=11113 is not responding, aborting session

<<< last two lines continue forever >>>


rc05 (coord):
1346965811.147348807 src/CoordinatorServerManager.cc:335 in CoordinatorServerManager::hintServerDown default NOTICE[8989:2]: Checking server id 2.0 (infrc:host=192.168.1.107,port=11113)
1346965811.147401900 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[8989:2]: Sending PING request to infrc:host=192.168.1.107,port=11113 with 12 bytes
1346965811.397395327 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[8989:1]: Sending PING request to infrc:host=192.168.1.107,port=11113 with 12 bytes
1346965811.397429587 src/CoordinatorServerManager.cc:611 in CoordinatorServerManager::verifyServerFailure default NOTICE[8989:2]: Verified host failure: id 2.0 ("infrc:host=192.168.1.107,port=11113")
1346965811.397454893 src/CoordinatorServerManager.cc:340 in CoordinatorServerManager::hintServerDown default NOTICE[8989:2]: Server id 2.0 has crashed, notifying the cluster and starting recovery
1346965811.397619774 src/CoordinatorServerManager.cc:408 in CoordinatorServerManager::ServerDown::execute default DEBUG[8989:2]: LogCabin: StateServerDown entryId: 6
1346965811.397665957 src/MasterRecoveryManager.cc:339 in MasterRecoveryManager::startMasterRecovery default NOTICE[8989:2]: Server 2.0 crashed, but it had no tablets
1346965811.397899237 src/CoordinatorServerManager.cc:330 in CoordinatorServerManager::hintServerDown default NOTICE[8989:2]: Spurious crash report on unknown server id 2.0
1346965811.397955087 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[8989:3]: Sending UPDATE_SERVER_LIST request to infrc:host=192.168.1.108,port=11113 with 80 bytes
1346965811.398121531 src/ServerListUpdater.cc:323 in ServerListUpdater::sendMembershipUpdate default DEBUG[8989:3]: Server list update sent to server 3.0
1346965811.902357437 src/SessionAlarm.cc:191 in SessionAlarmTimer::handleTimerEvent default WARNING[8989:1]: Server at infrc:host=192.168.1.107,port=11113 is not responding, aborting session
1346965811.902389656 src/InfRcTransport.cc:374 in InfRcTransport<Infiniband>::InfRcSession::abort default NOTICE[8989:1]: Infiniband aborting PING request to infrc:host=192.168.1.107,port=11113
1346965815.524228181 src/InfRcTransport.cc:771 in InfRcTransport<Infiniband>::reapTxBuffers default ERROR[8989:1]: Transmit failed for buffer 39173256: RETRY_EXC_ERR
1346965816.147348807 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[8989:3]: Sending UPDATE_SERVER_LIST request to infrc:host=192.168.1.106,port=11113 with 80 bytes
1346965816.147348807 src/ServerListUpdater.cc:323 in ServerListUpdater::sendMembershipUpdate default DEBUG[8989:3]: Server list update sent to server 1.0
1346965816.907036043 (1000 duplicates of the following message were suppressed)
1346965816.907036043 src/SessionAlarm.cc:191 in SessionAlarmTimer::handleTimerEvent default WARNING[8989:1]: Server at infrc:host=192.168.1.107,port=11113 is not responding, aborting session
1346965821.911712711 (1000 duplicates of the following message were suppressed)
1346965821.911712711 src/SessionAlarm.cc:191 in SessionAlarmTimer::handleTimerEvent default WARNING[8989:1]: Server at infrc:host=192.168.1.107,port=11113 is not responding, aborting session

<<< last two lines continue forever >>>

 


3.

start coord: rc05
start master: rc06
start master: rc07
start master: rc08
crash master: rc07

Reaction:

rc08: nothing bad

1346984637.572391059 src/FailureDetector.cc:160 in FailureDetector::pingRandomServer default WARNING[31253:2]: Ping timeout to server id 2.0 (locator "infrc:host=192.168.1.107,port=11113")
1346984637.606243205 src/MembershipService.cc:125 in MembershipService::updateServerList default NOTICE[31253:4]: Got server list update (version number 4)
1346984637.606272072 src/ServerList.cc:202 in ServerList::applyUpdate default NOTICE[31253:4]: Got server list update (version number 4)
1346984637.606292226 src/ServerList.cc:229 in ServerList::applyUpdate default NOTICE[31253:4]: Marking server id 2.0 as crashed

rc06: nothing bad

1346984637.572391059 src/FailureDetector.cc:160 in FailureDetector::pingRandomServer default WARNING[31253:2]: Ping timeout to server id 2.0 (locator "infrc:host=192.168.1.107,port=11113")
1346984637.606243205 src/MembershipService.cc:125 in MembershipService::updateServerList default NOTICE[31253:4]: Got server list update (version number 4)
1346984637.606272072 src/ServerList.cc:202 in ServerList::applyUpdate default NOTICE[31253:4]: Got server list update (version number 4)
1346984637.606292226 src/ServerList.cc:229 in ServerList::applyUpdate default NOTICE[31253:4]: Marking server id 2.0 as crashed

coord: freaks out

1346984637.355197986 src/CoordinatorServerManager.cc:335 in CoordinatorServerManager::hintServerDown default NOTICE[16448:2]: Checking server id 2.0 (infrc:host=192.168.1.107,port=11113)
1346984637.355266589 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[16448:2]: Sending PING request to infrc:host=192.168.1.107,port=11113 with 12 bytes
1346984637.605262131 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[16448:1]: Sending PING request to infrc:host=192.168.1.107,port=11113 with 12 bytes
1346984637.605295631 src/CoordinatorServerManager.cc:611 in CoordinatorServerManager::verifyServerFailure default NOTICE[16448:2]: Verified host failure: id 2.0 ("infrc:host=192.168.1.107,port=11113")
1346984637.605320810 src/CoordinatorServerManager.cc:340 in CoordinatorServerManager::hintServerDown default NOTICE[16448:2]: Server id 2.0 has crashed, notifying the cluster and starting recovery
1346984637.605706128 src/CoordinatorServerManager.cc:408 in CoordinatorServerManager::ServerDown::execute default DEBUG[16448:2]: LogCabin: StateServerDown entryId: 6
1346984637.605752629 src/MasterRecoveryManager.cc:339 in MasterRecoveryManager::startMasterRecovery default NOTICE[16448:2]: Server 2.0 crashed, but it had no tablets
1346984637.605827124 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[16448:3]: Sending UPDATE_SERVER_LIST request to infrc:host=192.168.1.106,port=11113 with 80 bytes
1346984637.606006619 src/ServerListUpdater.cc:323 in ServerListUpdater::sendMembershipUpdate default DEBUG[16448:3]: Server list update sent to server 1.0
1346984637.606036795 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[16448:3]: Sending UPDATE_SERVER_LIST request to infrc:host=192.168.1.108,port=11113 with 80 bytes
1346984637.606072872 src/CoordinatorServerManager.cc:330 in CoordinatorServerManager::hintServerDown default NOTICE[16448:2]: Spurious crash report on unknown server id 2.0
1346984637.606230215 src/ServerListUpdater.cc:323 in ServerListUpdater::sendMembershipUpdate default DEBUG[16448:3]: Server list update sent to server 3.0
1346984638.110239099 src/SessionAlarm.cc:191 in SessionAlarmTimer::handleTimerEvent default WARNING[16448:1]: Server at infrc:host=192.168.1.107,port=11113 is not responding, aborting session
1346984638.110263925 src/InfRcTransport.cc:374 in InfRcTransport<Infiniband>::InfRcSession::abort default NOTICE[16448:1]: Infiniband aborting PING request to infrc:host=192.168.1.107,port=11113
1346984642.606072872 src/CoordinatorServerManager.cc:330 in CoordinatorServerManager::hintServerDown default NOTICE[16448:2]: Spurious crash report on unknown server id 2.0
1346984643.115031869 (1000 duplicates of the following message were suppressed)
1346984643.115031869 src/SessionAlarm.cc:191 in SessionAlarmTimer::handleTimerEvent default WARNING[16448:1]: Server at infrc:host=192.168.1.107,port=11113 is not responding, aborting session
1346984648.119826776 (1000 duplicates of the following message were suppressed)
1346984648.119826776 src/SessionAlarm.cc:191 in SessionAlarmTimer::handleTimerEvent default WARNING[16448:1]: Server at infrc:host=192.168.1.107,port=11113 is not responding, aborting session

<<< last two lines repeat forever >>>

 


4.

start coord: rc05
start master: rc06
start master: rc07
start master: rc08
crash master: rc07
start master: rc09

weird thing:

rc09 gets entire server list, including rc07, which has status crashed. why isn't this server entirely gone?

 


5.

start coord: rc05
start master: rc06
start master: rc07
start master: rc08
crash master: rc07
start master: rc09 (gets entire server list, including rc07, which has status crashed)
crash coord: rc05
restart coord: rc05

Reaction:

rc08, rc09: nothing

rc06: freaks out

1346966448.440068376 src/ServerList.cc:316 in ServerList::add default WARNING[2916:4]: Duplicate add of ServerId 1.0!
1346966448.440084603 src/ServerList.cc:316 in ServerList::add default WARNING[2916:4]: Duplicate add of ServerId 4.0!
1346966449.033342651 src/FailureDetector.cc:238 in FailureDetector::checkForStaleServerList default WARNING[2916:2]: Stale server list detected (have 2, saw 2). Requesting new list push! Timeout after 500809 us.
1346966449.538371007 src/SessionAlarm.cc:191 in SessionAlarmTimer::handleTimerEvent default WARNING[2916:1]: Server at infrc:host=192.168.1.105,port=11114 is not responding, aborting session
1346966449.538404924 src/InfRcTransport.cc:374 in InfRcTransport<Infiniband>::InfRcSession::abort default NOTICE[2916:1]: Infiniband aborting SEND_SERVER_LIST request to infrc:host=192.168.1.105,port=11114
1346966449.538426687 src/InfRcTransport.cc:374 in InfRcTransport<Infiniband>::InfRcSession::abort default NOTICE[2916:1]: Infiniband aborting PING request to infrc:host=192.168.1.105,port=11114
1346966450.443647526 src/FailureDetector.cc:238 in FailureDetector::checkForStaleServerList default WARNING[2916:2]: Stale server list detected (have 2, saw 2). Requesting new list push! Timeout after 500623 us.
1346966450.944640867 src/FailureDetector.cc:238 in FailureDetector::checkForStaleServerList default WARNING[2916:2]: Stale server list detected (have 2, saw 2). Requesting new list push! Timeout after 500822 us.
1346966451.545478201 src/FailureDetector.cc:238 in FailureDetector::checkForStaleServerList default WARNING[2916:2]: Stale server list detected (have 2, saw 2). Requesting new list push! Timeout after 500537 us.
1346966452.046287039 src/FailureDetector.cc:238 in FailureDetector::checkForStaleServerList default WARNING[2916:2]: Stale server list detected (have 2, saw 2). Requesting new list push! Timeout after 500646 us.
1346966452.647144517 src/FailureDetector.cc:238 in FailureDetector::checkForStaleServerList default WARNING[2916:2]: Stale server list detected (have 2, saw 2). Requesting new list push! Timeout after 500579 us.
1346966453.148023608 src/FailureDetector.cc:238 in FailureDetector::checkForStaleServerList default WARNING[2916:2]: Stale server list detected (have 2, saw 2). Requesting new list push! Timeout after 500783 us.
1346966453.296479628 src/InfRcTransport.cc:771 in InfRcTransport<Infiniband>::reapTxBuffers default ERROR[2916:1]: Transmit failed for buffer 16017008: RETRY_EXC_ERR
1346966453.296504019 src/InfRcTransport.cc:771 in InfRcTransport<Infiniband>::reapTxBuffers default ERROR[2916:1]: Transmit failed for buffer 16017032: WR_FLUSH_ERR
1346966453.439968920 (6 duplicates of the following message were suppressed)
1346966453.439968920 src/ServerList.cc:112 in ServerList::applyFullList default NOTICE[2916:4]: Got complete list of servers containing 2 entries (version number 2)
1346966453.440012247 (6 duplicates of the following message were suppressed)
1346966453.440012247 src/ServerList.cc:115 in ServerList::applyFullList default NOTICE[2916:4]: Server List from coordinator is server {
services: 25
server_id: 1
service_locator: "infrc:host=192.168.1.106,port=11113"
expected_read_mbytes_per_sec: 0
status: 0
}
server {
services: 25
server_id: 4
service_locator: "infrc:host=192.168.1.109,port=11113"
expected_read_mbytes_per_sec: 0
status: 0
}
version_number: 2

<<< entire thing repeats forever>>>


coord: freaks out

1346966448.320090231 src/CoordinatorMain.cc:54 in main default NOTICE[9026:1]: Command line: ./obj.master/coordinator -C infrc:host=192.168.1.105,port=11114 -z 127.0.0.1:61023 -l DEBUG
1346966448.321387909 src/InfRcTransport.cc:216 in InfRcTransport default NOTICE[9026:1]: InfRc listening on UDP: 192.168.1.105:11114
1346966448.433162412 src/CoordinatorMain.cc:64 in main default NOTICE[9026:1]: coordinator: Listening on infrc:host=192.168.1.105,port=11114
1346966448.433280776 src/CoordinatorService.cc:49 in CoordinatorService default NOTICE[9026:1]: Connecting to LogCabin cluster at 127.0.0.1:61023
1346966448.434636096 src/CoordinatorService.cc:54 in CoordinatorService default NOTICE[9026:1]: Connected to LogCabin cluster.
1346966448.435483597 src/CoordinatorServiceRecovery.cc:53 in CoordinatorServiceRecovery::replay default DEBUG[9026:1]: 1 - ServerEnlisted

libprotobuf ERROR google/protobuf/message_lite.cc:123] Can't parse message of type "RAMCloud.ProtoBuf.EntryType" because it is missing required fields: entry_type
1346966448.435661460 src/CoordinatorServiceRecovery.cc:53 in CoordinatorServiceRecovery::replay default DEBUG[9026:1]: 7 -

1346966448.435703584 src/CoordinatorServiceRecovery.cc:53 in CoordinatorServiceRecovery::replay default DEBUG[9026:1]: 9 - ServerEnlisted

1346966448.436312693 src/Dispatch.cc:132 in Dispatch::poll default NOTICE[9026:1]: Long gap in dispatcher: 116.7 ms
1346966448.439794775 src/Infiniband.cc:711 in RealInfiniband::QueuePair::plumb default DEBUG[9026:2]: infiniband qp plumbed: lid 0xb, qpn 0x48004a, psn 0x5e999b, ibPhysicalPort 1 to remote lid 0x28, remote qpn 0x48004c, remote psn 0x857ce
1346966448.439897229 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[9026:2]: Sending GET_SERVER_ID request to infrc:host=192.168.1.106,port=11113 with 4 bytes
1346966448.440025683 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[9026:2]: Sending UPDATE_SERVER_LIST request to infrc:host=192.168.1.106,port=11113 with 80 bytes
1346966448.440171963 src/ServerListUpdater.cc:330 in ServerListUpdater::sendMembershipUpdate default NOTICE[9026:2]: Server 1.0 had lost an update. Sending whole list.
1346966448.440217983 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[9026:2]: Sending SET_SERVER_LIST request to infrc:host=192.168.1.106,port=11113 with 143 bytes
1346966449.539538376 src/Infiniband.cc:711 in RealInfiniband::QueuePair::plumb default DEBUG[9026:1]: infiniband qp plumbed: lid 0xb, qpn 0x48004d, psn 0x203853, ibPhysicalPort 1 to remote lid 0x28, remote qpn 0x4c005c, remote psn 0xd1699f
1346966449.540067091 src/ServerListUpdater.cc:262 in ServerListUpdater::handleRequest default DEBUG[9026:2]: Sending server list to server id 1.0 as requested
1346966453.658974955 (7 duplicates of the following message were suppressed)
1346966453.658974955 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[9026:2]: Sending SET_SERVER_LIST request to infrc:host=192.168.1.106,port=11113 with 143 bytes
1346966454.860312295 (8 duplicates of the following message were suppressed)
1346966454.860312295 src/ServerListUpdater.cc:262 in ServerListUpdater::handleRequest default DEBUG[9026:2]: Sending server list to server id 1.0 as requested
1346966458.664567186 (7 duplicates of the following message were suppressed)
1346966458.664567186 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[9026:2]: Sending SET_SERVER_LIST request to infrc:host=192.168.1.106,port=11113 with 143 bytes
1346966460.566663965 (8 duplicates of the following message were suppressed)
1346966460.566663965 src/ServerListUpdater.cc:262 in ServerListUpdater::handleRequest default DEBUG[9026:2]: Sending server list to server id 1.0 as requested


<<< last two lines repeat forever >>>


6.

start coord: rc05
start master: rc06
start master: rc07
start master: rc08
crash master: rc07
start master: rc09 (gets entire server list, including rc07, which has status crashed)
crash coord: rc05
restart coord: rc05
start master: rc10

Reaction: Weird stuff happens.

rc10 gets serverlist with following (messed up) info:
server {
services: 25
server_id: 1
service_locator: "infrc:host=192.168.1.106,port=11113"
expected_read_mbytes_per_sec: 0
status: 0
}
server {
services: 25
server_id: 2
service_locator: "infrc:host=192.168.1.110,port=11113"
expected_read_mbytes_per_sec: 0
status: 0
}
server {
services: 25
server_id: 4
service_locator: "infrc:host=192.168.1.109,port=11113"
expected_read_mbytes_per_sec: 0
status: 0
}
version_number: 3

i.e., the server id assignments are all wrong, rc08 is gone for no reason, version number is old

rc06 and rc09 also get weird info.

 


  • No labels