Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 3 Next »

 

1.

start coord
start master m1
start master m2
crash master m2

Reaction:
Tries to recover the backup on m2 even though I started the master with explicitly no backups (options: -M r=0)

Ryan's suggestion for a quick hack fix:
In BackupFailureMonitor.cc - in constructor for BackupFailureMonitor, change the initializer tracker(context, this) to tracker(context).

That worked for now.

 


 

2.

start coord: rc05
start master: rc06
start master: rc07
start master: rc08
crash master: rc07

Reaction:

rc08:
1346965811.146929195 src/FailureDetector.cc:160 in FailureDetector::pingRandomServer default WARNING[26984:2]: Ping timeout to server id 2.0 (locator "infrc:host=192.168.1.107,port=11113")
1346965811.397751433 src/MembershipService.cc:125 in MembershipService::updateServerList default NOTICE[26984:3]: Got server list update (version number 4)
1346965811.397778293 src/ServerList.cc:202 in ServerList::applyUpdate default NOTICE[26984:3]: Got server list update (version number 4)
1346965811.397797167 src/ServerList.cc:229 in ServerList::applyUpdate default NOTICE[26984:3]: Marking server id 2.0 as crashed


rc06:
1346965811.150152462 src/FailureDetector.cc:160 in FailureDetector::pingRandomServer default WARNING[2916:2]: Ping timeout to server id 2.0 (locator "infrc:host=192.168.1.107,port=11113")
1346965811.397608803 src/MembershipService.cc:125 in MembershipService::updateServerList default NOTICE[2916:3]: Got server list update (version number 4)
1346965811.397636765 src/ServerList.cc:202 in ServerList::applyUpdate default NOTICE[2916:3]: Got server list update (version number 4)
1346965811.397656711 src/ServerList.cc:229 in ServerList::applyUpdate default NOTICE[2916:3]: Marking server id 2.0 as crashed
1346965811.900134258 src/SessionAlarm.cc:191 in SessionAlarmTimer::handleTimerEvent default WARNING[2916:1]: Server at infrc:host=192.168.1.107,port=11113 is not responding, aborting session
1346965811.900160928 src/InfRcTransport.cc:374 in InfRcTransport<Infiniband>::InfRcSession::abort default NOTICE[2916:1]: Infiniband aborting PING request to infrc:host=192.168.1.107,port=11113
1346965816.905120233 (1000 duplicates of the following message were suppressed)
1346965816.905120233 src/SessionAlarm.cc:191 in SessionAlarmTimer::handleTimerEvent default WARNING[2916:1]: Server at infrc:host=192.168.1.107,port=11113 is not responding, aborting session
1346965821.910120108 (1000 duplicates of the following message were suppressed)
1346965821.910120108 src/SessionAlarm.cc:191 in SessionAlarmTimer::handleTimerEvent default WARNING[2916:1]: Server at infrc:host=192.168.1.107,port=11113 is not responding, aborting session

<<< last two lines continue forever >>>


rc05 (coord):
1346965811.147348807 src/CoordinatorServerManager.cc:335 in CoordinatorServerManager::hintServerDown default NOTICE[8989:2]: Checking server id 2.0 (infrc:host=192.168.1.107,port=11113)
1346965811.147401900 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[8989:2]: Sending PING request to infrc:host=192.168.1.107,port=11113 with 12 bytes
1346965811.397395327 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[8989:1]: Sending PING request to infrc:host=192.168.1.107,port=11113 with 12 bytes
1346965811.397429587 src/CoordinatorServerManager.cc:611 in CoordinatorServerManager::verifyServerFailure default NOTICE[8989:2]: Verified host failure: id 2.0 ("infrc:host=192.168.1.107,port=11113")
1346965811.397454893 src/CoordinatorServerManager.cc:340 in CoordinatorServerManager::hintServerDown default NOTICE[8989:2]: Server id 2.0 has crashed, notifying the cluster and starting recovery
1346965811.397619774 src/CoordinatorServerManager.cc:408 in CoordinatorServerManager::ServerDown::execute default DEBUG[8989:2]: LogCabin: StateServerDown entryId: 6
1346965811.397665957 src/MasterRecoveryManager.cc:339 in MasterRecoveryManager::startMasterRecovery default NOTICE[8989:2]: Server 2.0 crashed, but it had no tablets
1346965811.397899237 src/CoordinatorServerManager.cc:330 in CoordinatorServerManager::hintServerDown default NOTICE[8989:2]: Spurious crash report on unknown server id 2.0
1346965811.397955087 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[8989:3]: Sending UPDATE_SERVER_LIST request to infrc:host=192.168.1.108,port=11113 with 80 bytes
1346965811.398121531 src/ServerListUpdater.cc:323 in ServerListUpdater::sendMembershipUpdate default DEBUG[8989:3]: Server list update sent to server 3.0
1346965811.902357437 src/SessionAlarm.cc:191 in SessionAlarmTimer::handleTimerEvent default WARNING[8989:1]: Server at infrc:host=192.168.1.107,port=11113 is not responding, aborting session
1346965811.902389656 src/InfRcTransport.cc:374 in InfRcTransport<Infiniband>::InfRcSession::abort default NOTICE[8989:1]: Infiniband aborting PING request to infrc:host=192.168.1.107,port=11113
1346965815.524228181 src/InfRcTransport.cc:771 in InfRcTransport<Infiniband>::reapTxBuffers default ERROR[8989:1]: Transmit failed for buffer 39173256: RETRY_EXC_ERR
1346965816.147348807 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[8989:3]: Sending UPDATE_SERVER_LIST request to infrc:host=192.168.1.106,port=11113 with 80 bytes
1346965816.147348807 src/ServerListUpdater.cc:323 in ServerListUpdater::sendMembershipUpdate default DEBUG[8989:3]: Server list update sent to server 1.0
1346965816.907036043 (1000 duplicates of the following message were suppressed)
1346965816.907036043 src/SessionAlarm.cc:191 in SessionAlarmTimer::handleTimerEvent default WARNING[8989:1]: Server at infrc:host=192.168.1.107,port=11113 is not responding, aborting session
1346965821.911712711 (1000 duplicates of the following message were suppressed)
1346965821.911712711 src/SessionAlarm.cc:191 in SessionAlarmTimer::handleTimerEvent default WARNING[8989:1]: Server at infrc:host=192.168.1.107,port=11113 is not responding, aborting session

<<< last two lines continue forever >>>

 


 

3.

start coord: rc05
start master: rc06
start master: rc07
start master: rc08
crash master: rc07
start master: rc08 (gets entire server list, including rc07, which has status crashed)
crash coord: rc05
restart coord: rc05

Reaction:

rc08: nothing

rc06: freaks out

1346966448.440068376 src/ServerList.cc:316 in ServerList::add default WARNING[2916:4]: Duplicate add of ServerId 1.0!
1346966448.440084603 src/ServerList.cc:316 in ServerList::add default WARNING[2916:4]: Duplicate add of ServerId 4.0!
1346966449.033342651 src/FailureDetector.cc:238 in FailureDetector::checkForStaleServerList default WARNING[2916:2]: Stale server list detected (have 2, saw 2). Requesting new list push! Timeout after 500809 us.
1346966449.538371007 src/SessionAlarm.cc:191 in SessionAlarmTimer::handleTimerEvent default WARNING[2916:1]: Server at infrc:host=192.168.1.105,port=11114 is not responding, aborting session
1346966449.538404924 src/InfRcTransport.cc:374 in InfRcTransport<Infiniband>::InfRcSession::abort default NOTICE[2916:1]: Infiniband aborting SEND_SERVER_LIST request to infrc:host=192.168.1.105,port=11114
1346966449.538426687 src/InfRcTransport.cc:374 in InfRcTransport<Infiniband>::InfRcSession::abort default NOTICE[2916:1]: Infiniband aborting PING request to infrc:host=192.168.1.105,port=11114
1346966450.443647526 src/FailureDetector.cc:238 in FailureDetector::checkForStaleServerList default WARNING[2916:2]: Stale server list detected (have 2, saw 2). Requesting new list push! Timeout after 500623 us.
1346966450.944640867 src/FailureDetector.cc:238 in FailureDetector::checkForStaleServerList default WARNING[2916:2]: Stale server list detected (have 2, saw 2). Requesting new list push! Timeout after 500822 us.
1346966451.545478201 src/FailureDetector.cc:238 in FailureDetector::checkForStaleServerList default WARNING[2916:2]: Stale server list detected (have 2, saw 2). Requesting new list push! Timeout after 500537 us.
1346966452.046287039 src/FailureDetector.cc:238 in FailureDetector::checkForStaleServerList default WARNING[2916:2]: Stale server list detected (have 2, saw 2). Requesting new list push! Timeout after 500646 us.
1346966452.647144517 src/FailureDetector.cc:238 in FailureDetector::checkForStaleServerList default WARNING[2916:2]: Stale server list detected (have 2, saw 2). Requesting new list push! Timeout after 500579 us.
1346966453.148023608 src/FailureDetector.cc:238 in FailureDetector::checkForStaleServerList default WARNING[2916:2]: Stale server list detected (have 2, saw 2). Requesting new list push! Timeout after 500783 us.
1346966453.296479628 src/InfRcTransport.cc:771 in InfRcTransport<Infiniband>::reapTxBuffers default ERROR[2916:1]: Transmit failed for buffer 16017008: RETRY_EXC_ERR
1346966453.296504019 src/InfRcTransport.cc:771 in InfRcTransport<Infiniband>::reapTxBuffers default ERROR[2916:1]: Transmit failed for buffer 16017032: WR_FLUSH_ERR
1346966453.439968920 (6 duplicates of the following message were suppressed)
1346966453.439968920 src/ServerList.cc:112 in ServerList::applyFullList default NOTICE[2916:4]: Got complete list of servers containing 2 entries (version number 2)
1346966453.440012247 (6 duplicates of the following message were suppressed)
1346966453.440012247 src/ServerList.cc:115 in ServerList::applyFullList default NOTICE[2916:4]: Server List from coordinator is server {
services: 25
server_id: 1
service_locator: "infrc:host=192.168.1.106,port=11113"
expected_read_mbytes_per_sec: 0
status: 0
}
server {
services: 25
server_id: 4
service_locator: "infrc:host=192.168.1.109,port=11113"
expected_read_mbytes_per_sec: 0
status: 0
}
version_number: 2

<<< entire thing repeats forever>>>


coord: freaks out

1346966448.320090231 src/CoordinatorMain.cc:54 in main default NOTICE[9026:1]: Command line: ./obj.master/coordinator -C infrc:host=192.168.1.105,port=11114 -z 127.0.0.1:61023 -l DEBUG
1346966448.321387909 src/InfRcTransport.cc:216 in InfRcTransport default NOTICE[9026:1]: InfRc listening on UDP: 192.168.1.105:11114
1346966448.433162412 src/CoordinatorMain.cc:64 in main default NOTICE[9026:1]: coordinator: Listening on infrc:host=192.168.1.105,port=11114
1346966448.433280776 src/CoordinatorService.cc:49 in CoordinatorService default NOTICE[9026:1]: Connecting to LogCabin cluster at 127.0.0.1:61023
1346966448.434636096 src/CoordinatorService.cc:54 in CoordinatorService default NOTICE[9026:1]: Connected to LogCabin cluster.
1346966448.435223246 src/LogCabinHelper.cc:122 in LogCabinHelper::readValidEntries default DEBUG[9026:1]: Want to erase entry with id 0
1346966448.435254123 src/LogCabinHelper.cc:122 in LogCabinHelper::readValidEntries default DEBUG[9026:1]: Want to erase entry with id 2
1346966448.435271857 src/LogCabinHelper.cc:122 in LogCabinHelper::readValidEntries default DEBUG[9026:1]: Want to erase entry with id 4
1346966448.435288796 src/LogCabinHelper.cc:122 in LogCabinHelper::readValidEntries default DEBUG[9026:1]: Want to erase entry with id 3
1346966448.435304107 src/LogCabinHelper.cc:122 in LogCabinHelper::readValidEntries default DEBUG[9026:1]: Want to erase entry with id 6
1346966448.435339503 src/LogCabinHelper.cc:122 in LogCabinHelper::readValidEntries default DEBUG[9026:1]: Want to erase entry with id 8
1346966448.435355896 src/LogCabinHelper.cc:136 in LogCabinHelper::readValidEntries default DEBUG[9026:1]: Erasing entry with id 8
1346966448.435375822 src/LogCabinHelper.cc:136 in LogCabinHelper::readValidEntries default DEBUG[9026:1]: Erasing entry with id 6
1346966448.435391645 src/LogCabinHelper.cc:136 in LogCabinHelper::readValidEntries default DEBUG[9026:1]: Erasing entry with id 4
1346966448.435407257 src/LogCabinHelper.cc:136 in LogCabinHelper::readValidEntries default DEBUG[9026:1]: Erasing entry with id 3
1346966448.435435164 src/LogCabinHelper.cc:136 in LogCabinHelper::readValidEntries default DEBUG[9026:1]: Erasing entry with id 2
1346966448.435450474 src/LogCabinHelper.cc:136 in LogCabinHelper::readValidEntries default DEBUG[9026:1]: Erasing entry with id 0
1346966448.435483597 src/CoordinatorServiceRecovery.cc:53 in CoordinatorServiceRecovery::replay default DEBUG[9026:1]: 1 - ServerEnlisted

libprotobuf ERROR google/protobuf/message_lite.cc:123] Can't parse message of type "RAMCloud.ProtoBuf.EntryType" because it is missing required fields: entry_type
1346966448.435661460 src/CoordinatorServiceRecovery.cc:53 in CoordinatorServiceRecovery::replay default DEBUG[9026:1]: 7 -

1346966448.435703584 src/CoordinatorServiceRecovery.cc:53 in CoordinatorServiceRecovery::replay default DEBUG[9026:1]: 9 - ServerEnlisted

1346966448.436312693 src/Dispatch.cc:132 in Dispatch::poll default NOTICE[9026:1]: Long gap in dispatcher: 116.7 ms
1346966448.439794775 src/Infiniband.cc:711 in RealInfiniband::QueuePair::plumb default DEBUG[9026:2]: infiniband qp plumbed: lid 0xb, qpn 0x48004a, psn 0x5e999b, ibPhysicalPort 1 to remote lid 0x28, remote qpn 0x48004c, remote psn 0x857ce
1346966448.439897229 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[9026:2]: Sending GET_SERVER_ID request to infrc:host=192.168.1.106,port=11113 with 4 bytes
1346966448.440025683 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[9026:2]: Sending UPDATE_SERVER_LIST request to infrc:host=192.168.1.106,port=11113 with 80 bytes
1346966448.440171963 src/ServerListUpdater.cc:330 in ServerListUpdater::sendMembershipUpdate default NOTICE[9026:2]: Server 1.0 had lost an update. Sending whole list.
1346966448.440217983 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[9026:2]: Sending SET_SERVER_LIST request to infrc:host=192.168.1.106,port=11113 with 143 bytes
1346966449.539538376 src/Infiniband.cc:711 in RealInfiniband::QueuePair::plumb default DEBUG[9026:1]: infiniband qp plumbed: lid 0xb, qpn 0x48004d, psn 0x203853, ibPhysicalPort 1 to remote lid 0x28, remote qpn 0x4c005c, remote psn 0xd1699f
1346966449.540067091 src/ServerListUpdater.cc:262 in ServerListUpdater::handleRequest default DEBUG[9026:2]: Sending server list to server id 1.0 as requested
1346966453.435407257 src/LogCabinHelper.cc:136 in LogCabinHelper::readValidEntries default DEBUG[9026:1]: Erasing entry with id 3
1346966453.435288796 src/LogCabinHelper.cc:122 in LogCabinHelper::readValidEntries default DEBUG[9026:1]: Want to erase entry with id 3
1346966453.658974955 (7 duplicates of the following message were suppressed)
1346966453.658974955 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[9026:2]: Sending SET_SERVER_LIST request to infrc:host=192.168.1.106,port=11113 with 143 bytes
1346966454.860312295 (8 duplicates of the following message were suppressed)
1346966454.860312295 src/ServerListUpdater.cc:262 in ServerListUpdater::handleRequest default DEBUG[9026:2]: Sending server list to server id 1.0 as requested
1346966458.664567186 (7 duplicates of the following message were suppressed)
1346966458.664567186 src/InfRcTransport.cc:430 in InfRcTransport<Infiniband>::InfRcSession::sendRequest default DEBUG[9026:2]: Sending SET_SERVER_LIST request to infrc:host=192.168.1.106,port=11113 with 143 bytes
1346966460.566663965 (8 duplicates of the following message were suppressed)
1346966460.566663965 src/ServerListUpdater.cc:262 in ServerListUpdater::handleRequest default DEBUG[9026:2]: Sending server list to server id 1.0 as requested


<<< last two lines repeat forever >>>

 


  • No labels