/
Coordinator
Coordinator
Important Problems from Design Meeting
- Exactly how does a master select backups?
- How do we make sure a master is dead (to prevent divergent views of tables)?
- How does a new master know when it has received all segments (e.g. knows it has the head of its log)?
- How does host-to-host authentication work
- In particular, how does the application authenticate the Coordinator?
Further Things to Walkthrough in Design Meetings
- How does one install a new machine?
- How does one update RAMCloud software?
== Bootstrapping C Discovery == 1) DNS - Slow + Preconfigured on hosts + Delegation + Provides a way to deal with Coordinator failure/unavailability == Summary of Actions == -- Auth ------------------------------- App authentication 6, 7 App authorization 7 Machine authentication 8, 9 Machine authorization 9 -- Addressing ------------------------- Find M for Object Id 1 Lookup LMA to NMA 4 -- Backup/Recovery -------------------- List backup candidates 3 Confirm crashes 4 Start M recovery 1, 4, 5 Notify masters of B crash 4 Create a fresh B instance 5 -- Performance ------------------------ Load Balancing 1, 10 Stats, Metrics, Accting 10 -- Debugging/Auditing ---------------- Logging 11 == Summary of State == Name Format Size Churn Refs ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 1 tabletMap (workspace, table, start, end, LMA) ~40b/tablet/master Only on load balance Per workspace, table, id miss 3 hostPlacementMap (NHA, rack) ~16b/backup 2 entries/15m/10K machines After ~n/k segs written/master 4 logicalHostMap (LMA, NHA) ~16b/host 2 entries/15m/10K machines Per host addressing miss (on Ms, As, or Bs) 5 hostFreeList (NHA) ~8b/spare host ~0 Every 15m ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 6 appAuthentication (appId, secret, workspace) ~24b/principal ~0 Once per A session 7* appAuthorization (token, workspace) ~16b/active session Once per new A access to each M 8 machineAuthentication Perhaps "certifcate"? Every 15m (except during bootstrap) 9* machineAuthorization (token, role) ~16b/active host 2 entries/15m/10K machines ~Twice per first time pairwise host interaction ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 10 metrics ? (LHA, [int]) # dropped reqs/s etc. ? ? ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 11 logs ? Large High ~0 Data structures: 1 Find M for Object Id (workspace, table) -> [(start, LMA)] Hashtable with ordered lists as elements Start M recovery LMA -> (workspace, table, start, end) Hashtable 3 List Backup Candidates Return n and rotate them to the back of the list List 4 Lookup LMA to NMA Confirm crashes LMA -> NHA Hashtable Start M Recovery Remove entry in Hashtable, insert one Notify masters of B crash Needs list of all masters NMA, just walk buckets 5 Start M recovery Hashset (insert/remove) Remove M from hostFreeList** Create a fresh B Accidentally conflated these free lists Remove B from hostFreeList** == Persistence/Recovery Ideas for C == 1) Replication 2) Disks, WAL/LFS 3) RAMCloud Tables - Serious circular dependencies and bootstrapping issues ? Recovery latency + It works, it's durable + Code reuse + Super-cool 4) Decentralized 5) Punt/High-availability/special hardware/VM replication === Replication === - Performance & Complexity (2PC?) === RAMCloud-based C recovery === Choices: 1) C uses Bs 2) C uses (local) M NOTE: this approach presupposes that key-value store is good bet for this data; might not always be true (e.g. indexes) PROBLEM: Authenticate hosts during recovery without state? Can use small key RSA. As below, except where PROBLEM is We know that our CM crashed when the previous C did. Now - we need to just bootstrap the backup system which already relies on broadcast. start C create empty tempLogicalHostMap create tempHostFreeList and tempHostPlacementList from NHA range passed on the command-line create empty tempTabletMap broadcast to tempHostFreeList // BOOTSTRAP and/or RECOVER backupsOnline = 0 while backupsOnline < k { pop tempHostFreeList check tempHostPlacementList if new rack { start B add B to tempLogicalHostMap backupsOnline++ } else { pushBack B tempHostFreeList } } start M on localhost add (0, self network address) to tempLogicalHostMap add (0, logicalHostMap, 0, inf, M) to tempTabletMap notify M of change and tell it to recover insert(0, logicalHostMap, (0, logicalHostMap, 0, inf M)) map insert(0, logicalHostMap) tempLogicalHostMap map insert(0, freeList) tempHostFreeList 3) C is an A PROBLEMS: Can't count on C for starting recovery, else if master with data fails we're done Needs to deal with NHAs sometimes, which is different than a normal app start C create empty tempLogicalHostMap create tempHostFreeList and tempHostPlacementList from NHA range passed on the command-line create empty tempTabletMap broadcast to tempHostFreeList receive a response from one host { // RECOVER // PROBLEM - what if master went down in the meantime, neither can recover } else { // BOOTSTRAP backupsOnline = 0 while backupsOnline < k { pop tempHostFreeList check tempHostPlacementList if new rack { start B add B to tempLogicalHostMap backupsOnline++ } else { pushBack B tempHostFreeList } } pop tempHostFreeList; start M; add M to tempLogicalHostMap add (0, logicalHostMap, 0, inf, M) to tempTabletMap; notify M of change insert(0, logicalHostMap, (0, logicalHostMap, 0, inf M)) map insert(0, logicalHostMap) tempLogicalHostMap map insert(0, freeList) tempHostFreeList }