Coordinator

Important Problems from Design Meeting

  • Exactly how does a master select backups?
  • How do we make sure a master is dead (to prevent divergent views of tables)?
  • How does a new master know when it has received all segments (e.g. knows it has the head of its log)?
  • How does host-to-host authentication work
    • In particular, how does the application authenticate the Coordinator?

Further Things to Walkthrough in Design Meetings

  • How does one install a new machine?
  • How does one update RAMCloud software?
== Bootstrapping C Discovery ==

1) DNS
   - Slow
   + Preconfigured on hosts
   + Delegation
   + Provides a way to deal with Coordinator failure/unavailability

== Summary of Actions ==

-- Auth -------------------------------
App authentication              6, 7
App authorization               7
Machine authentication          8, 9
Machine authorization           9

-- Addressing -------------------------
Find M for Object Id            1
Lookup LMA to NMA               4

-- Backup/Recovery --------------------
List backup candidates          3
Confirm crashes                 4
Start M recovery                1, 4, 5
Notify masters of B crash       4
Create a fresh B instance       5

-- Performance ------------------------
Load Balancing                  1, 10
Stats, Metrics, Accting         10

-- Debugging/Auditing  ----------------
Logging                         11

== Summary of State ==

        Name                    Format                                  Size                    Churn                         Refs
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1       tabletMap               (workspace, table, start, end, LMA)     ~40b/tablet/master      Only on load balance          Per workspace, table, id miss
3       hostPlacementMap        (NHA, rack)                             ~16b/backup             2 entries/15m/10K machines    After ~n/k segs written/master
4       logicalHostMap          (LMA, NHA)                              ~16b/host               2 entries/15m/10K machines    Per host addressing miss (on Ms, As, or Bs)
5       hostFreeList            (NHA)                                   ~8b/spare host          ~0                            Every 15m
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
6       appAuthentication       (appId, secret, workspace)              ~24b/principal          ~0                            Once per A session
7*      appAuthorization        (token, workspace)                      ~16b/active session                                   Once per new A access to each M
8       machineAuthentication   Perhaps "certifcate"?                                                                         Every 15m (except during bootstrap)
9*      machineAuthorization    (token, role)                           ~16b/active host        2 entries/15m/10K machines    ~Twice per first time pairwise host interaction
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
10      metrics                 ? (LHA, [int]) # dropped reqs/s etc.    ?                       ?
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
11      logs                    ?                                       Large                   High                          ~0

Data structures:
1    Find M for Object Id
          (workspace, table) -> [(start, LMA)]                  Hashtable with ordered lists as elements
     Start M recovery
          LMA -> (workspace, table, start, end)                 Hashtable
3    List Backup Candidates
          Return n and rotate them to the back of the list      List
4    Lookup LMA to NMA
     Confirm crashes
          LMA -> NHA                                            Hashtable
     Start M Recovery
          Remove entry in Hashtable, insert one
     Notify masters of B crash
          Needs list of all masters NMA, just walk buckets
5    Start M recovery                                           Hashset (insert/remove)
           Remove M from hostFreeList**
     Create a fresh B                           Accidentally conflated these free lists
            Remove B from hostFreeList**

== Persistence/Recovery Ideas for C ==

1) Replication
2) Disks, WAL/LFS
3) RAMCloud Tables
   - Serious circular dependencies and bootstrapping issues
   ? Recovery latency
   + It works, it's durable
   + Code reuse
   + Super-cool
4) Decentralized
5) Punt/High-availability/special hardware/VM replication

=== Replication ===

- Performance & Complexity (2PC?)

=== RAMCloud-based C recovery ===

Choices:
1) C uses Bs


2) C uses (local) M

NOTE: this approach presupposes that key-value store is good bet for
      this data; might not always be true (e.g. indexes)

PROBLEM: Authenticate hosts during recovery without state?
         Can use small key RSA.

As below, except where PROBLEM is We know that our CM crashed when the
previous C did.  Now - we need to just bootstrap the backup system
which already relies on broadcast.


start C
create empty tempLogicalHostMap
create tempHostFreeList and tempHostPlacementList from NHA range passed on the command-line
create empty tempTabletMap
broadcast to tempHostFreeList
// BOOTSTRAP and/or RECOVER
backupsOnline = 0
while backupsOnline < k {
  pop tempHostFreeList
  check tempHostPlacementList
  if new rack {
    start B
    add B to tempLogicalHostMap
    backupsOnline++
  } else {
    pushBack B tempHostFreeList
  }
}
start M on localhost
add (0, self network address) to tempLogicalHostMap
add (0, logicalHostMap, 0, inf, M) to tempTabletMap
notify M of change and tell it to recover
insert(0, logicalHostMap, (0, logicalHostMap, 0, inf M))
map insert(0, logicalHostMap) tempLogicalHostMap
map insert(0, freeList) tempHostFreeList

3) C is an A

PROBLEMS: Can't count on C for starting recovery, else if master with data fails we're done

Needs to deal with NHAs sometimes, which is different than a normal app

start C
create empty tempLogicalHostMap
create tempHostFreeList and tempHostPlacementList from NHA range passed on the command-line
create empty tempTabletMap
broadcast to tempHostFreeList
receive a response from one host {
  // RECOVER
  // PROBLEM - what if master went down in the meantime, neither can recover
} else {
  // BOOTSTRAP
  backupsOnline = 0
  while backupsOnline < k {
    pop tempHostFreeList
    check tempHostPlacementList
    if new rack {
      start B
      add B to tempLogicalHostMap
      backupsOnline++
    } else {
      pushBack B tempHostFreeList
    }
  }
  pop tempHostFreeList; start M; add M to tempLogicalHostMap
  add (0, logicalHostMap, 0, inf, M) to tempTabletMap; notify M of change
  insert(0, logicalHostMap, (0, logicalHostMap, 0, inf M))
  map insert(0, logicalHostMap) tempLogicalHostMap
  map insert(0, freeList) tempHostFreeList
}