Cluster Management Scribe

Why not Zookeeper?
What's unique about RC that we have to solve this from scratch?

Mogul: P2P approach for load-balancing?

Shel: Less extreme than one or all coords: small #
Hybrid approach or decentralized

Shel: Suppose 100 key ranges, 100 hash buckets; why is that worse for indexes?

Why do we need next id/auto incr id if using a hash function anyway?

John: 10th pctile number of table size for smallest tables?
Franklin: a lot of small tables
Shel: tiny metadata tables; supplier codes table, for example

NetApp: Worst case for reqs/sec to coord?
During Crash recovery, 10K apps noes all need tablet map, pull 10M rows

Armbrust + Keith Adams: Use Zookeeper
6000 writes/sec, 100s of 1000s of reads/sec

Aguilera: Use RC table 0 as coordinator.

Keith Adams: DNS is reasonable punt for coord discovery

Mogul: Put coords on special power supplies?

Armbrust: Zookeeper failure detection "flickers" with 5 second timeout set on EC2.

NetApp: Concerned about apps falsely reporting hosts down to DOS the coord.

Mogul: Can cutoff network port. On Arista? Pay them enough. Can do this with OpenFlow.

Mogul: May want to know what went wrong before power cycling a machine.
dm: especially in the case of software failures

Aguilera: MS quarantines crashed boxes for a period before returning them to the DC

Mogul: Light-out: different control path, on higher end boxes, can even get memory etc, all serviced through a separate processor

Shel: Lot hard to get right; intermittent failures, asymmetric failures, very easy to make little errors

Keith Adams + Armbrust: Danger in believing one should do Paxos from scratch or optimize it