Cluster Management Scribe
Why not Zookeeper?
What's unique about RC that we have to solve this from scratch?
—
Mogul: P2P approach for load-balancing?
—
Shel: Less extreme than one or all coords: small #
Hybrid approach or decentralized
—
Shel: Suppose 100 key ranges, 100 hash buckets; why is that worse for indexes?
—
Why do we need next id/auto incr id if using a hash function anyway?
—
John: 10th pctile number of table size for smallest tables?
Franklin: a lot of small tables
Shel: tiny metadata tables; supplier codes table, for example
—
NetApp: Worst case for reqs/sec to coord?
During Crash recovery, 10K apps noes all need tablet map, pull 10M rows
—
Armbrust + Keith Adams: Use Zookeeper
6000 writes/sec, 100s of 1000s of reads/sec
—
Aguilera: Use RC table 0 as coordinator.
—
Keith Adams: DNS is reasonable punt for coord discovery
—
Mogul: Put coords on special power supplies?
—
Armbrust: Zookeeper failure detection "flickers" with 5 second timeout set on EC2.
—
NetApp: Concerned about apps falsely reporting hosts down to DOS the coord.
—
Mogul: Can cutoff network port. On Arista? Pay them enough. Can do this with OpenFlow.
—
Mogul: May want to know what went wrong before power cycling a machine.
dm: especially in the case of software failures
—
Aguilera: MS quarantines crashed boxes for a period before returning them to the DC
—
Mogul: Light-out: different control path, on higher end boxes, can even get memory etc, all serviced through a separate processor
—
Shel: Lot hard to get right; intermittent failures, asymmetric failures, very easy to make little errors
—
Keith Adams + Armbrust: Danger in believing one should do Paxos from scratch or optimize it