Cluster Management Scribe

Why not Zookeeper?
What's unique about RC that we have to solve this from scratch?

—

Mogul: P2P approach for load-balancing?

—

Shel: Less extreme than one or all coords: small #
Hybrid approach or decentralized

—

Shel: Suppose 100 key ranges, 100 hash buckets; why is that worse for indexes?

—

Why do we need next id/auto incr id if using a hash function anyway?

—

John: 10th pctile number of table size for smallest tables?
Franklin: a lot of small tables
Shel: tiny metadata tables; supplier codes table, for example

—

NetApp: Worst case for reqs/sec to coord?
During Crash recovery, 10K apps noes all need tablet map, pull 10M rows

—

Armbrust + Keith Adams: Use Zookeeper
6000 writes/sec, 100s of 1000s of reads/sec

—

Aguilera: Use RC table 0 as coordinator.

—

Keith Adams: DNS is reasonable punt for coord discovery

—

Mogul: Put coords on special power supplies?

—

Armbrust: Zookeeper failure detection "flickers" with 5 second timeout set on EC2.

—

NetApp: Concerned about apps falsely reporting hosts down to DOS the coord.

—

Mogul: Can cutoff network port. On Arista? Pay them enough. Can do this with OpenFlow.

—

Mogul: May want to know what went wrong before power cycling a machine.
dm: especially in the case of software failures

—

Aguilera: MS quarantines crashed boxes for a period before returning them to the DC

—

Mogul: Light-out: different control path, on higher end boxes, can even get memory etc, all serviced through a separate processor

—

Shel: Lot hard to get right; intermittent failures, asymmetric failures, very easy to make little errors

—

Keith Adams + Armbrust: Danger in believing one should do Paxos from scratch or optimize it