Comparing Replicas

Warning: these are design notes from initial stages of LogCabin and are are probably not relevant any longer.

One lesson from Paxos Made Live is that it's useful to have a way to compare replicas' state to verify that they are identical. Doing this periodically helped them discover bugs before they were exposed to applications. Ideally, LogCabin replicas could produce a single checksum value that would cover the entire contents of their state. This state includes: log entries, tombstones, and client responses (for linearizability). These same checksums could also be used to guard against disk corruption, by checking the consistency of servers before they are re-admitted to the cluster.

Problem 1: Independent Cleaning

If the LogCabin servers clean these independently, the set of tombstones and client responses on each replica may be different. In that case, I think only the log entries could be covered in a checksum.

John and Diego discussed this on 2012-03-26. We concluded that checksumming just the live entries (and no tombstones or client responses) would provide most of the benefit, and we shouldn't introduce much more complexity to increase checksum coverage for now.

Problem 2: When checksums don't match

Supposing checksums are verified periodically, what should happen when checksums don't match?