Cleaning Tombstones

This page is of historical interest only. Move along.

Problem

Clients should be able to read, back up, cache, or proxy a log while it is being written to with some consistency. So when an entry is invalidated, reading clients need to find out. More precisely, if an entry is invalidated after a reader has read it, that reader must find out.

Invalidating a log entry must itself be logged internally. This entry is called a tombstone. Once a tombstone has been written, the entry it invalidates may be cleaned. The tombstone itself may be cleaned once (a) the entry it invalidates has been durably erased (or marked in some way as deleted), and (b) no clients depend on seeing the tombstone for correctness. Given our current assumptions about clients, testing for (b) is difficult. In particular, a problem occurs under the following scenario:

  • A client starts reading through the log, and it reads the entry that creates an object.
  • The object has actually been deleted, and there is a tombstone later in the log.
  • Before the client reads a tombstone, the server decides to clean the tombstone (e.g. because the original entry has also been cleaned).
  • The client never sees the tombstone, so it never deletes its copy of the entry.

Our goal is to allow well-behaved clients to get their tombstones. For poorly-behaved or very unlucky clients, we would like to detect when they may be in trouble (might miss seeing necessary tombstones) and forcefully crash them. That is, we are willing to give up some liveness on clients in rare cases (especially when those clients are already likely to be faulty); however, we insist on safety at all times.

Solution 1: Expiration times

Servers agree not to clean tombstones for some delay D after they are created. Clients must read from their current position (or the start) in the log to the head of the log within D.

A responsible client could execute the following algorithm to help detect when it might be operating unsafely:

  • When the client starts reading, it notes its current local time as Start.
  • At the end of each read:
    • If the local time exceeds Start + D / 2, crash.
    • If the read reached the head of the log, set Start to the current local time.

Although this solution is simple, it relies on assumptions about the rates of progress of the clients' and servers' clocks for safety. For example, a client with a slow clock relative to the server's may not crash itself and may not get the tombstones it needs. Also, D must basically be chosen statically (unlike in Solution 2).

Solution 2: Detecting trouble more safely

Servers maintain a cleaning threshold S, where they agree not to clean entries past S in the log. This implies that on a single server, S increases monotonically. Furthermore, servers never return objects to clients that were dead when the client started reading for the first time (i.e., the server must keep auxiliary information that allows it to tell whether a particular entry in its log is live). When a client begins to read for the first time, it gets a pointer to the current head of the log, H0. On subsequent reads, if H0 <= C < S, where C is the client's cursor, the server instructs the client to crash.

Safety argument: The client does not need to see tombstones positioned before H0, since it will not have seen the entries those tombstones invalidate. After its cursor reaches H0, the client will see every tombstone by staying ahead of S, or it will crash. Clients can move to different replicas safely: if S' > S, this is identical to the old replica advancing its S value; if S' < S, there are more tombstones around on the new replica, so this is also safe.

A key idea is that the server can choose when to increase its S. Increasing S too aggressively may result in client liveness issues, but the algorithm preserves safety. Leaving sufficient headroom between S and the head of the log should result in good liveness. We have three ideas on how servers might increase S:

  • S trails the head of the log by some delay D, as in Solution 1.
    • This delay can be dynamic based on how many clients are having to crash or the observed rate of progress of normal clients.
    • The advantage of using time rather than distance is that logs with nearly no live data will eventually occupy very little space.
  • S trails the head of the log by some distance in bytes and/or number of entries.
    • This has the advantage that a client is guaranteed liveness if it can read at some constant factor times the rate of appends.  For example, if S is defined such that 90% of the log data is older than S, then as long as the client can read data 10x as fast as new data is being written, it will not get aborted.
    • The disadvantage of using distance is that a log with nearly no live data will still occupy some minimum amount of space.
  • A combination of the above:
    • For example, don't clean for D time or some amount of distance, but do clean after D * 2 time regardless of space.

This solution is more complex than Solution 1, both conceptually and in implementation. However, it preserves safety even with no assumptions about time. Also, it allows servers to increase S based on dynamic choices, which may be useful to improve liveness. It allows clients to move between replicas safely, without those replicas agreeing on the value of S.

Solution 3: Using client storage

This approach requires clients to store tombstones that the server wants to clean. The server maintains a generation number G, which is incremented every time it wants to clean. When clients read, they compare the generation number with the one from last time they read. If G' > G + 1, the clients crash. Otherwise, the client may proceed.

The server first decides which tombstones it wants to clean and increments G, but it does not clean these tombstones until D time elapses. During this time period, the clients are given an opportunity to talk to the server and fetch the tombstones that are scheduled for cleaning. The clients need to save these tombstones around, and collate them with the results of later reads. Once D time elapses, the server may clean these tombstones.

One could imagine the server using an S threshold as in Solution 2 to select which tombstones to clean. It is also possible to select any set of tombstones that may be convenient, for example because they occupy a region of disk with poor utilization.

For clients to be able to move between replicas, the servers must agree on generation numbers and when to clean what.

This solution may be more friendly than Solution 2 for liveness, as it only requires clients to issue any read within a period of time. However, it requires replicated state for cleaning.

Solution 4: Tracking client sessions

In this solution, servers to hold on to tombstones until clients no longer need them. They do so by tracking each client's progress in a session object on the server. New clients won't see old tombstones, but old clients will block them from being cleaned until those clients proceed past them. This relies on time for liveness (the session timeout) but not for safety.

For clients to be able to move between replicas, the session state must be replicated. This can happen asynchronously, however, as it is safe for replicas to think that client cursors are earlier than they actually are.

This solution is the most friendly to slow clients, as it will only compromise their liveness after long periods of inactivity. The flip side of that is space will be reclaimed slowly when clients fail or are slow. It also requires keeping replicated state for even read-only clients.

Decision (2012-03-20)

John and Diego decided on Solution 2 on 2012-03-20. At a tolerable cost in complexity, it is efficient, safe, and provides what we expect will be good enough liveness. If this turns out to result in poor liveness, it should be possible for a later version of LogCabin to switch to solutions 3 or 4 without affecting the client API.

Update (2012-03-22): New idea

We could use the consensus protocol to keep a monotonically increasing cluster time. The leader would be in charge of incrementing the cluster time somewhat regularly. The replicas could then use this time as input to a deterministic cleaner.

When applied to Solution 1, the new safety algorithm would go like this:

  • When the client starts reading, it is given a copy of the cluster time as Start.
  • During each subsequent read, the client gives Start back to the server, which computes the following:
    • If the current cluster time exceeds Start + D, crash the client.
    • If the read reached the head of the log, return the current cluster time as the new value for Start.

This would make Solution 1 just as safe as Solution 2. Also, during leader changes, the replicas would clean less aggressively. This is nice in that it gives clients a better chance to find the new leader and get caught up in time.

Alternatively, this approach could be applied to Solution 2 or 3.

Another benefit is that this helps the problem of Comparing Replicas. Replicas would have identical state, which can be easily checksummed.

Decision (2012-03-26)

John and Diego decided that the new idea did not merit changing the 2012-03-20 decision. We think the complexity of cluster time is not worth the benefit this would bring to Comparing Replicas for a least usable system.