Linearizable RPC

Problem:

Regular RPC in ramcloud is not linearizable and may result in inconsistent behavior. For example, a client request deletion of a key, and master fails after processing it recorded tombstone on backup) and before responding back to the client. The client will think the RPC was lost and retry the same deletion, which will cause master to reply with an error or deleting newly written value which was written after the original delete RPC. The same problem exists for conditional write. If the master fails between succeeding the conditional write and responding back to client the result, current Ramcloud RPC protocol thinks the request was not delivered to client and retries the conditional write. In that case, recovered master already contains the new version of value (after the conditional write) and the retried request will be rejected since the version number doesn't match. Here, the correct response value should be success since the conditional write RPC was already done before the crash.

We resolve this problem by avoiding re-doing same RPC if the previous one was committed to log in backup. This is done by keeping the status of rpcs in masters.

Overview of the solution.

In a client,

A client request a unique client_id from coordinator.
Each RPC gets assigned for a unique rpc_id. (Unique within a client)
As client receives the response from master, it logs it. (Similar to TCP)
Each RPC from a client contains <client_id, rpc_id, ack_id> where ack_id is the highest ACK number for the rpc_id, which means that the client received the result of all RPCs with rpc_id <= ack_id.

In a master server,

A master keeps status of all unacknowledged RPCs in an object of UnackedRpcResults <Client id, List<rpc_id, result>> (See figure 1)
As a new RPC comes in, the master checks whether duplicate RPC is in progress or completed by referring UnackedRpcResult. If the RPC is already completed before, just reply client with saved result. If it is in progress, reply with status code "RETRY". If it is neither in progress or completed, the master process the RPC as normal.
As an alternation RPC is processed by writing a new object value on log, master also write <client_id, rpc_id, ack_id> associated with the modification. Master atomically writes both the original log entry (contains object value) and new type of log entry (which contains client_id, rpc_id, ack_id, and other metadata. See figure 2.), so that we can guarantee consistency after crash & recovery.

When a crash happens

When a recovery server replays its log, it reconstructs the UnackedRpcResults data structure.

Log cleaner in master

When a tombstone is found, for each log regarding the object,
1. if recent ack_id of its client is higher than rpc_id of the log element, delete the whole log element.
2. if it is lower than rpc_id, just delete object in log and compact. Leave metadata <table_id, keyHash> and <client_id, rpc_id, ack_id>.
For every iteration on an entry,

1. Idea. How to avoid duplicate processing.
Duplicate processing of an RPC (usually due to re-tried RPCs) is avoided by
assigning a unique id for each RPC from a client. A master service keeps the
RPC's id number and its accompanying result, and just reply to duplicate RPCs
with the previously saved results.
To reduce space required to keep such data, a client "acknowledges" its
receipt of RPC results and guarantees it will not re-try the same RPCs.
This is done by attaching an "acknowledgement number" (aka. ack id) to each
RPC request. The number tells RPCs whose ids are smaller than or equal to
the ack id are acknowledged by this client.

Missing: Logging on disk?

2. Mechanisms on Master

Actions

<Check duplicate> Yes: <Process>, <Record Completion> No: <Reply with saved result> - etc exceptions
<Write on log> needs atomicity to guarantee consistency.

<Cleaner> does log cleaner need atomicity?

2.1 Memory data structure.
Each master keeps a copy of #UnackedRpcResults object to store the results of
linearizable RPCs. As a master receives an RPC request, it will check whether
the same RPC is in progress or completed by checkDuplicate(). As the processing
of the RPC is finished, master records its completion on memory by
recordCompletion(). On backup storage, it atomically writes both the result of
RPC and the log of the rpc's completion.