Page Comparison

Problem:

Conditional write RPC on Ramcloud returns the result by success or failure of the conditional write.
The flow of control of the client may depend on the result of conditional write.

Regular RPC in ramcloud is not linearizable and may result in inconsistent behavior. For example, a client request deletion of a key, and master fails after processing it recorded tombstone on backup) and before responding back to the client. The client will think the RPC was lost and retry the same deletion, which will cause master to reply with an error or deleting newly written value which was written after the original delete RPC. The same problem exists for conditional write. If the master fails between succeeding the conditional write and responding back to client the result, current Ramcloud RPC protocol thinks the request was not delivered to client and retries the conditional write. In that case, recovered master already contains the new version of value (after the conditional write) and the retried request will be rejected since the version number doesn't match. Here, the correct response value should be success since the conditional write RPC was already done before the crash.

To handle this problem, I propose the following design.

We resolve this problem by avoiding re-doing same RPC if the previous one was committed to log in backup. This is done by keeping the status of rpcs in masters.

Overview of the solution.

In a client,

A client request a unique client_id from coordinator.
Each RPC gets assigned for a unique rpc_id. (Unique within a client)
As client receives the response from master, it logs it. (Similar to TCP)
Each RPC from a client contains <client_id, rpc_id, ack_id> where ack_id is the highest ACK number for the rpc_id, which means that the client received the result of all RPCs with rpc_id <= ack_id.

In a master server,

Log in master now contains <Client id, RPC id, ACK id>
A master keeps an array of processed_lists status of all unacknowledged RPCs in an object of UnackedRpcResults <Client id, <list of rpc id pairs>> List<rpc_id, result>> (See figure 1)
As a new RPC comes in, the master checks the rpc_id is inside the processed_listwhether duplicate RPC is in progress or completed by referring UnackedRpcResult. If the RPC is already completed before, just reply client with saved result. If it is , just return successin progress, reply with status code "RETRY". If it isn't, is neither in progress or completed, the master process the rpc as normalRPC as normal.
As an alternation RPC is processed by writing a new object value on log, master also write <client_id, rpc_id, ack_id> associated with the modification. Master atomically writes both the original log entry (contains object value) and new type of log entry (client_id, rpc_id and ack_id), so that we can guarantee consistency after crash & recovery.

When a crash happens

When a recovery server replay's replays its log, it reconstructs the processed table by following rpc_id and ack_id in log elementsUnackedRpcResults data structure.

Log cleaner in master

When a tombstone is found, for each log regarding the object,
1. if recent ack_id of its client is higher than rpc_id of the log element, delete the whole log element.
2. if it is lower than rpc_id, just delete object in log and compact. Leave metadata <table_id, keyHash> and <client_id, rpc_id, ack_id>.

1. Idea. How to avoid duplicate processing.
Duplicate processing of an RPC (usually due to re-tried RPCs) is avoided by
assigning a unique id for each RPC from a client. A master service keeps the
RPC's id number and its accompanying result, and just reply to duplicate RPCs
with the previously saved results.
To reduce space required to keep such data, a client "acknowledges" its
receipt of RPC results and guarantees it will not re-try the same RPCs.
This is done by attaching an "acknowledgement number" (aka. ack id) to each
RPC request. The number tells RPCs whose ids are smaller than or equal to
the ack id are acknowledged by this client.

Missing: Logging on disk?

2. Mechanisms on Master
2.1 Memory data structure.
Each master keeps a copy of #UnackedRpcResults object to store the results of
linearizable RPCs. As a master receives an RPC request, it will check whether
the same RPC is in progress or completed by checkDuplicate(). As the processing
of the RPC is finished, master records its completion on memory by
recordCompletion(). On backup storage, it atomically writes both the result of
RPC and the log of the rpc's completion.

client_id

processed_list

1

...

2

<11,15>

<17,17>

<20,26>

3

<1,1>

...

Versions Compared

Old Version 3

New Version 4

Key