These are mostly notes from the 2009-12-03 mtg.
An index maps from keys to object identifiers (a.k.a. primary keys, but that's confusing in this context).
Note that this is distinct from the hash table that maps from OIDs to objects.
All 4 types of functions might appear in a RAMCloud application:
Multi indexes handle the not injective cases (1 and 2), while unique indexes handle the injective cases (3 and 4).
Consider everyone's favorite example of an employees table in which no two employees have the same employee ID, SSN, or username. I want to efficiently maintain this invariant in my application while inserting a new employee (and let's suppose other employees might be updated concurrently)
One approach is to first take a write-lock on the table. Lookup the new username in the multi index, and if it exists, abort. If it does not exist, proceed to insert the employee into the table and the username into the multi index. Then release the write-lock. However, serializing write requests to the table limits my application's scalability.
Alternative approaches are prone to race conditions.
A server-side unique index allows my application to atomically insert the index entry only if the key does not already exist, overcoming race conditions without locks.
If every object must correspond to a key, the application should never write an object that doesn't correspond to a key. This is easy to enforce with assertions on the client.
A multi index is probably best implemented as a hash table or tree of key/list-of-OID pairs, while a unique index is probably best implemented as simply a hash table or tree of key/OID pairs.
We seem to think hash tables are faster, since we use them for the primary indexes. If this turns out not to be the case for secondary indexes, all indexes will be range-queryable.
Insert(key, oid) -> err on duplicate key
Remove(key, oid) -> err on nonexistent key/OID pair
Lookup(key) -> oid or err on nonexistent key
Insert(key, oid)
Remove(key, oid) -> err on nonexistent key/OID pair
Lookup(key) -> list of OIDs
RangeQuery(key start (optional), bool inclusive, key end (optional), bool inclusive, int limit, oid start_after_oid (optional)) -> list of key/OID pairs or list of OIDs only (sorted by keys, then OIDs), whether more data was available and not returned
If we're OK with indexes possibly containing "extra" key/OID pairs for short periods of time, we can get away with cheaper algorithms to deal with updating multiple unique indexes.
For example, suppose we want to insert an employee with a unique SSN and a unique username where the table and each index are on separate machines.
If all goes well:
If the second index entry can not be inserted:
Insert in the unique case throws an error if there's a concurrent transaction with the same key
Lookup and RangeQuery return candidate OIDs