Data model related aspects
Current data model (as of April 2013):
Multiple tables
Each table can have infinitely many objects
Each object is key - value - version tuple
Key is a primary unique key
Hash table maintained from hash(key) to object, and key check made when object(s) found
Operations:
Basic operations: read, write, delete
Conditional operations allow reading / writing / deleting only if version matches provided version
Multi-ops allow multi-object operations (read / write / delete). The operation on each object is allowed to succeed / fail independent of others (i.e., it is not a transaction)
Table split into tablets across servers
We lose some locality
Tablets get further split during recovery
Helps in fast recovery
But causes further fragmentation / loss of locality
(Highly inclusive list of) topics for further exploration (or not) and some initial thoughts about them:
Multi-object transactions
Client side transactions
Can be constructed (relatively easily) using conditional ops by a user.
Provide library method (for better support / easy first-step / comparison point)?
Optimized using server modifications (see older discussions)
Diego has an implementation in python
Server side transactions
2 PC (see Sinfonia)
Allow transactions on objects across various tables? Why not!
Object format / Schema
Since key and val can both be blobs of data, user can push in data have more "interesting" formats, like JSON. However, it is all opaque to RAMCloud. So client has to do the serialization / deserialization, and it can also not query based on various attributes.
User can define their own library functions to do some of this.
We can provide library functions for some simple formats.
Allows for non-strict datamodel (i.e., each row can have different num / types of columns)
+ flexibility, + ease of implementation, - can't have secondary indexes
Relational DB - like model
Multiple columns as first-class citizens - queries can be done on columns other than "key"
Specify columns and their formats during table creation.
Same format applies to all rows.
Allowing schema changes after table creation:
Allowed in most DB systems, so we should too
Hard to implement
Will need to stop execution to re-check constrains of data already stored
What to do if data doesn't fit constraints?
Client side / server side?
Can treat at values in columns as opaque blobs (i.e., no constraint checking) but still provide secondary indexes. Or not (next point).
Contraint checking during writes for each column. Can do at client / server level. See common DB (incl. distributed DB) implementations. Some examples:
Null / not-null
Data type
length
Unique vals
harder, probably can't do at client level
data race when inserting val in secondary index, can make sure only one gets added, and accordingly one wins a write
Foreign key
harder, will need conditional op where condition can refer to other objects
what if the foreign key has to be deleted from its table?
Simple lookup / range query (see "secondary indexes" below) / nothing
Graph oriented data model
Jonathan says graphs fit really well in existing relational model.
Read up more / talk to more people. (What is the model? What ops they do and how efficient they are? what would they like to do?)
Primary index
"key" is a unique index (or primary key) implemented as a hash table based on hash(key). Can do lookups on this, but not range queries.
Support range queries (tree)? - Probably not. We're mostly interested in range queries on secondary indexes.
Secondary indexes
Multi indexes vs unique indexes
Simple lookups (Hash table) vs Range queries (Tree) - we'd probably have to support both, and it can be chosen by the user.
For multi indexes it would make more sense to use tree rather than hash.
Cross-server transactions
Recovery!
Querying
This brings together some of earlier issues and some additional decisions
Type of queries that can be done will depend on the object format / schema and the indexes.
It could be simple lookups / range queries on a column, or graph-oriented querying, SQL-like querying and so on.
Joins (for querying) or store information where needed (de-normalization?)
Higher level operations
For example, various types of aggregation (Cristian Tinnefeld was working on something like this?)
Client side library functions vs server side implementations
Favors column store more than row oriented
More in the direction of analytics
Code shipping
I think its beyond the scope of my work, unless used to compare with higher level functions implementation.
This basically comes down to the following main issues:
Secondary Indexes
Constraint checking (on secondary indexes / values)
Multi-object transactions
Graph operations (and don't need anything else from relational model) - figure out what's needed, maybe it will just come down to secondary indexes
From the above, multi-object transactions is probably a good starting point since: it will be needed as a building block for implementing secondary indexes that span multiple servers; it is the most systems-y topic (and I can climb on to topics on DB-Systems interface after this); it will probably be easier to wrap my head around this first.