Data Model Related Work

MySQL related:

https://archive.fosdem.org/2014/schedule/event/spider_storage_engine/attachments/slides/408/export/events/attachments/spider_storage_engine/slides/408/spider_storage_engine.pdf

http://dev.mysql.com/doc/internals/en/support-for-indexing.html


Interesting blog about importance of and technical difficulty in implementing features like indexing (in context of FoundationDB, VoltDB):

http://voltdb.com/blog/foundationdbs-lesson-fast-key-value-store-not-enough.

Also note good quote from Google's F1 paper: "Features like indexes and ad hoc query are not just nice to have, but absolute requirements for our business.”


Scaling memcache at Facebook - https://www.usenix.org/conference/nsdi13/scaling-memcache-facebook


Sinfonia


Calvin


Spanner makes an argument that NoSQL is not enough. We need transactions and SQL like query language.

"Spanner exposes the following set of data features to applications: a data model based on schematized semi-relational tables, a query language, and general purpose transactions. The move towards supporting these features was driven by many factors. The need to support schematized semi-relational tables and synchronous replication is supported by the popularity of Megastore.

At least 300 applications within Google use Megastore (despite its relatively low performance) because its data model is simpler to manage than Bigtable’s, and because of its support for synchronous replication across datacenters. (Bigtable only supports eventually-consistent replication across datacenters.) Examples of well-known Google applications that use Megastore are Gmail, Picasa, Calendar, Android Market, and AppEngine.

The need to support a SQLlike query language in Spanner was also clear, given the popularity of Dremel as an interactive data analysis tool. Finally, the lack of cross-row transactions in Bigtable led to frequent complaints; Percolator was in part built to address this failing.

Some authors have claimed that general two-phase commit is too expensive to s upport, because of the performance or availability problems that it brings [9, 10, 19]. We believe it is better to have application programmers deal with performance problems due to overuse of transactions as bottlenecks arise, rather than always coding around the lack of transactions. Running two-phase commit over Paxos mitigates the availability problems."


Octopus DB which stores data in log structure and has concept of storage views where you can define your data models. It can mimic its working as row store or column store or like as you wish it to. Quite an interesting idea (patent pending): http://www.cidrdb.org/cidr2011/Papers/CIDR11_Paper25.pdf  


Wasp - Multi-key transactions


Marcus’s thesis

http://e-collection.library.ethz.ch/eserv/eth:5403/eth-5403-01.pdf

Implemented a B-Tree on top of Ramcloud to interface it with MySQL.

MySQL internals main site : http://forge.mysql.com/wiki/MySQL_Internals

MySQL custom storage engine plugin info is here :   http://forge.mysql.com/wiki/MySQL_Internals_Custom_Engine

They say that they could not saturate ramcloud with load, and it performed almost at par with innodb (a disk based storage engine). I think that they have perhaps not implemented the custom engine the right way.

An interesting thing is that he got 700K queries/sec on top of mysql using handlersocket (which is the way to go actually if you want to saturate ramcloud with load using mysql wrapper): http://yoshinorimatsunobu.blogspot.in/2010/10/using-mysql-as-nosql-story-for.html

The handler socket is implemented using handler api , which is (for mariah db - an open source version by monty - the original author of mysql) : http://kb.askmonty.org/en/handler-commands 

 


Cristian Tinnefeld's work


Column DB, Abadi - http://cs-www.cs.yale.edu/homes/dna/papers/abadiphd.pdf

Vertica, Stonebraker


Daniel Abadi advocates use of PACELC instead of CAP to judge a db system design. http://dbmsmusings.blogspot.in/2010/04/problems-with-cap-and-yahoos-little.html

He says "To me, CAP should really be PACELC --- if there is a partition (P) how does the system tradeoff between availability and consistency (A and C); else (E) when the system is running as normal in the absence of partitions, how does the system tradeoff between latency (L) and consistency (C)?"

I think ramcloud fares very well on the PACELC yardstick too.


CAP theorem by Eric Brewer


Survey papers on data models for nosql dbs:

Google datastore

ultimate guide to nosql universe : http://nosql-database.org/ 

survey of distributed databases :  http://dbpedias.com/wiki/NoSQL:Survey_of_Distributed_Databases

some more papers :

http://www.mendeley.com/research/survey-nosql-database/

http://publications.lib.chalmers.se/records/fulltext/155048.pdf

In memory db - sigmod contest - post of some implementation - gives basic insight on implementing mem dbs:

http://eferm.com/implementing-durability-for-in-memory-databases-on-ssds/

kyoto cabinet - very interesting DB lib:

http://fallabs.com/kyotocabinet/