User view
...
for
...
target
...
applications:
Property | Web Datastore | Analytics System | Transactional System |
---|---|---|---|
Data | Records (possibly unstructured) Semi-structured records, upto 2MB | Files, sets of records (often unstructured), maybe large records | Structured records, typically small |
Queries | Simple, lookups, sometimes secondary key access | Rich analytic, procedural | Complex, declarative |
Consistency | Weak, per-record | Static files | ACID |
Latency | ~10s of milliseconds | minutes/hours | query-dependent: 3 milliseconds to minutes |
Scalability | High, Elastic | High | Low |
Examples | Yahoo PNUTS, Google AppEngine Datastore, Facebook Cassandra, Amazon Dynamo | MapReduce, Hadoop, Aster, Terradata | MySQL, Oracle, IBM DB2, MS SQL Server |
Web Datastore
More detailed user view
- Queries
- Primary key lookups, multigets
- Secondary key lookups
- Simple joins
- Simple aggregations
- Range queries
- Often predeclared , can't be adhoc
- Consistency, transcations
- Per-record, ACID within record
- Sometimes consistency dropped within record for certain failures for availability
- Stale data served
- Often expose trade offs for availability and performance
- Flexible schemas
- Semi-structured records
- Self managing
- Very high availability
- Ordered tables, hash tables
- Geographically low latency
- Option of in-memory tables/records
Common techniques
- Partitioning/Sharding
- Often an extra layer of query router used
- Load balancing is done by a manager (not in query path)
- Coerced to same data-model
- No locks across storage nodes
- Replication
- Replicas are used for reads and recovery
- Asynchronously maintained
- Maybe geographic, maybe selective
- Record mastership
- Only master writes
- Record versioning
- For consistency
- Secondary indexing
- Coerced to same data-model
- Maybe as a replica
- Sometimes co-location for joins
- Logging/Messaging/PubSub
- Used for logging
- Used for asynchronous replication
- Big caches on storage nodes
PNUTS deep-dive to show some of the above in use.
Current directions
- Richer queries
- Richer consistency, transactions
- Better self-management, even under failures
- Lowering latency