This page contains a collection of ideas for research projects that could potentially turn into nice PhD dissertations.
Large-scale low-latency RPC
The overall question is: what is the right software architecture for network communication in large-scale low-latency datacenters of the future?
The current RAMCloud RPC system is just the first step in answering this question, but it is primitive and lacking in several areas. Here are some issues that are worth exploring:
- How should the software stack for networking be organized for lowest possible latency? For example, it must be possible to send and receive packets directly from user space, without involving the kernel ("kernel bypass"), but at the same time the kernel must have some control.
- How does kernel bypass impact the underlying protocols? For example, can sockets still be shared between processes if the kernel is not involved in sending and receiving packets?
- What is the absolute least software overhead possible in RPC? This would involve measuring and analyzing RAMCloud, proposing a new architecture, and implementing and measuring it.
- How should RPCs interact with threads on the server side? In the current RAMCloud implementation, the use of threading in servers adds 300-500ns to each RPC, but without threading it's hard to handle things like heartbeats and requests for ACKs.
- What is the role of threading on the client side? RAMCloud does not support multi-threaded clients very cleanly right now.
- The current RAMCloud implementation dedicates one thread to polling on the server side. This would not scale very well if servers contain multiple services, each with its own polling thread. Is there a better architecture?
- In large-scale applications a server might interact with 1 million or more clients. How should RPC state be managed to do this efficiently? Note: there may be a million client threads, but the number of distinct client applications is probably much less.
- What is the right network protocol for large-scale low-latency RPC? Although many people are trying to optimize TCP for datacenters, it seems likely that a new protocol, optimized for a low-latency environment and for the large-scale communication that occurs in a datacenter, might work much better:
- For example, it's important to eliminate buffering to optimize latency, but TCP tries to fill as many buffers as possible.
- TCP doesn't handle out-of-order packet arrival very well, but in datacenters with multiple paths between every source and destination, it makes sense to scatter packets randomly across different paths.
- Another issue is congestion control: in the datacenters of the future, it seems feasible to element full bisection bandwidth, which would essentially eliminate congestion inside the network. However, congestion can still occur at endpoints (e.g., if several sources simultaneously transmit to a single destination). In a low-latency environment it might be possible to handle this in a much more efficient manner by having the destination issue reservations to limit incast.
- TCP is stream-oriented, but low-latency traffic is likely to be more RPC-oriented. A protocol designed for RPC can reduce acking overheads.
Higher-level data model
The current RAMCloud data model is quite low-level: key-value store. We would like to explore higher-level mechanisms, to understand if they can be implemented without sacrificing the scale and latency required for RAMCloud. Here are some possible features:
- Atomic updates that modify multiple objects on different servers.
- Secondary indexes.
- A graph-oriented data model.
- An SQL-like interface?
There are many questions to be answered, such as:
- How difficult is it to implement these features inside RAMCloud?
- What is the performance of these features?
- Does implementing these features affect the performance of other parts of RAMCloud?
- How expensive is it to implement these features outside RAMCloud, in client code?
RAMCloud currently has almost no facilities whatsoever for cluster management. Here are some of the important issues that need to be understood:
- How should tablets be assigned to servers when they are created?
- What usage information needs to be collected in order to understand what a particular server is overloaded?
- Was the right way to reconfigure the system (e.g., splitting and/or moving tablets)?
- Does locality matter? E.g., or some tablets that should be on the same server?
- Recovery tends to scatter tablets that used to be co-located; do they need to be collected together again?
- How do the configuration decisions affect performance?
Measure RAMCloud in action to understand how much variation there is in the latency for basic operations such as reads and writes (preliminary evidence suggests that there is a significant issue with tail latency). Then figure out what is causing tail latency (this is likely to involve the Linux kernel as well as RAMCloud) and develop mechanisms to improve it.
Sooner or later, every storage system needs a notification mechanism: some clients want to know when certain elements of the storage system are modified. For example, databases have triggers. Design and implement a notification mechanism for RAMCloud:
- The notification system could be either inside or outside RAMCloud. One possibility is to implement a general-purpose notification system, of which RAMCloud is simply a client.
- The notification system needs to be consistent with RAMCloud's goals of large-scale and low latency.
- Before doing this project, it's probably worth looking at the notification system we developed for Fiz.
Storing all data in memory is expensive, but storing some on flash and disk raises a ton of questions.
- Is it possible to serve data from disk without impacting the performance of data served from memory? I.e., how much hardware needs to be dedicated to the disk-based objects (disks? CPUs?)
- What's the latency you can expect out of disk/flash when the system is under load? How quickly do applications lose all benefit of having any objects in memory?
- Should metadata for all objects be kept in memory, even the disk-based ones? In that case, how big does an object have to be before there's any substantial savings from keeping it on disk?
- How does this compare to a volatile cache in front of a disk-based system?
- To what degree do you want both of these things in one system vs two independent systems? Perhaps with a client library that can interact with both.
- How would this change if the latency or IOPS of flash were different? (I assume disk will remain about the same over time.)
- Should the system manage which objects go where, the application, or some mixture?
- What are the characteristics of my application that would lead me to decide on a disk-based system vs a memory-based system vs a hybrid?