Component reliability questions:

1. DRAM

1a. DRAM soft errors

1b. DRAM hard errors

2. Storage servers:

3. Application servers

4. Networking substrate:

System reliability questions:

I haven't considered questions such as power outages,
disasters, geographical diversity, etc. in detail -- reason: for such
causes, what is important is data replication -- detection is less of an
issue -- issue is recovery) -- we can discuss that as part of this discussion.
However, this part is closely related to what we need to do for DRAM
(hard and soft) errors.

I looked up a few recent papers on causes of system failure rates.
Here is some data: