Component reliability questions:
1. DRAM

1a. DRAM soft errors

1b. DRAM hard errors

2. Storage servers:

3. Application servers

4. Networking substrate:

System reliability questions:
I haven't considered questions such as power outages,
disasters, geographical diversity, etc. in detail -- reason: for such
causes, what is important is data replication -- detection is less of an
issue -- issue is recovery) -- we can discuss that as part of this discussion.
However, this part is closely related to what we need to do for DRAM
(hard and soft) errors.
I looked up a few recent papers on causes of system failure rates.
Here is some data:
1. Understanding Failures in Petascale Computersby Bianca Schroeder Garth A. Gibson
Data set during 1995-2005 at LANL:
22 HPC systems (4,750 machines, 24,101 processors).
22 clusters (18 SMP-based clusters, 2 to 4 processors per
node: Total 4,672 nodes, 15,101 processors; remaining 4: NUMA
boxes with 128 to 256 processors each:  total 78 nodes, 9,000 processors).
An entry for any failure that occurred the time period
that resulted in an application interruption or a node outage.
(Note: it seems they are not doing concurrent checking so I'd assume
silent data corruption isn't included here).
System failure causes covered: software failures, hardware
failures, failures due to operator error, network failures, and failures due to environmental problems (e.g. power outages).
Here is some interesting data:
Cluster node outages: > 50% -- hardware failures; ~ 20% software causes;
15% -- unknown; rest: network, human, environment causes (doublechecked
by looking at the fraction of repair time attributed to these causes).
Number of failures per year per system varies between 20 to 1,100 per
system --> on an average: 0.2 to 0.7 per system per year per processor.
(Note: as the paper suggests, their systems mainly implement checkpointing; don't do much about error detection).
(There has been some reported data that Blue Gene systems have
significantly lower failure rates --> need to know details).

Cray XT3/XT4: 10,880 CPUs, Failures per month
per TF: 0.1 to 1
IBM Power 5/6: 10,240 CPUs, 1.3
Clusters AMD x86: 8,000 CPUs, 2.6 to 8
BlueGene L/P: 131,720 CPUs, 0.01 to 0.03

(this is taken from:

H. Simon. Petascale computing in the U.S.. Slides from presentation at the ACTS workshop.

http://acts.nersc.gov/events/Workshop2006/slides/Simon.pdf, June 2006).

Next paper:
Tahoori, Kaeli & others: they looked at storage servers
(so far as I know it's EMC).
Normalized failure rates:
Hardware-related 1.91 (System A) 2.27 (System B1) 7.25 (System B2)
2.19 (Total)
Power-related 0.18 (A) 0.19 (B1) 0.5 (B2) 0.19 (Total)
Software-related 2.41 (A) 4.44 (B1) 18.12 (B2) 3.48 (Total)
SEU-related 1.0 (A) 1.0 (B1) 1.0 (B2) 1.0 (Total)

Of course there are several papers (including Jim Gray 1985
paper which discusses operator errors being significant, Microsoft
XP paper discussing importance of third party drivers, etc.).