Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Component reliability questions:
1. DRAM

1a. DRAM soft errors --> between 1-10 FITs / Gbit [Charles Slayman, SUN, IRPS 2008]
(1 FIT = 1 error per billion device hours)
64 GBytes / DRAM server  for 10K servers we obtain 512*10K *(1 to 10) FITs
total --> Mean time to errors roughly 20 to 200 hours (assumes all
flips are important).
ECC necessary.
DRAM vendors are seeing peripheral logic upsets inside DRAM chips
and that error rate stays roughly constant on a per bit basis --> they will be
significant contributors --> even if I take their contribution to be only 10%
still significant --> traditional ECC doesn't do anything about these errors --
need to mix address with data --> need a quick study to see if error detection
will be enough --> needs support for recovery / retry.
More questions: S/w techniques on commodity DRAM or special DRAM?
(interesting techniques available)
Scrubbing (relevant for the next discussion on hard errors).

...

Solutions: scrubbing enough? -- it depends
Redundant mem. modules? -- expensive?
If we know some DRAM chip is BAD, we can use erasure
codes (but it may not be that straightforward -- we need to think).
Theremay be several other opportunities -- e.g., if we replicate the
data for a variety of reasons -- redundancy to prevent large outages / data
loss, performance -- then we should be able to play other tricks -->
But error DETECTION should be crucial --> (simple parity won't suffice
for alignment reasons unless we take preventive actions by redistributing
data instead of waiting for repair) --> the numbers in the above table
become less important at that point.
There is a recent paper at SIGMETRICS 09
that discusses these hard errors as well --
it seems they are also seeing increasing incidence of DRAM hard errors.
(DRAM Errors in the Wild: A Large-Scale Field Study, Bianca
Schroeder, Eduardo Pinheiro, Wolf-Dietrich Weber, SIGMETRICS, 2009 (to
appear:
Here are some excerpts from the abstract:
They analyzed measurements of memory errors in a large fleet
of commodity servers over a period of 2.5 years. The collected data
covers multiple vendors, DRAM capacities and technologies, and
comprises many millions of DIMM days. They observed
DRAM error rates that are orders of magnitude higher than
previously reported, with 25,000 to 70,000 errors per billion device
hours per Mbit and more than 8% of DIMMs affected by errors per year.
They have strong evidence that memory errors are dominated by hard
errors, rather than soft errors).

*2. Storage servers:*

Assuming 10K storage servers  --> will be interesting both from system
building / modeling / measurement purposes as well as to see what
protection techniques will be useful (or if any will be necessary).
Even if we assume that each server contributes 100 FITs for SOFT ERRORS (transients)
(the number is not off -->
and most possibly this number takes into account derating at the node level
(probability that a flip-flop error doesn't cause a system error -- this is generally taken into account when vendors quote error rates).
Overall: 10^6 FITs: MTTF of 1,000 hrs.: assumes full usage and
all applications highly critical --> which won't be the case in real life probably.
Needs characterization of workload types, their criticality, and a full system
derating (note: data replication doesn't solve this problem --> server errors
different)
Plus, one has to worry about Hard errors: increasing: aging, etc..
(even if you have 20-50 FITs per chip for hard errors --> can add up).
Opportunities:
Criticality of applications will be key -- e.g., social networking sites
may not care much for silent errors --> other data center apps (e.g., banking
or commercial) will.
Protection techniques -- we will discuss a few (software only?
hardware-assisted techniques -- probably not an option for our COTS implementation;
However, we should investigate those).
Software techniques --> scrubbing alone
won't work --> error detection (any application-level properties for end-to-end
or time redundacy based? --> performance impact, energy impact?
Also, brings up the question of:
checkpointing, recovery support?
Are we going to have transactions semantics? -- Can that help?
Can we classify transactions as "critical" vs. "non-critical" and
rely on selective protection?
For hard errors --> very thorough on-line self-test / on-line self-diagnostics
can PREDICT (i.e., EARLY detection even before errors appear)
failures: generally requires hardware support: may not
be available on our COTS parts --> may be interesting experiment opportunities.
Brings up questions of self-repair / self-healing --> what if hard errors are DETECTED
(instead of predicted) by periodic self-diagnostics --> implications on recovery?
Also, relevant: interactions with power management, power overhead
of error checking.

3. Application servers

--> much less relaxed reliability requirements? (social
networking -- they often use commodity app. servers and focus on storage
servers?).
Need to address -- criticality of app: selective protection -->
can be carried over through the entire system (our previous discussion
on storage servers).
Another issue: who is to be blamed for incorrect (criticality wise) results
due to app. server errors.

4. Networking substrate:

I missed the last part of the discussion
on whether we need custom networking substrate to meet our
latency objectives or not -- if we do need custom substrates,
that can provide opportunities in doing stuff from reliability standpoint.
If not, one must analyze error rates of the switches as well.
Some vendors  thought that their error rates would be really
low because of CRC checks providing end-to-end checks only to
find later that CRC wasn't done right to protect errors in switches
themselves. We can project numbers similar to our earlier discussions
for these as well.

...

Cray XT3/XT4: 10,880 CPUs, Failures per month
per TF: 0.1 to 1
IBM Power 5/6: 10,240 CPUs, 1.3
Clusters AMD x86: 8,000 CPUs, 2.6 to 8
BlueGene L/P: 131,720 CPUs, 0.01 to 0.03

(this is taken from:

H. Simon. Petascale computing in the U.S.. Slides from presentation at the ACTS workshop.

http://acts.nersc.gov/events/Workshop2006/slides/Simon.pdf, June 2006).

Next paper:
Tahoori, Kaeli & others: they looked at storage servers
(so far as I know it's EMC).
Normalized failure rates:
Hardware-related 1.91 (System A) 2.27 (System B1) 7.25 (System B2)
2.19 (Total)
Power-related 0.18 (A) 0.19 (B1) 0.5 (B2) 0.19 (Total)
Software-related 2.41 (A) 4.44 (B1) 18.12 (B2) 3.48 (Total)
SEU-related 1.0 (A) 1.0 (B1) 1.0 (B2) 1.0 (Total)

...