Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Component reliability questions:
1. DRAM

1a. DRAM soft errors --> between

  • Data on DRAM error rates: between 1-10 FITs / Gbit [Charles Slayman, SUN, IRPS 2008]
    (1 FIT = 1 error per billion device hours)
  • 64 GBytes / DRAM server  for 10K servers we obtain 512*10K *(1 to 10) FITs
    total --> Mean time to errors roughly 20 to 200 hours (assumes all
    flips are important).
  • ECC will be necessary.
  • DRAM vendors are seeing peripheral logic upsets inside DRAM chips.

...

  • They also found that this peripheral logic error rate stays roughly constant
    on a per bit basis

...

  • . This implies they will be

...

  • significant contributors

...

  • .
    Even if we take their contribution to be only 10%

...

  • that will still

...

  • be significant.
    Traditional ECC doesn't do anything about these errors --
    we need to mix address with data

...

  • to create parity checks for DETECTION.
    We need a quick study to see if error detection

...

  • will be enough

...

  • for such errors.
    Of course, that will require support for recovery / retry.
  • More questions about DRAM error protection: S/w techniques on commodity DRAM
    or special DRAM?

...

  • (interesting techniques

...

  • can be used)

...

  • Need for scrubbing (will be more relevant in the context of the following discussion on
    DRAM hard errors).

1b. DRAM hard errors: There is increasing concern about hard errors in DRAMs AND
alignment of hard and soft errors. This concern was raised by IBM, and
a 2008 IBM R&D by TJ Dell.
Why is Chipkill necessary? -- IBM data:  (with simple parity:
7 fails per 100 servers with 32MB mem. with parity over 3 yrs.,
9 fails per 100 servers with 1 GB mem. with SEC over 3 years)
Chipkill alone may not be enough in that case
and has to be backed up by either clever ECC or scrubbing (not sure scrubbing
will be enough).
According to the Dell paper, there is a significant
probability of DRAM chips producing hard errors (not a single
hard error but entire chip or most part of it). If not repaired right away,
there can be significant contribution to mem. failure rate (I THINK
the numbers below are for a 32 GB mem. subsystem but we need to
doublecheck the math)
Time to repair (months) Memory failure rate adder (FITs)
--------------------------------------------------------------------------------
1 102
6 608
12 1,207

...