Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Component reliability questions:

1. DRAM

1a. DRAM soft errors

...

  • There is increasing concern about hard errors in DRAMs AND
    alignment of hard and soft errors. This concern has been raised by an
    article by TJ Dell of IBM in the 2008 IBM Journal of R&D.
  • Before getting into that, we should discuss Chipkill ECC for hard errors.
    Basic point: people are seeing DRAM hard errors athat affect multiple bits (or a large
    portion) of a single DRAM chip. Chipkill ECC creates interleaving so that every chip
    contributes to a single bit in a word so that ECC can correct hard errors.
    Why is Chipkill necessary? -- IBM data:  (with simple parity:
    7 fails per 100 servers with 32MB mem. with parity over 3 yrs.,
    9 fails per 100 servers with 1 GB mem. with Single bit ECC over 3 years).
  • According to the IBM paper, Chipkill alone may not be enough due to alignment of
    soft an hard errors, has to be backed up using either clever ECC or scrubbing
    (not sure scrubbing will be enough). According to this paper, there is a significant
    probability of DRAM chips producing hard errors (not a single hard error but entire chip
    or most part of it). If not repaired right away, there can be significant contribution to mem.
    failure rate (I THINK the numbers below are for a 32 GB mem. subsystem but we need to
    doublecheck the math)
    Time to repair (months)     Memory failure rate adder (FITs)
    ------------------------------     -     ---------------------------------------
    1                                     102
    6                                     608
    12                                   1,207
    • Possible solutions:
      • will scrubbing be enough? -- it depends
      • Redundant mem. modules? -- expensive?
      • If we know some DRAM chip is absolutely BAD, we can use erasure
        codes in conjunction with online testing / scrubbing (but it may not be
        that straightforward -- we need to think of it).
      • There may be RAMCloud-specific opportunities -- e.g., if we replicate the
        data for a variety of reasons -- redundancy to prevent large outages / data
        loss, performance -- then we should be able to play other tricks.
        But error DETECTION should be crucial --> (simple parity won't suffice
        for alignment reasons unless we take preventive actions by redistributing
        data instead of waiting for repair). The numbers in the above table on time to repair
        become less important at that point.
  • There is a recent paper (to appear in SIGMETRICS 09) which discusses
    DRAM hard errors as well -- they are also seeing increasing incidence of DRAM hard errors.
    DRAM Errors in the Wild: A Large-Scale Field Study, Bianca
    Schroeder, Eduardo Pinheiro, Wolf-Dietrich Weber, SIGMETRICS, 2009 (to appear):
    Here are some excerpts from the abstract:
    • They analyzed measurements of memory errors in a large fleet
      of commodity servers over a period of 2.5 years. The collected data
      covers multiple vendors, DRAM capacities and technologies, and
      comprises many millions of DIMM days. They observed
      DRAM error rates that are orders of magnitude higher than
      previously reported, with 25,000 to 70,000 errors per billion device
      hours per Mbit and more than 8% of DIMMs affected by errors per year.
      They have strong evidence that memory errors are dominated by hard
      errors, rather than soft errors).

...

System reliability questions:

I haven't considered questions such as power outages,
disasters, geographical diversity, etc. in detail -- reason: for such
causes, what is important is data replication -- detection is less of an
issue -- issue is recovery) -- we can discuss that as part of this discussion.
However, this part is closely related to what we need to do for DRAM
(hard and soft) errors.

I looked up a few recent papers on causes of system failure rates.
Here is some data:1.

  • Understanding Failures in Petascale Computers
    by Bianca Schroeder Garth A. Gibson
    Data set during 1995-2005 at LANL:
    • 22 HPC systems (4,750 machines, 24,101 processors).
      22 clusters (18 SMP-based clusters, 2 to 4 processors per
      node: Total 4,672 nodes, 15,101 processors; remaining 4: NUMA
      boxes with 128 to 256 processors each:  total 78 nodes, 9,000 processors).
    • An entry for any failure that occurred the time period
      that resulted in an application interruption or a node outage.
      (Note: it seems they are not doing concurrent checking so I'd assume
      silent data corruption isn't included here).
    • System failure causes covered: software failures, hardware
      failures, failures due to operator error, network failures,
      and failures due to environmental problems (e.g. power outages).

      ...

          • Cluster node outages: > 50% -- hardware failures; ~ 20% software causes;
            15% -- unknown; rest: network, human, environment causes (doublechecked
            by looking at the fraction of repair time attributed to these causes).
          • Number of failures per year per system varies between 20 to 1,100 per
            system --> on an average: 0.2 to 0.7 per system per year per processor.
          • (Note: as the paper suggests, their systems mainly implement checkpointing;
            don't do much about error detection).
          • (There has been some reported data that Blue Gene systems have
            significantly lower failure rates --> need to know details).
      • Some data from HPCS community: (H. Simon. Petascale computing
        in the U.S.. Slides from presentation at the ACTS workshop.
        http://acts.nersc.gov/events/Workshop2006/slides/Simon.pdf, June 2006).
        • Cray XT3/XT4: 10,880 CPUs, Failures per month

      ...

        • per TF: 0.1 to 1
        • IBM Power 5/6: 10,240 CPUs, Failures per month per TF: 1.3
        • Clusters AMD x86: 8,000 CPUs, 2.6 to 8
        • BlueGene L/P: 131,720 CPUs, 0.01 to 0.03

      (this is taken from:

      H. Simon. Petascale computing in the U.S.. Slides from presentation at the ACTS workshop.

      http://acts.nersc.gov/events/Workshop2006/slides/Simon.pdf, June 2006).

      • Next paper:

      ...

      • Tahoori, Kaeli & others

      ...

        • They looked at storage servers

      ...

        • (probably EMC).
          • Normalized failure rates:
            Hardware-related 1.91 (System A)   2.27 (System B1)  7.25 (System B2)

      ...

          • 2.19 (Total)
            Power-

      ...

          • related      0.18 (A)               0.19 (B1)              0.5 (B2)               0.19 (Total)
            Software-

      ...

          • related   2.41 (A)              4.44 (B1)             18.12 (B2)             3.48 (Total)
            SEU-

      ...

          • related         1.0 (A)                1.0 (B1)               1.0 (B2)                1.0 (Total)
      • Of course there are several papers (including Jim Gray 1985
        paper which discusses operator errors being significant, Microsoft
        XP paper discussing importance of third party drivers, etc.).