Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • There is increasing concern about hard errors in DRAMs AND
    alignment of hard and soft errors. This concern has been raised by an
    article by TJ Dell of IBM in the 2008 IBM Journal of R&D.
  • Before getting into that, we should discuss Chipkill ECC for hard errors.
    Basic point: people are seeing DRAM hard errors athat affect multiple bits (or a large
    portion) of a single DRAM chip. Chipkill ECC creates interleaving so that every chip
    contributes to a single bit in a word so that ECC can correct hard errors.
    Why is Chipkill necessary? -- IBM data:  (with simple parity:
    7 fails per 100 servers with 32MB mem. with parity over 3 yrs.,
    9 fails per 100 servers with 1 GB mem. with Single bit ECC over 3 years).
  • According to the IBM paper, Chipkill alone may not be enough due to alignment of
    soft an hard errors, has to be backed up using either clever ECC or scrubbing
    (not sure scrubbing will be enough). According to this paper, there is a significant
    probability of DRAM chips producing hard errors (not a single hard error but entire chip
    or most part of it). If not repaired right away, there can be significant contribution to mem.
    failure rate (I THINK the numbers below are for a 32 GB mem. subsystem but we need to
    doublecheck the math)
    Time to repair (months)     Memory failure rate adder (FITs)
    --------------------------------     -     ---------------------------------------
    1                                     102
    6                                     608
    12                                   1,207
    • Possible solutions:
      • will scrubbing be enough? -- it depends
      • Redundant mem. modules? -- expensive?
      • If we know some DRAM chip is absolutely BAD, we can use erasure
        codes in conjunction with online testing / scrubbing (but it may not be
        that straightforward -- we need to think of it).
      • There may be RAMCloud-specific opportunities -- e.g., if we replicate the
        data for a variety of reasons -- redundancy to prevent large outages / data
        loss, performance -- then we should be able to play other tricks.
        But error DETECTION should be crucial --> (simple parity won't suffice
        for alignment reasons unless we take preventive actions by redistributing
        data instead of waiting for repair). The numbers in the above table on time to repair
        become less important at that point.
  • There is a recent paper (to appear in SIGMETRICS 09) which discusses
    DRAM hard errors as well -- they are also seeing increasing incidence of DRAM hard errors.
    DRAM Errors in the Wild: A Large-Scale Field Study, Bianca
    Schroeder, Eduardo Pinheiro, Wolf-Dietrich Weber, SIGMETRICS, 2009 (to appear):
    Here are some excerpts from the abstract:
    • They analyzed measurements of memory errors in a large fleet
      of commodity servers over a period of 2.5 years. The collected data
      covers multiple vendors, DRAM capacities and technologies, and
      comprises many millions of DIMM days. They observed
      DRAM error rates that are orders of magnitude higher than
      previously reported, with 25,000 to 70,000 errors per billion device
      hours per Mbit and more than 8% of DIMMs affected by errors per year.
      They have strong evidence that memory errors are dominated by hard
      errors, rather than soft errors).

...

  • Assuming 10K storage servers, it will be interesting both from system
    design / modeling / measurement purposes to see what
    protection techniques will be useful (or if any will be necessary).
  • Even if we assume that each server contributes 100 FITs (say, for SOFT ERRORS ( transients)
    (the number is not off -->
    and most possibly this number takes into account derating at the node level
    (probability that a flip-flop error doesn't cause a system error --
    this is generally taken into account when vendors quote error rates).
    Overall: 10^6 FITs: MTTF of 1,000 hrs.: This assumes full usage and
    all applications are highly critical --> which won't be the case in real life probably.
    • Needs characterization of workload types, their criticality, and a full system
      derating (note: data replication doesn't solve this problem
    --> server errors
    different
    • )
  • Plus, one has to worry about Hard errors: increasing: aging, etc..
    (even if you have 20-50 FITs per chip for hard errors --> can add up).
  • Opportunities:
    • Criticality of applications will be key -- e.g., social networking sites
      may not care much for silent errors --> other data center apps (e.g., banking
      or commercial) will.
    • Protection techniques -- we will discuss a few (software only?
      hardware-assisted techniques -- probably not an option for our COTS implementation;
      However, we should investigate those).
    • Software techniques --> scrubbing alone

    • won't work
    --> error
    • Error detection (any application-level properties for end-to-end
      or time redundacy based? --> performance impact, energy impact?
    • Also, brings up the question of:

    • checkpointing, recovery support?
      Are we going to have transactions semantics? -- Can that help?
    • Can we classify transactions as "critical" vs. "non-critical" and
      rely on selective protection?
    • For hard errors --> very thorough on-line self-test / on-line self-diagnostics
      can PREDICT (i.e., EARLY detection even before errors appear) failures.
    failures: generally
    • Generally requires hardware support: may not

    • be available on our COTS parts
    --> may be
    • .
      This will provide some interesting experiment opportunities.
    • Brings up questions of self-repair / self-healing --> what if hard errors are DETECTED
      (instead of predicted) by periodic self-diagnostics --> implications on recovery?
    • Also, relevant: interactions with power management, power overhead
      of error checking.

3. Application servers

--> much relaxed reliability requirements? (social
networking -- they often use commodity app. servers and focus on storage
servers?).
Need to address -- criticality of app: selective protection -->
can be carried over through the entire system (our previous discussion
on storage servers).
Another issue: who is to be blamed for incorrect (criticality wise) results
due to app. server errors.

...