Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

*2. Storage servers:*Assuming 10K storage serversĀ  --> will be interesting both from system
building / modeling / measurement purposes as well as to see what
protection techniques will be useful (or if any will be necessary).
Even if we assume that each server contributes 100 FITs for SOFT ERRORS (transients)
(the number is not off -->
and most possibly this number takes into account derating at the node level
(probability that a flip-flop error doesn't cause a system error -- this is generally taken into account when vendors quote error rates).
Overall: 10^6 FITs --> MTTF of 1,000 hours --> assumes full usage and
all applications highly critical --> which won't be the case in real life probably.
Needs characterization of workload types, their criticality, and a full system
derating (note: data replication doesn't solve this problem --> server errors
different)
Plus, one has to worry about Hard errors --> increasing --> aging, etc..
(even if you have 20-50 FITs per chip for hard errors --> can add up).
Opportunities:
Criticality of applications will be key -- e.g., social networking sites
may not care much for silent errors --> other data center apps (e.g., banking
or commercial) will.
Protection techniques -- we will discuss a few (software only?
hardware-assisted techniques -- probably not an option for our COTS implementation;
However, we should investigate those).
Software techniques --> scrubbing alone
won't work --> error detection (any application-level properties for end-to-end
or time redundacy based? --> performance impact, energy impact?
Also, brings up the question of:
checkpointing, recovery support?
Are we going to have transactions semantics? -- Can that help?
Can we classify transactions as "critical" vs. "non-critical" and
rely on selective protection?
For hard errors --> very thorough on-line self-test / on-line self-diagnostics
can PREDICT (i.e., EARLY detection even before errors appear)
failures --> generally requires hardware support --> may not
be available on our COTS parts --> may be interesting experiment opportunities.
Brings up questions of self-repair / self-healing --> what if hard errors are DETECTED
(instead of predicted) by periodic self-diagnostics --> implications on recovery?
Also, relevant: interactions with power management, power overhead
of error checking.

3. Application servers --> much less reliability requirements? (social
networking -- they often use commodity app. servers and focus on storage
servers?).
Need to address -> discussion expected - -> criticality of application ->
app: selective protection -->
can be carried over through the entire system (our previous discussion
on storage servers).
Another issue: who is to be blamed for incorrect (criticality wise) results
due to app. server errors.

...