Date: Fri, 29 Mar 2024 10:09:21 +0000 (UTC)
Message-ID: <789726376.5.1711706961333@d6bc2eb9cc7f>
Subject: Exported From Confluence
MIME-Version: 1.0
Content-Type: multipart/related;
boundary="----=_Part_4_112223506.1711706961333"
------=_Part_4_112223506.1711706961333
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Content-Location: file:///C:/exported.html
Component reliability questions:
1. DRAM
1a. DRAM soft errors
- Data on DRAM error rates: between 1-10 FITs / Gbit [Charles Slayma=
n, SUN, IRPS 2008]
(1 FIT =3D 1 error per billion device hours)
- 64 GBytes / DRAM server for 10K servers we obtain 512*10K *(1 to =
10) FITs
total --> Mean time to errors roughly 20 to 200 hours (assumes all
flips are important).
- ECC will be necessary.
- DRAM vendors are seeing peripheral logic upsets inside DRAM chips.
They also found that this peripheral logic error rate stays roughly consta=
nt
on a per bit basis. This implies they will be significant contributors.
Even if we take their contribution to be only 10% that will still be signi=
ficant.
Traditional ECC doesn't do anything about these errors --
we need to mix address with data to create parity checks for DETECTION.
We need a quick study to see if error detection will be enough for such er=
rors.
Of course, that will require support for recovery / retry.
- More questions about DRAM error protection: S/w techniques on commodity=
DRAM
or special DRAM? (interesting techniques can be used)
- Need for scrubbing (will be more relevant in the context of the followi=
ng discussion on
DRAM hard errors).
1b. DRAM hard errors
- There is increasing concern about hard errors in DRAMs AND
alignment of hard and soft errors. This concern has been raised =
by an
article by TJ Dell of IBM in the 2008 IBM Journal of R&D.
- Before getting into that, we should discuss Chipkill ECC for hard error=
s.
Basic point: people are seeing DRAM hard errors athat affect multiple bits=
(or a large
portion) of a single DRAM chip. Chipkill ECC creates interleaving so that =
every chip
contributes to a single bit in a word so that ECC can correct hard errors.=
Why is Chipkill necessary? -- IBM data: (with simple parity:
7 fails per 100 servers with 32MB mem. with parity over 3 yrs.,
9 fails per 100 servers with 1 GB mem. with Single bit ECC over 3 years).<=
/li>
- According to the IBM paper, Chipkill alone may not be enough due to ali=
gnment of
soft an hard errors, has to be backed up using either clever ECC or s=
crubbing
(not sure scrubbing will be enough). According to this paper, there i=
s a significant
probability of DRAM chips producing hard errors (not a single hard error b=
ut entire chip
or most part of it). If not repaired right away, there can be significant =
contribution to mem.
failure rate (I THINK the numbers below are for a 32 GB mem. subsystem but=
we need to
doublecheck the math)
Time to repair (months) Memory failure rate adder =
(FITs)
---------------------------=
--- -------=
---------------------------------
1 &=
nbsp; &nbs=
p; 102
6 &=
nbsp; &nbs=
p; 6=
08
12 =
&nb=
sp; 1,207=20
- Possible solutions:=20
- will scrubbing be enough? -- it depends
- Redundant mem. modules? -- expensive?
- If we know some DRAM chip is absolutely BAD, we can use erasure
codes in conjunction with online testing / scrubbing (but it may not be
that straightforward -- we need to think of it).
- There may be RAMCloud-specific opportunities -- e.g., if we replicate t=
he
data for a variety of reasons -- redundancy to prevent large outages / dat=
a
loss, performance -- then we should be able to play other tricks.
But error DETECTION should be crucial --> (simple parity won't suffice<=
br>
for alignment reasons unless we take preventive actions by redistributing<=
br>
data instead of waiting for repair). The numbers in the above table on tim=
e to repair
become less important at that point.
- There is a recent paper (to appear in SIGMETRICS 09) which discusses
DRAM hard errors as well -- they are also seeing increasing incidence of D=
RAM hard errors.
DRAM Errors in the Wild: A Large-Scale Field Study, Bianca
Schroeder, Eduardo Pinheiro, Wolf-Dietrich Weber, SIGMETRICS, 2009 (to app=
ear):
Here are some excerpts from the abstract:=20
- They analyzed measurements of memory errors in a large fleet
of commodity servers over a period of 2.5 years. The collected data
covers multiple vendors, DRAM capacities and technologies, and
comprises many millions of DIMM days. They observed
DRAM error rates that are orders of magnitude higher than
previously reported, with 25,000 to 70,000 errors per billion device
hours per Mbit and more than 8% of DIMMs affected by errors per year.
They have strong evidence that memory errors are dominated by hard
errors, rather than soft errors).
2. Storage servers:
- Assuming 10K storage servers, it will be interesting both from system
design / modeling / measurement purposes to see what
protection techniques will be useful (or if any will be necessary).
- Even if we assume that each server contributes 100 FITs (say, for SOFT =
ERRORS transients)
(most possibly this number takes into account derating at the node level
(probability that a flip-flop error doesn't cause a system error --
this is generally taken into account when vendors quote error rates).
Overall: 10^6 FITs: MTTF of 1,000 hrs.: This assumes full usage and
all applications are highly critical --> which won't be the case in rea=
l life probably.=20
- Needs characterization of workload types, their criticality, and a full=
system
derating (note: data replication doesn't solve this problem)
- Plus, one has to worry about Hard errors: increasing: aging, etc..
(even if you have 20-50 FITs per chip for hard errors --> can add up).<=
/li>
- Opportunities:=20
- Criticality of applications will be key -- e.g., social networking site=
s
may not care much for silent errors --> other data center apps (e.g., b=
anking
or commercial) will.
- Protection techniques -- we will discuss a few (software only?
hardware-assisted techniques -- probably not an option for our COTS implem=
entation;
However, we should investigate those).
- Software techniques --> scrubbing alone won't work
- Error detection (any application-level properties for end-to-end
or time redundacy based? --> performance impact, energy impact?
- Also, brings up the question of: checkpointing, recovery support?
Are we going to have transactions semantics? -- Can that help?
- Can we classify transactions as "critical" vs. "non-critical" and
rely on selective protection?
- For hard errors --> very thorough on-line self-test / on-line self-d=
iagnostics
can PREDICT (i.e., EARLY detection even before errors appear) failures.
Generally requires hardware support: may not be available on our COTS=
parts.
This will provide some interesting experiment opportunities.
- Brings up questions of self-repair / self-healing --> what if hard e=
rrors are DETECTED
(instead of predicted) by periodic self-diagnostics --> implications on=
recovery?
- Also, relevant: interactions with power management, power overhead
of error checking.
3. Application servers
- Much relaxed reliability requirements? (social
networking -- they often use commodity app. servers and focus on storage
servers?).
- Need to address: criticality of app: selective protection -->
can be carried over through the entire system (our previous discussion
on storage servers).
- Another issue: who is to be blamed for incorrect (criticality wise) res=
ults
due to app. server errors.
4. Networking substrate:
- I missed the last part of the discussion
on whether we need custom networking substrate to meet our
latency objectives or not.=20
- If we do need custom substrates, that can provide opportunities
in doing stuff from reliability standpoint.
- If not, one must analyze error rates of the switches as well.
Some vendors thought that their error rates would be really
low because of CRC providing end-to-end checks only to
find later that CRC wasn't done right to protect errors in switches
themselves.
- We can project numbers similar to our earlier discussions
for these as well.
System reliability questions:
I haven't considered questions such as power outages,
disasters, geographical diversity, etc. in detail -- reason: for such
causes, what is important is data replication -- detection is less of an
issue -- issue is recovery) -- we can discuss that as part of this discuss=
ion.
However, this part is closely related to what we need to do for DRAM
(hard and soft) errors.
I looked up a few recent papers on causes of system failure rates.
Here is some data:
- Understanding Failures in Petascale Computers
by Bianca Schroeder Garth A. Gibson
Data set during 1995-2005 at LANL:=20
- 22 HPC systems (4,750 machines, 24,101 processors).
22 clusters (18 SMP-based clusters, 2 to 4 processors per
node: Total 4,672 nodes, 15,101 processors; remaining 4: NUMA
boxes with 128 to 256 processors each: total 78 nodes, 9,000 process=
ors).
- An entry for any failure that occurred the time period
that resulted in an application interruption or a node outage.
(Note: it seems they are not doing concurrent checking so I'd assume
silent data corruption isn't included here).
- System failure causes covered: software failures, hardware
failures, failures due to operator error, network failures,
and failures due to environmental problems (e.g. power outages).=20
- Cluster node outages: > 50% -- hardware failures; ~ 20% software cau=
ses;
15% -- unknown; rest: network, human, environment causes (doublechecked
by looking at the fraction of repair time attributed to these causes).
- Number of failures per year per system varies between 20 to 1,100 per
system --> on an average: 0.2 to 0.7 per system per year per processor.=
- (Note: as the paper suggests, their systems mainly implement checkpoint=
ing;
don't do much about error detection).
- (There has been some reported data that Blue Gene systems have
significantly lower failure rates --> need to know details).
- Some data from HPCS community: (H. Simon. Petascale co=
mputing
in the U.S.. Slides from presentation at the ACTS workshop.
http://acts.nersc.gov/events/Workshop2006/slides/=
Simon.pdf, June 2006).=20
- Cray XT3/XT4: 10,880 CPUs, Failures per month per TF: 0.1 to 1
- IBM Power 5/6: 10,240 CPUs, Failures per month per TF: 1.3
- Clusters AMD x86: 8,000 CPUs, 2.6 to 8
- BlueGene L/P: 131,720 CPUs, 0.01 to 0.03
- Next paper: Tahoori, Kaeli & others=20
- They looked at storage servers (probably EMC).=20
- Normalized failure rates:
Hardware-related 1.91 (System A) 2.27 (System B1) 7.25 (=
System B2) 2.19 (Total)
Power-related 0.18 (A) &nbs=
p; 0.19 (B1)&nb=
sp; =
0.5 (B2) =
0.19 (Total)
Software-related 2.41 (A) &=
nbsp; 4.44 (B1) =
18.12 (B2) &nbs=
p; 3.48 (Total)
SEU-related 1.0 (A) &=
nbsp; &nbs=
p; 1.0 (B1) &nbs=
p; 1.0 (B2) &nbs=
p; 1.0 (Total)
- Of course there are several papers (including Jim Gray 1985
paper which discusses operator errors being significant, Microsoft
XP paper discussing importance of third party drivers, etc.).
------=_Part_4_112223506.1711706961333--