3 EDAC - Error Detection And Correction
5 Written by Doug Thompson <norsk5@xmission.com>
11 modified by Dave Peterson, Doug Thompson, et al,
12 from the bluesmoke.sourceforge.net project.
15 ============================================================================
18 The 'edac' kernel module goal is to detect and report errors that occur
19 within the computer system. In the initial release, memory Correctable Errors
20 (CE) and Uncorrectable Errors (UE) are the primary errors being harvested.
22 Detecting CE events, then harvesting those events and reporting them,
23 CAN be a predictor of future UE events. With CE events, the system can
24 continue to operate, but with less safety. Preventive maintenance and
25 proactive part replacement of memory DIMMs exhibiting CEs can reduce
26 the likelihood of the dreaded UE events and system 'panics'.
29 In addition, PCI Bus Parity and SERR Errors are scanned for on PCI devices
30 in order to determine if errors are occurring on data transfers.
31 The presence of PCI Parity errors must be examined with a grain of salt.
32 There are several add-in adapters that do NOT follow the PCI specification
33 with regards to Parity generation and reporting. The specification says
34 the vendor should tie the parity status bits to 0 if they do not intend
35 to generate parity. Some vendors do not do this, and thus the parity bit
36 can "float" giving false positives.
38 [There are patches in the kernel queue which will allow for storage of
39 quirks of PCI devices reporting false parity positives. The 2.6.18
40 kernel should have those patches included. When that becomes available,
41 then EDAC will be patched to utilize that information to "skip" such
44 EDAC will have future error detectors that will be integrated with
45 EDAC or added to it, in the following list:
47 MCE Machine Check Exception
48 MCA Machine Check Architecture
49 NMI NMI notification of ECC errors
50 MSRs Machine Specific Register error cases
53 These errors are usually bus errors, ECC errors, thermal throttling
57 ============================================================================
60 EDAC is composed of a "core" module (edac_mc.ko) and several Memory
61 Controller (MC) driver modules. On a given system, the CORE
62 is loaded and one MC driver will be loaded. Both the CORE and
63 the MC driver have individual versions that reflect current release
64 level of their respective modules. Thus, to "report" on what version
65 a system is running, one must report both the CORE's and the
71 If 'edac' was statically linked with the kernel then no loading is
72 necessary. If 'edac' was built as modules then simply modprobe the
73 'edac' pieces that you need. You should be able to modprobe
74 hardware-specific modules and have the dependencies load the necessary core
79 $> modprobe amd76x_edac
81 loads both the amd76x_edac.ko memory controller module and the edac_mc.ko
85 ============================================================================
88 EDAC presents a 'sysfs' interface for control, reporting and attribute
91 EDAC lives in the /sys/devices/system/edac directory. Within this directory
92 there currently reside 2 'edac' components:
94 mc memory controller(s) system
95 pci PCI control and status system
98 ============================================================================
99 Memory Controller (mc) Model
101 First a background on the memory controller's model abstracted in EDAC.
102 Each 'mc' device controls a set of DIMM memory modules. These modules are
103 laid out in a Chip-Select Row (csrowX) and Channel table (chX). There can
104 be multiple csrows and multiple channels.
106 Memory controllers allow for several csrows, with 8 csrows being a typical value.
107 Yet, the actual number of csrows depends on the electrical "loading"
108 of a given motherboard, memory controller and DIMM characteristics.
110 Dual channels allows for 128 bit data transfers to the CPU from memory.
111 Some newer chipsets allow for more than 2 channels, like Fully Buffered DIMMs
112 (FB-DIMMs). The following example will assume 2 channels:
116 ===================================
117 csrow0 | DIMM_A0 | DIMM_B0 |
118 csrow1 | DIMM_A0 | DIMM_B0 |
119 ===================================
121 ===================================
122 csrow2 | DIMM_A1 | DIMM_B1 |
123 csrow3 | DIMM_A1 | DIMM_B1 |
124 ===================================
126 In the above example table there are 4 physical slots on the motherboard
134 Labels for these slots are usually silk screened on the motherboard. Slots
135 labeled 'A' are channel 0 in this example. Slots labeled 'B'
136 are channel 1. Notice that there are two csrows possible on a
137 physical DIMM. These csrows are allocated their csrow assignment
138 based on the slot into which the memory DIMM is placed. Thus, when 1 DIMM
139 is placed in each Channel, the csrows cross both DIMMs.
141 Memory DIMMs come single or dual "ranked". A rank is a populated csrow.
142 Thus, 2 single ranked DIMMs, placed in slots DIMM_A0 and DIMM_B0 above
143 will have 1 csrow, csrow0. csrow1 will be empty. On the other hand,
144 when 2 dual ranked DIMMs are similarly placed, then both csrow0 and
145 csrow1 will be populated. The pattern repeats itself for csrow2 and
148 The representation of the above is reflected in the directory tree
149 in EDAC's sysfs interface. Starting in directory
150 /sys/devices/system/edac/mc each memory controller will be represented
151 by its own 'mcX' directory, where 'X" is the index of the MC.
161 Under each 'mcX' directory each 'csrowX' is again represented by a
162 'csrowX', where 'X" is the csrow index:
172 Notice that there is no csrow1, which indicates that csrow0 is
173 composed of a single ranked DIMMs. This should also apply in both
174 Channels, in order to have dual-channel mode be operational. Since
175 both csrow2 and csrow3 are populated, this indicates a dual ranked
176 set of DIMMs for channels 0 and 1.
179 Within each of the 'mc','mcX' and 'csrowX' directories are several
180 EDAC control and attribute files.
183 ============================================================================
186 In directory 'mc' are EDAC system overall control and attribute files:
189 Panic on UE control file:
193 An uncorrectable error will cause a machine panic. This is usually
194 desirable. It is a bad idea to continue when an uncorrectable error
195 occurs - it is indeterminate what was uncorrected and the operating
196 system context might be so mangled that continuing will lead to further
197 corruption. If the kernel has MCE configured, then EDAC will never
200 LOAD TIME: module/kernel parameter: panic_on_ue=[0|1]
202 RUN TIME: echo "1" >/sys/devices/system/edac/mc/panic_on_ue
209 Generate kernel messages describing uncorrectable errors. These errors
210 are reported through the system message log system. UE statistics
211 will be accumulated even when UE logging is disabled.
213 LOAD TIME: module/kernel parameter: log_ue=[0|1]
215 RUN TIME: echo "1" >/sys/devices/system/edac/mc/log_ue
222 Generate kernel messages describing correctable errors. These
223 errors are reported through the system message log system.
224 CE statistics will be accumulated even when CE logging is disabled.
226 LOAD TIME: module/kernel parameter: log_ce=[0|1]
228 RUN TIME: echo "1" >/sys/devices/system/edac/mc/log_ce
231 Polling period control file:
235 The time period, in milliseconds, for polling for error information.
236 Too small a value wastes resources. Too large a value might delay
237 necessary handling of errors and might loose valuable information for
238 locating the error. 1000 milliseconds (once each second) is the current
239 default. Systems which require all the bandwidth they can get, may
242 LOAD TIME: module/kernel parameter: poll_msec=[0|1]
244 RUN TIME: echo "1000" >/sys/devices/system/edac/mc/poll_msec
247 ============================================================================
251 In 'mcX' directories are EDAC control and attribute files for
252 this 'X" instance of the memory controllers:
255 Counter reset control file:
259 This write-only control file will zero all the statistical counters
260 for UE and CE errors. Zeroing the counters will also reset the timer
261 indicating how long since the last counter zero. This is useful
262 for computing errors/time. Since the counters are always reset at
263 driver initialization time, no module/kernel parameter is available.
265 RUN TIME: echo "anything" >/sys/devices/system/edac/mc/mc0/counter_reset
267 This resets the counters on memory controller 0
270 Seconds since last counter reset control file:
272 'seconds_since_reset'
274 This attribute file displays how many seconds have elapsed since the
275 last counter reset. This can be used with the error counters to
280 Memory Controller name attribute file:
284 This attribute file displays the type of memory controller
285 that is being utilized.
288 Total memory managed by this memory controller attribute file:
292 This attribute file displays, in count of megabytes, of memory
293 that this instance of memory controller manages.
296 Total Uncorrectable Errors count attribute file:
300 This attribute file displays the total count of uncorrectable
301 errors that have occurred on this memory controller. If panic_on_ue
302 is set this counter will not have a chance to increment,
303 since EDAC will panic the system.
306 Total UE count that had no information attribute fileY:
310 This attribute file displays the number of UEs that
311 have occurred have occurred with no informations as to which DIMM
312 slot is having errors.
315 Total Correctable Errors count attribute file:
319 This attribute file displays the total count of correctable
320 errors that have occurred on this memory controller. This
321 count is very important to examine. CEs provide early
322 indications that a DIMM is beginning to fail. This count
323 field should be monitored for non-zero values and report
324 such information to the system administrator.
327 Total Correctable Errors count attribute file:
331 This attribute file displays the number of CEs that
332 have occurred wherewith no informations as to which DIMM slot
333 is having errors. Memory is handicapped, but operational,
334 yet no information is available to indicate which slot
335 the failing memory is in. This count field should be also
336 be monitored for non-zero values.
342 Symlink to the memory controller device
346 ============================================================================
349 In the 'csrowX' directories are EDAC control and attribute files for
350 this 'X" instance of csrow:
353 Total Uncorrectable Errors count attribute file:
357 This attribute file displays the total count of uncorrectable
358 errors that have occurred on this csrow. If panic_on_ue is set
359 this counter will not have a chance to increment, since EDAC
360 will panic the system.
363 Total Correctable Errors count attribute file:
367 This attribute file displays the total count of correctable
368 errors that have occurred on this csrow. This
369 count is very important to examine. CEs provide early
370 indications that a DIMM is beginning to fail. This count
371 field should be monitored for non-zero values and report
372 such information to the system administrator.
375 Total memory managed by this csrow attribute file:
379 This attribute file displays, in count of megabytes, of memory
380 that this csrow contains.
383 Memory Type attribute file:
387 This attribute file will display what type of memory is currently
388 on this csrow. Normally, either buffered or unbuffered memory.
394 EDAC Mode of operation attribute file:
398 This attribute file will display what type of Error detection
399 and correction is being utilized.
402 Device type attribute file:
406 This attribute file will display what type of DRAM device is
407 being utilized on this DIMM.
415 Channel 0 CE Count attribute file:
419 This attribute file will display the count of CEs on this
420 DIMM located in channel 0.
423 Channel 0 UE Count attribute file:
427 This attribute file will display the count of UEs on this
428 DIMM located in channel 0.
431 Channel 0 DIMM Label control file:
435 This control file allows this DIMM to have a label assigned
436 to it. With this label in the module, when errors occur
437 the output can provide the DIMM label in the system log.
438 This becomes vital for panic events to isolate the
439 cause of the UE event.
441 DIMM Labels must be assigned after booting, with information
442 that correctly identifies the physical slot with its
443 silk screen label. This information is currently very
444 motherboard specific and determination of this information
445 must occur in userland at this time.
448 Channel 1 CE Count attribute file:
452 This attribute file will display the count of CEs on this
453 DIMM located in channel 1.
456 Channel 1 UE Count attribute file:
460 This attribute file will display the count of UEs on this
461 DIMM located in channel 0.
464 Channel 1 DIMM Label control file:
468 This control file allows this DIMM to have a label assigned
469 to it. With this label in the module, when errors occur
470 the output can provide the DIMM label in the system log.
471 This becomes vital for panic events to isolate the
472 cause of the UE event.
474 DIMM Labels must be assigned after booting, with information
475 that correctly identifies the physical slot with its
476 silk screen label. This information is currently very
477 motherboard specific and determination of this information
478 must occur in userland at this time.
481 ============================================================================
484 If logging for UEs and CEs are enabled then system logs will have
485 error notices indicating errors that have been detected:
487 EDAC MC0: CE page 0x283, offset 0xce0, grain 8, syndrome 0x6ec3, row 0,
488 channel 1 "DIMM_B1": amd76x_edac
490 EDAC MC0: CE page 0x1e5, offset 0xfb0, grain 8, syndrome 0xb741, row 0,
491 channel 1 "DIMM_B1": amd76x_edac
494 The structure of the message is:
495 the memory controller (MC0)
498 offset in the page (0xce0)
499 the byte granularity (grain 8)
500 or resolution of the error
501 the error syndrome (0xb741)
503 memory channel (channel 1)
504 DIMM label, if set prior (DIMM B1
505 and then an optional, driver-specific message that may
506 have additional information.
508 Both UEs and CEs with no info will lack all but memory controller,
509 error type, a notice of "no info" and then an optional,
510 driver-specific error message.
514 ============================================================================
515 PCI Bus Parity Detection
518 On Header Type 00 devices the primary status is looked at
519 for any parity error regardless of whether Parity is enabled on the
520 device. (The spec indicates parity is generated in some cases).
521 On Header Type 01 bridges, the secondary status register is also
522 looked at to see if parity occurred on the bus on the other side of
528 Under /sys/devices/system/edac/pci are control and attribute files as follows:
531 Enable/Disable PCI Parity checking control file:
536 This control file enables or disables the PCI Bus Parity scanning
537 operation. Writing a 1 to this file enables the scanning. Writing
538 a 0 to this file disables the scanning.
541 echo "1" >/sys/devices/system/edac/pci/check_pci_parity
544 echo "0" >/sys/devices/system/edac/pci/check_pci_parity
548 Panic on PCI PARITY Error:
550 'panic_on_pci_parity'
553 This control files enables or disables panicking when a parity
554 error has been detected.
557 module/kernel parameter: panic_on_pci_parity=[0|1]
560 echo "1" >/sys/devices/system/edac/pci/panic_on_pci_parity
563 echo "0" >/sys/devices/system/edac/pci/panic_on_pci_parity
570 This attribute file will display the number of parity errors that
575 =======================================================================