From ee66e6a24bd2f8a294f018ab90f660f23d8a7ff5 Mon Sep 17 00:00:00 2001 From: Loc Ho Date: Fri, 22 May 2015 17:32:58 -0600 Subject: Documentation: Add documentation for the APM X-Gene SoC EDAC DTS binding Add documentation for the APM X-Gene SoC EDAC DTS binding. Signed-off-by: Loc Ho Acked-by: Arnd Bergmann Cc: devicetree@vger.kernel.org Cc: dougthompson@xmission.com Cc: ijc+devicetree@hellion.org.uk Cc: jcm@redhat.com Cc: linux-arm-kernel@lists.infradead.org Cc: linux-edac Cc: mark.rutland@arm.com Cc: mchehab@osg.samsung.com Cc: patches@apm.com Cc: robh+dt@kernel.org Link: http://lkml.kernel.org/r/1432337580-3750-4-git-send-email-lho@apm.com Signed-off-by: Borislav Petkov --- .../devicetree/bindings/edac/apm-xgene-edac.txt | 78 ++++++++++++++++++++++ 1 file changed, 78 insertions(+) create mode 100644 Documentation/devicetree/bindings/edac/apm-xgene-edac.txt (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/edac/apm-xgene-edac.txt b/Documentation/devicetree/bindings/edac/apm-xgene-edac.txt new file mode 100644 index 000000000000..480911c38ff9 --- /dev/null +++ b/Documentation/devicetree/bindings/edac/apm-xgene-edac.txt @@ -0,0 +1,78 @@ +* APM X-Gene SoC EDAC node + +EDAC node is defined to describe on-chip error detection and correction. +The follow error types are supported: + + memory controller - Memory controller + PMD (L1/L2) - Processor module unit (PMD) L1/L2 cache + +The following section describes the EDAC DT node binding. + +Required properties: +- compatible : Shall be "apm,xgene-edac". +- regmap-csw : Regmap of the CPU switch fabric (CSW) resource. +- regmap-mcba : Regmap of the MCB-A (memory bridge) resource. +- regmap-mcbb : Regmap of the MCB-B (memory bridge) resource. +- regmap-efuse : Regmap of the PMD efuse resource. +- reg : First resource shall be the CPU bus (PCP) resource. +- interrupts : Interrupt-specifier for MCU, PMD, L3, or SoC error + IRQ(s). + +Required properties for memory controller subnode: +- compatible : Shall be "apm,xgene-edac-mc". +- reg : First resource shall be the memory controller unit + (MCU) resource. +- memory-controller : Instance number of the memory controller. + +Required properties for PMD subnode: +- compatible : Shall be "apm,xgene-edac-pmd". +- reg : First resource shall be the PMD resource. +- pmd-controller : Instance number of the PMD controller. + +Example: + csw: csw@7e200000 { + compatible = "apm,xgene-csw", "syscon"; + reg = <0x0 0x7e200000 0x0 0x1000>; + }; + + mcba: mcba@7e700000 { + compatible = "apm,xgene-mcb", "syscon"; + reg = <0x0 0x7e700000 0x0 0x1000>; + }; + + mcbb: mcbb@7e720000 { + compatible = "apm,xgene-mcb", "syscon"; + reg = <0x0 0x7e720000 0x0 0x1000>; + }; + + efuse: efuse@1054a000 { + compatible = "apm,xgene-efuse", "syscon"; + reg = <0x0 0x1054a000 0x0 0x20>; + }; + + edac@78800000 { + compatible = "apm,xgene-edac"; + #address-cells = <2>; + #size-cells = <2>; + ranges; + regmap-csw = <&csw>; + regmap-mcba = <&mcba>; + regmap-mcbb = <&mcbb>; + regmap-efuse = <&efuse>; + reg = <0x0 0x78800000 0x0 0x100>; + interrupts = <0x0 0x20 0x4>, + <0x0 0x21 0x4>, + <0x0 0x27 0x4>; + + edacmc@7e800000 { + compatible = "apm,xgene-edac-mc"; + reg = <0x0 0x7e800000 0x0 0x1000>; + memory-controller = <0>; + }; + + edacpmd@7c000000 { + compatible = "apm,xgene-edac-pmd"; + reg = <0x0 0x7c000000 0x0 0x200000>; + pmd-controller = <0>; + }; + }; -- cgit v1.2.3 From 451bb7fbccdc0f5d942227913afa5dea3afd12f7 Mon Sep 17 00:00:00 2001 From: Arnd Bergmann Date: Mon, 1 Jun 2015 16:09:35 -0600 Subject: EDAC, xgene: Fix cpuid abuse The new x-gene EDAC driver incorrectly tried to figure out the version of one of its IP blocks by looking at the version of the CPU core, which is only vagely related. This removes the incorrect code and instead uses the version of the IP block in the compatible string where it belongs. Found using build testing on x86, which does not provide the arm64 cpuid interface. Signed-off-by: Arnd Bergmann [ Changed subnode to "apm,xgene-edac-pmd-v2", adjusted check. ] Signed-off-by: Loc Ho Cc: devicetree@vger.kernel.org Cc: dougthompson@xmission.com Cc: ijc+devicetree@hellion.org.uk Cc: jcm@redhat.com Cc: linux-arm-kernel@lists.infradead.org Cc: linux-edac Cc: mark.rutland@arm.com Cc: mchehab@osg.samsung.com Cc: patches@apm.com Cc: robh+dt@kernel.org Link: http://lkml.kernel.org/r/3195065.IK73o60xya@wuerfel Signed-off-by: Borislav Petkov --- .../devicetree/bindings/edac/apm-xgene-edac.txt | 3 +- drivers/edac/xgene_edac.c | 55 ++++------------------ 2 files changed, 10 insertions(+), 48 deletions(-) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/edac/apm-xgene-edac.txt b/Documentation/devicetree/bindings/edac/apm-xgene-edac.txt index 480911c38ff9..78edb80002c8 100644 --- a/Documentation/devicetree/bindings/edac/apm-xgene-edac.txt +++ b/Documentation/devicetree/bindings/edac/apm-xgene-edac.txt @@ -25,7 +25,8 @@ Required properties for memory controller subnode: - memory-controller : Instance number of the memory controller. Required properties for PMD subnode: -- compatible : Shall be "apm,xgene-edac-pmd". +- compatible : Shall be "apm,xgene-edac-pmd" or + "apm,xgene-edac-pmd-v2". - reg : First resource shall be the PMD resource. - pmd-controller : Instance number of the PMD controller. diff --git a/drivers/edac/xgene_edac.c b/drivers/edac/xgene_edac.c index b5158572f6ab..14636e4b6a08 100644 --- a/drivers/edac/xgene_edac.c +++ b/drivers/edac/xgene_edac.c @@ -523,6 +523,7 @@ struct xgene_edac_pmd_ctx { struct edac_device_ctl_info *edac_dev; void __iomem *pmd_csr; u32 pmd; + int version; }; static void xgene_edac_pmd_l1_check(struct edac_device_ctl_info *edac_dev, @@ -784,50 +785,6 @@ static void xgene_edac_pmd_cpu_hw_cfg(struct edac_device_ctl_info *edac_dev, writel(0x00000101, pg_f + MEMERR_CPU_MMUECR_PAGE_OFFSET); } -static bool xgene_edac_pmd_l2c_version1(void) -{ - /* Check all chips with PMD L2C version 1 HW */ - #define REVIDR_MINOR_REV(revidr) ((revidr) & 0x00000007) - - switch (MIDR_VARIANT(read_cpuid_id())) { - case 0: - switch (MIDR_REVISION(read_cpuid_id())) { - case 0: - - switch (REVIDR_MINOR_REV(read_cpuid(REVIDR_EL1))) { - case 1: - case 2: - return true; - }; - break; - case 1: - if (REVIDR_MINOR_REV(read_cpuid(REVIDR_EL1)) == 1) - return true; - break; - } - break; - case 1: - switch (MIDR_REVISION(read_cpuid_id())) { - case 0: - switch (REVIDR_MINOR_REV(read_cpuid(REVIDR_EL1))) { - case 1: - return true; - }; - break; - case 1: - switch (REVIDR_MINOR_REV(read_cpuid(REVIDR_EL1))) { - case 1: - case 0: - return true; - }; - break; - } - break; - } - - return false; -} - static void xgene_edac_pmd_hw_cfg(struct edac_device_ctl_info *edac_dev) { struct xgene_edac_pmd_ctx *ctx = edac_dev->pvt_info; @@ -837,7 +794,7 @@ static void xgene_edac_pmd_hw_cfg(struct edac_device_ctl_info *edac_dev) /* Enable PMD memory error - MEMERR_L2C_L2ECR and L2C_L2RTOCR */ writel(0x00000703, pg_e + MEMERR_L2C_L2ECR_PAGE_OFFSET); /* Configure L2C HW request time out feature if supported */ - if (!xgene_edac_pmd_l2c_version1()) + if (ctx->version > 1) writel(0x00000119, pg_d + CPUX_L2C_L2RTOCR_PAGE_OFFSET); } @@ -956,7 +913,8 @@ static int xgene_edac_pmd_available(u32 efuse, int pmd) return (efuse & (1 << pmd)) ? 0 : 1; } -static int xgene_edac_pmd_add(struct xgene_edac *edac, struct device_node *np) +static int xgene_edac_pmd_add(struct xgene_edac *edac, struct device_node *np, + int version) { struct edac_device_ctl_info *edac_dev; struct xgene_edac_pmd_ctx *ctx; @@ -998,6 +956,7 @@ static int xgene_edac_pmd_add(struct xgene_edac *edac, struct device_node *np) ctx->edac = edac; ctx->edac_dev = edac_dev; ctx->ddev = *edac->dev; + ctx->version = version; edac_dev->dev = &ctx->ddev; edac_dev->ctl_name = ctx->name; edac_dev->dev_name = ctx->name; @@ -1169,7 +1128,9 @@ static int xgene_edac_probe(struct platform_device *pdev) if (of_device_is_compatible(child, "apm,xgene-edac-mc")) xgene_edac_mc_add(edac, child); if (of_device_is_compatible(child, "apm,xgene-edac-pmd")) - xgene_edac_pmd_add(edac, child); + xgene_edac_pmd_add(edac, child, 1); + if (of_device_is_compatible(child, "apm,xgene-edac-pmd-v2")) + xgene_edac_pmd_add(edac, child, 2); } return 0; -- cgit v1.2.3 From 54b4a8f57848bb08dcbdfba94b9b1ddef1c23358 Mon Sep 17 00:00:00 2001 From: Thor Thayer Date: Thu, 4 Jun 2015 09:28:48 -0500 Subject: arm: socfpga: dts: Add Arria10 SDRAM EDAC DTS support Add support for the Arria10 SDRAM EDAC. Update the bindings document for the new match string. Signed-off-by: Thor Thayer Cc: Arnd Bergmann Cc: devicetree@vger.kernel.org Cc: dinguyen@opensource.altera.com Cc: galak@codeaurora.org Cc: grant.likely@linaro.org Cc: ijc+devicetree@hellion.org.uk Cc: linux-arm-kernel@lists.infradead.org Cc: linux-edac Cc: m.chehab@samsung.com Cc: mark.rutland@arm.com Cc: pawel.moll@arm.com Cc: robh+dt@kernel.org Cc: tthayer.linux@gmail.com Link: http://lkml.kernel.org/r/1433428128-7292-5-git-send-email-tthayer@opensource.altera.com Signed-off-by: Borislav Petkov --- .../devicetree/bindings/arm/altera/socfpga-sdram-edac.txt | 2 +- arch/arm/boot/dts/socfpga_arria10.dtsi | 11 +++++++++++ 2 files changed, 12 insertions(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/arm/altera/socfpga-sdram-edac.txt b/Documentation/devicetree/bindings/arm/altera/socfpga-sdram-edac.txt index d0ce01da5c59..f5ad0ff69fae 100644 --- a/Documentation/devicetree/bindings/arm/altera/socfpga-sdram-edac.txt +++ b/Documentation/devicetree/bindings/arm/altera/socfpga-sdram-edac.txt @@ -2,7 +2,7 @@ Altera SOCFPGA SDRAM Error Detection & Correction [EDAC] The EDAC accesses a range of registers in the SDRAM controller. Required properties: -- compatible : should contain "altr,sdram-edac"; +- compatible : should contain "altr,sdram-edac" or "altr,sdram-edac-a10" - altr,sdr-syscon : phandle of the sdr module - interrupts : Should contain the SDRAM ECC IRQ in the appropriate format for the IRQ controller. diff --git a/arch/arm/boot/dts/socfpga_arria10.dtsi b/arch/arm/boot/dts/socfpga_arria10.dtsi index 8a05c47fd57f..4be75960a603 100644 --- a/arch/arm/boot/dts/socfpga_arria10.dtsi +++ b/arch/arm/boot/dts/socfpga_arria10.dtsi @@ -253,6 +253,17 @@ status = "disabled"; }; + sdr: sdr@ffc25000 { + compatible = "syscon"; + reg = <0xffcfb100 0x80>; + }; + + sdramedac { + compatible = "altr,sdram-edac-a10"; + altr,sdr-syscon = <&sdr>; + interrupts = <0 2 4>, <0 0 4>; + }; + L2: l2-cache@fffff000 { compatible = "arm,pl310-cache"; reg = <0xfffff000 0x1000>; -- cgit v1.2.3 From 3aae9edd5a63e226baf3375bb8f7e8d05f5d9098 Mon Sep 17 00:00:00 2001 From: Rami Rosen Date: Fri, 19 Jun 2015 09:18:34 +0300 Subject: EDAC: Fix typos in Documentation/edac.txt Fix various typos in Documentation/edac.txt. Signed-off-by: Rami Rosen Link: http://lkml.kernel.org/r/1434694714-2924-1-git-send-email-ramirose@gmail.com Signed-off-by: Borislav Petkov --- Documentation/edac.txt | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) (limited to 'Documentation') diff --git a/Documentation/edac.txt b/Documentation/edac.txt index 73fff13e848f..4df786e73e87 100644 --- a/Documentation/edac.txt +++ b/Documentation/edac.txt @@ -25,7 +25,7 @@ first time, it was renamed to 'EDAC'. The bluesmoke project at sourceforge.net is now utilized as a 'staging area' for EDAC development, before it is sent upstream to kernel.org -At the bluesmoke/EDAC project site is a series of quilt patches against +At the bluesmoke/EDAC project site, there is a series of quilt patches against recent kernels, stored in a SVN repository. For easier downloading, there is also a tarball snapshot available. @@ -235,7 +235,7 @@ In 'mcX' directories are EDAC control and attribute files for this 'X' instance of the memory controllers. For a description of the sysfs API, please see: - Documentation/ABI/testing/sysfs/devices-edac + Documentation/ABI/testing/sysfs-devices-edac ============================================================================ @@ -276,7 +276,7 @@ Total memory managed by this csrow attribute file: 'size_mb' - This attribute file displays, in count of megabytes, of memory + This attribute file displays, in count of megabytes, the memory that this csrow contains. @@ -516,7 +516,7 @@ Panic on PCI PARITY Error: 'panic_on_pci_parity' - This control files enables or disables panicking when a parity + This control file enables or disables panicking when a parity error has been detected. @@ -617,7 +617,7 @@ The 'test_device_edac' device adds 4 attributes and 1 control: reset all the above counters. -Use of the 'test_device_edac' driver should any others to create their own +Use of the 'test_device_edac' driver should enable any others to create their own unique drivers for their hardware systems. The 'test_device_edac' sample driver is located at the @@ -633,7 +633,7 @@ of the driver. Due to the way Nehalem exports Memory Controller data, some adjustments were done at i7core_edac driver. This chapter will cover those differences -1) On Nehalem, there are one Memory Controller per Quick Patch Interconnect +1) On Nehalem, there is one Memory Controller per Quick Patch Interconnect (QPI). At the driver, the term "socket" means one QPI. This is associated with a physical CPU socket. @@ -642,7 +642,7 @@ were done at i7core_edac driver. This chapter will cover those differences Each channel can have up to 3 DIMMs. The minimum known unity is DIMMs. There are no information about csrows. - As EDAC API maps the minimum unity is csrows, the driver sequencially + As EDAC API maps the minimum unity is csrows, the driver sequentially maps channel/dimm into different csrows. For example, supposing the following layout: @@ -664,7 +664,7 @@ exports one Each QPI is exported as a different memory controller. -2) Nehalem MC has the hability to generate errors. The driver implements this +2) Nehalem MC has the ability to generate errors. The driver implements this functionality via some error injection nodes: For injecting a memory error, there are some sysfs nodes, under @@ -771,5 +771,5 @@ exports one The standard error counters are generated when an mcelog error is received by the driver. Since, with udimm, this is counted by software, it is - possible that some errors could be lost. With rdimm's, they displays the + possible that some errors could be lost. With rdimm's, they display the contents of the registers -- cgit v1.2.3 From 043b43180efee8dcc41dde5ca710827b26d17510 Mon Sep 17 00:00:00 2001 From: Borislav Petkov Date: Fri, 19 Jun 2015 11:47:17 +0200 Subject: EDAC: Update Documentation/edac.txt Do some initial cleanup, more probably will come. - Move credits section to the end - Update maintainers - Drop sourceforge reference - project is long upstream now - Reformat sections - Reformat paragraphs - Clarify text - Bring it up-to-date - Drop useless "future hardware scanning" section Signed-off-by: Borislav Petkov --- Documentation/edac.txt | 273 +++++++++++++++++++++++-------------------------- 1 file changed, 130 insertions(+), 143 deletions(-) (limited to 'Documentation') diff --git a/Documentation/edac.txt b/Documentation/edac.txt index 4df786e73e87..0cf27a3544a5 100644 --- a/Documentation/edac.txt +++ b/Documentation/edac.txt @@ -1,53 +1,34 @@ - - EDAC - Error Detection And Correction - -Written by Doug Thompson -7 Dec 2005 -17 Jul 2007 Updated - -(c) Mauro Carvalho Chehab -05 Aug 2009 Nehalem interface - -EDAC is maintained and written by: - - Doug Thompson, Dave Jiang, Dave Peterson et al, - original author: Thayne Harbaugh, - -Contact: - website: bluesmoke.sourceforge.net - mailing list: bluesmoke-devel@lists.sourceforge.net +===================================== "bluesmoke" was the name for this device driver when it was "out-of-tree" and maintained at sourceforge.net. When it was pushed into 2.6.16 for the first time, it was renamed to 'EDAC'. -The bluesmoke project at sourceforge.net is now utilized as a 'staging area' -for EDAC development, before it is sent upstream to kernel.org - -At the bluesmoke/EDAC project site, there is a series of quilt patches against -recent kernels, stored in a SVN repository. For easier downloading, there -is also a tarball snapshot available. +PURPOSE +------- -============================================================================ -EDAC PURPOSE - -The 'edac' kernel module goal is to detect and report errors that occur -within the computer system running under linux. +The 'edac' kernel module's goal is to detect and report hardware errors +that occur within the computer system running under linux. MEMORY +------ -In the initial release, memory Correctable Errors (CE) and Uncorrectable -Errors (UE) are the primary errors being harvested. These types of errors -are harvested by the 'edac_mc' class of device. +Memory Correctable Errors (CE) and Uncorrectable Errors (UE) are the +primary errors being harvested. These types of errors are harvested by +the 'edac_mc' device. Detecting CE events, then harvesting those events and reporting them, -CAN be a predictor of future UE events. With CE events, the system can -continue to operate, but with less safety. Preventive maintenance and -proactive part replacement of memory DIMMs exhibiting CEs can reduce -the likelihood of the dreaded UE events and system 'panics'. +*can* but must not necessarily be a predictor of future UE events. With +CE events only, the system can and will continue to operate as no data +has been damaged yet. + +However, preventive maintenance and proactive part replacement of memory +DIMMs exhibiting CEs can reduce the likelihood of the dreaded UE events +and system panics. -NON-MEMORY +OTHER HARDWARE ELEMENTS +----------------------- A new feature for EDAC, the edac_device class of device, was added in the 2.6.23 version of the kernel. @@ -56,70 +37,57 @@ This new device type allows for non-memory type of ECC hardware detectors to have their states harvested and presented to userspace via the sysfs interface. -Some architectures have ECC detectors for L1, L2 and L3 caches, along with DMA -engines, fabric switches, main data path switches, interconnections, -and various other hardware data paths. If the hardware reports it, then -a edac_device device probably can be constructed to harvest and present -that to userspace. +Some architectures have ECC detectors for L1, L2 and L3 caches, +along with DMA engines, fabric switches, main data path switches, +interconnections, and various other hardware data paths. If the hardware +reports it, then a edac_device device probably can be constructed to +harvest and present that to userspace. PCI BUS SCANNING +---------------- -In addition, PCI Bus Parity and SERR Errors are scanned for on PCI devices -in order to determine if errors are occurring on data transfers. +In addition, PCI devices are scanned for PCI Bus Parity and SERR Errors +in order to determine if errors are occurring during data transfers. The presence of PCI Parity errors must be examined with a grain of salt. -There are several add-in adapters that do NOT follow the PCI specification +There are several add-in adapters that do *not* follow the PCI specification with regards to Parity generation and reporting. The specification says the vendor should tie the parity status bits to 0 if they do not intend to generate parity. Some vendors do not do this, and thus the parity bit can "float" giving false positives. -In the kernel there is a PCI device attribute located in sysfs that is -checked by the EDAC PCI scanning code. If that attribute is set, -PCI parity/error scanning is skipped for that device. The attribute -is: +There is a PCI device attribute located in sysfs that is checked by +the EDAC PCI scanning code. If that attribute is set, PCI parity/error +scanning is skipped for that device. The attribute is: broken_parity_status -as is located in /sys/devices/pci/0000:XX:YY.Z directories for +and is located in /sys/devices/pci/0000:XX:YY.Z directories for PCI devices. -FUTURE HARDWARE SCANNING -EDAC will have future error detectors that will be integrated with -EDAC or added to it, in the following list: - - MCE Machine Check Exception - MCA Machine Check Architecture - NMI NMI notification of ECC errors - MSRs Machine Specific Register error cases - and other mechanisms. - -These errors are usually bus errors, ECC errors, thermal throttling -and the like. - - -============================================================================ -EDAC VERSIONING +VERSIONING +---------- EDAC is composed of a "core" module (edac_core.ko) and several Memory -Controller (MC) driver modules. On a given system, the CORE -is loaded and one MC driver will be loaded. Both the CORE and -the MC driver (or edac_device driver) have individual versions that reflect -current release level of their respective modules. +Controller (MC) driver modules. On a given system, the CORE is loaded +and one MC driver will be loaded. Both the CORE and the MC driver (or +edac_device driver) have individual versions that reflect current +release level of their respective modules. -Thus, to "report" on what version a system is running, one must report both -the CORE's and the MC driver's versions. +Thus, to "report" on what version a system is running, one must report +both the CORE's and the MC driver's versions. LOADING +------- -If 'edac' was statically linked with the kernel then no loading is -necessary. If 'edac' was built as modules then simply modprobe the -'edac' pieces that you need. You should be able to modprobe -hardware-specific modules and have the dependencies load the necessary core -modules. +If 'edac' was statically linked with the kernel then no loading +is necessary. If 'edac' was built as modules then simply modprobe +the 'edac' pieces that you need. You should be able to modprobe +hardware-specific modules and have the dependencies load the necessary +core modules. Example: @@ -129,35 +97,33 @@ loads both the amd76x_edac.ko memory controller module and the edac_mc.ko core module. -============================================================================ -EDAC sysfs INTERFACE - -EDAC presents a 'sysfs' interface for control, reporting and attribute -reporting purposes. +SYSFS INTERFACE +--------------- -EDAC lives in the /sys/devices/system/edac directory. +EDAC presents a 'sysfs' interface for control and reporting purposes. It +lives in the /sys/devices/system/edac directory. -Within this directory there currently reside 2 'edac' components: +Within this directory there currently reside 2 components: mc memory controller(s) system pci PCI control and status system -============================================================================ + Memory Controller (mc) Model +---------------------------- -First a background on the memory controller's model abstracted in EDAC. -Each 'mc' device controls a set of DIMM memory modules. These modules are -laid out in a Chip-Select Row (csrowX) and Channel table (chX). There can -be multiple csrows and multiple channels. +Each 'mc' device controls a set of DIMM memory modules. These modules +are laid out in a Chip-Select Row (csrowX) and Channel table (chX). +There can be multiple csrows and multiple channels. -Memory controllers allow for several csrows, with 8 csrows being a typical value. -Yet, the actual number of csrows depends on the electrical "loading" -of a given motherboard, memory controller and DIMM characteristics. +Memory controllers allow for several csrows, with 8 csrows being a +typical value. Yet, the actual number of csrows depends on the layout of +a given motherboard, memory controller and DIMM characteristics. -Dual channels allows for 128 bit data transfers to the CPU from memory. -Some newer chipsets allow for more than 2 channels, like Fully Buffered DIMMs -(FB-DIMMs). The following example will assume 2 channels: +Dual channels allows for 128 bit data transfers to/from the CPU from/to +memory. Some newer chipsets allow for more than 2 channels, like Fully +Buffered DIMMs (FB-DIMMs). The following example will assume 2 channels: Channel 0 Channel 1 @@ -179,12 +145,12 @@ for memory DIMMs: DIMM_A1 DIMM_B1 -Labels for these slots are usually silk screened on the motherboard. Slots -labeled 'A' are channel 0 in this example. Slots labeled 'B' -are channel 1. Notice that there are two csrows possible on a -physical DIMM. These csrows are allocated their csrow assignment -based on the slot into which the memory DIMM is placed. Thus, when 1 DIMM -is placed in each Channel, the csrows cross both DIMMs. +Labels for these slots are usually silk-screened on the motherboard. +Slots labeled 'A' are channel 0 in this example. Slots labeled 'B' are +channel 1. Notice that there are two csrows possible on a physical DIMM. +These csrows are allocated their csrow assignment based on the slot into +which the memory DIMM is placed. Thus, when 1 DIMM is placed in each +Channel, the csrows cross both DIMMs. Memory DIMMs come single or dual "ranked". A rank is a populated csrow. Thus, 2 single ranked DIMMs, placed in slots DIMM_A0 and DIMM_B0 above @@ -193,8 +159,8 @@ when 2 dual ranked DIMMs are similarly placed, then both csrow0 and csrow1 will be populated. The pattern repeats itself for csrow2 and csrow3. -The representation of the above is reflected in the directory tree -in EDAC's sysfs interface. Starting in directory +The representation of the above is reflected in the directory +tree in EDAC's sysfs interface. Starting in directory /sys/devices/system/edac/mc each memory controller will be represented by its own 'mcX' directory, where 'X' is the index of the MC. @@ -217,19 +183,19 @@ Under each 'mcX' directory each 'csrowX' is again represented by a |->csrow3 .... -Notice that there is no csrow1, which indicates that csrow0 is -composed of a single ranked DIMMs. This should also apply in both -Channels, in order to have dual-channel mode be operational. Since -both csrow2 and csrow3 are populated, this indicates a dual ranked -set of DIMMs for channels 0 and 1. +Notice that there is no csrow1, which indicates that csrow0 is composed +of a single ranked DIMMs. This should also apply in both Channels, in +order to have dual-channel mode be operational. Since both csrow2 and +csrow3 are populated, this indicates a dual ranked set of DIMMs for +channels 0 and 1. -Within each of the 'mcX' and 'csrowX' directories are several -EDAC control and attribute files. +Within each of the 'mcX' and 'csrowX' directories are several EDAC +control and attribute files. -============================================================================ -'mcX' DIRECTORIES +'mcX' directories +----------------- In 'mcX' directories are EDAC control and attribute files for this 'X' instance of the memory controllers. @@ -238,13 +204,14 @@ For a description of the sysfs API, please see: Documentation/ABI/testing/sysfs-devices-edac -============================================================================ -'csrowX' DIRECTORIES -When CONFIG_EDAC_LEGACY_SYSFS is enabled, the sysfs will contain the -csrowX directories. As this API doesn't work properly for Rambus, FB-DIMMs -and modern Intel Memory Controllers, this is being deprecated in favor -of dimmX directories. +'csrowX' directories +-------------------- + +When CONFIG_EDAC_LEGACY_SYSFS is enabled, sysfs will contain the csrowX +directories. As this API doesn't work properly for Rambus, FB-DIMMs and +modern Intel Memory Controllers, this is being deprecated in favor of +dimmX directories. In the 'csrowX' directories are EDAC control and attribute files for this 'X' instance of csrow: @@ -265,11 +232,11 @@ Total Correctable Errors count attribute file: 'ce_count' This attribute file displays the total count of correctable - errors that have occurred on this csrow. This - count is very important to examine. CEs provide early - indications that a DIMM is beginning to fail. This count - field should be monitored for non-zero values and report - such information to the system administrator. + errors that have occurred on this csrow. This count is very + important to examine. CEs provide early indications that a + DIMM is beginning to fail. This count field should be + monitored for non-zero values and report such information + to the system administrator. Total memory managed by this csrow attribute file: @@ -377,11 +344,13 @@ Channel 1 DIMM Label control file: motherboard specific and determination of this information must occur in userland at this time. -============================================================================ + + SYSTEM LOGGING +-------------- -If logging for UEs and CEs are enabled then system logs will have -error notices indicating errors that have been detected: +If logging for UEs and CEs is enabled, then system logs will contain +information indicating that errors have been detected: EDAC MC0: CE page 0x283, offset 0xce0, grain 8, syndrome 0x6ec3, row 0, channel 1 "DIMM_B1": amd76x_edac @@ -404,24 +373,23 @@ The structure of the message is: and then an optional, driver-specific message that may have additional information. -Both UEs and CEs with no info will lack all but memory controller, -error type, a notice of "no info" and then an optional, -driver-specific error message. +Both UEs and CEs with no info will lack all but memory controller, error +type, a notice of "no info" and then an optional, driver-specific error +message. -============================================================================ PCI Bus Parity Detection +------------------------ - -On Header Type 00 devices the primary status is looked at -for any parity error regardless of whether Parity is enabled on the -device. (The spec indicates parity is generated in some cases). -On Header Type 01 bridges, the secondary status register is also -looked at to see if parity occurred on the bus on the other side of -the bridge. +On Header Type 00 devices, the primary status is looked at for any +parity error regardless of whether parity is enabled on the device or +not. (The spec indicates parity is generated in some cases). On Header +Type 01 bridges, the secondary status register is also looked at to see +if parity occurred on the bus on the other side of the bridge. SYSFS CONFIGURATION +------------------- Under /sys/devices/system/edac/pci are control and attribute files as follows: @@ -450,8 +418,9 @@ Parity Count: have been detected. -============================================================================ + MODULE PARAMETERS +----------------- Panic on UE control file: @@ -530,10 +499,8 @@ Panic on PCI PARITY Error: -======================================================================= - - -EDAC_DEVICE type of device +EDAC device type +---------------- In the header file, edac_core.h, there is a series of edac_device structures and APIs for the EDAC_DEVICE. @@ -573,6 +540,7 @@ The test_device_edac device adds at least one of its own custom control: The symlink points to the 'struct dev' that is registered for this edac_device. INSTANCES +--------- One or more instance directories are present. For the 'test_device_edac' case: @@ -586,6 +554,7 @@ counter in deeper subdirectories. ue_count total of UE events of subdirectories BLOCKS +------ At the lowest directory level is the 'block' directory. There can be 0, 1 or more blocks specified in each instance. @@ -623,8 +592,9 @@ unique drivers for their hardware systems. The 'test_device_edac' sample driver is located at the bluesmoke.sourceforge.net project site for EDAC. -======================================================================= + NEHALEM USAGE OF EDAC APIs +-------------------------- This chapter documents some EXPERIMENTAL mappings for EDAC API to handle Nehalem EDAC driver. They will likely be changed on future versions @@ -773,3 +743,20 @@ exports one by the driver. Since, with udimm, this is counted by software, it is possible that some errors could be lost. With rdimm's, they display the contents of the registers + +CREDITS: +======== + +Written by Doug Thompson +7 Dec 2005 +17 Jul 2007 Updated + +(c) Mauro Carvalho Chehab +05 Aug 2009 Nehalem interface + +EDAC authors/maintainers: + + Doug Thompson, Dave Jiang, Dave Peterson et al, + Mauro Carvalho Chehab + Borislav Petkov + original author: Thayne Harbaugh -- cgit v1.2.3