summaryrefslogtreecommitdiffstats
path: root/drivers/edac/amd64_edac.c
AgeCommit message (Collapse)AuthorFilesLines
2022-04-05x86/amd_nb: Unexport amd_cache_northbridges()Muralidhara M K1-1/+1
amd_cache_northbridges() is exported by amd_nb.c and is called by amd64-agp.c and amd64_edac.c modules at module_init() time so that NB descriptors are properly cached before those drivers can use them. However, the init_amd_nbs() initcall already does call amd_cache_northbridges() unconditionally and thus makes sure the NB descriptors are enumerated. That initcall is a fs_initcall type which is on the 5th group (starting from 0) of initcalls that gets run in increasing numerical order by the init code. The module_init() call is turned into an __initcall() in the MODULE=n case and those are device-level initcalls, i.e., group 6. Therefore, the northbridges caching is already finished by the time module initialization starts and thus the correct initialization order is retained. Unexport amd_cache_northbridges(), update dependent modules to call amd_nb_num() instead. While at it, simplify the checks in amd_cache_northbridges(). [ bp: Heavily massage and *actually* explain why the change is ok. ] Signed-off-by: Muralidhara M K <muralimk@amd.com> Signed-off-by: Naveen Krishna Chatradhi <nchatrad@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20220324122729.221765-1-nchatrad@amd.com
2022-02-23EDAC/amd64: Add new register offset support and related changesYazen Ghannam1-16/+64
Introduce a "family flags" bitmask that can be used to indicate any special behavior needed on a per-family basis. Add a flag to indicate a system uses the new register offsets introduced with Family 19h Model 10h. Use this flag to account for register offset changes, a new bitfield indicating DDR5 use on a memory controller, and to set the proper number of chip select masks. Rework f17_addr_mask_to_cs_size() to properly handle the change in chip select masks. And update code comments to reflect the updated Chip Select, DIMM, and Mask relationships. [uninitialized variable warning] Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: William Roche <william.roche@oracle.com> Link: https://lore.kernel.org/r/20220202144307.2678405-3-yazen.ghannam@amd.com
2022-02-23EDAC/amd64: Set memory type per DIMMYazen Ghannam1-12/+31
Current AMD systems allow mixing of DIMM types within a system. However, DIMMs within a channel, i.e. managed by a single Unified Memory Controller (UMC), must be of the same type. Handle this possible configuration by checking and setting the memory type for each individual "UMC" structure. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: William Roche <william.roche@oracle.com> Link: https://lore.kernel.org/r/20220202144307.2678405-2-yazen.ghannam@amd.com
2022-01-10Merge tag 'edac_updates_for_v5.17_rc1' of ↵Linus Torvalds1-1/+35
git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras Pull EDAC updates from Borislav Petkov: - Add support for version 3 of the Synopsys DDR controller to synopsys_edac - Add support for DRR5 and new models 0x10-0x1f and 0x50-0x5f of AMD family 0x19 CPUs to amd64_edac - The usual set of fixes and cleanups * tag 'edac_updates_for_v5.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras: EDAC/amd64: Add support for family 19h, models 50h-5fh EDAC/sb_edac: Remove redundant initialization of variable rc RAS/CEC: Remove a repeated 'an' in a comment EDAC/amd64: Add support for AMD Family 19h Models 10h-1Fh and A0h-AFh EDAC: Add RDDR5 and LRDDR5 memory types EDAC/sifive: Fix non-kernel-doc comment dt-bindings: memory: Add entry for version 3.80a EDAC/synopsys: Enable the driver on Intel's N5X platform EDAC/synopsys: Add support for version 3 of the Synopsys EDAC DDR EDAC/synopsys: Use the quirk for version instead of ddr version
2021-12-24EDAC/amd64: Add support for family 19h, models 50h-5fhMarc Bevand1-0/+15
Add the new family 19h models 50h-5fh PCI IDs (device 18h functions 0 and 6) to support Ryzen 5000 APUs ("Cezanne"). Signed-off-by: Marc Bevand <m@zorinaq.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com> Link: https://lore.kernel.org/r/20211221233112.556927-1-m@zorinaq.com
2021-12-10EDAC/amd64: Add support for AMD Family 19h Models 10h-1Fh and A0h-AFhYazen Ghannam1-1/+20
Add a new family type for AMD Family 19h Models 10h to 1Fh. Use this new family type for Models A0h to AFh also. Increase the maximum number of controllers from 8 to 12. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211208174356.1997855-3-yazen.ghannam@amd.com
2021-11-15EDAC/amd64: Add context structYazen Ghannam1-42/+55
Define an address translation context struct. This will hold values that will be passed between multiple functions. Save return address, Node ID, and the Instance ID number to start. Currently, the UMC number is used as the Instance ID, but future DF versions may use another value. Also include a "tmp" field to use when reading registers. This is to avoid having to define a temporary variable in multiple functions. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20211028175728.121452-5-yazen.ghannam@amd.com
2021-11-15EDAC/amd64: Allow for DF Indirect Broadcast readsYazen Ghannam1-8/+21
The DF Indirect Access method allows for "Broadcast" accesses in which case no specific instance is targeted. Add support using a reserved instance ID of 0xFF to indicate a broadcast access. Set the FICAA register appropriately. Define helpers functions for instance and broadcast reads and use them where appropriate. Drop the "amd_" prefix since these functions are all static. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20211028175728.121452-4-yazen.ghannam@amd.com
2021-11-15x86/amd_nb, EDAC/amd64: Move DF Indirect Read to AMD64 EDACYazen Ghannam1-0/+50
df_indirect_read() is used only for address translation. Move it to EDAC along with the translation code. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20211028175728.121452-3-yazen.ghannam@amd.com
2021-11-15x86/MCE/AMD, EDAC/amd64: Move address translation to AMD64 EDACYazen Ghannam1-0/+199
The address translation code used for current AMD systems is non-architectural. So move it to EDAC. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20211028175728.121452-2-yazen.ghannam@amd.com
2021-10-07EDAC/amd64: Handle three rank interleaving modeYazen Ghannam1-1/+21
AMD Rome systems and later support interleaving between three identical ranks within a channel. Check for this mode by counting the number of enabled chip selects and comparing their masks. If there are exactly three enabled chip selects and their masks are identical, then three rank interleaving is enabled. The size of a rank is determined from its mask value. However, three rank interleaving doesn't follow the method of swapping an interleave bit with the most significant bit. Rather, the interleave bit is flipped and the most significant bit remains the same. There is only a single interleave bit in this case. Account for this when determining the chip select size by keeping the most significant bit at its original value and ignoring any zero bits. This will return a full bitmask in [MSB:1]. Fixes: e53a3b267fb0 ("EDAC/amd64: Find Chip Select memory size using Address Mask") Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20211005154419.2060504-1-yazen.ghannam@amd.com
2021-07-13EDAC/amd64: Use DEVICE_ATTR helper macrosDwaipayan Ray1-13/+8
Instead of "open coding" DEVICE_ATTR, use the corresponding helper macros DEVICE_ATTR_{RW,RO,WO} in amd64_edac.c Some function names needed to be changed to match the device conventions <foo>_show and <foo>_store, but the functionality itself is unchanged. The devices using EDAC_DCT_ATTR_SHOW() are left unchanged. Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Dwaipayan Ray <dwaipayanray1@gmail.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20210713065130.2151-1-dwaipayanray1@gmail.com
2021-05-10x86/msr: Rename MSR_K8_SYSCFG to MSR_AMD64_SYSCFGBrijesh Singh1-1/+1
The SYSCFG MSR continued being updated beyond the K8 family; drop the K8 name from it. Suggested-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Brijesh Singh <brijesh.singh@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Joerg Roedel <jroedel@suse.de> Link: https://lkml.kernel.org/r/20210427111636.1207-4-brijesh.singh@amd.com
2021-01-22EDAC/amd64: Issue probing messages only on properly detected hardwareBorislav Petkov1-7/+7
amd64_edac was converted to CPU family autoprobing (from PCI device IDs) to not have to add a new PCI device ID each time a new platform is shipped but to support the whole family out-of-the-box. However, this caused a lot of noise in dmesg even when the machine doesn't have ECC DIMMs or ECC has been disabled in the BIOS: EDAC MC: Ver: 3.0.0 EDAC amd64: F17h detected (node 0). EDAC amd64: Node 0: DRAM ECC disabled. EDAC amd64: F17h detected (node 1). EDAC amd64: Node 1: DRAM ECC disabled. EDAC amd64: F17h detected (node 2). EDAC amd64: Node 2: DRAM ECC disabled. EDAC amd64: F17h detected (node 3). EDAC amd64: Node 3: DRAM ECC disabled. EDAC amd64: F17h detected (node 4). EDAC amd64: Node 4: DRAM ECC disabled. EDAC amd64: F17h detected (node 5). EDAC amd64: Node 5: DRAM ECC disabled. EDAC amd64: F17h detected (node 6). EDAC amd64: Node 6: DRAM ECC disabled. EDAC amd64: F17h detected (node 7). EDAC amd64: Node 7: DRAM ECC disabled. or even $ grep EDAC dmesg.log | sed 's/\[.*\] //' | sort | uniq -c 128 EDAC amd64: F17h detected (node 0). 128 EDAC amd64: Node 0: DRAM ECC disabled. 1 EDAC MC: Ver: 3.0.0 on a big machine. Yap, that's once per CPU for 128 of them. So move the init messages after all probing has succeeded to avoid unnecessary spew in dmesg. Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210119164141.17417-1-bp@alien8.de
2020-12-28EDAC/amd64: Limit error injection functionality to supported hwBorislav Petkov1-3/+5
Families up to and including 0x16 allow access to the injection hardware. Starting with family 0x17, access to those registers is blocked by security policy. Limit that only on the families which support it. Suggested-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20201222180013.GD13463@zn.tnic
2020-12-28EDAC/amd64: Merge error injection sysfs facilitiesBorislav Petkov1-4/+231
Merge them into the main driver and put them inside an EDAC_DEBUG ifdeffery to simplify the driver and have all debugging/injection stuff behind a debug build-time switch. No functional changes. Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com> Link: https://lkml.kernel.org/r/20201215110517.5215-2-bp@alien8.de
2020-12-28EDAC/amd64: Merge sysfs debugging attributes setup codeBorislav Petkov1-6/+59
There's no need for them to be in a separate file so merge them into the main driver compilation unit like the other EDAC drivers do. Drop now-unneeded function export, make the function static and shorten static function names. No functional changes. Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com> Link: https://lkml.kernel.org/r/20201215110517.5215-1-bp@alien8.de
2020-12-28EDAC/amd64: Tone down messages about missing PCI IDsYazen Ghannam1-4/+4
Give these messages a debug severity as they are really only useful to the module developers. Also, drop the "(broken BIOS?)" phrase, since this can cause churn for BIOS folks. The PCI IDs needed by the module, at least on modern systems, are fixed in hardware. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20201215170131.8496-1-Yazen.Ghannam@amd.com
2020-12-28EDAC/amd64: Do not load on family 0x15, model 0x13Borislav Petkov1-3/+7
Those were only laptops and are very very unlikely to have ECC memory. Currently, when the driver attempts to load, it issues: EDAC amd64: Error: F1 not found: device 0x1601 (broken BIOS?) because the PCI device is the wrong one (it uses the F15h default one). So do not load the driver on them as that is pointless. Reported-by: Don Curtis <bugrprt21882@online.de> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Don Curtis <bugrprt21882@online.de> Link: http://bugzilla.opensuse.org/show_bug.cgi?id=1179763 Link: https://lkml.kernel.org/r/20201218160622.20146-1-bp@alien8.de
2020-12-14Merge tag 'x86_cpu_for_v5.11' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cpuid updates from Borislav Petkov: "Only AMD-specific changes this time: - Save the AMD physical die ID into cpuinfo_x86.cpu_die_id and convert all code to use it (Yazen Ghannam) - Remove a dead and unused TSEG region remapping workaround on AMD (Arvind Sankar)" * tag 'x86_cpu_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/cpu/amd: Remove dead code for TSEG region remapping x86/topology: Set cpu_die_id only if DIE_TYPE found EDAC/mce_amd: Use struct cpuinfo_x86.cpu_die_id for AMD NodeId x86/CPU/AMD: Remove amd_get_nb_id() x86/CPU/AMD: Save AMD NodeId as cpu_die_id
2020-11-27EDAC/amd64: Fix PCI component registrationBorislav Petkov1-12/+14
In order to setup its PCI component, the driver needs any node private instance in order to get a reference to the PCI device and hand that into edac_pci_create_generic_ctl(). For convenience, it uses the 0th memory controller descriptor under the assumption that if any, the 0th will be always present. However, this assumption goes wrong when the 0th node doesn't have memory and the driver doesn't initialize an instance for it: EDAC amd64: F17h detected (node 0). ... EDAC amd64: Node 0: No DIMMs detected. But looking up node instances is not really needed - all one needs is the pointer to the proper device which gets discovered during instance init. So stash that pointer into a variable and use it when setting up the EDAC PCI component. Clear that variable when the driver needs to unwind due to some instances failing init to avoid any registration imbalance. Cc: <stable@vger.kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20201122150815.13808-1-bp@alien8.de
2020-11-19x86/CPU/AMD: Remove amd_get_nb_id()Yazen Ghannam1-2/+2
The Last Level Cache ID is returned by amd_get_nb_id(). In practice, this value is the same as the AMD NodeId for callers of this function. The NodeId is saved in struct cpuinfo_x86.cpu_die_id. Replace calls to amd_get_nb_id() with the logical CPU's cpu_die_id and remove the function. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20201109210659.754018-3-Yazen.Ghannam@amd.com
2020-10-26EDAC/amd64: Remove unneeded breaksTom Rix1-8/+0
A break is not needed if it is preceded by a return. Signed-off-by: Tom Rix <trix@redhat.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Robert Richter <rric@kernel.org> Link: https://lkml.kernel.org/r/20201019193524.13391-1-trix@redhat.com
2020-10-12Merge tag 'edac_updates_for_v5.10' of ↵Linus Torvalds1-0/+6
git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras Pull EDAC updates from Borislav Petkov: - Add Amazon's Annapurna Labs memory controller EDAC driver (Talel Shenhar) - New AMD CPUs support (Yazen Ghannam) - The usual misc fixes and cleanups all over the subsystem * tag 'edac_updates_for_v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras: EDAC/amd64: Set proper family type for Family 19h Models 20h-2Fh EDAC/mc_sysfs: Add missing newlines when printing {max,dimm}_location EDAC/aspeed: Use module_platform_driver() to simplify EDAC, sb_edac: Simplify switch statement EDAC/ti: Fix handling of platform_get_irq() error EDAC/aspeed: Fix handling of platform_get_irq() error EDAC/i5100: Fix error handling order in i5100_init_one() EDAC/highbank: Handover Calxeda Highbank maintenance to Andre Przywara EDAC/socfpga: Transfer SoCFPGA EDAC maintainership EDAC/thunderx: Make symbol lmc_dfs_ents static EDAC/al-mc-edac: Add Amazon's Annapurna Labs Memory Controller driver dt-bindings: EDAC: Add Amazon's Annapurna Labs Memory Controller binding EDAC/mce_amd: Add new error descriptions for existing types EDAC: Replace HTTP links with HTTPS ones
2020-10-09EDAC/amd64: Set proper family type for Family 19h Models 20h-2FhYazen Ghannam1-0/+6
AMD Family 19h Models 20h-2Fh use the same PCI IDs as Family 17h Models 70h-7Fh. The same family ops and number of channels also apply. Use the Family17h Model 70h family_type and ops for Family 19h Models 20h-2Fh. Update the controller name to match the system. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20201009171803.3214354-1-Yazen.Ghannam@amd.com
2020-08-23treewide: Use fallthrough pseudo-keywordGustavo A. R. Silva1-1/+1
Replace the existing /* fall through */ comments and its variants with the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary fall-through markings when it is the case. [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-06-18EDAC/amd64: Read back the scrub rate PCI register on F15hBorislav Petkov1-0/+2
Commit: da92110dfdfa ("EDAC, amd64_edac: Extend scrub rate support to F15hM60h") added support for F15h, model 0x60 CPUs but in doing so, missed to read back SCRCTRL PCI config register on F15h CPUs which are *not* model 0x60. Add that read so that doing $ cat /sys/devices/system/edac/mc/mc0/sdram_scrub_rate can show the previously set DRAM scrub rate. Fixes: da92110dfdfa ("EDAC, amd64_edac: Extend scrub rate support to F15hM60h") Reported-by: Anders Andersson <pipatron@gmail.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: <stable@vger.kernel.org> #v4.4.. Link: https://lkml.kernel.org/r/CAKkunMbNWppx_i6xSdDHLseA2QQmGJqj_crY=NF-GZML5np4Vw@mail.gmail.com
2020-06-11Merge branch 'x86/entry' into ras/coreThomas Gleixner1-1/+1
to fixup conflicts in arch/x86/kernel/cpu/mce/core.c so MCE specific follow up patches can be applied without creating a horrible merge conflict afterwards.
2020-05-29EDAC/amd64: Remove redundant assignment to variable ret in hw_info_get()Colin Ian King1-1/+1
The variable ret is being assigned with a value that is never read and it is being updated later with a new value. The initialization is redundant so remove it. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200429154847.287001-1-colin.king@canonical.com
2020-05-22EDAC/amd64: Add AMD family 17h model 60h PCI IDsAlexander Monakov1-0/+14
Add support for AMD Renoir (4000-series Ryzen CPUs). Signed-off-by: Alexander Monakov <amonakov@ispras.ru> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Yazen Ghannam <yazen.ghannam@amd.com> Link: https://lkml.kernel.org/r/20200510204842.2603-4-amonakov@ispras.ru
2020-04-14x86/mce/amd, edac: Remove report_gart_errorsBorislav Petkov1-8/+0
... because no one should be interested in spurious MCEs anyway. Make the filtering unconditional and move it to amd_filter_mce(). Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20200407163414.18058-2-bp@alien8.de
2020-03-24EDAC: Convert to new X86 CPU match macrosThomas Gleixner1-7/+7
The new macro set has a consistent namespace and uses C99 initializers instead of the grufty C89 ones. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20200320131509.673579000@linutronix.de
2020-01-27Merge branch 'ras-core-for-linus' of ↵Linus Torvalds1-26/+36
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RAS updates from Borislav Petkov: - Misc fixes to the MCE code all over the place, by Jan H. Schönherr. - Initial support for AMD F19h and other cleanups to amd64_edac, by Yazen Ghannam. - Other small cleanups. * 'ras-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: EDAC/mce_amd: Make fam_ops static global EDAC/amd64: Drop some family checks for newer systems EDAC/amd64: Add family ops for Family 19h Models 00h-0Fh x86/amd_nb: Add Family 19h PCI IDs EDAC/mce_amd: Always load on SMCA systems x86/MCE/AMD, EDAC/mce_amd: Add new Load Store unit McaType x86/mce: Fix use of uninitialized MCE message string x86/mce: Fix mce=nobootlog x86/mce: Take action on UCNA/Deferred errors again x86/mce: Remove mce_inject_log() in favor of mce_log() x86/mce: Pass MCE message to mce_panic() on failed kernel recovery x86/mce/therm_throt: Mark throttle_active_work() as __maybe_unused
2020-01-17EDAC/amd64: Do not warn when removing instancesBorislav Petkov1-3/+0
On machines which do not populate all nodes with DIMMs, the driver doesn't initialize an instance there. However, the instance removal remove_one_instance() path will warn unconditionally, which is wrong. Remove the WARN_ON() even if the warning is innocent because it causes a splat in dmesg. Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200117115939.5524-1-bp@alien8.de
2020-01-16EDAC/amd64: Drop some family checks for newer systemsYazen Ghannam1-26/+19
In general, "pvt->umc != NULL" is used to check if the system is Family 17h+. However, there are a few places that are using direct family checks. Replace the remaining family checks with a check for "pvt->umc != NULL". Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200110015651.14887-6-Yazen.Ghannam@amd.com
2020-01-16EDAC/amd64: Add family ops for Family 19h Models 00h-0FhYazen Ghannam1-0/+17
Add family ops to support AMD Family 19h systems. Existing Family 17h functions can be used. Also, add Family 19h to the list of families to automatically load the module. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200110015651.14887-5-Yazen.Ghannam@amd.com
2019-11-09EDAC/amd64: Get rid of the ECC disabled long messageBorislav Petkov1-16/+3
This message keeps flooding dmesg on boxes where ECC is disabled or the DIMMs do not support ECC but the module gets auto-probed. What's even worse is that autoprobing happens on every CPU due to the CPU-family matching the driver does and uevent being generated for each CPU device. What is more, this message is becoming even more useless on newer systems where forcing ECC is not recommended and it should be done in the BIOS so the BIOS can do all the necessary work, i.e., just setting a bit in an MSR is not enough anymore. So get rid of it. Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Yazen Ghannam <yazen.ghannam@amd.com> Cc: linux-edac@vger.kernel.org Link: https://lkml.kernel.org/r/20191106160607.GC28380@zn.tnic
2019-11-06EDAC/amd64: Check for memory before fully initializing an instanceYazen Ghannam1-3/+22
Return early before checking for ECC if the node does not have any populated memory. Free any cached hardware data before returning. Also, return 0 in this case since this is not a failure. Other nodes may have memory and the module should attempt to load an instance for them. Move printing of hardware information to after the instance is initialized, so that the information is only printed for nodes with memory. Return an error code when ECC is disabled. This check happens after checking for memory. The module should explicitly fail to load if memory is populated on a node and ECC is disabled. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Robert Richter <rrichter@marvell.com> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20191106012448.243970-6-Yazen.Ghannam@amd.com
2019-11-06EDAC/amd64: Use cached data when checking for ECCYazen Ghannam1-12/+8
...now that the data is available earlier. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Robert Richter <rrichter@marvell.com> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20191106012448.243970-5-Yazen.Ghannam@amd.com
2019-11-06EDAC/amd64: Save max number of controllers to family typeYazen Ghannam1-30/+14
The maximum number of memory controllers is fixed within a family/model group. In most cases, this has been fixed at 2, but some systems may have up to 8. The struct amd64_family_type already contains family/model-specific information, and this can be used rather than adding model checks to various functions. Create a new field in struct amd64_family_type for max_mcs. Set this when setting other family type information, and use this when needing the maximum number of memory controllers possible for a system. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Robert Richter <rrichter@marvell.com> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20191106012448.243970-4-Yazen.Ghannam@amd.com
2019-11-06EDAC/amd64: Gather hardware information earlyYazen Ghannam1-50/+51
Split out gathering hardware information from init_one_instance() into a separate function hw_info_get(). This is necessary so that the information can be cached earlier and used to check if memory is populated and if ECC is enabled on a node. Also, define a function hw_info_put() to back out changes made in hw_info_get(). Check for an allocated PCI device (Function 0 for Family 17h or Function 1 for pre-Family 17h) before freeing, since hw_info_put() may be called before PCI siblings are reserved. Drop the family check when freeing pvt->umc. This will be NULL on pre-Family 17h systems. However, kfree() is safe and will check for a NULL pointer before freeing. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Robert Richter <rrichter@marvell.com> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20191106012448.243970-3-Yazen.Ghannam@amd.com
2019-11-06EDAC/amd64: Make struct amd64_family_type globalYazen Ghannam1-7/+5
The struct amd64_family_type doesn't change between multiple nodes and instances of the module, so make it global. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Robert Richter <rrichter@marvell.com> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20191106012448.243970-2-Yazen.Ghannam@amd.com
2019-10-25EDAC/amd64: Set grain per DIMMYazen Ghannam1-0/+2
The following commit introduced a warning on error reports without a non-zero grain value. 3724ace582d9 ("EDAC/mc: Fix grain_bits calculation") The amd64_edac_mod module does not provide a value, so the warning will be given on the first reported memory error. Set the grain per DIMM to cacheline size (64 bytes). This is the current recommendation. Fixes: 3724ace582d9 ("EDAC/mc: Fix grain_bits calculation") Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Robert Richter <rrichter@marvell.com> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20191022203448.13962-7-Yazen.Ghannam@amd.com
2019-09-07EDAC/amd64: Add PCI device IDs for family 17h, model 70hIsaac Vaughn1-0/+13
Add the new Family 17h Model 70h PCI IDs (device 18h functions 0 and 6) to the AMD64 EDAC module. [ bp: s/f17_base_addr_to_cs_size/f17_addr_mask_to_cs_size/g ] Signed-off-by: Isaac Vaughn <isaac.vaughn@knights.ucf.edu> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: James Morse <james.morse@arm.com> Cc: linux-edac@vger.kernel.org Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Robert Richter <rrichter@marvell.com> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20190906192131.8ced0ca112146f32d82b6cae@knights.ucf.edu
2019-08-23EDAC/amd64: Support asymmetric dual-rank DIMMsYazen Ghannam1-3/+13
Future AMD systems will support asymmetric dual-rank DIMMs. These are DIMMs where the ranks are of different sizes. The even rank will use the Primary Even Chip Select registers and the odd rank will use the Secondary Odd Chip Select registers. Recognize if a Secondary Odd Chip Select is being used. Use the Secondary Odd Address Mask when calculating the chip select size. [ bp: move csrow_sec_enabled() to the header, fix CS_ODD define and tone-down the capitalized words spelling. ] Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20190821235938.118710-8-Yazen.Ghannam@amd.com
2019-08-23EDAC/amd64: Cache secondary Chip Select registersYazen Ghannam1-3/+20
AMD Family 17h systems have a set of secondary Chip Select Base Addresses and Address Masks. These do not represent unique Chip Selects, rather they are used in conjunction with the primary Chip Select registers in certain cases. Cache these secondary Chip Select registers for future use. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20190821235938.118710-7-Yazen.Ghannam@amd.com
2019-08-23EDAC/amd64: Decode syndrome before translating addressYazen Ghannam1-7/+7
AMD Family 17h systems currently require address translation in order to report the system address of a DRAM ECC error. This is currently done before decoding the syndrome information. The syndrome information does not depend on the address translation, so the proper EDAC csrow/channel reporting can function without the address. However, the syndrome information will not be decoded if the address translation fails. Decode the syndrome information before doing the address translation. The syndrome information is architecturally defined in MCA_SYND and can be considered robust. The address translation is system-specific and may fail on newer systems without proper updates to the translation algorithm. Fixes: 713ad54675fd ("EDAC, amd64: Define and register UMC error decode function") Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20190821235938.118710-6-Yazen.Ghannam@amd.com
2019-08-23EDAC/amd64: Find Chip Select memory size using Address MaskYazen Ghannam1-44/+70
Chip Select memory size reporting on AMD Family 17h was recently fixed in order to account for interleaving. However, the current method is not robust. The Chip Select Address Mask can be used to find the memory size. There are a couple of cases. 1) For single-rank and dual-rank non-interleaved, use the address mask plus 1 as the size. 2) For dual-rank interleaved, do #1 but "de-interleave" the address mask first. Always "de-interleave" the address mask in order to simplify the code flow. Bit mask manipulation is necessary to check for interleaving, so just go ahead and do the de-interleaving. In the non-interleaved case, the original and de-interleaved address masks will be the same. To de-interleave the mask, count the number of zero bits in the middle of the mask and swap them with the most significant bits. For example, Original=0xFFFF9FE, De-interleaved=0x3FFFFFE Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20190821235938.118710-5-Yazen.Ghannam@amd.com
2019-08-23EDAC/amd64: Initialize DIMM info for systems with more than two channelsYazen Ghannam1-14/+52
Currently, the DIMM info for AMD Family 17h systems is initialized in init_csrows(). This function is shared with legacy systems, and it has a limit of two channel support. This prevents initialization of the DIMM info for a number of ranks, so there will be missing ranks in the EDAC sysfs. Create a new init_csrows_df() for Family17h+ and revert init_csrows() back to pre-Family17h support. Loop over all channels in the new function in order to support systems with more than two channels. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20190821235938.118710-4-Yazen.Ghannam@amd.com
2019-08-23EDAC/amd64: Recognize DRAM device type ECC capabilityYazen Ghannam1-2/+12
AMD Family 17h systems support x4 and x16 DRAM devices. However, the device type is not checked when setting mci.edac_ctl_cap. Set the appropriate capability flag based on the device type. Default to x8 DRAM device when neither the x4 or x16 bits are set. [ bp: reverse cpk_en check to save an indentation level. ] Fixes: 2d09d8f301f5 ("EDAC, amd64: Determine EDAC MC capabilities on Fam17h") Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20190821235938.118710-3-Yazen.Ghannam@amd.com