summaryrefslogtreecommitdiffstats
path: root/arch/x86/kernel/cpu
AgeCommit message (Collapse)AuthorFilesLines
2020-12-20Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds3-6/+8
Pull KVM updates from Paolo Bonzini: "Much x86 work was pushed out to 5.12, but ARM more than made up for it. ARM: - PSCI relay at EL2 when "protected KVM" is enabled - New exception injection code - Simplification of AArch32 system register handling - Fix PMU accesses when no PMU is enabled - Expose CSV3 on non-Meltdown hosts - Cache hierarchy discovery fixes - PV steal-time cleanups - Allow function pointers at EL2 - Various host EL2 entry cleanups - Simplification of the EL2 vector allocation s390: - memcg accouting for s390 specific parts of kvm and gmap - selftest for diag318 - new kvm_stat for when async_pf falls back to sync x86: - Tracepoints for the new pagetable code from 5.10 - Catch VFIO and KVM irqfd events before userspace - Reporting dirty pages to userspace with a ring buffer - SEV-ES host support - Nested VMX support for wait-for-SIPI activity state - New feature flag (AVX512 FP16) - New system ioctl to report Hyper-V-compatible paravirtualization features Generic: - Selftest improvements" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (171 commits) KVM: SVM: fix 32-bit compilation KVM: SVM: Add AP_JUMP_TABLE support in prep for AP booting KVM: SVM: Provide support to launch and run an SEV-ES guest KVM: SVM: Provide an updated VMRUN invocation for SEV-ES guests KVM: SVM: Provide support for SEV-ES vCPU loading KVM: SVM: Provide support for SEV-ES vCPU creation/loading KVM: SVM: Update ASID allocation to support SEV-ES guests KVM: SVM: Set the encryption mask for the SVM host save area KVM: SVM: Add NMI support for an SEV-ES guest KVM: SVM: Guest FPU state save/restore not needed for SEV-ES guest KVM: SVM: Do not report support for SMM for an SEV-ES guest KVM: x86: Update __get_sregs() / __set_sregs() to support SEV-ES KVM: SVM: Add support for CR8 write traps for an SEV-ES guest KVM: SVM: Add support for CR4 write traps for an SEV-ES guest KVM: SVM: Add support for CR0 write traps for an SEV-ES guest KVM: SVM: Add support for EFER write traps for an SEV-ES guest KVM: SVM: Support string IO operations for an SEV-ES guest KVM: SVM: Support MMIO for an SEV-ES guest KVM: SVM: Create trace events for VMGEXIT MSR protocol processing KVM: SVM: Create trace events for VMGEXIT processing ...
2020-12-15Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-1/+1
Merge misc updates from Andrew Morton: - a few random little subsystems - almost all of the MM patches which are staged ahead of linux-next material. I'll trickle to post-linux-next work in as the dependents get merged up. Subsystems affected by this patch series: kthread, kbuild, ide, ntfs, ocfs2, arch, and mm (slab-generic, slab, slub, dax, debug, pagecache, gup, swap, shmem, memcg, pagemap, mremap, hmm, vmalloc, documentation, kasan, pagealloc, memory-failure, hugetlb, vmscan, z3fold, compaction, oom-kill, migration, cma, page-poison, userfaultfd, zswap, zsmalloc, uaccess, zram, and cleanups). * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (200 commits) mm: cleanup kstrto*() usage mm: fix fall-through warnings for Clang mm: slub: convert sysfs sprintf family to sysfs_emit/sysfs_emit_at mm: shmem: convert shmem_enabled_show to use sysfs_emit_at mm:backing-dev: use sysfs_emit in macro defining functions mm: huge_memory: convert remaining use of sprintf to sysfs_emit and neatening mm: use sysfs_emit for struct kobject * uses mm: fix kernel-doc markups zram: break the strict dependency from lzo zram: add stat to gather incompressible pages since zram set up zram: support page writeback mm/process_vm_access: remove redundant initialization of iov_r mm/zsmalloc.c: rework the list_add code in insert_zspage() mm/zswap: move to use crypto_acomp API for hardware acceleration mm/zswap: fix passing zero to 'PTR_ERR' warning mm/zswap: make struct kernel_param_ops definitions const userfaultfd/selftests: hint the test runner on required privilege userfaultfd/selftests: fix retval check for userfaultfd_open() userfaultfd/selftests: always dump something in modes userfaultfd: selftests: make __{s,u}64 format specifiers portable ...
2020-12-15mremap: don't allow MREMAP_DONTUNMAP on special_mappings and aioDmitry Safonov1-1/+1
As kernel expect to see only one of such mappings, any further operations on the VMA-copy may be unexpected by the kernel. Maybe it's being on the safe side, but there doesn't seem to be any expected use-case for this, so restrict it now. Link: https://lkml.kernel.org/r/20201013013416.390574-4-dima@arista.com Fixes: commit e346b3813067 ("mm/mremap: add MREMAP_DONTUNMAP to mremap()") Signed-off-by: Dmitry Safonov <dima@arista.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Geffon <bgeffon@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-14Merge tag 'x86-apic-2020-12-14' of ↵Linus Torvalds1-0/+29
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 apic updates from Thomas Gleixner: "Yet another large set of x86 interrupt management updates: - Simplification and distangling of the MSI related functionality - Let IO/APIC construct the RTE entries from an MSI message instead of having IO/APIC specific code in the interrupt remapping drivers - Make the retrieval of the parent interrupt domain (vector or remap unit) less hardcoded and use the relevant irqdomain callbacks for selection. - Allow the handling of more than 255 CPUs without a virtualized IOMMU when the hypervisor supports it. This has made been possible by the above modifications and also simplifies the existing workaround in the HyperV specific virtual IOMMU. - Cleanup of the historical timer_works() irq flags related inconsistencies" * tag 'x86-apic-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (42 commits) x86/ioapic: Cleanup the timer_works() irqflags mess iommu/hyper-v: Remove I/O-APIC ID check from hyperv_irq_remapping_select() iommu/amd: Fix IOMMU interrupt generation in X2APIC mode iommu/amd: Don't register interrupt remapping irqdomain when IR is disabled iommu/amd: Fix union of bitfields in intcapxt support x86/ioapic: Correct the PCI/ISA trigger type selection x86/ioapic: Use I/O-APIC ID for finding irqdomain, not index x86/hyperv: Enable 15-bit APIC ID if the hypervisor supports it x86/kvm: Enable 15-bit extension when KVM_FEATURE_MSI_EXT_DEST_ID detected iommu/hyper-v: Disable IRQ pseudo-remapping if 15 bit APIC IDs are available x86/apic: Support 15 bits of APIC ID in MSI where available x86/ioapic: Handle Extended Destination ID field in RTE iommu/vt-d: Simplify intel_irq_remapping_select() x86: Kill all traces of irq_remapping_get_irq_domain() x86/ioapic: Use irq_find_matching_fwspec() to find remapping irqdomain x86/hpet: Use irq_find_matching_fwspec() to find remapping irqdomain iommu/hyper-v: Implement select() method on remapping irqdomain iommu/vt-d: Implement select() method on remapping irqdomain iommu/amd: Implement select() method on remapping irqdomain x86/apic: Add select() method on vector irqdomain ...
2020-12-14Merge tag 'core-rcu-2020-12-14' of ↵Linus Torvalds2-3/+15
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RCU updates from Thomas Gleixner: "RCU, LKMM and KCSAN updates collected by Paul McKenney. RCU: - Avoid cpuinfo-induced IPI pileups and idle-CPU IPIs - Lockdep-RCU updates reducing the need for __maybe_unused - Tasks-RCU updates - Miscellaneous fixes - Documentation updates - Torture-test updates KCSAN: - updates for selftests, avoiding setting watchpoints on NULL pointers - fix to watchpoint encoding LKMM: - updates for documentation along with some updates to example-code litmus tests" * tag 'core-rcu-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits) srcu: Take early exit on memory-allocation failure rcu/tree: Defer kvfree_rcu() allocation to a clean context rcu: Do not report strict GPs for outgoing CPUs rcu: Fix a typo in rcu_blocking_is_gp() header comment rcu: Prevent lockdep-RCU splats on lock acquisition/release rcu/tree: nocb: Avoid raising softirq for offloaded ready-to-execute CBs rcu,ftrace: Fix ftrace recursion rcu/tree: Make struct kernel_param_ops definitions const rcu/tree: Add a warning if CPU being onlined did not report QS already rcu: Clarify nocb kthreads naming in RCU_NOCB_CPU config rcu: Fix single-CPU check in rcu_blocking_is_gp() rcu: Implement rcu_segcblist_is_offloaded() config dependent list.h: Update comment to explicitly note circular lists rcu: Panic after fixed number of stalls x86/smpboot: Move rcu_cpu_starting() earlier rcu: Allow rcu_irq_enter_check_tick() from NMI tools/memory-model: Label MP tests' producers and consumers tools/memory-model: Use "buf" and "flag" for message-passing tests tools/memory-model: Add types to litmus tests tools/memory-model: Add a glossary of LKMM terms ...
2020-12-14Merge tag 'core-entry-2020-12-14' of ↵Linus Torvalds1-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core entry/exit updates from Thomas Gleixner: "A set of updates for entry/exit handling: - More generalization of entry/exit functionality - The consolidation work to reclaim TIF flags on x86 and also for non-x86 specific TIF flags which are solely relevant for syscall related work and have been moved into their own storage space. The x86 specific part had to be merged in to avoid a major conflict. - The TIF_NOTIFY_SIGNAL work which replaces the inefficient signal delivery mode of task work and results in an impressive performance improvement for io_uring. The non-x86 consolidation of this is going to come seperate via Jens. - The selective syscall redirection facility which provides a clean and efficient way to support the non-Linux syscalls of WINE by catching them at syscall entry and redirecting them to the user space emulation. This can be utilized for other purposes as well and has been designed carefully to avoid overhead for the regular fastpath. This includes the core changes and the x86 support code. - Simplification of the context tracking entry/exit handling for the users of the generic entry code which guarantee the proper ordering and protection. - Preparatory changes to make the generic entry code accomodate S390 specific requirements which are mostly related to their syscall restart mechanism" * tag 'core-entry-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits) entry: Add syscall_exit_to_user_mode_work() entry: Add exit_to_user_mode() wrapper entry_Add_enter_from_user_mode_wrapper entry: Rename exit_to_user_mode() entry: Rename enter_from_user_mode() docs: Document Syscall User Dispatch selftests: Add benchmark for syscall user dispatch selftests: Add kselftest for syscall user dispatch entry: Support Syscall User Dispatch on common syscall entry kernel: Implement selective syscall userspace redirection signal: Expose SYS_USER_DISPATCH si_code type x86: vdso: Expose sigreturn address on vdso to the kernel MAINTAINERS: Add entry for common entry code entry: Fix boot for !CONFIG_GENERIC_ENTRY x86: Support HAVE_CONTEXT_TRACKING_OFFSTACK context_tracking: Only define schedule_user() on !HAVE_CONTEXT_TRACKING_OFFSTACK archs sched: Detect call to schedule from critical entry code context_tracking: Don't implement exception_enter/exit() on CONFIG_HAVE_CONTEXT_TRACKING_OFFSTACK context_tracking: Introduce HAVE_CONTEXT_TRACKING_OFFSTACK x86: Reclaim unused x86 TI flags ...
2020-12-14Merge tag 'x86_cache_for_v5.11' of ↵Linus Torvalds4-15/+95
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cache resource control updates from Borislav Petkov: - add logic to correct MBM total and local values fixing errata SKX99 and BDF102 (Fenghua Yu) - cleanups * tag 'x86_cache_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/resctrl: Clean up unused function parameter in rmdir path x86/resctrl: Constify kernfs_ops x86/resctrl: Correct MBM total and local values Documentation/x86: Rename resctrl_ui.rst and add two errata to the file
2020-12-14Merge tag 'x86_cleanups_for_v5.11' of ↵Linus Torvalds1-1/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cleanups from Borislav Petkov: "Another branch with a nicely negative diffstat, just the way I like 'em: - Remove all uses of TIF_IA32 and TIF_X32 and reclaim the two bits in the end (Gabriel Krisman Bertazi) - All kinds of minor cleanups all over the tree" * tag 'x86_cleanups_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits) x86/ia32_signal: Propagate __user annotation properly x86/alternative: Update text_poke_bp() kernel-doc comment x86/PCI: Make a kernel-doc comment a normal one x86/asm: Drop unused RDPID macro x86/boot/compressed/64: Use TEST %reg,%reg instead of CMP $0,%reg x86/head64: Remove duplicate include x86/mm: Declare 'start' variable where it is used x86/head/64: Remove unused GET_CR2_INTO() macro x86/boot: Remove unused finalize_identity_maps() x86/uaccess: Document copy_from_user_nmi() x86/dumpstack: Make show_trace_log_lvl() static x86/mtrr: Fix a kernel-doc markup x86/setup: Remove unused MCA variables x86, libnvdimm/test: Remove COPY_MC_TEST x86: Reclaim TIF_IA32 and TIF_X32 x86/mm: Convert mmu context ia32_compat into a proper flags field x86/elf: Use e_machine to check for x32/ia32 in setup_additional_pages() elf: Expose ELF header on arch_setup_additional_pages() x86/elf: Use e_machine to select start_thread for x32 elf: Expose ELF header in compat_start_thread() ...
2020-12-14Merge tag 'x86_cpu_for_v5.11' of ↵Linus Torvalds6-69/+26
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cpuid updates from Borislav Petkov: "Only AMD-specific changes this time: - Save the AMD physical die ID into cpuinfo_x86.cpu_die_id and convert all code to use it (Yazen Ghannam) - Remove a dead and unused TSEG region remapping workaround on AMD (Arvind Sankar)" * tag 'x86_cpu_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/cpu/amd: Remove dead code for TSEG region remapping x86/topology: Set cpu_die_id only if DIE_TYPE found EDAC/mce_amd: Use struct cpuinfo_x86.cpu_die_id for AMD NodeId x86/CPU/AMD: Remove amd_get_nb_id() x86/CPU/AMD: Save AMD NodeId as cpu_die_id
2020-12-14Merge tag 'x86_sgx_for_v5.11' of ↵Linus Torvalds12-1/+3229
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 SGC support from Borislav Petkov: "Intel Software Guard eXtensions enablement. This has been long in the making, we were one revision number short of 42. :) Intel SGX is new hardware functionality that can be used by applications to populate protected regions of user code and data called enclaves. Once activated, the new hardware protects enclave code and data from outside access and modification. Enclaves provide a place to store secrets and process data with those secrets. SGX has been used, for example, to decrypt video without exposing the decryption keys to nosy debuggers that might be used to subvert DRM. Software has generally been rewritten specifically to run in enclaves, but there are also projects that try to run limited unmodified software in enclaves. Most of the functionality is concentrated into arch/x86/kernel/cpu/sgx/ except the addition of a new mprotect() hook to control enclave page permissions and support for vDSO exceptions fixup which will is used by SGX enclaves. All this work by Sean Christopherson, Jarkko Sakkinen and many others" * tag 'x86_sgx_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits) x86/sgx: Return -EINVAL on a zero length buffer in sgx_ioc_enclave_add_pages() x86/sgx: Fix a typo in kernel-doc markup x86/sgx: Fix sgx_ioc_enclave_provision() kernel-doc comment x86/sgx: Return -ERESTARTSYS in sgx_ioc_enclave_add_pages() selftests/sgx: Use a statically generated 3072-bit RSA key x86/sgx: Clarify 'laundry_list' locking x86/sgx: Update MAINTAINERS Documentation/x86: Document SGX kernel architecture x86/sgx: Add ptrace() support for the SGX driver x86/sgx: Add a page reclaimer selftests/x86: Add a selftest for SGX x86/vdso: Implement a vDSO for Intel SGX enclave call x86/traps: Attempt to fixup exceptions in vDSO before signaling x86/fault: Add a helper function to sanitize error code x86/vdso: Add support for exception fixup in vDSO functions x86/sgx: Add SGX_IOC_ENCLAVE_PROVISION x86/sgx: Add SGX_IOC_ENCLAVE_INIT x86/sgx: Add SGX_IOC_ENCLAVE_ADD_PAGES x86/sgx: Add SGX_IOC_ENCLAVE_CREATE x86/sgx: Add an SGX misc driver interface ...
2020-12-14Merge tag 'x86_microcode_update_for_v5.11' of ↵Linus Torvalds1-1/+0
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 microcode loader update from Borislav Petkov: "This one wins the award for most boring pull request ever. But that's a good thing - this is how I like 'em and the microcode loader *should* be boring. :-) A single cleanup removing "break" after a return statement (Tom Rix)" * tag 'x86_microcode_update_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/microcode/amd: Remove unneeded break
2020-12-14Merge tag 'ras_updates_for_v5.11' of ↵Linus Torvalds3-27/+98
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 RAS updates from Borislav Petkov: - Enable additional logging mode on older Xeons (Tony Luck) - Pass error records logged by firmware through the MCE decoding chain to provide human-readable error descriptions instead of raw values (Smita Koralahalli) - Some #MC handler fixes (Gabriele Paoloni) - The usual small fixes and cleanups all over. * tag 'ras_updates_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mce: Rename kill_it to kill_current_task x86/mce: Remove redundant call to irq_work_queue() x86/mce: Panic for LMCE only if mca_cfg.tolerant < 3 x86/mce: Move the mce_panic() call and 'kill_it' assignments to the right places x86/mce, cper: Pass x86 CPER through the MCA handling chain x86/mce: Use "safe" MSR functions when enabling additional error logging x86/mce: Correct the detection of invalid notifier priorities x86/mce: Assign boolean values to a bool variable x86/mce: Enable additional error logging on certain Intel CPUs x86/mce: Remove unneeded break
2020-12-14KVM: SVM: Add GHCB accessor functions for retrieving fieldsTom Lendacky1-6/+6
Update the GHCB accessor functions to add functions for retrieve GHCB fields by name. Update existing code to use the new accessor functions. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Message-Id: <664172c53a5fb4959914e1a45d88e805649af0ad.1607620209.git.thomas.lendacky@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-12-14x86/cpu: Add VM page flush MSR availablility as a CPUID featureTom Lendacky1-0/+1
On systems that do not have hardware enforced cache coherency between encrypted and unencrypted mappings of the same physical page, the hypervisor can use the VM page flush MSR (0xc001011e) to flush the cache contents of an SEV guest page. When a small number of pages are being flushed, this can be used in place of issuing a WBINVD across all CPUs. CPUID 0x8000001f_eax[2] is used to determine if the VM page flush MSR is available. Add a CPUID feature to indicate it is supported and define the MSR. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Message-Id: <f1966379e31f9b208db5257509c4a089a87d33d0.1607620209.git.thomas.lendacky@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-12-11x86: Enumerate AVX512 FP16 CPUID feature flagKyung Min Park1-0/+1
Enumerate AVX512 Half-precision floating point (FP16) CPUID feature flag. Compared with using FP32, using FP16 cut the number of bits required for storage in half, reducing the exponent from 8 bits to 5, and the mantissa from 23 bits to 10. Using FP16 also enables developers to train and run inference on deep learning models fast when all precision or magnitude (FP32) is not needed. A processor supports AVX512 FP16 if CPUID.(EAX=7,ECX=0):EDX[bit 23] is present. The AVX512 FP16 requires AVX512BW feature be implemented since the instructions for manipulating 32bit masks are associated with AVX512BW. The only in-kernel usage of this is kvm passthrough. The CPU feature flag is shown as "avx512_fp16" in /proc/cpuinfo. Signed-off-by: Kyung Min Park <kyung.min.park@intel.com> Acked-by: Dave Hansen <dave.hansen@intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Message-Id: <20201208033441.28207-2-kyung.min.park@intel.com> Acked-by: Borislav Petkov <bp@suse.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-12-10x86/resctrl: Fix incorrect local bandwidth when mba_sc is enabledXiaochen Shen1-4/+2
The MBA software controller (mba_sc) is a feedback loop which periodically reads MBM counters and tries to restrict the bandwidth below a user-specified value. It tags along the MBM counter overflow handler to do the updates with 1s interval in mbm_update() and update_mba_bw(). The purpose of mbm_update() is to periodically read the MBM counters to make sure that the hardware counter doesn't wrap around more than once between user samplings. mbm_update() calls __mon_event_count() for local bandwidth updating when mba_sc is not enabled, but calls mbm_bw_count() instead when mba_sc is enabled. __mon_event_count() will not be called for local bandwidth updating in MBM counter overflow handler, but it is still called when reading MBM local bandwidth counter file 'mbm_local_bytes', the call path is as below: rdtgroup_mondata_show() mon_event_read() mon_event_count() __mon_event_count() In __mon_event_count(), m->chunks is updated by delta chunks which is calculated from previous MSR value (m->prev_msr) and current MSR value. When mba_sc is enabled, m->chunks is also updated in mbm_update() by mistake by the delta chunks which is calculated from m->prev_bw_msr instead of m->prev_msr. But m->chunks is not used in update_mba_bw() in the mba_sc feedback loop. When reading MBM local bandwidth counter file, m->chunks was changed unexpectedly by mbm_bw_count(). As a result, the incorrect local bandwidth counter which calculated from incorrect m->chunks is shown to the user. Fix this by removing incorrect m->chunks updating in mbm_bw_count() in MBM counter overflow handler, and always calling __mon_event_count() in mbm_update() to make sure that the hardware local bandwidth counter doesn't wrap around. Test steps: # Run workload with aggressive memory bandwidth (e.g., 10 GB/s) git clone https://github.com/intel/intel-cmt-cat && cd intel-cmt-cat && make ./tools/membw/membw -c 0 -b 10000 --read # Enable MBA software controller mount -t resctrl resctrl -o mba_MBps /sys/fs/resctrl # Create control group c1 mkdir /sys/fs/resctrl/c1 # Set MB throttle to 6 GB/s echo "MB:0=6000;1=6000" > /sys/fs/resctrl/c1/schemata # Write PID of the workload to tasks file echo `pidof membw` > /sys/fs/resctrl/c1/tasks # Read local bytes counters twice with 1s interval, the calculated # local bandwidth is not as expected (approaching to 6 GB/s): local_1=`cat /sys/fs/resctrl/c1/mon_data/mon_L3_00/mbm_local_bytes` sleep 1 local_2=`cat /sys/fs/resctrl/c1/mon_data/mon_L3_00/mbm_local_bytes` echo "local b/w (bytes/s):" `expr $local_2 - $local_1` Before fix: local b/w (bytes/s): 11076796416 After fix: local b/w (bytes/s): 5465014272 Fixes: ba0f26d8529c (x86/intel_rdt/mba_sc: Prepare for feedback loop) Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/1607063279-19437-1-git-send-email-xiaochen.shen@intel.com
2020-12-08x86/cpu/amd: Remove dead code for TSEG region remappingArvind Sankar2-41/+0
Commit 26bfa5f89486 ("x86, amd: Cleanup init_amd") moved the code that remaps the TSEG region using 4k pages from init_amd() to bsp_init_amd(). However, bsp_init_amd() is executed well before the direct mapping is actually created: setup_arch() -> early_cpu_init() -> early_identify_cpu() -> this_cpu->c_bsp_init() -> bsp_init_amd() ... -> init_mem_mapping() So the change effectively disabled the 4k remapping, because pfn_range_is_mapped() is always false at this point. It has been over six years since the commit, and no-one seems to have noticed this, so just remove the code. The original code was also incomplete, since it doesn't check how large the TSEG address range actually is, so it might remap only part of it in any case. Hygon has copied the incorrect version, so the code has never run on it since the cpu support was added two years ago. Remove it from there as well. Committer notes: This workaround is incomplete anyway: 1. The code must check MSRC001_0113.TValid (SMM TSeg Mask MSR) first, to check whether the TSeg address range is enabled. 2. The code must check whether the range is not 2M aligned - if it is, there's nothing to work around. 3. In all the BIOSes tested, the TSeg range is in a e820 reserved area and those are not mapped anymore, after 66520ebc2df3 ("x86, mm: Only direct map addresses that are marked as E820_RAM") which means, there's nothing to be worked around either. So let's rip it out. Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20201127171324.1846019-1-nivedita@alum.mit.edu
2020-12-03x86/sgx: Return -EINVAL on a zero length buffer in sgx_ioc_enclave_add_pages()Jarkko Sakkinen1-1/+1
The sgx_enclave_add_pages.length field is documented as * @length: length of the data (multiple of the page size) Fail with -EINVAL, when the caller gives a zero length buffer of data to be added as pages to an enclave. Right now 'ret' is returned as uninitialized in that case. [ bp: Flesh out commit message. ] Fixes: c6d26d370767 ("x86/sgx: Add SGX_IOC_ENCLAVE_ADD_PAGES") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/linux-sgx/X8ehQssnslm194ld@mwanda/ Link: https://lkml.kernel.org/r/20201203183527.139317-1-jarkko@kernel.org
2020-12-01x86/mce: Rename kill_it to kill_current_taskGabriele Paoloni1-8/+8
Currently, if an MCE happens in user-mode or while the kernel is copying data from user space, 'kill_it' is used to check if execution of the interrupted task can be recovered or not; the flag name however is not very meaningful, hence rename it to match its goal. [ bp: Massage commit message, rename the queue_task_work() arg too. ] Signed-off-by: Gabriele Paoloni <gabriele.paoloni@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20201127161819.3106432-6-gabriele.paoloni@intel.com
2020-12-01x86/mce: Remove redundant call to irq_work_queue()Gabriele Paoloni1-3/+0
Currently, __mc_scan_banks() in do_machine_check() does the following callchain: __mc_scan_banks()->mce_log()->irq_work_queue(&mce_irq_work). Hence, the call to irq_work_queue() below after __mc_scan_banks() seems redundant. Just remove it. Signed-off-by: Gabriele Paoloni <gabriele.paoloni@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20201127161819.3106432-5-gabriele.paoloni@intel.com
2020-12-01x86/mce: Panic for LMCE only if mca_cfg.tolerant < 3Gabriele Paoloni1-1/+1
Right now for LMCE, if no_way_out is set, mce_panic() is called regardless of mca_cfg.tolerant. This is not correct as, if mca_cfg.tolerant = 3, the code should never panic. Add that check. [ bp: use local ptr 'cfg'. ] Signed-off-by: Gabriele Paoloni <gabriele.paoloni@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20201127161819.3106432-4-gabriele.paoloni@intel.com
2020-12-01x86/mce: Move the mce_panic() call and 'kill_it' assignments to the right placesGabriele Paoloni1-11/+4
Right now, for local MCEs the machine calls panic(), if needed, right after lmce is set. For MCE broadcasting, mce_reign() takes care of calling mce_panic(). Hence: - improve readability by moving the conditional evaluation of tolerant up to when kill_it is set first; - move the mce_panic() call up into the statement where mce_end() fails. [ bp: Massage, remove comment in the mce_end() failure case because it is superfluous; use local ptr 'cfg' in both tests. ] Signed-off-by: Gabriele Paoloni <gabriele.paoloni@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20201127161819.3106432-3-gabriele.paoloni@intel.com
2020-12-01Merge tag 'v5.10-rc6' into ras/coreBorislav Petkov4-114/+75
Merge the -rc6 tag to pick up dependent changes. Signed-off-by: Borislav Petkov <bp@suse.de>
2020-12-01x86/resctrl: Clean up unused function parameter in rmdir pathXiaochen Shen1-10/+7
Commit fd8d9db3559a ("x86/resctrl: Remove superfluous kernfs_get() calls to prevent refcount leak") removed superfluous kernfs_get() calls in rdtgroup_ctrl_remove() and rdtgroup_rmdir_ctrl(). That change resulted in an unused function parameter to these two functions. Clean up the unused function parameter in rdtgroup_ctrl_remove(), rdtgroup_rmdir_mon() and their callers rdtgroup_rmdir_ctrl() and rdtgroup_rmdir(). Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lkml.kernel.org/r/1606759618-13181-1-git-send-email-xiaochen.shen@intel.com
2020-12-01Merge tag 'v5.10-rc6' into x86/cacheBorislav Petkov4-114/+75
Merge -rc6 tag to pick up dependent changes. Signed-off-by: Borislav Petkov <bp@suse.de>
2020-12-01x86/resctrl: Fix AMD L3 QOS CDP enable/disableBabu Moger3-2/+14
When the AMD QoS feature CDP (code and data prioritization) is enabled or disabled, the CDP bit in MSR 0000_0C81 is written on one of the CPUs in an L3 domain (core complex). That is not correct - the CDP bit needs to be updated on all the logical CPUs in the domain. This was not spelled out clearly in the spec earlier. The specification has been updated and the updated document, "AMD64 Technology Platform Quality of Service Extensions Publication # 56375 Revision: 1.02 Issue Date: October 2020" is available now. Refer the section: Code and Data Prioritization. Fix the issue by adding a new flag arch_has_per_cpu_cfg in rdt_cache data structure. The documentation can be obtained at: https://developer.amd.com/wp-content/resources/56375.pdf Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537 [ bp: Massage commit message. ] Fixes: 4d05bf71f157 ("x86/resctrl: Introduce AMD QOS feature") Signed-off-by: Babu Moger <babu.moger@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lkml.kernel.org/r/160675180380.15628.3309402017215002347.stgit@bmoger-ubuntu
2020-11-27x86/mce: Do not overwrite no_way_out if mce_end() failsGabriele Paoloni1-2/+4
Currently, if mce_end() fails, no_way_out - the variable denoting whether the machine can recover from this MCE - is determined by whether the worst severity that was found across the MCA banks associated with the current CPU, is of panic severity. However, at this point no_way_out could have been already set by mca_start() after looking at all severities of all CPUs that entered the MCE handler. If mce_end() fails, check first if no_way_out is already set and, if so, stick to it, otherwise use the local worst value. [ bp: Massage. ] Signed-off-by: Gabriele Paoloni <gabriele.paoloni@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20201127161819.3106432-2-gabriele.paoloni@intel.com
2020-11-25x86/speculation: Fix prctl() when spectre_v2_user={seccomp,prctl},ibpbAnand K Mistry1-2/+2
When spectre_v2_user={seccomp,prctl},ibpb is specified on the command line, IBPB is force-enabled and STIPB is conditionally-enabled (or not available). However, since 21998a351512 ("x86/speculation: Avoid force-disabling IBPB based on STIBP and enhanced IBRS.") the spectre_v2_user_ibpb variable is set to SPECTRE_V2_USER_{PRCTL,SECCOMP} instead of SPECTRE_V2_USER_STRICT, which is the actual behaviour. Because the issuing of IBPB relies on the switch_mm_*_ibpb static branches, the mitigations behave as expected. Since 1978b3a53a74 ("x86/speculation: Allow IBPB to be conditionally enabled on CPUs with always-on STIBP") this discrepency caused the misreporting of IB speculation via prctl(). On CPUs with STIBP always-on and spectre_v2_user=seccomp,ibpb, prctl(PR_GET_SPECULATION_CTRL) would return PR_SPEC_PRCTL | PR_SPEC_ENABLE instead of PR_SPEC_DISABLE since both IBPB and STIPB are always on. It also allowed prctl(PR_SET_SPECULATION_CTRL) to set the IB speculation mode, even though the flag is ignored. Similarly, for CPUs without SMT, prctl(PR_GET_SPECULATION_CTRL) should also return PR_SPEC_DISABLE since IBPB is always on and STIBP is not available. [ bp: Massage commit message. ] Fixes: 21998a351512 ("x86/speculation: Avoid force-disabling IBPB based on STIBP and enhanced IBRS.") Fixes: 1978b3a53a74 ("x86/speculation: Allow IBPB to be conditionally enabled on CPUs with always-on STIBP") Signed-off-by: Anand K Mistry <amistry@google.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20201110123349.1.Id0cbf996d2151f4c143c90f9028651a5b49a5908@changeid
2020-11-24x86/resctrl: Add necessary kernfs_put() calls to prevent refcount leakXiaochen Shen1-7/+25
On resource group creation via a mkdir an extra kernfs_node reference is obtained by kernfs_get() to ensure that the rdtgroup structure remains accessible for the rdtgroup_kn_unlock() calls where it is removed on deletion. Currently the extra kernfs_node reference count is only dropped by kernfs_put() in rdtgroup_kn_unlock() while the rdtgroup structure is removed in a few other locations that lack the matching reference drop. In call paths of rmdir and umount, when a control group is removed, kernfs_remove() is called to remove the whole kernfs nodes tree of the control group (including the kernfs nodes trees of all child monitoring groups), and then rdtgroup structure is freed by kfree(). The rdtgroup structures of all child monitoring groups under the control group are freed by kfree() in free_all_child_rdtgrp(). Before calling kfree() to free the rdtgroup structures, the kernfs node of the control group itself as well as the kernfs nodes of all child monitoring groups still take the extra references which will never be dropped to 0 and the kernfs nodes will never be freed. It leads to reference count leak and kernfs_node_cache memory leak. For example, reference count leak is observed in these two cases: (1) mount -t resctrl resctrl /sys/fs/resctrl mkdir /sys/fs/resctrl/c1 mkdir /sys/fs/resctrl/c1/mon_groups/m1 umount /sys/fs/resctrl (2) mkdir /sys/fs/resctrl/c1 mkdir /sys/fs/resctrl/c1/mon_groups/m1 rmdir /sys/fs/resctrl/c1 The same reference count leak issue also exists in the error exit paths of mkdir in mkdir_rdt_prepare() and rdtgroup_mkdir_ctrl_mon(). Fix this issue by following changes to make sure the extra kernfs_node reference on rdtgroup is dropped before freeing the rdtgroup structure. (1) Introduce rdtgroup removal helper rdtgroup_remove() to wrap up kernfs_put() and kfree(). (2) Call rdtgroup_remove() in rdtgroup removal path where the rdtgroup structure is about to be freed by kfree(). (3) Call rdtgroup_remove() or kernfs_put() as appropriate in the error exit paths of mkdir where an extra reference is taken by kernfs_get(). Fixes: f3cbeacaa06e ("x86/intel_rdt/cqm: Add rmdir support") Fixes: e02737d5b826 ("x86/intel_rdt: Add tasks files") Fixes: 60cf5e101fd4 ("x86/intel_rdt: Add mkdir to resctrl file system") Reported-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/1604085088-31707-1-git-send-email-xiaochen.shen@intel.com
2020-11-24x86/resctrl: Remove superfluous kernfs_get() calls to prevent refcount leakXiaochen Shen1-33/+2
Willem reported growing of kernfs_node_cache entries in slabtop when repeatedly creating and removing resctrl subdirectories as well as when repeatedly mounting and unmounting the resctrl filesystem. On resource group (control as well as monitoring) creation via a mkdir an extra kernfs_node reference is obtained to ensure that the rdtgroup structure remains accessible for the rdtgroup_kn_unlock() calls where it is removed on deletion. The kernfs_node reference count is dropped by kernfs_put() in rdtgroup_kn_unlock(). With the above explaining the need for one kernfs_get()/kernfs_put() pair in resctrl there are more places where a kernfs_node reference is obtained without a corresponding release. The excessive amount of reference count on kernfs nodes will never be dropped to 0 and the kernfs nodes will never be freed in the call paths of rmdir and umount. It leads to reference count leak and kernfs_node_cache memory leak. Remove the superfluous kernfs_get() calls and expand the existing comments surrounding the remaining kernfs_get()/kernfs_put() pair that remains in use. Superfluous kernfs_get() calls are removed from two areas: (1) In call paths of mount and mkdir, when kernfs nodes for "info", "mon_groups" and "mon_data" directories and sub-directories are created, the reference count of newly created kernfs node is set to 1. But after kernfs_create_dir() returns, superfluous kernfs_get() are called to take an additional reference. (2) kernfs_get() calls in rmdir call paths. Fixes: 17eafd076291 ("x86/intel_rdt: Split resource group removal in two") Fixes: 4af4a88e0c92 ("x86/intel_rdt/cqm: Add mount,umount support") Fixes: f3cbeacaa06e ("x86/intel_rdt/cqm: Add rmdir support") Fixes: d89b7379015f ("x86/intel_rdt/cqm: Add mon_data") Fixes: c7d9aac61311 ("x86/intel_rdt/cqm: Add mkdir support for RDT monitoring") Fixes: 5dc1d5c6bac2 ("x86/intel_rdt: Simplify info and base file lists") Fixes: 60cf5e101fd4 ("x86/intel_rdt: Add mkdir to resctrl file system") Fixes: 4e978d06dedb ("x86/intel_rdt: Add "info" files to resctrl file system") Reported-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Willem de Bruijn <willemb@google.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/1604085053-31639-1-git-send-email-xiaochen.shen@intel.com
2020-11-24x86/sgx: Fix sgx_ioc_enclave_provision() kernel-doc commentBorislav Petkov1-1/+1
Fix ./arch/x86/kernel/cpu/sgx/ioctl.c:666: warning: Function parameter or member \ 'encl' not described in 'sgx_ioc_enclave_provision' ./arch/x86/kernel/cpu/sgx/ioctl.c:666: warning: Excess function parameter \ 'enclave' description in 'sgx_ioc_enclave_provision' Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20201123181922.0c009406@canb.auug.org.au
2020-11-21x86/mce, cper: Pass x86 CPER through the MCA handling chainSmita Koralahalli1-0/+61
The kernel uses ACPI Boot Error Record Table (BERT) to report fatal errors that occurred in a previous boot. The MCA errors in the BERT are reported using the x86 Processor Error Common Platform Error Record (CPER) format. Currently, the record prints out the raw MSR values and AMD relies on the raw record to provide MCA information. Extract the raw MSR values of MCA registers from the BERT and feed them into mce_log() to decode them properly. The implementation is SMCA-specific as the raw MCA register values are given in the register offset order of the SMCA address space. [ bp: Massage. ] [ Fix a build breakage in patch v1. ] Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Punit Agrawal <punit1.agrawal@toshiba.co.jp> Acked-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lkml.kernel.org/r/20201119182938.151155-1-Smita.KoralahalliChannabasappa@amd.com
2020-11-19Merge branches 'cpuinfo.2020.11.06a', 'doc.2020.11.06a', ↵Paul E. McKenney1-2/+0
'fixes.2020.11.19b', 'lockdep.2020.11.02a', 'tasks.2020.11.06a' and 'torture.2020.11.06a' into HEAD cpuinfo.2020.11.06a: Speedups for /proc/cpuinfo. doc.2020.11.06a: Documentation updates. fixes.2020.11.19b: Miscellaneous fixes. lockdep.2020.11.02a: Lockdep-RCU updates to avoid "unused variable". tasks.2020.11.06a: Tasks-RCU updates. torture.2020.11.06a': Torture-test updates.
2020-11-19x86/smpboot: Move rcu_cpu_starting() earlierPaul E. McKenney1-2/+0
The call to rcu_cpu_starting() in mtrr_ap_init() is not early enough in the CPU-hotplug onlining process, which results in lockdep splats as follows: ============================= WARNING: suspicious RCU usage 5.9.0+ #268 Not tainted ----------------------------- kernel/kprobes.c:300 RCU-list traversed in non-reader section!! other info that might help us debug this: RCU used illegally from offline CPU! rcu_scheduler_active = 1, debug_locks = 1 no locks held by swapper/1/0. stack backtrace: CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.9.0+ #268 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-1ubuntu1 04/01/2014 Call Trace: dump_stack+0x77/0x97 __is_insn_slot_addr+0x15d/0x170 kernel_text_address+0xba/0xe0 ? get_stack_info+0x22/0xa0 __kernel_text_address+0x9/0x30 show_trace_log_lvl+0x17d/0x380 ? dump_stack+0x77/0x97 dump_stack+0x77/0x97 __lock_acquire+0xdf7/0x1bf0 lock_acquire+0x258/0x3d0 ? vprintk_emit+0x6d/0x2c0 _raw_spin_lock+0x27/0x40 ? vprintk_emit+0x6d/0x2c0 vprintk_emit+0x6d/0x2c0 printk+0x4d/0x69 start_secondary+0x1c/0x100 secondary_startup_64_no_verify+0xb8/0xbb This is avoided by moving the call to rcu_cpu_starting up near the beginning of the start_secondary() function. Note that the raw_smp_processor_id() is required in order to avoid calling into lockdep before RCU has declared the CPU to be watched for readers. Link: https://lore.kernel.org/lkml/160223032121.7002.1269740091547117869.tip-bot2@tip-bot2/ Reported-by: Qian Cai <cai@redhat.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-11-19x86/resctrl: Constify kernfs_opsRikard Falkeborn2-3/+3
The only usage of the kf_ops field in the rftype struct is to pass it as argument to __kernfs_create_file(), which accepts a pointer to const. Make it a pointer to const. This makes it possible to make rdtgroup_kf_single_ops and kf_mondata_ops const, which allows the compiler to put them in read-only memory. Signed-off-by: Rikard Falkeborn <rikard.falkeborn@gmail.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lkml.kernel.org/r/20201110230228.801785-1-rikard.falkeborn@gmail.com
2020-11-19x86/topology: Set cpu_die_id only if DIE_TYPE foundYazen Ghannam1-2/+8
CPUID Leaf 0x1F defines a DIE_TYPE level (nb: ECX[8:15] level type == 0x5), but CPUID Leaf 0xB does not. However, detect_extended_topology() will set struct cpuinfo_x86.cpu_die_id regardless of whether a valid Die ID was found. Only set cpu_die_id if a DIE_TYPE level is found. CPU topology code may use another value for cpu_die_id, e.g. the AMD NodeId on AMD-based systems. Code ordering should be maintained so that the CPUID Leaf 0x1F Die ID value will take precedence on systems that may use another value. Suggested-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20201109210659.754018-5-Yazen.Ghannam@amd.com
2020-11-19x86/CPU/AMD: Remove amd_get_nb_id()Yazen Ghannam4-11/+5
The Last Level Cache ID is returned by amd_get_nb_id(). In practice, this value is the same as the AMD NodeId for callers of this function. The NodeId is saved in struct cpuinfo_x86.cpu_die_id. Replace calls to amd_get_nb_id() with the logical CPU's cpu_die_id and remove the function. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20201109210659.754018-3-Yazen.Ghannam@amd.com
2020-11-19x86/CPU/AMD: Save AMD NodeId as cpu_die_idYazen Ghannam3-15/+13
AMD systems provide a "NodeId" value that represents a global ID indicating to which "Node" a logical CPU belongs. The "Node" is a physical structure equivalent to a Die, and it should not be confused with logical structures like NUMA nodes. Logical nodes can be adjusted based on firmware or other settings whereas the physical nodes/dies are fixed based on hardware topology. The NodeId value can be used when a physical ID is needed by software. Save the AMD NodeId to struct cpuinfo_x86.cpu_die_id. Use the value from CPUID or MSR as appropriate. Default to phys_proc_id otherwise. Do so for both AMD and Hygon systems. Drop the node_id parameter from cacheinfo_*_init_llc_id() as it is no longer needed. Update the x86 topology documentation. Suggested-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20201109210659.754018-2-Yazen.Ghannam@amd.com
2020-11-19x86/sgx: Return -ERESTARTSYS in sgx_ioc_enclave_add_pages()Jarkko Sakkinen1-1/+1
Return -ERESTARTSYS instead of -EINTR in sgx_ioc_enclave_add_pages() when interrupted before any pages have been processed. At this point ioctl can be obviously safely restarted. Reported-by: Haitao Huang <haitao.huang@intel.com> Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20201118213932.63341-1-jarkko@kernel.org
2020-11-18x86/sgx: Clarify 'laundry_list' lockingDave Hansen2-9/+20
Short Version: The SGX section->laundry_list structure is effectively thread-local, but declared next to some shared structures. Its semantics are clear as mud. Fix that. No functional changes. Compile tested only. Long Version: The SGX hardware keeps per-page metadata. This can provide things like permissions, integrity and replay protection. It also prevents things like having an enclave page mapped multiple times or shared between enclaves. But, that presents a problem for kexec()'d kernels (or any other kernel that does not run immediately after a hardware reset). This is because the last kernel may have been rude and forgotten to reset pages, which would trigger the "shared page" sanity check. To fix this, the SGX code "launders" the pages by running the EREMOVE instruction on all pages at boot. This is slow and can take a long time, so it is performed off in the SGX-specific ksgxd instead of being synchronous at boot. The init code hands the list of pages to launder in a per-SGX-section list: ->laundry_list. The only code to touch this list is the init code and ksgxd. This means that no locking is necessary for ->laundry_list. However, a lock is required for section->page_list, which is accessed while creating enclaves and by ksgxd. This lock (section->lock) is acquired by ksgxd while also processing ->laundry_list. It is easy to confuse the purpose of the locking as being for ->laundry_list and ->page_list. Rename ->laundry_list to ->init_laundry_list to make it clear that this is not normally used at runtime. Also add some comments clarifying the locking, and reorganize 'sgx_epc_section' to put 'lock' near the things it protects. Note: init_laundry_list is 128 bytes of wasted space at runtime. It could theoretically be dynamically allocated and then freed after the laundering process. But it would take nearly 128 bytes of extra instructions to do that. Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20201116222531.4834-1-dave.hansen@intel.com
2020-11-18x86/sgx: Add ptrace() support for the SGX driverJarkko Sakkinen1-0/+111
Enclave memory is normally inaccessible from outside the enclave. This makes enclaves hard to debug. However, enclaves can be put in a debug mode when they are being built. In that mode, enclave data *can* be read and/or written by using the ENCLS[EDBGRD] and ENCLS[EDBGWR] functions. This is obviously only for debugging and destroys all the protections present with normal enclaves. But, enclaves know their own debug status and can adjust their behavior appropriately. Add a vm_ops->access() implementation which can be used to read and write memory inside debug enclaves. This is typically used via ptrace() APIs. [ bp: Massage. ] Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Jethro Beekman <jethro@fortanix.com> Link: https://lkml.kernel.org/r/20201112220135.165028-23-jarkko@kernel.org
2020-11-18x86/sgx: Add a page reclaimerJarkko Sakkinen6-27/+1134
Just like normal RAM, there is a limited amount of enclave memory available and overcommitting it is a very valuable tool to reduce resource use. Introduce a simple reclaim mechanism for enclave pages. In contrast to normal page reclaim, the kernel cannot directly access enclave memory. To get around this, the SGX architecture provides a set of functions to help. Among other things, these functions copy enclave memory to and from normal memory, encrypting it and protecting its integrity in the process. Implement a page reclaimer by using these functions. Picks victim pages in LRU fashion from all the enclaves running in the system. A new kernel thread (ksgxswapd) reclaims pages in the background based on watermarks, similar to normal kswapd. All enclave pages can be reclaimed, architecturally. But, there are some limits to this, such as the special SECS metadata page which must be reclaimed last. The page version array (used to mitigate replaying old reclaimed pages) is also architecturally reclaimable, but not yet implemented. The end result is that the vast majority of enclave pages are currently reclaimable. Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Jethro Beekman <jethro@fortanix.com> Link: https://lkml.kernel.org/r/20201112220135.165028-22-jarkko@kernel.org
2020-11-18x86/sgx: Add SGX_IOC_ENCLAVE_PROVISIONJarkko Sakkinen3-1/+62
The whole point of SGX is to create a hardware protected place to do “stuff”. But, before someone is willing to hand over the keys to the castle , an enclave must often prove that it is running on an SGX-protected processor. Provisioning enclaves play a key role in providing proof. There are actually three different enclaves in play in order to make this happen: 1. The application enclave. The familiar one we know and love that runs the actual code that’s doing real work. There can be many of these on a single system, or even in a single application. 2. The quoting enclave (QE). The QE is mentioned in lots of silly whitepapers, but, for the purposes of kernel enabling, just pretend they do not exist. 3. The provisioning enclave. There is typically only one of these enclaves per system. Provisioning enclaves have access to a special hardware key. They can use this key to help to generate certificates which serve as proof that enclaves are running on trusted SGX hardware. These certificates can be passed around without revealing the special key. Any user who can create a provisioning enclave can access the processor-unique Provisioning Certificate Key which has privacy and fingerprinting implications. Even if a user is permitted to create normal application enclaves (via /dev/sgx_enclave), they should not be able to create provisioning enclaves. That means a separate permissions scheme is needed to control provisioning enclave privileges. Implement a separate device file (/dev/sgx_provision) which allows creating provisioning enclaves. This device will typically have more strict permissions than the plain enclave device. The actual device “driver” is an empty stub. Open file descriptors for this device will represent a token which allows provisioning enclave duty. This file descriptor can be passed around and ultimately given as an argument to the /dev/sgx_enclave driver ioctl(). [ bp: Touchups. ] Suggested-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: linux-security-module@vger.kernel.org Link: https://lkml.kernel.org/r/20201112220135.165028-16-jarkko@kernel.org
2020-11-18x86/sgx: Add SGX_IOC_ENCLAVE_INITJarkko Sakkinen4-1/+230
Enclaves have two basic states. They are either being built and are malleable and can be modified by doing things like adding pages. Or, they are locked down and not accepting changes. They can only be run after they have been locked down. The ENCLS[EINIT] function induces the transition from being malleable to locked-down. Add an ioctl() that performs ENCLS[EINIT]. After this, new pages can no longer be added with ENCLS[EADD]. This is also the time where the enclave can be measured to verify its integrity. Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Jethro Beekman <jethro@fortanix.com> Link: https://lkml.kernel.org/r/20201112220135.165028-15-jarkko@kernel.org
2020-11-18x86/sgx: Add SGX_IOC_ENCLAVE_ADD_PAGESJarkko Sakkinen2-0/+285
SGX enclave pages are inaccessible to normal software. They must be populated with data by copying from normal memory with the help of the EADD and EEXTEND functions of the ENCLS instruction. Add an ioctl() which performs EADD that adds new data to an enclave, and optionally EEXTEND functions that hash the page contents and use the hash as part of enclave “measurement” to ensure enclave integrity. The enclave author gets to decide which pages will be included in the enclave measurement with EEXTEND. Measurement is very slow and has sometimes has very little value. For instance, an enclave _could_ measure every page of data and code, but would be slow to initialize. Or, it might just measure its code and then trust that code to initialize the bulk of its data after it starts running. Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Jethro Beekman <jethro@fortanix.com> Link: https://lkml.kernel.org/r/20201112220135.165028-14-jarkko@kernel.org
2020-11-18x86/sgx: Add SGX_IOC_ENCLAVE_CREATEJarkko Sakkinen6-0/+154
Add an ioctl() that performs the ECREATE function of the ENCLS instruction, which creates an SGX Enclave Control Structure (SECS). Although the SECS is an in-memory data structure, it is present in enclave memory and is not directly accessible by software. Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Jethro Beekman <jethro@fortanix.com> Link: https://lkml.kernel.org/r/20201112220135.165028-13-jarkko@kernel.org
2020-11-18x86/sgx: Add an SGX misc driver interfaceJarkko Sakkinen6-1/+345
Intel(R) SGX is a new hardware functionality that can be used by applications to set aside private regions of code and data called enclaves. New hardware protects enclave code and data from outside access and modification. Add a driver that presents a device file and ioctl API to build and manage enclaves. [ bp: Small touchups, remove unused encl variable in sgx_encl_find() as Reported-by: kernel test robot <lkp@intel.com> ] Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org> Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Jethro Beekman <jethro@fortanix.com> Link: https://lkml.kernel.org/r/20201112220135.165028-12-jarkko@kernel.org
2020-11-17x86/sgx: Add SGX page allocator functionsJarkko Sakkinen2-0/+68
Add functions for runtime allocation and free. This allocator and its algorithms are as simple as it gets. They do a linear search across all EPC sections and find the first free page. They are not NUMA-aware and only hand out individual pages. The SGX hardware does not support large pages, so something more complicated like a buddy allocator is unwarranted. The free function (sgx_free_epc_page()) implicitly calls ENCLS[EREMOVE], which returns the page to the uninitialized state. This ensures that the page is ready for use at the next allocation. Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Jethro Beekman <jethro@fortanix.com> Link: https://lkml.kernel.org/r/20201112220135.165028-10-jarkko@kernel.org
2020-11-17x86/cpu/intel: Add a nosgx kernel parameterJarkko Sakkinen1-0/+9
Add a kernel parameter to disable SGX kernel support and document it. [ bp: Massage. ] Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Acked-by: Jethro Beekman <jethro@fortanix.com> Tested-by: Sean Christopherson <sean.j.christopherson@intel.com> Link: https://lkml.kernel.org/r/20201112220135.165028-9-jarkko@kernel.org
2020-11-17x86/cpu/intel: Detect SGX supportSean Christopherson1-1/+28
Kernel support for SGX is ultimately decided by the state of the launch control bits in the feature control MSR (MSR_IA32_FEAT_CTL). If the hardware supports SGX, but neglects to support flexible launch control, the kernel will not enable SGX. Enable SGX at feature control MSR initialization and update the associated X86_FEATURE flags accordingly. Disable X86_FEATURE_SGX (and all derivatives) if the kernel is not able to establish itself as the authority over SGX Launch Control. All checks are performed for each logical CPU (not just boot CPU) in order to verify that MSR_IA32_FEATURE_CONTROL is correctly configured on all CPUs. All SGX code in this series expects the same configuration from all CPUs. This differs from VMX where X86_FEATURE_VMX is intentionally cleared only for the current CPU so that KVM can provide additional information if KVM fails to load like which CPU doesn't support VMX. There’s not much the kernel or an administrator can do to fix the situation, so SGX neglects to convey additional details about these kinds of failures if they occur. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Co-developed-by: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Jethro Beekman <jethro@fortanix.com> Link: https://lkml.kernel.org/r/20201112220135.165028-8-jarkko@kernel.org