summaryrefslogtreecommitdiffstats
path: root/arch/x86/kernel
AgeCommit message (Collapse)AuthorFilesLines
2022-11-10x86/mtrr: Simplify mtrr_bp_init()Juergen Gross3-20/+1
In case of the generic cache interface being used (Intel CPUs or a 64-bit system), the initialization sequence of the boot CPU is more complicated than necessary: - check if MTRR enabled, if yes, call mtrr_bp_pat_init() which will disable caching, set the PAT MSR, and reenable caching - call mtrr_cleanup(), in case that changed anything, call cache_cpu_init() doing the same caching disable/enable dance as above, but this time with setting the (modified) MTRR state (even if MTRR was disabled) AND setting the PAT MSR (again even with disabled MTRR) The sequence can be simplified a lot while removing potential inconsistencies: - check if MTRR enabled, if yes, call mtrr_cleanup() and then cache_cpu_init() This ensures to: - no longer disable/enable caching more than once - avoid to set MTRRs and/or the PAT MSR on the boot processor in case of MTRR cleanups even if MTRRs meant to be disabled With that mtrr_bp_pat_init() can be removed. Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20221102074713.21493-10-jgross@suse.com Signed-off-by: Borislav Petkov <bp@suse.de>
2022-11-10x86/mtrr: Remove set_all callback from struct mtrr_opsJuergen Gross3-8/+5
Instead of using an indirect call to mtrr_if->set_all just call the only possible target cache_cpu_init() directly. Remove the set_all function pointer from struct mtrr_ops. Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20221102074713.21493-9-jgross@suse.com Signed-off-by: Borislav Petkov <bp@suse.de>
2022-11-10x86/mtrr: Disentangle MTRR init from PAT initJuergen Gross2-13/+19
Add a main cache_cpu_init() init routine which initializes MTRR and/or PAT support depending on what has been detected on the system. Leave the MTRR-specific initialization in a MTRR-specific init function where the smp_changes_mask setting happens now with caches disabled. This global mask update was done with caches enabled before probably because atomic operations while running uncached might have been quite expensive. But since only systems with a broken BIOS should ever require to set any bit in smp_changes_mask, hurting those devices with a penalty of a few microseconds during boot shouldn't be a real issue. Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20221102074713.21493-8-jgross@suse.com Signed-off-by: Borislav Petkov <bp@suse.de>
2022-11-10x86/mtrr: Move cache control code to cacheinfo.cJuergen Gross2-74/+77
Prepare making PAT and MTRR support independent from each other by moving some code needed by both out of the MTRR-specific sources. [ bp: Massage commit message. ] Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20221102074713.21493-7-jgross@suse.com Signed-off-by: Borislav Petkov <bp@suse.de>
2022-11-10x86/mtrr: Split MTRR-specific handling from cache dis/enablingJuergen Gross1-7/+19
Split the MTRR-specific actions from cache_disable() and cache_enable() into new functions mtrr_disable() and mtrr_enable(). Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20221102074713.21493-6-jgross@suse.com Signed-off-by: Borislav Petkov <bp@suse.de>
2022-11-10x86/mtrr: Rename prepare_set() and post_set()Juergen Gross1-22/+21
Rename the currently MTRR-specific functions prepare_set() and post_set() in preparation to move them. Make them non-static and put their prototypes into cacheinfo.h, where they will end after moving them to their final position anyway. Expand the comment before the functions with an introductory line and rename two related static variables, too. [ bp: Massage commit message. ] Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20221102074713.21493-5-jgross@suse.com Signed-off-by: Borislav Petkov <bp@suse.de>
2022-11-10x86/mtrr: Replace use_intel() with a local flagJuergen Gross4-18/+16
In MTRR code use_intel() is only used in one source file, and the relevant use_intel_if member of struct mtrr_ops is set only in generic_mtrr_ops. Replace use_intel() with a single flag in cacheinfo.c which can be set when assigning generic_mtrr_ops to mtrr_if. This allows to drop use_intel_if from mtrr_ops, while preparing to decouple PAT from MTRR. As another preparation for the PAT/MTRR decoupling use a bit for MTRR control and one for PAT control. For now set both bits together, this can be changed later. As the new flag will be set only if mtrr_enabled is set, the test for mtrr_enabled can be dropped at some places. [ bp: Massage commit message. ] Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20221102074713.21493-4-jgross@suse.com Signed-off-by: Borislav Petkov <bp@suse.de>
2022-11-09x86/kvm: Remove unused virt to phys translation in kvm_guest_cpu_init()Rafael Mendonca1-1/+1
Presumably, this was introduced due to a conflict resolution with commit ef68017eb570 ("x86/kvm: Handle async page faults directly through do_page_fault()"), given that the last posted version [1] of the blamed commit was not based on the aforementioned commit. [1] https://lore.kernel.org/kvm/20200525144125.143875-9-vkuznets@redhat.com/ Fixes: b1d405751cd5 ("KVM: x86: Switch KVM guest to using interrupts for page ready APF delivery") Signed-off-by: Rafael Mendonca <rafaelmendsr@gmail.com> Message-Id: <20221021020113.922027-1-rafaelmendsr@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09x86, KVM: remove unnecessary argument to x86_virt_spec_ctrl and callersPaolo Bonzini1-1/+1
x86_virt_spec_ctrl only deals with the paravirtualized MSR_IA32_VIRT_SPEC_CTRL now and does not handle MSR_IA32_SPEC_CTRL anymore; remove the corresponding, unused argument. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09KVM: SVM: move MSR_IA32_SPEC_CTRL save/restore to assemblyPaolo Bonzini1-10/+3
Restoration of the host IA32_SPEC_CTRL value is probably too late with respect to the return thunk training sequence. With respect to the user/kernel boundary, AMD says, "If software chooses to toggle STIBP (e.g., set STIBP on kernel entry, and clear it on kernel exit), software should set STIBP to 1 before executing the return thunk training sequence." I assume the same requirements apply to the guest/host boundary. The return thunk training sequence is in vmenter.S, quite close to the VM-exit. On hosts without V_SPEC_CTRL, however, the host's IA32_SPEC_CTRL value is not restored until much later. To avoid this, move the restoration of host SPEC_CTRL to assembly and, for consistency, move the restoration of the guest SPEC_CTRL as well. This is not particularly difficult, apart from some care to cover both 32- and 64-bit, and to share code between SEV-ES and normal vmentry. Cc: stable@vger.kernel.org Fixes: a149180fbcf3 ("x86: Add magic AMD return-thunk") Suggested-by: Jim Mattson <jmattson@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09KVM: x86: use a separate asm-offsets.c filePaolo Bonzini1-6/+0
This already removes an ugly #include "" from asm-offsets.c, but especially it avoids a future error when trying to define asm-offsets for KVM's svm/svm.h header. This would not work for kernel/asm-offsets.c, because svm/svm.h includes kvm_cache_regs.h which is not in the include path when compiling asm-offsets.c. The problem is not there if the .c file is in arch/x86/kvm. Suggested-by: Sean Christopherson <seanjc@google.com> Cc: stable@vger.kernel.org Fixes: a149180fbcf3 ("x86: Add magic AMD return-thunk") Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09x86/fpu/xstate: Fix XSTATE_WARN_ON() to emit relevant diagnosticsAndrew Cooper1-6/+6
"XSAVE consistency problem" has been reported under Xen, but that's the extent of my divination skills. Modify XSTATE_WARN_ON() to force the caller to provide relevant diagnostic information, and modify each caller suitably. For check_xstate_against_struct(), this removes a double WARN() where one will do perfectly fine. CC stable as this has been wonky debugging for 7 years and it is good to have there too. Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/20220810221909.12768-1-andrew.cooper3@citrix.com
2022-11-08x86/sgx: use VM_ACCESS_FLAGSKefeng Wang1-2/+2
Simplify VM_READ|VM_WRITE|VM_EXEC with VM_ACCESS_FLAGS. Link: https://lkml.kernel.org/r/20221019034945.93081-3-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: "Christian König" <christian.koenig@amd.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: David Airlie <airlied@gmail.com> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-08x86/traps: avoid KMSAN bugs originating from handle_bug()Alexander Potapenko1-0/+7
There is a case in exc_invalid_op handler that is executed outside the irqentry_enter()/irqentry_exit() region when an UD2 instruction is used to encode a call to __warn(). In that case the `struct pt_regs` passed to the interrupt handler is never unpoisoned by KMSAN (this is normally done in irqentry_enter()), which leads to false positives inside handle_bug(). Use kmsan_unpoison_entry_regs() to explicitly unpoison those registers before using them. Link: https://lkml.kernel.org/r/20221102110611.1085175-5-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Marco Elver <elver@google.com> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-08x86: Fix misc small issuesJiapeng Chong2-3/+3
Fix: ./arch/x86/kernel/traps.c: asm/proto.h is included more than once. ./arch/x86/kernel/alternative.c:1610:2-3: Unneeded semicolon. [ bp: Merge into a single patch. ] Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/1620902768-53822-1-git-send-email-jiapeng.chong@linux.alibaba.com Link: https://lore.kernel.org/r/20220926054628.116957-1-jiapeng.chong@linux.alibaba.com
2022-11-08x86/sgx: Add overflow check in sgx_validate_offset_length()Borys Popławski1-0/+3
sgx_validate_offset_length() function verifies "offset" and "length" arguments provided by userspace, but was missing an overflow check on their addition. Add it. Fixes: c6d26d370767 ("x86/sgx: Add SGX_IOC_ENCLAVE_ADD_PAGES") Signed-off-by: Borys Popławski <borysp@invisiblethingslab.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Cc: stable@vger.kernel.org # v5.11+ Link: https://lore.kernel.org/r/0d91ac79-6d84-abed-5821-4dbe59fa1a38@invisiblethingslab.com
2022-11-04KVM/VMX: Allow exposing EDECCSSA user leaf function to KVM guestKai Huang2-0/+2
The new Asynchronous Exit (AEX) notification mechanism (AEX-notify) allows one enclave to receive a notification in the ERESUME after the enclave exit due to an AEX. EDECCSSA is a new SGX user leaf function (ENCLU[EDECCSSA]) to facilitate the AEX notification handling. The new EDECCSSA is enumerated via CPUID(EAX=0x12,ECX=0x0):EAX[11]. Besides Allowing reporting the new AEX-notify attribute to KVM guests, also allow reporting the new EDECCSSA user leaf function to KVM guests so the guest can fully utilize the AEX-notify mechanism. Similar to existing X86_FEATURE_SGX1 and X86_FEATURE_SGX2, introduce a new scattered X86_FEATURE_SGX_EDECCSSA bit for the new EDECCSSA, and report it in KVM's supported CPUIDs. Note, no additional KVM enabling is required to allow the guest to use EDECCSSA. It's impossible to trap ENCLU (without completely preventing the guest from using SGX). Advertise EDECCSSA as supported purely so that userspace doesn't need to special case EDECCSSA, i.e. doesn't need to manually check host CPUID. The inability to trap ENCLU also means that KVM can't prevent the guest from using EDECCSSA, but that virtualization hole is benign as far as KVM is concerned. EDECCSSA is simply a fancy way to modify internal enclave state. More background about how do AEX-notify and EDECCSSA work: SGX maintains a Current State Save Area Frame (CSSA) for each enclave thread. When AEX happens, the enclave thread context is saved to the CSSA and the CSSA is increased by 1. For a normal ERESUME which doesn't deliver AEX notification, it restores the saved thread context from the previously saved SSA and decreases the CSSA. If AEX-notify is enabled for one enclave, the ERESUME acts differently. Instead of restoring the saved thread context and decreasing the CSSA, it acts like EENTER which doesn't decrease the CSSA but establishes a clean slate thread context using the CSSA for the enclave to handle the notification. After some handling, the enclave must discard the "new-established" SSA and switch back to the previously saved SSA (upon AEX). Otherwise, the enclave will run out of SSA space upon further AEXs and eventually fail to run. To solve this problem, the new EDECCSSA essentially decreases the CSSA. It can be used by the enclave notification handler to switch back to the previous saved SSA when needed, i.e. after it handles the notification. Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Sean Christopherson <seanjc@google.com> Acked-by: Jarkko Sakkinen <jarkko@kernel.org> Link: https://lore.kernel.org/all/20221101022422.858944-1-kai.huang%40intel.com
2022-11-04x86/sgx: Allow enclaves to use Asynchrounous Exit NotificationDave Hansen1-1/+1
Short Version: Allow enclaves to use the new Asynchronous EXit (AEX) notification mechanism. This mechanism lets enclaves run a handler after an AEX event. These handlers can run mitigations for things like SGX-Step[1]. AEX Notify will be made available both on upcoming processors and on some older processors through microcode updates. Long Version: == SGX Attribute Background == The SGX architecture includes a list of SGX "attributes". These attributes ensure consistency and transparency around specific enclave features. As a simple example, the "DEBUG" attribute allows an enclave to be debugged, but also destroys virtually all of SGX security. Using attributes, enclaves can know that they are being debugged. Attributes also affect enclave attestation so an enclave can, for instance, be denied access to secrets while it is being debugged. The kernel keeps a list of known attributes and will only initialize enclaves that use a known set of attributes. This kernel policy eliminates the chance that a new SGX attribute could cause undesired effects. For example, imagine a new attribute was added called "PROVISIONKEY2" that provided similar functionality to "PROVISIIONKEY". A kernel policy that allowed indiscriminate use of unknown attributes and thus PROVISIONKEY2 would undermine the existing kernel policy which limits use of PROVISIONKEY enclaves. == AEX Notify Background == "Intel Architecture Instruction Set Extensions and Future Features - Version 45" is out[2]. There is a new chapter: Asynchronous Enclave Exit Notify and the EDECCSSA User Leaf Function. Enclaves exit can be either synchronous and consensual (EEXIT for instance) or asynchronous (on an interrupt or fault). The asynchronous ones can evidently be exploited to single step enclaves[1], on top of which other naughty things can be built. AEX Notify will be made available both on upcoming processors and on some older processors through microcode updates. == The Problem == These attacks are currently entirely opaque to the enclave since the hardware does the save/restore under the covers. The Asynchronous Enclave Exit Notify (AEX Notify) mechanism provides enclaves an ability to detect and mitigate potential exposure to these kinds of attacks. == The Solution == Define the new attribute value for AEX Notification. Ensure the attribute is cleared from the list reserved attributes. Instead of adding to the open-coded lists of individual attributes, add named lists of privileged (disallowed by default) and unprivileged (allowed by default) attributes. Add the AEX notify attribute as an unprivileged attribute, which will keep the kernel from rejecting enclaves with it set. 1. https://github.com/jovanbulck/sgx-step 2. https://cdrdv2.intel.com/v1/dl/getContent/671368?explicitVersion=true Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Jarkko Sakkinen <jarkko@kernel.org> Tested-by: Haitao Huang <haitao.huang@intel.com> Tested-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/all/20220720191347.1343986-1-dave.hansen%40linux.intel.com
2022-11-03x86/intel_epb: Set Alder Lake N and Raptor Lake P normal EPBSrinivas Pandruvada1-1/+6
Intel processors support additional software hint called EPB ("Energy Performance Bias") to guide the hardware heuristic of power management features to favor increasing dynamic performance or conserve energy consumption. Since this EPB hint is processor specific, the same value of hint can result in different behavior across generations of processors. commit 4ecc933b7d1f ("x86: intel_epb: Allow model specific normal EPB value")' introduced capability to update the default power up EPB based on the CPU model and updated the default EPB to 7 for Alder Lake mobile CPUs. The same change is required for other Alder Lake-N and Raptor Lake-P mobile CPUs as the current default of 6 results in higher uncore power consumption. This increase in power is related to memory clock frequency setting based on the EPB value. Depending on the EPB the minimum memory frequency is set by the firmware. At EPB = 7, the minimum memory frequency is 1/4th compared to EPB = 6. This results in significant power saving for idle and semi-idle workload on a Chrome platform. For example Change in power and performance from EPB change from 6 to 7 on Alder Lake-N: Workload Performance diff (%) power diff ---------------------------------------------------- VP9 FHD30 0 (FPS) -218 mw Google meet 0 (FPS) -385 mw This 200+ mw power saving is very significant for mobile platform for battery life and thermal reasons. But as the workload demands more memory bandwidth, the memory frequency will be increased very fast. There is no power savings for such busy workloads. For example: Workload Performance diff (%) from EPB 6 to 7 ------------------------------------------------------- Speedometer 2.0 -0.8 WebGL Aquarium 10K Fish -0.5 Unity 3D 2018 0.2 WebXPRT3 -0.5 There are run to run variations for performance scores for such busy workloads. So the difference is not significant. Add a new define ENERGY_PERF_BIAS_NORMAL_POWERSAVE for EPB 7 and use it for Alder Lake-N and Raptor Lake-P mobile CPUs. This modification is done originally by Jeremy Compostella <jeremy.compostella@intel.com>. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://lore.kernel.org/all/20221027220056.1534264-1-srinivas.pandruvada%40linux.intel.com
2022-11-02x86/microcode: Drop struct ucode_cpu_info.validBorislav Petkov2-3/+2
It is not needed anymore. Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Ashok Raj <ashok.raj@intel.com> Link: https://lore.kernel.org/r/20221028142638.28498-6-bp@alien8.de
2022-11-02x86/microcode: Do some minor fixupsBorislav Petkov1-6/+5
Improve debugging printks and fixup formatting. Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Ashok Raj <ashok.raj@intel.com> Link: https://lore.kernel.org/r/20221028142638.28498-5-bp@alien8.de
2022-11-02x86/microcode: Kill refresh_fwBorislav Petkov3-6/+4
request_microcode_fw() can always request firmware now so drop this superfluous argument. Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Ashok Raj <ashok.raj@intel.com> Link: https://lore.kernel.org/r/20221028142638.28498-4-bp@alien8.de
2022-11-02x86/microcode: Simplify init path even moreBorislav Petkov1-104/+16
Get rid of all the IPI-sending functions and their wrappers and use those which are supposed to be called on each CPU. Thus: - microcode_init_cpu() gets called on each CPU on init, applying any new microcode that the driver might've found on the filesystem. - mc_cpu_starting() simply tries to apply cached microcode as this is the cpuhp starting callback which gets called on CPU resume too. Even if the driver init function is a late initcall, there is no filesystem by then (not even a hdd driver has been loaded yet) so a new firmware load attempt cannot simply be done. It is pointless anyway - for that there's late loading if one really needs it. Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Ashok Raj <ashok.raj@intel.com> Link: https://lore.kernel.org/r/20221028142638.28498-3-bp@alien8.de
2022-11-02x86/microcode: Rip out the subsys interface gunkBorislav Petkov1-58/+20
This is a left-over from the old days when CPU hotplug wasn't as robust as it is now. Currently, microcode gets loaded early on the CPU init path and there's no need to attempt to load it again, which that subsys interface callback is doing. The only other thing that the subsys interface init path was doing is adding the /sys/devices/system/cpu/cpu*/microcode/ hierarchy. So add a function which gets called on each CPU after all the necessary driver setup has happened. Use schedule_on_each_cpu() which can block because the sysfs creating code does kmem_cache_zalloc() which can block too and the initial version of this where it did that setup in an IPI handler of on_each_cpu() can cause a deadlock of the sort: lock(fs_reclaim); <Interrupt> lock(fs_reclaim); as the IPI handler runs in IRQ context. Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Ashok Raj <ashok.raj@intel.com> Link: https://lore.kernel.org/r/20221028142638.28498-2-bp@alien8.de
2022-11-01x86: Improve formatting of user_regset arraysRick Edgecombe1-42/+65
Back in 2018, Ingo Molnar suggested[0] to improve the formatting of the struct user_regset arrays. They have multiple member initializations per line and some lines exceed 100 chars. Reformat them like he suggested. [0] https://lore.kernel.org/lkml/20180711102035.GB8574@gmail.com/ Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20221021221803.10910-3-rick.p.edgecombe%40intel.com
2022-11-01x86: Separate out x86_regset for 32 and 64 bitRick Edgecombe1-24/+43
In fill_thread_core_info() the ptrace accessible registers are collected for a core file to be written out as notes. The note array is allocated from a size calculated by iterating the user regset view, and counting the regsets that have a non-zero core_note_type. However, this only allows for there to be non-zero core_note_type at the end of the regset view. If there are any in the middle, fill_thread_core_info() will overflow the note allocation, as it iterates over the size of the view and the allocation would be smaller than that. To apparently avoid this problem, x86_32_regsets and x86_64_regsets need to be constructed in a special way. They both draw their indices from a shared enum x86_regset, but 32 bit and 64 bit don't all support the same regsets and can be compiled in at the same time in the case of IA32_EMULATION. So this enum has to be laid out in a special way such that there are no gaps for both x86_32_regsets and x86_64_regsets. This involves ordering them just right by creating aliases for enum’s that are only in one view or the other, or creating multiple versions like REGSET32_IOPERM/REGSET64_IOPERM. So the collection of the registers tries to minimize the size of the allocation, but it doesn’t quite work. Then the x86 ptrace side works around it by constructing the enum just right to avoid a problem. In the end there is no functional problem, but it is somewhat strange and fragile. It could also be improved like this [1], by better utilizing the smaller array, but this still wastes space in the regset array’s if they are not carefully crafted to avoid gaps. Instead, just fully separate out the enums and give them separate 32 and 64 enum names. Add some bitsize-free defines for REGSET_GENERAL and REGSET_FP since they are the only two referred to in bitsize generic code. While introducing a bunch of new 32/64 enums, change the pattern of the name from REGSET_FOO32 to REGSET32_FOO to better indicate that the 32 is in reference to the CPU mode and not the register size, as suggested by Eric Biederman. This should have no functional change and is only changing how constants are generated and referred to. [1] https://lore.kernel.org/lkml/20180717162502.32274-1-yu-cheng.yu@intel.com/ Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20221021221803.10910-2-rick.p.edgecombe%40intel.com
2022-11-01x86/cfi: Add boot time hash randomizationPeter Zijlstra1-12/+108
In order to avoid known hashes (from knowing the boot image), randomize the CFI hashes with a per-boot random seed. Suggested-by: Kees Cook <keescook@chromium.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20221027092842.765195516@infradead.org
2022-11-01x86/cfi: Boot time selection of CFI schemePeter Zijlstra1-18/+81
Add the "cfi=" boot parameter to allow people to select a CFI scheme at boot time. Mostly useful for development / debugging. Requested-by: Kees Cook <keescook@chromium.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20221027092842.699804264@infradead.org
2022-11-01x86/ibt: Implement FineIBTPeter Zijlstra4-14/+269
Implement an alternative CFI scheme that merges both the fine-grained nature of kCFI but also takes full advantage of the coarse grained hardware CFI as provided by IBT. To contrast: kCFI is a pure software CFI scheme and relies on being able to read text -- specifically the instruction *before* the target symbol, and does the hash validation *before* doing the call (otherwise control flow is compromised already). FineIBT is a software and hardware hybrid scheme; by ensuring every branch target starts with a hash validation it is possible to place the hash validation after the branch. This has several advantages: o the (hash) load is avoided; no memop; no RX requirement. o IBT WAIT-FOR-ENDBR state is a speculation stop; by placing the hash validation in the immediate instruction after the branch target there is a minimal speculation window and the whole is a viable defence against SpectreBHB. o Kees feels obliged to mention it is slightly more vulnerable when the attacker can write code. Obviously this patch relies on kCFI, but additionally it also relies on the padding from the call-depth-tracking patches. It uses this padding to place the hash-validation while the call-sites are re-written to modify the indirect target to be 16 bytes in front of the original target, thus hitting this new preamble. Notably, there is no hardware that needs call-depth-tracking (Skylake) and supports IBT (Tigerlake and onwards). Suggested-by: Joao Moreira (Intel) <joao@overdrivepizza.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20221027092842.634714496@infradead.org
2022-10-31x86/sgx: Reduce delay and interference of enclave releaseReinette Chatre1-4/+19
commit 8795359e35bc ("x86/sgx: Silence softlockup detection when releasing large enclaves") introduced a cond_resched() during enclave release where the EREMOVE instruction is applied to every 4k enclave page. Giving other tasks an opportunity to run while tearing down a large enclave placates the soft lockup detector but Iqbal found that the fix causes a 25% performance degradation of a workload run using Gramine. Gramine maintains a 1:1 mapping between processes and SGX enclaves. That means if a workload in an enclave creates a subprocess then Gramine creates a duplicate enclave for that subprocess to run in. The consequence is that the release of the enclave used to run the subprocess can impact the performance of the workload that is run in the original enclave, especially in large enclaves when SGX2 is not in use. The workload run by Iqbal behaves as follows: Create enclave (enclave "A") /* Initialize workload in enclave "A" */ Create enclave (enclave "B") /* Run subprocess in enclave "B" and send result to enclave "A" */ Release enclave (enclave "B") /* Run workload in enclave "A" */ Release enclave (enclave "A") The performance impact of releasing enclave "B" in the above scenario is amplified when there is a lot of SGX memory and the enclave size matches the SGX memory. When there is 128GB SGX memory and an enclave size of 128GB, from the time enclave "B" starts the 128GB SGX memory is oversubscribed with a combined demand for 256GB from the two enclaves. Before commit 8795359e35bc ("x86/sgx: Silence softlockup detection when releasing large enclaves") enclave release was done in a tight loop without giving other tasks a chance to run. Even though the system experienced soft lockups the workload (run in enclave "A") obtained good performance numbers because when the workload started running there was no interference. Commit 8795359e35bc ("x86/sgx: Silence softlockup detection when releasing large enclaves") gave other tasks opportunity to run while an enclave is released. The impact of this in this scenario is that while enclave "B" is released and needing to access each page that belongs to it in order to run the SGX EREMOVE instruction on it, enclave "A" is attempting to run the workload needing to access the enclave pages that belong to it. This causes a lot of swapping due to the demand for the oversubscribed SGX memory. Longer latencies are experienced by the workload in enclave "A" while enclave "B" is released. Improve the performance of enclave release while still avoiding the soft lockup detector with two enhancements: - Only call cond_resched() after XA_CHECK_SCHED iterations. - Use the xarray advanced API to keep the xarray locked for XA_CHECK_SCHED iterations instead of locking and unlocking at every iteration. This batching solution is copied from sgx_encl_may_map() that also iterates through all enclave pages using this technique. With this enhancement the workload experiences a 5% performance degradation when compared to a kernel without commit 8795359e35bc ("x86/sgx: Silence softlockup detection when releasing large enclaves"), an improvement to the reported 25% degradation, while still placating the soft lockup detector. Scenarios with poor performance are still possible even with these enhancements. For example, short workloads creating sub processes while running in large enclaves. Further performance improvements are pursued in user space through avoiding to create duplicate enclaves for certain sub processes, and using SGX2 that will do lazy allocation of pages as needed so enclaves created for sub processes start quickly and release quickly. Fixes: 8795359e35bc ("x86/sgx: Silence softlockup detection when releasing large enclaves") Reported-by: Md Iqbal Hossain <md.iqbal.hossain@intel.com> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: Md Iqbal Hossain <md.iqbal.hossain@intel.com> Link: https://lore.kernel.org/all/00efa80dd9e35dc85753e1c5edb0344ac07bb1f0.1667236485.git.reinette.chatre%40intel.com
2022-10-31x86/espfix: Use get_random_long() rather than archrandomJason A. Donenfeld1-11/+1
A call is made to arch_get_random_longs() and rdtsc(), rather than just using get_random_long(), because this was written during a time when very early boot would give abysmal entropy. These days, a call to get_random_long() at early boot will incorporate RDRAND, RDTSC, and more, without having to do anything bespoke. In fact, the situation is now such that on the majority of x86 systems, the pool actually is initialized at this point, even though it doesn't need to be for get_random_long() to still return something better than what this function currently does. So simplify this to just call get_random_long() instead. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20221029002613.143153-1-Jason@zx2c4.com
2022-10-31x86/mce: Use severity table to handle uncorrected errors in kernelTony Luck1-3/+5
mce_severity_intel() has a special case to promote UC and AR errors in kernel context to PANIC severity. The "AR" case is already handled with separate entries in the severity table for all instruction fetch errors, and those data fetch errors that are not in a recoverable area of the kernel (i.e. have an extable fixup entry). Add an entry to the severity table for UC errors in kernel context that reports severity = PANIC. Delete the special case code from mce_severity_intel(). Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20220922195136.54575-2-tony.luck@intel.com
2022-10-31x86/i8259: Make default_legacy_pic staticChen Lifu1-1/+1
The symbol is not used outside of the file, so mark it static. Signed-off-by: Chen Lifu <chenlifu@huawei.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20220823021958.3052493-1-chenlifu@huawei.com
2022-10-27x86/MCE/AMD: Clear DFR errors found in THR handlerYazen Ghannam1-13/+20
AMD's MCA Thresholding feature counts errors of all severity levels, not just correctable errors. If a deferred error causes the threshold limit to be reached (it was the error that caused the overflow), then both a deferred error interrupt and a thresholding interrupt will be triggered. The order of the interrupts is not guaranteed. If the threshold interrupt handler is executed first, then it will clear MCA_STATUS for the error. It will not check or clear MCA_DESTAT which also holds a copy of the deferred error. When the deferred error interrupt handler runs it will not find an error in MCA_STATUS, but it will find the error in MCA_DESTAT. This will cause two errors to be logged. Check for deferred errors when handling a threshold interrupt. If a bank contains a deferred error, then clear the bank's MCA_DESTAT register. Define a new helper function to do the deferred error check and clearing of MCA_DESTAT. [ bp: Simplify, convert comment to passive voice. ] Fixes: 37d43acfd79f ("x86/mce/AMD: Redo error logging from APIC LVT interrupt handlers") Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20220621155943.33623-1-yazen.ghannam@amd.com
2022-10-25x86/retpoline: Fix crash printing warningDan Carpenter1-1/+1
The first argument of WARN() is a condition, so this will use "addr" as the format string and possibly crash. Fixes: 3b6c1747da48 ("x86/retpoline: Add SKL retthunk retpolines") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/Y1gBoUZrRK5N%2FlCB@kili/
2022-10-24x86/resctrl: Remove arch_has_empty_bitmapsBabu Moger2-4/+1
The field arch_has_empty_bitmaps is not required anymore. The field min_cbm_bits is enough to validate the CBM (capacity bit mask) if the architecture can support the zero CBM or not. Suggested-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Babu Moger <babu.moger@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Fenghua Yu <fenghua.yu@intel.com> Link: https://lore.kernel.org/r/166430979654.372014.615622285687642644.stgit@bmoger-ubuntu
2022-10-23Merge tag 'objtool_urgent_for_v6.1_rc2' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull objtool fix from Borislav Petkov: - Fix ORC stack unwinding when GCOV is enabled * tag 'objtool_urgent_for_v6.1_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/unwind/orc: Fix unreliable stack dump with gcov
2022-10-22Merge branch 'x86/urgent' into x86/core, to resolve conflictIngo Molnar6-49/+60
There's a conflict between the call-depth tracking commits in x86/core: ee3e2469b346 ("x86/ftrace: Make it call depth tracking aware") 36b64f101219 ("x86/ftrace: Rebalance RSB") eac828eaef29 ("x86/ftrace: Remove ftrace_epilogue()") And these fixes in x86/urgent: 883bbbffa5a4 ("ftrace,kcfi: Separate ftrace_stub() and ftrace_stub_graph()") b5f1fc318440 ("x86/ftrace: Remove ftrace_epilogue()") It's non-trivial overlapping modifications - resolve them. Conflicts: arch/x86/kernel/ftrace_64.S Signed-off-by: Ingo Molnar <mingo@kernel.org>
2022-10-21x86/fpu: Fix copy_xstate_to_uabi() to copy init states correctlyChang S. Bae1-0/+9
When an extended state component is not present in fpstate, but in init state, the function copies from init_fpstate via copy_feature(). But, dynamic states are not present in init_fpstate because of all-zeros init states. Then retrieving them from init_fpstate will explode like this: BUG: kernel NULL pointer dereference, address: 0000000000000000 ... RIP: 0010:memcpy_erms+0x6/0x10 ? __copy_xstate_to_uabi_buf+0x381/0x870 fpu_copy_guest_fpstate_to_uabi+0x28/0x80 kvm_arch_vcpu_ioctl+0x14c/0x1460 [kvm] ? __this_cpu_preempt_check+0x13/0x20 ? vmx_vcpu_put+0x2e/0x260 [kvm_intel] kvm_vcpu_ioctl+0xea/0x6b0 [kvm] ? kvm_vcpu_ioctl+0xea/0x6b0 [kvm] ? __fget_light+0xd4/0x130 __x64_sys_ioctl+0xe3/0x910 ? debug_smp_processor_id+0x17/0x20 ? fpregs_assert_state_consistent+0x27/0x50 do_syscall_64+0x3f/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd Adjust the 'mask' to zero out the userspace buffer for the features that are not available both from fpstate and from init_fpstate. The dynamic features depend on the compacted XSAVE format. Ensure it is enabled before reading XCOMP_BV in init_fpstate. Fixes: 2308ee57d93d ("x86/fpu/amx: Enable the AMX feature in 64-bit mode") Reported-by: Yuan Yao <yuan.yao@intel.com> Suggested-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/lkml/BYAPR11MB3717EDEF2351C958F2C86EED95259@BYAPR11MB3717.namprd11.prod.outlook.com/ Link: https://lkml.kernel.org/r/20221021185844.13472-1-chang.seok.bae@intel.com
2022-10-21x86/unwind/orc: Fix unreliable stack dump with gcovChen Zhongjin1-1/+1
When a console stack dump is initiated with CONFIG_GCOV_PROFILE_ALL enabled, show_trace_log_lvl() gets out of sync with the ORC unwinder, causing the stack trace to show all text addresses as unreliable: # echo l > /proc/sysrq-trigger [ 477.521031] sysrq: Show backtrace of all active CPUs [ 477.523813] NMI backtrace for cpu 0 [ 477.524492] CPU: 0 PID: 1021 Comm: bash Not tainted 6.0.0 #65 [ 477.525295] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-1.fc36 04/01/2014 [ 477.526439] Call Trace: [ 477.526854] <TASK> [ 477.527216] ? dump_stack_lvl+0xc7/0x114 [ 477.527801] ? dump_stack+0x13/0x1f [ 477.528331] ? nmi_cpu_backtrace.cold+0xb5/0x10d [ 477.528998] ? lapic_can_unplug_cpu+0xa0/0xa0 [ 477.529641] ? nmi_trigger_cpumask_backtrace+0x16a/0x1f0 [ 477.530393] ? arch_trigger_cpumask_backtrace+0x1d/0x30 [ 477.531136] ? sysrq_handle_showallcpus+0x1b/0x30 [ 477.531818] ? __handle_sysrq.cold+0x4e/0x1ae [ 477.532451] ? write_sysrq_trigger+0x63/0x80 [ 477.533080] ? proc_reg_write+0x92/0x110 [ 477.533663] ? vfs_write+0x174/0x530 [ 477.534265] ? handle_mm_fault+0x16f/0x500 [ 477.534940] ? ksys_write+0x7b/0x170 [ 477.535543] ? __x64_sys_write+0x1d/0x30 [ 477.536191] ? do_syscall_64+0x6b/0x100 [ 477.536809] ? entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 477.537609] </TASK> This happens when the compiled code for show_stack() has a single word on the stack, and doesn't use a tail call to show_stack_log_lvl(). (CONFIG_GCOV_PROFILE_ALL=y is the only known case of this.) Then the __unwind_start() skip logic hits an off-by-one bug and fails to unwind all the way to the intended starting frame. Fix it by reverting the following commit: f1d9a2abff66 ("x86/unwind/orc: Don't skip the first frame for inactive tasks") The original justification for that commit no longer exists. That original issue was later fixed in a different way, with the following commit: f2ac57a4c49d ("x86/unwind/orc: Fix inactive tasks with stack pointer in %sp on GCC 10 compiled kernels") Fixes: f1d9a2abff66 ("x86/unwind/orc: Don't skip the first frame for inactive tasks") Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com> [jpoimboe: rewrite commit log] Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
2022-10-20ftrace,kcfi: Separate ftrace_stub() and ftrace_stub_graph()Peter Zijlstra1-8/+9
Different function signatures means they needs to be different functions; otherwise CFI gets upset. As triggered by the ftrace boot tests: [] CFI failure at ftrace_return_to_handler+0xac/0x16c (target: ftrace_stub+0x0/0x14; expected type: 0x0a5d5347) Fixes: 3c516f89e17e ("x86: Add support for CONFIG_CFI_CLANG") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mark Rutland <mark.rutland@arm.com> Tested-by: Mark Rutland <mark.rutland@arm.com> Link: https://lkml.kernel.org/r/Y06dg4e1xF6JTdQq@hirez.programming.kicks-ass.net
2022-10-20x86/ftrace: Remove ftrace_epilogue()Peter Zijlstra1-15/+6
Remove the weird jumps to RET and simply use RET. This then promotes ftrace_stub() to a real function; which becomes important for kcfi. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220915111148.719080593@infradead.org Signed-off-by: Peter Zijlstra <peterz@infradead.org>
2022-10-20x86/mtrr: Remove unused cyrix_set_all() functionJuergen Gross1-34/+0
The Cyrix CPU specific MTRR function cyrix_set_all() will never be called as the mtrr_ops->set_all() callback will only be called in the use_intel() case, which would require the use_intel_if member of struct mtrr_ops to be set, which isn't the case for Cyrix. Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20221004081023.32402-3-jgross@suse.com
2022-10-19x86/mtrr: Add comment for set_mtrr_state() serializationJuergen Gross1-1/+4
Add a comment about set_mtrr_state() needing serialization. [ bp: Touchups. ] Suggested-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20220820092533.29420-2-jgross@suse.com
2022-10-19x86/signal/64: Move 64-bit signal code to its own fileBrian Gerst3-378/+385
[ bp: Fixup merge conflict caused by changes coming from the kbuild tree. ] Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Link: https://lore.kernel.org/r/20220606203802.158958-9-brgerst@gmail.com Signed-off-by: Borislav Petkov <bp@suse.de>
2022-10-19x86/signal/32: Merge native and compat 32-bit signal codeBrian Gerst3-215/+387
There are significant differences between signal handling on 32-bit vs. 64-bit, like different structure layouts and legacy syscalls. Instead of duplicating that code for native and compat, merge both versions into one file. Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Link: https://lore.kernel.org/r/20220606203802.158958-8-brgerst@gmail.com Signed-off-by: Borislav Petkov <bp@suse.de>
2022-10-19x86/signal: Add ABI prefixes to frame setup functionsBrian Gerst1-11/+7
Add ABI prefixes to the frame setup functions that didn't already have them. To avoid compiler warnings and prepare for moving these functions to separate files, make them non-static. Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Link: https://lore.kernel.org/r/20220606203802.158958-7-brgerst@gmail.com Signed-off-by: Borislav Petkov <bp@suse.de>
2022-10-19x86/signal: Merge get_sigframe()Brian Gerst1-42/+38
Adapt the native get_sigframe() function so that the compat signal code can use it. Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Link: https://lore.kernel.org/r/20220606203802.158958-6-brgerst@gmail.com Signed-off-by: Borislav Petkov <bp@suse.de>
2022-10-19x86/signal: Remove sigset_t parameter from frame setup functionsBrian Gerst1-16/+12
Push down the call to sigmask_to_save() into the frame setup functions. Thus, remove the use of compat_sigset_t outside of the compat code. Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Link: https://lore.kernel.org/r/20220606203802.158958-3-brgerst@gmail.com Signed-off-by: Borislav Petkov <bp@suse.de>
2022-10-19x86/signal: Remove sig parameter from frame setup functionsBrian Gerst1-12/+11
Passing the signal number as a separate parameter is unnecessary, since it is always ksig->sig. Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Link: https://lore.kernel.org/r/20220606203802.158958-2-brgerst@gmail.com Signed-off-by: Borislav Petkov <bp@suse.de>