summaryrefslogtreecommitdiffstats
path: root/arch/x86
AgeCommit message (Collapse)AuthorFilesLines
2022-04-03Merge tag 'x86-urgent-2022-04-03' of ↵Linus Torvalds5-185/+186
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thomas Gleixner: "A set of x86 fixes and updates: - Make the prctl() for enabling dynamic XSTATE components correct so it adds the newly requested feature to the permission bitmap instead of overwriting it. Add a selftest which validates that. - Unroll string MMIO for encrypted SEV guests as the hypervisor cannot emulate it. - Handle supervisor states correctly in the FPU/XSTATE code so it takes the feature set of the fpstate buffer into account. The feature sets can differ between host and guest buffers. Guest buffers do not contain supervisor states. So far this was not an issue, but with enabling PASID it needs to be handled in the buffer offset calculation and in the permission bitmaps. - Avoid a gazillion of repeated CPUID invocations in by caching the values early in the FPU/XSTATE code. - Enable CONFIG_WERROR in x86 defconfig. - Make the X86 defconfigs more useful by adapting them to Y2022 reality" * tag 'x86-urgent-2022-04-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/fpu/xstate: Consolidate size calculations x86/fpu/xstate: Handle supervisor states in XSTATE permissions x86/fpu/xsave: Handle compacted offsets correctly with supervisor states x86/fpu: Cache xfeature flags from CPUID x86/fpu/xsave: Initialize offset/size cache early x86/fpu: Remove unused supervisor only offsets x86/fpu: Remove redundant XCOMP_BV initialization x86/sev: Unroll string mmio with CC_ATTR_GUEST_UNROLL_STRING_IO x86/config: Make the x86 defconfigs a bit more usable x86/defconfig: Enable WERROR selftests/x86/amx: Update the ARCH_REQ_XCOMP_PERM test x86/fpu/xstate: Fix the ARCH_REQ_XCOMP_PERM implementation
2022-04-03Merge tag 'core-urgent-2022-04-03' of ↵Linus Torvalds1-1/+0
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RT signal fix from Thomas Gleixner: "Revert the RT related signal changes. They need to be reworked and generalized" * tag 'core-urgent-2022-04-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: Revert "signal, x86: Delay calling signals in atomic on RT enabled kernels"
2022-04-02Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds25-305/+349
Pull kvm fixes from Paolo Bonzini: - Only do MSR filtering for MSRs accessed by rdmsr/wrmsr - Documentation improvements - Prevent module exit until all VMs are freed - PMU Virtualization fixes - Fix for kvm_irq_delivery_to_apic_fast() NULL-pointer dereferences - Other miscellaneous bugfixes * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (42 commits) KVM: x86: fix sending PV IPI KVM: x86/mmu: do compare-and-exchange of gPTE via the user address KVM: x86: Remove redundant vm_entry_controls_clearbit() call KVM: x86: cleanup enter_rmode() KVM: x86: SVM: fix tsc scaling when the host doesn't support it kvm: x86: SVM: remove unused defines KVM: x86: SVM: move tsc ratio definitions to svm.h KVM: x86: SVM: fix avic spec based definitions again KVM: MIPS: remove reference to trap&emulate virtualization KVM: x86: document limitations of MSR filtering KVM: x86: Only do MSR filtering when access MSR by rdmsr/wrmsr KVM: x86/emulator: Emulate RDPID only if it is enabled in guest KVM: x86/pmu: Fix and isolate TSX-specific performance event logic KVM: x86: mmu: trace kvm_mmu_set_spte after the new SPTE was set KVM: x86/svm: Clear reserved bits written to PerfEvtSeln MSRs KVM: x86: Trace all APICv inhibit changes and capture overall status KVM: x86: Add wrappers for setting/clearing APICv inhibits KVM: x86: Make APICv inhibit reasons an enum and cleanup naming KVM: X86: Handle implicit supervisor access with SMAP KVM: X86: Rename variable smap to not_smap in permission_fault() ...
2022-04-02KVM: x86: fix sending PV IPILi RongQing1-1/+1
If apic_id is less than min, and (max - apic_id) is greater than KVM_IPI_CLUSTER_SIZE, then the third check condition is satisfied but the new apic_id does not fit the bitmask. In this case __send_ipi_mask should send the IPI. This is mostly theoretical, but it can happen if the apic_ids on three iterations of the loop are for example 1, KVM_IPI_CLUSTER_SIZE, 0. Fixes: aaffcfd1e82 ("KVM: X86: Implement PV IPIs in linux guest") Signed-off-by: Li RongQing <lirongqing@baidu.com> Message-Id: <1646814944-51801-1-git-send-email-lirongqing@baidu.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02KVM: x86/mmu: do compare-and-exchange of gPTE via the user addressPaolo Bonzini1-40/+34
FNAME(cmpxchg_gpte) is an inefficient mess. It is at least decent if it can go through get_user_pages_fast(), but if it cannot then it tries to use memremap(); that is not just terribly slow, it is also wrong because it assumes that the VM_PFNMAP VMA is contiguous. The right way to do it would be to do the same thing as hva_to_pfn_remapped() does since commit add6a0cd1c5b ("KVM: MMU: try to fix up page faults before giving up", 2016-07-05), using follow_pte() and fixup_user_fault() to determine the correct address to use for memremap(). To do this, one could for example extract hva_to_pfn() for use outside virt/kvm/kvm_main.c. But really there is no reason to do that either, because there is already a perfectly valid address to do the cmpxchg() on, only it is a userspace address. That means doing user_access_begin()/user_access_end() and writing the code in assembly to handle exceptions correctly. Worse, the guest PTE can be 8-byte even on i686 so there is the extra complication of using cmpxchg8b to account for. But at least it is an efficient mess. (Thanks to Linus for suggesting improvement on the inline assembly). Reported-by: Qiuhao Li <qiuhao@sysec.org> Reported-by: Gaoning Pan <pgn@zju.edu.cn> Reported-by: Yongkang Jia <kangel@zju.edu.cn> Reported-by: syzbot+6cde2282daa792c49ab8@syzkaller.appspotmail.com Debugged-by: Tadeusz Struk <tadeusz.struk@linaro.org> Tested-by: Maxim Levitsky <mlevitsk@redhat.com> Cc: stable@vger.kernel.org Fixes: bd53cb35a3e9 ("X86/KVM: Handle PFNs outside of kernel reach when touching GPTEs") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02KVM: x86: Remove redundant vm_entry_controls_clearbit() callZhenzhong Duan1-1/+0
When emulating exit from long mode, EFER_LMA is cleared with vmx_set_efer(). This will already unset the VM_ENTRY_IA32E_MODE control bit as requested by SDM, so there is no need to unset VM_ENTRY_IA32E_MODE again in exit_lmode() explicitly. In case EFER isn't supported by hardware, long mode isn't supported, so exit_lmode() cannot be reached. Note that, thanks to the shadow controls mechanism, this change doesn't eliminate vmread or vmwrite. Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com> Message-Id: <20220311102643.807507-3-zhenzhong.duan@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02KVM: x86: cleanup enter_rmode()Zhenzhong Duan1-9/+5
vmx_set_efer() sets uret->data but, in fact if the value of uret->data will be used vmx_setup_uret_msrs() will have rewritten it with the value returned by update_transition_efer(). uret->data is consumed if and only if uret->load_into_hardware is true, and vmx_setup_uret_msrs() takes care of (a) updating uret->data before setting uret->load_into_hardware to true (b) setting uret->load_into_hardware to false if uret->data isn't updated. Opportunistically use "vmx" directly instead of redoing to_vmx(). Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com> Message-Id: <20220311102643.807507-2-zhenzhong.duan@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02KVM: x86: SVM: fix tsc scaling when the host doesn't support itMaxim Levitsky3-9/+6
It was decided that when TSC scaling is not supported, the virtual MSR_AMD64_TSC_RATIO should still have the default '1.0' value. However in this case kvm_max_tsc_scaling_ratio is not set, which breaks various assumptions. Fix this by always calculating kvm_max_tsc_scaling_ratio regardless of host support. For consistency, do the same for VMX. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220322172449.235575-8-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02kvm: x86: SVM: remove unused definesMaxim Levitsky1-8/+0
Remove some unused #defines from svm.c Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220322172449.235575-7-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02KVM: x86: SVM: move tsc ratio definitions to svm.hMaxim Levitsky2-10/+11
Another piece of SVM spec which should be in the header file Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220322172449.235575-6-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02KVM: x86: SVM: fix avic spec based definitions againMaxim Levitsky2-14/+5
Due to wrong rebase, commit 4a204f7895878 ("KVM: SVM: Allow AVIC support on system w/ physical APIC ID > 255") moved avic spec #defines back to avic.c. Move them back, and while at it extend AVIC_DOORBELL_PHYSICAL_ID_MASK to 12 bits as well (it will be used in nested avic) Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220322172449.235575-5-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02KVM: x86: Only do MSR filtering when access MSR by rdmsr/wrmsrHou Wenlong3-16/+40
If MSR access is rejected by MSR filtering, kvm_set_msr()/kvm_get_msr() would return KVM_MSR_RET_FILTERED, and the return value is only handled well for rdmsr/wrmsr. However, some instruction emulation and state transition also use kvm_set_msr()/kvm_get_msr() to do msr access but may trigger some unexpected results if MSR access is rejected, E.g. RDPID emulation would inject a #UD but RDPID wouldn't cause a exit when RDPID is supported in hardware and ENABLE_RDTSCP is set. And it would also cause failure when load MSR at nested entry/exit. Since msr filtering is based on MSR bitmap, it is better to only do MSR filtering for rdmsr/wrmsr. Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com> Message-Id: <2b2774154f7532c96a6f04d71c82a8bec7d9e80b.1646655860.git.houwenlong.hwl@antgroup.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02KVM: x86/emulator: Emulate RDPID only if it is enabled in guestHou Wenlong3-1/+10
When RDTSCP is supported but RDPID is not supported in host, RDPID emulation is available. However, __kvm_get_msr() would only fail when RDTSCP/RDPID both are disabled in guest, so the emulator wouldn't inject a #UD when RDPID is disabled but RDTSCP is enabled in guest. Fixes: fb6d4d340e05 ("KVM: x86: emulate RDPID") Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com> Message-Id: <1dfd46ae5b76d3ed87bde3154d51c64ea64c99c1.1646226788.git.houwenlong.hwl@antgroup.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02KVM: x86/pmu: Fix and isolate TSX-specific performance event logicLike Xu2-13/+15
HSW_IN_TX* bits are used in generic code which are not supported on AMD. Worse, these bits overlap with AMD EventSelect[11:8] and hence using HSW_IN_TX* bits unconditionally in generic code is resulting in unintentional pmu behavior on AMD. For example, if EventSelect[11:8] is 0x2, pmc_reprogram_counter() wrongly assumes that HSW_IN_TX_CHECKPOINTED is set and thus forces sampling period to be 0. Also per the SDM, both bits 32 and 33 "may only be set if the processor supports HLE or RTM" and for "IN_TXCP (bit 33): this bit may only be set for IA32_PERFEVTSEL2." Opportunistically eliminate code redundancy, because if the HSW_IN_TX* bit is set in pmc->eventsel, it is already set in attr.config. Reported-by: Ravi Bangoria <ravi.bangoria@amd.com> Reported-by: Jim Mattson <jmattson@google.com> Fixes: 103af0a98788 ("perf, kvm: Support the in_tx/in_tx_cp modifiers in KVM arch perfmon emulation v5") Co-developed-by: Ravi Bangoria <ravi.bangoria@amd.com> Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com> Signed-off-by: Like Xu <likexu@tencent.com> Message-Id: <20220309084257.88931-1-likexu@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02KVM: x86: mmu: trace kvm_mmu_set_spte after the new SPTE was setMaxim Levitsky1-1/+1
It makes more sense to print new SPTE value than the old value. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220302102457.588450-1-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02KVM: x86/svm: Clear reserved bits written to PerfEvtSeln MSRsJim Mattson1-5/+3
AMD EPYC CPUs never raise a #GP for a WRMSR to a PerfEvtSeln MSR. Some reserved bits are cleared, and some are not. Specifically, on Zen3/Milan, bits 19 and 42 are not cleared. When emulating such a WRMSR, KVM should not synthesize a #GP, regardless of which bits are set. However, undocumented bits should not be passed through to the hardware MSR. So, rather than checking for reserved bits and synthesizing a #GP, just clear the reserved bits. This may seem pedantic, but since KVM currently does not support the "Host/Guest Only" bits (41:40), it is necessary to clear these bits rather than synthesizing #GP, because some popular guests (e.g Linux) will set the "Host Only" bit even on CPUs that don't support EFER.SVME, and they don't expect a #GP. For example, root@Ubuntu1804:~# perf stat -e r26 -a sleep 1 Performance counter stats for 'system wide': 0 r26 1.001070977 seconds time elapsed Feb 23 03:59:58 Ubuntu1804 kernel: [ 405.379957] unchecked MSR access error: WRMSR to 0xc0010200 (tried to write 0x0000020000130026) at rIP: 0xffffffff9b276a28 (native_write_msr+0x8/0x30) Feb 23 03:59:58 Ubuntu1804 kernel: [ 405.379958] Call Trace: Feb 23 03:59:58 Ubuntu1804 kernel: [ 405.379963] amd_pmu_disable_event+0x27/0x90 Fixes: ca724305a2b0 ("KVM: x86/vPMU: Implement AMD vPMU code for KVM") Reported-by: Lotus Fenn <lotusf@google.com> Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Like Xu <likexu@tencent.com> Reviewed-by: David Dunn <daviddunn@google.com> Message-Id: <20220226234131.2167175-1-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02KVM: x86: Trace all APICv inhibit changes and capture overall statusSean Christopherson2-18/+29
Trace all APICv inhibit changes instead of just those that result in APICv being (un)inhibited, and log the current state. Debugging why APICv isn't working is frustrating as it's hard to see why APICv is still inhibited, and logging only the first inhibition means unnecessary onion peeling. Opportunistically drop the export of the tracepoint, it is not and should not be used by vendor code due to the need to serialize toggling via apicv_update_lock. Note, using the common flow means kvm_apicv_init() switched from atomic to non-atomic bitwise operations. The VM is unreachable at init, so non-atomic is perfectly ok. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220311043517.17027-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02KVM: x86: Add wrappers for setting/clearing APICv inhibitsSean Christopherson6-36/+49
Add set/clear wrappers for toggling APICv inhibits to make the call sites more readable, and opportunistically rename the inner helpers to align with the new wrappers and to make them more readable as well. Invert the flag from "activate" to "set"; activate is painfully ambiguous as it's not obvious if the inhibit is being activated, or if APICv is being activated, in which case the inhibit is being deactivated. For the functions that take @set, swap the order of the inhibit reason and @set so that the call sites are visually similar to those that bounce through the wrapper. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220311043517.17027-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02KVM: x86: Make APICv inhibit reasons an enum and cleanup namingSean Christopherson6-31/+35
Use an enum for the APICv inhibit reasons, there is no meaning behind their values and they most definitely are not "unsigned longs". Rename the various params to "reason" for consistency and clarity (inhibit may be confused as a command, i.e. inhibit APICv, instead of the reason that is getting toggled/checked). No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220311043517.17027-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02KVM: X86: Handle implicit supervisor access with SMAPLai Jiangshan4-17/+21
There are two kinds of implicit supervisor access implicit supervisor access when CPL = 3 implicit supervisor access when CPL < 3 Current permission_fault() handles only the first kind for SMAP. But if the access is implicit when SMAP is on, data may not be read nor write from any user-mode address regardless the current CPL. So the second kind should be also supported. The first kind can be detect via CPL and access mode: if it is supervisor access and CPL = 3, it must be implicit supervisor access. But it is not possible to detect the second kind without extra information, so this patch adds an artificial PFERR_EXPLICIT_ACCESS into @access. This extra information also works for the first kind, so the logic is changed to use this information for both cases. The value of PFERR_EXPLICIT_ACCESS is deliberately chosen to be bit 48 which is in the most significant 16 bits of u64 and less likely to be forced to change due to future hardware uses it. This patch removes the call to ->get_cpl() for access mode is determined by @access. Not only does it reduce a function call, but also remove confusions when the permission is checked for nested TDP. The nested TDP shouldn't have SMAP checking nor even the L2's CPL have any bearing on it. The original code works just because it is always user walk for NPT and SMAP fault is not set for EPT in update_permission_bitmask. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Message-Id: <20220311070346.45023-5-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02KVM: X86: Rename variable smap to not_smap in permission_fault()Lai Jiangshan1-2/+2
Comments above the variable says the bit is set when SMAP is overridden or the same meaning in update_permission_bitmask(): it is not subjected to SMAP restriction. Renaming it to reflect the negative implication and make the code better readability. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Message-Id: <20220311070346.45023-4-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02KVM: X86: Fix comments in update_permission_bitmaskLai Jiangshan1-2/+2
The commit 09f037aa48f3 ("KVM: MMU: speedup update_permission_bitmask") refactored the code of update_permission_bitmask() and change the comments. It added a condition into a list to match the new code, so the number/order for conditions in the comments should be updated too. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Message-Id: <20220311070346.45023-3-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02KVM: X86: Change the type of access u32 to u64Lai Jiangshan5-21/+23
Change the type of access u32 to u64 for FNAME(walk_addr) and ->gva_to_gpa(). The kinds of accesses are usually combinations of UWX, and VMX/SVM's nested paging adds a new factor of access: is it an access for a guest page table or for a final guest physical address. And SMAP relies a factor for supervisor access: explicit or implicit. So @access in FNAME(walk_addr) and ->gva_to_gpa() is better to include all these information to do the walk. Although @access(u32) has enough bits to encode all the kinds, this patch extends it to u64: o Extra bits will be in the higher 32 bits, so that we can easily obtain the traditional access mode (UWX) by converting it to u32. o Reuse the value for the access kind defined by SVM's nested paging (PFERR_GUEST_FINAL_MASK and PFERR_GUEST_PAGE_MASK) as @error_code in kvm_handle_page_fault(). Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Message-Id: <20220311070346.45023-2-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02KVM: Remove dirty handling from gfn_to_pfn_cache completelyDavid Woodhouse1-3/+2
It isn't OK to cache the dirty status of a page in internal structures for an indefinite period of time. Any time a vCPU exits the run loop to userspace might be its last; the VMM might do its final check of the dirty log, flush the last remaining dirty pages to the destination and complete a live migration. If we have internal 'dirty' state which doesn't get flushed until the vCPU is finally destroyed on the source after migration is complete, then we have lost data because that will escape the final copy. This problem already exists with the use of kvm_vcpu_unmap() to mark pages dirty in e.g. VMX nesting. Note that the actual Linux MM already considers the page to be dirty since we have a writeable mapping of it. This is just about the KVM dirty logging. For the nesting-style use cases (KVM_GUEST_USES_PFN) we will need to track which gfn_to_pfn_caches have been used and explicitly mark the corresponding pages dirty before returning to userspace. But we would have needed external tracking of that anyway, rather than walking the full list of GPCs to find those belonging to this vCPU which are dirty. So let's rely *solely* on that external tracking, and keep it simple rather than laying a tempting trap for callers to fall into. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20220303154127.202856-3-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02KVM: Use enum to track if cached PFN will be used in guest and/or hostSean Christopherson1-1/+1
Replace the guest_uses_pa and kernel_map booleans in the PFN cache code with a unified enum/bitmask. Using explicit names makes it easier to review and audit call sites. Opportunistically add a WARN to prevent passing garbage; instantating a cache without declaring its usage is either buggy or pointless. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20220303154127.202856-2-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02KVM: SVM: Fix kvm_cache_regs.h inclusions for is_guest_mode()Peter Gonda2-1/+2
Include kvm_cache_regs.h to pick up the definition of is_guest_mode(), which is referenced by nested_svm_virtualize_tpr() in svm.h. Remove include from svm_onhpyerv.c which was done only because of lack of include in svm.h. Fixes: 883b0a91f41ab ("KVM: SVM: Move Nested SVM Implementation to nested.c") Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: kvm@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Peter Gonda <pgonda@google.com> Message-Id: <20220304161032.2270688-1-pgonda@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02KVM: x86/pmu: Use different raw event masks for AMD and IntelJim Mattson4-1/+5
The third nybble of AMD's event select overlaps with Intel's IN_TX and IN_TXCP bits. Therefore, we can't use AMD64_RAW_EVENT_MASK on Intel platforms that support TSX. Declare a raw_event_mask in the kvm_pmu structure, initialize it in the vendor-specific pmu_refresh() functions, and use that mask for PERF_TYPE_RAW configurations in reprogram_gp_counter(). Fixes: 710c47651431 ("KVM: x86/pmu: Use AMD64_RAW_EVENT_MASK for PERF_TYPE_RAW") Signed-off-by: Jim Mattson <jmattson@google.com> Message-Id: <20220308012452.3468611-1-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02KVM: x86/mmu: Zap only TDP MMU leafs in zap range and mmu_notifier unmapSean Christopherson3-48/+19
Re-introduce zapping only leaf SPTEs in kvm_zap_gfn_range() and kvm_tdp_mmu_unmap_gfn_range(), this time without losing a pending TLB flush when processing multiple roots (including nested TDP shadow roots). Dropping the TLB flush resulted in random crashes when running Hyper-V Server 2019 in a guest with KSM enabled in the host (or any source of mmu_notifier invalidations, KSM is just the easiest to force). This effectively revert commits 873dd122172f8cce329113cfb0dfe3d2344d80c0 and fcb93eb6d09dd302cbef22bd95a5858af75e4156, and thus restores commit cf3e26427c08ad9015956293ab389004ac6a338e, plus this delta on top: bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start, gfn_t end, struct kvm_mmu_page *root; for_each_tdp_mmu_root_yield_safe(kvm, root, as_id) - flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, false); + flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, flush); return flush; } Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20220325230348.2587437-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02KVM: SVM: fix panic on out-of-bounds guest IRQYi Wang1-2/+8
As guest_irq is coming from KVM_IRQFD API call, it may trigger crash in svm_update_pi_irte() due to out-of-bounds: crash> bt PID: 22218 TASK: ffff951a6ad74980 CPU: 73 COMMAND: "vcpu8" #0 [ffffb1ba6707fa40] machine_kexec at ffffffff8565b397 #1 [ffffb1ba6707fa90] __crash_kexec at ffffffff85788a6d #2 [ffffb1ba6707fb58] crash_kexec at ffffffff8578995d #3 [ffffb1ba6707fb70] oops_end at ffffffff85623c0d #4 [ffffb1ba6707fb90] no_context at ffffffff856692c9 #5 [ffffb1ba6707fbf8] exc_page_fault at ffffffff85f95b51 #6 [ffffb1ba6707fc50] asm_exc_page_fault at ffffffff86000ace [exception RIP: svm_update_pi_irte+227] RIP: ffffffffc0761b53 RSP: ffffb1ba6707fd08 RFLAGS: 00010086 RAX: ffffb1ba6707fd78 RBX: ffffb1ba66d91000 RCX: 0000000000000001 RDX: 00003c803f63f1c0 RSI: 000000000000019a RDI: ffffb1ba66db2ab8 RBP: 000000000000019a R8: 0000000000000040 R9: ffff94ca41b82200 R10: ffffffffffffffcf R11: 0000000000000001 R12: 0000000000000001 R13: 0000000000000001 R14: ffffffffffffffcf R15: 000000000000005f ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #7 [ffffb1ba6707fdb8] kvm_irq_routing_update at ffffffffc09f19a1 [kvm] #8 [ffffb1ba6707fde0] kvm_set_irq_routing at ffffffffc09f2133 [kvm] #9 [ffffb1ba6707fe18] kvm_vm_ioctl at ffffffffc09ef544 [kvm] RIP: 00007f143c36488b RSP: 00007f143a4e04b8 RFLAGS: 00000246 RAX: ffffffffffffffda RBX: 00007f05780041d0 RCX: 00007f143c36488b RDX: 00007f05780041d0 RSI: 000000004008ae6a RDI: 0000000000000020 RBP: 00000000000004e8 R8: 0000000000000008 R9: 00007f05780041e0 R10: 00007f0578004560 R11: 0000000000000246 R12: 00000000000004e0 R13: 000000000000001a R14: 00007f1424001c60 R15: 00007f0578003bc0 ORIG_RAX: 0000000000000010 CS: 0033 SS: 002b Vmx have been fix this in commit 3a8b0677fc61 (KVM: VMX: Do not BUG() on out-of-bounds guest IRQ), so we can just copy source from that to fix this. Co-developed-by: Yi Liu <liu.yi24@zte.com.cn> Signed-off-by: Yi Liu <liu.yi24@zte.com.cn> Signed-off-by: Yi Wang <wang.yi59@zte.com.cn> Message-Id: <20220309113025.44469-1-wang.yi59@zte.com.cn> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02KVM: MMU: propagate alloc_workqueue failurePaolo Bonzini5-17/+32
If kvm->arch.tdp_mmu_zap_wq cannot be created, the failure has to be propagated up to kvm_mmu_init_vm and kvm_arch_init_vm. kvm_arch_init_vm also has to undo all the initialization, so group all the MMU initialization code at the beginning and handle cleaning up of kvm_page_track_init. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-01Merge branch 'work.misc' of ↵Linus Torvalds4-34/+1
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs updates from Al Viro: "Assorted bits and pieces" * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: aio: drop needless assignment in aio_read() clean overflow checks in count_mounts() a bit seq_file: fix NULL pointer arithmetic warning uml/x86: use x86 load_unaligned_zeropad() asm/user.h: killed unused macros constify struct path argument of finish_automount()/do_add_mount() fs: Remove FIXME comment in generic_write_checks()
2022-03-31Merge tag 'for-linus-5.18-rc1' of ↵Linus Torvalds2-4/+2
git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml Pull UML updates from Richard Weinberger: - Devicetree support (for testing) - Various cleanups and fixes: UBD, port_user, uml_mconsole - Maintainer update * tag 'for-linus-5.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml: um: run_helper: Write error message to kernel log on exec failure on host um: port_user: Improve error handling when port-helper is not found um: port_user: Allow setting path to port-helper using UML_PORT_HELPER envvar um: port_user: Search for in.telnetd in PATH um: clang: Strip out -mno-global-merge from USER_CFLAGS docs: UML: Mention telnetd for port channel um: Remove unused timeval_to_ns() function um: Fix uml_mconsole stop/go um: Cleanup syscall_handler_t definition/cast, fix warning uml: net: vector: fix const issue um: Fix WRITE_ZEROES in the UBD Driver um: Migrate vector drivers to NAPI um: Fix order of dtb unflatten/early init um: fix and optimize xor select template for CONFIG64 and timetravel mode um: Document dtb command line option lib/logic_iomem: correct fallback config references um: Remove duplicated include in syscalls_64.c MAINTAINERS: Update UserModeLinux entry
2022-03-31Merge tag 'kbuild-v5.18-v2' of ↵Linus Torvalds1-2/+1
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild updates from Masahiro Yamada: - Add new environment variables, USERCFLAGS and USERLDFLAGS to allow additional flags to be passed to user-space programs. - Fix missing fflush() bugs in Kconfig and fixdep - Fix a minor bug in the comment format of the .config file - Make kallsyms ignore llvm's local labels, .L* - Fix UAPI compile-test for cross-compiling with Clang - Extend the LLVM= syntax to support LLVM=<suffix> form for using a particular version of LLVm, and LLVM=<prefix> form for using custom LLVM in a particular directory path. - Clean up Makefiles * tag 'kbuild-v5.18-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: kbuild: Make $(LLVM) more flexible kbuild: add --target to correctly cross-compile UAPI headers with Clang fixdep: use fflush() and ferror() to ensure successful write to files arch: syscalls: simplify uapi/kapi directory creation usr/include: replace extra-y with always-y certs: simplify empty certs creation in certs/Makefile certs: include certs/signing_key.x509 unconditionally kallsyms: ignore all local labels prefixed by '.L' kconfig: fix missing '# end of' for empty menu kconfig: add fflush() before ferror() check kbuild: replace $(if A,A,B) with $(or A,B) kbuild: Add environment variables for userprogs flags kbuild: unify cmd_copy and cmd_shipped
2022-03-31Merge tag 'net-5.18-rc1' of ↵Linus Torvalds8-133/+162
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull more networking updates from Jakub Kicinski: "Networking fixes and rethook patches. Features: - kprobes: rethook: x86: replace kretprobe trampoline with rethook Current release - regressions: - sfc: avoid null-deref on systems without NUMA awareness in the new queue sizing code Current release - new code bugs: - vxlan: do not feed vxlan_vnifilter_dump_dev with non-vxlan devices - eth: lan966x: fix null-deref on PHY pointer in timestamp ioctl when interface is down Previous releases - always broken: - openvswitch: correct neighbor discovery target mask field in the flow dump - wireguard: ignore v6 endpoints when ipv6 is disabled and fix a leak - rxrpc: fix call timer start racing with call destruction - rxrpc: fix null-deref when security type is rxrpc_no_security - can: fix UAF bugs around echo skbs in multiple drivers Misc: - docs: move netdev-FAQ to the 'process' section of the documentation" * tag 'net-5.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (57 commits) vxlan: do not feed vxlan_vnifilter_dump_dev with non vxlan devices openvswitch: Add recirc_id to recirc warning rxrpc: fix some null-ptr-deref bugs in server_key.c rxrpc: Fix call timer start racing with call destruction net: hns3: fix software vlan talbe of vlan 0 inconsistent with hardware net: hns3: fix the concurrency between functions reading debugfs docs: netdev: move the netdev-FAQ to the process pages docs: netdev: broaden the new vs old code formatting guidelines docs: netdev: call out the merge window in tag checking docs: netdev: add missing back ticks docs: netdev: make the testing requirement more stringent docs: netdev: add a question about re-posting frequency docs: netdev: rephrase the 'should I update patchwork' question docs: netdev: rephrase the 'Under review' question docs: netdev: shorten the name and mention msgid for patch status docs: netdev: note that RFC postings are allowed any time docs: netdev: turn the net-next closed into a Warning docs: netdev: move the patch marking section up docs: netdev: minor reword docs: netdev: replace references to old archives ...
2022-03-31Merge tag 'v5.18-p1' of ↵Linus Torvalds3-22/+22
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto fixes from Herbert Xu: - Missing Kconfig dependency on arm that leads to boot failure - x86 SLS fixes - Reference leak in the stm32 driver * tag 'v5.18-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: crypto: x86/sm3 - Fixup SLS crypto: x86/poly1305 - Fixup SLS crypto: x86/chacha20 - Avoid spurious jumps to other functions crypto: stm32 - fix reference leak in stm32_crc_remove crypto: arm/aes-neonbs-cbc - Select generic cbc and aes
2022-03-31Revert "signal, x86: Delay calling signals in atomic on RT enabled kernels"Thomas Gleixner1-1/+0
Revert commit bf9ad37dc8a. It needs to be better encapsulated and generalized. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2022-03-31arch: syscalls: simplify uapi/kapi directory creationMasahiro Yamada1-2/+1
$(shell ...) expands to empty. There is no need to assign it to _dummy. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
2022-03-30x86/fpu/xstate: Consolidate size calculationsThomas Gleixner1-41/+8
Use the offset calculation to do the size calculation which avoids yet another series of CPUID instructions for each invocation. [ Fix the FP/SSE only case which missed to take the xstate header into account, as Reported-by: kernel test robot <oliver.sang@intel.com> ] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/87o81pgbp2.ffs@tglx
2022-03-30x86/fpu/xstate: Handle supervisor states in XSTATE permissionsThomas Gleixner1-0/+3
The size calculation in __xstate_request_perm() fails to take supervisor states into account because the permission bitmap is only relevant for user states. Up to 5.17 this does not matter because there are no supervisor states supported, but the (re-)enabling of ENQCMD makes them available. Fixes: 7c1ef59145f1 ("x86/cpufeatures: Re-enable ENQCMD") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20220324134623.681768598@linutronix.de
2022-03-30x86/fpu/xsave: Handle compacted offsets correctly with supervisor statesThomas Gleixner1-45/+41
So far the cached fixed compacted offsets worked, but with (re-)enabling of ENQCMD this does no longer work with KVM fpstate. KVM does not have supervisor features enabled for the guest FPU, which means that KVM has then a different XSAVE area layout than the host FPU state. This in turn breaks the copy from/to UABI functions when invoked for a guest state. Remove the pre-calculated compacted offsets and calculate the offset of each component at runtime based on the XCOMP_BV field in the XSAVE header. The runtime overhead is not interesting because these copy from/to UABI functions are not used in critical fast paths. KVM uses them to save and restore FPU state during migration. The host uses them for ptrace and for the slow path of 32bit signal handling. Fixes: 7c1ef59145f1 ("x86/cpufeatures: Re-enable ENQCMD") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20220324134623.627636809@linutronix.de
2022-03-30x86/fpu: Cache xfeature flags from CPUIDThomas Gleixner1-36/+13
In preparation for runtime calculation of XSAVE offsets cache the feature flags for each XSTATE component during feature enumeration via CPUID(0xD). EDX has two relevant bits: 0 Supervisor component 1 Feature storage must be 64 byte aligned These bits are currently only evaluated during init, but the alignment bit must be cached to make runtime calculation of XSAVE offsets efficient. Cache the full EDX content and use it for the existing alignment and supervisor checks. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20220324134623.573656209@linutronix.de
2022-03-30x86/fpu/xsave: Initialize offset/size cache earlyThomas Gleixner1-2/+5
Reading XSTATE feature information from CPUID over and over does not make sense. The information has to be cached anyway, so it can be done early. Prepare for runtime calculation of XSTATE offsets and allow consolidation of the size calculation functions in a later step. Rename the function while at it as it does not setup any features. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20220324134623.519411939@linutronix.de
2022-03-30x86/fpu: Remove unused supervisor only offsetsThomas Gleixner1-30/+0
No users. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20220324134623.465066249@linutronix.de
2022-03-30crypto: x86/sm3 - Fixup SLSPeter Zijlstra1-1/+1
This missed the big asm update due to being merged through the crypto tree. Fixes: f94909ceb1ed ("x86: Prepare asm files for straight-line-speculation") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-03-29x86/fpu: Remove redundant XCOMP_BV initializationThomas Gleixner1-3/+0
fpu_copy_uabi_to_guest_fpstate() initializes the XCOMP_BV field in the XSAVE header. That's a leftover from the old KVM FPU buffer handling code. Since d69c1382e1b7 ("x86/kvm: Convert FPU handling to a single swap buffer") KVM uses the FPU core allocation code, which initializes the XCOMP_BV field already. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20220324134623.408932232@linutronix.de
2022-03-29KVM: x86: Forbid VMM to set SYNIC/STIMER MSRs when SynIC wasn't activatedVitaly Kuznetsov1-3/+6
Setting non-zero values to SYNIC/STIMER MSRs activates certain features, this should not happen when KVM_CAP_HYPERV_SYNIC{,2} was not activated. Note, it would've been better to forbid writing anything to SYNIC/STIMER MSRs, including zeroes, however, at least QEMU tries clearing HV_X64_MSR_STIMER0_CONFIG without SynIC. HV_X64_MSR_EOM MSR is somewhat 'special' as writing zero there triggers an action, this also should not happen when SynIC wasn't activated. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20220325132140.25650-4-vkuznets@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-03-29KVM: x86: Avoid theoretical NULL pointer dereference in ↵Vitaly Kuznetsov1-0/+4
kvm_irq_delivery_to_apic_fast() When kvm_irq_delivery_to_apic_fast() is called with APIC_DEST_SELF shorthand, 'src' must not be NULL. Crash the VM with KVM_BUG_ON() instead of crashing the host. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20220325132140.25650-3-vkuznets@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-03-29KVM: x86: Check lapic_in_kernel() before attempting to set a SynIC irqVitaly Kuznetsov1-0/+3
When KVM_CAP_HYPERV_SYNIC{,2} is activated, KVM already checks for irqchip_in_kernel() so normally SynIC irqs should never be set. It is, however, possible for a misbehaving VMM to write to SYNIC/STIMER MSRs causing erroneous behavior. The immediate issue being fixed is that kvm_irq_delivery_to_apic() (kvm_irq_delivery_to_apic_fast()) crashes when called with 'irq.shorthand = APIC_DEST_SELF' and 'src == NULL'. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20220325132140.25650-2-vkuznets@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-03-29KVM: x86: Fix clang -Wimplicit-fallthrough in do_host_cpuid()Nathan Chancellor1-0/+1
Clang warns: arch/x86/kvm/cpuid.c:739:2: error: unannotated fall-through between switch labels [-Werror,-Wimplicit-fallthrough] default: ^ arch/x86/kvm/cpuid.c:739:2: note: insert 'break;' to avoid fall-through default: ^ break; 1 error generated. Clang is a little more pedantic than GCC, which does not warn when falling through to a case that is just break or return. Clang's version is more in line with the kernel's own stance in deprecated.rst, which states that all switch/case blocks must end in either break, fallthrough, continue, goto, or return. Add the missing break to silence the warning. Fixes: f144c49e8c39 ("KVM: x86: synthesize CPUID leaf 0x80000021h if useful") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Nathan Chancellor <nathan@kernel.org> Message-Id: <20220322152906.112164-1-nathan@kernel.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-03-29x86/sev: Unroll string mmio with CC_ATTR_GUEST_UNROLL_STRING_IOJoerg Roedel1-8/+57
The io-specific memcpy/memset functions use string mmio accesses to do their work. Under SEV, the hypervisor can't emulate these instructions because they read/write directly from/to encrypted memory. KVM will inject a page fault exception into the guest when it is asked to emulate string mmio instructions for an SEV guest: BUG: unable to handle page fault for address: ffffc90000065068 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 8000100000067 P4D 8000100000067 PUD 80001000fb067 PMD 80001000fc067 PTE 80000000fed40173 Oops: 0000 [#1] PREEMPT SMP NOPTI CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.17.0-rc7 #3 As string mmio for an SEV guest can not be supported by the hypervisor, unroll the instructions for CC_ATTR_GUEST_UNROLL_STRING_IO enabled kernels. This issue appears when kernels are launched in recent libvirt-managed SEV virtual machines, because virt-install started to add a tpm-crb device to the guest by default and proactively because, raisins: https://github.com/virt-manager/virt-manager/commit/eb58c09f488b0633ed1eea012cd311e48864401e and as that commit says, the default adding of a TPM can be disabled with "virt-install ... --tpm none". The kernel driver for tpm-crb uses memcpy_to/from_io() functions to access MMIO memory, resulting in a page-fault injected by KVM and crashing the kernel at boot. [ bp: Massage and extend commit message. ] Fixes: d8aa7eea78a1 ('x86/mm: Add Secure Encrypted Virtualization (SEV) support') Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/20220321093351.23976-1-joro@8bytes.org