summaryrefslogtreecommitdiffstats
path: root/arch/x86/kvm
AgeCommit message (Collapse)AuthorFilesLines
2019-04-11x86/pkeys: Provide *pkru() helpersSebastian Andrzej Siewior1-1/+1
Dave Hansen asked for __read_pkru() and __write_pkru() to be symmetrical. As part of the series __write_pkru() will read back the value and only write it if it is different. In order to make both functions symmetrical, move the function containing only the opcode asm into a function called like the instruction itself. __write_pkru() will just invoke wrpkru() but in a follow-up patch will also read back the value. [ bp: Convert asm opcode wrapper names to rd/wrpkru(). ] Suggested-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Dave Hansen <dave.hansen@intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: "Jason A. Donenfeld" <Jason@zx2c4.com> Cc: Joerg Roedel <jroedel@suse.de> Cc: Juergen Gross <jgross@suse.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: kvm ML <kvm@vger.kernel.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: "Radim Krčmář" <rkrcmar@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/20190403164156.19645-13-bigeasy@linutronix.de
2019-04-10x86/fpu: Use a feature number instead of mask in two more helpersSebastian Andrzej Siewior1-14/+14
After changing the argument of __raw_xsave_addr() from a mask to number Dave suggested to check if it makes sense to do the same for get_xsave_addr(). As it turns out it does. Only get_xsave_addr() needs the mask to check if the requested feature is part of what is supported/saved and then uses the number again. The shift operation is cheaper compared to fls64() (find last bit set). Also, the feature number uses less opcode space compared to the mask. :) Make the get_xsave_addr() argument a xfeature number instead of a mask and fix up its callers. Furthermore, use xfeature_nr and xfeature_mask consistently. This results in the following changes to the kvm code: feature -> xfeature_mask index -> xfeature_nr Suggested-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Dave Hansen <dave.hansen@intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: "Jason A. Donenfeld" <Jason@zx2c4.com> Cc: kvm ML <kvm@vger.kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Radim Krčmář" <rkrcmar@redhat.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Rik van Riel <riel@surriel.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Siarhei Liakh <Siarhei.Liakh@concurrent-rt.com> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/20190403164156.19645-12-bigeasy@linutronix.de
2019-04-05KVM: x86: nVMX: fix x2APIC VTPR read interceptMarc Orr1-1/+1
Referring to the "VIRTUALIZING MSR-BASED APIC ACCESSES" chapter of the SDM, when "virtualize x2APIC mode" is 1 and "APIC-register virtualization" is 0, a RDMSR of 808H should return the VTPR from the virtual APIC page. However, for nested, KVM currently fails to disable the read intercept for this MSR. This means that a RDMSR exit takes precedence over "virtualize x2APIC mode", and KVM passes through L1's TPR to L2, instead of sourcing the value from L2's virtual APIC page. This patch fixes the issue by disabling the read intercept, in VMCS02, for the VTPR when "APIC-register virtualization" is 0. The issue described above and fix prescribed here, were verified with a related patch in kvm-unit-tests titled "Test VMX's virtualize x2APIC mode w/ nested". Signed-off-by: Marc Orr <marcorr@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Fixes: c992384bde84f ("KVM: vmx: speed up MSR bitmap merge") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-05KVM: x86: nVMX: close leak of L0's x2APIC MSRs (CVE-2019-3887)Marc Orr1-28/+44
The nested_vmx_prepare_msr_bitmap() function doesn't directly guard the x2APIC MSR intercepts with the "virtualize x2APIC mode" MSR. As a result, we discovered the potential for a buggy or malicious L1 to get access to L0's x2APIC MSRs, via an L2, as follows. 1. L1 executes WRMSR(IA32_SPEC_CTRL, 1). This causes the spec_ctrl variable, in nested_vmx_prepare_msr_bitmap() to become true. 2. L1 disables "virtualize x2APIC mode" in VMCS12. 3. L1 enables "APIC-register virtualization" in VMCS12. Now, KVM will set VMCS02's x2APIC MSR intercepts from VMCS12, and then set "virtualize x2APIC mode" to 0 in VMCS02. Oops. This patch closes the leak by explicitly guarding VMCS02's x2APIC MSR intercepts with VMCS12's "virtualize x2APIC mode" control. The scenario outlined above and fix prescribed here, were verified with a related patch in kvm-unit-tests titled "Add leak scenario to virt_x2apic_mode_test". Note, it looks like this issue may have been introduced inadvertently during a merge---see 15303ba5d1cd. Signed-off-by: Marc Orr <marcorr@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-05KVM: SVM: prevent DBG_DECRYPT and DBG_ENCRYPT overflowDavid Rientjes1-3/+9
This ensures that the address and length provided to DBG_DECRYPT and DBG_ENCRYPT do not cause an overflow. At the same time, pass the actual number of pages pinned in memory to sev_unpin_memory() as a cleanup. Reported-by: Cfir Cohen <cfir@google.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-05kvm: svm: fix potential get_num_contig_pages overflowDavid Rientjes1-5/+5
get_num_contig_pages() could potentially overflow int so make its type consistent with its usage. Reported-by: Cfir Cohen <cfir@google.com> Cc: stable@vger.kernel.org Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-03-28KVM: x86: update %rip after emulating IOSean Christopherson1-10/+26
Most (all?) x86 platforms provide a port IO based reset mechanism, e.g. OUT 92h or CF9h. Userspace may emulate said mechanism, i.e. reset a vCPU in response to KVM_EXIT_IO, without explicitly announcing to KVM that it is doing a reset, e.g. Qemu jams vCPU state and resumes running. To avoid corruping %rip after such a reset, commit 0967b7bf1c22 ("KVM: Skip pio instruction when it is emulated, not executed") changed the behavior of PIO handlers, i.e. today's "fast" PIO handling to skip the instruction prior to exiting to userspace. Full emulation doesn't need such tricks becase re-emulating the instruction will naturally handle %rip being changed to point at the reset vector. Updating %rip prior to executing to userspace has several drawbacks: - Userspace sees the wrong %rip on the exit, e.g. if PIO emulation fails it will likely yell about the wrong address. - Single step exits to userspace for are effectively dropped as KVM_EXIT_DEBUG is overwritten with KVM_EXIT_IO. - Behavior of PIO emulation is different depending on whether it goes down the fast path or the slow path. Rather than skip the PIO instruction before exiting to userspace, snapshot the linear %rip and cancel PIO completion if the current value does not match the snapshot. For a 64-bit vCPU, i.e. the most common scenario, the snapshot and comparison has negligible overhead as VMCS.GUEST_RIP will be cached regardless, i.e. there is no extra VMREAD in this case. All other alternatives to snapshotting the linear %rip that don't rely on an explicit reset announcenment suffer from one corner case or another. For example, canceling PIO completion on any write to %rip fails if userspace does a save/restore of %rip, and attempting to avoid that issue by canceling PIO only if %rip changed then fails if PIO collides with the reset %rip. Attempting to zero in on the exact reset vector won't work for APs, which means adding more hooks such as the vCPU's MP_STATE, and so on and so forth. Checking for a linear %rip match technically suffers from corner cases, e.g. userspace could theoretically rewrite the underlying code page and expect a different instruction to execute, or the guest hardcodes a PIO reset at 0xfffffff0, but those are far, far outside of what can be considered normal operation. Fixes: 432baf60eee3 ("KVM: VMX: use kvm_fast_pio_in for handling IN I/O") Cc: <stable@vger.kernel.org> Reported-by: Jim Mattson <jmattson@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-03-28x86/kvm/hyper-v: avoid spurious pending stimer on vCPU initVitaly Kuznetsov1-2/+7
When userspace initializes guest vCPUs it may want to zero all supported MSRs including Hyper-V related ones including HV_X64_MSR_STIMERn_CONFIG/ HV_X64_MSR_STIMERn_COUNT. With commit f3b138c5d89a ("kvm/x86: Update SynIC timers on guest entry only") we began doing stimer_mark_pending() unconditionally on every config change. The issue I'm observing manifests itself as following: - Qemu writes 0 to STIMERn_{CONFIG,COUNT} MSRs and marks all stimers as pending in stimer_pending_bitmap, arms KVM_REQ_HV_STIMER; - kvm_hv_has_stimer_pending() starts returning true; - kvm_vcpu_has_events() starts returning true; - kvm_arch_vcpu_runnable() starts returning true; - when kvm_arch_vcpu_ioctl_run() gets into (vcpu->arch.mp_state == KVM_MP_STATE_UNINITIALIZED) case: - kvm_vcpu_block() gets in 'kvm_vcpu_check_block(vcpu) < 0' and returns immediately, avoiding normal wait path; - -EAGAIN is returned from kvm_arch_vcpu_ioctl_run() immediately forcing userspace to retry. So instead of normal wait path we get a busy loop on all secondary vCPUs before they get INIT signal. This seems to be undesirable, especially given that this happens even when Hyper-V extensions are not used. Generally, it seems to be pointless to mark an stimer as pending in stimer_pending_bitmap and arm KVM_REQ_HV_STIMER as the only thing kvm_hv_process_stimers() will do is clear the corresponding bit. We may just not mark disabled timers as pending instead. Fixes: f3b138c5d89a ("kvm/x86: Update SynIC timers on guest entry only") Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-03-28kvm/x86: Move MSR_IA32_ARCH_CAPABILITIES to array emulated_msrsXiaoyao Li1-1/+2
Since MSR_IA32_ARCH_CAPABILITIES is emualted unconditionally even if host doesn't suppot it. We should move it to array emulated_msrs from arry msrs_to_save, to report to userspace that guest support this msr. Signed-off-by: Xiaoyao Li <xiaoyao.li@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-03-28KVM: x86: Emulate MSR_IA32_ARCH_CAPABILITIES on AMD hostsSean Christopherson3-14/+12
The CPUID flag ARCH_CAPABILITIES is unconditioinally exposed to host userspace for all x86 hosts, i.e. KVM advertises ARCH_CAPABILITIES regardless of hardware support under the pretense that KVM fully emulates MSR_IA32_ARCH_CAPABILITIES. Unfortunately, only VMX hosts handle accesses to MSR_IA32_ARCH_CAPABILITIES (despite KVM_GET_MSRS also reporting MSR_IA32_ARCH_CAPABILITIES for all hosts). Move the MSR_IA32_ARCH_CAPABILITIES handling to common x86 code so that it's emulated on AMD hosts. Fixes: 1eaafe91a0df4 ("kvm: x86: IA32_ARCH_CAPABILITIES is always supported") Cc: stable@vger.kernel.org Reported-by: Xiaoyao Li <xiaoyao.li@linux.intel.com> Cc: Jim Mattson <jmattson@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-03-28kvm: mmu: Used range based flushing in slot_handle_level_rangeBen Gardon1-2/+5
Replace kvm_flush_remote_tlbs with kvm_flush_remote_tlbs_with_address in slot_handle_level_range. When range based flushes are not enabled kvm_flush_remote_tlbs_with_address falls back to kvm_flush_remote_tlbs. This changes the behavior of many functions that indirectly use slot_handle_level_range, iff the range based flushes are enabled. The only potential problem I see with this is that kvm->tlbs_dirty will be cleared less often, however the only caller of slot_handle_level_range that checks tlbs_dirty is kvm_mmu_notifier_invalidate_range_start which checks it and does a kvm_flush_remote_tlbs after calling kvm_unmap_hva_range anyway. Tested: Ran all kvm-unit-tests on a Intel Haswell machine with and without this patch. The patch introduced no new failures. Signed-off-by: Ben Gardon <bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-03-28KVM: x86: remove check on nr_mmu_pages in kvm_arch_commit_memory_region()Wei Yang2-7/+3
* nr_mmu_pages would be non-zero only if kvm->arch.n_requested_mmu_pages is non-zero. * nr_mmu_pages is always non-zero, since kvm_mmu_calculate_mmu_pages() never return zero. Based on these two reasons, we can merge the two *if* clause and use the return value from kvm_mmu_calculate_mmu_pages() directly. This simplify the code and also eliminate the possibility for reader to believe nr_mmu_pages would be zero. Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-03-28kvm: nVMX: Add a vmentry check for HOST_SYSENTER_ESP and HOST_SYSENTER_EIP ↵Krish Sadhukhan1-0/+5
fields According to section "Checks on VMX Controls" in Intel SDM vol 3C, the following check is performed on vmentry of L2 guests: On processors that support Intel 64 architecture, the IA32_SYSENTER_ESP field and the IA32_SYSENTER_EIP field must each contain a canonical address. Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-03-28KVM: SVM: Workaround errata#1096 (insn_len maybe zero on SMAP violation)Singh, Brijesh3-3/+43
Errata#1096: On a nested data page fault when CR.SMAP=1 and the guest data read generates a SMAP violation, GuestInstrBytes field of the VMCB on a VMEXIT will incorrectly return 0h instead the correct guest instruction bytes . Recommend Workaround: To determine what instruction the guest was executing the hypervisor will have to decode the instruction at the instruction pointer. The recommended workaround can not be implemented for the SEV guest because guest memory is encrypted with the guest specific key, and instruction decoder will not be able to decode the instruction bytes. If we hit this errata in the SEV guest then log the message and request a guest shutdown. Reported-by: Venkatesh Srinivas <venkateshs@google.com> Cc: Jim Mattson <jmattson@google.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Joerg Roedel <joro@8bytes.org> Cc: "Radim Krčmář" <rkrcmar@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Brijesh Singh <brijesh.singh@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-03-28KVM: x86: fix handling of role.cr4_pae and rename it to 'gpte_size'Sean Christopherson2-16/+26
The cr4_pae flag is a bit of a misnomer, its purpose is really to track whether the guest PTE that is being shadowed is a 4-byte entry or an 8-byte entry. Prior to supporting nested EPT, the size of the gpte was reflected purely by CR4.PAE. KVM fudged things a bit for direct sptes, but it was mostly harmless since the size of the gpte never mattered. Now that a spte may be tracking an indirect EPT entry, relying on CR4.PAE is wrong and ill-named. For direct shadow pages, force the gpte_size to '1' as they are always 8-byte entries; EPT entries can only be 8-bytes and KVM always uses 8-byte entries for NPT and its identity map (when running with EPT but not unrestricted guest). Likewise, nested EPT entries are always 8-bytes. Nested EPT presents a unique scenario as the size of the entries are not dictated by CR4.PAE, but neither is the shadow page a direct map. To handle this scenario, set cr0_wp=1 and smap_andnot_wp=1, an otherwise impossible combination, to denote a nested EPT shadow page. Use the information to avoid incorrectly zapping an unsync'd indirect page in __kvm_sync_page(). Providing a consistent and accurate gpte_size fixes a bug reported by Vitaly where fast_cr3_switch() always fails when switching from L2 to L1 as kvm_mmu_get_page() would force role.cr4_pae=0 for direct pages, whereas kvm_calc_mmu_role_common() would set it according to CR4.PAE. Fixes: 7dcd575520082 ("x86/kvm/mmu: check if tdp/shadow MMU reconfiguration is needed") Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com> Tested-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-03-28KVM: nVMX: Do not inherit quadrant and invalid for the root shadow EPTSean Christopherson1-4/+9
Explicitly zero out quadrant and invalid instead of inheriting them from the root_mmu. Functionally, this patch is a nop as we (should) never set quadrant for a direct mapped (EPT) root_mmu and nested EPT is only allowed if EPT is used for L1, and the root_mmu will never be invalid at this point. Explicitly setting flags sets the stage for repurposing the legacy paging bits in role, e.g. nxe, cr0_wp, and sm{a,e}p_andnot_wp, at which point 'smm' would be the only flag to be inherited from root_mmu. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-03-15Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds18-574/+618
Pull KVM updates from Paolo Bonzini: "ARM: - some cleanups - direct physical timer assignment - cache sanitization for 32-bit guests s390: - interrupt cleanup - introduction of the Guest Information Block - preparation for processor subfunctions in cpu models PPC: - bug fixes and improvements, especially related to machine checks and protection keys x86: - many, many cleanups, including removing a bunch of MMU code for unnecessary optimizations - AVIC fixes Generic: - memcg accounting" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (147 commits) kvm: vmx: fix formatting of a comment KVM: doc: Document the life cycle of a VM and its resources MAINTAINERS: Add KVM selftests to existing KVM entry Revert "KVM/MMU: Flush tlb directly in the kvm_zap_gfn_range()" KVM: PPC: Book3S: Add count cache flush parameters to kvmppc_get_cpu_char() KVM: PPC: Fix compilation when KVM is not enabled KVM: Minor cleanups for kvm_main.c KVM: s390: add debug logging for cpu model subfunctions KVM: s390: implement subfunction processor calls arm64: KVM: Fix architecturally invalid reset value for FPEXC32_EL2 KVM: arm/arm64: Remove unused timer variable KVM: PPC: Book3S: Improve KVM reference counting KVM: PPC: Book3S HV: Fix build failure without IOMMU support Revert "KVM: Eliminate extra function calls in kvm_get_dirty_log_protect()" x86: kvmguest: use TSC clocksource if invariant TSC is exposed KVM: Never start grow vCPU halt_poll_ns from value below halt_poll_ns_grow_start KVM: Expose the initial start value in grow_halt_poll_ns() as a module parameter KVM: grow_halt_poll_ns() should never shrink vCPU halt_poll_ns KVM: x86/mmu: Consolidate kvm_mmu_zap_all() and kvm_mmu_zap_mmio_sptes() KVM: x86/mmu: WARN if zapping a MMIO spte results in zapping children ...
2019-03-15kvm: vmx: fix formatting of a commentPaolo Bonzini1-5/+5
Eliminate a gratuitous conflict with 5.0. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-03-15Revert "KVM/MMU: Flush tlb directly in the kvm_zap_gfn_range()"Ben Gardon1-13/+3
This reverts commit 71883a62fcd6c70639fa12cda733378b4d997409. The above commit contains an optimization to kvm_zap_gfn_range which uses gfn-limited TLB flushes, if enabled. If using these limited flushes, kvm_zap_gfn_range passes lock_flush_tlb=false to slot_handle_level_range which creates a race when the function unlocks to call cond_resched. See an example of this race below: CPU 0 CPU 1 CPU 3 // zap_direct_gfn_range mmu_lock() // *ptep == pte_1 *ptep = 0 if (lock_flush_tlb) flush_tlbs() mmu_unlock() // In invalidate range // MMU notifier mmu_lock() if (pte != 0) *ptep = 0 flush = true if (flush) flush_remote_tlbs() mmu_unlock() return // Host MM reallocates // page previously // backing guest memory. // Guest accesses // invalid page // through pte_1 // in its TLB!! Tested: Ran all kvm-unit-tests on a Intel Haswell machine with and without this patch. The patch introduced no new failures. Signed-off-by: Ben Gardon <bgardon@google.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-03-06Documentation: Move L1TF to separate directoryThomas Gleixner1-2/+2
Move L!TF to a separate directory so the MDS stuff can be added at the side. Otherwise the all hardware vulnerabilites have their own top level entry. Should have done that right away. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Jon Masters <jcm@redhat.com>
2019-03-06x86/kvm/vmx: Add MDS protection when L1D Flush is not activeThomas Gleixner1-0/+3
CPUs which are affected by L1TF and MDS mitigate MDS with the L1D Flush on VMENTER when updated microcode is installed. If a CPU is not affected by L1TF or if the L1D Flush is not in use, then MDS mitigation needs to be invoked explicitly. For these cases, follow the host mitigation state and invoke the MDS mitigation before VMENTER. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jon Masters <jcm@redhat.com> Tested-by: Jon Masters <jcm@redhat.com>
2019-03-06x86/kvm: Expose X86_FEATURE_MD_CLEAR to guestsAndi Kleen1-1/+2
X86_FEATURE_MD_CLEAR is a new CPUID bit which is set when microcode provides the mechanism to invoke a flush of various exploitable CPU buffers by invoking the VERW instruction. Hand it through to guests so they can adjust their mitigations. This also requires corresponding qemu changes, which are available separately. [ tglx: Massaged changelog ] Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Borislav Petkov <bp@suse.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Jon Masters <jcm@redhat.com> Tested-by: Jon Masters <jcm@redhat.com>
2019-02-22KVM: MMU: record maximum physical address width in kvm_mmu_extended_roleYu Zhang1-0/+1
Previously, commit 7dcd57552008 ("x86/kvm/mmu: check if tdp/shadow MMU reconfiguration is needed") offered some optimization to avoid the unnecessary reconfiguration. Yet one scenario is broken - when cpuid changes VM's maximum physical address width, reconfiguration is needed to reset the reserved bits. Also, the TDP may need to reset its shadow_root_level when this value is changed. To fix this, a new field, maxphyaddr, is introduced in the extended role structure to keep track of the configured guest physical address width. Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-22kvm: x86: Return LA57 feature based on hardware capabilityYu Zhang1-0/+4
Previously, 'commit 372fddf70904 ("x86/mm: Introduce the 'no5lvl' kernel parameter")' cleared X86_FEATURE_LA57 in boot_cpu_data, if Linux chooses to not run in 5-level paging mode. Yet boot_cpu_data is queried by do_cpuid_ent() as the host capability later when creating vcpus, and Qemu will not be able to detect this feature and create VMs with LA57 feature. As discussed earlier, VMs can still benefit from extended linear address width, e.g. to enhance features like ASLR. So we would like to fix this, by return the true hardware capability when Qemu queries. Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-22x86/kvm/mmu: fix switch between root and guest MMUsVitaly Kuznetsov1-4/+13
Commit 14c07ad89f4d ("x86/kvm/mmu: introduce guest_mmu") brought one subtle change: previously, when switching back from L2 to L1, we were resetting MMU hooks (like mmu->get_cr3()) in kvm_init_mmu() called from nested_vmx_load_cr3() and now we do that in nested_ept_uninit_mmu_context() when we re-target vcpu->arch.mmu pointer. The change itself looks logical: if nested_ept_init_mmu_context() changes something than nested_ept_uninit_mmu_context() restores it back. There is, however, one thing: the following call chain: nested_vmx_load_cr3() kvm_mmu_new_cr3() __kvm_mmu_new_cr3() fast_cr3_switch() cached_root_available() now happens with MMU hooks pointing to the new MMU (root MMU in our case) while previously it was happening with the old one. cached_root_available() tries to stash current root but it is incorrect to read current CR3 with mmu->get_cr3(), we need to use old_mmu->get_cr3() which in case we're switching from L2 to L1 is guest_mmu. (BTW, in shadow page tables case this is a non-issue because we don't switch MMU). While we could've tried to guess that we're switching between MMUs and call the right ->get_cr3() from cached_root_available() this seems to be overly complicated. Instead, just stash the corresponding CR3 when setting root_hpa and make cached_root_available() use the stashed value. Fixes: 14c07ad89f4d ("x86/kvm/mmu: introduce guest_mmu") Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20KVM: x86/mmu: Consolidate kvm_mmu_zap_all() and kvm_mmu_zap_mmio_sptes()Sean Christopherson1-23/+10
...via a new helper, __kvm_mmu_zap_all(). An alternative to passing a 'bool mmio_only' would be to pass a callback function to filter the shadow page, i.e. to make __kvm_mmu_zap_all() generic and reusable, but zapping all shadow pages is a last resort, i.e. making the helper less extensible is a feature of sorts. And the explicit MMIO parameter makes it easy to preserve the WARN_ON_ONCE() if a restart is triggered when zapping MMIO sptes. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20KVM: x86/mmu: WARN if zapping a MMIO spte results in zapping childrenSean Christopherson1-2/+5
Paolo expressed a concern that kvm_mmu_zap_mmio_sptes() could have a quadratic runtime[1], i.e. restarting the spte walk while zapping only MMIO sptes could result in re-walking large portions of the list over and over due to the non-MMIO sptes encountered before the restart not being removed. At the time, the concern was legitimate as the walk was restarted when any spte was zapped. But that is no longer the case as the walk is now restarted iff one or more children have been zapped, which is necessary because zapping children makes the active_mmu_pages list unstable. Furthermore, it should be impossible for an MMIO spte to have children, i.e. zapping an MMIO spte should never result in zapping children. In other words, kvm_mmu_zap_mmio_sptes() should never restart its walk, and so should always execute in linear time. WARN if this assertion fails. Although it should never be needed, leave the restart logic in place. In normal operation, the cost is at worst an extra CMP+Jcc, and if for some reason the list does become unstable, not restarting would likely crash KVM, or worse, the kernel. [1] https://patchwork.kernel.org/patch/10756589/#22452085 Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20KVM: x86/mmu: Differentiate between nr zapped and list unstableSean Christopherson1-10/+26
The return value of kvm_mmu_prepare_zap_page() has evolved to become overloaded to convey two separate pieces of information. 1) was at least one page zapped and 2) has the list of MMU pages become unstable. In it's original incarnation (as kvm_mmu_zap_page()), there was no return value at all. Commit 0738541396be ("KVM: MMU: awareness of new kvm_mmu_zap_page behaviour") added a return value in preparation for commit 4731d4c7a077 ("KVM: MMU: out of sync shadow core"). Although the return value was of type 'int', it was actually used as a boolean to indicate whether or not active_mmu_pages may have become unstable due to zapping children. Walking a list with list_for_each_entry_safe() only protects against deleting/moving the current entry, i.e. zapping a child page would break iteration due to modifying any number of entries. Later, commit 60c8aec6e2c9 ("KVM: MMU: use page array in unsync walk") modified mmu_zap_unsync_children() to return an approximation of the number of children zapped. This was not intentional, it was simply a side effect of how the code was written. The unintented side affect was then morphed into an actual feature by commit 77662e0028c7 ("KVM: MMU: fix kvm_mmu_zap_page() and its calling path"), which modified kvm_mmu_change_mmu_pages() to use the number of zapped pages when determining the number of MMU pages in use by the VM. Finally, commit 54a4f0239f2e ("KVM: MMU: make kvm_mmu_zap_page() return the number of pages it actually freed") added the initial page to the return value to make its behavior more consistent with what most users would expect. Incorporating the initial parent page in the return value of kvm_mmu_zap_page() breaks the original usage of restarting a list walk on a non-zero return value to handle a potentially unstable list, i.e. walks will unnecessarily restart when any page is zapped. Fix this by restoring the original behavior of kvm_mmu_zap_page(), i.e. return a boolean to indicate that the list may be unstable and move the number of zapped children to a dedicated parameter. Since the majority of callers to kvm_mmu_prepare_zap_page() don't care about either return value, preserve the current definition of kvm_mmu_prepare_zap_page() by making it a wrapper of a new helper, __kvm_mmu_prepare_zap_page(). This avoids having to update every call site and also provides cleaner code for functions that only care about the number of pages zapped. Fixes: 54a4f0239f2e ("KVM: MMU: make kvm_mmu_zap_page() return the number of pages it actually freed") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20Revert "KVM: MMU: fast invalidate all pages"Sean Christopherson2-98/+1
Remove x86 KVM's fast invalidate mechanism, i.e. revert all patches from the original series[1], now that all users of the fast invalidate mechanism are gone. This reverts commit 5304b8d37c2a5ebca48330f5e7868d240eafbed1. [1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com Cc: Xiao Guangrong <guangrong.xiao@gmail.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20KVM: x86/mmu: Voluntarily reschedule as needed when zapping all sptesSean Christopherson1-1/+2
Call cond_resched_lock() when zapping all sptes to reschedule if needed or to release and reacquire mmu_lock in case of contention. There is no need to flush or zap when temporarily dropping mmu_lock as zapping all sptes is done only when the owning userspace VMM has exited or when the VM is being destroyed, i.e. there is no interplay with memslots or MMIO generations to worry about. Be paranoid and restart the walk if mmu_lock is dropped to avoid any potential issues with consuming a stale iterator. The overhead in doing so is negligible as at worst there will be a few root shadow pages at the head of the list, i.e. the iterator is essentially the head of the list already. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20KVM: x86/mmu: skip over invalid root pages when zapping all sptesSean Christopherson1-1/+4
...to guarantee forward progress. When zapped, root pages are marked invalid and moved to the head of the active pages list until they are explicitly freed. Theoretically, having unzappable root pages at the head of the list could prevent kvm_mmu_zap_all() from making forward progress were a future patch to add a loop restart after processing a page, e.g. to drop mmu_lock on contention. Although kvm_mmu_prepare_zap_page() can theoretically take action on invalid pages, e.g. to zap unsync children, functionally it's not necessary (root pages will be re-zapped when freed) and practically speaking the odds of e.g. @unsync or @unsync_children becoming %true while zapping all pages is basically nil. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20Revert "KVM: x86: use the fast way to invalidate all pages"Sean Christopherson2-1/+16
Revert to a slow kvm_mmu_zap_all() for kvm_arch_flush_shadow_all(). Flushing all shadow entries is only done during VM teardown, i.e. kvm_arch_flush_shadow_all() is only called when the associated MM struct is being released or when the VM instance is being freed. Although the performance of teardown itself isn't critical, KVM should still voluntarily schedule to play nice with the rest of the kernel; but that can be done without the fast invalidate mechanism in a future patch. This reverts commit 6ca18b6950f8dee29361722f28f69847724b276f. Cc: Xiao Guangrong <guangrong.xiao@gmail.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20Revert "KVM: MMU: show mmu_valid_gen in shadow page related tracepoints"Sean Christopherson1-12/+9
...as part of removing x86 KVM's fast invalidate mechanism, i.e. this is one part of a revert all patches from the series that introduced the mechanism[1]. This reverts commit 2248b023219251908aedda0621251cffc548f258. [1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com Cc: Xiao Guangrong <guangrong.xiao@gmail.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20Revert "KVM: MMU: add tracepoint for kvm_mmu_invalidate_all_pages"Sean Christopherson2-22/+0
...as part of removing x86 KVM's fast invalidate mechanism, i.e. this is one part of a revert all patches from the series that introduced the mechanism[1]. This reverts commit 35006126f024f68727c67001b9cb703c38f69268. [1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com Cc: Xiao Guangrong <guangrong.xiao@gmail.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20Revert "KVM: MMU: zap pages in batch"Sean Christopherson1-11/+24
Unwinding optimizations related to obsolete pages is a step towards removing x86 KVM's fast invalidate mechanism, i.e. this is one part of a revert all patches from the series that introduced the mechanism[1]. This reverts commit e7d11c7a894986a13817c1c001e1e7668c5c4eb4. [1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com Cc: Xiao Guangrong <guangrong.xiao@gmail.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20Revert "KVM: MMU: collapse TLB flushes when zap all pages"Sean Christopherson1-28/+3
Unwinding optimizations related to obsolete pages is a step towards removing x86 KVM's fast invalidate mechanism, i.e. this is one part of a revert all patches from the series that introduced the mechanism[1]. This reverts commit f34d251d66ba263c077ed9d2bbd1874339a4c887. [1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com Cc: Xiao Guangrong <guangrong.xiao@gmail.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20Revert "KVM: MMU: reclaim the zapped-obsolete page first"Sean Christopherson2-18/+4
Unwinding optimizations related to obsolete pages is a step towards removing x86 KVM's fast invalidate mechanism, i.e. this is one part of a revert all patches from the series that introduced the mechanism[1]. This reverts commit 365c886860c4ba670d245e762b23987c912c129a. [1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com Cc: Xiao Guangrong <guangrong.xiao@gmail.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20KVM: x86/mmu: Remove is_obsolete() callSean Christopherson1-5/+1
Unwinding usage of is_obsolete() is a step towards removing x86's fast invalidate mechanism, i.e. this is one part of a revert all patches from the series that introduced the mechanism[1]. This is a partial revert of commit 05988d728dcd ("KVM: MMU: reduce KVM_REQ_MMU_RELOAD when root page is zapped"). [1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com Cc: Xiao Guangrong <guangrong.xiao@gmail.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20KVM: x86/mmu: Voluntarily reschedule as needed when zapping MMIO sptesSean Christopherson1-1/+2
Call cond_resched_lock() when zapping MMIO to reschedule if needed or to release and reacquire mmu_lock in case of contention. There is no need to flush or zap when temporarily dropping mmu_lock as zapping MMIO sptes is done when holding the memslots lock and with the "update in-progress" bit set in the memslots generation, which disables MMIO spte caching. The walk does need to be restarted if mmu_lock is dropped as the active pages list may be modified. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20Revert "KVM: MMU: drop kvm_mmu_zap_mmio_sptes"Sean Christopherson1-1/+21
Revert back to a dedicated (and slower) mechanism for handling the scenario where all MMIO shadow PTEs need to be zapped due to overflowing the MMIO generation number. The MMIO generation scenario is almost literally a one-in-a-million occurrence, i.e. is not a performance sensitive scenario. Restoring kvm_mmu_zap_mmio_sptes() leaves VM teardown as the only user of kvm_mmu_invalidate_zap_all_pages() and paves the way for removing the fast invalidate mechanism altogether. This reverts commit a8eca9dcc656a405a28ffba43f3d86a1ff0eb331. Cc: Xiao Guangrong <guangrong.xiao@gmail.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20KVM: x86/mmu: Zap only the relevant pages when removing a memslotSean Christopherson1-1/+32
Modify kvm_mmu_invalidate_zap_pages_in_memslot(), a.k.a. the x86 MMU's handler for kvm_arch_flush_shadow_memslot(), to zap only the pages/PTEs that actually belong to the memslot being removed. This improves performance, especially why the deleted memslot has only a few shadow entries, or even no entries. E.g. a microbenchmark to access regular memory while concurrently reading PCI ROM to trigger memslot deletion showed a 5% improvement in throughput. Cc: Xiao Guangrong <guangrong.xiao@gmail.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20KVM: x86/mmu: Split remote_flush+zap case out of kvm_mmu_flush_or_zap()Sean Christopherson1-6/+16
...and into a separate helper, kvm_mmu_remote_flush_or_zap(), that does not require a vcpu so that the code can be (re)used by kvm_mmu_invalidate_zap_pages_in_memslot(). Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20KVM: x86/mmu: Move slot_level_*() helper functions up a few linesSean Christopherson1-69/+70
...so that kvm_mmu_invalidate_zap_pages_in_memslot() can utilize the helpers in future patches. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20KVM: Move the memslot update in-progress flag to bit 63Sean Christopherson1-19/+12
...now that KVM won't explode by moving it out of bit 0. Using bit 63 eliminates the need to jump over bit 0, e.g. when calculating a new memslots generation or when propagating the memslots generation to an MMIO spte. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20KVM: x86: Refactor the MMIO SPTE generation handlingSean Christopherson1-33/+43
The code to propagate the memslots generation number into MMIO sptes is a bit convoluted. The "what" is relatively straightfoward, e.g. the comment explaining which bits go where is quite readable, but the "how" requires a lot of staring to understand what is happening. For example, 'MMIO_GEN_LOW_SHIFT' is actually used to calculate the high bits of the spte, while 'MMIO_SPTE_GEN_LOW_SHIFT' is used to calculate the low bits. Refactor the code to: - use #defines whose values align with the bits defined in the comment - use consistent code for both the high and low mask - explicitly highlight the handling of bit 0 (update in-progress flag) - explicitly call out that the defines are for MMIO sptes (to avoid confusion with the per-vCPU MMIO cache, which uses the full memslots generation) In addition to making the code a little less magical, this paves the way for moving the update in-progress flag to bit 63 without having to simultaneously rewrite all of the MMIO spte code. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20KVM: x86: Use a u64 when passing the MMIO gen aroundSean Christopherson1-7/+7
KVM currently uses an 'unsigned int' for the MMIO generation number despite it being derived from the 64-bit memslots generation and being propagated to (potentially) 64-bit sptes. There is no hidden agenda behind using an 'unsigned int', it's done simply because the MMIO generation will never set bits above bit 19. Passing a u64 will allow the "update in-progress" flag to be relocated from bit 0 to bit 63 and removes the need to cast the generation back to a u64 when propagating it to a spte. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20KVM: Explicitly define the "memslot update in-progress" bitSean Christopherson1-1/+1
KVM uses bit 0 of the memslots generation as an "update in-progress" flag, which is used by x86 to prevent caching MMIO access while the memslots are changing. Although the intended behavior is flag-like, e.g. MMIO sptes intentionally drop the in-progress bit so as to avoid caching data from in-flux memslots, the implementation oftentimes treats the bit as part of the generation number itself, e.g. incrementing the generation increments twice, once to set the flag and once to clear it. Prior to commit 4bd518f1598d ("KVM: use separate generations for each address space"), incorporating the "update in-progress" bit into the generation number largely made sense, e.g. "real" generations are even, "bogus" generations are odd, most code doesn't need to be aware of the bit, etc... Now that unique memslots generation numbers are assigned to each address space, stealthing the in-progress status into the generation number results in a wide variety of subtle code, e.g. kvm_create_vm() jumps over bit 0 when initializing the memslots generation without any hint as to why. Explicitly define the flag and convert as much code as possible (which isn't much) to actually treat it like a flag. This paves the way for eventually using a different bit for "update in-progress" so that it can be a flag in truth instead of a awkward extension to the generation number. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20KVM: x86/mmu: Do not cache MMIO accesses while memslots are in fluxSean Christopherson1-1/+6
When installing new memslots, KVM sets bit 0 of the generation number to indicate that an update is in-progress. Until the update is complete, there are no guarantees as to whether a vCPU will see the old or the new memslots. Explicity prevent caching MMIO accesses so as to avoid using an access cached from the old memslots after the new memslots have been installed. Note that it is unclear whether or not disabling caching during the update window is strictly necessary as there is no definitive documentation as to what ordering guarantees KVM provides with respect to updating memslots. That being said, the MMIO spte code does not allow reusing sptes created while an update is in-progress, and the associated documentation explicitly states: We do not want to use an MMIO sptes created with an odd generation number, ... If KVM is unlucky and creates an MMIO spte while the low bit is 1, the next access to the spte will always be a cache miss. At the very least, disabling the per-vCPU MMIO cache during updates will make its behavior consistent with the MMIO spte behavior and documentation. Fixes: 56f17dd3fbc4 ("kvm: x86: fix stale mmio cache bug") Cc: <stable@vger.kernel.org> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20KVM: x86/mmu: Detect MMIO generation wrap in any address spaceSean Christopherson1-2/+19
The check to detect a wrap of the MMIO generation explicitly looks for a generation number of zero. Now that unique memslots generation numbers are assigned to each address space, only address space 0 will get a generation number of exactly zero when wrapping. E.g. when address space 1 goes from 0x7fffe to 0x80002, the MMIO generation number will wrap to 0x2. Adjust the MMIO generation to strip the address space modifier prior to checking for a wrap. Fixes: 4bd518f1598d ("KVM: use separate generations for each address space") Cc: <stable@vger.kernel.org> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20KVM: Call kvm_arch_memslots_updated() before updating memslotsSean Christopherson2-4/+4
kvm_arch_memslots_updated() is at this point in time an x86-specific hook for handling MMIO generation wraparound. x86 stashes 19 bits of the memslots generation number in its MMIO sptes in order to avoid full page fault walks for repeat faults on emulated MMIO addresses. Because only 19 bits are used, wrapping the MMIO generation number is possible, if unlikely. kvm_arch_memslots_updated() alerts x86 that the generation has changed so that it can invalidate all MMIO sptes in case the effective MMIO generation has wrapped so as to avoid using a stale spte, e.g. a (very) old spte that was created with generation==0. Given that the purpose of kvm_arch_memslots_updated() is to prevent consuming stale entries, it needs to be called before the new generation is propagated to memslots. Invalidating the MMIO sptes after updating memslots means that there is a window where a vCPU could dereference the new memslots generation, e.g. 0, and incorrectly reuse an old MMIO spte that was created with (pre-wrap) generation==0. Fixes: e59dbe09f8e6 ("KVM: Introduce kvm_arch_memslots_updated()") Cc: <stable@vger.kernel.org> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>