summaryrefslogtreecommitdiffstats
path: root/arch/x86
AgeCommit message (Collapse)AuthorFilesLines
2022-06-19Merge tag 'x86-urgent-2022-06-19' of ↵Linus Torvalds4-75/+159
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thomas Gleixner: - Make RESERVE_BRK() work again with older binutils. The recent 'simplification' broke that. - Make early #VE handling increment RIP when successful. - Make the #VE code consistent vs. the RIP adjustments and add comments. - Handle load_unaligned_zeropad() across page boundaries correctly in #VE when the second page is shared. * tag 'x86-urgent-2022-06-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/tdx: Handle load_unaligned_zeropad() page-cross to a shared page x86/tdx: Clarify RIP adjustments in #VE handler x86/tdx: Fix early #VE handling x86/mm: Fix RESERVE_BRK() for older binutils
2022-06-19Merge tag 'objtool-urgent-2022-06-19' of ↵Linus Torvalds2-7/+8
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull build tooling updates from Thomas Gleixner: - Remove obsolete CONFIG_X86_SMAP reference from objtool - Fix overlapping text section failures in faddr2line for real - Remove OBJECT_FILES_NON_STANDARD usage from x86 ftrace and replace it with finegrained annotations so objtool can validate that code correctly. * tag 'objtool-urgent-2022-06-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/ftrace: Remove OBJECT_FILES_NON_STANDARD usage faddr2line: Fix overlapping text section failures, the sequel objtool: Fix obsolete reference to CONFIG_X86_SMAP
2022-06-17x86/tdx: Handle load_unaligned_zeropad() page-cross to a shared pageKirill A. Shutemov1-1/+14
load_unaligned_zeropad() can lead to unwanted loads across page boundaries. The unwanted loads are typically harmless. But, they might be made to totally unrelated or even unmapped memory. load_unaligned_zeropad() relies on exception fixup (#PF, #GP and now #VE) to recover from these unwanted loads. In TDX guests, the second page can be shared page and a VMM may configure it to trigger #VE. The kernel assumes that #VE on a shared page is an MMIO access and tries to decode instruction to handle it. In case of load_unaligned_zeropad() it may result in confusion as it is not MMIO access. Fix it by detecting split page MMIO accesses and failing them. load_unaligned_zeropad() will recover using exception fixups. The issue was discovered by analysis and reproduced artificially. It was not triggered during testing. [ dhansen: fix up changelogs and comments for grammar and clarity, plus incorporate Kirill's off-by-one fix] Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lkml.kernel.org/r/20220614120135.14812-4-kirill.shutemov@linux.intel.com
2022-06-17Merge tag 'pci-v5.19-fixes-2' of ↵Linus Torvalds4-17/+18
git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci Pull pci fix from Bjorn Helgaas: "Revert clipping of PCI host bridge windows to avoid E820 regions, which broke several machines by forcing unnecessary BAR reassignments (Hans de Goede)" * tag 'pci-v5.19-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: x86/PCI: Revert "x86/PCI: Clip only host bridge windows for E820 regions"
2022-06-17x86/PCI: Revert "x86/PCI: Clip only host bridge windows for E820 regions"Hans de Goede4-17/+18
This reverts commit 4c5e242d3e93. Prior to 4c5e242d3e93 ("x86/PCI: Clip only host bridge windows for E820 regions"), E820 regions did not affect PCI host bridge windows. We only looked at E820 regions and avoided them when allocating new MMIO space. If firmware PCI bridge window and BAR assignments used E820 regions, we left them alone. After 4c5e242d3e93, we removed E820 regions from the PCI host bridge windows before looking at BARs, so firmware assignments in E820 regions looked like errors, and we moved things around to fit in the space left (if any) after removing the E820 regions. This unnecessary BAR reassignment broke several machines. Guilherme reported that Steam Deck fails to boot after 4c5e242d3e93. We clipped the window that contained most 32-bit BARs: BIOS-e820: [mem 0x00000000a0000000-0x00000000a00fffff] reserved acpi PNP0A08:00: clipped [mem 0x80000000-0xf7ffffff window] to [mem 0xa0100000-0xf7ffffff window] for e820 entry [mem 0xa0000000-0xa00fffff] which forced us to reassign all those BARs, for example, this NVMe BAR: pci 0000:00:01.2: PCI bridge to [bus 01] pci 0000:00:01.2: bridge window [mem 0x80600000-0x806fffff] pci 0000:01:00.0: BAR 0: [mem 0x80600000-0x80603fff 64bit] pci 0000:00:01.2: can't claim window [mem 0x80600000-0x806fffff]: no compatible bridge window pci 0000:01:00.0: can't claim BAR 0 [mem 0x80600000-0x80603fff 64bit]: no compatible bridge window pci 0000:00:01.2: bridge window: assigned [mem 0xa0100000-0xa01fffff] pci 0000:01:00.0: BAR 0: assigned [mem 0xa0100000-0xa0103fff 64bit] All the reassignments were successful, so the devices should have been functional at the new addresses, but some were not. Andy reported a similar failure on an Intel MID platform. Benjamin reported a similar failure on a VMWare Fusion VM. Note: this is not a clean revert; this revert keeps the later change to make the clipping dependent on a new pci_use_e820 bool, moving the checking of this bool to arch_remove_reservations(). [bhelgaas: commit log, add more reporters and testers] BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=216109 Reported-by: Guilherme G. Piccoli <gpiccoli@igalia.com> Reported-by: Andy Shevchenko <andy.shevchenko@gmail.com> Reported-by: Benjamin Coddington <bcodding@redhat.com> Reported-by: Jongman Heo <jongman.heo@gmail.com> Fixes: 4c5e242d3e93 ("x86/PCI: Clip only host bridge windows for E820 regions") Link: https://lore.kernel.org/r/20220612144325.85366-1-hdegoede@redhat.com Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> Tested-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Tested-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2022-06-17Merge tag 'hyperv-fixes-signed-20220617' of ↵Linus Torvalds3-6/+88
git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux Pull hyperv fixes from Wei Liu: - Fix hv_init_clocksource annotation (Masahiro Yamada) - Two bug fixes for vmbus driver (Saurabh Sengar) - Fix SEV negotiation (Tianyu Lan) - Fix comments in code (Xiang Wang) - One minor fix to HID driver (Michael Kelley) * tag 'hyperv-fixes-signed-20220617' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: x86/Hyper-V: Add SEV negotiate protocol support in Isolation VM Drivers: hv: vmbus: Release cpu lock in error case HID: hyperv: Correctly access fields declared as __le16 clocksource: hyper-v: unexport __init-annotated hv_init_clocksource() Drivers: hv: Fix syntax errors in comments Drivers: hv: vmbus: Don't assign VMbus channel interrupts to isolated CPUs
2022-06-15x86/Hyper-V: Add SEV negotiate protocol support in Isolation VMTianyu Lan3-6/+88
Hyper-V Isolation VM current code uses sev_es_ghcb_hv_call() to read/write MSR via GHCB page and depends on the sev code. This may cause regression when sev code changes interface design. The latest SEV-ES code requires to negotiate GHCB version before reading/writing MSR via GHCB page and sev_es_ghcb_hv_call() doesn't work for Hyper-V Isolation VM. Add Hyper-V ghcb related implementation to decouple SEV and Hyper-V code. Negotiate GHCB version in the hyperv_init() and use the version to communicate with Hyper-V in the ghcb hv call function. Fixes: 2ea29c5abbc2 ("x86/sev: Save the negotiated GHCB version") Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Link: https://lore.kernel.org/r/20220614014553.1915929-1-ltykernel@gmail.com Signed-off-by: Wei Liu <wei.liu@kernel.org>
2022-06-15x86/tdx: Clarify RIP adjustments in #VE handlerKirill A. Shutemov1-55/+123
After successful #VE handling, tdx_handle_virt_exception() has to move RIP to the next instruction. The handler needs to know the length of the instruction. If the #VE happened due to instruction execution, the GET_VEINFO TDX module call provides info on the instruction in R10, including its length. For #VE due to EPT violation, the info in R10 is not populand and the kernel must decode the instruction manually to find out its length. Restructure the code to make it explicit that the instruction length depends on the type of #VE. Make individual #VE handlers return the instruction length on success or -errno on failure. [ dhansen: fix up changelog and comments ] Suggested-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lkml.kernel.org/r/20220614120135.14812-3-kirill.shutemov@linux.intel.com
2022-06-15x86/tdx: Fix early #VE handlingKirill A. Shutemov1-1/+5
tdx_early_handle_ve() does not increment RIP after successfully handling the exception. That leads to infinite loop of exceptions. Move RIP when exceptions are successfully handled. [ dhansen: make problem statement more clear ] Fixes: 32e72854fa5f ("x86/tdx: Port I/O: Add early boot support") Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Link: https://lkml.kernel.org/r/20220614120135.14812-2-kirill.shutemov@linux.intel.com
2022-06-14Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds9-127/+197
Pull kvm fixes from Paolo Bonzini: "While last week's pull request contained miscellaneous fixes for x86, this one covers other architectures, selftests changes, and a bigger series for APIC virtualization bugs that were discovered during 5.20 development. The idea is to base 5.20 development for KVM on top of this tag. ARM64: - Properly reset the SVE/SME flags on vcpu load - Fix a vgic-v2 regression regarding accessing the pending state of a HW interrupt from userspace (and make the code common with vgic-v3) - Fix access to the idreg range for protected guests - Ignore 'kvm-arm.mode=protected' when using VHE - Return an error from kvm_arch_init_vm() on allocation failure - A bunch of small cleanups (comments, annotations, indentation) RISC-V: - Typo fix in arch/riscv/kvm/vmid.c - Remove broken reference pattern from MAINTAINERS entry x86-64: - Fix error in page tables with MKTME enabled - Dirty page tracking performance test extended to running a nested guest - Disable APICv/AVIC in cases that it cannot implement correctly" [ This merge also fixes a misplaced end parenthesis bug introduced in commit 3743c2f02517 ("KVM: x86: inhibit APICv/AVIC on changes to APIC ID or APIC base") pointed out by Sean Christopherson ] Link: https://lore.kernel.org/all/20220610191813.371682-1-seanjc@google.com/ * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (34 commits) KVM: selftests: Restrict test region to 48-bit physical addresses when using nested KVM: selftests: Add option to run dirty_log_perf_test vCPUs in L2 KVM: selftests: Clean up LIBKVM files in Makefile KVM: selftests: Link selftests directly with lib object files KVM: selftests: Drop unnecessary rule for STATIC_LIBS KVM: selftests: Add a helper to check EPT/VPID capabilities KVM: selftests: Move VMX_EPT_VPID_CAP_AD_BITS to vmx.h KVM: selftests: Refactor nested_map() to specify target level KVM: selftests: Drop stale function parameter comment for nested_map() KVM: selftests: Add option to create 2M and 1G EPT mappings KVM: selftests: Replace x86_page_size with PG_LEVEL_XX KVM: x86: SVM: fix nested PAUSE filtering when L0 intercepts PAUSE KVM: x86: SVM: drop preempt-safe wrappers for avic_vcpu_load/put KVM: x86: disable preemption around the call to kvm_arch_vcpu_{un|}blocking KVM: x86: disable preemption while updating apicv inhibition KVM: x86: SVM: fix avic_kick_target_vcpus_fast KVM: x86: SVM: remove avic's broken code that updated APIC ID KVM: x86: inhibit APICv/AVIC on changes to APIC ID or APIC base KVM: x86: document AVIC/APICv inhibit reasons KVM: x86/mmu: Set memory encryption "value", not "mask", in shadow PDPTRs ...
2022-06-14Merge tag 'x86-bugs-2022-06-01' of ↵Linus Torvalds8-39/+353
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 MMIO stale data fixes from Thomas Gleixner: "Yet another hw vulnerability with a software mitigation: Processor MMIO Stale Data. They are a class of MMIO-related weaknesses which can expose stale data by propagating it into core fill buffers. Data which can then be leaked using the usual speculative execution methods. Mitigations include this set along with microcode updates and are similar to MDS and TAA vulnerabilities: VERW now clears those buffers too" * tag 'x86-bugs-2022-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/speculation/mmio: Print SMT warning KVM: x86/speculation: Disable Fill buffer clear within guests x86/speculation/mmio: Reuse SRBDS mitigation for SBDS x86/speculation/srbds: Update SRBDS mitigation selection x86/speculation/mmio: Add sysfs reporting for Processor MMIO Stale Data x86/speculation/mmio: Enable CPU Fill buffer clearing on idle x86/bugs: Group MDS, TAA & Processor MMIO Stale Data mitigations x86/speculation/mmio: Add mitigation for Processor MMIO Stale Data x86/speculation: Add a common function for MD_CLEAR mitigation update x86/speculation/mmio: Enumerate Processor MMIO Stale Data bug Documentation: Add documentation for Processor MMIO Stale Data
2022-06-13x86/mm: Fix RESERVE_BRK() for older binutilsJosh Poimboeuf3-24/+23
With binutils 2.26, RESERVE_BRK() causes a build failure: /tmp/ccnGOKZ5.s: Assembler messages: /tmp/ccnGOKZ5.s:98: Error: missing ')' /tmp/ccnGOKZ5.s:98: Error: missing ')' /tmp/ccnGOKZ5.s:98: Error: missing ')' /tmp/ccnGOKZ5.s:98: Error: junk at end of line, first unrecognized character is `U' The problem is this line: RESERVE_BRK(early_pgt_alloc, INIT_PGT_BUF_SIZE) Specifically, the INIT_PGT_BUF_SIZE macro which (via PAGE_SIZE's use _AC()) has a "1UL", which makes older versions of the assembler unhappy. Unfortunately the _AC() macro doesn't work for inline asm. Inline asm was only needed here to convince the toolchain to add the STT_NOBITS flag. However, if a C variable is placed in a section whose name is prefixed with ".bss", GCC and Clang automatically set STT_NOBITS. In fact, ".bss..page_aligned" already relies on this trick. So fix the build failure (and simplify the macro) by allocating the variable in C. Also, add NOLOAD to the ".brk" output section clause in the linker script. This is a failsafe in case the ".bss" prefix magic trick ever stops working somehow. If there's a section type mismatch, the GNU linker will force the ".brk" output section to be STT_NOBITS. The LLVM linker will fail with a "section type mismatch" error. Note this also changes the name of the variable from .brk.##name to __brk_##name. The variable names aren't actually used anywhere, so it's harmless. Fixes: a1e2c031ec39 ("x86/mm: Simplify RESERVE_BRK()") Reported-by: Joe Damato <jdamato@fastly.com> Reported-by: Byungchul Park <byungchul.park@lge.com> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Joe Damato <jdamato@fastly.com> Link: https://lore.kernel.org/r/22d07a44c80d8e8e1e82b9a806ddc8c6bbb2606e.1654759036.git.jpoimboe@kernel.org
2022-06-10Merge tag 'for-linus-5.19a-rc2-tag' of ↵Linus Torvalds5-8/+8
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen updates from Juergen Gross: - a small cleanup removing "export" of an __init function - a small series adding a new infrastructure for platform flags - a series adding generic virtio support for Xen guests (frontend side) * tag 'for-linus-5.19a-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: xen: unexport __init-annotated xen_xlate_map_ballooned_pages() arm/xen: Assign xen-grant DMA ops for xen-grant DMA devices xen/grant-dma-ops: Retrieve the ID of backend's domain for DT devices xen/grant-dma-iommu: Introduce stub IOMMU driver dt-bindings: Add xen,grant-dma IOMMU description for xen-grant DMA ops xen/virtio: Enable restricted memory access using Xen grant mappings xen/grant-dma-ops: Add option to restrict memory access under Xen xen/grants: support allocating consecutive grants arm/xen: Introduce xen_setup_dma_ops() virtio: replace arch_has_restricted_virtio_memory_access() kernel: add platform_has() infrastructure
2022-06-09KVM: x86: SVM: fix nested PAUSE filtering when L0 intercepts PAUSEPaolo Bonzini2-20/+23
Commit 74fd41ed16fd ("KVM: x86: nSVM: support PAUSE filtering when L0 doesn't intercept PAUSE") introduced passthrough support for nested pause filtering, (when the host doesn't intercept PAUSE) (either disabled with kvm module param, or disabled with '-overcommit cpu-pm=on') Before this commit, L1 KVM didn't intercept PAUSE at all; afterwards, the feature was exposed as supported by KVM cpuid unconditionally, thus if L1 could try to use it even when the L0 KVM can't really support it. In this case the fallback caused KVM to intercept each PAUSE instruction; in some cases, such intercept can slow down the nested guest so much that it can fail to boot. Instead, before the problematic commit KVM was already setting both thresholds to 0 in vmcb02, but after the first userspace VM exit shrink_ple_window was called and would reset the pause_filter_count to the default value. To fix this, change the fallback strategy - ignore the guest threshold values, but use/update the host threshold values unless the guest specifically requests disabling PAUSE filtering (either simple or advanced). Also fix a minor bug: on nested VM exit, when PAUSE filter counter were copied back to vmcb01, a dirty bit was not set. Thanks a lot to Suravee Suthikulpanit for debugging this! Fixes: 74fd41ed16fd ("KVM: x86: nSVM: support PAUSE filtering when L0 doesn't intercept PAUSE") Reported-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Tested-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Co-developed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220518072709.730031-1-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-09KVM: x86: SVM: drop preempt-safe wrappers for avic_vcpu_load/putMaxim Levitsky3-27/+8
Now that these functions are always called with preemption disabled, remove the preempt_disable()/preempt_enable() pair inside them. No functional change intended. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220606180829.102503-8-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-09KVM: x86: disable preemption while updating apicv inhibitionMaxim Levitsky1-0/+2
Currently nothing prevents preemption in kvm_vcpu_update_apicv. On SVM, If the preemption happens after we update the vcpu->arch.apicv_active, the preemption itself will 'update' the inhibition since the AVIC will be first disabled on vCPU unload and then enabled, when the current task is loaded again. Then we will try to update it again, which will lead to a warning in __avic_vcpu_load, that the AVIC is already enabled. Fix this by disabling preemption in this code. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220606180829.102503-6-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-09KVM: x86: SVM: fix avic_kick_target_vcpus_fastMaxim Levitsky1-36/+69
There are two issues in avic_kick_target_vcpus_fast 1. It is legal to issue an IPI request with APIC_DEST_NOSHORT and a physical destination of 0xFF (or 0xFFFFFFFF in case of x2apic), which must be treated as a broadcast destination. Fix this by explicitly checking for it. Also don’t use ‘index’ in this case as it gives no new information. 2. It is legal to issue a logical IPI request to more than one target. Index field only provides index in physical id table of first such target and therefore can't be used before we are sure that only a single target was addressed. Instead, parse the ICRL/ICRH, double check that a unicast interrupt was requested, and use that info to figure out the physical id of the target vCPU. At that point there is no need to use the index field as well. In addition to fixing the above issues, also skip the call to kvm_apic_match_dest. It is possible to do this now, because now as long as AVIC is not inhibited, it is guaranteed that none of the vCPUs changed their apic id from its default value. This fixes boot of windows guest with AVIC enabled because it uses IPI with 0xFF destination and no destination shorthand. Fixes: 7223fd2d5338 ("KVM: SVM: Use target APIC ID to complete AVIC IRQs when possible") Cc: stable@vger.kernel.org Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220606180829.102503-5-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-09KVM: x86: SVM: remove avic's broken code that updated APIC IDMaxim Levitsky1-35/+0
AVIC is now inhibited if the guest changes the apic id, and therefore this code is no longer needed. There are several ways this code was broken, including: 1. a vCPU was only allowed to change its apic id to an apic id of an existing vCPU. 2. After such change, the vCPU whose apic id entry was overwritten, could not correctly change its own apic id, because its own entry is already overwritten. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220606180829.102503-4-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-09KVM: x86: inhibit APICv/AVIC on changes to APIC ID or APIC baseMaxim Levitsky4-6/+37
Neither of these settings should be changed by the guest and it is a burden to support it in the acceleration code, so just inhibit this code instead. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220606180829.102503-3-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-09KVM: x86: document AVIC/APICv inhibit reasonsMaxim Levitsky1-2/+57
These days there are too many AVIC/APICv inhibit reasons, and it doesn't hurt to have some documentation for them. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220606180829.102503-2-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-09KVM: x86/mmu: Set memory encryption "value", not "mask", in shadow PDPTRsYuan Yao1-1/+1
Assign shadow_me_value, not shadow_me_mask, to PAE root entries, a.k.a. shadow PDPTRs, when host memory encryption is supported. The "mask" is the set of all possible memory encryption bits, e.g. MKTME KeyIDs, whereas "value" holds the actual value that needs to be stuffed into host page tables. Using shadow_me_mask results in a failed VM-Entry due to setting reserved PA bits in the PDPTRs, and ultimately causes an OOPS due to physical addresses with non-zero MKTME bits sending to_shadow_page() into the weeds: set kvm_intel.dump_invalid_vmcs=1 to dump internal KVM state. BUG: unable to handle page fault for address: ffd43f00063049e8 PGD 86dfd8067 P4D 0 Oops: 0000 [#1] PREEMPT SMP RIP: 0010:mmu_free_root_page+0x3c/0x90 [kvm] kvm_mmu_free_roots+0xd1/0x200 [kvm] __kvm_mmu_unload+0x29/0x70 [kvm] kvm_mmu_unload+0x13/0x20 [kvm] kvm_arch_destroy_vm+0x8a/0x190 [kvm] kvm_put_kvm+0x197/0x2d0 [kvm] kvm_vm_release+0x21/0x30 [kvm] __fput+0x8e/0x260 ____fput+0xe/0x10 task_work_run+0x6f/0xb0 do_exit+0x327/0xa90 do_group_exit+0x35/0xa0 get_signal+0x911/0x930 arch_do_signal_or_restart+0x37/0x720 exit_to_user_mode_prepare+0xb2/0x140 syscall_exit_to_user_mode+0x16/0x30 do_syscall_64+0x4e/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae Fixes: e54f1ff244ac ("KVM: x86/mmu: Add shadow_me_value and repurpose shadow_me_mask") Signed-off-by: Yuan Yao <yuan.yao@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Message-Id: <20220608012015.19566-1-yuan.yao@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-09Merge tag 'kvm-riscv-fixes-5.19-1' of https://github.com/kvm-riscv/linux ↵Paolo Bonzini263-3668/+8570
into HEAD KVM/riscv fixes for 5.19, take #1 - Typo fix in arch/riscv/kvm/vmid.c - Remove broken reference pattern from MAINTAINERS entry
2022-06-08Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds12-37/+113
Pull KVM fixes from Paolo Bonzini: - syzkaller NULL pointer dereference - TDP MMU performance issue with disabling dirty logging - 5.14 regression with SVM TSC scaling - indefinite stall on applying live patches - unstable selftest - memory leak from wrong copy-and-paste - missed PV TLB flush when racing with emulation * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: x86: do not report a vCPU as preempted outside instruction boundaries KVM: x86: do not set st->preempted when going back to user space KVM: SVM: fix tsc scaling cache logic KVM: selftests: Make hyperv_clock selftest more stable KVM: x86/MMU: Zap non-leaf SPTEs when disabling dirty logging x86: drop bogus "cc" clobber from __try_cmpxchg_user_asm() KVM: x86/mmu: Check every prev_roots in __kvm_mmu_free_obsolete_roots() entry/kvm: Exit to user mode when TIF_NOTIFY_SIGNAL is set KVM: Don't null dereference ops->destroy
2022-06-08KVM: x86: do not report a vCPU as preempted outside instruction boundariesPaolo Bonzini4-0/+28
If a vCPU is outside guest mode and is scheduled out, it might be in the process of making a memory access. A problem occurs if another vCPU uses the PV TLB flush feature during the period when the vCPU is scheduled out, and a virtual address has already been translated but has not yet been accessed, because this is equivalent to using a stale TLB entry. To avoid this, only report a vCPU as preempted if sure that the guest is at an instruction boundary. A rescheduling request will be delivered to the host physical CPU as an external interrupt, so for simplicity consider any vmexit *not* instruction boundary except for external interrupts. It would in principle be okay to report the vCPU as preempted also if it is sleeping in kvm_vcpu_block(): a TLB flush IPI will incur the vmentry/vmexit overhead unnecessarily, and optimistic spinning is also unlikely to succeed. However, leave it for later because right now kvm_vcpu_check_block() is doing memory accesses. Even though the TLB flush issue only applies to virtual memory address, it's very much preferrable to be conservative. Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-08KVM: x86: do not set st->preempted when going back to user spacePaolo Bonzini2-14/+18
Similar to the Xen path, only change the vCPU's reported state if the vCPU was actually preempted. The reason for KVM's behavior is that for example optimistic spinning might not be a good idea if the guest is doing repeated exits to userspace; however, it is confusing and unlikely to make a difference, because well-tuned guests will hardly ever exit KVM_RUN in the first place. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-07KVM: SVM: fix tsc scaling cache logicMaxim Levitsky3-15/+23
SVM uses a per-cpu variable to cache the current value of the tsc scaling multiplier msr on each cpu. Commit 1ab9287add5e2 ("KVM: X86: Add vendor callbacks for writing the TSC multiplier") broke this caching logic. Refactor the code so that all TSC scaling multiplier writes go through a single function which checks and updates the cache. This fixes the following scenario: 1. A CPU runs a guest with some tsc scaling ratio. 2. New guest with different tsc scaling ratio starts on this CPU and terminates almost immediately. This ensures that the short running guest had set the tsc scaling ratio just once when it was set via KVM_SET_TSC_KHZ. Due to the bug, the per-cpu cache is not updated. 3. The original guest continues to run, it doesn't restore the msr value back to its own value, because the cache matches, and thus continues to run with a wrong tsc scaling ratio. Fixes: 1ab9287add5e2 ("KVM: X86: Add vendor callbacks for writing the TSC multiplier") Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220606181149.103072-1-mlevitsk@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-07KVM: x86/MMU: Zap non-leaf SPTEs when disabling dirty loggingBen Gardon3-6/+42
Currently disabling dirty logging with the TDP MMU is extremely slow. On a 96 vCPU / 96G VM backed with gigabyte pages, it takes ~200 seconds to disable dirty logging with the TDP MMU, as opposed to ~4 seconds with the shadow MMU. When disabling dirty logging, zap non-leaf parent entries to allow replacement with huge pages instead of recursing and zapping all of the child, leaf entries. This reduces the number of TLB flushes required. and reduces the disable dirty log time with the TDP MMU to ~3 seconds. Opportunistically add a WARN() to catch GFNs that are mapped at a higher level than their max level. Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20220525230904.1584480-1-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-07x86: drop bogus "cc" clobber from __try_cmpxchg_user_asm()Jan Beulich1-1/+1
As noted (and fixed) a couple of times in the past, "=@cc<cond>" outputs and clobbering of "cc" don't work well together. The compiler appears to mean to reject such, but doesn't - in its upstream form - quite manage to yet for "cc". Furthermore two similar macros don't clobber "cc", and clobbering "cc" is pointless in asm()-s for x86 anyway - the compiler always assumes status flags to be clobbered there. Fixes: 989b5db215a2 ("x86/uaccess: Implement macros for CMPXCHG on user addresses") Signed-off-by: Jan Beulich <jbeulich@suse.com> Message-Id: <485c0c0b-a3a7-0b7c-5264-7d00c01de032@suse.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-07KVM: x86/mmu: Check every prev_roots in __kvm_mmu_free_obsolete_roots()Shaoqin Huang1-1/+1
When freeing obsolete previous roots, check prev_roots as intended, not the current root. Signed-off-by: Shaoqin Huang <shaoqin.huang@intel.com> Fixes: 527d5cd7eece ("KVM: x86/mmu: Zap only obsolete roots if a root shadow page is zapped") Message-Id: <20220607005905.2933378-1-shaoqin.huang@intel.com> Cc: stable@vger.kernel.org Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-06x86/ftrace: Remove OBJECT_FILES_NON_STANDARD usageJosh Poimboeuf2-7/+8
The file-wide OBJECT_FILES_NON_STANDARD annotation is used with CONFIG_FRAME_POINTER to tell objtool to skip the entire file when frame pointers are enabled. However that annotation is now deprecated because it doesn't work with IBT, where objtool runs on vmlinux.o instead of individual translation units. Instead, use more fine-grained function-specific annotations: - The 'save_mcount_regs' macro does funny things with the frame pointer. Use STACK_FRAME_NON_STANDARD_FP to tell objtool to ignore the functions using it. - The return_to_handler() "function" isn't actually a callable function. Instead of being called, it's returned to. The real return address isn't on the stack, so unwinding is already doomed no matter which unwinder is used. So just remove the STT_FUNC annotation, telling objtool to ignore it. That also removes the implicit ANNOTATE_NOENDBR, which now needs to be made explicit. Fixes the following warning: vmlinux.o: warning: objtool: __fentry__+0x16: return with modified stack frame Fixes: ed53a0d97192 ("x86/alternative: Use .ibt_endbr_seal to seal indirect calls") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Link: https://lore.kernel.org/r/b7a7a42fe306aca37826043dac89e113a1acdbac.1654268610.git.jpoimboe@kernel.org
2022-06-06xen/virtio: Enable restricted memory access using Xen grant mappingsJuergen Gross2-0/+4
In order to support virtio in Xen guests add a config option XEN_VIRTIO enabling the user to specify whether in all Xen guests virtio should be able to access memory via Xen grant mappings only on the host side. Also set PLATFORM_VIRTIO_RESTRICTED_MEM_ACCESS feature from the guest initialization code on Arm and x86 if CONFIG_XEN_VIRTIO is enabled. Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Link: https://lore.kernel.org/r/1654197833-25362-5-git-send-email-olekstysh@gmail.com Signed-off-by: Juergen Gross <jgross@suse.com>
2022-06-06virtio: replace arch_has_restricted_virtio_memory_access()Juergen Gross3-8/+4
Instead of using arch_has_restricted_virtio_memory_access() together with CONFIG_ARCH_HAS_RESTRICTED_VIRTIO_MEMORY_ACCESS, replace those with platform_has() and a new platform feature PLATFORM_VIRTIO_RESTRICTED_MEM_ACCESS. Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Tested-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> # Arm64 only Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Borislav Petkov <bp@suse.de>
2022-06-05Merge tag 'mm-hotfixes-stable-2022-06-05' of ↵Linus Torvalds1-3/+9
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull mm hotfixes from Andrew Morton: "Fixups for various recently-added and longer-term issues and a few minor tweaks: - fixes for material merged during this merge window - cc:stable fixes for more longstanding issues - minor mailmap and MAINTAINERS updates" * tag 'mm-hotfixes-stable-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm/oom_kill.c: fix vm_oom_kill_table[] ifdeffery x86/kexec: fix memory leak of elf header buffer mm/memremap: fix missing call to untrack_pfn() in pagemap_range() mm: page_isolation: use compound_nr() correctly in isolate_single_pageblock() mm: hugetlb_vmemmap: fix CONFIG_HUGETLB_PAGE_FREE_VMEMMAP_DEFAULT_ON MAINTAINERS: add maintainer information for z3fold mailmap: update Josh Poimboeuf's email
2022-06-05Merge tag 'x86-urgent-2022-06-05' of ↵Linus Torvalds3-6/+115
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 SGX fix from Thomas Gleixner: "A single fix for x86/SGX to prevent that memory which is allocated for an SGX enclave is accounted to the wrong memory control group" * tag 'x86-urgent-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/sgx: Set active memcg prior to shmem allocation
2022-06-05Merge tag 'x86-mm-2022-06-05' of ↵Linus Torvalds1-4/+4
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 mm cleanup from Thomas Gleixner: "Use PAGE_ALIGNED() instead of open coding it in the x86/mm code" * tag 'x86-mm-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm: Use PAGE_ALIGNED(x) instead of IS_ALIGNED(x, PAGE_SIZE)
2022-06-05Merge tag 'x86-microcode-2022-06-05' of ↵Linus Torvalds3-112/+20
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 microcode updates from Thomas Gleixner: - Disable late microcode loading by default. Unless the HW people get their act together and provide a required minimum version in the microcode header for making a halfways informed decision its just lottery and broken. - Warn and taint the kernel when microcode is loaded late - Remove the old unused microcode loader interface - Remove a redundant perf callback from the microcode loader * tag 'x86-microcode-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/microcode: Remove unnecessary perf callback x86/microcode: Taint and warn on late loading x86/microcode: Default-disable late loading x86/microcode: Rip out the OLD_INTERFACE
2022-06-05Merge tag 'x86-cleanups-2022-06-05' of ↵Linus Torvalds5-71/+66
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cleanups from Thomas Gleixner: "A set of small x86 cleanups: - Remove unused headers in the IDT code - Kconfig indendation and comment fixes - Fix all 'the the' typos in one go instead of waiting for bots to fix one at a time" * tag 'x86-cleanups-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86: Fix all occurences of the "the the" typo x86/idt: Remove unused headers x86/Kconfig: Fix indentation of arch/x86/Kconfig.debug x86/Kconfig: Fix indentation and add endif comments to arch/x86/Kconfig
2022-06-05Merge tag 'x86-boot-2022-06-05' of ↵Linus Torvalds1-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 boot update from Thomas Gleixner: "Use strlcpy() instead of strscpy() in arch_setup()" * tag 'x86-boot-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/setup: Use strscpy() to replace deprecated strlcpy()
2022-06-05Merge tag 'perf-urgent-2022-06-05' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Thomas Gleixner: - Make the ICL event constraints match reality - Remove a unused local variable * tag 'perf-urgent-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/core: Remove unused local variable perf/x86/intel: Fix event constraints for ICL
2022-06-05Merge tag 'perf-core-2022-06-05' of ↵Linus Torvalds1-6/+6
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixlet from Thomas Gleixner: "Trivial indentation fix in Kconfig" * tag 'perf-core-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/Kconfig: Fix indentation in the Kconfig file
2022-06-05Merge tag 'objtool-urgent-2022-06-05' of ↵Linus Torvalds4-5/+10
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull objtool fixes from Thomas Gleixner: - Handle __ubsan_handle_builtin_unreachable() correctly and treat it as noreturn - Allow architectures to select uaccess validation - Use the non-instrumented bit test for test_cpu_has() to prevent escape from non-instrumentable regions - Use arch_ prefixed atomics for JUMP_LABEL=n builds to prevent escape from non-instrumentable regions - Mark a few tiny inline as __always_inline to prevent GCC from bringing them out of line and instrumenting them - Mark the empty stub context_tracking_enabled() as always inline as GCC brings them out of line and instruments the empty shell - Annotate ex_handler_msr_mce() as dead end * tag 'objtool-urgent-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/extable: Annotate ex_handler_msr_mce() as a dead end context_tracking: Always inline empty stubs x86: Always inline on_thread_stack() and current_top_of_stack() jump_label,noinstr: Avoid instrumentation for JUMP_LABEL=n builds x86/cpu: Elide KCSAN for cpu_has() and friends objtool: Mark __ubsan_handle_builtin_unreachable() as noreturn objtool: Add CONFIG_HAVE_UACCESS_VALIDATION
2022-06-04Merge tag 'bitmap-for-5.19-rc1' of https://github.com/norov/linuxLinus Torvalds1-6/+6
Pull bitmap updates from Yury Norov: - bitmap: optimize bitmap_weight() usage, from me - lib/bitmap.c make bitmap_print_bitmask_to_buf parseable, from Mauro Carvalho Chehab - include/linux/find: Fix documentation, from Anna-Maria Behnsen - bitmap: fix conversion from/to fix-sized arrays, from me - bitmap: Fix return values to be unsigned, from Kees Cook It has been in linux-next for at least a week with no problems. * tag 'bitmap-for-5.19-rc1' of https://github.com/norov/linux: (31 commits) nodemask: Fix return values to be unsigned bitmap: Fix return values to be unsigned KVM: x86: hyper-v: replace bitmap_weight() with hweight64() KVM: x86: hyper-v: fix type of valid_bank_mask ia64: cleanup remove_siblinginfo() drm/amd/pm: use bitmap_{from,to}_arr32 where appropriate KVM: s390: replace bitmap_copy with bitmap_{from,to}_arr64 where appropriate lib/bitmap: add test for bitmap_{from,to}_arr64 lib: add bitmap_{from,to}_arr64 lib/bitmap: extend comment for bitmap_(from,to)_arr32() include/linux/find: Fix documentation lib/bitmap.c make bitmap_print_bitmask_to_buf parseable MAINTAINERS: add cpumask and nodemask files to BITMAP_API arch/x86: replace nodes_weight with nodes_empty where appropriate mm/vmstat: replace cpumask_weight with cpumask_empty where appropriate clocksource: replace cpumask_weight with cpumask_empty in clocksource.c genirq/affinity: replace cpumask_weight with cpumask_empty where appropriate irq: mips: replace cpumask_weight with cpumask_empty where appropriate drm/i915/pmu: replace cpumask_weight with cpumask_empty where appropriate arch/x86: replace cpumask_weight with cpumask_empty where appropriate ...
2022-06-04Merge tag 'for-linus-5.19-rc1b-tag' of ↵Linus Torvalds1-3/+0
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull more xen updates from Juergen Gross: "Two cleanup patches for Xen related code and (more important) an update of MAINTAINERS for Xen, as Boris Ostrovsky decided to step down" * tag 'for-linus-5.19-rc1b-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: xen: replace xen_remap() with memremap() MAINTAINERS: Update Xen maintainership xen: switch gnttab_end_foreign_access() to take a struct page pointer
2022-06-03Merge tag 'ptrace_stop-cleanup-for-v5.19' of ↵Linus Torvalds1-2/+1
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull ptrace_stop cleanups from Eric Biederman: "While looking at the ptrace problems with PREEMPT_RT and the problems Peter Zijlstra was encountering with ptrace in his freezer rewrite I identified some cleanups to ptrace_stop that make sense on their own and move make resolving the other problems much simpler. The biggest issue is the habit of the ptrace code to change task->__state from the tracer to suppress TASK_WAKEKILL from waking up the tracee. No other code in the kernel does that and it is straight forward to update signal_wake_up and friends to make that unnecessary. Peter's task freezer sets frozen tasks to a new state TASK_FROZEN and then it stores them by calling "wake_up_state(t, TASK_FROZEN)" relying on the fact that all stopped states except the special stop states can tolerate spurious wake up and recover their state. The state of stopped and traced tasked is changed to be stored in task->jobctl as well as in task->__state. This makes it possible for the freezer to recover tasks in these special states, as well as serving as a general cleanup. With a little more work in that direction I believe TASK_STOPPED can learn to tolerate spurious wake ups and become an ordinary stop state. The TASK_TRACED state has to remain a special state as the registers for a process are only reliably available when the process is stopped in the scheduler. Fundamentally ptrace needs acess to the saved register values of a task. There are bunch of semi-random ptrace related cleanups that were found while looking at these issues. One cleanup that deserves to be called out is from commit 57b6de08b5f6 ("ptrace: Admit ptrace_stop can generate spuriuos SIGTRAPs"). This makes a change that is technically user space visible, in the handling of what happens to a tracee when a tracer dies unexpectedly. According to our testing and our understanding of userspace nothing cares that spurious SIGTRAPs can be generated in that case" * tag 'ptrace_stop-cleanup-for-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: sched,signal,ptrace: Rework TASK_TRACED, TASK_STOPPED state ptrace: Always take siglock in ptrace_resume ptrace: Don't change __state ptrace: Admit ptrace_stop can generate spuriuos SIGTRAPs ptrace: Document that wait_task_inactive can't fail ptrace: Reimplement PTRACE_KILL by always sending SIGKILL signal: Use lockdep_assert_held instead of assert_spin_locked ptrace: Remove arch_ptrace_attach ptrace/xtensa: Replace PT_SINGLESTEP with TIF_SINGLESTEP ptrace/um: Replace PT_DTRACE with TIF_SINGLESTEP signal: Replace __group_send_sig_info with send_signal_locked signal: Rename send_signal send_signal_locked
2022-06-03Merge tag 'kthread-cleanups-for-v5.19' of ↵Linus Torvalds4-15/+17
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull kthread updates from Eric Biederman: "This updates init and user mode helper tasks to be ordinary user mode tasks. Commit 40966e316f86 ("kthread: Ensure struct kthread is present for all kthreads") caused init and the user mode helper threads that call kernel_execve to have struct kthread allocated for them. This struct kthread going away during execve in turned made a use after free of struct kthread possible. Here, commit 343f4c49f243 ("kthread: Don't allocate kthread_struct for init and umh") is enough to fix the use after free and is simple enough to be backportable. The rest of the changes pass struct kernel_clone_args to clean things up and cause the code to make sense. In making init and the user mode helpers tasks purely user mode tasks I ran into two complications. The function task_tick_numa was detecting tasks without an mm by testing for the presence of PF_KTHREAD. The initramfs code in populate_initrd_image was using flush_delayed_fput to ensuere the closing of all it's file descriptors was complete, and flush_delayed_fput does not work in a userspace thread. I have looked and looked and more complications and in my code review I have not found any, and neither has anyone else with the code sitting in linux-next" * tag 'kthread-cleanups-for-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: sched: Update task_tick_numa to ignore tasks without an mm fork: Stop allowing kthreads to call execve fork: Explicitly set PF_KTHREAD init: Deal with the init process being a user mode process fork: Generalize PF_IO_WORKER handling fork: Explicity test for idle tasks in copy_thread fork: Pass struct kernel_clone_args into copy_thread kthread: Don't allocate kthread_struct for init and umh
2022-06-03Merge tag 'for-linus-5.19-rc1' of ↵Linus Torvalds1-2/+4
git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml Pull UML updates from Richard Weinberger: - Various cleanups and fixes: xterm, serial line, time travel - Set ARCH_HAS_GCOV_PROFILE_ALL * tag 'for-linus-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml: um: Fix out-of-bounds read in LDT setup um: chan_user: Fix winch_tramp() return value um: virtio_uml: Fix broken device handling in time-travel um: line: Use separate IRQs per line um: Enable ARCH_HAS_GCOV_PROFILE_ALL um: Use asm-generic/dma-mapping.h um: daemon: Make default socket configurable um: xterm: Make default terminal emulator configurable
2022-06-03Merge tag 'efi-next-for-v5.19-2' of ↵Linus Torvalds2-1/+10
git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi Pull more EFI updates from Ard Biesheuvel: "Follow-up tweaks for EFI changes - they mostly address issues introduced this merge window, except for Heinrich's patch: - fix new DXE service invocations for mixed mode - use correct Kconfig symbol when setting PE header flag - clean up the drivers/firmware/efi Kconfig dependencies so that features that depend on CONFIG_EFI are hidden from the UI when the symbol is not enabled. Also included is a RISC-V bugfix from Heinrich to avoid read-write mappings of read-only firmware regions in the EFI page tables" * tag 'efi-next-for-v5.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi: efi: clean up Kconfig dependencies on CONFIG_EFI efi/x86: libstub: Make DXE calls mixed mode safe efi: x86: Fix config name for setting the NX-compatibility flag in the PE header riscv: read-only pages should not be writable
2022-06-03KVM: x86: hyper-v: replace bitmap_weight() with hweight64()Yury Norov1-2/+2
kvm_hv_flush_tlb() applies bitmap API to a u64 variable valid_bank_mask. Since valid_bank_mask has a fixed size, we can use hweight64() and avoid excessive bloating. CC: Borislav Petkov <bp@alien8.de> CC: Dave Hansen <dave.hansen@linux.intel.com> CC: H. Peter Anvin <hpa@zytor.com> CC: Ingo Molnar <mingo@redhat.com> CC: Jim Mattson <jmattson@google.com> CC: Joerg Roedel <joro@8bytes.org> CC: Paolo Bonzini <pbonzini@redhat.com> CC: Sean Christopherson <seanjc@google.com> CC: Thomas Gleixner <tglx@linutronix.de> CC: Vitaly Kuznetsov <vkuznets@redhat.com> CC: Wanpeng Li <wanpengli@tencent.com> CC: kvm@vger.kernel.org CC: linux-kernel@vger.kernel.org CC: x86@kernel.org Acked-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Yury Norov <yury.norov@gmail.com>
2022-06-03KVM: x86: hyper-v: fix type of valid_bank_maskYury Norov1-2/+2
In kvm_hv_flush_tlb(), valid_bank_mask is declared as unsigned long, but is used as u64, which is wrong for i386, and has been spotted by LKP after applying "KVM: x86: hyper-v: replace bitmap_weight() with hweight64()" https://lore.kernel.org/lkml/20220510154750.212913-12-yury.norov@gmail.com/ But it's wrong even without that patch because now bitmap_weight() dereferences a word after valid_bank_mask on i386. >> include/asm-generic/bitops/const_hweight.h:21:76: warning: right shift count >= width of type +[-Wshift-count-overflow] 21 | #define __const_hweight64(w) (__const_hweight32(w) + __const_hweight32((w) >> 32)) | ^~ include/asm-generic/bitops/const_hweight.h:10:16: note: in definition of macro '__const_hweight8' 10 | ((!!((w) & (1ULL << 0))) + \ | ^ include/asm-generic/bitops/const_hweight.h:20:31: note: in expansion of macro '__const_hweight16' 20 | #define __const_hweight32(w) (__const_hweight16(w) + __const_hweight16((w) >> 16)) | ^~~~~~~~~~~~~~~~~ include/asm-generic/bitops/const_hweight.h:21:54: note: in expansion of macro '__const_hweight32' 21 | #define __const_hweight64(w) (__const_hweight32(w) + __const_hweight32((w) >> 32)) | ^~~~~~~~~~~~~~~~~ include/asm-generic/bitops/const_hweight.h:29:49: note: in expansion of macro '__const_hweight64' 29 | #define hweight64(w) (__builtin_constant_p(w) ? __const_hweight64(w) : __arch_hweight64(w)) | ^~~~~~~~~~~~~~~~~ arch/x86/kvm/hyperv.c:1983:36: note: in expansion of macro 'hweight64' 1983 | if (hc->var_cnt != hweight64(valid_bank_mask)) | ^~~~~~~~~ CC: Borislav Petkov <bp@alien8.de> CC: Dave Hansen <dave.hansen@linux.intel.com> CC: H. Peter Anvin <hpa@zytor.com> CC: Ingo Molnar <mingo@redhat.com> CC: Jim Mattson <jmattson@google.com> CC: Joerg Roedel <joro@8bytes.org> CC: Paolo Bonzini <pbonzini@redhat.com> CC: Sean Christopherson <seanjc@google.com> CC: Thomas Gleixner <tglx@linutronix.de> CC: Vitaly Kuznetsov <vkuznets@redhat.com> CC: Wanpeng Li <wanpengli@tencent.com> CC: kvm@vger.kernel.org CC: linux-kernel@vger.kernel.org CC: x86@kernel.org Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Yury Norov <yury.norov@gmail.com> Message-Id: <20220519171504.1238724-1-yury.norov@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-03arch/x86: replace nodes_weight with nodes_empty where appropriateYury Norov2-3/+3
mm code calls nodes_weight() to check if any bit of a given nodemask is set. We can do it more efficiently with nodes_empty() because nodes_empty() stops traversing the nodemask as soon as it finds first set bit, while nodes_weight() counts all bits unconditionally. Signed-off-by: Yury Norov <yury.norov@gmail.com>