summaryrefslogtreecommitdiffstats
path: root/arch/x86/mm
AgeCommit message (Collapse)AuthorFilesLines
2021-05-05Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-0/+8
Merge more updates from Andrew Morton: "The remainder of the main mm/ queue. 143 patches. Subsystems affected by this patch series (all mm): pagecache, hugetlb, userfaultfd, vmscan, compaction, migration, cma, ksm, vmstat, mmap, kconfig, util, memory-hotplug, zswap, zsmalloc, highmem, cleanups, and kfence" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (143 commits) kfence: use power-efficient work queue to run delayed work kfence: maximize allocation wait timeout duration kfence: await for allocation using wait_event kfence: zero guard page after out-of-bounds access mm/process_vm_access.c: remove duplicate include mm/mempool: minor coding style tweaks mm/highmem.c: fix coding style issue btrfs: use memzero_page() instead of open coded kmap pattern iov_iter: lift memzero_page() to highmem.h mm/zsmalloc: use BUG_ON instead of if condition followed by BUG. mm/zswap.c: switch from strlcpy to strscpy arm64/Kconfig: introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE x86/Kconfig: introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE mm,memory_hotplug: add kernel boot option to enable memmap_on_memory acpi,memhotplug: enable MHP_MEMMAP_ON_MEMORY when supported mm,memory_hotplug: allocate memmap from the added memory range mm,memory_hotplug: factor out adjusting present pages into adjust_present_page_count() mm,memory_hotplug: relax fully spanned sections check drivers/base/memory: introduce memory_block_{online,offline} mm/memory_hotplug: remove broken locking of zone PCP structures during hot remove ...
2021-05-05x86/mm: track linear mapping split eventsSaravanan D1-0/+8
To help with debugging the sluggishness caused by TLB miss/reload, we introduce monotonic hugepage [direct mapped] split event counts since system state: SYSTEM_RUNNING to be displayed as part of /proc/vmstat in x86 servers The lifetime split event information will be displayed at the bottom of /proc/vmstat .... swap_ra 0 swap_ra_hit 0 direct_map_level2_splits 94 direct_map_level3_splits 4 nr_unstable 0 .... One of the many lasting sources of direct hugepage splits is kernel tracing (kprobes, tracepoints). Note that the kernel's code segment [512 MB] points to the same physical addresses that have been already mapped in the kernel's direct mapping range. Source : Documentation/x86/x86_64/mm.rst When we enable kernel tracing, the kernel has to modify attributes/permissions of the text segment hugepages that are direct mapped causing them to split. Kernel's direct mapped hugepages do not coalesce back after split and remain in place for the remainder of the lifetime. An instance of direct page splits when we turn on dynamic kernel tracing .... cat /proc/vmstat | grep -i direct_map_level direct_map_level2_splits 784 direct_map_level3_splits 12 bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @ [pid, comm] = count(); }' cat /proc/vmstat | grep -i direct_map_level direct_map_level2_splits 789 direct_map_level3_splits 12 .... Link: https://lkml.kernel.org/r/20210218235744.1040634-1-saravanand@fb.com Signed-off-by: Saravanan D <saravanand@fb.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-01Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2-7/+4
Pull kvm updates from Paolo Bonzini: "This is a large update by KVM standards, including AMD PSP (Platform Security Processor, aka "AMD Secure Technology") and ARM CoreSight (debug and trace) changes. ARM: - CoreSight: Add support for ETE and TRBE - Stage-2 isolation for the host kernel when running in protected mode - Guest SVE support when running in nVHE mode - Force W^X hypervisor mappings in nVHE mode - ITS save/restore for guests using direct injection with GICv4.1 - nVHE panics now produce readable backtraces - Guest support for PTP using the ptp_kvm driver - Performance improvements in the S2 fault handler x86: - AMD PSP driver changes - Optimizations and cleanup of nested SVM code - AMD: Support for virtual SPEC_CTRL - Optimizations of the new MMU code: fast invalidation, zap under read lock, enable/disably dirty page logging under read lock - /dev/kvm API for AMD SEV live migration (guest API coming soon) - support SEV virtual machines sharing the same encryption context - support SGX in virtual machines - add a few more statistics - improved directed yield heuristics - Lots and lots of cleanups Generic: - Rework of MMU notifier interface, simplifying and optimizing the architecture-specific code - a handful of "Get rid of oprofile leftovers" patches - Some selftests improvements" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (379 commits) KVM: selftests: Speed up set_memory_region_test selftests: kvm: Fix the check of return value KVM: x86: Take advantage of kvm_arch_dy_has_pending_interrupt() KVM: SVM: Skip SEV cache flush if no ASIDs have been used KVM: SVM: Remove an unnecessary prototype declaration of sev_flush_asids() KVM: SVM: Drop redundant svm_sev_enabled() helper KVM: SVM: Move SEV VMCB tracking allocation to sev.c KVM: SVM: Explicitly check max SEV ASID during sev_hardware_setup() KVM: SVM: Unconditionally invoke sev_hardware_teardown() KVM: SVM: Enable SEV/SEV-ES functionality by default (when supported) KVM: SVM: Condition sev_enabled and sev_es_enabled on CONFIG_KVM_AMD_SEV=y KVM: SVM: Append "_enabled" to module-scoped SEV/SEV-ES control variables KVM: SEV: Mask CPUID[0x8000001F].eax according to supported features KVM: SVM: Move SEV module params/variables to sev.c KVM: SVM: Disable SEV/SEV-ES if NPT is disabled KVM: SVM: Free sev_asid_bitmap during init if SEV setup fails KVM: SVM: Zero out the VMCB array used to track SEV ASID association x86/sev: Drop redundant and potentially misleading 'sev_enabled' KVM: x86: Move reverse CPUID helpers to separate header file KVM: x86: Rename GPR accessors to make mode-aware variants the defaults ...
2021-04-30mm: move mem_init_print_info() into mm_init()Kefeng Wang2-4/+0
mem_init_print_info() is called in mem_init() on each architecture, and pass NULL argument, so using void argument and move it into mm_init(). Link: https://lkml.kernel.org/r/20210317015210.33641-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> [x86] Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr> [powerpc] Acked-by: David Hildenbrand <david@redhat.com> Tested-by: Anatoly Pugachev <matorola@gmail.com> [sparc64] Acked-by: Russell King <rmk+kernel@armlinux.org.uk> [arm] Acked-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Guo Ren <guoren@kernel.org> Cc: Yoshinori Sato <ysato@users.osdn.me> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Jonas Bonn <jonas@southpole.se> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: "Peter Zijlstra" <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30x86: inline huge vmap supported functionsNicholas Piggin2-34/+0
This allows unsupported levels to be constant folded away, and so p4d_free_pud_page can be removed because it's no longer linked to. Link: https://lkml.kernel.org/r/20210317062402.533919-10-npiggin@gmail.com Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Ding Tianhong <dingtianhong@huawei.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Russell King <linux@armlinux.org.uk> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30mm: HUGE_VMAP arch support cleanupNicholas Piggin1-5/+7
This changes the awkward approach where architectures provide init functions to determine which levels they can provide large mappings for, to one where the arch is queried for each call. This removes code and indirection, and allows constant-folding of dead code for unsupported levels. This also adds a prot argument to the arch query. This is unused currently but could help with some architectures (e.g., some powerpc processors can't map uncacheable memory with large pages). Link: https://lkml.kernel.org/r/20210317062402.533919-7-npiggin@gmail.com Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Ding Tianhong <dingtianhong@huawei.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64] Cc: Will Deacon <will@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Russell King <linux@armlinux.org.uk> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30x86/vmemmap: optimize for consecutive sections in partial populated PMDsOscar Salvador1-6/+61
We can optimize in the case we are adding consecutive sections, so no memset(PAGE_UNUSED) is needed. In that case, let us keep track where the unused range of the previous memory range begins, so we can compare it with start of the range to be added. If they are equal, we know sections are added consecutively. For that purpose, let us introduce 'unused_pmd_start', which always holds the beginning of the unused memory range. In the case a section does not contiguously follow the previous one, we know we can memset [unused_pmd_start, PMD_BOUNDARY) with PAGE_UNUSE. This patch is based on a similar patch by David Hildenbrand: https://lore.kernel.org/linux-mm/20200722094558.9828-10-david@redhat.com/ Link: https://lkml.kernel.org/r/20210309214050.4674-5-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30x86/vmemmap: handle unpopulated sub-pmd rangesOscar Salvador1-13/+55
When sizeof(struct page) is not a power of 2, sections do not span a PMD anymore and so when populating them some parts of the PMD will remain unused. Because of this, PMDs will be left behind when depopulating sections since remove_pmd_table() thinks that those unused parts are still in use. Fix this by marking the unused parts with PAGE_UNUSED, so memchr_inv() will do the right thing and will let us free the PMD when the last user of it is gone. This patch is based on a similar patch by David Hildenbrand: https://lore.kernel.org/linux-mm/20200722094558.9828-9-david@redhat.com/ [osalvador@suse.de: go back to the ifdef version] Link: https://lkml.kernel.org/r/YGy++mSft7K4u+88@localhost.localdomain Link: https://lkml.kernel.org/r/20210309214050.4674-4-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30x86/vmemmap: drop handling of 1GB vmemmap rangesOscar Salvador1-28/+7
There is no code to allocate 1GB pages when mapping the vmemmap range as this might waste some memory and requires more complexity which is not really worth. Drop the dead code both for the aligned and unaligned cases and leave only the direct map handling. Link: https://lkml.kernel.org/r/20210309214050.4674-3-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Suggested-by: David Hildenbrand <david@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30x86/vmemmap: drop handling of 4K unaligned vmemmap rangeOscar Salvador1-35/+13
Patch series "Cleanup and fixups for vmemmap handling", v6. This series contains cleanups to remove dead code that handles unaligned cases for 4K and 1GB pages (patch#1 and patch#2) when removing the vemmmap range, and a fix (patch#3) to handle the case when two vmemmap ranges intersect the same PMD. This patch (of 4): remove_pte_table() is prepared to handle the case where either the start or the end of the range is not PAGE aligned. This cannot actually happen: __populate_section_memmap enforces the range to be PMD aligned, so as long as the size of the struct page remains multiple of 8, the vmemmap range will be aligned to PAGE_SIZE. Drop the dead code and place a VM_BUG_ON in vmemmap_{populate,free} to catch nasty cases. Note that the VM_BUG_ON is placed in there because vmemmap_{populate,free= } is the gate of all removing and freeing page tables logic. Link: https://lkml.kernel.org/r/20210309214050.4674-1-osalvador@suse.de Link: https://lkml.kernel.org/r/20210309214050.4674-2-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Suggested-by: David Hildenbrand <david@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-29Merge tag 'x86-mm-2021-04-29' of ↵Linus Torvalds2-72/+106
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 tlb updates from Ingo Molnar: "The x86 MM changes in this cycle were: - Implement concurrent TLB flushes, which overlaps the local TLB flush with the remote TLB flush. In testing this improved sysbench performance measurably by a couple of percentage points, especially if TLB-heavy security mitigations are active. - Further micro-optimizations to improve the performance of TLB flushes" * tag 'x86-mm-2021-04-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: smp: Micro-optimize smp_call_function_many_cond() smp: Inline on_each_cpu_cond() and on_each_cpu() x86/mm/tlb: Remove unnecessary uses of the inline keyword cpumask: Mark functions as pure x86/mm/tlb: Do not make is_lazy dirty for no reason x86/mm/tlb: Privatize cpu_tlbstate x86/mm/tlb: Flush remote and local TLBs concurrently x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy() x86/mm/tlb: Unify flush_tlb_func_local() and flush_tlb_func_remote() smp: Run functions concurrently in smp_call_function_many_cond()
2021-04-26Merge tag 'x86_cleanups_for_v5.13' of ↵Linus Torvalds11-23/+24
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull misc x86 cleanups from Borislav Petkov: "Trivial cleanups and fixes all over the place" * tag 'x86_cleanups_for_v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: MAINTAINERS: Remove me from IDE/ATAPI section x86/pat: Do not compile stubbed functions when X86_PAT is off x86/asm: Ensure asm/proto.h can be included stand-alone x86/platform/intel/quark: Fix incorrect kernel-doc comment syntax in files x86/msr: Make locally used functions static x86/cacheinfo: Remove unneeded dead-store initialization x86/process/64: Move cpu_current_top_of_stack out of TSS tools/turbostat: Unmark non-kernel-doc comment x86/syscalls: Fix -Wmissing-prototypes warnings from COND_SYSCALL() x86/fpu/math-emu: Fix function cast warning x86/msr: Fix wr/rdmsr_safe_regs_on_cpu() prototypes x86: Fix various typos in comments, take #2 x86: Remove unusual Unicode characters from comments x86/kaslr: Return boolean values from a function returning bool x86: Fix various typos in comments x86/setup: Remove unused RESERVE_BRK_ARRAY() stacktrace: Move documentation for arch_stack_walk_reliable() to header x86: Remove duplicate TSC DEADLINE MSR definitions
2021-04-26Merge tag 'x86_seves_for_v5.13' of ↵Linus Torvalds2-16/+25
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 AMD secure virtualization (SEV-ES) updates from Borislav Petkov: "Add support for SEV-ES guests booting through the 32-bit boot path, along with cleanups, fixes and improvements" * tag 'x86_seves_for_v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/sev-es: Optimize __sev_es_ist_enter() for better readability x86/sev-es: Replace open-coded hlt-loops with sev_es_terminate() x86/boot/compressed/64: Check SEV encryption in the 32-bit boot-path x86/boot/compressed/64: Add CPUID sanity check to 32-bit boot-path x86/boot/compressed/64: Add 32-bit boot #VC handler x86/boot/compressed/64: Setup IDT in startup_32 boot path x86/boot/compressed/64: Reload CS in startup_32 x86/sev: Do not require Hypervisor CPUID bit for SEV guests x86/boot/compressed/64: Cleanup exception handling before booting kernel x86/virtio: Have SEV guests enforce restricted virtio memory access x86/sev-es: Remove subtraction of res variable
2021-04-26x86/sev: Drop redundant and potentially misleading 'sev_enabled'Sean Christopherson2-7/+4
Drop the sev_enabled flag and switch its one user over to sev_active(). sev_enabled was made redundant with the introduction of sev_status in commit b57de6cd1639 ("x86/sev-es: Add SEV-ES Feature Detection"). sev_enabled and sev_active() are guaranteed to be equivalent, as each is true iff 'sev_status & MSR_AMD64_SEV_ENABLED' is true, and are only ever written in tandem (ignoring compressed boot's version of sev_status). Removing sev_enabled avoids confusion over whether it refers to the guest or the host, and will also allow KVM to usurp "sev_enabled" for its own purposes. No functional change intended. Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Brijesh Singh <brijesh.singh@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210422021125.3417167-7-seanjc@google.com> Acked-by: Borislav Petkov <bp@suse.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-14x86/pat: Do not compile stubbed functions when X86_PAT is offJan Kiszka1-0/+2
Those are already provided by linux/io.h as stubs. The conflict remains invisible until someone would pull linux/io.h into memtype.c. This fixes a build error when this file is used outside of the kernel tree. [ bp: Massage commit message. ] Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/a9351615-7a0d-9d47-af65-d9e2fffe8192@siemens.com
2021-03-28x86/process/64: Move cpu_current_top_of_stack out of TSSLai Jiangshan1-4/+3
cpu_current_top_of_stack is currently stored in TSS.sp1. TSS is exposed through the cpu_entry_area which is visible with user CR3 when PTI is enabled and active. This makes it a coveted fruit for attackers. An attacker can fetch the kernel stack top from it and continue next steps of actions based on the kernel stack. But it is actualy not necessary to be stored in the TSS. It is only accessed after the entry code switched to kernel CR3 and kernel GS_BASE which means it can be in any regular percpu variable. The reason why it is in TSS is historical (pre PTI) because TSS is also used as scratch space in SYSCALL_64 and therefore cache hot. A syscall also needs the per CPU variable current_task and eventually __preempt_count, so placing cpu_current_top_of_stack next to them makes it likely that they end up in the same cache line which should avoid performance regressions. This is not enforced as the compiler is free to place these variables, so these entry relevant variables should move into a data structure to make this enforceable. The seccomp_benchmark doesn't show any performance loss in the "getpid native" test result. Actually, the result changes from 93ns before to 92ns with this change when KPTI is disabled. The test is very stable and although the test doesn't show a higher degree of precision it gives enough confidence that moving cpu_current_top_of_stack does not cause a regression. [ tglx: Removed unneeded export. Massaged changelog ] Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210125173444.22696-2-jiangshanlai@gmail.com
2021-03-23x86/mem_encrypt: Correct physical address calculation in __set_clr_pte_enc()Isaku Yamahata1-1/+1
The pfn variable contains the page frame number as returned by the pXX_pfn() functions, shifted to the right by PAGE_SHIFT to remove the page bits. After page protection computations are done to it, it gets shifted back to the physical address using page_level_shift(). That is wrong, of course, because that function determines the shift length based on the level of the page in the page table but in all the cases, it was shifted by PAGE_SHIFT before. Therefore, shift it back using PAGE_SHIFT to get the correct physical address. [ bp: Rewrite commit message. ] Fixes: dfaaec9033b8 ("x86: Add support for changing memory encryption attribute in early boot") Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/81abbae1657053eccc535c16151f63cd049dcb97.1616098294.git.isaku.yamahata@intel.com
2021-03-21x86: Fix various typos in comments, take #2Ingo Molnar3-3/+3
Fix another ~42 single-word typos in arch/x86/ code comments, missed a few in the first pass, in particular in .S files. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: linux-kernel@vger.kernel.org
2021-03-18x86/sev: Do not require Hypervisor CPUID bit for SEV guestsJoerg Roedel1-16/+19
A malicious hypervisor could disable the CPUID intercept for an SEV or SEV-ES guest and trick it into the no-SEV boot path, where it could potentially reveal secrets. This is not an issue for SEV-SNP guests, as the CPUID intercept can't be disabled for those. Remove the Hypervisor CPUID bit check from the SEV detection code to protect against this kind of attack and add a Hypervisor bit equals zero check to the SME detection path to prevent non-encrypted guests from trying to enable SME. This handles the following cases: 1) SEV(-ES) guest where CPUID intercept is disabled. The guest will still see leaf 0x8000001f and the SEV bit. It can retrieve the C-bit and boot normally. 2) Non-encrypted guests with intercepted CPUID will check the SEV_STATUS MSR and find it 0 and will try to enable SME. This will fail when the guest finds MSR_K8_SYSCFG to be zero, as it is emulated by KVM. But we can't rely on that, as there might be other hypervisors which return this MSR with bit 23 set. The Hypervisor bit check will prevent that the guest tries to enable SME in this case. 3) Non-encrypted guests on SEV capable hosts with CPUID intercept disabled (by a malicious hypervisor) will try to boot into the SME path. This will fail, but it is also not considered a problem because non-encrypted guests have no protection against the hypervisor anyway. [ bp: s/non-SEV/non-encrypted/g ] Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lkml.kernel.org/r/20210312123824.306-3-joro@8bytes.org
2021-03-18x86: Fix various typos in commentsIngo Molnar10-16/+16
Fix ~144 single-word typos in arch/x86/ code comments. Doing this in a single commit should reduce the churn. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: linux-kernel@vger.kernel.org
2021-03-08x86/virtio: Have SEV guests enforce restricted virtio memory accessTom Lendacky1-0/+6
An SEV guest requires that virtio devices use the DMA API to allow the hypervisor to successfully access guest memory as needed. The VIRTIO_F_VERSION_1 and VIRTIO_F_ACCESS_PLATFORM features tell virtio to use the DMA API. Add arch_has_restricted_virtio_memory_access() for x86, to fail the device probe if these features have not been set for the device when running as an SEV guest. [ bp: Fix -Wmissing-prototypes warning Reported-by: kernel test robot <lkp@intel.com> ] Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/b46e0211f77ca1831f11132f969d470a6ffc9267.1614897610.git.thomas.lendacky@amd.com
2021-03-06x86/mm/tlb: Remove unnecessary uses of the inline keywordNadav Amit1-3/+3
The compiler is smart enough without these hints. Suggested-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Nadav Amit <namit@vmware.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/r/20210220231712.2475218-9-namit@vmware.com
2021-03-06x86/mm/tlb: Do not make is_lazy dirty for no reasonNadav Amit1-1/+2
Blindly writing to is_lazy for no reason, when the written value is identical to the old value, makes the cacheline dirty for no reason. Avoid making such writes to prevent cache coherency traffic for no reason. Suggested-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Nadav Amit <namit@vmware.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/r/20210220231712.2475218-7-namit@vmware.com
2021-03-06x86/mm/tlb: Privatize cpu_tlbstateNadav Amit2-8/+11
cpu_tlbstate is mostly private and only the variable is_lazy is shared. This causes some false-sharing when TLB flushes are performed. Break cpu_tlbstate intro cpu_tlbstate and cpu_tlbstate_shared, and mark each one accordingly. Signed-off-by: Nadav Amit <namit@vmware.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/r/20210220231712.2475218-6-namit@vmware.com
2021-03-06x86/mm/tlb: Flush remote and local TLBs concurrentlyNadav Amit1-17/+29
To improve TLB shootdown performance, flush the remote and local TLBs concurrently. Introduce flush_tlb_multi() that does so. Introduce paravirtual versions of flush_tlb_multi() for KVM, Xen and hyper-v (Xen and hyper-v are only compile-tested). While the updated smp infrastructure is capable of running a function on a single local core, it is not optimized for this case. The multiple function calls and the indirect branch introduce some overhead, and might make local TLB flushes slower than they were before the recent changes. Before calling the SMP infrastructure, check if only a local TLB flush is needed to restore the lost performance in this common case. This requires to check mm_cpumask() one more time, but unless this mask is updated very frequently, this should impact performance negatively. Signed-off-by: Nadav Amit <namit@vmware.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Michael Kelley <mikelley@microsoft.com> # Hyper-v parts Reviewed-by: Juergen Gross <jgross@suse.com> # Xen and paravirt parts Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/r/20210220231712.2475218-5-namit@vmware.com
2021-03-06x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()Nadav Amit1-5/+32
Open-code on_each_cpu_cond_mask() in native_flush_tlb_others() to optimize the code. Open-coding eliminates the need for the indirect branch that is used to call is_lazy(), and in CPUs that are vulnerable to Spectre v2, it eliminates the retpoline. In addition, it allows to use a preallocated cpumask to compute the CPUs that should be. This would later allow us not to adapt on_each_cpu_cond_mask() to support local and remote functions. Note that calling tlb_is_not_lazy() for every CPU that needs to be flushed, as done in native_flush_tlb_multi() might look ugly, but it is equivalent to what is currently done in on_each_cpu_cond_mask(). Actually, native_flush_tlb_multi() does it more efficiently since it avoids using an indirect branch for the matter. Signed-off-by: Nadav Amit <namit@vmware.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/r/20210220231712.2475218-4-namit@vmware.com
2021-03-06x86/mm/tlb: Unify flush_tlb_func_local() and flush_tlb_func_remote()Nadav Amit1-45/+36
The unification of these two functions allows to use them in the updated SMP infrastrucutre. To do so, remove the reason argument from flush_tlb_func_local(), add a member to struct tlb_flush_info that says which CPU initiated the flush and act accordingly. Optimize the size of flush_tlb_info while we are at it. Unfortunately, this prevents us from using a constant tlb_flush_info for arch_tlbbatch_flush(), but in a later stage we may be able to inline tlb_flush_info into the IPI data, so it should not have an impact eventually. Signed-off-by: Nadav Amit <namit@vmware.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/r/20210220231712.2475218-3-namit@vmware.com
2021-02-26x86: fix seq_file iteration for pat/memtype.cNeilBrown1-2/+2
The memtype seq_file iterator allocates a buffer in the ->start and ->next functions and frees it in the ->show function. The preferred handling for such resources is to free them in the subsequent ->next or ->stop function call. Since Commit 1f4aace60b0e ("fs/seq_file.c: simplify seq_file iteration code and interface") there is no guarantee that ->show will be called after ->next, so this function can now leak memory. So move the freeing of the buffer to ->next and ->stop. Link: https://lkml.kernel.org/r/161248539022.21478.13874455485854739066.stgit@noble1 Fixes: 1f4aace60b0e ("fs/seq_file.c: simplify seq_file iteration code and interface") Signed-off-by: NeilBrown <neilb@suse.de> Cc: Xin Long <lucien.xin@gmail.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andy Lutomirski <luto@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vlad Yasevich <vyasevich@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-26kfence: add test suiteMarco Elver1-1/+2
Add KFENCE test suite, testing various error detection scenarios. Makes use of KUnit for test organization. Since KFENCE's interface to obtain error reports is via the console, the test verifies that KFENCE outputs expected reports to the console. [elver@google.com: fix typo in test] Link: https://lkml.kernel.org/r/X9lHQExmHGvETxY4@elver.google.com [elver@google.com: show access type in report] Link: https://lkml.kernel.org/r/20210111091544.3287013-2-elver@google.com Link: https://lkml.kernel.org/r/20201103175841.3495947-9-elver@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Signed-off-by: Marco Elver <elver@google.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Co-developed-by: Alexander Potapenko <glider@google.com> Reviewed-by: Jann Horn <jannh@google.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christopher Lameter <cl@linux.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hillf Danton <hdanton@sina.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joern Engel <joern@purestorage.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kees Cook <keescook@chromium.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: SeongJae Park <sjpark@amazon.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-26kfence: use pt_regs to generate stack trace on faultsMarco Elver1-1/+1
Instead of removing the fault handling portion of the stack trace based on the fault handler's name, just use struct pt_regs directly. Change kfence_handle_page_fault() to take a struct pt_regs, and plumb it through to kfence_report_error() for out-of-bounds, use-after-free, or invalid access errors, where pt_regs is used to generate the stack trace. If the kernel is a DEBUG_KERNEL, also show registers for more information. Link: https://lkml.kernel.org/r/20201105092133.2075331-1-elver@google.com Signed-off-by: Marco Elver <elver@google.com> Suggested-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Mark Rutland <mark.rutland@arm.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Jann Horn <jannh@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-26x86, kfence: enable KFENCE for x86Alexander Potapenko1-0/+5
Add architecture specific implementation details for KFENCE and enable KFENCE for the x86 architecture. In particular, this implements the required interface in <asm/kfence.h> for setting up the pool and providing helper functions for protecting and unprotecting pages. For x86, we need to ensure that the pool uses 4K pages, which is done using the set_memory_4k() helper function. [elver@google.com: add missing copyright and description header] Link: https://lkml.kernel.org/r/20210118092159.145934-2-elver@google.com Link: https://lkml.kernel.org/r/20201103175841.3495947-3-elver@google.com Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Alexander Potapenko <glider@google.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Co-developed-by: Marco Elver <elver@google.com> Reviewed-by: Jann Horn <jannh@google.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christopher Lameter <cl@linux.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hillf Danton <hdanton@sina.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joern Engel <joern@purestorage.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kees Cook <keescook@chromium.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: SeongJae Park <sjpark@amazon.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-20Merge tag 'x86_mm_for_v5.12' of ↵Linus Torvalds3-199/+225
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 mm cleanups from Borislav Petkov: - PTRACE_GETREGS/PTRACE_PUTREGS regset selection cleanup - Another initial cleanup - more to follow - to the fault handling code. - Other minor cleanups and corrections. * tag 'x86_mm_for_v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits) x86/{fault,efi}: Fix and rename efi_recover_from_page_fault() x86/fault: Don't run fixups for SMAP violations x86/fault: Don't look for extable entries for SMEP violations x86/fault: Rename no_context() to kernelmode_fixup_or_oops() x86/fault: Bypass no_context() for implicit kernel faults from usermode x86/fault: Split the OOPS code out from no_context() x86/fault: Improve kernel-executing-user-memory handling x86/fault: Correct a few user vs kernel checks wrt WRUSS x86/fault: Document the locking in the fault_signal_pending() path x86/fault/32: Move is_f00f_bug() to do_kern_addr_fault() x86/fault: Fold mm_fault_error() into do_user_addr_fault() x86/fault: Skip the AMD erratum #91 workaround on unaffected CPUs x86/fault: Fix AMD erratum #91 errata fixup for user code x86/Kconfig: Remove HPET_EMULATE_RTC depends on RTC x86/asm: Fixup TASK_SIZE_MAX comment x86/ptrace: Clean up PTRACE_GETREGS/PTRACE_PUTREGS regset selection x86/vm86/32: Remove VM86_SCREEN_BITMAP support x86: Remove definition of DEBUG x86/entry: Remove now unused do_IRQ() declaration x86/mm: Remove duplicate definition of _PAGE_PAT_LARGE ...
2021-02-20Merge tag 'x86_seves_for_v5.12' of ↵Linus Torvalds1-2/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 SEV-ES fix from Borislav Petkov: "Do not unroll string I/O for SEV-ES guests because they support it" * tag 'x86_seves_for_v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/sev-es: Do not unroll string I/O for SEV-ES guests
2021-02-12Merge branch 'x86/cleanups' into x86/mmIngo Molnar2-32/+0
Merge recent cleanups to the x86 MM code to resolve a conflict. Conflicts: arch/x86/mm/fault.c Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-02-10x86/{fault,efi}: Fix and rename efi_recover_from_page_fault()Andy Lutomirski1-5/+6
efi_recover_from_page_fault() doesn't recover -- it does a special EFI mini-oops. Rename it to make it clear that it crashes. While renaming it, I noticed a blatant bug: a page fault oops in a different thread happening concurrently with an EFI runtime service call would be misinterpreted as an EFI page fault. Fix that. This isn't quite exact. The situation could be improved by using a special CS for calls into EFI. [ bp: Massage commit message and simplify in interrupt check. ] Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/f43b1e80830dc78ed60ed8b0826f4f189254570c.1612924255.git.luto@kernel.org
2021-02-10x86/fault: Don't run fixups for SMAP violationsAndy Lutomirski1-3/+6
A SMAP-violating kernel access is not a recoverable condition. Imagine kernel code that, outside of a uaccess region, dereferences a pointer to the user range by accident. If SMAP is on, this will reliably generate as an intentional user access. This makes it easy for bugs to be overlooked if code is inadequately tested both with and without SMAP. This was discovered because BPF can generate invalid accesses to user memory, but those warnings only got printed if SMAP was off. Make it so that this type of error will be discovered with SMAP on as well. [ bp: Massage commit message. ] Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/66a02343624b1ff46f02a838c497fc05c1a871b3.1612924255.git.luto@kernel.org
2021-02-10x86/fault: Don't look for extable entries for SMEP violationsAndy Lutomirski1-2/+2
If the kernel gets a SMEP violation or a fault that would have been a SMEP violation if it had SMEP support, it shouldn't run fixups. Just OOPS. [ bp: Massage commit message. ] Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/46160d8babce2abf1d6daa052146002efa24ac56.1612924255.git.luto@kernel.org
2021-02-10x86/fault: Rename no_context() to kernelmode_fixup_or_oops()Andy Lutomirski1-18/+10
The name no_context() has never been very clear. It's only called for faults from kernel mode, so rename it and change the no-longer-useful user_mode(regs) check to a WARN_ON_ONCE. Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/c21940efe676024bb4bc721f7d70c29c420e127e.1612924255.git.luto@kernel.org
2021-02-10x86/fault: Bypass no_context() for implicit kernel faults from usermodeAndy Lutomirski1-27/+32
Drop an indentation level and remove the last user_mode(regs) == true caller of no_context() by directly OOPSing for implicit kernel faults from usermode. Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/6e3d1129494a8de1e59d28012286e3a292a2296e.1612924255.git.luto@kernel.org
2021-02-10x86/fault: Split the OOPS code out from no_context()Andy Lutomirski1-54/+62
Not all callers of no_context() want to run exception fixups. Separate the OOPS code out from the fixup code in no_context(). Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/450f8d8eabafb83a5df349108c8e5ea83a2f939d.1612924255.git.luto@kernel.org
2021-02-10x86/fault: Improve kernel-executing-user-memory handlingAndy Lutomirski1-3/+18
Right now, the case of the kernel trying to execute from user memory is treated more or less just like the kernel getting a page fault on a user access. In the failure path, it checks for erratum #93, tries to otherwise fix up the error, and then oopses. If it manages to jump to the user address space, with or without SMEP, it should not try to resolve the page fault. This is an error, pure and simple. Rearrange the code so that this case is caught early, check for erratum #93, and bail out. [ bp: Massage commit message. ] Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/ab8719c7afb8bd501c4eee0e36493150fbbe5f6a.1612924255.git.luto@kernel.org
2021-02-10x86/fault: Correct a few user vs kernel checks wrt WRUSSAndy Lutomirski1-4/+11
In general, page fault errors for WRUSS should be just like get_user(), etc. Fix three bugs in this area: There is a comment that says that, if the kernel can't handle a page fault on a user address due to OOM, the OOM-kill-and-retry logic would be skipped. The code checked kernel *privilege*, not kernel mode, so it missed WRUSS. This means that the kernel would malfunction if it got OOM on a WRUSS fault -- this would be a kernel-mode, user-privilege fault, and the OOM killer would be invoked and the handler would retry the faulting instruction. A failed user access from kernel while a fatal signal is pending should fail even if the instruction in question was WRUSS. do_sigbus() should not send SIGBUS for WRUSS -- it should handle it like any other kernel mode failure. Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/a7b7bcea730bd4069e6b7e629236bb2cf526c2fb.1612924255.git.luto@kernel.org
2021-02-10x86/fault: Document the locking in the fault_signal_pending() pathAndy Lutomirski1-1/+4
If fault_signal_pending() returns true, then the core mm has unlocked the mm for us. Add a comment to help future readers of this code. Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/c56de3d103f40e6304437b150aa7b215530d23f7.1612924255.git.luto@kernel.org
2021-02-10x86/fault/32: Move is_f00f_bug() to do_kern_addr_fault()Andy Lutomirski1-5/+7
bad_area() and its relatives are called from many places in fault.c, and exactly one of them wants the F00F workaround. __bad_area_nosemaphore() no longer contains any kernel fault code, which prepares for further cleanups. Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/e9668729a48ce6754022b0a4415631e8ebdd00e7.1612924255.git.luto@kernel.org
2021-02-10x86/fault: Fold mm_fault_error() into do_user_addr_fault()Andy Lutomirski1-52/+45
mm_fault_error() is logically just the end of do_user_addr_fault(). Combine the functions. This makes the code easier to read. Most of the churn here is from renaming hw_error_code to error_code in do_user_addr_fault(). This makes no difference at all to the generated code (objdump -dr) as compared to changing noinline to __always_inline in the definition of mm_fault_error(). Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/dedc4d9c9b047e51ce38b991bd23971a28af4e7b.1612924255.git.luto@kernel.org
2021-02-10x86/fault: Skip the AMD erratum #91 workaround on unaffected CPUsAndy Lutomirski1-0/+13
According to the Revision Guide for AMD Athlon™ 64 and AMD Opteron™ Processors, only early revisions of family 0xF are affected. This will avoid unnecessarily fetching instruction bytes before sending SIGSEGV to user programs. Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/477173b7784bc28afb3e53d76ae5ef143917e8dd.1612924255.git.luto@kernel.org
2021-02-10x86/fault: Fix AMD erratum #91 errata fixup for user codeAndy Lutomirski1-10/+17
The recent rework of probe_kernel_address() and its conversion to get_kernel_nofault() inadvertently broke is_prefetch(). Before this change, probe_kernel_address() was used as a sloppy "read user or kernel memory" helper, but it doesn't do that any more. The new get_kernel_nofault() reads *kernel* memory only, which completely broke is_prefetch() for user access. Adjust the code to the correct accessor based on access mode. The manual address bounds check is no longer necessary, since the accessor helpers (get_user() / get_kernel_nofault()) do the right thing all by themselves. As a bonus, by using the correct accessor, the open-coded address bounds check is not needed anymore. [ bp: Massage commit message. ] Fixes: eab0c6089b68 ("maccess: unify the probe kernel arch hooks") Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/b91f7f92f3367d2d3a88eec3b09c6aab1b2dc8ef.1612924255.git.luto@kernel.org
2021-02-03KVM: SVM: Treat SVM as unsupported when running as an SEV guestSean Christopherson1-0/+1
Don't let KVM load when running as an SEV guest, regardless of what CPUID says. Memory is encrypted with a key that is not accessible to the host (L0), thus it's impossible for L0 to emulate SVM, e.g. it'll see garbage when reading the VMCB. Technically, KVM could decrypt all memory that needs to be accessible to the L0 and use shadow paging so that L0 does not need to shadow NPT, but exposing such information to L0 largely defeats the purpose of running as an SEV guest. This can always be revisited if someone comes up with a use case for running VMs inside SEV guests. Note, VMLOAD, VMRUN, etc... will also #GP on GPAs with C-bit set, i.e. KVM is doomed even if the SEV guest is debuggable and the hypervisor is willing to decrypt the VMCB. This may or may not be fixed on CPUs that have the SVME_ADDR_CHK fix. Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210202212017.2486595-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-02x86/sev-es: Do not unroll string I/O for SEV-ES guestsTom Lendacky1-2/+3
Under the GHCB specification, SEV-ES guests can support string I/O. The current #VC handler contains this support, so remove the need to unroll kernel string I/O operations. This will reduce the number of #VC exceptions generated as well as the number VM exits for the guest. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/3de04b5b638546ac75d42ba52307fe1a922173d3.1612203987.git.thomas.lendacky@amd.com
2021-01-21x86/vm86/32: Remove VM86_SCREEN_BITMAP supportAndy Lutomirski1-30/+0
The implementation was rather buggy. It unconditionally marked PTEs read-only, even for VM_SHARED mappings. I'm not sure whether this is actually a problem, but it certainly seems unwise. More importantly, it released the mmap lock before flushing the TLB, which could allow a racing CoW operation to falsely believe that the underlying memory was not writable. I can't find any users at all of this mechanism, so just remove it. Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Stas Sergeev <stsp2@yandex.ru> Link: https://lkml.kernel.org/r/f3086de0babcab36f69949b5780bde851f719bc8.1611078018.git.luto@kernel.org