summaryrefslogtreecommitdiffstats
path: root/arch/powerpc/kvm/book3s_hv.c
AgeCommit message (Collapse)AuthorFilesLines
2019-07-15KVM: PPC: Book3S HV: Save and restore guest visible PSSCR bits on pseriesSuraj Jitindar Singh1-0/+11
The Performance Stop Status and Control Register (PSSCR) is used to control the power saving facilities of the processor. This register has various fields, some of which can be modified only in hypervisor state, and others which can be modified in both hypervisor and privileged non-hypervisor state. The bits which can be modified in privileged non-hypervisor state are referred to as guest visible. Currently the L0 hypervisor saves and restores both it's own host value as well as the guest value of the PSSCR when context switching between the hypervisor and guest. However a nested hypervisor running it's own nested guests (as indicated by kvmhv_on_pseries()) doesn't context switch the PSSCR register. That means if a nested (L2) guest modifies the PSSCR then the L1 guest hypervisor will run with that modified value, and if the L1 guest hypervisor modifies the PSSCR and then goes to run the nested (L2) guest again then the L2 PSSCR value will be lost. Fix this by having the (L1) nested hypervisor save and restore both its host and the guest PSSCR value when entering and exiting a nested (L2) guest. Note that only the guest visible parts of the PSSCR are context switched since this is all the L1 nested hypervisor can access, this is fine however as these are the only fields the L0 hypervisor provides guest control of anyway and so all other fields are ignored. This could also have been implemented by adding the PSSCR register to the hv_regs passed to the L0 hypervisor as input to the H_ENTER_NESTED hcall, however this would have meant updating the structure layout and thus required modifications to both the L0 and L1 kernels. Whereas the approach used doesn't require L0 kernel modifications while achieving the same result. Fixes: 95a6432ce903 ("KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests") Cc: stable@vger.kernel.org # v4.20+ Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190703012022.15644-3-sjitindarsingh@gmail.com
2019-07-15KVM: PPC: Book3S HV: Always save guest pmu for guest capable of nestingSuraj Jitindar Singh1-0/+2
The performance monitoring unit (PMU) registers are saved on guest exit when the guest has set the pmcregs_in_use flag in its lppaca, if it exists, or unconditionally if it doesn't. If a nested guest is being run then the hypervisor doesn't, and in most cases can't, know if the PMU registers are in use since it doesn't know the location of the lppaca for the nested guest, although it may have one for its immediate guest. This results in the values of these registers being lost across nested guest entry and exit in the case where the nested guest was making use of the performance monitoring facility while it's nested guest hypervisor wasn't. Further more the hypervisor could interrupt a guest hypervisor between when it has loaded up the PMU registers and it calling H_ENTER_NESTED or between returning from the nested guest to the guest hypervisor and the guest hypervisor reading the PMU registers, in kvmhv_p9_guest_entry(). This means that it isn't sufficient to just save the PMU registers when entering or exiting a nested guest, but that it is necessary to always save the PMU registers whenever a guest is capable of running nested guests to ensure the register values aren't lost in the context switch. Ensure the PMU register values are preserved by always saving their value into the vcpu struct when a guest is capable of running nested guests. This should have minimal performance impact however any impact can be avoided by booting a guest with "-machine pseries,cap-nested-hv=false" on the qemu commandline. Fixes: 95a6432ce903 ("KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests") Cc: stable@vger.kernel.org # v4.20+ Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190703012022.15644-1-sjitindarsingh@gmail.com
2019-07-13Merge tag 'powerpc-5.3-1' of ↵Linus Torvalds1-2/+11
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: "Notable changes: - Removal of the NPU DMA code, used by the out-of-tree Nvidia driver, as well as some other functions only used by drivers that haven't (yet?) made it upstream. - A fix for a bug in our handling of hardware watchpoints (eg. perf record -e mem: ...) which could lead to register corruption and kernel crashes. - Enable HAVE_ARCH_HUGE_VMAP, which allows us to use large pages for vmalloc when using the Radix MMU. - A large but incremental rewrite of our exception handling code to use gas macros rather than multiple levels of nested CPP macros. And the usual small fixes, cleanups and improvements. Thanks to: Alastair D'Silva, Alexey Kardashevskiy, Andreas Schwab, Aneesh Kumar K.V, Anju T Sudhakar, Anton Blanchard, Arnd Bergmann, Athira Rajeev, Cédric Le Goater, Christian Lamparter, Christophe Leroy, Christophe Lombard, Christoph Hellwig, Daniel Axtens, Denis Efremov, Enrico Weigelt, Frederic Barrat, Gautham R. Shenoy, Geert Uytterhoeven, Geliang Tang, Gen Zhang, Greg Kroah-Hartman, Greg Kurz, Gustavo Romero, Krzysztof Kozlowski, Madhavan Srinivasan, Masahiro Yamada, Mathieu Malaterre, Michael Neuling, Nathan Lynch, Naveen N. Rao, Nicholas Piggin, Nishad Kamdar, Oliver O'Halloran, Qian Cai, Ravi Bangoria, Sachin Sant, Sam Bobroff, Satheesh Rajendran, Segher Boessenkool, Shaokun Zhang, Shawn Anastasio, Stewart Smith, Suraj Jitindar Singh, Thiago Jung Bauermann, YueHaibing" * tag 'powerpc-5.3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (163 commits) powerpc/powernv/idle: Fix restore of SPRN_LDBAR for POWER9 stop state. powerpc/eeh: Handle hugepages in ioremap space ocxl: Update for AFU descriptor template version 1.1 powerpc/boot: pass CONFIG options in a simpler and more robust way powerpc/boot: add {get, put}_unaligned_be32 to xz_config.h powerpc/irq: Don't WARN continuously in arch_local_irq_restore() powerpc/module64: Use symbolic instructions names. powerpc/module32: Use symbolic instructions names. powerpc: Move PPC_HA() PPC_HI() and PPC_LO() to ppc-opcode.h powerpc/module64: Fix comment in R_PPC64_ENTRY handling powerpc/boot: Add lzo support for uImage powerpc/boot: Add lzma support for uImage powerpc/boot: don't force gzipped uImage powerpc/8xx: Add microcode patch to move SMC parameter RAM. powerpc/8xx: Use IO accessors in microcode programming. powerpc/8xx: replace #ifdefs by IS_ENABLED() in microcode.c powerpc/8xx: refactor programming of microcode CPM params. powerpc/8xx: refactor printing of microcode patch name. powerpc/8xx: Refactor microcode write powerpc/8xx: refactor writing of CPM microcode arrays ...
2019-06-20KVM: PPC: Book3S HV: Clear pending decrementer exceptions on nested guest entrySuraj Jitindar Singh1-2/+9
If we enter an L1 guest with a pending decrementer exception then this is cleared on guest exit if the guest has writtien a positive value into the decrementer (indicating that it handled the decrementer exception) since there is no other way to detect that the guest has handled the pending exception and that it should be dequeued. In the event that the L1 guest tries to run a nested (L2) guest immediately after this and the L2 guest decrementer is negative (which is loaded by L1 before making the H_ENTER_NESTED hcall), then the pending decrementer exception isn't cleared and the L2 entry is blocked since L1 has a pending exception, even though L1 may have already handled the exception and written a positive value for it's decrementer. This results in a loop of L1 trying to enter the L2 guest and L0 blocking the entry since L1 has an interrupt pending with the outcome being that L2 never gets to run and hangs. Fix this by clearing any pending decrementer exceptions when L1 makes the H_ENTER_NESTED hcall since it won't do this if it's decrementer has gone negative, and anyway it's decrementer has been communicated to L0 in the hdec_expires field and L0 will return control to L1 when this goes negative by delivering an H_DECREMENTER exception. Fixes: 95a6432ce903 ("KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests") Cc: stable@vger.kernel.org # v4.20+ Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-06-20KVM: PPC: Book3S HV: Signed extend decrementer value if not using large ↵Suraj Jitindar Singh1-0/+2
decrementer On POWER9 the decrementer can operate in large decrementer mode where the decrementer is 56 bits and signed extended to 64 bits. When not operating in this mode the decrementer behaves as a 32 bit decrementer which is NOT signed extended (as on POWER8). Currently when reading a guest decrementer value we don't take into account whether the large decrementer is enabled or not, and this means the value will be incorrect when the guest is not using the large decrementer. Fix this by sign extending the value read when the guest isn't using the large decrementer. Fixes: 95a6432ce903 ("KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests") Cc: stable@vger.kernel.org # v4.20+ Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-06-19treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500Thomas Gleixner1-4/+1
Based on 2 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of the gnu general public license version 2 as published by the free software foundation this program is free software you can redistribute it and or modify it under the terms of the gnu general public license version 2 as published by the free software foundation # extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 4122 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Enrico Weigelt <info@metux.net> Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Allison Randal <allison@lohutok.net> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190604081206.933168790@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-30KVM: PPC: Book3S HV: Restore SPRG3 in kvmhv_p9_guest_entry()Suraj Jitindar Singh1-0/+1
The sprgs are a set of 4 general purpose sprs provided for software use. SPRG3 is special in that it can also be read from userspace. Thus it is used on linux to store the cpu and numa id of the process to speed up syscall access to this information. This register is overwritten with the guest value on kvm guest entry, and so needs to be restored on exit again. Thus restore the value on the guest exit path in kvmhv_p9_guest_entry(). Cc: stable@vger.kernel.org # v4.20+ Fixes: 95a6432ce9038 ("KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests") Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-05-30KVM: PPC: Book3S HV: Fix lockdep warning when entering guest on POWER9Paul Mackerras1-2/+5
Commit 3309bec85e60 ("KVM: PPC: Book3S HV: Fix lockdep warning when entering the guest") moved calls to trace_hardirqs_{on,off} in the entry path used for HPT guests. Similar code exists in the new streamlined entry path used for radix guests on POWER9. This makes the same change there, so as to avoid lockdep warnings such as this: [ 228.686461] DEBUG_LOCKS_WARN_ON(current->hardirqs_enabled) [ 228.686480] WARNING: CPU: 116 PID: 3803 at ../kernel/locking/lockdep.c:4219 check_flags.part.23+0x21c/0x270 [ 228.686544] Modules linked in: vhost_net vhost xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat +xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter +ebtables ip6table_filter ip6_tables iptable_filter fuse kvm_hv kvm at24 ipmi_powernv regmap_i2c ipmi_devintf +uio_pdrv_genirq ofpart ipmi_msghandler uio powernv_flash mtd ibmpowernv opal_prd ip_tables ext4 mbcache jbd2 btrfs +zstd_decompress zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx libcrc32c xor +raid6_pq raid1 raid0 ses sd_mod enclosure scsi_transport_sas ast i2c_opal i2c_algo_bit drm_kms_helper syscopyarea +sysfillrect sysimgblt fb_sys_fops ttm drm i40e e1000e cxl aacraid tg3 drm_panel_orientation_quirks i2c_core [ 228.686859] CPU: 116 PID: 3803 Comm: qemu-system-ppc Kdump: loaded Not tainted 5.2.0-rc1-xive+ #42 [ 228.686911] NIP: c0000000001b394c LR: c0000000001b3948 CTR: c000000000bfad20 [ 228.686963] REGS: c000200cdb50f570 TRAP: 0700 Not tainted (5.2.0-rc1-xive+) [ 228.687001] MSR: 9000000002823033 <SF,HV,VEC,VSX,FP,ME,IR,DR,RI,LE> CR: 48222222 XER: 20040000 [ 228.687060] CFAR: c000000000116db0 IRQMASK: 1 [ 228.687060] GPR00: c0000000001b3948 c000200cdb50f800 c0000000015e7600 000000000000002e [ 228.687060] GPR04: 0000000000000001 c0000000001c71a0 000000006e655f73 72727563284e4f5f [ 228.687060] GPR08: 0000200e60680000 0000000000000000 c000200cdb486180 0000000000000000 [ 228.687060] GPR12: 0000000000002000 c000200fff61a680 0000000000000000 00007fffb75c0000 [ 228.687060] GPR16: 0000000000000000 0000000000000000 c0000000017d6900 c000000001124900 [ 228.687060] GPR20: 0000000000000074 c008000006916f68 0000000000000074 0000000000000074 [ 228.687060] GPR24: ffffffffffffffff ffffffffffffffff 0000000000000003 c000200d4b600000 [ 228.687060] GPR28: c000000001627e58 c000000001489908 c000000001627e58 c000000002304de0 [ 228.687377] NIP [c0000000001b394c] check_flags.part.23+0x21c/0x270 [ 228.687415] LR [c0000000001b3948] check_flags.part.23+0x218/0x270 [ 228.687466] Call Trace: [ 228.687488] [c000200cdb50f800] [c0000000001b3948] check_flags.part.23+0x218/0x270 (unreliable) [ 228.687542] [c000200cdb50f870] [c0000000001b6548] lock_is_held_type+0x188/0x1c0 [ 228.687595] [c000200cdb50f8d0] [c0000000001d939c] rcu_read_lock_sched_held+0xdc/0x100 [ 228.687646] [c000200cdb50f900] [c0000000001dd704] rcu_note_context_switch+0x304/0x340 [ 228.687701] [c000200cdb50f940] [c0080000068fcc58] kvmhv_run_single_vcpu+0xdb0/0x1120 [kvm_hv] [ 228.687756] [c000200cdb50fa20] [c0080000068fd5b0] kvmppc_vcpu_run_hv+0x5e8/0xe40 [kvm_hv] [ 228.687816] [c000200cdb50faf0] [c0080000071797dc] kvmppc_vcpu_run+0x34/0x48 [kvm] [ 228.687863] [c000200cdb50fb10] [c0080000071755dc] kvm_arch_vcpu_ioctl_run+0x244/0x420 [kvm] [ 228.687916] [c000200cdb50fba0] [c008000007165ccc] kvm_vcpu_ioctl+0x424/0x838 [kvm] [ 228.687957] [c000200cdb50fd10] [c000000000433a24] do_vfs_ioctl+0xd4/0xcd0 [ 228.687995] [c000200cdb50fdb0] [c000000000434724] ksys_ioctl+0x104/0x120 [ 228.688033] [c000200cdb50fe00] [c000000000434768] sys_ioctl+0x28/0x80 [ 228.688072] [c000200cdb50fe20] [c00000000000b888] system_call+0x5c/0x70 [ 228.688109] Instruction dump: [ 228.688142] 4bf6342d 60000000 0fe00000 e8010080 7c0803a6 4bfffe60 3c82ff87 3c62ff87 [ 228.688196] 388472d0 3863d738 4bf63405 60000000 <0fe00000> 4bffff4c 3c82ff87 3c62ff87 [ 228.688251] irq event stamp: 205 [ 228.688287] hardirqs last enabled at (205): [<c0080000068fc1b4>] kvmhv_run_single_vcpu+0x30c/0x1120 [kvm_hv] [ 228.688344] hardirqs last disabled at (204): [<c0080000068fbff0>] kvmhv_run_single_vcpu+0x148/0x1120 [kvm_hv] [ 228.688412] softirqs last enabled at (180): [<c000000000c0b2ac>] __do_softirq+0x4ac/0x5d4 [ 228.688464] softirqs last disabled at (169): [<c000000000122aa8>] irq_exit+0x1f8/0x210 [ 228.688513] ---[ end trace eb16f6260022a812 ]--- [ 228.688548] possible reason: unannotated irqs-off. [ 228.688571] irq event stamp: 205 [ 228.688607] hardirqs last enabled at (205): [<c0080000068fc1b4>] kvmhv_run_single_vcpu+0x30c/0x1120 [kvm_hv] [ 228.688664] hardirqs last disabled at (204): [<c0080000068fbff0>] kvmhv_run_single_vcpu+0x148/0x1120 [kvm_hv] [ 228.688719] softirqs last enabled at (180): [<c000000000c0b2ac>] __do_softirq+0x4ac/0x5d4 [ 228.688758] softirqs last disabled at (169): [<c000000000122aa8>] irq_exit+0x1f8/0x210 Cc: stable@vger.kernel.org # v4.20+ Fixes: 95a6432ce903 ("KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests") Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Reviewed-by: Cédric Le Goater <clg@kaod.org> Tested-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-05-29KVM: PPC: Book3S HV: Don't take kvm->lock around kvm_for_each_vcpuPaul Mackerras1-8/+1
Currently the HV KVM code takes the kvm->lock around calls to kvm_for_each_vcpu() and kvm_get_vcpu_by_id() (which can call kvm_for_each_vcpu() internally). However, that leads to a lock order inversion problem, because these are called in contexts where the vcpu mutex is held, but the vcpu mutexes nest within kvm->lock according to Documentation/virtual/kvm/locking.txt. Hence there is a possibility of deadlock. To fix this, we simply don't take the kvm->lock mutex around these calls. This is safe because the implementations of kvm_for_each_vcpu() and kvm_get_vcpu_by_id() have been designed to be able to be called locklessly. Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Reviewed-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-05-29KVM: PPC: Book3S HV: Use new mutex to synchronize MMU setupPaul Mackerras1-8/+23
Currently the HV KVM code uses kvm->lock in conjunction with a flag, kvm->arch.mmu_ready, to synchronize MMU setup and hold off vcpu execution until the MMU-related data structures are ready. However, this means that kvm->lock is being taken inside vcpu->mutex, which is contrary to Documentation/virtual/kvm/locking.txt and results in lockdep warnings. To fix this, we add a new mutex, kvm->arch.mmu_setup_lock, which nests inside the vcpu mutexes, and is taken in the places where kvm->lock was taken that are related to MMU setup. Additionally we take the new mutex in the vcpu creation code at the point where we are creating a new vcore, in order to provide mutual exclusion with kvmppc_update_lpcr() and ensure that an update to kvm->arch.lpcr doesn't get missed, which could otherwise lead to a stale vcore->lpcr value. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30Merge remote-tracking branch 'remotes/powerpc/topic/ppc-kvm' into kvm-ppc-nextPaul Mackerras1-1/+2
This merges in the ppc-kvm topic branch from the powerpc tree to get patches which touch both general powerpc code and KVM code, one of which is a prerequisite for following patches. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30KVM: PPC: Book3S HV: Save/restore vrsave register in kvmhv_p9_guest_entry()Suraj Jitindar Singh1-0/+2
On POWER9 and later processors where the host can schedule vcpus on a per thread basis, there is a streamlined entry path used when the guest is radix. This entry path saves/restores the fp and vr state in kvmhv_p9_guest_entry() by calling store_[fp/vr]_state() and load_[fp/vr]_state(). This is the same as the old entry path however the old entry path also saved/restored the VRSAVE register, which isn't done in the new entry path. This means that the vrsave register is now volatile across guest exit, which is an incorrect change in behaviour. Fix this by saving/restoring the vrsave register in kvmhv_p9_guest_entry(). This restores the old, correct, behaviour. Fixes: 95a6432ce9038 ("KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests") Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30KVM: PPC: Book3S HV: Flush TLB on secondary radix threadsPaul Mackerras1-48/+7
When running on POWER9 with kvm_hv.indep_threads_mode = N and the host in SMT1 mode, KVM will run guest VCPUs on offline secondary threads. If those guests are in radix mode, we fail to load the LPID and flush the TLB if necessary, leading to the guest crashing with an unsupported MMU fault. This arises from commit 9a4506e11b97 ("KVM: PPC: Book3S HV: Make radix handle process scoped LPID flush in C, with relocation on", 2018-05-17), which didn't consider the case where indep_threads_mode = N. For simplicity, this makes the real-mode guest entry path flush the TLB in the same place for both radix and hash guests, as we did before 9a4506e11b97, though the code is now C code rather than assembly code. We also have the radix TLB flush open-coded rather than calling radix__local_flush_tlb_lpid_guest(), because the TLB flush can be called in real mode, and in real mode we don't want to invoke the tracepoint code. Fixes: 9a4506e11b97 ("KVM: PPC: Book3S HV: Make radix handle process scoped LPID flush in C, with relocation on") Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30KVM: PPC: Book3S HV: smb->smp comment fixupPalmer Dabbelt1-1/+1
I made the same typo when trying to grep for uses of smp_wmb and figured I might as well fix it. Signed-off-by: Palmer Dabbelt <palmer@sifive.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30KVM: PPC: Book3S HV: Fix lockdep warning when entering the guestAlexey Kardashevskiy1-7/+8
The trace_hardirqs_on() sets current->hardirqs_enabled and from here the lockdep assumes interrupts are enabled although they are remain disabled until the context switches to the guest. Consequent srcu_read_lock() checks the flags in rcu_lock_acquire(), observes disabled interrupts and prints a warning (see below). This moves trace_hardirqs_on/off closer to __kvmppc_vcore_entry to prevent lockdep from being confused. DEBUG_LOCKS_WARN_ON(current->hardirqs_enabled) WARNING: CPU: 16 PID: 8038 at kernel/locking/lockdep.c:4128 check_flags.part.25+0x224/0x280 [...] NIP [c000000000185b84] check_flags.part.25+0x224/0x280 LR [c000000000185b80] check_flags.part.25+0x220/0x280 Call Trace: [c000003fec253710] [c000000000185b80] check_flags.part.25+0x220/0x280 (unreliable) [c000003fec253780] [c000000000187ea4] lock_acquire+0x94/0x260 [c000003fec253840] [c00800001a1e9768] kvmppc_run_core+0xa60/0x1ab0 [kvm_hv] [c000003fec253a10] [c00800001a1ed944] kvmppc_vcpu_run_hv+0x73c/0xec0 [kvm_hv] [c000003fec253ae0] [c00800001a1095dc] kvmppc_vcpu_run+0x34/0x48 [kvm] [c000003fec253b00] [c00800001a1056bc] kvm_arch_vcpu_ioctl_run+0x2f4/0x400 [kvm] [c000003fec253b90] [c00800001a0f3618] kvm_vcpu_ioctl+0x460/0x850 [kvm] [c000003fec253d00] [c00000000041c4f4] do_vfs_ioctl+0xe4/0x930 [c000003fec253db0] [c00000000041ce04] ksys_ioctl+0xc4/0x110 [c000003fec253e00] [c00000000041ce78] sys_ioctl+0x28/0x80 [c000003fec253e20] [c00000000000b5a4] system_call+0x5c/0x70 Instruction dump: 419e0034 3d220004 39291730 81290000 2f890000 409e0020 3c82ffc6 3c62ffc5 3884be70 386329c0 4bf6ea71 60000000 <0fe00000> 3c62ffc6 3863be90 4801273d irq event stamp: 1025 hardirqs last enabled at (1025): [<c00800001a1e9728>] kvmppc_run_core+0xa20/0x1ab0 [kvm_hv] hardirqs last disabled at (1024): [<c00800001a1e9358>] kvmppc_run_core+0x650/0x1ab0 [kvm_hv] softirqs last enabled at (0): [<c0000000000f1210>] copy_process.isra.4.part.5+0x5f0/0x1d00 softirqs last disabled at (0): [<0000000000000000>] (null) ---[ end trace 31180adcc848993e ]--- possible reason: unannotated irqs-off. irq event stamp: 1025 hardirqs last enabled at (1025): [<c00800001a1e9728>] kvmppc_run_core+0xa20/0x1ab0 [kvm_hv] hardirqs last disabled at (1024): [<c00800001a1e9358>] kvmppc_run_core+0x650/0x1ab0 [kvm_hv] softirqs last enabled at (0): [<c0000000000f1210>] copy_process.isra.4.part.5+0x5f0/0x1d00 softirqs last disabled at (0): [<0000000000000000>] (null) Fixes: 8b24e69fc47e ("KVM: PPC: Book3S HV: Close race with testing for signals on guest entry", 2017-06-26) Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30KVM: PPC: Book3S HV: Implement virtual mode H_PAGE_INIT handlerSuraj Jitindar Singh1-0/+80
Implement a virtual mode handler for the H_CALL H_PAGE_INIT which can be used to zero or copy a guest page. The page is defined to be 4k and must be 4k aligned. The in-kernel handler halves the time to handle this H_CALL compared to handling it in userspace for a radix guest. Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-20powerpc: Add force enable of DAWR on P9 optionMichael Neuling1-1/+2
This adds a flag so that the DAWR can be enabled on P9 via: echo Y > /sys/kernel/debug/powerpc/dawr_enable_dangerous The DAWR was previously force disabled on POWER9 in: 9654153158 powerpc: Disable DAWR in the base POWER9 CPU features Also see Documentation/powerpc/DAWR-POWER9.txt This is a dangerous setting, USE AT YOUR OWN RISK. Some users may not care about a bad user crashing their box (ie. single user/desktop systems) and really want the DAWR. This allows them to force enable DAWR. This flag can also be used to disable DAWR access. Once this is cleared, all DAWR access should be cleared immediately and your machine once again safe from crashing. Userspace may get confused by toggling this. If DAWR is force enabled/disabled between getting the number of breakpoints (via PTRACE_GETHWDBGINFO) and setting the breakpoint, userspace will get an inconsistent view of what's available. Similarly for guests. For the DAWR to be enabled in a KVM guest, the DAWR needs to be force enabled in the host AND the guest. For this reason, this won't work on POWERVM as it doesn't allow the HCALL to work. Writes of 'Y' to the dawr_enable_dangerous file will fail if the hypervisor doesn't support writing the DAWR. To double check the DAWR is working, run this kernel selftest: tools/testing/selftests/powerpc/ptrace/ptrace-hwbreak.c Any errors/failures/skips mean something is wrong. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-05KVM: PPC: Book3S HV: Perserve PSSCR FAKE_SUSPEND bit on guest exitSuraj Jitindar Singh1-1/+3
There is a hardware bug in some POWER9 processors where a treclaim in fake suspend mode can cause an inconsistency in the XER[SO] bit across the threads of a core, the workaround being to force the core into SMT4 when doing the treclaim. The FAKE_SUSPEND bit (bit 10) in the PSSCR is used to control whether a thread is in fake suspend or real suspend. The important difference here being that thread reconfiguration is blocked in real suspend but not fake suspend mode. When we exit a guest which was in fake suspend mode, we force the core into SMT4 while we do the treclaim in kvmppc_save_tm_hv(). However on the new exit path introduced with the function kvmhv_run_single_vcpu() we restore the host PSSCR before calling kvmppc_save_tm_hv() which means that if we were in fake suspend mode we put the thread into real suspend mode when we clear the PSSCR[FAKE_SUSPEND] bit. This means that we block thread reconfiguration and the thread which is trying to get the core into SMT4 before it can do the treclaim spins forever since it itself is blocking thread reconfiguration. The result is that that core is essentially lost. This results in a trace such as: [ 93.512904] CPU: 7 PID: 13352 Comm: qemu-system-ppc Not tainted 5.0.0 #4 [ 93.512905] NIP: c000000000098a04 LR: c0000000000cc59c CTR: 0000000000000000 [ 93.512908] REGS: c000003fffd2bd70 TRAP: 0100 Not tainted (5.0.0) [ 93.512908] MSR: 9000000302883033 <SF,HV,VEC,VSX,FP,ME,IR,DR,RI,LE,TM[SE]> CR: 22222444 XER: 00000000 [ 93.512914] CFAR: c000000000098a5c IRQMASK: 3 [ 93.512915] PACATMSCRATCH: 0000000000000001 [ 93.512916] GPR00: 0000000000000001 c000003f6cc1b830 c000000001033100 0000000000000004 [ 93.512928] GPR04: 0000000000000004 0000000000000002 0000000000000004 0000000000000007 [ 93.512930] GPR08: 0000000000000000 0000000000000004 0000000000000000 0000000000000004 [ 93.512932] GPR12: c000203fff7fc000 c000003fffff9500 0000000000000000 0000000000000000 [ 93.512935] GPR16: 2000000000300375 000000000000059f 0000000000000000 0000000000000000 [ 93.512951] GPR20: 0000000000000000 0000000000080053 004000000256f41f c000003f6aa88ef0 [ 93.512953] GPR24: c000003f6aa89100 0000000000000010 0000000000000000 0000000000000000 [ 93.512956] GPR28: c000003f9e9a0800 0000000000000000 0000000000000001 c000203fff7fc000 [ 93.512959] NIP [c000000000098a04] pnv_power9_force_smt4_catch+0x1b4/0x2c0 [ 93.512960] LR [c0000000000cc59c] kvmppc_save_tm_hv+0x40/0x88 [ 93.512960] Call Trace: [ 93.512961] [c000003f6cc1b830] [0000000000080053] 0x80053 (unreliable) [ 93.512965] [c000003f6cc1b8a0] [c00800001e9cb030] kvmhv_p9_guest_entry+0x508/0x6b0 [kvm_hv] [ 93.512967] [c000003f6cc1b940] [c00800001e9cba44] kvmhv_run_single_vcpu+0x2dc/0xb90 [kvm_hv] [ 93.512968] [c000003f6cc1ba10] [c00800001e9cc948] kvmppc_vcpu_run_hv+0x650/0xb90 [kvm_hv] [ 93.512969] [c000003f6cc1bae0] [c00800001e8f620c] kvmppc_vcpu_run+0x34/0x48 [kvm] [ 93.512971] [c000003f6cc1bb00] [c00800001e8f2d4c] kvm_arch_vcpu_ioctl_run+0x2f4/0x400 [kvm] [ 93.512972] [c000003f6cc1bb90] [c00800001e8e3918] kvm_vcpu_ioctl+0x460/0x7d0 [kvm] [ 93.512974] [c000003f6cc1bd00] [c0000000003ae2c0] do_vfs_ioctl+0xe0/0x8e0 [ 93.512975] [c000003f6cc1bdb0] [c0000000003aeb24] ksys_ioctl+0x64/0xe0 [ 93.512978] [c000003f6cc1be00] [c0000000003aebc8] sys_ioctl+0x28/0x80 [ 93.512981] [c000003f6cc1be20] [c00000000000b3a4] system_call+0x5c/0x70 [ 93.512983] Instruction dump: [ 93.512986] 419dffbc e98c0000 2e8b0000 38000001 60000000 60000000 60000000 40950068 [ 93.512993] 392bffff 39400000 79290020 39290001 <7d2903a6> 60000000 60000000 7d235214 To fix this we preserve the PSSCR[FAKE_SUSPEND] bit until we call kvmppc_save_tm_hv() which will mean the core can get into SMT4 and perform the treclaim. Note kvmppc_save_tm_hv() clears the PSSCR[FAKE_SUSPEND] bit again so there is no need to explicitly do that. Fixes: 95a6432ce9038 ("KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests") Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-02-22Merge tag 'kvm-ppc-next-5.1-1' of ↵Paolo Bonzini1-15/+32
git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc into kvm-next PPC KVM update for 5.1 There are no major new features this time, just a collection of bug fixes and improvements in various areas, including machine check handling and context switching of protection-key-related registers.
2019-02-22Merge remote-tracking branch 'remotes/powerpc/topic/ppc-kvm' into kvm-ppc-nextPaul Mackerras1-4/+21
This merges in the "ppc-kvm" topic branch of the powerpc tree to get a series of commits that touch both general arch/powerpc code and KVM code. These commits will be merged both via the KVM tree and the powerpc tree. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-02-22KVM: PPC: Book3S HV: Fix build failure without IOMMU supportJordan Niethe1-0/+2
Currently trying to build without IOMMU support will fail: (.text+0x1380): undefined reference to `kvmppc_h_get_tce' (.text+0x1384): undefined reference to `kvmppc_rm_h_put_tce' (.text+0x149c): undefined reference to `kvmppc_rm_h_stuff_tce' (.text+0x14a0): undefined reference to `kvmppc_rm_h_put_tce_indirect' This happens because turning off IOMMU support will prevent book3s_64_vio_hv.c from being built because it is only built when SPAPR_TCE_IOMMU is set, which depends on IOMMU support. Fix it using ifdefs for the undefined references. Fixes: 76d837a4c0f9 ("KVM: PPC: Book3S PR: Don't include SPAPR TCE code on non-pseries platforms") Signed-off-by: Jordan Niethe <jniethe5@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-02-21powerpc/64s: Better printing of machine check info for guest MCEsPaul Mackerras1-2/+2
This adds an "in_guest" parameter to machine_check_print_event_info() so that we can avoid trying to translate guest NIP values into symbolic form using the host kernel's symbol table. Reviewed-by: Aravinda Prasad <aravinda@linux.vnet.ibm.com> Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-02-21KVM: PPC: Book3S HV: Simplify machine check handlingPaul Mackerras1-2/+16
This makes the handling of machine check interrupts that occur inside a guest simpler and more robust, with less done in assembler code and in real mode. Now, when a machine check occurs inside a guest, we always get the machine check event struct and put a copy in the vcpu struct for the vcpu where the machine check occurred. We no longer call machine_check_queue_event() from kvmppc_realmode_mc_power7(), because on POWER8, when a vcpu is running on an offline secondary thread and we call machine_check_queue_event(), that calls irq_work_queue(), which doesn't work because the CPU is offline, but instead triggers the WARN_ON(lazy_irq_pending()) in pnv_smp_cpu_kill_self() (which fires again and again because nothing clears the condition). All that machine_check_queue_event() actually does is to cause the event to be printed to the console. For a machine check occurring in the guest, we now print the event in kvmppc_handle_exit_hv() instead. The assembly code at label machine_check_realmode now just calls C code and then continues exiting the guest. We no longer either synthesize a machine check for the guest in assembly code or return to the guest without a machine check. The code in kvmppc_handle_exit_hv() is extended to handle the case where the guest is not FWNMI-capable. In that case we now always synthesize a machine check interrupt for the guest. Previously, if the host thinks it has recovered the machine check fully, it would return to the guest without any notification that the machine check had occurred. If the machine check was caused by some action of the guest (such as creating duplicate SLB entries), it is much better to tell the guest that it has caused a problem. Therefore we now always generate a machine check interrupt for guests that are not FWNMI-capable. Reviewed-by: Aravinda Prasad <aravinda@linux.vnet.ibm.com> Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-02-21KVM: PPC: Book3S HV: Context switch AMR on Power9Michael Ellerman1-1/+4
kvmhv_p9_guest_entry() implements a fast-path guest entry for Power9 when guest and host are both running with the Radix MMU. Currently in that path we don't save the host AMR (Authority Mask Register) value, and we always restore 0 on return to the host. That is OK at the moment because the AMR is not used for storage keys with the Radix MMU. However we plan to start using the AMR on Radix to prevent the kernel from reading/writing to userspace outside of copy_to/from_user(). In order to make that work we need to save/restore the AMR value. We only restore the value if it is different from the guest value, which is already in the register when we exit to the host. This should mean we rarely need to actually restore the value when running a modern Linux as a guest, because it will be using the same value as us. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Tested-by: Russell Currey <ruscur@russell.cc>
2019-02-20KVM: Never start grow vCPU halt_poll_ns from value below halt_poll_ns_grow_startNir Weiner1-3/+2
grow_halt_poll_ns() have a strange behaviour in case (vcpu->halt_poll_ns != 0) && (vcpu->halt_poll_ns < halt_poll_ns_grow_start). In this case, vcpu->halt_poll_ns will be multiplied by grow factor (halt_poll_ns_grow) which will require several grow iteration in order to reach a value bigger than halt_poll_ns_grow_start. This means that growing vcpu->halt_poll_ns from value of 0 is slower than growing it from a positive value less than halt_poll_ns_grow_start. Which is misleading and inaccurate. Fix issue by changing grow_halt_poll_ns() to set vcpu->halt_poll_ns to halt_poll_ns_grow_start in any case that (vcpu->halt_poll_ns < halt_poll_ns_grow_start). Regardless if vcpu->halt_poll_ns is 0. use READ_ONCE to get a consistent number for all cases. Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Nir Weiner <nir.weiner@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20KVM: Expose the initial start value in grow_halt_poll_ns() as a module parameterNir Weiner1-2/+1
The hard-coded value 10000 in grow_halt_poll_ns() stands for the initial start value when raising up vcpu->halt_poll_ns. It actually sets the first timeout to the first polling session. This value has significant effect on how tolerant we are to outliers. On the standard case, higher value is better - we will spend more time in the polling busyloop, handle events/interrupts faster and result in better performance. But on outliers it puts us in a busy loop that does nothing. Even if the shrink factor is zero, we will still waste time on the first iteration. The optimal value changes between different workloads. It depends on outliers rate and polling sessions length. As this value has significant effect on the dynamic halt-polling algorithm, it should be configurable and exposed. Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Nir Weiner <nir.weiner@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20KVM: grow_halt_poll_ns() should never shrink vCPU halt_poll_nsNir Weiner1-1/+4
grow_halt_poll_ns() have a strange behavior in case (halt_poll_ns_grow == 0) && (vcpu->halt_poll_ns != 0). In this case, vcpu->halt_pol_ns will be set to zero. That results in shrinking instead of growing. Fix issue by changing grow_halt_poll_ns() to not modify vcpu->halt_poll_ns in case halt_poll_ns_grow is zero Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Nir Weiner <nir.weiner@oracle.com> Suggested-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-19KVM: PPC: Book3S: Allow XICS emulation to work in nested hosts using XIVEPaul Mackerras1-8/+8
Currently, the KVM code assumes that if the host kernel is using the XIVE interrupt controller (the new interrupt controller that first appeared in POWER9 systems), then the in-kernel XICS emulation will use the XIVE hardware to deliver interrupts to the guest. However, this only works when the host is running in hypervisor mode and has full access to all of the XIVE functionality. It doesn't work in any nested virtualization scenario, either with PR KVM or nested-HV KVM, because the XICS-on-XIVE code calls directly into the native-XIVE routines, which are not initialized and cannot function correctly because they use OPAL calls, and OPAL is not available in a guest. This means that using the in-kernel XICS emulation in a nested hypervisor that is using XIVE as its interrupt controller will cause a (nested) host kernel crash. To fix this, we change most of the places where the current code calls xive_enabled() to select between the XICS-on-XIVE emulation and the plain XICS emulation to call a new function, xics_on_xive(), which returns false in a guest. However, there is a further twist. The plain XICS emulation has some functions which are used in real mode and access the underlying XICS controller (the interrupt controller of the host) directly. In the case of a nested hypervisor, this means doing XICS hypercalls directly. When the nested host is using XIVE as its interrupt controller, these hypercalls will fail. Therefore this also adds checks in the places where the XICS emulation wants to access the underlying interrupt controller directly, and if that is XIVE, makes the code use the virtual mode fallback paths, which call generic kernel infrastructure rather than doing direct XICS access. Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Reviewed-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-02-19KVM: PPC: Book3S HV: Replace kmalloc_node+memset with kzalloc_nodewangbo1-3/+1
Replace kmalloc_node and memset with kzalloc_node Signed-off-by: wangbo <wang.bo116@zte.com.cn> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-12-17KVM: PPC: Book3S: Introduce new hcall H_COPY_TOFROM_GUEST to access ↵Suraj Jitindar Singh1-1/+5
quadrants 1 & 2 A guest cannot access quadrants 1 or 2 as this would result in an exception. Thus introduce the hcall H_COPY_TOFROM_GUEST to be used by a guest when it wants to perform an access to quadrants 1 or 2, for example when it wants to access memory for one of its nested guests. Also provide an implementation for the kvm-hv module. Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-12-17KVM: PPC: Book3S HV: Allow passthrough of an emulated device to an L2 guestSuraj Jitindar Singh1-4/+8
Allow for a device which is being emulated at L0 (the host) for an L1 guest to be passed through to a nested (L2) guest. The existing kvmppc_hv_emulate_mmio function can be used here. The main challenge is that for a load the result must be stored into the L2 gpr, not an L1 gpr as would normally be the case after going out to qemu to complete the operation. This presents a challenge as at this point the L2 gpr state has been written back into L1 memory. To work around this we store the address in L1 memory of the L2 gpr where the result of the load is to be stored and use the new io_gpr value KVM_MMIO_REG_NESTED_GPR to indicate that this is a nested load for which completion must be done when returning back into the kernel. Then in kvmppc_complete_mmio_load() the resultant value is written into L1 memory at the location of the indicated L2 gpr. Note that we don't currently let an L1 guest emulate a device for an L2 guest which is then passed through to an L3 guest. Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-12-17KVM: PPC: Add load_from_eaddr and store_to_eaddr to the kvmppc_ops structSuraj Jitindar Singh1-0/+40
The kvmppc_ops struct is used to store function pointers to kvm implementation specific functions. Introduce two new functions load_from_eaddr and store_to_eaddr to be used to load from and store to a guest effective address respectively. Also implement these for the kvm-hv module. If we are using the radix mmu then we can call the functions to access quadrant 1 and 2. Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-12-17KVM: PPC: Book3S HV: Flush guest mappings when turning dirty tracking on/offPaul Mackerras1-0/+17
This adds code to flush the partition-scoped page tables for a radix guest when dirty tracking is turned on or off for a memslot. Only the guest real addresses covered by the memslot are flushed. The reason for this is to get rid of any 2M PTEs in the partition-scoped page tables that correspond to host transparent huge pages, so that page dirtiness is tracked at a system page (4k or 64k) granularity rather than a 2M granularity. The page tables are also flushed when turning dirty tracking off so that the memslot's address space can be repopulated with THPs if possible. To do this, we add a new function kvmppc_radix_flush_memslot(). Since this does what's needed for kvmppc_core_flush_memslot_hv() on a radix guest, we now make kvmppc_core_flush_memslot_hv() call the new kvmppc_radix_flush_memslot() rather than calling kvm_unmap_radix() for each page in the memslot. This has the effect of fixing a bug in that kvmppc_core_flush_memslot_hv() was previously calling kvm_unmap_radix() without holding the kvm->mmu_lock spinlock, which is required to be held. Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Reviewed-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-12-17KVM: PPC: Pass change type down to memslot commit functionBharata B Rao1-1/+2
Currently, kvm_arch_commit_memory_region() gets called with a parameter indicating what type of change is being made to the memslot, but it doesn't pass it down to the platform-specific memslot commit functions. This adds the `change' parameter to the lower-level functions so that they can use it in future. [paulus@ozlabs.org - fix book E also.] Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com> Reviewed-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-12-14KVM: PPC: Book3S HV: Fix race between kvm_unmap_hva_range and MMU mode switchPaul Mackerras1-6/+11
Testing has revealed an occasional crash which appears to be caused by a race between kvmppc_switch_mmu_to_hpt and kvm_unmap_hva_range_hv. The symptom is a NULL pointer dereference in __find_linux_pte() called from kvm_unmap_radix() with kvm->arch.pgtable == NULL. Looking at kvmppc_switch_mmu_to_hpt(), it does indeed clear kvm->arch.pgtable (via kvmppc_free_radix()) before setting kvm->arch.radix to NULL, and there is nothing to prevent kvm_unmap_hva_range_hv() or the other MMU callback functions from being called concurrently with kvmppc_switch_mmu_to_hpt() or kvmppc_switch_mmu_to_radix(). This patch therefore adds calls to spin_lock/unlock on the kvm->mmu_lock around the assignments to kvm->arch.radix, and makes sure that the partition-scoped radix tree or HPT is only freed after changing kvm->arch.radix. This also takes the kvm->mmu_lock in kvmppc_rmap_reset() to make sure that the clearing of each rmap array (one per memslot) doesn't happen concurrently with use of the array in the kvm_unmap_hva_range_hv() or the other MMU callbacks. Fixes: 18c3640cefc7 ("KVM: PPC: Book3S HV: Add infrastructure for running HPT guests on radix host") Cc: stable@vger.kernel.org # v4.15+ Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-11-15KVM: PPC: Book3S HV: Fix handling for interrupted H_ENTER_NESTEDMichael Roth1-0/+1
While running a nested guest VCPU on L0 via H_ENTER_NESTED hcall, a pending signal in the L0 QEMU process can generate the following sequence: ret0 = kvmppc_pseries_do_hcall() ret1 = kvmhv_enter_nested_guest() ret2 = kvmhv_run_single_vcpu() if (ret2 == -EINTR) return H_INTERRUPT if (ret1 == H_INTERRUPT) kvmppc_set_gpr(vcpu, 3, 0) return -EINTR /* skipped: */ kvmppc_set_gpr(vcpu, 3, ret) vcpu->arch.hcall_needed = 0 return RESUME_GUEST which causes an exit to L0 userspace with ret0 == -EINTR. The intention seems to be to set the hcall return value to 0 (via VCPU r3) so that L1 will see a successful return from H_ENTER_NESTED once we resume executing the VCPU. However, because we don't set vcpu->arch.hcall_needed = 0, we do the following once userspace resumes execution via kvm_arch_vcpu_ioctl_run(): ... } else if (vcpu->arch.hcall_needed) { int i kvmppc_set_gpr(vcpu, 3, run->papr_hcall.ret); for (i = 0; i < 9; ++i) kvmppc_set_gpr(vcpu, 4 + i, run->papr_hcall.args[i]); vcpu->arch.hcall_needed = 0; since vcpu->arch.hcall_needed == 1 indicates that userspace should have handled the hcall and stored the return value in run->papr_hcall.ret. Since that's not the case here, we can get an unexpected value in VCPU r3, which can result in kvmhv_p9_guest_entry() reporting an unexpected trap value when it returns from H_ENTER_NESTED, causing the following register dump to console via subsequent call to kvmppc_handle_exit_hv() in L1: [ 350.612854] vcpu 00000000f9564cf8 (0): [ 350.612915] pc = c00000000013eb98 msr = 8000000000009033 trap = 1 [ 350.613020] r 0 = c0000000004b9044 r16 = 0000000000000000 [ 350.613075] r 1 = c00000007cffba30 r17 = 0000000000000000 [ 350.613120] r 2 = c00000000178c100 r18 = 00007fffc24f3b50 [ 350.613166] r 3 = c00000007ef52480 r19 = 00007fffc24fff58 [ 350.613212] r 4 = 0000000000000000 r20 = 00000a1e96ece9d0 [ 350.613253] r 5 = 70616d00746f6f72 r21 = 00000a1ea117c9b0 [ 350.613295] r 6 = 0000000000000020 r22 = 00000a1ea1184360 [ 350.613338] r 7 = c0000000783be440 r23 = 0000000000000003 [ 350.613380] r 8 = fffffffffffffffc r24 = 00000a1e96e9e124 [ 350.613423] r 9 = c00000007ef52490 r25 = 00000000000007ff [ 350.613469] r10 = 0000000000000004 r26 = c00000007eb2f7a0 [ 350.613513] r11 = b0616d0009eccdb2 r27 = c00000007cffbb10 [ 350.613556] r12 = c0000000004b9000 r28 = c00000007d83a2c0 [ 350.613597] r13 = c000000001b00000 r29 = c0000000783cdf68 [ 350.613639] r14 = 0000000000000000 r30 = 0000000000000000 [ 350.613681] r15 = 0000000000000000 r31 = c00000007cffbbf0 [ 350.613723] ctr = c0000000004b9000 lr = c0000000004b9044 [ 350.613765] srr0 = 0000772f954dd48c srr1 = 800000000280f033 [ 350.613808] sprg0 = 0000000000000000 sprg1 = c000000001b00000 [ 350.613859] sprg2 = 0000772f9565a280 sprg3 = 0000000000000000 [ 350.613911] cr = 88002848 xer = 0000000020040000 dsisr = 42000000 [ 350.613962] dar = 0000772f95390000 [ 350.614031] fault dar = c000000244b278c0 dsisr = 00000000 [ 350.614073] SLB (0 entries): [ 350.614157] lpcr = 0040000003d40413 sdr1 = 0000000000000000 last_inst = ffffffff [ 350.614252] trap=0x1 | pc=0xc00000000013eb98 | msr=0x8000000000009033 followed by L1's QEMU reporting the following before stopping execution of the nested guest: KVM: unknown exit, hardware reason 1 NIP c00000000013eb98 LR c0000000004b9044 CTR c0000000004b9000 XER 0000000020040000 CPU#0 MSR 8000000000009033 HID0 0000000000000000 HF 8000000000000000 iidx 3 didx 3 TB 00000000 00000000 DECR 00000000 GPR00 c0000000004b9044 c00000007cffba30 c00000000178c100 c00000007ef52480 GPR04 0000000000000000 70616d00746f6f72 0000000000000020 c0000000783be440 GPR08 fffffffffffffffc c00000007ef52490 0000000000000004 b0616d0009eccdb2 GPR12 c0000000004b9000 c000000001b00000 0000000000000000 0000000000000000 GPR16 0000000000000000 0000000000000000 00007fffc24f3b50 00007fffc24fff58 GPR20 00000a1e96ece9d0 00000a1ea117c9b0 00000a1ea1184360 0000000000000003 GPR24 00000a1e96e9e124 00000000000007ff c00000007eb2f7a0 c00000007cffbb10 GPR28 c00000007d83a2c0 c0000000783cdf68 0000000000000000 c00000007cffbbf0 CR 88002848 [ L L - - E L G L ] RES ffffffffffffffff SRR0 0000772f954dd48c SRR1 800000000280f033 PVR 00000000004e1202 VRSAVE 0000000000000000 SPRG0 0000000000000000 SPRG1 c000000001b00000 SPRG2 0000772f9565a280 SPRG3 0000000000000000 SPRG4 0000000000000000 SPRG5 0000000000000000 SPRG6 0000000000000000 SPRG7 0000000000000000 HSRR0 0000000000000000 HSRR1 0000000000000000 CFAR 0000000000000000 LPCR 0000000003d40413 PTCR 0000000000000000 DAR 0000772f95390000 DSISR 0000000042000000 Fix this by setting vcpu->arch.hcall_needed = 0 to indicate completion of H_ENTER_NESTED before we exit to L0 userspace. Fixes: 360cae313702 ("KVM: PPC: Book3S HV: Nested guest entry via hypercall") Cc: linuxppc-dev@ozlabs.org Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com> Reviewed-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-11-02Merge tag 'powerpc-4.20-2' of ↵Linus Torvalds1-2/+1
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fixes from Michael Ellerman: "Some things that I missed due to travel, or that came in late. Two fixes also going to stable: - A revert of a buggy change to the 8xx TLB miss handlers. - Our flushing of SPE (Signal Processing Engine) registers on fork was broken. Other changes: - A change to the KVM decrementer emulation to use proper APIs. - Some cleanups to the way we do code patching in the 8xx code. - Expose the maximum possible memory for the system in /proc/powerpc/lparcfg. - Merge some updates from Scott: "a couple device tree updates, and a fix for a missing prototype warning" A few other minor fixes and a handful of fixes for our selftests. Thanks to: Aravinda Prasad, Breno Leitao, Camelia Groza, Christophe Leroy, Felipe Rechia, Joel Stanley, Naveen N. Rao, Paul Mackerras, Scott Wood, Tyrel Datwyler" * tag 'powerpc-4.20-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (21 commits) selftests/powerpc: Fix compilation issue due to asm label selftests/powerpc/cache_shape: Fix out-of-tree build selftests/powerpc/switch_endian: Fix out-of-tree build selftests/powerpc/pmu: Link ebb tests with -no-pie selftests/powerpc/signal: Fix out-of-tree build selftests/powerpc/ptrace: Fix out-of-tree build powerpc/xmon: Relax frame size for clang selftests: powerpc: Fix warning for security subdir selftests/powerpc: Relax L1d miss targets for rfi_flush test powerpc/process: Fix flush_all_to_thread for SPE powerpc/pseries: add missing cpumask.h include file selftests/powerpc: Fix ptrace tm failure KVM: PPC: Use exported tb_to_ns() function in decrementer emulation powerpc/pseries: Export maximum memory value powerpc/8xx: Use patch_site for perf counters setup powerpc/8xx: Use patch_site for memory setup patching powerpc/code-patching: Add a helper to get the address of a patch_site Revert "powerpc/8xx: Use L1 entry APG to handle _PAGE_ACCESSED for CONFIG_SWAP" powerpc/8xx: add missing header in 8xx_mmu.c powerpc/8xx: Add DT node for using the SEC engine of the MPC885 ...
2018-10-26KVM: PPC: Use exported tb_to_ns() function in decrementer emulationPaul Mackerras1-2/+1
This changes the KVM code that emulates the decrementer function to do the conversion of decrementer values to time intervals in nanoseconds by calling the tb_to_ns() function exported by the powerpc timer code, in preference to open-coded arithmetic using values from the decrementer_clockevent struct. Similarly, the HV-KVM code that did the same conversion using arithmetic on tb_ticks_per_sec also now uses tb_to_ns(). Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-19KVM: PPC: Book3S HV: Don't use streamlined entry path on early POWER9 chipsPaul Mackerras1-2/+11
This disables the use of the streamlined entry path for radix guests on early POWER9 chips that need the workaround added in commit a25bd72badfa ("powerpc/mm/radix: Workaround prefetch issue with KVM", 2017-07-24), because the streamlined entry path does not include that workaround. This also means that we can't do nested HV-KVM on those chips. Since the chips that need that workaround are the same ones that can't run both radix and HPT guests at the same time on different threads of a core, we use the existing 'no_mixing_hpt_and_radix' variable that identifies those chips to identify when we can't use the new guest entry path, and when we can't do nested virtualization. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-10-09KVM: PPC: Book3S HV: Add NO_HASH flag to GET_SMMU_INFO ioctl resultPaul Mackerras1-0/+4
This adds a KVM_PPC_NO_HASH flag to the flags field of the kvm_ppc_smmu_info struct, and arranges for it to be set when running as a nested hypervisor, as an unambiguous indication to userspace that HPT guests are not supported. Reporting the KVM_CAP_PPC_MMU_HASH_V3 capability as false could be taken as indicating only that the new HPT features in ISA V3.0 are not supported, leaving it ambiguous whether pre-V3.0 HPT features are supported. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-10-09KVM: PPC: Book3S HV: Add a VM capability to enable nested virtualizationPaul Mackerras1-9/+30
With this, userspace can enable a KVM-HV guest to run nested guests under it. The administrator can control whether any nested guests can be run; setting the "nested" module parameter to false prevents any guests becoming nested hypervisors (that is, any attempt to enable the nested capability on a guest will fail). Guests which are already nested hypervisors will continue to be so. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-10-09Merge remote-tracking branch 'remotes/powerpc/topic/ppc-kvm' into kvm-ppc-nextPaul Mackerras1-55/+776
This merges in the "ppc-kvm" topic branch of the powerpc tree to get a series of commits that touch both general arch/powerpc code and KVM code. These commits will be merged both via the KVM tree and the powerpc tree. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-10-09KVM: PPC: Book3S HV: Allow HV module to load without hypervisor modePaul Mackerras1-4/+12
With this, the KVM-HV module can be loaded in a guest running under KVM-HV, and if the hypervisor supports nested virtualization, this guest can now act as a nested hypervisor and run nested guests. This also adds some checks to inform userspace that HPT guests are not supported by nested hypervisors (by returning false for the KVM_CAP_PPC_MMU_HASH_V3 capability), and to prevent userspace from configuring a guest to use HPT mode. Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09KVM: PPC: Book3S HV: Add one-reg interface to virtual PTCR registerPaul Mackerras1-0/+6
This adds a one-reg register identifier which can be used to read and set the virtual PTCR for the guest. This register identifies the address and size of the virtual partition table for the guest, which contains information about the nested guests under this guest. Migrating this value is the only extra requirement for migrating a guest which has nested guests (assuming of course that the destination host supports nested virtualization in the kvm-hv module). Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09KVM: PPC: Book3S HV: Don't access HFSCR, LPIDR or LPCR when running nestedPaul Mackerras1-12/+21
When running as a nested hypervisor, this avoids reading hypervisor privileged registers (specifically HFSCR, LPIDR and LPCR) at startup; instead reasonable default values are used. This also avoids writing LPIDR in the single-vcpu entry/exit path. Also, this removes the check for CPU_FTR_HVMODE in kvmppc_mmu_hv_init() since its only caller already checks this. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09KVM: PPC: Book3S HV: Invalidate TLB when nested vcpu moves physical cpuSuraj Jitindar Singh1-38/+63
This is only done at level 0, since only level 0 knows which physical CPU a vcpu is running on. This does for nested guests what L0 already did for its own guests, which is to flush the TLB on a pCPU when it goes to run a vCPU there, and there is another vCPU in the same VM which previously ran on this pCPU and has now started to run on another pCPU. This is to handle the situation where the other vCPU touched a mapping, moved to another pCPU and did a tlbiel (local-only tlbie) on that new pCPU and thus left behind a stale TLB entry on this pCPU. This introduces a limit on the the vcpu_token values used in the H_ENTER_NESTED hcall -- they must now be less than NR_CPUS. [paulus@ozlabs.org - made prev_cpu array be short[] to reduce memory consumption.] Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09KVM: PPC: Book3S HV: Implement H_TLB_INVALIDATE hcallSuraj Jitindar Singh1-0/+3
When running a nested (L2) guest the guest (L1) hypervisor will use the H_TLB_INVALIDATE hcall when it needs to change the partition scoped page tables or the partition table which it manages. It will use this hcall in the situations where it would use a partition-scoped tlbie instruction if it were running in hypervisor mode. The H_TLB_INVALIDATE hcall can invalidate different scopes: Invalidate TLB for a given target address: - This invalidates a single L2 -> L1 pte - We need to invalidate any L2 -> L0 shadow_pgtable ptes which map the L2 address space which is being invalidated. This is because a single L2 -> L1 pte may have been mapped with more than one pte in the L2 -> L0 page tables. Invalidate the entire TLB for a given LPID or for all LPIDs: - Invalidate the entire shadow_pgtable for a given nested guest, or for all nested guests. Invalidate the PWC (page walk cache) for a given LPID or for all LPIDs: - We don't cache the PWC, so nothing to do. Invalidate the entire TLB, PWC and partition table for a given/all LPIDs: - Here we re-read the partition table entry and remove the nested state for any nested guest for which the first doubleword of the partition table entry is now zero. The H_TLB_INVALIDATE hcall takes as parameters the tlbie instruction word (of which only the RIC, PRS and R fields are used), the rS value (giving the lpid, where required) and the rB value (giving the IS, AP and EPN values). [paulus@ozlabs.org - adapted to having the partition table in guest memory, added the H_TLB_INVALIDATE implementation, removed tlbie instruction emulation, reworded the commit message.] Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09KVM: PPC: Book3S HV: Introduce rmap to track nested guest mappingsSuraj Jitindar Singh1-0/+1
When a host (L0) page which is mapped into a (L1) guest is in turn mapped through to a nested (L2) guest we keep a reverse mapping (rmap) so that these mappings can be retrieved later. Whenever we create an entry in a shadow_pgtable for a nested guest we create a corresponding rmap entry and add it to the list for the L1 guest memslot at the index of the L1 guest page it maps. This means at the L1 guest memslot we end up with lists of rmaps. When we are notified of a host page being invalidated which has been mapped through to a (L1) guest, we can then walk the rmap list for that guest page, and find and invalidate all of the corresponding shadow_pgtable entries. In order to reduce memory consumption, we compress the information for each rmap entry down to 52 bits -- 12 bits for the LPID and 40 bits for the guest real page frame number -- which will fit in a single unsigned long. To avoid a scenario where a guest can trigger unbounded memory allocations, we scan the list when adding an entry to see if there is already an entry with the contents we need. This can occur, because we don't ever remove entries from the middle of a list. A struct nested guest rmap is a list pointer and an rmap entry; ---------------- | next pointer | ---------------- | rmap entry | ---------------- Thus the rmap pointer for each guest frame number in the memslot can be either NULL, a single entry, or a pointer to a list of nested rmap entries. gfn memslot rmap array ------------------------- 0 | NULL | (no rmap entry) ------------------------- 1 | single rmap entry | (rmap entry with low bit set) ------------------------- 2 | list head pointer | (list of rmap entries) ------------------------- The final entry always has the lowest bit set and is stored in the next pointer of the last list entry, or as a single rmap entry. With a list of rmap entries looking like; ----------------- ----------------- ------------------------- | list head ptr | ----> | next pointer | ----> | single rmap entry | ----------------- ----------------- ------------------------- | rmap entry | | rmap entry | ----------------- ------------------------- Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09KVM: PPC: Book3S HV: Handle hypercalls correctly when nestedPaul Mackerras1-0/+43
When we are running as a nested hypervisor, we use a hypercall to enter the guest rather than code in book3s_hv_rmhandlers.S. This means that the hypercall handlers listed in hcall_real_table never get called. There are some hypercalls that are handled there and not in kvmppc_pseries_do_hcall(), which therefore won't get processed for a nested guest. To fix this, we add cases to kvmppc_pseries_do_hcall() to handle those hypercalls, with the following exceptions: - The HPT hypercalls (H_ENTER, H_REMOVE, etc.) are not handled because we only support radix mode for nested guests. - H_CEDE has to be handled specially because the cede logic in kvmhv_run_single_vcpu assumes that it has been processed by the time that kvmhv_p9_guest_entry() returns. Therefore we put a special case for H_CEDE in kvmhv_p9_guest_entry(). For the XICS hypercalls, if real-mode processing is enabled, then the virtual-mode handlers assume that they are being called only to finish up the operation. Therefore we turn off the real-mode flag in the XICS code when running as a nested hypervisor. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09KVM: PPC: Book3S HV: Use XICS hypercalls when running as a nested hypervisorPaul Mackerras1-1/+6
This adds code to call the H_IPI and H_EOI hypercalls when we are running as a nested hypervisor (i.e. without the CPU_FTR_HVMODE cpu feature) and we would otherwise access the XICS interrupt controller directly or via an OPAL call. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>