summaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)AuthorFilesLines
2018-10-09KVM: PPC: Book3S HV: Don't access HFSCR, LPIDR or LPCR when running nestedPaul Mackerras2-16/+24
When running as a nested hypervisor, this avoids reading hypervisor privileged registers (specifically HFSCR, LPIDR and LPCR) at startup; instead reasonable default values are used. This also avoids writing LPIDR in the single-vcpu entry/exit path. Also, this removes the check for CPU_FTR_HVMODE in kvmppc_mmu_hv_init() since its only caller already checks this. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09KVM: PPC: Book3S HV: Invalidate TLB when nested vcpu moves physical cpuSuraj Jitindar Singh3-38/+71
This is only done at level 0, since only level 0 knows which physical CPU a vcpu is running on. This does for nested guests what L0 already did for its own guests, which is to flush the TLB on a pCPU when it goes to run a vCPU there, and there is another vCPU in the same VM which previously ran on this pCPU and has now started to run on another pCPU. This is to handle the situation where the other vCPU touched a mapping, moved to another pCPU and did a tlbiel (local-only tlbie) on that new pCPU and thus left behind a stale TLB entry on this pCPU. This introduces a limit on the the vcpu_token values used in the H_ENTER_NESTED hcall -- they must now be less than NR_CPUS. [paulus@ozlabs.org - made prev_cpu array be short[] to reduce memory consumption.] Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09KVM: PPC: Book3S HV: Use hypercalls for TLB invalidation when nestedPaul Mackerras3-8/+57
This adds code to call the H_TLB_INVALIDATE hypercall when running as a guest, in the cases where we need to invalidate TLBs (or other MMU caches) as part of managing the mappings for a nested guest. Calling H_TLB_INVALIDATE lets the nested hypervisor inform the parent hypervisor about changes to partition-scoped page tables or the partition table without needing to do hypervisor-privileged tlbie instructions. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09KVM: PPC: Book3S HV: Implement H_TLB_INVALIDATE hcallSuraj Jitindar Singh6-2/+212
When running a nested (L2) guest the guest (L1) hypervisor will use the H_TLB_INVALIDATE hcall when it needs to change the partition scoped page tables or the partition table which it manages. It will use this hcall in the situations where it would use a partition-scoped tlbie instruction if it were running in hypervisor mode. The H_TLB_INVALIDATE hcall can invalidate different scopes: Invalidate TLB for a given target address: - This invalidates a single L2 -> L1 pte - We need to invalidate any L2 -> L0 shadow_pgtable ptes which map the L2 address space which is being invalidated. This is because a single L2 -> L1 pte may have been mapped with more than one pte in the L2 -> L0 page tables. Invalidate the entire TLB for a given LPID or for all LPIDs: - Invalidate the entire shadow_pgtable for a given nested guest, or for all nested guests. Invalidate the PWC (page walk cache) for a given LPID or for all LPIDs: - We don't cache the PWC, so nothing to do. Invalidate the entire TLB, PWC and partition table for a given/all LPIDs: - Here we re-read the partition table entry and remove the nested state for any nested guest for which the first doubleword of the partition table entry is now zero. The H_TLB_INVALIDATE hcall takes as parameters the tlbie instruction word (of which only the RIC, PRS and R fields are used), the rS value (giving the lpid, where required) and the rB value (giving the IS, AP and EPN values). [paulus@ozlabs.org - adapted to having the partition table in guest memory, added the H_TLB_INVALIDATE implementation, removed tlbie instruction emulation, reworded the commit message.] Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09KVM: PPC: Book3S HV: Introduce rmap to track nested guest mappingsSuraj Jitindar Singh5-15/+240
When a host (L0) page which is mapped into a (L1) guest is in turn mapped through to a nested (L2) guest we keep a reverse mapping (rmap) so that these mappings can be retrieved later. Whenever we create an entry in a shadow_pgtable for a nested guest we create a corresponding rmap entry and add it to the list for the L1 guest memslot at the index of the L1 guest page it maps. This means at the L1 guest memslot we end up with lists of rmaps. When we are notified of a host page being invalidated which has been mapped through to a (L1) guest, we can then walk the rmap list for that guest page, and find and invalidate all of the corresponding shadow_pgtable entries. In order to reduce memory consumption, we compress the information for each rmap entry down to 52 bits -- 12 bits for the LPID and 40 bits for the guest real page frame number -- which will fit in a single unsigned long. To avoid a scenario where a guest can trigger unbounded memory allocations, we scan the list when adding an entry to see if there is already an entry with the contents we need. This can occur, because we don't ever remove entries from the middle of a list. A struct nested guest rmap is a list pointer and an rmap entry; ---------------- | next pointer | ---------------- | rmap entry | ---------------- Thus the rmap pointer for each guest frame number in the memslot can be either NULL, a single entry, or a pointer to a list of nested rmap entries. gfn memslot rmap array ------------------------- 0 | NULL | (no rmap entry) ------------------------- 1 | single rmap entry | (rmap entry with low bit set) ------------------------- 2 | list head pointer | (list of rmap entries) ------------------------- The final entry always has the lowest bit set and is stored in the next pointer of the last list entry, or as a single rmap entry. With a list of rmap entries looking like; ----------------- ----------------- ------------------------- | list head ptr | ----> | next pointer | ----> | single rmap entry | ----------------- ----------------- ------------------------- | rmap entry | | rmap entry | ----------------- ------------------------- Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09KVM: PPC: Book3S HV: Handle page fault for a nested guestSuraj Jitindar Singh7-86/+473
Consider a normal (L1) guest running under the main hypervisor (L0), and then a nested guest (L2) running under the L1 guest which is acting as a nested hypervisor. L0 has page tables to map the address space for L1 providing the translation from L1 real address -> L0 real address; L1 | | (L1 -> L0) | ----> L0 There are also page tables in L1 used to map the address space for L2 providing the translation from L2 real address -> L1 read address. Since the hardware can only walk a single level of page table, we need to maintain in L0 a "shadow_pgtable" for L2 which provides the translation from L2 real address -> L0 real address. Which looks like; L2 L2 | | | (L2 -> L1) | | | ----> L1 | (L2 -> L0) | | | (L1 -> L0) | | | ----> L0 --------> L0 When a page fault occurs while running a nested (L2) guest we need to insert a pte into this "shadow_pgtable" for the L2 -> L0 mapping. To do this we need to: 1. Walk the pgtable in L1 memory to find the L2 -> L1 mapping, and provide a page fault to L1 if this mapping doesn't exist. 2. Use our L1 -> L0 pgtable to convert this L1 address to an L0 address, or try to insert a pte for that mapping if it doesn't exist. 3. Now we have a L2 -> L0 mapping, insert this into our shadow_pgtable Once this mapping exists we can take rc faults when hardware is unable to automatically set the reference and change bits in the pte. On these we need to: 1. Check the rc bits on the L2 -> L1 pte match, and otherwise reflect the fault down to L1. 2. Set the rc bits in the L1 -> L0 pte which corresponds to the same host page. 3. Set the rc bits in the L2 -> L0 pte. As we reuse a large number of functions in book3s_64_mmu_radix.c for this we also needed to refactor a number of these functions to take an lpid parameter so that the correct lpid is used for tlb invalidations. The functionality however has remained the same. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09KVM: PPC: Book3S HV: Handle hypercalls correctly when nestedPaul Mackerras4-1/+51
When we are running as a nested hypervisor, we use a hypercall to enter the guest rather than code in book3s_hv_rmhandlers.S. This means that the hypercall handlers listed in hcall_real_table never get called. There are some hypercalls that are handled there and not in kvmppc_pseries_do_hcall(), which therefore won't get processed for a nested guest. To fix this, we add cases to kvmppc_pseries_do_hcall() to handle those hypercalls, with the following exceptions: - The HPT hypercalls (H_ENTER, H_REMOVE, etc.) are not handled because we only support radix mode for nested guests. - H_CEDE has to be handled specially because the cede logic in kvmhv_run_single_vcpu assumes that it has been processed by the time that kvmhv_p9_guest_entry() returns. Therefore we put a special case for H_CEDE in kvmhv_p9_guest_entry(). For the XICS hypercalls, if real-mode processing is enabled, then the virtual-mode handlers assume that they are being called only to finish up the operation. Therefore we turn off the real-mode flag in the XICS code when running as a nested hypervisor. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09KVM: PPC: Book3S HV: Use XICS hypercalls when running as a nested hypervisorPaul Mackerras3-9/+50
This adds code to call the H_IPI and H_EOI hypercalls when we are running as a nested hypervisor (i.e. without the CPU_FTR_HVMODE cpu feature) and we would otherwise access the XICS interrupt controller directly or via an OPAL call. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09KVM: PPC: Book3S HV: Nested guest entry via hypercallPaul Mackerras7-30/+471
This adds a new hypercall, H_ENTER_NESTED, which is used by a nested hypervisor to enter one of its nested guests. The hypercall supplies register values in two structs. Those values are copied by the level 0 (L0) hypervisor (the one which is running in hypervisor mode) into the vcpu struct of the L1 guest, and then the guest is run until an interrupt or error occurs which needs to be reported to L1 via the hypercall return value. Currently this assumes that the L0 and L1 hypervisors are the same endianness, and the structs passed as arguments are in native endianness. If they are of different endianness, the version number check will fail and the hcall will be rejected. Nested hypervisors do not support indep_threads_mode=N, so this adds code to print a warning message if the administrator has set indep_threads_mode=N, and treat it as Y. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09KVM: PPC: Book3S HV: Framework and hcall stubs for nested virtualizationPaul Mackerras8-7/+384
This starts the process of adding the code to support nested HV-style virtualization. It defines a new H_SET_PARTITION_TABLE hypercall which a nested hypervisor can use to set the base address and size of a partition table in its memory (analogous to the PTCR register). On the host (level 0 hypervisor) side, the H_SET_PARTITION_TABLE hypercall from the guest is handled by code that saves the virtual PTCR value for the guest. This also adds code for creating and destroying nested guests and for reading the partition table entry for a nested guest from L1 memory. Each nested guest has its own shadow LPID value, different in general from the LPID value used by the nested hypervisor to refer to it. The shadow LPID value is allocated at nested guest creation time. Nested hypervisor functionality is only available for a radix guest, which therefore means a radix host on a POWER9 (or later) processor. Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09KVM: PPC: Book3S HV: Use kvmppc_unmap_pte() in kvm_unmap_radix()Paul Mackerras1-20/+13
kvmppc_unmap_pte() does a sequence of operations that are open-coded in kvm_unmap_radix(). This extends kvmppc_unmap_pte() a little so that it can be used by kvm_unmap_radix(), and makes kvm_unmap_radix() call it. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09KVM: PPC: Book3S HV: Refactor radix page fault handlerSuraj Jitindar Singh1-87/+123
The radix page fault handler accounts for all cases, including just needing to insert a pte. This breaks it up into separate functions for the two main cases; setting rc and inserting a pte. This allows us to make the setting of rc and inserting of a pte generic for any pgtable, not specific to the one for this guest. [paulus@ozlabs.org - reduced diffs from previous code] Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09KVM: PPC: Book3S HV: Make kvmppc_mmu_radix_xlate process/partition table ↵Suraj Jitindar Singh2-34/+78
agnostic kvmppc_mmu_radix_xlate() is used to translate an effective address through the process tables. The process table and partition tables have identical layout. Exploit this fact to make the kvmppc_mmu_radix_xlate() function able to translate either an effective address through the process tables or a guest real address through the partition tables. [paulus@ozlabs.org - reduced diffs from previous code] Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09KVM: PPC: Book3S HV: Clear partition table entry on vm teardownSuraj Jitindar Singh1-1/+7
When destroying a VM we return the LPID to the pool, however we never zero the partition table entry. This is instead done when we reallocate the LPID. Zero the partition table entry on VM teardown before returning the LPID to the pool. This means if we were running as a nested hypervisor the real hypervisor could use this to determine when it can free resources. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09KVM: PPC: Use ccr field in pt_regs struct embedded in vcpu structPaul Mackerras13-32/+30
When the 'regs' field was added to struct kvm_vcpu_arch, the code was changed to use several of the fields inside regs (e.g., gpr, lr, etc.) but not the ccr field, because the ccr field in struct pt_regs is 64 bits on 64-bit platforms, but the cr field in kvm_vcpu_arch is only 32 bits. This changes the code to use the regs.ccr field instead of cr, and changes the assembly code on 64-bit platforms to use 64-bit loads and stores instead of 32-bit ones. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09KVM: PPC: Book3S HV: Add a debugfs file to dump radix mappingsPaul Mackerras4-0/+183
This adds a file called 'radix' in the debugfs directory for the guest, which when read gives all of the valid leaf PTEs in the partition-scoped radix tree for a radix guest, in human-readable format. It is analogous to the existing 'htab' file which dumps the HPT entries for a HPT guest. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09KVM: PPC: Book3S HV: Handle hypervisor instruction faults betterPaul Mackerras2-1/+5
Currently the code for handling hypervisor instruction page faults passes 0 for the flags indicating the type of fault, which is OK in the usual case that the page is not mapped in the partition-scoped page tables. However, there are other causes for hypervisor instruction page faults, such as not being to update a reference (R) or change (C) bit. The cause is indicated in bits in HSRR1, including a bit which indicates that the fault is due to not being able to write to a page (for example to update an R or C bit). Not handling these other kinds of faults correctly can lead to a loop of continual faults without forward progress in the guest. In order to handle these faults better, this patch constructs a "DSISR-like" value from the bits which DSISR and SRR1 (for a HISI) have in common, and passes it to kvmppc_book3s_hv_page_fault() so that it knows what caused the fault. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guestsPaul Mackerras6-4/+593
This creates an alternative guest entry/exit path which is used for radix guests on POWER9 systems when we have indep_threads_mode=Y. In these circumstances there is exactly one vcpu per vcore and there is no coordination required between vcpus or vcores; the vcpu can enter the guest without needing to synchronize with anything else. The new fast path is implemented almost entirely in C in book3s_hv.c and runs with the MMU on until the guest is entered. On guest exit we use the existing path until the point where we are committed to exiting the guest (as distinct from handling an interrupt in the low-level code and returning to the guest) and we have pulled the guest context from the XIVE. At that point we check a flag in the stack frame to see whether we came in via the old path and the new path; if we came in via the new path then we go back to C code to do the rest of the process of saving the guest context and restoring the host context. The C code is split into separate functions for handling the OS-accessible state and the hypervisor state, with the idea that the latter can be replaced by a hypercall when we implement nested virtualization. Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> [mpe: Fix CONFIG_ALTIVEC=n build] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09KVM: PPC: Book3S HV: Call kvmppc_handle_exit_hv() with vcore unlockedPaul Mackerras1-9/+10
Currently kvmppc_handle_exit_hv() is called with the vcore lock held because it is called within a for_each_runnable_thread loop. However, we already unlock the vcore within kvmppc_handle_exit_hv() under certain circumstances, and this is safe because (a) any vcpus that become runnable and are added to the runnable set by kvmppc_run_vcpu() have their vcpu->arch.trap == 0 and can't actually run in the guest (because the vcore state is VCORE_EXITING), and (b) for_each_runnable_thread is safe against addition or removal of vcpus from the runnable set. Therefore, in order to simplify things for following patches, let's drop the vcore lock in the for_each_runnable_thread loop, so kvmppc_handle_exit_hv() gets called without the vcore lock held. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09KVM: PPC: Book3S: Rework TM save/restore code and make it C-callablePaul Mackerras3-140/+169
This adds a parameter to __kvmppc_save_tm and __kvmppc_restore_tm which allows the caller to indicate whether it wants the nonvolatile register state to be preserved across the call, as required by the C calling conventions. This parameter being non-zero also causes the MSR bits that enable TM, FP, VMX and VSX to be preserved. The condition register and DSCR are now always preserved. With this, kvmppc_save_tm_hv and kvmppc_restore_tm_hv can be called from C code provided the 3rd parameter is non-zero. So that these functions can be called from modules, they now include code to set the TOC pointer (r2) on entry, as they can call other built-in C functions which will assume the TOC to have been set. Also, the fake suspend code in kvmppc_save_tm_hv is modified here to assume that treclaim in fake-suspend state does not modify any registers, which is the case on POWER9. This enables the code to be simplified quite a bit. _kvmppc_save_tm_pr and _kvmppc_restore_tm_pr become much simpler with this change, since they now only need to save and restore TAR and pass 1 for the 3rd argument to __kvmppc_{save,restore}_tm. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09KVM: PPC: Book3S HV: Simplify real-mode interrupt handlingPaul Mackerras2-109/+119
This streamlines the first part of the code that handles a hypervisor interrupt that occurred in the guest. With this, all of the real-mode handling that occurs is done before the "guest_exit_cont" label; once we get to that label we are committed to exiting to host virtual mode. Thus the machine check and HMI real-mode handling is moved before that label. Also, the code to handle external interrupts is moved out of line, as is the code that calls kvmppc_realmode_hmi_handler(). Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09KVM: PPC: Book3S HV: Extract PMU save/restore operations as C-callable functionsPaul Mackerras3-210/+253
This pulls out the assembler code that is responsible for saving and restoring the PMU state for the host and guest into separate functions so they can be used from an alternate entry path. The calling convention is made compatible with C. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Reviewed-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09KVM: PPC: Book3S HV: Move interrupt delivery on guest entry to C codePaul Mackerras4-55/+67
This is based on a patch by Suraj Jitindar Singh. This moves the code in book3s_hv_rmhandlers.S that generates an external, decrementer or privileged doorbell interrupt just before entering the guest to C code in book3s_hv_builtin.c. This is to make future maintenance and modification easier. The algorithm expressed in the C code is almost identical to the previous algorithm. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09KVM: PPC: Book3S HV: Remove left-over code in XICS-on-XIVE emulationPaul Mackerras1-8/+0
This removes code that clears the external interrupt pending bit in the pending_exceptions bitmap. This is left over from an earlier iteration of the code where this bit was set when an escalation interrupt arrived in order to wake the vcpu from cede. Currently we set the vcpu->arch.irq_pending flag instead for this purpose. Therefore there is no need to do anything with the pending_exceptions bitmap. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09KVM: PPC: Book3S: Simplify external interrupt handlingPaul Mackerras10-29/+44
Currently we use two bits in the vcpu pending_exceptions bitmap to indicate that an external interrupt is pending for the guest, one for "one-shot" interrupts that are cleared when delivered, and one for interrupts that persist until cleared by an explicit action of the OS (e.g. an acknowledge to an interrupt controller). The BOOK3S_IRQPRIO_EXTERNAL bit is used for one-shot interrupt requests and BOOK3S_IRQPRIO_EXTERNAL_LEVEL is used for persisting interrupts. In practice BOOK3S_IRQPRIO_EXTERNAL never gets used, because our Book3S platforms generally, and pseries in particular, expect external interrupt requests to persist until they are acknowledged at the interrupt controller. That combined with the confusion introduced by having two bits for what is essentially the same thing makes it attractive to simplify things by only using one bit. This patch does that. With this patch there is only BOOK3S_IRQPRIO_EXTERNAL, and by default it has the semantics of a persisting interrupt. In order to avoid breaking the ABI, we introduce a new "external_oneshot" flag which preserves the behaviour of the KVM_INTERRUPT ioctl with the KVM_INTERRUPT_SET argument. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09powerpc: Turn off CPU_FTR_P9_TM_HV_ASSIST in non-hypervisor modePaul Mackerras1-2/+2
When doing nested virtualization, it is only necessary to do the transactional memory hypervisor assist at level 0, that is, when we are in hypervisor mode. Nested hypervisors can just use the TM facilities as architected. Therefore we should clear the CPU_FTR_P9_TM_HV_ASSIST bit when we are not in hypervisor mode, along with the CPU_FTR_HVMODE bit. Doing this will not change anything at this stage because the only code that tests CPU_FTR_P9_TM_HV_ASSIST is in HV KVM, which currently can only be used when when CPU_FTR_HVMODE is set. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09KVM: PPC: Remove redundand permission bits removalAlexey Kardashevskiy3-22/+14
The kvmppc_gpa_to_ua() helper itself takes care of the permission bits in the TCE and yet every single caller removes them. This changes semantics of kvmppc_gpa_to_ua() so it takes TCEs (which are GPAs + TCE permission bits) to make the callers simpler. This should cause no behavioural change. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09KVM: PPC: Propagate errors to the guest when failed instead of ignoringAlexey Kardashevskiy2-28/+13
At the moment if the PUT_TCE{_INDIRECT} handlers fail to update the hardware tables, we print a warning once, clear the entry and continue. This is so as at the time the assumption was that if a VFIO device is hotplugged into the guest, and the userspace replays virtual DMA mappings (i.e. TCEs) to the hardware tables and if this fails, then there is nothing useful we can do about it. However the assumption is not valid as these handlers are not called for TCE replay (VFIO ioctl interface is used for that) and these handlers are for new TCEs. This returns an error to the guest if there is a request which cannot be processed. By now the only possible failure must be H_TOO_HARD. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09KVM: PPC: Validate TCEs against preregistered memory page sizesAlexey Kardashevskiy3-9/+60
The userspace can request an arbitrary supported page size for a DMA window and this works fine as long as the mapped memory is backed with the pages of the same or bigger size; if this is not the case, mm_iommu_ua_to_hpa{_rm}() fail and tables do not populated with dangerously incorrect TCEs. However since it is quite easy to misconfigure the KVM and we do not do reverts to all changes made to TCE tables if an error happens in a middle, we better do the acceptable page size validation before we even touch the tables. This enhances kvmppc_tce_validate() to check the hardware IOMMU page sizes against the preregistered memory page sizes. Since the new check uses real/virtual mode helpers, this renames kvmppc_tce_validate() to kvmppc_rm_tce_validate() to handle the real mode case and mirrors it for the virtual mode under the old name. The real mode handler is not used for the virtual mode as: 1. it uses _lockless() list traversing primitives instead of RCU; 2. realmode's mm_iommu_ua_to_hpa_rm() uses vmalloc_to_phys() which virtual mode does not have to use and since on POWER9+radix only virtual mode handlers actually work, we do not want to slow down that path even a bit. This removes EXPORT_SYMBOL_GPL(kvmppc_tce_validate) as the validators are static now. From now on the attempts on mapping IOMMU pages bigger than allowed will result in KVM exit. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> [mpe: Fix KVM_HV=n build] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-02KVM: PPC: Inform the userspace about TCE update failuresAlexey Kardashevskiy2-7/+7
We return H_TOO_HARD from TCE update handlers when we think that the next handler (realmode -> virtual mode -> user mode) has a chance to handle the request; H_HARDWARE/H_CLOSED otherwise. This changes the handlers to return H_TOO_HARD on every error giving the userspace an opportunity to handle any request or at least log them all. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-02KVM: PPC: Validate all tces before updating tablesAlexey Kardashevskiy2-0/+22
The KVM TCE handlers are written in a way so they fail when either something went horribly wrong or the userspace did some obvious mistake such as passing a misaligned address. We are going to enhance the TCE checker to fail on attempts to map bigger IOMMU page than the underlying pinned memory so let's valitate TCE beforehand. This should cause no behavioral change. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-02Merge branch 'kvm-ppc-fixes' of paulus/powerpc into topic/ppc-kvmMichael Ellerman8-143/+99
Some commits we'd like to share between the powerpc and kvm-ppc tree for next have dependencies on commits that went into 4.19 via the kvm-ppc-fixes branch and weren't merged before 4.19-rc3, which is our base commit. So merge the kvm-ppc-fixes branch into topic/ppc-kvm.
2018-09-12KVM: PPC: Book3S HV: Don't use compound_order to determine host mapping sizeNicholas Piggin1-54/+37
THP paths can defer splitting compound pages until after the actual remap and TLB flushes to split a huge PMD/PUD. This causes radix partition scope page table mappings to get out of synch with the host qemu page table mappings. This results in random memory corruption in the guest when running with THP. The easiest way to reproduce is use KVM balloon to free up a lot of memory in the guest and then shrink the balloon to give the memory back, while some work is being done in the guest. Cc: David Gibson <david@gibson.dropbear.id.au> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: kvm-ppc@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-09-12KVM: PPC: Avoid marking DMA-mapped pages dirty in real modeAlexey Kardashevskiy7-89/+62
At the moment the real mode handler of H_PUT_TCE calls iommu_tce_xchg_rm() which in turn reads the old TCE and if it was a valid entry, marks the physical page dirty if it was mapped for writing. Since it is in real mode, realmode_pfn_to_page() is used instead of pfn_to_page() to get the page struct. However SetPageDirty() itself reads the compound page head and returns a virtual address for the head page struct and setting dirty bit for that kills the system. This adds additional dirty bit tracking into the MM/IOMMU API for use in the real mode. Note that this does not change how VFIO and KVM (in virtual mode) set this bit. The KVM (real mode) changes include: - use the lowest bit of the cached host phys address to carry the dirty bit; - mark pages dirty when they are unpinned which happens when the preregistered memory is released which always happens in virtual mode; - add mm_iommu_ua_mark_dirty_rm() helper to set delayed dirty bit; - change iommu_tce_xchg_rm() to take the kvm struct for the mm to use in the new mm_iommu_ua_mark_dirty_rm() helper; - move iommu_tce_xchg_rm() to book3s_64_vio_hv.c (which is the only caller anyway) to reduce the real mode KVM and IOMMU knowledge across different subsystems. This removes realmode_pfn_to_page() as it is not used anymore. While we at it, remove some EXPORT_SYMBOL_GPL() as that code is for the real mode only and modules cannot call it anyway. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-09-09Linux 4.19-rc3v4.19-rc3Linus Torvalds1-1/+1
2018-09-09Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds14-59/+87
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thomas Gleixner: "A set of fixes for x86: - Prevent multiplication result truncation on 32bit. Introduced with the early timestamp reworrk. - Ensure microcode revision storage to be consistent under all circumstances - Prevent write tearing of PTEs - Prevent confusion of user and kernel reegisters when dumping fatal signals verbosely - Make an error return value in a failure path of the vector allocation negative. Returning EINVAL might the caller assume success and causes further wreckage. - A trivial kernel doc warning fix" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm: Use WRITE_ONCE() when setting PTEs x86/apic/vector: Make error return value negative x86/process: Don't mix user/kernel regs in 64bit __show_regs() x86/tsc: Prevent result truncation on 32bit x86: Fix kernel-doc atomic.h warnings x86/microcode: Update the new microcode revision unconditionally x86/microcode: Make sure boot_cpu_data.microcode is up-to-date
2018-09-09Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds2-12/+32
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timekeeping fixes from Thomas Gleixner: "Two fixes for timekeeping: - Revert to the previous kthread based update, which is unfortunately required due to lock ordering issues. The removal caused boot failures on old Core2 machines. Add a proper comment why the thread needs to stay to prevent accidental removal in the future. - Fix a silly typo in a function declaration" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: clocksource: Revert "Remove kthread" timekeeping: Fix declaration of read_persistent_wall_and_boot_offset()
2018-09-09Merge branch 'irq-urgent-for-linus' of ↵Linus Torvalds1-1/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irqchip fix from Thomas Gleixner: "A single fix to prevent allocating excessive memory in the GIC/ITS driver. While the subject of the patch might suggest otherwise this is a real fix as some SoCs exceed the memory allocation limits and fail to boot" * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irqchip/gic-v3-its: Cap lpi_id_bits to reduce memory footprint
2018-09-09Merge branch 'smp-urgent-for-linus' of ↵Linus Torvalds1-5/+6
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull cpu hotplug fixes from Thomas Gleixner: "Two fixes for the hotplug state machine code: - Move the misplaces smb() in the hotplug thread function to the proper place, otherwise a half update control struct could be observed - Prevent state corruption on error rollback, which causes the state to advance by one and as a consequence skip it in the bringup sequence" * 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: cpu/hotplug: Prevent state corruption on error rollback cpu/hotplug: Adjust misplaced smb() in cpuhp_thread_fun()
2018-09-09Merge tag 'for_linus' of ↵Linus Torvalds3-5/+16
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random Pull random driver fix from Ted Ts'o: "Fix things so the choice of whether or not to trust RDRAND to initialize the CRNG is configurable via the boot option random.trust_cpu={on,off}" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random: random: make CPU trust a boot parameter
2018-09-09Merge tag 'kbuild-fixes-v4.19' of ↵Linus Torvalds10-36/+47
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild fixes from Masahiro Yamada: - make setlocalversion more robust about -dirty check - loosen the pkg-config requirement for Kconfig - change missing depmod to a warning from an error - warn modules_install when System.map is missing * tag 'kbuild-fixes-v4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: kbuild: modules_install: warn when missing System.map file kbuild: make missing $DEPMOD a Warning instead of an Error kconfig: do not require pkg-config on make {menu,n}config kconfig: remove a spurious self-assignment scripts/setlocalversion: git: Make -dirty check more robust
2018-09-09kbuild: modules_install: warn when missing System.map fileRandy Dunlap1-0/+1
If there is no System.map file for "make modules_install", scripts/depmod.sh will silently exit with success, having done nothing. Since this is an unexpected situation, change it to report a Warning for the missing file. The behavior is not changed except for the Warning message. The (previous) silent success and new Warning can be reproduced by: $ make mrproper; make defconfig $ make modules; make modules_install and since System.map is produced by "make vmlinux", the steps above omit producing the System.map file. Reported-by: Masahiro Yamada <yamada.masahiro@socionext.com> Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2018-09-08Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds21-134/+204
Pull KVM fixes from Radim Krčmář: "ARM: - Fix a VFP corruption in 32-bit guest - Add missing cache invalidation for CoW pages - Two small cleanups s390: - Fallout from the hugetlbfs support: pfmf interpretion and locking - VSIE: fix keywrapping for nested guests PPC: - Fix a bug where pages might not get marked dirty, causing guest memory corruption on migration - Fix a bug causing reads from guest memory to use the wrong guest real address for very large HPT guests (>256G of memory), leading to failures in instruction emulation. x86: - Fix out of bound access from malicious pv ipi hypercalls (introduced in rc1) - Fix delivery of pending interrupts when entering a nested guest, preventing arbitrarily late injection - Sanitize kvm_stat output after destroying a guest - Fix infinite loop when emulating a nested guest page fault and improve the surrounding emulation code - Two minor cleanups" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (28 commits) KVM: LAPIC: Fix pv ipis out-of-bounds access KVM: nVMX: Fix loss of pending IRQ/NMI before entering L2 arm64: KVM: Remove pgd_lock KVM: Remove obsolete kvm_unmap_hva notifier backend arm64: KVM: Only force FPEXC32_EL2.EN if trapping FPSIMD KVM: arm/arm64: Clean dcache to PoC when changing PTE due to CoW KVM: s390: Properly lock mm context allow_gmap_hpage_1m setting KVM: s390: vsie: copy wrapping keys to right place KVM: s390: Fix pfmf and conditional skey emulation tools/kvm_stat: re-animate display of dead guests tools/kvm_stat: indicate dead guests as such tools/kvm_stat: handle guest removals more gracefully tools/kvm_stat: don't reset stats when setting PID filter for debugfs tools/kvm_stat: fix updates for dead guests tools/kvm_stat: fix handling of invalid paths in debugfs provider tools/kvm_stat: fix python3 issues KVM: x86: Unexport x86_emulate_instruction() KVM: x86: Rename emulate_instruction() to kvm_emulate_instruction() KVM: x86: Do not re-{try,execute} after failed emulation in L2 KVM: x86: Default to not allowing emulation retry in kvm_mmu_page_fault ...
2018-09-08Merge tag 'armsoc-fixes' of ↵Linus Torvalds4-2/+11
git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc Pull ARM SoC fixes from Olof Johansson: "A few more fixes who have trickled in: - MMC bus width fixup for some Allwinner platforms - Fix for NULL deref in ti-aemif when no platform data is passed in - Fix div by 0 in SCMI code - Add a missing module alias in a new RPi driver" * tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: memory: ti-aemif: fix a potential NULL-pointer dereference firmware: arm_scmi: fix divide by zero when sustained_perf_level is zero hwmon: rpi: add module alias to raspberrypi-hwmon arm64: allwinner: dts: h6: fix Pine H64 MMC bus width
2018-09-08Merge tag 'sunxi-fixes-for-4.19' of ↵Olof Johansson1-0/+2
https://git.kernel.org/pub/scm/linux/kernel/git/sunxi/linux into fixes Allwinner fixes for 4.19 Just one fix for H6 mmc on the Pine H64: the mmc bus width was missing from the device tree. This was added in 4.19-rc1. * tag 'sunxi-fixes-for-4.19' of https://git.kernel.org/pub/scm/linux/kernel/git/sunxi/linux: arm64: allwinner: dts: h6: fix Pine H64 MMC bus width Signed-off-by: Olof Johansson <olof@lixom.net>
2018-09-08x86/mm: Use WRITE_ONCE() when setting PTEsNadav Amit3-15/+15
When page-table entries are set, the compiler might optimize their assignment by using multiple instructions to set the PTE. This might turn into a security hazard if the user somehow manages to use the interim PTE. L1TF does not make our lives easier, making even an interim non-present PTE a security hazard. Using WRITE_ONCE() to set PTEs and friends should prevent this potential security hazard. I skimmed the differences in the binary with and without this patch. The differences are (obviously) greater when CONFIG_PARAVIRT=n as more code optimizations are possible. For better and worse, the impact on the binary with this patch is pretty small. Skimming the code did not cause anything to jump out as a security hazard, but it seems that at least move_soft_dirty_pte() caused set_pte_at() to use multiple writes. Signed-off-by: Nadav Amit <namit@vmware.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20180902181451.80520-1-namit@vmware.com
2018-09-08x86/apic/vector: Make error return value negativeThomas Gleixner1-1/+1
activate_managed() returns EINVAL instead of -EINVAL in case of error. While this is unlikely to happen, the positive return value would cause further malfunction at the call site. Fixes: 2db1f959d9dc ("x86/vector: Handle managed interrupts proper") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org
2018-09-07Merge branch 'i2c/for-current' of ↵Linus Torvalds7-14/+16
git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux Pull i2c fixes from Wolfram Sang: - bugfixes for uniphier, i801, and xiic drivers - ID removal (never produced) for imx - one MAINTAINER addition * 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: i2c: xiic: Record xilinx i2c with Zynq fragment i2c: xiic: Make the start and the byte count write atomic i2c: i801: fix DNV's SMBCTRL register offset i2c: imx-lpi2c: Remove mx8dv compatible entry dt-bindings: imx-lpi2c: Remove mx8dv compatible entry i2c: uniphier-f: issue STOP only for last message or I2C_M_STOP i2c: uniphier: issue STOP only for last message or I2C_M_STOP
2018-09-07Merge tag 'arc-4.19-rc3' of ↵Linus Torvalds27-113/+154
git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc Pull ARC updates from Vineet Gupta: - Fix for atomic_fetch_#op [Will Deacon] - Enable per device IOC [Eugeniy Paltsev] - Remove redundant gcc version checks [Masahiro Yamada] - Miscll platform config/DT updates [Alexey Brodkin] * tag 'arc-4.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc: ARC: don't check for HIGHMEM pages in arch_dma_alloc ARC: IOC: panic if both IOC and ZONE_HIGHMEM enabled ARC: dma [IOC] Enable per device io coherency ARC: dma [IOC]: mark DMA devices connected as dma-coherent ARC: atomics: unbork atomic_fetch_##op() arc: remove redundant GCC version checks ARC: sort Kconfig ARC: cleanup show_faulting_vma() ARC: [plat-axs*]: Enable SWAP ARC: [plat-axs*/plat-hsdk]: Allow U-Boot to pass MAC-address to the kernel ARC: configs: cleanup
2018-09-07afs: Fix cell specification to permit an empty address listDavid Howells1-8/+7
Fix the cell specification mechanism to allow cells to be pre-created without having to specify at least one address (the addresses will be upcalled for). This allows the cell information preload service to avoid the need to issue loads of DNS lookups during boot to get the addresses for each cell (500+ lookups for the 'standard' cell list[*]). The lookups can be done later as each cell is accessed through the filesystem. Also remove the print statement that prints a line every time a new cell is added. [*] There are 144 cells in the list. Each cell is first looked up for an SRV record, and if that fails, for an AFSDB record. These get a list of server names, each of which then has to be looked up to get the addresses for that server. E.g.: dig srv _afs3-vlserver._udp.grand.central.org Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>