From 1a9167a214f560a23c5050ce6dfebae489528f0d Mon Sep 17 00:00:00 2001 From: Fabiano Rosas Date: Wed, 19 Jun 2019 13:01:27 -0300 Subject: KVM: PPC: Report single stepping capability When calling the KVM_SET_GUEST_DEBUG ioctl, userspace might request the next instruction to be single stepped via the KVM_GUESTDBG_SINGLESTEP control bit of the kvm_guest_debug structure. This patch adds the KVM_CAP_PPC_GUEST_DEBUG_SSTEP capability in order to inform userspace about the state of single stepping support. We currently don't have support for guest single stepping implemented in Book3S HV so the capability is only present for Book3S PR and BookE. Signed-off-by: Fabiano Rosas Signed-off-by: Paul Mackerras --- Documentation/virt/kvm/api.txt | 3 +++ 1 file changed, 3 insertions(+) (limited to 'Documentation') diff --git a/Documentation/virt/kvm/api.txt b/Documentation/virt/kvm/api.txt index 4833904d32a5..f94d06a12d20 100644 --- a/Documentation/virt/kvm/api.txt +++ b/Documentation/virt/kvm/api.txt @@ -2982,6 +2982,9 @@ can be determined by querying the KVM_CAP_GUEST_DEBUG_HW_BPS and KVM_CAP_GUEST_DEBUG_HW_WPS capabilities which return a positive number indicating the number of supported registers. +For ppc, the KVM_CAP_PPC_GUEST_DEBUG_SSTEP capability indicates whether +the single-step debug event (KVM_GUESTDBG_SINGLESTEP) is supported. + When debug events exit the main run loop with the reason KVM_EXIT_DEBUG with the kvm_debug_exit_arch part of the kvm_run structure containing architecture specific debug information. -- cgit v1.2.3 From c726200dd106d4c58a281eea7159b8ba28a4ab34 Mon Sep 17 00:00:00 2001 From: Christoffer Dall Date: Fri, 11 Oct 2019 13:07:05 +0200 Subject: KVM: arm/arm64: Allow reporting non-ISV data aborts to userspace For a long time, if a guest accessed memory outside of a memslot using any of the load/store instructions in the architecture which doesn't supply decoding information in the ESR_EL2 (the ISV bit is not set), the kernel would print the following message and terminate the VM as a result of returning -ENOSYS to userspace: load/store instruction decoding not implemented The reason behind this message is that KVM assumes that all accesses outside a memslot is an MMIO access which should be handled by userspace, and we originally expected to eventually implement some sort of decoding of load/store instructions where the ISV bit was not set. However, it turns out that many of the instructions which don't provide decoding information on abort are not safe to use for MMIO accesses, and the remaining few that would potentially make sense to use on MMIO accesses, such as those with register writeback, are not used in practice. It also turns out that fetching an instruction from guest memory can be a pretty horrible affair, involving stopping all CPUs on SMP systems, handling multiple corner cases of address translation in software, and more. It doesn't appear likely that we'll ever implement this in the kernel. What is much more common is that a user has misconfigured his/her guest and is actually not accessing an MMIO region, but just hitting some random hole in the IPA space. In this scenario, the error message above is almost misleading and has led to a great deal of confusion over the years. It is, nevertheless, ABI to userspace, and we therefore need to introduce a new capability that userspace explicitly enables to change behavior. This patch introduces KVM_CAP_ARM_NISV_TO_USER (NISV meaning Non-ISV) which does exactly that, and introduces a new exit reason to report the event to userspace. User space can then emulate an exception to the guest, restart the guest, suspend the guest, or take any other appropriate action as per the policy of the running system. Reported-by: Heinrich Schuchardt Signed-off-by: Christoffer Dall Reviewed-by: Alexander Graf Signed-off-by: Marc Zyngier --- Documentation/virt/kvm/api.txt | 33 +++++++++++++++++++++++++++++++++ arch/arm/include/asm/kvm_arm.h | 1 + arch/arm/include/asm/kvm_emulate.h | 5 +++++ arch/arm/include/asm/kvm_host.h | 8 ++++++++ arch/arm64/include/asm/kvm_emulate.h | 5 +++++ arch/arm64/include/asm/kvm_host.h | 8 ++++++++ include/uapi/linux/kvm.h | 7 +++++++ virt/kvm/arm/arm.c | 21 +++++++++++++++++++++ virt/kvm/arm/mmio.c | 9 ++++++++- 9 files changed, 96 insertions(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/virt/kvm/api.txt b/Documentation/virt/kvm/api.txt index 4833904d32a5..7403f15657c2 100644 --- a/Documentation/virt/kvm/api.txt +++ b/Documentation/virt/kvm/api.txt @@ -4468,6 +4468,39 @@ Hyper-V SynIC state change. Notification is used to remap SynIC event/message pages and to enable/disable SynIC messages/events processing in userspace. + /* KVM_EXIT_ARM_NISV */ + struct { + __u64 esr_iss; + __u64 fault_ipa; + } arm_nisv; + +Used on arm and arm64 systems. If a guest accesses memory not in a memslot, +KVM will typically return to userspace and ask it to do MMIO emulation on its +behalf. However, for certain classes of instructions, no instruction decode +(direction, length of memory access) is provided, and fetching and decoding +the instruction from the VM is overly complicated to live in the kernel. + +Historically, when this situation occurred, KVM would print a warning and kill +the VM. KVM assumed that if the guest accessed non-memslot memory, it was +trying to do I/O, which just couldn't be emulated, and the warning message was +phrased accordingly. However, what happened more often was that a guest bug +caused access outside the guest memory areas which should lead to a more +meaningful warning message and an external abort in the guest, if the access +did not fall within an I/O window. + +Userspace implementations can query for KVM_CAP_ARM_NISV_TO_USER, and enable +this capability at VM creation. Once this is done, these types of errors will +instead return to userspace with KVM_EXIT_ARM_NISV, with the valid bits from +the HSR (arm) and ESR_EL2 (arm64) in the esr_iss field, and the faulting IPA +in the fault_ipa field. Userspace can either fix up the access if it's +actually an I/O access by decoding the instruction from guest memory (if it's +very brave) and continue executing the guest, or it can decide to suspend, +dump, or restart the guest. + +Note that KVM does not skip the faulting instruction as it does for +KVM_EXIT_MMIO, but userspace has to emulate any change to the processing state +if it decides to decode and emulate the instruction. + /* Fix the size of the union. */ char padding[256]; }; diff --git a/arch/arm/include/asm/kvm_arm.h b/arch/arm/include/asm/kvm_arm.h index 0125aa059d5b..9c04bd810d07 100644 --- a/arch/arm/include/asm/kvm_arm.h +++ b/arch/arm/include/asm/kvm_arm.h @@ -162,6 +162,7 @@ #define HSR_ISV (_AC(1, UL) << HSR_ISV_SHIFT) #define HSR_SRT_SHIFT (16) #define HSR_SRT_MASK (0xf << HSR_SRT_SHIFT) +#define HSR_CM (1 << 8) #define HSR_FSC (0x3f) #define HSR_FSC_TYPE (0x3c) #define HSR_SSE (1 << 21) diff --git a/arch/arm/include/asm/kvm_emulate.h b/arch/arm/include/asm/kvm_emulate.h index 40002416efec..e8ef349c04b4 100644 --- a/arch/arm/include/asm/kvm_emulate.h +++ b/arch/arm/include/asm/kvm_emulate.h @@ -167,6 +167,11 @@ static inline bool kvm_vcpu_dabt_isvalid(struct kvm_vcpu *vcpu) return kvm_vcpu_get_hsr(vcpu) & HSR_ISV; } +static inline unsigned long kvm_vcpu_dabt_iss_nisv_sanitized(const struct kvm_vcpu *vcpu) +{ + return kvm_vcpu_get_hsr(vcpu) & (HSR_CM | HSR_WNR | HSR_FSC); +} + static inline bool kvm_vcpu_dabt_iswrite(struct kvm_vcpu *vcpu) { return kvm_vcpu_get_hsr(vcpu) & HSR_WNR; diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h index 8a37c8e89777..19a92c49039c 100644 --- a/arch/arm/include/asm/kvm_host.h +++ b/arch/arm/include/asm/kvm_host.h @@ -76,6 +76,14 @@ struct kvm_arch { /* Mandated version of PSCI */ u32 psci_version; + + /* + * If we encounter a data abort without valid instruction syndrome + * information, report this to user space. User space can (and + * should) opt in to this feature if KVM_CAP_ARM_NISV_TO_USER is + * supported. + */ + bool return_nisv_io_abort_to_user; }; #define KVM_NR_MEM_OBJS 40 diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h index d69c1efc63e7..a3c967988e1d 100644 --- a/arch/arm64/include/asm/kvm_emulate.h +++ b/arch/arm64/include/asm/kvm_emulate.h @@ -258,6 +258,11 @@ static inline bool kvm_vcpu_dabt_isvalid(const struct kvm_vcpu *vcpu) return !!(kvm_vcpu_get_hsr(vcpu) & ESR_ELx_ISV); } +static inline unsigned long kvm_vcpu_dabt_iss_nisv_sanitized(const struct kvm_vcpu *vcpu) +{ + return kvm_vcpu_get_hsr(vcpu) & (ESR_ELx_CM | ESR_ELx_WNR | ESR_ELx_FSC); +} + static inline bool kvm_vcpu_dabt_issext(const struct kvm_vcpu *vcpu) { return !!(kvm_vcpu_get_hsr(vcpu) & ESR_ELx_SSE); diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index f656169db8c3..019bc560edc1 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -83,6 +83,14 @@ struct kvm_arch { /* Mandated version of PSCI */ u32 psci_version; + + /* + * If we encounter a data abort without valid instruction syndrome + * information, report this to user space. User space can (and + * should) opt in to this feature if KVM_CAP_ARM_NISV_TO_USER is + * supported. + */ + bool return_nisv_io_abort_to_user; }; #define KVM_NR_MEM_OBJS 40 diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 52641d8ca9e8..7336ee8d98d7 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -235,6 +235,7 @@ struct kvm_hyperv_exit { #define KVM_EXIT_S390_STSI 25 #define KVM_EXIT_IOAPIC_EOI 26 #define KVM_EXIT_HYPERV 27 +#define KVM_EXIT_ARM_NISV 28 /* For KVM_EXIT_INTERNAL_ERROR */ /* Emulate instruction failed. */ @@ -394,6 +395,11 @@ struct kvm_run { } eoi; /* KVM_EXIT_HYPERV */ struct kvm_hyperv_exit hyperv; + /* KVM_EXIT_ARM_NISV */ + struct { + __u64 esr_iss; + __u64 fault_ipa; + } arm_nisv; /* Fix the size of the union. */ char padding[256]; }; @@ -1000,6 +1006,7 @@ struct kvm_ppc_resize_hpt { #define KVM_CAP_PMU_EVENT_FILTER 173 #define KVM_CAP_ARM_IRQ_LINE_LAYOUT_2 174 #define KVM_CAP_HYPERV_DIRECT_TLBFLUSH 175 +#define KVM_CAP_ARM_NISV_TO_USER 176 #ifdef KVM_CAP_IRQ_ROUTING diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c index 86c6aa1cb58e..e6d56f60e4b6 100644 --- a/virt/kvm/arm/arm.c +++ b/virt/kvm/arm/arm.c @@ -98,6 +98,26 @@ int kvm_arch_check_processor_compat(void) return 0; } +int kvm_vm_ioctl_enable_cap(struct kvm *kvm, + struct kvm_enable_cap *cap) +{ + int r; + + if (cap->flags) + return -EINVAL; + + switch (cap->cap) { + case KVM_CAP_ARM_NISV_TO_USER: + r = 0; + kvm->arch.return_nisv_io_abort_to_user = true; + break; + default: + r = -EINVAL; + break; + } + + return r; +} /** * kvm_arch_init_vm - initializes a VM data structure @@ -197,6 +217,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_IMMEDIATE_EXIT: case KVM_CAP_VCPU_EVENTS: case KVM_CAP_ARM_IRQ_LINE_LAYOUT_2: + case KVM_CAP_ARM_NISV_TO_USER: r = 1; break; case KVM_CAP_ARM_SET_DEVICE_ADDR: diff --git a/virt/kvm/arm/mmio.c b/virt/kvm/arm/mmio.c index 6af5c91337f2..70d3b449692c 100644 --- a/virt/kvm/arm/mmio.c +++ b/virt/kvm/arm/mmio.c @@ -167,7 +167,14 @@ int io_mem_abort(struct kvm_vcpu *vcpu, struct kvm_run *run, if (ret) return ret; } else { - kvm_err("load/store instruction decoding not implemented\n"); + if (vcpu->kvm->arch.return_nisv_io_abort_to_user) { + run->exit_reason = KVM_EXIT_ARM_NISV; + run->arm_nisv.esr_iss = kvm_vcpu_dabt_iss_nisv_sanitized(vcpu); + run->arm_nisv.fault_ipa = fault_ipa; + return 0; + } + + kvm_pr_unimpl("Data abort outside memslots with no valid syndrome info\n"); return -ENOSYS; } -- cgit v1.2.3 From da345174ceca052469e4775e4ae263b5f27a9355 Mon Sep 17 00:00:00 2001 From: Christoffer Dall Date: Fri, 11 Oct 2019 13:07:06 +0200 Subject: KVM: arm/arm64: Allow user injection of external data aborts In some scenarios, such as buggy guest or incorrect configuration of the VMM and firmware description data, userspace will detect a memory access to a portion of the IPA, which is not mapped to any MMIO region. For this purpose, the appropriate action is to inject an external abort to the guest. The kernel already has functionality to inject an external abort, but we need to wire up a signal from user space that lets user space tell the kernel to do this. It turns out, we already have the set event functionality which we can perfectly reuse for this. Signed-off-by: Christoffer Dall Signed-off-by: Marc Zyngier --- Documentation/virt/kvm/api.txt | 22 +++++++++++++++++++++- arch/arm/include/uapi/asm/kvm.h | 3 ++- arch/arm/kvm/guest.c | 10 ++++++++++ arch/arm64/include/uapi/asm/kvm.h | 3 ++- arch/arm64/kvm/guest.c | 10 ++++++++++ arch/arm64/kvm/inject_fault.c | 4 ++-- include/uapi/linux/kvm.h | 1 + virt/kvm/arm/arm.c | 1 + 8 files changed, 49 insertions(+), 5 deletions(-) (limited to 'Documentation') diff --git a/Documentation/virt/kvm/api.txt b/Documentation/virt/kvm/api.txt index 7403f15657c2..bd29d44af32b 100644 --- a/Documentation/virt/kvm/api.txt +++ b/Documentation/virt/kvm/api.txt @@ -1002,12 +1002,18 @@ Specifying exception.has_esr on a system that does not support it will return -EINVAL. Setting anything other than the lower 24bits of exception.serror_esr will return -EINVAL. +It is not possible to read back a pending external abort (injected via +KVM_SET_VCPU_EVENTS or otherwise) because such an exception is always delivered +directly to the virtual CPU). + + struct kvm_vcpu_events { struct { __u8 serror_pending; __u8 serror_has_esr; + __u8 ext_dabt_pending; /* Align it to 8 bytes */ - __u8 pad[6]; + __u8 pad[5]; __u64 serror_esr; } exception; __u32 reserved[12]; @@ -1051,9 +1057,23 @@ contain a valid state and shall be written into the VCPU. ARM/ARM64: +User space may need to inject several types of events to the guest. + Set the pending SError exception state for this VCPU. It is not possible to 'cancel' an Serror that has been made pending. +If the guest performed an access to I/O memory which could not be handled by +userspace, for example because of missing instruction syndrome decode +information or because there is no device mapped at the accessed IPA, then +userspace can ask the kernel to inject an external abort using the address +from the exiting fault on the VCPU. It is a programming error to set +ext_dabt_pending after an exit which was not either KVM_EXIT_MMIO or +KVM_EXIT_ARM_NISV. This feature is only available if the system supports +KVM_CAP_ARM_INJECT_EXT_DABT. This is a helper which provides commonality in +how userspace reports accesses for the above cases to guests, across different +userspace implementations. Nevertheless, userspace can still emulate all Arm +exceptions by manipulating individual registers using the KVM_SET_ONE_REG API. + See KVM_GET_VCPU_EVENTS for the data structure. diff --git a/arch/arm/include/uapi/asm/kvm.h b/arch/arm/include/uapi/asm/kvm.h index 2769360f195c..03cd7c19a683 100644 --- a/arch/arm/include/uapi/asm/kvm.h +++ b/arch/arm/include/uapi/asm/kvm.h @@ -131,8 +131,9 @@ struct kvm_vcpu_events { struct { __u8 serror_pending; __u8 serror_has_esr; + __u8 ext_dabt_pending; /* Align it to 8 bytes */ - __u8 pad[6]; + __u8 pad[5]; __u64 serror_esr; } exception; __u32 reserved[12]; diff --git a/arch/arm/kvm/guest.c b/arch/arm/kvm/guest.c index 684cf64b4033..735f9b007e58 100644 --- a/arch/arm/kvm/guest.c +++ b/arch/arm/kvm/guest.c @@ -255,6 +255,12 @@ int __kvm_arm_vcpu_get_events(struct kvm_vcpu *vcpu, { events->exception.serror_pending = !!(*vcpu_hcr(vcpu) & HCR_VA); + /* + * We never return a pending ext_dabt here because we deliver it to + * the virtual CPU directly when setting the event and it's no longer + * 'pending' at this point. + */ + return 0; } @@ -263,12 +269,16 @@ int __kvm_arm_vcpu_set_events(struct kvm_vcpu *vcpu, { bool serror_pending = events->exception.serror_pending; bool has_esr = events->exception.serror_has_esr; + bool ext_dabt_pending = events->exception.ext_dabt_pending; if (serror_pending && has_esr) return -EINVAL; else if (serror_pending) kvm_inject_vabt(vcpu); + if (ext_dabt_pending) + kvm_inject_dabt(vcpu, kvm_vcpu_get_hfar(vcpu)); + return 0; } diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h index 67c21f9bdbad..d49c17a80491 100644 --- a/arch/arm64/include/uapi/asm/kvm.h +++ b/arch/arm64/include/uapi/asm/kvm.h @@ -164,8 +164,9 @@ struct kvm_vcpu_events { struct { __u8 serror_pending; __u8 serror_has_esr; + __u8 ext_dabt_pending; /* Align it to 8 bytes */ - __u8 pad[6]; + __u8 pad[5]; __u64 serror_esr; } exception; __u32 reserved[12]; diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c index dfd626447482..ca613a44c6ec 100644 --- a/arch/arm64/kvm/guest.c +++ b/arch/arm64/kvm/guest.c @@ -712,6 +712,12 @@ int __kvm_arm_vcpu_get_events(struct kvm_vcpu *vcpu, if (events->exception.serror_pending && events->exception.serror_has_esr) events->exception.serror_esr = vcpu_get_vsesr(vcpu); + /* + * We never return a pending ext_dabt here because we deliver it to + * the virtual CPU directly when setting the event and it's no longer + * 'pending' at this point. + */ + return 0; } @@ -720,6 +726,7 @@ int __kvm_arm_vcpu_set_events(struct kvm_vcpu *vcpu, { bool serror_pending = events->exception.serror_pending; bool has_esr = events->exception.serror_has_esr; + bool ext_dabt_pending = events->exception.ext_dabt_pending; if (serror_pending && has_esr) { if (!cpus_have_const_cap(ARM64_HAS_RAS_EXTN)) @@ -733,6 +740,9 @@ int __kvm_arm_vcpu_set_events(struct kvm_vcpu *vcpu, kvm_inject_vabt(vcpu); } + if (ext_dabt_pending) + kvm_inject_dabt(vcpu, kvm_vcpu_get_hfar(vcpu)); + return 0; } diff --git a/arch/arm64/kvm/inject_fault.c b/arch/arm64/kvm/inject_fault.c index a9d25a305af5..ccdb6a051ab2 100644 --- a/arch/arm64/kvm/inject_fault.c +++ b/arch/arm64/kvm/inject_fault.c @@ -109,7 +109,7 @@ static void inject_undef64(struct kvm_vcpu *vcpu) /** * kvm_inject_dabt - inject a data abort into the guest - * @vcpu: The VCPU to receive the undefined exception + * @vcpu: The VCPU to receive the data abort * @addr: The address to report in the DFAR * * It is assumed that this code is called from the VCPU thread and that the @@ -125,7 +125,7 @@ void kvm_inject_dabt(struct kvm_vcpu *vcpu, unsigned long addr) /** * kvm_inject_pabt - inject a prefetch abort into the guest - * @vcpu: The VCPU to receive the undefined exception + * @vcpu: The VCPU to receive the prefetch abort * @addr: The address to report in the DFAR * * It is assumed that this code is called from the VCPU thread and that the diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 7336ee8d98d7..65db5a4257ec 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -1007,6 +1007,7 @@ struct kvm_ppc_resize_hpt { #define KVM_CAP_ARM_IRQ_LINE_LAYOUT_2 174 #define KVM_CAP_HYPERV_DIRECT_TLBFLUSH 175 #define KVM_CAP_ARM_NISV_TO_USER 176 +#define KVM_CAP_ARM_INJECT_EXT_DABT 177 #ifdef KVM_CAP_IRQ_ROUTING diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c index e6d56f60e4b6..12064780f1d8 100644 --- a/virt/kvm/arm/arm.c +++ b/virt/kvm/arm/arm.c @@ -218,6 +218,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_VCPU_EVENTS: case KVM_CAP_ARM_IRQ_LINE_LAYOUT_2: case KVM_CAP_ARM_NISV_TO_USER: + case KVM_CAP_ARM_INJECT_EXT_DABT: r = 1; break; case KVM_CAP_ARM_SET_DEVICE_ADDR: -- cgit v1.2.3 From 6a7458485b390f48e481fcd4a0b20e6c5c843d2e Mon Sep 17 00:00:00 2001 From: Steven Price Date: Mon, 21 Oct 2019 16:28:14 +0100 Subject: KVM: arm64: Document PV-time interface Introduce a paravirtualization interface for KVM/arm64 based on the "Arm Paravirtualized Time for Arm-Base Systems" specification DEN 0057A. This only adds the details about "Stolen Time" as the details of "Live Physical Time" have not been fully agreed. User space can specify a reserved area of memory for the guest and inform KVM to populate the memory with information on time that the host kernel has stolen from the guest. A hypercall interface is provided for the guest to interrogate the hypervisor's support for this interface and the location of the shared memory structures. Signed-off-by: Steven Price Signed-off-by: Marc Zyngier --- Documentation/virt/kvm/arm/pvtime.rst | 80 +++++++++++++++++++++++++++++++++ Documentation/virt/kvm/devices/vcpu.txt | 14 ++++++ 2 files changed, 94 insertions(+) create mode 100644 Documentation/virt/kvm/arm/pvtime.rst (limited to 'Documentation') diff --git a/Documentation/virt/kvm/arm/pvtime.rst b/Documentation/virt/kvm/arm/pvtime.rst new file mode 100644 index 000000000000..2357dd2d8655 --- /dev/null +++ b/Documentation/virt/kvm/arm/pvtime.rst @@ -0,0 +1,80 @@ +.. SPDX-License-Identifier: GPL-2.0 + +Paravirtualized time support for arm64 +====================================== + +Arm specification DEN0057/A defines a standard for paravirtualised time +support for AArch64 guests: + +https://developer.arm.com/docs/den0057/a + +KVM/arm64 implements the stolen time part of this specification by providing +some hypervisor service calls to support a paravirtualized guest obtaining a +view of the amount of time stolen from its execution. + +Two new SMCCC compatible hypercalls are defined: + +* PV_TIME_FEATURES: 0xC5000020 +* PV_TIME_ST: 0xC5000021 + +These are only available in the SMC64/HVC64 calling convention as +paravirtualized time is not available to 32 bit Arm guests. The existence of +the PV_FEATURES hypercall should be probed using the SMCCC 1.1 ARCH_FEATURES +mechanism before calling it. + +PV_TIME_FEATURES + ============= ======== ========== + Function ID: (uint32) 0xC5000020 + PV_call_id: (uint32) The function to query for support. + Currently only PV_TIME_ST is supported. + Return value: (int64) NOT_SUPPORTED (-1) or SUCCESS (0) if the relevant + PV-time feature is supported by the hypervisor. + ============= ======== ========== + +PV_TIME_ST + ============= ======== ========== + Function ID: (uint32) 0xC5000021 + Return value: (int64) IPA of the stolen time data structure for this + VCPU. On failure: + NOT_SUPPORTED (-1) + ============= ======== ========== + +The IPA returned by PV_TIME_ST should be mapped by the guest as normal memory +with inner and outer write back caching attributes, in the inner shareable +domain. A total of 16 bytes from the IPA returned are guaranteed to be +meaningfully filled by the hypervisor (see structure below). + +PV_TIME_ST returns the structure for the calling VCPU. + +Stolen Time +----------- + +The structure pointed to by the PV_TIME_ST hypercall is as follows: + ++-------------+-------------+-------------+----------------------------+ +| Field | Byte Length | Byte Offset | Description | ++=============+=============+=============+============================+ +| Revision | 4 | 0 | Must be 0 for version 1.0 | ++-------------+-------------+-------------+----------------------------+ +| Attributes | 4 | 4 | Must be 0 | ++-------------+-------------+-------------+----------------------------+ +| Stolen time | 8 | 8 | Stolen time in unsigned | +| | | | nanoseconds indicating how | +| | | | much time this VCPU thread | +| | | | was involuntarily not | +| | | | running on a physical CPU. | ++-------------+-------------+-------------+----------------------------+ + +All values in the structure are stored little-endian. + +The structure will be updated by the hypervisor prior to scheduling a VCPU. It +will be present within a reserved region of the normal memory given to the +guest. The guest should not attempt to write into this memory. There is a +structure per VCPU of the guest. + +It is advisable that one or more 64k pages are set aside for the purpose of +these structures and not used for other purposes, this enables the guest to map +the region using 64k pages and avoids conflicting attributes with other memory. + +For the user space interface see Documentation/virt/kvm/devices/vcpu.txt +section "3. GROUP: KVM_ARM_VCPU_PVTIME_CTRL". diff --git a/Documentation/virt/kvm/devices/vcpu.txt b/Documentation/virt/kvm/devices/vcpu.txt index 2b5dab16c4f2..6f3bd64a05b0 100644 --- a/Documentation/virt/kvm/devices/vcpu.txt +++ b/Documentation/virt/kvm/devices/vcpu.txt @@ -60,3 +60,17 @@ time to use the number provided for a given timer, overwriting any previously configured values on other VCPUs. Userspace should configure the interrupt numbers on at least one VCPU after creating all VCPUs and before running any VCPUs. + +3. GROUP: KVM_ARM_VCPU_PVTIME_CTRL +Architectures: ARM64 + +3.1 ATTRIBUTE: KVM_ARM_VCPU_PVTIME_IPA +Parameters: 64-bit base address +Returns: -ENXIO: Stolen time not implemented + -EEXIST: Base address already set for this VCPU + -EINVAL: Base address not 64 byte aligned + +Specifies the base address of the stolen time structure for this VCPU. The +base address must be 64 byte aligned and exist within a valid guest memory +region. See Documentation/virt/kvm/arm/pvtime.txt for more information +including the layout of the stolen time structure. -- cgit v1.2.3 From e0685fa228fdaf386f82ac0d64b2d6f3e0ddd588 Mon Sep 17 00:00:00 2001 From: Steven Price Date: Mon, 21 Oct 2019 16:28:23 +0100 Subject: arm64: Retrieve stolen time as paravirtualized guest Enable paravirtualization features when running under a hypervisor supporting the PV_TIME_ST hypercall. For each (v)CPU, we ask the hypervisor for the location of a shared page which the hypervisor will use to report stolen time to us. We set pv_time_ops to the stolen time function which simply reads the stolen value from the shared page for a VCPU. We guarantee single-copy atomicity using READ_ONCE which means we can also read the stolen time for another VCPU than the currently running one while it is potentially being updated by the hypervisor. Signed-off-by: Steven Price Signed-off-by: Marc Zyngier --- Documentation/admin-guide/kernel-parameters.txt | 6 +- arch/arm64/include/asm/paravirt.h | 9 +- arch/arm64/kernel/paravirt.c | 140 ++++++++++++++++++++++++ arch/arm64/kernel/time.c | 3 + include/linux/cpuhotplug.h | 1 + 5 files changed, 155 insertions(+), 4 deletions(-) (limited to 'Documentation') diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index a84a83f8881e..19f465530e86 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -3083,9 +3083,9 @@ [X86,PV_OPS] Disable paravirtualized VMware scheduler clock and use the default one. - no-steal-acc [X86,KVM] Disable paravirtualized steal time accounting. - steal time is computed, but won't influence scheduler - behaviour + no-steal-acc [X86,KVM,ARM64] Disable paravirtualized steal time + accounting. steal time is computed, but won't + influence scheduler behaviour nolapic [X86-32,APIC] Do not enable or use the local APIC. diff --git a/arch/arm64/include/asm/paravirt.h b/arch/arm64/include/asm/paravirt.h index 799d9dd6f7cc..cf3a0fd7c1a7 100644 --- a/arch/arm64/include/asm/paravirt.h +++ b/arch/arm64/include/asm/paravirt.h @@ -21,6 +21,13 @@ static inline u64 paravirt_steal_clock(int cpu) { return pv_ops.time.steal_clock(cpu); } -#endif + +int __init pv_time_init(void); + +#else + +#define pv_time_init() do {} while (0) + +#endif // CONFIG_PARAVIRT #endif diff --git a/arch/arm64/kernel/paravirt.c b/arch/arm64/kernel/paravirt.c index 4cfed91fe256..1ef702b0be2d 100644 --- a/arch/arm64/kernel/paravirt.c +++ b/arch/arm64/kernel/paravirt.c @@ -6,13 +6,153 @@ * Author: Stefano Stabellini */ +#define pr_fmt(fmt) "arm-pv: " fmt + +#include +#include #include +#include #include +#include +#include +#include +#include #include + #include +#include +#include struct static_key paravirt_steal_enabled; struct static_key paravirt_steal_rq_enabled; struct paravirt_patch_template pv_ops; EXPORT_SYMBOL_GPL(pv_ops); + +struct pv_time_stolen_time_region { + struct pvclock_vcpu_stolen_time *kaddr; +}; + +static DEFINE_PER_CPU(struct pv_time_stolen_time_region, stolen_time_region); + +static bool steal_acc = true; +static int __init parse_no_stealacc(char *arg) +{ + steal_acc = false; + return 0; +} + +early_param("no-steal-acc", parse_no_stealacc); + +/* return stolen time in ns by asking the hypervisor */ +static u64 pv_steal_clock(int cpu) +{ + struct pv_time_stolen_time_region *reg; + + reg = per_cpu_ptr(&stolen_time_region, cpu); + if (!reg->kaddr) { + pr_warn_once("stolen time enabled but not configured for cpu %d\n", + cpu); + return 0; + } + + return le64_to_cpu(READ_ONCE(reg->kaddr->stolen_time)); +} + +static int stolen_time_dying_cpu(unsigned int cpu) +{ + struct pv_time_stolen_time_region *reg; + + reg = this_cpu_ptr(&stolen_time_region); + if (!reg->kaddr) + return 0; + + memunmap(reg->kaddr); + memset(reg, 0, sizeof(*reg)); + + return 0; +} + +static int init_stolen_time_cpu(unsigned int cpu) +{ + struct pv_time_stolen_time_region *reg; + struct arm_smccc_res res; + + reg = this_cpu_ptr(&stolen_time_region); + + arm_smccc_1_1_invoke(ARM_SMCCC_HV_PV_TIME_ST, &res); + + if (res.a0 == SMCCC_RET_NOT_SUPPORTED) + return -EINVAL; + + reg->kaddr = memremap(res.a0, + sizeof(struct pvclock_vcpu_stolen_time), + MEMREMAP_WB); + + if (!reg->kaddr) { + pr_warn("Failed to map stolen time data structure\n"); + return -ENOMEM; + } + + if (le32_to_cpu(reg->kaddr->revision) != 0 || + le32_to_cpu(reg->kaddr->attributes) != 0) { + pr_warn_once("Unexpected revision or attributes in stolen time data\n"); + return -ENXIO; + } + + return 0; +} + +static int pv_time_init_stolen_time(void) +{ + int ret; + + ret = cpuhp_setup_state(CPUHP_AP_ARM_KVMPV_STARTING, + "hypervisor/arm/pvtime:starting", + init_stolen_time_cpu, stolen_time_dying_cpu); + if (ret < 0) + return ret; + return 0; +} + +static bool has_pv_steal_clock(void) +{ + struct arm_smccc_res res; + + /* To detect the presence of PV time support we require SMCCC 1.1+ */ + if (psci_ops.smccc_version < SMCCC_VERSION_1_1) + return false; + + arm_smccc_1_1_invoke(ARM_SMCCC_ARCH_FEATURES_FUNC_ID, + ARM_SMCCC_HV_PV_TIME_FEATURES, &res); + + if (res.a0 != SMCCC_RET_SUCCESS) + return false; + + arm_smccc_1_1_invoke(ARM_SMCCC_HV_PV_TIME_FEATURES, + ARM_SMCCC_HV_PV_TIME_ST, &res); + + return (res.a0 == SMCCC_RET_SUCCESS); +} + +int __init pv_time_init(void) +{ + int ret; + + if (!has_pv_steal_clock()) + return 0; + + ret = pv_time_init_stolen_time(); + if (ret) + return ret; + + pv_ops.time.steal_clock = pv_steal_clock; + + static_key_slow_inc(¶virt_steal_enabled); + if (steal_acc) + static_key_slow_inc(¶virt_steal_rq_enabled); + + pr_info("using stolen time PV\n"); + + return 0; +} diff --git a/arch/arm64/kernel/time.c b/arch/arm64/kernel/time.c index 0b2946414dc9..73f06d4b3aae 100644 --- a/arch/arm64/kernel/time.c +++ b/arch/arm64/kernel/time.c @@ -30,6 +30,7 @@ #include #include +#include unsigned long profile_pc(struct pt_regs *regs) { @@ -65,4 +66,6 @@ void __init time_init(void) /* Calibrate the delay loop directly */ lpj_fine = arch_timer_rate / HZ; + + pv_time_init(); } diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h index 068793a619ca..89d75edb5750 100644 --- a/include/linux/cpuhotplug.h +++ b/include/linux/cpuhotplug.h @@ -136,6 +136,7 @@ enum cpuhp_state { /* Must be the last timer callback */ CPUHP_AP_DUMMY_TIMER_STARTING, CPUHP_AP_ARM_XEN_STARTING, + CPUHP_AP_ARM_KVMPV_STARTING, CPUHP_AP_ARM_CORESIGHT_STARTING, CPUHP_AP_ARM64_ISNDEP_STARTING, CPUHP_AP_SMPCFD_DYING, -- cgit v1.2.3 From efe5ddcae496b7c7307805d31815df23ba69bf7c Mon Sep 17 00:00:00 2001 From: Greg Kurz Date: Fri, 27 Sep 2019 13:54:07 +0200 Subject: KVM: PPC: Book3S HV: XIVE: Allow userspace to set the # of VPs MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add a new attribute to both XIVE and XICS-on-XIVE KVM devices so that userspace can tell how many interrupt servers it needs. If a VM needs less than the current default of KVM_MAX_VCPUS (2048), we can allocate less VPs in OPAL. Combined with a core stride (VSMT) that matches the number of guest threads per core, this may substantially increases the number of VMs that can run concurrently with an in-kernel XIVE device. Since the legacy XIVE KVM device is exposed to userspace through the XICS KVM API, a new attribute group is added to it for this purpose. While here, fix the syntax of the existing KVM_DEV_XICS_GRP_SOURCES in the XICS documentation. Signed-off-by: Greg Kurz Reviewed-by: Cédric Le Goater Signed-off-by: Paul Mackerras --- Documentation/virt/kvm/devices/xics.txt | 14 ++++++++++++-- Documentation/virt/kvm/devices/xive.txt | 8 ++++++++ arch/powerpc/include/uapi/asm/kvm.h | 3 +++ arch/powerpc/kvm/book3s_xive.c | 10 ++++++++++ arch/powerpc/kvm/book3s_xive_native.c | 3 +++ 5 files changed, 36 insertions(+), 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/virt/kvm/devices/xics.txt b/Documentation/virt/kvm/devices/xics.txt index 42864935ac5d..423332dda7bc 100644 --- a/Documentation/virt/kvm/devices/xics.txt +++ b/Documentation/virt/kvm/devices/xics.txt @@ -3,9 +3,19 @@ XICS interrupt controller Device type supported: KVM_DEV_TYPE_XICS Groups: - KVM_DEV_XICS_SOURCES + 1. KVM_DEV_XICS_GRP_SOURCES Attributes: One per interrupt source, indexed by the source number. + 2. KVM_DEV_XICS_GRP_CTRL + Attributes: + 2.1 KVM_DEV_XICS_NR_SERVERS (write only) + The kvm_device_attr.addr points to a __u32 value which is the number of + interrupt server numbers (ie, highest possible vcpu id plus one). + Errors: + -EINVAL: Value greater than KVM_MAX_VCPU_ID. + -EFAULT: Invalid user pointer for attr->addr. + -EBUSY: A vcpu is already connected to the device. + This device emulates the XICS (eXternal Interrupt Controller Specification) defined in PAPR. The XICS has a set of interrupt sources, each identified by a 20-bit source number, and a set of @@ -38,7 +48,7 @@ least-significant end of the word: Each source has 64 bits of state that can be read and written using the KVM_GET_DEVICE_ATTR and KVM_SET_DEVICE_ATTR ioctls, specifying the -KVM_DEV_XICS_SOURCES attribute group, with the attribute number being +KVM_DEV_XICS_GRP_SOURCES attribute group, with the attribute number being the interrupt source number. The 64 bit state word has the following bitfields, starting from the least-significant end of the word: diff --git a/Documentation/virt/kvm/devices/xive.txt b/Documentation/virt/kvm/devices/xive.txt index 9a24a4525253..f5d1d6b5af61 100644 --- a/Documentation/virt/kvm/devices/xive.txt +++ b/Documentation/virt/kvm/devices/xive.txt @@ -78,6 +78,14 @@ the legacy interrupt mode, referred as XICS (POWER7/8). migrating the VM. Errors: none + 1.3 KVM_DEV_XIVE_NR_SERVERS (write only) + The kvm_device_attr.addr points to a __u32 value which is the number of + interrupt server numbers (ie, highest possible vcpu id plus one). + Errors: + -EINVAL: Value greater than KVM_MAX_VCPU_ID. + -EFAULT: Invalid user pointer for attr->addr. + -EBUSY: A vCPU is already connected to the device. + 2. KVM_DEV_XIVE_GRP_SOURCE (write only) Initializes a new source in the XIVE device and mask it. Attributes: diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h index b0f72dea8b11..264e266a85bf 100644 --- a/arch/powerpc/include/uapi/asm/kvm.h +++ b/arch/powerpc/include/uapi/asm/kvm.h @@ -667,6 +667,8 @@ struct kvm_ppc_cpu_char { /* PPC64 eXternal Interrupt Controller Specification */ #define KVM_DEV_XICS_GRP_SOURCES 1 /* 64-bit source attributes */ +#define KVM_DEV_XICS_GRP_CTRL 2 +#define KVM_DEV_XICS_NR_SERVERS 1 /* Layout of 64-bit source attribute values */ #define KVM_XICS_DESTINATION_SHIFT 0 @@ -683,6 +685,7 @@ struct kvm_ppc_cpu_char { #define KVM_DEV_XIVE_GRP_CTRL 1 #define KVM_DEV_XIVE_RESET 1 #define KVM_DEV_XIVE_EQ_SYNC 2 +#define KVM_DEV_XIVE_NR_SERVERS 3 #define KVM_DEV_XIVE_GRP_SOURCE 2 /* 64-bit source identifier */ #define KVM_DEV_XIVE_GRP_SOURCE_CONFIG 3 /* 64-bit source identifier */ #define KVM_DEV_XIVE_GRP_EQ_CONFIG 4 /* 64-bit EQ identifier */ diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c index 6c35b3d95986..66858b7d3c6b 100644 --- a/arch/powerpc/kvm/book3s_xive.c +++ b/arch/powerpc/kvm/book3s_xive.c @@ -1911,6 +1911,11 @@ static int xive_set_attr(struct kvm_device *dev, struct kvm_device_attr *attr) switch (attr->group) { case KVM_DEV_XICS_GRP_SOURCES: return xive_set_source(xive, attr->attr, attr->addr); + case KVM_DEV_XICS_GRP_CTRL: + switch (attr->attr) { + case KVM_DEV_XICS_NR_SERVERS: + return kvmppc_xive_set_nr_servers(xive, attr->addr); + } } return -ENXIO; } @@ -1936,6 +1941,11 @@ static int xive_has_attr(struct kvm_device *dev, struct kvm_device_attr *attr) attr->attr < KVMPPC_XICS_NR_IRQS) return 0; break; + case KVM_DEV_XICS_GRP_CTRL: + switch (attr->attr) { + case KVM_DEV_XICS_NR_SERVERS: + return 0; + } } return -ENXIO; } diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c index 8ab333eabeef..34bd123fa024 100644 --- a/arch/powerpc/kvm/book3s_xive_native.c +++ b/arch/powerpc/kvm/book3s_xive_native.c @@ -921,6 +921,8 @@ static int kvmppc_xive_native_set_attr(struct kvm_device *dev, return kvmppc_xive_reset(xive); case KVM_DEV_XIVE_EQ_SYNC: return kvmppc_xive_native_eq_sync(xive); + case KVM_DEV_XIVE_NR_SERVERS: + return kvmppc_xive_set_nr_servers(xive, attr->addr); } break; case KVM_DEV_XIVE_GRP_SOURCE: @@ -960,6 +962,7 @@ static int kvmppc_xive_native_has_attr(struct kvm_device *dev, switch (attr->attr) { case KVM_DEV_XIVE_RESET: case KVM_DEV_XIVE_EQ_SYNC: + case KVM_DEV_XIVE_NR_SERVERS: return 0; } break; -- cgit v1.2.3