From d8968f1f334c18bf65178b757a82864ef3f37613 Mon Sep 17 00:00:00 2001 From: Alexey Kardashevskiy Date: Wed, 19 Jun 2013 11:42:07 +1000 Subject: kvm api doc: fix section numbers Signed-off-by: Alexey Kardashevskiy Signed-off-by: Paolo Bonzini --- Documentation/virtual/kvm/api.txt | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) (limited to 'Documentation/virtual') diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 5f91eda91647..6365fef8d616 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -2261,7 +2261,7 @@ return indicates the attribute is implemented. It does not necessarily indicate that the attribute can be read or written in the device's current state. "addr" is ignored. -4.77 KVM_ARM_VCPU_INIT +4.82 KVM_ARM_VCPU_INIT Capability: basic Architectures: arm @@ -2285,7 +2285,7 @@ Possible features: Depends on KVM_CAP_ARM_PSCI. -4.78 KVM_GET_REG_LIST +4.83 KVM_GET_REG_LIST Capability: basic Architectures: arm @@ -2305,7 +2305,7 @@ This ioctl returns the guest registers that are supported for the KVM_GET_ONE_REG/KVM_SET_ONE_REG calls. -4.80 KVM_ARM_SET_DEVICE_ADDR +4.84 KVM_ARM_SET_DEVICE_ADDR Capability: KVM_CAP_ARM_SET_DEVICE_ADDR Architectures: arm @@ -2342,7 +2342,7 @@ and distributor interface, the ioctl must be called after calling KVM_CREATE_IRQCHIP, but before calling KVM_RUN on any of the VCPUs. Calling this ioctl twice for any of the base addresses will return -EEXIST. -4.82 KVM_PPC_RTAS_DEFINE_TOKEN +4.85 KVM_PPC_RTAS_DEFINE_TOKEN Capability: KVM_CAP_PPC_RTAS Architectures: ppc -- cgit v1.2.3 From 6c806a7332e6d36c1f95966b4a8a22f342b68948 Mon Sep 17 00:00:00 2001 From: Xiao Guangrong Date: Wed, 19 Jun 2013 17:09:19 +0800 Subject: KVM: MMU: update the documentation for reverse mapping of parent_pte Update the document to match the current reverse mapping of parent_pte Signed-off-by: Xiao Guangrong Signed-off-by: Paolo Bonzini --- Documentation/virtual/kvm/mmu.txt | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) (limited to 'Documentation/virtual') diff --git a/Documentation/virtual/kvm/mmu.txt b/Documentation/virtual/kvm/mmu.txt index 43fcb761ed16..869abcc48315 100644 --- a/Documentation/virtual/kvm/mmu.txt +++ b/Documentation/virtual/kvm/mmu.txt @@ -191,12 +191,12 @@ Shadow pages contain the following information: A counter keeping track of how many hardware registers (guest cr3 or pdptrs) are now pointing at the page. While this counter is nonzero, the page cannot be destroyed. See role.invalid. - multimapped: - Whether there exist multiple sptes pointing at this page. - parent_pte/parent_ptes: - If multimapped is zero, parent_pte points at the single spte that points at - this page's spt. Otherwise, parent_ptes points at a data structure - with a list of parent_ptes. + parent_ptes: + The reverse mapping for the pte/ptes pointing at this page's spt. If + parent_ptes bit 0 is zero, only one spte points at this pages and + parent_ptes points at this single spte, otherwise, there exists multiple + sptes pointing at this page and (parent_ptes & ~0x1) points at a data + structure with a list of parent_ptes. unsync: If true, then the translations in this page may not match the guest's translation. This is equivalent to the state of the tlb when a pte is -- cgit v1.2.3 From accaefe07ddbeb12c0de4cec1d62dba6a0ea1605 Mon Sep 17 00:00:00 2001 From: Xiao Guangrong Date: Wed, 19 Jun 2013 17:09:20 +0800 Subject: KVM: MMU: document clear_spte_count Document it to Documentation/virtual/kvm/mmu.txt Signed-off-by: Xiao Guangrong Signed-off-by: Paolo Bonzini --- Documentation/virtual/kvm/mmu.txt | 5 +++++ arch/x86/include/asm/kvm_host.h | 4 ++++ arch/x86/kvm/mmu.c | 17 ++++++++++++++--- 3 files changed, 23 insertions(+), 3 deletions(-) (limited to 'Documentation/virtual') diff --git a/Documentation/virtual/kvm/mmu.txt b/Documentation/virtual/kvm/mmu.txt index 869abcc48315..f514a3fad9b9 100644 --- a/Documentation/virtual/kvm/mmu.txt +++ b/Documentation/virtual/kvm/mmu.txt @@ -210,6 +210,11 @@ Shadow pages contain the following information: A bitmap indicating which sptes in spt point (directly or indirectly) at pages that may be unsynchronized. Used to quickly locate all unsychronized pages reachable from a given page. + clear_spte_count: + Only present on 32-bit hosts, where a 64-bit spte cannot be written + atomically. The reader uses this while running out of the MMU lock + to detect in-progress updates and retry them until the writer has + finished the write. Reverse map =========== diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 966f2650b6ab..5d28c11d5e21 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -226,6 +226,10 @@ struct kvm_mmu_page { DECLARE_BITMAP(unsync_child_bitmap, 512); #ifdef CONFIG_X86_32 + /* + * Used out of the mmu-lock to avoid reading spte values while an + * update is in progress; see the comments in __get_spte_lockless(). + */ int clear_spte_count; #endif diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 7113a0fb544c..f385a4cf4bfd 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -466,9 +466,20 @@ static u64 __update_clear_spte_slow(u64 *sptep, u64 spte) /* * The idea using the light way get the spte on x86_32 guest is from * gup_get_pte(arch/x86/mm/gup.c). - * The difference is we can not catch the spte tlb flush if we leave - * guest mode, so we emulate it by increase clear_spte_count when spte - * is cleared. + * + * An spte tlb flush may be pending, because kvm_set_pte_rmapp + * coalesces them and we are running out of the MMU lock. Therefore + * we need to protect against in-progress updates of the spte. + * + * Reading the spte while an update is in progress may get the old value + * for the high part of the spte. The race is fine for a present->non-present + * change (because the high part of the spte is ignored for non-present spte), + * but for a present->present change we must reread the spte. + * + * All such changes are done in two steps (present->non-present and + * non-present->present), hence it is enough to count the number of + * present->non-present updates: if it changed while reading the spte, + * we might have hit the race. This is done using clear_spte_count. */ static u64 __get_spte_lockless(u64 *sptep) { -- cgit v1.2.3 From 0cbf8e437b60b8b12d97589509d3e5a581731d36 Mon Sep 17 00:00:00 2001 From: Xiao Guangrong Date: Wed, 19 Jun 2013 17:09:21 +0800 Subject: KVM: MMU: document write_flooding_count Document write_flooding_count to Documentation/virtual/kvm/mmu.txt Signed-off-by: Xiao Guangrong Signed-off-by: Paolo Bonzini --- Documentation/virtual/kvm/mmu.txt | 9 +++++++++ arch/x86/include/asm/kvm_host.h | 1 + 2 files changed, 10 insertions(+) (limited to 'Documentation/virtual') diff --git a/Documentation/virtual/kvm/mmu.txt b/Documentation/virtual/kvm/mmu.txt index f514a3fad9b9..0db7743c55e1 100644 --- a/Documentation/virtual/kvm/mmu.txt +++ b/Documentation/virtual/kvm/mmu.txt @@ -215,6 +215,15 @@ Shadow pages contain the following information: atomically. The reader uses this while running out of the MMU lock to detect in-progress updates and retry them until the writer has finished the write. + write_flooding_count: + A guest may write to a page table many times, causing a lot of + emulations if the page needs to be write-protected (see "Synchronized + and unsynchronized pages" below). Leaf pages can be unsynchronized + so that they do not trigger frequent emulation, but this is not + possible for non-leafs. This field counts the number of emulations + since the last time the page table was actually used; if emulation + is triggered too frequently on this page, KVM will unmap the page + to avoid emulation in the future. Reverse map =========== diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 5d28c11d5e21..6b636fd8582f 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -233,6 +233,7 @@ struct kvm_mmu_page { int clear_spte_count; #endif + /* Number of writes since the last time traversal visited this page. */ int write_flooding_count; }; -- cgit v1.2.3 From 67652ed34390b19ede6f847d23d8c68e2c819b50 Mon Sep 17 00:00:00 2001 From: Xiao Guangrong Date: Wed, 19 Jun 2013 17:09:22 +0800 Subject: KVM: MMU: document mmio page fault Document it to Documentation/virtual/kvm/mmu.txt Signed-off-by: Xiao Guangrong Signed-off-by: Paolo Bonzini --- Documentation/virtual/kvm/mmu.txt | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) (limited to 'Documentation/virtual') diff --git a/Documentation/virtual/kvm/mmu.txt b/Documentation/virtual/kvm/mmu.txt index 0db7743c55e1..0aa8e0e34119 100644 --- a/Documentation/virtual/kvm/mmu.txt +++ b/Documentation/virtual/kvm/mmu.txt @@ -272,14 +272,21 @@ This is the most complicated event. The cause of a page fault can be: Handling a page fault is performed as follows: + - if the RSV bit of the error code is set, the page fault is caused by guest + accessing MMIO and cached MMIO information is available. + - walk shadow page table + - cache the information to vcpu->arch.mmio_gva, vcpu->arch.access and + vcpu->arch.mmio_gfn, and call the emulator - if needed, walk the guest page tables to determine the guest translation (gva->gpa or ngpa->gpa) - if permissions are insufficient, reflect the fault back to the guest - determine the host page - - if this is an mmio request, there is no host page; call the emulator - to emulate the instruction instead + - if this is an mmio request, there is no host page; cache the info to + vcpu->arch.mmio_gva, vcpu->arch.access and vcpu->arch.mmio_gfn - walk the shadow page table to find the spte for the translation, instantiating missing intermediate page tables as necessary + - If this is an mmio request, cache the mmio info to the spte and set some + reserved bit on the spte (see callers of kvm_mmu_set_mmio_spte_mask) - try to unsynchronize the page - if successful, we can let the guest continue and modify the gpte - emulate the instruction -- cgit v1.2.3 From 2d49c47f350b939b395cd2d1abaa8c3bb6c54326 Mon Sep 17 00:00:00 2001 From: Xiao Guangrong Date: Wed, 19 Jun 2013 17:09:23 +0800 Subject: KVM: MMU: document fast page fault Document fast page fault to Documentation/virtual/kvm/mmu.txt Signed-off-by: Xiao Guangrong Signed-off-by: Paolo Bonzini --- Documentation/virtual/kvm/mmu.txt | 3 +++ 1 file changed, 3 insertions(+) (limited to 'Documentation/virtual') diff --git a/Documentation/virtual/kvm/mmu.txt b/Documentation/virtual/kvm/mmu.txt index 0aa8e0e34119..42193f206602 100644 --- a/Documentation/virtual/kvm/mmu.txt +++ b/Documentation/virtual/kvm/mmu.txt @@ -277,6 +277,9 @@ Handling a page fault is performed as follows: - walk shadow page table - cache the information to vcpu->arch.mmio_gva, vcpu->arch.access and vcpu->arch.mmio_gfn, and call the emulator + - If both P bit and R/W bit of error code are set, this could possibly + be handled as a "fast page fault" (fixed without taking the MMU lock). See + the description in Documentation/virtual/kvm/locking.txt. - if needed, walk the guest page tables to determine the guest translation (gva->gpa or ngpa->gpa) - if permissions are insufficient, reflect the fault back to the guest -- cgit v1.2.3 From f6f8adeef542a18b1cb26a0b772c9781a10bb477 Mon Sep 17 00:00:00 2001 From: Xiao Guangrong Date: Wed, 19 Jun 2013 17:09:24 +0800 Subject: KVM: MMU: document fast invalidate all pages Document it to Documentation/virtual/kvm/mmu.txt Signed-off-by: Xiao Guangrong Signed-off-by: Paolo Bonzini --- Documentation/virtual/kvm/mmu.txt | 25 +++++++++++++++++++++++++ arch/x86/include/asm/kvm_host.h | 3 +++ 2 files changed, 28 insertions(+) (limited to 'Documentation/virtual') diff --git a/Documentation/virtual/kvm/mmu.txt b/Documentation/virtual/kvm/mmu.txt index 42193f206602..89c8a4caf51e 100644 --- a/Documentation/virtual/kvm/mmu.txt +++ b/Documentation/virtual/kvm/mmu.txt @@ -210,6 +210,10 @@ Shadow pages contain the following information: A bitmap indicating which sptes in spt point (directly or indirectly) at pages that may be unsynchronized. Used to quickly locate all unsychronized pages reachable from a given page. + mmu_valid_gen: + Generation number of the page. It is compared with kvm->arch.mmu_valid_gen + during hash table lookup, and used to skip invalidated shadow pages (see + "Zapping all pages" below.) clear_spte_count: Only present on 32-bit hosts, where a 64-bit spte cannot be written atomically. The reader uses this while running out of the MMU lock @@ -375,6 +379,27 @@ causes its write_count to be incremented, thus preventing instantiation of a large spte. The frames at the end of an unaligned memory slot have artificially inflated ->write_counts so they can never be instantiated. +Zapping all pages (page generation count) +========================================= + +For the large memory guests, walking and zapping all pages is really slow +(because there are a lot of pages), and also blocks memory accesses of +all VCPUs because it needs to hold the MMU lock. + +To make it be more scalable, kvm maintains a global generation number +which is stored in kvm->arch.mmu_valid_gen. Every shadow page stores +the current global generation-number into sp->mmu_valid_gen when it +is created. Pages with a mismatching generation number are "obsolete". + +When KVM need zap all shadow pages sptes, it just simply increases the global +generation-number then reload root shadow pages on all vcpus. As the VCPUs +create new shadow page tables, the old pages are not used because of the +mismatching generation number. + +KVM then walks through all pages and zaps obsolete pages. While the zap +operation needs to take the MMU lock, the lock can be released periodically +so that the VCPUs can make progress. + Further reading =============== diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 6b636fd8582f..280e3271b027 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -222,7 +222,10 @@ struct kvm_mmu_page { int root_count; /* Currently serving as active root */ unsigned int unsync_children; unsigned long parent_ptes; /* Reverse mapping for parent_pte */ + + /* The page is obsolete if mmu_valid_gen != kvm->arch.mmu_valid_gen. */ unsigned long mmu_valid_gen; + DECLARE_BITMAP(unsync_child_bitmap, 512); #ifdef CONFIG_X86_32 -- cgit v1.2.3 From 5a9b3830d462971bf05329148873f8996d1c88fc Mon Sep 17 00:00:00 2001 From: Xiao Guangrong Date: Wed, 19 Jun 2013 17:09:25 +0800 Subject: KVM: MMU: document fast invalidate all mmio sptes Document it to Documentation/virtual/kvm/mmu.txt Signed-off-by: Xiao Guangrong Signed-off-by: Paolo Bonzini --- Documentation/virtual/kvm/mmu.txt | 26 ++++++++++++++++++++++++++ 1 file changed, 26 insertions(+) (limited to 'Documentation/virtual') diff --git a/Documentation/virtual/kvm/mmu.txt b/Documentation/virtual/kvm/mmu.txt index 89c8a4caf51e..290894176142 100644 --- a/Documentation/virtual/kvm/mmu.txt +++ b/Documentation/virtual/kvm/mmu.txt @@ -279,6 +279,8 @@ Handling a page fault is performed as follows: - if the RSV bit of the error code is set, the page fault is caused by guest accessing MMIO and cached MMIO information is available. - walk shadow page table + - check for valid generation number in the spte (see "Fast invalidation of + MMIO sptes" below) - cache the information to vcpu->arch.mmio_gva, vcpu->arch.access and vcpu->arch.mmio_gfn, and call the emulator - If both P bit and R/W bit of error code are set, this could possibly @@ -400,6 +402,30 @@ KVM then walks through all pages and zaps obsolete pages. While the zap operation needs to take the MMU lock, the lock can be released periodically so that the VCPUs can make progress. +Fast invalidation of MMIO sptes +=============================== + +As mentioned in "Reaction to events" above, kvm will cache MMIO +information in leaf sptes. When a new memslot is added or an existing +memslot is changed, this information may become stale and needs to be +invalidated. This also needs to hold the MMU lock while walking all +shadow pages, and is made more scalable with a similar technique. + +MMIO sptes have a few spare bits, which are used to store a +generation number. The global generation number is stored in +kvm_memslots(kvm)->generation, and increased whenever guest memory info +changes. This generation number is distinct from the one described in +the previous section. + +When KVM finds an MMIO spte, it checks the generation number of the spte. +If the generation number of the spte does not equal the global generation +number, it will ignore the cached MMIO information and handle the page +fault through the slow path. + +Since only 19 bits are used to store generation-number on mmio spte, all +pages are zapped when there is an overflow. + + Further reading =============== -- cgit v1.2.3