diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2022-10-29 11:45:07 -0700 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2022-11-02 16:10:30 -0700 |
commit | 28154ddc676efa64e8e792389787eb85199d2772 (patch) | |
tree | 6bf1a0ba9189c8e904a91b71e01ec5e44651c8a3 /mm/mmu_gather.c | |
parent | 655d4bdee63563392b0e5fb40f973c6d41658070 (diff) | |
download | linux-28154ddc676efa64e8e792389787eb85199d2772.tar.bz2 |
mm: delay rmap removal until after TLB flushmmu_gather-race-fix
When we remove a page table entry, we are very careful to only free the
page after we have flushed the TLB, because other CPUs could still be
using the page through stale TLB entries until after the flush.
However, we have removed the rmap entry for that page early, which means
that functions like folio_mkclean() would end up not serializing with
the page table lock because the page had already been made invisible to
rmap.
And that is a problem, because while the TLB entry exists, we could end
up with the following situation:
(a) one CPU could come in and clean it, never seeing our mapping of
the page
(b) another CPU could continue to use the stale and dirty TLB entry
and continue to write to said page
resulting in a page that has been dirtied, but then marked clean again,
all while another CPU might have dirtied it some more.
End result: possibly lost dirty data.
This commit uses the same old TLB gather array that we use to delay the
freeing of the page to also say 'remove from rmap after flush', so that
we can keep the rmap entries alive until all TLB entries have been
flushed.
It might be worth noting that this means that the page_zap_pte_rmap() is
now called outside the page table lock. That was never mutual exclusion
(since the same page could be mapped under multiple different page
tables), but it does mean that it needs to use the more careful version
of dec_lruvec_page_state() that doesn't depend on being called in a
non-preemptable context.
NOTE! While the "possibly lost dirty data" sounds catastrophic, for this
all to happen you need to have a user thread doing either madvise() with
MADV_DONTNEED or a full re-mmap() of the area concurrently with another
thread continuing to use said mapping.
So arguably this is about user space doing crazy things, but from a VM
consistency standpoint it's better if we track the dirty bit properly
even when user space goes off the rails.
Reported-by: Nadav Amit <nadav.amit@gmail.com>
Link: https://lore.kernel.org/all/B88D3073-440A-41C7-95F4-895D3F657EF2@gmail.com/
Cc: Will Deacon <will@kernel.org>
Cc: Aneesh Kumar <aneesh.kumar@linux.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> # s390
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Diffstat (limited to 'mm/mmu_gather.c')
-rw-r--r-- | mm/mmu_gather.c | 42 |
1 files changed, 38 insertions, 4 deletions
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c index add4244e5790..34f6a07fd711 100644 --- a/mm/mmu_gather.c +++ b/mm/mmu_gather.c @@ -9,6 +9,7 @@ #include <linux/rcupdate.h> #include <linux/smp.h> #include <linux/swap.h> +#include <linux/rmap.h> #include <asm/pgalloc.h> #include <asm/tlb.h> @@ -43,12 +44,44 @@ static bool tlb_next_batch(struct mmu_gather *tlb) return true; } +/* + * We get an 'encoded page' array, which has page pointers with + * encoded flags in the low bits of the array. + * + * The TLB has been flushed, now we need to react to the flag bits + * the 'struct page', clean the array in-place, and then free the + * pages and their swap cache. + */ +static void clean_and_free_pages_and_swap_cache(struct encoded_page **pages, unsigned int nr) +{ + for (unsigned int i = 0; i < nr; i++) { + struct encoded_page *encoded = pages[i]; + unsigned int flags = encoded_page_flags(encoded); + if (flags) { + /* Clean the flagged pointer in-place */ + struct page *page = encoded_page_ptr(encoded); + pages[i] = encode_page(page, 0); + + /* The flag bit being set means that we should zap the rmap */ + page_zap_pte_rmap(page); + VM_WARN_ON_ONCE_PAGE(page_mapcount(page) < 0, page); + } + } + + /* + * Now all entries have been un-encoded, and changed to plain + * page pointers, so we can cast the 'encoded_page' array to + * a plain page array and free them + */ + free_pages_and_swap_cache((struct page **)pages, nr); +} + static void tlb_batch_pages_flush(struct mmu_gather *tlb) { struct mmu_gather_batch *batch; for (batch = &tlb->local; batch && batch->nr; batch = batch->next) { - struct page **pages = batch->pages; + struct encoded_page **pages = batch->encoded_pages; do { /* @@ -56,7 +89,7 @@ static void tlb_batch_pages_flush(struct mmu_gather *tlb) */ unsigned int nr = min(512U, batch->nr); - free_pages_and_swap_cache(pages, nr); + clean_and_free_pages_and_swap_cache(pages, nr); pages += nr; batch->nr -= nr; @@ -77,11 +110,12 @@ static void tlb_batch_list_free(struct mmu_gather *tlb) tlb->local.next = NULL; } -bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_size) +bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_size, unsigned int flags) { struct mmu_gather_batch *batch; VM_BUG_ON(!tlb->end); + VM_BUG_ON(flags & ~ENCODE_PAGE_BITS); #ifdef CONFIG_MMU_GATHER_PAGE_SIZE VM_WARN_ON(tlb->page_size != page_size); @@ -92,7 +126,7 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_ * Add the page and check if we are full. If so * force a flush. */ - batch->pages[batch->nr++] = page; + batch->encoded_pages[batch->nr++] = encode_page(page, flags); if (batch->nr == batch->max) { if (!tlb_next_batch(tlb)) return true; |