summaryrefslogtreecommitdiffstats
path: root/include
AgeCommit message (Collapse)AuthorFilesLines
2022-03-21mm: Support arbitrary THP sizesMatthew Wilcox (Oracle)2-47/+31
For code which has not yet been converted from THP to folios, use the compound size of the page instead of assuming PTE or PMD size. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-03-21mm: Make large folios depend on THPMatthew Wilcox (Oracle)1-1/+6
Some parts of the VM still depend on THP to handle large folios correctly. Until those are fixed, prevent creating large folios if THP are disabled. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-03-21mm: Fix READ_ONLY_THP warningMatthew Wilcox (Oracle)1-2/+2
These counters only exist if CONFIG_READ_ONLY_THP_FOR_FS is defined, but we do not need to warn if the filesystem natively supports large folios. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-03-21mm: Turn can_split_huge_page() into can_split_folio()Matthew Wilcox (Oracle)1-2/+2
This function already required a head page to be passed, so this just adds type-safety and removes a few implicit calls to compound_head(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-03-21mm/vmscan: Convert pageout() to take a folioMatthew Wilcox (Oracle)1-5/+5
We always write out an entire folio at once. This conversion removes a few calls to compound_head() and gets the NR_VMSCAN_WRITE statistic right when writing out a large folio. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-03-21mm/rmap: Constify the rmap_walk_control argumentMatthew Wilcox (Oracle)2-4/+4
The rmap walking functions do not modify the rmap_walk_control, and page_idle_clear_pte_refs() takes advantage of that to move construction of the rmap_walk_control to compile time. This lets us remove an unclean cast. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-03-21mm/rmap: Convert rmap_walk() to take a folioMatthew Wilcox (Oracle)2-8/+7
This ripples all the way through to every calling and called function from rmap. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-03-21mm: Turn page_anon_vma() into folio_anon_vma()Matthew Wilcox (Oracle)1-1/+0
Move the prototype from mm.h to mm/internal.h and convert all callers to pass a folio. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-03-21mm/rmap: Turn page_lock_anon_vma_read() into folio_lock_anon_vma_read()Matthew Wilcox (Oracle)1-0/+1
Add back page_lock_anon_vma_read() as a wrapper. This saves a few calls to compound_head(). If any callers were passing a tail page before, this would have failed to lock the anon VMA as page->mapping is not valid for tail pages. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-03-21mm/migrate: Convert remove_migration_ptes() to foliosMatthew Wilcox (Oracle)1-1/+1
Convert the implementation and all callers. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-03-21mm/rmap: Convert try_to_migrate() to foliosMatthew Wilcox (Oracle)1-1/+1
Convert the callers to pass a folio and the try_to_migrate_one() worker to use a folio throughout. Fixes an assumption that a folio must be <= PMD size. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-03-21mm/rmap: Convert try_to_unmap() to take a folioMatthew Wilcox (Oracle)1-2/+2
Change all three callers and the worker function try_to_unmap_one(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-03-21mm/huge_memory: Convert __split_huge_pmd() to take a folioMatthew Wilcox (Oracle)1-4/+4
Convert split_huge_pmd_address() at the same time since it only passes the folio through, and its two callers already have a folio on hand. Removes numerous calls to compound_head() and removes an assumption that a page cannot be larger than a PMD. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-03-21mm/rmap: Turn page_referenced() into folio_referenced()Matthew Wilcox (Oracle)1-2/+2
Both its callers pass a page which was previously on an LRU list, so were passing a folio by definition. Use the type system to enforce that and remove a few calls to compound_head(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-03-21mm: Convert page_vma_mapped_walk to work on PFNsMatthew Wilcox (Oracle)2-5/+17
page_mapped_in_vma() really just wants to walk one page, but as the code stands, if passed the head page of a compound page, it will walk every page in the compound page. Extract pfn/nr_pages/pgoff from the struct page early, so they can be overridden by page_mapped_in_vma(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-03-21mm: Add DEFINE_PAGE_VMA_WALK and DEFINE_FOLIO_VMA_WALKMatthew Wilcox (Oracle)1-0/+16
Instead of declaring a struct page_vma_mapped_walk directly, use these helpers to allow us to transition to a PFN approach in the following patches. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-03-21mm: Add folio_pgoff()Matthew Wilcox (Oracle)1-0/+11
This is the folio equivalent of page_to_pgoff(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-03-21mm: Add split_folio_to_list()Matthew Wilcox (Oracle)1-0/+6
This is a convenience function; split_huge_page_to_list() can take any page in a folio (and does so on purpose because that page will be the one which keeps the refcount). But it's convenient for the callers to pass the folio instead of the first page in the folio. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-03-21mm: Add folio_mapcount()Matthew Wilcox (Oracle)1-1/+7
This implements the same algorithm as total_mapcount(), which is transformed into a wrapper function. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-03-21mm: Turn head_compound_mapcount() into folio_entire_mapcount()Matthew Wilcox (Oracle)1-6/+11
Adjust documentation to be more clear. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-03-21fs: Move many prototypes to pagemap.hMatthew Wilcox (Oracle)2-116/+114
These functions are page cache functionality and don't need to be declared in fs.h. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
2022-03-21mm/truncate: Combine invalidate_mapping_pagevec() and ↵Matthew Wilcox (Oracle)1-4/+0
__invalidate_mapping_pages() We can save a function call by combining these two functions, which are identical except for the return value. Also move the prototype to mm/internal.h. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
2022-03-21mm: Turn deactivate_file_page() into deactivate_file_folio()Matthew Wilcox (Oracle)1-1/+0
This function has one caller which already has a reference to the page, so we don't need to use get_page_unless_zero(). Also move the prototype to mm/internal.h. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
2022-03-21mm/truncate: Split invalidate_inode_page() into mapping_evict_folio()Matthew Wilcox (Oracle)1-1/+0
Some of the callers already have the address_space and can avoid calling folio_mapping() and checking if the folio was already truncated. Also add kernel-doc and fix the return type (in case we ever support folios larger than 4TB). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
2022-03-21mm: Convert remove_mapping() to take a folioMatthew Wilcox (Oracle)1-1/+1
Add kernel-doc and return the number of pages removed in order to get the statistics right in __invalidate_mapping_pages(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
2022-03-21mm: Add lru_to_folio()Matthew Wilcox (Oracle)1-0/+4
Since page->lru occupies the same bytes as compound_head, any page on the LRU list must be a folio. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-03-21mm/memcg: Convert mem_cgroup_swapout() to take a folioMatthew Wilcox (Oracle)1-2/+2
This removes an assumption that THPs are the only kind of compound pages and removes a couple of hidden calls to compound_head. It also documents that you can't pass a tail page to mem_cgroup_swapout(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-03-21mm/workingset: Convert workingset_eviction() to take a folioMatthew Wilcox (Oracle)1-1/+1
This removes an assumption that THPs are the only kind of compound pages and removes a few hidden calls to compound_head(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-03-21mm: Add three folio wrappersMatthew Wilcox (Oracle)2-0/+15
folio_is_zone_device() is equivalent to is_zone_device_page(), folio_is_device_private() is equivalent to is_device_private_page(), and folio_is_pinnable() is equivalent to is_pinnable_page(). All of these tests return the same result for every page in the folio, so we can just pass the head page of the folio to the page variant of the function. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-03-21mm/gup: Turn compound_range_next() into gup_folio_range_next()Matthew Wilcox (Oracle)1-2/+2
Convert the only caller to work on folios instead of pages. This removes the last caller of put_compound_head(), so delete it. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-03-21mm/hugetlb: Use try_grab_folio() instead of try_grab_compound_head()Matthew Wilcox (Oracle)1-3/+0
follow_hugetlb_page() only cares about success or failure, so it doesn't need to know the type of the returned pointer, only whether it's NULL or not. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-03-21mm: Remove page_cache_add_speculative() and page_cache_get_speculative()Matthew Wilcox (Oracle)2-14/+3
These wrappers have no more callers, so delete them. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-03-21mm: Turn page_maybe_dma_pinned() into folio_maybe_dma_pinned()Matthew Wilcox (Oracle)1-66/+63
Replace three calls to compound_head() with one. This removes the last user of compound_pincount(), so remove that helper too. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-03-21mm: Add folio_pincount_ptr()Matthew Wilcox (Oracle)1-0/+5
This is the folio equivalent of compound_pincount_ptr(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-03-21mm: Make compound_pincount always availableMatthew Wilcox (Oracle)2-15/+13
Move compound_pincount from the third page to the second page, which means it's available for all compound pages. That lets us delete hpage_pincount_available(). On 32-bit systems, there isn't enough space for both compound_pincount and compound_nr in the second page (it would collide with page->private, which is in use for pages in the swap cache), so revert the optimisation of storing both compound_order and compound_nr on 32-bit systems. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-03-21mm/gup: Remove an assumption of a contiguous memmapMatthew Wilcox (Oracle)1-0/+2
This assumption needs the inverse of nth_page(), which is temporarily named page_nth() until it's renamed later in this series. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-03-03mm: remove the extra ZONE_DEVICE struct page refcountChristoph Hellwig2-10/+8
ZONE_DEVICE struct pages have an extra reference count that complicates the code for put_page() and several places in the kernel that need to check the reference count to see that a page is not being used (gup, compaction, migration, etc.). Clean up the code so the reference count doesn't need to be treated specially for ZONE_DEVICE pages. Note that this excludes the special idle page wakeup for fsdax pages, which still happens at refcount 1. This is a separate issue and will be sorted out later. Given that only fsdax pages require the notifiacation when the refcount hits 1 now, the PAGEMAP_OPS Kconfig symbol can go away and be replaced with a FS_DAX check for this hook in the put_page fastpath. Based on an earlier patch from Ralph Campbell <rcampbell@nvidia.com>. Link: https://lkml.kernel.org/r/20220210072828.2930359-8-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Tested-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Chaitanya Kulkarni <kch@nvidia.com> Cc: Christian Knig <christian.koenig@amd.com> Cc: Karol Herbst <kherbst@redhat.com> Cc: Lyude Paul <lyude@redhat.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-03-03mm: don't include <linux/memremap.h> in <linux/mm.h>Christoph Hellwig2-20/+18
Move the check for the actual pgmap types that need the free at refcount one behavior into the out of line helper, and thus avoid the need to pull memremap.h into mm.h. Link: https://lkml.kernel.org/r/20220210072828.2930359-7-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Tested-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Chaitanya Kulkarni <kch@nvidia.com> Cc: Karol Herbst <kherbst@redhat.com> Cc: Lyude Paul <lyude@redhat.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-03-03mm: simplify freeing of devmap managed pagesChristoph Hellwig1-24/+10
Make put_devmap_managed_page return if it took charge of the page or not and remove the separate page_is_devmap_managed helper. Link: https://lkml.kernel.org/r/20220210072828.2930359-6-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Tested-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Christian Knig <christian.koenig@amd.com> Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Karol Herbst <kherbst@redhat.com> Cc: Lyude Paul <lyude@redhat.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-03-03mm: move free_devmap_managed_page to memremap.cChristoph Hellwig1-1/+0
free_devmap_managed_page has nothing to do with the code in swap.c, move it to live with the rest of the code for devmap handling. Link: https://lkml.kernel.org/r/20220210072828.2930359-5-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Tested-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Christian Knig <christian.koenig@amd.com> Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Karol Herbst <kherbst@redhat.com> Cc: Lyude Paul <lyude@redhat.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-03-03mm: remove pointless includes from <linux/hmm.h>Christoph Hellwig1-7/+2
hmm.h pulls in the world for no good reason at all. Remove the includes and push a few ones into the users instead. Link: https://lkml.kernel.org/r/20220210072828.2930359-4-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Tested-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Christian Knig <christian.koenig@amd.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Karol Herbst <kherbst@redhat.com> Cc: Lyude Paul <lyude@redhat.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-03-03mm: remove the __KERNEL__ guard from <linux/mm.h>Christoph Hellwig1-4/+0
__KERNEL__ ifdefs don't make sense outside of include/uapi/. Link: https://lkml.kernel.org/r/20220210072828.2930359-3-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Tested-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Christian Knig <christian.koenig@amd.com> Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Karol Herbst <kherbst@redhat.com> Cc: Lyude Paul <lyude@redhat.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-02-17mm/munlock: maintain page->mlock_count while unevictableHugh Dickins2-5/+25
Previous patches have been preparatory: now implement page->mlock_count. The ordering of the "Unevictable LRU" is of no significance, and there is no point holding unevictable pages on a list: place page->mlock_count to overlay page->lru.prev (since page->lru.next is overlaid by compound_head, which needs to be even so as not to satisfy PageTail - though 2 could be added instead of 1 for each mlock, if that's ever an improvement). But it's only safe to rely on or modify page->mlock_count while lruvec lock is held and page is on unevictable "LRU" - we can save lots of edits by continuing to pretend that there's an imaginary LRU here (there is an unevictable count which still needs to be maintained, but not a list). The mlock_count technique suffers from an unreliability much like with page_mlock(): while someone else has the page off LRU, not much can be done. As before, err on the safe side (behave as if mlock_count 0), and let try_to_unlock_one() move the page to unevictable if reclaim finds out later on - a few misplaced pages don't matter, what we want to avoid is imbalancing reclaim by flooding evictable lists with unevictable pages. I am not a fan of "if (!isolate_lru_page(page)) putback_lru_page(page);": if we have taken lruvec lock to get the page off its present list, then we save everyone trouble (and however many extra atomic ops) by putting it on its destination list immediately. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-02-17mm/munlock: rmap call mlock_vma_page() munlock_vma_page()Hugh Dickins1-8/+9
Add vma argument to mlock_vma_page() and munlock_vma_page(), make them inline functions which check (vma->vm_flags & VM_LOCKED) before calling mlock_page() and munlock_page() in mm/mlock.c. Add bool compound to mlock_vma_page() and munlock_vma_page(): this is because we have understandable difficulty in accounting pte maps of THPs, and if passed a PageHead page, mlock_page() and munlock_page() cannot tell whether it's a pmd map to be counted or a pte map to be ignored. Add vma arg to page_add_file_rmap() and page_remove_rmap(), like the others, and use that to call mlock_vma_page() at the end of the page adds, and munlock_vma_page() at the end of page_remove_rmap() (end or beginning? unimportant, but end was easier for assertions in testing). No page lock is required (although almost all adds happen to hold it): delete the "Serialize with page migration" BUG_ON(!PageLocked(page))s. Certainly page lock did serialize with page migration, but I'm having difficulty explaining why that was ever important. Mlock accounting on THPs has been hard to define, differed between anon and file, involved PageDoubleMap in some places and not others, required clear_page_mlock() at some points. Keep it simple now: just count the pmds and ignore the ptes, there is no reason for ptes to undo pmd mlocks. page_add_new_anon_rmap() callers unchanged: they have long been calling lru_cache_add_inactive_or_unevictable(), which does its own VM_LOCKED handling (it also checks for not VM_SPECIAL: I think that's overcautious, and inconsistent with other checks, that mmap_region() already prevents VM_LOCKED on VM_SPECIAL; but haven't quite convinced myself to change it). Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-02-17mm/munlock: delete FOLL_MLOCK and FOLL_POPULATEHugh Dickins1-2/+0
If counting page mlocks, we must not double-count: follow_page_pte() can tell if a page has already been Mlocked or not, but cannot tell if a pte has already been counted or not: that will have to be done when the pte is mapped in (which lru_cache_add_inactive_or_unevictable() already tracks for new anon pages, but there's no such tracking yet for others). Delete all the FOLL_MLOCK code - faulting in the missing pages will do all that is necessary, without special mlock_vma_page() calls from here. But then FOLL_POPULATE turns out to serve no purpose - it was there so that its absence would tell faultin_page() not to faultin page when setting up VM_LOCKONFAULT areas; but if there's no special work needed here for mlock, then there's no work at all here for VM_LOCKONFAULT. Have I got that right? I've not looked into the history, but see that FOLL_POPULATE goes back before VM_LOCKONFAULT: did it serve a different purpose before? Ah, yes, it was used to skip the old stack guard page. And is it intentional that COW is not broken on existing pages when setting up a VM_LOCKONFAULT area? I can see that being argued either way, and have no reason to disagree with current behaviour. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-02-17mm/munlock: delete page_mlock() and all its worksHugh Dickins1-6/+0
We have recommended some applications to mlock their userspace, but that turns out to be counter-productive: when many processes mlock the same file, contention on rmap's i_mmap_rwsem can become intolerable at exit: it is needed for write, to remove any vma mapping that file from rmap's tree; but hogged for read by those with mlocks calling page_mlock() (formerly known as try_to_munlock()) on *each* page mapped from the file (the purpose being to find out whether another process has the page mlocked, so therefore it should not be unmlocked yet). Several optimizations have been made in the past: one is to skip page_mlock() when mapcount tells that nothing else has this page mapped; but that doesn't help at all when others do have it mapped. This time around, I initially intended to add a preliminary search of the rmap tree for overlapping VM_LOCKED ranges; but that gets messy with locking order, when in doubt whether a page is actually present; and risks adding even more contention on the i_mmap_rwsem. A solution would be much easier, if only there were space in struct page for an mlock_count... but actually, most of the time, there is space for it - an mlocked page spends most of its life on an unevictable LRU, but since 3.18 removed the scan_unevictable_pages sysctl, that "LRU" has been redundant. Let's try to reuse its page->lru. But leave that until a later patch: in this patch, clear the ground by removing page_mlock(), and all the infrastructure that has gathered around it - which mostly hinders understanding, and will make reviewing new additions harder. Don't mind those old comments about THPs, they date from before 4.5's refcounting rework: splitting is not a risk here. Just keep a minimal version of munlock_vma_page(), as reminder of what it should attend to (in particular, the odd way PGSTRANDED is counted out of PGMUNLOCKED), and likewise a stub for munlock_vma_pages_range(). Move unchanged __mlock_posix_error_return() out of the way, down to above its caller: this series then makes no further change after mlock_fixup(). After this and each following commit, the kernel builds, boots and runs; but with deficiencies which may show up in testing of mlock and munlock. The system calls succeed or fail as before, and mlock remains effective in preventing page reclaim; but meminfo's Unevictable and Mlocked amounts may be shown too low after mlock, grow, then stay too high after munlock: with previously mlocked pages remaining unevictable for too long, until finally unmapped and freed and counts corrected. Normal service will be resumed in "mm/munlock: mlock_pte_range() when mlocking or munlocking". Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-02-15Merge tag 'hyperv-fixes-signed-20220215' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux Pull hyperv fixes from Wei Liu: - Rework use of DMA_BIT_MASK in vmbus to work around a clang bug (Michael Kelley) - Fix NUMA topology (Long Li) - Fix a memory leak in vmbus (Miaoqian Lin) - One minor clean-up patch (Cai Huoqing) * tag 'hyperv-fixes-signed-20220215' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: Drivers: hv: utils: Make use of the helper macro LIST_HEAD() Drivers: hv: vmbus: Rework use of DMA_BIT_MASK(64) Drivers: hv: vmbus: Fix memory leak in vmbus_add_channel_kobj PCI: hv: Fix NUMA node assignment when kernel boots with custom NUMA topology
2022-02-13Merge tag 'objtool_urgent_for_v5.17_rc4' of ↵Linus Torvalds1-16/+5
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull objtool fix from Borislav Petkov: "Fix a case where objtool would mistakenly warn about instructions being unreachable" * tag 'objtool_urgent_for_v5.17_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/bug: Merge annotate_reachable() into _BUG_FLAGS() asm
2022-02-12Merge branch 'akpm' (patches from Andrew)Linus Torvalds2-2/+5
Merge misc fixes from Andrew Morton: "5 patches. Subsystems affected by this patch series: binfmt, procfs, and mm (vmscan, memcg, and kfence)" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: kfence: make test case compatible with run time set sample interval mm: memcg: synchronize objcg lists with a dedicated spinlock mm: vmscan: remove deadlock due to throttling failing to make progress fs/proc: task_mmu.c: don't read mapcount for migration entry fs/binfmt_elf: fix PT_LOAD p_align values for loaders
2022-02-11kfence: make test case compatible with run time set sample intervalPeng Liu1-0/+2
The parameter kfence_sample_interval can be set via boot parameter and late shell command, which is convenient for automated tests and KFENCE parameter optimization. However, KFENCE test case just uses compile-time CONFIG_KFENCE_SAMPLE_INTERVAL, which will make KFENCE test case not run as users desired. Export kfence_sample_interval, so that KFENCE test case can use run-time-set sample interval. Link: https://lkml.kernel.org/r/20220207034432.185532-1-liupeng256@huawei.com Signed-off-by: Peng Liu <liupeng256@huawei.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: Christian Knig <christian.koenig@amd.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>