summaryrefslogtreecommitdiffstats
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2022-12-11mm: add folio dtor and order setter functionsSidhartha Kumar1-3/+1
Patch series "convert core hugetlb functions to folios", v5. ============== OVERVIEW =========================== Now that many hugetlb helper functions that deal with hugetlb specific flags[1] and hugetlb cgroups[2] are converted to folios, higher level allocation, prep, and freeing functions within hugetlb can also be converted to operate in folios. Patch 1 of this series implements the wrapper functions around setting the compound destructor and compound order for a folio. Besides the user added in patch 1, patch 2 and patch 9 also use these helper functions. Patches 2-10 convert the higher level hugetlb functions to folios. ============== TESTING =========================== LTP: Ran 10 back to back rounds of the LTP hugetlb test suite. Gigantic Huge Pages: Test allocation and freeing via hugeadm commands: hugeadm --pool-pages-min 1GB:10 hugeadm --pool-pages-min 1GB:0 Demote: Demote 1 1GB hugepages to 512 2MB hugepages echo 1 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages echo 1 > /sys/kernel/mm/hugepages/hugepages-1048576kB/demote cat /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages # 512 cat /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages # 0 [1] https://lore.kernel.org/lkml/20220922154207.1575343-1-sidhartha.kumar@oracle.com/ [2] https://lore.kernel.org/linux-mm/20221101223059.460937-1-sidhartha.kumar@oracle.com/ This patch (of 10): Add folio equivalents for set_compound_order() and set_compound_page_dtor(). Also remove extra new-lines introduced by mm/hugetlb: convert move_hugetlb_state() to folios and mm/hugetlb_cgroup: convert hugetlb_cgroup_uncharge_page() to folios. [sidhartha.kumar@oracle.com: clarify folio_set_compound_order() zero support] Link: https://lkml.kernel.org/r/20221207223731.32784-1-sidhartha.kumar@oracle.com Link: https://lkml.kernel.org/r/20221129225039.82257-1-sidhartha.kumar@oracle.com Link: https://lkml.kernel.org/r/20221129225039.82257-2-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Suggested-by: Mike Kravetz <mike.kravetz@oracle.com> Suggested-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Tarun Sahu <tsahu@linux.ibm.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Wei Chen <harperchen1110@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-11folio-compat: remove lru_cache_add()Vishal Moola (Oracle)3-8/+5
There are no longer any callers of lru_cache_add(), so remove it. This saves 79 bytes of kernel text. Also cleanup some comments such that they reference the new folio_add_lru() instead. Link: https://lkml.kernel.org/r/20221101175326.13265-6-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-11khugepage: replace lru_cache_add() with folio_add_lru()Vishal Moola (Oracle)1-4/+7
Replaces some calls with their folio equivalents. This is in preparation for the removal of lru_cache_add(). This replaces 3 calls to compound_head() with 1. Link: https://lkml.kernel.org/r/20221101175326.13265-5-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-11userfaultfd: replace lru_cache functions with folio_add functionsVishal Moola (Oracle)1-2/+4
Replaces lru_cache_add() and lru_cache_add_inactive_or_unevictable() with folio_add_lru() and folio_add_lru_vma(). This is in preparation for the removal of lru_cache_add(). Link: https://lkml.kernel.org/r/20221101175326.13265-4-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-11filemap: convert replace_page_cache_page() to replace_page_cache_folio()Vishal Moola (Oracle)1-27/+25
Patch series "Removing the lru_cache_add() wrapper". This patchset replaces all calls of lru_cache_add() with the folio equivalent: folio_add_lru(). This is allows us to get rid of the wrapper The series passes xfstests and the userfaultfd selftests. This patch (of 5): Eliminates 7 calls to compound_head(). Link: https://lkml.kernel.org/r/20221101175326.13265-1-vishal.moola@gmail.com Link: https://lkml.kernel.org/r/20221101175326.13265-2-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-11mm/sparse-vmemmap: generalise vmemmap_populate_hugepages()Feiyang Chen1-0/+63
Generalise vmemmap_populate_hugepages() so ARM64 & X86 & LoongArch can share its implementation. Link: https://lkml.kernel.org/r/20221027125253.3458989-4-chenhuacai@loongson.cn Signed-off-by: Feiyang Chen <chenfeiyang@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn> Acked-by: Will Deacon <will@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Guo Ren <guoren@kernel.org> Cc: Jiaxun Yang <jiaxun.yang@flygoat.com> Cc: Min Zhou <zhoumin@loongson.cn> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Philippe Mathieu-Daudé <philmd@linaro.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Xuefeng Li <lixuefeng@loongson.cn> Cc: Xuerui Wang <kernel@xen0n.name> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-11LoongArch: add sparse memory vmemmap supportFeiyang Chen1-0/+10
Add sparse memory vmemmap support for LoongArch. SPARSEMEM_VMEMMAP uses a virtually mapped memmap to optimise pfn_to_page and page_to_pfn operations. This is the most efficient option when sufficient kernel resources are available. Link: https://lkml.kernel.org/r/20221027125253.3458989-3-chenhuacai@loongson.cn Signed-off-by: Min Zhou <zhoumin@loongson.cn> Signed-off-by: Feiyang Chen <chenfeiyang@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Guo Ren <guoren@kernel.org> Cc: Jiaxun Yang <jiaxun.yang@flygoat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Philippe Mathieu-Daudé <philmd@linaro.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Will Deacon <will@kernel.org> Cc: Xuefeng Li <lixuefeng@loongson.cn> Cc: Xuerui Wang <kernel@xen0n.name> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-11kmsan: allow using __msan_instrument_asm_store() inside runtimeAlexander Potapenko1-3/+5
In certain cases (e.g. when handling a softirq) __msan_instrument_asm_store(&var, sizeof(var)) may be called with from within KMSAN runtime, but later the value of @var is used with !kmsan_in_runtime(), leading to false positives. Because kmsan_internal_unpoison_memory() doesn't take locks, it should be fine to call it without kmsan_in_runtime() checks, which fixes the mentioned false positives. Link: https://lkml.kernel.org/r/20221128094541.2645890-2-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Eric Biggers <ebiggers@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Marco Elver <elver@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-11mm/fadvise: use LLONG_MAX instead of -1 for eofBrian Foster1-1/+1
generic_fadvise() sets endbyte = -1 to specify end of file (i.e. if length == 0 is passed from userspace). Most other callers to filemap_fdatawrite_range() use LLONG_MAX for this purpose, particularly if they also call fdatawait_range() (which requires end >= start). For example, sync_file_range(), vfs_fsync() (where the range is passed down through per-fs ->fsync() callbacks), filemap_flush(), etc. generic_fadvise() does not currently wait on writeback, but fix the call up to be consistent with other callers. Link: https://lkml.kernel.org/r/20221128155632.3950447-3-bfoster@redhat.com Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-11filemap: skip write and wait if end offset precedes startBrian Foster1-3/+6
Patch series "filemap: skip write and wait if end offset precedes start", v2. A fix for the odd write and wait behavior described in the patch 1 commit log. Technically patch 1 could simply remove the check rather than lift it into the callers, but this seemed a bit more user friendly to me. Patch 2 is appended after observation that fadvise() interacted poorly with the v1 patch. This is no longer a problem with v2, making patch 2 purely a cleanup. This series survived both fstests and ltp regression runs without observable problems. I had (end < start) warning checks in each relevant function, with fadvise() being the only caller that triggered them. That said, I dropped the warnings after testing because there seemed to much potential for noise from the various other callers. This patch (of 2): A call to file[map]_write_and_wait_range() with an end offset that precedes the start offset but happens to land in the same page can trigger writeback submission but fails to wait on the submitted page. Writeback submission occurs because __filemap_fdatawrite_range() passes both offsets down into write_cache_pages(), which rounds down to page indexes before it starts processing writeback. However, __filemap_fdatawait_range() immediately returns if the byte-granular end offset precedes the start offset. This behavior was observed in the form of unpredictable latency from a frequent write and wait call with incorrect parameters. The behavior gave the impression that the fdatawait path might occasionally fail to wait on writeback, but further investigation showed the latency was from write_cache_pages() waiting on writeback state to clear for a page already under writeback. Therefore, this indicated that fdatawait actually never waits on writeback in this particular situation. The byte granular check in __filemap_fdatawait_range() goes all the way back to the old wait_on_page_writeback() helper. It originally used page offsets and so would have waited in this problematic case. That changed to byte granularity file offsets in commit 94004ed726f3 ("kill wait_on_page_writeback_range"), which subtly changed this behavior. The check itself has become somewhat redundant since the error checking code that used to follow the wait loop (at the time of the aforementioned commit) has now been removed and lifted into the higher level callers. Therefore, we can restore historical fdatawait behavior by simply removing the check. Since the current fdatawait behavior has been in place for quite some time and is consistent with other interfaces that use file offsets, instead lift the check into the file[map]_write_and_wait_range() helpers to provide consistent behavior between the write and wait. Link: https://lkml.kernel.org/r/20221128155632.3950447-1-bfoster@redhat.com Link: https://lkml.kernel.org/r/20221128155632.3950447-2-bfoster@redhat.com Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-11zsmalloc: implement writeback mechanism for zsmallocNhat Pham1-11/+183
This commit adds the writeback mechanism for zsmalloc, analogous to the zbud allocator. Zsmalloc will attempt to determine the coldest zspage (i.e least recently used) in the pool, and attempt to write back all the stored compressed objects via the pool's evict handler. Link: https://lkml.kernel.org/r/20221128191616.1261026-7-nphamcs@gmail.com Signed-off-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-11zsmalloc: add zpool_ops field to zs_pool to store evict handlersNhat Pham1-1/+10
This adds a new field to zs_pool to store evict handlers for writeback, analogous to the zbud allocator. Link: https://lkml.kernel.org/r/20221128191616.1261026-6-nphamcs@gmail.com Signed-off-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-11zsmalloc: add a LRU to zs_pool to keep track of zspages in LRU orderNhat Pham1-0/+50
This helps determines the coldest zspages as candidates for writeback. Link: https://lkml.kernel.org/r/20221128191616.1261026-5-nphamcs@gmail.com Signed-off-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-11zsmalloc: consolidate zs_pool's migrate_lock and size_class's locksNhat Pham1-50/+37
Currently, zsmalloc has a hierarchy of locks, which includes a pool-level migrate_lock, and a lock for each size class. We have to obtain both locks in the hotpath in most cases anyway, except for zs_malloc. This exception will no longer exist when we introduce a LRU into the zs_pool for the new writeback functionality - we will need to obtain a pool-level lock to synchronize LRU handling even in zs_malloc. In preparation for zsmalloc writeback, consolidate these locks into a single pool-level lock, which drastically reduces the complexity of synchronization in zsmalloc. We have also benchmarked the lock consolidation to see the performance effect of this change on zram. First, we ran a synthetic FS workload on a server machine with 36 cores (same machine for all runs), using fs_mark -d ../zram1mnt -s 100000 -n 2500 -t 32 -k before and after for btrfs and ext4 on zram (FS usage is 80%). Here is the result (unit is file/second): With lock consolidation (btrfs): Average: 13520.2, Median: 13531.0, Stddev: 137.5961482019028 Without lock consolidation (btrfs): Average: 13487.2, Median: 13575.0, Stddev: 309.08283679298665 With lock consolidation (ext4): Average: 16824.4, Median: 16839.0, Stddev: 89.97388510006668 Without lock consolidation (ext4) Average: 16958.0, Median: 16986.0, Stddev: 194.7370021336469 As you can see, we observe a 0.3% regression for btrfs, and a 0.9% regression for ext4. This is a small, barely measurable difference in my opinion. For a more realistic scenario, we also tries building the kernel on zram. Here is the time it takes (in seconds): With lock consolidation (btrfs): real Average: 319.6, Median: 320.0, Stddev: 0.8944271909999159 user Average: 6894.2, Median: 6895.0, Stddev: 25.528415540334656 sys Average: 521.4, Median: 522.0, Stddev: 1.51657508881031 Without lock consolidation (btrfs): real Average: 319.8, Median: 320.0, Stddev: 0.8366600265340756 user Average: 6896.6, Median: 6899.0, Stddev: 16.04057355583023 sys Average: 520.6, Median: 521.0, Stddev: 1.140175425099138 With lock consolidation (ext4): real Average: 320.0, Median: 319.0, Stddev: 1.4142135623730951 user Average: 6896.8, Median: 6878.0, Stddev: 28.621670111997307 sys Average: 521.2, Median: 521.0, Stddev: 1.7888543819998317 Without lock consolidation (ext4) real Average: 319.6, Median: 319.0, Stddev: 0.8944271909999159 user Average: 6886.2, Median: 6887.0, Stddev: 16.93221781102523 sys Average: 520.4, Median: 520.0, Stddev: 1.140175425099138 The difference is entirely within the noise of a typical run on zram. This hardly justifies the complexity of maintaining both the pool lock and the class lock. In fact, for writeback, we would need to introduce yet another lock to prevent data races on the pool's LRU, further complicating the lock handling logic. IMHO, it is just better to collapse all of these into a single pool-level lock. Link: https://lkml.kernel.org/r/20221128191616.1261026-4-nphamcs@gmail.com Signed-off-by: Nhat Pham <nphamcs@gmail.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-11zpool: clean out dead codeJohannes Weiner3-66/+12
There is a lot of provision for flexibility that isn't actually needed or used. Zswap (the only zpool user) always passes zpool_ops with an .evict method set. The backends who reclaim only do so for zswap, so they can also directly call zpool_ops without indirection or checks. Finally, there is no need to check the retries parameters and bail with -EINVAL in the reclaim function, when that's called just a few lines below with a hard-coded 8. There is no need to duplicate the evictable and sleep_mapped attrs from the driver in zpool_ops. Link: https://lkml.kernel.org/r/20221128191616.1261026-3-nphamcs@gmail.com Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Nhat Pham <nphamcs@gmail.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-11zswap: fix writeback lock ordering for zsmallocJohannes Weiner1-16/+19
Patch series "Implement writeback for zsmalloc", v7. Unlike other zswap allocators such as zbud or z3fold, zsmalloc currently lacks the writeback mechanism. This means that when the zswap pool is full, it will simply reject further allocations, and the pages will be written directly to swap. This series of patches implements writeback for zsmalloc. When the zswap pool becomes full, zsmalloc will attempt to evict all the compressed objects in the least-recently used zspages. This patch (of 6): zswap's customary lock order is tree->lock before pool->lock, because the tree->lock protects the entries' refcount, and the free callbacks in the backends acquire their respective pool locks to dispatch the backing object. zsmalloc's map callback takes the pool lock, so zswap must not grab the tree->lock while a handle is mapped. This currently only happens during writeback, which isn't implemented for zsmalloc. In preparation for it, move the tree->lock section out of the mapped entry section Link: https://lkml.kernel.org/r/20221128191616.1261026-1-nphamcs@gmail.com Link: https://lkml.kernel.org/r/20221128191616.1261026-2-nphamcs@gmail.com Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Nhat Pham <nphamcs@gmail.com> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-11mm/madvise: fix madvise_pageout for private file mappingsPavankumar Kondeti1-18/+35
When MADV_PAGEOUT is called on a private file mapping VMA region, we bail out early if the process is neither owner nor write capable of the file. However, this VMA may have both private/shared clean pages and private dirty pages. The opportunity of paging out the private dirty pages (Anon pages) is missed. Fix this behavior by allowing private file mappings pageout further and perform the file access check along with PageAnon() during page walk. We observe ~10% improvement in zram usage, thus leaving more available memory on a 4GB RAM system running Android. [quic_pkondeti@quicinc.com: v2] Link: https://lkml.kernel.org/r/1669962597-27724-1-git-send-email-quic_pkondeti@quicinc.com Link: https://lkml.kernel.org/r/1667971116-12900-1-git-send-email-quic_pkondeti@quicinc.com Signed-off-by: Pavankumar Kondeti <quic_pkondeti@quicinc.com> Cc: Charan Teja Kalla <quic_charante@quicinc.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-11mm/khugepaged: add tracepoint to collapse_file()Gautam Menghani1-3/+4
"mm_khugepaged_collapse_file" for capturing is_shmem. Currently, is_shmem is not being captured. Capturing is_shmem is useful as it can indicate if tmpfs is being used as a backing store instead of persistent storage. Add the tracepoint in collapse_file() named "mm_khugepaged_collapse_file" for capturing is_shmem. [gautammenghani201@gmail.com: swap is_shmem and addr to save space, per Steven Rostedt] Link: https://lkml.kernel.org/r/20221202201807.182829-1-gautammenghani201@gmail.com Link: https://lkml.kernel.org/r/20221026052218.148234-1-gautammenghani201@gmail.com Signed-off-by: Gautam Menghani <gautammenghani201@gmail.com> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> [tracing] Cc: David Hildenbrand <david@redhat.com> Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-11mm/gup: remove FOLL_MIGRATIONDavid Hildenbrand1-50/+5
Fortunately, the last user (KSM) is gone, so let's just remove this rather special code from generic GUP handling -- especially because KSM never required the PMD handling as KSM only deals with individual base pages. [akpm@linux-foundation.org: fix merge snafu]Link: https://lkml.kernel.org/r/20221021101141.84170-10-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-11mm/ksm: convert break_ksm() to use walk_page_range_vma()David Hildenbrand1-10/+39
FOLL_MIGRATION exists only for the purpose of break_ksm(), and actually, there is not even the need to wait for the migration to finish, we only want to know if we're dealing with a KSM page. Using follow_page() just to identify a KSM page overcomplicates GUP code. Let's use walk_page_range_vma() instead, because we don't actually care about the page itself, we only need to know a single property -- no need to even grab a reference. So, get rid of follow_page() usage such that we can get rid of FOLL_MIGRATION now and eventually be able to get rid of follow_page() in the future. In my setup (AMD Ryzen 9 3900X), running the KSM selftest to test unmerge performance on 2 GiB (taskset 0x8 ./ksm_tests -D -s 2048), this results in a performance degradation of ~2% (old: ~5010 MiB/s, new: ~4900 MiB/s). I don't think we particularly care for now. Interestingly, the benchmark reduction is due to the single callback. Adding a second callback (e.g., pud_entry()) reduces the benchmark by another 100-200 MiB/s. Link: https://lkml.kernel.org/r/20221021101141.84170-9-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-11mm/pagewalk: add walk_page_range_vma()David Hildenbrand1-0/+20
Let's add walk_page_range_vma(), which is similar to walk_page_vma(), however, is only interested in a subset of the VMA range. To be used in KSM code to stop using follow_page() next. Link: https://lkml.kernel.org/r/20221021101141.84170-8-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-11mm/ksm: fix KSM COW breaking with userfaultfd-wp via FAULT_FLAG_UNSHAREDavid Hildenbrand1-7/+5
Let's stop breaking COW via a fake write fault and let's use FAULT_FLAG_UNSHARE instead. This avoids any wrong side effects of the fake write fault, such as mapping the PTE writable and marking the pte dirty/softdirty. Consequently, we will no longer trigger a fake write fault and break COW without any such side-effects. Also, this fixes KSM interaction with userfaultfd-wp: when we have a KSM page that's write-protected by userfaultfd, break_ksm()->handle_mm_fault() will fail with VM_FAULT_SIGBUS and will simply return in break_ksm() with 0 instead of actually breaking COW. For now, the KSM unmerge tests can trigger that: $ sudo ./ksm_functional_tests TAP version 13 1..3 # [RUN] test_unmerge ok 1 Pages were unmerged # [RUN] test_unmerge_discarded ok 2 Pages were unmerged # [RUN] test_unmerge_uffd_wp not ok 3 Pages were unmerged Bail out! 1 out of 3 tests failed # Planned tests != run tests (2 != 3) # Totals: pass:2 fail:1 xfail:0 xpass:0 skip:0 error:0 The warning in dmesg also indicates this wrong handling: [ 230.096368] FAULT_FLAG_ALLOW_RETRY missing 881 [ 230.100822] CPU: 1 PID: 1643 Comm: ksm-uffd-wp [...] [ 230.110124] Hardware name: [...] [ 230.117775] Call Trace: [ 230.120227] <TASK> [ 230.122334] dump_stack_lvl+0x44/0x5c [ 230.126010] handle_userfault.cold+0x14/0x19 [ 230.130281] ? tlb_finish_mmu+0x65/0x170 [ 230.134207] ? uffd_wp_range+0x65/0xa0 [ 230.137959] ? _raw_spin_unlock+0x15/0x30 [ 230.141972] ? do_wp_page+0x50/0x590 [ 230.145551] __handle_mm_fault+0x9f5/0xf50 [ 230.149652] ? mmput+0x1f/0x40 [ 230.152712] handle_mm_fault+0xb9/0x2a0 [ 230.156550] break_ksm+0x141/0x180 [ 230.159964] unmerge_ksm_pages+0x60/0x90 [ 230.163890] ksm_madvise+0x3c/0xb0 [ 230.167295] do_madvise.part.0+0x10c/0xeb0 [ 230.171396] ? do_syscall_64+0x67/0x80 [ 230.175157] __x64_sys_madvise+0x5a/0x70 [ 230.179082] do_syscall_64+0x58/0x80 [ 230.182661] ? do_syscall_64+0x67/0x80 [ 230.186413] entry_SYSCALL_64_after_hwframe+0x63/0xcd This is primarily a fix for KSM+userfaultfd-wp, however, the fake write fault was always questionable. As this fix is not easy to backport and it's not very critical, let's not cc stable. Link: https://lkml.kernel.org/r/20221021101141.84170-6-david@redhat.com Fixes: 529b930b87d9 ("userfaultfd: wp: hook userfault handler to write protection fault") Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Peter Xu <peterx@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-11mm: remove VM_FAULT_WRITEDavid Hildenbrand2-6/+5
All users -- GUP and KSM -- are gone, let's just remove it. Link: https://lkml.kernel.org/r/20221021101141.84170-4-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Peter Xu <peterx@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-11mm/ksm: simplify break_ksm() to not rely on VM_FAULT_WRITEDavid Hildenbrand1-12/+13
Now that GUP no longer requires VM_FAULT_WRITE, break_ksm() is the sole remaining user of VM_FAULT_WRITE. As we also want to stop triggering a fake write fault and instead use FAULT_FLAG_UNSHARE -- similar to GUP-triggered unsharing when taking a R/O pin on a shared anonymous page (including KSM pages), let's stop relying on VM_FAULT_WRITE. Let's rework break_ksm() to not rely on the return value of handle_mm_fault() anymore to figure out whether COW-breaking was successful. Simply perform another follow_page() lookup to verify the result. While this makes break_ksm() slightly less efficient, we can simplify handle_mm_fault() a little and easily switch to FAULT_FLAG_UNSHARE without introducing similar KSM-specific behavior for FAULT_FLAG_UNSHARE. In my setup (AMD Ryzen 9 3900X), running the KSM selftest to test unmerge performance on 2 GiB (taskset 0x8 ./ksm_tests -D -s 2048), this results in a performance degradation of ~4% -- 5% (old: ~5250 MiB/s, new: ~5010 MiB/s). I don't think that we particularly care about that performance drop when unmerging. If it ever turns out to be an actual performance issue, we can think about a better alternative for FAULT_FLAG_UNSHARE -- let's just keep it simple for now. Link: https://lkml.kernel.org/r/20221021101141.84170-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Peter Xu <peterx@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-11mm/pagewalk: don't trigger test_walk() in walk_page_vma()David Hildenbrand1-7/+0
As Peter points out, the caller passes a single VMA and can just do that check itself. And in fact, no existing users rely on test_walk() getting called. So let's just remove it and make the implementation slightly more efficient. Link: https://lkml.kernel.org/r/20221021101141.84170-7-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-10Merge tag 'mm-hotfixes-stable-2022-12-10-1' of ↵Linus Torvalds3-11/+16
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "Nine hotfixes. Six for MM, three for other areas. Four of these patches address post-6.0 issues" * tag 'mm-hotfixes-stable-2022-12-10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: memcg: fix possible use-after-free in memcg_write_event_control() MAINTAINERS: update Muchun Song's email mm/gup: fix gup_pud_range() for dax mmap: fix do_brk_flags() modifying obviously incorrect VMAs mm/swap: fix SWP_PFN_BITS with CONFIG_PHYS_ADDR_T_64BIT on 32bit tmpfs: fix data loss from failed fallocate kselftests: cgroup: update kmem test precision tolerance mm: do not BUG_ON missing brk mapping, because userspace can unmap it mailmap: update Matti Vaittinen's email address
2022-12-09Merge branch 'mm-hotfixes-stable' into mm-stableAndrew Morton4-13/+29
2022-12-09memcg: fix possible use-after-free in memcg_write_event_control()Tejun Heo1-2/+13
memcg_write_event_control() accesses the dentry->d_name of the specified control fd to route the write call. As a cgroup interface file can't be renamed, it's safe to access d_name as long as the specified file is a regular cgroup file. Also, as these cgroup interface files can't be removed before the directory, it's safe to access the parent too. Prior to 347c4a874710 ("memcg: remove cgroup_event->cft"), there was a call to __file_cft() which verified that the specified file is a regular cgroupfs file before further accesses. The cftype pointer returned from __file_cft() was no longer necessary and the commit inadvertently dropped the file type check with it allowing any file to slip through. With the invarients broken, the d_name and parent accesses can now race against renames and removals of arbitrary files and cause use-after-free's. Fix the bug by resurrecting the file type check in __file_cft(). Now that cgroupfs is implemented through kernfs, checking the file operations needs to go through a layer of indirection. Instead, let's check the superblock and dentry type. Link: https://lkml.kernel.org/r/Y5FRm/cfcKPGzWwl@slm.duckdns.org Fixes: 347c4a874710 ("memcg: remove cgroup_event->cft") Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Jann Horn <jannh@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: <stable@vger.kernel.org> [3.14+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-09mm/gup: fix gup_pud_range() for daxJohn Starks1-1/+1
For dax pud, pud_huge() returns true on x86. So the function works as long as hugetlb is configured. However, dax doesn't depend on hugetlb. Commit 414fd080d125 ("mm/gup: fix gup_pmd_range() for dax") fixed devmap-backed huge PMDs, but missed devmap-backed huge PUDs. Fix this as well. This fixes the below kernel panic: general protection fault, probably for non-canonical address 0x69e7c000cc478: 0000 [#1] SMP < snip > Call Trace: <TASK> get_user_pages_fast+0x1f/0x40 iov_iter_get_pages+0xc6/0x3b0 ? mempool_alloc+0x5d/0x170 bio_iov_iter_get_pages+0x82/0x4e0 ? bvec_alloc+0x91/0xc0 ? bio_alloc_bioset+0x19a/0x2a0 blkdev_direct_IO+0x282/0x480 ? __io_complete_rw_common+0xc0/0xc0 ? filemap_range_has_page+0x82/0xc0 generic_file_direct_write+0x9d/0x1a0 ? inode_update_time+0x24/0x30 __generic_file_write_iter+0xbd/0x1e0 blkdev_write_iter+0xb4/0x150 ? io_import_iovec+0x8d/0x340 io_write+0xf9/0x300 io_issue_sqe+0x3c3/0x1d30 ? sysvec_reschedule_ipi+0x6c/0x80 __io_queue_sqe+0x33/0x240 ? fget+0x76/0xa0 io_submit_sqes+0xe6a/0x18d0 ? __fget_light+0xd1/0x100 __x64_sys_io_uring_enter+0x199/0x880 ? __context_tracking_enter+0x1f/0x70 ? irqentry_exit_to_user_mode+0x24/0x30 ? irqentry_exit+0x1d/0x30 ? __context_tracking_exit+0xe/0x70 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x61/0xcb RIP: 0033:0x7fc97c11a7be < snip > </TASK> ---[ end trace 48b2e0e67debcaeb ]--- RIP: 0010:internal_get_user_pages_fast+0x340/0x990 < snip > Kernel panic - not syncing: Fatal exception Kernel Offset: disabled Link: https://lkml.kernel.org/r/1670392853-28252-1-git-send-email-ssengar@linux.microsoft.com Fixes: 414fd080d125 ("mm/gup: fix gup_pmd_range() for dax") Signed-off-by: John Starks <jostarks@microsoft.com> Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com> Cc: Jan Kara <jack@suse.cz> Cc: Yu Zhao <yuzhao@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-09mmap: fix do_brk_flags() modifying obviously incorrect VMAsLiam Howlett1-8/+3
Add more sanity checks to the VMA that do_brk_flags() will expand. Ensure the VMA matches basic merge requirements within the function before calling can_vma_merge_after(). Drop the duplicate checks from vm_brk_flags() since they will be enforced later. The old code would expand file VMAs on brk(), which is functionally wrong and also dangerous in terms of locking because the brk() path isn't designed for file VMAs and therefore doesn't lock the file mapping. Checking can_vma_merge_after() ensures that new anonymous VMAs can't be merged into file VMAs. See https://lore.kernel.org/linux-mm/CAG48ez1tJZTOjS_FjRZhvtDA-STFmdw8PEizPDwMGFd_ui0Nrw@mail.gmail.com/ Link: https://lkml.kernel.org/r/20221205192304.1957418-1-Liam.Howlett@oracle.com Fixes: 2e7ce7d354f2 ("mm/mmap: change do_brk_flags() to expand existing VMA and add do_brk_munmap()") Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Suggested-by: Jann Horn <jannh@google.com> Cc: Jason A. Donenfeld <Jason@zx2c4.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-09tmpfs: fix data loss from failed fallocateHugh Dickins1-0/+11
Fix tmpfs data loss when the fallocate system call is interrupted by a signal, or fails for some other reason. The partial folio handling in shmem_undo_range() forgot to consider this unfalloc case, and was liable to erase or truncate out data which had already been committed earlier. It turns out that none of the partial folio handling there is appropriate for the unfalloc case, which just wants to proceed to removal of whole folios: which find_get_entries() provides, even when partially covered. Original patch by Rui Wang. Link: https://lore.kernel.org/linux-mm/33b85d82.7764.1842e9ab207.Coremail.chenguoqic@163.com/ Link: https://lkml.kernel.org/r/a5dac112-cf4b-7af-a33-f386e347fd38@google.com Fixes: b9a8a4195c7d ("truncate,shmem: Handle truncates that split large folios") Signed-off-by: Hugh Dickins <hughd@google.com> Reported-by: Guoqi Chen <chenguoqic@163.com> Link: https://lore.kernel.org/all/20221101032248.819360-1-kernel@hev.cc/ Cc: Rui Wang <kernel@hev.cc> Cc: Huacai Chen <chenhuacai@loongson.cn> Cc: Matthew Wilcox <willy@infradead.org> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: <stable@vger.kernel.org> [5.17+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-09mm: do not BUG_ON missing brk mapping, because userspace can unmap itJason A. Donenfeld1-2/+1
The following program will trigger the BUG_ON that this patch removes, because the user can munmap() mm->brk: #include <sys/syscall.h> #include <sys/mman.h> #include <assert.h> #include <unistd.h> static void *brk_now(void) { return (void *)syscall(SYS_brk, 0); } static void brk_set(void *b) { assert(syscall(SYS_brk, b) != -1); } int main(int argc, char *argv[]) { void *b = brk_now(); brk_set(b + 4096); assert(munmap(b - 4096, 4096 * 2) == 0); brk_set(b); return 0; } Compile that with musl, since glibc actually uses brk(), and then execute it, and it'll hit this splat: kernel BUG at mm/mmap.c:229! invalid opcode: 0000 [#1] PREEMPT SMP CPU: 12 PID: 1379 Comm: a.out Tainted: G S U 6.1.0-rc7+ #419 RIP: 0010:__do_sys_brk+0x2fc/0x340 Code: 00 00 4c 89 ef e8 04 d3 fe ff eb 9a be 01 00 00 00 4c 89 ff e8 35 e0 fe ff e9 6e ff ff ff 4d 89 a7 20> RSP: 0018:ffff888140bc7eb0 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 00000000007e7000 RCX: ffff8881020fe000 RDX: ffff8881020fe001 RSI: ffff8881955c9b00 RDI: ffff8881955c9b08 RBP: 0000000000000000 R08: ffff8881955c9b00 R09: 00007ffc77844000 R10: 0000000000000000 R11: 0000000000000001 R12: 00000000007e8000 R13: 00000000007e8000 R14: 00000000007e7000 R15: ffff8881020fe000 FS: 0000000000604298(0000) GS:ffff88901f700000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000603fe0 CR3: 000000015ba9a005 CR4: 0000000000770ee0 PKRU: 55555554 Call Trace: <TASK> do_syscall_64+0x2b/0x50 entry_SYSCALL_64_after_hwframe+0x46/0xb0 RIP: 0033:0x400678 Code: 10 4c 8d 41 08 4c 89 44 24 10 4c 8b 01 8b 4c 24 08 83 f9 2f 77 0a 4c 8d 4c 24 20 4c 01 c9 eb 05 48 8b> RSP: 002b:00007ffc77863890 EFLAGS: 00000212 ORIG_RAX: 000000000000000c RAX: ffffffffffffffda RBX: 000000000040031b RCX: 0000000000400678 RDX: 00000000004006a1 RSI: 00000000007e6000 RDI: 00000000007e7000 RBP: 00007ffc77863900 R08: 0000000000000000 R09: 00000000007e6000 R10: 00007ffc77863930 R11: 0000000000000212 R12: 00007ffc77863978 R13: 00007ffc77863988 R14: 0000000000000000 R15: 0000000000000000 </TASK> Instead, just return the old brk value if the original mapping has been removed. [akpm@linux-foundation.org: fix changelog, per Liam] Link: https://lkml.kernel.org/r/20221202162724.2009-1-Jason@zx2c4.com Fixes: 2e7ce7d354f2 ("mm/mmap: change do_brk_flags() to expand existing VMA and add do_brk_munmap()") Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Jann Horn <jannh@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-09Merge tag 'kvmarm-6.2' of ↵Paolo Bonzini2-1/+10
https://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 updates for 6.2 - Enable the per-vcpu dirty-ring tracking mechanism, together with an option to keep the good old dirty log around for pages that are dirtied by something other than a vcpu. - Switch to the relaxed parallel fault handling, using RCU to delay page table reclaim and giving better performance under load. - Relax the MTE ABI, allowing a VMM to use the MAP_SHARED mapping option, which multi-process VMMs such as crosvm rely on. - Merge the pKVM shadow vcpu state tracking that allows the hypervisor to have its own view of a vcpu, keeping that state private. - Add support for the PMUv3p5 architecture revision, bringing support for 64bit counters on systems that support it, and fix the no-quite-compliant CHAIN-ed counter support for the machines that actually exist out there. - Fix a handful of minor issues around 52bit VA/PA support (64kB pages only) as a prefix of the oncoming support for 4kB and 16kB pages. - Add/Enable/Fix a bunch of selftests covering memslots, breakpoints, stage-2 faults and access tracking. You name it, we got it, we probably broke it. - Pick a small set of documentation and spelling fixes, because no good merge window would be complete without those. As a side effect, this tag also drags: - The 'kvmarm-fixes-6.1-3' tag as a dependency to the dirty-ring series - A shared branch with the arm64 tree that repaints all the system registers to match the ARM ARM's naming, and resulting in interesting conflicts
2022-12-08mm: export buffer_migrate_folio_norefs()Jan Kara1-0/+1
Ext4 needs this function to allow safe migration for journalled data pages. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20221207112722.22220-11-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-12-08memcg: Fix possible use-after-free in memcg_write_event_control()Tejun Heo1-2/+13
memcg_write_event_control() accesses the dentry->d_name of the specified control fd to route the write call. As a cgroup interface file can't be renamed, it's safe to access d_name as long as the specified file is a regular cgroup file. Also, as these cgroup interface files can't be removed before the directory, it's safe to access the parent too. Prior to 347c4a874710 ("memcg: remove cgroup_event->cft"), there was a call to __file_cft() which verified that the specified file is a regular cgroupfs file before further accesses. The cftype pointer returned from __file_cft() was no longer necessary and the commit inadvertently dropped the file type check with it allowing any file to slip through. With the invarients broken, the d_name and parent accesses can now race against renames and removals of arbitrary files and cause use-after-free's. Fix the bug by resurrecting the file type check in __file_cft(). Now that cgroupfs is implemented through kernfs, checking the file operations needs to go through a layer of indirection. Instead, let's check the superblock and dentry type. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 347c4a874710 ("memcg: remove cgroup_event->cft") Cc: stable@kernel.org # v3.14+ Reported-by: Jann Horn <jannh@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-12-04Revert "mm: align larger anonymous mappings on THP boundaries"Linus Torvalds1-3/+0
This reverts commit f35b5d7d676e59e401690b678cd3cfec5e785c23. It has been reported to cause huge performance regressions on some loads (will-it-scale.per_process_ops, but also building the kernel with clang). The commit did speed up gcc builds by a small amount, so it's not an unambiguous regression, but until the big regressions are understood, let's revert it. Reported-by: kernel test robot <yujie.liu@intel.com> Link: https://lore.kernel.org/r/202210181535.7144dd15-yujie.liu@intel.com Reported-by: Nathan Chancellor <nathan@kernel.org> Link: https://lore.kernel.org/lkml/Y1DNQaoPWxE%2BrGce@dev-arch.thelio-3990X/ Cc: Huang, Ying <ying.huang@intel.com> Cc: Rik van Riel <riel@surriel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-12-02Merge tag 'mm-hotfixes-stable-2022-12-02' of ↵Linus Torvalds8-52/+150
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc hotfixes from Andrew Morton: "15 hotfixes, 11 marked cc:stable. Only three or four of the latter address post-6.0 issues, which is hopefully a sign that things are converging" * tag 'mm-hotfixes-stable-2022-12-02' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: revert "kbuild: fix -Wimplicit-function-declaration in license_is_gpl_compatible" Kconfig.debug: provide a little extra FRAME_WARN leeway when KASAN is enabled drm/amdgpu: temporarily disable broken Clang builds due to blown stack-frame mm/khugepaged: invoke MMU notifiers in shmem/file collapse paths mm/khugepaged: fix GUP-fast interaction by sending IPI mm/khugepaged: take the right locks for page table retraction mm: migrate: fix THP's mapcount on isolation mm: introduce arch_has_hw_nonleaf_pmd_young() mm: add dummy pmd_young() for architectures not having it mm/damon/sysfs: fix wrong empty schemes assumption under online tuning in damon_sysfs_set_schemes() tools/vm/slabinfo-gnuplot: use "grep -E" instead of "egrep" nilfs2: fix NULL pointer dereference in nilfs_palloc_commit_free_entry() hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing madvise: use zap_page_range_single for madvise dontneed mm: replace VM_WARN_ON to pr_warn if the node is offline with __GFP_THISNODE
2022-12-02panic: Consolidate open-coded panic_on_warn checksKees Cook2-4/+3
Several run-time checkers (KASAN, UBSAN, KFENCE, KCSAN, sched) roll their own warnings, and each check "panic_on_warn". Consolidate this into a single function so that future instrumentation can be added in a single location. Cc: Marco Elver <elver@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: David Gow <davidgow@google.com> Cc: tangmeng <tangmeng@uniontech.com> Cc: Jann Horn <jannh@google.com> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Petr Mladek <pmladek@suse.com> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: "Guilherme G. Piccoli" <gpiccoli@igalia.com> Cc: Tiezhu Yang <yangtiezhu@loongson.cn> Cc: kasan-dev@googlegroups.com Cc: linux-mm@kvack.org Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Marco Elver <elver@google.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Link: https://lore.kernel.org/r/20221117234328.594699-4-keescook@chromium.org
2022-12-01mm: Make ksize() a reporting-only functionKees Cook2-22/+23
With all "silently resizing" callers of ksize() refactored, remove the logic in ksize() that would allow it to be used to effectively change the size of an allocation (bypassing __alloc_size hints, etc). Users wanting this feature need to either use kmalloc_size_roundup() before an allocation, or use krealloc() directly. For kfree_sensitive(), move the unpoisoning logic inline. Replace the some of the partially open-coded ksize() in __do_krealloc with ksize() now that it doesn't perform unpoisoning. Adjust the KUnit tests to match the new ksize() behavior. Execution tested with: $ ./tools/testing/kunit/kunit.py run \ --kconfig_add CONFIG_KASAN=y \ --kconfig_add CONFIG_KASAN_GENERIC=y \ --arch x86_64 kasan Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: linux-mm@kvack.org Cc: kasan-dev@googlegroups.com Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Enhanced-by: Andrey Konovalov <andreyknvl@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org>
2022-11-30mm/memory-failure.c: cleanup in unpoison_memoryMa Wupeng1-4/+2
If freeit is true, the value of ret must be zero, there is no need to check the value of freeit after label unlock_mutex. We can drop variable freeit to do this cleanup. Link: https://lkml.kernel.org/r/20221125065444.3462681-1-mawupeng1@huawei.com Signed-off-by: Ma Wupeng <mawupeng1@huawei.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: zhenwei pi <pizhenwei@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm/thp: re-apply mkdirty for small pages after splitPeter Xu1-6/+8
We used to have 624a2c94f5b7 (Partly revert "mm/thp: carry over dirty bit when thp splits on pmd") fixing the regression reported here by Anatoly Pugachev on sparc64: https://lore.kernel.org/r/20221021160603.GA23307@u164.east.ru Where we temporarily ignored the dirty bit for small pages. Then, Hev also reported similar issue on loongarch: (the original mail was private, but Anatoly copied the list here) https://lore.kernel.org/r/CADxRZqxqb7f_WhMh=jweZP+ynf_JwGd-0VwbYgp4P+T0-AXosw@mail.gmail.com Hev pointed out that the issue is having HW write bit set within the pte_mkdirty() so the split pte can be written after split even if e.g. they were shared by more than one processes, causing data corrupt. Hev also tried to explain why loongarch set HW write bit in mkdirty: https://lore.kernel.org/r/CAHirt9itKO_K_HPboXh5AyJtt16Zf0cD73PtHvM=na39u_ztxA@mail.gmail.com One way to fix it is as what Huacai proposed here for loongarch (then we can re-apply the dirty bit in thp split): https://lore.kernel.org/r/20221117042532.4064448-1-chenhuacai@loongson.cnn We may need similar thing for sparc64, though. For now since we've found the root cause of the dirty bit issue the simpler solution (which won't lose the dirty bit for small) that will work for both is we wr-protect after pte_mkdirty(), so the HW write bit can be persistent after thp split. Add a comment for wrprotect, so we will not mess up the ordering later. With 624a2c94f5b7 (Partly revert "mm/thp: carry over dirty bit when thp splits on pmd") this is not a fix anymore, but just brings back the dirty bit for thp split safely, so we re-apply the optimization but in safe way. Provide a Tested-by credit to Hev too (not the exact same patch but the same outcome) for loongarch. Link: https://lkml.kernel.org/r/20221125185857.3110155-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Tested-by: Hev <r@hev.cc> # loongarch Cc: Anatoly Pugachev <matorola@gmail.com> Cc: Raghavendra K T <raghavendra.kt@amd.com> Cc: Thorsten Leemhuis <regressions@leemhuis.info> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm: vmscan: use sysfs_emit() to instead of scnprintf()Xu Panda1-1/+1
Replace open-coded snprintf() with sysfs_emit() to simplify the code. Link: https://lkml.kernel.org/r/202211241929015476424@zte.com.cn Signed-off-by: Xu Panda <xu.panda@zte.com.cn> Signed-off-by: Yang Yang <yang.yang29@zte.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30zswap: do not allocate from atomic poolSergey Senozhatsky2-2/+9
zswap_frontswap_load() should be called from preemptible context (we even call mutex_lock() there) and it does not look like we need to do GFP_ATOMIC allocaion for temp buffer. The same applies to zswap_writeback_entry(). Use GFP_KERNEL for temporary buffer allocation in both cases. Link: https://lkml.kernel.org/r/Y3xCTr6ikbtcUr/y@google.com Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm, compaction: fix fast_isolate_around() to stay within boundariesNARIBAYASHI Akira1-13/+5
Depending on the memory configuration, isolate_freepages_block() may scan pages out of the target range and causes panic. Panic can occur on systems with multiple zones in a single pageblock. The reason it is rare is that it only happens in special configurations. Depending on how many similar systems there are, it may be a good idea to fix this problem for older kernels as well. The problem is that pfn as argument of fast_isolate_around() could be out of the target range. Therefore we should consider the case where pfn < start_pfn, and also the case where end_pfn < pfn. This problem should have been addressd by the commit 6e2b7044c199 ("mm, compaction: make fast_isolate_freepages() stay within zone") but there was an oversight. Case1: pfn < start_pfn <at memory compaction for node Y> | node X's zone | node Y's zone +-----------------+------------------------------... pageblock ^ ^ ^ +-----------+-----------+-----------+-----------+... ^ ^ ^ ^ ^ end_pfn ^ start_pfn = cc->zone->zone_start_pfn pfn <---------> scanned range by "Scan After" Case2: end_pfn < pfn <at memory compaction for node X> | node X's zone | node Y's zone +-----------------+------------------------------... pageblock ^ ^ ^ +-----------+-----------+-----------+-----------+... ^ ^ ^ ^ ^ pfn ^ end_pfn start_pfn <---------> scanned range by "Scan Before" It seems that there is no good reason to skip nr_isolated pages just after given pfn. So let perform simple scan from start to end instead of dividing the scan into "Before" and "After". Link: https://lkml.kernel.org/r/20221026112438.236336-1-a.naribayashi@fujitsu.com Fixes: 6e2b7044c199 ("mm, compaction: make fast_isolate_freepages() stay within zone"). Signed-off-by: NARIBAYASHI Akira <a.naribayashi@fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm: add /sys/class/bdi/<bdi>/min_ratio_fine knobStefan Roesch1-0/+20
This adds the min_ratio_fine knob. The knob specifies the values not based on 1 of 100, but instead 1 per million. Link: https://lkml.kernel.org/r/20221119005215.3052436-20-shr@devkernel.io Signed-off-by: Stefan Roesch <shr@devkernel.io> Cc: Chris Mason <clm@meta.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm: add bdi_set_min_ratio_no_scale() functionStefan Roesch1-0/+7
This introduces bdi_set_min_ratio_no_scale(). It uses the max granularity for the ratio. This function by the new sysfs knob min_ratio_fine. Link: https://lkml.kernel.org/r/20221119005215.3052436-19-shr@devkernel.io Signed-off-by: Stefan Roesch <shr@devkernel.io> Cc: Chris Mason <clm@meta.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm: add /sys/class/bdi/<bdi>/max_ratio_fine knobStefan Roesch1-0/+20
This adds the max_ratio_fine knob. The knob specifies the values not based on 1 of 100, but instead 1 per million. Link: https://lkml.kernel.org/r/20221119005215.3052436-17-shr@devkernel.io Signed-off-by: Stefan Roesch <shr@devkernel.io> Cc: Chris Mason <clm@meta.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm: add bdi_set_max_ratio_no_scale() functionStefan Roesch1-3/+8
This introduces bdi_set_max_ratio_no_scale(). It uses the max granularity for the ratio. This function by the new sysfs knob max_ratio_fine. Link: https://lkml.kernel.org/r/20221119005215.3052436-16-shr@devkernel.io Signed-off-by: Stefan Roesch <shr@devkernel.io> Cc: Chris Mason <clm@meta.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm: add /sys/class/bdi/<bdi>/min_bytes knobStefan Roesch1-0/+29
bdi has two existing knobs to limit the amount of dirty memory: min_ratio and max_ratio. However the granularity of the knobs is limited and often it is more convenient to specify limits in terms of bytes. This change adds the min_bytes knob. It does not store the min_bytes value, instead it converts the max_bytes value to a ratio. The value is therefore more an approximation than an absolute value. It also maintains the sum over all the bdi min_ratio values stored in the variable bdi_min_ratio. Link: https://lkml.kernel.org/r/20221119005215.3052436-14-shr@devkernel.io Signed-off-by: Stefan Roesch <shr@devkernel.io> Cc: Chris Mason <clm@meta.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm: add bdi_set_min_bytes() functionStefan Roesch1-0/+14
This introduces the bdi_set_min_bytes() function. The min_bytes function does not store the min_bytes value. Instead it converts the min_bytes value into the corresponding ratio value. Link: https://lkml.kernel.org/r/20221119005215.3052436-13-shr@devkernel.io Signed-off-by: Stefan Roesch <shr@devkernel.io> Cc: Chris Mason <clm@meta.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>