<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux/mm, branch v5.1-rc6</title>
<subtitle>Linux Kernel (branches are rebased on master from time to time)</subtitle>
<id>https://sre.ring0.de/linux/atom?h=v5.1-rc6</id>
<link rel='self' href='https://sre.ring0.de/linux/atom?h=v5.1-rc6'/>
<link rel='alternate' type='text/html' href='https://sre.ring0.de/linux/'/>
<updated>2019-04-19T22:37:22Z</updated>
<entry>
<title>Merge branch 'for-5.1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu</title>
<updated>2019-04-19T22:37:22Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2019-04-19T22:37:22Z</published>
<link rel='alternate' type='text/html' href='https://sre.ring0.de/linux/commit/?id=4c3f49ae1306c05e91211c06feddfd0a4a57fabd'/>
<id>urn:sha1:4c3f49ae1306c05e91211c06feddfd0a4a57fabd</id>
<content type='text'>
Pull percpu fixlet from Dennis Zhou:
 "This stops printing the base address of percpu memory on
  initialization"

* 'for-5.1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu:
  percpu: stop printing kernel addresses
</content>
</entry>
<entry>
<title>coredump: fix race condition between mmget_not_zero()/get_task_mm() and core dumping</title>
<updated>2019-04-19T16:46:05Z</updated>
<author>
<name>Andrea Arcangeli</name>
<email>aarcange@redhat.com</email>
</author>
<published>2019-04-19T00:50:52Z</published>
<link rel='alternate' type='text/html' href='https://sre.ring0.de/linux/commit/?id=04f5866e41fb70690e28397487d8bd8eea7d712a'/>
<id>urn:sha1:04f5866e41fb70690e28397487d8bd8eea7d712a</id>
<content type='text'>
The core dumping code has always run without holding the mmap_sem for
writing, despite that is the only way to ensure that the entire vma
layout will not change from under it.  Only using some signal
serialization on the processes belonging to the mm is not nearly enough.
This was pointed out earlier.  For example in Hugh's post from Jul 2017:

  https://lkml.kernel.org/r/alpine.LSU.2.11.1707191716030.2055@eggly.anvils

  "Not strictly relevant here, but a related note: I was very surprised
   to discover, only quite recently, how handle_mm_fault() may be called
   without down_read(mmap_sem) - when core dumping. That seems a
   misguided optimization to me, which would also be nice to correct"

In particular because the growsdown and growsup can move the
vm_start/vm_end the various loops the core dump does around the vma will
not be consistent if page faults can happen concurrently.

Pretty much all users calling mmget_not_zero()/get_task_mm() and then
taking the mmap_sem had the potential to introduce unexpected side
effects in the core dumping code.

Adding mmap_sem for writing around the -&gt;core_dump invocation is a
viable long term fix, but it requires removing all copy user and page
faults and to replace them with get_dump_page() for all binary formats
which is not suitable as a short term fix.

For the time being this solution manually covers the places that can
confuse the core dump either by altering the vma layout or the vma flags
while it runs.  Once -&gt;core_dump runs under mmap_sem for writing the
function mmget_still_valid() can be dropped.

Allowing mmap_sem protected sections to run in parallel with the
coredump provides some minor parallelism advantage to the swapoff code
(which seems to be safe enough by never mangling any vma field and can
keep doing swapins in parallel to the core dumping) and to some other
corner case.

In order to facilitate the backporting I added "Fixes: 86039bd3b4e6"
however the side effect of this same race condition in /proc/pid/mem
should be reproducible since before 2.6.12-rc2 so I couldn't add any
other "Fixes:" because there's no hash beyond the git genesis commit.

Because find_extend_vma() is the only location outside of the process
context that could modify the "mm" structures under mmap_sem for
reading, by adding the mmget_still_valid() check to it, all other cases
that take the mmap_sem for reading don't need the new check after
mmget_not_zero()/get_task_mm().  The expand_stack() in page fault
context also doesn't need the new check, because all tasks under core
dumping are frozen.

Link: http://lkml.kernel.org/r/20190325224949.11068-1-aarcange@redhat.com
Fixes: 86039bd3b4e6 ("userfaultfd: add new syscall to provide memory externalization")
Signed-off-by: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Reported-by: Jann Horn &lt;jannh@google.com&gt;
Suggested-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Acked-by: Peter Xu &lt;peterx@redhat.com&gt;
Reviewed-by: Mike Rapoport &lt;rppt@linux.ibm.com&gt;
Reviewed-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Reviewed-by: Jann Horn &lt;jannh@google.com&gt;
Acked-by: Jason Gunthorpe &lt;jgg@mellanox.com&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/kmemleak.c: fix unused-function warning</title>
<updated>2019-04-19T16:46:05Z</updated>
<author>
<name>Arnd Bergmann</name>
<email>arnd@arndb.de</email>
</author>
<published>2019-04-19T00:50:48Z</published>
<link rel='alternate' type='text/html' href='https://sre.ring0.de/linux/commit/?id=dce5b0bdeec61bdbee56121ceb1d014151d5cab1'/>
<id>urn:sha1:dce5b0bdeec61bdbee56121ceb1d014151d5cab1</id>
<content type='text'>
The only references outside of the #ifdef have been removed, so now we
get a warning in non-SMP configurations:

  mm/kmemleak.c:1404:13: error: unused function 'scan_large_block' [-Werror,-Wunused-function]

Add a new #ifdef around it.

Link: http://lkml.kernel.org/r/20190416123148.3502045-1-arnd@arndb.de
Fixes: 298a32b13208 ("kmemleak: powerpc: skip scanning holes in the .bss section")
Signed-off-by: Arnd Bergmann &lt;arnd@arndb.de&gt;
Acked-by: Catalin Marinas &lt;catalin.marinas@arm.com&gt;
Cc: Vincent Whitchurch &lt;vincent.whitchurch@axis.com&gt;
Cc: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: fix inactive list balancing between NUMA nodes and cgroups</title>
<updated>2019-04-19T16:46:05Z</updated>
<author>
<name>Johannes Weiner</name>
<email>hannes@cmpxchg.org</email>
</author>
<published>2019-04-19T00:50:34Z</published>
<link rel='alternate' type='text/html' href='https://sre.ring0.de/linux/commit/?id=3b991208b897f52507168374033771a984b947b1'/>
<id>urn:sha1:3b991208b897f52507168374033771a984b947b1</id>
<content type='text'>
During !CONFIG_CGROUP reclaim, we expand the inactive list size if it's
thrashing on the node that is about to be reclaimed.  But when cgroups
are enabled, we suddenly ignore the node scope and use the cgroup scope
only.  The result is that pressure bleeds between NUMA nodes depending
on whether cgroups are merely compiled into Linux.  This behavioral
difference is unexpected and undesirable.

When the refault adaptivity of the inactive list was first introduced,
there were no statistics at the lruvec level - the intersection of node
and memcg - so it was better than nothing.

But now that we have that infrastructure, use lruvec_page_state() to
make the list balancing decision always NUMA aware.

[hannes@cmpxchg.org: fix bisection hole]
  Link: http://lkml.kernel.org/r/20190417155241.GB23013@cmpxchg.org
Link: http://lkml.kernel.org/r/20190412144438.2645-1-hannes@cmpxchg.org
Fixes: 2a2e48854d70 ("mm: vmscan: fix IO/refault regression in cache workingset transition")
Signed-off-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Reviewed-by: Shakeel Butt &lt;shakeelb@google.com&gt;
Cc: Roman Gushchin &lt;guro@fb.com&gt;
Cc: Michal Hocko &lt;mhocko@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/hotplug: treat CMA pages as unmovable</title>
<updated>2019-04-19T16:46:05Z</updated>
<author>
<name>Qian Cai</name>
<email>cai@lca.pw</email>
</author>
<published>2019-04-19T00:50:30Z</published>
<link rel='alternate' type='text/html' href='https://sre.ring0.de/linux/commit/?id=1a9f219157b22d0ffb340a9c5f431afd02cd2cf3'/>
<id>urn:sha1:1a9f219157b22d0ffb340a9c5f431afd02cd2cf3</id>
<content type='text'>
has_unmovable_pages() is used by allocating CMA and gigantic pages as
well as the memory hotplug.  The later doesn't know how to offline CMA
pool properly now, but if an unused (free) CMA page is encountered, then
has_unmovable_pages() happily considers it as a free memory and
propagates this up the call chain.  Memory offlining code then frees the
page without a proper CMA tear down which leads to an accounting issues.
Moreover if the same memory range is onlined again then the memory never
gets back to the CMA pool.

State after memory offline:

 # grep cma /proc/vmstat
 nr_free_cma 205824

 # cat /sys/kernel/debug/cma/cma-kvm_cma/count
 209920

Also, kmemleak still think those memory address are reserved below but
have already been used by the buddy allocator after onlining.  This
patch fixes the situation by treating CMA pageblocks as unmovable except
when has_unmovable_pages() is called as part of CMA allocation.

  Offlined Pages 4096
  kmemleak: Cannot insert 0xc000201f7d040008 into the object search tree (overlaps existing)
  Call Trace:
    dump_stack+0xb0/0xf4 (unreliable)
    create_object+0x344/0x380
    __kmalloc_node+0x3ec/0x860
    kvmalloc_node+0x58/0x110
    seq_read+0x41c/0x620
    __vfs_read+0x3c/0x70
    vfs_read+0xbc/0x1a0
    ksys_read+0x7c/0x140
    system_call+0x5c/0x70
  kmemleak: Kernel memory leak detector disabled
  kmemleak: Object 0xc000201cc8000000 (size 13757317120):
  kmemleak:   comm "swapper/0", pid 0, jiffies 4294937297
  kmemleak:   min_count = -1
  kmemleak:   count = 0
  kmemleak:   flags = 0x5
  kmemleak:   checksum = 0
  kmemleak:   backtrace:
       cma_declare_contiguous+0x2a4/0x3b0
       kvm_cma_reserve+0x11c/0x134
       setup_arch+0x300/0x3f8
       start_kernel+0x9c/0x6e8
       start_here_common+0x1c/0x4b0
  kmemleak: Automatic memory scanning thread ended

[cai@lca.pw: use is_migrate_cma_page() and update commit log]
  Link: http://lkml.kernel.org/r/20190416170510.20048-1-cai@lca.pw
Link: http://lkml.kernel.org/r/20190413002623.8967-1-cai@lca.pw
Signed-off-by: Qian Cai &lt;cai@lca.pw&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.com&gt;
Acked-by: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Reviewed-by: Oscar Salvador &lt;osalvador@suse.de&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/vmstat.c: fix /proc/vmstat format for CONFIG_DEBUG_TLBFLUSH=y CONFIG_SMP=n</title>
<updated>2019-04-19T16:46:04Z</updated>
<author>
<name>Konstantin Khlebnikov</name>
<email>khlebnikov@yandex-team.ru</email>
</author>
<published>2019-04-19T00:50:20Z</published>
<link rel='alternate' type='text/html' href='https://sre.ring0.de/linux/commit/?id=e8277b3b52240ec1caad8e6df278863e4bf42eac'/>
<id>urn:sha1:e8277b3b52240ec1caad8e6df278863e4bf42eac</id>
<content type='text'>
Commit 58bc4c34d249 ("mm/vmstat.c: skip NR_TLB_REMOTE_FLUSH* properly")
depends on skipping vmstat entries with empty name introduced in
7aaf77272358 ("mm: don't show nr_indirectly_reclaimable in
/proc/vmstat") but reverted in b29940c1abd7 ("mm: rename and change
semantics of nr_indirectly_reclaimable_bytes").

So skipping no longer works and /proc/vmstat has misformatted lines " 0".

This patch simply shows debug counters "nr_tlb_remote_*" for UP.

Link: http://lkml.kernel.org/r/155481488468.467.4295519102880913454.stgit@buzz
Fixes: 58bc4c34d249 ("mm/vmstat.c: skip NR_TLB_REMOTE_FLUSH* properly")
Signed-off-by: Konstantin Khlebnikov &lt;khlebnikov@yandex-team.ru&gt;
Acked-by: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: Roman Gushchin &lt;guro@fb.com&gt;
Cc: Jann Horn &lt;jannh@google.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: swapoff: shmem_unuse() stop eviction without igrab()</title>
<updated>2019-04-19T16:46:04Z</updated>
<author>
<name>Hugh Dickins</name>
<email>hughd@google.com</email>
</author>
<published>2019-04-19T00:50:13Z</published>
<link rel='alternate' type='text/html' href='https://sre.ring0.de/linux/commit/?id=af53d3e9e04024885de5b4fda51e5fa362ae2bd8'/>
<id>urn:sha1:af53d3e9e04024885de5b4fda51e5fa362ae2bd8</id>
<content type='text'>
The igrab() in shmem_unuse() looks good, but we forgot that it gives no
protection against concurrent unmounting: a point made by Konstantin
Khlebnikov eight years ago, and then fixed in 2.6.39 by 778dd893ae78
("tmpfs: fix race between umount and swapoff").  The current 5.1-rc
swapoff is liable to hit "VFS: Busy inodes after unmount of tmpfs.
Self-destruct in 5 seconds.  Have a nice day..." followed by GPF.

Once again, give up on using igrab(); but don't go back to making such
heavy-handed use of shmem_swaplist_mutex as last time: that would spoil
the new design, and I expect could deadlock inside shmem_swapin_page().

Instead, shmem_unuse() just raise a "stop_eviction" count in the shmem-
specific inode, and shmem_evict_inode() wait for that to go down to 0.
Call it "stop_eviction" rather than "swapoff_busy" because it can be put
to use for others later (huge tmpfs patches expect to use it).

That simplifies shmem_unuse(), protecting it from both unlink and
unmount; and in practice lets it locate all the swap in its first try.
But do not rely on that: there's still a theoretical case, when
shmem_writepage() might have been preempted after its get_swap_page(),
before making the swap entry visible to swapoff.

[hughd@google.com: remove incorrect list_del()]
  Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1904091133570.1898@eggly.anvils
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1904081259400.1523@eggly.anvils
Fixes: b56a2d8af914 ("mm: rid swapoff of quadratic complexity")
Signed-off-by: Hugh Dickins &lt;hughd@google.com&gt;
Cc: "Alex Xu (Hello71)" &lt;alex_y_xu@yahoo.ca&gt;
Cc: Huang Ying &lt;ying.huang@intel.com&gt;
Cc: Kelley Nielsen &lt;kelleynnn@gmail.com&gt;
Cc: Konstantin Khlebnikov &lt;khlebnikov@yandex-team.ru&gt;
Cc: Rik van Riel &lt;riel@surriel.com&gt;
Cc: Vineeth Pillai &lt;vpillai@digitalocean.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: swapoff: take notice of completion sooner</title>
<updated>2019-04-19T16:46:04Z</updated>
<author>
<name>Hugh Dickins</name>
<email>hughd@google.com</email>
</author>
<published>2019-04-19T00:50:09Z</published>
<link rel='alternate' type='text/html' href='https://sre.ring0.de/linux/commit/?id=64165b1affc5bc16231ac971e66aae7d68d57f2c'/>
<id>urn:sha1:64165b1affc5bc16231ac971e66aae7d68d57f2c</id>
<content type='text'>
The old try_to_unuse() implementation was driven by find_next_to_unuse(),
which terminated as soon as all the swap had been freed.

Add inuse_pages checks now (alongside signal_pending()) to stop scanning
mms and swap_map once finished.

The same ought to be done in shmem_unuse() too, but never was before,
and needs a different interface: so leave it as is for now.

Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1904081258200.1523@eggly.anvils
Fixes: b56a2d8af914 ("mm: rid swapoff of quadratic complexity")
Signed-off-by: Hugh Dickins &lt;hughd@google.com&gt;
Cc: "Alex Xu (Hello71)" &lt;alex_y_xu@yahoo.ca&gt;
Cc: Huang Ying &lt;ying.huang@intel.com&gt;
Cc: Kelley Nielsen &lt;kelleynnn@gmail.com&gt;
Cc: Konstantin Khlebnikov &lt;khlebnikov@yandex-team.ru&gt;
Cc: Rik van Riel &lt;riel@surriel.com&gt;
Cc: Vineeth Pillai &lt;vpillai@digitalocean.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: swapoff: remove too limiting SWAP_UNUSE_MAX_TRIES</title>
<updated>2019-04-19T16:46:04Z</updated>
<author>
<name>Hugh Dickins</name>
<email>hughd@google.com</email>
</author>
<published>2019-04-19T00:50:02Z</published>
<link rel='alternate' type='text/html' href='https://sre.ring0.de/linux/commit/?id=dd862deb151aad2548e72b077a82ad3aa91b715f'/>
<id>urn:sha1:dd862deb151aad2548e72b077a82ad3aa91b715f</id>
<content type='text'>
SWAP_UNUSE_MAX_TRIES 3 appeared to work well in earlier testing, but
further testing has proved it to be a source of unnecessary swapoff
EBUSY failures (which can then be followed by unmount EBUSY failures).

When mmget_not_zero() or shmem's igrab() fails, there is an mm exiting
or inode being evicted, freeing up swap independent of try_to_unuse().
Those typically completed much sooner than the old quadratic swapoff,
but now it's more common that swapoff may need to wait for them.

It's possible to move those cases from init_mm.mmlist and shmem_swaplist
to separate "exiting" swaplists, and try_to_unuse() then wait for those
lists to be emptied; but we've not bothered with that in the past, and
don't want to risk missing some other forgotten case.  So just revert to
cycling around until the swap is gone, without any retries limit.

Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1904081256170.1523@eggly.anvils
Fixes: b56a2d8af914 ("mm: rid swapoff of quadratic complexity")
Signed-off-by: Hugh Dickins &lt;hughd@google.com&gt;
Cc: "Alex Xu (Hello71)" &lt;alex_y_xu@yahoo.ca&gt;
Cc: Huang Ying &lt;ying.huang@intel.com&gt;
Cc: Kelley Nielsen &lt;kelleynnn@gmail.com&gt;
Cc: Konstantin Khlebnikov &lt;khlebnikov@yandex-team.ru&gt;
Cc: Rik van Riel &lt;riel@surriel.com&gt;
Cc: Vineeth Pillai &lt;vpillai@digitalocean.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: swapoff: shmem_find_swap_entries() filter out other types</title>
<updated>2019-04-19T16:46:04Z</updated>
<author>
<name>Hugh Dickins</name>
<email>hughd@google.com</email>
</author>
<published>2019-04-19T00:49:58Z</published>
<link rel='alternate' type='text/html' href='https://sre.ring0.de/linux/commit/?id=87039546544479d4bedb19d0ea525270c43c1c9b'/>
<id>urn:sha1:87039546544479d4bedb19d0ea525270c43c1c9b</id>
<content type='text'>
Swapfile "type" was passed all the way down to shmem_unuse_inode(), but
then forgotten from shmem_find_swap_entries(): with the result that
removing one swapfile would try to free up all the swap from shmem - no
problem when only one swapfile anyway, but counter-productive when more,
causing swapoff to be unnecessarily OOM-killed when it should succeed.

Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1904081254470.1523@eggly.anvils
Fixes: b56a2d8af914 ("mm: rid swapoff of quadratic complexity")
Signed-off-by: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Konstantin Khlebnikov &lt;khlebnikov@yandex-team.ru&gt;
Cc: "Alex Xu (Hello71)" &lt;alex_y_xu@yahoo.ca&gt;
Cc: Vineeth Pillai &lt;vpillai@digitalocean.com&gt;
Cc: Kelley Nielsen &lt;kelleynnn@gmail.com&gt;
Cc: Rik van Riel &lt;riel@surriel.com&gt;
Cc: Huang Ying &lt;ying.huang@intel.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
</feed>
