summaryrefslogtreecommitdiffstats
path: root/mm/mmap.c
AgeCommit message (Collapse)AuthorFilesLines
2016-05-27mm: remove more IS_ERR_VALUE abusesLinus Torvalds1-8/+8
The do_brk() and vm_brk() return value was "unsigned long" and returned the starting address on success, and an error value on failure. The reasons are entirely historical, and go back to it basically behaving like the mmap() interface does. However, nobody actually wanted that interface, and it causes totally pointless IS_ERR_VALUE() confusion. What every single caller actually wants is just the simpler integer return of zero for success and negative error number on failure. So just convert to that much clearer and more common calling convention, and get rid of all the IS_ERR_VALUE() uses wrt vm_brk(). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23mm: make vm_brk killableMichal Hocko1-6/+3
Now that all the callers handle vm_brk failure we can change it wait for mmap_sem killable to help oom_reaper to not get blocked just because vm_brk gets blocked behind mmap_sem readers. Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23mm: make vm_munmap killableMichal Hocko1-5/+3
Almost all current users of vm_munmap are ignoring the return value and so they do not handle potential error. This means that some VMAs might stay behind. This patch doesn't try to solve those potential problems. Quite contrary it adds a new failure mode by using down_write_killable in vm_munmap. This should be safer than other failure modes, though, because the process is guaranteed to die as soon as it leaves the kernel and exit_mmap will clean the whole address space. This will help in the OOM conditions when the oom victim might be stuck waiting for the mmap_sem for write which in turn can block oom_reaper which relies on the mmap_sem for read to make a forward progress and reclaim the address space of the victim. Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23mm: make vm_mmap killableMichal Hocko1-1/+1
All the callers of vm_mmap seem to check for the failure already and bail out in one way or another on the error which means that we can change it to use killable version of vm_mmap_pgoff and return -EINTR if the current task gets killed while waiting for mmap_sem. This also means that vm_mmap_pgoff can be killable by default and drop the additional parameter. This will help in the OOM conditions when the oom victim might be stuck waiting for the mmap_sem for write which in turn can block oom_reaper which relies on the mmap_sem for read to make a forward progress and reclaim the address space of the victim. Please note that load_elf_binary is ignoring vm_mmap error for current->personality & MMAP_PAGE_ZERO case but that shouldn't be a problem because the address is not used anywhere and we never return to the userspace if we got killed. Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23mm: make mmap_sem for write waits killable for mm syscallsMichal Hocko1-4/+23
This is a follow up work for oom_reaper [1]. As the async OOM killing depends on oom_sem for read we would really appreciate if a holder for write didn't stood in the way. This patchset is changing many of down_write calls to be killable to help those cases when the writer is blocked and waiting for readers to release the lock and so help __oom_reap_task to process the oom victim. Most of the patches are really trivial because the lock is help from a shallow syscall paths where we can return EINTR trivially and allow the current task to die (note that EINTR will never get to the userspace as the task has fatal signal pending). Others seem to be easy as well as the callers are already handling fatal errors and bail and return to userspace which should be sufficient to handle the failure gracefully. I am not familiar with all those code paths so a deeper review is really appreciated. As this work is touching more areas which are not directly connected I have tried to keep the CC list as small as possible and people who I believed would be familiar are CCed only to the specific patches (all should have received the cover though). This patchset is based on linux-next and it depends on down_write_killable for rw_semaphores which got merged into tip locking/rwsem branch and it is merged into this next tree. I guess it would be easiest to route these patches via mmotm because of the dependency on the tip tree but if respective maintainers prefer other way I have no objections. I haven't covered all the mmap_write(mm->mmap_sem) instances here $ git grep "down_write(.*\<mmap_sem\>)" next/master | wc -l 98 $ git grep "down_write(.*\<mmap_sem\>)" | wc -l 62 I have tried to cover those which should be relatively easy to review in this series because this alone should be a nice improvement. Other places can be changed on top. [0] http://lkml.kernel.org/r/1456752417-9626-1-git-send-email-mhocko@kernel.org [1] http://lkml.kernel.org/r/1452094975-551-1-git-send-email-mhocko@kernel.org [2] http://lkml.kernel.org/r/1456750705-7141-1-git-send-email-mhocko@kernel.org This patch (of 18): This is the first step in making mmap_sem write waiters killable. It focuses on the trivial ones which are taking the lock early after entering the syscall and they are not changing state before. Therefore it is very easy to change them to use down_write_killable and immediately return with -EINTR. This will allow the waiter to pass away without blocking the mmap_sem which might be required to make a forward progress. E.g. the oom reaper will need the lock for reading to dismantle the OOM victim address space. The only tricky function in this patch is vm_mmap_pgoff which has many call sites via vm_mmap. To reduce the risk keep vm_mmap with the original non-killable semantic for now. vm_munmap callers do not bother checking the return value so open code it into the munmap syscall path for now for simplicity. Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20mm: enable RLIMIT_DATA by default with workaround for valgrindKonstantin Khlebnikov1-4/+8
Since commit 84638335900f ("mm: rework virtual memory accounting") RLIMIT_DATA limits both brk() and private mmap() but this's disabled by default because of incompatibility with older versions of valgrind. Valgrind always set limit to zero and fails if RLIMIT_DATA is enabled. Fortunately it changes only rlim_cur and keeps rlim_max for reverting limit back when needed. This patch checks current usage also against rlim_max if rlim_cur is zero. This is safe because task anyway can increase rlim_cur up to rlim_max. Size of brk is still checked against rlim_cur, so this part is completely compatible - zero rlim_cur forbids brk() but allows private mmap(). Link: http://lkml.kernel.org/r/56A28613.5070104@de.ibm.com Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-19mm/mmap: kill hook arch_rebalance_pgtables()Konstantin Khlebnikov1-5/+0
Nobody uses it. Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-20Merge branch 'mm-pkeys-for-linus' of ↵Linus Torvalds1-1/+9
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 protection key support from Ingo Molnar: "This tree adds support for a new memory protection hardware feature that is available in upcoming Intel CPUs: 'protection keys' (pkeys). There's a background article at LWN.net: https://lwn.net/Articles/643797/ The gist is that protection keys allow the encoding of user-controllable permission masks in the pte. So instead of having a fixed protection mask in the pte (which needs a system call to change and works on a per page basis), the user can map a (handful of) protection mask variants and can change the masks runtime relatively cheaply, without having to change every single page in the affected virtual memory range. This allows the dynamic switching of the protection bits of large amounts of virtual memory, via user-space instructions. It also allows more precise control of MMU permission bits: for example the executable bit is separate from the read bit (see more about that below). This tree adds the MM infrastructure and low level x86 glue needed for that, plus it adds a high level API to make use of protection keys - if a user-space application calls: mmap(..., PROT_EXEC); or mprotect(ptr, sz, PROT_EXEC); (note PROT_EXEC-only, without PROT_READ/WRITE), the kernel will notice this special case, and will set a special protection key on this memory range. It also sets the appropriate bits in the Protection Keys User Rights (PKRU) register so that the memory becomes unreadable and unwritable. So using protection keys the kernel is able to implement 'true' PROT_EXEC on x86 CPUs: without protection keys PROT_EXEC implies PROT_READ as well. Unreadable executable mappings have security advantages: they cannot be read via information leaks to figure out ASLR details, nor can they be scanned for ROP gadgets - and they cannot be used by exploits for data purposes either. We know about no user-space code that relies on pure PROT_EXEC mappings today, but binary loaders could start making use of this new feature to map binaries and libraries in a more secure fashion. There is other pending pkeys work that offers more high level system call APIs to manage protection keys - but those are not part of this pull request. Right now there's a Kconfig that controls this feature (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) that is default enabled (like most x86 CPU feature enablement code that has no runtime overhead), but it's not user-configurable at the moment. If there's any serious problem with this then we can make it configurable and/or flip the default" * 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits) x86/mm/pkeys: Fix mismerge of protection keys CPUID bits mm/pkeys: Fix siginfo ABI breakage caused by new u64 field x86/mm/pkeys: Fix access_error() denial of writes to write-only VMA mm/core, x86/mm/pkeys: Add execute-only protection keys support x86/mm/pkeys: Create an x86 arch_calc_vm_prot_bits() for VMA flags x86/mm/pkeys: Allow kernel to modify user pkey rights register x86/fpu: Allow setting of XSAVE state x86/mm: Factor out LDT init from context init mm/core, x86/mm/pkeys: Add arch_validate_pkey() mm/core, arch, powerpc: Pass a protection key in to calc_vm_flag_bits() x86/mm/pkeys: Actually enable Memory Protection Keys in the CPU x86/mm/pkeys: Add Kconfig prompt to existing config option x86/mm/pkeys: Dump pkey from VMA in /proc/pid/smaps x86/mm/pkeys: Dump PKRU with other kernel registers mm/core, x86/mm/pkeys: Differentiate instruction fetches x86/mm/pkeys: Optimize fault handling in access_error() mm/core: Do not enforce PKEY permissions on remote mm access um, pkeys: Add UML arch_*_access_permitted() methods mm/gup, x86/mm/pkeys: Check VMAs and PTEs for protection keys x86/mm/gup: Simplify get_user_pages() PTE bit handling ...
2016-03-17mm: coalesce split stringsJoe Perches1-5/+3
Kernel style prefers a single string over split strings when the string is 'user-visible'. Miscellanea: - Add a missing newline - Realign arguments Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Tejun Heo <tj@kernel.org> [percpu] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17mm: deduplicate memory overcommitment codeAndrey Ryabinin1-124/+0
Currently we have two copies of the same code which implements memory overcommitment logic. Let's move it into mm/util.c and hence avoid duplication. No functional changes here. Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17mm: move max_map_count bits into mm.hAndrey Ryabinin1-1/+0
max_map_count sysctl unrelated to scheduler. Move its bits from include/linux/sched/sysctl.h to include/linux/mm.h. Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-07Merge tag 'v4.5-rc7' into x86/asm, to pick up SMAP fixIngo Molnar1-5/+29
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-18mm: fix regression in remap_file_pages() emulationKirill A. Shutemov1-5/+29
Grazvydas Ignotas has reported a regression in remap_file_pages() emulation. Testcase: #define _GNU_SOURCE #include <assert.h> #include <stdlib.h> #include <stdio.h> #include <sys/mman.h> #define SIZE (4096 * 3) int main(int argc, char **argv) { unsigned long *p; long i; p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0); if (p == MAP_FAILED) { perror("mmap"); return -1; } for (i = 0; i < SIZE / 4096; i++) p[i * 4096 / sizeof(*p)] = i; if (remap_file_pages(p, 4096, 0, 1, 0)) { perror("remap_file_pages"); return -1; } if (remap_file_pages(p, 4096 * 2, 0, 1, 0)) { perror("remap_file_pages"); return -1; } assert(p[0] == 1); munmap(p, SIZE); return 0; } The second remap_file_pages() fails with -EINVAL. The reason is that remap_file_pages() emulation assumes that the target vma covers whole area we want to over map. That assumption is broken by first remap_file_pages() call: it split the area into two vma. The solution is to check next adjacent vmas, if they map the same file with the same flags. Fixes: c8d78c1823f4 ("mm: replace remap_file_pages() syscall with emulation") Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Grazvydas Ignotas <notasas@gmail.com> Tested-by: Grazvydas Ignotas <notasas@gmail.com> Cc: <stable@vger.kernel.org> [4.0+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-18mm/core, x86/mm/pkeys: Add execute-only protection keys supportDave Hansen1-1/+9
Protection keys provide new page-based protection in hardware. But, they have an interesting attribute: they only affect data accesses and never affect instruction fetches. That means that if we set up some memory which is set as "access-disabled" via protection keys, we can still execute from it. This patch uses protection keys to set up mappings to do just that. If a user calls: mmap(..., PROT_EXEC); or mprotect(ptr, sz, PROT_EXEC); (note PROT_EXEC-only without PROT_READ/WRITE), the kernel will notice this, and set a special protection key on the memory. It also sets the appropriate bits in the Protection Keys User Rights (PKRU) register so that the memory becomes unreadable and unwritable. I haven't found any userspace that does this today. With this facility in place, we expect userspace to move to use it eventually. Userspace _could_ start doing this today. Any PROT_EXEC calls get converted to PROT_READ inside the kernel, and would transparently be upgraded to "true" PROT_EXEC with this code. IOW, userspace never has to do any PROT_EXEC runtime detection. This feature provides enhanced protection against leaking executable memory contents. This helps thwart attacks which are attempting to find ROP gadgets on the fly. But, the security provided by this approach is not comprehensive. The PKRU register which controls access permissions is a normal user register writable from unprivileged userspace. An attacker who can execute the 'wrpkru' instruction can easily disable the protection provided by this feature. The protection key that is used for execute-only support is permanently dedicated at compile time. This is fine for now because there is currently no API to set a protection key other than this one. Despite there being a constant PKRU value across the entire system, we do not set it unless this feature is in use in a process. That is to preserve the PKRU XSAVE 'init state', which can lead to faster context switches. PKRU *is* a user register and the kernel is modifying it. That means that code doing: pkru = rdpkru() pkru |= 0x100; mmap(..., PROT_EXEC); wrpkru(pkru); could lose the bits in PKRU that enforce execute-only permissions. To avoid this, we suggest avoiding ever calling mmap() or mprotect() when the PKRU value is expected to be unstable. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Andy Lutomirski <luto@kernel.org> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bp@suse.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Chen Gang <gang.chen.5i5j@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave@sr71.net> Cc: David Hildenbrand <dahi@linux.vnet.ibm.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Piotr Kwapulinski <kwapulinski.piotr@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: Vladimir Murzin <vladimir.murzin@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: keescook@google.com Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20160212210240.CB4BB5CA@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-18mm/core, arch, powerpc: Pass a protection key in to calc_vm_flag_bits()Dave Hansen1-1/+1
This plumbs a protection key through calc_vm_flag_bits(). We could have done this in calc_vm_prot_bits(), but I did not feel super strongly which way to go. It was pretty arbitrary which one to use. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Arve Hjønnevåg <arve@android.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Chen Gang <gang.chen.5i5j@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave@sr71.net> Cc: David Airlie <airlied@linux.ie> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Geliang Tang <geliangtang@163.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Leon Romanovsky <leon@leon.nu> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: Maxime Coquelin <mcoquelin.stm32@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Riley Andrews <riandrews@android.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: devel@driverdev.osuosl.org Cc: linux-api@vger.kernel.org Cc: linux-arch@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org Cc: linuxppc-dev@lists.ozlabs.org Link: http://lkml.kernel.org/r/20160212210231.E6F1F0D6@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-18Merge branch 'x86/urgent' into x86/asm, to pick up fixesIngo Molnar1-38/+47
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-16Merge branches 'x86/fpu', 'x86/mm' and 'x86/asm' into x86/pkeysIngo Molnar1-4/+9
Provide a stable basis for the pkeys patches, which touches various x86 details. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-05mm: replace vma_lock_anon_vma with anon_vma_lock_read/writeKonstantin Khlebnikov1-30/+25
Sequence vma_lock_anon_vma() - vma_unlock_anon_vma() isn't safe if anon_vma appeared between lock and unlock. We have to check anon_vma first or call anon_vma_prepare() to be sure that it's here. There are only few users of these legacy helpers. Let's get rid of them. This patch fixes anon_vma lock imbalance in validate_mm(). Write lock isn't required here, read lock is enough. And reorders expand_downwards/expand_upwards: security_mmap_addr() and wrapping-around check don't have to be under anon vma lock. Link: https://lkml.kernel.org/r/CACT4Y+Y908EjM2z=706dv4rV6dWtxTLK9nFg9_7DhRMLppBo2g@mail.gmail.com Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-05mm: validate_mm browse_rb SMP race conditionAndrea Arcangeli1-2/+5
The mmap_sem for reading in validate_mm called from expand_stack is not enough to prevent the argumented rbtree rb_subtree_gap information to change from under us because expand_stack may be running from other threads concurrently which will hold the mmap_sem for reading too. The argumented rbtree is updated with vma_gap_update under the page_table_lock so use it in browse_rb() too to avoid false positives. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Tested-by: Dmitry Vyukov <dvyukov@google.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-03mm: warn about VmData over RLIMIT_DATAKonstantin Khlebnikov1-6/+17
This patch provides a way of working around a slight regression introduced by commit 84638335900f ("mm: rework virtual memory accounting"). Before that commit RLIMIT_DATA have control only over size of the brk region. But that change have caused problems with all existing versions of valgrind, because it set RLIMIT_DATA to zero. This patch fixes rlimit check (limit actually in bytes, not pages) and by default turns it into warning which prints at first VmData misuse: "mmap: top (795): VmData 516096 exceed data ulimit 512000. Will be forbidden soon." Behavior is controlled by boot param ignore_rlimit_data=y/n and by sysfs /sys/module/kernel/parameters/ignore_rlimit_data. For now it set to "y". [akpm@linux-foundation.org: tweak kernel-parameters.txt text[ Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com> Link: http://lkml.kernel.org/r/20151228211015.GL2194@uranus Reported-by: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Vegard Nossum <vegard.nossum@oracle.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com> Cc: Kees Cook <keescook@google.com> Cc: Willy Tarreau <w@1wt.eu> Cc: Pavel Emelyanov <xemul@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-29Merge tag 'v4.5-rc1' into x86/asm, to refresh the branch before merging new ↵Ingo Molnar1-44/+62
changes Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-15mm: fix locking order in mm_take_all_locks()Kirill A. Shutemov1-5/+20
Dmitry Vyukov has reported[1] possible deadlock (triggered by his syzkaller fuzzer): Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&hugetlbfs_i_mmap_rwsem_key); lock(&mapping->i_mmap_rwsem); lock(&hugetlbfs_i_mmap_rwsem_key); lock(&mapping->i_mmap_rwsem); Both traces points to mm_take_all_locks() as a source of the problem. It doesn't take care about ordering or hugetlbfs_i_mmap_rwsem_key (aka mapping->i_mmap_rwsem for hugetlb mapping) vs. i_mmap_rwsem. huge_pmd_share() does memory allocation under hugetlbfs_i_mmap_rwsem_key and allocator can take i_mmap_rwsem if it hit reclaim. So we need to take i_mmap_rwsem from all hugetlb VMAs before taking i_mmap_rwsem from rest of VMAs. The patch also documents locking order for hugetlbfs_i_mmap_rwsem_key. [1] http://lkml.kernel.org/r/CACT4Y+Zu95tBs-0EvdiAKzUOsb4tczRRfCRTpLr4bg_OP9HuVg@mail.gmail.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Michal Hocko <mhocko@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14mm: rework virtual memory accountingKonstantin Khlebnikov1-31/+32
When inspecting a vague code inside prctl(PR_SET_MM_MEM) call (which testing the RLIMIT_DATA value to figure out if we're allowed to assign new @start_brk, @brk, @start_data, @end_data from mm_struct) it's been commited that RLIMIT_DATA in a form it's implemented now doesn't do anything useful because most of user-space libraries use mmap() syscall for dynamic memory allocations. Linus suggested to convert RLIMIT_DATA rlimit into something suitable for anonymous memory accounting. But in this patch we go further, and the changes are bundled together as: * keep vma counting if CONFIG_PROC_FS=n, will be used for limits * replace mm->shared_vm with better defined mm->data_vm * account anonymous executable areas as executable * account file-backed growsdown/up areas as stack * drop struct file* argument from vm_stat_account * enforce RLIMIT_DATA for size of data areas This way code looks cleaner: now code/stack/data classification depends only on vm_flags state: VM_EXEC & ~VM_WRITE -> code (VmExe + VmLib in proc) VM_GROWSUP | VM_GROWSDOWN -> stack (VmStk) VM_WRITE & ~VM_SHARED & !stack -> data (VmData) The rest (VmSize - VmData - VmStk - VmExe - VmLib) could be called "shared", but that might be strange beast like readonly-private or VM_IO area. - RLIMIT_AS limits whole address space "VmSize" - RLIMIT_STACK limits stack "VmStk" (but each vma individually) - RLIMIT_DATA now limits "VmData" Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com> Cc: Vegard Nossum <vegard.nossum@oracle.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Willy Tarreau <w@1wt.eu> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Kees Cook <keescook@google.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Pavel Emelyanov <xemul@virtuozzo.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14mm: mmap: add new /proc tunable for mmap_base ASLRDaniel Cashman1-0/+12
Address Space Layout Randomization (ASLR) provides a barrier to exploitation of user-space processes in the presence of security vulnerabilities by making it more difficult to find desired code/data which could help an attack. This is done by adding a random offset to the location of regions in the process address space, with a greater range of potential offset values corresponding to better protection/a larger search-space for brute force, but also to greater potential for fragmentation. The offset added to the mmap_base address, which provides the basis for the majority of the mappings for a process, is set once on process exec in arch_pick_mmap_layout() and is done via hard-coded per-arch values, which reflect, hopefully, the best compromise for all systems. The trade-off between increased entropy in the offset value generation and the corresponding increased variability in address space fragmentation is not absolute, however, and some platforms may tolerate higher amounts of entropy. This patch introduces both new Kconfig values and a sysctl interface which may be used to change the amount of entropy used for offset generation on a system. The direct motivation for this change was in response to the libstagefright vulnerabilities that affected Android, specifically to information provided by Google's project zero at: http://googleprojectzero.blogspot.com/2015/09/stagefrightened.html The attack presented therein, by Google's project zero, specifically targeted the limited randomness used to generate the offset added to the mmap_base address in order to craft a brute-force-based attack. Concretely, the attack was against the mediaserver process, which was limited to respawning every 5 seconds, on an arm device. The hard-coded 8 bits used resulted in an average expected success rate of defeating the mmap ASLR after just over 10 minutes (128 tries at 5 seconds a piece). With this patch, and an accompanying increase in the entropy value to 16 bits, the same attack would take an average expected time of over 45 hours (32768 tries), which makes it both less feasible and more likely to be noticed. The introduced Kconfig and sysctl options are limited by per-arch minimum and maximum values, the minimum of which was chosen to match the current hard-coded value and the maximum of which was chosen so as to give the greatest flexibility without generating an invalid mmap_base address, generally a 3-4 bits less than the number of bits in the user-space accessible virtual address space. When decided whether or not to change the default value, a system developer should consider that mmap_base address could be placed anywhere up to 2^(value) bits away from the non-randomized location, which would introduce variable-sized areas above and below the mmap_base address such that the maximum vm_area_struct size may be reduced, preventing very large allocations. This patch (of 4): ASLR only uses as few as 8 bits to generate the random offset for the mmap base address on 32 bit architectures. This value was chosen to prevent a poorly chosen value from dividing the address space in such a way as to prevent large allocations. This may not be an issue on all platforms. Allow the specification of a minimum number of bits so that platforms desiring greater ASLR protection may determine where to place the trade-off. Signed-off-by: Daniel Cashman <dcashman@google.com> Cc: Russell King <linux@arm.linux.org.uk> Acked-by: Kees Cook <keescook@chromium.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Don Zickus <dzickus@redhat.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Heinrich Schuchardt <xypron.glpk@gmx.de> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: David Rientjes <rientjes@google.com> Cc: Mark Salyzyn <salyzyn@android.com> Cc: Jeff Vander Stoep <jeffv@google.com> Cc: Nick Kralevich <nnk@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Hector Marco-Gisbert <hecmargi@upv.es> Cc: Borislav Petkov <bp@suse.de> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14mm/mmap.c: remove incorrect MAP_FIXED flag comparison from mmap_regionPiotr Kwapulinski1-3/+0
The following flag comparison in mmap_region makes no sense: if (!(vm_flags & MAP_FIXED)) return -ENOMEM; The condition is always false and thus the above "return -ENOMEM" is never executed. The vm_flags must not be compared with MAP_FIXED flag. The vm_flags may only be compared with VM_* flags. MAP_FIXED has the same value as VM_MAYREAD. Hitting the rlimit is a slow path and find_vma_intersection should realize that there is no overlapping VMA for !MAP_FIXED case pretty quickly. Signed-off-by: Piotr Kwapulinski <kwapulinski.piotr@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Chris Metcalf <cmetcalf@ezchip.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14mm/mmap.c: remove redundant local variables for may_expand_vm()Chen Gang1-8/+1
Simplify may_expand_vm(). [akpm@linux-foundation.org: further simplification, per Naoya Horiguchi] Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-12mm: Add a vm_special_mapping.fault() methodAndy Lutomirski1-4/+9
Requiring special mappings to give a list of struct pages is inflexible: it prevents sane use of IO memory in a special mapping, it's inefficient (it requires arch code to initialize a list of struct pages, and it requires the mm core to walk the entire list just to figure out how long it is), and it prevents arch code from doing anything fancy when a special mapping fault occurs. Add a .fault method as an alternative to filling in a .pages array. Looks-OK-to: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andy Lutomirski <luto@kernel.org> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/a26d1677c0bc7e774c33f469451a78ca31e9e6af.1451446564.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-05mm: introduce VM_LOCKONFAULTEric B Munson1-1/+1
The cost of faulting in all memory to be locked can be very high when working with large mappings. If only portions of the mapping will be used this can incur a high penalty for locking. For the example of a large file, this is the usage pattern for a large statical language model (probably applies to other statical or graphical models as well). For the security example, any application transacting in data that cannot be swapped out (credit card data, medical records, etc). This patch introduces the ability to request that pages are not pre-faulted, but are placed on the unevictable LRU when they are finally faulted in. The VM_LOCKONFAULT flag will be used together with VM_LOCKED and has no effect when set without VM_LOCKED. Setting the VM_LOCKONFAULT flag for a VMA will cause pages faulted into that VMA to be added to the unevictable LRU when they are faulted or if they are already present, but will not cause any missing pages to be faulted in. Exposing this new lock state means that we cannot overload the meaning of the FOLL_POPULATE flag any longer. Prior to this patch it was used to mean that the VMA for a fault was locked. This means we need the new FOLL_MLOCK flag to communicate the locked state of a VMA. FOLL_POPULATE will now only control if the VMA should be populated and in the case of VM_LOCKONFAULT, it will not be set. Signed-off-by: Eric B Munson <emunson@akamai.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Shuah Khan <shuahkh@osg.samsung.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-05mm/mmap.c: change __install_special_mapping() args orderChen Gang1-6/+6
Make __install_special_mapping() args order match the caller, so the caller can pass their register args directly to callee with no touch. For most of architectures, args (at least the first 5th args) are in registers, so this change will have effect on most of architectures. For -O2, __install_special_mapping() may be inlined under most of architectures, but for -Os, it should not. So this change can get a little better performance for -Os, at least. Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-05mm/mmap.c: do not initialize retval in mmap_pgoff()Chen Gang1-3/+2
When fget() fails we can return -EBADF directly. Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-05mm/mmap.c: remove redundant statement "error = -ENOMEM"Chen Gang1-1/+0
It is still a little better to remove it, although it should be skipped by "-O2". Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com>=0A= Acked-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-05mm: add the "struct mm_struct *mm" local intoOleg Nesterov1-11/+13
Cosmetic, but expand_upwards() and expand_downwards() overuse vma->vm_mm, a local variable makes sense imho. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-05mm: fix the racy mm->locked_vm change inOleg Nesterov1-4/+8
"mm->locked_vm += grow" and vm_stat_account() in acct_stack_growth() are not safe; multiple threads using the same ->mm can do this at the same time trying to expans different vma's under down_read(mmap_sem). This means that one of the "locked_vm += grow" changes can be lost and we can miss munlock_vma_pages_all() later. Move this code into the caller(s) under mm->page_table_lock. All other updates to ->locked_vm hold mmap_sem for writing. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-05mm/mmap: use offset_in_page macroAlexander Kuleshov1-6/+6
linux/mm.h provides offset_in_page() macro. Let's use already predefined macro instead of (addr & ~PAGE_MASK). Signed-off-by: Alexander Kuleshov <kuleshovmail@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-05mm/mmap.c: remove useless statement "vma = NULL" in find_vma()Chen Gang1-1/+0
Before the main loop, vma is already is NULL. There is no need to set it to NULL again. Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-22mm, dax: VMA with vm_ops->pfn_mkwrite wants to be write-notifiedKirill A. Shutemov1-1/+2
For VM_PFNMAP and VM_MIXEDMAP we use vm_ops->pfn_mkwrite instead of vm_ops->page_mkwrite to notify abort write access. This means we want vma->vm_page_prot to be write-protected if the VMA provides this vm_ops. A theoretical scenario that will cause these missed events is: On writable mapping with vm_ops->pfn_mkwrite, but without vm_ops->page_mkwrite: read fault followed by write access to the pfn. Writable pte will be set up on read fault and write fault will not be generated. I found it examining Dave's complaint on generic/080: http://lkml.kernel.org/g/20150831233803.GO3902@dastard Although I don't think it's the reason. It shouldn't be a problem for ext2/ext4 as they provide both pfn_mkwrite and page_mkwrite. [akpm@linux-foundation.org: add local vm_ops to avoid 80-cols mess] Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Yigal Korman <yigal@plexistor.com> Acked-by: Boaz Harrosh <boaz@plexistor.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-17revert "mm: make sure all file VMAs have ->vm_ops set"Andrew Morton1-8/+0
Revert commit 6dc296e7df4c "mm: make sure all file VMAs have ->vm_ops set". Will Deacon reports that it "causes some mmap regressions in LTP, which appears to use a MAP_PRIVATE mmap of /dev/zero as a way to get anonymous pages in some of its tests (specifically mmap10 [1])". William Shuman reports Oracle crashes. So revert the patch while we work out what to do. Reported-by: William Shuman <wshuman3@gmail.com> Reported-by: Will Deacon <will.deacon@arm.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-10mm: make sure all file VMAs have ->vm_ops setKirill A. Shutemov1-0/+8
We rely on vma->vm_ops == NULL to detect anonymous VMA: see vma_is_anonymous(), but some drivers doesn't set ->vm_ops. As a result we can end up with anonymous page in private file mapping. That should not lead to serious misbehaviour, but nevertheless is wrong. Let's fix by setting up dummy ->vm_ops for file mmapping if f_op->mmap() didn't set its own. The patch also adds sanity check into __vma_link_rb(). It will help catch broken VMAs which inserted directly into mm_struct via insert_vm_struct(). Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Minchan Kim <minchan@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-10mm, mpx: add "vm_flags_t vm_flags" arg to do_mmap_pgoff()Oleg Nesterov1-6/+4
Add the additional "vm_flags_t vm_flags" argument to do_mmap_pgoff(), rename it to do_mmap(), and re-introduce do_mmap_pgoff() as a simple wrapper on top of do_mmap(). Perhaps we should update the callers of do_mmap_pgoff() and kill it later. This way mpx_mmap() can simply call do_mmap(vm_flags => VM_MPX) and do not play with vm internals. After this change mmap_region() has a single user outside of mmap.c, arch/tile/mm/elf.c:arch_setup_additional_pages(). It would be nice to change arch/tile/ and unexport mmap_region(). [kirill@shutemov.name: fix build] [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Ingo Molnar <mingo@elte.hu> Cc: Minchan Kim <minchan@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08mm/mmap.c:insert_vm_struct(): check for failure before setting valuesChen Gang1-6/+7
There's no point in initializing vma->vm_pgoff if the insertion attempt will be failing anyway. Run the checks before performing the initialization. Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08mm/mmap.c: simplify the failure return working flowChen Gang1-22/+22
__split_vma() doesn't need out_err label, neither need initializing err. copy_vma() can return NULL directly when kmem_cache_alloc() fails. Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08mremap: fix the wrong !vma->vm_file check in copy_vma()Oleg Nesterov1-1/+1
Test-case: #define _GNU_SOURCE #include <stdio.h> #include <unistd.h> #include <stdlib.h> #include <string.h> #include <sys/mman.h> #include <assert.h> void *find_vdso_vaddr(void) { FILE *perl; char buf[32] = {}; perl = popen("perl -e 'open STDIN,qq|/proc/@{[getppid]}/maps|;" "/^(.*?)-.*vdso/ && print hex $1 while <>'", "r"); fread(buf, sizeof(buf), 1, perl); fclose(perl); return (void *)atol(buf); } #define PAGE_SIZE 4096 void *get_unmapped_area(void) { void *p = mmap(0, PAGE_SIZE, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, -1,0); assert(p != MAP_FAILED); munmap(p, PAGE_SIZE); return p; } char save[2][PAGE_SIZE]; int main(void) { void *vdso = find_vdso_vaddr(); void *page[2]; assert(vdso); memcpy(save, vdso, sizeof (save)); // force another fault on the next check assert(madvise(vdso, 2 * PAGE_SIZE, MADV_DONTNEED) == 0); page[0] = mremap(vdso, PAGE_SIZE, PAGE_SIZE, MREMAP_FIXED | MREMAP_MAYMOVE, get_unmapped_area()); page[1] = mremap(vdso + PAGE_SIZE, PAGE_SIZE, PAGE_SIZE, MREMAP_FIXED | MREMAP_MAYMOVE, get_unmapped_area()); assert(page[0] != MAP_FAILED && page[1] != MAP_FAILED); printf("match: %d %d\n", !memcmp(save[0], page[0], PAGE_SIZE), !memcmp(save[1], page[1], PAGE_SIZE)); return 0; } fails without this patch. Before the previous commit it gets the wrong page, now it segfaults (which is imho better). This is because copy_vma() wrongly assumes that if vma->vm_file == NULL is irrelevant until the first fault which will use do_anonymous_page(). This is obviously wrong for the special mapping. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08mmap: fix the usage of ->vm_pgoff in special_mapping pathsOleg Nesterov1-10/+2
Test-case: #include <stdio.h> #include <unistd.h> #include <stdlib.h> #include <string.h> #include <sys/mman.h> #include <assert.h> void *find_vdso_vaddr(void) { FILE *perl; char buf[32] = {}; perl = popen("perl -e 'open STDIN,qq|/proc/@{[getppid]}/maps|;" "/^(.*?)-.*vdso/ && print hex $1 while <>'", "r"); fread(buf, sizeof(buf), 1, perl); fclose(perl); return (void *)atol(buf); } #define PAGE_SIZE 4096 int main(void) { void *vdso = find_vdso_vaddr(); assert(vdso); // of course they should differ, and they do so far printf("vdso pages differ: %d\n", !!memcmp(vdso, vdso + PAGE_SIZE, PAGE_SIZE)); // split into 2 vma's assert(mprotect(vdso, PAGE_SIZE, PROT_READ) == 0); // force another fault on the next check assert(madvise(vdso, 2 * PAGE_SIZE, MADV_DONTNEED) == 0); // now they no longer differ, the 2nd vm_pgoff is wrong printf("vdso pages differ: %d\n", !!memcmp(vdso, vdso + PAGE_SIZE, PAGE_SIZE)); return 0; } Output: vdso pages differ: 1 vdso pages differ: 0 This is because split_vma() correctly updates ->vm_pgoff, but the logic in insert_vm_struct() and special_mapping_fault() is absolutely broken, so the fault at vdso + PAGE_SIZE return the 1st page. The same happens if you simply unmap the 1st page. special_mapping_fault() does: pgoff = vmf->pgoff - vma->vm_pgoff; and this is _only_ correct if vma->vm_start mmaps the first page from ->vm_private_data array. vdso or any other user of install_special_mapping() is not anonymous, it has the "backing storage" even if it is just the array of pages. So we actually need to make vm_pgoff work as an offset in this array. Note: this also allows to fix another problem: currently gdb can't access "[vvar]" memory because in this case special_mapping_fault() doesn't work. Now that we can use ->vm_pgoff we can implement ->access() and fix this. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04userfaultfd: teach vma_merge to merge across vma->vm_userfaultfd_ctxAndrea Arcangeli1-13/+27
vma->vm_userfaultfd_ctx is yet another vma parameter that vma_merge must be aware about so that we can merge vmas back like they were originally before arming the userfaultfd on some memory range. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com> Cc: zhang.zhanghailiang@huawei.com Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Feiner <pfeiner@google.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-07-10vfs: Commit to never having exectuables on proc and sysfs.Eric W. Biederman1-2/+2
Today proc and sysfs do not contain any executable files. Several applications today mount proc or sysfs without noexec and nosuid and then depend on there being no exectuables files on proc or sysfs. Having any executable files show on proc or sysfs would cause a user space visible regression, and most likely security problems. Therefore commit to never allowing executables on proc and sysfs by adding a new flag to mark them as filesystems without executables and enforce that flag. Test the flag where MNT_NOEXEC is tested today, so that the only user visible effect will be that exectuables will be treated as if the execute bit is cleared. The filesystems proc and sysfs do not currently incoporate any executable files so this does not result in any user visible effects. This makes it unnecessary to vet changes to proc and sysfs tightly for adding exectuable files or changes to chattr that would modify existing files, as no matter what the individual file say they will not be treated as exectuable files by the vfs. Not having to vet changes to closely is important as without this we are only one proc_create call (or another goof up in the implementation of notify_change) from having problematic executables on proc. Those mistakes are all too easy to make and would create a situation where there are security issues or the assumptions of some program having to be broken (and cause userspace regressions). Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2015-06-24mm/mmap.c: optimization of do_mmap_pgoff functionPiotr Kwapulinski1-3/+3
The simple check for zero length memory mapping may be performed earlier. So that in case of zero length memory mapping some unnecessary code is not executed at all. It does not make the code less readable and saves some CPU cycles. Signed-off-by: Piotr Kwapulinski <kwapulinski.piotr@gmail.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-15mm/mmap.c: use while instead of if+gotoRasmus Villemoes1-7/+6
The creators of the C language gave us the while keyword. Let's use that instead of synthesizing it from if+goto. Made possible by 6597d783397a ("mm/mmap.c: replace find_vma_prepare() with clearer find_vma_links()"). [akpm@linux-foundation.org: fix 80-col overflows] Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-15mm: remove rest of ACCESS_ONCE() usagesJason Low1-4/+4
We converted some of the usages of ACCESS_ONCE to READ_ONCE in the mm/ tree since it doesn't work reliably on non-scalar types. This patch removes the rest of the usages of ACCESS_ONCE, and use the new READ_ONCE API for the read accesses. This makes things cleaner, instead of using separate/multiple sets of APIs. Signed-off-by: Jason Low <jason.low2@hp.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Acked-by: Rik van Riel <riel@redhat.com> Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14mm: rename __mlock_vma_pages_range() to populate_vma_page_range()Kirill A. Shutemov1-2/+2
__mlock_vma_pages_range() doesn't necessarily mlock pages. It depends on vma flags. The same codepath is used for MAP_POPULATE. Let's rename __mlock_vma_pages_range() to populate_vma_page_range(). This patch also drops mlock_vma_pages_range() references from documentation. It has gone in cea10a19b797 ("mm: directly use __mlock_vma_pages_range() in find_extend_vma()"). Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-25mm: fix anon_vma->degree underflow in anon_vma endless growing preventionLeon Yu1-3/+1
I have constantly stumbled upon "kernel BUG at mm/rmap.c:399!" after upgrading to 3.19 and had no luck with 4.0-rc1 neither. So, after looking into new logic introduced by commit 7a3ef208e662 ("mm: prevent endless growth of anon_vma hierarchy"), I found chances are that unlink_anon_vmas() is called without incrementing dst->anon_vma->degree in anon_vma_clone() due to allocation failure. If dst->anon_vma is not NULL in error path, its degree will be incorrectly decremented in unlink_anon_vmas() and eventually underflow when exiting as a result of another call to unlink_anon_vmas(). That's how "kernel BUG at mm/rmap.c:399!" is triggered for me. This patch fixes the underflow by dropping dst->anon_vma when allocation fails. It's safe to do so regardless of original value of dst->anon_vma because dst->anon_vma doesn't have valid meaning if anon_vma_clone() fails. Besides, callers don't care dst->anon_vma in such case neither. Also suggested by Michal Hocko, we can clean up vma_adjust() a bit as anon_vma_clone() now does the work. [akpm@linux-foundation.org: tweak comment] Fixes: 7a3ef208e662 ("mm: prevent endless growth of anon_vma hierarchy") Signed-off-by: Leon Yu <chianglungyu@gmail.com> Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>