summaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)AuthorFilesLines
2019-09-24io_uring: compare cached_cq_tail with cq.head in_io_uring_pollyangerkun1-1/+1
After 75b28af("io_uring: allocate the two rings together"), we compare sq.head with cached_cq_tail to determine does there any cq invalid. Actually, we should use cq.head. Fixes: 75b28affdd6a ("io_uring: allocate the two rings together") Signed-off-by: yangerkun <yangerkun@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-23io_uring: correctly handle non ->{read,write}_iter() file_operationsJens Axboe1-6/+54
Currently we just -EINVAL a read or write to an fd that isn't backed by ->read_iter() or ->write_iter(). But we can handle them just fine, as long as we punt fo async context first. Implement a simple loop function for doing ->read() or ->write() instead, and ensure we call it appropriately. Reported-by: 李通洲 <carter.li@eoitek.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-18io_uring: IORING_OP_TIMEOUT supportJens Axboe2-5/+146
There's been a few requests for functionality similar to io_getevents() and epoll_wait(), where the user can specify a timeout for waiting on events. I deliberately did not add support for this through the system call initially to avoid overloading the args, but I can see that the use cases for this are valid. This adds support for IORING_OP_TIMEOUT. If a user wants to get woken when waiting for events, simply submit one of these timeout commands with your wait call (or before). This ensures that the application sleeping on the CQ ring waiting for events will get woken. The timeout command is passed in as a pointer to a struct timespec. Timeouts are relative. The timeout command also includes a way to auto-cancel after N events has passed. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-19io_uring: use cond_resched() in sqthreadJens Axboe1-1/+1
If preempt isn't enabled in the kernel, we can run into hang issues with sqthread submissions. Use cond_resched() to play nice instead of cpu_relax(), if we end up starting the loop and not having any events pending for submissions. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-18io_uring: fix potential crash issue due to io_get_req failureJackie Liu1-0/+6
Sometimes io_get_req will return a NUL, then we need to do the correct error handling, otherwise it will cause the kernel null pointer exception. Fixes: 4fe2c963154c ("io_uring: add support for link with drain") Signed-off-by: Jackie Liu <liuyun01@kylinos.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-18io_uring: ensure poll commands clear ->sqeJens Axboe1-9/+9
If we end up getting woken in poll (due to a signal), then we may need to punt the poll request to an async worker. When we do that, we look up the list to queue at, deferefencing req->submit.sqe, however that is only set for requests we initially decided to queue async. This fixes a crash with poll command usage and wakeups that need to punt to async context. Fixes: 54a91f3bb9b9 ("io_uring: limit parallelism of buffered writes") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-18io_uring: fix use-after-free of shadow_reqJackie Liu1-0/+2
There is a potential dangling pointer problem. we never clean shadow_req, if there are multiple link lists in this series of sqes, then the shadow_req will not reallocate, and continue to use the last one. but in the previous, his memory has been released, thus forming a dangling pointer. let's clean up him and make sure that every new link list can reapply for a new shadow_req. Fixes: 4fe2c963154c ("io_uring: add support for link with drain") Signed-off-by: Jackie Liu <liuyun01@kylinos.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-18io_uring: use kmemdup instead of kmalloc and memcpyJackie Liu1-3/+1
Just clean up the code, no function changes. Signed-off-by: Jackie Liu <liuyun01@kylinos.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-14io_uring: increase IORING_MAX_ENTRIES to 32KDaniel Xu1-1/+1
Some workloads can require far more than 4K oustanding entries. For example memcached can have ~300K sockets over ~40 cores. Bumping the max to 32K seems to work pretty well. Reported-by: Dan Melnic <dmm@fb.com> Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-12io_uring: make sqpoll wakeup possible with geteventsJens Axboe1-6/+2
The way the logic is setup in io_uring_enter() means that you can't wake up the SQ poller thread while at the same time waiting (or polling) for completions afterwards. There's no reason for that to be the case. Reported-by: Lewis Baker <lbaker@fb.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-12io_uring: extend async work mergingJens Axboe1-8/+28
We currently merge async work items if we see a strict sequential hit. This helps avoid unnecessary workqueue switches when we don't need them. We can extend this merging to cover cases where it's not a strict sequential hit, but the IO still fits within the same page. If an application is doing multiple requests within the same page, we don't want separate workers waiting on the same page to complete IO. It's much faster to let the first worker bring in the page, then operate on that page from the same worker to complete the next request(s). Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-10io_uring: limit parallelism of buffered writesJens Axboe1-8/+39
All the popular filesystems need to grab the inode lock for buffered writes. With io_uring punting buffered writes to async context, we observe a lot of contention with all workers hamming this mutex. For buffered writes, we generally don't need a lot of parallelism on the submission side, as the flushing will take care of that for us. Hence we don't need a deep queue on the write side, as long as we can safely punt from the original submission context. Add a workqueue with a limit of 2 that we can use for buffered writes. This greatly improves the performance and efficiency of higher queue depth buffered async writes with io_uring. Reported-by: Andres Freund <andres@anarazel.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-10io_uring: add io_queue_async_work() helperJens Axboe1-5/+11
Add a helper for queueing a request for async execution, in preparation for optimizing it. No functional change in this patch. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-10io_uring: optimize submit_and_wait APIJens Axboe1-17/+46
For some applications that end up using a submit-and-wait type of approach for certain batches of IO, we can make that a bit more efficient by allowing the application to block for the last IO submission. This prevents an async when we don't need it, as the application will be blocking for the completion event(s) anyway. Typical use cases are using the liburing io_uring_submit_and_wait() API, or just using io_uring_enter() doing both submissions and completions. As a specific example, RocksDB doing MultiGet() is sped up quite a bit with this change. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-09io_uring: add support for link with drainJackie Liu1-17/+97
To support the link with drain, we need to do two parts. There is an sqes: 0 1 2 3 4 5 6 +-----+-----+-----+-----+-----+-----+-----+ | N | L | L | L+D | N | N | N | +-----+-----+-----+-----+-----+-----+-----+ First, we need to ensure that the io before the link is completed, there is a easy way is set drain flag to the link list's head, so all subsequent io will be inserted into the defer_list. +-----+ (0) | N | +-----+ | (2) (3) (4) +-----+ +-----+ +-----+ +-----+ (1) | L+D | --> | L | --> | L+D | --> | N | +-----+ +-----+ +-----+ +-----+ | +-----+ (5) | N | +-----+ | +-----+ (6) | N | +-----+ Second, ensure that the following IO will not be completed first, an easy way is to create a mirror of drain io and insert it into defer_list, in this way, as long as drain io is not processed, the following io in the defer_list will not be actively process. +-----+ (0) | N | +-----+ | (2) (3) (4) +-----+ +-----+ +-----+ +-----+ (1) | L+D | --> | L | --> | L+D | --> | N | +-----+ +-----+ +-----+ +-----+ | +-----+ ('3) | D | <== This is a shadow of (3) +-----+ | +-----+ (5) | N | +-----+ | +-----+ (6) | N | +-----+ Signed-off-by: Jackie Liu <liuyun01@kylinos.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-09io_uring: fix wrong sequence setting logicJackie Liu1-1/+3
Sqo_thread will get sqring in batches, which will cause ctx->cached_sq_head to be added in batches. if one of these sqes is set with the DRAIN flag, then he will never get a chance to process, and finally sqo_thread will not exit. Fixes: de0617e4671 ("io_uring: add support for marking commands as draining") Signed-off-by: Jackie Liu <liuyun01@kylinos.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-06io_uring: expose single mmap capabilityJens Axboe2-1/+9
After commit 75b28affdd6a we can get by with just a single mmap to map both the sq and cq ring. However, userspace doesn't know that. Add a features variable to io_uring_params, and notify userspace that the kernel has this ability. This can then be used in liburing (or in applications directly) to avoid the second mmap. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-08-27io_uring: allocate the two rings togetherHristo Venev1-127/+128
Both the sq and the cq rings have sizes just over a power of two, and the sq ring is significantly smaller. By bundling them in a single alllocation, we get the sq ring for free. This also means that IORING_OFF_SQ_RING and IORING_OFF_CQ_RING now mean the same thing. If we indicate this to userspace, we can save a mmap call. Signed-off-by: Hristo Venev <hristo@venev.name> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-08-27fs/io_uring.c: convert put_page() to put_user_page*()John Hubbard1-5/+3
For pages that were retained via get_user_pages*(), release those pages via the new put_user_page*() routines, instead of via put_page() or release_pages(). This is part a tree-wide conversion, as described in commit fc1d8e7cca2d ("mm: introduce put_user_page*(), placeholder versions"). Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-fsdevel@vger.kernel.org Cc: linux-block@vger.kernel.org Signed-off-by: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-08-25Linux 5.3-rc6v5.3-rc6Linus Torvalds1-1/+1
2019-08-25Merge tag 'auxdisplay-for-linus-v5.3-rc7' of git://github.com/ojeda/linuxLinus Torvalds1-2/+2
Pull auxdisplay cleanup from Miguel Ojeda: "Make ht16k33_fb_fix and ht16k33_fb_var constant (Nishka Dasgupta)" * tag 'auxdisplay-for-linus-v5.3-rc7' of git://github.com/ojeda/linux: auxdisplay: ht16k33: Make ht16k33_fb_fix and ht16k33_fb_var constant
2019-08-25Merge tag 'for-linus-5.3-rc6' of ↵Linus Torvalds3-12/+20
git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml Pull UML fix from Richard Weinberger: "Fix time travel mode" * tag 'for-linus-5.3-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml: um: fix time travel mode
2019-08-25Merge tag 'for-linus-5.3-rc6' of ↵Linus Torvalds4-8/+5
git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs Pull UBIFS and JFFS2 fixes from Richard Weinberger: "UBIFS: - Don't block too long in writeback_inodes_sb() - Fix for a possible overrun of the log head - Fix double unlock in orphan_delete() JFFS2: - Remove C++ style from UAPI header and unbreak picky toolchains" * tag 'for-linus-5.3-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs: ubifs: Limit the number of pages in shrink_liability ubifs: Correctly initialize c->min_log_bytes ubifs: Fix double unlock around orphan_delete() jffs2: Remove C++ style comments from uapi header
2019-08-25Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds9-33/+227
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thomas Gleixner: "A few fixes for x86: - Fix a boot regression caused by the recent bootparam sanitizing change, which escaped the attention of all people who reviewed that code. - Address a boot problem on machines with broken E820 tables caused by an underflow which ended up placing the trampoline start at physical address 0. - Handle machines which do not advertise a legacy timer of any form, but need calibration of the local APIC timer gracefully by making the calibration routine independent from the tick interrupt. Marked for stable as well as there seems to be quite some new laptops rolled out which expose this. - Clear the RDRAND CPUID bit on AMD family 15h and 16h CPUs which are affected by broken firmware which does not initialize RDRAND correctly after resume. Add a command line parameter to override this for machine which either do not use suspend/resume or have a fixed BIOS. Unfortunately there is no way to detect this on boot, so the only safe decision is to turn it off by default. - Prevent RFLAGS from being clobbers in CALL_NOSPEC on 32bit which caused fast KVM instruction emulation to break. - Explain the Intel CPU model naming convention so that the repeating discussions come to an end" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/retpoline: Don't clobber RFLAGS during CALL_NOSPEC on i386 x86/boot: Fix boot regression caused by bootparam sanitizing x86/CPU/AMD: Clear RDRAND CPUID bit on AMD family 15h/16h x86/boot/compressed/64: Fix boot on machines with broken E820 table x86/apic: Handle missing global clockevent gracefully x86/cpu: Explain Intel model naming convention
2019-08-25Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds3-9/+23
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timekeeping fix from Thomas Gleixner: "A single fix for a regression caused by the generic VDSO implementation where a math overflow causes CLOCK_BOOTTIME to become a random number generator" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timekeeping/vsyscall: Prevent math overflow in BOOTTIME update
2019-08-25Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds1-1/+4
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Thomas Gleixner: "Handle the worker management in situations where a task is scheduled out on a PI lock contention correctly and schedule a new worker if possible" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/core: Schedule new worker even if PI-blocked
2019-08-25Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds2-5/+5
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Thomas Gleixner: "Two small fixes for kprobes and perf: - Prevent a deadlock in kprobe_optimizer() causes by reverse lock ordering - Fix a comment typo" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: kprobes: Fix potential deadlock in kprobe_optimizer() perf/x86: Fix typo in comment
2019-08-25Merge branch 'irq-urgent-for-linus' of ↵Linus Torvalds1-1/+14
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fix from Thomas Gleixner: "A single fix for a imbalanced kobject operation in the irq decriptor code which was unearthed by the new warnings in the kobject code" * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq: Properly pair kobject_del() with kobject_add()
2019-08-25Merge branch 'akpm' (patches from Andrew)Linus Torvalds9-36/+260
Mergr misc fixes from Andrew Morton: "11 fixes" Mostly VM fixes, one psi polling fix, and one parisc build fix. * emailed patches from Andrew Morton <akpm@linux-foundation.org>: mm/kasan: fix false positive invalid-free reports with CONFIG_KASAN_SW_TAGS=y mm/zsmalloc.c: fix race condition in zs_destroy_pool mm/zsmalloc.c: migration can leave pages in ZS_EMPTY indefinitely mm, page_owner: handle THP splits correctly userfaultfd_release: always remove uffd flags and clear vm_userfaultfd_ctx psi: get poll_work to run when calling poll syscall next time mm: memcontrol: flush percpu vmevents before releasing memcg mm: memcontrol: flush percpu vmstats before releasing memcg parisc: fix compilation errrors mm, page_alloc: move_freepages should not examine struct page of reserved memory mm/z3fold.c: fix race between migration and destruction
2019-08-24Merge tag 'dma-mapping-5.3-5' of git://git.infradead.org/users/hch/dma-mappingLinus Torvalds6-15/+19
Pull dma-mapping fixes from Christoph Hellwig: "Two fixes for regressions in this merge window: - select the Kconfig symbols for the noncoherent dma arch helpers on arm if swiotlb is selected, not just for LPAE to not break then Xen build, that uses swiotlb indirectly through swiotlb-xen - fix the page allocator fallback in dma_alloc_contiguous if the CMA allocation fails" * tag 'dma-mapping-5.3-5' of git://git.infradead.org/users/hch/dma-mapping: dma-direct: fix zone selection after an unaddressable CMA allocation arm: select the dma-noncoherent symbols for all swiotlb builds
2019-08-24mm/kasan: fix false positive invalid-free reports with CONFIG_KASAN_SW_TAGS=yAndrey Ryabinin1-2/+8
The code like this: ptr = kmalloc(size, GFP_KERNEL); page = virt_to_page(ptr); offset = offset_in_page(ptr); kfree(page_address(page) + offset); may produce false-positive invalid-free reports on the kernel with CONFIG_KASAN_SW_TAGS=y. In the example above we lose the original tag assigned to 'ptr', so kfree() gets the pointer with 0xFF tag. In kfree() we check that 0xFF tag is different from the tag in shadow hence print false report. Instead of just comparing tags, do the following: 1) Check that shadow doesn't contain KASAN_TAG_INVALID. Otherwise it's double-free and it doesn't matter what tag the pointer have. 2) If pointer tag is different from 0xFF, make sure that tag in the shadow is the same as in the pointer. Link: http://lkml.kernel.org/r/20190819172540.19581-1-aryabinin@virtuozzo.com Fixes: 7f94ffbc4c6a ("kasan: add hooks implementation for tag-based mode") Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reported-by: Walter Wu <walter-zh.wu@mediatek.com> Reported-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Andrey Konovalov <andreyknvl@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-24mm/zsmalloc.c: fix race condition in zs_destroy_poolHenry Burns1-2/+59
In zs_destroy_pool() we call flush_work(&pool->free_work). However, we have no guarantee that migration isn't happening in the background at that time. Since migration can't directly free pages, it relies on free_work being scheduled to free the pages. But there's nothing preventing an in-progress migrate from queuing the work *after* zs_unregister_migration() has called flush_work(). Which would mean pages still pointing at the inode when we free it. Since we know at destroy time all objects should be free, no new migrations can come in (since zs_page_isolate() fails for fully-free zspages). This means it is sufficient to track a "# isolated zspages" count by class, and have the destroy logic ensure all such pages have drained before proceeding. Keeping that state under the class spinlock keeps the logic straightforward. In this case a memory leak could lead to an eventual crash if compaction hits the leaked page. This crash would only occur if people are changing their zswap backend at runtime (which eventually starts destruction). Link: http://lkml.kernel.org/r/20190809181751.219326-2-henryburns@google.com Fixes: 48b4800a1c6a ("zsmalloc: page migration support") Signed-off-by: Henry Burns <henryburns@google.com> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Henry Burns <henrywolfeburns@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Jonathan Adams <jwadams@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-24mm/zsmalloc.c: migration can leave pages in ZS_EMPTY indefinitelyHenry Burns1-4/+15
In zs_page_migrate() we call putback_zspage() after we have finished migrating all pages in this zspage. However, the return value is ignored. If a zs_free() races in between zs_page_isolate() and zs_page_migrate(), freeing the last object in the zspage, putback_zspage() will leave the page in ZS_EMPTY for potentially an unbounded amount of time. To fix this, we need to do the same thing as zs_page_putback() does: schedule free_work to occur. To avoid duplicated code, move the sequence to a new putback_zspage_deferred() function which both zs_page_migrate() and zs_page_putback() call. Link: http://lkml.kernel.org/r/20190809181751.219326-1-henryburns@google.com Fixes: 48b4800a1c6a ("zsmalloc: page migration support") Signed-off-by: Henry Burns <henryburns@google.com> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Henry Burns <henrywolfeburns@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Jonathan Adams <jwadams@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-24mm, page_owner: handle THP splits correctlyVlastimil Babka1-0/+4
THP splitting path is missing the split_page_owner() call that split_page() has. As a result, split THP pages are wrongly reported in the page_owner file as order-9 pages. Furthermore when the former head page is freed, the remaining former tail pages are not listed in the page_owner file at all. This patch fixes that by adding the split_page_owner() call into __split_huge_page(). Link: http://lkml.kernel.org/r/20190820131828.22684-2-vbabka@suse.cz Fixes: a9627bc5e34e ("mm/page_owner: introduce split_page_owner and replace manual handling") Reported-by: Kirill A. Shutemov <kirill@shutemov.name> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-24userfaultfd_release: always remove uffd flags and clear vm_userfaultfd_ctxOleg Nesterov1-12/+13
userfaultfd_release() should clear vm_flags/vm_userfaultfd_ctx even if mm->core_state != NULL. Otherwise a page fault can see userfaultfd_missing() == T and use an already freed userfaultfd_ctx. Link: http://lkml.kernel.org/r/20190820160237.GB4983@redhat.com Fixes: 04f5866e41fb ("coredump: fix race condition between mmget_not_zero()/get_task_mm() and core dumping") Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Tested-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Peter Xu <peterx@redhat.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-24psi: get poll_work to run when calling poll syscall next timeJason Xing1-0/+8
Only when calling the poll syscall the first time can user receive POLLPRI correctly. After that, user always fails to acquire the event signal. Reproduce case: 1. Get the monitor code in Documentation/accounting/psi.txt 2. Run it, and wait for the event triggered. 3. Kill and restart the process. The question is why we can end up with poll_scheduled = 1 but the work not running (which would reset it to 0). And the answer is because the scheduling side sees group->poll_kworker under RCU protection and then schedules it, but here we cancel the work and destroy the worker. The cancel needs to pair with resetting the poll_scheduled flag. Link: http://lkml.kernel.org/r/1566357985-97781-1-git-send-email-joseph.qi@linux.alibaba.com Signed-off-by: Jason Xing <kerneljasonxing@linux.alibaba.com> Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Caspar Zhang <caspar@linux.alibaba.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-24mm: memcontrol: flush percpu vmevents before releasing memcgRoman Gushchin1-1/+21
Similar to vmstats, percpu caching of local vmevents leads to an accumulation of errors on non-leaf levels. This happens because some leftovers may remain in percpu caches, so that they are never propagated up by the cgroup tree and just disappear into nonexistence with on releasing of the memory cgroup. To fix this issue let's accumulate and propagate percpu vmevents values before releasing the memory cgroup similar to what we're doing with vmstats. Since on cpu hotplug we do flush percpu vmstats anyway, we can iterate only over online cpus. Link: http://lkml.kernel.org/r/20190819202338.363363-4-guro@fb.com Fixes: 42a300353577 ("mm: memcontrol: fix recursive statistics correctness & scalabilty") Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-24mm: memcontrol: flush percpu vmstats before releasing memcgRoman Gushchin1-0/+40
Percpu caching of local vmstats with the conditional propagation by the cgroup tree leads to an accumulation of errors on non-leaf levels. Let's imagine two nested memory cgroups A and A/B. Say, a process belonging to A/B allocates 100 pagecache pages on the CPU 0. The percpu cache will spill 3 times, so that 32*3=96 pages will be accounted to A/B and A atomic vmstat counters, 4 pages will remain in the percpu cache. Imagine A/B is nearby memory.max, so that every following allocation triggers a direct reclaim on the local CPU. Say, each such attempt will free 16 pages on a new cpu. That means every percpu cache will have -16 pages, except the first one, which will have 4 - 16 = -12. A/B and A atomic counters will not be touched at all. Now a user removes A/B. All percpu caches are freed and corresponding vmstat numbers are forgotten. A has 96 pages more than expected. As memory cgroups are created and destroyed, errors do accumulate. Even 1-2 pages differences can accumulate into large numbers. To fix this issue let's accumulate and propagate percpu vmstat values before releasing the memory cgroup. At this point these numbers are stable and cannot be changed. Since on cpu hotplug we do flush percpu vmstats anyway, we can iterate only over online cpus. Link: http://lkml.kernel.org/r/20190819202338.363363-2-guro@fb.com Fixes: 42a300353577 ("mm: memcontrol: fix recursive statistics correctness & scalabilty") Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-24parisc: fix compilation errrorsQian Cai1-2/+1
Commit 0cfaee2af3a0 ("include/asm-generic/5level-fixup.h: fix variable 'p4d' set but not used") converted a few functions from macros to static inline, which causes parisc to complain, In file included from include/asm-generic/4level-fixup.h:38:0, from arch/parisc/include/asm/pgtable.h:5, from arch/parisc/include/asm/io.h:6, from include/linux/io.h:13, from sound/core/memory.c:9: include/asm-generic/5level-fixup.h:14:18: error: unknown type name 'pgd_t'; did you mean 'pid_t'? #define p4d_t pgd_t ^ include/asm-generic/5level-fixup.h:24:28: note: in expansion of macro 'p4d_t' static inline int p4d_none(p4d_t p4d) ^~~~~ It is because "4level-fixup.h" is included before "asm/page.h" where "pgd_t" is defined. Link: http://lkml.kernel.org/r/20190815205305.1382-1-cai@lca.pw Fixes: 0cfaee2af3a0 ("include/asm-generic/5level-fixup.h: fix variable 'p4d' set but not used") Signed-off-by: Qian Cai <cai@lca.pw> Reported-by: Guenter Roeck <linux@roeck-us.net> Tested-by: Guenter Roeck <linux@roeck-us.net> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-24mm, page_alloc: move_freepages should not examine struct page of reserved memoryDavid Rientjes1-15/+4
After commit 907ec5fca3dc ("mm: zero remaining unavailable struct pages"), struct page of reserved memory is zeroed. This causes page->flags to be 0 and fixes issues related to reading /proc/kpageflags, for example, of reserved memory. The VM_BUG_ON() in move_freepages_block(), however, assumes that page_zone() is meaningful even for reserved memory. That assumption is no longer true after the aforementioned commit. There's no reason why move_freepages_block() should be testing the legitimacy of page_zone() for reserved memory; its scope is limited only to pages on the zone's freelist. Note that pfn_valid() can be true for reserved memory: there is a backing struct page. The check for page_to_nid(page) is also buggy but reserved memory normally only appears on node 0 so the zeroing doesn't affect this. Move the debug checks to after verifying PageBuddy is true. This isolates the scope of the checks to only be for buddy pages which are on the zone's freelist which move_freepages_block() is operating on. In this case, an incorrect node or zone is a bug worthy of being warned about (and the examination of struct page is acceptable bcause this memory is not reserved). Why does move_freepages_block() gets called on reserved memory? It's simply math after finding a valid free page from the per-zone free area to use as fallback. We find the beginning and end of the pageblock of the valid page and that can bring us into memory that was reserved per the e820. pfn_valid() is still true (it's backed by a struct page), but since it's zero'd we shouldn't make any inferences here about comparing its node or zone. The current node check just happens to succeed most of the time by luck because reserved memory typically appears on node 0. The fix here is to validate that we actually have buddy pages before testing if there's any type of zone or node strangeness going on. We noticed it almost immediately after bringing 907ec5fca3dc in on CONFIG_DEBUG_VM builds. It depends on finding specific free pages in the per-zone free area where the math in move_freepages() will bring the start or end pfn into reserved memory and wanting to claim that entire pageblock as a new migratetype. So the path will be rare, require CONFIG_DEBUG_VM, and require fallback to a different migratetype. Some struct pages were already zeroed from reserve pages before 907ec5fca3c so it theoretically could trigger before this commit. I think it's rare enough under a config option that most people don't run that others may not have noticed. I wouldn't argue against a stable tag and the backport should be easy enough, but probably wouldn't single out a commit that this is fixing. Mel said: : The overhead of the debugging check is higher with this patch although : it'll only affect debug builds and the path is not particularly hot. : If this was a concern, I think it would be reasonable to simply remove : the debugging check as the zone boundaries are checked in : move_freepages_block and we never expect a zone/node to be smaller than : a pageblock and stuck in the middle of another zone. Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1908122036560.10779@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pavel Tatashin <pavel.tatashin@microsoft.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-24mm/z3fold.c: fix race between migration and destructionHenry Burns1-0/+89
In z3fold_destroy_pool() we call destroy_workqueue(&pool->compact_wq). However, we have no guarantee that migration isn't happening in the background at that time. Migration directly calls queue_work_on(pool->compact_wq), if destruction wins that race we are using a destroyed workqueue. Link: http://lkml.kernel.org/r/20190809213828.202833-1-henryburns@google.com Signed-off-by: Henry Burns <henryburns@google.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Jonathan Adams <jwadams@google.com> Cc: Henry Burns <henrywolfeburns@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-24Merge tag 'gpio-v5.3-4' of ↵Linus Torvalds3-42/+20
git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio Pull GPIO fixes from Linus Walleij: "Here is a (hopefully last) set of GPIO fixes for the v5.3 kernel cycle. Two are pretty core: - Fix not reporting open drain/source lines to userspace as "input" - Fix a minor build error found in randconfigs - Fix a chip select quirk on the Freescale SPI - Fix the irqchip initialization semantic order to reflect what it was using the old API" * tag 'gpio-v5.3-4' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio: gpio: Fix irqchip initialization order gpio: of: fix Freescale SPI CS quirk handling gpio: Fix build error of function redefinition gpiolib: never report open-drain/source lines as 'input' to user-space
2019-08-24Merge tag 'hyperv-fixes-signed' of ↵Linus Torvalds4-33/+8
git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux Pull Hyper-V fixes from Sasha Levin: - Fix for panics and network failures on PAE guests by Dexuan Cui. - Fix of a memory leak (and related cleanups) in the hyper-v keyboard driver by Dexuan Cui. - Code cleanups for hyper-v clocksource driver during the merge window by Dexuan Cui. - Fix for a false positive warning in the userspace hyper-v KVP store by Vitaly Kuznetsov. * tag 'hyperv-fixes-signed' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: Drivers: hv: vmbus: Fix virt_to_hvpfn() for X86_PAE Tools: hv: kvp: eliminate 'may be used uninitialized' warning Input: hyperv-keyboard: Use in-place iterator API in the channel callback Drivers: hv: vmbus: Remove the unused "tsc_page" from struct hv_context
2019-08-24Merge tag 'arm64-fixes' of ↵Linus Torvalds2-10/+27
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 fixes from Will Deacon: "Two KVM/arm fixes for MMIO emulation and UBSAN. Unusually, we're routing them via the arm64 tree as per Paolo's request on the list: https://lore.kernel.org/kvm/21ae69a2-2546-29d0-bff6-2ea825e3d968@redhat.com/ We don't actually have any other arm64 fixes pending at the moment (touch wood), so I've pulled from Marc, written a merge commit, tagged the result and run it through my build/boot/bisect scripts" * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: KVM: arm/arm64: VGIC: Properly initialise private IRQ affinity KVM: arm/arm64: Only skip MMIO insn once
2019-08-24Merge tag 'scsi-fixes' of ↵Linus Torvalds8-7/+49
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull SCSI fixes from James Bottomley: "Four fixes, three for edge conditions which don't occur very often. The lpfc fix mitigates memory exhaustion for some high CPU systems" * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: scsi: lpfc: Mitigate high memory pre-allocation by SCSI-MQ scsi: ufs: Fix NULL pointer dereference in ufshcd_config_vreg_hpm() scsi: target: tcmu: avoid use-after-free after command timeout scsi: qla2xxx: Fix gnl.l memory leak on adapter init failure
2019-08-24Merge tag 'xfs-5.3-fixes-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds1-0/+1
Pull xfs fix from Darrick Wong: "A single patch that fixes a xfs lockup problem when a chown/chgrp operation fails due to running out of quota. It has survived the usual xfstests runs and merges cleanly with this morning's master: - Fix a forgotten inode unlock when chown/chgrp fail due to quota" * tag 'xfs-5.3-fixes-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: fix missing ILOCK unlock when xfs_setattr_nonsize fails due to EDQUOT
2019-08-24Merge tag 'drm-fixes-2019-08-24' of git://anongit.freedesktop.org/drm/drmLinus Torvalds2-7/+18
Pull more drm fixes from Dave Airlie: "Although the tree built for me fine on arm here, it appears either header cleanups in next or some kconfig combo it breaks, so this contains a fix to mediatek to include dma-mapping.h explicitly. There was also one nouveau fix that came in late that I was going to leave until next week, but since I was sending this I thought it may as well be in here: mediatek: - fix build in some cases nouveau: - fix hang with i2c and mst docks" * tag 'drm-fixes-2019-08-24' of git://anongit.freedesktop.org/drm/drm: drm/mediatek: include dma-mapping header drm/nouveau: Don't retry infinitely when receiving no data on i2c over AUX
2019-08-24Merge tag 'kvmarm-fixes-for-5.3-3' of ↵Will Deacon2-10/+27
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into kvm/fixes Pull KVM/arm fixes from Marc Zyngier as per Paulo's request at: https://lkml.kernel.org/r/21ae69a2-2546-29d0-bff6-2ea825e3d968@redhat.com "One (hopefully last) set of fixes for KVM/arm for 5.3: an embarassing MMIO emulation regression, and a UBSAN splat. Oh well... - Don't overskip instructions on MMIO emulation - Fix UBSAN splat when initializing PPI priorities" * tag 'kvmarm-fixes-for-5.3-3' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm: KVM: arm/arm64: VGIC: Properly initialise private IRQ affinity KVM: arm/arm64: Only skip MMIO insn once
2019-08-24drm/mediatek: include dma-mapping headerDave Airlie1-0/+1
Although it builds fine here in my arm cross compile, it seems either via some other patches in -next or some Kconfig combination, this fails to build for everyone. Include linux/dma-mapping.h should fix it. Signed-off-by: Dave Airlie <airlied@redhat.com>
2019-08-23Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds26-239/+248
Pull rdma fixes from Doug Ledford: "No beating around the bush: this is a monster pull request for an -rc5 kernel. Intel hit me with a series of fixes for TID processing. Mellanox hit me with a series for their UMR memory support. And we had one fix for siw that fixes the 32bit build warnings and because of the number of casts that had to be changed to properly silence the warnings, that one patch alone is a full 40% of the LOC of this entire pull request. Given that this is the initial release kernel for siw, I'm trying to fix anything in it that we can, so that adds to the impetus to take fixes for it like this one. I had to do a rebase early in the week. Jason had thought he put a patch on the rc queue that he needed to be there so he could base some work off of it, and it had actually not been placed there. So he asked me (on Tuesday) to fix that up before pushing my wip branch to the official rc branch. I did, and that's why the early patches look like they were all committed at the same time on Tuesday. That bunch had been in my queue prior. The various patches all pass my test for being legitimate fixes and not attempts to slide new features or development into a late rc. Well, they were all fixes with the exception of a couple clean up patches people wrote for making the fixes they also wrote better (like a cleanup patch to move UMR checking into a function so that the remaining UMR fix patches can reference that function), so I left those in place too. My apologies for the LOC count and the number of patches here, it's just how the cards fell this cycle. Summary: - Fix siw buffer mapping issue - Fix siw 32/64 casting issues - Fix a KASAN access issue in bnxt_re - Fix several memory leaks (hfi1, mlx4) - Fix a NULL deref in cma_cleanup - Fixes for UMR memory support in mlx5 (4 patch series) - Fix namespace check for restrack - Fixes for counter support - Fixes for hfi1 TID processing (5 patch series) - Fix potential NULL deref in siw - Fix memory page calculations in mlx5" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (21 commits) RDMA/siw: Fix 64/32bit pointer inconsistency RDMA/siw: Fix SGL mapping issues RDMA/bnxt_re: Fix stack-out-of-bounds in bnxt_qplib_rcfw_send_message infiniband: hfi1: fix memory leaks infiniband: hfi1: fix a memory leak bug IB/mlx4: Fix memory leaks RDMA/cma: fix null-ptr-deref Read in cma_cleanup IB/mlx5: Block MR WR if UMR is not possible IB/mlx5: Fix MR re-registration flow to use UMR properly IB/mlx5: Report and handle ODP support properly IB/mlx5: Consolidate use_umr checks into single function RDMA/restrack: Rewrite PID namespace check to be reliable RDMA/counters: Properly implement PID checks IB/core: Fix NULL pointer dereference when bind QP to counter IB/hfi1: Drop stale TID RDMA packets that cause TIDErr IB/hfi1: Add additional checks when handling TID RDMA WRITE DATA packet IB/hfi1: Add additional checks when handling TID RDMA READ RESP packet IB/hfi1: Unsafe PSN checking for TID RDMA READ Resp packet IB/hfi1: Drop stale TID RDMA packets RDMA/siw: Fix potential NULL de-ref ...