summaryrefslogtreecommitdiffstats
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2020-08-07Merge branch 'akpm' (patches from Andrew)Linus Torvalds6-43/+27
Merge misc updates from Andrew Morton: - a few MM hotfixes - kthread, tools, scripts, ntfs and ocfs2 - some of MM Subsystems affected by this patch series: kthread, tools, scripts, ntfs, ocfs2 and mm (hofixes, pagealloc, slab-generic, slab, slub, kcsan, debug, pagecache, gup, swap, shmem, memcg, pagemap, mremap, mincore, sparsemem, vmalloc, kasan, pagealloc, hugetlb and vmscan). * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (162 commits) mm: vmscan: consistent update to pgrefill mm/vmscan.c: fix typo khugepaged: khugepaged_test_exit() check mmget_still_valid() khugepaged: retract_page_tables() remember to test exit khugepaged: collapse_pte_mapped_thp() protect the pmd lock khugepaged: collapse_pte_mapped_thp() flush the right range mm/hugetlb: fix calculation of adjust_range_if_pmd_sharing_possible mm: thp: replace HTTP links with HTTPS ones mm/page_alloc: fix memalloc_nocma_{save/restore} APIs mm/page_alloc.c: skip setting nodemask when we are in interrupt mm/page_alloc: fallbacks at most has 3 elements mm/page_alloc: silence a KASAN false positive mm/page_alloc.c: remove unnecessary end_bitidx for [set|get]_pfnblock_flags_mask() mm/page_alloc.c: simplify pageblock bitmap access mm/page_alloc.c: extract the common part in pfn_to_bitidx() mm/page_alloc.c: replace the definition of NR_MIGRATETYPE_BITS with PB_migratetype_bits mm/shuffle: remove dynamic reconfiguration mm/memory_hotplug: document why shuffle_zone() is relevant mm/page_alloc: remove nr_free_pagecache_pages() mm: remove vm_total_pages ...
2020-08-07kasan: don't tag stacks allocated with pageallocAndrey Konovalov1-1/+2
Patch series "kasan: support stack instrumentation for tag-based mode", v2. This patch (of 5): Prepare Software Tag-Based KASAN for stack tagging support. With Tag-Based KASAN when kernel stacks are allocated via pagealloc (which happens when CONFIG_VMAP_STACK is not enabled), they get tagged. KASAN instrumentation doesn't expect the sp register to be tagged, and this leads to false-positive reports. Fix by resetting the tag of kernel stack pointers after allocation. Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Marco Elver <elver@google.com> Cc: Walter Wu <walter-zh.wu@mediatek.com> Cc: Elena Petrova <lenaptr@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Ard Biesheuvel <ardb@kernel.org> Link: http://lkml.kernel.org/r/cover.1596199677.git.andreyknvl@google.com Link: http://lkml.kernel.org/r/cover.1596544734.git.andreyknvl@google.com Link: http://lkml.kernel.org/r/12d8c678869268dd0884b01271ab592f30792abf.1596544734.git.andreyknvl@google.com Link: http://lkml.kernel.org/r/01c678b877755bcf29009176592402cdf6f2cb15.1596199677.git.andreyknvl@google.com Link: https://bugzilla.kernel.org/show_bug.cgi?id=203497 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07rcu: kasan: record and print call_rcu() call stackWalter Wu1-0/+2
Patch series "kasan: memorize and print call_rcu stack", v8. This patchset improves KASAN reports by making them to have call_rcu() call stack information. It is useful for programmers to solve use-after-free or double-free memory issue. The KASAN report was as follows(cleaned up slightly): BUG: KASAN: use-after-free in kasan_rcu_reclaim+0x58/0x60 Freed by task 0: kasan_save_stack+0x24/0x50 kasan_set_track+0x24/0x38 kasan_set_free_info+0x18/0x20 __kasan_slab_free+0x10c/0x170 kasan_slab_free+0x10/0x18 kfree+0x98/0x270 kasan_rcu_reclaim+0x1c/0x60 Last call_rcu(): kasan_save_stack+0x24/0x50 kasan_record_aux_stack+0xbc/0xd0 call_rcu+0x8c/0x580 kasan_rcu_uaf+0xf4/0xf8 Generic KASAN will record the last two call_rcu() call stacks and print up to 2 call_rcu() call stacks in KASAN report. it is only suitable for generic KASAN. This feature considers the size of struct kasan_alloc_meta and kasan_free_meta, we try to optimize the structure layout and size, lets it get better memory consumption. [1]https://bugzilla.kernel.org/show_bug.cgi?id=198437 [2]https://groups.google.com/forum/#!searchin/kasan-dev/better$20stack$20traces$20for$20rcu%7Csort:date/kasan-dev/KQsjT_88hDE/7rNUZprRBgAJ This patch (of 4): This feature will record the last two call_rcu() call stacks and prints up to 2 call_rcu() call stacks in KASAN report. When call_rcu() is called, we store the call_rcu() call stack into slub alloc meta-data, so that the KASAN report can print rcu stack. [1]https://bugzilla.kernel.org/show_bug.cgi?id=198437 [2]https://groups.google.com/forum/#!searchin/kasan-dev/better$20stack$20traces$20for$20rcu%7Csort:date/kasan-dev/KQsjT_88hDE/7rNUZprRBgAJ [walter-zh.wu@mediatek.com: build fix] Link: http://lkml.kernel.org/r/20200710162401.23816-1-walter-zh.wu@mediatek.com Suggested-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Walter Wu <walter-zh.wu@mediatek.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Andrey Konovalov <andreyknvl@google.com> Acked-by: Paul E. McKenney <paulmck@kernel.org> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthias Brugger <matthias.bgg@gmail.com> Link: http://lkml.kernel.org/r/20200710162123.23713-1-walter-zh.wu@mediatek.com Link: http://lkml.kernel.org/r/20200601050847.1096-1-walter-zh.wu@mediatek.com Link: http://lkml.kernel.org/r/20200601050927.1153-1-walter-zh.wu@mediatek.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm: adjust vm_committed_as_batch according to vm overcommit policyFeng Tang1-1/+1
When checking a performance change for will-it-scale scalability mmap test [1], we found very high lock contention for spinlock of percpu counter 'vm_committed_as': 94.14% 0.35% [kernel.kallsyms] [k] _raw_spin_lock_irqsave 48.21% _raw_spin_lock_irqsave;percpu_counter_add_batch;__vm_enough_memory;mmap_region;do_mmap; 45.91% _raw_spin_lock_irqsave;percpu_counter_add_batch;__do_munmap; Actually this heavy lock contention is not always necessary. The 'vm_committed_as' needs to be very precise when the strict OVERCOMMIT_NEVER policy is set, which requires a rather small batch number for the percpu counter. So keep 'batch' number unchanged for strict OVERCOMMIT_NEVER policy, and lift it to 64X for OVERCOMMIT_ALWAYS and OVERCOMMIT_GUESS policies. Also add a sysctl handler to adjust it when the policy is reconfigured. Benchmark with the same testcase in [1] shows 53% improvement on a 8C/16T desktop, and 2097%(20X) on a 4S/72C/144T server. We tested with test platforms in 0day (server, desktop and laptop), and 80%+ platforms shows improvements with that test. And whether it shows improvements depends on if the test mmap size is bigger than the batch number computed. And if the lift is 16X, 1/3 of the platforms will show improvements, though it should help the mmap/unmap usage generally, as Michal Hocko mentioned: : I believe that there are non-synthetic worklaods which would benefit from : a larger batch. E.g. large in memory databases which do large mmaps : during startups from multiple threads. [1] https://lore.kernel.org/lkml/20200305062138.GI5972@shao2-debian/ Signed-off-by: Feng Tang <feng.tang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Qian Cai <cai@lca.pw> Cc: Kees Cook <keescook@chromium.org> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Tim Chen <tim.c.chen@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: kernel test robot <rong.a.chen@intel.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/1589611660-89854-4-git-send-email-feng.tang@intel.com Link: http://lkml.kernel.org/r/1592725000-73486-4-git-send-email-feng.tang@intel.com Link: http://lkml.kernel.org/r/1594389708-60781-5-git-send-email-feng.tang@intel.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm: memcontrol: account kernel stack per nodeShakeel Butt2-38/+15
Currently the kernel stack is being accounted per-zone. There is no need to do that. In addition due to being per-zone, memcg has to keep a separate MEMCG_KERNEL_STACK_KB. Make the stat per-node and deprecate MEMCG_KERNEL_STACK_KB as memcg_stat_item is an extension of node_stat_item. In addition localize the kernel stack stats updates to account_kernel_stack(). Signed-off-by: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Roman Gushchin <guro@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Link: http://lkml.kernel.org/r/20200630161539.1759185-1-shakeelb@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm: memcg: convert vmstat slab counters to bytesRoman Gushchin1-1/+1
In order to prepare for per-object slab memory accounting, convert NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE vmstat items to bytes. To make it obvious, rename them to NR_SLAB_RECLAIMABLE_B and NR_SLAB_UNRECLAIMABLE_B (similar to NR_KERNEL_STACK_KB). Internally global and per-node counters are stored in pages, however memcg and lruvec counters are stored in bytes. This scheme may look weird, but only for now. As soon as slab pages will be shared between multiple cgroups, global and node counters will reflect the total number of slab pages. However memcg and lruvec counters will be used for per-memcg slab memory tracking, which will take separate kernel objects in the account. Keeping global and node counters in pages helps to avoid additional overhead. The size of slab memory shouldn't exceed 4Gb on 32-bit machines, so it will fit into atomic_long_t we use for vmstats. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20200623174037.3951353-4-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07kthread: remove incorrect comment in kthread_create_on_cpu()Ilias Stamatis1-1/+0
Originally kthread_create_on_cpu() parked and woke up the new thread. However, since commit a65d40961dc7 ("kthread/smpboot: do not park in kthread_create_on_cpu()") this is no longer the case. This patch removes the comment that has been left behind and is now incorrect / stale. Fixes: a65d40961dc7 ("kthread/smpboot: do not park in kthread_create_on_cpu()") Signed-off-by: Ilias Stamatis <stamatis.iliass@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: http://lkml.kernel.org/r/20200611135920.240551-1-stamatis.iliass@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm: fix kthread_use_mm() vs TLB invalidatePeter Zijlstra1-1/+6
For SMP systems using IPI based TLB invalidation, looking at current->active_mm is entirely reasonable. This then presents the following race condition: CPU0 CPU1 flush_tlb_mm(mm) use_mm(mm) <send-IPI> tsk->active_mm = mm; <IPI> if (tsk->active_mm == mm) // flush TLBs </IPI> switch_mm(old_mm,mm,tsk); Where it is possible the IPI flushed the TLBs for @old_mm, not @mm, because the IPI lands before we actually switched. Avoid this by disabling IRQs across changing ->active_mm and switch_mm(). Of the (SMP) architectures that have IPI based TLB invalidate: Alpha - checks active_mm ARC - ASID specific IA64 - checks active_mm MIPS - ASID specific flush OpenRISC - shoots down world PARISC - shoots down world SH - ASID specific SPARC - ASID specific x86 - N/A xtensa - checks active_mm So at the very least Alpha, IA64 and Xtensa are suspect. On top of this, for scheduler consistency we need at least preemption disabled across changing tsk->mm and doing switch_mm(), which is currently provided by task_lock(), but that's not sufficient for PREEMPT_RT. [akpm@linux-foundation.org: add comment] Reported-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Kees Cook <keescook@chromium.org> Cc: Jann Horn <jannh@google.com> Cc: Will Deacon <will@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200721154106.GE10769@hirez.programming.kicks-ass.net Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07Merge tag 'powerpc-5.9-1' of ↵Linus Torvalds1-2/+14
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: - Add support for (optionally) using queued spinlocks & rwlocks. - Support for a new faster system call ABI using the scv instruction on Power9 or later. - Drop support for the PROT_SAO mmap/mprotect flag as it will be unsupported on Power10 and future processors, leaving us with no way to implement the functionality it requests. This risks breaking userspace, though we believe it is unused in practice. - A bug fix for, and then the removal of, our custom stack expansion checking. We now allow stack expansion up to the rlimit, like other architectures. - Remove the remnants of our (previously disabled) topology update code, which tried to react to NUMA layout changes on virtualised systems, but was prone to crashes and other problems. - Add PMU support for Power10 CPUs. - A change to our signal trampoline so that we don't unbalance the link stack (branch return predictor) in the signal delivery path. - Lots of other cleanups, refactorings, smaller features and so on as usual. Thanks to: Abhishek Goel, Alastair D'Silva, Alexander A. Klimov, Alexey Kardashevskiy, Alistair Popple, Andrew Donnellan, Aneesh Kumar K.V, Anju T Sudhakar, Anton Blanchard, Arnd Bergmann, Athira Rajeev, Balamuruhan S, Bharata B Rao, Bill Wendling, Bin Meng, Cédric Le Goater, Chris Packham, Christophe Leroy, Christoph Hellwig, Daniel Axtens, Dan Williams, David Lamparter, Desnes A. Nunes do Rosario, Erhard F., Finn Thain, Frederic Barrat, Ganesh Goudar, Gautham R. Shenoy, Geoff Levand, Greg Kurz, Gustavo A. R. Silva, Hari Bathini, Harish, Imre Kaloz, Joel Stanley, Joe Perches, John Crispin, Jordan Niethe, Kajol Jain, Kamalesh Babulal, Kees Cook, Laurent Dufour, Leonardo Bras, Li RongQing, Madhavan Srinivasan, Mahesh Salgaonkar, Mark Cave-Ayland, Michal Suchanek, Milton Miller, Mimi Zohar, Murilo Opsfelder Araujo, Nathan Chancellor, Nathan Lynch, Naveen N. Rao, Nayna Jain, Nicholas Piggin, Oliver O'Halloran, Palmer Dabbelt, Pedro Miraglia Franco de Carvalho, Philippe Bergheaud, Pingfan Liu, Pratik Rajesh Sampat, Qian Cai, Qinglang Miao, Randy Dunlap, Ravi Bangoria, Sachin Sant, Sam Bobroff, Sandipan Das, Santosh Sivaraj, Satheesh Rajendran, Shirisha Ganta, Sourabh Jain, Srikar Dronamraju, Stan Johnson, Stephen Rothwell, Thadeu Lima de Souza Cascardo, Thiago Jung Bauermann, Tom Lane, Vaibhav Jain, Vladis Dronov, Wei Yongjun, Wen Xiong, YueHaibing. * tag 'powerpc-5.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (337 commits) selftests/powerpc: Fix pkey syscall redefinitions powerpc: Fix circular dependency between percpu.h and mmu.h powerpc/powernv/sriov: Fix use of uninitialised variable selftests/powerpc: Skip vmx/vsx/tar/etc tests on older CPUs powerpc/40x: Fix assembler warning about r0 powerpc/papr_scm: Add support for fetching nvdimm 'fuel-gauge' metric powerpc/papr_scm: Fetch nvdimm performance stats from PHYP cpuidle: pseries: Fixup exit latency for CEDE(0) cpuidle: pseries: Add function to parse extended CEDE records cpuidle: pseries: Set the latency-hint before entering CEDE selftests/powerpc: Fix online CPU selection powerpc/perf: Consolidate perf_callchain_user_[64|32]() powerpc/pseries/hotplug-cpu: Remove double free in error path powerpc/pseries/mobility: Add pr_debug() for device tree changes powerpc/pseries/mobility: Set pr_fmt() powerpc/cacheinfo: Warn if cache object chain becomes unordered powerpc/cacheinfo: Improve diagnostics about malformed cache lists powerpc/cacheinfo: Use name@unit instead of full DT path in debug messages powerpc/cacheinfo: Set pr_fmt() powerpc: fix function annotations to avoid section mismatch warnings with gcc-10 ...
2020-08-07Merge branch 'work.regset' of ↵Linus Torvalds2-1/+77
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull ptrace regset updates from Al Viro: "Internal regset API changes: - regularize copy_regset_{to,from}_user() callers - switch to saner calling conventions for ->get() - kill user_regset_copyout() The ->put() side of things will have to wait for the next cycle, unfortunately. The balance is about -1KLoC and replacements for ->get() instances are a lot saner" * 'work.regset' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (41 commits) regset: kill user_regset_copyout{,_zero}() regset(): kill ->get_size() regset: kill ->get() csky: switch to ->regset_get() xtensa: switch to ->regset_get() parisc: switch to ->regset_get() nds32: switch to ->regset_get() nios2: switch to ->regset_get() hexagon: switch to ->regset_get() h8300: switch to ->regset_get() openrisc: switch to ->regset_get() riscv: switch to ->regset_get() c6x: switch to ->regset_get() ia64: switch to ->regset_get() arc: switch to ->regset_get() arm: switch to ->regset_get() sh: convert to ->regset_get() arm64: switch to ->regset_get() mips: switch to ->regset_get() sparc: switch to ->regset_get() ...
2020-08-06Merge tag 'fsnotify_for_v5.9-rc1' of ↵Linus Torvalds4-32/+22
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull fsnotify updates from Jan Kara: - fanotify fix for softlockups when there are many queued events - performance improvement to reduce fsnotify overhead when not used - Amir's implementation of fanotify events with names. With these you can now efficiently monitor whole filesystem, eg to mirror changes to another machine. * tag 'fsnotify_for_v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: (37 commits) fanotify: compare fsid when merging name event fsnotify: create method handle_inode_event() in fsnotify_operations fanotify: report parent fid + child fid fanotify: report parent fid + name + child fid fanotify: add support for FAN_REPORT_NAME fanotify: report events with parent dir fid to sb/mount/non-dir marks fanotify: add basic support for FAN_REPORT_DIR_FID fsnotify: remove check that source dentry is positive fsnotify: send event with parent/name info to sb/mount/non-dir marks audit: do not set FS_EVENT_ON_CHILD in audit marks mask inotify: do not set FS_EVENT_ON_CHILD in non-dir mark mask fsnotify: pass dir and inode arguments to fsnotify() fsnotify: create helper fsnotify_inode() fsnotify: send event to parent and child with single callback inotify: report both events on parent and child with single callback dnotify: report both events on parent and child with single callback fanotify: no external fh buffer in fanotify_name_event fanotify: use struct fanotify_info to parcel the variable size buffer fsnotify: add object type "child" to object type iterator fanotify: use FAN_EVENT_ON_CHILD as implicit flag on sb/mount/non-dir marks ...
2020-08-06Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds1-0/+7
Pull KVM updates from Paolo Bonzini: "s390: - implement diag318 x86: - Report last CPU for debugging - Emulate smaller MAXPHYADDR in the guest than in the host - .noinstr and tracing fixes from Thomas - nested SVM page table switching optimization and fixes Generic: - Unify shadow MMU cache data structures across architectures" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (127 commits) KVM: SVM: Fix sev_pin_memory() error handling KVM: LAPIC: Set the TDCR settable bits KVM: x86: Specify max TDP level via kvm_configure_mmu() KVM: x86/mmu: Rename max_page_level to max_huge_page_level KVM: x86: Dynamically calculate TDP level from max level and MAXPHYADDR KVM: VXM: Remove temporary WARN on expected vs. actual EPTP level mismatch KVM: x86: Pull the PGD's level from the MMU instead of recalculating it KVM: VMX: Make vmx_load_mmu_pgd() static KVM: x86/mmu: Add separate helper for shadow NPT root page role calc KVM: VMX: Drop a duplicate declaration of construct_eptp() KVM: nSVM: Correctly set the shadow NPT root level in its MMU role KVM: Using macros instead of magic values MIPS: KVM: Fix build error caused by 'kvm_run' cleanup KVM: nSVM: remove nonsensical EXITINFO1 adjustment on nested NPF KVM: x86: Add a capability for GUEST_MAXPHYADDR < HOST_MAXPHYADDR support KVM: VMX: optimize #PF injection when MAXPHYADDR does not match KVM: VMX: Add guest physical address check in EPT violation and misconfig KVM: VMX: introduce vmx_need_pf_intercept KVM: x86: update exception bitmap on CPUID changes KVM: x86: rename update_bp_intercept to update_exception_bitmap ...
2020-08-06Merge tag 'sched-fifo-2020-08-04' of ↵Linus Torvalds7-57/+77
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull sched/fifo updates from Ingo Molnar: "This adds the sched_set_fifo*() encapsulation APIs to remove static priority level knowledge from non-scheduler code. The three APIs for non-scheduler code to set SCHED_FIFO are: - sched_set_fifo() - sched_set_fifo_low() - sched_set_normal() These are two FIFO priority levels: default (high), and a 'low' priority level, plus sched_set_normal() to set the policy back to non-SCHED_FIFO. Since the changes affect a lot of non-scheduler code, we kept this in a separate tree" * tag 'sched-fifo-2020-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits) sched,tracing: Convert to sched_set_fifo() sched: Remove sched_set_*() return value sched: Remove sched_setscheduler*() EXPORTs sched,psi: Convert to sched_set_fifo_low() sched,rcutorture: Convert to sched_set_fifo_low() sched,rcuperf: Convert to sched_set_fifo_low() sched,locktorture: Convert to sched_set_fifo() sched,irq: Convert to sched_set_fifo() sched,watchdog: Convert to sched_set_fifo() sched,serial: Convert to sched_set_fifo() sched,powerclamp: Convert to sched_set_fifo() sched,ion: Convert to sched_set_normal() sched,powercap: Convert to sched_set_fifo*() sched,spi: Convert to sched_set_fifo*() sched,mmc: Convert to sched_set_fifo*() sched,ivtv: Convert to sched_set_fifo*() sched,drm/scheduler: Convert to sched_set_fifo*() sched,msm: Convert to sched_set_fifo*() sched,psci: Convert to sched_set_fifo*() sched,drbd: Convert to sched_set_fifo*() ...
2020-08-06Merge tag 'integrity-v5.9' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity Pull integrity updates from Mimi Zohar: "The nicest change is the IMA policy rule checking. The other changes include allowing the kexec boot cmdline line measure policy rules to be defined in terms of the inode associated with the kexec kernel image, making the IMA_APPRAISE_BOOTPARAM, which governs the IMA appraise mode (log, fix, enforce), a runtime decision based on the secure boot mode of the system, and including errno in the audit log" * tag 'integrity-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity: integrity: remove redundant initialization of variable ret ima: move APPRAISE_BOOTPARAM dependency on ARCH_POLICY to runtime ima: AppArmor satisfies the audit rule requirements ima: Rename internal filter rule functions ima: Support additional conditionals in the KEXEC_CMDLINE hook function ima: Use the common function to detect LSM conditionals in a rule ima: Move comprehensive rule validation checks out of the token parser ima: Use correct type for the args_p member of ima_rule_entry.lsm elements ima: Shallow copy the args_p member of ima_rule_entry.lsm elements ima: Fail rule parsing when appraise_flag=blacklist is unsupportable ima: Fail rule parsing when the KEY_CHECK hook is combined with an invalid cond ima: Fail rule parsing when the KEXEC_CMDLINE hook is combined with an invalid cond ima: Fail rule parsing when buffer hook functions have an invalid action ima: Free the entire rule if it fails to parse ima: Free the entire rule when deleting a list of rules ima: Have the LSM free its audit rule IMA: Add audit log for failure conditions integrity: Add errno field in audit message
2020-08-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-nextLinus Torvalds27-429/+1884
Pull networking updates from David Miller: 1) Support 6Ghz band in ath11k driver, from Rajkumar Manoharan. 2) Support UDP segmentation in code TSO code, from Eric Dumazet. 3) Allow flashing different flash images in cxgb4 driver, from Vishal Kulkarni. 4) Add drop frames counter and flow status to tc flower offloading, from Po Liu. 5) Support n-tuple filters in cxgb4, from Vishal Kulkarni. 6) Various new indirect call avoidance, from Eric Dumazet and Brian Vazquez. 7) Fix BPF verifier failures on 32-bit pointer arithmetic, from Yonghong Song. 8) Support querying and setting hardware address of a port function via devlink, use this in mlx5, from Parav Pandit. 9) Support hw ipsec offload on bonding slaves, from Jarod Wilson. 10) Switch qca8k driver over to phylink, from Jonathan McDowell. 11) In bpftool, show list of processes holding BPF FD references to maps, programs, links, and btf objects. From Andrii Nakryiko. 12) Several conversions over to generic power management, from Vaibhav Gupta. 13) Add support for SO_KEEPALIVE et al. to bpf_setsockopt(), from Dmitry Yakunin. 14) Various https url conversions, from Alexander A. Klimov. 15) Timestamping and PHC support for mscc PHY driver, from Antoine Tenart. 16) Support bpf iterating over tcp and udp sockets, from Yonghong Song. 17) Support 5GBASE-T i40e NICs, from Aleksandr Loktionov. 18) Add kTLS RX HW offload support to mlx5e, from Tariq Toukan. 19) Fix the ->ndo_start_xmit() return type to be netdev_tx_t in several drivers. From Luc Van Oostenryck. 20) XDP support for xen-netfront, from Denis Kirjanov. 21) Support receive buffer autotuning in MPTCP, from Florian Westphal. 22) Support EF100 chip in sfc driver, from Edward Cree. 23) Add XDP support to mvpp2 driver, from Matteo Croce. 24) Support MPTCP in sock_diag, from Paolo Abeni. 25) Commonize UDP tunnel offloading code by creating udp_tunnel_nic infrastructure, from Jakub Kicinski. 26) Several pci_ --> dma_ API conversions, from Christophe JAILLET. 27) Add FLOW_ACTION_POLICE support to mlxsw, from Ido Schimmel. 28) Add SK_LOOKUP bpf program type, from Jakub Sitnicki. 29) Refactor a lot of networking socket option handling code in order to avoid set_fs() calls, from Christoph Hellwig. 30) Add rfc4884 support to icmp code, from Willem de Bruijn. 31) Support TBF offload in dpaa2-eth driver, from Ioana Ciornei. 32) Support XDP_REDIRECT in qede driver, from Alexander Lobakin. 33) Support PCI relaxed ordering in mlx5 driver, from Aya Levin. 34) Support TCP syncookies in MPTCP, from Flowian Westphal. 35) Fix several tricky cases of PMTU handling wrt. briding, from Stefano Brivio. * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2056 commits) net: thunderx: initialize VF's mailbox mutex before first usage usb: hso: remove bogus check for EINPROGRESS usb: hso: no complaint about kmalloc failure hso: fix bailout in error case of probe ip_tunnel_core: Fix build for archs without _HAVE_ARCH_IPV6_CSUM selftests/net: relax cpu affinity requirement in msg_zerocopy test mptcp: be careful on subflow creation selftests: rtnetlink: make kci_test_encap() return sub-test result selftests: rtnetlink: correct the final return value for the test net: dsa: sja1105: use detected device id instead of DT one on mismatch tipc: set ub->ifindex for local ipv6 address ipv6: add ipv6_dev_find() net: openvswitch: silence suspicious RCU usage warning Revert "vxlan: fix tos value before xmit" ptp: only allow phase values lower than 1 period farsync: switch from 'pci_' to 'dma_' API wan: wanxl: switch from 'pci_' to 'dma_' API hv_netvsc: do not use VF device if link is down dpaa2-eth: Fix passing zero to 'PTR_ERR' warning net: macb: Properly handle phylink on at91sam9x ...
2020-08-05Merge tag 'driver-core-5.9-rc1' of ↵Linus Torvalds1-3/+1
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here is the "big" set of changes to the driver core, and some drivers using the changes, for 5.9-rc1. "Biggest" thing in here is the device link exposure in sysfs, to help to tame the madness that is SoC device tree representations and driver interactions with it. Other stuff in here that is interesting is: - device probe log helper so that drivers can report problems in a unified way easier. - devres functions added - DEVICE_ATTR_ADMIN_* macro added to make it harder to write incorrect sysfs file permissions - documentation cleanups - ability for debugfs to be present in the kernel, yet not exposed to userspace. Needed for systems that want it enabled, but do not trust users, so they can still use some kernel functions that were otherwise disabled. - other minor fixes and cleanups The patches outside of drivers/base/ all have acks from the respective subsystem maintainers to go through this tree instead of theirs. All of these have been in linux-next with no reported issues" * tag 'driver-core-5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (39 commits) drm/bridge: lvds-codec: simplify error handling drm/bridge/sii8620: fix resource acquisition error handling driver core: add deferring probe reason to devices_deferred property driver core: add device probe log helper driver core: Avoid binding drivers to dead devices Revert "test_firmware: Test platform fw loading on non-EFI systems" firmware_loader: EFI firmware loader must handle pre-allocated buffer selftest/firmware: Add selftest timeout in settings test_firmware: Test platform fw loading on non-EFI systems driver core: Change delimiter in devlink device's name to "--" debugfs: Add access restriction option tracefs: Remove unnecessary debug_fs checks. driver core: Fix probe_count imbalance in really_probe() kobject: remove unused KOBJ_MAX action driver core: Fix sleeping in invalid context during device link deletion driver core: Add waiting_for_supplier sysfs file for devices driver core: Add state_synced sysfs file for devices that support it driver core: Expose device link details in sysfs driver core: Drop mention of obsolete bus rwsem from kernel-doc debugfs: file: Remove unnecessary cast in kfree() ...
2020-08-05Merge tag 'char-misc-5.9-rc1' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc Pull char/misc driver updates from Greg KH: "Here is the large set of char and misc and other driver subsystem patches for 5.9-rc1. Lots of new driver submissions in here, and cleanups and features for existing drivers. Highlights are: - habanalabs driver updates - coresight driver updates - nvmem driver updates - huge number of "W=1" build warning cleanups from Lee Jones - dyndbg updates - virtbox driver fixes and updates - soundwire driver updates - mei driver updates - phy driver updates - fpga driver updates - lots of smaller individual misc/char driver cleanups and fixes Full details are in the shortlog. All of these have been in linux-next with no reported issues" * tag 'char-misc-5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (322 commits) habanalabs: remove unused but set variable 'ctx_asid' nvmem: qcom-spmi-sdam: Enable multiple devices dt-bindings: nvmem: SID: add binding for A100's SID controller nvmem: update Kconfig description nvmem: qfprom: Add fuse blowing support dt-bindings: nvmem: Add properties needed for blowing fuses dt-bindings: nvmem: qfprom: Convert to yaml nvmem: qfprom: use NVMEM_DEVID_AUTO for multiple instances nvmem: core: add support to auto devid nvmem: core: Add nvmem_cell_read_u8() nvmem: core: Grammar fixes for help text nvmem: sc27xx: add sc2730 efuse support nvmem: Enforce nvmem stride in the sysfs interface MAINTAINERS: Add git tree for NVMEM FRAMEWORK nvmem: sprd: Fix return value of sprd_efuse_probe() drivers: android: Fix the SPDX comment style drivers: android: Fix a variable declaration coding style issue drivers: android: Remove braces for a single statement if-else block drivers: android: Remove the use of else after return drivers: android: Fix a variable declaration coding style issue ...
2020-08-04Merge tag 'docs-5.9' of git://git.lwn.net/linuxLinus Torvalds1-1/+1
Pull documentation updates from Jonathan Corbet: "It's been a busy cycle for documentation - hopefully the busiest for a while to come. Changes include: - Some new Chinese translations - Progress on the battle against double words words and non-HTTPS URLs - Some block-mq documentation - More RST conversions from Mauro. At this point, that task is essentially complete, so we shouldn't see this kind of churn again for a while. Unless we decide to switch to asciidoc or something...:) - Lots of typo fixes, warning fixes, and more" * tag 'docs-5.9' of git://git.lwn.net/linux: (195 commits) scripts/kernel-doc: optionally treat warnings as errors docs: ia64: correct typo mailmap: add entry for <alobakin@marvell.com> doc/zh_CN: add cpu-load Chinese version Documentation/admin-guide: tainted-kernels: fix spelling mistake MAINTAINERS: adjust kprobes.rst entry to new location devices.txt: document rfkill allocation PCI: correct flag name docs: filesystems: vfs: correct flag name docs: filesystems: vfs: correct sync_mode flag names docs: path-lookup: markup fixes for emphasis docs: path-lookup: more markup fixes docs: path-lookup: fix HTML entity mojibake CREDITS: Replace HTTP links with HTTPS ones docs: process: Add an example for creating a fixes tag doc/zh_CN: add Chinese translation prefer section doc/zh_CN: add clearing-warn-once Chinese version doc/zh_CN: add admin-guide index doc:it_IT: process: coding-style.rst: Correct __maybe_unused compiler label futex: MAINTAINERS: Re-add selftests directory ...
2020-08-04Merge tag 'printk-for-5.9' of ↵Linus Torvalds1-4/+12
git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux Pull printk updates from Petr Mladek: - Herbert Xu made printk header file self-contained. - Andy Shevchenko and Sergey Senozhatsky cleaned up console->setup() error handling. - Andy Shevchenko did some cleanups (e.g. sparse warning) in vsprintf code. - Minor documentation updates. * tag 'printk-for-5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux: lib/vsprintf: Force type of flags value for gfp_t lib/vsprintf: Replace custom spec to print decimals with generic one lib/vsprintf: Replace hidden BUILD_BUG_ON() with static_assert() printk: Make linux/printk.h self-contained doc:kmsg: explicitly state the return value in case of SEEK_CUR Replace HTTP links with HTTPS ones: vsprintf hvc: unify console setup naming console: Fix trivia typo 'change' -> 'chance' console: Propagate error code from console ->setup() tty: hvc: Return proper error code from console ->setup() hook serial: sunzilog: Return proper error code from console ->setup() hook serial: sunsab: Return proper error code from console ->setup() hook mips: Return proper error code from console ->setup() hook
2020-08-04Merge tag 'core-entry-2020-08-04' of ↵Linus Torvalds4-0/+439
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull generic kernel entry/exit code from Thomas Gleixner: "Generic implementation of common syscall, interrupt and exception entry/exit functionality based on the recent X86 effort to ensure correctness of entry/exit vs RCU and instrumentation. As this functionality and the required entry/exit sequences are not architecture specific, sharing them allows other architectures to benefit instead of copying the same code over and over again. This branch was kept standalone to allow others to work on it. The conversion of x86 comes in a seperate pull request which obviously is based on this branch" * tag 'core-entry-2020-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: entry: Correct __secure_computing() stub entry: Correct 'noinstr' attributes entry: Provide infrastructure for work before transitioning to guest mode entry: Provide generic interrupt entry/exit code entry: Provide generic syscall exit function entry: Provide generic syscall entry functionality seccomp: Provide stub for __secure_computing()
2020-08-04Merge tag 'timers-core-2020-08-04' of ↵Linus Torvalds1-145/+108
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer updates from Thomas Gleixner: "Time, timers and related driver updates: - Prevent unnecessary timer softirq invocations by extending the tracking of the next expiring timer in the timer wheel beyond the existing NOHZ functionality. The tracking overhead at enqueue time is within the noise, but on sensitive workloads the avoidance of the soft interrupt invocation is a measurable improvement. - The obligatory new clocksource driver for Ingenic X100 OST - The usual fixes, improvements, cleanups and extensions for newer chip variants all over the driver space" * tag 'timers-core-2020-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (28 commits) timers: Recalculate next timer interrupt only when necessary clocksource/drivers/ingenic: Add support for the Ingenic X1000 OST. dt-bindings: timer: Add Ingenic X1000 OST bindings. clocksource/drivers: Replace HTTP links with HTTPS ones clocksource/drivers/nomadik-mtu: Handle 32kHz clock clocksource/drivers/sh_cmt: Use "kHz" for kilohertz clocksource/drivers/imx: Add support for i.MX TPM driver with ARM64 clocksource/drivers/ingenic: Add high resolution timer support for SMP/SMT. timers: Lower base clock forwarding threshold timers: Remove must_forward_clk timers: Spare timer softirq until next expiry timers: Expand clk forward logic beyond nohz timers: Reuse next expiry cache after nohz exit timers: Always keep track of next expiry timers: Optimize _next_timer_interrupt() level iteration timers: Add comments about calc_index() ceiling work timers: Move trigger_dyntick_cpu() to enqueue_timer() timers: Use only bucket expiry for base->next_expiry value timers: Preserve higher bits of expiration on index calculation clocksource/drivers/timer-atmel-tcb: Add sama5d2 support ...
2020-08-04Merge tag 'irq-core-2020-08-04' of ↵Linus Torvalds3-19/+4
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq updates from Thomas Gleixner: "The usual boring updates from the interrupt subsystem: - Infrastructure to allow building irqchip drivers as modules - Consolidation of irqchip ACPI probing - Removal of the EOI-preflow interrupt handler which was required for SPARC support and became obsolete after SPARC was converted to use sparse interrupts. - Cleanups, fixes and improvements all over the place" * tag 'irq-core-2020-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (51 commits) irqchip/loongson-pch-pic: Fix the misused irq flow handler irqchip/loongson-htvec: Support 8 groups of HT vectors irqchip/loongson-liointc: Fix misuse of gc->mask_cache dt-bindings: interrupt-controller: Update Loongson HTVEC description irqchip/imx-intmux: Fix irqdata regs save in imx_intmux_runtime_suspend() irqchip/imx-intmux: Implement intmux runtime power management irqchip/gic-v4.1: Use GFP_ATOMIC flag in allocate_vpe_l1_table() irqchip: Fix IRQCHIP_PLATFORM_DRIVER_* compilation by including module.h irqchip/stm32-exti: Map direct event to irq parent irqchip/mtk-cirq: Convert to a platform driver irqchip/mtk-sysirq: Convert to a platform driver irqchip/qcom-pdc: Switch to using IRQCHIP_PLATFORM_DRIVER helper macros irqchip: Add IRQCHIP_PLATFORM_DRIVER_BEGIN/END and IRQCHIP_MATCH helper macros irqchip: irq-bcm2836.h: drop a duplicated word irqchip/gic-v4.1: Ensure accessing the correct RD when writing INVALLR irqchip/irq-bcm7038-l1: Guard uses of cpu_logical_map irqchip/gic-v3: Remove unused register definition irqchip/qcom-pdc: Allow QCOM_PDC to be loadable as a permanent module genirq: Export irq_chip_retrigger_hierarchy and irq_chip_set_vcpu_affinity_parent irqdomain: Export irq_domain_update_bus_token ...
2020-08-04Merge tag 'dma-mapping-5.9' of git://git.infradead.org/users/hch/dma-mappingLinus Torvalds6-104/+238
Pull dma-mapping updates from Christoph Hellwig: - make support for dma_ops optional - move more code out of line - add generic support for a dma_ops bypass mode - misc cleanups * tag 'dma-mapping-5.9' of git://git.infradead.org/users/hch/dma-mapping: dma-contiguous: cleanup dma_alloc_contiguous dma-debug: use named initializers for dir2name powerpc: use the generic dma_ops_bypass mode dma-mapping: add a dma_ops_bypass flag to struct device dma-mapping: make support for dma ops optional dma-mapping: inline the fast path dma-direct calls dma-mapping: move the remaining DMA API calls out of line
2020-08-04Merge tag 'close-range-v5.9' of ↵Linus Torvalds1-5/+6
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull close_range() implementation from Christian Brauner: "This adds the close_range() syscall. It allows to efficiently close a range of file descriptors up to all file descriptors of a calling task. This is coordinated with the FreeBSD folks which have copied our version of this syscall and in the meantime have already merged it in April 2019: https://reviews.freebsd.org/D21627 https://svnweb.freebsd.org/base?view=revision&revision=359836 The syscall originally came up in a discussion around the new mount API and making new file descriptor types cloexec by default. During this discussion, Al suggested the close_range() syscall. First, it helps to close all file descriptors of an exec()ing task. This can be done safely via (quoting Al's example from [1] verbatim): /* that exec is sensitive */ unshare(CLONE_FILES); /* we don't want anything past stderr here */ close_range(3, ~0U); execve(....); The code snippet above is one way of working around the problem that file descriptors are not cloexec by default. This is aggravated by the fact that we can't just switch them over without massively regressing userspace. For a whole class of programs having an in-kernel method of closing all file descriptors is very helpful (e.g. demons, service managers, programming language standard libraries, container managers etc.). Second, it allows userspace to avoid implementing closing all file descriptors by parsing through /proc/<pid>/fd/* and calling close() on each file descriptor and other hacks. From looking at various large(ish) userspace code bases this or similar patterns are very common in service managers, container runtimes, and programming language runtimes/standard libraries such as Python or Rust. In addition, the syscall will also work for tasks that do not have procfs mounted and on kernels that do not have procfs support compiled in. In such situations the only way to make sure that all file descriptors are closed is to call close() on each file descriptor up to UINT_MAX or RLIMIT_NOFILE, OPEN_MAX trickery. Based on Linus' suggestion close_range() also comes with a new flag CLOSE_RANGE_UNSHARE to more elegantly handle file descriptor dropping right before exec. This would usually be expressed in the sequence: unshare(CLONE_FILES); close_range(3, ~0U); as pointed out by Linus it might be desirable to have this be a part of close_range() itself under a new flag CLOSE_RANGE_UNSHARE which gets especially handy when we're closing all file descriptors above a certain threshold. Test-suite as always included" * tag 'close-range-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: tests: add CLOSE_RANGE_UNSHARE tests close_range: add CLOSE_RANGE_UNSHARE tests: add close_range() tests arch: wire-up close_range() open: add close_range()
2020-08-04Merge tag 'cap-checkpoint-restore-v5.9' of ↵Linus Torvalds3-7/+10
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull checkpoint-restore updates from Christian Brauner: "This enables unprivileged checkpoint/restore of processes. Given that this work has been going on for quite some time the first sentence in this summary is hopefully more exciting than the actual final code changes required. Unprivileged checkpoint/restore has seen a frequent increase in interest over the last two years and has thus been one of the main topics for the combined containers & checkpoint/restore microconference since at least 2018 (cf. [1]). Here are just the three most frequent use-cases that were brought forward: - The JVM developers are integrating checkpoint/restore into a Java VM to significantly decrease the startup time. - In high-performance computing environment a resource manager will typically be distributing jobs where users are always running as non-root. Long-running and "large" processes with significant startup times are supposed to be checkpointed and restored with CRIU. - Container migration as a non-root user. In all of these scenarios it is either desirable or required to run without CAP_SYS_ADMIN. The userspace implementation of checkpoint/restore CRIU already has the pull request for supporting unprivileged checkpoint/restore up (cf. [2]). To enable unprivileged checkpoint/restore a new dedicated capability CAP_CHECKPOINT_RESTORE is introduced. This solution has last been discussed in 2019 in a talk by Google at Linux Plumbers (cf. [1] "Update on Task Migration at Google Using CRIU") with Adrian and Nicolas providing the implementation now over the last months. In essence, this allows the CRIU binary to be installed with the CAP_CHECKPOINT_RESTORE vfs capability set thereby enabling unprivileged users to restore processes. To make this possible the following permissions are altered: - Selecting a specific PID via clone3() set_tid relaxed from userns CAP_SYS_ADMIN to CAP_CHECKPOINT_RESTORE. - Selecting a specific PID via /proc/sys/kernel/ns_last_pid relaxed from userns CAP_SYS_ADMIN to CAP_CHECKPOINT_RESTORE. - Accessing /proc/pid/map_files relaxed from init userns CAP_SYS_ADMIN to init userns CAP_CHECKPOINT_RESTORE. - Changing /proc/self/exe from userns CAP_SYS_ADMIN to userns CAP_CHECKPOINT_RESTORE. Of these four changes the /proc/self/exe change deserves a few words because the reasoning behind even restricting /proc/self/exe changes in the first place is just full of historical quirks and tracking this down was a questionable version of fun that I'd like to spare others. In short, it is trivial to change /proc/self/exe as an unprivileged user, i.e. without userns CAP_SYS_ADMIN right now. Either via ptrace() or by simply intercepting the elf loader in userspace during exec. Nicolas was nice enough to even provide a POC for the latter (cf. [3]) to illustrate this fact. The original patchset which introduced PR_SET_MM_MAP had no permissions around changing the exe link. They too argued that it is trivial to spoof the exe link already which is true. The argument brought up against this was that the Tomoyo LSM uses the exe link in tomoyo_manager() to detect whether the calling process is a policy manager. This caused changing the exe links to be guarded by userns CAP_SYS_ADMIN. All in all this rather seems like a "better guard it with something rather than nothing" argument which imho doesn't qualify as a great security policy. Again, because spoofing the exe link is possible for the calling process so even if this were security relevant it was broken back then and would be broken today. So technically, dropping all permissions around changing the exe link would probably be possible and would send a clearer message to any userspace that relies on /proc/self/exe for security reasons that they should stop doing this but for now we're only relaxing the exe link permissions from userns CAP_SYS_ADMIN to userns CAP_CHECKPOINT_RESTORE. There's a final uapi change in here. Changing the exe link used to accidently return EINVAL when the caller lacked the necessary permissions instead of the more correct EPERM. This pr contains a commit fixing this. I assume that userspace won't notice or care and if they do I will revert this commit. But since we are changing the permissions anyway it seems like a good opportunity to try this fix. With these changes merged unprivileged checkpoint/restore will be possible and has already been tested by various users" [1] LPC 2018 1. "Task Migration at Google Using CRIU" https://www.youtube.com/watch?v=yI_1cuhoDgA&t=12095 2. "Securely Migrating Untrusted Workloads with CRIU" https://www.youtube.com/watch?v=yI_1cuhoDgA&t=14400 LPC 2019 1. "CRIU and the PID dance" https://www.youtube.com/watch?v=LN2CUgp8deo&list=PLVsQ_xZBEyN30ZA3Pc9MZMFzdjwyz26dO&index=9&t=2m48s 2. "Update on Task Migration at Google Using CRIU" https://www.youtube.com/watch?v=LN2CUgp8deo&list=PLVsQ_xZBEyN30ZA3Pc9MZMFzdjwyz26dO&index=9&t=1h2m8s [2] https://github.com/checkpoint-restore/criu/pull/1155 [3] https://github.com/nviennot/run_as_exe * tag 'cap-checkpoint-restore-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: selftests: add clone3() CAP_CHECKPOINT_RESTORE test prctl: exe link permission error changed from -EINVAL to -EPERM prctl: Allow local CAP_CHECKPOINT_RESTORE to change /proc/self/exe proc: allow access in init userns for map_files with CAP_CHECKPOINT_RESTORE pid_namespace: use checkpoint_restore_ns_capable() for ns_last_pid pid: use checkpoint_restore_ns_capable() for set_tid capabilities: Introduce CAP_CHECKPOINT_RESTORE
2020-08-04Merge tag 'fork-v5.9' of ↵Linus Torvalds1-51/+16
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull fork cleanups from Christian Brauner: "This is cleanup series from when we reworked a chunk of the process creation paths in the kernel and switched to struct {kernel_}clone_args. High-level this does two main things: - Remove the double export of both do_fork() and _do_fork() where do_fork() used the incosistent legacy clone calling convention. Now we only export _do_fork() which is based on struct kernel_clone_args. - Remove the copy_thread_tls()/copy_thread() split making the architecture specific HAVE_COYP_THREAD_TLS config option obsolete. This switches all remaining architectures to select HAVE_COPY_THREAD_TLS and thus to the copy_thread_tls() calling convention. The current split makes the process creation codepaths more convoluted than they need to be. Each architecture has their own copy_thread() function unless it selects HAVE_COPY_THREAD_TLS then it has a copy_thread_tls() function. The split is not needed anymore nowadays, all architectures support CLONE_SETTLS but quite a few of them never bothered to select HAVE_COPY_THREAD_TLS and instead simply continued to use copy_thread() and use the old calling convention. Removing this split cleans up the process creation codepaths and paves the way for implementing clone3() on such architectures since it requires the copy_thread_tls() calling convention. After having made each architectures support copy_thread_tls() this series simply renames that function back to copy_thread(). It also switches all architectures that call do_fork() directly over to _do_fork() and the struct kernel_clone_args calling convention. This is a corollary of switching the architectures that did not yet support it over to copy_thread_tls() since do_fork() is conditional on not supporting copy_thread_tls() (Mostly because it lacks a separate argument for tls which is trivial to fix but there's no need for this function to exist.). The do_fork() removal is in itself already useful as it allows to to remove the export of both do_fork() and _do_fork() we currently have in favor of only _do_fork(). This has already been discussed back when we added clone3(). The legacy clone() calling convention is - as is probably well-known - somewhat odd: # # ABI hall of shame # config CLONE_BACKWARDS config CLONE_BACKWARDS2 config CLONE_BACKWARDS3 that is aggravated by the fact that some architectures such as sparc follow the CLONE_BACKWARDSx calling convention but don't really select the corresponding config option since they call do_fork() directly. So do_fork() enforces a somewhat arbitrary calling convention in the first place that doesn't really help the individual architectures that deviate from it. They can thus simply be switched to _do_fork() enforcing a single calling convention. (I really hope that any new architectures will __not__ try to implement their own calling conventions...) Most architectures already have made a similar switch (m68k comes to mind). Overall this removes more code than it adds even with a good portion of added comments. It simplifies a chunk of arch specific assembly either by moving the code into C or by simply rewriting the assembly. Architectures that have been touched in non-trivial ways have all been actually boot and stress tested: sparc and ia64 have been tested with Debian 9 images. They are the two architectures which have been touched the most. All non-trivial changes to architectures have seen acks from the relevant maintainers. nios2 with a custom built buildroot image. h8300 I couldn't get something bootable to test on but the changes have been fairly automatic and I'm sure we'll hear people yell if I broke something there. All other architectures that have been touched in trivial ways have been compile tested for each single patch of the series via git rebase -x "make ..." v5.8-rc2. arm{64} and x86{_64} have been boot tested even though they have just been trivially touched (removal of the HAVE_COPY_THREAD_TLS macro from their Kconfig) because well they are basically "core architectures" and since it is trivial to get your hands on a useable image" * tag 'fork-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: arch: rename copy_thread_tls() back to copy_thread() arch: remove HAVE_COPY_THREAD_TLS unicore: switch to copy_thread_tls() sh: switch to copy_thread_tls() nds32: switch to copy_thread_tls() microblaze: switch to copy_thread_tls() hexagon: switch to copy_thread_tls() c6x: switch to copy_thread_tls() alpha: switch to copy_thread_tls() fork: remove do_fork() h8300: select HAVE_COPY_THREAD_TLS, switch to kernel_clone_args nios2: enable HAVE_COPY_THREAD_TLS, switch to kernel_clone_args ia64: enable HAVE_COPY_THREAD_TLS, switch to kernel_clone_args sparc: unconditionally enable HAVE_COPY_THREAD_TLS sparc: share process creation helpers between sparc and sparc64 sparc64: enable HAVE_COPY_THREAD_TLS fork: fold legacy_clone_args_valid() into _do_fork()
2020-08-04Merge tag 'threads-v5.9' of ↵Linus Torvalds2-16/+27
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull thread updates from Christian Brauner: "This contains the changes to add the missing support for attaching to time namespaces via pidfds. Last cycle setns() was changed to support attaching to multiple namespaces atomically. This requires all namespaces to have a point of no return where they can't fail anymore. Specifically, <namespace-type>_install() is allowed to perform permission checks and install the namespace into the new struct nsset that it has been given but it is not allowed to make visible changes to the affected task. Once <namespace-type>_install() returns, anything that the given namespace type additionally requires to be setup needs to ideally be done in a function that can't fail or if it fails the failure must be non-fatal. For time namespaces the relevant functions that fell into this category were timens_set_vvar_page() and vdso_join_timens(). The latter could still fail although it didn't need to. This function is only implemented for vdso_join_timens() in current mainline. As discussed on-list (cf. [1]), in order to make setns() support time namespaces when attaching to multiple namespaces at once properly we changed vdso_join_timens() to always succeed. So vdso_join_timens() replaces the mmap_write_lock_killable() with mmap_read_lock(). Please note that arm is about to grow vdso support for time namespaces (possibly this merge window). We've synced on this change and arm64 also uses mmap_read_lock(), i.e. makes vdso_join_timens() a function that can't fail. Once the changes here and the arm64 changes have landed, vdso_join_timens() should be turned into a void function so it's obvious to callers and implementers on other architectures that the expectation is that it can't fail. We didn't do this right away because it would've introduced unnecessary merge conflicts between the two trees for no major gain. As always, tests included" [1]: https://lore.kernel.org/lkml/20200611110221.pgd3r5qkjrjmfqa2@wittgenstein * tag 'threads-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: tests: add CLONE_NEWTIME setns tests nsproxy: support CLONE_NEWTIME with setns() timens: add timens_commit() helper timens: make vdso_join_timens() always succeed
2020-08-04Merge branch 'exec-linus' of ↵Linus Torvalds5-174/+211
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull execve updates from Eric Biederman: "During the development of v5.7 I ran into bugs and quality of implementation issues related to exec that could not be easily fixed because of the way exec is implemented. So I have been diggin into exec and cleaning up what I can. This cycle I have been looking at different ideas and different implementations to see what is possible to improve exec, and cleaning the way exec interfaces with in kernel users. Only cleaning up the interfaces of exec with rest of the kernel has managed to stabalize and make it through review in time for v5.9-rc1 resulting in 2 sets of changes this cycle. - Implement kernel_execve - Make the user mode driver code a better citizen With kernel_execve the code size got a little larger as the copying of parameters from userspace and copying of parameters from userspace is now separate. The good news is kernel threads no longer need to play games with set_fs to use exec. Which when combined with the rest of Christophs set_fs changes should security bugs with set_fs much more difficult" * 'exec-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (23 commits) exec: Implement kernel_execve exec: Factor bprm_stack_limits out of prepare_arg_pages exec: Factor bprm_execve out of do_execve_common exec: Move bprm_mm_init into alloc_bprm exec: Move initialization of bprm->filename into alloc_bprm exec: Factor out alloc_bprm exec: Remove unnecessary spaces from binfmts.h umd: Stop using split_argv umd: Remove exit_umh bpfilter: Take advantage of the facilities of struct pid exit: Factor thread_group_exited out of pidfd_poll umd: Track user space drivers with struct pid bpfilter: Move bpfilter_umh back into init data exec: Remove do_execve_file umh: Stop calling do_execve_file umd: Transform fork_usermode_blob into fork_usermode_driver umd: Rename umd_info.cmdline umd_info.driver_name umd: For clarity rename umh_info umd_info umh: Separate the user mode driver and the user mode helper support umh: Remove call_usermodehelper_setup_file. ...
2020-08-04Merge tag 'audit-pr-20200803' of ↵Linus Torvalds3-22/+66
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit Pull audit updates from Paul Moore: "Aside from some smaller bug fixes, here are the highlights: - add a new backlog wait metric to the audit status message, this is intended to help admins determine how long processes have been waiting for the audit backlog queue to clear - generate audit records for nftables configuration changes - generate CWD audit records for for the relevant LSM audit records" * tag 'audit-pr-20200803' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit: audit: report audit wait metric in audit status reply audit: purge audit_log_string from the intra-kernel audit API audit: issue CWD record to accompany LSM_AUDIT_DATA_* records audit: use the proper gfp flags in the audit_log_nfcfg() calls audit: remove unused !CONFIG_AUDITSYSCALL __audit_inode* stubs audit: add gfp parameter to audit_log_nfcfg audit: log nftables configuration change events audit: Use struct_size() helper in alloc_chunk
2020-08-04Merge tag 'seccomp-v5.9-rc1' of ↵Linus Torvalds4-84/+308
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull seccomp updates from Kees Cook: "There are a bunch of clean ups and selftest improvements along with two major updates to the SECCOMP_RET_USER_NOTIF filter return: EPOLLHUP support to more easily detect the death of a monitored process, and being able to inject fds when intercepting syscalls that expect an fd-opening side-effect (needed by both container folks and Chrome). The latter continued the refactoring of __scm_install_fd() started by Christoph, and in the process found and fixed a handful of bugs in various callers. - Improved selftest coverage, timeouts, and reporting - Add EPOLLHUP support for SECCOMP_RET_USER_NOTIF (Christian Brauner) - Refactor __scm_install_fd() into __receive_fd() and fix buggy callers - Introduce 'addfd' command for SECCOMP_RET_USER_NOTIF (Sargun Dhillon)" * tag 'seccomp-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (30 commits) selftests/seccomp: Test SECCOMP_IOCTL_NOTIF_ADDFD seccomp: Introduce addfd ioctl to seccomp user notifier fs: Expand __receive_fd() to accept existing fd pidfd: Replace open-coded receive_fd() fs: Add receive_fd() wrapper for __receive_fd() fs: Move __scm_install_fd() to __receive_fd() net/scm: Regularize compat handling of scm_detach_fds() pidfd: Add missing sock updates for pidfd_getfd() net/compat: Add missing sock updates for SCM_RIGHTS selftests/seccomp: Check ENOSYS under tracing selftests/seccomp: Refactor to use fixture variants selftests/harness: Clean up kern-doc for fixtures seccomp: Use -1 marker for end of mode 1 syscall list seccomp: Fix ioctl number for SECCOMP_IOCTL_NOTIF_ID_VALID selftests/seccomp: Rename user_trap_syscall() to user_notif_syscall() selftests/seccomp: Make kcmp() less required seccomp: Use pr_fmt selftests/seccomp: Improve calibration loop selftests/seccomp: use 90s as timeout selftests/seccomp: Expand benchmark to per-filter measurements ...
2020-08-04Merge tag 'uninit-macro-v5.9-rc1' of ↵Linus Torvalds10-24/+24
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull uninitialized_var() macro removal from Kees Cook: "This is long overdue, and has hidden too many bugs over the years. The series has several "by hand" fixes, and then a trivial treewide replacement. - Clean up non-trivial uses of uninitialized_var() - Update documentation and checkpatch for uninitialized_var() removal - Treewide removal of uninitialized_var()" * tag 'uninit-macro-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: compiler: Remove uninitialized_var() macro treewide: Remove uninitialized_var() usage checkpatch: Remove awareness of uninitialized_var() macro mm/debug_vm_pgtable: Remove uninitialized_var() usage f2fs: Eliminate usage of uninitialized_var() macro media: sur40: Remove uninitialized_var() usage KVM: PPC: Book3S PR: Remove uninitialized_var() usage clk: spear: Remove uninitialized_var() usage clk: st: Remove uninitialized_var() usage spi: davinci: Remove uninitialized_var() usage ide: Remove uninitialized_var() usage rtlwifi: rtl8192cu: Remove uninitialized_var() usage b43: Remove uninitialized_var() usage drbd: Remove uninitialized_var() usage x86/mm/numa: Remove uninitialized_var() usage docs: deprecated.rst: Add uninitialized_var()
2020-08-04Merge tag 'tasklets-v5.9-rc1' of ↵Linus Torvalds4-4/+20
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull tasklets API update from Kees Cook: "These are the infrastructure updates needed to support converting the tasklet API to something more modern (and hopefully for removal further down the road). There is a 300-patch series waiting in the wings to get set out to subsystem maintainers, but these changes need to be present in the kernel first. Since this has some treewide changes, I carried this series for -next instead of paining Thomas with it in -tip, but it's got his Ack. This is similar to the timer_struct modernization from a while back, but not nearly as messy (I hope). :) - Prepare for tasklet API modernization (Romain Perier, Allen Pais, Kees Cook)" * tag 'tasklets-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: tasklet: Introduce new initialization API treewide: Replace DECLARE_TASKLET() with DECLARE_TASKLET_OLD() usb: gadget: udc: Avoid tasklet passing a global
2020-08-04Merge tag 'gcc-plugins-v5.9-rc1' of ↵Linus Torvalds2-12/+5
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull gcc plugin updates from Kees Cook: "Primarily improvements to STACKLEAK from Alexander Popov, along with some additional cleanups. - Update URLs for HTTPS scheme where available (Alexander A. Klimov) - Improve STACKLEAK code generation on x86 (Alexander Popov)" * tag 'gcc-plugins-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: gcc-plugins: Replace HTTP links with HTTPS ones gcc-plugins/stackleak: Add 'verbose' plugin parameter gcc-plugins/stackleak: Use asm instrumentation to avoid useless register saving ARM: vdso: Don't use gcc plugins for building vgettimeofday.c gcc-plugins/stackleak: Don't instrument itself
2020-08-04Merge branch 'for-5.9-console-return-codes' into for-linusPetr Mladek1-4/+4
2020-08-03Merge tag 'pm-5.9-rc1' of ↵Linus Torvalds7-120/+210
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management updates from Rafael Wysocki: "The most significant change here is the extension of the Energy Model to cover non-CPU devices (as well as CPUs) from Lukasz Luba. There is also some new hardware support (Ice Lake server idle states table for intel_idle, Sapphire Rapids and Power Limit 4 support in the RAPL driver), some new functionality in the existing drivers (eg. a new switch to disable/enable CPU energy-efficiency optimizations in intel_pstate, delayed timers in devfreq), some assorted fixes (cpufreq core, intel_pstate, intel_idle) and cleanups (eg. cpuidle-psci, devfreq), including the elimination of W=1 build warnings from cpufreq done by Lee Jones. Specifics: - Make the Energy Model cover non-CPU devices (Lukasz Luba). - Add Ice Lake server idle states table to the intel_idle driver and eliminate a redundant static variable from it (Chen Yu, Rafael Wysocki). - Eliminate all W=1 build warnings from cpufreq (Lee Jones). - Add support for Sapphire Rapids and for Power Limit 4 to the Intel RAPL power capping driver (Sumeet Pawnikar, Zhang Rui). - Fix function name in kerneldoc comments in the idle_inject power capping driver (Yangtao Li). - Fix locking issues with cpufreq governors and drop a redundant "weak" function definition from cpufreq (Viresh Kumar). - Rearrange cpufreq to register non-modular governors at the core_initcall level and allow the default cpufreq governor to be specified in the kernel command line (Quentin Perret). - Extend, fix and clean up the intel_pstate driver (Srinivas Pandruvada, Rafael Wysocki): * Add a new sysfs attribute for disabling/enabling CPU energy-efficiency optimizations in the processor. * Make the driver avoid enabling HWP if EPP is not supported. * Allow the driver to handle numeric EPP values in the sysfs interface and fix the setting of EPP via sysfs in the active mode. * Eliminate a static checker warning and clean up a kerneldoc comment. - Clean up some variable declarations in the powernv cpufreq driver (Wei Yongjun). - Fix up the ->enter_s2idle callback definition to cover the case when it points to the same function as ->idle correctly (Neal Liu). - Rearrange and clean up the PSCI cpuidle driver (Ulf Hansson). - Make the PM core emit "changed" uevent when adding/removing the "wakeup" sysfs attribute of devices (Abhishek Pandit-Subedi). - Add a helper macro for declaring PM callbacks and use it in the MMC jz4740 driver (Paul Cercueil). - Fix white space in some places in the hibernate code and make the system-wide PM code use "const char *" where appropriate (Xiang Chen, Alexey Dobriyan). - Add one more "unsafe" helper macro to the freezer to cover the NFS use case (He Zhe). - Change the language in the generic PM domains framework to use parent/child terminology and clean up a typo and some comment fromatting in that code (Kees Cook, Geert Uytterhoeven). - Update the operating performance points OPP framework (Lukasz Luba, Andrew-sh.Cheng, Valdis Kletnieks): * Refactor dev_pm_opp_of_register_em() and update related drivers. * Add a missing function export. * Allow disabled OPPs in dev_pm_opp_get_freq(). - Update devfreq core and drivers (Chanwoo Choi, Lukasz Luba, Enric Balletbo i Serra, Dmitry Osipenko, Kieran Bingham, Marc Zyngier): * Add support for delayed timers to the devfreq core and make the Samsung exynos5422-dmc driver use it. * Unify sysfs interface to use "df-" as a prefix in instance names consistently. * Fix devfreq_summary debugfs node indentation. * Add the rockchip,pmu phandle to the rk3399_dmc driver DT bindings. * List Dmitry Osipenko as the Tegra devfreq driver maintainer. * Fix typos in the core devfreq code. - Update the pm-graph utility to version 5.7 including a number of fixes related to suspend-to-idle (Todd Brandt). - Fix coccicheck errors and warnings in the cpupower utility (Shuah Khan). - Replace HTTP links with HTTPs ones in multiple places (Alexander A. Klimov)" * tag 'pm-5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (71 commits) cpuidle: ACPI: fix 'return' with no value build warning cpufreq: intel_pstate: Fix EPP setting via sysfs in active mode cpufreq: intel_pstate: Rearrange the storing of new EPP values intel_idle: Customize IceLake server support PM / devfreq: Fix the wrong end with semicolon PM / devfreq: Fix indentaion of devfreq_summary debugfs node PM / devfreq: Clean up the devfreq instance name in sysfs attr memory: samsung: exynos5422-dmc: Add module param to control IRQ mode memory: samsung: exynos5422-dmc: Adjust polling interval and uptreshold memory: samsung: exynos5422-dmc: Use delayed timer as default PM / devfreq: Add support delayed timer for polling mode dt-bindings: devfreq: rk3399_dmc: Add rockchip,pmu phandle PM / devfreq: tegra: Add Dmitry as a maintainer PM / devfreq: event: Fix trivial spelling PM / devfreq: rk3399_dmc: Fix kernel oops when rockchip,pmu is absent cpuidle: change enter_s2idle() prototype cpuidle: psci: Prevent domain idlestates until consumers are ready cpuidle: psci: Convert PM domain to platform driver cpuidle: psci: Fix error path via converting to a platform driver cpuidle: psci: Fail cpuidle registration if set OSI mode failed ...
2020-08-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller17-204/+1108
Daniel Borkmann says: ==================== pull-request: bpf-next 2020-08-04 The following pull-request contains BPF updates for your *net-next* tree. We've added 73 non-merge commits during the last 9 day(s) which contain a total of 135 files changed, 4603 insertions(+), 1013 deletions(-). The main changes are: 1) Implement bpf_link support for XDP. Also add LINK_DETACH operation for the BPF syscall allowing processes with BPF link FD to force-detach, from Andrii Nakryiko. 2) Add BPF iterator for map elements and to iterate all BPF programs for efficient in-kernel inspection, from Yonghong Song and Alexei Starovoitov. 3) Separate bpf_get_{stack,stackid}() helpers for perf events in BPF to avoid unwinder errors, from Song Liu. 4) Allow cgroup local storage map to be shared between programs on the same cgroup. Also extend BPF selftests with coverage, from YiFei Zhu. 5) Add BPF exception tables to ARM64 JIT in order to be able to JIT BPF_PROBE_MEM load instructions, from Jean-Philippe Brucker. 6) Follow-up fixes on BPF socket lookup in combination with reuseport group handling. Also add related BPF selftests, from Jakub Sitnicki. 7) Allow to use socket storage in BPF_PROG_TYPE_CGROUP_SOCK-typed programs for socket create/release as well as bind functions, from Stanislav Fomichev. 8) Fix an info leak in xsk_getsockopt() when retrieving XDP stats via old struct xdp_statistics, from Peilin Ye. 9) Fix PT_REGS_RC{,_CORE}() macros in libbpf for MIPS arch, from Jerry Crunchtime. 10) Extend BPF kernel test infra with skb->family and skb->{local,remote}_ip{4,6} fields and allow user space to specify skb->dev via ifindex, from Dmitry Yakunin. 11) Fix a bpftool segfault due to missing program type name and make it more robust to prevent them in future gaps, from Quentin Monnet. 12) Consolidate cgroup helper functions across selftests and fix a v6 localhost resolver issue, from John Fastabend. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-03Merge tag 'sched-core-2020-08-03' of ↵Linus Torvalds21-277/+789
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: - Improve uclamp performance by using a static key for the fast path - Add the "sched_util_clamp_min_rt_default" sysctl, to optimize for better power efficiency of RT tasks on battery powered devices. (The default is to maximize performance & reduce RT latencies.) - Improve utime and stime tracking accuracy, which had a fixed boundary of error, which created larger and larger relative errors as the values become larger. This is now replaced with more precise arithmetics, using the new mul_u64_u64_div_u64() helper in math64.h. - Improve the deadline scheduler, such as making it capacity aware - Improve frequency-invariant scheduling - Misc cleanups in energy/power aware scheduling - Add sched_update_nr_running tracepoint to track changes to nr_running - Documentation additions and updates - Misc cleanups and smaller fixes * tag 'sched-core-2020-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits) sched/doc: Factorize bits between sched-energy.rst & sched-capacity.rst sched/doc: Document capacity aware scheduling sched: Document arch_scale_*_capacity() arm, arm64: Fix selection of CONFIG_SCHED_THERMAL_PRESSURE Documentation/sysctl: Document uclamp sysctl knobs sched/uclamp: Add a new sysctl to control RT default boost value sched/uclamp: Fix a deadlock when enabling uclamp static key sched: Remove duplicated tick_nohz_full_enabled() check sched: Fix a typo in a comment sched/uclamp: Remove unnecessary mutex_init() arm, arm64: Select CONFIG_SCHED_THERMAL_PRESSURE sched: Cleanup SCHED_THERMAL_PRESSURE kconfig entry arch_topology, sched/core: Cleanup thermal pressure definition trace/events/sched.h: fix duplicated word linux/sched/mm.h: drop duplicated words in comments smp: Fix a potential usage of stale nr_cpus sched/fair: update_pick_idlest() Select group with lowest group_util when idle_cpus are equal sched: nohz: stop passing around unused "ticks" parameter. sched: Better document ttwu() sched: Add a tracepoint to track rq->nr_running ...
2020-08-03Merge tag 'perf-core-2020-08-03' of ↵Linus Torvalds4-16/+302
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf event updates from Ingo Molnar: "HW support updates: - Add uncore support for Intel Comet Lake - Add RAPL support for Hygon Fam18h - Add Intel "IIO stack to PMON mapping" support on Skylake-SP CPUs, which enumerates per device performance counters via sysfs and enables the perf stat --iiostat functionality - Add support for Intel "Architectural LBRs", which generalized the model specific LBR hardware tracing feature into a model-independent, architected performance monitoring feature. Usage is mostly seamless to tooling, as the pre-existing LBR features are kept, but there's a couple of advantages under the hood, such as faster context-switching, faster LBR reads, cleaner exposure of LBR features to guest kernels, etc. ( Since architectural LBRs are supported via XSAVE, there's related changes to the x86 FPU code as well. ) ftrace/perf updates: - Add support to add a text poke event to record changes to kernel text (i.e. self-modifying code) in order to support tracers like Intel PT decoding through jump labels, kprobes and ftrace trampolines. Misc cleanups, smaller fixes..." * tag 'perf-core-2020-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (47 commits) perf/x86/rapl: Add Hygon Fam18h RAPL support kprobes: Remove unnecessary module_mutex locking from kprobe_optimizer() x86/perf: Fix a typo perf: <linux/perf_event.h>: drop a duplicated word perf/x86/intel/lbr: Support XSAVES for arch LBR read perf/x86/intel/lbr: Support XSAVES/XRSTORS for LBR context switch x86/fpu/xstate: Add helpers for LBR dynamic supervisor feature x86/fpu/xstate: Support dynamic supervisor feature for LBR x86/fpu: Use proper mask to replace full instruction mask perf/x86: Remove task_ctx_size perf/x86/intel/lbr: Create kmem_cache for the LBR context data perf/core: Use kmem_cache to allocate the PMU specific data perf/core: Factor out functions to allocate/free the task_ctx_data perf/x86/intel/lbr: Support Architectural LBR perf/x86/intel/lbr: Factor out intel_pmu_store_lbr perf/x86/intel/lbr: Factor out rdlbr_all() and wrlbr_all() perf/x86/intel/lbr: Mark the {rd,wr}lbr_{to,from} wrappers __always_inline perf/x86/intel/lbr: Unify the stored format of LBR information perf/x86/intel/lbr: Support LBR_CTL perf/x86: Expose CPUID enumeration bits for arch LBR ...
2020-08-03Merge tag 'locking-core-2020-08-03' of ↵Linus Torvalds12-167/+1309
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking updates from Ingo Molnar: - LKMM updates: mostly documentation changes, but also some new litmus tests for atomic ops. - KCSAN updates: the most important change is that GCC 11 now has all fixes in place to support KCSAN, so GCC support can be enabled again. Also more annotations. - futex updates: minor cleanups and simplifications - seqlock updates: merge preparatory changes/cleanups for the 'associated locks' facilities. - lockdep updates: - simplify IRQ trace event handling - add various new debug checks - simplify header dependencies, split out <linux/lockdep_types.h>, decouple lockdep from other low level headers some more - fix NMI handling - misc cleanups and smaller fixes * tag 'locking-core-2020-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (60 commits) kcsan: Improve IRQ state trace reporting lockdep: Refactor IRQ trace events fields into struct seqlock: lockdep assert non-preemptibility on seqcount_t write lockdep: Add preemption enabled/disabled assertion APIs seqlock: Implement raw_seqcount_begin() in terms of raw_read_seqcount() seqlock: Add kernel-doc for seqcount_t and seqlock_t APIs seqlock: Reorder seqcount_t and seqlock_t API definitions seqlock: seqcount_t latch: End read sections with read_seqcount_retry() seqlock: Properly format kernel-doc code samples Documentation: locking: Describe seqlock design and usage locking/qspinlock: Do not include atomic.h from qspinlock_types.h locking/atomic: Move ATOMIC_INIT into linux/types.h lockdep: Move list.h inclusion into lockdep.h locking/lockdep: Fix TRACE_IRQFLAGS vs. NMIs futex: Remove unused or redundant includes futex: Consistently use fshared as boolean futex: Remove needless goto's futex: Remove put_futex_key() rwsem: fix commas in initialisation docs: locking: Replace HTTP links with HTTPS ones ...
2020-08-03Merge tag 'core-rcu-2020-08-03' of ↵Linus Torvalds18-206/+1234
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RCU updates from Ingo Molnar: - kfree_rcu updates - RCU tasks updates - Read-side scalability tests - SRCU updates - Torture-test updates - Documentation updates - Miscellaneous fixes * tag 'core-rcu-2020-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (109 commits) torture: Remove obsolete "cd $KVM" torture: Avoid duplicate specification of qemu command torture: Dump ftrace at shutdown only if requested torture: Add kvm-tranform.sh script for qemu-cmd files torture: Add more tracing crib notes to kvm.sh torture: Improve diagnostic for KCSAN-incapable compilers torture: Correctly summarize build-only runs torture: Pass --kmake-arg to all make invocations rcutorture: Check for unwatched readers torture: Abstract out console-log error detection torture: Add a stop-run capability torture: Create qemu-cmd in --buildonly runs rcu/rcutorture: Replace 0 with false torture: Add --allcpus argument to the kvm.sh script torture: Remove whitespace from identify_qemu_vcpus output rcutorture: NULL rcu_torture_current earlier in cleanup code rcutorture: Handle non-statistic bang-string error messages torture: Set configfile variable to current scenario rcutorture: Add races with task-exit processing locktorture: Use true and false to assign to bool variables ...
2020-08-03Merge tag 'irq-urgent-2020-08-02' of ↵Linus Torvalds2-1/+10
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fixes from Ingo Molnar: "Fix a recent IRQ affinities regression, add in a missing debugfs printout that helps the debugging of IRQ affinity logic bugs, and fix a memory leak" * tag 'irq-urgent-2020-08-02' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq/debugfs: Add missing irqchip flags genirq/affinity: Make affinity setting if activated opt-in irqdomain/treewide: Free firmware node after domain removal
2020-08-03Merge tag 'arm64-upstream' of ↵Linus Torvalds2-28/+14
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 and cross-arch updates from Catalin Marinas: "Here's a slightly wider-spread set of updates for 5.9. Going outside the usual arch/arm64/ area is the removal of read_barrier_depends() series from Will and the MSI/IOMMU ID translation series from Lorenzo. The notable arm64 updates include ARMv8.4 TLBI range operations and translation level hint, time namespace support, and perf. Summary: - Removal of the tremendously unpopular read_barrier_depends() barrier, which is a NOP on all architectures apart from Alpha, in favour of allowing architectures to override READ_ONCE() and do whatever dance they need to do to ensure address dependencies provide LOAD -> LOAD/STORE ordering. This work also offers a potential solution if compilers are shown to convert LOAD -> LOAD address dependencies into control dependencies (e.g. under LTO), as weakly ordered architectures will effectively be able to upgrade READ_ONCE() to smp_load_acquire(). The latter case is not used yet, but will be discussed further at LPC. - Make the MSI/IOMMU input/output ID translation PCI agnostic, augment the MSI/IOMMU ACPI/OF ID mapping APIs to accept an input ID bus-specific parameter and apply the resulting changes to the device ID space provided by the Freescale FSL bus. - arm64 support for TLBI range operations and translation table level hints (part of the ARMv8.4 architecture version). - Time namespace support for arm64. - Export the virtual and physical address sizes in vmcoreinfo for makedumpfile and crash utilities. - CPU feature handling cleanups and checks for programmer errors (overlapping bit-fields). - ACPI updates for arm64: disallow AML accesses to EFI code regions and kernel memory. - perf updates for arm64. - Miscellaneous fixes and cleanups, most notably PLT counting optimisation for module loading, recordmcount fix to ignore relocations other than R_AARCH64_CALL26, CMA areas reserved for gigantic pages on 16K and 64K configurations. - Trivial typos, duplicate words" Link: http://lkml.kernel.org/r/20200710165203.31284-1-will@kernel.org Link: http://lkml.kernel.org/r/20200619082013.13661-1-lorenzo.pieralisi@arm.com * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (82 commits) arm64: use IRQ_STACK_SIZE instead of THREAD_SIZE for irq stack arm64/mm: save memory access in check_and_switch_context() fast switch path arm64: sigcontext.h: delete duplicated word arm64: ptrace.h: delete duplicated word arm64: pgtable-hwdef.h: delete duplicated words bus: fsl-mc: Add ACPI support for fsl-mc bus/fsl-mc: Refactor the MSI domain creation in the DPRC driver of/irq: Make of_msi_map_rid() PCI bus agnostic of/irq: make of_msi_map_get_device_domain() bus agnostic dt-bindings: arm: fsl: Add msi-map device-tree binding for fsl-mc bus of/device: Add input id to of_dma_configure() of/iommu: Make of_map_rid() PCI agnostic ACPI/IORT: Add an input ID to acpi_dma_configure() ACPI/IORT: Remove useless PCI bus walk ACPI/IORT: Make iort_msi_map_rid() PCI agnostic ACPI/IORT: Make iort_get_device_domain IRQ domain agnostic ACPI/IORT: Make iort_match_node_callback walk the ACPI namespace for NC arm64: enable time namespace support arm64/vdso: Restrict splitting VVAR VMA arm64/vdso: Handle faults on timens page ...
2020-08-03Merge tag 'rm-unicore32' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux Pull unicore32 removal from Mike Rapoport: "Remove unicore32 support. The unicore32 port do not seem maintained for a long time now, there is no upstream toolchain that can create unicore32 binaries and all the links to prebuilt toolchains for unicore32 are dead. Even compilers that were available are not supported by the kernel anymore. Guenter Roeck says: "I have stopped building unicore32 images since v4.19 since there is no available compiler that is still supported by the kernel. I am surprised that support for it has not been removed from the kernel" However, it's worth pointing out two things: - Guan Xuetao is still listed as maintainer and asked for the port to be kept around the last time Arnd suggested removing it two years ago. He promised that there would be compiler sources (presumably llvm), but has not made those available since. - https://github.com/gxt has patches to linux-4.9 and qemu-2.7, both released in 2016, with patches dated early 2019. These patches mainly restore a syscall ABI that was never part of mainline Linux but apparently used in production. qemu-2.8 removed support for that ABI and newer kernels (4.19+) can no longer be built with the old toolchain, so apparently there will not be any future updates to that git tree" * tag 'rm-unicore32' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux: MAINTAINERS: remove "PKUNITY SOC DRIVERS" entry rtc: remove fb-puv3 driver video: fbdev: remove fb-puv3 driver pwm: remove pwm-puv3 driver input: i8042: remove support for 8042-unicore32io i2c/buses: remove i2c-puv3 driver cpufreq: remove unicore32 driver arch: remove unicore32 port
2020-08-03Merge tag 'for-5.9/block-20200802' of git://git.kernel.dk/linux-blockLinus Torvalds2-32/+55
Pull core block updates from Jens Axboe: "Good amount of cleanups and tech debt removals in here, and as a result, the diffstat shows a nice net reduction in code. - Softirq completion cleanups (Christoph) - Stop using ->queuedata (Christoph) - Cleanup bd claiming (Christoph) - Use check_events, moving away from the legacy media change (Christoph) - Use inode i_blkbits consistently (Christoph) - Remove old unused writeback congestion bits (Christoph) - Cleanup/unify submission path (Christoph) - Use bio_uninit consistently, instead of bio_disassociate_blkg (Christoph) - sbitmap cleared bits handling (John) - Request merging blktrace event addition (Jan) - sysfs add/remove race fixes (Luis) - blk-mq tag fixes/optimizations (Ming) - Duplicate words in comments (Randy) - Flush deferral cleanup (Yufen) - IO context locking/retry fixes (John) - struct_size() usage (Gustavo) - blk-iocost fixes (Chengming) - blk-cgroup IO stats fixes (Boris) - Various little fixes" * tag 'for-5.9/block-20200802' of git://git.kernel.dk/linux-block: (135 commits) block: blk-timeout: delete duplicated word block: blk-mq-sched: delete duplicated word block: blk-mq: delete duplicated word block: genhd: delete duplicated words block: elevator: delete duplicated word and fix typos block: bio: delete duplicated words block: bfq-iosched: fix duplicated word iocost_monitor: start from the oldest usage index iocost: Fix check condition of iocg abs_vdebt block: Remove callback typedefs for blk_mq_ops block: Use non _rcu version of list functions for tag_set_list blk-cgroup: show global disk stats in root cgroup io.stat blk-cgroup: make iostat functions visible to stat printing block: improve discard bio alignment in __blkdev_issue_discard() block: change REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL to be odd numbers block: defer flush request no matter whether we have elevator block: make blk_timeout_init() static block: remove retry loop in ioc_release_fn() block: remove unnecessary ioc nested locking block: integrate bd_start_claiming into __blkdev_get ...
2020-08-03Merge branch 'linus' of ↵Linus Torvalds1-141/+36
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto updates from Herbert Xu: "API: - Add support for allocating transforms on a specific NUMA Node - Introduce the flag CRYPTO_ALG_ALLOCATES_MEMORY for storage users Algorithms: - Drop PMULL based ghash on arm64 - Fixes for building with clang on x86 - Add sha256 helper that does the digest in one go - Add SP800-56A rev 3 validation checks to dh Drivers: - Permit users to specify NUMA node in hisilicon/zip - Add support for i.MX6 in imx-rngc - Add sa2ul crypto driver - Add BA431 hwrng driver - Add Ingenic JZ4780 and X1000 hwrng driver - Spread IRQ affinity in inside-secure and marvell/cesa" * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (157 commits) crypto: sa2ul - Fix inconsistent IS_ERR and PTR_ERR hwrng: core - remove redundant initialization of variable ret crypto: x86/curve25519 - Remove unused carry variables crypto: ingenic - Add hardware RNG for Ingenic JZ4780 and X1000 dt-bindings: RNG: Add Ingenic RNG bindings. crypto: caam/qi2 - add module alias crypto: caam - add more RNG hw error codes crypto: caam/jr - remove incorrect reference to caam_jr_register() crypto: caam - silence .setkey in case of bad key length crypto: caam/qi2 - create ahash shared descriptors only once crypto: caam/qi2 - fix error reporting for caam_hash_alloc crypto: caam - remove deadcode on 32-bit platforms crypto: ccp - use generic power management crypto: xts - Replace memcpy() invocation with simple assignment crypto: marvell/cesa - irq balance crypto: inside-secure - irq balance crypto: ecc - SP800-56A rev 3 local public key validation crypto: dh - SP800-56A rev 3 local public key validation crypto: dh - check validity of Z before export lib/mpi: Add mpi_sub_ui() ...
2020-08-03Merge tag 'irqchip-5.9' of ↵Thomas Gleixner2-2/+4
git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into irq/core Pull irqchip updates from Marc Zyngier: - Add infrastructure to allow DT irqchip platform drivers to be built as modules - Allow qcom-pdc, mtk-cirq and mtk-sysirq to be built as module - Fix ACPI probing to avoid abusing function pointer casting - Allow bcm7120-l2 and brcmstb-l2 to be used as wake-up sources - Teach NXP's IMX INTMUX some power management - Allow stm32-exti to be used as a hierarchical irqchip - Let stm32-exti use the hw spinlock API in its full glory - A couple of GICv4.1 fixes - Tons of cleanups (mtk-sysirq, aic5, bcm7038-l1, imx-intmux, brcmstb-l2, ativic32, ti-sci-inta, lonsoon, MIPS GIC, GICv3)
2020-08-03Merge branches 'pm-sleep', 'pm-domains', 'powercap' and 'pm-tools'Rafael J. Wysocki3-6/+6
* pm-sleep: PM: sleep: spread "const char *" correctness PM: hibernate: fix white space in a few places freezer: Add unsafe version of freezable_schedule_timeout_interruptible() for NFS PM: sleep: core: Emit changed uevent on wakeup_sysfs_add/remove * pm-domains: PM: domains: Restore comment indentation for generic_pm_domain.child_links PM: domains: Fix up terminology with parent/child * powercap: powercap: Add Power Limit4 support powercap: idle_inject: Replace play_idle() with play_idle_precise() in comments powercap: intel_rapl: add support for Sapphire Rapids * pm-tools: pm-graph v5.7 - important s2idle fixes cpupower: Replace HTTP links with HTTPS ones cpupower: Fix NULL but dereferenced coccicheck errors cpupower: Fix comparing pointer to 0 coccicheck warns
2020-08-03Merge branch 'pm-cpufreq'Rafael J. Wysocki1-5/+1
* pm-cpufreq: (24 commits) cpufreq: intel_pstate: Fix EPP setting via sysfs in active mode cpufreq: intel_pstate: Rearrange the storing of new EPP values cpufreq: intel_pstate: Avoid enabling HWP if EPP is not supported cpufreq: intel_pstate: Clean up aperf_mperf_shift description cpufreq: powernv: Make some symbols static cpufreq: amd_freq_sensitivity: Mark sometimes used ID structs as __maybe_unused cpufreq: intel_pstate: Supply struct attribute description for get_aperf_mperf_shift() cpufreq: pcc-cpufreq: Mark sometimes used ID structs as __maybe_unused cpufreq: powernow-k8: Mark 'hi' and 'lo' dummy variables as __always_unused cpufreq: acpi-cpufreq: Mark sometimes used ID structs as __maybe_unused cpufreq: acpi-cpufreq: Mark 'dummy' variable as __always_unused cpufreq: powernv-cpufreq: Fix a bunch of kerneldoc related issues cpufreq: pasemi: Include header file for {check,restore}_astate prototypes cpufreq: cpufreq_governor: Demote store_sampling_rate() header to standard comment block cpufreq: cpufreq: Demote lots of function headers unworthy of kerneldoc status cpufreq: freq_table: Demote obvious misuse of kerneldoc to standard comment blocks cpufreq: Replace HTTP links with HTTPS ones cpufreq: intel_pstate: Fix static checker warning for epp variable cpufreq: Remove the weakly defined cpufreq_default_governor() cpufreq: Specify default governor on command line ...
2020-08-03Merge branches 'pm-em' and 'pm-core'Rafael J. Wysocki3-109/+203
* pm-em: OPP: refactor dev_pm_opp_of_register_em() and update related drivers Documentation: power: update Energy Model description PM / EM: change name of em_pd_energy to em_cpu_energy PM / EM: remove em_register_perf_domain PM / EM: add support for other devices than CPUs in Energy Model PM / EM: update callback structure and add device pointer PM / EM: introduce em_dev_register_perf_domain function PM / EM: change naming convention from 'capacity' to 'performance' * pm-core: mmc: jz4740: Use pm_ptr() macro PM: Make *_DEV_PM_OPS macros use __maybe_unused PM: core: introduce pm_ptr() macro
2020-08-03Merge branch 'locking/nmi' into locking/core, to pick up completed topic branchIngo Molnar1-1/+7
Signed-off-by: Ingo Molnar <mingo@kernel.org>