summaryrefslogtreecommitdiffstats
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2022-12-01exit: Put an upper limit on how often we can oopsJann Horn1-0/+42
Many Linux systems are configured to not panic on oops; but allowing an attacker to oops the system **really** often can make even bugs that look completely unexploitable exploitable (like NULL dereferences and such) if each crash elevates a refcount by one or a lock is taken in read mode, and this causes a counter to eventually overflow. The most interesting counters for this are 32 bits wide (like open-coded refcounts that don't use refcount_t). (The ldsem reader count on 32-bit platforms is just 16 bits, but probably nobody cares about 32-bit platforms that much nowadays.) So let's panic the system if the kernel is constantly oopsing. The speed of oopsing 2^32 times probably depends on several factors, like how long the stack trace is and which unwinder you're using; an empirically important one is whether your console is showing a graphical environment or a text console that oopses will be printed to. In a quick single-threaded benchmark, it looks like oopsing in a vfork() child with a very short stack trace only takes ~510 microseconds per run when a graphical console is active; but switching to a text console that oopses are printed to slows it down around 87x, to ~45 milliseconds per run. (Adding more threads makes this faster, but the actual oops printing happens under &die_lock on x86, so you can maybe speed this up by a factor of around 2 and then any further improvement gets eaten up by lock contention.) It looks like it would take around 8-12 days to overflow a 32-bit counter with repeated oopsing on a multi-core X86 system running a graphical environment; both me (in an X86 VM) and Seth (with a distro kernel on normal hardware in a standard configuration) got numbers in that ballpark. 12 days aren't *that* short on a desktop system, and you'd likely need much longer on a typical server system (assuming that people don't run graphical desktop environments on their servers), and this is a *very* noisy and violent approach to exploiting the kernel; and it also seems to take orders of magnitude longer on some machines, probably because stuff like EFI pstore will slow it down a ton if that's active. Signed-off-by: Jann Horn <jannh@google.com> Link: https://lore.kernel.org/r/20221107201317.324457-1-jannh@google.com Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20221117234328.594699-2-keescook@chromium.org
2022-12-01panic: Separate sysctl logic from CONFIG_SMPKees Cook1-1/+3
In preparation for adding more sysctls directly in kernel/panic.c, split CONFIG_SMP from the logic that adds sysctls. Cc: Petr Mladek <pmladek@suse.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: tangmeng <tangmeng@uniontech.com> Cc: "Guilherme G. Piccoli" <gpiccoli@igalia.com> Cc: Tiezhu Yang <yangtiezhu@loongson.cn> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20221117234328.594699-1-keescook@chromium.org
2022-12-01block: bdev & blktrace: use consistent function doc. notationRandy Dunlap1-2/+2
Use only one hyphen in kernel-doc notation between the function name and its short description. The is the documented kerenl-doc format. It also fixes the HTML presentation to be consistent with other functions. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Link: https://lore.kernel.org/r/20221201070331.25685-1-rdunlap@infradead.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-01clockevents: Repair kernel-doc for clockevent_delta2ns()Lukas Bulwahn1-1/+1
Since the introduction of clockevents, i.e., commit d316c57ff6bf ("clockevents: add core functionality"), there has been a mismatch between the function and the kernel-doc comment for clockevent_delta2ns(). Hence, ./scripts/kernel-doc -none kernel/time/clockevents.c warns about it. Adjust the kernel-doc comment for clockevent_delta2ns() for make W=1 happiness. Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20221102091048.15068-1-lukas.bulwahn@gmail.com
2022-12-01printk: use strscpy() to instead of strlcpy()Xu Panda1-1/+1
The implementation of strscpy() is more robust and safer. That's now the recommended way to copy NUL terminated strings. Signed-off-by: Xu Panda <xu.panda@zte.com.cn> Signed-off-by: Yang Yang <yang.yang29@zte.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/202211301601416229001@zte.com.cn
2022-12-01vdso/timens: Refactor copy-pasted find_timens_vvar_page() helper into one copyJann Horn1-0/+18
find_timens_vvar_page() is not architecture-specific, as can be seen from how all five per-architecture versions of it are the same. (arm64, powerpc and riscv are exactly the same; x86 and s390 have two characters difference inside a comment, less blank lines, and mark the !CONFIG_TIME_NS version as inline.) Refactor the five copies into a central copy in kernel/time/namespace.c. Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20221130115320.2918447-1-jannh@google.com
2022-11-30bpf: Fix a compilation failure with clang lto buildYonghong Song3-7/+4
When building the kernel with clang lto (CONFIG_LTO_CLANG_FULL=y), the following compilation error will appear: $ make LLVM=1 LLVM_IAS=1 -j ... ld.lld: error: ld-temp.o <inline asm>:26889:1: symbol 'cgroup_storage_map_btf_ids' is already defined cgroup_storage_map_btf_ids:; ^ make[1]: *** [/.../bpf-next/scripts/Makefile.vmlinux_o:61: vmlinux.o] Error 1 In local_storage.c, we have BTF_ID_LIST_SINGLE(cgroup_storage_map_btf_ids, struct, bpf_local_storage_map) Commit c4bcfb38a95e ("bpf: Implement cgroup storage available to non-cgroup-attached bpf progs") added the above identical BTF_ID_LIST_SINGLE definition in bpf_cgrp_storage.c. With duplicated definitions, llvm linker complains with lto build. Also, extracting btf_id of 'struct bpf_local_storage_map' is defined four times for sk, inode, task and cgrp local storages. Let us define a single global one with a different name than cgroup_storage_map_btf_ids, which also fixed the lto compilation error. Fixes: c4bcfb38a95e ("bpf: Implement cgroup storage available to non-cgroup-attached bpf progs") Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20221130052147.1591625-1-yhs@fb.com
2022-11-30kernel/user: Allow user_struct::locked_vm to be usable for iommufdJason Gunthorpe1-0/+1
Following the pattern of io_uring, perf, skb, and bpf, iommfd will use user->locked_vm for accounting pinned pages. Ensure the value is included in the struct and export free_uid() as iommufd is modular. user->locked_vm is the good accounting to use for ulimit because it is per-user, and the security sandboxing of locked pages is not supposed to be per-process. Other places (vfio, vdpa and infiniband) have used mm->pinned_vm and/or mm->locked_vm for accounting pinned pages, but this is only per-process and inconsistent with the new FOLL_LONGTERM users in the kernel. Concurrent work is underway to try to put this in a cgroup, so everything can be consistent and the kernel can provide a FOLL_LONGTERM limit that actually provides security. Link: https://lore.kernel.org/r/7-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Lixiao Yang <lixiao.yang@intel.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-11-30acct: fix potential integer overflow in encode_comp_t()Zheng Yejian1-0/+2
The integer overflow is descripted with following codes: > 317 static comp_t encode_comp_t(u64 value) > 318 { > 319 int exp, rnd; ...... > 341 exp <<= MANTSIZE; > 342 exp += value; > 343 return exp; > 344 } Currently comp_t is defined as type of '__u16', but the variable 'exp' is type of 'int', so overflow would happen when variable 'exp' in line 343 is greater than 65535. Link: https://lkml.kernel.org/r/20210515140631.369106-3-zhengyejian1@huawei.com Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com> Cc: Hanjun Guo <guohanjun@huawei.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zhang Jinhao <zhangjinhao2@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30acct: fix accuracy loss for input value of encode_comp_t()Zheng Yejian1-2/+2
Patch series "Fix encode_comp_t()". Type conversion in encode_comp_t() may look a bit problematic. This patch (of 2): See calculation of ac_{u,s}time in fill_ac(): > ac->ac_utime = encode_comp_t(nsec_to_AHZ(pacct->ac_utime)); > ac->ac_stime = encode_comp_t(nsec_to_AHZ(pacct->ac_stime)); Return value of nsec_to_AHZ() is always type of 'u64', but it is handled as type of 'unsigned long' in encode_comp_t, and accuracy loss would happen on 32-bit platform when 'unsigned long' value is 32-bit-width. So 'u64' value of encode_comp_t() may look better. Link: https://lkml.kernel.org/r/20210515140631.369106-1-zhengyejian1@huawei.com Link: https://lkml.kernel.org/r/20210515140631.369106-2-zhengyejian1@huawei.com Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com> Cc: Hanjun Guo <guohanjun@huawei.com> Cc: Randy Dunlap <rdunlap@infradead.org> # build-tested Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zhang Jinhao <zhangjinhao2@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30vmcoreinfo: warn if we exceed vmcoreinfo data sizeStephen Brennan1-0/+3
Though vmcoreinfo is intended to be small, at just one page, useful information is still added to it, so we risk running out of space. Currently there is no runtime check to see whether the vmcoreinfo buffer has been exhausted. Add a warning for this case. Currently, my static checking tool[1] indicates that a good upper bound for vmcoreinfo size is currently 3415 bytes, but the best time to add warnings is before the risk becomes too high. [1] https://github.com/brenns10/kernel_stuff/blob/master/vmcoreinfosize/vmcoreinfosize.py Link: https://lkml.kernel.org/r/20221027205008.312534-1-stephen.s.brennan@oracle.com Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30percpu_counter: add percpu_counter_sum_all interfaceShakeel Butt1-0/+5
The percpu_counter is used for scenarios where performance is more important than the accuracy. For percpu_counter users, who want more accurate information in their slowpath, percpu_counter_sum is provided which traverses all the online CPUs to accumulate the data. The reason it only needs to traverse online CPUs is because percpu_counter does implement CPU offline callback which syncs the local data of the offlined CPU. However there is a small race window between the online CPUs traversal of percpu_counter_sum and the CPU offline callback. The offline callback has to traverse all the percpu_counters on the system to flush the CPU local data which can be a lot. During that time, the CPU which is going offline has already been published as offline to all the readers. So, as the offline callback is running, percpu_counter_sum can be called for one counter which has some state on the CPU going offline. Since percpu_counter_sum only traverses online CPUs, it will skip that specific CPU and the offline callback might not have flushed the state for that specific percpu_counter on that offlined CPU. Normally this is not an issue because percpu_counter users can deal with some inaccuracy for small time window. However a new user i.e. mm_struct on the cleanup path wants to check the exact state of the percpu_counter through check_mm(). For such users, this patch introduces percpu_counter_sum_all() which traverses all possible CPUs and it is used in fork.c:check_mm() to avoid the potential race. This issue is exposed by the later patch "mm: convert mm's rss stats into percpu_counter". Link: https://lkml.kernel.org/r/20221109012011.881058-1-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: Marek Szyprowski <m.szyprowski@samsung.com> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm: convert mm's rss stats into percpu_counterShakeel Butt1-1/+15
Currently mm_struct maintains rss_stats which are updated on page fault and the unmapping codepaths. For page fault codepath the updates are cached per thread with the batch of TASK_RSS_EVENTS_THRESH which is 64. The reason for caching is performance for multithreaded applications otherwise the rss_stats updates may become hotspot for such applications. However this optimization comes with the cost of error margin in the rss stats. The rss_stats for applications with large number of threads can be very skewed. At worst the error margin is (nr_threads * 64) and we have a lot of applications with 100s of threads, so the error margin can be very high. Internally we had to reduce TASK_RSS_EVENTS_THRESH to 32. Recently we started seeing the unbounded errors for rss_stats for specific applications which use TCP rx0cp. It seems like vm_insert_pages() codepath does not sync rss_stats at all. This patch converts the rss_stats into percpu_counter to convert the error margin from (nr_threads * 64) to approximately (nr_cpus ^ 2). However this conversion enable us to get the accurate stats for situations where accuracy is more important than the cpu cost. This patch does not make such tradeoffs - we can just use percpu_counter_add_local() for the updates and percpu_counter_sum() (or percpu_counter_sync() + percpu_counter_read) for the readers. At the moment the readers are either procfs interface, oom_killer and memory reclaim which I think are not performance critical and should be ok with slow read. However I think we can make that change in a separate patch. Link: https://lkml.kernel.org/r/20221024052841.3291983-1-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30bpf: Tighten ptr_to_btf_id checks.Alexei Starovoitov1-3/+14
The networking programs typically don't require CAP_PERFMON, but through kfuncs like bpf_cast_to_kern_ctx() they can access memory through PTR_TO_BTF_ID. In such case enforce CAP_PERFMON. Also make sure that only GPL programs can access kernel data structures. All kfuncs require GPL already. Also remove allow_ptr_to_map_access. It's the same as allow_ptr_leaks and different name for the same check only causes confusion. Fixes: fd264ca02094 ("bpf: Add a kfunc to type cast from bpf uapi ctx to kernel ctx") Fixes: 50c6b8a9aea2 ("selftests/bpf: Add a test for btf_type_tag "percpu"") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20221125220617.26846-1-alexei.starovoitov@gmail.com
2022-12-01mm, slob: rename CONFIG_SLOB to CONFIG_SLOB_DEPRECATEDVlastimil Babka1-2/+3
As explained in [1], we would like to remove SLOB if possible. - There are no known users that need its somewhat lower memory footprint so much that they cannot handle SLUB (after some modifications by the previous patches) instead. - It is an extra maintenance burden, and a number of features are incompatible with it. - It blocks the API improvement of allowing kfree() on objects allocated via kmem_cache_alloc(). As the first step, rename the CONFIG_SLOB option in the slab allocator configuration choice to CONFIG_SLOB_DEPRECATED. Add CONFIG_SLOB depending on CONFIG_SLOB_DEPRECATED as an internal option to avoid code churn. This will cause existing .config files and defconfigs with CONFIG_SLOB=y to silently switch to the default (and recommended replacement) SLUB, while still allowing SLOB to be configured by anyone that notices and needs it. But those should contact the slab maintainers and linux-mm@kvack.org as explained in the updated help. With no valid objections, the plan is to update the existing defconfigs to SLUB and remove SLOB in a few cycles. To make SLUB more suitable replacement for SLOB, a CONFIG_SLUB_TINY option was introduced to limit SLUB's memory overhead. There is a number of defconfigs specifying CONFIG_SLOB=y. As part of this patch, update them to select CONFIG_SLUB and CONFIG_SLUB_TINY. [1] https://lore.kernel.org/all/b35c3f82-f67b-2103-7d82-7a7ba7521439@suse.cz/ Cc: Russell King <linux@armlinux.org.uk> Cc: Aaro Koskinen <aaro.koskinen@iki.fi> Cc: Janusz Krzysztofik <jmkrzyszt@gmail.com> Cc: Tony Lindgren <tony@atomide.com> Cc: Jonas Bonn <jonas@southpole.se> Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi> Cc: Stafford Horne <shorne@gmail.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Rich Felker <dalias@libc.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Conor Dooley <conor@kernel.org> Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Aaro Koskinen <aaro.koskinen@iki.fi> # OMAP1 Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> # riscv k210 Acked-by: Arnd Bergmann <arnd@arndb.de> # arm Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Christoph Lameter <cl@linux.com>
2022-11-30Merge branch 'mm-hotfixes-stable' into mm-stableAndrew Morton1-0/+2
2022-11-30Merge branches 'doc.2022.10.20a', 'fixes.2022.10.21a', 'lazy.2022.11.30a', ↵Paul E. McKenney14-153/+545
'srcunmisafe.2022.11.09a', 'torture.2022.10.18c' and 'torturescript.2022.10.20a' into HEAD doc.2022.10.20a: Documentation updates. fixes.2022.10.21a: Miscellaneous fixes. lazy.2022.11.30a: Lazy call_rcu() and NOCB updates. srcunmisafe.2022.11.09a: NMI-safe SRCU readers. torture.2022.10.18c: Torture-test updates. torturescript.2022.10.20a: Torture-test scripting updates.
2022-11-30workqueue: Make queue_rcu_work() use call_rcu_hurry()Uladzislau Rezki1-1/+1
Earlier commits in this series allow battery-powered systems to build their kernels with the default-disabled CONFIG_RCU_LAZY=y Kconfig option. This Kconfig option causes call_rcu() to delay its callbacks in order to batch them. This means that a given RCU grace period covers more callbacks, thus reducing the number of grace periods, in turn reducing the amount of energy consumed, which increases battery lifetime which can be a very good thing. This is not a subtle effect: In some important use cases, the battery lifetime is increased by more than 10%. This CONFIG_RCU_LAZY=y option is available only for CPUs that offload callbacks, for example, CPUs mentioned in the rcu_nocbs kernel boot parameter passed to kernels built with CONFIG_RCU_NOCB_CPU=y. Delaying callbacks is normally not a problem because most callbacks do nothing but free memory. If the system is short on memory, a shrinker will kick all currently queued lazy callbacks out of their laziness, thus freeing their memory in short order. Similarly, the rcu_barrier() function, which blocks until all currently queued callbacks are invoked, will also kick lazy callbacks, thus enabling rcu_barrier() to complete in a timely manner. However, there are some cases where laziness is not a good option. For example, synchronize_rcu() invokes call_rcu(), and blocks until the newly queued callback is invoked. It would not be a good for synchronize_rcu() to block for ten seconds, even on an idle system. Therefore, synchronize_rcu() invokes call_rcu_hurry() instead of call_rcu(). The arrival of a non-lazy call_rcu_hurry() callback on a given CPU kicks any lazy callbacks that might be already queued on that CPU. After all, if there is going to be a grace period, all callbacks might as well get full benefit from it. Yes, this could be done the other way around by creating a call_rcu_lazy(), but earlier experience with this approach and feedback at the 2022 Linux Plumbers Conference shifted the approach to call_rcu() being lazy with call_rcu_hurry() for the few places where laziness is inappropriate. And another call_rcu() instance that cannot be lazy is the one in queue_rcu_work(), given that callers to queue_rcu_work() are not necessarily OK with long delays. Therefore, make queue_rcu_work() use call_rcu_hurry() in order to revert to the old behavior. [ paulmck: Apply s/call_rcu_flush/call_rcu_hurry/ feedback from Tejun Heo. ] Signed-off-by: Uladzislau Rezki <urezki@gmail.com> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Acked-by: Tejun Heo <tj@kernel.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-11-30notifier: repair slips in kernel-doc commentsLukas Bulwahn1-3/+3
Invoking ./scripts/kernel-doc -none kernel/notifier.c warns: kernel/notifier.c:71: warning: Excess function parameter 'returns' description in 'notifier_call_chain' kernel/notifier.c:119: warning: Function parameter or member 'v' not described in 'notifier_call_chain_robust' These two warning are easy to fix, as they are just due to some minor slips that makes the comment not follow kernel-doc's syntactic expectation. Fix those minor slips in kernel-doc comments for make W=1 happiness. Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2022-11-30Revert "blk-cgroup: Flush stats at blkgs destruction path"Jens Axboe1-20/+0
This reverts commit dae590a6c96c799434e0ff8156ef29b88c257e60. We've had a few reports on this causing a crash at boot time, because of a reference issue. While this problem seemginly did exist before the patch and needs solving separately, this patch makes it a lot easier to trigger. Link: https://lore.kernel.org/linux-block/CA+QYu4oxiRKC6hJ7F27whXy-PRBx=Tvb+-7TQTONN8qTtV3aDA@mail.gmail.com/ Link: https://lore.kernel.org/linux-block/69af7ccb-6901-c84c-0e95-5682ccfb750c@acm.org/ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-30genirq/irqdesc: Don't try to remove non-existing sysfs filesYang Yingliang2-6/+11
Fault injection tests trigger warnings like this: kernfs: can not remove 'chip_name', no directory WARNING: CPU: 0 PID: 253 at fs/kernfs/dir.c:1616 kernfs_remove_by_name_ns+0xce/0xe0 RIP: 0010:kernfs_remove_by_name_ns+0xce/0xe0 Call Trace: <TASK> remove_files.isra.1+0x3f/0xb0 sysfs_remove_group+0x68/0xe0 sysfs_remove_groups+0x41/0x70 __kobject_del+0x45/0xc0 kobject_del+0x29/0x40 free_desc+0x42/0x70 irq_free_descs+0x5e/0x90 The reason is that the interrupt descriptor sysfs handling does not roll back on a failing kobject_add() during allocation. If the descriptor is freed later on, kobject_del() is invoked with a not added kobject resulting in the above warnings. A proper rollback in case of a kobject_add() failure would be the straight forward solution. But this is not possible due to the way how interrupt descriptor sysfs handling works. Interrupt descriptors are allocated before sysfs becomes available. So the sysfs files for the early allocated descriptors are added later in the boot process. At this point there can be nothing useful done about a failing kobject_add(). For consistency the interrupt descriptor allocation always treats kobject_add() failures as non-critical and just emits a warning. To solve this problem, keep track in the interrupt descriptor whether kobject_add() was successful or not and make the invocation of kobject_del() conditional on that. [ tglx: Massage changelog, comments and use a state bit. ] Fixes: ecb3f394c5db ("genirq: Expose interrupt information through sysfs") Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://lore.kernel.org/r/20221128151612.1786122-1-yangyingliang@huawei.com
2022-11-29rcu: Make SRCU mandatoryPaul E. McKenney4-24/+16
Kernels configured with CONFIG_PRINTK=n and CONFIG_SRCU=n get build failures. This causes trouble for deep embedded systems. But given that there are more than 25 instances of "select SRCU" in the kernel, it is hard to believe that there are many kernels running in production without SRCU. This commit therefore makes SRCU mandatory. The SRCU Kconfig option remains for backwards compatibility, and will be removed when it is no longer used. [ paulmck: Update per kernel test robot feedback. ] Reported-by: John Ogness <john.ogness@linutronix.de> Reported-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: <linux-arch@vger.kernel.org> Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested Reviewed-by: John Ogness <john.ogness@linutronix.de>
2022-11-29rcu/rcutorture: Use call_rcu_hurry() where neededJoel Fernandes (Google)1-8/+8
call_rcu() changes to save power will change the behavior of rcutorture tests. Use the call_rcu_hurry() API instead which reverts to the old behavior. [ paulmck: Apply s/call_rcu_flush/call_rcu_hurry/ feedback from Tejun Heo. ] Reported-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-11-29rcu/rcuscale: Use call_rcu_hurry() for async reader testJoel Fernandes (Google)1-1/+1
rcuscale uses call_rcu() to queue async readers. With recent changes to save power, the test will have fewer async readers in flight. Use the call_rcu_hurry() API instead to revert to the old behavior. [ paulmck: Apply s/call_rcu_flush/call_rcu_hurry/ feedback from Tejun Heo. ] Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-11-29rcu/sync: Use call_rcu_hurry() instead of call_rcuJoel Fernandes (Google)1-1/+1
call_rcu() changes to save power will slow down rcu sync. Use the call_rcu_hurry() API instead which reverts to the old behavior. [ paulmck: Apply s/call_rcu_flush/call_rcu_hurry/ feedback from Tejun Heo. ] Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-11-29rcuscale: Add laziness and kfree testsJoel Fernandes (Google)1-2/+65
This commit adds 2 tests to rcuscale. The first one is a startup test to check whether we are not too lazy or too hard working. The second one causes kfree_rcu() itself to use call_rcu() and checks memory pressure. Testing indicates that the new call_rcu() keeps memory pressure under control roughly as well as does kfree_rcu(). [ paulmck: Apply checkpatch feedback. ] Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-11-29rcu: Shrinker for lazy rcuVineeth Pillai1-0/+52
The shrinker is used to speed up the free'ing of memory potentially held by RCU lazy callbacks. RCU kernel module test cases show this to be effective. Test is introduced in a later patch. Signed-off-by: Vineeth Pillai <vineeth@bitbyteword.org> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-11-29rcu: Refactor code a bit in rcu_nocb_do_flush_bypass()Joel Fernandes (Google)1-8/+9
This consolidates the code a bit and makes it cleaner. Functionally it is the same. Reported-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-11-29rcu: Make call_rcu() lazy to save powerJoel Fernandes (Google)7-82/+237
Implement timer-based RCU callback batching (also known as lazy callbacks). With this we save about 5-10% of power consumed due to RCU requests that happen when system is lightly loaded or idle. By default, all async callbacks (queued via call_rcu) are marked lazy. An alternate API call_rcu_hurry() is provided for the few users, for example synchronize_rcu(), that need the old behavior. The batch is flushed whenever a certain amount of time has passed, or the batch on a particular CPU grows too big. Also memory pressure will flush it in a future patch. To handle several corner cases automagically (such as rcu_barrier() and hotplug), we re-use bypass lists which were originally introduced to address lock contention, to handle lazy CBs as well. The bypass list length has the lazy CB length included in it. A separate lazy CB length counter is also introduced to keep track of the number of lazy CBs. [ paulmck: Fix formatting of inline call_rcu_lazy() definition. ] [ paulmck: Apply Zqiang feedback. ] [ paulmck: Apply s/call_rcu_flush/call_rcu_hurry/ feedback from Tejun Heo. ] Suggested-by: Paul McKenney <paulmck@kernel.org> Acked-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-11-29Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski22-115/+263
tools/lib/bpf/ringbuf.c 927cbb478adf ("libbpf: Handle size overflow for ringbuf mmap") b486d19a0ab0 ("libbpf: checkpatch: Fixed code alignments in ringbuf.c") https://lore.kernel.org/all/20221121122707.44d1446a@canb.auug.org.au/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-29Merge tag 'net-6.1-rc8-2' of ↵Linus Torvalds2-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from bpf, can and wifi. Current release - new code bugs: - eth: mlx5e: - use kvfree() in mlx5e_accel_fs_tcp_create() - MACsec, fix RX data path 16 RX security channel limit - MACsec, fix memory leak when MACsec device is deleted - MACsec, fix update Rx secure channel active field - MACsec, fix add Rx security association (SA) rule memory leak Previous releases - regressions: - wifi: cfg80211: don't allow multi-BSSID in S1G - stmmac: set MAC's flow control register to reflect current settings - eth: mlx5: - E-switch, fix duplicate lag creation - fix use-after-free when reverting termination table Previous releases - always broken: - ipv4: fix route deletion when nexthop info is not specified - bpf: fix a local storage BPF map bug where the value's spin lock field can get initialized incorrectly - tipc: re-fetch skb cb after tipc_msg_validate - wifi: wilc1000: fix Information Element parsing - packet: do not set TP_STATUS_CSUM_VALID on CHECKSUM_COMPLETE - sctp: fix memory leak in sctp_stream_outq_migrate() - can: can327: fix potential skb leak when netdev is down - can: add number of missing netdev freeing on error paths - aquantia: do not purge addresses when setting the number of rings - wwan: iosm: - fix incorrect skb length leading to truncated packet - fix crash in peek throughput test due to skb UAF" * tag 'net-6.1-rc8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (79 commits) net: ethernet: renesas: ravb: Fix promiscuous mode after system resumed MAINTAINERS: Update maintainer list for chelsio drivers ionic: update MAINTAINERS entry sctp: fix memory leak in sctp_stream_outq_migrate() packet: do not set TP_STATUS_CSUM_VALID on CHECKSUM_COMPLETE net/mlx5: Lag, Fix for loop when checking lag Revert "net/mlx5e: MACsec, remove replay window size limitation in offload path" net: marvell: prestera: Fix a NULL vs IS_ERR() check in some functions net: tun: Fix use-after-free in tun_detach() net: mdiobus: fix unbalanced node reference count net: hsr: Fix potential use-after-free tipc: re-fetch skb cb after tipc_msg_validate mptcp: fix sleep in atomic at close time mptcp: don't orphan ssk in mptcp_close() dsa: lan9303: Correct stat name ipv4: Fix route deletion when nexthop info is not specified net: wwan: iosm: fix incorrect skb length net: wwan: iosm: fix crash in peek throughput test net: wwan: iosm: fix dma_alloc_coherent incompatible pointer type net: wwan: iosm: fix kernel test robot reported error ...
2022-11-29perf: Fix perf_pending_task() UaFPeter Zijlstra1-4/+13
Per syzbot it is possible for perf_pending_task() to run after the event is free()'d. There are two related but distinct cases: - the task_work was already queued before destroying the event; - destroying the event itself queues the task_work. The first cannot be solved using task_work_cancel() since perf_release() itself might be called from a task_work (____fput), which means the current->task_works list is already empty and task_work_cancel() won't be able to find the perf_pending_task() entry. The simplest alternative is extending the perf_event lifetime to cover the task_work. The second is just silly, queueing a task_work while you know the event is going away makes no sense and is easily avoided by re-arranging how the event is marked STATE_DEAD and ensuring it goes through STATE_OFF on the way down. Reported-by: syzbot+9228d6098455bb209ec8@syzkaller.appspotmail.com Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Marco Elver <elver@google.com>
2022-11-28Daniel Borkmann says:Jakub Kicinski14-577/+2401
==================== bpf-next 2022-11-25 We've added 101 non-merge commits during the last 11 day(s) which contain a total of 109 files changed, 8827 insertions(+), 1129 deletions(-). The main changes are: 1) Support for user defined BPF objects: the use case is to allocate own objects, build own object hierarchies and use the building blocks to build own data structures flexibly, for example, linked lists in BPF, from Kumar Kartikeya Dwivedi. 2) Add bpf_rcu_read_{,un}lock() support for sleepable programs, from Yonghong Song. 3) Add support storing struct task_struct objects as kptrs in maps, from David Vernet. 4) Batch of BPF map documentation improvements, from Maryam Tahhan and Donald Hunter. 5) Improve BPF verifier to propagate nullness information for branches of register to register comparisons, from Eduard Zingerman. 6) Fix cgroup BPF iter infra to hold reference on the start cgroup, from Hou Tao. 7) Fix BPF verifier to not mark fentry/fexit program arguments as trusted given it is not the case for them, from Alexei Starovoitov. 8) Improve BPF verifier's realloc handling to better play along with dynamic runtime analysis tools like KASAN and friends, from Kees Cook. 9) Remove legacy libbpf mode support from bpftool, from Sahid Orentino Ferdjaoui. 10) Rework zero-len skb redirection checks to avoid potentially breaking existing BPF test infra users, from Stanislav Fomichev. 11) Two small refactorings which are independent and have been split out of the XDP queueing RFC series, from Toke Høiland-Jørgensen. 12) Fix a memory leak in LSM cgroup BPF selftest, from Wang Yufen. 13) Documentation on how to run BPF CI without patch submission, from Daniel Müller. Signed-off-by: Jakub Kicinski <kuba@kernel.org> ==================== Link: https://lore.kernel.org/r/20221125012450.441-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-28Merge tag 'trace-v6.1-rc6' of ↵Linus Torvalds8-11/+35
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Fix osnoise duration type to 64bit not 32bit - Have histogram triggers be able to handle an unexpected NULL pointer for the record event, which can happen when the histogram first starts up - Clear out ring buffers when dynamic events are removed, as the type that is saved in the ring buffer is used to read the event, and a stale type that is reused by another event could cause use after free issues - Trivial comment fix - Fix memory leak in user_event_create() * tag 'trace-v6.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Free buffers when a used dynamic event is removed tracing: Add tracing_reset_all_online_cpus_unlocked() function tracing: Fix race where histograms can be called before the event tracing/osnoise: Fix duration type tracing/user_events: Fix memory leak in user_event_create() tracing/hist: add in missing * in comment blocks
2022-11-28tracing/user_events: Fix call print_fmt leakBeau Belgrave1-0/+1
If user_event_trace_register() fails within user_event_parse() the call's print_fmt member is not freed. Add kfree call to fix this. Link: https://lkml.kernel.org/r/20221123183248.554-1-beaub@linux.microsoft.com Fixes: aa3b2b4c6692 ("user_events: Add print_fmt generation support for basic types") Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-11-28kprobes: Fix check for probe enabled in kill_kprobe()Li Huafei1-8/+8
In kill_kprobe(), the check whether disarm_kprobe_ftrace() needs to be called always fails. This is because before that we set the KPROBE_FLAG_GONE flag for kprobe so that "!kprobe_disabled(p)" is always false. The disarm_kprobe_ftrace() call introduced by commit: 0cb2f1372baa ("kprobes: Fix NULL pointer dereference at kprobe_ftrace_handler") to fix the NULL pointer reference problem. When the probe is enabled, if we do not disarm it, this problem still exists. Fix it by putting the probe enabled check before setting the KPROBE_FLAG_GONE flag. Link: https://lore.kernel.org/all/20221126114316.201857-1-lihuafei1@huawei.com/ Fixes: 3031313eb3d54 ("kprobes: Fix to check probe enabled before disarm_kprobe_ftrace()") Signed-off-by: Li Huafei <lihuafei1@huawei.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2022-11-27Merge tag 'perf_urgent_for_v6.1_rc7' of ↵Linus Torvalds1-2/+25
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Borislav Petkov: "Two more fixes to the perf sigtrap handling: - output the address in the sample only when it has been requested - handle the case where user-only events can hit in kernel and thus upset the sigtrap sanity checking" * tag 'perf_urgent_for_v6.1_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf: Consider OS filter fail perf: Fixup SIGTRAP and sample_flags interaction
2022-11-25Merge tag 'pm-6.1-rc7' of ↵Linus Torvalds1-15/+15
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael Wysocki: "These revert a recent change in the schedutil cpufreq governor that had not been expected to make any functional difference, but turned out to introduce a performance regression, fix an initialization issue in the amd-pstate driver and make it actually replace the venerable ACPI cpufreq driver on the supported systems by default. Specifics: - Revert a recent schedutil cpufreq governor change that introduced a performace regression on Pixel 6 (Sam Wu) - Fix amd-pstate driver initialization after running the kernel via kexec (Wyes Karny) - Turn amd-pstate into a built-in driver which allows it to take precedence over acpi-cpufreq by default on supported systems and amend it with a mechanism to disable this behavior (Perry Yuan) - Update amd-pstate documentation in accordance with the other changes made to it (Perry Yuan)" * tag 'pm-6.1-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: Documentation: add amd-pstate kernel command line options Documentation: amd-pstate: add driver working mode introduction cpufreq: amd-pstate: add amd-pstate driver parameter for mode selection cpufreq: amd-pstate: change amd-pstate driver to be built-in type cpufreq: amd-pstate: cpufreq: amd-pstate: reset MSR_AMD_PERF_CTL register at init Revert "cpufreq: schedutil: Move max CPU capacity to sugov_policy"
2022-11-25Merge tag 'mm-hotfixes-stable-2022-11-24' of ↵Linus Torvalds1-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull hotfixes from Andrew Morton: "24 MM and non-MM hotfixes. 8 marked cc:stable and 16 for post-6.0 issues. There have been a lot of hotfixes this cycle, and this is quite a large batch given how far we are into the -rc cycle. Presumably a reflection of the unusually large amount of MM material which went into 6.1-rc1" * tag 'mm-hotfixes-stable-2022-11-24' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (24 commits) test_kprobes: fix implicit declaration error of test_kprobes nilfs2: fix nilfs_sufile_mark_dirty() not set segment usage as dirty mm/cgroup/reclaim: fix dirty pages throttling on cgroup v1 mm: fix unexpected changes to {failslab|fail_page_alloc}.attr swapfile: fix soft lockup in scan_swap_map_slots hugetlb: fix __prep_compound_gigantic_page page flag setting kfence: fix stack trace pruning proc/meminfo: fix spacing in SecPageTables mm: multi-gen LRU: retry folios written back while isolated mailmap: update email address for Satya Priya mm/migrate_device: return number of migrating pages in args->cpages kbuild: fix -Wimplicit-function-declaration in license_is_gpl_compatible MAINTAINERS: update Alex Hung's email address mailmap: update Alex Hung's email address mm: mmap: fix documentation for vma_mas_szero mm/damon/sysfs-schemes: skip stats update if the scheme directory is removed mm/memory: return vm_fault_t result from migrate_to_ram() callback mm: correctly charge compressed memory to its memcg ipc/shm: call underlying open/close vm_ops gcov: clang: fix the buffer overflow issue ...
2022-11-25use less confusing names for iov_iter direction initializersAl Viro1-1/+1
READ/WRITE proved to be actively confusing - the meanings are "data destination, as used with read(2)" and "data source, as used with write(2)", but people keep interpreting those as "we read data from it" and "we write data to it", i.e. exactly the wrong way. Call them ITER_DEST and ITER_SOURCE - at least that is harder to misinterpret... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-11-25padata: Fix list iterator in padata_do_serial()Daniel Jordan1-2/+5
list_for_each_entry_reverse() assumes that the iterated list is nonempty and that every list_head is embedded in the same type, but its use in padata_do_serial() breaks both rules. This doesn't cause any issues now because padata_priv and padata_list happen to have their list fields at the same offset, but we really shouldn't be relying on that. Fixes: bfde23ce200e ("padata: unbind parallel jobs from specific CPUs") Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-11-25padata: Always leave BHs disabled when running ->parallel()Daniel Jordan1-3/+5
A deadlock can happen when an overloaded system runs ->parallel() in the context of the current task: padata_do_parallel ->parallel() pcrypt_aead_enc/dec padata_do_serial spin_lock(&reorder->lock) // BHs still enabled <interrupt> ... __do_softirq ... padata_do_serial spin_lock(&reorder->lock) It's a bug for BHs to be on in _do_serial as Steffen points out, so ensure they're off in the "current task" case like they are in padata_parallel_worker to avoid this situation. Reported-by: syzbot+bc05445bc14148d51915@syzkaller.appspotmail.com Fixes: 4611ce224688 ("padata: allocate work structures for parallel jobs from a pool") Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com> Acked-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-11-24bpf: Don't mark arguments to fentry/fexit programs as trusted.Alexei Starovoitov2-6/+13
The PTR_TRUSTED flag should only be applied to pointers where the verifier can guarantee that such pointers are valid. The fentry/fexit/fmod_ret programs are not in this category. Only arguments of SEC("tp_btf") and SEC("iter") programs are trusted (which have BPF_TRACE_RAW_TP and BPF_TRACE_ITER attach_type correspondingly) This bug was masked because convert_ctx_accesses() was converting trusted loads into BPF_PROBE_MEM loads. Fix it as well. The loads from trusted pointers don't need exception handling. Fixes: 3f00c5239344 ("bpf: Allow trusted pointers to be passed to KF_TRUSTED_ARGS kfuncs") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20221124215314.55890-1-alexei.starovoitov@gmail.com
2022-11-24bpf: Add kfunc bpf_rcu_read_lock/unlock()Yonghong Song3-29/+148
Add two kfunc's bpf_rcu_read_lock() and bpf_rcu_read_unlock(). These two kfunc's can be used for all program types. The following is an example about how rcu pointer are used w.r.t. bpf_rcu_read_lock()/bpf_rcu_read_unlock(). struct task_struct { ... struct task_struct *last_wakee; struct task_struct __rcu *real_parent; ... }; Let us say prog does 'task = bpf_get_current_task_btf()' to get a 'task' pointer. The basic rules are: - 'real_parent = task->real_parent' should be inside bpf_rcu_read_lock region. This is to simulate rcu_dereference() operation. The 'real_parent' is marked as MEM_RCU only if (1). task->real_parent is inside bpf_rcu_read_lock region, and (2). task is a trusted ptr. So MEM_RCU marked ptr can be 'trusted' inside the bpf_rcu_read_lock region. - 'last_wakee = real_parent->last_wakee' should be inside bpf_rcu_read_lock region since it tries to access rcu protected memory. - the ptr 'last_wakee' will be marked as PTR_UNTRUSTED since in general it is not clear whether the object pointed by 'last_wakee' is valid or not even inside bpf_rcu_read_lock region. The verifier will reset all rcu pointer register states to untrusted at bpf_rcu_read_unlock() kfunc call site, so any such rcu pointer won't be trusted any more outside the bpf_rcu_read_lock() region. The current implementation does not support nested rcu read lock region in the prog. Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20221124053217.2373910-1-yhs@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-24bpf: Introduce might_sleep field in bpf_func_protoYonghong Song4-4/+13
Introduce bpf_func_proto->might_sleep to indicate a particular helper might sleep. This will make later check whether a helper might be sleepable or not easier. Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20221124053211.2373553-1-yhs@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-24timers: Provide timer_shutdown[_sync]()Thomas Gleixner1-0/+66
Tearing down timers which have circular dependencies to other functionality, e.g. workqueues, where the timer can schedule work and work can arm timers, is not trivial. In those cases it is desired to shutdown the timer in a way which prevents rearming of the timer. The mechanism to do so is to set timer->function to NULL and use this as an indicator for the timer arming functions to ignore the (re)arm request. Expose new interfaces for this: timer_shutdown_sync() and timer_shutdown(). timer_shutdown_sync() has the same functionality as timer_delete_sync() plus the NULL-ification of the timer function. timer_shutdown() has the same functionality as timer_delete() plus the NULL-ification of the timer function. In both cases the rearming of the timer is prevented by silently discarding rearm attempts due to timer->function being NULL. Co-developed-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Guenter Roeck <linux@roeck-us.net> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Link: https://lore.kernel.org/all/20220407161745.7d6754b3@gandalf.local.home Link: https://lore.kernel.org/all/20221110064101.429013735@goodmis.org Link: https://lore.kernel.org/r/20221123201625.314230270@linutronix.de
2022-11-24timers: Add shutdown mechanism to the internal functionsThomas Gleixner1-8/+54
Tearing down timers which have circular dependencies to other functionality, e.g. workqueues, where the timer can schedule work and work can arm timers, is not trivial. In those cases it is desired to shutdown the timer in a way which prevents rearming of the timer. The mechanism to do so is to set timer->function to NULL and use this as an indicator for the timer arming functions to ignore the (re)arm request. Add a shutdown argument to the relevant internal functions which makes the actual deactivation code set timer->function to NULL which in turn prevents rearming of the timer. Co-developed-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Guenter Roeck <linux@roeck-us.net> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Link: https://lore.kernel.org/all/20220407161745.7d6754b3@gandalf.local.home Link: https://lore.kernel.org/all/20221110064101.429013735@goodmis.org Link: https://lore.kernel.org/r/20221123201625.253883224@linutronix.de
2022-11-24timers: Split [try_to_]del_timer[_sync]() to prepare for shutdown modeThomas Gleixner1-51/+92
Tearing down timers which have circular dependencies to other functionality, e.g. workqueues, where the timer can schedule work and work can arm timers, is not trivial. In those cases it is desired to shutdown the timer in a way which prevents rearming of the timer. The mechanism to do so is to set timer->function to NULL and use this as an indicator for the timer arming functions to ignore the (re)arm request. Split the inner workings of try_do_del_timer_sync(), del_timer_sync() and del_timer() into helper functions to prepare for implementing the shutdown functionality. No functional change. Co-developed-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Guenter Roeck <linux@roeck-us.net> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Link: https://lore.kernel.org/all/20220407161745.7d6754b3@gandalf.local.home Link: https://lore.kernel.org/all/20221110064101.429013735@goodmis.org Link: https://lore.kernel.org/r/20221123201625.195147423@linutronix.de
2022-11-24timers: Silently ignore timers with a NULL functionThomas Gleixner1-5/+52
Tearing down timers which have circular dependencies to other functionality, e.g. workqueues, where the timer can schedule work and work can arm timers, is not trivial. In those cases it is desired to shutdown the timer in a way which prevents rearming of the timer. The mechanism to do so is to set timer->function to NULL and use this as an indicator for the timer arming functions to ignore the (re)arm request. In preparation for that replace the warnings in the relevant code paths with checks for timer->function == NULL. If the pointer is NULL, then discard the rearm request silently. Add debug_assert_init() instead of the WARN_ON_ONCE(!timer->function) checks so that debug objects can warn about non-initialized timers. The warning of debug objects does not warn if timer->function == NULL. It warns when timer was not initialized using timer_setup[_on_stack]() or via DEFINE_TIMER(). If developers fail to enable debug objects and then waste lots of time to figure out why their non-initialized timer is not firing, they deserve it. Same for initializing a timer with a NULL function. Co-developed-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Guenter Roeck <linux@roeck-us.net> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Link: https://lore.kernel.org/all/20220407161745.7d6754b3@gandalf.local.home Link: https://lore.kernel.org/all/20221110064101.429013735@goodmis.org Link: https://lore.kernel.org/r/87wn7kdann.ffs@tglx
2022-11-24timers: Rename del_timer() to timer_delete()Thomas Gleixner1-3/+3
The timer related functions do not have a strict timer_ prefixed namespace which is really annoying. Rename del_timer() to timer_delete() and provide del_timer() as a wrapper. Document that del_timer() is not for new code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Guenter Roeck <linux@roeck-us.net> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Link: https://lore.kernel.org/r/20221123201625.015535022@linutronix.de