summaryrefslogtreecommitdiffstats
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2022-12-13Merge tag 'audit-pr-20221212' of ↵Linus Torvalds1-36/+39
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit Pull audit updates from Paul Moore: "Two performance oriented patches for the audit subsystem: one consolidates similar code to gain some caching advantages, while the other stores a value in a stack variable to avoid repeated lookups in a loop. The commit descriptions have more information, including some before/after performance measurements" * tag 'audit-pr-20221212' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit: audit: unify audit_filter_{uring(), inode_name(), syscall()} audit: cache ctx->major in audit_filter_syscall()
2022-12-13Merge tag 'dma-mapping-6.2-2022-12-13' of ↵Linus Torvalds2-24/+47
git://git.infradead.org/users/hch/dma-mapping Pull dma-mapping updates from Christoph Hellwig: - reduce the swiotlb buffer size on allocation failure (Alexey Kardashevskiy) - clean up passing of bogus GFP flags to the dma-coherent allocator (Christoph Hellwig) * tag 'dma-mapping-6.2-2022-12-13' of git://git.infradead.org/users/hch/dma-mapping: dma-mapping: reject __GFP_COMP in dma_alloc_attrs ALSA: memalloc: don't pass bogus GFP_ flags to dma_alloc_* s390/ism: don't pass bogus GFP_ flags to dma_alloc_coherent cnic: don't pass bogus GFP_ flags to dma_alloc_coherent RDMA/qib: don't pass bogus GFP_ flags to dma_alloc_coherent RDMA/hfi1: don't pass bogus GFP_ flags to dma_alloc_coherent media: videobuf-dma-contig: use dma_mmap_coherent swiotlb: reduce the swiotlb buffer size on allocation failure
2022-12-12Merge tag 'fs.vfsuid.conversion.v6.2' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping Pull vfsuid updates from Christian Brauner: "Last cycle we introduced the vfs{g,u}id_t types and associated helpers to gain type safety when dealing with idmapped mounts. That initial work already converted a lot of places over but there were still some left, This converts all remaining places that still make use of non-type safe idmapping helpers to rely on the new type safe vfs{g,u}id based helpers. Afterwards it removes all the old non-type safe helpers" * tag 'fs.vfsuid.conversion.v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping: fs: remove unused idmapping helpers ovl: port to vfs{g,u}id_t and associated helpers fuse: port to vfs{g,u}id_t and associated helpers ima: use type safe idmapping helpers apparmor: use type safe idmapping helpers caps: use type safe idmapping helpers fs: use type safe idmapping helpers mnt_idmapping: add missing helpers
2022-12-12livepatch: Call klp_match_callback() in klp_find_callback() to avoid code ↵Zhen Lei1-22/+11
duplication The implementation of function klp_match_callback() is identical to the partial implementation of function klp_find_callback(). So call function klp_match_callback() in function klp_find_callback() instead of the duplicated code. Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Acked-by: Song Liu <song@kernel.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Suggested-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2022-12-12Merge tag 'pull-iov_iter' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull iov_iter updates from Al Viro: "iov_iter work; most of that is about getting rid of direction misannotations and (hopefully) preventing more of the same for the future" * tag 'pull-iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: use less confusing names for iov_iter direction initializers iov_iter: saner checks for attempt to copy to/from iterator [xen] fix "direction" argument of iov_iter_kvec() [vhost] fix 'direction' argument of iov_iter_{init,bvec}() [target] fix iov_iter_bvec() "direction" argument [s390] memcpy_real(): WRITE is "data source", not destination... [s390] zcore: WRITE is "data source", not destination... [infiniband] READ is "data destination", not source... [fsi] WRITE is "data source", not destination... [s390] copy_oldmem_kernel() - WRITE is "data source", not destination csum_and_copy_to_iter(): handle ITER_DISCARD get rid of unlikely() on page_copy_sane() calls
2022-12-12Merge tag 'pull-elfcore' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull elf coredumping updates from Al Viro: "Unification of regset and non-regset sides of ELF coredump handling. Collecting per-thread register values is the only thing that needs to be ifdefed there..." * tag 'pull-elfcore' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: [elf] get rid of get_note_info_size() [elf] unify regset and non-regset cases [elf][non-regset] use elf_core_copy_task_regs() for dumper as well [elf][non-regset] uninline elf_core_copy_task_fpregs() (and lose pt_regs argument) elf_core_copy_task_regs(): task_pt_regs is defined everywhere [elf][regset] simplify thread list handling in fill_note_info() [elf][regset] clean fill_note_info() a bit kill extern of vsyscall32_sysctl kill coredump_params->regs kill signal_pt_regs()
2022-12-12Merge tag 'mm-nonmm-stable-2022-12-12' of ↵Linus Torvalds8-18/+19
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - A ptrace API cleanup series from Sergey Shtylyov - Fixes and cleanups for kexec from ye xingchen - nilfs2 updates from Ryusuke Konishi - squashfs feature work from Xiaoming Ni: permit configuration of the filesystem's compression concurrency from the mount command line - A series from Akinobu Mita which addresses bound checking errors when writing to debugfs files - A series from Yang Yingliang to address rapidio memory leaks - A series from Zheng Yejian to address possible overflow errors in encode_comp_t() - And a whole shower of singleton patches all over the place * tag 'mm-nonmm-stable-2022-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (79 commits) ipc: fix memory leak in init_mqueue_fs() hfsplus: fix bug causing custom uid and gid being unable to be assigned with mount rapidio: devices: fix missing put_device in mport_cdev_open kcov: fix spelling typos in comments hfs: Fix OOB Write in hfs_asc2mac hfs: fix OOB Read in __hfs_brec_find relay: fix type mismatch when allocating memory in relay_create_buf() ocfs2: always read both high and low parts of dinode link count io-mapping: move some code within the include guarded section kernel: kcsan: kcsan_test: build without structleak plugin mailmap: update email for Iskren Chernev eventfd: change int to __u64 in eventfd_signal() ifndef CONFIG_EVENTFD rapidio: fix possible UAF when kfifo_alloc() fails relay: use strscpy() is more robust and safer cpumask: limit visibility of FORCE_NR_CPUS acct: fix potential integer overflow in encode_comp_t() acct: fix accuracy loss for input value of encode_comp_t() linux/init.h: include <linux/build_bug.h> and <linux/stringify.h> rapidio: rio: fix possible name leak in rio_register_mport() rapidio: fix possible name leaks when rio_add_device() fails ...
2022-12-12Merge tag 'random-6.2-rc1-for-linus' of ↵Linus Torvalds5-16/+8
git://git.kernel.org/pub/scm/linux/kernel/git/crng/random Pull random number generator updates from Jason Donenfeld: - Replace prandom_u32_max() and various open-coded variants of it, there is now a new family of functions that uses fast rejection sampling to choose properly uniformly random numbers within an interval: get_random_u32_below(ceil) - [0, ceil) get_random_u32_above(floor) - (floor, U32_MAX] get_random_u32_inclusive(floor, ceil) - [floor, ceil] Coccinelle was used to convert all current users of prandom_u32_max(), as well as many open-coded patterns, resulting in improvements throughout the tree. I'll have a "late" 6.1-rc1 pull for you that removes the now unused prandom_u32_max() function, just in case any other trees add a new use case of it that needs to converted. According to linux-next, there may be two trivial cases of prandom_u32_max() reintroductions that are fixable with a 's/.../.../'. So I'll have for you a final conversion patch doing that alongside the removal patch during the second week. This is a treewide change that touches many files throughout. - More consistent use of get_random_canary(). - Updates to comments, documentation, tests, headers, and simplification in configuration. - The arch_get_random*_early() abstraction was only used by arm64 and wasn't entirely useful, so this has been replaced by code that works in all relevant contexts. - The kernel will use and manage random seeds in non-volatile EFI variables, refreshing a variable with a fresh seed when the RNG is initialized. The RNG GUID namespace is then hidden from efivarfs to prevent accidental leakage. These changes are split into random.c infrastructure code used in the EFI subsystem, in this pull request, and related support inside of EFISTUB, in Ard's EFI tree. These are co-dependent for full functionality, but the order of merging doesn't matter. - Part of the infrastructure added for the EFI support is also used for an improvement to the way vsprintf initializes its siphash key, replacing an sleep loop wart. - The hardware RNG framework now always calls its correct random.c input function, add_hwgenerator_randomness(), rather than sometimes going through helpers better suited for other cases. - The add_latent_entropy() function has long been called from the fork handler, but is a no-op when the latent entropy gcc plugin isn't used, which is fine for the purposes of latent entropy. But it was missing out on the cycle counter that was also being mixed in beside the latent entropy variable. So now, if the latent entropy gcc plugin isn't enabled, add_latent_entropy() will expand to a call to add_device_randomness(NULL, 0), which adds a cycle counter, without the absent latent entropy variable. - The RNG is now reseeded from a delayed worker, rather than on demand when used. Always running from a worker allows it to make use of the CPU RNG on platforms like S390x, whose instructions are too slow to do so from interrupts. It also has the effect of adding in new inputs more frequently with more regularity, amounting to a long term transcript of random values. Plus, it helps a bit with the upcoming vDSO implementation (which isn't yet ready for 6.2). - The jitter entropy algorithm now tries to execute on many different CPUs, round-robining, in hopes of hitting even more memory latencies and other unpredictable effects. It also will mix in a cycle counter when the entropy timer fires, in addition to being mixed in from the main loop, to account more explicitly for fluctuations in that timer firing. And the state it touches is now kept within the same cache line, so that it's assured that the different execution contexts will cause latencies. * tag 'random-6.2-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random: (23 commits) random: include <linux/once.h> in the right header random: align entropy_timer_state to cache line random: mix in cycle counter when jitter timer fires random: spread out jitter callback to different CPUs random: remove extraneous period and add a missing one in comments efi: random: refresh non-volatile random seed when RNG is initialized vsprintf: initialize siphash key using notifier random: add back async readiness notifier random: reseed in delayed work rather than on-demand random: always mix cycle counter in add_latent_entropy() hw_random: use add_hwgenerator_randomness() for early entropy random: modernize documentation comment on get_random_bytes() random: adjust comment to account for removed function random: remove early archrandom abstraction random: use random.trust_{bootloader,cpu} command line option only stackprotector: actually use get_random_canary() stackprotector: move get_random_canary() into stackprotector.h treewide: use get_random_u32_inclusive() when possible treewide: use get_random_u32_{above,below}() instead of manual loop treewide: use get_random_u32_below() instead of deprecated function ...
2022-12-12Merge tag 'livepatching-for-6.2' of ↵Linus Torvalds1-27/+27
git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching Pull livepatching update from Petr Mladek: - code cleanup * tag 'livepatching-for-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching: livepatch: Move the result-invariant calculation out of the loop
2022-12-12Merge tag 'cgroup-for-6.2' of ↵Linus Torvalds2-11/+45
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "Nothing too interesting: - Add CONFIG_DEBUG_GROUP_REF which makes cgroup refcnt operations kprobable - A couple cpuset optimizations - Other misc changes including doc and test updates" * tag 'cgroup-for-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: remove rcu_read_lock()/rcu_read_unlock() in critical section of spin_lock_irq() cgroup/cpuset: Improve cpuset_css_alloc() description kselftest/cgroup: Add cleanup() to test_cpuset_prs.sh cgroup/cpuset: Optimize cpuset_attach() on v2 cgroup/cpuset: Skip spread flags update on v2 kselftest/cgroup: Fix gathering number of CPUs cgroup: cgroup refcnt functions should be exported when CONFIG_DEBUG_CGROUP_REF cgroup: Implement DEBUG_CGROUP_REF
2022-12-12Merge tag 'sched-core-2022-12-12' of ↵Linus Torvalds6-180/+603
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: - Implement persistent user-requested affinity: introduce affinity_context::user_mask and unconditionally preserve the user-requested CPU affinity masks, for long-lived tasks to better interact with cpusets & CPU hotplug events over longer timespans, without destroying the original affinity intent if the underlying topology changes. - Uclamp updates: fix relationship between uclamp and fits_capacity() - PSI fixes - Misc fixes & updates * tag 'sched-core-2022-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched: Clear ttwu_pending after enqueue_task() sched/psi: Use task->psi_flags to clear in CPU migration sched/psi: Stop relying on timer_pending() for poll_work rescheduling sched/psi: Fix avgs_work re-arm in psi_avgs_work() sched/psi: Fix possible missing or delayed pending event sched: Always clear user_cpus_ptr in do_set_cpus_allowed() sched: Enforce user requested affinity sched: Always preserve the user requested cpumask sched: Introduce affinity_context sched: Add __releases annotations to affine_move_task() sched/fair: Check if prev_cpu has highest spare cap in feec() sched/fair: Consider capacity inversion in util_fits_cpu() sched/fair: Detect capacity inversion sched/uclamp: Cater for uclamp in find_energy_efficient_cpu()'s early exit condition sched/uclamp: Make cpu_overutilized() use util_fits_cpu() sched/uclamp: Make asym_fits_capacity() use util_fits_cpu() sched/uclamp: Make select_idle_capacity() use util_fits_cpu() sched/uclamp: Fix fits_capacity() check in feec() sched/uclamp: Make task_fits_capacity() use util_fits_cpu() sched/uclamp: Fix relationship between uclamp and migration margin
2022-12-12Merge tag 'perf-core-2022-12-12' of ↵Linus Torvalds1-1013/+1087
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf events updates from Ingo Molnar: - Thoroughly rewrite the data structures that implement perf task context handling, with the goal of fixing various quirks and unfeatures both in already merged, and in upcoming proposed code. The old data structure is the per task and per cpu perf_event_contexts: task_struct::perf_events_ctxp[] <-> perf_event_context <-> perf_cpu_context ^ | ^ | ^ `---------------------------------' | `--> pmu ---' v ^ perf_event ------' In this new design this is replaced with a single task context and a single CPU context, plus intermediate data-structures: task_struct::perf_event_ctxp -> perf_event_context <- perf_cpu_context ^ | ^ ^ `---------------------------' | | | | perf_cpu_pmu_context <--. | `----. ^ | | | | | | v v | | ,--> perf_event_pmu_context | | | | | | | v v | perf_event ---> pmu ----------------' [ See commit bd2756811766 for more details. ] This rewrite was developed by Peter Zijlstra and Ravi Bangoria. - Optimize perf_tp_event() - Update the Intel uncore PMU driver, extending it with UPI topology discovery on various hardware models. - Misc fixes & cleanups * tag 'perf-core-2022-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits) perf/x86/intel/uncore: Fix reference count leak in __uncore_imc_init_box() perf/x86/intel/uncore: Fix reference count leak in snr_uncore_mmio_map() perf/x86/intel/uncore: Fix reference count leak in hswep_has_limit_sbox() perf/x86/intel/uncore: Fix reference count leak in sad_cfg_iio_topology() perf/x86/intel/uncore: Make set_mapping() procedure void perf/x86/intel/uncore: Update sysfs-devices-mapping file perf/x86/intel/uncore: Enable UPI topology discovery for Sapphire Rapids perf/x86/intel/uncore: Enable UPI topology discovery for Icelake Server perf/x86/intel/uncore: Get UPI NodeID and GroupID perf/x86/intel/uncore: Enable UPI topology discovery for Skylake Server perf/x86/intel/uncore: Generalize get_topology() for SKX PMUs perf/x86/intel/uncore: Disable I/O stacks to PMU mapping on ICX-D perf/x86/intel/uncore: Clear attr_update properly perf/x86/intel/uncore: Introduce UPI topology type perf/x86/intel/uncore: Generalize IIO topology support perf/core: Don't allow grouping events from different hw pmus perf/amd/ibs: Make IBS a core pmu perf: Fix function pointer case perf/x86/amd: Remove the repeated declaration perf: Fix possible memleak in pmu_dev_alloc() ...
2022-12-12Merge tag 'locking-core-2022-12-12' of ↵Linus Torvalds2-15/+19
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking updates from Ingo Molnar: "Two changes in this cycle: - a micro-optimization in static_key_slow_inc_cpuslocked() - fix futex death-notification wakeup bug" * tag 'locking-core-2022-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: futex: Resend potentially swallowed owner death notification jump_label: Use atomic_try_cmpxchg() in static_key_slow_inc_cpuslocked()
2022-12-12Merge tag 'cxl-for-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxlLinus Torvalds1-5/+8
Pull cxl updates from Dan Williams: "Compute Express Link (CXL) updates for 6.2. While it may seem backwards, the CXL update this time around includes some focus on CXL 1.x enabling where the work to date had been with CXL 2.0 (VH topologies) in mind. First generation CXL can mostly be supported via BIOS, similar to DDR, however it became clear there are use cases for OS native CXL error handling and some CXL 3.0 endpoint features can be deployed on CXL 1.x hosts (Restricted CXL Host (RCH) topologies). So, this update brings RCH topologies into the Linux CXL device model. In support of the ongoing CXL 2.0+ enabling two new core kernel facilities are added. One is the ability for the kernel to flag collisions between userspace access to PCI configuration registers and kernel accesses. This is brought on by the PCIe Data-Object-Exchange (DOE) facility, a hardware mailbox over config-cycles. The other is a cpu_cache_invalidate_memregion() API that maps to wbinvd_on_all_cpus() on x86. To prevent abuse it is disabled in guest VMs and architectures that do not support it yet. The CXL paths that need it, dynamic memory region creation and security commands (erase / unlock), are disabled when it is not present. As for the CXL 2.0+ this cycle the subsystem gains support Persistent Memory Security commands, error handling in response to PCIe AER notifications, and support for the "XOR" host bridge interleave algorithm. Summary: - Add the cpu_cache_invalidate_memregion() API for cache flushing in response to physical memory reconfiguration, or memory-side data invalidation from operations like secure erase or memory-device unlock. - Add a facility for the kernel to warn about collisions between kernel and userspace access to PCI configuration registers - Add support for Restricted CXL Host (RCH) topologies (formerly CXL 1.1) - Add handling and reporting of CXL errors reported via the PCIe AER mechanism - Add support for CXL Persistent Memory Security commands - Add support for the "XOR" algorithm for CXL host bridge interleave - Rework / simplify CXL to NVDIMM interactions - Miscellaneous cleanups and fixes" * tag 'cxl-for-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: (71 commits) cxl/region: Fix memdev reuse check cxl/pci: Remove endian confusion cxl/pci: Add some type-safety to the AER trace points cxl/security: Drop security command ioctl uapi cxl/mbox: Add variable output size validation for internal commands cxl/mbox: Enable cxl_mbox_send_cmd() users to validate output size cxl/security: Fix Get Security State output payload endian handling cxl: update names for interleave ways conversion macros cxl: update names for interleave granularity conversion macros cxl/acpi: Warn about an invalid CHBCR in an existing CHBS entry tools/testing/cxl: Require cache invalidation bypass cxl/acpi: Fail decoder add if CXIMS for HBIG is missing cxl/region: Fix spelling mistake "memergion" -> "memregion" cxl/regs: Fix sparse warning cxl/acpi: Set ACPI's CXL _OSC to indicate RCD mode support tools/testing/cxl: Add an RCH topology cxl/port: Add RCD endpoint port enumeration cxl/mem: Move devm_cxl_add_endpoint() from cxl_core to cxl_mem tools/testing/cxl: Add XOR Math support to cxl_test cxl/acpi: Support CXL XOR Interleave Math (CXIMS) ...
2022-12-12Merge tag 'pm-6.2-rc1' of ↵Linus Torvalds3-22/+21
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management updates from Rafael Wysocki: "These include two new drivers (cpufreq driver for Apple SoC CPU P-states and the SCMI Powercap based power capping driver), other new hardware support and driver extensions (Qualcomm cpufreq driver and its DT bindings, TI cpufreq driver, intel_pstate, intel-uncore-freq), a bunch of fixes and cleanups all over and a cpupower utility update including new features related to RAPL support. Specifics: - Fix nasty and hard to debug race condition introduced by mistake in the runtime PM core code and clean up that code somewhat on top of the fix (Rafael Wysocki) - Generalize of_perf_domain_get_sharing_cpumask phandle format (Hector Martin) - Add new cpufreq driver for Apple SoC CPU P-states (Hector Martin) - Update Qualcomm cpufreq driver (Manivannan Sadhasivam, Chen Hui): - CPU clock provider support - Generic cleanups or reorganization - Potential memleak fix - Fix of the return value of cpufreq_driver->get() - Update Qualcomm cpufreq driver's DT bindings (Manivannan Sadhasivam, Rob Herring, Melody Olvera): - Support for CPU clock provider - Missing cache-related properties fixes - Support for QDU1000/QRU1000 - Add support for ti,am625 SoC and enable build of ti-cpufreq for ARCH_K3 (Dave Gerlach, and Vibhore Vardhan) - Use flexible array to simplify memory allocation in the tegra186 cpufreq driver (Christophe JAILLET) - Convert cpufreq statistics code to use sysfs_emit_at() (ye xingchen) - Allow intel_pstate to use no-HWP mode on Sapphire Rapids (Giovanni Gherdovich) - Add missing pci_dev_put() to the amd_freq_sensitivity cpufreq driver (Xiongfeng Wang) - Initialize the kobj_unregister completion before calling kobject_init_and_add() in the cpufreq core code (Yongqiang Liu) - Defer setting boost MSRs in the ACPI cpufreq driver (Stuart Hayes, Nathan Chancellor) - Make intel_pstate accept initial EPP value of 0x80 (Srinivas Pandruvada) - Make read-only array sys_clk_src in the SPEAr cpufreq driver static (Colin Ian King) - Make array speeds in the longhaul cpufreq driver static (Colin Ian King) - Use str_enabled_disabled() helper in the ACPI cpufreq driver (Andy Shevchenko) - Drop a reference to CVS from cpufreq documentation (Conghui Wang) - Improve kernel messages printed by the PSCI cpuidle driver (Ulf Hansson) - Make the DT cpuidle driver return the correct number of parsed idle states, clean it up and clarify a comment in it (Ulf Hansson) - Modify the tasks freezing code to avoid using pr_cont() and refine an error message printed by it (Rafael Wysocki) - Make the hibernation core code complain about memory map mismatches during resume to help diagnostics (Xueqin Luo) - Fix mistake in a kerneldoc comment in the hibernation code (xiongxin) - Reverse the order of performance and enabling operations in the generic power domains code (Abel Vesa) - Power off[on] domains in hibernate .freeze[thaw]_noirq hook of in the generic power domains code (Abel Vesa) - Consolidate genpd_restore_noirq() and genpd_resume_noirq() (Shawn Guo) - Pass generic PM noirq hooks to genpd_finish_suspend() (Shawn Guo) - Drop generic power domain status manipulation during hibernate restore (Shawn Guo) - Fix compiler warnings with make W=1 in the idle_inject power capping driver (Srinivas Pandruvada) - Use kstrtobool() instead of strtobool() in the power capping sysfs interface (Christophe JAILLET) - Add SCMI Powercap based power capping driver (Cristian Marussi) - Add Emerald Rapids support to the intel-uncore-freq driver (Artem Bityutskiy) - Repair slips in kernel-doc comments in the generic notifier code (Lukas Bulwahn) - Fix several DT issues in the OPP library reorganize code around opp-microvolt-<named> DT property (Viresh Kumar) - Allow any of opp-microvolt, opp-microamp, or opp-microwatt properties to be present without the others present (James Calligeros) - Fix clock-latency-ns property in DT example (Serge Semin) - Add a private governor_data for devfreq governors (Kant Fan) - Reorganize devfreq code to use device_match_of_node() and devm_platform_get_and_ioremap_resource() instead of open coding them (ye xingchen, Minghao Chi) - Make cpupower choose base_cpu to display default cpupower details instead of picking CPU 0 (Saket Kumar Bhaskar) - Add Georgian translation to cpupower documentation (Zurab Kargareteli) - Introduce powercap intel-rapl library, powercap-info command, and RAPL monitor into cpupower (Thomas Renninger)" * tag 'pm-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (64 commits) PM: runtime: Adjust white space in the core code cpufreq: Remove CVS version control contents from documentation cpufreq: stats: Convert to use sysfs_emit_at() API cpufreq: ACPI: Only set boost MSRs on supported CPUs PM: sleep: Refine error message in try_to_freeze_tasks() PM: sleep: Avoid using pr_cont() in the tasks freezing code PM: runtime: Relocate rpm_callback() right after __rpm_callback() PM: runtime: Do not call __rpm_callback() from rpm_idle() PM / devfreq: event: use devm_platform_get_and_ioremap_resource() PM / devfreq: event: Use device_match_of_node() PM / devfreq: Use device_match_of_node() powercap: idle_inject: Fix warnings with make W=1 PM: hibernate: Complain about memory map mismatches during resume dt-bindings: cpufreq: cpufreq-qcom-hw: Add QDU1000/QRU1000 cpufreq cpufreq: tegra186: Use flexible array to simplify memory allocation cpupower: rapl monitor - shows the used power consumption in uj for each rapl domain cpupower: Introduce powercap intel-rapl library and powercap-info command cpupower: Add Georgian translation cpufreq: intel_pstate: Add Sapphire Rapids support in no-HWP mode cpufreq: amd_freq_sensitivity: Add missing pci_dev_put() ...
2022-12-12Merge tag 'timers-core-2022-12-10' of ↵Linus Torvalds3-99/+348
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer updates from Thomas Gleixner: "Updates for timers, timekeeping and drivers: Core: - The timer_shutdown[_sync]() infrastructure: Tearing down timers can be tedious when there are circular dependencies to other things which need to be torn down. A prime example is timer and workqueue where the timer schedules work and the work arms the timer. What needs to prevented is that pending work which is drained via destroy_workqueue() does not rearm the previously shutdown timer. Nothing in that shutdown sequence relies on the timer being functional. The conclusion was that the semantics of timer_shutdown_sync() should be: - timer is not enqueued - timer callback is not running - timer cannot be rearmed Preventing the rearming of shutdown timers is done by discarding rearm attempts silently. A warning for the case that a rearm attempt of a shutdown timer is detected would not be really helpful because it's entirely unclear how it should be acted upon. The only way to address such a case is to add 'if (in_shutdown)' conditionals all over the place. This is error prone and in most cases of teardown not required all. - The real fix for the bluetooth HCI teardown based on timer_shutdown_sync(). A larger scale conversion to timer_shutdown_sync() is work in progress. - Consolidation of VDSO time namespace helper functions - Small fixes for timer and timerqueue Drivers: - Prevent integer overflow on the XGene-1 TVAL register which causes an never ending interrupt storm. - The usual set of new device tree bindings - Small fixes and improvements all over the place" * tag 'timers-core-2022-12-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits) dt-bindings: timer: renesas,cmt: Add r8a779g0 CMT support dt-bindings: timer: renesas,tmu: Add r8a779g0 support clocksource/drivers/arm_arch_timer: Use kstrtobool() instead of strtobool() clocksource/drivers/timer-ti-dm: Fix missing clk_disable_unprepare in dmtimer_systimer_init_clock() clocksource/drivers/timer-ti-dm: Clear settings on probe and free clocksource/drivers/timer-ti-dm: Make timer_get_irq static clocksource/drivers/timer-ti-dm: Fix warning for omap_timer_match clocksource/drivers/arm_arch_timer: Fix XGene-1 TVAL register math error clocksource/drivers/timer-npcm7xx: Enable timer 1 clock before use dt-bindings: timer: nuvoton,npcm7xx-timer: Allow specifying all clocks dt-bindings: timer: rockchip: Add rockchip,rk3128-timer clockevents: Repair kernel-doc for clockevent_delta2ns() clocksource/drivers/ingenic-ost: Define pm functions properly in platform_driver struct clocksource/drivers/sh_cmt: Access registers according to spec vdso/timens: Refactor copy-pasted find_timens_vvar_page() helper into one copy Bluetooth: hci_qca: Fix the teardown problem for real timers: Update the documentation to reflect on the new timer_shutdown() API timers: Provide timer_shutdown[_sync]() timers: Add shutdown mechanism to the internal functions timers: Split [try_to_]del_timer[_sync]() to prepare for shutdown mode ...
2022-12-12Merge tag 'smp-core-2022-12-10' of ↵Linus Torvalds1-17/+44
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull CPU hotplug updates from Thomas Gleixner: "A small set of updates for CPU hotplug: - Prevent stale CPU hotplug state in the cpu_down() path which was detected by stress testing the sysfs interface - Ensure that the target CPU hotplug state for the boot CPU is CPUHP_ONLINE instead of the compile time init value CPUHP_OFFLINE. - Switch back to the original behaviour of warning when a CPU hotplug callback in the DYING/STARTING section returns an error code. Otherwise a buggy callback can leave the CPUs in an non recoverable state" * tag 'smp-core-2022-12-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: cpu/hotplug: Do not bail-out in DYING/STARTING sections cpu/hotplug: Set cpuhp target for boot cpu cpu/hotplug: Make target_store() a nop when target == state
2022-12-12Merge tag 'for-netdev' of ↵Jakub Kicinski8-299/+605
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Alexei Starovoitov says: ==================== pull-request: bpf-next 2022-12-11 We've added 74 non-merge commits during the last 11 day(s) which contain a total of 88 files changed, 3362 insertions(+), 789 deletions(-). The main changes are: 1) Decouple prune and jump points handling in the verifier, from Andrii. 2) Do not rely on ALLOW_ERROR_INJECTION for fmod_ret, from Benjamin. Merged from hid tree. 3) Do not zero-extend kfunc return values. Necessary fix for 32-bit archs, from Björn. 4) Don't use rcu_users to refcount in task kfuncs, from David. 5) Three reg_state->id fixes in the verifier, from Eduard. 6) Optimize bpf_mem_alloc by reusing elements from free_by_rcu, from Hou. 7) Refactor dynptr handling in the verifier, from Kumar. 8) Remove the "/sys" mount and umount dance in {open,close}_netns in bpf selftests, from Martin. 9) Enable sleepable support for cgrp local storage, from Yonghong. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (74 commits) selftests/bpf: test case for relaxed prunning of active_lock.id selftests/bpf: Add pruning test case for bpf_spin_lock bpf: use check_ids() for active_lock comparison selftests/bpf: verify states_equal() maintains idmap across all frames bpf: states_equal() must build idmap for all function frames selftests/bpf: test cases for regsafe() bug skipping check_id() bpf: regsafe() must not skip check_ids() docs/bpf: Add documentation for BPF_MAP_TYPE_SK_STORAGE selftests/bpf: Add test for dynptr reinit in user_ringbuf callback bpf: Use memmove for bpf_dynptr_{read,write} bpf: Move PTR_TO_STACK alignment check to process_dynptr_func bpf: Rework check_func_arg_reg_off bpf: Rework process_dynptr_func bpf: Propagate errors from process_* checks in check_func_arg bpf: Refactor ARG_PTR_TO_DYNPTR checks into process_dynptr_func bpf: Skip rcu_barrier() if rcu_trace_implies_rcu_gp() is true bpf: Reuse freed element in free_by_rcu during allocation selftests/bpf: Bring test_offload.py back to life bpf: Fix comment error in fixup_kfunc_call function bpf: Do not zero-extend kfunc return values ... ==================== Link: https://lore.kernel.org/r/20221212024701.73809-1-alexei.starovoitov@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-12Merge tag 'irq-core-2022-12-10' of ↵Linus Torvalds6-189/+761
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq updates from Thomas Gleixner: "Updates for the interrupt core and driver subsystem: The bulk is the rework of the MSI subsystem to support per device MSI interrupt domains. This solves conceptual problems of the current PCI/MSI design which are in the way of providing support for PCI/MSI[-X] and the upcoming PCI/IMS mechanism on the same device. IMS (Interrupt Message Store] is a new specification which allows device manufactures to provide implementation defined storage for MSI messages (as opposed to PCI/MSI and PCI/MSI-X that has a specified message store which is uniform accross all devices). The PCI/MSI[-X] uniformity allowed us to get away with "global" PCI/MSI domains. IMS not only allows to overcome the size limitations of the MSI-X table, but also gives the device manufacturer the freedom to store the message in arbitrary places, even in host memory which is shared with the device. There have been several attempts to glue this into the current MSI code, but after lengthy discussions it turned out that there is a fundamental design problem in the current PCI/MSI-X implementation. This needs some historical background. When PCI/MSI[-X] support was added around 2003, interrupt management was completely different from what we have today in the actively developed architectures. Interrupt management was completely architecture specific and while there were attempts to create common infrastructure the commonalities were rudimentary and just providing shared data structures and interfaces so that drivers could be written in an architecture agnostic way. The initial PCI/MSI[-X] support obviously plugged into this model which resulted in some basic shared infrastructure in the PCI core code for setting up MSI descriptors, which are a pure software construct for holding data relevant for a particular MSI interrupt, but the actual association to Linux interrupts was completely architecture specific. This model is still supported today to keep museum architectures and notorious stragglers alive. In 2013 Intel tried to add support for hot-pluggable IO/APICs to the kernel, which was creating yet another architecture specific mechanism and resulted in an unholy mess on top of the existing horrors of x86 interrupt handling. The x86 interrupt management code was already an incomprehensible maze of indirections between the CPU vector management, interrupt remapping and the actual IO/APIC and PCI/MSI[-X] implementation. At roughly the same time ARM struggled with the ever growing SoC specific extensions which were glued on top of the architected GIC interrupt controller. This resulted in a fundamental redesign of interrupt management and provided the today prevailing concept of hierarchical interrupt domains. This allowed to disentangle the interactions between x86 vector domain and interrupt remapping and also allowed ARM to handle the zoo of SoC specific interrupt components in a sane way. The concept of hierarchical interrupt domains aims to encapsulate the functionality of particular IP blocks which are involved in interrupt delivery so that they become extensible and pluggable. The X86 encapsulation looks like this: |--- device 1 [Vector]---[Remapping]---[PCI/MSI]--|... |--- device N where the remapping domain is an optional component and in case that it is not available the PCI/MSI[-X] domains have the vector domain as their parent. This reduced the required interaction between the domains pretty much to the initialization phase where it is obviously required to establish the proper parent relation ship in the components of the hierarchy. While in most cases the model is strictly representing the chain of IP blocks and abstracting them so they can be plugged together to form a hierarchy, the design stopped short on PCI/MSI[-X]. Looking at the hardware it's clear that the actual PCI/MSI[-X] interrupt controller is not a global entity, but strict a per PCI device entity. Here we took a short cut on the hierarchical model and went for the easy solution of providing "global" PCI/MSI domains which was possible because the PCI/MSI[-X] handling is uniform across the devices. This also allowed to keep the existing PCI/MSI[-X] infrastructure mostly unchanged which in turn made it simple to keep the existing architecture specific management alive. A similar problem was created in the ARM world with support for IP block specific message storage. Instead of going all the way to stack a IP block specific domain on top of the generic MSI domain this ended in a construct which provides a "global" platform MSI domain which allows overriding the irq_write_msi_msg() callback per allocation. In course of the lengthy discussions we identified other abuse of the MSI infrastructure in wireless drivers, NTB etc. where support for implementation specific message storage was just mindlessly glued into the existing infrastructure. Some of this just works by chance on particular platforms but will fail in hard to diagnose ways when the driver is used on platforms where the underlying MSI interrupt management code does not expect the creative abuse. Another shortcoming of today's PCI/MSI-X support is the inability to allocate or free individual vectors after the initial enablement of MSI-X. This results in an works by chance implementation of VFIO (PCI pass-through) where interrupts on the host side are not set up upfront to avoid resource exhaustion. They are expanded at run-time when the guest actually tries to use them. The way how this is implemented is that the host disables MSI-X and then re-enables it with a larger number of vectors again. That works by chance because most device drivers set up all interrupts before the device actually will utilize them. But that's not universally true because some drivers allocate a large enough number of vectors but do not utilize them until it's actually required, e.g. for acceleration support. But at that point other interrupts of the device might be in active use and the MSI-X disable/enable dance can just result in losing interrupts and therefore hard to diagnose subtle problems. Last but not least the "global" PCI/MSI-X domain approach prevents to utilize PCI/MSI[-X] and PCI/IMS on the same device due to the fact that IMS is not longer providing a uniform storage and configuration model. The solution to this is to implement the missing step and switch from global PCI/MSI domains to per device PCI/MSI domains. The resulting hierarchy then looks like this: |--- [PCI/MSI] device 1 [Vector]---[Remapping]---|... |--- [PCI/MSI] device N which in turn allows to provide support for multiple domains per device: |--- [PCI/MSI] device 1 |--- [PCI/IMS] device 1 [Vector]---[Remapping]---|... |--- [PCI/MSI] device N |--- [PCI/IMS] device N This work converts the MSI and PCI/MSI core and the x86 interrupt domains to the new model, provides new interfaces for post-enable allocation/free of MSI-X interrupts and the base framework for PCI/IMS. PCI/IMS has been verified with the work in progress IDXD driver. There is work in progress to convert ARM over which will replace the platform MSI train-wreck. The cleanup of VFIO, NTB and other creative "solutions" are in the works as well. Drivers: - Updates for the LoongArch interrupt chip drivers - Support for MTK CIRQv2 - The usual small fixes and updates all over the place" * tag 'irq-core-2022-12-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (134 commits) irqchip/ti-sci-inta: Fix kernel doc irqchip/gic-v2m: Mark a few functions __init irqchip/gic-v2m: Include arm-gic-common.h irqchip/irq-mvebu-icu: Fix works by chance pointer assignment iommu/amd: Enable PCI/IMS iommu/vt-d: Enable PCI/IMS x86/apic/msi: Enable PCI/IMS PCI/MSI: Provide pci_ims_alloc/free_irq() PCI/MSI: Provide IMS (Interrupt Message Store) support genirq/msi: Provide constants for PCI/IMS support x86/apic/msi: Enable MSI_FLAG_PCI_MSIX_ALLOC_DYN PCI/MSI: Provide post-enable dynamic allocation interfaces for MSI-X PCI/MSI: Provide prepare_desc() MSI domain op PCI/MSI: Split MSI-X descriptor setup genirq/msi: Provide MSI_FLAG_MSIX_ALLOC_DYN genirq/msi: Provide msi_domain_alloc_irq_at() genirq/msi: Provide msi_domain_ops:: Prepare_desc() genirq/msi: Provide msi_desc:: Msi_data genirq/msi: Provide struct msi_map x86/apic/msi: Remove arch_create_remap_msi_irq_domain() ...
2022-12-12rtmutex: Add acquire semantics for rtmutex lock acquisition slow pathMel Gorman2-12/+49
Jan Kara reported the following bug triggering on 6.0.5-rt14 running dbench on XFS on arm64. kernel BUG at fs/inode.c:625! Internal error: Oops - BUG: 0 [#1] PREEMPT_RT SMP CPU: 11 PID: 6611 Comm: dbench Tainted: G E 6.0.0-rt14-rt+ #1 pc : clear_inode+0xa0/0xc0 lr : clear_inode+0x38/0xc0 Call trace: clear_inode+0xa0/0xc0 evict+0x160/0x180 iput+0x154/0x240 do_unlinkat+0x184/0x300 __arm64_sys_unlinkat+0x48/0xc0 el0_svc_common.constprop.4+0xe4/0x2c0 do_el0_svc+0xac/0x100 el0_svc+0x78/0x200 el0t_64_sync_handler+0x9c/0xc0 el0t_64_sync+0x19c/0x1a0 It also affects 6.1-rc7-rt5 and affects a preempt-rt fork of 5.14 so this is likely a bug that existed forever and only became visible when ARM support was added to preempt-rt. The same problem does not occur on x86-64 and he also reported that converting sb->s_inode_wblist_lock to raw_spinlock_t makes the problem disappear indicating that the RT spinlock variant is the problem. Which in turn means that RT mutexes on ARM64 and any other weakly ordered architecture are affected by this independent of RT. Will Deacon observed: "I'd be more inclined to be suspicious of the slowpath tbh, as we need to make sure that we have acquire semantics on all paths where the lock can be taken. Looking at the rtmutex code, this really isn't obvious to me -- for example, try_to_take_rt_mutex() appears to be able to return via the 'takeit' label without acquire semantics and it looks like we might be relying on the caller's subsequent _unlock_ of the wait_lock for ordering, but that will give us release semantics which aren't correct." Sebastian Andrzej Siewior prototyped a fix that does work based on that comment but it was a little bit overkill and added some fences that should not be necessary. The lock owner is updated with an IRQ-safe raw spinlock held, but the spin_unlock does not provide acquire semantics which are needed when acquiring a mutex. Adds the necessary acquire semantics for lock owner updates in the slow path acquisition and the waiter bit logic. It successfully completed 10 iterations of the dbench workload while the vanilla kernel fails on the first iteration. [ bigeasy@linutronix.de: Initial prototype fix ] Fixes: 700318d1d7b38 ("locking/rtmutex: Use acquire/release semantics") Fixes: 23f78d4a03c5 ("[PATCH] pi-futex: rt mutex core") Reported-by: Jan Kara <jack@suse.cz> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20221202100223.6mevpbl7i6x5udfd@techsingularity.net
2022-12-12Merge tag 'arm64-upstream' of ↵Linus Torvalds4-8/+17
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 updates from Will Deacon: "The highlights this time are support for dynamically enabling and disabling Clang's Shadow Call Stack at boot and a long-awaited optimisation to the way in which we handle the SVE register state on system call entry to avoid taking unnecessary traps from userspace. Summary: ACPI: - Enable FPDT support for boot-time profiling - Fix CPU PMU probing to work better with PREEMPT_RT - Update SMMUv3 MSI DeviceID parsing to latest IORT spec - APMT support for probing Arm CoreSight PMU devices CPU features: - Advertise new SVE instructions (v2.1) - Advertise range prefetch instruction - Advertise CSSC ("Common Short Sequence Compression") scalar instructions, adding things like min, max, abs, popcount - Enable DIT (Data Independent Timing) when running in the kernel - More conversion of system register fields over to the generated header CPU misfeatures: - Workaround for Cortex-A715 erratum #2645198 Dynamic SCS: - Support for dynamic shadow call stacks to allow switching at runtime between Clang's SCS implementation and the CPU's pointer authentication feature when it is supported (complete with scary DWARF parser!) Tracing and debug: - Remove static ftrace in favour of, err, dynamic ftrace! - Seperate 'struct ftrace_regs' from 'struct pt_regs' in core ftrace and existing arch code - Introduce and implement FTRACE_WITH_ARGS on arm64 to replace the old FTRACE_WITH_REGS - Extend 'crashkernel=' parameter with default value and fallback to placement above 4G physical if initial (low) allocation fails SVE: - Optimisation to avoid disabling SVE unconditionally on syscall entry and just zeroing the non-shared state on return instead Exceptions: - Rework of undefined instruction handling to avoid serialisation on global lock (this includes emulation of user accesses to the ID registers) Perf and PMU: - Support for TLP filters in Hisilicon's PCIe PMU device - Support for the DDR PMU present in Amlogic Meson G12 SoCs - Support for the terribly-named "CoreSight PMU" architecture from Arm (and Nvidia's implementation of said architecture) Misc: - Tighten up our boot protocol for systems with memory above 52 bits physical - Const-ify static keys to satisty jump label asm constraints - Trivial FFA driver cleanups in preparation for v1.1 support - Export the kernel_neon_* APIs as GPL symbols - Harden our instruction generation routines against instrumentation - A bunch of robustness improvements to our arch-specific selftests - Minor cleanups and fixes all over (kbuild, kprobes, kfence, PMU, ...)" * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (151 commits) arm64: kprobes: Return DBG_HOOK_ERROR if kprobes can not handle a BRK arm64: kprobes: Let arch do_page_fault() fix up page fault in user handler arm64: Prohibit instrumentation on arch_stack_walk() arm64:uprobe fix the uprobe SWBP_INSN in big-endian arm64: alternatives: add __init/__initconst to some functions/variables arm_pmu: Drop redundant armpmu->map_event() in armpmu_event_init() kselftest/arm64: Allow epoll_wait() to return more than one result kselftest/arm64: Don't drain output while spawning children kselftest/arm64: Hold fp-stress children until they're all spawned arm64/sysreg: Remove duplicate definitions from asm/sysreg.h arm64/sysreg: Convert ID_DFR1_EL1 to automatic generation arm64/sysreg: Convert ID_DFR0_EL1 to automatic generation arm64/sysreg: Convert ID_AFR0_EL1 to automatic generation arm64/sysreg: Convert ID_MMFR5_EL1 to automatic generation arm64/sysreg: Convert MVFR2_EL1 to automatic generation arm64/sysreg: Convert MVFR1_EL1 to automatic generation arm64/sysreg: Convert MVFR0_EL1 to automatic generation arm64/sysreg: Convert ID_PFR2_EL1 to automatic generation arm64/sysreg: Convert ID_PFR1_EL1 to automatic generation arm64/sysreg: Convert ID_PFR0_EL1 to automatic generation ...
2022-12-12Merge tag 'slab-for-6.2-rc1' of ↵Linus Torvalds1-2/+3
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull slab updates from Vlastimil Babka: - SLOB deprecation and SLUB_TINY The SLOB allocator adds maintenance burden and stands in the way of API improvements [1]. Deprecate it by renaming the config option (to make users notice) to CONFIG_SLOB_DEPRECATED with updated help text. SLUB should be used instead as SLAB will be the next on the removal list. Based on reports from a riscv k210 board with 8MB RAM, add a CONFIG_SLUB_TINY option to minimize SLUB's memory usage at the expense of scalability. This has resolved the k210 regression [2] so in case there are no others (that wouldn't be resolvable by further tweaks to SLUB_TINY) plan is to remove SLOB in a few cycles. Existing defconfigs with CONFIG_SLOB are converted to CONFIG_SLUB_TINY. - kmalloc() slub_debug redzone improvements A series from Feng Tang that builds on the tracking or requested size for kmalloc() allocations (for caches with debugging enabled) added in 6.1, to make redzone checks consider the requested size and not the rounded up one, in order to catch more subtle buffer overruns. Includes new slub_kunit test. - struct slab fields reordering to accomodate larger rcu_head RCU folks would like to grow rcu_head with debugging options, which breaks current struct slab layout's assumptions, so reorganize it to make this possible. - Miscellaneous improvements/fixes: - __alloc_size checking compiler workaround (Kees Cook) - Optimize and cleanup SLUB's sysfs init (Rasmus Villemoes) - Make SLAB compatible with PROVE_RAW_LOCK_NESTING (Jiri Kosina) - Correct SLUB's percpu allocation estimates (Baoquan He) - Re-enableS LUB's run-time failslab sysfs control (Alexander Atanasov) - Make tools/vm/slabinfo more user friendly when not run as root (Rong Tao) - Dead code removal in SLUB (Hyeonggon Yoo) * tag 'slab-for-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: (31 commits) mm, slob: rename CONFIG_SLOB to CONFIG_SLOB_DEPRECATED mm, slub: don't aggressively inline with CONFIG_SLUB_TINY mm, slub: remove percpu slabs with CONFIG_SLUB_TINY mm, slub: split out allocations from pre/post hooks mm/slub, kunit: Add a test case for kmalloc redzone check mm/slub, kunit: add SLAB_SKIP_KFENCE flag for cache creation mm, slub: refactor free debug processing mm, slab: ignore SLAB_RECLAIM_ACCOUNT with CONFIG_SLUB_TINY mm, slub: don't create kmalloc-rcl caches with CONFIG_SLUB_TINY mm, slub: lower the default slub_max_order with CONFIG_SLUB_TINY mm, slub: retain no free slabs on partial list with CONFIG_SLUB_TINY mm, slub: disable SYSFS support with CONFIG_SLUB_TINY mm, slub: add CONFIG_SLUB_TINY mm, slab: ignore hardened usercopy parameters when disabled slab: Remove special-casing of const 0 size allocations slab: Clean up SLOB vs kmalloc() definition mm/sl[au]b: rearrange struct slab fields to allow larger rcu_head mm/migrate: make isolate_movable_page() skip slab pages mm/slab: move and adjust kernel-doc for kmem_cache_alloc mm/slub, percpu: correct the calculation of early percpu allocation size ...
2022-12-12Merge tag 'printk-for-6.2' of ↵Linus Torvalds8-140/+428
git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux Pull printk updates from Petr Mladek: - Add NMI-safe SRCU reader API. It uses atomic_inc() instead of this_cpu_inc() on strong load-store architectures. - Introduce new console_list_lock to synchronize a manipulation of the list of registered consoles and their flags. This is a first step in removing the big-kernel-lock-like behavior of console_lock(). This semaphore still serializes console->write() calbacks against: - each other. It primary prevents potential races between early and proper console drivers using the same device. - suspend()/resume() callbacks and init() operations in some drivers. - various other operations in the tty/vt and framebufer susbsystems. It is likely that console_lock() serializes even operations that are not directly conflicting with the console->write() callbacks here. This is the most complicated big-kernel-lock aspect of the console_lock() that will be hard to untangle. - Introduce new console_srcu lock that is used to safely iterate and access the registered console drivers under SRCU read lock. This is a prerequisite for introducing atomic console drivers and console kthreads. It will reduce the complexity of serialization against normal consoles and console_lock(). Also it should remove the risk of deadlock during critical situations, like Oops or panic, when only atomic consoles are registered. - Check whether the console is registered instead of enabled on many locations. It was a historical leftover. - Cleanly force a preferred console in xenfb code instead of a dirty hack. - A lot of code and comment clean ups and improvements. * tag 'printk-for-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux: (47 commits) printk: htmldocs: add missing description tty: serial: sh-sci: use setup() callback for early console printk: relieve console_lock of list synchronization duties tty: serial: kgdboc: use console_list_lock to trap exit tty: serial: kgdboc: synchronize tty_find_polling_driver() and register_console() tty: serial: kgdboc: use console_list_lock for list traversal tty: serial: kgdboc: use srcu console list iterator proc: consoles: use console_list_lock for list iteration tty: tty_io: use console_list_lock for list synchronization printk, xen: fbfront: create/use safe function for forcing preferred netconsole: avoid CON_ENABLED misuse to track registration usb: early: xhci-dbc: use console_is_registered() tty: serial: xilinx_uartps: use console_is_registered() tty: serial: samsung_tty: use console_is_registered() tty: serial: pic32_uart: use console_is_registered() tty: serial: earlycon: use console_is_registered() tty: hvc: use console_is_registered() efi: earlycon: use console_is_registered() tty: nfcon: use console_is_registered() serial_core: replace uart_console_enabled() with uart_console_registered() ...
2022-12-12Merge tag 'execve-v6.2-rc1' of ↵Linus Torvalds2-11/+21
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull execve updates from Kees Cook: "Most are small refactorings and bug fixes, but three things stand out: switching timens (which got reverted before) looks solid now, FOLL_FORCE has been removed (no failures seen yet across several weeks in -next), and some whitespace cleanups (which are long overdue). - Add timens support (when switching mm). This version has survived in -next for the entire cycle (Andrei Vagin) - Various small bug fixes, refactoring, and readability improvements (Bernd Edlinger, Rolf Eike Beer, Bo Liu, Li Zetao Liu Shixin) - Remove FOLL_FORCE for stack setup (Kees Cook) - Whitespace cleanups (Rolf Eike Beer, Kees Cook)" * tag 'execve-v6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: binfmt_misc: fix shift-out-of-bounds in check_special_flags binfmt: Fix error return code in load_elf_fdpic_binary() exec: Remove FOLL_FORCE for stack setup binfmt_elf: replace IS_ERR() with IS_ERR_VALUE() binfmt_elf: simplify error handling in load_elf_phdrs() binfmt_elf: fix documented return value for load_elf_phdrs() exec: simplify initial stack size expansion binfmt: Fix whitespace issues exec: Add comments on check_unsafe_exec() fs counting ELF uapi: add spaces before '{' selftests/timens: add a test for vfork+exit fs/exec: switch timens when a task gets a new mm
2022-12-12Merge tag 'seccomp-v6.2-rc1' of ↵Linus Torvalds1-6/+11
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull seccomp updates from Kees Cook: - Add missing kerndoc parameter (Randy Dunlap) - Improve seccomp selftest to check CAP_SYS_ADMIN (Gautam Menghani) - Fix allocation leak when cloned thread immediately dies (Kuniyuki Iwashima) * tag 'seccomp-v6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: seccomp: document the "filter_count" field seccomp: Move copy_seccomp() to no failure path. selftests/seccomp: Check CAP_SYS_ADMIN capability in the test mode_filter_without_nnp
2022-12-12Merge tag 'kcsan.2022.12.02a' of ↵Linus Torvalds1-0/+50
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu Pull KCSAN updates from Paul McKenney: - Add instrumentation for memcpy(), memset(), and memmove() for Clang v16+'s new function names that are used when the -fsanitize=thread argument is given - Fix objtool warnings from KCSAN's volatile instrumentation, and typos in a pair of Kconfig options' help clauses * tag 'kcsan.2022.12.02a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: kcsan: Fix trivial typo in Kconfig help comments objtool, kcsan: Add volatile read/write instrumentation to whitelist kcsan: Instrument memcpy/memset/memmove with newer Clang
2022-12-12Merge tag 'rcu.2022.12.02a' of ↵Linus Torvalds14-153/+545
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu Pull RCU updates from Paul McKenney: - Documentation updates. This is the second in a series from an ongoing review of the RCU documentation. - Miscellaneous fixes. - Introduce a default-off Kconfig option that depends on RCU_NOCB_CPU that, on CPUs mentioned in the nohz_full or rcu_nocbs boot-argument CPU lists, causes call_rcu() to introduce delays. These delays result in significant power savings on nearly idle Android and ChromeOS systems. These savings range from a few percent to more than ten percent. This series also includes several commits that change call_rcu() to a new call_rcu_hurry() function that avoids these delays in a few cases, for example, where timely wakeups are required. Several of these are outside of RCU and thus have acks and reviews from the relevant maintainers. - Create an srcu_read_lock_nmisafe() and an srcu_read_unlock_nmisafe() for architectures that support NMIs, but which do not provide NMI-safe this_cpu_inc(). These NMI-safe SRCU functions are required by the upcoming lockless printk() work by John Ogness et al. - Changes providing minor but important increases in torture test coverage for the new RCU polled-grace-period APIs. - Changes to torturescript that avoid redundant kernel builds, thus providing about a 30% speedup for the torture.sh acceptance test. * tag 'rcu.2022.12.02a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (49 commits) net: devinet: Reduce refcount before grace period net: Use call_rcu_hurry() for dst_release() workqueue: Make queue_rcu_work() use call_rcu_hurry() percpu-refcount: Use call_rcu_hurry() for atomic switch scsi/scsi_error: Use call_rcu_hurry() instead of call_rcu() rcu/rcutorture: Use call_rcu_hurry() where needed rcu/rcuscale: Use call_rcu_hurry() for async reader test rcu/sync: Use call_rcu_hurry() instead of call_rcu rcuscale: Add laziness and kfree tests rcu: Shrinker for lazy rcu rcu: Refactor code a bit in rcu_nocb_do_flush_bypass() rcu: Make call_rcu() lazy to save power rcu: Implement lockdep_rcu_enabled for !CONFIG_DEBUG_LOCK_ALLOC srcu: Debug NMI safety even on archs that don't require it srcu: Explain the reason behind the read side critical section on GP start srcu: Warn when NMI-unsafe API is used in NMI arch/s390: Add ARCH_HAS_NMI_SAFE_THIS_CPU_OPS Kconfig option arch/loongarch: Add ARCH_HAS_NMI_SAFE_THIS_CPU_OPS Kconfig option rcu: Fix __this_cpu_read() lockdep warning in rcu_force_quiescent_state() rcu-tasks: Make grace-period-age message human-readable ...
2022-12-12Merge branches 'powercap', 'pm-x86', 'pm-opp' and 'pm-misc'Rafael J. Wysocki1-3/+3
Merge power capping code updates, x86-specific power management pdate, operating performance points library updates and miscellaneous power management updates for 6.2-rc1: - Fix compiler warnings with make W=1 in the idle_inject power capping driver (Srinivas Pandruvada). - Use kstrtobool() instead of strtobool() in the power capping sysfs interface (Christophe JAILLET). - Add SCMI Powercap based power capping driver (Cristian Marussi). - Add Emerald Rapids support to the intel-uncore-freq driver (Artem Bityutskiy). - Repair slips in kernel-doc comments in the generic notifier code (Lukas Bulwahn). - Fix several DT issues in the OPP library reorganize code around opp-microvolt-<named> DT property (Viresh Kumar). - Allow any of opp-microvolt, opp-microamp, or opp-microwatt properties to be present without the others present (James Calligeros). - Fix clock-latency-ns property in DT example (Serge Semin). * powercap: powercap: idle_inject: Fix warnings with make W=1 powercap: Use kstrtobool() instead of strtobool() powercap: arm_scmi: Add SCMI Powercap based driver * pm-x86: platform/x86: intel-uncore-freq: add Emerald Rapids support * pm-opp: dt-bindings: opp-v2: Fix clock-latency-ns prop in example OPP: decouple dt properties in opp_parse_supplies() OPP: Simplify opp_parse_supplies() by restructuring it OPP: Parse named opp-microwatt property too dt-bindings: opp: Fix named microwatt property dt-bindings: opp: Fix usage of current in microwatt property * pm-misc: notifier: repair slips in kernel-doc comments
2022-12-12tracing: Fix infinite loop in tracing_read_pipe on overflowed print_trace_lineYang Jihong1-1/+14
print_trace_line may overflow seq_file buffer. If the event is not consumed, the while loop keeps peeking this event, causing a infinite loop. Link: https://lkml.kernel.org/r/20221129113009.182425-1-yangjihong1@huawei.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: stable@vger.kernel.org Fixes: 088b1e427dbba ("ftrace: pipe fixes") Signed-off-by: Yang Jihong <yangjihong1@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-12-12Merge branches 'pm-cpuidle', 'pm-sleep' and 'pm-domains'Rafael J. Wysocki2-19/+18
Merge cpuidle changes, updates related to system sleep amd generic power domains code fixes for 6.2-rc1: - Improve kernel messages printed by the cpuidle PCI driver (Ulf Hansson). - Make the DT cpuidle driver return the correct number of parsed idle states, clean it up and clarify a comment in it (Ulf Hansson). - Modify the tasks freezing code to avoid using pr_cont() and refine an error message printed by it (Rafael Wysocki). - Make the hibernation core code complain about memory map mismatches during resume to help diagnostics (Xueqin Luo). - Fix mistake in a kerneldoc comment in the hibernation code (xiongxin). - Reverse the order of performance and enabling operations in the generic power domains code (Abel Vesa). - Power off[on] domains in hibernate .freeze[thaw]_noirq hook of in the generic power domains code (Abel Vesa). - Consolidate genpd_restore_noirq() and genpd_resume_noirq() (Shawn Guo). - Pass generic PM noirq hooks to genpd_finish_suspend() (Shawn Guo). - Drop generic power domain status manipulation during hibernate restore (Shawn Guo). * pm-cpuidle: cpuidle: dt: Clarify a comment and simplify code in dt_init_idle_driver() cpuidle: dt: Return the correct numbers of parsed idle states cpuidle: psci: Extend information in log about OSI/PC mode * pm-sleep: PM: sleep: Refine error message in try_to_freeze_tasks() PM: sleep: Avoid using pr_cont() in the tasks freezing code PM: hibernate: Complain about memory map mismatches during resume PM: hibernate: Fix mistake in kerneldoc comment * pm-domains: PM: domains: Reverse the order of performance and enabling ops PM: domains: Power off[on] domain in hibernate .freeze[thaw]_noirq hook PM: domains: Consolidate genpd_restore_noirq() and genpd_resume_noirq() PM: domains: Pass generic PM noirq hooks to genpd_finish_suspend() PM: domains: Drop genpd status manipulation for hibernate restore
2022-12-11relay: fix type mismatch when allocating memory in relay_create_buf()Gavrilov Ilia1-2/+2
The 'padding' field of the 'rchan_buf' structure is an array of 'size_t' elements, but the memory is allocated for an array of 'size_t *' elements. Found by Linux Verification Center (linuxtesting.org) with SVACE. Link: https://lkml.kernel.org/r/20221129092002.3538384-1-Ilia.Gavrilov@infotecs.ru Fixes: b86ff981a825 ("[PATCH] relay: migrate from relayfs to a generic relay API") Signed-off-by: Ilia.Gavrilov <Ilia.Gavrilov@infotecs.ru> Cc: Colin Ian King <colin.i.king@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: wuchi <wuchi.zero@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-11kernel: kcsan: kcsan_test: build without structleak pluginAnders Roxell1-0/+1
Building kcsan_test with structleak plugin enabled makes the stack frame size to grow. kernel/kcsan/kcsan_test.c:704:1: error: the frame size of 3296 bytes is larger than 2048 bytes [-Werror=frame-larger-than=] Turn off the structleak plugin checks for kcsan_test. Link: https://lkml.kernel.org/r/20221128104358.2660634-1-anders.roxell@linaro.org Signed-off-by: Anders Roxell <anders.roxell@linaro.org> Suggested-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Marco Elver <elver@google.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: David Gow <davidgow@google.com> Cc: Jason A. Donenfeld <Jason@zx2c4.com> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-11relay: use strscpy() is more robust and saferXu Panda1-2/+2
The implementation of strscpy() is more robust and safer. That's now the recommended way to copy NUL terminated strings. Link: https://lkml.kernel.org/r/202211220853259244666@zte.com.cn Signed-off-by: Xu Panda <xu.panda@zte.com.cn> Signed-off-by: Yang Yang <yang.yang29@zte.com> Cc: Colin Ian King <colin.i.king@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: wuchi <wuchi.zero@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-11lockdep: allow instrumenting lockdep.c with KMSANAlexander Potapenko1-1/+0
Lockdep and KMSAN used to play badly together, causing deadlocks when KMSAN instrumentation of lockdep.c called lockdep functions recursively. Looks like this is no more the case, and a kernel can run (yet slower) with both KMSAN and lockdep enabled. This patch should fix false positives on wq_head->lock->dep_map, which KMSAN used to consider uninitialized because of lockdep.c not being instrumented. Link: https://lore.kernel.org/lkml/Y3b9AAEKp2Vr3e6O@sol.localdomain/ Link: https://lkml.kernel.org/r/20221128094541.2645890-1-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Reported-by: Eric Biggers <ebiggers@kernel.org> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Marco Elver <elver@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-10bpf: use check_ids() for active_lock comparisonEduard Zingerman1-3/+13
An update for verifier.c:states_equal()/regsafe() to use check_ids() for active spin lock comparisons. This fixes the issue reported by Kumar Kartikeya Dwivedi in [1] using technique suggested by Edward Cree. W/o this commit the verifier might be tricked to accept the following program working with a map containing spin locks: 0: r9 = map_lookup_elem(...) ; Returns PTR_TO_MAP_VALUE_OR_NULL id=1. 1: r8 = map_lookup_elem(...) ; Returns PTR_TO_MAP_VALUE_OR_NULL id=2. 2: if r9 == 0 goto exit ; r9 -> PTR_TO_MAP_VALUE. 3: if r8 == 0 goto exit ; r8 -> PTR_TO_MAP_VALUE. 4: r7 = ktime_get_ns() ; Unbound SCALAR_VALUE. 5: r6 = ktime_get_ns() ; Unbound SCALAR_VALUE. 6: bpf_spin_lock(r8) ; active_lock.id == 2. 7: if r6 > r7 goto +1 ; No new information about the state ; is derived from this check, thus ; produced verifier states differ only ; in 'insn_idx'. 8: r9 = r8 ; Optionally make r9.id == r8.id. --- checkpoint --- ; Assume is_state_visisted() creates a ; checkpoint here. 9: bpf_spin_unlock(r9) ; (a,b) active_lock.id == 2. ; (a) r9.id == 2, (b) r9.id == 1. 10: exit(0) Consider two verification paths: (a) 0-10 (b) 0-7,9-10 The path (a) is verified first. If checkpoint is created at (8) the (b) would assume that (8) is safe because regsafe() does not compare register ids for registers of type PTR_TO_MAP_VALUE. [1] https://lore.kernel.org/bpf/20221111202719.982118-1-memxor@gmail.com/ Reported-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Suggested-by: Edward Cree <ecree.xilinx@gmail.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20221209135733.28851-6-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-10bpf: states_equal() must build idmap for all function framesEduard Zingerman1-1/+2
verifier.c:states_equal() must maintain register ID mapping across all function frames. Otherwise the following example might be erroneously marked as safe: main: fp[-24] = map_lookup_elem(...) ; frame[0].fp[-24].id == 1 fp[-32] = map_lookup_elem(...) ; frame[0].fp[-32].id == 2 r1 = &fp[-24] r2 = &fp[-32] call foo() r0 = 0 exit foo: 0: r9 = r1 1: r8 = r2 2: r7 = ktime_get_ns() 3: r6 = ktime_get_ns() 4: if (r6 > r7) goto skip_assign 5: r9 = r8 skip_assign: ; <--- checkpoint 6: r9 = *r9 ; (a) frame[1].r9.id == 2 ; (b) frame[1].r9.id == 1 7: if r9 == 0 goto exit: ; mark_ptr_or_null_regs() transfers != 0 info ; for all regs sharing ID: ; (a) r9 != 0 => &frame[0].fp[-32] != 0 ; (b) r9 != 0 => &frame[0].fp[-24] != 0 8: r8 = *r8 ; (a) r8 == &frame[0].fp[-32] ; (b) r8 == &frame[0].fp[-32] 9: r0 = *r8 ; (a) safe ; (b) unsafe exit: 10: exit While processing call to foo() verifier considers the following execution paths: (a) 0-10 (b) 0-4,6-10 (There is also path 0-7,10 but it is not interesting for the issue at hand. (a) is verified first.) Suppose that checkpoint is created at (6) when path (a) is verified, next path (b) is verified and (6) is reached. If states_equal() maintains separate 'idmap' for each frame the mapping at (6) for frame[1] would be empty and regsafe(r9)::check_ids() would add a pair 2->1 and return true, which is an error. If states_equal() maintains single 'idmap' for all frames the mapping at (6) would be { 1->1, 2->2 } and regsafe(r9)::check_ids() would return false when trying to add a pair 2->1. This issue was suggested in the following discussion: https://lore.kernel.org/bpf/CAEf4BzbFB5g4oUfyxk9rHy-PJSLQ3h8q9mV=rVoXfr_JVm8+1Q@mail.gmail.com/ Suggested-by: Andrii Nakryiko <andrii.nakryiko@gmail.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20221209135733.28851-4-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-10bpf: regsafe() must not skip check_ids()Eduard Zingerman1-21/+8
The verifier.c:regsafe() has the following shortcut: equal = memcmp(rold, rcur, offsetof(struct bpf_reg_state, parent)) == 0; ... if (equal) return true; Which is executed regardless old register type. This is incorrect for register types that might have an ID checked by check_ids(), namely: - PTR_TO_MAP_KEY - PTR_TO_MAP_VALUE - PTR_TO_PACKET_META - PTR_TO_PACKET The following pattern could be used to exploit this: 0: r9 = map_lookup_elem(...) ; Returns PTR_TO_MAP_VALUE_OR_NULL id=1. 1: r8 = map_lookup_elem(...) ; Returns PTR_TO_MAP_VALUE_OR_NULL id=2. 2: r7 = ktime_get_ns() ; Unbound SCALAR_VALUE. 3: r6 = ktime_get_ns() ; Unbound SCALAR_VALUE. 4: if r6 > r7 goto +1 ; No new information about the state ; is derived from this check, thus ; produced verifier states differ only ; in 'insn_idx'. 5: r9 = r8 ; Optionally make r9.id == r8.id. --- checkpoint --- ; Assume is_state_visisted() creates a ; checkpoint here. 6: if r9 == 0 goto <exit> ; Nullness info is propagated to all ; registers with matching ID. 7: r1 = *(u64 *) r8 ; Not always safe. Verifier first visits path 1-7 where r8 is verified to be not null at (6). Later the jump from 4 to 6 is examined. The checkpoint for (6) looks as follows: R8_rD=map_value_or_null(id=2,off=0,ks=4,vs=8,imm=0) R9_rwD=map_value_or_null(id=2,off=0,ks=4,vs=8,imm=0) R10=fp0 The current state is: R0=... R6=... R7=... fp-8=... R8=map_value_or_null(id=2,off=0,ks=4,vs=8,imm=0) R9=map_value_or_null(id=1,off=0,ks=4,vs=8,imm=0) R10=fp0 Note that R8 states are byte-to-byte identical, so regsafe() would exit early and skip call to check_ids(), thus ID mapping 2->2 will not be added to 'idmap'. Next, states for R9 are compared: these are not identical and check_ids() is executed, but 'idmap' is empty, so check_ids() adds mapping 2->1 to 'idmap' and returns success. This commit pushes the 'equal' down to register types that don't need check_ids(). Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20221209135733.28851-2-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-10tracing/osnoise: Add preempt and/or irq disabled optionsDaniel Bristot de Oliveira1-5/+43
The osnoise workload runs with preemption and IRQs enabled in such a way as to allow all sorts of noise to disturb osnoise's execution. hwlat tracer has a similar workload but works with irq disabled, allowing only NMIs and the hardware to generate noise. While thinking about adding an options file to hwlat tracer to allow the system to panic, and other features I was thinking to add, like having a tracepoint at each noise detection, it came to my mind that is easier to make osnoise and also do hardware latency detection than making hwlat "feature compatible" with osnoise. Other points are: - osnoise already has an independent cpu file. - osnoise has a more intuitive interface, e.g., runtime/period vs. window/width (and people often need help remembering what it is). - osnoise: tracepoints - osnoise stop options - osnoise options file itself Moreover, the user-space side (in rtla) is simplified by reusing the existing osnoise code. Finally, people have been asking me about using osnoise for hw latency detection, and I have to explain that it was sufficient but not necessary. These options make it sufficient and necessary. Adding a Suggested-by Clark, as he often asked me about this possibility. Link: https://lkml.kernel.org/r/d9c6c19135497054986900f94c8e47410b15316a.1670623111.git.bristot@kernel.org Cc: Suggested-by: Clark Williams <williams@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Daniel Bristot de Oliveira <bristot@kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-12-10tracing/osnoise: Add PANIC_ON_STOP optionDaniel Bristot de Oliveira1-1/+5
Often the latency observed in a CPU is not caused by the work being done in the CPU itself, but by work done on another CPU that causes the hardware to stall all CPUs. In this case, it is interesting to know what is happening on ALL CPUs, and the best way to do this is via crash dump analysis. Add the PANIC_ON_STOP option to osnoise/timerlat tracers. The default behavior is having this option off. When enabled by the user, the system will panic after hitting a stop tracing condition. This option was motivated by a real scenario that Juri Lelli and I were debugging. Link: https://lkml.kernel.org/r/249ce4287c6725543e6db845a6e0df621dc67db5.1670623111.git.bristot@kernel.org Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Clark Williams <williams@redhat.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Daniel Bristot de Oliveira <bristot@kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-12-10tracing: Fix some checker warningsDavid Howells2-3/+4
Fix some checker warnings in the trace code by adding __printf attributes to a number of trace functions and their declarations. Changes: ======== ver #2) - Dropped the fix for the unconditional tracing_max_lat_fops decl[1]. Link: https://lore.kernel.org/r/20221205180617.9b9d3971cbe06ee536603523@kernel.org/ [1] Link: https://lore.kernel.org/r/166992525941.1716618.13740663757583361463.stgit@warthog.procyon.org.uk/ # v1 Link: https://lkml.kernel.org/r/167023571258.382307.15314866482834835192.stgit@warthog.procyon.org.uk Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-12-10tracing/osnoise: Make osnoise_options staticDaniel Bristot de Oliveira1-2/+2
Make osnoise_options static, as reported by the kernel test robot. Link: https://lkml.kernel.org/r/63255826485400d7a2270e9c5e66111079671e7a.1670228712.git.bristot@kernel.org Reported-by: kernel test robot <lkp@intel.com> Cc: Daniel Bristot de Oliveira <bristot@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-12-10tracing: remove unnecessary trace_trigger ifdefRoss Zwisler1-6/+0
The trace_trigger command line option introduced by commit a01fdc897fa5 ("tracing: Add trace_trigger kernel command line option") doesn't need to depend on the CONFIG_HIST_TRIGGERS kernel config option. This code doesn't depend on the histogram code, and the run-time selection of triggers is usable without CONFIG_HIST_TRIGGERS. Link: https://lore.kernel.org/linux-trace-kernel/20221209003310.1737039-1-zwisler@google.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Tom Zanussi <zanussi@kernel.org> Fixes: a01fdc897fa5 ("tracing: Add trace_trigger kernel command line option") Signed-off-by: Ross Zwisler <zwisler@google.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-12-10ring-buffer: Handle resize in early boot upSteven Rostedt1-7/+25
With the new command line option that allows trace event triggers to be added at boot, the "snapshot" trigger will allocate the snapshot buffer very early, when interrupts can not be enabled. Allocating the ring buffer is not the problem, but it also resizes it, which is, as the resize code does synchronization that can not be preformed at early boot. To handle this, first change the raw_spin_lock_irq() in rb_insert_pages() to raw_spin_lock_irqsave(), such that the unlocking of that spin lock will not enable interrupts. Next, where it calls schedule_work_on(), disable migration and check if the CPU to update is the current CPU, and if so, perform the work directly, otherwise re-enable migration and call the schedule_work_on() to the CPU that is being updated. The rb_insert_pages() just needs to be run on the CPU that it is updating, and does not need preemption nor interrupts disabled when calling it. Link: https://lore.kernel.org/lkml/Y5J%2FCajlNh1gexvo@google.com/ Link: https://lore.kernel.org/linux-trace-kernel/20221209101151.1fec1167@gandalf.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Fixes: a01fdc897fa5 ("tracing: Add trace_trigger kernel command line option") Reported-by: Ross Zwisler <zwisler@google.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Tested-by: Ross Zwisler <zwisler@google.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-12-10tracing/hist: Fix issue of losting command info in error_logZheng Yejian1-1/+1
When input some constructed invalid 'trigger' command, command info in 'error_log' are lost [1]. The root cause is that there is a path that event_hist_trigger_parse() is recursely called once and 'last_cmd' which save origin command is cleared, then later calling of hist_err() will no longer record origin command info: event_hist_trigger_parse() { last_cmd_set() // <1> 'last_cmd' save origin command here at first create_actions() { onmatch_create() { action_create() { trace_action_create() { trace_action_create_field_var() { create_field_var_hist() { event_hist_trigger_parse() { // <2> recursely called once hist_err_clear() // <3> 'last_cmd' is cleared here } hist_err() // <4> No longer find origin command!!! Since 'glob' is empty string while running into the recurse call, we can trickly check it and bypass the call of hist_err_clear() to solve it. [1] # cd /sys/kernel/tracing # echo "my_synth_event int v1; int v2; int v3;" >> synthetic_events # echo 'hist:keys=pid' >> events/sched/sched_waking/trigger # echo "hist:keys=next_pid:onmatch(sched.sched_waking).my_synth_event(\ pid,pid1)" >> events/sched/sched_switch/trigger # cat error_log [ 8.405018] hist:sched:sched_switch: error: Couldn't find synthetic event Command: hist:keys=next_pid:onmatch(sched.sched_waking).my_synth_event(pid,pid1) ^ [ 8.816902] hist:sched:sched_switch: error: Couldn't find field Command: hist:keys=next_pid:onmatch(sched.sched_waking).my_synth_event(pid,pid1) ^ [ 8.816902] hist:sched:sched_switch: error: Couldn't parse field variable Command: hist:keys=next_pid:onmatch(sched.sched_waking).my_synth_event(pid,pid1) ^ [ 8.999880] : error: Couldn't find field Command: ^ [ 8.999880] : error: Couldn't parse field variable Command: ^ [ 8.999880] : error: Couldn't find field Command: ^ [ 8.999880] : error: Couldn't create histogram for field Command: ^ Link: https://lore.kernel.org/linux-trace-kernel/20221207135326.3483216-1-zhengyejian1@huawei.com Cc: <mhiramat@kernel.org> Cc: <zanussi@kernel.org> Fixes: f404da6e1d46 ("tracing: Add 'last error' error facility for hist triggers") Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-12-10tracing: Fix issue of missing one synthetic fieldZheng Yejian1-1/+1
The maximum number of synthetic fields supported is defined as SYNTH_FIELDS_MAX which value currently is 64, but it actually fails when try to generate a synthetic event with 64 fields by executing like: # echo "my_synth_event int v1; int v2; int v3; int v4; int v5; int v6;\ int v7; int v8; int v9; int v10; int v11; int v12; int v13; int v14;\ int v15; int v16; int v17; int v18; int v19; int v20; int v21; int v22;\ int v23; int v24; int v25; int v26; int v27; int v28; int v29; int v30;\ int v31; int v32; int v33; int v34; int v35; int v36; int v37; int v38;\ int v39; int v40; int v41; int v42; int v43; int v44; int v45; int v46;\ int v47; int v48; int v49; int v50; int v51; int v52; int v53; int v54;\ int v55; int v56; int v57; int v58; int v59; int v60; int v61; int v62;\ int v63; int v64" >> /sys/kernel/tracing/synthetic_events Correct the field counting to fix it. Link: https://lore.kernel.org/linux-trace-kernel/20221207091557.3137904-1-zhengyejian1@huawei.com Cc: <mhiramat@kernel.org> Cc: <zanussi@kernel.org> Cc: stable@vger.kernel.org Fixes: c9e759b1e845 ("tracing: Rework synthetic event command parsing") Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-12-10tracing/hist: Fix out-of-bound write on 'action_data.var_ref_idx'Zheng Yejian1-2/+8
When generate a synthetic event with many params and then create a trace action for it [1], kernel panic happened [2]. It is because that in trace_action_create() 'data->n_params' is up to SYNTH_FIELDS_MAX (current value is 64), and array 'data->var_ref_idx' keeps indices into array 'hist_data->var_refs' for each synthetic event param, but the length of 'data->var_ref_idx' is TRACING_MAP_VARS_MAX (current value is 16), so out-of-bound write happened when 'data->n_params' more than 16. In this case, 'data->match_data.event' is overwritten and eventually cause the panic. To solve the issue, adjust the length of 'data->var_ref_idx' to be SYNTH_FIELDS_MAX and add sanity checks to avoid out-of-bound write. [1] # cd /sys/kernel/tracing/ # echo "my_synth_event int v1; int v2; int v3; int v4; int v5; int v6;\ int v7; int v8; int v9; int v10; int v11; int v12; int v13; int v14;\ int v15; int v16; int v17; int v18; int v19; int v20; int v21; int v22;\ int v23; int v24; int v25; int v26; int v27; int v28; int v29; int v30;\ int v31; int v32; int v33; int v34; int v35; int v36; int v37; int v38;\ int v39; int v40; int v41; int v42; int v43; int v44; int v45; int v46;\ int v47; int v48; int v49; int v50; int v51; int v52; int v53; int v54;\ int v55; int v56; int v57; int v58; int v59; int v60; int v61; int v62;\ int v63" >> synthetic_events # echo 'hist:keys=pid:ts0=common_timestamp.usecs if comm=="bash"' >> \ events/sched/sched_waking/trigger # echo "hist:keys=next_pid:onmatch(sched.sched_waking).my_synth_event(\ pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,\ pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,\ pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,\ pid,pid,pid,pid,pid,pid,pid,pid,pid)" >> events/sched/sched_switch/trigger [2] BUG: unable to handle page fault for address: ffff91c900000000 PGD 61001067 P4D 61001067 PUD 0 Oops: 0000 [#1] PREEMPT SMP NOPTI CPU: 2 PID: 322 Comm: bash Tainted: G W 6.1.0-rc8+ #229 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014 RIP: 0010:strcmp+0xc/0x30 Code: 75 f7 31 d2 44 0f b6 04 16 44 88 04 11 48 83 c2 01 45 84 c0 75 ee c3 cc cc cc cc 0f 1f 00 31 c0 eb 08 48 83 c0 01 84 d2 74 13 <0f> b6 14 07 3a 14 06 74 ef 19 c0 83 c8 01 c3 cc cc cc cc 31 c3 RSP: 0018:ffff9b3b00f53c48 EFLAGS: 00000246 RAX: 0000000000000000 RBX: ffffffffba958a68 RCX: 0000000000000000 RDX: 0000000000000010 RSI: ffff91c943d33a90 RDI: ffff91c900000000 RBP: ffff91c900000000 R08: 00000018d604b529 R09: 0000000000000000 R10: ffff91c9483eddb1 R11: ffff91ca483eddab R12: ffff91c946171580 R13: ffff91c9479f0538 R14: ffff91c9457c2848 R15: ffff91c9479f0538 FS: 00007f1d1cfbe740(0000) GS:ffff91c9bdc80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff91c900000000 CR3: 0000000006316000 CR4: 00000000000006e0 Call Trace: <TASK> __find_event_file+0x55/0x90 action_create+0x76c/0x1060 event_hist_trigger_parse+0x146d/0x2060 ? event_trigger_write+0x31/0xd0 trigger_process_regex+0xbb/0x110 event_trigger_write+0x6b/0xd0 vfs_write+0xc8/0x3e0 ? alloc_fd+0xc0/0x160 ? preempt_count_add+0x4d/0xa0 ? preempt_count_add+0x70/0xa0 ksys_write+0x5f/0xe0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7f1d1d0cf077 Code: 64 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 RSP: 002b:00007ffcebb0e568 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 0000000000000143 RCX: 00007f1d1d0cf077 RDX: 0000000000000143 RSI: 00005639265aa7e0 RDI: 0000000000000001 RBP: 00005639265aa7e0 R08: 000000000000000a R09: 0000000000000142 R10: 000056392639c017 R11: 0000000000000246 R12: 0000000000000143 R13: 00007f1d1d1ae6a0 R14: 00007f1d1d1aa4a0 R15: 00007f1d1d1a98a0 </TASK> Modules linked in: CR2: ffff91c900000000 ---[ end trace 0000000000000000 ]--- RIP: 0010:strcmp+0xc/0x30 Code: 75 f7 31 d2 44 0f b6 04 16 44 88 04 11 48 83 c2 01 45 84 c0 75 ee c3 cc cc cc cc 0f 1f 00 31 c0 eb 08 48 83 c0 01 84 d2 74 13 <0f> b6 14 07 3a 14 06 74 ef 19 c0 83 c8 01 c3 cc cc cc cc 31 c3 RSP: 0018:ffff9b3b00f53c48 EFLAGS: 00000246 RAX: 0000000000000000 RBX: ffffffffba958a68 RCX: 0000000000000000 RDX: 0000000000000010 RSI: ffff91c943d33a90 RDI: ffff91c900000000 RBP: ffff91c900000000 R08: 00000018d604b529 R09: 0000000000000000 R10: ffff91c9483eddb1 R11: ffff91ca483eddab R12: ffff91c946171580 R13: ffff91c9479f0538 R14: ffff91c9457c2848 R15: ffff91c9479f0538 FS: 00007f1d1cfbe740(0000) GS:ffff91c9bdc80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff91c900000000 CR3: 0000000006316000 CR4: 00000000000006e0 Link: https://lore.kernel.org/linux-trace-kernel/20221207035143.2278781-1-zhengyejian1@huawei.com Cc: <mhiramat@kernel.org> Cc: <zanussi@kernel.org> Cc: stable@vger.kernel.org Fixes: d380dcde9a07 ("tracing: Fix now invalid var_ref_vals assumption in trace action") Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-12-10tracing/hist: Fix wrong return value in parse_action_params()Zheng Yejian1-0/+1
When number of synth fields is more than SYNTH_FIELDS_MAX, parse_action_params() should return -EINVAL. Link: https://lore.kernel.org/linux-trace-kernel/20221207034635.2253990-1-zhengyejian1@huawei.com Cc: <mhiramat@kernel.org> Cc: <zanussi@kernel.org> Cc: stable@vger.kernel.org Fixes: c282a386a397 ("tracing: Add 'onmatch' hist trigger action support") Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-12-09tracing: Fix complicated dependency of CONFIG_TRACER_MAX_TRACEMasami Hiramatsu (Google)3-15/+18
Both CONFIG_OSNOISE_TRACER and CONFIG_HWLAT_TRACER partially enables the CONFIG_TRACER_MAX_TRACE code, but that is complicated and has introduced a bug; It declares tracing_max_lat_fops data structure outside of #ifdefs, but since it is defined only when CONFIG_TRACER_MAX_TRACE=y or CONFIG_HWLAT_TRACER=y, if only CONFIG_OSNOISE_TRACER=y, that declaration comes to a definition(!). To fix this issue, and do not repeat the similar problem, makes CONFIG_OSNOISE_TRACER and CONFIG_HWLAT_TRACER enables the CONFIG_TRACER_MAX_TRACE always. It has there benefits; - Fix the tracing_max_lat_fops bug - Simplify the #ifdefs - CONFIG_TRACER_MAX_TRACE code is fully enabled, or not. Link: https://lore.kernel.org/linux-trace-kernel/167033628155.4111793.12185405690820208159.stgit@devnote3 Fixes: 424b650f35c7 ("tracing: Fix missing osnoise tracer on max_latency") Cc: Daniel Bristot de Oliveira <bristot@kernel.org> Cc: stable@vger.kernel.org Reported-by: David Howells <dhowells@redhat.com> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Link: https://lore.kernel.org/all/166992525941.1716618.13740663757583361463.stgit@warthog.procyon.org.uk/ (original thread and v1) Link: https://lore.kernel.org/all/202212052253.VuhZ2ulJ-lkp@intel.com/T/#u (v1 error report) Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-12-09tracing/probes: Handle system names with hyphensSteven Rostedt (Google)2-4/+17
When creating probe names, a check is done to make sure it matches basic C standard variable naming standards. Basically, starts with alphabetic or underline, and then the rest of the characters have alpha-numeric or underline in them. But system names do not have any true naming conventions, as they are created by the TRACE_SYSTEM macro and nothing tests to see what they are. The "xhci-hcd" trace events has a '-' in the system name. When trying to attach a eprobe to one of these trace points, it fails because the system name does not follow the variable naming convention because of the hyphen, and the eprobe checks fail on this. Allow hyphens in the system name so that eprobes can attach to the "xhci-hcd" trace events. Link: https://lore.kernel.org/all/Y3eJ8GiGnEvVd8%2FN@macondo/ Link: https://lore.kernel.org/linux-trace-kernel/20221122122345.160f5077@gandalf.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: stable@vger.kernel.org Fixes: 5b7a96220900e ("tracing/probe: Check event/group naming rule at parsing") Reported-by: Rafael Mendonca <rafaelmendsr@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-12-09trace/kprobe: remove duplicated calls of ring_buffer_event_dataSong Chen1-2/+0
Function __kprobe_trace_func calls ring_buffer_event_data to get a ring buffer, however, it has been done in above call trace_event_buffer_reserve. So does __kretprobe_trace_func. This patch removes those duplicated calls. Link: https://lore.kernel.org/all/1666145478-4706-1-git-send-email-chensong_2000@189.cn/ Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Song Chen <chensong_2000@189.cn> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>