summaryrefslogtreecommitdiffstats
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2022-08-02Merge tag 'pm-5.20-rc1' of ↵Linus Torvalds3-11/+30
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management updates from Rafael Wysocki: "These are mostly minor improvements all over including new CPU IDs for the Intel RAPL driver, an Energy Model rework to use micro-Watt as the power unit, cpufreq fixes and cleanus, cpuidle updates, devfreq updates, documentation cleanups and a new version of the pm-graph suite of utilities. Specifics: - Make cpufreq_show_cpus() more straightforward (Viresh Kumar). - Drop unnecessary CPU hotplug locking from store() used by cpufreq sysfs attributes (Viresh Kumar). - Make the ACPI cpufreq driver support the boost control interface on Zhaoxin/Centaur processors (Tony W Wang-oc). - Print a warning message on attempts to free an active cpufreq policy which should never happen (Viresh Kumar). - Fix grammar in the Kconfig help text for the loongson2 cpufreq driver (Randy Dunlap). - Use cpumask_var_t for an on-stack CPU mask in the ondemand cpufreq governor (Zhao Liu). - Add trace points for guest_halt_poll_ns grow/shrink to the haltpoll cpuidle driver (Eiichi Tsukata). - Modify intel_idle to treat C1 and C1E as independent idle states on Sapphire Rapids (Artem Bityutskiy). - Extend support for wakeirq to callback wrappers used during system suspend and resume (Ulf Hansson). - Defer waiting for device probe before loading a hibernation image till the first actual device access to avoid possible deadlocks reported by syzbot (Tetsuo Handa). - Unify device_init_wakeup() for PM_SLEEP and !PM_SLEEP (Bjorn Helgaas). - Add Raptor Lake-P to the list of processors supported by the Intel RAPL driver (George D Sworo). - Add Alder Lake-N and Raptor Lake-P to the list of processors for which Power Limit4 is supported in the Intel RAPL driver (Sumeet Pawnikar). - Make pm_genpd_remove() check genpd_debugfs_dir against NULL before attempting to remove it (Hsin-Yi Wang). - Change the Energy Model code to represent power in micro-Watts and adjust its users accordingly (Lukasz Luba). - Add new devfreq driver for Mediatek CCI (Cache Coherent Interconnect) (Johnson Wang). - Convert the Samsung Exynos SoC Bus bindings to DT schema of exynos-bus.c (Krzysztof Kozlowski). - Address kernel-doc warnings by adding the description for unused function parameters in devfreq core (Mauro Carvalho Chehab). - Use NULL to pass a null pointer rather than zero according to the function propotype in imx-bus.c (Colin Ian King). - Print error message instead of error interger value in tegra30-devfreq.c (Dmitry Osipenko). - Add checks to prevent setting negative frequency QoS limits for CPUs (Shivnandan Kumar). - Update the pm-graph suite of utilities to the latest revision 5.9 including multiple improvements (Todd Brandt). - Drop pme_interrupt reference from the PCI power management documentation (Mario Limonciello)" * tag 'pm-5.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (27 commits) powercap: RAPL: Add Power Limit4 support for Alder Lake-N and Raptor Lake-P PM: QoS: Add check to make sure CPU freq is non-negative PM: hibernate: defer device probing when resuming from hibernation intel_idle: make SPR C1 and C1E be independent cpufreq: ondemand: Use cpumask_var_t for on-stack cpu mask cpufreq: loongson2: fix Kconfig "its" grammar pm-graph v5.9 cpufreq: Warn users while freeing active policy cpufreq: scmi: Support the power scale in micro-Watts in SCMI v3.1 firmware: arm_scmi: Get detailed power scale from perf Documentation: EM: Switch to micro-Watts scale PM: EM: convert power field to micro-Watts precision and align drivers PM / devfreq: tegra30: Add error message for devm_devfreq_add_device() PM / devfreq: imx-bus: use NULL to pass a null pointer rather than zero PM / devfreq: shut up kernel-doc warnings dt-bindings: interconnect: samsung,exynos-bus: convert to dtschema PM / devfreq: mediatek: Introduce MediaTek CCI devfreq driver dt-bindings: interconnect: Add MediaTek CCI dt-bindings PM: domains: Ensure genpd_debugfs_dir exists before remove PM: runtime: Extend support for wakeirq for force_suspend|resume ...
2022-08-01Merge tag 'irq-core-2022-08-01' of ↵Linus Torvalds9-31/+30
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq updates from Thomas Gleixner: "Updates for interrupt core and drivers: Core: - Fix a few inconsistencies between UP and SMP vs interrupt affinities - Small updates and cleanups all over the place New drivers: - LoongArch interrupt controller - Renesas RZ/G2L interrupt controller Updates: - Hotpath optimization for SiFive PLIC - Workaround for broken PLIC edge triggered interrupts - Simall cleanups and improvements as usual" * tag 'irq-core-2022-08-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits) irqchip/mmp: Declare init functions in common header file irqchip/mips-gic: Check the return value of ioremap() in gic_of_init() genirq: Use for_each_action_of_desc in actions_show() irqchip / ACPI: Introduce ACPI_IRQ_MODEL_LPIC for LoongArch irqchip: Add LoongArch CPU interrupt controller support irqchip: Add Loongson Extended I/O interrupt controller support irqchip/loongson-liointc: Add ACPI init support irqchip/loongson-pch-msi: Add ACPI init support irqchip/loongson-pch-pic: Add ACPI init support irqchip: Add Loongson PCH LPC controller support LoongArch: Prepare to support multiple pch-pic and pch-msi irqdomain LoongArch: Use ACPI_GENERIC_GSI for gsi handling genirq/generic_chip: Export irq_unmap_generic_chip ACPI: irq: Allow acpi_gsi_to_irq() to have an arch-specific fallback APCI: irq: Add support for multiple GSI domains LoongArch: Provisionally add ACPICA data structures irqdomain: Use hwirq_max instead of revmap_size for NOMAP domains irqdomain: Report irq number for NOMAP domains irqchip/gic-v3: Fix comment typo dt-bindings: interrupt-controller: renesas,rzg2l-irqc: Document RZ/V2L SoC ...
2022-08-01Merge tag 'perf-core-2022-08-01' of ↵Linus Torvalds2-4/+22
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf events updates from Ingo Molnar: - Fix Intel Alder Lake PEBS memory access latency & data source profiling info bugs. - Use Intel large-PEBS hardware feature in more circumstances, to reduce PMI overhead & reduce sampling data. - Extend the lost-sample profiling output with the PERF_FORMAT_LOST ABI variant, which tells tooling the exact number of samples lost. - Add new IBS register bits definitions. - AMD uncore events: Add PerfMonV2 DF (Data Fabric) enhancements. * tag 'perf-core-2022-08-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/ibs: Add new IBS register bits into header perf/x86/intel: Fix PEBS data source encoding for ADL perf/x86/intel: Fix PEBS memory access info encoding for ADL perf/core: Add a new read format to get a number of lost samples perf/x86/amd/uncore: Add PerfMonV2 RDPMC assignments perf/x86/amd/uncore: Add PerfMonV2 DF event format perf/x86/amd/uncore: Detect available DF counters perf/x86/amd/uncore: Use attr_update for format attributes perf/x86/amd/uncore: Use dynamic events array x86/events/intel/ds: Enable large PEBS for PERF_SAMPLE_WEIGHT_TYPE
2022-08-01Merge tag 'locking-core-2022-08-01' of ↵Linus Torvalds2-38/+10
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking updates from Ingo Molnar: "This was a fairly quiet cycle for the locking subsystem: - lockdep: Fix a handful of the more complex lockdep_init_map_*() primitives that can lose the lock_type & cause false reports. No such mishap was observed in the wild. - jump_label improvements: simplify the cross-arch support of initial NOP patching by making it arch-specific code (used on MIPS only), and remove the s390 initial NOP patching that was superfluous" * tag 'locking-core-2022-08-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/lockdep: Fix lockdep_init_map_*() confusion jump_label: make initial NOP patching the special case jump_label: mips: move module NOP patching into arch code jump_label: s390: avoid pointless initial NOP patching
2022-08-01Merge tag 'sched-core-2022-08-01' of ↵Linus Torvalds13-448/+837
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "Load-balancing improvements: - Improve NUMA balancing on AMD Zen systems for affine workloads. - Improve the handling of reduced-capacity CPUs in load-balancing. - Energy Model improvements: fix & refine all the energy fairness metrics (PELT), and remove the conservative threshold requiring 6% energy savings to migrate a task. Doing this improves power efficiency for most workloads, and also increases the reliability of energy-efficiency scheduling. - Optimize/tweak select_idle_cpu() to spend (much) less time searching for an idle CPU on overloaded systems. There's reports of several milliseconds spent there on large systems with large workloads ... [ Since the search logic changed, there might be behavioral side effects. ] - Improve NUMA imbalance behavior. On certain systems with spare capacity, initial placement of tasks is non-deterministic, and such an artificial placement imbalance can persist for a long time, hurting (and sometimes helping) performance. The fix is to make fork-time task placement consistent with runtime NUMA balancing placement. Note that some performance regressions were reported against this, caused by workloads that are not memory bandwith limited, which benefit from the artificial locality of the placement bug(s). Mel Gorman's conclusion, with which we concur, was that consistency is better than random workload benefits from non-deterministic bugs: "Given there is no crystal ball and it's a tradeoff, I think it's better to be consistent and use similar logic at both fork time and runtime even if it doesn't have universal benefit." - Improve core scheduling by fixing a bug in sched_core_update_cookie() that caused unnecessary forced idling. - Improve wakeup-balancing by allowing same-LLC wakeup of idle CPUs for newly woken tasks. - Fix a newidle balancing bug that introduced unnecessary wakeup latencies. ABI improvements/fixes: - Do not check capabilities and do not issue capability check denial messages when a scheduler syscall doesn't require privileges. (Such as increasing niceness.) - Add forced-idle accounting to cgroups too. - Fix/improve the RSEQ ABI to not just silently accept unknown flags. (No existing tooling is known to have learned to rely on the previous behavior.) - Depreciate the (unused) RSEQ_CS_FLAG_NO_RESTART_ON_* flags. Optimizations: - Optimize & simplify leaf_cfs_rq_list() - Micro-optimize set_nr_{and_not,if}_polling() via try_cmpxchg(). Misc fixes & cleanups: - Fix the RSEQ self-tests on RISC-V and Glibc 2.35 systems. - Fix a full-NOHZ bug that can in some cases result in the tick not being re-enabled when the last SCHED_RT task is gone from a runqueue but there's still SCHED_OTHER tasks around. - Various PREEMPT_RT related fixes. - Misc cleanups & smaller fixes" * tag 'sched-core-2022-08-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (32 commits) rseq: Kill process when unknown flags are encountered in ABI structures rseq: Deprecate RSEQ_CS_FLAG_NO_RESTART_ON_* flags sched/core: Fix the bug that task won't enqueue into core tree when update cookie nohz/full, sched/rt: Fix missed tick-reenabling bug in dequeue_task_rt() sched/core: Always flush pending blk_plug sched/fair: fix case with reduced capacity CPU sched/core: Use try_cmpxchg in set_nr_{and_not,if}_polling sched/core: add forced idle accounting for cgroups sched/fair: Remove the energy margin in feec() sched/fair: Remove task_util from effective utilization in feec() sched/fair: Use the same cpumask per-PD throughout find_energy_efficient_cpu() sched/fair: Rename select_idle_mask to select_rq_mask sched, drivers: Remove max param from effective_cpu_util()/sched_cpu_util() sched/fair: Decay task PELT values during wakeup migration sched/fair: Provide u64 read for 32-bits arch helper sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg sched: only perform capability check on privileged operation sched: Remove unused function group_first_cpu() sched/fair: Remove redundant word " *" selftests/rseq: check if libc rseq support is registered ...
2022-08-01rseq: Kill process when unknown flags are encountered in ABI structuresMathieu Desnoyers1-2/+2
rseq_abi()->flags and rseq_abi()->rseq_cs->flags 29 upper bits are currently unused. The current behavior when those bits are set is to ignore them. This is not an ideal behavior, because when future features will start using those flags, if user-space fails to correctly validate that the kernel indeed supports those flags (e.g. with a new sys_rseq flags bit) before using them, it may incorrectly assume that the kernel will handle those flags way when in fact those will be silently ignored on older kernels. Validating that unused flags bits are cleared will allow a smoother transition when those flags will start to be used by allowing applications to fail early, and obviously, when they attempt to use the new flags on an older kernel that does not support them. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20220622194617.1155957-2-mathieu.desnoyers@efficios.com
2022-08-01rseq: Deprecate RSEQ_CS_FLAG_NO_RESTART_ON_* flagsMathieu Desnoyers1-15/+8
The pretty much unused RSEQ_CS_FLAG_NO_RESTART_ON_* flags introduce complexity in rseq, and are subtly buggy [1]. Solving those issues requires introducing additional complexity in the rseq implementation for each supported architecture. Considering that it complexifies the rseq ABI, I am proposing that we deprecate those flags. [2] So far there appears to be consensus from maintainers of user-space projects impacted by this feature that its removal would be a welcome simplification. [3] The deprecation approach proposed here is to issue WARN_ON_ONCE() when encountering those flags and kill the offending process with sigsegv. This should allow us to quickly identify whether anyone yells at us for removing this. Link: https://lore.kernel.org/lkml/20220618182515.95831-1-minhquangbui99@gmail.com/ [1] Link: https://lore.kernel.org/lkml/258546133.12151.1655739550814.JavaMail.zimbra@efficios.com/ [2] Link: https://lore.kernel.org/lkml/87pmj1enjh.fsf@email.froward.int.ebiederm.org/ [3] Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/lkml/20220622194617.1155957-1-mathieu.desnoyers@efficios.com
2022-07-31Merge tag 'x86_urgent_for_v5.19' of ↵Linus Torvalds1-2/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: - Update the 'mitigations=' kernel param documentation - Check the IBPB feature flag before enabling IBPB in firmware calls because cloud vendors' fantasy when it comes to creating guest configurations is unlimited - Unexport sev_es_ghcb_hv_call() before 5.19 releases now that HyperV doesn't need it anymore - Remove dead CONFIG_* items * tag 'x86_urgent_for_v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: docs/kernel-parameters: Update descriptions for "mitigations=" param with retbleed x86/bugs: Do not enable IBPB at firmware entry when IBPB is not available Revert "x86/sev: Expose sev_es_ghcb_hv_call() for use by HyperV" x86/configs: Update configs in x86_debug.config
2022-07-31Merge tag 'locking_urgent_for_v5.19' of ↵Linus Torvalds1-10/+20
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fix from Borislav Petkov: - Avoid rwsem lockups in certain situations when handling the handoff bit * tag 'locking_urgent_for_v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/rwsem: Allow slowpath writer to ignore handoff bit if not set by first waiter
2022-07-30locking/rwsem: Allow slowpath writer to ignore handoff bit if not set by ↵Waiman Long1-10/+20
first waiter With commit d257cc8cb8d5 ("locking/rwsem: Make handoff bit handling more consistent"), the writer that sets the handoff bit can be interrupted out without clearing the bit if the wait queue isn't empty. This disables reader and writer optimistic lock spinning and stealing. Now if a non-first writer in the queue is somehow woken up or a new waiter enters the slowpath, it can't acquire the lock. This is not the case before commit d257cc8cb8d5 as the writer that set the handoff bit will clear it when exiting out via the out_nolock path. This is less efficient as the busy rwsem stays in an unlock state for a longer time. In some cases, this new behavior may cause lockups as shown in [1] and [2]. This patch allows a non-first writer to ignore the handoff bit if it is not originally set or initiated by the first waiter. This patch is shown to be effective in fixing the lockup problem reported in [1]. [1] https://lore.kernel.org/lkml/20220617134325.GC30825@techsingularity.net/ [2] https://lore.kernel.org/lkml/3f02975c-1a9d-be20-32cf-f1d8e3dfafcc@oracle.com/ Fixes: d257cc8cb8d5 ("locking/rwsem: Make handoff bit handling more consistent") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: John Donnelly <john.p.donnelly@oracle.com> Tested-by: Mel Gorman <mgorman@techsingularity.net> Link: https://lore.kernel.org/r/20220622200419.778799-1-longman@redhat.com
2022-07-29Merge tag 'wq-for-5.19-rc8-fixes' of ↵Linus Torvalds1-1/+4
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue fix from Tejun Heo: "Just one commit to suppress a spurious warning added during the 5.19 cycle" * tag 'wq-for-5.19-rc8-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: Avoid a false warning in unbind_workers()
2022-07-29workqueue: Avoid a false warning in unbind_workers()Lai Jiangshan1-1/+4
Doing set_cpus_allowed_ptr() with wq_unbound_cpumask can be possible fails and trigger the false warning. Use cpu_possible_mask instead when wq_unbound_cpumask has no active CPUs. It is very easy to trigger the warning: Set wq_unbound_cpumask to a small set of CPUs. Offline all the CPUs of wq_unbound_cpumask. Offline an extra CPU and trigger the warning. Fixes: 10a5a651e3af ("workqueue: Restrict kworker in the offline CPU pool running on housekeeping CPUs") Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2022-07-29Merge branches 'pm-devfreq', 'pm-qos', 'pm-tools' and 'pm-docs'Rafael J. Wysocki1-2/+2
Merge devfreq changes, PM QoS change, and power management tools and documentation changes for v5.20-rc1: - Add new devfreq driver for Mediatek CCI (Cache Coherent Interconnect) (Johnson Wang). - Convert the Samsung Exynos SoC Bus bindings to DT schema of exynos-bus.c (Krzysztof Kozlowski). - Address kernel-doc warnings by adding the description for unused fucntion parameters in devfreq core (Mauro Carvalho Chehab). - Use NULL to pass a null pointer rather than zero according to the function propotype in imx-bus.c (Colin Ian King). - Print error message instead of error interger value in tegra30-devfreq.c (Dmitry Osipenko). - Add checks to prevent setting negative frequency QoS limits for CPUs (Shivnandan Kumar). - Update the pm-graph suite of utilities to the latest revision 5.9 including multiple improvements (Todd Brandt). - Drop pme_interrupt reference from the PCI power management documentation (Mario Limonciello). * pm-devfreq: PM / devfreq: tegra30: Add error message for devm_devfreq_add_device() PM / devfreq: imx-bus: use NULL to pass a null pointer rather than zero PM / devfreq: shut up kernel-doc warnings dt-bindings: interconnect: samsung,exynos-bus: convert to dtschema PM / devfreq: mediatek: Introduce MediaTek CCI devfreq driver dt-bindings: interconnect: Add MediaTek CCI dt-bindings * pm-qos: PM: QoS: Add check to make sure CPU freq is non-negative * pm-tools: pm-graph v5.9 * pm-docs: Documentation: PM: Drop pme_interrupt reference
2022-07-29Merge branches 'pm-core', 'pm-sleep', 'powercap', 'pm-domains' and 'pm-em'Rafael J. Wysocki2-9/+28
Merge core device power management changes for v5.20-rc1: - Extend support for wakeirq to callback wrappers used during system suspend and resume (Ulf Hansson). - Defer waiting for device probe before loading a hibernation image till the first actual device access to avoid possible deadlocks reported by syzbot (Tetsuo Handa). - Unify device_init_wakeup() for PM_SLEEP and !PM_SLEEP (Bjorn Helgaas). - Add Raptor Lake-P to the list of processors supported by the Intel RAPL driver (George D Sworo). - Add Alder Lake-N and Raptor Lake-P to the list of processors for which Power Limit4 is supported in the Intel RAPL driver (Sumeet Pawnikar). - Make pm_genpd_remove() check genpd_debugfs_dir against NULL before attempting to remove it (Hsin-Yi Wang). - Change the Energy Model code to represent power in micro-Watts and adjust its users accordingly (Lukasz Luba). * pm-core: PM: runtime: Extend support for wakeirq for force_suspend|resume * pm-sleep: PM: hibernate: defer device probing when resuming from hibernation PM: wakeup: Unify device_init_wakeup() for PM_SLEEP and !PM_SLEEP * powercap: powercap: RAPL: Add Power Limit4 support for Alder Lake-N and Raptor Lake-P powercap: intel_rapl: Add support for RAPTORLAKE_P * pm-domains: PM: domains: Ensure genpd_debugfs_dir exists before remove * pm-em: cpufreq: scmi: Support the power scale in micro-Watts in SCMI v3.1 firmware: arm_scmi: Get detailed power scale from perf Documentation: EM: Switch to micro-Watts scale PM: EM: convert power field to micro-Watts precision and align drivers
2022-07-28watch_queue: Fix missing locking in add_watch_to_object()Linus Torvalds1-22/+36
If a watch is being added to a queue, it needs to guard against interference from addition of a new watch, manual removal of a watch and removal of a watch due to some other queue being destroyed. KEYCTL_WATCH_KEY guards against this for the same {key,queue} pair by holding the key->sem writelocked and by holding refs on both the key and the queue - but that doesn't prevent interaction from other {key,queue} pairs. While add_watch_to_object() does take the spinlock on the event queue, it doesn't take the lock on the source's watch list. The assumption was that the caller would prevent that (say by taking key->sem) - but that doesn't prevent interference from the destruction of another queue. Fix this by locking the watcher list in add_watch_to_object(). Fixes: c73be61cede5 ("pipe: Add general notification queue support") Reported-by: syzbot+03d7b43290037d1f87ca@syzkaller.appspotmail.com Signed-off-by: David Howells <dhowells@redhat.com> cc: keyrings@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-07-28watch_queue: Fix missing rcu annotationDavid Howells1-1/+1
Since __post_watch_notification() walks wlist->watchers with only the RCU read lock held, we need to use RCU methods to add to the list (we already use RCU methods to remove from the list). Fix add_watch_to_object() to use hlist_add_head_rcu() instead of hlist_add_head() for that list. Fixes: c73be61cede5 ("pipe: Add general notification queue support") Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-07-28Merge tag 'irqchip-5.20' of ↵Thomas Gleixner18-42/+135
git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into irq/core Pull irqchip/genirq updates from Marc Zyngier: * Core code update: - Non-SMP IRQ affinity fixes, allowing UP kernel to behave similarly to SMP ones for the purpose of interrupt affinity - Let irq_set_chip_handler_name_locked() take a const struct irq_chip * - Tidy-up the NOMAP irqdomain API variant - Teach action_show() to use for_each_action_of_desc() - Make irq_chip_request_resources_parent() allow the parent callback to be optional - Remove dynamic allocations from populate_parent_alloc_arg() * New drivers: - Merge the long awaited IRQ support for the LoongArch architecture, with the provisional ACPICA update (to be reverted once the official support lands) - New Renesas RZ/G2L IRQC driver, equipped with its companion GPIO driver * Driver updates - Optimise the hot path operations for the SiFive PLIC, trading the locking for per-CPU priority masking masking operations which are apparently faster - Work around broken PLIC implementations that deal pretty badly with edge-triggered interrupts. Flag two implementations as affected. - Simplify the irq-stm32-exti driver, particularly the table that remaps the interrupts from exti to the GIC, reducing the memory usage - Convert the ocelot irq_chip to being immutable - Check ioremap() return value in the MIPS GIC driver - Move MMP driver init function declarations into the common .h - The obligatory typo fixes Link: https://lore.kernel.org/all/20220727192356.1860546-1-maz@kernel.org
2022-07-27x86/configs: Update configs in x86_debug.configLukas Bulwahn1-2/+1
Commit 4675ff05de2d ("kmemcheck: rip it out") removed kmemcheck and its corresponding build config KMEMCHECK. Commit 0f620cefd775 ("objtool: Rename "VMLINUX_VALIDATION" -> "NOINSTR_VALIDATION"") renamed the debug config option. Adjust x86_debug.config to those changes in debug configs. Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20220722121815.27535-1-lukas.bulwahn@gmail.com
2022-07-26PM: QoS: Add check to make sure CPU freq is non-negativeShivnandan Kumar1-2/+2
CPU frequency should never be negative. If some client driver calls freq_qos_update_request with a negative value which will be very high in absolute terms, then frequency QoS sets max CPU freq at fmax as it considers it's absolute value but it will add plist node with negative priority. plist node has priority from INT_MIN (highest) to INT_MAX(lowest). Once priority is set as negative, another client will not be able to reduce CPU frequency. Adding check to make sure CPU freq is non-negative will fix this problem. Signed-off-by: Shivnandan Kumar <quic_kshivnan@quicinc.com> [ rjw: Changelog edits ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2022-07-26PM: hibernate: defer device probing when resuming from hibernationTetsuo Handa1-1/+12
syzbot is reporting hung task at misc_open() [1], for there is a race window of AB-BA deadlock which involves probe_count variable. Currently wait_for_device_probe() from snapshot_open() from misc_open() can sleep forever with misc_mtx held if probe_count cannot become 0. When a device is probed by hub_event() work function, probe_count is incremented before the probe function starts, and probe_count is decremented after the probe function completed. There are three cases that can prevent probe_count from dropping to 0. (a) A device being probed stopped responding (i.e. broken/malicious hardware). (b) A process emulating a USB device using /dev/raw-gadget interface stopped responding for some reason. (c) New device probe requests keeps coming in before existing device probe requests complete. The phenomenon syzbot is reporting is (b). A process which is holding system_transition_mutex and misc_mtx is waiting for probe_count to become 0 inside wait_for_device_probe(), but the probe function which is called from hub_event() work function is waiting for the processes which are blocked at mutex_lock(&misc_mtx) to respond via /dev/raw-gadget interface. This patch mitigates (b) by deferring wait_for_device_probe() from snapshot_open() to snapshot_write() and snapshot_ioctl(). Please note that the possibility of (b) remains as long as any thread which is emulating a USB device via /dev/raw-gadget interface can be blocked by uninterruptible blocking operations (e.g. mutex_lock()). Please also note that (a) and (c) are not addressed. Regarding (c), we should change the code to wait for only one device which contains the image for resuming from hibernation. I don't know how to address (a), for use of timeout for wait_for_device_probe() might result in loss of user data in the image. Maybe we should require the userland to wait for the image device before opening /dev/snapshot interface. Link: https://syzkaller.appspot.com/bug?extid=358c9ab4c93da7b7238c [1] Reported-by: syzbot <syzbot+358c9ab4c93da7b7238c@syzkaller.appspotmail.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Tested-by: syzbot <syzbot+358c9ab4c93da7b7238c@syzkaller.appspotmail.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2022-07-25Merge branch irq/misc-5.20 into irq/irqchip-nextMarc Zyngier2-7/+9
* irq/misc-5.20: : . : Misc IRQ changes for 5.20: : : - Let irq_set_chip_handler_name_locked() take a const struct irq_chip * : : - Convert the ocelot irq_chip to being immutable (depends on the above) : : - Tidy-up the NOMAP irqdomain API variant : : - Teach action_show() to use for_each_action_of_desc() : : - Check ioremap() return value in the MIPS GIC driver : : - Move MMP driver init function declarations into the common .h : : - The obligatory typo fixes : . irqchip/mmp: Declare init functions in common header file irqchip/mips-gic: Check the return value of ioremap() in gic_of_init() genirq: Use for_each_action_of_desc in actions_show() irqdomain: Use hwirq_max instead of revmap_size for NOMAP domains irqdomain: Report irq number for NOMAP domains irqchip/gic-v3: Fix comment typo pinctrl: ocelot: Make irq_chip immutable genirq: Allow irq_set_chip_handler_name_locked() to take a const irq_chip Signed-off-by: Marc Zyngier <maz@kernel.org>
2022-07-24Merge tag 'sched_urgent_for_v5.19_rc8' of ↵Linus Torvalds1-1/+4
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Borislav Petkov: "A single fix to correct a wrong BUG_ON() condition for deboosted tasks" * tag 'sched_urgent_for_v5.19_rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/deadline: Fix BUG_ON condition for deboosted tasks
2022-07-22Merge tag 'rcu-urgent.2022.07.21a' of ↵Linus Torvalds1-24/+74
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu Pull RCU fix from Paul McKenney: "This contains a pair of commits that fix 282d8998e997 ("srcu: Prevent expedited GPs and blocking readers from consuming CPU"), which was itself a fix to an SRCU expedited grace-period problem that could prevent kernel live patching (KLP) from completing. That SRCU fix for KLP introduced large (as in minutes) boot-time delays to embedded Linux kernels running on qemu/KVM. These delays were due to the emulation of certain MMIO operations controlling memory layout, which were emulated with one expedited grace period per access. Common configurations required thousands of boot-time MMIO accesses, and thus thousands of boot-time expedited SRCU grace periods. In these configurations, the occasional sleeps that allowed KLP to proceed caused excessive boot delays. These commits preserve enough sleeps to permit KLP to proceed, but few enough that the virtual embedded kernels still boot reasonably quickly. This represents a regression introduced in the v5.19 merge window, and the bug is causing significant inconvenience" * tag 'rcu-urgent.2022.07.21a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: srcu: Make expedited RCU grace periods block even less frequently srcu: Block less aggressively for expedited grace periods
2022-07-21watch-queue: remove spurious double semicolonLinus Torvalds1-1/+1
Sedat Dilek noticed that I had an extraneous semicolon at the end of a line in the previous patch. It's harmless, but unintentional, and while compilers just treat it as an extra empty statement, for all I know some other tooling might warn about it. So clean it up before other people notice too ;) Fixes: 353f7988dd84 ("watchqueue: make sure to serialize 'wqueue->defunct' properly") Reported-by: Sedat Dilek <sedat.dilek@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Reported-by: Sedat Dilek <sedat.dilek@gmail.com>
2022-07-21sched/core: Fix the bug that task won't enqueue into core tree when update ↵Cruz Zhao1-4/+5
cookie In function sched_core_update_cookie(), a task will enqueue into the core tree only when it enqueued before, that is, if an uncookied task is cookied, it will not enqueue into the core tree until it enqueue again, which will result in unnecessary force idle. Here follows the scenario: CPU x and CPU y are a pair of SMT siblings. 1. Start task a running on CPU x without sleeping, and task b and task c running on CPU y without sleeping. 2. We create a cookie and share it to task a and task b, and then we create another cookie and share it to task c. 3. Simpling core_forceidle_sum of task a and b from /proc/PID/sched And we will find out that core_forceidle_sum of task a takes 30% time of the sampling period, which shouldn't happen as task a and b have the same cookie. Then we migrate task a to CPU x', migrate task b and c to CPU y', where CPU x' and CPU y' are a pair of SMT siblings, and sampling again, we will found out that core_forceidle_sum of task a and b are almost zero. To solve this problem, we enqueue the task into the core tree if it's on rq. Fixes: 6e33cad0af49("sched: Trivial core scheduling cookie management") Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/1656403045-100840-2-git-send-email-CruzZhao@linux.alibaba.com
2022-07-21nohz/full, sched/rt: Fix missed tick-reenabling bug in dequeue_task_rt()Nicolas Saenz Julienne1-6/+9
dequeue_task_rt() only decrements 'rt_rq->rt_nr_running' after having called sched_update_tick_dependency() preventing it from re-enabling the tick on systems that no longer have pending SCHED_RT tasks but have multiple runnable SCHED_OTHER tasks: dequeue_task_rt() dequeue_rt_entity() dequeue_rt_stack() dequeue_top_rt_rq() sub_nr_running() // decrements rq->nr_running sched_update_tick_dependency() sched_can_stop_tick() // checks rq->rt.rt_nr_running, ... __dequeue_rt_entity() dec_rt_tasks() // decrements rq->rt.rt_nr_running ... Every other scheduler class performs the operation in the opposite order, and sched_update_tick_dependency() expects the values to be updated as such. So avoid the misbehaviour by inverting the order in which the above operations are performed in the RT scheduler. Fixes: 76d92ac305f2 ("sched: Migrate sched to use new tick dependency mask model") Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Phil Auld <pauld@redhat.com> Link: https://lore.kernel.org/r/20220628092259.330171-1-nsaenzju@redhat.com
2022-07-21sched/deadline: Fix BUG_ON condition for deboosted tasksJuri Lelli1-1/+4
Tasks the are being deboosted from SCHED_DEADLINE might enter enqueue_task_dl() one last time and hit an erroneous BUG_ON condition: since they are not boosted anymore, the if (is_dl_boosted()) branch is not taken, but the else if (!dl_prio) is and inside this one we BUG_ON(!is_dl_boosted), which is of course false (BUG_ON triggered) otherwise we had entered the if branch above. Long story short, the current condition doesn't make sense and always leads to triggering of a BUG. Fix this by only checking enqueue flags, properly: ENQUEUE_REPLENISH has to be present, but additional flags are not a problem. Fixes: 64be6f1f5f71 ("sched/deadline: Don't replenish from a !SCHED_DEADLINE entity") Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20220714151908.533052-1-juri.lelli@redhat.com
2022-07-20watchqueue: make sure to serialize 'wqueue->defunct' properlyLinus Torvalds1-16/+37
When the pipe is closed, we mark the associated watchqueue defunct by calling watch_queue_clear(). However, while that is protected by the watchqueue lock, new watchqueue entries aren't actually added under that lock at all: they use the pipe->rd_wait.lock instead, and looking up that pipe happens without any locking. The watchqueue code uses the RCU read-side section to make sure that the wqueue entry itself hasn't disappeared, but that does not protect the pipe_info in any way. So make sure to actually hold the wqueue lock when posting watch events, properly serializing against the pipe being torn down. Reported-by: Noam Rathaus <noamr@ssd-disclosure.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-07-20Merge branch irq/loongarch into irq/irqchip-nextMarc Zyngier1-1/+1
* irq/loongarch: : . : Merge the long awaited IRQ support for the LoongArch architecture. : : From the cover letter: : : "Currently, LoongArch based processors (e.g. Loongson-3A5000) : can only work together with LS7A chipsets. The irq chips in : LoongArch computers include CPUINTC (CPU Core Interrupt : Controller), LIOINTC (Legacy I/O Interrupt Controller), : EIOINTC (Extended I/O Interrupt Controller), PCH-PIC (Main : Interrupt Controller in LS7A chipset), PCH-LPC (LPC Interrupt : Controller in LS7A chipset) and PCH-MSI (MSI Interrupt Controller)." : : Note that this comes with non-official, arch private ACPICA : definitions until the official ACPICA update is realeased. : . irqchip / ACPI: Introduce ACPI_IRQ_MODEL_LPIC for LoongArch irqchip: Add LoongArch CPU interrupt controller support irqchip: Add Loongson Extended I/O interrupt controller support irqchip/loongson-liointc: Add ACPI init support irqchip/loongson-pch-msi: Add ACPI init support irqchip/loongson-pch-pic: Add ACPI init support irqchip: Add Loongson PCH LPC controller support LoongArch: Prepare to support multiple pch-pic and pch-msi irqdomain LoongArch: Use ACPI_GENERIC_GSI for gsi handling genirq/generic_chip: Export irq_unmap_generic_chip ACPI: irq: Allow acpi_gsi_to_irq() to have an arch-specific fallback APCI: irq: Add support for multiple GSI domains LoongArch: Provisionally add ACPICA data structures Signed-off-by: Marc Zyngier <maz@kernel.org>
2022-07-20genirq: Use for_each_action_of_desc in actions_show()Paran Lee1-1/+1
Refactor action_show() to use for_each_action_of_desc instead of a similar open-coded loop. Signed-off-by: Paran Lee <p4ranlee@gmail.com> [maz: reword commit message] Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20220710112614.19410-1-p4ranlee@gmail.com
2022-07-20genirq/generic_chip: Export irq_unmap_generic_chipJianmin Lv1-1/+1
Some irq controllers have to re-implement a private version for irq_generic_chip_ops, because they have a different xlate to translate hwirq. Export irq_unmap_generic_chip to allow reusing in drivers. Signed-off-by: Jianmin Lv <lvjianmin@loongson.cn> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/1658314292-35346-5-git-send-email-lvjianmin@loongson.cn
2022-07-19srcu: Make expedited RCU grace periods block even less frequentlyNeeraj Upadhyay1-19/+63
The purpose of commit 282d8998e997 ("srcu: Prevent expedited GPs and blocking readers from consuming CPU") was to prevent a long series of never-blocking expedited SRCU grace periods from blocking kernel-live-patching (KLP) progress. Although it was successful, it also resulted in excessive boot times on certain embedded workloads running under qemu with the "-bios QEMU_EFI.fd" command line. Here "excessive" means increasing the boot time up into the three-to-four minute range. This increase in boot time was due to the more than 6000 back-to-back invocations of synchronize_rcu_expedited() within the KVM host OS, which in turn resulted from qemu's emulation of a long series of MMIO accesses. Commit 640a7d37c3f4 ("srcu: Block less aggressively for expedited grace periods") did not significantly help this particular use case. Zhangfei Gao and Shameerali Kolothum Thodi did experiments varying the value of SRCU_MAX_NODELAY_PHASE with HZ=250 and with various values of non-sleeping per phase counts on a system with preemption enabled, and observed the following boot times: +──────────────────────────+────────────────+ | SRCU_MAX_NODELAY_PHASE | Boot time (s) | +──────────────────────────+────────────────+ | 100 | 30.053 | | 150 | 25.151 | | 200 | 20.704 | | 250 | 15.748 | | 500 | 11.401 | | 1000 | 11.443 | | 10000 | 11.258 | | 1000000 | 11.154 | +──────────────────────────+────────────────+ Analysis on the experiment results show additional improvements with CPU-bound delays approaching one jiffy in duration. This improvement was also seen when number of per-phase iterations were scaled to one jiffy. This commit therefore scales per-grace-period phase number of non-sleeping polls so that non-sleeping polls extend for about one jiffy. In addition, the delay-calculation call to srcu_get_delay() in srcu_gp_end() is replaced with a simple check for an expedited grace period. This change schedules callback invocation immediately after expedited grace periods complete, which results in greatly improved boot times. Testing done by Marc and Zhangfei confirms that this change recovers most of the performance degradation in boottime; for CONFIG_HZ_250 configuration, specifically, boot times improve from 3m50s to 41s on Marc's setup; and from 2m40s to ~9.7s on Zhangfei's setup. In addition to the changes to default per phase delays, this change adds 3 new kernel parameters - srcutree.srcu_max_nodelay, srcutree.srcu_max_nodelay_phase, and srcutree.srcu_retry_check_delay. This allows users to configure the srcu grace period scanning delays in order to more quickly react to additional use cases. Fixes: 640a7d37c3f4 ("srcu: Block less aggressively for expedited grace periods") Fixes: 282d8998e997 ("srcu: Prevent expedited GPs and blocking readers from consuming CPU") Reported-by: Zhangfei Gao <zhangfei.gao@linaro.org> Reported-by: yueluck <yueluck@163.com> Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Tested-by: Marc Zyngier <maz@kernel.org> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org> Link: https://lore.kernel.org/all/20615615-0013-5adc-584f-2b1d5c03ebfc@linaro.org/ Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-07-19srcu: Block less aggressively for expedited grace periodsPaul E. McKenney1-7/+13
Commit 282d8998e997 ("srcu: Prevent expedited GPs and blocking readers from consuming CPU") fixed a problem where a long-running expedited SRCU grace period could block kernel live patching. It did so by giving up on expediting once a given SRCU expedited grace period grew too old. Unfortunately, this added excessive delays to boots of virtual embedded systems specifying "-bios QEMU_EFI.fd" to qemu. This commit therefore makes the transition away from expediting less aggressive, increasing the per-grace-period phase number of non-sleeping polls of readers from one to three and increasing the required grace-period age from one jiffy (actually from zero to one jiffies) to two jiffies (actually from one to two jiffies). Fixes: 282d8998e997 ("srcu: Prevent expedited GPs and blocking readers from consuming CPU") Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reported-by: Zhangfei Gao <zhangfei.gao@linaro.org> Reported-by: chenxiang (M)" <chenxiang66@hisilicon.com> Cc: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Link: https://lore.kernel.org/all/20615615-0013-5adc-584f-2b1d5c03ebfc@linaro.org/
2022-07-19irqdomain: Use hwirq_max instead of revmap_size for NOMAP domainsXu Qiang1-6/+6
NOMAP irq domains use the revmap_size field to indicate the maximum hwirq number the domain accepts. This is a bit confusing as revmap_size is usually used to indicate the size of the revmap array, which a NOMAP domain doesn't have. Instead, use the hwirq_max field which has the correct semantics, and keep revmap_size to 0 for a NOMAP domain. Signed-off-by: Xu Qiang <xuqiang36@huawei.com> [maz: commit message] Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20220719063641.56541-3-xuqiang36@huawei.com
2022-07-19irqdomain: Report irq number for NOMAP domainsXu Qiang1-0/+2
When using a NOMAP domain, __irq_resolve_mapping() doesn't store the Linux IRQ number at the address optionally provided by the caller. While this isn't a huge deal (the returned value is guaranteed to the hwirq that was passed as a parameter), let's honour the letter of the API by writing the expected value. Fixes: d22558dd0a6c (“irqdomain: Introduce irq_resolve_mapping()”) Signed-off-by: Xu Qiang <xuqiang36@huawei.com> [maz: commit message] Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20220719063641.56541-2-xuqiang36@huawei.com
2022-07-17Merge tag 'perf_urgent_for_v5.19_rc7' of ↵Linus Torvalds1-14/+31
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fix from Borislav Petkov: - A single data race fix on the perf event cleanup path to avoid endless loops due to insufficient locking * tag 'perf_urgent_for_v5.19_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/core: Fix data race between perf_event_set_output() and perf_mmap_close()
2022-07-16Merge tag 'printk-for-5.19-rc7' of ↵Linus Torvalds1-2/+11
git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux Pull printk fix from Petr Mladek: - Make pr_flush() fast when consoles are suspended. * tag 'printk-for-5.19-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux: printk: do not wait for consoles when suspended
2022-07-15PM: EM: convert power field to micro-Watts precision and align driversLukasz Luba1-8/+16
The milli-Watts precision causes rounding errors while calculating efficiency cost for each OPP. This is especially visible in the 'simple' Energy Model (EM), where the power for each OPP is provided from OPP framework. This can cause some OPPs to be marked inefficient, while using micro-Watts precision that might not happen. Update all EM users which access 'power' field and assume the value is in milli-Watts. Solve also an issue with potential overflow in calculation of energy estimation on 32bit machine. It's needed now since the power value (thus the 'cost' as well) are higher. Example calculation which shows the rounding error and impact: power = 'dyn-power-coeff' * volt_mV * volt_mV * freq_MHz power_a_uW = (100 * 600mW * 600mW * 500MHz) / 10^6 = 18000 power_a_mW = (100 * 600mW * 600mW * 500MHz) / 10^9 = 18 power_b_uW = (100 * 605mW * 605mW * 600MHz) / 10^6 = 21961 power_b_mW = (100 * 605mW * 605mW * 600MHz) / 10^9 = 21 max_freq = 2000MHz cost_a_mW = 18 * 2000MHz/500MHz = 72 cost_a_uW = 18000 * 2000MHz/500MHz = 72000 cost_b_mW = 21 * 2000MHz/600MHz = 70 // <- artificially better cost_b_uW = 21961 * 2000MHz/600MHz = 73203 The 'cost_b_mW' (which is based on old milli-Watts) is misleadingly better that the 'cost_b_uW' (this patch uses micro-Watts) and such would have impact on the 'inefficient OPPs' information in the Cpufreq framework. This patch set removes the rounding issue. Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2022-07-15Merge tag 'sysctl-fixes-5.19-rc7' of ↵Linus Torvalds1-9/+11
git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux Pyll sysctl fix from Luis Chamberlain: "Only one fix for sysctl" * tag 'sysctl-fixes-5.19-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux: mm: sysctl: fix missing numa_stat when !CONFIG_HUGETLB_PAGE
2022-07-15Merge branch 'rework/kthreads' into for-linusPetr Mladek1-2/+11
2022-07-15printk: do not wait for consoles when suspendedJohn Ogness1-2/+11
The console_stop() and console_start() functions call pr_flush(). When suspending, these functions are called by the serial subsystem while the serial port is suspended. In this scenario, if there are any pending messages, a call to pr_flush() will always result in a timeout because the serial port cannot make forward progress. This causes longer suspend and resume times. Add a check in pr_flush() so that it will immediately timeout if the consoles are suspended. Fixes: 3b604ca81202 ("printk: add pr_flush()") Reported-by: Todd Brandt <todd.e.brandt@linux.intel.com> Signed-off-by: John Ogness <john.ogness@linutronix.de> Tested-by: Todd Brandt <todd.e.brandt@linux.intel.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20220715061042.373640-2-john.ogness@linutronix.de
2022-07-14mm: sysctl: fix missing numa_stat when !CONFIG_HUGETLB_PAGEMuchun Song1-9/+11
"numa_stat" should not be included in the scope of CONFIG_HUGETLB_PAGE, if CONFIG_HUGETLB_PAGE is not configured even if CONFIG_NUMA is configured, "numa_stat" is missed form /proc. Move it out of CONFIG_HUGETLB_PAGE to fix it. Fixes: 4518085e127d ("mm, sysctl: make NUMA stats configurable") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Cc: <stable@vger.kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2022-07-14Merge tag 'net-5.19-rc7' of ↵Linus Torvalds3-24/+33
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from netfilter, bpf and wireless. Still no major regressions, the release continues to be calm. An uptick of fixes this time around due to trivial data race fixes and patches flowing down from subtrees. There has been a few driver fixes (particularly a few fixes for false positives due to 66e4c8d95008 which went into -next in May!) that make me worry the wide testing is not exactly fully through. So "calm" but not "let's just cut the final ASAP" vibes over here. Current release - regressions: - wifi: rtw88: fix write to const table of channel parameters Current release - new code bugs: - mac80211: add gfp_t arg to ieeee80211_obss_color_collision_notify - mlx5: - TC, allow offload from uplink to other PF's VF - Lag, decouple FDB selection and shared FDB - Lag, correct get the port select mode str - bnxt_en: fix and simplify XDP transmit path - r8152: fix accessing unset transport header Previous releases - regressions: - conntrack: fix crash due to confirmed bit load reordering (after atomic -> refcount conversion) - stmmac: dwc-qos: disable split header for Tegra194 Previous releases - always broken: - mlx5e: ring the TX doorbell on DMA errors - bpf: make sure mac_header was set before using it - mac80211: do not wake queues on a vif that is being stopped - mac80211: fix queue selection for mesh/OCB interfaces - ip: fix dflt addr selection for connected nexthop - seg6: fix skb checksums for SRH encapsulation/insertion - xdp: fix spurious packet loss in generic XDP TX path - bunch of sysctl data race fixes - nf_log: incorrect offset to network header Misc: - bpf: add flags arg to bpf_dynptr_read and bpf_dynptr_write APIs" * tag 'net-5.19-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (87 commits) nfp: flower: configure tunnel neighbour on cmsg rx net/tls: Check for errors in tls_device_init MAINTAINERS: Add an additional maintainer to the AMD XGBE driver xen/netback: avoid entering xenvif_rx_next_skb() with an empty rx queue selftests/net: test nexthop without gw ip: fix dflt addr selection for connected nexthop net: atlantic: remove aq_nic_deinit() when resume net: atlantic: remove deep parameter on suspend/resume functions sfc: fix kernel panic when creating VF seg6: bpf: fix skb checksum in bpf_push_seg6_encap() seg6: fix skb checksum in SRv6 End.B6 and End.B6.Encaps behaviors seg6: fix skb checksum evaluation in SRH encapsulation/insertion sfc: fix use after free when disabling sriov net: sunhme: output link status with a single print. r8152: fix accessing unset transport header net: stmmac: fix leaks in probe net: ftgmac100: Hold reference returned by of_get_child_by_name() nexthop: Fix data-races around nexthop_compat_mode. ipv4: Fix data-races around sysctl_ip_dynaddr. tcp: Fix a data-race around sysctl_tcp_ecn_fallback. ...
2022-07-14Merge tag 'integrity-v5.19-fix' of ↵Linus Torvalds1-1/+10
git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity Pull integrity fixes from Mimi Zohar: "Here are a number of fixes for recently found bugs. Only 'ima: fix violation measurement list record' was introduced in the current release. The rest address existing bugs" * tag 'integrity-v5.19-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity: ima: Fix potential memory leak in ima_init_crypto() ima: force signature verification when CONFIG_KEXEC_SIG is configured ima: Fix a potential integer overflow in ima_appraise_measurement ima: fix violation measurement list record Revert "evm: Fix memleak in init_desc"
2022-07-13Merge tag 'cgroup-for-5.19-rc6-fixes' of ↵Linus Torvalds1-14/+23
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fix from Tejun Heo: "Fix an old and subtle bug in the migration path. css_sets are used to track tasks and migrations are tasks moving from a group of css_sets to another group of css_sets. The migration path pins all source and destination css_sets in the prep stage. Unfortunately, it was overloading the same list_head entry to track sources and destinations, which got confused for migrations which are partially identity leading to use-after-frees. Fixed by using dedicated list_heads for tracking sources and destinations" * tag 'cgroup-for-5.19-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: Use separate src/dst nodes when preloading css_sets for migration
2022-07-13ima: force signature verification when CONFIG_KEXEC_SIG is configuredCoiby Xu1-1/+10
Currently, an unsigned kernel could be kexec'ed when IMA arch specific policy is configured unless lockdown is enabled. Enforce kernel signature verification check in the kexec_file_load syscall when IMA arch specific policy is configured. Fixes: 99d5cadfde2b ("kexec_file: split KEXEC_VERIFY_SIG into KEXEC_SIG and KEXEC_SIG_FORCE") Reported-and-suggested-by: Mimi Zohar <zohar@linux.ibm.com> Signed-off-by: Coiby Xu <coxu@redhat.com> Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
2022-07-13sysctl: Fix data-races in proc_dointvec_ms_jiffies().Kuniyuki Iwashima1-4/+4
A sysctl variable is accessed concurrently, and there is always a chance of data-race. So, all readers and writers need some basic protection to avoid load/store-tearing. This patch changes proc_dointvec_ms_jiffies() to use READ_ONCE() and WRITE_ONCE() internally to fix data-races on the sysctl side. For now, proc_dointvec_ms_jiffies() itself is tolerant to a data-race, but we still need to add annotations on the other subsystem's side. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-13sysctl: Fix data-races in proc_dou8vec_minmax().Kuniyuki Iwashima1-2/+2
A sysctl variable is accessed concurrently, and there is always a chance of data-race. So, all readers and writers need some basic protection to avoid load/store-tearing. This patch changes proc_dou8vec_minmax() to use READ_ONCE() and WRITE_ONCE() internally to fix data-races on the sysctl side. For now, proc_dou8vec_minmax() itself is tolerant to a data-race, but we still need to add annotations on the other subsystem's side. Fixes: cb9444130662 ("sysctl: add proc_dou8vec_minmax()") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-13sched/core: Always flush pending blk_plugJohn Keeping1-2/+6
With CONFIG_PREEMPT_RT, it is possible to hit a deadlock between two normal priority tasks (SCHED_OTHER, nice level zero): INFO: task kworker/u8:0:8 blocked for more than 491 seconds. Not tainted 5.15.49-rt46 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/u8:0 state:D stack: 0 pid: 8 ppid: 2 flags:0x00000000 Workqueue: writeback wb_workfn (flush-7:0) [<c08a3a10>] (__schedule) from [<c08a3d84>] (schedule+0xdc/0x134) [<c08a3d84>] (schedule) from [<c08a65a0>] (rt_mutex_slowlock_block.constprop.0+0xb8/0x174) [<c08a65a0>] (rt_mutex_slowlock_block.constprop.0) from [<c08a6708>] +(rt_mutex_slowlock.constprop.0+0xac/0x174) [<c08a6708>] (rt_mutex_slowlock.constprop.0) from [<c0374d60>] (fat_write_inode+0x34/0x54) [<c0374d60>] (fat_write_inode) from [<c0297304>] (__writeback_single_inode+0x354/0x3ec) [<c0297304>] (__writeback_single_inode) from [<c0297998>] (writeback_sb_inodes+0x250/0x45c) [<c0297998>] (writeback_sb_inodes) from [<c0297c20>] (__writeback_inodes_wb+0x7c/0xb8) [<c0297c20>] (__writeback_inodes_wb) from [<c0297f24>] (wb_writeback+0x2c8/0x2e4) [<c0297f24>] (wb_writeback) from [<c0298c40>] (wb_workfn+0x1a4/0x3e4) [<c0298c40>] (wb_workfn) from [<c0138ab8>] (process_one_work+0x1fc/0x32c) [<c0138ab8>] (process_one_work) from [<c0139120>] (worker_thread+0x22c/0x2d8) [<c0139120>] (worker_thread) from [<c013e6e0>] (kthread+0x16c/0x178) [<c013e6e0>] (kthread) from [<c01000fc>] (ret_from_fork+0x14/0x38) Exception stack(0xc10e3fb0 to 0xc10e3ff8) 3fa0: 00000000 00000000 00000000 00000000 3fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 3fe0: 00000000 00000000 00000000 00000000 00000013 00000000 INFO: task tar:2083 blocked for more than 491 seconds. Not tainted 5.15.49-rt46 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:tar state:D stack: 0 pid: 2083 ppid: 2082 flags:0x00000000 [<c08a3a10>] (__schedule) from [<c08a3d84>] (schedule+0xdc/0x134) [<c08a3d84>] (schedule) from [<c08a41b0>] (io_schedule+0x14/0x24) [<c08a41b0>] (io_schedule) from [<c08a455c>] (bit_wait_io+0xc/0x30) [<c08a455c>] (bit_wait_io) from [<c08a441c>] (__wait_on_bit_lock+0x54/0xa8) [<c08a441c>] (__wait_on_bit_lock) from [<c08a44f4>] (out_of_line_wait_on_bit_lock+0x84/0xb0) [<c08a44f4>] (out_of_line_wait_on_bit_lock) from [<c0371fb0>] (fat_mirror_bhs+0xa0/0x144) [<c0371fb0>] (fat_mirror_bhs) from [<c0372a68>] (fat_alloc_clusters+0x138/0x2a4) [<c0372a68>] (fat_alloc_clusters) from [<c0370b14>] (fat_alloc_new_dir+0x34/0x250) [<c0370b14>] (fat_alloc_new_dir) from [<c03787c0>] (vfat_mkdir+0x58/0x148) [<c03787c0>] (vfat_mkdir) from [<c0277b60>] (vfs_mkdir+0x68/0x98) [<c0277b60>] (vfs_mkdir) from [<c027b484>] (do_mkdirat+0xb0/0xec) [<c027b484>] (do_mkdirat) from [<c0100060>] (ret_fast_syscall+0x0/0x1c) Exception stack(0xc2e1bfa8 to 0xc2e1bff0) bfa0: 01ee42f0 01ee4208 01ee42f0 000041ed 00000000 00004000 bfc0: 01ee42f0 01ee4208 00000000 00000027 01ee4302 00000004 000dcb00 01ee4190 bfe0: 000dc368 bed11924 0006d4b0 b6ebddfc Here the kworker is waiting on msdos_sb_info::s_lock which is held by tar which is in turn waiting for a buffer which is locked waiting to be flushed, but this operation is plugged in the kworker. The lock is a normal struct mutex, so tsk_is_pi_blocked() will always return false on !RT and thus the behaviour changes for RT. It seems that the intent here is to skip blk_flush_plug() in the case where a non-preemptible lock (such as a spinlock) has been converted to a rtmutex on RT, which is the case covered by the SM_RTLOCK_WAIT schedule flag. But sched_submit_work() is only called from schedule() which is never called in this scenario, so the check can simply be deleted. Looking at the history of the -rt patchset, in fact this change was present from v5.9.1-rt20 until being dropped in v5.13-rt1 as it was part of a larger patch [1] most of which was replaced by commit b4bfa3fcfe3b ("sched/core: Rework the __schedule() preempt argument"). As described in [1]: The schedule process must distinguish between blocking on a regular sleeping lock (rwsem and mutex) and a RT-only sleeping lock (spinlock and rwlock): - rwsem and mutex must flush block requests (blk_schedule_flush_plug()) even if blocked on a lock. This can not deadlock because this also happens for non-RT. There should be a warning if the scheduling point is within a RCU read section. - spinlock and rwlock must not flush block requests. This will deadlock if the callback attempts to acquire a lock which is already acquired. Similarly to being preempted, there should be no warning if the scheduling point is within a RCU read section. and with the tsk_is_pi_blocked() in the scheduler path, we hit the first issue. [1] https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/0022-locking-rtmutex-Use-custom-scheduling-function-for-s.patch?h=linux-5.10.y-rt-patches Signed-off-by: John Keeping <john@metanate.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://lkml.kernel.org/r/20220708162702.1758865-1-john@metanate.com
2022-07-13sched/fair: fix case with reduced capacity CPUVincent Guittot1-12/+42
The capacity of the CPU available for CFS tasks can be reduced because of other activities running on the latter. In such case, it's worth trying to move CFS tasks on a CPU with more available capacity. The rework of the load balance has filtered the case when the CPU is classified to be fully busy but its capacity is reduced. Check if CPU's capacity is reduced while gathering load balance statistic and classify it group_misfit_task instead of group_fully_busy so we can try to move the load on another CPU. Reported-by: David Chen <david.chen@nutanix.com> Reported-by: Zhang Qiao <zhangqiao22@huawei.com> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: David Chen <david.chen@nutanix.com> Tested-by: Zhang Qiao <zhangqiao22@huawei.com> Link: https://lkml.kernel.org/r/20220708154401.21411-1-vincent.guittot@linaro.org