summaryrefslogtreecommitdiffstats
path: root/kernel/locking
AgeCommit message (Collapse)AuthorFilesLines
2018-10-31mm: remove include/linux/bootmem.hMike Rapoport1-1/+1
Move remaining definitions and declarations from include/linux/bootmem.h into include/linux/memblock.h and remove the redundant header. The includes were replaced with the semantic patch below and then semi-automated removal of duplicated '#include <linux/memblock.h> @@ @@ - #include <linux/bootmem.h> + #include <linux/memblock.h> [sfr@canb.auug.org.au: dma-direct: fix up for the removal of linux/bootmem.h] Link: http://lkml.kernel.org/r/20181002185342.133d1680@canb.auug.org.au [sfr@canb.auug.org.au: powerpc: fix up for removal of linux/bootmem.h] Link: http://lkml.kernel.org/r/20181005161406.73ef8727@canb.auug.org.au [sfr@canb.auug.org.au: x86/kaslr, ACPI/NUMA: fix for linux/bootmem.h removal] Link: http://lkml.kernel.org/r/20181008190341.5e396491@canb.auug.org.au Link: http://lkml.kernel.org/r/1536927045-23536-30-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Jonas Bonn <jonas@southpole.se> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ley Foon Tan <lftan@altera.com> Cc: Mark Salter <msalter@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@sifive.com> Cc: Paul Burton <paul.burton@mips.com> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Serge Semin <fancer.lancer@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-23Merge branch 'locking-core-for-linus' of ↵Linus Torvalds10-167/+252
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking and misc x86 updates from Ingo Molnar: "Lots of changes in this cycle - in part because locking/core attracted a number of related x86 low level work which was easier to handle in a single tree: - Linux Kernel Memory Consistency Model updates (Alan Stern, Paul E. McKenney, Andrea Parri) - lockdep scalability improvements and micro-optimizations (Waiman Long) - rwsem improvements (Waiman Long) - spinlock micro-optimization (Matthew Wilcox) - qspinlocks: Provide a liveness guarantee (more fairness) on x86. (Peter Zijlstra) - Add support for relative references in jump tables on arm64, x86 and s390 to optimize jump labels (Ard Biesheuvel, Heiko Carstens) - Be a lot less permissive on weird (kernel address) uaccess faults on x86: BUG() when uaccess helpers fault on kernel addresses (Jann Horn) - macrofy x86 asm statements to un-confuse the GCC inliner. (Nadav Amit) - ... and a handful of other smaller changes as well" * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (57 commits) locking/lockdep: Make global debug_locks* variables read-mostly locking/lockdep: Fix debug_locks off performance problem locking/pvqspinlock: Extend node size when pvqspinlock is configured locking/qspinlock_stat: Count instances of nested lock slowpaths locking/qspinlock, x86: Provide liveness guarantee x86/asm: 'Simplify' GEN_*_RMWcc() macros locking/qspinlock: Rework some comments locking/qspinlock: Re-order code locking/lockdep: Remove duplicated 'lock_class_ops' percpu array x86/defconfig: Enable CONFIG_USB_XHCI_HCD=y futex: Replace spin_is_locked() with lockdep locking/lockdep: Make class->ops a percpu counter and move it under CONFIG_DEBUG_LOCKDEP=y x86/jump-labels: Macrofy inline assembly code to work around GCC inlining bugs x86/cpufeature: Macrofy inline assembly code to work around GCC inlining bugs x86/extable: Macrofy inline assembly code to work around GCC inlining bugs x86/paravirt: Work around GCC inlining bugs when compiling paravirt ops x86/bug: Macrofy the BUG table section handling, to work around GCC inlining bugs x86/alternatives: Macrofy lock prefixes to work around GCC inlining bugs x86/refcount: Work around GCC inlining bug x86/objtool: Use asm macros to work around GCC inlining bugs ...
2018-10-19locking/lockdep: Fix debug_locks off performance problemWaiman Long1-2/+2
It was found that when debug_locks was turned off because of a problem found by the lockdep code, the system performance could drop quite significantly when the lock_stat code was also configured into the kernel. For instance, parallel kernel build time on a 4-socket x86-64 server nearly doubled. Further analysis into the cause of the slowdown traced back to the frequent call to debug_locks_off() from the __lock_acquired() function probably due to some inconsistent lockdep states with debug_locks off. The debug_locks_off() function did an unconditional atomic xchg to write a 0 value into debug_locks which had already been set to 0. This led to severe cacheline contention in the cacheline that held debug_locks. As debug_locks is being referenced in quite a few different places in the kernel, this greatly slow down the system performance. To prevent that trashing of debug_locks cacheline, lock_acquired() and lock_contended() now checks the state of debug_locks before proceeding. The debug_locks_off() function is also modified to check debug_locks before calling __debug_locks_off(). Signed-off-by: Waiman Long <longman@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Link: http://lkml.kernel.org/r/1539913518-15598-1-git-send-email-longman@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-17locking/pvqspinlock: Extend node size when pvqspinlock is configuredWaiman Long2-11/+27
The qspinlock code supports up to 4 levels of slowpath nesting using four per-CPU mcs_spinlock structures. For 64-bit architectures, they fit nicely in one 64-byte cacheline. For para-virtualized (PV) qspinlocks it needs to store more information in the per-CPU node structure than there is space for. It uses a trick to use a second cacheline to hold the extra information that it needs. So PV qspinlock needs to access two extra cachelines for its information whereas the native qspinlock code only needs one extra cacheline. Freshly added counter profiling of the qspinlock code, however, revealed that it was very rare to use more than two levels of slowpath nesting. So it doesn't make sense to penalize PV qspinlock code in order to have four mcs_spinlock structures in the same cacheline to optimize for a case in the native qspinlock code that rarely happens. Extend the per-CPU node structure to have two more long words when PV qspinlock locks are configured to hold the extra data that it needs. As a result, the PV qspinlock code will enjoy the same benefit of using just one extra cacheline like the native counterpart, for most cases. [ mingo: Minor changelog edits. ] Signed-off-by: Waiman Long <longman@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Link: http://lkml.kernel.org/r/1539697507-28084-2-git-send-email-longman@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-17locking/qspinlock_stat: Count instances of nested lock slowpathsWaiman Long2-0/+11
Queued spinlock supports up to 4 levels of lock slowpath nesting - user context, soft IRQ, hard IRQ and NMI. However, we are not sure how often the nesting happens. So add 3 more per-CPU stat counters to track the number of instances where nesting index goes to 1, 2 and 3 respectively. On a dual-socket 64-core 128-thread Zen server, the following were the new stat counter values under different circumstances: State slowpath index1 index2 index3 ----- -------- ------ ------ ------- After bootup 1,012,150 82 0 0 After parallel build + perf-top 125,195,009 82 0 0 So the chance of having more than 2 levels of nesting is extremely low. [ mingo: Minor changelog edits. ] Signed-off-by: Waiman Long <longman@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Link: http://lkml.kernel.org/r/1539697507-28084-1-git-send-email-longman@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-16locking/qspinlock, x86: Provide liveness guaranteePeter Zijlstra1-1/+15
On x86 we cannot do fetch_or() with a single instruction and thus end up using a cmpxchg loop, this reduces determinism. Replace the fetch_or() with a composite operation: tas-pending + load. Using two instructions of course opens a window we previously did not have. Consider the scenario: CPU0 CPU1 CPU2 1) lock trylock -> (0,0,1) 2) lock trylock /* fail */ 3) unlock -> (0,0,0) 4) lock trylock -> (0,0,1) 5) tas-pending -> (0,1,1) load-val <- (0,1,0) from 3 6) clear-pending-set-locked -> (0,0,1) FAIL: _2_ owners where 5) is our new composite operation. When we consider each part of the qspinlock state as a separate variable (as we can when _Q_PENDING_BITS == 8) then the above is entirely possible, because tas-pending will only RmW the pending byte, so the later load is able to observe prior tail and lock state (but not earlier than its own trylock, which operates on the whole word, due to coherence). To avoid this we need 2 things: - the load must come after the tas-pending (obviously, otherwise it can trivially observe prior state). - the tas-pending must be a full word RmW instruction, it cannot be an XCHGB for example, such that we cannot observe other state prior to setting pending. On x86 we can realize this by using "LOCK BTS m32, r32" for tas-pending followed by a regular load. Note that observing later state is not a problem: - if we fail to observe a later unlock, we'll simply spin-wait for that store to become visible. - if we observe a later xchg_tail(), there is no difference from that xchg_tail() having taken place before the tas-pending. Suggested-by: Will Deacon <will.deacon@arm.com> Reported-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Will Deacon <will.deacon@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: andrea.parri@amarulasolutions.com Cc: longman@redhat.com Fixes: 59fb586b4a07 ("locking/qspinlock: Remove unbounded cmpxchg() loop from locking slowpath") Link: https://lkml.kernel.org/r/20181003130957.183726335@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-16locking/qspinlock: Rework some commentsPeter Zijlstra1-10/+26
While working my way through the code again; I felt the comments could use help. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Will Deacon <will.deacon@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: andrea.parri@amarulasolutions.com Cc: longman@redhat.com Link: https://lkml.kernel.org/r/20181003130257.156322446@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-16locking/qspinlock: Re-order codePeter Zijlstra1-29/+27
Flip the branch condition after atomic_fetch_or_acquire(_Q_PENDING_VAL) such that we loose the indent. This also result in a more natural code flow IMO. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Will Deacon <will.deacon@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: andrea.parri@amarulasolutions.com Cc: longman@redhat.com Link: https://lkml.kernel.org/r/20181003130257.156322446@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-16Merge branch 'x86/build' into locking/core, to pick up dependent patches and ↵Ingo Molnar2-2/+1
unify jump-label work Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-16locking/lockdep: Remove duplicated 'lock_class_ops' percpu arrayWaiman Long1-1/+0
Remove the duplicated 'lock_class_ops' percpu array that is not used anywhere. Signed-off-by: Waiman Long <longman@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Fixes: 8ca2b56cd7da ("locking/lockdep: Make class->ops a percpu counter and move it under CONFIG_DEBUG_LOCKDEP=y") Link: http://lkml.kernel.org/r/1539380547-16726-1-git-send-email-longman@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-09locking/lockdep: Make class->ops a percpu counter and move it under ↵Waiman Long3-4/+36
CONFIG_DEBUG_LOCKDEP=y A sizable portion of the CPU cycles spent on the __lock_acquire() is used up by the atomic increment of the class->ops stat counter. By taking it out from the lock_class structure and changing it to a per-cpu per-lock-class counter, we can reduce the amount of cacheline contention on the class structure when multiple CPUs are trying to acquire locks of the same class simultaneously. To limit the increase in memory consumption because of the percpu nature of that counter, it is now put back under the CONFIG_DEBUG_LOCKDEP config option. So the memory consumption increase will only occur if CONFIG_DEBUG_LOCKDEP is defined. The lock_class structure, however, is reduced in size by 16 bytes on 64-bit archs after ops removal and a minor restructuring of the fields. This patch also fixes a bug in the increment code as the counter is of the 'unsigned long' type, but atomic_inc() was used to increment it. Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Link: http://lkml.kernel.org/r/d66681f3-8781-9793-1dcf-2436a284550b@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-03locking/ww_mutex: Fix runtime warning in the WW mutex selftestGuenter Roeck1-4/+6
If CONFIG_WW_MUTEX_SELFTEST=y is enabled, booting an image in an arm64 virtual machine results in the following traceback if 8 CPUs are enabled: DEBUG_LOCKS_WARN_ON(__owner_task(owner) != current) WARNING: CPU: 2 PID: 537 at kernel/locking/mutex.c:1033 __mutex_unlock_slowpath+0x1a8/0x2e0 ... Call trace: __mutex_unlock_slowpath() ww_mutex_unlock() test_cycle_work() process_one_work() worker_thread() kthread() ret_from_fork() If requesting b_mutex fails with -EDEADLK, the error variable is reassigned to the return value from calling ww_mutex_lock on a_mutex again. If this call fails, a_mutex is not locked. It is, however, unconditionally unlocked subsequently, causing the reported warning. Fix the problem by using two error variables. With this change, the selftest still fails as follows: cyclic deadlock not resolved, ret[7/8] = -35 However, the traceback is gone. Signed-off-by: Guenter Roeck <linux@roeck-us.net> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Fixes: d1b42b800e5d0 ("locking/ww_mutex: Add kselftests for resolving ww_mutex cyclic deadlocks") Link: http://lkml.kernel.org/r/1538516929-9734-1-git-send-email-linux@roeck-us.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-03locking/lockdep: Add a faster path in __lock_release()Waiman Long1-3/+14
When __lock_release() is called, the most likely unlock scenario is on the innermost lock in the chain. In this case, we can skip some of the checks and provide a faster path to completion. Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Link: http://lkml.kernel.org/r/1538511560-10090-4-git-send-email-longman@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-03locking/lockdep: Eliminate redundant IRQs check in __lock_acquire()Waiman Long1-8/+7
The static __lock_acquire() function has only two callers: 1) lock_acquire() 2) reacquire_held_locks() In lock_acquire(), raw_local_irq_save() is called beforehand. So IRQs must have been disabled. So the check: DEBUG_LOCKS_WARN_ON(!irqs_disabled()) is kind of redundant in this case. So move the above check to reacquire_held_locks() to eliminate redundant code in the lock_acquire() path. Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Link: http://lkml.kernel.org/r/1538511560-10090-3-git-send-email-longman@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-03locking/lockdep: Remove add_chain_cache_classes()Waiman Long1-70/+0
The inline function add_chain_cache_classes() is defined, but has no caller. Just remove it. Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Link: http://lkml.kernel.org/r/1538511560-10090-2-git-send-email-longman@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-11locking/rtmutex: Fix the preprocessor logic with normal #ifdef #else #endifSteven Rostedt (VMware)1-2/+2
Merging v4.14.68 into v4.14-rt I tripped over a conflict in the rtmutex.c code. There I found that we had: #ifdef CONFIG_DEBUG_LOCK_ALLOC [..] #endif #ifndef CONFIG_DEBUG_LOCK_ALLOC [..] #endif Really this should be: #ifdef CONFIG_DEBUG_LOCK_ALLOC [..] #else [..] #endif This cleans up that logic. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Rosin <peda@axentia.se> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180910214638.55926030@vmware.local.home Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10locking/ww_mutex: Fix spelling mistake "cylic" -> "cyclic"Colin Ian King1-1/+1
Trivial fix to spelling mistake in pr_err() error message Signed-off-by: Colin Ian King <colin.king@canonical.com> Acked-by: Will Deacon <will.deacon@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: kernel-janitors@vger.kernel.org Link: http://lkml.kernel.org/r/20180824112235.8842-1-colin.king@canonical.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10locking/lockdep: Delete unnecessary #includeBen Hutchings1-1/+0
Commit: c3bc8fd637a9 ("tracing: Centralize preemptirq tracepoints and unify their usage") added the inclusion of <trace/events/preemptirq.h>. liblockdep doesn't have a stub version of that header so now fails to build. However, commit: bff1b208a5d1 ("tracing: Partial revert of "tracing: Centralize preemptirq tracepoints and unify their usage"") removed the use of functions declared in that header. So delete the #include. Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sasha Levin <alexander.levin@verizon.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Fixes: bff1b208a5d1 ("tracing: Partial revert of "tracing: Centralize ...") Fixes: c3bc8fd637a9 ("tracing: Centralize preemptirq tracepoints ...") Link: http://lkml.kernel.org/r/20180828203315.GD18030@decadent.org.uk Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10locking/rwsem: Make owner store task pointer of last owning readerWaiman Long3-28/+76
Currently, when a reader acquires a lock, it only sets the RWSEM_READER_OWNED bit in the owner field. The other bits are simply not used. When debugging hanging cases involving rwsems and readers, the owner value does not provide much useful information at all. This patch modifies the current behavior to always store the task_struct pointer of the last rwsem-acquiring reader in a reader-owned rwsem. This may be useful in debugging rwsem hanging cases especially if only one reader is involved. However, the task in the owner field may not the real owner or one of the real owners at all when the owner value is examined, for example, in a crash dump. So it is just an additional hint about the past history. If CONFIG_DEBUG_RWSEMS=y is enabled, the owner field will be checked at unlock time too to make sure the task pointer value is valid. That does have a slight performance cost and so is only enabled as part of that debug option. From the performance point of view, it is expected that the changes shouldn't have any noticeable performance impact. A rwsem microbenchmark (with 48 worker threads and 1:1 reader/writer ratio) was ran on a 2-socket 24-core 48-thread Haswell system. The locking rates on a 4.19-rc1 based kernel were as follows: 1) Unpatched kernel: 543.3 kops/s 2) Patched kernel: 549.2 kops/s 3) Patched kernel (CONFIG_DEBUG_RWSEMS on): 546.6 kops/s There was actually a slight increase in performance (1.1%) in this particular case. Maybe it was caused by the elimination of a branch or just a testing noise. Turning on the CONFIG_DEBUG_RWSEMS option also had less than the expected impact on performance. The least significant 2 bits of the owner value are now used to designate the rwsem is readers owned and the owners are anonymous. Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Link: http://lkml.kernel.org/r/1536265114-10842-1-git-send-email-longman@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10locking/rwsem: Exit read lock slowpath if queue empty & no writerWaiman Long1-1/+12
It was discovered that a constant stream of readers with occassional writers pounding on a rwsem may cause many of the readers to enter the slowpath unnecessarily thus increasing latency and lowering performance. In the current code, a reader entering the slowpath critical section will unconditionally set the WAITING_BIAS, if not set yet, and clear its active count even if no one is in the wait queue and no writer is present. This causes some incoming readers to observe the presence of waiters in the wait queue and hence have to go into the slowpath themselves. With sufficient numbers of readers and a relatively short lock hold time, the WAITING_BIAS may be repeatedly turned on and off and a substantial portion of the readers will go into the slowpath sustaining a rather long queue in the wait queue spinlock and repeated WAITING_BIAS on/off cycle until the logjam is broken opportunistically. To avoid this situation from happening, an additional check is added to detect the special case that the reader in the critical section is the only one in the wait queue and no writer is present. When that happens, it can just exit the slowpath and return immediately as its active count has already been set in the lock. Other incoming readers won't observe the presence of waiters and so will not be forced into the slowpath. The issue was found in a customer site where they had an application that pounded on the pread64 syscalls heavily on an XFS filesystem. The application was run in a recent 4-socket boxes with a lot of CPUs. They saw significant spinlock contention in the rwsem_down_read_failed() call. With this patch applied, the system CPU usage went down from 85% to 57%, and the spinlock contention in the pread64 syscalls was gone. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Davidlohr Bueso <dbueso@suse.de> Acked-by: Will Deacon <will.deacon@arm.com> Cc: Joe Mario <jmario@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1532459425-19204-1-git-send-email-longman@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10locking/mutex: Fix mutex debug call and ww_mutex documentationThomas Hellstrom1-2/+1
The following commit: 08295b3b5bee ("Implement an algorithm choice for Wound-Wait mutexes") introduced a reference in the documentation to a function that was removed in an earlier commit. It also forgot to remove a call to debug_mutex_add_waiter() which is now unconditionally called by __mutex_add_waiter(). Fix those bugs. Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dri-devel@lists.freedesktop.org Fixes: 08295b3b5bee ("Implement an algorithm choice for Wound-Wait mutexes") Link: http://lkml.kernel.org/r/20180903140708.2401-1-thellstrom@vmware.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-08-20Merge tag 'trace-v4.19' of ↵Linus Torvalds1-30/+5
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing updates from Steven Rostedt: - Restructure of lockdep and latency tracers This is the biggest change. Joel Fernandes restructured the hooks from irqs and preemption disabling and enabling. He got rid of a lot of the preprocessor #ifdef mess that they caused. He turned both lockdep and the latency tracers to use trace events inserted in the preempt/irqs disabling paths. But unfortunately, these started to cause issues in corner cases. Thus, parts of the code was reverted back to where lockdep and the latency tracers just get called directly (without using the trace events). But because the original change cleaned up the code very nicely we kept that, as well as the trace events for preempt and irqs disabling, but they are limited to not being called in NMIs. - Have trace events use SRCU for "rcu idle" calls. This was required for the preempt/irqs off trace events. But it also had to not allow them to be called in NMI context. Waiting till Paul makes an NMI safe SRCU API. - New notrace SRCU API to allow trace events to use SRCU. - Addition of mcount-nop option support - SPDX headers replacing GPL templates. - Various other fixes and clean ups. - Some fixes are marked for stable, but were not fully tested before the merge window opened. * tag 'trace-v4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (44 commits) tracing: Fix SPDX format headers to use C++ style comments tracing: Add SPDX License format tags to tracing files tracing: Add SPDX License format to bpf_trace.c blktrace: Add SPDX License format header s390/ftrace: Add -mfentry and -mnop-mcount support tracing: Add -mcount-nop option support tracing: Avoid calling cc-option -mrecord-mcount for every Makefile tracing: Handle CC_FLAGS_FTRACE more accurately Uprobe: Additional argument arch_uprobe to uprobe_write_opcode() Uprobes: Simplify uprobe_register() body tracepoints: Free early tracepoints after RCU is initialized uprobes: Use synchronize_rcu() not synchronize_sched() tracing: Fix synchronizing to event changes with tracepoint_synchronize_unregister() ftrace: Remove unused pointer ftrace_swapper_pid tracing: More reverting of "tracing: Centralize preemptirq tracepoints and unify their usage" tracing/irqsoff: Handle preempt_count for different configs tracing: Partial revert of "tracing: Centralize preemptirq tracepoints and unify their usage" tracing: irqsoff: Account for additional preempt_disable trace: Use rcu_dereference_raw for hooks from trace-event subsystem tracing/kprobes: Fix within_notrace_func() to check only notrace functions ...
2018-08-15Merge tag 'drm-next-2018-08-15' of git://anongit.freedesktop.org/drm/drmLinus Torvalds3-85/+264
Pull drm updates from Dave Airlie: "This is the main drm pull request for 4.19. Rob has some new hardware support for new qualcomm hw that I'll send along separately. This has the display part of it, the remaining pull is for the acceleration engine. This also contains a wound-wait/wait-die mutex rework, Peter has acked it for merging via my tree. Otherwise mostly the usual level of activity. Summary: core: - Wound-wait/wait-die mutex rework - Add writeback connector type - Add "content type" property for HDMI - Move GEM bo to drm_framebuffer - Initial gpu scheduler documentation - GPU scheduler fixes for dying processes - Console deferred fbcon takeover support - Displayport support for CEC tunneling over AUX panel: - otm8009a panel driver fixes - Innolux TV123WAM and G070Y2-L01 panel driver - Ilitek ILI9881c panel driver - Rocktech RK070ER9427 LCD - EDT ETM0700G0EDH6 and EDT ETM0700G0BDH6 - DLC DLC0700YZG-1 - BOE HV070WSA-100 - newhaven, nhd-4.3-480272ef-atxl LCD - DataImage SCF0700C48GGU18 - Sharp LQ035Q7DB03 - p079zca: Refactor to support multiple panels tinydrm: - ILI9341 display panel New driver: - vkms - virtual kms driver to testing. i915: - Icelake: Display enablement DSI support IRQ support Powerwell support - GPU reset fixes and improvements - Full ppgtt support refactoring - PSR fixes and improvements - Execlist improvments - GuC related fixes amdgpu: - Initial amdgpu documentation - JPEG engine support on VCN - CIK uses powerplay by default - Move to using core PCIE functionality for gens/lanes - DC/Powerplay interface rework - Stutter mode support for RV - Vega12 Powerplay updates - GFXOFF fixes - GPUVM fault debugging - Vega12 GFXOFF - DC improvements - DC i2c/aux changes - UVD 7.2 fixes - Powerplay fixes for Polaris12, CZ/ST - command submission bo_list fixes amdkfd: - Raven support - Power management fixes udl: - Cleanups and fixes nouveau: - misc fixes and cleanups. msm: - DPU1 support display controller in sdm845 - GPU coredump support. vmwgfx: - Atomic modesetting validation fixes - Support for multisample surfaces armada: - Atomic modesetting support completed. exynos: - IPPv2 fixes - Move g2d to component framework - Suspend/resume support cleanups - Driver cleanups imx: - CSI configuration improvements - Driver cleanups - Use atomic suspend/resume helpers - ipu-v3 V4L2 XRGB32/XBGR32 support pl111: - Add Nomadik LCDC variant v3d: - GPU scheduler jobs management sun4i: - R40 display engine support - TCON TOP driver mediatek: - MT2712 SoC support rockchip: - vop fixes omapdrm: - Workaround for DRA7 errata i932 - Fix mm_list locking mali-dp: - Writeback implementation PM improvements - Internal error reporting debugfs tilcdc: - Single fix for deferred probing hdlcd: - Teardown fixes tda998x: - Converted to a bridge driver. etnaviv: - Misc fixes" * tag 'drm-next-2018-08-15' of git://anongit.freedesktop.org/drm/drm: (1506 commits) drm/amdgpu/sriov: give 8s for recover vram under RUNTIME drm/scheduler: fix param documentation drm/i2c: tda998x: correct PLL divider calculation drm/i2c: tda998x: get rid of private fill_modes function drm/i2c: tda998x: move mode_valid() to bridge drm/i2c: tda998x: register bridge outside of component helper drm/i2c: tda998x: cleanup from previous changes drm/i2c: tda998x: allocate tda998x_priv inside tda998x_create() drm/i2c: tda998x: convert to bridge driver drm/scheduler: fix timeout worker setup for out of order job completions drm/amd/display: display connected to dp-1 does not light up drm/amd/display: update clk for various HDMI color depths drm/amd/display: program display clock on cache match drm/amd/display: Add NULL check for enabling dp ss drm/amd/display: add vbios table check for enabling dp ss drm/amd/display: Don't share clk source between DP and HDMI drm/amd/display: Fix DP HBR2 Eye Diagram Pattern on Carrizo drm/amd/display: Use calculated disp_clk_khz value for dce110 drm/amd/display: Implement custom degamma lut on dcn drm/amd/display: Destroy aux_engines only once ...
2018-08-13Merge branch 'core-rcu-for-linus' of ↵Linus Torvalds1-1/+4
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RCU updates from Thomas Gleixner: "A large update to RCU: Preparatory work for consolidating the RCU flavors: - Introduce grace-period sequence numbers to the RCU-bh, RCU-preempt, and RCU-sched flavors, replacing the old ->gpnum and ->completed pair of fields. This change allows lockless code to obtain the complete grace-period state with a single READ_ONCE(), which is needed to maintain tolerable lock contention during the upcoming consolidation of the three RCU flavors. Note that grace-period sequence numbers are already used by rcu_barrier(), expedited RCU grace periods, and SRCU, and are thus already heavily used and well-tested. Joel Fernandes contributed a number of excellent fixes and improvements. - Clean up some grace-period-reporting loose ends, including improving the handling of quiescent states from offline CPUs and fixing some false-positive WARN_ON_ONCE() invocations. (Strictly speaking, the WARN_ON_ONCE() invocations were quite correct, but their invariants were (harmlessly) violated by the earlier sloppy handling of quiescent states from offline CPUs.) In addition, improve grace-period forward-progress guarantees so as to allow removal of fail-safe checks that required otherwise needless lock acquisitions. Finally, add more diagnostics to help debug the upcoming consolidation of the RCU-bh, RCU-preempt, and RCU-sched flavors. The rest: - SRCU updates - Updates to rcutorture and associated scripting. - The usual pile of miscellaneous fixes" * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (118 commits) rcutorture: Fix rcu_barrier successes counter rcutorture: Add support to detect if boost kthread prio is too low rcutorture: Use monotonic timestamp for stall detection rcutorture: Make boost test more robust rcutorture: Disable RT throttling for boost tests rcutorture: Emphasize testing of single reader protection type rcutorture: Handle extended read-side critical sections rcutorture: Make rcu_torture_timer() use rcu_torture_one_read() rcutorture: Use per-CPU random state for rcu_torture_timer() rcutorture: Use atomic increment for n_rcu_torture_timers rcutorture: Extract common code from rcu_torture_reader() rcuperf: Remove unused torturing_tasks() function rcu: Remove rcutorture test version and sequence number rcutorture: Change units of onoff_interval to jiffies rcu: Assign higher prio to RCU threads if rcutorture is built-in rculist: Improve documentation for list_for_each_entry_from_rcu() srcu: Add grace-period number to rcutorture statistics printout rcu: Print stall-warning NMI dyntick state in hexadecimal MAINTAINERS: Update RCU, SRCU, and TORTURE-TEST entries rcu: Make rcu_seq_diff() more exact ...
2018-08-10tracing: Partial revert of "tracing: Centralize preemptirq tracepoints and ↵Steven Rostedt (VMware)1-12/+2
unify their usage" Joel Fernandes created a nice patch that cleaned up the duplicate hooks used by lockdep and irqsoff latency tracer. It made both use tracepoints. But it caused lockdep to trigger several false positives. We have not figured out why yet, but removing lockdep from using the trace event hooks and just call its helper functions directly (like it use to), makes the problem go away. This is a partial revert of the clean up patch c3bc8fd637a9 ("tracing: Centralize preemptirq tracepoints and unify their usage") that adds direct calls for lockdep, but also keeps most of the clean up done to get rid of the horrible preprocessor if statements. Link: http://lkml.kernel.org/r/20180806155058.5ee875f4@gandalf.local.home Cc: Peter Zijlstra <peterz@infradead.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Fixes: c3bc8fd637a9 ("tracing: Centralize preemptirq tracepoints and unify their usage") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-07-31tracing: Centralize preemptirq tracepoints and unify their usageJoel Fernandes (Google)1-21/+14
This patch detaches the preemptirq tracepoints from the tracers and keeps it separate. Advantages: * Lockdep and irqsoff event can now run in parallel since they no longer have their own calls. * This unifies the usecase of adding hooks to an irqsoff and irqson event, and a preemptoff and preempton event. 3 users of the events exist: - Lockdep - irqsoff and preemptoff tracers - irqs and preempt trace events The unification cleans up several ifdefs and makes the code in preempt tracer and irqsoff tracers simpler. It gets rid of all the horrific ifdeferry around PROVE_LOCKING and makes configuration of the different users of the tracepoints more easy and understandable. It also gets rid of the time_* function calls from the lockdep hooks used to call into the preemptirq tracer which is not needed anymore. The negative delta in lines of code in this patch is quite large too. In the patch we introduce a new CONFIG option PREEMPTIRQ_TRACEPOINTS as a single point for registering probes onto the tracepoints. With this, the web of config options for preempt/irq toggle tracepoints and its users becomes: PREEMPT_TRACER PREEMPTIRQ_EVENTS IRQSOFF_TRACER PROVE_LOCKING | | \ | | \ (selects) / \ \ (selects) / TRACE_PREEMPT_TOGGLE ----> TRACE_IRQFLAGS \ / \ (depends on) / PREEMPTIRQ_TRACEPOINTS Other than the performance tests mentioned in the previous patch, I also ran the locking API test suite. I verified that all tests cases are passing. I also injected issues by not registering lockdep probes onto the tracepoints and I see failures to confirm that the probes are indeed working. This series + lockdep probes not registered (just to inject errors): [ 0.000000] hard-irqs-on + irq-safe-A/21: ok | ok | ok | [ 0.000000] soft-irqs-on + irq-safe-A/21: ok | ok | ok | [ 0.000000] sirq-safe-A => hirqs-on/12:FAILED|FAILED| ok | [ 0.000000] sirq-safe-A => hirqs-on/21:FAILED|FAILED| ok | [ 0.000000] hard-safe-A + irqs-on/12:FAILED|FAILED| ok | [ 0.000000] soft-safe-A + irqs-on/12:FAILED|FAILED| ok | [ 0.000000] hard-safe-A + irqs-on/21:FAILED|FAILED| ok | [ 0.000000] soft-safe-A + irqs-on/21:FAILED|FAILED| ok | [ 0.000000] hard-safe-A + unsafe-B #1/123: ok | ok | ok | [ 0.000000] soft-safe-A + unsafe-B #1/123: ok | ok | ok | With this series + lockdep probes registered, all locking tests pass: [ 0.000000] hard-irqs-on + irq-safe-A/21: ok | ok | ok | [ 0.000000] soft-irqs-on + irq-safe-A/21: ok | ok | ok | [ 0.000000] sirq-safe-A => hirqs-on/12: ok | ok | ok | [ 0.000000] sirq-safe-A => hirqs-on/21: ok | ok | ok | [ 0.000000] hard-safe-A + irqs-on/12: ok | ok | ok | [ 0.000000] soft-safe-A + irqs-on/12: ok | ok | ok | [ 0.000000] hard-safe-A + irqs-on/21: ok | ok | ok | [ 0.000000] soft-safe-A + irqs-on/21: ok | ok | ok | [ 0.000000] hard-safe-A + unsafe-B #1/123: ok | ok | ok | [ 0.000000] soft-safe-A + unsafe-B #1/123: ok | ok | ok | Link: http://lkml.kernel.org/r/20180730222423.196630-4-joel@joelfernandes.org Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-07-30lockdep: Use this_cpu_ptr instead of get_cpu_var statsJoel Fernandes (Google)1-9/+1
get_cpu_var disables preemption which has the potential to call into the preemption disable trace points causing some complications. There's also no need to disable preemption in uses of get_lock_stats anyway since preempt is already disabled. So lets simplify the code. Link: http://lkml.kernel.org/r/20180730222423.196630-2-joel@joelfernandes.org Suggested-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-07-25locking/rtmutex: Allow specifying a subclass for nested lockingPeter Rosin1-4/+25
Needed for annotating rt_mutex locks. Tested-by: John Sperbeck <jsperbeck@google.com> Signed-off-by: Peter Rosin <peda@axentia.se> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Deepa Dinamani <deepadinamani@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Chang <dpf@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Philippe Ombredanne <pombredanne@nexb.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Cc: Wolfram Sang <wsa@the-dreams.de> Link: http://lkml.kernel.org/r/20180720083914.1950-2-peda@axentia.se Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-17Merge branch 'for-mingo' of ↵Ingo Molnar1-1/+4
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu Pull RCU updates from Paul E. McKenney: - An optimization and a fix for RCU expedited grace periods, with the fix being from Boqun Feng. - Miscellaneous fixes, including a lockdep-annotation fix from Boqun Feng. - SRCU updates. - Updates to rcutorture and associated scripting. - Introduce grace-period sequence numbers to the RCU-bh, RCU-preempt, and RCU-sched flavors, replacing the old ->gpnum and ->completed pair of fields. This change allows lockless code to obtain the complete grace-period state with a single READ_ONCE(), which is needed to maintain tolerable lock contention during the upcoming consolidation of the three RCU flavors. Note that grace-period sequence numbers are already used by rcu_barrier(), expedited RCU grace periods, and SRCU, and are thus already heavily used and well-tested. Joel Fernandes contributed a number of excellent fixes and improvements. - Clean up some grace-period-reporting loose ends, including improving the handling of quiescent states from offline CPUs and fixing some false-positive WARN_ON_ONCE() invocations. (Strictly speaking, the WARN_ON_ONCE() invocations were quite correct, but their invariants were (harmlessly) violated by the earlier sloppy handling of quiescent states from offline CPUs.) In addition, improve grace-period forward-progress guarantees so as to allow removal of fail-safe checks that required otherwise needless lock acquisitions. Finally, add more diagnostics to help debug the upcoming consolidation of the RCU-bh, RCU-preempt, and RCU-sched flavors. - Additional miscellaneous fixes, including those contributed by Byungchul Park, Mauro Carvalho Chehab, Joe Perches, Joel Fernandes, Steven Rostedt, Andrea Parri, and Neil Brown. - Additional torture-test changes, including several contributed by Arnd Bergmann and Joel Fernandes. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-06Merge branch 'vmwgfx-next' of git://people.freedesktop.org/~thomash/linux ↵Dave Airlie3-85/+264
into drm-next A patchset worked out together with Peter Zijlstra. Ingo is OK with taking it through the DRM tree: This is a small fallout from a work to allow batching WW mutex locks and unlocks. Our Wound-Wait mutexes actually don't use the Wound-Wait algorithm but the Wait-Die algorithm. One could perhaps rename those mutexes tree-wide to "Wait-Die mutexes" or "Deadlock Avoidance mutexes". Another approach suggested here is to implement also the "Wound-Wait" algorithm as a per-WW-class choice, as it has advantages in some cases. See for example http://www.mathcs.emory.edu/~cheung/Courses/554/Syllabus/8-recv+serial/deadlock-compare.html Now Wound-Wait is a preemptive algorithm, and the preemption is implemented using a lazy scheme: If a wounded transaction is about to go to sleep on a contended WW mutex, we return -EDEADLK. That is sufficient for deadlock prevention. Since with WW mutexes we also require the aborted transaction to sleep waiting to lock the WW mutex it was aborted on, this choice also provides a suitable WW mutex to sleep on. If we were to return -EDEADLK on the first WW mutex lock after the transaction was wounded whether the WW mutex was contended or not, the transaction might frequently be restarted without a wait, which is far from optimal. Note also that with the lazy preemption scheme, contrary to Wait-Die there will be no rollbacks on lock contention of locks held by a transaction that has completed its locking sequence. The modeset locks are then changed from Wait-Die to Wound-Wait since the typical locking pattern of those locks very well matches the criterion for a substantial reduction in the number of rollbacks. For reservation objects, the benefit is more unclear at this point and they remain using Wait-Die. Signed-off-by: Dave Airlie <airlied@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180703105339.4461-1-thellstrom@vmware.com
2018-07-03locking: Implement an algorithm choice for Wound-Wait mutexesThomas Hellstrom3-19/+150
The current Wound-Wait mutex algorithm is actually not Wound-Wait but Wait-Die. Implement also Wound-Wait as a per-ww-class choice. Wound-Wait is, contrary to Wait-Die a preemptive algorithm and is known to generate fewer backoffs. Testing reveals that this is true if the number of simultaneous contending transactions is small. As the number of simultaneous contending threads increases, Wait-Wound becomes inferior to Wait-Die in terms of elapsed time. Possibly due to the larger number of held locks of sleeping transactions. Update documentation and callers. Timings using git://people.freedesktop.org/~thomash/ww_mutex_test tag patch-18-06-15 Each thread runs 100000 batches of lock / unlock 800 ww mutexes randomly chosen out of 100000. Four core Intel x86_64: Algorithm #threads Rollbacks time Wound-Wait 4 ~100 ~17s. Wait-Die 4 ~150000 ~19s. Wound-Wait 16 ~360000 ~109s. Wait-Die 16 ~450000 ~82s. Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Gustavo Padovan <gustavo@padovan.org> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: Sean Paul <seanpaul@chromium.org> Cc: David Airlie <airlied@linux.ie> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Kate Stewart <kstewart@linuxfoundation.org> Cc: Philippe Ombredanne <pombredanne@nexb.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: linux-doc@vger.kernel.org Cc: linux-media@vger.kernel.org Cc: linaro-mm-sig@lists.linaro.org Co-authored-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Ingo Molnar <mingo@kernel.org>
2018-07-03locking: WW mutex cleanupPeter Ziljstra1-77/+125
Make the WW mutex code more readable by adding comments, splitting up functions and pointing out that we're actually using the Wait-Die algorithm. Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Gustavo Padovan <gustavo@padovan.org> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: Sean Paul <seanpaul@chromium.org> Cc: David Airlie <airlied@linux.ie> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Kate Stewart <kstewart@linuxfoundation.org> Cc: Philippe Ombredanne <pombredanne@nexb.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: linux-doc@vger.kernel.org Cc: linux-media@vger.kernel.org Cc: linaro-mm-sig@lists.linaro.org Co-authored-by: Thomas Hellstrom <thellstrom@vmware.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com> Acked-by: Ingo Molnar <mingo@kernel.org>
2018-06-25torture: Keep old-school dmesg formatPaul E. McKenney1-0/+3
This commit adds "#define pr_fmt(fmt) fmt" to the torture-test files in order to keep the current dmesg format. Once Joe's commits have hit mainline, these definitions will be changed in order to automatically generate the dmesg line prefix that the scripts expect. This will have the beneficial side-effect of allowing printk() formats to be used more widely and of shortening some pr_*() lines. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Joe Perches <joe@perches.com>
2018-06-25torture: Make online/offline messages appear only for verbose=2Paul E. McKenney1-1/+1
Some bugs reproduce quickly only at high CPU-hotplug rates, so the rcutorture TREE03 scenario now has only 200 milliseconds spacing between CPU-hotplug operations. At this rate, the torture-test pair of console messages per operation becomes a bit voluminous. This commit therefore converts the torture-test set of "verbose" kernel-boot arguments from bool to int, and prints the extra console messages only when verbose=2. The default is still verbose=1. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2018-06-24Merge branch 'locking-urgent-for-linus' of ↵Linus Torvalds2-6/+7
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fixes from Thomas Gleixner: "A set of fixes and updates for the locking code: - Prevent lockdep from updating irq state within its own code and thereby confusing itself. - Buid fix for older GCCs which mistreat anonymous unions - Add a missing lockdep annotation in down_read_non_onwer() which causes up_read_non_owner() to emit a lockdep splat - Remove the custom alpha dec_and_lock() implementation which is incorrect in terms of ordering and use the generic one. The remaining two commits are not strictly fixes. They provide irqsave variants of atomic_dec_and_lock() and refcount_dec_and_lock(). These are required to merge the relevant updates and cleanups into different maintainer trees for 4.19, so routing them into mainline without actual users is the sanest approach. They should have been in -rc1, but last weekend I took the liberty to just avoid computers in order to regain some mental sanity" * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/qspinlock: Fix build for anonymous union in older GCC compilers locking/lockdep: Do not record IRQ state within lockdep code locking/rwsem: Fix up_read_non_owner() warning with DEBUG_RWSEMS locking/refcounts: Implement refcount_dec_and_lock_irqsave() atomic: Add irqsave variant of atomic_dec_and_lock() alpha: Remove custom dec_and_lock() implementation
2018-06-21locking/lockdep: Do not record IRQ state within lockdep codeSteven Rostedt (VMware)1-6/+6
While debugging where things were going wrong with mapping enabling/disabling interrupts with the lockdep state and actual real enabling and disabling interrupts, I had to silent the IRQ disabling/enabling in debug_check_no_locks_freed() because it was always showing up as it was called before the splat was. Use raw_local_irq_save/restore() for not only debug_check_no_locks_freed() but for all internal lockdep functions, as they hide useful information about where interrupts were used incorrectly last. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Link: https://lkml.kernel.org/lkml/20180404140630.3f4f4c7a@gandalf.local.home Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-06-20locking/rwsem: Fix up_read_non_owner() warning with DEBUG_RWSEMSWaiman Long1-0/+1
It was found that the use of up_read_non_owner() in NFS was causing the following warning when DEBUG_RWSEMS was configured. DEBUG_LOCKS_WARN_ON(sem->owner != ((struct task_struct *)(1UL << 0))) Looking into the rwsem.c file, it was discovered that the corresponding down_read_non_owner() function was not setting the owner field properly. This is fixed now, and the warning should be gone. Fixes: 5149cbac4235 ("locking/rwsem: Add DEBUG_RWSEMS to look for lock/unlock mismatches") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Gavin Schenk <g.schenk@eckelmann.de> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: linux-nfs@vger.kernel.org Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/1527168398-4291-1-git-send-email-longman@redhat.com
2018-06-12treewide: kzalloc() -> kcalloc()Kees Cook1-2/+4
The kzalloc() function has a 2-factor argument form, kcalloc(). This patch replaces cases of: kzalloc(a * b, gfp) with: kcalloc(a * b, gfp) as well as handling cases of: kzalloc(a * b * c, gfp) with: kzalloc(array3_size(a, b, c), gfp) as it's slightly less ugly than: kzalloc_array(array_size(a, b), c, gfp) This does, however, attempt to ignore constant size factors like: kzalloc(4 * 1024, gfp) though any constants defined via macros get caught up in the conversion. Any factors with a sizeof() of "unsigned char", "char", and "u8" were dropped, since they're redundant. The Coccinelle script used for this was: // Fix redundant parens around sizeof(). @@ type TYPE; expression THING, E; @@ ( kzalloc( - (sizeof(TYPE)) * E + sizeof(TYPE) * E , ...) | kzalloc( - (sizeof(THING)) * E + sizeof(THING) * E , ...) ) // Drop single-byte sizes and redundant parens. @@ expression COUNT; typedef u8; typedef __u8; @@ ( kzalloc( - sizeof(u8) * (COUNT) + COUNT , ...) | kzalloc( - sizeof(__u8) * (COUNT) + COUNT , ...) | kzalloc( - sizeof(char) * (COUNT) + COUNT , ...) | kzalloc( - sizeof(unsigned char) * (COUNT) + COUNT , ...) | kzalloc( - sizeof(u8) * COUNT + COUNT , ...) | kzalloc( - sizeof(__u8) * COUNT + COUNT , ...) | kzalloc( - sizeof(char) * COUNT + COUNT , ...) | kzalloc( - sizeof(unsigned char) * COUNT + COUNT , ...) ) // 2-factor product with sizeof(type/expression) and identifier or constant. @@ type TYPE; expression THING; identifier COUNT_ID; constant COUNT_CONST; @@ ( - kzalloc + kcalloc ( - sizeof(TYPE) * (COUNT_ID) + COUNT_ID, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * COUNT_ID + COUNT_ID, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * (COUNT_CONST) + COUNT_CONST, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * COUNT_CONST + COUNT_CONST, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * (COUNT_ID) + COUNT_ID, sizeof(THING) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * COUNT_ID + COUNT_ID, sizeof(THING) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * (COUNT_CONST) + COUNT_CONST, sizeof(THING) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * COUNT_CONST + COUNT_CONST, sizeof(THING) , ...) ) // 2-factor product, only identifiers. @@ identifier SIZE, COUNT; @@ - kzalloc + kcalloc ( - SIZE * COUNT + COUNT, SIZE , ...) // 3-factor product with 1 sizeof(type) or sizeof(expression), with // redundant parens removed. @@ expression THING; identifier STRIDE, COUNT; type TYPE; @@ ( kzalloc( - sizeof(TYPE) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kzalloc( - sizeof(TYPE) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kzalloc( - sizeof(TYPE) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kzalloc( - sizeof(TYPE) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kzalloc( - sizeof(THING) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kzalloc( - sizeof(THING) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kzalloc( - sizeof(THING) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kzalloc( - sizeof(THING) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) ) // 3-factor product with 2 sizeof(variable), with redundant parens removed. @@ expression THING1, THING2; identifier COUNT; type TYPE1, TYPE2; @@ ( kzalloc( - sizeof(TYPE1) * sizeof(TYPE2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | kzalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | kzalloc( - sizeof(THING1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | kzalloc( - sizeof(THING1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | kzalloc( - sizeof(TYPE1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) | kzalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) ) // 3-factor product, only identifiers, with redundant parens removed. @@ identifier STRIDE, SIZE, COUNT; @@ ( kzalloc( - (COUNT) * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - COUNT * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - COUNT * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - (COUNT) * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - COUNT * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - (COUNT) * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - (COUNT) * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - COUNT * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) ) // Any remaining multi-factor products, first at least 3-factor products, // when they're not all constants... @@ expression E1, E2, E3; constant C1, C2, C3; @@ ( kzalloc(C1 * C2 * C3, ...) | kzalloc( - (E1) * E2 * E3 + array3_size(E1, E2, E3) , ...) | kzalloc( - (E1) * (E2) * E3 + array3_size(E1, E2, E3) , ...) | kzalloc( - (E1) * (E2) * (E3) + array3_size(E1, E2, E3) , ...) | kzalloc( - E1 * E2 * E3 + array3_size(E1, E2, E3) , ...) ) // And then all remaining 2 factors products when they're not all constants, // keeping sizeof() as the second factor argument. @@ expression THING, E1, E2; type TYPE; constant C1, C2, C3; @@ ( kzalloc(sizeof(THING) * C2, ...) | kzalloc(sizeof(TYPE) * C2, ...) | kzalloc(C1 * C2 * C3, ...) | kzalloc(C1 * C2, ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * (E2) + E2, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * E2 + E2, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * (E2) + E2, sizeof(THING) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * E2 + E2, sizeof(THING) , ...) | - kzalloc + kcalloc ( - (E1) * E2 + E1, E2 , ...) | - kzalloc + kcalloc ( - (E1) * (E2) + E1, E2 , ...) | - kzalloc + kcalloc ( - E1 * E2 + E1, E2 , ...) ) Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-12treewide: kmalloc() -> kmalloc_array()Kees Cook1-2/+6
The kmalloc() function has a 2-factor argument form, kmalloc_array(). This patch replaces cases of: kmalloc(a * b, gfp) with: kmalloc_array(a * b, gfp) as well as handling cases of: kmalloc(a * b * c, gfp) with: kmalloc(array3_size(a, b, c), gfp) as it's slightly less ugly than: kmalloc_array(array_size(a, b), c, gfp) This does, however, attempt to ignore constant size factors like: kmalloc(4 * 1024, gfp) though any constants defined via macros get caught up in the conversion. Any factors with a sizeof() of "unsigned char", "char", and "u8" were dropped, since they're redundant. The tools/ directory was manually excluded, since it has its own implementation of kmalloc(). The Coccinelle script used for this was: // Fix redundant parens around sizeof(). @@ type TYPE; expression THING, E; @@ ( kmalloc( - (sizeof(TYPE)) * E + sizeof(TYPE) * E , ...) | kmalloc( - (sizeof(THING)) * E + sizeof(THING) * E , ...) ) // Drop single-byte sizes and redundant parens. @@ expression COUNT; typedef u8; typedef __u8; @@ ( kmalloc( - sizeof(u8) * (COUNT) + COUNT , ...) | kmalloc( - sizeof(__u8) * (COUNT) + COUNT , ...) | kmalloc( - sizeof(char) * (COUNT) + COUNT , ...) | kmalloc( - sizeof(unsigned char) * (COUNT) + COUNT , ...) | kmalloc( - sizeof(u8) * COUNT + COUNT , ...) | kmalloc( - sizeof(__u8) * COUNT + COUNT , ...) | kmalloc( - sizeof(char) * COUNT + COUNT , ...) | kmalloc( - sizeof(unsigned char) * COUNT + COUNT , ...) ) // 2-factor product with sizeof(type/expression) and identifier or constant. @@ type TYPE; expression THING; identifier COUNT_ID; constant COUNT_CONST; @@ ( - kmalloc + kmalloc_array ( - sizeof(TYPE) * (COUNT_ID) + COUNT_ID, sizeof(TYPE) , ...) | - kmalloc + kmalloc_array ( - sizeof(TYPE) * COUNT_ID + COUNT_ID, sizeof(TYPE) , ...) | - kmalloc + kmalloc_array ( - sizeof(TYPE) * (COUNT_CONST) + COUNT_CONST, sizeof(TYPE) , ...) | - kmalloc + kmalloc_array ( - sizeof(TYPE) * COUNT_CONST + COUNT_CONST, sizeof(TYPE) , ...) | - kmalloc + kmalloc_array ( - sizeof(THING) * (COUNT_ID) + COUNT_ID, sizeof(THING) , ...) | - kmalloc + kmalloc_array ( - sizeof(THING) * COUNT_ID + COUNT_ID, sizeof(THING) , ...) | - kmalloc + kmalloc_array ( - sizeof(THING) * (COUNT_CONST) + COUNT_CONST, sizeof(THING) , ...) | - kmalloc + kmalloc_array ( - sizeof(THING) * COUNT_CONST + COUNT_CONST, sizeof(THING) , ...) ) // 2-factor product, only identifiers. @@ identifier SIZE, COUNT; @@ - kmalloc + kmalloc_array ( - SIZE * COUNT + COUNT, SIZE , ...) // 3-factor product with 1 sizeof(type) or sizeof(expression), with // redundant parens removed. @@ expression THING; identifier STRIDE, COUNT; type TYPE; @@ ( kmalloc( - sizeof(TYPE) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kmalloc( - sizeof(TYPE) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kmalloc( - sizeof(TYPE) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kmalloc( - sizeof(TYPE) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kmalloc( - sizeof(THING) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kmalloc( - sizeof(THING) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kmalloc( - sizeof(THING) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kmalloc( - sizeof(THING) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) ) // 3-factor product with 2 sizeof(variable), with redundant parens removed. @@ expression THING1, THING2; identifier COUNT; type TYPE1, TYPE2; @@ ( kmalloc( - sizeof(TYPE1) * sizeof(TYPE2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | kmalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | kmalloc( - sizeof(THING1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | kmalloc( - sizeof(THING1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | kmalloc( - sizeof(TYPE1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) | kmalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) ) // 3-factor product, only identifiers, with redundant parens removed. @@ identifier STRIDE, SIZE, COUNT; @@ ( kmalloc( - (COUNT) * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - COUNT * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - COUNT * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - (COUNT) * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - COUNT * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - (COUNT) * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - (COUNT) * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - COUNT * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) ) // Any remaining multi-factor products, first at least 3-factor products, // when they're not all constants... @@ expression E1, E2, E3; constant C1, C2, C3; @@ ( kmalloc(C1 * C2 * C3, ...) | kmalloc( - (E1) * E2 * E3 + array3_size(E1, E2, E3) , ...) | kmalloc( - (E1) * (E2) * E3 + array3_size(E1, E2, E3) , ...) | kmalloc( - (E1) * (E2) * (E3) + array3_size(E1, E2, E3) , ...) | kmalloc( - E1 * E2 * E3 + array3_size(E1, E2, E3) , ...) ) // And then all remaining 2 factors products when they're not all constants, // keeping sizeof() as the second factor argument. @@ expression THING, E1, E2; type TYPE; constant C1, C2, C3; @@ ( kmalloc(sizeof(THING) * C2, ...) | kmalloc(sizeof(TYPE) * C2, ...) | kmalloc(C1 * C2 * C3, ...) | kmalloc(C1 * C2, ...) | - kmalloc + kmalloc_array ( - sizeof(TYPE) * (E2) + E2, sizeof(TYPE) , ...) | - kmalloc + kmalloc_array ( - sizeof(TYPE) * E2 + E2, sizeof(TYPE) , ...) | - kmalloc + kmalloc_array ( - sizeof(THING) * (E2) + E2, sizeof(THING) , ...) | - kmalloc + kmalloc_array ( - sizeof(THING) * E2 + E2, sizeof(THING) , ...) | - kmalloc + kmalloc_array ( - (E1) * E2 + E1, E2 , ...) | - kmalloc + kmalloc_array ( - (E1) * (E2) + E1, E2 , ...) | - kmalloc + kmalloc_array ( - E1 * E2 + E1, E2 , ...) ) Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-04Merge branch 'locking-core-for-linus' of ↵Linus Torvalds7-236/+177
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking updates from Ingo Molnar: - Lots of tidying up changes all across the map for Linux's formal memory/locking-model tooling, by Alan Stern, Akira Yokosawa, Andrea Parri, Paul E. McKenney and SeongJae Park. Notable changes beyond an overall update in the tooling itself is the tidying up of spin_is_locked() semantics, which spills over into the kernel proper as well. - qspinlock improvements: the locking algorithm now guarantees forward progress whereas the previous implementation in mainline could starve threads indefinitely in cmpxchg() loops. Also other related cleanups to the qspinlock code (Will Deacon) - misc smaller improvements, cleanups and fixes all across the locking subsystem * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (51 commits) locking/rwsem: Simplify the is-owner-spinnable checks tools/memory-model: Add reference for 'Simplifying ARM concurrency' tools/memory-model: Update ASPLOS information MAINTAINERS, tools/memory-model: Update e-mail address for Andrea Parri tools/memory-model: Fix coding style in 'lock.cat' tools/memory-model: Remove out-of-date comments and code from lock.cat tools/memory-model: Improve mixed-access checking in lock.cat tools/memory-model: Improve comments in lock.cat tools/memory-model: Remove duplicated code from lock.cat tools/memory-model: Flag "cumulativity" and "propagation" tests tools/memory-model: Add model support for spin_is_locked() tools/memory-model: Add scripts to test memory model tools/memory-model: Fix coding style in 'linux-kernel.def' tools/memory-model: Model 'smp_store_mb()' tools/memory-order: Update the cheat-sheet to show that smp_mb__after_atomic() orders later RMW operations tools/memory-order: Improve key for SELF and SV tools/memory-model: Fix cheat sheet typo tools/memory-model: Update required version of herdtools7 tools/memory-model: Redefine rb in terms of rcu-fence tools/memory-model: Rename link and rcu-path to rcu-link and rb ...
2018-06-04Merge branch 'hch.procfs' of ↵Linus Torvalds1-42/+3
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull procfs updates from Al Viro: "Christoph's proc_create_... cleanups series" * 'hch.procfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (44 commits) xfs, proc: hide unused xfs procfs helpers isdn/gigaset: add back gigaset_procinfo assignment proc: update SIZEOF_PDE_INLINE_NAME for the new pde fields tty: replace ->proc_fops with ->proc_show ide: replace ->proc_fops with ->proc_show ide: remove ide_driver_proc_write isdn: replace ->proc_fops with ->proc_show atm: switch to proc_create_seq_private atm: simplify procfs code bluetooth: switch to proc_create_seq_data netfilter/x_tables: switch to proc_create_seq_private netfilter/xt_hashlimit: switch to proc_create_{seq,single}_data neigh: switch to proc_create_seq_data hostap: switch to proc_create_{seq,single}_data bonding: switch to proc_create_seq_data rtc/proc: switch to proc_create_single_data drbd: switch to proc_create_single resource: switch to proc_create_seq_data staging/rtl8192u: simplify procfs code jfs: simplify procfs code ...
2018-05-25locking/rwsem: Simplify the is-owner-spinnable checksOleg Nesterov1-12/+13
Add the trivial owner_on_cpu() helper for rwsem_can_spin_on_owner() and rwsem_spin_on_owner(), it also allows to make rwsem_can_spin_on_owner() a bit more clear. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Waiman Long <longman@redhat.com> Cc: Amir Goldstein <amir73il@gmail.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Jan Kara <jack@suse.cz> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Theodore Y. Ts'o <tytso@mit.edu> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180518165534.GA22348@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-25Merge branch 'linus' into locking/core, to pick up fixesIngo Molnar3-21/+30
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-16locking/percpu-rwsem: Annotate rwsem ownership transfer by setting ↵Waiman Long1-0/+2
RWSEM_OWNER_UNKNOWN The filesystem freezing code needs to transfer ownership of a rwsem embedded in a percpu-rwsem from the task that does the freezing to another one that does the thawing by calling percpu_rwsem_release() after freezing and percpu_rwsem_acquire() before thawing. However, the new rwsem debug code runs afoul with this scheme by warning that the task that releases the rwsem isn't the one that acquires it, as reported by Amir Goldstein: DEBUG_LOCKS_WARN_ON(sem->owner != get_current()) WARNING: CPU: 1 PID: 1401 at /home/amir/build/src/linux/kernel/locking/rwsem.c:133 up_write+0x59/0x79 Call Trace: percpu_up_write+0x1f/0x28 thaw_super_locked+0xdf/0x120 do_vfs_ioctl+0x270/0x5f1 ksys_ioctl+0x52/0x71 __x64_sys_ioctl+0x16/0x19 do_syscall_64+0x5d/0x167 entry_SYSCALL_64_after_hwframe+0x49/0xbe To work properly with the rwsem debug code, we need to annotate that the rwsem ownership is unknown during the tranfer period until a brave soul comes forward to acquire the ownership. During that period, optimistic spinning will be disabled. Reported-by: Amir Goldstein <amir73il@gmail.com> Tested-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Jan Kara <jack@suse.cz> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Theodore Y. Ts'o <tytso@mit.edu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Cc: linux-fsdevel@vger.kernel.org Link: http://lkml.kernel.org/r/1526420991-21213-3-git-send-email-longman@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-16locking/rwsem: Add a new RWSEM_ANONYMOUSLY_OWNED flagWaiman Long3-21/+28
There are use cases where a rwsem can be acquired by one task, but released by another task. In thess cases, optimistic spinning may need to be disabled. One example will be the filesystem freeze/thaw code where the task that freezes the filesystem will acquire a write lock on a rwsem and then un-owns it before returning to userspace. Later on, another task will come along, acquire the ownership, thaw the filesystem and release the rwsem. Bit 0 of the owner field was used to designate that it is a reader owned rwsem. It is now repurposed to mean that the owner of the rwsem is not known. If only bit 0 is set, the rwsem is reader owned. If bit 0 and other bits are set, it is writer owned with an unknown owner. One such value for the latter case is (-1L). So we can set owner to 1 for reader-owned, -1 for writer-owned. The owner is unknown in both cases. To handle transfer of rwsem ownership, the higher level code should set the owner field to -1 to indicate a write-locked rwsem with unknown owner. Optimistic spinning will be disabled in this case. Once the higher level code figures who the new owner is, it can then set the owner field accordingly. Tested-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Jan Kara <jack@suse.cz> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Theodore Y. Ts'o <tytso@mit.edu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Cc: linux-fsdevel@vger.kernel.org Link: http://lkml.kernel.org/r/1526420991-21213-2-git-send-email-longman@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-16proc: introduce proc_create_single{,_data}Christoph Hellwig1-15/+1
Variants of proc_create{,_data} that directly take a seq_file show callback and drastically reduces the boilerplate code in the callers. All trivial callers converted over. Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-05-16proc: introduce proc_create_seq{,_data}Christoph Hellwig1-27/+2
Variants of proc_create{,_data} that directly take a struct seq_operations argument and drastically reduces the boilerplate code in the callers. All trivial callers converted over. Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-05-14locking/lockdep: Move sanity check to inside lockdep_print_held_locks()Tetsuo Handa1-16/+13
Calling lockdep_print_held_locks() on a running thread is considered unsafe. Since all callers should follow that rule and the sanity check is not heavy, this patch moves the sanity check to inside lockdep_print_held_locks(). As a side effect of this patch, the number of locks held by running threads will be printed as well. This change will be preferable when we want to know which threads might be relevant to a problem but are unable to print any clues because that thread is running. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1523011279-8206-2-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-14locking/lockdep: Use for_each_process_thread() for debug_show_all_locks()Tetsuo Handa1-35/+8
debug_show_all_locks() tries to grab the tasklist_lock for two seconds, but calling while_each_thread() without tasklist_lock held is not safe. See the following commit for more information: 4449a51a7c281602 ("vm_is_stack: use for_each_thread() rather then buggy while_each_thread()") Change debug_show_all_locks() from "do_each_thread()/while_each_thread() with possibility of missing tasklist_lock" to "for_each_process_thread() with RCU", and add a call to touch_all_softlockup_watchdogs() like show_state_filter() does. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1523011279-8206-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-04locking/mutex: Optimize __mutex_trylock_fast()Peter Zijlstra1-1/+2
Use try_cmpxchg to avoid the pointless TEST instruction.. And add the (missing) atomic_long_try_cmpxchg*() wrappery. On x86_64 this gives: 0000000000000710 <mutex_lock>: 0000000000000710 <mutex_lock>: 710: 65 48 8b 14 25 00 00 mov %gs:0x0,%rdx 710: 65 48 8b 14 25 00 00 mov %gs:0x0,%rdx 717: 00 00 717: 00 00 715: R_X86_64_32S current_task 715: R_X86_64_32S current_task 719: 31 c0 xor %eax,%eax 719: 31 c0 xor %eax,%eax 71b: f0 48 0f b1 17 lock cmpxchg %rdx,(%rdi) 71b: f0 48 0f b1 17 lock cmpxchg %rdx,(%rdi) 720: 48 85 c0 test %rax,%rax 720: 75 02 jne 724 <mutex_lock+0x14> 723: 75 02 jne 727 <mutex_lock+0x17> 722: f3 c3 repz retq 725: f3 c3 repz retq 724: eb da jmp 700 <__mutex_lock_slowpath> 727: eb d7 jmp 700 <__mutex_lock_slowpath> 726: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1) 729: 0f 1f 80 00 00 00 00 nopl 0x0(%rax) 72d: 00 00 00 On ARM64 this gives: 000000000000638 <mutex_lock>: 0000000000000638 <mutex_lock>: 638: d5384101 mrs x1, sp_el0 638: d5384101 mrs x1, sp_el0 63c: d2800002 mov x2, #0x0 63c: d2800002 mov x2, #0x0 640: f9800011 prfm pstl1strm, [x0] 640: f9800011 prfm pstl1strm, [x0] 644: c85ffc03 ldaxr x3, [x0] 644: c85ffc03 ldaxr x3, [x0] 648: ca020064 eor x4, x3, x2 648: ca020064 eor x4, x3, x2 64c: b5000064 cbnz x4, 658 <mutex_lock+0x20> 64c: b5000064 cbnz x4, 658 <mutex_lock+0x20> 650: c8047c01 stxr w4, x1, [x0] 650: c8047c01 stxr w4, x1, [x0] 654: 35ffff84 cbnz w4, 644 <mutex_lock+0xc> 654: 35ffff84 cbnz w4, 644 <mutex_lock+0xc> 658: b40000c3 cbz x3, 670 <mutex_lock+0x38> 658: b5000043 cbnz x3, 660 <mutex_lock+0x28> 65c: a9bf7bfd stp x29, x30, [sp,#-16]! 65c: d65f03c0 ret 660: 910003fd mov x29, sp 660: a9bf7bfd stp x29, x30, [sp,#-16]! 664: 97ffffef bl 620 <__mutex_lock_slowpath> 664: 910003fd mov x29, sp 668: a8c17bfd ldp x29, x30, [sp],#16 668: 97ffffee bl 620 <__mutex_lock_slowpath> 66c: d65f03c0 ret 66c: a8c17bfd ldp x29, x30, [sp],#16 670: d65f03c0 ret 670: d65f03c0 ret Reported-by: Matthew Wilcox <mawilcox@microsoft.com> Acked-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>