rcu: Fix TREE_PREEMPT_RCU CPU_HOTPLUG bad-luck hang

If the following sequence of events occurs, then TREE_PREEMPT_RCU will hang waiting for a grace period to complete, eventually OOMing the system: o A TREE_PREEMPT_RCU build of the kernel is booted on a system with more than 64 physical CPUs present (32 on a 32-bit system). Alternatively, a TREE_PREEMPT_RCU build of the kernel is booted with RCU_FANOUT set to a sufficiently small value that the physical CPUs populate two or more leaf rcu_node structures. o A task is preempted in an RCU read-side critical section while running on a CPU corresponding to a given leaf rcu_node structure. o All CPUs corresponding to this same leaf rcu_node structure record quiescent states for the current grace period. o All of these same CPUs go offline (hence the need for enough physical CPUs to populate more than one leaf rcu_node structure). This causes the preempted task to be moved to the root rcu_node structure. At this point, there is nothing left to cause the quiescent state to be propagated up the rcu_node tree, so the current grace period never completes. The simplest fix, especially after considering the deadlock possibilities, is to detect this situation when the last CPU is offlined, and to set that CPU's ->qsmask bit in its leaf rcu_node structure. This will cause the next invocation of force_quiescent_state() to end the grace period. Without this fix, this hang can be triggered in an hour or so on some machines with rcutorture and random CPU onlining/offlining. With this fix, these same machines pass a full 10 hours of this sort of abuse. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <20091015162614.GA19131@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
author: Paul E. McKenney <paulmck@linux.vnet.ibm.com> 2009-10-15 09:26:14 -0700
committer: Ingo Molnar <mingo@elte.hu> 2009-10-15 20:33:01 +0200
commit: 237c80c5c8fb7ec128cf2a756b550dc41ad7eac7 (patch)
tree: 5d6b3346f2c53cd3f7471001479a8dbd741533a3 /kernel/rcutree.c
parent: 019129d595caaa5bd0b41d128308da1be6a91869 (diff)
download: linux-237c80c5c8fb7ec128cf2a756b550dc41ad7eac7.tar.bz2
1 files changed, 14 insertions, 1 deletions
diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index ddbf111e9e18..0536125b0497 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -913,7 +913,20 @@ static void __rcu_offline_cpu(int cpu, struct rcu_state *rsp)
 			spin_unlock(&rnp->lock); /* irqs remain disabled. */
 			break;
 		}
-		rcu_preempt_offline_tasks(rsp, rnp, rdp);
+
+		/*
+		 * If there was a task blocking the current grace period,
+		 * and if all CPUs have checked in, we need to propagate
+		 * the quiescent state up the rcu_node hierarchy.  But that
+		 * is inconvenient at the moment due to deadlock issues if
+		 * this should end the current grace period.  So set the
+		 * offlined CPU's bit in ->qsmask in order to force the
+		 * next force_quiescent_state() invocation to clean up this
+		 * mess in a deadlock-free manner.
+		 */
+		if (rcu_preempt_offline_tasks(rsp, rnp, rdp) && !rnp->qsmask)
+			rnp->qsmask |= mask;
+
 		mask = rnp->grpmask;
 		spin_unlock(&rnp->lock);	/* irqs remain disabled. */
 		rnp = rnp->parent;
author	Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2009-10-15 09:26:14 -0700
committer	Ingo Molnar <mingo@elte.hu>	2009-10-15 20:33:01 +0200
commit	237c80c5c8fb7ec128cf2a756b550dc41ad7eac7 (patch)
tree	5d6b3346f2c53cd3f7471001479a8dbd741533a3 /kernel/rcutree.c
parent	019129d595caaa5bd0b41d128308da1be6a91869 (diff)
download	linux-237c80c5c8fb7ec128cf2a756b550dc41ad7eac7.tar.bz2