2013-08-07 10:21:01

by Lai Jiangshan

[permalink] [raw]
Subject: [PATCH 0/8] rcu: Ensure rcu read site is deadlock-immunity

Although all articles declare that rcu read site is deadlock-immunity.
It is not true for rcu-preempt, it will be deadlock if rcu read site
overlaps with scheduler lock.

ec433f0c, 10f39bb1 and 016a8d5b just partially solve it. But rcu read site
is still not deadlock-immunity. And the problem described in 016a8d5b
is still existed(rcu_read_unlock_special() calls wake_up).

The problem is fixed in patch5.

Lai Jiangshan (8):
rcu: add a warn to rcu_preempt_note_context_switch()
rcu: rcu_read_unlock_special() can be nested in irq/softirq 10f39bb1
rcu: keep irqs disabled in rcu_read_unlock_special()
rcu: delay task rcu state cleanup in exit_rcu()
rcu: eliminate rcu read site deadlock
rcu: call rcu_read_unlock_special() in rcu_preempt_check_callbacks()
rcu: add # of deferred _special() statistics
rcu: remove irq work for rsp_wakeup()

include/linux/rcupdate.h | 2 +-
kernel/rcupdate.c | 2 +-
kernel/rcutree.c | 17 +--------
kernel/rcutree.h | 2 +-
kernel/rcutree_plugin.h | 82 ++++++++++++++++++++++++++++++++++-----------
kernel/rcutree_trace.c | 1 +
6 files changed, 68 insertions(+), 38 deletions(-)

--
1.7.4.4


2013-08-07 10:21:05

by Lai Jiangshan

[permalink] [raw]
Subject: [PATCH 3/8] rcu: keep irqs disabled in rcu_read_unlock_special()

rcu_read_unlock_special() may enable irqs temporarily before it finish its
last work. It doesn't introduce any extremely bad things. but it does
add more task rcu machine states and add more complexity, bad to review.

And if the task is preempted when it enables irqs,
the synchronize_rcu_expedited() will be slowed down, and it can't get help
from rcu_boost.

Signed-off-by: Lai Jiangshan <[email protected]>
---
kernel/rcutree_plugin.h | 13 ++++++++-----
1 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index 54f7e45..6b23b6f 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -338,7 +338,7 @@ void rcu_read_unlock_special(struct task_struct *t)
int empty;
int empty_exp;
int empty_exp_now;
- unsigned long flags;
+ unsigned long flags, irq_disabled_flags;
struct list_head *np;
#ifdef CONFIG_RCU_BOOST
struct rt_mutex *rbmp = NULL;
@@ -351,6 +351,7 @@ void rcu_read_unlock_special(struct task_struct *t)
return;

local_irq_save(flags);
+ local_save_flags(irq_disabled_flags);

/*
* If RCU core is waiting for this CPU to exit critical section,
@@ -414,9 +415,12 @@ void rcu_read_unlock_special(struct task_struct *t)
rnp->grplo,
rnp->grphi,
!!rnp->gp_tasks);
- rcu_report_unblock_qs_rnp(rnp, flags);
+ rcu_report_unblock_qs_rnp(rnp, irq_disabled_flags);
+ /* irqs remain disabled. */
} else {
- raw_spin_unlock_irqrestore(&rnp->lock, flags);
+ raw_spin_unlock_irqrestore(&rnp->lock,
+ irq_disabled_flags);
+ /* irqs remain disabled. */
}

#ifdef CONFIG_RCU_BOOST
@@ -431,9 +435,8 @@ void rcu_read_unlock_special(struct task_struct *t)
*/
if (!empty_exp && empty_exp_now)
rcu_report_exp_rnp(&rcu_preempt_state, rnp, true);
- } else {
- local_irq_restore(flags);
}
+ local_irq_restore(flags);
}

#ifdef CONFIG_RCU_CPU_STALL_VERBOSE
--
1.7.4.4

2013-08-07 10:21:12

by Lai Jiangshan

[permalink] [raw]
Subject: [PATCH 8/8] rcu: remove irq work for rsp_wakeup()

It is safe to aquire scheduler lock in rnp->lock since the rcu read lock is
always deadlock-immunity(rnp->lock is always can't be nested in scheduler lock)

it partial revert patch 016a8d5b.

Signed-off-by: Lai Jiangshan <[email protected]>
---
kernel/rcutree.c | 17 ++---------------
kernel/rcutree.h | 1 -
2 files changed, 2 insertions(+), 16 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index e08abb9..6c91edc 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -1524,14 +1524,6 @@ static int __noreturn rcu_gp_kthread(void *arg)
}
}

-static void rsp_wakeup(struct irq_work *work)
-{
- struct rcu_state *rsp = container_of(work, struct rcu_state, wakeup_work);
-
- /* Wake up rcu_gp_kthread() to start the grace period. */
- wake_up(&rsp->gp_wq);
-}
-
/*
* Start a new RCU grace period if warranted, re-initializing the hierarchy
* in preparation for detecting the next grace period. The caller must hold
@@ -1556,12 +1548,8 @@ rcu_start_gp_advanced(struct rcu_state *rsp, struct rcu_node *rnp,
}
rsp->gp_flags = RCU_GP_FLAG_INIT;

- /*
- * We can't do wakeups while holding the rnp->lock, as that
- * could cause possible deadlocks with the rq->lock. Deter
- * the wakeup to interrupt context.
- */
- irq_work_queue(&rsp->wakeup_work);
+ /* Wake up rcu_gp_kthread() to start the grace period. */
+ wake_up(&rsp->gp_wq);
}

/*
@@ -3153,7 +3141,6 @@ static void __init rcu_init_one(struct rcu_state *rsp,

rsp->rda = rda;
init_waitqueue_head(&rsp->gp_wq);
- init_irq_work(&rsp->wakeup_work, rsp_wakeup);
rnp = rsp->level[rcu_num_lvls - 1];
for_each_possible_cpu(i) {
while (i > rnp->grphi)
diff --git a/kernel/rcutree.h b/kernel/rcutree.h
index a5e9643..5892a43 100644
--- a/kernel/rcutree.h
+++ b/kernel/rcutree.h
@@ -449,7 +449,6 @@ struct rcu_state {
char *name; /* Name of structure. */
char abbr; /* Abbreviated name. */
struct list_head flavors; /* List of RCU flavors. */
- struct irq_work wakeup_work; /* Postponed wakeups */
};

/* Values for rcu_state structure's gp_flags field. */
--
1.7.4.4

2013-08-07 10:21:34

by Lai Jiangshan

[permalink] [raw]
Subject: [PATCH 7/8] rcu: add # of deferred _special() statistics

Signed-off-by: Lai Jiangshan <[email protected]>
---
kernel/rcutree.h | 1 +
kernel/rcutree_plugin.h | 1 +
kernel/rcutree_trace.c | 1 +
3 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/kernel/rcutree.h b/kernel/rcutree.h
index 4a39d36..a5e9643 100644
--- a/kernel/rcutree.h
+++ b/kernel/rcutree.h
@@ -290,6 +290,7 @@ struct rcu_data {
unsigned long n_force_qs_snap;
/* did other CPU force QS recently? */
long blimit; /* Upper limit on a processed batch */
+ unsigned long n_defer_special;/* # of deferred _special() */

/* 3) dynticks interface. */
struct rcu_dynticks *dynticks; /* Shared per-CPU dynticks state. */
diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index c9ff9f1..d828eec 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -396,6 +396,7 @@ void rcu_read_unlock_special(struct task_struct *t, bool unlock)
* is still unlikey to be true.
*/
if (unlikely(unlock && irqs_disabled_flags(flags))) {
+ this_cpu_ptr(&rcu_preempt_data)->n_defer_special++;
set_need_resched();
local_irq_restore(flags);
return;
diff --git a/kernel/rcutree_trace.c b/kernel/rcutree_trace.c
index cf6c174..17d8b2c 100644
--- a/kernel/rcutree_trace.c
+++ b/kernel/rcutree_trace.c
@@ -149,6 +149,7 @@ static void print_one_rcu_data(struct seq_file *m, struct rcu_data *rdp)
seq_printf(m, " ci=%lu nci=%lu co=%lu ca=%lu\n",
rdp->n_cbs_invoked, rdp->n_nocbs_invoked,
rdp->n_cbs_orphaned, rdp->n_cbs_adopted);
+ seq_printf(m, " ds=%lu\n", rdp->n_defer_special);
}

static int show_rcudata(struct seq_file *m, void *v)
--
1.7.4.4

2013-08-07 10:21:47

by Lai Jiangshan

[permalink] [raw]
Subject: [PATCH 4/8] rcu: delay task rcu state cleanup in exit_rcu()

exit_rcu() tries to clean up the task rcu state when the task is exiting.
It did it by calls __rcu_read_unlock().

Actually, calling rcu_read_unlock_special() is enough. This patch defer it
to the rcu_preempt_note_context_switch() of the next schedule().

This patch prepares for the next patch which defers rcu_read_unlock_special()
if irq is disabled when __rcu_read_unlock() is called.
So __rcu_read_unlock() can't work here(it is irq-disabled here)
if the next patch applied.

Signed-off-by: Lai Jiangshan <[email protected]>
---
kernel/rcutree_plugin.h | 11 +++++++----
1 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index 6b23b6f..fc8b36f 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -942,10 +942,13 @@ void exit_rcu(void)

if (likely(list_empty(&current->rcu_node_entry)))
return;
- t->rcu_read_lock_nesting = 1;
- barrier();
- t->rcu_read_unlock_special = RCU_READ_UNLOCK_BLOCKED;
- __rcu_read_unlock();
+ WARN_ON_ONCE(!(t->rcu_read_unlock_special | RCU_READ_UNLOCK_BLOCKED));
+ /*
+ * Task RCU state(rcu_node_entry) of this task will be cleanup by
+ * the next rcu_preempt_note_context_switch() of the next schedule()
+ * in the do_exit().
+ */
+ t->rcu_read_lock_nesting = INT_MIN;
}

#else /* #ifdef CONFIG_TREE_PREEMPT_RCU */
--
1.7.4.4

2013-08-07 10:22:55

by Lai Jiangshan

[permalink] [raw]
Subject: [PATCH 2/8] rcu: remove irq/softirq context check in rcu_read_unlock_special()

After patch 10f39bb1, "special & RCU_READ_UNLOCK_BLOCKED" can't be true
in irq nor softirq.(due to RCU_READ_UNLOCK_BLOCKED can only be set
when preemption)

Signed-off-by: Lai Jiangshan <[email protected]>
---
kernel/rcutree_plugin.h | 6 ------
1 files changed, 0 insertions(+), 6 deletions(-)

diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index 8fd947e..54f7e45 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -361,12 +361,6 @@ void rcu_read_unlock_special(struct task_struct *t)
rcu_preempt_qs(smp_processor_id());
}

- /* Hardware IRQ handlers cannot block. */
- if (in_irq() || in_serving_softirq()) {
- local_irq_restore(flags);
- return;
- }
-
/* Clean up if blocked during RCU read-side critical section. */
if (special & RCU_READ_UNLOCK_BLOCKED) {
t->rcu_read_unlock_special &= ~RCU_READ_UNLOCK_BLOCKED;
--
1.7.4.4

2013-08-07 10:22:53

by Lai Jiangshan

[permalink] [raw]
Subject: [PATCH 6/8] rcu: call rcu_read_unlock_special() in rcu_preempt_check_callbacks()

if rcu_read_unlock_special() is deferred, we can invoke it earlier
in the schedule-tick.

Signed-off-by: Lai Jiangshan <[email protected]>
---
kernel/rcutree_plugin.h | 5 ++++-
1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index 997b424..c9ff9f1 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -684,8 +684,11 @@ static void rcu_preempt_check_callbacks(int cpu)
{
struct task_struct *t = current;

- if (t->rcu_read_lock_nesting == 0) {
+ if (t->rcu_read_lock_nesting == 0 ||
+ t->rcu_read_lock_nesting == INT_MIN) {
rcu_preempt_qs(cpu);
+ if (t->rcu_read_unlock_special)
+ rcu_read_unlock_special(t, false);
return;
}
if (t->rcu_read_lock_nesting > 0 &&
--
1.7.4.4

2013-08-07 10:22:52

by Lai Jiangshan

[permalink] [raw]
Subject: [PATCH 5/8] rcu: eliminate deadlock for rcu read site

Background)

Although all articles declare that rcu read site is deadlock-immunity.
It is not true for rcu-preempt, it will be deadlock if rcu read site
overlaps with scheduler lock.

ec433f0c, 10f39bb1 and 016a8d5b just partially solve it. But rcu read site
is still not deadlock-immunity. And the problem described in 016a8d5b
is still existed(rcu_read_unlock_special() calls wake_up).

Aim)

We want to fix the problem forever, we want to keep rcu read site
is deadlock-immunity as books say.

How)

The problem is solved by "if rcu_read_unlock_special() is called inside
any lock which can be (chained) nested in rcu_read_unlock_special(),
we defer rcu_read_unlock_special()".
This kind locks include rnp->lock, scheduler locks, perf ctx->lock, locks
in printk()/WARN_ON() and all locks nested in these locks or chained nested
in these locks.

The problem is reduced to "how to distinguish all these locks(context)",
We don't distinguish all these locks, we know that all these locks
should be nested in local_irqs_disable().

we just consider if rcu_read_unlock_special() is called in irqs-disabled
context, it may be called in these suspect locks, we should defer
rcu_read_unlock_special().

The algorithm enlarges the probability of deferring, but the probability
is still very very low.

Deferring does add a small overhead, but it offers us:
1) really deadlock-immunity for rcu read site
2) remove the overhead of the irq-work(250 times per second in avg.)

Signed-off-by: Lai Jiangshan <[email protected]>
---
include/linux/rcupdate.h | 2 +-
kernel/rcupdate.c | 2 +-
kernel/rcutree_plugin.h | 47 +++++++++++++++++++++++++++++++++++++++++----
3 files changed, 44 insertions(+), 7 deletions(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 4b14bdc..00b4220 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -180,7 +180,7 @@ extern void synchronize_sched(void);

extern void __rcu_read_lock(void);
extern void __rcu_read_unlock(void);
-extern void rcu_read_unlock_special(struct task_struct *t);
+extern void rcu_read_unlock_special(struct task_struct *t, bool unlock);
void synchronize_rcu(void);

/*
diff --git a/kernel/rcupdate.c b/kernel/rcupdate.c
index cce6ba8..33b89a3 100644
--- a/kernel/rcupdate.c
+++ b/kernel/rcupdate.c
@@ -90,7 +90,7 @@ void __rcu_read_unlock(void)
#endif /* #ifdef CONFIG_PROVE_RCU_DELAY */
barrier(); /* assign before ->rcu_read_unlock_special load */
if (unlikely(ACCESS_ONCE(t->rcu_read_unlock_special)))
- rcu_read_unlock_special(t);
+ rcu_read_unlock_special(t, true);
barrier(); /* ->rcu_read_unlock_special load before assign */
t->rcu_read_lock_nesting = 0;
}
diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index fc8b36f..997b424 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -242,15 +242,16 @@ static void rcu_preempt_note_context_switch(int cpu)
? rnp->gpnum
: rnp->gpnum + 1);
raw_spin_unlock_irqrestore(&rnp->lock, flags);
- } else if (t->rcu_read_lock_nesting < 0 &&
- !WARN_ON_ONCE(t->rcu_read_lock_nesting != INT_MIN) &&
- t->rcu_read_unlock_special) {
+ } else if (t->rcu_read_lock_nesting == 0 ||
+ (t->rcu_read_lock_nesting < 0 &&
+ !WARN_ON_ONCE(t->rcu_read_lock_nesting != INT_MIN))) {

/*
* Complete exit from RCU read-side critical section on
* behalf of preempted instance of __rcu_read_unlock().
*/
- rcu_read_unlock_special(t);
+ if (t->rcu_read_unlock_special)
+ rcu_read_unlock_special(t, false);
}

/*
@@ -333,7 +334,7 @@ static struct list_head *rcu_next_node_entry(struct task_struct *t,
* notify RCU core processing or task having blocked during the RCU
* read-side critical section.
*/
-void rcu_read_unlock_special(struct task_struct *t)
+void rcu_read_unlock_special(struct task_struct *t, bool unlock)
{
int empty;
int empty_exp;
@@ -364,6 +365,42 @@ void rcu_read_unlock_special(struct task_struct *t)

/* Clean up if blocked during RCU read-side critical section. */
if (special & RCU_READ_UNLOCK_BLOCKED) {
+ /*
+ * If rcu read lock overlaps with scheduler lock,
+ * rcu_read_unlock_special() may lead to deadlock:
+ *
+ * rcu_read_lock();
+ * preempt_schedule[_irq]() (when preemption)
+ * scheduler lock; (or some other locks can be (chained) nested
+ * in rcu_read_unlock_special()/rnp->lock)
+ * access and check rcu data
+ * rcu_read_unlock();
+ * rcu_read_unlock_special();
+ * wake_up(); DEAD LOCK
+ *
+ * To avoid all these kinds of deadlock, we should quit
+ * rcu_read_unlock_special() here and defer it to
+ * rcu_preempt_note_context_switch() or next outmost
+ * rcu_read_unlock() if we consider this case may happen.
+ *
+ * Although we can't know whether current _special()
+ * is nested in scheduler lock or not. But we know that
+ * irqs are always disabled in this case. so we just quit
+ * and defer it to rcu_preempt_note_context_switch()
+ * when irqs are disabled.
+ *
+ * It means we always defer _special() when it is
+ * nested in irqs disabled context, but
+ * (special & RCU_READ_UNLOCK_BLOCKED) &&
+ * irqs_disabled_flags(flags)
+ * is still unlikely to be true.
+ */
+ if (unlikely(unlock && irqs_disabled_flags(flags))) {
+ set_need_resched();
+ local_irq_restore(flags);
+ return;
+ }
+
t->rcu_read_unlock_special &= ~RCU_READ_UNLOCK_BLOCKED;

/*
--
1.7.4.4

2013-08-07 10:30:27

by Lai Jiangshan

[permalink] [raw]
Subject: [PATCH 1/8] rcu: add a warn to rcu_preempt_note_context_switch()

It is expected that _nesting == INT_MIN if _nesting < 0.
Add a warning to it if something unexpected happen.

Signed-off-by: Lai Jiangshan <[email protected]>
---
kernel/rcutree_plugin.h | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index 63098a5..8fd947e 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -243,6 +243,7 @@ static void rcu_preempt_note_context_switch(int cpu)
: rnp->gpnum + 1);
raw_spin_unlock_irqrestore(&rnp->lock, flags);
} else if (t->rcu_read_lock_nesting < 0 &&
+ !WARN_ON_ONCE(t->rcu_read_lock_nesting != INT_MIN) &&
t->rcu_read_unlock_special) {

/*
--
1.7.4.4

2013-08-07 12:52:18

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 0/8] rcu: Ensure rcu read site is deadlock-immunity

On Wed, Aug 07, 2013 at 06:24:56PM +0800, Lai Jiangshan wrote:
> Although all articles declare that rcu read site is deadlock-immunity.
> It is not true for rcu-preempt, it will be deadlock if rcu read site
> overlaps with scheduler lock.

The real rule is that if the scheduler does its outermost rcu_read_unlock()
with one of those locks held, it has to have avoided enabling preemption
through the entire RCU read-side critical section.

That said, avoiding the need for this rule would be a good thing.

How did you test this? The rcutorture tests will not exercise this.
(Intentionally so, given that it can deadlock!)

> ec433f0c, 10f39bb1 and 016a8d5b just partially solve it. But rcu read site
> is still not deadlock-immunity. And the problem described in 016a8d5b
> is still existed(rcu_read_unlock_special() calls wake_up).
>
> The problem is fixed in patch5.

This is going to require some serious review and testing. One requirement
is that RCU priority boosting not persist significantly beyond the
re-enabling of interrupts associated with the irq-disabled lock. To do
otherwise breaks RCU priority boosting. At first glance, the added
set_need_resched() might handle this, but that is part of the review
and testing required.

Steven, would you and Carsten be willing to try this and see if it
helps with the issues you are seeing in -rt? (My guess is "no", since
a deadlock would block forever rather than waking up after a couple
thousand seconds, but worth a try.)

Thanx, Paul

> Lai Jiangshan (8):
> rcu: add a warn to rcu_preempt_note_context_switch()
> rcu: rcu_read_unlock_special() can be nested in irq/softirq 10f39bb1
> rcu: keep irqs disabled in rcu_read_unlock_special()
> rcu: delay task rcu state cleanup in exit_rcu()
> rcu: eliminate rcu read site deadlock
> rcu: call rcu_read_unlock_special() in rcu_preempt_check_callbacks()
> rcu: add # of deferred _special() statistics
> rcu: remove irq work for rsp_wakeup()
>
> include/linux/rcupdate.h | 2 +-
> kernel/rcupdate.c | 2 +-
> kernel/rcutree.c | 17 +--------
> kernel/rcutree.h | 2 +-
> kernel/rcutree_plugin.h | 82 ++++++++++++++++++++++++++++++++++-----------
> kernel/rcutree_trace.c | 1 +
> 6 files changed, 68 insertions(+), 38 deletions(-)
>
> --
> 1.7.4.4
>

2013-08-07 16:42:23

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 7/8] rcu: add # of deferred _special() statistics

On Wed, Aug 07, 2013 at 06:25:03PM +0800, Lai Jiangshan wrote:
> Signed-off-by: Lai Jiangshan <[email protected]>
> ---
> kernel/rcutree.h | 1 +
> kernel/rcutree_plugin.h | 1 +
> kernel/rcutree_trace.c | 1 +
> 3 files changed, 3 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/rcutree.h b/kernel/rcutree.h
> index 4a39d36..a5e9643 100644
> --- a/kernel/rcutree.h
> +++ b/kernel/rcutree.h
> @@ -290,6 +290,7 @@ struct rcu_data {
> unsigned long n_force_qs_snap;
> /* did other CPU force QS recently? */
> long blimit; /* Upper limit on a processed batch */
> + unsigned long n_defer_special;/* # of deferred _special() */
>
> /* 3) dynticks interface. */
> struct rcu_dynticks *dynticks; /* Shared per-CPU dynticks state. */
> diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
> index c9ff9f1..d828eec 100644
> --- a/kernel/rcutree_plugin.h
> +++ b/kernel/rcutree_plugin.h
> @@ -396,6 +396,7 @@ void rcu_read_unlock_special(struct task_struct *t, bool unlock)
> * is still unlikey to be true.
> */
> if (unlikely(unlock && irqs_disabled_flags(flags))) {
> + this_cpu_ptr(&rcu_preempt_data)->n_defer_special++;
> set_need_resched();
> local_irq_restore(flags);
> return;
> diff --git a/kernel/rcutree_trace.c b/kernel/rcutree_trace.c
> index cf6c174..17d8b2c 100644
> --- a/kernel/rcutree_trace.c
> +++ b/kernel/rcutree_trace.c
> @@ -149,6 +149,7 @@ static void print_one_rcu_data(struct seq_file *m, struct rcu_data *rdp)
> seq_printf(m, " ci=%lu nci=%lu co=%lu ca=%lu\n",

The above "\n" needs to be removed. (Fixed while queueing.)

Thanx, Paul

> rdp->n_cbs_invoked, rdp->n_nocbs_invoked,
> rdp->n_cbs_orphaned, rdp->n_cbs_adopted);
> + seq_printf(m, " ds=%lu\n", rdp->n_defer_special);
> }
>
> static int show_rcudata(struct seq_file *m, void *v)
> --
> 1.7.4.4
>

2013-08-07 19:52:45

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 0/8] rcu: Ensure rcu read site is deadlock-immunity

On Wed, Aug 07, 2013 at 09:29:07PM +0200, Carsten Emde wrote:
> Hi Paul,
>
> >>Although all articles declare that rcu read site is deadlock-immunity.
> >>It is not true for rcu-preempt, it will be deadlock if rcu read site
> >>overlaps with scheduler lock.
> >
> >The real rule is that if the scheduler does its outermost rcu_read_unlock()
> >with one of those locks held, it has to have avoided enabling preemption
> >through the entire RCU read-side critical section.
> >
> >That said, avoiding the need for this rule would be a good thing.
> >
> >How did you test this? The rcutorture tests will not exercise this.
> >(Intentionally so, given that it can deadlock!)
> >
> >>ec433f0c, 10f39bb1 and 016a8d5b just partially solve it. But rcu read site
> >>is still not deadlock-immunity. And the problem described in 016a8d5b
> >>is still existed(rcu_read_unlock_special() calls wake_up).
> >>
> >>The problem is fixed in patch5.
> >
> >This is going to require some serious review and testing. One requirement
> >is that RCU priority boosting not persist significantly beyond the
> >re-enabling of interrupts associated with the irq-disabled lock. To do
> >otherwise breaks RCU priority boosting. At first glance, the added
> >set_need_resched() might handle this, but that is part of the review
> >and testing required.
> >
> >Steven, would you and Carsten be willing to try this and see if it
> >helps with the issues you are seeing in -rt? (My guess is "no", since
> >a deadlock would block forever rather than waking up after a couple
> >thousand seconds, but worth a try.)
> Your guess was correct, applying this patch doesn't heal the
> NO_HZ_FULL+PREEMPT_RT_FULL 3.10.4 based system; it still is hanging
> at -> synchronize_rcu -> wait_rcu_gp.

I would rather have been wrong, but thank you for trying it out!

Thanx, Paul

2013-08-07 20:03:27

by Carsten Emde

[permalink] [raw]
Subject: Re: [PATCH 0/8] rcu: Ensure rcu read site is deadlock-immunity

Hi Paul,

>> Although all articles declare that rcu read site is deadlock-immunity.
>> It is not true for rcu-preempt, it will be deadlock if rcu read site
>> overlaps with scheduler lock.
>
> The real rule is that if the scheduler does its outermost rcu_read_unlock()
> with one of those locks held, it has to have avoided enabling preemption
> through the entire RCU read-side critical section.
>
> That said, avoiding the need for this rule would be a good thing.
>
> How did you test this? The rcutorture tests will not exercise this.
> (Intentionally so, given that it can deadlock!)
>
>> ec433f0c, 10f39bb1 and 016a8d5b just partially solve it. But rcu read site
>> is still not deadlock-immunity. And the problem described in 016a8d5b
>> is still existed(rcu_read_unlock_special() calls wake_up).
>>
>> The problem is fixed in patch5.
>
> This is going to require some serious review and testing. One requirement
> is that RCU priority boosting not persist significantly beyond the
> re-enabling of interrupts associated with the irq-disabled lock. To do
> otherwise breaks RCU priority boosting. At first glance, the added
> set_need_resched() might handle this, but that is part of the review
> and testing required.
>
> Steven, would you and Carsten be willing to try this and see if it
> helps with the issues you are seeing in -rt? (My guess is "no", since
> a deadlock would block forever rather than waking up after a couple
> thousand seconds, but worth a try.)
Your guess was correct, applying this patch doesn't heal the
NO_HZ_FULL+PREEMPT_RT_FULL 3.10.4 based system; it still is hanging at
-> synchronize_rcu -> wait_rcu_gp.

-Carsten.

2013-08-08 00:36:54

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 0/8] rcu: Ensure rcu read site is deadlock-immunity

On Wed, Aug 07, 2013 at 05:38:27AM -0700, Paul E. McKenney wrote:
> On Wed, Aug 07, 2013 at 06:24:56PM +0800, Lai Jiangshan wrote:
> > Although all articles declare that rcu read site is deadlock-immunity.
> > It is not true for rcu-preempt, it will be deadlock if rcu read site
> > overlaps with scheduler lock.
>
> The real rule is that if the scheduler does its outermost rcu_read_unlock()
> with one of those locks held, it has to have avoided enabling preemption
> through the entire RCU read-side critical section.
>
> That said, avoiding the need for this rule would be a good thing.
>
> How did you test this? The rcutorture tests will not exercise this.
> (Intentionally so, given that it can deadlock!)
>
> > ec433f0c, 10f39bb1 and 016a8d5b just partially solve it. But rcu read site
> > is still not deadlock-immunity. And the problem described in 016a8d5b
> > is still existed(rcu_read_unlock_special() calls wake_up).
> >
> > The problem is fixed in patch5.
>
> This is going to require some serious review and testing. One requirement
> is that RCU priority boosting not persist significantly beyond the
> re-enabling of interrupts associated with the irq-disabled lock. To do
> otherwise breaks RCU priority boosting. At first glance, the added
> set_need_resched() might handle this, but that is part of the review
> and testing required.
>
> Steven, would you and Carsten be willing to try this and see if it
> helps with the issues you are seeing in -rt? (My guess is "no", since
> a deadlock would block forever rather than waking up after a couple
> thousand seconds, but worth a try.)

No joy from either Steven or Carsten on the -rt hangs.

I pushed this to -rcu and ran tests. I hit this in one of the
configurations:

[ 393.641012] =================================
[ 393.641012] [ INFO: inconsistent lock state ]
[ 393.641012] 3.11.0-rc1+ #1 Not tainted
[ 393.641012] ---------------------------------
[ 393.641012] inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage.
[ 393.641012] rcu_torture_rea/697 [HC1[1]:SC0[0]:HE0:SE1] takes:
[ 393.641012] (&lock->wait_lock){?.+...}, at: [<ffffffff818860f3>] rt_mutex_unlock+0x53/0x100
[ 393.641012] {HARDIRQ-ON-W} state was registered at:
[ 393.641012] [<ffffffff810aecb1>] __lock_acquire+0x651/0x1d40
[ 393.641012] [<ffffffff810b0a65>] lock_acquire+0x95/0x210
[ 393.641012] [<ffffffff81886d26>] _raw_spin_lock+0x36/0x50
[ 393.641012] [<ffffffff81886329>] rt_mutex_slowlock+0x39/0x170
[ 393.641012] [<ffffffff818864ca>] rt_mutex_lock+0x2a/0x30
[ 393.641012] [<ffffffff810ebc03>] rcu_boost_kthread+0x173/0x800
[ 393.641012] [<ffffffff810759d6>] kthread+0xd6/0xe0
[ 393.641012] [<ffffffff8188f52c>] ret_from_fork+0x7c/0xb0
[ 393.641012] irq event stamp: 96581116
[ 393.641012] hardirqs last enabled at (96581115): [<ffffffff81887ba0>] restore_args+0x0/0x30
[ 393.641012] hardirqs last disabled at (96581116): [<ffffffff8189022a>] apic_timer_interrupt+0x6a/0x80
[ 393.641012] softirqs last enabled at (96576304): [<ffffffff81051844>] __do_softirq+0x174/0x470
[ 393.641012] softirqs last disabled at (96576275): [<ffffffff81051ca6>] irq_exit+0x96/0xc0
[ 393.641012]
[ 393.641012] other info that might help us debug this:
[ 393.641012] Possible unsafe locking scenario:
[ 393.641012]
[ 393.641012] CPU0
[ 393.641012] ----
[ 393.641012] lock(&lock->wait_lock);
[ 393.641012] <Interrupt>
[ 393.641012] lock(&lock->wait_lock);
[ 393.641012]
[ 393.641012] *** DEADLOCK ***
[ 393.641012]
[ 393.641012] no locks held by rcu_torture_rea/697.
[ 393.641012]
[ 393.641012] stack backtrace:
[ 393.641012] CPU: 3 PID: 697 Comm: rcu_torture_rea Not tainted 3.11.0-rc1+ #1
[ 393.641012] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
[ 393.641012] ffffffff8586fea0 ffff88001fcc3a78 ffffffff8187b4cb ffffffff8104a261
[ 393.641012] ffff88001e1a20c0 ffff88001fcc3ad8 ffffffff818773e4 0000000000000000
[ 393.641012] ffff880000000000 ffff880000000001 ffffffff81010a0a 0000000000000001
[ 393.641012] Call Trace:
[ 393.641012] <IRQ> [<ffffffff8187b4cb>] dump_stack+0x4f/0x84
[ 393.641012] [<ffffffff8104a261>] ? console_unlock+0x291/0x410
[ 393.641012] [<ffffffff818773e4>] print_usage_bug+0x1f5/0x206
[ 393.641012] [<ffffffff81010a0a>] ? save_stack_trace+0x2a/0x50
[ 393.641012] [<ffffffff810ae603>] mark_lock+0x283/0x2e0
[ 393.641012] [<ffffffff810ada10>] ? print_irq_inversion_bug.part.40+0x1f0/0x1f0
[ 393.641012] [<ffffffff810aef66>] __lock_acquire+0x906/0x1d40
[ 393.641012] [<ffffffff810ae94b>] ? __lock_acquire+0x2eb/0x1d40
[ 393.641012] [<ffffffff810ae94b>] ? __lock_acquire+0x2eb/0x1d40
[ 393.641012] [<ffffffff810b0a65>] lock_acquire+0x95/0x210
[ 393.641012] [<ffffffff818860f3>] ? rt_mutex_unlock+0x53/0x100
[ 393.641012] [<ffffffff81886d26>] _raw_spin_lock+0x36/0x50
[ 393.641012] [<ffffffff818860f3>] ? rt_mutex_unlock+0x53/0x100
[ 393.641012] [<ffffffff818860f3>] rt_mutex_unlock+0x53/0x100
[ 393.641012] [<ffffffff810ee3ca>] rcu_read_unlock_special+0x17a/0x2a0
[ 393.641012] [<ffffffff810ee803>] rcu_check_callbacks+0x313/0x950
[ 393.641012] [<ffffffff8107a6bd>] ? hrtimer_run_queues+0x1d/0x180
[ 393.641012] [<ffffffff810abb9d>] ? trace_hardirqs_off+0xd/0x10
[ 393.641012] [<ffffffff8105bae3>] update_process_times+0x43/0x80
[ 393.641012] [<ffffffff810a9801>] tick_sched_handle.isra.10+0x31/0x40
[ 393.641012] [<ffffffff810a98f7>] tick_sched_timer+0x47/0x70
[ 393.641012] [<ffffffff8107941c>] __run_hrtimer+0x7c/0x490
[ 393.641012] [<ffffffff810a260d>] ? ktime_get_update_offsets+0x4d/0xe0
[ 393.641012] [<ffffffff810a98b0>] ? tick_nohz_handler+0xa0/0xa0
[ 393.641012] [<ffffffff8107a017>] hrtimer_interrupt+0x107/0x260
[ 393.641012] [<ffffffff81030173>] local_apic_timer_interrupt+0x33/0x60
[ 393.641012] [<ffffffff8103059e>] smp_apic_timer_interrupt+0x3e/0x60
[ 393.641012] [<ffffffff8189022f>] apic_timer_interrupt+0x6f/0x80
[ 393.641012] <EOI> [<ffffffff810ee250>] ? rcu_scheduler_starting+0x60/0x60
[ 393.641012] [<ffffffff81072101>] ? __rcu_read_unlock+0x91/0xa0
[ 393.641012] [<ffffffff810e80e3>] rcu_torture_read_unlock+0x33/0x70
[ 393.641012] [<ffffffff810e8f54>] rcu_torture_reader+0xe4/0x450
[ 393.641012] [<ffffffff810e92c0>] ? rcu_torture_reader+0x450/0x450
[ 393.641012] [<ffffffff810e8e70>] ? rcutorture_trace_dump+0x30/0x30
[ 393.641012] [<ffffffff810759d6>] kthread+0xd6/0xe0
[ 393.641012] [<ffffffff818874bb>] ? _raw_spin_unlock_irq+0x2b/0x60
[ 393.641012] [<ffffffff81075900>] ? flush_kthread_worker+0x130/0x130
[ 393.641012] [<ffffffff8188f52c>] ret_from_fork+0x7c/0xb0
[ 393.641012] [<ffffffff81075900>] ? flush_kthread_worker+0x130/0x130

I don't see this without your patches.

.config attached. The other configurations completed without errors.
Short tests, 30 minutes per configuration.

Thoughts?

Thanx, Paul


Attachments:
(No filename) (6.73 kB)
.config (88.85 kB)
.config
Download all attachments

2013-08-08 01:13:08

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH 0/8] rcu: Ensure rcu read site is deadlock-immunity

On 08/08/2013 03:29 AM, Carsten Emde wrote:
> Hi Paul,
>
>>> Although all articles declare that rcu read site is deadlock-immunity.
>>> It is not true for rcu-preempt, it will be deadlock if rcu read site
>>> overlaps with scheduler lock.
>>
>> The real rule is that if the scheduler does its outermost rcu_read_unlock()
>> with one of those locks held, it has to have avoided enabling preemption
>> through the entire RCU read-side critical section.
>>
>> That said, avoiding the need for this rule would be a good thing.
>>
>> How did you test this? The rcutorture tests will not exercise this.
>> (Intentionally so, given that it can deadlock!)
>>
>>> ec433f0c, 10f39bb1 and 016a8d5b just partially solve it. But rcu read site
>>> is still not deadlock-immunity. And the problem described in 016a8d5b
>>> is still existed(rcu_read_unlock_special() calls wake_up).
>>>
>>> The problem is fixed in patch5.
>>
>> This is going to require some serious review and testing. One requirement
>> is that RCU priority boosting not persist significantly beyond the
>> re-enabling of interrupts associated with the irq-disabled lock. To do
>> otherwise breaks RCU priority boosting. At first glance, the added
>> set_need_resched() might handle this, but that is part of the review
>> and testing required.
>>
>> Steven, would you and Carsten be willing to try this and see if it
>> helps with the issues you are seeing in -rt? (My guess is "no", since
>> a deadlock would block forever rather than waking up after a couple
>> thousand seconds, but worth a try.)
> Your guess was correct, applying this patch doesn't heal the NO_HZ_FULL+PREEMPT_RT_FULL 3.10.4 based system; it still is hanging at -> synchronize_rcu -> wait_rcu_gp.
>
> -Carsten.
>

I didn't find the problem you reported, could you give me a url?

Thanx,
Lai

2013-08-08 01:43:51

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH 0/8] rcu: Ensure rcu read site is deadlock-immunity

On 08/08/2013 08:36 AM, Paul E. McKenney wrote:
> On Wed, Aug 07, 2013 at 05:38:27AM -0700, Paul E. McKenney wrote:
>> On Wed, Aug 07, 2013 at 06:24:56PM +0800, Lai Jiangshan wrote:
>>> Although all articles declare that rcu read site is deadlock-immunity.
>>> It is not true for rcu-preempt, it will be deadlock if rcu read site
>>> overlaps with scheduler lock.
>>
>> The real rule is that if the scheduler does its outermost rcu_read_unlock()
>> with one of those locks held, it has to have avoided enabling preemption
>> through the entire RCU read-side critical section.
>>
>> That said, avoiding the need for this rule would be a good thing.
>>
>> How did you test this? The rcutorture tests will not exercise this.
>> (Intentionally so, given that it can deadlock!)
>>
>>> ec433f0c, 10f39bb1 and 016a8d5b just partially solve it. But rcu read site
>>> is still not deadlock-immunity. And the problem described in 016a8d5b
>>> is still existed(rcu_read_unlock_special() calls wake_up).
>>>
>>> The problem is fixed in patch5.
>>
>> This is going to require some serious review and testing. One requirement
>> is that RCU priority boosting not persist significantly beyond the
>> re-enabling of interrupts associated with the irq-disabled lock. To do
>> otherwise breaks RCU priority boosting. At first glance, the added
>> set_need_resched() might handle this, but that is part of the review
>> and testing required.
>>
>> Steven, would you and Carsten be willing to try this and see if it
>> helps with the issues you are seeing in -rt? (My guess is "no", since
>> a deadlock would block forever rather than waking up after a couple
>> thousand seconds, but worth a try.)
>
> No joy from either Steven or Carsten on the -rt hangs.
>
> I pushed this to -rcu and ran tests. I hit this in one of the
> configurations:
>
> [ 393.641012] =================================
> [ 393.641012] [ INFO: inconsistent lock state ]
> [ 393.641012] 3.11.0-rc1+ #1 Not tainted
> [ 393.641012] ---------------------------------
> [ 393.641012] inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage.
> [ 393.641012] rcu_torture_rea/697 [HC1[1]:SC0[0]:HE0:SE1] takes:
> [ 393.641012] (&lock->wait_lock){?.+...}, at: [<ffffffff818860f3>] rt_mutex_unlock+0x53/0x100
> [ 393.641012] {HARDIRQ-ON-W} state was registered at:
> [ 393.641012] [<ffffffff810aecb1>] __lock_acquire+0x651/0x1d40
> [ 393.641012] [<ffffffff810b0a65>] lock_acquire+0x95/0x210
> [ 393.641012] [<ffffffff81886d26>] _raw_spin_lock+0x36/0x50
> [ 393.641012] [<ffffffff81886329>] rt_mutex_slowlock+0x39/0x170
> [ 393.641012] [<ffffffff818864ca>] rt_mutex_lock+0x2a/0x30
> [ 393.641012] [<ffffffff810ebc03>] rcu_boost_kthread+0x173/0x800
> [ 393.641012] [<ffffffff810759d6>] kthread+0xd6/0xe0
> [ 393.641012] [<ffffffff8188f52c>] ret_from_fork+0x7c/0xb0
> [ 393.641012] irq event stamp: 96581116
> [ 393.641012] hardirqs last enabled at (96581115): [<ffffffff81887ba0>] restore_args+0x0/0x30
> [ 393.641012] hardirqs last disabled at (96581116): [<ffffffff8189022a>] apic_timer_interrupt+0x6a/0x80
> [ 393.641012] softirqs last enabled at (96576304): [<ffffffff81051844>] __do_softirq+0x174/0x470
> [ 393.641012] softirqs last disabled at (96576275): [<ffffffff81051ca6>] irq_exit+0x96/0xc0
> [ 393.641012]
> [ 393.641012] other info that might help us debug this:
> [ 393.641012] Possible unsafe locking scenario:
> [ 393.641012]
> [ 393.641012] CPU0
> [ 393.641012] ----
> [ 393.641012] lock(&lock->wait_lock);
> [ 393.641012] <Interrupt>
> [ 393.641012] lock(&lock->wait_lock);

Patch2 causes it!
When I found all lock which can (chained) nested in rcu_read_unlock_special(),
I didn't notice rtmutex's lock->wait_lock is not nested in irq-disabled.

Two ways to fix it:
1) change rtmutex's lock->wait_lock, make it alwasys irq-disabled.
2) revert my patch2

> [ 393.641012]
> [ 393.641012] *** DEADLOCK ***
> [ 393.641012]
> [ 393.641012] no locks held by rcu_torture_rea/697.
> [ 393.641012]
> [ 393.641012] stack backtrace:
> [ 393.641012] CPU: 3 PID: 697 Comm: rcu_torture_rea Not tainted 3.11.0-rc1+ #1
> [ 393.641012] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
> [ 393.641012] ffffffff8586fea0 ffff88001fcc3a78 ffffffff8187b4cb ffffffff8104a261
> [ 393.641012] ffff88001e1a20c0 ffff88001fcc3ad8 ffffffff818773e4 0000000000000000
> [ 393.641012] ffff880000000000 ffff880000000001 ffffffff81010a0a 0000000000000001
> [ 393.641012] Call Trace:
> [ 393.641012] <IRQ> [<ffffffff8187b4cb>] dump_stack+0x4f/0x84
> [ 393.641012] [<ffffffff8104a261>] ? console_unlock+0x291/0x410
> [ 393.641012] [<ffffffff818773e4>] print_usage_bug+0x1f5/0x206
> [ 393.641012] [<ffffffff81010a0a>] ? save_stack_trace+0x2a/0x50
> [ 393.641012] [<ffffffff810ae603>] mark_lock+0x283/0x2e0
> [ 393.641012] [<ffffffff810ada10>] ? print_irq_inversion_bug.part.40+0x1f0/0x1f0
> [ 393.641012] [<ffffffff810aef66>] __lock_acquire+0x906/0x1d40
> [ 393.641012] [<ffffffff810ae94b>] ? __lock_acquire+0x2eb/0x1d40
> [ 393.641012] [<ffffffff810ae94b>] ? __lock_acquire+0x2eb/0x1d40
> [ 393.641012] [<ffffffff810b0a65>] lock_acquire+0x95/0x210
> [ 393.641012] [<ffffffff818860f3>] ? rt_mutex_unlock+0x53/0x100
> [ 393.641012] [<ffffffff81886d26>] _raw_spin_lock+0x36/0x50
> [ 393.641012] [<ffffffff818860f3>] ? rt_mutex_unlock+0x53/0x100
> [ 393.641012] [<ffffffff818860f3>] rt_mutex_unlock+0x53/0x100
> [ 393.641012] [<ffffffff810ee3ca>] rcu_read_unlock_special+0x17a/0x2a0
> [ 393.641012] [<ffffffff810ee803>] rcu_check_callbacks+0x313/0x950
> [ 393.641012] [<ffffffff8107a6bd>] ? hrtimer_run_queues+0x1d/0x180
> [ 393.641012] [<ffffffff810abb9d>] ? trace_hardirqs_off+0xd/0x10
> [ 393.641012] [<ffffffff8105bae3>] update_process_times+0x43/0x80
> [ 393.641012] [<ffffffff810a9801>] tick_sched_handle.isra.10+0x31/0x40
> [ 393.641012] [<ffffffff810a98f7>] tick_sched_timer+0x47/0x70
> [ 393.641012] [<ffffffff8107941c>] __run_hrtimer+0x7c/0x490
> [ 393.641012] [<ffffffff810a260d>] ? ktime_get_update_offsets+0x4d/0xe0
> [ 393.641012] [<ffffffff810a98b0>] ? tick_nohz_handler+0xa0/0xa0
> [ 393.641012] [<ffffffff8107a017>] hrtimer_interrupt+0x107/0x260
> [ 393.641012] [<ffffffff81030173>] local_apic_timer_interrupt+0x33/0x60
> [ 393.641012] [<ffffffff8103059e>] smp_apic_timer_interrupt+0x3e/0x60
> [ 393.641012] [<ffffffff8189022f>] apic_timer_interrupt+0x6f/0x80
> [ 393.641012] <EOI> [<ffffffff810ee250>] ? rcu_scheduler_starting+0x60/0x60
> [ 393.641012] [<ffffffff81072101>] ? __rcu_read_unlock+0x91/0xa0
> [ 393.641012] [<ffffffff810e80e3>] rcu_torture_read_unlock+0x33/0x70
> [ 393.641012] [<ffffffff810e8f54>] rcu_torture_reader+0xe4/0x450
> [ 393.641012] [<ffffffff810e92c0>] ? rcu_torture_reader+0x450/0x450
> [ 393.641012] [<ffffffff810e8e70>] ? rcutorture_trace_dump+0x30/0x30
> [ 393.641012] [<ffffffff810759d6>] kthread+0xd6/0xe0
> [ 393.641012] [<ffffffff818874bb>] ? _raw_spin_unlock_irq+0x2b/0x60
> [ 393.641012] [<ffffffff81075900>] ? flush_kthread_worker+0x130/0x130
> [ 393.641012] [<ffffffff8188f52c>] ret_from_fork+0x7c/0xb0
> [ 393.641012] [<ffffffff81075900>] ? flush_kthread_worker+0x130/0x130
>
> I don't see this without your patches.
>
> .config attached. The other configurations completed without errors.
> Short tests, 30 minutes per configuration.
>
> Thoughts?
>
> Thanx, Paul

2013-08-08 02:12:16

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 0/8] rcu: Ensure rcu read site is deadlock-immunity

On Thu, 2013-08-08 at 09:47 +0800, Lai Jiangshan wrote:

> > [ 393.641012] CPU0
> > [ 393.641012] ----
> > [ 393.641012] lock(&lock->wait_lock);
> > [ 393.641012] <Interrupt>
> > [ 393.641012] lock(&lock->wait_lock);
>
> Patch2 causes it!
> When I found all lock which can (chained) nested in rcu_read_unlock_special(),
> I didn't notice rtmutex's lock->wait_lock is not nested in irq-disabled.
>
> Two ways to fix it:
> 1) change rtmutex's lock->wait_lock, make it alwasys irq-disabled.
> 2) revert my patch2

Your patch 2 states:

"After patch 10f39bb1, "special & RCU_READ_UNLOCK_BLOCKED" can't be true
in irq nor softirq.(due to RCU_READ_UNLOCK_BLOCKED can only be set
when preemption)"

But then below we have:


>
> > [ 393.641012]
> > [ 393.641012] *** DEADLOCK ***
> > [ 393.641012]
> > [ 393.641012] no locks held by rcu_torture_rea/697.
> > [ 393.641012]
> > [ 393.641012] stack backtrace:
> > [ 393.641012] CPU: 3 PID: 697 Comm: rcu_torture_rea Not tainted 3.11.0-rc1+ #1
> > [ 393.641012] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
> > [ 393.641012] ffffffff8586fea0 ffff88001fcc3a78 ffffffff8187b4cb ffffffff8104a261
> > [ 393.641012] ffff88001e1a20c0 ffff88001fcc3ad8 ffffffff818773e4 0000000000000000
> > [ 393.641012] ffff880000000000 ffff880000000001 ffffffff81010a0a 0000000000000001
> > [ 393.641012] Call Trace:
> > [ 393.641012] <IRQ> [<ffffffff8187b4cb>] dump_stack+0x4f/0x84
> > [ 393.641012] [<ffffffff8104a261>] ? console_unlock+0x291/0x410
> > [ 393.641012] [<ffffffff818773e4>] print_usage_bug+0x1f5/0x206
> > [ 393.641012] [<ffffffff81010a0a>] ? save_stack_trace+0x2a/0x50
> > [ 393.641012] [<ffffffff810ae603>] mark_lock+0x283/0x2e0
> > [ 393.641012] [<ffffffff810ada10>] ? print_irq_inversion_bug.part.40+0x1f0/0x1f0
> > [ 393.641012] [<ffffffff810aef66>] __lock_acquire+0x906/0x1d40
> > [ 393.641012] [<ffffffff810ae94b>] ? __lock_acquire+0x2eb/0x1d40
> > [ 393.641012] [<ffffffff810ae94b>] ? __lock_acquire+0x2eb/0x1d40
> > [ 393.641012] [<ffffffff810b0a65>] lock_acquire+0x95/0x210
> > [ 393.641012] [<ffffffff818860f3>] ? rt_mutex_unlock+0x53/0x100
> > [ 393.641012] [<ffffffff81886d26>] _raw_spin_lock+0x36/0x50
> > [ 393.641012] [<ffffffff818860f3>] ? rt_mutex_unlock+0x53/0x100
> > [ 393.641012] [<ffffffff818860f3>] rt_mutex_unlock+0x53/0x100
> > [ 393.641012] [<ffffffff810ee3ca>] rcu_read_unlock_special+0x17a/0x2a0
> > [ 393.641012] [<ffffffff810ee803>] rcu_check_callbacks+0x313/0x950
> > [ 393.641012] [<ffffffff8107a6bd>] ? hrtimer_run_queues+0x1d/0x180
> > [ 393.641012] [<ffffffff810abb9d>] ? trace_hardirqs_off+0xd/0x10
> > [ 393.641012] [<ffffffff8105bae3>] update_process_times+0x43/0x80
> > [ 393.641012] [<ffffffff810a9801>] tick_sched_handle.isra.10+0x31/0x40
> > [ 393.641012] [<ffffffff810a98f7>] tick_sched_timer+0x47/0x70
> > [ 393.641012] [<ffffffff8107941c>] __run_hrtimer+0x7c/0x490
> > [ 393.641012] [<ffffffff810a260d>] ? ktime_get_update_offsets+0x4d/0xe0
> > [ 393.641012] [<ffffffff810a98b0>] ? tick_nohz_handler+0xa0/0xa0
> > [ 393.641012] [<ffffffff8107a017>] hrtimer_interrupt+0x107/0x260

The hrtimer_interrupt is calling a rt_mutex_unlock? How did that happen?
Did it first call a rt_mutex_lock?

If patch two was the culprit, I'm thinking the idea behind patch two is
wrong. The only option is to remove patch number two!

Or perhaps I missed something.

-- Steve


> > [ 393.641012] [<ffffffff81030173>] local_apic_timer_interrupt+0x33/0x60
> > [ 393.641012] [<ffffffff8103059e>] smp_apic_timer_interrupt+0x3e/0x60
> > [ 393.641012] [<ffffffff8189022f>] apic_timer_interrupt+0x6f/0x80
> > [ 393.641012] <EOI> [<ffffffff810ee250>] ? rcu_scheduler_starting+0x60/0x60
> > [ 393.641012] [<ffffffff81072101>] ? __rcu_read_unlock+0x91/0xa0
> > [ 393.641012] [<ffffffff810e80e3>] rcu_torture_read_unlock+0x33/0x70
> > [ 393.641012] [<ffffffff810e8f54>] rcu_torture_reader+0xe4/0x450
> > [ 393.641012] [<ffffffff810e92c0>] ? rcu_torture_reader+0x450/0x450
> > [ 393.641012] [<ffffffff810e8e70>] ? rcutorture_trace_dump+0x30/0x30
> > [ 393.641012] [<ffffffff810759d6>] kthread+0xd6/0xe0
> > [ 393.641012] [<ffffffff818874bb>] ? _raw_spin_unlock_irq+0x2b/0x60
> > [ 393.641012] [<ffffffff81075900>] ? flush_kthread_worker+0x130/0x130
> > [ 393.641012] [<ffffffff8188f52c>] ret_from_fork+0x7c/0xb0
> > [ 393.641012] [<ffffffff81075900>] ? flush_kthread_worker+0x130/0x130
> >
> > I don't see this without your patches.
> >
> > .config attached. The other configurations completed without errors.
> > Short tests, 30 minutes per configuration.
> >
> > Thoughts?
> >
> > Thanx, Paul

2013-08-08 02:29:11

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH 0/8] rcu: Ensure rcu read site is deadlock-immunity

On 08/08/2013 10:12 AM, Steven Rostedt wrote:
> On Thu, 2013-08-08 at 09:47 +0800, Lai Jiangshan wrote:
>
>>> [ 393.641012] CPU0
>>> [ 393.641012] ----
>>> [ 393.641012] lock(&lock->wait_lock);
>>> [ 393.641012] <Interrupt>
>>> [ 393.641012] lock(&lock->wait_lock);
>>
>> Patch2 causes it!
>> When I found all lock which can (chained) nested in rcu_read_unlock_special(),
>> I didn't notice rtmutex's lock->wait_lock is not nested in irq-disabled.
>>
>> Two ways to fix it:
>> 1) change rtmutex's lock->wait_lock, make it alwasys irq-disabled.
>> 2) revert my patch2
>
> Your patch 2 states:
>
> "After patch 10f39bb1, "special & RCU_READ_UNLOCK_BLOCKED" can't be true
> in irq nor softirq.(due to RCU_READ_UNLOCK_BLOCKED can only be set
> when preemption)"

Patch5 adds "special & RCU_READ_UNLOCK_BLOCKED" back in irq nor softirq.
This new thing is handle in patch5 if I did not do wrong things in patch5.
(I don't notice rtmutex's lock->wait_lock is not irqs-disabled in patch5)

>
> But then below we have:
>
>
>>
>>> [ 393.641012]
>>> [ 393.641012] *** DEADLOCK ***
>>> [ 393.641012]
>>> [ 393.641012] no locks held by rcu_torture_rea/697.
>>> [ 393.641012]
>>> [ 393.641012] stack backtrace:
>>> [ 393.641012] CPU: 3 PID: 697 Comm: rcu_torture_rea Not tainted 3.11.0-rc1+ #1
>>> [ 393.641012] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
>>> [ 393.641012] ffffffff8586fea0 ffff88001fcc3a78 ffffffff8187b4cb ffffffff8104a261
>>> [ 393.641012] ffff88001e1a20c0 ffff88001fcc3ad8 ffffffff818773e4 0000000000000000
>>> [ 393.641012] ffff880000000000 ffff880000000001 ffffffff81010a0a 0000000000000001
>>> [ 393.641012] Call Trace:
>>> [ 393.641012] <IRQ> [<ffffffff8187b4cb>] dump_stack+0x4f/0x84
>>> [ 393.641012] [<ffffffff8104a261>] ? console_unlock+0x291/0x410
>>> [ 393.641012] [<ffffffff818773e4>] print_usage_bug+0x1f5/0x206
>>> [ 393.641012] [<ffffffff81010a0a>] ? save_stack_trace+0x2a/0x50
>>> [ 393.641012] [<ffffffff810ae603>] mark_lock+0x283/0x2e0
>>> [ 393.641012] [<ffffffff810ada10>] ? print_irq_inversion_bug.part.40+0x1f0/0x1f0
>>> [ 393.641012] [<ffffffff810aef66>] __lock_acquire+0x906/0x1d40
>>> [ 393.641012] [<ffffffff810ae94b>] ? __lock_acquire+0x2eb/0x1d40
>>> [ 393.641012] [<ffffffff810ae94b>] ? __lock_acquire+0x2eb/0x1d40
>>> [ 393.641012] [<ffffffff810b0a65>] lock_acquire+0x95/0x210
>>> [ 393.641012] [<ffffffff818860f3>] ? rt_mutex_unlock+0x53/0x100
>>> [ 393.641012] [<ffffffff81886d26>] _raw_spin_lock+0x36/0x50
>>> [ 393.641012] [<ffffffff818860f3>] ? rt_mutex_unlock+0x53/0x100
>>> [ 393.641012] [<ffffffff818860f3>] rt_mutex_unlock+0x53/0x100
>>> [ 393.641012] [<ffffffff810ee3ca>] rcu_read_unlock_special+0x17a/0x2a0
>>> [ 393.641012] [<ffffffff810ee803>] rcu_check_callbacks+0x313/0x950
>>> [ 393.641012] [<ffffffff8107a6bd>] ? hrtimer_run_queues+0x1d/0x180
>>> [ 393.641012] [<ffffffff810abb9d>] ? trace_hardirqs_off+0xd/0x10
>>> [ 393.641012] [<ffffffff8105bae3>] update_process_times+0x43/0x80
>>> [ 393.641012] [<ffffffff810a9801>] tick_sched_handle.isra.10+0x31/0x40
>>> [ 393.641012] [<ffffffff810a98f7>] tick_sched_timer+0x47/0x70
>>> [ 393.641012] [<ffffffff8107941c>] __run_hrtimer+0x7c/0x490
>>> [ 393.641012] [<ffffffff810a260d>] ? ktime_get_update_offsets+0x4d/0xe0
>>> [ 393.641012] [<ffffffff810a98b0>] ? tick_nohz_handler+0xa0/0xa0
>>> [ 393.641012] [<ffffffff8107a017>] hrtimer_interrupt+0x107/0x260
>
> The hrtimer_interrupt is calling a rt_mutex_unlock? How did that happen?
> Did it first call a rt_mutex_lock?
>
> If patch two was the culprit, I'm thinking the idea behind patch two is
> wrong. The only option is to remove patch number two!

removing patch number two can solve the problem found be Paul, but it is not the best.
because I can't declare that rcu is deadlock-immunity
(it will be deadlock if rcu read site overlaps with rtmutex's lock->wait_lock
if I only remove patch2)
I must do more things, but I think it is still better than changing rtmutex's lock->wait_lock.

Thanks,
Lai

>
> Or perhaps I missed something.
>
> -- Steve
>
>
>>> [ 393.641012] [<ffffffff81030173>] local_apic_timer_interrupt+0x33/0x60
>>> [ 393.641012] [<ffffffff8103059e>] smp_apic_timer_interrupt+0x3e/0x60
>>> [ 393.641012] [<ffffffff8189022f>] apic_timer_interrupt+0x6f/0x80
>>> [ 393.641012] <EOI> [<ffffffff810ee250>] ? rcu_scheduler_starting+0x60/0x60
>>> [ 393.641012] [<ffffffff81072101>] ? __rcu_read_unlock+0x91/0xa0
>>> [ 393.641012] [<ffffffff810e80e3>] rcu_torture_read_unlock+0x33/0x70
>>> [ 393.641012] [<ffffffff810e8f54>] rcu_torture_reader+0xe4/0x450
>>> [ 393.641012] [<ffffffff810e92c0>] ? rcu_torture_reader+0x450/0x450
>>> [ 393.641012] [<ffffffff810e8e70>] ? rcutorture_trace_dump+0x30/0x30
>>> [ 393.641012] [<ffffffff810759d6>] kthread+0xd6/0xe0
>>> [ 393.641012] [<ffffffff818874bb>] ? _raw_spin_unlock_irq+0x2b/0x60
>>> [ 393.641012] [<ffffffff81075900>] ? flush_kthread_worker+0x130/0x130
>>> [ 393.641012] [<ffffffff8188f52c>] ret_from_fork+0x7c/0xb0
>>> [ 393.641012] [<ffffffff81075900>] ? flush_kthread_worker+0x130/0x130
>>>
>>> I don't see this without your patches.
>>>
>>> .config attached. The other configurations completed without errors.
>>> Short tests, 30 minutes per configuration.
>>>
>>> Thoughts?
>>>
>>> Thanx, Paul
>
>
>

2013-08-08 02:34:05

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 0/8] rcu: Ensure rcu read site is deadlock-immunity

On Thu, Aug 08, 2013 at 10:33:15AM +0800, Lai Jiangshan wrote:
> On 08/08/2013 10:12 AM, Steven Rostedt wrote:
> > On Thu, 2013-08-08 at 09:47 +0800, Lai Jiangshan wrote:
> >
> >>> [ 393.641012] CPU0
> >>> [ 393.641012] ----
> >>> [ 393.641012] lock(&lock->wait_lock);
> >>> [ 393.641012] <Interrupt>
> >>> [ 393.641012] lock(&lock->wait_lock);
> >>
> >> Patch2 causes it!
> >> When I found all lock which can (chained) nested in rcu_read_unlock_special(),
> >> I didn't notice rtmutex's lock->wait_lock is not nested in irq-disabled.
> >>
> >> Two ways to fix it:
> >> 1) change rtmutex's lock->wait_lock, make it alwasys irq-disabled.
> >> 2) revert my patch2
> >
> > Your patch 2 states:
> >
> > "After patch 10f39bb1, "special & RCU_READ_UNLOCK_BLOCKED" can't be true
> > in irq nor softirq.(due to RCU_READ_UNLOCK_BLOCKED can only be set
> > when preemption)"
>
> Patch5 adds "special & RCU_READ_UNLOCK_BLOCKED" back in irq nor softirq.
> This new thing is handle in patch5 if I did not do wrong things in patch5.
> (I don't notice rtmutex's lock->wait_lock is not irqs-disabled in patch5)
>
> >
> > But then below we have:
> >
> >
> >>
> >>> [ 393.641012]
> >>> [ 393.641012] *** DEADLOCK ***
> >>> [ 393.641012]
> >>> [ 393.641012] no locks held by rcu_torture_rea/697.
> >>> [ 393.641012]
> >>> [ 393.641012] stack backtrace:
> >>> [ 393.641012] CPU: 3 PID: 697 Comm: rcu_torture_rea Not tainted 3.11.0-rc1+ #1
> >>> [ 393.641012] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
> >>> [ 393.641012] ffffffff8586fea0 ffff88001fcc3a78 ffffffff8187b4cb ffffffff8104a261
> >>> [ 393.641012] ffff88001e1a20c0 ffff88001fcc3ad8 ffffffff818773e4 0000000000000000
> >>> [ 393.641012] ffff880000000000 ffff880000000001 ffffffff81010a0a 0000000000000001
> >>> [ 393.641012] Call Trace:
> >>> [ 393.641012] <IRQ> [<ffffffff8187b4cb>] dump_stack+0x4f/0x84
> >>> [ 393.641012] [<ffffffff8104a261>] ? console_unlock+0x291/0x410
> >>> [ 393.641012] [<ffffffff818773e4>] print_usage_bug+0x1f5/0x206
> >>> [ 393.641012] [<ffffffff81010a0a>] ? save_stack_trace+0x2a/0x50
> >>> [ 393.641012] [<ffffffff810ae603>] mark_lock+0x283/0x2e0
> >>> [ 393.641012] [<ffffffff810ada10>] ? print_irq_inversion_bug.part.40+0x1f0/0x1f0
> >>> [ 393.641012] [<ffffffff810aef66>] __lock_acquire+0x906/0x1d40
> >>> [ 393.641012] [<ffffffff810ae94b>] ? __lock_acquire+0x2eb/0x1d40
> >>> [ 393.641012] [<ffffffff810ae94b>] ? __lock_acquire+0x2eb/0x1d40
> >>> [ 393.641012] [<ffffffff810b0a65>] lock_acquire+0x95/0x210
> >>> [ 393.641012] [<ffffffff818860f3>] ? rt_mutex_unlock+0x53/0x100
> >>> [ 393.641012] [<ffffffff81886d26>] _raw_spin_lock+0x36/0x50
> >>> [ 393.641012] [<ffffffff818860f3>] ? rt_mutex_unlock+0x53/0x100
> >>> [ 393.641012] [<ffffffff818860f3>] rt_mutex_unlock+0x53/0x100
> >>> [ 393.641012] [<ffffffff810ee3ca>] rcu_read_unlock_special+0x17a/0x2a0
> >>> [ 393.641012] [<ffffffff810ee803>] rcu_check_callbacks+0x313/0x950
> >>> [ 393.641012] [<ffffffff8107a6bd>] ? hrtimer_run_queues+0x1d/0x180
> >>> [ 393.641012] [<ffffffff810abb9d>] ? trace_hardirqs_off+0xd/0x10
> >>> [ 393.641012] [<ffffffff8105bae3>] update_process_times+0x43/0x80
> >>> [ 393.641012] [<ffffffff810a9801>] tick_sched_handle.isra.10+0x31/0x40
> >>> [ 393.641012] [<ffffffff810a98f7>] tick_sched_timer+0x47/0x70
> >>> [ 393.641012] [<ffffffff8107941c>] __run_hrtimer+0x7c/0x490
> >>> [ 393.641012] [<ffffffff810a260d>] ? ktime_get_update_offsets+0x4d/0xe0
> >>> [ 393.641012] [<ffffffff810a98b0>] ? tick_nohz_handler+0xa0/0xa0
> >>> [ 393.641012] [<ffffffff8107a017>] hrtimer_interrupt+0x107/0x260
> >
> > The hrtimer_interrupt is calling a rt_mutex_unlock? How did that happen?
> > Did it first call a rt_mutex_lock?
> >
> > If patch two was the culprit, I'm thinking the idea behind patch two is
> > wrong. The only option is to remove patch number two!
>
> removing patch number two can solve the problem found be Paul, but it is not the best.
> because I can't declare that rcu is deadlock-immunity
> (it will be deadlock if rcu read site overlaps with rtmutex's lock->wait_lock
> if I only remove patch2)
> I must do more things, but I think it is still better than changing rtmutex's lock->wait_lock.

NP, I will remove your current patches and wait for an updated set.

Thanx, Paul

2013-08-08 02:41:44

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH 0/8] rcu: Ensure rcu read site is deadlock-immunity

On 08/08/2013 10:12 AM, Steven Rostedt wrote:
> On Thu, 2013-08-08 at 09:47 +0800, Lai Jiangshan wrote:
>
>>> [ 393.641012] CPU0
>>> [ 393.641012] ----
>>> [ 393.641012] lock(&lock->wait_lock);
>>> [ 393.641012] <Interrupt>
>>> [ 393.641012] lock(&lock->wait_lock);
>>
>> Patch2 causes it!
>> When I found all lock which can (chained) nested in rcu_read_unlock_special(),
>> I didn't notice rtmutex's lock->wait_lock is not nested in irq-disabled.
>>
>> Two ways to fix it:
>> 1) change rtmutex's lock->wait_lock, make it alwasys irq-disabled.
>> 2) revert my patch2
>
> Your patch 2 states:
>
> "After patch 10f39bb1, "special & RCU_READ_UNLOCK_BLOCKED" can't be true
> in irq nor softirq.(due to RCU_READ_UNLOCK_BLOCKED can only be set
> when preemption)"
>
> But then below we have:
>
>
>>
>>> [ 393.641012]
>>> [ 393.641012] *** DEADLOCK ***
>>> [ 393.641012]
>>> [ 393.641012] no locks held by rcu_torture_rea/697.
>>> [ 393.641012]
>>> [ 393.641012] stack backtrace:
>>> [ 393.641012] CPU: 3 PID: 697 Comm: rcu_torture_rea Not tainted 3.11.0-rc1+ #1
>>> [ 393.641012] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
>>> [ 393.641012] ffffffff8586fea0 ffff88001fcc3a78 ffffffff8187b4cb ffffffff8104a261
>>> [ 393.641012] ffff88001e1a20c0 ffff88001fcc3ad8 ffffffff818773e4 0000000000000000
>>> [ 393.641012] ffff880000000000 ffff880000000001 ffffffff81010a0a 0000000000000001
>>> [ 393.641012] Call Trace:
>>> [ 393.641012] <IRQ> [<ffffffff8187b4cb>] dump_stack+0x4f/0x84
>>> [ 393.641012] [<ffffffff8104a261>] ? console_unlock+0x291/0x410
>>> [ 393.641012] [<ffffffff818773e4>] print_usage_bug+0x1f5/0x206
>>> [ 393.641012] [<ffffffff81010a0a>] ? save_stack_trace+0x2a/0x50
>>> [ 393.641012] [<ffffffff810ae603>] mark_lock+0x283/0x2e0
>>> [ 393.641012] [<ffffffff810ada10>] ? print_irq_inversion_bug.part.40+0x1f0/0x1f0
>>> [ 393.641012] [<ffffffff810aef66>] __lock_acquire+0x906/0x1d40
>>> [ 393.641012] [<ffffffff810ae94b>] ? __lock_acquire+0x2eb/0x1d40
>>> [ 393.641012] [<ffffffff810ae94b>] ? __lock_acquire+0x2eb/0x1d40
>>> [ 393.641012] [<ffffffff810b0a65>] lock_acquire+0x95/0x210
>>> [ 393.641012] [<ffffffff818860f3>] ? rt_mutex_unlock+0x53/0x100
>>> [ 393.641012] [<ffffffff81886d26>] _raw_spin_lock+0x36/0x50
>>> [ 393.641012] [<ffffffff818860f3>] ? rt_mutex_unlock+0x53/0x100
>>> [ 393.641012] [<ffffffff818860f3>] rt_mutex_unlock+0x53/0x100
>>> [ 393.641012] [<ffffffff810ee3ca>] rcu_read_unlock_special+0x17a/0x2a0
>>> [ 393.641012] [<ffffffff810ee803>] rcu_check_callbacks+0x313/0x950
>>> [ 393.641012] [<ffffffff8107a6bd>] ? hrtimer_run_queues+0x1d/0x180
>>> [ 393.641012] [<ffffffff810abb9d>] ? trace_hardirqs_off+0xd/0x10
>>> [ 393.641012] [<ffffffff8105bae3>] update_process_times+0x43/0x80
>>> [ 393.641012] [<ffffffff810a9801>] tick_sched_handle.isra.10+0x31/0x40
>>> [ 393.641012] [<ffffffff810a98f7>] tick_sched_timer+0x47/0x70
>>> [ 393.641012] [<ffffffff8107941c>] __run_hrtimer+0x7c/0x490
>>> [ 393.641012] [<ffffffff810a260d>] ? ktime_get_update_offsets+0x4d/0xe0
>>> [ 393.641012] [<ffffffff810a98b0>] ? tick_nohz_handler+0xa0/0xa0
>>> [ 393.641012] [<ffffffff8107a017>] hrtimer_interrupt+0x107/0x260
>
> The hrtimer_interrupt is calling a rt_mutex_unlock? How did that happen?
> Did it first call a rt_mutex_lock?

Sorry, I forgot to answer this question of yours.
rt_mutex_lock is held by a proxy way in rcu_boost thread.

rt_mutex_init_proxy_locked(&mtx, t);
t->rcu_boost_mutex = &mtx;
raw_spin_unlock_irqrestore(&rnp->lock, flags);
rt_mutex_lock(&mtx); /* Side effect: boosts task t's priority. */
rt_mutex_unlock(&mtx); /* Keep lockdep happy. */

>
> If patch two was the culprit, I'm thinking the idea behind patch two is
> wrong. The only option is to remove patch number two!
>
> Or perhaps I missed something.
>
> -- Steve
>
>
>>> [ 393.641012] [<ffffffff81030173>] local_apic_timer_interrupt+0x33/0x60
>>> [ 393.641012] [<ffffffff8103059e>] smp_apic_timer_interrupt+0x3e/0x60
>>> [ 393.641012] [<ffffffff8189022f>] apic_timer_interrupt+0x6f/0x80
>>> [ 393.641012] <EOI> [<ffffffff810ee250>] ? rcu_scheduler_starting+0x60/0x60
>>> [ 393.641012] [<ffffffff81072101>] ? __rcu_read_unlock+0x91/0xa0
>>> [ 393.641012] [<ffffffff810e80e3>] rcu_torture_read_unlock+0x33/0x70
>>> [ 393.641012] [<ffffffff810e8f54>] rcu_torture_reader+0xe4/0x450
>>> [ 393.641012] [<ffffffff810e92c0>] ? rcu_torture_reader+0x450/0x450
>>> [ 393.641012] [<ffffffff810e8e70>] ? rcutorture_trace_dump+0x30/0x30
>>> [ 393.641012] [<ffffffff810759d6>] kthread+0xd6/0xe0
>>> [ 393.641012] [<ffffffff818874bb>] ? _raw_spin_unlock_irq+0x2b/0x60
>>> [ 393.641012] [<ffffffff81075900>] ? flush_kthread_worker+0x130/0x130
>>> [ 393.641012] [<ffffffff8188f52c>] ret_from_fork+0x7c/0xb0
>>> [ 393.641012] [<ffffffff81075900>] ? flush_kthread_worker+0x130/0x130
>>>
>>> I don't see this without your patches.
>>>
>>> .config attached. The other configurations completed without errors.
>>> Short tests, 30 minutes per configuration.
>>>
>>> Thoughts?
>>>
>>> Thanx, Paul
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2013-08-08 03:06:42

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH 0/8] rcu: Ensure rcu read site is deadlock-immunity

On 08/08/2013 10:33 AM, Paul E. McKenney wrote:
> On Thu, Aug 08, 2013 at 10:33:15AM +0800, Lai Jiangshan wrote:
>> On 08/08/2013 10:12 AM, Steven Rostedt wrote:
>>> On Thu, 2013-08-08 at 09:47 +0800, Lai Jiangshan wrote:
>>>
>>>>> [ 393.641012] CPU0
>>>>> [ 393.641012] ----
>>>>> [ 393.641012] lock(&lock->wait_lock);
>>>>> [ 393.641012] <Interrupt>
>>>>> [ 393.641012] lock(&lock->wait_lock);
>>>>
>>>> Patch2 causes it!
>>>> When I found all lock which can (chained) nested in rcu_read_unlock_special(),
>>>> I didn't notice rtmutex's lock->wait_lock is not nested in irq-disabled.
>>>>
>>>> Two ways to fix it:
>>>> 1) change rtmutex's lock->wait_lock, make it alwasys irq-disabled.
>>>> 2) revert my patch2
>>>
>>> Your patch 2 states:
>>>
>>> "After patch 10f39bb1, "special & RCU_READ_UNLOCK_BLOCKED" can't be true
>>> in irq nor softirq.(due to RCU_READ_UNLOCK_BLOCKED can only be set
>>> when preemption)"
>>
>> Patch5 adds "special & RCU_READ_UNLOCK_BLOCKED" back in irq nor softirq.
>> This new thing is handle in patch5 if I did not do wrong things in patch5.
>> (I don't notice rtmutex's lock->wait_lock is not irqs-disabled in patch5)
>>
>>>
>>> But then below we have:
>>>
>>>
>>>>
>>>>> [ 393.641012]
>>>>> [ 393.641012] *** DEADLOCK ***
>>>>> [ 393.641012]
>>>>> [ 393.641012] no locks held by rcu_torture_rea/697.
>>>>> [ 393.641012]
>>>>> [ 393.641012] stack backtrace:
>>>>> [ 393.641012] CPU: 3 PID: 697 Comm: rcu_torture_rea Not tainted 3.11.0-rc1+ #1
>>>>> [ 393.641012] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
>>>>> [ 393.641012] ffffffff8586fea0 ffff88001fcc3a78 ffffffff8187b4cb ffffffff8104a261
>>>>> [ 393.641012] ffff88001e1a20c0 ffff88001fcc3ad8 ffffffff818773e4 0000000000000000
>>>>> [ 393.641012] ffff880000000000 ffff880000000001 ffffffff81010a0a 0000000000000001
>>>>> [ 393.641012] Call Trace:
>>>>> [ 393.641012] <IRQ> [<ffffffff8187b4cb>] dump_stack+0x4f/0x84
>>>>> [ 393.641012] [<ffffffff8104a261>] ? console_unlock+0x291/0x410
>>>>> [ 393.641012] [<ffffffff818773e4>] print_usage_bug+0x1f5/0x206
>>>>> [ 393.641012] [<ffffffff81010a0a>] ? save_stack_trace+0x2a/0x50
>>>>> [ 393.641012] [<ffffffff810ae603>] mark_lock+0x283/0x2e0
>>>>> [ 393.641012] [<ffffffff810ada10>] ? print_irq_inversion_bug.part.40+0x1f0/0x1f0
>>>>> [ 393.641012] [<ffffffff810aef66>] __lock_acquire+0x906/0x1d40
>>>>> [ 393.641012] [<ffffffff810ae94b>] ? __lock_acquire+0x2eb/0x1d40
>>>>> [ 393.641012] [<ffffffff810ae94b>] ? __lock_acquire+0x2eb/0x1d40
>>>>> [ 393.641012] [<ffffffff810b0a65>] lock_acquire+0x95/0x210
>>>>> [ 393.641012] [<ffffffff818860f3>] ? rt_mutex_unlock+0x53/0x100
>>>>> [ 393.641012] [<ffffffff81886d26>] _raw_spin_lock+0x36/0x50
>>>>> [ 393.641012] [<ffffffff818860f3>] ? rt_mutex_unlock+0x53/0x100
>>>>> [ 393.641012] [<ffffffff818860f3>] rt_mutex_unlock+0x53/0x100
>>>>> [ 393.641012] [<ffffffff810ee3ca>] rcu_read_unlock_special+0x17a/0x2a0
>>>>> [ 393.641012] [<ffffffff810ee803>] rcu_check_callbacks+0x313/0x950
>>>>> [ 393.641012] [<ffffffff8107a6bd>] ? hrtimer_run_queues+0x1d/0x180
>>>>> [ 393.641012] [<ffffffff810abb9d>] ? trace_hardirqs_off+0xd/0x10
>>>>> [ 393.641012] [<ffffffff8105bae3>] update_process_times+0x43/0x80
>>>>> [ 393.641012] [<ffffffff810a9801>] tick_sched_handle.isra.10+0x31/0x40
>>>>> [ 393.641012] [<ffffffff810a98f7>] tick_sched_timer+0x47/0x70
>>>>> [ 393.641012] [<ffffffff8107941c>] __run_hrtimer+0x7c/0x490
>>>>> [ 393.641012] [<ffffffff810a260d>] ? ktime_get_update_offsets+0x4d/0xe0
>>>>> [ 393.641012] [<ffffffff810a98b0>] ? tick_nohz_handler+0xa0/0xa0
>>>>> [ 393.641012] [<ffffffff8107a017>] hrtimer_interrupt+0x107/0x260
>>>
>>> The hrtimer_interrupt is calling a rt_mutex_unlock? How did that happen?
>>> Did it first call a rt_mutex_lock?
>>>
>>> If patch two was the culprit, I'm thinking the idea behind patch two is
>>> wrong. The only option is to remove patch number two!
>>
>> removing patch number two can solve the problem found be Paul, but it is not the best.
>> because I can't declare that rcu is deadlock-immunity
>> (it will be deadlock if rcu read site overlaps with rtmutex's lock->wait_lock
>> if I only remove patch2)
>> I must do more things, but I think it is still better than changing rtmutex's lock->wait_lock.
>
> NP, I will remove your current patches and wait for an updated set.

Hi, Paul

Could you agree that moving the rt_mutex_unlock() to rcu_preempt_note_context_switch()?

thanks,
Lai

>
> Thanx, Paul
>
>

2013-08-08 04:18:35

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 0/8] rcu: Ensure rcu read site is deadlock-immunity

On Thu, Aug 08, 2013 at 11:10:47AM +0800, Lai Jiangshan wrote:
> On 08/08/2013 10:33 AM, Paul E. McKenney wrote:
> > On Thu, Aug 08, 2013 at 10:33:15AM +0800, Lai Jiangshan wrote:
> >> On 08/08/2013 10:12 AM, Steven Rostedt wrote:
> >>> On Thu, 2013-08-08 at 09:47 +0800, Lai Jiangshan wrote:
> >>>
> >>>>> [ 393.641012] CPU0
> >>>>> [ 393.641012] ----
> >>>>> [ 393.641012] lock(&lock->wait_lock);
> >>>>> [ 393.641012] <Interrupt>
> >>>>> [ 393.641012] lock(&lock->wait_lock);
> >>>>
> >>>> Patch2 causes it!
> >>>> When I found all lock which can (chained) nested in rcu_read_unlock_special(),
> >>>> I didn't notice rtmutex's lock->wait_lock is not nested in irq-disabled.
> >>>>
> >>>> Two ways to fix it:
> >>>> 1) change rtmutex's lock->wait_lock, make it alwasys irq-disabled.
> >>>> 2) revert my patch2
> >>>
> >>> Your patch 2 states:
> >>>
> >>> "After patch 10f39bb1, "special & RCU_READ_UNLOCK_BLOCKED" can't be true
> >>> in irq nor softirq.(due to RCU_READ_UNLOCK_BLOCKED can only be set
> >>> when preemption)"
> >>
> >> Patch5 adds "special & RCU_READ_UNLOCK_BLOCKED" back in irq nor softirq.
> >> This new thing is handle in patch5 if I did not do wrong things in patch5.
> >> (I don't notice rtmutex's lock->wait_lock is not irqs-disabled in patch5)
> >>
> >>>
> >>> But then below we have:
> >>>
> >>>
> >>>>
> >>>>> [ 393.641012]
> >>>>> [ 393.641012] *** DEADLOCK ***
> >>>>> [ 393.641012]
> >>>>> [ 393.641012] no locks held by rcu_torture_rea/697.
> >>>>> [ 393.641012]
> >>>>> [ 393.641012] stack backtrace:
> >>>>> [ 393.641012] CPU: 3 PID: 697 Comm: rcu_torture_rea Not tainted 3.11.0-rc1+ #1
> >>>>> [ 393.641012] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
> >>>>> [ 393.641012] ffffffff8586fea0 ffff88001fcc3a78 ffffffff8187b4cb ffffffff8104a261
> >>>>> [ 393.641012] ffff88001e1a20c0 ffff88001fcc3ad8 ffffffff818773e4 0000000000000000
> >>>>> [ 393.641012] ffff880000000000 ffff880000000001 ffffffff81010a0a 0000000000000001
> >>>>> [ 393.641012] Call Trace:
> >>>>> [ 393.641012] <IRQ> [<ffffffff8187b4cb>] dump_stack+0x4f/0x84
> >>>>> [ 393.641012] [<ffffffff8104a261>] ? console_unlock+0x291/0x410
> >>>>> [ 393.641012] [<ffffffff818773e4>] print_usage_bug+0x1f5/0x206
> >>>>> [ 393.641012] [<ffffffff81010a0a>] ? save_stack_trace+0x2a/0x50
> >>>>> [ 393.641012] [<ffffffff810ae603>] mark_lock+0x283/0x2e0
> >>>>> [ 393.641012] [<ffffffff810ada10>] ? print_irq_inversion_bug.part.40+0x1f0/0x1f0
> >>>>> [ 393.641012] [<ffffffff810aef66>] __lock_acquire+0x906/0x1d40
> >>>>> [ 393.641012] [<ffffffff810ae94b>] ? __lock_acquire+0x2eb/0x1d40
> >>>>> [ 393.641012] [<ffffffff810ae94b>] ? __lock_acquire+0x2eb/0x1d40
> >>>>> [ 393.641012] [<ffffffff810b0a65>] lock_acquire+0x95/0x210
> >>>>> [ 393.641012] [<ffffffff818860f3>] ? rt_mutex_unlock+0x53/0x100
> >>>>> [ 393.641012] [<ffffffff81886d26>] _raw_spin_lock+0x36/0x50
> >>>>> [ 393.641012] [<ffffffff818860f3>] ? rt_mutex_unlock+0x53/0x100
> >>>>> [ 393.641012] [<ffffffff818860f3>] rt_mutex_unlock+0x53/0x100

The really strange thing here is that I thought that your passing false
in as the new second parameter to rcu_read_unlock_special() was supposed
to prevent rt_mutex_unlock() from being called.

But then why is the call from rcu_preempt_note_context_switch() also
passing false? I would have expected that one to pass true. Probably
I don't understand your intent with the "unlock" argument.

> >>>>> [ 393.641012] [<ffffffff810ee3ca>] rcu_read_unlock_special+0x17a/0x2a0
> >>>>> [ 393.641012] [<ffffffff810ee803>] rcu_check_callbacks+0x313/0x950
> >>>>> [ 393.641012] [<ffffffff8107a6bd>] ? hrtimer_run_queues+0x1d/0x180
> >>>>> [ 393.641012] [<ffffffff810abb9d>] ? trace_hardirqs_off+0xd/0x10
> >>>>> [ 393.641012] [<ffffffff8105bae3>] update_process_times+0x43/0x80
> >>>>> [ 393.641012] [<ffffffff810a9801>] tick_sched_handle.isra.10+0x31/0x40
> >>>>> [ 393.641012] [<ffffffff810a98f7>] tick_sched_timer+0x47/0x70
> >>>>> [ 393.641012] [<ffffffff8107941c>] __run_hrtimer+0x7c/0x490
> >>>>> [ 393.641012] [<ffffffff810a260d>] ? ktime_get_update_offsets+0x4d/0xe0
> >>>>> [ 393.641012] [<ffffffff810a98b0>] ? tick_nohz_handler+0xa0/0xa0
> >>>>> [ 393.641012] [<ffffffff8107a017>] hrtimer_interrupt+0x107/0x260
> >>>
> >>> The hrtimer_interrupt is calling a rt_mutex_unlock? How did that happen?
> >>> Did it first call a rt_mutex_lock?
> >>>
> >>> If patch two was the culprit, I'm thinking the idea behind patch two is
> >>> wrong. The only option is to remove patch number two!
> >>
> >> removing patch number two can solve the problem found be Paul, but it is not the best.
> >> because I can't declare that rcu is deadlock-immunity
> >> (it will be deadlock if rcu read site overlaps with rtmutex's lock->wait_lock
> >> if I only remove patch2)
> >> I must do more things, but I think it is still better than changing rtmutex's lock->wait_lock.
> >
> > NP, I will remove your current patches and wait for an updated set.
>
> Hi, Paul
>
> Could you agree that moving the rt_mutex_unlock() to rcu_preempt_note_context_switch()?

My guess is that changing rcu_preempt_note_context_switch()'s call to
rcu_read_unlock_special(t, true) would accomplish this in a nicer way.
Except that I would first need to understand why rcu_check_callbacks()'s
call to rcu_read_unlock_special() resulted in rt_mutex_unlock() being
called.

Oh!

In rcu_read_unlock_special, shouldn't the:

if (unlikely(unlock && irqs_disabled_flags(flags))) {

Instead be:

if (unlikely(!unlock || irqs_disabled_flags(flags))) {

Here I am guessing that the "unlock" parameter means "It is OK for
rcu_read_unlock_special() to invoke rt_mutex_unlock()", so it would be
passed in as false from rcu_check_callbacks() and true everywhere else.
If it means something else, please let me know what that might be.

Though at this point, it is not clear to me how it helps to call
rcu_read_unlock_special() from rcu_check_callbacks(). After all,
rcu_check_callbacks() has interrupts disabled always, and so it is never
safe for anything that it calls to invoke rt_mutex_unlock().

In any case, the opinion that really matters is not mine, but rather
that of the hundreds of millions of computer systems that might soon be
executing this code. As RCU maintainer, I just try my best to predict
what their opinions will be. ;-)

Thanx, Paul

> thanks,
> Lai
>
> >
> > Thanx, Paul
> >
> >
>

2013-08-08 05:23:50

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH 0/8] rcu: Ensure rcu read site is deadlock-immunity

On 08/08/2013 12:18 PM, Paul E. McKenney wrote:
> On Thu, Aug 08, 2013 at 11:10:47AM +0800, Lai Jiangshan wrote:
>> On 08/08/2013 10:33 AM, Paul E. McKenney wrote:
>>> On Thu, Aug 08, 2013 at 10:33:15AM +0800, Lai Jiangshan wrote:
>>>> On 08/08/2013 10:12 AM, Steven Rostedt wrote:
>>>>> On Thu, 2013-08-08 at 09:47 +0800, Lai Jiangshan wrote:
>>>>>
>>>>>>> [ 393.641012] CPU0
>>>>>>> [ 393.641012] ----
>>>>>>> [ 393.641012] lock(&lock->wait_lock);
>>>>>>> [ 393.641012] <Interrupt>
>>>>>>> [ 393.641012] lock(&lock->wait_lock);
>>>>>>
>>>>>> Patch2 causes it!
>>>>>> When I found all lock which can (chained) nested in rcu_read_unlock_special(),
>>>>>> I didn't notice rtmutex's lock->wait_lock is not nested in irq-disabled.
>>>>>>
>>>>>> Two ways to fix it:
>>>>>> 1) change rtmutex's lock->wait_lock, make it alwasys irq-disabled.
>>>>>> 2) revert my patch2
>>>>>
>>>>> Your patch 2 states:
>>>>>
>>>>> "After patch 10f39bb1, "special & RCU_READ_UNLOCK_BLOCKED" can't be true
>>>>> in irq nor softirq.(due to RCU_READ_UNLOCK_BLOCKED can only be set
>>>>> when preemption)"
>>>>
>>>> Patch5 adds "special & RCU_READ_UNLOCK_BLOCKED" back in irq nor softirq.
>>>> This new thing is handle in patch5 if I did not do wrong things in patch5.
>>>> (I don't notice rtmutex's lock->wait_lock is not irqs-disabled in patch5)
>>>>
>>>>>
>>>>> But then below we have:
>>>>>
>>>>>
>>>>>>
>>>>>>> [ 393.641012]
>>>>>>> [ 393.641012] *** DEADLOCK ***
>>>>>>> [ 393.641012]
>>>>>>> [ 393.641012] no locks held by rcu_torture_rea/697.
>>>>>>> [ 393.641012]
>>>>>>> [ 393.641012] stack backtrace:
>>>>>>> [ 393.641012] CPU: 3 PID: 697 Comm: rcu_torture_rea Not tainted 3.11.0-rc1+ #1
>>>>>>> [ 393.641012] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
>>>>>>> [ 393.641012] ffffffff8586fea0 ffff88001fcc3a78 ffffffff8187b4cb ffffffff8104a261
>>>>>>> [ 393.641012] ffff88001e1a20c0 ffff88001fcc3ad8 ffffffff818773e4 0000000000000000
>>>>>>> [ 393.641012] ffff880000000000 ffff880000000001 ffffffff81010a0a 0000000000000001
>>>>>>> [ 393.641012] Call Trace:
>>>>>>> [ 393.641012] <IRQ> [<ffffffff8187b4cb>] dump_stack+0x4f/0x84
>>>>>>> [ 393.641012] [<ffffffff8104a261>] ? console_unlock+0x291/0x410
>>>>>>> [ 393.641012] [<ffffffff818773e4>] print_usage_bug+0x1f5/0x206
>>>>>>> [ 393.641012] [<ffffffff81010a0a>] ? save_stack_trace+0x2a/0x50
>>>>>>> [ 393.641012] [<ffffffff810ae603>] mark_lock+0x283/0x2e0
>>>>>>> [ 393.641012] [<ffffffff810ada10>] ? print_irq_inversion_bug.part.40+0x1f0/0x1f0
>>>>>>> [ 393.641012] [<ffffffff810aef66>] __lock_acquire+0x906/0x1d40
>>>>>>> [ 393.641012] [<ffffffff810ae94b>] ? __lock_acquire+0x2eb/0x1d40
>>>>>>> [ 393.641012] [<ffffffff810ae94b>] ? __lock_acquire+0x2eb/0x1d40
>>>>>>> [ 393.641012] [<ffffffff810b0a65>] lock_acquire+0x95/0x210
>>>>>>> [ 393.641012] [<ffffffff818860f3>] ? rt_mutex_unlock+0x53/0x100
>>>>>>> [ 393.641012] [<ffffffff81886d26>] _raw_spin_lock+0x36/0x50
>>>>>>> [ 393.641012] [<ffffffff818860f3>] ? rt_mutex_unlock+0x53/0x100
>>>>>>> [ 393.641012] [<ffffffff818860f3>] rt_mutex_unlock+0x53/0x100
>
> The really strange thing here is that I thought that your passing false
> in as the new second parameter to rcu_read_unlock_special() was supposed
> to prevent rt_mutex_unlock() from being called.
>
> But then why is the call from rcu_preempt_note_context_switch() also
> passing false? I would have expected that one to pass true. Probably
> I don't understand your intent with the "unlock" argument.
>
>>>>>>> [ 393.641012] [<ffffffff810ee3ca>] rcu_read_unlock_special+0x17a/0x2a0
>>>>>>> [ 393.641012] [<ffffffff810ee803>] rcu_check_callbacks+0x313/0x950
>>>>>>> [ 393.641012] [<ffffffff8107a6bd>] ? hrtimer_run_queues+0x1d/0x180
>>>>>>> [ 393.641012] [<ffffffff810abb9d>] ? trace_hardirqs_off+0xd/0x10
>>>>>>> [ 393.641012] [<ffffffff8105bae3>] update_process_times+0x43/0x80
>>>>>>> [ 393.641012] [<ffffffff810a9801>] tick_sched_handle.isra.10+0x31/0x40
>>>>>>> [ 393.641012] [<ffffffff810a98f7>] tick_sched_timer+0x47/0x70
>>>>>>> [ 393.641012] [<ffffffff8107941c>] __run_hrtimer+0x7c/0x490
>>>>>>> [ 393.641012] [<ffffffff810a260d>] ? ktime_get_update_offsets+0x4d/0xe0
>>>>>>> [ 393.641012] [<ffffffff810a98b0>] ? tick_nohz_handler+0xa0/0xa0
>>>>>>> [ 393.641012] [<ffffffff8107a017>] hrtimer_interrupt+0x107/0x260
>>>>>
>>>>> The hrtimer_interrupt is calling a rt_mutex_unlock? How did that happen?
>>>>> Did it first call a rt_mutex_lock?
>>>>>
>>>>> If patch two was the culprit, I'm thinking the idea behind patch two is
>>>>> wrong. The only option is to remove patch number two!
>>>>
>>>> removing patch number two can solve the problem found be Paul, but it is not the best.
>>>> because I can't declare that rcu is deadlock-immunity
>>>> (it will be deadlock if rcu read site overlaps with rtmutex's lock->wait_lock
>>>> if I only remove patch2)
>>>> I must do more things, but I think it is still better than changing rtmutex's lock->wait_lock.
>>>
>>> NP, I will remove your current patches and wait for an updated set.
>>
>> Hi, Paul
>>
>> Could you agree that moving the rt_mutex_unlock() to rcu_preempt_note_context_switch()?
>
> My guess is that changing rcu_preempt_note_context_switch()'s call to
> rcu_read_unlock_special(t, true) would accomplish this in a nicer way.
> Except that I would first need to understand why rcu_check_callbacks()'s
> call to rcu_read_unlock_special() resulted in rt_mutex_unlock() being
> called.
>
> Oh!
>
> In rcu_read_unlock_special, shouldn't the:
>
> if (unlikely(unlock && irqs_disabled_flags(flags))) {

Sorry.
@unlock means it is called from rcu_read_unlock() path.

if rcu_read_unlock() is called from irqs_disabled context,
it may be called from suspect lock(scheduler locks, ...) context,
so it can't require rnp->lock or invoke wakeup() in rcu_read_unlock_special().


>
> Instead be:
>
> if (unlikely(!unlock || irqs_disabled_flags(flags))) {
>
> Here I am guessing that the "unlock" parameter means "It is OK for
> rcu_read_unlock_special() to invoke rt_mutex_unlock()", so it would be
> passed in as false from rcu_check_callbacks() and true everywhere else.
> If it means something else, please let me know what that might be.
>
> Though at this point, it is not clear to me how it helps to call
> rcu_read_unlock_special() from rcu_check_callbacks(). After all,
> rcu_check_callbacks() has interrupts disabled always, and so it is never
> safe for anything that it calls to invoke rt_mutex_unlock().
>
> In any case, the opinion that really matters is not mine, but rather
> that of the hundreds of millions of computer systems that might soon be
> executing this code. As RCU maintainer, I just try my best to predict
> what their opinions will be. ;-)
>
> Thanx, Paul
>
>> thanks,
>> Lai
>>
>>>
>>> Thanx, Paul
>>>
>>>
>>
>
>

2013-08-08 07:06:01

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 0/8] rcu: Ensure rcu read site is deadlock-immunity

On Thu, Aug 08, 2013 at 01:27:55PM +0800, Lai Jiangshan wrote:
> On 08/08/2013 12:18 PM, Paul E. McKenney wrote:
> > On Thu, Aug 08, 2013 at 11:10:47AM +0800, Lai Jiangshan wrote:
> >> On 08/08/2013 10:33 AM, Paul E. McKenney wrote:
> >>> On Thu, Aug 08, 2013 at 10:33:15AM +0800, Lai Jiangshan wrote:
> >>>> On 08/08/2013 10:12 AM, Steven Rostedt wrote:
> >>>>> On Thu, 2013-08-08 at 09:47 +0800, Lai Jiangshan wrote:
> >>>>>
> >>>>>>> [ 393.641012] CPU0
> >>>>>>> [ 393.641012] ----
> >>>>>>> [ 393.641012] lock(&lock->wait_lock);
> >>>>>>> [ 393.641012] <Interrupt>
> >>>>>>> [ 393.641012] lock(&lock->wait_lock);
> >>>>>>
> >>>>>> Patch2 causes it!
> >>>>>> When I found all lock which can (chained) nested in rcu_read_unlock_special(),
> >>>>>> I didn't notice rtmutex's lock->wait_lock is not nested in irq-disabled.
> >>>>>>
> >>>>>> Two ways to fix it:
> >>>>>> 1) change rtmutex's lock->wait_lock, make it alwasys irq-disabled.
> >>>>>> 2) revert my patch2
> >>>>>
> >>>>> Your patch 2 states:
> >>>>>
> >>>>> "After patch 10f39bb1, "special & RCU_READ_UNLOCK_BLOCKED" can't be true
> >>>>> in irq nor softirq.(due to RCU_READ_UNLOCK_BLOCKED can only be set
> >>>>> when preemption)"
> >>>>
> >>>> Patch5 adds "special & RCU_READ_UNLOCK_BLOCKED" back in irq nor softirq.
> >>>> This new thing is handle in patch5 if I did not do wrong things in patch5.
> >>>> (I don't notice rtmutex's lock->wait_lock is not irqs-disabled in patch5)
> >>>>
> >>>>>
> >>>>> But then below we have:
> >>>>>
> >>>>>
> >>>>>>
> >>>>>>> [ 393.641012]
> >>>>>>> [ 393.641012] *** DEADLOCK ***
> >>>>>>> [ 393.641012]
> >>>>>>> [ 393.641012] no locks held by rcu_torture_rea/697.
> >>>>>>> [ 393.641012]
> >>>>>>> [ 393.641012] stack backtrace:
> >>>>>>> [ 393.641012] CPU: 3 PID: 697 Comm: rcu_torture_rea Not tainted 3.11.0-rc1+ #1
> >>>>>>> [ 393.641012] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
> >>>>>>> [ 393.641012] ffffffff8586fea0 ffff88001fcc3a78 ffffffff8187b4cb ffffffff8104a261
> >>>>>>> [ 393.641012] ffff88001e1a20c0 ffff88001fcc3ad8 ffffffff818773e4 0000000000000000
> >>>>>>> [ 393.641012] ffff880000000000 ffff880000000001 ffffffff81010a0a 0000000000000001
> >>>>>>> [ 393.641012] Call Trace:
> >>>>>>> [ 393.641012] <IRQ> [<ffffffff8187b4cb>] dump_stack+0x4f/0x84
> >>>>>>> [ 393.641012] [<ffffffff8104a261>] ? console_unlock+0x291/0x410
> >>>>>>> [ 393.641012] [<ffffffff818773e4>] print_usage_bug+0x1f5/0x206
> >>>>>>> [ 393.641012] [<ffffffff81010a0a>] ? save_stack_trace+0x2a/0x50
> >>>>>>> [ 393.641012] [<ffffffff810ae603>] mark_lock+0x283/0x2e0
> >>>>>>> [ 393.641012] [<ffffffff810ada10>] ? print_irq_inversion_bug.part.40+0x1f0/0x1f0
> >>>>>>> [ 393.641012] [<ffffffff810aef66>] __lock_acquire+0x906/0x1d40
> >>>>>>> [ 393.641012] [<ffffffff810ae94b>] ? __lock_acquire+0x2eb/0x1d40
> >>>>>>> [ 393.641012] [<ffffffff810ae94b>] ? __lock_acquire+0x2eb/0x1d40
> >>>>>>> [ 393.641012] [<ffffffff810b0a65>] lock_acquire+0x95/0x210
> >>>>>>> [ 393.641012] [<ffffffff818860f3>] ? rt_mutex_unlock+0x53/0x100
> >>>>>>> [ 393.641012] [<ffffffff81886d26>] _raw_spin_lock+0x36/0x50
> >>>>>>> [ 393.641012] [<ffffffff818860f3>] ? rt_mutex_unlock+0x53/0x100
> >>>>>>> [ 393.641012] [<ffffffff818860f3>] rt_mutex_unlock+0x53/0x100
> >
> > The really strange thing here is that I thought that your passing false
> > in as the new second parameter to rcu_read_unlock_special() was supposed
> > to prevent rt_mutex_unlock() from being called.
> >
> > But then why is the call from rcu_preempt_note_context_switch() also
> > passing false? I would have expected that one to pass true. Probably
> > I don't understand your intent with the "unlock" argument.
> >
> >>>>>>> [ 393.641012] [<ffffffff810ee3ca>] rcu_read_unlock_special+0x17a/0x2a0
> >>>>>>> [ 393.641012] [<ffffffff810ee803>] rcu_check_callbacks+0x313/0x950
> >>>>>>> [ 393.641012] [<ffffffff8107a6bd>] ? hrtimer_run_queues+0x1d/0x180
> >>>>>>> [ 393.641012] [<ffffffff810abb9d>] ? trace_hardirqs_off+0xd/0x10
> >>>>>>> [ 393.641012] [<ffffffff8105bae3>] update_process_times+0x43/0x80
> >>>>>>> [ 393.641012] [<ffffffff810a9801>] tick_sched_handle.isra.10+0x31/0x40
> >>>>>>> [ 393.641012] [<ffffffff810a98f7>] tick_sched_timer+0x47/0x70
> >>>>>>> [ 393.641012] [<ffffffff8107941c>] __run_hrtimer+0x7c/0x490
> >>>>>>> [ 393.641012] [<ffffffff810a260d>] ? ktime_get_update_offsets+0x4d/0xe0
> >>>>>>> [ 393.641012] [<ffffffff810a98b0>] ? tick_nohz_handler+0xa0/0xa0
> >>>>>>> [ 393.641012] [<ffffffff8107a017>] hrtimer_interrupt+0x107/0x260
> >>>>>
> >>>>> The hrtimer_interrupt is calling a rt_mutex_unlock? How did that happen?
> >>>>> Did it first call a rt_mutex_lock?
> >>>>>
> >>>>> If patch two was the culprit, I'm thinking the idea behind patch two is
> >>>>> wrong. The only option is to remove patch number two!
> >>>>
> >>>> removing patch number two can solve the problem found be Paul, but it is not the best.
> >>>> because I can't declare that rcu is deadlock-immunity
> >>>> (it will be deadlock if rcu read site overlaps with rtmutex's lock->wait_lock
> >>>> if I only remove patch2)
> >>>> I must do more things, but I think it is still better than changing rtmutex's lock->wait_lock.
> >>>
> >>> NP, I will remove your current patches and wait for an updated set.
> >>
> >> Hi, Paul
> >>
> >> Could you agree that moving the rt_mutex_unlock() to rcu_preempt_note_context_switch()?
> >
> > My guess is that changing rcu_preempt_note_context_switch()'s call to
> > rcu_read_unlock_special(t, true) would accomplish this in a nicer way.
> > Except that I would first need to understand why rcu_check_callbacks()'s
> > call to rcu_read_unlock_special() resulted in rt_mutex_unlock() being
> > called.
> >
> > Oh!
> >
> > In rcu_read_unlock_special, shouldn't the:
> >
> > if (unlikely(unlock && irqs_disabled_flags(flags))) {
>
> Sorry.
> @unlock means it is called from rcu_read_unlock() path.
>
> if rcu_read_unlock() is called from irqs_disabled context,
> it may be called from suspect lock(scheduler locks, ...) context,
> so it can't require rnp->lock or invoke wakeup() in rcu_read_unlock_special().

Hmmm... And the case where it rcu_read_unlock_special() is
called from rcu_preempt_note_context_switch()?

Thanx, Paul

> > Instead be:
> >
> > if (unlikely(!unlock || irqs_disabled_flags(flags))) {
> >
> > Here I am guessing that the "unlock" parameter means "It is OK for
> > rcu_read_unlock_special() to invoke rt_mutex_unlock()", so it would be
> > passed in as false from rcu_check_callbacks() and true everywhere else.
> > If it means something else, please let me know what that might be.
> >
> > Though at this point, it is not clear to me how it helps to call
> > rcu_read_unlock_special() from rcu_check_callbacks(). After all,
> > rcu_check_callbacks() has interrupts disabled always, and so it is never
> > safe for anything that it calls to invoke rt_mutex_unlock().
> >
> > In any case, the opinion that really matters is not mine, but rather
> > that of the hundreds of millions of computer systems that might soon be
> > executing this code. As RCU maintainer, I just try my best to predict
> > what their opinions will be. ;-)
> >
> > Thanx, Paul
> >
> >> thanks,
> >> Lai
> >>
> >>>
> >>> Thanx, Paul
> >>>
> >>>
> >>
> >
> >
>

2013-08-08 20:40:40

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 5/8] rcu: eliminate deadlock for rcu read site

On Wed, Aug 07, 2013 at 06:25:01PM +0800, Lai Jiangshan wrote:
> Background)
>
> Although all articles declare that rcu read site is deadlock-immunity.
> It is not true for rcu-preempt, it will be deadlock if rcu read site
> overlaps with scheduler lock.
>
> ec433f0c, 10f39bb1 and 016a8d5b just partially solve it. But rcu read site
> is still not deadlock-immunity. And the problem described in 016a8d5b
> is still existed(rcu_read_unlock_special() calls wake_up).
>
> Aim)
>
> We want to fix the problem forever, we want to keep rcu read site
> is deadlock-immunity as books say.
>
> How)
>
> The problem is solved by "if rcu_read_unlock_special() is called inside
> any lock which can be (chained) nested in rcu_read_unlock_special(),
> we defer rcu_read_unlock_special()".
> This kind locks include rnp->lock, scheduler locks, perf ctx->lock, locks
> in printk()/WARN_ON() and all locks nested in these locks or chained nested
> in these locks.
>
> The problem is reduced to "how to distinguish all these locks(context)",
> We don't distinguish all these locks, we know that all these locks
> should be nested in local_irqs_disable().
>
> we just consider if rcu_read_unlock_special() is called in irqs-disabled
> context, it may be called in these suspect locks, we should defer
> rcu_read_unlock_special().
>
> The algorithm enlarges the probability of deferring, but the probability
> is still very very low.
>
> Deferring does add a small overhead, but it offers us:
> 1) really deadlock-immunity for rcu read site
> 2) remove the overhead of the irq-work(250 times per second in avg.)

One problem here -- it may take quite some time for a set_need_resched()
to take effect. This is especially a problem for RCU priority boosting,
but can also needlessly delay preemptible-RCU grace periods because
local_irq_restore() and friends don't check the TIF_NEED_RESCHED bit.

OK, alternatives...

o Keep the current rule saying that if the scheduler is going
to exit an RCU read-side critical section while holding
one of its spinlocks, preemption has to have been disabled
throughout the full duration of that critical section.
Well, we can certainly do this, but it would be nice to get
rid of this rule.

o Use per-CPU variables, possibly injecting delay. This has ugly
disadvantages as noted above.

o irq_work_queue() can wait a jiffy (or on some architectures,
quite a bit longer) before actually doing anything.

o raise_softirq() is more immediate and is an easy change, but
adds a softirq vector -- which people are really trying to
get rid of. Also, wakeup_softirqd() calls things that acquire
the scheduler locks, which is exactly what we were trying to
avoid doing.

o invoke_rcu_core() can invoke raise_softirq() as above.

o IPI to self. From what I can see, not all architectures
support this. Easy to fake if you have at least two CPUs,
but not so good from an OS jitter viewpoint...

o Add a check to local_irq_disable() and friends. I would guess
that this suggestion would not make architecture maintainers
happy.

Other thoughts?

Thanx, Paul

> Signed-off-by: Lai Jiangshan <[email protected]>
> ---
> include/linux/rcupdate.h | 2 +-
> kernel/rcupdate.c | 2 +-
> kernel/rcutree_plugin.h | 47 +++++++++++++++++++++++++++++++++++++++++----
> 3 files changed, 44 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> index 4b14bdc..00b4220 100644
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -180,7 +180,7 @@ extern void synchronize_sched(void);
>
> extern void __rcu_read_lock(void);
> extern void __rcu_read_unlock(void);
> -extern void rcu_read_unlock_special(struct task_struct *t);
> +extern void rcu_read_unlock_special(struct task_struct *t, bool unlock);
> void synchronize_rcu(void);
>
> /*
> diff --git a/kernel/rcupdate.c b/kernel/rcupdate.c
> index cce6ba8..33b89a3 100644
> --- a/kernel/rcupdate.c
> +++ b/kernel/rcupdate.c
> @@ -90,7 +90,7 @@ void __rcu_read_unlock(void)
> #endif /* #ifdef CONFIG_PROVE_RCU_DELAY */
> barrier(); /* assign before ->rcu_read_unlock_special load */
> if (unlikely(ACCESS_ONCE(t->rcu_read_unlock_special)))
> - rcu_read_unlock_special(t);
> + rcu_read_unlock_special(t, true);
> barrier(); /* ->rcu_read_unlock_special load before assign */
> t->rcu_read_lock_nesting = 0;
> }
> diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
> index fc8b36f..997b424 100644
> --- a/kernel/rcutree_plugin.h
> +++ b/kernel/rcutree_plugin.h
> @@ -242,15 +242,16 @@ static void rcu_preempt_note_context_switch(int cpu)
> ? rnp->gpnum
> : rnp->gpnum + 1);
> raw_spin_unlock_irqrestore(&rnp->lock, flags);
> - } else if (t->rcu_read_lock_nesting < 0 &&
> - !WARN_ON_ONCE(t->rcu_read_lock_nesting != INT_MIN) &&
> - t->rcu_read_unlock_special) {
> + } else if (t->rcu_read_lock_nesting == 0 ||
> + (t->rcu_read_lock_nesting < 0 &&
> + !WARN_ON_ONCE(t->rcu_read_lock_nesting != INT_MIN))) {
>
> /*
> * Complete exit from RCU read-side critical section on
> * behalf of preempted instance of __rcu_read_unlock().
> */
> - rcu_read_unlock_special(t);
> + if (t->rcu_read_unlock_special)
> + rcu_read_unlock_special(t, false);
> }
>
> /*
> @@ -333,7 +334,7 @@ static struct list_head *rcu_next_node_entry(struct task_struct *t,
> * notify RCU core processing or task having blocked during the RCU
> * read-side critical section.
> */
> -void rcu_read_unlock_special(struct task_struct *t)
> +void rcu_read_unlock_special(struct task_struct *t, bool unlock)
> {
> int empty;
> int empty_exp;
> @@ -364,6 +365,42 @@ void rcu_read_unlock_special(struct task_struct *t)
>
> /* Clean up if blocked during RCU read-side critical section. */
> if (special & RCU_READ_UNLOCK_BLOCKED) {
> + /*
> + * If rcu read lock overlaps with scheduler lock,
> + * rcu_read_unlock_special() may lead to deadlock:
> + *
> + * rcu_read_lock();
> + * preempt_schedule[_irq]() (when preemption)
> + * scheduler lock; (or some other locks can be (chained) nested
> + * in rcu_read_unlock_special()/rnp->lock)
> + * access and check rcu data
> + * rcu_read_unlock();
> + * rcu_read_unlock_special();
> + * wake_up(); DEAD LOCK
> + *
> + * To avoid all these kinds of deadlock, we should quit
> + * rcu_read_unlock_special() here and defer it to
> + * rcu_preempt_note_context_switch() or next outmost
> + * rcu_read_unlock() if we consider this case may happen.
> + *
> + * Although we can't know whether current _special()
> + * is nested in scheduler lock or not. But we know that
> + * irqs are always disabled in this case. so we just quit
> + * and defer it to rcu_preempt_note_context_switch()
> + * when irqs are disabled.
> + *
> + * It means we always defer _special() when it is
> + * nested in irqs disabled context, but
> + * (special & RCU_READ_UNLOCK_BLOCKED) &&
> + * irqs_disabled_flags(flags)
> + * is still unlikely to be true.
> + */
> + if (unlikely(unlock && irqs_disabled_flags(flags))) {
> + set_need_resched();
> + local_irq_restore(flags);
> + return;
> + }
> +
> t->rcu_read_unlock_special &= ~RCU_READ_UNLOCK_BLOCKED;
>
> /*
> --
> 1.7.4.4
>

2013-08-09 09:28:21

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH 5/8] rcu: eliminate deadlock for rcu read site

On 08/09/2013 04:40 AM, Paul E. McKenney wrote:
> On Wed, Aug 07, 2013 at 06:25:01PM +0800, Lai Jiangshan wrote:
>> Background)
>>
>> Although all articles declare that rcu read site is deadlock-immunity.
>> It is not true for rcu-preempt, it will be deadlock if rcu read site
>> overlaps with scheduler lock.
>>
>> ec433f0c, 10f39bb1 and 016a8d5b just partially solve it. But rcu read site
>> is still not deadlock-immunity. And the problem described in 016a8d5b
>> is still existed(rcu_read_unlock_special() calls wake_up).
>>
>> Aim)
>>
>> We want to fix the problem forever, we want to keep rcu read site
>> is deadlock-immunity as books say.
>>
>> How)
>>
>> The problem is solved by "if rcu_read_unlock_special() is called inside
>> any lock which can be (chained) nested in rcu_read_unlock_special(),
>> we defer rcu_read_unlock_special()".
>> This kind locks include rnp->lock, scheduler locks, perf ctx->lock, locks
>> in printk()/WARN_ON() and all locks nested in these locks or chained nested
>> in these locks.
>>
>> The problem is reduced to "how to distinguish all these locks(context)",
>> We don't distinguish all these locks, we know that all these locks
>> should be nested in local_irqs_disable().
>>
>> we just consider if rcu_read_unlock_special() is called in irqs-disabled
>> context, it may be called in these suspect locks, we should defer
>> rcu_read_unlock_special().
>>
>> The algorithm enlarges the probability of deferring, but the probability
>> is still very very low.
>>
>> Deferring does add a small overhead, but it offers us:
>> 1) really deadlock-immunity for rcu read site
>> 2) remove the overhead of the irq-work(250 times per second in avg.)
>
> One problem here -- it may take quite some time for a set_need_resched()
> to take effect. This is especially a problem for RCU priority boosting,
> but can also needlessly delay preemptible-RCU grace periods because
> local_irq_restore() and friends don't check the TIF_NEED_RESCHED bit.


The final effect of deboosting(rt_mutex_unlock()) is also accomplished
via set_need_resched()/set_tsk_need_resched().
set_need_resched() is enough for RCU priority boosting issue here.

Since rcu_read_unlock_special() is deferred, it does take quite some time for
QS report to take effect.


>
> OK, alternatives...
>
> o Keep the current rule saying that if the scheduler is going
> to exit an RCU read-side critical section while holding
> one of its spinlocks, preemption has to have been disabled

Since rtmutex'lock->wait_lock is not irqs-disabled nor bh-disabled.

This kind of spinlocks include scheduler locks, rtmutex'lock->wait_lock,
all locks can be acquired in irq/SOFTIRQ.

So this rule is not only applied for scheduler locks, it should also
be applied for almost all spinlocks in the kernel.

I was hard to accept that rcu read site is not deadlock-immunity.

Thanks,
Lai

> throughout the full duration of that critical section.
> Well, we can certainly do this, but it would be nice to get
> rid of this rule.
>
> o Use per-CPU variables, possibly injecting delay. This has ugly
> disadvantages as noted above.
>
> o irq_work_queue() can wait a jiffy (or on some architectures,
> quite a bit longer) before actually doing anything.
>
> o raise_softirq() is more immediate and is an easy change, but
> adds a softirq vector -- which people are really trying to
> get rid of. Also, wakeup_softirqd() calls things that acquire
> the scheduler locks, which is exactly what we were trying to
> avoid doing.
>
> o invoke_rcu_core() can invoke raise_softirq() as above.
>
> o IPI to self. From what I can see, not all architectures
> support this. Easy to fake if you have at least two CPUs,
> but not so good from an OS jitter viewpoint...
>
> o Add a check to local_irq_disable() and friends. I would guess
> that this suggestion would not make architecture maintainers
> happy.
>
> Other thoughts?
>
> Thanx, Paul
>
>> Signed-off-by: Lai Jiangshan <[email protected]>
>> ---
>> include/linux/rcupdate.h | 2 +-
>> kernel/rcupdate.c | 2 +-
>> kernel/rcutree_plugin.h | 47 +++++++++++++++++++++++++++++++++++++++++----
>> 3 files changed, 44 insertions(+), 7 deletions(-)
>>
>> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
>> index 4b14bdc..00b4220 100644
>> --- a/include/linux/rcupdate.h
>> +++ b/include/linux/rcupdate.h
>> @@ -180,7 +180,7 @@ extern void synchronize_sched(void);
>>
>> extern void __rcu_read_lock(void);
>> extern void __rcu_read_unlock(void);
>> -extern void rcu_read_unlock_special(struct task_struct *t);
>> +extern void rcu_read_unlock_special(struct task_struct *t, bool unlock);
>> void synchronize_rcu(void);
>>
>> /*
>> diff --git a/kernel/rcupdate.c b/kernel/rcupdate.c
>> index cce6ba8..33b89a3 100644
>> --- a/kernel/rcupdate.c
>> +++ b/kernel/rcupdate.c
>> @@ -90,7 +90,7 @@ void __rcu_read_unlock(void)
>> #endif /* #ifdef CONFIG_PROVE_RCU_DELAY */
>> barrier(); /* assign before ->rcu_read_unlock_special load */
>> if (unlikely(ACCESS_ONCE(t->rcu_read_unlock_special)))
>> - rcu_read_unlock_special(t);
>> + rcu_read_unlock_special(t, true);
>> barrier(); /* ->rcu_read_unlock_special load before assign */
>> t->rcu_read_lock_nesting = 0;
>> }
>> diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
>> index fc8b36f..997b424 100644
>> --- a/kernel/rcutree_plugin.h
>> +++ b/kernel/rcutree_plugin.h
>> @@ -242,15 +242,16 @@ static void rcu_preempt_note_context_switch(int cpu)
>> ? rnp->gpnum
>> : rnp->gpnum + 1);
>> raw_spin_unlock_irqrestore(&rnp->lock, flags);
>> - } else if (t->rcu_read_lock_nesting < 0 &&
>> - !WARN_ON_ONCE(t->rcu_read_lock_nesting != INT_MIN) &&
>> - t->rcu_read_unlock_special) {
>> + } else if (t->rcu_read_lock_nesting == 0 ||
>> + (t->rcu_read_lock_nesting < 0 &&
>> + !WARN_ON_ONCE(t->rcu_read_lock_nesting != INT_MIN))) {
>>
>> /*
>> * Complete exit from RCU read-side critical section on
>> * behalf of preempted instance of __rcu_read_unlock().
>> */
>> - rcu_read_unlock_special(t);
>> + if (t->rcu_read_unlock_special)
>> + rcu_read_unlock_special(t, false);
>> }
>>
>> /*
>> @@ -333,7 +334,7 @@ static struct list_head *rcu_next_node_entry(struct task_struct *t,
>> * notify RCU core processing or task having blocked during the RCU
>> * read-side critical section.
>> */
>> -void rcu_read_unlock_special(struct task_struct *t)
>> +void rcu_read_unlock_special(struct task_struct *t, bool unlock)
>> {
>> int empty;
>> int empty_exp;
>> @@ -364,6 +365,42 @@ void rcu_read_unlock_special(struct task_struct *t)
>>
>> /* Clean up if blocked during RCU read-side critical section. */
>> if (special & RCU_READ_UNLOCK_BLOCKED) {
>> + /*
>> + * If rcu read lock overlaps with scheduler lock,
>> + * rcu_read_unlock_special() may lead to deadlock:
>> + *
>> + * rcu_read_lock();
>> + * preempt_schedule[_irq]() (when preemption)
>> + * scheduler lock; (or some other locks can be (chained) nested
>> + * in rcu_read_unlock_special()/rnp->lock)
>> + * access and check rcu data
>> + * rcu_read_unlock();
>> + * rcu_read_unlock_special();
>> + * wake_up(); DEAD LOCK
>> + *
>> + * To avoid all these kinds of deadlock, we should quit
>> + * rcu_read_unlock_special() here and defer it to
>> + * rcu_preempt_note_context_switch() or next outmost
>> + * rcu_read_unlock() if we consider this case may happen.
>> + *
>> + * Although we can't know whether current _special()
>> + * is nested in scheduler lock or not. But we know that
>> + * irqs are always disabled in this case. so we just quit
>> + * and defer it to rcu_preempt_note_context_switch()
>> + * when irqs are disabled.
>> + *
>> + * It means we always defer _special() when it is
>> + * nested in irqs disabled context, but
>> + * (special & RCU_READ_UNLOCK_BLOCKED) &&
>> + * irqs_disabled_flags(flags)
>> + * is still unlikely to be true.
>> + */
>> + if (unlikely(unlock && irqs_disabled_flags(flags))) {
>> + set_need_resched();
>> + local_irq_restore(flags);
>> + return;
>> + }
>> +
>> t->rcu_read_unlock_special &= ~RCU_READ_UNLOCK_BLOCKED;
>>
>> /*
>> --
>> 1.7.4.4
>>
>
>

2013-08-09 17:59:07

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 5/8] rcu: eliminate deadlock for rcu read site

On Fri, Aug 09, 2013 at 05:31:27PM +0800, Lai Jiangshan wrote:
> On 08/09/2013 04:40 AM, Paul E. McKenney wrote:
> > On Wed, Aug 07, 2013 at 06:25:01PM +0800, Lai Jiangshan wrote:
> >> Background)
> >>
> >> Although all articles declare that rcu read site is deadlock-immunity.
> >> It is not true for rcu-preempt, it will be deadlock if rcu read site
> >> overlaps with scheduler lock.
> >>
> >> ec433f0c, 10f39bb1 and 016a8d5b just partially solve it. But rcu read site
> >> is still not deadlock-immunity. And the problem described in 016a8d5b
> >> is still existed(rcu_read_unlock_special() calls wake_up).
> >>
> >> Aim)
> >>
> >> We want to fix the problem forever, we want to keep rcu read site
> >> is deadlock-immunity as books say.
> >>
> >> How)
> >>
> >> The problem is solved by "if rcu_read_unlock_special() is called inside
> >> any lock which can be (chained) nested in rcu_read_unlock_special(),
> >> we defer rcu_read_unlock_special()".
> >> This kind locks include rnp->lock, scheduler locks, perf ctx->lock, locks
> >> in printk()/WARN_ON() and all locks nested in these locks or chained nested
> >> in these locks.
> >>
> >> The problem is reduced to "how to distinguish all these locks(context)",
> >> We don't distinguish all these locks, we know that all these locks
> >> should be nested in local_irqs_disable().
> >>
> >> we just consider if rcu_read_unlock_special() is called in irqs-disabled
> >> context, it may be called in these suspect locks, we should defer
> >> rcu_read_unlock_special().
> >>
> >> The algorithm enlarges the probability of deferring, but the probability
> >> is still very very low.
> >>
> >> Deferring does add a small overhead, but it offers us:
> >> 1) really deadlock-immunity for rcu read site
> >> 2) remove the overhead of the irq-work(250 times per second in avg.)
> >
> > One problem here -- it may take quite some time for a set_need_resched()
> > to take effect. This is especially a problem for RCU priority boosting,
> > but can also needlessly delay preemptible-RCU grace periods because
> > local_irq_restore() and friends don't check the TIF_NEED_RESCHED bit.
>
> The final effect of deboosting(rt_mutex_unlock()) is also accomplished
> via set_need_resched()/set_tsk_need_resched().
> set_need_resched() is enough for RCU priority boosting issue here.

Eventually, yes. But all that set_need_resched() does is set the
TIF_NEED_RESCHED. This is checked by the outermost preempt_enable(),
return from interrupt, return to userspace, and things like
cond_resched(). So it might well be quite some time until the boosted
reader actually gets around to deboosting itself.

> Since rcu_read_unlock_special() is deferred, it does take quite some time for
> QS report to take effect.

Agreed.

> > OK, alternatives...
> >
> > o Keep the current rule saying that if the scheduler is going
> > to exit an RCU read-side critical section while holding
> > one of its spinlocks, preemption has to have been disabled
>
> Since rtmutex'lock->wait_lock is not irqs-disabled nor bh-disabled.

Yep, because this rule prevents the call to rt_mutex_unlock() from
happening whenever one of the scheduler's irq-disable locks is held.

> This kind of spinlocks include scheduler locks, rtmutex'lock->wait_lock,
> all locks can be acquired in irq/SOFTIRQ.
>
> So this rule is not only applied for scheduler locks, it should also
> be applied for almost all spinlocks in the kernel.

No, only those locks acquired by the scheduler in the wakeup path.
Or am I missing something here?

> I was hard to accept that rcu read site is not deadlock-immunity.

So did I, see http://lwn.net/Articles/453002/. RCU was a victim of its
own success. ;-)

And it would be really cool to restore full deadlock immunity to
rcu_read_unlock(), no two ways about it! Hmmm... Any way that a
self-IPI could be made safe for all architectures?

Thanx, Paul

> Thanks,
> Lai
>
> > throughout the full duration of that critical section.
> > Well, we can certainly do this, but it would be nice to get
> > rid of this rule.
> >
> > o Use per-CPU variables, possibly injecting delay. This has ugly
> > disadvantages as noted above.
> >
> > o irq_work_queue() can wait a jiffy (or on some architectures,
> > quite a bit longer) before actually doing anything.
> >
> > o raise_softirq() is more immediate and is an easy change, but
> > adds a softirq vector -- which people are really trying to
> > get rid of. Also, wakeup_softirqd() calls things that acquire
> > the scheduler locks, which is exactly what we were trying to
> > avoid doing.
> >
> > o invoke_rcu_core() can invoke raise_softirq() as above.
> >
> > o IPI to self. From what I can see, not all architectures
> > support this. Easy to fake if you have at least two CPUs,
> > but not so good from an OS jitter viewpoint...
> >
> > o Add a check to local_irq_disable() and friends. I would guess
> > that this suggestion would not make architecture maintainers
> > happy.
> >
> > Other thoughts?
> >
> > Thanx, Paul
> >
> >> Signed-off-by: Lai Jiangshan <[email protected]>
> >> ---
> >> include/linux/rcupdate.h | 2 +-
> >> kernel/rcupdate.c | 2 +-
> >> kernel/rcutree_plugin.h | 47 +++++++++++++++++++++++++++++++++++++++++----
> >> 3 files changed, 44 insertions(+), 7 deletions(-)
> >>
> >> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> >> index 4b14bdc..00b4220 100644
> >> --- a/include/linux/rcupdate.h
> >> +++ b/include/linux/rcupdate.h
> >> @@ -180,7 +180,7 @@ extern void synchronize_sched(void);
> >>
> >> extern void __rcu_read_lock(void);
> >> extern void __rcu_read_unlock(void);
> >> -extern void rcu_read_unlock_special(struct task_struct *t);
> >> +extern void rcu_read_unlock_special(struct task_struct *t, bool unlock);
> >> void synchronize_rcu(void);
> >>
> >> /*
> >> diff --git a/kernel/rcupdate.c b/kernel/rcupdate.c
> >> index cce6ba8..33b89a3 100644
> >> --- a/kernel/rcupdate.c
> >> +++ b/kernel/rcupdate.c
> >> @@ -90,7 +90,7 @@ void __rcu_read_unlock(void)
> >> #endif /* #ifdef CONFIG_PROVE_RCU_DELAY */
> >> barrier(); /* assign before ->rcu_read_unlock_special load */
> >> if (unlikely(ACCESS_ONCE(t->rcu_read_unlock_special)))
> >> - rcu_read_unlock_special(t);
> >> + rcu_read_unlock_special(t, true);
> >> barrier(); /* ->rcu_read_unlock_special load before assign */
> >> t->rcu_read_lock_nesting = 0;
> >> }
> >> diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
> >> index fc8b36f..997b424 100644
> >> --- a/kernel/rcutree_plugin.h
> >> +++ b/kernel/rcutree_plugin.h
> >> @@ -242,15 +242,16 @@ static void rcu_preempt_note_context_switch(int cpu)
> >> ? rnp->gpnum
> >> : rnp->gpnum + 1);
> >> raw_spin_unlock_irqrestore(&rnp->lock, flags);
> >> - } else if (t->rcu_read_lock_nesting < 0 &&
> >> - !WARN_ON_ONCE(t->rcu_read_lock_nesting != INT_MIN) &&
> >> - t->rcu_read_unlock_special) {
> >> + } else if (t->rcu_read_lock_nesting == 0 ||
> >> + (t->rcu_read_lock_nesting < 0 &&
> >> + !WARN_ON_ONCE(t->rcu_read_lock_nesting != INT_MIN))) {
> >>
> >> /*
> >> * Complete exit from RCU read-side critical section on
> >> * behalf of preempted instance of __rcu_read_unlock().
> >> */
> >> - rcu_read_unlock_special(t);
> >> + if (t->rcu_read_unlock_special)
> >> + rcu_read_unlock_special(t, false);
> >> }
> >>
> >> /*
> >> @@ -333,7 +334,7 @@ static struct list_head *rcu_next_node_entry(struct task_struct *t,
> >> * notify RCU core processing or task having blocked during the RCU
> >> * read-side critical section.
> >> */
> >> -void rcu_read_unlock_special(struct task_struct *t)
> >> +void rcu_read_unlock_special(struct task_struct *t, bool unlock)
> >> {
> >> int empty;
> >> int empty_exp;
> >> @@ -364,6 +365,42 @@ void rcu_read_unlock_special(struct task_struct *t)
> >>
> >> /* Clean up if blocked during RCU read-side critical section. */
> >> if (special & RCU_READ_UNLOCK_BLOCKED) {
> >> + /*
> >> + * If rcu read lock overlaps with scheduler lock,
> >> + * rcu_read_unlock_special() may lead to deadlock:
> >> + *
> >> + * rcu_read_lock();
> >> + * preempt_schedule[_irq]() (when preemption)
> >> + * scheduler lock; (or some other locks can be (chained) nested
> >> + * in rcu_read_unlock_special()/rnp->lock)
> >> + * access and check rcu data
> >> + * rcu_read_unlock();
> >> + * rcu_read_unlock_special();
> >> + * wake_up(); DEAD LOCK
> >> + *
> >> + * To avoid all these kinds of deadlock, we should quit
> >> + * rcu_read_unlock_special() here and defer it to
> >> + * rcu_preempt_note_context_switch() or next outmost
> >> + * rcu_read_unlock() if we consider this case may happen.
> >> + *
> >> + * Although we can't know whether current _special()
> >> + * is nested in scheduler lock or not. But we know that
> >> + * irqs are always disabled in this case. so we just quit
> >> + * and defer it to rcu_preempt_note_context_switch()
> >> + * when irqs are disabled.
> >> + *
> >> + * It means we always defer _special() when it is
> >> + * nested in irqs disabled context, but
> >> + * (special & RCU_READ_UNLOCK_BLOCKED) &&
> >> + * irqs_disabled_flags(flags)
> >> + * is still unlikely to be true.
> >> + */
> >> + if (unlikely(unlock && irqs_disabled_flags(flags))) {
> >> + set_need_resched();
> >> + local_irq_restore(flags);
> >> + return;
> >> + }
> >> +
> >> t->rcu_read_unlock_special &= ~RCU_READ_UNLOCK_BLOCKED;
> >>
> >> /*
> >> --
> >> 1.7.4.4
> >>
> >
> >
>

2013-08-10 03:40:52

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH 5/8] rcu: eliminate deadlock for rcu read site

Hi, Steven

I was considering rtmutex's lock->wait_lock is a scheduler lock,
But it is not, and it is just a spinlock of process context.
I hope you change it to a spinlock of irq context.

1) it causes rcu read site more deadlockable, example:
x is a spinlock of softirq context.

CPU1 cpu2(rcu boost)
rcu_read_lock() rt_mutext_lock()
<preemption and reschedule back> raw_spin_lock(lock->wait_lock)
spin_lock_bh(x) <interrupt and doing softirq after irq>
rcu_read_unlock() do_softirq()
rcu_read_unlock_special()
rt_mutext_unlock()
raw_spin_lock(lock->wait_lock) spin_lock_bh(x) DEADLOCK

This example can happen on any one of these code:
without my patchset
with my patchset
with my patchset but reverted patch2

2) why it causes more deadlockable: it extends the range of suspect locks.
#DEFINE suspect locks: any lock can be (chained) nested in rcu_read_unlock_special().

So the suspect locks are: rnp->lock, scheduler locks, rtmutex's lock->wait_lock,
locks in prink()/WARN_ON() and the locks which can be chained/indirectly nested
in the above locks.

If the rtmutex's lock->wait_lock is a spinlock of irq context, all suspect locks are
some spinlocks of irq context.

If the rtmutex's lock->wait_lock is a spinlock of process context, suspect locks
will be extended to, all spinlocks of irq context, all spinlocks of softirq context,
and (all spinlocks of process context but nested in rtmutex's lock->wait_lock).

We can see from the definition, if rcu_read_unlock_special() is called from
any suspect lock, it may be deadlock like the example. the rtmutex's lock->wait_lock
extends the range of suspect locks, it causes more deadlockable.

3) How my algorithm works, why smaller range of suspect locks help us.
Since rcu_read_unlock_special() can't be called from suspect locks context,
we should defer rcu_read_unlock_special() when in these contexts.
It is hard to find out current context is suspect locks context or not,
but we can determine it based on irq/softirq/process context.

if all suspect locks are some spinlocks of irq context:
if (irqs_disabled) /* we may be in suspect locks context */
defer rcu_read_unlock_special().

if all suspect locks are some spinlocks of irq/softirq/process context:
if (irqs_disabled || in_atomic()) /* we may be in suspect locks context */
defer rcu_read_unlock_special().
In this case, the deferring becomes large more, I can't accept it.
So I have to narrow the range of suspect locks. Two choices:
A) don't call rt_mutex_unlock() from rcu_read_unlock(), only call it
from rcu_preempt_not_context_switch(). we need to rework these
two functions and it will add complexity to RCU, and it also still
adds some probability of deferring.
B) change rtmutex's lock->wait_lock to irqs-disabled.

4) In the view of rtmutex, I think it will be better if ->wait_lock is irqs-disabled.
A) like trylock of mutex/rw_sem, we may call rt_mutex_trylock() in irq in future.
B) the critical section of ->wait_lock is short,
making it irqs-disabled don't hurts responsibility/latency.
C) almost all time of the critical section of ->wait_lock is irqs-disabled
(due to task->pi_lock), I think converting whole time of the critical section
of ->wait_lock to irqs-disabled is OK.

So I hope you change rtmutex's lock->wait_lock.

Any feedback from anyone is welcome.

Thanks,
Lai

On 08/09/2013 04:40 AM, Paul E. McKenney wrote:
> On Wed, Aug 07, 2013 at 06:25:01PM +0800, Lai Jiangshan wrote:
>> Background)
>>
>> Although all articles declare that rcu read site is deadlock-immunity.
>> It is not true for rcu-preempt, it will be deadlock if rcu read site
>> overlaps with scheduler lock.
>>
>> ec433f0c, 10f39bb1 and 016a8d5b just partially solve it. But rcu read site
>> is still not deadlock-immunity. And the problem described in 016a8d5b
>> is still existed(rcu_read_unlock_special() calls wake_up).
>>
>> Aim)
>>
>> We want to fix the problem forever, we want to keep rcu read site
>> is deadlock-immunity as books say.
>>
>> How)
>>
>> The problem is solved by "if rcu_read_unlock_special() is called inside
>> any lock which can be (chained) nested in rcu_read_unlock_special(),
>> we defer rcu_read_unlock_special()".
>> This kind locks include rnp->lock, scheduler locks, perf ctx->lock, locks
>> in printk()/WARN_ON() and all locks nested in these locks or chained nested
>> in these locks.
>>
>> The problem is reduced to "how to distinguish all these locks(context)",
>> We don't distinguish all these locks, we know that all these locks
>> should be nested in local_irqs_disable().
>>
>> we just consider if rcu_read_unlock_special() is called in irqs-disabled
>> context, it may be called in these suspect locks, we should defer
>> rcu_read_unlock_special().
>>
>> The algorithm enlarges the probability of deferring, but the probability
>> is still very very low.
>>
>> Deferring does add a small overhead, but it offers us:
>> 1) really deadlock-immunity for rcu read site
>> 2) remove the overhead of the irq-work(250 times per second in avg.)
>
> One problem here -- it may take quite some time for a set_need_resched()
> to take effect. This is especially a problem for RCU priority boosting,
> but can also needlessly delay preemptible-RCU grace periods because
> local_irq_restore() and friends don't check the TIF_NEED_RESCHED bit.
>
> OK, alternatives...
>
> o Keep the current rule saying that if the scheduler is going
> to exit an RCU read-side critical section while holding
> one of its spinlocks, preemption has to have been disabled
> throughout the full duration of that critical section.
> Well, we can certainly do this, but it would be nice to get
> rid of this rule.
>
> o Use per-CPU variables, possibly injecting delay. This has ugly
> disadvantages as noted above.
>
> o irq_work_queue() can wait a jiffy (or on some architectures,
> quite a bit longer) before actually doing anything.
>
> o raise_softirq() is more immediate and is an easy change, but
> adds a softirq vector -- which people are really trying to
> get rid of. Also, wakeup_softirqd() calls things that acquire
> the scheduler locks, which is exactly what we were trying to
> avoid doing.
>
> o invoke_rcu_core() can invoke raise_softirq() as above.
>
> o IPI to self. From what I can see, not all architectures
> support this. Easy to fake if you have at least two CPUs,
> but not so good from an OS jitter viewpoint...
>
> o Add a check to local_irq_disable() and friends. I would guess
> that this suggestion would not make architecture maintainers
> happy.
>
> Other thoughts?
>
> Thanx, Paul
>
>> Signed-off-by: Lai Jiangshan <[email protected]>
>> ---
>> include/linux/rcupdate.h | 2 +-
>> kernel/rcupdate.c | 2 +-
>> kernel/rcutree_plugin.h | 47 +++++++++++++++++++++++++++++++++++++++++----
>> 3 files changed, 44 insertions(+), 7 deletions(-)
>>
>> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
>> index 4b14bdc..00b4220 100644
>> --- a/include/linux/rcupdate.h
>> +++ b/include/linux/rcupdate.h
>> @@ -180,7 +180,7 @@ extern void synchronize_sched(void);
>>
>> extern void __rcu_read_lock(void);
>> extern void __rcu_read_unlock(void);
>> -extern void rcu_read_unlock_special(struct task_struct *t);
>> +extern void rcu_read_unlock_special(struct task_struct *t, bool unlock);
>> void synchronize_rcu(void);
>>
>> /*
>> diff --git a/kernel/rcupdate.c b/kernel/rcupdate.c
>> index cce6ba8..33b89a3 100644
>> --- a/kernel/rcupdate.c
>> +++ b/kernel/rcupdate.c
>> @@ -90,7 +90,7 @@ void __rcu_read_unlock(void)
>> #endif /* #ifdef CONFIG_PROVE_RCU_DELAY */
>> barrier(); /* assign before ->rcu_read_unlock_special load */
>> if (unlikely(ACCESS_ONCE(t->rcu_read_unlock_special)))
>> - rcu_read_unlock_special(t);
>> + rcu_read_unlock_special(t, true);
>> barrier(); /* ->rcu_read_unlock_special load before assign */
>> t->rcu_read_lock_nesting = 0;
>> }
>> diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
>> index fc8b36f..997b424 100644
>> --- a/kernel/rcutree_plugin.h
>> +++ b/kernel/rcutree_plugin.h
>> @@ -242,15 +242,16 @@ static void rcu_preempt_note_context_switch(int cpu)
>> ? rnp->gpnum
>> : rnp->gpnum + 1);
>> raw_spin_unlock_irqrestore(&rnp->lock, flags);
>> - } else if (t->rcu_read_lock_nesting < 0 &&
>> - !WARN_ON_ONCE(t->rcu_read_lock_nesting != INT_MIN) &&
>> - t->rcu_read_unlock_special) {
>> + } else if (t->rcu_read_lock_nesting == 0 ||
>> + (t->rcu_read_lock_nesting < 0 &&
>> + !WARN_ON_ONCE(t->rcu_read_lock_nesting != INT_MIN))) {
>>
>> /*
>> * Complete exit from RCU read-side critical section on
>> * behalf of preempted instance of __rcu_read_unlock().
>> */
>> - rcu_read_unlock_special(t);
>> + if (t->rcu_read_unlock_special)
>> + rcu_read_unlock_special(t, false);
>> }
>>
>> /*
>> @@ -333,7 +334,7 @@ static struct list_head *rcu_next_node_entry(struct task_struct *t,
>> * notify RCU core processing or task having blocked during the RCU
>> * read-side critical section.
>> */
>> -void rcu_read_unlock_special(struct task_struct *t)
>> +void rcu_read_unlock_special(struct task_struct *t, bool unlock)
>> {
>> int empty;
>> int empty_exp;
>> @@ -364,6 +365,42 @@ void rcu_read_unlock_special(struct task_struct *t)
>>
>> /* Clean up if blocked during RCU read-side critical section. */
>> if (special & RCU_READ_UNLOCK_BLOCKED) {
>> + /*
>> + * If rcu read lock overlaps with scheduler lock,
>> + * rcu_read_unlock_special() may lead to deadlock:
>> + *
>> + * rcu_read_lock();
>> + * preempt_schedule[_irq]() (when preemption)
>> + * scheduler lock; (or some other locks can be (chained) nested
>> + * in rcu_read_unlock_special()/rnp->lock)
>> + * access and check rcu data
>> + * rcu_read_unlock();
>> + * rcu_read_unlock_special();
>> + * wake_up(); DEAD LOCK
>> + *
>> + * To avoid all these kinds of deadlock, we should quit
>> + * rcu_read_unlock_special() here and defer it to
>> + * rcu_preempt_note_context_switch() or next outmost
>> + * rcu_read_unlock() if we consider this case may happen.
>> + *
>> + * Although we can't know whether current _special()
>> + * is nested in scheduler lock or not. But we know that
>> + * irqs are always disabled in this case. so we just quit
>> + * and defer it to rcu_preempt_note_context_switch()
>> + * when irqs are disabled.
>> + *
>> + * It means we always defer _special() when it is
>> + * nested in irqs disabled context, but
>> + * (special & RCU_READ_UNLOCK_BLOCKED) &&
>> + * irqs_disabled_flags(flags)
>> + * is still unlikely to be true.
>> + */
>> + if (unlikely(unlock && irqs_disabled_flags(flags))) {
>> + set_need_resched();
>> + local_irq_restore(flags);
>> + return;
>> + }
>> +
>> t->rcu_read_unlock_special &= ~RCU_READ_UNLOCK_BLOCKED;
>>
>> /*
>> --
>> 1.7.4.4
>>
>
>

2013-08-10 15:07:25

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 5/8] rcu: eliminate deadlock for rcu read site

On Sat, Aug 10, 2013 at 11:43:59AM +0800, Lai Jiangshan wrote:
> Hi, Steven
>
> I was considering rtmutex's lock->wait_lock is a scheduler lock,
> But it is not, and it is just a spinlock of process context.
> I hope you change it to a spinlock of irq context.
>
> 1) it causes rcu read site more deadlockable, example:
> x is a spinlock of softirq context.
>
> CPU1 cpu2(rcu boost)
> rcu_read_lock() rt_mutext_lock()
> <preemption and reschedule back> raw_spin_lock(lock->wait_lock)
> spin_lock_bh(x) <interrupt and doing softirq after irq>
> rcu_read_unlock() do_softirq()
> rcu_read_unlock_special()
> rt_mutext_unlock()
> raw_spin_lock(lock->wait_lock) spin_lock_bh(x) DEADLOCK
>
> This example can happen on any one of these code:
> without my patchset
> with my patchset
> with my patchset but reverted patch2
>
> 2) why it causes more deadlockable: it extends the range of suspect locks.
> #DEFINE suspect locks: any lock can be (chained) nested in rcu_read_unlock_special().
>
> So the suspect locks are: rnp->lock, scheduler locks, rtmutex's lock->wait_lock,
> locks in prink()/WARN_ON() and the locks which can be chained/indirectly nested
> in the above locks.
>
> If the rtmutex's lock->wait_lock is a spinlock of irq context, all suspect locks are
> some spinlocks of irq context.
>
> If the rtmutex's lock->wait_lock is a spinlock of process context, suspect locks
> will be extended to, all spinlocks of irq context, all spinlocks of softirq context,
> and (all spinlocks of process context but nested in rtmutex's lock->wait_lock).
>
> We can see from the definition, if rcu_read_unlock_special() is called from
> any suspect lock, it may be deadlock like the example. the rtmutex's lock->wait_lock
> extends the range of suspect locks, it causes more deadlockable.
>
> 3) How my algorithm works, why smaller range of suspect locks help us.
> Since rcu_read_unlock_special() can't be called from suspect locks context,
> we should defer rcu_read_unlock_special() when in these contexts.
> It is hard to find out current context is suspect locks context or not,
> but we can determine it based on irq/softirq/process context.
>
> if all suspect locks are some spinlocks of irq context:
> if (irqs_disabled) /* we may be in suspect locks context */
> defer rcu_read_unlock_special().
>
> if all suspect locks are some spinlocks of irq/softirq/process context:
> if (irqs_disabled || in_atomic()) /* we may be in suspect locks context */
> defer rcu_read_unlock_special().
> In this case, the deferring becomes large more, I can't accept it.
> So I have to narrow the range of suspect locks. Two choices:
> A) don't call rt_mutex_unlock() from rcu_read_unlock(), only call it
> from rcu_preempt_not_context_switch(). we need to rework these
> two functions and it will add complexity to RCU, and it also still
> adds some probability of deferring.

One advantage of bh-disable locks is that enabling bh checks
TIF_NEED_RESCHED, so that there is no deferring beyond that
needed by bh disable. The same of course applies to preempt_disable().

So one approach is to defer when rcu_read_unlock_special() is entered
with either preemption or bh disabled. Your current set_need_resched()
trick would work fine in this case. Unfortunately, re-enabling interrupts
does -not- check TIF_NEED_RESCHED, which is why we have latency problems
in that case. (Hence my earlier question about making self-IPI safe
on all arches, which would result in an interrupt as soon as interrupts
were re-enabled.)

Another possibility is to defer only when preemption or bh are disabled
on entry ro rcu_read_unlock_special(), but to retain the current
(admittedly ugly) nesting rules for the scheduler locks.

> B) change rtmutex's lock->wait_lock to irqs-disabled.

I have to defer to Steven on this one.

Thanx, Paul

> 4) In the view of rtmutex, I think it will be better if ->wait_lock is irqs-disabled.
> A) like trylock of mutex/rw_sem, we may call rt_mutex_trylock() in irq in future.
> B) the critical section of ->wait_lock is short,
> making it irqs-disabled don't hurts responsibility/latency.
> C) almost all time of the critical section of ->wait_lock is irqs-disabled
> (due to task->pi_lock), I think converting whole time of the critical section
> of ->wait_lock to irqs-disabled is OK.
>
> So I hope you change rtmutex's lock->wait_lock.
>
> Any feedback from anyone is welcome.
>
> Thanks,
> Lai
>
> On 08/09/2013 04:40 AM, Paul E. McKenney wrote:
> > On Wed, Aug 07, 2013 at 06:25:01PM +0800, Lai Jiangshan wrote:
> >> Background)
> >>
> >> Although all articles declare that rcu read site is deadlock-immunity.
> >> It is not true for rcu-preempt, it will be deadlock if rcu read site
> >> overlaps with scheduler lock.
> >>
> >> ec433f0c, 10f39bb1 and 016a8d5b just partially solve it. But rcu read site
> >> is still not deadlock-immunity. And the problem described in 016a8d5b
> >> is still existed(rcu_read_unlock_special() calls wake_up).
> >>
> >> Aim)
> >>
> >> We want to fix the problem forever, we want to keep rcu read site
> >> is deadlock-immunity as books say.
> >>
> >> How)
> >>
> >> The problem is solved by "if rcu_read_unlock_special() is called inside
> >> any lock which can be (chained) nested in rcu_read_unlock_special(),
> >> we defer rcu_read_unlock_special()".
> >> This kind locks include rnp->lock, scheduler locks, perf ctx->lock, locks
> >> in printk()/WARN_ON() and all locks nested in these locks or chained nested
> >> in these locks.
> >>
> >> The problem is reduced to "how to distinguish all these locks(context)",
> >> We don't distinguish all these locks, we know that all these locks
> >> should be nested in local_irqs_disable().
> >>
> >> we just consider if rcu_read_unlock_special() is called in irqs-disabled
> >> context, it may be called in these suspect locks, we should defer
> >> rcu_read_unlock_special().
> >>
> >> The algorithm enlarges the probability of deferring, but the probability
> >> is still very very low.
> >>
> >> Deferring does add a small overhead, but it offers us:
> >> 1) really deadlock-immunity for rcu read site
> >> 2) remove the overhead of the irq-work(250 times per second in avg.)
> >
> > One problem here -- it may take quite some time for a set_need_resched()
> > to take effect. This is especially a problem for RCU priority boosting,
> > but can also needlessly delay preemptible-RCU grace periods because
> > local_irq_restore() and friends don't check the TIF_NEED_RESCHED bit.
> >
> > OK, alternatives...
> >
> > o Keep the current rule saying that if the scheduler is going
> > to exit an RCU read-side critical section while holding
> > one of its spinlocks, preemption has to have been disabled
> > throughout the full duration of that critical section.
> > Well, we can certainly do this, but it would be nice to get
> > rid of this rule.
> >
> > o Use per-CPU variables, possibly injecting delay. This has ugly
> > disadvantages as noted above.
> >
> > o irq_work_queue() can wait a jiffy (or on some architectures,
> > quite a bit longer) before actually doing anything.
> >
> > o raise_softirq() is more immediate and is an easy change, but
> > adds a softirq vector -- which people are really trying to
> > get rid of. Also, wakeup_softirqd() calls things that acquire
> > the scheduler locks, which is exactly what we were trying to
> > avoid doing.
> >
> > o invoke_rcu_core() can invoke raise_softirq() as above.
> >
> > o IPI to self. From what I can see, not all architectures
> > support this. Easy to fake if you have at least two CPUs,
> > but not so good from an OS jitter viewpoint...
> >
> > o Add a check to local_irq_disable() and friends. I would guess
> > that this suggestion would not make architecture maintainers
> > happy.
> >
> > Other thoughts?
> >
> > Thanx, Paul
> >
> >> Signed-off-by: Lai Jiangshan <[email protected]>
> >> ---
> >> include/linux/rcupdate.h | 2 +-
> >> kernel/rcupdate.c | 2 +-
> >> kernel/rcutree_plugin.h | 47 +++++++++++++++++++++++++++++++++++++++++----
> >> 3 files changed, 44 insertions(+), 7 deletions(-)
> >>
> >> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> >> index 4b14bdc..00b4220 100644
> >> --- a/include/linux/rcupdate.h
> >> +++ b/include/linux/rcupdate.h
> >> @@ -180,7 +180,7 @@ extern void synchronize_sched(void);
> >>
> >> extern void __rcu_read_lock(void);
> >> extern void __rcu_read_unlock(void);
> >> -extern void rcu_read_unlock_special(struct task_struct *t);
> >> +extern void rcu_read_unlock_special(struct task_struct *t, bool unlock);
> >> void synchronize_rcu(void);
> >>
> >> /*
> >> diff --git a/kernel/rcupdate.c b/kernel/rcupdate.c
> >> index cce6ba8..33b89a3 100644
> >> --- a/kernel/rcupdate.c
> >> +++ b/kernel/rcupdate.c
> >> @@ -90,7 +90,7 @@ void __rcu_read_unlock(void)
> >> #endif /* #ifdef CONFIG_PROVE_RCU_DELAY */
> >> barrier(); /* assign before ->rcu_read_unlock_special load */
> >> if (unlikely(ACCESS_ONCE(t->rcu_read_unlock_special)))
> >> - rcu_read_unlock_special(t);
> >> + rcu_read_unlock_special(t, true);
> >> barrier(); /* ->rcu_read_unlock_special load before assign */
> >> t->rcu_read_lock_nesting = 0;
> >> }
> >> diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
> >> index fc8b36f..997b424 100644
> >> --- a/kernel/rcutree_plugin.h
> >> +++ b/kernel/rcutree_plugin.h
> >> @@ -242,15 +242,16 @@ static void rcu_preempt_note_context_switch(int cpu)
> >> ? rnp->gpnum
> >> : rnp->gpnum + 1);
> >> raw_spin_unlock_irqrestore(&rnp->lock, flags);
> >> - } else if (t->rcu_read_lock_nesting < 0 &&
> >> - !WARN_ON_ONCE(t->rcu_read_lock_nesting != INT_MIN) &&
> >> - t->rcu_read_unlock_special) {
> >> + } else if (t->rcu_read_lock_nesting == 0 ||
> >> + (t->rcu_read_lock_nesting < 0 &&
> >> + !WARN_ON_ONCE(t->rcu_read_lock_nesting != INT_MIN))) {
> >>
> >> /*
> >> * Complete exit from RCU read-side critical section on
> >> * behalf of preempted instance of __rcu_read_unlock().
> >> */
> >> - rcu_read_unlock_special(t);
> >> + if (t->rcu_read_unlock_special)
> >> + rcu_read_unlock_special(t, false);
> >> }
> >>
> >> /*
> >> @@ -333,7 +334,7 @@ static struct list_head *rcu_next_node_entry(struct task_struct *t,
> >> * notify RCU core processing or task having blocked during the RCU
> >> * read-side critical section.
> >> */
> >> -void rcu_read_unlock_special(struct task_struct *t)
> >> +void rcu_read_unlock_special(struct task_struct *t, bool unlock)
> >> {
> >> int empty;
> >> int empty_exp;
> >> @@ -364,6 +365,42 @@ void rcu_read_unlock_special(struct task_struct *t)
> >>
> >> /* Clean up if blocked during RCU read-side critical section. */
> >> if (special & RCU_READ_UNLOCK_BLOCKED) {
> >> + /*
> >> + * If rcu read lock overlaps with scheduler lock,
> >> + * rcu_read_unlock_special() may lead to deadlock:
> >> + *
> >> + * rcu_read_lock();
> >> + * preempt_schedule[_irq]() (when preemption)
> >> + * scheduler lock; (or some other locks can be (chained) nested
> >> + * in rcu_read_unlock_special()/rnp->lock)
> >> + * access and check rcu data
> >> + * rcu_read_unlock();
> >> + * rcu_read_unlock_special();
> >> + * wake_up(); DEAD LOCK
> >> + *
> >> + * To avoid all these kinds of deadlock, we should quit
> >> + * rcu_read_unlock_special() here and defer it to
> >> + * rcu_preempt_note_context_switch() or next outmost
> >> + * rcu_read_unlock() if we consider this case may happen.
> >> + *
> >> + * Although we can't know whether current _special()
> >> + * is nested in scheduler lock or not. But we know that
> >> + * irqs are always disabled in this case. so we just quit
> >> + * and defer it to rcu_preempt_note_context_switch()
> >> + * when irqs are disabled.
> >> + *
> >> + * It means we always defer _special() when it is
> >> + * nested in irqs disabled context, but
> >> + * (special & RCU_READ_UNLOCK_BLOCKED) &&
> >> + * irqs_disabled_flags(flags)
> >> + * is still unlikely to be true.
> >> + */
> >> + if (unlikely(unlock && irqs_disabled_flags(flags))) {
> >> + set_need_resched();
> >> + local_irq_restore(flags);
> >> + return;
> >> + }
> >> +
> >> t->rcu_read_unlock_special &= ~RCU_READ_UNLOCK_BLOCKED;
> >>
> >> /*
> >> --
> >> 1.7.4.4
> >>
> >
> >
>

2013-08-10 15:08:53

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 5/8] rcu: eliminate deadlock for rcu read site

On Sat, Aug 10, 2013 at 08:07:15AM -0700, Paul E. McKenney wrote:
> On Sat, Aug 10, 2013 at 11:43:59AM +0800, Lai Jiangshan wrote:
> > Hi, Steven
> >
> > I was considering rtmutex's lock->wait_lock is a scheduler lock,
> > But it is not, and it is just a spinlock of process context.
> > I hope you change it to a spinlock of irq context.
> >
> > 1) it causes rcu read site more deadlockable, example:
> > x is a spinlock of softirq context.
> >
> > CPU1 cpu2(rcu boost)
> > rcu_read_lock() rt_mutext_lock()
> > <preemption and reschedule back> raw_spin_lock(lock->wait_lock)
> > spin_lock_bh(x) <interrupt and doing softirq after irq>
> > rcu_read_unlock() do_softirq()
> > rcu_read_unlock_special()
> > rt_mutext_unlock()
> > raw_spin_lock(lock->wait_lock) spin_lock_bh(x) DEADLOCK
> >
> > This example can happen on any one of these code:
> > without my patchset
> > with my patchset
> > with my patchset but reverted patch2
> >
> > 2) why it causes more deadlockable: it extends the range of suspect locks.
> > #DEFINE suspect locks: any lock can be (chained) nested in rcu_read_unlock_special().
> >
> > So the suspect locks are: rnp->lock, scheduler locks, rtmutex's lock->wait_lock,
> > locks in prink()/WARN_ON() and the locks which can be chained/indirectly nested
> > in the above locks.
> >
> > If the rtmutex's lock->wait_lock is a spinlock of irq context, all suspect locks are
> > some spinlocks of irq context.
> >
> > If the rtmutex's lock->wait_lock is a spinlock of process context, suspect locks
> > will be extended to, all spinlocks of irq context, all spinlocks of softirq context,
> > and (all spinlocks of process context but nested in rtmutex's lock->wait_lock).
> >
> > We can see from the definition, if rcu_read_unlock_special() is called from
> > any suspect lock, it may be deadlock like the example. the rtmutex's lock->wait_lock
> > extends the range of suspect locks, it causes more deadlockable.
> >
> > 3) How my algorithm works, why smaller range of suspect locks help us.
> > Since rcu_read_unlock_special() can't be called from suspect locks context,
> > we should defer rcu_read_unlock_special() when in these contexts.
> > It is hard to find out current context is suspect locks context or not,
> > but we can determine it based on irq/softirq/process context.
> >
> > if all suspect locks are some spinlocks of irq context:
> > if (irqs_disabled) /* we may be in suspect locks context */
> > defer rcu_read_unlock_special().
> >
> > if all suspect locks are some spinlocks of irq/softirq/process context:
> > if (irqs_disabled || in_atomic()) /* we may be in suspect locks context */
> > defer rcu_read_unlock_special().
> > In this case, the deferring becomes large more, I can't accept it.
> > So I have to narrow the range of suspect locks. Two choices:
> > A) don't call rt_mutex_unlock() from rcu_read_unlock(), only call it
> > from rcu_preempt_not_context_switch(). we need to rework these
> > two functions and it will add complexity to RCU, and it also still
> > adds some probability of deferring.
>
> One advantage of bh-disable locks is that enabling bh checks
> TIF_NEED_RESCHED, so that there is no deferring beyond that
> needed by bh disable. The same of course applies to preempt_disable().
>
> So one approach is to defer when rcu_read_unlock_special() is entered
> with either preemption or bh disabled. Your current set_need_resched()
> trick would work fine in this case. Unfortunately, re-enabling interrupts
> does -not- check TIF_NEED_RESCHED, which is why we have latency problems
> in that case. (Hence my earlier question about making self-IPI safe
> on all arches, which would result in an interrupt as soon as interrupts
> were re-enabled.)
>
> Another possibility is to defer only when preemption or bh are disabled
> on entry ro rcu_read_unlock_special(), but to retain the current
> (admittedly ugly) nesting rules for the scheduler locks.
>
> > B) change rtmutex's lock->wait_lock to irqs-disabled.
>
> I have to defer to Steven on this one.

C) Remove support for RCU priority boosting.

I am reluctant to do this, but...

Thanx, Paul

> > 4) In the view of rtmutex, I think it will be better if ->wait_lock is irqs-disabled.
> > A) like trylock of mutex/rw_sem, we may call rt_mutex_trylock() in irq in future.
> > B) the critical section of ->wait_lock is short,
> > making it irqs-disabled don't hurts responsibility/latency.
> > C) almost all time of the critical section of ->wait_lock is irqs-disabled
> > (due to task->pi_lock), I think converting whole time of the critical section
> > of ->wait_lock to irqs-disabled is OK.
> >
> > So I hope you change rtmutex's lock->wait_lock.
> >
> > Any feedback from anyone is welcome.
> >
> > Thanks,
> > Lai
> >
> > On 08/09/2013 04:40 AM, Paul E. McKenney wrote:
> > > On Wed, Aug 07, 2013 at 06:25:01PM +0800, Lai Jiangshan wrote:
> > >> Background)
> > >>
> > >> Although all articles declare that rcu read site is deadlock-immunity.
> > >> It is not true for rcu-preempt, it will be deadlock if rcu read site
> > >> overlaps with scheduler lock.
> > >>
> > >> ec433f0c, 10f39bb1 and 016a8d5b just partially solve it. But rcu read site
> > >> is still not deadlock-immunity. And the problem described in 016a8d5b
> > >> is still existed(rcu_read_unlock_special() calls wake_up).
> > >>
> > >> Aim)
> > >>
> > >> We want to fix the problem forever, we want to keep rcu read site
> > >> is deadlock-immunity as books say.
> > >>
> > >> How)
> > >>
> > >> The problem is solved by "if rcu_read_unlock_special() is called inside
> > >> any lock which can be (chained) nested in rcu_read_unlock_special(),
> > >> we defer rcu_read_unlock_special()".
> > >> This kind locks include rnp->lock, scheduler locks, perf ctx->lock, locks
> > >> in printk()/WARN_ON() and all locks nested in these locks or chained nested
> > >> in these locks.
> > >>
> > >> The problem is reduced to "how to distinguish all these locks(context)",
> > >> We don't distinguish all these locks, we know that all these locks
> > >> should be nested in local_irqs_disable().
> > >>
> > >> we just consider if rcu_read_unlock_special() is called in irqs-disabled
> > >> context, it may be called in these suspect locks, we should defer
> > >> rcu_read_unlock_special().
> > >>
> > >> The algorithm enlarges the probability of deferring, but the probability
> > >> is still very very low.
> > >>
> > >> Deferring does add a small overhead, but it offers us:
> > >> 1) really deadlock-immunity for rcu read site
> > >> 2) remove the overhead of the irq-work(250 times per second in avg.)
> > >
> > > One problem here -- it may take quite some time for a set_need_resched()
> > > to take effect. This is especially a problem for RCU priority boosting,
> > > but can also needlessly delay preemptible-RCU grace periods because
> > > local_irq_restore() and friends don't check the TIF_NEED_RESCHED bit.
> > >
> > > OK, alternatives...
> > >
> > > o Keep the current rule saying that if the scheduler is going
> > > to exit an RCU read-side critical section while holding
> > > one of its spinlocks, preemption has to have been disabled
> > > throughout the full duration of that critical section.
> > > Well, we can certainly do this, but it would be nice to get
> > > rid of this rule.
> > >
> > > o Use per-CPU variables, possibly injecting delay. This has ugly
> > > disadvantages as noted above.
> > >
> > > o irq_work_queue() can wait a jiffy (or on some architectures,
> > > quite a bit longer) before actually doing anything.
> > >
> > > o raise_softirq() is more immediate and is an easy change, but
> > > adds a softirq vector -- which people are really trying to
> > > get rid of. Also, wakeup_softirqd() calls things that acquire
> > > the scheduler locks, which is exactly what we were trying to
> > > avoid doing.
> > >
> > > o invoke_rcu_core() can invoke raise_softirq() as above.
> > >
> > > o IPI to self. From what I can see, not all architectures
> > > support this. Easy to fake if you have at least two CPUs,
> > > but not so good from an OS jitter viewpoint...
> > >
> > > o Add a check to local_irq_disable() and friends. I would guess
> > > that this suggestion would not make architecture maintainers
> > > happy.
> > >
> > > Other thoughts?
> > >
> > > Thanx, Paul
> > >
> > >> Signed-off-by: Lai Jiangshan <[email protected]>
> > >> ---
> > >> include/linux/rcupdate.h | 2 +-
> > >> kernel/rcupdate.c | 2 +-
> > >> kernel/rcutree_plugin.h | 47 +++++++++++++++++++++++++++++++++++++++++----
> > >> 3 files changed, 44 insertions(+), 7 deletions(-)
> > >>
> > >> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> > >> index 4b14bdc..00b4220 100644
> > >> --- a/include/linux/rcupdate.h
> > >> +++ b/include/linux/rcupdate.h
> > >> @@ -180,7 +180,7 @@ extern void synchronize_sched(void);
> > >>
> > >> extern void __rcu_read_lock(void);
> > >> extern void __rcu_read_unlock(void);
> > >> -extern void rcu_read_unlock_special(struct task_struct *t);
> > >> +extern void rcu_read_unlock_special(struct task_struct *t, bool unlock);
> > >> void synchronize_rcu(void);
> > >>
> > >> /*
> > >> diff --git a/kernel/rcupdate.c b/kernel/rcupdate.c
> > >> index cce6ba8..33b89a3 100644
> > >> --- a/kernel/rcupdate.c
> > >> +++ b/kernel/rcupdate.c
> > >> @@ -90,7 +90,7 @@ void __rcu_read_unlock(void)
> > >> #endif /* #ifdef CONFIG_PROVE_RCU_DELAY */
> > >> barrier(); /* assign before ->rcu_read_unlock_special load */
> > >> if (unlikely(ACCESS_ONCE(t->rcu_read_unlock_special)))
> > >> - rcu_read_unlock_special(t);
> > >> + rcu_read_unlock_special(t, true);
> > >> barrier(); /* ->rcu_read_unlock_special load before assign */
> > >> t->rcu_read_lock_nesting = 0;
> > >> }
> > >> diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
> > >> index fc8b36f..997b424 100644
> > >> --- a/kernel/rcutree_plugin.h
> > >> +++ b/kernel/rcutree_plugin.h
> > >> @@ -242,15 +242,16 @@ static void rcu_preempt_note_context_switch(int cpu)
> > >> ? rnp->gpnum
> > >> : rnp->gpnum + 1);
> > >> raw_spin_unlock_irqrestore(&rnp->lock, flags);
> > >> - } else if (t->rcu_read_lock_nesting < 0 &&
> > >> - !WARN_ON_ONCE(t->rcu_read_lock_nesting != INT_MIN) &&
> > >> - t->rcu_read_unlock_special) {
> > >> + } else if (t->rcu_read_lock_nesting == 0 ||
> > >> + (t->rcu_read_lock_nesting < 0 &&
> > >> + !WARN_ON_ONCE(t->rcu_read_lock_nesting != INT_MIN))) {
> > >>
> > >> /*
> > >> * Complete exit from RCU read-side critical section on
> > >> * behalf of preempted instance of __rcu_read_unlock().
> > >> */
> > >> - rcu_read_unlock_special(t);
> > >> + if (t->rcu_read_unlock_special)
> > >> + rcu_read_unlock_special(t, false);
> > >> }
> > >>
> > >> /*
> > >> @@ -333,7 +334,7 @@ static struct list_head *rcu_next_node_entry(struct task_struct *t,
> > >> * notify RCU core processing or task having blocked during the RCU
> > >> * read-side critical section.
> > >> */
> > >> -void rcu_read_unlock_special(struct task_struct *t)
> > >> +void rcu_read_unlock_special(struct task_struct *t, bool unlock)
> > >> {
> > >> int empty;
> > >> int empty_exp;
> > >> @@ -364,6 +365,42 @@ void rcu_read_unlock_special(struct task_struct *t)
> > >>
> > >> /* Clean up if blocked during RCU read-side critical section. */
> > >> if (special & RCU_READ_UNLOCK_BLOCKED) {
> > >> + /*
> > >> + * If rcu read lock overlaps with scheduler lock,
> > >> + * rcu_read_unlock_special() may lead to deadlock:
> > >> + *
> > >> + * rcu_read_lock();
> > >> + * preempt_schedule[_irq]() (when preemption)
> > >> + * scheduler lock; (or some other locks can be (chained) nested
> > >> + * in rcu_read_unlock_special()/rnp->lock)
> > >> + * access and check rcu data
> > >> + * rcu_read_unlock();
> > >> + * rcu_read_unlock_special();
> > >> + * wake_up(); DEAD LOCK
> > >> + *
> > >> + * To avoid all these kinds of deadlock, we should quit
> > >> + * rcu_read_unlock_special() here and defer it to
> > >> + * rcu_preempt_note_context_switch() or next outmost
> > >> + * rcu_read_unlock() if we consider this case may happen.
> > >> + *
> > >> + * Although we can't know whether current _special()
> > >> + * is nested in scheduler lock or not. But we know that
> > >> + * irqs are always disabled in this case. so we just quit
> > >> + * and defer it to rcu_preempt_note_context_switch()
> > >> + * when irqs are disabled.
> > >> + *
> > >> + * It means we always defer _special() when it is
> > >> + * nested in irqs disabled context, but
> > >> + * (special & RCU_READ_UNLOCK_BLOCKED) &&
> > >> + * irqs_disabled_flags(flags)
> > >> + * is still unlikely to be true.
> > >> + */
> > >> + if (unlikely(unlock && irqs_disabled_flags(flags))) {
> > >> + set_need_resched();
> > >> + local_irq_restore(flags);
> > >> + return;
> > >> + }
> > >> +
> > >> t->rcu_read_unlock_special &= ~RCU_READ_UNLOCK_BLOCKED;
> > >>
> > >> /*
> > >> --
> > >> 1.7.4.4
> > >>
> > >
> > >
> >

2013-08-12 13:53:29

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 5/8] rcu: eliminate deadlock for rcu read site

On Sat, Aug 10, 2013 at 11:43:59AM +0800, Lai Jiangshan wrote:
> Hi, Steven
>
> I was considering rtmutex's lock->wait_lock is a scheduler lock,
> But it is not, and it is just a spinlock of process context.
> I hope you change it to a spinlock of irq context.

rwmutex::wait_lock is irq-safe; it had better be because its taken under
task_struct::pi_lock.

2013-08-12 13:55:56

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 5/8] rcu: eliminate deadlock for rcu read site

On Fri, Aug 09, 2013 at 05:31:27PM +0800, Lai Jiangshan wrote:
> On 08/09/2013 04:40 AM, Paul E. McKenney wrote:
> > One problem here -- it may take quite some time for a set_need_resched()
> > to take effect. This is especially a problem for RCU priority boosting,
> > but can also needlessly delay preemptible-RCU grace periods because
> > local_irq_restore() and friends don't check the TIF_NEED_RESCHED bit.
>
>
> The final effect of deboosting(rt_mutex_unlock()) is also accomplished
> via set_need_resched()/set_tsk_need_resched().
> set_need_resched() is enough for RCU priority boosting issue here.

But there's a huge difference between the boosting and deboosting side
of things. rcu_read_unlock_special() starts the boost, the deboosting
only matters if/when you reschedule.

2013-08-12 14:10:14

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 5/8] rcu: eliminate deadlock for rcu read site

On Mon, 12 Aug 2013 15:53:10 +0200
Peter Zijlstra <[email protected]> wrote:

> On Sat, Aug 10, 2013 at 11:43:59AM +0800, Lai Jiangshan wrote:
> > Hi, Steven
> >
> > I was considering rtmutex's lock->wait_lock is a scheduler lock,
> > But it is not, and it is just a spinlock of process context.
> > I hope you change it to a spinlock of irq context.
>
> rwmutex::wait_lock is irq-safe; it had better be because its taken under
> task_struct::pi_lock.

It is? I thought it was the other way around. That is, pi_lock is taken
under wait_lock.

task_blocks_on_rt_mutex() is called with wait_lock held. The first
thing it does is to grab the pi_lock.

The wait_lock is the mutex internal lock. Currently, no interrupt
context should take that lock, or should there be an interrupt lock
held when it is taken. The lock should be taken in the same contexts
that a mutex can be taken in.

By making it a irq-safe lock, we need to disable interrupts every time
it is taken, which means the entire pi-chain walk in
rt_mutex_adjust_prio_chain() will pretty much be with interrupts
disabled. I really would like to avoid making wait_lock irq-safe if we
can avoid it. I'm sure it will have a large impact to the -rt kernel if
we convert it.


-- Steve

2013-08-12 14:13:10

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 0/8] rcu: Ensure rcu read site is deadlock-immunity

On Thu, Aug 08, 2013 at 12:59:55PM -0500, Eric Boxer wrote:
> Eric Boxer liked your message with Boxer. On Wed, Aug 07, 2013 at 07:36 PM, Paul E. McKenney wrote:

WTF kinda crap is this and can we stop this?

2013-08-12 14:15:02

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 5/8] rcu: eliminate deadlock for rcu read site

On Mon, Aug 12, 2013 at 10:10:08AM -0400, Steven Rostedt wrote:
> On Mon, 12 Aug 2013 15:53:10 +0200
> Peter Zijlstra <[email protected]> wrote:
>
> > On Sat, Aug 10, 2013 at 11:43:59AM +0800, Lai Jiangshan wrote:
> > > Hi, Steven
> > >
> > > I was considering rtmutex's lock->wait_lock is a scheduler lock,
> > > But it is not, and it is just a spinlock of process context.
> > > I hope you change it to a spinlock of irq context.
> >
> > rwmutex::wait_lock is irq-safe; it had better be because its taken under
> > task_struct::pi_lock.
>
> It is? I thought it was the other way around. That is, pi_lock is taken
> under wait_lock.

Urgh, right you are.

2013-08-12 15:17:46

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 5/8] rcu: eliminate deadlock for rcu read site

On Mon, Aug 12, 2013 at 03:55:44PM +0200, Peter Zijlstra wrote:
> On Fri, Aug 09, 2013 at 05:31:27PM +0800, Lai Jiangshan wrote:
> > On 08/09/2013 04:40 AM, Paul E. McKenney wrote:
> > > One problem here -- it may take quite some time for a set_need_resched()
> > > to take effect. This is especially a problem for RCU priority boosting,
> > > but can also needlessly delay preemptible-RCU grace periods because
> > > local_irq_restore() and friends don't check the TIF_NEED_RESCHED bit.
> >
> >
> > The final effect of deboosting(rt_mutex_unlock()) is also accomplished
> > via set_need_resched()/set_tsk_need_resched().
> > set_need_resched() is enough for RCU priority boosting issue here.
>
> But there's a huge difference between the boosting and deboosting side
> of things. rcu_read_unlock_special() starts the boost, the deboosting
> only matters if/when you reschedule.

Or if there is a pre-existing runnable task whose priority is such that
deboosting makes it the highest-priority task.

Thanx, Paul

2013-08-12 16:21:40

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 5/8] rcu: eliminate deadlock for rcu read site

On Mon, Aug 12, 2013 at 08:16:18AM -0700, Paul E. McKenney wrote:
> On Mon, Aug 12, 2013 at 03:55:44PM +0200, Peter Zijlstra wrote:
> > On Fri, Aug 09, 2013 at 05:31:27PM +0800, Lai Jiangshan wrote:
> > > On 08/09/2013 04:40 AM, Paul E. McKenney wrote:
> > > > One problem here -- it may take quite some time for a set_need_resched()
> > > > to take effect. This is especially a problem for RCU priority boosting,
> > > > but can also needlessly delay preemptible-RCU grace periods because
> > > > local_irq_restore() and friends don't check the TIF_NEED_RESCHED bit.
> > >
> > >
> > > The final effect of deboosting(rt_mutex_unlock()) is also accomplished
> > > via set_need_resched()/set_tsk_need_resched().
> > > set_need_resched() is enough for RCU priority boosting issue here.
> >
> > But there's a huge difference between the boosting and deboosting side
> > of things. rcu_read_unlock_special() starts the boost, the deboosting
> > only matters if/when you reschedule.
>
> Or if there is a pre-existing runnable task whose priority is such that
> deboosting makes it the highest-priority task.

Right, I got horribly lost in rt_mutex, but I suspect we deal with that
case the right way. -rt people would've noticed us screwing that up ;-)

But there too, we're fully limited to how fast we can get a
reschedule(). Deboosting sooner than we can reschedule to run the other
task is effectively pointless. The converse is obviously not true; we
must not be able to reschedule sooner than we can deboost ;-)

2013-08-12 16:44:48

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 5/8] rcu: eliminate deadlock for rcu read site

On Mon, Aug 12, 2013 at 06:21:26PM +0200, Peter Zijlstra wrote:
> On Mon, Aug 12, 2013 at 08:16:18AM -0700, Paul E. McKenney wrote:
> > On Mon, Aug 12, 2013 at 03:55:44PM +0200, Peter Zijlstra wrote:
> > > On Fri, Aug 09, 2013 at 05:31:27PM +0800, Lai Jiangshan wrote:
> > > > On 08/09/2013 04:40 AM, Paul E. McKenney wrote:
> > > > > One problem here -- it may take quite some time for a set_need_resched()
> > > > > to take effect. This is especially a problem for RCU priority boosting,
> > > > > but can also needlessly delay preemptible-RCU grace periods because
> > > > > local_irq_restore() and friends don't check the TIF_NEED_RESCHED bit.
> > > >
> > > >
> > > > The final effect of deboosting(rt_mutex_unlock()) is also accomplished
> > > > via set_need_resched()/set_tsk_need_resched().
> > > > set_need_resched() is enough for RCU priority boosting issue here.
> > >
> > > But there's a huge difference between the boosting and deboosting side
> > > of things. rcu_read_unlock_special() starts the boost, the deboosting
> > > only matters if/when you reschedule.
> >
> > Or if there is a pre-existing runnable task whose priority is such that
> > deboosting makes it the highest-priority task.
>
> Right, I got horribly lost in rt_mutex, but I suspect we deal with that
> case the right way. -rt people would've noticed us screwing that up ;-)
>
> But there too, we're fully limited to how fast we can get a
> reschedule(). Deboosting sooner than we can reschedule to run the other
> task is effectively pointless. The converse is obviously not true; we
> must not be able to reschedule sooner than we can deboost ;-)

In addition, the proposed change was to defer the deboost based on
a set_need_resched(), which would provide additional opportunity for
delay -- the running task would retain its high priority until the
scheduler acted on the set_need_resched().

Thanx, Paul

2013-08-21 03:17:35

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 5/8] rcu: eliminate deadlock for rcu read site

On Sat, Aug 10, 2013 at 08:07:15AM -0700, Paul E. McKenney wrote:
> On Sat, Aug 10, 2013 at 11:43:59AM +0800, Lai Jiangshan wrote:

[ . . . ]

> > So I have to narrow the range of suspect locks. Two choices:
> > A) don't call rt_mutex_unlock() from rcu_read_unlock(), only call it
> > from rcu_preempt_not_context_switch(). we need to rework these
> > two functions and it will add complexity to RCU, and it also still
> > adds some probability of deferring.
>
> One advantage of bh-disable locks is that enabling bh checks
> TIF_NEED_RESCHED, so that there is no deferring beyond that
> needed by bh disable. The same of course applies to preempt_disable().
>
> So one approach is to defer when rcu_read_unlock_special() is entered
> with either preemption or bh disabled. Your current set_need_resched()
> trick would work fine in this case. Unfortunately, re-enabling interrupts
> does -not- check TIF_NEED_RESCHED, which is why we have latency problems
> in that case. (Hence my earlier question about making self-IPI safe
> on all arches, which would result in an interrupt as soon as interrupts
> were re-enabled.)
>
> Another possibility is to defer only when preemption or bh are disabled
> on entry ro rcu_read_unlock_special(), but to retain the current
> (admittedly ugly) nesting rules for the scheduler locks.

Would you be willing to do a patch that deferred rt_mutex_unlock() in
the preempt/bh cases? This of course does not solve the irq-disable
case, but it should at least narrow the problem to the scheduler locks.

Not a big hurry, given the testing required, this is 3.13 or 3.14 material,
I think.

If you are busy, no problem, I can do it, just figured you have priority
if you want it.

Thanx, Paul

> > B) change rtmutex's lock->wait_lock to irqs-disabled.
>
> I have to defer to Steven on this one.
>
> Thanx, Paul
>
> > 4) In the view of rtmutex, I think it will be better if ->wait_lock is irqs-disabled.
> > A) like trylock of mutex/rw_sem, we may call rt_mutex_trylock() in irq in future.
> > B) the critical section of ->wait_lock is short,
> > making it irqs-disabled don't hurts responsibility/latency.
> > C) almost all time of the critical section of ->wait_lock is irqs-disabled
> > (due to task->pi_lock), I think converting whole time of the critical section
> > of ->wait_lock to irqs-disabled is OK.
> >
> > So I hope you change rtmutex's lock->wait_lock.
> >
> > Any feedback from anyone is welcome.
> >
> > Thanks,
> > Lai
> >
> > On 08/09/2013 04:40 AM, Paul E. McKenney wrote:
> > > On Wed, Aug 07, 2013 at 06:25:01PM +0800, Lai Jiangshan wrote:
> > >> Background)
> > >>
> > >> Although all articles declare that rcu read site is deadlock-immunity.
> > >> It is not true for rcu-preempt, it will be deadlock if rcu read site
> > >> overlaps with scheduler lock.
> > >>
> > >> ec433f0c, 10f39bb1 and 016a8d5b just partially solve it. But rcu read site
> > >> is still not deadlock-immunity. And the problem described in 016a8d5b
> > >> is still existed(rcu_read_unlock_special() calls wake_up).
> > >>
> > >> Aim)
> > >>
> > >> We want to fix the problem forever, we want to keep rcu read site
> > >> is deadlock-immunity as books say.
> > >>
> > >> How)
> > >>
> > >> The problem is solved by "if rcu_read_unlock_special() is called inside
> > >> any lock which can be (chained) nested in rcu_read_unlock_special(),
> > >> we defer rcu_read_unlock_special()".
> > >> This kind locks include rnp->lock, scheduler locks, perf ctx->lock, locks
> > >> in printk()/WARN_ON() and all locks nested in these locks or chained nested
> > >> in these locks.
> > >>
> > >> The problem is reduced to "how to distinguish all these locks(context)",
> > >> We don't distinguish all these locks, we know that all these locks
> > >> should be nested in local_irqs_disable().
> > >>
> > >> we just consider if rcu_read_unlock_special() is called in irqs-disabled
> > >> context, it may be called in these suspect locks, we should defer
> > >> rcu_read_unlock_special().
> > >>
> > >> The algorithm enlarges the probability of deferring, but the probability
> > >> is still very very low.
> > >>
> > >> Deferring does add a small overhead, but it offers us:
> > >> 1) really deadlock-immunity for rcu read site
> > >> 2) remove the overhead of the irq-work(250 times per second in avg.)
> > >
> > > One problem here -- it may take quite some time for a set_need_resched()
> > > to take effect. This is especially a problem for RCU priority boosting,
> > > but can also needlessly delay preemptible-RCU grace periods because
> > > local_irq_restore() and friends don't check the TIF_NEED_RESCHED bit.
> > >
> > > OK, alternatives...
> > >
> > > o Keep the current rule saying that if the scheduler is going
> > > to exit an RCU read-side critical section while holding
> > > one of its spinlocks, preemption has to have been disabled
> > > throughout the full duration of that critical section.
> > > Well, we can certainly do this, but it would be nice to get
> > > rid of this rule.
> > >
> > > o Use per-CPU variables, possibly injecting delay. This has ugly
> > > disadvantages as noted above.
> > >
> > > o irq_work_queue() can wait a jiffy (or on some architectures,
> > > quite a bit longer) before actually doing anything.
> > >
> > > o raise_softirq() is more immediate and is an easy change, but
> > > adds a softirq vector -- which people are really trying to
> > > get rid of. Also, wakeup_softirqd() calls things that acquire
> > > the scheduler locks, which is exactly what we were trying to
> > > avoid doing.
> > >
> > > o invoke_rcu_core() can invoke raise_softirq() as above.
> > >
> > > o IPI to self. From what I can see, not all architectures
> > > support this. Easy to fake if you have at least two CPUs,
> > > but not so good from an OS jitter viewpoint...
> > >
> > > o Add a check to local_irq_disable() and friends. I would guess
> > > that this suggestion would not make architecture maintainers
> > > happy.
> > >
> > > Other thoughts?
> > >
> > > Thanx, Paul
> > >
> > >> Signed-off-by: Lai Jiangshan <[email protected]>
> > >> ---
> > >> include/linux/rcupdate.h | 2 +-
> > >> kernel/rcupdate.c | 2 +-
> > >> kernel/rcutree_plugin.h | 47 +++++++++++++++++++++++++++++++++++++++++----
> > >> 3 files changed, 44 insertions(+), 7 deletions(-)
> > >>
> > >> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> > >> index 4b14bdc..00b4220 100644
> > >> --- a/include/linux/rcupdate.h
> > >> +++ b/include/linux/rcupdate.h
> > >> @@ -180,7 +180,7 @@ extern void synchronize_sched(void);
> > >>
> > >> extern void __rcu_read_lock(void);
> > >> extern void __rcu_read_unlock(void);
> > >> -extern void rcu_read_unlock_special(struct task_struct *t);
> > >> +extern void rcu_read_unlock_special(struct task_struct *t, bool unlock);
> > >> void synchronize_rcu(void);
> > >>
> > >> /*
> > >> diff --git a/kernel/rcupdate.c b/kernel/rcupdate.c
> > >> index cce6ba8..33b89a3 100644
> > >> --- a/kernel/rcupdate.c
> > >> +++ b/kernel/rcupdate.c
> > >> @@ -90,7 +90,7 @@ void __rcu_read_unlock(void)
> > >> #endif /* #ifdef CONFIG_PROVE_RCU_DELAY */
> > >> barrier(); /* assign before ->rcu_read_unlock_special load */
> > >> if (unlikely(ACCESS_ONCE(t->rcu_read_unlock_special)))
> > >> - rcu_read_unlock_special(t);
> > >> + rcu_read_unlock_special(t, true);
> > >> barrier(); /* ->rcu_read_unlock_special load before assign */
> > >> t->rcu_read_lock_nesting = 0;
> > >> }
> > >> diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
> > >> index fc8b36f..997b424 100644
> > >> --- a/kernel/rcutree_plugin.h
> > >> +++ b/kernel/rcutree_plugin.h
> > >> @@ -242,15 +242,16 @@ static void rcu_preempt_note_context_switch(int cpu)
> > >> ? rnp->gpnum
> > >> : rnp->gpnum + 1);
> > >> raw_spin_unlock_irqrestore(&rnp->lock, flags);
> > >> - } else if (t->rcu_read_lock_nesting < 0 &&
> > >> - !WARN_ON_ONCE(t->rcu_read_lock_nesting != INT_MIN) &&
> > >> - t->rcu_read_unlock_special) {
> > >> + } else if (t->rcu_read_lock_nesting == 0 ||
> > >> + (t->rcu_read_lock_nesting < 0 &&
> > >> + !WARN_ON_ONCE(t->rcu_read_lock_nesting != INT_MIN))) {
> > >>
> > >> /*
> > >> * Complete exit from RCU read-side critical section on
> > >> * behalf of preempted instance of __rcu_read_unlock().
> > >> */
> > >> - rcu_read_unlock_special(t);
> > >> + if (t->rcu_read_unlock_special)
> > >> + rcu_read_unlock_special(t, false);
> > >> }
> > >>
> > >> /*
> > >> @@ -333,7 +334,7 @@ static struct list_head *rcu_next_node_entry(struct task_struct *t,
> > >> * notify RCU core processing or task having blocked during the RCU
> > >> * read-side critical section.
> > >> */
> > >> -void rcu_read_unlock_special(struct task_struct *t)
> > >> +void rcu_read_unlock_special(struct task_struct *t, bool unlock)
> > >> {
> > >> int empty;
> > >> int empty_exp;
> > >> @@ -364,6 +365,42 @@ void rcu_read_unlock_special(struct task_struct *t)
> > >>
> > >> /* Clean up if blocked during RCU read-side critical section. */
> > >> if (special & RCU_READ_UNLOCK_BLOCKED) {
> > >> + /*
> > >> + * If rcu read lock overlaps with scheduler lock,
> > >> + * rcu_read_unlock_special() may lead to deadlock:
> > >> + *
> > >> + * rcu_read_lock();
> > >> + * preempt_schedule[_irq]() (when preemption)
> > >> + * scheduler lock; (or some other locks can be (chained) nested
> > >> + * in rcu_read_unlock_special()/rnp->lock)
> > >> + * access and check rcu data
> > >> + * rcu_read_unlock();
> > >> + * rcu_read_unlock_special();
> > >> + * wake_up(); DEAD LOCK
> > >> + *
> > >> + * To avoid all these kinds of deadlock, we should quit
> > >> + * rcu_read_unlock_special() here and defer it to
> > >> + * rcu_preempt_note_context_switch() or next outmost
> > >> + * rcu_read_unlock() if we consider this case may happen.
> > >> + *
> > >> + * Although we can't know whether current _special()
> > >> + * is nested in scheduler lock or not. But we know that
> > >> + * irqs are always disabled in this case. so we just quit
> > >> + * and defer it to rcu_preempt_note_context_switch()
> > >> + * when irqs are disabled.
> > >> + *
> > >> + * It means we always defer _special() when it is
> > >> + * nested in irqs disabled context, but
> > >> + * (special & RCU_READ_UNLOCK_BLOCKED) &&
> > >> + * irqs_disabled_flags(flags)
> > >> + * is still unlikely to be true.
> > >> + */
> > >> + if (unlikely(unlock && irqs_disabled_flags(flags))) {
> > >> + set_need_resched();
> > >> + local_irq_restore(flags);
> > >> + return;
> > >> + }
> > >> +
> > >> t->rcu_read_unlock_special &= ~RCU_READ_UNLOCK_BLOCKED;
> > >>
> > >> /*
> > >> --
> > >> 1.7.4.4
> > >>
> > >
> > >
> >

2013-08-21 03:21:38

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH 5/8] rcu: eliminate deadlock for rcu read site

On 08/21/2013 11:17 AM, Paul E. McKenney wrote:
> On Sat, Aug 10, 2013 at 08:07:15AM -0700, Paul E. McKenney wrote:
>> On Sat, Aug 10, 2013 at 11:43:59AM +0800, Lai Jiangshan wrote:
>
> [ . . . ]
>
>>> So I have to narrow the range of suspect locks. Two choices:
>>> A) don't call rt_mutex_unlock() from rcu_read_unlock(), only call it
>>> from rcu_preempt_not_context_switch(). we need to rework these
>>> two functions and it will add complexity to RCU, and it also still
>>> adds some probability of deferring.
>>
>> One advantage of bh-disable locks is that enabling bh checks
>> TIF_NEED_RESCHED, so that there is no deferring beyond that
>> needed by bh disable. The same of course applies to preempt_disable().
>>
>> So one approach is to defer when rcu_read_unlock_special() is entered
>> with either preemption or bh disabled. Your current set_need_resched()
>> trick would work fine in this case. Unfortunately, re-enabling interrupts
>> does -not- check TIF_NEED_RESCHED, which is why we have latency problems
>> in that case. (Hence my earlier question about making self-IPI safe
>> on all arches, which would result in an interrupt as soon as interrupts
>> were re-enabled.)
>>
>> Another possibility is to defer only when preemption or bh are disabled
>> on entry ro rcu_read_unlock_special(), but to retain the current
>> (admittedly ugly) nesting rules for the scheduler locks.
>
> Would you be willing to do a patch that deferred rt_mutex_unlock() in
> the preempt/bh cases? This of course does not solve the irq-disable
> case, but it should at least narrow the problem to the scheduler locks.
>
> Not a big hurry, given the testing required, this is 3.13 or 3.14 material,
> I think.
>
> If you are busy, no problem, I can do it, just figured you have priority
> if you want it.
>
>


I'm writing a special rt_mutex_unlock() for rcu deboost only.
I hope Steven accept it.

Thanks,
Lai

2013-08-21 13:42:13

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 5/8] rcu: eliminate deadlock for rcu read site

On Wed, Aug 21, 2013 at 11:25:55AM +0800, Lai Jiangshan wrote:
> On 08/21/2013 11:17 AM, Paul E. McKenney wrote:
> > On Sat, Aug 10, 2013 at 08:07:15AM -0700, Paul E. McKenney wrote:
> >> On Sat, Aug 10, 2013 at 11:43:59AM +0800, Lai Jiangshan wrote:
> >
> > [ . . . ]
> >
> >>> So I have to narrow the range of suspect locks. Two choices:
> >>> A) don't call rt_mutex_unlock() from rcu_read_unlock(), only call it
> >>> from rcu_preempt_not_context_switch(). we need to rework these
> >>> two functions and it will add complexity to RCU, and it also still
> >>> adds some probability of deferring.
> >>
> >> One advantage of bh-disable locks is that enabling bh checks
> >> TIF_NEED_RESCHED, so that there is no deferring beyond that
> >> needed by bh disable. The same of course applies to preempt_disable().
> >>
> >> So one approach is to defer when rcu_read_unlock_special() is entered
> >> with either preemption or bh disabled. Your current set_need_resched()
> >> trick would work fine in this case. Unfortunately, re-enabling interrupts
> >> does -not- check TIF_NEED_RESCHED, which is why we have latency problems
> >> in that case. (Hence my earlier question about making self-IPI safe
> >> on all arches, which would result in an interrupt as soon as interrupts
> >> were re-enabled.)
> >>
> >> Another possibility is to defer only when preemption or bh are disabled
> >> on entry ro rcu_read_unlock_special(), but to retain the current
> >> (admittedly ugly) nesting rules for the scheduler locks.
> >
> > Would you be willing to do a patch that deferred rt_mutex_unlock() in
> > the preempt/bh cases? This of course does not solve the irq-disable
> > case, but it should at least narrow the problem to the scheduler locks.
> >
> > Not a big hurry, given the testing required, this is 3.13 or 3.14 material,
> > I think.
> >
> > If you are busy, no problem, I can do it, just figured you have priority
> > if you want it.
>
> I'm writing a special rt_mutex_unlock() for rcu deboost only.
> I hope Steven accept it.

That would be very cool, though if I understand the requirements,
especially for -rt, very challenging. Looking forward to seeing it!

Thanx, Paul

2013-08-22 14:34:34

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 5/8] rcu: eliminate deadlock for rcu read site

On Thu, 22 Aug 2013 22:23:09 +0800
Lai Jiangshan <[email protected]> wrote:


> > By making it a irq-safe lock, we need to disable interrupts every time
> > it is taken, which means the entire pi-chain walk in
> > rt_mutex_adjust_prio_chain() will pretty much be with interrupts
> > disabled.
>
>
> I didn't catch your meaning.
> rt_mutex_adjust_prio_chain() is called without wait_lock held.
> current C.S. of wait_lock are really short.
>

There is quite a bit of overhead to enable and disable interrupts. If
we are doing this in a very fast path, it's going to slow down -rt even
more.

It would be really nice to avoid making wait_lock be irq safe.

-- Steve

2013-08-22 14:41:50

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 5/8] rcu: eliminate deadlock for rcu read site

On Thu, 22 Aug 2013 10:34:31 -0400
Steven Rostedt <[email protected]> wrote:

> On Thu, 22 Aug 2013 22:23:09 +0800
> Lai Jiangshan <[email protected]> wrote:
>
>
> > > By making it a irq-safe lock, we need to disable interrupts every time
> > > it is taken, which means the entire pi-chain walk in
> > > rt_mutex_adjust_prio_chain() will pretty much be with interrupts
> > > disabled.
> >
> >
> > I didn't catch your meaning.
> > rt_mutex_adjust_prio_chain() is called without wait_lock held.
> > current C.S. of wait_lock are really short.
> >
>
> There is quite a bit of overhead to enable and disable interrupts. If
> we are doing this in a very fast path, it's going to slow down -rt even
> more.

Looking at the rt_mutex_adjust_prio_chain(), it's not that bad because
the pi_lock needs irqs disabled too, and we would just extend that
section.

But it still extends irqs off.

But in -rt, it is used in the rt_spin_lock_slowlock() where it can exit
out of the code without ever having to disable interrupts.

>
> It would be really nice to avoid making wait_lock be irq safe.

I still would like to exhaust other options before just making
wait_lock irq safe.

-- Steve

2013-08-23 06:52:16

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH 5/8] rcu: eliminate deadlock for rcu read site

[PATCH] rcu/rt_mutex: eliminate a kind of deadlock for rcu read site

Current rtmutex's lock->wait_lock doesn't disables softirq nor irq, it will
cause rcu read site deadlock when rcu overlaps with any softirq-context/irq-context lock.

@L is a spinlock of softirq or irq context.

CPU1 cpu2(rcu boost)
rcu_read_lock() rt_mutext_lock()
<preemption and reschedule back> raw_spin_lock(lock->wait_lock)
spin_lock_XX(L) <interrupt and doing softirq or irq>
rcu_read_unlock() do_softirq()
rcu_read_unlock_special()
rt_mutext_unlock()
raw_spin_lock(lock->wait_lock) spin_lock_XX(L) **DEADLOCK**

This patch fixes this kind of deadlock by removing rt_mutext_unlock() from
rcu_read_unlock(), new rt_mutex_rcu_deboost_unlock() is called instead.
Thus rtmutex's lock->wait_lock will not be called from rcu_read_unlock().

This patch does not eliminate all kinds of rcu-read-site deadlock,
if @L is a scheduler lock, it will be deadlock, we should apply Paul's rule
in this case.(avoid overlapping or preempt_disable()).

rt_mutex_rcu_deboost_unlock() requires the @waiter is queued, so we
can't directly call rt_mutex_lock(&mtx) in the rcu_boost thread,
we split rt_mutex_lock(&mtx) into two steps just like pi-futex.
This result a internal state in rcu_boost thread and cause
rcu_boost thread a bit more complicated.

Thanks
Lai

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 5cd0f09..8830874 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -102,7 +102,7 @@ extern struct group_info init_groups;

#ifdef CONFIG_RCU_BOOST
#define INIT_TASK_RCU_BOOST() \
- .rcu_boost_mutex = NULL,
+ .rcu_boost_waiter = NULL,
#else
#define INIT_TASK_RCU_BOOST()
#endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index e9995eb..1eca99f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1078,7 +1078,7 @@ struct task_struct {
struct rcu_node *rcu_blocked_node;
#endif /* #ifdef CONFIG_TREE_PREEMPT_RCU */
#ifdef CONFIG_RCU_BOOST
- struct rt_mutex *rcu_boost_mutex;
+ struct rt_mutex_waiter *rcu_boost_waiter;
#endif /* #ifdef CONFIG_RCU_BOOST */

#if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
@@ -1723,7 +1723,7 @@ static inline void rcu_copy_process(struct task_struct *p)
p->rcu_blocked_node = NULL;
#endif /* #ifdef CONFIG_TREE_PREEMPT_RCU */
#ifdef CONFIG_RCU_BOOST
- p->rcu_boost_mutex = NULL;
+ p->rcu_boost_waiter = NULL;
#endif /* #ifdef CONFIG_RCU_BOOST */
INIT_LIST_HEAD(&p->rcu_node_entry);
}
diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index 769e12e..d207ddd 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -33,6 +33,7 @@
#define RCU_KTHREAD_PRIO 1

#ifdef CONFIG_RCU_BOOST
+#include "rtmutex_common.h"
#define RCU_BOOST_PRIO CONFIG_RCU_BOOST_PRIO
#else
#define RCU_BOOST_PRIO RCU_KTHREAD_PRIO
@@ -340,7 +341,7 @@ void rcu_read_unlock_special(struct task_struct *t)
unsigned long flags;
struct list_head *np;
#ifdef CONFIG_RCU_BOOST
- struct rt_mutex *rbmp = NULL;
+ struct rt_mutex_waiter *waiter = NULL;
#endif /* #ifdef CONFIG_RCU_BOOST */
struct rcu_node *rnp;
int special;
@@ -397,10 +398,10 @@ void rcu_read_unlock_special(struct task_struct *t)
#ifdef CONFIG_RCU_BOOST
if (&t->rcu_node_entry == rnp->boost_tasks)
rnp->boost_tasks = np;
- /* Snapshot/clear ->rcu_boost_mutex with rcu_node lock held. */
- if (t->rcu_boost_mutex) {
- rbmp = t->rcu_boost_mutex;
- t->rcu_boost_mutex = NULL;
+ /* Snapshot/clear ->rcu_boost_waiter with rcu_node lock held. */
+ if (t->rcu_boost_waiter) {
+ waiter = t->rcu_boost_waiter;
+ t->rcu_boost_waiter = NULL;
}
#endif /* #ifdef CONFIG_RCU_BOOST */

@@ -426,8 +427,8 @@ void rcu_read_unlock_special(struct task_struct *t)

#ifdef CONFIG_RCU_BOOST
/* Unboost if we were boosted. */
- if (rbmp)
- rt_mutex_unlock(rbmp);
+ if (waiter)
+ rt_mutex_rcu_deboost_unlock(t, waiter);
#endif /* #ifdef CONFIG_RCU_BOOST */

/*
@@ -1129,9 +1130,6 @@ void exit_rcu(void)
#endif /* #else #ifdef CONFIG_TREE_PREEMPT_RCU */

#ifdef CONFIG_RCU_BOOST
-
-#include "rtmutex_common.h"
-
#ifdef CONFIG_RCU_TRACE

static void rcu_initiate_boost_trace(struct rcu_node *rnp)
@@ -1181,14 +1179,15 @@ static int rcu_boost(struct rcu_node *rnp)
{
unsigned long flags;
struct rt_mutex mtx;
+ struct rt_mutex_waiter rcu_boost_waiter;
struct task_struct *t;
struct list_head *tb;
+ int ret;

if (rnp->exp_tasks == NULL && rnp->boost_tasks == NULL)
return 0; /* Nothing left to boost. */

raw_spin_lock_irqsave(&rnp->lock, flags);
-
/*
* Recheck under the lock: all tasks in need of boosting
* might exit their RCU read-side critical sections on their own.
@@ -1215,7 +1214,7 @@ static int rcu_boost(struct rcu_node *rnp)

/*
* We boost task t by manufacturing an rt_mutex that appears to
- * be held by task t. We leave a pointer to that rt_mutex where
+ * be held by task t. We leave a pointer to that rt_mutex_waiter where
* task t can find it, and task t will release the mutex when it
* exits its outermost RCU read-side critical section. Then
* simply acquiring this artificial rt_mutex will boost task
@@ -1230,11 +1229,30 @@ static int rcu_boost(struct rcu_node *rnp)
* section.
*/
t = container_of(tb, struct task_struct, rcu_node_entry);
+ get_task_struct(t);
rt_mutex_init_proxy_locked(&mtx, t);
- t->rcu_boost_mutex = &mtx;
raw_spin_unlock_irqrestore(&rnp->lock, flags);
- rt_mutex_lock(&mtx); /* Side effect: boosts task t's priority. */
- rt_mutex_unlock(&mtx); /* Keep lockdep happy. */
+
+ debug_rt_mutex_init_waiter(&rcu_boost_waiter);
+ /* Side effect: boosts task t's priority. */
+ ret = rt_mutex_start_proxy_lock(&mtx, &rcu_boost_waiter, current, 0);
+ if (WARN_ON_ONCE(ret)) {
+ put_task_struct(t);
+ return 0; /* temporary stop boosting */
+ }
+
+ raw_spin_lock_irqsave(&rnp->lock, flags);
+ if (&t->rcu_node_entry == rnp->exp_tasks ||
+ &t->rcu_node_entry == rnp->boost_tasks) {
+ t->rcu_boost_waiter = &rcu_boost_waiter;
+ raw_spin_unlock_irqrestore(&rnp->lock, flags);
+ } else {
+ raw_spin_unlock_irqrestore(&rnp->lock, flags);
+ rt_mutex_rcu_deboost_unlock(t, &rcu_boost_waiter);
+ }
+
+ put_task_struct(t);
+ rt_mutex_finish_proxy_lock(&mtx, NULL, &rcu_boost_waiter, 0);

return ACCESS_ONCE(rnp->exp_tasks) != NULL ||
ACCESS_ONCE(rnp->boost_tasks) != NULL;
diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 0dd6aec..2f3caee 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -734,6 +734,43 @@ rt_mutex_slowunlock(struct rt_mutex *lock)
rt_mutex_adjust_prio(current);
}

+#ifdef CONFIG_RCU_BOOST
+/*
+ * rt_mutex_rcu_deboost_unlock() - unlock in irq/bh/process context
+ *
+ * please revert the patch which introduces this function when
+ * rt_mutex's ->wait_lock is irq-off.
+ */
+void rt_mutex_rcu_deboost_unlock(struct task_struct *owner,
+ struct rt_mutex_waiter *waiter)
+{
+ unsigned long flags;
+ struct rt_mutex *lock = waiter->lock;
+
+ /*
+ * The correction of the following code is based on
+ * 1) current lock is owned by @owner
+ * 2) only one task(@waiter->task) is waiting on the @lock
+ * 3) the @waiter has been queued and keeps been queued
+ */
+ if (WARN_ON_ONCE(rt_mutex_owner(lock) != owner))
+ return; /* 1) */
+ if (WARN_ON_ONCE(rt_mutex_top_waiter(lock) != waiter))
+ return; /* 2) & 3) */
+ if (WARN_ON_ONCE(plist_node_empty(&waiter->pi_list_entry)))
+ return; /* 2) & 3) */
+
+ raw_spin_lock_irqsave(&owner->pi_lock, flags);
+ plist_del(&waiter->pi_list_entry, &owner->pi_waiters);
+ lock->owner = NULL;
+ raw_spin_unlock_irqrestore(&owner->pi_lock, flags);
+
+ wake_up_process(waiter->task);
+ /* Undo pi boosting if necessary: */
+ rt_mutex_adjust_prio(owner);
+}
+#endif /* #ifdef CONFIG_RCU_BOOST */
+
/*
* debug aware fast / slowpath lock,trylock,unlock
*
diff --git a/kernel/rtmutex_common.h b/kernel/rtmutex_common.h
index 53a66c8..3cdbe82 100644
--- a/kernel/rtmutex_common.h
+++ b/kernel/rtmutex_common.h
@@ -117,6 +117,11 @@ extern int rt_mutex_finish_proxy_lock(struct rt_mutex *lock,
struct rt_mutex_waiter *waiter,
int detect_deadlock);

+#ifdef CONFIG_RCU_BOOST
+void rt_mutex_rcu_deboost_unlock(struct task_struct *owner,
+ struct rt_mutex_waiter *waiter);
+#endif /* #ifdef CONFIG_RCU_BOOST */
+
#ifdef CONFIG_DEBUG_RT_MUTEXES
# include "rtmutex-debug.h"
#else

2013-08-25 17:43:12

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 5/8] rcu: eliminate deadlock for rcu read site

On Sun, Aug 25, 2013 at 11:19:37PM +0800, Lai Jiangshan wrote:
> Hi, Steven
>
> Any comments about this patch?

For whatever it is worth, it ran without incident for two hours worth
of rcutorture on my P5 test (boosting but no CPU hotplug).

Lai, do you have a specific test for this patch? Your deadlock
scenario looks plausible, but is apparently not occurring in the
mainline kernel.

Thanx, Paul

> Thanks,
> Lai
>
>
> On Fri, Aug 23, 2013 at 2:26 PM, Lai Jiangshan <[email protected]> wrote:
>
> > [PATCH] rcu/rt_mutex: eliminate a kind of deadlock for rcu read site
> >
> > Current rtmutex's lock->wait_lock doesn't disables softirq nor irq, it will
> > cause rcu read site deadlock when rcu overlaps with any
> > softirq-context/irq-context lock.
> >
> > @L is a spinlock of softirq or irq context.
> >
> > CPU1 cpu2(rcu boost)
> > rcu_read_lock() rt_mutext_lock()
> > <preemption and reschedule back> raw_spin_lock(lock->wait_lock)
> > spin_lock_XX(L) <interrupt and doing softirq or
> > irq>
> > rcu_read_unlock() do_softirq()
> > rcu_read_unlock_special()
> > rt_mutext_unlock()
> > raw_spin_lock(lock->wait_lock) spin_lock_XX(L) **DEADLOCK**
> >
> > This patch fixes this kind of deadlock by removing rt_mutext_unlock() from
> > rcu_read_unlock(), new rt_mutex_rcu_deboost_unlock() is called instead.
> > Thus rtmutex's lock->wait_lock will not be called from rcu_read_unlock().
> >
> > This patch does not eliminate all kinds of rcu-read-site deadlock,
> > if @L is a scheduler lock, it will be deadlock, we should apply Paul's rule
> > in this case.(avoid overlapping or preempt_disable()).
> >
> > rt_mutex_rcu_deboost_unlock() requires the @waiter is queued, so we
> > can't directly call rt_mutex_lock(&mtx) in the rcu_boost thread,
> > we split rt_mutex_lock(&mtx) into two steps just like pi-futex.
> > This result a internal state in rcu_boost thread and cause
> > rcu_boost thread a bit more complicated.
> >
> > Thanks
> > Lai
> >
> > diff --git a/include/linux/init_task.h b/include/linux/init_task.h
> > index 5cd0f09..8830874 100644
> > --- a/include/linux/init_task.h
> > +++ b/include/linux/init_task.h
> > @@ -102,7 +102,7 @@ extern struct group_info init_groups;
> >
> > #ifdef CONFIG_RCU_BOOST
> > #define INIT_TASK_RCU_BOOST() \
> > - .rcu_boost_mutex = NULL,
> > + .rcu_boost_waiter = NULL,
> > #else
> > #define INIT_TASK_RCU_BOOST()
> > #endif
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index e9995eb..1eca99f 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1078,7 +1078,7 @@ struct task_struct {
> > struct rcu_node *rcu_blocked_node;
> > #endif /* #ifdef CONFIG_TREE_PREEMPT_RCU */
> > #ifdef CONFIG_RCU_BOOST
> > - struct rt_mutex *rcu_boost_mutex;
> > + struct rt_mutex_waiter *rcu_boost_waiter;
> > #endif /* #ifdef CONFIG_RCU_BOOST */
> >
> > #if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
> > @@ -1723,7 +1723,7 @@ static inline void rcu_copy_process(struct
> > task_struct *p)
> > p->rcu_blocked_node = NULL;
> > #endif /* #ifdef CONFIG_TREE_PREEMPT_RCU */
> > #ifdef CONFIG_RCU_BOOST
> > - p->rcu_boost_mutex = NULL;
> > + p->rcu_boost_waiter = NULL;
> > #endif /* #ifdef CONFIG_RCU_BOOST */
> > INIT_LIST_HEAD(&p->rcu_node_entry);
> > }
> > diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
> > index 769e12e..d207ddd 100644
> > --- a/kernel/rcutree_plugin.h
> > +++ b/kernel/rcutree_plugin.h
> > @@ -33,6 +33,7 @@
> > #define RCU_KTHREAD_PRIO 1
> >
> > #ifdef CONFIG_RCU_BOOST
> > +#include "rtmutex_common.h"
> > #define RCU_BOOST_PRIO CONFIG_RCU_BOOST_PRIO
> > #else
> > #define RCU_BOOST_PRIO RCU_KTHREAD_PRIO
> > @@ -340,7 +341,7 @@ void rcu_read_unlock_special(struct task_struct *t)
> > unsigned long flags;
> > struct list_head *np;
> > #ifdef CONFIG_RCU_BOOST
> > - struct rt_mutex *rbmp = NULL;
> > + struct rt_mutex_waiter *waiter = NULL;
> > #endif /* #ifdef CONFIG_RCU_BOOST */
> > struct rcu_node *rnp;
> > int special;
> > @@ -397,10 +398,10 @@ void rcu_read_unlock_special(struct task_struct *t)
> > #ifdef CONFIG_RCU_BOOST
> > if (&t->rcu_node_entry == rnp->boost_tasks)
> > rnp->boost_tasks = np;
> > - /* Snapshot/clear ->rcu_boost_mutex with rcu_node lock
> > held. */
> > - if (t->rcu_boost_mutex) {
> > - rbmp = t->rcu_boost_mutex;
> > - t->rcu_boost_mutex = NULL;
> > + /* Snapshot/clear ->rcu_boost_waiter with rcu_node lock
> > held. */
> > + if (t->rcu_boost_waiter) {
> > + waiter = t->rcu_boost_waiter;
> > + t->rcu_boost_waiter = NULL;
> > }
> > #endif /* #ifdef CONFIG_RCU_BOOST */
> >
> > @@ -426,8 +427,8 @@ void rcu_read_unlock_special(struct task_struct *t)
> >
> > #ifdef CONFIG_RCU_BOOST
> > /* Unboost if we were boosted. */
> > - if (rbmp)
> > - rt_mutex_unlock(rbmp);
> > + if (waiter)
> > + rt_mutex_rcu_deboost_unlock(t, waiter);
> > #endif /* #ifdef CONFIG_RCU_BOOST */
> >
> > /*
> > @@ -1129,9 +1130,6 @@ void exit_rcu(void)
> > #endif /* #else #ifdef CONFIG_TREE_PREEMPT_RCU */
> >
> > #ifdef CONFIG_RCU_BOOST
> > -
> > -#include "rtmutex_common.h"
> > -
> > #ifdef CONFIG_RCU_TRACE
> >
> > static void rcu_initiate_boost_trace(struct rcu_node *rnp)
> > @@ -1181,14 +1179,15 @@ static int rcu_boost(struct rcu_node *rnp)
> > {
> > unsigned long flags;
> > struct rt_mutex mtx;
> > + struct rt_mutex_waiter rcu_boost_waiter;
> > struct task_struct *t;
> > struct list_head *tb;
> > + int ret;
> >
> > if (rnp->exp_tasks == NULL && rnp->boost_tasks == NULL)
> > return 0; /* Nothing left to boost. */
> >
> > raw_spin_lock_irqsave(&rnp->lock, flags);
> > -
> > /*
> > * Recheck under the lock: all tasks in need of boosting
> > * might exit their RCU read-side critical sections on their own.
> > @@ -1215,7 +1214,7 @@ static int rcu_boost(struct rcu_node *rnp)
> >
> > /*
> > * We boost task t by manufacturing an rt_mutex that appears to
> > - * be held by task t. We leave a pointer to that rt_mutex where
> > + * be held by task t. We leave a pointer to that rt_mutex_waiter
> > where
> > * task t can find it, and task t will release the mutex when it
> > * exits its outermost RCU read-side critical section. Then
> > * simply acquiring this artificial rt_mutex will boost task
> > @@ -1230,11 +1229,30 @@ static int rcu_boost(struct rcu_node *rnp)
> > * section.
> > */
> > t = container_of(tb, struct task_struct, rcu_node_entry);
> > + get_task_struct(t);
> > rt_mutex_init_proxy_locked(&mtx, t);
> > - t->rcu_boost_mutex = &mtx;
> > raw_spin_unlock_irqrestore(&rnp->lock, flags);
> > - rt_mutex_lock(&mtx); /* Side effect: boosts task t's priority. */
> > - rt_mutex_unlock(&mtx); /* Keep lockdep happy. */
> > +
> > + debug_rt_mutex_init_waiter(&rcu_boost_waiter);
> > + /* Side effect: boosts task t's priority. */
> > + ret = rt_mutex_start_proxy_lock(&mtx, &rcu_boost_waiter, current,
> > 0);
> > + if (WARN_ON_ONCE(ret)) {
> > + put_task_struct(t);
> > + return 0; /* temporary stop boosting */
> > + }
> > +
> > + raw_spin_lock_irqsave(&rnp->lock, flags);
> > + if (&t->rcu_node_entry == rnp->exp_tasks ||
> > + &t->rcu_node_entry == rnp->boost_tasks) {
> > + t->rcu_boost_waiter = &rcu_boost_waiter;
> > + raw_spin_unlock_irqrestore(&rnp->lock, flags);
> > + } else {
> > + raw_spin_unlock_irqrestore(&rnp->lock, flags);
> > + rt_mutex_rcu_deboost_unlock(t, &rcu_boost_waiter);
> > + }
> > +
> > + put_task_struct(t);
> > + rt_mutex_finish_proxy_lock(&mtx, NULL, &rcu_boost_waiter, 0);
> >
> > return ACCESS_ONCE(rnp->exp_tasks) != NULL ||
> > ACCESS_ONCE(rnp->boost_tasks) != NULL;
> > diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
> > index 0dd6aec..2f3caee 100644
> > --- a/kernel/rtmutex.c
> > +++ b/kernel/rtmutex.c
> > @@ -734,6 +734,43 @@ rt_mutex_slowunlock(struct rt_mutex *lock)
> > rt_mutex_adjust_prio(current);
> > }
> >
> > +#ifdef CONFIG_RCU_BOOST
> > +/*
> > + * rt_mutex_rcu_deboost_unlock() - unlock in irq/bh/process context
> > + *
> > + * please revert the patch which introduces this function when
> > + * rt_mutex's ->wait_lock is irq-off.
> > + */
> > +void rt_mutex_rcu_deboost_unlock(struct task_struct *owner,
> > + struct rt_mutex_waiter *waiter)
> > +{
> > + unsigned long flags;
> > + struct rt_mutex *lock = waiter->lock;
> > +
> > + /*
> > + * The correction of the following code is based on
> > + * 1) current lock is owned by @owner
> > + * 2) only one task(@waiter->task) is waiting on the @lock
> > + * 3) the @waiter has been queued and keeps been queued
> > + */
> > + if (WARN_ON_ONCE(rt_mutex_owner(lock) != owner))
> > + return; /* 1) */
> > + if (WARN_ON_ONCE(rt_mutex_top_waiter(lock) != waiter))
> > + return; /* 2) & 3) */
> > + if (WARN_ON_ONCE(plist_node_empty(&waiter->pi_list_entry)))
> > + return; /* 2) & 3) */
> > +
> > + raw_spin_lock_irqsave(&owner->pi_lock, flags);
> > + plist_del(&waiter->pi_list_entry, &owner->pi_waiters);
> > + lock->owner = NULL;
> > + raw_spin_unlock_irqrestore(&owner->pi_lock, flags);
> > +
> > + wake_up_process(waiter->task);
> > + /* Undo pi boosting if necessary: */
> > + rt_mutex_adjust_prio(owner);
> > +}
> > +#endif /* #ifdef CONFIG_RCU_BOOST */
> > +
> > /*
> > * debug aware fast / slowpath lock,trylock,unlock
> > *
> > diff --git a/kernel/rtmutex_common.h b/kernel/rtmutex_common.h
> > index 53a66c8..3cdbe82 100644
> > --- a/kernel/rtmutex_common.h
> > +++ b/kernel/rtmutex_common.h
> > @@ -117,6 +117,11 @@ extern int rt_mutex_finish_proxy_lock(struct rt_mutex
> > *lock,
> > struct rt_mutex_waiter *waiter,
> > int detect_deadlock);
> >
> > +#ifdef CONFIG_RCU_BOOST
> > +void rt_mutex_rcu_deboost_unlock(struct task_struct *owner,
> > + struct rt_mutex_waiter *waiter);
> > +#endif /* #ifdef CONFIG_RCU_BOOST */
> > +
> > #ifdef CONFIG_DEBUG_RT_MUTEXES
> > # include "rtmutex-debug.h"
> > #else
> >

2013-08-26 02:35:18

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH 5/8] rcu: eliminate deadlock for rcu read site

On 08/26/2013 01:43 AM, Paul E. McKenney wrote:
> On Sun, Aug 25, 2013 at 11:19:37PM +0800, Lai Jiangshan wrote:
>> Hi, Steven
>>
>> Any comments about this patch?
>
> For whatever it is worth, it ran without incident for two hours worth
> of rcutorture on my P5 test (boosting but no CPU hotplug).
>
> Lai, do you have a specific test for this patch?

Also rcutorture.
(A special module is added to ensure all paths of my code are covered.)

> Your deadlock
> scenario looks plausible, but is apparently not occurring in the
> mainline kernel.

Yes, you can leave this possible bug until the real problem happens
or just disallow overlapping.
I can write some debug code for it which allow us find out
the problems earlier.

I guess this is an useful usage pattern of rcu:

again:
rcu_read_lock();
obj = read_dereference(ptr);
spin_lock_XX(obj->lock);
if (obj is invalid) {
spin_unlock_XX(obj->lock);
rcu_read_unlock();
goto again;
}
rcu_read_unlock();
# use obj
spin_unlock_XX(obj->lock);

If we encourage this pattern, we should fix all the related problems.

Thanks,
Lai

>
> Thanx, Paul
>
>> Thanks,
>> Lai
>>
>>
>> On Fri, Aug 23, 2013 at 2:26 PM, Lai Jiangshan <[email protected]> wrote:
>>
>>> [PATCH] rcu/rt_mutex: eliminate a kind of deadlock for rcu read site
>>>
>>> Current rtmutex's lock->wait_lock doesn't disables softirq nor irq, it will
>>> cause rcu read site deadlock when rcu overlaps with any
>>> softirq-context/irq-context lock.
>>>
>>> @L is a spinlock of softirq or irq context.
>>>
>>> CPU1 cpu2(rcu boost)
>>> rcu_read_lock() rt_mutext_lock()
>>> <preemption and reschedule back> raw_spin_lock(lock->wait_lock)
>>> spin_lock_XX(L) <interrupt and doing softirq or
>>> irq>
>>> rcu_read_unlock() do_softirq()
>>> rcu_read_unlock_special()
>>> rt_mutext_unlock()
>>> raw_spin_lock(lock->wait_lock) spin_lock_XX(L) **DEADLOCK**
>>>
>>> This patch fixes this kind of deadlock by removing rt_mutext_unlock() from
>>> rcu_read_unlock(), new rt_mutex_rcu_deboost_unlock() is called instead.
>>> Thus rtmutex's lock->wait_lock will not be called from rcu_read_unlock().
>>>
>>> This patch does not eliminate all kinds of rcu-read-site deadlock,
>>> if @L is a scheduler lock, it will be deadlock, we should apply Paul's rule
>>> in this case.(avoid overlapping or preempt_disable()).
>>>
>>> rt_mutex_rcu_deboost_unlock() requires the @waiter is queued, so we
>>> can't directly call rt_mutex_lock(&mtx) in the rcu_boost thread,
>>> we split rt_mutex_lock(&mtx) into two steps just like pi-futex.
>>> This result a internal state in rcu_boost thread and cause
>>> rcu_boost thread a bit more complicated.
>>>
>>> Thanks
>>> Lai
>>>
>>> diff --git a/include/linux/init_task.h b/include/linux/init_task.h
>>> index 5cd0f09..8830874 100644
>>> --- a/include/linux/init_task.h
>>> +++ b/include/linux/init_task.h
>>> @@ -102,7 +102,7 @@ extern struct group_info init_groups;
>>>
>>> #ifdef CONFIG_RCU_BOOST
>>> #define INIT_TASK_RCU_BOOST() \
>>> - .rcu_boost_mutex = NULL,
>>> + .rcu_boost_waiter = NULL,
>>> #else
>>> #define INIT_TASK_RCU_BOOST()
>>> #endif
>>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>>> index e9995eb..1eca99f 100644
>>> --- a/include/linux/sched.h
>>> +++ b/include/linux/sched.h
>>> @@ -1078,7 +1078,7 @@ struct task_struct {
>>> struct rcu_node *rcu_blocked_node;
>>> #endif /* #ifdef CONFIG_TREE_PREEMPT_RCU */
>>> #ifdef CONFIG_RCU_BOOST
>>> - struct rt_mutex *rcu_boost_mutex;
>>> + struct rt_mutex_waiter *rcu_boost_waiter;
>>> #endif /* #ifdef CONFIG_RCU_BOOST */
>>>
>>> #if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
>>> @@ -1723,7 +1723,7 @@ static inline void rcu_copy_process(struct
>>> task_struct *p)
>>> p->rcu_blocked_node = NULL;
>>> #endif /* #ifdef CONFIG_TREE_PREEMPT_RCU */
>>> #ifdef CONFIG_RCU_BOOST
>>> - p->rcu_boost_mutex = NULL;
>>> + p->rcu_boost_waiter = NULL;
>>> #endif /* #ifdef CONFIG_RCU_BOOST */
>>> INIT_LIST_HEAD(&p->rcu_node_entry);
>>> }
>>> diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
>>> index 769e12e..d207ddd 100644
>>> --- a/kernel/rcutree_plugin.h
>>> +++ b/kernel/rcutree_plugin.h
>>> @@ -33,6 +33,7 @@
>>> #define RCU_KTHREAD_PRIO 1
>>>
>>> #ifdef CONFIG_RCU_BOOST
>>> +#include "rtmutex_common.h"
>>> #define RCU_BOOST_PRIO CONFIG_RCU_BOOST_PRIO
>>> #else
>>> #define RCU_BOOST_PRIO RCU_KTHREAD_PRIO
>>> @@ -340,7 +341,7 @@ void rcu_read_unlock_special(struct task_struct *t)
>>> unsigned long flags;
>>> struct list_head *np;
>>> #ifdef CONFIG_RCU_BOOST
>>> - struct rt_mutex *rbmp = NULL;
>>> + struct rt_mutex_waiter *waiter = NULL;
>>> #endif /* #ifdef CONFIG_RCU_BOOST */
>>> struct rcu_node *rnp;
>>> int special;
>>> @@ -397,10 +398,10 @@ void rcu_read_unlock_special(struct task_struct *t)
>>> #ifdef CONFIG_RCU_BOOST
>>> if (&t->rcu_node_entry == rnp->boost_tasks)
>>> rnp->boost_tasks = np;
>>> - /* Snapshot/clear ->rcu_boost_mutex with rcu_node lock
>>> held. */
>>> - if (t->rcu_boost_mutex) {
>>> - rbmp = t->rcu_boost_mutex;
>>> - t->rcu_boost_mutex = NULL;
>>> + /* Snapshot/clear ->rcu_boost_waiter with rcu_node lock
>>> held. */
>>> + if (t->rcu_boost_waiter) {
>>> + waiter = t->rcu_boost_waiter;
>>> + t->rcu_boost_waiter = NULL;
>>> }
>>> #endif /* #ifdef CONFIG_RCU_BOOST */
>>>
>>> @@ -426,8 +427,8 @@ void rcu_read_unlock_special(struct task_struct *t)
>>>
>>> #ifdef CONFIG_RCU_BOOST
>>> /* Unboost if we were boosted. */
>>> - if (rbmp)
>>> - rt_mutex_unlock(rbmp);
>>> + if (waiter)
>>> + rt_mutex_rcu_deboost_unlock(t, waiter);
>>> #endif /* #ifdef CONFIG_RCU_BOOST */
>>>
>>> /*
>>> @@ -1129,9 +1130,6 @@ void exit_rcu(void)
>>> #endif /* #else #ifdef CONFIG_TREE_PREEMPT_RCU */
>>>
>>> #ifdef CONFIG_RCU_BOOST
>>> -
>>> -#include "rtmutex_common.h"
>>> -
>>> #ifdef CONFIG_RCU_TRACE
>>>
>>> static void rcu_initiate_boost_trace(struct rcu_node *rnp)
>>> @@ -1181,14 +1179,15 @@ static int rcu_boost(struct rcu_node *rnp)
>>> {
>>> unsigned long flags;
>>> struct rt_mutex mtx;
>>> + struct rt_mutex_waiter rcu_boost_waiter;
>>> struct task_struct *t;
>>> struct list_head *tb;
>>> + int ret;
>>>
>>> if (rnp->exp_tasks == NULL && rnp->boost_tasks == NULL)
>>> return 0; /* Nothing left to boost. */
>>>
>>> raw_spin_lock_irqsave(&rnp->lock, flags);
>>> -
>>> /*
>>> * Recheck under the lock: all tasks in need of boosting
>>> * might exit their RCU read-side critical sections on their own.
>>> @@ -1215,7 +1214,7 @@ static int rcu_boost(struct rcu_node *rnp)
>>>
>>> /*
>>> * We boost task t by manufacturing an rt_mutex that appears to
>>> - * be held by task t. We leave a pointer to that rt_mutex where
>>> + * be held by task t. We leave a pointer to that rt_mutex_waiter
>>> where
>>> * task t can find it, and task t will release the mutex when it
>>> * exits its outermost RCU read-side critical section. Then
>>> * simply acquiring this artificial rt_mutex will boost task
>>> @@ -1230,11 +1229,30 @@ static int rcu_boost(struct rcu_node *rnp)
>>> * section.
>>> */
>>> t = container_of(tb, struct task_struct, rcu_node_entry);
>>> + get_task_struct(t);
>>> rt_mutex_init_proxy_locked(&mtx, t);
>>> - t->rcu_boost_mutex = &mtx;
>>> raw_spin_unlock_irqrestore(&rnp->lock, flags);
>>> - rt_mutex_lock(&mtx); /* Side effect: boosts task t's priority. */
>>> - rt_mutex_unlock(&mtx); /* Keep lockdep happy. */
>>> +
>>> + debug_rt_mutex_init_waiter(&rcu_boost_waiter);
>>> + /* Side effect: boosts task t's priority. */
>>> + ret = rt_mutex_start_proxy_lock(&mtx, &rcu_boost_waiter, current,
>>> 0);
>>> + if (WARN_ON_ONCE(ret)) {
>>> + put_task_struct(t);
>>> + return 0; /* temporary stop boosting */
>>> + }
>>> +
>>> + raw_spin_lock_irqsave(&rnp->lock, flags);
>>> + if (&t->rcu_node_entry == rnp->exp_tasks ||
>>> + &t->rcu_node_entry == rnp->boost_tasks) {
>>> + t->rcu_boost_waiter = &rcu_boost_waiter;
>>> + raw_spin_unlock_irqrestore(&rnp->lock, flags);
>>> + } else {
>>> + raw_spin_unlock_irqrestore(&rnp->lock, flags);
>>> + rt_mutex_rcu_deboost_unlock(t, &rcu_boost_waiter);
>>> + }
>>> +
>>> + put_task_struct(t);
>>> + rt_mutex_finish_proxy_lock(&mtx, NULL, &rcu_boost_waiter, 0);
>>>
>>> return ACCESS_ONCE(rnp->exp_tasks) != NULL ||
>>> ACCESS_ONCE(rnp->boost_tasks) != NULL;
>>> diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
>>> index 0dd6aec..2f3caee 100644
>>> --- a/kernel/rtmutex.c
>>> +++ b/kernel/rtmutex.c
>>> @@ -734,6 +734,43 @@ rt_mutex_slowunlock(struct rt_mutex *lock)
>>> rt_mutex_adjust_prio(current);
>>> }
>>>
>>> +#ifdef CONFIG_RCU_BOOST
>>> +/*
>>> + * rt_mutex_rcu_deboost_unlock() - unlock in irq/bh/process context
>>> + *
>>> + * please revert the patch which introduces this function when
>>> + * rt_mutex's ->wait_lock is irq-off.
>>> + */
>>> +void rt_mutex_rcu_deboost_unlock(struct task_struct *owner,
>>> + struct rt_mutex_waiter *waiter)
>>> +{
>>> + unsigned long flags;
>>> + struct rt_mutex *lock = waiter->lock;
>>> +
>>> + /*
>>> + * The correction of the following code is based on
>>> + * 1) current lock is owned by @owner
>>> + * 2) only one task(@waiter->task) is waiting on the @lock
>>> + * 3) the @waiter has been queued and keeps been queued
>>> + */
>>> + if (WARN_ON_ONCE(rt_mutex_owner(lock) != owner))
>>> + return; /* 1) */
>>> + if (WARN_ON_ONCE(rt_mutex_top_waiter(lock) != waiter))
>>> + return; /* 2) & 3) */
>>> + if (WARN_ON_ONCE(plist_node_empty(&waiter->pi_list_entry)))
>>> + return; /* 2) & 3) */
>>> +
>>> + raw_spin_lock_irqsave(&owner->pi_lock, flags);
>>> + plist_del(&waiter->pi_list_entry, &owner->pi_waiters);
>>> + lock->owner = NULL;
>>> + raw_spin_unlock_irqrestore(&owner->pi_lock, flags);
>>> +
>>> + wake_up_process(waiter->task);
>>> + /* Undo pi boosting if necessary: */
>>> + rt_mutex_adjust_prio(owner);
>>> +}
>>> +#endif /* #ifdef CONFIG_RCU_BOOST */
>>> +
>>> /*
>>> * debug aware fast / slowpath lock,trylock,unlock
>>> *
>>> diff --git a/kernel/rtmutex_common.h b/kernel/rtmutex_common.h
>>> index 53a66c8..3cdbe82 100644
>>> --- a/kernel/rtmutex_common.h
>>> +++ b/kernel/rtmutex_common.h
>>> @@ -117,6 +117,11 @@ extern int rt_mutex_finish_proxy_lock(struct rt_mutex
>>> *lock,
>>> struct rt_mutex_waiter *waiter,
>>> int detect_deadlock);
>>>
>>> +#ifdef CONFIG_RCU_BOOST
>>> +void rt_mutex_rcu_deboost_unlock(struct task_struct *owner,
>>> + struct rt_mutex_waiter *waiter);
>>> +#endif /* #ifdef CONFIG_RCU_BOOST */
>>> +
>>> #ifdef CONFIG_DEBUG_RT_MUTEXES
>>> # include "rtmutex-debug.h"
>>> #else
>>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2013-08-30 02:05:26

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 5/8] rcu: eliminate deadlock for rcu read site

On Mon, Aug 26, 2013 at 10:39:32AM +0800, Lai Jiangshan wrote:
> On 08/26/2013 01:43 AM, Paul E. McKenney wrote:
> > On Sun, Aug 25, 2013 at 11:19:37PM +0800, Lai Jiangshan wrote:
> >> Hi, Steven
> >>
> >> Any comments about this patch?
> >
> > For whatever it is worth, it ran without incident for two hours worth
> > of rcutorture on my P5 test (boosting but no CPU hotplug).
> >
> > Lai, do you have a specific test for this patch?
>
> Also rcutorture.
> (A special module is added to ensure all paths of my code are covered.)

OK, good! Could you please send along your rcutorture changes as well?

Also, it would be good to have Steven Rostedt's Acked-by or Reviewed-by.

> > Your deadlock
> > scenario looks plausible, but is apparently not occurring in the
> > mainline kernel.
>
> Yes, you can leave this possible bug until the real problem happens
> or just disallow overlapping.
> I can write some debug code for it which allow us find out
> the problems earlier.
>
> I guess this is an useful usage pattern of rcu:
>
> again:
> rcu_read_lock();
> obj = read_dereference(ptr);
> spin_lock_XX(obj->lock);
> if (obj is invalid) {
> spin_unlock_XX(obj->lock);
> rcu_read_unlock();
> goto again;
> }
> rcu_read_unlock();
> # use obj
> spin_unlock_XX(obj->lock);
>
> If we encourage this pattern, we should fix all the related problems.

Given that I have had to ask people to move away from this pattern,
it would be good to allow it to work. The transformation to currently
permitted usage is as follows, for whatever it is worth:

again:
disable_XX();
rcu_read_lock();
obj = read_dereference(ptr);
spin_lock(obj->lock);
if (obj is invalid) {
spin_unlock_XX(obj->lock);
rcu_read_unlock();
goto again;
}
rcu_read_unlock();
# use obj
spin_unlock_XX(obj->lock);

In mainline, this prevents preemption within the RCU read-side critical
section, avoiding the problem.

That said, if we allow your original pattern, that would be even better!

Thanx, Paul

> Thanks,
> Lai
>
> >
> > Thanx, Paul
> >
> >> Thanks,
> >> Lai
> >>
> >>
> >> On Fri, Aug 23, 2013 at 2:26 PM, Lai Jiangshan <[email protected]> wrote:
> >>
> >>> [PATCH] rcu/rt_mutex: eliminate a kind of deadlock for rcu read site
> >>>
> >>> Current rtmutex's lock->wait_lock doesn't disables softirq nor irq, it will
> >>> cause rcu read site deadlock when rcu overlaps with any
> >>> softirq-context/irq-context lock.
> >>>
> >>> @L is a spinlock of softirq or irq context.
> >>>
> >>> CPU1 cpu2(rcu boost)
> >>> rcu_read_lock() rt_mutext_lock()
> >>> <preemption and reschedule back> raw_spin_lock(lock->wait_lock)
> >>> spin_lock_XX(L) <interrupt and doing softirq or
> >>> irq>
> >>> rcu_read_unlock() do_softirq()
> >>> rcu_read_unlock_special()
> >>> rt_mutext_unlock()
> >>> raw_spin_lock(lock->wait_lock) spin_lock_XX(L) **DEADLOCK**
> >>>
> >>> This patch fixes this kind of deadlock by removing rt_mutext_unlock() from
> >>> rcu_read_unlock(), new rt_mutex_rcu_deboost_unlock() is called instead.
> >>> Thus rtmutex's lock->wait_lock will not be called from rcu_read_unlock().
> >>>
> >>> This patch does not eliminate all kinds of rcu-read-site deadlock,
> >>> if @L is a scheduler lock, it will be deadlock, we should apply Paul's rule
> >>> in this case.(avoid overlapping or preempt_disable()).
> >>>
> >>> rt_mutex_rcu_deboost_unlock() requires the @waiter is queued, so we
> >>> can't directly call rt_mutex_lock(&mtx) in the rcu_boost thread,
> >>> we split rt_mutex_lock(&mtx) into two steps just like pi-futex.
> >>> This result a internal state in rcu_boost thread and cause
> >>> rcu_boost thread a bit more complicated.
> >>>
> >>> Thanks
> >>> Lai
> >>>
> >>> diff --git a/include/linux/init_task.h b/include/linux/init_task.h
> >>> index 5cd0f09..8830874 100644
> >>> --- a/include/linux/init_task.h
> >>> +++ b/include/linux/init_task.h
> >>> @@ -102,7 +102,7 @@ extern struct group_info init_groups;
> >>>
> >>> #ifdef CONFIG_RCU_BOOST
> >>> #define INIT_TASK_RCU_BOOST() \
> >>> - .rcu_boost_mutex = NULL,
> >>> + .rcu_boost_waiter = NULL,
> >>> #else
> >>> #define INIT_TASK_RCU_BOOST()
> >>> #endif
> >>> diff --git a/include/linux/sched.h b/include/linux/sched.h
> >>> index e9995eb..1eca99f 100644
> >>> --- a/include/linux/sched.h
> >>> +++ b/include/linux/sched.h
> >>> @@ -1078,7 +1078,7 @@ struct task_struct {
> >>> struct rcu_node *rcu_blocked_node;
> >>> #endif /* #ifdef CONFIG_TREE_PREEMPT_RCU */
> >>> #ifdef CONFIG_RCU_BOOST
> >>> - struct rt_mutex *rcu_boost_mutex;
> >>> + struct rt_mutex_waiter *rcu_boost_waiter;
> >>> #endif /* #ifdef CONFIG_RCU_BOOST */
> >>>
> >>> #if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
> >>> @@ -1723,7 +1723,7 @@ static inline void rcu_copy_process(struct
> >>> task_struct *p)
> >>> p->rcu_blocked_node = NULL;
> >>> #endif /* #ifdef CONFIG_TREE_PREEMPT_RCU */
> >>> #ifdef CONFIG_RCU_BOOST
> >>> - p->rcu_boost_mutex = NULL;
> >>> + p->rcu_boost_waiter = NULL;
> >>> #endif /* #ifdef CONFIG_RCU_BOOST */
> >>> INIT_LIST_HEAD(&p->rcu_node_entry);
> >>> }
> >>> diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
> >>> index 769e12e..d207ddd 100644
> >>> --- a/kernel/rcutree_plugin.h
> >>> +++ b/kernel/rcutree_plugin.h
> >>> @@ -33,6 +33,7 @@
> >>> #define RCU_KTHREAD_PRIO 1
> >>>
> >>> #ifdef CONFIG_RCU_BOOST
> >>> +#include "rtmutex_common.h"
> >>> #define RCU_BOOST_PRIO CONFIG_RCU_BOOST_PRIO
> >>> #else
> >>> #define RCU_BOOST_PRIO RCU_KTHREAD_PRIO
> >>> @@ -340,7 +341,7 @@ void rcu_read_unlock_special(struct task_struct *t)
> >>> unsigned long flags;
> >>> struct list_head *np;
> >>> #ifdef CONFIG_RCU_BOOST
> >>> - struct rt_mutex *rbmp = NULL;
> >>> + struct rt_mutex_waiter *waiter = NULL;
> >>> #endif /* #ifdef CONFIG_RCU_BOOST */
> >>> struct rcu_node *rnp;
> >>> int special;
> >>> @@ -397,10 +398,10 @@ void rcu_read_unlock_special(struct task_struct *t)
> >>> #ifdef CONFIG_RCU_BOOST
> >>> if (&t->rcu_node_entry == rnp->boost_tasks)
> >>> rnp->boost_tasks = np;
> >>> - /* Snapshot/clear ->rcu_boost_mutex with rcu_node lock
> >>> held. */
> >>> - if (t->rcu_boost_mutex) {
> >>> - rbmp = t->rcu_boost_mutex;
> >>> - t->rcu_boost_mutex = NULL;
> >>> + /* Snapshot/clear ->rcu_boost_waiter with rcu_node lock
> >>> held. */
> >>> + if (t->rcu_boost_waiter) {
> >>> + waiter = t->rcu_boost_waiter;
> >>> + t->rcu_boost_waiter = NULL;
> >>> }
> >>> #endif /* #ifdef CONFIG_RCU_BOOST */
> >>>
> >>> @@ -426,8 +427,8 @@ void rcu_read_unlock_special(struct task_struct *t)
> >>>
> >>> #ifdef CONFIG_RCU_BOOST
> >>> /* Unboost if we were boosted. */
> >>> - if (rbmp)
> >>> - rt_mutex_unlock(rbmp);
> >>> + if (waiter)
> >>> + rt_mutex_rcu_deboost_unlock(t, waiter);
> >>> #endif /* #ifdef CONFIG_RCU_BOOST */
> >>>
> >>> /*
> >>> @@ -1129,9 +1130,6 @@ void exit_rcu(void)
> >>> #endif /* #else #ifdef CONFIG_TREE_PREEMPT_RCU */
> >>>
> >>> #ifdef CONFIG_RCU_BOOST
> >>> -
> >>> -#include "rtmutex_common.h"
> >>> -
> >>> #ifdef CONFIG_RCU_TRACE
> >>>
> >>> static void rcu_initiate_boost_trace(struct rcu_node *rnp)
> >>> @@ -1181,14 +1179,15 @@ static int rcu_boost(struct rcu_node *rnp)
> >>> {
> >>> unsigned long flags;
> >>> struct rt_mutex mtx;
> >>> + struct rt_mutex_waiter rcu_boost_waiter;
> >>> struct task_struct *t;
> >>> struct list_head *tb;
> >>> + int ret;
> >>>
> >>> if (rnp->exp_tasks == NULL && rnp->boost_tasks == NULL)
> >>> return 0; /* Nothing left to boost. */
> >>>
> >>> raw_spin_lock_irqsave(&rnp->lock, flags);
> >>> -
> >>> /*
> >>> * Recheck under the lock: all tasks in need of boosting
> >>> * might exit their RCU read-side critical sections on their own.
> >>> @@ -1215,7 +1214,7 @@ static int rcu_boost(struct rcu_node *rnp)
> >>>
> >>> /*
> >>> * We boost task t by manufacturing an rt_mutex that appears to
> >>> - * be held by task t. We leave a pointer to that rt_mutex where
> >>> + * be held by task t. We leave a pointer to that rt_mutex_waiter
> >>> where
> >>> * task t can find it, and task t will release the mutex when it
> >>> * exits its outermost RCU read-side critical section. Then
> >>> * simply acquiring this artificial rt_mutex will boost task
> >>> @@ -1230,11 +1229,30 @@ static int rcu_boost(struct rcu_node *rnp)
> >>> * section.
> >>> */
> >>> t = container_of(tb, struct task_struct, rcu_node_entry);
> >>> + get_task_struct(t);
> >>> rt_mutex_init_proxy_locked(&mtx, t);
> >>> - t->rcu_boost_mutex = &mtx;
> >>> raw_spin_unlock_irqrestore(&rnp->lock, flags);
> >>> - rt_mutex_lock(&mtx); /* Side effect: boosts task t's priority. */
> >>> - rt_mutex_unlock(&mtx); /* Keep lockdep happy. */
> >>> +
> >>> + debug_rt_mutex_init_waiter(&rcu_boost_waiter);
> >>> + /* Side effect: boosts task t's priority. */
> >>> + ret = rt_mutex_start_proxy_lock(&mtx, &rcu_boost_waiter, current,
> >>> 0);
> >>> + if (WARN_ON_ONCE(ret)) {
> >>> + put_task_struct(t);
> >>> + return 0; /* temporary stop boosting */
> >>> + }
> >>> +
> >>> + raw_spin_lock_irqsave(&rnp->lock, flags);
> >>> + if (&t->rcu_node_entry == rnp->exp_tasks ||
> >>> + &t->rcu_node_entry == rnp->boost_tasks) {
> >>> + t->rcu_boost_waiter = &rcu_boost_waiter;
> >>> + raw_spin_unlock_irqrestore(&rnp->lock, flags);
> >>> + } else {
> >>> + raw_spin_unlock_irqrestore(&rnp->lock, flags);
> >>> + rt_mutex_rcu_deboost_unlock(t, &rcu_boost_waiter);
> >>> + }
> >>> +
> >>> + put_task_struct(t);
> >>> + rt_mutex_finish_proxy_lock(&mtx, NULL, &rcu_boost_waiter, 0);
> >>>
> >>> return ACCESS_ONCE(rnp->exp_tasks) != NULL ||
> >>> ACCESS_ONCE(rnp->boost_tasks) != NULL;
> >>> diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
> >>> index 0dd6aec..2f3caee 100644
> >>> --- a/kernel/rtmutex.c
> >>> +++ b/kernel/rtmutex.c
> >>> @@ -734,6 +734,43 @@ rt_mutex_slowunlock(struct rt_mutex *lock)
> >>> rt_mutex_adjust_prio(current);
> >>> }
> >>>
> >>> +#ifdef CONFIG_RCU_BOOST
> >>> +/*
> >>> + * rt_mutex_rcu_deboost_unlock() - unlock in irq/bh/process context
> >>> + *
> >>> + * please revert the patch which introduces this function when
> >>> + * rt_mutex's ->wait_lock is irq-off.
> >>> + */
> >>> +void rt_mutex_rcu_deboost_unlock(struct task_struct *owner,
> >>> + struct rt_mutex_waiter *waiter)
> >>> +{
> >>> + unsigned long flags;
> >>> + struct rt_mutex *lock = waiter->lock;
> >>> +
> >>> + /*
> >>> + * The correction of the following code is based on
> >>> + * 1) current lock is owned by @owner
> >>> + * 2) only one task(@waiter->task) is waiting on the @lock
> >>> + * 3) the @waiter has been queued and keeps been queued
> >>> + */
> >>> + if (WARN_ON_ONCE(rt_mutex_owner(lock) != owner))
> >>> + return; /* 1) */
> >>> + if (WARN_ON_ONCE(rt_mutex_top_waiter(lock) != waiter))
> >>> + return; /* 2) & 3) */
> >>> + if (WARN_ON_ONCE(plist_node_empty(&waiter->pi_list_entry)))
> >>> + return; /* 2) & 3) */
> >>> +
> >>> + raw_spin_lock_irqsave(&owner->pi_lock, flags);
> >>> + plist_del(&waiter->pi_list_entry, &owner->pi_waiters);
> >>> + lock->owner = NULL;
> >>> + raw_spin_unlock_irqrestore(&owner->pi_lock, flags);
> >>> +
> >>> + wake_up_process(waiter->task);
> >>> + /* Undo pi boosting if necessary: */
> >>> + rt_mutex_adjust_prio(owner);
> >>> +}
> >>> +#endif /* #ifdef CONFIG_RCU_BOOST */
> >>> +
> >>> /*
> >>> * debug aware fast / slowpath lock,trylock,unlock
> >>> *
> >>> diff --git a/kernel/rtmutex_common.h b/kernel/rtmutex_common.h
> >>> index 53a66c8..3cdbe82 100644
> >>> --- a/kernel/rtmutex_common.h
> >>> +++ b/kernel/rtmutex_common.h
> >>> @@ -117,6 +117,11 @@ extern int rt_mutex_finish_proxy_lock(struct rt_mutex
> >>> *lock,
> >>> struct rt_mutex_waiter *waiter,
> >>> int detect_deadlock);
> >>>
> >>> +#ifdef CONFIG_RCU_BOOST
> >>> +void rt_mutex_rcu_deboost_unlock(struct task_struct *owner,
> >>> + struct rt_mutex_waiter *waiter);
> >>> +#endif /* #ifdef CONFIG_RCU_BOOST */
> >>> +
> >>> #ifdef CONFIG_DEBUG_RT_MUTEXES
> >>> # include "rtmutex-debug.h"
> >>> #else
> >>>
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
> >
>

2013-09-05 15:23:00

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 5/8] rcu: eliminate deadlock for rcu read site


Sorry for taking so long to review. So many other things to do :-/

On Fri, 23 Aug 2013 14:26:39 +0800
Lai Jiangshan <[email protected]> wrote:

> [PATCH] rcu/rt_mutex: eliminate a kind of deadlock for rcu read site

"rcu read site"?

This is specific to boosting, thus boosting should be in the subject,
perhaps something like:

"Eliminate deadlock due to rcu boosting"

?

>
> Current rtmutex's lock->wait_lock doesn't disables softirq nor irq, it will
> cause rcu read site deadlock when rcu overlaps with any softirq-context/irq-context lock.
>
> @L is a spinlock of softirq or irq context.
>
> CPU1 cpu2(rcu boost)
> rcu_read_lock() rt_mutext_lock()
> <preemption and reschedule back> raw_spin_lock(lock->wait_lock)
> spin_lock_XX(L) <interrupt and doing softirq or irq>
> rcu_read_unlock() do_softirq()
> rcu_read_unlock_special()
> rt_mutext_unlock()
> raw_spin_lock(lock->wait_lock) spin_lock_XX(L) **DEADLOCK**
>
> This patch fixes this kind of deadlock by removing rt_mutext_unlock() from
> rcu_read_unlock(), new rt_mutex_rcu_deboost_unlock() is called instead.
> Thus rtmutex's lock->wait_lock will not be called from rcu_read_unlock().
>
> This patch does not eliminate all kinds of rcu-read-site deadlock,
> if @L is a scheduler lock, it will be deadlock, we should apply Paul's rule
> in this case.(avoid overlapping or preempt_disable()).
>
> rt_mutex_rcu_deboost_unlock() requires the @waiter is queued, so we
> can't directly call rt_mutex_lock(&mtx) in the rcu_boost thread,
> we split rt_mutex_lock(&mtx) into two steps just like pi-futex.
> This result a internal state in rcu_boost thread and cause
> rcu_boost thread a bit more complicated.
>
> Thanks
> Lai
>
> diff --git a/include/linux/init_task.h b/include/linux/init_task.h
> index 5cd0f09..8830874 100644
> --- a/include/linux/init_task.h
> +++ b/include/linux/init_task.h
> @@ -102,7 +102,7 @@ extern struct group_info init_groups;
>
> #ifdef CONFIG_RCU_BOOST
> #define INIT_TASK_RCU_BOOST() \
> - .rcu_boost_mutex = NULL,
> + .rcu_boost_waiter = NULL,
> #else
> #define INIT_TASK_RCU_BOOST()
> #endif
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index e9995eb..1eca99f 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1078,7 +1078,7 @@ struct task_struct {
> struct rcu_node *rcu_blocked_node;
> #endif /* #ifdef CONFIG_TREE_PREEMPT_RCU */
> #ifdef CONFIG_RCU_BOOST
> - struct rt_mutex *rcu_boost_mutex;
> + struct rt_mutex_waiter *rcu_boost_waiter;
> #endif /* #ifdef CONFIG_RCU_BOOST */
>
> #if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
> @@ -1723,7 +1723,7 @@ static inline void rcu_copy_process(struct task_struct *p)
> p->rcu_blocked_node = NULL;
> #endif /* #ifdef CONFIG_TREE_PREEMPT_RCU */
> #ifdef CONFIG_RCU_BOOST
> - p->rcu_boost_mutex = NULL;
> + p->rcu_boost_waiter = NULL;
> #endif /* #ifdef CONFIG_RCU_BOOST */
> INIT_LIST_HEAD(&p->rcu_node_entry);
> }
> diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
> index 769e12e..d207ddd 100644
> --- a/kernel/rcutree_plugin.h
> +++ b/kernel/rcutree_plugin.h
> @@ -33,6 +33,7 @@
> #define RCU_KTHREAD_PRIO 1
>
> #ifdef CONFIG_RCU_BOOST
> +#include "rtmutex_common.h"
> #define RCU_BOOST_PRIO CONFIG_RCU_BOOST_PRIO
> #else
> #define RCU_BOOST_PRIO RCU_KTHREAD_PRIO
> @@ -340,7 +341,7 @@ void rcu_read_unlock_special(struct task_struct *t)
> unsigned long flags;
> struct list_head *np;
> #ifdef CONFIG_RCU_BOOST
> - struct rt_mutex *rbmp = NULL;
> + struct rt_mutex_waiter *waiter = NULL;
> #endif /* #ifdef CONFIG_RCU_BOOST */
> struct rcu_node *rnp;
> int special;
> @@ -397,10 +398,10 @@ void rcu_read_unlock_special(struct task_struct *t)
> #ifdef CONFIG_RCU_BOOST
> if (&t->rcu_node_entry == rnp->boost_tasks)
> rnp->boost_tasks = np;
> - /* Snapshot/clear ->rcu_boost_mutex with rcu_node lock held. */
> - if (t->rcu_boost_mutex) {
> - rbmp = t->rcu_boost_mutex;
> - t->rcu_boost_mutex = NULL;
> + /* Snapshot/clear ->rcu_boost_waiter with rcu_node lock held. */
> + if (t->rcu_boost_waiter) {
> + waiter = t->rcu_boost_waiter;
> + t->rcu_boost_waiter = NULL;
> }
> #endif /* #ifdef CONFIG_RCU_BOOST */
>
> @@ -426,8 +427,8 @@ void rcu_read_unlock_special(struct task_struct *t)
>
> #ifdef CONFIG_RCU_BOOST
> /* Unboost if we were boosted. */
> - if (rbmp)
> - rt_mutex_unlock(rbmp);
> + if (waiter)
> + rt_mutex_rcu_deboost_unlock(t, waiter);
> #endif /* #ifdef CONFIG_RCU_BOOST */
>
> /*
> @@ -1129,9 +1130,6 @@ void exit_rcu(void)
> #endif /* #else #ifdef CONFIG_TREE_PREEMPT_RCU */
>
> #ifdef CONFIG_RCU_BOOST
> -
> -#include "rtmutex_common.h"
> -
> #ifdef CONFIG_RCU_TRACE
>
> static void rcu_initiate_boost_trace(struct rcu_node *rnp)
> @@ -1181,14 +1179,15 @@ static int rcu_boost(struct rcu_node *rnp)
> {
> unsigned long flags;
> struct rt_mutex mtx;
> + struct rt_mutex_waiter rcu_boost_waiter;
> struct task_struct *t;
> struct list_head *tb;
> + int ret;
>
> if (rnp->exp_tasks == NULL && rnp->boost_tasks == NULL)
> return 0; /* Nothing left to boost. */
>
> raw_spin_lock_irqsave(&rnp->lock, flags);
> -
> /*
> * Recheck under the lock: all tasks in need of boosting
> * might exit their RCU read-side critical sections on their own.
> @@ -1215,7 +1214,7 @@ static int rcu_boost(struct rcu_node *rnp)
>
> /*
> * We boost task t by manufacturing an rt_mutex that appears to
> - * be held by task t. We leave a pointer to that rt_mutex where
> + * be held by task t. We leave a pointer to that rt_mutex_waiter where
> * task t can find it, and task t will release the mutex when it
> * exits its outermost RCU read-side critical section. Then
> * simply acquiring this artificial rt_mutex will boost task
> @@ -1230,11 +1229,30 @@ static int rcu_boost(struct rcu_node *rnp)
> * section.
> */
> t = container_of(tb, struct task_struct, rcu_node_entry);
> + get_task_struct(t);
> rt_mutex_init_proxy_locked(&mtx, t);
> - t->rcu_boost_mutex = &mtx;
> raw_spin_unlock_irqrestore(&rnp->lock, flags);
> - rt_mutex_lock(&mtx); /* Side effect: boosts task t's priority. */
> - rt_mutex_unlock(&mtx); /* Keep lockdep happy. */
> +
> + debug_rt_mutex_init_waiter(&rcu_boost_waiter);
> + /* Side effect: boosts task t's priority. */
> + ret = rt_mutex_start_proxy_lock(&mtx, &rcu_boost_waiter, current, 0);
> + if (WARN_ON_ONCE(ret)) {
> + put_task_struct(t);
> + return 0; /* temporary stop boosting */
> + }
> +
> + raw_spin_lock_irqsave(&rnp->lock, flags);
> + if (&t->rcu_node_entry == rnp->exp_tasks ||
> + &t->rcu_node_entry == rnp->boost_tasks) {
> + t->rcu_boost_waiter = &rcu_boost_waiter;
> + raw_spin_unlock_irqrestore(&rnp->lock, flags);
> + } else {
> + raw_spin_unlock_irqrestore(&rnp->lock, flags);
> + rt_mutex_rcu_deboost_unlock(t, &rcu_boost_waiter);
> + }
> +
> + put_task_struct(t);
> + rt_mutex_finish_proxy_lock(&mtx, NULL, &rcu_boost_waiter, 0);
>
> return ACCESS_ONCE(rnp->exp_tasks) != NULL ||
> ACCESS_ONCE(rnp->boost_tasks) != NULL;
> diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
> index 0dd6aec..2f3caee 100644
> --- a/kernel/rtmutex.c
> +++ b/kernel/rtmutex.c
> @@ -734,6 +734,43 @@ rt_mutex_slowunlock(struct rt_mutex *lock)
> rt_mutex_adjust_prio(current);
> }
>
> +#ifdef CONFIG_RCU_BOOST
> +/*
> + * rt_mutex_rcu_deboost_unlock() - unlock in irq/bh/process context
> + *
> + * please revert the patch which introduces this function when
> + * rt_mutex's ->wait_lock is irq-off.

I don't think we ever want wait_lock to disable interrupts. Doing so
for just rcu boosting is not enough IMO.

Please remove that comment.

Honestly, I like this solution better than the original :-) It only
uses the pi boosting and not the rest of the rt_mutex, which is really
just overhead.

Looks good to me. Other than what I already commented:

Reviewed-by: Steven Rostedt <[email protected]>

-- Steve

> + */
> +void rt_mutex_rcu_deboost_unlock(struct task_struct *owner,
> + struct rt_mutex_waiter *waiter)
> +{
> + unsigned long flags;
> + struct rt_mutex *lock = waiter->lock;
> +
> + /*
> + * The correction of the following code is based on
> + * 1) current lock is owned by @owner
> + * 2) only one task(@waiter->task) is waiting on the @lock
> + * 3) the @waiter has been queued and keeps been queued

"keeps been queued"? Do you mean "keeps being queued"?

> + */
> + if (WARN_ON_ONCE(rt_mutex_owner(lock) != owner))
> + return; /* 1) */
> + if (WARN_ON_ONCE(rt_mutex_top_waiter(lock) != waiter))
> + return; /* 2) & 3) */
> + if (WARN_ON_ONCE(plist_node_empty(&waiter->pi_list_entry)))
> + return; /* 2) & 3) */
> +
> + raw_spin_lock_irqsave(&owner->pi_lock, flags);
> + plist_del(&waiter->pi_list_entry, &owner->pi_waiters);
> + lock->owner = NULL;
> + raw_spin_unlock_irqrestore(&owner->pi_lock, flags);
> +
> + wake_up_process(waiter->task);
> + /* Undo pi boosting if necessary: */
> + rt_mutex_adjust_prio(owner);
> +}
> +#endif /* #ifdef CONFIG_RCU_BOOST */
> +
> /*
> * debug aware fast / slowpath lock,trylock,unlock
> *
> diff --git a/kernel/rtmutex_common.h b/kernel/rtmutex_common.h
> index 53a66c8..3cdbe82 100644
> --- a/kernel/rtmutex_common.h
> +++ b/kernel/rtmutex_common.h
> @@ -117,6 +117,11 @@ extern int rt_mutex_finish_proxy_lock(struct rt_mutex *lock,
> struct rt_mutex_waiter *waiter,
> int detect_deadlock);
>
> +#ifdef CONFIG_RCU_BOOST
> +void rt_mutex_rcu_deboost_unlock(struct task_struct *owner,
> + struct rt_mutex_waiter *waiter);
> +#endif /* #ifdef CONFIG_RCU_BOOST */
> +
> #ifdef CONFIG_DEBUG_RT_MUTEXES
> # include "rtmutex-debug.h"
> #else

2013-10-30 11:02:52

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 1/8] rcu: add a warn to rcu_preempt_note_context_switch()

On Wed, Aug 07, 2013 at 06:24:57PM +0800, Lai Jiangshan wrote:
> It is expected that _nesting == INT_MIN if _nesting < 0.
> Add a warning to it if something unexpected happen.
>
> Signed-off-by: Lai Jiangshan <[email protected]>
> ---
> kernel/rcutree_plugin.h | 1 +
> 1 files changed, 1 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
> index 63098a5..8fd947e 100644
> --- a/kernel/rcutree_plugin.h
> +++ b/kernel/rcutree_plugin.h
> @@ -243,6 +243,7 @@ static void rcu_preempt_note_context_switch(int cpu)
> : rnp->gpnum + 1);
> raw_spin_unlock_irqrestore(&rnp->lock, flags);
> } else if (t->rcu_read_lock_nesting < 0 &&
> + !WARN_ON_ONCE(t->rcu_read_lock_nesting != INT_MIN) &&

Finally getting back to this...

>From what I can see, this is safe right now because
->rcu_read_lock_nesting is incremented only in case of an interrupt, NMI,
or softirq interrupting the rcu_read_unlock() code path on the one hand,
and because the functions called from rcu_read_unlock_special() currently
disable interrupts before doing any rcu_read_lock()s. With this in
mind, it is currently impossible to have a context switch occur within
an RCU read-side critical section that is invoked (either directly or
indirectly) from the portion of __rcu_read_unlock() that has negative
->rcu_read_lock_nesting.

But this could change should any part of the rt_mutex_unlock() code
called from rcu_read_unlock_special() were to do rcu_read_lock() before
disabling interrupts. Is there any reason we should prohibit such a
pattern in rt_mutex_unlock()?

(For the record, I am currently OK prohibiting this pattern in
rcu_report_exp_rnp(), which is also called from rcu_read_unlock_special()
-- it seems unlikely that someone would use RCU to protect RCU's
expedited-grace-period data structures.)

Thanx, Paul


> t->rcu_read_unlock_special) {
>
> /*
> --
> 1.7.4.4
>

2013-10-30 11:19:57

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 2/8] rcu: remove irq/softirq context check in rcu_read_unlock_special()

On Wed, Aug 07, 2013 at 06:24:58PM +0800, Lai Jiangshan wrote:
> After patch 10f39bb1, "special & RCU_READ_UNLOCK_BLOCKED" can't be true
> in irq nor softirq.(due to RCU_READ_UNLOCK_BLOCKED can only be set
> when preemption)
>
> Signed-off-by: Lai Jiangshan <[email protected]>
> ---
> kernel/rcutree_plugin.h | 6 ------
> 1 files changed, 0 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
> index 8fd947e..54f7e45 100644
> --- a/kernel/rcutree_plugin.h
> +++ b/kernel/rcutree_plugin.h
> @@ -361,12 +361,6 @@ void rcu_read_unlock_special(struct task_struct *t)
> rcu_preempt_qs(smp_processor_id());
> }
>
> - /* Hardware IRQ handlers cannot block. */
> - if (in_irq() || in_serving_softirq()) {
> - local_irq_restore(flags);
> - return;
> - }
> -

Good point, it is time to relax the redundant checking. Paranoid that
I am, I took an intermediate position, wrapping a WARN_ON_ONCE() around
the check as follows:

if (WARN_ON_ONCE(in_irq() || in_serving_softirq())) {
local_irq_restore(flags);
return;
}

If this warning never triggers over a period of some time, we can remove
it entirely.

I have queued this for 3.14 with your Signed-off-by. Please let me know
if you have any objections.

Thanx, Paul

> /* Clean up if blocked during RCU read-side critical section. */
> if (special & RCU_READ_UNLOCK_BLOCKED) {
> t->rcu_read_unlock_special &= ~RCU_READ_UNLOCK_BLOCKED;
> --
> 1.7.4.4
>