one of my boxes boots with "threadirqs" and since commit 05f415715ce45
("rcu: Speed up expedited GPs when interrupting RCU reader") I run
reliably into the following deadlock:
| ============================================
| WARNING: possible recursive locking detected
| 5.2.0-rc6 #279 Not tainted
| --------------------------------------------
| (cron)/2109 is trying to acquire lock:
| 0000000088464daa (&p->pi_lock){-.-.}, at: try_to_wake_up+0x37/0x700
|
| but task is already holding lock:
| 0000000088464daa (&p->pi_lock){-.-.}, at: try_to_wake_up+0x37/0x700
|
| other info that might help us debug this:
| Possible unsafe locking scenario:
|
| CPU0
| ----
| lock(&p->pi_lock);
| lock(&p->pi_lock);
|
| *** DEADLOCK ***
|
| May be due to missing lock nesting notation
|
| 4 locks held by (cron)/2109:
| #0: 00000000c0ae63d9 (&sb->s_type->i_mutex_key){++++}, at: iterate_dir+0x3d/0x170
| #1: 0000000088464daa (&p->pi_lock){-.-.}, at: try_to_wake_up+0x37/0x700
| #2: 00000000f62f14cf (&rq->lock){-.-.}, at: try_to_wake_up+0x209/0x700
| #3: 000000000d32568e (rcu_read_lock){....}, at: cpuacct_charge+0x37/0x1e0
|
| stack backtrace:
| CPU: 3 PID: 2109 Comm: (cron) Not tainted 5.2.0-rc6 #279
| Call Trace:
| <IRQ>
| dump_stack+0x67/0x90
| __lock_acquire.cold.63+0x142/0x23a
| lock_acquire+0x9b/0x1a0
| ? try_to_wake_up+0x37/0x700
| _raw_spin_lock_irqsave+0x33/0x50
| ? try_to_wake_up+0x37/0x700
| try_to_wake_up+0x37/0x700
wake up ksoftirqd
| rcu_read_unlock_special+0x61/0xa0
| __rcu_read_unlock+0x58/0x60
| cpuacct_charge+0xeb/0x1e0
| update_curr+0x15d/0x350
| enqueue_entity+0x115/0x7e0
| enqueue_task_fair+0x78/0x450
| activate_task+0x41/0x90
| ttwu_do_activate+0x49/0x80
| try_to_wake_up+0x23f/0x700
wake up ksoftirqd
| irq_exit+0xba/0xc0
| smp_apic_timer_interrupt+0xb2/0x2a0
| apic_timer_interrupt+0xf/0x20
| </IRQ>
based one the commit it seems the problem was always there but now the
mix of raise_softirq_irqoff() and set_tsk_need_resched() seems to hit
the window quite reliably. Replacing it with
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 1102765f91fd1..baab36f4d0f45 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -627,14 +627,7 @@ static void rcu_read_unlock_special(struct task_struct *t)
if (preempt_bh_were_disabled || irqs_were_disabled) {
WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, false);
/* Need to defer quiescent state until everything is enabled. */
- if (irqs_were_disabled) {
- /* Enabling irqs does not reschedule, so... */
- raise_softirq_irqoff(RCU_SOFTIRQ);
- } else {
- /* Enabling BH or preempt does reschedule, so... */
- set_tsk_need_resched(current);
- set_preempt_need_resched();
- }
+ raise_softirq_irqoff(RCU_SOFTIRQ);
local_irq_restore(flags);
return;
}
will make it go away.
Any suggestions?
Sebastian
On Wed, Jun 26, 2019 at 03:54:47PM +0200, Sebastian Andrzej Siewior wrote:
> one of my boxes boots with "threadirqs" and since commit 05f415715ce45
> ("rcu: Speed up expedited GPs when interrupting RCU reader") I run
> reliably into the following deadlock:
>
> | ============================================
> | WARNING: possible recursive locking detected
> | 5.2.0-rc6 #279 Not tainted
> | --------------------------------------------
> | (cron)/2109 is trying to acquire lock:
> | 0000000088464daa (&p->pi_lock){-.-.}, at: try_to_wake_up+0x37/0x700
> |
> | but task is already holding lock:
> | 0000000088464daa (&p->pi_lock){-.-.}, at: try_to_wake_up+0x37/0x700
> |
> | other info that might help us debug this:
> | Possible unsafe locking scenario:
> |
> | CPU0
> | ----
> | lock(&p->pi_lock);
> | lock(&p->pi_lock);
> |
> | *** DEADLOCK ***
> |
> | May be due to missing lock nesting notation
> |
> | 4 locks held by (cron)/2109:
> | #0: 00000000c0ae63d9 (&sb->s_type->i_mutex_key){++++}, at: iterate_dir+0x3d/0x170
> | #1: 0000000088464daa (&p->pi_lock){-.-.}, at: try_to_wake_up+0x37/0x700
> | #2: 00000000f62f14cf (&rq->lock){-.-.}, at: try_to_wake_up+0x209/0x700
> | #3: 000000000d32568e (rcu_read_lock){....}, at: cpuacct_charge+0x37/0x1e0
> |
> | stack backtrace:
> | CPU: 3 PID: 2109 Comm: (cron) Not tainted 5.2.0-rc6 #279
> | Call Trace:
> | <IRQ>
> | dump_stack+0x67/0x90
> | __lock_acquire.cold.63+0x142/0x23a
> | lock_acquire+0x9b/0x1a0
> | ? try_to_wake_up+0x37/0x700
> | _raw_spin_lock_irqsave+0x33/0x50
> | ? try_to_wake_up+0x37/0x700
> | try_to_wake_up+0x37/0x700
> wake up ksoftirqd
>
> | rcu_read_unlock_special+0x61/0xa0
> | __rcu_read_unlock+0x58/0x60
> | cpuacct_charge+0xeb/0x1e0
> | update_curr+0x15d/0x350
> | enqueue_entity+0x115/0x7e0
> | enqueue_task_fair+0x78/0x450
> | activate_task+0x41/0x90
> | ttwu_do_activate+0x49/0x80
> | try_to_wake_up+0x23f/0x700
>
> wake up ksoftirqd
>
> | irq_exit+0xba/0xc0
> | smp_apic_timer_interrupt+0xb2/0x2a0
> | apic_timer_interrupt+0xf/0x20
> | </IRQ>
>
> based one the commit it seems the problem was always there but now the
> mix of raise_softirq_irqoff() and set_tsk_need_resched() seems to hit
> the window quite reliably. Replacing it with
>
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index 1102765f91fd1..baab36f4d0f45 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -627,14 +627,7 @@ static void rcu_read_unlock_special(struct task_struct *t)
> if (preempt_bh_were_disabled || irqs_were_disabled) {
> WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, false);
> /* Need to defer quiescent state until everything is enabled. */
> - if (irqs_were_disabled) {
> - /* Enabling irqs does not reschedule, so... */
> - raise_softirq_irqoff(RCU_SOFTIRQ);
> - } else {
> - /* Enabling BH or preempt does reschedule, so... */
> - set_tsk_need_resched(current);
> - set_preempt_need_resched();
> - }
> + raise_softirq_irqoff(RCU_SOFTIRQ);
> local_irq_restore(flags);
> return;
> }
>
> will make it go away.
Color me confused. Neither set_tsk_need_resched() nor
set_preempt_need_resched() acquire locks or do wakeups.
Yet raise_softirq_irqoff() can do a wakeup if not called
from hardirq/softirq/NMI context, so I would instead expect
raise_softirq_irqoff() to be the source of troubles when
interrupts are threaded.
What am I missing here?
> Any suggestions?
Does something like IRQ work help? Please see -rcu commit 0864f057b050
("rcu: Use irq_work to get scheduler's attention in clean context")
for one way of doing this. Perhaps in combination with -rcu commit
a69987a515c8 ("rcu: Simplify rcu_read_unlock_special() deferred wakeups").
Thanx, Paul
On 2019-06-26 09:25:58 [-0700], Paul E. McKenney wrote:
> On Wed, Jun 26, 2019 at 03:54:47PM +0200, Sebastian Andrzej Siewior wrote:
> > one of my boxes boots with "threadirqs" and since commit 05f415715ce45
> > ("rcu: Speed up expedited GPs when interrupting RCU reader") I run
> > reliably into the following deadlock:
> >
> > | ============================================
> > | WARNING: possible recursive locking detected
> > | 5.2.0-rc6 #279 Not tainted
> > | --------------------------------------------
> > | (cron)/2109 is trying to acquire lock:
> > | 0000000088464daa (&p->pi_lock){-.-.}, at: try_to_wake_up+0x37/0x700
> > |
> > | but task is already holding lock:
> > | 0000000088464daa (&p->pi_lock){-.-.}, at: try_to_wake_up+0x37/0x700
> > |
> > | other info that might help us debug this:
> > | Possible unsafe locking scenario:
> > |
> > | CPU0
> > | ----
> > | lock(&p->pi_lock);
> > | lock(&p->pi_lock);
> > |
> > | *** DEADLOCK ***
> > |
> > | May be due to missing lock nesting notation
> > |
> > | 4 locks held by (cron)/2109:
> > | #0: 00000000c0ae63d9 (&sb->s_type->i_mutex_key){++++}, at: iterate_dir+0x3d/0x170
> > | #1: 0000000088464daa (&p->pi_lock){-.-.}, at: try_to_wake_up+0x37/0x700
> > | #2: 00000000f62f14cf (&rq->lock){-.-.}, at: try_to_wake_up+0x209/0x700
> > | #3: 000000000d32568e (rcu_read_lock){....}, at: cpuacct_charge+0x37/0x1e0
> > |
> > | stack backtrace:
> > | CPU: 3 PID: 2109 Comm: (cron) Not tainted 5.2.0-rc6 #279
> > | Call Trace:
> > | <IRQ>
> > | dump_stack+0x67/0x90
> > | __lock_acquire.cold.63+0x142/0x23a
> > | lock_acquire+0x9b/0x1a0
> > | ? try_to_wake_up+0x37/0x700
> > | _raw_spin_lock_irqsave+0x33/0x50
> > | ? try_to_wake_up+0x37/0x700
> > | try_to_wake_up+0x37/0x700
> > wake up ksoftirqd
> >
> > | rcu_read_unlock_special+0x61/0xa0
> > | __rcu_read_unlock+0x58/0x60
> > | cpuacct_charge+0xeb/0x1e0
> > | update_curr+0x15d/0x350
> > | enqueue_entity+0x115/0x7e0
> > | enqueue_task_fair+0x78/0x450
> > | activate_task+0x41/0x90
> > | ttwu_do_activate+0x49/0x80
> > | try_to_wake_up+0x23f/0x700
> >
> > wake up ksoftirqd
> >
> > | irq_exit+0xba/0xc0
> > | smp_apic_timer_interrupt+0xb2/0x2a0
> > | apic_timer_interrupt+0xf/0x20
> > | </IRQ>
> >
> > based one the commit it seems the problem was always there but now the
> > mix of raise_softirq_irqoff() and set_tsk_need_resched() seems to hit
> > the window quite reliably. Replacing it with
> >
> > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > index 1102765f91fd1..baab36f4d0f45 100644
> > --- a/kernel/rcu/tree_plugin.h
> > +++ b/kernel/rcu/tree_plugin.h
> > @@ -627,14 +627,7 @@ static void rcu_read_unlock_special(struct task_struct *t)
> > if (preempt_bh_were_disabled || irqs_were_disabled) {
> > WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, false);
> > /* Need to defer quiescent state until everything is enabled. */
> > - if (irqs_were_disabled) {
> > - /* Enabling irqs does not reschedule, so... */
> > - raise_softirq_irqoff(RCU_SOFTIRQ);
> > - } else {
> > - /* Enabling BH or preempt does reschedule, so... */
> > - set_tsk_need_resched(current);
> > - set_preempt_need_resched();
> > - }
> > + raise_softirq_irqoff(RCU_SOFTIRQ);
> > local_irq_restore(flags);
> > return;
> > }
> >
> > will make it go away.
>
> Color me confused. Neither set_tsk_need_resched() nor
> set_preempt_need_resched() acquire locks or do wakeups.
This is correct.
> Yet raise_softirq_irqoff() can do a wakeup if not called
> from hardirq/softirq/NMI context, so I would instead expect
> raise_softirq_irqoff() to be the source of troubles when
> interrupts are threaded.
also correct and it is.
> What am I missing here?
Timing. If raise_softirq_irqoff() is always invoked then we end up in a
state where the thread either isn't invoked or is already running and
the wake up is skipped early (because ->state == TASK_RUNNING or
something).
Please be aware that timing is crucial here to trigger it. I have a
test-case running as an init-script which triggers the bug. Running the
tast-case later manually does not trigger it.
> > Any suggestions?
>
> Does something like IRQ work help? Please see -rcu commit 0864f057b050
> ("rcu: Use irq_work to get scheduler's attention in clean context")
> for one way of doing this. Perhaps in combination with -rcu commit
> a69987a515c8 ("rcu: Simplify rcu_read_unlock_special() deferred wakeups").
I don't think this will help. The problem is that irq_exit() invokes
wake_up_process(ksoftirqd). This function will invoke itself on the same
task as part of rcu_unlock() / rcu_read_unlock_special(). I don't think
this changes here.
> Thanx, Paul
Sebastian
On Wed, Jun 26, 2019 at 09:25:58AM -0700, Paul E. McKenney wrote:
> On Wed, Jun 26, 2019 at 03:54:47PM +0200, Sebastian Andrzej Siewior wrote:
> > one of my boxes boots with "threadirqs" and since commit 05f415715ce45
> > ("rcu: Speed up expedited GPs when interrupting RCU reader") I run
> > reliably into the following deadlock:
> >
> > | ============================================
> > | WARNING: possible recursive locking detected
> > | 5.2.0-rc6 #279 Not tainted
> > | --------------------------------------------
> > | (cron)/2109 is trying to acquire lock:
> > | 0000000088464daa (&p->pi_lock){-.-.}, at: try_to_wake_up+0x37/0x700
> > |
> > | but task is already holding lock:
> > | 0000000088464daa (&p->pi_lock){-.-.}, at: try_to_wake_up+0x37/0x700
> > |
> > | other info that might help us debug this:
> > | Possible unsafe locking scenario:
> > |
> > | CPU0
> > | ----
> > | lock(&p->pi_lock);
> > | lock(&p->pi_lock);
> > |
> > | *** DEADLOCK ***
> > |
> > | May be due to missing lock nesting notation
> > |
> > | 4 locks held by (cron)/2109:
> > | #0: 00000000c0ae63d9 (&sb->s_type->i_mutex_key){++++}, at: iterate_dir+0x3d/0x170
> > | #1: 0000000088464daa (&p->pi_lock){-.-.}, at: try_to_wake_up+0x37/0x700
> > | #2: 00000000f62f14cf (&rq->lock){-.-.}, at: try_to_wake_up+0x209/0x700
> > | #3: 000000000d32568e (rcu_read_lock){....}, at: cpuacct_charge+0x37/0x1e0
> > |
> > | stack backtrace:
> > | CPU: 3 PID: 2109 Comm: (cron) Not tainted 5.2.0-rc6 #279
> > | Call Trace:
> > | <IRQ>
> > | dump_stack+0x67/0x90
> > | __lock_acquire.cold.63+0x142/0x23a
> > | lock_acquire+0x9b/0x1a0
> > | ? try_to_wake_up+0x37/0x700
> > | _raw_spin_lock_irqsave+0x33/0x50
> > | ? try_to_wake_up+0x37/0x700
> > | try_to_wake_up+0x37/0x700
> > wake up ksoftirqd
> >
> > | rcu_read_unlock_special+0x61/0xa0
> > | __rcu_read_unlock+0x58/0x60
> > | cpuacct_charge+0xeb/0x1e0
> > | update_curr+0x15d/0x350
> > | enqueue_entity+0x115/0x7e0
> > | enqueue_task_fair+0x78/0x450
> > | activate_task+0x41/0x90
> > | ttwu_do_activate+0x49/0x80
> > | try_to_wake_up+0x23f/0x700
> >
> > wake up ksoftirqd
> >
> > | irq_exit+0xba/0xc0
> > | smp_apic_timer_interrupt+0xb2/0x2a0
> > | apic_timer_interrupt+0xf/0x20
> > | </IRQ>
> >
> > based one the commit it seems the problem was always there but now the
> > mix of raise_softirq_irqoff() and set_tsk_need_resched() seems to hit
> > the window quite reliably. Replacing it with
> >
> > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > index 1102765f91fd1..baab36f4d0f45 100644
> > --- a/kernel/rcu/tree_plugin.h
> > +++ b/kernel/rcu/tree_plugin.h
> > @@ -627,14 +627,7 @@ static void rcu_read_unlock_special(struct task_struct *t)
> > if (preempt_bh_were_disabled || irqs_were_disabled) {
> > WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, false);
> > /* Need to defer quiescent state until everything is enabled. */
> > - if (irqs_were_disabled) {
> > - /* Enabling irqs does not reschedule, so... */
> > - raise_softirq_irqoff(RCU_SOFTIRQ);
> > - } else {
> > - /* Enabling BH or preempt does reschedule, so... */
> > - set_tsk_need_resched(current);
> > - set_preempt_need_resched();
> > - }
> > + raise_softirq_irqoff(RCU_SOFTIRQ);
> > local_irq_restore(flags);
> > return;
> > }
> >
> > will make it go away.
>
> Color me confused. Neither set_tsk_need_resched() nor
> set_preempt_need_resched() acquire locks or do wakeups.
> Yet raise_softirq_irqoff() can do a wakeup if not called
> from hardirq/softirq/NMI context, so I would instead expect
> raise_softirq_irqoff() to be the source of troubles when
> interrupts are threaded.
>
> What am I missing here?
This issue I think is
(in normal process context)
spin_lock_irqsave(rq_lock); // which disables both preemption and interrupt
// but this was done in normal process context,
// not from IRQ handler
rcu_read_lock();
<---------- IPI comes in and sets exp_hint
rcu_read_unlock()
-> rcu_read_unlock_special
-> raise_softirq_irqoff
-> wakeup_softirq <--- because in_interrupt returns false.
I think the issue is in_interrupt() does not know about threaded interrupts.
If it did, then the ksoftirqd wake up would not happen.
Did I get something wrong?
thanks,
- Joel
> > Any suggestions?
>
> Does something like IRQ work help? Please see -rcu commit 0864f057b050
> ("rcu: Use irq_work to get scheduler's attention in clean context")
> for one way of doing this. Perhaps in combination with -rcu commit
> a69987a515c8 ("rcu: Simplify rcu_read_unlock_special() deferred wakeups").
>
> Thanx, Paul
On Thu, 27 Jun 2019 10:24:36 -0400
Joel Fernandes <[email protected]> wrote:
> > What am I missing here?
>
> This issue I think is
>
> (in normal process context)
> spin_lock_irqsave(rq_lock); // which disables both preemption and interrupt
> // but this was done in normal process context,
> // not from IRQ handler
> rcu_read_lock();
> <---------- IPI comes in and sets exp_hint
How would an IPI come in here with interrupts disabled?
-- Steve
> rcu_read_unlock()
> -> rcu_read_unlock_special
> -> raise_softirq_irqoff
> -> wakeup_softirq <--- because in_interrupt returns false.
>
> I think the issue is in_interrupt() does not know about threaded interrupts.
> If it did, then the ksoftirqd wake up would not happen.
>
> Did I get something wrong?
On Thu, Jun 27, 2019 at 10:34:55AM -0400, Steven Rostedt wrote:
> On Thu, 27 Jun 2019 10:24:36 -0400
> Joel Fernandes <[email protected]> wrote:
>
> > > What am I missing here?
> >
> > This issue I think is
> >
> > (in normal process context)
> > spin_lock_irqsave(rq_lock); // which disables both preemption and interrupt
> > // but this was done in normal process context,
> > // not from IRQ handler
> > rcu_read_lock();
> > <---------- IPI comes in and sets exp_hint
>
> How would an IPI come in here with interrupts disabled?
>
> -- Steve
This is true, could it be rcu_read_unlock_special() got called for some
*other* reason other than the IPI then?
Per Sebastian's stack trace of the recursive lock scenario, it is happening
during cpu_acct_charge() which is called with the rq_lock held.
The only other reasons I know off to call rcu_read_unlock_special() are if
1. the tick indicated that the CPU has to report a QS
2. an IPI in the middle of the reader section for expedited GPs
3. preemption in the middle of a preemptible RCU reader section
1. and 2. are not possible because interrupts are disabled, that's why the
wakeup_softirq even happened.
3. is not possible because we are holding rq_lock in the RCU reader section.
So I am at a bit of a loss how this can happen :-(
Spurious call to rcu_read_unlock_special() may be when it should not have
been called?
thanks,
- Joel
On Thu, Jun 27, 2019 at 11:30 AM Joel Fernandes <[email protected]> wrote:
>
> On Thu, Jun 27, 2019 at 10:34:55AM -0400, Steven Rostedt wrote:
> > On Thu, 27 Jun 2019 10:24:36 -0400
> > Joel Fernandes <[email protected]> wrote:
> >
> > > > What am I missing here?
> > >
> > > This issue I think is
> > >
> > > (in normal process context)
> > > spin_lock_irqsave(rq_lock); // which disables both preemption and interrupt
> > > // but this was done in normal process context,
> > > // not from IRQ handler
> > > rcu_read_lock();
> > > <---------- IPI comes in and sets exp_hint
> >
> > How would an IPI come in here with interrupts disabled?
> >
> > -- Steve
>
> This is true, could it be rcu_read_unlock_special() got called for some
> *other* reason other than the IPI then?
>
> Per Sebastian's stack trace of the recursive lock scenario, it is happening
> during cpu_acct_charge() which is called with the rq_lock held.
>
> The only other reasons I know off to call rcu_read_unlock_special() are if
> 1. the tick indicated that the CPU has to report a QS
> 2. an IPI in the middle of the reader section for expedited GPs
> 3. preemption in the middle of a preemptible RCU reader section
>
> 1. and 2. are not possible because interrupts are disabled, that's why the
> wakeup_softirq even happened.
> 3. is not possible because we are holding rq_lock in the RCU reader section.
>
> So I am at a bit of a loss how this can happen :-(
Sebastian it would be nice if possible to trace where the
t->rcu_read_unlock_special is set for this scenario of calling
rcu_read_unlock_special, to give a clear idea about whether it was
really because of an IPI. I guess we could also add additional RCU
debug fields to task_struct (just for debugging) to see where there
unlock_special is set.
Is there a test to reproduce this, or do I just boot an intel x86_64
machine with "threadirqs" and run into it?
On 2019-06-27 11:37:10 [-0400], Joel Fernandes wrote:
> Sebastian it would be nice if possible to trace where the
> t->rcu_read_unlock_special is set for this scenario of calling
> rcu_read_unlock_special, to give a clear idea about whether it was
> really because of an IPI. I guess we could also add additional RCU
> debug fields to task_struct (just for debugging) to see where there
> unlock_special is set.
>
> Is there a test to reproduce this, or do I just boot an intel x86_64
> machine with "threadirqs" and run into it?
Do you want to send me a patch or should I send you my kvm image which
triggers the bug on boot?
Sebastian
On Thu, Jun 27, 2019 at 11:40 AM Sebastian Andrzej Siewior
<[email protected]> wrote:
>
> On 2019-06-27 11:37:10 [-0400], Joel Fernandes wrote:
> > Sebastian it would be nice if possible to trace where the
> > t->rcu_read_unlock_special is set for this scenario of calling
> > rcu_read_unlock_special, to give a clear idea about whether it was
> > really because of an IPI. I guess we could also add additional RCU
> > debug fields to task_struct (just for debugging) to see where there
> > unlock_special is set.
> >
> > Is there a test to reproduce this, or do I just boot an intel x86_64
> > machine with "threadirqs" and run into it?
>
> Do you want to send me a patch or should I send you my kvm image which
> triggers the bug on boot?
I think the kvm image would be better so I reproduce it on my side.
After that I can share any debug patches with you if they are useful.
Thanks.
On Thu, Jun 27, 2019 at 09:47:05AM +0200, Sebastian Andrzej Siewior wrote:
> On 2019-06-26 09:25:58 [-0700], Paul E. McKenney wrote:
> > On Wed, Jun 26, 2019 at 03:54:47PM +0200, Sebastian Andrzej Siewior wrote:
> > > one of my boxes boots with "threadirqs" and since commit 05f415715ce45
> > > ("rcu: Speed up expedited GPs when interrupting RCU reader") I run
> > > reliably into the following deadlock:
> > >
> > > | ============================================
> > > | WARNING: possible recursive locking detected
> > > | 5.2.0-rc6 #279 Not tainted
> > > | --------------------------------------------
> > > | (cron)/2109 is trying to acquire lock:
> > > | 0000000088464daa (&p->pi_lock){-.-.}, at: try_to_wake_up+0x37/0x700
> > > |
> > > | but task is already holding lock:
> > > | 0000000088464daa (&p->pi_lock){-.-.}, at: try_to_wake_up+0x37/0x700
> > > |
> > > | other info that might help us debug this:
> > > | Possible unsafe locking scenario:
> > > |
> > > | CPU0
> > > | ----
> > > | lock(&p->pi_lock);
> > > | lock(&p->pi_lock);
> > > |
> > > | *** DEADLOCK ***
> > > |
> > > | May be due to missing lock nesting notation
> > > |
> > > | 4 locks held by (cron)/2109:
> > > | #0: 00000000c0ae63d9 (&sb->s_type->i_mutex_key){++++}, at: iterate_dir+0x3d/0x170
> > > | #1: 0000000088464daa (&p->pi_lock){-.-.}, at: try_to_wake_up+0x37/0x700
> > > | #2: 00000000f62f14cf (&rq->lock){-.-.}, at: try_to_wake_up+0x209/0x700
> > > | #3: 000000000d32568e (rcu_read_lock){....}, at: cpuacct_charge+0x37/0x1e0
> > > |
> > > | stack backtrace:
> > > | CPU: 3 PID: 2109 Comm: (cron) Not tainted 5.2.0-rc6 #279
> > > | Call Trace:
> > > | <IRQ>
> > > | dump_stack+0x67/0x90
> > > | __lock_acquire.cold.63+0x142/0x23a
> > > | lock_acquire+0x9b/0x1a0
> > > | ? try_to_wake_up+0x37/0x700
> > > | _raw_spin_lock_irqsave+0x33/0x50
> > > | ? try_to_wake_up+0x37/0x700
> > > | try_to_wake_up+0x37/0x700
> > > wake up ksoftirqd
> > >
> > > | rcu_read_unlock_special+0x61/0xa0
> > > | __rcu_read_unlock+0x58/0x60
> > > | cpuacct_charge+0xeb/0x1e0
> > > | update_curr+0x15d/0x350
> > > | enqueue_entity+0x115/0x7e0
> > > | enqueue_task_fair+0x78/0x450
> > > | activate_task+0x41/0x90
> > > | ttwu_do_activate+0x49/0x80
> > > | try_to_wake_up+0x23f/0x700
> > >
> > > wake up ksoftirqd
> > >
> > > | irq_exit+0xba/0xc0
> > > | smp_apic_timer_interrupt+0xb2/0x2a0
> > > | apic_timer_interrupt+0xf/0x20
> > > | </IRQ>
> > >
> > > based one the commit it seems the problem was always there but now the
> > > mix of raise_softirq_irqoff() and set_tsk_need_resched() seems to hit
> > > the window quite reliably. Replacing it with
> > >
> > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > > index 1102765f91fd1..baab36f4d0f45 100644
> > > --- a/kernel/rcu/tree_plugin.h
> > > +++ b/kernel/rcu/tree_plugin.h
> > > @@ -627,14 +627,7 @@ static void rcu_read_unlock_special(struct task_struct *t)
> > > if (preempt_bh_were_disabled || irqs_were_disabled) {
> > > WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, false);
> > > /* Need to defer quiescent state until everything is enabled. */
> > > - if (irqs_were_disabled) {
> > > - /* Enabling irqs does not reschedule, so... */
> > > - raise_softirq_irqoff(RCU_SOFTIRQ);
> > > - } else {
> > > - /* Enabling BH or preempt does reschedule, so... */
> > > - set_tsk_need_resched(current);
> > > - set_preempt_need_resched();
> > > - }
> > > + raise_softirq_irqoff(RCU_SOFTIRQ);
> > > local_irq_restore(flags);
> > > return;
> > > }
> > >
> > > will make it go away.
> >
> > Color me confused. Neither set_tsk_need_resched() nor
> > set_preempt_need_resched() acquire locks or do wakeups.
>
> This is correct.
>
> > Yet raise_softirq_irqoff() can do a wakeup if not called
> > from hardirq/softirq/NMI context, so I would instead expect
> > raise_softirq_irqoff() to be the source of troubles when
> > interrupts are threaded.
>
> also correct and it is.
>
> > What am I missing here?
>
> Timing. If raise_softirq_irqoff() is always invoked then we end up in a
> state where the thread either isn't invoked or is already running and
> the wake up is skipped early (because ->state == TASK_RUNNING or
> something).
> Please be aware that timing is crucial here to trigger it. I have a
> test-case running as an init-script which triggers the bug. Running the
> tast-case later manually does not trigger it.
>
> > > Any suggestions?
> >
> > Does something like IRQ work help? Please see -rcu commit 0864f057b050
> > ("rcu: Use irq_work to get scheduler's attention in clean context")
> > for one way of doing this. Perhaps in combination with -rcu commit
> > a69987a515c8 ("rcu: Simplify rcu_read_unlock_special() deferred wakeups").
>
> I don't think this will help. The problem is that irq_exit() invokes
> wake_up_process(ksoftirqd). This function will invoke itself on the same
> task as part of rcu_unlock() / rcu_read_unlock_special(). I don't think
> this changes here.
I could always just do a self-IPI. :-/
Another approach would be to kick off a short-duration timer from
rcu_read_unlock_special().
Of course neither of these would help with irq_exit() invoking
wake_up_process(ksoftirqd). Except that if we are doing an irq_exit(),
how is it that any of the scheduler rq/pi locks are held?
Am I right in guessing that the threaded interrupts increase the
probability of this sort of thing happening due to softirqs (but not
interrupts) being disabled while the interrupt is running in threaded
mode? (Referring to the local_bh_disable() in irq_forced_thread_fn(),
although irq_thread() doesn't obviously disable bh.)
Thomas may need to order a large quantity of confused-color paint...
Thanx, Paul
On Thu, Jun 27, 2019 at 11:30:31AM -0400, Joel Fernandes wrote:
> On Thu, Jun 27, 2019 at 10:34:55AM -0400, Steven Rostedt wrote:
> > On Thu, 27 Jun 2019 10:24:36 -0400
> > Joel Fernandes <[email protected]> wrote:
> >
> > > > What am I missing here?
> > >
> > > This issue I think is
> > >
> > > (in normal process context)
> > > spin_lock_irqsave(rq_lock); // which disables both preemption and interrupt
> > > // but this was done in normal process context,
> > > // not from IRQ handler
> > > rcu_read_lock();
> > > <---------- IPI comes in and sets exp_hint
> >
> > How would an IPI come in here with interrupts disabled?
> >
> > -- Steve
>
> This is true, could it be rcu_read_unlock_special() got called for some
> *other* reason other than the IPI then?
>
> Per Sebastian's stack trace of the recursive lock scenario, it is happening
> during cpu_acct_charge() which is called with the rq_lock held.
>
> The only other reasons I know off to call rcu_read_unlock_special() are if
> 1. the tick indicated that the CPU has to report a QS
> 2. an IPI in the middle of the reader section for expedited GPs
> 3. preemption in the middle of a preemptible RCU reader section
4. Some previous reader section was IPIed or preempted, but either
interrupts, softirqs, or preemption was disabled across the
rcu_read_unlock() of that previous reader section.
I -think- that this is what Sebastian is seeing.
Thanx, Paul
> 1. and 2. are not possible because interrupts are disabled, that's why the
> wakeup_softirq even happened.
> 3. is not possible because we are holding rq_lock in the RCU reader section.
>
> So I am at a bit of a loss how this can happen :-(
>
> Spurious call to rcu_read_unlock_special() may be when it should not have
> been called?
>
> thanks,
>
> - Joel
On Thu, Jun 27, 2019 at 11:55 AM Paul E. McKenney <[email protected]> wrote:
>
> On Thu, Jun 27, 2019 at 11:30:31AM -0400, Joel Fernandes wrote:
> > On Thu, Jun 27, 2019 at 10:34:55AM -0400, Steven Rostedt wrote:
> > > On Thu, 27 Jun 2019 10:24:36 -0400
> > > Joel Fernandes <[email protected]> wrote:
> > >
> > > > > What am I missing here?
> > > >
> > > > This issue I think is
> > > >
> > > > (in normal process context)
> > > > spin_lock_irqsave(rq_lock); // which disables both preemption and interrupt
> > > > // but this was done in normal process context,
> > > > // not from IRQ handler
> > > > rcu_read_lock();
> > > > <---------- IPI comes in and sets exp_hint
> > >
> > > How would an IPI come in here with interrupts disabled?
> > >
> > > -- Steve
> >
> > This is true, could it be rcu_read_unlock_special() got called for some
> > *other* reason other than the IPI then?
> >
> > Per Sebastian's stack trace of the recursive lock scenario, it is happening
> > during cpu_acct_charge() which is called with the rq_lock held.
> >
> > The only other reasons I know off to call rcu_read_unlock_special() are if
> > 1. the tick indicated that the CPU has to report a QS
> > 2. an IPI in the middle of the reader section for expedited GPs
> > 3. preemption in the middle of a preemptible RCU reader section
>
> 4. Some previous reader section was IPIed or preempted, but either
> interrupts, softirqs, or preemption was disabled across the
> rcu_read_unlock() of that previous reader section.
Hi Paul, I did not fully understand 4. The previous RCU reader section
could not have been IPI'ed or been preempted if interrupts were
disabled across. Also, if softirq/preempt is disabled across the
previous reader section, the previous reader could not be preempted in
these case.
That leaves us with the only scenario where the previous reader was
IPI'ed while softirq/preempt was disabled across it. Is that what you
meant? But in this scenario, the previous reader should have set
exp_hint to false in the previous reader's rcu_read_unlock_special()
invocation itself. So I would think t->rcu_read_unlock_special should
be 0 during the new reader's invocation thus I did not understand how
rcu_read_unlock_special can be called because of a previous reader.
I'll borrow some of that confused color paint if you don't mind ;-)
And we should document this somewhere for future sanity preservation
:-D
thanks,
- Joel
>
> I -think- that this is what Sebastian is seeing.
>
> Thanx, Paul
>
> > 1. and 2. are not possible because interrupts are disabled, that's why the
> > wakeup_softirq even happened.
> > 3. is not possible because we are holding rq_lock in the RCU reader section.
> >
> > So I am at a bit of a loss how this can happen :-(
> >
> > Spurious call to rcu_read_unlock_special() may be when it should not have
> > been called?
> >
> > thanks,
> >
> > - Joel
On Thu, Jun 27, 2019 at 12:47:24PM -0400, Joel Fernandes wrote:
> On Thu, Jun 27, 2019 at 11:55 AM Paul E. McKenney <[email protected]> wrote:
> >
> > On Thu, Jun 27, 2019 at 11:30:31AM -0400, Joel Fernandes wrote:
> > > On Thu, Jun 27, 2019 at 10:34:55AM -0400, Steven Rostedt wrote:
> > > > On Thu, 27 Jun 2019 10:24:36 -0400
> > > > Joel Fernandes <[email protected]> wrote:
> > > >
> > > > > > What am I missing here?
> > > > >
> > > > > This issue I think is
> > > > >
> > > > > (in normal process context)
> > > > > spin_lock_irqsave(rq_lock); // which disables both preemption and interrupt
> > > > > // but this was done in normal process context,
> > > > > // not from IRQ handler
> > > > > rcu_read_lock();
> > > > > <---------- IPI comes in and sets exp_hint
> > > >
> > > > How would an IPI come in here with interrupts disabled?
> > > >
> > > > -- Steve
> > >
> > > This is true, could it be rcu_read_unlock_special() got called for some
> > > *other* reason other than the IPI then?
> > >
> > > Per Sebastian's stack trace of the recursive lock scenario, it is happening
> > > during cpu_acct_charge() which is called with the rq_lock held.
> > >
> > > The only other reasons I know off to call rcu_read_unlock_special() are if
> > > 1. the tick indicated that the CPU has to report a QS
> > > 2. an IPI in the middle of the reader section for expedited GPs
> > > 3. preemption in the middle of a preemptible RCU reader section
> >
> > 4. Some previous reader section was IPIed or preempted, but either
> > interrupts, softirqs, or preemption was disabled across the
> > rcu_read_unlock() of that previous reader section.
>
> Hi Paul, I did not fully understand 4. The previous RCU reader section
> could not have been IPI'ed or been preempted if interrupts were
> disabled across. Also, if softirq/preempt is disabled across the
> previous reader section, the previous reader could not be preempted in
> these case.
Like this, courtesy of the consolidation of RCU flavors:
previous_reader()
{
rcu_read_lock();
do_something(); /* Preemption happened here. */
local_irq_disable(); /* Cannot be the scheduler! */
do_something_else();
rcu_read_unlock(); /* Must defer QS, task still queued. */
do_some_other_thing();
local_irq_enable();
}
current_reader() /* QS from previous_reader() is still deferred. */
{
local_irq_disable(); /* Might be the scheduler. */
do_whatever();
rcu_read_lock();
do_whatever_else();
rcu_read_unlock(); /* Must still defer reporting QS. */
do_whatever_comes_to_mind();
local_irq_enable();
}
Both instances of rcu_read_unlock() need to cause some later thing
to report the quiescent state, and in some cases it will do a wakeup.
Now, previous_reader()'s IRQ disabling cannot be due to scheduler rq/pi
locks due to the rule about holding them across the entire RCU reader
if they are held across the rcu_read_unlock(). But current_reader()'s
IRQ disabling might well be due to the scheduler rq/pi locks, so
current_reader() must be careful about doing wakeups.
> That leaves us with the only scenario where the previous reader was
> IPI'ed while softirq/preempt was disabled across it. Is that what you
> meant?
No, but that can also happen.
> But in this scenario, the previous reader should have set
> exp_hint to false in the previous reader's rcu_read_unlock_special()
> invocation itself. So I would think t->rcu_read_unlock_special should
> be 0 during the new reader's invocation thus I did not understand how
> rcu_read_unlock_special can be called because of a previous reader.
Yes, exp_hint would unconditionally be set to false in the first
reader's rcu_read_unlock(). But .blocked won't be.
> I'll borrow some of that confused color paint if you don't mind ;-)
> And we should document this somewhere for future sanity preservation
> :-D
Or adjust the code and requirements to make it more sane, if feasible.
My current (probably wildly unreliable) guess that the conditions in
rcu_read_unlock_special() need adjusting. I was assuming that in_irq()
implies a hardirq context, in other words that in_irq() would return
false from a threaded interrupt handler. If in_irq() instead returns
true from within a threaded interrupt handler, then this code in
rcu_read_unlock_special() needs fixing:
if ((exp || in_irq()) && irqs_were_disabled && use_softirq &&
(in_irq() || !t->rcu_read_unlock_special.b.deferred_qs)) {
// Using softirq, safe to awaken, and we get
// no help from enabling irqs, unlike bh/preempt.
raise_softirq_irqoff(RCU_SOFTIRQ);
The fix would be replacing the calls to in_irq() with something that
returns true only if called from within a hardirq context.
Thoughts?
Ugh. Same question about IRQ work. Will the current use of it by
rcu_read_unlock_special() cause breakage in the presence of threaded
interrupt handlers?
Thanx, Paul
> thanks,
> - Joel
>
>
>
> >
> > I -think- that this is what Sebastian is seeing.
> >
> > Thanx, Paul
> >
> > > 1. and 2. are not possible because interrupts are disabled, that's why the
> > > wakeup_softirq even happened.
> > > 3. is not possible because we are holding rq_lock in the RCU reader section.
> > >
> > > So I am at a bit of a loss how this can happen :-(
> > >
> > > Spurious call to rcu_read_unlock_special() may be when it should not have
> > > been called?
> > >
> > > thanks,
> > >
> > > - Joel
On Thu, Jun 27, 2019 at 11:40 AM Sebastian Andrzej Siewior
<[email protected]> wrote:
>
> On 2019-06-27 11:37:10 [-0400], Joel Fernandes wrote:
> > Sebastian it would be nice if possible to trace where the
> > t->rcu_read_unlock_special is set for this scenario of calling
> > rcu_read_unlock_special, to give a clear idea about whether it was
> > really because of an IPI. I guess we could also add additional RCU
> > debug fields to task_struct (just for debugging) to see where there
> > unlock_special is set.
> >
> > Is there a test to reproduce this, or do I just boot an intel x86_64
> > machine with "threadirqs" and run into it?
>
> Do you want to send me a patch or should I send you my kvm image which
> triggers the bug on boot?
I could reproduce this as well just booting Linus tree with threadirqs
command line and running rcutorture. In 15 seconds or so it locks
up... gdb backtrace shows the recursive lock:
(gdb) bt
#0 queued_spin_lock_slowpath (lock=0xffff88813af16b44, val=0) at
kernel/locking/qspinlock.c:381
#1 0xffffffff81a75a0f in queued_spin_lock (lock=<optimized out>) at
./include/asm-generic/qspinlock.h:81
#2 do_raw_spin_lock_flags (flags=<optimized out>, lock=<optimized
out>) at ./include/linux/spinlock.h:193
#3 __raw_spin_lock_irqsave (lock=<optimized out>) at
./include/linux/spinlock_api_smp.h:119
#4 _raw_spin_lock_irqsave (lock=0xffff88813af16b44) at
kernel/locking/spinlock.c:159
#5 0xffffffff81092158 in try_to_wake_up (p=0xffff88813af16400,
state=3, wake_flags=0) at kernel/sched/core.c:2000
#6 0xffffffff8109265c in wake_up_process (p=<optimized out>) at
kernel/sched/core.c:2114
#7 0xffffffff8106b019 in wakeup_softirqd () at kernel/softirq.c:76
#8 0xffffffff8106b8ef in raise_softirq_irqoff (nr=<optimized out>) at
kernel/softirq.c:437
#9 0xffffffff810d080d in rcu_read_unlock_special (t=<optimized out>)
at kernel/rcu/tree_plugin.h:632
#10 0xffffffff810d0868 in __rcu_read_unlock () at kernel/rcu/tree_plugin.h:414
#11 0xffffffff810aa58e in rcu_read_unlock () at ./include/linux/rcupdate.h:646
#12 cpuacct_charge (tsk=<optimized out>, cputime=<optimized out>) at
kernel/sched/cpuacct.c:352
#13 0xffffffff81097994 in cgroup_account_cputime
(delta_exec=<optimized out>, task=<optimized out>) at
./include/linux/cgroup.h:764
#14 update_curr (cfs_rq=<optimized out>) at kernel/sched/fair.c:843
#15 0xffffffff8109a882 in enqueue_entity (flags=<optimized out>,
se=<optimized out>, cfs_rq=<optimized out>) at
kernel/sched/fair.c:3901
#16 enqueue_task_fair (rq=<optimized out>, p=<optimized out>, flags=9)
at kernel/sched/fair.c:5194
#17 0xffffffff81091699 in enqueue_task (flags=<optimized out>,
p=<optimized out>, rq=<optimized out>) at kernel/sched/core.c:774
#18 activate_task (rq=0xffff88813af16b44, p=0x0 <fixed_percpu_data>,
flags=<optimized out>) at kernel/sched/core.c:795
#19 0xffffffff81091a1a in ttwu_do_activate (rq=0xffff88813af16b44,
p=0x0 <fixed_percpu_data>, wake_flags=0, rf=<optimized out>) at
kernel/sched/core.c:1737
#20 0xffffffff810924b3 in ttwu_queue (wake_flags=<optimized out>,
cpu=<optimized out>, p=<optimized out>) at kernel/sched/core.c:1882
#21 try_to_wake_up (p=0xffff88813af16400, state=<optimized out>,
wake_flags=0) at kernel/sched/core.c:2092
#22 0xffffffff8109265c in wake_up_process (p=<optimized out>) at
kernel/sched/core.c:2114
#23 0xffffffff8106b019 in wakeup_softirqd () at kernel/softirq.c:76
#24 0xffffffff8106b843 in invoke_softirq () at kernel/softirq.c:383
#25 irq_exit () at kernel/softirq.c:413
#26 0xffffffff81c01f8e in exiting_irq () at ./arch/x86/include/asm/apic.h:536
>
> Sebastian
On Thu, Jun 27, 2019 at 1:43 PM Joel Fernandes <[email protected]> wrote:
>
> On Thu, Jun 27, 2019 at 11:40 AM Sebastian Andrzej Siewior
> <[email protected]> wrote:
> >
> > On 2019-06-27 11:37:10 [-0400], Joel Fernandes wrote:
> > > Sebastian it would be nice if possible to trace where the
> > > t->rcu_read_unlock_special is set for this scenario of calling
> > > rcu_read_unlock_special, to give a clear idea about whether it was
> > > really because of an IPI. I guess we could also add additional RCU
> > > debug fields to task_struct (just for debugging) to see where there
> > > unlock_special is set.
> > >
> > > Is there a test to reproduce this, or do I just boot an intel x86_64
> > > machine with "threadirqs" and run into it?
> >
> > Do you want to send me a patch or should I send you my kvm image which
> > triggers the bug on boot?
>
> I could reproduce this as well just booting Linus tree with threadirqs
> command line and running rcutorture. In 15 seconds or so it locks
> up... gdb backtrace shows the recursive lock:
Sorry that got badly wrapped, so I pasted it here:
https://hastebin.com/ajivofomik.shell
On Thu, Jun 27, 2019 at 01:46:27PM -0400, Joel Fernandes wrote:
> On Thu, Jun 27, 2019 at 1:43 PM Joel Fernandes <[email protected]> wrote:
> >
> > On Thu, Jun 27, 2019 at 11:40 AM Sebastian Andrzej Siewior
> > <[email protected]> wrote:
> > >
> > > On 2019-06-27 11:37:10 [-0400], Joel Fernandes wrote:
> > > > Sebastian it would be nice if possible to trace where the
> > > > t->rcu_read_unlock_special is set for this scenario of calling
> > > > rcu_read_unlock_special, to give a clear idea about whether it was
> > > > really because of an IPI. I guess we could also add additional RCU
> > > > debug fields to task_struct (just for debugging) to see where there
> > > > unlock_special is set.
> > > >
> > > > Is there a test to reproduce this, or do I just boot an intel x86_64
> > > > machine with "threadirqs" and run into it?
> > >
> > > Do you want to send me a patch or should I send you my kvm image which
> > > triggers the bug on boot?
> >
> > I could reproduce this as well just booting Linus tree with threadirqs
> > command line and running rcutorture. In 15 seconds or so it locks
> > up... gdb backtrace shows the recursive lock:
>
> Sorry that got badly wrapped, so I pasted it here:
> https://hastebin.com/ajivofomik.shell
Which rcutorture scenario would that be? TREE03 is thus far refusing
to fail for me when run this way:
$ tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 8 --duration 5 --trust-make --configs "TREE03" --bootargs "threadirqs"
If it had failed, I would have tried the patch shown below. I know that
Sebastian has some concerns about the bug happening anyway, but we have
to start somewhere! ;-)
Thanx, Paul
------------------------------------------------------------------------
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 82c925df1d92..be7bafc2c0a0 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -624,25 +624,16 @@ static void rcu_read_unlock_special(struct task_struct *t)
(rdp->grpmask & rnp->expmask) ||
tick_nohz_full_cpu(rdp->cpu);
// Need to defer quiescent state until everything is enabled.
- if ((exp || in_irq()) && irqs_were_disabled && use_softirq &&
- (in_irq() || !t->rcu_read_unlock_special.b.deferred_qs)) {
- // Using softirq, safe to awaken, and we get
- // no help from enabling irqs, unlike bh/preempt.
- raise_softirq_irqoff(RCU_SOFTIRQ);
- } else {
- // Enabling BH or preempt does reschedule, so...
- // Also if no expediting or NO_HZ_FULL, slow is OK.
- set_tsk_need_resched(current);
- set_preempt_need_resched();
- if (IS_ENABLED(CONFIG_IRQ_WORK) && irqs_were_disabled &&
- !rdp->defer_qs_iw_pending && exp) {
- // Get scheduler to re-evaluate and call hooks.
- // If !IRQ_WORK, FQS scan will eventually IPI.
- init_irq_work(&rdp->defer_qs_iw,
- rcu_preempt_deferred_qs_handler);
- rdp->defer_qs_iw_pending = true;
- irq_work_queue_on(&rdp->defer_qs_iw, rdp->cpu);
- }
+ set_tsk_need_resched(current);
+ set_preempt_need_resched();
+ if (IS_ENABLED(CONFIG_IRQ_WORK) && irqs_were_disabled &&
+ !rdp->defer_qs_iw_pending && exp) {
+ // Get scheduler to re-evaluate and call hooks.
+ // If !IRQ_WORK, FQS scan will eventually IPI.
+ init_irq_work(&rdp->defer_qs_iw,
+ rcu_preempt_deferred_qs_handler);
+ rdp->defer_qs_iw_pending = true;
+ irq_work_queue_on(&rdp->defer_qs_iw, rdp->cpu);
}
t->rcu_read_unlock_special.b.deferred_qs = true;
local_irq_restore(flags);
On Thu, Jun 27, 2019 at 10:38:31AM -0700, Paul E. McKenney wrote:
> On Thu, Jun 27, 2019 at 12:47:24PM -0400, Joel Fernandes wrote:
> > On Thu, Jun 27, 2019 at 11:55 AM Paul E. McKenney <[email protected]> wrote:
> > >
> > > On Thu, Jun 27, 2019 at 11:30:31AM -0400, Joel Fernandes wrote:
> > > > On Thu, Jun 27, 2019 at 10:34:55AM -0400, Steven Rostedt wrote:
> > > > > On Thu, 27 Jun 2019 10:24:36 -0400
> > > > > Joel Fernandes <[email protected]> wrote:
> > > > >
> > > > > > > What am I missing here?
> > > > > >
> > > > > > This issue I think is
> > > > > >
> > > > > > (in normal process context)
> > > > > > spin_lock_irqsave(rq_lock); // which disables both preemption and interrupt
> > > > > > // but this was done in normal process context,
> > > > > > // not from IRQ handler
> > > > > > rcu_read_lock();
> > > > > > <---------- IPI comes in and sets exp_hint
> > > > >
> > > > > How would an IPI come in here with interrupts disabled?
> > > > >
> > > > > -- Steve
> > > >
> > > > This is true, could it be rcu_read_unlock_special() got called for some
> > > > *other* reason other than the IPI then?
> > > >
> > > > Per Sebastian's stack trace of the recursive lock scenario, it is happening
> > > > during cpu_acct_charge() which is called with the rq_lock held.
> > > >
> > > > The only other reasons I know off to call rcu_read_unlock_special() are if
> > > > 1. the tick indicated that the CPU has to report a QS
> > > > 2. an IPI in the middle of the reader section for expedited GPs
> > > > 3. preemption in the middle of a preemptible RCU reader section
> > >
> > > 4. Some previous reader section was IPIed or preempted, but either
> > > interrupts, softirqs, or preemption was disabled across the
> > > rcu_read_unlock() of that previous reader section.
> >
> > Hi Paul, I did not fully understand 4. The previous RCU reader section
> > could not have been IPI'ed or been preempted if interrupts were
> > disabled across. Also, if softirq/preempt is disabled across the
> > previous reader section, the previous reader could not be preempted in
> > these case.
>
> Like this, courtesy of the consolidation of RCU flavors:
>
> previous_reader()
> {
> rcu_read_lock();
> do_something(); /* Preemption happened here. */
> local_irq_disable(); /* Cannot be the scheduler! */
> do_something_else();
> rcu_read_unlock(); /* Must defer QS, task still queued. */
> do_some_other_thing();
> local_irq_enable();
> }
>
> current_reader() /* QS from previous_reader() is still deferred. */
> {
> local_irq_disable(); /* Might be the scheduler. */
> do_whatever();
> rcu_read_lock();
> do_whatever_else();
> rcu_read_unlock(); /* Must still defer reporting QS. */
> do_whatever_comes_to_mind();
> local_irq_enable();
> }
>
> Both instances of rcu_read_unlock() need to cause some later thing
> to report the quiescent state, and in some cases it will do a wakeup.
> Now, previous_reader()'s IRQ disabling cannot be due to scheduler rq/pi
> locks due to the rule about holding them across the entire RCU reader
> if they are held across the rcu_read_unlock(). But current_reader()'s
> IRQ disabling might well be due to the scheduler rq/pi locks, so
> current_reader() must be careful about doing wakeups.
Makes sense now, thanks.
> > That leaves us with the only scenario where the previous reader was
> > IPI'ed while softirq/preempt was disabled across it. Is that what you
> > meant?
>
> No, but that can also happen.
>
> > But in this scenario, the previous reader should have set
> > exp_hint to false in the previous reader's rcu_read_unlock_special()
> > invocation itself. So I would think t->rcu_read_unlock_special should
> > be 0 during the new reader's invocation thus I did not understand how
> > rcu_read_unlock_special can be called because of a previous reader.
>
> Yes, exp_hint would unconditionally be set to false in the first
> reader's rcu_read_unlock(). But .blocked won't be.
Makes sense.
> > I'll borrow some of that confused color paint if you don't mind ;-)
> > And we should document this somewhere for future sanity preservation
> > :-D
>
> Or adjust the code and requirements to make it more sane, if feasible.
>
> My current (probably wildly unreliable) guess that the conditions in
> rcu_read_unlock_special() need adjusting. I was assuming that in_irq()
> implies a hardirq context, in other words that in_irq() would return
> false from a threaded interrupt handler. If in_irq() instead returns
> true from within a threaded interrupt handler, then this code in
> rcu_read_unlock_special() needs fixing:
>
> if ((exp || in_irq()) && irqs_were_disabled && use_softirq &&
> (in_irq() || !t->rcu_read_unlock_special.b.deferred_qs)) {
> // Using softirq, safe to awaken, and we get
> // no help from enabling irqs, unlike bh/preempt.
> raise_softirq_irqoff(RCU_SOFTIRQ);
>
> The fix would be replacing the calls to in_irq() with something that
> returns true only if called from within a hardirq context.
> Thoughts?
I am not sure if this will fix all cases though?
I think the crux of the problem is doing a recursive wake up. The threaded
IRQ probably just happens to be causing it here, it seems to me this problem
can also occur on a non-threaded irq system (say current_reader() in your
example executed in a scheduler path in process-context and not from an
interrupt). Is that not possible?
I think the fix should be to prevent the wake-up not based on whether we are
in hard/soft-interrupt mode but that we are doing the rcu_read_unlock() from
a scheduler path (if we can detect that)
I lost track of this code:
if ((exp || in_irq()) && irqs_were_disabled && use_softirq &&
(in_irq() || !t->rcu_read_unlock_special.b.deferred_qs)) {
Was this patch posted to the list? I will blame it to try to get some
context. It sounds like you added more conditions on when to kick the
softirq.
> Ugh. Same question about IRQ work. Will the current use of it by
> rcu_read_unlock_special() cause breakage in the presence of threaded
> interrupt handlers?
/me needs to understand why the irq work stuff was added here as well. Have
my work cut out for the day! ;-)
thanks,
- Joel
>
> Thanx, Paul
>
> > thanks,
> > - Joel
> >
> >
> >
> > >
> > > I -think- that this is what Sebastian is seeing.
> > >
> > > Thanx, Paul
> > >
> > > > 1. and 2. are not possible because interrupts are disabled, that's why the
> > > > wakeup_softirq even happened.
> > > > 3. is not possible because we are holding rq_lock in the RCU reader section.
> > > >
> > > > So I am at a bit of a loss how this can happen :-(
> > > >
> > > > Spurious call to rcu_read_unlock_special() may be when it should not have
> > > > been called?
> > > >
> > > > thanks,
> > > >
> > > > - Joel
>
On Thu, Jun 27, 2019 at 11:11:12AM -0700, Paul E. McKenney wrote:
> On Thu, Jun 27, 2019 at 01:46:27PM -0400, Joel Fernandes wrote:
> > On Thu, Jun 27, 2019 at 1:43 PM Joel Fernandes <[email protected]> wrote:
> > >
> > > On Thu, Jun 27, 2019 at 11:40 AM Sebastian Andrzej Siewior
> > > <[email protected]> wrote:
> > > >
> > > > On 2019-06-27 11:37:10 [-0400], Joel Fernandes wrote:
> > > > > Sebastian it would be nice if possible to trace where the
> > > > > t->rcu_read_unlock_special is set for this scenario of calling
> > > > > rcu_read_unlock_special, to give a clear idea about whether it was
> > > > > really because of an IPI. I guess we could also add additional RCU
> > > > > debug fields to task_struct (just for debugging) to see where there
> > > > > unlock_special is set.
> > > > >
> > > > > Is there a test to reproduce this, or do I just boot an intel x86_64
> > > > > machine with "threadirqs" and run into it?
> > > >
> > > > Do you want to send me a patch or should I send you my kvm image which
> > > > triggers the bug on boot?
> > >
> > > I could reproduce this as well just booting Linus tree with threadirqs
> > > command line and running rcutorture. In 15 seconds or so it locks
> > > up... gdb backtrace shows the recursive lock:
> >
> > Sorry that got badly wrapped, so I pasted it here:
> > https://hastebin.com/ajivofomik.shell
>
> Which rcutorture scenario would that be? TREE03 is thus far refusing
> to fail for me when run this way:
>
> $ tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 8 --duration 5 --trust-make --configs "TREE03" --bootargs "threadirqs"
I built x86_64_defconfig with CONFIG_PREEMPT enabled, then I ran it with
following boot params:
rcutorture.shutdown_secs=60 rcutorture.n_barrier_cbs=4 rcutree.kthread_prio=2
and also "threadirqs"
This was not a TREE config, but just my simple RCU test using qemu.
I will try this diff and let you know.
> If it had failed, I would have tried the patch shown below. I know that
> Sebastian has some concerns about the bug happening anyway, but we have
> to start somewhere! ;-)
>
> ------------------------------------------------------------------------
>
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index 82c925df1d92..be7bafc2c0a0 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -624,25 +624,16 @@ static void rcu_read_unlock_special(struct task_struct *t)
> (rdp->grpmask & rnp->expmask) ||
> tick_nohz_full_cpu(rdp->cpu);
> // Need to defer quiescent state until everything is enabled.
> - if ((exp || in_irq()) && irqs_were_disabled && use_softirq &&
> - (in_irq() || !t->rcu_read_unlock_special.b.deferred_qs)) {
> - // Using softirq, safe to awaken, and we get
> - // no help from enabling irqs, unlike bh/preempt.
> - raise_softirq_irqoff(RCU_SOFTIRQ);
> - } else {
> - // Enabling BH or preempt does reschedule, so...
> - // Also if no expediting or NO_HZ_FULL, slow is OK.
> - set_tsk_need_resched(current);
> - set_preempt_need_resched();
> - if (IS_ENABLED(CONFIG_IRQ_WORK) && irqs_were_disabled &&
> - !rdp->defer_qs_iw_pending && exp) {
> - // Get scheduler to re-evaluate and call hooks.
> - // If !IRQ_WORK, FQS scan will eventually IPI.
> - init_irq_work(&rdp->defer_qs_iw,
> - rcu_preempt_deferred_qs_handler);
> - rdp->defer_qs_iw_pending = true;
> - irq_work_queue_on(&rdp->defer_qs_iw, rdp->cpu);
> - }
> + set_tsk_need_resched(current);
> + set_preempt_need_resched();
> + if (IS_ENABLED(CONFIG_IRQ_WORK) && irqs_were_disabled &&
> + !rdp->defer_qs_iw_pending && exp) {
> + // Get scheduler to re-evaluate and call hooks.
> + // If !IRQ_WORK, FQS scan will eventually IPI.
> + init_irq_work(&rdp->defer_qs_iw,
> + rcu_preempt_deferred_qs_handler);
> + rdp->defer_qs_iw_pending = true;
> + irq_work_queue_on(&rdp->defer_qs_iw, rdp->cpu);
Nice to see the code here got simplified ;-)
On Thu, Jun 27, 2019 at 11:11:12AM -0700, Paul E. McKenney wrote:
> On Thu, Jun 27, 2019 at 01:46:27PM -0400, Joel Fernandes wrote:
> > On Thu, Jun 27, 2019 at 1:43 PM Joel Fernandes <[email protected]> wrote:
> > >
> > > On Thu, Jun 27, 2019 at 11:40 AM Sebastian Andrzej Siewior
> > > <[email protected]> wrote:
> > > >
> > > > On 2019-06-27 11:37:10 [-0400], Joel Fernandes wrote:
> > > > > Sebastian it would be nice if possible to trace where the
> > > > > t->rcu_read_unlock_special is set for this scenario of calling
> > > > > rcu_read_unlock_special, to give a clear idea about whether it was
> > > > > really because of an IPI. I guess we could also add additional RCU
> > > > > debug fields to task_struct (just for debugging) to see where there
> > > > > unlock_special is set.
> > > > >
> > > > > Is there a test to reproduce this, or do I just boot an intel x86_64
> > > > > machine with "threadirqs" and run into it?
> > > >
> > > > Do you want to send me a patch or should I send you my kvm image which
> > > > triggers the bug on boot?
> > >
> > > I could reproduce this as well just booting Linus tree with threadirqs
> > > command line and running rcutorture. In 15 seconds or so it locks
> > > up... gdb backtrace shows the recursive lock:
> >
> > Sorry that got badly wrapped, so I pasted it here:
> > https://hastebin.com/ajivofomik.shell
>
> Which rcutorture scenario would that be? TREE03 is thus far refusing
> to fail for me when run this way:
>
> $ tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 8 --duration 5 --trust-make --configs "TREE03" --bootargs "threadirqs"
Ah, but I was running -rcu. TREE03 fails at 38 seconds for me on v5.2.
Now to find out which -rcu commit fixed it. Or at least made it much
less probable, to Sebastian's point.
> If it had failed, I would have tried the patch shown below. I know that
> Sebastian has some concerns about the bug happening anyway, but we have
> to start somewhere! ;-)
This patch might thus be completely unnecessary.
Thanx, Paul
> ------------------------------------------------------------------------
>
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index 82c925df1d92..be7bafc2c0a0 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -624,25 +624,16 @@ static void rcu_read_unlock_special(struct task_struct *t)
> (rdp->grpmask & rnp->expmask) ||
> tick_nohz_full_cpu(rdp->cpu);
> // Need to defer quiescent state until everything is enabled.
> - if ((exp || in_irq()) && irqs_were_disabled && use_softirq &&
> - (in_irq() || !t->rcu_read_unlock_special.b.deferred_qs)) {
> - // Using softirq, safe to awaken, and we get
> - // no help from enabling irqs, unlike bh/preempt.
> - raise_softirq_irqoff(RCU_SOFTIRQ);
> - } else {
> - // Enabling BH or preempt does reschedule, so...
> - // Also if no expediting or NO_HZ_FULL, slow is OK.
> - set_tsk_need_resched(current);
> - set_preempt_need_resched();
> - if (IS_ENABLED(CONFIG_IRQ_WORK) && irqs_were_disabled &&
> - !rdp->defer_qs_iw_pending && exp) {
> - // Get scheduler to re-evaluate and call hooks.
> - // If !IRQ_WORK, FQS scan will eventually IPI.
> - init_irq_work(&rdp->defer_qs_iw,
> - rcu_preempt_deferred_qs_handler);
> - rdp->defer_qs_iw_pending = true;
> - irq_work_queue_on(&rdp->defer_qs_iw, rdp->cpu);
> - }
> + set_tsk_need_resched(current);
> + set_preempt_need_resched();
> + if (IS_ENABLED(CONFIG_IRQ_WORK) && irqs_were_disabled &&
> + !rdp->defer_qs_iw_pending && exp) {
> + // Get scheduler to re-evaluate and call hooks.
> + // If !IRQ_WORK, FQS scan will eventually IPI.
> + init_irq_work(&rdp->defer_qs_iw,
> + rcu_preempt_deferred_qs_handler);
> + rdp->defer_qs_iw_pending = true;
> + irq_work_queue_on(&rdp->defer_qs_iw, rdp->cpu);
> }
> t->rcu_read_unlock_special.b.deferred_qs = true;
> local_irq_restore(flags);
>
On Thu, Jun 27, 2019 at 02:16:38PM -0400, Joel Fernandes wrote:
> On Thu, Jun 27, 2019 at 10:38:31AM -0700, Paul E. McKenney wrote:
> > On Thu, Jun 27, 2019 at 12:47:24PM -0400, Joel Fernandes wrote:
> > > On Thu, Jun 27, 2019 at 11:55 AM Paul E. McKenney <[email protected]> wrote:
> > > >
> > > > On Thu, Jun 27, 2019 at 11:30:31AM -0400, Joel Fernandes wrote:
> > > > > On Thu, Jun 27, 2019 at 10:34:55AM -0400, Steven Rostedt wrote:
> > > > > > On Thu, 27 Jun 2019 10:24:36 -0400
> > > > > > Joel Fernandes <[email protected]> wrote:
> > > > > >
> > > > > > > > What am I missing here?
> > > > > > >
> > > > > > > This issue I think is
> > > > > > >
> > > > > > > (in normal process context)
> > > > > > > spin_lock_irqsave(rq_lock); // which disables both preemption and interrupt
> > > > > > > // but this was done in normal process context,
> > > > > > > // not from IRQ handler
> > > > > > > rcu_read_lock();
> > > > > > > <---------- IPI comes in and sets exp_hint
> > > > > >
> > > > > > How would an IPI come in here with interrupts disabled?
> > > > > >
> > > > > > -- Steve
> > > > >
> > > > > This is true, could it be rcu_read_unlock_special() got called for some
> > > > > *other* reason other than the IPI then?
> > > > >
> > > > > Per Sebastian's stack trace of the recursive lock scenario, it is happening
> > > > > during cpu_acct_charge() which is called with the rq_lock held.
> > > > >
> > > > > The only other reasons I know off to call rcu_read_unlock_special() are if
> > > > > 1. the tick indicated that the CPU has to report a QS
> > > > > 2. an IPI in the middle of the reader section for expedited GPs
> > > > > 3. preemption in the middle of a preemptible RCU reader section
> > > >
> > > > 4. Some previous reader section was IPIed or preempted, but either
> > > > interrupts, softirqs, or preemption was disabled across the
> > > > rcu_read_unlock() of that previous reader section.
> > >
> > > Hi Paul, I did not fully understand 4. The previous RCU reader section
> > > could not have been IPI'ed or been preempted if interrupts were
> > > disabled across. Also, if softirq/preempt is disabled across the
> > > previous reader section, the previous reader could not be preempted in
> > > these case.
> >
> > Like this, courtesy of the consolidation of RCU flavors:
> >
> > previous_reader()
> > {
> > rcu_read_lock();
> > do_something(); /* Preemption happened here. */
> > local_irq_disable(); /* Cannot be the scheduler! */
> > do_something_else();
> > rcu_read_unlock(); /* Must defer QS, task still queued. */
> > do_some_other_thing();
> > local_irq_enable();
> > }
> >
> > current_reader() /* QS from previous_reader() is still deferred. */
> > {
> > local_irq_disable(); /* Might be the scheduler. */
> > do_whatever();
> > rcu_read_lock();
> > do_whatever_else();
> > rcu_read_unlock(); /* Must still defer reporting QS. */
> > do_whatever_comes_to_mind();
> > local_irq_enable();
> > }
> >
> > Both instances of rcu_read_unlock() need to cause some later thing
> > to report the quiescent state, and in some cases it will do a wakeup.
> > Now, previous_reader()'s IRQ disabling cannot be due to scheduler rq/pi
> > locks due to the rule about holding them across the entire RCU reader
> > if they are held across the rcu_read_unlock(). But current_reader()'s
> > IRQ disabling might well be due to the scheduler rq/pi locks, so
> > current_reader() must be careful about doing wakeups.
>
> Makes sense now, thanks.
>
> > > That leaves us with the only scenario where the previous reader was
> > > IPI'ed while softirq/preempt was disabled across it. Is that what you
> > > meant?
> >
> > No, but that can also happen.
> >
> > > But in this scenario, the previous reader should have set
> > > exp_hint to false in the previous reader's rcu_read_unlock_special()
> > > invocation itself. So I would think t->rcu_read_unlock_special should
> > > be 0 during the new reader's invocation thus I did not understand how
> > > rcu_read_unlock_special can be called because of a previous reader.
> >
> > Yes, exp_hint would unconditionally be set to false in the first
> > reader's rcu_read_unlock(). But .blocked won't be.
>
> Makes sense.
>
> > > I'll borrow some of that confused color paint if you don't mind ;-)
> > > And we should document this somewhere for future sanity preservation
> > > :-D
> >
> > Or adjust the code and requirements to make it more sane, if feasible.
> >
> > My current (probably wildly unreliable) guess that the conditions in
> > rcu_read_unlock_special() need adjusting. I was assuming that in_irq()
> > implies a hardirq context, in other words that in_irq() would return
> > false from a threaded interrupt handler. If in_irq() instead returns
> > true from within a threaded interrupt handler, then this code in
> > rcu_read_unlock_special() needs fixing:
> >
> > if ((exp || in_irq()) && irqs_were_disabled && use_softirq &&
> > (in_irq() || !t->rcu_read_unlock_special.b.deferred_qs)) {
> > // Using softirq, safe to awaken, and we get
> > // no help from enabling irqs, unlike bh/preempt.
> > raise_softirq_irqoff(RCU_SOFTIRQ);
> >
> > The fix would be replacing the calls to in_irq() with something that
> > returns true only if called from within a hardirq context.
> > Thoughts?
>
> I am not sure if this will fix all cases though?
>
> I think the crux of the problem is doing a recursive wake up. The threaded
> IRQ probably just happens to be causing it here, it seems to me this problem
> can also occur on a non-threaded irq system (say current_reader() in your
> example executed in a scheduler path in process-context and not from an
> interrupt). Is that not possible?
In the non-threaded case, invoking raise_softirq*() from hardirq context
just sets a bit in a per-CPU variable. Now, to Sebastian's point, we
are only sort of in hardirq context in this case due to being called
from irq_exit(), but the failure we are seeing might well be a ways
downstream of the actual root-cause bug.
> I think the fix should be to prevent the wake-up not based on whether we are
> in hard/soft-interrupt mode but that we are doing the rcu_read_unlock() from
> a scheduler path (if we can detect that)
Or just don't do the wakeup at all, if it comes to that. I don't know
of any way to determine whether rcu_read_unlock() is being called from
the scheduler, but it has been some time since I asked Peter Zijlstra
about that.
Of course, unconditionally refusing to do the wakeup might not be happy
thing for NO_HZ_FULL kernels that don't implement IRQ work.
> I lost track of this code:
> if ((exp || in_irq()) && irqs_were_disabled && use_softirq &&
> (in_irq() || !t->rcu_read_unlock_special.b.deferred_qs)) {
>
> Was this patch posted to the list? I will blame it to try to get some
> context. It sounds like you added more conditions on when to kick the
> softirq.
This is from the dev branch of my -rcu tree. It has at least one
patch in this area that is currently slated for v5.4, so I would not
have sent that as part of an official patch series.
> > Ugh. Same question about IRQ work. Will the current use of it by
> > rcu_read_unlock_special() cause breakage in the presence of threaded
> > interrupt handlers?
>
> /me needs to understand why the irq work stuff was added here as well. Have
> my work cut out for the day! ;-)
New code, so more likely to contain bugs than usual. ;-)
The point was to get a wakeup soonish without risk of rq/pi deadlocks.
Thanx, Paul
> thanks,
>
> - Joel
>
>
> >
> > Thanx, Paul
> >
> > > thanks,
> > > - Joel
> > >
> > >
> > >
> > > >
> > > > I -think- that this is what Sebastian is seeing.
> > > >
> > > > Thanx, Paul
> > > >
> > > > > 1. and 2. are not possible because interrupts are disabled, that's why the
> > > > > wakeup_softirq even happened.
> > > > > 3. is not possible because we are holding rq_lock in the RCU reader section.
> > > > >
> > > > > So I am at a bit of a loss how this can happen :-(
> > > > >
> > > > > Spurious call to rcu_read_unlock_special() may be when it should not have
> > > > > been called?
> > > > >
> > > > > thanks,
> > > > >
> > > > > - Joel
> >
>
On Thu, Jun 27, 2019 at 02:27:22PM -0400, Joel Fernandes wrote:
> On Thu, Jun 27, 2019 at 11:11:12AM -0700, Paul E. McKenney wrote:
> > On Thu, Jun 27, 2019 at 01:46:27PM -0400, Joel Fernandes wrote:
> > > On Thu, Jun 27, 2019 at 1:43 PM Joel Fernandes <[email protected]> wrote:
> > > >
> > > > On Thu, Jun 27, 2019 at 11:40 AM Sebastian Andrzej Siewior
> > > > <[email protected]> wrote:
> > > > >
> > > > > On 2019-06-27 11:37:10 [-0400], Joel Fernandes wrote:
> > > > > > Sebastian it would be nice if possible to trace where the
> > > > > > t->rcu_read_unlock_special is set for this scenario of calling
> > > > > > rcu_read_unlock_special, to give a clear idea about whether it was
> > > > > > really because of an IPI. I guess we could also add additional RCU
> > > > > > debug fields to task_struct (just for debugging) to see where there
> > > > > > unlock_special is set.
> > > > > >
> > > > > > Is there a test to reproduce this, or do I just boot an intel x86_64
> > > > > > machine with "threadirqs" and run into it?
> > > > >
> > > > > Do you want to send me a patch or should I send you my kvm image which
> > > > > triggers the bug on boot?
> > > >
> > > > I could reproduce this as well just booting Linus tree with threadirqs
> > > > command line and running rcutorture. In 15 seconds or so it locks
> > > > up... gdb backtrace shows the recursive lock:
> > >
> > > Sorry that got badly wrapped, so I pasted it here:
> > > https://hastebin.com/ajivofomik.shell
> >
> > Which rcutorture scenario would that be? TREE03 is thus far refusing
> > to fail for me when run this way:
> >
> > $ tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 8 --duration 5 --trust-make --configs "TREE03" --bootargs "threadirqs"
>
> I built x86_64_defconfig with CONFIG_PREEMPT enabled, then I ran it with
> following boot params:
> rcutorture.shutdown_secs=60 rcutorture.n_barrier_cbs=4 rcutree.kthread_prio=2
>
> and also "threadirqs"
>
> This was not a TREE config, but just my simple RCU test using qemu.
Ah, it seems that the issue is reproducible in Linus tree only (which matches
the initial diff Sebastian posted). It cannot be reproduced with your /dev
branch. So perhaps the in_irq() check indeed works.
Looking further, in_irq() does also set the HARDIRQ_MASK in the preempt_count
courtesy of:
#define __irq_enter() \
do { \
account_irq_enter_time(current); \
preempt_count_add(HARDIRQ_OFFSET); \
trace_hardirq_enter();
I dumped the stack at this point as well even with "threadirqs" just to
double confirm that is the case.
So probably, the in_irq() check is sufficient. However I am still a bit
nervous about this issue manifesting in other paths of the scheduler
that don't execute from an interrupt handler, but still would have RCU
reader sections with spinlocks held - I am not sure if this is possible
though but it does make me nervous.
Thanks!
>
>
> I will try this diff and let you know.
>
> > If it had failed, I would have tried the patch shown below. I know that
> > Sebastian has some concerns about the bug happening anyway, but we have
> > to start somewhere! ;-)
> >
> > ------------------------------------------------------------------------
> >
> > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > index 82c925df1d92..be7bafc2c0a0 100644
> > --- a/kernel/rcu/tree_plugin.h
> > +++ b/kernel/rcu/tree_plugin.h
> > @@ -624,25 +624,16 @@ static void rcu_read_unlock_special(struct task_struct *t)
> > (rdp->grpmask & rnp->expmask) ||
> > tick_nohz_full_cpu(rdp->cpu);
> > // Need to defer quiescent state until everything is enabled.
> > - if ((exp || in_irq()) && irqs_were_disabled && use_softirq &&
> > - (in_irq() || !t->rcu_read_unlock_special.b.deferred_qs)) {
> > - // Using softirq, safe to awaken, and we get
> > - // no help from enabling irqs, unlike bh/preempt.
> > - raise_softirq_irqoff(RCU_SOFTIRQ);
> > - } else {
> > - // Enabling BH or preempt does reschedule, so...
> > - // Also if no expediting or NO_HZ_FULL, slow is OK.
> > - set_tsk_need_resched(current);
> > - set_preempt_need_resched();
> > - if (IS_ENABLED(CONFIG_IRQ_WORK) && irqs_were_disabled &&
> > - !rdp->defer_qs_iw_pending && exp) {
> > - // Get scheduler to re-evaluate and call hooks.
> > - // If !IRQ_WORK, FQS scan will eventually IPI.
> > - init_irq_work(&rdp->defer_qs_iw,
> > - rcu_preempt_deferred_qs_handler);
> > - rdp->defer_qs_iw_pending = true;
> > - irq_work_queue_on(&rdp->defer_qs_iw, rdp->cpu);
> > - }
> > + set_tsk_need_resched(current);
> > + set_preempt_need_resched();
> > + if (IS_ENABLED(CONFIG_IRQ_WORK) && irqs_were_disabled &&
> > + !rdp->defer_qs_iw_pending && exp) {
> > + // Get scheduler to re-evaluate and call hooks.
> > + // If !IRQ_WORK, FQS scan will eventually IPI.
> > + init_irq_work(&rdp->defer_qs_iw,
> > + rcu_preempt_deferred_qs_handler);
> > + rdp->defer_qs_iw_pending = true;
> > + irq_work_queue_on(&rdp->defer_qs_iw, rdp->cpu);
>
> Nice to see the code here got simplified ;-)
>
On Thu, Jun 27, 2019 at 02:51:03PM -0400, Joel Fernandes wrote:
> On Thu, Jun 27, 2019 at 02:27:22PM -0400, Joel Fernandes wrote:
> > On Thu, Jun 27, 2019 at 11:11:12AM -0700, Paul E. McKenney wrote:
> > > On Thu, Jun 27, 2019 at 01:46:27PM -0400, Joel Fernandes wrote:
> > > > On Thu, Jun 27, 2019 at 1:43 PM Joel Fernandes <[email protected]> wrote:
> > > > >
> > > > > On Thu, Jun 27, 2019 at 11:40 AM Sebastian Andrzej Siewior
> > > > > <[email protected]> wrote:
> > > > > >
> > > > > > On 2019-06-27 11:37:10 [-0400], Joel Fernandes wrote:
> > > > > > > Sebastian it would be nice if possible to trace where the
> > > > > > > t->rcu_read_unlock_special is set for this scenario of calling
> > > > > > > rcu_read_unlock_special, to give a clear idea about whether it was
> > > > > > > really because of an IPI. I guess we could also add additional RCU
> > > > > > > debug fields to task_struct (just for debugging) to see where there
> > > > > > > unlock_special is set.
> > > > > > >
> > > > > > > Is there a test to reproduce this, or do I just boot an intel x86_64
> > > > > > > machine with "threadirqs" and run into it?
> > > > > >
> > > > > > Do you want to send me a patch or should I send you my kvm image which
> > > > > > triggers the bug on boot?
> > > > >
> > > > > I could reproduce this as well just booting Linus tree with threadirqs
> > > > > command line and running rcutorture. In 15 seconds or so it locks
> > > > > up... gdb backtrace shows the recursive lock:
> > > >
> > > > Sorry that got badly wrapped, so I pasted it here:
> > > > https://hastebin.com/ajivofomik.shell
> > >
> > > Which rcutorture scenario would that be? TREE03 is thus far refusing
> > > to fail for me when run this way:
> > >
> > > $ tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 8 --duration 5 --trust-make --configs "TREE03" --bootargs "threadirqs"
> >
> > I built x86_64_defconfig with CONFIG_PREEMPT enabled, then I ran it with
> > following boot params:
> > rcutorture.shutdown_secs=60 rcutorture.n_barrier_cbs=4 rcutree.kthread_prio=2
> >
> > and also "threadirqs"
> >
> > This was not a TREE config, but just my simple RCU test using qemu.
>
> Ah, it seems that the issue is reproducible in Linus tree only (which matches
> the initial diff Sebastian posted). It cannot be reproduced with your /dev
> branch. So perhaps the in_irq() check indeed works.
>
> Looking further, in_irq() does also set the HARDIRQ_MASK in the preempt_count
> courtesy of:
> #define __irq_enter() \
> do { \
> account_irq_enter_time(current); \
> preempt_count_add(HARDIRQ_OFFSET); \
> trace_hardirq_enter();
>
> I dumped the stack at this point as well even with "threadirqs" just to
> double confirm that is the case.
>
> So probably, the in_irq() check is sufficient. However I am still a bit
> nervous about this issue manifesting in other paths of the scheduler
> that don't execute from an interrupt handler, but still would have RCU
> reader sections with spinlocks held - I am not sure if this is possible
> though but it does make me nervous.
I have gotten back to this -rcu commit:
385b599e8c04 ("rcu: Allow rcu_read_unlock_special() to raise_softirq() if in_irq()")
It works there, and that will be part of my pull request later today.
I am continuing an informal manual bisection. ;-)
Thanx, Paul
> Thanks!
>
> >
> >
> > I will try this diff and let you know.
> >
> > > If it had failed, I would have tried the patch shown below. I know that
> > > Sebastian has some concerns about the bug happening anyway, but we have
> > > to start somewhere! ;-)
> > >
> > > ------------------------------------------------------------------------
> > >
> > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > > index 82c925df1d92..be7bafc2c0a0 100644
> > > --- a/kernel/rcu/tree_plugin.h
> > > +++ b/kernel/rcu/tree_plugin.h
> > > @@ -624,25 +624,16 @@ static void rcu_read_unlock_special(struct task_struct *t)
> > > (rdp->grpmask & rnp->expmask) ||
> > > tick_nohz_full_cpu(rdp->cpu);
> > > // Need to defer quiescent state until everything is enabled.
> > > - if ((exp || in_irq()) && irqs_were_disabled && use_softirq &&
> > > - (in_irq() || !t->rcu_read_unlock_special.b.deferred_qs)) {
> > > - // Using softirq, safe to awaken, and we get
> > > - // no help from enabling irqs, unlike bh/preempt.
> > > - raise_softirq_irqoff(RCU_SOFTIRQ);
> > > - } else {
> > > - // Enabling BH or preempt does reschedule, so...
> > > - // Also if no expediting or NO_HZ_FULL, slow is OK.
> > > - set_tsk_need_resched(current);
> > > - set_preempt_need_resched();
> > > - if (IS_ENABLED(CONFIG_IRQ_WORK) && irqs_were_disabled &&
> > > - !rdp->defer_qs_iw_pending && exp) {
> > > - // Get scheduler to re-evaluate and call hooks.
> > > - // If !IRQ_WORK, FQS scan will eventually IPI.
> > > - init_irq_work(&rdp->defer_qs_iw,
> > > - rcu_preempt_deferred_qs_handler);
> > > - rdp->defer_qs_iw_pending = true;
> > > - irq_work_queue_on(&rdp->defer_qs_iw, rdp->cpu);
> > > - }
> > > + set_tsk_need_resched(current);
> > > + set_preempt_need_resched();
> > > + if (IS_ENABLED(CONFIG_IRQ_WORK) && irqs_were_disabled &&
> > > + !rdp->defer_qs_iw_pending && exp) {
> > > + // Get scheduler to re-evaluate and call hooks.
> > > + // If !IRQ_WORK, FQS scan will eventually IPI.
> > > + init_irq_work(&rdp->defer_qs_iw,
> > > + rcu_preempt_deferred_qs_handler);
> > > + rdp->defer_qs_iw_pending = true;
> > > + irq_work_queue_on(&rdp->defer_qs_iw, rdp->cpu);
> >
> > Nice to see the code here got simplified ;-)
> >
On Thu, Jun 27, 2019 at 02:27:22PM -0400, Joel Fernandes wrote:
> On Thu, Jun 27, 2019 at 11:11:12AM -0700, Paul E. McKenney wrote:
> > On Thu, Jun 27, 2019 at 01:46:27PM -0400, Joel Fernandes wrote:
> > > On Thu, Jun 27, 2019 at 1:43 PM Joel Fernandes <[email protected]> wrote:
> > > >
> > > > On Thu, Jun 27, 2019 at 11:40 AM Sebastian Andrzej Siewior
> > > > <[email protected]> wrote:
> > > > >
> > > > > On 2019-06-27 11:37:10 [-0400], Joel Fernandes wrote:
> > > > > > Sebastian it would be nice if possible to trace where the
> > > > > > t->rcu_read_unlock_special is set for this scenario of calling
> > > > > > rcu_read_unlock_special, to give a clear idea about whether it was
> > > > > > really because of an IPI. I guess we could also add additional RCU
> > > > > > debug fields to task_struct (just for debugging) to see where there
> > > > > > unlock_special is set.
> > > > > >
> > > > > > Is there a test to reproduce this, or do I just boot an intel x86_64
> > > > > > machine with "threadirqs" and run into it?
> > > > >
> > > > > Do you want to send me a patch or should I send you my kvm image which
> > > > > triggers the bug on boot?
> > > >
> > > > I could reproduce this as well just booting Linus tree with threadirqs
> > > > command line and running rcutorture. In 15 seconds or so it locks
> > > > up... gdb backtrace shows the recursive lock:
> > >
> > > Sorry that got badly wrapped, so I pasted it here:
> > > https://hastebin.com/ajivofomik.shell
> >
> > Which rcutorture scenario would that be? TREE03 is thus far refusing
> > to fail for me when run this way:
> >
> > $ tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 8 --duration 5 --trust-make --configs "TREE03" --bootargs "threadirqs"
>
> I built x86_64_defconfig with CONFIG_PREEMPT enabled, then I ran it with
> following boot params:
> rcutorture.shutdown_secs=60 rcutorture.n_barrier_cbs=4 rcutree.kthread_prio=2
>
> and also "threadirqs"
>
> This was not a TREE config, but just my simple RCU test using qemu.
>
>
> I will try this diff and let you know.
>
> > If it had failed, I would have tried the patch shown below. I know that
> > Sebastian has some concerns about the bug happening anyway, but we have
> > to start somewhere! ;-)
> >
> > ------------------------------------------------------------------------
> >
> > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > index 82c925df1d92..be7bafc2c0a0 100644
> > --- a/kernel/rcu/tree_plugin.h
> > +++ b/kernel/rcu/tree_plugin.h
> > @@ -624,25 +624,16 @@ static void rcu_read_unlock_special(struct task_struct *t)
> > (rdp->grpmask & rnp->expmask) ||
> > tick_nohz_full_cpu(rdp->cpu);
> > // Need to defer quiescent state until everything is enabled.
> > - if ((exp || in_irq()) && irqs_were_disabled && use_softirq &&
> > - (in_irq() || !t->rcu_read_unlock_special.b.deferred_qs)) {
> > - // Using softirq, safe to awaken, and we get
> > - // no help from enabling irqs, unlike bh/preempt.
> > - raise_softirq_irqoff(RCU_SOFTIRQ);
> > - } else {
> > - // Enabling BH or preempt does reschedule, so...
> > - // Also if no expediting or NO_HZ_FULL, slow is OK.
> > - set_tsk_need_resched(current);
> > - set_preempt_need_resched();
> > - if (IS_ENABLED(CONFIG_IRQ_WORK) && irqs_were_disabled &&
> > - !rdp->defer_qs_iw_pending && exp) {
> > - // Get scheduler to re-evaluate and call hooks.
> > - // If !IRQ_WORK, FQS scan will eventually IPI.
> > - init_irq_work(&rdp->defer_qs_iw,
> > - rcu_preempt_deferred_qs_handler);
> > - rdp->defer_qs_iw_pending = true;
> > - irq_work_queue_on(&rdp->defer_qs_iw, rdp->cpu);
> > - }
> > + set_tsk_need_resched(current);
> > + set_preempt_need_resched();
> > + if (IS_ENABLED(CONFIG_IRQ_WORK) && irqs_were_disabled &&
> > + !rdp->defer_qs_iw_pending && exp) {
> > + // Get scheduler to re-evaluate and call hooks.
> > + // If !IRQ_WORK, FQS scan will eventually IPI.
> > + init_irq_work(&rdp->defer_qs_iw,
> > + rcu_preempt_deferred_qs_handler);
> > + rdp->defer_qs_iw_pending = true;
> > + irq_work_queue_on(&rdp->defer_qs_iw, rdp->cpu);
>
> Nice to see the code here got simplified ;-)
Assuming that it still works... But it also looks to be unnecessary,
at least in brief testing. I will of course be adding a threadirqs
to TREE03.boot. :-/
Thanx, Paul
On Thu, 2019-06-27 at 11:41 -0700, Paul E. McKenney wrote:
> On Thu, Jun 27, 2019 at 02:16:38PM -0400, Joel Fernandes wrote:
> >
> > I think the fix should be to prevent the wake-up not based on whether we
> > are
> > in hard/soft-interrupt mode but that we are doing the rcu_read_unlock()
> > from
> > a scheduler path (if we can detect that)
>
> Or just don't do the wakeup at all, if it comes to that. I don't know
> of any way to determine whether rcu_read_unlock() is being called from
> the scheduler, but it has been some time since I asked Peter Zijlstra
> about that.
>
> Of course, unconditionally refusing to do the wakeup might not be happy
> thing for NO_HZ_FULL kernels that don't implement IRQ work.
Couldn't smp_send_reschedule() be used instead?
-Scott
On Thu, Jun 27, 2019 at 03:17:27PM -0500, Scott Wood wrote:
> On Thu, 2019-06-27 at 11:41 -0700, Paul E. McKenney wrote:
> > On Thu, Jun 27, 2019 at 02:16:38PM -0400, Joel Fernandes wrote:
> > >
> > > I think the fix should be to prevent the wake-up not based on whether we
> > > are
> > > in hard/soft-interrupt mode but that we are doing the rcu_read_unlock()
> > > from
> > > a scheduler path (if we can detect that)
> >
> > Or just don't do the wakeup at all, if it comes to that. I don't know
> > of any way to determine whether rcu_read_unlock() is being called from
> > the scheduler, but it has been some time since I asked Peter Zijlstra
> > about that.
> >
> > Of course, unconditionally refusing to do the wakeup might not be happy
> > thing for NO_HZ_FULL kernels that don't implement IRQ work.
>
> Couldn't smp_send_reschedule() be used instead?
Good point. If current -rcu doesn't fix things for Sebastian's case,
that would be well worth looking at. But there must be some reason
why Peter Zijlstra didn't suggest it when he instead suggested using
the IRQ work approach.
Peter, thoughts?
Thanx, Paul
On Thu, Jun 27, 2019 at 11:30:22AM -0700, Paul E. McKenney wrote:
> On Thu, Jun 27, 2019 at 11:11:12AM -0700, Paul E. McKenney wrote:
> > On Thu, Jun 27, 2019 at 01:46:27PM -0400, Joel Fernandes wrote:
> > > On Thu, Jun 27, 2019 at 1:43 PM Joel Fernandes <[email protected]> wrote:
> > > >
> > > > On Thu, Jun 27, 2019 at 11:40 AM Sebastian Andrzej Siewior
> > > > <[email protected]> wrote:
> > > > >
> > > > > On 2019-06-27 11:37:10 [-0400], Joel Fernandes wrote:
> > > > > > Sebastian it would be nice if possible to trace where the
> > > > > > t->rcu_read_unlock_special is set for this scenario of calling
> > > > > > rcu_read_unlock_special, to give a clear idea about whether it was
> > > > > > really because of an IPI. I guess we could also add additional RCU
> > > > > > debug fields to task_struct (just for debugging) to see where there
> > > > > > unlock_special is set.
> > > > > >
> > > > > > Is there a test to reproduce this, or do I just boot an intel x86_64
> > > > > > machine with "threadirqs" and run into it?
> > > > >
> > > > > Do you want to send me a patch or should I send you my kvm image which
> > > > > triggers the bug on boot?
> > > >
> > > > I could reproduce this as well just booting Linus tree with threadirqs
> > > > command line and running rcutorture. In 15 seconds or so it locks
> > > > up... gdb backtrace shows the recursive lock:
> > >
> > > Sorry that got badly wrapped, so I pasted it here:
> > > https://hastebin.com/ajivofomik.shell
> >
> > Which rcutorture scenario would that be? TREE03 is thus far refusing
> > to fail for me when run this way:
> >
> > $ tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 8 --duration 5 --trust-make --configs "TREE03" --bootargs "threadirqs"
>
> Ah, but I was running -rcu. TREE03 fails at 38 seconds for me on v5.2.
>
> Now to find out which -rcu commit fixed it. Or at least made it much
> less probable, to Sebastian's point.
And it still works before this one:
a69987a515c8 ("rcu: Simplify rcu_read_unlock_special() deferred wakeups")
(This is included in the dev branch of the -rcu tree, and is
currently slated for v5.4.)
And it works at these commits:
0864f057b050 ("rcu: Use irq_work to get scheduler's attention in clean context")
385b599e8c04 ("rcu: Allow rcu_read_unlock_special() to raise_softirq() if in_irq()")
25102de65fdd ("rcu: Only do rcu_read_unlock_special() wakeups if expedited")
23634ebc1d94 ("rcu: Check for wakeup-safe conditions in rcu_read_unlock_special()")
I checked the last one twice, both completing without problems
other than false positives due to security-induced pointer
obfuscation. (These will be included in my v5.3 pull request.)
But not at this commit:
48d07c04b4cc ("rcu: Enable elimination of Tree-RCU softirq processing")
This gets RCU CPU stall warnings rather than the double wakeup
of ksoftirqd. Works fine without traceirqs.
v5.2-rc1 locks up hard at early boot (three of three attempts):
[ 2.525153] clocksource: Switched to clocksource tsc
[ 2.878858] random: fast init done
[ 2.881122] input: ImExPS/2 Generic Explorer Mouse as /devices/platform/i8042/serio1/input/input3
[ 2.884183] probe of serio1 returned 1 after 671008 usecs
[ 9.969474] hrtimer: interrupt took 3992554 ns
v5.1 gets the double wakeup of ksoftirqd.
So it looks like I need to get that pull request sent out, doesn't it? ;-)
Thanx, Paul
On Thu, Jun 27, 2019 at 01:36:12PM -0700, Paul E. McKenney wrote:
> On Thu, Jun 27, 2019 at 03:17:27PM -0500, Scott Wood wrote:
> > On Thu, 2019-06-27 at 11:41 -0700, Paul E. McKenney wrote:
> > > On Thu, Jun 27, 2019 at 02:16:38PM -0400, Joel Fernandes wrote:
> > > >
> > > > I think the fix should be to prevent the wake-up not based on whether we
> > > > are
> > > > in hard/soft-interrupt mode but that we are doing the rcu_read_unlock()
> > > > from
> > > > a scheduler path (if we can detect that)
> > >
> > > Or just don't do the wakeup at all, if it comes to that. I don't know
> > > of any way to determine whether rcu_read_unlock() is being called from
> > > the scheduler, but it has been some time since I asked Peter Zijlstra
> > > about that.
> > >
> > > Of course, unconditionally refusing to do the wakeup might not be happy
> > > thing for NO_HZ_FULL kernels that don't implement IRQ work.
> >
> > Couldn't smp_send_reschedule() be used instead?
>
> Good point. If current -rcu doesn't fix things for Sebastian's case,
> that would be well worth looking at. But there must be some reason
> why Peter Zijlstra didn't suggest it when he instead suggested using
> the IRQ work approach.
>
> Peter, thoughts?
Hello,
Isn't the following scenario possible?
The original code
-----------------
rcu_read_lock();
...
/* Experdite */
WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, true);
...
__rcu_read_unlock();
if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
rcu_read_unlock_special(t);
WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, false);
rcu_preempt_deferred_qs_irqrestore(t, flags);
barrier(); /* ->rcu_read_unlock_special load before assign */
t->rcu_read_lock_nesting = 0;
The reordered code by machine
-----------------------------
rcu_read_lock();
...
/* Experdite */
WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, true);
...
__rcu_read_unlock();
if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
rcu_read_unlock_special(t);
t->rcu_read_lock_nesting = 0; <--- LOOK AT THIS!!!
WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, false);
rcu_preempt_deferred_qs_irqrestore(t, flags);
barrier(); /* ->rcu_read_unlock_special load before assign */
An interrupt happens
--------------------
rcu_read_lock();
...
/* Experdite */
WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, true);
...
__rcu_read_unlock();
if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
rcu_read_unlock_special(t);
t->rcu_read_lock_nesting = 0; <--- LOOK AT THIS!!!
<--- Handle an (any) irq
rcu_read_lock();
/* This call should be skipped */
rcu_read_unlock_special(t);
WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, false);
rcu_preempt_deferred_qs_irqrestore(t, flags);
barrier(); /* ->rcu_read_unlock_special load before assign */
We don't have to handle the special thing twice like this which is one
reason to cause the problem even though another problem is of course to
call ttwu w/o being aware it's within a context holding pi lock.
Apart from the discussion about how to avoid ttwu in an improper
condition, I think the following is necessary. I may have something
missing. It would be appreciated if you let me know in case I'm wrong.
Anyway, logically I think we should prevent reordering between
t->rcu_read_lock_nesting and t->rcu_read_unlock_special.b.exp_hint not
only by compiler but also by machine like the below.
Do I miss something?
Thanks,
Byungchul
---8<---
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 3c8444e..9b137f1 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -412,7 +412,13 @@ void __rcu_read_unlock(void)
barrier(); /* assign before ->rcu_read_unlock_special load */
if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
rcu_read_unlock_special(t);
- barrier(); /* ->rcu_read_unlock_special load before assign */
+ /*
+ * Prevent reordering between clearing
+ * t->rcu_reak_unlock_special in
+ * rcu_read_unlock_special() and the following
+ * assignment to t->rcu_read_lock_nesting.
+ */
+ smp_wmb();
t->rcu_read_lock_nesting = 0;
}
if (IS_ENABLED(CONFIG_PROVE_LOCKING)) {
On Fri, Jun 28, 2019 at 04:31:38PM +0900, Byungchul Park wrote:
> On Thu, Jun 27, 2019 at 01:36:12PM -0700, Paul E. McKenney wrote:
> > On Thu, Jun 27, 2019 at 03:17:27PM -0500, Scott Wood wrote:
> > > On Thu, 2019-06-27 at 11:41 -0700, Paul E. McKenney wrote:
> > > > On Thu, Jun 27, 2019 at 02:16:38PM -0400, Joel Fernandes wrote:
> > > > >
> > > > > I think the fix should be to prevent the wake-up not based on whether we
> > > > > are
> > > > > in hard/soft-interrupt mode but that we are doing the rcu_read_unlock()
> > > > > from
> > > > > a scheduler path (if we can detect that)
> > > >
> > > > Or just don't do the wakeup at all, if it comes to that. I don't know
> > > > of any way to determine whether rcu_read_unlock() is being called from
> > > > the scheduler, but it has been some time since I asked Peter Zijlstra
> > > > about that.
> > > >
> > > > Of course, unconditionally refusing to do the wakeup might not be happy
> > > > thing for NO_HZ_FULL kernels that don't implement IRQ work.
> > >
> > > Couldn't smp_send_reschedule() be used instead?
> >
> > Good point. If current -rcu doesn't fix things for Sebastian's case,
> > that would be well worth looking at. But there must be some reason
> > why Peter Zijlstra didn't suggest it when he instead suggested using
> > the IRQ work approach.
> >
> > Peter, thoughts?
>
+cc [email protected]
(I'm sorry for more noise on the thread.)
> Hello,
>
> Isn't the following scenario possible?
>
> The original code
> -----------------
> rcu_read_lock();
> ...
> /* Experdite */
> WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, true);
> ...
> __rcu_read_unlock();
> if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
> rcu_read_unlock_special(t);
> WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, false);
> rcu_preempt_deferred_qs_irqrestore(t, flags);
> barrier(); /* ->rcu_read_unlock_special load before assign */
> t->rcu_read_lock_nesting = 0;
>
> The reordered code by machine
> -----------------------------
> rcu_read_lock();
> ...
> /* Experdite */
> WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, true);
> ...
> __rcu_read_unlock();
> if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
> rcu_read_unlock_special(t);
> t->rcu_read_lock_nesting = 0; <--- LOOK AT THIS!!!
> WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, false);
> rcu_preempt_deferred_qs_irqrestore(t, flags);
> barrier(); /* ->rcu_read_unlock_special load before assign */
>
> An interrupt happens
> --------------------
> rcu_read_lock();
> ...
> /* Experdite */
> WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, true);
> ...
> __rcu_read_unlock();
> if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
> rcu_read_unlock_special(t);
> t->rcu_read_lock_nesting = 0; <--- LOOK AT THIS!!!
> <--- Handle an (any) irq
> rcu_read_lock();
> /* This call should be skipped */
> rcu_read_unlock_special(t);
> WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, false);
> rcu_preempt_deferred_qs_irqrestore(t, flags);
> barrier(); /* ->rcu_read_unlock_special load before assign */
>
> We don't have to handle the special thing twice like this which is one
> reason to cause the problem even though another problem is of course to
> call ttwu w/o being aware it's within a context holding pi lock.
>
> Apart from the discussion about how to avoid ttwu in an improper
> condition, I think the following is necessary. I may have something
> missing. It would be appreciated if you let me know in case I'm wrong.
>
> Anyway, logically I think we should prevent reordering between
> t->rcu_read_lock_nesting and t->rcu_read_unlock_special.b.exp_hint not
> only by compiler but also by machine like the below.
>
> Do I miss something?
>
> Thanks,
> Byungchul
>
> ---8<---
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index 3c8444e..9b137f1 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -412,7 +412,13 @@ void __rcu_read_unlock(void)
> barrier(); /* assign before ->rcu_read_unlock_special load */
> if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
> rcu_read_unlock_special(t);
> - barrier(); /* ->rcu_read_unlock_special load before assign */
> + /*
> + * Prevent reordering between clearing
> + * t->rcu_reak_unlock_special in
> + * rcu_read_unlock_special() and the following
> + * assignment to t->rcu_read_lock_nesting.
> + */
> + smp_wmb();
> t->rcu_read_lock_nesting = 0;
> }
> if (IS_ENABLED(CONFIG_PROVE_LOCKING)) {
>
>
On Fri, Jun 28, 2019 at 04:43:50PM +0900, Byungchul Park wrote:
> On Fri, Jun 28, 2019 at 04:31:38PM +0900, Byungchul Park wrote:
> > On Thu, Jun 27, 2019 at 01:36:12PM -0700, Paul E. McKenney wrote:
> > > On Thu, Jun 27, 2019 at 03:17:27PM -0500, Scott Wood wrote:
> > > > On Thu, 2019-06-27 at 11:41 -0700, Paul E. McKenney wrote:
> > > > > On Thu, Jun 27, 2019 at 02:16:38PM -0400, Joel Fernandes wrote:
> > > > > >
> > > > > > I think the fix should be to prevent the wake-up not based on whether we
> > > > > > are
> > > > > > in hard/soft-interrupt mode but that we are doing the rcu_read_unlock()
> > > > > > from
> > > > > > a scheduler path (if we can detect that)
> > > > >
> > > > > Or just don't do the wakeup at all, if it comes to that. I don't know
> > > > > of any way to determine whether rcu_read_unlock() is being called from
> > > > > the scheduler, but it has been some time since I asked Peter Zijlstra
> > > > > about that.
> > > > >
> > > > > Of course, unconditionally refusing to do the wakeup might not be happy
> > > > > thing for NO_HZ_FULL kernels that don't implement IRQ work.
> > > >
> > > > Couldn't smp_send_reschedule() be used instead?
> > >
> > > Good point. If current -rcu doesn't fix things for Sebastian's case,
> > > that would be well worth looking at. But there must be some reason
> > > why Peter Zijlstra didn't suggest it when he instead suggested using
> > > the IRQ work approach.
> > >
> > > Peter, thoughts?
> >
>
> +cc [email protected]
> (I'm sorry for more noise on the thread.)
>
> > Hello,
> >
> > Isn't the following scenario possible?
> >
> > The original code
> > -----------------
> > rcu_read_lock();
> > ...
> > /* Experdite */
> > WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, true);
> > ...
> > __rcu_read_unlock();
> > if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
> > rcu_read_unlock_special(t);
> > WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, false);
> > rcu_preempt_deferred_qs_irqrestore(t, flags);
> > barrier(); /* ->rcu_read_unlock_special load before assign */
> > t->rcu_read_lock_nesting = 0;
> >
> > The reordered code by machine
> > -----------------------------
> > rcu_read_lock();
> > ...
> > /* Experdite */
> > WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, true);
> > ...
> > __rcu_read_unlock();
> > if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
> > rcu_read_unlock_special(t);
> > t->rcu_read_lock_nesting = 0; <--- LOOK AT THIS!!!
> > WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, false);
> > rcu_preempt_deferred_qs_irqrestore(t, flags);
> > barrier(); /* ->rcu_read_unlock_special load before assign */
> >
> > An interrupt happens
> > --------------------
> > rcu_read_lock();
> > ...
> > /* Experdite */
> > WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, true);
> > ...
> > __rcu_read_unlock();
> > if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
> > rcu_read_unlock_special(t);
> > t->rcu_read_lock_nesting = 0; <--- LOOK AT THIS!!!
> > <--- Handle an (any) irq
> > rcu_read_lock();
> > /* This call should be skipped */
> > rcu_read_unlock_special(t);
> > WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, false);
> > rcu_preempt_deferred_qs_irqrestore(t, flags);
> > barrier(); /* ->rcu_read_unlock_special load before assign */
> >
> > We don't have to handle the special thing twice like this which is one
> > reason to cause the problem even though another problem is of course to
> > call ttwu w/o being aware it's within a context holding pi lock.
> >
> > Apart from the discussion about how to avoid ttwu in an improper
> > condition, I think the following is necessary. I may have something
> > missing. It would be appreciated if you let me know in case I'm wrong.
> >
> > Anyway, logically I think we should prevent reordering between
> > t->rcu_read_lock_nesting and t->rcu_read_unlock_special.b.exp_hint not
> > only by compiler but also by machine like the below.
> >
> > Do I miss something?
> >
> > Thanks,
> > Byungchul
> >
> > ---8<---
> > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > index 3c8444e..9b137f1 100644
> > --- a/kernel/rcu/tree_plugin.h
> > +++ b/kernel/rcu/tree_plugin.h
> > @@ -412,7 +412,13 @@ void __rcu_read_unlock(void)
> > barrier(); /* assign before ->rcu_read_unlock_special load */
> > if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
> > rcu_read_unlock_special(t);
> > - barrier(); /* ->rcu_read_unlock_special load before assign */
> > + /*
> > + * Prevent reordering between clearing
> > + * t->rcu_reak_unlock_special in
> > + * rcu_read_unlock_special() and the following
> > + * assignment to t->rcu_read_lock_nesting.
> > + */
> > + smp_wmb();
Ah. But the problem is this makes rcu_read_unlock() heavier, which is
too bad. Need to consider something else. But I'm still curious about
if the scenario I told you is correct?
> > t->rcu_read_lock_nesting = 0;
> > }
> > if (IS_ENABLED(CONFIG_PROVE_LOCKING)) {
> >
> >
On Fri, Jun 28, 2019 at 05:14:32PM +0900, Byungchul Park wrote:
> On Fri, Jun 28, 2019 at 04:43:50PM +0900, Byungchul Park wrote:
> > On Fri, Jun 28, 2019 at 04:31:38PM +0900, Byungchul Park wrote:
> > > On Thu, Jun 27, 2019 at 01:36:12PM -0700, Paul E. McKenney wrote:
> > > > On Thu, Jun 27, 2019 at 03:17:27PM -0500, Scott Wood wrote:
> > > > > On Thu, 2019-06-27 at 11:41 -0700, Paul E. McKenney wrote:
> > > > > > On Thu, Jun 27, 2019 at 02:16:38PM -0400, Joel Fernandes wrote:
> > > > > > >
> > > > > > > I think the fix should be to prevent the wake-up not based on whether we
> > > > > > > are
> > > > > > > in hard/soft-interrupt mode but that we are doing the rcu_read_unlock()
> > > > > > > from
> > > > > > > a scheduler path (if we can detect that)
> > > > > >
> > > > > > Or just don't do the wakeup at all, if it comes to that. I don't know
> > > > > > of any way to determine whether rcu_read_unlock() is being called from
> > > > > > the scheduler, but it has been some time since I asked Peter Zijlstra
> > > > > > about that.
> > > > > >
> > > > > > Of course, unconditionally refusing to do the wakeup might not be happy
> > > > > > thing for NO_HZ_FULL kernels that don't implement IRQ work.
> > > > >
> > > > > Couldn't smp_send_reschedule() be used instead?
> > > >
> > > > Good point. If current -rcu doesn't fix things for Sebastian's case,
> > > > that would be well worth looking at. But there must be some reason
> > > > why Peter Zijlstra didn't suggest it when he instead suggested using
> > > > the IRQ work approach.
> > > >
> > > > Peter, thoughts?
> > >
> >
> > +cc [email protected]
> > (I'm sorry for more noise on the thread.)
> >
> > > Hello,
> > >
> > > Isn't the following scenario possible?
> > >
> > > The original code
> > > -----------------
> > > rcu_read_lock();
> > > ...
> > > /* Experdite */
> > > WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, true);
> > > ...
> > > __rcu_read_unlock();
> > > if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
> > > rcu_read_unlock_special(t);
> > > WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, false);
> > > rcu_preempt_deferred_qs_irqrestore(t, flags);
> > > barrier(); /* ->rcu_read_unlock_special load before assign */
> > > t->rcu_read_lock_nesting = 0;
> > >
> > > The reordered code by machine
> > > -----------------------------
> > > rcu_read_lock();
> > > ...
> > > /* Experdite */
> > > WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, true);
> > > ...
> > > __rcu_read_unlock();
> > > if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
> > > rcu_read_unlock_special(t);
> > > t->rcu_read_lock_nesting = 0; <--- LOOK AT THIS!!!
> > > WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, false);
> > > rcu_preempt_deferred_qs_irqrestore(t, flags);
> > > barrier(); /* ->rcu_read_unlock_special load before assign */
> > >
> > > An interrupt happens
> > > --------------------
> > > rcu_read_lock();
> > > ...
> > > /* Experdite */
> > > WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, true);
> > > ...
> > > __rcu_read_unlock();
> > > if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
> > > rcu_read_unlock_special(t);
> > > t->rcu_read_lock_nesting = 0; <--- LOOK AT THIS!!!
> > > <--- Handle an (any) irq
> > > rcu_read_lock();
> > > /* This call should be skipped */
> > > rcu_read_unlock_special(t);
> > > WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, false);
> > > rcu_preempt_deferred_qs_irqrestore(t, flags);
> > > barrier(); /* ->rcu_read_unlock_special load before assign */
> > >
> > > We don't have to handle the special thing twice like this which is one
> > > reason to cause the problem even though another problem is of course to
> > > call ttwu w/o being aware it's within a context holding pi lock.
> > >
> > > Apart from the discussion about how to avoid ttwu in an improper
> > > condition, I think the following is necessary. I may have something
> > > missing. It would be appreciated if you let me know in case I'm wrong.
> > >
> > > Anyway, logically I think we should prevent reordering between
> > > t->rcu_read_lock_nesting and t->rcu_read_unlock_special.b.exp_hint not
> > > only by compiler but also by machine like the below.
> > >
> > > Do I miss something?
> > >
> > > Thanks,
> > > Byungchul
> > >
> > > ---8<---
> > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > > index 3c8444e..9b137f1 100644
> > > --- a/kernel/rcu/tree_plugin.h
> > > +++ b/kernel/rcu/tree_plugin.h
> > > @@ -412,7 +412,13 @@ void __rcu_read_unlock(void)
> > > barrier(); /* assign before ->rcu_read_unlock_special load */
> > > if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
> > > rcu_read_unlock_special(t);
> > > - barrier(); /* ->rcu_read_unlock_special load before assign */
> > > + /*
> > > + * Prevent reordering between clearing
> > > + * t->rcu_reak_unlock_special in
> > > + * rcu_read_unlock_special() and the following
> > > + * assignment to t->rcu_read_lock_nesting.
> > > + */
> > > + smp_wmb();
>
> Ah. But the problem is this makes rcu_read_unlock() heavier, which is
> too bad. Need to consider something else. But I'm still curious about
> if the scenario I told you is correct?
Instead, this patch should be replaced with the following:
---8<---
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 3c8444e..f103e98 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -624,8 +624,15 @@ static void rcu_read_unlock_special(struct task_struct *t)
local_irq_save(flags);
irqs_were_disabled = irqs_disabled_flags(flags);
+
+ WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, false);
+ /*
+ * Prevent reordering between rcu_read_unlock_special.b.exp_hint
+ * above and rcu_read_lock_nesting outside of this function.
+ */
+ smp_wmb();
+
if (preempt_bh_were_disabled || irqs_were_disabled) {
- WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, false);
/* Need to defer quiescent state until everything is enabled. */
if (irqs_were_disabled) {
/* Enabling irqs does not reschedule, so... */
@@ -638,7 +645,6 @@ static void rcu_read_unlock_special(struct task_struct *t)
local_irq_restore(flags);
return;
}
- WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, false);
rcu_preempt_deferred_qs_irqrestore(t, flags);
}
On Fri, Jun 28, 2019 at 04:43:50PM +0900, Byungchul Park wrote:
> On Fri, Jun 28, 2019 at 04:31:38PM +0900, Byungchul Park wrote:
> > On Thu, Jun 27, 2019 at 01:36:12PM -0700, Paul E. McKenney wrote:
> > > On Thu, Jun 27, 2019 at 03:17:27PM -0500, Scott Wood wrote:
> > > > On Thu, 2019-06-27 at 11:41 -0700, Paul E. McKenney wrote:
> > > > > On Thu, Jun 27, 2019 at 02:16:38PM -0400, Joel Fernandes wrote:
> > > > > >
> > > > > > I think the fix should be to prevent the wake-up not based on whether we
> > > > > > are
> > > > > > in hard/soft-interrupt mode but that we are doing the rcu_read_unlock()
> > > > > > from
> > > > > > a scheduler path (if we can detect that)
> > > > >
> > > > > Or just don't do the wakeup at all, if it comes to that. I don't know
> > > > > of any way to determine whether rcu_read_unlock() is being called from
> > > > > the scheduler, but it has been some time since I asked Peter Zijlstra
> > > > > about that.
> > > > >
> > > > > Of course, unconditionally refusing to do the wakeup might not be happy
> > > > > thing for NO_HZ_FULL kernels that don't implement IRQ work.
> > > >
> > > > Couldn't smp_send_reschedule() be used instead?
> > >
> > > Good point. If current -rcu doesn't fix things for Sebastian's case,
> > > that would be well worth looking at. But there must be some reason
> > > why Peter Zijlstra didn't suggest it when he instead suggested using
> > > the IRQ work approach.
> > >
> > > Peter, thoughts?
> >
>
> +cc [email protected]
> (I'm sorry for more noise on the thread.)
>
> > Hello,
> >
> > Isn't the following scenario possible?
> >
> > The original code
> > -----------------
> > rcu_read_lock();
> > ...
> > /* Experdite */
> > WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, true);
> > ...
> > __rcu_read_unlock();
> > if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
> > rcu_read_unlock_special(t);
> > WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, false);
> > rcu_preempt_deferred_qs_irqrestore(t, flags);
> > barrier(); /* ->rcu_read_unlock_special load before assign */
> > t->rcu_read_lock_nesting = 0;
> >
> > The reordered code by machine
> > -----------------------------
> > rcu_read_lock();
> > ...
> > /* Experdite */
> > WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, true);
> > ...
> > __rcu_read_unlock();
> > if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
> > rcu_read_unlock_special(t);
> > t->rcu_read_lock_nesting = 0; <--- LOOK AT THIS!!!
> > WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, false);
> > rcu_preempt_deferred_qs_irqrestore(t, flags);
> > barrier(); /* ->rcu_read_unlock_special load before assign */
> >
> > An interrupt happens
> > --------------------
> > rcu_read_lock();
> > ...
> > /* Experdite */
> > WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, true);
> > ...
> > __rcu_read_unlock();
> > if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
> > rcu_read_unlock_special(t);
> > t->rcu_read_lock_nesting = 0; <--- LOOK AT THIS!!!
> > <--- Handle an (any) irq
> > rcu_read_lock();
> > /* This call should be skipped */
> > rcu_read_unlock_special(t);
> > WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, false);
> > rcu_preempt_deferred_qs_irqrestore(t, flags);
> > barrier(); /* ->rcu_read_unlock_special load before assign */
I was confused it was a LOAD access. The example should be changed a bit.
The original code
-----------------
rcu_read_lock();
...
/* Experdite */
WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, true);
...
__rcu_read_unlock();
if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
rcu_read_unlock_special(t);
WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, false);
rcu_preempt_deferred_qs_irqrestore(t, flags);
barrier(); /* ->rcu_read_unlock_special load before assign */
t->rcu_read_lock_nesting = 0;
The reordered code by machine
-----------------------------
rcu_read_lock();
...
/* Experdite */
WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, true);
...
__rcu_read_unlock();
if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
rcu_read_unlock_special(t);
rcu_preempt_deferred_qs_irqrestore(t, flags);
barrier(); /* ->rcu_read_unlock_special load before assign */
t->rcu_read_lock_nesting = 0;
WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, false);
An interrupt happens
--------------------
rcu_read_lock();
...
/* Experdite */
WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, true);
...
__rcu_read_unlock();
if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
rcu_read_unlock_special(t);
rcu_preempt_deferred_qs_irqrestore(t, flags);
barrier(); /* ->rcu_read_unlock_special load before assign */
t->rcu_read_lock_nesting = 0;
<--- Handle an (any) irq
rcu_read_lock();
/* This call should be skipped */
rcu_read_unlock_special(t);
WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, false);
Now that I re-made the example, I'm afraid it'd be no problem because
anyway it'd be within a cpu so it can see inside of the store-buffer of
the cpu.
I'm sorry. Please ignore my suggestion here.
On Fri, Jun 28, 2019 at 06:10:42PM +0900, Byungchul Park wrote:
> On Fri, Jun 28, 2019 at 04:43:50PM +0900, Byungchul Park wrote:
> > On Fri, Jun 28, 2019 at 04:31:38PM +0900, Byungchul Park wrote:
> > > On Thu, Jun 27, 2019 at 01:36:12PM -0700, Paul E. McKenney wrote:
> > > > On Thu, Jun 27, 2019 at 03:17:27PM -0500, Scott Wood wrote:
> > > > > On Thu, 2019-06-27 at 11:41 -0700, Paul E. McKenney wrote:
> > > > > > On Thu, Jun 27, 2019 at 02:16:38PM -0400, Joel Fernandes wrote:
> > > > > > >
> > > > > > > I think the fix should be to prevent the wake-up not based on whether we
> > > > > > > are
> > > > > > > in hard/soft-interrupt mode but that we are doing the rcu_read_unlock()
> > > > > > > from
> > > > > > > a scheduler path (if we can detect that)
> > > > > >
> > > > > > Or just don't do the wakeup at all, if it comes to that. I don't know
> > > > > > of any way to determine whether rcu_read_unlock() is being called from
> > > > > > the scheduler, but it has been some time since I asked Peter Zijlstra
> > > > > > about that.
> > > > > >
> > > > > > Of course, unconditionally refusing to do the wakeup might not be happy
> > > > > > thing for NO_HZ_FULL kernels that don't implement IRQ work.
> > > > >
> > > > > Couldn't smp_send_reschedule() be used instead?
> > > >
> > > > Good point. If current -rcu doesn't fix things for Sebastian's case,
> > > > that would be well worth looking at. But there must be some reason
> > > > why Peter Zijlstra didn't suggest it when he instead suggested using
> > > > the IRQ work approach.
> > > >
> > > > Peter, thoughts?
> > >
> >
> > +cc [email protected]
> > (I'm sorry for more noise on the thread.)
> >
> > > Hello,
> > >
> > > Isn't the following scenario possible?
> > >
> > > The original code
> > > -----------------
> > > rcu_read_lock();
> > > ...
> > > /* Experdite */
> > > WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, true);
> > > ...
> > > __rcu_read_unlock();
> > > if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
> > > rcu_read_unlock_special(t);
> > > WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, false);
> > > rcu_preempt_deferred_qs_irqrestore(t, flags);
> > > barrier(); /* ->rcu_read_unlock_special load before assign */
> > > t->rcu_read_lock_nesting = 0;
> > >
> > > The reordered code by machine
> > > -----------------------------
> > > rcu_read_lock();
> > > ...
> > > /* Experdite */
> > > WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, true);
> > > ...
> > > __rcu_read_unlock();
> > > if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
> > > rcu_read_unlock_special(t);
> > > t->rcu_read_lock_nesting = 0; <--- LOOK AT THIS!!!
> > > WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, false);
> > > rcu_preempt_deferred_qs_irqrestore(t, flags);
> > > barrier(); /* ->rcu_read_unlock_special load before assign */
> > >
> > > An interrupt happens
> > > --------------------
> > > rcu_read_lock();
> > > ...
> > > /* Experdite */
> > > WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, true);
> > > ...
> > > __rcu_read_unlock();
> > > if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
> > > rcu_read_unlock_special(t);
> > > t->rcu_read_lock_nesting = 0; <--- LOOK AT THIS!!!
> > > <--- Handle an (any) irq
> > > rcu_read_lock();
> > > /* This call should be skipped */
> > > rcu_read_unlock_special(t);
> > > WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, false);
> > > rcu_preempt_deferred_qs_irqrestore(t, flags);
> > > barrier(); /* ->rcu_read_unlock_special load before assign */
>
> I was confused it was a LOAD access. The example should be changed a bit.
>
>
>
> The original code
> -----------------
> rcu_read_lock();
> ...
> /* Experdite */
> WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, true);
> ...
> __rcu_read_unlock();
> if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
> rcu_read_unlock_special(t);
> WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, false);
> rcu_preempt_deferred_qs_irqrestore(t, flags);
> barrier(); /* ->rcu_read_unlock_special load before assign */
> t->rcu_read_lock_nesting = 0;
>
> The reordered code by machine
> -----------------------------
> rcu_read_lock();
> ...
> /* Experdite */
> WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, true);
> ...
> __rcu_read_unlock();
> if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
> rcu_read_unlock_special(t);
> rcu_preempt_deferred_qs_irqrestore(t, flags);
> barrier(); /* ->rcu_read_unlock_special load before assign */
> t->rcu_read_lock_nesting = 0;
> WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, false);
>
> An interrupt happens
> --------------------
> rcu_read_lock();
> ...
> /* Experdite */
> WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, true);
> ...
> __rcu_read_unlock();
> if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
> rcu_read_unlock_special(t);
> rcu_preempt_deferred_qs_irqrestore(t, flags);
> barrier(); /* ->rcu_read_unlock_special load before assign */
> t->rcu_read_lock_nesting = 0;
> <--- Handle an (any) irq
> rcu_read_lock();
> /* This call should be skipped */
> rcu_read_unlock_special(t);
> WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, false);
>
>
>
> Now that I re-made the example, I'm afraid it'd be no problem because
> anyway it'd be within a cpu so it can see inside of the store-buffer of
> the cpu.
>
> I'm sorry. Please ignore my suggestion here.
Even though the example is wrong but I think you can get confused with
about what I was trying to tell. It was about (1) LOAD in advance and
(2) reordering within a store buffer within a cpu, but not about
reordering instructions - I wrote the example as if it's about the
latter though.
Sorry for noise again.
Thanks,
Byungchul
On Fri, Jun 28, 2019 at 04:31:38PM +0900, Byungchul Park wrote:
> On Thu, Jun 27, 2019 at 01:36:12PM -0700, Paul E. McKenney wrote:
> > On Thu, Jun 27, 2019 at 03:17:27PM -0500, Scott Wood wrote:
> > > On Thu, 2019-06-27 at 11:41 -0700, Paul E. McKenney wrote:
> > > > On Thu, Jun 27, 2019 at 02:16:38PM -0400, Joel Fernandes wrote:
> > > > >
> > > > > I think the fix should be to prevent the wake-up not based on whether we
> > > > > are
> > > > > in hard/soft-interrupt mode but that we are doing the rcu_read_unlock()
> > > > > from
> > > > > a scheduler path (if we can detect that)
> > > >
> > > > Or just don't do the wakeup at all, if it comes to that. I don't know
> > > > of any way to determine whether rcu_read_unlock() is being called from
> > > > the scheduler, but it has been some time since I asked Peter Zijlstra
> > > > about that.
> > > >
> > > > Of course, unconditionally refusing to do the wakeup might not be happy
> > > > thing for NO_HZ_FULL kernels that don't implement IRQ work.
> > >
> > > Couldn't smp_send_reschedule() be used instead?
> >
> > Good point. If current -rcu doesn't fix things for Sebastian's case,
> > that would be well worth looking at. But there must be some reason
> > why Peter Zijlstra didn't suggest it when he instead suggested using
> > the IRQ work approach.
> >
> > Peter, thoughts?
>
> Hello,
>
> Isn't the following scenario possible?
>
> The original code
> -----------------
> rcu_read_lock();
> ...
> /* Experdite */
> WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, true);
> ...
> __rcu_read_unlock();
> if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
> rcu_read_unlock_special(t);
> WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, false);
> rcu_preempt_deferred_qs_irqrestore(t, flags);
> barrier(); /* ->rcu_read_unlock_special load before assign */
> t->rcu_read_lock_nesting = 0;
>
> The reordered code by machine
> -----------------------------
> rcu_read_lock();
> ...
> /* Experdite */
> WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, true);
> ...
> __rcu_read_unlock();
> if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
> rcu_read_unlock_special(t);
> t->rcu_read_lock_nesting = 0; <--- LOOK AT THIS!!!
> WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, false);
> rcu_preempt_deferred_qs_irqrestore(t, flags);
> barrier(); /* ->rcu_read_unlock_special load before assign */
>
> An interrupt happens
> --------------------
> rcu_read_lock();
> ...
> /* Experdite */
> WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, true);
> ...
> __rcu_read_unlock();
> if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
> rcu_read_unlock_special(t);
> t->rcu_read_lock_nesting = 0; <--- LOOK AT THIS!!!
> <--- Handle an (any) irq
> rcu_read_lock();
> /* This call should be skipped */
> rcu_read_unlock_special(t);
Wait.. I got a little bit confused on recordering.
This 'STORE rcu_read_lock_nesting = 0' can happen before
'STORE rcu_read_unlock_special.b.exp_hint = false' regardless of the
order a compiler generated to by the barrier(), because anyway they
are independent so it's within an arch's right.
Then.. is this scenario possible? Or all archs properly deal with
interrupts across this kind of reordering?
Thanks,
Byungchul
> WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, false);
> rcu_preempt_deferred_qs_irqrestore(t, flags);
> barrier(); /* ->rcu_read_unlock_special load before assign */
>
> We don't have to handle the special thing twice like this which is one
> reason to cause the problem even though another problem is of course to
> call ttwu w/o being aware it's within a context holding pi lock.
>
> Apart from the discussion about how to avoid ttwu in an improper
> condition, I think the following is necessary. I may have something
> missing. It would be appreciated if you let me know in case I'm wrong.
>
> Anyway, logically I think we should prevent reordering between
> t->rcu_read_lock_nesting and t->rcu_read_unlock_special.b.exp_hint not
> only by compiler but also by machine like the below.
>
> Do I miss something?
>
> Thanks,
> Byungchul
>
> ---8<---
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index 3c8444e..9b137f1 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -412,7 +412,13 @@ void __rcu_read_unlock(void)
> barrier(); /* assign before ->rcu_read_unlock_special load */
> if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
> rcu_read_unlock_special(t);
> - barrier(); /* ->rcu_read_unlock_special load before assign */
> + /*
> + * Prevent reordering between clearing
> + * t->rcu_reak_unlock_special in
> + * rcu_read_unlock_special() and the following
> + * assignment to t->rcu_read_lock_nesting.
> + */
> + smp_wmb();
> t->rcu_read_lock_nesting = 0;
> }
> if (IS_ENABLED(CONFIG_PROVE_LOCKING)) {
>
>
On Fri, Jun 28, 2019 at 04:43:50PM +0900, Byungchul Park wrote:
> On Fri, Jun 28, 2019 at 04:31:38PM +0900, Byungchul Park wrote:
> > On Thu, Jun 27, 2019 at 01:36:12PM -0700, Paul E. McKenney wrote:
> > > On Thu, Jun 27, 2019 at 03:17:27PM -0500, Scott Wood wrote:
> > > > On Thu, 2019-06-27 at 11:41 -0700, Paul E. McKenney wrote:
> > > > > On Thu, Jun 27, 2019 at 02:16:38PM -0400, Joel Fernandes wrote:
> > > > > >
> > > > > > I think the fix should be to prevent the wake-up not based on whether we
> > > > > > are
> > > > > > in hard/soft-interrupt mode but that we are doing the rcu_read_unlock()
> > > > > > from
> > > > > > a scheduler path (if we can detect that)
> > > > >
> > > > > Or just don't do the wakeup at all, if it comes to that. I don't know
> > > > > of any way to determine whether rcu_read_unlock() is being called from
> > > > > the scheduler, but it has been some time since I asked Peter Zijlstra
> > > > > about that.
> > > > >
> > > > > Of course, unconditionally refusing to do the wakeup might not be happy
> > > > > thing for NO_HZ_FULL kernels that don't implement IRQ work.
> > > >
> > > > Couldn't smp_send_reschedule() be used instead?
> > >
> > > Good point. If current -rcu doesn't fix things for Sebastian's case,
> > > that would be well worth looking at. But there must be some reason
> > > why Peter Zijlstra didn't suggest it when he instead suggested using
> > > the IRQ work approach.
> > >
> > > Peter, thoughts?
> >
>
> +cc [email protected]
> (I'm sorry for more noise on the thread.)
>
> > Hello,
> >
> > Isn't the following scenario possible?
> >
> > The original code
> > -----------------
> > rcu_read_lock();
> > ...
> > /* Experdite */
> > WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, true);
> > ...
> > __rcu_read_unlock();
> > if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
> > rcu_read_unlock_special(t);
> > WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, false);
> > rcu_preempt_deferred_qs_irqrestore(t, flags);
> > barrier(); /* ->rcu_read_unlock_special load before assign */
Note the barrier() just above ...
> > t->rcu_read_lock_nesting = 0;
> >
> > The reordered code by machine
> > -----------------------------
> > rcu_read_lock();
> > ...
> > /* Experdite */
> > WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, true);
> > ...
> > __rcu_read_unlock();
> > if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
> > rcu_read_unlock_special(t);
> > t->rcu_read_lock_nesting = 0; <--- LOOK AT THIS!!!
... which prevents the compiler from moving the above up.
Or am I missing something subtle here?
Thanx, Paul
> > WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, false);
> > rcu_preempt_deferred_qs_irqrestore(t, flags);
> > barrier(); /* ->rcu_read_unlock_special load before assign */
> >
> > An interrupt happens
> > --------------------
> > rcu_read_lock();
> > ...
> > /* Experdite */
> > WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, true);
> > ...
> > __rcu_read_unlock();
> > if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
> > rcu_read_unlock_special(t);
> > t->rcu_read_lock_nesting = 0; <--- LOOK AT THIS!!!
> > <--- Handle an (any) irq
> > rcu_read_lock();
> > /* This call should be skipped */
> > rcu_read_unlock_special(t);
> > WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, false);
> > rcu_preempt_deferred_qs_irqrestore(t, flags);
> > barrier(); /* ->rcu_read_unlock_special load before assign */
> >
> > We don't have to handle the special thing twice like this which is one
> > reason to cause the problem even though another problem is of course to
> > call ttwu w/o being aware it's within a context holding pi lock.
> >
> > Apart from the discussion about how to avoid ttwu in an improper
> > condition, I think the following is necessary. I may have something
> > missing. It would be appreciated if you let me know in case I'm wrong.
> >
> > Anyway, logically I think we should prevent reordering between
> > t->rcu_read_lock_nesting and t->rcu_read_unlock_special.b.exp_hint not
> > only by compiler but also by machine like the below.
> >
> > Do I miss something?
> >
> > Thanks,
> > Byungchul
> >
> > ---8<---
> > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > index 3c8444e..9b137f1 100644
> > --- a/kernel/rcu/tree_plugin.h
> > +++ b/kernel/rcu/tree_plugin.h
> > @@ -412,7 +412,13 @@ void __rcu_read_unlock(void)
> > barrier(); /* assign before ->rcu_read_unlock_special load */
> > if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
> > rcu_read_unlock_special(t);
> > - barrier(); /* ->rcu_read_unlock_special load before assign */
> > + /*
> > + * Prevent reordering between clearing
> > + * t->rcu_reak_unlock_special in
> > + * rcu_read_unlock_special() and the following
> > + * assignment to t->rcu_read_lock_nesting.
> > + */
> > + smp_wmb();
> > t->rcu_read_lock_nesting = 0;
> > }
> > if (IS_ENABLED(CONFIG_PROVE_LOCKING)) {
> >
> >
>
On Fri, Jun 28, 2019 at 05:24:38PM +0900, Byungchul Park wrote:
> On Fri, Jun 28, 2019 at 05:14:32PM +0900, Byungchul Park wrote:
> > On Fri, Jun 28, 2019 at 04:43:50PM +0900, Byungchul Park wrote:
> > > On Fri, Jun 28, 2019 at 04:31:38PM +0900, Byungchul Park wrote:
> > > > On Thu, Jun 27, 2019 at 01:36:12PM -0700, Paul E. McKenney wrote:
> > > > > On Thu, Jun 27, 2019 at 03:17:27PM -0500, Scott Wood wrote:
> > > > > > On Thu, 2019-06-27 at 11:41 -0700, Paul E. McKenney wrote:
> > > > > > > On Thu, Jun 27, 2019 at 02:16:38PM -0400, Joel Fernandes wrote:
> > > > > > > >
> > > > > > > > I think the fix should be to prevent the wake-up not based on whether we
> > > > > > > > are
> > > > > > > > in hard/soft-interrupt mode but that we are doing the rcu_read_unlock()
> > > > > > > > from
> > > > > > > > a scheduler path (if we can detect that)
> > > > > > >
> > > > > > > Or just don't do the wakeup at all, if it comes to that. I don't know
> > > > > > > of any way to determine whether rcu_read_unlock() is being called from
> > > > > > > the scheduler, but it has been some time since I asked Peter Zijlstra
> > > > > > > about that.
> > > > > > >
> > > > > > > Of course, unconditionally refusing to do the wakeup might not be happy
> > > > > > > thing for NO_HZ_FULL kernels that don't implement IRQ work.
> > > > > >
> > > > > > Couldn't smp_send_reschedule() be used instead?
> > > > >
> > > > > Good point. If current -rcu doesn't fix things for Sebastian's case,
> > > > > that would be well worth looking at. But there must be some reason
> > > > > why Peter Zijlstra didn't suggest it when he instead suggested using
> > > > > the IRQ work approach.
> > > > >
> > > > > Peter, thoughts?
> > > >
> > >
> > > +cc [email protected]
> > > (I'm sorry for more noise on the thread.)
> > >
> > > > Hello,
> > > >
> > > > Isn't the following scenario possible?
> > > >
> > > > The original code
> > > > -----------------
> > > > rcu_read_lock();
> > > > ...
> > > > /* Experdite */
> > > > WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, true);
> > > > ...
> > > > __rcu_read_unlock();
> > > > if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
> > > > rcu_read_unlock_special(t);
> > > > WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, false);
> > > > rcu_preempt_deferred_qs_irqrestore(t, flags);
> > > > barrier(); /* ->rcu_read_unlock_special load before assign */
> > > > t->rcu_read_lock_nesting = 0;
> > > >
> > > > The reordered code by machine
> > > > -----------------------------
> > > > rcu_read_lock();
> > > > ...
> > > > /* Experdite */
> > > > WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, true);
> > > > ...
> > > > __rcu_read_unlock();
> > > > if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
> > > > rcu_read_unlock_special(t);
> > > > t->rcu_read_lock_nesting = 0; <--- LOOK AT THIS!!!
> > > > WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, false);
> > > > rcu_preempt_deferred_qs_irqrestore(t, flags);
> > > > barrier(); /* ->rcu_read_unlock_special load before assign */
> > > >
> > > > An interrupt happens
> > > > --------------------
> > > > rcu_read_lock();
> > > > ...
> > > > /* Experdite */
> > > > WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, true);
> > > > ...
> > > > __rcu_read_unlock();
> > > > if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
> > > > rcu_read_unlock_special(t);
> > > > t->rcu_read_lock_nesting = 0; <--- LOOK AT THIS!!!
> > > > <--- Handle an (any) irq
> > > > rcu_read_lock();
> > > > /* This call should be skipped */
> > > > rcu_read_unlock_special(t);
> > > > WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, false);
> > > > rcu_preempt_deferred_qs_irqrestore(t, flags);
> > > > barrier(); /* ->rcu_read_unlock_special load before assign */
> > > >
> > > > We don't have to handle the special thing twice like this which is one
> > > > reason to cause the problem even though another problem is of course to
> > > > call ttwu w/o being aware it's within a context holding pi lock.
> > > >
> > > > Apart from the discussion about how to avoid ttwu in an improper
> > > > condition, I think the following is necessary. I may have something
> > > > missing. It would be appreciated if you let me know in case I'm wrong.
> > > >
> > > > Anyway, logically I think we should prevent reordering between
> > > > t->rcu_read_lock_nesting and t->rcu_read_unlock_special.b.exp_hint not
> > > > only by compiler but also by machine like the below.
> > > >
> > > > Do I miss something?
> > > >
> > > > Thanks,
> > > > Byungchul
> > > >
> > > > ---8<---
> > > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > > > index 3c8444e..9b137f1 100644
> > > > --- a/kernel/rcu/tree_plugin.h
> > > > +++ b/kernel/rcu/tree_plugin.h
> > > > @@ -412,7 +412,13 @@ void __rcu_read_unlock(void)
> > > > barrier(); /* assign before ->rcu_read_unlock_special load */
> > > > if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
> > > > rcu_read_unlock_special(t);
> > > > - barrier(); /* ->rcu_read_unlock_special load before assign */
> > > > + /*
> > > > + * Prevent reordering between clearing
> > > > + * t->rcu_reak_unlock_special in
> > > > + * rcu_read_unlock_special() and the following
> > > > + * assignment to t->rcu_read_lock_nesting.
> > > > + */
> > > > + smp_wmb();
> >
> > Ah. But the problem is this makes rcu_read_unlock() heavier, which is
> > too bad. Need to consider something else. But I'm still curious about
> > if the scenario I told you is correct?
>
> Instead, this patch should be replaced with the following:
>
> ---8<---
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index 3c8444e..f103e98 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -624,8 +624,15 @@ static void rcu_read_unlock_special(struct task_struct *t)
>
> local_irq_save(flags);
> irqs_were_disabled = irqs_disabled_flags(flags);
> +
> + WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, false);
> + /*
> + * Prevent reordering between rcu_read_unlock_special.b.exp_hint
> + * above and rcu_read_lock_nesting outside of this function.
> + */
> + smp_wmb();
Except that these are manipulated by the current CPU (aside from debug
code), so inter-CPU ordering is not needed. Plus .exp_hint is just a
heuristic.
Or am I missing something subtle here? If so, please provide a
step-by-step explanation of the failure scenario.
Thanx, Paul
> +
> if (preempt_bh_were_disabled || irqs_were_disabled) {
> - WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, false);
> /* Need to defer quiescent state until everything is enabled. */
> if (irqs_were_disabled) {
> /* Enabling irqs does not reschedule, so... */
> @@ -638,7 +645,6 @@ static void rcu_read_unlock_special(struct task_struct *t)
> local_irq_restore(flags);
> return;
> }
> - WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, false);
> rcu_preempt_deferred_qs_irqrestore(t, flags);
> }
>
>
On Fri, Jun 28, 2019 at 07:40:45PM +0900, Byungchul Park wrote:
> On Fri, Jun 28, 2019 at 04:31:38PM +0900, Byungchul Park wrote:
> > On Thu, Jun 27, 2019 at 01:36:12PM -0700, Paul E. McKenney wrote:
> > > On Thu, Jun 27, 2019 at 03:17:27PM -0500, Scott Wood wrote:
> > > > On Thu, 2019-06-27 at 11:41 -0700, Paul E. McKenney wrote:
> > > > > On Thu, Jun 27, 2019 at 02:16:38PM -0400, Joel Fernandes wrote:
> > > > > >
> > > > > > I think the fix should be to prevent the wake-up not based on whether we
> > > > > > are
> > > > > > in hard/soft-interrupt mode but that we are doing the rcu_read_unlock()
> > > > > > from
> > > > > > a scheduler path (if we can detect that)
> > > > >
> > > > > Or just don't do the wakeup at all, if it comes to that. I don't know
> > > > > of any way to determine whether rcu_read_unlock() is being called from
> > > > > the scheduler, but it has been some time since I asked Peter Zijlstra
> > > > > about that.
> > > > >
> > > > > Of course, unconditionally refusing to do the wakeup might not be happy
> > > > > thing for NO_HZ_FULL kernels that don't implement IRQ work.
> > > >
> > > > Couldn't smp_send_reschedule() be used instead?
> > >
> > > Good point. If current -rcu doesn't fix things for Sebastian's case,
> > > that would be well worth looking at. But there must be some reason
> > > why Peter Zijlstra didn't suggest it when he instead suggested using
> > > the IRQ work approach.
> > >
> > > Peter, thoughts?
> >
> > Hello,
> >
> > Isn't the following scenario possible?
> >
> > The original code
> > -----------------
> > rcu_read_lock();
> > ...
> > /* Experdite */
> > WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, true);
> > ...
> > __rcu_read_unlock();
> > if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
> > rcu_read_unlock_special(t);
> > WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, false);
> > rcu_preempt_deferred_qs_irqrestore(t, flags);
> > barrier(); /* ->rcu_read_unlock_special load before assign */
> > t->rcu_read_lock_nesting = 0;
> >
> > The reordered code by machine
> > -----------------------------
> > rcu_read_lock();
> > ...
> > /* Experdite */
> > WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, true);
> > ...
> > __rcu_read_unlock();
> > if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
> > rcu_read_unlock_special(t);
> > t->rcu_read_lock_nesting = 0; <--- LOOK AT THIS!!!
> > WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, false);
> > rcu_preempt_deferred_qs_irqrestore(t, flags);
> > barrier(); /* ->rcu_read_unlock_special load before assign */
> >
> > An interrupt happens
> > --------------------
> > rcu_read_lock();
> > ...
> > /* Experdite */
> > WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, true);
> > ...
> > __rcu_read_unlock();
> > if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
> > rcu_read_unlock_special(t);
> > t->rcu_read_lock_nesting = 0; <--- LOOK AT THIS!!!
> > <--- Handle an (any) irq
> > rcu_read_lock();
> > /* This call should be skipped */
> > rcu_read_unlock_special(t);
>
> Wait.. I got a little bit confused on recordering.
>
> This 'STORE rcu_read_lock_nesting = 0' can happen before
> 'STORE rcu_read_unlock_special.b.exp_hint = false' regardless of the
> order a compiler generated to by the barrier(), because anyway they
> are independent so it's within an arch's right.
>
> Then.. is this scenario possible? Or all archs properly deal with
> interrupts across this kind of reordering?
Interrupts are "exact" in that they happen between a pair of consecutive
instructions from the viewpoint of the CPU taking the interrupt. And
again, these fields are local to their CPU/task, give or take debug code.
Thanx, Paul
> Thanks,
> Byungchul
>
> > WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, false);
> > rcu_preempt_deferred_qs_irqrestore(t, flags);
> > barrier(); /* ->rcu_read_unlock_special load before assign */
> >
> > We don't have to handle the special thing twice like this which is one
> > reason to cause the problem even though another problem is of course to
> > call ttwu w/o being aware it's within a context holding pi lock.
> >
> > Apart from the discussion about how to avoid ttwu in an improper
> > condition, I think the following is necessary. I may have something
> > missing. It would be appreciated if you let me know in case I'm wrong.
> >
> > Anyway, logically I think we should prevent reordering between
> > t->rcu_read_lock_nesting and t->rcu_read_unlock_special.b.exp_hint not
> > only by compiler but also by machine like the below.
> >
> > Do I miss something?
> >
> > Thanks,
> > Byungchul
> >
> > ---8<---
> > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > index 3c8444e..9b137f1 100644
> > --- a/kernel/rcu/tree_plugin.h
> > +++ b/kernel/rcu/tree_plugin.h
> > @@ -412,7 +412,13 @@ void __rcu_read_unlock(void)
> > barrier(); /* assign before ->rcu_read_unlock_special load */
> > if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
> > rcu_read_unlock_special(t);
> > - barrier(); /* ->rcu_read_unlock_special load before assign */
> > + /*
> > + * Prevent reordering between clearing
> > + * t->rcu_reak_unlock_special in
> > + * rcu_read_unlock_special() and the following
> > + * assignment to t->rcu_read_lock_nesting.
> > + */
> > + smp_wmb();
> > t->rcu_read_lock_nesting = 0;
> > }
> > if (IS_ENABLED(CONFIG_PROVE_LOCKING)) {
> >
> >
>
On Thu, Jun 27, 2019 at 11:41:07AM -0700, Paul E. McKenney wrote:
> Or just don't do the wakeup at all, if it comes to that. I don't know
> of any way to determine whether rcu_read_unlock() is being called from
> the scheduler, but it has been some time since I asked Peter Zijlstra
> about that.
There (still) is no 'in-scheduler' state.
On Thu, Jun 27, 2019 at 01:36:12PM -0700, Paul E. McKenney wrote:
> On Thu, Jun 27, 2019 at 03:17:27PM -0500, Scott Wood wrote:
> > On Thu, 2019-06-27 at 11:41 -0700, Paul E. McKenney wrote:
> > > On Thu, Jun 27, 2019 at 02:16:38PM -0400, Joel Fernandes wrote:
> > > >
> > > > I think the fix should be to prevent the wake-up not based on whether we
> > > > are
> > > > in hard/soft-interrupt mode but that we are doing the rcu_read_unlock()
> > > > from
> > > > a scheduler path (if we can detect that)
> > >
> > > Or just don't do the wakeup at all, if it comes to that. I don't know
> > > of any way to determine whether rcu_read_unlock() is being called from
> > > the scheduler, but it has been some time since I asked Peter Zijlstra
> > > about that.
> > >
> > > Of course, unconditionally refusing to do the wakeup might not be happy
> > > thing for NO_HZ_FULL kernels that don't implement IRQ work.
> >
> > Couldn't smp_send_reschedule() be used instead?
>
> Good point. If current -rcu doesn't fix things for Sebastian's case,
> that would be well worth looking at. But there must be some reason
> why Peter Zijlstra didn't suggest it when he instead suggested using
> the IRQ work approach.
>
> Peter, thoughts?
I've not exactly kept up with the thread; but irq_work allows you to run
some actual code on the remote CPU which is often useful and it is only
a little more expensive than smp_send_reschedule().
Also, just smp_send_reschedule() doesn't really do anything without
first poking TIF_NEED_RESCHED (or other scheduler state) and if you want
to do both, there's other helpers you should use, like resched_cpu().
On Fri, Jun 28, 2019 at 03:54:33PM +0200, Peter Zijlstra wrote:
> On Thu, Jun 27, 2019 at 11:41:07AM -0700, Paul E. McKenney wrote:
> > Or just don't do the wakeup at all, if it comes to that. I don't know
> > of any way to determine whether rcu_read_unlock() is being called from
> > the scheduler, but it has been some time since I asked Peter Zijlstra
> > about that.
>
> There (still) is no 'in-scheduler' state.
Well, my TREE03 + threadirqs rcutorture test ran for ten hours last
night with no problems, so we just might be OK.
The apparent fix is below, though my approach would be to do backports
for the full set of related changes.
Joel, Sebastian, how goes any testing from your end? Any reason
to believe that this does not represent a fix? (Me, I am still
concerned about doing raise_softirq() from within a threaded
interrupt, but am not seeing failures.)
Thanx, Paul
------------------------------------------------------------------------
commit 23634ebc1d946f19eb112d4455c1d84948875e31
Author: Paul E. McKenney <[email protected]>
Date: Sun Mar 24 15:25:51 2019 -0700
rcu: Check for wakeup-safe conditions in rcu_read_unlock_special()
When RCU core processing is offloaded from RCU_SOFTIRQ to the rcuc
kthreads, a full and unconditional wakeup is required to initiate RCU
core processing. In contrast, when RCU core processing is carried
out by RCU_SOFTIRQ, a raise_softirq() suffices. Of course, there are
situations where raise_softirq() does a full wakeup, but these do not
occur with normal usage of rcu_read_unlock().
The reason that full wakeups can be problematic is that the scheduler
sometimes invokes rcu_read_unlock() with its pi or rq locks held,
which can of course result in deadlock in CONFIG_PREEMPT=y kernels when
rcu_read_unlock() invokes the scheduler. Scheduler invocations can happen
in the following situations: (1) The just-ended reader has been subjected
to RCU priority boosting, in which case rcu_read_unlock() must deboost,
(2) Interrupts were disabled across the call to rcu_read_unlock(), so
the quiescent state must be deferred, requiring a wakeup of the rcuc
kthread corresponding to the current CPU.
Now, the scheduler may hold one of its locks across rcu_read_unlock()
only if preemption has been disabled across the entire RCU read-side
critical section, which in the days prior to RCU flavor consolidation
meant that rcu_read_unlock() never needed to do wakeups. However, this
is no longer the case for any but the first rcu_read_unlock() following a
condition (e.g., preempted RCU reader) requiring special rcu_read_unlock()
attention. For example, an RCU read-side critical section might be
preempted, but preemption might be disabled across the rcu_read_unlock().
The rcu_read_unlock() must defer the quiescent state, and therefore
leaves the task queued on its leaf rcu_node structure. If a scheduler
interrupt occurs, the scheduler might well invoke rcu_read_unlock() with
one of its locks held. However, the preempted task is still queued, so
rcu_read_unlock() will attempt to defer the quiescent state once more.
When RCU core processing is carried out by RCU_SOFTIRQ, this works just
fine: The raise_softirq() function simply sets a bit in a per-CPU mask
and the RCU core processing will be undertaken upon return from interrupt.
Not so when RCU core processing is carried out by the rcuc kthread: In this
case, the required wakeup can result in deadlock.
The initial solution to this problem was to use set_tsk_need_resched() and
set_preempt_need_resched() to force a future context switch, which allows
rcu_preempt_note_context_switch() to report the deferred quiescent state
to RCU's core processing. Unfortunately for expedited grace periods,
there can be a significant delay between the call for a context switch
and the actual context switch.
This commit therefore introduces a ->deferred_qs flag to the task_struct
structure's rcu_special structure. This flag is initially false, and
is set to true by the first call to rcu_read_unlock() requiring special
attention, then finally reset back to false when the quiescent state is
finally reported. Then rcu_read_unlock() attempts full wakeups only when
->deferred_qs is false, that is, on the first rcu_read_unlock() requiring
special attention. Note that a chain of RCU readers linked by some other
sort of reader may find that a later rcu_read_unlock() is once again able
to do a full wakeup, courtesy of an intervening preemption:
rcu_read_lock();
/* preempted */
local_irq_disable();
rcu_read_unlock(); /* Can do full wakeup, sets ->deferred_qs. */
rcu_read_lock();
local_irq_enable();
preempt_disable()
rcu_read_unlock(); /* Cannot do full wakeup, ->deferred_qs set. */
rcu_read_lock();
preempt_enable();
/* preempted, >deferred_qs reset. */
local_irq_disable();
rcu_read_unlock(); /* Can again do full wakeup, sets ->deferred_qs. */
Such linked RCU readers do not yet seem to appear in the Linux kernel, and
it is probably best if they don't. However, RCU needs to handle them, and
some variations on this theme could make even raise_softirq() unsafe due to
the possibility of its doing a full wakeup. This commit therefore also
avoids invoking raise_softirq() when the ->deferred_qs set flag is set.
Signed-off-by: Paul E. McKenney <[email protected]>
Cc: Sebastian Andrzej Siewior <[email protected]>
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 11837410690f..942a44c1b8eb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -565,7 +565,7 @@ union rcu_special {
u8 blocked;
u8 need_qs;
u8 exp_hint; /* Hint for performance. */
- u8 pad; /* No garbage from compiler! */
+ u8 deferred_qs;
} b; /* Bits. */
u32 s; /* Set of bits. */
};
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 21611862e083..75110ea75d01 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -455,6 +455,7 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
local_irq_restore(flags);
return;
}
+ t->rcu_read_unlock_special.b.deferred_qs = false;
if (special.b.need_qs) {
rcu_qs();
t->rcu_read_unlock_special.b.need_qs = false;
@@ -605,16 +606,24 @@ static void rcu_read_unlock_special(struct task_struct *t)
local_irq_save(flags);
irqs_were_disabled = irqs_disabled_flags(flags);
if (preempt_bh_were_disabled || irqs_were_disabled) {
- WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, false);
- /* Need to defer quiescent state until everything is enabled. */
- if (irqs_were_disabled && use_softirq) {
- /* Enabling irqs does not reschedule, so... */
+ t->rcu_read_unlock_special.b.exp_hint = false;
+ // Need to defer quiescent state until everything is enabled.
+ if (irqs_were_disabled && use_softirq &&
+ (in_irq() || !t->rcu_read_unlock_special.b.deferred_qs)) {
+ // Using softirq, safe to awaken, and we get
+ // no help from enabling irqs, unlike bh/preempt.
raise_softirq_irqoff(RCU_SOFTIRQ);
+ } else if (irqs_were_disabled && !use_softirq &&
+ !t->rcu_read_unlock_special.b.deferred_qs) {
+ // Safe to awaken and we get no help from enabling
+ // irqs, unlike bh/preempt.
+ invoke_rcu_core();
} else {
- /* Enabling BH or preempt does reschedule, so... */
+ // Enabling BH or preempt does reschedule, so...
set_tsk_need_resched(current);
set_preempt_need_resched();
}
+ t->rcu_read_unlock_special.b.deferred_qs = true;
local_irq_restore(flags);
return;
}
On Fri, 28 Jun 2019 19:40:45 +0900
Byungchul Park <[email protected]> wrote:
> Wait.. I got a little bit confused on recordering.
>
> This 'STORE rcu_read_lock_nesting = 0' can happen before
> 'STORE rcu_read_unlock_special.b.exp_hint = false' regardless of the
> order a compiler generated to by the barrier(), because anyway they
> are independent so it's within an arch's right.
>
> Then.. is this scenario possible? Or all archs properly deal with
> interrupts across this kind of reordering?
As Paul stated, interrupts are synchronization points. Archs can only
play games with ordering when dealing with entities outside the CPU
(devices and other CPUs). But if you have assembly that has two stores,
and an interrupt comes in, the arch must guarantee that the stores are
done in that order as the interrupt sees it.
If this is not the case, there's a hell of a lot more broken in the
kernel than just this, and "barrier()" would also be meaningless, as
that is used mostly to deal with interrupts.
-- Steve
On Fri, Jun 28, 2019 at 04:15:22PM +0200, Peter Zijlstra wrote:
> On Thu, Jun 27, 2019 at 01:36:12PM -0700, Paul E. McKenney wrote:
> > On Thu, Jun 27, 2019 at 03:17:27PM -0500, Scott Wood wrote:
> > > On Thu, 2019-06-27 at 11:41 -0700, Paul E. McKenney wrote:
> > > > On Thu, Jun 27, 2019 at 02:16:38PM -0400, Joel Fernandes wrote:
> > > > >
> > > > > I think the fix should be to prevent the wake-up not based on whether we
> > > > > are
> > > > > in hard/soft-interrupt mode but that we are doing the rcu_read_unlock()
> > > > > from
> > > > > a scheduler path (if we can detect that)
> > > >
> > > > Or just don't do the wakeup at all, if it comes to that. I don't know
> > > > of any way to determine whether rcu_read_unlock() is being called from
> > > > the scheduler, but it has been some time since I asked Peter Zijlstra
> > > > about that.
> > > >
> > > > Of course, unconditionally refusing to do the wakeup might not be happy
> > > > thing for NO_HZ_FULL kernels that don't implement IRQ work.
> > >
> > > Couldn't smp_send_reschedule() be used instead?
> >
> > Good point. If current -rcu doesn't fix things for Sebastian's case,
> > that would be well worth looking at. But there must be some reason
> > why Peter Zijlstra didn't suggest it when he instead suggested using
> > the IRQ work approach.
> >
> > Peter, thoughts?
>
> I've not exactly kept up with the thread; but irq_work allows you to run
> some actual code on the remote CPU which is often useful and it is only
> a little more expensive than smp_send_reschedule().
>
> Also, just smp_send_reschedule() doesn't really do anything without
> first poking TIF_NEED_RESCHED (or other scheduler state) and if you want
> to do both, there's other helpers you should use, like resched_cpu().
Thank you! Plus it looks like scheduler_ipi() takes an early exit if
->wake_list is empty, regardless of TIF_NEED_RESCHED, right?
Thanx, Paul
On Fri, Jun 28, 2019 at 08:54:04AM -0700, Paul E. McKenney wrote:
> Thank you! Plus it looks like scheduler_ipi() takes an early exit if
> ->wake_list is empty, regardless of TIF_NEED_RESCHED, right?
Yes, TIF_NEED_RESCHED is checked in the interrupt return path.
On Thu, Jun 27, 2019 at 11:41:07AM -0700, Paul E. McKenney wrote:
[snip]
> > > > And we should document this somewhere for future sanity preservation
> > > > :-D
> > >
> > > Or adjust the code and requirements to make it more sane, if feasible.
> > >
> > > My current (probably wildly unreliable) guess that the conditions in
> > > rcu_read_unlock_special() need adjusting. I was assuming that in_irq()
> > > implies a hardirq context, in other words that in_irq() would return
> > > false from a threaded interrupt handler. If in_irq() instead returns
> > > true from within a threaded interrupt handler, then this code in
> > > rcu_read_unlock_special() needs fixing:
> > >
> > > if ((exp || in_irq()) && irqs_were_disabled && use_softirq &&
> > > (in_irq() || !t->rcu_read_unlock_special.b.deferred_qs)) {
> > > // Using softirq, safe to awaken, and we get
> > > // no help from enabling irqs, unlike bh/preempt.
> > > raise_softirq_irqoff(RCU_SOFTIRQ);
> > >
> > > The fix would be replacing the calls to in_irq() with something that
> > > returns true only if called from within a hardirq context.
> > > Thoughts?
> >
> > I am not sure if this will fix all cases though?
> >
> > I think the crux of the problem is doing a recursive wake up. The threaded
> > IRQ probably just happens to be causing it here, it seems to me this problem
> > can also occur on a non-threaded irq system (say current_reader() in your
> > example executed in a scheduler path in process-context and not from an
> > interrupt). Is that not possible?
>
> In the non-threaded case, invoking raise_softirq*() from hardirq context
> just sets a bit in a per-CPU variable. Now, to Sebastian's point, we
> are only sort of in hardirq context in this case due to being called
> from irq_exit(), but the failure we are seeing might well be a ways
> downstream of the actual root-cause bug.
Hi Paul,
I was talking about calling of rcu_read_unlock_special from a normal process
context from the scheduler.
In the below traces, it shows that only the PREEMPT_MASK offset is set at the
time of the issue. Both HARD AND SOFT IRQ masks are not enabled, which means
the lock up is from a normal process context.
I think I finally understood why the issue shows up only with threadirqs in
my setup. If I build x86_64_defconfig, the CONFIG_IRQ_FORCED_THREADING=y
option is set. And booting this with threadirqs, it always tries to
wakeup_ksoftirqd in invoke_softirq.
I believe what happens is, at an in-opportune time when the .blocked field is
set for the preempted task, an interrupt is received. This timing is quite in
auspicious because t->rcu_read_unlock_special just happens to have its
.blocked field set even though it is not in a reader-section.
The interrupt return path now does a wake up on ksoftirqd. The wake-up path
calls cpuacct_charge() which starts a new reader section. During the unlock
of this section though, it notices .blocked and calls unlock_special(). That
does a raise_softirq. in_interrupt() now says it is Ok to wake up ksoftirqd.
The wake up is attempted, and we have a recursive wake up. This probably does
not happen in non-threadirqs machines because ksoftirqd is probably not woken
up a lot (invoke_softirq()).
The traces for this looks like (patch to trace it is later in email) this.
rus stands for t->read_unlock_special. "cur1" means we just locked, and
"cur2" means we are about to unlock (sorry I am weird with names):
[ 19.703436] rcu_tort-85 0d.s4 19528994us : sched_waking: comm=rcu_preempt pid=10 prio=120 target_cpu=000
[ 19.704770] rcu_tort-85 0d.s5 19528995us : cpuacct_charge: cur1 rus=0
[ 19.705706] rcu_tort-85 0d.s5 19528995us : cpuacct_charge: cur2 rus=0
[ 19.706657] rcu_tort-85 0dNs5 19528996us : sched_wakeup: comm=rcu_preempt pid=10 prio=120 target_cpu=000
[ 19.707947] rcu_tort-85 0dN.1 19528997us : rcu_note_context_switch: rcu_note_context_switch preempt=1
[ 19.709239] rcu_tort-85 0d..2 19528998us : sched_switch: prev_comm=rcu_torture_rea prev_pid=85 prev_prio=139 prev_state=R+ ==>0
[ 19.711361] rcu_pree-10 0d..1 19529001us : rcu_note_context_switch: rcu_note_context_switch preempt=0
[ 19.712714] rcu_pree-10 0d..2 19529002us : cpuacct_charge: cur1 rus=0
[ 19.713640] rcu_pree-10 0d..2 19529002us : cpuacct_charge: cur2 rus=0
[ 19.714612] rcu_pree-10 0d..2 19529003us : sched_switch: prev_comm=rcu_preempt prev_pid=10 prev_prio=120 prev_state=I ==> next9
[ 19.716639] rcu_tort-83 0d..1 19529022us : rcu_note_context_switch: rcu_note_context_switch preempt=0
[ 19.717887] rcu_tort-83 0d..2 19529023us : cpuacct_charge: cur1 rus=0
[ 19.718828] rcu_tort-83 0d..2 19529023us : cpuacct_charge: cur2 rus=0
[ 19.719772] rcu_tort-83 0d..2 19529023us : sched_switch: prev_comm=rcu_torture_fak prev_pid=83 prev_prio=139 prev_state=D ==> 9
[ 19.721985] rcu_tort-81 0d..1 19529752us : rcu_note_context_switch: rcu_note_context_switch preempt=0
[ 19.723252] rcu_tort-81 0d..2 19529752us : cpuacct_charge: cur1 rus=0
[ 19.724188] rcu_tort-81 0d..2 19529752us : cpuacct_charge: cur2 rus=0
[ 19.725125] rcu_tort-81 0d..2 19529753us : sched_switch: prev_comm=rcu_torture_fak prev_pid=81 prev_prio=139 prev_state=D ==> 9
[ 19.727372] rcu_tort-85 0.... 19529754us : __rcu_read_unlock: call rcu_read_unlock_special nest=1
[ 19.728587] rcu_tort-85 0.... 19529754us : rcu_read_unlock_special: unlock_special nest=-2147483647 c=1
[ 19.729955] rcu_tort-85 0dN.2 19530008us : irq_enter: irq_enter
[ 19.730821] rcu_tort-85 0dNh2 19530010us : __rcu_read_unlock: call rcu_read_unlock_special nest=1
[ 19.732106] rcu_tort-85 0dNh2 19530010us : rcu_read_unlock_special: unlock_special nest=-2147483647 c=1
[ 19.733464] rcu_tort-85 0dNh2 19530011us : __rcu_read_unlock: call rcu_read_unlock_special nest=1
[ 19.734713] rcu_tort-85 0dNh2 19530011us : rcu_read_unlock_special: unlock_special nest=-2147483647 c=1
[ 19.736051] rcu_tort-85 0dNh3 19530012us : cpuacct_charge: cur1 rus=1
[ 19.736995] rcu_tort-85 0dNh3 19530012us : cpuacct_charge: cur2 rus=1
[ 19.737950] rcu_tort-85 0dNh3 19530012us : __rcu_read_unlock: call rcu_read_unlock_special nest=1
[ 19.739205] rcu_tort-85 0dNh3 19530012us : rcu_read_unlock_special: unlock_special nest=-2147483647 c=1
[ 19.740526] rcu_tort-85 0dNh3 19530012us : __rcu_read_unlock: call rcu_read_unlock_special nest=1
[ 19.741828] rcu_tort-85 0dNh3 19530013us : rcu_read_unlock_special: unlock_special nest=-2147483647 c=1
[ 19.743171] rcu_tort-85 0dNh2 19530014us : irq_exit: irq_exit
[ 19.743984] rcu_tort-85 0dN.2 19530014us : irq_exit: invoke_softirq: wakeup_softirqd
[ 19.745002] rcu_tort-85 0dN.3 19530015us : sched_waking: comm=ksoftirqd/0 pid=9 prio=120 target_cpu=000
[ 19.746383] rcu_tort-85 0dN.4 19530015us : cpuacct_charge: cur1 rus=1
[ 19.747300] rcu_tort-85 0dN.4 19530016us : cpuacct_charge: cur2 rus=1
[ 19.748215] rcu_tort-85 0dN.4 19530016us : __rcu_read_unlock: call rcu_read_unlock_special nest=1
[ 19.749487] rcu_tort-85 0dN.4 19530016us : rcu_read_unlock_special: unlock_special nest=-2147483647 c=1
[ 19.750832] rcu_tort-85 0dN.4 19530016us : raise_softirq_irqoff: raise_softirq_irqoff: waking softirqd
---8<-----------------------
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index c47788fa85f9..0a8d0805c5ef 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -2189,6 +2189,7 @@ print_deadlock_bug(struct task_struct *curr, struct held_lock *prev,
pr_warn("\nstack backtrace:\n");
dump_stack();
+ BUG();
return 0;
}
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 1102765f91fd..cf825503a740 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -308,6 +308,8 @@ void rcu_note_context_switch(bool preempt)
struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
struct rcu_node *rnp;
+ trace_printk("rcu_note_context_switch preempt=%d\n", preempt);
+
barrier(); /* Avoid RCU read-side critical sections leaking down. */
trace_rcu_utilization(TPS("Start context switch"));
lockdep_assert_irqs_disabled();
@@ -341,6 +343,7 @@ void rcu_note_context_switch(bool preempt)
* Complete exit from RCU read-side critical section on
* behalf of preempted instance of __rcu_read_unlock().
*/
+ trace_printk("rcu_note_context_switch->unlock_special pre=%d\n", preempt);
rcu_read_unlock_special(t);
rcu_preempt_deferred_qs(t);
} else {
@@ -403,15 +406,19 @@ EXPORT_SYMBOL_GPL(__rcu_read_lock);
void __rcu_read_unlock(void)
{
struct task_struct *t = current;
+ int prev;
if (t->rcu_read_lock_nesting != 1) {
--t->rcu_read_lock_nesting;
} else {
barrier(); /* critical section before exit code. */
+ prev = t->rcu_read_lock_nesting;
t->rcu_read_lock_nesting = -RCU_NEST_BIAS;
barrier(); /* assign before ->rcu_read_unlock_special load */
- if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
+ if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s))) {
+ trace_printk("call rcu_read_unlock_special nest=%d\n", prev);
rcu_read_unlock_special(t);
+ }
barrier(); /* ->rcu_read_unlock_special load before assign */
t->rcu_read_lock_nesting = 0;
}
@@ -618,6 +625,8 @@ static void rcu_read_unlock_special(struct task_struct *t)
!!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK));
bool irqs_were_disabled;
+ trace_printk("unlock_special nest=%d c=%d\n",t->rcu_read_lock_nesting, t==current);
+
/* NMI handlers cannot block and cannot safely manipulate state. */
if (in_nmi())
return;
diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index 9fbb10383434..1caacf936466 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -340,15 +340,20 @@ void cpuacct_charge(struct task_struct *tsk, u64 cputime)
struct cpuacct *ca;
int index = CPUACCT_STAT_SYSTEM;
struct pt_regs *regs = task_pt_regs(tsk);
+ struct task_struct *t = current;
if (regs && user_mode(regs))
index = CPUACCT_STAT_USER;
+ tracing_on();
rcu_read_lock();
+ trace_printk("cur1 rus=%lx\n", (unsigned long)t->rcu_read_unlock_special.s);
+
for (ca = task_ca(tsk); ca; ca = parent_ca(ca))
this_cpu_ptr(ca->cpuusage)->usages[index] += cputime;
+ trace_printk("cur2 rus=%lx\n", (unsigned long)t->rcu_read_unlock_special.s);
rcu_read_unlock();
}
diff --git a/kernel/softirq.c b/kernel/softirq.c
index a6b81c6b6bff..17402467ae31 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -311,6 +311,7 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
--max_restart)
goto restart;
+ trace_printk("__do_softirq recurse: wakeup_softirqd\n");
wakeup_softirqd();
}
@@ -344,6 +345,7 @@ asmlinkage __visible void do_softirq(void)
*/
void irq_enter(void)
{
+ trace_printk("irq_enter\n");
rcu_irq_enter();
if (is_idle_task(current) && !in_interrupt()) {
/*
@@ -380,6 +382,7 @@ static inline void invoke_softirq(void)
do_softirq_own_stack();
#endif
} else {
+ trace_printk("invoke_softirq: wakeup_softirqd\n");
wakeup_softirqd();
}
}
@@ -402,6 +405,7 @@ static inline void tick_irq_exit(void)
*/
void irq_exit(void)
{
+ trace_printk("irq_exit\n");
#ifndef __ARCH_IRQ_EXIT_IRQS_DISABLED
local_irq_disable();
#else
@@ -433,8 +437,10 @@ inline void raise_softirq_irqoff(unsigned int nr)
* Otherwise we wake up ksoftirqd to make sure we
* schedule the softirq soon.
*/
- if (!in_interrupt())
+ if (!in_interrupt()) {
+ trace_printk("raise_softirq_irqoff: waking softirqd\n");
wakeup_softirqd();
+ }
}
void raise_softirq(unsigned int nr)
On Fri, Jun 28, 2019 at 12:40:08PM -0400, Joel Fernandes wrote:
> On Thu, Jun 27, 2019 at 11:41:07AM -0700, Paul E. McKenney wrote:
> [snip]
> > > > > And we should document this somewhere for future sanity preservation
> > > > > :-D
> > > >
> > > > Or adjust the code and requirements to make it more sane, if feasible.
> > > >
> > > > My current (probably wildly unreliable) guess that the conditions in
> > > > rcu_read_unlock_special() need adjusting. I was assuming that in_irq()
> > > > implies a hardirq context, in other words that in_irq() would return
> > > > false from a threaded interrupt handler. If in_irq() instead returns
> > > > true from within a threaded interrupt handler, then this code in
> > > > rcu_read_unlock_special() needs fixing:
> > > >
> > > > if ((exp || in_irq()) && irqs_were_disabled && use_softirq &&
> > > > (in_irq() || !t->rcu_read_unlock_special.b.deferred_qs)) {
> > > > // Using softirq, safe to awaken, and we get
> > > > // no help from enabling irqs, unlike bh/preempt.
> > > > raise_softirq_irqoff(RCU_SOFTIRQ);
> > > >
> > > > The fix would be replacing the calls to in_irq() with something that
> > > > returns true only if called from within a hardirq context.
> > > > Thoughts?
> > >
> > > I am not sure if this will fix all cases though?
> > >
> > > I think the crux of the problem is doing a recursive wake up. The threaded
> > > IRQ probably just happens to be causing it here, it seems to me this problem
> > > can also occur on a non-threaded irq system (say current_reader() in your
> > > example executed in a scheduler path in process-context and not from an
> > > interrupt). Is that not possible?
> >
> > In the non-threaded case, invoking raise_softirq*() from hardirq context
> > just sets a bit in a per-CPU variable. Now, to Sebastian's point, we
> > are only sort of in hardirq context in this case due to being called
> > from irq_exit(), but the failure we are seeing might well be a ways
> > downstream of the actual root-cause bug.
>
> Hi Paul,
> I was talking about calling of rcu_read_unlock_special from a normal process
> context from the scheduler.
>
> In the below traces, it shows that only the PREEMPT_MASK offset is set at the
> time of the issue. Both HARD AND SOFT IRQ masks are not enabled, which means
> the lock up is from a normal process context.
>
> I think I finally understood why the issue shows up only with threadirqs in
> my setup. If I build x86_64_defconfig, the CONFIG_IRQ_FORCED_THREADING=y
> option is set. And booting this with threadirqs, it always tries to
> wakeup_ksoftirqd in invoke_softirq.
>
> I believe what happens is, at an in-opportune time when the .blocked field is
> set for the preempted task, an interrupt is received. This timing is quite in
> auspicious because t->rcu_read_unlock_special just happens to have its
> .blocked field set even though it is not in a reader-section.
I believe the .blocked field remains set even though we are not any more in a
reader section because of deferred processing of the blocked lists that you
mentioned yesterday.
On Fri, Jun 28, 2019 at 06:04:08PM +0200, Peter Zijlstra wrote:
> On Fri, Jun 28, 2019 at 08:54:04AM -0700, Paul E. McKenney wrote:
> > Thank you! Plus it looks like scheduler_ipi() takes an early exit if
> > ->wake_list is empty, regardless of TIF_NEED_RESCHED, right?
>
> Yes, TIF_NEED_RESCHED is checked in the interrupt return path.
OK, got it. So the following sequence would be a valid way to get the
scheduler's attention on the current CPU shortly after interrupts
are re-enabled, even if the current CPU is already holding some
rq or pi locks, correct?
set_tsk_need_resched(current);
set_preempt_need_resched();
smp_send_reschedule(smp_processor_id());
Thanx, Paul
On Fri, Jun 28, 2019 at 12:45:59PM -0400, Joel Fernandes wrote:
> On Fri, Jun 28, 2019 at 12:40:08PM -0400, Joel Fernandes wrote:
> > On Thu, Jun 27, 2019 at 11:41:07AM -0700, Paul E. McKenney wrote:
> > [snip]
> > > > > > And we should document this somewhere for future sanity preservation
> > > > > > :-D
> > > > >
> > > > > Or adjust the code and requirements to make it more sane, if feasible.
> > > > >
> > > > > My current (probably wildly unreliable) guess that the conditions in
> > > > > rcu_read_unlock_special() need adjusting. I was assuming that in_irq()
> > > > > implies a hardirq context, in other words that in_irq() would return
> > > > > false from a threaded interrupt handler. If in_irq() instead returns
> > > > > true from within a threaded interrupt handler, then this code in
> > > > > rcu_read_unlock_special() needs fixing:
> > > > >
> > > > > if ((exp || in_irq()) && irqs_were_disabled && use_softirq &&
> > > > > (in_irq() || !t->rcu_read_unlock_special.b.deferred_qs)) {
> > > > > // Using softirq, safe to awaken, and we get
> > > > > // no help from enabling irqs, unlike bh/preempt.
> > > > > raise_softirq_irqoff(RCU_SOFTIRQ);
> > > > >
> > > > > The fix would be replacing the calls to in_irq() with something that
> > > > > returns true only if called from within a hardirq context.
> > > > > Thoughts?
> > > >
> > > > I am not sure if this will fix all cases though?
> > > >
> > > > I think the crux of the problem is doing a recursive wake up. The threaded
> > > > IRQ probably just happens to be causing it here, it seems to me this problem
> > > > can also occur on a non-threaded irq system (say current_reader() in your
> > > > example executed in a scheduler path in process-context and not from an
> > > > interrupt). Is that not possible?
> > >
> > > In the non-threaded case, invoking raise_softirq*() from hardirq context
> > > just sets a bit in a per-CPU variable. Now, to Sebastian's point, we
> > > are only sort of in hardirq context in this case due to being called
> > > from irq_exit(), but the failure we are seeing might well be a ways
> > > downstream of the actual root-cause bug.
> >
> > Hi Paul,
> > I was talking about calling of rcu_read_unlock_special from a normal process
> > context from the scheduler.
> >
> > In the below traces, it shows that only the PREEMPT_MASK offset is set at the
> > time of the issue. Both HARD AND SOFT IRQ masks are not enabled, which means
> > the lock up is from a normal process context.
> >
> > I think I finally understood why the issue shows up only with threadirqs in
> > my setup. If I build x86_64_defconfig, the CONFIG_IRQ_FORCED_THREADING=y
> > option is set. And booting this with threadirqs, it always tries to
> > wakeup_ksoftirqd in invoke_softirq.
> >
> > I believe what happens is, at an in-opportune time when the .blocked field is
> > set for the preempted task, an interrupt is received. This timing is quite in
> > auspicious because t->rcu_read_unlock_special just happens to have its
> > .blocked field set even though it is not in a reader-section.
Thank you for tracing through this!
> I believe the .blocked field remains set even though we are not any more in a
> reader section because of deferred processing of the blocked lists that you
> mentioned yesterday.
That can indeed happen. However, in current -rcu, that would mean
that .deferred_qs is also set, which (if in_irq()) would prevent
the raise_softirq_irqsoff() from being invoked. Which was why I was
asking the questions about whether in_irq() returns true within threaded
interrupts yesterday. If it does, I need to find if there is some way
of determining whether rcu_read_unlock_special() is being called from
a threaded interrupt in order to suppress the call to raise_softirq()
in that case.
But which version of the kernel are you using here? Current -rcu?
v5.2-rc1? Something else?
Thanx, Paul
On Fri, Jun 28, 2019 at 10:30:11AM -0700, Paul E. McKenney wrote:
> On Fri, Jun 28, 2019 at 12:45:59PM -0400, Joel Fernandes wrote:
> > On Fri, Jun 28, 2019 at 12:40:08PM -0400, Joel Fernandes wrote:
> > > On Thu, Jun 27, 2019 at 11:41:07AM -0700, Paul E. McKenney wrote:
> > > [snip]
> > > > > > > And we should document this somewhere for future sanity preservation
> > > > > > > :-D
> > > > > >
> > > > > > Or adjust the code and requirements to make it more sane, if feasible.
> > > > > >
> > > > > > My current (probably wildly unreliable) guess that the conditions in
> > > > > > rcu_read_unlock_special() need adjusting. I was assuming that in_irq()
> > > > > > implies a hardirq context, in other words that in_irq() would return
> > > > > > false from a threaded interrupt handler. If in_irq() instead returns
> > > > > > true from within a threaded interrupt handler, then this code in
> > > > > > rcu_read_unlock_special() needs fixing:
> > > > > >
> > > > > > if ((exp || in_irq()) && irqs_were_disabled && use_softirq &&
> > > > > > (in_irq() || !t->rcu_read_unlock_special.b.deferred_qs)) {
> > > > > > // Using softirq, safe to awaken, and we get
> > > > > > // no help from enabling irqs, unlike bh/preempt.
> > > > > > raise_softirq_irqoff(RCU_SOFTIRQ);
> > > > > >
> > > > > > The fix would be replacing the calls to in_irq() with something that
> > > > > > returns true only if called from within a hardirq context.
> > > > > > Thoughts?
> > > > >
> > > > > I am not sure if this will fix all cases though?
> > > > >
> > > > > I think the crux of the problem is doing a recursive wake up. The threaded
> > > > > IRQ probably just happens to be causing it here, it seems to me this problem
> > > > > can also occur on a non-threaded irq system (say current_reader() in your
> > > > > example executed in a scheduler path in process-context and not from an
> > > > > interrupt). Is that not possible?
> > > >
> > > > In the non-threaded case, invoking raise_softirq*() from hardirq context
> > > > just sets a bit in a per-CPU variable. Now, to Sebastian's point, we
> > > > are only sort of in hardirq context in this case due to being called
> > > > from irq_exit(), but the failure we are seeing might well be a ways
> > > > downstream of the actual root-cause bug.
> > >
> > > Hi Paul,
> > > I was talking about calling of rcu_read_unlock_special from a normal process
> > > context from the scheduler.
> > >
> > > In the below traces, it shows that only the PREEMPT_MASK offset is set at the
> > > time of the issue. Both HARD AND SOFT IRQ masks are not enabled, which means
> > > the lock up is from a normal process context.
> > >
> > > I think I finally understood why the issue shows up only with threadirqs in
> > > my setup. If I build x86_64_defconfig, the CONFIG_IRQ_FORCED_THREADING=y
> > > option is set. And booting this with threadirqs, it always tries to
> > > wakeup_ksoftirqd in invoke_softirq.
> > >
> > > I believe what happens is, at an in-opportune time when the .blocked field is
> > > set for the preempted task, an interrupt is received. This timing is quite in
> > > auspicious because t->rcu_read_unlock_special just happens to have its
> > > .blocked field set even though it is not in a reader-section.
>
> Thank you for tracing through this!
>
> > I believe the .blocked field remains set even though we are not any more in a
> > reader section because of deferred processing of the blocked lists that you
> > mentioned yesterday.
>
> That can indeed happen. However, in current -rcu, that would mean
> that .deferred_qs is also set, which (if in_irq()) would prevent
> the raise_softirq_irqsoff() from being invoked. Which was why I was
> asking the questions about whether in_irq() returns true within threaded
> interrupts yesterday. If it does, I need to find if there is some way
> of determining whether rcu_read_unlock_special() is being called from
> a threaded interrupt in order to suppress the call to raise_softirq()
> in that case.
>
> But which version of the kernel are you using here? Current -rcu?
> v5.2-rc1? Something else?
And if this turns out to be current -rcu, and if there is no reasonable
way for rcu_read_unlock_special() to know if it is being invoked from
within a threaded interrupt handler, then the patch below would be one
way out.
Thanx, Paul
------------------------------------------------------------------------
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 82c925df1d92..5140e792c1c2 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -625,7 +625,7 @@ static void rcu_read_unlock_special(struct task_struct *t)
tick_nohz_full_cpu(rdp->cpu);
// Need to defer quiescent state until everything is enabled.
if ((exp || in_irq()) && irqs_were_disabled && use_softirq &&
- (in_irq() || !t->rcu_read_unlock_special.b.deferred_qs)) {
+ !t->rcu_read_unlock_special.b.deferred_qs) {
// Using softirq, safe to awaken, and we get
// no help from enabling irqs, unlike bh/preempt.
raise_softirq_irqoff(RCU_SOFTIRQ);
On 2019-06-28 10:30:11 [-0700], Paul E. McKenney wrote:
> > I believe the .blocked field remains set even though we are not any more in a
> > reader section because of deferred processing of the blocked lists that you
> > mentioned yesterday.
>
> That can indeed happen. However, in current -rcu, that would mean
> that .deferred_qs is also set, which (if in_irq()) would prevent
> the raise_softirq_irqsoff() from being invoked. Which was why I was
> asking the questions about whether in_irq() returns true within threaded
> interrupts yesterday. If it does, I need to find if there is some way
> of determining whether rcu_read_unlock_special() is being called from
> a threaded interrupt in order to suppress the call to raise_softirq()
> in that case.
Please not that:
| void irq_exit(void)
| {
|…
in_irq() returns true
| preempt_count_sub(HARDIRQ_OFFSET);
in_irq() returns false
| if (!in_interrupt() && local_softirq_pending())
| invoke_softirq();
-> invoke_softirq() does
| if (!force_irqthreads) {
| __do_softirq();
| } else {
| wakeup_softirqd();
| }
so for `force_irqthreads' rcu_read_unlock_special() within
wakeup_softirqd() will see false.
> Thanx, Paul
Sebastian
On Fri, Jun 28, 2019 at 10:30:11AM -0700, Paul E. McKenney wrote:
> On Fri, Jun 28, 2019 at 12:45:59PM -0400, Joel Fernandes wrote:
> > On Fri, Jun 28, 2019 at 12:40:08PM -0400, Joel Fernandes wrote:
> > > On Thu, Jun 27, 2019 at 11:41:07AM -0700, Paul E. McKenney wrote:
> > > [snip]
> > > > > > > And we should document this somewhere for future sanity preservation
> > > > > > > :-D
> > > > > >
> > > > > > Or adjust the code and requirements to make it more sane, if feasible.
> > > > > >
> > > > > > My current (probably wildly unreliable) guess that the conditions in
> > > > > > rcu_read_unlock_special() need adjusting. I was assuming that in_irq()
> > > > > > implies a hardirq context, in other words that in_irq() would return
> > > > > > false from a threaded interrupt handler. If in_irq() instead returns
> > > > > > true from within a threaded interrupt handler, then this code in
> > > > > > rcu_read_unlock_special() needs fixing:
> > > > > >
> > > > > > if ((exp || in_irq()) && irqs_were_disabled && use_softirq &&
> > > > > > (in_irq() || !t->rcu_read_unlock_special.b.deferred_qs)) {
> > > > > > // Using softirq, safe to awaken, and we get
> > > > > > // no help from enabling irqs, unlike bh/preempt.
> > > > > > raise_softirq_irqoff(RCU_SOFTIRQ);
> > > > > >
> > > > > > The fix would be replacing the calls to in_irq() with something that
> > > > > > returns true only if called from within a hardirq context.
> > > > > > Thoughts?
> > > > >
> > > > > I am not sure if this will fix all cases though?
> > > > >
> > > > > I think the crux of the problem is doing a recursive wake up. The threaded
> > > > > IRQ probably just happens to be causing it here, it seems to me this problem
> > > > > can also occur on a non-threaded irq system (say current_reader() in your
> > > > > example executed in a scheduler path in process-context and not from an
> > > > > interrupt). Is that not possible?
> > > >
> > > > In the non-threaded case, invoking raise_softirq*() from hardirq context
> > > > just sets a bit in a per-CPU variable. Now, to Sebastian's point, we
> > > > are only sort of in hardirq context in this case due to being called
> > > > from irq_exit(), but the failure we are seeing might well be a ways
> > > > downstream of the actual root-cause bug.
> > >
> > > Hi Paul,
> > > I was talking about calling of rcu_read_unlock_special from a normal process
> > > context from the scheduler.
> > >
> > > In the below traces, it shows that only the PREEMPT_MASK offset is set at the
> > > time of the issue. Both HARD AND SOFT IRQ masks are not enabled, which means
> > > the lock up is from a normal process context.
> > >
> > > I think I finally understood why the issue shows up only with threadirqs in
> > > my setup. If I build x86_64_defconfig, the CONFIG_IRQ_FORCED_THREADING=y
> > > option is set. And booting this with threadirqs, it always tries to
> > > wakeup_ksoftirqd in invoke_softirq.
> > >
> > > I believe what happens is, at an in-opportune time when the .blocked field is
> > > set for the preempted task, an interrupt is received. This timing is quite in
> > > auspicious because t->rcu_read_unlock_special just happens to have its
> > > .blocked field set even though it is not in a reader-section.
>
> Thank you for tracing through this!
My pleasure ;)
> > I believe the .blocked field remains set even though we are not any more in a
> > reader section because of deferred processing of the blocked lists that you
> > mentioned yesterday.
>
> That can indeed happen. However, in current -rcu, that would mean
> that .deferred_qs is also set, which (if in_irq()) would prevent
> the raise_softirq_irqsoff() from being invoked. Which was why I was
> asking the questions about whether in_irq() returns true within threaded
> interrupts yesterday. If it does, I need to find if there is some way
> of determining whether rcu_read_unlock_special() is being called from
> a threaded interrupt in order to suppress the call to raise_softirq()
> in that case.
Thanks. I will take a look at the -rcu tree a bit and reply to this.
> But which version of the kernel are you using here? Current -rcu?
> v5.2-rc1? Something else?
This is v5.2-rc6 kernel version from Linus tree which was showing the issue.
thanks!
On Fri, Jun 28, 2019 at 07:45:45PM +0200, Sebastian Andrzej Siewior wrote:
> On 2019-06-28 10:30:11 [-0700], Paul E. McKenney wrote:
> > > I believe the .blocked field remains set even though we are not any more in a
> > > reader section because of deferred processing of the blocked lists that you
> > > mentioned yesterday.
> >
> > That can indeed happen. However, in current -rcu, that would mean
> > that .deferred_qs is also set, which (if in_irq()) would prevent
> > the raise_softirq_irqsoff() from being invoked. Which was why I was
> > asking the questions about whether in_irq() returns true within threaded
> > interrupts yesterday. If it does, I need to find if there is some way
> > of determining whether rcu_read_unlock_special() is being called from
> > a threaded interrupt in order to suppress the call to raise_softirq()
> > in that case.
>
> Please not that:
> | void irq_exit(void)
> | {
> |…
> in_irq() returns true
> | preempt_count_sub(HARDIRQ_OFFSET);
> in_irq() returns false
> | if (!in_interrupt() && local_softirq_pending())
> | invoke_softirq();
>
> -> invoke_softirq() does
> | if (!force_irqthreads) {
> | __do_softirq();
> | } else {
> | wakeup_softirqd();
> | }
>
In my traces which I shared previous email, the wakeup_softirqd() gets
called.
I thought force_irqthreads value is decided at boot time, so I got lost a bit
with your comment.
On 2019-06-28 14:07:27 [-0400], Joel Fernandes wrote:
> On Fri, Jun 28, 2019 at 07:45:45PM +0200, Sebastian Andrzej Siewior wrote:
> > On 2019-06-28 10:30:11 [-0700], Paul E. McKenney wrote:
> > > > I believe the .blocked field remains set even though we are not any more in a
> > > > reader section because of deferred processing of the blocked lists that you
> > > > mentioned yesterday.
> > >
> > > That can indeed happen. However, in current -rcu, that would mean
> > > that .deferred_qs is also set, which (if in_irq()) would prevent
> > > the raise_softirq_irqsoff() from being invoked. Which was why I was
> > > asking the questions about whether in_irq() returns true within threaded
> > > interrupts yesterday. If it does, I need to find if there is some way
> > > of determining whether rcu_read_unlock_special() is being called from
> > > a threaded interrupt in order to suppress the call to raise_softirq()
> > > in that case.
> >
> > Please not that:
> > | void irq_exit(void)
> > | {
> > |…
> > in_irq() returns true
> > | preempt_count_sub(HARDIRQ_OFFSET);
> > in_irq() returns false
> > | if (!in_interrupt() && local_softirq_pending())
> > | invoke_softirq();
> >
> > -> invoke_softirq() does
> > | if (!force_irqthreads) {
> > | __do_softirq();
> > | } else {
> > | wakeup_softirqd();
> > | }
> >
>
> In my traces which I shared previous email, the wakeup_softirqd() gets
> called.
>
> I thought force_irqthreads value is decided at boot time, so I got lost a bit
> with your comment.
It does. I just wanted point out that in this case
rcu_unlock() / rcu_read_unlock_special() won't see in_irq() true.
Sebastian
On Fri, Jun 28, 2019 at 07:45:45PM +0200, Sebastian Andrzej Siewior wrote:
> On 2019-06-28 10:30:11 [-0700], Paul E. McKenney wrote:
> > > I believe the .blocked field remains set even though we are not any more in a
> > > reader section because of deferred processing of the blocked lists that you
> > > mentioned yesterday.
> >
> > That can indeed happen. However, in current -rcu, that would mean
> > that .deferred_qs is also set, which (if in_irq()) would prevent
> > the raise_softirq_irqsoff() from being invoked. Which was why I was
> > asking the questions about whether in_irq() returns true within threaded
> > interrupts yesterday. If it does, I need to find if there is some way
> > of determining whether rcu_read_unlock_special() is being called from
> > a threaded interrupt in order to suppress the call to raise_softirq()
> > in that case.
>
> Please not that:
> | void irq_exit(void)
> | {
> |…
> in_irq() returns true
> | preempt_count_sub(HARDIRQ_OFFSET);
> in_irq() returns false
> | if (!in_interrupt() && local_softirq_pending())
> | invoke_softirq();
>
> -> invoke_softirq() does
> | if (!force_irqthreads) {
> | __do_softirq();
> | } else {
> | wakeup_softirqd();
> | }
>
> so for `force_irqthreads' rcu_read_unlock_special() within
> wakeup_softirqd() will see false.
OK, fair point. How about the following instead, again on -rcu?
Here is the rationale for the new version of the "if" statement:
1. irqs_were_disabled: If interrupts are enabled, we should
instead let the upcoming irq_enable()/local_bh_enable()
do the rescheduling for us.
2. use_softirq: If we aren't using softirq, then
raise_softirq_irqoff() will be unhelpful.
3a. in_interrupt(): If this returns true, the subsequent
call to raise_softirq_irqoff() is guaranteed not to
do a wakeup, so that call will be both very cheap and
quite safe.
3b. Otherwise, if !in_interrupt(), if exp (an expedited RCU grace
period is being blocked), then incurring wakeup overhead
is worthwhile, and if also !.deferred_qs then scheduler locks
cannot be held so the wakeup will be safe.
Does that make more sense?
Thanx, Paul
------------------------------------------------------------------------
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 82c925df1d92..83333cfe8707 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -624,8 +624,9 @@ static void rcu_read_unlock_special(struct task_struct *t)
(rdp->grpmask & rnp->expmask) ||
tick_nohz_full_cpu(rdp->cpu);
// Need to defer quiescent state until everything is enabled.
- if ((exp || in_irq()) && irqs_were_disabled && use_softirq &&
- (in_irq() || !t->rcu_read_unlock_special.b.deferred_qs)) {
+ if (irqs_were_disabled && use_softirq &&
+ (in_interrupt() ||
+ (exp && !t->rcu_read_unlock_special.b.deferred_qs))) {
// Using softirq, safe to awaken, and we get
// no help from enabling irqs, unlike bh/preempt.
raise_softirq_irqoff(RCU_SOFTIRQ);
On Fri, Jun 28, 2019 at 02:05:37PM -0400, Joel Fernandes wrote:
> On Fri, Jun 28, 2019 at 10:30:11AM -0700, Paul E. McKenney wrote:
> > On Fri, Jun 28, 2019 at 12:45:59PM -0400, Joel Fernandes wrote:
> > > On Fri, Jun 28, 2019 at 12:40:08PM -0400, Joel Fernandes wrote:
> > > > On Thu, Jun 27, 2019 at 11:41:07AM -0700, Paul E. McKenney wrote:
> > > > [snip]
> > > > > > > > And we should document this somewhere for future sanity preservation
> > > > > > > > :-D
> > > > > > >
> > > > > > > Or adjust the code and requirements to make it more sane, if feasible.
> > > > > > >
> > > > > > > My current (probably wildly unreliable) guess that the conditions in
> > > > > > > rcu_read_unlock_special() need adjusting. I was assuming that in_irq()
> > > > > > > implies a hardirq context, in other words that in_irq() would return
> > > > > > > false from a threaded interrupt handler. If in_irq() instead returns
> > > > > > > true from within a threaded interrupt handler, then this code in
> > > > > > > rcu_read_unlock_special() needs fixing:
> > > > > > >
> > > > > > > if ((exp || in_irq()) && irqs_were_disabled && use_softirq &&
> > > > > > > (in_irq() || !t->rcu_read_unlock_special.b.deferred_qs)) {
> > > > > > > // Using softirq, safe to awaken, and we get
> > > > > > > // no help from enabling irqs, unlike bh/preempt.
> > > > > > > raise_softirq_irqoff(RCU_SOFTIRQ);
> > > > > > >
> > > > > > > The fix would be replacing the calls to in_irq() with something that
> > > > > > > returns true only if called from within a hardirq context.
> > > > > > > Thoughts?
> > > > > >
> > > > > > I am not sure if this will fix all cases though?
> > > > > >
> > > > > > I think the crux of the problem is doing a recursive wake up. The threaded
> > > > > > IRQ probably just happens to be causing it here, it seems to me this problem
> > > > > > can also occur on a non-threaded irq system (say current_reader() in your
> > > > > > example executed in a scheduler path in process-context and not from an
> > > > > > interrupt). Is that not possible?
> > > > >
> > > > > In the non-threaded case, invoking raise_softirq*() from hardirq context
> > > > > just sets a bit in a per-CPU variable. Now, to Sebastian's point, we
> > > > > are only sort of in hardirq context in this case due to being called
> > > > > from irq_exit(), but the failure we are seeing might well be a ways
> > > > > downstream of the actual root-cause bug.
> > > >
> > > > Hi Paul,
> > > > I was talking about calling of rcu_read_unlock_special from a normal process
> > > > context from the scheduler.
> > > >
> > > > In the below traces, it shows that only the PREEMPT_MASK offset is set at the
> > > > time of the issue. Both HARD AND SOFT IRQ masks are not enabled, which means
> > > > the lock up is from a normal process context.
> > > >
> > > > I think I finally understood why the issue shows up only with threadirqs in
> > > > my setup. If I build x86_64_defconfig, the CONFIG_IRQ_FORCED_THREADING=y
> > > > option is set. And booting this with threadirqs, it always tries to
> > > > wakeup_ksoftirqd in invoke_softirq.
> > > >
> > > > I believe what happens is, at an in-opportune time when the .blocked field is
> > > > set for the preempted task, an interrupt is received. This timing is quite in
> > > > auspicious because t->rcu_read_unlock_special just happens to have its
> > > > .blocked field set even though it is not in a reader-section.
> >
> > Thank you for tracing through this!
>
> My pleasure ;)
>
> > > I believe the .blocked field remains set even though we are not any more in a
> > > reader section because of deferred processing of the blocked lists that you
> > > mentioned yesterday.
> >
> > That can indeed happen. However, in current -rcu, that would mean
> > that .deferred_qs is also set, which (if in_irq()) would prevent
> > the raise_softirq_irqsoff() from being invoked. Which was why I was
> > asking the questions about whether in_irq() returns true within threaded
> > interrupts yesterday. If it does, I need to find if there is some way
> > of determining whether rcu_read_unlock_special() is being called from
> > a threaded interrupt in order to suppress the call to raise_softirq()
> > in that case.
>
> Thanks. I will take a look at the -rcu tree a bit and reply to this.
>
> > But which version of the kernel are you using here? Current -rcu?
> > v5.2-rc1? Something else?
>
> This is v5.2-rc6 kernel version from Linus tree which was showing the issue.
OK, that version does not contain the alleged fixes that are in -rcu.
Thanx, Paul
On 2019-06-28 08:30:50 [-0700], Paul E. McKenney wrote:
> On Fri, Jun 28, 2019 at 03:54:33PM +0200, Peter Zijlstra wrote:
> > On Thu, Jun 27, 2019 at 11:41:07AM -0700, Paul E. McKenney wrote:
> > > Or just don't do the wakeup at all, if it comes to that. I don't know
> > > of any way to determine whether rcu_read_unlock() is being called from
> > > the scheduler, but it has been some time since I asked Peter Zijlstra
> > > about that.
> >
> > There (still) is no 'in-scheduler' state.
>
> Well, my TREE03 + threadirqs rcutorture test ran for ten hours last
> night with no problems, so we just might be OK.
>
> The apparent fix is below, though my approach would be to do backports
> for the full set of related changes.
>
> Joel, Sebastian, how goes any testing from your end? Any reason
> to believe that this does not represent a fix? (Me, I am still
> concerned about doing raise_softirq() from within a threaded
> interrupt, but am not seeing failures.)
For some reason it does not trigger as good as it did yesterday.
Commit
- 23634ebc1d946 ("rcu: Check for wakeup-safe conditions in
rcu_read_unlock_special()") does not trigger the bug within 94
attempts.
- 48d07c04b4cc1 ("rcu: Enable elimination of Tree-RCU softirq
processing") needed 12 attempts to trigger the bug.
> Thanx, Paul
>
Sebastian
On Fri, Jun 28, 2019 at 08:40:26PM +0200, Sebastian Andrzej Siewior wrote:
> On 2019-06-28 08:30:50 [-0700], Paul E. McKenney wrote:
> > On Fri, Jun 28, 2019 at 03:54:33PM +0200, Peter Zijlstra wrote:
> > > On Thu, Jun 27, 2019 at 11:41:07AM -0700, Paul E. McKenney wrote:
> > > > Or just don't do the wakeup at all, if it comes to that. I don't know
> > > > of any way to determine whether rcu_read_unlock() is being called from
> > > > the scheduler, but it has been some time since I asked Peter Zijlstra
> > > > about that.
> > >
> > > There (still) is no 'in-scheduler' state.
> >
> > Well, my TREE03 + threadirqs rcutorture test ran for ten hours last
> > night with no problems, so we just might be OK.
> >
> > The apparent fix is below, though my approach would be to do backports
> > for the full set of related changes.
> >
> > Joel, Sebastian, how goes any testing from your end? Any reason
> > to believe that this does not represent a fix? (Me, I am still
> > concerned about doing raise_softirq() from within a threaded
> > interrupt, but am not seeing failures.)
>
> For some reason it does not trigger as good as it did yesterday.
I swear that I wasn't watching!!! ;-)
But I do know that feeling.
> Commit
> - 23634ebc1d946 ("rcu: Check for wakeup-safe conditions in
> rcu_read_unlock_special()") does not trigger the bug within 94
> attempts.
>
> - 48d07c04b4cc1 ("rcu: Enable elimination of Tree-RCU softirq
> processing") needed 12 attempts to trigger the bug.
That matches my belief that 23634ebc1d946 ("rcu: Check for wakeup-safe
conditions in rcu_read_unlock_special()") will at least greatly decrease
the probability of this bug occurring.
Thanx, Paul
On Fri, Jun 28, 2019 at 11:52:19AM -0700, Paul E. McKenney wrote:
> On Fri, Jun 28, 2019 at 08:40:26PM +0200, Sebastian Andrzej Siewior wrote:
> > On 2019-06-28 08:30:50 [-0700], Paul E. McKenney wrote:
> > > On Fri, Jun 28, 2019 at 03:54:33PM +0200, Peter Zijlstra wrote:
> > > > On Thu, Jun 27, 2019 at 11:41:07AM -0700, Paul E. McKenney wrote:
> > > > > Or just don't do the wakeup at all, if it comes to that. I don't know
> > > > > of any way to determine whether rcu_read_unlock() is being called from
> > > > > the scheduler, but it has been some time since I asked Peter Zijlstra
> > > > > about that.
> > > >
> > > > There (still) is no 'in-scheduler' state.
> > >
> > > Well, my TREE03 + threadirqs rcutorture test ran for ten hours last
> > > night with no problems, so we just might be OK.
> > >
> > > The apparent fix is below, though my approach would be to do backports
> > > for the full set of related changes.
> > >
> > > Joel, Sebastian, how goes any testing from your end? Any reason
> > > to believe that this does not represent a fix? (Me, I am still
> > > concerned about doing raise_softirq() from within a threaded
> > > interrupt, but am not seeing failures.)
Are you concerned also about a regular process context executing in the
scheduler and using RCU, having this issue?
(not anything with threaded or not threaded IRQs, but just a path in the
scheduler that uses RCU).
I don't think Sebastian's lock up has to do with the fact that an interrupt
is threaded or not, except that ksoftirqd is awakened in the case where
threadirqs is passed.
> > For some reason it does not trigger as good as it did yesterday.
>
> I swear that I wasn't watching!!! ;-)
>
> But I do know that feeling.
:-)
> > Commit
> > - 23634ebc1d946 ("rcu: Check for wakeup-safe conditions in
> > rcu_read_unlock_special()") does not trigger the bug within 94
> > attempts.
> >
> > - 48d07c04b4cc1 ("rcu: Enable elimination of Tree-RCU softirq
> > processing") needed 12 attempts to trigger the bug.
>
> That matches my belief that 23634ebc1d946 ("rcu: Check for wakeup-safe
> conditions in rcu_read_unlock_special()") will at least greatly decrease
> the probability of this bug occurring.
I was just typing a reply that I can't reproduce it with:
rcu: Check for wakeup-safe conditions in rcu_read_unlock_special()
I am trying to revert enough of this patch to see what would break things,
however I think a better exercise might be to understand more what the patch
does why it fixes things in the first place ;-) It is probably the
deferred_qs thing.
thanks!
On Fri, Jun 28, 2019 at 11:22:16AM -0700, Paul E. McKenney wrote:
> On Fri, Jun 28, 2019 at 07:45:45PM +0200, Sebastian Andrzej Siewior wrote:
> > On 2019-06-28 10:30:11 [-0700], Paul E. McKenney wrote:
> > > > I believe the .blocked field remains set even though we are not any more in a
> > > > reader section because of deferred processing of the blocked lists that you
> > > > mentioned yesterday.
> > >
> > > That can indeed happen. However, in current -rcu, that would mean
> > > that .deferred_qs is also set, which (if in_irq()) would prevent
> > > the raise_softirq_irqsoff() from being invoked. Which was why I was
> > > asking the questions about whether in_irq() returns true within threaded
> > > interrupts yesterday. If it does, I need to find if there is some way
> > > of determining whether rcu_read_unlock_special() is being called from
> > > a threaded interrupt in order to suppress the call to raise_softirq()
> > > in that case.
> >
> > Please not that:
> > | void irq_exit(void)
> > | {
> > |…
> > in_irq() returns true
> > | preempt_count_sub(HARDIRQ_OFFSET);
> > in_irq() returns false
> > | if (!in_interrupt() && local_softirq_pending())
> > | invoke_softirq();
> >
> > -> invoke_softirq() does
> > | if (!force_irqthreads) {
> > | __do_softirq();
> > | } else {
> > | wakeup_softirqd();
> > | }
> >
> > so for `force_irqthreads' rcu_read_unlock_special() within
> > wakeup_softirqd() will see false.
>
> OK, fair point. How about the following instead, again on -rcu?
>
> Here is the rationale for the new version of the "if" statement:
>
> 1. irqs_were_disabled: If interrupts are enabled, we should
> instead let the upcoming irq_enable()/local_bh_enable()
> do the rescheduling for us.
> 2. use_softirq: If we aren't using softirq, then
> raise_softirq_irqoff() will be unhelpful.
> 3a. in_interrupt(): If this returns true, the subsequent
> call to raise_softirq_irqoff() is guaranteed not to
> do a wakeup, so that call will be both very cheap and
> quite safe.
> 3b. Otherwise, if !in_interrupt(), if exp (an expedited RCU grace
> period is being blocked), then incurring wakeup overhead
> is worthwhile, and if also !.deferred_qs then scheduler locks
> cannot be held so the wakeup will be safe.
>
> Does that make more sense?
This makes a lot of sense. It would be nice to stick these comments on top of
rcu_read_unlock_special() for future reference.
thanks,
- Joel
>
> Thanx, Paul
>
> ------------------------------------------------------------------------
>
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index 82c925df1d92..83333cfe8707 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -624,8 +624,9 @@ static void rcu_read_unlock_special(struct task_struct *t)
> (rdp->grpmask & rnp->expmask) ||
> tick_nohz_full_cpu(rdp->cpu);
> // Need to defer quiescent state until everything is enabled.
> - if ((exp || in_irq()) && irqs_were_disabled && use_softirq &&
> - (in_irq() || !t->rcu_read_unlock_special.b.deferred_qs)) {
> + if (irqs_were_disabled && use_softirq &&
> + (in_interrupt() ||
> + (exp && !t->rcu_read_unlock_special.b.deferred_qs))) {
> // Using softirq, safe to awaken, and we get
> // no help from enabling irqs, unlike bh/preempt.
> raise_softirq_irqoff(RCU_SOFTIRQ);
>
On Fri, 2019-06-28 at 16:15 +0200, Peter Zijlstra wrote:
> On Thu, Jun 27, 2019 at 01:36:12PM -0700, Paul E. McKenney wrote:
> > On Thu, Jun 27, 2019 at 03:17:27PM -0500, Scott Wood wrote:
> > > On Thu, 2019-06-27 at 11:41 -0700, Paul E. McKenney wrote:
> > > > Of course, unconditionally refusing to do the wakeup might not be
> > > > happy
> > > > thing for NO_HZ_FULL kernels that don't implement IRQ work.
> > >
> > > Couldn't smp_send_reschedule() be used instead?
> >
> > Good point. If current -rcu doesn't fix things for Sebastian's case,
> > that would be well worth looking at. But there must be some reason
> > why Peter Zijlstra didn't suggest it when he instead suggested using
> > the IRQ work approach.
> >
> > Peter, thoughts?
>
> I've not exactly kept up with the thread; but irq_work allows you to run
> some actual code on the remote CPU which is often useful and it is only
> a little more expensive than smp_send_reschedule().
>
> Also, just smp_send_reschedule() doesn't really do anything without
> first poking TIF_NEED_RESCHED (or other scheduler state) and if you want
> to do both, there's other helpers you should use, like resched_cpu().
resched_cpu() will not send an IPI to the current CPU[1]. Plus, the RCU
code needs to set need_resched even in cases where it doesn't need to send
the IPI. And worst of all, resched_cpu() takes the rq lock which is the
deadlock scenario we're trying to avoid.
-Scott
[1] Which makes me nervous about latency if there are any wakeups with irqs
disabled, without a preempt_enable() after irqs are enabled again, and not
inside an interrupt.
On Fri, Jun 28, 2019 at 03:24:07PM -0400, Joel Fernandes wrote:
> On Fri, Jun 28, 2019 at 11:52:19AM -0700, Paul E. McKenney wrote:
> > On Fri, Jun 28, 2019 at 08:40:26PM +0200, Sebastian Andrzej Siewior wrote:
> > > On 2019-06-28 08:30:50 [-0700], Paul E. McKenney wrote:
> > > > On Fri, Jun 28, 2019 at 03:54:33PM +0200, Peter Zijlstra wrote:
> > > > > On Thu, Jun 27, 2019 at 11:41:07AM -0700, Paul E. McKenney wrote:
> > > > > > Or just don't do the wakeup at all, if it comes to that. I don't know
> > > > > > of any way to determine whether rcu_read_unlock() is being called from
> > > > > > the scheduler, but it has been some time since I asked Peter Zijlstra
> > > > > > about that.
> > > > >
> > > > > There (still) is no 'in-scheduler' state.
> > > >
> > > > Well, my TREE03 + threadirqs rcutorture test ran for ten hours last
> > > > night with no problems, so we just might be OK.
> > > >
> > > > The apparent fix is below, though my approach would be to do backports
> > > > for the full set of related changes.
> > > >
> > > > Joel, Sebastian, how goes any testing from your end? Any reason
> > > > to believe that this does not represent a fix? (Me, I am still
> > > > concerned about doing raise_softirq() from within a threaded
> > > > interrupt, but am not seeing failures.)
>
> Are you concerned also about a regular process context executing in the
> scheduler and using RCU, having this issue?
> (not anything with threaded or not threaded IRQs, but just a path in the
> scheduler that uses RCU).
>
> I don't think Sebastian's lock up has to do with the fact that an interrupt
> is threaded or not, except that ksoftirqd is awakened in the case where
> threadirqs is passed.
In current -rcu, the checks should suffice in the absence of threaded
interrupts. They might also suffice for threaded interrupts, but a more
direct approach would be better, hence the in_interrupt() patch.
> > > For some reason it does not trigger as good as it did yesterday.
> >
> > I swear that I wasn't watching!!! ;-)
> >
> > But I do know that feeling.
>
> :-)
>
> > > Commit
> > > - 23634ebc1d946 ("rcu: Check for wakeup-safe conditions in
> > > rcu_read_unlock_special()") does not trigger the bug within 94
> > > attempts.
> > >
> > > - 48d07c04b4cc1 ("rcu: Enable elimination of Tree-RCU softirq
> > > processing") needed 12 attempts to trigger the bug.
> >
> > That matches my belief that 23634ebc1d946 ("rcu: Check for wakeup-safe
> > conditions in rcu_read_unlock_special()") will at least greatly decrease
> > the probability of this bug occurring.
>
> I was just typing a reply that I can't reproduce it with:
> rcu: Check for wakeup-safe conditions in rcu_read_unlock_special()
>
> I am trying to revert enough of this patch to see what would break things,
> however I think a better exercise might be to understand more what the patch
> does why it fixes things in the first place ;-) It is probably the
> deferred_qs thing.
The deferred_qs flag is part of it! Looking forward to hearing what
you come up with as being the critical piece of this commit.
Thanx, Paul
On Fri, Jun 28, 2019 at 03:29:23PM -0400, Joel Fernandes wrote:
> On Fri, Jun 28, 2019 at 11:22:16AM -0700, Paul E. McKenney wrote:
> > On Fri, Jun 28, 2019 at 07:45:45PM +0200, Sebastian Andrzej Siewior wrote:
> > > On 2019-06-28 10:30:11 [-0700], Paul E. McKenney wrote:
> > > > > I believe the .blocked field remains set even though we are not any more in a
> > > > > reader section because of deferred processing of the blocked lists that you
> > > > > mentioned yesterday.
> > > >
> > > > That can indeed happen. However, in current -rcu, that would mean
> > > > that .deferred_qs is also set, which (if in_irq()) would prevent
> > > > the raise_softirq_irqsoff() from being invoked. Which was why I was
> > > > asking the questions about whether in_irq() returns true within threaded
> > > > interrupts yesterday. If it does, I need to find if there is some way
> > > > of determining whether rcu_read_unlock_special() is being called from
> > > > a threaded interrupt in order to suppress the call to raise_softirq()
> > > > in that case.
> > >
> > > Please not that:
> > > | void irq_exit(void)
> > > | {
> > > |…
> > > in_irq() returns true
> > > | preempt_count_sub(HARDIRQ_OFFSET);
> > > in_irq() returns false
> > > | if (!in_interrupt() && local_softirq_pending())
> > > | invoke_softirq();
> > >
> > > -> invoke_softirq() does
> > > | if (!force_irqthreads) {
> > > | __do_softirq();
> > > | } else {
> > > | wakeup_softirqd();
> > > | }
> > >
> > > so for `force_irqthreads' rcu_read_unlock_special() within
> > > wakeup_softirqd() will see false.
> >
> > OK, fair point. How about the following instead, again on -rcu?
> >
> > Here is the rationale for the new version of the "if" statement:
> >
> > 1. irqs_were_disabled: If interrupts are enabled, we should
> > instead let the upcoming irq_enable()/local_bh_enable()
> > do the rescheduling for us.
> > 2. use_softirq: If we aren't using softirq, then
> > raise_softirq_irqoff() will be unhelpful.
> > 3a. in_interrupt(): If this returns true, the subsequent
> > call to raise_softirq_irqoff() is guaranteed not to
> > do a wakeup, so that call will be both very cheap and
> > quite safe.
> > 3b. Otherwise, if !in_interrupt(), if exp (an expedited RCU grace
> > period is being blocked), then incurring wakeup overhead
> > is worthwhile, and if also !.deferred_qs then scheduler locks
> > cannot be held so the wakeup will be safe.
> >
> > Does that make more sense?
>
> This makes a lot of sense. It would be nice to stick these comments on top of
> rcu_read_unlock_special() for future reference.
I do have an expanded version in the commit log. I hope to get a more
high-level description in comments.
Thanx, Paul
> thanks,
>
> - Joel
>
>
> >
> > Thanx, Paul
> >
> > ------------------------------------------------------------------------
> >
> > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > index 82c925df1d92..83333cfe8707 100644
> > --- a/kernel/rcu/tree_plugin.h
> > +++ b/kernel/rcu/tree_plugin.h
> > @@ -624,8 +624,9 @@ static void rcu_read_unlock_special(struct task_struct *t)
> > (rdp->grpmask & rnp->expmask) ||
> > tick_nohz_full_cpu(rdp->cpu);
> > // Need to defer quiescent state until everything is enabled.
> > - if ((exp || in_irq()) && irqs_were_disabled && use_softirq &&
> > - (in_irq() || !t->rcu_read_unlock_special.b.deferred_qs)) {
> > + if (irqs_were_disabled && use_softirq &&
> > + (in_interrupt() ||
> > + (exp && !t->rcu_read_unlock_special.b.deferred_qs))) {
> > // Using softirq, safe to awaken, and we get
> > // no help from enabling irqs, unlike bh/preempt.
> > raise_softirq_irqoff(RCU_SOFTIRQ);
> >
>
Hi Paul,
On Fri, Jun 28, 2019 at 01:04:23PM -0700, Paul E. McKenney wrote:
[snip]
> > > > Commit
> > > > - 23634ebc1d946 ("rcu: Check for wakeup-safe conditions in
> > > > rcu_read_unlock_special()") does not trigger the bug within 94
> > > > attempts.
> > > >
> > > > - 48d07c04b4cc1 ("rcu: Enable elimination of Tree-RCU softirq
> > > > processing") needed 12 attempts to trigger the bug.
> > >
> > > That matches my belief that 23634ebc1d946 ("rcu: Check for wakeup-safe
> > > conditions in rcu_read_unlock_special()") will at least greatly decrease
> > > the probability of this bug occurring.
> >
> > I was just typing a reply that I can't reproduce it with:
> > rcu: Check for wakeup-safe conditions in rcu_read_unlock_special()
> >
> > I am trying to revert enough of this patch to see what would break things,
> > however I think a better exercise might be to understand more what the patch
> > does why it fixes things in the first place ;-) It is probably the
> > deferred_qs thing.
>
> The deferred_qs flag is part of it! Looking forward to hearing what
> you come up with as being the critical piece of this commit.
The new deferred_qs flag indeed saves the machine from the dead-lock.
If we don't want the deferred_qs, then the below patch also fixes the issue.
However, I am more sure than not that it does not handle all cases (such as
what if we previously had an expedited grace period IPI in a previous reader
section and had to to defer processing. Then it seems a similar deadlock
would present. But anyway, the below patch does fix it for me! It is based on
your -rcu tree commit 23634ebc1d946f19eb112d4455c1d84948875e31 (rcu: Check
for wakeup-safe conditions in rcu_read_unlock_special()).
---8<-----------------------
From: "Joel Fernandes (Google)" <[email protected]>
Subject: [PATCH] Fix RCU recursive deadlock
Signed-off-by: Joel Fernandes (Google) <[email protected]>
---
include/linux/sched.h | 2 +-
kernel/rcu/tree_plugin.h | 17 +++++++++++++----
2 files changed, 14 insertions(+), 5 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 942a44c1b8eb..347e6dfcc91b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -565,7 +565,7 @@ union rcu_special {
u8 blocked;
u8 need_qs;
u8 exp_hint; /* Hint for performance. */
- u8 deferred_qs;
+ u8 pad;
} b; /* Bits. */
u32 s; /* Set of bits. */
};
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 75110ea75d01..5b9b12c1ba5c 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -455,7 +455,6 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
local_irq_restore(flags);
return;
}
- t->rcu_read_unlock_special.b.deferred_qs = false;
if (special.b.need_qs) {
rcu_qs();
t->rcu_read_unlock_special.b.need_qs = false;
@@ -608,13 +607,24 @@ static void rcu_read_unlock_special(struct task_struct *t)
if (preempt_bh_were_disabled || irqs_were_disabled) {
t->rcu_read_unlock_special.b.exp_hint = false;
// Need to defer quiescent state until everything is enabled.
+
+ /* If unlock_special was called in the current reader section
+ * just because we were blocked in a previous reader section,
+ * then raising softirqs can deadlock. This is because the
+ * scheduler executes RCU sections with preemption disabled,
+ * however it may have previously blocked in a previous
+ * non-scheduler reader section and .blocked got set. It is
+ * never safe to call unlock_special from the scheduler path
+ * due to recursive wake ups (unless we are in_irq(), so
+ * prevent this by checking if we were previously blocked.
+ */
if (irqs_were_disabled && use_softirq &&
- (in_irq() || !t->rcu_read_unlock_special.b.deferred_qs)) {
+ (!t->rcu_read_unlock_special.b.blocked || in_irq())) {
// Using softirq, safe to awaken, and we get
// no help from enabling irqs, unlike bh/preempt.
raise_softirq_irqoff(RCU_SOFTIRQ);
} else if (irqs_were_disabled && !use_softirq &&
- !t->rcu_read_unlock_special.b.deferred_qs) {
+ !t->rcu_read_unlock_special.b.blocked) {
// Safe to awaken and we get no help from enabling
// irqs, unlike bh/preempt.
invoke_rcu_core();
@@ -623,7 +633,6 @@ static void rcu_read_unlock_special(struct task_struct *t)
set_tsk_need_resched(current);
set_preempt_need_resched();
}
- t->rcu_read_unlock_special.b.deferred_qs = true;
local_irq_restore(flags);
return;
}
--
2.22.0.410.gd8fdbe21b5-goog
On Fri, Jun 28, 2019 at 05:40:18PM -0400, Joel Fernandes wrote:
> Hi Paul,
>
> On Fri, Jun 28, 2019 at 01:04:23PM -0700, Paul E. McKenney wrote:
> [snip]
> > > > > Commit
> > > > > - 23634ebc1d946 ("rcu: Check for wakeup-safe conditions in
> > > > > rcu_read_unlock_special()") does not trigger the bug within 94
> > > > > attempts.
> > > > >
> > > > > - 48d07c04b4cc1 ("rcu: Enable elimination of Tree-RCU softirq
> > > > > processing") needed 12 attempts to trigger the bug.
> > > >
> > > > That matches my belief that 23634ebc1d946 ("rcu: Check for wakeup-safe
> > > > conditions in rcu_read_unlock_special()") will at least greatly decrease
> > > > the probability of this bug occurring.
> > >
> > > I was just typing a reply that I can't reproduce it with:
> > > rcu: Check for wakeup-safe conditions in rcu_read_unlock_special()
> > >
> > > I am trying to revert enough of this patch to see what would break things,
> > > however I think a better exercise might be to understand more what the patch
> > > does why it fixes things in the first place ;-) It is probably the
> > > deferred_qs thing.
> >
> > The deferred_qs flag is part of it! Looking forward to hearing what
> > you come up with as being the critical piece of this commit.
>
> The new deferred_qs flag indeed saves the machine from the dead-lock.
>
> If we don't want the deferred_qs, then the below patch also fixes the issue.
> However, I am more sure than not that it does not handle all cases (such as
> what if we previously had an expedited grace period IPI in a previous reader
> section and had to to defer processing. Then it seems a similar deadlock
> would present. But anyway, the below patch does fix it for me! It is based on
> your -rcu tree commit 23634ebc1d946f19eb112d4455c1d84948875e31 (rcu: Check
> for wakeup-safe conditions in rcu_read_unlock_special()).
The point here being that you rely on .b.blocked rather than
.b.deferred_qs. Hmmm... There are a number of places that check all
the bits via the .s leg of the rcu_special union. The .s check in
rcu_preempt_need_deferred_qs() should be OK because it is conditioned
on t->rcu_read_lock_nesting of zero or negative.
Do rest of those also work out OK?
It would be nice to remove the flag, but doing so clearly needs careful
review and testing.
Thanx, Paul
> ---8<-----------------------
>
> From: "Joel Fernandes (Google)" <[email protected]>
> Subject: [PATCH] Fix RCU recursive deadlock
>
> Signed-off-by: Joel Fernandes (Google) <[email protected]>
> ---
> include/linux/sched.h | 2 +-
> kernel/rcu/tree_plugin.h | 17 +++++++++++++----
> 2 files changed, 14 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 942a44c1b8eb..347e6dfcc91b 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -565,7 +565,7 @@ union rcu_special {
> u8 blocked;
> u8 need_qs;
> u8 exp_hint; /* Hint for performance. */
> - u8 deferred_qs;
> + u8 pad;
> } b; /* Bits. */
> u32 s; /* Set of bits. */
> };
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index 75110ea75d01..5b9b12c1ba5c 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -455,7 +455,6 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
> local_irq_restore(flags);
> return;
> }
> - t->rcu_read_unlock_special.b.deferred_qs = false;
> if (special.b.need_qs) {
> rcu_qs();
> t->rcu_read_unlock_special.b.need_qs = false;
> @@ -608,13 +607,24 @@ static void rcu_read_unlock_special(struct task_struct *t)
> if (preempt_bh_were_disabled || irqs_were_disabled) {
> t->rcu_read_unlock_special.b.exp_hint = false;
> // Need to defer quiescent state until everything is enabled.
> +
> + /* If unlock_special was called in the current reader section
> + * just because we were blocked in a previous reader section,
> + * then raising softirqs can deadlock. This is because the
> + * scheduler executes RCU sections with preemption disabled,
> + * however it may have previously blocked in a previous
> + * non-scheduler reader section and .blocked got set. It is
> + * never safe to call unlock_special from the scheduler path
> + * due to recursive wake ups (unless we are in_irq(), so
> + * prevent this by checking if we were previously blocked.
> + */
> if (irqs_were_disabled && use_softirq &&
> - (in_irq() || !t->rcu_read_unlock_special.b.deferred_qs)) {
> + (!t->rcu_read_unlock_special.b.blocked || in_irq())) {
> // Using softirq, safe to awaken, and we get
> // no help from enabling irqs, unlike bh/preempt.
> raise_softirq_irqoff(RCU_SOFTIRQ);
> } else if (irqs_were_disabled && !use_softirq &&
> - !t->rcu_read_unlock_special.b.deferred_qs) {
> + !t->rcu_read_unlock_special.b.blocked) {
> // Safe to awaken and we get no help from enabling
> // irqs, unlike bh/preempt.
> invoke_rcu_core();
> @@ -623,7 +633,6 @@ static void rcu_read_unlock_special(struct task_struct *t)
> set_tsk_need_resched(current);
> set_preempt_need_resched();
> }
> - t->rcu_read_unlock_special.b.deferred_qs = true;
> local_irq_restore(flags);
> return;
> }
> --
> 2.22.0.410.gd8fdbe21b5-goog
>
On Fri, Jun 28, 2019 at 03:25:47PM -0700, Paul E. McKenney wrote:
> On Fri, Jun 28, 2019 at 05:40:18PM -0400, Joel Fernandes wrote:
> > Hi Paul,
> >
> > On Fri, Jun 28, 2019 at 01:04:23PM -0700, Paul E. McKenney wrote:
> > [snip]
> > > > > > Commit
> > > > > > - 23634ebc1d946 ("rcu: Check for wakeup-safe conditions in
> > > > > > rcu_read_unlock_special()") does not trigger the bug within 94
> > > > > > attempts.
> > > > > >
> > > > > > - 48d07c04b4cc1 ("rcu: Enable elimination of Tree-RCU softirq
> > > > > > processing") needed 12 attempts to trigger the bug.
> > > > >
> > > > > That matches my belief that 23634ebc1d946 ("rcu: Check for wakeup-safe
> > > > > conditions in rcu_read_unlock_special()") will at least greatly decrease
> > > > > the probability of this bug occurring.
> > > >
> > > > I was just typing a reply that I can't reproduce it with:
> > > > rcu: Check for wakeup-safe conditions in rcu_read_unlock_special()
> > > >
> > > > I am trying to revert enough of this patch to see what would break things,
> > > > however I think a better exercise might be to understand more what the patch
> > > > does why it fixes things in the first place ;-) It is probably the
> > > > deferred_qs thing.
> > >
> > > The deferred_qs flag is part of it! Looking forward to hearing what
> > > you come up with as being the critical piece of this commit.
> >
> > The new deferred_qs flag indeed saves the machine from the dead-lock.
> >
> > If we don't want the deferred_qs, then the below patch also fixes the issue.
> > However, I am more sure than not that it does not handle all cases (such as
> > what if we previously had an expedited grace period IPI in a previous reader
> > section and had to to defer processing. Then it seems a similar deadlock
> > would present. But anyway, the below patch does fix it for me! It is based on
> > your -rcu tree commit 23634ebc1d946f19eb112d4455c1d84948875e31 (rcu: Check
> > for wakeup-safe conditions in rcu_read_unlock_special()).
>
> The point here being that you rely on .b.blocked rather than
> .b.deferred_qs. Hmmm... There are a number of places that check all
> the bits via the .s leg of the rcu_special union. The .s check in
> rcu_preempt_need_deferred_qs() should be OK because it is conditioned
> on t->rcu_read_lock_nesting of zero or negative.
> Do rest of those also work out OK?
>
> It would be nice to remove the flag, but doing so clearly needs careful
> review and testing.
Agreed. I am planning to do an audit of this code within the next couple of
weeks so I will be on the look out for any optimization opportunities related
to this. Will let you know if this can work. For now I like your patch better
because it is more conservative and doesn't cause any space overhead.
If you'd like, please free to included my Tested-by on it:
Tested-by: Joel Fernandes (Google) <[email protected]>
If you had a chance, could you also point to me any tests that show
performance improvement with the irqwork patch, on the expedited GP usecase?
I'd like to try it out as well. I guess rcuperf should have some?
thanks!
- Joel
On Fri, Jun 28, 2019 at 07:12:41PM -0400, Joel Fernandes wrote:
> On Fri, Jun 28, 2019 at 03:25:47PM -0700, Paul E. McKenney wrote:
> > On Fri, Jun 28, 2019 at 05:40:18PM -0400, Joel Fernandes wrote:
> > > Hi Paul,
> > >
> > > On Fri, Jun 28, 2019 at 01:04:23PM -0700, Paul E. McKenney wrote:
> > > [snip]
> > > > > > > Commit
> > > > > > > - 23634ebc1d946 ("rcu: Check for wakeup-safe conditions in
> > > > > > > rcu_read_unlock_special()") does not trigger the bug within 94
> > > > > > > attempts.
> > > > > > >
> > > > > > > - 48d07c04b4cc1 ("rcu: Enable elimination of Tree-RCU softirq
> > > > > > > processing") needed 12 attempts to trigger the bug.
> > > > > >
> > > > > > That matches my belief that 23634ebc1d946 ("rcu: Check for wakeup-safe
> > > > > > conditions in rcu_read_unlock_special()") will at least greatly decrease
> > > > > > the probability of this bug occurring.
> > > > >
> > > > > I was just typing a reply that I can't reproduce it with:
> > > > > rcu: Check for wakeup-safe conditions in rcu_read_unlock_special()
> > > > >
> > > > > I am trying to revert enough of this patch to see what would break things,
> > > > > however I think a better exercise might be to understand more what the patch
> > > > > does why it fixes things in the first place ;-) It is probably the
> > > > > deferred_qs thing.
> > > >
> > > > The deferred_qs flag is part of it! Looking forward to hearing what
> > > > you come up with as being the critical piece of this commit.
> > >
> > > The new deferred_qs flag indeed saves the machine from the dead-lock.
> > >
> > > If we don't want the deferred_qs, then the below patch also fixes the issue.
> > > However, I am more sure than not that it does not handle all cases (such as
> > > what if we previously had an expedited grace period IPI in a previous reader
> > > section and had to to defer processing. Then it seems a similar deadlock
> > > would present. But anyway, the below patch does fix it for me! It is based on
> > > your -rcu tree commit 23634ebc1d946f19eb112d4455c1d84948875e31 (rcu: Check
> > > for wakeup-safe conditions in rcu_read_unlock_special()).
> >
> > The point here being that you rely on .b.blocked rather than
> > .b.deferred_qs. Hmmm... There are a number of places that check all
> > the bits via the .s leg of the rcu_special union. The .s check in
> > rcu_preempt_need_deferred_qs() should be OK because it is conditioned
> > on t->rcu_read_lock_nesting of zero or negative.
> > Do rest of those also work out OK?
> >
> > It would be nice to remove the flag, but doing so clearly needs careful
> > review and testing.
>
> Agreed. I am planning to do an audit of this code within the next couple of
> weeks so I will be on the look out for any optimization opportunities related
> to this. Will let you know if this can work. For now I like your patch better
> because it is more conservative and doesn't cause any space overhead.
Fixing the bug in a maintainable manner is the priority, to be sure.
However, simplifications, assuming that they work, are very much worth
considering as well.
And Murphy says that there are still a number of bugs and optimization
opportunities. ;-)
> If you'd like, please free to included my Tested-by on it:
>
> Tested-by: Joel Fernandes (Google) <[email protected]>
Will do, thank you!
> If you had a chance, could you also point to me any tests that show
> performance improvement with the irqwork patch, on the expedited GP usecase?
> I'd like to try it out as well. I guess rcuperf should have some?
As a first thing to try, I suggest running rcuperf with both readers and
writers, with only expedited grace periods, and with most (or maybe even
all) CPUs having nohz_full enabled.
Thanx, Paul
Hi Steve,
> As Paul stated, interrupts are synchronization points. Archs can only
> play games with ordering when dealing with entities outside the CPU
> (devices and other CPUs). But if you have assembly that has two stores,
> and an interrupt comes in, the arch must guarantee that the stores are
> done in that order as the interrupt sees it.
Hopefully I'm not derailing the conversation too much with my questions
... but I was wondering if we had any documentation (or inline comments)
elaborating on this "interrupts are synchronization points"?
Thanks,
Andrea
On Sat, Jun 29, 2019 at 05:12:36PM +0200, Andrea Parri wrote:
> Hi Steve,
>
> > As Paul stated, interrupts are synchronization points. Archs can only
> > play games with ordering when dealing with entities outside the CPU
> > (devices and other CPUs). But if you have assembly that has two stores,
> > and an interrupt comes in, the arch must guarantee that the stores are
> > done in that order as the interrupt sees it.
>
> Hopefully I'm not derailing the conversation too much with my questions
> ... but I was wondering if we had any documentation (or inline comments)
> elaborating on this "interrupts are synchronization points"?
I don't know of any, but I would suggest instead looking at something
like the Hennessey and Patterson computer-architecture textbook.
Please keep in mind that the rather detailed documentation on RCU is a
bit of an outlier due to the fact that there are not so many textbooks
that cover RCU. If we tried to replicate all of the relevant textbooks
in the Documentation directory, it would be quite a large mess. ;-)
Thanx, Paul
On Sat, Jun 29, 2019 at 09:55:33AM -0700, Paul E. McKenney wrote:
> On Sat, Jun 29, 2019 at 05:12:36PM +0200, Andrea Parri wrote:
> > Hi Steve,
> >
> > > As Paul stated, interrupts are synchronization points. Archs can only
> > > play games with ordering when dealing with entities outside the CPU
> > > (devices and other CPUs). But if you have assembly that has two stores,
> > > and an interrupt comes in, the arch must guarantee that the stores are
> > > done in that order as the interrupt sees it.
> >
> > Hopefully I'm not derailing the conversation too much with my questions
> > ... but I was wondering if we had any documentation (or inline comments)
> > elaborating on this "interrupts are synchronization points"?
>
> I don't know of any, but I would suggest instead looking at something
> like the Hennessey and Patterson computer-architecture textbook.
>
> Please keep in mind that the rather detailed documentation on RCU is a
> bit of an outlier due to the fact that there are not so many textbooks
> that cover RCU. If we tried to replicate all of the relevant textbooks
> in the Documentation directory, it would be quite a large mess. ;-)
You know some developers considered it worth to develop formal specs in
order to better understand concepts such as "synchronization" and "IRQs
(processing)"! ... ;-) I still think that adding a few paragraphs (if
only in informal prose) to explain that "interrupts are synchronization
points" wouln't hurt. And you're right, I guess we may well start from
a reference to H&P...
Remark: we do have code which (while acknowledging that "interrupts are
synchronization points") doesn't quite seem to "believe it", c.f., e.g.,
kernel/sched/membarrier.c:ipi_mb(). So, I guess the follow-up question
would be "Would we better be (more) paranoid? ..."
Thanks,
Andrea
> Remark: we do have code which (while acknowledging that "interrupts are
> synchronization points") doesn't quite seem to "believe it", c.f., e.g.,
> kernel/sched/membarrier.c:ipi_mb(). So, I guess the follow-up question
> would be "Would we better be (more) paranoid? ..."
should have been "IPIs are serializing" (so all a different "order"...)
Andrea
On Sat, Jun 29, 2019 at 08:09:10PM +0200, Andrea Parri wrote:
> On Sat, Jun 29, 2019 at 09:55:33AM -0700, Paul E. McKenney wrote:
> > On Sat, Jun 29, 2019 at 05:12:36PM +0200, Andrea Parri wrote:
> > > Hi Steve,
> > >
> > > > As Paul stated, interrupts are synchronization points. Archs can only
> > > > play games with ordering when dealing with entities outside the CPU
> > > > (devices and other CPUs). But if you have assembly that has two stores,
> > > > and an interrupt comes in, the arch must guarantee that the stores are
> > > > done in that order as the interrupt sees it.
> > >
> > > Hopefully I'm not derailing the conversation too much with my questions
> > > ... but I was wondering if we had any documentation (or inline comments)
> > > elaborating on this "interrupts are synchronization points"?
> >
> > I don't know of any, but I would suggest instead looking at something
> > like the Hennessey and Patterson computer-architecture textbook.
> >
> > Please keep in mind that the rather detailed documentation on RCU is a
> > bit of an outlier due to the fact that there are not so many textbooks
> > that cover RCU. If we tried to replicate all of the relevant textbooks
> > in the Documentation directory, it would be quite a large mess. ;-)
>
> You know some developers considered it worth to develop formal specs in
> order to better understand concepts such as "synchronization" and "IRQs
> (processing)"! ... ;-) I still think that adding a few paragraphs (if
> only in informal prose) to explain that "interrupts are synchronization
> points" wouln't hurt. And you're right, I guess we may well start from
> a reference to H&P...
>
> Remark: we do have code which (while acknowledging that "interrupts are
> synchronization points") doesn't quite seem to "believe it", c.f., e.g.,
> kernel/sched/membarrier.c:ipi_mb(). So, I guess the follow-up question
> would be "Would we better be (more) paranoid? ..."
As Steve said that I said, they are synchronization points from the
viewpoint of code within the interrupted CPU. Unless the architecture
code does as smp_mb() on interrupt entry and exit (which perhaps some
do, for all I know, maybe all of them do by now), memory accesses could
still be reordered across the interrupt from the perspective of other
CPUs and devices on the system.
Thanx, Paul
On Sat, Jun 29, 2019 at 12:15:15PM -0700, Paul E. McKenney wrote:
> On Sat, Jun 29, 2019 at 08:09:10PM +0200, Andrea Parri wrote:
> > On Sat, Jun 29, 2019 at 09:55:33AM -0700, Paul E. McKenney wrote:
> > > On Sat, Jun 29, 2019 at 05:12:36PM +0200, Andrea Parri wrote:
> > > > Hi Steve,
> > > >
> > > > > As Paul stated, interrupts are synchronization points. Archs can only
> > > > > play games with ordering when dealing with entities outside the CPU
> > > > > (devices and other CPUs). But if you have assembly that has two stores,
> > > > > and an interrupt comes in, the arch must guarantee that the stores are
> > > > > done in that order as the interrupt sees it.
> > > >
> > > > Hopefully I'm not derailing the conversation too much with my questions
> > > > ... but I was wondering if we had any documentation (or inline comments)
> > > > elaborating on this "interrupts are synchronization points"?
> > >
> > > I don't know of any, but I would suggest instead looking at something
> > > like the Hennessey and Patterson computer-architecture textbook.
> > >
> > > Please keep in mind that the rather detailed documentation on RCU is a
> > > bit of an outlier due to the fact that there are not so many textbooks
> > > that cover RCU. If we tried to replicate all of the relevant textbooks
> > > in the Documentation directory, it would be quite a large mess. ;-)
> >
> > You know some developers considered it worth to develop formal specs in
> > order to better understand concepts such as "synchronization" and "IRQs
> > (processing)"! ... ;-) I still think that adding a few paragraphs (if
> > only in informal prose) to explain that "interrupts are synchronization
> > points" wouln't hurt. And you're right, I guess we may well start from
> > a reference to H&P...
> >
> > Remark: we do have code which (while acknowledging that "interrupts are
> > synchronization points") doesn't quite seem to "believe it", c.f., e.g.,
> > kernel/sched/membarrier.c:ipi_mb(). So, I guess the follow-up question
> > would be "Would we better be (more) paranoid? ..."
>
> As Steve said that I said, they are synchronization points from the
> viewpoint of code within the interrupted CPU. Unless the architecture
> code does as smp_mb() on interrupt entry and exit (which perhaps some
> do, for all I know, maybe all of them do by now), memory accesses could
> still be reordered across the interrupt from the perspective of other
> CPUs and devices on the system.
Yes, I got it. See:
https://lkml.kernel.org/r/20190629182132.GA5666@andrea
Still...
Andrea
On Fri, Jun 28, 2019 at 11:44:11AM -0400, Steven Rostedt wrote:
> On Fri, 28 Jun 2019 19:40:45 +0900
> Byungchul Park <[email protected]> wrote:
>
> > Wait.. I got a little bit confused on recordering.
> >
> > This 'STORE rcu_read_lock_nesting = 0' can happen before
> > 'STORE rcu_read_unlock_special.b.exp_hint = false' regardless of the
> > order a compiler generated to by the barrier(), because anyway they
> > are independent so it's within an arch's right.
> >
> > Then.. is this scenario possible? Or all archs properly deal with
> > interrupts across this kind of reordering?
>
> As Paul stated, interrupts are synchronization points. Archs can only
> play games with ordering when dealing with entities outside the CPU
> (devices and other CPUs). But if you have assembly that has two stores,
> and an interrupt comes in, the arch must guarantee that the stores are
> done in that order as the interrupt sees it.
>
> If this is not the case, there's a hell of a lot more broken in the
> kernel than just this, and "barrier()" would also be meaningless, as
> that is used mostly to deal with interrupts.
Clear. Dear Paul and Steve, Thank you.
> -- Steve
On Fri, Jun 28, 2019 at 08:20:45PM +0200, Sebastian Andrzej Siewior wrote:
> On 2019-06-28 14:07:27 [-0400], Joel Fernandes wrote:
> > On Fri, Jun 28, 2019 at 07:45:45PM +0200, Sebastian Andrzej Siewior wrote:
> > > On 2019-06-28 10:30:11 [-0700], Paul E. McKenney wrote:
> > > > > I believe the .blocked field remains set even though we are not any more in a
> > > > > reader section because of deferred processing of the blocked lists that you
> > > > > mentioned yesterday.
> > > >
> > > > That can indeed happen. However, in current -rcu, that would mean
> > > > that .deferred_qs is also set, which (if in_irq()) would prevent
> > > > the raise_softirq_irqsoff() from being invoked. Which was why I was
> > > > asking the questions about whether in_irq() returns true within threaded
> > > > interrupts yesterday. If it does, I need to find if there is some way
> > > > of determining whether rcu_read_unlock_special() is being called from
> > > > a threaded interrupt in order to suppress the call to raise_softirq()
> > > > in that case.
> > >
> > > Please not that:
> > > | void irq_exit(void)
> > > | {
> > > |…
> > > in_irq() returns true
> > > | preempt_count_sub(HARDIRQ_OFFSET);
> > > in_irq() returns false
> > > | if (!in_interrupt() && local_softirq_pending())
> > > | invoke_softirq();
> > >
> > > -> invoke_softirq() does
> > > | if (!force_irqthreads) {
> > > | __do_softirq();
> > > | } else {
> > > | wakeup_softirqd();
> > > | }
> > >
> >
> > In my traces which I shared previous email, the wakeup_softirqd() gets
> > called.
> >
> > I thought force_irqthreads value is decided at boot time, so I got lost a bit
> > with your comment.
>
> It does. I just wanted point out that in this case
> rcu_unlock() / rcu_read_unlock_special() won't see in_irq() true.
Paul, Sebastian, This commit breaks kvm.sh for me on both my machines. I have
to revert it to make things work otherwise the test just runs for 10 seconds
or so and dies.
"rcutorture: Tweak kvm options"
This reverts commit a6fda6dab93c2c06ef4b8cb4b9258df6674d2438.
Any thoughts or debug ideas? I am running v4.19.37
thanks,
- Joel
On Fri, Jun 28, 2019 at 10:20:56AM -0700, Paul E. McKenney wrote:
> On Fri, Jun 28, 2019 at 06:04:08PM +0200, Peter Zijlstra wrote:
> > On Fri, Jun 28, 2019 at 08:54:04AM -0700, Paul E. McKenney wrote:
> > > Thank you! Plus it looks like scheduler_ipi() takes an early exit if
> > > ->wake_list is empty, regardless of TIF_NEED_RESCHED, right?
> >
> > Yes, TIF_NEED_RESCHED is checked in the interrupt return path.
>
> OK, got it. So the following sequence would be a valid way to get the
> scheduler's attention on the current CPU shortly after interrupts
> are re-enabled, even if the current CPU is already holding some
> rq or pi locks, correct?
>
> set_tsk_need_resched(current);
> set_preempt_need_resched();
> smp_send_reschedule(smp_processor_id());
I'm not sure if smp_send_reschedule() can be used as self-IPI, some
hardware doesn't particularly like that IIRC. That is, hardware might
only have interfaces to IPI _other_ CPUs, but not self.
The normal scheduler code takes care to not call smp_send_reschedule()
to self.
On Fri, Jun 28, 2019 at 03:01:36PM -0500, Scott Wood wrote:
> On Fri, 2019-06-28 at 16:15 +0200, Peter Zijlstra wrote:
> > On Thu, Jun 27, 2019 at 01:36:12PM -0700, Paul E. McKenney wrote:
> > > On Thu, Jun 27, 2019 at 03:17:27PM -0500, Scott Wood wrote:
> > > > On Thu, 2019-06-27 at 11:41 -0700, Paul E. McKenney wrote:
> > > > > Of course, unconditionally refusing to do the wakeup might not be
> > > > > happy
> > > > > thing for NO_HZ_FULL kernels that don't implement IRQ work.
> > > >
> > > > Couldn't smp_send_reschedule() be used instead?
> > >
> > > Good point. If current -rcu doesn't fix things for Sebastian's case,
> > > that would be well worth looking at. But there must be some reason
> > > why Peter Zijlstra didn't suggest it when he instead suggested using
> > > the IRQ work approach.
> > >
> > > Peter, thoughts?
> >
> > I've not exactly kept up with the thread; but irq_work allows you to run
> > some actual code on the remote CPU which is often useful and it is only
> > a little more expensive than smp_send_reschedule().
> >
> > Also, just smp_send_reschedule() doesn't really do anything without
> > first poking TIF_NEED_RESCHED (or other scheduler state) and if you want
> > to do both, there's other helpers you should use, like resched_cpu().
>
> resched_cpu() will not send an IPI to the current CPU[1].
Correct, smp_send_reschedule() might not work for self, not all hardware
can self-IPI.
> Plus, the RCU
> code needs to set need_resched even in cases where it doesn't need to send
> the IPI. And worst of all, resched_cpu() takes the rq lock which is the
> deadlock scenario we're trying to avoid.
>
> -Scott
>
> [1] Which makes me nervous about latency if there are any wakeups with irqs
> disabled, without a preempt_enable() after irqs are enabled again, and not
> inside an interrupt.
All good points; and those are all solved with irq_work. That provides a
'clean' IRQ context.
On 2019-07-01 11:42:15 [+0200], Peter Zijlstra wrote:
> I'm not sure if smp_send_reschedule() can be used as self-IPI, some
> hardware doesn't particularly like that IIRC. That is, hardware might
> only have interfaces to IPI _other_ CPUs, but not self.
>
> The normal scheduler code takes care to not call smp_send_reschedule()
> to self.
and irq_work:
471ba0e686cb1 ("irq_work: Do not raise an IPI when queueing work on the local CPU")
Sebastian
On Mon, Jul 01, 2019 at 12:24:42PM +0200, Sebastian Andrzej Siewior wrote:
> On 2019-07-01 11:42:15 [+0200], Peter Zijlstra wrote:
> > I'm not sure if smp_send_reschedule() can be used as self-IPI, some
> > hardware doesn't particularly like that IIRC. That is, hardware might
> > only have interfaces to IPI _other_ CPUs, but not self.
> >
> > The normal scheduler code takes care to not call smp_send_reschedule()
> > to self.
>
> and irq_work:
> 471ba0e686cb1 ("irq_work: Do not raise an IPI when queueing work on the local CPU")
OK, so it looks like I will need to use something else. But thank you
for calling my attention to this commit.
Thanx, Paul
On Mon, Jul 01, 2019 at 05:23:05AM -0700, Paul E. McKenney wrote:
> On Mon, Jul 01, 2019 at 12:24:42PM +0200, Sebastian Andrzej Siewior wrote:
> > On 2019-07-01 11:42:15 [+0200], Peter Zijlstra wrote:
> > > I'm not sure if smp_send_reschedule() can be used as self-IPI, some
> > > hardware doesn't particularly like that IIRC. That is, hardware might
> > > only have interfaces to IPI _other_ CPUs, but not self.
> > >
> > > The normal scheduler code takes care to not call smp_send_reschedule()
> > > to self.
> >
> > and irq_work:
> > 471ba0e686cb1 ("irq_work: Do not raise an IPI when queueing work on the local CPU")
>
> OK, so it looks like I will need to use something else. But thank you
> for calling my attention to this commit.
I think that commit is worded slight confusing -- sorry I should've paid
more attention.
irq_work _does_ work locally, and arch_irq_work_raise() must self-IPI,
otherwise everything is horribly broken.
But what happened, was that irq_work_queue() and irq_work_queue_on(.cpu
= smp_processor_id()) wasn't using the same code, and the latter would
try to self-IPI through arch_send_call_function_single_ipi().
Nick fixed that so that irq_work_queue() and irq_work_queue_on(.cpu =
smp_processor_id() now both use arch_raise_irq_work() and remote stuff
uses arch_send_call_function_single_ipi().
On Mon, Jul 01, 2019 at 04:00:53PM +0200, Peter Zijlstra wrote:
> On Mon, Jul 01, 2019 at 05:23:05AM -0700, Paul E. McKenney wrote:
> > On Mon, Jul 01, 2019 at 12:24:42PM +0200, Sebastian Andrzej Siewior wrote:
> > > On 2019-07-01 11:42:15 [+0200], Peter Zijlstra wrote:
> > > > I'm not sure if smp_send_reschedule() can be used as self-IPI, some
> > > > hardware doesn't particularly like that IIRC. That is, hardware might
> > > > only have interfaces to IPI _other_ CPUs, but not self.
> > > >
> > > > The normal scheduler code takes care to not call smp_send_reschedule()
> > > > to self.
> > >
> > > and irq_work:
> > > 471ba0e686cb1 ("irq_work: Do not raise an IPI when queueing work on the local CPU")
> >
> > OK, so it looks like I will need to use something else. But thank you
> > for calling my attention to this commit.
>
> I think that commit is worded slight confusing -- sorry I should've paid
> more attention.
>
> irq_work _does_ work locally, and arch_irq_work_raise() must self-IPI,
> otherwise everything is horribly broken.
>
> But what happened, was that irq_work_queue() and irq_work_queue_on(.cpu
> = smp_processor_id()) wasn't using the same code, and the latter would
> try to self-IPI through arch_send_call_function_single_ipi().
>
> Nick fixed that so that irq_work_queue() and irq_work_queue_on(.cpu =
> smp_processor_id() now both use arch_raise_irq_work() and remote stuff
> uses arch_send_call_function_single_ipi().
OK, thank you for looking into this!
I therefore continue relying on IRQ work. Should there be problems with
kernels not supporting IRQ work, and if there is a legitimate reason
why they should not support IRQ work, I can look into things like timers
for those kernels.
Thanx, Paul