2014-04-01 15:09:00

by Dave Jones

[permalink] [raw]
Subject: Re: rcu_prempt stalls / lockup

On Mon, Mar 31, 2014 at 05:48:01PM -0700, Paul E. McKenney wrote:

> Alternatively, I suppose a wakeup could be getting lost. The main change
> related to that this merge window was ffa83fb565fb, which eliminated
> idle wakeups from RCU in the CONFIG_RCU_NOCB_CPU_ALL=y case.
>
> So, could you please try reverting ffa83fb565fb?

Did a run overnight with that reverted, no luck..

[11869.580161] INFO: rcu_preempt detected stalls on CPUs/tasks:
[11869.581245] (detected by 0, t=6502 jiffies, g=449892, c=449891, q=0)
[11869.582047] INFO: Stall ended before state dump start
[12064.517461] INFO: rcu_preempt detected stalls on CPUs/tasks:
[12064.518539] (detected by 0, t=26007 jiffies, g=449892, c=449891, q=0)
[12064.519347] INFO: Stall ended before state dump start
[12259.454767] INFO: rcu_preempt detected stalls on CPUs/tasks:
[12259.455818] (detected by 0, t=45512 jiffies, g=449892, c=449891, q=0)
[12259.456602] INFO: Stall ended before state dump start
[12454.392112] INFO: rcu_preempt detected stalls on CPUs/tasks:
[12454.393131] (detected by 0, t=65017 jiffies, g=449892, c=449891, q=0)
[12454.393942] INFO: Stall ended before state dump start
[12649.329446] INFO: rcu_preempt detected stalls on CPUs/tasks:
[12649.330446] (detected by 0, t=84522 jiffies, g=449892, c=449891, q=0)
[12649.331267] INFO: Stall ended before state dump start
[12844.266776] INFO: rcu_preempt detected stalls on CPUs/tasks:
[12844.267695] (detected by 0, t=104027 jiffies, g=449892, c=449891, q=0)
[12844.268571] INFO: Stall ended before state dump start
[13039.204102] INFO: rcu_preempt detected stalls on CPUs/tasks:
[13039.205119] (detected by 0, t=123532 jiffies, g=449892, c=449891, q=0)
[13039.205871] INFO: Stall ended before state dump start
[13234.141424] INFO: rcu_preempt detected stalls on CPUs/tasks:
[13234.141424] INFO: rcu_preempt detected stalls on CPUs/tasks:
[13234.142448] (detected by 0, t=143037 jiffies, g=449892, c=449891, q=0)
[13234.143339] INFO: Stall ended before state dump start
[13429.078730] INFO: rcu_preempt detected stalls on CPUs/tasks:
[13429.079771] (detected by 0, t=162542 jiffies, g=449892, c=449891, q=0)
[13429.080591] INFO: Stall ended before state dump start
[13624.016058] INFO: rcu_preempt detected stalls on CPUs/tasks:
[13624.017116] (detected by 0, t=182047 jiffies, g=449892, c=449891, q=0)
[13624.018033] INFO: Stall ended before state dump start
[13818.953370] INFO: rcu_preempt detected stalls on CPUs/tasks:
[13818.954437] (detected by 0, t=201552 jiffies, g=449892, c=449891, q=0)
[13818.955334] INFO: Stall ended before state dump start
[14013.890682] INFO: rcu_preempt detected stalls on CPUs/tasks:
[14013.891761] (detected by 0, t=221057 jiffies, g=449892, c=449891, q=0)
[14013.892691] INFO: Stall ended before state dump start
[14208.827993] INFO: rcu_preempt detected stalls on CPUs/tasks:
[14208.829088] (detected by 0, t=240562 jiffies, g=449892, c=449891, q=0)
[14208.829977] INFO: Stall ended before state dump start
[14403.765300] INFO: rcu_preempt detected stalls on CPUs/tasks:
[14403.766405] (detected by 0, t=260067 jiffies, g=449892, c=449891, q=0)
and on and on..

so kernel space still works like before, but userspace is locked up.

> If that doesn't work, I will need to put together some diagnostic patches.
> Starting with the one below.

bugger, I forgot to apply this last night. I'll do it on the next run.

Dave


2014-04-01 15:30:40

by Paul E. McKenney

[permalink] [raw]
Subject: Re: rcu_prempt stalls / lockup

On Tue, Apr 01, 2014 at 11:08:49AM -0400, Dave Jones wrote:
> On Mon, Mar 31, 2014 at 05:48:01PM -0700, Paul E. McKenney wrote:
>
> > Alternatively, I suppose a wakeup could be getting lost. The main change
> > related to that this merge window was ffa83fb565fb, which eliminated
> > idle wakeups from RCU in the CONFIG_RCU_NOCB_CPU_ALL=y case.
> >
> > So, could you please try reverting ffa83fb565fb?
>
> Did a run overnight with that reverted, no luck..
>
> [11869.580161] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [11869.581245] (detected by 0, t=6502 jiffies, g=449892, c=449891, q=0)
> [11869.582047] INFO: Stall ended before state dump start
> [12064.517461] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [12064.518539] (detected by 0, t=26007 jiffies, g=449892, c=449891, q=0)
> [12064.519347] INFO: Stall ended before state dump start
> [12259.454767] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [12259.455818] (detected by 0, t=45512 jiffies, g=449892, c=449891, q=0)
> [12259.456602] INFO: Stall ended before state dump start
> [12454.392112] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [12454.393131] (detected by 0, t=65017 jiffies, g=449892, c=449891, q=0)
> [12454.393942] INFO: Stall ended before state dump start
> [12649.329446] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [12649.330446] (detected by 0, t=84522 jiffies, g=449892, c=449891, q=0)
> [12649.331267] INFO: Stall ended before state dump start
> [12844.266776] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [12844.267695] (detected by 0, t=104027 jiffies, g=449892, c=449891, q=0)
> [12844.268571] INFO: Stall ended before state dump start
> [13039.204102] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [13039.205119] (detected by 0, t=123532 jiffies, g=449892, c=449891, q=0)
> [13039.205871] INFO: Stall ended before state dump start
> [13234.141424] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [13234.141424] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [13234.142448] (detected by 0, t=143037 jiffies, g=449892, c=449891, q=0)
> [13234.143339] INFO: Stall ended before state dump start
> [13429.078730] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [13429.079771] (detected by 0, t=162542 jiffies, g=449892, c=449891, q=0)
> [13429.080591] INFO: Stall ended before state dump start
> [13624.016058] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [13624.017116] (detected by 0, t=182047 jiffies, g=449892, c=449891, q=0)
> [13624.018033] INFO: Stall ended before state dump start
> [13818.953370] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [13818.954437] (detected by 0, t=201552 jiffies, g=449892, c=449891, q=0)
> [13818.955334] INFO: Stall ended before state dump start
> [14013.890682] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [14013.891761] (detected by 0, t=221057 jiffies, g=449892, c=449891, q=0)
> [14013.892691] INFO: Stall ended before state dump start
> [14208.827993] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [14208.829088] (detected by 0, t=240562 jiffies, g=449892, c=449891, q=0)
> [14208.829977] INFO: Stall ended before state dump start
> [14403.765300] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [14403.766405] (detected by 0, t=260067 jiffies, g=449892, c=449891, q=0)
> and on and on..
>
> so kernel space still works like before, but userspace is locked up.

Interesting. I suspect that if you reverted the rest of this merge
window's RCU patches, you would get the same result.

> > If that doesn't work, I will need to put together some diagnostic patches.
> > Starting with the one below.
>
> bugger, I forgot to apply this last night. I'll do it on the next run.

But let's see what this shows first.

Thanx, Paul

2014-04-01 17:22:56

by Dave Jones

[permalink] [raw]
Subject: Re: rcu_prempt stalls / lockup

On Tue, Apr 01, 2014 at 08:30:32AM -0700, Paul E. McKenney wrote:
> > [14403.765300] INFO: rcu_preempt detected stalls on CPUs/tasks:
> > [14403.766405] (detected by 0, t=260067 jiffies, g=449892, c=449891, q=0)
> > and on and on..
> >
> > so kernel space still works like before, but userspace is locked up.
>
> Interesting. I suspect that if you reverted the rest of this merge
> window's RCU patches, you would get the same result.
>
> > > If that doesn't work, I will need to put together some diagnostic patches.
> > > Starting with the one below.
> >
> > bugger, I forgot to apply this last night. I'll do it on the next run.
>
> But let's see what this shows first.

[ 1173.923463] INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 1173.924457] (detected by 0, t=6502 jiffies, g=47728, c=47727, q=0)
[ 1173.925199] INFO: Stall ended before state dump start, gp_kthread state: 0x2
[ 1368.859744] INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 1368.860754] (detected by 0, t=26007 jiffies, g=47728, c=47727, q=0)
[ 1368.861559] INFO: Stall ended before state dump start, gp_kthread state: 0x2
[ 1563.796416] INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 1563.797426] (detected by 0, t=45512 jiffies, g=47728, c=47727, q=0)
[ 1563.798293] INFO: Stall ended before state dump start, gp_kthread state: 0x2
[ 1758.733747] INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 1758.734725] (detected by 0, t=65017 jiffies, g=47728, c=47727, q=0)
[ 1758.735526] INFO: Stall ended before state dump start, gp_kthread state: 0x2
[ 1953.670944] INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 1953.671932] (detected by 0, t=84522 jiffies, g=47728, c=47727, q=0)
[ 1953.672735] INFO: Stall ended before state dump start, gp_kthread state: 0x2
[ 2148.608132] INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 2148.609140] (detected by 0, t=104027 jiffies, g=47728, c=47727, q=0)
etc etc.

Dave

2014-04-01 17:55:51

by Paul E. McKenney

[permalink] [raw]
Subject: Re: rcu_prempt stalls / lockup

On Tue, Apr 01, 2014 at 01:22:44PM -0400, Dave Jones wrote:
> On Tue, Apr 01, 2014 at 08:30:32AM -0700, Paul E. McKenney wrote:
> > > [14403.765300] INFO: rcu_preempt detected stalls on CPUs/tasks:
> > > [14403.766405] (detected by 0, t=260067 jiffies, g=449892, c=449891, q=0)
> > > and on and on..
> > >
> > > so kernel space still works like before, but userspace is locked up.
> >
> > Interesting. I suspect that if you reverted the rest of this merge
> > window's RCU patches, you would get the same result.
> >
> > > > If that doesn't work, I will need to put together some diagnostic patches.
> > > > Starting with the one below.
> > >
> > > bugger, I forgot to apply this last night. I'll do it on the next run.
> >
> > But let's see what this shows first.
>
> [ 1173.923463] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [ 1173.924457] (detected by 0, t=6502 jiffies, g=47728, c=47727, q=0)
> [ 1173.925199] INFO: Stall ended before state dump start, gp_kthread state: 0x2
> [ 1368.859744] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [ 1368.860754] (detected by 0, t=26007 jiffies, g=47728, c=47727, q=0)
> [ 1368.861559] INFO: Stall ended before state dump start, gp_kthread state: 0x2
> [ 1563.796416] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [ 1563.797426] (detected by 0, t=45512 jiffies, g=47728, c=47727, q=0)
> [ 1563.798293] INFO: Stall ended before state dump start, gp_kthread state: 0x2
> [ 1758.733747] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [ 1758.734725] (detected by 0, t=65017 jiffies, g=47728, c=47727, q=0)
> [ 1758.735526] INFO: Stall ended before state dump start, gp_kthread state: 0x2
> [ 1953.670944] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [ 1953.671932] (detected by 0, t=84522 jiffies, g=47728, c=47727, q=0)
> [ 1953.672735] INFO: Stall ended before state dump start, gp_kthread state: 0x2
> [ 2148.608132] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [ 2148.609140] (detected by 0, t=104027 jiffies, g=47728, c=47727, q=0)
> etc etc.

Waiting uninterruptibly. Presumably blocked on mutex_lock(). But
you have CONFIG_PROVE_LOCKING(), so any deadlocks should have been
reported.

Given that you have CONFIG_RCU_TRACE=y, could you please enable the
following trace events and dump the trace before things hang?

trace_event=rcu:rcu_grace_period,rcu:rcu_grace_period_init

If it is not feasible to dump the trace before things hang, let me
know, and I will work out some other diagnostic regime.

Thanx, Paul

2014-04-01 18:04:26

by Dave Jones

[permalink] [raw]
Subject: Re: rcu_prempt stalls / lockup

On Tue, Apr 01, 2014 at 10:55:45AM -0700, Paul E. McKenney wrote:
> > > > so kernel space still works like before, but userspace is locked up.
> > >
> > > Interesting. I suspect that if you reverted the rest of this merge
> > > window's RCU patches, you would get the same result.

Something that occurred to me is that this might be something in the x86 merge
that's just changing timings enough to expose this problem.
At some point this evening, I'll try bisecting it if we don't get any closer.

> > [ 1953.672735] INFO: Stall ended before state dump start, gp_kthread state: 0x2
> > [ 2148.608132] INFO: rcu_preempt detected stalls on CPUs/tasks:
> > [ 2148.609140] (detected by 0, t=104027 jiffies, g=47728, c=47727, q=0)
> > etc etc.
>
> Waiting uninterruptibly. Presumably blocked on mutex_lock(). But
> you have CONFIG_PROVE_LOCKING(), so any deadlocks should have been
> reported.

Lockdep had reported something a little earlier (timestamped at 1108.xxxxxx)
but that's a known false-positive in xfs.

> Given that you have CONFIG_RCU_TRACE=y, could you please enable the
> following trace events and dump the trace before things hang?
>
> trace_event=rcu:rcu_grace_period,rcu:rcu_grace_period_init
>
> If it is not feasible to dump the trace before things hang, let me
> know, and I will work out some other diagnostic regime.

I'll give that a shot when I get back in a few hours.

Dave

2014-04-01 18:32:52

by Paul E. McKenney

[permalink] [raw]
Subject: Re: rcu_prempt stalls / lockup

On Tue, Apr 01, 2014 at 02:04:14PM -0400, Dave Jones wrote:
> On Tue, Apr 01, 2014 at 10:55:45AM -0700, Paul E. McKenney wrote:
> > > > > so kernel space still works like before, but userspace is locked up.
> > > >
> > > > Interesting. I suspect that if you reverted the rest of this merge
> > > > window's RCU patches, you would get the same result.
>
> Something that occurred to me is that this might be something in the x86 merge
> that's just changing timings enough to expose this problem.
> At some point this evening, I'll try bisecting it if we don't get any closer.

OK. ;-)

> > > [ 1953.672735] INFO: Stall ended before state dump start, gp_kthread state: 0x2
> > > [ 2148.608132] INFO: rcu_preempt detected stalls on CPUs/tasks:
> > > [ 2148.609140] (detected by 0, t=104027 jiffies, g=47728, c=47727, q=0)
> > > etc etc.
> >
> > Waiting uninterruptibly. Presumably blocked on mutex_lock(). But
> > you have CONFIG_PROVE_LOCKING(), so any deadlocks should have been
> > reported.
>
> Lockdep had reported something a little earlier (timestamped at 1108.xxxxxx)
> but that's a known false-positive in xfs.

Yep, I would be very surprised if that was related to the grace-period hang.

> > Given that you have CONFIG_RCU_TRACE=y, could you please enable the
> > following trace events and dump the trace before things hang?
> >
> > trace_event=rcu:rcu_grace_period,rcu:rcu_grace_period_init
> >
> > If it is not feasible to dump the trace before things hang, let me
> > know, and I will work out some other diagnostic regime.
>
> I'll give that a shot when I get back in a few hours.

Cool!

Thanx, Paul

2014-04-01 22:16:27

by Dave Jones

[permalink] [raw]
Subject: Re: rcu_prempt stalls / lockup

On Tue, Apr 01, 2014 at 11:32:45AM -0700, Paul E. McKenney wrote:

> > > Given that you have CONFIG_RCU_TRACE=y, could you please enable the
> > > following trace events and dump the trace before things hang?
> > >
> > > trace_event=rcu:rcu_grace_period,rcu:rcu_grace_period_init
> > >
> > > If it is not feasible to dump the trace before things hang, let me
> > > know, and I will work out some other diagnostic regime.
> >
> > I'll give that a shot when I get back in a few hours.

ok, this is tricky, and I'm not sure how helpful the below is.
because I don't know when the hang is going to happen, in parallel to my usual workload
I did this..
while [ 1 ];
do
cat trace
echo > trace
done

and got this...



# entries-in-buffer/entries-written: 7/7 #P:4
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
<idle>-0 [000] dNs3 851.748475: rcu_grace_period: rcu_preempt 21342 cpuend
<idle>-0 [000] dNs3 851.748477: rcu_grace_period: rcu_preempt 21343 cpustart
<idle>-0 [000] dNs3 851.748477: rcu_grace_period: rcu_sched 20080 cpuend
<idle>-0 [000] dN.2 851.748480: rcu_grace_period: rcu_preempt 21343 cpuqs
rcu_preempt-9 [000] ...1 851.748485: rcu_grace_period: rcu_preempt 21343 fqsstart
rcu_preempt-9 [000] ...1 851.748487: rcu_grace_period: rcu_preempt 21343 fqsend
rcu_preempt-9 [000] ...1 851.748487: rcu_grace_period: rcu_preempt 21343 fqswait
# tracer: nop
#
# entries-in-buffer/entries-written: 13/13 #P:4
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
rcu_preempt-9 [003] d..2 851.878457: rcu_grace_period: rcu_preempt 21349 end
rcu_preempt-9 [003] d..2 851.878459: rcu_grace_period: rcu_preempt 21349 newreq
rcu_preempt-9 [003] ...1 851.878459: rcu_grace_period: rcu_preempt 21349 reqwait
rcu_preempt-9 [003] d..2 851.878460: rcu_grace_period: rcu_preempt 21350 start
rcu_preempt-9 [003] d..2 851.878461: rcu_grace_period: rcu_preempt 21350 cpustart
rcu_preempt-9 [003] d..2 851.878462: rcu_grace_period_init: rcu_preempt 21350 0 0 7 f
rcu_preempt-9 [003] ...1 851.878463: rcu_grace_period: rcu_preempt 21350 fqswait
rcu_preempt-9 [003] d..2 851.878464: rcu_grace_period: rcu_preempt 21350 cpuqs
<idle>-0 [000] dNs3 851.878482: rcu_grace_period: rcu_preempt 21349 cpuend
<idle>-0 [000] dNs3 851.878483: rcu_grace_period: rcu_preempt 21350 cpustart
<idle>-0 [000] dNs3 851.878484: rcu_grace_period: rcu_sched 20086 cpuend
<idle>-0 [000] .N.2 851.878486: rcu_grace_period: rcu_sched 20086 cpuqs
<idle>-0 [000] dN.2 851.878487: rcu_grace_period: rcu_preempt 21350 cpuqs


followed by a half dozen 'empty' traces before it totally locked up.

Dave

2014-04-01 23:18:33

by Paul E. McKenney

[permalink] [raw]
Subject: Re: rcu_prempt stalls / lockup

On Tue, Apr 01, 2014 at 06:16:16PM -0400, Dave Jones wrote:
> On Tue, Apr 01, 2014 at 11:32:45AM -0700, Paul E. McKenney wrote:
>
> > > > Given that you have CONFIG_RCU_TRACE=y, could you please enable the
> > > > following trace events and dump the trace before things hang?
> > > >
> > > > trace_event=rcu:rcu_grace_period,rcu:rcu_grace_period_init
> > > >
> > > > If it is not feasible to dump the trace before things hang, let me
> > > > know, and I will work out some other diagnostic regime.
> > >
> > > I'll give that a shot when I get back in a few hours.
>
> ok, this is tricky, and I'm not sure how helpful the below is.
> because I don't know when the hang is going to happen, in parallel to my usual workload
> I did this..
> while [ 1 ];
> do
> cat trace
> echo > trace
> done
>
> and got this...
>
>
>
> # entries-in-buffer/entries-written: 7/7 #P:4
> #
> # _-----=> irqs-off
> # / _----=> need-resched
> # | / _---=> hardirq/softirq
> # || / _--=> preempt-depth
> # ||| / delay
> # TASK-PID CPU# |||| TIMESTAMP FUNCTION
> # | | | |||| | |
> <idle>-0 [000] dNs3 851.748475: rcu_grace_period: rcu_preempt 21342 cpuend
> <idle>-0 [000] dNs3 851.748477: rcu_grace_period: rcu_preempt 21343 cpustart
> <idle>-0 [000] dNs3 851.748477: rcu_grace_period: rcu_sched 20080 cpuend
> <idle>-0 [000] dN.2 851.748480: rcu_grace_period: rcu_preempt 21343 cpuqs
> rcu_preempt-9 [000] ...1 851.748485: rcu_grace_period: rcu_preempt 21343 fqsstart
> rcu_preempt-9 [000] ...1 851.748487: rcu_grace_period: rcu_preempt 21343 fqsend
> rcu_preempt-9 [000] ...1 851.748487: rcu_grace_period: rcu_preempt 21343 fqswait
> # tracer: nop
> #
> # entries-in-buffer/entries-written: 13/13 #P:4
> #
> # _-----=> irqs-off
> # / _----=> need-resched
> # | / _---=> hardirq/softirq
> # || / _--=> preempt-depth
> # ||| / delay
> # TASK-PID CPU# |||| TIMESTAMP FUNCTION
> # | | | |||| | |
> rcu_preempt-9 [003] d..2 851.878457: rcu_grace_period: rcu_preempt 21349 end
> rcu_preempt-9 [003] d..2 851.878459: rcu_grace_period: rcu_preempt 21349 newreq
> rcu_preempt-9 [003] ...1 851.878459: rcu_grace_period: rcu_preempt 21349 reqwait
> rcu_preempt-9 [003] d..2 851.878460: rcu_grace_period: rcu_preempt 21350 start
> rcu_preempt-9 [003] d..2 851.878461: rcu_grace_period: rcu_preempt 21350 cpustart
> rcu_preempt-9 [003] d..2 851.878462: rcu_grace_period_init: rcu_preempt 21350 0 0 7 f
> rcu_preempt-9 [003] ...1 851.878463: rcu_grace_period: rcu_preempt 21350 fqswait
> rcu_preempt-9 [003] d..2 851.878464: rcu_grace_period: rcu_preempt 21350 cpuqs
> <idle>-0 [000] dNs3 851.878482: rcu_grace_period: rcu_preempt 21349 cpuend
> <idle>-0 [000] dNs3 851.878483: rcu_grace_period: rcu_preempt 21350 cpustart
> <idle>-0 [000] dNs3 851.878484: rcu_grace_period: rcu_sched 20086 cpuend
> <idle>-0 [000] .N.2 851.878486: rcu_grace_period: rcu_sched 20086 cpuqs
> <idle>-0 [000] dN.2 851.878487: rcu_grace_period: rcu_preempt 21350 cpuqs
>
>
> followed by a half dozen 'empty' traces before it totally locked up.

Cool, thank you!

Could you please also send the RCU CPU stall warning messages?

Thanx, Paul

2014-04-01 23:31:40

by Dave Jones

[permalink] [raw]
Subject: Re: rcu_prempt stalls / lockup

On Tue, Apr 01, 2014 at 04:18:22PM -0700, Paul E. McKenney wrote:

> > # entries-in-buffer/entries-written: 7/7 #P:4
> > #
> > # _-----=> irqs-off
> > # / _----=> need-resched
> > # | / _---=> hardirq/softirq
> > # || / _--=> preempt-depth
> > # ||| / delay
> > # TASK-PID CPU# |||| TIMESTAMP FUNCTION
> > # | | | |||| | |
> > <idle>-0 [000] dNs3 851.748475: rcu_grace_period: rcu_preempt 21342 cpuend
> > <idle>-0 [000] dNs3 851.748477: rcu_grace_period: rcu_preempt 21343 cpustart
> > <idle>-0 [000] dNs3 851.748477: rcu_grace_period: rcu_sched 20080 cpuend
> > <idle>-0 [000] dN.2 851.748480: rcu_grace_period: rcu_preempt 21343 cpuqs
> > rcu_preempt-9 [000] ...1 851.748485: rcu_grace_period: rcu_preempt 21343 fqsstart
> > rcu_preempt-9 [000] ...1 851.748487: rcu_grace_period: rcu_preempt 21343 fqsend
> > rcu_preempt-9 [000] ...1 851.748487: rcu_grace_period: rcu_preempt 21343 fqswait
> > # tracer: nop
> > #
> > # entries-in-buffer/entries-written: 13/13 #P:4
> > #
> > # _-----=> irqs-off
> > # / _----=> need-resched
> > # | / _---=> hardirq/softirq
> > # || / _--=> preempt-depth
> > # ||| / delay
> > # TASK-PID CPU# |||| TIMESTAMP FUNCTION
> > # | | | |||| | |
> > rcu_preempt-9 [003] d..2 851.878457: rcu_grace_period: rcu_preempt 21349 end
> > rcu_preempt-9 [003] d..2 851.878459: rcu_grace_period: rcu_preempt 21349 newreq
> > rcu_preempt-9 [003] ...1 851.878459: rcu_grace_period: rcu_preempt 21349 reqwait
> > rcu_preempt-9 [003] d..2 851.878460: rcu_grace_period: rcu_preempt 21350 start
> > rcu_preempt-9 [003] d..2 851.878461: rcu_grace_period: rcu_preempt 21350 cpustart
> > rcu_preempt-9 [003] d..2 851.878462: rcu_grace_period_init: rcu_preempt 21350 0 0 7 f
> > rcu_preempt-9 [003] ...1 851.878463: rcu_grace_period: rcu_preempt 21350 fqswait
> > rcu_preempt-9 [003] d..2 851.878464: rcu_grace_period: rcu_preempt 21350 cpuqs
> > <idle>-0 [000] dNs3 851.878482: rcu_grace_period: rcu_preempt 21349 cpuend
> > <idle>-0 [000] dNs3 851.878483: rcu_grace_period: rcu_preempt 21350 cpustart
> > <idle>-0 [000] dNs3 851.878484: rcu_grace_period: rcu_sched 20086 cpuend
> > <idle>-0 [000] .N.2 851.878486: rcu_grace_period: rcu_sched 20086 cpuqs
> > <idle>-0 [000] dN.2 851.878487: rcu_grace_period: rcu_preempt 21350 cpuqs
> >
> >
> > followed by a half dozen 'empty' traces before it totally locked up.
>
> Cool, thank you!
>
> Could you please also send the RCU CPU stall warning messages?

>From that run they were..

[ 917.790265] INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 917.791189] (detected by 0, t=6502 jiffies, g=21390, c=21389, q=0)
[ 917.792058] INFO: Stall ended before state dump start, gp_kthread state: 0x2
[ 917.792799] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 917.793534] (detected by 0, t=6502 jiffies, g=20128, c=20127, q=0)
[ 917.794265] INFO: Stall ended before state dump start, gp_kthread state: 0x2
[ 1112.727480] INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 1112.728425] (detected by 0, t=26007 jiffies, g=21390, c=21389, q=0)
[ 1112.729300] INFO: Stall ended before state dump start, gp_kthread state: 0x2
[ 1112.730035] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 1112.730744] (detected by 0, t=26007 jiffies, g=20128, c=20127, q=0)
[ 1112.731459] INFO: Stall ended before state dump start, gp_kthread state: 0x2
[ 1307.664710] INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 1307.665623] (detected by 0, t=45512 jiffies, g=21390, c=21389, q=0)
[ 1307.666386] INFO: Stall ended before state dump start, gp_kthread state: 0x2
[ 1307.667113] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 1307.667839] (detected by 0, t=45512 jiffies, g=20128, c=20127, q=0)
[ 1307.668572] INFO: Stall ended before state dump start, gp_kthread state: 0x2
[ 1502.601961] INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 1502.602905] (detected by 0, t=65017 jiffies, g=21390, c=21389, q=0)
[ 1502.603765] INFO: Stall ended before state dump start, gp_kthread state: 0x2
[ 1502.604550] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 1502.605305] (detected by 0, t=65017 jiffies, g=20128, c=20127, q=0)
[ 1502.606061] INFO: Stall ended before state dump start, gp_kthread state: 0x2
[ 1697.539217] INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 1697.540189] (detected by 0, t=84522 jiffies, g=21390, c=21389, q=0)
[ 1697.541052] INFO: Stall ended before state dump start, gp_kthread state: 0x2
[ 1697.541826] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 1697.542600] (detected by 0, t=84522 jiffies, g=20128, c=20127, q=0)
[ 1697.543378] INFO: Stall ended before state dump start, gp_kthread state: 0x2
[ 1892.476433] INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 1892.477545] (detected by 0, t=104027 jiffies, g=21390, c=21389, q=0)
[ 1892.478402] INFO: Stall ended before state dump start, gp_kthread state: 0x2
[ 1892.479208] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 1892.480010] (detected by 0, t=104027 jiffies, g=20128, c=20127, q=0)
[ 1892.480831] INFO: Stall ended before state dump start, gp_kthread state: 0x2
[ 2087.413694] INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 2087.414748] (detected by 0, t=123532 jiffies, g=21390, c=21389, q=0)
[ 2087.415722] INFO: Stall ended before state dump start, gp_kthread state: 0x2
[ 2087.416558] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 2087.417392] (detected by 0, t=123532 jiffies, g=20128, c=20127, q=0)
[ 2087.418231] INFO: Stall ended before state dump start, gp_kthread state: 0x2


2014-04-01 23:57:34

by Paul E. McKenney

[permalink] [raw]
Subject: Re: rcu_prempt stalls / lockup

On Tue, Apr 01, 2014 at 07:31:30PM -0400, Dave Jones wrote:
> On Tue, Apr 01, 2014 at 04:18:22PM -0700, Paul E. McKenney wrote:
>
> > > # entries-in-buffer/entries-written: 7/7 #P:4
> > > #
> > > # _-----=> irqs-off
> > > # / _----=> need-resched
> > > # | / _---=> hardirq/softirq
> > > # || / _--=> preempt-depth
> > > # ||| / delay
> > > # TASK-PID CPU# |||| TIMESTAMP FUNCTION
> > > # | | | |||| | |
> > > <idle>-0 [000] dNs3 851.748475: rcu_grace_period: rcu_preempt 21342 cpuend
> > > <idle>-0 [000] dNs3 851.748477: rcu_grace_period: rcu_preempt 21343 cpustart
> > > <idle>-0 [000] dNs3 851.748477: rcu_grace_period: rcu_sched 20080 cpuend
> > > <idle>-0 [000] dN.2 851.748480: rcu_grace_period: rcu_preempt 21343 cpuqs
> > > rcu_preempt-9 [000] ...1 851.748485: rcu_grace_period: rcu_preempt 21343 fqsstart
> > > rcu_preempt-9 [000] ...1 851.748487: rcu_grace_period: rcu_preempt 21343 fqsend
> > > rcu_preempt-9 [000] ...1 851.748487: rcu_grace_period: rcu_preempt 21343 fqswait
> > > # tracer: nop
> > > #
> > > # entries-in-buffer/entries-written: 13/13 #P:4
> > > #
> > > # _-----=> irqs-off
> > > # / _----=> need-resched
> > > # | / _---=> hardirq/softirq
> > > # || / _--=> preempt-depth
> > > # ||| / delay
> > > # TASK-PID CPU# |||| TIMESTAMP FUNCTION
> > > # | | | |||| | |
> > > rcu_preempt-9 [003] d..2 851.878457: rcu_grace_period: rcu_preempt 21349 end
> > > rcu_preempt-9 [003] d..2 851.878459: rcu_grace_period: rcu_preempt 21349 newreq
> > > rcu_preempt-9 [003] ...1 851.878459: rcu_grace_period: rcu_preempt 21349 reqwait
> > > rcu_preempt-9 [003] d..2 851.878460: rcu_grace_period: rcu_preempt 21350 start
> > > rcu_preempt-9 [003] d..2 851.878461: rcu_grace_period: rcu_preempt 21350 cpustart
> > > rcu_preempt-9 [003] d..2 851.878462: rcu_grace_period_init: rcu_preempt 21350 0 0 7 f
> > > rcu_preempt-9 [003] ...1 851.878463: rcu_grace_period: rcu_preempt 21350 fqswait
> > > rcu_preempt-9 [003] d..2 851.878464: rcu_grace_period: rcu_preempt 21350 cpuqs
> > > <idle>-0 [000] dNs3 851.878482: rcu_grace_period: rcu_preempt 21349 cpuend
> > > <idle>-0 [000] dNs3 851.878483: rcu_grace_period: rcu_preempt 21350 cpustart
> > > <idle>-0 [000] dNs3 851.878484: rcu_grace_period: rcu_sched 20086 cpuend
> > > <idle>-0 [000] .N.2 851.878486: rcu_grace_period: rcu_sched 20086 cpuqs
> > > <idle>-0 [000] dN.2 851.878487: rcu_grace_period: rcu_preempt 21350 cpuqs

OK, so the trace ended on grace period #21350...

> > >
> > >
> > > followed by a half dozen 'empty' traces before it totally locked up.
> >
> > Cool, thank you!
> >
> > Could you please also send the RCU CPU stall warning messages?
>
> >From that run they were..
>
> [ 917.790265] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [ 917.791189] (detected by 0, t=6502 jiffies, g=21390, c=21389, q=0)

But the hang didn't happen until grace period #21390. Sigh! That means
that the trace doesn't tell me about the current grace period.

Thank you for getting this info, but it is beginning to look like
bisection is necessary here. :-(

Thanx, Paul

> [ 917.792058] INFO: Stall ended before state dump start, gp_kthread state: 0x2
> [ 917.792799] INFO: rcu_sched detected stalls on CPUs/tasks:
> [ 917.793534] (detected by 0, t=6502 jiffies, g=20128, c=20127, q=0)
> [ 917.794265] INFO: Stall ended before state dump start, gp_kthread state: 0x2
> [ 1112.727480] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [ 1112.728425] (detected by 0, t=26007 jiffies, g=21390, c=21389, q=0)
> [ 1112.729300] INFO: Stall ended before state dump start, gp_kthread state: 0x2
> [ 1112.730035] INFO: rcu_sched detected stalls on CPUs/tasks:
> [ 1112.730744] (detected by 0, t=26007 jiffies, g=20128, c=20127, q=0)
> [ 1112.731459] INFO: Stall ended before state dump start, gp_kthread state: 0x2
> [ 1307.664710] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [ 1307.665623] (detected by 0, t=45512 jiffies, g=21390, c=21389, q=0)
> [ 1307.666386] INFO: Stall ended before state dump start, gp_kthread state: 0x2
> [ 1307.667113] INFO: rcu_sched detected stalls on CPUs/tasks:
> [ 1307.667839] (detected by 0, t=45512 jiffies, g=20128, c=20127, q=0)
> [ 1307.668572] INFO: Stall ended before state dump start, gp_kthread state: 0x2
> [ 1502.601961] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [ 1502.602905] (detected by 0, t=65017 jiffies, g=21390, c=21389, q=0)
> [ 1502.603765] INFO: Stall ended before state dump start, gp_kthread state: 0x2
> [ 1502.604550] INFO: rcu_sched detected stalls on CPUs/tasks:
> [ 1502.605305] (detected by 0, t=65017 jiffies, g=20128, c=20127, q=0)
> [ 1502.606061] INFO: Stall ended before state dump start, gp_kthread state: 0x2
> [ 1697.539217] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [ 1697.540189] (detected by 0, t=84522 jiffies, g=21390, c=21389, q=0)
> [ 1697.541052] INFO: Stall ended before state dump start, gp_kthread state: 0x2
> [ 1697.541826] INFO: rcu_sched detected stalls on CPUs/tasks:
> [ 1697.542600] (detected by 0, t=84522 jiffies, g=20128, c=20127, q=0)
> [ 1697.543378] INFO: Stall ended before state dump start, gp_kthread state: 0x2
> [ 1892.476433] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [ 1892.477545] (detected by 0, t=104027 jiffies, g=21390, c=21389, q=0)
> [ 1892.478402] INFO: Stall ended before state dump start, gp_kthread state: 0x2
> [ 1892.479208] INFO: rcu_sched detected stalls on CPUs/tasks:
> [ 1892.480010] (detected by 0, t=104027 jiffies, g=20128, c=20127, q=0)
> [ 1892.480831] INFO: Stall ended before state dump start, gp_kthread state: 0x2
> [ 2087.413694] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [ 2087.414748] (detected by 0, t=123532 jiffies, g=21390, c=21389, q=0)
> [ 2087.415722] INFO: Stall ended before state dump start, gp_kthread state: 0x2
> [ 2087.416558] INFO: rcu_sched detected stalls on CPUs/tasks:
> [ 2087.417392] (detected by 0, t=123532 jiffies, g=20128, c=20127, q=0)
> [ 2087.418231] INFO: Stall ended before state dump start, gp_kthread state: 0x2
>
>
>

2014-04-02 00:08:14

by Dave Jones

[permalink] [raw]
Subject: Re: rcu_prempt stalls / lockup

On Tue, Apr 01, 2014 at 04:57:24PM -0700, Paul E. McKenney wrote:

> > [ 917.790265] INFO: rcu_preempt detected stalls on CPUs/tasks:
> > [ 917.791189] (detected by 0, t=6502 jiffies, g=21390, c=21389, q=0)
>
> But the hang didn't happen until grace period #21390. Sigh! That means
> that the trace doesn't tell me about the current grace period.
>
> Thank you for getting this info, but it is beginning to look like
> bisection is necessary here. :-(

ok, I was hoping that wouldn't be the case, but let's see what it turns up.

Dave

2014-04-02 16:20:48

by Paul E. McKenney

[permalink] [raw]
Subject: Re: rcu_prempt stalls / lockup

On Tue, Apr 01, 2014 at 11:32:45AM -0700, Paul E. McKenney wrote:
> On Tue, Apr 01, 2014 at 02:04:14PM -0400, Dave Jones wrote:
> > On Tue, Apr 01, 2014 at 10:55:45AM -0700, Paul E. McKenney wrote:

[ . . . ]

> > > Waiting uninterruptibly. Presumably blocked on mutex_lock(). But
> > > you have CONFIG_PROVE_LOCKING(), so any deadlocks should have been
> > > reported.
> >
> > Lockdep had reported something a little earlier (timestamped at 1108.xxxxxx)
> > but that's a known false-positive in xfs.
>
> Yep, I would be very surprised if that was related to the grace-period hang.

Ah, but it could be suppressing later lockdep splats. So if this can be
reproduced without xfs, we might get additional information from lockdep.

Thanx, Paul

2014-04-02 16:23:40

by Dave Jones

[permalink] [raw]
Subject: Re: rcu_prempt stalls / lockup

On Wed, Apr 02, 2014 at 09:20:43AM -0700, Paul E. McKenney wrote:
> On Tue, Apr 01, 2014 at 11:32:45AM -0700, Paul E. McKenney wrote:
> > On Tue, Apr 01, 2014 at 02:04:14PM -0400, Dave Jones wrote:
> > > On Tue, Apr 01, 2014 at 10:55:45AM -0700, Paul E. McKenney wrote:
>
> [ . . . ]
>
> > > > Waiting uninterruptibly. Presumably blocked on mutex_lock(). But
> > > > you have CONFIG_PROVE_LOCKING(), so any deadlocks should have been
> > > > reported.
> > >
> > > Lockdep had reported something a little earlier (timestamped at 1108.xxxxxx)
> > > but that's a known false-positive in xfs.
> >
> > Yep, I would be very surprised if that was related to the grace-period hang.
>
> Ah, but it could be suppressing later lockdep splats. So if this can be
> reproduced without xfs, we might get additional information from lockdep.

Hmm, I lack another test machine, and don't really feel like reinstalling it right now.
I'm halfway through a bisect, hopefully that's fruitful.

Dave

2014-04-02 22:48:51

by Dave Jones

[permalink] [raw]
Subject: Re: rcu_prempt stalls / lockup

On Wed, Apr 02, 2014 at 09:20:43AM -0700, Paul E. McKenney wrote:
> On Tue, Apr 01, 2014 at 11:32:45AM -0700, Paul E. McKenney wrote:
> > On Tue, Apr 01, 2014 at 02:04:14PM -0400, Dave Jones wrote:
> > > On Tue, Apr 01, 2014 at 10:55:45AM -0700, Paul E. McKenney wrote:
>
> [ . . . ]
>
> > > > Waiting uninterruptibly. Presumably blocked on mutex_lock(). But
> > > > you have CONFIG_PROVE_LOCKING(), so any deadlocks should have been
> > > > reported.
> > >
> > > Lockdep had reported something a little earlier (timestamped at 1108.xxxxxx)
> > > but that's a known false-positive in xfs.
> >
> > Yep, I would be very surprised if that was related to the grace-period hang.
>
> Ah, but it could be suppressing later lockdep splats. So if this can be
> reproduced without xfs, we might get additional information from lockdep.

Hrmph.

$ git bisect bad
The merge base 5cb480f6b488128140c940abff3c36f524a334a8 is bad.
This means the bug has been fixed between 5cb480f6b488128140c940abff3c36f524a334a8 and [455c6fdbd219161bd09b1165f11699d6d73de11c 62c206bd514600d4d73751ade00dca8e488390a3 e086481baf9d0436bdd6e9b739bfa4a83fb89ef5].

Not sure where to go from here..

The 'good' news is I can reproduce it pretty reliably now.
I start my fuzz tester, and immediately do a git diff in my working tree,
and then boom..

Dave

2014-04-03 20:02:01

by Dave Jones

[permalink] [raw]
Subject: Re: rcu_prempt stalls / lockup

On Wed, Apr 02, 2014 at 06:48:40PM -0400, Dave Jones wrote:

> > > > > Waiting uninterruptibly. Presumably blocked on mutex_lock(). But
> > > > > you have CONFIG_PROVE_LOCKING(), so any deadlocks should have been
> > > > > reported.
> > > >
> > > > Lockdep had reported something a little earlier (timestamped at 1108.xxxxxx)
> > > > but that's a known false-positive in xfs.
> > >
> > > Yep, I would be very surprised if that was related to the grace-period hang.
> >
> > Ah, but it could be suppressing later lockdep splats. So if this can be
> > reproduced without xfs, we might get additional information from lockdep.
>
> Hrmph.
>
> $ git bisect bad
> The merge base 5cb480f6b488128140c940abff3c36f524a334a8 is bad.
> This means the bug has been fixed between 5cb480f6b488128140c940abff3c36f524a334a8 and [455c6fdbd219161bd09b1165f11699d6d73de11c 62c206bd514600d4d73751ade00dca8e488390a3 e086481baf9d0436bdd6e9b739bfa4a83fb89ef5].
>
> Not sure where to go from here..
>
> The 'good' news is I can reproduce it pretty reliably now.
> I start my fuzz tester, and immediately do a git diff in my working tree,
> and then boom..

Even better, now I realise I don't even need my fuzzer in the mix. Just doing
a fair amount of disk io (like a git diff on a dirty tree) will trigger it.

I've tried adding a show_state() call when the stall happens, but another stall
seems to occur before it gets a chance to even dump everything over the usb-serial console.
And of course nothing ever makes it to disk, even though I can sysrq-sync, on the
next reboot systemd has stuffed a bunch of ^@ in the log where the interesting
stuff should be.

Any other ideas ?

Dave

2014-04-03 20:47:08

by Paul E. McKenney

[permalink] [raw]
Subject: Re: rcu_prempt stalls / lockup

On Thu, Apr 03, 2014 at 04:01:43PM -0400, Dave Jones wrote:
> On Wed, Apr 02, 2014 at 06:48:40PM -0400, Dave Jones wrote:
>
> > > > > > Waiting uninterruptibly. Presumably blocked on mutex_lock(). But
> > > > > > you have CONFIG_PROVE_LOCKING(), so any deadlocks should have been
> > > > > > reported.
> > > > >
> > > > > Lockdep had reported something a little earlier (timestamped at 1108.xxxxxx)
> > > > > but that's a known false-positive in xfs.
> > > >
> > > > Yep, I would be very surprised if that was related to the grace-period hang.
> > >
> > > Ah, but it could be suppressing later lockdep splats. So if this can be
> > > reproduced without xfs, we might get additional information from lockdep.
> >
> > Hrmph.
> >
> > $ git bisect bad
> > The merge base 5cb480f6b488128140c940abff3c36f524a334a8 is bad.
> > This means the bug has been fixed between 5cb480f6b488128140c940abff3c36f524a334a8 and [455c6fdbd219161bd09b1165f11699d6d73de11c 62c206bd514600d4d73751ade00dca8e488390a3 e086481baf9d0436bdd6e9b739bfa4a83fb89ef5].
> >
> > Not sure where to go from here..
> >
> > The 'good' news is I can reproduce it pretty reliably now.
> > I start my fuzz tester, and immediately do a git diff in my working tree,
> > and then boom..
>
> Even better, now I realise I don't even need my fuzzer in the mix. Just doing
> a fair amount of disk io (like a git diff on a dirty tree) will trigger it.
>
> I've tried adding a show_state() call when the stall happens, but another stall
> seems to occur before it gets a chance to even dump everything over the usb-serial console.
> And of course nothing ever makes it to disk, even though I can sysrq-sync, on the
> next reboot systemd has stuffed a bunch of ^@ in the log where the interesting
> stuff should be.
>
> Any other ideas ?

This bug does seem to be doing an effective job of defending itself,
doesn't it? I guess producing additional debug output just isn't going
to cut it in this case.

So, how about reverting each commit in the RCU series, and then bisecting
through the reverts? Something like the following:

wherever=linus/master # or substitute whatever point you wish.
# Create a revert branch for the rcu.2014.02.26a branch
git checkout -b anti-fixes $wherever
git revert 5cb5c6e18f822b19bd41a2c0f9930c82b3ec0bc9
git revert 7a754743185a4b05818e10058fa2fbe4e6969085
git revert 8857563b819b140aa8c9be920cfe44d5d3f808b7
git revert add1f0995454374d90c9d6b2c420d2fba3d0a4e3
git revert ae1670339c95c3ff96ab10582506cf827c5fecc8
git revert 52e2bb958ac4f9b3c4bdd78606d279852fd72922
git revert 88c1863066ccfa456797e12c5d8b4631aa1ad0d0
git revert 0adab9b9aa18d7e90337d43567f1eec3d5401b81
git revert 41f4abd92a34f9c5110bbb870c04f8854604e28d
git revert cb1e78cfa267453bb19e7edafd214c03834b664c
git revert 87de1cfdc55b16b794e245b07322340725149d62
git revert 3660c2813fb6d0ba48ee44bcbf9feddf7218c11d
# Create a revert branch for the rt.2014.02.17b branch
git checkout -b anti-rt $wherever
git revert f1f399d1281ea339a08469f7e58193624992f620
git revert ffa83fb565fbc397cbafb4b71fd1cce276d4c3b6
git revert 2f33b512a5460578f6cf11d7b7867bed53157c7c
git merge anti-fixes

Then bisect through these reverts.

I am assuming, perhaps naively, that changes under Documentation and
to torture testing should not be affecting you.

Does this make sense?

Thanx, Paul

2014-04-03 21:44:45

by Dave Jones

[permalink] [raw]
Subject: Re: rcu_prempt stalls / lockup

On Thu, Apr 03, 2014 at 01:46:56PM -0700, Paul E. McKenney wrote:
> So, how about reverting each commit in the RCU series, and then bisecting
> through the reverts? Something like the following:
>
> wherever=linus/master # or substitute whatever point you wish.
> # Create a revert branch for the rcu.2014.02.26a branch
> git checkout -b anti-fixes $wherever
> git revert 5cb5c6e18f822b19bd41a2c0f9930c82b3ec0bc9
> git revert 7a754743185a4b05818e10058fa2fbe4e6969085
> git revert 8857563b819b140aa8c9be920cfe44d5d3f808b7
> git revert add1f0995454374d90c9d6b2c420d2fba3d0a4e3
> git revert ae1670339c95c3ff96ab10582506cf827c5fecc8
> git revert 52e2bb958ac4f9b3c4bdd78606d279852fd72922
> git revert 88c1863066ccfa456797e12c5d8b4631aa1ad0d0
> git revert 0adab9b9aa18d7e90337d43567f1eec3d5401b81
> git revert 41f4abd92a34f9c5110bbb870c04f8854604e28d
> git revert cb1e78cfa267453bb19e7edafd214c03834b664c
> git revert 87de1cfdc55b16b794e245b07322340725149d62
> git revert 3660c2813fb6d0ba48ee44bcbf9feddf7218c11d
> # Create a revert branch for the rt.2014.02.17b branch
> git checkout -b anti-rt $wherever
> git revert f1f399d1281ea339a08469f7e58193624992f620
> git revert ffa83fb565fbc397cbafb4b71fd1cce276d4c3b6
> git revert 2f33b512a5460578f6cf11d7b7867bed53157c7c
> git merge anti-fixes
>
> Then bisect through these reverts.
>
> I am assuming, perhaps naively, that changes under Documentation and
> to torture testing should not be affecting you.
>
> Does this make sense?

I tried reverting every one of those and could still hit it, which
seems to indicate the bug is elsewhere, and rcu is just a victim of it.
I'm going to start trying to revert the lockdep changes, given this seems
to happen after that has warned..

Dave

2014-04-03 22:37:17

by Dave Jones

[permalink] [raw]
Subject: Re: rcu_prempt stalls / lockup

On Thu, Apr 03, 2014 at 05:44:30PM -0400, Dave Jones wrote:
> On Thu, Apr 03, 2014 at 01:46:56PM -0700, Paul E. McKenney wrote:
> > git merge anti-fixes
> >
> > Then bisect through these reverts.
> >
> > I am assuming, perhaps naively, that changes under Documentation and
> > to torture testing should not be affecting you.
> >
> > Does this make sense?
>
> I tried reverting every one of those and could still hit it, which
> seems to indicate the bug is elsewhere, and rcu is just a victim of it.
> I'm going to start trying to revert the lockdep changes, given this seems
> to happen after that has warned..

And that was also without any success.
Tomorrow I'll try a bisect on just arch/x86.

Dave

2014-04-04 17:06:46

by Paul E. McKenney

[permalink] [raw]
Subject: Re: rcu_prempt stalls / lockup

On Thu, Apr 03, 2014 at 06:37:00PM -0400, Dave Jones wrote:
> On Thu, Apr 03, 2014 at 05:44:30PM -0400, Dave Jones wrote:
> > On Thu, Apr 03, 2014 at 01:46:56PM -0700, Paul E. McKenney wrote:
> > > git merge anti-fixes
> > >
> > > Then bisect through these reverts.
> > >
> > > I am assuming, perhaps naively, that changes under Documentation and
> > > to torture testing should not be affecting you.
> > >
> > > Does this make sense?
> >
> > I tried reverting every one of those and could still hit it, which
> > seems to indicate the bug is elsewhere, and rcu is just a victim of it.
> > I'm going to start trying to revert the lockdep changes, given this seems
> > to happen after that has warned..
>
> And that was also without any success.
> Tomorrow I'll try a bisect on just arch/x86.

I cannot think of any better debugging strategy at the moment.

Thanx, Paul