2009-11-09 11:49:06

by Rafael J. Wysocki

[permalink] [raw]
Subject: Help needed: Resume problems in 2.6.32-rc, perhaps related to preempt_count leakage in keventd

Hi,

For the last couple of days I've been trying to debug resume problems on
Toshiba Portege R500 that seem to be to a leakage of preempt_count in the
"events" thread.

To start with, more-or-less since 2.6.32-rc4 I've been observing a
resume-from-hibernation issue causing the box to hang solid, apparently during
the final switch from a VT to X (it ends up with the X cursor in the black
background). Initially I thought it was related to the graphics driver (i915),
but I didn't use KMS on this box and the non-KMS suspend-resume handling
hasn't really changed in there since 2.6.31.

However, after 2.6.32-rc6 I started to observe similar symptoms during resume
from suspend to RAM. Then, I thought it was a separate issue and tried to
narrow it down. I thought it might be related to the VGA framebuffer, so I
started to boot with vga=0 and it turned out that it apparently was a NULL
pointer dereference in worker_thread() of the "events" thread (with the
graphics framebuffer the box just hanged with the X cursor in the black
backgroung, sometimes it managed to display the windows and then hanged).

I'm saying "apparently", because I've never been able to see the entire oops
message. I only saw a call trace pointing to autoremove_wake_function() and
saying that RIP was somewhere in worker_thread(), so I used gdb to find the
code this address corresponded to and it turned out to be the first instruction
of __list_del(). This, in turn, appeared to mean that the list_del_init() in
run_workqueue(), which was inlined, hit the NULL pointer and that seemed quite
strange to me.

Nevertheless, I started to search for commits that might have caused this
to happen and I first found out that the problem was not reproducible without
commit 9905d1b411946fb3fb228e8c6529fd94afda8a92 (PM / yenta: Split resume into
early and late parts (rev. 4)) from myself, but this commit didn't do anything
that might result in the observed behavior. Moreover, it didn't cause any
problems to happen when applied on top of 2.6.31.5 on the very same box with
a very similar .config. So, I thought that the commit just made the issue more
likely to happen and I started to use it as a "trigger" in further debugging.

Next, I started to look for kernel versions in which I was able to reproduce
the problem and I found that it was already present in 2.6.32-rc1 (or
2.6.32-rc2), so apparently it had to be introduced during the merge window.
So, I thought I'd carry out a bisection of the merge window commits and
I verified that the problem was not reproducible with the kernel where
commit 5d1fe0c98f2aef99fb57aaf6dd25e793c186cea3 was the head
(approximately in the middle of the merge window). However, in the next
step of bisection, in which commit c720f5655df159a630fa0290a0bd67c93e92b0bf was
the head, I got the following call trace:

[ 2016.865041] BUG: using smp_processor_id() in preemptible [00000000] code: events/1/29920
[ 2016.865344] caller is vmstat_update+0x13/0x48
[ 2016.865522] Pid: 29920, comm: events/1 Not tainted 2.6.31-tst #158
[ 2016.865700] Call Trace:
[ 2016.865877] [<ffffffff811608e8>] debug_smp_processor_id+0xc4/0xd4
[ 2016.866052] [<ffffffff810a9ae1>] vmstat_update+0x13/0x48
[ 2016.866232] [<ffffffff81051ee6>] worker_thread+0x18b/0x22a
[ 2016.866409] [<ffffffff810a9ace>] ? vmstat_update+0x0/0x48
[ 2016.866578] [<ffffffff810556a5>] ? autoremove_wake_function+0x0/0x38
[ 2016.866749] [<ffffffff81288803>] ? _spin_unlock_irqrestore+0x35/0x37
[ 2016.866935] [<ffffffff81051d5b>] ? worker_thread+0x0/0x22a
[ 2016.867113] [<ffffffff8105547d>] kthread+0x69/0x71
[ 2016.867278] [<ffffffff8100c1aa>] child_rip+0xa/0x20
[ 2016.867450] [<ffffffff81055414>] ? kthread+0x0/0x71
[ 2016.867618] [<ffffffff8100c1a0>] ? child_rip+0x0/0x20
[ 2016.867781] BUG: using smp_processor_id() in preemptible [00000000] code: events/1/29920
[ 2016.868077] caller is vmstat_update+0x2e/0x48
[ 2016.868239] Pid: 29920, comm: events/1 Not tainted 2.6.31-tst #158
[ 2016.868406] Call Trace:
[ 2016.868576] [<ffffffff811608e8>] debug_smp_processor_id+0xc4/0xd4
[ 2016.868756] [<ffffffff810a9afc>] vmstat_update+0x2e/0x48
[ 2016.868938] [<ffffffff81051ee6>] worker_thread+0x18b/0x22a
[ 2016.869121] [<ffffffff810a9ace>] ? vmstat_update+0x0/0x48
[ 2016.869292] [<ffffffff810556a5>] ? autoremove_wake_function+0x0/0x38
[ 2016.869466] [<ffffffff81288803>] ? _spin_unlock_irqrestore+0x35/0x37
[ 2016.869635] [<ffffffff81051d5b>] ? worker_thread+0x0/0x22a
[ 2016.869808] [<ffffffff8105547d>] kthread+0x69/0x71
[ 2016.869979] [<ffffffff8100c1aa>] child_rip+0xa/0x20
[ 2016.870149] [<ffffffff81055414>] ? kthread+0x0/0x71
[ 2016.870317] [<ffffffff8100c1a0>] ? child_rip+0x0/0x20

where, according to gdb, worker_thread+0x18b corresponded to the

if (unlikely(in_atomic() || lockdep_depth(current) > 0)) {

condition in run_workqueue(). That didn't kill the box, but I wonder if the
leakage of preempt_count may lead to the solid hang observed before.

I also am not sure if I should mark this commit as "bad" or "good" in the
bisection process.

Please advise.

Rafael


2009-11-09 12:02:40

by Ingo Molnar

[permalink] [raw]
Subject: Re: Help needed: Resume problems in 2.6.32-rc, perhaps related to preempt_count leakage in keventd


* Rafael J. Wysocki <[email protected]> wrote:

> [ 2016.865041] BUG: using smp_processor_id() in preemptible [00000000] code: events/1/29920
> [ 2016.865344] caller is vmstat_update+0x13/0x48
> [ 2016.865522] Pid: 29920, comm: events/1 Not tainted 2.6.31-tst #158
> [ 2016.865700] Call Trace:
> [ 2016.865877] [<ffffffff811608e8>] debug_smp_processor_id+0xc4/0xd4
> [ 2016.866052] [<ffffffff810a9ae1>] vmstat_update+0x13/0x48
> [ 2016.866232] [<ffffffff81051ee6>] worker_thread+0x18b/0x22a
> [ 2016.866409] [<ffffffff810a9ace>] ? vmstat_update+0x0/0x48
> [ 2016.866578] [<ffffffff810556a5>] ? autoremove_wake_function+0x0/0x38
> [ 2016.866749] [<ffffffff81288803>] ? _spin_unlock_irqrestore+0x35/0x37
> [ 2016.866935] [<ffffffff81051d5b>] ? worker_thread+0x0/0x22a
> [ 2016.867113] [<ffffffff8105547d>] kthread+0x69/0x71
> [ 2016.867278] [<ffffffff8100c1aa>] child_rip+0xa/0x20
> [ 2016.867450] [<ffffffff81055414>] ? kthread+0x0/0x71
> [ 2016.867618] [<ffffffff8100c1a0>] ? child_rip+0x0/0x20

a bug producing similar looking messages was fixed by:

fd21073: sched: Fix affinity logic in select_task_rq_fair()

but that bug was introduced by:

a1f84a3: sched: Check for an idle shared cache in select_task_rq_fair()

Which is for v2.6.33, not v2.6.32.

Ingo

2009-11-09 12:23:08

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: Help needed: Resume problems in 2.6.32-rc, perhaps related to preempt_count leakage in keventd

On Monday 09 November 2009, Ingo Molnar wrote:
>
> * Rafael J. Wysocki <[email protected]> wrote:
>
> > [ 2016.865041] BUG: using smp_processor_id() in preemptible [00000000] code: events/1/29920
> > [ 2016.865344] caller is vmstat_update+0x13/0x48
> > [ 2016.865522] Pid: 29920, comm: events/1 Not tainted 2.6.31-tst #158
> > [ 2016.865700] Call Trace:
> > [ 2016.865877] [<ffffffff811608e8>] debug_smp_processor_id+0xc4/0xd4
> > [ 2016.866052] [<ffffffff810a9ae1>] vmstat_update+0x13/0x48
> > [ 2016.866232] [<ffffffff81051ee6>] worker_thread+0x18b/0x22a
> > [ 2016.866409] [<ffffffff810a9ace>] ? vmstat_update+0x0/0x48
> > [ 2016.866578] [<ffffffff810556a5>] ? autoremove_wake_function+0x0/0x38
> > [ 2016.866749] [<ffffffff81288803>] ? _spin_unlock_irqrestore+0x35/0x37
> > [ 2016.866935] [<ffffffff81051d5b>] ? worker_thread+0x0/0x22a
> > [ 2016.867113] [<ffffffff8105547d>] kthread+0x69/0x71
> > [ 2016.867278] [<ffffffff8100c1aa>] child_rip+0xa/0x20
> > [ 2016.867450] [<ffffffff81055414>] ? kthread+0x0/0x71
> > [ 2016.867618] [<ffffffff8100c1a0>] ? child_rip+0x0/0x20
>
> a bug producing similar looking messages was fixed by:
>
> fd21073: sched: Fix affinity logic in select_task_rq_fair()
>
> but that bug was introduced by:
>
> a1f84a3: sched: Check for an idle shared cache in select_task_rq_fair()

I guess these are tip commits?

> Which is for v2.6.33, not v2.6.32.

The one I saw was in the Linus' tree, quite obviously.

Also, I should have noted that the problem is not really easy to trigger.
Usually, it involves 20-25 suspend-resume cycles to reproduce it. For this
reason the kernels that I mark as "good" may still be affected, although in
such a way that the problem is much harder to reproduce with them.

Thanks,
Rafael

2009-11-09 12:49:55

by Ingo Molnar

[permalink] [raw]
Subject: Re: Help needed: Resume problems in 2.6.32-rc, perhaps related to preempt_count leakage in keventd


* Rafael J. Wysocki <[email protected]> wrote:

> On Monday 09 November 2009, Ingo Molnar wrote:
> >
> > * Rafael J. Wysocki <[email protected]> wrote:
> >
> > > [ 2016.865041] BUG: using smp_processor_id() in preemptible [00000000] code: events/1/29920
> > > [ 2016.865344] caller is vmstat_update+0x13/0x48
> > > [ 2016.865522] Pid: 29920, comm: events/1 Not tainted 2.6.31-tst #158
> > > [ 2016.865700] Call Trace:
> > > [ 2016.865877] [<ffffffff811608e8>] debug_smp_processor_id+0xc4/0xd4
> > > [ 2016.866052] [<ffffffff810a9ae1>] vmstat_update+0x13/0x48
> > > [ 2016.866232] [<ffffffff81051ee6>] worker_thread+0x18b/0x22a
> > > [ 2016.866409] [<ffffffff810a9ace>] ? vmstat_update+0x0/0x48
> > > [ 2016.866578] [<ffffffff810556a5>] ? autoremove_wake_function+0x0/0x38
> > > [ 2016.866749] [<ffffffff81288803>] ? _spin_unlock_irqrestore+0x35/0x37
> > > [ 2016.866935] [<ffffffff81051d5b>] ? worker_thread+0x0/0x22a
> > > [ 2016.867113] [<ffffffff8105547d>] kthread+0x69/0x71
> > > [ 2016.867278] [<ffffffff8100c1aa>] child_rip+0xa/0x20
> > > [ 2016.867450] [<ffffffff81055414>] ? kthread+0x0/0x71
> > > [ 2016.867618] [<ffffffff8100c1a0>] ? child_rip+0x0/0x20
> >
> > a bug producing similar looking messages was fixed by:
> >
> > fd21073: sched: Fix affinity logic in select_task_rq_fair()
> >
> > but that bug was introduced by:
> >
> > a1f84a3: sched: Check for an idle shared cache in select_task_rq_fair()
>
> I guess these are tip commits?

yep, tip:sched/core ones.

> > Which is for v2.6.33, not v2.6.32.
>
> The one I saw was in the Linus' tree, quite obviously.

ok, then my observation should not apply.

Ingo

2009-11-09 14:04:09

by Thomas Gleixner

[permalink] [raw]
Subject: Re: Help needed: Resume problems in 2.6.32-rc, perhaps related to preempt_count leakage in keventd

On Mon, 9 Nov 2009, Ingo Molnar wrote:
>
> * Rafael J. Wysocki <[email protected]> wrote:
>
> > On Monday 09 November 2009, Ingo Molnar wrote:
> > >
> > > * Rafael J. Wysocki <[email protected]> wrote:
> > >
> > > > [ 2016.865041] BUG: using smp_processor_id() in preemptible [00000000] code: events/1/29920
> > > > [ 2016.865344] caller is vmstat_update+0x13/0x48
> > > > [ 2016.865522] Pid: 29920, comm: events/1 Not tainted 2.6.31-tst #158
> > > > [ 2016.865700] Call Trace:
> > > > [ 2016.865877] [<ffffffff811608e8>] debug_smp_processor_id+0xc4/0xd4
> > > > [ 2016.866052] [<ffffffff810a9ae1>] vmstat_update+0x13/0x48
> > > > [ 2016.866232] [<ffffffff81051ee6>] worker_thread+0x18b/0x22a
> > > > [ 2016.866409] [<ffffffff810a9ace>] ? vmstat_update+0x0/0x48
> > > > [ 2016.866578] [<ffffffff810556a5>] ? autoremove_wake_function+0x0/0x38
> > > > [ 2016.866749] [<ffffffff81288803>] ? _spin_unlock_irqrestore+0x35/0x37
> > > > [ 2016.866935] [<ffffffff81051d5b>] ? worker_thread+0x0/0x22a
> > > > [ 2016.867113] [<ffffffff8105547d>] kthread+0x69/0x71
> > > > [ 2016.867278] [<ffffffff8100c1aa>] child_rip+0xa/0x20
> > > > [ 2016.867450] [<ffffffff81055414>] ? kthread+0x0/0x71
> > > > [ 2016.867618] [<ffffffff8100c1a0>] ? child_rip+0x0/0x20
> > >
> > > a bug producing similar looking messages was fixed by:
> > >
> > > fd21073: sched: Fix affinity logic in select_task_rq_fair()
> > >
> > > but that bug was introduced by:
> > >
> > > a1f84a3: sched: Check for an idle shared cache in select_task_rq_fair()
> >
> > I guess these are tip commits?
>
> yep, tip:sched/core ones.
>
> > > Which is for v2.6.33, not v2.6.32.
> >
> > The one I saw was in the Linus' tree, quite obviously.
>
> ok, then my observation should not apply.

I think it _IS_ releated because the worker_thread is CPU affine and
the debug_smp_processor_id() check does:

if (cpumask_equal(&current->cpus_allowed, cpumask_of(this_cpu)))

which prevents that usage of smp_processor_id() in ksoftirqd and
keventd in preempt enabled regions is warned on.

We saw exaclty the same back trace with fd21073 (sched: Fix affinity
logic in select_task_rq_fair()).

Rafael, can you please add a printk to debug_smp_processor_id() so we
can see on which CPU we are running ? I suspect we are on the wrong
one.

Thanks,

tglx

2009-11-09 14:16:16

by Mike Galbraith

[permalink] [raw]
Subject: Re: Help needed: Resume problems in 2.6.32-rc, perhaps related to preempt_count leakage in keventd

On Mon, 2009-11-09 at 15:02 +0100, Thomas Gleixner wrote:
> On Mon, 9 Nov 2009, Ingo Molnar wrote:
> >

> > ok, then my observation should not apply.
>
> I think it _IS_ releated because the worker_thread is CPU affine and
> the debug_smp_processor_id() check does:
>
> if (cpumask_equal(&current->cpus_allowed, cpumask_of(this_cpu)))
>
> which prevents that usage of smp_processor_id() in ksoftirqd and
> keventd in preempt enabled regions is warned on.
>
> We saw exaclty the same back trace with fd21073 (sched: Fix affinity
> logic in select_task_rq_fair()).
>
> Rafael, can you please add a printk to debug_smp_processor_id() so we
> can see on which CPU we are running ? I suspect we are on the wrong
> one.

I wonder if that's not intimately related to the problem I had, namely
newidle balancing offline CPUs as they're coming up, making a mess of
cpu enumeration.

-Mike

2009-11-09 14:24:46

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: Help needed: Resume problems in 2.6.32-rc, perhaps related to preempt_count leakage in keventd

On Monday 09 November 2009, Thomas Gleixner wrote:
> On Mon, 9 Nov 2009, Ingo Molnar wrote:
> >
> > * Rafael J. Wysocki <[email protected]> wrote:
> >
> > > On Monday 09 November 2009, Ingo Molnar wrote:
> > > >
> > > > * Rafael J. Wysocki <[email protected]> wrote:
> > > >
> > > > > [ 2016.865041] BUG: using smp_processor_id() in preemptible [00000000] code: events/1/29920
> > > > > [ 2016.865344] caller is vmstat_update+0x13/0x48
> > > > > [ 2016.865522] Pid: 29920, comm: events/1 Not tainted 2.6.31-tst #158
> > > > > [ 2016.865700] Call Trace:
> > > > > [ 2016.865877] [<ffffffff811608e8>] debug_smp_processor_id+0xc4/0xd4
> > > > > [ 2016.866052] [<ffffffff810a9ae1>] vmstat_update+0x13/0x48
> > > > > [ 2016.866232] [<ffffffff81051ee6>] worker_thread+0x18b/0x22a
> > > > > [ 2016.866409] [<ffffffff810a9ace>] ? vmstat_update+0x0/0x48
> > > > > [ 2016.866578] [<ffffffff810556a5>] ? autoremove_wake_function+0x0/0x38
> > > > > [ 2016.866749] [<ffffffff81288803>] ? _spin_unlock_irqrestore+0x35/0x37
> > > > > [ 2016.866935] [<ffffffff81051d5b>] ? worker_thread+0x0/0x22a
> > > > > [ 2016.867113] [<ffffffff8105547d>] kthread+0x69/0x71
> > > > > [ 2016.867278] [<ffffffff8100c1aa>] child_rip+0xa/0x20
> > > > > [ 2016.867450] [<ffffffff81055414>] ? kthread+0x0/0x71
> > > > > [ 2016.867618] [<ffffffff8100c1a0>] ? child_rip+0x0/0x20
> > > >
> > > > a bug producing similar looking messages was fixed by:
> > > >
> > > > fd21073: sched: Fix affinity logic in select_task_rq_fair()
> > > >
> > > > but that bug was introduced by:
> > > >
> > > > a1f84a3: sched: Check for an idle shared cache in select_task_rq_fair()
> > >
> > > I guess these are tip commits?
> >
> > yep, tip:sched/core ones.
> >
> > > > Which is for v2.6.33, not v2.6.32.
> > >
> > > The one I saw was in the Linus' tree, quite obviously.
> >
> > ok, then my observation should not apply.
>
> I think it _IS_ releated because the worker_thread is CPU affine and
> the debug_smp_processor_id() check does:
>
> if (cpumask_equal(&current->cpus_allowed, cpumask_of(this_cpu)))
>
> which prevents that usage of smp_processor_id() in ksoftirqd and
> keventd in preempt enabled regions is warned on.
>
> We saw exaclty the same back trace with fd21073 (sched: Fix affinity
> logic in select_task_rq_fair()).
>
> Rafael, can you please add a printk to debug_smp_processor_id() so we
> can see on which CPU we are running ? I suspect we are on the wrong
> one.

Well, I can add the printk(), but I can't guarantee that I will get the call
trace once again. So far I've seen it only once after 20-25 consecutive
suspend-resume cycles, so ... you get the idea.

However, running on a wrong CPU would very nicely explain all of the observed
symptoms, so I guess we can try a House M.D.-alike approach and assume that the
answer is "yes, we're running on the wrong CPU". What would we do next if that
was the case?

Rafael

2009-11-09 14:25:30

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: Help needed: Resume problems in 2.6.32-rc, perhaps related to preempt_count leakage in keventd

On Monday 09 November 2009, Mike Galbraith wrote:
> On Mon, 2009-11-09 at 15:02 +0100, Thomas Gleixner wrote:
> > On Mon, 9 Nov 2009, Ingo Molnar wrote:
> > >
>
> > > ok, then my observation should not apply.
> >
> > I think it _IS_ releated because the worker_thread is CPU affine and
> > the debug_smp_processor_id() check does:
> >
> > if (cpumask_equal(&current->cpus_allowed, cpumask_of(this_cpu)))
> >
> > which prevents that usage of smp_processor_id() in ksoftirqd and
> > keventd in preempt enabled regions is warned on.
> >
> > We saw exaclty the same back trace with fd21073 (sched: Fix affinity
> > logic in select_task_rq_fair()).
> >
> > Rafael, can you please add a printk to debug_smp_processor_id() so we
> > can see on which CPU we are running ? I suspect we are on the wrong
> > one.
>
> I wonder if that's not intimately related to the problem I had, namely
> newidle balancing offline CPUs as they're coming up, making a mess of
> cpu enumeration.

Very likely. What did you do to fix it?

Rafael

2009-11-09 14:30:42

by Mike Galbraith

[permalink] [raw]
Subject: Re: Help needed: Resume problems in 2.6.32-rc, perhaps related to preempt_count leakage in keventd

On Mon, 2009-11-09 at 15:27 +0100, Rafael J. Wysocki wrote:
> On Monday 09 November 2009, Mike Galbraith wrote:
> > On Mon, 2009-11-09 at 15:02 +0100, Thomas Gleixner wrote:
> > > On Mon, 9 Nov 2009, Ingo Molnar wrote:
> > > >
> >
> > > > ok, then my observation should not apply.
> > >
> > > I think it _IS_ releated because the worker_thread is CPU affine and
> > > the debug_smp_processor_id() check does:
> > >
> > > if (cpumask_equal(&current->cpus_allowed, cpumask_of(this_cpu)))
> > >
> > > which prevents that usage of smp_processor_id() in ksoftirqd and
> > > keventd in preempt enabled regions is warned on.
> > >
> > > We saw exaclty the same back trace with fd21073 (sched: Fix affinity
> > > logic in select_task_rq_fair()).
> > >
> > > Rafael, can you please add a printk to debug_smp_processor_id() so we
> > > can see on which CPU we are running ? I suspect we are on the wrong
> > > one.
> >
> > I wonder if that's not intimately related to the problem I had, namely
> > newidle balancing offline CPUs as they're coming up, making a mess of
> > cpu enumeration.
>
> Very likely. What did you do to fix it?

You don't really wanna know. In 31 with newidle enabled, the below
fixed it. It won't fix 32, though it might cure the resume problem.

diff --git a/kernel/sched.c b/kernel/sched.c
index 1b59e26..6e71932 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4032,7 +4049,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
unsigned long flags;
struct cpumask *cpus = __get_cpu_var(load_balance_tmpmask);

- cpumask_setall(cpus);
+ cpumask_copy(cpus, cpu_online_mask);

/*
* When power savings policy is enabled for the parent domain, idle
@@ -4195,7 +4212,7 @@ load_balance_newidle(int this_cpu, struct rq *this_rq, struct sched_domain *sd)
int all_pinned = 0;
struct cpumask *cpus = __get_cpu_var(load_balance_tmpmask);

- cpumask_setall(cpus);
+ cpumask_copy(cpus, cpu_online_mask);

/*
* When power savings policy is enabled for the parent domain, idle

2009-11-09 14:44:56

by Mike Galbraith

[permalink] [raw]
Subject: Re: Help needed: Resume problems in 2.6.32-rc, perhaps related to preempt_count leakage in keventd

On Mon, 2009-11-09 at 15:26 +0100, Rafael J. Wysocki wrote:

> Well, I can add the printk(), but I can't guarantee that I will get the call
> trace once again. So far I've seen it only once after 20-25 consecutive
> suspend-resume cycles, so ... you get the idea.
>
> However, running on a wrong CPU would very nicely explain all of the observed
> symptoms, so I guess we can try a House M.D.-alike approach and assume that the
> answer is "yes, we're running on the wrong CPU". What would we do next if that
> was the case?

If you're interested in trying a patch, I have a couple that fix things
up here. My "get a bigger hammer" patch would be best if it's what I
think it is. If it only fails once in a blue moon though, maybe not
worth even trying. If I could suspend/resume, I'd try it myself, but I
have (steenking) nvidia card. No eyecandy or resume for me, though I
can suspend... once.

-Mike

2009-11-09 15:45:51

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: Help needed: Resume problems in 2.6.32-rc, perhaps related to preempt_count leakage in keventd

On Monday 09 November 2009, Mike Galbraith wrote:
> On Mon, 2009-11-09 at 15:26 +0100, Rafael J. Wysocki wrote:
>
> > Well, I can add the printk(), but I can't guarantee that I will get the call
> > trace once again. So far I've seen it only once after 20-25 consecutive
> > suspend-resume cycles, so ... you get the idea.
> >
> > However, running on a wrong CPU would very nicely explain all of the observed
> > symptoms, so I guess we can try a House M.D.-alike approach and assume that the
> > answer is "yes, we're running on the wrong CPU". What would we do next if that
> > was the case?
>
> If you're interested in trying a patch, I have a couple that fix things
> up here. My "get a bigger hammer" patch would be best if it's what I
> think it is. If it only fails once in a blue moon though, maybe not
> worth even trying. If I could suspend/resume, I'd try it myself, but I
> have (steenking) nvidia card. No eyecandy or resume for me, though I
> can suspend... once.

OK, please send me (a) poiter(s) to the patches.

I'll first try the patch you've just sent and report back.

Thanks,
Rafael

2009-11-09 15:46:16

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: Help needed: Resume problems in 2.6.32-rc, perhaps related to preempt_count leakage in keventd

On Monday 09 November 2009, Mike Galbraith wrote:
> On Mon, 2009-11-09 at 15:27 +0100, Rafael J. Wysocki wrote:
> > On Monday 09 November 2009, Mike Galbraith wrote:
> > > On Mon, 2009-11-09 at 15:02 +0100, Thomas Gleixner wrote:
> > > > On Mon, 9 Nov 2009, Ingo Molnar wrote:
> > > > >
> > >
> > > > > ok, then my observation should not apply.
> > > >
> > > > I think it _IS_ releated because the worker_thread is CPU affine and
> > > > the debug_smp_processor_id() check does:
> > > >
> > > > if (cpumask_equal(&current->cpus_allowed, cpumask_of(this_cpu)))
> > > >
> > > > which prevents that usage of smp_processor_id() in ksoftirqd and
> > > > keventd in preempt enabled regions is warned on.
> > > >
> > > > We saw exaclty the same back trace with fd21073 (sched: Fix affinity
> > > > logic in select_task_rq_fair()).
> > > >
> > > > Rafael, can you please add a printk to debug_smp_processor_id() so we
> > > > can see on which CPU we are running ? I suspect we are on the wrong
> > > > one.
> > >
> > > I wonder if that's not intimately related to the problem I had, namely
> > > newidle balancing offline CPUs as they're coming up, making a mess of
> > > cpu enumeration.
> >
> > Very likely. What did you do to fix it?
>
> You don't really wanna know. In 31 with newidle enabled, the below
> fixed it. It won't fix 32, though it might cure the resume problem.

OK, I'll give it a try.

> diff --git a/kernel/sched.c b/kernel/sched.c
> index 1b59e26..6e71932 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -4032,7 +4049,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
> unsigned long flags;
> struct cpumask *cpus = __get_cpu_var(load_balance_tmpmask);
>
> - cpumask_setall(cpus);
> + cpumask_copy(cpus, cpu_online_mask);
>
> /*
> * When power savings policy is enabled for the parent domain, idle
> @@ -4195,7 +4212,7 @@ load_balance_newidle(int this_cpu, struct rq *this_rq, struct sched_domain *sd)
> int all_pinned = 0;
> struct cpumask *cpus = __get_cpu_var(load_balance_tmpmask);
>
> - cpumask_setall(cpus);
> + cpumask_copy(cpus, cpu_online_mask);
>
> /*
> * When power savings policy is enabled for the parent domain, idle

Thanks,
Rafael

2009-11-09 15:58:26

by Linus Torvalds

[permalink] [raw]
Subject: Re: Help needed: Resume problems in 2.6.32-rc, perhaps related to preempt_count leakage in keventd



On Mon, 9 Nov 2009, Thomas Gleixner wrote:
>
> I think it _IS_ releated because the worker_thread is CPU affine and
> the debug_smp_processor_id() check does:

Hmm. We do know CPU affinity is destroyed by CPU hotplug. People have
complained about that before (for user-space processes that get moved
around due to hot-unplug/plug).

And the suspend/resume process does CPU hotplug to take down all but one
CPU. The workqueues should act on those events already, but maybe there's
a bug somewhere. None of that is new to 32-rc, though - is it?

And workqueue_cpu_callback() seems buggy. It loops over the 'workqueues'
list with no protection. Yes, we do 'stop_machine' for CPU hotplug events,
but only for the very internal one (CPU_DYING) will the CPU notifiers be
called with the machine stopped).

Hmm. I don't see any changes in kernel/cpu.c or kernel/workqueue.c that
look at all relevant. But scheduler changes could certainly matter.

Linus

2009-11-09 16:19:13

by Mike Galbraith

[permalink] [raw]
Subject: Re: Help needed: Resume problems in 2.6.32-rc, perhaps related to preempt_count leakage in keventd

On Mon, 2009-11-09 at 16:47 +0100, Rafael J. Wysocki wrote:
> On Monday 09 November 2009, Mike Galbraith wrote:

> > > Very likely. What did you do to fix it?
> >
> > You don't really wanna know. In 31 with newidle enabled, the below
> > fixed it. It won't fix 32, though it might cure the resume problem.
>
> OK, I'll give it a try.

I just tried to trigger badness via high speed online/offline combined
with taskset with CONFIG_PREEMPT enabled, and couldn't make it explode.

(damn, wish i could s2ram this box)

-Mike

2009-11-09 17:35:08

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: Help needed: Resume problems in 2.6.32-rc, perhaps related to preempt_count leakage in keventd

On Monday 09 November 2009, Mike Galbraith wrote:
> On Mon, 2009-11-09 at 16:47 +0100, Rafael J. Wysocki wrote:
> > On Monday 09 November 2009, Mike Galbraith wrote:
>
> > > > Very likely. What did you do to fix it?
> > >
> > > You don't really wanna know. In 31 with newidle enabled, the below
> > > fixed it. It won't fix 32, though it might cure the resume problem.
> >
> > OK, I'll give it a try.

It doesn't help.

Also, I can reproduce the issue with current -git and kernel preepmtion
disabled.

> I just tried to trigger badness via high speed online/offline combined
> with taskset with CONFIG_PREEMPT enabled, and couldn't make it explode.

I'm not able to do it this way too, so resume seems to be necessary to trigger
it. I'm going try with the suspend debug in the "core" mode.

> (damn, wish i could s2ram this box)

That need not suffice. I have two other boxes that suspend and resume
correctly with 2.6.32-rc, AFAICS.

However, there seems to be a systematic error somewhere, since the failure
always happens at the same place, ie. list_del_init(cwq->worklist.next); in
run_workqueue(), in preemptible as well as in non-preemptible kernels.

Which is kind of strange, given the !list_empty(&cwq->worklist) test right
before it.

Thanks,
Rafael

2009-11-09 18:50:37

by Thomas Gleixner

[permalink] [raw]
Subject: Re: Help needed: Resume problems in 2.6.32-rc, perhaps related to preempt_count leakage in keventd

On Mon, 9 Nov 2009, Rafael J. Wysocki wrote:

> On Monday 09 November 2009, Mike Galbraith wrote:
> > On Mon, 2009-11-09 at 16:47 +0100, Rafael J. Wysocki wrote:
> > > On Monday 09 November 2009, Mike Galbraith wrote:
> >
> > > > > Very likely. What did you do to fix it?
> > > >
> > > > You don't really wanna know. In 31 with newidle enabled, the below
> > > > fixed it. It won't fix 32, though it might cure the resume problem.
> > >
> > > OK, I'll give it a try.
>
> It doesn't help.
>
> Also, I can reproduce the issue with current -git and kernel preepmtion
> disabled.
>
> > I just tried to trigger badness via high speed online/offline combined
> > with taskset with CONFIG_PREEMPT enabled, and couldn't make it explode.
>
> I'm not able to do it this way too, so resume seems to be necessary to trigger
> it. I'm going try with the suspend debug in the "core" mode.
>
> > (damn, wish i could s2ram this box)
>
> That need not suffice. I have two other boxes that suspend and resume
> correctly with 2.6.32-rc, AFAICS.
>
> However, there seems to be a systematic error somewhere, since the failure
> always happens at the same place, ie. list_del_init(cwq->worklist.next); in
> run_workqueue(), in preemptible as well as in non-preemptible kernels.
>
> Which is kind of strange, given the !list_empty(&cwq->worklist) test right
> before it.

Does that happen before or after the secondary CPU has been brought up ?

Thanks,

tglx

2009-11-09 19:15:07

by Thomas Gleixner

[permalink] [raw]
Subject: Re: Help needed: Resume problems in 2.6.32-rc, perhaps related to preempt_count leakage in keventd

On Mon, 9 Nov 2009, Rafael J. Wysocki wrote:
> On Monday 09 November 2009, Mike Galbraith wrote:
> > On Mon, 2009-11-09 at 16:47 +0100, Rafael J. Wysocki wrote:
> > > On Monday 09 November 2009, Mike Galbraith wrote:
> >
> > > > > Very likely. What did you do to fix it?
> > > >
> > > > You don't really wanna know. In 31 with newidle enabled, the below
> > > > fixed it. It won't fix 32, though it might cure the resume problem.
> > >
> > > OK, I'll give it a try.
>
> It doesn't help.
>
> Also, I can reproduce the issue with current -git and kernel preepmtion
> disabled.
>
> > I just tried to trigger badness via high speed online/offline combined
> > with taskset with CONFIG_PREEMPT enabled, and couldn't make it explode.
>
> I'm not able to do it this way too, so resume seems to be necessary to trigger
> it. I'm going try with the suspend debug in the "core" mode.
>
> > (damn, wish i could s2ram this box)
>
> That need not suffice. I have two other boxes that suspend and resume
> correctly with 2.6.32-rc, AFAICS.
>
> However, there seems to be a systematic error somewhere, since the failure
> always happens at the same place, ie. list_del_init(cwq->worklist.next); in
> run_workqueue(), in preemptible as well as in non-preemptible kernels.
>
> Which is kind of strange, given the !list_empty(&cwq->worklist) test right
> before it.

Hmm, no. list_empty() just tells you whether the list is empty or
not. It does not care whether the list is corrupted or not. Have you
tried to run with list debugging enabled ?

Thanks,

tglx

2009-11-09 19:59:44

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: Help needed: Resume problems in 2.6.32-rc, perhaps related to preempt_count leakage in keventd

On Monday 09 November 2009, Thomas Gleixner wrote:
> On Mon, 9 Nov 2009, Rafael J. Wysocki wrote:
>
> > On Monday 09 November 2009, Mike Galbraith wrote:
> > > On Mon, 2009-11-09 at 16:47 +0100, Rafael J. Wysocki wrote:
> > > > On Monday 09 November 2009, Mike Galbraith wrote:
> > >
> > > > > > Very likely. What did you do to fix it?
> > > > >
> > > > > You don't really wanna know. In 31 with newidle enabled, the below
> > > > > fixed it. It won't fix 32, though it might cure the resume problem.
> > > >
> > > > OK, I'll give it a try.
> >
> > It doesn't help.
> >
> > Also, I can reproduce the issue with current -git and kernel preepmtion
> > disabled.
> >
> > > I just tried to trigger badness via high speed online/offline combined
> > > with taskset with CONFIG_PREEMPT enabled, and couldn't make it explode.
> >
> > I'm not able to do it this way too, so resume seems to be necessary to trigger
> > it. I'm going try with the suspend debug in the "core" mode.
> >
> > > (damn, wish i could s2ram this box)
> >
> > That need not suffice. I have two other boxes that suspend and resume
> > correctly with 2.6.32-rc, AFAICS.
> >
> > However, there seems to be a systematic error somewhere, since the failure
> > always happens at the same place, ie. list_del_init(cwq->worklist.next); in
> > run_workqueue(), in preemptible as well as in non-preemptible kernels.
> >
> > Which is kind of strange, given the !list_empty(&cwq->worklist) test right
> > before it.
>
> Does that happen before or after the secondary CPU has been brought up ?

Way after. It seems to happen more-or-less during or right after the thawing
of tasks.

Moreover, the call trace I get is (manual transcription):

? autoremove_wake_function+0x0
? worker_thread+0x0
kthread+0x69
child_rip+0xa

where kthread+0x69 is the do_exit(ret); in kthread(). Afterwards it says
that "events/0" exited with preempt_count = 1 (it sometimes is "events/1"
IIRC).

Still, RIP always points to list_del_init(cwq->worklist.next); in
run_workqueue().

Thanks,
Rafael

2009-11-09 20:02:40

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: Help needed: Resume problems in 2.6.32-rc, perhaps related to preempt_count leakage in keventd

On Monday 09 November 2009, Thomas Gleixner wrote:
> On Mon, 9 Nov 2009, Rafael J. Wysocki wrote:
> > On Monday 09 November 2009, Mike Galbraith wrote:
> > > On Mon, 2009-11-09 at 16:47 +0100, Rafael J. Wysocki wrote:
> > > > On Monday 09 November 2009, Mike Galbraith wrote:
> > >
> > > > > > Very likely. What did you do to fix it?
> > > > >
> > > > > You don't really wanna know. In 31 with newidle enabled, the below
> > > > > fixed it. It won't fix 32, though it might cure the resume problem.
> > > >
> > > > OK, I'll give it a try.
> >
> > It doesn't help.
> >
> > Also, I can reproduce the issue with current -git and kernel preepmtion
> > disabled.
> >
> > > I just tried to trigger badness via high speed online/offline combined
> > > with taskset with CONFIG_PREEMPT enabled, and couldn't make it explode.
> >
> > I'm not able to do it this way too, so resume seems to be necessary to trigger
> > it. I'm going try with the suspend debug in the "core" mode.
> >
> > > (damn, wish i could s2ram this box)
> >
> > That need not suffice. I have two other boxes that suspend and resume
> > correctly with 2.6.32-rc, AFAICS.
> >
> > However, there seems to be a systematic error somewhere, since the failure
> > always happens at the same place, ie. list_del_init(cwq->worklist.next); in
> > run_workqueue(), in preemptible as well as in non-preemptible kernels.
> >
> > Which is kind of strange, given the !list_empty(&cwq->worklist) test right
> > before it.
>
> Hmm, no. list_empty() just tells you whether the list is empty or
> not. It does not care whether the list is corrupted or not. Have you
> tried to run with list debugging enabled ?

I did, but the result was exactly the same as without it.

Also, I don't seem to be able to reproduce the issue without actually
suspending the box. I tried with offlining/onlining CPU1 and with
the /sys/power/pm_test "core" test and it didn't show up.

Thanks,
Rafael

2009-11-09 20:31:44

by Alan Stern

[permalink] [raw]
Subject: Re: [linux-pm] Help needed: Resume problems in 2.6.32-rc, perhaps related to preempt_count leakage in keventd

On Mon, 9 Nov 2009, Rafael J. Wysocki wrote:

> Still, RIP always points to list_del_init(cwq->worklist.next); in
> run_workqueue().

Use a big hammer: Create a new global variable, set it to 1 while
resuming and back to 0 after the tasks have been thawed. While the
variable is nonzero, print in the log the list pointers in
cwq->worklist just before executing the list_del_init(). Maybe also
print some other interesting information about cwq.

Alan Stern

2009-11-09 20:44:19

by Rafael J. Wysocki

[permalink] [raw]
Subject: GPF in run_workqueue()/list_del_init(cwq->worklist.next) on resume (was: Re: Help needed: Resume problems in 2.6.32-rc, perhaps related to preempt_count leakage in keventd)

On Monday 09 November 2009, Rafael J. Wysocki wrote:
> On Monday 09 November 2009, Thomas Gleixner wrote:
> > On Mon, 9 Nov 2009, Rafael J. Wysocki wrote:
> >
> > > On Monday 09 November 2009, Mike Galbraith wrote:
> > > > On Mon, 2009-11-09 at 16:47 +0100, Rafael J. Wysocki wrote:
> > > > > On Monday 09 November 2009, Mike Galbraith wrote:
> > > >
> > > > > > > Very likely. What did you do to fix it?
> > > > > >
> > > > > > You don't really wanna know. In 31 with newidle enabled, the below
> > > > > > fixed it. It won't fix 32, though it might cure the resume problem.
> > > > >
> > > > > OK, I'll give it a try.
> > >
> > > It doesn't help.
> > >
> > > Also, I can reproduce the issue with current -git and kernel preepmtion
> > > disabled.
> > >
> > > > I just tried to trigger badness via high speed online/offline combined
> > > > with taskset with CONFIG_PREEMPT enabled, and couldn't make it explode.
> > >
> > > I'm not able to do it this way too, so resume seems to be necessary to trigger
> > > it. I'm going try with the suspend debug in the "core" mode.
> > >
> > > > (damn, wish i could s2ram this box)
> > >
> > > That need not suffice. I have two other boxes that suspend and resume
> > > correctly with 2.6.32-rc, AFAICS.
> > >
> > > However, there seems to be a systematic error somewhere, since the failure
> > > always happens at the same place, ie. list_del_init(cwq->worklist.next); in
> > > run_workqueue(), in preemptible as well as in non-preemptible kernels.
> > >
> > > Which is kind of strange, given the !list_empty(&cwq->worklist) test right
> > > before it.
> >
> > Does that happen before or after the secondary CPU has been brought up ?
>
> Way after. It seems to happen more-or-less during or right after the thawing
> of tasks.
>
> Moreover, the call trace I get is (manual transcription):

OK, below is the full call trace I found in the kernel log.

[ 51.520183] PM: Finishing wakeup.
[ 51.520186] Restarting tasks ...
[ 51.520387] usb 5-2: USB disconnect, address 2
[ 51.544197] done.
[ 52.013018] general protection fault: 0000 [#1] PREEMPT SMP
[ 52.013431] last sysfs file: /sys/devices/pci0000:00/0000:00:1d.0/usb1/1-2/1-2:1.3/ttyUSB3/port_number
[ 52.013700] CPU 0
[ 52.013900] Modules linked in: ip6t_LOG af_packet xt_tcpudp xt_pkttype ipt_LOG xt_limit bnep sco rfcomm l2cap crc16 snd_pcm_oss snd_mixer_oss snd_seq binfmt_misc snd_seq_device ip6t_REJECT nf_conntrack_ipv6 ip6table_raw xt_NOTRACK ipt_REJECT xt_state iptable_raw iptable_filter ip6table_mangle nf_conntrack_netbios_ns nf_conntrack_ipv4 cpufreq_conservative nf_conntrack nf_defrag_ipv4 cpufreq_ondemand ip_tables cpufreq_userspace cpufreq_powersave acpi_cpufreq ip6table_filter ip6_tables x_tables freq_table ipv6 microcode fuse loop sr_mod cdrom dm_mod arc4 ecb btusb snd_hda_codec_realtek bluetooth iwlagn snd_hda_intel snd_hda_codec iwlcore pcmcia snd_hwdep snd_pcm sdhci_pci mac80211 snd_timer joydev sdhci toshiba_acpi yenta_socket usbhid cfg80211 snd option rtc_cmos mmc_core firewire_ohci video rsrc_nonstatic psmouse firewire_core backlight soundcore iTCO_wdt rtc_core hid battery ac intel_agp button usb_storage snd_page_alloc usbserial rfkill pcmcia_core iTCO_vendor_support e1000e rtc_lib led_class serio_raw crc_itu_t output uinput sg ehci_hcd uhci_hcd sd_mod crc_t10dif usbcore ext3 jbd fan ahci libata thermal processor
[ 52.016961] Pid: 9, comm: events/0 Not tainted 2.6.32-rc6-tst #160 PORTEGE R500
[ 52.016961] RIP: 0010:[<ffffffff81054bff>] [<ffffffff81054bff>] worker_thread+0x15b/0x22a
[ 52.016961] RSP: 0018:ffff88007f0d9e40 EFLAGS: 00010046
[ 52.016961] RAX: ffff88007e056b68 RBX: ffff88007f09bd48 RCX: 6b6b6b6b6b6b6b6b
[ 52.016961] RDX: 6b6b6b6b6b6b6b6b RSI: 0000000000000000 RDI: ffff880001613d00
[ 52.016961] RBP: ffff88007f0d9ee0 R08: ffff88007f0b9178 R09: ffff88007f0d9e10
[ 52.016961] R10: ffff880001613d00 R11: 0000000000000001 R12: ffff88007e056b60
[ 52.016961] R13: ffff880001613d00 R14: ffff88007f0b9140 R15: ffff88007f0b9140
[ 52.016961] FS: 0000000000000000(0000) GS:ffff880001600000(0000) knlGS:0000000000000000
[ 52.016961] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[ 52.016961] CR2: 00007f786667d060 CR3: 0000000001001000 CR4: 00000000000006f0
[ 52.016961] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 52.016961] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 52.016961] Process events/0 (pid: 9, threadinfo ffff88007f0d8000, task ffff88007f0b9140)
[ 52.016961] Stack:
[ 52.016961] 000000000000c918 ffff88007f0b9578 ffff88007f0d9fd8 ffff88007f0b9140
[ 52.016961] <0> ffff880001613d08 ffff88007f0b9140 ffff880001613d18 6b6b6b6b6b6b6b6b
[ 52.016961] <0> 0000000000000000 ffff88007f0b9140 ffffffff81058281 ffff88007f0d9e98
[ 52.016961] Call Trace:
[ 52.016961] [<ffffffff81058281>] ? autoremove_wake_function+0x0/0x38
[ 52.016961] [<ffffffff81054aa4>] ? worker_thread+0x0/0x22a
[ 52.016961] [<ffffffff8105805a>] kthread+0x69/0x71
[ 52.016961] [<ffffffff8100c16a>] child_rip+0xa/0x20
[ 52.016961] [<ffffffff81057ff1>] ? kthread+0x0/0x71
[ 52.016961] [<ffffffff8100c160>] ? child_rip+0x0/0x20
[ 52.016961] Code: 74 12 4c 89 e6 4c 89 f7 ff 13 48 83 c3 08 48 83 3b 00 eb ec e8 3d ef ff ff 49 8b 45 08 4d 89 65 30 4c 89 ef 48 8b 08 48 8b 50 08 <48> 89 51 08 48 89 0a 48 89 40 08 48 89 00 e8 f6 11 24 00 49 8b
[ 52.016961] RIP [<ffffffff81054bff>] worker_thread+0x15b/0x22a
[ 52.016961] RSP <ffff88007f0d9e40>
[ 52.016961] ---[ end trace 1d831fad17e9eb5d ]---
[ 52.016961] note: events/0[9] exited with preempt_count 1

So, this actually is a general protection fault that killed events and it
happened exactly in list_del_init(cwq->worklist.next); in run_workqueue().

Thanks,
Rafael

2009-11-09 20:46:28

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [linux-pm] Help needed: Resume problems in 2.6.32-rc, perhaps related to preempt_count leakage in keventd

On Monday 09 November 2009, Alan Stern wrote:
> On Mon, 9 Nov 2009, Rafael J. Wysocki wrote:
>
> > Still, RIP always points to list_del_init(cwq->worklist.next); in
> > run_workqueue().
>
> Use a big hammer: Create a new global variable, set it to 1 while
> resuming and back to 0 after the tasks have been thawed. While the
> variable is nonzero, print in the log the list pointers in
> cwq->worklist just before executing the list_del_init(). Maybe also
> print some other interesting information about cwq.

I've just sent a message containing full call trace.

It shows the problem is a general protection fault that happens _after_ we've
thawed tasks.

Thanks,
Rafael

2009-11-09 21:24:53

by Alan Stern

[permalink] [raw]
Subject: Re: [linux-pm] Help needed: Resume problems in 2.6.32-rc, perhaps related to preempt_count leakage in keventd

On Mon, 9 Nov 2009, Rafael J. Wysocki wrote:

> On Monday 09 November 2009, Alan Stern wrote:
> > On Mon, 9 Nov 2009, Rafael J. Wysocki wrote:
> >
> > > Still, RIP always points to list_del_init(cwq->worklist.next); in
> > > run_workqueue().
> >
> > Use a big hammer: Create a new global variable, set it to 1 while
> > resuming and back to 0 after the tasks have been thawed. While the
> > variable is nonzero, print in the log the list pointers in
> > cwq->worklist just before executing the list_del_init(). Maybe also
> > print some other interesting information about cwq.
>
> I've just sent a message containing full call trace.
>
> It shows the problem is a general protection fault that happens _after_ we've
> thawed tasks.

Okay, so set the variable to 1 when the tasks are thawed and back to 0
a second later. The point is that you _can_ get more detailed
information about the bad data values without flooding the system log.

Alan Stern

2009-11-09 21:43:49

by Linus Torvalds

[permalink] [raw]
Subject: Re: GPF in run_workqueue()/list_del_init(cwq->worklist.next) on resume (was: Re: Help needed: Resume problems in 2.6.32-rc, perhaps related to preempt_count leakage in keventd)



On Mon, 9 Nov 2009, Rafael J. Wysocki wrote:
>
> [ 52.013018] general protection fault: 0000 [#1] PREEMPT SMP

It'sa GP fault only because it's an invalid pointer - it's the same as a
NULL pointer page fault, except pointing into the invalid hole in between
low virtual memory and high virtual memory mappings.

> [ 52.013431] last sysfs file: /sys/devices/pci0000:00/0000:00:1d.0/usb1/1-2/1-2:1.3/ttyUSB3/port_number
> [ 52.013700] CPU 0
> [ 52.013900] Modules linked in: ip6t_LOG af_packet xt_tcpudp xt_pkttype ipt_LOG xt_limit bnep sco rfcomm l2cap crc16 snd_pcm_oss snd_mixer_oss snd_seq binfmt_misc snd_seq_device ip6t_REJECT nf_conntrack_ipv6 ip6table_raw xt_NOTRACK ipt_REJECT xt_state iptable_raw iptable_filter ip6table_mangle nf_conntrack_netbios_ns nf_conntrack_ipv4 cpufreq_conservative nf_conntrack nf_defrag_ipv4 cpufreq_ondemand ip_tables cpufreq_userspace cpufreq_powersave acpi_cpufreq ip6table_filter ip6_tables x_tables freq_table ipv6 microcode fuse loop sr_mod cdrom dm_mod arc4 ecb btusb snd_hda_codec_realtek bluetooth iwlagn snd_hda_intel snd_hda_codec iwlcore pcmcia snd_hwdep snd_pcm sdhci_pci mac80211 snd_timer joydev sdhci toshiba_acpi yenta_socket usbhid cfg80211 snd option rtc_cmos mmc_core firewire_ohci video rsrc_nonstatic psmouse firewire_core backlight soundcore iTCO_wdt rtc_core hid battery ac intel_agp button usb_storage snd_page_alloc usbserial rfkill pcmcia_core iTCO_vendor_support !
e1000e rtc_lib led_class serio_raw crc_itu_t output uinput sg ehci_hcd uhci_hcd sd_mod crc_t10dif usbcore ext3 jbd fan ahci libata thermal processor
> [ 52.016961] Pid: 9, comm: events/0 Not tainted 2.6.32-rc6-tst #160 PORTEGE R500
> [ 52.016961] RIP: 0010:[<ffffffff81054bff>] [<ffffffff81054bff>] worker_thread+0x15b/0x22a
> [ 52.016961] RSP: 0018:ffff88007f0d9e40 EFLAGS: 00010046
> [ 52.016961] RAX: ffff88007e056b68 RBX: ffff88007f09bd48 RCX: 6b6b6b6b6b6b6b6b

The code is

8: ff 13 callq *(%rbx)
a: 48 83 c3 08 add $0x8,%rbx
e: 48 83 3b 00 cmpq $0x0,(%rbx)
12: eb ec jmp 0x0
14: e8 3d ef ff ff callq 0xffffffffffffef56
19: 49 8b 45 08 mov 0x8(%r13),%rax
1d: 4d 89 65 30 mov %r12,0x30(%r13)
21: 4c 89 ef mov %r13,%rdi
24: 48 8b 08 mov (%rax),%rcx
27: 48 8b 50 08 mov 0x8(%rax),%rdx
2b:* 48 89 51 08 mov %rdx,0x8(%rcx) <-- trapping instruction
2f: 48 89 0a mov %rcx,(%rdx)
32: 48 89 40 08 mov %rax,0x8(%rax)

and %rcx is 0x6b6b6b6b6b6b6b6b, which is the POISON_FREE byte sequence.

So somebody is trying to remove a list entry that was already free'd.

Linus

2009-11-10 00:18:20

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: GPF in run_workqueue()/list_del_init(cwq->worklist.next) on resume (was: Re: Help needed: Resume problems in 2.6.32-rc, perhaps related to preempt_count leakage in keventd)

On Monday 09 November 2009, Linus Torvalds wrote:
>
> On Mon, 9 Nov 2009, Rafael J. Wysocki wrote:
> >
> > [ 52.013018] general protection fault: 0000 [#1] PREEMPT SMP
>
> It'sa GP fault only because it's an invalid pointer - it's the same as a
> NULL pointer page fault, except pointing into the invalid hole in between
> low virtual memory and high virtual memory mappings.
>
> > [ 52.013431] last sysfs file: /sys/devices/pci0000:00/0000:00:1d.0/usb1/1-2/1-2:1.3/ttyUSB3/port_number
> > [ 52.013700] CPU 0
> > [ 52.013900] Modules linked in: ip6t_LOG af_packet xt_tcpudp xt_pkttype ipt_LOG xt_limit bnep sco rfcomm l2cap crc16 snd_pcm_oss snd_mixer_oss snd_seq binfmt_misc snd_seq_device ip6t_REJECT nf_conntrack_ipv6 ip6table_raw xt_NOTRACK ipt_REJECT xt_state iptable_raw iptable_filter ip6table_mangle nf_conntrack_netbios_ns nf_conntrack_ipv4 cpufreq_conservative nf_conntrack nf_defrag_ipv4 cpufreq_ondemand ip_tables cpufreq_userspace cpufreq_powersave acpi_cpufreq ip6table_filter ip6_tables x_tables freq_table ipv6 microcode fuse loop sr_mod cdrom dm_mod arc4 ecb btusb snd_hda_codec_realtek bluetooth iwlagn snd_hda_intel snd_hda_codec iwlcore pcmcia snd_hwdep snd_pcm sdhci_pci mac80211 snd_timer joydev sdhci toshiba_acpi yenta_socket usbhid cfg80211 snd option rtc_cmos mmc_core firewire_ohci video rsrc_nonstatic psmouse firewire_core backlight soundcore iTCO_wdt rtc_core hid battery ac intel_agp button usb_storage snd_page_alloc usbserial rfkill pcmcia_core iTCO_vendor_support !
> e1000e rtc_lib led_class serio_raw crc_itu_t output uinput sg ehci_hcd uhci_hcd sd_mod crc_t10dif usbcore ext3 jbd fan ahci libata thermal processor
> > [ 52.016961] Pid: 9, comm: events/0 Not tainted 2.6.32-rc6-tst #160 PORTEGE R500
> > [ 52.016961] RIP: 0010:[<ffffffff81054bff>] [<ffffffff81054bff>] worker_thread+0x15b/0x22a
> > [ 52.016961] RSP: 0018:ffff88007f0d9e40 EFLAGS: 00010046
> > [ 52.016961] RAX: ffff88007e056b68 RBX: ffff88007f09bd48 RCX: 6b6b6b6b6b6b6b6b
>
> The code is
>
> 8: ff 13 callq *(%rbx)
> a: 48 83 c3 08 add $0x8,%rbx
> e: 48 83 3b 00 cmpq $0x0,(%rbx)
> 12: eb ec jmp 0x0
> 14: e8 3d ef ff ff callq 0xffffffffffffef56
> 19: 49 8b 45 08 mov 0x8(%r13),%rax
> 1d: 4d 89 65 30 mov %r12,0x30(%r13)
> 21: 4c 89 ef mov %r13,%rdi
> 24: 48 8b 08 mov (%rax),%rcx
> 27: 48 8b 50 08 mov 0x8(%rax),%rdx
> 2b:* 48 89 51 08 mov %rdx,0x8(%rcx) <-- trapping instruction
> 2f: 48 89 0a mov %rcx,(%rdx)
> 32: 48 89 40 08 mov %rax,0x8(%rax)
>
> and %rcx is 0x6b6b6b6b6b6b6b6b, which is the POISON_FREE byte sequence.
>
> So somebody is trying to remove a list entry that was already free'd.

The entry is being removed by run_workqueue(), so I wonder what the scenario is.

It appears that the problem is not reproducible with init=/bin/bash and the
minimal set of modules loaded, so perhaps one of the (modular) drivers
removes a work item from keventd_wq in a racy way.

In the meantime I got another trace, this time with a slab corruption involved.
Note that it crashed in exactly the same place as previously.

[ 210.450676] usb 1-2: GSM modem (1-port) converter now attached to ttyUSB0
[ 210.450725] option 1-2:1.1: GSM modem (1-port) converter detected
[ 210.450997] usb 1-2: GSM modem (1-port) converter now attached to ttyUSB1
[ 210.451044] option 1-2:1.2: GSM modem (1-port) converter detected
[ 210.451422] usb 1-2: GSM modem (1-port) converter now attached to ttyUSB2
[ 210.451470] option 1-2:1.3: GSM modem (1-port) converter detected
[ 210.451746] usb 1-2: GSM modem (1-port) converter now attached to ttyUSB3
[ 210.452092] PM: Finishing wakeup.
[ 210.452095] Restarting tasks ...
[ 210.452293] usb 5-2: USB disconnect, address 2
[ 210.470519] done.
[ 210.845547] Slab corruption: size-512 start=ffff88007f1182b0, len=512
[ 210.845718] Redzone: 0x9f911029d74e35b/0x9f911029d74e35b.
[ 210.845880] Last user: [<ffffffff810c8bdb>](kfree+0x7c/0x88)
[ 210.846142] 050: 6b 6b 6b 6b 6b 6b 6b 6b 38 e7 60 01 00 88 ff ff
[ 210.847114] Prev obj: start=ffff88007f118098, len=512
[ 210.847275] Redzone: 0xd84156c5635688c0/0xd84156c5635688c0.
[ 210.847436] Last user: [<ffffffffa01ffce0>](kzalloc+0xf/0x11 [snd])
[ 210.847724] 000: 00 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00
[ 210.850735] 010: 50 43 4d 20 50 6c 61 79 62 61 63 6b 20 56 6f 6c
[ 210.851844] Next obj: start=ffff88007f1184c8, len=512
[ 210.852220] Redzone: 0xd84156c5635688c0/0xd84156c5635688c0.
[ 210.852384] Last user: [<ffffffff8106a4c8>](kzalloc+0xf/0x11)
[ 210.852655] 000: 17 ea 3b 81 ff ff ff ff 00 00 00 00 00 00 00 00
[ 210.853981] 010: 24 01 00 00 00 00 00 00 0e a9 06 81 ff ff ff ff
[ 210.887295] e1000e 0000:01:00.0: irq 27 for MSI/MSI-X
[ 210.940224] e1000e 0000:01:00.0: irq 27 for MSI/MSI-X
[ 210.941776] ADDRCONF(NETDEV_UP): eth0: link is not ready
[ 211.061316] general protection fault: 0000 [#1] PREEMPT SMP
[ 211.061626] last sysfs file: /sys/class/pcmcia_socket/pcmcia_socket0/card_insert
[ 211.061891] CPU 0
[ 211.062090] Modules linked in: ip6t_LOG xt_tcpudp xt_pkttype ipt_LOG xt_limit af_packet bnep sco rfcomm l2cap crc16 snd_pcm_oss snd_mixer_oss snd_seq binfmt_misc snd_seq_device ip6t_REJECT nf_conntrack_ipv6 ip6table_raw xt_NOTRACK ipt_REJECT xt_state iptable_raw iptable_filter ip6table_mangle nf_conntrack_netbios_ns nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 cpufreq_conservative ip_tables cpufreq_ondemand cpufreq_userspace cpufreq_powersave ip6table_filter acpi_cpufreq ip6_tables x_tables freq_table ipv6 microcode fuse loop sr_mod cdrom dm_mod btusb snd_hda_codec_realtek bluetooth arc4 snd_hda_intel ecb snd_hda_codec pcmcia snd_hwdep iwlcore snd_pcm mac80211 toshiba_acpi joydev snd_timer yenta_socket usbhid option video cfg80211 backlight rtc_cmos snd rsrc_nonstatic psmouse usb_storage hid output usbserial rtc_core iTCO_wdt soundcore battery pcmcia_core ac rfkill led_class button intel_agp rtc_lib crc_itu_t serio_raw e1000e iTCO_vendor_support snd_page_alloc uinput sg ehci_hcd uhci_hcd sd_mod crc_t10dif usbcore ext3 jbd fan ahci libata thermal processor [last unloaded: iwlagn]
[ 211.064780] Pid: 9, comm: events/0 Not tainted 2.6.32-rc6-tst #169 PORTEGE R500
[ 211.064780] RIP: 0010:[<ffffffff81054c07>] [<ffffffff81054c07>] worker_thread+0x15b/0x22a
[ 211.064780] RSP: 0018:ffff88007f0d9e40 EFLAGS: 00010046
[ 211.064780] RAX: ffff88007f118308 RBX: ffff88007f09bd48 RCX: 6b6b6b6b6b6b6b6b
[ 211.064780] RDX: 6b6b6b6b6b6b6b6b RSI: 0000000000000000 RDI: ffff880001613d00
[ 211.064780] RBP: ffff88007f0d9ee0 R08: ffff88007f0d9cf0 R09: ffff88007f0d9e02
[ 211.064780] R10: ffff880001613d00 R11: 0000000000000000 R12: ffff88007f118300
[ 211.064780] R13: ffff880001613d00 R14: ffff88007f0b9140 R15: ffff88007f0b9140
[ 211.064780] FS: 0000000000000000(0000) GS:ffff880001600000(0000) knlGS:0000000000000000
[ 211.064780] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[ 211.064780] CR2: 00007f7678984660 CR3: 000000007d3ce000 CR4: 00000000000006f0
[ 211.064780] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 211.064780] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 211.064780] Process events/0 (pid: 9, threadinfo ffff88007f0d8000, task ffff88007f0b9140)
[ 211.064780] Stack:
[ 211.064780] 000000000000c918 ffff88007f0b9578 ffff88007f0d9fd8 ffff88007f0b9140
[ 211.064780] <0> ffff880001613d08 ffff88007f0b9140 ffff880001613d18 6b6b6b6b6b6b6b6b
[ 211.064780] <0> 0000000000000000 ffff88007f0b9140 ffffffff81058289 ffff88007f0d9e98
[ 211.064780] Call Trace:
[ 211.064780] [<ffffffff81058289>] ? autoremove_wake_function+0x0/0x38
[ 211.064780] [<ffffffff81054aac>] ? worker_thread+0x0/0x22a
[ 211.064780] [<ffffffff81058062>] kthread+0x69/0x71
[ 211.064780] [<ffffffff8100c16a>] child_rip+0xa/0x20
[ 211.064780] [<ffffffff81057ff9>] ? kthread+0x0/0x71
[ 211.064780] [<ffffffff8100c160>] ? child_rip+0x0/0x20
[ 211.064780] Code: 74 12 4c 89 e6 4c 89 f7 ff 13 48 83 c3 08 48 83 3b 00 eb ec e8 3d ef ff ff 49 8b 45 08 4d 89 65 30 4c 89 ef 48 8b 08 48 8b 50 08 <48> 89 51 08 48 89 0a 48 89 40 08 48 89 00 e8 f6 11 24 00 49 8b
[ 211.064780] RIP [<ffffffff81054c07>] worker_thread+0x15b/0x22a
[ 211.064780] RSP <ffff88007f0d9e40>
[ 211.064780] ---[ end trace 586a2f2b96b909e0 ]---
[ 211.064780] note: events/0[9] exited with preempt_count 1

Rafael

2009-11-10 22:03:23

by Linus Torvalds

[permalink] [raw]
Subject: Re: GPF in run_workqueue()/list_del_init(cwq->worklist.next) on resume (was: Re: Help needed: Resume problems in 2.6.32-rc, perhaps related to preempt_count leakage in keventd)



On Tue, 10 Nov 2009, Rafael J. Wysocki wrote:
> >
> > So somebody is trying to remove a list entry that was already free'd.
>
> The entry is being removed by run_workqueue(), so I wonder what the scenario is.

If the workqueue code runs on the wrogn CPU, any random crap can easily
happen. If a workqueue entry gets queued on the wrong CPU, people who try
to unqueue it may not find it again, and will thus fail to remove it even
though it's still listed.

And the code really is pretty subtle. queue_delayed_work_on(), for
example, uses raw_smp_processor_id() to basically pick a target CPU for
the work ("whatever the CPU happens to be now"). But does that have to
match the CPU that the timeout will trigger on? I dunno. I've walked
through the code many times, but every time I end up forgetting all the
subtleties a few days later.

The workqueue code is very fragile in many ways. I'm adding Oleg and Tejun
to the Cc, because Oleg is definitely one of the workqueue locking
masters, and Tejun has worked on it to make it way more robust, so they
may have ideas.

Oleg, Tejun?

> It appears that the problem is not reproducible with init=/bin/bash and the
> minimal set of modules loaded, so perhaps one of the (modular) drivers
> removes a work item from keventd_wq in a racy way.

It's not necessarily that they remove a work item racily - maybe there is
something that frees a work-item without removing it at all (or, equally
possible: maybe it does remove it, but has a racy re-attach that means
that it gets added back to the workqueue again and then freed).

> In the meantime I got another trace, this time with a slab corruption involved.
> Note that it crashed in exactly the same place as previously.

I'm leaving your crash log appended for the new cc's, and I would not be
at all surprised to hear that the slab corruption is related. The whole
6b6b6b6b pattern does imply a use-after-free on the workqueue, after all,
and slab corruption could easily be another result.

Sadly, the 'size-512' thing doesn't really say much. There's a hint that
it may be sound-related, because the second redzone report (the firxt
kzalloc) has the data being

- 16 bytes of zero
- followed by 16 bytes of "PCM Playback Vol"

and the module information there also says [snd]. But because it's a
size-512 thing (ie kmalloc/kfree), it may be that the sound driver was
just the next user of the invalid 512-byte block, and wasn't the _cause_
of the problem.

So I don't know, but it might be worthwhile to see if the problem goes
away if you disable sound. Just to narrow down what has changed.

Linus
---
> [ 210.450676] usb 1-2: GSM modem (1-port) converter now attached to ttyUSB0
> [ 210.450725] option 1-2:1.1: GSM modem (1-port) converter detected
> [ 210.450997] usb 1-2: GSM modem (1-port) converter now attached to ttyUSB1
> [ 210.451044] option 1-2:1.2: GSM modem (1-port) converter detected
> [ 210.451422] usb 1-2: GSM modem (1-port) converter now attached to ttyUSB2
> [ 210.451470] option 1-2:1.3: GSM modem (1-port) converter detected
> [ 210.451746] usb 1-2: GSM modem (1-port) converter now attached to ttyUSB3
> [ 210.452092] PM: Finishing wakeup.
> [ 210.452095] Restarting tasks ...
> [ 210.452293] usb 5-2: USB disconnect, address 2
> [ 210.470519] done.
> [ 210.845547] Slab corruption: size-512 start=ffff88007f1182b0, len=512
> [ 210.845718] Redzone: 0x9f911029d74e35b/0x9f911029d74e35b.
> [ 210.845880] Last user: [<ffffffff810c8bdb>](kfree+0x7c/0x88)
> [ 210.846142] 050: 6b 6b 6b 6b 6b 6b 6b 6b 38 e7 60 01 00 88 ff ff
> [ 210.847114] Prev obj: start=ffff88007f118098, len=512
> [ 210.847275] Redzone: 0xd84156c5635688c0/0xd84156c5635688c0.
> [ 210.847436] Last user: [<ffffffffa01ffce0>](kzalloc+0xf/0x11 [snd])
> [ 210.847724] 000: 00 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00
> [ 210.850735] 010: 50 43 4d 20 50 6c 61 79 62 61 63 6b 20 56 6f 6c
> [ 210.851844] Next obj: start=ffff88007f1184c8, len=512
> [ 210.852220] Redzone: 0xd84156c5635688c0/0xd84156c5635688c0.
> [ 210.852384] Last user: [<ffffffff8106a4c8>](kzalloc+0xf/0x11)
> [ 210.852655] 000: 17 ea 3b 81 ff ff ff ff 00 00 00 00 00 00 00 00
> [ 210.853981] 010: 24 01 00 00 00 00 00 00 0e a9 06 81 ff ff ff ff
> [ 210.887295] e1000e 0000:01:00.0: irq 27 for MSI/MSI-X
> [ 210.940224] e1000e 0000:01:00.0: irq 27 for MSI/MSI-X
> [ 210.941776] ADDRCONF(NETDEV_UP): eth0: link is not ready
> [ 211.061316] general protection fault: 0000 [#1] PREEMPT SMP
> [ 211.061626] last sysfs file: /sys/class/pcmcia_socket/pcmcia_socket0/card_insert
> [ 211.061891] CPU 0
> [ 211.062090] Modules linked in: ip6t_LOG xt_tcpudp xt_pkttype ipt_LOG xt_limit af_packet bnep sco rfcomm l2cap crc16 snd_pcm_oss snd_mixer_oss snd_seq binfmt_misc snd_seq_device ip6t_REJECT nf_conntrack_ipv6 ip6table_raw xt_NOTRACK ipt_REJECT xt_state iptable_raw iptable_filter ip6table_mangle nf_conntrack_netbios_ns nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 cpufreq_conservative ip_tables cpufreq_ondemand cpufreq_userspace cpufreq_powersave ip6table_filter acpi_cpufreq ip6_tables x_tables freq_table ipv6 microcode fuse loop sr_mod cdrom dm_mod btusb snd_hda_codec_realtek bluetooth arc4 snd_hda_intel ecb snd_hda_codec pcmcia snd_hwdep iwlcore snd_pcm mac80211 toshiba_acpi joydev snd_timer yenta_socket usbhid option video cfg80211 backlight rtc_cmos snd rsrc_nonstatic psmouse usb_storage hid output usbserial rtc_core iTCO_wdt soundcore battery pcmcia_core ac rfkill led_class button intel_agp rtc_lib crc_itu_t serio_raw e1000e iTCO_vendor_support snd_page_alloc uinput s!
g ehci_hcd uhci_hcd sd_mod crc_t10dif usbcore ext3 jbd fan ahci libata thermal processor [last unloaded: iwlagn]
> [ 211.064780] Pid: 9, comm: events/0 Not tainted 2.6.32-rc6-tst #169 PORTEGE R500
> [ 211.064780] RIP: 0010:[<ffffffff81054c07>] [<ffffffff81054c07>] worker_thread+0x15b/0x22a
> [ 211.064780] RSP: 0018:ffff88007f0d9e40 EFLAGS: 00010046
> [ 211.064780] RAX: ffff88007f118308 RBX: ffff88007f09bd48 RCX: 6b6b6b6b6b6b6b6b
> [ 211.064780] RDX: 6b6b6b6b6b6b6b6b RSI: 0000000000000000 RDI: ffff880001613d00
> [ 211.064780] RBP: ffff88007f0d9ee0 R08: ffff88007f0d9cf0 R09: ffff88007f0d9e02
> [ 211.064780] R10: ffff880001613d00 R11: 0000000000000000 R12: ffff88007f118300
> [ 211.064780] R13: ffff880001613d00 R14: ffff88007f0b9140 R15: ffff88007f0b9140
> [ 211.064780] FS: 0000000000000000(0000) GS:ffff880001600000(0000) knlGS:0000000000000000
> [ 211.064780] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> [ 211.064780] CR2: 00007f7678984660 CR3: 000000007d3ce000 CR4: 00000000000006f0
> [ 211.064780] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 211.064780] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [ 211.064780] Process events/0 (pid: 9, threadinfo ffff88007f0d8000, task ffff88007f0b9140)
> [ 211.064780] Stack:
> [ 211.064780] 000000000000c918 ffff88007f0b9578 ffff88007f0d9fd8 ffff88007f0b9140
> [ 211.064780] <0> ffff880001613d08 ffff88007f0b9140 ffff880001613d18 6b6b6b6b6b6b6b6b
> [ 211.064780] <0> 0000000000000000 ffff88007f0b9140 ffffffff81058289 ffff88007f0d9e98
> [ 211.064780] Call Trace:
> [ 211.064780] [<ffffffff81058289>] ? autoremove_wake_function+0x0/0x38
> [ 211.064780] [<ffffffff81054aac>] ? worker_thread+0x0/0x22a
> [ 211.064780] [<ffffffff81058062>] kthread+0x69/0x71
> [ 211.064780] [<ffffffff8100c16a>] child_rip+0xa/0x20
> [ 211.064780] [<ffffffff81057ff9>] ? kthread+0x0/0x71
> [ 211.064780] [<ffffffff8100c160>] ? child_rip+0x0/0x20
> [ 211.064780] Code: 74 12 4c 89 e6 4c 89 f7 ff 13 48 83 c3 08 48 83 3b 00 eb ec e8 3d ef ff ff 49 8b 45 08 4d 89 65 30 4c 89 ef 48 8b 08 48 8b 50 08 <48> 89 51 08 48 89 0a 48 89 40 08 48 89 00 e8 f6 11 24 00 49 8b
> [ 211.064780] RIP [<ffffffff81054c07>] worker_thread+0x15b/0x22a
> [ 211.064780] RSP <ffff88007f0d9e40>
> [ 211.064780] ---[ end trace 586a2f2b96b909e0 ]---
> [ 211.064780] note: events/0[9] exited with preempt_count 1
>
> Rafael
>

2009-11-11 08:09:22

by Tejun Heo

[permalink] [raw]
Subject: Re: GPF in run_workqueue()/list_del_init(cwq->worklist.next) on resume

Hello,

Linus Torvalds wrote:
> On Tue, 10 Nov 2009, Rafael J. Wysocki wrote:
> And the code really is pretty subtle. queue_delayed_work_on(), for
> example, uses raw_smp_processor_id() to basically pick a target CPU for
> the work ("whatever the CPU happens to be now"). But does that have to
> match the CPU that the timeout will trigger on? I dunno.

It will queue on the cpu the timer callback runs, so delayed work will
follow timer which can be a little bit surprising at times, I suppose.

> I've walked through the code many times, but every time I end up
> forgetting all the subtleties a few days later.

A lot of those subtleties come from the fact struct work is not around
once the work function is invoked, so after that the workqueue code
doesn't have much to work with but the pointer value itself.

> The workqueue code is very fragile in many ways. I'm adding Oleg and Tejun
> to the Cc, because Oleg is definitely one of the workqueue locking
> masters, and Tejun has worked on it to make it way more robust, so they
> may have ideas.

One thing that I can think of which might cause this early release is
self-requeueing works which assume that only one instance of the
function will be executed at any given time. While preparing to bring
down a cpu, worker threads are unbound from the cpu. After cpu is
brought down, the workqueue for that cpu is flushed. This means that
any work which was running or on queue at the time of cpu down will
run on a different cpu. So, let's assume there's a work function
which looks like the following,

void my_work_fn(struct work_struct *work)
{
struct my_struct *me = container_of(work, something...);

DO SOMETHING;

if (--me->todo)
schedule_work(work);
else
free(me);
}

Which will work perfectly as long as all cpus stay alive as the work
will be pinned on a single cpu and cwq guarantees that a single work
is never executed in parallel. However, if a cpu is brought down
while my_work_fn() was doing SOMETHING and me->todo was above 1,
schedule_work() will schedule itself to a different cpu which will
happily execute the work in parallel.

As worker threads become unbound, they may bounce among different cpus
while executing and create more than two instances of parallel
execution of the work which can lead to freeings work which is still
on workqueue list with the above code but with different code just two
parallel executions should be able to achieve it.

Another related issue is the behavior flush_work() when a work ends up
scheduled on a different cpu. flush_work() will only look at a single
cpu workqueue on each flush attempt and if the work is not on the cpu
or there but also running on other cpus, it won't do nothing about it.
So, it's not too difficult to write code where the caller incorrectly
assumes the work is done after flush_work() is finished while the work
actually ended up being scheduled on a different cpu.

One way to debug I can think of is to record work pointer -> function
mapping in a percpu ring buffer and when the bug occurs (can be
detected by matching the 6b poison pattern before dereferencing next
pointer) print out function pointer for the matching work pointer.
The async nature makes the problem very difficult to track down. :-(

Thanks.

--
tejun

2009-11-11 11:51:25

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: GPF in run_workqueue()/list_del_init(cwq->worklist.next) on resume (was: Re: Help needed: Resume problems in 2.6.32-rc, perhaps related to preempt_count leakage in keventd)

On Tuesday 10 November 2009, Linus Torvalds wrote:
>
...
>
> So I don't know, but it might be worthwhile to see if the problem goes
> away if you disable sound. Just to narrow down what has changed.

I thought that the problem was somehow related to user space, because it only
happens after we've thawed tasks. At least, all of the call traces I was able
to collect indicated so.

Moreover, in a few cases I got

kernel: PM: Finishing wakeup.
kernel: Restarting tasks ...
kernel: usb 5-2: USB disconnect, address 2
kernel: done.
bluetoothd[3445]: HCI dev 0 unregistered
bluetoothd[3445]: Unregister path: /org/bluez/3445/hci0
bluetoothd[3445]: Unregistered interface org.bluez.NetworkPeer on path /org/bluez/3445/hci0
bluetoothd[3445]: Unregistered interface org.bluez.NetworkHub on path /org/bluez/3445/hci0
bluetoothd[3445]: Unregistered interface org.bluez.NetworkRouter on path /org/bluez/3445/hci0
kernel: Slab corruption: size-512 start=ffff88007f1182b0, len=512

and so on (of course, the bluetoothd PID was different each time), so I thought
that the problem might be related to Bluetooth.

So, I've disabled the Bluetooth subsystem in the kernel config and I'm not able
to reproduce the problem any more (at least not within 50 consecutive
suspend-resume and hibernate-resume cycles). Thus Bluetooth seems to be
at least necessary to reproduce the issue and perhaps it's also the cause of
it.

Rafael

2009-11-11 16:19:57

by Oleg Nesterov

[permalink] [raw]
Subject: Re: GPF in run_workqueue()/list_del_init(cwq->worklist.next) on resume (was: Re: Help needed: Resume problems in 2.6.32-rc, perhaps related to preempt_count leakage in keventd)

On 11/10, Linus Torvalds wrote:
>
> > In the meantime I got another trace, this time with a slab corruption involved.
> > Note that it crashed in exactly the same place as previously.
>
> I'm leaving your crash log appended for the new cc's, and I would not be
> at all surprised to hear that the slab corruption is related. The whole
> 6b6b6b6b pattern does imply a use-after-free on the workqueue,

Yes, RCX = 6b6b6b6b6b6b6b6b, and according to decodecode the faulting
instruction is "mov %rdx,0x8(%rcx)". Looks like the pending work was
freed.

Rafael, could you reproduce the problem with the debugging patch below?
It tries to detect the case when the pending work was corrupted and
prints its work->func (saved in the previous item). It should work
if the work_struct was freed and poisoned, or if it was re-initialized.
See ck_work().

Oleg.

--- TH/include/linux/workqueue.h~WQ_DBG 2009-10-15 12:09:50.000000000 +0200
+++ TH/include/linux/workqueue.h 2009-11-11 16:20:16.000000000 +0100
@@ -27,6 +27,7 @@ struct work_struct {
#define WORK_STRUCT_PENDING 0 /* T if work item pending execution */
#define WORK_STRUCT_FLAG_MASK (3UL)
#define WORK_STRUCT_WQ_DATA_MASK (~WORK_STRUCT_FLAG_MASK)
+ work_func_t next_func;
struct list_head entry;
work_func_t func;
#ifdef CONFIG_LOCKDEP
@@ -65,7 +66,7 @@ struct execute_work {
#define __WORK_INITIALIZER(n, f) { \
.data = WORK_DATA_INIT(), \
.entry = { &(n).entry, &(n).entry }, \
- .func = (f), \
+ .func = (f), .next_func = NULL, \
__WORK_INIT_LOCKDEP_MAP(#n, &(n)) \
}

--- TH/kernel/workqueue.c~WQ_DBG 2009-10-17 20:25:07.000000000 +0200
+++ TH/kernel/workqueue.c 2009-11-11 16:49:53.000000000 +0100
@@ -36,6 +36,7 @@
#define CREATE_TRACE_POINTS
#include <trace/events/workqueue.h>

+#define tow(p) list_entry((p), struct work_struct, entry)
/*
* The per-CPU workqueue (if single thread, we always use the first
* possible cpu).
@@ -44,7 +45,9 @@ struct cpu_workqueue_struct {

spinlock_t lock;

+ work_func_t next_func;
struct list_head worklist;
+
wait_queue_head_t more_work;
struct work_struct *current_work;

@@ -137,6 +140,10 @@ static void insert_work(struct cpu_workq
*/
smp_wmb();
list_add_tail(&work->entry, head);
+
+ work->next_func = tow(work->entry.next)->func;
+ tow(work->entry.prev)->next_func = work->func;
+
wake_up(&cwq->more_work);
}

@@ -261,9 +268,22 @@ int queue_delayed_work_on(int cpu, struc
}
EXPORT_SYMBOL_GPL(queue_delayed_work_on);

+static int ck_work(struct cpu_workqueue_struct *cwq, struct work_struct *work)
+{
+ if (cwq->next_func == work->func && cwq == get_wq_data(work) &&
+ work->entry.prev == &cwq->worklist &&
+ test_bit(WORK_STRUCT_PENDING, work_data_bits(work)))
+ return 0;
+
+ printk(KERN_CRIT "ERR!! ");
+ print_symbol("%s\n", (unsigned long)cwq->next_func);
+ return 1;
+}
+
static void run_workqueue(struct cpu_workqueue_struct *cwq)
{
spin_lock_irq(&cwq->lock);
+again:
while (!list_empty(&cwq->worklist)) {
struct work_struct *work = list_entry(cwq->worklist.next,
struct work_struct, entry);
@@ -279,8 +299,20 @@ static void run_workqueue(struct cpu_wor
*/
struct lockdep_map lockdep_map = work->lockdep_map;
#endif
+ if (ck_work(cwq, work)) {
+ struct list_head *pos = &cwq->worklist;
+ while (pos->prev != &work->entry)
+ pos = pos->prev;
+
+ cwq->next_func = tow(pos)->func;
+ cwq->worklist.next = pos;
+ pos->prev = &cwq->worklist;
+ goto again;
+ }
+
trace_workqueue_execution(cwq->thread, work);
cwq->current_work = work;
+ cwq->next_func = work->next_func;
list_del_init(cwq->worklist.next);
spin_unlock_irq(&cwq->lock);

@@ -485,6 +517,7 @@ static int try_to_grab_pending(struct wo
*/
smp_rmb();
if (cwq == get_wq_data(work)) {
+ tow(work->entry.prev)->next_func = work->next_func;
list_del_init(&work->entry);
ret = 1;
}

2009-11-11 17:24:16

by Oleg Nesterov

[permalink] [raw]
Subject: Re: GPF in run_workqueue()/list_del_init(cwq->worklist.next) on resume (was: Re: Help needed: Resume problems in 2.6.32-rc, perhaps related to preempt_count leakage in keventd)

On 11/10, Linus Torvalds wrote:
>
> And the code really is pretty subtle. queue_delayed_work_on(), for
> example, uses raw_smp_processor_id() to basically pick a target CPU for
> the work ("whatever the CPU happens to be now"). But does that have to
> match the CPU that the timeout will trigger on?

Yes, but this doesn't matter.

queue_delayed_work_on() does:

/* This stores cwq for the moment, for the timer_fn */
set_wq_data(work, wq_per_cpu(wq, raw_smp_processor_id()));

this is only needed to ensure that delayed_work_timer_fn() which does

struct cpu_workqueue_struct *cwq = get_wq_data(&dwork->work);
struct workqueue_struct *wq = cwq->wq;

gets the correct workqueue_struct, cpu_workqueue_struct can be "wrong"
and even destroyed. queue_delayed_work_on() could use any CPU from
cpu_possible_mask instead of raw_smp_processor_id().

I still owe you the promised changes which should fix the problems
with the "overlapping" works (although I don't agree we never want
to run the same work entry on multiple CPU's at once, so I am not
sure work_struct's should be always "single-threaded"), and the very
first change will be

--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -246,7 +246,8 @@ int queue_delayed_work_on(int cpu, struc
timer_stats_timer_set_start_info(&dwork->timer);

/* This stores cwq for the moment, for the timer_fn */
- set_wq_data(work, wq_per_cpu(wq, raw_smp_processor_id()));
+ if (!get_wq_data(work))
+ set_wq_data(work, wq_per_cpu(wq, raw_smp_processor_id()));
timer->expires = jiffies + delay;
timer->data = (unsigned long)dwork;
timer->function = delayed_work_timer_fn;

except this change is not always right. Not only the same work_struct
can run on multiple CPU's, it can run on different workqueues. Perhaps
nobody does this, but this is possible.

IOW, I agree it makes sense to introcude WORK_STRUCT_SINGLE_THREADED flag,
and perhaps it can be even set by default (not sure), but in any case I
think we need work_{set,clear}_single_threaded().

> The workqueue code is very fragile in many ways.

Well, yes. Because any buggy user can easily kill the system, and we
constantly have the bugs like this one.

I think we should teach workqueue.c to use debugobjects.c at least.

But otherwise I don't see how we can improve things very much. The
problem is not that the code itself is fragile, just it has a lot
of buggy users. Once queue_work(work) adds this work to ->worklist
the kernel depends on the owner of this work, it can crash the kernel
in many ways.

Oleg.

2009-11-11 18:21:08

by Oleg Nesterov

[permalink] [raw]
Subject: Re: GPF in run_workqueue()/list_del_init(cwq->worklist.next) on resume

On 11/11, Tejun Heo wrote:
>
> One thing that I can think of which might cause this early release is
> self-requeueing works which assume that only one instance of the
> function will be executed at any given time. While preparing to bring
> down a cpu, worker threads are unbound from the cpu. After cpu is
> brought down, the workqueue for that cpu is flushed. This means that
> any work which was running or on queue at the time of cpu down will
> run on a different cpu. So, let's assume there's a work function
> which looks like the following,
>
> void my_work_fn(struct work_struct *work)
> {
> struct my_struct *me = container_of(work, something...);
>
> DO SOMETHING;
>
> if (--me->todo)
> schedule_work(work);
> else
> free(me);
> }
>
> Which will work perfectly as long as all cpus stay alive as the work
> will be pinned on a single cpu and cwq guarantees that a single work
> is never executed in parallel. However, if a cpu is brought down
> while my_work_fn() was doing SOMETHING and me->todo was above 1,
> schedule_work() will schedule itself to a different cpu which will
> happily execute the work in parallel.
>
> As worker threads become unbound, they may bounce among different cpus
> while executing and create more than two instances

Well, "more than two instances" is not possible in this particular
case.

But in general I agree. If a self-requeueing work assumes it stays on
the same CPU or it assumes it can never race with itself, it should hook
CPU_DOWN_PREPARE and cancel the work. Like slab.c does with reap_work.

This is even documented, the comment above queue_work() says:

* We queue the work to the CPU on which it was submitted, but if the CPU dies
* it can be processed by another CPU.

We can improve things, see http://marc.info/?l=linux-kernel&m=125562105103769

But then we should also change workqueue_cpu_callback(CPU_POST_DEAD).
Instead of flushing, we should carefully move the pending works to
another CPU, otherwise the self-requeueing work can block cpu_down().

> Another related issue is the behavior flush_work() when a work ends up
> scheduled on a different cpu. flush_work() will only look at a single
> cpu workqueue on each flush attempt and if the work is not on the cpu
> or there but also running on other cpus, it won't do nothing about it.
> So, it's not too difficult to write code where the caller incorrectly
> assumes the work is done after flush_work() is finished while the work
> actually ended up being scheduled on a different cpu.

Yes, flush_work() is not even supposed to work "correctly" in this case.
Please note the changelog for db700897224b5ebdf852f2d38920ce428940d059
In particular:

More precisely, it "flushes" the result of of the last
queue_work() which is visible to the caller.

but we can add flush_work_sync().

But flush_work() do not have too much callers. Instead people often
use flush_workqueue() which just can't help if the work_struct is
self-requeueing or if it is delayed_work.

> One way to debug I can think of is to record work pointer -> function
> mapping in a percpu ring buffer

We can record work->func in work->entry.prev, which is either another
work or cwq. Please see the debugging patch I sent.

Not sure this patch will help, but I bet that the actual reason for
this bug is much simpler than the subtle races above ;)

Oleg.

2009-11-11 19:53:21

by Linus Torvalds

[permalink] [raw]
Subject: Re: GPF in run_workqueue()/list_del_init(cwq->worklist.next) on resume (was: Re: Help needed: Resume problems in 2.6.32-rc, perhaps related to preempt_count leakage in keventd)



On Wed, 11 Nov 2009, Rafael J. Wysocki wrote:
>
> I thought that the problem was somehow related to user space, because it only
> happens after we've thawed tasks. At least, all of the call traces I was able
> to collect indicated so.
>
> Moreover, in a few cases I got
>
> kernel: PM: Finishing wakeup.
> kernel: Restarting tasks ...
> kernel: usb 5-2: USB disconnect, address 2
> kernel: done.
> bluetoothd[3445]: HCI dev 0 unregistered
> bluetoothd[3445]: Unregister path: /org/bluez/3445/hci0
> bluetoothd[3445]: Unregistered interface org.bluez.NetworkPeer on path /org/bluez/3445/hci0
> bluetoothd[3445]: Unregistered interface org.bluez.NetworkHub on path /org/bluez/3445/hci0
> bluetoothd[3445]: Unregistered interface org.bluez.NetworkRouter on path /org/bluez/3445/hci0
> kernel: Slab corruption: size-512 start=ffff88007f1182b0, len=512
>
> and so on (of course, the bluetoothd PID was different each time), so I thought
> that the problem might be related to Bluetooth.

Hmm. Sounds reasonable. It's still that 'size-512', but if the sound
subsystem and the bluetooth code both happen to use that size, that would
explain why there was sound data in the slab.

> So, I've disabled the Bluetooth subsystem in the kernel config and I'm not able
> to reproduce the problem any more (at least not within 50 consecutive
> suspend-resume and hibernate-resume cycles). Thus Bluetooth seems to be
> at least necessary to reproduce the issue and perhaps it's also the cause of
> it.

Which BT driver are you using? Maybe it's specific to the low-level
driver?

For example, I could imagine that (say) a USB bluetooth dongle (I think
they are common for for mice, and are sometimes built-in on the
motherboard) could get the USB "disconnect" event, and get freed while
some work from the resume is still pending.

I'm looking at btusb_disconnect(), for example. It's one of the few BT
drivers that seem to use workqueues, and I'm not seeing a
cancel_work_sync() in the disconnect routine - but maybe the btusb_close()
routine is called indirectly some way that I just don't see.

Marcel?

Linus

2009-11-11 19:59:06

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: GPF in run_workqueue()/list_del_init(cwq->worklist.next) on resume (was: Re: Help needed: Resume problems in 2.6.32-rc, perhaps related to preempt_count leakage in keventd)

On Wednesday 11 November 2009, Oleg Nesterov wrote:
> On 11/10, Linus Torvalds wrote:
> >
> > > In the meantime I got another trace, this time with a slab corruption involved.
> > > Note that it crashed in exactly the same place as previously.
> >
> > I'm leaving your crash log appended for the new cc's, and I would not be
> > at all surprised to hear that the slab corruption is related. The whole
> > 6b6b6b6b pattern does imply a use-after-free on the workqueue,
>
> Yes, RCX = 6b6b6b6b6b6b6b6b, and according to decodecode the faulting
> instruction is "mov %rdx,0x8(%rcx)". Looks like the pending work was
> freed.
>
> Rafael, could you reproduce the problem with the debugging patch below?
> It tries to detect the case when the pending work was corrupted and
> prints its work->func (saved in the previous item). It should work
> if the work_struct was freed and poisoned, or if it was re-initialized.
> See ck_work().

I applied the patch and this is the result of 'dmesg | grep ERR' after 10-or-so
consecutive suspend-resume and hibernate-resume cycles:

[ 129.008689] ERR!! btusb_waker+0x0/0x27 [btusb]
[ 166.477373] ERR!! btusb_waker+0x0/0x27 [btusb]
[ 203.983665] ERR!! btusb_waker+0x0/0x27 [btusb]
[ 241.636547] ERR!! btusb_waker+0x0/0x27 [btusb]

which kind of confirms my previous observation that the problem was not
reproducible without Bluetooth.

So, it looks like the bug is in btusb_destruct(), which should call
cancel_work_sync() on data->waker before freeing 'data'. I guess it should
do the same for data->work.

I'm going to test the appended patch, then.

Thanks,
Rafael

---
drivers/bluetooth/btusb.c | 3 +++
1 file changed, 3 insertions(+)

Index: linux-2.6/drivers/bluetooth/btusb.c
===================================================================
--- linux-2.6.orig/drivers/bluetooth/btusb.c
+++ linux-2.6/drivers/bluetooth/btusb.c
@@ -738,6 +738,9 @@ static void btusb_destruct(struct hci_de

BT_DBG("%s", hdev->name);

+ cancel_work_sync(&data->work);
+ cancel_work_sync(&data->waker);
+
kfree(data);
}

2009-11-11 20:12:57

by Linus Torvalds

[permalink] [raw]
Subject: Re: GPF in run_workqueue()/list_del_init(cwq->worklist.next) on resume (was: Re: Help needed: Resume problems in 2.6.32-rc, perhaps related to preempt_count leakage in keventd)



On Wed, 11 Nov 2009, Rafael J. Wysocki wrote:
>
> I applied the patch and this is the result of 'dmesg | grep ERR' after 10-or-so
> consecutive suspend-resume and hibernate-resume cycles:
>
> [ 129.008689] ERR!! btusb_waker+0x0/0x27 [btusb]
> [ 166.477373] ERR!! btusb_waker+0x0/0x27 [btusb]
> [ 203.983665] ERR!! btusb_waker+0x0/0x27 [btusb]
> [ 241.636547] ERR!! btusb_waker+0x0/0x27 [btusb]
>
> which kind of confirms my previous observation that the problem was not
> reproducible without Bluetooth.

.. and that btusb thing matches my observation that only a few BT drivers
seem to use workqueues, and btusb_disconnect() isn't doing any work
cancel.

> I'm going to test the appended patch, then.

Hmm. the USB disconnect doesn't call hci_dev_put(), it calls
hci_free_dev() and doesn't seem to call the ->destruct thing.

Although again, maybe that gets called indirectly.

So I'd put the cancel work in the actual disconnect routine. But maybe
thar's just me.

Linus

2009-11-11 20:18:39

by Marcel Holtmann

[permalink] [raw]
Subject: Re: GPF in run_workqueue()/list_del_init(cwq->worklist.next) on resume (was: Re: Help needed: Resume problems in 2.6.32-rc, perhaps related to preempt_count leakage in keventd)

Hi Linus,

> > I thought that the problem was somehow related to user space, because it only
> > happens after we've thawed tasks. At least, all of the call traces I was able
> > to collect indicated so.
> >
> > Moreover, in a few cases I got
> >
> > kernel: PM: Finishing wakeup.
> > kernel: Restarting tasks ...
> > kernel: usb 5-2: USB disconnect, address 2
> > kernel: done.
> > bluetoothd[3445]: HCI dev 0 unregistered
> > bluetoothd[3445]: Unregister path: /org/bluez/3445/hci0
> > bluetoothd[3445]: Unregistered interface org.bluez.NetworkPeer on path /org/bluez/3445/hci0
> > bluetoothd[3445]: Unregistered interface org.bluez.NetworkHub on path /org/bluez/3445/hci0
> > bluetoothd[3445]: Unregistered interface org.bluez.NetworkRouter on path /org/bluez/3445/hci0
> > kernel: Slab corruption: size-512 start=ffff88007f1182b0, len=512
> >
> > and so on (of course, the bluetoothd PID was different each time), so I thought
> > that the problem might be related to Bluetooth.
>
> Hmm. Sounds reasonable. It's still that 'size-512', but if the sound
> subsystem and the bluetooth code both happen to use that size, that would
> explain why there was sound data in the slab.
>
> > So, I've disabled the Bluetooth subsystem in the kernel config and I'm not able
> > to reproduce the problem any more (at least not within 50 consecutive
> > suspend-resume and hibernate-resume cycles). Thus Bluetooth seems to be
> > at least necessary to reproduce the issue and perhaps it's also the cause of
> > it.
>
> Which BT driver are you using? Maybe it's specific to the low-level
> driver?
>
> For example, I could imagine that (say) a USB bluetooth dongle (I think
> they are common for for mice, and are sometimes built-in on the
> motherboard) could get the USB "disconnect" event, and get freed while
> some work from the resume is still pending.
>
> I'm looking at btusb_disconnect(), for example. It's one of the few BT
> drivers that seem to use workqueues, and I'm not seeing a
> cancel_work_sync() in the disconnect routine - but maybe the btusb_close()
> routine is called indirectly some way that I just don't see.

so the btusb_close() should be called before btusb_destruct() and the
destruct() callback is only when the last reference count gets dropped
and we do have to free the memory. So it seems we are doing something
wrong in btusb_close(). The close() callback is triggered via
hci_unregister_dev() from btusb_disconnect().

As it seems the btusb_close() only cancels the work workqueue and not
the waker workqueue. Could that be the problem.

Oliver, what do you think?

Regards

Marcel

2009-11-11 20:20:26

by Marcel Holtmann

[permalink] [raw]
Subject: Re: GPF in run_workqueue()/list_del_init(cwq->worklist.next) on resume (was: Re: Help needed: Resume problems in 2.6.32-rc, perhaps related to preempt_count leakage in keventd)

Hi Linus,

> > I applied the patch and this is the result of 'dmesg | grep ERR' after 10-or-so
> > consecutive suspend-resume and hibernate-resume cycles:
> >
> > [ 129.008689] ERR!! btusb_waker+0x0/0x27 [btusb]
> > [ 166.477373] ERR!! btusb_waker+0x0/0x27 [btusb]
> > [ 203.983665] ERR!! btusb_waker+0x0/0x27 [btusb]
> > [ 241.636547] ERR!! btusb_waker+0x0/0x27 [btusb]
> >
> > which kind of confirms my previous observation that the problem was not
> > reproducible without Bluetooth.
>
> .. and that btusb thing matches my observation that only a few BT drivers
> seem to use workqueues, and btusb_disconnect() isn't doing any work
> cancel.
>
> > I'm going to test the appended patch, then.
>
> Hmm. the USB disconnect doesn't call hci_dev_put(), it calls
> hci_free_dev() and doesn't seem to call the ->destruct thing.
>
> Although again, maybe that gets called indirectly.
>
> So I'd put the cancel work in the actual disconnect routine. But maybe
> thar's just me.

see my other email, I think we have to cancel the waker in btusb_close()
to fix this. However the suspend/resume and auto-suspend has been done
mostly by Oliver and I like to see what he thinks about this.

Regards

Marcel

2009-11-11 20:32:09

by Oleg Nesterov

[permalink] [raw]
Subject: Re: GPF in run_workqueue()/list_del_init(cwq->worklist.next) on resume (was: Re: Help needed: Resume problems in 2.6.32-rc, perhaps related to preempt_count leakage in keventd)

On 11/11, Rafael J. Wysocki wrote:
>
> On Wednesday 11 November 2009, Oleg Nesterov wrote:
> >
> > Rafael, could you reproduce the problem with the debugging patch below?
> > It tries to detect the case when the pending work was corrupted and
> > prints its work->func (saved in the previous item). It should work
> > if the work_struct was freed and poisoned, or if it was re-initialized.
> > See ck_work().
>
> I applied the patch and this is the result of 'dmesg | grep ERR' after 10-or-so
> consecutive suspend-resume and hibernate-resume cycles:
>
> [ 129.008689] ERR!! btusb_waker+0x0/0x27 [btusb]
> [ 166.477373] ERR!! btusb_waker+0x0/0x27 [btusb]
> [ 203.983665] ERR!! btusb_waker+0x0/0x27 [btusb]
> [ 241.636547] ERR!! btusb_waker+0x0/0x27 [btusb]
>
> which kind of confirms my previous observation that the problem was not
> reproducible without Bluetooth.

Great, thanks.

> So, it looks like the bug is in btusb_destruct(), which should call
> cancel_work_sync() on data->waker before freeing 'data'. I guess it should
> do the same for data->work.

Or. btusb_suspend() and btusb_close() do cancel_work_sync(data->work),
perhaps they should cancel data->waker as well, I dunno.

I added Oliver to cc.

Oleg.

2009-11-11 20:26:46

by Linus Torvalds

[permalink] [raw]
Subject: Re: GPF in run_workqueue()/list_del_init(cwq->worklist.next) on resume (was: Re: Help needed: Resume problems in 2.6.32-rc, perhaps related to preempt_count leakage in keventd)



On Wed, 11 Nov 2009, Marcel Holtmann wrote:
>
> As it seems the btusb_close() only cancels the work workqueue and not
> the waker workqueue. Could that be the problem.

It's the waker workqueue that Rafael apparently sees with the debug patch
from Oleg, so yeah, that sounds possible.

Rafael, since apparently htusb_close() _is_ called, does the following
simpler patch fix it for you? I bet your patch does it too.

Linus

---
drivers/bluetooth/btusb.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/drivers/bluetooth/btusb.c b/drivers/bluetooth/btusb.c
index 7ba91aa..2fb3802 100644
--- a/drivers/bluetooth/btusb.c
+++ b/drivers/bluetooth/btusb.c
@@ -591,6 +591,7 @@ static int btusb_close(struct hci_dev *hdev)
return 0;

cancel_work_sync(&data->work);
+ cancel_work_sync(&data->waker);

clear_bit(BTUSB_ISOC_RUNNING, &data->flags);
clear_bit(BTUSB_BULK_RUNNING, &data->flags);

2009-11-11 21:11:00

by Oliver Neukum

[permalink] [raw]
Subject: Re: GPF in run_workqueue()/list_del_init(cwq->worklist.next) on resume (was: Re: Help needed: Resume problems in 2.6.32-rc, perhaps related to preempt_count leakage in keventd)

Am Mittwoch, 11. November 2009 21:18:39 schrieb Marcel Holtmann:
> > I'm looking at btusb_disconnect(), for example. It's one of the few BT
> > drivers that seem to use workqueues, and I'm not seeing a
> > cancel_work_sync() in the disconnect routine - but maybe the
> > btusb_close() routine is called indirectly some way that I just don't
> > see.
>
> so the btusb_close() should be called before btusb_destruct() and the
> destruct() callback is only when the last reference count gets dropped
> and we do have to free the memory. So it seems we are doing something
> wrong in btusb_close(). The close() callback is triggered via
> hci_unregister_dev() from btusb_disconnect().
>
> As it seems the btusb_close() only cancels the work workqueue and not
> the waker workqueue. Could that be the problem.

Yes, btusb_close() needs to cancel the waker workqueue, too.
In addition, in order to avoid a memory leak it must call
usb_scuttle_anchored_urbs(&data->deferred)
to make sure any deferred data is destroyed.

Regards
Oliver

2009-11-11 21:12:36

by Oliver Neukum

[permalink] [raw]
Subject: Re: GPF in run_workqueue()/list_del_init(cwq->worklist.next) on resume (was: Re: Help needed: Resume problems in 2.6.32-rc, perhaps related to preempt_count leakage in keventd)

Am Mittwoch, 11. November 2009 21:24:33 schrieb Oleg Nesterov:
> > So, it looks like the bug is in btusb_destruct(), which should call
> > cancel_work_sync() on data->waker before freeing 'data'. I guess it
> > should do the same for data->work.
>
> Or. btusb_suspend() and btusb_close() do cancel_work_sync(data->work),
> perhaps they should cancel data->waker as well, I dunno.
>
> I added Oliver to cc.

btusb_close() must do it. btusb_suspend() must not do it.

Regards
Oliver

2009-11-11 21:17:34

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: GPF in run_workqueue()/list_del_init(cwq->worklist.next) on resume (was: Re: Help needed: Resume problems in 2.6.32-rc, perhaps related to preempt_count leakage in keventd)

On Wednesday 11 November 2009, Linus Torvalds wrote:
>
> On Wed, 11 Nov 2009, Marcel Holtmann wrote:
> >
> > As it seems the btusb_close() only cancels the work workqueue and not
> > the waker workqueue. Could that be the problem.
>
> It's the waker workqueue that Rafael apparently sees with the debug patch
> from Oleg, so yeah, that sounds possible.
>
> Rafael, since apparently htusb_close() _is_ called, does the following
> simpler patch fix it for you? I bet your patch does it too.

Yes and yes.

And in fact this is a post-2.6.31 regression introduced by commit
7bee549e197c9c0e92b89857a409675c1d5e9dff (Bluetooth: Add USB autosuspend
support to btusb driver).

Thanks,
Rafael

2009-11-11 21:39:10

by Linus Torvalds

[permalink] [raw]
Subject: Re: GPF in run_workqueue()/list_del_init(cwq->worklist.next) on resume (was: Re: Help needed: Resume problems in 2.6.32-rc, perhaps related to preempt_count leakage in keventd)



On Wed, 11 Nov 2009, Oliver Neukum wrote:
>
> Yes, btusb_close() needs to cancel the waker workqueue, too.
> In addition, in order to avoid a memory leak it must call
> usb_scuttle_anchored_urbs(&data->deferred)
> to make sure any deferred data is destroyed.

Ok. I committed the one-liner patch to fix the oops (confirmed by Rafael),
but I'm leaving the memory leak for you as I'm not going to commit some
untested patch at this stage.

Thanks to everybody involved, I was personally very nervous that this was
something much more fundamental (we've had some scary bugs during the
32-rc series), and am relieved that it was "just" a silly driver bug.

Linus

2009-11-11 21:42:04

by Oliver Neukum

[permalink] [raw]
Subject: Re: GPF in run_workqueue()/list_del_init(cwq->worklist.next) on resume (was: Re: Help needed: Resume problems in 2.6.32-rc, perhaps related to preempt_count leakage in keventd)

Am Mittwoch, 11. November 2009 22:38:08 schrieb Linus Torvalds:
> On Wed, 11 Nov 2009, Oliver Neukum wrote:
> > Yes, btusb_close() needs to cancel the waker workqueue, too.
> > In addition, in order to avoid a memory leak it must call
> > usb_scuttle_anchored_urbs(&data->deferred)
> > to make sure any deferred data is destroyed.
>
> Ok. I committed the one-liner patch to fix the oops (confirmed by Rafael),
> but I'm leaving the memory leak for you as I'm not going to commit some
> untested patch at this stage.
>
> Thanks to everybody involved, I was personally very nervous that this was
> something much more fundamental (we've had some scary bugs during the
> 32-rc series), and am relieved that it was "just" a silly driver bug.

Thanks. A patch is in the works.

Regards
Oliver

2009-11-12 04:57:40

by Tejun Heo

[permalink] [raw]
Subject: Re: GPF in run_workqueue()/list_del_init(cwq->worklist.next) on resume

Hello,

Oleg Nesterov wrote:
> Well, "more than two instances" is not possible in this particular
> case.

Right, the second one won't leave the cpu and cpu downs don't overlap.

> But in general I agree. If a self-requeueing work assumes it stays on
> the same CPU or it assumes it can never race with itself, it should hook
> CPU_DOWN_PREPARE and cancel the work. Like slab.c does with reap_work.
>
> This is even documented, the comment above queue_work() says:
>
> * We queue the work to the CPU on which it was submitted, but if the CPU dies
> * it can be processed by another CPU.
>
> We can improve things, see http://marc.info/?l=linux-kernel&m=125562105103769
>
> But then we should also change workqueue_cpu_callback(CPU_POST_DEAD).
> Instead of flushing, we should carefully move the pending works to
> another CPU, otherwise the self-requeueing work can block cpu_down().

That looks like an excellent idea and I don't think it will add
noticeable overhead even done by default and it will magically make
the "how to implement single-threaded wq semantics in conccurrency
managed wq" problem go away. I'll work on it.

>> Another related issue is the behavior flush_work() when a work ends up
>> scheduled on a different cpu. flush_work() will only look at a single
>> cpu workqueue on each flush attempt and if the work is not on the cpu
>> or there but also running on other cpus, it won't do nothing about it.
>> So, it's not too difficult to write code where the caller incorrectly
>> assumes the work is done after flush_work() is finished while the work
>> actually ended up being scheduled on a different cpu.
>
> Yes, flush_work() is not even supposed to work "correctly" in this case.
> Please note the changelog for db700897224b5ebdf852f2d38920ce428940d059
> In particular:
>
> More precisely, it "flushes" the result of of the last
> queue_work() which is visible to the caller.
>
> but we can add flush_work_sync().
>
> But flush_work() do not have too much callers. Instead people often
> use flush_workqueue() which just can't help if the work_struct is
> self-requeueing or if it is delayed_work.

If you look at the workqueue code itself very closely, all subtleties
are properly defined and described. The problem is that it's not very
clear and way too subtle when seen from outside and workqueue is
something used very widely. I think if we enforce global single
threadedness with the above change, making flush_work() behave as
flush_work_sync() by default should be doable without too much
overhead. I'll give it a shot.

>> One way to debug I can think of is to record work pointer -> function
>> mapping in a percpu ring buffer
>
> We can record work->func in work->entry.prev, which is either another
> work or cwq. Please see the debugging patch I sent.

That's much better. It would be nice to have such debug code in
upstream dependent on a debug config.

> Not sure this patch will help, but I bet that the actual reason for
> this bug is much simpler than the subtle races above ;)

And yes it was but still I'm fairly sure unexpected races described
above are happening. They would probably be quite infrequent and the
outcome in many cases not bad enough to be noticed. I think we really
need to plug the hole. It's scary.

Thanks.

--
tejun

2009-11-12 17:34:09

by Thomas Gleixner

[permalink] [raw]
Subject: Re: GPF in run_workqueue()/list_del_init(cwq->worklist.next) on resume (was: Re: Help needed: Resume problems in 2.6.32-rc, perhaps related to preempt_count leakage in keventd)

On Wed, 11 Nov 2009, Oleg Nesterov wrote:
> Well, yes. Because any buggy user can easily kill the system, and we
> constantly have the bugs like this one.
>
> I think we should teach workqueue.c to use debugobjects.c at least.

Here you go. Would be interesting to know whether it catches the
problem at hand.

Thanks,

tglx
----
Subject: workqueue: Add debugobjects support
From: Thomas Gleixner <[email protected]>
Date: Thu, 12 Nov 2009 16:25:34 +0100

Add debugobject support to track the life time of work_structs.

Signed-off-by: Thomas Gleixner <[email protected]>
---
arch/x86/kernel/smpboot.c | 4 +
include/linux/workqueue.h | 38 +++++++++----
kernel/workqueue.c | 131 ++++++++++++++++++++++++++++++++++++++++++++--
lib/Kconfig.debug | 8 ++
4 files changed, 166 insertions(+), 15 deletions(-)

Index: linux-2.6/arch/x86/kernel/smpboot.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/smpboot.c
+++ linux-2.6/arch/x86/kernel/smpboot.c
@@ -687,7 +687,7 @@ static int __cpuinit do_boot_cpu(int api
.done = COMPLETION_INITIALIZER_ONSTACK(c_idle.done),
};

- INIT_WORK(&c_idle.work, do_fork_idle);
+ INIT_WORK_ON_STACK(&c_idle.work, do_fork_idle);

alternatives_smp_switch(1);

@@ -713,6 +713,7 @@ static int __cpuinit do_boot_cpu(int api

if (IS_ERR(c_idle.idle)) {
printk("failed fork for CPU %d\n", cpu);
+ destroy_work_on_stack(&c_idle.work);
return PTR_ERR(c_idle.idle);
}

@@ -831,6 +832,7 @@ do_rest:
smpboot_restore_warm_reset_vector();
}

+ destroy_work_on_stack(&c_idle.work);
return boot_error;
}

Index: linux-2.6/include/linux/workqueue.h
===================================================================
--- linux-2.6.orig/include/linux/workqueue.h
+++ linux-2.6/include/linux/workqueue.h
@@ -25,6 +25,7 @@ typedef void (*work_func_t)(struct work_
struct work_struct {
atomic_long_t data;
#define WORK_STRUCT_PENDING 0 /* T if work item pending execution */
+#define WORK_STRUCT_STATIC 1 /* static initializer (debugobjects) */
#define WORK_STRUCT_FLAG_MASK (3UL)
#define WORK_STRUCT_WQ_DATA_MASK (~WORK_STRUCT_FLAG_MASK)
struct list_head entry;
@@ -35,6 +36,7 @@ struct work_struct {
};

#define WORK_DATA_INIT() ATOMIC_LONG_INIT(0)
+#define WORK_DATA_STATIC_INIT() ATOMIC_LONG_INIT(2)

struct delayed_work {
struct work_struct work;
@@ -63,7 +65,7 @@ struct execute_work {
#endif

#define __WORK_INITIALIZER(n, f) { \
- .data = WORK_DATA_INIT(), \
+ .data = WORK_DATA_STATIC_INIT(), \
.entry = { &(n).entry, &(n).entry }, \
.func = (f), \
__WORK_INIT_LOCKDEP_MAP(#n, &(n)) \
@@ -91,6 +93,14 @@ struct execute_work {
#define PREPARE_DELAYED_WORK(_work, _func) \
PREPARE_WORK(&(_work)->work, (_func))

+#ifdef CONFIG_DEBUG_OBJECTS_WORK
+extern void __init_work(struct work_struct *work, int onstack);
+extern void destroy_work_on_stack(struct work_struct *work);
+#else
+static inline void __init_work(struct work_struct *work, int onstack) { }
+static inline void destroy_work_on_stack(struct work_struct *work) { }
+#endif
+
/*
* initialize all of a work item in one go
*
@@ -99,24 +109,36 @@ struct execute_work {
* to generate better code.
*/
#ifdef CONFIG_LOCKDEP
-#define INIT_WORK(_work, _func) \
+#define __INIT_WORK(_work, _func, _onstack) \
do { \
static struct lock_class_key __key; \
\
+ __init_work((_work), _onstack); \
(_work)->data = (atomic_long_t) WORK_DATA_INIT(); \
lockdep_init_map(&(_work)->lockdep_map, #_work, &__key, 0);\
INIT_LIST_HEAD(&(_work)->entry); \
PREPARE_WORK((_work), (_func)); \
} while (0)
#else
-#define INIT_WORK(_work, _func) \
+#define __INIT_WORK(_work, _func, _onstack) \
do { \
+ __init_work((_work), _onstack); \
(_work)->data = (atomic_long_t) WORK_DATA_INIT(); \
INIT_LIST_HEAD(&(_work)->entry); \
PREPARE_WORK((_work), (_func)); \
} while (0)
#endif

+#define INIT_WORK(_work, _func) \
+ do { \
+ __INIT_WORK((_work), (_func), 0); \
+ } while (0)
+
+#define INIT_WORK_ON_STACK(_work, _func) \
+ do { \
+ __INIT_WORK((_work), (_func), 1); \
+ } while (0)
+
#define INIT_DELAYED_WORK(_work, _func) \
do { \
INIT_WORK(&(_work)->work, (_func)); \
@@ -125,22 +147,16 @@ struct execute_work {

#define INIT_DELAYED_WORK_ON_STACK(_work, _func) \
do { \
- INIT_WORK(&(_work)->work, (_func)); \
+ INIT_WORK_ON_STACK(&(_work)->work, (_func)); \
init_timer_on_stack(&(_work)->timer); \
} while (0)

-#define INIT_DELAYED_WORK_DEFERRABLE(_work, _func) \
+#define INIT_DELAYED_WORK_DEFERRABLE(_work, _func) \
do { \
INIT_WORK(&(_work)->work, (_func)); \
init_timer_deferrable(&(_work)->timer); \
} while (0)

-#define INIT_DELAYED_WORK_ON_STACK(_work, _func) \
- do { \
- INIT_WORK(&(_work)->work, (_func)); \
- init_timer_on_stack(&(_work)->timer); \
- } while (0)
-
/**
* work_pending - Find out whether a work item is currently pending
* @work: The work item in question
Index: linux-2.6/kernel/workqueue.c
===================================================================
--- linux-2.6.orig/kernel/workqueue.c
+++ linux-2.6/kernel/workqueue.c
@@ -68,6 +68,116 @@ struct workqueue_struct {
#endif
};

+#ifdef CONFIG_DEBUG_OBJECTS_WORK
+
+static struct debug_obj_descr work_debug_descr;
+
+/*
+ * fixup_init is called when:
+ * - an active object is initialized
+ */
+static int work_fixup_init(void *addr, enum debug_obj_state state)
+{
+ struct work_struct *work = addr;
+
+ switch (state) {
+ case ODEBUG_STATE_ACTIVE:
+ cancel_work_sync(work);
+ debug_object_init(work, &work_debug_descr);
+ return 1;
+ default:
+ return 0;
+ }
+}
+
+/*
+ * fixup_activate is called when:
+ * - an active object is activated
+ * - an unknown object is activated (might be a statically initialized object)
+ */
+static int work_fixup_activate(void *addr, enum debug_obj_state state)
+{
+ struct work_struct *work = addr;
+
+ switch (state) {
+
+ case ODEBUG_STATE_NOTAVAILABLE:
+ /*
+ * This is not really a fixup. The work struct was
+ * statically initialized. We just make sure that it
+ * is tracked in the object tracker.
+ */
+ if (test_bit(WORK_STRUCT_STATIC, work_data_bits(work))) {
+ debug_object_init(work, &work_debug_descr);
+ debug_object_activate(work, &work_debug_descr);
+ return 0;
+ }
+ WARN_ON_ONCE(1);
+ return 0;
+
+ case ODEBUG_STATE_ACTIVE:
+ WARN_ON(1);
+
+ default:
+ return 0;
+ }
+}
+
+/*
+ * fixup_free is called when:
+ * - an active object is freed
+ */
+static int work_fixup_free(void *addr, enum debug_obj_state state)
+{
+ struct work_struct *work = addr;
+
+ switch (state) {
+ case ODEBUG_STATE_ACTIVE:
+ cancel_work_sync(work);
+ debug_object_free(work, &work_debug_descr);
+ return 1;
+ default:
+ return 0;
+ }
+}
+
+static struct debug_obj_descr work_debug_descr = {
+ .name = "work_struct",
+ .fixup_init = work_fixup_init,
+ .fixup_activate = work_fixup_activate,
+ .fixup_free = work_fixup_free,
+};
+
+static inline void debug_work_activate(struct work_struct *work)
+{
+ debug_object_activate(work, &work_debug_descr);
+}
+
+static inline void debug_work_deactivate(struct work_struct *work)
+{
+ debug_object_deactivate(work, &work_debug_descr);
+}
+
+void __init_work(struct work_struct *work, int onstack)
+{
+ if (onstack)
+ debug_object_init_on_stack(work, &work_debug_descr);
+ else
+ debug_object_init(work, &work_debug_descr);
+}
+EXPORT_SYMBOL_GPL(__init_work);
+
+void destroy_work_on_stack(struct work_struct *work)
+{
+ debug_object_free(work, &work_debug_descr);
+}
+EXPORT_SYMBOL_GPL(destroy_work_on_stack);
+
+#else
+static inline void debug_work_activate(struct work_struct *work) { }
+static inline void debug_work_deactivate(struct work_struct *work) { }
+#endif
+
/* Serializes the accesses to the list of workqueues. */
static DEFINE_SPINLOCK(workqueue_lock);
static LIST_HEAD(workqueues);
@@ -145,6 +255,7 @@ static void __queue_work(struct cpu_work
{
unsigned long flags;

+ debug_work_activate(work);
spin_lock_irqsave(&cwq->lock, flags);
insert_work(cwq, work, &cwq->worklist);
spin_unlock_irqrestore(&cwq->lock, flags);
@@ -280,6 +391,7 @@ static void run_workqueue(struct cpu_wor
struct lockdep_map lockdep_map = work->lockdep_map;
#endif
trace_workqueue_execution(cwq->thread, work);
+ debug_work_deactivate(work);
cwq->current_work = work;
list_del_init(cwq->worklist.next);
spin_unlock_irq(&cwq->lock);
@@ -350,11 +462,18 @@ static void wq_barrier_func(struct work_
static void insert_wq_barrier(struct cpu_workqueue_struct *cwq,
struct wq_barrier *barr, struct list_head *head)
{
- INIT_WORK(&barr->work, wq_barrier_func);
+ /*
+ * debugobject calls are safe here even with cwq->lock locked
+ * as we know for sure that this will not trigger any of the
+ * checks and call back into the fixup functions where we
+ * might deadlock.
+ */
+ INIT_WORK_ON_STACK(&barr->work, wq_barrier_func);
__set_bit(WORK_STRUCT_PENDING, work_data_bits(&barr->work));

init_completion(&barr->done);

+ debug_work_activate(&barr->work);
insert_work(cwq, &barr->work, head);
}

@@ -372,8 +491,10 @@ static int flush_cpu_workqueue(struct cp
}
spin_unlock_irq(&cwq->lock);

- if (active)
+ if (active) {
wait_for_completion(&barr.done);
+ destroy_work_on_stack(&barr.work);
+ }

return active;
}
@@ -451,6 +572,7 @@ out:
return 0;

wait_for_completion(&barr.done);
+ destroy_work_on_stack(&barr.work);
return 1;
}
EXPORT_SYMBOL_GPL(flush_work);
@@ -485,6 +607,7 @@ static int try_to_grab_pending(struct wo
*/
smp_rmb();
if (cwq == get_wq_data(work)) {
+ debug_work_deactivate(work);
list_del_init(&work->entry);
ret = 1;
}
@@ -507,8 +630,10 @@ static void wait_on_cpu_work(struct cpu_
}
spin_unlock_irq(&cwq->lock);

- if (unlikely(running))
+ if (unlikely(running)) {
wait_for_completion(&barr.done);
+ destroy_work_on_stack(&barr.work);
+ }
}

static void wait_on_work(struct work_struct *work)
Index: linux-2.6/lib/Kconfig.debug
===================================================================
--- linux-2.6.orig/lib/Kconfig.debug
+++ linux-2.6/lib/Kconfig.debug
@@ -298,6 +298,14 @@ config DEBUG_OBJECTS_TIMERS
timer routines to track the life time of timer objects and
validate the timer operations.

+config DEBUG_OBJECTS_WORK
+ bool "Debug work objects"
+ depends on DEBUG_OBJECTS
+ help
+ If you say Y here, additional code will be inserted into the
+ work queue routines to track the life time of work objects and
+ validate the work operations.
+
config DEBUG_OBJECTS_ENABLE_DEFAULT
int "debug_objects bootup default value (0-1)"
range 0 1

2009-11-12 18:41:29

by Oleg Nesterov

[permalink] [raw]
Subject: Re: GPF in run_workqueue()/list_del_init(cwq->worklist.next) on resume

On 11/12, Tejun Heo wrote:
>
> Oleg Nesterov wrote:
> >
> > This is even documented, the comment above queue_work() says:
> >
> > * We queue the work to the CPU on which it was submitted, but if the CPU dies
> > * it can be processed by another CPU.
> >
> > We can improve things, see http://marc.info/?l=linux-kernel&m=125562105103769
> >
> > But then we should also change workqueue_cpu_callback(CPU_POST_DEAD).
> > Instead of flushing, we should carefully move the pending works to
> > another CPU, otherwise the self-requeueing work can block cpu_down().
>
> That looks like an excellent idea and I don't think it will add
> noticeable overhead even done by default and it will magically make
> the "how to implement single-threaded wq semantics in conccurrency
> managed wq" problem go away. I'll work on it.

I am still not sure all work_structs should single-threaded by default.

To clarify, I am not arguing. Just I don't know. I mean, this change can
break the existing code, and it is not easy to notice the problem.

> If you look at the workqueue code itself very closely, all subtleties
> are properly defined and described. The problem is that it's not very
> clear and way too subtle when seen from outside and workqueue is
> something used very widely.

Yes, agreed.

> making flush_work() behave as
> flush_work_sync() by default should be doable without too much
> overhead. I'll give it a shot.

Well, I disagree. Imho it is better to have both flush_work() and
flush_work_sync(). flush_work() is "special" and should be used with
care. But this is minor, and if the work_stuct is single-threaded then
flush_work() == flush_work_sync().

(Perhaps this is what you meant)

> > Not sure this patch will help, but I bet that the actual reason for
> > this bug is much simpler than the subtle races above ;)
>
> And yes it was but still I'm fairly sure unexpected races described
> above are happening.

Yes, sure, I never argued.

My only point was, it is not that workqueues are buggy, they were
designed (and even documented) to work this way. I can't judge if it
was right or not, but personally I think everything is "logical".

That said, I agree that we have too many buggy users, perhaps we
should simplify the rules.

I just noticed that schedule_on_each_cpu() was recently changed by

HWPOISON: Allow schedule_on_each_cpu() from keventd
commit: 65a64464349883891e21e74af16c05d6e1eeb4e9

Surprisingly, even this simple change is not exactly right.

/*
* when running in keventd don't schedule a work item on itself.
* Can just call directly because the work queue is already bound.
* This also is faster.
* Make this a generic parameter for other workqueues?
*/
if (current_is_keventd()) {
orig = raw_smp_processor_id();
INIT_WORK(per_cpu_ptr(works, orig), func);
func(per_cpu_ptr(works, orig));
}

OK, but this code should be moved down, under get_online_cpus().

schedule_on_each_cpu() should guarantee that func() can't race with
CPU hotplug, can safely use per-cpu data, etc. That is why flush_work()
is called before put_online_cpus().

Another reason to move this code down is that current_is_keventd()
itself is racy, the "preempt-safe: keventd is per-cpu" comment is not
right. I think do_boot_cpu() case is fine though.

(off-topic, but we can also simplify the code a little bit, the second
"if (cpu != orig)" is not necessary).


Perhaps you and Linus are right, and we should simplify the rules
unconditionally. But note that the problem above has nothing to do with
single-threaded behaviour, and I do not think it is possible to guarantee
work->func() can't be moved to another CPU.

Oleg.

2009-11-12 19:13:44

by Tejun Heo

[permalink] [raw]
Subject: Re: GPF in run_workqueue()/list_del_init(cwq->worklist.next) on resume

Hello, Oleg.

11/13/2009 03:35 AM, Oleg Nesterov wrote:
>> That looks like an excellent idea and I don't think it will add
>> noticeable overhead even done by default and it will magically make
>> the "how to implement single-threaded wq semantics in conccurrency
>> managed wq" problem go away. I'll work on it.
>
> I am still not sure all work_structs should single-threaded by default.
>
> To clarify, I am not arguing. Just I don't know. I mean, this change can
> break the existing code, and it is not easy to notice the problem.

Hmm... I can't think of something which will break if single thread
(or execution instance) is enforced. Are you worrying about ignoring
cpu designation? I'm still working on it but it seems possible to
honor cpu parameter and enforce single thread.

>> making flush_work() behave as
>> flush_work_sync() by default should be doable without too much
>> overhead. I'll give it a shot.
>
> Well, I disagree. Imho it is better to have both flush_work() and
> flush_work_sync(). flush_work() is "special" and should be used with
> care. But this is minor, and if the work_stuct is single-threaded then
> flush_work() == flush_work_sync().
>
> (Perhaps this is what you meant)

Yeap, that was what I meant. :-)

> My only point was, it is not that workqueues are buggy, they were
> designed (and even documented) to work this way. I can't judge if it
> was right or not, but personally I think everything is "logical".

Yes, it's very focused on lowering cross-cpu overhead whereever
possible. I think the one design decision which added most to
complexity and subtlety was allowing work functions to free or reuse
work structs leaving the workqueue code only the pointer value to work
with for synchronization. Maybe it's worth it. I don't know. At any
rate changing it would be too costly at this point.

> That said, I agree that we have too many buggy users, perhaps we
> should simplify the rules.
>
> I just noticed that schedule_on_each_cpu() was recently changed by
>
> HWPOISON: Allow schedule_on_each_cpu() from keventd
> commit: 65a64464349883891e21e74af16c05d6e1eeb4e9
>
> Surprisingly, even this simple change is not exactly right.
>
> /*
> * when running in keventd don't schedule a work item on itself.
> * Can just call directly because the work queue is already bound.
> * This also is faster.
> * Make this a generic parameter for other workqueues?
> */
> if (current_is_keventd()) {
> orig = raw_smp_processor_id();
> INIT_WORK(per_cpu_ptr(works, orig), func);
> func(per_cpu_ptr(works, orig));
> }
>
> OK, but this code should be moved down, under get_online_cpus().

Yeap, I already have the patch in my queue. It will be going out in a
few days.

> Perhaps you and Linus are right, and we should simplify the rules
> unconditionally. But note that the problem above has nothing to do with
> single-threaded behaviour, and I do not think it is possible to guarantee
> work->func() can't be moved to another CPU.

I don't know about Linus but I definitely would like default single
threaded behavior. Currently, singlethread workqueue is used for two
things - to limit the number of workers and to avoid works executing
simultaneously on different cpus but I don't think all users are
careful enough about the second point.

Thanks.

--
tejun

2009-11-12 19:17:37

by Tejun Heo

[permalink] [raw]
Subject: Re: GPF in run_workqueue()/list_del_init(cwq->worklist.next) on resume

11/13/2009 02:33 AM, Thomas Gleixner wrote:
> Subject: workqueue: Add debugobjects support
> From: Thomas Gleixner <[email protected]>
> Date: Thu, 12 Nov 2009 16:25:34 +0100
>
> Add debugobject support to track the life time of work_structs.
>
> Signed-off-by: Thomas Gleixner <[email protected]>

This looks great. I'm about to create a wq tree for concurrency
managed workqueue patches. Do you mind if I route this patch through
there?

Thanks.

--
tejun

2009-11-12 20:51:45

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: GPF in run_workqueue()/list_del_init(cwq->worklist.next) on resume (was: Re: Help needed: Resume problems in 2.6.32-rc, perhaps related to preempt_count leakage in keventd)

On Thursday 12 November 2009, Thomas Gleixner wrote:
> On Wed, 11 Nov 2009, Oleg Nesterov wrote:
> > Well, yes. Because any buggy user can easily kill the system, and we
> > constantly have the bugs like this one.
> >
> > I think we should teach workqueue.c to use debugobjects.c at least.
>
> Here you go. Would be interesting to know whether it catches the
> problem at hand.

How would I tell if it did?

Rafael

2009-11-12 20:54:41

by Thomas Gleixner

[permalink] [raw]
Subject: Re: GPF in run_workqueue()/list_del_init(cwq->worklist.next) on resume

On Fri, 13 Nov 2009, Tejun Heo wrote:
> 11/13/2009 02:33 AM, Thomas Gleixner wrote:
> > Subject: workqueue: Add debugobjects support
> > From: Thomas Gleixner <[email protected]>
> > Date: Thu, 12 Nov 2009 16:25:34 +0100
> >
> > Add debugobject support to track the life time of work_structs.
> >
> > Signed-off-by: Thomas Gleixner <[email protected]>
>
> This looks great. I'm about to create a wq tree for concurrency
> managed workqueue patches. Do you mind if I route this patch through
> there?

Not at all. That patch is barely tested and when you are anyway
fooling around with the workqueue code it's the best place to go.

Thanks,

tglx

2009-11-12 20:57:31

by Thomas Gleixner

[permalink] [raw]
Subject: Re: GPF in run_workqueue()/list_del_init(cwq->worklist.next) on resume (was: Re: Help needed: Resume problems in 2.6.32-rc, perhaps related to preempt_count leakage in keventd)

On Thu, 12 Nov 2009, Rafael J. Wysocki wrote:

> On Thursday 12 November 2009, Thomas Gleixner wrote:
> > On Wed, 11 Nov 2009, Oleg Nesterov wrote:
> > > Well, yes. Because any buggy user can easily kill the system, and we
> > > constantly have the bugs like this one.
> > >
> > > I think we should teach workqueue.c to use debugobjects.c at least.
> >
> > Here you go. Would be interesting to know whether it catches the
> > problem at hand.
>
> How would I tell if it did?

Apply the patch, enable debugobjects and try to reproduce the failure
- w/o the fix patch of course :)

debugobjects should catch the attempt to free the memory which holds
the work_struct, cancel it and print a backtrace where this happened.

Thanks,

tglx

2009-11-12 22:53:57

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: GPF in run_workqueue()/list_del_init(cwq->worklist.next) on resume (was: Re: Help needed: Resume problems in 2.6.32-rc, perhaps related to preempt_count leakage in keventd)

On Thursday 12 November 2009, Thomas Gleixner wrote:
> On Thu, 12 Nov 2009, Rafael J. Wysocki wrote:
>
> > On Thursday 12 November 2009, Thomas Gleixner wrote:
> > > On Wed, 11 Nov 2009, Oleg Nesterov wrote:
> > > > Well, yes. Because any buggy user can easily kill the system, and we
> > > > constantly have the bugs like this one.
> > > >
> > > > I think we should teach workqueue.c to use debugobjects.c at least.
> > >
> > > Here you go. Would be interesting to know whether it catches the
> > > problem at hand.
> >
> > How would I tell if it did?
>
> Apply the patch, enable debugobjects and try to reproduce the failure
> - w/o the fix patch of course :)
>
> debugobjects should catch the attempt to free the memory which holds
> the work_struct, cancel it and print a backtrace where this happened.

I got the appended trace, so I guess it works.

Thanks,
Rafael


[ 92.775800] PM: Finishing wakeup.
[ 92.775803] Restarting tasks ...
[ 92.776020] usb 5-2: USB disconnect, address 3
[ 92.800258] done.
Nov 12 23:40:53 tosh bluetoothd[3424]: HCI dev 0 unregistered
Nov 12 23:40:53 tosh bluetoothd[3424]: Unregister path: /org/bluez/3424/hci0
Nov 12 23:40:53 tosh bluetoothd[3424]: Unregistered interface org.bluez.NetworkPeer on path /org/bluez/3424/hci0
Nov 12 23:40:53 tosh bluetoothd[3424]: Unregistered interface org.bluez.NetworkHub on path /org/bluez/3424/hci0
Nov 12 23:40:53 tosh bluetoothd[3424]: Unregistered interface org.bluez.NetworkRouter on path /org/bluez/3424/hci0
[ 93.028345] ------------[ cut here ]------------
[ 93.028525] WARNING: at /home/rafael/src/linux-2.6/lib/debugobjects.c:255 debug_print_object+0x4f/0x57()
[ 93.028833] Hardware name: PORTEGE R500
[ 93.028997] ODEBUG: free active object type: work_struct
[ 93.029174] Modules linked in: af_packet ip6t_LOG xt_tcpudp xt_pkttype ipt_LOG xt_limit bnep sco rfcomm l2cap crc16 snd_pcm_oss snd_mixer_oss snd_seq binfmt_misc snd_seq_device ip6t_REJECT nf_conntrack_ipv6 ip6table_raw xt_NOTRACK ipt_REJECT xt_state iptable_raw iptable_filter ip6table_mangle nf_conntrack_netbios_ns nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables cpufreq_conservative cpufreq_ondemand ip6table_filter cpufreq_userspace ip6_tables cpufreq_powersave x_tables acpi_cpufreq ipv6 freq_table microcode fuse loop sr_mod cdrom dm_mod snd_hda_codec_realtek btusb arc4 bluetooth snd_hda_intel ecb snd_hda_codec iwlagn snd_hwdep iwlcore sdhci_pci snd_pcm pcmcia mac80211 toshiba_acpi snd_timer joydev sdhci firewire_ohci option usbhid video snd rtc_cmos yenta_socket cfg80211 backlight mmc_core firewire_core rsrc_nonstatic iTCO_wdt soundcore psmouse rtc_core usb_storage hid usbserial output battery ac button led_class e1000e rtc_lib pcmcia_core rfkill crc_itu_t iTCO_vendor_
support snd_page_alloc serio_raw intel_agp uinput sg ehci_hcd uhci_hcd sd_mod crc_t10dif usbcore ext3 jbd fan ahci libata thermal processor
[ 93.032029] Pid: 1185, comm: khubd Not tainted 2.6.32-rc6-tst #176
[ 93.032029] Call Trace:
[ 93.032029] [<ffffffff8116b773>] ? debug_print_object+0x4f/0x57
[ 93.032029] [<ffffffff81040df6>] warn_slowpath_common+0x77/0x8f
[ 93.032029] [<ffffffff81040e83>] warn_slowpath_fmt+0x64/0x66
[ 93.032029] [<ffffffff81294f96>] ? _spin_lock_irq+0x17/0x30
[ 93.032029] [<ffffffff812951cb>] ? _spin_unlock_irq+0x2b/0x36
[ 93.032029] [<ffffffff81294fc7>] ? _spin_lock_irqsave+0x18/0x34
[ 93.032029] [<ffffffff81295202>] ? _spin_unlock_irqrestore+0x2c/0x37
[ 93.032029] [<ffffffff81164caa>] ? __up_read+0x84/0x8c
[ 93.032029] [<ffffffff81294fc7>] ? _spin_lock_irqsave+0x18/0x34
[ 93.038521] [<ffffffff8116b773>] debug_print_object+0x4f/0x57
[ 93.038521] [<ffffffff8116bae4>] debug_check_no_obj_freed+0x98/0x15e
[ 93.038521] [<ffffffff81295202>] ? _spin_unlock_irqrestore+0x2c/0x37
[ 93.039180] [<ffffffff810c8d7c>] kfree+0x71/0x88
[ 93.039282] [<ffffffffa03d36f1>] btusb_destruct+0x10/0x12 [btusb]
[ 93.039282] [<ffffffffa03d36d0>] btusb_disconnect+0x81/0x92 [btusb]
[ 93.039282] [<ffffffffa00c6612>] usb_unbind_interface+0x5e/0xe1 [usbcore]
[ 93.039282] [<ffffffff811ebead>] __device_release_driver+0x70/0xbe
[ 93.039282] [<ffffffff811ebfc6>] device_release_driver+0x1e/0x2b
[ 93.040317] [<ffffffff811ebb46>] bus_remove_device+0x83/0x92
[ 93.040317] [<ffffffff811e9940>] device_del+0x12e/0x170
[ 93.040317] [<ffffffffa00c3adf>] usb_disable_device+0x76/0xf5 [usbcore]
[ 93.040317] [<ffffffffa00becfb>] usb_disconnect+0x96/0xfa [usbcore]
[ 93.040317] [<ffffffffa00bf927>] hub_thread+0x493/0xe2c [usbcore]
[ 93.040317] [<ffffffff81058465>] ? autoremove_wake_function+0x0/0x38
[ 93.040317] [<ffffffff81293de6>] ? preempt_schedule+0x3b/0x41
[ 93.040317] [<ffffffffa00bf494>] ? hub_thread+0x0/0xe2c [usbcore]
[ 93.040317] [<ffffffff8105823e>] kthread+0x69/0x71
[ 93.040317] [<ffffffff8100c16a>] child_rip+0xa/0x20
[ 93.042128] [<ffffffff810581d5>] ? kthread+0x0/0x71
[ 93.042393] [<ffffffff8100c160>] ? child_rip+0x0/0x20
[ 93.042478] ---[ end trace 6915b286c29296ce ]---

2009-11-12 23:10:00

by Thomas Gleixner

[permalink] [raw]
Subject: Re: GPF in run_workqueue()/list_del_init(cwq->worklist.next) on resume (was: Re: Help needed: Resume problems in 2.6.32-rc, perhaps related to preempt_count leakage in keventd)

On Thu, 12 Nov 2009, Rafael J. Wysocki wrote:
> > debugobjects should catch the attempt to free the memory which holds
> > the work_struct, cancel it and print a backtrace where this happened.
>
> I got the appended trace, so I guess it works.

Yup. I assume the box survived otherwise you wouldnt have retrieved
the backtrace.

> [ 93.028997] ODEBUG: free active object type: work_struct

Active work_struct is freed

> [ 93.039180] [<ffffffff810c8d7c>] kfree+0x71/0x88
> [ 93.039282] [<ffffffffa03d36f1>] btusb_destruct+0x10/0x12 [btusb]

Here is the offending code and it's call chain.

> [ 93.039282] [<ffffffffa03d36d0>] btusb_disconnect+0x81/0x92 [btusb]
> [ 93.039282] [<ffffffffa00c6612>] usb_unbind_interface+0x5e/0xe1 [usbcore]
> [ 93.039282] [<ffffffff811ebead>] __device_release_driver+0x70/0xbe
> [ 93.039282] [<ffffffff811ebfc6>] device_release_driver+0x1e/0x2b
> [ 93.040317] [<ffffffff811ebb46>] bus_remove_device+0x83/0x92
> [ 93.040317] [<ffffffff811e9940>] device_del+0x12e/0x170
> [ 93.040317] [<ffffffffa00c3adf>] usb_disable_device+0x76/0xf5 [usbcore]
> [ 93.040317] [<ffffffffa00becfb>] usb_disconnect+0x96/0xfa [usbcore]
> [ 93.040317] [<ffffffffa00bf927>] hub_thread+0x493/0xe2c [usbcore]
> [ 93.040317] [<ffffffff81058465>] ? autoremove_wake_function+0x0/0x38
> [ 93.040317] [<ffffffff81293de6>] ? preempt_schedule+0x3b/0x41
> [ 93.040317] [<ffffffffa00bf494>] ? hub_thread+0x0/0xe2c [usbcore]
> [ 93.040317] [<ffffffff8105823e>] kthread+0x69/0x71
> [ 93.040317] [<ffffffff8100c16a>] child_rip+0xa/0x20
> [ 93.042128] [<ffffffff810581d5>] ? kthread+0x0/0x71
> [ 93.042393] [<ffffffff8100c160>] ? child_rip+0x0/0x20
> [ 93.042478] ---[ end trace 6915b286c29296ce ]---

That does not necessary tell you which of the work_structs in the
bluetooth code it is, but it limits the search of culprits
significantly.

Thanks for testing and confirming.

tglx

2009-11-15 23:37:03

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: GPF in run_workqueue()/list_del_init(cwq->worklist.next) on resume (was: Re: Help needed: Resume problems in 2.6.32-rc, perhaps related to preempt_count leakage in keventd)

On Thu, Nov 12, 2009 at 06:33:00PM +0100, Thomas Gleixner wrote:
> @@ -145,6 +255,7 @@ static void __queue_work(struct cpu_work
> {
> unsigned long flags;
>
> + debug_work_activate(work);
> spin_lock_irqsave(&cwq->lock, flags);
> insert_work(cwq, work, &cwq->worklist);



Since you are doing that from insert_wq_barrier too, which
endpoint is also insert_work(), why not put debug_work_activate
there instead? Or may be you really prefer to do this outside
the spinlock (which in off-case is zero-overhead). May be that
can sleep or?

2009-11-15 23:40:27

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: GPF in run_workqueue()/list_del_init(cwq->worklist.next) on resume (was: Re: Help needed: Resume problems in 2.6.32-rc, perhaps related to preempt_count leakage in keventd)

On Mon, Nov 16, 2009 at 12:37:06AM +0100, Frederic Weisbecker wrote:
> On Thu, Nov 12, 2009 at 06:33:00PM +0100, Thomas Gleixner wrote:
> > @@ -145,6 +255,7 @@ static void __queue_work(struct cpu_work
> > {
> > unsigned long flags;
> >
> > + debug_work_activate(work);
> > spin_lock_irqsave(&cwq->lock, flags);
> > insert_work(cwq, work, &cwq->worklist);
>
>
>
> Since you are doing that from insert_wq_barrier too, which
> endpoint is also insert_work(), why not put debug_work_activate
> there instead? Or may be you really prefer to do this outside
> the spinlock (which in off-case is zero-overhead). May be that
> can sleep or?


/me now remembers this path can't sleep since we can queue a
work from anywhere...so I guess this is to not bloat the lock
overhead.

Whatever, this is really a small detail.

2009-11-16 11:01:42

by Tejun Heo

[permalink] [raw]
Subject: Re: GPF in run_workqueue()/list_del_init(cwq->worklist.next) on resume

11/13/2009 04:14 AM, Tejun Heo wrote:
>>> making flush_work() behave as
>>> flush_work_sync() by default should be doable without too much
>>> overhead. I'll give it a shot.

Aieee... my estimate for cross cpu work queueing was way off.
Queueing on different cpus is very common - sometimes more so than on
the same cpu, so it looks like this can't be done without adding
significant overhead to queueing path. Giving up. :-(

Thanks.

--
tejun