2007-09-11 19:01:48

by Pavel Machek

[permalink] [raw]
Subject: regression: fireware causes oops during system

Hi!

I noticed empty suspend stopped working around 2.6.23-rc4, and it is
still present in 2.6.23-rc6. To reproduce

swapoff -a
echo disk > /sys/power/state
echo disk > /sys/power/state

Unsetting

CONFIG_IEEE1394=y

solves the problem.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


2007-09-11 19:30:00

by Stefan Richter

[permalink] [raw]
Subject: Re: regression: fireware causes oops during system

Ben Collins wrote:
> On Tue, 2007-09-11 at 21:00 +0200, Pavel Machek wrote:
>> I noticed empty suspend stopped working around 2.6.23-rc4, and it is
>> still present in 2.6.23-rc6. To reproduce
>>
>> swapoff -a
>> echo disk > /sys/power/state
>> echo disk > /sys/power/state
>>
>> Unsetting
>>
>> CONFIG_IEEE1394=y
>>
>> solves the problem.
>
> Since this is easy to reproduce, can you bisect it down?

I will check in a minute what I pushed out after 2.6.23-rc1, may save
Pavel some time.
--
Stefan Richter
-=====-=-=== =--= -=-==
http://arcgraph.de/sr/

2007-09-11 19:41:04

by Ben Collins

[permalink] [raw]
Subject: Re: regression: fireware causes oops during system


On Tue, 2007-09-11 at 21:00 +0200, Pavel Machek wrote:
> Hi!
>
> I noticed empty suspend stopped working around 2.6.23-rc4, and it is
> still present in 2.6.23-rc6. To reproduce
>
> swapoff -a
> echo disk > /sys/power/state
> echo disk > /sys/power/state
>
> Unsetting
>
> CONFIG_IEEE1394=y
>
> solves the problem.

Since this is easy to reproduce, can you bisect it down?

--
Ubuntu : http://www.ubuntu.com/
Linux1394: http://wiki.linux1394.org/
SwissDisk: http://www.swissdisk.com/

2007-09-11 19:46:40

by Stefan Richter

[permalink] [raw]
Subject: Re: regression: fireware causes oops during system

>> On Tue, 2007-09-11 at 21:00 +0200, Pavel Machek wrote:
>>> I noticed empty suspend stopped working around 2.6.23-rc4, and it is
>>> still present in 2.6.23-rc6.
...
>>> Unsetting
>>>
>>> CONFIG_IEEE1394=y
>>>
>>> solves the problem.
...

Between -rc3 and -rc4:

ieee1394: sbp2: fix sbp2_remove_device for error cases
a2ee3f9bbb0ce57102dad8928d54f59acdc4b8f7
should not occur in suspend path

Between -rc1 and -rc2:

ieee1394: sbp2: more correct Kconfig dependencies
e4f8cac5e07528f7e0bc21d3682c16c9de993ecb
unrelated

ieee1394: revert "sbp2: enforce 32bit DMA mapping"
a9c2f18800753c82c45fc13b27bdc148849bdbb2
unrelated

(not via linux1394-2.6.git)
raw1394 __user annotation
5b26e64ea39e45802c5736c8261bf8a8704d212f
unrelated

So it must be something older which was somehow uncovered.
Dmesg with the oops or/and bisection would be good.
--
Stefan Richter
-=====-=-=== =--= -=-==
http://arcgraph.de/sr/

2007-09-16 17:53:49

by Pavel Machek

[permalink] [raw]
Subject: Re: regression: fireware causes oops during system

Hi!

> >> On Tue, 2007-09-11 at 21:00 +0200, Pavel Machek wrote:
> >>> I noticed empty suspend stopped working around 2.6.23-rc4, and it is
> >>> still present in 2.6.23-rc6.
> ...
> >>> Unsetting
> >>>
> >>> CONFIG_IEEE1394=y
> >>>
> >>> solves the problem.
> ...
>
> Between -rc3 and -rc4:
>
> ieee1394: sbp2: fix sbp2_remove_device for error cases
> a2ee3f9bbb0ce57102dad8928d54f59acdc4b8f7
> should not occur in suspend path
>
> Between -rc1 and -rc2:
>
> ieee1394: sbp2: more correct Kconfig dependencies
> e4f8cac5e07528f7e0bc21d3682c16c9de993ecb
> unrelated
>
> ieee1394: revert "sbp2: enforce 32bit DMA mapping"
> a9c2f18800753c82c45fc13b27bdc148849bdbb2
> unrelated
>
> (not via linux1394-2.6.git)
> raw1394 __user annotation
> 5b26e64ea39e45802c5736c8261bf8a8704d212f
> unrelated
>
> So it must be something older which was somehow uncovered.
> Dmesg with the oops or/and bisection would be good.

Sorry, had to hand-copy. It is oops at virtual adddress 6b6b6b7b --
looks like slab poison to me?

EIP is in task_rq_lock, backtrace is
try_to_wake_up
highlevel_host_reset
ohci_irq_handler
Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2007-09-16 17:57:17

by Pavel Machek

[permalink] [raw]
Subject: Re: regression: fireware causes oops during system

On Tue 2007-09-11 21:45:58, Stefan Richter wrote:
> >> On Tue, 2007-09-11 at 21:00 +0200, Pavel Machek wrote:
> >>> I noticed empty suspend stopped working around 2.6.23-rc4, and it is
> >>> still present in 2.6.23-rc6.
> ...
> >>> Unsetting
> >>>
> >>> CONFIG_IEEE1394=y
> >>>
> >>> solves the problem.
> ...
>
> Between -rc3 and -rc4:
>
> ieee1394: sbp2: fix sbp2_remove_device for error cases
> a2ee3f9bbb0ce57102dad8928d54f59acdc4b8f7
> should not occur in suspend path

Plus I do not have firewire attached disk here, so sbp2 should not be
used, right?

> Between -rc1 and -rc2:
>
> ieee1394: sbp2: more correct Kconfig dependencies
> e4f8cac5e07528f7e0bc21d3682c16c9de993ecb
> unrelated
>
> ieee1394: revert "sbp2: enforce 32bit DMA mapping"
> a9c2f18800753c82c45fc13b27bdc148849bdbb2
> unrelated
>
> (not via linux1394-2.6.git)
> raw1394 __user annotation
> 5b26e64ea39e45802c5736c8261bf8a8704d212f
> unrelated
>
> So it must be something older which was somehow uncovered.
> Dmesg with the oops or/and bisection would be good.

The others look even more innocent. I wonder if this may have been
responsible?

commit 831441862956fffa17b9801db37e6ea1650b0f69
tree b0334921341f8f1734bdd3243de76d676329d21c
parent 787d2214c19bcc9b6ac48af0ce098277a801eded
author Rafael J. Wysocki <[email protected]> Tue, 17 Jul 2007 04:03:35 -0700
committer Linus Torvalds <[email protected]> Tue, 17
Jul 2007 10:23:02 -0700

Freezer: make kernel threads nonfreezable by default

Or this one?

commit 20c2df83d25c6a95affe6157a4c9cac4cf5ffaac
tree 415c4453d2b17a50abe7a3e515177e1fa337bd67
parent 64fb98fc40738ae1a98bcea9ca3145b89fb71524
author Paul Mundt <[email protected]> Fri, 20 Jul 2007 10:11:58
+0900
committer Paul Mundt <[email protected]> Fri, 20 Jul 2007 10:11:58
+0900

mm: Remove slab destructors from kmem_cache_create().

Slab destructors were no longer supported after Christoph's
c59def9f222d44bb7e2f0a559f2906191a0862d7 change. They've been
BUGs for both slab and slub, and slob never supported them
either.


Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2007-09-16 18:42:49

by Stefan Richter

[permalink] [raw]
Subject: Re: regression: fireware causes oops during system

(Adding Cc: Rafael, Ingo)

Pavel Machek wrote:
>> Dmesg with the oops or/and bisection would be good.
>
> Sorry, had to hand-copy. It is oops at virtual adddress 6b6b6b7b --
> looks like slab poison to me?
>
> EIP is in task_rq_lock, backtrace is
> try_to_wake_up
> highlevel_host_reset
> ohci_irq_handler

In reverse order, this trace is most certainly

drivers/ieee1394/ohci1394.c::ohci_irq_handler
drivers/ieee1394/highlevel.c::highlevel_host_reset
drivers/ieee1394/highlevel.c::nodemgr_highlevel.host_reset
== nodemgr_host_reset
kernel/sched.c::wake_up_process(hi->thread);

with hi->thread being the kthread which executes
drivers/ieee1394/nodemgr.c::nodemgr_host_thread of the ieee1394 core driver.

The ieee1394 core has two (or more) threads: One nodemgr_host_thread
alias [knodemgrd_*] for each card, and one hpsbpkt_thread alias
[khpsbpkt] for all cards. The knodemgrd should be frozen during suspend
or hibernate, while khpsbpkt should not be frozen in order to let
transactions to go on in the case of saving the hibernation image to a
FireWire disk.


Quoting your other post:
> On Tue 2007-09-11 21:45:58, Stefan Richter wrote:
>> Between -rc3 and -rc4:
>>
>> ieee1394: sbp2: fix sbp2_remove_device for error cases
>> a2ee3f9bbb0ce57102dad8928d54f59acdc4b8f7
>> should not occur in suspend path
>
> Plus I do not have firewire attached disk here, so sbp2 should not be
> used, right?

Sbp2's host_reset handler would do something if it was loaded even
without any SBP-2 devices attached,but it wouldn't under any
circumstances call try_to_wake_up. Actually with no devices attached it
would simply "iterate" over an empty list.

>> Between -rc1 and -rc2:
>>
>> ieee1394: sbp2: more correct Kconfig dependencies
>> e4f8cac5e07528f7e0bc21d3682c16c9de993ecb
>> unrelated
>>
>> ieee1394: revert "sbp2: enforce 32bit DMA mapping"
>> a9c2f18800753c82c45fc13b27bdc148849bdbb2
>> unrelated
>>
>> (not via linux1394-2.6.git)
>> raw1394 __user annotation
>> 5b26e64ea39e45802c5736c8261bf8a8704d212f
>> unrelated
>>
>> So it must be something older which was somehow uncovered.
>> Dmesg with the oops or/and bisection would be good.
>
> The others look even more innocent. I wonder if this may have been
> responsible?
>
> commit 831441862956fffa17b9801db37e6ea1650b0f69
> tree b0334921341f8f1734bdd3243de76d676329d21c
> parent 787d2214c19bcc9b6ac48af0ce098277a801eded
> author Rafael J. Wysocki <[email protected]> Tue, 17 Jul 2007 04:03:35 -0700
> committer Linus Torvalds <[email protected]> Tue, 17
> Jul 2007 10:23:02 -0700
>
> Freezer: make kernel threads nonfreezable by default
>
> Or this one?
>
> commit 20c2df83d25c6a95affe6157a4c9cac4cf5ffaac
> tree 415c4453d2b17a50abe7a3e515177e1fa337bd67
> parent 64fb98fc40738ae1a98bcea9ca3145b89fb71524
> author Paul Mundt <[email protected]> Fri, 20 Jul 2007 10:11:58
> +0900
> committer Paul Mundt <[email protected]> Fri, 20 Jul 2007 10:11:58
> +0900
>
> mm: Remove slab destructors from kmem_cache_create().
>
> Slab destructors were no longer supported after Christoph's
> c59def9f222d44bb7e2f0a559f2906191a0862d7 change. They've been
> BUGs for both slab and slub, and slob never supported them
> either.

Both of these went in during the merge window between 2.6.22 and 2.6.23-rc1.

The bug could have been introduced by a change outside of the ieee1394
subsystem, like freezer or scheduler. The try_to_wake_up in your
backtrace would be directed towards a thread which should be frozen at
some point.
--
Stefan Richter
-=====-=-=== =--= =----
http://arcgraph.de/sr/

2007-09-16 19:40:51

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: regression: fireware causes oops during system

On Sunday, 16 September 2007 20:41, Stefan Richter wrote:
> (Adding Cc: Rafael, Ingo)
>
> Pavel Machek wrote:
> >> Dmesg with the oops or/and bisection would be good.
> >
> > Sorry, had to hand-copy. It is oops at virtual adddress 6b6b6b7b --
> > looks like slab poison to me?
> >
> > EIP is in task_rq_lock, backtrace is
> > try_to_wake_up
> > highlevel_host_reset
> > ohci_irq_handler
>
> In reverse order, this trace is most certainly
>
> drivers/ieee1394/ohci1394.c::ohci_irq_handler
> drivers/ieee1394/highlevel.c::highlevel_host_reset
> drivers/ieee1394/highlevel.c::nodemgr_highlevel.host_reset
> == nodemgr_host_reset
> kernel/sched.c::wake_up_process(hi->thread);
>
> with hi->thread being the kthread which executes
> drivers/ieee1394/nodemgr.c::nodemgr_host_thread of the ieee1394 core driver.
>
> The ieee1394 core has two (or more) threads: One nodemgr_host_thread
> alias [knodemgrd_*] for each card, and one hpsbpkt_thread alias
> [khpsbpkt] for all cards. The knodemgrd should be frozen during suspend
> or hibernate, while khpsbpkt should not be frozen in order to let
> transactions to go on in the case of saving the hibernation image to a
> FireWire disk.
>
>
> Quoting your other post:
> > On Tue 2007-09-11 21:45:58, Stefan Richter wrote:
> >> Between -rc3 and -rc4:
> >>
> >> ieee1394: sbp2: fix sbp2_remove_device for error cases
> >> a2ee3f9bbb0ce57102dad8928d54f59acdc4b8f7
> >> should not occur in suspend path
> >
> > Plus I do not have firewire attached disk here, so sbp2 should not be
> > used, right?
>
> Sbp2's host_reset handler would do something if it was loaded even
> without any SBP-2 devices attached,but it wouldn't under any
> circumstances call try_to_wake_up. Actually with no devices attached it
> would simply "iterate" over an empty list.
>
> >> Between -rc1 and -rc2:
> >>
> >> ieee1394: sbp2: more correct Kconfig dependencies
> >> e4f8cac5e07528f7e0bc21d3682c16c9de993ecb
> >> unrelated
> >>
> >> ieee1394: revert "sbp2: enforce 32bit DMA mapping"
> >> a9c2f18800753c82c45fc13b27bdc148849bdbb2
> >> unrelated
> >>
> >> (not via linux1394-2.6.git)
> >> raw1394 __user annotation
> >> 5b26e64ea39e45802c5736c8261bf8a8704d212f
> >> unrelated
> >>
> >> So it must be something older which was somehow uncovered.
> >> Dmesg with the oops or/and bisection would be good.
> >
> > The others look even more innocent. I wonder if this may have been
> > responsible?
> >
> > commit 831441862956fffa17b9801db37e6ea1650b0f69
> > tree b0334921341f8f1734bdd3243de76d676329d21c
> > parent 787d2214c19bcc9b6ac48af0ce098277a801eded
> > author Rafael J. Wysocki <[email protected]> Tue, 17 Jul 2007 04:03:35 -0700
> > committer Linus Torvalds <[email protected]> Tue, 17
> > Jul 2007 10:23:02 -0700
> >
> > Freezer: make kernel threads nonfreezable by default

Well, I don't think so.

nodemgr_host_thread() calls set_freezable() as it should.

Greetings,
Rafael

2007-09-16 19:58:41

by Stefan Richter

[permalink] [raw]
Subject: Re: regression: fireware causes oops during system

Rafael J. Wysocki wrote:
>> Pavel Machek wrote:
>>> commit 831441862956fffa17b9801db37e6ea1650b0f69
>>> tree b0334921341f8f1734bdd3243de76d676329d21c
>>> parent 787d2214c19bcc9b6ac48af0ce098277a801eded
>>> author Rafael J. Wysocki <[email protected]> Tue, 17 Jul 2007 04:03:35 -0700
>>> committer Linus Torvalds <[email protected]> Tue, 17
>>> Jul 2007 10:23:02 -0700
>>>
>>> Freezer: make kernel threads nonfreezable by default
>
> Well, I don't think so.
>
> nodemgr_host_thread() calls set_freezable() as it should.

This commit is certainly OK, as it should merely preserve status quo.
Also note that Pavel wrote in his initial post that the problem became
apparent way after -rc1. Full quote:

| I noticed empty suspend stopped working around 2.6.23-rc4, and it is
| still present in 2.6.23-rc6. To reproduce
|
| swapoff -a
| echo disk > /sys/power/state
| echo disk > /sys/power/state
|
| Unsetting
|
| CONFIG_IEEE1394=y
|
| solves the problem.

--
Stefan Richter
-=====-=-=== =--= =----
http://arcgraph.de/sr/

2007-09-16 20:03:52

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: regression: fireware causes oops during system

On Sunday, 16 September 2007 21:58, Stefan Richter wrote:
> Rafael J. Wysocki wrote:
> >> Pavel Machek wrote:
> >>> commit 831441862956fffa17b9801db37e6ea1650b0f69
> >>> tree b0334921341f8f1734bdd3243de76d676329d21c
> >>> parent 787d2214c19bcc9b6ac48af0ce098277a801eded
> >>> author Rafael J. Wysocki <[email protected]> Tue, 17 Jul 2007 04:03:35 -0700
> >>> committer Linus Torvalds <[email protected]> Tue, 17
> >>> Jul 2007 10:23:02 -0700
> >>>
> >>> Freezer: make kernel threads nonfreezable by default
> >
> > Well, I don't think so.
> >
> > nodemgr_host_thread() calls set_freezable() as it should.
>
> This commit is certainly OK, as it should merely preserve status quo.
> Also note that Pavel wrote in his initial post that the problem became
> apparent way after -rc1. Full quote:
>
> | I noticed empty suspend stopped working around 2.6.23-rc4, and it is
> | still present in 2.6.23-rc6. To reproduce
> |
> | swapoff -a
> | echo disk > /sys/power/state
> | echo disk > /sys/power/state
> |
> | Unsetting
> |
> | CONFIG_IEEE1394=y
> |
> | solves the problem.

Yes.

I thought that there might be a later change that exposed a bug in it.

Hm. Pavel, can you do

# echo test > /sys/power/disk
# echo disk > /sys/power/state

and see what happens?

Greetings,
Rafael

2007-10-05 07:09:05

by Pavel Machek

[permalink] [raw]
Subject: Re: regression: fireware causes oops during system

Hi!

> > This commit is certainly OK, as it should merely preserve status quo.
> > Also note that Pavel wrote in his initial post that the problem became
> > apparent way after -rc1. Full quote:
> >
> > | I noticed empty suspend stopped working around 2.6.23-rc4, and it is
> > | still present in 2.6.23-rc6. To reproduce
> > |
> > | swapoff -a
> > | echo disk > /sys/power/state
> > | echo disk > /sys/power/state
> > |
> > | Unsetting
> > |
> > | CONFIG_IEEE1394=y
> > |
> > | solves the problem.
>
> Yes.
>
> I thought that there might be a later change that exposed a bug in it.
>
> Hm. Pavel, can you do
>
> # echo test > /sys/power/disk
> # echo disk > /sys/power/state
>
> and see what happens?

root@amd:~# echo test > /sys/power/disk
root@amd:~# echo disk > /sys/power/state
root@amd:~# echo disk > /sys/power/state

Produces nothing interesting... I also did few hibernation/resume
cycles, and everything seems to work ok. But when I do swapoff then
try to hibernate, it fails on second try. Weird.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2007-10-05 21:13:18

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: regression: fireware causes oops during system

On Friday, 5 October 2007 09:08, Pavel Machek wrote:
> Hi!
>
> > > This commit is certainly OK, as it should merely preserve status quo.
> > > Also note that Pavel wrote in his initial post that the problem became
> > > apparent way after -rc1. Full quote:
> > >
> > > | I noticed empty suspend stopped working around 2.6.23-rc4, and it is
> > > | still present in 2.6.23-rc6. To reproduce
> > > |
> > > | swapoff -a
> > > | echo disk > /sys/power/state
> > > | echo disk > /sys/power/state
> > > |
> > > | Unsetting
> > > |
> > > | CONFIG_IEEE1394=y
> > > |
> > > | solves the problem.
> >
> > Yes.
> >
> > I thought that there might be a later change that exposed a bug in it.
> >
> > Hm. Pavel, can you do
> >
> > # echo test > /sys/power/disk
> > # echo disk > /sys/power/state
> >
> > and see what happens?
>
> root@amd:~# echo test > /sys/power/disk
> root@amd:~# echo disk > /sys/power/state
> root@amd:~# echo disk > /sys/power/state
>
> Produces nothing interesting... I also did few hibernation/resume
> cycles, and everything seems to work ok. But when I do swapoff then
> try to hibernate, it fails on second try. Weird.

Weird indeed, but it means that there's some history that causes things to
break on the second attempt.

To summarize, after running swapoff the suspending of devices during the second
attempt to hibernate fails (100% of the time) unless CONFIG_IEEE1394 is unset?

What happens for CONFIG_IEEE1394=m?

Greetings,
Rafael