2009-09-24 18:23:01

by Alexander Huemer

[permalink] [raw]
Subject: 2.6.{30,31} x86_64 ahci problem - irq 23: nobody cared

the problem appears under heavy system load and slows down the system to
unusable speed.
kernels before .30 were not affected.
irqpoll does not change behavior.

error message from .31:

[157152.418524] irq 23: nobody cared (try booting with the "irqpoll"
option)
[157152.418530] Pid: 1359, comm: cc1plus Tainted: G W
2.6.31-gentoo-blackbit #2
[157152.418532] Call Trace:
[157152.418534] <IRQ> [<ffffffff81066e3f>] ?
__report_bad_irq+0x30/0x7d
[157152.418544] [<ffffffff81066f93>] ? note_interrupt+0x107/0x170
[157152.418547] [<ffffffff81067580>] ? handle_fasteoi_irq+0x8a/0xaa
[157152.418551] [<ffffffff8100d1cf>] ? handle_irq+0x17/0x1d
[157152.418554] [<ffffffff8100c84b>] ? do_IRQ+0x54/0xb2
[157152.418558] [<ffffffff8100b6d3>] ? ret_from_intr+0x0/0xa
[157152.418559] <EOI>
[157152.418560] handlers:
[157152.418562] [<ffffffff813d2a6f>] (ahci_interrupt+0x0/0x426)
[157152.418566] Disabling IRQ #23


bios of the machine is up to date,
i tried all related bios settings, no change.

kernel config for .31 http://xx.vu/~ahuemer/config_ahuemer_20090923.gz
lspci -vxxx http://xx.vu/~ahuemer/lspci_ahuemer_20090923
lsusb -v http://xx.vu/~ahuemer/lsusb_ahuemer_20090923
/proc/interrupts
http://xx.vu/~ahuemer/proc_interrupts_ahuemer_20090923
thread in gentoo forums
http://forums.gentoo.org/viewtopic-t-780725-start-0.html

please tell me what additional info is needed.
please CC me on replies, i am not subscribed.

-alex


2009-09-24 18:32:37

by David Daney

[permalink] [raw]
Subject: Re: 2.6.{30,31} x86_64 ahci problem - irq 23: nobody cared

Alexander Huemer wrote:
> the problem appears under heavy system load and slows down the system to
> unusable speed.
> kernels before .30 were not affected.
> irqpoll does not change behavior.
>
> error message from .31:
>
> [157152.418524] irq 23: nobody cared (try booting with the "irqpoll"
> option)
> [157152.418530] Pid: 1359, comm: cc1plus Tainted: G W

Right here is the problem -> ^^^^^^^^

Haven't you read all the threads about the evil of C++. This is just
one more example of the why we shouldn't be using it. :-)

David Daney

> 2.6.31-gentoo-blackbit #2
> [157152.418532] Call Trace:
> [157152.418534] <IRQ> [<ffffffff81066e3f>] ?
> __report_bad_irq+0x30/0x7d
> [157152.418544] [<ffffffff81066f93>] ? note_interrupt+0x107/0x170
> [157152.418547] [<ffffffff81067580>] ? handle_fasteoi_irq+0x8a/0xaa
> [157152.418551] [<ffffffff8100d1cf>] ? handle_irq+0x17/0x1d
> [157152.418554] [<ffffffff8100c84b>] ? do_IRQ+0x54/0xb2
> [157152.418558] [<ffffffff8100b6d3>] ? ret_from_intr+0x0/0xa
> [157152.418559] <EOI>
> [157152.418560] handlers:
> [157152.418562] [<ffffffff813d2a6f>] (ahci_interrupt+0x0/0x426)
> [157152.418566] Disabling IRQ #23
>
>
> bios of the machine is up to date,
> i tried all related bios settings, no change.
>
> kernel config for .31 http://xx.vu/~ahuemer/config_ahuemer_20090923.gz
> lspci -vxxx http://xx.vu/~ahuemer/lspci_ahuemer_20090923
> lsusb -v http://xx.vu/~ahuemer/lsusb_ahuemer_20090923
> /proc/interrupts
> http://xx.vu/~ahuemer/proc_interrupts_ahuemer_20090923
> thread in gentoo forums
> http://forums.gentoo.org/viewtopic-t-780725-start-0.html
>
> please tell me what additional info is needed.
> please CC me on replies, i am not subscribed.
>
> -alex
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2009-09-24 19:15:30

by Alexander Huemer

[permalink] [raw]
Subject: Re: 2.6.{30,31} x86_64 ahci problem - irq 23: nobody cared

David Daney wrote:
> Alexander Huemer wrote:
>> the problem appears under heavy system load and slows down the system to
>> unusable speed.
>> kernels before .30 were not affected.
>> irqpoll does not change behavior.
>>
>> error message from .31:
>>
>> [157152.418524] irq 23: nobody cared (try booting with the "irqpoll"
>> option)
>> [157152.418530] Pid: 1359, comm: cc1plus Tainted: G W
>
> Right here is the problem -> ^^^^^^^^
>
> Haven't you read all the threads about the evil of C++. This is just
> one more example of the why we shouldn't be using it. :-)
>
> David Daney
>
>> 2.6.31-gentoo-blackbit #2
>> [157152.418532] Call Trace:
>> [157152.418534] <IRQ> [<ffffffff81066e3f>] ?
>> __report_bad_irq+0x30/0x7d
>> [157152.418544] [<ffffffff81066f93>] ? note_interrupt+0x107/0x170
>> [157152.418547] [<ffffffff81067580>] ? handle_fasteoi_irq+0x8a/0xaa
>> [157152.418551] [<ffffffff8100d1cf>] ? handle_irq+0x17/0x1d
>> [157152.418554] [<ffffffff8100c84b>] ? do_IRQ+0x54/0xb2
>> [157152.418558] [<ffffffff8100b6d3>] ? ret_from_intr+0x0/0xa
>> [157152.418559] <EOI>
>> [157152.418560] handlers:
>> [157152.418562] [<ffffffff813d2a6f>] (ahci_interrupt+0x0/0x426)
>> [157152.418566] Disabling IRQ #23
>>
>>
>> bios of the machine is up to date,
>> i tried all related bios settings, no change.
>>
>> kernel config for .31
>> http://xx.vu/~ahuemer/config_ahuemer_20090923.gz
>> lspci -vxxx http://xx.vu/~ahuemer/lspci_ahuemer_20090923
>> lsusb -v http://xx.vu/~ahuemer/lsusb_ahuemer_20090923
>> /proc/interrupts
>> http://xx.vu/~ahuemer/proc_interrupts_ahuemer_20090923
>> thread in gentoo forums
>> http://forums.gentoo.org/viewtopic-t-780725-start-0.html
>>
>> please tell me what additional info is needed.
>> please CC me on replies, i am not subscribed.
>>
>> -alex
>> --
>> To unsubscribe from this list: send the line "unsubscribe
>> linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
>>
>
thanks for your quick answer, david.
so, this isn't a kernel issue at all ?
imho a user process shouldn't be able to cause such a situation.
what can i do against that phenomenon ?

-alex

2009-09-24 19:24:43

by Frans Pop

[permalink] [raw]
Subject: Re: 2.6.{30,31} x86_64 ahci problem - irq 23: nobody cared

Adding linux-ide to CC.

Alexander Huemer wrote:
> the problem appears under heavy system load and slows down the system to
> unusable speed.
> kernels before .30 were not affected.
> irqpoll does not change behavior.
>
> error message from .31:
> [157152.418524] irq 23: nobody cared (try booting with the "irqpoll" option)
> [157152.418530] Pid: 1359, comm: cc1plus Tainted: G W 2.6.31-gentoo-blackbit #2
> [157152.418532] Call Trace:
> [157152.418534] <IRQ> [<ffffffff81066e3f>] ? __report_bad_irq+0x30/0x7d
> [157152.418544] [<ffffffff81066f93>] ? note_interrupt+0x107/0x170
> [157152.418547] [<ffffffff81067580>] ? handle_fasteoi_irq+0x8a/0xaa
> [157152.418551] [<ffffffff8100d1cf>] ? handle_irq+0x17/0x1d
> [157152.418554] [<ffffffff8100c84b>] ? do_IRQ+0x54/0xb2
> [157152.418558] [<ffffffff8100b6d3>] ? ret_from_intr+0x0/0xa
> [157152.418559] <EOI>
> [157152.418560] handlers:
> [157152.418562] [<ffffffff813d2a6f>] (ahci_interrupt+0x0/0x426)
> [157152.418566] Disabling IRQ #23
>
> bios of the machine is up to date,
> i tried all related bios settings, no change.
>
> kernel config for .31 http://xx.vu/~ahuemer/config_ahuemer_20090923.gz
> lspci -vxxx http://xx.vu/~ahuemer/lspci_ahuemer_20090923
> lsusb -v http://xx.vu/~ahuemer/lsusb_ahuemer_20090923
> /proc/interrupts http://xx.vu/~ahuemer/proc_interrupts_ahuemer_20090923
> thread in gentoo forums http://forums.gentoo.org/viewtopic-t-780725-start-0.html
>
> please tell me what additional info is needed.

A full dmesg (or kernel log) starting from a clean boot up to the error
could be useful.

If no others reply and the issue can be reproduced reliably, running a
git bisect between v2.6.29 and v2.6.30 to trace the cause of the regression
could be an option.

Cheers,
FJP

2009-09-24 19:31:18

by Alexander Huemer

[permalink] [raw]
Subject: Re: 2.6.{30,31} x86_64 ahci problem - irq 23: nobody cared

Frans Pop wrote:
> Adding linux-ide to CC.
>
> Alexander Huemer wrote:
>> the problem appears under heavy system load and slows down the system to
>> unusable speed.
>> kernels before .30 were not affected.
>> irqpoll does not change behavior.
>>
>> error message from .31:
>> [157152.418524] irq 23: nobody cared (try booting with the "irqpoll" option)
>> [157152.418530] Pid: 1359, comm: cc1plus Tainted: G W 2.6.31-gentoo-blackbit #2
>> [157152.418532] Call Trace:
>> [157152.418534] <IRQ> [<ffffffff81066e3f>] ? __report_bad_irq+0x30/0x7d
>> [157152.418544] [<ffffffff81066f93>] ? note_interrupt+0x107/0x170
>> [157152.418547] [<ffffffff81067580>] ? handle_fasteoi_irq+0x8a/0xaa
>> [157152.418551] [<ffffffff8100d1cf>] ? handle_irq+0x17/0x1d
>> [157152.418554] [<ffffffff8100c84b>] ? do_IRQ+0x54/0xb2
>> [157152.418558] [<ffffffff8100b6d3>] ? ret_from_intr+0x0/0xa
>> [157152.418559] <EOI>
>> [157152.418560] handlers:
>> [157152.418562] [<ffffffff813d2a6f>] (ahci_interrupt+0x0/0x426)
>> [157152.418566] Disabling IRQ #23
>>
>> bios of the machine is up to date,
>> i tried all related bios settings, no change.
>>
>> kernel config for .31 http://xx.vu/~ahuemer/config_ahuemer_20090923.gz
>> lspci -vxxx http://xx.vu/~ahuemer/lspci_ahuemer_20090923
>> lsusb -v http://xx.vu/~ahuemer/lsusb_ahuemer_20090923
>> /proc/interrupts http://xx.vu/~ahuemer/proc_interrupts_ahuemer_20090923
>> thread in gentoo forums http://forums.gentoo.org/viewtopic-t-780725-start-0.html
>>
>> please tell me what additional info is needed.
>
> A full dmesg (or kernel log) starting from a clean boot up to the error
> could be useful.
>
> If no others reply and the issue can be reproduced reliably, running a
> git bisect between v2.6.29 and v2.6.30 to trace the cause of the regression
> could be an option.
>
> Cheers,
> FJP
http://xx.vu/~ahuemer/dmesg_ahuemer_20090923
i rebootet and try to reproduce the error.
the last time the problem appeared during compilation of gcc-4.3.4.

regards
-alex

2009-09-24 19:39:33

by David Daney

[permalink] [raw]
Subject: Re: 2.6.{30,31} x86_64 ahci problem - irq 23: nobody cared

Alexander Huemer wrote:
> David Daney wrote:
>> Alexander Huemer wrote:
>>> the problem appears under heavy system load and slows down the system to
>>> unusable speed.
>>> kernels before .30 were not affected.
>>> irqpoll does not change behavior.
>>>
>>> error message from .31:
>>>
>>> [157152.418524] irq 23: nobody cared (try booting with the "irqpoll"
>>> option)
>>> [157152.418530] Pid: 1359, comm: cc1plus Tainted: G W
>> Right here is the problem -> ^^^^^^^^
>>
>> Haven't you read all the threads about the evil of C++. This is just
>> one more example of the why we shouldn't be using it. :-)
>>
>> David Daney
>>
[...]
> thanks for your quick answer, david.
> so, this isn't a kernel issue at all ?
> imho a user process shouldn't be able to cause such a situation.
> what can i do against that phenomenon ?
>

Just for avoidance of doubt, it was an attempt at a joke.

You are of course correct. It looks like a real bug. User-space
shouldn't matter.

David Daney

2009-09-24 19:40:09

by Frans Pop

[permalink] [raw]
Subject: Re: 2.6.{30,31} x86_64 ahci problem - irq 23: nobody cared

On Thursday 24 September 2009, Frans Pop wrote:
> > error message from .31:
> > [157152.418524] irq 23: nobody cared
>
> If no others reply and the issue can be reproduced reliably, running a
> git bisect between v2.6.29 and v2.6.30 to trace the cause of the
> regression could be an option.

Looking at the changes in drivers/ata/ahci.c, it might be worth to try if
reverting the following commit fixes the issue:

commit a5bfc4714b3f01365aef89a92673f2ceb1ccf246
Author: Tejun Heo <[email protected]>
Date: Fri Jan 23 11:31:39 2009 +0900

ahci: drop intx manipulation on msi enable

It's a bit of a wild guess though.

2009-09-24 19:43:30

by Alexander Huemer

[permalink] [raw]
Subject: Re: 2.6.{30,31} x86_64 ahci problem - irq 23: nobody cared

Frans Pop wrote:
> On Thursday 24 September 2009, Frans Pop wrote:
>
>>> error message from .31:
>>> [157152.418524] irq 23: nobody cared
>>>
>> If no others reply and the issue can be reproduced reliably, running a
>> git bisect between v2.6.29 and v2.6.30 to trace the cause of the
>> regression could be an option.
>>
>
> Looking at the changes in drivers/ata/ahci.c, it might be worth to try if
> reverting the following commit fixes the issue:
>
> commit a5bfc4714b3f01365aef89a92673f2ceb1ccf246
> Author: Tejun Heo <[email protected]>
> Date: Fri Jan 23 11:31:39 2009 +0900
>
> ahci: drop intx manipulation on msi enable
>
> It's a bit of a wild guess though.
>
thanks for the hint.
i'll wait for the end of the compilation of gcc-4.3.4. that will take ~ 45m.
afterwards i'll check out the kernel sources from git and try the revert.
many thanks till then.

-alex

2009-09-25 00:03:00

by Alexander Huemer

[permalink] [raw]
Subject: Re: 2.6.{30,31} x86_64 ahci problem - irq 23: nobody cared

Frans Pop wrote:
> On Thursday 24 September 2009, Frans Pop wrote:
>>> error message from .31:
>>> [157152.418524] irq 23: nobody cared
>> If no others reply and the issue can be reproduced reliably, running a
>> git bisect between v2.6.29 and v2.6.30 to trace the cause of the
>> regression could be an option.
>
> Looking at the changes in drivers/ata/ahci.c, it might be worth to try if
> reverting the following commit fixes the issue:
>
> commit a5bfc4714b3f01365aef89a92673f2ceb1ccf246
> Author: Tejun Heo <[email protected]>
> Date: Fri Jan 23 11:31:39 2009 +0900
>
> ahci: drop intx manipulation on msi enable
>
> It's a bit of a wild guess though.
i reproduced the issue.

[ 3486.747729] Pid: 9573, comm: jc1 Tainted: G W
2.6.31-gentoo-blackbit #2
[ 3486.747731] Call Trace:
[ 3486.747733] <IRQ> [<ffffffff81066e3f>] ? __report_bad_irq+0x30/0x7d
[ 3486.747743] [<ffffffff81066f93>] ? note_interrupt+0x107/0x170
[ 3486.747746] [<ffffffff81067580>] ? handle_fasteoi_irq+0x8a/0xaa
[ 3486.747750] [<ffffffff8100d1cf>] ? handle_irq+0x17/0x1d
[ 3486.747752] [<ffffffff8100c84b>] ? do_IRQ+0x54/0xb2
[ 3486.747756] [<ffffffff8100b6d3>] ? ret_from_intr+0x0/0xa
[ 3486.747758] <EOI>
[ 3486.747759] handlers:
[ 3486.747761] [<ffffffff813d2a6f>] (ahci_interrupt+0x0/0x426)
[ 3486.747765] Disabling IRQ #23

i will report back after a compile run of gcc-4.3.4 with a kernel
without the commit you suggested.

-alex

2009-09-25 11:28:35

by Alexander Huemer

[permalink] [raw]
Subject: Re: 2.6.{30,31} x86_64 ahci problem - irq 23: nobody cared

Alexander Huemer wrote:
> Frans Pop wrote:
>> On Thursday 24 September 2009, Frans Pop wrote:
>>>> error message from .31:
>>>> [157152.418524] irq 23: nobody cared
>>> If no others reply and the issue can be reproduced reliably, running a
>>> git bisect between v2.6.29 and v2.6.30 to trace the cause of the
>>> regression could be an option.
>> Looking at the changes in drivers/ata/ahci.c, it might be worth to try if
>> reverting the following commit fixes the issue:
>>
>> commit a5bfc4714b3f01365aef89a92673f2ceb1ccf246
>> Author: Tejun Heo <[email protected]>
>> Date: Fri Jan 23 11:31:39 2009 +0900
>>
>> ahci: drop intx manipulation on msi enable
>>
>> It's a bit of a wild guess though.
> i reproduced the issue.
>
> [ 3486.747729] Pid: 9573, comm: jc1 Tainted: G W
> 2.6.31-gentoo-blackbit #2
> [ 3486.747731] Call Trace:
> [ 3486.747733] <IRQ> [<ffffffff81066e3f>] ? __report_bad_irq+0x30/0x7d
> [ 3486.747743] [<ffffffff81066f93>] ? note_interrupt+0x107/0x170
> [ 3486.747746] [<ffffffff81067580>] ? handle_fasteoi_irq+0x8a/0xaa
> [ 3486.747750] [<ffffffff8100d1cf>] ? handle_irq+0x17/0x1d
> [ 3486.747752] [<ffffffff8100c84b>] ? do_IRQ+0x54/0xb2
> [ 3486.747756] [<ffffffff8100b6d3>] ? ret_from_intr+0x0/0xa
> [ 3486.747758] <EOI>
> [ 3486.747759] handlers:
> [ 3486.747761] [<ffffffff813d2a6f>] (ahci_interrupt+0x0/0x426)
> [ 3486.747765] Disabling IRQ #23
>
> i will report back after a compile run of gcc-4.3.4 with a kernel
> without the commit you suggested.
>
> -alex
4 compilation runs of gcc-4.3.4 finished without the issue re-appearing.
it seems like you guessed right, Frans.
i also found this:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=31b239ad1ba7225435e13f5afc47e48eb674c0cc
i'll report on bugzilla.

thanks for the help.
-alex

2009-09-25 12:24:24

by Frans Pop

[permalink] [raw]
Subject: Re: 2.6.{30,31} x86_64 ahci problem - irq 23: nobody cared

On Friday 25 September 2009, Alexander Huemer wrote:
> Alexander Huemer wrote:
> > Frans Pop wrote:
> >> On Thursday 24 September 2009, Frans Pop wrote:
> >>>> error message from .31:
> >>>> [157152.418524] irq 23: nobody cared
> >>
> >> Looking at the changes in drivers/ata/ahci.c, it might be worth to
> >> try if reverting the following commit fixes the issue:
> >>
> >> commit a5bfc4714b3f01365aef89a92673f2ceb1ccf246
> >> Author: Tejun Heo <[email protected]>
> >> Date: Fri Jan 23 11:31:39 2009 +0900
> >>
> >> ahci: drop intx manipulation on msi enable
> >
> > i reproduced the issue.
> >
> > [ 3486.747729] Pid: 9573, comm: jc1 Tainted: G W 2.6.31-gentoo-blackbit #2
> > [ 3486.747731] Call Trace:
> > [ 3486.747733] <IRQ> [<ffffffff81066e3f>] ? __report_bad_irq+0x30/0x7d
> > [ 3486.747743] [<ffffffff81066f93>] ? note_interrupt+0x107/0x170
> > [ 3486.747746] [<ffffffff81067580>] ? handle_fasteoi_irq+0x8a/0xaa
> > [ 3486.747750] [<ffffffff8100d1cf>] ? handle_irq+0x17/0x1d
> > [ 3486.747752] [<ffffffff8100c84b>] ? do_IRQ+0x54/0xb2
> > [ 3486.747756] [<ffffffff8100b6d3>] ? ret_from_intr+0x0/0xa
> > [ 3486.747758] <EOI>
> > [ 3486.747759] handlers:
> > [ 3486.747761] [<ffffffff813d2a6f>] (ahci_interrupt+0x0/0x426)
> > [ 3486.747765] Disabling IRQ #23
> >
> > i will report back after a compile run of gcc-4.3.4 with a kernel
> > without the commit you suggested.
>
> 4 compilation runs of gcc-4.3.4 finished without the issue re-appearing.
> it seems like you guessed right, Frans.

Great. Glad to hear it worked out.

> i also found this:
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commi
>t;h=31b239ad1ba7225435e13f5afc47e48eb674c0cc i'll report on bugzilla.

So with the revert already in mainline for .32, the only thing left is for
that to get included in stable updates for .30 and .31.

Cheers,
FJP

2009-09-25 12:27:34

by Alexander Huemer

[permalink] [raw]
Subject: Re: 2.6.{30,31} x86_64 ahci problem - irq 23: nobody cared

Frans Pop wrote:
> On Friday 25 September 2009, Alexander Huemer wrote:
>
>> Alexander Huemer wrote:
>>
>>> Frans Pop wrote:
>>>
>>>> On Thursday 24 September 2009, Frans Pop wrote:
>>>>
>>>>>> error message from .31:
>>>>>> [157152.418524] irq 23: nobody cared
>>>>>>
>>>> Looking at the changes in drivers/ata/ahci.c, it might be worth to
>>>> try if reverting the following commit fixes the issue:
>>>>
>>>> commit a5bfc4714b3f01365aef89a92673f2ceb1ccf246
>>>> Author: Tejun Heo <[email protected]>
>>>> Date: Fri Jan 23 11:31:39 2009 +0900
>>>>
>>>> ahci: drop intx manipulation on msi enable
>>>>
>>> i reproduced the issue.
>>>
>>> [ 3486.747729] Pid: 9573, comm: jc1 Tainted: G W 2.6.31-gentoo-blackbit #2
>>> [ 3486.747731] Call Trace:
>>> [ 3486.747733] <IRQ> [<ffffffff81066e3f>] ? __report_bad_irq+0x30/0x7d
>>> [ 3486.747743] [<ffffffff81066f93>] ? note_interrupt+0x107/0x170
>>> [ 3486.747746] [<ffffffff81067580>] ? handle_fasteoi_irq+0x8a/0xaa
>>> [ 3486.747750] [<ffffffff8100d1cf>] ? handle_irq+0x17/0x1d
>>> [ 3486.747752] [<ffffffff8100c84b>] ? do_IRQ+0x54/0xb2
>>> [ 3486.747756] [<ffffffff8100b6d3>] ? ret_from_intr+0x0/0xa
>>> [ 3486.747758] <EOI>
>>> [ 3486.747759] handlers:
>>> [ 3486.747761] [<ffffffff813d2a6f>] (ahci_interrupt+0x0/0x426)
>>> [ 3486.747765] Disabling IRQ #23
>>>
>>> i will report back after a compile run of gcc-4.3.4 with a kernel
>>> without the commit you suggested.
>>>
>> 4 compilation runs of gcc-4.3.4 finished without the issue re-appearing.
>> it seems like you guessed right, Frans.
>>
>
> Great. Glad to hear it worked out.
>
>
>> i also found this:
>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commi
>> t;h=31b239ad1ba7225435e13f5afc47e48eb674c0cc i'll report on bugzilla.
>>
>
> So with the revert already in mainline for .32, the only thing left is for
> that to get included in stable updates for .30 and .31.
>
> Cheers,
> FJP
>
please see the last comment in [1].
can i do anything else to help ?

thanks again
-alex

[1] http://bugzilla.kernel.org/show_bug.cgi?id=14124

2009-09-25 12:48:18

by Frans Pop

[permalink] [raw]
Subject: Re: 2.6.{30,31} x86_64 ahci problem - irq 23: nobody cared

On Friday 25 September 2009, Alexander Huemer wrote:
> > So with the revert already in mainline for .32, the only thing left is
> > for that to get included in stable updates for .30 and .31.
>
> please see the last comment in [1].
> can i do anything else to help ?

> [1] http://bugzilla.kernel.org/show_bug.cgi?id=14124

Yes, adding that comment was excellent. I also added the relevant people
in the CC of my previous mail, so it should get taken care of now. Unless
they have additional questions no further action from you should be
needed.

2009-10-08 12:01:17

by Alexander Huemer

[permalink] [raw]
Subject: Re: 2.6.{30,31} x86_64 ahci problem - irq 23: nobody cared

Frans Pop wrote:
> On Friday 25 September 2009, Alexander Huemer wrote:
>>> So with the revert already in mainline for .32, the only thing left is
>>> for that to get included in stable updates for .30 and .31.
>> please see the last comment in [1].
>> can i do anything else to help ?
>
>> [1] http://bugzilla.kernel.org/show_bug.cgi?id=14124
>
> Yes, adding that comment was excellent. I also added the relevant people
> in the CC of my previous mail, so it should get taken care of now. Unless
> they have additional questions no further action from you should be
> needed.
it seems like the problem is _not_ solved.
i just booted with 2.6.31.3.
2.6.31-gentoo-r2 is vanilla-2.6.31-r2 with a few unrelated patches.
did the usual verification (compilation of gcc-4.3.4),
and got this again:

[ 1018.059729] irq 23: nobody cared (try booting with the "irqpoll"
option)
[ 1018.059734] Pid: 8656, comm: sh Tainted: G W
2.6.31-gentoo-r2-blackbit #1
[ 1018.059736] Call Trace:
[ 1018.059738] <IRQ> [<ffffffff81066ecf>] ? __report_bad_irq+0x30/0x7d
[ 1018.059748] [<ffffffff81067023>] ? note_interrupt+0x107/0x170
[ 1018.059751] [<ffffffff81067610>] ? handle_fasteoi_irq+0x8a/0xaa
[ 1018.059755] [<ffffffff8100d1cf>] ? handle_irq+0x17/0x1d
[ 1018.059757] [<ffffffff8100c84b>] ? do_IRQ+0x54/0xb2
[ 1018.059761] [<ffffffff8100b6d3>] ? ret_from_intr+0x0/0xa
[ 1018.059762] <EOI> [<ffffffff815c7d2c>] ? do_page_fault+0xed/0x2ef
[ 1018.059769] [<ffffffff815c7f12>] ? do_page_fault+0x2d3/0x2ef
[ 1018.059773] [<ffffffff812dd5ed>] ? __put_user_4+0x1d/0x30
[ 1018.059776] [<ffffffff815c5fdf>] ? page_fault+0x1f/0x30
[ 1018.059777] handlers:
[ 1018.059778] [<ffffffff813d2d8c>] (ahci_interrupt+0x0/0x426)
[ 1018.059783] Disabling IRQ #23

so in my opinion reverting commit [1] with commit [2] missed the point.
please comment.

-alex

[1] a5bfc4714b3f01365aef89a92673f2ceb1ccf246
[2] 31b239ad1ba7225435e13f5afc47e48eb674c0cc

2009-10-09 21:31:11

by Alexander Huemer

[permalink] [raw]
Subject: Re: 2.6.{30,31} x86_64 ahci problem - irq 23: nobody cared

Alexander Huemer wrote:
> Frans Pop wrote:
>> On Friday 25 September 2009, Alexander Huemer wrote:
>>>> So with the revert already in mainline for .32, the only thing left is
>>>> for that to get included in stable updates for .30 and .31.
>>> please see the last comment in [1].
>>> can i do anything else to help ?
>>> [1] http://bugzilla.kernel.org/show_bug.cgi?id=14124
>> Yes, adding that comment was excellent. I also added the relevant people
>> in the CC of my previous mail, so it should get taken care of now. Unless
>> they have additional questions no further action from you should be
>> needed.
> it seems like the problem is _not_ solved.
> i just booted with 2.6.31.3.
> 2.6.31-gentoo-r2 is vanilla-2.6.31-r2 with a few unrelated patches.
> did the usual verification (compilation of gcc-4.3.4),
> and got this again:
>
> [ 1018.059729] irq 23: nobody cared (try booting with the "irqpoll"
> option)
> [ 1018.059734] Pid: 8656, comm: sh Tainted: G W
> 2.6.31-gentoo-r2-blackbit #1
> [ 1018.059736] Call Trace:
> [ 1018.059738] <IRQ> [<ffffffff81066ecf>] ? __report_bad_irq+0x30/0x7d
> [ 1018.059748] [<ffffffff81067023>] ? note_interrupt+0x107/0x170
> [ 1018.059751] [<ffffffff81067610>] ? handle_fasteoi_irq+0x8a/0xaa
> [ 1018.059755] [<ffffffff8100d1cf>] ? handle_irq+0x17/0x1d
> [ 1018.059757] [<ffffffff8100c84b>] ? do_IRQ+0x54/0xb2
> [ 1018.059761] [<ffffffff8100b6d3>] ? ret_from_intr+0x0/0xa
> [ 1018.059762] <EOI> [<ffffffff815c7d2c>] ? do_page_fault+0xed/0x2ef
> [ 1018.059769] [<ffffffff815c7f12>] ? do_page_fault+0x2d3/0x2ef
> [ 1018.059773] [<ffffffff812dd5ed>] ? __put_user_4+0x1d/0x30
> [ 1018.059776] [<ffffffff815c5fdf>] ? page_fault+0x1f/0x30
> [ 1018.059777] handlers:
> [ 1018.059778] [<ffffffff813d2d8c>] (ahci_interrupt+0x0/0x426)
> [ 1018.059783] Disabling IRQ #23
>
> so in my opinion reverting commit [1] with commit [2] missed the point.
> please comment.
>
> -alex
>
> [1] a5bfc4714b3f01365aef89a92673f2ceb1ccf246
> [2] 31b239ad1ba7225435e13f5afc47e48eb674c0cc
>
i hope i do not annoy anybody by posting again, but i am afraid my last
message was not noticed by anybody.
is there something i don't know but should ? as it seems the problem is
still existing.
i would be happy do test whatever is needed to trace the problem.

please respond.
regards
-alex

2009-10-10 13:14:34

by Frans Pop

[permalink] [raw]
Subject: Re: 2.6.{30,31} x86_64 ahci problem - irq 23: nobody cared

(dropped stable from CC)

On Thursday 08 October 2009, you wrote:
> Frans Pop wrote:
> > On Friday 25 September 2009, Alexander Huemer wrote:
> >>> So with the revert already in mainline for .32, the only thing left
> >>> is for that to get included in stable updates for .30 and .31.
> >>
> >> please see the last comment in [1].
> >> can i do anything else to help ?
> >>
> >> [1] http://bugzilla.kernel.org/show_bug.cgi?id=14124
>
> it seems like the problem is _not_ solved.
> i just booted with 2.6.31.3.
> 2.6.31-gentoo-r2 is vanilla-2.6.31-r2 with a few unrelated patches.

I don't know what vanilla-2.6.31-r2 is, but I assume it's based on either
2.6.31.3 or 2.6.31.2.

> did the usual verification (compilation of gcc-4.3.4),

> so in my opinion reverting commit [1] with commit [2] missed the point.
>
> [1] a5bfc4714b3f01365aef89a92673f2ceb1ccf246
> [2] 31b239ad1ba7225435e13f5afc47e48eb674c0cc

The most likely explanation is that your earlier test from which you
concluded that the revert did fix the problem was incorrect. It seems
unlikely that some other stable commit interferes here.

So basically we're back where we started.

> ? ? [ 1018.059729] irq 23: nobody cared (try booting with the "irqpoll" option)
> ? ? [ 1018.059734] Pid: 8656, comm: sh Tainted: G ? ? ? ?W? ? 2.6.31-gentoo-r2-blackbit #1
> ? ? [ 1018.059736] Call Trace:
> ? ? [ 1018.059738] ?<IRQ> ?[<ffffffff81066ecf>] ? __report_bad_irq+0x30/0x7d
> ? ? [ 1018.059748] ?[<ffffffff81067023>] ? note_interrupt+0x107/0x170
> ? ? [ 1018.059751] ?[<ffffffff81067610>] ? handle_fasteoi_irq+0x8a/0xaa
> ? ? [ 1018.059755] ?[<ffffffff8100d1cf>] ? handle_irq+0x17/0x1d
> ? ? [ 1018.059757] ?[<ffffffff8100c84b>] ? do_IRQ+0x54/0xb2
> ? ? [ 1018.059761] ?[<ffffffff8100b6d3>] ? ret_from_intr+0x0/0xa
> ? ? [ 1018.059762] ?<EOI> ?[<ffffffff815c7d2c>] ? do_page_fault+0xed/0x2ef
> ? ? [ 1018.059769] ?[<ffffffff815c7f12>] ? do_page_fault+0x2d3/0x2ef
> ? ? [ 1018.059773] ?[<ffffffff812dd5ed>] ? __put_user_4+0x1d/0x30
> ? ? [ 1018.059776] ?[<ffffffff815c5fdf>] ? page_fault+0x1f/0x30
> ? ? [ 1018.059777] handlers:
> ? ? [ 1018.059778] [<ffffffff813d2d8c>] (ahci_interrupt+0x0/0x426)
> ? ? [ 1018.059783] Disabling IRQ #23

How reproducible is the error for you? Do you see it every time or not?
If it is reliably reproducible, can you think of any explanation why your
earlier test was a success while we now see that the revert does not help?

Does the error *only* occur during gcc compilation, or was that just the
simplest way to reproduce it? Does it always occur at the same point during
the compilation or does it vary?
Can you create a test case that does not require doing the whole
compilation, but only executes the step that triggers the error?

If you can find a reliable and fairly quick way to reproduce the error, I
would suggest doing a bisection.

Jeff, Tejun: do you have any ideas what could cause this issue to suddenly
appear or how to debug/instrument it?

Cheers,
FJP

2009-10-11 20:57:53

by Alexander Huemer

[permalink] [raw]
Subject: Re: 2.6.{30,31} x86_64 ahci problem - irq 23: nobody cared

I don't know what vanilla-2.6.31-r2 is, but I assume it's based on
either 2.6.31.3 or 2.6.31.2.

vanilla just means the unpatched kernel from kernel.org.

The most likely explanation is that your earlier test from which you
concluded that the revert did fix the problem was incorrect. It seems
unlikely that some other stable commit interferes here.

So basically we're back where we started.

unfortunately you seem to be right.

How reproducible is the error for you? Do you see it every time or not?
If it is reliably reproducible, can you think of any explanation why your
earlier test was a success while we now see that the revert does not help?

the error is reproducible. i'll try to pin it down to certain kernel
versions in the next days.

Does the error *only* occur during gcc compilation, or was that just the
simplest way to reproduce it? Does it always occur at the same point during
the compilation or does it vary?

it was the simplest way.
i don't know how i could find out if the error actually always
happens exactly the same time.
i'll think about that.

Can you create a test case that does not require doing the whole
compilation, but only executes the step that triggers the error?

surely, if i know what happens when the error occurs.

If you can find a reliable and fairly quick way to reproduce the error, I
would suggest doing a bisection.

i would be happy to do that.

thanks for now.

2009-10-12 07:50:56

by Tejun Heo

[permalink] [raw]
Subject: Re: 2.6.{30,31} x86_64 ahci problem - irq 23: nobody cared

Hello,

Frans Pop wrote:
>> so in my opinion reverting commit [1] with commit [2] missed the point.
>>
>> [1] a5bfc4714b3f01365aef89a92673f2ceb1ccf246
>> [2] 31b239ad1ba7225435e13f5afc47e48eb674c0cc
>
> The most likely explanation is that your earlier test from which you
> concluded that the revert did fix the problem was incorrect. It seems
> unlikely that some other stable commit interferes here.

Hmm...

> So basically we're back where we started.
>
>> [ 1018.059729] irq 23: nobody cared (try booting with the "irqpoll" option)
>> [ 1018.059734] Pid: 8656, comm: sh Tainted: G W 2.6.31-gentoo-r2-blackbit #1
>> [ 1018.059736] Call Trace:
>> [ 1018.059738] <IRQ> [<ffffffff81066ecf>] ? __report_bad_irq+0x30/0x7d
>> [ 1018.059748] [<ffffffff81067023>] ? note_interrupt+0x107/0x170
>> [ 1018.059751] [<ffffffff81067610>] ? handle_fasteoi_irq+0x8a/0xaa
>> [ 1018.059755] [<ffffffff8100d1cf>] ? handle_irq+0x17/0x1d
>> [ 1018.059757] [<ffffffff8100c84b>] ? do_IRQ+0x54/0xb2
>> [ 1018.059761] [<ffffffff8100b6d3>] ? ret_from_intr+0x0/0xa
>> [ 1018.059762] <EOI> [<ffffffff815c7d2c>] ? do_page_fault+0xed/0x2ef
>> [ 1018.059769] [<ffffffff815c7f12>] ? do_page_fault+0x2d3/0x2ef
>> [ 1018.059773] [<ffffffff812dd5ed>] ? __put_user_4+0x1d/0x30
>> [ 1018.059776] [<ffffffff815c5fdf>] ? page_fault+0x1f/0x30
>> [ 1018.059777] handlers:
>> [ 1018.059778] [<ffffffff813d2d8c>] (ahci_interrupt+0x0/0x426)
>> [ 1018.059783] Disabling IRQ #23
>
> How reproducible is the error for you? Do you see it every time or not?
> If it is reliably reproducible, can you think of any explanation why your
> earlier test was a success while we now see that the revert does not help?
>
> Does the error *only* occur during gcc compilation, or was that just the
> simplest way to reproduce it? Does it always occur at the same point during
> the compilation or does it vary?
> Can you create a test case that does not require doing the whole
> compilation, but only executes the step that triggers the error?
>
> If you can find a reliable and fairly quick way to reproduce the error, I
> would suggest doing a bisection.
>
> Jeff, Tejun: do you have any ideas what could cause this issue to suddenly
> appear or how to debug/instrument it?

Alexander, can you please attach full boot log and the output of
"lspci -nn"? Also, how reproducible is the problem? You already
answered to Frans' question but can you be more specific?

Thanks.

--
tejun

2009-10-12 09:48:55

by Frans Pop

[permalink] [raw]
Subject: Re: 2.6.{30,31} x86_64 ahci problem - irq 23: nobody cared

On Monday 12 October 2009, Tejun Heo wrote:
> Alexander, can you please attach full boot log and the output of
> "lspci -nn"? Also, how reproducible is the problem? You already
> answered to Frans' question but can you be more specific?

Full dmesg was made available earlier at:
http://xx.vu/~ahuemer/dmesg_ahuemer_20090923

2009-10-12 09:53:19

by Tejun Heo

[permalink] [raw]
Subject: Re: 2.6.{30,31} x86_64 ahci problem - irq 23: nobody cared

Frans Pop wrote:
> On Monday 12 October 2009, Tejun Heo wrote:
>> Alexander, can you please attach full boot log and the output of
>> "lspci -nn"? Also, how reproducible is the problem? You already
>> answered to Frans' question but can you be more specific?
>
> Full dmesg was made available earlier at:
> http://xx.vu/~ahuemer/dmesg_ahuemer_20090923

Does blacklisting i801_smbus make any difference?

--
tejun

2009-10-12 09:56:13

by Alexander Huemer

[permalink] [raw]
Subject: Re: 2.6.{30,31} x86_64 ahci problem - irq 23: nobody cared

Tejun Heo wrote:
> Frans Pop wrote:
>
>> On Monday 12 October 2009, Tejun Heo wrote:
>>
>>> Alexander, can you please attach full boot log and the output of
>>> "lspci -nn"? Also, how reproducible is the problem? You already
>>> answered to Frans' question but can you be more specific?
>>>
>> Full dmesg was made available earlier at:
>> http://xx.vu/~ahuemer/dmesg_ahuemer_20090923
>>
>
> Does blacklisting i801_smbus make any difference?
>
>
lspci -nn:
http://xx.vu/~ahuemer/lspci_nn_ahuemer_20091012

what do you mean with "blacklisting i801_smbus" ?

regards
-alex

2009-10-12 10:08:03

by Tejun Heo

[permalink] [raw]
Subject: Re: 2.6.{30,31} x86_64 ahci problem - irq 23: nobody cared

Alexander Huemer wrote:
> Tejun Heo wrote:
>> Frans Pop wrote:
>>
>>> On Monday 12 October 2009, Tejun Heo wrote:
>>>
>>>> Alexander, can you please attach full boot log and the output of
>>>> "lspci -nn"? Also, how reproducible is the problem? You already
>>>> answered to Frans' question but can you be more specific?
>>>>
>>> Full dmesg was made available earlier at:
>>> http://xx.vu/~ahuemer/dmesg_ahuemer_20090923
>>>
>>
>> Does blacklisting i801_smbus make any difference?
>>
>>
> lspci -nn:
> http://xx.vu/~ahuemer/lspci_nn_ahuemer_20091012
>
> what do you mean with "blacklisting i801_smbus" ?

[ 3.872387] i2c /dev entries driver
[ 3.873943] i801_smbus 0000:00:1f.3: PCI INT B -> GSI 23 (level, low) -> IRQ 23
[ 3.875580] w83627hf: Found W83627HF chip at 0x290

IRQ23 is also used by i801_smbus and it would be nice to confirm
whether the problem can still be triggered with that driver not
loaded. Adding "blacklist i2c_i801" to /etc/modprobe.d/blacklist
should probabaly do the trick.

Thanks.

--
tejun

2009-10-12 10:11:45

by Alexander Huemer

[permalink] [raw]
Subject: Re: 2.6.{30,31} x86_64 ahci problem - irq 23: nobody cared

Tejun Heo wrote:
> Alexander Huemer wrote:
>
>> Tejun Heo wrote:
>>
>>> Frans Pop wrote:
>>>
>>>
>>>> On Monday 12 October 2009, Tejun Heo wrote:
>>>>
>>>>
>>>>> Alexander, can you please attach full boot log and the output of
>>>>> "lspci -nn"? Also, how reproducible is the problem? You already
>>>>> answered to Frans' question but can you be more specific?
>>>>>
>>>>>
>>>> Full dmesg was made available earlier at:
>>>> http://xx.vu/~ahuemer/dmesg_ahuemer_20090923
>>>>
>>>>
>>> Does blacklisting i801_smbus make any difference?
>>>
>>>
>>>
>> lspci -nn:
>> http://xx.vu/~ahuemer/lspci_nn_ahuemer_20091012
>>
>> what do you mean with "blacklisting i801_smbus" ?
>>
>
> [ 3.872387] i2c /dev entries driver
> [ 3.873943] i801_smbus 0000:00:1f.3: PCI INT B -> GSI 23 (level, low) -> IRQ 23
> [ 3.875580] w83627hf: Found W83627HF chip at 0x290
>
> IRQ23 is also used by i801_smbus and it would be nice to confirm
> whether the problem can still be triggered with that driver not
> loaded. Adding "blacklist i2c_i801" to /etc/modprobe.d/blacklist
> should probabaly do the trick.
>
> Thanks.
>
>
okay, i think you assume that i2c_i801 is a module.
it is indeed built into the kernel.
i'll rebuild the kernel without that component and run a test again.

regards
-alex

2009-10-12 15:04:35

by Alexander Huemer

[permalink] [raw]
Subject: Re: 2.6.{30,31} x86_64 ahci problem - irq 23: nobody cared

Alexander Huemer wrote:
> Tejun Heo wrote:
>> Alexander Huemer wrote:
>>
>>> Tejun Heo wrote:
>>>
>>>> Frans Pop wrote:
>>>>
>>>>
>>>>> On Monday 12 October 2009, Tejun Heo wrote:
>>>>>
>>>>>> Alexander, can you please attach full boot log and the output of
>>>>>> "lspci -nn"? Also, how reproducible is the problem? You already
>>>>>> answered to Frans' question but can you be more specific?
>>>>>>
>>>>> Full dmesg was made available earlier at:
>>>>> http://xx.vu/~ahuemer/dmesg_ahuemer_20090923
>>>>>
>>>> Does blacklisting i801_smbus make any difference?
>>>>
>>>>
>>> lspci -nn:
>>> http://xx.vu/~ahuemer/lspci_nn_ahuemer_20091012
>>>
>>> what do you mean with "blacklisting i801_smbus" ?
>>>
>>
>> [ 3.872387] i2c /dev entries driver
>> [ 3.873943] i801_smbus 0000:00:1f.3: PCI INT B -> GSI 23 (level,
>> low) -> IRQ 23
>> [ 3.875580] w83627hf: Found W83627HF chip at 0x290
>>
>> IRQ23 is also used by i801_smbus and it would be nice to confirm
>> whether the problem can still be triggered with that driver not
>> loaded. Adding "blacklist i2c_i801" to /etc/modprobe.d/blacklist
>> should probabaly do the trick.
>>
>> Thanks.
>>
>>
> okay, i think you assume that i2c_i801 is a module.
> it is indeed built into the kernel.
> i'll rebuild the kernel without that component and run a test again.
>
> regards
> -alex
tejun, it seems you hit an interesting point.
i compiled kernel-2.6.31.3 with my ususal config _without_ i2c_i801.
my usual test (compilation of gcc-4.3.2) finished 5 times without the error.
i'll let it run some more times over night.
does anybody have an idea how i can trace what exactly causes the error
during the compilation run so that i can create a short test program ?

regards
-alex

2009-10-12 17:29:39

by Robert Hancock

[permalink] [raw]
Subject: Re: 2.6.{30,31} x86_64 ahci problem - irq 23: nobody cared

On 10/12/2009 09:03 AM, Alexander Huemer wrote:
> Alexander Huemer wrote:
>> Tejun Heo wrote:
>>> Alexander Huemer wrote:
>>>
>>>> Tejun Heo wrote:
>>>>> Frans Pop wrote:
>>>>>
>>>>>> On Monday 12 October 2009, Tejun Heo wrote:
>>>>>>> Alexander, can you please attach full boot log and the output of
>>>>>>> "lspci -nn"? Also, how reproducible is the problem? You already
>>>>>>> answered to Frans' question but can you be more specific?
>>>>>> Full dmesg was made available earlier at:
>>>>>> http://xx.vu/~ahuemer/dmesg_ahuemer_20090923
>>>>> Does blacklisting i801_smbus make any difference?
>>>>>
>>>> lspci -nn:
>>>> http://xx.vu/~ahuemer/lspci_nn_ahuemer_20091012
>>>>
>>>> what do you mean with "blacklisting i801_smbus" ?
>>>
>>> [ 3.872387] i2c /dev entries driver
>>> [ 3.873943] i801_smbus 0000:00:1f.3: PCI INT B -> GSI 23 (level, low)
>>> -> IRQ 23
>>> [ 3.875580] w83627hf: Found W83627HF chip at 0x290
>>>
>>> IRQ23 is also used by i801_smbus and it would be nice to confirm
>>> whether the problem can still be triggered with that driver not
>>> loaded. Adding "blacklist i2c_i801" to /etc/modprobe.d/blacklist
>>> should probabaly do the trick.
>>>
>>> Thanks.
>>>
>> okay, i think you assume that i2c_i801 is a module.
>> it is indeed built into the kernel.
>> i'll rebuild the kernel without that component and run a test again.
>>
>> regards
>> -alex
> tejun, it seems you hit an interesting point.
> i compiled kernel-2.6.31.3 with my ususal config _without_ i2c_i801.
> my usual test (compilation of gcc-4.3.2) finished 5 times without the
> error.
> i'll let it run some more times over night.
> does anybody have an idea how i can trace what exactly causes the error
> during the compilation run so that i can create a short test program ?

Do you have any hardware sensors monitoring software running (such as
the GNOME sensors panel applet or something?) Something like that would
be the most likely cause for something to access the smbus driver.

Interesting that the device seems to be on the same interrupt but it
hasn't registered itself as a handler (it looks like that driver doesn't
use interrupts). If the device did generate an interrupt though, it
would indeed cause this problem.

2009-10-13 02:18:37

by Tejun Heo

[permalink] [raw]
Subject: Re: 2.6.{30,31} x86_64 ahci problem - irq 23: nobody cared

[cc'ing Jean and quoting whole body]

Hello, Jean.

It seems i2c_i801 is triggering IRQ storm on Alexander's machine. The
original thread is

http://thread.gmane.org/gmane.linux.kernel/894187

Any ideas?

Thanks.

Alexander Huemer wrote:
> Alexander Huemer wrote:
>> Tejun Heo wrote:
>>> Alexander Huemer wrote:
>>>
>>>> Tejun Heo wrote:
>>>>
>>>>> Frans Pop wrote:
>>>>>
>>>>>
>>>>>> On Monday 12 October 2009, Tejun Heo wrote:
>>>>>>
>>>>>>> Alexander, can you please attach full boot log and the output of
>>>>>>> "lspci -nn"? Also, how reproducible is the problem? You already
>>>>>>> answered to Frans' question but can you be more specific?
>>>>>>>
>>>>>> Full dmesg was made available earlier at:
>>>>>> http://xx.vu/~ahuemer/dmesg_ahuemer_20090923
>>>>>>
>>>>> Does blacklisting i801_smbus make any difference?
>>>>>
>>>>>
>>>> lspci -nn:
>>>> http://xx.vu/~ahuemer/lspci_nn_ahuemer_20091012
>>>>
>>>> what do you mean with "blacklisting i801_smbus" ?
>>>>
>>>
>>> [ 3.872387] i2c /dev entries driver
>>> [ 3.873943] i801_smbus 0000:00:1f.3: PCI INT B -> GSI 23 (level,
>>> low) -> IRQ 23
>>> [ 3.875580] w83627hf: Found W83627HF chip at 0x290
>>>
>>> IRQ23 is also used by i801_smbus and it would be nice to confirm
>>> whether the problem can still be triggered with that driver not
>>> loaded. Adding "blacklist i2c_i801" to /etc/modprobe.d/blacklist
>>> should probabaly do the trick.
>>>
>>> Thanks.
>>>
>>>
>> okay, i think you assume that i2c_i801 is a module.
>> it is indeed built into the kernel.
>> i'll rebuild the kernel without that component and run a test again.
>>
>> regards
>> -alex
> tejun, it seems you hit an interesting point.
> i compiled kernel-2.6.31.3 with my ususal config _without_ i2c_i801.
> my usual test (compilation of gcc-4.3.2) finished 5 times without the
> error.
> i'll let it run some more times over night.
> does anybody have an idea how i can trace what exactly causes the error
> during the compilation run so that i can create a short test program ?
>
> regards
> -alex

--
tejun

2009-10-13 06:50:08

by Alexander Huemer

[permalink] [raw]
Subject: Re: 2.6.{30,31} x86_64 ahci problem - irq 23: nobody cared

Tejun Heo wrote:
> [cc'ing Jean and quoting whole body]
>
> Hello, Jean.
>
> It seems i2c_i801 is triggering IRQ storm on Alexander's machine. The
> original thread is
>
> http://thread.gmane.org/gmane.linux.kernel/894187
>
> Any ideas?
>
> Thanks.
>
> Alexander Huemer wrote:
>
>> Alexander Huemer wrote:
>>
>>> Tejun Heo wrote:
>>>
>>>> Alexander Huemer wrote:
>>>>
>>>>
>>>>> Tejun Heo wrote:
>>>>>
>>>>>
>>>>>> Frans Pop wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>> On Monday 12 October 2009, Tejun Heo wrote:
>>>>>>>
>>>>>>>
>>>>>>>> Alexander, can you please attach full boot log and the output of
>>>>>>>> "lspci -nn"? Also, how reproducible is the problem? You already
>>>>>>>> answered to Frans' question but can you be more specific?
>>>>>>>>
>>>>>>>>
>>>>>>> Full dmesg was made available earlier at:
>>>>>>> http://xx.vu/~ahuemer/dmesg_ahuemer_20090923
>>>>>>>
>>>>>>>
>>>>>> Does blacklisting i801_smbus make any difference?
>>>>>>
>>>>>>
>>>>>>
>>>>> lspci -nn:
>>>>> http://xx.vu/~ahuemer/lspci_nn_ahuemer_20091012
>>>>>
>>>>> what do you mean with "blacklisting i801_smbus" ?
>>>>>
>>>>>
>>>> [ 3.872387] i2c /dev entries driver
>>>> [ 3.873943] i801_smbus 0000:00:1f.3: PCI INT B -> GSI 23 (level,
>>>> low) -> IRQ 23
>>>> [ 3.875580] w83627hf: Found W83627HF chip at 0x290
>>>>
>>>> IRQ23 is also used by i801_smbus and it would be nice to confirm
>>>> whether the problem can still be triggered with that driver not
>>>> loaded. Adding "blacklist i2c_i801" to /etc/modprobe.d/blacklist
>>>> should probabaly do the trick.
>>>>
>>>> Thanks.
>>>>
>>>>
>>>>
>>> okay, i think you assume that i2c_i801 is a module.
>>> it is indeed built into the kernel.
>>> i'll rebuild the kernel without that component and run a test again.
>>>
>>> regards
>>> -alex
>>>
>> tejun, it seems you hit an interesting point.
>> i compiled kernel-2.6.31.3 with my ususal config _without_ i2c_i801.
>> my usual test (compilation of gcc-4.3.2) finished 5 times without the
>> error.
>> i'll let it run some more times over night.
>> does anybody have an idea how i can trace what exactly causes the error
>> during the compilation run so that i can create a short test program ?
>>
>> regards
>> -alex
>>
>
>
hi,

i compiled gcc in a loop over night, 14 times. no error.
it really seams i2c_i801 was the cause...
unfortunately i still don't know how i can extract the part of the gcc
compilation process that causes the error on an affected kernel.
that would enable me to create a simple test program.

regards
-alex

2009-10-13 12:36:12

by Tejun Heo

[permalink] [raw]
Subject: Re: 2.6.{30,31} x86_64 ahci problem - irq 23: nobody cared

Alexander Huemer wrote:
> i compiled gcc in a loop over night, 14 times. no error.
> it really seams i2c_i801 was the cause...
> unfortunately i still don't know how i can extract the part of the gcc
> compilation process that causes the error on an affected kernel.
> that would enable me to create a simple test program.

Given that i2c is used for temperature monitoring, I think it is not
triggered by any single step of the compiling but rather by the
accumulated heat load during compilation. Let's wait for Jean to
chime in. :-)

Thanks.

--
tejun

2009-10-14 11:46:11

by Jean Delvare

[permalink] [raw]
Subject: Re: 2.6.{30,31} x86_64 ahci problem - irq 23: nobody cared

Le mardi 13 octobre 2009, Tejun Heo a ?crit?:
> Alexander Huemer wrote:
> > i compiled gcc in a loop over night, 14 times. no error.
> > it really seams i2c_i801 was the cause...
> > unfortunately i still don't know how i can extract the part of the gcc
> > compilation process that causes the error on an affected kernel.
> > that would enable me to create a simple test program.
>
> Given that i2c is used for temperature monitoring, I think it is not
> triggered by any single step of the compiling but rather by the
> accumulated heat load during compilation. Let's wait for Jean to
> chime in. :-)

Sorry, I'm somewhat busy at the moment, I'll give it a look as soon
as I get a moment.

--
Jean Delvare
Suse L3

2009-10-21 08:38:48

by Jean Delvare

[permalink] [raw]
Subject: Re: 2.6.{30,31} x86_64 ahci problem - irq 23: nobody cared

Hi Tejun, Alexander,

Le mardi 13 octobre 2009, Tejun Heo a ?crit?:
> Alexander Huemer wrote:
> > i compiled gcc in a loop over night, 14 times. no error.
> > it really seams i2c_i801 was the cause...
> > unfortunately i still don't know how i can extract the part of the gcc
> > compilation process that causes the error on an affected kernel.
> > that would enable me to create a simple test program.
>
> Given that i2c is used for temperature monitoring, I think it is not
> triggered by any single step of the compiling but rather by the
> accumulated heat load during compilation. Let's wait for Jean to
> chime in. :-)

OK, here I am, sorry for the delay. I've read the discussion thread.
Here are the few data points I can offer, in the hope it will help:

* While the i2c-i801 driver received some changes in kernel 2.6.30,
none of these are related to PCI nor interrupts. So as the problem
is new in kernel 2.6.30, the i2c-i801 driver alone is unlikely to
cause it. This may, however, be a combination of something i2c-i801
does and something the pci subsystem does since kernel 2.6.30. For
this reason, I would still recommend a bisection if the problem can
be reliably reproduced. I know it takes time, but it is always
easier to fix a bug when we know which commit introduced it.

* The i2c-i801 driver does _not_ make use of interrupts. It is
poll-based (I am not exactly proud of that, but that's the way it
is.)

#define ENABLE_INT9 0 /* set to 0x01 to enable - untested */

So I am very surprised to read that this driver would cause an IRQ
storm.

* One thing the i2c-i801 driver does on the PCI device is:

err = pci_enable_device(dev);

I presume this is what causes the following message in dmesg:

i801_smbus 0000:00:1f.3: PCI INT B -> GSI 23 (level, low) -> IRQ 23

Basically, even though the driver doesn't make use of interrupts,
the IRQ is still registered because this is how the hardware is
setup.

As a conclusion, I suspect that 2 things may be happening: either
the SMBus is triggering interrupts when told not to. The ICH6 is a
bit different from all the other supported chips, I'll double check
if we may have missed something. Or, something else is triggering
SMBus transactions. SMI and ACPI come to mind. If this is the case
then you do not want to use i2c-i801 on this motherboard.

Questions to Alexander :

* Can I please see the output of "sensors" on your system?
* What are the brand and model of your motherboard?
* Can we get an acpidump for your system?

--
Jean Delvare
Suse L3

2009-10-21 10:01:39

by Alexander Huemer

[permalink] [raw]
Subject: Re: 2.6.{30,31} x86_64 ahci problem - irq 23: nobody cared

Jean Delvare wrote:
> Hi Tejun, Alexander,
>
> Le mardi 13 octobre 2009, Tejun Heo a ?crit :
>
>> Alexander Huemer wrote:
>>
>>> i compiled gcc in a loop over night, 14 times. no error.
>>> it really seams i2c_i801 was the cause...
>>> unfortunately i still don't know how i can extract the part of the gcc
>>> compilation process that causes the error on an affected kernel.
>>> that would enable me to create a simple test program.
>>>
>> Given that i2c is used for temperature monitoring, I think it is not
>> triggered by any single step of the compiling but rather by the
>> accumulated heat load during compilation. Let's wait for Jean to
>> chime in. :-)
>>
>
> OK, here I am, sorry for the delay. I've read the discussion thread.
> Here are the few data points I can offer, in the hope it will help:
>
> * While the i2c-i801 driver received some changes in kernel 2.6.30,
> none of these are related to PCI nor interrupts. So as the problem
> is new in kernel 2.6.30, the i2c-i801 driver alone is unlikely to
> cause it. This may, however, be a combination of something i2c-i801
> does and something the pci subsystem does since kernel 2.6.30. For
> this reason, I would still recommend a bisection if the problem can
> be reliably reproduced. I know it takes time, but it is always
> easier to fix a bug when we know which commit introduced it.
>
> * The i2c-i801 driver does _not_ make use of interrupts. It is
> poll-based (I am not exactly proud of that, but that's the way it
> is.)
>
> #define ENABLE_INT9 0 /* set to 0x01 to enable - untested */
>
> So I am very surprised to read that this driver would cause an IRQ
> storm.
>
> * One thing the i2c-i801 driver does on the PCI device is:
>
> err = pci_enable_device(dev);
>
> I presume this is what causes the following message in dmesg:
>
> i801_smbus 0000:00:1f.3: PCI INT B -> GSI 23 (level, low) -> IRQ 23
>
> Basically, even though the driver doesn't make use of interrupts,
> the IRQ is still registered because this is how the hardware is
> setup.
>
> As a conclusion, I suspect that 2 things may be happening: either
> the SMBus is triggering interrupts when told not to. The ICH6 is a
> bit different from all the other supported chips, I'll double check
> if we may have missed something. Or, something else is triggering
> SMBus transactions. SMI and ACPI come to mind. If this is the case
> then you do not want to use i2c-i801 on this motherboard.
>
> Questions to Alexander :
>
> * Can I please see the output of "sensors" on your system?
> * What are the brand and model of your motherboard?
> * Can we get an acpidump for your system?
>
>
many thanks for your response. i appreciate that.
first, the data you requested:

sensors: http://xx.vu/~ahuemer/sensors-ahuemer-20091021.txt
acpidump: http://xx.vu/~ahuemer/acpidump-ahuemer-20091021.txt
motherboard: tyan tempest i5400pw/s5397 with one intel xeon e5420.

the output of sensors was made _without_ i801_smbus in the kernel.
i noticed that the data of w83627hf-isa-0290 is quite weird. i do not
have an explanation for that.
if a bisection is what will bring light into this, i am willing to take
the time.
so that would be a bisection between 2.6.29 and 2.6.30 ?
a quicker test case would be good for that, but i don't have one yet,
just the compilation of gcc, which takes time, even on this machine with
tmpfs and ccache.

2009-10-21 11:28:36

by Jean Delvare

[permalink] [raw]
Subject: Re: 2.6.{30,31} x86_64 ahci problem - irq 23: nobody cared

Le mercredi 21 octobre 2009, Alexander Huemer a ?crit?:
> Jean Delvare wrote:
> > OK, here I am, sorry for the delay. I've read the discussion thread.
> > Here are the few data points I can offer, in the hope it will help:
> >
> > * While the i2c-i801 driver received some changes in kernel 2.6.30,
> > none of these are related to PCI nor interrupts. So as the problem
> > is new in kernel 2.6.30, the i2c-i801 driver alone is unlikely to
> > cause it. This may, however, be a combination of something i2c-i801
> > does and something the pci subsystem does since kernel 2.6.30. For
> > this reason, I would still recommend a bisection if the problem can
> > be reliably reproduced. I know it takes time, but it is always
> > easier to fix a bug when we know which commit introduced it.
> >
> > * The i2c-i801 driver does _not_ make use of interrupts. It is
> > poll-based (I am not exactly proud of that, but that's the way it
> > is.)
> >
> > #define ENABLE_INT9 0 /* set to 0x01 to enable - untested */
> >
> > So I am very surprised to read that this driver would cause an IRQ
> > storm.
> >
> > * One thing the i2c-i801 driver does on the PCI device is:
> >
> > err = pci_enable_device(dev);
> >
> > I presume this is what causes the following message in dmesg:
> >
> > i801_smbus 0000:00:1f.3: PCI INT B -> GSI 23 (level, low) -> IRQ 23
> >
> > Basically, even though the driver doesn't make use of interrupts,
> > the IRQ is still registered because this is how the hardware is
> > setup.
> >
> > As a conclusion, I suspect that 2 things may be happening: either
> > the SMBus is triggering interrupts when told not to. The ICH6 is a
> > bit different from all the other supported chips, I'll double check

My bad, it's an 63xxESB-based board, not ICH6. I must have been
mixing data from a different bug.

> > if we may have missed something. Or, something else is triggering
> > SMBus transactions. SMI and ACPI come to mind. If this is the case
> > then you do not want to use i2c-i801 on this motherboard.
> >
> > Questions to Alexander :
> >
> > * Can I please see the output of "sensors" on your system?
> > * What are the brand and model of your motherboard?
> > * Can we get an acpidump for your system?
> >
> >
> many thanks for your response. i appreciate that.
> first, the data you requested:
>
> sensors: http://xx.vu/~ahuemer/sensors-ahuemer-20091021.txt
> acpidump: http://xx.vu/~ahuemer/acpidump-ahuemer-20091021.txt

The good news is that I can't see any access to the SMBus in the
ACPI tables. Nothing can be said about the SMIs though, without an
intimate knowledge of the BIOS.

> motherboard: tyan tempest i5400pw/s5397 with one intel xeon e5420.
>
> the output of sensors was made _without_ i801_smbus in the kernel.

Then please once again with it. My whole point was to know whether
there was any hardware monitoring chip connected to the SMBus. Your
initial kernel configuration suggests that you have a W83793G chip
there.

> i noticed that the data of w83627hf-isa-0290 is quite weird. i do not
> have an explanation for that.

I do. This happens when the manufacturer decides that the hardware
monitoring features of the Super-I/O are insufficient for their
needs. They add a dedicated chip for the hardware monitoring. This
is particularly frequent on server boards from Tyan and SuperMicro.
Ideally they would _also_ disable the feature on the Super-I/O side,
but often then do not, so the driver still loads, but outputs
garbage.

You can see the following messages in your log:
[ 3.878703] w83627hf w83627hf.656: Enabling temp2, readings might not make sense
[ 3.881708] w83627hf w83627hf.656: Enabling temp3, readings might not make sense
This is a good hint that this is the case (if the nonsensical data
displayed by "sensors" wasn't enough to convince you.)

So you should stop loading/including kernel module w83627hf.

> if a bisection is what will bring light into this, i am willing to take
> the time.
> so that would be a bisection between 2.6.29 and 2.6.30 ?
> a quicker test case would be good for that, but i don't have one yet,
> just the compilation of gcc, which takes time, even on this machine with
> tmpfs and ccache.

--
Jean Delvare
Suse L3

2009-10-26 15:02:07

by Alexander Huemer

[permalink] [raw]
Subject: Re: 2.6.{30,31} x86_64 ahci problem - irq 23: nobody cared

Jean Delvare wrote:
> Le mercredi 21 octobre 2009, Alexander Huemer a ?crit :
>
>> Jean Delvare wrote:
>>
>>> OK, here I am, sorry for the delay. I've read the discussion thread.
>>> Here are the few data points I can offer, in the hope it will help:
>>>
>>> * While the i2c-i801 driver received some changes in kernel 2.6.30,
>>> none of these are related to PCI nor interrupts. So as the problem
>>> is new in kernel 2.6.30, the i2c-i801 driver alone is unlikely to
>>> cause it. This may, however, be a combination of something i2c-i801
>>> does and something the pci subsystem does since kernel 2.6.30. For
>>> this reason, I would still recommend a bisection if the problem can
>>> be reliably reproduced. I know it takes time, but it is always
>>> easier to fix a bug when we know which commit introduced it.
>>>
>>> * The i2c-i801 driver does _not_ make use of interrupts. It is
>>> poll-based (I am not exactly proud of that, but that's the way it
>>> is.)
>>>
>>> #define ENABLE_INT9 0 /* set to 0x01 to enable - untested */
>>>
>>> So I am very surprised to read that this driver would cause an IRQ
>>> storm.
>>>
>>> * One thing the i2c-i801 driver does on the PCI device is:
>>>
>>> err = pci_enable_device(dev);
>>>
>>> I presume this is what causes the following message in dmesg:
>>>
>>> i801_smbus 0000:00:1f.3: PCI INT B -> GSI 23 (level, low) -> IRQ 23
>>>
>>> Basically, even though the driver doesn't make use of interrupts,
>>> the IRQ is still registered because this is how the hardware is
>>> setup.
>>>
>>> As a conclusion, I suspect that 2 things may be happening: either
>>> the SMBus is triggering interrupts when told not to. The ICH6 is a
>>> bit different from all the other supported chips, I'll double check
>>>
>
> My bad, it's an 63xxESB-based board, not ICH6. I must have been
> mixing data from a different bug.
>
>
>>> if we may have missed something. Or, something else is triggering
>>> SMBus transactions. SMI and ACPI come to mind. If this is the case
>>> then you do not want to use i2c-i801 on this motherboard.
>>>
>>> Questions to Alexander :
>>>
>>> * Can I please see the output of "sensors" on your system?
>>> * What are the brand and model of your motherboard?
>>> * Can we get an acpidump for your system?
>>>
>>>
>>>
>> many thanks for your response. i appreciate that.
>> first, the data you requested:
>>
>> sensors: http://xx.vu/~ahuemer/sensors-ahuemer-20091021.txt
>> acpidump: http://xx.vu/~ahuemer/acpidump-ahuemer-20091021.txt
>>
>
> The good news is that I can't see any access to the SMBus in the
> ACPI tables. Nothing can be said about the SMIs though, without an
> intimate knowledge of the BIOS.
>
>
>> motherboard: tyan tempest i5400pw/s5397 with one intel xeon e5420.
>>
>> the output of sensors was made _without_ i801_smbus in the kernel.
>>
>
> Then please once again with it. My whole point was to know whether
> there was any hardware monitoring chip connected to the SMBus. Your
> initial kernel configuration suggests that you have a W83793G chip
> there.
>
>
>> i noticed that the data of w83627hf-isa-0290 is quite weird. i do not
>> have an explanation for that.
>>
>
> I do. This happens when the manufacturer decides that the hardware
> monitoring features of the Super-I/O are insufficient for their
> needs. They add a dedicated chip for the hardware monitoring. This
> is particularly frequent on server boards from Tyan and SuperMicro.
> Ideally they would _also_ disable the feature on the Super-I/O side,
> but often then do not, so the driver still loads, but outputs
> garbage.
>
> You can see the following messages in your log:
> [ 3.878703] w83627hf w83627hf.656: Enabling temp2, readings might not make sense
> [ 3.881708] w83627hf w83627hf.656: Enabling temp3, readings might not make sense
> This is a good hint that this is the case (if the nonsensical data
> displayed by "sensors" wasn't enough to convince you.)
>
> So you should stop loading/including kernel module w83627hf.
>
>
>> if a bisection is what will bring light into this, i am willing to take
>> the time.
>> so that would be a bisection between 2.6.29 and 2.6.30 ?
>> a quicker test case would be good for that, but i don't have one yet,
>> just the compilation of gcc, which takes time, even on this machine with
>> tmpfs and ccache.
>>
>
>
here is the output you requested:
http://xx.vu/~ahuemer/sensors_ahuemer_with_i801_20091026.txt
i am currently in the middle of a bisection between 2.6.29 and 2.6.30, 8
steps left.
many thanks for the info on hardware monitoring.
i'll report back when bisection is finished.

regards
-alex