2012-02-02 19:21:20

by Edward Donovan

[permalink] [raw]
Subject: Re: ASM1083 PCIx-PCI bridge interrupts - widespread problems

(I'm just a bystander here, but interested, since I've been asked
about it a few times)

On Tue, Jan 31, 2012 at 7:08 AM, Chris Palmer <[email protected]> wrote:
>
> On 31/01/2012 02:12, Robert Hancock wrote:
>> On 01/30/2012 09:04 AM, Chris Palmer wrote:
>>> Linus et al
>>>
>>>
>>> For about 6 months many users have been having interrupt problems
>>> with PCI boards, but it hasn't been
>>> easy trying to find where the problem may be. However, it is now
>>> looking likely that the problem lies
>>> in the ASM1083 PCIe-PCI bridge chipset, as used by Asus in many
>>> Sandybridge and AMD boards.
>>>
>>> My original bug report is:
>>> ? ? ?https://bugzilla.kernel.org/show_bug.cgi?id=38632 ?(Sandybridge)
>>>
>>> and there several other similar ones. However there is also extensive
>>> investigation in the following thread:
>>> ? ? ?http://www.gossamer-threads.com/lists/linux/kernel/1466185 ?(AMD)
>>>
>>> There have also been reports of Windows users having similar problems.
>>>
>>> This problem prevents use of PCI boards in any motherboard with that
>>> bridge chipset - including most
>>> ASUS boards. At the moment though we don't know whether the chipset
>>> or drivers are faulty, and if a
>>> workaround is possible.
>>>
>>> At the moment my bug is assigned to drivers_network, but this doesn't
>>> look appropriate.
>>>
>>> Hoping someone can help...
>>
>> If the analysis posted in the "Unhandled IRQs on AMD E-450" thread is
>> correct, then it sounds like the bridge chip is delaying PCIe INTx
>> deassert messages. In that case there isn't much the kernel is likely
>> to be able to fix it properly, at least not without input from ASMedia
>> or someone else with detailed knowledge of the chip.
>>
>> The workaround posted in that thread (switching to IRQ polling mode on
>> the interrupt for some period of time after a screaming IRQ is
>> detected) might be a workaround, but definitely would be considered a
>> hack.
>>
>> Do you have a source/link for people having issues with this on
>> Windows? I wouldn't be surprised though - I doubt Windows has any
>> special handling for unhandled IRQs so likely it just hammers the IRQ
>> handler until the IRQ gets deasserted. In that case the only thing a
>> user might notice would be poor performance whenever the devices
>> behind that bridge raise interrupts.
>>
>
> Nothing definitive about Windows, but Edward found this discussion. It's
> a bit emotive, but suggests the problem may be manifesting itself there too:
>
> http://forums.planetz.com/viewtopic.php?f=19&t=30557&sid=00d319732500eaf99c586b73060a9602
>
> Chris

If we end up helpless with this chip, will we at least warn the user
that it's known to be buggy? I dont' know if there's a standard
procedure when documenting bad hardware.

I've CC'd a few more people who have reported this, and Clemens, who
got to the bottom of it.

Thanks,

Ed





>>> On 09/09/2011 00:51, Andrew Morton wrote:
>>>
>>>> On Thu, 08 Sep 2011 12:28:40 +0100
>>>> Chris Palmer<[email protected]> ?wrote:
>>>>
>>>>> Andrew
>>>>>
>>>>> I'm writing to ask if you could cast a quick eye over the following
>>>>> bugs, to give an opinion on where they should be assigned. Mine has
>>>>> been
>>>>> reassigned to Network Drivers but I'm not convinced that is right,
>>>>> and I
>>>>> think the problem is wider than that.
>>>>>
>>>>> In summary, interrupt handling for *PCI boards with ASUS Sandybridge
>>>>> motherboards* seems to be broken.
>>>>>
>>>>> It has been seen with network and non-network PCI boards. PCIx network
>>>>> boards work OK. And all reports are for ASUS motherboards.
>>>>>
>>>>> My bug report is
>>>>>
>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=38632
>>>>>
>>>>> Others that I know of are:
>>>>>
>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=713351
>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=35332
>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=34242
>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=32242
>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=39122
>>>>>
>>>>>
>>>>> I'm now on kernel 3.0.4 with the problem still there. The only thing
>>>>> that seems to make a difference is acpi=off (although one person
>>>>> reported that it merely changed it from minutes to days before
>>>>> occurring).
>>>>>
>>>>> I'd appreciate anything you could do to move this in the right
>>>>> direction...
>>>>>
>>>>
>>>> Most likely ACPI, I expect. ?I think that's
>>>> [email protected]. ?kernel.org DNS is dead at
>>>> present and I can't check.
>>>>
>>>> Len, can you suggest how to triage these please?
>>


2012-02-02 19:28:30

by Linus Torvalds

[permalink] [raw]
Subject: Re: ASM1083 PCIx-PCI bridge interrupts - widespread problems

On Thu, Feb 2, 2012 at 11:20 AM, Edward Donovan
<[email protected]> wrote:
>
> If we end up helpless with this chip, will we at least warn the user
> that it's known to be buggy? ?I dont' know if there's a standard
> procedure when documenting bad hardware.

That's probably a good idea.

That said, the "switch to polled mode and then try to reenable every
100ms" approach sounds like a good idea regardless. The more robust we
can be, the better.

I realize that the people with *this* particular problem would
probably want to reenable them even more often than 100ms or so, but
that could lead to problems for people with seriously screaming
interrupts (which has definitely happened too), so we need to balance
those two issues out against each other.

And we'd probably need to limit the warning messages if we start
re-enabling it - so that people with constantly screaming interrupts
don't get a constant stream of 10 "nobody cared, disabling" messages
per second.

So I'd take a tested patch that looks sane for both the "warning: this
pcie-pci bridge is dodgy" and for the "try polling, then re-enable for
a while" approach.

Linus

2012-02-02 20:23:09

by Edward Donovan

[permalink] [raw]
Subject: Re: ASM1083 PCIx-PCI bridge interrupts - widespread problems

On Thu, Feb 2, 2012 at 2:28 PM, Linus Torvalds
<[email protected]> wrote:
> On Thu, Feb 2, 2012 at 11:20 AM, Edward Donovan
> <[email protected]> wrote:
>>
>> If we end up helpless with this chip, will we at least warn the user
>> that it's known to be buggy? ?I dont' know if there's a standard
>> procedure when documenting bad hardware.
>
> That's probably a good idea.
>
> That said, the "switch to polled mode and then try to reenable every
> 100ms" approach sounds like a good idea regardless. The more robust we
> can be, the better.
>
> I realize that the people with *this* particular problem would
> probably want to reenable them even more often than 100ms or so, but
> that could lead to problems for people with seriously screaming
> interrupts (which has definitely happened too), so we need to balance
> those two issues out against each other.
>
> And we'd probably need to limit the warning messages if we start
> re-enabling it - so that people with constantly screaming interrupts
> don't get a constant stream of 10 "nobody cared, disabling" messages
> per second.
>
> So I'd take a tested patch that looks sane for both the "warning: this
> pcie-pci bridge is dodgy" and for the "try polling, then re-enable for
> a while" approach.

I don't have the bad chip, so I won't try to work that up myself. And
I'd have to ponder before trying the generic parts of this. But let
me see if I'm following you. Is that, potentially, these two or three
patches?

* New logic in the generic IRQ code, in spurious.c, adding a "try
polling, then re-enable for
a while" method, for everybody?

* A warning message about ASM1083, under arch/ or drivers/ ? A better
place for special checks, than the genirq code. (Right?)

* Could there be more hardware-specifc code, to crank up the
frequency, when you do have this chip? I don't think we have this
facility at present: would we let the arch-or-drivers code set a
variable, to be honored by irq/spurious.c?

I speak with hastiness and naivete, especially on that last one. I
imagine you and Ingo and Thomas have considered such possibly-lousy
ideas a lot more than me, so I hope wisdom will be dispensed.

Thanks,

Ed

2012-02-02 21:40:09

by Clemens Ladisch

[permalink] [raw]
Subject: Re: ASM1083 PCIx-PCI bridge interrupts - widespread problems

Edward Donovan wrote:
> * New logic in the generic IRQ code, in spurious.c, adding a "try
> polling, then re-enable for
> a while" method, for everybody?

This is useful in the general case, if the interrupt line eventually
gets unstuck. With the ASM1083, we know this happens when another
interrupt comes in, but we don't know when (the sound card mentioned in
the link above issues interrupts every few milliseconds; Jeroen's on-
board FireWire controller fires a timer every 64 s; his network card
might not get any action if there isn't any traffic).

> * A warning message about ASM1083, under arch/ or drivers/ ? A better
> place for special checks, than the genirq code. (Right?)

drivers/pci/quirks.c

> * Could there be more hardware-specifc code, to crank up the
> frequency,

... and lower the threshold for detecting a stuck interrupt, ...

> when you do have this chip?

This would be sensible, as this is not a catch-all debugging measure but
a workaround for a known problem.

> I don't think we have this facility at present: would we let the
> arch-or-drivers code set a variable, to be honored by irq/spurious.c?

Wouldn't be the first one that affects generic code.


Regards,
Clemens

2012-02-02 22:41:47

by Jeroen Van den Keybus

[permalink] [raw]
Subject: Re: ASM1083 PCIx-PCI bridge interrupts - widespread problems

>> And we'd probably need to limit the warning messages if we start
>> re-enabling it - so that people with constantly screaming interrupts
>> don't get a constant stream of 10 "nobody cared, disabling" messages
>> per second.

Main reason for the patch to comment out __report_bad_irq(). There's
other printk's in there right now, allowing monitoring of the patch's
(hack's) performance. After 2 months of testing, the following is a
typical result on my Asus E45M1-M board:

[1675739.482843] Disabling IRQ 16
[1675739.488056] Polling IRQ 16
[1675740.288244] Reenabling IRQ 16
[1675740.288363] Disabling IRQ 16
[1675740.296132] Polling IRQ 16
...
[1675802.512233] Polling IRQ 16
[1675803.312244] Reenabling IRQ 16
[1675803.312362] Disabling IRQ 16
[1675803.320233] Polling IRQ 16
[1675804.120229] Reenabling IRQ 16

Because this particular IRQ is only asserted once every 64 s, polling
mode stays active for that amount of time, until a new INTx Deassert
is received. So it appears that INTx Deassert is not delayed, but
simply lost (either not sent or not received by the IO-APIC).

I did a test (https://lkml.org/lkml/2011/12/8/329) in which I
programmed an e1000 to issue one of the interrupts it doesn't use/need
(RXT0). From that log, it is clear that raising and clearing the IRQ
after more than 60 ?s did not generate the expected INTx Deassert
either. There is serious trouble with this device.

> * New logic in the generic IRQ code, in spurious.c, adding a "try
> polling, then re-enable for
> a while" method, for everybody?

Something like it (poll for a while, then try reenabling). Please see
the patch (https://lkml.org/lkml/2012/1/30/432) for the general idea.

> * Could there be more hardware-specifc code, to crank up the
> frequency, when you do have this chip? I don't think we have this
> facility at present: would we let the arch-or-drivers code set a
> variable, to be honored by irq/spurious.c?

I would propose to crank up the frequency adaptively. Whenever
reenabling fails (i.e. is followed by a storm immediately), the
poll_spurious_irq_timer interval may be increased gradually up to,
say, 100ms (the value currently used by irqpoll). In the other cases,
it could be decreased progressively to e.g. 1 ms. So add or subtract
one ms of polling time whenever reenabling fails or succeeds,
respectively. Attempt to reenable an IRQ could occur after a fixed
amount of polling cycles (100).

Alternatively, the interval could also be modified by the number of
interrupts received in an interval.

- When, Heaven forbid, more than one IRQ is handled by this mechanism,
what would the polling interval need to be ? The shortest of them all
? Can/should timers be created dynamically ?
- Would struct irq_desc be an appropriate place to keep the per-irq
variables to accomplish all this ? I noticed that there is also
irq_data.
- Are there any alignment/security requirements on struct irq_desc to
be aware of ?
- Alan Cox also suggested that IRQ 0 'magic' could be an issue. I
cannot really find what he refers to in spurious.c, but it may be
important ?
- Are there any drivers that would not be able to operate in a polling fashion ?



J.

2012-02-03 01:59:29

by Edward Donovan

[permalink] [raw]
Subject: Re: ASM1083 PCIx-PCI bridge interrupts - widespread problems

Clemens and Jeroen - um, wow, that is a lot of strong thinking. I
don't have any code in mind, to match, yet. Especially given your set
of questions, Jeroen. I'll try to get time to think and read the
code. Hopefully the group will be ahead of my pace. :)

(I have a more minor patch for spurious.c, that I need to resubmit,
too, and I should probably get that over with, first. So if you see
me post about "better error messages for spurious IRQs", it won't be
directly related to this.)

Thanks,

Ed

2012-02-03 09:17:47

by Müller Keve

[permalink] [raw]
Subject: RE: ASM1083 PCIx-PCI bridge interrupts - widespread problems

Gentlemen,

Below is a comment of AsMedia. I will continue now with ASUS, but they are
known to just say "the system is not supporting Linux".

As evidence is growing that the chip might have a severe timing issue, I
believe that all related kernel bug reports should be rooted under 1 major
bug only naming the chipset. Can somebody please advise on how I could
perform this re-routing and whether that makes sense. I would simply take
all PCI related kernel bug reports with a system having a ASM1083 and make
them dependent on a newly created (empty) report.

IMO the bridge should finally get a human readable tag by the kernel (so far
it is numeric only: 1b21:1080) possibly including a note in the tag saying
"buggy". This should raise attention at others running into problems and
channel them to the appropriate place.

Having the hardware and plenty of different PCI cards I am open to test any
suggestion you have and gather data/evidence that sheds light on how to best
treat the issue in the kernel.

Thank you for your continuing support!

Best regards,

Keve


Here is the mail from AsMedia (an ASUS dependency...) .

Dear Keve,
Thanks for your valuable opinion. We are glad to receive the opinion form
the end-user. However, we are sorry to inform you that Asmedia is an IC
design house and so far we only provide direct support to the manufacturer,
OEM/ODM, and brand companies. Because most of software is customerized and
our IC is one of the parts in the system and we even don?t know if that
system supports Linux?.Actually we are not authorized to release those
software by ourselves. We are afraid that we are not able to provide the
appropriate solution to you, so we suggest that you contact the service
department of that product and they might provide the appropriate support to
you. We also thank you for purchasing the product on which our IC is used.

Best Regards,
Asmedia Service


-----Original Message-----
From: Edward Donovan [mailto:[email protected]] On Behalf Of Edward Donovan
Sent: Friday, February 03, 2012 2:59 AM
To: Jeroen Van den Keybus
Cc: Linus Torvalds; Chris Palmer; Robert Hancock; Andrew Morton; Len Brown;
[email protected]; [email protected]; [email protected];
[email protected]; [email protected]; [email protected]; Thomas
Gleixner; Ingo Molnar
Subject: Re: ASM1083 PCIx-PCI bridge interrupts - widespread problems

Clemens and Jeroen - um, wow, that is a lot of strong thinking. I don't
have any code in mind, to match, yet. Especially given your set of
questions, Jeroen. I'll try to get time to think and read the code.
Hopefully the group will be ahead of my pace. :)

(I have a more minor patch for spurious.c, that I need to resubmit, too, and
I should probably get that over with, first. So if you see me post about
"better error messages for spurious IRQs", it won't be directly related to
this.)

Thanks,

Ed