2004-04-09 13:41:22

by Jon Grimm

[permalink] [raw]
Subject: io_apic & timer_ack fix

Hmmm....

I see that the following patch got pulled in by Andrew:
http://linux.bkbits.net:8080/linux-2.5/diffs/arch/i386/kernel/[email protected]?nav=index.html|src/|src/arch|src/arch/i386|src/arch/i386/kernel|hist/arch/i386/kernel/io_apic.c

The patch had a couple bugs:
http://seclists.org/lists/linux-kernel/2004/Mar/4152.html

But the patch was pulled out entirely by Linus:
http://linux.bkbits.net:8080/linux-2.5/diffs/arch/i386/kernel/[email protected]?nav=index.html|src/|src/arch|src/arch/i386|src/arch/i386/kernel|hist/arch/i386/kernel/io_apic.c

Was it determined that the fix was bogus? damaging? fixable?

I ask as I see behavior identical for which this patch seems to have
been originally carved up for (buggy SMM BIOS at fault, but this was a
workaround in the OS).

http://marc.theaimsgroup.com/?l=linux-kernel&m=101604672921823&w=2
http://www.ussg.iu.edu/hypermail/linux/kernel/0203.2/0698.html

Its a fair answer to force the BIOS vendor to fix, but in the meantime,
I'm trying to figure out how safe/unsafe the workaround patch is ?
I've ran on it overnight (with the semi-colon's fixed) and it hasn't
exhibited the troubling behavior (where timer interrupts seem stuck or
in some cases just extremely slow.... and the 8259 IMR is mucked up when
Linux isn't even touching anymore).

NMI interrupts are comming in at furious rate, but I see a patch in
Andrew's tree to handle that.

Best Regards,
Jon Grimm


2004-04-09 14:25:07

by Philippe Elie

[permalink] [raw]
Subject: Re: io_apic & timer_ack fix

On Fri, 09 Apr 2004 at 08:39 +0000, Jon Grimm wrote:

> Hmmm....
>
> I see that the following patch got pulled in by Andrew:
> http://linux.bkbits.net:8080/linux-2.5/diffs/arch/i386/kernel/[email protected]?nav=index.html|src/|src/arch|src/arch/i386|src/arch/i386/kernel|hist/arch/i386/kernel/io_apic.c
>
> The patch had a couple bugs:
> http://seclists.org/lists/linux-kernel/2004/Mar/4152.html
>
> But the patch was pulled out entirely by Linus:
> http://linux.bkbits.net:8080/linux-2.5/diffs/arch/i386/kernel/[email protected]?nav=index.html|src/|src/arch|src/arch/i386|src/arch/i386/kernel|hist/arch/i386/kernel/io_apic.c
>
> Was it determined that the fix was bogus? damaging? fixable?

http://marc.theaimsgroup.com/?l=linux-kernel&m=107840458123059&w=2

what's the right fix ? This patch fix timer_ack in three place, the
two last look like typo (spurious ';' after if ()), the first chunk
apparently cause higher temp on some mobo.

> I ask as I see behavior identical for which this patch seems to have
> been originally carved up for (buggy SMM BIOS at fault, but this was a
> workaround in the OS).
>
> http://marc.theaimsgroup.com/?l=linux-kernel&m=101604672921823&w=2
> http://www.ussg.iu.edu/hypermail/linux/kernel/0203.2/0698.html
>
> Its a fair answer to force the BIOS vendor to fix, but in the meantime,
> I'm trying to figure out how safe/unsafe the workaround patch is ?
> I've ran on it overnight (with the semi-colon's fixed) and it hasn't
> exhibited the troubling behavior (where timer interrupts seem stuck or
> in some cases just extremely slow.... and the 8259 IMR is mucked up when
> Linux isn't even touching anymore).

I agree but actually it cause trouble for non bugged mobo, can this fixed ?

regards,
Philippe Elie


2004-04-09 15:27:35

by Jon Grimm

[permalink] [raw]
Subject: Re: io_apic & timer_ack fix

Philippe Elie wrote:

>On Fri, 09 Apr 2004 at 08:39 +0000, Jon Grimm wrote:
>
>
>
>>Hmmm....
>>
>>I see that the following patch got pulled in by Andrew:
>>http://linux.bkbits.net:8080/linux-2.5/diffs/arch/i386/kernel/[email protected]?nav=index.html|src/|src/arch|src/arch/i386|src/arch/i386/kernel|hist/arch/i386/kernel/io_apic.c
>>
>>The patch had a couple bugs:
>>http://seclists.org/lists/linux-kernel/2004/Mar/4152.html
>>
>>But the patch was pulled out entirely by Linus:
>>http://linux.bkbits.net:8080/linux-2.5/diffs/arch/i386/kernel/[email protected]?nav=index.html|src/|src/arch|src/arch/i386|src/arch/i386/kernel|hist/arch/i386/kernel/io_apic.c
>>
>>Was it determined that the fix was bogus? damaging? fixable?
>>
>>
>
>http://marc.theaimsgroup.com/?l=linux-kernel&m=107840458123059&w=2
>
>what's the right fix ? This patch fix timer_ack in three place, the
>two last look like typo (spurious ';' after if ()), the first chunk
>apparently cause higher temp on some mobo.
>
>
>
I have the spurious ';' removed in my testing.

>>I ask as I see behavior identical for which this patch seems to have
>>been originally carved up for (buggy SMM BIOS at fault, but this was a
>>workaround in the OS).
>>
>>http://marc.theaimsgroup.com/?l=linux-kernel&m=101604672921823&w=2
>>http://www.ussg.iu.edu/hypermail/linux/kernel/0203.2/0698.html
>>
>>Its a fair answer to force the BIOS vendor to fix, but in the meantime,
>>I'm trying to figure out how safe/unsafe the workaround patch is ?
>>I've ran on it overnight (with the semi-colon's fixed) and it hasn't
>>exhibited the troubling behavior (where timer interrupts seem stuck or
>>in some cases just extremely slow.... and the 8259 IMR is mucked up when
>>Linux isn't even touching anymore).
>>
>>
>
>I agree but actually it cause trouble for non bugged mobo, can this fixed ?
>
>
>

I hope. It looks like a real problem on a box I have.

>regards,
>Philippe Elie
>
>

2004-04-09 17:36:12

by Ross Dickson

[permalink] [raw]
Subject: Re: io_apic & timer_ack fix

On Fri, 09 Apr 2004 at 08:39 +0000, Jon Grimm wrote:

> Hmmm....
>
> I see that the following patch got pulled in by Andrew:
> http://linux.bkbits.net:8080/linux-2.5/diffs/arch/i386/kernel/[email protected]?nav=index.html|src/|src/arch|src/arch/i386|src/arch/i386/kernel|hist/arch/i386/kernel/io_apic.c
>
> The patch had a couple bugs:
> http://seclists.org/lists/linux-kernel/2004/Mar/4152.html
>
> But the patch was pulled out entirely by Linus:
> http://linux.bkbits.net:8080/linux-2.5/diffs/arch/i386/kernel/[email protected]?nav=index.html|src/|src/arch|src/arch/i386|src/arch/i386/kernel|hist/arch/i386/kernel/io_apic.c
>
> Was it determined that the fix was bogus? damaging? fixable?

I thought the patch was OK with typos fixed.

> I ask as I see behavior identical for which this patch seems to have
> been originally carved up for (buggy SMM BIOS at fault, but this was a
> workaround in the OS).
>
> http://marc.theaimsgroup.com/?l=linux-kernel&m=101604672921823&w=2
> http://www.ussg.iu.edu/hypermail/linux/kernel/0203.2/0698.html
>
> Its a fair answer to force the BIOS vendor to fix, but in the meantime,
> I'm trying to figure out how safe/unsafe the workaround patch is ?
> I've ran on it overnight (with the semi-colon's fixed) and it hasn't
> exhibited the troubling behavior (where timer interrupts seem stuck or
> in some cases just extremely slow.... and the 8259 IMR is mucked up when
> Linux isn't even touching anymore).

I read the thread you mention about the IMR muckup along the way to creating
my nforce2 patches - it was most enlightening as to how bad consumer computers
can be.

Prakash tracked his overheat to a buggy binary nvidia driver
http://marc.theaimsgroup.com/?l=linux-kernel&m=108059111721363&w=2
and not Maciej's patch.

Thomas was tracking down C1 C2 etc states but I do not know the results of
his search?
http://marc.theaimsgroup.com/?l=linux-kernel&m=107972277920929&w=2
Was it a problem only with one machine?

I do not recollect any other threads indicating problems with the patch.

I remember rediffing my nforce2 io-apic patch using the 2.6.3-mm3 kernel with
Maciej's patch and having no heat trouble. I am surprised it got pulled out but
then I only tested it on one type of chipset.

BTW I just rebooted to my modified 2.6.3-mm3 and got my normal 38C cpu.
I have to have timer_ack=0 in my io-apic timer routing patch for nforce2 to
get nmi_debug=1 to work. This was all along the way to trying to stop lockups.
In fact I have been running no timer_ack kernel mods since December on 4
machines and all have been cool and hard lockup free.

Regards
Ross Dickson



2004-04-13 12:20:31

by Maciej W. Rozycki

[permalink] [raw]
Subject: Re: io_apic & timer_ack fix

On Sat, 10 Apr 2004, Ross Dickson wrote:

> > Was it determined that the fix was bogus? damaging? fixable?
>
> I thought the patch was OK with typos fixed.

I consider it final.

> Thomas was tracking down C1 C2 etc states but I do not know the results of
> his search?
> http://marc.theaimsgroup.com/?l=linux-kernel&m=107972277920929&w=2
> Was it a problem only with one machine?

The effect seems impossible, but I can't discuss with facts. I need to
get at AMD processor documents to find out what are the conditions to
switch between the power-save states. Perhaps this is yet another SMM
bug. E.g. the pair of PIC poll I/O accesses interacts with the SMM
somehow.

Unfortunately I lack time to dig into it right now -- perhaps someone
else can do the research?

--
+ Maciej W. Rozycki, Technical University of Gdansk, Poland +
+--------------------------------------------------------------+
+ e-mail: [email protected], PGP key available +

2004-04-14 04:55:04

by Jon Grimm

[permalink] [raw]
Subject: Re: io_apic & timer_ack fix

Maciej W. Rozycki wrote:

>On Sat, 10 Apr 2004, Ross Dickson wrote:
>
>> > Was it determined that the fix was bogus? damaging? fixable?
>>
>>I thought the patch was OK with typos fixed.
>>
>>
>
> I consider it final.
>
>
I 've been running the patch on a 2.4.21 code base. I note that
/proc/interrupts shows NMI interrupts coming in fast and furious, where
without it there were none. I'm not sure what to think of this.

Any idea whether this could/should be expected?

Best Regards,
Jon


2004-04-14 10:14:42

by Maciej W. Rozycki

[permalink] [raw]
Subject: Re: io_apic & timer_ack fix

On Tue, 13 Apr 2004, Jon Grimm wrote:

> I 've been running the patch on a 2.4.21 code base. I note that
> /proc/interrupts shows NMI interrupts coming in fast and furious, where
> without it there were none. I'm not sure what to think of this.

This is the NMI watchdog -- an aid to debug lock-ups. You can control it
with the "nmi_watchdog=" kernel option -- see documentation. I suppose
without the timer_ack change the watchdog wouldn't work for some reason
-- perhaps due to problems with your 8259A core or with the SMM.

--
+ Maciej W. Rozycki, Technical University of Gdansk, Poland +
+--------------------------------------------------------------+
+ e-mail: [email protected], PGP key available +