2005-01-05 15:52:06

by Martin Drab

[permalink] [raw]
Subject: APIC/LAPIC hanging problems on nForce2 system.

Hi,

I'm witnessing a total freeze on my system when the APIC and LAPIC are
enabled in kernel 2.6.10-bk7.

The feeze seems to occur whenever there is some heavy interrupt occurance,
usually high network communication load, or high HDD activity. The freeze
does not occur after constant time during the heavy interrupt load, but it
ALLWAYS occurs, and allways after quite a short time. The freeze is total,
I mean nothing reacts, then, even the cursor on the HW text console stops
blinking. Only cold reset helps.

The problem disappears when I turn off APIC and LAPIC (by the "noapic
nolapic" commands at the kernel boot command line). I tried to turn off
only APIC (i.e., only "noapic"), at first it seemd to be working, but it
frozen anyway, only a bit later. I also tried to turn off only the LAPIC
(i.e., only "nolapic"), but then my HDD was loosing interrupts, so the
system didn't even boot.

I also tried the native kernel from MDK 10.1 i586, i.e. 2.6.8.1-12mdk and
it works without any problem with both APIC and LAPIC enabled.

Does anybody have a clue what could be wrong?

System info follows (and attached).

Thanks in advance.
Martin Drab

Basic HW:

MB: Gigabyte K7NNXP (nForce2 Ultra 400, Intel E1000 Gb LAN)
CPU: AMD AthlonXP 3200+ Barton 400MHz FSB
MEM: 1 GB Dual Channel 400MHz


Attachments:
lspci.log (2.19 kB)
lspci
kernel2a.log (18.70 kB)
Kernel log
config (48.38 kB)
kernel config
interrupts.log (635.00 B)
Interrupts assignment under both APIC and LAPIC enabled.
Download all attachments

2005-01-05 16:09:01

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: APIC/LAPIC hanging problems on nForce2 system.

On Wed, 5 Jan 2005, Martin Drab wrote:

> I'm witnessing a total freeze on my system when the APIC and LAPIC are
> enabled in kernel 2.6.10-bk7.
>
> The feeze seems to occur whenever there is some heavy interrupt occurance,
> usually high network communication load, or high HDD activity. The freeze
> does not occur after constant time during the heavy interrupt load, but it
> ALLWAYS occurs, and allways after quite a short time. The freeze is total,
> I mean nothing reacts, then, even the cursor on the HW text console stops
> blinking. Only cold reset helps.
>
> The problem disappears when I turn off APIC and LAPIC (by the "noapic
> nolapic" commands at the kernel boot command line). I tried to turn off
> only APIC (i.e., only "noapic"), at first it seemd to be working, but it
> frozen anyway, only a bit later. I also tried to turn off only the LAPIC
> (i.e., only "nolapic"), but then my HDD was loosing interrupts, so the
> system didn't even boot.
>
> I also tried the native kernel from MDK 10.1 i586, i.e. 2.6.8.1-12mdk and
> it works without any problem with both APIC and LAPIC enabled.
>
> Does anybody have a clue what could be wrong?

I'm assuming that 2.6.10 is ok?

2005-01-05 16:50:26

by Prakash K. Cheemplavam

[permalink] [raw]
Subject: Re: APIC/LAPIC hanging problems on nForce2 system.

Martin Drab schrieb:
> Hi,
>
> I'm witnessing a total freeze on my system when the APIC and LAPIC are
> enabled in kernel 2.6.10-bk7.

Do you know whether your bios already contains the C1 halt disconnect
fix? I couldn't find this line in your dmesg:


PCI: nForce2 C1 Halt Disconnect fixup

Did it occur with earlier kernels? If yes, this is a regression.

Try as workaround if

athcool off

makes your system stable. If yes, you need above fix activated.

Cheers,

Prakash


Attachments:
signature.asc (189.00 B)
OpenPGP digital signature

2005-01-05 16:56:58

by Martin Drab

[permalink] [raw]
Subject: Re: APIC/LAPIC hanging problems on nForce2 system.



On Wed, 5 Jan 2005, Zwane Mwaikambo wrote:

> On Wed, 5 Jan 2005, Martin Drab wrote:
>
> > I'm witnessing a total freeze on my system when the APIC and LAPIC are
> > enabled in kernel 2.6.10-bk7.
> >
> > The feeze seems to occur whenever there is some heavy interrupt occurance,
> > usually high network communication load, or high HDD activity. The freeze
> > does not occur after constant time during the heavy interrupt load, but it
> > ALLWAYS occurs, and allways after quite a short time. The freeze is total,
> > I mean nothing reacts, then, even the cursor on the HW text console stops
> > blinking. Only cold reset helps.
> >
> > The problem disappears when I turn off APIC and LAPIC (by the "noapic
> > nolapic" commands at the kernel boot command line). I tried to turn off
> > only APIC (i.e., only "noapic"), at first it seemd to be working, but it
> > frozen anyway, only a bit later. I also tried to turn off only the LAPIC
> > (i.e., only "nolapic"), but then my HDD was loosing interrupts, so the
> > system didn't even boot.
> >
> > I also tried the native kernel from MDK 10.1 i586, i.e. 2.6.8.1-12mdk and
> > it works without any problem with both APIC and LAPIC enabled.
> >
> > Does anybody have a clue what could be wrong?
>
> I'm assuming that 2.6.10 is ok?

If I remember correctly it is not working as well.

Martin

2005-01-05 17:06:32

by Martin Drab

[permalink] [raw]
Subject: Re: APIC/LAPIC hanging problems on nForce2 system.



On Wed, 5 Jan 2005, Prakash K. Cheemplavam wrote:

> Martin Drab schrieb:
> > Hi,
> >
> > I'm witnessing a total freeze on my system when the APIC and LAPIC are
> > enabled in kernel 2.6.10-bk7.
>
> Do you know whether your bios already contains the C1 halt disconnect
> fix? I couldn't find this line in your dmesg:

Aha! That might be the problem. Because there is still the factory BIOS,
which is F11. I'll try the current F20 when I get home and I'll let you
know.

> PCI: nForce2 C1 Halt Disconnect fixup

OK, I'll check it out.

> Did it occur with earlier kernels? If yes, this is a regression.

Well as I said, with the native Mandrake kernel 2.6.8.1-12mdk everything
was OK. First vanilla kernel I tried on this MB was somthing about
2.6.9-rc2 if I remember correctly and it allready had the problem, and all
subsequent ones had it as well.

> Try as workaround if
>
> athcool off

OK, I'll try that.

> makes your system stable. If yes, you need above fix activated.

OK, I'll take a look at it and let you know of the results (hopefully in
few hours).

Thanks,
Martin

2005-01-05 17:19:11

by Prakash K. Cheemplavam

[permalink] [raw]
Subject: Re: APIC/LAPIC hanging problems on nForce2 system.

Martin Drab schrieb:
>
> On Wed, 5 Jan 2005, Prakash K. Cheemplavam wrote:
>
>
>>Martin Drab schrieb:
>>
>>>Hi,
>>>
>>>I'm witnessing a total freeze on my system when the APIC and LAPIC are
>>>enabled in kernel 2.6.10-bk7.
>>
>>Do you know whether your bios already contains the C1 halt disconnect
>>fix? I couldn't find this line in your dmesg:
>
>
> Aha! That might be the problem. Because there is still the factory BIOS,
> which is F11. I'll try the current F20 when I get home and I'll let you
> know.
>
>
>>PCI: nForce2 C1 Halt Disconnect fixup
>
>
> OK, I'll check it out.

Just to avoid confusion: If your bios does *not contain the fix, the
kernel should fix it and above line should appear. (It does here with
2.6.10) So if it doesn't in your case (and your bios does not contain
that fix), the detection code probably isn't enough. -> This should be
fixed in kernel.

When you use a fixed bios though, above line should not appear, and your
system should be stable.

Prakash


Attachments:
signature.asc (189.00 B)
OpenPGP digital signature

2005-01-05 17:23:08

by Martin Drab

[permalink] [raw]
Subject: Re: APIC/LAPIC hanging problems on nForce2 system.



On Wed, 5 Jan 2005, Prakash K. Cheemplavam wrote:

> Martin Drab schrieb:
> Just to avoid confusion: If your bios does *not contain the fix, the
> kernel should fix it and above line should appear. (It does here with
> 2.6.10) So if it doesn't in your case (and your bios does not contain
> that fix), the detection code probably isn't enough. -> This should be
> fixed in kernel.
>
> When you use a fixed bios though, above line should not appear, and your
> system should be stable.

Is there some other way to get to know whether BIOS contains the fix
allready?

Martin

2005-01-05 17:28:10

by Prakash K. Cheemplavam

[permalink] [raw]
Subject: Re: APIC/LAPIC hanging problems on nForce2 system.

Martin Drab schrieb:
>
> On Wed, 5 Jan 2005, Prakash K. Cheemplavam wrote:
>
>
>>Martin Drab schrieb:
>>Just to avoid confusion: If your bios does *not contain the fix, the
>>kernel should fix it and above line should appear. (It does here with
>>2.6.10) So if it doesn't in your case (and your bios does not contain
>>that fix), the detection code probably isn't enough. -> This should be
>>fixed in kernel.
>>
>>When you use a fixed bios though, above line should not appear, and your
>>system should be stable.
>
>
> Is there some other way to get to know whether BIOS contains the fix
> allready?

lspci -xxx

then check

0000:00:00.0 Host bridge: nVidia Corporation nForce2 AGP (different
version?) (rev c1)
00: de 10 e0 01 06 00 b0 00 c1 00 00 06 00 00 80 00
10: 08 00 00 a0 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 7b 14 00 1c
30: 00 00 00 00 40 00 00 00 00 00 00 00 00 00 00 00
40: 02 60 30 00 1b 42 00 1f 02 03 00 00 ff ff ff ff
50: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
60: 08 00 01 20 20 00 88 80 10 00 00 00 01 ff 01 9f <----
70: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
80: 00 01 00 00 ff ff ff 3f 01 00 00 00 01 80 00 00
90: 14 80 40 a7 14 80 40 a5 00 30 00 00 00 00 00 00
a0: 40 00 00 00 32 fb 10 00 01 00 00 00 00 00 00 00
b0: cc ff 07 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 33 33 03 00 00 00 00 00 00 00 00 00 00 00 00 00
d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
e0: d6 01 47 00 16 30 00 10 00 00 00 00 00 00 00 00
f0: 0f 00 00 08 00 00 00 00 00 00 00 00 00 00 00 00


From fixup.c:
* Chip Old value New value
* C17 0x1F0FFF01 0x1F01FF01
* C18D 0x9F0FFF01 0x9F01FF01

If there is old value, it needs to be fixed.

Prakash


Attachments:
signature.asc (189.00 B)
OpenPGP digital signature

2005-01-05 17:32:55

by Martin Drab

[permalink] [raw]
Subject: Re: APIC/LAPIC hanging problems on nForce2 system.



On Wed, 5 Jan 2005, Prakash K. Cheemplavam wrote:

> Martin Drab schrieb:
> > Is there some other way to get to know whether BIOS contains the fix
> > allready?
>
> lspci -xxx
>
> then check
>
> 0000:00:00.0 Host bridge: nVidia Corporation nForce2 AGP (different
> version?) (rev c1)
> 00: de 10 e0 01 06 00 b0 00 c1 00 00 06 00 00 80 00
> 10: 08 00 00 a0 00 00 00 00 00 00 00 00 00 00 00 00
> 20: 00 00 00 00 00 00 00 00 00 00 00 00 7b 14 00 1c
> 30: 00 00 00 00 40 00 00 00 00 00 00 00 00 00 00 00
> 40: 02 60 30 00 1b 42 00 1f 02 03 00 00 ff ff ff ff
> 50: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> 60: 08 00 01 20 20 00 88 80 10 00 00 00 01 ff 01 9f <----
> 70: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> 80: 00 01 00 00 ff ff ff 3f 01 00 00 00 01 80 00 00
> 90: 14 80 40 a7 14 80 40 a5 00 30 00 00 00 00 00 00
> a0: 40 00 00 00 32 fb 10 00 01 00 00 00 00 00 00 00
> b0: cc ff 07 00 00 00 00 00 00 00 00 00 00 00 00 00
> c0: 33 33 03 00 00 00 00 00 00 00 00 00 00 00 00 00
> d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> e0: d6 01 47 00 16 30 00 10 00 00 00 00 00 00 00 00
> f0: 0f 00 00 08 00 00 00 00 00 00 00 00 00 00 00 00
>
>
> From fixup.c:
> * Chip Old value New value
> * C17 0x1F0FFF01 0x1F01FF01
> * C18D 0x9F0FFF01 0x9F01FF01
>
> If there is old value, it needs to be fixed.

OK, I'll check it out.

Thanks,
Martin

2005-01-06 00:17:06

by Martin Drab

[permalink] [raw]
Subject: Re: APIC/LAPIC hanging problems on nForce2 system.

On Wed, 5 Jan 2005, Prakash K. Cheemplavam wrote:
> lspci -xxx
>
> then check
>
> 0000:00:00.0 Host bridge: nVidia Corporation nForce2 AGP (different
> version?) (rev c1)
> 00: de 10 e0 01 06 00 b0 00 c1 00 00 06 00 00 80 00
> 10: 08 00 00 a0 00 00 00 00 00 00 00 00 00 00 00 00
> 20: 00 00 00 00 00 00 00 00 00 00 00 00 7b 14 00 1c
> 30: 00 00 00 00 40 00 00 00 00 00 00 00 00 00 00 00
> 40: 02 60 30 00 1b 42 00 1f 02 03 00 00 ff ff ff ff
> 50: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> 60: 08 00 01 20 20 00 88 80 10 00 00 00 01 ff 01 9f <----
> 70: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> 80: 00 01 00 00 ff ff ff 3f 01 00 00 00 01 80 00 00
> 90: 14 80 40 a7 14 80 40 a5 00 30 00 00 00 00 00 00
> a0: 40 00 00 00 32 fb 10 00 01 00 00 00 00 00 00 00
> b0: cc ff 07 00 00 00 00 00 00 00 00 00 00 00 00 00
> c0: 33 33 03 00 00 00 00 00 00 00 00 00 00 00 00 00
> d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> e0: d6 01 47 00 16 30 00 10 00 00 00 00 00 00 00 00
> f0: 0f 00 00 08 00 00 00 00 00 00 00 00 00 00 00 00
>
>
> From fixup.c:
> * Chip Old value New value
> * C17 0x1F0FFF01 0x1F01FF01
> * C18D 0x9F0FFF01 0x9F01FF01
>
> If there is old value, it needs to be fixed.

OK, so I investigated a bit, and found the following interesting
and seemingly unexpected situation. I wrote a trivial debugging patch
(attached as "nForce2-Fixup-DEBUG.diff") to take a closer look at what is
going on in there. Result is attached as "kernel.log". Namely interesting
part:

...
DEBUG: pci_fixup_nforce2() called.
DEBUG: nForce2 revision byte = 0xC1.
DEBUG: fixed value = 0x9F01FF01.
DEBUG: current value = 0x8F0FFF01. <---------------
...

So that means, that the device doesn't have the "C1 Halt Disconnect"
enabled at that point, and, though, no fixup is done. However, if you take
a closer look at the result of "lspci -xxx" (attached as "lspci-xxx.log"),

00:00.0 Host bridge: nVidia Corporation nForce2 AGP (different version?) (rev c1)
...
60: 08 00 01 20 20 00 88 80 10 00 00 00 01 ff 0f 9f <-------
...

you'll notice, that all of a sudden that bit 28 of PCI.0x6c *is set!! That
means, that sometimes later, after the pci_fixup_nforce2() is called,
something, smewhere, somehow has to set the bit to 1. But this part in the
arch/i386/pci/fixup.c prevents it.

/*
* Apply fixup only if C1 Halt Disconnect is enabled
* (bit28) because it is not supported on some boards.
*/
vvvvvvvvvvvvvvvvv
if ((val & (1 << 28)) && val != fixed_val) {
printk(KERN_WARNING "PCI: nForce2 C1 Halt Disconnect fixup\n");
pci_write_config_dword(dev, 0x6c, fixed_val);
}

So my question is: Is the condition necessary? If there really are boards,
that don't support this, then is would probably have to be a more
sophisticated test, or the fixup would have to be called again later, when
the flag is set. BTW.: Any clue on what could possibly set the flag?

Anyway, I wrote a trivial patch (attached as "nForce2-Fixup-Fix.diff")
which simply just removes the condition mentioned above. With this patch
the fixup IS done on my MB and then everything works OK. So the problem
really *is here. But I'm afraid that this trivial solution is rather an
ugly workaround, then a real fix to this problem.

Upgrade of my BIOS could probably solve the problem, but since I thought
it may be necessary to really fix the problem, so that others don't need
to care to upgrade the BIOS as well. So I didn't upgrade, yet, in order to
be able to test any better solution that there might be (if there is any).

Martin


Attachments:
nForce2-Fixup-DEBUG.diff (932.00 B)
nForce2-Fixup-DEBUG.diff
kernel.log (20.59 kB)
kernel.log
lspci-xxx.log (1.00 kB)
lspci-xxx.log
nForce2-Fixup-Fix.diff (538.00 B)
nForce2-Fixup-Fix.diff
Download all attachments

2005-01-06 09:04:03

by Prakash K. Cheemplavam

[permalink] [raw]
Subject: Re: APIC/LAPIC hanging problems on nForce2 system.

Martin Drab schrieb:
> On Wed, 5 Jan 2005, Prakash K. Cheemplavam wrote:
>
> ...
> DEBUG: pci_fixup_nforce2() called.
> DEBUG: nForce2 revision byte = 0xC1.
> DEBUG: fixed value = 0x9F01FF01.
> DEBUG: current value = 0x8F0FFF01. <---------------
> ...
>
> So that means, that the device doesn't have the "C1 Halt Disconnect"
> enabled at that point, and, though, no fixup is done. However, if you take
> a closer look at the result of "lspci -xxx" (attached as "lspci-xxx.log"),
>
> 00:00.0 Host bridge: nVidia Corporation nForce2 AGP (different version?) (rev c1)
> ...
> 60: 08 00 01 20 20 00 88 80 10 00 00 00 01 ff 0f 9f <-------
> ...
>
> you'll notice, that all of a sudden that bit 28 of PCI.0x6c *is set!! That
> means, that sometimes later, after the pci_fixup_nforce2() is called,
> something, smewhere, somehow has to set the bit to 1. But this part in the
> arch/i386/pci/fixup.c prevents it.

You are not by chance using athcool or something to enable disconnect?

>
> /*
> * Apply fixup only if C1 Halt Disconnect is enabled
> * (bit28) because it is not supported on some boards.
> */
> vvvvvvvvvvvvvvvvv
> if ((val & (1 << 28)) && val != fixed_val) {
> printk(KERN_WARNING "PCI: nForce2 C1 Halt Disconnect fixup\n");
> pci_write_config_dword(dev, 0x6c, fixed_val);
> }
>
> So my question is: Is the condition necessary? If there really are boards,
> that don't support this, then is would probably have to be a more
> sophisticated test, or the fixup would have to be called again later, when
> the flag is set. BTW.: Any clue on what could possibly set the flag?

Well, I also think it is quite stupid to only apply the fix if
disconnect is enabled at boot time and don't apply it if it is not. The
kernel dev responsible for it is rather pedantic: Fix only when needed,
ie don't apply anything in a foreseeing way (prevent what could break),
if change something in userspace, do it correctly. (not exact words of
course, but the conclusion of it.) Ie if you enable disconnect outside
of bios and kernel, you should also set the fix by hand...

Easy workaround: Enable disconnect in bios, if possible, then the kernel
will fix it for you...

I admit there is logic behind the dev's point of view, nevertheless it
is not a very near-to-life-and-make-it-simpler-for-the-user logic. There
is often a difference in point of view of kernel dev and average user...

Prakash


Attachments:
signature.asc (189.00 B)
OpenPGP digital signature
Subject: Re: APIC/LAPIC hanging problems on nForce2 system.

Hi,

On Thu, 06 Jan 2005 10:03:44 +0100, Prakash K. Cheemplavam
<[email protected]> wrote:
> Martin Drab schrieb:
> > On Wed, 5 Jan 2005, Prakash K. Cheemplavam wrote:
> >
> > ...
> > DEBUG: pci_fixup_nforce2() called.
> > DEBUG: nForce2 revision byte = 0xC1.
> > DEBUG: fixed value = 0x9F01FF01.
> > DEBUG: current value = 0x8F0FFF01. <---------------
> > ...
> >
> > So that means, that the device doesn't have the "C1 Halt Disconnect"
> > enabled at that point, and, though, no fixup is done. However, if you take
> > a closer look at the result of "lspci -xxx" (attached as "lspci-xxx.log"),
> >
> > 00:00.0 Host bridge: nVidia Corporation nForce2 AGP (different version?) (rev c1)
> > ...
> > 60: 08 00 01 20 20 00 88 80 10 00 00 00 01 ff 0f 9f <-------
> > ...
> >
> > you'll notice, that all of a sudden that bit 28 of PCI.0x6c *is set!! That
> > means, that sometimes later, after the pci_fixup_nforce2() is called,
> > something, smewhere, somehow has to set the bit to 1. But this part in the
> > arch/i386/pci/fixup.c prevents it.
>
> You are not by chance using athcool or something to enable disconnect?
>
> >
> > /*
> > * Apply fixup only if C1 Halt Disconnect is enabled
> > * (bit28) because it is not supported on some boards.
> > */
> > vvvvvvvvvvvvvvvvv
> > if ((val & (1 << 28)) && val != fixed_val) {
> > printk(KERN_WARNING "PCI: nForce2 C1 Halt Disconnect fixup\n");
> > pci_write_config_dword(dev, 0x6c, fixed_val);
> > }
> >
> > So my question is: Is the condition necessary? If there really are boards,
> > that don't support this, then is would probably have to be a more
> > sophisticated test, or the fixup would have to be called again later, when
> > the flag is set. BTW.: Any clue on what could possibly set the flag?
>
> Well, I also think it is quite stupid to only apply the fix if
> disconnect is enabled at boot time and don't apply it if it is not. The
> kernel dev responsible for it is rather pedantic: Fix only when needed,

Hey, I only coded it because I was getting a lot of false IDE bugreports... ;-)

> ie don't apply anything in a foreseeing way (prevent what could break),
> if change something in userspace, do it correctly. (not exact words of
> course, but the conclusion of it.) Ie if you enable disconnect outside
> of bios and kernel, you should also set the fix by hand...
>
> Easy workaround: Enable disconnect in bios, if possible, then the kernel
> will fix it for you...
>
> I admit there is logic behind the dev's point of view, nevertheless it
> is not a very near-to-life-and-make-it-simpler-for-the-user logic. There
> is often a difference in point of view of kernel dev and average user...

Changing _only_ "fixup" bits seems like a reasonable compromise IMO.
Could you (or Martin) make a patch and submit it to -mm for testing?

Bartlomiej

2005-01-06 14:18:13

by Martin Drab

[permalink] [raw]
Subject: Re: APIC/LAPIC hanging problems on nForce2 system.



On Thu, 6 Jan 2005, Prakash K. Cheemplavam wrote:

> Martin Drab schrieb:
> > On Wed, 5 Jan 2005, Prakash K. Cheemplavam wrote:
> >
> > ...
> > DEBUG: pci_fixup_nforce2() called.
> > DEBUG: nForce2 revision byte = 0xC1.
> > DEBUG: fixed value = 0x9F01FF01.
> > DEBUG: current value = 0x8F0FFF01. <---------------
> > ...
> >
> > So that means, that the device doesn't have the "C1 Halt Disconnect"
> > enabled at that point, and, though, no fixup is done. However, if you take
> > a closer look at the result of "lspci -xxx" (attached as "lspci-xxx.log"),
> >
> > 00:00.0 Host bridge: nVidia Corporation nForce2 AGP (different version?)
> > (rev c1)
> > ...
> > 60: 08 00 01 20 20 00 88 80 10 00 00 00 01 ff 0f 9f <-------
> > ...
> >
> > you'll notice, that all of a sudden that bit 28 of PCI.0x6c *is set!! That
> > means, that sometimes later, after the pci_fixup_nforce2() is called,
> > something, smewhere, somehow has to set the bit to 1. But this part in the
> > arch/i386/pci/fixup.c prevents it.
>
> You are not by chance using athcool or something to enable disconnect?

Yes, in fact I am. So that enables it then, OK.

> >
> > /*
> > * Apply fixup only if C1 Halt Disconnect is enabled
> > * (bit28) because it is not supported on some boards.
> > */
> > vvvvvvvvvvvvvvvvv
> > if ((val & (1 << 28)) && val != fixed_val) {
> > printk(KERN_WARNING "PCI: nForce2 C1 Halt Disconnect
> > fixup\n");
> > pci_write_config_dword(dev, 0x6c, fixed_val);
> > }
> >
> > So my question is: Is the condition necessary? If there really are boards,
> > that don't support this, then is would probably have to be a more
> > sophisticated test, or the fixup would have to be called again later, when
> > the flag is set. BTW.: Any clue on what could possibly set the flag?
>
> Well, I also think it is quite stupid to only apply the fix if
> disconnect is enabled at boot time and don't apply it if it is not. The
> kernel dev responsible for it is rather pedantic: Fix only when needed,
> ie don't apply anything in a foreseeing way (prevent what could break),
> if change something in userspace, do it correctly. (not exact words of
> course, but the conclusion of it.) Ie if you enable disconnect outside
> of bios and kernel, you should also set the fix by hand...
>
> Easy workaround: Enable disconnect in bios, if possible, then the kernel
> will fix it for you...

That assumes that the BIOS allows to enable it.

> I admit there is logic behind the dev's point of view, nevertheless it
> is not a very near-to-life-and-make-it-simpler-for-the-user logic. There
> is often a difference in point of view of kernel dev and average user...

Right. And how about to fix it, but leave the disconnect bit in its
previous state. Would it help? Something like

fixed_val = (val & (1<<28)) | (fixed_val & ~(1<<28));
if (val != fixed_val) {
printk(KERN_WARNING "PCI: nForce2 C1 Halt Disconnect fixup\n");
pci_write_config_dword(dev, 0x6c, fixed_val);
}

Martin

2005-01-06 15:08:15

by Prakash K. Cheemplavam

[permalink] [raw]
Subject: [PATCH] Re: APIC/LAPIC hanging problems on nForce2 system.


>>Well, I also think it is quite stupid to only apply the fix if
>>disconnect is enabled at boot time and don't apply it if it is not. The
>>kernel dev responsible for it is rather pedantic: Fix only when needed,

[..]
> Changing _only_ "fixup" bits seems like a reasonable compromise IMO.
> Could you (or Martin) make a patch and submit it to -mm for testing?

Ok, here it goes. It's the first time I write a patch for the kernel, so
please don't bash me. I hope my logics were alright, so please
proof-read it. I haven't tested it yet...

It simplifies the function to

static void __init pci_fixup_nforce2(struct pci_dev *dev)
{
u32 val;

/*
* Chip Old value New value
* C17 0x1F0FFF01 0x1F01FF01
* C18D 0x9F0FFF01 0x9F01FF01
*
* Northbridge chip version may be determined by
* reading the PCI revision ID (0xC1 or greater is C18D).
*/
pci_read_config_dword(dev, 0x6c, &val);

/*
* Apply fixup if needed, but don't touch disconnect state
*/
if ((val & 0x00FF0000) != 0x00010000) {
printk(KERN_WARNING "PCI: nForce2 C1 Halt Disconnect fixup\n");
pci_write_config_dword(dev, 0x6c, (val & 0xFF00FFFF) | 0x00010000);
}
}



This patch applies the Nforce2 C1 halt disconnect fix, no matter if
disconnect is enabled of not. I don't know whether checking the whole
affected byte is necessary or the nibble would be enough (I am no Nvidia
engineer).

Signed-off-by: Prakash Punnoor <[email protected]>

(My name is soon officially to be changed, in case you are wondering.)


--- arch/i386/pci/fixup.c.o 2005-01-06 15:43:40.535842320 +0100
+++ arch/i386/pci/fixup.c 2005-01-06 16:00:50.174313480 +0100
@@ -227,10 +227,7 @@
*/
static void __init pci_fixup_nforce2(struct pci_dev *dev)
{
- u32 val, fixed_val;
- u8 rev;
-
- pci_read_config_byte(dev, PCI_REVISION_ID, &rev);
+ u32 val;

/*
* Chip Old value New value
@@ -240,17 +237,14 @@
* Northbridge chip version may be determined by
* reading the PCI revision ID (0xC1 or greater is C18D).
*/
- fixed_val = rev < 0xC1 ? 0x1F01FF01 : 0x9F01FF01;
-
pci_read_config_dword(dev, 0x6c, &val);

/*
- * Apply fixup only if C1 Halt Disconnect is enabled
- * (bit28) because it is not supported on some boards.
+ * Apply fixup if needed, but don't touch disconnect state
*/
- if ((val & (1 << 28)) && val != fixed_val) {
+ if ((val & 0x00FF0000) != 0x00010000) {
printk(KERN_WARNING "PCI: nForce2 C1 Halt Disconnect fixup\n");
- pci_write_config_dword(dev, 0x6c, fixed_val);
+ pci_write_config_dword(dev, 0x6c, (val & 0xFF00FFFF) | 0x00010000);
}
}
DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_NVIDIA,
PCI_DEVICE_ID_NVIDIA_NFORCE2, pci_fixup_nforce2);



Attachments:
always_nforce2_c1_fix.patch (1.12 kB)
signature.asc (189.00 B)
OpenPGP digital signature
Download all attachments

2005-01-06 23:45:31

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Re: APIC/LAPIC hanging problems on nForce2 system.

"Prakash K. Cheemplavam" <[email protected]> wrote:
>
> This patch applies the Nforce2 C1 halt disconnect fix, no matter if
> disconnect is enabled of not. I don't know whether checking the whole
> affected byte is necessary or the nibble would be enough (I am no Nvidia
> engineer).

The patch doesn't apply to the current tree. Here's what we currently have:

static void __init pci_fixup_nforce2(struct pci_dev *dev)
{
u32 val, fixed_val;
u8 rev;

pci_read_config_byte(dev, PCI_REVISION_ID, &rev);

/*
* Chip Old value New value
* C17 0x1F0FFF01 0x1F01FF01
* C18D 0x9F0FFF01 0x9F01FF01
*
* Northbridge chip version may be determined by
* reading the PCI revision ID (0xC1 or greater is C18D).
*/
fixed_val = rev < 0xC1 ? 0x1F01FF01 : 0x9F01FF01;

pci_read_config_dword(dev, 0x6c, &val);

/*
* Apply fixup only if C1 Halt Disconnect is enabled
* (bit28) because it is not supported on some boards.
*/
if ((val & (1 << 28)) && val != fixed_val) {
printk(KERN_WARNING "PCI: nForce2 C1 Halt Disconnect fixup\n");
pci_write_config_dword(dev, 0x6c, fixed_val);
}
}
DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_NVIDIA, PCI_DEVICE_ID_NVIDIA_NFORCE2, pci_fixup_nforce2);

If you think this still needs fixing, please submit a new patch. I think
we'd need to see a better explanation of the rationale for the change as
well, please. What it does, why, how, etc.

2005-01-07 00:58:18

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Re: APIC/LAPIC hanging problems on nForce2 system.

"Prakash K. Cheemplavam" <[email protected]> wrote:
>
> Perhaps firfox fscked up the inlined patch, so please
> try the attached version. If it goes alright, I'll resubmit it,
> inlcuding more detailed description.

There was no attachment.

Please go ahead and prepare a final patch against Linus's latest tree. The
simplest way to obtain that is via the topmost link at
http://www.kernel.org/pub/linux/kernel/v2.5/testing/cset/.

2005-01-07 02:15:48

by Prakash K. Cheemplavam

[permalink] [raw]
Subject: Re: [PATCH] Re: APIC/LAPIC hanging problems on nForce2 system.

Andrew Morton schrieb:
> "Prakash K. Cheemplavam" <[email protected]> wrote:
>
>>This patch applies the Nforce2 C1 halt disconnect fix, no matter if
>>disconnect is enabled of not. I don't know whether checking the whole
>>affected byte is necessary or the nibble would be enough (I am no Nvidia
>>engineer).
>
>
> The patch doesn't apply to the current tree. Here's what we currently have:

Well, I just got 2.6.10-mm1, went into its dir and here

tachyon linux-2.6.10-mm1 # patch -p0
</home/light/always_nforce2_c1_fix.patch
patching file arch/i386/pci/fixup.c
tachyon linux-2.6.10-mm1 #

it went alright. Perhaps firfox fscked up the inlined patch, so please
try the attached version. If it goes alright, I'll resubmit it,
inlcuding more detailed description.

Prakash


Attachments:
signature.asc (189.00 B)
OpenPGP digital signature

2005-01-07 11:48:24

by Martin Drab

[permalink] [raw]
Subject: Re: [PATCH] Re: APIC/LAPIC hanging problems on nForce2 system.



On Thu, 6 Jan 2005, Andrew Morton wrote:

> "Prakash K. Cheemplavam" <[email protected]> wrote:
> >
> > Perhaps firfox fscked up the inlined patch, so please
> > try the attached version. If it goes alright, I'll resubmit it,
> > inlcuding more detailed description.
>
> There was no attachment.
>
> Please go ahead and prepare a final patch against Linus's latest tree. The
> simplest way to obtain that is via the topmost link at
> http://www.kernel.org/pub/linux/kernel/v2.5/testing/cset/.

That is strange. I got it with the attachment. I tried it and it applies
to the vanilla 2.6.10-bk9 just fine with

cd /usr/src/linux
patch -p0 <always_nforce2_c1_fix.patch

Maybe the problem is that the diff is done *inside the tree and so needs
to be applied in the /usr/src/linux (or whatever your linux directory is)
and with -p0 there. Usually patches have one level more, so you do it
there with -p1 or so. But otherwise it should apply. The section, that you
mentioned in the previous mail is exactly the one that it applies to (it
seems).

About the rationale. The problem was (as you may read in my previous mails
to LKML with this subject) that BIOS on my board (Gigabyte GA-7NNXP)
doesn't enable the C1 Halt Disconnect bit (bit 28 of the PCI reg. 0x6C).
The fix that really needs to be done in order for the C1 Halt Disconnect
to work properly, as you may read in the rationale of the original fixing
function, is changing the 3rd byte of that PCI.0x6C register from 0x0F to
0x01. Problem is that the original fixing function didn't apply the fix at
all when the C1 Halt Disconnect isn't set at the moment of calling the
fixing function (which is called only during bootup initialization of the
nForce2), and so when the C1 Halt Disconnect is enabled later (i.e., by
the athcool utility), the fix isn't applied and the whole system becomes
*VERY* unstable (at least it did for me - total freeze) on heavy interrupt
occurances (i.e., high network load, high HDD activity, etc.).

This patch really solves the problem for me and probably for others with
unfixed BIOS as well, and (though I'm not an nForce expert) I don't think
it may harm anyone, for whom it worked before, because except for that
little difference of applying even when C1 Halt Disconnect is disabled, it
does exactly the same thing.

Martin

2005-01-07 13:54:02

by Prakash K. Cheemplavam

[permalink] [raw]
Subject: Re: [PATCH] Re: APIC/LAPIC hanging problems on nForce2 system.

Andrew Morton schrieb:
> "Prakash K. Cheemplavam" <[email protected]> wrote:
>
>>Perhaps firfox fscked up the inlined patch, so please
>>try the attached version. If it goes alright, I'll resubmit it,
>>inlcuding more detailed description.
>
>
> There was no attachment.

*sigh* Not in my last email, but when I submitted the patch...

> Please go ahead and prepare a final patch against Linus's latest tree. The
> simplest way to obtain that is via the topmost link at
> http://www.kernel.org/pub/linux/kernel/v2.5/testing/cset/.

It applies cleanly there. Nevertheless, once again, with more details.
If inlined version doesn't patch, please try attached!

current state:
Systems with Nforce2 could freeze on high disk i/o activity in APIC mode
when CPU Disconnect is enabled. If bios doesn't fix this, current kernel
fix changes the registers according to follwing table:

* Chip Old value New value
* C17 0x1F0FFF01 0x1F01FF01
* C18D 0x9F0FFF01 0x9F01FF01

But this is only done, if cpu disconnect has been enabled in bios.


why change this:
If CPU disconnect is not enabled in bios, and bios is broken (some
manufacturers like Abit don't care about their customers and even the
latest bios doesn't fix this; I have an Abit mainboard), the kernel
doesn't apply the fix, so if cpu disconnect is enabled at a later stage
(in userspace), the system will be unstable and most likely freeze.

new behaviour:
The fix is now applied regardless of cpu disconnect being enabled at
boot time, or not. As you only have to change byte 3 to 0x01, reading
out chipset version isn't needed, so the patch simplifies the fix. Now
turning cpu disconnect on, at later stage won't break the system, and if
it was already enabled, it gets fixed, as the old version did.


Signed-off-by: Prakash Punnoor <[email protected]>


--- arch/i386/pci/fixup.c.o 2005-01-06 15:43:40.535842320 +0100
+++ arch/i386/pci/fixup.c 2005-01-06 16:00:50.174313480 +0100
@@ -227,10 +227,7 @@
*/
static void __init pci_fixup_nforce2(struct pci_dev *dev)
{
- u32 val, fixed_val;
- u8 rev;
-
- pci_read_config_byte(dev, PCI_REVISION_ID, &rev);
+ u32 val;

/*
* Chip Old value New value
@@ -240,17 +237,14 @@
* Northbridge chip version may be determined by
* reading the PCI revision ID (0xC1 or greater is C18D).
*/
- fixed_val = rev < 0xC1 ? 0x1F01FF01 : 0x9F01FF01;
-
pci_read_config_dword(dev, 0x6c, &val);

/*
- * Apply fixup only if C1 Halt Disconnect is enabled
- * (bit28) because it is not supported on some boards.
+ * Apply fixup if needed, but don't touch disconnect state
*/
- if ((val & (1 << 28)) && val != fixed_val) {
+ if ((val & 0x00FF0000) != 0x00010000) {
printk(KERN_WARNING "PCI: nForce2 C1 Halt Disconnect fixup\n");
- pci_write_config_dword(dev, 0x6c, fixed_val);
+ pci_write_config_dword(dev, 0x6c, (val & 0xFF00FFFF) | 0x00010000);
}
}
DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_NVIDIA,
PCI_DEVICE_ID_NVIDIA_NFORCE2, pci_fixup_nforce2);


Attachments:
always_nforce2_c1_fix.patch (1.12 kB)
signature.asc (189.00 B)
OpenPGP digital signature
Download all attachments
Subject: Re: [PATCH] Re: APIC/LAPIC hanging problems on nForce2 system.

On Fri, 07 Jan 2005 14:53:42 +0100, Prakash K. Cheemplavam
<[email protected]> wrote:
> Andrew Morton schrieb:
> > "Prakash K. Cheemplavam" <[email protected]> wrote:
> >
> >>Perhaps firfox fscked up the inlined patch, so please
> >>try the attached version. If it goes alright, I'll resubmit it,
> >>inlcuding more detailed description.
> >
> >
> > There was no attachment.
>
> *sigh* Not in my last email, but when I submitted the patch...
>
> > Please go ahead and prepare a final patch against Linus's latest tree. The
> > simplest way to obtain that is via the topmost link at
> > http://www.kernel.org/pub/linux/kernel/v2.5/testing/cset/.
>
> It applies cleanly there. Nevertheless, once again, with more details.
> If inlined version doesn't patch, please try attached!
>
> current state:
> Systems with Nforce2 could freeze on high disk i/o activity in APIC mode
> when CPU Disconnect is enabled. If bios doesn't fix this, current kernel
> fix changes the registers according to follwing table:
>
> * Chip Old value New value
> * C17 0x1F0FFF01 0x1F01FF01
> * C18D 0x9F0FFF01 0x9F01FF01
>
> But this is only done, if cpu disconnect has been enabled in bios.
>
> why change this:
> If CPU disconnect is not enabled in bios, and bios is broken (some
> manufacturers like Abit don't care about their customers and even the
> latest bios doesn't fix this; I have an Abit mainboard), the kernel
> doesn't apply the fix, so if cpu disconnect is enabled at a later stage
> (in userspace), the system will be unstable and most likely freeze.
>
> new behaviour:
> The fix is now applied regardless of cpu disconnect being enabled at
> boot time, or not. As you only have to change byte 3 to 0x01, reading
> out chipset version isn't needed, so the patch simplifies the fix. Now
> turning cpu disconnect on, at later stage won't break the system, and if
> it was already enabled, it gets fixed, as the old version did.
>
>
> Signed-off-by: Prakash Punnoor <[email protected]>

Patch looks fine (thanks!) and since I added the original quirk...

Acked-by: Bartlomiej Zolnierkiewicz <[email protected]>

> --- arch/i386/pci/fixup.c.o 2005-01-06 15:43:40.535842320 +0100
> +++ arch/i386/pci/fixup.c 2005-01-06 16:00:50.174313480 +0100
> @@ -227,10 +227,7 @@
> */
> static void __init pci_fixup_nforce2(struct pci_dev *dev)
> {
> - u32 val, fixed_val;
> - u8 rev;
> -
> - pci_read_config_byte(dev, PCI_REVISION_ID, &rev);
> + u32 val;
>
> /*
> * Chip Old value New value
> @@ -240,17 +237,14 @@
> * Northbridge chip version may be determined by
> * reading the PCI revision ID (0xC1 or greater is C18D).
> */
> - fixed_val = rev < 0xC1 ? 0x1F01FF01 : 0x9F01FF01;
> -
> pci_read_config_dword(dev, 0x6c, &val);
>
> /*
> - * Apply fixup only if C1 Halt Disconnect is enabled
> - * (bit28) because it is not supported on some boards.
> + * Apply fixup if needed, but don't touch disconnect state
> */
> - if ((val & (1 << 28)) && val != fixed_val) {
> + if ((val & 0x00FF0000) != 0x00010000) {
> printk(KERN_WARNING "PCI: nForce2 C1 Halt Disconnect fixup\n");
> - pci_write_config_dword(dev, 0x6c, fixed_val);
> + pci_write_config_dword(dev, 0x6c, (val & 0xFF00FFFF) | 0x00010000);
> }
> }
> DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_NVIDIA,
> PCI_DEVICE_ID_NVIDIA_NFORCE2, pci_fixup_nforce2);