2004-06-01 21:47:54

by Arthur Perry

[permalink] [raw]
Subject: Re: GART Error 11

Hi Saurabh,

I am working on this issue as we speak.
It is interesting that your machine crashes entirely with iommu disabled.

I am starting to think there is more to this than just the kernel
misreporting other hardware errors (being improperly decoded as GART
errors).
On my machine, I am actaully getting Gart erros on 3 out of 4 CPUS when I
use RedHat's 2.4.21-9EL kernel. This same kernel when rebuilt from source,
however, will not produce GART errors when built without AGP support.

Here is my Extended error code (bits 19-16 on 0:[18,19,1b]:3 at offset 0x44:
0101 = GART error

So, this is not a translation issue on my side.

Can you do this for me?

pcitweak -r 0:18:3 0x44
and
pcitweak -r 0:19:3 0x44


Thanks!


Arthur Perry
Lead Linux Developer / Linux Systems Architect
Validation, CSU Celestica
Sair/Linux Gnu Certified Professional
Providing professional Linux solutions for 7+ years

On Tue, 1 Jun 2004, Saurabh Barve wrote:

> Hi,
>
> I know this has been posted before on this list, but the solution
> suggested does not seem to work for me.
>
> I have a dual opteron system with 8 GB of RAM. I am running RHEL 3.0 AS on
> it. The kernel version is 2.4.21-4.ELsmp. The motherboard I am using is
> the Tyan Thunder K8S Pro - 2882 motherboard.
>
> I am getting the following error every two minutes or so:
>
> GART error 11
> Lost an northbridge error
> NB error address some-hex-number
> Error uncorrected
>
> I checked the various postings on the list, and someone suggested that
> passing iommu=off option to the kernel solved the problem for him.
> However, when I tried that, it got the kernel to panic. I read somewhere
> that a newer kernel would fix these 'bugs' in the default RHEL kernel.
> However, I am using the onboard SATA controller for my hard disks. This
> requires binary drivers from Tyan. I already downloaded a newer kernel,
> however, it breaks the drivers, so I can't boot into the new kernel.
>
> Here is my output from lspci:
>
> 00:06.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8111 PCI (rev 07)
> 00:07.0 ISA bridge: Advanced Micro Devices [AMD] AMD-8111 LPC (rev 05)
> 00:07.1 IDE interface: Advanced Micro Devices [AMD] AMD-8111 IDE (rev 03)
> 00:07.2 SMBus: Advanced Micro Devices [AMD] AMD-8111 SMBus 2.0 (rev 02)
> 00:07.3 Bridge: Advanced Micro Devices [AMD] AMD-8111 ACPI (rev 05)
> 00:0a.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge
> (rev 12)
> 00:0a.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X APIC (rev 01)
> 00:0b.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge
> (rev 12)
> 00:0b.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X APIC (rev 01)
> 00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge
> 00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge
> 00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge
> 00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge
> 00:19.0 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge
> 00:19.1 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge
> 00:19.2 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge
> 00:19.3 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge
> 02:09.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704
> Gigabit Ethernet (rev 03)
> 02:09.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5704
> Gigabit Ethernet (rev 03)
> 03:00.0 USB Controller: Advanced Micro Devices [AMD] AMD-8111 USB (rev 0b)
> 03:00.1 USB Controller: Advanced Micro Devices [AMD] AMD-8111 USB (rev 0b)
> 03:05.0 Unknown mass storage controller: Silicon Image, Inc. (formerly CMD
> Technology Inc) Silicon Image SiI 3114 SATARaid Controller (rev 02)
> 03:06.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27)
> 03:08.0 Ethernet controller: Intel Corp. 82557/8/9 [Ethernet Pro 100] (rev
> 10)
>
> The dmesg output was too large to include inline, so I am attaching it as
> a text file.
>
> I tried passing the following options to the kernel:
>
> iommu=noagp
> iommu=noforce
> iommu=off (results in kernel-panic)
> mce=off
> mce=0
>
> I tried all the above in various combinations, but none of them worked.
> The machine doesn't crash, and everything else seems to work fine, but I'd
> like to get rid of these errors.
>
> There are some snippets from the dmesg output that I found to be of
> interest:
>
> ------------------------------------------------------------
> Linux agpgart interface v0.99 (c) Jeff Hartmann
> agpgart: Maximum main memory to use for agp memory: 7956M
> agpgart: no supported devices found.
> PCI-DMA: Disabling AGP.
> PCI-DMA: aperture base @ 10000000 size 65536 KB
> PCI-DMA: Reserving 64MB of IOMMU area in the AGP aperture
> -----------------------------------------------------------
>
> -----------------------------------------------------------
>
> GART error 11
> Lost an northbridge error
> NB error address 00000000fbfe4398
> Error uncorrected
> Northbridge status a40000000005001b
>
> ----------------------------------------------------------
>
>
> Any suggestions?
>
> Thanks,
> Saurabh.
>


2004-06-01 21:54:57

by Arthur Perry

[permalink] [raw]
Subject: Re: GART Error 11

Hi Saurabh,

I almost forgot.
Can you also tell me which AMD CPUs you are using?
Preferrably by number if you know (starts with OSA I believe), or at least
the CPU speed.
Thanks!

Arthur Perry
Lead Linux Developer / Linux Systems Architect
Validation, CSU Celestica
Sair/Linux Gnu Certified Professional
Providing professional Linux solutions for 7+ years

On Tue, 1 Jun 2004, Arthur Perry wrote:

> Hi Saurabh,
>
> I am working on this issue as we speak.
> It is interesting that your machine crashes entirely with iommu disabled.
>
> I am starting to think there is more to this than just the kernel
> misreporting other hardware errors (being improperly decoded as GART
> errors).
> On my machine, I am actaully getting Gart erros on 3 out of 4 CPUS when I
> use RedHat's 2.4.21-9EL kernel. This same kernel when rebuilt from source,
> however, will not produce GART errors when built without AGP support.
>
> Here is my Extended error code (bits 19-16 on 0:[18,19,1b]:3 at offset 0x44:
> 0101 = GART error
>
> So, this is not a translation issue on my side.
>
> Can you do this for me?
>
> pcitweak -r 0:18:3 0x44
> and
> pcitweak -r 0:19:3 0x44
>
>
> Thanks!
>
>
> Arthur Perry
> Lead Linux Developer / Linux Systems Architect
> Validation, CSU Celestica
> Sair/Linux Gnu Certified Professional
> Providing professional Linux solutions for 7+ years
>
> On Tue, 1 Jun 2004, Saurabh Barve wrote:
>
> > Hi,
> >
> > I know this has been posted before on this list, but the solution
> > suggested does not seem to work for me.
> >
> > I have a dual opteron system with 8 GB of RAM. I am running RHEL 3.0 AS on
> > it. The kernel version is 2.4.21-4.ELsmp. The motherboard I am using is
> > the Tyan Thunder K8S Pro - 2882 motherboard.
> >
> > I am getting the following error every two minutes or so:
> >
> > GART error 11
> > Lost an northbridge error
> > NB error address some-hex-number
> > Error uncorrected
> >
> > I checked the various postings on the list, and someone suggested that
> > passing iommu=off option to the kernel solved the problem for him.
> > However, when I tried that, it got the kernel to panic. I read somewhere
> > that a newer kernel would fix these 'bugs' in the default RHEL kernel.
> > However, I am using the onboard SATA controller for my hard disks. This
> > requires binary drivers from Tyan. I already downloaded a newer kernel,
> > however, it breaks the drivers, so I can't boot into the new kernel.
> >
> > Here is my output from lspci:
> >
> > 00:06.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8111 PCI (rev 07)
> > 00:07.0 ISA bridge: Advanced Micro Devices [AMD] AMD-8111 LPC (rev 05)
> > 00:07.1 IDE interface: Advanced Micro Devices [AMD] AMD-8111 IDE (rev 03)
> > 00:07.2 SMBus: Advanced Micro Devices [AMD] AMD-8111 SMBus 2.0 (rev 02)
> > 00:07.3 Bridge: Advanced Micro Devices [AMD] AMD-8111 ACPI (rev 05)
> > 00:0a.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge
> > (rev 12)
> > 00:0a.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X APIC (rev 01)
> > 00:0b.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge
> > (rev 12)
> > 00:0b.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X APIC (rev 01)
> > 00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge
> > 00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge
> > 00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge
> > 00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge
> > 00:19.0 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge
> > 00:19.1 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge
> > 00:19.2 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge
> > 00:19.3 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge
> > 02:09.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704
> > Gigabit Ethernet (rev 03)
> > 02:09.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5704
> > Gigabit Ethernet (rev 03)
> > 03:00.0 USB Controller: Advanced Micro Devices [AMD] AMD-8111 USB (rev 0b)
> > 03:00.1 USB Controller: Advanced Micro Devices [AMD] AMD-8111 USB (rev 0b)
> > 03:05.0 Unknown mass storage controller: Silicon Image, Inc. (formerly CMD
> > Technology Inc) Silicon Image SiI 3114 SATARaid Controller (rev 02)
> > 03:06.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27)
> > 03:08.0 Ethernet controller: Intel Corp. 82557/8/9 [Ethernet Pro 100] (rev
> > 10)
> >
> > The dmesg output was too large to include inline, so I am attaching it as
> > a text file.
> >
> > I tried passing the following options to the kernel:
> >
> > iommu=noagp
> > iommu=noforce
> > iommu=off (results in kernel-panic)
> > mce=off
> > mce=0
> >
> > I tried all the above in various combinations, but none of them worked.
> > The machine doesn't crash, and everything else seems to work fine, but I'd
> > like to get rid of these errors.
> >
> > There are some snippets from the dmesg output that I found to be of
> > interest:
> >
> > ------------------------------------------------------------
> > Linux agpgart interface v0.99 (c) Jeff Hartmann
> > agpgart: Maximum main memory to use for agp memory: 7956M
> > agpgart: no supported devices found.
> > PCI-DMA: Disabling AGP.
> > PCI-DMA: aperture base @ 10000000 size 65536 KB
> > PCI-DMA: Reserving 64MB of IOMMU area in the AGP aperture
> > -----------------------------------------------------------
> >
> > -----------------------------------------------------------
> >
> > GART error 11
> > Lost an northbridge error
> > NB error address 00000000fbfe4398
> > Error uncorrected
> > Northbridge status a40000000005001b
> >
> > ----------------------------------------------------------
> >
> >
> > Any suggestions?
> >
> > Thanks,
> > Saurabh.
> >
>
>
> --
> amd64-list mailing list
> [email protected]
> https://www.redhat.com/mailman/listinfo/amd64-list
>

2004-06-01 22:54:30

by Saurabh Barve

[permalink] [raw]
Subject: Re: GART Error 11

Arthur,

I list all the information I have right off the bat:

AMD Opteron Model 246, 1 MB L2 Cache 64-bit processor
Model : AMD Opteron Model 246
Core : Hammer
Operating Frequency : 2 GHz
Cache : L1/128K, L2/1024K
Socekt: Socket 940

Is that info enough?

I just remembered looking at /var/log/dmesg again. There was a line that
said that IOMMU was not enabled in my BIOS, and that I should enable it.
However, I can't see any option in my BIOS for enabling/disabling IOMMU.

Thanks,
Saurabh.

On Tue, 1 Jun 2004, Arthur Perry wrote:

> Hi Saurabh,
>
> I almost forgot.
> Can you also tell me which AMD CPUs you are using?
> Preferrably by number if you know (starts with OSA I believe), or at least
> the CPU speed.
> Thanks!
>
> Arthur Perry
> Lead Linux Developer / Linux Systems Architect
> Validation, CSU Celestica
> Sair/Linux Gnu Certified Professional
> Providing professional Linux solutions for 7+ years
>
> On Tue, 1 Jun 2004, Arthur Perry wrote:
>
> > Hi Saurabh,
> >
> > I am working on this issue as we speak.
> > It is interesting that your machine crashes entirely with iommu disabled.
> >
> > I am starting to think there is more to this than just the kernel
> > misreporting other hardware errors (being improperly decoded as GART
> > errors).
> > On my machine, I am actaully getting Gart erros on 3 out of 4 CPUS when I
> > use RedHat's 2.4.21-9EL kernel. This same kernel when rebuilt from source,
> > however, will not produce GART errors when built without AGP support.
> >
> > Here is my Extended error code (bits 19-16 on 0:[18,19,1b]:3 at offset 0x44:
> > 0101 = GART error
> >
> > So, this is not a translation issue on my side.
> >
> > Can you do this for me?
> >
> > pcitweak -r 0:18:3 0x44
> > and
> > pcitweak -r 0:19:3 0x44
> >
> >
> > Thanks!
> >
> >
> > Arthur Perry
> > Lead Linux Developer / Linux Systems Architect
> > Validation, CSU Celestica
> > Sair/Linux Gnu Certified Professional
> > Providing professional Linux solutions for 7+ years
> >
> > On Tue, 1 Jun 2004, Saurabh Barve wrote:
> >
> > > Hi,
> > >
> > > I know this has been posted before on this list, but the solution
> > > suggested does not seem to work for me.
> > >
> > > I have a dual opteron system with 8 GB of RAM. I am running RHEL 3.0 AS on
> > > it. The kernel version is 2.4.21-4.ELsmp. The motherboard I am using is
> > > the Tyan Thunder K8S Pro - 2882 motherboard.
> > >
> > > I am getting the following error every two minutes or so:
> > >
> > > GART error 11
> > > Lost an northbridge error
> > > NB error address some-hex-number
> > > Error uncorrected
> > >
> > > I checked the various postings on the list, and someone suggested that
> > > passing iommu=off option to the kernel solved the problem for him.
> > > However, when I tried that, it got the kernel to panic. I read somewhere
> > > that a newer kernel would fix these 'bugs' in the default RHEL kernel.
> > > However, I am using the onboard SATA controller for my hard disks. This
> > > requires binary drivers from Tyan. I already downloaded a newer kernel,
> > > however, it breaks the drivers, so I can't boot into the new kernel.
> > >
> > > Here is my output from lspci:
> > >
> > > 00:06.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8111 PCI (rev 07)
> > > 00:07.0 ISA bridge: Advanced Micro Devices [AMD] AMD-8111 LPC (rev 05)
> > > 00:07.1 IDE interface: Advanced Micro Devices [AMD] AMD-8111 IDE (rev 03)
> > > 00:07.2 SMBus: Advanced Micro Devices [AMD] AMD-8111 SMBus 2.0 (rev 02)
> > > 00:07.3 Bridge: Advanced Micro Devices [AMD] AMD-8111 ACPI (rev 05)
> > > 00:0a.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge
> > > (rev 12)
> > > 00:0a.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X APIC (rev 01)
> > > 00:0b.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge
> > > (rev 12)
> > > 00:0b.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X APIC (rev 01)
> > > 00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge
> > > 00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge
> > > 00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge
> > > 00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge
> > > 00:19.0 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge
> > > 00:19.1 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge
> > > 00:19.2 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge
> > > 00:19.3 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge
> > > 02:09.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704
> > > Gigabit Ethernet (rev 03)
> > > 02:09.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5704
> > > Gigabit Ethernet (rev 03)
> > > 03:00.0 USB Controller: Advanced Micro Devices [AMD] AMD-8111 USB (rev 0b)
> > > 03:00.1 USB Controller: Advanced Micro Devices [AMD] AMD-8111 USB (rev 0b)
> > > 03:05.0 Unknown mass storage controller: Silicon Image, Inc. (formerly CMD
> > > Technology Inc) Silicon Image SiI 3114 SATARaid Controller (rev 02)
> > > 03:06.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27)
> > > 03:08.0 Ethernet controller: Intel Corp. 82557/8/9 [Ethernet Pro 100] (rev
> > > 10)
> > >
> > > The dmesg output was too large to include inline, so I am attaching it as
> > > a text file.
> > >
> > > I tried passing the following options to the kernel:
> > >
> > > iommu=noagp
> > > iommu=noforce
> > > iommu=off (results in kernel-panic)
> > > mce=off
> > > mce=0
> > >
> > > I tried all the above in various combinations, but none of them worked.
> > > The machine doesn't crash, and everything else seems to work fine, but I'd
> > > like to get rid of these errors.
> > >
> > > There are some snippets from the dmesg output that I found to be of
> > > interest:
> > >
> > > ------------------------------------------------------------
> > > Linux agpgart interface v0.99 (c) Jeff Hartmann
> > > agpgart: Maximum main memory to use for agp memory: 7956M
> > > agpgart: no supported devices found.
> > > PCI-DMA: Disabling AGP.
> > > PCI-DMA: aperture base @ 10000000 size 65536 KB
> > > PCI-DMA: Reserving 64MB of IOMMU area in the AGP aperture
> > > -----------------------------------------------------------
> > >
> > > -----------------------------------------------------------
> > >
> > > GART error 11
> > > Lost an northbridge error
> > > NB error address 00000000fbfe4398
> > > Error uncorrected
> > > Northbridge status a40000000005001b
> > >
> > > ----------------------------------------------------------
> > >
> > >
> > > Any suggestions?
> > >
> > > Thanks,
> > > Saurabh.
> > >
> >
> >
> > --
> > amd64-list mailing list
> > [email protected]
> > https://www.redhat.com/mailman/listinfo/amd64-list
> >
>

--
===============================================================================
Saurabh Barve Phone:
System Administrator/Data Specialist 970-491-7714 (voice)
Montgomery Research Group, 970-491-8449 (Fax)
Atmospheric Sciences Department,
Fort Collins, Colorado
Colorado State University

Mail : [email protected]
Web : http://fjortoft.atmos.colostate.edu/~sa

2004-06-01 23:07:33

by Saurabh Barve

[permalink] [raw]
Subject: Re: GART Error 11

Arthur,

Here are the results that I got

> Can you do this for me?
>
> pcitweak -r 0:18:3 0x44

0x02400040

> and
> pcitweak -r 0:19:3 0x44

0x02400040

Hope this helps,
Saurabh.

>
> Thanks!
>
>
> Arthur Perry
> Lead Linux Developer / Linux Systems Architect
> Validation, CSU Celestica
> Sair/Linux Gnu Certified Professional
> Providing professional Linux solutions for 7+ years
>
> On Tue, 1 Jun 2004, Saurabh Barve wrote:
>
> > Hi,
> >
> > I know this has been posted before on this list, but the solution
> > suggested does not seem to work for me.
> >
> > I have a dual opteron system with 8 GB of RAM. I am running RHEL 3.0 AS on
> > it. The kernel version is 2.4.21-4.ELsmp. The motherboard I am using is
> > the Tyan Thunder K8S Pro - 2882 motherboard.
> >
> > I am getting the following error every two minutes or so:
> >
> > GART error 11
> > Lost an northbridge error
> > NB error address some-hex-number
> > Error uncorrected
> >
> > I checked the various postings on the list, and someone suggested that
> > passing iommu=off option to the kernel solved the problem for him.
> > However, when I tried that, it got the kernel to panic. I read somewhere
> > that a newer kernel would fix these 'bugs' in the default RHEL kernel.
> > However, I am using the onboard SATA controller for my hard disks. This
> > requires binary drivers from Tyan. I already downloaded a newer kernel,
> > however, it breaks the drivers, so I can't boot into the new kernel.
> >
> > Here is my output from lspci:
> >
> > 00:06.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8111 PCI (rev 07)
> > 00:07.0 ISA bridge: Advanced Micro Devices [AMD] AMD-8111 LPC (rev 05)
> > 00:07.1 IDE interface: Advanced Micro Devices [AMD] AMD-8111 IDE (rev 03)
> > 00:07.2 SMBus: Advanced Micro Devices [AMD] AMD-8111 SMBus 2.0 (rev 02)
> > 00:07.3 Bridge: Advanced Micro Devices [AMD] AMD-8111 ACPI (rev 05)
> > 00:0a.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge
> > (rev 12)
> > 00:0a.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X APIC (rev 01)
> > 00:0b.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge
> > (rev 12)
> > 00:0b.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X APIC (rev 01)
> > 00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge
> > 00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge
> > 00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge
> > 00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge
> > 00:19.0 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge
> > 00:19.1 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge
> > 00:19.2 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge
> > 00:19.3 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge
> > 02:09.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704
> > Gigabit Ethernet (rev 03)
> > 02:09.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5704
> > Gigabit Ethernet (rev 03)
> > 03:00.0 USB Controller: Advanced Micro Devices [AMD] AMD-8111 USB (rev 0b)
> > 03:00.1 USB Controller: Advanced Micro Devices [AMD] AMD-8111 USB (rev 0b)
> > 03:05.0 Unknown mass storage controller: Silicon Image, Inc. (formerly CMD
> > Technology Inc) Silicon Image SiI 3114 SATARaid Controller (rev 02)
> > 03:06.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27)
> > 03:08.0 Ethernet controller: Intel Corp. 82557/8/9 [Ethernet Pro 100] (rev
> > 10)
> >
> > The dmesg output was too large to include inline, so I am attaching it as
> > a text file.
> >
> > I tried passing the following options to the kernel:
> >
> > iommu=noagp
> > iommu=noforce
> > iommu=off (results in kernel-panic)
> > mce=off
> > mce=0
> >
> > I tried all the above in various combinations, but none of them worked.
> > The machine doesn't crash, and everything else seems to work fine, but I'd
> > like to get rid of these errors.
> >
> > There are some snippets from the dmesg output that I found to be of
> > interest:
> >
> > ------------------------------------------------------------
> > Linux agpgart interface v0.99 (c) Jeff Hartmann
> > agpgart: Maximum main memory to use for agp memory: 7956M
> > agpgart: no supported devices found.
> > PCI-DMA: Disabling AGP.
> > PCI-DMA: aperture base @ 10000000 size 65536 KB
> > PCI-DMA: Reserving 64MB of IOMMU area in the AGP aperture
> > -----------------------------------------------------------
> >
> > -----------------------------------------------------------
> >
> > GART error 11
> > Lost an northbridge error
> > NB error address 00000000fbfe4398
> > Error uncorrected
> > Northbridge status a40000000005001b
> >
> > ----------------------------------------------------------
> >
> >
> > Any suggestions?
> >
> > Thanks,
> > Saurabh.
> >
>

--
===============================================================================
Saurabh Barve Phone:
System Administrator/Data Specialist 970-491-7714 (voice)
Montgomery Research Group, 970-491-8449 (Fax)
Atmospheric Sciences Department,
Fort Collins, Colorado
Colorado State University

Mail : [email protected]
Web : http://fjortoft.atmos.colostate.edu/~sa

2004-06-02 14:15:38

by Arthur Perry

[permalink] [raw]
Subject: Re: GART Error 11


Hello,

Oops. Sorry I have made a mistake in all of my statements below.
It was after 5pm yesterday, and it was a long day...
It's not offset 0x44 that we are interested in.
My listings were at offset 0x48, which is MCA NB Status Low Register.
Sorry, did not mean to confuse anybody.

So Saurabh, can you please do this again with the corrected lines?

pcitweak -r 0:18:3 0x48
and
pcitweak -r 0:19:3 0x48

While you are at it, can you send us status high as well?

pcitweak -r 0:18:3 0x4c
and
pcitweak -r 0:19:3 0x4c


Thanks, and sorry about the confusion.

Arthur Perry




On Tue, 1 Jun 2004, Saurabh Barve wrote:

> Arthur,
>
> Here are the results that I got
>
> > Can you do this for me?
> >
> > pcitweak -r 0:18:3 0x44
>
> 0x02400040
>
> > and
> > pcitweak -r 0:19:3 0x44
>
> 0x02400040
>
> Hope this helps,
> Saurabh.
>
> >
> > Thanks!
> >
> >
> > Arthur Perry
> > Lead Linux Developer / Linux Systems Architect
> > Validation, CSU Celestica
> > Sair/Linux Gnu Certified Professional
> > Providing professional Linux solutions for 7+ years
> >
> > On Tue, 1 Jun 2004, Saurabh Barve wrote:
> >
> > > Hi,
> > >
> > > I know this has been posted before on this list, but the solution
> > > suggested does not seem to work for me.
> > >
> > > I have a dual opteron system with 8 GB of RAM. I am running RHEL 3.0 AS on
> > > it. The kernel version is 2.4.21-4.ELsmp. The motherboard I am using is
> > > the Tyan Thunder K8S Pro - 2882 motherboard.
> > >
> > > I am getting the following error every two minutes or so:
> > >
> > > GART error 11
> > > Lost an northbridge error
> > > NB error address some-hex-number
> > > Error uncorrected
> > >
> > > I checked the various postings on the list, and someone suggested that
> > > passing iommu=off option to the kernel solved the problem for him.
> > > However, when I tried that, it got the kernel to panic. I read somewhere
> > > that a newer kernel would fix these 'bugs' in the default RHEL kernel.
> > > However, I am using the onboard SATA controller for my hard disks. This
> > > requires binary drivers from Tyan. I already downloaded a newer kernel,
> > > however, it breaks the drivers, so I can't boot into the new kernel.
> > >
> > > Here is my output from lspci:
> > >
> > > 00:06.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8111 PCI (rev 07)
> > > 00:07.0 ISA bridge: Advanced Micro Devices [AMD] AMD-8111 LPC (rev 05)
> > > 00:07.1 IDE interface: Advanced Micro Devices [AMD] AMD-8111 IDE (rev 03)
> > > 00:07.2 SMBus: Advanced Micro Devices [AMD] AMD-8111 SMBus 2.0 (rev 02)
> > > 00:07.3 Bridge: Advanced Micro Devices [AMD] AMD-8111 ACPI (rev 05)
> > > 00:0a.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge
> > > (rev 12)
> > > 00:0a.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X APIC (rev 01)
> > > 00:0b.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge
> > > (rev 12)
> > > 00:0b.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X APIC (rev 01)
> > > 00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge
> > > 00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge
> > > 00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge
> > > 00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge
> > > 00:19.0 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge
> > > 00:19.1 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge
> > > 00:19.2 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge
> > > 00:19.3 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge
> > > 02:09.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704
> > > Gigabit Ethernet (rev 03)
> > > 02:09.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5704
> > > Gigabit Ethernet (rev 03)
> > > 03:00.0 USB Controller: Advanced Micro Devices [AMD] AMD-8111 USB (rev 0b)
> > > 03:00.1 USB Controller: Advanced Micro Devices [AMD] AMD-8111 USB (rev 0b)
> > > 03:05.0 Unknown mass storage controller: Silicon Image, Inc. (formerly CMD
> > > Technology Inc) Silicon Image SiI 3114 SATARaid Controller (rev 02)
> > > 03:06.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27)
> > > 03:08.0 Ethernet controller: Intel Corp. 82557/8/9 [Ethernet Pro 100] (rev
> > > 10)
> > >
> > > The dmesg output was too large to include inline, so I am attaching it as
> > > a text file.
> > >
> > > I tried passing the following options to the kernel:
> > >
> > > iommu=noagp
> > > iommu=noforce
> > > iommu=off (results in kernel-panic)
> > > mce=off
> > > mce=0
> > >
> > > I tried all the above in various combinations, but none of them worked.
> > > The machine doesn't crash, and everything else seems to work fine, but I'd
> > > like to get rid of these errors.
> > >
> > > There are some snippets from the dmesg output that I found to be of
> > > interest:
> > >
> > > ------------------------------------------------------------
> > > Linux agpgart interface v0.99 (c) Jeff Hartmann
> > > agpgart: Maximum main memory to use for agp memory: 7956M
> > > agpgart: no supported devices found.
> > > PCI-DMA: Disabling AGP.
> > > PCI-DMA: aperture base @ 10000000 size 65536 KB
> > > PCI-DMA: Reserving 64MB of IOMMU area in the AGP aperture
> > > -----------------------------------------------------------
> > >
> > > -----------------------------------------------------------
> > >
> > > GART error 11
> > > Lost an northbridge error
> > > NB error address 00000000fbfe4398
> > > Error uncorrected
> > > Northbridge status a40000000005001b
> > >
> > > ----------------------------------------------------------
> > >
> > >
> > > Any suggestions?
> > >
> > > Thanks,
> > > Saurabh.
> > >
> >
>
> --
> ===============================================================================
> Saurabh Barve Phone:
> System Administrator/Data Specialist 970-491-7714 (voice)
> Montgomery Research Group, 970-491-8449 (Fax)
> Atmospheric Sciences Department,
> Fort Collins, Colorado
> Colorado State University
>
> Mail : [email protected]
> Web : http://fjortoft.atmos.colostate.edu/~sa
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2004-06-02 17:14:54

by Saurabh Barve

[permalink] [raw]
Subject: Re: GART Error 11

Sorry about the delay in my reply. Just got in to work!
Here is the output:

> pcitweak -r 0:18:3 0x48

0x0005001B

> and
> pcitweak -r 0:19:3 0x48

0x00000000

> While you are at it, can you send us status high as well?
>
> pcitweak -r 0:18:3 0x4c

0xA4000000

> and
> pcitweak -r 0:19:3 0x4c

0x00000000

I don't know if this would help, but below is a part of my cronwatch log:

--------------------- Init Begin ------------------------

**Unmatched Entries**
Trying to re-exec init
Trying to re-exec init

---------------------- Init End -------------------------


--------------------- Kernel Begin ------------------------


WARNING: Kernel Errors Present
uteval-0098: *** Error: Method executio...: 4Time(s)
psparse-1121: *** Error: Method executio...: 8Time(s)
Error uncorrected...: 538Time(s)
GART error 11...: 538Time(s)
Lost an northbridge error...: 538Time(s)
NB error address 00000000...: 538Time(s)

---------------------- Kernel End -------------------------


--------------------- ModProbe Begin ------------------------


Can't locate these modules:
char-major-10-134: 4 Time(s)
sound-service-0-3: 6 Time(s)
xp0: 3 Time(s)
sound-slot-0: 6 Time(s)
char-major-188: 15 Time(s)

---------------------- ModProbe End -------------------------


Thanks,
Saurabh.

--
===============================================================================
Saurabh Barve Phone:
System Administrator/Data Specialist 970-491-7714 (voice)
Montgomery Research Group, 970-491-8449 (Fax)
Atmospheric Sciences Department,
Fort Collins, Colorado
Colorado State University

Mail : [email protected]
Web : http://fjortoft.atmos.colostate.edu/~sa

2004-06-02 18:37:51

by Arthur Perry

[permalink] [raw]
Subject: Re: GART Error 11

Hi Saurabh,

Thanks. It looks like you also have true GART errors as reported by hardware, on CPU0.
So our common failure mode here is actual GART errors and not something else being reported as a GART error because of erroneous kernel translation.

It's possible that we are using a device driver somewhere that is misbehaving, which is using the GART or IOMMU improperly somehow, or my guess is that is may be the actual AGP device driver used by RedHat.
ie, they may have not patched in the most recent version that may contain a lot of fixes.

Thanks for your feedback.

As of making your messages go away, I would tell you to disable the GartTableWalk in MCE, but that does not seem to work on my machine.
I'll let you know what does work without turning off Northbridge MC* entirely once I discover it.

-Arthur Perry



On Wed, 2 Jun 2004, Saurabh Barve wrote:

> Sorry about the delay in my reply. Just got in to work!
> Here is the output:
>
> > pcitweak -r 0:18:3 0x48
>
> 0x0005001B
>
> > and
> > pcitweak -r 0:19:3 0x48
>
> 0x00000000
>
> > While you are at it, can you send us status high as well?
> >
> > pcitweak -r 0:18:3 0x4c
>
> 0xA4000000
>
> > and
> > pcitweak -r 0:19:3 0x4c
>
> 0x00000000
>
> I don't know if this would help, but below is a part of my cronwatch log:
>
> --------------------- Init Begin ------------------------
>
> **Unmatched Entries**
> Trying to re-exec init
> Trying to re-exec init
>
> ---------------------- Init End -------------------------
>
>
> --------------------- Kernel Begin ------------------------
>
>
> WARNING: Kernel Errors Present
> uteval-0098: *** Error: Method executio...: 4Time(s)
> psparse-1121: *** Error: Method executio...: 8Time(s)
> Error uncorrected...: 538Time(s)
> GART error 11...: 538Time(s)
> Lost an northbridge error...: 538Time(s)
> NB error address 00000000...: 538Time(s)
>
> ---------------------- Kernel End -------------------------
>
>
> --------------------- ModProbe Begin ------------------------
>
>
> Can't locate these modules:
> char-major-10-134: 4 Time(s)
> sound-service-0-3: 6 Time(s)
> xp0: 3 Time(s)
> sound-slot-0: 6 Time(s)
> char-major-188: 15 Time(s)
>
> ---------------------- ModProbe End -------------------------
>
>
> Thanks,
> Saurabh.
>
> --
> ===============================================================================
> Saurabh Barve Phone:
> System Administrator/Data Specialist 970-491-7714 (voice)
> Montgomery Research Group, 970-491-8449 (Fax)
> Atmospheric Sciences Department,
> Fort Collins, Colorado
> Colorado State University
>
> Mail : [email protected]
> Web : http://fjortoft.atmos.colostate.edu/~sa
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2004-06-02 18:43:40

by Arthur Perry

[permalink] [raw]
Subject: Re: GART Error 11

Or actually, I should say, you "most likely" have this as well, since I asked you to gather the information through the more qurky interface.
The bits for this error case match perfectly, so I'd say it's probably a good bet.

Arthur Perry


On Wed, 2 Jun 2004, Arthur Perry wrote:

> Hi Saurabh,
>
> Thanks. It looks like you also have true GART errors as reported by hardware, on CPU0.
> So our common failure mode here is actual GART errors and not something else being reported as a GART error because of erroneous kernel translation.
>
> It's possible that we are using a device driver somewhere that is misbehaving, which is using the GART or IOMMU improperly somehow, or my guess is that is may be the actual AGP device driver used by RedHat.
> ie, they may have not patched in the most recent version that may contain a lot of fixes.
>
> Thanks for your feedback.
>
> As of making your messages go away, I would tell you to disable the GartTableWalk in MCE, but that does not seem to work on my machine.
> I'll let you know what does work without turning off Northbridge MC* entirely once I discover it.
>
> -Arthur Perry
>
>
>
> On Wed, 2 Jun 2004, Saurabh Barve wrote:
>
> > Sorry about the delay in my reply. Just got in to work!
> > Here is the output:
> >
> > > pcitweak -r 0:18:3 0x48
> >
> > 0x0005001B
> >
> > > and
> > > pcitweak -r 0:19:3 0x48
> >
> > 0x00000000
> >
> > > While you are at it, can you send us status high as well?
> > >
> > > pcitweak -r 0:18:3 0x4c
> >
> > 0xA4000000
> >
> > > and
> > > pcitweak -r 0:19:3 0x4c
> >
> > 0x00000000
> >
> > I don't know if this would help, but below is a part of my cronwatch log:
> >
> > --------------------- Init Begin ------------------------
> >
> > **Unmatched Entries**
> > Trying to re-exec init
> > Trying to re-exec init
> >
> > ---------------------- Init End -------------------------
> >
> >
> > --------------------- Kernel Begin ------------------------
> >
> >
> > WARNING: Kernel Errors Present
> > uteval-0098: *** Error: Method executio...: 4Time(s)
> > psparse-1121: *** Error: Method executio...: 8Time(s)
> > Error uncorrected...: 538Time(s)
> > GART error 11...: 538Time(s)
> > Lost an northbridge error...: 538Time(s)
> > NB error address 00000000...: 538Time(s)
> >
> > ---------------------- Kernel End -------------------------
> >
> >
> > --------------------- ModProbe Begin ------------------------
> >
> >
> > Can't locate these modules:
> > char-major-10-134: 4 Time(s)
> > sound-service-0-3: 6 Time(s)
> > xp0: 3 Time(s)
> > sound-slot-0: 6 Time(s)
> > char-major-188: 15 Time(s)
> >
> > ---------------------- ModProbe End -------------------------
> >
> >
> > Thanks,
> > Saurabh.
> >
> > --
> > ===============================================================================
> > Saurabh Barve Phone:
> > System Administrator/Data Specialist 970-491-7714 (voice)
> > Montgomery Research Group, 970-491-8449 (Fax)
> > Atmospheric Sciences Department,
> > Fort Collins, Colorado
> > Colorado State University
> >
> > Mail : [email protected]
> > Web : http://fjortoft.atmos.colostate.edu/~sa
> >
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
> >
>
>
> --
> amd64-list mailing list
> [email protected]
> https://www.redhat.com/mailman/listinfo/amd64-list
>

2004-06-02 18:49:24

by Saurabh Barve

[permalink] [raw]
Subject: Re: GART Error 11

Thanks Arthur!

The machine seems to work except for the errors. Is there a way to update
the drivers in the OS without having to upgrade the kernel. I guess we'll
first have to find out which driver is misbehaving !!

I'll try the 'mce=off' and 'iommu=off' options again. I'll keep you
posted.

Thanks again,
Saurabh.

On Wed, 2 Jun 2004, Arthur Perry wrote:

> Hi Saurabh,
>
> Thanks. It looks like you also have true GART errors as reported by hardware, on CPU0.
> So our common failure mode here is actual GART errors and not something else being reported as a GART error because of erroneous kernel translation.
>
> It's possible that we are using a device driver somewhere that is misbehaving, which is using the GART or IOMMU improperly somehow, or my guess is that is may be the actual AGP device driver used by RedHat.
> ie, they may have not patched in the most recent version that may contain a lot of fixes.
>
> Thanks for your feedback.
>
> As of making your messages go away, I would tell you to disable the GartTableWalk in MCE, but that does not seem to work on my machine.
> I'll let you know what does work without turning off Northbridge MC* entirely once I discover it.
>
> -Arthur Perry
>
>
>
> On Wed, 2 Jun 2004, Saurabh Barve wrote:
>
> > Sorry about the delay in my reply. Just got in to work!
> > Here is the output:
> >
> > > pcitweak -r 0:18:3 0x48
> >
> > 0x0005001B
> >
> > > and
> > > pcitweak -r 0:19:3 0x48
> >
> > 0x00000000
> >
> > > While you are at it, can you send us status high as well?
> > >
> > > pcitweak -r 0:18:3 0x4c
> >
> > 0xA4000000
> >
> > > and
> > > pcitweak -r 0:19:3 0x4c
> >
> > 0x00000000
> >
> > I don't know if this would help, but below is a part of my cronwatch log:
> >
> > --------------------- Init Begin ------------------------
> >
> > **Unmatched Entries**
> > Trying to re-exec init
> > Trying to re-exec init
> >
> > ---------------------- Init End -------------------------
> >
> >
> > --------------------- Kernel Begin ------------------------
> >
> >
> > WARNING: Kernel Errors Present
> > uteval-0098: *** Error: Method executio...: 4Time(s)
> > psparse-1121: *** Error: Method executio...: 8Time(s)
> > Error uncorrected...: 538Time(s)
> > GART error 11...: 538Time(s)
> > Lost an northbridge error...: 538Time(s)
> > NB error address 00000000...: 538Time(s)
> >
> > ---------------------- Kernel End -------------------------
> >
> >
> > --------------------- ModProbe Begin ------------------------
> >
> >
> > Can't locate these modules:
> > char-major-10-134: 4 Time(s)
> > sound-service-0-3: 6 Time(s)
> > xp0: 3 Time(s)
> > sound-slot-0: 6 Time(s)
> > char-major-188: 15 Time(s)
> >
> > ---------------------- ModProbe End -------------------------
> >
> >
> > Thanks,
> > Saurabh.
> >
> > --
> > ===============================================================================
> > Saurabh Barve Phone:
> > System Administrator/Data Specialist 970-491-7714 (voice)
> > Montgomery Research Group, 970-491-8449 (Fax)
> > Atmospheric Sciences Department,
> > Fort Collins, Colorado
> > Colorado State University
> >
> > Mail : [email protected]
> > Web : http://fjortoft.atmos.colostate.edu/~sa
> >
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
> >
>

--
=============================================================================
Saurabh Barve Phone:
System Administrator/Data Specialist 970-491-7714 (voice)
Montgomery Research Group, 970-491-8449 (Fax)
Atmospheric Sciences Department,
Fort Collins, Colorado
Colorado State University

Mail : [email protected]
Web : http://fjortoft.atmos.colostate.edu/~sa