2002-09-25 13:20:28

by Jan Kasprzak

[permalink] [raw]
Subject: AMD 768 erratum 10 (solved: AMD 760MPX DMA lockup)

Hello, all!

two weeks ago I've posted to the LKML the following message:

[...]
: my dual athlon box is unstable in some situations. I can consistently
: lock it up by running the following code:
:
: fd = open("/dev/hda3", O_RDWR);
: for (i=0; i<1024*1024; i++) {
: read(fd, buffer, 8192);
: lseek(fd, -8192, SEEK_CUR);
: write(fd, buffer, 8192);
: }
[...]

I think I have been hit by AMD 768 southbridge erratum number 10.
After plugging in the PS/2 mouse, the server is able to run 10 iterations
of bonnie++ without any problem (w/o PS/2 mouse it locks up in first
or second iterations).

I want to ask everyone who replied to me that the above code
works for him on the 760MPX-based system to re-run the above code
(or run bonnie++ benchmark several times in a loop), but _without_
the PS/2 mouse connected?

Since this is an official AMD errata, we should have a work-around
for this, or at least the big fat warning during boot, when the 768
southbridge is detected - something like the following:

WARNING: Using the system with AMD 768 southbridge without the PS/2
WARNING: mouse plugged in can cause instabilities. See the AMD 768 erratum #10

The AMD 768 Revision Guide is at the following URL:

http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24472.pdf

the erratum #10 is described on page 7 (pstotext output, manually edited):

: 10 Multiprocessor System May Hang While in FULL APIC Mode
: and IOAPIC Interrupt is Masked
:
: Products Affected. B1, B2
:
: Normal Specified Operation. The AMD-768 peripheral bus controller is
: designed to support FULL APIC mode in multiprocessor systems for system
: management events. If an interrupt is masked in the APIC controller of
: the AMD-768, then the corresponding interrupt message should not be
: sent to the processor via the 3-wire APIC bus.
:
: Non-conformance. The AMD-768 peripheral bus controller will send an
: interrupt message via the 3-wire APIC bus regardless if the interrupt
: is masked or not.
:
: Potential Effect on System. Since the processor had previously masked
: the APIC interrupt, it is not expecting to receive future APIC messages
: for the masked interrupt. The APIC controller will continuously send
: the interrupt message via the 3-wire bus until a processor accepts the
: message, causing the system to hang.
:
: A system hang has been observed when executing a server shutdown
: command in Novell Netware versions 5.0 or 5.1 while using a serial
: mouse. During the server shutdown sequence, software writes an invalid
: CPU ID to the IOAPIC redirection table, and the system does not
: complete the shutdown.
:
: Note: No failure has been observed when using a PS/2 mouse.
:
: Suggested Workaround. None.
:
: Resolution Status: No fix planned.


--
| Jan "Yenya" Kasprzak <kas at {fi.muni.cz - work | yenya.net - private}> |
| GPG: ID 1024/D3498839 Fingerprint 0D99A7FB206605D7 8B35FCDE05B18A5E |
| http://www.fi.muni.cz/~kas/ Czech Linux Homepage: http://www.linux.cz/ |
|----------- If you want the holes in your knowledge showing up -----------|
|----------- try teaching someone. -- Alan Cox -----------|


2002-09-26 15:00:35

by Alan

[permalink] [raw]
Subject: Re: AMD 768 erratum 10 (solved: AMD 760MPX DMA lockup)

Looks like the obvious fix is to simply disable the APIC on all such
systems

2002-09-26 15:28:16

by Dave Jones

[permalink] [raw]
Subject: Re: AMD 768 erratum 10 (solved: AMD 760MPX DMA lockup)

On Thu, Sep 26, 2002 at 04:08:10PM +0100, Alan Cox wrote:
> Looks like the obvious fix is to simply disable the APIC on all such
> systems

Converting a *lot* of MP systems to UP due to an errata
that only occurs with no PS/2 mouse seems a bit extreme.
Can we safely probe the PS2 port to see if its empty or not
and do a runtime APIC/SMP disable really early in the boot ?

Dave

--
| Dave Jones. http://www.codemonkey.org.uk
| SuSE Labs

2002-09-26 16:03:26

by Manfred Spraul

[permalink] [raw]
Subject: Re: AMD 768 erratum 10 (solved: AMD 760MPX DMA lockup)

The errata is not PS/2 mouse specific:
it says that the io apic doesn't implement masking interrupts correctly.

Linux uses masking aggressively - disable_irq() is implemented by
masking the interrupt in the io apic. I'm surprised that this doesn't
cause frequent problems. Perhaps the problem only occurs if an invalid
cpu id is written into the target register, as done by Netware?

Is someone around with a ne2k-pci card and a AMD-760MPX based system?

Regarding Jan's problem: I'm not sure if his problems are related to
this errata. It says that using a PS/2 mouse instead of a serial mouse
with Novell Netward avoids the hang during shutdown, probably because
then netware doesn't mask the irq.

--
Manfred

2002-09-26 16:42:26

by Alan

[permalink] [raw]
Subject: Re: AMD 768 erratum 10 (solved: AMD 760MPX DMA lockup)

On Thu, 2002-09-26 at 16:34, Dave Jones wrote:
> Converting a *lot* of MP systems to UP due to an errata
> that only occurs with no PS/2 mouse seems a bit extreme.

It would help no end in reducing power bills 8)

I'm just talking about keeping the system running SMP with PIC mode
interrupts.

2002-09-27 06:41:21

by Jan Kasprzak

[permalink] [raw]
Subject: Re: AMD 768 erratum 10 (solved: AMD 760MPX DMA lockup)

Manfred Spraul wrote:
: The errata is not PS/2 mouse specific:
: it says that the io apic doesn't implement masking interrupts correctly.

Yes, but it says that problem has not been observed when running
with PS/2 mouse. Which is exactly what I observe on my system.
:
: Linux uses masking aggressively - disable_irq() is implemented by
: masking the interrupt in the io apic. I'm surprised that this doesn't
: cause frequent problems.

The errata does not say that the interrupt masking in the IO-APIC
does not work in all situations. I read it as the lock-up (caused by
nonfunctional IRQ masking) sometimes occurs, nobody knows when,
and it has been observed on some Netware boxes w/o the PS/2 mouse.
Maybe even AMD does not know exactly what is going on there.

: Regarding Jan's problem: I'm not sure if his problems are related to
: this errata. It says that using a PS/2 mouse instead of a serial mouse
: with Novell Netward avoids the hang during shutdown, probably because
: then netware doesn't mask the irq.

Yes, it may be a faulty board as well, but I think it is
too close to this AMD errata.

-Yenya

--
| Jan "Yenya" Kasprzak <kas at {fi.muni.cz - work | yenya.net - private}> |
| GPG: ID 1024/D3498839 Fingerprint 0D99A7FB206605D7 8B35FCDE05B18A5E |
| http://www.fi.muni.cz/~kas/ Czech Linux Homepage: http://www.linux.cz/ |
|----------- If you want the holes in your knowledge showing up -----------|
|----------- try teaching someone. -- Alan Cox -----------|

2002-09-27 14:19:20

by Bruno A. Crespo

[permalink] [raw]
Subject: Re: AMD 768 erratum 10 (solved: AMD 760MPX DMA lockup)

Jan Kasprzak wrote:

> Manfred Spraul wrote:
> : The errata is not PS/2 mouse specific:
> : it says that the io apic doesn't implement masking interrupts correctly.
>
> Yes, but it says that problem has not been observed when running
> with PS/2 mouse. Which is exactly what I observe on my system.

I observed the same problem with a MSI K7D Master mainboard, and
plugging a PS/2 mouse fix the problem.

BTW: I can also fix the problem unplugging the AGP card.


Bruno

2002-09-30 21:42:23

by Maxwell Spangler

[permalink] [raw]
Subject: Re: AMD 768 erratum 10 (solved: AMD 760MPX DMA lockup)


Does this mouse have to be actively used...

Could one plug a cheap ps/2 mouse into that port and yet continue to use a USB
mouse if X is configured to use it?

What is actually happening that causes the ps/2 mouse to cure this or any
other problem of this nature?

Curious.

On Fri, 27 Sep 2002, Jan Kasprzak wrote:

> Manfred Spraul wrote:
> : The errata is not PS/2 mouse specific:
> : it says that the io apic doesn't implement masking interrupts correctly.
>
> Yes, but it says that problem has not been observed when running
> with PS/2 mouse. Which is exactly what I observe on my system.
> :
> : Linux uses masking aggressively - disable_irq() is implemented by
> : masking the interrupt in the io apic. I'm surprised that this doesn't
> : cause frequent problems.
>
> The errata does not say that the interrupt masking in the IO-APIC
> does not work in all situations. I read it as the lock-up (caused by
> nonfunctional IRQ masking) sometimes occurs, nobody knows when,
> and it has been observed on some Netware boxes w/o the PS/2 mouse.
> Maybe even AMD does not know exactly what is going on there.
>
> : Regarding Jan's problem: I'm not sure if his problems are related to
> : this errata. It says that using a PS/2 mouse instead of a serial mouse
> : with Novell Netward avoids the hang during shutdown, probably because
> : then netware doesn't mask the irq.
>
> Yes, it may be a faulty board as well, but I think it is
> too close to this AMD errata.
>
> -Yenya
>
> --
> | Jan "Yenya" Kasprzak <kas at {fi.muni.cz - work | yenya.net - private}> |
> | GPG: ID 1024/D3498839 Fingerprint 0D99A7FB206605D7 8B35FCDE05B18A5E |
> | http://www.fi.muni.cz/~kas/ Czech Linux Homepage: http://www.linux.cz/ |
> |----------- If you want the holes in your knowledge showing up -----------|
> |----------- try teaching someone. -- Alan Cox -----------|
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

-- ----------------------------------------------------------------------------
Maxwell Spangler
Program Writer
Greenbelt, Maryland, U.S.A.
Washington D.C. Metropolitan Area