2002-09-12 14:08:18

by Jan Kasprzak

[permalink] [raw]
Subject: AMD 760MPX DMA lockup

Hello, kernel hackers,

my dual athlon box is unstable in some situations. I can consistently
lock it up by running the following code:

fd = open("/dev/hda3", O_RDWR);
for (i=0; i<1024*1024; i++) {
read(fd, buffer, 8192);
lseek(fd, -8192, SEEK_CUR);
write(fd, buffer, 8192);
}

It locks up in a minute or so (solid lock up, it does not react even
to a NumLock key or console switching). It can surely be a HW problem
(this is a new box), but how to tell whether this is the case?

The mainboard is MSI K7D Master, AMD 760MPX chipset, 460W power supply,
1GB RAM.

The box survived whole night of memtest86 and the whole night of three kernel
compiles running in parallel in an infinite loop.

This problem is on many recent kernels (tried 2.4.18-11 from RedHat "null",
2.4.20-pre5-ac1, 2.4.20-pre5-ac5, 2.4.20-pre6). It does not matter whether
I compile the kernel SMP or UP, with or without CONFIG_HIGHMEM.

I tried several disks (WD1200JB, WD1200BB, IBM 120GXP).
I tried to remove all other PCI cards and 512MB of RAM. No change.
I tried to create an ext3 filesystem on /dev/hda3, mounted it
as /mnt, created big file /mnt/bigfile and run the above code
on /mnt/bigfile. System still locks up.

I tried to put the tested disk to a separate IDE controller
(Promise PDC20269 PCI card) - then I do not get a complete lockup,
just the drive starts to complain about the DMA timeout, and the kernel
reesets the controller. However, DMA timeouts start to occur even on
the primary controller.

When I switch off the DMA (hdparm -d0 /dev/hda), the problem goes away
(however, the disk is very slow, as expected).

Is anybody able to run the above code on AMD 760MPX-based system?
Is it a kernel problem or hardware problem?

Thanks in advance,

-Yenya

--
| Jan "Yenya" Kasprzak <kas at {fi.muni.cz - work | yenya.net - private}> |
| GPG: ID 1024/D3498839 Fingerprint 0D99A7FB206605D7 8B35FCDE05B18A5E |
| http://www.fi.muni.cz/~kas/ Czech Linux Homepage: http://www.linux.cz/ |
|----------- If you want the holes in your knowledge showing up -----------|
|----------- try teaching someone. -- Alan Cox -----------|


2002-09-12 15:03:36

by Justin Cormack

[permalink] [raw]
Subject: Re: AMD 760MPX DMA lockup

>
> Hello, kernel hackers,
>
> my dual athlon box is unstable in some situations. I can consistently
> lock it up by running the following code:
>
> fd = open("/dev/hda3", O_RDWR);
> for (i=0; i<1024*1024; i++) {
> read(fd, buffer, 8192);
> lseek(fd, -8192, SEEK_CUR);
> write(fd, buffer, 8192);
> }
>
> It locks up in a minute or so (solid lock up, it does not react even
> to a NumLock key or console switching). It can surely be a HW problem
> (this is a new box), but how to tell whether this is the case?
>
> The mainboard is MSI K7D Master, AMD 760MPX chipset, 460W power supply,
> 1GB RAM.
>
> The box survived whole night of memtest86 and the whole night of three kernel
> compiles running in parallel in an infinite loop.
>
> This problem is on many recent kernels (tried 2.4.18-11 from RedHat "null",
> 2.4.20-pre5-ac1, 2.4.20-pre5-ac5, 2.4.20-pre6). It does not matter whether
> I compile the kernel SMP or UP, with or without CONFIG_HIGHMEM.

Well I have run this several times on my MPX, and it is fine.

This is 2.4.20-pre1, dual AMD 2000MP, only difference is it is the Tyan
version of the MPX, not the MSI.

Justin

2002-09-12 16:17:37

by Petr Konecny

[permalink] [raw]
Subject: Re: AMD 760MPX DMA lockup

>>>>> Jan Kasprzak (Yenya) napsal:

Yenya> Is anybody able to run the above code on AMD 760MPX-based system?
Yenya> Is it a kernel problem or hardware problem?
Runs fine ASUS A7M266-D MoBo, WD800JB disk.

Petr

2002-09-12 18:15:09

by Denis Vlasenko

[permalink] [raw]
Subject: Re: AMD 760MPX DMA lockup

On 12 September 2002 12:12, Jan Kasprzak wrote:

> my dual athlon box is unstable in some situations. I can consistently
> lock it up by running the following code:
>
> fd = open("/dev/hda3", O_RDWR);
> for (i=0; i<1024*1024; i++) {
> read(fd, buffer, 8192);
> lseek(fd, -8192, SEEK_CUR);
> write(fd, buffer, 8192);
> }

8 GB... Can you make it loop over much lesser size?

for (j=0; j<1024; j++) {
fd = open("/dev/hda3", O_RDWR);
for (i=0; i<1024; i++) {
read(fd, buffer, 8192);
lseek(fd, -8192, SEEK_CUR);
write(fd, buffer, 8192);
}
close(fd);
printf(<some stats>);
}

I assume removing read+lseek eliminates lockup?

> I tried to put the tested disk to a separate IDE controller
> (Promise PDC20269 PCI card) - then I do not get a complete lockup,
> just the drive starts to complain about the DMA timeout, and the kernel
> reesets the controller. However, DMA timeouts start to occur even on
> the primary controller.

Is it IDE related or not?
If you can test it over SCSI/NFS/ramdisk/???...

> When I switch off the DMA (hdparm -d0 /dev/hda), the problem goes away
> (however, the disk is very slow, as expected).

At which DMA/UDMA mode it starts to fail?
--
vda

2002-09-12 19:10:24

by Jan Kasprzak

[permalink] [raw]
Subject: Re: AMD 760MPX DMA lockup (partly solved)

[email protected] wrote:
: Well I have run this several times on my MPX, and it is fine.
:
: This is 2.4.20-pre1, dual AMD 2000MP, only difference is it is the Tyan
: version of the MPX, not the MSI.
:
: Justin

Justin, thanks for this! I've tried 2.4.20-pre1 with your
.config (and then with my .config), and it works!

Further investigation showed that the problem first appeared
somewhere between 2.4.20-pre2 (works for me) and 2.4.20-pre5 (has the
lock-up problem I've described). I was not able to test -pre3 and -pre4,
because these kernel died on me during boot after the
"Initializing RT netlink socket" message.

I the bug got merged from the -ac kernels, because it is
present bot in the kernel 2.4.19-11 from RedHat "null" beta
and in 2.4.20-pre2-ac1 (altough the later crashes instead of lock-up).

Denis Vlasenko wrote:
:
: 8 GB... Can you make it loop over much lesser size?
:
with 2GB it still fails. I didn't try less, because with 1GB of RAM
it would not have any effect.

: I assume removing read+lseek eliminates lockup?

Partly. I've tried

dd if=/dev/hda3 of=/dev/null bs=1024k, and it still causes filesystem
corruption (altough no lockup).

: Is it IDE related or not?
: If you can test it over SCSI/NFS/ramdisk/???...

I think it is IDE or DMA related.

: > When I switch off the DMA (hdparm -d0 /dev/hda), the problem goes away
: > (however, the disk is very slow, as expected).
:
: At which DMA/UDMA mode it starts to fail?

-d1 -X33 fails.

-Y.

--
| Jan "Yenya" Kasprzak <kas at {fi.muni.cz - work | yenya.net - private}> |
| GPG: ID 1024/D3498839 Fingerprint 0D99A7FB206605D7 8B35FCDE05B18A5E |
| http://www.fi.muni.cz/~kas/ Czech Linux Homepage: http://www.linux.cz/ |
|----------- If you want the holes in your knowledge showing up -----------|
|----------- try teaching someone. -- Alan Cox -----------|

2002-09-12 20:38:02

by Alan

[permalink] [raw]
Subject: Re: AMD 760MPX DMA lockup (partly solved)

On Thu, 2002-09-12 at 20:14, Jan Kasprzak wrote:
> I the bug got merged from the -ac kernels, because it is
> present bot in the kernel 2.4.19-11 from RedHat "null" beta
> and in 2.4.20-pre2-ac1 (altough the later crashes instead of lock-up).

That would strange actually. The Red Hat beta kernel has 2.4.18 like IDE
not -ac like IDE


2002-09-12 21:32:08

by Vojtech Pavlik

[permalink] [raw]
Subject: Re: AMD 760MPX DMA lockup (partly solved)

On Thu, Sep 12, 2002 at 09:14:52PM +0200, Jan Kasprzak wrote:

> : > When I switch off the DMA (hdparm -d0 /dev/hda), the problem goes away
> : > (however, the disk is very slow, as expected).
> :
> : At which DMA/UDMA mode it starts to fail?
>
> -d1 -X33 fails.

X33? X33 doesn't make sense.

--
Vojtech Pavlik
SuSE Labs

2002-09-13 07:02:38

by Denis Vlasenko

[permalink] [raw]
Subject: Re: AMD 760MPX DMA lockup (partly solved)

On 12 September 2002 17:14, Jan Kasprzak wrote:
> : This is 2.4.20-pre1, dual AMD 2000MP, only difference is it is the Tyan
> : version of the MPX, not the MSI.
> :
> : Justin
>
> Justin, thanks for this! I've tried 2.4.20-pre1 with your
> .config (and then with my .config), and it works!
>
> Further investigation showed that the problem first appeared
> somewhere between 2.4.20-pre2 (works for me) and 2.4.20-pre5 (has the
> lock-up problem I've described). I was not able to test -pre3 and -pre4,
> because these kernel died on me during boot after the
> "Initializing RT netlink socket" message.

It would be interesting to test 2.4.20-pre5 on Justin's box
(if he can risk fs damage)
--
vda

2002-09-13 09:37:21

by Jan Kasprzak

[permalink] [raw]
Subject: Re: AMD 760MPX DMA lockup (partly solved)

Alan Cox wrote:
: On Thu, 2002-09-12 at 20:14, Jan Kasprzak wrote:
: > I the bug got merged from the -ac kernels, because it is
: > present bot in the kernel 2.4.19-11 from RedHat "null" beta
: > and in 2.4.20-pre2-ac1 (altough the later crashes instead of lock-up).
:
: That would strange actually. The Red Hat beta kernel has 2.4.18 like IDE
: not -ac like IDE
:
Well, it is probably not IDE-related at all, but rather it has
something to do with PCI or may be scheduling changes that -pre2-ac2 does.
Currently I positively know that 2.4.20-pre2 works, but 2.4.20-ac2 and
2.4.20-pre5 does not. I have tested 2.4.20-pre2-ac1 and 2.4.20-pre[34],
but all these does not work for me probably for some other reason.

So I have taken patch-2.4.20-pre2-ac2, and deleted all
changes except the ones that are IDE-related, and 2.4.20-pre2 plus the
following parts of -pre2-ac2 works:

drivers/ide/Config.in | 15
drivers/ide/Makefile | 9
drivers/ide/amd74xx.c | 292 +++---
drivers/ide/hd.c | 42
drivers/ide/ide-disk.c | 879 +++++++++++-------
drivers/ide/ide-dma.c | 347 +++----
drivers/ide/ide-features.c | 385 --------
drivers/ide/ide-pci.c | 539 +++++------
drivers/ide/ide-probe.c | 126 +-
drivers/ide/ide-proc.c | 58 -
drivers/ide/ide-taskfile.c | 2159 +++++++++++++++++++++++++++++++++------------ drivers/ide/ide.c | 651 +++----------
drivers/ide/pdc202xx.c | 1098 +++++++++++++---------
drivers/ide/pdc4030.c | 272 ++++-
drivers/ide/pdcadma.c | 106 ++
include/asm-i386/ide.h | 26
include/asm-i386/system.h | 14
include/linux/hdreg.h | 93 +
include/linux/ide.h | 409 +++++---
include/linux/pci_ids.h | 16
20 files changed, 4524 insertions, 3012 deletions

I have to delete the first chunk of the patch of
include/asm-i386/ide.h, and I have deleted the call to
pci_enable_device_bars() in drivers/ide/ide-pci.c to be able to compile
this, but other than that it is exactly the same as 2.4.20-pre2-ac2 IDE code.
I will send you this as a patch against 2.4.20-pre2 if you want.

This still works, so it means the problem has to be in some other
part of 2.4.20-pre2-ac2.

Vojtech Pavlik wrote:
:
: X33? X33 doesn't make sense.
:
X34, sorry. DMA 33.

-Y.

--
| Jan "Yenya" Kasprzak <kas at {fi.muni.cz - work | yenya.net - private}> |
| GPG: ID 1024/D3498839 Fingerprint 0D99A7FB206605D7 8B35FCDE05B18A5E |
| http://www.fi.muni.cz/~kas/ Czech Linux Homepage: http://www.linux.cz/ |
|----------- If you want the holes in your knowledge showing up -----------|
|----------- try teaching someone. -- Alan Cox -----------|

2002-09-13 09:42:22

by Vojtech Pavlik

[permalink] [raw]
Subject: Re: AMD 760MPX DMA lockup (partly solved)

On Fri, Sep 13, 2002 at 11:41:49AM +0200, Jan Kasprzak wrote:

> Vojtech Pavlik wrote:
> :
> : X33? X33 doesn't make sense.
> :
> X34, sorry. DMA 33.

Still not right. -X34 is MWDMA16, for UDMA33 you need -X66.
I know it's confusing, but these are mode numbers from the ATA spec.

--
Vojtech Pavlik
SuSE Labs

2002-09-13 11:40:59

by Justin Cormack

[permalink] [raw]
Subject: Re: AMD 760MPX DMA lockup (partly solved)

>
> On 12 September 2002 17:14, Jan Kasprzak wrote:
> > : This is 2.4.20-pre1, dual AMD 2000MP, only difference is it is the Tyan
> > : version of the MPX, not the MSI.
> > :
> > : Justin
> >
> > Justin, thanks for this! I've tried 2.4.20-pre1 with your
> > .config (and then with my .config), and it works!
> >
> > Further investigation showed that the problem first appeared
> > somewhere between 2.4.20-pre2 (works for me) and 2.4.20-pre5 (has the
> > lock-up problem I've described). I was not able to test -pre3 and -pre4,
> > because these kernel died on me during boot after the
> > "Initializing RT netlink socket" message.
>
> It would be interesting to test 2.4.20-pre5 on Justin's box
> (if he can risk fs damage)

ok, tried it on 2.4.20-pre5, and it is fine.

I would send your board back...

Justin