LinuxLists.cc - Problem: IDE data corruption with VIA chipsets on 2.4.20-19.8+others

2003-09-11 05:33:04

Subject: Problem: IDE data corruption with VIA chipsets on 2.4.20-19.8+others

The core issue:
----------------
Random crashes, general operating system instability with a RedHat 8 Linux
install running a moderately heavy-use database server (IBM Lotus Domino 5
or 6). All current indications point to a data
corruption/ide-incompatibility between the linux IDE driver and various VIA
chipsets. The problem only occurs during heavy database server load.

What happened:
----------------
We decided to migrate our Windows 2000 Pro Lotus Domino installations to
RedHat Linux 8. Everything worked fine on the office servers, which were
older 300-400mhz computers based on the good-old "bx" intel chipset.
Everything worked perfectly.

A few months ago we launched a large website for a large US state
department/organization. The website received and extreme amount of positive
press, and as a result the server load became more than we could handle. We
built an interim linux server which we hoped would tide us over until we
could purchase permadent, proper server hardware. After some testing
everything looked good and we launched the server. The server had two IDE
80gig hard-drives in a Software RAID-1 configuration. We chose IDE due to
the extremely high cost of SCSI and the fact we weren't sure if this was
going to be a permanent configuration. The installation used RedHat 8, with
all of the latest patches installed. (redhat network).

About one week later, the machine started behaving badly. Random crashes,
problems, etc. A very unstable server. We suspected it was the IDE. We built
another server using a different brand of motherboard and hard-drives and
launched the server. One or two weeks later - same thing - major problems,
crashes. In fact, we've launched about 4 different hardware configurations,
some raid, some non-raid, different hard-drive brands, different versions of
Lotus Domino. There is only one thing in common - they all crash, they all
run RedHat Linux, and they all use IDE.

We also get various IDE errors in the /var/log files, such as: (also another
weird IRQ error - except we're running a stock config! (ie/ no PCI devices
other than one NIC... No idea if they're related to the IDE thing <sigh>.
The IRQ thing appears to be USB related)
== during server runtime ==
kernel: hdc: dma_intr: status=0x51 { DriveReady SeekComplete Error }
kernel: hdc: dma_intr: error=0x40 { UncorrectableError }, LBAsect=150637065,
sector=150636992
kernel: end_request: I/O error, dev 16:01 (hdc), sector 150636992
kernel: hdc: dma_intr: status=0x53 { DriveReady SeekComplete Index Error }
kernel: hdc: dma_intr: error=0x40 { UncorrectableError }, LBAsect=150630007,
sector=150629920
kernel: end_request: I/O error, dev 16:01 (hdc), sector 150629920
== during server boot ==
kernel: usb.c: new USB bus registered, assigned bus number 3
kernel: hub.c: USB hub found
kernel: hub.c: 2 ports detected
kernel: PCI: Found IRQ 3 for device 00:10.2
kernel: IRQ routing conflict for 00:10.0, have irq 9, want irq 3
kernel: IRQ routing conflict for 00:10.1, have irq 9, want irq 3
kernel: IRQ routing conflict for 00:10.2, have irq 9, want irq 3
kernel: IRQ routing conflict for 00:10.3, have irq 9, want irq 3
kernel: usb-uhci.c: USB UHCI at I/O 0x9400, IRQ 9

Usually after that happens (the IDE errors) it will crash fairly soon. Other
times no errors are logged but the server is *extremely* slow - likely due
to disk performance. My guess is that there is some sort of internal IDE
error (A "CorrectableError"?) that the kernel is recovering from and not
writing a message to the log.

Once and awhile after a major kernel panic or reboot, the system refused to
reboot at all, going into an endless cycle of disk checking. We tried
various brand-name ram suppliers in case it was a ram corruption - no luck.
Everything points to IDE.

We tried various motherboards, including the following chipsets:
VIA KT333
VIA KT400
SiS 745

The problem definately occurs on 2.4.20-19.8, but also some earlier kernel
versions as well (which I can't remember). It only happens during extremely
high disk usage (I've seen it fail with about 8% CPU and Memory Usage...)
It's not our database server - it runs fine on our older "BX-Boards" (the
older 300mhz intels) and on various Windows NT/2000 boxes. The configuration
of the database server is exactially the same as on the other servers (I
double checked at least 5 times...)

After about 4 months of random server crashes and corruption, various trials
and testing, I'm fairly certain it must be a hardware interaction between
the IDE hard-drives and the Linux kernel. We've gone through 20 IDE
hard-drives that work fine and look fine after the crashes - definately not
a physical hard-drive problem. Definately not a motherboard problem because
we've been through 4 different motherboards, different manufacturers and 3
different chipsets.

Ideas..? Help..? :( The clients are going to be banging down the doors any
day if we don't get operational servers, and so far the only solution that I
believe works 100% is to install Windows (doh). As a last resort I'm asking
to see if any of you Kernel-gurus have ideas :-)

Thanks,
-Eric Bickle

2003-09-11 08:28:31

by Sebastian Piecha

[permalink] [raw]

Subject: Re: Problem: IDE data corruption with VIA chipsets on 2.4.20-19.8+others

On 10 Sep 2003 at 22:32, Eric Bickle wrote:

> The core issue:
> ----------------
> Random crashes, general operating system instability with a RedHat 8 Linux
> install running a moderately heavy-use database server (IBM Lotus Domino 5
> or 6). All current indications point to a data
> corruption/ide-incompatibility between the linux IDE driver and various VIA
> chipsets. The problem only occurs during heavy database server load.
> ...
I reported a problem to the lkml (see "PROBLEM: Powerquest Drive
Image let the kernel panic" and "PROBLEM: kernel panic when accessing
data via samba") describing a kernel panic when accessing a huge
amount of data via samba. Unfortunately I didn't get any response
yet.

I'm using a SuSE 8.2 distribution with kernel 2.4.20.

>
> We also get various IDE errors in the /var/log files, such as: (also another
> weird IRQ error - except we're running a stock config! (ie/ no PCI devices
> other than one NIC... No idea if they're related to the IDE thing <sigh>.
> The IRQ thing appears to be USB related)
> == during server runtime ==
> kernel: hdc: dma_intr: status=0x51 { DriveReady SeekComplete Error }
> kernel: hdc: dma_intr: error=0x40 { UncorrectableError }, LBAsect=150637065,
> sector=150636992
> kernel: end_request: I/O error, dev 16:01 (hdc), sector 150636992
> kernel: hdc: dma_intr: status=0x53 { DriveReady SeekComplete Index Error }
> kernel: hdc: dma_intr: error=0x40 { UncorrectableError }, LBAsect=150630007,
> sector=150629920
> kernel: end_request: I/O error, dev 16:01 (hdc), sector 150629920
> == during server boot ==
> kernel: usb.c: new USB bus registered, assigned bus number 3
> kernel: hub.c: USB hub found
> kernel: hub.c: 2 ports detected
> kernel: PCI: Found IRQ 3 for device 00:10.2
> kernel: IRQ routing conflict for 00:10.0, have irq 9, want irq 3
> kernel: IRQ routing conflict for 00:10.1, have irq 9, want irq 3
> kernel: IRQ routing conflict for 00:10.2, have irq 9, want irq 3
> kernel: IRQ routing conflict for 00:10.3, have irq 9, want irq 3
> kernel: usb-uhci.c: USB UHCI at I/O 0x9400, IRQ 9
> ...

I'm getting similar errors:
kernel: hdc: drive_cmd: status=0x51 { DriveReady SeekComplete Error }
kernel: hdc: drive_cmd: error=0x04Aborted Command

I don't get any IRQ routing conflicts.

But in my configuration hdc is a cd-rom attached to the onboard IDE.

> ...
> We tried various motherboards, including the following chipsets:
> VIA KT333
> VIA KT400
> SiS 745
>
> ...
My onboard IDE is an Intel 82371AB/EB/MB PIIX4 IDE. My two harddisks
(each 120GB of size) are attached to a Promise Ultra 133TX2
controller.

Mit freundlichen Gruessen/Best regards,
Sebastian Piecha

EMail: [email protected]

2003-09-11 11:26:39

by Andre Tomt

[permalink] [raw]

Subject: Re: Problem: IDE data corruption with VIA chipsets on 2.4.20-19.8+others

Eric Bickle wrote:
> The core issue:
> ----------------
> Random crashes, general operating system instability with a RedHat 8 Linux
> install running a moderately heavy-use database server (IBM Lotus Domino 5
> or 6). All current indications point to a data
> corruption/ide-incompatibility between the linux IDE driver and various VIA
> chipsets. The problem only occurs during heavy database server load.
<snip long story>

If I didn't misread your story, the other common issue is Red Hat's
kernel, wich is heavily patched with several good (and not so good)
patches. You may have better luck with mainline 2.4.22 (kernel.org).

In any case Red Hat kernel problems should probably be reported to Red
Hat, their bugzilla comes to mind.

--
Cheers,
Andr? Tomt
[email protected]

2003-09-11 14:38:18

by Alan

[permalink] [raw]

Subject: Re: Problem: IDE data corruption with VIA chipsets on 2.4.20-19.8+others

PS: These wouldn't all happen to be say IBM 40Gb or so drives would
they, and maybe about a year-2 years old ?

2003-09-11 14:33:37

by Alan

[permalink] [raw]

Subject: Re: Problem: IDE data corruption with VIA chipsets on 2.4.20-19.8+others

On Iau, 2003-09-11 at 06:32, Eric Bickle wrote:
> == during server runtime ==
> kernel: hdc: dma_intr: status=0x51 { DriveReady SeekComplete Error }
> kernel: hdc: dma_intr: error=0x40 { UncorrectableError }, LBAsect=150637065,
> sector=150636992

This is a physical failure from the hard disk *NOT* a Linux problem

> kernel: end_request: I/O error, dev 16:01 (hdc), sector 150636992
> kernel: hdc: dma_intr: status=0x53 { DriveReady SeekComplete Index Error }
> kernel: hdc: dma_intr: error=0x40 { UncorrectableError }, LBAsect=150630007,
> sector=150629920

Ditto

So the only things you've posted here are physical drive failures.

2003-09-11 20:28:39

by Francois Romieu

[permalink] [raw]

Subject: Re: Problem: IDE data corruption with VIA chipsets on 2.4.20-19.8+others

Eric Bickle <[email protected]> :
[...]
> About one week later, the machine started behaving badly. Random crashes,
> problems, etc. A very unstable server. We suspected it was the IDE. We built
> another server using a different brand of motherboard and hard-drives and
> launched the server. One or two weeks later - same thing - major problems,
> crashes. In fact, we've launched about 4 different hardware configurations,
> some raid, some non-raid, different hard-drive brands, different versions of
> Lotus Domino. There is only one thing in common - they all crash, they all
> run RedHat Linux, and they all use IDE.

Overheat ?

What do smartctl -a and sensors output at regular interval say ?

Regards

--
Ueimor

2003-09-12 00:10:52

by Resident Boxholder

[permalink] [raw]

Subject: Re: Problem: IDE data corruption with VIA chipsets on 2.4.20-19.8+others

VIA KT333
VIA KT400
SiS 745

The problem definately occurs on 2.4.20-19.8

B> I saw that on VIA KT266 k2.6.0-test5 and it
seems to go away if I use anticipatory scheduling
instead of deadline scheduling in kernel config and
don't use aggressive mem settings in cmos, and USB
on a server what's that for?

Did you flash all your bioses before trying anything?
I saw acpi-derivative problems go away after flashing
an award bios on another server. It seems there was
no graceful fail or default when acpi id info is not
in bios, so improvement was drastic. ACPI problems
cause USB to go clunk when ide is active, and downstream
conflicts like that are not too informative or productive
to look at if flashing the bios would "fix the code" as
if by magic.

-Bob

Eric Bickle wrote:

>....
>RedHat Linux 8....two IDE
>80gig hard-drives in a Software RAID-1... they all crash, they all
>run RedHat Linux, and they all use IDE.
>
>We also get various IDE errors in the /var/log files, such as: (also another
>weird IRQ error - except we're running a stock config! (ie/ no PCI devices
>other than one NIC... No idea if they're related to the IDE thing <sigh>.
>The IRQ thing appears to be USB related)
>== during server runtime ==
>kernel: hdc: dma_intr: status=0x51 { DriveReady SeekComplete Error }
>kernel: hdc: dma_intr: error=0x40 { UncorrectableError }, LBAsect=150637065,
>sector=150636992
>kernel: end_request: I/O error, dev 16:01 (hdc), sector 150636992
>kernel: hdc: dma_intr: status=0x53 { DriveReady SeekComplete Index Error }
>kernel: hdc: dma_intr: error=0x40 { UncorrectableError }, LBAsect=150630007,
>sector=150629920
>kernel: end_request: I/O error, dev 16:01 (hdc), sector 150629920
>== during server boot ==
>kernel: usb.c: new USB bus registered, assigned bus number 3
>kernel: hub.c: USB hub found
>kernel: hub.c: 2 ports detected
>kernel: PCI: Found IRQ 3 for device 00:10.2
>kernel: IRQ routing conflict for 00:10.0, have irq 9, want irq 3
>kernel: IRQ routing conflict for 00:10.1, have irq 9, want irq 3
>kernel: IRQ routing conflict for 00:10.2, have irq 9, want irq 3
>kernel: IRQ routing conflict for 00:10.3, have irq 9, want irq 3
>kernel: usb-uhci.c: USB UHCI at I/O 0x9400, IRQ 9
>
>Usually after that happens (the IDE errors) it will crash fairly soon. Other
>times no errors are logged but the server is *extremely* slow - likely due
>to disk performance. My guess is that there is some sort of internal IDE
>error (A "CorrectableError"?) that the kernel is recovering from and not
>writing a message to the log.
>
>Once and awhile after a major kernel panic or reboot, the system refused to
>reboot at all, going into an endless cycle of disk checking. We tried
>various brand-name ram suppliers in case it was a ram corruption - no luck.
>Everything points to IDE.
>
>We tried various motherboards, including the following chipsets:
>VIA KT333
>VIA KT400
>SiS 745
>
>The problem definately occurs on 2.4.20-19.8, but also some earlier kernel
>versions as well (which I can't remember). It only happens during extremely
>high disk usage (I've seen it fail with about 8% CPU and Memory Usage...)
>It's not our database server - it runs fine on our older "BX-Boards" (the
>older 300mhz intels) and on various Windows NT/2000 boxes. The configuration
>of the database server is exactially the same as on the other servers (I
>double checked at least 5 times...)
>
>After about 4 months of random server crashes and corruption, various trials
>and testing, I'm fairly certain it must be a hardware interaction between
>the IDE hard-drives and the Linux kernel. We've gone through 20 IDE
>hard-drives that work fine and look fine after the crashes - definately not
>a physical hard-drive problem. Definately not a motherboard problem because
>we've been through 4 different motherboards, different manufacturers and 3
>different chipsets.
>
>Ideas..? Help..? :( The clients are going to be banging down the doors any
>day if we don't get operational servers, and so far the only solution that I
>believe works 100% is to install Windows (doh). As a last resort I'm asking
>to see if any of you Kernel-gurus have ideas :-)
>
>Thanks,
>-Eric Bickle
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
>
>
>
>