2004-01-15 02:21:30

by Jonathan Kamens

[permalink] [raw]
Subject: Updated on UDMA BadCRC errors + subsequent problems (was: Is it safe to ignore UDMA BadCRC errors?)

Hello everyone,

I'd like to provide an update on my efforts to understand what causes
"DriveStatusError BadCRC" errors when using UDMA drives, how to debug
these errors in general, the specific progress I've made at resolving
these errors on my system, and subsequent problems I've encountered
when doing so.

Recall that I was getting these errors from 2.4.22-ac4 on my
dual-processor (550MHz Pentium III Katmai) system:

hde: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hde: dma_intr: error=0x84 { DriveStatusError BadCRC }

from a Seagate 160GB drive (ST3160021A) plugged into its own channel
on a Promise Ultra66 (PDC20262) controller.

The suggestion most frequently given to me and others for resolving
BadCRC errors is to replace the IDE cable with one that conforms to
the Ultra ATA spec (80-conductor, flat, two drive connectors, single
drive connected to the end connector). I tried several such cables,
none of which made the BadCRC errors go away.

Other suggestions given to me included:

* Make sure the IDE cable is not running parallel to another cable.
* Make sure the cable is not passing near magnets inside the case,
e.g., speaker magnets.
* Update the IDE controller's firmware.
* Check to make sure the PCI bus speed is valid (33MHz, normally).
* Make sure the PCI latency timer is set in the BIOS to at least 64.

I tried all of these suggestions, and none of them worked.

I tried swapping the drives on the controller's two channels, and the
BadCRC errors traveled with the drive. Then I swapped the cables on
the two channels, and the errors still remained on the same drive.

Next, I bought a SIIG Ultra ATA 133 controller, compiled SIIMAGE
support into my kernel, plugged in the new SIIG controller, and moved
the drive getting the BadCRC errors over to it. They stopped -- I
haven't seen a single BadCRC error since I moved the drive to the SIIG
controller a couple of weeks ago.

Alas, another problem has presented itself. Twice after I installed
the SIIG controller and moved the Seagate drive to it, my system hung
(all activity seemed to stop, syslogd stopped logging, X server
stopped responding, couldn't switch VTs). Both times, Alt-SysRq-s and
Alt-SysRq-u appeared to have no effect, but Alt-SysRq-b successfully
rebooted the system. I couldn't get any more information because I
don't have a serial console and my monitor was in X when the hang
happened; since I couldn't switch VT's I couldn't get to one where the
magic SysRq sequences would display information.

After the second hang, I tried two more things -- moving the other
drive to the SIIG controller, such that the Promise controller no
longer has any drives on it (but it's still plugged in, and also, my
motherboard's PIIX4 controller still has a hard drive, CD-ROM and
OnStream DI-30 drive plugged into it as hda, hdc and hdd
respectively), and turning off unmask IRQ for the drives on the SIIG
controller, as suggested in other messages here. Unfortunately, even
with these two additional steps, I'm still seeing kernel hangs, albeit
seemingly less frequently -- I just had another one about an hour ago.

I've just enabled the NMI watchdog, compiled software watchdog support
into my kernel and installed and enabled the watchdog daemon. If
anyone can suggest anything else I can do to debug these hangs, I'm
all ears.

Thanks for reading this far. :-)

Jonathan Kamens


2004-01-16 03:47:47

by Jonathan Kamens

[permalink] [raw]
Subject: Re: Updated on UDMA BadCRC errors + subsequent problems (was: Is it safe to ignore UDMA BadCRC errors?)

The drive which stopped reporting BadCRC errors for weeks when I
transferred it from the Promise PDC20262 Ultra66 controller to the
SIIG SIi680 Ultra133 controller just reported this:

Jan 15 22:03:13 jik kernel: hde: drive_cmd: status=0x51 { DriveReady SeekComplete Error }
Jan 15 22:03:13 jik kernel: hde: drive_cmd: error=0x04 { DriveStatusError }
Jan 15 22:03:20 jik kernel: hde: drive_cmd: status=0x51 { DriveReady SeekComplete Error }
Jan 15 22:03:20 jik kernel: hde: drive_cmd: error=0x04 { DriveStatusError }

I don't know whether it's relevant that these errors are tagged
"drive_cmd" and the BadCRC errors were tagged "dma_intr".

Are the errors above close enough to the BadCRC errors I was getting,
i.e.:

hde: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hde: dma_intr: error=0x84 { DriveStatusError BadCRC }

for me to now start suspecting a problem with the drive, given that
two different controllers with two different chipsets are reporting
problems with it?

Thanks,

Jonathan Kamens

2004-01-16 07:40:06

by John Bradford

[permalink] [raw]
Subject: Re: Updated on UDMA BadCRC errors + subsequent problems (was: Is it safe to ignore UDMA BadCRC errors?)

Quote from Jonathan Kamens <[email protected]>:
> The drive which stopped reporting BadCRC errors for weeks when I
> transferred it from the Promise PDC20262 Ultra66 controller to the
> SIIG SIi680 Ultra133 controller just reported this:
>
> Jan 15 22:03:13 jik kernel: hde: drive_cmd: status=0x51 { DriveReady SeekComplete Error }
> Jan 15 22:03:13 jik kernel: hde: drive_cmd: error=0x04 { DriveStatusError }
> Jan 15 22:03:20 jik kernel: hde: drive_cmd: status=0x51 { DriveReady SeekComplete Error }
> Jan 15 22:03:20 jik kernel: hde: drive_cmd: error=0x04 { DriveStatusError }

The drive doesn't seem to understand the command it was sent.

> I don't know whether it's relevant that these errors are tagged
> "drive_cmd" and the BadCRC errors were tagged "dma_intr".
>
> Are the errors above close enough to the BadCRC errors I was getting,
> i.e.:
>
> hde: dma_intr: status=0x51 { DriveReady SeekComplete Error }
> hde: dma_intr: error=0x84 { DriveStatusError BadCRC }
>
> for me to now start suspecting a problem with the drive, given that
> two different controllers with two different chipsets are reporting
> problems with it?

No.

John.

2004-01-16 15:27:30

by Jonathan Kamens

[permalink] [raw]
Subject: Re: Updated on UDMA BadCRC errors + subsequent problems (was: Is it safe to ignore UDMA BadCRC errors?)

John Bradford writes:
> Quote from Jonathan Kamens <[email protected]>:
> > ... hde: drive_cmd: status=0x51 { DriveReady SeekComplete Error }
> > ... hde: drive_cmd: error=0x04 { DriveStatusError }
>
> The drive doesn't seem to understand the command it was sent.

I'm not sure what this means, but assuming that it's going to happen
again at some point, what do I need to do to my kernel/configuration
now to be able to capture additional debugging information the next
time it happens?

Thanks,

jik

2004-01-16 15:39:03

by John Bradford

[permalink] [raw]
Subject: Re: Updated on UDMA BadCRC errors + subsequent problems (was: Is it safe to ignore UDMA BadCRC errors?)

Quote from Jonathan Kamens <[email protected]>:
> John Bradford writes:
> > Quote from Jonathan Kamens <[email protected]>:
> > > ... hde: drive_cmd: status=0x51 { DriveReady SeekComplete Error }
> > > ... hde: drive_cmd: error=0x04 { DriveStatusError }
> >
> > The drive doesn't seem to understand the command it was sent.
>
> I'm not sure what this means, but assuming that it's going to happen
> again at some point,

Maybe not - the most common cause I've seen for that message in the logs is trying to access S.M.A.R.T. information when S.M.A.R.T. is disabled.

I.E. the error should be reproducable with:

# smartctl -d /dev/hda
# smartctl -a /dev/hda

Are you sure you weren't trying to get S.M.A.R.T. info from the drive at the time the error was logged?

John.

2004-01-16 15:49:09

by Jonathan Kamens

[permalink] [raw]
Subject: Re: Updated on UDMA BadCRC errors + subsequent problems (was: Is it safe to ignore UDMA BadCRC errors?)

John Bradford writes:

> Maybe not - the most common cause I've seen for that message in the
> logs is trying to access S.M.A.R.T. information when S.M.A.R.T. is
> disabled.
>
> I.E. the error should be reproducable with:
>
> # smartctl -d /dev/hda
> # smartctl -a /dev/hda
>
> Are you sure you weren't trying to get S.M.A.R.T. info from the
> drive at the time the error was logged?

My smartctl wants "-s off" rather than "-d", but other than that,
you're correct, that sequence of commands does ause the same error to
appear in the logs. But why/how would SMART be disabled on the drive?
I've been running smartd on the drive for weeks with no errors of this
sort, and I fail to see how SMART would suddenly be disabled on the
drive with no action on my part, so it seems more likely that some
other condition caused the error.

Thanks,

jik

2004-01-16 16:13:15

by Ed Sweetman

[permalink] [raw]
Subject: Re: Updated on UDMA BadCRC errors + subsequent problems

John Bradford wrote:
> Quote from Jonathan Kamens <[email protected]>:
>
>>John Bradford writes:
>> > Quote from Jonathan Kamens <[email protected]>:
>> > > ... hde: drive_cmd: status=0x51 { DriveReady SeekComplete Error }
>> > > ... hde: drive_cmd: error=0x04 { DriveStatusError }
>> >
>> > The drive doesn't seem to understand the command it was sent.
>>
>>I'm not sure what this means, but assuming that it's going to happen
>>again at some point,
>
>
> Maybe not - the most common cause I've seen for that message in the logs is trying to access S.M.A.R.T. information when S.M.A.R.T. is disabled.
>
> I.E. the error should be reproducable with:
>
> # smartctl -d /dev/hda
> # smartctl -a /dev/hda
>
> Are you sure you weren't trying to get S.M.A.R.T. info from the drive at the time the error was logged?
>
> John.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hda: dma_intr: error=0x84 { DriveStatusError BadCRC }


Some drives i guess report the exact error like mine here. These occur
when i'm transferring from my tulip based nic to my hdd at 8Mbytes a
second (avg). The fs is ext3. When i'm transferring about 1.5GB files
over the drive seems to freak out. Timing sources (tsc) can also lose
so many ticks that the other time source has to be used.

What i dont understand is why the ata drivers dont handle crc errors
correctly. Instead of resetting the ide bus and turning dma off why dont
they start throttling down the dma modes one by one, When the rate of
crc errors reaches a certain reasonable number, drop an udma level. If
that crc error rate is reached again, drop a level. You keep doing that
until you hit pio mode. Usually the problem is solved by simply using a
lower dma mode. That way my system doesn't have to reach loads of 20
and io suck all my cpu while i'm trying to re-enable dma so i can
actually figure out what's going on. CRC errors are caused by timing
problems as well as physical problems around the cabling in the
computer. Normal hdd to hdd transfers (which avg about 30MByte/sec) do
not cause these errors for me.


2004-01-16 16:40:54

by John Bradford

[permalink] [raw]
Subject: Re: Updated on UDMA BadCRC errors + subsequent problems (was: Is it safe to ignore UDMA BadCRC errors?)

Quote from Jonathan Kamens <[email protected]>:
> John Bradford writes:
>
> > Maybe not - the most common cause I've seen for that message in the
> > logs is trying to access S.M.A.R.T. information when S.M.A.R.T. is
> > disabled.
> >
> > I.E. the error should be reproducable with:
> >
> > # smartctl -d /dev/hda
> > # smartctl -a /dev/hda
> >
> > Are you sure you weren't trying to get S.M.A.R.T. info from the
> > drive at the time the error was logged?
>
> My smartctl wants "-s off" rather than "-d", but other than that,
> you're correct, that sequence of commands does ause the same error to
> appear in the logs. But why/how would SMART be disabled on the drive?
> I've been running smartd on the drive for weeks with no errors of this
> sort, and I fail to see how SMART would suddenly be disabled on the
> drive with no action on my part,

Some motherboard BIOSes disable S.M.A.R.T. on drives connected to
their on-board controllers on each boot. Quite possibly some PCI IDE
cards do as well. It's possible, (but probably not likely), that by
trying the drive on different controllers a BIOS somewhere has
disabled S.M.A.R.T.

> so it seems more likely that some
> other condition caused the error.

Quite possibly, but I can only really guess as to what that might be
at this point.

I _think_ that UDMA CRC checking is only done on data transfers, not
commands. I've CC'ed Alan in the hope of getting some confirmation on
this. Maybe a command being corrupted on the wire could theoretically
cause that error.

John.

2004-01-16 18:04:51

by Jonathan Kamens

[permalink] [raw]
Subject: Re: Updated on UDMA BadCRC errors + subsequent problems (was: Is it safe to ignore UDMA BadCRC errors?)

John Bradford writes:
> Some motherboard BIOSes disable S.M.A.R.T. on drives connected to
> their on-board controllers on each boot. Quite possibly some PCI IDE
> cards do as well. It's possible, (but probably not likely), that by
> trying the drive on different controllers a BIOS somewhere has
> disabled S.M.A.R.T.

The error occurred at least a week after the drive was moved to its
current controller. In that week both the drive and smartd ran just
fine with no errors.

jik

2004-01-16 20:53:28

by Alan Cox

[permalink] [raw]
Subject: Re: Updated on UDMA BadCRC errors + subsequent problems (was: Is it safe to ignore UDMA BadCRC errors?)

On Fri, Jan 16, 2004 at 04:48:34PM +0000, John Bradford wrote:
> I _think_ that UDMA CRC checking is only done on data transfers, not
> commands. I've CC'ed Alan in the hope of getting some confirmation on
> this. Maybe a command being corrupted on the wire could theoretically
> cause that error.

You are correct for PATA but the situation there is very very unlikely,
let alone for it to be repeatable