2001-11-25 04:57:43

by Jonathan Kamens

[permalink] [raw]
Subject: IDE: 2.2.19+IDE patches works fine; 2.4.x fails miserably; please help me figure out why!

For months now, I've been trying every 2.4.x kernel as it comes out.
Every time, I start getting IDE errors shortly after booting into the
2.4.x kernel. My filesystems aren't totally trashed, but lots of the
new data being written to the filesystems are trashed and I have to
fix a bunch of errors with fsck and recreate those trashed new files
after reverting to my 2.2.19 kernel (to which I have applied Andre's
IDE patches).

When I use "hdparm" to examine the settings of all of my hard drives
in 2.2.19 and 2.4.x, the only difference is that the 2.4.x kernel
sets multcount to 16 by default while 2.2.19 sets it to 0 by default.
Setting multcount to 0 with 2.4.x for all my drives does not help -- I
still get the errors as soon as I start trying to do lots of disk
activities.

Here's an example of the errors I got in the last go-around before I
gave up on 2.4.16-pre1 (with irrelevant fields removed to make the
syslog output easier to read):

22:58:56 hde: timeout waiting for DMA
22:58:58 hde: dma_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
22:58:58 hde: dma_intr: error=0x84 { DriveStatusError BadCRC }
22:58:59 hde: dma_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
22:58:59 hde: dma_intr: error=0x84 { DriveStatusError BadCRC }
22:58:59 hde: dma_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
22:58:59 hde: dma_intr: error=0x84 { DriveStatusError BadCRC }
22:58:59 hde: dma_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
22:58:59 hde: dma_intr: error=0x84 { DriveStatusError BadCRC }
22:59:19 hde: timeout waiting for DMA
23:00:23 hde: timeout waiting for DMA
23:00:24 hde: timeout waiting for DMA
23:00:24 hdg: timeout waiting for DMA
23:00:31 hdg: timeout waiting for DMA
23:00:33 hdg: status error: status=0x58 { DriveReady SeekComplete DataRequest }
23:00:33 hdg: drive not ready for command

I've seen people mention in comp.os.linux.development.system that the
BadCRC error may indicate a cable problem. However, (a) I'm pretty
certain that I'm using Ultra66 cables for both hde and hdg, and (b) if
that's the problem, why don't I get the same errors with 2.2.19?

As for (a), I believe I've got the right cables because I checked when
I installed them and because the controller (Promise Ultra66)
recognizes both hde and hdg as Ultra-capable drives when it starts up
(which it wouldn't do if I didn't have the correct cables -- I know
this because it wasn't doing it when I didn't have the correct cables
;-).

As for (b), is 2.4.x more paranoid about and/or better at checking
CRCs than 2.2.19 was?

I should note that when the errors shown in the log above are
happening, I'm also seeing "Lost interrupt" messages on my console for
hde or hdg.

Appended below are the pertinent details about the two drives that are
giving me trouble. If anyone can offer *any* insights into what I can
do to debug and solve this problem, I'd much appreciate it. Until I
can solve it, I'm stuck using 2.2.x, which is unfortunate since (a)
Andre has stopped maintaining his IDE backport patches for new 2.2.x
versions and (b) there's functionality in 2.4.x that I want to use.

Thank you,

Jonathan Kamens

*************************

/dev/hde:
multcount = 0 (off)
I/O support = 0 (default 16-bit)
unmaskirq = 0 (off)
using_dma = 1 (on)
keepsettings = 0 (off)
nowerr = 0 (off)
readonly = 0 (off)
readahead = 8 (on)
geometry = 524/255/63, sectors = 8421840, start = 0

/dev/hde:

non-removable ATA device, with non-removable media
Model Number: SAMSUNG SV0432D
Serial Number: 0125J1EK821690 Firmware Revision: KS100
Standards:
Supported: 1 2 3
Likely used: 4
Configuration:
Logical max current
cylinders 8912 8912
heads 15 15
sectors/track 63 63
bytes/track: 32256 (obsolete)
bytes/sector: 512 (obsolete)
current sector capacity: 8421840
LBA user addressable sectors = 8421840
Capabilities:
LBA, IORDY(can be disabled)
Buffer size: 480.0kB ECC bytes: 4 Queue depth: 1
Standby timer values: spec'd by Vendor
r/w multiple sector transfer: Max = 16 Current = 0
DMA: sdma0 sdma1 sdma2 *mdma0 mdma1 mdma2 udma0 udma1 *udma2 udma3 udma4 (?)
Cycle time: min=120ns recommended=120ns
PIO: pio0 pio1 pio2 pio3 pio4
Cycle time: no flow control=120ns IORDY flow control=120ns
Commands/features:
Enabled Supported:
Host Protected Area feature set
Power Management feature set
SMART feature set
DOWNLOAD MICROCODE cmd

/dev/hdg:
multcount = 0 (off)
I/O support = 0 (default 16-bit)
unmaskirq = 0 (off)
using_dma = 1 (on)
keepsettings = 0 (off)
nowerr = 0 (off)
readonly = 0 (off)
readahead = 8 (on)
geometry = 1868/255/63, sectors = 30015216, start = 0

/dev/hdg:

non-removable ATA device, with non-removable media
Model Number: Maxtor 51536U3
Serial Number: K3H0XSDC
Firmware Revision: DA620CQ0
Standards:
Used: ATA/ATAPI-4 T13 1153D revision 17
Supported: 1 2 3 4 5 & some of 5
Configuration:
Logical max current
cylinders 16383 16383
heads 16 16
sectors/track 63 63
bytes/track: 0 (obsolete)
bytes/sector: 0 (obsolete)
current sector capacity: 16514064
LBA user addressable sectors = 30015216
Capabilities:
LBA, IORDY(can be disabled)
Buffer size: 2048.0kB ECC bytes: 57 Queue depth: 1
Standby timer values: spec'd by standard, no device specific minimum
r/w multiple sector transfer: Max = 16 Current = 0
DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 *udma4
Cycle time: min=120ns recommended=120ns
PIO: pio0 pio1 pio2 pio3 pio4
Cycle time: no flow control=120ns IORDY flow control=120ns
Commands/features:
Enabled Supported:
* NOP cmd
* READ BUFFER cmd
* WRITE BUFFER cmd
* Host Protected Area feature set
* look-ahead
* write cache
* Power Management feature set
SMART feature set
Advanced Power Management feature set
* DOWNLOAD MICROCODE cmd
HW reset results:
CBLID- above Vih
Device num = 0 determined by the jumper
Checksum: correct


2001-11-25 13:23:52

by Phil Howard

[permalink] [raw]
Subject: Re: IDE: 2.2.19+IDE patches works fine; 2.4.x fails miserably; please help me figure out why!

On Sat, Nov 24, 2001 at 11:57:18PM -0500, Jonathan Kamens wrote:

| I've seen people mention in comp.os.linux.development.system that the
| BadCRC error may indicate a cable problem. However, (a) I'm pretty
| certain that I'm using Ultra66 cables for both hde and hdg, and (b) if
| that's the problem, why don't I get the same errors with 2.2.19?
|
| As for (a), I believe I've got the right cables because I checked when
| I installed them and because the controller (Promise Ultra66)
| recognizes both hde and hdg as Ultra-capable drives when it starts up
| (which it wouldn't do if I didn't have the correct cables -- I know
| this because it wasn't doing it when I didn't have the correct cables
| ;-).

I've had drive problems simply as a result of bad _instances_ of cables,
even though it was the right kind of cable. I've also had drive problems
as a result of looseness of connections. Some connector _instances_
seem to be more prone to this, and some headers on some motherboards
also seem to be more prone to this.

<paranoid>
Pull the cable and clean all the connections on the cable, board, and
drive. High pressure air is good enough in most cases, but liquid
cleaner might be needed in some. Check the cable for dings, nicks,
and sharp bends. These have more impact on ATA66 cables. If it has
any, get a new cable and cut the old one in half. If you have a
single ATA66 drive, be sure it is on the _end_ connector. While you
can cut off the end of older IDE cables to make short ones, not so
with ATA66 cables. Leave the middle connector dangling or get a cable
with just 2 connectors. Also, don't "round" an ATA66 cable.
</paranoid>

And finally, I actually have a couple drives that exhibit these problems
no matter what I do. The drives _can_ simply be bad.

--
-----------------------------------------------------------------
| Phil Howard - KA9WGN | Dallas | http://linuxhomepage.com/ |
| [email protected] | Texas, USA | http://phil.ipal.org/ |
-----------------------------------------------------------------

2001-11-25 20:55:24

by Jonathan Kamens

[permalink] [raw]
Subject: Re: IDE: 2.2.19+IDE patches works fine; 2.4.x fails miserably; please help me figure out why!

OK, so I opened up my case. I cleaned everything out with compressed
air. I unplugged the two IDE cables connecting my drives to my Promise
Ultra66 controller. I replaced them with brand new Maxtor ATA/100
cables I just bought today. In the process, I discovered that one of
the two old cables was reversed, i.e., the end that was supposed to be
connected to the controller was connected to the drive instead but this
does not seem to have been the problem. I closed up the computer,
powered it up, booted 2.4.16-pre1, and did some disk-intensive stuff on
my hde drive. It's still barfing:

hde: timeout waiting for DMA
ide_dmaproc: chipset supported ide_dma_timeout func only: 14
hde: dma_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
hde: dma_intr: error=0x84 { DriveStatusError BadCRC }
hde: dma_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
hde: dma_intr: error=0x84 { DriveStatusError BadCRC }
hde: dma_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
hde: dma_intr: error=0x84 { DriveStatusError BadCRC }
hde: dma_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
hde: dma_intr: error=0x84 { DriveStatusError BadCRC }
ide2: reset: success
hde: dma_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
hde: dma_intr: error=0x84 { DriveStatusError BadCRC }
hde: dma_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
hde: dma_intr: error=0x84 { DriveStatusError BadCRC }
hde: dma_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
hde: dma_intr: error=0x84 { DriveStatusError BadCRC }
hde: dma_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
hde: dma_intr: error=0x84 { DriveStatusError BadCRC }
ide2: reset: success
hde: timeout waiting for DMA
ide_dmaproc: chipset supported ide_dma_timeout func only: 14
hde: timeout waiting for DMA
ide_dmaproc: chipset supported ide_dma_timeout func only: 14
hde: timeout waiting for DMA
ide_dmaproc: chipset supported ide_dma_timeout func only: 14
hdg: timeout waiting for DMA
ide_dmaproc: chipset supported ide_dma_timeout func only: 14
hdg: status error: status=0x51 { DriveReady SeekComplete Error }
hdg: status error: error=0x04 { DriveStatusError }
hdg: no DRQ after issuing MULTWRITE
hdg: status error: status=0x51 { DriveReady SeekComplete Error }
hdg: status error: error=0x04 { DriveStatusError }
hdg: no DRQ after issuing MULTWRITE
hdg: status error: status=0x51 { DriveReady SeekComplete Error }
hdg: status error: error=0x04 { DriveStatusError }
hdg: no DRQ after issuing MULTWRITE
hdg: status error: status=0x51 { DriveReady SeekComplete Error }
hdg: status error: error=0x04 { DriveStatusError }
hdg: no DRQ after issuing WRITE
ide3: reset: success
hde: status error: status=0x58 { DriveReady SeekComplete DataRequest }
hde: drive not ready for command
hdg: timeout waiting for DMA
ide_dmaproc: chipset supported ide_dma_timeout func only: 14
hdg: status error: status=0x58 { DriveReady SeekComplete DataRequest }
hdg: drive not ready for command
hdg: timeout waiting for DMA
ide_dmaproc: chipset supported ide_dma_timeout func only: 14
hdg: status error: status=0x58 { DriveReady SeekComplete DataRequest }
hdg: drive not ready for command
hdg: timeout waiting for DMA
ide_dmaproc: chipset supported ide_dma_timeout func only: 14
hde: status error: status=0x58 { DriveReady SeekComplete DataRequest }
hde: drive not ready for command
hde: status error: status=0x58 { DriveReady SeekComplete DataRequest }
hde: drive not ready for command
hde: status error: status=0x58 { DriveReady SeekComplete DataRequest }
hde: drive not ready for command
hde: status error: status=0x58 { DriveReady SeekComplete DataRequest }
hde: drive not ready for command
hde: status error: status=0x58 { DriveReady SeekComplete DataRequest }
hde: drive not ready for command
hde: status error: status=0x58 { DriveReady SeekComplete DataRequest }
hde: drive not ready for command
hde: status error: status=0x58 { DriveReady SeekComplete DataRequest }
hde: drive not ready for command

I set multcount to 0 and the problems persist, just as they did with
the old cables.

This clearly isn't a problem with my cables (and I've just wasted over
$40 to prove it, unless I can convince Staples to take back the opened
cables).

If it's a problem with my drives, then how is it that I don't have any
problems at all when I run 2.2.19+IDE on exactly the same hardware?
And isn't it a mighty big coincidence that *both* of the drives on
this controller are having problems?

Someone asked me in E-mail to post details about the IDE controller
that's having the problems. Here's /proc/ide/pdc202xx:

PDC20262 Chipset.
------------------------------- General Status ---------------------------------
Burst Mode : enabled
Host Mode : Normal
Bus Clocking : 33 PCI Internal
IO pad select : 10 mA
Status Polling Period : 15
Interrupt Check Status Polling Delay : 13
--------------- Primary Channel ---------------- Secondary Channel -------------
enabled enabled
66 Clocking enabled enabled
Mode PCI Mode PCI
FIFO Empty FIFO Empty
--------------- drive0 --------- drive1 -------- drive0 ---------- drive1 ------
DMA enabled: no yes no yes
DMA Mode: UDMA 4 NOTSET UDMA 4 NOTSET
PIO Mode: PIO 4 NOTSET PIO 4 NOTSET

2001-11-25 21:32:40

by Jonathan Kamens

[permalink] [raw]
Subject: Re: IDE: 2.2.19+IDE patches works fine; 2.4.x fails miserably; please help me figure out why!

(Responding to E-mail sent to me privately by Mark Hahn....)

> > This clearly isn't a problem with my cables (and I've just wasted over
> > $40 to prove it, unless I can convince Staples to take back the opened
> > cables).
>
> are the cables 18"? lots of places sell 24" cables, which have never
> been valid...

No, both the new cables I put in and my old ones were 18" cables.

> > If it's a problem with my drives, then how is it that I don't have any
> > problems at all when I run 2.2.19+IDE on exactly the same hardware?
>
> 2.2 doesn't contain chipset-specific mode tuning code, afaik.
> in general, it just uses the mode as programmed by the bios.

Perhaps I have not explained myself clearly enough, or perhaps my
understanding of what Andre's 2.2.x IDE patch is, is incorrect.

I am not using stock 2.2. I am using 2.2 plus Andre Hedrick's IDE
backport patch. I thought that the whole point of this patch was to
backport the enhanced IDE functionality from 2.4 back to 2.2.

Without Andre's patch, I wouldn't be able to send you the output of
/proc/ide/pdc202xx, because it wouldn't exist, because (as you point
out) there would be no code in the kernel specific to that chipset.
With the patch, there *is* code in the kernel specific to that
chipset.

> > Bus Clocking : 33 PCI Internal
>
> my best theory is that this is wrong. I assume you're not overclocking,
> but have you scrutinized your bios settings? the ide clock is usually
> hung off the PCI clock, divided down from AGP, divided down from FSB.

This is all Greek to me. Could you translate a bit for the
kernel-internals-impaired? What should I be looking at/for, exactly?

> wrong clocking (or the driver somehow using the wrong timing)
> would explain both the messages you're seeing.

But the settings are the same as those used by 2.2.19+IDE, with which
I'm not having any problems. Here's /proc/ide/pdc202xx when I'm
running 2.2.19+IDE:

PDC20262 Chipset.
------------------------------- General Status ---------------------------------
Burst Mode : enabled
Host Mode : Normal
Bus Clocking : 33 PCI Internal
IO pad select : 10 mA
Status Polling Period : 15
Interrupt Check Status Polling Delay : 13
--------------- Primary Channel ---------------- Secondary Channel -------------
enabled enabled
66 Clocking enabled enabled
Mode PCI Mode PCI
FIFO Empty FIFO Empty
--------------- drive0 --------- drive1 -------- drive0 ---------- drive1 ------
DMA enabled: yes yes yes yes
DMA Mode: UDMA 4 NOTSET UDMA 4 NOTSET
PIO Mode: PIO 4 NOTSET PIO 4 NOTSET

Thanks,

Jonathan Kamens

2001-11-27 00:36:14

by nakai

[permalink] [raw]
Subject: Re: IDE: 2.2.19+IDE patches works fine; 2.4.x fails miserably;please help me figure out why!

Jonathan Kamens wrote:
> For months now, I've been trying every 2.4.x kernel as it comes out.
> Every time, I start getting IDE errors shortly after booting into the
> 2.4.x kernel. My filesystems aren't totally trashed, but lots of the
> new data being written to the filesystems are trashed and I have to
> fix a bunch of errors with fsck and recreate those trashed new files
> after reverting to my 2.2.19 kernel (to which I have applied Andre's
> IDE patches).

Would you try 2.4.2 ? Some ide-pci code related with promise were
changed between 2.4.2 and 2.4.10. I also had a problem in 2.4.10
with promise.

But I needed to run 2.4.10, because 2.4.10 support promise 100tx2
(pdc20268 chip). 2.4.2 dosen't suport promise 100tx2. I had to
edit ide-pci code. It works with no problem, but I'm not sure
it is correct. If you want to run >2.4.10, I will send you diffs.

--
-=-=-=-= SHINKO ELECTRIC INDUSTRIES CO., LTD. =-=-=-=-
=-=-=-=- Core Technology Research & Laboratory, -=-=-=-=
-=-=-=-= Infomation Technology Research Dept. =-=-=-=-
=-=-=-=- Name:Hisakazu Nakai TEL:026-283-2866 -=-=-=-=
-=-=-=-= Mail:[email protected] FAX:026-283-2820 =-=-=-=-

2001-11-27 08:15:35

by Jonathan Kamens

[permalink] [raw]
Subject: Re: IDE: 2.2.19+IDE patches works fine; 2.4.x fails miserably;please help me figure out why!

> Date: Tue, 27 Nov 2001 09:35:40 +0900
> From: nakai <[email protected]>
>
> Would you try 2.4.2 ?

No dice. It fails pretty spectacularly:

Nov 26 21:29:47 jik kernel: hde: timeout waiting for DMA
Nov 26 21:29:47 jik kernel: ide_dmaproc: chipset supported ide_dma_timeout func only: 14
Nov 26 21:29:47 jik kernel: hde: irq timeout: status=0x50 { DriveReady SeekComplete }
Nov 26 21:29:48 jik kernel: hdg: dma_intr: status=0x51 { DriveReady SeekComplete Error }
Nov 26 21:29:48 jik kernel: hdg: dma_intr: error=0x04 { DriveStatusError }
Nov 26 21:29:50 jik kernel: hde: dma_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
Nov 26 21:29:50 jik kernel: hde: dma_intr: error=0x84 { DriveStatusError BadCRC }
Nov 26 21:29:50 jik kernel: hdg: dma_intr: status=0x51 { DriveReady SeekComplete Error }
Nov 26 21:29:50 jik kernel: hdg: dma_intr: error=0x04 { DriveStatusError }
Nov 26 21:29:50 jik kernel: attempt to access beyond end of device
Nov 26 21:29:50 jik kernel: 22:01: rw=0, want=1101324772, limit=7502323
[lots more of lines like the last two]
Nov 26 21:29:52 jik kernel: hde: dma_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
Nov 26 21:29:52 jik kernel: hde: dma_intr: error=0x84 { DriveStatusError BadCRC }
Nov 26 21:29:52 jik kernel: hde: dma_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
Nov 26 21:29:52 jik kernel: hde: dma_intr: error=0x84 { DriveStatusError BadCRC }
Nov 26 21:29:52 jik kernel: ide2: reset: success

With Mark Hahn's guidance, I have learned the following additional
information about these problems:

* It only occurs when I use an SMP kernel (I have two processors).
With the uniprocessor kernel (i.e., letting my second processor sit
idle), there are no disk errors.

* Changing the BIOS settings from MPS 1.4 to MPS 1.1 reduces the
frequenty of, but does not completely eliminate, the problems.

Furthermore, I happened to have a SIIG Ultra ATA/66 controller lying
around, so I decided to see what would happen if I used it instead of
the Promise controller. I rebuilt my 2.4.16 kernel with HPT366,
swapped the controllers and rebooted, and the problems persist with
the SIIG controller.

On the other hand, at least while the SIIG controller is working, it
seems to be much faster than the Promise controller, so I think I'll
leave it in and switch back to 2.2.19+IDE for the time being :-).

Thanks,

Jonathan Kamens