2002-08-29 14:01:45

by Mike Isely

[permalink] [raw]
Subject: 2.4.20-pre4-ac1 trashed my system


Hi,

I've been a Linux user since 1994. I have always built my own
kernels. I have never trashed a system, until 2 nights ago, when I
ran 2.4.20-pre4-ac1...

Unfortunately what I have is very short on detail, because most of the
"evidence burned in the fire" when the disk's file systems were
destroyed. I do have .deb's of the suspect kernel (I'm a Debian user)
and I think I can recover the kernel .config file from that. Here's
what else I can supply:

o System: Athlon XP 1700+ CPU, built on an Asus A7V266-E mainboard.

o Ram: 512MB

o Disk: 160GB Maxtor IDE (note: >137GB)

o Controller: Promise 20265 rev 02 (as reported by lspci)

o Note: The Promise controller is not the primary controller on the
board. This is a second controller equipped on the board. I
point this out because the 160GB disk was connected to this
Promise controller, not the motherboard's default controller.

o I was using ext3 everywhere at the time things exploded.

o The previous kernel before 2.4.20-pre4-ac1 that I ran was
2.4.19-ac4, which ran OK on this hardware combination.

The first symptom I observed was a directory that listed incorrectly
as a file. It wasn't on my root file system so I unmounted it and
attempted an fsck. At this point I wasn't suspecting the new kernel
(I should have), otherwise I should have backed off to 2.4.19-ac4
first. But I didn't. This file system was about 120GB, most of the
disk; I hadn't noticed any trouble with other file systems yet.

The fsck went through about 60% of the file system cleanly and then
just went nuts reporting / fixing errors. Then fsck gave up,
complaining about something wrong with the journal file. I
reattempted it (second mistake); this time it died right away with the
same error. Then I noticed other processes hanging on the system. I
was unable to shut down so I power-cycled. The reboot paniced after
failing to find the init executable (though it did manage to mount
root).

Going further down this trail of damage, I then tried to boot a rescue
partition (about 200MB) previously set up on the same disk. This was
a partition that was _not_ _mounted_ when 2.4.20-pre4-ac1 was running.
This boot attempt got as far as trying to start things in /etc/init.d
before croaking with a pile of SEGVs. I managed to fsck the rescue
partition but the damage had been done and that partition never worked
right again (i.e. corrupted files).

As far as I can tell now, the entire disk has been scrambled (except
for the partition table, which seems to have survived unscathed).

Also FWIW I did check kernel log message output during this melee and
saw nothing unusual, specifically no errors from the IDE subsystem.

The only things I see about this that might be noteworthy is:

1. I was using the Promise controller for my system disk, not the
board's primary controller.

2. I was using a 160GB drive, which exceeds the 137GB limit of ATA-5
(?). Notably, the initial fsck got ugly about 60% the way
through, which I _think_ would have put that right near the 137GB
boundary of the disk, given where that particular partition was
set up.

Is there a possible problem here with huge disk support using the
Promise 20265 controller in 2.4.20-pre4-ac1?

Unfortunately I need that system back so I'm rebuilding it now (and
moving the disk off of the Promise controller out of paranoia).
Sorry...

Is this a known problem with 2.4.20-pre4-ac1? I did note Alan's
statement about using the -ac series for further IDE development and
wonder if perhaps I got caught in the crosshairs.

-Mike


| Mike Isely | PGP fingerprint
POSITIVELY NO | | 03 54 43 4D 75 E5 CC 92
UNSOLICITED JUNK MAIL! | isely @ pobox (dot) com | 71 16 01 E2 B5 F5 C1 E8
| (spam-foiling address) |


2002-08-29 15:25:26

by Alan

[permalink] [raw]
Subject: Re: 2.4.20-pre4-ac1 trashed my system

The promise 20265 does need special handling for LBA48 I believe. The
code should also be handling it correctly. Cc'd to Andre to investigate
further

2002-08-29 17:13:47

by Andre Hedrick

[permalink] [raw]
Subject: Re: 2.4.20-pre4-ac1 trashed my system


That host does have a flag check on the primary channel.
The Seconday has been observed and many people have verified the second
channel works okay in 48-bit.

If you have a system which has a 28-bit limited host, and it has been
openly discussed on lkml for many months, why would one not use the
jumpon.exe from maxtor to prevent such problems.

What I want is details from the last kernel you booted and worked, because
I am positive AC's code does the correct thing. I was one of the first
people to find the 48-bit bomb in that asic during prototype of the large
drive technology.

So please add more details, and regardless this is a semi-development
thread and nobody else has reported this error.

On 29 Aug 2002, Alan Cox wrote:

> The promise 20265 does need special handling for LBA48 I believe. The
> code should also be handling it correctly. Cc'd to Andre to investigate
> further
>

Cheers,

Andre Hedrick
LAD Storage Consulting Group

2002-08-29 17:58:21

by Mike Isely

[permalink] [raw]
Subject: Re: 2.4.20-pre4-ac1 trashed my system

On Thu, 29 Aug 2002, Andre Hedrick wrote:

>
> That host does have a flag check on the primary channel.
> The Seconday has been observed and many people have verified the second
> channel works okay in 48-bit.
>
> If you have a system which has a 28-bit limited host, and it has been
> openly discussed on lkml for many months, why would one not use the
> jumpon.exe from maxtor to prevent such problems.

First, I'm new to lkml. I did search the recent archives for
information on this topic, hoping that if it was a new problem it would
have shown up here. Since the trouble for me began with
2.4.20-pre4-ac1, I did not search that far back.

I have never used "jumpon.exe" from Maxtor. I don't even know what it
is (yet, I'm sure I'm going to find out awful quick now...). When I set
up the system, it "just worked" from day 1 with the existing IDE driver
in the 2.4.19-preX series so I had no reason to go looking for issues
like this.


>
> What I want is details from the last kernel you booted and worked, because
> I am positive AC's code does the correct thing. I was one of the first
> people to find the 48-bit bomb in that asic during prototype of the large
> drive technology.

I don't doubt that it worked at some point. It had to have worked,
otherwise my hardware would never have worked at all at any time. The
fact is that the system was stable for several months; I had installed a
full Debian setup on that hard drive, while attached to that controller,
and dumped tens of GB to it over that time without incident. The
trouble happened when I updated the kernel to 2.4.20-pre4-ac1.

The previous kernel I had booted was 2.4.19-ac4, configured similarly
(copied its .config forward to build 2.4.20-pre4-ac1). And before that,
I had run 2.4.19-pre10-ac2 without any IDE problems.


>
> So please add more details, and regardless this is a semi-development
> thread and nobody else has reported this error.

I'll add more details as I learn them, but right now I must point out:

The same hardware configuration ran 2.4.19-ac4 just fine. The only
change to the system was booting the newer kernel. No hardware changes,
no BIOS updates, nothing else. Whatever went wrong got introduced
somewhere between that version and 2.4.20-pre4-ac1.

Unfortunately as I said originally, all the gory details "burned in the
fire" so I have precious little else to offer.

I will go back further in lkml and get up to speed on what happened back
then with the "48 bit bomb", and I will look into your references about
"28-bit limited host" and jumpon.exe.

-Mike


| Mike Isely | PGP fingerprint
POSITIVELY NO | | 03 54 43 4D 75 E5 CC 92
UNSOLICITED JUNK MAIL! | isely @ pobox (dot) com | 71 16 01 E2 B5 F5 C1 E8
| (spam-foiling address) |

2002-08-29 19:11:39

by Mike Isely

[permalink] [raw]
Subject: Re: 2.4.20-pre4-ac1 trashed my system


On Thu Aug 29 13:46:10 2002, Mike Isely wrote:
>
> On Thu, 29 Aug 2002, Andre Hedrick wrote:
>
> > If you have a system which has a 28-bit limited host, and it has been
> > openly discussed on lkml for many months, why would one not use the
> > jumpon.exe from maxtor to prevent such problems.
>
> I have never used "jumpon.exe" from Maxtor. I don't even know what it
> is (yet, I'm sure I'm going to find out awful quick now...). When I set
> up the system, it "just worked" from day 1 with the existing IDE driver
> in the 2.4.19-preX series so I had no reason to go looking for issues
> like this.

I did some digging and I think I can answer these points a little
better now.

If "28-bit limited host" refers to a system BIOS which can't do LBA48,
then I don't think that's a problem here. I've been successfully
booting this system without any special tweaks / fixes (hardware or
software) for quite some time now. The Asus A7V266-E motherboard does
indeed use an Award BIOS, but it's Award version 6.0 dated 2000, not
1999 as in some previous posts about there being trouble booting >32GB
hard drives. Note: Since I was booting from the onboard Promise
controller, the Promise BIOS was in play here too. It's version
2.01.0 build 43, copyright 2001.

I understand now that jumpon.exe is a Maxtor utility to help boot hard
drives >32GB in systems which otherwise can't do this. I never
learned about it before because I've never had this problem. Indeed,
the first OS I put on the hardware was Linux (last December using a
2.4.18 kernel with additional IDE patches to support LBA48); it didn't
even see a DOS/Windows type boot disk until months later.

So I don't think any of this is an issue.

>
> I'll add more details as I learn them, but right now I must point out:
>
> The same hardware configuration ran 2.4.19-ac4 just fine. The only
> change to the system was booting the newer kernel. No hardware changes,
> no BIOS updates, nothing else. Whatever went wrong got introduced
> somewhere between that version and 2.4.20-pre4-ac1.

I think the above point is extremely important.


>
> I will go back further in lkml and get up to speed on what happened back
> then with the "48 bit bomb", and I will look into your references about
> "28-bit limited host" and jumpon.exe.
>

I've done some more looking through the lkml archives and I found
discussions from March / April about LBA48 problems and the Promise
controller. Clearly from that, exactly how well LBA48 works seems to
depend a lot on whether or not PIO vs DMA vs UltraDMA is being used.
Also it looks like if CONFIG_IDE_TASKFILE_IO is on then things may yet
be different. To those points, I can add these details for my
situation: I believe the driver was in UltraDMA mode at the time and I
had CONFIG_IDE_TASKFILE_IO turned on.

I do understand your response here to my post. I'm making an
extraordinary claim here for something that should just not happen at
all. I understand the doubt. The simple fact however is that I still
have a trashed system, and it happened only after updating the kernel.
I know that's not a lot to go on, and again I apologize for lack of
detail. I originally wasn't going to post to lkml about this; I have
been a quiet Linux user for 8+ years and really felt that a problem of
this severity would probably already have been noticed. I really
didn't want to jump into the fray with this sort of "information".
However several others that I work with (who are closer to the lkml
community than I) really insisted that I post this information,
however incomplete it is. So I did.

If I'm the only one that has hit this - another reason for doubt -
then I guess have no choice but to dig deeper. I can't really leave
the broken system like this to play with. However I do have a smaller
spare hard drive and I'll make that the new system disk, leaving the
160GB Maxtor attached to the Promise controller (with nothing valuable
on it). I should be able to replicate the corruption and provide more
information here, hopefully while still having a usable system.

-Mike


| Mike Isely | PGP fingerprint
POSITIVELY NO | | 03 54 43 4D 75 E5 CC 92
UNSOLICITED JUNK MAIL! | isely @ pobox (dot) com | 71 16 01 E2 B5 F5 C1 E8
| (spam-foiling address) |

2002-08-29 19:21:10

by Alan

[permalink] [raw]
Subject: Re: 2.4.20-pre4-ac1 trashed my system

On Thu, 2002-08-29 at 20:15, Mike Isely wrote:
> I've done some more looking through the lkml archives and I found
> discussions from March / April about LBA48 problems and the Promise
> controller. Clearly from that, exactly how well LBA48 works seems to

That was when the original work got done if I remember rightly

> depend a lot on whether or not PIO vs DMA vs UltraDMA is being used.
> Also it looks like if CONFIG_IDE_TASKFILE_IO is on then things may yet
> be different. To those points, I can add these details for my
> situation: I believe the driver was in UltraDMA mode at the time and I
> had CONFIG_IDE_TASKFILE_IO turned on.

PIO LBA48 seems to work on all promise
Early promise needs a helping hand with DMA LBA48, one promise doesnt
seem to do DMA LBA48 on secondary at all, and newer stuff gets it right.

> all. I understand the doubt. The simple fact however is that I still
> have a trashed system, and it happened only after updating the kernel.
> I know that's not a lot to go on, and again I apologize for lack of
> detail. I originally wasn't going to post to lkml about this; I have
> been a quiet Linux user for 8+ years and really felt that a problem of
> this severity would probably already have been noticed. I really

You've actually provided prety much all the key information. The things
that matter are:

The file system was known good, passed fsck before you ran the
recent kernel

The file system wasnt good after this

The problem is replicatable

And what controller/drives which you've provided.


> If I'm the only one that has hit this - another reason for doubt -
> then I guess have no choice but to dig deeper. I can't really leave
> the broken system like this to play with. However I do have a smaller
> spare hard drive and I'll make that the new system disk, leaving the
> 160GB Maxtor attached to the Promise controller (with nothing valuable
> on it). I should be able to replicate the corruption and provide more
> information here, hopefully while still having a usable system.

If you can replicate it and find out where the problem begins that would
be wonderful in itself.

2002-08-29 19:27:48

by Mike Isely

[permalink] [raw]
Subject: Re: 2.4.20-pre4-ac1 trashed my system

On 29 Aug 2002, Alan Cox wrote:

>
> PIO LBA48 seems to work on all promise
> Early promise needs a helping hand with DMA LBA48, one promise doesnt
> seem to do DMA LBA48 on secondary at all, and newer stuff gets it right.
>
>
> And what controller/drives which you've provided.

Another detail: The drive was on the primary cable, configured as
master. It came up as /dev/hde (because hd[a-d] was for the
motherboard's "native" controller).


>
> If you can replicate it and find out where the problem begins that would
> be wonderful in itself.
>

I'll do what I can. I never have enough time. However I've benefitted
from this excellent OS for too long; I should be doing more in return.

-Mike


| Mike Isely | PGP fingerprint
POSITIVELY NO | | 03 54 43 4D 75 E5 CC 92
UNSOLICITED JUNK MAIL! | isely @ pobox (dot) com | 71 16 01 E2 B5 F5 C1 E8
| (spam-foiling address) |

2002-08-30 07:03:14

by Mike Isely

[permalink] [raw]
Subject: Re: 2.4.20-pre4-ac1 trashed my system


OK, I have some good news and some bad news.

The bad news is that I replicated the corruption.

The good news is that I replicated the corruption. Oh, and I can
cause it on demand, and not lose my system in the process. I can
provide LOTS and LOTS of details now. What do you want to know?

Some additional background: The 160GB Maxtor has a number of file
systems on it. Here's the fdisk -l output:

Disk /dev/hde: 255 heads, 63 sectors, 19929 cylinders
Units = cylinders of 16065 * 512 bytes

Device Boot Start End Blocks Id System
/dev/hde1 * 1 912 7325608+ c Win95 FAT32 (LBA)
/dev/hde2 913 19929 152754052+ 5 Extended
/dev/hde5 913 936 192748+ 83 Linux
/dev/hde6 937 985 393561 83 Linux
/dev/hde7 986 1058 586341 82 Linux swap
/dev/hde8 1059 1423 2931831 83 Linux
/dev/hde9 1424 19929 148649413+ 83 Linux

The file system that started all the fireworks was the big one at the
end, hde9. The rescue partition that booted up corrupted
afterwards was hde6. The toasted root partition was hde8.


Here's what I did:

1. I pulled a spare hard drive (80GB Maxtor) and installed it in the
system as hda (primary controller, primary channel, master).

2. I put a Debian installation there. Updated the kernel to
2.4.19-ac4.

3. With a stable system on the spare drive, I moved the 160GB Maxtor
to be hdc (primary controller, secondary channel, master).

4. Using an alternate superblock I managed to fsck the fsck'ed up file
systems on the 160GB drive while running as hdc, while booted under
2.4.19-ac4.

5. I then ran additional fsck passes on the 160GB drive, checking all
partitions. Just for paranoia's sake. All now passed clean.

6. I shut down the system, moved the 160GB drive to be hde (Promise
controller, primary channel, master), and rebooted.

7. I ran the fsck passes again on the drive. Note: This is still
under 2.4.19-ac4, but using the Promise controller. All passed,
squeaky clean. So under 2.4.19-ac4 there's no problem.

8. I rebooted the system to 2.4.20-pre4-ac1 and fsck'ed the big
partition again. Splat. Some time after 50% done it reported an
error.

Unlike the initial carnage, I wasn't an idiot and didn't use the -y
fsck option this time, so it stopped after the first error and since
I'm not writing to the drive, the contents hopefully should still be
OK. I've already rebooted again and repeated the last step. I should
be able to repeat this experiment as often as needed.

Clearly there's something wrong in 2.4.20-pre4-ac1 that wasn't wrong
in 2.4.19-ac4 that is impacting my setup.

Some additional datapoints:

1. During bootup of 2.4.20-pre4-ac1, I found the following message
in the kernel log, not previously seen:

> hde: Maxtor 4G160J8, ATA DISK drive
> ULTRA 66/100/133: Primary channel of Ultra 66/100/133 requires an 80-pin cable for Ultra66 operation.
> Switching to Ultra33 mode.
> Warning: Primary channel requires an 80-pin cable for operation.
> hde reduced to Ultra33 mode.

What makes this notable is that there is indeed an 80 pin cable
connecting the 160GB drive to that controller. I hadn't noticed
this message in 2.4.19-ac4, but honestly I didn't directly look
for it yet. I'll check that.

2. I did something else that night that may have been less than
smart. I remembered it tonight and repeated the experiment. I
tried to read-only mount hde9 while the fsck was running. When
this happens, the fsck process gets a short read and complains.
Obviously that's going to mess up fsck. However that little
shenanigan is not needed to screw things up. Tonight I ran step
8 (above) twice. The first time was after restarting fsck, after
fsck had failed on account of my trying to ro-mount the file
system. The second time - after rebooting - I still got the fsck
failure some time after 50% completion, without having to try to
mount anything.

I've got a system here that I can foul-up on demand now. What would
you like me to do?

-Mike


| Mike Isely | PGP fingerprint
POSITIVELY NO | | 03 54 43 4D 75 E5 CC 92
UNSOLICITED JUNK MAIL! | isely @ pobox (dot) com | 71 16 01 E2 B5 F5 C1 E8
| (spam-foiling address) |

2002-08-31 04:59:56

by Mike Isely

[permalink] [raw]
Subject: Re: 2.4.20-pre4-ac1 trashed my system



> OK, I have some good news and some bad news.
>
> The bad news is that I replicated the corruption.
>
> The good news is that I replicated the corruption. Oh, and I can
> cause it on demand, and not lose my system in the process. I can
> provide LOTS and LOTS of details now. What do you want to know?
>

[...]

I've done some more tests and have more information now. No smoking
gun yet, but a few more clues.

1. I moved the 160GB drive away from the Promise controller and
reattached it to the motherboard chipset's controller ("VIA
Technologies, Inc. Bus Master IDE (rev 06)", by the way according
to lspci). Then I booted 2.4.20-pre4-ac1 (the "bad" kernel) and
fsck'ed the big partition again. It passed. Then I moved the
drive back to the Promise controller, booted the same OS and
fsck'ed again. Failure.

2. I booted 2.4.19-ac4 with the 160GB drive attached to the Promise
controller and watched the kernel log output. There's no message
about any missing 80 pin cable. This is different than
2.4.20-pre4-ac1 which complains that I allegedly don't have an 80
pin cable plugged. However the cable is there but the driver
downshifts the interface to 33MHz anyway. I described this
observation before and now today I noticed another poster on the
lkml bringing up the same issue with his Promise 20269 controller
(but in -pre5-ac1 instead - look for subject "2.4.20-pre5-ac1
PDC20269 80-pin acble misdetection" [sic]).

3. Still looking for the low-hanging fruit, I extracted lots of other
info from the system. I grabbed fdisk -l output, dmesg output, the
kernel source .config file and a bunch of stuff out of /proc/ide,
once apiece for each kernel version (while the 160GB drive remained
on the Promise controller). I then diff'ed it all. I have all
this saved, but in the spirit of not wasting more bandwidth, I am
not including the raw data here. However here's a summary of the
the differences I found:

o Lots of dmesg differences, but nothing I saw really relevant
beyond the thing about the 80 pin cable.

o fdisk -l output was unchanged between the kernel versions, so I
guess at least disk geometry hasn't been messed up.

o hdparm output is different between the kernel versions. This
should not be a big surprise since the 2.4.20-pre4-ac1 driver is
downshifting the bus speed. hdparm -i (and -I) reports udma2 for
the suspect kernel while I get udma5 for the stable kernel. I
did see one other alarming(?) change however; hdparm -I is
reporting different configurations:

2.4.19-ac4:
Configuration:
Logical max current
cylinders 16383 65535
heads 16 1
sectors/track 63 63
bytes/track: 0 (obsolete)
bytes/sector: 0 (obsolete)
current sector capacity: 4128705
LBA user addressable sectors = 268435455

2.4.20-pre4-ac1:
Configuration:
Logical max current
cylinders 16383 16383
heads 16 16
sectors/track 63 63
bytes/track: 0 (obsolete)
bytes/sector: 0 (obsolete)
current sector capacity: 16514064
LBA user addressable sectors = 268435455

Note the different sector capacity, cylinder counts, and head
counts. And yes, the entry reporting the _larger_ capacity is
the suspect kernel (double-checked). Is this significant?

o Timings (hdparm -t -T output) are also different. The "bad"
kernel (2.4.20-pre4-ac1) is only getting 30MB/sec off the device
while 2.4.19-ac4 is reading 35MB/sec. Not exactly a fantastic
difference, but 35MB/sec exceeds UDMA33 rate so that would
suggest that 2.4.19-ac4 really is running the Promise controller
at something better than udma2.

o Output from /proc/ide/pdc202xx is identical between the kernels.

o There are differences in the files in /proc/ide/ide2/hde/*
between the kernels but the differences are too cryptic for me to
decipher in any meaningful way (but if you want the data, ask).

o The two kernel source .config files have more differences than I
expected. Notably, I see a new CONFIG_PDC202XX_* options that
weren't there before. For CONFIG_BLK_DEV_PDC202XX has _OLD and
_NEW variants now (both are set). Also CONFIG_PDC202XX_FORCE is
new (and not set). And CONFIG_PDC202XX_BURST was previously set
but for some unexplained reason I have it not set in the "bad"
kernel. For the record, here are the currently enabled
CONFIG_IDE* settings (same for both kernels):

CONFIG_IDE=y
CONFIG_IDEDISK_MULTI_MODE=y
CONFIG_IDEDISK_STROKE=y
CONFIG_IDEDMA_AUTO=y
CONFIG_IDEDMA_ONLYDISK=y
CONFIG_IDEDMA_PCI_AUTO=y
CONFIG_IDEPCI_SHARE_IRQ=y
CONFIG_IDE_CHIPSETS=y
CONFIG_IDE_TASKFILE_IO=y
CONFIG_IDE_TASK_IOCTL=y


I'll build another 2.4.20-pre4-ac1 instance with CONFIG_PDC202XX_BURST
turned on and see if that makes a difference. Any advice on the
...PDC202XX_OLD vs ...PDC202XX_NEW settings? Turn one of them off?
What's the difference? (Don't answer that last one; I haven't checked
the Configure help yet for it.)

Another thing I can try is to force the driver to downshift to udma2
in 2.4.19-ac4 and see if then the problem appears there.

I'll can also build a new kernel from the newest sources and see if
the problem still exists.

Is there anything else I should try? Advice on a better direction?
Should I sit down and shut up already? Are you all still reading this
far down the message?

-Mike


| Mike Isely | PGP fingerprint
POSITIVELY NO | | 03 54 43 4D 75 E5 CC 92
UNSOLICITED JUNK MAIL! | isely @ pobox (dot) com | 71 16 01 E2 B5 F5 C1 E8
| (spam-foiling address) |

2002-08-31 05:54:01

by Andre Hedrick

[permalink] [raw]
Subject: Re: 2.4.20-pre4-ac1 trashed my system


Your data is not trashed.
Linux failed to understand cut off partitions.
When you said you put it on primary channel, I realized that you have a
system that breaks the rules of Promise and I am not sure.
This will make it more painful to parse systems which can 48-bit and those
which can not.

This is not going to be fun.

grep "hwif->addressing" pdc202xx.c

Stub out the three lines.

Recompile and reboot, it will be fixed

Andre Hedrick
LAD Storage Consulting Group

On Sat, 31 Aug 2002, Mike Isely wrote:

>
>
> > OK, I have some good news and some bad news.
> >
> > The bad news is that I replicated the corruption.
> >
> > The good news is that I replicated the corruption. Oh, and I can
> > cause it on demand, and not lose my system in the process. I can
> > provide LOTS and LOTS of details now. What do you want to know?
> >
>
> [...]
>
> I've done some more tests and have more information now. No smoking
> gun yet, but a few more clues.
>
> 1. I moved the 160GB drive away from the Promise controller and
> reattached it to the motherboard chipset's controller ("VIA
> Technologies, Inc. Bus Master IDE (rev 06)", by the way according
> to lspci). Then I booted 2.4.20-pre4-ac1 (the "bad" kernel) and
> fsck'ed the big partition again. It passed. Then I moved the
> drive back to the Promise controller, booted the same OS and
> fsck'ed again. Failure.
>
> 2. I booted 2.4.19-ac4 with the 160GB drive attached to the Promise
> controller and watched the kernel log output. There's no message
> about any missing 80 pin cable. This is different than
> 2.4.20-pre4-ac1 which complains that I allegedly don't have an 80
> pin cable plugged. However the cable is there but the driver
> downshifts the interface to 33MHz anyway. I described this
> observation before and now today I noticed another poster on the
> lkml bringing up the same issue with his Promise 20269 controller
> (but in -pre5-ac1 instead - look for subject "2.4.20-pre5-ac1
> PDC20269 80-pin acble misdetection" [sic]).
>
> 3. Still looking for the low-hanging fruit, I extracted lots of other
> info from the system. I grabbed fdisk -l output, dmesg output, the
> kernel source .config file and a bunch of stuff out of /proc/ide,
> once apiece for each kernel version (while the 160GB drive remained
> on the Promise controller). I then diff'ed it all. I have all
> this saved, but in the spirit of not wasting more bandwidth, I am
> not including the raw data here. However here's a summary of the
> the differences I found:
>
> o Lots of dmesg differences, but nothing I saw really relevant
> beyond the thing about the 80 pin cable.
>
> o fdisk -l output was unchanged between the kernel versions, so I
> guess at least disk geometry hasn't been messed up.
>
> o hdparm output is different between the kernel versions. This
> should not be a big surprise since the 2.4.20-pre4-ac1 driver is
> downshifting the bus speed. hdparm -i (and -I) reports udma2 for
> the suspect kernel while I get udma5 for the stable kernel. I
> did see one other alarming(?) change however; hdparm -I is
> reporting different configurations:
>
> 2.4.19-ac4:
> Configuration:
> Logical max current
> cylinders 16383 65535
> heads 16 1
> sectors/track 63 63
> bytes/track: 0 (obsolete)
> bytes/sector: 0 (obsolete)
> current sector capacity: 4128705
> LBA user addressable sectors = 268435455
>
> 2.4.20-pre4-ac1:
> Configuration:
> Logical max current
> cylinders 16383 16383
> heads 16 16
> sectors/track 63 63
> bytes/track: 0 (obsolete)
> bytes/sector: 0 (obsolete)
> current sector capacity: 16514064
> LBA user addressable sectors = 268435455
>
> Note the different sector capacity, cylinder counts, and head
> counts. And yes, the entry reporting the _larger_ capacity is
> the suspect kernel (double-checked). Is this significant?
>
> o Timings (hdparm -t -T output) are also different. The "bad"
> kernel (2.4.20-pre4-ac1) is only getting 30MB/sec off the device
> while 2.4.19-ac4 is reading 35MB/sec. Not exactly a fantastic
> difference, but 35MB/sec exceeds UDMA33 rate so that would
> suggest that 2.4.19-ac4 really is running the Promise controller
> at something better than udma2.
>
> o Output from /proc/ide/pdc202xx is identical between the kernels.
>
> o There are differences in the files in /proc/ide/ide2/hde/*
> between the kernels but the differences are too cryptic for me to
> decipher in any meaningful way (but if you want the data, ask).
>
> o The two kernel source .config files have more differences than I
> expected. Notably, I see a new CONFIG_PDC202XX_* options that
> weren't there before. For CONFIG_BLK_DEV_PDC202XX has _OLD and
> _NEW variants now (both are set). Also CONFIG_PDC202XX_FORCE is
> new (and not set). And CONFIG_PDC202XX_BURST was previously set
> but for some unexplained reason I have it not set in the "bad"
> kernel. For the record, here are the currently enabled
> CONFIG_IDE* settings (same for both kernels):
>
> CONFIG_IDE=y
> CONFIG_IDEDISK_MULTI_MODE=y
> CONFIG_IDEDISK_STROKE=y
> CONFIG_IDEDMA_AUTO=y
> CONFIG_IDEDMA_ONLYDISK=y
> CONFIG_IDEDMA_PCI_AUTO=y
> CONFIG_IDEPCI_SHARE_IRQ=y
> CONFIG_IDE_CHIPSETS=y
> CONFIG_IDE_TASKFILE_IO=y
> CONFIG_IDE_TASK_IOCTL=y
>
>
> I'll build another 2.4.20-pre4-ac1 instance with CONFIG_PDC202XX_BURST
> turned on and see if that makes a difference. Any advice on the
> ...PDC202XX_OLD vs ...PDC202XX_NEW settings? Turn one of them off?
> What's the difference? (Don't answer that last one; I haven't checked
> the Configure help yet for it.)
>
> Another thing I can try is to force the driver to downshift to udma2
> in 2.4.19-ac4 and see if then the problem appears there.
>
> I'll can also build a new kernel from the newest sources and see if
> the problem still exists.
>
> Is there anything else I should try? Advice on a better direction?
> Should I sit down and shut up already? Are you all still reading this
> far down the message?
>
> -Mike
>
>
> | Mike Isely | PGP fingerprint
> POSITIVELY NO | | 03 54 43 4D 75 E5 CC 92
> UNSOLICITED JUNK MAIL! | isely @ pobox (dot) com | 71 16 01 E2 B5 F5 C1 E8
> | (spam-foiling address) |
>

2002-08-31 06:02:39

by Mike Isely

[permalink] [raw]
Subject: Re: 2.4.20-pre4-ac1 trashed my system

On Fri, 30 Aug 2002, Andre Hedrick wrote:

>
> Your data is not trashed.

Well actually it was. After the driver read bad data from the disk
(presumably mis-addressed) my knee-jerk reaction was to run e2fsk -y to
"fix" it. And _that_ trashed the data.


> Linux failed to understand cut off partitions.

???


> When you said you put it on primary channel, I realized that you have a
> system that breaks the rules of Promise and I am not sure.

What are the "rules of Promise" or where may I find such information?


> This will make it more painful to parse systems which can 48-bit and those
> which can not.
>
> This is not going to be fun.

But this wasn't a problem in 2.4.19-ac4; what confounding factor now is
making it difficult?


>
> grep "hwif->addressing" pdc202xx.c
>
> Stub out the three lines.
>
> Recompile and reboot, it will be fixed

Will do. Thanks. If you have a more permanent fix you'd like me to
test, let me know.

-Mike

| Mike Isely | PGP fingerprint
POSITIVELY NO | | 03 54 43 4D 75 E5 CC 92
UNSOLICITED JUNK MAIL! | isely @ pobox (dot) com | 71 16 01 E2 B5 F5 C1 E8
| (spam-foiling address) |

2002-08-31 06:21:17

by Andre Hedrick

[permalink] [raw]
Subject: Re: 2.4.20-pre4-ac1 trashed my system

On Sat, 31 Aug 2002, Mike Isely wrote:

> On Fri, 30 Aug 2002, Andre Hedrick wrote:
>
> >
> > Your data is not trashed.
>
> Well actually it was. After the driver read bad data from the disk
> (presumably mis-addressed) my knee-jerk reaction was to run e2fsk -y to
> "fix" it. And _that_ trashed the data.

Okay that sounds more like it. The driver did not damage the data, only
user space forced down the driver trashed it. Regardless of the
definition of "is" you system was wrecked.

>
> > Linux failed to understand cut off partitions.
>
> ???

This was a great concern of mine when 48-bit was introduced.

>
> > When you said you put it on primary channel, I realized that you have a
> > system that breaks the rules of Promise and I am not sure.
>
> What are the "rules of Promise" or where may I find such information?

You do not want to sign the NDA's to get the data sheets, aquire all the
hardware to test, generate tables of irregularities, query Promise, and
then scratch your head why.

I have a FastTrak 100 TX4 the BIOS fails to see beyond 128GB, but in
practice it does.

The PDC20267 will puke in 48-bit DMA, but run clean in 48-bit PIO :-/
Oh but that is the primary channel, Seconday Channel is clean both ways :-\

PDC20262 works in 48-bit DMA every where.

PDC20265 similar to PDC20267 except yours.

Rules are emperical tests and rants back at the OEM, and ....

>
> > This will make it more painful to parse systems which can 48-bit and those
> > which can not.
> >
> > This is not going to be fun.
>
> But this wasn't a problem in 2.4.19-ac4; what confounding factor now is
> making it difficult?

Cause there were reports of PDC20265/PDC20267 comming in as deadlocking.
Thanks for the wrinkle in the fabric of ruleless world. :-)

> >
> > grep "hwif->addressing" pdc202xx.c
> >
> > Stub out the three lines.
> >
> > Recompile and reboot, it will be fixed
>
> Will do. Thanks. If you have a more permanent fix you'd like me to
> test, let me know.

Oh another dang piece of the puzzle found and it does not fit anywhere!

Cheers,

Andre Hedrick
LAD Storage Consulting Group

2002-08-31 06:53:05

by Mike Isely

[permalink] [raw]
Subject: Re: 2.4.20-pre4-ac1 trashed my system

On Fri, 30 Aug 2002, Andre Hedrick wrote:

> On Sat, 31 Aug 2002, Mike Isely wrote:
>
> > On Fri, 30 Aug 2002, Andre Hedrick wrote:
> >
>
> Okay that sounds more like it. The driver did not damage the data, only
> user space forced down the driver trashed it. Regardless of the
> definition of "is" you system was wrecked.

No permanent harm. It was a workstation, and most of the 160GB drive
was being used primarily as a backup device for a separate file server
machine. Obviously I'd like to get that "backup device" up and running
again.


>
> >
> > > Linux failed to understand cut off partitions.
> >
> > ???
>
> This was a great concern of mine when 48-bit was introduced.

Ah, a riddle answered with another riddle. I know what 48 bit addresing
is; I'm just curious to understand why my system seems to have run afoul
of it, especially since things were ok before. (but read on...)


> > What are the "rules of Promise" or where may I find such information?
>
> You do not want to sign the NDA's to get the data sheets, aquire all the
> hardware to test, generate tables of irregularities, query Promise, and
> then scratch your head why.

OK, Uncle! I detect a lot of pain here and perhaps I'm exacerbating it
by asking. The technical side of me just wants to understand. I write
code for a living and have had my share of pain with crappy hardware
(though nothing even close to the scale at which you are working). I
hate I2C, by the way, and don't ever ask me about the P.O.S. Philips
pcf8584.


>
> I have a FastTrak 100 TX4 the BIOS fails to see beyond 128GB, but in
> practice it does.
>
> The PDC20267 will puke in 48-bit DMA, but run clean in 48-bit PIO :-/
> Oh but that is the primary channel, Seconday Channel is clean both ways :-\

Oh goodie. This can't be by design, but rather by stupid
implementation. But I'll stop now before aggravating your ulcer :-)


>
> PDC20262 works in 48-bit DMA every where.
>
> PDC20265 similar to PDC20267 except yours.

But I'd still like to understand why my PDC20265 seems unique. Earlier
hardware rev? Later hardware rev? Promise BIOS issue? The Asus
A7V-266E motherboard was purchased December 2001. If it's any help, I'm
staring at the chip on the board now. The label shows:

PROMISE (R)
TECHNOLOGY INC.
PDC20265R
(C) 2000-0113

Maybe there is another cleaner way to go at this problem.


>
> Rules are emperical tests and rants back at the OEM, and ....
>

Sounds to me like you need a vacation ;-)


> >
> > But this wasn't a problem in 2.4.19-ac4; what confounding factor now is
> > making it difficult?
>
> Cause there were reports of PDC20265/PDC20267 comming in as deadlocking.
> Thanks for the wrinkle in the fabric of ruleless world. :-)
>

You're welcome :-)

-Mike


| Mike Isely | PGP fingerprint
POSITIVELY NO | | 03 54 43 4D 75 E5 CC 92
UNSOLICITED JUNK MAIL! | isely @ pobox (dot) com | 71 16 01 E2 B5 F5 C1 E8
| (spam-foiling address) |

2002-08-31 10:50:11

by Vojtech Pavlik

[permalink] [raw]
Subject: Re: 2.4.20-pre4-ac1 trashed my system

On Sat, Aug 31, 2002 at 12:04:20AM -0500, Mike Isely wrote:
>
>
> > OK, I have some good news and some bad news.
> >
> > The bad news is that I replicated the corruption.
> >
> > The good news is that I replicated the corruption. Oh, and I can
> > cause it on demand, and not lose my system in the process. I can
> > provide LOTS and LOTS of details now. What do you want to know?
> >
>
> [...]
>
> I've done some more tests and have more information now. No smoking
> gun yet, but a few more clues.
>
> 1. I moved the 160GB drive away from the Promise controller and
> reattached it to the motherboard chipset's controller ("VIA
> Technologies, Inc. Bus Master IDE (rev 06)", by the way according
> to lspci). Then I booted 2.4.20-pre4-ac1 (the "bad" kernel) and
> fsck'ed the big partition again. It passed. Then I moved the
> drive back to the Promise controller, booted the same OS and
> fsck'ed again. Failure.
>
> 2. I booted 2.4.19-ac4 with the 160GB drive attached to the Promise
> controller and watched the kernel log output. There's no message
> about any missing 80 pin cable. This is different than
> 2.4.20-pre4-ac1 which complains that I allegedly don't have an 80
> pin cable plugged. However the cable is there but the driver
> downshifts the interface to 33MHz anyway. I described this

Note that 33 MHz isn't 33 MB/sec (UDMA2). Question remains, what you wanted to
say.


--
Vojtech Pavlik
SuSE Labs

2002-09-01 02:54:55

by Mike Isely

[permalink] [raw]
Subject: Re: 2.4.20-pre4-ac1 trashed my system

On Fri, 30 Aug 2002, Andre Hedrick wrote:

>
>
> This is not going to be fun.
>
> grep "hwif->addressing" pdc202xx.c
>
> Stub out the three lines.
>
> Recompile and reboot, it will be fixed
>

What version of the driver source are you using? In 2.4.20-pre4-ac1 and
2.4.20-pre5-ac1,

grep "hwif->addressing" pdc202xx.c

finds only 1 line.

There are however two other places in pdc202xx.c where one can find
"drive->addressing", each used as a condition in an if-statement (which
looks a lot like this might be the LBA48 fix you and Alan have been
telling me about). What exactly do you want me to do? Knock out the
if-conditions (and matching close-braces)? Knock out the entire block
(and assumedly the LBA48 fix along with it) in each case?

I've been trying different combinations but so far either the result has
been no effect or fatally broken DMA (timeouts / failures at boot and
then the driver falls back to PIO). I can post more details later, but
I wonder if I'm missing something blindingly obvious here...

Side note: In /proc/ide/ide2/hde/settings in 2.4.19-ac4, the "address"
field reports a value of 1. However in 2.4.20-pre4-ac1, I instead find
the value 0. Is this related to the addressing field in ide_hwif_t?

Oh, and while futzing with things I tried another experiment with the
hardware. This may be a totally unrelated problem. I attached a
Plextor CD burner to the second cable of the Promise controller. It
should show up as hdg. Under 2.4.19-ac4 it shows up. Under
2.4.20-pre4-ac1 it isn't anywhere to be found - no errors, no hints
anywhere in the system that it might exist. Yes, the IDE CDROM driver
is compiled into the kernel.

-Mike


| Mike Isely | PGP fingerprint
POSITIVELY NO | | 03 54 43 4D 75 E5 CC 92
UNSOLICITED JUNK MAIL! | isely @ pobox (dot) com | 71 16 01 E2 B5 F5 C1 E8
| (spam-foiling address) |

2002-09-01 05:10:48

by Mike Isely

[permalink] [raw]
Subject: Re: 2.4.20-pre4-ac1 trashed my system


Another update and more information on the "Linux 2.4.20-pre4-ac1 ate
my system" problem...

Question: I am new to this mailing list; should I keep copying these
messages to lkml or should I just pester Andre and/or Alan privately
now?

I've been studying pdc202xx.c and anything else in the other IDE
driver source files which reference any identifier named "addressing".
I think I understand the picture better now.

The addressing field of ide_drive_t describes how the drive is to be
addressed. Indeed, ide.h defines these values: 0= 28 bit, 1= 48 bit,
2= 48 bit doing 28 bit (a hack?) and 3=64 bit. Also it appears that
idedisk_add_settings() defines the "address" attribute viewable from
/proc/ide/<host>/<device>/settings which shows the current value of
this field.

Notably, also in pdc202xx.c we find two places where drive->addressing
is used, once each in pdc202xx_old_ide_dma_begin() and
pdc202xx_old_ide_dma_end(). If addressing is 1 (48 bit), then extra
logic is inserted here to manipulate the hardware. I presume this is
Alan's Promise controller LBA48 fix...

In ide-disk.c we find the function probe_lba_addressing() which
appears to be the only real place where this addressing field gets set
(I think it can also be set somehow through /proc/ide/whatever but I
doubt that's a path I should be concerned about). There's also
set_lba_addressing() but it just does nothing but pass through to
probe_lba_addressing(). Further down in ide_disk.c we find that
idedisk_setup() calls probe_lba_addressing() with hardcoded arguments
such that it attempts to set the addressing mode to 1 (48 bit
addressing). So far so good. However, back in probe_lba_addressing()
there is an interesting thing going on. It first initializes
addressing to 0 (28 bit), and then after a few checks sets it to the
requested value (second argument of the function). One of those
checks appears to be a check of another "addressing" field which is a
member of the ide_hwif_t structure. If _this_ addressing field is
non-zero, then the rest of probe_lba_addressing() is aborted - thus
addressing gets forced to zero (28 bit). It seems that if
ide_hwif_t::addressing is non-zero, we force 28 bit addressing no
matter what.

Here I should point out the contents of /proc/ide/ide2/hde/settings.
When my system is running under 2.4.19-ac4 (the "good" kernel), the
"address" field is 1 - which makes sense. I have a 160GB drive as hde
so naturally it should be addressed LBA48 style. However, that same
field in that same file is 0 while running under 2.4.20-pre4-ac1 (the
"bad" kernel, and 2.4.20-pre5-ac1 - I checked). So this would suggest
that my drive is being treated with 28 bit addressing, which would go
a long ways towards explaining why I'm getting the corruption.

So why is addressing being set to 0? Yesterday Andre described a
work-around I should try. Edit pdc202xx.c and stub out the line
matching "hwif->addressing". That occurs in init_hwif_pdc202xx()
inside a switch statement based on chip id. What I found was:

hwif->addressing = (hwif->channel) ? 0 : 1

in the case for PCI_DEVICE_ID_PROMISE_20265. Well that makes sense.
I'm on the primary channel, thus the condition is false and
hwif->addressing gets a value of 1. That kills probe_lba_addressing()
and I'm stuck at 28 bits. What's more, this line is not in the driver
that's part of 2.4.19-ac4 (but it's still there for the 20267 as Andre
points out). So the advice to comment out that line makes sense. I
killed the line, built a new kernel and tried again.

Result?

DMA is completely fubared. As soon as the new kernel tries to read
hde's partition table, a flurry of DMA timeouts take place and the
system either hangs or falls back to PIO mode (had both cases
happen). I have no idea why this is happening. Ideas?

In addition to commenting out the line, I also tried forcing
hwif->addressing to 1 (no effect) and 0 (fubared DMA again). That
result of course makes sense.

Another thing I did was to force on Alan's LBA48 fix code in
pdc202xx.c (by removing the LBA48 check). This had no effect (still
got disk corruption). I don't think Alan's code has anything to do
with the root cause here.

Additional things I am trying right now:

1. Try the same experiment with 2.4.20-ac5-pre1, i.e. kill the
hwif->addressing setting in pdc202xx.c and see if it works OK
there.

2. Turn off taskfile mode (CONFIG_IDE_TASKFILE_IO), comment out the
line and see if DMA still works now.

In summary, it seems that we want ide_drive_t::addressing to be 1, but
it's currently 0. It was 1 in 2.4.19-ac4 (where things worked). It's
zero because of a piece of duct tape in pdc202xx.c where it recognizes
the 20265 chip and prevents the driver from enabling 48 bit mode on
the primary controller. However if I remove that duct tape and
presumably let the addressing field go to 1, then all dma to that
device times out.

I want to solve this problem. I know I'm probably being an annoying
pest by now, but I'm willing to try things, not just sit here and
scream "please fix this or I'll hold my breath until I turn purple".
I know very little unfortunately about IDE standards, but I can learn
quickly. Use me to help find the cause. I'm here. What would you
like me to try? Is there anything I can do? Am I missing something
blindingly obvious? It's happened before :-) Am I asking too many
questions? :-)

-Mike

| Mike Isely | PGP fingerprint
POSITIVELY NO | | 03 54 43 4D 75 E5 CC 92
UNSOLICITED JUNK MAIL! | isely @ pobox (dot) com | 71 16 01 E2 B5 F5 C1 E8
| (spam-foiling address) |

2002-09-02 08:12:12

by Joachim Breuer

[permalink] [raw]
Subject: Re: 2.4.20-pre4-ac1 trashed my system

Andre Hedrick <[email protected]> writes:

> On Sat, 31 Aug 2002, Mike Isely wrote:
>
>> On Fri, 30 Aug 2002, Andre Hedrick wrote:
>>
>> > When you said you put it on primary channel, I realized that you have a
>> > system that breaks the rules of Promise and I am not sure.
>>
>> What are the "rules of Promise" or where may I find such information?
>
> You do not want to sign the NDA's to get the data sheets, aquire all the
> hardware to test, generate tables of irregularities, query Promise, and
> then scratch your head why.
>
> I have a FastTrak 100 TX4 the BIOS fails to see beyond 128GB, but in
> practice it does.
>
> The PDC20267 will puke in 48-bit DMA, but run clean in 48-bit PIO :-/
> Oh but that is the primary channel, Seconday Channel is clean both ways :-\
>
> PDC20262 works in 48-bit DMA every where.
>
> PDC20265 similar to PDC20267 except yours.
>
> Rules are emperical tests and rants back at the OEM, and ....

Another data point: My experiences with 2.4.19-pre4-ac2 are remarkably
similar to Mike Isley's, but for a few interesting differences:

- 2.4.18 runs O.K.
- 2.4.19 hangs when checking for partitions
- 2.4.19-ac4 hangs, too
- 2.4.20-pre4 hangs, too
- 2.4.20-pre4-ac2 does not hang, but shows problems exactly as Mike is
describing:
- Claims 80pin cable is missing
- wrong data read from disk, write based on wrong read trashes fs

My hardware:
o Promise PDC20262 On-Board on a GigaByte GA-6BX7+ (Intel 440BX)
o Maxtor 120G (4G120J6)

>> > grep "hwif->addressing" pdc202xx.c
>> >
>> > Stub out the three lines.
>> >
>> > Recompile and reboot, it will be fixed
>>
>> Will do. Thanks. If you have a more permanent fix you'd like me to
>> test, let me know.
>
> Oh another dang piece of the puzzle found and it does not fit anywhere!

Does this fix the bogus 80-pin message or does it just have to do with
block addressing and thus the "corruption" issue?

I'm asking because the 20262 seems to break ATAPI devices completely
once it was in a "wrong" mode. I.e. if my PX-W1610 on the second
channel is correctly detected as MDMA2 it works, if it is detected as
something else and I try to tweak it the channel and/or controller
hangs.

Can I somewhere get a complete picture of what is *supposed* to work
with the '62 and what not?

Thanks a lot!


So long,
Joe

--
"I use emacs, which might be thought of as a thermonuclear
word processor."
-- Neal Stephenson, "In the beginning... was the command line"

2002-09-02 08:15:07

by Joachim Breuer

[permalink] [raw]
Subject: Re: 2.4.20-pre4-ac1 trashed my system

Mike Isely <[email protected]> writes:

> Another update and more information on the "Linux 2.4.20-pre4-ac1 ate
> my system" problem...
>
> Question: I am new to this mailing list; should I keep copying these
> messages to lkml or should I just pester Andre and/or Alan privately
> now?

PLEASE do continue to copy to the list; you got some discussion on
that issue started and there's more people out here who want to figure
out what's going on with Linux IDE support... PLEASE!

The thread so far was invalueable to me, I couldn't find similar
appropriate infos in the archives... that should have changed now ;-)

Thanks!


So long,
Joe

--
"I use emacs, which might be thought of as a thermonuclear
word processor."
-- Neal Stephenson, "In the beginning... was the command line"

2002-09-03 12:34:41

by mbs

[permalink] [raw]
Subject: 2.4.20-pre4-ac1 trashed my system

it trashed mine also.

supermicro p4dp8-g2 mobo
2x 2.2 Xeon
e7500 chipset
wd400 40gb hd

2.4.20-pre4-ac2 + RML preempt patch (applied cleanly)

boot it and eveything runs fine for a short while, then I start getting "bad
CRC" errors and "seek failure" errors.

I have had this problem with both ext2 and ext3

initially I thought it was a bad HD, so I installed a new one on a new cable
and did a complete rh7.3 install ran for a while eith no problems then built
the same kernel over again, rebooted into the new kernel and within seconds
was having problems again.

2.4.19-rc3-ac4 +rml preempt has been dead stable, as has (so far) 2.4.29-ac4
+rml and RH 2.4.18-3 and -5

I am not doing anything funky with hd setup, not even specifying idebus=

this has happened with 40 and 80 wire cables.

if there is any additional info I can provide please let me know.
--
/**************************************************
** Mark Salisbury || [email protected] **
** If you would like to sponsor me for the **
** Mass Getaway, a 150 mile bicycle ride to for **
** MS, contact me to donate by cash or check or **
** click the link below to donate by credit card **
**************************************************/
https://www.nationalmssociety.org/pledge/pledge.asp?participantid=86736

2002-09-03 14:29:29

by Mike Isely

[permalink] [raw]
Subject: Re: 2.4.20-pre4-ac1 trashed my system

On Tue, 3 Sep 2002, mbs wrote:

> it trashed mine also.
>
> supermicro p4dp8-g2 mobo
> 2x 2.2 Xeon
> e7500 chipset
> wd400 40gb hd
>
> 2.4.20-pre4-ac2 + RML preempt patch (applied cleanly)
>
> boot it and eveything runs fine for a short while, then I start getting "bad
> CRC" errors and "seek failure" errors.
>
> I have had this problem with both ext2 and ext3
>
> initially I thought it was a bad HD, so I installed a new one on a new cable
> and did a complete rh7.3 install ran for a while eith no problems then built
> the same kernel over again, rebooted into the new kernel and within seconds
> was having problems again.
>
> 2.4.19-rc3-ac4 +rml preempt has been dead stable, as has (so far) 2.4.29-ac4
> +rml and RH 2.4.18-3 and -5
>
> I am not doing anything funky with hd setup, not even specifying idebus=
>

This is likely different than the problem I've been seeing.

My situation appears to be due to the fact that on my Promise
controller, LBA48 addressing mode had been turned off on the primary
channel, which then causes access problems with my 160GB Maxtor drive.
Turning LBA48 mode back on (by removing the hack which turned it off,
which wasn't in 2.4.19-ac4) breaks DMA. Either that or I screwed
something up when removing the hack. I'm wondering if a bug appeared
after 2.4.19-ac4 which breaks DMA on Promise 20265 primary channel
access, and that a work-around was put in place that disables LBA48
addressing. There are in fact well over 100 diffs in pdc202xx.c between
2.4.19-ac4 and 2.4.20-pre4-ac1. This wrong addressing is what
(indirectly) wrecked my system. I've posted my findings on this so far
along with some questions for further investigation, but I haven't seen
any answers yet (or even a "go away you're bothering me" reply).

Unfortunately you've said you are using a 40GB drive and something other
than a Promise controller so your situation may be a different problem.

-Mike


| Mike Isely | PGP fingerprint
POSITIVELY NO | | 03 54 43 4D 75 E5 CC 92
UNSOLICITED JUNK MAIL! | isely @ pobox (dot) com | 71 16 01 E2 B5 F5 C1 E8
| (spam-foiling address) |

2002-09-03 15:57:17

by Alan

[permalink] [raw]
Subject: Re: 2.4.20-pre4-ac1 trashed my system

> Unfortunately you've said you are using a 40GB drive and something other
> than a Promise controller so your situation may be a different problem.

The 40Gb drives may well be trying to pick LBA48, but LBA48 works on the
Intel hardware

2002-09-03 15:54:00

by Alan

[permalink] [raw]
Subject: Re: 2.4.20-pre4-ac1 trashed my system

On Tue, 2002-09-03 at 13:41, mbs wrote:
> 2.4.20-pre4-ac2 + RML preempt patch (applied cleanly)

I'm not interested in any bug reports with the pre-empt patch involved.
It just muddies the waters

2002-09-03 18:25:52

by mbs

[permalink] [raw]
Subject: Re: 2.4.20-pre4-ac1 trashed my system

ok, tomorrow, I'll let you know how it goes without preempt.

On Tuesday 03 September 2002 11:59, Alan Cox wrote:
> On Tue, 2002-09-03 at 13:41, mbs wrote:
> > 2.4.20-pre4-ac2 + RML preempt patch (applied cleanly)
>
> I'm not interested in any bug reports with the pre-empt patch involved.
> It just muddies the waters

--
/**************************************************
** Mark Salisbury || [email protected] **
** If you would like to sponsor me for the **
** Mass Getaway, a 150 mile bicycle ride to for **
** MS, contact me to donate by cash or check or **
** click the link below to donate by credit card **
**************************************************/
https://www.nationalmssociety.org/pledge/pledge.asp?participantid=86736

2002-09-04 10:18:08

by Rogier Wolff

[permalink] [raw]
Subject: Re: 2.4.20-pre4-ac1 trashed my system

On Tue, Sep 03, 2002 at 05:00:58PM +0100, Alan Cox wrote:
> > Unfortunately you've said you are using a 40GB drive and something other
> > than a Promise controller so your situation may be a different problem.
>
> The 40Gb drives may well be trying to pick LBA48, but LBA48 works on the
> Intel hardware

The maxtor drives

XxYYYyZ

where
X is a number
x is a letter
YYY is the capacity (040 for a 40G)
y is a letter
Z is a number.

When Z = 2, the disk is single platter, and COULD have been 4 times as
large by adding three more platters. That adds up to 160G so the

Xy040z2 drives will certainly do LBA48.

Roger.

--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2137555 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
* The Worlds Ecosystem is a stable system. Stable systems may experience *
* excursions from the stable situation. We are currenly in such an *
* excursion: The stable situation does not include humans. ***************

2002-09-05 05:50:08

by Mike Isely

[permalink] [raw]
Subject: [PATCH] 2.4.20-pre5-ac2: Promise Controller LBA48 DMA fixed


The trivial patch at the end of this text fixes DMA w/ LBA48 problems
on the Promise 202265 controller and probably also the 20267. This
patch is against 2.4.20-pre5-ac2 but I suspect it should apply cleanly
against anything after 2.4.19-ac4.

Problem: LBA48 DMA stopped working on Promise 20265 some time after
kernel version 2.4.19-ac4. This manifested itself on systems with
large (>137GB) hard drives; addressing above 137GB stopped working
correctly, leading to file system errors, and after a foolish
"e2fsck -y" operation, massive corruption.

Cause: The DMA was broken due to a bad if-statement in the function
init_hwif_pdc202xx() in pdc202xx.c. Because "!" binds more
tightly than "==", the check against PCI_DEVICE_ID_PROMISE_20246
was incorrect, which prevented the Promise controller LBA48 fix
logic from basically ever being turned on. Obfuscating this
further was logic in that same function which disabled LBA48
addressing mode for devices on the primary channel of the 20265 or
20267.

Solution: Apply parantheses to get evaluation ordering correct. Then
remove duct tape which disabled LBA48 addressing.

Verification: Before this fix, inspecting
/proc/ide/<host>/<device>/settings would show "0" for the
"address" attribute, owing to LBA48 being off. Just removing the
duct tape causing this however results in a broken system (driver
DMA completely fails). After also fixing the if-statement, the
system comes up successfully, the "address" attribute reads back
as "1" (confirms LBA48 addressing on), and most importantly,
fsck'ing the big drive comes back clean!

This problem did not exist in 2.4.19-ac4 because the code had since
then been rearranged / rewritten. The new code harbored the bug and
the LBA48 regression. Note: I have not tested this fix against the
Promise 20267, but I suspect (since 2.4.19-ac4 didn't hack up the
20267 either) that the same fix applies there so I deleted the duct
tape rather than just moving it.

-Mike


diff -u -r linux-2.4.20-pre5-ac2/drivers/ide/pci/pdc202xx.c linux-2.4.20-pre5-ac2.fixed/drivers/ide/pci/pdc202xx.c
--- linux-2.4.20-pre5-ac2/drivers/ide/pci/pdc202xx.c 2002-09-05 00:09:43.000000000 -0500
+++ linux-2.4.20-pre5-ac2.fixed/drivers/ide/pci/pdc202xx.c 2002-09-05 00:16:43.000000000 -0500
@@ -952,7 +952,6 @@
break;
case PCI_DEVICE_ID_PROMISE_20267:
case PCI_DEVICE_ID_PROMISE_20265:
- hwif->addressing = (hwif->channel) ? 0 : 1;
case PCI_DEVICE_ID_PROMISE_20263:
case PCI_DEVICE_ID_PROMISE_20262:
hwif->busproc = &pdc202xx_tristate;
@@ -979,7 +978,7 @@
if (!(hwif->udma_four))
hwif->udma_four = (!(hwif->INB(hwif->dma_vendor3) & 0x04));
} else {
- if (!hwif->pci_dev->device == PCI_DEVICE_ID_PROMISE_20246) {
+ if (!(hwif->pci_dev->device == PCI_DEVICE_ID_PROMISE_20246)) {
u16 mask = (hwif->channel) ? (1<<11) : (1<<10);
u16 CIS = 0;
hwif->ide_dma_begin = &pdc202xx_old_ide_dma_begin;




| Mike Isely | PGP fingerprint
POSITIVELY NO | | 03 54 43 4D 75 E5 CC 92
UNSOLICITED JUNK MAIL! | isely @ pobox (dot) com | 71 16 01 E2 B5 F5 C1 E8
| (spam-foiling address) |


2002-09-05 06:00:51

by Mike Isely

[permalink] [raw]
Subject: Re: 2.4.20-pre4-ac1 trashed my system


Final update on this thread.

1. I fixed the busted DMA. Full explanation of the bug and the
associated 2-line patch can be found in a separate message with subject
"[PATCH] 2.4.20-pre5-ac2: Promise Controller LBA48 DMA fixed".

2. The problem I saw with CDROM detection in 2.4.20-pre4-ac1 was PEBCAK.
I had the cable backwards (host connector in the drive, master connector
in the controller). 2.4.19-ac4 doesn't seem to be sensitive to this.

3. Still no idea about the broken 80 pin cable detection - yes, I
double-checked the cable orientation :-)

-Mike


| Mike Isely | PGP fingerprint
POSITIVELY NO | | 03 54 43 4D 75 E5 CC 92
UNSOLICITED JUNK MAIL! | isely @ pobox (dot) com | 71 16 01 E2 B5 F5 C1 E8
| (spam-foiling address) |



Subject: Re: [PATCH] 2.4.20-pre5-ac2: Promise Controller LBA48 DMA fixed

Mike Isely <[email protected]> writes:

>The trivial patch at the end of this text fixes DMA w/ LBA48 problems

More readable would be:

>- if (!hwif->pci_dev->device == PCI_DEVICE_ID_PROMISE_20246) {
>+ if (!(hwif->pci_dev->device == PCI_DEVICE_ID_PROMISE_20246)) {

if (hwif->pci_dev->device != PCI_DEVICE_ID_PROMISE_20246) {

Regards
Henning

--
Dipl.-Inf. (Univ.) Henning P. Schmiedehausen -- Geschaeftsfuehrer
INTERMETA - Gesellschaft fuer Mehrwertdienste mbH [email protected]

Am Schwabachgrund 22 Fon.: 09131 / 50654-0 [email protected]
D-91054 Buckenhof Fax.: 09131 / 50654-20

2002-09-05 14:07:30

by Mike Isely

[permalink] [raw]
Subject: Re: [PATCH] 2.4.20-pre5-ac2: Promise Controller LBA48 DMA fixed

On Thu, 5 Sep 2002, Henning P. Schmiedehausen wrote:

> Mike Isely <[email protected]> writes:
>
> >The trivial patch at the end of this text fixes DMA w/ LBA48 problems
>
> More readable would be:
>
> >- if (!hwif->pci_dev->device == PCI_DEVICE_ID_PROMISE_20246) {
> >+ if (!(hwif->pci_dev->device == PCI_DEVICE_ID_PROMISE_20246)) {
>
> if (hwif->pci_dev->device != PCI_DEVICE_ID_PROMISE_20246) {
>

Yes that is true. But this is Andre's code and it seemed to me to be
more important to follow his style. But whatever...

-Mike


| Mike Isely | PGP fingerprint
POSITIVELY NO | | 03 54 43 4D 75 E5 CC 92
UNSOLICITED JUNK MAIL! | isely @ pobox (dot) com | 71 16 01 E2 B5 F5 C1 E8
| (spam-foiling address) |


2002-09-05 14:29:19

by Mike Isely

[permalink] [raw]
Subject: Re: [PATCH] 2.4.20-pre5-ac2: Promise Controller LBA48 DMA fixed

On 5 Sep 2002, Alan Cox wrote:

> On Thu, 2002-09-05 at 15:12, Mike Isely wrote:
> > Yes that is true. But this is Andre's code and it seemed to me to be
> > more important to follow his style. But whatever...
>
> Its a good general rule but for the IDE, break it ;)
>

Point taken. I should have expected to hear this :-)

-Mike


| Mike Isely | PGP fingerprint
POSITIVELY NO | | 03 54 43 4D 75 E5 CC 92
UNSOLICITED JUNK MAIL! | isely @ pobox (dot) com | 71 16 01 E2 B5 F5 C1 E8
| (spam-foiling address) |

2002-09-05 14:26:18

by Alan

[permalink] [raw]
Subject: Re: [PATCH] 2.4.20-pre5-ac2: Promise Controller LBA48 DMA fixed

On Thu, 2002-09-05 at 15:12, Mike Isely wrote:
> Yes that is true. But this is Andre's code and it seemed to me to be
> more important to follow his style. But whatever...

Its a good general rule but for the IDE, break it ;)

2002-09-05 14:31:11

by Horst H. von Brand

[permalink] [raw]
Subject: Re: [PATCH] 2.4.20-pre5-ac2: Promise Controller LBA48 DMA fixed

Mike Isely <[email protected]> said:
> On Thu, 5 Sep 2002, Henning P. Schmiedehausen wrote:
>
> > Mike Isely <[email protected]> writes:
> >
> > >The trivial patch at the end of this text fixes DMA w/ LBA48 problems
> >
> > More readable would be:
> >
> > >- if (!hwif->pci_dev->device == PCI_DEVICE_ID_PROMISE_20246) {
> > >+ if (!(hwif->pci_dev->device == PCI_DEVICE_ID_PROMISE_20246)) {
> >
> > if (hwif->pci_dev->device != PCI_DEVICE_ID_PROMISE_20246) {
> >
>
> Yes that is true. But this is Andre's code and it seemed to me to be
> more important to follow his style. But whatever...

What is wrong with != here?
--
Dr. Horst H. von Brand User #22616 counter.li.org
Departamento de Informatica Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria +56 32 654239
Casilla 110-V, Valparaiso, Chile Fax: +56 32 797513

2002-09-05 14:37:00

by Tomas Szepe

[permalink] [raw]
Subject: Re: [PATCH] 2.4.20-pre5-ac2: Promise Controller LBA48 DMA fixed

> - if (!hwif->pci_dev->device == PCI_DEVICE_ID_PROMISE_20246) {
> + if (!(hwif->pci_dev->device == PCI_DEVICE_ID_PROMISE_20246)) {

Good eye, btw. I was looking at this line a couple times and always
assumed this kind of obfuscation had a purpose of some sort.

And it does after all! It's a bug :)

2002-09-05 14:43:14

by Mike Isely

[permalink] [raw]
Subject: Re: [PATCH] 2.4.20-pre5-ac2: Promise Controller LBA48 DMA fixed

On Thu, 5 Sep 2002, Horst von Brand wrote:

> Mike Isely <[email protected]> said:
> > On Thu, 5 Sep 2002, Henning P. Schmiedehausen wrote:
> >
> > > Mike Isely <[email protected]> writes:
> > >
> > > >The trivial patch at the end of this text fixes DMA w/ LBA48 problems
> > >
> > > More readable would be:
> > >
> > > >- if (!hwif->pci_dev->device == PCI_DEVICE_ID_PROMISE_20246) {
> > > >+ if (!(hwif->pci_dev->device == PCI_DEVICE_ID_PROMISE_20246)) {
> > >
> > > if (hwif->pci_dev->device != PCI_DEVICE_ID_PROMISE_20246) {
> > >
> >
> > Yes that is true. But this is Andre's code and it seemed to me to be
> > more important to follow his style. But whatever...
>
> What is wrong with != here?

Nothing whatsoever. If I wrote the code I would have used "!=". But
when editing code written by someone else I try to adopt that person's
style, for better or for worse. Using !(a == b) is more obtuse but it
is still unambiguous and readable. So I didn't feel it was that big of
a deal to leave it in that form. Besides, there are many MANY other
places in that driver far worse than this - just try to follow the code
that sets up DMA operations or look at the mostly dead code which tries
to identify if it is a cause for an asserted interrupt. If we want to
start nitpicking issues as small as this then I invite you to inspect
the rest of pdc202xx.c. Have the antacids ready...

But in the future, if I post more fixes to the IDE driver (probably
won't), I'll sanitize as I go along.

I find it amusing that a post from me which describes evidence of
completely broken Promise controller DMA goes unresponded to, yet there
are concerns about whether to spell code as "a != b" or "!(a == b)".

-Mike


| Mike Isely | PGP fingerprint
POSITIVELY NO | | 03 54 43 4D 75 E5 CC 92
UNSOLICITED JUNK MAIL! | isely @ pobox (dot) com | 71 16 01 E2 B5 F5 C1 E8
| (spam-foiling address) |

2002-09-05 14:51:52

by Tomas Szepe

[permalink] [raw]
Subject: Re: [PATCH] 2.4.20-pre5-ac2: Promise Controller LBA48 DMA fixed

> But in the future, if I post more fixes to the IDE driver (probably
> won't), I'll sanitize as I go along.

>From what Andre said in the past I've gathered he's very much ok with
code sanitizing and cleanups... Knock yourself out if you please.

> I find it amusing that a post from me which describes evidence of
> completely broken Promise controller DMA goes unresponded to, yet there
> are concerns about whether to spell code as "a != b" or "!(a == b)".

Well, your patch is obviously correct -- there's not much to comment on.

T.

2002-09-05 15:08:24

by Mike Isely

[permalink] [raw]
Subject: Re: [PATCH] 2.4.20-pre5-ac2: Promise Controller LBA48 DMA fixed

On Thu, 5 Sep 2002, Tomas Szepe wrote:

> > But in the future, if I post more fixes to the IDE driver (probably
> > won't), I'll sanitize as I go along.
>
> From what Andre said in the past I've gathered he's very much ok with
> code sanitizing and cleanups... Knock yourself out if you please.
>
> > I find it amusing that a post from me which describes evidence of
> > completely broken Promise controller DMA goes unresponded to, yet there
> > are concerns about whether to spell code as "a != b" or "!(a == b)".
>
> Well, your patch is obviously correct -- there's not much to comment on.
>

I was refering to the longer unanswered messages posted over the weekend
(search for subject "trashed") asking for guidance on how to proceed.
Having never debugged IDE before, I was hoping for some help.

-Mike

| Mike Isely | PGP fingerprint
POSITIVELY NO | | 03 54 43 4D 75 E5 CC 92
UNSOLICITED JUNK MAIL! | isely @ pobox (dot) com | 71 16 01 E2 B5 F5 C1 E8
| (spam-foiling address) |