2003-09-02 13:28:45

by Steve Bennett

[permalink] [raw]
Subject: corruption with A7A266+200GB disk?


I just got a new 200GB disk (WDC WD2000JB) for my home machine (Asus A7A266,
Ali chipset). I put some partitions on it like so:
hda1: 100MB - /boot
hda2: 8192MB - /
hda3: 1024MB - swap
hda4: the rest (about 190GB I guess) - /home

I find that when I mkfs on /home, I get massive filesystem corruption on /
When I fsck / (and restore the deleted files) I get massive filesystem corruption on /home. Luckily all my real data is still on my old disk...

I reduced the size of /home to 40GB and everything was fine.
I see the same behaviour with both 2.6.0test3 and 2.4.22.
My guess is that writes to very high numbered blocks are wrapping round
to lower numbered blocks in some way.

so...anyone else seen this? Is it a known driver problem?
Or is it a hardware issue?
Anyone care to suggest stuff to try? The contents of the disk are toast
(pretty much) so I can do destructive tests if it'll help...

Output from lspci looks like this:
00:00.0 Host bridge: ALi Corporation M1647 Northbridge [MAGiK 1 / MobileMAGiK 1] (rev 04)
00:01.0 PCI bridge: ALi Corporation PCI to AGP Controller
00:02.0 USB Controller: ALi Corporation USB 1.1 Controller (rev 03)
00:04.0 IDE interface: ALi Corporation M5229 IDE (rev c4)
00:05.0 Multimedia audio controller: C-Media Electronics Inc CM8738 (rev 10)
00:06.0 USB Controller: ALi Corporation USB 1.1 Controller (rev 03)
00:07.0 ISA bridge: ALi Corporation M1533 PCI to ISA Bridge [Aladdin IV]
00:0a.0 Ethernet controller: Intel Corp. 82557/8/9 [Ethernet Pro 100] (rev 0c)
00:0b.0 Ethernet controller: Intel Corp. 82557/8/9 [Ethernet Pro 100] (rev 0c)
00:0d.0 Multimedia audio controller: Ensoniq 5880 AudioPCI (rev 02)
00:11.0 Bridge: ALi Corporation M7101 PMU
01:00.0 VGA compatible controller: ATI Technologies Inc Rage 128 PF/PRO AGP 4x TMDS

Thanks in advance,

Steve Bennett


2003-09-02 21:55:49

by Andries Brouwer

[permalink] [raw]
Subject: Re: corruption with A7A266+200GB disk?

On Tue, Sep 02, 2003 at 02:28:16PM +0100, [email protected] wrote:

> I just got a new 200GB disk (WDC WD2000JB) for my home machine (Asus A7A266,
> Ali chipset). I put some partitions on it like so:
> hda1: 100MB - /boot
> hda2: 8192MB - /
> hda3: 1024MB - swap
> hda4: the rest (about 190GB I guess) - /home
>
> I find that when I mkfs on /home, I get massive filesystem corruption on /
> When I fsck / (and restore the deleted files) I get massive filesystem corruption on /home.
>
> so...anyone else seen this? Is it a known driver problem?

No doubt wraparound at 137 GB. (2^28 sectors of 2^9 bytes gives a 2^37 byte,
that is 128 GiB limit; to get past this you need support for lba48)

Recently we discussed a case where Linux decided that the hardware
could not handle lba48 but forgot to adapt the total capacity.
That was a Linux bug.

In fact, if I am not mistaken, the idea that that hardware could not
handle lba48 was due to a misunderstanding. That was another Linux bug.

Maybe these have now been fixed in some kernel versions.

So, you must check (i) what Linux thinks your hardware can do, and
(ii) what your hardware can do in reality.
Maybe the former can be seen in /proc/ide/hdX/settings under "address"
or so.

Subject: Re: corruption with A7A266+200GB disk?


Corruption is fixed in 2.6.0-test4.

Unfortunately it seems your IDE chipset doesnt support LBA48,
so you wont be able to access full capacity (137GB limit).

If you are ready to take a risk (again ;-) ) you can remove
"hwif->no_lba48 = ..." line from a drivers/ide/pci/alim15x3.c,
recompile and retest without using DMA (add "ide=nodma"
boot option). Maybe LBA48 will work in PIO mode.

--bartlomiej

On Tuesday 02 of September 2003 15:28, [email protected] wrote:
> I just got a new 200GB disk (WDC WD2000JB) for my home machine (Asus
> A7A266, Ali chipset). I put some partitions on it like so:
> hda1: 100MB - /boot
> hda2: 8192MB - /
> hda3: 1024MB - swap
> hda4: the rest (about 190GB I guess) - /home
>
> I find that when I mkfs on /home, I get massive filesystem corruption on /
> When I fsck / (and restore the deleted files) I get massive filesystem
> corruption on /home. Luckily all my real data is still on my old disk...
>
> I reduced the size of /home to 40GB and everything was fine.
> I see the same behaviour with both 2.6.0test3 and 2.4.22.
> My guess is that writes to very high numbered blocks are wrapping round
> to lower numbered blocks in some way.
>
> so...anyone else seen this? Is it a known driver problem?
> Or is it a hardware issue?
> Anyone care to suggest stuff to try? The contents of the disk are toast
> (pretty much) so I can do destructive tests if it'll help...
>
> Output from lspci looks like this:
> 00:00.0 Host bridge: ALi Corporation M1647 Northbridge [MAGiK 1 /
> MobileMAGiK 1] (rev 04) 00:01.0 PCI bridge: ALi Corporation PCI to AGP
> Controller
> 00:02.0 USB Controller: ALi Corporation USB 1.1 Controller (rev 03)
> 00:04.0 IDE interface: ALi Corporation M5229 IDE (rev c4)
> 00:05.0 Multimedia audio controller: C-Media Electronics Inc CM8738 (rev
> 10) 00:06.0 USB Controller: ALi Corporation USB 1.1 Controller (rev 03)
> 00:07.0 ISA bridge: ALi Corporation M1533 PCI to ISA Bridge [Aladdin IV]
> 00:0a.0 Ethernet controller: Intel Corp. 82557/8/9 [Ethernet Pro 100] (rev
> 0c) 00:0b.0 Ethernet controller: Intel Corp. 82557/8/9 [Ethernet Pro 100]
> (rev 0c) 00:0d.0 Multimedia audio controller: Ensoniq 5880 AudioPCI (rev
> 02) 00:11.0 Bridge: ALi Corporation M7101 PMU
> 01:00.0 VGA compatible controller: ATI Technologies Inc Rage 128 PF/PRO
> AGP 4x TMDS
>
> Thanks in advance,
>
> Steve Bennett
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
> I just got a new 200GB disk (WDC WD2000JB) for my home machine (Asus
> A7A266, Ali chipset). I put some partitions on it like so:
> hda1: 100MB - /boot
> hda2: 8192MB - /
> hda3: 1024MB - swap
> hda4: the rest (about 190GB I guess) - /home
>
> I find that when I mkfs on /home, I get massive filesystem corruption on /
> When I fsck / (and restore the deleted files) I get massive filesystem
> corruption on /home. Luckily all my real data is still on my old disk...
>
> I reduced the size of /home to 40GB and everything was fine.
> I see the same behaviour with both 2.6.0test3 and 2.4.22.
> My guess is that writes to very high numbered blocks are wrapping round
> to lower numbered blocks in some way.
>
> so...anyone else seen this? Is it a known driver problem?
> Or is it a hardware issue?
> Anyone care to suggest stuff to try? The contents of the disk are toast
> (pretty much) so I can do destructive tests if it'll help...
>
> Output from lspci looks like this:
> 00:00.0 Host bridge: ALi Corporation M1647 Northbridge [MAGiK 1 /
> MobileMAGiK 1] (rev 04) 00:01.0 PCI bridge: ALi Corporation PCI to AGP
> Controller
> 00:02.0 USB Controller: ALi Corporation USB 1.1 Controller (rev 03)
> 00:04.0 IDE interface: ALi Corporation M5229 IDE (rev c4)
> 00:05.0 Multimedia audio controller: C-Media Electronics Inc CM8738 (rev
> 10) 00:06.0 USB Controller: ALi Corporation USB 1.1 Controller (rev 03)
> 00:07.0 ISA bridge: ALi Corporation M1533 PCI to ISA Bridge [Aladdin IV]
> 00:0a.0 Ethernet controller: Intel Corp. 82557/8/9 [Ethernet Pro 100] (rev
> 0c) 00:0b.0 Ethernet controller: Intel Corp. 82557/8/9 [Ethernet Pro 100]
> (rev 0c) 00:0d.0 Multimedia audio controller: Ensoniq 5880 AudioPCI (rev
> 02) 00:11.0 Bridge: ALi Corporation M7101 PMU
> 01:00.0 VGA compatible controller: ATI Technologies Inc Rage 128 PF/PRO
> AGP 4x TMDS
>
> Thanks in advance,
>
> Steve Bennett

2003-09-03 01:37:41

by Erik Andersen

[permalink] [raw]
Subject: Re: corruption with A7A266+200GB disk?

On Tue Sep 02, 2003 at 02:28:16PM +0100, [email protected] wrote:
>
> I just got a new 200GB disk (WDC WD2000JB) for my home machine (Asus A7A266,
> Ali chipset). I put some partitions on it like so:
> hda1: 100MB - /boot
> hda2: 8192MB - /
> hda3: 1024MB - swap
> hda4: the rest (about 190GB I guess) - /home
>
> I find that when I mkfs on /home, I get massive filesystem corruption on /
> When I fsck / (and restore the deleted files) I get massive filesystem corruption on /home. Luckily all my real data is still on my old disk...
>
> I reduced the size of /home to 40GB and everything was fine.
> I see the same behaviour with both 2.6.0test3 and 2.4.22.

Known problem. For some reason Marcelo has not yet applied
the fix for this problem to the 2.4.x kernels...

-Erik

--
Erik B. Andersen http://codepoet-consulting.com/
--This message was written using 73% post-consumer electrons--

2003-09-03 13:37:26

by Alan

[permalink] [raw]
Subject: Re: corruption with A7A266+200GB disk?

On Mer, 2003-09-03 at 01:55, Bartlomiej Zolnierkiewicz wrote:
> If you are ready to take a risk (again ;-) ) you can remove
> "hwif->no_lba48 = ..." line from a drivers/ide/pci/alim15x3.c,
> recompile and retest without using DMA (add "ide=nodma"
> boot option). Maybe LBA48 will work in PIO mode.

ALi does support LBA48 in PIO mode. Right now the choice is
DMA and 137Gb or no DMA and 200Gb, ideally it should be DMA
and fall back to PIO for the top 70Gb, but not yet a while.

I've actually not yet found a controller in my testing that cannot
manage LBA48 PIO, including nailing a 160Gb drive to a Cyrix box with
a VIA VP2.


2003-09-03 18:42:21

by Alan

[permalink] [raw]
Subject: Re: corruption with A7A266+200GB disk?

On Mer, 2003-09-03 at 19:07, Marcelo Tosatti wrote:
> > Known problem. For some reason Marcelo has not yet applied
> > the fix for this problem to the 2.4.x kernels...
>
> Alan (which has a clue about IDE unlike me) had complaints about your
> approach, right?

Bart pointed out the case in question can only occur when you move a
disk between interfaces physically. So the last IDE changes I sent you
included a minimal version of Erik's change

2003-09-03 18:07:59

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: corruption with A7A266+200GB disk?



On Tue, 2 Sep 2003, Erik Andersen wrote:

> On Tue Sep 02, 2003 at 02:28:16PM +0100, [email protected] wrote:
> >
> > I just got a new 200GB disk (WDC WD2000JB) for my home machine (Asus A7A266,
> > Ali chipset). I put some partitions on it like so:
> > hda1: 100MB - /boot
> > hda2: 8192MB - /
> > hda3: 1024MB - swap
> > hda4: the rest (about 190GB I guess) - /home
> >
> > I find that when I mkfs on /home, I get massive filesystem corruption on /
> > When I fsck / (and restore the deleted files) I get massive filesystem corruption on /home. Luckily all my real data is still on my old disk...
> >
> > I reduced the size of /home to 40GB and everything was fine.
> > I see the same behaviour with both 2.6.0test3 and 2.4.22.
>
> Known problem. For some reason Marcelo has not yet applied
> the fix for this problem to the 2.4.x kernels...

Alan (which has a clue about IDE unlike me) had complaints about your
approach, right?



2003-09-03 19:55:14

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: corruption with A7A266+200GB disk?



On Tue, 2 Sep 2003, Erik Andersen wrote:

> On Tue Sep 02, 2003 at 02:28:16PM +0100, [email protected] wrote:
> >
> > I just got a new 200GB disk (WDC WD2000JB) for my home machine (Asus A7A266,
> > Ali chipset). I put some partitions on it like so:
> > hda1: 100MB - /boot
> > hda2: 8192MB - /
> > hda3: 1024MB - swap
> > hda4: the rest (about 190GB I guess) - /home
> >
> > I find that when I mkfs on /home, I get massive filesystem corruption on /
> > When I fsck / (and restore the deleted files) I get massive filesystem corruption on /home. Luckily all my real data is still on my old disk...
> >
> > I reduced the size of /home to 40GB and everything was fine.
> > I see the same behaviour with both 2.6.0test3 and 2.4.22.
>
> Known problem. For some reason Marcelo has not yet applied
> the fix for this problem to the 2.4.x kernels...

So it seems the fix is already in 2.4.23-pre2 (came in through Alan IDE
changes).

Steve, it seems 2.4.23-pre2 fixes your problem.

2003-09-03 20:09:59

by Erik Andersen

[permalink] [raw]
Subject: Re: corruption with A7A266+200GB disk?

On Wed Sep 03, 2003 at 04:54:28PM -0300, Marcelo Tosatti wrote:
> > > I reduced the size of /home to 40GB and everything was fine.
> > > I see the same behaviour with both 2.6.0test3 and 2.4.22.
> >
> > Known problem. For some reason Marcelo has not yet applied
> > the fix for this problem to the 2.4.x kernels...
>
> So it seems the fix is already in 2.4.23-pre2 (came in through Alan IDE
> changes).
>
> Steve, it seems 2.4.23-pre2 fixes your problem.

Marcelo, I think you are mistaken... You have indeed applied
some IDE fixes from Alan. But I just read all the IDE changes
again, and unless I have gone blind, this problem is not yet
fixed.

-Erik

--
Erik B. Andersen http://codepoet-consulting.com/
--This message was written using 73% post-consumer electrons--

2003-09-05 09:44:03

by Steve Bennett

[permalink] [raw]
Subject: Re: corruption with A7A266+200GB disk?

> ALi does support LBA48 in PIO mode. Right now the choice is
> DMA and 137Gb or no DMA and 200Gb, ideally it should be DMA
> and fall back to PIO for the top 70Gb, but not yet a while.

OK, having actually read what dmesg says (instead of making assumptions),
I see:
hda: max request size: 128KiB
hda: cannot use LBA48 - full capacity 390721968 sectors (200049 MB)
hda: 268435456 sectors (137438 MB) w/8192KiB Cache, CHS=16709/255/63, UDMA(100)
hda: hda1 hda2 hda3 hda4

and fdisk reports:
# /sbin/fdisk -l

Disk /dev/hda: 137.4 GB, 137438953472 bytes
255 heads, 63 sectors/track, 16709 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System
/dev/hda1 * 1 13 104391 83 Linux
/dev/hda2 14 1057 8385930 83 Linux
/dev/hda3 1058 1188 1052257+ 82 Linux swap
/dev/hda4 1189 6169 40009882+ 83 Linux

So the disk is being correctly downgraded to a non-lba48-compatible size.
In which case, why is the disk getting trashed?

Maybe there's a fault on the disk itself? I'll find a system that does lba48
and try it there...

Steve.

2003-09-05 11:43:40

by Dave Gilbert (Home)

[permalink] [raw]
Subject: Re: corruption with A7A266+200GB disk?

Steve Bennett wrote:

> So the disk is being correctly downgraded to a non-lba48-compatible size.
> In which case, why is the disk getting trashed
>
> Maybe there's a fault on the disk itself? I'll find a system that does lba48
> and try it there...

This is a western digital 200GB disc? Stop. Back off. Upgrade the
firmware to the latest on their site. Try again.

Dave