2002-09-04 10:22:04

by Marius Gedminas

[permalink] [raw]
Subject: ext3 corruption on 2.4.18 (LVM, vt82c586b, no DMA)

There's an old Compaq Deskpro 2000 (Pentium MMX 166 MHz, 384M RAM)
that's being used as an Internet gateway (NAT) and FTP server for about
200 users. It was previously running that other operating system, and I
helped convert it to Linux (Debian 3.0).

There are three disk drives. Two and a half are used for a single ext3
filesystem using LVM. That 140GB partition is only used for FTP.

About 20 hours after mke2fs the first erros started cropping up:

kernel: EXT3-fs error (device lvm(58,0)): ext3_add_entry: bad entry in directory #8568833: rec_len %% 4 != 0 - offset=0, inode=1104134607, rec_len=16847, name_len=207
kernel: EXT3-fs error (device lvm(58,0)): ext3_add_entry: bad entry in directory #8568833: rec_len %% 4 != 0 - offset=0, inode=1104134607, rec_len=16847, name_len=207
kernel: EXT3-fs error (device lvm(58,0)): ext3_add_entry: bad entry in directory #8650753: rec_len %% 4 != 0 - offset=0, inode=1054162645, rec_len=16093, name_len=213
kernel: EXT3-fs error (device lvm(58,0)): ext3_add_entry: bad entry in directory #8650753: rec_len %% 4 != 0 - offset=0, inode=1054162645, rec_len=16093, name_len=213
kernel: EXT3-fs error (device lvm(58,0)): ext3_add_entry: bad entry in directory #8650753: rec_len %% 4 != 0 - offset=0, inode=1503025555, rec_len=22942, name_len=147
[...]
kernel: EXT3-fs error (device lvm(58,0)): ext3_readdir: bad entry in directory #8847362: rec_len %% 4 != 0 - offset=0, inode=1861709558, rec_len=28414, name_len=247
[...]
kernel: EXT3-fs warning (device lvm(58,0)): empty_dir: bad directory (dir #8830978) - no `.' or `..'
kernel: EXT3-fs warning (device lvm(58,0)): ext3_rmdir: empty directory has nlink!=2 (3)
[...]
kernel: EXT3-fs error (device lvm(58,0)): ext3_readdir: bad entry in directory #8683521: rec_len is too small for name_len - offset=0, inode=11305005, rec_len=44, name_len=45
kernel: attempt to access beyond end of device
kernel: 3a:00: rw=0, want=1123713500, limit=154509312
kernel: attempt to access beyond end of device
kernel: 3a:00: rw=0, want=586860000, limit=154509312
[...]

Unfortunately I noticed this only two days later. e2fsck found *lots*
of errors, and it keeps restarting from the beginning for some reason.
I'm starting to have doubts if it will ever finish.

Is this an unfortunate interaction between ext3 and LVM, or should I
suspect flaky hardware? RAM, disks, IDE cable? There were problems
with /dev/hdd earlier that hinted a broken cable (borken model name in
hdparm -i), and the cable was replaced with a new one.

I gather from Configure.help that DMA is broken on Via VP2, but it is
turned off here. I see only one series of hardware-related messages in
syslog:

kernel: hda: status timeout: status=0xd0 { Busy }
kernel: hda: no DRQ after issuing WRITE
kernel: ide0: reset timed-out, status=0x80

But these appeared about 4 hours after the first EXT3 error messages.

The kernel comes from a Debian package (2.4.18-586tsc). LVM also
(lvm10-1:1.0.4-4).

Exceprt from dmesg:

Uniform Multi-Platform E-IDE driver Revision: 6.31
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
VP_IDE: IDE controller on PCI bus 00 dev 39
VP_IDE: chipset revision 6
VP_IDE: not 100% native mode: will probe irqs later
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
VP_IDE: VIA vt82c586b (rev 31) IDE UDMA33 controller on pci00:07.1
ide0: BM-DMA at 0x1400-0x1407, BIOS settings: hda:DMA, hdb:pio
ide1: BM-DMA at 0x1408-0x140f, BIOS settings: hdc:DMA, hdd:DMA
hda: IC35L040AVVN07-0, ATA DISK drive
hdc: MAXTOR 6L060J3, ATA DISK drive
hdd: MAXTOR 6L080J4, ATA DISK drive
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
ide1 at 0x170-0x177,0x376 on irq 15
hda: 80418240 sectors (41174 MB) w/1863KiB Cache, CHS=79780/16/63
hdc: 117266688 sectors (60041 MB) w/1819KiB Cache, CHS=116336/16/63
hdd: 156355584 sectors (80054 MB) w/1819KiB Cache, CHS=155114/16/63
Partition check:
/dev/ide/host0/bus0/target0/lun0: [PTBL] [5005/255/63] p1 p2 p3 p4
/dev/ide/host0/bus1/target0/lun0: [PTBL] [7299/255/63] p1 < p5 >
/dev/ide/host0/bus1/target1/lun0: [PTBL] [9732/255/63] p1

/proc/ide/via:

----------VIA BusMastering IDE Configuration----------------
Driver Version: 3.29
South Bridge: VIA vt82c586b
Revision: ISA 0x31 IDE 0x6
Highest DMA rate: UDMA33
BM-DMA base: 0x1400
PCI clock: 33MHz
Master Read Cycle IRDY: 1ws
Master Write Cycle IRDY: 1ws
BM IDE Status Register Read Retry: yes
Max DRDY Pulse Width: No limit
-----------------------Primary IDE-------Secondary IDE------
Read DMA FIFO flush: yes yes
End Sector FIFO flush: no no
Prefetch Buffer: yes yes
Post Write Buffer: yes yes
Enabled: yes yes
Simplex only: no no
Cable Type: 40w 40w
-------------------drive0----drive1----drive2----drive3-----
Transfer Mode: PIO PIO PIO PIO
Address Setup: 30ns 120ns 30ns 30ns
Cmd Active: 90ns 90ns 90ns 90ns
Cmd Recovery: 30ns 30ns 30ns 30ns
Data Active: 90ns 330ns 90ns 90ns
Data Recovery: 30ns 270ns 30ns 30ns
Cycle Time: 120ns 600ns 120ns 120ns
Transfer Rate: 16.5MB/s 3.3MB/s 16.5MB/s 16.5MB/s

lspci -v

00:00.0 Host bridge: VIA Technologies, Inc. VT82C595/97 [Apollo VP2/97] (rev 03)
Flags: bus master, 66Mhz, medium devsel, latency 8

00:02.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8139 (rev 10)
Subsystem: Compex: Unknown device 8139
Flags: bus master, medium devsel, latency 66, IRQ 11
I/O ports at 1800 [size=256]
Memory at 44000000 (32-bit, non-prefetchable) [size=256]
Expansion ROM at <unassigned> [disabled] [size=64K]
Capabilities: [50] Power Management version 2

00:03.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8139 (rev 10)
Subsystem: Compex: Unknown device 8139
Flags: bus master, medium devsel, latency 66, IRQ 11
I/O ports at 2000 [size=256]
Memory at 44200000 (32-bit, non-prefetchable) [size=256]
Expansion ROM at <unassigned> [disabled] [size=64K]
Capabilities: [50] Power Management version 2

00:04.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8139 (rev 10)
Subsystem: Compex: Unknown device 8139
Flags: bus master, medium devsel, latency 66, IRQ 11
I/O ports at 1c00 [size=256]
Memory at 44100000 (32-bit, non-prefetchable) [size=256]
Expansion ROM at <unassigned> [disabled] [size=64K]
Capabilities: [50] Power Management version 2

00:07.0 ISA bridge: VIA Technologies, Inc. VT82C586/A/B PCI-to-ISA [Apollo VP] (rev 31)
Flags: bus master, medium devsel, latency 0

00:07.1 IDE interface: VIA Technologies, Inc. Bus Master IDE (rev 06) (prog-if 8a [Master SecP PriP])
Flags: bus master, medium devsel, latency 66
I/O ports at 1400 [size=16]

00:07.2 USB Controller: VIA Technologies, Inc. UHCI USB (rev 02) (prog-if 00 [UHCI])
Subsystem: Unknown device 0925:1234
Flags: bus master, medium devsel, latency 66, IRQ 11
I/O ports at 1420 [size=32]

00:07.3 Non-VGA unclassified device: VIA Technologies, Inc. VT82C586B ACPI
Flags: medium devsel
I/O ports at 1000 [size=256]

00:0f.0 VGA compatible controller: S3 Inc. Trio 64V2/DX or /GX (rev 04) (prog-if 00 [VGA])
Subsystem: Compaq Computer Corporation: Unknown device b031
Flags: medium devsel, IRQ 11
Memory at 40000000 (32-bit, non-prefetchable) [size=64M]
Expansion ROM at <unassigned> [disabled] [size=64K]


hdparm -i /dev/hda

/dev/hda:

Model=IC35L040AVVN07-0, FwRev=VA2OAG0A, SerialNo=VNP210B2R7MVGB
Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs }
RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=52
BuffType=DualPortCache, BuffSize=1863kB, MaxMultSect=16, MultSect=off
CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=80418240
IORDY=on/off, tPIO={min:240,w/IORDY:120}, tDMA={min:120,rec:120}
PIO modes: pio0 pio1 pio2 pio3 pio4
DMA modes: mdma0 mdma1 mdma2 udma0 udma1 *udma2 udma3 udma4 udma5
AdvancedPM=yes: disabled (255) WriteCache=enabled
Drive Supports : ATA/ATAPI-5 T13 1321D revision 1 : ATA-2 ATA-3 ATA-4 ATA-5

hdparm -i /dev/hdc

/dev/hdc:

Model=MAXTOR 6L060J3, FwRev=A93.0500, SerialNo=663202014354
Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs }
RawCHS=16383/16/63, TrkSize=32256, SectSize=21298, ECCbytes=4
BuffType=DualPortCache, BuffSize=1819kB, MaxMultSect=16, MultSect=off
CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=117266688
IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
PIO modes: pio0 pio1 pio2 pio3 pio4
DMA modes: mdma0 mdma1 mdma2 udma0 udma1 *udma2 udma3 udma4 udma5 udma6
AdvancedPM=no WriteCache=enabled
Drive Supports : ATA/ATAPI-5 T13 1321D revision 1 : ATA-1 ATA-2 ATA-3 ATA-4 ATA-5

hdparm -i /dev/hdd

/dev/hdd:

Model=MAXTOR 6L080J4, FwRev=A93.0500, SerialNo=664203751576
Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs }
RawCHS=16383/16/63, TrkSize=32256, SectSize=21298, ECCbytes=4
BuffType=DualPortCache, BuffSize=1819kB, MaxMultSect=16, MultSect=off
CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=156355584
IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
PIO modes: pio0 pio1 pio2 pio3 pio4
DMA modes: mdma0 mdma1 mdma2 udma0 udma1 *udma2 udma3 udma4 udma5 udma6
AdvancedPM=no WriteCache=enabled
Drive Supports : ATA/ATAPI-5 T13 1321D revision 1 : ATA-1 ATA-2 ATA-3 ATA-4 ATA-5


Marius Gedminas
--
This host is a black hole at HTTP wavelengths. GETs go in, and nothing
comes out, not even Hawking radiation.
-- Graaagh the Mighty on rec.games.roguelike.angband


Attachments:
(No filename) (9.88 kB)
(No filename) (189.00 B)
Download all attachments

2002-09-06 17:29:40

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: ext3 corruption on 2.4.18 (LVM, vt82c586b, no DMA)

Hi,

On Wed, Sep 04, 2002 at 12:26:06PM +0200, Marius Gedminas wrote:
> There's an old Compaq Deskpro 2000 (Pentium MMX 166 MHz, 384M RAM)
> that's being used as an Internet gateway (NAT) and FTP server for about
> 200 users. It was previously running that other operating system, and I
> helped convert it to Linux (Debian 3.0).

> About 20 hours after mke2fs the first erros started cropping up:
>
> kernel: EXT3-fs error (device lvm(58,0)): ext3_add_entry: bad entry in directory #8568833: rec_len %% 4 != 0 - offset=0, inode=1104134607, rec_len=16847, name_len=207

Well, there are a couple of ext3 fixes that have just been merged into
Marcelo's bk tree, so you could try that and see if it helps.
However, I suspect it won't, because:

> Unfortunately I noticed this only two days later. e2fsck found *lots*
> of errors, and it keeps restarting from the beginning for some reason.
> I'm starting to have doubts if it will ever finish.

This suggests that e2fsck is finding new corruption each time it is
scanning the disk. That sounds as if a hardware or driver-level
problem is more likely. What sorts of errors are you getting from the
fsck passes?

> Is this an unfortunate interaction between ext3 and LVM, or should I
> suspect flaky hardware? RAM, disks, IDE cable? There were problems
> with /dev/hdd earlier that hinted a broken cable (borken model name in
> hdparm -i), and the cable was replaced with a new one.

One thing that would help would be to try a surface scan which writes
stuff to the disk and verifies it. The "badblocks" code from e2fsck
can do that, but the most effective form of "badblocks" for such a
case is highly destructive to your data, so it's only useful if you
don't need to preserve the data already on the filesystem.

> I gather from Configure.help that DMA is broken on Via VP2, but it is
> turned off here.

Unfortunately, if you disable UDMA mode, you also lose the checksums
between drive and controller which can detect cable data corruption.

Cheers,
Stephen

2002-09-06 18:58:00

by Marius Gedminas

[permalink] [raw]
Subject: Re: ext3 corruption on 2.4.18 (LVM, vt82c586b, no DMA)

On Fri, Sep 06, 2002 at 06:34:15PM +0100, Stephen C. Tweedie wrote:
> On Wed, Sep 04, 2002 at 12:26:06PM +0200, Marius Gedminas wrote:
> > There's an old Compaq Deskpro 2000 (Pentium MMX 166 MHz, 384M RAM)
> > that's being used as an Internet gateway (NAT) and FTP server for about
> > 200 users. It was previously running that other operating system, and I
> > helped convert it to Linux (Debian 3.0).
>
> > About 20 hours after mke2fs the first erros started cropping up:
> >
> > kernel: EXT3-fs error (device lvm(58,0)): ext3_add_entry: bad entry in directory #8568833: rec_len %% 4 != 0 - offset=0, inode=1104134607, rec_len=16847, name_len=207
>
> Well, there are a couple of ext3 fixes that have just been merged into
> Marcelo's bk tree, so you could try that and see if it helps.
> However, I suspect it won't, because:
>
> > Unfortunately I noticed this only two days later.

(Now all partitions are mounted with errors=remount-ro. I didn't know
that wasn't the default.)

> > e2fsck found *lots*
> > of errors, and it keeps restarting from the beginning for some reason.
> > I'm starting to have doubts if it will ever finish.
>
> This suggests that e2fsck is finding new corruption each time it is
> scanning the disk. That sounds as if a hardware or driver-level
> problem is more likely. What sorts of errors are you getting from the
> fsck passes?

Mostly "Inode XXX is in use, but has dtime set", some "Inode XXX has
compression flag set on filesystem without compression support", a
couple of "Inode XXX has illegal block(s)", which are followed by "Too
many illegal blocks in inode XXX" (AFAIU that forces the restart of pass
1).

I have a full typescript (1.4M) if you're interested. The different
iterations of pass 1 are very similair but there are also differenctes.
I got tired after five iterations and just mkfs'ed it again.

This is starting to look more and more like a hardware problem. The
next day I suddenly noticed that / and /var (both on /dev/hda) were
remounted read-only, probably because of errors. That 140G LVM
partition was still rw. An ls or grep would fail in any directory, but
some of the files could be accessed by their full path. Trying to
access other files (even files in /proc) would result in I/O or Bus
error. Actually, I'm only 90% sure about /proc -- could be it was
/usr/bin/less that was unreadable. I was a bit excited at the time.
But I usually cat things in /proc, and I clearly remember that I could
successfully cat some files in /etc.

A reboot helped, but the syslog contained nothing interesting. I ran
memtest for at least 20 hours -- it found no errors. I think I already
mentioned that read-only badblock scan also found nothing.

BTW both /, /var and about 20GB part of LVM are on /dev/hda, which is an
40G IBM drive. It might be pure superstition, but the words "IBM hard
drive" do not inspire much confidence to me.

> > Is this an unfortunate interaction between ext3 and LVM, or should I
> > suspect flaky hardware? RAM, disks, IDE cable? There were problems
> > with /dev/hdd earlier that hinted a broken cable (borken model name in
> > hdparm -i), and the cable was replaced with a new one.
>
> One thing that would help would be to try a surface scan which writes
> stuff to the disk and verifies it. The "badblocks" code from e2fsck
> can do that, but the most effective form of "badblocks" for such a
> case is highly destructive to your data, so it's only useful if you
> don't need to preserve the data already on the filesystem.

It might be difficult to arrange at the moment, but I'll keep it in mind.

> > I gather from Configure.help that DMA is broken on Via VP2, but it is
> > turned off here.
>
> Unfortunately, if you disable UDMA mode, you also lose the checksums
> between drive and controller which can detect cable data corruption.

Ah, that's good to know. Is there a kernel tree that supports DMA on
VP2? I think I heard somewhere that the new IDE code did that, but I
wasn't paying much attention to its fate lately.

Thanks,
Marius Gedminas
--
It's not illegal to disagree with my opinions (*).
[...]
(*) Although it obviously _should_ be. Mwhaahahahahaaa... You unbelievers
will all be shot when the revolution comes!
-- Linus Torvalds


Attachments:
(No filename) (4.16 kB)
(No filename) (189.00 B)
Download all attachments

2002-09-06 21:00:15

by Alan

[permalink] [raw]
Subject: Re: ext3 corruption on 2.4.18 (LVM, vt82c586b, no DMA)

On Fri, 2002-09-06 at 18:34, Stephen C. Tweedie wrote:
> > I gather from Configure.help that DMA is broken on Via VP2, but it is
> > turned off here.
>
> Unfortunately, if you disable UDMA mode, you also lose the checksums
> between drive and controller which can detect cable data corruption.

I's have to look it up to be sure but I believe the VIA VP2 goes to DMA
not UDMA