2003-01-05 04:36:30

by Carl Wilhelm Soderstrom

[permalink] [raw]
Subject: fs corruption with 2.4.20 IDE+md+LVM

I observed filesystem corruption on my home workstation recently. I was
running kernel 2.4.20 (built myself with gcc 2.95.4), and ext3 with the
default journaling mode (ordered?).

I was downloading files, and noticed that they weren't being saved. I
immediately did a 'df -h', and it reported my home partition as having 7.3T
used, -64Z free.

I (foolishly) immediately did a 'du -sch ~/*' to see what might be taking up
all the space. after realizing what was going on (du reported filesystem
permission errors on files it shouldn't have), I shut down all programs, and
dropped to runlevel 1.

I unmounted my LVM'ed partitions (/var /usr /home), and tried to fsck
/dev/sys/home (the /home partition). it couldn't find a good superblock; and
fell back to using another backup superblock. fsck reported that the journal
was corrupt, and discarded it. many of the low-numbered inodes had wrong
refcounts, or wrong modes.

eventually it fixed the filesystem; but everything ended up in many files &
directories under lost+found. (had to pull the home dirs from one or more
dirs each, under lost+found).

after fixing the filesystem, I gratuitously fsck -f'ed all my other
partitions; they came up clean.

fortunately, looks like the only stuff I really lost were some chunks of my
XFree86 source tree, and some linux kernel sources. easily replaceable
stuff.

here's my system architecture:
2x Western Digital 80GB Special Edition IDE drives (hde, hdf)
- / is an ext3 RAID1 /dev/md0 made of hde1 and hdf1
- /dev/md1 is LVM-formatted RAID1, made of hde2 and hdf2. this partition
contains /var, /usr, and /home.

/home is the only place that I saw this corruption.

I have since reverted back to kernel 2.4.18.

I'm thinking that my reaction *should* have been to power-cycle the box
immediately upon notice of the problem, to prevent further fs corruption,
and bring it back up in single-user read-only mode. shutting down programs
nicely would have written more stuff to disk, worsening the corruption.

I will also point out that kernel 2.4.20-ac1 and 2.4.21-pre6 will not boot
on my machine; they kernel panic when detecting my IDE devices. I have not
tried 2.4.20-ac2 nor 2.4.21-pre2 yet. 2.4.20 and 2.4.18 boot quite happily
tho. I suppose I ought to try the latest versions and set up a serial
console to capture the oops, before reporting a bug on this.

Carl Soderstrom.
--
Systems Administrator
Real-Time Enterprises
http://www.real-time.com


2003-01-05 05:39:48

by Carl Wilhelm Soderstrom

[permalink] [raw]
Subject: Re: fs corruption with 2.4.20 IDE+md+LVM

On Sat, Jan 04, 2003 at 10:45:00PM -0600, Carl Wilhelm Soderstrom wrote:
> I observed filesystem corruption on my home workstation recently. I was
> running kernel 2.4.20 (built myself with gcc 2.95.4), and ext3 with the
> default journaling mode (ordered?).

I should probably include some details about my IDE devices.

here's the controller for the devices in question. it's the second
controller on the mobo. (first controller is only ATA-66)

00:11.0 Unknown mass storage controller: Promise Technology, Inc. 20265 (rev
02)
Subsystem: Promise Technology, Inc. Ultra100
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B-
Status: Cap+ 66Mhz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
Latency: 32
Interrupt: pin A routed to IRQ 9
Region 0: I/O ports at 9400 [size=8]
Region 1: I/O ports at 9000 [size=4]
Region 2: I/O ports at 8800 [size=8]
Region 3: I/O ports at 8400 [size=4]
Region 4: I/O ports at 8000 [size=64]
Region 5: Memory at de800000 (32-bit, non-prefetchable) [size=128K]
Expansion ROM at <unassigned> [disabled] [size=64K]
Capabilities: <available only to root>

and here's the output of hdparm. (yes, I know it could probably get tweaked
a bit for performance. this is a brand-new drive arrangement, and I was
trying to run it in a 'safe' setting for a while to see if anything would go
wrong. well, it did).

~# hdparm /dev/hde

/dev/hde:
multcount = 0 (off)
IO_support = 0 (default 16-bit)
unmaskirq = 0 (off)
using_dma = 1 (on)
keepsettings = 0 (off)
readonly = 0 (off)
readahead = 8 (on)
geometry = 155061/16/63, sectors = 156301488, start = 0
~# hdparm /dev/hdf

/dev/hdf:
multcount = 0 (off)
IO_support = 0 (default 16-bit)
unmaskirq = 0 (off)
using_dma = 1 (on)
keepsettings = 0 (off)
readonly = 0 (off)
readahead = 8 (on)
geometry = 155061/16/63, sectors = 156301488, start = 0

Carl Soderstrom.
--
Systems Administrator
Real-Time Enterprises
http://www.real-time.com

2003-01-06 02:11:49

by Dmitry Volkoff

[permalink] [raw]
Subject: Re: fs corruption with 2.4.20 IDE+md+LVM

> I observed filesystem corruption on my home workstation recently. I was
> running kernel 2.4.20 (built myself with gcc 2.95.4), and ext3 with the
> default journaling mode (ordered?).

Hello,

Same problem here. I have software raid-1 on 2 IDE Seagate 80G, kernel
2.4.20aa1 built with gcc-3.2, all filesystems are ext2, no LVM.
FS corruption after running Cerberus test for about 8 hours.

> I will also point out that kernel 2.4.20-ac1 and 2.4.21-pre6 will not
> boot on my machine; they kernel panic when detecting my IDE devices.

I can confirm. Kernel 2.4.21-pre2 does not boot from a RAID device
(/dev/md0).

--

D.V.

2003-01-06 04:40:42

by Carl Wilhelm Soderstrom

[permalink] [raw]
Subject: Re: fs corruption with 2.4.20 IDE+md+LVM

On Mon, Jan 06, 2003 at 05:14:12AM +0300, Dmitry Volkoff wrote:
> > I observed filesystem corruption on my home workstation recently. I was
> > running kernel 2.4.20 (built myself with gcc 2.95.4), and ext3 with the
> > default journaling mode (ordered?).
>
> Hello,
>
> Same problem here. I have software raid-1 on 2 IDE Seagate 80G, kernel
> 2.4.20aa1 built with gcc-3.2, all filesystems are ext2, no LVM.
> FS corruption after running Cerberus test for about 8 hours.

glad to know I'm not the only one.

someone pointed out to me in a private e-mail, that the corruption may be
related to my VIA KT133 chipset. (they had a similar problem).

> > I will also point out that kernel 2.4.20-ac1 and 2.4.21-pre6 will not
> > boot on my machine; they kernel panic when detecting my IDE devices.
>
> I can confirm. Kernel 2.4.21-pre2 does not boot from a RAID device
> (/dev/md0).

sorry about the thinko in my mail. I meant 2.4.21-pre1. Glad to know I'm not
crazy, but hopefully confirmation means it'll get fixed before 2.4.21-final.

<flamebait>
maybe I just missed the arguments since I wasn't reading LKML at the time;
but *why* is IDE being revamped in the middle of a "stable" kernel series?
however better it may be, I don't regard the existing situation as being bad
enough to justify the risk.
</flamebait>

Carl Soderstrom.
--
Systems Administrator
Real-Time Enterprises
http://www.real-time.com

2003-01-06 14:09:02

by Alan

[permalink] [raw]
Subject: Re: fs corruption with 2.4.20 IDE+md+LVM

On Mon, 2003-01-06 at 04:49, Carl Wilhelm Soderstrom wrote:
> <flamebait>
> maybe I just missed the arguments since I wasn't reading LKML at the time;
> but *why* is IDE being revamped in the middle of a "stable" kernel series?
> however better it may be, I don't regard the existing situation as being bad
> enough to justify the risk.
> </flamebait>

You are reporting problems in 2.4.20. 2.4.20 doesn't have the revamped IDE...

The IDE is getting updated because

- Lots of new controllers dont work with the old code
- Lots of LBA48 problems exist with the older code
- SATA is right out with the older code
- Several existing controllers have weird bugs with the older code

I'd much prefer we didn't have to update the IDE too 8)

2003-01-06 16:12:41

by Carl Wilhelm Soderstrom

[permalink] [raw]
Subject: Re: fs corruption with 2.4.20 IDE+md+LVM

On Mon, Jan 06, 2003 at 03:02:02PM +0000, Alan Cox wrote:
> You are reporting problems in 2.4.20. 2.4.20 doesn't have the revamped IDE...

I know. which is why I put that comment after the section of my mail
regarding md bugs in 2.4.21

> The IDE is getting updated because
>
> - Lots of new controllers dont work with the old code
> - Lots of LBA48 problems exist with the older code
> - SATA is right out with the older code
> - Several existing controllers have weird bugs with the older code
>
> I'd much prefer we didn't have to update the IDE too 8)

ok. I didn't see an extensive discussion of this on any of the kernel-digest
forums (kerneltrap, kernel traffic).
I'll trust that you're doing the right thing, and try to avoid stepping in
any other flamebait. ;)

Carl Soderstrom.
--
Systems Administrator
Real-Time Enterprises
http://www.real-time.com