2002-11-25 04:06:17

by Clemmitt Sigler

[permalink] [raw]
Subject: 2.4.20-rc3 ext3 fsck corruption -- tool update warning needed?

(I'm not subscribed to lkml, please CC: [email protected] -- Thanks :^)

Hi,

I'd been running 2.4.20-rc3 for two days. While rebooting it tonight
fsck.ext3 corrupted my / partition during an automatic fsck of the
partition (caused by the maximal mount count being reached). (I had
backups so I was able to recover :^) The symptoms were that some files
like /etc/fstab and dirs like /etc/rc2.d disappeared -- not good.

My system is Debian Testing, with Debian e2fsprogs version
1.29+1.30-WIP-0930-1. I use ext3 partitions with all options set to
the defaults (ordered data mode). This is an SMP system, in case
that matters. Please e-mail me for any other details that might help.

I'm wondering if this change between -rc1 and -rc2 might be a factor ->

<[email protected]>
HTREE backwards compatibility patch.

Upon rebooting to 2.4.19 (SMP kernel also), the system did another
auto-fsck.ext3, this time on /usr. I held my breath, but all went fine.
This seems to me to narrow it down to a kernel/e2fsprogs incompatibility
(but I'm not an expert).

If this is indeed the case, please put a LOUD WARNING in the kernel
notes that some versions of e2fsprogs are incompatible. HTH.

Clemmitt Sigler


2002-11-25 10:50:46

by Hugo Mills

[permalink] [raw]
Subject: Re: 2.4.20-rc3 ext3 fsck corruption -- tool update warning needed?

On Mon, Nov 25, 2002 at 12:12:55AM -0500, Clemmitt Sigler wrote:
> I'd been running 2.4.20-rc3 for two days. While rebooting it tonight
> fsck.ext3 corrupted my / partition during an automatic fsck of the
> partition (caused by the maximal mount count being reached). (I had
> backups so I was able to recover :^) The symptoms were that some files
> like /etc/fstab and dirs like /etc/rc2.d disappeared -- not good.

Did you also get some duplicated entries in /etc? I've had a
similar problem, with some files in /etc vanishing entirely, and
others being duplicated (so I got, for example, /etc/fstab appearing
twice in `ls /etc`). This has happened, as far as I can tell,
spontaneously -- no reboot, no run of fsck, not even any non-daemon
processes running other than X and an xterm. The machine wasn't being
used at all. Nothing turns up in the syslogs -- no oops, no bug,
nothing.

Running fsck recovers the missing files into lost+found, but
doesn't remove the duplicated filenames. Duplicate files can be
deleted, but only one "filename" is removed, and the file then no
longer exists except to ls -- it shows up in `ls /etc`, but (e.g.)
`cat /etc/fstab` gives a "No such file or directory" error.

> My system is Debian Testing, with Debian e2fsprogs version
> 1.29+1.30-WIP-0930-1. I use ext3 partitions with all options set to
> the defaults (ordered data mode). This is an SMP system, in case
> that matters.

I'm also using Debian testing with the same e2fsprogs. I saw the
effect on ext2, on a UP box.

> Please e-mail me for any other details that might help.
>
> I'm wondering if this change between -rc1 and -rc2 might be a factor ->
>
> <[email protected]>
> HTREE backwards compatibility patch.

I remember seeing a comment about HTREE running past when I tried
the e2fsck for the first time. Don't know if this is relevant.

I've seen this happen twice now -- both times requiring a day or so
of effort to recover the system configuration (it wasn't up for long
enough between times to do a system backup :( ). I can't afford any
more downtime on this machine for the next month or so, so I'm no
longer running ext2 on my root partition -- I moved to reiserfs in the
hope that it'll be more stable.

Hugo.

--
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
PGP: 1024D/1C335860 from wwwkeys.eu.pgp.net or http://www.carfax.nildram.co.uk
--- On Mondays, Wednesdays and Fridays they called it a particle. ---
On Tuesdays, Thursdays and Saturdays, they called it a
wave. On Sundays, they just prayed.


Attachments:
(No filename) (2.58 kB)
(No filename) (189.00 B)
Download all attachments

2002-11-25 13:14:48

by Clemmitt Sigler

[permalink] [raw]
Subject: Re: 2.4.20-rc3 ext3 fsck corruption -- tool update warning needed?

Hi Hugo et al,

On Mon, 25 Nov 2002, Hugo Mills wrote:
> Did you also get some duplicated entries in /etc? I've had a
> similar problem, with some files in /etc vanishing entirely, and
> others being duplicated (so I got, for example, /etc/fstab appearing
> twice in `ls /etc`).

Unfortunately, I didn't check thoroughly to see if duplicate files were
listed by ls. I wish I had known to look! This is my production
workstation. It needed to get back up and running, so I didn't save
the "evidence." :^(

> This has happened, as far as I can tell, spontaneously
> Running fsck recovers the missing files into lost+found, but
> doesn't remove the duplicated filenames.

My problem was clearly related to fsck.ext3 which ran automatically upon
a reboot. Before that, no problems whatsoever. After the boot-time
fsck was over, the system failed to finish coming up. The boot output
contained the surprising error "Can't find /etc/fstab" and complained
about the rc script being missing (entire /etc/rc2.d dir was gone).

> I remember seeing a comment about HTREE running past when I tried
> the e2fsck for the first time. Don't know if this is relevant.

Neither do I (not enough in-depth knowledge to know), but that info may
be of use to Ted and Marcelo(?) If my conclusions about 2.4.20-rc3 and
some e2fsprogs tools being incompatable are erroneous, my apologies.

Clemmitt Sigler

2002-11-25 17:08:31

by Matthias Andree

[permalink] [raw]
Subject: Re: 2.4.20-rc3 ext3 fsck corruption -- tool update warning needed?

On Mon, 25 Nov 2002, Clemmitt Sigler wrote:

> I'd been running 2.4.20-rc3 for two days. While rebooting it tonight
> fsck.ext3 corrupted my / partition during an automatic fsck of the
> partition (caused by the maximal mount count being reached). (I had
> backups so I was able to recover :^) The symptoms were that some files
> like /etc/fstab and dirs like /etc/rc2.d disappeared -- not good.
>
> My system is Debian Testing, with Debian e2fsprogs version
> 1.29+1.30-WIP-0930-1. I use ext3 partitions with all options set to
> the defaults (ordered data mode). This is an SMP system, in case
> that matters. Please e-mail me for any other details that might help.

Retry with 1.32. I don't think the corruption is kernel-related, but I
may be wrong.

--
Matthias Andree

2002-11-25 17:32:34

by Tomas Konir

[permalink] [raw]
Subject: Re: 2.4.20-rc3 ext3 fsck corruption -- tool update warning needed?

On Mon, 25 Nov 2002, Matthias Andree wrote:

> On Mon, 25 Nov 2002, Clemmitt Sigler wrote:
>
> > I'd been running 2.4.20-rc3 for two days. While rebooting it tonight
> > fsck.ext3 corrupted my / partition during an automatic fsck of the
> > partition (caused by the maximal mount count being reached). (I had
> > backups so I was able to recover :^) The symptoms were that some files
> > like /etc/fstab and dirs like /etc/rc2.d disappeared -- not good.
> >
> > My system is Debian Testing, with Debian e2fsprogs version
> > 1.29+1.30-WIP-0930-1. I use ext3 partitions with all options set to
> > the defaults (ordered data mode). This is an SMP system, in case
> > that matters. Please e-mail me for any other details that might help.
>
> Retry with 1.32. I don't think the corruption is kernel-related, but I
> may be wrong.
>
hm
My / XFS partition was corrupted after reboot the 2.4.20-rc3-xfs.
I rebooted twice and same thing done after each reboot.
My opinion is that this is kernel-related.

MOJE

--
Tomas Konir
Brno
ICQ 25849167


2002-11-25 17:31:09

by Clemmitt Sigler

[permalink] [raw]
Subject: Re: 2.4.20-rc3 ext3 fsck corruption -- tool update warning needed?

Hi Matthias,

On Mon, 25 Nov 2002, Matthias Andree wrote:
> > I'd been running 2.4.20-rc3 for two days. While rebooting it tonight
> > fsck.ext3 corrupted my / partition during an automatic fsck of the
> > partition (caused by the maximal mount count being reached).
> > My system is Debian Testing, with Debian e2fsprogs version
> > 1.29+1.30-WIP-0930-1. I use ext3 partitions with all options set to
> > the defaults (ordered data mode).
>
> Retry with 1.32. I don't think the corruption is kernel-related, but I
> may be wrong.

I just checked the Debian packages, and e2fsprogs 1.32 is available
on Debian Unstable. If it is indeed a mismatch between Debian Testing's
e2fsprogs version and 2.4.20-rc3, Debian users should upgrade e2fsprogs
before upgrading their kernel. Forewarned is forearmed :^) Thanks.

Clemmitt Sigler

2002-11-26 02:40:30

by Theodore Ts'o

[permalink] [raw]
Subject: Re: 2.4.20-rc3 ext3 fsck corruption -- tool update warning needed?

On Mon, Nov 25, 2002 at 12:12:55AM -0500, Clemmitt Sigler wrote:
> I'd been running 2.4.20-rc3 for two days. While rebooting it tonight
> fsck.ext3 corrupted my / partition during an automatic fsck of the
> partition (caused by the maximal mount count being reached). (I had
> backups so I was able to recover :^) The symptoms were that some files
> like /etc/fstab and dirs like /etc/rc2.d disappeared -- not good.
>
> My system is Debian Testing, with Debian e2fsprogs version
> 1.29+1.30-WIP-0930-1. I use ext3 partitions with all options set to
> the defaults (ordered data mode). This is an SMP system, in case
> that matters. Please e-mail me for any other details that might help.
>
> I'm wondering if this change between -rc1 and -rc2 might be a factor ->
>
> <[email protected]>
> HTREE backwards compatibility patch.

Nope; I really doubt it. All the HTREE compatibility patch does is
clear the INDEX_FL flag in a directory inode if the directory inode is
modified. It's a very, very innocuous patch.

> Upon rebooting to 2.4.19 (SMP kernel also), the system did another
> auto-fsck.ext3, this time on /usr. I held my breath, but all went fine.
> This seems to me to narrow it down to a kernel/e2fsprogs incompatibility
> (but I'm not an expert).

Well, no; it could also be that some kind of filesystem corruption
either made the directories disappear, or caused e2fsck to believe
that the files needed to be removed or moved into lost+found. There
are a million possible explanations, including a bug in a device
driver, the VM layer, or just pure coincidence.

Without some clear indication of what e2fsck actually printed we'd
only be speculating.

> If this is indeed the case, please put a LOUD WARNING in the kernel
> notes that some versions of e2fsprogs are incompatible. HTH.

No, there shouldn't be any kind of compatibility problems. All of the
various extensions to ext2/ext3 are all clearly marked with feature
flags in the superblock, and need to be explicitly enabled before they
take effect.

Can you can duplicate the problem?

- Ted

2002-11-26 03:52:01

by Theodore Ts'o

[permalink] [raw]
Subject: Re: 2.4.20-rc3 ext3 fsck corruption -- tool update warning needed?

On Mon, Nov 25, 2002 at 10:57:39AM +0000, Hugo Mills wrote:
> Running fsck recovers the missing files into lost+found, but
> doesn't remove the duplicated filenames.

E2fsck doesn't currently check for duplicated files in a directory.
It probably should, but doing so would increase its required memory
usuage and/or its run-time, at least in the non-HTREE case, which is
why I never implemented it. (Currently we scan all directory blocks
in the filesystem sorted by disk block number, to avoid seeks. So in
order to check for duplicates, we would either need to allocate enough
disk space to store all directory entries on the filesystem in memory
--- which would take up lost of memory --- or scan directory blocks
directory by directory, which would significantly increase the number
of seeks needed by e2fsck in its pass 2 processing.)

If the directory is indexed using the HTREE format, or the directories
are going to be optimized anyway by specifying the -D option in newer
versions e2fsck, adding support for detecting duplicate filenames
would be really cheap, so I'll very likely end up adding this feature
in an upcoming release of e2fsck.

Given the reports that people with xfs filesystems are also seeing
filesystem corruption, it sounds like filesystem blocks are getting
written to the wrong location on disk. That would explain the
duplicate filenames in the directory, as well as files disappearing.
In general this sort of corruption is relatively rare, which is
another reason why I havne't bothered with adding that support to
e2fsck.

- Ted

2002-11-26 14:48:30

by Clemmitt Sigler

[permalink] [raw]
Subject: Re: 2.4.20-rc3 ext3 fsck corruption -- tool update warning needed?

Hi,

On Mon, 25 Nov 2002, Theodore Ts'o wrote:
> Well, no; it could also be that some kind of filesystem corruption
> either made the directories disappear, or caused e2fsck to believe
> that the files needed to be removed or moved into lost+found. There
> are a million possible explanations, including a bug in a device
> driver, the VM layer, or just pure coincidence.

I got no indication that anything was moved into lost+found similar
to what I've gotten on test systems where the filesystem was
intentionally crashed. No human y/n intervention required this time :^(
it just trashed things and then went on its merry way, and failed to
boot because things were messed up in /etc (and perhaps other places
in the / partition).

> Without some clear indication of what e2fsck actually printed we'd
> only be speculating.
> Can you can duplicate the problem?

The e2fsck run seemed to me to go normally. It reported that it
optimized some directories, but this has happened on other auto-fscks
of my ext3 filesystems without corruption under earlier kernels. (This
is the first corruption I've seen in many, many years.) But I didn't
capture the messages :^( and they don't get written into
/var/log/messages (that I could find).

When 2.4.20 final comes out, I'll set up a mirror system and try to
duplicate the problem. I'll be sure to check lost+found, too.
(The system this happened on is my production workstation and isn't
suitable for testing.) Thanks.

Clemmitt Sigler

2002-11-27 12:49:27

by Theodore Ts'o

[permalink] [raw]
Subject: Re: 2.4.20-rc3 ext3 fsck corruption -- tool update warning needed?

On Tue, Nov 26, 2002 at 10:55:10AM -0500, Clemmitt Sigler wrote:
> The e2fsck run seemed to me to go normally. It reported that it
> optimized some directories, but this has happened on other auto-fscks
> of my ext3 filesystems without corruption under earlier kernels. (This
> is the first corruption I've seen in many, many years.) But I didn't
> capture the messages :^( and they don't get written into
> /var/log/messages (that I could find).

Ah, ha. I think I know what happened.

What version of e2fsprogs were you using? If it was 1.28, that would
explain what you saw. There was a fencepost error that could corrupt
directories when it was optimizing/rehashing them. This bug was fixed
in in the next version, which was rushed out the door as a result of
this bug. Fortunately, 1.28 didn't get adopted by any distro's as far
as I know, and not that many people downloaded and compiled e2fsprogs
1.28.

If you're not using the latest version of e2fsprogs, which is
e2fsprogs 1.32, I'd strongly suggest updating to it. Version 1.28 is
just *so* three months ago. :-)

- Ted

P.S. If you do have a directory which is corrupted by e2fsck 1.28, no
data is lost; it just created a directory entry which is too small, so
it triggers the sanity checks in the kernel. Running e2fsck version
1.29 or later will unbork the directory.

2002-11-27 13:21:24

by Clemmitt Sigler

[permalink] [raw]
Subject: Re: 2.4.20-rc3 ext3 fsck corruption -- tool update warning needed?

Hi,

On Wed, 27 Nov 2002, Theodore Ts'o wrote:
> What version of e2fsprogs were you using? If it was 1.28, that would
> explain what you saw. There was a fencepost error that could corrupt
> directories when it was optimizing/rehashing them.

I'm using Debian Testing, which is (the maintainer's own?) version
1.29+1.30-WIP-0930-1. Debian Unstable is now on version 1.32 (and
Testing should get updated to this soon?). Mind you, what I have
installed is almost 2 months old.

Just in case, I'll upgrade to 1.32 before I try to duplicate the
problem on 2.4.20. Thanks for your time and trouble :^)

Clemmitt Sigler

2002-11-27 13:31:35

by Margit Schubert-While

[permalink] [raw]
Subject: Re: 2.4.20-rc3 ext3 fsck corruption -- tool update warning needed?

>Fortunately, 1.28 didn't get adopted by any distro's as far as I know,
It got into Suse 8.1

Margit

2002-11-27 14:41:02

by Alan

[permalink] [raw]
Subject: Re: 2.4.20-rc3 ext3 fsck corruption -- tool update warning needed?

On Wed, 2002-11-27 at 13:39, Margit Schubert-While wrote:
> >Fortunately, 1.28 didn't get adopted by any distro's as far as I know,
> It got into Suse 8.1

I guess thats a bit of showstopper for this change 8(

2002-11-27 20:40:07

by Theodore Ts'o

[permalink] [raw]
Subject: Re: 2.4.20-rc3 ext3 fsck corruption -- tool update warning needed?

On Wed, Nov 27, 2002 at 03:18:57PM +0000, Alan Cox wrote:
> On Wed, 2002-11-27 at 13:39, Margit Schubert-While wrote:
> > >Fortunately, 1.28 didn't get adopted by any distro's as far as I know,
> > It got into Suse 8.1
>
> I guess thats a bit of showstopper for this change 8(

I forgot about SuSE; yeah, they are using 1.28, but they did take the
patch. (It's needed regardless of whether or not you're using HTREE
in the kernel; it can cause corrupted directories even if you're not
using ext3 htree's).

That being said, from what I can tell, SuSE does *not* support the
Htree changes, and you would be strongly advised to update to update
to the latest version of e2fsprogs before you enabled HTREE. The
version of e2fsck that SuSE is missing some checks that detect some
incosistencies in the htree structure (not a bug deal, but it might
miss some corrupted filesystems) and is also missing some endian
bugfixes in the htree code (only a problem if you're using a
big-endian machine).

- Ted

2002-11-28 22:42:13

by Andries Brouwer

[permalink] [raw]
Subject: Re: 2.4.20-rc3 ext3 fsck corruption -- tool update warning needed?

On Wed, Nov 27, 2002 at 07:55:48AM -0500, Theodore Ts'o wrote:

> Ah, ha. I think I know what happened.
>
> What version of e2fsprogs were you using? If it was 1.28, that would
> explain what you saw. There was a fencepost error that could corrupt
> directories when it was optimizing/rehashing them. This bug was fixed
> in in the next version, which was rushed out the door as a result of
> this bug. Fortunately, 1.28 didn't get adopted by any distro's as far
> as I know

Hmm. On a recently installed SuSE 8.1 machine:

% rpm -qf `which e2fsck`
e2fsprogs-1.28-18

(maybe the -18 contains the fix?)

2002-11-28 23:40:30

by GertJan Spoelman

[permalink] [raw]
Subject: Re: 2.4.20-rc3 ext3 fsck corruption -- tool update warning needed?

On Thursday 28 November 2002 23:49, Andries Brouwer wrote:
> On Wed, Nov 27, 2002 at 07:55:48AM -0500, Theodore Ts'o wrote:
> > Ah, ha. I think I know what happened.
> >
> > What version of e2fsprogs were you using? If it was 1.28, that would
> > explain what you saw. There was a fencepost error that could corrupt
> > directories when it was optimizing/rehashing them. This bug was fixed
> > in in the next version, which was rushed out the door as a result of
> > this bug. Fortunately, 1.28 didn't get adopted by any distro's as far
> > as I know
>
> Hmm. On a recently installed SuSE 8.1 machine:
>
> % rpm -qf `which e2fsck`
> e2fsprogs-1.28-18
>
> (maybe the -18 contains the fix?)

No, see below, I copied it from the SuSE security list.

>> This could be slightly off-topic, but isn't data corruption security
>> related?
>> SuSE 8.1 was delivered with e2fsprogs-1.28, which has a "fencepost
>> error" (???) which could cause directory corruption.
>
> According to the maintainer of our e2fsprogs package, this referes to the
> htree (hash tree) stuff in e2fsprogs which is *not* enabled in
> SuSEs version of e2fsprogs.
>
> HTH
>
> best regards,
> Rainer Link
> (SuSE Labs)

--

GertJan