2008-10-15 03:24:55

by Andrey Borzenkov

[permalink] [raw]
Subject: Possible ext3 corruption with 1K block size

There is long standing open bug report on Mandriva which is currently
beieved to have root cause in file system corruption. It shows itself
in RPM DB corruption (at least, there is no other known method to trigger
it). So far all reported cases happened on filesystem with 1K block size
and stopped when RPM DB was moved to FS with 4K block size.

There are also similar RH reports as well.

Here are references:

https://qa.mandriva.com/show_bug.cgi?id=32547

This one is rather long. Interesting bits are probably around

https://qa.mandriva.com/show_bug.cgi?id=32547#c177
https://qa.mandriva.com/show_bug.cgi?id=32547#c148 (many users reporting
dumpe2fs)

https://bugzilla.redhat.com/show_bug.cgi?id=230362
https://bugzilla.redhat.com/show_bug.cgi?id=375931
https://bugzilla.redhat.com/show_bug.cgi?id=305301

The Mandriva bugzilla also mentions this mail from Stephen Tweedie
http://lkml.org/lkml/2007/9/18/232

which indicates some issues with 1K blocks, but according to last comment:
https://qa.mandriva.com/show_bug.cgi?id=32547#c300

it is still present in 2.6.27 (at least was present on -rc6)

There was a kernel bug report http://bugzilla.kernel.org/show_bug.cgi?id=11564,
but in this case it was identified as hardware issue.


Attachments:
(No filename) (1.21 kB)
signature.asc (197.00 B)
This is a digitally signed message part.
Download all attachments

2008-10-15 12:50:09

by Eric Sandeen

[permalink] [raw]
Subject: Re: Possible ext3 corruption with 1K block size

Andrey Borzenkov wrote:
> There is long standing open bug report on Mandriva which is currently
> beieved to have root cause in file system corruption. It shows itself
> in RPM DB corruption (at least, there is no other known method to trigger
> it). So far all reported cases happened on filesystem with 1K block size
> and stopped when RPM DB was moved to FS with 4K block size.
>
> There are also similar RH reports as well.
>
> Here are references:
>
> https://qa.mandriva.com/show_bug.cgi?id=32547
>
> This one is rather long.

yep, unfortunately IIRC most of the bug is "me too's" and "how do I do
the workaround" :)

> Interesting bits are probably around
>
> https://qa.mandriva.com/show_bug.cgi?id=32547#c177
> https://qa.mandriva.com/show_bug.cgi?id=32547#c148 (many users reporting
> dumpe2fs)
>
> https://bugzilla.redhat.com/show_bug.cgi?id=230362
> https://bugzilla.redhat.com/show_bug.cgi?id=375931
> https://bugzilla.redhat.com/show_bug.cgi?id=305301
>
> The Mandriva bugzilla also mentions this mail from Stephen Tweedie
> http://lkml.org/lkml/2007/9/18/232

I don't think this is related, in the end... there was some possiblity
of corruption from that, but I think it's doubtful it'd hit 1k block
filesystems more, and in any case, the corruption has been seen since
then if I read it right.

> which indicates some issues with 1K blocks, but according to last comment:
> https://qa.mandriva.com/show_bug.cgi?id=32547#c300
>
> it is still present in 2.6.27 (at least was present on -rc6)
>
> There was a kernel bug report http://bugzilla.kernel.org/show_bug.cgi?id=11564,
> but in this case it was identified as hardware issue.

My kingdom for a testcase... does anyone have simple steps to reproduce
this? Or do they all start with "install mandriva on a 1k block size
system?" :)

-Eric

2008-10-15 14:24:51

by Andrey Borzenkov

[permalink] [raw]
Subject: Re: Possible ext3 corruption with 1K block size

On Wednesday 15 October 2008, Eric Sandeen wrote:
> Andrey Borzenkov wrote:
> > There is long standing open bug report on Mandriva which is currently
> > beieved to have root cause in file system corruption. It shows itself
> > in RPM DB corruption (at least, there is no other known method to trigger
> > it). So far all reported cases happened on filesystem with 1K block size
> > and stopped when RPM DB was moved to FS with 4K block size.
> >
> > There are also similar RH reports as well.
> >
> > Here are references:
> >
> > https://qa.mandriva.com/show_bug.cgi?id=32547
> >
> > This one is rather long.
>
> yep, unfortunately IIRC most of the bug is "me too's" and "how do I do
> the workaround" :)
>
> > Interesting bits are probably around
> >
> > https://qa.mandriva.com/show_bug.cgi?id=32547#c177
> > https://qa.mandriva.com/show_bug.cgi?id=32547#c148 (many users reporting
> > dumpe2fs)
> >
> > https://bugzilla.redhat.com/show_bug.cgi?id=230362
> > https://bugzilla.redhat.com/show_bug.cgi?id=375931
> > https://bugzilla.redhat.com/show_bug.cgi?id=305301
> >
> > The Mandriva bugzilla also mentions this mail from Stephen Tweedie
> > http://lkml.org/lkml/2007/9/18/232
>
> I don't think this is related, in the end... there was some possiblity
> of corruption from that, but I think it's doubtful it'd hit 1k block
> filesystems more, and in any case, the corruption has been seen since
> then if I read it right.
>
> > which indicates some issues with 1K blocks, but according to last comment:
> > https://qa.mandriva.com/show_bug.cgi?id=32547#c300
> >
> > it is still present in 2.6.27 (at least was present on -rc6)
> >
> > There was a kernel bug report http://bugzilla.kernel.org/show_bug.cgi?id=11564,
> > but in this case it was identified as hardware issue.
>
> My kingdom for a testcase... does anyone have simple steps to reproduce
> this? Or do they all start with "install mandriva on a 1k block size
> system?" :)
>

May be RH will do? :)

As indicated by last comment, Pascal has some ways to trigger it; I
forgot to Cc to him initially; doing it now.


Attachments:
(No filename) (2.05 kB)
signature.asc (197.00 B)
This is a digitally signed message part.
Download all attachments

2008-10-15 14:44:19

by Eric Sandeen

[permalink] [raw]
Subject: Re: Possible ext3 corruption with 1K block size

Andrey Borzenkov wrote:
> On Wednesday 15 October 2008, Eric Sandeen wrote:


>> My kingdom for a testcase... does anyone have simple steps to reproduce
>> this? Or do they all start with "install mandriva on a 1k block size
>> system?" :)
>>
>
> May be RH will do? :)

I did try a 1k-block root fs Fedora install, and didn't see any problems...

> As indicated by last comment, Pascal has some ways to trigger it; I
> forgot to Cc to him initially; doing it now.

Ok, good deal.

-Eric

2008-10-16 13:48:08

by Pascal Terjan

[permalink] [raw]
Subject: Re: Possible ext3 corruption with 1K block size

Le mercredi 15 octobre 2008 ? 09:43 -0500, Eric Sandeen a ?crit :
> Andrey Borzenkov wrote:
> > On Wednesday 15 October 2008, Eric Sandeen wrote:
>
>
> >> My kingdom for a testcase... does anyone have simple steps to reproduce
> >> this? Or do they all start with "install mandriva on a 1k block size
> >> system?" :)
> >>
> >
> > May be RH will do? :)
>
> I did try a 1k-block root fs Fedora install, and didn't see any problems...
>
> > As indicated by last comment, Pascal has some ways to trigger it; I
> > forgot to Cc to him initially; doing it now.
>
> Ok, good deal.
>

On my test machine I reproduce it easily : rpm --rebuilddb and if the db
is not detected to be corrupted yet it will be after installing a few
packages (tested again with 2.6.27).

If I do the rebuilddb on a 2.6.17 and then reboot on a recent kernel,
then I can install/uninstall thousands of packages without any
corruption.

I wanted to try a few things including copying the partition to a file
and trying to reproduce in a vm.

Given how I can reproduce and repair it i can even write a bisecting
script which would basically be an initscript which would do

if on test kernel
- rebuild the db
- install 10 rpm
- remove the 10 rpm
- check the db
- do the good/bad
- reboot onto 2.6.17
else if on 2.6.17
- rebuild the db
- build the kernel
- reboot on test kernel

and let it run :)

All I need is to find some time with nothing more urgent...

2008-10-16 14:38:56

by Eric Sandeen

[permalink] [raw]
Subject: Re: Possible ext3 corruption with 1K block size

Pascal Terjan wrote:
> Le mercredi 15 octobre 2008 ? 09:43 -0500, Eric Sandeen a ?crit :
>> Andrey Borzenkov wrote:
>>> On Wednesday 15 October 2008, Eric Sandeen wrote:
>>
>>>> My kingdom for a testcase... does anyone have simple steps to reproduce
>>>> this? Or do they all start with "install mandriva on a 1k block size
>>>> system?" :)
>>>>
>>> May be RH will do? :)
>> I did try a 1k-block root fs Fedora install, and didn't see any problems...
>>
>>> As indicated by last comment, Pascal has some ways to trigger it; I
>>> forgot to Cc to him initially; doing it now.
>> Ok, good deal.
>>
>
> On my test machine I reproduce it easily : rpm --rebuilddb and if the db
> is not detected to be corrupted yet it will be after installing a few
> packages (tested again with 2.6.27).
>
> If I do the rebuilddb on a 2.6.17 and then reboot on a recent kernel,
> then I can install/uninstall thousands of packages without any
> corruption.

so it seems to be the database rebuilding, under a recent kernel, which
causes the problem? installing under a recent kernel is ok, as long as
the db was created on an older kernel?

Ok that's a good clue...

-Eric

2008-10-16 14:41:00

by Pascal Terjan

[permalink] [raw]
Subject: Re: Possible ext3 corruption with 1K block size

Le jeudi 16 octobre 2008 ? 09:38 -0500, Eric Sandeen a ?crit :
> Pascal Terjan wrote:
> > Le mercredi 15 octobre 2008 ? 09:43 -0500, Eric Sandeen a ?crit :
> >> Andrey Borzenkov wrote:
> >>> On Wednesday 15 October 2008, Eric Sandeen wrote:
> >>
> >>>> My kingdom for a testcase... does anyone have simple steps to reproduce
> >>>> this? Or do they all start with "install mandriva on a 1k block size
> >>>> system?" :)
> >>>>
> >>> May be RH will do? :)
> >> I did try a 1k-block root fs Fedora install, and didn't see any problems...
> >>
> >>> As indicated by last comment, Pascal has some ways to trigger it; I
> >>> forgot to Cc to him initially; doing it now.
> >> Ok, good deal.
> >>
> >
> > On my test machine I reproduce it easily : rpm --rebuilddb and if the db
> > is not detected to be corrupted yet it will be after installing a few
> > packages (tested again with 2.6.27).
> >
> > If I do the rebuilddb on a 2.6.17 and then reboot on a recent kernel,
> > then I can install/uninstall thousands of packages without any
> > corruption.
>
> so it seems to be the database rebuilding, under a recent kernel, which
> causes the problem? installing under a recent kernel is ok, as long as
> the db was created on an older kernel?

Yes

2008-12-18 18:13:17

by Jan Kara

[permalink] [raw]
Subject: Re: Possible ext3 corruption with 1K block size

Hi Eric,

> Pascal Terjan wrote:
> > Le mercredi 15 octobre 2008 à 09:43 -0500, Eric Sandeen a écrit :
> >> Andrey Borzenkov wrote:
> >>> On Wednesday 15 October 2008, Eric Sandeen wrote:
> >>
> >>>> My kingdom for a testcase... does anyone have simple steps to reproduce
> >>>> this? Or do they all start with "install mandriva on a 1k block size
> >>>> system?" :)
> >>>>
> >>> May be RH will do? :)
> >> I did try a 1k-block root fs Fedora install, and didn't see any problems...
> >>
> >>> As indicated by last comment, Pascal has some ways to trigger it; I
> >>> forgot to Cc to him initially; doing it now.
> >> Ok, good deal.
> >>
> >
> > On my test machine I reproduce it easily : rpm --rebuilddb and if the db
> > is not detected to be corrupted yet it will be after installing a few
> > packages (tested again with 2.6.27).
> >
> > If I do the rebuilddb on a 2.6.17 and then reboot on a recent kernel,
> > then I can install/uninstall thousands of packages without any
> > corruption.
>
> so it seems to be the database rebuilding, under a recent kernel, which
> causes the problem? installing under a recent kernel is ok, as long as
> the db was created on an older kernel?
>
> Ok that's a good clue...
Have you been able to track this down? Anything interesting?

Honza
--
Jan Kara <[email protected]>
SuSE CR Labs

2008-12-18 18:20:37

by Eric Sandeen

[permalink] [raw]
Subject: Re: Possible ext3 corruption with 1K block size

Jan Kara wrote:
> Hi Eric,
>


> Have you been able to track this down? Anything interesting?
>
> Honza

No, unfortunately this kind of dropped off my radar recently...

-Eric