2018-01-15 07:55:53

by Nikola Ciprich

[permalink] [raw]
Subject: e2fsck -D lead to severely damaged filesystem

Hello dear ext4 developers,

I'd like to ask about following problem I hit yesterday
(and which I'm a bit responsible for, I guess).

we were dealing with slow access to directories with lots of
files (large maildirs), so after some tests, I came to conclusion
that optimizing directories using e2fsck -D (on unmounted FS of course)
helps a lot. So after testing this on our test box, I did it on production
mailserver mail volume. The I decided to do some tests on newer kernel,
so I rebooted test box and got lots of fs errors..

I checked production box, and it got bad as well:

lots of dx_probe:829: inode #15949784: block 35579: comm deliver: Directory hole found
messages..


so I unmounted fs again, run fsck, and got zillion of:

Inode 18378187 ref count is 2, should be 1. Fix? yes

Unattached inode 18378194
Connect to /lost+found? yes

messages..


after ~3 hours, I gave up, and recovered FS from backup.. checking fs after
"repair" showed that some of large mailboxes vanished completely (and appeared in lost+found)

I think I can rule out hardware problem, since it appeared on two completely different
systems after some action.. but I'll try to prepare new test environment and reproduce it.

What I think might be my big mistake is that I was using quite old e2fsprogs - 1.42.6,
kernel was 4.4.52 (which I know is also a bit old, we're already testig 4.14.x)

My question is, was that some known e2fsck problem which got fixed in new version?

Or did I do something wrong?

I'm going to retry using 1.43.8, but still I'd be a bit calmer to know it was known problem
and got fixed :)

If I could provide some more information, please let me know..

BR

nik

PS: both systems were running latest centos 6 (but with newer kernel and e2fsprogs)


--
-------------------------------------
Ing. Nikola CIPRICH
LinuxBox.cz, s.r.o.
28.rijna 168, 709 00 Ostrava

tel.: +420 591 166 214
fax: +420 596 621 273
mobil: +420 777 093 799
http://www.linuxbox.cz

mobil servis: +420 737 238 656
email servis: [email protected]
-------------------------------------


2018-01-16 09:03:54

by Jan Kara

[permalink] [raw]
Subject: Re: e2fsck -D lead to severely damaged filesystem

Hello,

On Mon 15-01-18 08:23:29, Nikola Ciprich wrote:
> we were dealing with slow access to directories with lots of
> files (large maildirs), so after some tests, I came to conclusion
> that optimizing directories using e2fsck -D (on unmounted FS of course)
> helps a lot. So after testing this on our test box, I did it on production
> mailserver mail volume. The I decided to do some tests on newer kernel,
> so I rebooted test box and got lots of fs errors..
>
> I checked production box, and it got bad as well:
>
> lots of dx_probe:829: inode #15949784: block 35579: comm deliver: Directory hole found
> messages..
>
>
> so I unmounted fs again, run fsck, and got zillion of:
>
> Inode 18378187 ref count is 2, should be 1. Fix? yes
>
> Unattached inode 18378194
> Connect to /lost+found? yes
>
> messages..
>
>
> after ~3 hours, I gave up, and recovered FS from backup.. checking fs after
> "repair" showed that some of large mailboxes vanished completely (and appeared in lost+found)
>
> I think I can rule out hardware problem, since it appeared on two completely different
> systems after some action.. but I'll try to prepare new test environment and reproduce it.
>
> What I think might be my big mistake is that I was using quite old e2fsprogs - 1.42.6,
> kernel was 4.4.52 (which I know is also a bit old, we're already testig 4.14.x)
>
> My question is, was that some known e2fsck problem which got fixed in new version?

Commit 19961cd000 "e2fsck: fix e2fsck -fD directory truncation" sounds like
fixing a similar problem you've observed. So there's reasonable chance
newer e2fsprogs will handle the filesystem fine. But if not, please do
"e2image -r <device> - | xz -c >ext4.image" *before* running e2fsck -D and
put it somewhere for download. That way we can experiment with the metadata
image and see what exactly does e2fsck do wrong. Thanks!

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR

2018-01-16 10:32:17

by Nikola Ciprich

[permalink] [raw]
Subject: Re: e2fsck -D lead to severely damaged filesystem

Hello Jan,

thanks for the reply (and really sorry for double post, after sending
first email, I noticed I'm no longer subscribed, so I sent it again
after resubscribe, but apparently both emails got into list after all)

you're right, this commit looks like the one I'm looking for.. I'l try
to reproduce with and without it and report.. anyways time to move on
to newer e2fsprogs..

with best regards

nik



On Tue, Jan 16, 2018 at 10:03:52AM +0100, Jan Kara wrote:
> Hello,
>
> On Mon 15-01-18 08:23:29, Nikola Ciprich wrote:
> > we were dealing with slow access to directories with lots of
> > files (large maildirs), so after some tests, I came to conclusion
> > that optimizing directories using e2fsck -D (on unmounted FS of course)
> > helps a lot. So after testing this on our test box, I did it on production
> > mailserver mail volume. The I decided to do some tests on newer kernel,
> > so I rebooted test box and got lots of fs errors..
> >
> > I checked production box, and it got bad as well:
> >
> > lots of dx_probe:829: inode #15949784: block 35579: comm deliver: Directory hole found
> > messages..
> >
> >
> > so I unmounted fs again, run fsck, and got zillion of:
> >
> > Inode 18378187 ref count is 2, should be 1. Fix? yes
> >
> > Unattached inode 18378194
> > Connect to /lost+found? yes
> >
> > messages..
> >
> >
> > after ~3 hours, I gave up, and recovered FS from backup.. checking fs after
> > "repair" showed that some of large mailboxes vanished completely (and appeared in lost+found)
> >
> > I think I can rule out hardware problem, since it appeared on two completely different
> > systems after some action.. but I'll try to prepare new test environment and reproduce it.
> >
> > What I think might be my big mistake is that I was using quite old e2fsprogs - 1.42.6,
> > kernel was 4.4.52 (which I know is also a bit old, we're already testig 4.14.x)
> >
> > My question is, was that some known e2fsck problem which got fixed in new version?
>
> Commit 19961cd000 "e2fsck: fix e2fsck -fD directory truncation" sounds like
> fixing a similar problem you've observed. So there's reasonable chance
> newer e2fsprogs will handle the filesystem fine. But if not, please do
> "e2image -r <device> - | xz -c >ext4.image" *before* running e2fsck -D and
> put it somewhere for download. That way we can experiment with the metadata
> image and see what exactly does e2fsck do wrong. Thanks!
>
> Honza
>
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR
>

--
-------------------------------------
Ing. Nikola CIPRICH
LinuxBox.cz, s.r.o.
28.rijna 168, 709 00 Ostrava

tel.: +420 591 166 214
fax: +420 596 621 273
mobil: +420 777 093 799
http://www.linuxbox.cz

mobil servis: +420 737 238 656
email servis: [email protected]
-------------------------------------