2009-09-09 13:52:03

by Markus Trippelsdorf

[permalink] [raw]
Subject: Filessystem corruptions while using rsnapshot

I'm using rsnapshot (http://rsnapshot.org/) to automatically backup my
root filesystem (btrfs) to a second harddrive running ext4. Rsnapshot
uses rsync and a massive amount of hard links to keep multiple backups
instantly available.
It seems that the sheer number of hard links overwhelms ext4 and results
in problems that require manual fsck.ext4 runs to bring the fs back to
normal.

For example this is output from fsck.ext4:

Problem in HTREE directory inode <some number>
... node <increasing number> has invalid depth (2)
... node <increasing number> has bad mxhash
... node <increasing number> not referenced

This output is repeated ad nauseam while <increasing number> increases
at each round.

The bug is very simple to reproduce here. Just run rsnapshot several
times per day and you will eventually hit the problem.

# tune2fs -l /dev/sda1
tune2fs 1.41.9 (22-Aug-2009)
Filesystem volume name: <none>
Last mounted on: /var
Filesystem UUID: 46e99ee2-615d-4ce8-9641-a8c0118fdaa3
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
Filesystem flags: signed_directory_hash
Default mount options: (none)
Filesystem state: clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 6545408
Block count: 26165860
Reserved block count: 1308292
Free blocks: 18575443
Free inodes: 5761975
First block: 0
Block size: 4096
Fragment size: 4096
Reserved GDT blocks: 1017
Blocks per group: 32768
Fragments per group: 32768
Inodes per group: 8192
Inode blocks per group: 512
RAID stride: 32732
Flex block group size: 16
Filesystem created: Mon Jun 8 12:08:51 2009
Last mount time: Wed Sep 9 15:03:04 2009
Last write time: Wed Sep 9 15:03:04 2009
Mount count: 1
Maximum mount count: 5
Last checked: Wed Sep 9 15:02:21 2009
Check interval: 15552000 (6 months)
Next check after: Mon Mar 8 14:02:21 2010
Lifetime writes: 434 GB
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 256
Required extra isize: 28
Desired extra isize: 28
Journal inode: 8
Default directory hash: half_md4
Directory Hash Seed: 28e527ba-4f7f-4588-a956-c87042f237e6
Journal backup: inode blocks

This machine is running the latest git kernel (2.6.31-rc9).
--
Markus


2009-09-09 14:41:11

by Eric Sandeen

[permalink] [raw]
Subject: Re: Filessystem corruptions while using rsnapshot

Markus Trippelsdorf wrote:
> I'm using rsnapshot (http://rsnapshot.org/) to automatically backup my
> root filesystem (btrfs) to a second harddrive running ext4. Rsnapshot
> uses rsync and a massive amount of hard links to keep multiple backups
> instantly available.
> It seems that the sheer number of hard links overwhelms ext4 and results
> in problems that require manual fsck.ext4 runs to bring the fs back to
> normal.
>
> For example this is output from fsck.ext4:
>
> Problem in HTREE directory inode <some number>
> ... node <increasing number> has invalid depth (2)
> ... node <increasing number> has bad mxhash
> ... node <increasing number> not referenced
>
> This output is repeated ad nauseam while <increasing number> increases
> at each round.
>
> The bug is very simple to reproduce here. Just run rsnapshot several
> times per day and you will eventually hit the problem.

Could you provide a bzip2'd e2image -r of a corrupted filesystem for
analysis?

Thanks,
-Eric

2009-09-09 14:54:16

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: Filessystem corruptions while using rsnapshot

On Wed, Sep 09, 2009 at 09:40:41AM -0500, Eric Sandeen wrote:
> Markus Trippelsdorf wrote:
> > I'm using rsnapshot (http://rsnapshot.org/) to automatically backup my
> > root filesystem (btrfs) to a second harddrive running ext4. Rsnapshot
> > uses rsync and a massive amount of hard links to keep multiple backups
> > instantly available.
> > It seems that the sheer number of hard links overwhelms ext4 and results
> > in problems that require manual fsck.ext4 runs to bring the fs back to
> > normal.
> >
> > For example this is output from fsck.ext4:
> >
> > Problem in HTREE directory inode <some number>
> > ... node <increasing number> has invalid depth (2)
> > ... node <increasing number> has bad mxhash
> > ... node <increasing number> not referenced
> >
> > This output is repeated ad nauseam while <increasing number> increases
> > at each round.
> >
> > The bug is very simple to reproduce here. Just run rsnapshot several
> > times per day and you will eventually hit the problem.
>
> Could you provide a bzip2'd e2image -r of a corrupted filesystem for
> analysis?

Sure, I will post a link the next time I'll hit the problem...
--
Markus

2009-09-09 17:20:14

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: Filessystem corruptions while using rsnapshot

On Wed, Sep 09, 2009 at 09:40:41AM -0500, Eric Sandeen wrote:
> Markus Trippelsdorf wrote:
> > I'm using rsnapshot (http://rsnapshot.org/) to automatically backup my
> > root filesystem (btrfs) to a second harddrive running ext4. Rsnapshot
> > uses rsync and a massive amount of hard links to keep multiple backups
> > instantly available.
> > It seems that the sheer number of hard links overwhelms ext4 and results
> > in problems that require manual fsck.ext4 runs to bring the fs back to
> > normal.
> >
> > For example this is output from fsck.ext4:
> >
> > Problem in HTREE directory inode <some number>
> > ... node <increasing number> has invalid depth (2)
> > ... node <increasing number> has bad mxhash
> > ... node <increasing number> not referenced
> >
> > This output is repeated ad nauseam while <increasing number> increases
> > at each round.
> >
> > The bug is very simple to reproduce here. Just run rsnapshot several
> > times per day and you will eventually hit the problem.
>
> Could you provide a bzip2'd e2image -r of a corrupted filesystem for
> analysis?

OK I've uploaded the file (~80 MB):
http://www.2shared.com/file/7681344/d19d7154/disc_e2image.html

Hope it helps.
--
Markus

2009-09-09 17:21:33

by Eric Sandeen

[permalink] [raw]
Subject: Re: Filessystem corruptions while using rsnapshot

Markus Trippelsdorf wrote:
> On Wed, Sep 09, 2009 at 09:40:41AM -0500, Eric Sandeen wrote:
>> Markus Trippelsdorf wrote:

...

>>> The bug is very simple to reproduce here. Just run rsnapshot several
>>> times per day and you will eventually hit the problem.
>> Could you provide a bzip2'd e2image -r of a corrupted filesystem for
>> analysis?
>
> OK I've uploaded the file (~80 MB):
> http://www.2shared.com/file/7681344/d19d7154/disc_e2image.html
>
> Hope it helps.

Thanks, can't guarantee that the post-mortem will lead to the anwer but
it's worth a look.

-Eric

2009-09-09 21:29:24

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: Filessystem corruptions while using rsnapshot

On Wed, Sep 09, 2009 at 12:21:21PM -0500, Eric Sandeen wrote:
> Markus Trippelsdorf wrote:
> > On Wed, Sep 09, 2009 at 09:40:41AM -0500, Eric Sandeen wrote:
> >> Markus Trippelsdorf wrote:
>
> ...
>
> >>> The bug is very simple to reproduce here. Just run rsnapshot several
> >>> times per day and you will eventually hit the problem.
> >> Could you provide a bzip2'd e2image -r of a corrupted filesystem for
> >> analysis?
> >
> > OK I've uploaded the file (~80 MB):
> > http://www.2shared.com/file/7681344/d19d7154/disc_e2image.html
> >
> > Hope it helps.
>
> Thanks, can't guarantee that the post-mortem will lead to the anwer but
> it's worth a look.

I also took a look myself and it turned out that my maildir, which
contains the LKML messages of the last three months, is the root of the
problem. I can now reproduce the bug, by simply running (e.g.):

cp -al /var/backup/hourly.3/localhost/home/markus/.maildir/lkml/. /var/tmp/debug

where /var is mounted as: "/dev/sda1 on /var type ext4 (rw,noatime,commit=60)"
and /var/backup/hourly.3/localhost/ is a previous backup produced by rsnapshot.

--
Markus

2009-09-09 21:35:59

by Eric Sandeen

[permalink] [raw]
Subject: Re: Filessystem corruptions while using rsnapshot

Markus Trippelsdorf wrote:
> On Wed, Sep 09, 2009 at 12:21:21PM -0500, Eric Sandeen wrote:
>> Markus Trippelsdorf wrote:
>>> On Wed, Sep 09, 2009 at 09:40:41AM -0500, Eric Sandeen wrote:
>>>> Markus Trippelsdorf wrote:
>> ...
>>
>>>>> The bug is very simple to reproduce here. Just run rsnapshot several
>>>>> times per day and you will eventually hit the problem.
>>>> Could you provide a bzip2'd e2image -r of a corrupted filesystem for
>>>> analysis?
>>> OK I've uploaded the file (~80 MB):
>>> http://www.2shared.com/file/7681344/d19d7154/disc_e2image.html
>>>
>>> Hope it helps.
>> Thanks, can't guarantee that the post-mortem will lead to the anwer but
>> it's worth a look.
>
> I also took a look myself and it turned out that my maildir, which
> contains the LKML messages of the last three months, is the root of the
> problem. I can now reproduce the bug, by simply running (e.g.):
>
> cp -al /var/backup/hourly.3/localhost/home/markus/.maildir/lkml/. /var/tmp/debug
>
> where /var is mounted as: "/dev/sda1 on /var type ext4 (rw,noatime,commit=60)"
> and /var/backup/hourly.3/localhost/ is a previous backup produced by rsnapshot.

From the original post it wasn't quite clear; do you hit runtime
corruption, indicated by the kernel logs when the fs is mounted, and a
subsequent fsck also finds errors?

Just for completeness, if there are errors in the system logs, can you
post them here as well?

Thanks,
-Eric

2009-09-09 21:42:23

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: Filessystem corruptions while using rsnapshot

On Wed, Sep 09, 2009 at 04:35:45PM -0500, Eric Sandeen wrote:
> Markus Trippelsdorf wrote:
> > On Wed, Sep 09, 2009 at 12:21:21PM -0500, Eric Sandeen wrote:
> >> Markus Trippelsdorf wrote:
> >>> On Wed, Sep 09, 2009 at 09:40:41AM -0500, Eric Sandeen wrote:
> >>>> Markus Trippelsdorf wrote:
> >> ...
> >>
> >>>>> The bug is very simple to reproduce here. Just run rsnapshot several
> >>>>> times per day and you will eventually hit the problem.
> >>>> Could you provide a bzip2'd e2image -r of a corrupted filesystem for
> >>>> analysis?
> >>> OK I've uploaded the file (~80 MB):
> >>> http://www.2shared.com/file/7681344/d19d7154/disc_e2image.html
> >>>
> >>> Hope it helps.
> >> Thanks, can't guarantee that the post-mortem will lead to the anwer but
> >> it's worth a look.
> >
> > I also took a look myself and it turned out that my maildir, which
> > contains the LKML messages of the last three months, is the root of the
> > problem. I can now reproduce the bug, by simply running (e.g.):
> >
> > cp -al /var/backup/hourly.3/localhost/home/markus/.maildir/lkml/. /var/tmp/debug
> >
> > where /var is mounted as: "/dev/sda1 on /var type ext4 (rw,noatime,commit=60)"
> > and /var/backup/hourly.3/localhost/ is a previous backup produced by rsnapshot.
>
> From the original post it wasn't quite clear; do you hit runtime
> corruption, indicated by the kernel logs when the fs is mounted, and a
> subsequent fsck also finds errors?
>
> Just for completeness, if there are errors in the system logs, can you
> post them here as well?

There are no errors in the system logs.
I only learned about this bug when a regular boot-time fsck was running that
printed the problems mentionted above.
--
Markus

2009-09-09 21:46:34

by Eric Sandeen

[permalink] [raw]
Subject: Re: Filessystem corruptions while using rsnapshot

Markus Trippelsdorf wrote:
> On Wed, Sep 09, 2009 at 04:35:45PM -0500, Eric Sandeen wrote:

...

>> Just for completeness, if there are errors in the system logs, can you
>> post them here as well?
>
> There are no errors in the system logs.
> I only learned about this bug when a regular boot-time fsck was running that
> printed the problems mentionted above.

Hm, but there should be no regular boot-time fsck unless you have your
forced-fsck interval set pretty low; a normal boot-time fsck will simply
replay the log. Was that it; did you just happen to catch this on the
21st boot or whatever the interval may be?

Otherwise, the fs was flagged with errors, and -that- forced the full
fsck at boot time, and hopefully something previous was logged somewhere ...

-Eric

2009-09-09 21:50:53

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: Filessystem corruptions while using rsnapshot

On Wed, Sep 09, 2009 at 04:46:12PM -0500, Eric Sandeen wrote:
> Markus Trippelsdorf wrote:
> > On Wed, Sep 09, 2009 at 04:35:45PM -0500, Eric Sandeen wrote:
>
> ...
>
> >> Just for completeness, if there are errors in the system logs, can you
> >> post them here as well?
> >
> > There are no errors in the system logs.
> > I only learned about this bug when a regular boot-time fsck was running that
> > printed the problems mentionted above.
>
> Hm, but there should be no regular boot-time fsck unless you have your
> forced-fsck interval set pretty low; a normal boot-time fsck will simply
> replay the log. Was that it; did you just happen to catch this on the
> 21st boot or whatever the interval may be?

Yes, sorry I didn't make myself clear, I just happen to catch this on
the 21st boot...
After that I booted from an USB stick and ran fsck by hand from there.
--
Markus

2009-09-10 01:11:10

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: Filessystem corruptions while using rsnapshot

On Wed, Sep 09, 2009 at 05:29:54PM -0500, Eric Sandeen wrote:
> Markus Trippelsdorf wrote:
> > On Wed, Sep 09, 2009 at 04:46:12PM -0500, Eric Sandeen wrote:
> >> Markus Trippelsdorf wrote:
> >>> On Wed, Sep 09, 2009 at 04:35:45PM -0500, Eric Sandeen wrote:
> >> ...
> >>
> >>>> Just for completeness, if there are errors in the system logs, can you
> >>>> post them here as well?
> >>> There are no errors in the system logs.
> >>> I only learned about this bug when a regular boot-time fsck was running that
> >>> printed the problems mentionted above.
> >> Hm, but there should be no regular boot-time fsck unless you have your
> >> forced-fsck interval set pretty low; a normal boot-time fsck will simply
> >> replay the log. Was that it; did you just happen to catch this on the
> >> 21st boot or whatever the interval may be?
> >
> > Yes, sorry I didn't make myself clear, I just happen to catch this on
> > the 21st boot...
> > After that I booted from an USB stick and ran fsck by hand from there.
>
> Are you certain there are no errors logged? The fs -is- flagged with
> errors:
>
> # file disc_e2image
> disc_e2image: Linux rev 1.0 ext4 filesystem data (needs journal
> recovery) (errors) (extents) (large files) (huge files)
>
> At what point did you make that image?

I did the cp -al; after that I ran "tune2fs -C 99 /dev/sda1" to force a
e2fsck -p at next boot. I rebooted, the fs became marked with errors by
e2fsck -p. Then I ran "e2image -r /dev/sda1 - | bzip2 > disc_e2image.bz2"
on the mounted fs.

BTW the image shows the errors nicely:

# fsck.ext4 -n disc_e2image
e2fsck 1.41.9 (22-Aug-2009)
Warning: skipping journal recovery because doing a read-only filesystem check.
disc_e2image contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Problem in HTREE directory inode 2409285: node (1) has invalid depth (2)
Problem in HTREE directory inode 2409285: node (1) has bad max hash
Problem in HTREE directory inode 2409285: node (1) not referenced
Problem in HTREE directory inode 2409285: node (2) has invalid depth (2)
Problem in HTREE directory inode 2409285: node (2) has bad max hash
Problem in HTREE directory inode 2409285: node (2) not referenced
Problem in HTREE directory inode 2409285: node (3) has invalid depth (2)
Problem in HTREE directory inode 2409285: node (3) has bad max hash
Problem in HTREE directory inode 2409285: node (3) not referenced
Problem in HTREE directory inode 2409285: node (4) has invalid depth (2)
Problem in HTREE directory inode 2409285: node (4) has bad max hash
Problem in HTREE directory inode 2409285: node (4) not referenced
Problem in HTREE directory inode 2409285: node (5) has invalid depth (2)
Problem in HTREE directory inode 2409285: node (5) has bad max hash
Problem in HTREE directory inode 2409285: node (5) not referenced
Problem in HTREE directory inode 2409285: node (6) has invalid depth (2)
Problem in HTREE directory inode 2409285: node (6) has bad max hash
Problem in HTREE directory inode 2409285: node (6) not referenced
Problem in HTREE directory inode 2409285: node (7) has invalid depth (2)
Problem in HTREE directory inode 2409285: node (7) has bad max hash
Problem in HTREE directory inode 2409285: node (7) not referenced
Problem in HTREE directory inode 2409285: node (8) has invalid depth (2)
Problem in HTREE directory inode 2409285: node (8) has bad max hash
Problem in HTREE directory inode 2409285: node (8) not referenced
.
.
.
Problem in HTREE directory inode 2409285: node (1026) has invalid depth (2)
Problem in HTREE directory inode 2409285: node (1026) has bad max hash
Problem in HTREE directory inode 2409285: node (1026) not referenced
Problem in HTREE directory inode 2409285: node (1027) has invalid depth (2)
Problem in HTREE directory inode 2409285: node (1027) has bad max hash
Problem in HTREE directory inode 2409285: node (1027) not referenced
Problem in HTREE directory inode 2409285: node (1028) has invalid depth (2)
Problem in HTREE directory inode 2409285: node (1028) has bad max hash
Problem in HTREE directory inode 2409285: node (1028) not referenced
Invalid HTREE directory inode 2409285 (/backup/hourly.2/localhost/home/markus/.maildir/lkml/cur). Clear HTree index? no

Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Free blocks count wrong (18497534, counted=18497453).
Fix? no


disc_e2image: ********** WARNING: Filesystem still has errors **********

disc_e2image: 809496/6545408 files (0.3% non-contiguous), 7668326/26165860 blocks
--
Markus

2009-09-17 20:33:44

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: Filessystem corruptions while using rsnapshot

On Wed, Sep 09, 2009 at 09:40:41AM -0500, Eric Sandeen wrote:
> Markus Trippelsdorf wrote:
> > I'm using rsnapshot (http://rsnapshot.org/) to automatically backup my
> > root filesystem (btrfs) to a second harddrive running ext4. Rsnapshot
> > uses rsync and a massive amount of hard links to keep multiple backups
> > instantly available.
> > It seems that the sheer number of hard links overwhelms ext4 and results
> > in problems that require manual fsck.ext4 runs to bring the fs back to
> > normal.
> >
> > For example this is output from fsck.ext4:
> >
> > Problem in HTREE directory inode <some number>
> > ... node <increasing number> has invalid depth (2)
> > ... node <increasing number> has bad mxhash
> > ... node <increasing number> not referenced
> >
> > This output is repeated ad nauseam while <increasing number> increases
> > at each round.
> >
> > The bug is very simple to reproduce here. Just run rsnapshot several
> > times per day and you will eventually hit the problem.
>
> Could you provide a bzip2'd e2image -r of a corrupted filesystem for
> analysis?

Eric, I'm sorry that I may have wasted your time with this.
It turned out that a bad RAM module was the most likely cause of these
corruptions. (I've found out only today, after I got btrfs csum errors
in my log, see: http://thread.gmane.org/gmane.comp.file-systems.btrfs/3613)

So please accept my apologies.
Thanks.
--
Markus

2009-09-17 20:39:31

by Eric Sandeen

[permalink] [raw]
Subject: Re: Filessystem corruptions while using rsnapshot

Markus Trippelsdorf wrote:
...

> Eric, I'm sorry that I may have wasted your time with this.
> It turned out that a bad RAM module was the most likely cause of these
> corruptions. (I've found out only today, after I got btrfs csum errors
> in my log, see: http://thread.gmane.org/gmane.comp.file-systems.btrfs/3613)
>
> So please accept my apologies.
> Thanks.

That's ok, I've been so swamped I haven't had much time at all to look
at it, and was feeling bad. So this is actually good news on several
fronts. ;)

Thanks for the followup!

-Eric