Hi - I'm trying out ext4 on a large 8.2 TB software raid device (md). On
rebooting (cleanly unmounting), I tried an fsck on the device. I get the
following:
[root@xback2 ~]# fsck /dev/md0
fsck 1.41.4 (27-Jan-2009)
e2fsck 1.41.4 (27-Jan-2009)
fsck.ext4: Group descriptors look bad... trying backup blocks...
Group descriptor 0 checksum is invalid. Fix<y>?
It then finds lots of bad group descriptors.
This is Fedora 10, 2.6.27.21-170.2.56.fc10.x86_64 and
e2fsprogs-1.41.4-4.fc10.x86_64.
Jeremy
--
Jeremy Sanders <[email protected]> http://www-xray.ast.cam.ac.uk/~jss/
X-Ray Group, Institute of Astronomy, University of Cambridge, UK.
Public Key Server PGP Key ID: E1AAE053
Some more information about the device:
root@xback2 ~]# dumpe2fs /dev/md0| head -100
dumpe2fs 1.41.4 (27-Jan-2009)
Filesystem volume name: <none>
Last mounted on: <not available>
Filesystem UUID: 508aee62-79af-4b4c-95a6-222b3868834c
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr resize_inode dir_index
filetype extent flex_bg sparse_super large_file huge_file uninit_bg
dir_nlink extra_isize
Filesystem flags: signed_directory_hash
Default mount options: (none)
Filesystem state: clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 549314560
Block count: 2197239840
Reserved block count: 0
Free blocks: 1508301443
Free inodes: 545311753
First block: 0
Block size: 4096
Fragment size: 4096
Reserved GDT blocks: 500
Blocks per group: 32768
Fragments per group: 32768
Inodes per group: 8192
Inode blocks per group: 512
RAID stride: 8
RAID stripe width: 72
Flex block group size: 16
Filesystem created: Fri Apr 10 17:13:08 2009
Last mount time: Mon Apr 13 11:00:22 2009
Last write time: Fri Apr 17 11:53:05 2009
Mount count: 1
Maximum mount count: -1
Last checked: Fri Apr 10 17:13:08 2009
Check interval: 0 (<none>)
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 256
Required extra isize: 28
Desired extra isize: 28
Journal inode: 8
Default directory hash: half_md4
Directory Hash Seed: 9c9b9fd6-5af2-4ee0-bceb-25827cb008f9
Journal backup: inode blocks
Journal size: 128M
Group 0: (Blocks 0-32767) [ITABLE_ZEROED]
Checksum 0xd3b2, unused inodes 4032
Primary superblock at 0, Group descriptors at 1-524
Reserved GDT blocks at 525-1024
Block bitmap at 1025 (+1025), Inode bitmap at 1041 (+1041)
Inode table at 1057-1568 (+1057)
856 free blocks, 4032 free inodes, 268 directories, 4032 unused inodes
Group 1: (Blocks 32768-65535) [INODE_UNINIT, ITABLE_ZEROED]
Checksum 0xc586, unused inodes 8192
Backup superblock at 32768, Group descriptors at 32769-33292
Reserved GDT blocks at 33293-33792
Block bitmap at 1026 (+4294935554), Inode bitmap at 1042 (+4294935570)
Inode table at 1569-2080 (+4294936097)
939 free blocks, 8192 free inodes, 0 directories, 8192 unused inodes
Group 2: (Blocks 65536-98303) [INODE_UNINIT, ITABLE_ZEROED]
...
On Fri, Apr 17, 2009 at 12:03:33PM +0100, Jeremy Sanders wrote:
> Hi - I'm trying out ext4 on a large 8.2 TB software raid device (md). On
> rebooting (cleanly unmounting), I tried an fsck on the device. I get the
> following:
>
> [root@xback2 ~]# fsck /dev/md0
> fsck 1.41.4 (27-Jan-2009)
> e2fsck 1.41.4 (27-Jan-2009)
> fsck.ext4: Group descriptors look bad... trying backup blocks...
> Group descriptor 0 checksum is invalid. Fix<y>?
>
> It then finds lots of bad group descriptors.
What happened afterwards? Did fsck complete successfully?
I see from the dumpe2fs that you sent it had only been in use for a
week. How were you using the filesystem? Did you try using the
online resize feature at any time?
The problem is that any number of things could have caused the block
group descriptors to be corrupted.
- Ted
Theodore Tso wrote:
> What happened afterwards? Did fsck complete successfully?
I was waiting to see whether you wanted me to do something else. I've just
tried it and it didn't:
[root@xback2 ~]# fsck -a /dev/md0
fsck 1.41.4 (27-Jan-2009)
/dev/md0: Group descriptor 384 checksum is invalid. FIXED.
/dev/md0: Group descriptor 385 checksum is invalid. FIXED.
/dev/md0: Group descriptor 386 checksum is invalid. FIXED.
/dev/md0: Group descriptor 387 checksum is invalid. FIXED.
/dev/md0: Group descriptor 388 checksum is invalid. FIXED.
/dev/md0: Group descriptor 389 checksum is invalid. FIXED.
/dev/md0: Group descriptor 390 checksum is invalid. FIXED.
/dev/md0: Group descriptor 391 checksum is invalid. FIXED.
/dev/md0: Group descriptor 392 checksum is invalid. FIXED.
/dev/md0: Group descriptor 393 checksum is invalid. FIXED.
/dev/md0: Group descriptor 394 checksum is invalid. FIXED.
/dev/md0: Group descriptor 395 checksum is invalid. FIXED.
/dev/md0: Group descriptor 396 checksum is invalid. FIXED.
/dev/md0: Group descriptor 397 checksum is invalid. FIXED.
/dev/md0: Group descriptor 398 checksum is invalid. FIXED.
/dev/md0: Group descriptor 399 checksum is invalid. FIXED.
/dev/md0: Group descriptor 400 checksum is invalid. FIXED.
/dev/md0: Group descriptor 401 checksum is invalid. FIXED.
/dev/md0: Group descriptor 402 checksum is invalid. FIXED.
/dev/md0: Group descriptor 403 checksum is invalid. FIXED.
/dev/md0: Group descriptor 404 checksum is invalid. FIXED.
/dev/md0: Note: if several inode or block bitmap blocks or part
of the inode table require relocation, you may wish to try
running e2fsck with the '-b 32768' option first. The problem
may lie only with the primary block group descriptors, and
the backup block group descriptors may be OK.
/dev/md0: Block bitmap for group 405 is not in group. (block 3393946179)
/dev/md0: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
(i.e., without -a or -p options)
** When I run it manually I get:
Pass 1: Checking inodes, blocks, and sizes
Inode 8355 has imagic flag set. Clear<y>? yes
Inode 8355 has a extra size (62017) which is invalid
Fix<y>? yes
Inode 8355 has compression flag set on filesystem without compression
support. Clear<y>? yes
Inode 8355 has a bad extended attribute block 2170352193. Clear<y>? yes
Inode 8355 has INDEX_FL flag set but is not a directory.
Clear HTree index<y>? yes
Inode 8355, i_size is 9321591691907232321, should be 0. Fix<y>? yes
Inode 8355, i_blocks is 266363157148225, should be 0. Fix<y>? yes
Inode 8356 is in use, but has dtime set. Fix<y>? yes
Inode 8356 has imagic flag set. Clear<y>? yes
Inode 8356 has a extra size (62017) which is invalid
Fix<y>? yes
Inode 8356 has compression flag set on filesystem without compression
support. Clear<y>? yes
Inode 8356 has a bad extended attribute block 2170352193. Clear<y>? yes
Inode 8356 has INDEX_FL flag set but is not a directory.
Clear HTree index<y>? yes
Inode 8356, i_size is 9321591691907232321, should be 0. Fix<y>? yes
Inode 8356, i_blocks is 266363157148225, should be 0. Fix<y>? yes
Inode 8357 is in use, but has dtime set. Fix<y>? yes
Inode 8357 has imagic flag set. Clear<y>? yes
Inode 8357 has a extra size (62017) which is invalid
Fix<y>? yes
Inode 8357 has compression flag set on filesystem without compression
support. Clear<y>? yes
Inode 8357 has a bad extended attribute block 2170352193. Clear<y>? yes
Inode 8357 has INDEX_FL flag set but is not a directory.
Clear HTree index<y>? yes
> I see from the dumpe2fs that you sent it had only been in use for a
> week. How were you using the filesystem? Did you try using the
> online resize feature at any time?
No. The filesystem was used to store rsync snapshots of other file systems
(using the hard link feature). I had only rsynced the initial data and run a
couple of rsync backups on to it. The filesystem was created using:
mkfs.ext4 -m0 -b 4096 -E stride=8,stripe-width=72 /dev/md0
> The problem is that any number of things could have caused the block
> group descriptors to be corrupted.
Oh dear. The system has ECC ram (though linux doesn't know about it, so it
may not be working) and the md device is using 10 drives on raid5 and a
3ware controller.
Maybe I should force a md raid5 resync to check the drives agree with each
other.
Jeremy
--
Jeremy Sanders <[email protected]> http://www-xray.ast.cam.ac.uk/~jss/
X-Ray Group, Institute of Astronomy, University of Cambridge, UK.
Public Key Server PGP Key ID: E1AAE053
Theodore Tso wrote:
> I see from the dumpe2fs that you sent it had only been in use for a
> week. How were you using the filesystem? Did you try using the
> online resize feature at any time?
I assume that this isn't enough to corrupt the filesystem?
[root@xback2 ~]# tune2fs -i -1 /dev/md0
tune2fs 1.41.4 (27-Jan-2009)
Setting interval between checks to 18446744073709465216 seconds
Jeremy
--
Jeremy Sanders <[email protected]> http://www-xray.ast.cam.ac.uk/~jss/
X-Ray Group, Institute of Astronomy, University of Cambridge, UK.
Public Key Server PGP Key ID: E1AAE053
On Fri, Apr 17, 2009 at 01:24:16PM +0100, Jeremy Sanders wrote:
> Theodore Tso wrote:
>
> > I see from the dumpe2fs that you sent it had only been in use for a
> > week. How were you using the filesystem? Did you try using the
> > online resize feature at any time?
>
> I assume that this isn't enough to corrupt the filesystem?
>
> [root@xback2 ~]# tune2fs -i -1 /dev/md0
> tune2fs 1.41.4 (27-Jan-2009)
> Setting interval between checks to 18446744073709465216 seconds
No, but it won't do what you want, either. To disable time-based
checks, you should use "tune2fs -i 0 /dev/md0".
Tune2fs should have flagged an error when you specified -1; I'll have
to fix that.
- Ted
Jeremy Sanders wrote:
> Hi - I'm trying out ext4 on a large 8.2 TB software raid device (md). On
> rebooting (cleanly unmounting), I tried an fsck on the device. I get the
> following:
>
> [root@xback2 ~]# fsck /dev/md0
> fsck 1.41.4 (27-Jan-2009)
> e2fsck 1.41.4 (27-Jan-2009)
> fsck.ext4: Group descriptors look bad... trying backup blocks...
> Group descriptor 0 checksum is invalid. Fix<y>?
>
> It then finds lots of bad group descriptors.
>
> This is Fedora 10, 2.6.27.21-170.2.56.fc10.x86_64 and
> e2fsprogs-1.41.4-4.fc10.x86_64.
>
Jeremy, if you're willing, could you upgrade to the 2.6.29 kernel that's
in F10 updates-testing? That way the ext4 code is a bit more of a
recent, common codebase. Also, if this is a test fs, re-mkfs'ing from
scratch might not be a bad way to go.
Depending on how hard it is to reproduce, it may also be interesting to
try a filesystem just shy of 8TB (2^31) blocks in case there is some
32-bit wrap-around there, since you're at 8.2T....
-Eric
Jeremy Sanders wrote:
> Theodore Tso wrote:
...
>> I see from the dumpe2fs that you sent it had only been in use for a
>> week. How were you using the filesystem? Did you try using the
>> online resize feature at any time?
>
> No. The filesystem was used to store rsync snapshots of other file systems
> (using the hard link feature). I had only rsynced the initial data and run a
> couple of rsync backups on to it. The filesystem was created using:
>
> mkfs.ext4 -m0 -b 4096 -E stride=8,stripe-width=72 /dev/md0
Can you show us exactly how you're using rsync? is this with
rdiff-backup or some similar tool?
Thanks,
-Eric
On Fri, 17 Apr 2009, Eric Sandeen wrote:
> Jeremy Sanders wrote:
>> Theodore Tso wrote:
>
> ...
>
>>> I see from the dumpe2fs that you sent it had only been in use for a
>>> week. How were you using the filesystem? Did you try using the
>>> online resize feature at any time?
>>
>> No. The filesystem was used to store rsync snapshots of other file systems
>> (using the hard link feature). I had only rsynced the initial data and run a
>> couple of rsync backups on to it. The filesystem was created using:
>>
>> mkfs.ext4 -m0 -b 4096 -E stride=8,stripe-width=72 /dev/md0
>
> Can you show us exactly how you're using rsync? is this with
> rdiff-backup or some similar tool?
No, plain rsync. We have a script which does something like
rsync -raHSx --stats --while-file --numeric-ids \
--link-dest=/mnt/username/20090418/ host:/data/username/ \
/mnt/username/20090419/
for a set of users.
This command copies the files from /data/username on host to
/mnt/username/20090419, but creates hard links to the previous copy
(/mnt/username/20090418/) for unchanged files.
It worked fine on ext3, at least for a 2.4TB device.
Jeremy
--
Jeremy Sanders <[email protected]> http://www-xray.ast.cam.ac.uk/~jss/
X-Ray Group, Institute of Astronomy, University of Cambridge, UK.
Public Key Server PGP Key ID: E1AAE053
Eric Sandeen wrote:
> Jeremy, if you're willing, could you upgrade to the 2.6.29 kernel that's
> in F10 updates-testing? That way the ext4 code is a bit more of a
> recent, common codebase. Also, if this is a test fs, re-mkfs'ing from
> scratch might not be a bad way to go.
>
> Depending on how hard it is to reproduce, it may also be interesting to
> try a filesystem just shy of 8TB (2^31) blocks in case there is some
> 32-bit wrap-around there, since you're at 8.2T....
I wasn't able to trivially reproduce the problem with the old kernel, but I
updated to 2.6.29.1-30.fc10.x86_64 in updates testing. This introduced some
further problems with a USB issue and some sort of stack dump probably
associated with the r8169 driver (see bugzilla).
However, the system seems to mostly work, so I recreated the ext4 device,
I've just run my backup script again and fsck'd the device. It seems the
problem is reproducible with the new kernel:
[root@xback2 ~]# fsck /dev/md0
fsck 1.41.4 (27-Jan-2009)
e2fsck 1.41.4 (27-Jan-2009)
fsck.ext4: Group descriptors look bad... trying backup blocks...
Group descriptor 0 checksum is invalid. Fix<y>?
Looks like there's a real problem in ext4 causing this under certain
circumstances (unless an obscure hardware error is somehow giving the same
problem).
To cause this, all I did was rsync a set of directories to the disk. No hard
link trees were created.
Jeremy
--
Jeremy Sanders <[email protected]> http://www-xray.ast.cam.ac.uk/~jss/
X-Ray Group, Institute of Astronomy, University of Cambridge, UK.
Public Key Server PGP Key ID: E1AAE053
On Mon, Apr 20, 2009 at 10:33:09AM +0100, Jeremy Sanders wrote:
>
> However, the system seems to mostly work, so I recreated the ext4 device,
> I've just run my backup script again and fsck'd the device. It seems the
> problem is reproducible with the new kernel:
When you say reproducible, how many times have you tried it, and were
you able to reproduce it every single time? 50% of time? I do
believe there is a problem, but we haven't been able to something
where it's easily reproducible. So if you can easily reproduce this,
this is definitely very exciting.
> [root@xback2 ~]# fsck /dev/md0
> fsck 1.41.4 (27-Jan-2009)
> e2fsck 1.41.4 (27-Jan-2009)
> fsck.ext4: Group descriptors look bad... trying backup blocks...
> Group descriptor 0 checksum is invalid. Fix<y>?
Do you have to reboot to see this, or is it enough to unmount the
filesystem? How big is the ext4 filesystem, and how big was the
amount of data that you rsync'ed? One thing that would be worth
trying if you can easily reproduce is whether it happens on a single
device disk, or whether it only shows up when you use a /dev/mdX
device.
Thanks,
- Ted
On Mon, 20 Apr 2009, Theodore Tso wrote:
> On Mon, Apr 20, 2009 at 10:33:09AM +0100, Jeremy Sanders wrote:
>>
>> However, the system seems to mostly work, so I recreated the ext4 device,
>> I've just run my backup script again and fsck'd the device. It seems the
>> problem is reproducible with the new kernel:
>
> When you say reproducible, how many times have you tried it, and were
> you able to reproduce it every single time? 50% of time? I do
> believe there is a problem, but we haven't been able to something
> where it's easily reproducible. So if you can easily reproduce this,
> this is definitely very exciting.
It takes a day or two to do the sync. I've only done it twice (one with
the old kernel, once with the new fedora testing kernel) and it happened
both times. I'm afraid the statistics are rather low number here.
I did a different faster test (just copying my home directory lots of
times), but I wasn't able to get it to fail. That test didn't use much
disk space, however. Maybe it's worth just dd'ing a few TB of data onto
the device and seeing whether that fails.
>> [root@xback2 ~]# fsck /dev/md0
>> fsck 1.41.4 (27-Jan-2009)
>> e2fsck 1.41.4 (27-Jan-2009)
>> fsck.ext4: Group descriptors look bad... trying backup blocks...
>> Group descriptor 0 checksum is invalid. Fix<y>?
>
> Do you have to reboot to see this, or is it enough to unmount the
> filesystem? How big is the ext4 filesystem, and how big was the
> amount of data that you rsync'ed? One thing that would be worth
> trying if you can easily reproduce is whether it happens on a single
> device disk, or whether it only shows up when you use a /dev/mdX
> device.
I didn't reboot this time - I did last time. I just unmounted the file
system and fsckd it. The filesystem is 8.2TB and the data is around 2.5TB.
The drives on a 3ware card, so I could configure the card as a single
raid5 device and try to reproduce it there. It may take a day or two to
copy the data if I try this.
Jeremy
--
Jeremy Sanders <[email protected]> http://www-xray.ast.cam.ac.uk/~jss/
X-Ray Group, Institute of Astronomy, University of Cambridge, UK.
Public Key Server PGP Key ID: E1AAE053
On Mon, Apr 20, 2009 at 12:43:37PM +0100, Jeremy Sanders wrote:
> It takes a day or two to do the sync. I've only done it twice (one with
> the old kernel, once with the new fedora testing kernel) and it happened
> both times. I'm afraid the statistics are rather low number here.
>
> I did a different faster test (just copying my home directory lots of
> times), but I wasn't able to get it to fail. That test didn't use much
> disk space, however. Maybe it's worth just dd'ing a few TB of data onto
> the device and seeing whether that fails.
>
> I didn't reboot this time - I did last time. I just unmounted the file
> system and fsckd it. The filesystem is 8.2TB and the data is around
> 2.5TB.
That's that's useful data. I wish we could make it fail more quickly
on a smaller rsync, but the fact that you didn't need to reboot is
definitely useful information.
And this is a fresh rsync so no files were being deleted, rsync should
have just been writing new files to .filename.XXXXX and then renaming
the filename to filename.XXXXX when it is done, right?
OK, let me think about this a little. I think we can create a patch
which checks for writes to the block group descriptors and dumps a
stack trace. That would allow us catch the failing code in question
in the act, and maybe figure out what is going on.
- Ted
On Mon, 20 Apr 2009, Theodore Tso wrote:
> That's that's useful data. I wish we could make it fail more quickly
> on a smaller rsync, but the fact that you didn't need to reboot is
> definitely useful information.
>
> And this is a fresh rsync so no files were being deleted, rsync should
> have just been writing new files to .filename.XXXXX and then renaming
> the filename to filename.XXXXX when it is done, right?
That's what I'd guess. It was onto a clean filesystem, so there shouldn't
be any deletions.
> OK, let me think about this a little. I think we can create a patch
> which checks for writes to the block group descriptors and dumps a
> stack trace. That would allow us catch the failing code in question
> in the act, and maybe figure out what is going on.
Ok.
Jeremy
--
Jeremy Sanders <[email protected]> http://www-xray.ast.cam.ac.uk/~jss/
X-Ray Group, Institute of Astronomy, University of Cambridge, UK.
Public Key Server PGP Key ID: E1AAE053
Theodore Tso wrote:
> On Mon, Apr 20, 2009 at 12:43:37PM +0100, Jeremy Sanders wrote:
>> It takes a day or two to do the sync. I've only done it twice (one with
>> the old kernel, once with the new fedora testing kernel) and it happened
>> both times. I'm afraid the statistics are rather low number here.
>>
>> I did a different faster test (just copying my home directory lots of
>> times), but I wasn't able to get it to fail. That test didn't use much
>> disk space, however. Maybe it's worth just dd'ing a few TB of data onto
>> the device and seeing whether that fails.
>>
>> I didn't reboot this time - I did last time. I just unmounted the file
>> system and fsckd it. The filesystem is 8.2TB and the data is around
>> 2.5TB.
I think trying a filesystem with just under 8T would be a useful test too.
> That's that's useful data. I wish we could make it fail more quickly
> on a smaller rsync, but the fact that you didn't need to reboot is
> definitely useful information.
>
> And this is a fresh rsync so no files were being deleted, rsync should
> have just been writing new files to .filename.XXXXX and then renaming
> the filename to filename.XXXXX when it is done, right?
>
> OK, let me think about this a little. I think we can create a patch
> which checks for writes to the block group descriptors and dumps a
> stack trace. That would allow us catch the failing code in question
> in the act, and maybe figure out what is going on.
XFS has block-zero tests, because there was once a bug where
uninitialized block numbers in buffers were clobbering the superblock at
block 0. It was helpful, so I think this is a good idea, Ted.
-Eric
Eric Sandeen wrote:
> Theodore Tso wrote:
>> On Mon, Apr 20, 2009 at 12:43:37PM +0100, Jeremy Sanders wrote:
>>> It takes a day or two to do the sync. I've only done it twice (one with
>>> the old kernel, once with the new fedora testing kernel) and it happened
>>> both times. I'm afraid the statistics are rather low number here.
>>>
>>> I did a different faster test (just copying my home directory lots of
>>> times), but I wasn't able to get it to fail. That test didn't use much
>>> disk space, however. Maybe it's worth just dd'ing a few TB of data onto
>>> the device and seeing whether that fails.
>>>
>>> I didn't reboot this time - I did last time. I just unmounted the file
>>> system and fsckd it. The filesystem is 8.2TB and the data is around
>>> 2.5TB.
>
> I think trying a filesystem with just under 8T would be a useful test too.
One other question - do you make use of xattrs on this filesystem?
In case it's not obvious we are very interested in this reproducible
testcase, thank you for being so willing to provide feedback and testing
....
-Eric
On Mon, 20 Apr 2009, Eric Sandeen wrote:
> Eric Sandeen wrote:
>> Theodore Tso wrote:
>>> On Mon, Apr 20, 2009 at 12:43:37PM +0100, Jeremy Sanders wrote:
>>>> It takes a day or two to do the sync. I've only done it twice (one with
>>>> the old kernel, once with the new fedora testing kernel) and it happened
>>>> both times. I'm afraid the statistics are rather low number here.
>>>>
>>>> I did a different faster test (just copying my home directory lots of
>>>> times), but I wasn't able to get it to fail. That test didn't use much
>>>> disk space, however. Maybe it's worth just dd'ing a few TB of data onto
>>>> the device and seeing whether that fails.
>>>>
>>>> I didn't reboot this time - I did last time. I just unmounted the file
>>>> system and fsckd it. The filesystem is 8.2TB and the data is around
>>>> 2.5TB.
>>
>> I think trying a filesystem with just under 8T would be a useful test too.
>
> One other question - do you make use of xattrs on this filesystem?
No.
Jeremy
--
Jeremy Sanders <[email protected]> http://www-xray.ast.cam.ac.uk/~jss/
X-Ray Group, Institute of Astronomy, University of Cambridge, UK.
Public Key Server PGP Key ID: E1AAE053
Jeremy Sanders wrote:
> On Mon, 20 Apr 2009, Eric Sandeen wrote:
...
>> One other question - do you make use of xattrs on this filesystem?
>
> No.
I've commandeered about 10T of disk space to see if I can hit this.
Would you mind providing dumpe2fs -h output for your 8.2T filesystem so
I can exactly replicate the geometry?
Thanks,
-Eric
On Mon, 20 Apr 2009, Eric Sandeen wrote:
> Jeremy Sanders wrote:
>> On Mon, 20 Apr 2009, Eric Sandeen wrote:
> ...
>
>>> One other question - do you make use of xattrs on this filesystem?
>>
>> No.
>
> I've commandeered about 10T of disk space to see if I can hit this.
> Would you mind providing dumpe2fs -h output for your 8.2T filesystem so
> I can exactly replicate the geometry?
I formatted with
mkfs.ext4 -m0 -b 4096 -E stride=8,stripe-width=72 /dev/md0
[root@xback2 ~]# dumpe2fs -h /dev/md0
Filesystem volume name: <none>
Last mounted on: <not available>
Filesystem UUID: 34fefacb-0494-4df7-b189-e11b2064dd90
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
Filesystem flags: signed_directory_hash
Default mount options: (none)
Filesystem state: clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 549314560
Block count: 2197239840
Reserved block count: 0
Free blocks: 2162717221
Free inodes: 549314549
First block: 0
Block size: 4096
Fragment size: 4096
Reserved GDT blocks: 500
Blocks per group: 32768
Fragments per group: 32768
Inodes per group: 8192
Inode blocks per group: 512
RAID stride: 8
RAID stripe width: 72
Flex block group size: 16
Filesystem created: Mon Apr 20 17:29:14 2009
Last mount time: n/a
Last write time: Mon Apr 20 17:38:28 2009
Mount count: 0
Maximum mount count: 38
Last checked: Mon Apr 20 17:29:14 2009
Check interval: 15552000 (6 months)
Next check after: Sat Oct 17 17:29:14 2009
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 256
Required extra isize: 28
Desired extra isize: 28
Journal inode: 8
Default directory hash: half_md4
Directory Hash Seed: 06d43af3-a75c-405a-8f25-e51517dae7f6
Journal backup: inode blocks
Journal size: 128M
On Apr 20, 2009 16:53 +0100, Jeremy Sanders wrote:
> On Mon, 20 Apr 2009, Eric Sandeen wrote:
>> Eric Sandeen wrote:
>>> Theodore Tso wrote:
>>>> On Mon, Apr 20, 2009 at 12:43:37PM +0100, Jeremy Sanders wrote:
>>>>> It takes a day or two to do the sync. I've only done it twice (one with
>>>>> the old kernel, once with the new fedora testing kernel) and it happened
>>>>> both times. I'm afraid the statistics are rather low number here.
>>>>>
>>>>> I did a different faster test (just copying my home directory lots of
>>>>> times), but I wasn't able to get it to fail. That test didn't use much
>>>>> disk space, however. Maybe it's worth just dd'ing a few TB of data onto
>>>>> the device and seeing whether that fails.
>>>>>
>>>>> I didn't reboot this time - I did last time. I just unmounted the file
>>>>> system and fsckd it. The filesystem is 8.2TB and the data is around
>>>>> 2.5TB.
>>>
>>> I think trying a filesystem with just under 8T would be a useful test too.
>>
>> One other question - do you make use of xattrs on this filesystem?
>
> No.
If you use anything like SELinux or ACLs you would also (indirectly) be
using xattrs.
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
On Mon, 20 Apr 2009, Andreas Dilger wrote:
> On Apr 20, 2009 16:53 +0100, Jeremy Sanders wrote:
>> On Mon, 20 Apr 2009, Eric Sandeen wrote:
>>> One other question - do you make use of xattrs on this filesystem?
>>
>> No.
>
> If you use anything like SELinux or ACLs you would also (indirectly) be
> using xattrs.
SELinux is switched off and we haven't (knowingly) been using xattrs, but
I remember rsync might copy copy xattrs, so perhaps they get written in
some way...
Jeremy
--
Jeremy Sanders <[email protected]> http://www-xray.ast.cam.ac.uk/~jss/
X-Ray Group, Institute of Astronomy, University of Cambridge, UK.
Public Key Server PGP Key ID: E1AAE053
On Apr 20, 2009 19:55 +0100, Jeremy Sanders wrote:
> On Mon, 20 Apr 2009, Andreas Dilger wrote:
>> On Apr 20, 2009 16:53 +0100, Jeremy Sanders wrote:
>>> On Mon, 20 Apr 2009, Eric Sandeen wrote:
>>>> One other question - do you make use of xattrs on this filesystem?
>>>
>>> No.
>>
>> If you use anything like SELinux or ACLs you would also (indirectly) be
>> using xattrs.
>
> SELinux is switched off and we haven't (knowingly) been using xattrs, but
> I remember rsync might copy copy xattrs, so perhaps they get written in
> some way...
You can check this with:
debugfs -c -R "stat {path to file inside filesystem}" /dev/XXX
and check if the "File ACL" field is non-zero:
debugfs -c -R "stat etc/hosts" /dev/sda2
debugfs 1.40.11.sun1 (17-June-2008)
/dev/sda2: catastrophic mode - not reading inode or group bitmaps
Inode: 259128 Type: regular Mode: 0644 Flags: 0x0 Generation:
2075236634
User: 0 Group: 0 Size: 2258
File ACL: 0 Directory ACL: 0
^^^^^^^^^^^ ##### this would be non-zero #####
Links: 2 Blockcount: 8
Fragment: Address: 0 Number: 0 Size: 0
ctime: 0x49812ef4 -- Wed Jan 28 21:22:12 2009
atime: 0x49ebdce3 -- Sun Apr 19 20:24:35 2009
mtime: 0x49812ef4 -- Wed Jan 28 21:22:12 2009
Size of extra inode fields: 4
Inode version: 0
BLOCKS:
(0):534546
TOTAL: 1
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
Jeremy Sanders <[email protected]> writes:
> However, the system seems to mostly work, so I recreated the ext4 device,
> I've just run my backup script again and fsck'd the device. It seems the
> problem is reproducible with the new kernel:
>
> [root@xback2 ~]# fsck /dev/md0
> fsck 1.41.4 (27-Jan-2009)
> e2fsck 1.41.4 (27-Jan-2009)
> fsck.ext4: Group descriptors look bad... trying backup blocks...
> Group descriptor 0 checksum is invalid. Fix<y>?
>
> Looks like there's a real problem in ext4 causing this under certain
> circumstances (unless an obscure hardware error is somehow giving the same
> problem).
>
> To cause this, all I did was rsync a set of directories to the disk. No hard
> link trees were created.
For the record, I reproduced this bug with 2.6.30-rc2-git6 on a new
1.5Tb disk. Formated as ext4, using relatime, copied 20Gb.
On reboot, I got such errors.
The hd was partitionned (all ext4) as:
/ (5Gb) | /usr (20Gb) | /pub (1.5Tb)
The smaller system fses didn't saw those errors.
Thierry Vignaud wrote:
> Jeremy Sanders <[email protected]> writes:
>
>> However, the system seems to mostly work, so I recreated the ext4 device,
>> I've just run my backup script again and fsck'd the device. It seems the
>> problem is reproducible with the new kernel:
>>
>> [root@xback2 ~]# fsck /dev/md0
>> fsck 1.41.4 (27-Jan-2009)
>> e2fsck 1.41.4 (27-Jan-2009)
>> fsck.ext4: Group descriptors look bad... trying backup blocks...
>> Group descriptor 0 checksum is invalid. Fix<y>?
>>
>> Looks like there's a real problem in ext4 causing this under certain
>> circumstances (unless an obscure hardware error is somehow giving the same
>> problem).
>>
>> To cause this, all I did was rsync a set of directories to the disk. No hard
>> link trees were created.
>
> For the record, I reproduced this bug with 2.6.30-rc2-git6 on a new
> 1.5Tb disk. Formated as ext4, using relatime, copied 20Gb.
> On reboot, I got such errors.
> The hd was partitionned (all ext4) as:
> / (5Gb) | /usr (20Gb) | /pub (1.5Tb)
>
> The smaller system fses didn't saw those errors.
Can you provide a little more info on how you copied the 20Gb, and
exactly what the errors were?
Thanks,
-Eric
Thierry Vignaud wrote:
> Eric Sandeen <[email protected]> writes:
>
>>>> However, the system seems to mostly work, so I recreated the ext4 device,
>>>> I've just run my backup script again and fsck'd the device. It seems the
>>>> problem is reproducible with the new kernel:
>>>>
>>>> [root@xback2 ~]# fsck /dev/md0
>>>> fsck 1.41.4 (27-Jan-2009)
>>>> e2fsck 1.41.4 (27-Jan-2009)
>>>> fsck.ext4: Group descriptors look bad... trying backup blocks...
>>>> Group descriptor 0 checksum is invalid. Fix<y>?
>>>>
>>>> Looks like there's a real problem in ext4 causing this under certain
>>>> circumstances (unless an obscure hardware error is somehow giving the same
>>>> problem).
>>>>
>>>> To cause this, all I did was rsync a set of directories to the disk. No hard
>>>> link trees were created.
>>> For the record, I reproduced this bug with 2.6.30-rc2-git6 on a new
>>> 1.5Tb disk. Formated as ext4, using relatime, copied 20Gb.
>>> On reboot, I got such errors.
>>> The hd was partitionned (all ext4) as:
>>> / (5Gb) | /usr (20Gb) | /pub (1.5Tb)
>>>
>>> The smaller system fses didn't saw those errors.
>> Can you provide a little more info on how you copied the 20Gb, and
>> exactly what the errors were?
>
> I just copied some files from an USB hard disc with cp on the big
> partition (the one that showed the issues).
> For other system partitions (that showed _no_ problems) were filled with
> something like "rsync -rvltpx / /where/it/was/mounted"
>
> Here's the fsck log:
>
>
>-----------------------------------------------------------------------
Wow, awful.
Could you send me dumpe2fs -h output of the large target device, as well
as an "e2image -r" image of the source filesystem? That way I can
hopefully perfectly replicate your target filesystem as well as the data
you're using to populate it, try the cp myself, and see if I hit the
same thing.
e2image only sends metadata information, not data. If you are concerned
about filenames, use -s to scramble them, though this *might* impact my
ability to reproduce it...
Thanks,
-Eric
On Tue, Apr 21, 2009 at 05:14:40PM +0200, Thierry Vignaud wrote:
> For the record, I reproduced this bug with 2.6.30-rc2-git6 on a new
> 1.5Tb disk. Formated as ext4, using relatime, copied 20Gb.
> On reboot, I got such errors.
> The hd was partitionned (all ext4) as:
> / (5Gb) | /usr (20Gb) | /pub (1.5Tb)
Theirry, are you willing to try to see if you can get a reliable
reproduction case? That's what we need, very badly. The fact that
you only copied 20GB is very good; better than than 2TB. If you can
reliably reproduce the failure 2 or 3 times, can you give us exact
reproduction instructions? That would be extremely useful.
Thanks in advance,
- Ted
Eric Sandeen <[email protected]> writes:
> Could you send me dumpe2fs -h output of the large target device, as
> well as an "e2image -r" image of the source filesystem? That way I
> can hopefully perfectly replicate your target filesystem as well as
> the data you're using to populate it, try the cp myself, and see if I
> hit the same thing.
>
> e2image only sends metadata information, not data. If you are
> concerned about filenames, use -s to scramble them, though this
> *might* impact my ability to reproduce it...
I'll do (disk's at home).
Filesystems were formatted with standard mkfs.ext4 (some were formated
with mkfs.ext4 -F which is why diskdrake default to), that us using std
/etc/mke2fs.conf.
Theodore Tso wrote:
> Do you have to reboot to see this, or is it enough to unmount the
> filesystem? How big is the ext4 filesystem, and how big was the
> amount of data that you rsync'ed? One thing that would be worth
> trying if you can easily reproduce is whether it happens on a single
> device disk, or whether it only shows up when you use a /dev/mdX
> device.
I've been able to reproduce it on a single device disk (It was partitioned
to have the same number of blocks as the md device).
Jeremy
--
Jeremy Sanders <[email protected]> http://www-xray.ast.cam.ac.uk/~jss/
X-Ray Group, Institute of Astronomy, University of Cambridge, UK.
Public Key Server PGP Key ID: E1AAE053