2009-08-31 16:32:48

by Ric Wheeler

[permalink] [raw]
Subject: large file system & high object count testing


We have put together a very large, relatively slow JBOD to test
scalability with (big server, 40GB of DRAM, 8 CPU's + 4 SAS expansion
shelves, each with 16 2TB WD S-ATA drives).

In all, this is pulled together with DM (striped) to give us a bit over
116TB.

Testing was done on 2.6.31-rc6 along with the pu branches e2fsprogs.

Everything went well until after the fsck - I think that I have
reproduced that earlier issue with a failed mount.

mkfs took a very long time - longer than fsck. fsck (with around 500
million 20KB files) finished in just under 2 hours.

logs below,

ric


[[email protected] e2fsprogs]# time /sbin/mkfs.ext4
/dev/vg_wdc_disks/lv_wdc_disks
mke2fs 1.41.8 (20-Jul-2009)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
4287627264 inodes, 31138512896 blocks
1556925644 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=33285996544
950272 block groups
32768 blocks per group, 32768 fragments per group
4512 inodes per group
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632,
2654208,
4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
102400000, 214990848, 512000000, 550731776, 644972544, 1934917632,
2560000000, 3855122432, 5804752896, 12800000000, 17414258688,
26985857024

Allocating group tables: done
Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 38 mounts or
180 days, whichever comes first. Use tune2fs -c or -i to override.

real 230m6.362s
user 2m30.844s
sys 200m1.002s
[[email protected] e2fsprogs]# mount /dev/vg_wdc_disks/lv_wdc_disks /test_fs/
[[email protected] e2fsprogs]# df -H /test_fs/
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg_wdc_disks-lv_wdc_disks
127T 256M 121T 1% /test_fs

FSCK time:

[[email protected] e2fsck]# time ./e2fsck -f -tt /dev/vg_wdc_disks/lv_wdc_disks
e2fsck 1.41.8 (20-Jul-2009)
Pass 1: Checking inodes, blocks, and sizes
Pass 1: Memory used: 1280k/18014398508273796k (1130k/151k), time:
4630.05/780.40/3580.01
Pass 1: I/O read: 126019MB, write: 0MB, rate: 27.22MB/s
Pass 2: Checking directory structure
Pass 2: Memory used: 1280k/18014398508921888k (65k/1216k), time:
1215.10/454.21/705.79
Pass 2: I/O read: 34221MB, write: 0MB, rate: 28.16MB/s
Pass 3: Checking directory connectivity
Peak memory: Memory used: 1280k/18014398509445284k (65k/1216k), time:
5884.30/1263.59/4295.71
Pass 3A: Memory used: 1280k/18014398509445284k (65k/1216k), time: 0.00/
0.00/ 0.00
Pass 3A: I/O read: 0MB, write: 0MB, rate: 0.00MB/s
Pass 3: Memory used: 1280k/18014398508921888k (56k/1225k), time: 1.49/
0.33/ 1.14
Pass 3: I/O read: 1MB, write: 0MB, rate: 0.67MB/s
Pass 4: Checking reference counts
Pass 4: Memory used: 1280k/724124k (56k/1225k), time: 91.59/89.70/ 1.88
Pass 4: I/O read: 0MB, write: 0MB, rate: 0.00MB/s
Pass 5: Checking group summary information
Pass 5: Memory used: 312k/200728k (56k/257k), time: 685.24/170.49/73.72
Pass 5: I/O read: 713MB, write: 0MB, rate: 1.04MB/s
/dev/vg_wdc_disks/lv_wdc_disks: 516142418/4287627264 files (0.0%
non-contiguous), 2859838991/31138512896 blocks
Memory used: 312k/200728k (56k/257k), time: 6679.27/1541.45/4371.67
I/O read: 161012MB, write: 1MB, rate: 24.11MB/s

real 112m14.925s
user 25m41.557s
sys 73m46.849s


REMOUNT:

[[email protected] e2fsck]# mount /dev/vg_wdc_disks/lv_wdc_disks /test_fs/
mount: wrong fs type, bad option, bad superblock on
/dev/mapper/vg_wdc_disks-lv_wdc_disks,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail or so

[[email protected] ~]# tail -20 /var/log/messages
<snip>
Aug 31 12:27:12 megadeth kernel: EXT4-fs (dm-75):
ext4_check_descriptors: Checksum for group 487 failed (59799!=46827)
Aug 31 12:27:12 megadeth kernel: EXT4-fs (dm-75): group descriptors
corrupted!






2009-08-31 17:00:55

by Ric Wheeler

[permalink] [raw]
Subject: Re: large file system & high object count testing

On 08/31/2009 12:34 PM, Ric Wheeler wrote:
>
> We have put together a very large, relatively slow JBOD to test
> scalability with (big server, 40GB of DRAM, 8 CPU's + 4 SAS expansion
> shelves, each with 16 2TB WD S-ATA drives).
>
> In all, this is pulled together with DM (striped) to give us a bit
> over 116TB.
>
> Testing was done on 2.6.31-rc6 along with the pu branches e2fsprogs.
>
> Everything went well until after the fsck - I think that I have
> reproduced that earlier issue with a failed mount.
>
> mkfs took a very long time - longer than fsck. fsck (with around 500
> million 20KB files) finished in just under 2 hours.
>
> logs below,
>
> ric


One more note - this file system was filled using fs_mark, but without
doing any fsync() calls.

The unmount took several minutes (which I did not time), but the
following was logged during that:

Mount:

Aug 28 23:46:14 megadeth kernel: EXT4-fs (dm-75): barriers enabled
Aug 28 23:46:14 megadeth kernel: EXT4-fs (dm-75): internal journal on
dm-75:8
Aug 28 23:46:14 megadeth kernel: EXT4-fs (dm-75): delayed allocation enabled
Aug 28 23:46:14 megadeth kernel: EXT4-fs: file extents enabled
Aug 28 23:46:21 megadeth kernel: EXT4-fs: mballoc enabled
Aug 28 23:46:21 megadeth kernel: EXT4-fs (dm-75): mounted filesystem
with ordered data mode

umount:

Aug 31 10:19:27 megadeth kernel: EXT4-fs: mballoc: 2580708130 blocks
516141626 reqs (511081408 success)
Aug 31 10:19:27 megadeth kernel: EXT4-fs: mballoc: 5060218 extents
scanned, 0 goal hits, 5060218 2^N hits, 0 breaks, 0 lost
Aug 31 10:19:27 megadeth kernel: EXT4-fs: mballoc: 85164 generated and
it took 471527376
Aug 31 10:19:27 megadeth kernel: EXT4-fs: mballoc: 2590831616
preallocated, 10120312 discarded

Mount after fsck:
Aug 31 12:27:12 megadeth kernel: EXT4-fs (dm-75):
ext4_check_descriptors: Checksum for group 487 failed (59799!=46827)
Aug 31 12:27:12 megadeth kernel: EXT4-fs (dm-75): group descriptors
corrupted!

The MBALLOC messages are a bit worrying - what exactly gets discarded
during an unmount?

ric

>
>
> [[email protected] e2fsprogs]# time /sbin/mkfs.ext4
> /dev/vg_wdc_disks/lv_wdc_disks
> mke2fs 1.41.8 (20-Jul-2009)
> Filesystem label=
> OS type: Linux
> Block size=4096 (log=2)
> Fragment size=4096 (log=2)
> 4287627264 inodes, 31138512896 blocks
> 1556925644 blocks (5.00%) reserved for the super user
> First data block=0
> Maximum filesystem blocks=33285996544
> 950272 block groups
> 32768 blocks per group, 32768 fragments per group
> 4512 inodes per group
> Superblock backups stored on blocks:
> 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632,
> 2654208,
> 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
> 102400000, 214990848, 512000000, 550731776, 644972544, 1934917632,
> 2560000000, 3855122432, 5804752896, 12800000000, 17414258688,
> 26985857024
>
> Allocating group tables: done
> Writing inode tables: done
> Creating journal (32768 blocks): done
> Writing superblocks and filesystem accounting information: done
>
> This filesystem will be automatically checked every 38 mounts or
> 180 days, whichever comes first. Use tune2fs -c or -i to override.
>
> real 230m6.362s
> user 2m30.844s
> sys 200m1.002s
> [[email protected] e2fsprogs]# mount /dev/vg_wdc_disks/lv_wdc_disks /test_fs/
> [[email protected] e2fsprogs]# df -H /test_fs/
> Filesystem Size Used Avail Use% Mounted on
> /dev/mapper/vg_wdc_disks-lv_wdc_disks
> 127T 256M 121T 1% /test_fs
>
> FSCK time:
>
> [[email protected] e2fsck]# time ./e2fsck -f -tt
> /dev/vg_wdc_disks/lv_wdc_disks
> e2fsck 1.41.8 (20-Jul-2009)
> Pass 1: Checking inodes, blocks, and sizes
> Pass 1: Memory used: 1280k/18014398508273796k (1130k/151k), time:
> 4630.05/780.40/3580.01
> Pass 1: I/O read: 126019MB, write: 0MB, rate: 27.22MB/s
> Pass 2: Checking directory structure
> Pass 2: Memory used: 1280k/18014398508921888k (65k/1216k), time:
> 1215.10/454.21/705.79
> Pass 2: I/O read: 34221MB, write: 0MB, rate: 28.16MB/s
> Pass 3: Checking directory connectivity
> Peak memory: Memory used: 1280k/18014398509445284k (65k/1216k), time:
> 5884.30/1263.59/4295.71
> Pass 3A: Memory used: 1280k/18014398509445284k (65k/1216k), time:
> 0.00/ 0.00/ 0.00
> Pass 3A: I/O read: 0MB, write: 0MB, rate: 0.00MB/s
> Pass 3: Memory used: 1280k/18014398508921888k (56k/1225k), time:
> 1.49/ 0.33/ 1.14
> Pass 3: I/O read: 1MB, write: 0MB, rate: 0.67MB/s
> Pass 4: Checking reference counts
> Pass 4: Memory used: 1280k/724124k (56k/1225k), time: 91.59/89.70/ 1.88
> Pass 4: I/O read: 0MB, write: 0MB, rate: 0.00MB/s
> Pass 5: Checking group summary information
> Pass 5: Memory used: 312k/200728k (56k/257k), time: 685.24/170.49/73.72
> Pass 5: I/O read: 713MB, write: 0MB, rate: 1.04MB/s
> /dev/vg_wdc_disks/lv_wdc_disks: 516142418/4287627264 files (0.0%
> non-contiguous), 2859838991/31138512896 blocks
> Memory used: 312k/200728k (56k/257k), time: 6679.27/1541.45/4371.67
> I/O read: 161012MB, write: 1MB, rate: 24.11MB/s
>
> real 112m14.925s
> user 25m41.557s
> sys 73m46.849s
>
>
> REMOUNT:
>
> [[email protected] e2fsck]# mount /dev/vg_wdc_disks/lv_wdc_disks /test_fs/
> mount: wrong fs type, bad option, bad superblock on
> /dev/mapper/vg_wdc_disks-lv_wdc_disks,
> missing codepage or helper program, or other error
> In some cases useful info is found in syslog - try
> dmesg | tail or so
>
> [[email protected] ~]# tail -20 /var/log/messages
> <snip>
> Aug 31 12:27:12 megadeth kernel: EXT4-fs (dm-75):
> ext4_check_descriptors: Checksum for group 487 failed (59799!=46827)
> Aug 31 12:27:12 megadeth kernel: EXT4-fs (dm-75): group descriptors
> corrupted!
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


2009-08-31 20:19:47

by Andreas Dilger

[permalink] [raw]
Subject: Re: large file system & high object count testing

On Aug 31, 2009 12:34 -0400, Ric Wheeler wrote:
> We have put together a very large, relatively slow JBOD to test
> scalability with (big server, 40GB of DRAM, 8 CPU's + 4 SAS expansion
> shelves, each with 16 2TB WD S-ATA drives).
>
> In all, this is pulled together with DM (striped) to give us a bit over
> 116TB.
>
> Testing was done on 2.6.31-rc6 along with the pu branches e2fsprogs.
>
> Everything went well until after the fsck - I think that I have
> reproduced that earlier issue with a failed mount.
>
> mkfs took a very long time - longer than fsck. fsck (with around 500
> million 20KB files) finished in just under 2 hours.

Fixing the kernel to do the "safe zeroing of inode table blocks" would
allow mke2fs to be MUCH faster than it is today...

> real 230m6.362s
> user 2m30.844s
> sys 200m1.002s

Ouch, 4h is a long time, but hopefully not many people have to reformat
their 120TB filesystem on a regular basis.

> [[email protected] e2fsck]# time ./e2fsck -f -tt /dev/vg_wdc_disks/lv_wdc_disks
> e2fsck 1.41.8 (20-Jul-2009)
> Pass 1: Checking inodes, blocks, and sizes
> Pass 1: Memory used: 1280k/18014398508273796k (1130k/151k), time:
> 4630.05/780.40/3580.01

Sigh, we need better memory accounting in e2fsck. Rather than depending
on the VM/glibc to track that for us, how hard would it be to just add
a counter into e2fsck_{get,free,resize}_mem() to track this?

> REMOUNT:
>
> [[email protected] e2fsck]# mount /dev/vg_wdc_disks/lv_wdc_disks /test_fs/
> mount: wrong fs type, bad option, bad superblock on
> /dev/mapper/vg_wdc_disks-lv_wdc_disks,
> missing codepage or helper program, or other error
> In some cases useful info is found in syslog - try
> dmesg | tail or so
>
> [[email protected] ~]# tail -20 /var/log/messages
> <snip>
> Aug 31 12:27:12 megadeth kernel: EXT4-fs (dm-75):
> ext4_check_descriptors: Checksum for group 487 failed (59799!=46827)
> Aug 31 12:27:12 megadeth kernel: EXT4-fs (dm-75): group descriptors
> corrupted!

Hmm, is e2fsck computing the 64-byte group descriptor checksum differently
than the kernel? Can we dump the group descriptors before and after the
e2fsck run to see whether they have been modified without any messages to
the console?

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2009-08-31 20:56:04

by Andreas Dilger

[permalink] [raw]
Subject: Re: large file system & high object count testing

On Aug 31, 2009 13:02 -0400, Ric Wheeler wrote:
> One more note - this file system was filled using fs_mark, but without
> doing any fsync() calls.
>
> umount:
>
> Aug 31 10:19:27 megadeth kernel: EXT4-fs: mballoc: 2580708130 blocks
> 516141626 reqs (511081408 success)
> Aug 31 10:19:27 megadeth kernel: EXT4-fs: mballoc: 5060218 extents
> scanned, 0 goal hits, 5060218 2^N hits, 0 breaks, 0 lost
> Aug 31 10:19:27 megadeth kernel: EXT4-fs: mballoc: 85164 generated and
> it took 471527376
> Aug 31 10:19:27 megadeth kernel: EXT4-fs: mballoc: 2590831616
> preallocated, 10120312 discarded
>
> Mount after fsck:
> Aug 31 12:27:12 megadeth kernel: EXT4-fs (dm-75):
> ext4_check_descriptors: Checksum for group 487 failed (59799!=46827)
> Aug 31 12:27:12 megadeth kernel: EXT4-fs (dm-75): group descriptors
> corrupted!
>
> The MBALLOC messages are a bit worrying - what exactly gets discarded
> during an unmount?

The in-memory preallocation areas are discarded. This is reporting
that of the 2590M preallocation areas it reserved, only 10M of them
were discarded during the lifetime of the filesystem.

Of the other stats:
- 471 seconds were spent in total generating the 85k buddy bitmaps
(this is done incrementally at runtime)
- 516M calls to mballoc to find a chunk of blocks, 511M calls were able
to find the requested chunk (not surprising given it is a new filesystem,
probably the 5M calls that failed were when the fs was nearly full)

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2009-08-31 21:00:14

by Ric Wheeler

[permalink] [raw]
Subject: Re: large file system & high object count testing

On 08/31/2009 04:19 PM, Andreas Dilger wrote:
> On Aug 31, 2009 12:34 -0400, Ric Wheeler wrote:
>> We have put together a very large, relatively slow JBOD to test
>> scalability with (big server, 40GB of DRAM, 8 CPU's + 4 SAS expansion
>> shelves, each with 16 2TB WD S-ATA drives).
>>
>> In all, this is pulled together with DM (striped) to give us a bit over
>> 116TB.
>>
>> Testing was done on 2.6.31-rc6 along with the pu branches e2fsprogs.
>>
>> Everything went well until after the fsck - I think that I have
>> reproduced that earlier issue with a failed mount.
>>
>> mkfs took a very long time - longer than fsck. fsck (with around 500
>> million 20KB files) finished in just under 2 hours.
>
> Fixing the kernel to do the "safe zeroing of inode table blocks" would
> allow mke2fs to be MUCH faster than it is today...
>
>> real 230m6.362s
>> user 2m30.844s
>> sys 200m1.002s
>
> Ouch, 4h is a long time, but hopefully not many people have to reformat
> their 120TB filesystem on a regular basis.

Seems that it should not take longer than fsck in any case? Might be interesting
to use bkltrace/seekwatcher to see if it is thrashing these big, slow drives
around...

>
>> [[email protected] e2fsck]# time ./e2fsck -f -tt /dev/vg_wdc_disks/lv_wdc_disks
>> e2fsck 1.41.8 (20-Jul-2009)
>> Pass 1: Checking inodes, blocks, and sizes
>> Pass 1: Memory used: 1280k/18014398508273796k (1130k/151k), time:
>> 4630.05/780.40/3580.01
>
> Sigh, we need better memory accounting in e2fsck. Rather than depending
> on the VM/glibc to track that for us, how hard would it be to just add
> a counter into e2fsck_{get,free,resize}_mem() to track this?

That second number looks like a bug, not a real memory number. The largest
memory allocation I saw while it ran with top was around 6-7GB iirc.

>
>> REMOUNT:
>>
>> [[email protected] e2fsck]# mount /dev/vg_wdc_disks/lv_wdc_disks /test_fs/
>> mount: wrong fs type, bad option, bad superblock on
>> /dev/mapper/vg_wdc_disks-lv_wdc_disks,
>> missing codepage or helper program, or other error
>> In some cases useful info is found in syslog - try
>> dmesg | tail or so
>>
>> [[email protected] ~]# tail -20 /var/log/messages
>> <snip>
>> Aug 31 12:27:12 megadeth kernel: EXT4-fs (dm-75):
>> ext4_check_descriptors: Checksum for group 487 failed (59799!=46827)
>> Aug 31 12:27:12 megadeth kernel: EXT4-fs (dm-75): group descriptors
>> corrupted!
>
> Hmm, is e2fsck computing the 64-byte group descriptor checksum differently
> than the kernel? Can we dump the group descriptors before and after the
> e2fsck run to see whether they have been modified without any messages to
> the console?
>
> Cheers, Andreas

I tried to verify that by redoing a shorter run with fs_mark, unmount/remount
(no fsck in the middle).

That file system remounted with no corrupted group descriptors.

Running fsck on it & remounting reproduces the error (although, again, no fixes
reported during the run).

Running fsck on it after the first corruption did indeed fix it & I could remount.

Do you have a specific debugfs/other command I should use to poke at it with?

Thanks!

Ric




2009-08-31 21:01:28

by Ric Wheeler

[permalink] [raw]
Subject: Re: large file system & high object count testing

On 08/31/2009 04:56 PM, Andreas Dilger wrote:
> On Aug 31, 2009 13:02 -0400, Ric Wheeler wrote:
>> One more note - this file system was filled using fs_mark, but without
>> doing any fsync() calls.
>>
>> umount:
>>
>> Aug 31 10:19:27 megadeth kernel: EXT4-fs: mballoc: 2580708130 blocks
>> 516141626 reqs (511081408 success)
>> Aug 31 10:19:27 megadeth kernel: EXT4-fs: mballoc: 5060218 extents
>> scanned, 0 goal hits, 5060218 2^N hits, 0 breaks, 0 lost
>> Aug 31 10:19:27 megadeth kernel: EXT4-fs: mballoc: 85164 generated and
>> it took 471527376
>> Aug 31 10:19:27 megadeth kernel: EXT4-fs: mballoc: 2590831616
>> preallocated, 10120312 discarded
>>
>> Mount after fsck:
>> Aug 31 12:27:12 megadeth kernel: EXT4-fs (dm-75):
>> ext4_check_descriptors: Checksum for group 487 failed (59799!=46827)
>> Aug 31 12:27:12 megadeth kernel: EXT4-fs (dm-75): group descriptors
>> corrupted!
>>
>> The MBALLOC messages are a bit worrying - what exactly gets discarded
>> during an unmount?
>
> The in-memory preallocation areas are discarded. This is reporting
> that of the 2590M preallocation areas it reserved, only 10M of them
> were discarded during the lifetime of the filesystem.
>
> Of the other stats:
> - 471 seconds were spent in total generating the 85k buddy bitmaps
> (this is done incrementally at runtime)
> - 516M calls to mballoc to find a chunk of blocks, 511M calls were able
> to find the requested chunk (not surprising given it is a new filesystem,
> probably the 5M calls that failed were when the fs was nearly full)
>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>

This file system was never more than 7% full - the 511M calls were for each of
the 20KB files more or less I guess.

ric


2009-08-31 21:25:07

by Justin Maggard

[permalink] [raw]
Subject: Re: large file system & high object count testing

On Aug 31, 2009 ?13:02 -0400, Ric Wheeler wrote:
> Mount after fsck:
> Aug 31 12:27:12 megadeth kernel: EXT4-fs (dm-75):
> ext4_check_descriptors: Checksum for group 487 failed (59799!=46827)
> Aug 31 12:27:12 megadeth kernel: EXT4-fs (dm-75): group descriptors
> corrupted!

Ah, so it's not just me. It looks like you're seeing the exact same
thing I reported a few days ago in the ">16TB issues" thread. You
don't even have to do anything fancy to make this happen. My test
case involves simply creating 5 directories on the newly-created
64-bit filesystem, and running e2fsck on it immediately after
unmounting to get the same results.

-Justin

2009-08-31 22:20:48

by Ric Wheeler

[permalink] [raw]
Subject: Re: large file system & high object count testing

On 08/31/2009 05:25 PM, Justin Maggard wrote:
> On Aug 31, 2009 13:02 -0400, Ric Wheeler wrote:
>
>> Mount after fsck:
>> Aug 31 12:27:12 megadeth kernel: EXT4-fs (dm-75):
>> ext4_check_descriptors: Checksum for group 487 failed (59799!=46827)
>> Aug 31 12:27:12 megadeth kernel: EXT4-fs (dm-75): group descriptors
>> corrupted!
>>
> Ah, so it's not just me. It looks like you're seeing the exact same
> thing I reported a few days ago in the ">16TB issues" thread. You
> don't even have to do anything fancy to make this happen. My test
> case involves simply creating 5 directories on the newly-created
> 64-bit filesystem, and running e2fsck on it immediately after
> unmounting to get the same results.
>
> -Justin
>
Much faster than my fill it over the week end test :-)

ric


2009-08-31 23:13:34

by Andreas Dilger

[permalink] [raw]
Subject: Re: large file system & high object count testing

On Aug 31, 2009 14:25 -0700, Justin Maggard wrote:
> On Aug 31, 2009 ?13:02 -0400, Ric Wheeler wrote:
> > Mount after fsck:
> > Aug 31 12:27:12 megadeth kernel: EXT4-fs (dm-75):
> > ext4_check_descriptors: Checksum for group 487 failed (59799!=46827)
> > Aug 31 12:27:12 megadeth kernel: EXT4-fs (dm-75): group descriptors
> > corrupted!
>
> Ah, so it's not just me. It looks like you're seeing the exact same
> thing I reported a few days ago in the ">16TB issues" thread. You
> don't even have to do anything fancy to make this happen. My test
> case involves simply creating 5 directories on the newly-created
> 64-bit filesystem, and running e2fsck on it immediately after
> unmounting to get the same results.

Justin, could you please replicate this corruption, collecting some
additional information before & after. My recollection is that the
corruption appears in the first few groups, so 64kB should be plenty
to capture the group descriptor tables (where the checksum is kept).

- mke2fs
- dd if=/dev/XXX bs=4k count=16 | gzip -9 > /tmp/gdt-new.gz
- mkdir ...
- sync
- dd if=/dev/XXX bs=4k count=16 | gzip -9 > /tmp/gdt-mkdir.gz
- umount
- dd if=/dev/XXX bs=4k count=16 | gzip -9 > /tmp/gdt-umount.gz
- e2fsck
- dd if=/dev/XXX bs=4k count=16 | gzip -9 > /tmp/gdt-e2fsck.gz

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

2009-08-31 23:16:43

by Andreas Dilger

[permalink] [raw]
Subject: Re: large file system & high object count testing

On Aug 31, 2009 17:01 -0400, Ric Wheeler wrote:
> On 08/31/2009 04:19 PM, Andreas Dilger wrote:
>> Ouch, 4h is a long time, but hopefully not many people have to reformat
>> their 120TB filesystem on a regular basis.
>
> Seems that it should not take longer than fsck in any case? Might be
> interesting to use bkltrace/seekwatcher to see if it is thrashing these
> big, slow drives around...

Well, e2fsck + gdt_csum can skip reading large parts of an empty
filesystem, while ironically mke2fs is required to initialize it all.

>>> [[email protected] e2fsck]# time ./e2fsck -f -tt /dev/vg_wdc_disks/lv_wdc_disks
>>> e2fsck 1.41.8 (20-Jul-2009)
>>> Pass 1: Checking inodes, blocks, and sizes
>>> Pass 1: Memory used: 1280k/18014398508273796k (1130k/151k), time:
>>> 4630.05/780.40/3580.01
>>
>> Sigh, we need better memory accounting in e2fsck. Rather than depending
>> on the VM/glibc to track that for us, how hard would it be to just add
>> a counter into e2fsck_{get,free,resize}_mem() to track this?
>
> That second number looks like a bug, not a real memory number. The
> largest memory allocation I saw while it ran with top was around 6-7GB
> iirc.

Sure, it is a 32-bit overflow (which is the most this API can provide),
which is why we should fix it.

>> Hmm, is e2fsck computing the 64-byte group descriptor checksum differently
>> than the kernel? Can we dump the group descriptors before and after the
>> e2fsck run to see whether they have been modified without any messages to
>> the console?
>
> I tried to verify that by redoing a shorter run with fs_mark,
> unmount/remount (no fsck in the middle).
>
> That file system remounted with no corrupted group descriptors.
>
> Running fsck on it & remounting reproduces the error (although, again, no
> fixes reported during the run).
>
> Running fsck on it after the first corruption did indeed fix it & I could remount.
>
> Do you have a specific debugfs/other command I should use to poke at it with?

Getting dumps of the corrupted group descriptors before/after corruption,
to see what the values are, per my other email.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2009-08-31 23:43:40

by Justin Maggard

[permalink] [raw]
Subject: Re: large file system & high object count testing

On Mon, Aug 31, 2009 at 4:13 PM, Andreas Dilger<[email protected]> wrote:
> Justin, could you please replicate this corruption, collecting some
> additional information before & after. ?My recollection is that the
> corruption appears in the first few groups, so 64kB should be plenty
> to capture the group descriptor tables (where the checksum is kept).
>
> - mke2fs
> - dd if=/dev/XXX bs=4k count=16 | gzip -9 > /tmp/gdt-new.gz
> - mkdir ...
> - sync
> - dd if=/dev/XXX bs=4k count=16 | gzip -9 > /tmp/gdt-mkdir.gz
> - umount
> - dd if=/dev/XXX bs=4k count=16 | gzip -9 > /tmp/gdt-umount.gz
> - e2fsck
> - dd if=/dev/XXX bs=4k count=16 | gzip -9 > /tmp/gdt-e2fsck.gz
>

No problem. I just sent you an email with those four attached. If
anyone would like me to upload them somewhere else, just let me know.

-Justin

2009-09-02 09:15:40

by Andreas Dilger

[permalink] [raw]
Subject: Re: large file system & high object count testing

On Aug 31, 2009 16:37 -0700, Justin Maggard wrote:
> On Mon, Aug 31, 2009 at 4:13 PM, Andreas Dilger<[email protected]> wrote:
> > Justin, could you please replicate this corruption, collecting some
> > additional information before & after. ?My recollection is that the
> > corruption appears in the first few groups, so 64kB should be plenty
> > to capture the group descriptor tables (where the checksum is kept).
> >
> > - mke2fs
> > - dd if=/dev/XXX bs=4k count=16 | gzip -9 > /tmp/gdt-new.gz
> > - mkdir ...
> > - sync
> > - dd if=/dev/XXX bs=4k count=16 | gzip -9 > /tmp/gdt-mkdir.gz
> > - umount
> > - dd if=/dev/XXX bs=4k count=16 | gzip -9 > /tmp/gdt-umount.gz
> > - e2fsck
> > - dd if=/dev/XXX bs=4k count=16 | gzip -9 > /tmp/gdt-e2fsck.gz
> >
>
> No problem. I just sent you an email with those four attached. If
> anyone would like me to upload them somewhere else, just let me know.

Comparing the GDT dumps you gave makes it fairly clear what is wrong:

--- gdt-umount.od 2009-09-02 02:54:40.148704651 -0600
+++ gdt-e2fsck.od 2009-09-02 02:54:54.809699151 -0600
001000 00000ee5 00000ef5 00000f05 07f568f0
001010 00040002 00000000 00000000 339107f5
001020 00000000 00000000 00000000 00000000
-*
+001030 00000000 00000000 00000000 000007fe

It seems that e2fsck isn't keeping one of the reserved fields zero,
so this is confusing the checksum.

truct ext4_group_desc
{
:
:
/*30*/ __le16 bg_used_dirs_count_hi; /* Directories count MSB */
__le16 bg_itable_unused_hi; /* Unused inodes count MSB */
__u32 bg_reserved2[3];
}

The bg_reserved[2] field is being changed incorrectly.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.