2009-07-02 22:23:39

by Justin Maggard

[permalink] [raw]
Subject: >16TB issues

I've been toying with ext4 and e2fsprogs pu branch (pulled from git
yesterday) on very large volumes, and I've run into some issues. What
I've found so far with an 19TB MD RAID0 volume, running 2.6.29.4 (I'm
planning on trying 2.6.30 soon):

- mkfs.ext4 *appears* to work fine, reporting no errors. Examining
the superblock info with dumpe2fs -h looks normal -- although I'm
unfamiliar with "Lifetime writes" field, and I'm not sure why it's at
73GB immediately after doing mkfs, before ever mount it.

- Immediately running e2fsck on the volume before ever mounting it
will not complete, and results in the following:
# e2fsck -n /dev/md2
e2fsck 1.41.7 (29-June-2009)
Error reading block 2435874816 (Attempt to read block from filesystem
resulted in short read). Ignore error? no
/dev/md2: Attempt to read block from filesystem resulted in short read
while reading block 2435874816
/dev/md2: Attempt to read block from filesystem resulted in short read
reading journal superblock
e2fsck: Attempt to read block from filesystem resulted in short read
while checking ext3 journal for /dev/md2

- Trying to mount normally with no options does not work. The kernel
log contains these messages:
EXT4-fs: barriers enabled
JBD: no valid journal superblock found
EXT4-fs: error loading journal.

- Mounting with -o noload does appear to work, and reading and
writing seems to work fine.

- Setting default mount options with tune2fs works fine, as expected.

- Then, I went on to check out filesystem resizing. I created an LVM
15TB LV, and ran mkfs.ext4 on it. Looking at the superblock info, it
did not contain the 64bit flag, which I assume is expected behavior.
I extended the LV to ~18TB and tried resize2fs, and got this error:
resize2fs: Can't read an block bitmap while trying to resize /dev/data/data0

If there's anything else anyone would have me try, or any patches to
test, just let me know.

Thanks!
-Justin


2009-07-03 14:38:30

by Andreas Dilger

[permalink] [raw]
Subject: Re: >16TB issues

On Jul 02, 2009 15:23 -0700, Justin Maggard wrote:
> I've been toying with ext4 and e2fsprogs pu branch (pulled from git
> yesterday) on very large volumes, and I've run into some issues. What
> I've found so far with an 19TB MD RAID0 volume, running 2.6.29.4 (I'm
> planning on trying 2.6.30 soon):
>
> - mkfs.ext4 *appears* to work fine, reporting no errors. Examining
> the superblock info with dumpe2fs -h looks normal -- although I'm
> unfamiliar with "Lifetime writes" field, and I'm not sure why it's at
> 73GB immediately after doing mkfs, before ever mount it.
>
> - Immediately running e2fsck on the volume before ever mounting it
> will not complete, and results in the following:
> # e2fsck -n /dev/md2
> e2fsck 1.41.7 (29-June-2009)
> Error reading block 2435874816 (Attempt to read block from filesystem
> resulted in short read). Ignore error? no
> /dev/md2: Attempt to read block from filesystem resulted in short read
> while reading block 2435874816
> /dev/md2: Attempt to read block from filesystem resulted in short read
> reading journal superblock
> e2fsck: Attempt to read block from filesystem resulted in short read
> while checking ext3 journal for /dev/md2

It looks like there may be some problem with the underlying device?
I posted a program here a few months ago called "ll_ver_dev" which
can quickly (or slowly) verify that writes and reads to different
offsets in a block device return consistent data. The quick version
will detect such problems as 32-bit overflows, but if you are having
strange problems you might need to run the full version.

You could also try running with a filesystem just under 16TB and
verifying that works.

> - Mounting with -o noload does appear to work, and reading and
> writing seems to work fine.

That's because the journal is not being used, which is what seems to
be having the problem. I wonder if the journal is beyond 8TB or
beyond 16TB for some reason and this is causing grief?

> - Setting default mount options with tune2fs works fine, as expected.
>
> - Then, I went on to check out filesystem resizing. I created an LVM
> 15TB LV, and ran mkfs.ext4 on it. Looking at the superblock info, it
> did not contain the 64bit flag, which I assume is expected behavior.
> I extended the LV to ~18TB and tried resize2fs, and got this error:
> resize2fs: Can't read an block bitmap while trying to resize /dev/data/data0

This is known not to work, AFAIR.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2009-07-16 18:04:43

by Justin Maggard

[permalink] [raw]
Subject: Re: >16TB issues

On Fri, Jul 3, 2009 at 7:38 AM, Andreas Dilger<[email protected]> wrote:
>> - ?Immediately running e2fsck on the volume before ever mounting it
>> will not complete, and results in the following:
>> # e2fsck -n /dev/md2
>> e2fsck 1.41.7 (29-June-2009)
>> Error reading block 2435874816 (Attempt to read block from filesystem
>> resulted in short read). ?Ignore error? no
>> /dev/md2: Attempt to read block from filesystem resulted in short read
>> while reading block 2435874816
>> /dev/md2: Attempt to read block from filesystem resulted in short read
>> reading journal superblock
>> e2fsck: Attempt to read block from filesystem resulted in short read
>> while checking ext3 journal for /dev/md2
>
> It looks like there may be some problem with the underlying device?
> I posted a program here a few months ago called "ll_ver_dev" which
> can quickly (or slowly) verify that writes and reads to different
> offsets in a block device return consistent data. ?The quick version
> will detect such problems as 32-bit overflows, but if you are having
> strange problems you might need to run the full version.
>
> You could also try running with a filesystem just under 16TB and
> verifying that works.
>

Running with a filesystem just under 16TB works fine. Forgive my
ignorance, but for the life of me I couldn't find an reference
anywhere about your "ll_ver_dev" program. But doing dd if=/dev/zero
across the entire ~18TB didn't report any errors, so I believe the
underlying device is in good shape.

Running e2fsck with an external journal did change the behavior
though. Basically it no longer chokes on the journal, but it does
somewhere else:

e2fsck 1.41.8 (11-July-2009)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Error reading block 576192512 (Attempt to read block from filesystem
resulted in short read) while reading inode and block bitmaps. Ignore
error? no

e2fsck: Can't read an block bitmap while retrying to read bitmaps for /dev/md2
e2fsck: aborted

>> - ?Mounting with -o noload does appear to work, and reading and
>> writing seems to work fine.
>
> That's because the journal is not being used, which is what seems to
> be having the problem. ?I wonder if the journal is beyond 8TB or
> beyond 16TB for some reason and this is causing grief?
>

Perhaps, but I'm not sure. Using an external journal device also
worked the same as not loading the journal.

-Justin

2009-07-16 18:59:16

by Valerie Aurora

[permalink] [raw]
Subject: Re: >16TB issues

On Thu, Jul 16, 2009 at 11:04:41AM -0700, Justin Maggard wrote:
> On Fri, Jul 3, 2009 at 7:38 AM, Andreas Dilger<[email protected]> wrote:
> >> - ?Immediately running e2fsck on the volume before ever mounting it
> >> will not complete, and results in the following:
> >> # e2fsck -n /dev/md2
> >> e2fsck 1.41.7 (29-June-2009)
> >> Error reading block 2435874816 (Attempt to read block from filesystem
> >> resulted in short read). ?Ignore error? no
> >> /dev/md2: Attempt to read block from filesystem resulted in short read
> >> while reading block 2435874816
> >> /dev/md2: Attempt to read block from filesystem resulted in short read
> >> reading journal superblock
> >> e2fsck: Attempt to read block from filesystem resulted in short read
> >> while checking ext3 journal for /dev/md2
> >
> > It looks like there may be some problem with the underlying device?
> > I posted a program here a few months ago called "ll_ver_dev" which
> > can quickly (or slowly) verify that writes and reads to different
> > offsets in a block device return consistent data. ?The quick version
> > will detect such problems as 32-bit overflows, but if you are having
> > strange problems you might need to run the full version.
> >
> > You could also try running with a filesystem just under 16TB and
> > verifying that works.
> >
>
> Running with a filesystem just under 16TB works fine. Forgive my
> ignorance, but for the life of me I couldn't find an reference
> anywhere about your "ll_ver_dev" program. But doing dd if=/dev/zero
> across the entire ~18TB didn't report any errors, so I believe the
> underlying device is in good shape.

Excellent point. You can get the programs from here:

http://valhenson.livejournal.com/38933.html

Please do run llverdev if you have the chance - at this point, we are
stuck trying to figure out how to reproduce this bug.

We really appreciate your testing! This definitely needs to get fixed.

-VAL

2009-07-21 16:10:59

by Andreas Dilger

[permalink] [raw]
Subject: Re: >16TB issues

On Jul 16, 2009 11:04 -0700, Justin Maggard wrote:
> On Fri, Jul 3, 2009 at 7:38 AM, Andreas Dilger<[email protected]> wrote:
> >> - ?Immediately running e2fsck on the volume before ever mounting it
> >> will not complete, and results in the following:
> >> # e2fsck -n /dev/md2
> >> e2fsck 1.41.7 (29-June-2009)
> >> Error reading block 2435874816 (Attempt to read block from filesystem
> >> resulted in short read). ?Ignore error? no
>
> Error reading block 576192512 (Attempt to read block from filesystem
> resulted in short read) while reading inode and block bitmaps. Ignore
> error? no
>
> e2fsck: Can't read an block bitmap while retrying to read bitmaps for /dev/md2
> e2fsck: aborted

What is very strange here is that the block numbers being reported as
having read errors are not even beyond the 16TB limit. Assuming 4kB blocks:

576192512 * 4kB = 2304770048kB = 2198GB

Are there error messages in syslog/dmesg when this happens?

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

2009-07-21 18:52:36

by Justin Maggard

[permalink] [raw]
Subject: Re: >16TB issues

On Tue, Jul 21, 2009 at 9:10 AM, Andreas Dilger<[email protected]> wrote:
>> Error reading block 576192512 (Attempt to read block from filesystem
>> resulted in short read) while reading inode and block bitmaps. ?Ignore
>> error? no
>>
>> e2fsck: Can't read an block bitmap while retrying to read bitmaps for /dev/md2
>> e2fsck: aborted
>
> What is very strange here is that the block numbers being reported as
> having read errors are not even beyond the 16TB limit. ?Assuming 4kB blocks:
>
> 576192512 * 4kB = 2304770048kB = 2198GB
>
> Are there error messages in syslog/dmesg when this happens?

No, no error messages from the kernel. But your llverdev utility
ended up showing problems on the device. After asking around on the
MD mailing list, that was apparently because of the page cache index
limit (at the time I was using a 32-bit kernel).

Switching to a 64-bit kernel allowed me to pass the llverdev test and
get much further with a very large filesystem, but I'm running into
other issues now. I wrote up a very simple script to write 2TB files
onto the filesystem until the device fills up. It was able to write
~16TB, but after that it ran into some problems. My kernel log now
has lots of messages like these:
EXT4-fs error (device md2): ext4_mb_generate_buddy: EXT4-fs: group
163548: 32744 blocks in bitmap, 32768 in gd
- and -
EXT4-fs error (device md2): ext4_mb_mark_diskspace_used: Allocating
block 4294967391 in system zone of 131072 group

I shouldn't need e2fsprogs to be compiled 64-bit as well, right?
Currently I've got a 64-bit kernel with 32-bit userspace.

-Justin

2009-07-21 18:57:09

by Eric Sandeen

[permalink] [raw]
Subject: Re: >16TB issues

Justin Maggard wrote:
> On Tue, Jul 21, 2009 at 9:10 AM, Andreas Dilger<[email protected]> wrote:
>>> Error reading block 576192512 (Attempt to read block from filesystem
>>> resulted in short read) while reading inode and block bitmaps. Ignore
>>> error? no
>>>
>>> e2fsck: Can't read an block bitmap while retrying to read bitmaps for /dev/md2
>>> e2fsck: aborted
>> What is very strange here is that the block numbers being reported as
>> having read errors are not even beyond the 16TB limit. Assuming 4kB blocks:
>>
>> 576192512 * 4kB = 2304770048kB = 2198GB
>>
>> Are there error messages in syslog/dmesg when this happens?
>
> No, no error messages from the kernel. But your llverdev utility
> ended up showing problems on the device. After asking around on the
> MD mailing list, that was apparently because of the page cache index
> limit (at the time I was using a 32-bit kernel).
>
> Switching to a 64-bit kernel allowed me to pass the llverdev test and
> get much further with a very large filesystem, but I'm running into
> other issues now. I wrote up a very simple script to write 2TB files
> onto the filesystem until the device fills up. It was able to write
> ~16TB, but after that it ran into some problems. My kernel log now
> has lots of messages like these:
> EXT4-fs error (device md2): ext4_mb_generate_buddy: EXT4-fs: group
> 163548: 32744 blocks in bitmap, 32768 in gd
> - and -
> EXT4-fs error (device md2): ext4_mb_mark_diskspace_used: Allocating
> block 4294967391 in system zone of 131072 group
>
> I shouldn't need e2fsprogs to be compiled 64-bit as well, right?
> Currently I've got a 64-bit kernel with 32-bit userspace.

It -should- work but it is probably more bug-prone if "unsigned longs"
still lurk.

-Eric

2009-07-21 19:21:54

by Andreas Dilger

[permalink] [raw]
Subject: Re: >16TB issues

On Jul 21, 2009 11:52 -0700, Justin Maggard wrote:
> No, no error messages from the kernel. But your llverdev utility
> ended up showing problems on the device. After asking around on the
> MD mailing list, that was apparently because of the page cache index
> limit (at the time I was using a 32-bit kernel).
>
> Switching to a 64-bit kernel allowed me to pass the llverdev test and
> get much further with a very large filesystem, but I'm running into
> other issues now. I wrote up a very simple script to write 2TB files
> onto the filesystem until the device fills up. It was able to write
> ~16TB, but after that it ran into some problems. My kernel log now
> has lots of messages like these:

There is a matching "llverfs" tool that does essentially this, with
data verification also.

> EXT4-fs error (device md2): ext4_mb_generate_buddy: EXT4-fs: group
> 163548: 32744 blocks in bitmap, 32768 in gd
> - and -
> EXT4-fs error (device md2): ext4_mb_mark_diskspace_used: Allocating
> block 4294967391 in system zone of 131072 group
>
> I shouldn't need e2fsprogs to be compiled 64-bit as well, right?
> Currently I've got a 64-bit kernel with 32-bit userspace.

Yes, that is a potential problem.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2009-07-22 22:27:24

by Justin Maggard

[permalink] [raw]
Subject: Re: >16TB issues

On Tue, Jul 21, 2009 at 12:21 PM, Andreas Dilger<[email protected]> wrote:
>> I shouldn't need e2fsprogs to be compiled 64-bit as well, right?
>> Currently I've got a 64-bit kernel with 32-bit userspace.
>
> Yes, that is a potential problem.

It looks like it certainly is a problem with current e2fsprogs "pu"
branch. My latest findings from basic testing (just mkfs.ext4, then
e2fsck -fy, and -- if e2fsck modified the filesystem -- another e2fsck
-fy) are as follows:

1) 64-bit mke2fs + 64-bit e2fsck
Appears to work fine. No errors reported anywhere.

2) 64-bit mke2fs + 32-bit e2fsck
Also appears to work fine. llverfs --partial looked okay, and e2fsck
reported no errors.

3) 32-bit mke2fs + 64-bit e2fsck
Mkfs.ext4 must have done something wrong, but e2fsck was able to fix
it up, and future e2fsck (32 or 64-bit) runs reported no issues.
e2fsck output was:
Block bitmap differences: +(1063780365--1063780367)
+(1063780381--1063780383) +(1063782048--1063782431)
-(5359140864--5359173631)
Running mkfs through valgrind doesn't show any obvious errors.

4) 32-bit mke2fs + 32-bit e2fsck
Same as (3), for the mkfs and the first e2fsck again reported fixing
the same block bitmap differences. But after the first e2fsck was
complete, the second e2fsck run reported:
e2fsck: Superblock invalid, trying backup blocks...
Group descriptor 0 checksum is invalid. Fix? yes
Group descriptor 1 checksum is invalid. Fix? yes
...
Group descriptor 81774 checksum is invalid. Fix? yes
followed by tons of block bitmap differences.

-Justin

2009-07-27 22:03:10

by Valerie Aurora

[permalink] [raw]
Subject: Re: >16TB issues

On Wed, Jul 22, 2009 at 03:27:24PM -0700, Justin Maggard wrote:
> On Tue, Jul 21, 2009 at 12:21 PM, Andreas Dilger<[email protected]> wrote:
> >> I shouldn't need e2fsprogs to be compiled 64-bit as well, right?
> >> Currently I've got a 64-bit kernel with 32-bit userspace.
> >
> > Yes, that is a potential problem.
>
> It looks like it certainly is a problem with current e2fsprogs "pu"
> branch. My latest findings from basic testing (just mkfs.ext4, then
> e2fsck -fy, and -- if e2fsck modified the filesystem -- another e2fsck
> -fy) are as follows:
>
> 1) 64-bit mke2fs + 64-bit e2fsck
> Appears to work fine. No errors reported anywhere.
>
> 2) 64-bit mke2fs + 32-bit e2fsck
> Also appears to work fine. llverfs --partial looked okay, and e2fsck
> reported no errors.
>
> 3) 32-bit mke2fs + 64-bit e2fsck
> Mkfs.ext4 must have done something wrong, but e2fsck was able to fix
> it up, and future e2fsck (32 or 64-bit) runs reported no issues.
> e2fsck output was:
> Block bitmap differences: +(1063780365--1063780367)
> +(1063780381--1063780383) +(1063782048--1063782431)
> -(5359140864--5359173631)
> Running mkfs through valgrind doesn't show any obvious errors.
>
> 4) 32-bit mke2fs + 32-bit e2fsck
> Same as (3), for the mkfs and the first e2fsck again reported fixing
> the same block bitmap differences. But after the first e2fsck was
> complete, the second e2fsck run reported:
> e2fsck: Superblock invalid, trying backup blocks...
> Group descriptor 0 checksum is invalid. Fix? yes
> Group descriptor 1 checksum is invalid. Fix? yes
> ...
> Group descriptor 81774 checksum is invalid. Fix? yes
> followed by tons of block bitmap differences.

Great, this is really helpful. I've added it to my todo list.

-VAL

2009-07-30 22:23:51

by Valerie Aurora

[permalink] [raw]
Subject: Re: >16TB issues

On Wed, Jul 22, 2009 at 03:27:24PM -0700, Justin Maggard wrote:
> On Tue, Jul 21, 2009 at 12:21 PM, Andreas Dilger<[email protected]> wrote:
> >> I shouldn't need e2fsprogs to be compiled 64-bit as well, right?
> >> Currently I've got a 64-bit kernel with 32-bit userspace.
> >
> > Yes, that is a potential problem.
>
> It looks like it certainly is a problem with current e2fsprogs "pu"
> branch. My latest findings from basic testing (just mkfs.ext4, then
> e2fsck -fy, and -- if e2fsck modified the filesystem -- another e2fsck
> -fy) are as follows:
>
> 1) 64-bit mke2fs + 64-bit e2fsck
> Appears to work fine. No errors reported anywhere.
>
> 2) 64-bit mke2fs + 32-bit e2fsck
> Also appears to work fine. llverfs --partial looked okay, and e2fsck
> reported no errors.
>
> 3) 32-bit mke2fs + 64-bit e2fsck
> Mkfs.ext4 must have done something wrong, but e2fsck was able to fix
> it up, and future e2fsck (32 or 64-bit) runs reported no issues.
> e2fsck output was:
> Block bitmap differences: +(1063780365--1063780367)
> +(1063780381--1063780383) +(1063782048--1063782431)
> -(5359140864--5359173631)
> Running mkfs through valgrind doesn't show any obvious errors.
>
> 4) 32-bit mke2fs + 32-bit e2fsck
> Same as (3), for the mkfs and the first e2fsck again reported fixing
> the same block bitmap differences. But after the first e2fsck was
> complete, the second e2fsck run reported:
> e2fsck: Superblock invalid, trying backup blocks...
> Group descriptor 0 checksum is invalid. Fix? yes
> Group descriptor 1 checksum is invalid. Fix? yes
> ...
> Group descriptor 81774 checksum is invalid. Fix? yes
> followed by tons of block bitmap differences.

Justin,

Any chance you can give us the exact mke2fs command lines, outputs,
and device sizes you are using? dumpe2fs -h would be bonus.

If you use IRC, showing up on #ext4 on irc.oftc.net would be useful,
too.

Thanks!

-VAL

2009-08-01 01:31:45

by Justin Maggard

[permalink] [raw]
Subject: Re: >16TB issues

On Thu, Jul 30, 2009 at 3:23 PM, Valerie Aurora<[email protected]> wrote:
> Any chance you can give us the exact mke2fs command lines, outputs,
> and device sizes you are using? ?dumpe2fs -h would be bonus.
>

Sure thing. Here's the output using an x86_64 kernel with all 32-bit userland:

2TB-Monster:~# uname -a
Linux 2TB-Monster 2.6.30.1 #1 SMP Fri Jul 17 15:43:57 PDT 2009 x86_64 GNU/Linux

2TB-Monster:~# cat /etc/mke2fs.conf
[defaults]
base_features = sparse_super,filetype,resize_inode,dir_index,ext_attr
blocksize = 4096
inode_size = 128
inode_ratio = 65536

[fs_types]
ext4 = {
features = has_journal,extents,huge_file,flex_bg,uninit_bg,dir_nlink,extra_isize
inode_size = 256
}

2TB-Monster:~# mkfs.ext4 /dev/md2
mke2fs 1.41.8 (20-Jul-2009)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
365400064 inodes, 5846380224 blocks
292319011 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=3699376128
178418 block groups
32768 blocks per group, 32768 fragments per group
2048 inodes per group
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
102400000, 214990848, 512000000, 550731776, 644972544, 1934917632,
2560000000, 3855122432, 5804752896

Allocating group tables: done
Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done
This filesystem will be automatically checked every 36 mounts or
180 days, whichever comes first. Use tune2fs -c or -i to override.

2TB-Monster:~# dumpe2fs -h /dev/md2
dumpe2fs 1.41.8 (20-Jul-2009)
Filesystem volume name: <none>
Last mounted on: <not available>
Filesystem UUID: 482b503d-68e0-481e-906e-52eb70e7842c
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr resize_inode dir_index
filetype extent 64bit flex_bg sparse_super large_file huge_file
uninit_bg dir_nlink extra_isize
Filesystem flags: signed_directory_hash
Default mount options: (none)
Filesystem state: clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 365400064
Block count: 5846380224
Reserved block count: 292319011
Free blocks: 5823053972
Free inodes: 365400053
First block: 0
Block size: 4096
Fragment size: 4096
Reserved GDT blocks: 1024
Blocks per group: 32768
Fragments per group: 32768
Inodes per group: 2048
Inode blocks per group: 128
Flex block group size: 16
Filesystem created: Fri Jul 31 16:23:29 2009
Last mount time: n/a
Last write time: Fri Jul 31 16:26:45 2009
Mount count: 0
Maximum mount count: 36
Last checked: Fri Jul 31 16:23:29 2009
Check interval: 15552000 (6 months)
Next check after: Wed Jan 27 15:23:29 2010
Lifetime writes: 88 GB
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 256
Required extra isize: 28
Desired extra isize: 28
Journal inode: 8
Default directory hash: half_md4
Directory Hash Seed: d92d80aa-9c24-4d1c-a452-9ddadd1ebc94
Journal backup: inode blocks
Journal size: 128M

2TB-Monster:~# e2fsck -C0 -fy /dev/md2
e2fsck 1.41.8 (20-Jul-2009)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences: +(1551368194--1551368207)
+(1551368210--1551368223) +(1551368480--1551370271)
+(4294967296--4294969375) +(4295491584--4295491599) +4295491601
+(4295491609--4295491610)
... tons more like this ...
-(5846368341--5846368342) -5846368345 -5846368347 -5846368350
-5846368384 -(5846368386--5846368387) -5846368391 -5846368399
-5846368401
Fix? yes

/dev/md2: ***** FILE SYSTEM WAS MODIFIED *****
/dev/md2: 11/365400064 files (0.0% non-contiguous), 23326252/5846380224 blocks

2TB-Monster:~# e2fsck -C0 -fy /dev/md2
e2fsck 1.41.8 (20-Jul-2009)
e2fsck: Superblock invalid, trying backup blocks...
Group descriptor 0 checksum is invalid. Fix? yes
Group descriptor 1 checksum is invalid. Fix? yes
... every number sequentially ...
Group descriptor 89208 checksum is invalid. Fix? yes
Pass 1: Checking inodes, blocks, and sizes
Block bitmap differences: -1081356 -(1081362--1081363) -1081366
-(1081369--1081370) -1081372 -1081374 -1081410 -1081412
-(1081414--1081416) -1081419 -1081421 -(1081423--1081426)
-(1081429--1081430) -1081433
... tons more like this ...
(182679623--182679625) -182679628 -182679630 -(182679632--182679635)
-(182679638--182679639) -182679642 -182679644 -182679647
-(182679681--182679683) -182679687 -182679690 -182679697
Fix? yes

/dev/md2: ***** FILE SYSTEM WAS MODIFIED *****
/dev/md2: 11/365400064 files (0.0% non-contiguous), 23326252/5846380224 blocks

2009-08-03 17:22:45

by Valerie Aurora

[permalink] [raw]
Subject: Re: >16TB issues

On Fri, Jul 31, 2009 at 06:24:50PM -0700, Justin Maggard wrote:
> On Thu, Jul 30, 2009 at 3:23 PM, Valerie Aurora<[email protected]> wrote:
> > Any chance you can give us the exact mke2fs command lines, outputs,
> > and device sizes you are using? ?dumpe2fs -h would be bonus.
> >
>
> Sure thing. Here's the output using an x86_64 kernel with all 32-bit userland:

Perfect, thank you!

-VAL

2009-08-11 21:39:50

by Valerie Aurora

[permalink] [raw]
Subject: Re: >16TB issues

On Fri, Jul 31, 2009 at 06:24:50PM -0700, Justin Maggard wrote:
> On Thu, Jul 30, 2009 at 3:23 PM, Valerie Aurora<[email protected]> wrote:
> > Any chance you can give us the exact mke2fs command lines, outputs,
> > and device sizes you are using? ?dumpe2fs -h would be bonus.
> >
>
> Sure thing. Here's the output using an x86_64 kernel with all 32-bit userland:

By the way, I'm just waiting for Ted's rebase/rewrite of the 64-bit
patches to try reproducing and fixing this bug.

-VAL

2009-08-11 22:05:26

by Theodore Ts'o

[permalink] [raw]
Subject: Re: >16TB issues

On Tue, Aug 11, 2009 at 05:39:48PM -0400, Valerie Aurora wrote:
> On Fri, Jul 31, 2009 at 06:24:50PM -0700, Justin Maggard wrote:
> > On Thu, Jul 30, 2009 at 3:23 PM, Valerie Aurora<[email protected]> wrote:
> > > Any chance you can give us the exact mke2fs command lines, outputs,
> > > and device sizes you are using? ?dumpe2fs -h would be bonus.
> > >
> >
> > Sure thing. Here's the output using an x86_64 kernel with all 32-bit userland:
>
> By the way, I'm just waiting for Ted's rebase/rewrite of the 64-bit
> patches to try reproducing and fixing this bug.

The 'pu' branch is always kept building and has the latest 64-bit
patches. I'm keeping it regularly rebased off of the development
branch, so feel free to reproduce the bug at any time. There's no
need to wait until it is completely merged into the mainline e2fsprogs
sources.

- Ted

2009-08-12 01:26:28

by Valerie Aurora

[permalink] [raw]
Subject: Re: >16TB issues

On Tue, Aug 11, 2009 at 06:05:11PM -0400, Theodore Tso wrote:
> On Tue, Aug 11, 2009 at 05:39:48PM -0400, Valerie Aurora wrote:
> > By the way, I'm just waiting for Ted's rebase/rewrite of the 64-bit
> > patches to try reproducing and fixing this bug.
>
> The 'pu' branch is always kept building and has the latest 64-bit
> patches. I'm keeping it regularly rebased off of the development
> branch, so feel free to reproduce the bug at any time. There's no
> need to wait until it is completely merged into the mainline e2fsprogs
> sources.

Oh, I wasn't waiting for a mainline merge, just for you to finish your
planned rewrites to this patch set. I didn't realize you were done, I
must have missed an email or something.

I am going to stop tracking and fixing 64 bit bugs now. I stopped
actively working on this project in February but wanted to follow up
on bugs for a few months. This is the only outstanding bug report I
know of (I'm sure there are more!).

Thanks to everyone who worked on the 64 bit feature! In particular,
Jose R. Santos for writing the first half of the 64 bit conversion,
Nick Dokos for lots of testing and bug fixes, Justin Maggard for more
testing, Andreas Dilger for code review, Julia Lawall and the
Coccinelle team for spatch, and of course, Theodore T'so for design,
rewrites, code review, merges, bug fixes, and other maintainerly
duties.

-VAL

2009-08-12 02:04:44

by Theodore Ts'o

[permalink] [raw]
Subject: Re: >16TB issues

Just to be clear, since it's not clear from me what you said here:

On Tue, Aug 11, 2009 at 05:39:48PM -0400, Valerie Aurora wrote:
> By the way, I'm just waiting for Ted's rebase/rewrite of the 64-bit
> patches to try reproducing and fixing this bug.

Vs. a few hours later:

On Tue, Aug 11, 2009 at 09:25:39PM -0400, Valerie Aurora wrote:
> I am going to stop tracking and fixing 64 bit bugs now. I stopped
> actively working on this project in February but wanted to follow up
> on bugs for a few months. This is the only outstanding bug report I
> know of (I'm sure there are more!).

Are you going to try to reproduce and fix this bug or not?

- Ted

2009-08-12 04:21:53

by Eric Sandeen

[permalink] [raw]
Subject: Re: >16TB issues

Justin Maggard wrote:
> I've been toying with ext4 and e2fsprogs pu branch (pulled from git
> yesterday) on very large volumes, and I've run into some issues. What
> I've found so far with an 19TB MD RAID0 volume, running 2.6.29.4 (I'm
> planning on trying 2.6.30 soon):
>
> - mkfs.ext4 *appears* to work fine, reporting no errors. Examining
> the superblock info with dumpe2fs -h looks normal -- although I'm
> unfamiliar with "Lifetime writes" field, and I'm not sure why it's at
> 73GB immediately after doing mkfs, before ever mount it.

Guessing that's how much metadata actually gets written during mkfs.

> - Immediately running e2fsck on the volume before ever mounting it
> will not complete, and results in the following:
> # e2fsck -n /dev/md2
> e2fsck 1.41.7 (29-June-2009)
> Error reading block 2435874816 (Attempt to read block from filesystem
> resulted in short read). Ignore error? no

This is roughly halfway through the filesystem; the journal is roughly
in the middle of the filesystem; this is just over 2^31 blocks. I bet
there are still ints or longs in the userspace journal code. I'll take
a look.

-Eric

2009-08-12 05:35:55

by Justin Maggard

[permalink] [raw]
Subject: Re: >16TB issues

On Tue, Aug 11, 2009 at 9:21 PM, Eric Sandeen<[email protected]> wrote:
>> - ?Immediately running e2fsck on the volume before ever mounting it
>> will not complete, and results in the following:
>> # e2fsck -n /dev/md2
>> e2fsck 1.41.7 (29-June-2009)
>> Error reading block 2435874816 (Attempt to read block from filesystem
>> resulted in short read). ?Ignore error? no
>
> This is roughly halfway through the filesystem; the journal is roughly
> in the middle of the filesystem; this is just over 2^31 blocks. ?I bet
> there are still ints or longs in the userspace journal code. ?I'll take
> a look.

Thanks for looking into this. That log message is from when I was
running a 32-bit kernel, and I was apparently running into page cache
index limitations. That error went away when I switched to a x86_64
kernel, but there are still other errors, as posted in one of my more
recent messages.

-Justin

2009-08-12 14:12:25

by Eric Sandeen

[permalink] [raw]
Subject: Re: >16TB issues

Justin Maggard wrote:
> On Tue, Aug 11, 2009 at 9:21 PM, Eric Sandeen<[email protected]> wrote:
>>> - Immediately running e2fsck on the volume before ever mounting it
>>> will not complete, and results in the following:
>>> # e2fsck -n /dev/md2
>>> e2fsck 1.41.7 (29-June-2009)
>>> Error reading block 2435874816 (Attempt to read block from filesystem
>>> resulted in short read). Ignore error? no
>> This is roughly halfway through the filesystem; the journal is roughly
>> in the middle of the filesystem; this is just over 2^31 blocks. I bet
>> there are still ints or longs in the userspace journal code. I'll take
>> a look.
>
> Thanks for looking into this. That log message is from when I was
> running a 32-bit kernel,

oh! ok I missed that tidbit. Although even full 32-bit should be able
to reach 16T at least. Hmmm

> and I was apparently running into page cache
> index limitations. That error went away when I switched to a x86_64
> kernel, but there are still other errors, as posted in one of my more
> recent messages.

Guess I need to catch up a bit. :)

-Eric

p.s. I ws wrong about the 73G lifetime writes; just mkfsing a 19T
filesystem even w/ lazy_itable_init=1, it did 351G of writes at mkfs
time so that's not it.

> -Justin


2009-08-12 18:32:13

by Valerie Aurora

[permalink] [raw]
Subject: Re: >16TB issues

(Removing Jose from the cc list - his email at IBM is bouncing.)

On Tue, Aug 11, 2009 at 10:04:39PM -0400, Theodore Tso wrote:
> Just to be clear, since it's not clear from me what you said here:
>
> On Tue, Aug 11, 2009 at 05:39:48PM -0400, Valerie Aurora wrote:
> > By the way, I'm just waiting for Ted's rebase/rewrite of the 64-bit
> > patches to try reproducing and fixing this bug.
>
> Vs. a few hours later:
>
> On Tue, Aug 11, 2009 at 09:25:39PM -0400, Valerie Aurora wrote:
> > I am going to stop tracking and fixing 64 bit bugs now. I stopped
> > actively working on this project in February but wanted to follow up
> > on bugs for a few months. This is the only outstanding bug report I
> > know of (I'm sure there are more!).
>
> Are you going to try to reproduce and fix this bug or not?

I didn't realize this wouldn't be clear, but the second email does
override the first. I won't be trying to reproduce and fix this
particular bug (more likely, set of bugs).

I'm stepping back from tracking bugs in this patch set for a couple of
reasons. First, I've used all the time allocated for this project for
me by Red Hat. I am now working on other projects, union mounts at
present, and btrfs utilities next. Second, you're now the most active
developer on this feature, and the overhead of trying to synchronize
between two developers is far more than the benefit I can contribute.
I feel the project is in good hands now.

I will of course be happy to assist other people working on bugs in
the 64 bit patches, I just won't be taking the lead in tracking and
fixing them. I'm also happy to answer questions, do code review, help
with spatch scripts, etc. - again, just not taking the lead since I'm
not the lead developer.

-VAL

2009-08-28 02:30:45

by Justin Maggard

[permalink] [raw]
Subject: Re: >16TB issues

I've been testing the latest 64-bit e2fsprogs from the git pu branch
(kernel.org 2.6.30.5/x86_64) on a 64-bit (~22TB) filesystem for a
couple days, since it seems like 32-bit e2fsprogs on a 64-bit
filesystem is going to take a while longer. I'm able to create and
check a filesystem without any problem. I've also run Andreas'
llverfs utility for a few hours and not had any complaints. But, I'm
running into another strange issue. Here's what I'm doing:
# mkfs.ext4 /dev/md0
# mount /dev/md0 /mnt
# mkdir /mnt/1 /mnt/2 /mnt/3 /mnt/4 /mnt/5
# umount /mnt
# fsck.ext4 -C0 -fy /dev/md0
** No errors at all at this point. fsck returns 0. **
# mount /dev/md0 /mnt
The last mount command fails, and the kernel log contains:
EXT4-fs: ext4_check_descriptors: Checksum for group 0 failed (3412!=9428)
EXT4-fs: group descriptors corrupted!

If I redo the same steps without the mkdir, or doing fsck.ext4 -fn,
the mount works fine. I'm running a full llverfs run just to make
sure, but it looks like it will be okay. Has anyone had success with
this yet? Any other suggestions?

-Justin

2009-08-28 12:40:23

by Theodore Ts'o

[permalink] [raw]
Subject: Re: >16TB issues

On Thu, Aug 27, 2009 at 07:30:47PM -0700, Justin Maggard wrote:
> I've been testing the latest 64-bit e2fsprogs from the git pu branch
> (kernel.org 2.6.30.5/x86_64) on a 64-bit (~22TB) filesystem for a
> couple days, since it seems like 32-bit e2fsprogs on a 64-bit
> filesystem is going to take a while longer. I'm able to create and
> check a filesystem without any problem. I've also run Andreas'
> llverfs utility for a few hours and not had any complaints. But, I'm
> running into another strange issue. Here's what I'm doing:
> # mkfs.ext4 /dev/md0
> # mount /dev/md0 /mnt
> # mkdir /mnt/1 /mnt/2 /mnt/3 /mnt/4 /mnt/5
> # umount /mnt
> # fsck.ext4 -C0 -fy /dev/md0
> ** No errors at all at this point. fsck returns 0. **
> # mount /dev/md0 /mnt
> The last mount command fails, and the kernel log contains:
> EXT4-fs: ext4_check_descriptors: Checksum for group 0 failed (3412!=9428)
> EXT4-fs: group descriptors corrupted!

Um, that's interesting. What happens if run fsck.ext4 twice? i.e:

# mkfs.ext4 /dev/md0
# mount /dev/md0 /mnt
# mkdir /mnt/1 /mnt/2 /mnt/3 /mnt/4 /mnt/5
# umount /mnt
# fsck.ext4 -C0 -fy /dev/md0
# fsck.ext4 -C0 -fy /dev/md0

- Ted


2009-08-28 20:27:24

by Justin Maggard

[permalink] [raw]
Subject: Re: >16TB issues

On Fri, Aug 28, 2009 at 5:40 AM, Theodore Tso<[email protected]> wrote:
> Um, that's interesting. ?What happens if run fsck.ext4 twice? ?i.e:
>
Here's what happens with an immediate fsck after the first one.
Again, if I don't create the directories, there are no errors.

# fsck.ext4 -C0 -fy /dev/md0
e2fsck 1.41.8 (20-Jul-2009)
One or more block group descriptor checksums are invalid. Fix? yes

Group descriptor 0 checksum is invalid. FIXED.
Group descriptor 1 checksum is invalid. FIXED.
Group descriptor 2 checksum is invalid. FIXED.
Group descriptor 3 checksum is invalid. FIXED.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/md0: 16/365234176 files (0.0% non-contiguous), 23315701/5843746816 blocks
# echo $?
0
# fsck.ext4 -C0 -fy /dev/md0
e2fsck 1.41.8 (20-Jul-2009)
One or more block group descriptor checksums are invalid. Fix? yes

Group descriptor 1 checksum is invalid. FIXED.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/md0: 16/365234176 files (0.0% non-contiguous), 23315701/5843746816 blocks
# echo $?
0

-Justin