2009-05-01 08:45:47

by Nick Dokos

[permalink] [raw]
Subject: [PATCH 0/6][64-bit] Overview

With this set of patches, I can go through a mkfs/fsck cycle with a
32TiB filesystem in four different configurations:

o flex_bg off, no raid parameters
o flex_bg off, raid parameters
o flex_bg on, no raid parameters
o flex_bg on, raid parameters

There are no errors and the layouts seem reasonable - in the first two
cases, I've checked the block and inode bitmaps of the four groups that
are not marked BG_BLOCK_UNINIT and they look correct. I'm spot checking
some bitmaps in the last two cases but that's a longer process.

The fs is built on an LVM volume that consists of 16 physical volumes,
with a stripe size of 128 KiB. Each physical volume is a striped LUN
(also with a 128KiB stripe size) exported by an MSA1000 RAID
controller. There are 4 controllers, each with 28 300GiB, 15Krpm SCSI
disks. Each controller exports 4 LUNs. Each LUN is 2TiB (that's
a limitation of the hardware). So each controller exports 8TiB and
four of them provide the 32TiB for the filesystem.

The machine is a DL585g5: 4 slots, each with a quad core AMD cpu
(/proc/cpuinfo says:

vendor_id : AuthenticAMD
cpu family : 16
model : 2
model name : Quad-Core AMD Opteron(tm) Processor 8356
stepping : 3
cpu MHz : 2310.961
cache size : 512 KB
)

Even though I thought I had done this before (with the third
configuration), I could not replicate it: when running e2fsck, I
started getting checksum errors before the first pass and block
conflicts in pass 1. See the patch entitled "Eliminate erroneous blk_t
casts in ext2fs_get_free_blocks2()" for more details.

Even after these fixes, dumpe2fs and e2fsck were complaining that the
last group (group #250337) had block bitmap differences. It turned out
that the bitmaps were being written to the wrong place because of 32-bit
truncation. The patch entitled "write_bitmaps(): blk_t -> blk64_t" fixes
that.

mke2fs is supposed to zero out the last 16 blocks of the volume to make
sure that any old MD RAID metadata at the end of the device are wiped
out, but it was zeroing out the wrong blocks. The patch entitled
"mke2fs 64-bit miscellaneous fixes" fixes that, as well as a
few display issues.

dumpe2fs needed the EXT2_FLAG_NEW_BITMAPS flag and had a few display
problems of its own. These are fixed in the patch entitled
"enable dumpe2fs 64-bitness and fix printf formats."

There are two patches for problems found by visual inspection:
"(blk_t) cast in ext2fs_new_block2()" and "__u32 -> __u64 in
ba_resize_bmap() and blk_t -> blk64_t in ext2fs_check_desc()"

Thanks,
Nick


2009-05-01 10:58:16

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH 0/6][64-bit] Overview

On May 01, 2009 04:46 -0400, Nick Dokos wrote:
> With this set of patches, I can go through a mkfs/fsck cycle with a
> 32TiB filesystem in four different configurations:
>
> o flex_bg off, no raid parameters
> o flex_bg off, raid parameters
> o flex_bg on, no raid parameters
> o flex_bg on, raid parameters

Nick,
sorry to be so slow getting back to you. Attached are the relatively
simple test programs we use to verify whether large block devices and
large filesystems are suffering from block address aliasing.

The first tool (llverdev) will write either partial or fill data patterns
to the disk, then read them back and verify the data is still correct.

The second tool (llverfs) will try to allocate directories spread across
the filesystem (if possible, using EXT2_TOPDIR_FL) and then fill the
filesystem partially or fully with a data pattern in ~1GB files and
then read them back for verification.

This isn't really a stress test, but rather just a sanity check for
variable overflows at different levels of the IO stack.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


Attachments:
(No filename) (1.13 kB)
llverdev.c (14.99 kB)
llverfs.c (18.14 kB)
Download all attachments

2009-05-01 15:37:15

by Nick Dokos

[permalink] [raw]
Subject: Re: [PATCH 0/6][64-bit] Overview

Andreas Dilger <[email protected]> wrote:

> Nick,
> sorry to be so slow getting back to you. Attached are the relatively
> simple test programs we use to verify whether large block devices and
> large filesystems are suffering from block address aliasing.
>
> The first tool (llverdev) will write either partial or fill data patterns
> to the disk, then read them back and verify the data is still correct.
>
> The second tool (llverfs) will try to allocate directories spread across
> the filesystem (if possible, using EXT2_TOPDIR_FL) and then fill the
> filesystem partially or fully with a data pattern in ~1GB files and
> then read them back for verification.
>
> This isn't really a stress test, but rather just a sanity check for
> variable overflows at different levels of the IO stack.
>

Andreas,

thanks very much! I'll do some runs over the weekend with them,
as well as some e2fsck runs w/blktrace (with and without lazy
itable init).

Thanks,
Nick

PS. BTW, I have problems receiving email right now - the Reply-to
address above seems to work but my "official" address in the From header
does not.

2009-05-04 06:28:00

by Valerie Aurora

[permalink] [raw]
Subject: Re: [PATCH 0/6][64-bit] Overview

On Fri, May 01, 2009 at 04:46:00AM -0400, Nick Dokos wrote:
> With this set of patches, I can go through a mkfs/fsck cycle with a
> 32TiB filesystem in four different configurations:
>
> o flex_bg off, no raid parameters
> o flex_bg off, raid parameters
> o flex_bg on, no raid parameters
> o flex_bg on, raid parameters
>
> There are no errors and the layouts seem reasonable - in the first two
> cases, I've checked the block and inode bitmaps of the four groups that
> are not marked BG_BLOCK_UNINIT and they look correct. I'm spot checking
> some bitmaps in the last two cases but that's a longer process.
>
> The fs is built on an LVM volume that consists of 16 physical volumes,
> with a stripe size of 128 KiB. Each physical volume is a striped LUN
> (also with a 128KiB stripe size) exported by an MSA1000 RAID
> controller. There are 4 controllers, each with 28 300GiB, 15Krpm SCSI
> disks. Each controller exports 4 LUNs. Each LUN is 2TiB (that's
> a limitation of the hardware). So each controller exports 8TiB and
> four of them provide the 32TiB for the filesystem.
>
> The machine is a DL585g5: 4 slots, each with a quad core AMD cpu
> (/proc/cpuinfo says:
>
> vendor_id : AuthenticAMD
> cpu family : 16
> model : 2
> model name : Quad-Core AMD Opteron(tm) Processor 8356
> stepping : 3
> cpu MHz : 2310.961
> cache size : 512 KB
> )
>
> Even though I thought I had done this before (with the third
> configuration), I could not replicate it: when running e2fsck, I
> started getting checksum errors before the first pass and block
> conflicts in pass 1. See the patch entitled "Eliminate erroneous blk_t
> casts in ext2fs_get_free_blocks2()" for more details.
>
> Even after these fixes, dumpe2fs and e2fsck were complaining that the
> last group (group #250337) had block bitmap differences. It turned out
> that the bitmaps were being written to the wrong place because of 32-bit
> truncation. The patch entitled "write_bitmaps(): blk_t -> blk64_t" fixes
> that.
>
> mke2fs is supposed to zero out the last 16 blocks of the volume to make
> sure that any old MD RAID metadata at the end of the device are wiped
> out, but it was zeroing out the wrong blocks. The patch entitled
> "mke2fs 64-bit miscellaneous fixes" fixes that, as well as a
> few display issues.
>
> dumpe2fs needed the EXT2_FLAG_NEW_BITMAPS flag and had a few display
> problems of its own. These are fixed in the patch entitled
> "enable dumpe2fs 64-bitness and fix printf formats."
>
> There are two patches for problems found by visual inspection:
> "(blk_t) cast in ext2fs_new_block2()" and "__u32 -> __u64 in
> ba_resize_bmap() and blk_t -> blk64_t in ext2fs_check_desc()"

Great! I pulled them into my public git repo.

-VAL