2011-11-21 11:36:47

by Dave Chinner

[permalink] [raw]
Subject: [RFC][PATCH 0/8] xfstests: rework large filesystem testing

This series changes the way xfstests configures large filesystems
for testing. The assumption is that a sparse device is being used
for the large filesystem, be it a loop device or a thin provisioned
LUN. The key to make this work is marking large amounts of the
filesystem as used without having to actually write data to the
filesystem.

In the case of XFS, it always used to use a special xfs_db hack to
modify the free space in the AG headers to make them appear full.
This meant that xfs_check needed special options to avoid checking
free space as this way of marking space free is really a corrupted
filesystem state. Before we can use xfs_repair on such filesystems,
we need to change the way we mark blocks free.

So, change the method of marking space free to use preallocation.
For XFS, we can simply preallocate as much space as we need to
consume on a single file, essentially giving use a free "that's a
frickin' huge file" test. This is slower than the old xfs_db method,
but leaves the filesystem in a consistent state. It also means the
free space is not in the last AG - instead the free space will
usually be located in the same AG as the log.

This means that we can now use an unmodified xfs_repair binary to
check the consistency of the filesystem. We still need to avoid
free-space checking with xfs_check because of it's memory
consumption, but we at least will now get that checked by
xfs_repair.

There are numerous other cleanups and ease-of use modifications such
as command line parameters for executing large filesystem testing
rather than having to know about magic environment variables.

Further, the same preallocation technique can be used for testing on
ext4. The last patch of the series (not well tested yet) enables
the preallocation space filling technique for ext4 filesystems.

ext4, however, still has serious issues with this - either we take
the mkfs.ext4 time hit to initialise all the block groups, or we
take it during the preallocation. IOWs, the "don't do work at mkfs
but do it after mount" hack^Wtradeoff simply does not work for
testing large filesystems in this manner. While it is possible to
run large filesystem tests on ext4 using this mechanism, it is
extremely painful to do so.

Indeed, test runtime on ext4 is abysmal compared to XFS. XFS takes
about 15-20s to mkfs a 20TB filesystem and preallocate a 19.8TB
file, and about 2m to check it. ext4 took somewhere in the
order of 5 minutes to do the same operation on a loopback fs on a
SATA drive, while e2fsck -f takes 20 minutes to run. e.g: test 223
runs mkfs 4 times:

$ sudo ./check --large-fs 223
FSTYP -- ext4
PLATFORM -- Linux/x86_64 test-2 3.2.0-rc2-dgc+
MKFS_OPTIONS -- /dev/loop0
MOUNT_OPTIONS -- -o acl,user_xattr /dev/loop0 /mnt/scratch/scratch

223 143s ... 1567s
Ran: 223
Passed all 1 tests
$ sudo time e2fsck -f /dev/loop0
e2fsck 1.42-WIP (16-Oct-2011)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/loop0: 54/335544320 files (0.0% non-contiguous),
5368709120/5368709120 blocks
1131.16user 4.36system 19:12.59elapsed 98%CPU (0avgtext+0avgdata
6153616maxresident)k
0inputs+0outputs (3major+933709minor)pagefaults 0swaps

compared to XFS:

$ sudo ./check --large-fs 223
FSTYP -- xfs (non-debug)
PLATFORM -- Linux/x86_64 test-2 3.2.0-rc2-dgc+
MKFS_OPTIONS -- -f -bsize=4096 /dev/loop0
MOUNT_OPTIONS -- /dev/loop0 /mnt/scratch/scratch

223 1567s ... 144s
Ran: 223
Passed all 1 tests
dave@test-2:~/src/xfstests-dev$ sudo time xfs_repair /dev/loop0
Phase 1 - find and verify superblock...
Not enough RAM available for repair to enable prefetching.
This will be _slow_.
You need at least 3261MB RAM to run with prefetching enabled.
Phase 2 - using internal log
......
Phase 6 - check inode connectivity...
- resetting contents of realtime bitmap and summary inodes
- traversing filesystem ...
- traversal finished ...
- moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
done
0.00user 0.26system 2:23.09elapsed 0%CPU (0avgtext+0avgdata 11200maxresident)k
0inputs+0outputs (5major+951minor)pagefaults 0swaps

This is why I haven't really tested it all that much - I'm not even
really sure it is working properly yet because execution of a single
test can take half an hour for a 20TB filesystem. I encourage the
ext4 developers to work towards fixing these problems to help speed
up large filesystem testing cycles.

FWIW, I haven't yet written the btrfs code to enable this form of
large filesystem testing - that's the next patch I'm going to write.
I'm not sure what to expect from that.

Comments, flames, suggestions all welcome....



2011-11-21 11:36:42

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 2/8] xfstests: rename USE_BIG_LOOPFS to be more generic

From: Dave Chinner <[email protected]>

USE_BIG_LOOPFS is really misnamed - it can be used on real devices just as
easily as loop devices. It really means we are testing a large scratch device
and that we should enable the special filesystem filling and checking options
that enable xfstests to be run sanely on large XFS filesystems.

Signed-off-by: Dave Chinner <[email protected]>
---
004 | 2 +-
015 | 2 +-
030 | 2 +-
031 | 2 +-
032 | 2 +-
033 | 4 ++--
041 | 2 +-
049 | 2 +-
083 | 2 +-
092 | 2 +-
148 | 2 +-
149 | 2 +-
common | 2 +-
common.rc | 12 ++++++------
setup | 5 +++--
15 files changed, 23 insertions(+), 22 deletions(-)

diff --git a/004 b/004
index 9f28e17..23729da 100755
--- a/004
+++ b/004
@@ -64,7 +64,7 @@ _supported_os IRIX Linux

_need_to_be_root
_require_scratch
-_require_nobigloopfs
+_require_no_large_scratch_dev

rm -f $seq.full

diff --git a/015 b/015
index 4206b93..a99f1ed 100755
--- a/015
+++ b/015
@@ -52,7 +52,7 @@ _supported_fs generic
_supported_os IRIX Linux

_require_scratch
-_require_nobigloopfs
+_require_no_large_scratch_dev

_scratch_mkfs_sized `expr 50 \* 1024 \* 1024` >/dev/null 2>&1 \
|| _fail "mkfs failed"
diff --git a/030 b/030
index 74147d4..cd040a9 100755
--- a/030
+++ b/030
@@ -63,8 +63,8 @@ _check_ag()
_supported_fs xfs
_supported_os IRIX Linux

-_require_nobigloopfs
_require_scratch
+_require_no_large_scratch_dev

DSIZE="-dsize=100m,agcount=6"

diff --git a/031 b/031
index b062277..fb6f15b 100755
--- a/031
+++ b/031
@@ -97,8 +97,8 @@ EOF
_supported_fs xfs
_supported_os IRIX Linux

-_require_nobigloopfs
_require_scratch
+_require_no_large_scratch_dev

# sanity test - default + one root directory entry
# Note: must do this proto/mkfs now for later inode size calcs
diff --git a/032 b/032
index 4261ca2..d093b45 100755
--- a/032
+++ b/032
@@ -41,8 +41,8 @@ rm -f $seq.full
_supported_fs xfs
_supported_os Linux

-_require_nobigloopfs
_require_scratch
+_require_no_large_scratch_dev

echo "Silence is golden."
for fs in `echo /sbin/mkfs.* | sed -e 's/.sbin.mkfs.//g'`
diff --git a/033 b/033
index 9651f26..68a688e 100755
--- a/033
+++ b/033
@@ -76,9 +76,9 @@ _filter_bad_ids()
# real QA test starts here
_supported_fs xfs
_supported_os IRIX Linux
-
-_require_nobigloopfs
+
_require_scratch
+_require_no_large_scratch_dev

# devzero blows away 512byte blocks, so make 512byte inodes (at least)
_scratch_mkfs_xfs | _filter_mkfs 2>$tmp.mkfs
diff --git a/041 b/041
index 2800811..28dcb33 100755
--- a/041
+++ b/041
@@ -50,7 +50,7 @@ _supported_fs xfs
_supported_os IRIX Linux

_require_scratch
-_require_nobigloopfs
+_require_no_large_scratch_dev
umount $SCRATCH_DEV 2>/dev/null

_fill()
diff --git a/049 b/049
index c6c4faa..e37b2d3 100755
--- a/049
+++ b/049
@@ -60,9 +60,9 @@ _log()
echo "--- $*" >> $seq.full
}

-_require_nobigloopfs
_require_nonexternal
_require_scratch
+_require_no_large_scratch_dev
_require_loop
_require_ext2

diff --git a/083 b/083
index e0670b9..7a73f30 100755
--- a/083
+++ b/083
@@ -58,7 +58,7 @@ _supported_fs generic
_supported_os IRIX Linux

_require_scratch
-_require_nobigloopfs
+_require_no_large_scratch_dev

rm -f $seq.full

diff --git a/092 b/092
index 429fa80..02ccc71 100755
--- a/092
+++ b/092
@@ -48,7 +48,7 @@ _cleanup()
_supported_fs xfs
_supported_os IRIX Linux
_require_scratch
-_require_nobigloopfs
+_require_no_large_scratch_dev

MOUNT_OPTIONS="$MOUNT_OPTIONS -o inode64"
_scratch_mkfs_xfs | _filter_mkfs 2>/dev/null
diff --git a/148 b/148
index 76cbf37..7bb1722 100755
--- a/148
+++ b/148
@@ -66,8 +66,8 @@ _check_ag()
_supported_fs xfs
_supported_os IRIX Linux

-_require_nobigloopfs
_require_scratch
+_require_no_large_scratch_dev

DSIZE="-dsize=100m"

diff --git a/149 b/149
index 5131a45..193e6d7 100755
--- a/149
+++ b/149
@@ -100,8 +100,8 @@ EOF
_supported_fs xfs
_supported_os IRIX Linux

-_require_nobigloopfs
_require_scratch
+_require_no_large_scratch_dev

# sanity test - default + one root directory entry
# Note: must do this proto/mkfs now for later inode size calcs
diff --git a/common b/common
index 7d13078..da86cd9 100644
--- a/common
+++ b/common
@@ -239,7 +239,7 @@ s/ .*//p
;;

--large-fs)
- export USE_BIG_LOOPFS=yes
+ export LARGE_SCRATCH_DEV=yes
xpand=false
;;

diff --git a/common.rc b/common.rc
index cab0b64..fdeef1c 100644
--- a/common.rc
+++ b/common.rc
@@ -310,7 +310,7 @@ _scratch_mkfs_xfs()
cat $tmp_dir.mkfsstd
rm -f $tmp_dir.mkfserr $tmp_dir.mkfsstd

- if [ "$USE_BIG_LOOPFS" = yes ]; then
+ if [ "$LARGE_SCRATCH_DEV" = yes ]; then
[ -z "$RETAIN_AG_BYTES" ] && RETAIN_AG_BYTES=0
./tools/ag-wipe -q -r $RETAIN_AG_BYTES $SCRATCH_DEV
fi
@@ -432,7 +432,7 @@ _scratch_xfs_repair()
SCRATCH_OPTIONS="-l$SCRATCH_LOGDEV"
[ "$USE_EXTERNAL" = yes -a ! -z "$SCRATCH_RTDEV" ] && \
SCRATCH_OPTIONS=$SCRATCH_OPTIONS" -r$SCRATCH_RTDEV"
- [ "$USE_BIG_LOOPFS" = yes ] && SCRATCH_OPTIONS=$SCRATCH_OPTIONS" -t"
+ [ "$LARGE_SCRATCH_DEV" = yes ] && SCRATCH_OPTIONS=$SCRATCH_OPTIONS" -t"
$XFS_REPAIR_PROG $SCRATCH_OPTIONS $* $SCRATCH_DEV
}

@@ -821,9 +821,9 @@ _require_ext2()

# this test requires that (large) loopback device files are not in use
#
-_require_nobigloopfs()
+_require_no_large_scratch_dev()
{
- [ "$USE_BIG_LOOPFS" = yes ] && \
+ [ "$LARGE_SCRATCH_DEV" = yes ] && \
_notrun "Large filesystem testing in progress, skipped this test"
}

@@ -1164,7 +1164,7 @@ _check_xfs_filesystem()

[ "$FSTYP" != xfs ] && return 0
testoption=""
- [ "$USE_BIG_LOOPFS" = yes ] && testoption=-t
+ [ "$LARGE_SCRATCH_DEV" = yes ] && testoption=-t

type=`_fs_type $device`
ok=1
@@ -1203,7 +1203,7 @@ _check_xfs_filesystem()
ok=0
fi
# repair doesn't scale massively at this stage, optionally skip it for now
- [ "$USE_BIG_LOOPFS" = yes ] || \
+ [ "$LARGE_SCRATCH_DEV" = yes ] || \
$XFS_REPAIR_PROG -n $extra_log_options $extra_rt_options $device >$tmp.repair 2>&1
if [ $? -ne 0 ]
then
diff --git a/setup b/setup
index 62798cc..5225951 100755
--- a/setup
+++ b/setup
@@ -23,7 +23,7 @@ fi

[ "$USE_EXTERNAL" = yes ] || USE_EXTERNAL=no
[ "$USE_LBD_PATCH" = yes ] || USE_LBD_PATCH=no
-[ "$USE_BIG_LOOPFS" = yes ] || USE_BIG_LOOPFS=no
+[ "$LARGE_SCRATCH_DEV" = yes ] || LARGE_SCRATCH_DEV=no
[ "$USE_ATTR_SECURE" = yes ] || USE_ATTR_SECURE=no
[ -z "$FSTYP" ] && FSTYP="xfs"

@@ -31,5 +31,6 @@ cat <<EOF
TEST: DIR=$TEST_DIR DEV=$TEST_DEV rt=[$TEST_RTDEV] log=[$TEST_LOGDEV]
TAPE: dev=[$TAPE_DEV] rmt=[$RMT_TAPE_DEV] rmtirix=[$RMT_TAPE_USER@$RMT_IRIXTAPE_DEV]
SCRATCH: MNT=$SCRATCH_MNT DEV=$SCRATCH_DEV rt=[$SCRATCH_RTDEV] log=[$SCRATCH_LOGDEV]
-VARIABLES: external=$USE_EXTERNAL largeblk=$USE_LBD_PATCH fstyp=$FSTYP bigloopfs=$USE_BIG_LOOPFS attrsecure=$USE_ATTR_SECURE
+VARIABLES: external=$USE_EXTERNAL largeblk=$USE_LBD_PATCH fstyp=$FSTYP
+ large_scratch_dev=$LARGE_SCRATCH_DEV attrsecure=$USE_ATTR_SECURE
EOF
--
1.7.5.4


2011-11-21 11:36:46

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 1/8] xfstests: add --largefs check option

From: Dave Chinner <[email protected]>

Make it easier to check large filesystems quickly by adding a
--large-fs option to check to turn on shortcuts for large scratch
device filesystem testing.

Also, reject invalid command line options with a usage message.

Signed-off-by: Dave Chinner <[email protected]>
---
common | 63 +++++++++++++++++++++++++++++++++++++++------------------------
1 files changed, 39 insertions(+), 24 deletions(-)

diff --git a/common b/common
index 0723224..7d13078 100644
--- a/common
+++ b/common
@@ -27,6 +27,35 @@ _setenvironment()
export MSGVERB
}

+usage()
+{
+ echo "Usage: $0 [options] [testlist]"'
+
+common options
+ -v verbose
+
+check options
+ -xfs test XFS (default)
+ -udf test UDF
+ -nfs test NFS
+ -l line mode diff
+ -xdiff graphical mode diff
+ -udiff show unified diff (default)
+ -n show me, do not run tests
+ -q quick [deprecated]
+ -T output timestamps
+ -r randomize test order
+ --large-fs optimise scratch device for large filesystems
+
+testlist options
+ -g group[,group...] include tests from these groups
+ -x group[,group...] exclude tests from these groups
+ NNN include test NNN
+ NNN-NNN include test range (eg. 012-021)
+'
+ exit 0
+}
+
here=`pwd`
rm -f $here/$iam.out
_setenvironment
@@ -117,30 +146,7 @@ s/ .*//p
in

-\? | -h | --help) # usage
- echo "Usage: $0 [options] [testlist]"'
-
-common options
- -v verbose
-
-check options
- -xfs test XFS (default)
- -udf test UDF
- -nfs test NFS
- -l line mode diff
- -xdiff graphical mode diff
- -udiff show unified diff (default)
- -n show me, do not run tests
- -q quick [deprecated]
- -T output timestamps
- -r randomize test order
-
-testlist options
- -g group[,group...] include tests from these groups
- -x group[,group...] exclude tests from these groups
- NNN include test NNN
- NNN-NNN include test range (eg. 012-021)
-'
- exit 0
+ usage
;;

-udf) # -udf ... set FSTYP to udf
@@ -232,6 +238,15 @@ testlist options
fi
;;

+ --large-fs)
+ export USE_BIG_LOOPFS=yes
+ xpand=false
+ ;;
+
+ -*)
+ usage
+ ;;
+
*)
start=$r
end=$r
--
1.7.5.4


2011-11-21 11:36:48

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 3/8] xfstests: rename RETAIN_AG_BYTES

From: Dave Chinner <[email protected]>

Rename the $RETAIN_AG_BYTES variable to be more generic so that it
reflects the fact that it is designed to retain a certain amount of
extra free space above the default amount in the filesystem when
doing large scratch device testing.

Signed-off-by: Dave Chinner <[email protected]>
---
common.rc | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/common.rc b/common.rc
index fdeef1c..455886d 100644
--- a/common.rc
+++ b/common.rc
@@ -311,8 +311,8 @@ _scratch_mkfs_xfs()
rm -f $tmp_dir.mkfserr $tmp_dir.mkfsstd

if [ "$LARGE_SCRATCH_DEV" = yes ]; then
- [ -z "$RETAIN_AG_BYTES" ] && RETAIN_AG_BYTES=0
- ./tools/ag-wipe -q -r $RETAIN_AG_BYTES $SCRATCH_DEV
+ [ -z "$SCRATCH_DEV_EMPTY_SPACE" ] && SCRATCH_DEV_EMPTY_SPACE=0
+ ./tools/ag-wipe -q -r $SCRATCH_DEV_EMPTY_SPACE $SCRATCH_DEV
fi

return $mkfs_status
--
1.7.5.4


2011-11-21 11:31:26

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 6/8] xfstest: enable xfs_repair for large filesystem testing

From: Dave Chinner <[email protected]>

Now that large filesystem testing does not play free space games to
fill the space without IO, we can enable xfs_repair when running in
this mode. xfs_repair has had it's scalability problems solved, too,
so this is a safe thing to do.

Signed-off-by: Dave Chinner <[email protected]>
---
common.rc | 7 +++----
1 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/common.rc b/common.rc
index 34467ec..12bd349 100644
--- a/common.rc
+++ b/common.rc
@@ -1200,12 +1200,12 @@ _check_xfs_filesystem()
extra_mount_options=""
device=$1
if [ "$2" != "none" ]; then
- extra_log_options="-l$2"
+ extra_log_options="-l$2"
extra_mount_options="-ologdev=$2"
fi

if [ "$3" != "none" ]; then
- extra_rt_options="-r$3"
+ extra_rt_options="-r$3"
extra_mount_options=$extra_mount_options" -ortdev=$3"
fi
extra_mount_options=$extra_mount_options" $MOUNT_OPTIONS"
@@ -1250,8 +1250,7 @@ _check_xfs_filesystem()

ok=0
fi
- # repair doesn't scale massively at this stage, optionally skip it for now
- [ "$LARGE_SCRATCH_DEV" = yes ] || \
+
$XFS_REPAIR_PROG -n $extra_log_options $extra_rt_options $device >$tmp.repair 2>&1
if [ $? -ne 0 ]
then
--
1.7.5.4


2011-11-21 11:36:51

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 5/8] xfstests: use command line option for setting extra space

From: Dave Chinner <[email protected]>

Allow the extra free space to leave in large scratch filesystems to
be specified by a command line option rather than just via an
environment variable.

Signed-off-by: Dave Chinner <[email protected]>
---
common | 5 +++++
1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/common b/common
index da86cd9..78ac654 100644
--- a/common
+++ b/common
@@ -247,6 +247,11 @@ s/ .*//p
usage
;;

+ --extra-space=*)
+ export SCRATCH_DEV_EMPTY_SPACE=${r#*=}
+ xpand=false
+ ;;
+
*)
start=$r
end=$r
--
1.7.5.4


2011-11-21 11:36:50

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 4/8] xfstests: use preallocation for ag-wiper

From: Dave Chinner <[email protected]>

To enable sane testing of large scale filesystems, the --large-fs
test option uses xfs_db magic to mark AGs full without doing any IO.
This leaves only a small amount of free space left in the filesystem
to stress the high AGs of the filesystem rather than the low AGs.

This method requires us to have special filesystem check options to
avoid free space checking in xfs_check, and we cannot current run
xfs_repair on such a filesystem at all. As it is, free space
checking on xfs_check does not scale, so we still need to avoid this
checking regardless of how we fill the filesystem.

We can acheive exactly the same fill behaviour by preallocating a
single large file in the filesystem immediately after creating it.
This is a filesystem independent manner of filling the filesystem,
and allows us to do large filesystem testing on more than just XFS.

Further, this preallocation method effectively adds a new "very
large file" test. It also enables us to run an unmodified xfs_repair
or filesystem specific fsck program to check the filesystem for
sanity, so we can now do full sanity checking of such large
filesystems.

Signed-off-by: Dave Chinner <[email protected]>
---
common.rc | 58 +++++++++++++++++++++++++++++++++++++++++++++++++++++-----
1 files changed, 53 insertions(+), 5 deletions(-)

diff --git a/common.rc b/common.rc
index 455886d..34467ec 100644
--- a/common.rc
+++ b/common.rc
@@ -276,6 +276,47 @@ _scratch_mkfs_options()
echo $SCRATCH_OPTIONS $MKFS_OPTIONS $* $SCRATCH_DEV
}

+
+_setup_large_xfs_fs()
+{
+ fs_size=$1
+ local tmp_dir=/tmp/
+
+ [ "$LARGE_SCRATCH_DEV" != yes ] && return 0
+ [ -z "$SCRATCH_DEV_EMPTY_SPACE" ] && SCRATCH_DEV_EMPTY_SPACE=0
+ [ $SCRATCH_DEV_EMPTY_SPACE -ge $fs_size ] && return 0
+
+ # calculate the size of the file we need to allocate.
+ # Default free space in the FS is 50GB, but you can specify more via
+ # SCRATCH_DEV_EMPTY_SPACE
+ file_size=$(($fs_size - 50*1024*1024*1024))
+ file_size=$(($file_size - $SCRATCH_DEV_EMPTY_SPACE))
+
+ # mount the filesystem, create the file, unmount it
+ _scratch_mount 2>&1 >$tmp_dir/mnt.err
+ local status=$?
+ if [ $status -ne 0 ]; then
+ echo "mount failed"
+ cat $tmp_dir/mnt.err >&2
+ rm -f $tmp_dir/mnt.err
+ return $status
+ fi
+ rm -f $tmp_dir/mnt.err
+
+ xfs_io -F -f \
+ -c "truncate $file_size" \
+ -c "falloc -k 0 $file_size" \
+ $SCRATCH_MNT/.use_space 2>&1 > /dev/null
+ status=$?
+ umount $SCRATCH_MNT
+ if [ $status -ne 0 ]; then
+ echo "large file prealloc failed"
+ cat $tmp_dir/mnt.err >&2
+ return $status
+ fi
+ return 0
+}
+
_scratch_mkfs_xfs()
{
# extra mkfs options can be added by tests
@@ -305,16 +346,23 @@ _scratch_mkfs_xfs()
mkfs_status=$?
fi

+ if [ $mkfs_status -eq 0 -a "$LARGE_SCRATCH_DEV" = yes ]; then
+ # manually parse the mkfs output to get the fs size in bytes
+ local fs_size
+ fs_size=`cat $tmp_dir.mkfsstd | perl -ne '
+ if (/^data\s+=\s+bsize=(\d+)\s+blocks=(\d+)/) {
+ my $size = $1 * $2;
+ print STDOUT "$size\n";
+ }'`
+ _setup_large_xfs_fs $fs_size
+ mkfs_status=$?
+ fi
+
# output stored mkfs output
cat $tmp_dir.mkfserr >&2
cat $tmp_dir.mkfsstd
rm -f $tmp_dir.mkfserr $tmp_dir.mkfsstd

- if [ "$LARGE_SCRATCH_DEV" = yes ]; then
- [ -z "$SCRATCH_DEV_EMPTY_SPACE" ] && SCRATCH_DEV_EMPTY_SPACE=0
- ./tools/ag-wipe -q -r $SCRATCH_DEV_EMPTY_SPACE $SCRATCH_DEV
- fi

2011-11-21 11:36:51

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 8/8] xfstests: enable large fs testing on ext4

From: Dave Chinner <[email protected]>

Now that setting up large filesystem testing on sparse loopback
devices uses a generic method for filling the filesystem, extent
support to ext4 filesystems.

ext4 is slightly more complex to fill as it does not support files
larger than 16TB. Hence a slightly more complex method of using
multiple smaller files to fill the space is necessary.

Signed-off-by: Dave Chinner <[email protected]>
---
common.rc | 90 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 90 insertions(+), 0 deletions(-)

diff --git a/common.rc b/common.rc
index 9b9041f..7bb8f50 100644
--- a/common.rc
+++ b/common.rc
@@ -366,6 +366,93 @@ _scratch_mkfs_xfs()
return $mkfs_status
}

+_setup_large_ext4_fs()
+{
+ fs_size=$1
+ local tmp_dir=/tmp/
+
+ [ "$LARGE_SCRATCH_DEV" != yes ] && return 0
+ [ -z "$SCRATCH_DEV_EMPTY_SPACE" ] && SCRATCH_DEV_EMPTY_SPACE=0
+ [ $SCRATCH_DEV_EMPTY_SPACE -ge $fs_size ] && return 0
+
+ # Default free space in the FS is 50GB, but you can specify more via
+ # SCRATCH_DEV_EMPTY_SPACE
+ space_to_consume=$(($fs_size - 50*1024*1024*1024 - $SCRATCH_DEV_EMPTY_SPACE))
+
+ # mount the filesystem and create 16TB - 4KB files until we consume
+ # all the necessary space.
+ _scratch_mount 2>&1 >$tmp_dir/mnt.err
+ local status=$?
+ if [ $status -ne 0 ]; then
+ echo "mount failed"
+ cat $tmp_dir/mnt.err >&2
+ rm -f $tmp_dir/mnt.err
+ return $status
+ fi
+ rm -f $tmp_dir/mnt.err
+
+ file_size=$((16*1024*1024*1024*1024 - 4096))
+ nfiles=0
+ while [ $space_to_consume -gt $file_size ]; do
+
+ xfs_io -F -f \
+ -c "truncate $file_size" \
+ -c "falloc -k 0 $file_size" \
+ $SCRATCH_MNT/.use_space.$nfiles 2>&1
+ status=$?
+ if [ $status -ne 0 ]; then
+ break;
+ fi
+
+ space_to_consume=$(( $space_to_consume - $file_size ))
+ nfiles=$(($nfiles + 1))
+ done
+
+ # consume the remaining space.
+ if [ $space_to_consume -gt 0 ]; then
+ xfs_io -F -f \
+ -c "truncate $space_to_consume" \
+ -c "falloc -k 0 $space_to_consume" \
+ $SCRATCH_MNT/.use_space.$nfiles 2>&1
+ status=$?
+ fi
+
+ umount $SCRATCH_MNT
+ if [ $status -ne 0 ]; then
+ echo "large file prealloc failed"
+ cat $tmp_dir/mnt.err >&2
+ return $status
+ fi
+ return 0
+}
+_scratch_mkfs_ext4()
+{
+ local tmp_dir=/tmp/
+
+ /sbin/mkfs -t $FSTYP -- $MKFS_OPTIONS $* $SCRATCH_DEV \
+ 2>$tmp_dir.mkfserr 1>$tmp_dir.mkfsstd
+ local mkfs_status=$?
+
+ if [ $mkfs_status -eq 0 -a "$LARGE_SCRATCH_DEV" = yes ]; then
+ # manually parse the mkfs output to get the fs size in bytes
+ fs_size=`cat $tmp_dir.mkfsstd | awk ' \
+ /^Block size/ { split($2, a, "="); bs = a[2] ; } \
+ / inodes, / { blks = $3 } \
+ /reserved for the super user/ { resv = $1 } \
+ END { fssize = bs * blks - resv; print fssize }'`
+
+ _setup_large_ext4_fs $fs_size
+ mkfs_status=$?
+ fi
+
+ # output stored mkfs output
+ cat $tmp_dir.mkfserr >&2
+ cat $tmp_dir.mkfsstd
+ rm -f $tmp_dir.mkfserr $tmp_dir.mkfsstd
+
+ return $mkfs_status
+}
+
_scratch_mkfs()
{
case $FSTYP in
@@ -381,6 +468,9 @@ _scratch_mkfs()
btrfs)
$MKFS_BTRFS_PROG $MKFS_OPTIONS $* $SCRATCH_DEV > /dev/null
;;
+ ext4)
+ _scratch_mkfs_ext4 $*
+ ;;
*)
/sbin/mkfs -t $FSTYP -- $MKFS_OPTIONS $* $SCRATCH_DEV
;;
--
1.7.5.4


2011-11-21 11:31:27

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 7/8] xfstests: always us test option when checking large scratch device

From: Dave Chinner <[email protected]>

Some tests call _check_scratch_device directly and when using a
large filesystem this needs to run with a -t option to avoid
consuming large amounts of memory. Make this happen in all cases
that the scratch device is checked.

Signed-off-by: Dave Chinner <[email protected]>
---
017 | 7 ++-----
common.rc | 2 ++
2 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/017 b/017
index 9ca0e72..0a3ede3 100755
--- a/017
+++ b/017
@@ -48,9 +48,6 @@ _supported_os Linux

_require_scratch

-checkopts=""
-[ "$USE_BIG_LOOPFS" = yes ] && checkopts=-t
-
echo "*** init FS"

rm -f $seq.full
@@ -81,8 +78,8 @@ do
echo "" >>$seq.full
echo "*** XFS_CHECK ***" >>$seq.full
echo "" >>$seq.full
- _scratch_xfs_check $checkopts >>$seq.full 2>&1 \
- || _fail "xfs_check $checkopts failed"
+ _scratch_xfs_check >>$seq.full 2>&1 \
+ || _fail "xfs_check failed"
_scratch_mount -o remount,rw \
|| _fail "remount rw failed"
done
diff --git a/common.rc b/common.rc
index 12bd349..9b9041f 100644
--- a/common.rc
+++ b/common.rc
@@ -470,6 +470,8 @@ _scratch_xfs_check()
SCRATCH_OPTIONS=""
[ "$USE_EXTERNAL" = yes -a ! -z "$SCRATCH_LOGDEV" ] && \
SCRATCH_OPTIONS="-l $SCRATCH_LOGDEV"
+ [ "$LARGE_SCRATCH_DEV" = yes ] && \
+ SCRATCH_OPTIONS=$SCRATCH_OPTIONS" -t"
$XFS_CHECK_PROG $SCRATCH_OPTIONS $* $SCRATCH_DEV
}

--
1.7.5.4


2011-11-21 12:10:47

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/8] xfstests: rework large filesystem testing


On Nov 21, 2011, at 6:31 AM, Dave Chinner wrote:

> ext4, however, still has serious issues with this - either we take
> the mkfs.ext4 time hit to initialise all the block groups, or we
> take it during the preallocation. IOWs, the "don't do work at mkfs
> but do it after mount" hack^Wtradeoff simply does not work for
> testing large filesystems in this manner. While it is possible to
> run large filesystem tests on ext4 using this mechanism, it is
> extremely painful to do so.

For testing, we can disable the "do it after the mount " aspect
of ext4 by using the mount option "noinit_itable". We basically
only need to zero the inode table to make sure e2fsck doesn't
confuse old inode tables as new ones in the event that the block
group descriptors get compromised and we can't trust them to
determine the high watermark of inodes used per block group,
something which is only a concern in the case of kernel bugs
or hardware failures (or power failures in no journal mode).
(We could also compare the inode crime with the fs mkfs time
in the superblock, but ext4 gets used on desktops and
on things like android tablets where I've learned through
bitter experience that we can't trust the system clock to be
correct.)

In any case it's safe to turn of the inode table initialization for
testing purposes. In the long term, once we get checksums
into the inode table block, we won't need to zero out the inode
tables at all.

As far as xfstests are concerned, if there's a convenient way
to add mount options automatically (on a per file system
basis) when --large-fs is specified, we should be able to
make this work for ext4 file systems.

Regards,

-- Ted


2011-11-22 09:28:49

by Dave Chinner

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/8] xfstests: rework large filesystem testing

On Mon, Nov 21, 2011 at 07:10:45AM -0500, Theodore Tso wrote:
>
> On Nov 21, 2011, at 6:31 AM, Dave Chinner wrote:
>
> > ext4, however, still has serious issues with this - either we take
> > the mkfs.ext4 time hit to initialise all the block groups, or we
> > take it during the preallocation. IOWs, the "don't do work at mkfs
> > but do it after mount" hack^Wtradeoff simply does not work for
> > testing large filesystems in this manner. While it is possible to
> > run large filesystem tests on ext4 using this mechanism, it is
> > extremely painful to do so.
>
> For testing, we can disable the "do it after the mount " aspect
> of ext4 by using the mount option "noinit_itable". We basically
> only need to zero the inode table to make sure e2fsck doesn't
> confuse old inode tables as new ones in the event that the block

It's not the deferred inode table initialisation that is the problem
for the preallocation immediately after a mkfs and mount - it's
initialising block groups that is the problem:

363806.042907] SysRq : Show Blocked State
[363806.044586] task PC stack pid father
[363806.046400] xfs_io D ffff8801099aed08 0 7264 7064 0x00000000
[363806.046400] ffff880117e33868 0000000000000086 0000000000000000 ffffffffb13a2903
[363806.046400] ffff8801099ae980 ffff880117e33fd8 ffff880117e33fd8 ffff880117e33fd8
[363806.046400] ffff88011afb44c0 ffff8801099ae980 ffff880117e33868 00000001810b59ed
[363806.046400] Call Trace:
[363806.046400] [<ffffffff8118eec0>] ? __wait_on_buffer+0x30/0x30
[363806.046400] [<ffffffff81aab3af>] schedule+0x3f/0x60
[363806.046400] [<ffffffff81aab45f>] io_schedule+0x8f/0xd0
[363806.046400] [<ffffffff8118eece>] sleep_on_buffer+0xe/0x20
[363806.046400] [<ffffffff81aabc2f>] __wait_on_bit+0x5f/0x90
[363806.046400] [<ffffffff8167e177>] ? generic_make_request+0xc7/0x100
[363806.046400] [<ffffffff8118eec0>] ? __wait_on_buffer+0x30/0x30
[363806.046400] [<ffffffff81aabcdc>] out_of_line_wait_on_bit+0x7c/0x90
[363806.046400] [<ffffffff810ac360>] ? autoremove_wake_function+0x40/0x40
[363806.046400] [<ffffffff8118eebe>] __wait_on_buffer+0x2e/0x30
[363806.046400] [<ffffffff812824c3>] ext4_mb_init_cache+0x223/0x9c0
[363806.046400] [<ffffffff81118583>] ? add_to_page_cache_locked+0xb3/0x100
[363806.046400] [<ffffffff81282dae>] ext4_mb_init_group+0x14e/0x210
[363806.046400] [<ffffffff812832d9>] ext4_mb_load_buddy+0x339/0x350
[363806.046400] [<ffffffff8128465b>] ext4_mb_find_by_goal+0x6b/0x2b0
[363806.046400] [<ffffffff81285034>] ext4_mb_regular_allocator+0x64/0x430
[363806.046400] [<ffffffff81286d8d>] ext4_mb_new_blocks+0x40d/0x560
[363806.046400] [<ffffffff81aad1ee>] ? _raw_spin_lock+0xe/0x20
[363806.046400] [<ffffffff81aad1ee>] ? _raw_spin_lock+0xe/0x20
[363806.046400] [<ffffffff8127c6a1>] ext4_ext_map_blocks+0xfa1/0x1d10
[363806.046400] [<ffffffff8129a6aa>] ? jbd2__journal_start+0xca/0x110
[363806.046400] [<ffffffff81252535>] ext4_map_blocks+0x1b5/0x280
[363806.046400] [<ffffffff8127ddf5>] ext4_fallocate+0x1c5/0x530
[363806.046400] [<ffffffff8115e992>] do_fallocate+0xf2/0x160
[363806.046400] [<ffffffff8115ea4b>] sys_fallocate+0x4b/0x70
[363806.046400] [<ffffffff81ab5082>] system_call_fastpath+0x16/0x1b

this initialisation runs at about 50MB/s for some periods of the
preallocation. Sample from iostat -d -x -m 5:

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
vdc 0.00 1352.30 46.91 178.64 0.18 51.44 468.74 4.38 19.42 18.57 19.64 4.00 90.30

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
vdc 0.00 1405.40 47.20 184.40 0.18 50.97 452.34 5.99 25.84 18.31 27.77 3.91 90.56

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
vdc 0.00 4302.40 38.60 377.40 0.15 57.49 283.79 31.68 76.17 23.50 81.55 2.20 91.68

shows it is close to IO bound. This in on a 12 disk RAID-0 array w/
a 512MB BBWC. That indicates that most of the IO being done is
random. perf top shows that that the limited amount of CPU time
being spent is distributed like this:

samples pcnt function DSO
_______ _____ _____________________________ _________________

83.00 6.7% ext4_init_block_bitmap [kernel.kallsyms]
82.00 6.6% crc16 [kernel.kallsyms]
73.00 5.9% __find_get_block [kernel.kallsyms]
65.00 5.2% ext4_num_overhead_clusters [kernel.kallsyms]
62.00 5.0% ext4_set_bits [kernel.kallsyms]
56.00 4.5% ext4_ext_find_extent [kernel.kallsyms]
55.00 4.4% ext4_mark_iloc_dirty [kernel.kallsyms]
53.00 4.3% jbd2_journal_add_journal_head [kernel.kallsyms]
50.00 4.0% do_get_write_access [kernel.kallsyms]
45.00 3.6% mb_find_order_for_block [kernel.kallsyms]
41.00 3.3% ext4_ext_map_blocks [kernel.kallsyms]
34.00 2.7% jbd2_journal_cancel_revoke [kernel.kallsyms]
28.00 2.3% jbd2_journal_dirty_metadata [kernel.kallsyms]
27.00 2.2% jbd2_journal_put_journal_head [kernel.kallsyms]

The rest of the time, there is no IO and the preallocation is
is severely CPU bound. Top shows:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7523 root 20 0 10848 792 636 R 99 0.0 0:23.39 xfs_io

and perf top -p <pid of xfs_io> shows:

samples pcnt function DSO
_______ _____ ___________________________ _________________

13840.00 89.2% ext4_mb_good_group [kernel.kallsyms]
1218.00 7.8% ext4_mb_regular_allocator [kernel.kallsyms]
148.00 1.0% mb_find_order_for_block [kernel.kallsyms]
85.00 0.5% find_next_zero_bit [kernel.kallsyms]
78.00 0.5% radix_tree_lookup_element [kernel.kallsyms]
54.00 0.3% find_get_page [kernel.kallsyms]
53.00 0.3% mb_find_extent.constprop.31 [kernel.kallsyms]
12.00 0.1% mb_find_buddy [kernel.kallsyms]
10.00 0.1% ext4_mb_load_buddy [kernel.kallsyms]

which, if I read the code correctly, is CPU bound searching for a
block group to allocate from.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2011-12-04 21:14:58

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 1/8] xfstests: add --largefs check option

On Mon, Nov 21, 2011 at 10:31:21PM +1100, Dave Chinner wrote:
> From: Dave Chinner <[email protected]>
>
> Make it easier to check large filesystems quickly by adding a
> --large-fs option to check to turn on shortcuts for large scratch
> device filesystem testing.
>
> Also, reject invalid command line options with a usage message.

Looks good, except that the help text for it doesn't look overly useful.
I don't think we "optimise" for large filesystems, we simply fill most
of it, which is what the documentation should mention.


2011-12-04 21:16:32

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 5/8] xfstests: use command line option for setting extra space

On Mon, Nov 21, 2011 at 10:31:25PM +1100, Dave Chinner wrote:
> From: Dave Chinner <[email protected]>
>
> Allow the extra free space to leave in large scratch filesystems to
> be specified by a command line option rather than just via an
> environment variable.
>
> Signed-off-by: Dave Chinner <[email protected]>

This probably should be documented in the help text.


2012-01-16 16:23:25

by Mark Tinguely

[permalink] [raw]
Subject: Re: [PATCH 2/8] xfstests: rename USE_BIG_LOOPFS to be more generic

On 01/-10/63 13:59, Dave Chinner wrote:
> From: Dave Chinner<[email protected]>
>
> USE_BIG_LOOPFS is really misnamed - it can be used on real devices just as
> easily as loop devices. It really means we are testing a large scratch device
> and that we should enable the special filesystem filling and checking options
> that enable xfstests to be run sanely on large XFS filesystems.
>

Looks good.

Reviewed-by: Mark Tinguely <[email protected]>

2012-01-16 17:04:38

by Mark Tinguely

[permalink] [raw]
Subject: Re: [PATCH 6/8] xfstest: enable xfs_repair for large filesystem testing

On 01/-10/63 13:59, Dave Chinner wrote:
> From: Dave Chinner<[email protected]>
>
> Now that large filesystem testing does not play free space games to
> fill the space without IO, we can enable xfs_repair when running in
> this mode. xfs_repair has had it's scalability problems solved, too,
> so this is a safe thing to do.
>

Looks good.

Reviewed-by: Mark Tinguely <[email protected]>