2007-10-22 23:32:32

by Chris Mason

[permalink] [raw]
Subject: compilebench numbers for ext4

Hello everyone,

I recently posted some performance numbers for Btrfs with different
blocksizes, and to help establish a baseline I did comparisons with
Ext3.

The graphs, numbers and a basic description of compilebench are here:

http://oss.oracle.com/~mason/blocksizes/

Ext3 easily wins the read phase, but scores poorly while creating files
and deleting them. Since ext3 is winning the read phase, we can assume
the file layout is fairly good. I think most of the problems during the
write phase are caused by pdflush doing metadata writeback. The file
data and metadata are written separately, and so we end up seeking
between things that are actually close together.

Andreas asked me to give ext4 a try, so I grabbed the patch queue from
Friday along with the latest Linus kernel. The FS was created with:

mkfs.ext3 -I 256 /dev/xxxx
mount -o delalloc,mballoc,data=ordered -t ext4dev /dev/xxxx

I did expect delayed allocation to help the write phases of
compilebench, especially the parts where it writes out .o files in
random order (basically writing medium sized files all over the
directory tree). But, every phase except reads showed huge
improvements.

http://oss.oracle.com/~mason/compilebench/ext4/ext-create-compare.png
http://oss.oracle.com/~mason/compilebench/ext4/ext-compile-compare.png
http://oss.oracle.com/~mason/compilebench/ext4/ext-read-compare.png
http://oss.oracle.com/~mason/compilebench/ext4/ext-rm-compare.png

To match the ext4 numbers with Btrfs, I'd probably have to turn off data
checksumming...

But oddly enough I saw very bad ext4 read throughput even when reading
a single kernel tree (outside of compilebench). The time to read the
tree was almost 2x ext3. Have others seen similar problems?

I think the ext4 delete times are so much better than ext3 because this
is a single threaded test. delayed allocation is able to get
everything into a few extents, and these all end up in the inode. So,
the delete phase only needs to seek around in small directories and
seek to well grouped inodes. ext3 probably had to seek all over for
the direct/indirect blocks.

So, tomorrow I'll run a few tests with delalloc and mballoc
independently, but if there are other numbers people are interested in,
please let me know.

(test box was a desktop machine with single sata drive, barriers were
not used).

-chris


2007-10-22 23:50:54

by Chris Mason

[permalink] [raw]
Subject: Re: compilebench numbers for ext4

On Mon, 22 Oct 2007 19:31:04 -0400
Chris Mason <[email protected]> wrote:

> I did expect delayed allocation to help the write phases of
> compilebench, especially the parts where it writes out .o files in
> random order (basically writing medium sized files all over the
> directory tree). But, every phase except reads showed huge
> improvements.
>
> http://oss.oracle.com/~mason/compilebench/ext4/ext-create-compare.png
> http://oss.oracle.com/~mason/compilebench/ext4/ext-compile-compare.png
> http://oss.oracle.com/~mason/compilebench/ext4/ext-read-compare.png
> http://oss.oracle.com/~mason/compilebench/ext4/ext-rm-compare.png

This might make the IO during reads a little easier to see. The dirs
will look like the kernel after a make -j. So each directory will have
a bunch of small .c files that are close together and a bunch of .o
files that are randomly created across the tree.

http://oss.oracle.com/~mason/compilebench/ext4/ext4-read.mpg

-chris

2007-10-23 00:12:58

by Mingming Cao

[permalink] [raw]
Subject: Re: compilebench numbers for ext4

On Mon, 2007-10-22 at 19:31 -0400, Chris Mason wrote:
> Hello everyone,
>
> I recently posted some performance numbers for Btrfs with different
> blocksizes, and to help establish a baseline I did comparisons with
> Ext3.
>

Thanks for doing this, Chris!

> The graphs, numbers and a basic description of compilebench are here:
>
> http://oss.oracle.com/~mason/blocksizes/
>
> Ext3 easily wins the read phase, but scores poorly while creating files
> and deleting them. Since ext3 is winning the read phase, we can assume
> the file layout is fairly good. I think most of the problems during the
> write phase are caused by pdflush doing metadata writeback. The file
> data and metadata are written separately, and so we end up seeking
> between things that are actually close together.
>
> Andreas asked me to give ext4 a try, so I grabbed the patch queue from
> Friday along with the latest Linus kernel. The FS was created with:
>
> mkfs.ext3 -I 256 /dev/xxxx
> mount -o delalloc,mballoc,data=ordered -t ext4dev /dev/xxxx
>
> I did expect delayed allocation to help the write phases of
> compilebench, especially the parts where it writes out .o files in
> random order (basically writing medium sized files all over the
> directory tree).

Unfortunately delayed allocation support for ordered mode is not there
yet.

> But, every phase except reads showed huge
> improvements.
>
> http://oss.oracle.com/~mason/compilebench/ext4/ext-create-compare.png
> http://oss.oracle.com/~mason/compilebench/ext4/ext-compile-compare.png
> http://oss.oracle.com/~mason/compilebench/ext4/ext-read-compare.png
> http://oss.oracle.com/~mason/compilebench/ext4/ext-rm-compare.png
>
> To match the ext4 numbers with Btrfs, I'd probably have to turn off data
> checksumming...
>
> But oddly enough I saw very bad ext4 read throughput even when reading
> a single kernel tree (outside of compilebench). The time to read the
> tree was almost 2x ext3. Have others seen similar problems?
>
thanks for point this out, will run compilebench.

Trying to understand the Disk IO graph
http://oss.oracle.com/~mason/compilebench/ext4/ext-read-compare.png
it looks like ext3 the blocks are spread over the disk, while ext4 is
more around the same place, is this right?

> I think the ext4 delete times are so much better than ext3 because this
> is a single threaded test. delayed allocation is able to get
> everything into a few extents, and these all end up in the inode. So,
> the delete phase only needs to seek around in small directories and
> seek to well grouped inodes. ext3 probably had to seek all over for
> the direct/indirect blocks.
>
> So, tomorrow I'll run a few tests with delalloc and mballoc
> independently, but if there are other numbers people are interested in,
> please let me know.
>
> (test box was a desktop machine with single sata drive, barriers were
> not used).
>
> -chris
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2007-10-23 00:57:09

by Chris Mason

[permalink] [raw]
Subject: Re: compilebench numbers for ext4

On Mon, 22 Oct 2007 17:12:58 -0700
Mingming Cao <[email protected]> wrote:

> On Mon, 2007-10-22 at 19:31 -0400, Chris Mason wrote:
> > Hello everyone,
> >
> > I recently posted some performance numbers for Btrfs with different
> > blocksizes, and to help establish a baseline I did comparisons with
> > Ext3.
> >
>
> Thanks for doing this, Chris!
>
> > The graphs, numbers and a basic description of compilebench are
> > here:
> >
> > http://oss.oracle.com/~mason/blocksizes/
> >
> > Ext3 easily wins the read phase, but scores poorly while creating
> > files and deleting them. Since ext3 is winning the read phase, we
> > can assume the file layout is fairly good. I think most of the
> > problems during the write phase are caused by pdflush doing
> > metadata writeback. The file data and metadata are written
> > separately, and so we end up seeking between things that are
> > actually close together.
> >
> > Andreas asked me to give ext4 a try, so I grabbed the patch queue
> > from Friday along with the latest Linus kernel. The FS was created
> > with:
> >
> > mkfs.ext3 -I 256 /dev/xxxx
> > mount -o delalloc,mballoc,data=ordered -t ext4dev /dev/xxxx
> >
> > I did expect delayed allocation to help the write phases of
> > compilebench, especially the parts where it writes out .o files in
> > random order (basically writing medium sized files all over the
> > directory tree).
>
> Unfortunately delayed allocation support for ordered mode is not there
> yet.

Sorry, I meant to write data=writeback, not sure how my fingers typed
ordered instead.

>
> > But, every phase except reads showed huge
> > improvements.
> >
> > http://oss.oracle.com/~mason/compilebench/ext4/ext-create-compare.png
> > http://oss.oracle.com/~mason/compilebench/ext4/ext-compile-compare.png
> > http://oss.oracle.com/~mason/compilebench/ext4/ext-read-compare.png
> > http://oss.oracle.com/~mason/compilebench/ext4/ext-rm-compare.png
> >
> > To match the ext4 numbers with Btrfs, I'd probably have to turn off
> > data checksumming...
> >
> > But oddly enough I saw very bad ext4 read throughput even when
> > reading a single kernel tree (outside of compilebench). The time
> > to read the tree was almost 2x ext3. Have others seen similar
> > problems?
> >
> thanks for point this out, will run compilebench.
>
> Trying to understand the Disk IO graph
> http://oss.oracle.com/~mason/compilebench/ext4/ext-read-compare.png
> it looks like ext3 the blocks are spread over the disk, while ext4 is
> more around the same place, is this right?

It does look like that, but the ext4 movie shows the middle line a
little differently than the graph. The middle ext4 line is actually
comprised of a lot of seeks.

For comparison, here's the ext3 movie:

http://oss.oracle.com/~mason/compilebench/ext4/ext3-read.mpg

Even though the ext3 data looks more spread out, there are more
throughput peaks, and fewer seeks overall in ext3.

-chris

2007-10-23 13:10:25

by Chris Mason

[permalink] [raw]
Subject: Re: compilebench numbers for ext4

On Tue, 23 Oct 2007 18:13:53 +0530
"Aneesh Kumar K.V" <[email protected]> wrote:

>
> I get this error while running compilebench
>
> http://oss.oracle.com/~mason/compilebench/compilebench-0.4.tar.bz2

I've uploaded compilebench-0.6.tar.bz2 and updated the docs on the
compilebench page. This includes the --makej option that I used for
the numbers I have posted (sorry, I thought that was pushed out
already).

For consistency with seekwatcher, I changed the -d working_dir option
into -D working_dir. The actual run I used was:

./compilebench -D /mnt --makej -i 20 -d /dev/xxxx -t trace-ext4

-d and -t make compilebench start blktrace for you at the start of each
phase, which allows easy creation of the graphs, but this isn't
required.

>
>
> elm3b138:~/compilebench-0.4# ./compilebench -d /ext4/
> Traceback (most recent call last):
> File "./compilebench", line 541, in ?
> total_runs += func(dset, rnd)
> File "./compilebench", line 431, in create_one_dir
> mbs = run_directory(dset.unpatched, dirname, "create dir")
> File "./compilebench", line 217, in run_directory
> fp = file(fname, 'a+')
> IOError: [Errno 2] No such file or directory:
> '/ext4/kernel-75618/fs/smbfs/symlink.c' elm3b138:~/compilebench-0.4#

I'm not sure, did you run out of space?

-chris

2007-10-23 13:27:35

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: compilebench numbers for ext4


I get this error while running compilebench

http://oss.oracle.com/~mason/compilebench/compilebench-0.4.tar.bz2


elm3b138:~/compilebench-0.4# ./compilebench -d /ext4/
using working directory /ext4/, 30 intial dirs 100 runs
native unpatched native-0 222MB in 9.17 seconds (24.25 MB/s)
native patched native-0 109MB in 3.12 seconds (35.15 MB/s)
native unpatched compiled native-0 680MB in 12.81 seconds (53.13 MB/s)
native patched compiled native-0 691MB in 11.99 seconds (57.68 MB/s)
create dir kernel-0 222MB in 8.48 seconds (26.22 MB/s)
create dir kernel-1 222MB in 8.09 seconds (27.49 MB/s)
create dir kernel-2 222MB in 8.02 seconds (27.73 MB/s)
create dir kernel-3 222MB in 8.09 seconds (27.49 MB/s)
create dir kernel-4 222MB in 8.26 seconds (26.92 MB/s)
create dir kernel-5 222MB in 9.26 seconds (24.01 MB/s)
create dir kernel-6 222MB in 8.60 seconds (25.86 MB/s)
create dir kernel-7 222MB in 8.02 seconds (27.73 MB/s)
create dir kernel-8 222MB in 8.11 seconds (27.42 MB/s)
create dir kernel-9 222MB in 7.95 seconds (27.97 MB/s)
create dir kernel-10 222MB in 8.04 seconds (27.66 MB/s)
create dir kernel-11 222MB in 8.04 seconds (27.66 MB/s)
create dir kernel-12 222MB in 7.99 seconds (27.83 MB/s)
create dir kernel-13 222MB in 8.11 seconds (27.42 MB/s)
create dir kernel-14 222MB in 8.46 seconds (26.29 MB/s)
create dir kernel-15 222MB in 7.97 seconds (27.90 MB/s)
create dir kernel-16 222MB in 8.70 seconds (25.56 MB/s)
create dir kernel-17 222MB in 7.99 seconds (27.83 MB/s)
create dir kernel-18 222MB in 8.12 seconds (27.39 MB/s)
create dir kernel-19 222MB in 8.25 seconds (26.95 MB/s)
create dir kernel-20 222MB in 8.58 seconds (25.92 MB/s)
create dir kernel-21 222MB in 11.96 seconds (18.59 MB/s)
create dir kernel-22 222MB in 9.34 seconds (23.81 MB/s)
create dir kernel-23 222MB in 9.04 seconds (24.60 MB/s)
create dir kernel-24 222MB in 8.34 seconds (26.66 MB/s)
create dir kernel-25 222MB in 8.65 seconds (25.71 MB/s)
create dir kernel-26 222MB in 8.48 seconds (26.22 MB/s)
create dir kernel-27 222MB in 9.14 seconds (24.33 MB/s)
create dir kernel-28 222MB in 8.75 seconds (25.41 MB/s)
create dir kernel-29 222MB in 8.26 seconds (26.92 MB/s)
compile dir kernel-6 680MB in 17.90 seconds (38.02 MB/s)
stat dir kernel-1 in 5.27 seconds
delete kernel-6 in 11.98 seconds
patch dir kernel-26 109MB in 12.39 seconds (8.85 MB/s)
patch dir kernel-23 109MB in 10.08 seconds (10.88 MB/s)
create dir kernel-86372 222MB in 11.75 seconds (18.93 MB/s)
compile dir kernel-16 680MB in 23.44 seconds (29.04 MB/s)
Traceback (most recent call last):
File "./compilebench", line 541, in ?
total_runs += func(dset, rnd)
File "./compilebench", line 431, in create_one_dir
mbs = run_directory(dset.unpatched, dirname, "create dir")
File "./compilebench", line 217, in run_directory
fp = file(fname, 'a+')
IOError: [Errno 2] No such file or directory: '/ext4/kernel-75618/fs/smbfs/symlink.c'
elm3b138:~/compilebench-0.4#

2007-10-23 13:43:29

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: compilebench numbers for ext4



Chris Mason wrote:
> On Tue, 23 Oct 2007 18:13:53 +0530
> "Aneesh Kumar K.V" <[email protected]> wrote:
>
>> I get this error while running compilebench
>>
>> http://oss.oracle.com/~mason/compilebench/compilebench-0.4.tar.bz2
>
> I've uploaded compilebench-0.6.tar.bz2 and updated the docs on the
> compilebench page. This includes the --makej option that I used for
> the numbers I have posted (sorry, I thought that was pushed out
> already).
>
> For consistency with seekwatcher, I changed the -d working_dir option
> into -D working_dir. The actual run I used was:
>
> ./compilebench -D /mnt --makej -i 20 -d /dev/xxxx -t trace-ext4
>
> -d and -t make compilebench start blktrace for you at the start of each
> phase, which allows easy creation of the graphs, but this isn't
> required.
>
>>
>> elm3b138:~/compilebench-0.4# ./compilebench -d /ext4/
>> Traceback (most recent call last):
>> File "./compilebench", line 541, in ?
>> total_runs += func(dset, rnd)
>> File "./compilebench", line 431, in create_one_dir
>> mbs = run_directory(dset.unpatched, dirname, "create dir")
>> File "./compilebench", line 217, in run_directory
>> fp = file(fname, 'a+')
>> IOError: [Errno 2] No such file or directory:
>> '/ext4/kernel-75618/fs/smbfs/symlink.c' elm3b138:~/compilebench-0.4#
>
> I'm not sure, did you run out of space?
>
>

yes.

Filesystem Size Used Avail Use% Mounted on
/dev/sdb1 9.1G 9.1G 0 100% /mnt




-aneesh

2007-10-25 18:01:16

by Jose R. Santos

[permalink] [raw]
Subject: Re: compilebench numbers for ext4

On Mon, 22 Oct 2007 19:31:04 -0400
Chris Mason <[email protected]> wrote:

> Hello everyone,
>
> I recently posted some performance numbers for Btrfs with different
> blocksizes, and to help establish a baseline I did comparisons with
> Ext3.
>
> The graphs, numbers and a basic description of compilebench are here:
>
> http://oss.oracle.com/~mason/blocksizes/

I've been playing a bit with the workload and I have a couple of
comments.

1) I find the averaging of results at the end of the run misleading
unless you run a high number of directories. A single very good result
due to page caching effects seems to skew the final results output.
Have you considered providing output of the standard deviation of the
data points as well in order to show how widely the results are spread.

2) You mentioned that one of the goals of the benchmark is to measure
locality during directory aging, but the workloads seems too well order
to truly age the filesystem. At least that's what I can gather from
the output the benchmark spits out. It may be that Im not
understanding the relationship between INITIAL_DIRS and RUNS, but the
workload seem to been localized to do operations on a single dir at a
time. Just wondering is this is truly stressing allocation algorithms
in a significant or realistic way.

Still playing and reading the code so I hope to have a clearer
understating of how it stresses the filesystem. This would be a hard
one to simulate in ffsb (my favorite workload) due to the locality in
the way the dataset is access. Would be interesting to let ffsb age
the filesystem and run then run compilebench to see how it does on an
unclean filesystem with lots of holes.

> Ext3 easily wins the read phase, but scores poorly while creating files
> and deleting them. Since ext3 is winning the read phase, we can assume
> the file layout is fairly good. I think most of the problems during the
> write phase are caused by pdflush doing metadata writeback. The file
> data and metadata are written separately, and so we end up seeking
> between things that are actually close together.

If I understand how compilebench works, directories would be allocated
with in one or two block group boundaries so the data and meta data
would be in very close proximity. I assume that doing random lookup
through the entire file set would show some weakness in the ext3 meta
data layout.

> Andreas asked me to give ext4 a try, so I grabbed the patch queue from
> Friday along with the latest Linus kernel. The FS was created with:
>
> mkfs.ext3 -I 256 /dev/xxxx
> mount -o delalloc,mballoc,data=ordered -t ext4dev /dev/xxxx
>
> I did expect delayed allocation to help the write phases of
> compilebench, especially the parts where it writes out .o files in
> random order (basically writing medium sized files all over the
> directory tree). But, every phase except reads showed huge
> improvements.
>
> http://oss.oracle.com/~mason/compilebench/ext4/ext-create-compare.png
> http://oss.oracle.com/~mason/compilebench/ext4/ext-compile-compare.png
> http://oss.oracle.com/~mason/compilebench/ext4/ext-read-compare.png
> http://oss.oracle.com/~mason/compilebench/ext4/ext-rm-compare.png

I really want to use seekwatcher to test some of the stuff that I'm
doing for flex_bg feature but it barfs on me in my test machine.

running :sleep 10:
done running sleep 10
Device: /dev/sdh
CPU 0: 0 events, 121 KiB data
CPU 1: 0 events, 231 KiB data
CPU 2: 0 events, 121 KiB data
CPU 3: 0 events, 208 KiB data
CPU 4: 0 events, 137 KiB data
CPU 5: 0 events, 213 KiB data
CPU 6: 0 events, 120 KiB data
CPU 7: 0 events, 220 KiB data
Total: 0 events (dropped 0), 1368 KiB data
blktrace done
Traceback (most recent call last):
File "/usr/bin/seekwatcher", line 534, in ?
add_range(hist, step, start, size)
File "/usr/bin/seekwatcher", line 522, in add_range
val = hist[slot]
IndexError: list index out of range

This is running on a PPC64/gentoo combination. Dont know if this means
anything to you. I have a very basic algorithm for to take advantage
block group metadata grouping and want be able to better visualize how
different IO patterns take advantage or are hurt by the feature.

> To match the ext4 numbers with Btrfs, I'd probably have to turn off data
> checksumming...
>
> But oddly enough I saw very bad ext4 read throughput even when reading
> a single kernel tree (outside of compilebench). The time to read the
> tree was almost 2x ext3. Have others seen similar problems?
>
> I think the ext4 delete times are so much better than ext3 because this
> is a single threaded test. delayed allocation is able to get
> everything into a few extents, and these all end up in the inode. So,
> the delete phase only needs to seek around in small directories and
> seek to well grouped inodes. ext3 probably had to seek all over for
> the direct/indirect blocks.
>
> So, tomorrow I'll run a few tests with delalloc and mballoc
> independently, but if there are other numbers people are interested in,
> please let me know.
>
> (test box was a desktop machine with single sata drive, barriers were
> not used).

More details please....

1. CPU info (type, count, speed)
2. Memory info (mostly amount)
3. Disk info (partition size, disk rpms, interface, internal cache size)
4. Benchmark cmdline parameters.

All good info when trying to explain and reproduce results since some of the components of the workload are very sensitive to the hw configuration.

> -chris


-JRS

2007-10-25 18:01:19

by Jose R. Santos

[permalink] [raw]
Subject: Re: compilebench numbers for ext4

On Mon, 22 Oct 2007 19:31:04 -0400
Chris Mason <[email protected]> wrote:

> Hello everyone,
>
> I recently posted some performance numbers for Btrfs with different
> blocksizes, and to help establish a baseline I did comparisons with
> Ext3.
>
> The graphs, numbers and a basic description of compilebench are here:
>
> http://oss.oracle.com/~mason/blocksizes/

I've been playing a bit with the workload and I have a couple of
comments.

1) I find the averaging of results at the end of the run misleading
unless you run a high number of directories. A single very good result
due to page caching effects seems to skew the final results output.
Have you considered providing output of the standard deviation of the
data points as well in order to show how widely the results are spread.

2) You mentioned that one of the goals of the benchmark is to measure
locality during directory aging, but the workloads seems too well order
to truly age the filesystem. At least that's what I can gather from
the output the benchmark spits out. It may be that Im not
understanding the relationship between INITIAL_DIRS and RUNS, but the
workload seem to been localized to do operations on a single dir at a
time. Just wondering is this is truly stressing allocation algorithms
in a significant or realistic way.

Still playing and reading the code so I hope to have a clearer
understating of how it stresses the filesystem. This would be a hard
one to simulate in ffsb (my favorite workload) due to the locality in
the way the dataset is access. Would be interesting to let ffsb age
the filesystem and run then run compilebench to see how it does on an
unclean filesystem with lots of holes.

> Ext3 easily wins the read phase, but scores poorly while creating files
> and deleting them. Since ext3 is winning the read phase, we can assume
> the file layout is fairly good. I think most of the problems during the
> write phase are caused by pdflush doing metadata writeback. The file
> data and metadata are written separately, and so we end up seeking
> between things that are actually close together.

If I understand how compilebench works, directories would be allocated
with in one or two block group boundaries so the data and meta data
would be in very close proximity. I assume that doing random lookup
through the entire file set would show some weakness in the ext3 meta
data layout.

> Andreas asked me to give ext4 a try, so I grabbed the patch queue from
> Friday along with the latest Linus kernel. The FS was created with:
>
> mkfs.ext3 -I 256 /dev/xxxx
> mount -o delalloc,mballoc,data=ordered -t ext4dev /dev/xxxx
>
> I did expect delayed allocation to help the write phases of
> compilebench, especially the parts where it writes out .o files in
> random order (basically writing medium sized files all over the
> directory tree). But, every phase except reads showed huge
> improvements.
>
> http://oss.oracle.com/~mason/compilebench/ext4/ext-create-compare.png
> http://oss.oracle.com/~mason/compilebench/ext4/ext-compile-compare.png
> http://oss.oracle.com/~mason/compilebench/ext4/ext-read-compare.png
> http://oss.oracle.com/~mason/compilebench/ext4/ext-rm-compare.png

I really want to use seekwatcher to test some of the stuff that I'm
doing for flex_bg feature but it barfs on me in my test machine.

running :sleep 10:
done running sleep 10
Device: /dev/sdh
CPU 0: 0 events, 121 KiB data
CPU 1: 0 events, 231 KiB data
CPU 2: 0 events, 121 KiB data
CPU 3: 0 events, 208 KiB data
CPU 4: 0 events, 137 KiB data
CPU 5: 0 events, 213 KiB data
CPU 6: 0 events, 120 KiB data
CPU 7: 0 events, 220 KiB data
Total: 0 events (dropped 0), 1368 KiB data
blktrace done
Traceback (most recent call last):
File "/usr/bin/seekwatcher", line 534, in ?
add_range(hist, step, start, size)
File "/usr/bin/seekwatcher", line 522, in add_range
val = hist[slot]
IndexError: list index out of range

This is running on a PPC64/gentoo combination. Dont know if this means
anything to you. I have a very basic algorithm for to take advantage
block group metadata grouping and want be able to better visualize how
different IO patterns take advantage or are hurt by the feature.

> To match the ext4 numbers with Btrfs, I'd probably have to turn off data
> checksumming...
>
> But oddly enough I saw very bad ext4 read throughput even when reading
> a single kernel tree (outside of compilebench). The time to read the
> tree was almost 2x ext3. Have others seen similar problems?
>
> I think the ext4 delete times are so much better than ext3 because this
> is a single threaded test. delayed allocation is able to get
> everything into a few extents, and these all end up in the inode. So,
> the delete phase only needs to seek around in small directories and
> seek to well grouped inodes. ext3 probably had to seek all over for
> the direct/indirect blocks.
>
> So, tomorrow I'll run a few tests with delalloc and mballoc
> independently, but if there are other numbers people are interested in,
> please let me know.
>
> (test box was a desktop machine with single sata drive, barriers were
> not used).

More details please....

1. CPU info (type, count, speed)
2. Memory info (mostly amount)
3. Disk info (partition size, disk rpms, interface, internal cache size)
4. Benchmark cmdline parameters.

All good info when trying to explain and reproduce results since some
of the components of the workload are very sensitive to the hw
configuration. For example with the algorithms for flex_bg grouping
of meta-data, the speed improvement on is about 10 time greater than
with the standard allocation in ext4. This is cause by the fact that
Im running on a SCSI subsystem which has write caching disable on the
disk (like in most servers). Testing on a desktop sata drive with
write caching enable would probably yeild much different results, so I
find the detail of the system under test important when looking at the
performance characteristics of different solutions.

> -chris

-JRS

2007-10-25 18:45:35

by Chris Mason

[permalink] [raw]
Subject: Re: compilebench numbers for ext4

On Thu, 25 Oct 2007 10:34:49 -0500
"Jose R. Santos" <[email protected]> wrote:

> On Mon, 22 Oct 2007 19:31:04 -0400
> Chris Mason <[email protected]> wrote:
>
> > Hello everyone,
> >
> > I recently posted some performance numbers for Btrfs with different
> > blocksizes, and to help establish a baseline I did comparisons with
> > Ext3.
> >
> > The graphs, numbers and a basic description of compilebench are
> > here:
> >
> > http://oss.oracle.com/~mason/blocksizes/
>
> I've been playing a bit with the workload and I have a couple of
> comments.
>
> 1) I find the averaging of results at the end of the run misleading
> unless you run a high number of directories. A single very good
> result due to page caching effects seems to skew the final results
> output. Have you considered providing output of the standard
> deviation of the data points as well in order to show how widely the
> results are spread.

This is the main reason I keep the output from each run. Stdev would
definitely help as well, I'll put it on the todo list.

>
> 2) You mentioned that one of the goals of the benchmark is to measure
> locality during directory aging, but the workloads seems too well
> order to truly age the filesystem. At least that's what I can gather
> from the output the benchmark spits out. It may be that Im not
> understanding the relationship between INITIAL_DIRS and RUNS, but the
> workload seem to been localized to do operations on a single dir at a
> time. Just wondering is this is truly stressing allocation algorithms
> in a significant or realistic way.

A good question. compilebench has two modes, and the default is better
at aging then the run I graphed on ext4. compilebench isn't trying to
fragment individual files, but it is instead trying to fragment
locality, and lower the overall performance of a directory tree.

In the default run, the patch, clean, and compile operations end up
changing around groups of files in a somewhat random fashion (at least
from the FS point of view). But, it is still a workload where a good
FS should be able to maintain locality and provide consistent results
over time.

The ext4 numbers I sent here are from compilebench --makej, which is a
shorter and less complex run. It has a few simple phases:

* create some number of kernel trees sequentially
* write new files into those trees in random order
* read a three of the trees
* delete all the trees

It is a very basic test that can give you a picture of directory
layout, writeback performance and overall locality.

>
> If I understand how compilebench works, directories would be allocated
> with in one or two block group boundaries so the data and meta data
> would be in very close proximity. I assume that doing random lookup
> through the entire file set would show some weakness in the ext3 meta
> data layout.

Probably.

>
> I really want to use seekwatcher to test some of the stuff that I'm
> doing for flex_bg feature but it barfs on me in my test machine.
>
> running :sleep 10:
> done running sleep 10
> Device: /dev/sdh
> Total: 0 events (dropped 0), 1368 KiB data
> blktrace done
> Traceback (most recent call last):
> File "/usr/bin/seekwatcher", line 534, in ?
> add_range(hist, step, start, size)
> File "/usr/bin/seekwatcher", line 522, in add_range
> val = hist[slot]
> IndexError: list index out of range

I don't think you have any events in the trace. Try this instead:

echo 3 > /proc/sys/vm/drop_caches
seekwatcher -t find-trace -d /dev/xxxx -p 'find /usr/local -type f'

>
> This is running on a PPC64/gentoo combination. Dont know if this
> means anything to you. I have a very basic algorithm for to take
> advantage block group metadata grouping and want be able to better
> visualize how different IO patterns take advantage or are hurt by the
> feature.

I wanted to benchmark flexbg too, but couldn't quite figure out the
correct patch combination ;)

>
> > To match the ext4 numbers with Btrfs, I'd probably have to turn off
> > data checksumming...
> >
> > But oddly enough I saw very bad ext4 read throughput even when
> > reading a single kernel tree (outside of compilebench). The time
> > to read the tree was almost 2x ext3. Have others seen similar
> > problems?
> >
> > I think the ext4 delete times are so much better than ext3 because
> > this is a single threaded test. delayed allocation is able to get
> > everything into a few extents, and these all end up in the inode.
> > So, the delete phase only needs to seek around in small directories
> > and seek to well grouped inodes. ext3 probably had to seek all
> > over for the direct/indirect blocks.
> >
> > So, tomorrow I'll run a few tests with delalloc and mballoc
> > independently, but if there are other numbers people are interested
> > in, please let me know.
> >
> > (test box was a desktop machine with single sata drive, barriers
> > were not used).
>
> More details please....
>
> 1. CPU info (type, count, speed)

Dual core 3ghz x86-64

> 2. Memory info (mostly amount)

2GB

> 3. Disk info (partition size, disk rpms, interface, internal cache

SAMSUNG HD160JJ (sataII w/ncq), the FS was on a 40GB lvm volume.
Single spindle.

> size) 4. Benchmark cmdline parameters.

mkdir ext4
compilebench --makej -D /mnt -d /dev/mapper/xxxx -t ext4/trace -i 20 >&
ext4/out

-chris

2007-10-25 22:39:27

by Jose R. Santos

[permalink] [raw]
Subject: Re: compilebench numbers for ext4

On Thu, 25 Oct 2007 14:43:55 -0400
Chris Mason <[email protected]> wrote:
> >
> > 2) You mentioned that one of the goals of the benchmark is to measure
> > locality during directory aging, but the workloads seems too well
> > order to truly age the filesystem. At least that's what I can gather
> > from the output the benchmark spits out. It may be that Im not
> > understanding the relationship between INITIAL_DIRS and RUNS, but the
> > workload seem to been localized to do operations on a single dir at a
> > time. Just wondering is this is truly stressing allocation algorithms
> > in a significant or realistic way.
>
> A good question. compilebench has two modes, and the default is better
> at aging then the run I graphed on ext4. compilebench isn't trying to
> fragment individual files, but it is instead trying to fragment
> locality, and lower the overall performance of a directory tree.
>
> In the default run, the patch, clean, and compile operations end up
> changing around groups of files in a somewhat random fashion (at least
> from the FS point of view). But, it is still a workload where a good
> FS should be able to maintain locality and provide consistent results
> over time.
>
> The ext4 numbers I sent here are from compilebench --makej, which is a
> shorter and less complex run. It has a few simple phases:
>
> * create some number of kernel trees sequentially
> * write new files into those trees in random order
> * read a three of the trees
> * delete all the trees
>
> It is a very basic test that can give you a picture of directory
> layout, writeback performance and overall locality.

Thanks. This clear a couple of things and I think I now follow the
direction you're heading into with this workload.

> >
> > I really want to use seekwatcher to test some of the stuff that I'm
> > doing for flex_bg feature but it barfs on me in my test machine.
> >
> > running :sleep 10:
> > done running sleep 10
> > Device: /dev/sdh
> > Total: 0 events (dropped 0), 1368 KiB data
> > blktrace done
> > Traceback (most recent call last):
> > File "/usr/bin/seekwatcher", line 534, in ?
> > add_range(hist, step, start, size)
> > File "/usr/bin/seekwatcher", line 522, in add_range
> > val = hist[slot]
> > IndexError: list index out of range
>
> I don't think you have any events in the trace. Try this instead:
>
> echo 3 > /proc/sys/vm/drop_caches
> seekwatcher -t find-trace -d /dev/xxxx -p 'find /usr/local -type f'

Nope, get the same error. There does seem to be data recorded in the
trace files and iostat does show activity on the disk.

toolssf2 ~ # echo 3 > /proc/sys/vm/drop_caches
toolssf2 ~ # seekwatcher -t find-trace -d /dev/sdb3 -p 'find /root -type f >/dev/null'
running :find /root -type f >/dev/null:
done running find /root -type f >/dev/null
Device: /dev/sdb3
CPU 0: 0 events, 303 KiB data
CPU 1: 0 events, 262 KiB data
CPU 2: 0 events, 205 KiB data
CPU 3: 0 events, 302 KiB data
CPU 4: 0 events, 240 KiB data
CPU 5: 0 events, 281 KiB data
CPU 6: 0 events, 191 KiB data
CPU 7: 0 events, 281 KiB data
Total: 0 events (dropped 0), 2061 KiB data
blktrace done
Traceback (most recent call last):
File "/usr/bin/seekwatcher", line 534, in ?
add_range(hist, step, start, size)
File "/usr/bin/seekwatcher", line 522, in add_range
val = hist[slot]
IndexError: list index out of range

> > This is running on a PPC64/gentoo combination. Dont know if this
> > means anything to you. I have a very basic algorithm for to take
> > advantage block group metadata grouping and want be able to better
> > visualize how different IO patterns take advantage or are hurt by the
> > feature.
>
> I wanted to benchmark flexbg too, but couldn't quite figure out the
> correct patch combination ;)

Ill attach e2progfs and Kernel patches but do realize that these are
experimental patches that Im using to test what layout would work
best. Don't take them too seriously as it is largely incomplete.

Currently trying to come up with workloads to test this and other
changes with. Im am warming up to yours :)

To create a filesystem with the feature just do:
mke2fs -j -I 256 -O flex_bg /dev/xxx

Curently the number of block group meta data that are group together
is EXT4_DESC_PER_BLOCK() which matches the meta_bg feature. This turns
out to be 128 block groups. This may(probably will) change in the
future but it give a general idea of what benefits can be had with
large grouping of metadata.

On compilebench it seems to show a 10x improvement on "create dir"
since Im currently testing on a SCSI disk with write cache disable. I
would think the improvements would be a lot less noticeable on a SATA
drive since those usually ship with write caching enable. All other
test from the --makej runs where measurably better. Would love to see
seekwatche working to tune a bit better though.


-JRS


Attachments:
(No filename) (5.03 kB)
flex_bg_test.tar.bz2 (5.66 kB)
Download all attachments

2007-10-25 23:47:22

by Chris Mason

[permalink] [raw]
Subject: Re: compilebench numbers for ext4

On Thu, 25 Oct 2007 17:40:25 -0500
"Jose R. Santos" <[email protected]> wrote:

> > >
> > > I really want to use seekwatcher to test some of the stuff that
> > > I'm doing for flex_bg feature but it barfs on me in my test
> > > machine.
> > >
> > > running :sleep 10:
> > > done running sleep 10
> > > Device: /dev/sdh
> > > Total: 0 events (dropped 0), 1368 KiB
> > > data blktrace done
> > > Traceback (most recent call last):
> > > File "/usr/bin/seekwatcher", line 534, in ?
> > > add_range(hist, step, start, size)
> > > File "/usr/bin/seekwatcher", line 522, in add_range
> > > val = hist[slot]
> > > IndexError: list index out of range
> >
> > I don't think you have any events in the trace. Try this instead:
> >
> > echo 3 > /proc/sys/vm/drop_caches
> > seekwatcher -t find-trace -d /dev/xxxx -p 'find /usr/local -type f'
>
> Nope, get the same error. There does seem to be data recorded in the
> trace files and iostat does show activity on the disk.

Hmmm, could you please send me your trace files. There will be one for
each cpu, starting with find-trace-blktrace

> > I wanted to benchmark flexbg too, but couldn't quite figure out the
> > correct patch combination ;)
>
> Ill attach e2progfs and Kernel patches but do realize that these are
> experimental patches that Im using to test what layout would work
> best. Don't take them too seriously as it is largely incomplete.

Thanks, I'll try this out.

>
> Currently trying to come up with workloads to test this and other
> changes with. Im am warming up to yours :)

At least for the write phases of compilebench, it should benefit from
data and metadata separation. It made a very big difference in btrfs,
(from 20MB/s up to 32MB/s on create). However it did make the read
phases slower.

-chris