LinuxLists.cc - Eric Whitney's ext4 scaling data

2013-03-26 04:00:50

Subject: Eric Whitney's ext4 scaling data

Eric Whitney has very thoughtfully provided an updated set of ext4
scalability data (with comparisons against ext3, xfs, and btrfs)
comparing performance between 3.1 and 3.2, and comparing performance
between 3.2 and 3.6-rc3.

I've made his compressed tar file available at:

https://www.kernel.org/pub/linux/kernel/people/tytso/ext4_scaling_data.tar.xz
https://www.kernel.org/pub/linux/kernel/people/tytso/ext4_scaling_data.tar.gz

His comments on this data are:

It contains two sets of data - one comparing 3.2 and 3.1 (this was
the last data set I posted publicly) and another comparing 3.6-rc3
and 3.2. 3.6-rc3 was the last data set I collected, and until now, I
hadn't prepared graphs for it. The graphical results are consistent
with what I'd reported verbally over the first 2/3 of last year - not
much change between 3.2 and 3.6-rc3. The last large change I could
see occurred in 3.2, as mentioned in the notes.

The tarball unpacks into a directory named ext4_scaling_data and
contains a few subdirectories. The directories named 3.2 and 3.6-rc3
map to the data sets described above. Each contains a file named
index.html which you can open with a web browser to see the graphs,
browse the raw data, ffsb profiles and lockstats, etc.

Hopefully you'll find the lockstats and other information useful,
even though stale (3.6-rc3 became available the last week in August
2012).

Thanks, Eric for making this data available!

- Ted

P.S. The btrfs numbers were shockingly bad, even for the random write
workload, which was unexpected for me. I wonder if checksumming was
enabled by default, or some such, and this was hampering their
performance...

2013-03-27 03:17:41

by Zheng Liu

[permalink] [raw]

Subject: Re: Eric Whitney's ext4 scaling data

On Tue, Mar 26, 2013 at 12:00:48AM -0400, Theodore Ts'o wrote:
>
> Eric Whitney has very thoughtfully provided an updated set of ext4
> scalability data (with comparisons against ext3, xfs, and btrfs)
> comparing performance between 3.1 and 3.2, and comparing performance
> between 3.2 and 3.6-rc3.
>
> I've made his compressed tar file available at:
>
> https://www.kernel.org/pub/linux/kernel/people/tytso/ext4_scaling_data.tar.xz
> https://www.kernel.org/pub/linux/kernel/people/tytso/ext4_scaling_data.tar.gz
>
> His comments on this data are:
>
> It contains two sets of data - one comparing 3.2 and 3.1 (this was
> the last data set I posted publicly) and another comparing 3.6-rc3
> and 3.2. 3.6-rc3 was the last data set I collected, and until now, I
> hadn't prepared graphs for it. The graphical results are consistent
> with what I'd reported verbally over the first 2/3 of last year - not
> much change between 3.2 and 3.6-rc3. The last large change I could
> see occurred in 3.2, as mentioned in the notes.
>
> The tarball unpacks into a directory named ext4_scaling_data and
> contains a few subdirectories. The directories named 3.2 and 3.6-rc3
> map to the data sets described above. Each contains a file named
> index.html which you can open with a web browser to see the graphs,
> browse the raw data, ffsb profiles and lockstats, etc.
>
> Hopefully you'll find the lockstats and other information useful,
> even though stale (3.6-rc3 became available the last week in August
> 2012).
>
> Thanks, Eric for making this data available!

Thanks for sharing this with us. I have an rough idea that we can create
a project, which have some test cases to test the performance of file
system. We can use fio to simulate all kinds of scenarios, such as

- logger app (buffered io, append write, sequential write);
- distribute file system (preallocate, buffered io, random read/write)
- database (direct io, random read/write)
- search (mmapped, random read, periodic append write)
- ...

If we want to measure the performance of file system, we could simply
run a script to run some cases and get some result.

Currently we already have xfstests, but AFAIK it can verify that there
is no bug, deadlock, etc. in a file system, and it couldn't tell us
whether there has a performance regression after applied some patches.
(Please correct me if I am wrong.) So the question is whether it is
worth creating a new project. Or we should add these test cases into
xfstests.

Regards,
- Zheng

2013-03-27 03:29:27

by Theodore Ts'o

[permalink] [raw]

Subject: Re: Eric Whitney's ext4 scaling data

On Tue, Mar 26, 2013 at 04:06:47PM +0100, Lukáš Czerner wrote:
> It'll take me some time to process the results, but just one small
> nitpick is that in the mail server workload the reads and writes are
> not really representative for "just reads" or "just writes" as with
> the other tests since both interfere with each other. I am
> mentioning this just so that people do not misinterpret the results.

Yes, that's true. The mail server workload is also one where the
benchmark results tend to be more variable, and so Eric has mentioned
in the past that he's had to run the benchmark several times to make
sure he's getting good, stable, numbers.

The nubmers are useful for seeing whether we've accidentally regressed
on scalability, but at least for previous results from Eric's
scalability testing, what I've found most interesting thing to look at
are the lockstat reports.

- Ted

2013-03-27 03:35:57

by Theodore Ts'o

[permalink] [raw]

Subject: Re: Eric Whitney's ext4 scaling data

On Wed, Mar 27, 2013 at 11:33:23AM +0800, Zheng Liu wrote:
>
> Thanks for sharing this with us. I have an rough idea that we can create
> a project, which have some test cases to test the performance of file
> system.....

There is bitrotted benchmarking support into xfstests. I know some of
the folks at SGI have wished that it could be nursed back to health,
but having not looked at it, it's not clear to me whether it's better
to try to add benchmarking capabilities into xfstests, or as a
separate project.

The real challenge with doing this is that it tends to be very system
specific; if you change the amount of memory, number of CPU's, type of
storage, etc., you'll get very different results. So any kind of
system which is trying to detect performance regression really needs
to be run on a specific system, and what's important is the delta from
previous kernel versions.

The other thing I'll note is that Eric's results were especially
interesting because he had (in the past) access to a system with a
combination of a fast storage (via a large RAID array), and a large
number of CPU cores, which is useful for testing scalability.

These days, a fast PCIe attached storage can someone replace a large
RAID array, but most of us don't necessarily have access to a very
large (4 or more CPU sockets) system.

- Ted

2013-03-27 07:05:20

by Zheng Liu

[permalink] [raw]

Subject: Re: Eric Whitney's ext4 scaling data

On Tue, Mar 26, 2013 at 11:35:54PM -0400, Theodore Ts'o wrote:
> On Wed, Mar 27, 2013 at 11:33:23AM +0800, Zheng Liu wrote:
> >
> > Thanks for sharing this with us. I have an rough idea that we can create
> > a project, which have some test cases to test the performance of file
> > system.....
>
> There is bitrotted benchmarking support into xfstests. I know some of
> the folks at SGI have wished that it could be nursed back to health,
> but having not looked at it, it's not clear to me whether it's better
> to try to add benchmarking capabilities into xfstests, or as a
> separate project.

The key issue that we add test case into xfstests is that we need to
handle some filesystem-specific feature. Just like we had discussed
with Dave, what is an extent? IMHO now xfstests gets more compliated
because it needs to handle this problem. e.g. punch hole for
indirect-based file in ext4.

>
> The real challenge with doing this is that it tends to be very system
> specific; if you change the amount of memory, number of CPU's, type of
> storage, etc., you'll get very different results. So any kind of
> system which is trying to detect performance regression really needs
> to be run on a specific system, and what's important is the delta from
> previous kernel versions.

Yes, the test depends on the specific system. That means that if
someone want to make sure there is no performance regression, they need
to have a baseline result, and run this test again on the same machine.
So everyone has their own result. But it doesn't affect us to highlight
a performance regression. If I run a test and find a regression, I will
post it in mailing list, and other folks can notice it, run the same
tests in their own environment, and get the result. I think it is
reproducible on other environments if it is a regression.

>
> The other thing I'll note is that Eric's results were especially
> interesting because he had (in the past) access to a system with a
> combination of a fast storage (via a large RAID array), and a large
> number of CPU cores, which is useful for testing scalability.

Yes, in a internet company, we don't have any high-end mahcine. We just
have a lot of commodity x86 servers. :-(

>
> These days, a fast PCIe attached storage can someone replace a large
> RAID array, but most of us don't necessarily have access to a very
> large (4 or more CPU sockets) system.

Yeah, we couldn't test all kinds of devices. That is impossible.

Regards,
- Zheng

2013-03-27 15:10:12

by Theodore Ts'o

[permalink] [raw]

Subject: Re: Eric Whitney's ext4 scaling data

On Wed, Mar 27, 2013 at 03:21:02PM +0800, Zheng Liu wrote:
>
> The key issue that we add test case into xfstests is that we need to
> handle some filesystem-specific feature. Just like we had discussed
> with Dave, what is an extent? IMHO now xfstests gets more compliated
> because it needs to handle this problem. e.g. punch hole for
> indirect-based file in ext4.

Yes, that means among other things the test framework needs to keep
track of which file system features was being used when we run a
particular test, as well as the hardware configuration.

I suspect that what this means is that we're better off trying to
create a new test framework that does what we want, and automates as
much of this as possible.

It would probably be a good idea to bring in Eric Whitney into this
discussion, since he has a huge amount of expertise about what sort of
things need to be done in order to get good results. He was doing a
number of things by hand, including re-running the tests multiple
times to make sure the results were stable. I could imagine that if
the framework could keep track of what the standard deviation was for
a particular test, it could try to do this automatically, and then we
could also throw up a flag if the average result hadn't changed, but
the standard deviation had increased, since that might be an
indication that some change had caused a lot more variability.

(Note by the way that one of the things that is going to be critically
important for companies using ext4 for web backends is not just the
average throughput, which is what FFSB mostly tests, but also 99.99%
percentile latency. And sometimes the best workloads which show this
will only be mixed workloads, when under memory pressure. For
example, consider the recent "page eviction from the buddy cache"
e-mail. That's something which might result in only a slight increase
for average throughput numbers, but could have a much more profound
impact on 99.9% latency numbers, especially if while we are reading in
a bitmap block, we are holding some lock or preventing a journal
commit from closing.)

Cheers,

- Ted

2013-03-28 04:33:25

by Zheng Liu

[permalink] [raw]

Subject: Re: Eric Whitney's ext4 scaling data

[add Eric into cc list]

On Wed, Mar 27, 2013 at 11:10:11AM -0400, Theodore Ts'o wrote:
> On Wed, Mar 27, 2013 at 03:21:02PM +0800, Zheng Liu wrote:
> >
> > The key issue that we add test case into xfstests is that we need to
> > handle some filesystem-specific feature. Just like we had discussed
> > with Dave, what is an extent? IMHO now xfstests gets more compliated
> > because it needs to handle this problem. e.g. punch hole for
> > indirect-based file in ext4.
>
> Yes, that means among other things the test framework needs to keep
> track of which file system features was being used when we run a
> particular test, as well as the hardware configuration.
>
> I suspect that what this means is that we're better off trying to
> create a new test framework that does what we want, and automates as
> much of this as possible.

Yes, that means that we need to create a new wheel to do this work.
That is why I want to discuss with other folks because this is not a
small project.

>
> It would probably be a good idea to bring in Eric Whitney into this
> discussion, since he has a huge amount of expertise about what sort of
> things need to be done in order to get good results. He was doing a
> number of things by hand, including re-running the tests multiple
> times to make sure the results were stable. I could imagine that if
> the framework could keep track of what the standard deviation was for
> a particular test, it could try to do this automatically, and then we
> could also throw up a flag if the average result hadn't changed, but
> the standard deviation had increased, since that might be an
> indication that some change had caused a lot more variability.

Average and standard deviation is a very important data for a
performance test framework. Some performance regressions only causes a
very subtle impact. This means that we need to run a test case serveral
times, and count average and standard deviation besides throughput,
IOPS, latency, etc....

>
> (Note by the way that one of the things that is going to be critically
> important for companies using ext4 for web backends is not just the
> average throughput, which is what FFSB mostly tests, but also 99.99%
> percentile latency. And sometimes the best workloads which show this
> will only be mixed workloads, when under memory pressure. For
> example, consider the recent "page eviction from the buddy cache"
> e-mail. That's something which might result in only a slight increase
> for average throughput numbers, but could have a much more profound
> impact on 99.9% latency numbers, especially if while we are reading in
> a bitmap block, we are holding some lock or preventing a journal
> commit from closing.)

Definitely, the latency is very important for us. At Taobao, most apps
are latency-sensitive. They expect a stable latency that is provided by
file system. They can accept that we only provide a stable but high
latency on every writes (e.g. 100ms, quite big :-)) because the designer
will consider this factor. However, they hate that we provide a small
but unstable latency (e.g. 3ms on 99% writes, and 500ms on 1% write).

Regards,
- Zheng

2013-03-28 05:07:42

by Dave Chinner

[permalink] [raw]

Subject: Re: Eric Whitney's ext4 scaling data

On Tue, Mar 26, 2013 at 11:35:54PM -0400, Theodore Ts'o wrote:
> On Wed, Mar 27, 2013 at 11:33:23AM +0800, Zheng Liu wrote:
> >
> > Thanks for sharing this with us. I have an rough idea that we can create
> > a project, which have some test cases to test the performance of file
> > system.....
>
> There is bitrotted benchmarking support into xfstests. I know some of
> the folks at SGI have wished that it could be nursed back to health,
> but having not looked at it, it's not clear to me whether it's better
> to try to add benchmarking capabilities into xfstests, or as a
> separate project.

The stuff that was in xfstests was useless. It was some simple
wrappers around dbench, metaperf, dirperf and dd, and not much else.

SGI are looking to reintroduce a framework into xfstests, but we
have no information on what that may contain so I can't tell you
anything about it.

> The real challenge with doing this is that it tends to be very system
> specific; if you change the amount of memory, number of CPU's, type of
> storage, etc., you'll get very different results. So any kind of
> system which is trying to detect performance regression really needs
> to be run on a specific system, and what's important is the delta from
> previous kernel versions.

Right, and the other important thing is that you know what the
expected variance of each benchmark is going to be so you can tell
if the difference between kernels is statistically significant or
not.

This was the real problem with the old xfstests stuff - I could
never get results that were consistent from run to run. Sometimes it
would be fine, but it wasn't reliable. That's where most
benchmarking efforts fail - they do unable to provide consistent,
deterministic results.....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-03-28 05:14:29

by Dave Chinner

[permalink] [raw]

Subject: Re: Eric Whitney's ext4 scaling data

On Wed, Mar 27, 2013 at 11:10:11AM -0400, Theodore Ts'o wrote:
> On Wed, Mar 27, 2013 at 03:21:02PM +0800, Zheng Liu wrote:
> >
> > The key issue that we add test case into xfstests is that we need to
> > handle some filesystem-specific feature. Just like we had discussed
> > with Dave, what is an extent? IMHO now xfstests gets more compliated
> > because it needs to handle this problem. e.g. punch hole for
> > indirect-based file in ext4.
>
> Yes, that means among other things the test framework needs to keep
> track of which file system features was being used when we run a
> particular test, as well as the hardware configuration.
>
> I suspect that what this means is that we're better off trying to
> create a new test framework that does what we want, and automates as
> much of this as possible.

Well, tracking the hardware, configuration, results over time, etc
is really orthogonal to the benchmarking harness. We're already
modifying xfstests to make it easier to do this sort of thing (like
user specified results directories, configurable expunged files,
etc) so that you can control and archive individual xfstests from a
higher level automated harness.

So I don't see this a problem that a low level benchmarking
framework needs to concern itself directly with - what you seem to
be wanting is a better automation and archiving framework on top of
the low level harness that runs the specific tests/benchmarks....

> It would probably be a good idea to bring in Eric Whitney into this
> discussion, since he has a huge amount of expertise about what sort of
> things need to be done in order to get good results. He was doing a
> number of things by hand, including re-running the tests multiple
> times to make sure the results were stable. I could imagine that if
> the framework could keep track of what the standard deviation was for
> a particular test, it could try to do this automatically, and then we
> could also throw up a flag if the average result hadn't changed, but
> the standard deviation had increased, since that might be an
> indication that some change had caused a lot more variability.

Yup, you need to have result archives and post-process them to do
this sort of thing, which is why I think it's a separate problem to
that of actually defining and running the benchmarks...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-04-01 03:43:57

by Eric Whitney

[permalink] [raw]

Subject: Re: Eric Whitney's ext4 scaling data

* Dave Chinner <[email protected]>:
> On Wed, Mar 27, 2013 at 11:10:11AM -0400, Theodore Ts'o wrote:
> > On Wed, Mar 27, 2013 at 03:21:02PM +0800, Zheng Liu wrote:
> > >
> > > The key issue that we add test case into xfstests is that we need to
> > > handle some filesystem-specific feature. Just like we had discussed
> > > with Dave, what is an extent? IMHO now xfstests gets more compliated
> > > because it needs to handle this problem. e.g. punch hole for
> > > indirect-based file in ext4.
> >
> > Yes, that means among other things the test framework needs to keep
> > track of which file system features was being used when we run a
> > particular test, as well as the hardware configuration.
> >
> > I suspect that what this means is that we're better off trying to
> > create a new test framework that does what we want, and automates as
> > much of this as possible.
>
> Well, tracking the hardware, configuration, results over time, etc
> is really orthogonal to the benchmarking harness. We're already
> modifying xfstests to make it easier to do this sort of thing (like
> user specified results directories, configurable expunged files,
> etc) so that you can control and archive individual xfstests from a
> higher level automated harness.
>
> So I don't see this a problem that a low level benchmarking
> framework needs to concern itself directly with - what you seem to
> be wanting is a better automation and archiving framework on top of
> the low level harness that runs the specific tests/benchmarks....
>
> > It would probably be a good idea to bring in Eric Whitney into this
> > discussion, since he has a huge amount of expertise about what sort of
> > things need to be done in order to get good results. He was doing a
> > number of things by hand, including re-running the tests multiple
> > times to make sure the results were stable. I could imagine that if
> > the framework could keep track of what the standard deviation was for
> > a particular test, it could try to do this automatically, and then we
> > could also throw up a flag if the average result hadn't changed, but
> > the standard deviation had increased, since that might be an
> > indication that some change had caused a lot more variability.
>
> Yup, you need to have result archives and post-process them to do
> this sort of thing, which is why I think it's a separate problem to
> that of actually defining and running the benchmarks...
>

I think it's important to also consider building good tools to explore
and visualize the data. The web pages in the tar archive I sent Ted are a
poor approximation, since their content was generated by hand rather than
automatically. Instead, you might have a tool whose user interface is a
web page with links to all collected data sets in an archive, and filters
which could be used to select specific test systems, individual benchmarks,
and metrics of interest (including configuration info). Once you select a
group of data sets, test systems, a benchmark, and a metric, the page
produces graphs or tables of data for comparison.

We built something like this at my previous employer, and it was invaluable
(I'm sure similar things must have been done elsewhere). It made it very
easy to quickly review a new incoming data set and compare it with older
data, to look for progressive changes over time, or to examine behavioral
differences across system configurations. When you collect enough benchmark
data over time, fully exploiting all that information leaves you with a
significant data mining problem.

It's helpful if benchmark and workload design supports analysis as well as
measurement. I tend to like a layered approach where, for example, a base
layer might consist of a set of block layer microbenchmarks that help
characterize storage system performance. A second layer would consist of
simple file system microbenchmarks - the usual sequential/random read/write
plus selected metadata operations, etc. More elaborate workloads
representative of important use cases would sit on top of that. Ideally,
it should be possible to relate changes in higher level workloads/benchmarks
to those below and vice versa. For example, the block layer microbenchmarks
ought to help determine the maximum performance bounds for the file system
microbenchmarks, etc. (fio ought to be suitable for the two lower levels in
this scheme; more elaborate workloads might require some scripting around
fio or some new code.)

When working with benchmarks on a test system that can yield significant
variation, I do tend to like to take multiple sets and compare them. This
could certainly be handled statistically; my usual practice is to do this
manually so as to get a better feel for how the benchmark and the hardware
run together. Ideally, more experience with the test configuration leads
to hardware reconfiguration or kernel tweaks that can yield more consistent
results (common on NUMA systems, for example). Strong variation is sometimes
indication of a problem somewhere (in my experience, at least), so trying
to understand and reduce the variation sometimes leads to a useful fix.

FWIW, I used the Autotest client code for my ext4 work to run the benchmarks
and collect the data, system configuration particulars, run logs, etc. Most
of what I had to do involved scripting test scenarios that would run selected
sets of benchmarks in predefined test environments (mkfs and mount options,
etc.). Hooks to run code before and after tests in the standard test
framework made it easy to add lockstat and other instrumentation.

It worked well enough, though Autotest contained a number of test environment
assumptions that conflicted with what I wanted to do from time to time, and
required custom workarounds to its framework. A number of new versions have
been released since then, and a quick look suggests that there have been some
substantial changes (contains some test scenarios for fio, ffsb, xfstests).
Using Autotest means working in Python, though, and some prefer a simpler
approach using shell scripts.

Autotest's server code can be used to control client code on test systems,
scheduling and operating tests, archiving results in data bases, and
postprocessing data. That was more complexity than I wanted, so I simply
archived my results in my own filesystem directory structure.

Eric