From: Daniel Phillips <daniel@phunq.net>
To: Dave Chinner <david@fromorbit.com>
Cc: <linux-kernel@vger.kernel.org>, <linux-fsdevel@vger.kernel.org>,
        <tux3@tux3.org>, "Theodore Ts'o" <tytso@mit.edu>
Subject: Re: Tux3 Report: How fast can we =?iso-8859-1?Q?fsync=3F?=
Date: Thu, 30 Apr 2015 03:28:13 -0700
User-Agent: Trojita/v0.5-14-g8a2496c; Qt/4.8.6; X11; Linux; Ubuntu 14.04.2 LTS
MIME-Version: 1.0
Message-ID: <81488fcb-b5d5-4761-b8ae-936dce9c1f89@phunq.net>
In-Reply-To: <20150430014616.GZ15810@dastard>
References: <8f886f13-6550-4322-95be-93244ae61045@phunq.net>
 <20150430014616.GZ15810@dastard>
Organization: tux3.org
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8BIT
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 9871
Lines: 238

On Wednesday, April 29, 2015 6:46:16 PM PDT, Dave Chinner wrote:
>> I measured fsync performance using a 7200 RPM disk as a virtual
>> drive under KVM, configured with cache=none so that asynchronous
>> writes are cached and synchronous writes translate into direct
>> writes to the block device.
>
> Yup, a slow single spindle, so fsync performance is determined by
> seek latency of the filesystem. Hence the filesystem that "wins"
> will be the filesystem that minimises fsync seek latency above all
> other considerations.
>
> http://www.spinics.net/lists/kernel/msg1978216.html

If you want to declare that XFS only works well on solid state disks 
and big storage arrays, that is your business. But if you do, you can no
longer call XFS a general purpose filesystem. And if you would rather 
disparage people who report genuine performance bugs than get down to
fixing them, that is your business too. Don't expect to be able to stop 
the bug reports by bluster.

> So, to demonstrate, I'll run the same tests but using a 256GB
> samsung 840 EVO SSD and show how much the picture changes.

I will go you one better, I ran a series of fsync tests using tmpfs,
and I now have a very clear picture of how the picture changes. The
executive summary is: Tux3 is still way faster, and still scales way
better to large numbers of tasks. I have every confidence that the same
is true of SSD.

> I didn't test tux3, you don't make it easy to get or build.

There is no need to apologize for not testing Tux3, however, it is 
unseemly to throw mud at the same time. Remember, you are the person 
who put so much energy into blocking Tux3 from merging last summer. If
it now takes you a little extra work to build it then it is hard to be 
really sympathetic. Mike apparently did not find it very hard.

>> To focus purely on fsync, I wrote a
>> small utility (at the end of this post) that forks a number of
>> tasks, each of which continuously appends to and fsyncs its own
>> file. For a single task doing 1,000 fsyncs of 1K each, we have:
>> 
>>    Ext4:  34.34s
>>    XFS:   23.63s
>>    Btrfs: 34.84s
>>    Tux3:  17.24s
>
>    Ext4:   1.94s
>    XFS:    2.06s
>    Btrfs:  2.06s
>
> All equally fast, so I can't see how tux3 would be much faster here.

Running the same thing on tmpfs, Tux3 is significantly faster:

     Ext4:   1.40s
     XFS:    1.10s
     Btrfs:  1.56s
     Tux3:   1.07s

>    Tasks:   10      100    1,000    10,000
>    Ext4:   0.05s   0.12s    0.48s     3.99s
>    XFS:    0.25s   0.41s    0.96s     4.07s
>    Btrfs   0.22s   0.50s    2.86s   161.04s
>              (lower is better)
>
> Ext4 and XFS are fast and show similar performance. Tux3 *can't* be
> very much faster as most of the elapsed time in the test is from
> forking the processes that do the IO and fsyncs.

You wish. In fact, Tux3 is a lot faster. You must have made a mistake in 
estimating your fork overhead. It is easy to check, just run "syncs foo 
0 10000". I get 0.23 seconds to fork 10,0000 proceses, create the files 
and exit. Here are my results on tmpfs, triple checked and reproducible:

    Tasks:   10      100    1,000    10,000
    Ext4:   0.05     0.14    1.53     26.56
    XFS:    0.05     0.16    2.10     29.76
    Btrfs:  0.08     0.37    3.18     34.54
    Tux3:   0.02     0.05    0.18      2.16

Note: you should recheck your final number for Btrfs. I have seen Btrfs 
fall off the rails and take wildly longer on some tests just like that.
We know Btrfs has corner case issues, I don't think they deny it. 
Unlike you, Chris Mason is a gentleman when faced with issues. Instead 
of insulting his colleagues and hurling around the sort of abuse that 
has gained LKML its current unenviable reputation, he gets down to work 
and fixes things.

You should do that too, your own house is not in order. XFS has major 
issues. One easily reproducible one is a denial of service during the 
10,000 task test where it takes multiple seconds to cat small files. I 
saw XFS do this on both spinning disk and tmpfs, and I have seen it 
hang for minutes trying to list a directory. I looked a bit into it, and 
I see that you are blocking for aeons trying to acquire a lock in open.

Here is an example. While doing "sync6 fs/foo 10 10000":

time cat fs/foo999
hello world!
hello world!
hello world!
hello world!
hello world!
hello world!
hello world!
hello world!
hello world!
hello world!

real    0m2.282s
user    0m0.000s
sys     0m0.000s

You and I both know the truth: Ext4 is the only really reliable general 
purpose filesystem on Linux at the moment. XFS is definitely not, I 
have seen ample evidence with my own eyes. What you need is people 
helping you fix your issues instead of making your colleagues angry at 
you with your incessant attacks.

> FWIW, btrfs shows it's horrible fsync implementation here, burning
> huge amounts of CPU to do bugger all IO. i.e. it burnt all 16p for 2
> and a half minutes in that 10000 fork test so wasn't IO bound at
> all.

Btrfs is hot and cold. In my tmpfs tests, Btrfs beats XFS at high 
task counts. It is actually amazing the progress Btrfs has made in 
performance. I for one appreciate the work they are doing and I admire 
the way Chris conducts both himself and his project. I wish you were 
more like Chris, and I wish I was for that matter.

I agree that Btrfs uses too much CPU, but there is no need to be rude 
about it. I think the Btrfs team knows how to use a profiler.

>> Is there any practical use for fast parallel fsync of tens of thousands
>> of tasks? This could be useful for a scalable transaction server
>> that sits directly on the filesystem instead of a database, as is
>> the fashion for big data these days. It certainly can't hurt to know
>> that if you need that kind of scaling, Tux3 will do it.
>
> Ext4 and XFS already do that just fine, too, when you use storage
> suited to such a workload and you have a sane interface for
> submitting tens of thousands of concurrent fsync operations. e.g
>
> http://oss.sgi.com/archives/xfs/2014-06/msg00214.html

Tux3 turns in really great performance with an ordinary, cheap spinning 
disk using standard Posix ops. It is not for you to tell people they 
don't care about that, and it is wrong for you to imply that we only 
perform well on spinning disk - you don't know that, and it's not true.

By the way, I like your asynchronous fsync, nice work. It by no means
obviates the need for a fast implementation of the standard operation.

> On a SSD (256GB samsung 840 EVO), running 4.0.0:
>
>    Tasks:       8           16           32
>    Ext4:    598.27 MB/s    981.13 MB/s 1233.77 MB/s
>    XFS:     884.62 MB/s   1328.21 MB/s 1373.66 MB/s
>    Btrfs:   201.64 MB/s    137.55 MB/s  108.56 MB/s
>
> dbench looks *very different* when there is no seek latency,
> doesn't it?

It looks like Btrfs hit a bug, not a huge surprise. Btrfs hit an assert
for me earlier this evening. It is rare but it happens. I rebooted and 
got sane numbers. Running dbench -t10 on tmpfs I get:

     Tasks:       8            16            32
     Ext4:    660.69 MB/s   708.81 MB/s   720.12 MB/s
     XFS:     692.01 MB/s   388.53 MB/s   134.84 MB/s
     Btrfs:   229.66 MB/s   341.27 MB/s   377.97 MB/s
     Tux3:   1147.12 MB/s  1401.61 MB/s  1283.74 MB/s

Looks like XFS hit a bump and fell off the cliff at 32 threads. I reran
that one many times because I don't want to give you an inaccurate 
report.

Tux3 turned in a great performance. I am not pleased with the negative 
scaling at 32 threads, but it still finishes way ahead.

>> Dbench -t10 -s (all file operations synchronous)
>> 
>>    Tasks:       8           16           32
>>    Ext4:     4.51 MB/s    6.25 MB/s    7.72 MB/s
>>    XFS:      4.24 MB/s    4.77 MB/s    5.15 MB/s
>>    Btrfs:    7.98 MB/s   13.87 MB/s   22.87 MB/s
>>    Tux3:    15.41 MB/s   25.56 MB/s   39.15 MB/s
>>                   (higher is better)
>
>     Ext4:   173.54 MB/s  294.41 MB/s  424.11 MB/s
>     XFS:    172.98 MB/s  342.78 MB/s  458.87 MB/s
>     Btrfs:   36.92 MB/s   34.52 MB/s   55.19 MB/s
>
> Again, the numbers are completely the other way around on a SSD,
> with the conventional filesystems being 5-10x faster than the
> WA/COW style filesystem.

I wouldn't be so sure about that...

     Tasks:       8            16            32
     Ext4:     93.06 MB/s    98.67 MB/s   102.16 MB/s
     XFS:      81.10 MB/s    79.66 MB/s    73.27 MB/s
     Btrfs:    43.77 MB/s    64.81 MB/s    90.35 MB/s
     Tux3:    198.49 MB/s   279.00 MB/s   318.41 MB/s

>> In the full disclosure department, Tux3 is still not properly
>> optimized in some areas. One of them is fragmentation: it is not
>> very hard to make Tux3 slow down by running long tests. Our current
>
> Oh, that still hasn't been fixed?

Count your blessings while you can.

> Until you sort of how you are going to scale allocation to tens of
> TB and not fragment free space over time, fsync performance of the
> filesystem is pretty much irrelevant. Changing the allocation
> algorithms will fundamentally alter the IO patterns and so all these
> benchmarks are essentially meaningless.

Ahem, are you the same person for whom fsync was the most important 
issue in the world last time the topic came up, to the extent of 
spreading around FUD and entirely ignoring the great work we had 
accomplished for regular file operations? I said then that when we got 
around to a proper fsync it would be competitive. Now here it is, so you 
want to change the topic. I understand.

Honestly, you would be a lot better off investigating why our fsync 
algorithm is so good.

Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/