Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751073AbbD3K2C (ORCPT ); Thu, 30 Apr 2015 06:28:02 -0400 Received: from mail.phunq.net ([184.71.0.62]:33253 "EHLO starbase.phunq.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750771AbbD3K15 convert rfc822-to-8bit (ORCPT ); Thu, 30 Apr 2015 06:27:57 -0400 From: Daniel Phillips To: Dave Chinner Cc: , , , "Theodore Ts'o" Subject: Re: Tux3 Report: How fast can we =?iso-8859-1?Q?fsync=3F?= Date: Thu, 30 Apr 2015 03:28:13 -0700 User-Agent: Trojita/v0.5-14-g8a2496c; Qt/4.8.6; X11; Linux; Ubuntu 14.04.2 LTS MIME-Version: 1.0 Message-ID: <81488fcb-b5d5-4761-b8ae-936dce9c1f89@phunq.net> In-Reply-To: <20150430014616.GZ15810@dastard> References: <8f886f13-6550-4322-95be-93244ae61045@phunq.net> <20150430014616.GZ15810@dastard> Organization: tux3.org Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 9871 Lines: 238 On Wednesday, April 29, 2015 6:46:16 PM PDT, Dave Chinner wrote: >> I measured fsync performance using a 7200 RPM disk as a virtual >> drive under KVM, configured with cache=none so that asynchronous >> writes are cached and synchronous writes translate into direct >> writes to the block device. > > Yup, a slow single spindle, so fsync performance is determined by > seek latency of the filesystem. Hence the filesystem that "wins" > will be the filesystem that minimises fsync seek latency above all > other considerations. > > http://www.spinics.net/lists/kernel/msg1978216.html If you want to declare that XFS only works well on solid state disks and big storage arrays, that is your business. But if you do, you can no longer call XFS a general purpose filesystem. And if you would rather disparage people who report genuine performance bugs than get down to fixing them, that is your business too. Don't expect to be able to stop the bug reports by bluster. > So, to demonstrate, I'll run the same tests but using a 256GB > samsung 840 EVO SSD and show how much the picture changes. I will go you one better, I ran a series of fsync tests using tmpfs, and I now have a very clear picture of how the picture changes. The executive summary is: Tux3 is still way faster, and still scales way better to large numbers of tasks. I have every confidence that the same is true of SSD. > I didn't test tux3, you don't make it easy to get or build. There is no need to apologize for not testing Tux3, however, it is unseemly to throw mud at the same time. Remember, you are the person who put so much energy into blocking Tux3 from merging last summer. If it now takes you a little extra work to build it then it is hard to be really sympathetic. Mike apparently did not find it very hard. >> To focus purely on fsync, I wrote a >> small utility (at the end of this post) that forks a number of >> tasks, each of which continuously appends to and fsyncs its own >> file. For a single task doing 1,000 fsyncs of 1K each, we have: >> >> Ext4: 34.34s >> XFS: 23.63s >> Btrfs: 34.84s >> Tux3: 17.24s > > Ext4: 1.94s > XFS: 2.06s > Btrfs: 2.06s > > All equally fast, so I can't see how tux3 would be much faster here. Running the same thing on tmpfs, Tux3 is significantly faster: Ext4: 1.40s XFS: 1.10s Btrfs: 1.56s Tux3: 1.07s > Tasks: 10 100 1,000 10,000 > Ext4: 0.05s 0.12s 0.48s 3.99s > XFS: 0.25s 0.41s 0.96s 4.07s > Btrfs 0.22s 0.50s 2.86s 161.04s > (lower is better) > > Ext4 and XFS are fast and show similar performance. Tux3 *can't* be > very much faster as most of the elapsed time in the test is from > forking the processes that do the IO and fsyncs. You wish. In fact, Tux3 is a lot faster. You must have made a mistake in estimating your fork overhead. It is easy to check, just run "syncs foo 0 10000". I get 0.23 seconds to fork 10,0000 proceses, create the files and exit. Here are my results on tmpfs, triple checked and reproducible: Tasks: 10 100 1,000 10,000 Ext4: 0.05 0.14 1.53 26.56 XFS: 0.05 0.16 2.10 29.76 Btrfs: 0.08 0.37 3.18 34.54 Tux3: 0.02 0.05 0.18 2.16 Note: you should recheck your final number for Btrfs. I have seen Btrfs fall off the rails and take wildly longer on some tests just like that. We know Btrfs has corner case issues, I don't think they deny it. Unlike you, Chris Mason is a gentleman when faced with issues. Instead of insulting his colleagues and hurling around the sort of abuse that has gained LKML its current unenviable reputation, he gets down to work and fixes things. You should do that too, your own house is not in order. XFS has major issues. One easily reproducible one is a denial of service during the 10,000 task test where it takes multiple seconds to cat small files. I saw XFS do this on both spinning disk and tmpfs, and I have seen it hang for minutes trying to list a directory. I looked a bit into it, and I see that you are blocking for aeons trying to acquire a lock in open. Here is an example. While doing "sync6 fs/foo 10 10000": time cat fs/foo999 hello world! hello world! hello world! hello world! hello world! hello world! hello world! hello world! hello world! hello world! real 0m2.282s user 0m0.000s sys 0m0.000s You and I both know the truth: Ext4 is the only really reliable general purpose filesystem on Linux at the moment. XFS is definitely not, I have seen ample evidence with my own eyes. What you need is people helping you fix your issues instead of making your colleagues angry at you with your incessant attacks. > FWIW, btrfs shows it's horrible fsync implementation here, burning > huge amounts of CPU to do bugger all IO. i.e. it burnt all 16p for 2 > and a half minutes in that 10000 fork test so wasn't IO bound at > all. Btrfs is hot and cold. In my tmpfs tests, Btrfs beats XFS at high task counts. It is actually amazing the progress Btrfs has made in performance. I for one appreciate the work they are doing and I admire the way Chris conducts both himself and his project. I wish you were more like Chris, and I wish I was for that matter. I agree that Btrfs uses too much CPU, but there is no need to be rude about it. I think the Btrfs team knows how to use a profiler. >> Is there any practical use for fast parallel fsync of tens of thousands >> of tasks? This could be useful for a scalable transaction server >> that sits directly on the filesystem instead of a database, as is >> the fashion for big data these days. It certainly can't hurt to know >> that if you need that kind of scaling, Tux3 will do it. > > Ext4 and XFS already do that just fine, too, when you use storage > suited to such a workload and you have a sane interface for > submitting tens of thousands of concurrent fsync operations. e.g > > http://oss.sgi.com/archives/xfs/2014-06/msg00214.html Tux3 turns in really great performance with an ordinary, cheap spinning disk using standard Posix ops. It is not for you to tell people they don't care about that, and it is wrong for you to imply that we only perform well on spinning disk - you don't know that, and it's not true. By the way, I like your asynchronous fsync, nice work. It by no means obviates the need for a fast implementation of the standard operation. > On a SSD (256GB samsung 840 EVO), running 4.0.0: > > Tasks: 8 16 32 > Ext4: 598.27 MB/s 981.13 MB/s 1233.77 MB/s > XFS: 884.62 MB/s 1328.21 MB/s 1373.66 MB/s > Btrfs: 201.64 MB/s 137.55 MB/s 108.56 MB/s > > dbench looks *very different* when there is no seek latency, > doesn't it? It looks like Btrfs hit a bug, not a huge surprise. Btrfs hit an assert for me earlier this evening. It is rare but it happens. I rebooted and got sane numbers. Running dbench -t10 on tmpfs I get: Tasks: 8 16 32 Ext4: 660.69 MB/s 708.81 MB/s 720.12 MB/s XFS: 692.01 MB/s 388.53 MB/s 134.84 MB/s Btrfs: 229.66 MB/s 341.27 MB/s 377.97 MB/s Tux3: 1147.12 MB/s 1401.61 MB/s 1283.74 MB/s Looks like XFS hit a bump and fell off the cliff at 32 threads. I reran that one many times because I don't want to give you an inaccurate report. Tux3 turned in a great performance. I am not pleased with the negative scaling at 32 threads, but it still finishes way ahead. >> Dbench -t10 -s (all file operations synchronous) >> >> Tasks: 8 16 32 >> Ext4: 4.51 MB/s 6.25 MB/s 7.72 MB/s >> XFS: 4.24 MB/s 4.77 MB/s 5.15 MB/s >> Btrfs: 7.98 MB/s 13.87 MB/s 22.87 MB/s >> Tux3: 15.41 MB/s 25.56 MB/s 39.15 MB/s >> (higher is better) > > Ext4: 173.54 MB/s 294.41 MB/s 424.11 MB/s > XFS: 172.98 MB/s 342.78 MB/s 458.87 MB/s > Btrfs: 36.92 MB/s 34.52 MB/s 55.19 MB/s > > Again, the numbers are completely the other way around on a SSD, > with the conventional filesystems being 5-10x faster than the > WA/COW style filesystem. I wouldn't be so sure about that... Tasks: 8 16 32 Ext4: 93.06 MB/s 98.67 MB/s 102.16 MB/s XFS: 81.10 MB/s 79.66 MB/s 73.27 MB/s Btrfs: 43.77 MB/s 64.81 MB/s 90.35 MB/s Tux3: 198.49 MB/s 279.00 MB/s 318.41 MB/s >> In the full disclosure department, Tux3 is still not properly >> optimized in some areas. One of them is fragmentation: it is not >> very hard to make Tux3 slow down by running long tests. Our current > > Oh, that still hasn't been fixed? Count your blessings while you can. > Until you sort of how you are going to scale allocation to tens of > TB and not fragment free space over time, fsync performance of the > filesystem is pretty much irrelevant. Changing the allocation > algorithms will fundamentally alter the IO patterns and so all these > benchmarks are essentially meaningless. Ahem, are you the same person for whom fsync was the most important issue in the world last time the topic came up, to the extent of spreading around FUD and entirely ignoring the great work we had accomplished for regular file operations? I said then that when we got around to a proper fsync it would be competitive. Now here it is, so you want to change the topic. I understand. Honestly, you would be a lot better off investigating why our fsync algorithm is so good. Regards, Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/