Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751677AbbD3Bqe (ORCPT ); Wed, 29 Apr 2015 21:46:34 -0400 Received: from ipmail07.adl2.internode.on.net ([150.101.137.131]:8102 "EHLO ipmail07.adl2.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751159AbbD3Bqd (ORCPT ); Wed, 29 Apr 2015 21:46:33 -0400 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: A2CWEQBliEFV/38+LHlcgwxTXIxZpy0MAQEBAQEBBpM2CoV+BAICgT9NAQEBAQEBgQtBBINbAQEBAwEBAjccIwULCAMYCSUPBRQRAyETiCMHDsc2AQEBAQYCAR8YhX6FIoEjgUiBTwJJB4QtBYUyB5A6hA+CL5V5I4QGLDEBgQKBQgEBAQ Date: Thu, 30 Apr 2015 11:46:16 +1000 From: Dave Chinner To: Daniel Phillips Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, tux3@tux3.org, "Theodore Ts'o" Subject: Re: Tux3 Report: How fast can we fsync? Message-ID: <20150430014616.GZ15810@dastard> References: <8f886f13-6550-4322-95be-93244ae61045@phunq.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <8f886f13-6550-4322-95be-93244ae61045@phunq.net> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5431 Lines: 146 On Tue, Apr 28, 2015 at 04:13:18PM -0700, Daniel Phillips wrote: > Greetings, > > This post is dedicated to Ted, who raised doubts a while back about > whether Tux3 can ever have a fast fsync: > > https://lkml.org/lkml/2013/5/11/128 > "Re: Tux3 Report: Faster than tmpfs, what?" [snip] > I measured fsync performance using a 7200 RPM disk as a virtual > drive under KVM, configured with cache=none so that asynchronous > writes are cached and synchronous writes translate into direct > writes to the block device. Yup, a slow single spindle, so fsync performance is determined by seek latency of the filesystem. Hence the filesystem that "wins" will be the filesystem that minimises fsync seek latency above all other considerations. http://www.spinics.net/lists/kernel/msg1978216.html So, to demonstrate, I'll run the same tests but using a 256GB samsung 840 EVO SSD and show how much the picture changes. I didn't test tux3, you don't make it easy to get or build. > To focus purely on fsync, I wrote a > small utility (at the end of this post) that forks a number of > tasks, each of which continuously appends to and fsyncs its own > file. For a single task doing 1,000 fsyncs of 1K each, we have: > > Ext4: 34.34s > XFS: 23.63s > Btrfs: 34.84s > Tux3: 17.24s Ext4: 1.94s XFS: 2.06s Btrfs: 2.06s All equally fast, so I can't see how tux3 would be much faster here. > Things get more interesting with parallel fsyncs. In this test, each > task does ten fsyncs and task count scales from ten to ten thousand. > We see that all tested filesystems are able to combine fsyncs into > group commits, with varying degrees of success: > > Tasks: 10 100 1,000 10,000 > Ext4: 0.79s 0.98s 4.62s 61.45s > XFS: 0.75s 1.68s 20.97s 238.23s > Btrfs 0.53s 0.78s 3.80s 84.34s > Tux3: 0.27s 0.34s 1.00s 6.86s Tasks: 10 100 1,000 10,000 Ext4: 0.05s 0.12s 0.48s 3.99s XFS: 0.25s 0.41s 0.96s 4.07s Btrfs 0.22s 0.50s 2.86s 161.04s (lower is better) Ext4 and XFS are fast and show similar performance. Tux3 *can't* be very much faster as most of the elapsed time in the test is from forking the processes that do the IO and fsyncs. FWIW, btrfs shows it's horrible fsync implementation here, burning huge amounts of CPU to do bugger all IO. i.e. it burnt all 16p for 2 and a half minutes in that 10000 fork test so wasn't IO bound at all. > Is there any practical use for fast parallel fsync of tens of thousands > of tasks? This could be useful for a scalable transaction server > that sits directly on the filesystem instead of a database, as is > the fashion for big data these days. It certainly can't hurt to know > that if you need that kind of scaling, Tux3 will do it. Ext4 and XFS already do that just fine, too, when you use storage suited to such a workload and you have a sane interface for submitting tens of thousands of concurrent fsync operations. e.g http://oss.sgi.com/archives/xfs/2014-06/msg00214.html > Of course, a pure fsync load could be viewed as somewhat unnatural. We > also need to know what happens under a realistic load with buffered > operations mixed with fsyncs. We turn to an old friend, dbench: > > Dbench -t10 > > Tasks: 8 16 32 > Ext4: 35.32 MB/s 34.08 MB/s 39.71 MB/s > XFS: 32.12 MB/s 25.08 MB/s 30.12 MB/s > Btrfs: 54.40 MB/s 75.09 MB/s 102.81 MB/s > Tux3: 85.82 MB/s 133.69 MB/s 159.78 MB/s > (higher is better) On a SSD (256GB samsung 840 EVO), running 4.0.0: Tasks: 8 16 32 Ext4: 598.27 MB/s 981.13 MB/s 1233.77 MB/s XFS: 884.62 MB/s 1328.21 MB/s 1373.66 MB/s Btrfs: 201.64 MB/s 137.55 MB/s 108.56 MB/s dbench looks *very different* when there is no seek latency, doesn't it? > Dbench -t10 -s (all file operations synchronous) > > Tasks: 8 16 32 > Ext4: 4.51 MB/s 6.25 MB/s 7.72 MB/s > XFS: 4.24 MB/s 4.77 MB/s 5.15 MB/s > Btrfs: 7.98 MB/s 13.87 MB/s 22.87 MB/s > Tux3: 15.41 MB/s 25.56 MB/s 39.15 MB/s > (higher is better) Ext4: 173.54 MB/s 294.41 MB/s 424.11 MB/s XFS: 172.98 MB/s 342.78 MB/s 458.87 MB/s Btrfs: 36.92 MB/s 34.52 MB/s 55.19 MB/s Again, the numbers are completely the other way around on a SSD, with the conventional filesystems being 5-10x faster than the WA/COW style filesystem. .... > In the full disclosure department, Tux3 is still not properly > optimized in some areas. One of them is fragmentation: it is not > very hard to make Tux3 slow down by running long tests. Our current Oh, that still hasn't been fixed? Until you sort of how you are going to scale allocation to tens of TB and not fragment free space over time, fsync performance of the filesystem is pretty much irrelevant. Changing the allocation algorithms will fundamentally alter the IO patterns and so all these benchmarks are essentially meaningless. Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/