Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754179AbbEAPjf (ORCPT ); Fri, 1 May 2015 11:39:35 -0400 Received: from ipmail05.adl6.internode.on.net ([150.101.137.143]:1368 "EHLO ipmail05.adl6.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754079AbbEAPja (ORCPT ); Fri, 1 May 2015 11:39:30 -0400 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: A2AzBgDTnUNVPH8+LHlcgwxTXLN2DAEBAQEBAQaRZAmBSgqFAn4CAgKBWjgUAQEBAQEBAQYBAQEBQT+EIAEBAQMBAQI3MgQEBQULCAMNCwklDwUUFAcaE4gjBw7HFAEBAQEGAQEBAQEBHBiFfoQggQKBI4MJAQ9JB4MXgRYFhTMHkD+ED4IwgSQ9hXQDCIpOg1CCCCIcgWMsMQGBAQEfgSMBAQE Date: Sat, 2 May 2015 01:38:55 +1000 From: Dave Chinner To: Daniel Phillips Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, tux3@tux3.org, "Theodore Ts'o" Subject: Re: Tux3 Report: How fast can we fsync? Message-ID: <20150501153855.GB15810@dastard> References: <8f886f13-6550-4322-95be-93244ae61045@phunq.net> <20150430014616.GZ15810@dastard> <81488fcb-b5d5-4761-b8ae-936dce9c1f89@phunq.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <81488fcb-b5d5-4761-b8ae-936dce9c1f89@phunq.net> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 15495 Lines: 414 On Thu, Apr 30, 2015 at 03:28:13AM -0700, Daniel Phillips wrote: > On Wednesday, April 29, 2015 6:46:16 PM PDT, Dave Chinner wrote: > >>I measured fsync performance using a 7200 RPM disk as a virtual > >>drive under KVM, configured with cache=none so that asynchronous > >>writes are cached and synchronous writes translate into direct > >>writes to the block device. > > > >Yup, a slow single spindle, so fsync performance is determined by > >seek latency of the filesystem. Hence the filesystem that "wins" > >will be the filesystem that minimises fsync seek latency above > >all other considerations. > > > >http://www.spinics.net/lists/kernel/msg1978216.html > > If you want to declare that XFS only works well on solid state > disks and big storage arrays, that is your business. But if you > do, you can no longer call XFS a general purpose filesystem. And Well, yes - I never claimed XFS is a general purpose filesystem. It is a high performance filesystem. Is is also becoming more relevant to general purpose systems as low cost storage gains capabilities that used to be considered the domain of high performance storage... > >So, to demonstrate, I'll run the same tests but using a 256GB > >samsung 840 EVO SSD and show how much the picture changes. > > I will go you one better, I ran a series of fsync tests using > tmpfs, and I now have a very clear picture of how the picture > changes. The executive summary is: Tux3 is still way faster, and > still scales way better to large numbers of tasks. I have every > confidence that the same is true of SSD. /dev/ramX can't be compared to an SSD. Yes, they both have low seek/IO latency but they have very different dispatch and IO concurrency models. One is synchronous, the other is fully asynchronous. This is an important distinction, as we'll see later on.... > >I didn't test tux3, you don't make it easy to get or build. > > There is no need to apologize for not testing Tux3, however, it is > unseemly to throw mud at the same time. Remember, you are the These trees: git://git.kernel.org/pub/scm/linux/kernel/git/daniel/linux-tux3.git git://git.kernel.org/pub/scm/linux/kernel/git/daniel/linux-tux3-test.git have not been updated for 11 months. I thought tux3 had died long ago. You should keep them up to date, and send patches for xfstests to support tux3, and then you'll get a lot more people running, testing and breaking tux3.... > >>To focus purely on fsync, I wrote a > >>small utility (at the end of this post) that forks a number of > >>tasks, each of which continuously appends to and fsyncs its own > >>file. For a single task doing 1,000 fsyncs of 1K each, we have: ..... > >All equally fast, so I can't see how tux3 would be much faster here. > > Running the same thing on tmpfs, Tux3 is significantly faster: > > Ext4: 1.40s > XFS: 1.10s > Btrfs: 1.56s > Tux3: 1.07s 3% is not "signficantly faster". It's within run to run variation! > > Tasks: 10 100 1,000 10,000 > > Ext4: 0.05s 0.12s 0.48s 3.99s > > XFS: 0.25s 0.41s 0.96s 4.07s > > Btrfs 0.22s 0.50s 2.86s 161.04s > > (lower is better) > > > >Ext4 and XFS are fast and show similar performance. Tux3 *can't* be > >very much faster as most of the elapsed time in the test is from > >forking the processes that do the IO and fsyncs. > > You wish. In fact, Tux3 is a lot faster. Yes, it's easy to be fast when you have simple, naive algorithms and an empty filesystem. > triple checked and reproducible: > > Tasks: 10 100 1,000 10,000 > Ext4: 0.05 0.14 1.53 26.56 > XFS: 0.05 0.16 2.10 29.76 > Btrfs: 0.08 0.37 3.18 34.54 > Tux3: 0.02 0.05 0.18 2.16 Yet I can't reproduce those XFS or ext4 numbers you are quoting there. eg. XFS on a 4GB ram disk: $ for i in 10 100 1000 10000; do rm /mnt/test/foo* ; time ./test-fsync /mnt/test/foo 10 $i; done real 0m0.030s user 0m0.000s sys 0m0.014s real 0m0.031s user 0m0.008s sys 0m0.157s real 0m0.305s user 0m0.029s sys 0m1.555s real 0m3.624s user 0m0.219s sys 0m17.631s $ That's roughly 10x faster than your numbers. Can you describe your test setup in detail? e.g. post the full log from block device creation to benchmark completion so I can reproduce what you are doing exactly? > Note: you should recheck your final number for Btrfs. I have seen > Btrfs fall off the rails and take wildly longer on some tests just > like that. Completely reproducable: $ sudo mkfs.btrfs -f /dev/vdc Btrfs v3.16.2 See http://btrfs.wiki.kernel.org for more information. Turning ON incompat feature 'extref': increased hardlink limit per file to 65536 fs created label (null) on /dev/vdc nodesize 16384 leafsize 16384 sectorsize 4096 size 500.00TiB $ sudo mount /dev/vdc /mnt/test $ sudo chmod 777 /mnt/test $ for i in 10 100 1000 10000; do rm /mnt/test/foo* ; time ./test-fsync /mnt/test/foo 10 $i; done real 0m0.068s user 0m0.000s sys 0m0.061s real 0m0.563s user 0m0.001s sys 0m2.047s real 0m2.851s user 0m0.040s sys 0m24.503s real 2m38.713s user 0m0.533s sys 38m34.831s Same result - ~160s burning all 16 CPUs, as can be seen by the system time. And even on a 4GB ram disk, the 10000 process test comes in at: real 0m35.567s user 0m0.707s sys 6m1.922s That's the same wall time as your tst, but the CPU burn on my machine is still clearly evident. You indicated that it's not doing this on your machine, so I don't think we can really use btfrs numbers for comparison purposes if it is behaving so differently on different machines.... [snip] > One easily reproducible one is a denial of service > during the 10,000 task test where it takes multiple seconds to cat > small files. I saw XFS do this on both spinning disk and tmpfs, and > I have seen it hang for minutes trying to list a directory. I looked > a bit into it, and I see that you are blocking for aeons trying to > acquire a lock in open. Yes, that's the usual case when XFS is waiting on buffer readahead IO completion. The latency of which is completely determined by block layer queuing and scheduling behaviour. And the block device queue is being dominated by the 10,000 concurrent write processes you just ran..... "Doctor, it hurts when I do this!" [snip] > You and I both know the truth: Ext4 is the only really reliable > general purpose filesystem on Linux at the moment. BWAHAHAHAHAHAHAH-*choke* *cough* *cough* /me wipes tears from his eyes That's the funniest thing I've read in a long time :) [snip] > >On a SSD (256GB samsung 840 EVO), running 4.0.0: > > > > Tasks: 8 16 32 > > Ext4: 598.27 MB/s 981.13 MB/s 1233.77 MB/s > > XFS: 884.62 MB/s 1328.21 MB/s 1373.66 MB/s > > Btrfs: 201.64 MB/s 137.55 MB/s 108.56 MB/s > > > >dbench looks *very different* when there is no seek latency, > >doesn't it? > > It looks like Btrfs hit a bug, not a huge surprise. Btrfs hit an assert > for me earlier this evening. It is rare but it happens. I rebooted > and got sane numbers. Running dbench -t10 on tmpfs I get: > > Tasks: 8 16 32 > Ext4: 660.69 MB/s 708.81 MB/s 720.12 MB/s > XFS: 692.01 MB/s 388.53 MB/s 134.84 MB/s > Btrfs: 229.66 MB/s 341.27 MB/s 377.97 MB/s > Tux3: 1147.12 MB/s 1401.61 MB/s 1283.74 MB/s > > Looks like XFS hit a bump and fell off the cliff at 32 threads. I reran > that one many times because I don't want to give you an inaccurate > report. I can't reproduce those numbers, either. On /dev/ram0: Tasks: 8 16 32 Ext4: 1416.11 MB/s 1585.81 MB/s 1406.18 MB/s XFS: 2580.58 MB/s 1367.48 MB/s 994.46 MB/s Btrfs: 151.89 MB/s 84.88 MB/s 73.16 MB/s Still, that negative XFS scalability shouldn't be occuring - it should be level off and be much flatter if everything is working correctly. Ah. Ram disks and synchronous IO..... The XFS journal a completely asynchronous IO engine and the synchronous IO done by the ram disk really screws with the concurrency model. There are journal write aggregation optimisations that are based on the "buffer under IO" state detection, which is completely skipped when journal IO is synchronous and completed in the submission context. This problem doesn't occur on actual storage devices where IO is asynchronous. So, yes, dbench can trigger an interesting behaviour in XFS, but it's well understood and doesn't actually effect normal storage devices. If you need a volatile fileystem for performance reasons then tmpfs is what you want, not XFS.... [ Feel free to skip the detail: Let's go back to that SSD, which does asynchronous IO and so the journal to operates fully asynchronously: $ for i in 8 16 32 64 128 256; do dbench -t10 $i -D /mnt/test; done Throughput 811.806 MB/sec 8 clients 8 procs max_latency=12.152 ms Throughput 1285.47 MB/sec 16 clients 16 procs max_latency=22.880 ms Throughput 1516.22 MB/sec 32 clients 32 procs max_latency=73.381 ms Throughput 1724.57 MB/sec 64 clients 64 procs max_latency=256.681 ms Throughput 2046.91 MB/sec 128 clients 128 procs max_latency=1068.169 ms Throughput 1895.4 MB/sec 256 clients 256 procs max_latency=3157.738 ms So performance improves out to 128 processes and then the SSD runs out of capacity - it's doing >400MB/s write IO at 128 clients. That makes latency blow out as we add more load, so it doesn't go any faster and we start to back up on the log. Hence we slowly start to go backwards as client count continues to increase and contention builds up on global wait queues. Now, XFS has 8 log buffer and so can issue 8 concurrent journal writes. Let's run dbench with fewer processes on a ram disk, and see what happens as we increase the number of processes doing IO and hence triggering journal writes: $ for i in 1 2 4 6 8; do dbench -t10 $i -D /mnt/test |grep Throughput; done Throughput 653.163 MB/sec 1 clients 1 procs max_latency=0.355 ms Throughput 1273.65 MB/sec 2 clients 2 procs max_latency=3.947 ms Throughput 2189.19 MB/sec 4 clients 4 procs max_latency=7.582 ms Throughput 2318.33 MB/sec 6 clients 6 procs max_latency=8.023 ms Throughput 2212.85 MB/sec 8 clients 8 procs max_latency=9.120 ms Yeah, ok, we scale out to 4 processes, then level off. That's going to be limited by allocation concurrency during writes, not the journal (the default is 4 AGs on a filesystem so small). Let's make 16 AGs, cause seeks don't matter on a ram disk. $ sudo mkfs.xfs -f -d agcount=16 /dev/ram0 .... $ for i in 1 2 4 6 8; do dbench -t10 $i -D /mnt/test |grep Throughput; done Throughput 656.189 MB/sec 1 clients 1 procs max_latency=0.565 ms Throughput 1277.25 MB/sec 2 clients 2 procs max_latency=3.739 ms Throughput 2350.73 MB/sec 4 clients 4 procs max_latency=5.126 ms Throughput 2754.3 MB/sec 6 clients 6 procs max_latency=8.063 ms Throughput 3135.11 MB/sec 8 clients 8 procs max_latency=6.746 ms Yup, as expected the we continue to increase performance out to 8 processes now that there isn't an allocation concurrency limit being hit. What happens as we pass 8 processes now? $ for i in 4 8 12 16; do dbench -t10 $i -D /mnt/test |grep Throughput; done Throughput 2277.53 MB/sec 4 clients 4 procs max_latency=5.778 ms Throughput 3070.3 MB/sec 8 clients 8 procs max_latency=7.808 ms Throughput 2555.29 MB/sec 12 clients 12 procs max_latency=8.518 ms Throughput 1868.96 MB/sec 16 clients 16 procs max_latency=14.193 ms $ As expected, past 8 processes perform tails off because the journal state machine is not scheduling after dispatch of the journal IO and hence allowing other threads to aggregate journal writes into the next active log buffer because there is no "under IO" stage in the state machine to it to trigger log write aggregation delays off. I'd completely forgotten about this - I discovered it 3 or 4 years ago, and then simply stopped using ramdisks for performance testing because I could get better performance from XFS on highly concurrent workloads from real storage. ] > >>Dbench -t10 -s (all file operations synchronous) > >> > >> Tasks: 8 16 32 > >> Ext4: 4.51 MB/s 6.25 MB/s 7.72 MB/s > >> XFS: 4.24 MB/s 4.77 MB/s 5.15 MB/s > >> Btrfs: 7.98 MB/s 13.87 MB/s 22.87 MB/s > >> Tux3: 15.41 MB/s 25.56 MB/s 39.15 MB/s > >> (higher is better) > > > > Ext4: 173.54 MB/s 294.41 MB/s 424.11 MB/s > > XFS: 172.98 MB/s 342.78 MB/s 458.87 MB/s > > Btrfs: 36.92 MB/s 34.52 MB/s 55.19 MB/s > > > >Again, the numbers are completely the other way around on a SSD, > >with the conventional filesystems being 5-10x faster than the > >WA/COW style filesystem. > > I wouldn't be so sure about that... > > Tasks: 8 16 32 > Ext4: 93.06 MB/s 98.67 MB/s 102.16 MB/s > XFS: 81.10 MB/s 79.66 MB/s 73.27 MB/s > Btrfs: 43.77 MB/s 64.81 MB/s 90.35 MB/s > Tux3: 198.49 MB/s 279.00 MB/s 318.41 MB/s Ext4: 807.21 MB/s 1089.89 MB/s 867.55 MB/s XFS: 997.77 MB/s 1011.51 MB/s 876.49 MB/s Btrfs: 55.66 MB/s 56.77 MB/s 60.30 MB/s Numbers are again very different for XFS and ext4 on /dev/ramX on my system. Need to work out why yours are so low.... > >Until you sort of how you are going to scale allocation to tens of > >TB and not fragment free space over time, fsync performance of the > >filesystem is pretty much irrelevant. Changing the allocation > >algorithms will fundamentally alter the IO patterns and so all these > >benchmarks are essentially meaningless. > > Ahem, are you the same person for whom fsync was the most important > issue in the world last time the topic came up, to the extent of > spreading around FUD and entirely ignoring the great work we had > accomplished for regular file operations? Actually, I don't remember any discussions about fsync. Things I remember that needed addressing are: - the lack of ENOSPC detection - the writeback integration issues - the code cleanliness issues (ifdef mess, etc) - the page forking design problems - the lack of scalable inode and space allocation algorithms. Those are the things I remember, and fsync performance pales in comparison to those. > I said then that when we > got around to a proper fsync it would be competitive. Now here it > is, so you want to change the topic. I understand. I haven't changed the topic, just the storage medium. The simple fact is that the world is moving away from slow sata storage at a pretty rapid pace and it's mostly going solid state. Spinning disks also changing - they are going to ZBC based SMR, which is a compeltely different problem space which doesn't even appear to be on the tux3 radar.... So where does tux3 fit into a storage future of byte addressable persistent memory and ZBC based SMR devices? Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/