Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751469AbbEAXUl (ORCPT ); Fri, 1 May 2015 19:20:41 -0400 Received: from mail.phunq.net ([184.71.0.62]:52298 "EHLO starbase.phunq.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751118AbbEAXUh (ORCPT ); Fri, 1 May 2015 19:20:37 -0400 Message-ID: <55440A56.8000207@phunq.net> Date: Fri, 01 May 2015 16:20:54 -0700 From: Daniel Phillips User-Agent: Mozilla/5.0 (X11; Linux i686; rv:31.0) Gecko/20100101 Thunderbird/31.6.0 MIME-Version: 1.0 To: Dave Chinner CC: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, tux3@tux3.org, "Theodore Ts'o" Subject: Re: Tux3 Report: How fast can we fsync? References: <8f886f13-6550-4322-95be-93244ae61045@phunq.net> <20150430014616.GZ15810@dastard> <81488fcb-b5d5-4761-b8ae-936dce9c1f89@phunq.net> <20150501153855.GB15810@dastard> In-Reply-To: <20150501153855.GB15810@dastard> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 11300 Lines: 281 On Friday, May 1, 2015 8:38:55 AM PDT, Dave Chinner wrote: > > Well, yes - I never claimed XFS is a general purpose filesystem. It > is a high performance filesystem. Is is also becoming more relevant > to general purpose systems as low cost storage gains capabilities > that used to be considered the domain of high performance storage... OK. Well, Tux3 is general purpose and that means we care about single spinning disk and small systems. >>> So, to demonstrate, I'll run the same tests but using a 256GB >>> samsung 840 EVO SSD and show how much the picture changes. >> >> I will go you one better, I ran a series of fsync tests using >> tmpfs, and I now have a very clear picture of how the picture >> changes. The executive summary is: Tux3 is still way faster, and >> still scales way better to large numbers of tasks. I have every >> confidence that the same is true of SSD. > > /dev/ramX can't be compared to an SSD. Yes, they both have low > seek/IO latency but they have very different dispatch and IO > concurrency models. One is synchronous, the other is fully > asynchronous. I had ram available and no SSD handy to abuse. I was interested in measuring the filesystem overhead with the device factored out. I mounted loopback on a tmpfs file, which seems to be about the same as /dev/ram, maybe slightly faster, but much easier to configure. I ran some tests on a ramdisk just now and was mortified to find that I have to reboot to empty the disk. It would take a compelling reason before I do that again. > This is an important distinction, as we'll see later on.... I regard it as predictive of Tux3 performance on NVM. > These trees: > > git://git.kernel.org/pub/scm/linux/kernel/git/daniel/linux-tux3.git > git://git.kernel.org/pub/scm/linux/kernel/git/daniel/linux-tux3-test.git > > have not been updated for 11 months. I thought tux3 had died long > ago. > > You should keep them up to date, and send patches for xfstests to > support tux3, and then you'll get a lot more people running, > testing and breaking tux3.... People are starting to show up to do testing now, pretty much the first time, so we must do some housecleaning. It is gratifying that Tux3 never broke for Mike, but of course it will assert just by running out of space at the moment. As you rightly point out, that fix is urgent and is my current project. >> Running the same thing on tmpfs, Tux3 is significantly faster: >> >> Ext4: 1.40s >> XFS: 1.10s >> Btrfs: 1.56s >> Tux3: 1.07s > > 3% is not "signficantly faster". It's within run to run variation! You are right, XFS and Tux3 are within experimental error for single syncs on the ram disk, while Ext4 and Btrfs are way slower: Ext4: 1.59s XFS: 1.11s Btrfs: 1.70s Tux3: 1.11s A distinct performance gap appears between Tux3 and XFS as parallel tasks increase. >> You wish. In fact, Tux3 is a lot faster. ... > > Yes, it's easy to be fast when you have simple, naive algorithms and > an empty filesystem. No it isn't or the others would be fast too. In any case our algorithms are far from naive, except for allocation. You can rest assured that when allocation is brought up to a respectable standard in the fullness of time, it will be competitive and will not harm our clean filesystem performance at all. There is no call for you to disparage our current achievements, which are significant. I do not mind some healthy skepticism about the allocation work, you know as well as anyone how hard it is. However your denial of our current result is irritating and creates the impression that you have an agenda. If you want to complain about something real, complain that our current code drop is not done yet. I will humbly apologize, and the same for enospc. >> triple checked and reproducible: >> >> Tasks: 10 100 1,000 10,000 >> Ext4: 0.05 0.14 1.53 26.56 >> XFS: 0.05 0.16 2.10 29.76 >> Btrfs: 0.08 0.37 3.18 34.54 >> Tux3: 0.02 0.05 0.18 2.16 > > Yet I can't reproduce those XFS or ext4 numbers you are quoting > there. eg. XFS on a 4GB ram disk: > > $ for i in 10 100 1000 10000; do rm /mnt/test/foo* ; time > ./test-fsync /mnt/test/foo 10 $i; done > > real 0m0.030s > user 0m0.000s > sys 0m0.014s > > real 0m0.031s > user 0m0.008s > sys 0m0.157s > > real 0m0.305s > user 0m0.029s > sys 0m1.555s > > real 0m3.624s > user 0m0.219s > sys 0m17.631s > $ > > That's roughly 10x faster than your numbers. Can you describe your > test setup in detail? e.g. post the full log from block device > creation to benchmark completion so I can reproduce what you are > doing exactly? Mine is a lame i5 minitower with 4GB from Fry's. Yours is clearly way more substantial, so I can't compare my numbers directly to yours. Clearly the curve is the same: your numbers increase 10x going from 100 to 1,000 tasks and 12x going from 1,000 to 10,000. The Tux3 curve is significantly flatter and starts from a lower base, so it ends with a really wide gap. You will need to take my word for that for now. I promise that the beer is on me should you not find that reproducible. The repository delay is just about not bothering Hirofumi for a merge while he finishes up his inode table anti-fragmentation work. >> Note: you should recheck your final number for Btrfs. I have seen >> Btrfs fall off the rails and take wildly longer on some tests just >> like that. > > Completely reproducable... I believe you. I found that Btrfs does that way too much. So does XFS from time to time, when it gets up into lots of tasks. Read starvation on XFS is much worse than Btrfs, and XFS also exhibits some very undesirable behavior with initial file create. Note: Ext4 and Tux3 have roughly zero read starvation in any of these tests, which pretty much proves it is not just a block scheduler thing. I don't think this is something you should dismiss. >> One easily reproducible one is a denial of service >> during the 10,000 task test where it takes multiple seconds to cat >> small files. I saw XFS do this on both spinning disk and tmpfs, and >> I have seen it hang for minutes trying to list a directory. I looked >> a bit into it, and I see that you are blocking for aeons trying to >> acquire a lock in open. > > Yes, that's the usual case when XFS is waiting on buffer readahead > IO completion. The latency of which is completely determined by > block layer queuing and scheduling behaviour. And the block device > queue is being dominated by the 10,000 concurrent write processes > you just ran..... > > "Doctor, it hurts when I do this!" It only hurts XFS (and sometimes Btrfs) when you do that. I believe your theory is wrong about the cause, or at least Ext4 and Tux3 skirt that issue somehow. We definitely did not do anything special to avoid it. >> You and I both know the truth: Ext4 is the only really reliable >> general purpose filesystem on Linux at the moment. > > That's the funniest thing I've read in a long time :) I'm glad I could lighten your day, but I remain uncomfortable with the read starvation issues and the massive long lock holds I see. Perhaps XFS is stable if you don't push too many tasks at it. [snipped the interesting ramdisk performance bug hunt] OK, fair enough, you get a return match on SSD when I get hold of one. >> I wouldn't be so sure about that... >> >> Tasks: 8 16 32 >> Ext4: 93.06 MB/s 98.67 MB/s 102.16 MB/s >> XFS: 81.10 MB/s 79.66 MB/s 73.27 MB/s >> Btrfs: 43.77 MB/s 64.81 MB/s 90.35 MB/s ... > > Ext4: 807.21 MB/s 1089.89 MB/s 867.55 MB/s > XFS: 997.77 MB/s 1011.51 MB/s 876.49 MB/s > Btrfs: 55.66 MB/s 56.77 MB/s 60.30 MB/s > > Numbers are again very different for XFS and ext4 on /dev/ramX on my > system. Need to work out why yours are so low.... Your machine makes mine look like a PCjr. >> Ahem, are you the same person for whom fsync was the most important >> issue in the world last time the topic came up, to the extent of >> spreading around FUD and entirely ignoring the great work we had >> accomplished for regular file operations? ... > > Actually, I don't remember any discussions about fsync. Here: http://www.spinics.net/lists/linux-fsdevel/msg64825.html (Re: Tux3 Report: Faster than tmpfs, what?) It still rankles that you took my innocent omission of the detail that Hirofumi had removed the fsyncs from dbench and turned it into a major FUD attack, casting aspersions on our integrity. We removed the fsyncs because we weren't interested in measuring something we had not implemented yet, it is that simple. That, plus Ted's silly pronouncements that I could not answer at the time, is what motivated me to design and implement an fsync that would not just be competitive, but would righteously kick the tails of XFS and Ext4, which is done. If I were you, I would wait for the code drop, verify it, and then give credit where credit is due. Then I would figure out how to make XFS work like that. > Things I remember that needed addressing are: > - the lack of ENOSPC detection > - the writeback integration issues > - the code cleanliness issues (ifdef mess, etc) > - the page forking design problems > - the lack of scalable inode and space allocation > algorithms. > > Those are the things I remember, and fsync performance pales in > comparison to those. With the exception of "page forking design", it is the same list as ours, with progress on all of them. I freely admit that optimized fsync was not on the critical path, but you made it an issue so I addressed it. Anyway, I needed to hone my kernel debugging skills and that worked out well. >> I said then that when we >> got around to a proper fsync it would be competitive. Now here it >> is, so you want to change the topic. I understand. > > I haven't changed the topic, just the storage medium. The simple > fact is that the world is moving away from slow sata storage at a > pretty rapid pace and it's mostly going solid state. Spinning disks > also changing - they are going to ZBC based SMR, which is a > compeltely different problem space which doesn't even appear to be > on the tux3 radar.... > > So where does tux3 fit into a storage future of byte addressable > persistent memory and ZBC based SMR devices? You won't convince us to abandon spinning rust, it's going to be around a lot longer than you think. Obviously, we care about SSD and I believe you will find that Tux3 is more than competitive there. We lay things out in a very erase block friendly way. We need to address the volume wrap issue of course, and that is in progress. This is much easier than spinning disk. Tux3's redirect-on-write[1] is obviously a natural for SMR, however I will not get excited about it unless a vendor waves money. Regards, Daniel [1] Copy-on-write is a misnomer because there is no copy. The proper term is "redirect-on-write". -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/