Date: Wed, 16 Mar 2016 15:23:43 -0700
From: Chris Mason <clm@fb.com>
To: <sandeen@redhat.com>, Linus Torvalds <torvalds@linux-foundation.org>,
        Dave Chinner <david@fromorbit.com>, "Theodore Ts'o" <tytso@mit.edu>,
        Ric Wheeler <rwheeler@redhat.com>,
        Andy Lutomirski <luto@amacapital.net>,
        One Thousand Gnomes <gnomes@lxorguk.ukuu.org.uk>,
        Gregory Farnum <greg@gregs42.com>,
        "Martin K. Petersen" <martin.petersen@oracle.com>,
        Christoph Hellwig <hch@infradead.org>,
        "Darrick J. Wong" <darrick.wong@oracle.com>,
        Jens Axboe <axboe@kernel.dk>,
        Andrew Morton <akpm@linux-foundation.org>,
        Linux API <linux-api@vger.kernel.org>,
        Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
        <shane.seymour@hpe.com>, Bruce Fields <bfields@fieldses.org>,
        linux-fsdevel <linux-fsdevel@vger.kernel.org>,
        Jeff Layton <jlayton@poochiereds.net>
Subject: Re: [PATCH 2/2] block: create ioctl to discard-or-zeroout a range of
 blocks
Message-ID: <20160316222343.GA53649@clm-mbp.thefacebook.com>
Mail-Followup-To: Chris Mason <clm@fb.com>, sandeen@redhat.com,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Dave Chinner <david@fromorbit.com>, Theodore Ts'o <tytso@mit.edu>,
	Ric Wheeler <rwheeler@redhat.com>,
	Andy Lutomirski <luto@amacapital.net>,
	One Thousand Gnomes <gnomes@lxorguk.ukuu.org.uk>,
	Gregory Farnum <greg@gregs42.com>,
	"Martin K. Petersen" <martin.petersen@oracle.com>,
	Christoph Hellwig <hch@infradead.org>,
	"Darrick J. Wong" <darrick.wong@oracle.com>,
	Jens Axboe <axboe@kernel.dk>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linux API <linux-api@vger.kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	shane.seymour@hpe.com, Bruce Fields <bfields@fieldses.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Jeff Layton <jlayton@poochiereds.net>
References: <56E69398.7030508@redhat.com>
 <20160314144603.GO29218@thunk.org>
 <20160315201431.GG30721@dastard>
 <CA+55aFwHLJffmN-Dw=yZCGKzxe_2Tm9h2GjdaFL3JdvYXNstRw@mail.gmail.com>
 <20160315223313.GH30721@dastard>
 <CA+55aFxCpza3_3J8kP9E0gLKHiSitZHjC4UMS_ZfZ_HBTC9=Bg@mail.gmail.com>
 <20160315235216.GI30721@dastard>
 <CA+55aFxZ=feGd+QGdhCN28kjd2XJO3PCj9NBoJJZ5E8_WMJiMA@mail.gmail.com>
 <56E8A916.8050702@redhat.com>
 <20160316005117.GA34410@clm-mbp.thefacebook.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Disposition: inline
In-Reply-To: <20160316005117.GA34410@clm-mbp.thefacebook.com>
User-Agent: Mutt/1.5.24 (2015-08-30)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 2393
Lines: 57

On Tue, Mar 15, 2016 at 05:51:17PM -0700, Chris Mason wrote:
> On Tue, Mar 15, 2016 at 07:30:14PM -0500, Eric Sandeen wrote:
> > On 3/15/16 7:06 PM, Linus Torvalds wrote:
> > > On Tue, Mar 15, 2016 at 4:52 PM, Dave Chinner <david@fromorbit.com> wrote:
> > >> >
> > >> > It is pretty clear that the onus is on the patch submitter to
> > >> > provide justification for inclusion, not for the reviewer/Maintainer
> > >> > to have to prove that the solution is unworkable.
> > > I agree, but quite frankly, performance is a good justification.
> > > 
> > > So if Ted can give performance numbers, that's justification enough.
> > > We've certainly taken changes with less.
> > 
> > I've been away from ext4 for a while, so I'm really not on top of the
> > mechanics of the underlying problem at the moment.
> > 
> > But I would say that in addition to numbers showing that ext4 has trouble
> > with unwritten extent conversion, we should have an explanation of
> > why it can't be solved in a way that doesn't open up these concerns.
> > 
> > XFS certainly has different mechanisms, but is the demonstrated workload
> > problematic on XFS (or btrfs) as well?  If not, can ext4 adopt any of the
> > solutions that make the workload perform better on other filesystems?
> 
> When I've benchmarked this in the past, doing small random buffered writes
> into an preallocated extent was dramatically (3x or more) slower on xfs
> than doing them into a fully written extent.  That was two years ago,
> but I can redo it.

So I re-ran some benchmarks, with 4K O_DIRECT random ios on nvme (4.5
kernel).  This is O_DIRECT without O_SYNC.  I don't think xfs will do
commits for each IO into the prealloc file?  O_SYNC makes it much
slower, so hopefully I've got this right.

The test runs for 60 seconds, and I used an iodepth of 4:

prealloc file: 32,000 iops
overwrite:    121,000 iops

If I bump the iodepth up to 512:

prealloc file: 33,000 iops
overwrite:   279,000 iops

For streaming writes, XFS converts prealloc to written much better when
the IO isn't random.  You can start seeing the difference at 16K
sequential O_DIRECT writes, but really its not a huge impact.  The worst
case is 4K:

prealloc file: 227MB/s
overwrite:     340MB/s

I can't think of sequential workloads where this will matter, since they
will either end up with bigger IO or the performance impact won't get
noticed.

-chris