Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755706AbcCQAP4 (ORCPT ); Wed, 16 Mar 2016 20:15:56 -0400 Received: from imap.thunk.org ([74.207.234.97]:54282 "EHLO imap.thunk.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751578AbcCQAPx (ORCPT ); Wed, 16 Mar 2016 20:15:53 -0400 Date: Wed, 16 Mar 2016 20:15:02 -0400 From: "Theodore Ts'o" To: Andreas Dilger Cc: "Darrick J. Wong" , Dave Chinner , Linus Torvalds , Ric Wheeler , Andy Lutomirski , One Thousand Gnomes , Gregory Farnum , Martin Petersen , Christoph Hellwig , Jens Axboe , Andrew Morton , Linux API , Linux Kernel Mailing List , shane.seymour@hpe.com, Bruce Fields , linux-fsdevel , Jeff Layton , Eric Sandeen Subject: Re: [PATCH 2/2] block: create ioctl to discard-or-zeroout a range of blocks Message-ID: <20160317001502.GF23593@thunk.org> Mail-Followup-To: Theodore Ts'o , Andreas Dilger , "Darrick J. Wong" , Dave Chinner , Linus Torvalds , Ric Wheeler , Andy Lutomirski , One Thousand Gnomes , Gregory Farnum , Martin Petersen , Christoph Hellwig , Jens Axboe , Andrew Morton , Linux API , Linux Kernel Mailing List , shane.seymour@hpe.com, Bruce Fields , linux-fsdevel , Jeff Layton , Eric Sandeen References: <20160313233049.GA30721@dastard> <56E69398.7030508@redhat.com> <20160314144603.GO29218@thunk.org> <20160315201431.GG30721@dastard> <20160315223313.GH30721@dastard> <20160315225224.GD23848@thunk.org> <20160316015139.GC5826@birch.djwong.org> <7674C689-C07E-4D38-85EB-4FD9B55CBB35@dilger.ca> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <7674C689-C07E-4D38-85EB-4FD9B55CBB35@dilger.ca> User-Agent: Mutt/1.5.24 (2015-08-30) X-SA-Exim-Connect-IP: X-SA-Exim-Mail-From: tytso@thunk.org X-SA-Exim-Scanned: No (on imap.thunk.org); SAEximRunCond expanded to false Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3260 Lines: 60 On Wed, Mar 16, 2016 at 03:45:49PM -0600, Andreas Dilger wrote: > > Clearly, the performance hit of unwritten extent conversion is large > > enough to tempt people to ask for no-hide-stale. But I'd rather hear > > that directly from a developer, Ceph or otherwise. > > I suspect that this gets significantly worse if you are running with > random writes instead of sequential overwrites. With sequential overwrites > there is only a single boundary between init and uninit extents, so at > most one extra extent in the tree. The above performance deltas will also > be much larger when real disks are involved and seek latency is a factor. It will vary a lot depending on your use case. If you are running with data=ordered, and with journalled enabled, then even if it is a single extent that is modified, the fact that a journal transaction involved, with a forced data block flush to avoid revealing stale data, that is certainly going to be measurable. The other thing is if you are worried about tail latency, which is a major concern at Google[1], and you are running your disks close to flat out, the fact that you have to do an extra seek to update the extent tree is a seek that you can't be using for useful work --- and worse, could delay a low-latency read from completing within your SLO. [1] https://research.google.com/pubs/pub44830.html Part of what's challenging with giving numbers is that it's trivially easy to give some worst case scneario where the numbers are really terrible. A random 4k random write benchmark into an fallocated file, eeven with XFS, would have pretty bad numbers, But of course people wouldn't say that it's very realistic. But those are the easiest to get. The most realistic numbers are going to be a lot harder to get, and wouldn't necessarily make a lot of sense without revealing a lot proprietary information. I will say that Google does have a fairly large number of disks[2] and so even a small fractional percentage gain multipled by gazillions of disks starts turning into a dollar number with enough zeros that people really sit up and take notice. I'll also note that map reduce can be quite nasty as far as random I/O is concerned[3], and while map reduce jobs are often not high priority jobs, they can interfere with low-latency reads from important applications (e.g., web search, user-visible gmail operations, etc.) [2] https://what-if.xkcd.com/63/ [3] https://pdfs.semanticscholar.org/6238/e5f0fd807f634f5999701c7aa6a09d88dfc8.pdf So I'm not sure what numbers I can really give that would satisfy people. Doing a random write fio job is not hard, and will result in fairly impressive numbers. If that's enough, then either I can do this, or Chris Mason can reproduce his experiment using XFS (which would presumably eliminate the excuse that it's because ext4 sucks at extent operations). But if that's not going to convince people, then I'd much rather not waste my time. Besides, at Google it's easy enough for me to maintain the patch out-of-tree. It's the Ceph folks who would need to at the very least, have such a patch ship in Red Hat Enterprise Linux. So it's probably better for them to justify it, if numbers are really necessary. - Ted