Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754853AbcCQCmk (ORCPT ); Wed, 16 Mar 2016 22:42:40 -0400 Received: from aserp1040.oracle.com ([141.146.126.69]:29842 "EHLO aserp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751975AbcCQCmh (ORCPT ); Wed, 16 Mar 2016 22:42:37 -0400 Date: Wed, 16 Mar 2016 19:38:14 -0700 From: "Darrick J. Wong" To: Dave Chinner Cc: "Theodore Ts'o" , Linus Torvalds , Ric Wheeler , Andy Lutomirski , One Thousand Gnomes , Gregory Farnum , "Martin K. Petersen" , Christoph Hellwig , Jens Axboe , Andrew Morton , Linux API , Linux Kernel Mailing List , shane.seymour@hpe.com, Bruce Fields , linux-fsdevel , Jeff Layton , Eric Sandeen Subject: Re: [PATCH 2/2] block: create ioctl to discard-or-zeroout a range of blocks Message-ID: <20160317023814.GF5826@birch.djwong.org> References: <20160313233049.GA30721@dastard> <56E69398.7030508@redhat.com> <20160314144603.GO29218@thunk.org> <20160315201431.GG30721@dastard> <20160315223313.GH30721@dastard> <20160315225224.GD23848@thunk.org> <20160316015139.GC5826@birch.djwong.org> <20160317010116.GK30721@dastard> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160317010116.GK30721@dastard> User-Agent: Mutt/1.5.21 (2010-09-15) X-Source-IP: userv0021.oracle.com [156.151.31.71] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5346 Lines: 121 On Thu, Mar 17, 2016 at 12:01:16PM +1100, Dave Chinner wrote: > On Tue, Mar 15, 2016 at 06:51:39PM -0700, Darrick J. Wong wrote: > > On Tue, Mar 15, 2016 at 06:52:24PM -0400, Theodore Ts'o wrote: > > > On Wed, Mar 16, 2016 at 09:33:13AM +1100, Dave Chinner wrote: > > > > > > > > Stale data escaping containment is a security issue. Enabling > > > > generic kernel mechanisms to *enable containment escape* is > > > > fundamentally wrong, and relying on userspace to Do The Right Thing > > > > is even more of a gamble, IMO. > > > > > > We already have generic kernel mechanisms such as "the block device". P > > > > > > > It's a practical concern because if we enable this functionality in > > > > fallocate because it will get used by more than just special storage > > > > apps. i.e. this can't be waved away with "only properly managed > > > > applications will use it" arguments. > > > > > > It requires a mount option. How is this going to allow random > > > applications to use this feature, again? > > > > > > > I also don't make a habit of publicising the fact that since we > > > > disabled the "-d unwritten=X" mkfs parameter (because of speed racer > > > > blogs such as the above and configuration cargo-culting resulting in > > > > unsuspecting users exposing stale data unintentionally) that the > > > > functionality still exists in the kernel code and that it only takes > > > > a single xfs_db command to turn off unwritten extents in XFS. i.e. > > > > we can easily make fallocate on XFS expose stale data, filesystem > > > > wide, without requiring mount options, kernel or application > > > > modifications. > > > > > > So you have something even more dangerous in XFS and it's in the > > > kernel tree? Has Red Hat threatened to have a distro-specific patch > > > > xfs_db is the XFS debugger, so you can only enable that bit of functionality > > with magical commands, which IMHO isn't much different than people messing > > with their ext4 filesystems with debugfs. You had better know what you're > > doing and if you break the filesystem you can eat both pieces. :P > > > > > to comment out this code to make sure irresponsible users can't use > > > it? What I've been suggesting has even more controls that what you > > > have. And I've been keeping it as an out-of-tree kernel patch mainly > > > because you've been arguing that it's such a horrible thing. > > > > One could lock it down even more -- hide it behind a Kconfig option that > > depends on CONFIG_EXPERT=y and itself defaults to n, require a mount option, > > only allow the file owner to call no-hide-stale and only if the file is 0600 > > (or the appropriate group equivalents like Ted's existing patch), and upon > > adding stale extents, set an inode flag that locks uid/gid/mode/flags. > > Obviously root can still get to the file, but at least there's hard evidence > > that one is entering the twilight zone. > > > > > > Making Google's hack more widely available through the fallocate > > > > API is entirely dependent on proving that: > > > > > > Ceph is about to completely bypass the file system because of your > > > intransigence, and reimplement a userspace file system. They seem to > > > believe it's necessary. I'll let them make the case, because they > > > seem to think it's necessary. And if not, if Linus sides with you, > > > and doesn't want to take the patch, I'll just keep it as a > > > Google-specific out-of-tree patch. I don't *need* to have this thing > > > upstream. > > > > Frankly, I second Eric Sandeen's comments -- just how bad is ext4's > > unwritten extent conversion for these folks? > > > > I ran this crappy looping microbenchmark against a ramdisk: > > fallocate 400M > > write 400M > > fsync > > rewrite the 400M > > fsync > > on kernel 4.5. > > > > For writing 400M through the page cache in 4k chunks, > > ext4: ~460MB/s -> ~580MB/s (~20%) > > XFS: ~660 -> ~870 (~25%) > > btrfs: ~130 -> ~200 (~35%) > > > > For writing 400M in 80M chunks, > > ext4: ~480MB/s -> ~720MB/s (~30%) > > XFS: ~1GB/s -> ~1.5GB/s (~35%) > > btrfs: ~590MB/s -> ~590MB/s (no change) > > > > For directio writing 400MB in 4k chunks, > > ext4: 25MB/s -> 26MB/s (~5%) > > XFS: 25 -> 27 (~8%) > > btrfs: 22 -> 18 (...) > > > > For directio writing 1200MB in 80M chunks, > > ext4: ~2.9GB/s -> ~3.3GB/s (~13%) > > XFS: 3.2 -> 3.5 (~9%) > > btrfs: 2.3 -> 2.2 (...) > > That's not comparing apples to apples. overwrite does not require > allocation/extent manipulation at all so it is comparing performance > of completely different file extent operations. The operations we > should be comparing are "first write" operations, namely "write over > hole with allocation" vs "write over preallocation". Overwrite > performance should be the same regardless of the method used for the > intial write/allocation. Eh, ok, let's compare writing an fallocated region vs writing an empty file: (laptop this time, so the numbers aren't the same) For writing 400M in 4k chunks, ext4: ~720MB/s -> 620MB/s (~14%) XFS: ~560MB/s -> 540MB/s (~4%) btrfs: ~260 -> 580MB/s For writing 400M in 80M chunks, ext4: ~960MB/s -> ~730MB/s (~24%) XFS: ~1000MB/s -> ~980MB/s (~2%) btrfs: ~950MB/s -> ~930MB/s (~3%) For directio writing 1200MB in 80M chunks, ext4: ~2.9GB/s -> ~2.8GB/s (~4%) XFS: 3.2 -> 3.2 btrfs: 2.3 -> 2.3 --D