Return-Path: Received: from userp1040.oracle.com ([156.151.31.81]:31310 "EHLO userp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752916AbbIIBUS (ORCPT ); Tue, 8 Sep 2015 21:20:18 -0400 Date: Tue, 8 Sep 2015 18:19:33 -0700 From: "Darrick J. Wong" To: Andy Lutomirski Cc: =?iso-8859-1?Q?P=E1draig?= Brady , Anna Schumaker , linux-nfs@vger.kernel.org, Linux btrfs Developers List , Linux FS Devel , Linux API , Zach Brown , Al Viro , Chris Mason , Michael Kerrisk-manpages , andros@netapp.com, Christoph Hellwig , Coreutils Subject: Re: [PATCH v1 0/8] VFS: In-kernel copy system call Message-ID: <20150909011933.GF30681@birch.djwong.org> References: <1441397823-1203-1-git-send-email-Anna.Schumaker@Netapp.com> <55EEFCEE.5090000@draigBrady.com> <55EF279B.3020101@Netapp.com> <55EF3EFD.3080302@draigBrady.com> <20150908212907.GD30681@birch.djwong.org> <20150908223959.GE30681@birch.djwong.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 In-Reply-To: Sender: linux-nfs-owner@vger.kernel.org List-ID: On Tue, Sep 08, 2015 at 04:08:43PM -0700, Andy Lutomirski wrote: > On Tue, Sep 8, 2015 at 3:39 PM, Darrick J. Wong wrote: > > On Tue, Sep 08, 2015 at 02:45:39PM -0700, Andy Lutomirski wrote: > >> On Tue, Sep 8, 2015 at 2:29 PM, Darrick J. Wong wrote: > >> > On Tue, Sep 08, 2015 at 09:03:09PM +0100, P?draig Brady wrote: > >> >> On 08/09/15 20:10, Andy Lutomirski wrote: > >> >> > On Tue, Sep 8, 2015 at 11:23 AM, Anna Schumaker > >> >> > wrote: > >> >> >> On 09/08/2015 11:21 AM, P?draig Brady wrote: > >> >> >>> I see copy_file_range() is a reflink() on BTRFS? > >> >> >>> That's a bit surprising, as it avoids the copy completely. > >> >> >>> cp(1) for example considered doing a BTRFS clone by default, > >> >> >>> but didn't due to expectations that users actually wanted > >> >> >>> the data duplicated on disk for resilience reasons, > >> >> >>> and for performance reasons so that write latencies were > >> >> >>> restricted to the copy operation, rather than being > >> >> >>> introduced at usage time as the dest file is CoW'd. > >> >> >>> > >> >> >>> If reflink() is a possibility for copy_file_range() > >> >> >>> then could it be done optionally with a flag? > >> >> >> > >> >> >> The idea is that filesystems get to choose how to handle copies in the > >> >> >> default case. BTRFS could do a reflink, but NFS could do a server side > >> > > >> > Eww, different default behaviors depending on the filesystem. :) > >> > > >> >> >> copy instead. I can change the default behavior to only do a data copy > >> >> >> (unless the reflink flag is specified) instead, if that is desirable. > >> >> >> > >> >> >> What does everybody think? > >> >> > > >> >> > I think the best you could do is to have a hint asking politely for > >> >> > the data to be deep-copied. After all, some filesystems reserve the > >> >> > right to transparently deduplicate. > >> >> > > >> >> > Also, on a true COW filesystem (e.g. btrfs sometimes), there may be no > >> >> > advantage to deep copying unless you actually want two copies for > >> >> > locality reasons. > >> >> > >> >> Agreed. The relink and server side copy are separate things. > >> >> There's no advantage to not doing a server side copy, > >> >> but as mentioned there may be advantages to doing deep copies on BTRFS > >> >> (another reason not previous mentioned in this thread, would be > >> >> to avoid ENOSPC errors at some time in the future). > >> >> > >> >> So having control over the deep copy seems useful. > >> >> It's debatable whether ALLOW_REFLINK should be on/off by default > >> >> for copy_file_range(). I'd be inclined to have such a setting off by default, > >> >> but cp(1) at least will work with whatever is chosen. > >> > > >> > So far it looks like people are interested in at least these "make data appear > >> > in this other place" filesystem operations: > >> > > >> > 1. reflink > >> > 2. reflink, but only if the contents are the same (dedupe) > >> > >> What I meant by this was: if you ask for "regular copy", you may end > >> up with a reflink anyway. Anyway, how can you reflink a range and > >> have the contents *not* be the same? > > > > reflink forcibly remaps fd_dest's range to fd_src's range. If they didn't > > match before, they will afterwards. > > > > dedupe remaps fd_dest's range to fd_src's range only if they match, of course. > > > > Perhaps I should have said "...if the contents are the same before the call"? > > > > Oh, I see. > > Can we have a clean way to figure out whether two file ranges are the > same in a way that allows false negatives? I.e. return 1 if the > ranges are reflinks of each other and 0 if not? Pretty please? I've > implemented that in the past on btrfs by syncing the ranges and then > comparing FIEMAP output, but that's hideous. Another mode for this call... :) > >> > >> > 3. regular copy > >> > 4. regular copy, but make the hardware do it for us > >> > 5. regular copy, but require a second copy on the media (no-dedupe) > >> > >> If this comes from me, I have no desire to ever use this as a flag. > > > > I meant (5) as a "disable auto-dedupe for this operation" flag, not as > > a "reallocate all the shared blocks now" op... > > Hmm, interesting. What effect does it have on systems that do > deferred auto-dedupe? If it's a userspace deferred auto-dedupe, then hopefully the program coordinates with the dedupe program. Otherwise, it's only effective with a dedupe that runs in the write-path. > >> > >> I think we should focus on what the actual legit use cases might be. > >> Certainly we want to support a mode that's "reflink or fail". We > >> could have these flags: > >> > >> COPY_FILE_RANGE_ALLOW_REFLINK > >> COPY_FILE_RANGE_ALLOW_COPY > >> > >> Setting neither gets -EINVAL. Setting both works as is. Setting just > >> ALLOW_REFLINK will fail if a reflink can't be supported. Setting just > >> ALLOW_COPY will make a best-effort attempt not to reflink but > >> expressly permits reflinking in cases where either (a) plain old > >> write(2) might also result in a reflink or (b) there is no advantage > >> to not reflinking. > > > > I don't agree with having a 'copy' flag that can reflink when we also have a > > 'reflink' flag. I guess I just don't like having a flag with different > > meanings depending on context. > > > > Users should be able to get the default behavior by passing '0' for flags, so > > provide FORBID_REFLINK and FORBID_COPY flags to turn off those behaviors, with > > an admonishment that one should only use them if they have a goooood reason. > > Passing neither gets you reflink-xor-copy, which is what I think we both want > > in the general case. > > > > FORBID_REFLINK = 1 > > FORBID_COPY = 2 > > CHECK_SAME = 4 > > HW_COPY = 8 > > > > DEDUPE = (FORBID_COPY | CHECK_SAME) > > > > What do you say to that? > > What does HW_COPY mean? It /probably/ means that the FS tells the storage device to copy the block rather than streaming it through the page cache. Your autodedupe thinp device, for example, would "copy" block X to Y by mapping both X and Y to the same piece of media. (Effectively the same thing as FS reflink/dedupe, but in the storage dev.) > > If we have enough weird combinations, maybe having a mode instead of > flags makes sense. Let's hope not. :) --D > > --Andy