Return-Path: Received: from mail-ob0-f175.google.com ([209.85.214.175]:36741 "EHLO mail-ob0-f175.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752643AbbIHXJE convert rfc822-to-8bit (ORCPT ); Tue, 8 Sep 2015 19:09:04 -0400 Received: by obqa2 with SMTP id a2so96743034obq.3 for ; Tue, 08 Sep 2015 16:09:03 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20150908223959.GE30681@birch.djwong.org> References: <1441397823-1203-1-git-send-email-Anna.Schumaker@Netapp.com> <55EEFCEE.5090000@draigBrady.com> <55EF279B.3020101@Netapp.com> <55EF3EFD.3080302@draigBrady.com> <20150908212907.GD30681@birch.djwong.org> <20150908223959.GE30681@birch.djwong.org> From: Andy Lutomirski Date: Tue, 8 Sep 2015 16:08:43 -0700 Message-ID: Subject: Re: [PATCH v1 0/8] VFS: In-kernel copy system call To: "Darrick J. Wong" Cc: =?UTF-8?Q?P=C3=A1draig_Brady?= , Anna Schumaker , linux-nfs@vger.kernel.org, Linux btrfs Developers List , Linux FS Devel , Linux API , Zach Brown , Al Viro , Chris Mason , Michael Kerrisk-manpages , andros@netapp.com, Christoph Hellwig , Coreutils Content-Type: text/plain; charset=UTF-8 Sender: linux-nfs-owner@vger.kernel.org List-ID: On Tue, Sep 8, 2015 at 3:39 PM, Darrick J. Wong wrote: > On Tue, Sep 08, 2015 at 02:45:39PM -0700, Andy Lutomirski wrote: >> On Tue, Sep 8, 2015 at 2:29 PM, Darrick J. Wong wrote: >> > On Tue, Sep 08, 2015 at 09:03:09PM +0100, Pádraig Brady wrote: >> >> On 08/09/15 20:10, Andy Lutomirski wrote: >> >> > On Tue, Sep 8, 2015 at 11:23 AM, Anna Schumaker >> >> > wrote: >> >> >> On 09/08/2015 11:21 AM, Pádraig Brady wrote: >> >> >>> I see copy_file_range() is a reflink() on BTRFS? >> >> >>> That's a bit surprising, as it avoids the copy completely. >> >> >>> cp(1) for example considered doing a BTRFS clone by default, >> >> >>> but didn't due to expectations that users actually wanted >> >> >>> the data duplicated on disk for resilience reasons, >> >> >>> and for performance reasons so that write latencies were >> >> >>> restricted to the copy operation, rather than being >> >> >>> introduced at usage time as the dest file is CoW'd. >> >> >>> >> >> >>> If reflink() is a possibility for copy_file_range() >> >> >>> then could it be done optionally with a flag? >> >> >> >> >> >> The idea is that filesystems get to choose how to handle copies in the >> >> >> default case. BTRFS could do a reflink, but NFS could do a server side >> > >> > Eww, different default behaviors depending on the filesystem. :) >> > >> >> >> copy instead. I can change the default behavior to only do a data copy >> >> >> (unless the reflink flag is specified) instead, if that is desirable. >> >> >> >> >> >> What does everybody think? >> >> > >> >> > I think the best you could do is to have a hint asking politely for >> >> > the data to be deep-copied. After all, some filesystems reserve the >> >> > right to transparently deduplicate. >> >> > >> >> > Also, on a true COW filesystem (e.g. btrfs sometimes), there may be no >> >> > advantage to deep copying unless you actually want two copies for >> >> > locality reasons. >> >> >> >> Agreed. The relink and server side copy are separate things. >> >> There's no advantage to not doing a server side copy, >> >> but as mentioned there may be advantages to doing deep copies on BTRFS >> >> (another reason not previous mentioned in this thread, would be >> >> to avoid ENOSPC errors at some time in the future). >> >> >> >> So having control over the deep copy seems useful. >> >> It's debatable whether ALLOW_REFLINK should be on/off by default >> >> for copy_file_range(). I'd be inclined to have such a setting off by default, >> >> but cp(1) at least will work with whatever is chosen. >> > >> > So far it looks like people are interested in at least these "make data appear >> > in this other place" filesystem operations: >> > >> > 1. reflink >> > 2. reflink, but only if the contents are the same (dedupe) >> >> What I meant by this was: if you ask for "regular copy", you may end >> up with a reflink anyway. Anyway, how can you reflink a range and >> have the contents *not* be the same? > > reflink forcibly remaps fd_dest's range to fd_src's range. If they didn't > match before, they will afterwards. > > dedupe remaps fd_dest's range to fd_src's range only if they match, of course. > > Perhaps I should have said "...if the contents are the same before the call"? > Oh, I see. Can we have a clean way to figure out whether two file ranges are the same in a way that allows false negatives? I.e. return 1 if the ranges are reflinks of each other and 0 if not? Pretty please? I've implemented that in the past on btrfs by syncing the ranges and then comparing FIEMAP output, but that's hideous. >> >> > 3. regular copy >> > 4. regular copy, but make the hardware do it for us >> > 5. regular copy, but require a second copy on the media (no-dedupe) >> >> If this comes from me, I have no desire to ever use this as a flag. > > I meant (5) as a "disable auto-dedupe for this operation" flag, not as > a "reallocate all the shared blocks now" op... Hmm, interesting. What effect does it have on systems that do deferred auto-dedupe? >> >> I think we should focus on what the actual legit use cases might be. >> Certainly we want to support a mode that's "reflink or fail". We >> could have these flags: >> >> COPY_FILE_RANGE_ALLOW_REFLINK >> COPY_FILE_RANGE_ALLOW_COPY >> >> Setting neither gets -EINVAL. Setting both works as is. Setting just >> ALLOW_REFLINK will fail if a reflink can't be supported. Setting just >> ALLOW_COPY will make a best-effort attempt not to reflink but >> expressly permits reflinking in cases where either (a) plain old >> write(2) might also result in a reflink or (b) there is no advantage >> to not reflinking. > > I don't agree with having a 'copy' flag that can reflink when we also have a > 'reflink' flag. I guess I just don't like having a flag with different > meanings depending on context. > > Users should be able to get the default behavior by passing '0' for flags, so > provide FORBID_REFLINK and FORBID_COPY flags to turn off those behaviors, with > an admonishment that one should only use them if they have a goooood reason. > Passing neither gets you reflink-xor-copy, which is what I think we both want > in the general case. > > FORBID_REFLINK = 1 > FORBID_COPY = 2 > CHECK_SAME = 4 > HW_COPY = 8 > > DEDUPE = (FORBID_COPY | CHECK_SAME) > > What do you say to that? What does HW_COPY mean? If we have enough weird combinations, maybe having a mode instead of flags makes sense. --Andy