2020-04-12 11:30:24

by Amir Goldstein

[permalink] [raw]
Subject: Re: Same mountpoint restriction in FICLONE ioctls

+CC XFS,NFS,CIFS

On Sun, Apr 12, 2020 at 1:06 PM Keno Fischer <[email protected]> wrote:
>
> Hello,
>
> I was curious about the reasoning behind the
> same-mountpoint restriction in the FICLONE
> ioctl. I saw that in commit
>
> [913b86e92] vfs: allow vfs_clone_file_range() across mount points
>
> this check was moved from the vfs layer into
> the ioctl itself, so it appears to be a policy restriction
> rather than a technical limitation. I understand why
> hardlinks are disallowed across mount point boundaries,
> but it seems like that rationale would not apply to clones,
> since modifying the clone would not affect the original
> file. Is there some other reason that the ioctl enforces
> this restriction?
>

I don't know. I suppose that when FICLONE was introduced
there wasn't any use case for cross mount clone.

Note that copy_file_range() also had this restriction, which was
recently lifted, because NFSv4 and CIFS needed this functionality.

As far as I can tell, CIFS and NFSv4 can also support cross mount
clone, but nobody stepped up to request or implement that.

The question is: do you *really* need cross mount clone?
Can you use copy_file_range() instead?
It attempts to do remap_file_range() (clone) before falling back to
kernel copy_file_range().

> Removing this restrictions would have some performance
> advantages for us, but I figured there must be a good reason
> why it's there that I just don't know about, so I figured I'd ask.
>

You did not specify your use case.
Across which filesystems mounts are you trying to clone?

Thanks,
Amir.


2020-04-13 05:42:58

by Keno Fischer

[permalink] [raw]
Subject: Re: Same mountpoint restriction in FICLONE ioctls

> You did not specify your use case.

My use case is recording (https://rr-project.org/) executions
of containers (which often make heavy use of bind mounts on
the same file system, thus me running into this restriction).
In essence, at relevant read or mmap operations,
rr needs to checkpoint the file that was opened,
in case it later gets deleted or modified.
It always tries to FICLONE the file first,
before deciding heuristically whether to
instead create a copy (if it decides there is a low
likelihood the file will get changed - e.g. because
it's a system file - it may decide to take the chance and
not copy it at the risk of creating a broken recording).
That's often a decent trade-off, but of course it's not
100% perfect.

> The question is: do you *really* need cross mount clone?
> Can you use copy_file_range() instead?

Good question. copy_file_range doesn't quite work
for that initial clone, because we do want it to fail if
cloning doesn't work (so that we can apply the
heuristics). However, you make a good point that
the copy fallback should probably use copy_file_range.
At least that way, if it does decide to copy, the
performance will be better.

It would still be nice for FICLONE to ease this restriction,
since it reduces the chance of the heuristics getting
it wrong and preventing the copy, even if such
a copy would have been cheap.

> Across which filesystems mounts are you trying to clone?

This functionality was written with btrfs in mind, so that's
what I was testing with. The mounts themselves are just
different bindmounts into the same filesystem.

Keno

2020-04-13 09:55:20

by Amir Goldstein

[permalink] [raw]
Subject: Re: Same mountpoint restriction in FICLONE ioctls

On Mon, Apr 13, 2020 at 1:28 AM Keno Fischer <[email protected]> wrote:
>
> > You did not specify your use case.
>
> My use case is recording (https://rr-project.org/) executions

Cool! I should try that ;-)

> of containers (which often make heavy use of bind mounts on
> the same file system, thus me running into this restriction).
> In essence, at relevant read or mmap operations,
> rr needs to checkpoint the file that was opened,
> in case it later gets deleted or modified.
> It always tries to FICLONE the file first,
> before deciding heuristically whether to
> instead create a copy (if it decides there is a low
> likelihood the file will get changed - e.g. because
> it's a system file - it may decide to take the chance and
> not copy it at the risk of creating a broken recording).
> That's often a decent trade-off, but of course it's not
> 100% perfect.
>
> > The question is: do you *really* need cross mount clone?
> > Can you use copy_file_range() instead?
>
> Good question. copy_file_range doesn't quite work
> for that initial clone, because we do want it to fail if
> cloning doesn't work (so that we can apply the
> heuristics). However, you make a good point that
> the copy fallback should probably use copy_file_range.
> At least that way, if it does decide to copy, the
> performance will be better.
>
> It would still be nice for FICLONE to ease this restriction,
> since it reduces the chance of the heuristics getting
> it wrong and preventing the copy, even if such
> a copy would have been cheap.
>

You make it sound like the heuristic decision must be made
*after* trying to clone, but it can be made before and pass
flags to the kernel whether or to fallback to copy.

copy_file_range(2) has an unused flags argument.
Adding support for flags like:
COPY_FILE_RANGE_BY_FS
COPY_FILE_RANGE_BY_KERNEL

or any other names elected after bike shedding can be used
to control whether user intended to use filesystem internal
clone/copy methods and/or to fallback to kernel copy.

I think this functionality will be useful to many.

> > Across which filesystems mounts are you trying to clone?
>
> This functionality was written with btrfs in mind, so that's
> what I was testing with. The mounts themselves are just
> different bindmounts into the same filesystem.
>

I can also suggest a workaround for you.
If your only problem is bind mounts and if recorder is a privileged
process (CAP_DAC_READ_SEARCH) then you can use a "master"
bind mount to perform all clone operations on.
Use name_to_handle_at(2) to get sb file handle of source file.
Use open_by_handle_at(2) to get an open file descriptor of the source
file under the "master" bind mount.

Thanks,
Amir.

2020-04-14 13:31:03

by Keno Fischer

[permalink] [raw]
Subject: Re: Same mountpoint restriction in FICLONE ioctls

> You make it sound like the heuristic decision must be made
> *after* trying to clone, but it can be made before and pass
> flags to the kernel whether or to fallback to copy.

True, though I simplified slightly. There's other things we try
first if the clone fails, like creating a hardlink. If cloning fails,
we also often only want to copy a part of the file (again
heuristically, whether more than what the program asked
for will be useful for debugging)

> copy_file_range(2) has an unused flags argument.
> Adding support for flags like:
> COPY_FILE_RANGE_BY_FS
> COPY_FILE_RANGE_BY_KERNEL

That would solve it of course, and I'd be happy with that
solution, but it seems like we'd end up with just another
spelling for the cloning ioctls then that have subtly different
semantics.

> I can also suggest a workaround for you.
> If your only problem is bind mounts and if recorder is a privileged
> process (CAP_DAC_READ_SEARCH) then you can use a "master"
> bind mount to perform all clone operations on.
> Use name_to_handle_at(2) to get sb file handle of source file.
> Use open_by_handle_at(2) to get an open file descriptor of the source
> file under the "master" bind mount.

Thanks, that's a very valuable suggestion - I hadn't considered
that. Unfortunately, I don't think the recorder does generally have
those privileges. It doesn't help in my use case, since I'm recording
a container that makes use of user namespaces, so nothing requires
priviledge, but it does seem like it would be useful if the recorder does
have appropriate capabilities (rr already has a mode where it runs
with privilege, e.g. for recording setuid binaries).

Keno

2020-04-14 14:00:20

by Amir Goldstein

[permalink] [raw]
Subject: Re: Same mountpoint restriction in FICLONE ioctls

On Mon, Apr 13, 2020 at 10:40 PM Keno Fischer <[email protected]> wrote:
>
> > You make it sound like the heuristic decision must be made
> > *after* trying to clone, but it can be made before and pass
> > flags to the kernel whether or to fallback to copy.
>
> True, though I simplified slightly. There's other things we try
> first if the clone fails, like creating a hardlink. If cloning fails,
> we also often only want to copy a part of the file (again
> heuristically, whether more than what the program asked
> for will be useful for debugging)

Fair enough.

>
> > copy_file_range(2) has an unused flags argument.
> > Adding support for flags like:
> > COPY_FILE_RANGE_BY_FS
> > COPY_FILE_RANGE_BY_KERNEL
>
> That would solve it of course, and I'd be happy with that
> solution, but it seems like we'd end up with just another
> spelling for the cloning ioctls then that have subtly different
> semantics.
>

Yeh. Another spelling is a common way to change behavior.
In fact, it is the only way if you want to avoid changing behavior
of existing application.

Generally speaking, syscall interface is an improvement over ioctl
interface. Flags like:
COPY_FILE_RANGE_REFLINK
COPY_FILE_RANGE_NO_XDEV
along with proper documentation, can help make the change of behavior
explicit. The flags mentioned above would describe the existing
FICLONERANGE semantics.

But the thing is that the above is not just a fancy maneuver for relaxing the
same mnt restriction of FICLONERANGE.
I believe that enhancing the semantics of copy_file_range(2) has benefits
beyond your use case.
copy tools could make use of nfs/cifs server side copy without falling back
to kernel copy.

Thanks,
Amir.