2009-09-11 20:06:34

by Joel Becker

[permalink] [raw]
Subject: [GIT PULL] ocfs2 changes for 2.6.32

Linus, et al,
Here are the ocfs2 feature changes for 2.6.32. The big ticket
item is the reflinkat(2) system call and ocfs2's support for it. The
ocfs2 support accounts for all but a handful of the changes. The
remaining few patches are fixes.
Please pull.

Joel

The following changes since commit 8379e7c46cc48f51197dd663fc6676f47f2a1e71:
Sunil Mushran (1):
ocfs2: ocfs2_write_begin_nolock() should handle len=0

are available in the git repository at:

git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2.git upstream-linus

Coly Li (1):
dlmglue.c: add missed mlog lines

Joel Becker (41):
ocfs2: Make the ocfs2_caching_info structure self-contained.
ocfs2: Change metadata caching locks to an operations structure.
ocfs2: Take the inode out of the metadata read/write paths.
ocfs2: move ip_last_trans to struct ocfs2_caching_info
ocfs2: move ip_created_trans to struct ocfs2_caching_info
ocfs2: Pass struct ocfs2_caching_info to the journal functions.
ocfs2: Store the ocfs2_caching_info on ocfs2_extent_tree.
ocfs2: Pass ocfs2_caching_info to ocfs2_read_extent_block().
ocfs2: ocfs2_find_path() only needs the caching info
ocfs2: ocfs2_create_new_meta_bhs() doesn't need struct inode.
ocfs2: Pass ocfs2_extent_tree to ocfs2_unlink_path()
ocfs2: ocfs2_complete_edge_insert() doesn't need struct inode at all.
ocfs2: Get inode out of ocfs2_rotate_subtree_root_right().
ocfs2: Pass ocfs2_extent_tree to ocfs2_get_subtree_root()
ocfs2: Drop struct inode from ocfs2_extent_tree_operations.
ocfs2: ocfs2_rotate_tree_right() doesn't need struct inode.
ocfs2: ocfs2_update_edge_lengths() doesn't need struct inode.
ocfs2: ocfs2_rotate_subtree_left() doesn't need struct inode.
ocfs2: __ocfs2_rotate_tree_left() doesn't need struct inode.
ocfs2: ocfs2_rotate_tree_left() no longer needs struct inode.
ocfs2: ocfs2_merge_rec_left/right() no longer need struct inode.
ocfs2: ocfs2_try_to_merge_extent() doesn't need struct inode.
ocfs2: ocfs2_grow_branch() and ocfs2_append_rec_to_path() lose struct inode.
ocfs2: ocfs2_truncate_rec() doesn't need struct inode.
ocfs2: Make truncating the extent map an extent_tree_operation.
ocfs2: ocfs2_insert_at_leaf() doesn't need struct inode.
ocfs2: Give ocfs2_split_record() an extent_tree instead of an inode.
ocfs2: ocfs2_do_insert_extent() and ocfs2_insert_path() no longer need an inode.
ocfs2: ocfs2_extent_contig() only requires the superblock.
ocfs2: Swap inode for extent_tree in ocfs2_figure_merge_contig_type().
ocfs2: Remove inode from ocfs2_figure_extent_contig().
ocfs2: ocfs2_figure_insert_type() no longer needs struct inode.
ocfs2: Make extent map insertion an extent_tree_operation.
ocfs2: ocfs2_insert_extent() no longer needs struct inode.
ocfs2: ocfs2_add_clusters_in_btree() no longer needs struct inode.
ocfs2: ocfs2_remove_extent() no longer needs struct inode.
ocfs2: ocfs2_split_and_insert() no longer needs struct inode.
ocfs2: Teach ocfs2_replace_extent_rec() to use an extent_tree.
ocfs2: __ocfs2_mark_extent_written() doesn't need struct inode.
ocfs2: Pass ocfs2_caching_info into ocfs_init_*_extent_tree().
fs: Add the reflink() operation and reflinkat(2) system call.

Sunil Mushran (1):
ocfs2: __ocfs2_abort() should not enable panic for local mounts

Tao Ma (42):
ocfs2: Define refcount tree structure.
ocfs2: Add metaecc for ocfs2_refcount_block.
ocfs2: Add ocfs2_read_refcount_block.
ocfs2: Abstract caching info checkpoint.
ocfs2: Add new refcount tree lock resource in dlmglue.
ocfs2: Add caching info for refcount tree.
ocfs2: Add refcount tree lock mechanism.
ocfs2: Basic tree root operation.
ocfs2: Wrap ocfs2_extent_contig in ocfs2_extent_tree.
ocfs2: Abstract extent split process.
ocfs2: Add refcount b-tree as a new extent tree.
ocfs2: move tree path functions to alloc.h.
ocfs2: Add support for incrementing refcount in the tree.
ocfs2: Add support of decrementing refcount for delete.
ocfs2: Add functions for extents refcounted.
ocfs2: Decrement refcount when truncating refcounted extents.
ocfs2: Add CoW support.
ocfs2: CoW refcount tree improvement.
ocfs2: Integrate CoW in file write.
ocfs2: CoW a reflinked cluster when it is truncated.
ocfs2: Add normal functions for reflink a normal file's extents.
ocfs2: handle file attributes issue for reflink.
ocfs2: Return extent flags for xattr value tree.
ocfs2: Abstract duplicate clusters process in CoW.
ocfs2: Add CoW support for xattr.
ocfs2: Remove inode from ocfs2_xattr_bucket_get_name_value.
ocfs2: Abstract the creation of xattr block.
ocfs2: Abstract ocfs2 xattr tree extend rec iteration process.
ocfs2: Attach xattr clusters to refcount tree.
ocfs2: Call refcount tree remove process properly.
ocfs2: Create an xattr indexed block if needed.
ocfs2: Add reflink support for xattr.
ocfs2: Modify removing xattr process for refcount.
ocfs2: Don't merge in 1st refcount ops of reflink.
ocfs2: Make transaction extend more efficient.
ocfs2: Use proper parameter for some inode operation.
ocfs2: Create reflinked file in orphan dir.
ocfs2: Add preserve to reflink.
ocfs2: Implement ocfs2_reflink.
ocfs2: Enable refcount tree support.
ocfs2: Add ioctl for reflink.
ocfs2: Use buffer IO if we are appending a file.

Wengang Wang (1):
ocfs2: add spinlock protection when dealing with lockres->purge.

Documentation/filesystems/reflink.txt | 174 ++
Documentation/filesystems/vfs.txt | 4 +
arch/x86/ia32/ia32entry.S | 1 +
arch/x86/include/asm/unistd_32.h | 1 +
arch/x86/include/asm/unistd_64.h | 2 +
arch/x86/kernel/syscall_table_32.S | 1 +
fs/namei.c | 137 ++
fs/ocfs2/Makefile | 1 +
fs/ocfs2/alloc.c | 1342 ++++++-----
fs/ocfs2/alloc.h | 101 +-
fs/ocfs2/aops.c | 37 +-
fs/ocfs2/aops.h | 2 +
fs/ocfs2/buffer_head_io.c | 47 +-
fs/ocfs2/buffer_head_io.h | 8 +-
fs/ocfs2/cluster/masklog.c | 1 +
fs/ocfs2/cluster/masklog.h | 1 +
fs/ocfs2/dir.c | 107 +-
fs/ocfs2/dlm/dlmthread.c | 6 +-
fs/ocfs2/dlmglue.c | 105 +-
fs/ocfs2/dlmglue.h | 6 +
fs/ocfs2/extent_map.c | 33 +-
fs/ocfs2/extent_map.h | 8 +-
fs/ocfs2/file.c | 151 ++-
fs/ocfs2/file.h | 2 +
fs/ocfs2/inode.c | 86 +-
fs/ocfs2/inode.h | 20 +-
fs/ocfs2/ioctl.c | 14 +
fs/ocfs2/journal.c | 82 +-
fs/ocfs2/journal.h | 94 +-
fs/ocfs2/localalloc.c | 12 +-
fs/ocfs2/namei.c | 343 +++-
fs/ocfs2/namei.h | 6 +
fs/ocfs2/ocfs2.h | 52 +-
fs/ocfs2/ocfs2_fs.h | 107 +-
fs/ocfs2/ocfs2_lockid.h | 5 +
fs/ocfs2/quota_global.c | 5 +-
fs/ocfs2/quota_local.c | 26 +-
fs/ocfs2/refcounttree.c | 4249 +++++++++++++++++++++++++++++++++
fs/ocfs2/refcounttree.h | 108 +
fs/ocfs2/resize.c | 16 +-
fs/ocfs2/slot_map.c | 10 +-
fs/ocfs2/suballoc.c | 35 +-
fs/ocfs2/super.c | 13 +-
fs/ocfs2/uptodate.c | 265 ++-
fs/ocfs2/uptodate.h | 51 +-
fs/ocfs2/xattr.c | 2056 +++++++++++++++--
fs/ocfs2/xattr.h | 15 +-
include/linux/fcntl.h | 8 +
include/linux/fs.h | 2 +
include/linux/security.h | 23 +
include/linux/syscalls.h | 3 +
security/capability.c | 7 +
security/security.c | 8 +
53 files changed, 8823 insertions(+), 1176 deletions(-)
create mode 100644 Documentation/filesystems/reflink.txt
create mode 100644 fs/ocfs2/refcounttree.c
create mode 100644 fs/ocfs2/refcounttree.h
--

Life's Little Instruction Book #99

"Think big thoughts, but relish small pleasures."

Joel Becker
Principal Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127


2009-09-14 21:33:01

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] ocfs2 changes for 2.6.32



On Fri, 11 Sep 2009, Joel Becker wrote:
>
> Linus, et al,
> Here are the ocfs2 feature changes for 2.6.32. The big ticket
> item is the reflinkat(2) system call and ocfs2's support for it. The
> ocfs2 support accounts for all but a handful of the changes. The
> remaining few patches are fixes.

I _really_ want some kind of ack's for new filesystem system calls like
this. I'm not going to pull a new 'reflink[at]()' system call just based
on a single filesystem.

Yes, there's clearly been _some_ discussion, but (a) I've not seen it
(since it's been on 'fsdevel', which is one of those single-topic mailing
lists that I'm totally uninterested in, since they tend to become clique
groups) and (b) you don't even say whether the thing has been acked by
things like the security angle etc.

So I'm not pulling this. Not until I get the feeling that there is
consensus.

I also don't understand why it's called 'reflink'. Why not 'copyfile'? We
should not name things by implementation, we should name things by what
they _do_. And I'm not seeing what is so 'reflink' about this that it's
not a 'copyfile'. I also am not entirely clear on why you need the source
name, and not - for example - an 'fd'.

Are we going to add 'freflink[at]()' at some point?

So I want explanations for the naming, I want sign-offs from other
filesystem (and security) people, etc. What I do _not_ want is to get a
"please pull" request for a filesystem, and notice that it's suddenly not
all about just that particular filesystem, without any indication of who
you've been talking to etc etc.

Linus

2009-09-14 22:16:09

by Joel Becker

[permalink] [raw]
Subject: Re: [GIT PULL] ocfs2 changes for 2.6.32

On Mon, Sep 14, 2009 at 02:32:36PM -0700, Linus Torvalds wrote:
> On Fri, 11 Sep 2009, Joel Becker wrote:
> >
> > Linus, et al,
> > Here are the ocfs2 feature changes for 2.6.32. The big ticket
> > item is the reflinkat(2) system call and ocfs2's support for it. The
> > ocfs2 support accounts for all but a handful of the changes. The
> > remaining few patches are fixes.
>
> I _really_ want some kind of ack's for new filesystem system calls like
> this. I'm not going to pull a new 'reflink[at]()' system call just based
> on a single filesystem.

I'll get specific acks. I sent it via ocfs2.git because others
recommended I not send it upstream in June but instead wait until
I had at least one filesystem implementing it.

> Yes, there's clearly been _some_ discussion, but (a) I've not seen it
> (since it's been on 'fsdevel', which is one of those single-topic mailing
> lists that I'm totally uninterested in, since they tend to become clique
> groups) and (b) you don't even say whether the thing has been acked by
> things like the security angle etc.

Fair enough. Don't worry, the security folks were involved.
I'll get direct acks.

> I also don't understand why it's called 'reflink'. Why not 'copyfile'? We
> should not name things by implementation, we should name things by what
> they _do_. And I'm not seeing what is so 'reflink' about this that it's
> not a 'copyfile'. I also am not entirely clear on why you need the source
> name, and not - for example - an 'fd'.
>
> Are we going to add 'freflink[at]()' at some point?

It's a link(2) analogue. symlink(2) has the loosest coupling,
and reflink(2) the highest. We're not going to add freflink[at]().
It's a snap, not a copy. It can be used to implement a copy, and
copyfile() in libc can be written with reflinkat(2), but it isn't just a
copy.

Joel

--

"There is shadow under this red rock.
(Come in under the shadow of this red rock)
And I will show you something different from either
Your shadow at morning striding behind you
Or your shadow at evening rising to meet you.
I will show you fear in a handful of dust."

Joel Becker
Principal Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127

2009-09-14 23:28:35

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] ocfs2 changes for 2.6.32



On Mon, 14 Sep 2009, Joel Becker wrote:
>
> It's a link(2) analogue. symlink(2) has the loosest coupling,
> and reflink(2) the highest. We're not going to add freflink[at]().
> It's a snap, not a copy. It can be used to implement a copy, and
> copyfile() in libc can be written with reflinkat(2), but it isn't just a
> copy.

>From all but a performance standpoint, it's a copy. It has absolutely
_zero_ "link" semantics. When you do a symlink or a hardlink, you see it
in the resulting semantics: changing one changes the other.

This 'reflink' has no such semantics that I can tell. It has purely copy
semantics, never mind that it's optimized.

And the thing to note is that it doesn't even have to be optimized as a
"link". Think about network filesystems: maybe they want to implement this
thing as a server-side "copy" operation (with atomicity guarantees).

In other words, I can well imagine that for some filesystems, there really
is no refcounting or linking implied, and that's why I think naming should
be about semantics, not some random implementation issue that just happens
to be true for some particular class of filesystems.

So tell me - are there actually any non-copying semantics as far as the
_user_ is concerned? Is there some reason why a NFS server might not
implement this as a server-side copy? Is there something fundamentally in
this all that is about reference counting as far as a user is concerned?

I also still didn't get any answer to the "freflink()" question. You just
said that we wouldn't do it, with no explanation. Why? We've discussed
'flink()' in the past, I just want to know that when we do a new system
call there is some _reason_ why it's not going to explode into many
different variants later...

Linus

2009-09-15 00:06:01

by Joel Becker

[permalink] [raw]
Subject: Re: [GIT PULL] ocfs2 changes for 2.6.32

On Mon, Sep 14, 2009 at 04:27:59PM -0700, Linus Torvalds wrote:
> On Mon, 14 Sep 2009, Joel Becker wrote:
> >From all but a performance standpoint, it's a copy. It has absolutely
> _zero_ "link" semantics. When you do a symlink or a hardlink, you see it
> in the resulting semantics: changing one changes the other.

It's creating a new entry in the name space based on an old one.

> And the thing to note is that it doesn't even have to be optimized as a
> "link". Think about network filesystems: maybe they want to implement this
> thing as a server-side "copy" operation (with atomicity guarantees).

reflink doesn't merely guarantee atomicity, it guarantees the
shared data extents. Under the auspices of reflink a network filesystem
cannot merely provide an atomic copy. A separate copyfile call might
allow that, but reflink doesn't. This is deliberate, because the caller
wants the shared storage, not just a copy.

> I also still didn't get any answer to the "freflink()" question. You just
> said that we wouldn't do it, with no explanation. Why? We've discussed
> 'flink()' in the past, I just want to know that when we do a new system
> call there is some _reason_ why it's not going to explode into many
> different variants later...

Well, obviously I started from the fact that we don't have
flink(). But it doesn't really fit anyway. reflink is a namespace
operation - give me a new item in the namespace that shares the data
extents of the old item. So working from a file descriptor doesn't
quite fit. Plus, flink and freflink would have to deal with
recovering already-orphaned inodes.
Where do you stand on flink? If it actually makes sense to
you, then perhaps we should consider it and freflinkat. It doesn't
strike me as the way to go, but throughout all the discussion I'm quite
willing to be convinced.

Joel

--

"I don't know anything about music. In my line you don't have
to."
- Elvis Presley

Joel Becker
Principal Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127

2009-09-15 00:31:33

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] ocfs2 changes for 2.6.32



On Mon, 14 Sep 2009, Joel Becker wrote:
>
> It's creating a new entry in the name space based on an old one.

That's just a cumbersome way of saying "copyfile".

Here's a challenge for you: go outside, take the first five people you
meet at random, and ask them what a 'copyfile()' system call would do.

Then, do the same thing with 'reflink()'.

Feel free to stack the deck, so that the people you ask about 'reflink()'
actually know computers.

Then report back which group guessed better what the system call does.

> reflink doesn't merely guarantee atomicity, it guarantees the
> shared data extents.

Why?

That just limits its usefulness. What's the reason for that sophistry,
except to try to argue for a name that makes no sense?

> Well, obviously I started from the fact that we don't have
> flink(). But it doesn't really fit anyway. reflink is a namespace
> operation - give me a new item in the namespace that shares the data
> extents of the old item.

That's not a namespace op, EXCEPT FOR THE NEW NAME.

The data you share from has no namespace component to it, except as a
lookup. But a 'fd' is equally descriptive of the shared data.

Linus

2009-09-15 00:56:50

by Joel Becker

[permalink] [raw]
Subject: Re: [GIT PULL] ocfs2 changes for 2.6.32

On Mon, Sep 14, 2009 at 05:31:27PM -0700, Linus Torvalds wrote:
> On Mon, 14 Sep 2009, Joel Becker wrote:
> > reflink doesn't merely guarantee atomicity, it guarantees the
> > shared data extents.
>
> Why?
>
> That just limits its usefulness. What's the reason for that sophistry,
> except to try to argue for a name that makes no sense?

This originally came from the idea of creating file snapshots.
That was our original goal, but the more generic reflink call allows
more than snapshots to be built. You can use it to implement copyfile
or clone or a variety of things. But the snapshot capability is what
really motivates, and removing the shared data requirement means
removing that capability. Like any API we have, if it can degrade, you
have to assume it degraded. A reflink/copyfile that can just copy means
you have assume it copied and didn't conserve space. This makes it
useless for snapshotting or cloning.
In the reflink discussion before, I proposed that a separate
copyfile() syscall could be written that uses the same ->reflink() inode
operation but allows degradation in the storage handling. This would be
a little more capable than a glibc copyfile() written around reflink
because it would get the atomicity right. The separate copyfile/reflink
calls would handle the different requirements of storage handling. I
just concentrated on reflink and didn't worry about that alternate
copyfile at the time being.
I'm open to another proposal on how to do it. As a user, I need
a way to ask for a reflink/copyfile that fails if it can't share the
data. Things like snapshots and cloning gold VM images can't be
doubling the storage. They become pointless.
About the name, the reflink name came out of "you call it like
link(2)" and "the storage is reference counted CoW". It really works
well as "ln -r". Folks at the filesystem summit liked it, so I didn't
change it. It's not so much that it has to be "reflink", but I've
avoided "copyfile" because copyfile intuitively sounds like you
describe, including the plain-copy fallback. Want me to call the
requires-shared-data-because-its-a-snap version snapfileat(2)?
Something better?

> > Well, obviously I started from the fact that we don't have
> > flink(). But it doesn't really fit anyway. reflink is a namespace
> > operation - give me a new item in the namespace that shares the data
> > extents of the old item.
>
> That's not a namespace op, EXCEPT FOR THE NEW NAME.
>
> The data you share from has no namespace component to it, except as a
> lookup. But a 'fd' is equally descriptive of the shared data.

Ok, I gather that you find freflink (and by extension, flink)
compelling. I can certainly implement it.

Joel

--

A good programming language should have features that make the
kind of people who use the phrase "software engineering" shake
their heads disapprovingly.
- Paul Graham

Joel Becker
Principal Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127

2009-09-15 02:01:17

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] ocfs2 changes for 2.6.32



On Mon, 14 Sep 2009, Joel Becker wrote:
>
> In the reflink discussion before, I proposed that a separate
> copyfile() syscall could be written that uses the same ->reflink() inode
> operation but allows degradation in the storage handling.

.. exactly how?

If you're talking about falling back to manually just copying the data,
then nobody is interested in that. User space can do that better with a
simple read-write loop or with splice, or whatever. There's no reaason
what-so-ever to do that.

But the thing is, network filesystems may be able to do server-side
copies, and the point being that they can do so _without_ transferring the
data to the client (and back). And if we do 'copyfile' (under whatever
name) for one filesystem, then I think we should strive to make sure that
it's useful for other filesystems too.

Just google for "NFS Server-side Copy". And SMB has had a COPY command
from the very beginning, I think.

And as far as I can tell, neither NFS nor CIFS could use your definition
of 'reflink()'. They aren't reflinks. Or rather, _could_ be, on the
server, of course, but what some people want to do is to avoid moving data
over the network. So it's not about "don't use more diskspace" for that
kind of application.

Do we really want to introduce a new filesystem operation that is likely
to be broken for something like that?

Now, it's possible that nobody will ever care, and that NFS server-side
copy goes the way of a lot of other failed trials. But I really hope you
have at least _talked_ to some CIFS/NFS people about this.

[ Btw, it's quite possible that CIFS/NFS people would want more than a
single entrypoint. I think they might want partial copies and status
updates etc, which would likely mean that a single ->copyfile() thing
isn't sufficient.

Maybe it's not worth it, and the complexity of something like that gets
to be too annoying. But I don't get the feeling that you've even _tried_
to see if this can be generalized to something that would be much more
widely useful ]

Now, I can see that you might want to say "fail rather than use double
the diskspace for data". But why not just do that as a flag? You already
have flags for 'copy extended attributes or not'. Why not have a flag that
says 'copy only if you can do it without any extra space'?

Linus

2009-09-15 04:02:30

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [GIT PULL] ocfs2 changes for 2.6.32

On Mon, 14 Sep 2009 19:01:06 -0700 (PDT)
Linus Torvalds <[email protected]> wrote:

>
>
> On Mon, 14 Sep 2009, Joel Becker wrote:
> >
> > In the reflink discussion before, I proposed that a separate
> > copyfile() syscall could be written that uses the same ->reflink()
> > inode operation but allows degradation in the storage handling.
>
> .. exactly how?
>
> If you're talking about falling back to manually just copying the
> data, then nobody is interested in that. User space can do that
> better with a simple read-write loop or with splice, or whatever.
> There's no reaason what-so-ever to do that.
>
> But the thing is, network filesystems may be able to do server-side
> copies, and the point being that they can do so _without_
> transferring the data to the client (and back). And if we do
> 'copyfile' (under whatever name) for one filesystem, then I think we
> should strive to make sure that it's useful for other filesystems too.

COW filesystems like btrfs may also be able to do interesting things
with copyfile() btw by just sharing all data blocks COW.

That would make copyfile() useful for me, much more so than the network
filesystem side...

--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2009-09-15 04:08:58

by Joel Becker

[permalink] [raw]
Subject: Re: [GIT PULL] ocfs2 changes for 2.6.32

On Mon, Sep 14, 2009 at 07:01:06PM -0700, Linus Torvalds wrote:
> On Mon, 14 Sep 2009, Joel Becker wrote:
> >
> > In the reflink discussion before, I proposed that a separate
> > copyfile() syscall could be written that uses the same ->reflink() inode
> > operation but allows degradation in the storage handling.
>
> .. exactly how?
>
> If you're talking about falling back to manually just copying the data,
> then nobody is interested in that. User space can do that better with a
> simple read-write loop or with splice, or whatever. There's no reaason
> what-so-ever to do that.

I'm talking about any facility for copying that isn't just a
userspace loop. Like your discussion of network filesystems.

> But the thing is, network filesystems may be able to do server-side
> copies, and the point being that they can do so _without_ transferring the
> data to the client (and back). And if we do 'copyfile' (under whatever
> name) for one filesystem, then I think we should strive to make sure that
> it's useful for other filesystems too.

Hence I brought this to the filesystem summit and then fsdevel
rather than just implementing it in ocfs2. I know NFS folks were in the
room in April, and they said the call definition was workable. Can't
remember if CIFS folks were there, but I think so.

> [ Btw, it's quite possible that CIFS/NFS people would want more than a
> single entrypoint. I think they might want partial copies and status
> updates etc, which would likely mean that a single ->copyfile() thing
> isn't sufficient.
>
> Maybe it's not worth it, and the complexity of something like that gets
> to be too annoying. But I don't get the feeling that you've even _tried_
> to see if this can be generalized to something that would be much more
> widely useful ]

I brought it up in a forum with everyone there precisely so that
I wouldn't miss their concerns via myopia. reflink() is a generic
application of the specific "let's snapshot inodes" idea. It doesn't do
"atomic copy of data into duplicate storage", nor does it do "send byte
ranges". The goal was something straightforward, not a kitchen sink.

> Now, I can see that you might want to say "fail rather than use double
> the diskspace for data". But why not just do that as a flag? You already
> have flags for 'copy extended attributes or not'. Why not have a flag that
> says 'copy only if you can do it without any extra space'?

We could. Like I said, I really wanted something simple and
clean. I tried hard to avoid that other flag, but I had to give up due
to (correct) concerns from the security folks.
I'm looking at both the ease of calling the call and how we
define userspace programs to use it. reflink(1), the program, is
essentially a synonym for 'ln -r' right now. That's pretty nice to use
from a script. Other ideas have been 'cp --reflink' or 'cp --clone',
but every proposal for a cp argument has felt awful and clunky.
If I were doing a straight copyfile(), ignoring the reflink
symantics, I'd want something that could be done by cp(1) at all times
(rc = copyfile(); if ENOSYS do_normal_copy()). I mean, if we do it
right, why not take advantage at all times. Using reflink here violates
peoples expectations, because a reflink, with its shared data extents,
can ENOSPC when you do CoW. Whereas a copyfile() that expects to
duplicate the storage can fit within defualt cp.

Joel

--

"Heav'n hath no rage like love to hatred turn'd, nor Hell a fury,
like a woman scorn'd."
- William Congreve

Joel Becker
Principal Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127

2009-09-15 04:38:21

by Joel Becker

[permalink] [raw]
Subject: Re: [GIT PULL] ocfs2 changes for 2.6.32

On Tue, Sep 15, 2009 at 06:05:59AM +0200, Arjan van de Ven wrote:
> COW filesystems like btrfs may also be able to do interesting things
> with copyfile() btw by just sharing all data blocks COW.
>
> That would make copyfile() useful for me, much more so than the network
> filesystem side...

btrfs reflink is definitely in the cards.

Joel

--

"The first thing we do, let's kill all the lawyers."
-Henry VI, IV:ii

Joel Becker
Principal Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127

2009-09-15 06:44:52

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [GIT PULL] ocfs2 changes for 2.6.32

On Mon, 14 Sep 2009, Linus Torvalds wrote:
> Are we going to add 'freflink[at]()' at some point?

We already have an interface for "freflink()" called splice(). Splice
has all the arguments needed to implement refcounted data copies and
more.

Yes, splice's name implies more of a "piecing data together" type of
operation, but we can look at reflink() or copyfile() as just "splice
one big piece from here to there".

Thanks,
Miklos

2009-09-15 16:31:38

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] ocfs2 changes for 2.6.32



On Mon, 14 Sep 2009, Joel Becker wrote:
> >
> > If you're talking about falling back to manually just copying the data,
> > then nobody is interested in that. User space can do that better with a
> > simple read-write loop or with splice, or whatever. There's no reaason
> > what-so-ever to do that.
>
> I'm talking about any facility for copying that isn't just a
> userspace loop. Like your discussion of network filesystems.

HOW?

We need to have a per-filesystem interface to that.

Having a '->copyfile()' function would be great.

But don't you see how _idiotic_ it is to then also having a '->reflink()'
function that does _conceptually_ the exact same thing, except it does it
by incrementing a usage count instead?

Do you see why I'm so unhappy to add a ->reflink() function?

> Hence I brought this to the filesystem summit and then fsdevel
> rather than just implementing it in ocfs2. I know NFS folks were in the
> room in April, and they said the call definition was workable. Can't
> remember if CIFS folks were there, but I think so.

It's not workable if you define the 'reflink()' function to not use any
disk space on the filesystem. Because SMB _will_ do a copy (and I presume
the NFS thing will too). So it would not in general be what you call
reflink, it will not be a "snapshot".

So if you _define_ the semantics of "reflink" to be that it's atomic and
doesn't use any new diskspace (apart from the new inode/directory entry,
of course), then it will be almost totally useless to other filesystems.

In fact, it's entirely possible to have filesystems that can avoid copying
the _data_ blocks, but would need to copy the indirect blocks - maybe the
data blocks are ref-counted, but the metadata needs to be per-file (I can
see many reasons to do it that way, even if it's organized as a tree -
it's how we do page table COW, for example, and it makes some things much
simpler).

Would that be a 'reflink()' or not? I have no way of knowing, because you
have decided on reflink on a purely ocfs2-specific implementation basis.
But I do know that such a filesystem would be perfectly happy to have a
'copyfile' function.

This is why I want the VFS pointers to be about _semantics_, not about
some random implementation detail.

Linus

2009-09-15 21:47:29

by Joel Becker

[permalink] [raw]
Subject: Re: [GIT PULL] ocfs2 changes for 2.6.32

On Tue, Sep 15, 2009 at 09:30:54AM -0700, Linus Torvalds wrote:
> HOW?
>
> We need to have a per-filesystem interface to that.

No argument here.

> But don't you see how _idiotic_ it is to then also having a '->reflink()'
> function that does _conceptually_ the exact same thing, except it does it
> by incrementing a usage count instead?
>
> Do you see why I'm so unhappy to add a ->reflink() function?

I got it the first time. You see reflink() as a copyfile(), and
distinguishing the inode operations doesn't make sense to you. Quite
frankly, it doesn't to me either. There is the user<->kernel interface
of the system call, and there is the filesystem interface of the inode
operation. One inode op that can support multiple variations of
user<->kernel is find with me!
Let's step back a second. I'm not married to the name
'reflink'. I'm not opposed to a copyfile() syscall. I think I have a
clearer idea of what I see. More below.

> Would that be a 'reflink()' or not? I have no way of knowing, because you
> have decided on reflink on a purely ocfs2-specific implementation basis.
> But I do know that such a filesystem would be perfectly happy to have a
> 'copyfile' function.

That's not fair. I deliberately defined it as something outside
of the ocfs2 implementation. Apparently I didn't do a good enough job.

> This is why I want the VFS pointers to be about _semantics_, not about
> some random implementation detail.

Again, no argument here. The syscall interface better be
reasonably obvious to the userspace programmer. The VFS pointer better
be an efficient and clean way to implement the syscall interface.
I'm seeing three things here:

1. A CoW snapshot of an inode. This is reflink. It expressly defines
metadata as copyable, but data must be shared in a CoW fashion (to
answer your question about indirect blocks). You either get a
snapshot or nothing. Call it snapfile() if you like. Don't care.

2. An efficient copy. This is what you're talking about with CIFS COPY,
etc. You want to be guaranteed it does NOT do CoW, because it would
be great for a naive cp(1) to use it without the ENOSPC surprise of
CoW. You'd like the kernel call to fail if you're just going to get
read-write-loops, because userspace can implement that better. Maybe
we have it such that only network filesystems implement this action,
all the others return -ENOTSUPP, and then glibc handles the
read-write-loop. This allows everyone to call copyfile() and get
what they expected.

3. A space-saving copy. This is doing CoW linkup of the data storage if
possible, like a snapshot but without the atomicity guarantee. It
has the ENOSPC surprise, but someone using it should know that.

I think it would be great for Linux to provide all three. I
chose to only attack (1) because I could define it well. I left (2) and
(3), what I see as copyfile(), for later work. And I fully expected
that the VFS operation could change later - it's an internal thing,
after all. I want to get a good user<->kernel interface, because that's
the one that is set in stone. What I didn't want was to create another
kitchen-sink call, or another POSIXy thing that has a million special
cases that trip folks up.
I'm glad you've taken an interest, because you're pretty damned
good at architecture. If we can expand to cover copyfile sanely too,
win-win. To me, the user<->kernel interface really is two system calls:
reflink/snapfile for (1) and copyfile for (2) & (3). The kernel VFS
interface I would think you could do in one inode operation. If you
want to name it ->copyfile, that's fine.
Perhaps ->copyfile takes the following flags:

#define ALLOW_COW_SHARED 0x0001
#define REQUIRE_COW_SHARED 0x0002
#define REQUIRE_BASIC_ATTRS 0x0004
#define REQUIRE_FULL_ATTRS 0x0008
#define REQUIRE_ATOMIC 0x0010
#define SNAPSHOT (REQUIRE_COW_SHARED |
REQUIRE_BASIC_ATTRS |
REQUIRE_ATOMIC)
#define SNAPSHOT_PRESERVE (SNAPSHOT | REQUIRE_FULL_ATTRS)

Thus, sys_reflink/sys_snapfile(oldpath, newpath, 0) becomes:

->copyfile(oldpath, newpath, SNAPSHOT)

and sys_reflink/sys_snapfile(oldpath, newpath, ATTR_PRESERVE) becomes:

->copyfile(oldpath, newpath, SNAPSHOT_PRESERVE)

while sys_copyfile(oldpath, newpath, 0) is:

->copyfile(oldpath, newpath, 0)

and sys_copyfile(oldpath, newpath, ALLOW_COW) is:

->copyfile(oldpath, newpath, ALLOW_COW_SHARED)

What do you think? Other ideas?

Joel
--

"The lawgiver, of all beings, most owes the law allegiance. He of all
men should behave as though the law compelled him. But it is the
universal weakness of mankind that what we are given to administer we
presently imagine we own."
- H.G. Wells

Joel Becker
Principal Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127

2009-09-16 04:20:55

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] ocfs2 changes for 2.6.32



On Tue, 15 Sep 2009, Joel Becker wrote:
>
> Perhaps ->copyfile takes the following flags:
>
> #define ALLOW_COW_SHARED 0x0001
> #define REQUIRE_COW_SHARED 0x0002
> #define REQUIRE_BASIC_ATTRS 0x0004
> #define REQUIRE_FULL_ATTRS 0x0008
> #define REQUIRE_ATOMIC 0x0010
> #define SNAPSHOT (REQUIRE_COW_SHARED |
> REQUIRE_BASIC_ATTRS |
> REQUIRE_ATOMIC)
> #define SNAPSHOT_PRESERVE (SNAPSHOT | REQUIRE_FULL_ATTRS)
>
> Thus, sys_reflink/sys_snapfile(oldpath, newpath, 0) becomes:
> ...

Yes. The above all sounds sane to me.

I still worry that especially the non-atomic case will want some kind of
partial-copy updates (think graphical file managers that want to show the
progress of the copy), and that (think EINTR and continuing) makes me
think "that could get really complex really quickly", but that's something
that the NFS/SMB people would have to pipe up on. I'm pretty sure the NFS
spec has some kind "partial completion notification" model, I dunno about
SMB.

Linus

2009-09-16 04:42:23

by Joel Becker

[permalink] [raw]
Subject: Re: [GIT PULL] ocfs2 changes for 2.6.32

On Tue, Sep 15, 2009 at 09:20:47PM -0700, Linus Torvalds wrote:
> On Tue, 15 Sep 2009, Joel Becker wrote:
> >
> > Perhaps ->copyfile takes the following flags:
> >
> > #define ALLOW_COW_SHARED 0x0001
> > #define REQUIRE_COW_SHARED 0x0002
> > #define REQUIRE_BASIC_ATTRS 0x0004
> > #define REQUIRE_FULL_ATTRS 0x0008
> > #define REQUIRE_ATOMIC 0x0010
> > #define SNAPSHOT (REQUIRE_COW_SHARED |
> > REQUIRE_BASIC_ATTRS |
> > REQUIRE_ATOMIC)
> > #define SNAPSHOT_PRESERVE (SNAPSHOT | REQUIRE_FULL_ATTRS)
> >
> > Thus, sys_reflink/sys_snapfile(oldpath, newpath, 0) becomes:
> > ...
>
> Yes. The above all sounds sane to me.

Ok. Where do you see the exposure level? What I mean is, I
just defined a vfs op that handles these things, but accessed it via two
syscalls, sys_snapfile() and sys_copyfile(). We could also just provide
one system call and allow userspace to use these flags itself, creating
snapfile(3) and copyfile(3) in libc, hiding the details (kind of like
clone being hidden by pthreads, though ignoring that pthreads has
"issues"). Or we could explicitly make this the public API and expect
something like cp(1) to directly use the flags. Thoughts?

> I still worry that especially the non-atomic case will want some kind of
> partial-copy updates (think graphical file managers that want to show the
> progress of the copy), and that (think EINTR and continuing) makes me
> think "that could get really complex really quickly", but that's something
> that the NFS/SMB people would have to pipe up on. I'm pretty sure the NFS
> spec has some kind "partial completion notification" model, I dunno about
> SMB.

I'm really wary of combining a ranged interface with this one.
Not only does it make no sense for snapshots, but I think it falls down
in any "create a new inode" scheme entirely.
btrfs has an ioctl that basically says "link up range x->y of
file 1 to file 2". Chris is using the underlying machinery to implement
reflink, but I think the concept actually would work nicely as a splice
flag. If you have two existing files, not creating one, you can just
ask splice to do efficient things with a SPLICE_F_EFFICIENT_COPY for
yoru CIFS COPY-style thing or SPLICE_F_COW_COPY for btrfs- and
ocfs2-style data sharing.

Joel

--

"Nothing is wrong with California that a rise in the ocean level
wouldn't cure."
- Ross MacDonald

Joel Becker
Principal Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127

2009-09-17 16:29:21

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] ocfs2 changes for 2.6.32



On Tue, 15 Sep 2009, Joel Becker wrote:
>
> Ok. Where do you see the exposure level? What I mean is, I
> just defined a vfs op that handles these things, but accessed it via two
> syscalls, sys_snapfile() and sys_copyfile(). We could also just provide
> one system call and allow userspace to use these flags itself, creating
> snapfile(3) and copyfile(3) in libc

Why would anybody want to hide it at all? Why even the libc hiding?

Nobody is going to use this except for special apps. Let them see what
they can do, in all its glory.

> > I still worry that especially the non-atomic case will want some kind of
> > partial-copy updates (think graphical file managers that want to show the
> > progress of the copy), and that (think EINTR and continuing) makes me
> > think "that could get really complex really quickly", but that's something
> > that the NFS/SMB people would have to pipe up on. I'm pretty sure the NFS
> > spec has some kind "partial completion notification" model, I dunno about
> > SMB.
>
> I'm really wary of combining a ranged interface with this one.
> Not only does it make no sense for snapshots, but I think it falls down
> in any "create a new inode" scheme entirely.

Oh, I wouldn't suggest a ranged interface, just one that allows for status
updates and cancelling - _if_ the initial op isn't atomic to begin with.
There's also the issue of concurrency in IO: maybe you want to start
several things without necessarily waiting for them (think high-throughput
"cp -R" on NFS or something like that).

So I'd suggest something like having two system calls: one to start the
operation, and one to control it. And for a filesystem that does atomic
copies, the 'start' one obviously would also finish it, so the 'control'
it would be a no-op, because there would never be any outstanding ones.

See what I'm saying? It wouldn't complicate _your_ life, but it would
allow for filesystems that can't do it atomically (or even quickly).

So the first one would be something like

int copyfile(const char *src, const char *dest, unsigned long flags);

which would return:

- zero on success
- negative (with errno) on error
- positive cookie on "I started it, here's my cookie". For extra bonus
points, maybe the cookie would actually be a file descriptor (for
poll/select users), but it would _not_ be a file descriptor to the
resulting _file_, it would literally be a "cookie" to the actual
copyfile event.

and then for ocfs2 you'd never return positive cookies. You'd never have
to worry about it.

Then the second interface would be something like

int copyfile_ctrl(long cookie, unsigned long cmd);

where you'd just have some way to wait for completion and ask how much has
been copied. The 'cmd' would be some set of 'cancel', 'status' or
'uninterruptible wait' or whatever, and the return value would again be

- negative (with errno) for errors (copy failed) - cookie released
- zero for 'done' - cookie released
- positive for 'percent remaining' or whatever - cookie still valid

and this would be another callback into the filesystem code, but you'd
never have to worry about it, since you'd never see it (just leave it
NULL).

NOTE! The above is a rough idea - I have not spent tons of time thinking
about it, or looking at exactly what something like NFS would really want.
But the _concept_ is simple, and usage should be pretty trivial. A simple
case would be something like this:

int copy_file(const char *src, const char *dst)
{
/* Start a file copy */
int cookie = copyfile(src, dst, 0);

/* Async case? */
if (cookie > 0) {
int ret;

while ((ret = copyfile_ctrl(cookie, COPYFILE_WAIT)) > 0)
/* nothing */;

/* Error handling is shared for async/sync */
cookie = ret;
}
if (cookie < 0) {
perror("copyfile failed");
return -1;
}
return 0;
}

doesn't that look fairly easy to use?

And the advantage here is that you _can_ - still fairly easily - do much
more involved things. For example, let's say that you wanted to do a very
efficient parallel copy, so you'd do something like this:

#define MAX_PEND 10
static int pending[MAX_PEND];
static int nr_pending = 0;

static int wait_for_completion(int nr_left)
{
int ret;

while (nr_pending > nr_left) {
int cookie = pending[0], i;

/* Wait for completion of the oldest entry */
while ((i = copyfile_ctrl(cookie, COPYFILE_WAIT)) > 0)
/* nothing */;

/* Save the "we had an error" case */
if (i < 0)
ret = i;

/* Move the other entries down */
memmove(pending, pending+1, sizeof(int)*--nr_pending);
}
return ret;
}

int start_copy(src, dst)
{
int cookie, ret;

cookie = copyfile(src, dst, 0);
if (cookie <= 0)
return cookie;

ret = 0;
if (nr_pending == MAX_PENDING)
ret = wait_for_completion(pending, MAX_PENDING/2);

pending[nr_pending++] = cookie;
return ret;
}

int stop_copy(void)
{
return wait_for_completion(pending, 0);
}

which basically ends up having ten copyfile() calls outstanding (and when
we hit the limit, we wait for half of them to complete), so now you can do
an efficient "cp -R" with concurrent server-side IO. And it wasn't so
hard, was it?

(Ok, so the above would need to be fleshed out to remember the filenames
so that you can report _which_ file failed etc, but you get the idea).

And again, it wouldn't be any more complicated for your case. Your
copyfile would always just return 0 or negative for error. But it would be
_way_ more powerful for filesystems that want to do potentially lots of IO
for the file copy.

I dunno. The above seems like a fairly simple and powerful interface, and
I _think_ it would be ok for NFS and CIFS. And in fact, if that whole
"background copy" ends up being used a lot, maybe even a local filesystem
would implement it just to get easy overlapping IO - even if it would just
be a trivial common wrapper function that says "start a thread to do a
trivial manual copy".

Linus

2009-09-17 16:38:17

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [GIT PULL] ocfs2 changes for 2.6.32

On Thu, 17 Sep 2009 09:29:14 -0700 (PDT)
Linus Torvalds <[email protected]> wrote:

>
>
> On Tue, 15 Sep 2009, Joel Becker wrote:
> >
> > Ok. Where do you see the exposure level? What I mean is, I
> > just defined a vfs op that handles these things, but accessed it
> > via two syscalls, sys_snapfile() and sys_copyfile(). We could also
> > just provide one system call and allow userspace to use these flags
> > itself, creating snapfile(3) and copyfile(3) in libc
>
> Why would anybody want to hide it at all? Why even the libc hiding?
>
> Nobody is going to use this except for special apps. Let them see
> what they can do, in all its glory.
>
> > > I still worry that especially the non-atomic case will want some
> > > kind of partial-copy updates (think graphical file managers that
> > > want to show the progress of the copy), and that (think EINTR and
> > > continuing) makes me think "that could get really complex really
> > > quickly", but that's something that the NFS/SMB people would have
> > > to pipe up on. I'm pretty sure the NFS spec has some kind
> > > "partial completion notification" model, I dunno about SMB.
> >
> > I'm really wary of combining a ranged interface with this
> > one. Not only does it make no sense for snapshots, but I think it
> > falls down in any "create a new inode" scheme entirely.
>
> Oh, I wouldn't suggest a ranged interface, just one that allows for
> status updates and cancelling - _if_ the initial op isn't atomic to
> begin with. There's also the issue of concurrency in IO: maybe you
> want to start several things without necessarily waiting for them
> (think high-throughput "cp -R" on NFS or something like that).
>
> So I'd suggest something like having two system calls: one to start
> the operation, and one to control it. And for a filesystem that does
> atomic copies, the 'start' one obviously would also finish it, so the
> 'control' it would be a no-op, because there would never be any
> outstanding ones.
>
> See what I'm saying? It wouldn't complicate _your_ life, but it would
> allow for filesystems that can't do it atomically (or even quickly).
>
> So the first one would be something like
>
> int copyfile(const char *src, const char *dest, unsigned long
> flags);
>
> which would return:
>
> - zero on success
> - negative (with errno) on error
> - positive cookie on "I started it, here's my cookie". For extra
> bonus points, maybe the cookie would actually be a file descriptor
> (for poll/select users), but it would _not_ be a file descriptor to
> the resulting _file_, it would literally be a "cookie" to the actual
> copyfile event.
>
> and then for ocfs2 you'd never return positive cookies. You'd never
> have to worry about it.
>
> Then the second interface would be something like
>
> int copyfile_ctrl(long cookie, unsigned long cmd);
>
> where you'd just have some way to wait for completion and ask how
> much has been copied. The 'cmd' would be some set of 'cancel',
> 'status' or 'uninterruptible wait' or whatever, and the return value
> would again be
>
> - negative (with errno) for errors (copy failed) - cookie released
> - zero for 'done' - cookie released
> - positive for 'percent remaining' or whatever - cookie still valid
>
> and this would be another callback into the filesystem code, but
> you'd never have to worry about it, since you'd never see it (just
> leave it NULL).
>
> NOTE! The above is a rough idea - I have not spent tons of time
> thinking about it, or looking at exactly what something like NFS
> would really want. But the _concept_ is simple, and usage should be
> pretty trivial. A simple case would be something like this:
>
> int copy_file(const char *src, const char *dst)
> {
> /* Start a file copy */
> int cookie = copyfile(src, dst, 0);
>
> /* Async case? */
> if (cookie > 0) {
> int ret;
>
> while ((ret = copyfile_ctrl(cookie, COPYFILE_WAIT)) >
> 0) /* nothing */;
>
> /* Error handling is shared for async/sync */
> cookie = ret;
> }
> if (cookie < 0) {
> perror("copyfile failed");
> return -1;
> }
> return 0;
> }
>
> doesn't that look fairly easy to use?
>
> And the advantage here is that you _can_ - still fairly easily - do
> much more involved things. For example, let's say that you wanted to
> do a very efficient parallel copy, so you'd do something like this:
>
> #define MAX_PEND 10
> static int pending[MAX_PEND];
> static int nr_pending = 0;
>
> static int wait_for_completion(int nr_left)
> {
> int ret;
>
> while (nr_pending > nr_left) {
> int cookie = pending[0], i;
>
> /* Wait for completion of the oldest entry */
> while ((i = copyfile_ctrl(cookie,
> COPYFILE_WAIT)) > 0) /* nothing */;
>
> /* Save the "we had an error" case */
> if (i < 0)
> ret = i;
>
> /* Move the other entries down */
> memmove(pending, pending+1,
> sizeof(int)*--nr_pending); }
> return ret;
> }
>
> int start_copy(src, dst)
> {
> int cookie, ret;
>
> cookie = copyfile(src, dst, 0);
> if (cookie <= 0)
> return cookie;
>
> ret = 0;
> if (nr_pending == MAX_PENDING)
> ret = wait_for_completion(pending,
> MAX_PENDING/2);
>
> pending[nr_pending++] = cookie;
> return ret;
> }
>
> int stop_copy(void)
> {
> return wait_for_completion(pending, 0);
> }
>
> which basically ends up having ten copyfile() calls outstanding (and
> when we hit the limit, we wait for half of them to complete), so now
> you can do an efficient "cp -R" with concurrent server-side IO. And
> it wasn't so hard, was it?
>
> (Ok, so the above would need to be fleshed out to remember the
> filenames so that you can report _which_ file failed etc, but you get
> the idea).
>
> And again, it wouldn't be any more complicated for your case. Your
> copyfile would always just return 0 or negative for error. But it
> would be _way_ more powerful for filesystems that want to do
> potentially lots of IO for the file copy.
>
> I dunno. The above seems like a fairly simple and powerful interface,
> and I _think_ it would be ok for NFS and CIFS. And in fact, if that
> whole "background copy" ends up being used a lot, maybe even a local
> filesystem would implement it just to get easy overlapping IO - even
> if it would just be a trivial common wrapper function that says
> "start a thread to do a trivial manual copy".

or make it one level simpler?
Have a "wait for all started copies" call only.... saves a ton of book
keeping, and is likely what people will use it for in the end anyway.


(implementation wise the fallback implementation could then just use
the async function calls if it wanted to, and just wait for all copies
to finish in the complete call)


--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2009-09-17 18:40:33

by Roland Dreier

[permalink] [raw]
Subject: Re: [GIT PULL] ocfs2 changes for 2.6.32


> int copy_file(const char *src, const char *dst)
> {
> /* Start a file copy */
> int cookie = copyfile(src, dst, 0);
>
> /* Async case? */
> if (cookie > 0) {
> int ret;
>
> while ((ret = copyfile_ctrl(cookie, COPYFILE_WAIT)) > 0)
> /* nothing */;
>
> /* Error handling is shared for async/sync */
> cookie = ret;
> }
> if (cookie < 0) {
> perror("copyfile failed");
> return -1;
> }

I guess one bit of semantics to figure out is what happens if copyfile()
does the async case but then copyfile_ctrl() returns an error halfway
through... is the state of the dest file just undefined?

- R.

2009-09-17 20:16:30

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] ocfs2 changes for 2.6.32



On Thu, 17 Sep 2009, Arjan van de Ven wrote:
>
> or make it one level simpler?
> Have a "wait for all started copies" call only.... saves a ton of book
> keeping, and is likely what people will use it for in the end anyway.

No. That wouldn't work. For a few reasons:

- the case I didn't show was the "graphical file manager client" thing
that wants to show the copy as it progresses. It needs to know how much
is left, and for which file.

- if errors happen, you need to indicate which file had an error. Again,
my example code didn't show that, since it was written as an example
and obviously just while writing email anyway. But it's a major
requirement for any sane and reliable filesystem model!

- even in my example, I wanted to show how you don't want to wait for
_all_ of them in the middle, you just want to wait for some of them. If
you wait for all of them - just to make room for more - you're going to
have hickups in your IO patterns and you cannot saturate your server or
disks well.

So I really think there needs to be a cookie per outstanding file copy (of
course, the kernel is likely to not allow a single user more than 'n'
outstanding copies anyway, but that's a separate issue).

Linus

2009-09-17 20:18:32

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] ocfs2 changes for 2.6.32



On Thu, 17 Sep 2009, Roland Dreier wrote:
>
> I guess one bit of semantics to figure out is what happens if copyfile()
> does the async case but then copyfile_ctrl() returns an error halfway
> through... is the state of the dest file just undefined?

I think that's the one that most filesystems would prefer. Maybe the file
is there, it's just that it's only half copied because the filesystem
filled up.

Making filesystems give atomicity guarantees would be hard for the async
case.

Of course, if the filesystem can do the copy entirely atomically (ie by
just incrementing a refcount), then it can give atomicity guarantees, but
then you'd never see the async case either.

Linus

2009-09-17 20:36:38

by Joel Becker

[permalink] [raw]
Subject: Re: [GIT PULL] ocfs2 changes for 2.6.32

On Thu, Sep 17, 2009 at 01:17:55PM -0700, Linus Torvalds wrote:
> On Thu, 17 Sep 2009, Roland Dreier wrote:
> >
> > I guess one bit of semantics to figure out is what happens if copyfile()
> > does the async case but then copyfile_ctrl() returns an error halfway
> > through... is the state of the dest file just undefined?
>
> I think that's the one that most filesystems would prefer. Maybe the file
> is there, it's just that it's only half copied because the filesystem
> filled up.

I have to say, adding 'undefined behavior' things isn't fun in a
call that is already potentially confusing. We have a bunch of flags
and behaviors we're covering.

> Making filesystems give atomicity guarantees would be hard for the async
> case.

Note that "cleaning up after an error" and "atomic" are not the
same. Atomicity implies that not only do you see all or none, but that
the contents are a point-in-time of the source file. A non-atomic
implementation may be affected by writes that happen during the copy
(like any read-write-loop copy would be).
As an example, ocfs2_reflink() builds the target inode in the
orphan directory. If the operation fails at any point, it's removed.
If we crash, orphan cleanup happens. Only if it succeeds do we move it
to the target directory. ocfs2_reflink() is an atomic snapshot, of
course, but recoverability is certainly possible for a non-atomic
copyfile() on filesystems with similar orphan schemes (ext3 is the
obvious example).
Of course, how the network filesystems might see it, I don't
know. NFS/CIFS folks, please speak up.

> Of course, if the filesystem can do the copy entirely atomically (ie by
> just incrementing a refcount), then it can give atomicity guarantees, but
> then you'd never see the async case either.

Even the atomic copy might take a little time (say, to bump up
and write out the metadata structures). Do you want to define that as
not being async? I was figuring COPYFILE_ATOMIC and COPYFILE_WAIT to be
separate flags.

Joel

--

"Behind every successful man there's a lot of unsuccessful years."
- Bob Brown

Joel Becker
Principal Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127

2009-09-17 20:42:29

by Roland Dreier

[permalink] [raw]
Subject: Re: [GIT PULL] ocfs2 changes for 2.6.32


> > I guess one bit of semantics to figure out is what happens if copyfile()
> > does the async case but then copyfile_ctrl() returns an error halfway
> > through... is the state of the dest file just undefined?

> I think that's the one that most filesystems would prefer. Maybe the file
> is there, it's just that it's only half copied because the filesystem
> filled up.

Makes sense... and even having the state of having the file half-copied
is not really well-defined since a filesystem may want to optimize
things by copying out of order etc.

2009-09-17 20:56:15

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] ocfs2 changes for 2.6.32



On Thu, 17 Sep 2009, Roland Dreier wrote:
> > > I guess one bit of semantics to figure out is what happens if copyfile()
> > > does the async case but then copyfile_ctrl() returns an error halfway
> > > through... is the state of the dest file just undefined?
>
> > I think that's the one that most filesystems would prefer. Maybe the file
> > is there, it's just that it's only half copied because the filesystem
> > filled up.
>
> Makes sense... and even having the state of having the file half-copied
> is not really well-defined since a filesystem may want to optimize
> things by copying out of order etc.

Yeah.

The thing to remember is that where you'd use a non-atomic 'copyfile()'
system call is as a replacement for just doing the same thing by hand in
user space, so any users of this system call would basically have to
handle the "oops, it failed with ENOSPC in the middle" _anyway_.

So there is no downside to saying "ok, it failed in the middle, you clean
up the result", because the user needs to support that anyway.

The ones that use copyfile for filesystem-specific snapshots and use the
ATOMIC bit to say so obviously don't have this issue. But they aren't
looking for a "faster 'cp'" thing - they're looking for very specific
semantics, and for the ATOMIC case we can provide those kinds of nice
atomic "everything or nothing" semantics.

So the fact that some random server-side copy over CIFS/NFS may need
cleanup by the user after failing at half-point is not actually a
downside. It doesn't affect Joel's kind of fancy use.

Linus

2009-09-18 00:30:15

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] ocfs2 changes for 2.6.32



On Thu, 17 Sep 2009, Joel Becker wrote:
>
> I have to say, adding 'undefined behavior' things isn't fun in a
> call that is already potentially confusing. We have a bunch of flags
> and behaviors we're covering.

I don't think it's "undefined". It's just not complete.

>From a user standpoint, there is no difference between such a system call
and doing the thing as a library routine (which has to be the fallback
anyway for something like 'cp').

> Note that "cleaning up after an error" and "atomic" are not the
> same. Atomicity implies that not only do you see all or none, but that
> the contents are a point-in-time of the source file. A non-atomic
> implementation may be affected by writes that happen during the copy
> (like any read-write-loop copy would be).

Sure. There are middle grounds that may not need the cleanup. I was more
looking at the two extreme ends.

> > Of course, if the filesystem can do the copy entirely atomically (ie by
> > just incrementing a refcount), then it can give atomicity guarantees, but
> > then you'd never see the async case either.
>
> Even the atomic copy might take a little time (say, to bump up
> and write out the metadata structures). Do you want to define that as
> not being async? I was figuring COPYFILE_ATOMIC and COPYFILE_WAIT to be
> separate flags.

I don't think that's wrong, and yeah, you could decide that you actually
want to be able to support the "ten outstanding 'copy' commands from user
space" model too. So yeah, separate COPYFILE_ATOMIC (only succeed if you
can do it as an atomic op in the filesystem) and COPYFILE_WAIT (only
return when fully done) bits sounds conceptually fine to me.

Whether it's worth it for a filesystem that effectively only needs a
couple of writes (that can be buffered too), I dunno. But it's certainly
not something I'd object to on an interface level.

Linus

2009-09-18 01:45:16

by Joel Becker

[permalink] [raw]
Subject: Re: [Ocfs2-devel] [GIT PULL] ocfs2 changes for 2.6.32

On Thu, Sep 17, 2009 at 09:29:14AM -0700, Linus Torvalds wrote:
> Why would anybody want to hide it at all? Why even the libc hiding?
>
> Nobody is going to use this except for special apps. Let them see what
> they can do, in all its glory.

I expect everyone will use this through cp(1), so that cp(1) can
try to get server-side copy on the network filesystms.
Speaking of "all its glory", what we have now is:

int sys_copyfileat(int oldfd, const char *oldname, int newfd,
const char *newname, int flags, int atflags)

> So I'd suggest something like having two system calls: one to start the
> operation, and one to control it. And for a filesystem that does atomic
> copies, the 'start' one obviously would also finish it, so the 'control'
> it would be a no-op, because there would never be any outstanding ones.
>
> See what I'm saying? It wouldn't complicate _your_ life, but it would
> allow for filesystems that can't do it atomically (or even quickly).
>
> So the first one would be something like
>
> int copyfile(const char *src, const char *dest, unsigned long flags);
>
> which would return:
>
> - zero on success
> - negative (with errno) on error
> - positive cookie on "I started it, here's my cookie". For extra bonus
> points, maybe the cookie would actually be a file descriptor (for
> poll/select users), but it would _not_ be a file descriptor to the
> resulting _file_, it would literally be a "cookie" to the actual
> copyfile event.

Actually, if the cookie is a magic file descriptor, you don't
need ctl. You can play tricks like polling for completoin,
read(magic_fd, &remain, sizeof(loff_t)) for status, and close(magic_fd)
for cancel. Might be a bit overloaded, though.

> and then for ocfs2 you'd never return positive cookies. You'd never have
> to worry about it.

I suspect we'll later take advantage of copyfile's other
modes. I did reflink as reflink only for the simple fact of doing one
thing and well, not because I think copyfile isn't good.

> Then the second interface would be something like
>
> int copyfile_ctrl(long cookie, unsigned long cmd);
>
> where you'd just have some way to wait for completion and ask how much has
> been copied. The 'cmd' would be some set of 'cancel', 'status' or
> 'uninterruptible wait' or whatever, and the return value would again be
>
> - negative (with errno) for errors (copy failed) - cookie released
> - zero for 'done' - cookie released
> - positive for 'percent remaining' or whatever - cookie still valid
>
> and this would be another callback into the filesystem code, but you'd
> never have to worry about it, since you'd never see it (just leave it
> NULL).

I was going to ask about how to fit both calls into one inode
operation, but I see you're giving this as an additional inode
operation.
This leaves us with a simliar-to-reflink inode copyfile op and a
control op:

->copyfile(old_dentry, dir_inode, new_dentry, flags)
->copyfile_ctl(int cookie, unsigned int cmd)

I have to change the flags a little, as my original proposal
didn't handle backoff correctly.

#define COPYFILE_WAIT 0x0001 /* Block until complete */
#define COPYFILE_ATOMIC 0x0002 /* Things copied must be
point-in-time and it must
fail or succeed completely. */
#define COPYFILE_ALLOW_COW 0x0004 /* The filesystem may share data
extents between the source
and target in a Copy-on-Write
fashion. If neither
COPYFILE_ALLOW_COW nor
COPYFILE_REQUIRE_COW are
specified, data extents must
NOT be shared. When neither
COW flag is provided, most
filesystems should return
-ENOTSUPP, as userspace can
do read-write looping
itself */
#define COPYFILE_REQUIRE_COW 0x0008 /* Data extents MUST be shared
between the source and target
in a Copy-on-Write fashion */
#define COPYFILE_UNPRIV_ATTRS 0x0010 /* Unprivileged attributes
should be copied from the
source to the target */
#define COPYFILE_PRIV_ATTRS 0x0020 /* Privileged attributes should
be copied from the source to
the target if the caller has
the necessary privileges */
#define COPYFILE_REQUIRE_ATTRS 0x0040 /* Combined with the other
attribute flags, the call
MUST fail if the caller lacks
the necessary privileges to
copy ever attribute
requested */

#define COPYFILE_SNAPSHOT_ASYNC (COPYFILE_REQUIRE_COW |
COPYFILE_UNPRIV_ATTRS |
COPYFILE_PRIV_ATTRS |
COPYFILE_ATOMIC)
#define COPYFILE_SNAPSHOT_STRICT_ASYNC (COPYFILE_SNAPSHOT_ASYNC |
COPYFILE_REQUIRE_ATTRS)
#define COPYFILE_SNAPSHOT (COPYFILE_SNAPSHOT_ASYNC |
COPYFILE_WAIT)
#define COPYFILE_SNAPSHOT_STRICT (COPYFILE_SNAPSHOT_STRICT_ASYNC |
COPYFILE_WAIT)

> I dunno. The above seems like a fairly simple and powerful interface, and
> I _think_ it would be ok for NFS and CIFS. And in fact, if that whole
> "background copy" ends up being used a lot, maybe even a local filesystem
> would implement it just to get easy overlapping IO - even if it would just
> be a trivial common wrapper function that says "start a thread to do a
> trivial manual copy".

NFS and CIFS folks, please speak up.

Joel

--

"There is no more evil thing on earth than race prejudice, none at
all. I write deliberately -- it is the worst single thing in life
now. It justifies and holds together more baseness, cruelty and
abomination than any other sort of error in the world."
- H. G. Wells

Joel Becker
Principal Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127

2009-09-18 13:42:45

by Pádraig Brady

[permalink] [raw]
Subject: Re: [Ocfs2-devel] [GIT PULL] ocfs2 changes for 2.6.32

Joel Becker wrote:
> On Thu, Sep 17, 2009 at 09:29:14AM -0700, Linus Torvalds wrote:
>> Why would anybody want to hide it at all? Why even the libc hiding?
>>
>> Nobody is going to use this except for special apps. Let them see what
>> they can do, in all its glory.
>
> I expect everyone will use this through cp(1), so that cp(1) can
> try to get server-side copy on the network filesystms.

For reference, cp(1) has a --reflink option as of
coreutils-7.5 which currently just does:

ioctl (dest_fd, BTRFS_IOC_CLONE, src_fd);

There is a specific option in cp to do this because
a "reflink copy" was seen to have these disadvantages:

1. one copy of data blocks so more chances of data loss
2. disk head seeking deferred to modification process
3. possible fragmentation on write
4. possible ENOSPC on write

Now 2. will go away with time, and 3 & 4 may be alleviated
by the use of fallocate(), but 1. was deemed important
enough to not enable by default.

cheers,
P?draig.

2009-09-18 17:23:53

by Peter W. Morreale

[permalink] [raw]
Subject: Re: [Ocfs2-devel] [GIT PULL] ocfs2 changes for 2.6.32

On Thu, 2009-09-17 at 18:43 -0700, Joel Becker wrote:
> On Thu, Sep 17, 2009 at 09:29:14AM -0700, Linus Torvalds wrote:
> > Why would anybody want to hide it at all? Why even the libc hiding?
> >
> > Nobody is going to use this except for special apps. Let them see what
> > they can do, in all its glory.
>
> I expect everyone will use this through cp(1), so that cp(1) can
> try to get server-side copy on the network filesystms.
> Speaking of "all its glory", what we have now is:
>
> int sys_copyfileat(int oldfd, const char *oldname, int newfd,
> const char *newname, int flags, int atflags)


Would it be worthwhile to consider adding an offset and length?

Then we get dd as well. (potentially)


Best,
-PWM

>
> > So I'd suggest something like having two system calls: one to start the
> > operation, and one to control it. And for a filesystem that does atomic
> > copies, the 'start' one obviously would also finish it, so the 'control'
> > it would be a no-op, because there would never be any outstanding ones.
> >
> > See what I'm saying? It wouldn't complicate _your_ life, but it would
> > allow for filesystems that can't do it atomically (or even quickly).
> >
> > So the first one would be something like
> >
> > int copyfile(const char *src, const char *dest, unsigned long flags);
> >
> > which would return:
> >
> > - zero on success
> > - negative (with errno) on error
> > - positive cookie on "I started it, here's my cookie". For extra bonus
> > points, maybe the cookie would actually be a file descriptor (for
> > poll/select users), but it would _not_ be a file descriptor to the
> > resulting _file_, it would literally be a "cookie" to the actual
> > copyfile event.
>
> Actually, if the cookie is a magic file descriptor, you don't
> need ctl. You can play tricks like polling for completoin,
> read(magic_fd, &remain, sizeof(loff_t)) for status, and close(magic_fd)
> for cancel. Might be a bit overloaded, though.
>
> > and then for ocfs2 you'd never return positive cookies. You'd never have
> > to worry about it.
>
> I suspect we'll later take advantage of copyfile's other
> modes. I did reflink as reflink only for the simple fact of doing one
> thing and well, not because I think copyfile isn't good.
>
> > Then the second interface would be something like
> >
> > int copyfile_ctrl(long cookie, unsigned long cmd);
> >
> > where you'd just have some way to wait for completion and ask how much has
> > been copied. The 'cmd' would be some set of 'cancel', 'status' or
> > 'uninterruptible wait' or whatever, and the return value would again be
> >
> > - negative (with errno) for errors (copy failed) - cookie released
> > - zero for 'done' - cookie released
> > - positive for 'percent remaining' or whatever - cookie still valid
> >
> > and this would be another callback into the filesystem code, but you'd
> > never have to worry about it, since you'd never see it (just leave it
> > NULL).
>
> I was going to ask about how to fit both calls into one inode
> operation, but I see you're giving this as an additional inode
> operation.
> This leaves us with a simliar-to-reflink inode copyfile op and a
> control op:
>
> ->copyfile(old_dentry, dir_inode, new_dentry, flags)
> ->copyfile_ctl(int cookie, unsigned int cmd)
>
> I have to change the flags a little, as my original proposal
> didn't handle backoff correctly.
>
> #define COPYFILE_WAIT 0x0001 /* Block until complete */
> #define COPYFILE_ATOMIC 0x0002 /* Things copied must be
> point-in-time and it must
> fail or succeed completely. */
> #define COPYFILE_ALLOW_COW 0x0004 /* The filesystem may share data
> extents between the source
> and target in a Copy-on-Write
> fashion. If neither
> COPYFILE_ALLOW_COW nor
> COPYFILE_REQUIRE_COW are
> specified, data extents must
> NOT be shared. When neither
> COW flag is provided, most
> filesystems should return
> -ENOTSUPP, as userspace can
> do read-write looping
> itself */
> #define COPYFILE_REQUIRE_COW 0x0008 /* Data extents MUST be shared
> between the source and target
> in a Copy-on-Write fashion */
> #define COPYFILE_UNPRIV_ATTRS 0x0010 /* Unprivileged attributes
> should be copied from the
> source to the target */
> #define COPYFILE_PRIV_ATTRS 0x0020 /* Privileged attributes should
> be copied from the source to
> the target if the caller has
> the necessary privileges */
> #define COPYFILE_REQUIRE_ATTRS 0x0040 /* Combined with the other
> attribute flags, the call
> MUST fail if the caller lacks
> the necessary privileges to
> copy ever attribute
> requested */
>
> #define COPYFILE_SNAPSHOT_ASYNC (COPYFILE_REQUIRE_COW |
> COPYFILE_UNPRIV_ATTRS |
> COPYFILE_PRIV_ATTRS |
> COPYFILE_ATOMIC)
> #define COPYFILE_SNAPSHOT_STRICT_ASYNC (COPYFILE_SNAPSHOT_ASYNC |
> COPYFILE_REQUIRE_ATTRS)
> #define COPYFILE_SNAPSHOT (COPYFILE_SNAPSHOT_ASYNC |
> COPYFILE_WAIT)
> #define COPYFILE_SNAPSHOT_STRICT (COPYFILE_SNAPSHOT_STRICT_ASYNC |
> COPYFILE_WAIT)
>
> > I dunno. The above seems like a fairly simple and powerful interface, and
> > I _think_ it would be ok for NFS and CIFS. And in fact, if that whole
> > "background copy" ends up being used a lot, maybe even a local filesystem
> > would implement it just to get easy overlapping IO - even if it would just
> > be a trivial common wrapper function that says "start a thread to do a
> > trivial manual copy".
>
> NFS and CIFS folks, please speak up.
>
> Joel
>

2009-09-18 18:38:16

by Joel Becker

[permalink] [raw]
Subject: Re: [Ocfs2-devel] [GIT PULL] ocfs2 changes for 2.6.32

On Fri, Sep 18, 2009 at 02:34:18PM +0100, P?draig Brady wrote:
> Joel Becker wrote:
> > On Thu, Sep 17, 2009 at 09:29:14AM -0700, Linus Torvalds wrote:
> >> Why would anybody want to hide it at all? Why even the libc hiding?
> >>
> >> Nobody is going to use this except for special apps. Let them see what
> >> they can do, in all its glory.
> >
> > I expect everyone will use this through cp(1), so that cp(1) can
> > try to get server-side copy on the network filesystms.
>
> For reference, cp(1) has a --reflink option as of
> coreutils-7.5 which currently just does:
>
> ioctl (dest_fd, BTRFS_IOC_CLONE, src_fd);

Note that the btrfs ioctl is not a reflink(), so this probably
wants changing (OCFS2_IOC_REFLINK is the ocfs2 ioctl, sys_reflink() was
going to be the syscall).

> There is a specific option in cp to do this because
> a "reflink copy" was seen to have these disadvantages:
>
> 1. one copy of data blocks so more chances of data loss
> 2. disk head seeking deferred to modification process
> 3. possible fragmentation on write
> 4. possible ENOSPC on write
>
> Now 2. will go away with time, and 3 & 4 may be alleviated
> by the use of fallocate(), but 1. was deemed important
> enough to not enable by default.

1, 2, and 3 are definitely in the category of "it would be nice
to choose the behavior". 4 is the big one, because it breaks default
cp(1) assumptions. The good news is that the current copyfile
idea of copyfile(src, dst, 0) would satisfy 1-4 and be efficient or
return -ENOTSUPP/-ENOSYS if it couldn't be. Then cp(1) falls back to
the read-write loop.
cp --reflink would become copyfile(src, dst, COPYFILE_SNAPSHOT)

Joel

--

Life's Little Instruction Book #451

"Don't be afraid to say, 'I'm sorry.'"

Joel Becker
Principal Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127

2009-09-18 18:40:48

by Joel Becker

[permalink] [raw]
Subject: Re: [Ocfs2-devel] [GIT PULL] ocfs2 changes for 2.6.32

On Fri, Sep 18, 2009 at 11:23:33AM -0600, Peter W. Morreale wrote:
> On Thu, 2009-09-17 at 18:43 -0700, Joel Becker wrote:
> > On Thu, Sep 17, 2009 at 09:29:14AM -0700, Linus Torvalds wrote:
> > > Why would anybody want to hide it at all? Why even the libc hiding?
> > >
> > > Nobody is going to use this except for special apps. Let them see what
> > > they can do, in all its glory.
> >
> > I expect everyone will use this through cp(1), so that cp(1) can
> > try to get server-side copy on the network filesystms.
> > Speaking of "all its glory", what we have now is:
> >
> > int sys_copyfileat(int oldfd, const char *oldname, int newfd,
> > const char *newname, int flags, int atflags)
>
>
> Would it be worthwhile to consider adding an offset and length?
>
> Then we get dd as well. (potentially)

I'm with Linus that a range attribute really makes this
complicated. I also think it doesn't work well with a call that is
supposed to create newname.

Joel

--

Pitchers and catchers report.

Joel Becker
Principal Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127

2009-09-22 00:51:42

by George Spelvin

[permalink] [raw]
Subject: Re: [GIT PULL] ocfs2 changes for 2.6.32

> Perhaps ->copyfile takes the following flags:
>
> #define ALLOW_COW_SHARED 0x0001
> #define REQUIRE_COW_SHARED 0x0002
> #define REQUIRE_BASIC_ATTRS 0x0004
> #define REQUIRE_FULL_ATTRS 0x0008
> #define REQUIRE_ATOMIC 0x0010
> #define SNAPSHOT (REQUIRE_COW_SHARED |
> REQUIRE_BASIC_ATTRS |
> REQUIRE_ATOMIC)
> #define SNAPSHOT_PRESERVE (SNAPSHOT | REQUIRE_FULL_ATTRS)

Um, could I strongly suggest that flags == 0 be the "succeed if at all
possible case", and various options limit it.

In particular, invert ALLOW_COW_SHARED to REQUIRE_ALLOCATE.

Another possibly useful flag would be REQUIRE_OPTIMIZED.
I.e. if it's not appreciably faster than a read/write loop, perhaps
the application would prefer to do it itself.

We also have to define the error code to return in case of a flag
violation. ENOTSUP?

2009-09-22 03:28:01

by George Spelvin

[permalink] [raw]
Subject: Re: [GIT PULL] ocfs2 changes for 2.6.32

> or make it one level simpler?
> Have a "wait for all started copies" call only.... saves a ton of book
> keeping, and is likely what people will use it for in the end anyway.

Aieeee! No, no, a thousand times, no!

We do NOT need another blocking primitive that can't play well with others.
That would be a HORRIBLE design mistake.

The whole benefit of Linus' scheme is that it returns a file descriptor.
Any waiting should be done by the standard event-wait system call, poll().
It should return POLLIN when there's an interesting event (such as copy
completion), and should remain valid until explicitly close()d.

There's nothing wrong with a convenience function that waits for all
started copies, but I don't see a reason to design a new kernel interface
for the purpose.

A few more points:
- If the file descriptor returned by copyfile() is guaranteed not to be 0
(even if that is available), perhaps it should be guaranteed to be >= 3.
- We might as well make the returned file descriptor O_CLOEXEC by default.
You can always change it back if you want to.

2009-09-23 11:02:41

by Joel Becker

[permalink] [raw]
Subject: [GIT PULL] ocfs2 changes for 2.6.32 (take 2, no syscall)

On Mon, Sep 14, 2009 at 02:32:36PM -0700, Linus Torvalds wrote:
> So I'm not pulling this. Not until I get the feeling that there is
> consensus.

Linus,
Here are all of the ocfs2 changes without the reflink(2)
system call. The snapshot functionality is still available via the
ocfs2 ioctl(2). I'm going to formulate up a new system call based on
the discussion we've had, but clearly that's not something for 2.6.32
anymore.
Please pull.

The following changes since commit 8379e7c46cc48f51197dd663fc6676f47f2a1e71:
Sunil Mushran (1):
ocfs2: ocfs2_write_begin_nolock() should handle len=0

are available in the git repository at:

git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2.git upstream-linus

Coly Li (1):
dlmglue.c: add missed mlog lines

Joel Becker (40):
ocfs2: Make the ocfs2_caching_info structure self-contained.
ocfs2: Change metadata caching locks to an operations structure.
ocfs2: Take the inode out of the metadata read/write paths.
ocfs2: move ip_last_trans to struct ocfs2_caching_info
ocfs2: move ip_created_trans to struct ocfs2_caching_info
ocfs2: Pass struct ocfs2_caching_info to the journal functions.
ocfs2: Store the ocfs2_caching_info on ocfs2_extent_tree.
ocfs2: Pass ocfs2_caching_info to ocfs2_read_extent_block().
ocfs2: ocfs2_find_path() only needs the caching info
ocfs2: ocfs2_create_new_meta_bhs() doesn't need struct inode.
ocfs2: Pass ocfs2_extent_tree to ocfs2_unlink_path()
ocfs2: ocfs2_complete_edge_insert() doesn't need struct inode at all.
ocfs2: Get inode out of ocfs2_rotate_subtree_root_right().
ocfs2: Pass ocfs2_extent_tree to ocfs2_get_subtree_root()
ocfs2: Drop struct inode from ocfs2_extent_tree_operations.
ocfs2: ocfs2_rotate_tree_right() doesn't need struct inode.
ocfs2: ocfs2_update_edge_lengths() doesn't need struct inode.
ocfs2: ocfs2_rotate_subtree_left() doesn't need struct inode.
ocfs2: __ocfs2_rotate_tree_left() doesn't need struct inode.
ocfs2: ocfs2_rotate_tree_left() no longer needs struct inode.
ocfs2: ocfs2_merge_rec_left/right() no longer need struct inode.
ocfs2: ocfs2_try_to_merge_extent() doesn't need struct inode.
ocfs2: ocfs2_grow_branch() and ocfs2_append_rec_to_path() lose struct inode.
ocfs2: ocfs2_truncate_rec() doesn't need struct inode.
ocfs2: Make truncating the extent map an extent_tree_operation.
ocfs2: ocfs2_insert_at_leaf() doesn't need struct inode.
ocfs2: Give ocfs2_split_record() an extent_tree instead of an inode.
ocfs2: ocfs2_do_insert_extent() and ocfs2_insert_path() no longer need an inode.
ocfs2: ocfs2_extent_contig() only requires the superblock.
ocfs2: Swap inode for extent_tree in ocfs2_figure_merge_contig_type().
ocfs2: Remove inode from ocfs2_figure_extent_contig().
ocfs2: ocfs2_figure_insert_type() no longer needs struct inode.
ocfs2: Make extent map insertion an extent_tree_operation.
ocfs2: ocfs2_insert_extent() no longer needs struct inode.
ocfs2: ocfs2_add_clusters_in_btree() no longer needs struct inode.
ocfs2: ocfs2_remove_extent() no longer needs struct inode.
ocfs2: ocfs2_split_and_insert() no longer needs struct inode.
ocfs2: Teach ocfs2_replace_extent_rec() to use an extent_tree.
ocfs2: __ocfs2_mark_extent_written() doesn't need struct inode.
ocfs2: Pass ocfs2_caching_info into ocfs_init_*_extent_tree().

Sunil Mushran (1):
ocfs2: __ocfs2_abort() should not enable panic for local mounts

Tao Ma (42):
ocfs2: Define refcount tree structure.
ocfs2: Add metaecc for ocfs2_refcount_block.
ocfs2: Add ocfs2_read_refcount_block.
ocfs2: Abstract caching info checkpoint.
ocfs2: Add new refcount tree lock resource in dlmglue.
ocfs2: Add caching info for refcount tree.
ocfs2: Add refcount tree lock mechanism.
ocfs2: Basic tree root operation.
ocfs2: Wrap ocfs2_extent_contig in ocfs2_extent_tree.
ocfs2: Abstract extent split process.
ocfs2: Add refcount b-tree as a new extent tree.
ocfs2: move tree path functions to alloc.h.
ocfs2: Add support for incrementing refcount in the tree.
ocfs2: Add support of decrementing refcount for delete.
ocfs2: Add functions for extents refcounted.
ocfs2: Decrement refcount when truncating refcounted extents.
ocfs2: Add CoW support.
ocfs2: CoW refcount tree improvement.
ocfs2: Integrate CoW in file write.
ocfs2: CoW a reflinked cluster when it is truncated.
ocfs2: Add normal functions for reflink a normal file's extents.
ocfs2: handle file attributes issue for reflink.
ocfs2: Return extent flags for xattr value tree.
ocfs2: Abstract duplicate clusters process in CoW.
ocfs2: Add CoW support for xattr.
ocfs2: Remove inode from ocfs2_xattr_bucket_get_name_value.
ocfs2: Abstract the creation of xattr block.
ocfs2: Abstract ocfs2 xattr tree extend rec iteration process.
ocfs2: Attach xattr clusters to refcount tree.
ocfs2: Call refcount tree remove process properly.
ocfs2: Create an xattr indexed block if needed.
ocfs2: Add reflink support for xattr.
ocfs2: Modify removing xattr process for refcount.
ocfs2: Don't merge in 1st refcount ops of reflink.
ocfs2: Make transaction extend more efficient.
ocfs2: Use proper parameter for some inode operation.
ocfs2: Create reflinked file in orphan dir.
ocfs2: Add preserve to reflink.
ocfs2: Implement ocfs2_reflink.
ocfs2: Enable refcount tree support.
ocfs2: Add ioctl for reflink.
ocfs2: Use buffer IO if we are appending a file.

Wengang Wang (1):
ocfs2: add spinlock protection when dealing with lockres->purge.

fs/ocfs2/Makefile | 1 +
fs/ocfs2/alloc.c | 1342 ++++++++------
fs/ocfs2/alloc.h | 101 +-
fs/ocfs2/aops.c | 37 +-
fs/ocfs2/aops.h | 2 +
fs/ocfs2/buffer_head_io.c | 47 +-
fs/ocfs2/buffer_head_io.h | 8 +-
fs/ocfs2/cluster/masklog.c | 1 +
fs/ocfs2/cluster/masklog.h | 1 +
fs/ocfs2/dir.c | 107 +-
fs/ocfs2/dlm/dlmthread.c | 6 +-
fs/ocfs2/dlmglue.c | 105 +-
fs/ocfs2/dlmglue.h | 6 +
fs/ocfs2/extent_map.c | 33 +-
fs/ocfs2/extent_map.h | 8 +-
fs/ocfs2/file.c | 151 ++-
fs/ocfs2/file.h | 2 +
fs/ocfs2/inode.c | 86 +-
fs/ocfs2/inode.h | 20 +-
fs/ocfs2/ioctl.c | 14 +
fs/ocfs2/journal.c | 82 +-
fs/ocfs2/journal.h | 94 +-
fs/ocfs2/localalloc.c | 12 +-
fs/ocfs2/namei.c | 341 ++++-
fs/ocfs2/namei.h | 6 +
fs/ocfs2/ocfs2.h | 52 +-
fs/ocfs2/ocfs2_fs.h | 107 ++-
fs/ocfs2/ocfs2_lockid.h | 5 +
fs/ocfs2/quota_global.c | 5 +-
fs/ocfs2/quota_local.c | 26 +-
fs/ocfs2/refcounttree.c | 4313 ++++++++++++++++++++++++++++++++++++++++++++
fs/ocfs2/refcounttree.h | 106 ++
fs/ocfs2/resize.c | 16 +-
fs/ocfs2/slot_map.c | 10 +-
fs/ocfs2/suballoc.c | 35 +-
fs/ocfs2/super.c | 13 +-
fs/ocfs2/uptodate.c | 265 ++--
fs/ocfs2/uptodate.h | 51 +-
fs/ocfs2/xattr.c | 2056 +++++++++++++++++++--
fs/ocfs2/xattr.h | 15 +-
40 files changed, 8512 insertions(+), 1176 deletions(-)
create mode 100644 fs/ocfs2/refcounttree.c
create mode 100644 fs/ocfs2/refcounttree.h
--

"But then she looks me in the eye
And says, 'We're going to last forever,'
And man you know I can't begin to doubt it.
Cause it just feels so good and so free and so right,
I know we ain't never going to change our minds about it, Hey!
Here comes my girl."

Joel Becker
Principal Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127