2013-09-11 17:08:13

by Zach Brown

[permalink] [raw]
Subject: [RFC] extending splice for copy offloading

When I first started on this stuff I followed the lead of previous
work and added a new syscall for the copy operation:

https://lkml.org/lkml/2013/5/14/618

Towards the end of that thread Eric Wong asked why we didn't just
extend splice. I immediately replied with some dumb dismissive
answer. Once I sat down and looked at it, though, it does make a
lot of sense. So good job, Eric. +10 Dummie points for me.

Extending splice avoids all the noise of adding a new syscall and
naturally falls back to buffered copying as that's what the direct
splice path does for sendfile() today.

So that's what this patch series demonstrates. It adds a flag that
lets splice get at the same direct splicing that sendfile() does.
We then add a file system file_operations method to accelerate the
copy which has access to both files.

Some things to talk about:
- I really don't care about the naming here. If you do, holler.
- We might want different flags for file-to-file splicing and acceleration
- We might want flags to require or forbid acceleration
- We might want to provide all these flags to sendfile, too

Thoughts? Objections?

Bryan, do you see any problems with wiring the NFS COPY RPC under this?

Martin, are we any closer to getting blk_() calls to kick off XCOPY
bios?

OCFS2 friends, is it a managable amount of work to implement an
ocfs2_splice_direct() that only modifies a region of the destination
file?

Finally, there's a slot in the plumbers schedule next week to talk
about this stuff. Come say hi if you're interested.

-z



2013-09-30 19:34:29

by Myklebust, Trond

[permalink] [raw]
Subject: Re: [RFC] extending splice for copy offloading

T24gTW9uLCAyMDEzLTA5LTMwIGF0IDIwOjQ5ICswMjAwLCBCZXJuZCBTY2h1YmVydCB3cm90ZToN
Cj4gT24gMDkvMzAvMjAxMyAwODowMiBQTSwgTXlrbGVidXN0LCBUcm9uZCB3cm90ZToNCj4gPiBP
biBNb24sIDIwMTMtMDktMzAgYXQgMTk6NDggKzAyMDAsIEJlcm5kIFNjaHViZXJ0IHdyb3RlOg0K
PiA+PiBPbiAwOS8zMC8yMDEzIDA3OjQ0IFBNLCBNeWtsZWJ1c3QsIFRyb25kIHdyb3RlOg0KPiA+
Pj4gT24gTW9uLCAyMDEzLTA5LTMwIGF0IDE5OjE3ICswMjAwLCBCZXJuZCBTY2h1YmVydCB3cm90
ZToNCj4gPj4+PiBJdCB3b3VsZCBiZSBuaWNlIGlmIHRoZXJlIHdvdWxkIGJlIHdheSBpZiB0aGUg
ZmlsZSBzeXN0ZW0gd291bGQgZ2V0IGENCj4gPj4+PiBoaW50IHRoYXQgdGhlIHRhcmdldCBmaWxl
IGlzIHN1cHBvc2VkIHRvIGJlIGNvcHkgb2YgYW5vdGhlciBmaWxlLiBUaGF0DQo+ID4+Pj4gd2F5
IGRpc3RyaWJ1dGVkIGZpbGUgc3lzdGVtcyBjb3VsZCBhbHNvIGNyZWF0ZSB0aGUgdGFyZ2V0LWZp
bGUgd2l0aCB0aGUNCj4gPj4+PiBjb3JyZWN0IG1ldGEtaW5mb3JtYXRpb24gKHNhbWUgc3RvcmFn
ZSB0YXJnZXRzIGFzIGluLWZpbGUgaGFzKS4NCj4gPj4+PiBXZWxsLCBpZiB3ZSBjYW5ub3QgYWdy
ZWUgb24gdGhhdCwgZmlsZSBzeXN0ZW0gd2l0aCBhIGN1c3RvbSBwcm90b2NvbCBhdA0KPiA+Pj4+
IGxlYXN0IGNhbiBkZXRlY3QgZnJvbSAwIHRvIFNTSVpFX01BWCBhbmQgdGhlbiByZXNldCBtZXRh
ZGF0YS4gSSdtIG5vdA0KPiA+Pj4+IHN1cmUgaWYgdGhpcyB3b3VsZCB3b3JrIGZvciBwTkZTLCB0
aG91Z2guDQo+ID4+Pg0KPiA+Pj4gc3BsaWNlKCkgZG9lcyBub3QgY3JlYXRlIG5ldyBmaWxlcy4g
V2hhdCB5b3UgYXBwZWFyIHRvIGJlIGFza2luZyBmb3INCj4gPj4+IGxpZXMgd2F5IG91dHNpZGUg
dGhlIHNjb3BlIG9mIHRoYXQgc3lzdGVtIGNhbGwgaW50ZXJmYWNlLg0KPiA+Pj4NCj4gPj4NCj4g
Pj4gU29ycnkgSSBrbm93LCBkZWZpbml0ZWx5IG91dHNpZGUgdGhlIHNjb3BlIG9mIHNwbGljZSwg
YnV0IGluIHRoZSBjb250ZXh0DQo+ID4+IG9mIG9mZmxvYWRlZCBmaWxlIGNvcGllcy4gU28gdGhl
IHF1ZXN0aW9uIGlzLCB3aGF0IGlzIHRoZSBiZXN0IHdheSB0bw0KPiA+PiBhZGRyZXNzL2Rpc2N1
c3MgdGhhdD8NCj4gPg0KPiA+IFdoeSBkb2VzIGl0IG5lZWQgdG8gYmUgYWRkcmVzc2VkIGluIHRo
ZSBmaXJzdCBwbGFjZT8NCj4gDQo+IEFuIG9mZmxvYWRlZCBjb3B5IGlzIHN0aWxsIG5vdCBlZmZp
Y2llbnQgaWYgZGlmZmVyZW50IHN0b3JhZ2UgDQo+IHNlcnZlcnMvdGFyZ2V0cyB1c2VkIGJ5IGZy
b20tZmlsZSBhbmQgdG8tZmlsZS4NCg0KU28/IA0KDQo+ID4NCj4gPiBXaGF0IGlzIHByZXZlbnRp
bmcgYW4gYXBwbGljYXRpb24gZnJvbSByZXRyaWV2aW5nIGFuZCBzZXR0aW5nIHRoaXMNCj4gPiBp
bmZvcm1hdGlvbiB1c2luZyBzdGFuZGFyZCBsaWJjIGZ1bmN0aW9ucyBzdWNoIGFzIGZzdGF0KCkr
b3BlbigpLCBhbmQNCj4gPiBzdXBwbGVtZW50ZWQgd2l0aCBsaWJhdHRyIGF0dHJfc2V0Zi9nZXRm
KCksIGFuZCBsaWJhY2wgYWNsX2dldF9mZC9zZXRfZmQNCj4gPiB3aGVyZSBhcHByb3ByaWF0ZT8N
Cj4gPg0KPiANCj4gQXQgYSBtaW5pbXVtIHRoaXMgcmVxdWlyZXMgbmV0d29yayBhbmQgbWV0YWRh
dGEgb3ZlcmhlYWQuIEFuZCB3aGlsZSBJJ20gDQo+IHdvcmtpbmcgb24gRmhHRlMgbm93LCBJIHN0
aWxsIHdvbmRlciB3aGF0IG90aGVyIGZpbGUgc3lzdGVtIG5lZWQgdG8gZG8gLSANCj4gZm9yIGV4
YW1wbGUgTHVzdHJlIHByZS1hbGxvY2F0ZXMgc3RvcmFnZS10YXJnZXQgZmlsZXMgb24gY3JlYXRp
bmcgYSANCj4gZmlsZSwgc28gZmlsZSBsYXlvdXQgY2hhbmdlcyBtZWFuIGV2ZW4gbW9yZSBvdmVy
aGVhZCB0aGVyZS4NCg0KVGhlIHByb2JsZW0geW91IGFyZSBkZXNjcmliaW5nIGlzIGxpbWl0ZWQg
dG8gYSBuYXJyb3cgc2V0IG9mIHN0b3JhZ2UNCmFyY2hpdGVjdHVyZXMuIElmIGNvcHkgb2ZmbG9h
ZCB1c2luZyBzcGxpY2UoKSBkb2Vzbid0IG1ha2Ugc2Vuc2UgZm9yDQp0aG9zZSBhcmNoaXRlY3R1
cmVzLCB0aGVuIGRvbid0IGltcGxlbWVudCBpdCBmb3IgdGhlbS4NCllvdSBtaWdodCBiZSBhYmxl
IHRvIHByb3ZpZGUgaW9jdGxzKCkgdG8gZG8gdGhlc2Ugc3BlY2lhbCBoaW50ZWQgZmlsZQ0KY3Jl
YXRpb25zIGZvciB0aG9zZSBmaWxlc3lzdGVtcyB0aGF0IG5lZWQgaXQsIGJ1dCB0aGUgdmFzdCBt
YWpvcml0eQ0KZG9uJ3QsIGFuZCB5b3Ugc2hvdWxkbid0IGVuZm9yY2UgaXQgb24gdGhlbS4NCg0K
PiBBbnl3YXksIGlmIHdlIGNvdWxkIGFncmVlIG9uIHRvIHVzZSBsaWJhdHRyIG9yIGxpYmFjbCB0
byB0ZWFjaCB0aGUgZmlsZSANCj4gc3lzdGVtIGFib3V0IHRoZSB1cGNvbWluZyBzcGxpY2UgY2Fs
bCBJIHdvdWxkIGJlIGZpbmUuDQoNCmxpYmF0dHIgYW5kIGxpYmFjbCBhcmUgZ2VuZXJpYyBsaWJy
YXJpZXMgdGhhdCBleGlzdCB0byBtYW5pcHVsYXRlIHhhdHRycw0KYW5kIGFjbHMuIFRoZXkgZG8g
bm90IG5lZWQgdG8gY29udGFpbiBMdXN0cmUtc3BlY2lmaWMgY29kZS4NCg0KLS0gDQpUcm9uZCBN
eWtsZWJ1c3QNCkxpbnV4IE5GUyBjbGllbnQgbWFpbnRhaW5lcg0KDQpOZXRBcHANClRyb25kLk15
a2xlYnVzdEBuZXRhcHAuY29tDQp3d3cubmV0YXBwLmNvbQ0K

2013-09-11 17:16:45

by Zach Brown

[permalink] [raw]
Subject: [PATCH 3/3] btrfs: implement .splice_direct extent copying

This patch re-uses the existing btrfs file cloning ioctl code to
implement the .splice_direct copy offloading file operation.

The existing extent item copying btrfs_ioctl_clone() is renamed to a
shared btrfs_clone_extents(). The ioctl specific code (mostly simple
entry-point stuff that splice() already does elsewhere) is moved to a
new much smaller btrfs_ioctl_clone().

btrfs_splice_direct() thus inherits the conservative limitations of the
btrfs clone ioctl: it only allows block-aligned copies between files on
the same snapshot.

Signed-off-by: Zach Brown <[email protected]>
---
fs/btrfs/ctree.h | 2 ++
fs/btrfs/file.c | 11 ++++++++++
fs/btrfs/ioctl.c | 64 +++++++++++++++++++++++++++++++-------------------------
3 files changed, 48 insertions(+), 29 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index e795bf1..f73830e 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3648,6 +3648,8 @@ int btrfs_defrag_file(struct inode *inode, struct file *file,
u64 newer_than, unsigned long max_pages);
void btrfs_get_block_group_info(struct list_head *groups_list,
struct btrfs_ioctl_space_info *space);
+long btrfs_clone_extents(struct file *file, struct file *src_file, u64 off,
+ u64 olen, u64 destoff);

/* file.c */
int btrfs_auto_defrag_init(void);
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 4d2eb64..82aec93 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2557,6 +2557,16 @@ out:
return offset;
}

+static long btrfs_splice_direct(struct file *in, loff_t in_pos,
+ struct file *out, loff_t out_pos, size_t len,
+ unsigned int flags)
+{
+ int ret = btrfs_clone_extents(out, in, in_pos, len, out_pos);
+ if (ret == 0)
+ ret = len;
+ return ret;
+}
+
const struct file_operations btrfs_file_operations = {
.llseek = btrfs_file_llseek,
.read = do_sync_read,
@@ -2573,6 +2583,7 @@ const struct file_operations btrfs_file_operations = {
#ifdef CONFIG_COMPAT
.compat_ioctl = btrfs_ioctl,
#endif
+ .splice_direct = btrfs_splice_direct,
};

void btrfs_auto_defrag_exit(void)
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 238a055..cddf6ef 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2469,13 +2469,12 @@ out:
return ret;
}

-static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
- u64 off, u64 olen, u64 destoff)
+long btrfs_clone_extents(struct file *file, struct file *src_file, u64 off,
+ u64 olen, u64 destoff)
{
struct inode *inode = file_inode(file);
+ struct inode *src = file_inode(src_file);
struct btrfs_root *root = BTRFS_I(inode)->root;
- struct fd src_file;
- struct inode *src;
struct btrfs_trans_handle *trans;
struct btrfs_path *path;
struct extent_buffer *leaf;
@@ -2498,10 +2497,6 @@ static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
* they don't overlap)?
*/

- /* the destination must be opened for writing */
- if (!(file->f_mode & FMODE_WRITE) || (file->f_flags & O_APPEND))
- return -EINVAL;
-
if (btrfs_root_readonly(root))
return -EROFS;

@@ -2509,48 +2504,36 @@ static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
if (ret)
return ret;

- src_file = fdget(srcfd);
- if (!src_file.file) {
- ret = -EBADF;
- goto out_drop_write;
- }
-
ret = -EXDEV;
- if (src_file.file->f_path.mnt != file->f_path.mnt)
- goto out_fput;
-
- src = file_inode(src_file.file);
+ if (src_file->f_path.mnt != file->f_path.mnt)
+ goto out_drop_write;

ret = -EINVAL;
if (src == inode)
same_inode = 1;

- /* the src must be open for reading */
- if (!(src_file.file->f_mode & FMODE_READ))
- goto out_fput;
-
/* don't make the dst file partly checksummed */
if ((BTRFS_I(src)->flags & BTRFS_INODE_NODATASUM) !=
(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM))
- goto out_fput;
+ goto out_drop_write;

ret = -EISDIR;
if (S_ISDIR(src->i_mode) || S_ISDIR(inode->i_mode))
- goto out_fput;
+ goto out_drop_write;

ret = -EXDEV;
if (src->i_sb != inode->i_sb)
- goto out_fput;
+ goto out_drop_write;

ret = -ENOMEM;
buf = vmalloc(btrfs_level_size(root, 0));
if (!buf)
- goto out_fput;
+ goto out_drop_write;

path = btrfs_alloc_path();
if (!path) {
vfree(buf);
- goto out_fput;
+ goto out_drop_write;
}
path->reada = 2;

@@ -2867,13 +2850,36 @@ out_unlock:
mutex_unlock(&inode->i_mutex);
vfree(buf);
btrfs_free_path(path);
-out_fput:
- fdput(src_file);
out_drop_write:
mnt_drop_write_file(file);
return ret;
}

+static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
+ u64 off, u64 olen, u64 destoff)
+{
+ struct fd src_file;
+ int ret;
+
+ /* the destination must be opened for writing */
+ if (!(file->f_mode & FMODE_WRITE) || (file->f_flags & O_APPEND))
+ return -EINVAL;
+
+ src_file = fdget(srcfd);
+ if (!src_file.file)
+ return -EBADF;
+
+ /* the src must be open for reading */
+ if (!(src_file.file->f_mode & FMODE_READ))
+ ret = -EINVAL;
+ else
+ ret = btrfs_clone_extents(file, src_file.file, off, olen,
+ destoff);
+
+ fdput(src_file);
+ return ret;
+}
+
static long btrfs_ioctl_clone_range(struct file *file, void __user *argp)
{
struct btrfs_ioctl_clone_range_args args;
--
1.7.11.7


2013-09-30 14:51:12

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [RFC] extending splice for copy offloading

On Mon, Sep 30, 2013 at 4:34 PM, J. Bruce Fields <[email protected]> wrote:
>> My other worry is about interruptibility/restartability. Ideas?
>>
>> What happens on splice(from, to, 4G) and it's a non-reflink copy?
>> Can the page cache copy be made restartable? Or should splice() be
>> allowed to return a short count? What happens on (non-reflink) remote
>> copies and huge request sizes?
>
> If I were writing an application that required copies to be restartable,
> I'd probably use the largest possible range in the reflink case but
> break the copy into smaller chunks in the splice case.
>

The app really doesn't want to care about that. And it doesn't want
to care about restartability, etc.. It's something the *kernel* has
to care about. You just can't have uninterruptible syscalls that
sleep for a "long" time, otherwise first you'll just have annoyed
users pressing ^C in vain; then, if the sleep is even longer, warnings
about task sleeping too long.

One idea is letting splice() return a short count, and so the app can
safely issue SIZE_MAX requests and the kernel can decide if it can
copy the whole file in one go or if it wants to do it in smaller
chunks.

Thanks,
Miklos

2013-09-26 15:34:03

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [RFC] extending splice for copy offloading

On Thu, Sep 26, 2013 at 10:58:05AM +0200, Miklos Szeredi wrote:
> On Wed, Sep 25, 2013 at 11:07 PM, Zach Brown <[email protected]> wrote:
> >> A client-side copy will be slower, but I guess it does have the
> >> advantage that the application can track progress to some degree, and
> >> abort it fairly quickly without leaving the file in a totally undefined
> >> state--and both might be useful if the copy's not a simple constant-time
> >> operation.
> >
> > I suppose, but can't the app achieve a nice middle ground by copying the
> > file in smaller syscalls? Avoid bulk data motion back to the client,
> > but still get notification every, I dunno, few hundred meg?
>
> Yes. And if "cp" could just be switched from a read+write syscall
> pair to a single splice syscall using the same buffer size.

Will the various magic fs-specific copy operations become inefficient
when the range copied is too small?

(Totally naive question, as I have no idea how they really work.)

--b.

> And then
> the user would only notice that things got faster in case of server
> side copy. No problems with long blocking times (at least not much
> worse than it was).
>
> However "cp" doesn't do reflinking by default, it has a switch for
> that. If we just want "cp" and the like to use splice without fearing
> side effects then by default we should try to be as close to
> read+write behavior as possible. No? That's what I'm really
> worrying about when you want to wire up splice to reflink by default.
> I do think there should be a flag for that. And if on the block level
> some magic happens, so be it. It's not the fs deverloper's worry any
> more ;)
>
> Thanks,
> Miklos
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2013-09-25 19:06:58

by Zach Brown

[permalink] [raw]
Subject: Re: [RFC] extending splice for copy offloading

On Wed, Sep 25, 2013 at 03:02:29PM -0400, Anna Schumaker wrote:
> On Wed, Sep 25, 2013 at 2:38 PM, Zach Brown <[email protected]> wrote:
> >
> > Hrmph. I had composed a reply to you during Plumbers but.. something
> > happened to it :). Here's another try now that I'm back.
> >
> >> > Some things to talk about:
> >> > - I really don't care about the naming here. If you do, holler.
> >> > - We might want different flags for file-to-file splicing and acceleration
> >>
> >> Yes, I think "copy" and "reflink" needs to be differentiated.
> >
> > I initially agreed but I'm not so sure now. The problem is that we
> > can't know whether the acceleration is copying or not. XCOPY on some
> > array may well do some shared referencing tricks. The nfs COPY op can
> > have a server use btrfs reflink, or ext* and XCOPY, or .. who knows. At
> > some point we have to admit that we have no way to determine the
> > relative durability of writes. Storage can do a lot to make writes more
> > or less fragile that we have no visibility of. SSD FTLs can log a bunch
> > of unrelated sectors on to one flash failure domain.
> >
> > And if such a flag couldn't *actually* guarantee anything for a bunch of
> > storage topologies, well, let's not bother with it.
> >
> > The only flag I'm in favour of now is one that has splice return rather
> > than falling back to manual page cache reads and writes. It's more like
> > O_NONBLOCK than any kind of data durability hint.
>
> For reference, I'm planning to have the NFS server do the fallback
> when it copies since any local copy will be faster than a read and
> write over the network.

Agreed, this is definitely the reasonable thing to do.

- z

2013-09-30 18:02:14

by Myklebust, Trond

[permalink] [raw]
Subject: Re: [RFC] extending splice for copy offloading

T24gTW9uLCAyMDEzLTA5LTMwIGF0IDE5OjQ4ICswMjAwLCBCZXJuZCBTY2h1YmVydCB3cm90ZToN
Cj4gT24gMDkvMzAvMjAxMyAwNzo0NCBQTSwgTXlrbGVidXN0LCBUcm9uZCB3cm90ZToNCj4gPiBP
biBNb24sIDIwMTMtMDktMzAgYXQgMTk6MTcgKzAyMDAsIEJlcm5kIFNjaHViZXJ0IHdyb3RlOg0K
PiA+PiBJdCB3b3VsZCBiZSBuaWNlIGlmIHRoZXJlIHdvdWxkIGJlIHdheSBpZiB0aGUgZmlsZSBz
eXN0ZW0gd291bGQgZ2V0IGENCj4gPj4gaGludCB0aGF0IHRoZSB0YXJnZXQgZmlsZSBpcyBzdXBw
b3NlZCB0byBiZSBjb3B5IG9mIGFub3RoZXIgZmlsZS4gVGhhdA0KPiA+PiB3YXkgZGlzdHJpYnV0
ZWQgZmlsZSBzeXN0ZW1zIGNvdWxkIGFsc28gY3JlYXRlIHRoZSB0YXJnZXQtZmlsZSB3aXRoIHRo
ZQ0KPiA+PiBjb3JyZWN0IG1ldGEtaW5mb3JtYXRpb24gKHNhbWUgc3RvcmFnZSB0YXJnZXRzIGFz
IGluLWZpbGUgaGFzKS4NCj4gPj4gV2VsbCwgaWYgd2UgY2Fubm90IGFncmVlIG9uIHRoYXQsIGZp
bGUgc3lzdGVtIHdpdGggYSBjdXN0b20gcHJvdG9jb2wgYXQNCj4gPj4gbGVhc3QgY2FuIGRldGVj
dCBmcm9tIDAgdG8gU1NJWkVfTUFYIGFuZCB0aGVuIHJlc2V0IG1ldGFkYXRhLiBJJ20gbm90DQo+
ID4+IHN1cmUgaWYgdGhpcyB3b3VsZCB3b3JrIGZvciBwTkZTLCB0aG91Z2guDQo+ID4NCj4gPiBz
cGxpY2UoKSBkb2VzIG5vdCBjcmVhdGUgbmV3IGZpbGVzLiBXaGF0IHlvdSBhcHBlYXIgdG8gYmUg
YXNraW5nIGZvcg0KPiA+IGxpZXMgd2F5IG91dHNpZGUgdGhlIHNjb3BlIG9mIHRoYXQgc3lzdGVt
IGNhbGwgaW50ZXJmYWNlLg0KPiA+DQo+IA0KPiBTb3JyeSBJIGtub3csIGRlZmluaXRlbHkgb3V0
c2lkZSB0aGUgc2NvcGUgb2Ygc3BsaWNlLCBidXQgaW4gdGhlIGNvbnRleHQgDQo+IG9mIG9mZmxv
YWRlZCBmaWxlIGNvcGllcy4gU28gdGhlIHF1ZXN0aW9uIGlzLCB3aGF0IGlzIHRoZSBiZXN0IHdh
eSB0byANCj4gYWRkcmVzcy9kaXNjdXNzIHRoYXQ/DQoNCldoeSBkb2VzIGl0IG5lZWQgdG8gYmUg
YWRkcmVzc2VkIGluIHRoZSBmaXJzdCBwbGFjZT8NCg0KV2hhdCBpcyBwcmV2ZW50aW5nIGFuIGFw
cGxpY2F0aW9uIGZyb20gcmV0cmlldmluZyBhbmQgc2V0dGluZyB0aGlzDQppbmZvcm1hdGlvbiB1
c2luZyBzdGFuZGFyZCBsaWJjIGZ1bmN0aW9ucyBzdWNoIGFzIGZzdGF0KCkrb3BlbigpLCBhbmQN
CnN1cHBsZW1lbnRlZCB3aXRoIGxpYmF0dHIgYXR0cl9zZXRmL2dldGYoKSwgYW5kIGxpYmFjbCBh
Y2xfZ2V0X2ZkL3NldF9mZA0Kd2hlcmUgYXBwcm9wcmlhdGU/DQoNCi0tIA0KVHJvbmQgTXlrbGVi
dXN0DQpMaW51eCBORlMgY2xpZW50IG1haW50YWluZXINCg0KTmV0QXBwDQpUcm9uZC5NeWtsZWJ1
c3RAbmV0YXBwLmNvbQ0Kd3d3Lm5ldGFwcC5jb20NCg==

2013-09-30 20:10:13

by Myklebust, Trond

[permalink] [raw]
Subject: Re: [RFC] extending splice for copy offloading

T24gTW9uLCAyMDEzLTA5LTMwIGF0IDIyOjAwICswMjAwLCBCZXJuZCBTY2h1YmVydCB3cm90ZToN
Cj4gT24gMDkvMzAvMjAxMyAwOTozNCBQTSwgTXlrbGVidXN0LCBUcm9uZCB3cm90ZToNCj4gPiBP
biBNb24sIDIwMTMtMDktMzAgYXQgMjA6NDkgKzAyMDAsIEJlcm5kIFNjaHViZXJ0IHdyb3RlOg0K
PiA+PiBPbiAwOS8zMC8yMDEzIDA4OjAyIFBNLCBNeWtsZWJ1c3QsIFRyb25kIHdyb3RlOg0KPiA+
Pj4gT24gTW9uLCAyMDEzLTA5LTMwIGF0IDE5OjQ4ICswMjAwLCBCZXJuZCBTY2h1YmVydCB3cm90
ZToNCj4gPj4+PiBPbiAwOS8zMC8yMDEzIDA3OjQ0IFBNLCBNeWtsZWJ1c3QsIFRyb25kIHdyb3Rl
Og0KPiA+Pj4+PiBPbiBNb24sIDIwMTMtMDktMzAgYXQgMTk6MTcgKzAyMDAsIEJlcm5kIFNjaHVi
ZXJ0IHdyb3RlOg0KPiA+Pj4+Pj4gSXQgd291bGQgYmUgbmljZSBpZiB0aGVyZSB3b3VsZCBiZSB3
YXkgaWYgdGhlIGZpbGUgc3lzdGVtIHdvdWxkIGdldCBhDQo+ID4+Pj4+PiBoaW50IHRoYXQgdGhl
IHRhcmdldCBmaWxlIGlzIHN1cHBvc2VkIHRvIGJlIGNvcHkgb2YgYW5vdGhlciBmaWxlLiBUaGF0
DQo+ID4+Pj4+PiB3YXkgZGlzdHJpYnV0ZWQgZmlsZSBzeXN0ZW1zIGNvdWxkIGFsc28gY3JlYXRl
IHRoZSB0YXJnZXQtZmlsZSB3aXRoIHRoZQ0KPiA+Pj4+Pj4gY29ycmVjdCBtZXRhLWluZm9ybWF0
aW9uIChzYW1lIHN0b3JhZ2UgdGFyZ2V0cyBhcyBpbi1maWxlIGhhcykuDQo+ID4+Pj4+PiBXZWxs
LCBpZiB3ZSBjYW5ub3QgYWdyZWUgb24gdGhhdCwgZmlsZSBzeXN0ZW0gd2l0aCBhIGN1c3RvbSBw
cm90b2NvbCBhdA0KPiA+Pj4+Pj4gbGVhc3QgY2FuIGRldGVjdCBmcm9tIDAgdG8gU1NJWkVfTUFY
IGFuZCB0aGVuIHJlc2V0IG1ldGFkYXRhLiBJJ20gbm90DQo+ID4+Pj4+PiBzdXJlIGlmIHRoaXMg
d291bGQgd29yayBmb3IgcE5GUywgdGhvdWdoLg0KPiA+Pj4+Pg0KPiA+Pj4+PiBzcGxpY2UoKSBk
b2VzIG5vdCBjcmVhdGUgbmV3IGZpbGVzLiBXaGF0IHlvdSBhcHBlYXIgdG8gYmUgYXNraW5nIGZv
cg0KPiA+Pj4+PiBsaWVzIHdheSBvdXRzaWRlIHRoZSBzY29wZSBvZiB0aGF0IHN5c3RlbSBjYWxs
IGludGVyZmFjZS4NCj4gPj4+Pj4NCj4gPj4+Pg0KPiA+Pj4+IFNvcnJ5IEkga25vdywgZGVmaW5p
dGVseSBvdXRzaWRlIHRoZSBzY29wZSBvZiBzcGxpY2UsIGJ1dCBpbiB0aGUgY29udGV4dA0KPiA+
Pj4+IG9mIG9mZmxvYWRlZCBmaWxlIGNvcGllcy4gU28gdGhlIHF1ZXN0aW9uIGlzLCB3aGF0IGlz
IHRoZSBiZXN0IHdheSB0bw0KPiA+Pj4+IGFkZHJlc3MvZGlzY3VzcyB0aGF0Pw0KPiA+Pj4NCj4g
Pj4+IFdoeSBkb2VzIGl0IG5lZWQgdG8gYmUgYWRkcmVzc2VkIGluIHRoZSBmaXJzdCBwbGFjZT8N
Cj4gPj4NCj4gPj4gQW4gb2ZmbG9hZGVkIGNvcHkgaXMgc3RpbGwgbm90IGVmZmljaWVudCBpZiBk
aWZmZXJlbnQgc3RvcmFnZQ0KPiA+PiBzZXJ2ZXJzL3RhcmdldHMgdXNlZCBieSBmcm9tLWZpbGUg
YW5kIHRvLWZpbGUuDQo+ID4NCj4gPiBTbz8NCj4gDQo+IG1kczE6IG9yaWctZmlsZQ0KPiBvc3Mx
L3RhcmdldDE6IG9yaWctY2h1bmsxDQo+IA0KPiBtZHMxOiB0YXJnZXQtZmlsZQ0KPiBvc3NOL3Rh
cmdldE46IHRhcmdldC1jaHVuazENCj4gDQo+IGNsaWVudE46IFBlcmZvcm1zIHRoZSBjb3B5DQo+
IA0KPiBJZGVhbGx5LCBvcmlnLWNodW5rMSBhbmQgdGFyZ2V0LWNodW5rMSBhcmUgb24gdGhlIHNh
bWUgc2VydmVyIGFuZCBzYW1lIA0KPiB0YXJnZXQuIENvcHkgb2ZmbG9hZCB0aGVuIGV2ZW4gY291
bGQgZG9uZSBmcm9tIHRoZSB1bmRlcmx5aW5nIGZzLCANCj4gc2ltaWxpYXIgYXMgbG9jYWwgc3Bs
aWNlLg0KPiBJZiBkaWZmZXJlbnQgb3NzTiBzZXJ2ZXJzIGFyZSB1c2VkIGNvcGllcyBzdGlsbCBo
YXZlIHRvIGJlIGRvbmUgb3ZlciANCj4gbmV0d29yayBieSB0aGVzZSBzdG9yYWdlIHNlcnZlcnMs
IGFsdGhvdWdoIHRoZSBjbGllbnQgb25seSB3b3VsZCBuZWVkIHRvIA0KPiBpbml0aWF0ZSB0aGUg
Y29weS4gU3RpbGwgZmFzdGVyLCBidXQgYWxzbyBub3QgaWRlYWwuDQo+IA0KPiA+DQo+ID4+Pg0K
PiA+Pj4gV2hhdCBpcyBwcmV2ZW50aW5nIGFuIGFwcGxpY2F0aW9uIGZyb20gcmV0cmlldmluZyBh
bmQgc2V0dGluZyB0aGlzDQo+ID4+PiBpbmZvcm1hdGlvbiB1c2luZyBzdGFuZGFyZCBsaWJjIGZ1
bmN0aW9ucyBzdWNoIGFzIGZzdGF0KCkrb3BlbigpLCBhbmQNCj4gPj4+IHN1cHBsZW1lbnRlZCB3
aXRoIGxpYmF0dHIgYXR0cl9zZXRmL2dldGYoKSwgYW5kIGxpYmFjbCBhY2xfZ2V0X2ZkL3NldF9m
ZA0KPiA+Pj4gd2hlcmUgYXBwcm9wcmlhdGU/DQo+ID4+Pg0KPiA+Pg0KPiA+PiBBdCBhIG1pbmlt
dW0gdGhpcyByZXF1aXJlcyBuZXR3b3JrIGFuZCBtZXRhZGF0YSBvdmVyaGVhZC4gQW5kIHdoaWxl
IEknbQ0KPiA+PiB3b3JraW5nIG9uIEZoR0ZTIG5vdywgSSBzdGlsbCB3b25kZXIgd2hhdCBvdGhl
ciBmaWxlIHN5c3RlbSBuZWVkIHRvIGRvIC0NCj4gPj4gZm9yIGV4YW1wbGUgTHVzdHJlIHByZS1h
bGxvY2F0ZXMgc3RvcmFnZS10YXJnZXQgZmlsZXMgb24gY3JlYXRpbmcgYQ0KPiA+PiBmaWxlLCBz
byBmaWxlIGxheW91dCBjaGFuZ2VzIG1lYW4gZXZlbiBtb3JlIG92ZXJoZWFkIHRoZXJlLg0KPiA+
DQo+ID4gVGhlIHByb2JsZW0geW91IGFyZSBkZXNjcmliaW5nIGlzIGxpbWl0ZWQgdG8gYSBuYXJy
b3cgc2V0IG9mIHN0b3JhZ2UNCj4gPiBhcmNoaXRlY3R1cmVzLiBJZiBjb3B5IG9mZmxvYWQgdXNp
bmcgc3BsaWNlKCkgZG9lc24ndCBtYWtlIHNlbnNlIGZvcg0KPiA+IHRob3NlIGFyY2hpdGVjdHVy
ZXMsIHRoZW4gZG9uJ3QgaW1wbGVtZW50IGl0IGZvciB0aGVtLg0KPiANCj4gQnV0IGl0IF9kb2Vz
XyBtYWtlIHNlbnNlLiBUaGUgZmlsZSBzeXN0ZW0ganVzdCBuZWVkcyBhIGhpbnQgdGhhdCBhIA0K
PiBzcGxpY2UgY29weSBpcyBnb2luZyB0byBjb21lIHVwLg0KDQpKdXN0IHdhaXQgZm9yIHRoZSBz
cGxpY2UoKSBzeXN0ZW0gY2FsbC4gSG93IGlzIHRoaXMgYW55IGRpZmZlcmVudCBmcm9tDQp3cml0
ZSgpPw0KDQo+ID4gWW91IG1pZ2h0IGJlIGFibGUgdG8gcHJvdmlkZSBpb2N0bHMoKSB0byBkbyB0
aGVzZSBzcGVjaWFsIGhpbnRlZCBmaWxlDQo+ID4gY3JlYXRpb25zIGZvciB0aG9zZSBmaWxlc3lz
dGVtcyB0aGF0IG5lZWQgaXQsIGJ1dCB0aGUgdmFzdCBtYWpvcml0eQ0KPiA+IGRvbid0LCBhbmQg
eW91IHNob3VsZG4ndCBlbmZvcmNlIGl0IG9uIHRoZW0uDQo+IA0KPiBBbmQgZXhhY3RseSBmb3Ig
dGhhdCB3ZSBuZWVkIGEgc3RhbmRhcmQgLSBpdCBkb2VzIG5vdCBtYWtlIHNlbnNlIGlmIGVhY2gg
DQo+IGFuZCBldmVyeSBkaXN0cmlidXRlZCBmaWxlIHN5c3RlbSBpbXBsZW1lbnRzIGl0cyBvd24g
DQo+IGlvY3RsL2xpYmF0dHIvbGliYWNsIGludGVyZmFjZSBmb3IgdGhhdC4NCj4gDQo+ID4NCj4g
Pj4gQW55d2F5LCBpZiB3ZSBjb3VsZCBhZ3JlZSBvbiB0byB1c2UgbGliYXR0ciBvciBsaWJhY2wg
dG8gdGVhY2ggdGhlIGZpbGUNCj4gPj4gc3lzdGVtIGFib3V0IHRoZSB1cGNvbWluZyBzcGxpY2Ug
Y2FsbCBJIHdvdWxkIGJlIGZpbmUuDQo+ID4NCj4gPiBsaWJhdHRyIGFuZCBsaWJhY2wgYXJlIGdl
bmVyaWMgbGlicmFyaWVzIHRoYXQgZXhpc3QgdG8gbWFuaXB1bGF0ZSB4YXR0cnMNCj4gPiBhbmQg
YWNscy4gVGhleSBkbyBub3QgbmVlZCB0byBjb250YWluIEx1c3RyZS1zcGVjaWZpYyBjb2RlLg0K
PiA+DQo+IA0KPiBwTkZTLCBGaEdGUywgTHVzdHJlLCBDZXBoLCBldGMuLCBhbGwgb2YgdGhlbSBz
aGFsbCBpbXBsZW1lbnQgdGhlaXIgb3duIA0KPiBpbnRlcmZhY2U/IEFuZCB1c2Vyc3BhY2UgbmVl
ZHMgdG8gYWRkcmVzcyBhbGwgb2YgdGhlbSBkaWZmZXJlbnRseT8NCj4NCj4gSSdtIGp1c3QgYXNr
aW5nIGZvciBzb21ldGhpbmcgbGlrZSBhIHZmcyBpb2N0bCBTUExJQ0VfTUVUQV9DT1BZIChzb3Jy
eSwgDQo+IGRpZG4ndCBmaW5kIGEgYmV0dGVyIG5hbWUgeWV0KSwgd2hpY2ggd291bGQgdGFrZSBp
bi1maWxlLXBhdGggYW5kIA0KPiBvdXQtZmlsZS1wYXRoIGFuZCBhbGxvdyB0aGUgZmlsZSBzeXN0
ZW0gdG8gY3JlYXRlIG91dC1maWxlLXBhdGggd2l0aCB0aGUgDQo+IHNhbWUgbWV0YS1sYXlvdXQg
YXMgaW4tZmlsZS1wYXRoLiBBbmQgaXQgd291bGQgbmVlZCBzb21lIGZsYWdzLCBzdWNoIGFzIA0K
PiBBVVRPIChmaWxlIHN5c3RlbSBkZWNpZGVzIGlmIGl0IG1ha2VzIHNlbnNlIHRvIGRvIGEgbG9j
YWwgY29weSkgYW5kIA0KPiBGT1JDRSAoYWx3YXlzIHRyeSBhIGxvY2FsIGNvcHkpLg0KDQpzcGxp
Y2UoKSBpcyBub3QgYSB3aG9sZS1maWxlIGNvcHkgb3BlcmF0aW9uOyBpdCdzIGEgYnl0ZSByYW5n
ZSBjb3B5LiBIb3cNCmRvZXMgdGhlIGFib3ZlIGhlbHAgb3RoZXIgdGhhbiBpbiB0aGUgd2hvbGUt
ZmlsZSBjYXNlPw0KDQotLSANClRyb25kIE15a2xlYnVzdA0KTGludXggTkZTIGNsaWVudCBtYWlu
dGFpbmVyDQoNCk5ldEFwcA0KVHJvbmQuTXlrbGVidXN0QG5ldGFwcC5jb20NCnd3dy5uZXRhcHAu
Y29tDQo=

2013-09-26 19:53:17

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [RFC] extending splice for copy offloading

On Thu, Sep 26, 2013 at 9:06 PM, Zach Brown <[email protected]> wrote:

>> But I'm not sure it's worth the effort; 99% of the use of this
>> interface will be copying whole files. And for that perhaps we need a
>> different API, one which has been discussed some time ago:
>> asynchronous copyfile() returns immediately with a pollable event
>> descriptor indicating copy progress, and some way to cancel the copy.
>> And that can internally rely on ->direct_splice(), with appropriate
>> algorithms for determine the optimal chunk size.
>
> And perhaps we don't. Perhaps we can provide this much simpler
> data-plane interface that works well enough for most everyone and can
> avoid going down the async rat hole, yet again.

I think either buffering or async is needed to get good perforrmace
without too much complexity in the app (which is not good). Buffering
works quite well for regular I/O, so maybe its the way to go here as
well.

Thanks,
Miklos

2013-09-25 21:08:16

by Zach Brown

[permalink] [raw]
Subject: Re: [RFC] extending splice for copy offloading

> A client-side copy will be slower, but I guess it does have the
> advantage that the application can track progress to some degree, and
> abort it fairly quickly without leaving the file in a totally undefined
> state--and both might be useful if the copy's not a simple constant-time
> operation.

I suppose, but can't the app achieve a nice middle ground by copying the
file in smaller syscalls? Avoid bulk data motion back to the client,
but still get notification every, I dunno, few hundred meg?

> So maybe a way to pass your NONBLOCKy flag to the server would be
> useful?

Maybe, but maybe it also just won't be used in practice. I'm to the
point where I'd rather we get the stupidest possible thing out there so
that we can learm from actual use of the interface.

- z

2013-09-30 15:46:39

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [RFC] extending splice for copy offloading

On Mon, Sep 30, 2013 at 4:41 PM, Ric Wheeler <[email protected]> wrote:
> The way the array based offload (and some software side reflink works) is
> not a byte by byte copy. We cannot assume that a valid count can be returned
> or that such a count would be an indication of a sequential segment of good
> data. The whole thing would normally have to be reissued.
>
> To make that a true assumption, you would have to mandate that in each of
> the specifications (and sw targets)...

You're missing my point.

- user issues SIZE_MAX splice request
- fs issues *64M* (or whatever) request to offload
- when that completes *fully* then we return 64M to userspace
- if it completes partially, then we return an error to userspace

Again, wouldn't that work?

Thanks,
Miklos

2013-09-28 05:49:47

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [RFC] extending splice for copy offloading

On Fri, Sep 27, 2013 at 10:50 PM, Zach Brown <[email protected]> wrote:
>> Also, I don't get the first option above at all. The argument is that
>> it's safer to have more copies? How much safety does another copy on
>> the same disk really give you? Do systems that do dedup provide
>> interfaces to turn it off per-file?

I don't see the safety argument very compelling either. There are
real semantic differences, however: ENOSPC on a write to a
(apparentlĂ­y) already allocated block. That could be a bit
unexpected. Do we need a fallocate extension to deal with shared
blocks?

Thanks,
Miklos

2013-09-30 15:57:44

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [RFC] extending splice for copy offloading

On Mon, Sep 30, 2013 at 4:49 PM, Ric Wheeler <[email protected]> wrote:
> On 09/30/2013 10:46 AM, Miklos Szeredi wrote:
>>
>> On Mon, Sep 30, 2013 at 4:41 PM, Ric Wheeler <[email protected]> wrote:
>>>
>>> The way the array based offload (and some software side reflink works) is
>>> not a byte by byte copy. We cannot assume that a valid count can be
>>> returned
>>> or that such a count would be an indication of a sequential segment of
>>> good
>>> data. The whole thing would normally have to be reissued.
>>>
>>> To make that a true assumption, you would have to mandate that in each of
>>> the specifications (and sw targets)...
>>
>> You're missing my point.
>>
>> - user issues SIZE_MAX splice request
>> - fs issues *64M* (or whatever) request to offload
>> - when that completes *fully* then we return 64M to userspace
>> - if it completes partially, then we return an error to userspace
>>
>> Again, wouldn't that work?
>>
>> Thanks,
>> Miklos
>
>
> Yes, if you send a copy offload command and it works, you can assume that it
> worked fully. It would be pretty interesting if that were not true :)
>
> If it fails, we cannot assume anything about partial completion.

Sure, that was my understanding from the start. Maybe I wasn't
precise enough in my explanation.

Thanks,
Miklos

2013-09-30 15:42:27

by Ric Wheeler

[permalink] [raw]
Subject: Re: [RFC] extending splice for copy offloading

On 09/30/2013 10:38 AM, Miklos Szeredi wrote:
> On Mon, Sep 30, 2013 at 4:28 PM, Ric Wheeler <[email protected]> wrote:
>> On 09/30/2013 10:24 AM, Miklos Szeredi wrote:
>>> On Mon, Sep 30, 2013 at 4:52 PM, Ric Wheeler <[email protected]> wrote:
>>>> On 09/30/2013 10:51 AM, Miklos Szeredi wrote:
>>>>> On Mon, Sep 30, 2013 at 4:34 PM, J. Bruce Fields <[email protected]>
>>>>> wrote:
>>>>>>> My other worry is about interruptibility/restartability. Ideas?
>>>>>>>
>>>>>>> What happens on splice(from, to, 4G) and it's a non-reflink copy?
>>>>>>> Can the page cache copy be made restartable? Or should splice() be
>>>>>>> allowed to return a short count? What happens on (non-reflink) remote
>>>>>>> copies and huge request sizes?
>>>>>> If I were writing an application that required copies to be
>>>>>> restartable,
>>>>>> I'd probably use the largest possible range in the reflink case but
>>>>>> break the copy into smaller chunks in the splice case.
>>>>>>
>>>>> The app really doesn't want to care about that. And it doesn't want
>>>>> to care about restartability, etc.. It's something the *kernel* has
>>>>> to care about. You just can't have uninterruptible syscalls that
>>>>> sleep for a "long" time, otherwise first you'll just have annoyed
>>>>> users pressing ^C in vain; then, if the sleep is even longer, warnings
>>>>> about task sleeping too long.
>>>>>
>>>>> One idea is letting splice() return a short count, and so the app can
>>>>> safely issue SIZE_MAX requests and the kernel can decide if it can
>>>>> copy the whole file in one go or if it wants to do it in smaller
>>>>> chunks.
>>>>>
>>>> You cannot rely on a short count. That implies that an offloaded copy
>>>> starts
>>>> at byte 0 and the short count first bytes are all valid.
>>> Huh?
>>>
>>> - app calls splice(from, 0, to, 0, SIZE_MAX)
>>> 1) VFS calls ->direct_splice(from, 0, to, 0, SIZE_MAX)
>>> 1.a) fs reflinks the whole file in a jiffy and returns the size of
>>> the file
>>> 1 b) fs does copy offload of, say, 64MB and returns 64M
>>> 2) VFS does page copy of, say, 1MB and returns 1MB
>>> - app calls splice(from, X, to, X, SIZE_MAX) where X is the new offset
>>> ...
>>>
>>> The point is: the app is always doing the same (incrementing offset
>>> with the return value from splice) and the kernel can decide what is
>>> the best size it can service within a single uninterruptible syscall.
>>>
>>> Wouldn't that work?
>>>
>> No.
>>
>> Keep in mind that the offload operation in (1) might fail partially. The
>> target file (the copy) is allocated, the question is what ranges have valid
>> data.
> You are talking about case 1.a, right? So if the offload copy 0-64MB
> fails partially, we return failure from splice, yet some of the copy
> did succeed. Is that the problem? Why?
>
> Thanks,
> Miklos

The way the array based offload (and some software side reflink works) is not a
byte by byte copy. We cannot assume that a valid count can be returned or that
such a count would be an indication of a sequential segment of good data. The
whole thing would normally have to be reissued.

To make that a true assumption, you would have to mandate that in each of the
specifications (and sw targets)...

ric


2013-09-26 19:06:44

by Zach Brown

[permalink] [raw]
Subject: Re: [RFC] extending splice for copy offloading

On Thu, Sep 26, 2013 at 08:06:41PM +0200, Miklos Szeredi wrote:
> On Thu, Sep 26, 2013 at 5:34 PM, J. Bruce Fields <[email protected]> wrote:
> > On Thu, Sep 26, 2013 at 10:58:05AM +0200, Miklos Szeredi wrote:
> >> On Wed, Sep 25, 2013 at 11:07 PM, Zach Brown <[email protected]> wrote:
> >> >> A client-side copy will be slower, but I guess it does have the
> >> >> advantage that the application can track progress to some degree, and
> >> >> abort it fairly quickly without leaving the file in a totally undefined
> >> >> state--and both might be useful if the copy's not a simple constant-time
> >> >> operation.
> >> >
> >> > I suppose, but can't the app achieve a nice middle ground by copying the
> >> > file in smaller syscalls? Avoid bulk data motion back to the client,
> >> > but still get notification every, I dunno, few hundred meg?
> >>
> >> Yes. And if "cp" could just be switched from a read+write syscall
> >> pair to a single splice syscall using the same buffer size.
> >
> > Will the various magic fs-specific copy operations become inefficient
> > when the range copied is too small?
>
> We could treat spice-copy operations just like write operations (can
> be buffered, coalesced, synced).
>
> But I'm not sure it's worth the effort; 99% of the use of this
> interface will be copying whole files. And for that perhaps we need a
> different API, one which has been discussed some time ago:
> asynchronous copyfile() returns immediately with a pollable event
> descriptor indicating copy progress, and some way to cancel the copy.
> And that can internally rely on ->direct_splice(), with appropriate
> algorithms for determine the optimal chunk size.

And perhaps we don't. Perhaps we can provide this much simpler
data-plane interface that works well enough for most everyone and can
avoid going down the async rat hole, yet again.

- z

2013-09-30 15:24:27

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [RFC] extending splice for copy offloading

On Mon, Sep 30, 2013 at 4:52 PM, Ric Wheeler <[email protected]> wrote:
> On 09/30/2013 10:51 AM, Miklos Szeredi wrote:
>>
>> On Mon, Sep 30, 2013 at 4:34 PM, J. Bruce Fields <[email protected]>
>> wrote:
>>>>
>>>> My other worry is about interruptibility/restartability. Ideas?
>>>>
>>>> What happens on splice(from, to, 4G) and it's a non-reflink copy?
>>>> Can the page cache copy be made restartable? Or should splice() be
>>>> allowed to return a short count? What happens on (non-reflink) remote
>>>> copies and huge request sizes?
>>>
>>> If I were writing an application that required copies to be restartable,
>>> I'd probably use the largest possible range in the reflink case but
>>> break the copy into smaller chunks in the splice case.
>>>
>> The app really doesn't want to care about that. And it doesn't want
>> to care about restartability, etc.. It's something the *kernel* has
>> to care about. You just can't have uninterruptible syscalls that
>> sleep for a "long" time, otherwise first you'll just have annoyed
>> users pressing ^C in vain; then, if the sleep is even longer, warnings
>> about task sleeping too long.
>>
>> One idea is letting splice() return a short count, and so the app can
>> safely issue SIZE_MAX requests and the kernel can decide if it can
>> copy the whole file in one go or if it wants to do it in smaller
>> chunks.
>>

>
> You cannot rely on a short count. That implies that an offloaded copy starts
> at byte 0 and the short count first bytes are all valid.

Huh?

- app calls splice(from, 0, to, 0, SIZE_MAX)
1) VFS calls ->direct_splice(from, 0, to, 0, SIZE_MAX)
1.a) fs reflinks the whole file in a jiffy and returns the size of the file
1 b) fs does copy offload of, say, 64MB and returns 64M
2) VFS does page copy of, say, 1MB and returns 1MB
- app calls splice(from, X, to, X, SIZE_MAX) where X is the new offset
...

The point is: the app is always doing the same (incrementing offset
with the return value from splice) and the kernel can decide what is
the best size it can service within a single uninterruptible syscall.

Wouldn't that work?

Thanks,
Miklos

2013-09-30 16:31:28

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [RFC] extending splice for copy offloading

Here's an example "cp" app using direct splice (and without fallback to
non-splice, which is obviously required unless the kernel is known to support
direct splice).

Untested, but trivial enough...

The important part is, I think, that the app must not assume that the kernel can
complete the request in one go.

Thanks,
Miklos

----
#define _GNU_SOURCE

#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
#include <limits.h>
#include <sys/stat.h>
#include <err.h>

#ifndef SPLICE_F_DIRECT
#define SPLICE_F_DIRECT (0x10) /* neither splice fd is a pipe */
#endif

int main(int argc, char *argv[])
{
struct stat stbuf;
int in_fd;
int out_fd;
int res;
off_t off;

if (argc != 3)
errx(1, "usage: %s from to", argv[0]);

in_fd = open(argv[1], O_RDONLY);
if (in_fd == -1)
err(1, "opening %s", argv[1]);

res = fstat(in_fd, &stbuf);
if (res == -1)
err(1, "fstat");

out_fd = open(argv[2], O_CREAT | O_WRONLY | O_TRUNC, stbuf.st_mode);
if (out_fd == -1)
err(1, "opening %s", argv[2]);

do {
off_t in_off = off, out_off = off;
ssize_t rres;

rres = splice(in_fd, &in_off, out_fd, &out_off, SSIZE_MAX,
SPLICE_F_DIRECT);
if (rres == -1)
err(1, "splice");
if (rres == 0)
break;

off += rres;
} while (off < stbuf.st_size);

res = close(in_fd);
if (res == -1)
err(1, "close");

res = fsync(out_fd);
if (res == -1)
err(1, "fsync");

res = close(out_fd);
if (res == -1)
err(1, "close");

return 0;
}

2013-09-26 21:25:09

by Ric Wheeler

[permalink] [raw]
Subject: Re: [RFC] extending splice for copy offloading

On 09/26/2013 03:53 PM, Miklos Szeredi wrote:
> On Thu, Sep 26, 2013 at 9:06 PM, Zach Brown <[email protected]> wrote:
>
>>> But I'm not sure it's worth the effort; 99% of the use of this
>>> interface will be copying whole files. And for that perhaps we need a
>>> different API, one which has been discussed some time ago:
>>> asynchronous copyfile() returns immediately with a pollable event
>>> descriptor indicating copy progress, and some way to cancel the copy.
>>> And that can internally rely on ->direct_splice(), with appropriate
>>> algorithms for determine the optimal chunk size.
>> And perhaps we don't. Perhaps we can provide this much simpler
>> data-plane interface that works well enough for most everyone and can
>> avoid going down the async rat hole, yet again.
> I think either buffering or async is needed to get good perforrmace
> without too much complexity in the app (which is not good). Buffering
> works quite well for regular I/O, so maybe its the way to go here as
> well.
>
> Thanks,
> Miklos
>

Buffering misses the whole point of the copy offload - the idea is *not* to
read or write the actual data in the most interesting cases which offload the
operation to a smart target device or file system.

Regards,

Ric


2013-09-25 18:39:05

by Zach Brown

[permalink] [raw]
Subject: Re: [RFC] extending splice for copy offloading


Hrmph. I had composed a reply to you during Plumbers but.. something
happened to it :). Here's another try now that I'm back.

> > Some things to talk about:
> > - I really don't care about the naming here. If you do, holler.
> > - We might want different flags for file-to-file splicing and acceleration
>
> Yes, I think "copy" and "reflink" needs to be differentiated.

I initially agreed but I'm not so sure now. The problem is that we
can't know whether the acceleration is copying or not. XCOPY on some
array may well do some shared referencing tricks. The nfs COPY op can
have a server use btrfs reflink, or ext* and XCOPY, or .. who knows. At
some point we have to admit that we have no way to determine the
relative durability of writes. Storage can do a lot to make writes more
or less fragile that we have no visibility of. SSD FTLs can log a bunch
of unrelated sectors on to one flash failure domain.

And if such a flag couldn't *actually* guarantee anything for a bunch of
storage topologies, well, let's not bother with it.

The only flag I'm in favour of now is one that has splice return rather
than falling back to manual page cache reads and writes. It's more like
O_NONBLOCK than any kind of data durability hint.

> > - We might want flags to require or forbid acceleration
> > - We might want to provide all these flags to sendfile, too
> >
> > Thoughts? Objections?
>
> Can filesystem support "whole file copy" only? Or arbitrary
> block-to-block copy should be mandatory?

I'm not sure I understand what you're asking. The interface specifies
byte ranges. File systems can return errors if they can't accelerate
the copy. We *can't* mandate copy acceleration granularity as some
formats and protocols just can't do it. splice() will fall back to
doing buffered copies when the file system returns an error.

> Splice has size_t argument for the size, which is limited to 4G on 32
> bit. Won't this be an issue for whole-file-copy? We could have
> special value (-1) for whole file, but that's starting to be hackish.

It will be an issue, yeah. Just like it is with write() today. I think
it's reasonable to start with a simple interface that matches current IO
syscalls. I won't implement a special whole-file value, no.

And it's not just 32bit size_t. While do_splice_direct() doesn't use
the truncated length that's returned from rw_verify_area(), it then
silently truncates the lengths to unsigned int in the splice_desc struct
fields. It seems like we might want to address that :/.

> We are talking about copying large amounts of data in a single
> syscall, which will possibly take a long time. Will the syscall be
> interruptible? Restartable?

In as much as file systems let it be, yeah. As ever, you're not going
to have a lot of luck interrupting a process stuck in lock_page(),
mutex_lock(), wait_on_page_writeback(), etc. Though you did remind me
to investigate restarting. Thanks.

- z

2013-09-27 20:05:56

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [RFC] extending splice for copy offloading

On Thu, Sep 26, 2013 at 05:26:39PM -0400, Ric Wheeler wrote:
> On 09/26/2013 02:55 PM, Zach Brown wrote:
> >On Thu, Sep 26, 2013 at 10:58:05AM +0200, Miklos Szeredi wrote:
> >>On Wed, Sep 25, 2013 at 11:07 PM, Zach Brown <[email protected]> wrote:
> >>>>A client-side copy will be slower, but I guess it does have the
> >>>>advantage that the application can track progress to some degree, and
> >>>>abort it fairly quickly without leaving the file in a totally undefined
> >>>>state--and both might be useful if the copy's not a simple constant-time
> >>>>operation.
> >>>I suppose, but can't the app achieve a nice middle ground by copying the
> >>>file in smaller syscalls? Avoid bulk data motion back to the client,
> >>>but still get notification every, I dunno, few hundred meg?
> >>Yes. And if "cp" could just be switched from a read+write syscall
> >>pair to a single splice syscall using the same buffer size. And then
> >>the user would only notice that things got faster in case of server
> >>side copy. No problems with long blocking times (at least not much
> >>worse than it was).
> >Hmm, yes, that would be a nice outcome.
> >
> >>However "cp" doesn't do reflinking by default, it has a switch for
> >>that. If we just want "cp" and the like to use splice without fearing
> >>side effects then by default we should try to be as close to
> >>read+write behavior as possible. No?
> >I guess? I don't find requiring --reflink hugely compelling. But there
> >it is.
> >
> >>That's what I'm really
> >>worrying about when you want to wire up splice to reflink by default.
> >>I do think there should be a flag for that. And if on the block level
> >>some magic happens, so be it. It's not the fs deverloper's worry any
> >>more ;)
> >Sure. So we'd have:
> >
> >- no flag default that forbids knowingly copying with shared references
> > so that it will be used by default by people who feel strongly about
> > their assumptions about independent write durability.
> >
> >- a flag that allows shared references for people who would otherwise
> > use the file system shared reference ioctls (ocfs2 reflink, btrfs
> > clone) but would like it to also do server-side read/write copies
> > over nfs without additional intervention.
> >
> >- a flag that requires shared references for callers who don't want
> > giant copies to take forever if they aren't instant. (The qemu guys
> > asked for this at Plumbers.)

Why not implement only the last flag only as the first step? It seems
like the simplest one. So I think that would mean:

- no worrying about cancelling, etc.
- apps should be told to pass the entire range at once (normally
the whole file).
- The NFS server probably shouldn't do the internal copy loop by
default.

We can't prevent some storage system from implementing a high-latency
copy operation, but we can refuse to provide them any help (providing no
progress reports or easy way to cancel) and then they can deal with the
complaints from their users.

Also, I don't get the first option above at all. The argument is that
it's safer to have more copies? How much safety does another copy on
the same disk really give you? Do systems that do dedup provide
interfaces to turn it off per-file?

> This last flag should not prevent a remote target device (NFS or
> SCSI array) copy from working though since they often do reflink
> like operations inside of the remote target device....

In fact maybe that's the only case to care about on the first pass.

But I understand that Zach's tired of the woodshedding and I could live
with the above I guess....

--b.

2013-09-30 14:49:59

by Ric Wheeler

[permalink] [raw]
Subject: Re: [RFC] extending splice for copy offloading

On 09/30/2013 10:34 AM, J. Bruce Fields wrote:
> On Mon, Sep 30, 2013 at 02:20:30PM +0200, Miklos Szeredi wrote:
>> On Sat, Sep 28, 2013 at 11:20 PM, Ric Wheeler <[email protected]> wrote:
>>
>>>>> I don't see the safety argument very compelling either. There are real
>>>>> semantic differences, however: ENOSPC on a write to a
>>>>> (apparentlĂ­y) already allocated block. That could be a bit unexpected.
>>>>> Do we
>>>>> need a fallocate extension to deal with shared blocks?
>>>> The above has been the case for all enterprise storage arrays ever since
>>>> the invention of snapshots. The NFSv4.2 spec does allow you to set a
>>>> per-file attribute that causes the storage server to always preallocate
>>>> enough buffers to guarantee that you can rewrite the entire file, however
>>>> the fact that we've lived without it for said 20 years leads me to believe
>>>> that demand for it is going to be limited. I haven't put it top of the list
>>>> of features we care to implement...
>>>>
>>>> Cheers,
>>>> Trond
>>>
>>> I agree - this has been common behaviour for a very long time in the array
>>> space. Even without an array, this is the same as overwriting a block in
>>> btrfs or any file system with a read-write LVM snapshot.
>> Okay, I'm convinced.
>>
>> So I suggest
>>
>> - mount(..., MNT_REFLINK): *allow* splice to reflink. If this is not
>> set, fall back to page cache copy.
>> - splice(... SPLICE_REFLINK): fail non-reflink copy. With this app
>> can force reflink.
>>
>> Both are trivial to implement and make sure that no backward
>> incompatibility surprises happen.
>>
>> My other worry is about interruptibility/restartability. Ideas?
>>
>> What happens on splice(from, to, 4G) and it's a non-reflink copy?
>> Can the page cache copy be made restartable? Or should splice() be
>> allowed to return a short count? What happens on (non-reflink) remote
>> copies and huge request sizes?
> If I were writing an application that required copies to be restartable,
> I'd probably use the largest possible range in the reflink case but
> break the copy into smaller chunks in the splice case.
>
> For that reason I don't like the idea of a mount option--the choice is
> something that the application probably wants to make (or at least to
> know about).
>
> The NFS COPY operation, as specified in current drafts, allows for
> asynchronous copies but leaves the state of the file undefined in the
> case of an aborted COPY. I worry that agreeing on standard behavior in
> the case of an abort might be difficult.
>
> --b.

I think that this is still confusing - reflink and array copy offload should not
be differentiated. In effect, they should often be the same order of magnitude
in performance and possibly even use the same or very similar techniques (just
on different sides of the initiator/target transaction!).

It is much simpler to let the application fail if the offload (or reflink) is
not supported and let it do the traditional copy offload. Then you always send
the largest possible offload operation and do whatever you do now if that fails.

thanks!

Ric


2013-09-11 17:08:12

by Zach Brown

[permalink] [raw]
Subject: [PATCH 2/3] splice: add f_op->splice_direct

The splice_direct file_operations method gives file systems the
opportunity to accelerate copying a region between two files.

The generic path attempts to copy the remainder of the region that the
file system fails to accelerate, for whatever reason. We may choose to
dial this back a bit if the caller wants to avoid unaccelerated copying,
perhaps by setting behavioural flags.

The SPLICE_F_DIRECT flag is arguably misused here to indicate both
file-to-file "direct" splicing *and* acceleration.

Signed-off-by: Zach Brown <[email protected]>
---
fs/bad_inode.c | 8 ++++++++
fs/splice.c | 28 +++++++++++++++++++++++-----
include/linux/fs.h | 1 +
3 files changed, 32 insertions(+), 5 deletions(-)

diff --git a/fs/bad_inode.c b/fs/bad_inode.c
index 7c93953..394914b 100644
--- a/fs/bad_inode.c
+++ b/fs/bad_inode.c
@@ -145,6 +145,13 @@ static ssize_t bad_file_splice_read(struct file *in, loff_t *ppos,
return -EIO;
}

+static ssize_t bad_file_splice_direct(struct file *in, loff_t in_pos,
+ struct file *out, loff_t out_pos, size_t len,
+ unsigned int flags)
+{
+ return -EIO;
+}
+
static const struct file_operations bad_file_ops =
{
.llseek = bad_file_llseek,
@@ -170,6 +177,7 @@ static const struct file_operations bad_file_ops =
.flock = bad_file_flock,
.splice_write = bad_file_splice_write,
.splice_read = bad_file_splice_read,
+ .splice_direct = bad_file_splice_direct,
};

static int bad_inode_create (struct inode *dir, struct dentry *dentry,
diff --git a/fs/splice.c b/fs/splice.c
index c0f4e27..eac310f 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -1284,14 +1284,12 @@ long do_splice_direct(struct file *in, loff_t *ppos, struct file *out,
loff_t *opos, size_t len, unsigned int flags)
{
struct splice_desc sd = {
- .len = len,
- .total_len = len,
.flags = flags,
- .pos = *ppos,
.u.file = out,
.opos = opos,
};
long ret;
+ long bytes = 0;

if (unlikely(!(out->f_mode & FMODE_WRITE)))
return -EBADF;
@@ -1303,11 +1301,31 @@ long do_splice_direct(struct file *in, loff_t *ppos, struct file *out,
if (unlikely(ret < 0))
return ret;

+ if ((flags & SPLICE_F_DIRECT) && out->f_op->splice_direct) {
+ ret = out->f_op->splice_direct(in, *ppos, out, *opos, len,
+ flags);
+ if (ret > 0) {
+ bytes += ret;
+ len -= ret;
+ *opos += ret;
+ *ppos += ret;
+
+ if (len == 0)
+ return ret;
+ }
+ }
+
+ sd.len = len;
+ sd.total_len = len;
+ sd.pos = *ppos;
+
ret = splice_direct_to_actor(in, &sd, direct_splice_actor);
- if (ret > 0)
+ if (ret > 0) {
+ bytes += ret;
*ppos = sd.pos;
+ }

- return ret;
+ return bytes ? bytes : ret;
}

static int splice_pipe_to_pipe(struct pipe_inode_info *ipipe,
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 529d871..725e6fc 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1553,6 +1553,7 @@ struct file_operations {
int (*flock) (struct file *, int, struct file_lock *);
ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int);
ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int);
+ ssize_t (*splice_direct)(struct file *, loff_t, struct file *, loff_t, size_t, unsigned int);
int (*setlease)(struct file *, long, struct file_lock **);
long (*fallocate)(struct file *file, int mode, loff_t offset,
loff_t len);
--
1.7.11.7


2013-09-26 18:06:42

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [RFC] extending splice for copy offloading

On Thu, Sep 26, 2013 at 5:34 PM, J. Bruce Fields <[email protected]> wrote:
> On Thu, Sep 26, 2013 at 10:58:05AM +0200, Miklos Szeredi wrote:
>> On Wed, Sep 25, 2013 at 11:07 PM, Zach Brown <[email protected]> wrote:
>> >> A client-side copy will be slower, but I guess it does have the
>> >> advantage that the application can track progress to some degree, and
>> >> abort it fairly quickly without leaving the file in a totally undefined
>> >> state--and both might be useful if the copy's not a simple constant-time
>> >> operation.
>> >
>> > I suppose, but can't the app achieve a nice middle ground by copying the
>> > file in smaller syscalls? Avoid bulk data motion back to the client,
>> > but still get notification every, I dunno, few hundred meg?
>>
>> Yes. And if "cp" could just be switched from a read+write syscall
>> pair to a single splice syscall using the same buffer size.
>
> Will the various magic fs-specific copy operations become inefficient
> when the range copied is too small?

We could treat spice-copy operations just like write operations (can
be buffered, coalesced, synced).

But I'm not sure it's worth the effort; 99% of the use of this
interface will be copying whole files. And for that perhaps we need a
different API, one which has been discussed some time ago:
asynchronous copyfile() returns immediately with a pollable event
descriptor indicating copy progress, and some way to cancel the copy.
And that can internally rely on ->direct_splice(), with appropriate
algorithms for determine the optimal chunk size.

Thanks,
Miklos

2013-09-30 17:44:26

by Myklebust, Trond

[permalink] [raw]
Subject: Re: [RFC] extending splice for copy offloading

T24gTW9uLCAyMDEzLTA5LTMwIGF0IDE5OjE3ICswMjAwLCBCZXJuZCBTY2h1YmVydCB3cm90ZToN
Cj4gSXQgd291bGQgYmUgbmljZSBpZiB0aGVyZSB3b3VsZCBiZSB3YXkgaWYgdGhlIGZpbGUgc3lz
dGVtIHdvdWxkIGdldCBhIA0KPiBoaW50IHRoYXQgdGhlIHRhcmdldCBmaWxlIGlzIHN1cHBvc2Vk
IHRvIGJlIGNvcHkgb2YgYW5vdGhlciBmaWxlLiBUaGF0IA0KPiB3YXkgZGlzdHJpYnV0ZWQgZmls
ZSBzeXN0ZW1zIGNvdWxkIGFsc28gY3JlYXRlIHRoZSB0YXJnZXQtZmlsZSB3aXRoIHRoZSANCj4g
Y29ycmVjdCBtZXRhLWluZm9ybWF0aW9uIChzYW1lIHN0b3JhZ2UgdGFyZ2V0cyBhcyBpbi1maWxl
IGhhcykuDQo+IFdlbGwsIGlmIHdlIGNhbm5vdCBhZ3JlZSBvbiB0aGF0LCBmaWxlIHN5c3RlbSB3
aXRoIGEgY3VzdG9tIHByb3RvY29sIGF0IA0KPiBsZWFzdCBjYW4gZGV0ZWN0IGZyb20gMCB0byBT
U0laRV9NQVggYW5kIHRoZW4gcmVzZXQgbWV0YWRhdGEuIEknbSBub3QgDQo+IHN1cmUgaWYgdGhp
cyB3b3VsZCB3b3JrIGZvciBwTkZTLCB0aG91Z2guDQoNCnNwbGljZSgpIGRvZXMgbm90IGNyZWF0
ZSBuZXcgZmlsZXMuIFdoYXQgeW91IGFwcGVhciB0byBiZSBhc2tpbmcgZm9yDQpsaWVzIHdheSBv
dXRzaWRlIHRoZSBzY29wZSBvZiB0aGF0IHN5c3RlbSBjYWxsIGludGVyZmFjZS4NCg0KLS0gDQpU
cm9uZCBNeWtsZWJ1c3QNCkxpbnV4IE5GUyBjbGllbnQgbWFpbnRhaW5lcg0KDQpOZXRBcHANClRy
b25kLk15a2xlYnVzdEBuZXRhcHAuY29tDQp3d3cubmV0YXBwLmNvbQ0K

2013-09-27 20:50:54

by Zach Brown

[permalink] [raw]
Subject: Re: [RFC] extending splice for copy offloading


> > >Sure. So we'd have:
> > >
> > >- no flag default that forbids knowingly copying with shared references
> > > so that it will be used by default by people who feel strongly about
> > > their assumptions about independent write durability.
> > >
> > >- a flag that allows shared references for people who would otherwise
> > > use the file system shared reference ioctls (ocfs2 reflink, btrfs
> > > clone) but would like it to also do server-side read/write copies
> > > over nfs without additional intervention.
> > >
> > >- a flag that requires shared references for callers who don't want
> > > giant copies to take forever if they aren't instant. (The qemu guys
> > > asked for this at Plumbers.)
>
> Why not implement only the last flag only as the first step? It seems
> like the simplest one. So I think that would mean:
>
> - no worrying about cancelling, etc.
> - apps should be told to pass the entire range at once (normally
> the whole file).
> - The NFS server probably shouldn't do the internal copy loop by
> default.
>
> We can't prevent some storage system from implementing a high-latency
> copy operation, but we can refuse to provide them any help (providing no
> progress reports or easy way to cancel) and then they can deal with the
> complaints from their users.

I can see where you're going with that, yeah.

It'd make less sense as a splice extension, then, perhaps. It'd be more
like a generic entry point for the existing ioctls. Maybe even just
defining the semantics of a common ioctl.

Hmm.

> Also, I don't get the first option above at all. The argument is that
> it's safer to have more copies? How much safety does another copy on
> the same disk really give you? Do systems that do dedup provide
> interfaces to turn it off per-file?

Yeah, got me. It's certainly nonsense on a lot of FTL logging
implementations (which are making their way into SMR drives in the
future).

> But I understand that Zach's tired of the woodshedding and I could live
> with the above I guess....

No, it's fine. At least people are expressing some interest in the
interface! That's a marked improvement over the state of things in the
past.

- z

2013-09-30 15:38:34

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [RFC] extending splice for copy offloading

On Mon, Sep 30, 2013 at 4:28 PM, Ric Wheeler <[email protected]> wrote:
> On 09/30/2013 10:24 AM, Miklos Szeredi wrote:
>>
>> On Mon, Sep 30, 2013 at 4:52 PM, Ric Wheeler <[email protected]> wrote:
>>>
>>> On 09/30/2013 10:51 AM, Miklos Szeredi wrote:
>>>>
>>>> On Mon, Sep 30, 2013 at 4:34 PM, J. Bruce Fields <[email protected]>
>>>> wrote:
>>>>>>
>>>>>> My other worry is about interruptibility/restartability. Ideas?
>>>>>>
>>>>>> What happens on splice(from, to, 4G) and it's a non-reflink copy?
>>>>>> Can the page cache copy be made restartable? Or should splice() be
>>>>>> allowed to return a short count? What happens on (non-reflink) remote
>>>>>> copies and huge request sizes?
>>>>>
>>>>> If I were writing an application that required copies to be
>>>>> restartable,
>>>>> I'd probably use the largest possible range in the reflink case but
>>>>> break the copy into smaller chunks in the splice case.
>>>>>
>>>> The app really doesn't want to care about that. And it doesn't want
>>>> to care about restartability, etc.. It's something the *kernel* has
>>>> to care about. You just can't have uninterruptible syscalls that
>>>> sleep for a "long" time, otherwise first you'll just have annoyed
>>>> users pressing ^C in vain; then, if the sleep is even longer, warnings
>>>> about task sleeping too long.
>>>>
>>>> One idea is letting splice() return a short count, and so the app can
>>>> safely issue SIZE_MAX requests and the kernel can decide if it can
>>>> copy the whole file in one go or if it wants to do it in smaller
>>>> chunks.
>>>>
>>> You cannot rely on a short count. That implies that an offloaded copy
>>> starts
>>> at byte 0 and the short count first bytes are all valid.
>>
>> Huh?
>>
>> - app calls splice(from, 0, to, 0, SIZE_MAX)
>> 1) VFS calls ->direct_splice(from, 0, to, 0, SIZE_MAX)
>> 1.a) fs reflinks the whole file in a jiffy and returns the size of
>> the file
>> 1 b) fs does copy offload of, say, 64MB and returns 64M
>> 2) VFS does page copy of, say, 1MB and returns 1MB
>> - app calls splice(from, X, to, X, SIZE_MAX) where X is the new offset
>> ...
>>
>> The point is: the app is always doing the same (incrementing offset
>> with the return value from splice) and the kernel can decide what is
>> the best size it can service within a single uninterruptible syscall.
>>
>> Wouldn't that work?
>>

>
> No.
>
> Keep in mind that the offload operation in (1) might fail partially. The
> target file (the copy) is allocated, the question is what ranges have valid
> data.

You are talking about case 1.a, right? So if the offload copy 0-64MB
fails partially, we return failure from splice, yet some of the copy
did succeed. Is that the problem? Why?

Thanks,
Miklos

2013-09-27 14:01:40

by Ric Wheeler

[permalink] [raw]
Subject: Re: [RFC] extending splice for copy offloading

On 09/27/2013 12:47 AM, Miklos Szeredi wrote:
> On Thu, Sep 26, 2013 at 11:23 PM, Ric Wheeler <[email protected]> wrote:
>> On 09/26/2013 03:53 PM, Miklos Szeredi wrote:
>>> On Thu, Sep 26, 2013 at 9:06 PM, Zach Brown <[email protected]> wrote:
>>>
>>>>> But I'm not sure it's worth the effort; 99% of the use of this
>>>>> interface will be copying whole files. And for that perhaps we need a
>>>>> different API, one which has been discussed some time ago:
>>>>> asynchronous copyfile() returns immediately with a pollable event
>>>>> descriptor indicating copy progress, and some way to cancel the copy.
>>>>> And that can internally rely on ->direct_splice(), with appropriate
>>>>> algorithms for determine the optimal chunk size.
>>>> And perhaps we don't. Perhaps we can provide this much simpler
>>>> data-plane interface that works well enough for most everyone and can
>>>> avoid going down the async rat hole, yet again.
>>> I think either buffering or async is needed to get good perforrmace
>>> without too much complexity in the app (which is not good). Buffering
>>> works quite well for regular I/O, so maybe its the way to go here as
>>> well.
>>>
>>> Thanks,
>>> Miklos
>>>
>> Buffering misses the whole point of the copy offload - the idea is *not* to
>> read or write the actual data in the most interesting cases which offload
>> the operation to a smart target device or file system.
> I meant buffering the COPY, not the data. Doing the COPY
> synchronously will always incur a performance penalty, the amount
> depending on the latency, which can be significant with networking.
>
> We think of write(2) as a synchronous interface, because that's the
> appearance we get from all that hard work the page cache and delayed
> writeback code does to make an asynchronous operation look as if it
> was synchronous. So from a userspace API perspective a sync interface
> is nice, but inside we almost always have async interfaces to do the
> actual work.
>
> Thanks,
> Miklos

I think that you are an order of magnitude off here in thinking about the scale
of the operations.

An enabled, synchronize copy offload to an array (or one that turns into a
reflink locally) is effectively the cost of the call itself. Let's say no slower
than one IO to a S-ATA disk (10ms?) as a pessimistic guess. Realistically, that
call is much faster than that worst case number.

Copying any substantial amount of data - like the target workload of VM images
or media files - would be hundreds of MB's per copy and that would take seconds
or minutes.

We should really work on getting the basic mechanism working and robust without
any complications, then we can look at real, measured performance and see if
there is any justification for adding complexity.

thanks!

Ric

>


2013-09-30 14:53:34

by Ric Wheeler

[permalink] [raw]
Subject: Re: [RFC] extending splice for copy offloading

On 09/30/2013 10:51 AM, Miklos Szeredi wrote:
> On Mon, Sep 30, 2013 at 4:34 PM, J. Bruce Fields <[email protected]> wrote:
>>> My other worry is about interruptibility/restartability. Ideas?
>>>
>>> What happens on splice(from, to, 4G) and it's a non-reflink copy?
>>> Can the page cache copy be made restartable? Or should splice() be
>>> allowed to return a short count? What happens on (non-reflink) remote
>>> copies and huge request sizes?
>> If I were writing an application that required copies to be restartable,
>> I'd probably use the largest possible range in the reflink case but
>> break the copy into smaller chunks in the splice case.
>>
> The app really doesn't want to care about that. And it doesn't want
> to care about restartability, etc.. It's something the *kernel* has
> to care about. You just can't have uninterruptible syscalls that
> sleep for a "long" time, otherwise first you'll just have annoyed
> users pressing ^C in vain; then, if the sleep is even longer, warnings
> about task sleeping too long.
>
> One idea is letting splice() return a short count, and so the app can
> safely issue SIZE_MAX requests and the kernel can decide if it can
> copy the whole file in one go or if it wants to do it in smaller
> chunks.
>
> Thanks,
> Miklos

You cannot rely on a short count. That implies that an offloaded copy starts at
byte 0 and the short count first bytes are all valid.

I don't believe that is in fact required by all (any?) versions of the spec :)

Best just to fail and restart the whole operation.

Ric


2013-09-30 20:27:32

by Myklebust, Trond

[permalink] [raw]
Subject: Re: [RFC] extending splice for copy offloading

T24gTW9uLCAyMDEzLTA5LTMwIGF0IDE2OjA4IC0wNDAwLCBSaWMgV2hlZWxlciB3cm90ZToNCj4g
T24gMDkvMzAvMjAxMyAwNDowMCBQTSwgQmVybmQgU2NodWJlcnQgd3JvdGU6DQo+ID4gcE5GUywg
RmhHRlMsIEx1c3RyZSwgQ2VwaCwgZXRjLiwgYWxsIG9mIHRoZW0gc2hhbGwgaW1wbGVtZW50IHRo
ZWlyIG93biANCj4gPiBpbnRlcmZhY2U/IEFuZCB1c2Vyc3BhY2UgbmVlZHMgdG8gYWRkcmVzcyBh
bGwgb2YgdGhlbSBkaWZmZXJlbnRseT8gDQo+IA0KPiBUaGUgTkZTIGFuZCBTQ1NJIGdyb3VwcyBo
YXZlIGVhY2ggZGVmaW5lZCBhIHN0YW5kYXJkIHdoaWNoIFphY2gncyBwcm9wb3NhbCANCj4gYWJz
dHJhY3RzIGludG8gYSBjb21tb24gdXNlciBBUEkuDQo+IA0KPiBEaXN0cmlidXRlZCBmaWxlIHN5
c3RlbXMgdGVuZCB0byBiZSByYXRoZXIgdW5pcXVlIGFuZCBkbyBub3QgaGF2ZSBzaW1pbGFyIA0K
PiBzdGFuZGFyZCBib2RpZXMsIGJ1dCBhIGxvdCBvZiB0aGVtIGNvdWxkIGhpZGUgc2VydmVyIHNw
ZWNpZmljIGltcGxlbWVudGF0aW9ucyANCj4gdW5kZXIgdGhlIGN1cnJlbnQgcHJvcG9zZWQgaW50
ZXJmYWNlcy4NCj4gDQo+IFdoYXQgaXMgbm90IGEgZ29vZCBpZGVhIGlzIHRvIGRyYWcgb3V0IHRo
ZSBjb3JlLCBzaW1wbGUgY29weSBvZmZsb2FkIGRpc2N1c3Npb24gDQo+IGZvciBhbm90aGVyIDUg
eWVhcnMgdG8gcHVsbCBpbiBldmVyeSBvZGQgdXNlIGNhc2UgOikNCg0KQWdyZWVkLiBUaGUgd2hv
bGUgaWRlYSBvZiBhIGNvbW1vbiBzeXN0ZW0gY2FsbCBpbnRlcmZhY2Ugc2hvdWxkIGJlIHRvDQph
bGxvdyB1cyB0byBhYnN0cmFjdCBhd2F5IHRoZSB1bmRlcmx5aW5nIHN0b3JhZ2UgYW5kIGZpbGVz
eXN0ZW0NCmFyY2hpdGVjdHVyZXMuIElmIGZpbGVzeXN0ZW0gZGV2ZWxvcGVycyBhbHNvIHdhbnQg
YSB3YXkgdG8gZXhwb3NlIHRoYXQNCnVuZGVybHlpbmcgYXJjaGl0ZWN0dXJlIHRvIGFwcGxpY2F0
aW9ucyBpbiBvcmRlciB0byBlbmFibGUgZnVydGhlcg0Kb3B0aW1pc2F0aW9ucywgdGhlbiB0aGF0
IGJlbG9uZ3MgaW4gYSBzZXBhcmF0ZSBkaXNjdXNzaW9uLg0KDQotLSANClRyb25kIE15a2xlYnVz
dA0KTGludXggTkZTIGNsaWVudCBtYWludGFpbmVyDQoNCk5ldEFwcA0KVHJvbmQuTXlrbGVidXN0
QG5ldGFwcC5jb20NCnd3dy5uZXRhcHAuY29tDQo=

2013-09-28 15:20:20

by Myklebust, Trond

[permalink] [raw]
Subject: RE: [RFC] extending splice for copy offloading

PiAtLS0tLU9yaWdpbmFsIE1lc3NhZ2UtLS0tLQ0KPiBGcm9tOiBNaWtsb3MgU3plcmVkaSBbbWFp
bHRvOm1pa2xvc0BzemVyZWRpLmh1XQ0KPiBTZW50OiBTYXR1cmRheSwgU2VwdGVtYmVyIDI4LCAy
MDEzIDEyOjUwIEFNDQo+IFRvOiBaYWNoIEJyb3duDQo+IENjOiBKLiBCcnVjZSBGaWVsZHM7IFJp
YyBXaGVlbGVyOyBBbm5hIFNjaHVtYWtlcjsgS2VybmVsIE1haWxpbmcgTGlzdDsgTGludXgtDQo+
IEZzZGV2ZWw7IGxpbnV4LW5mc0B2Z2VyLmtlcm5lbC5vcmc7IE15a2xlYnVzdCwgVHJvbmQ7IFNj
aHVtYWtlciwgQnJ5YW47DQo+IE1hcnRpbiBLLiBQZXRlcnNlbjsgSmVucyBBeGJvZTsgTWFyayBG
YXNoZWg7IEpvZWwgQmVja2VyOyBFcmljIFdvbmcNCj4gU3ViamVjdDogUmU6IFtSRkNdIGV4dGVu
ZGluZyBzcGxpY2UgZm9yIGNvcHkgb2ZmbG9hZGluZw0KPiANCj4gT24gRnJpLCBTZXAgMjcsIDIw
MTMgYXQgMTA6NTAgUE0sIFphY2ggQnJvd24gPHphYkByZWRoYXQuY29tPiB3cm90ZToNCj4gPj4g
QWxzbywgSSBkb24ndCBnZXQgdGhlIGZpcnN0IG9wdGlvbiBhYm92ZSBhdCBhbGwuICBUaGUgYXJn
dW1lbnQgaXMNCj4gPj4gdGhhdCBpdCdzIHNhZmVyIHRvIGhhdmUgbW9yZSBjb3BpZXM/ICBIb3cg
bXVjaCBzYWZldHkgZG9lcyBhbm90aGVyDQo+ID4+IGNvcHkgb24gdGhlIHNhbWUgZGlzayByZWFs
bHkgZ2l2ZSB5b3U/ICBEbyBzeXN0ZW1zIHRoYXQgZG8gZGVkdXANCj4gPj4gcHJvdmlkZSBpbnRl
cmZhY2VzIHRvIHR1cm4gaXQgb2ZmIHBlci1maWxlPw0KPiANCj4gSSBkb24ndCBzZWUgdGhlIHNh
ZmV0eSBhcmd1bWVudCB2ZXJ5IGNvbXBlbGxpbmcgZWl0aGVyLiAgVGhlcmUgYXJlIHJlYWwNCj4g
c2VtYW50aWMgZGlmZmVyZW5jZXMsIGhvd2V2ZXI6IEVOT1NQQyBvbiBhIHdyaXRlIHRvIGENCj4g
KGFwcGFyZW50bMOteSkgYWxyZWFkeSBhbGxvY2F0ZWQgYmxvY2suICBUaGF0IGNvdWxkIGJlIGEg
Yml0IHVuZXhwZWN0ZWQuICBEbyB3ZQ0KPiBuZWVkIGEgZmFsbG9jYXRlIGV4dGVuc2lvbiB0byBk
ZWFsIHdpdGggc2hhcmVkIGJsb2Nrcz8NCg0KVGhlIGFib3ZlIGhhcyBiZWVuIHRoZSBjYXNlIGZv
ciBhbGwgZW50ZXJwcmlzZSBzdG9yYWdlIGFycmF5cyBldmVyIHNpbmNlIHRoZSBpbnZlbnRpb24g
b2Ygc25hcHNob3RzLiBUaGUgTkZTdjQuMiBzcGVjIGRvZXMgYWxsb3cgeW91IHRvIHNldCBhIHBl
ci1maWxlIGF0dHJpYnV0ZSB0aGF0IGNhdXNlcyB0aGUgc3RvcmFnZSBzZXJ2ZXIgdG8gYWx3YXlz
IHByZWFsbG9jYXRlIGVub3VnaCBidWZmZXJzIHRvIGd1YXJhbnRlZSB0aGF0IHlvdSBjYW4gcmV3
cml0ZSB0aGUgZW50aXJlIGZpbGUsIGhvd2V2ZXIgdGhlIGZhY3QgdGhhdCB3ZSd2ZSBsaXZlZCB3
aXRob3V0IGl0IGZvciBzYWlkIDIwIHllYXJzIGxlYWRzIG1lIHRvIGJlbGlldmUgdGhhdCBkZW1h
bmQgZm9yIGl0IGlzIGdvaW5nIHRvIGJlIGxpbWl0ZWQuIEkgaGF2ZW4ndCBwdXQgaXQgdG9wIG9m
IHRoZSBsaXN0IG9mIGZlYXR1cmVzIHdlIGNhcmUgdG8gaW1wbGVtZW50Li4uDQoNCkNoZWVycywN
CiAgIFRyb25kDQo=

2013-09-27 14:39:17

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [RFC] extending splice for copy offloading

On Fri, Sep 27, 2013 at 4:00 PM, Ric Wheeler <[email protected]> wrote:

> I think that you are an order of magnitude off here in thinking about the
> scale of the operations.
>
> An enabled, synchronize copy offload to an array (or one that turns into a
> reflink locally) is effectively the cost of the call itself. Let's say no
> slower than one IO to a S-ATA disk (10ms?) as a pessimistic guess.
> Realistically, that call is much faster than that worst case number.
>
> Copying any substantial amount of data - like the target workload of VM
> images or media files - would be hundreds of MB's per copy and that would
> take seconds or minutes.

Will a single splice-copy operation be interruptible/restartable? If
not, how should apps size one request so that it doesn't take too much
time? Even for slow devices (usb stick)? If it will be restartable,
how? Can remote copy be done with this? Over a high latency
network?

Those are the questions I'm worried about.

>
> We should really work on getting the basic mechanism working and robust
> without any complications, then we can look at real, measured performance
> and see if there is any justification for adding complexity.

Go for that. But don't forget that at the end of the day actual apps
will need to be converted like file managers and "dd" and "cp" and we
definitely don't wont a userspace library to be able to figure out how
the copy is done most efficiently; it's something for the kernel to
figure out.

Thanks,
Miklos

2013-09-26 16:47:56

by Ric Wheeler

[permalink] [raw]
Subject: Re: [RFC] extending splice for copy offloading

On 09/26/2013 11:34 AM, J. Bruce Fields wrote:
> On Thu, Sep 26, 2013 at 10:58:05AM +0200, Miklos Szeredi wrote:
>> On Wed, Sep 25, 2013 at 11:07 PM, Zach Brown <[email protected]> wrote:
>>>> A client-side copy will be slower, but I guess it does have the
>>>> advantage that the application can track progress to some degree, and
>>>> abort it fairly quickly without leaving the file in a totally undefined
>>>> state--and both might be useful if the copy's not a simple constant-time
>>>> operation.
>>> I suppose, but can't the app achieve a nice middle ground by copying the
>>> file in smaller syscalls? Avoid bulk data motion back to the client,
>>> but still get notification every, I dunno, few hundred meg?
>> Yes. And if "cp" could just be switched from a read+write syscall
>> pair to a single splice syscall using the same buffer size.
> Will the various magic fs-specific copy operations become inefficient
> when the range copied is too small?
>
> (Totally naive question, as I have no idea how they really work.)
>
> --b.

I think that is not really possible to tell when we invoke it. It is very much
target device (or file system, etc) dependent on how long it takes. It could be
as simple as a reflink copying in a smallish amount of metadata or fall back to
a full byte-by-byte copy. Also note that speed is not the only impact here,
some of the mechanisms actually do not consume more space (just increment shared
data references).

It would probably make more sense to send it off to the target device and have
it return an error when not appropriate (then the app can fall back to the old
fashion copy).

ric

>
>> And then
>> the user would only notice that things got faster in case of server
>> side copy. No problems with long blocking times (at least not much
>> worse than it was).
>>
>> However "cp" doesn't do reflinking by default, it has a switch for
>> that. If we just want "cp" and the like to use splice without fearing
>> side effects then by default we should try to be as close to
>> read+write behavior as possible. No? That's what I'm really
>> worrying about when you want to wire up splice to reflink by default.
>> I do think there should be a flag for that. And if on the block level
>> some magic happens, so be it. It's not the fs deverloper's worry any
>> more ;)
>>
>> Thanks,
>> Miklos
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/


2013-09-30 17:29:46

by Bernd Schubert

[permalink] [raw]
Subject: Re: [RFC] extending splice for copy offloading

On 09/30/2013 06:31 PM, Miklos Szeredi wrote:
> Here's an example "cp" app using direct splice (and without fallback to
> non-splice, which is obviously required unless the kernel is known to support
> direct splice).
>
> Untested, but trivial enough...
>
> The important part is, I think, that the app must not assume that the kernel can
> complete the request in one go.
>
> Thanks,
> Miklos
>
> ----
> #define _GNU_SOURCE
>
> #include <stdio.h>
> #include <fcntl.h>
> #include <unistd.h>
> #include <limits.h>
> #include <sys/stat.h>
> #include <err.h>
>
> #ifndef SPLICE_F_DIRECT
> #define SPLICE_F_DIRECT (0x10) /* neither splice fd is a pipe */
> #endif
>
> int main(int argc, char *argv[])
> {
> struct stat stbuf;
> int in_fd;
> int out_fd;
> int res;
> off_t off;

off_t off = 0;

>
> if (argc != 3)
> errx(1, "usage: %s from to", argv[0]);
>
> in_fd = open(argv[1], O_RDONLY);
> if (in_fd == -1)
> err(1, "opening %s", argv[1]);
>
> res = fstat(in_fd, &stbuf);
> if (res == -1)
> err(1, "fstat");
>
> out_fd = open(argv[2], O_CREAT | O_WRONLY | O_TRUNC, stbuf.st_mode);
> if (out_fd == -1)
> err(1, "opening %s", argv[2]);
>
> do {
> off_t in_off = off, out_off = off;
> ssize_t rres;
>
> rres = splice(in_fd, &in_off, out_fd, &out_off, SSIZE_MAX,
> SPLICE_F_DIRECT);
> if (rres == -1)
> err(1, "splice");
> if (rres == 0)
> break;
>
> off += rres;
> } while (off < stbuf.st_size);
>
> res = close(in_fd);
> if (res == -1)
> err(1, "close");
>
> res = fsync(out_fd);
> if (res == -1)
> err(1, "fsync");
>
> res = close(out_fd);
> if (res == -1)
> err(1, "close");
>
> return 0;
> }


It would be nice if there would be way if the file system would get a
hint that the target file is supposed to be copy of another file. That
way distributed file systems could also create the target-file with the
correct meta-information (same storage targets as in-file has).
Well, if we cannot agree on that, file system with a custom protocol at
least can detect from 0 to SSIZE_MAX and then reset metadata. I'm not
sure if this would work for pNFS, though.


Bernd




2013-09-30 20:00:47

by Bernd Schubert

[permalink] [raw]
Subject: Re: [RFC] extending splice for copy offloading

On 09/30/2013 09:34 PM, Myklebust, Trond wrote:
> On Mon, 2013-09-30 at 20:49 +0200, Bernd Schubert wrote:
>> On 09/30/2013 08:02 PM, Myklebust, Trond wrote:
>>> On Mon, 2013-09-30 at 19:48 +0200, Bernd Schubert wrote:
>>>> On 09/30/2013 07:44 PM, Myklebust, Trond wrote:
>>>>> On Mon, 2013-09-30 at 19:17 +0200, Bernd Schubert wrote:
>>>>>> It would be nice if there would be way if the file system would get a
>>>>>> hint that the target file is supposed to be copy of another file. That
>>>>>> way distributed file systems could also create the target-file with the
>>>>>> correct meta-information (same storage targets as in-file has).
>>>>>> Well, if we cannot agree on that, file system with a custom protocol at
>>>>>> least can detect from 0 to SSIZE_MAX and then reset metadata. I'm not
>>>>>> sure if this would work for pNFS, though.
>>>>>
>>>>> splice() does not create new files. What you appear to be asking for
>>>>> lies way outside the scope of that system call interface.
>>>>>
>>>>
>>>> Sorry I know, definitely outside the scope of splice, but in the context
>>>> of offloaded file copies. So the question is, what is the best way to
>>>> address/discuss that?
>>>
>>> Why does it need to be addressed in the first place?
>>
>> An offloaded copy is still not efficient if different storage
>> servers/targets used by from-file and to-file.
>
> So?

mds1: orig-file
oss1/target1: orig-chunk1

mds1: target-file
ossN/targetN: target-chunk1

clientN: Performs the copy

Ideally, orig-chunk1 and target-chunk1 are on the same server and same
target. Copy offload then even could done from the underlying fs,
similiar as local splice.
If different ossN servers are used copies still have to be done over
network by these storage servers, although the client only would need to
initiate the copy. Still faster, but also not ideal.

>
>>>
>>> What is preventing an application from retrieving and setting this
>>> information using standard libc functions such as fstat()+open(), and
>>> supplemented with libattr attr_setf/getf(), and libacl acl_get_fd/set_fd
>>> where appropriate?
>>>
>>
>> At a minimum this requires network and metadata overhead. And while I'm
>> working on FhGFS now, I still wonder what other file system need to do -
>> for example Lustre pre-allocates storage-target files on creating a
>> file, so file layout changes mean even more overhead there.
>
> The problem you are describing is limited to a narrow set of storage
> architectures. If copy offload using splice() doesn't make sense for
> those architectures, then don't implement it for them.

But it _does_ make sense. The file system just needs a hint that a
splice copy is going to come up.

> You might be able to provide ioctls() to do these special hinted file
> creations for those filesystems that need it, but the vast majority
> don't, and you shouldn't enforce it on them.

And exactly for that we need a standard - it does not make sense if each
and every distributed file system implements its own
ioctl/libattr/libacl interface for that.

>
>> Anyway, if we could agree on to use libattr or libacl to teach the file
>> system about the upcoming splice call I would be fine.
>
> libattr and libacl are generic libraries that exist to manipulate xattrs
> and acls. They do not need to contain Lustre-specific code.
>

pNFS, FhGFS, Lustre, Ceph, etc., all of them shall implement their own
interface? And userspace needs to address all of them differently?

I'm just asking for something like a vfs ioctl SPLICE_META_COPY (sorry,
didn't find a better name yet), which would take in-file-path and
out-file-path and allow the file system to create out-file-path with the
same meta-layout as in-file-path. And it would need some flags, such as
AUTO (file system decides if it makes sense to do a local copy) and
FORCE (always try a local copy).


Thanks,
Bernd

2013-09-27 04:47:06

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [RFC] extending splice for copy offloading

On Thu, Sep 26, 2013 at 11:23 PM, Ric Wheeler <[email protected]> wrote:
> On 09/26/2013 03:53 PM, Miklos Szeredi wrote:
>>
>> On Thu, Sep 26, 2013 at 9:06 PM, Zach Brown <[email protected]> wrote:
>>
>>>> But I'm not sure it's worth the effort; 99% of the use of this
>>>> interface will be copying whole files. And for that perhaps we need a
>>>> different API, one which has been discussed some time ago:
>>>> asynchronous copyfile() returns immediately with a pollable event
>>>> descriptor indicating copy progress, and some way to cancel the copy.
>>>> And that can internally rely on ->direct_splice(), with appropriate
>>>> algorithms for determine the optimal chunk size.
>>>
>>> And perhaps we don't. Perhaps we can provide this much simpler
>>> data-plane interface that works well enough for most everyone and can
>>> avoid going down the async rat hole, yet again.
>>
>> I think either buffering or async is needed to get good perforrmace
>> without too much complexity in the app (which is not good). Buffering
>> works quite well for regular I/O, so maybe its the way to go here as
>> well.
>>
>> Thanks,
>> Miklos
>>
>
> Buffering misses the whole point of the copy offload - the idea is *not* to
> read or write the actual data in the most interesting cases which offload
> the operation to a smart target device or file system.

I meant buffering the COPY, not the data. Doing the COPY
synchronously will always incur a performance penalty, the amount
depending on the latency, which can be significant with networking.

We think of write(2) as a synchronous interface, because that's the
appearance we get from all that hard work the page cache and delayed
writeback code does to make an asynchronous operation look as if it
was synchronous. So from a userspace API perspective a sync interface
is nice, but inside we almost always have async interfaces to do the
actual work.

Thanks,
Miklos


>
> Regards,
>
> Ric
>

2013-09-30 15:33:49

by Myklebust, Trond

[permalink] [raw]
Subject: RE: [RFC] extending splice for copy offloading

PiAtLS0tLU9yaWdpbmFsIE1lc3NhZ2UtLS0tLQ0KPiBGcm9tOiBSaWMgV2hlZWxlciBbbWFpbHRv
OnJ3aGVlbGVyQHJlZGhhdC5jb21dDQo+IFNlbnQ6IE1vbmRheSwgU2VwdGVtYmVyIDMwLCAyMDEz
IDEwOjI5IEFNDQo+IFRvOiBNaWtsb3MgU3plcmVkaQ0KPiBDYzogSi4gQnJ1Y2UgRmllbGRzOyBN
eWtsZWJ1c3QsIFRyb25kOyBaYWNoIEJyb3duOyBBbm5hIFNjaHVtYWtlcjsgS2VybmVsDQo+IE1h
aWxpbmcgTGlzdDsgTGludXgtRnNkZXZlbDsgbGludXgtbmZzQHZnZXIua2VybmVsLm9yZzsgU2No
dW1ha2VyLCBCcnlhbjsNCj4gTWFydGluIEsuIFBldGVyc2VuOyBKZW5zIEF4Ym9lOyBNYXJrIEZh
c2hlaDsgSm9lbCBCZWNrZXI7IEVyaWMgV29uZw0KPiBTdWJqZWN0OiBSZTogW1JGQ10gZXh0ZW5k
aW5nIHNwbGljZSBmb3IgY29weSBvZmZsb2FkaW5nDQo+IA0KPiBPbiAwOS8zMC8yMDEzIDEwOjI0
IEFNLCBNaWtsb3MgU3plcmVkaSB3cm90ZToNCj4gPiBPbiBNb24sIFNlcCAzMCwgMjAxMyBhdCA0
OjUyIFBNLCBSaWMgV2hlZWxlciA8cndoZWVsZXJAcmVkaGF0LmNvbT4NCj4gd3JvdGU6DQo+ID4+
IE9uIDA5LzMwLzIwMTMgMTA6NTEgQU0sIE1pa2xvcyBTemVyZWRpIHdyb3RlOg0KPiA+Pj4gT24g
TW9uLCBTZXAgMzAsIDIwMTMgYXQgNDozNCBQTSwgSi4gQnJ1Y2UgRmllbGRzDQo+ID4+PiA8YmZp
ZWxkc0BmaWVsZHNlcy5vcmc+DQo+ID4+PiB3cm90ZToNCj4gPj4+Pj4gTXkgb3RoZXIgd29ycnkg
aXMgYWJvdXQgaW50ZXJydXB0aWJpbGl0eS9yZXN0YXJ0YWJpbGl0eS4gIElkZWFzPw0KPiA+Pj4+
Pg0KPiA+Pj4+PiBXaGF0IGhhcHBlbnMgb24gc3BsaWNlKGZyb20sIHRvLCA0RykgYW5kIGl0J3Mg
YSBub24tcmVmbGluayBjb3B5Pw0KPiA+Pj4+PiBDYW4gdGhlIHBhZ2UgY2FjaGUgY29weSBiZSBt
YWRlIHJlc3RhcnRhYmxlPyAgIE9yIHNob3VsZCBzcGxpY2UoKSBiZQ0KPiA+Pj4+PiBhbGxvd2Vk
IHRvIHJldHVybiBhIHNob3J0IGNvdW50PyAgV2hhdCBoYXBwZW5zIG9uIChub24tcmVmbGluaykN
Cj4gPj4+Pj4gcmVtb3RlIGNvcGllcyBhbmQgaHVnZSByZXF1ZXN0IHNpemVzPw0KPiA+Pj4+IElm
IEkgd2VyZSB3cml0aW5nIGFuIGFwcGxpY2F0aW9uIHRoYXQgcmVxdWlyZWQgY29waWVzIHRvIGJl
DQo+ID4+Pj4gcmVzdGFydGFibGUsIEknZCBwcm9iYWJseSB1c2UgdGhlIGxhcmdlc3QgcG9zc2li
bGUgcmFuZ2UgaW4gdGhlDQo+ID4+Pj4gcmVmbGluayBjYXNlIGJ1dCBicmVhayB0aGUgY29weSBp
bnRvIHNtYWxsZXIgY2h1bmtzIGluIHRoZSBzcGxpY2UgY2FzZS4NCj4gPj4+Pg0KPiA+Pj4gVGhl
IGFwcCByZWFsbHkgZG9lc24ndCB3YW50IHRvIGNhcmUgYWJvdXQgdGhhdC4gIEFuZCBpdCBkb2Vz
bid0IHdhbnQNCj4gPj4+IHRvIGNhcmUgYWJvdXQgcmVzdGFydGFiaWxpdHksIGV0Yy4uICBJdCdz
IHNvbWV0aGluZyB0aGUgKmtlcm5lbCogaGFzDQo+ID4+PiB0byBjYXJlIGFib3V0LiAgIFlvdSBq
dXN0IGNhbid0IGhhdmUgdW5pbnRlcnJ1cHRpYmxlIHN5c2NhbGxzIHRoYXQNCj4gPj4+IHNsZWVw
IGZvciBhICJsb25nIiB0aW1lLCBvdGhlcndpc2UgZmlyc3QgeW91J2xsIGp1c3QgaGF2ZSBhbm5v
eWVkDQo+ID4+PiB1c2VycyBwcmVzc2luZyBeQyBpbiB2YWluOyB0aGVuLCBpZiB0aGUgc2xlZXAg
aXMgZXZlbiBsb25nZXIsDQo+ID4+PiB3YXJuaW5ncyBhYm91dCB0YXNrIHNsZWVwaW5nIHRvbyBs
b25nLg0KPiA+Pj4NCj4gPj4+IE9uZSBpZGVhIGlzIGxldHRpbmcgc3BsaWNlKCkgcmV0dXJuIGEg
c2hvcnQgY291bnQsIGFuZCBzbyB0aGUgYXBwDQo+ID4+PiBjYW4gc2FmZWx5IGlzc3VlIFNJWkVf
TUFYIHJlcXVlc3RzIGFuZCB0aGUga2VybmVsIGNhbiBkZWNpZGUgaWYgaXQNCj4gPj4+IGNhbiBj
b3B5IHRoZSB3aG9sZSBmaWxlIGluIG9uZSBnbyBvciBpZiBpdCB3YW50cyB0byBkbyBpdCBpbiBz
bWFsbGVyDQo+ID4+PiBjaHVua3MuDQo+ID4+Pg0KPiA+PiBZb3UgY2Fubm90IHJlbHkgb24gYSBz
aG9ydCBjb3VudC4gVGhhdCBpbXBsaWVzIHRoYXQgYW4gb2ZmbG9hZGVkIGNvcHkNCj4gPj4gc3Rh
cnRzIGF0IGJ5dGUgMCBhbmQgdGhlIHNob3J0IGNvdW50IGZpcnN0IGJ5dGVzIGFyZSBhbGwgdmFs
aWQuDQo+ID4gSHVoPw0KPiA+DQo+ID4gLSBhcHAgY2FsbHMgc3BsaWNlKGZyb20sIDAsIHRvLCAw
LCBTSVpFX01BWCkNCj4gPiAgIDEpIFZGUyBjYWxscyAtPmRpcmVjdF9zcGxpY2UoZnJvbSwgMCwg
IHRvLCAwLCBTSVpFX01BWCkNCj4gPiAgICAgIDEuYSkgZnMgcmVmbGlua3MgdGhlIHdob2xlIGZp
bGUgaW4gYSBqaWZmeSBhbmQgcmV0dXJucyB0aGUgc2l6ZSBvZiB0aGUgZmlsZQ0KPiA+ICAgICAg
MSBiKSBmcyBkb2VzIGNvcHkgb2ZmbG9hZCBvZiwgc2F5LCA2NE1CIGFuZCByZXR1cm5zIDY0TQ0K
PiA+ICAgMikgVkZTIGRvZXMgcGFnZSBjb3B5IG9mLCBzYXksIDFNQiBhbmQgcmV0dXJucyAxTUIN
Cj4gPiAtIGFwcCBjYWxscyBzcGxpY2UoZnJvbSwgWCwgdG8sIFgsIFNJWkVfTUFYKSB3aGVyZSBY
IGlzIHRoZSBuZXcgb2Zmc2V0DQo+ID4gLi4uDQo+ID4NCj4gPiBUaGUgcG9pbnQgaXM6IHRoZSBh
cHAgaXMgYWx3YXlzIGRvaW5nIHRoZSBzYW1lIChpbmNyZW1lbnRpbmcgb2Zmc2V0DQo+ID4gd2l0
aCB0aGUgcmV0dXJuIHZhbHVlIGZyb20gc3BsaWNlKSBhbmQgdGhlIGtlcm5lbCBjYW4gZGVjaWRl
IHdoYXQgaXMNCj4gPiB0aGUgYmVzdCBzaXplIGl0IGNhbiBzZXJ2aWNlIHdpdGhpbiBhIHNpbmds
ZSB1bmludGVycnVwdGlibGUgc3lzY2FsbC4NCj4gPg0KPiA+IFdvdWxkbid0IHRoYXQgd29yaz8N
Cj4gPg0KPiA+IFRoYW5rcywNCj4gPiBNaWtsb3MNCj4gDQo+IE5vLg0KPiANCj4gS2VlcCBpbiBt
aW5kIHRoYXQgdGhlIG9mZmxvYWQgb3BlcmF0aW9uIGluICgxKSBtaWdodCBmYWlsIHBhcnRpYWxs
eS4gVGhlIHRhcmdldA0KPiBmaWxlICh0aGUgY29weSkgaXMgYWxsb2NhdGVkLCB0aGUgcXVlc3Rp
b24gaXMgd2hhdCByYW5nZXMgaGF2ZSB2YWxpZCBkYXRhLg0KPiANCj4gSSBkb24ndCBzZWUgdGhh
dCAoMikgaXMgaW50ZXJlc3Rpbmcgb3IgcmVhbGx5IG5lZWRlZCB0byBiZSBkb25lIGluIHRoZSBr
ZXJuZWwuDQo+IElmIG5vdGhpbmcgZWxzZSwgaXQgdGVuZHMgdG8gY29uZnVzZSB0aGUgZGlzY3Vz
c2lvbi4uLi4NCj4gDQoNCkFubmEncyBmaWd1cmVzLCB0aGF0IHdlcmUgcHJlc2VudGVkIGF0IFBs
dW1iZXIncywgc2hvdyB0aGF0ICgyKSBpcyBzdGlsbCB3b3J0aCBkb2luZyBvbiB0aGUgX3NlcnZl
cl8gZm9yIHRoZSBjYXNlIG9mIE5GUy4NCg0KQ2hlZXJzDQogIFRyb25kDQo=

2013-09-30 12:20:31

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [RFC] extending splice for copy offloading

On Sat, Sep 28, 2013 at 11:20 PM, Ric Wheeler <[email protected]> wrote:

>>> I don't see the safety argument very compelling either. There are real
>>> semantic differences, however: ENOSPC on a write to a
>>> (apparentlĂ­y) already allocated block. That could be a bit unexpected.
>>> Do we
>>> need a fallocate extension to deal with shared blocks?
>>
>> The above has been the case for all enterprise storage arrays ever since
>> the invention of snapshots. The NFSv4.2 spec does allow you to set a
>> per-file attribute that causes the storage server to always preallocate
>> enough buffers to guarantee that you can rewrite the entire file, however
>> the fact that we've lived without it for said 20 years leads me to believe
>> that demand for it is going to be limited. I haven't put it top of the list
>> of features we care to implement...
>>
>> Cheers,
>> Trond
>
>
> I agree - this has been common behaviour for a very long time in the array
> space. Even without an array, this is the same as overwriting a block in
> btrfs or any file system with a read-write LVM snapshot.

Okay, I'm convinced.

So I suggest

- mount(..., MNT_REFLINK): *allow* splice to reflink. If this is not
set, fall back to page cache copy.
- splice(... SPLICE_REFLINK): fail non-reflink copy. With this app
can force reflink.

Both are trivial to implement and make sure that no backward
incompatibility surprises happen.

My other worry is about interruptibility/restartability. Ideas?

What happens on splice(from, to, 4G) and it's a non-reflink copy?
Can the page cache copy be made restartable? Or should splice() be
allowed to return a short count? What happens on (non-reflink) remote
copies and huge request sizes?

Thanks,
Miklos

2013-09-20 09:49:56

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [RFC] extending splice for copy offloading

On Wed, Sep 11, 2013 at 7:06 PM, Zach Brown <[email protected]> wrote:
>
> When I first started on this stuff I followed the lead of previous
> work and added a new syscall for the copy operation:
>
> https://lkml.org/lkml/2013/5/14/618
>
> Towards the end of that thread Eric Wong asked why we didn't just
> extend splice. I immediately replied with some dumb dismissive
> answer. Once I sat down and looked at it, though, it does make a
> lot of sense. So good job, Eric. +10 Dummie points for me.
>
> Extending splice avoids all the noise of adding a new syscall and
> naturally falls back to buffered copying as that's what the direct
> splice path does for sendfile() today.

Nice idea.

>
> So that's what this patch series demonstrates. It adds a flag that
> lets splice get at the same direct splicing that sendfile() does.
> We then add a file system file_operations method to accelerate the
> copy which has access to both files.
>
> Some things to talk about:
> - I really don't care about the naming here. If you do, holler.
> - We might want different flags for file-to-file splicing and acceleration

Yes, I think "copy" and "reflink" needs to be differentiated.

> - We might want flags to require or forbid acceleration
> - We might want to provide all these flags to sendfile, too
>
> Thoughts? Objections?

Can filesystem support "whole file copy" only? Or arbitrary
block-to-block copy should be mandatory?

Splice has size_t argument for the size, which is limited to 4G on 32
bit. Won't this be an issue for whole-file-copy? We could have
special value (-1) for whole file, but that's starting to be hackish.

We are talking about copying large amounts of data in a single
syscall, which will possibly take a long time. Will the syscall be
interruptible? Restartable?

Thanks,
Miklos

2013-09-30 17:49:02

by Bernd Schubert

[permalink] [raw]
Subject: Re: [RFC] extending splice for copy offloading

On 09/30/2013 07:44 PM, Myklebust, Trond wrote:
> On Mon, 2013-09-30 at 19:17 +0200, Bernd Schubert wrote:
>> It would be nice if there would be way if the file system would get a
>> hint that the target file is supposed to be copy of another file. That
>> way distributed file systems could also create the target-file with the
>> correct meta-information (same storage targets as in-file has).
>> Well, if we cannot agree on that, file system with a custom protocol at
>> least can detect from 0 to SSIZE_MAX and then reset metadata. I'm not
>> sure if this would work for pNFS, though.
>
> splice() does not create new files. What you appear to be asking for
> lies way outside the scope of that system call interface.
>

Sorry I know, definitely outside the scope of splice, but in the context
of offloaded file copies. So the question is, what is the best way to
address/discuss that?

Thanks,
Bernd

2013-09-11 21:24:12

by Eric Wong

[permalink] [raw]
Subject: Re: [RFC] extending splice for copy offloading

Zach Brown <[email protected]> wrote:
> Towards the end of that thread Eric Wong asked why we didn't just
> extend splice. I immediately replied with some dumb dismissive
> answer. Once I sat down and looked at it, though, it does make a
> lot of sense. So good job, Eric. +10 Dummie points for me.

Thanks for revisiting that :>

> Some things to talk about:
> - I really don't care about the naming here. If you do, holler.

Exposing "DIRECT" to userspace now might confuse users into expecting
O_DIRECT behavior. I say this as an easily-confused user.

In the future, perhaps O_DIRECT behavior can become per-splice (instead
of just per-open) and can save SPLICE_F_DIRECT for that.

> - We might want different flags for file-to-file splicing and acceleration
> - We might want flags to require or forbid acceleration

> - We might want to provide all these flags to sendfile, too

Another syscall? I prefer not. Better to just maintain the sendfile
API as-is for compatibility reasons and nudge users towards splice.

> Thoughts? Objections?

I'll try to test/comment more in a week or two (not much time for
computing until then).

2013-09-28 21:21:40

by Ric Wheeler

[permalink] [raw]
Subject: Re: [RFC] extending splice for copy offloading

On 09/28/2013 11:20 AM, Myklebust, Trond wrote:
>> -----Original Message-----
>> From: Miklos Szeredi [mailto:[email protected]]
>> Sent: Saturday, September 28, 2013 12:50 AM
>> To: Zach Brown
>> Cc: J. Bruce Fields; Ric Wheeler; Anna Schumaker; Kernel Mailing List; Linux-
>> Fsdevel; [email protected]; Myklebust, Trond; Schumaker, Bryan;
>> Martin K. Petersen; Jens Axboe; Mark Fasheh; Joel Becker; Eric Wong
>> Subject: Re: [RFC] extending splice for copy offloading
>>
>> On Fri, Sep 27, 2013 at 10:50 PM, Zach Brown <[email protected]> wrote:
>>>> Also, I don't get the first option above at all. The argument is
>>>> that it's safer to have more copies? How much safety does another
>>>> copy on the same disk really give you? Do systems that do dedup
>>>> provide interfaces to turn it off per-file?
>> I don't see the safety argument very compelling either. There are real
>> semantic differences, however: ENOSPC on a write to a
>> (apparentlĂ­y) already allocated block. That could be a bit unexpected. Do we
>> need a fallocate extension to deal with shared blocks?
> The above has been the case for all enterprise storage arrays ever since the invention of snapshots. The NFSv4.2 spec does allow you to set a per-file attribute that causes the storage server to always preallocate enough buffers to guarantee that you can rewrite the entire file, however the fact that we've lived without it for said 20 years leads me to believe that demand for it is going to be limited. I haven't put it top of the list of features we care to implement...
>
> Cheers,
> Trond

I agree - this has been common behaviour for a very long time in the array
space. Even without an array, this is the same as overwriting a block in btrfs
or any file system with a read-write LVM snapshot.

Regards,

Ric


2013-09-30 15:49:36

by Ric Wheeler

[permalink] [raw]
Subject: Re: [RFC] extending splice for copy offloading

On 09/30/2013 10:46 AM, Miklos Szeredi wrote:
> On Mon, Sep 30, 2013 at 4:41 PM, Ric Wheeler <[email protected]> wrote:
>> The way the array based offload (and some software side reflink works) is
>> not a byte by byte copy. We cannot assume that a valid count can be returned
>> or that such a count would be an indication of a sequential segment of good
>> data. The whole thing would normally have to be reissued.
>>
>> To make that a true assumption, you would have to mandate that in each of
>> the specifications (and sw targets)...
> You're missing my point.
>
> - user issues SIZE_MAX splice request
> - fs issues *64M* (or whatever) request to offload
> - when that completes *fully* then we return 64M to userspace
> - if it completes partially, then we return an error to userspace
>
> Again, wouldn't that work?
>
> Thanks,
> Miklos

Yes, if you send a copy offload command and it works, you can assume that it
worked fully. It would be pretty interesting if that were not true :)

If it fails, we cannot assume anything about partial completion.

Ric


2013-09-30 14:34:38

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [RFC] extending splice for copy offloading

On Mon, Sep 30, 2013 at 02:20:30PM +0200, Miklos Szeredi wrote:
> On Sat, Sep 28, 2013 at 11:20 PM, Ric Wheeler <[email protected]> wrote:
>
> >>> I don't see the safety argument very compelling either. There are real
> >>> semantic differences, however: ENOSPC on a write to a
> >>> (apparentlĂ­y) already allocated block. That could be a bit unexpected.
> >>> Do we
> >>> need a fallocate extension to deal with shared blocks?
> >>
> >> The above has been the case for all enterprise storage arrays ever since
> >> the invention of snapshots. The NFSv4.2 spec does allow you to set a
> >> per-file attribute that causes the storage server to always preallocate
> >> enough buffers to guarantee that you can rewrite the entire file, however
> >> the fact that we've lived without it for said 20 years leads me to believe
> >> that demand for it is going to be limited. I haven't put it top of the list
> >> of features we care to implement...
> >>
> >> Cheers,
> >> Trond
> >
> >
> > I agree - this has been common behaviour for a very long time in the array
> > space. Even without an array, this is the same as overwriting a block in
> > btrfs or any file system with a read-write LVM snapshot.
>
> Okay, I'm convinced.
>
> So I suggest
>
> - mount(..., MNT_REFLINK): *allow* splice to reflink. If this is not
> set, fall back to page cache copy.
> - splice(... SPLICE_REFLINK): fail non-reflink copy. With this app
> can force reflink.
>
> Both are trivial to implement and make sure that no backward
> incompatibility surprises happen.
>
> My other worry is about interruptibility/restartability. Ideas?
>
> What happens on splice(from, to, 4G) and it's a non-reflink copy?
> Can the page cache copy be made restartable? Or should splice() be
> allowed to return a short count? What happens on (non-reflink) remote
> copies and huge request sizes?

If I were writing an application that required copies to be restartable,
I'd probably use the largest possible range in the reflink case but
break the copy into smaller chunks in the splice case.

For that reason I don't like the idea of a mount option--the choice is
something that the application probably wants to make (or at least to
know about).

The NFS COPY operation, as specified in current drafts, allows for
asynchronous copies but leaves the state of the file undefined in the
case of an aborted COPY. I worry that agreeing on standard behavior in
the case of an abort might be difficult.

--b.

2013-09-11 17:12:07

by Zach Brown

[permalink] [raw]
Subject: [PATCH 1/3] splice: add DIRECT flag for splicing between files

sendfile() is implemented by performing an internal "direct" splice
between two regular files. A per-task pipe buffer is allocated to
splice between the reads from the source page cache and writes to the
destination file page cache.

This patch lets userspace perform these direct splices with sys_splice()
by setting the SPLICE_F_DIRECT flag. This provides a single syscall for
copying a region between files without either having to store the
destination offset in the descriptor for sendfile or having to use
multiple splicing syscalls to and from a pipe.

Providing both files to the method lets the file system lock both for
the duration of the copy, should it need to. If the method refuses to
accelerate the copy, for whatever reason, we can naturally fall back to
the generic direct splice method that sendfile uses today.

Signed-off-by: Zach Brown <[email protected]>
---
fs/splice.c | 38 ++++++++++++++++++++++++++++++++++++--
include/linux/splice.h | 1 +
2 files changed, 37 insertions(+), 2 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index 3b7ee65..c0f4e27 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -1347,7 +1347,7 @@ static long do_splice(struct file *in, loff_t __user *off_in,
}

if (ipipe) {
- if (off_in)
+ if (off_in || (flags & SPLICE_F_DIRECT))
return -ESPIPE;
if (off_out) {
if (!(out->f_mode & FMODE_PWRITE))
@@ -1381,7 +1381,7 @@ static long do_splice(struct file *in, loff_t __user *off_in,
}

if (opipe) {
- if (off_out)
+ if (off_out || (flags & SPLICE_F_DIRECT))
return -ESPIPE;
if (off_in) {
if (!(in->f_mode & FMODE_PREAD))
@@ -1402,6 +1402,40 @@ static long do_splice(struct file *in, loff_t __user *off_in,
return ret;
}

+ if (flags & SPLICE_F_DIRECT) {
+ loff_t out_pos;
+
+ if (off_in) {
+ if (!(in->f_mode & FMODE_PREAD))
+ return -EINVAL;
+ if (copy_from_user(&offset, off_in, sizeof(loff_t)))
+ return -EFAULT;
+ } else
+ offset = in->f_pos;
+
+ if (off_out) {
+ if (!(out->f_mode & FMODE_PWRITE))
+ return -EINVAL;
+ if (copy_from_user(&out_pos, off_out, sizeof(loff_t)))
+ return -EFAULT;
+ } else
+ out_pos = out->f_pos;
+
+ ret = do_splice_direct(in, &offset, out, &out_pos, len, flags);
+
+ if (!off_in)
+ in->f_pos = offset;
+ else if (copy_to_user(off_in, &offset, sizeof(loff_t)))
+ ret = -EFAULT;
+
+ if (!off_out)
+ out->f_pos = out_pos;
+ else if (copy_to_user(off_out, &out_pos, sizeof(loff_t)))
+ ret = -EFAULT;
+
+ return ret;
+ }
+
return -EINVAL;
}

diff --git a/include/linux/splice.h b/include/linux/splice.h
index 74575cb..e1aa3ad 100644
--- a/include/linux/splice.h
+++ b/include/linux/splice.h
@@ -19,6 +19,7 @@
/* from/to, of course */
#define SPLICE_F_MORE (0x04) /* expect more data */
#define SPLICE_F_GIFT (0x08) /* pages passed in are a gift */
+#define SPLICE_F_DIRECT (0x10) /* neither splice fd is a pipe */

/*
* Passed to the actors
--
1.7.11.7


2013-09-30 15:29:21

by Ric Wheeler

[permalink] [raw]
Subject: Re: [RFC] extending splice for copy offloading

On 09/30/2013 10:24 AM, Miklos Szeredi wrote:
> On Mon, Sep 30, 2013 at 4:52 PM, Ric Wheeler <[email protected]> wrote:
>> On 09/30/2013 10:51 AM, Miklos Szeredi wrote:
>>> On Mon, Sep 30, 2013 at 4:34 PM, J. Bruce Fields <[email protected]>
>>> wrote:
>>>>> My other worry is about interruptibility/restartability. Ideas?
>>>>>
>>>>> What happens on splice(from, to, 4G) and it's a non-reflink copy?
>>>>> Can the page cache copy be made restartable? Or should splice() be
>>>>> allowed to return a short count? What happens on (non-reflink) remote
>>>>> copies and huge request sizes?
>>>> If I were writing an application that required copies to be restartable,
>>>> I'd probably use the largest possible range in the reflink case but
>>>> break the copy into smaller chunks in the splice case.
>>>>
>>> The app really doesn't want to care about that. And it doesn't want
>>> to care about restartability, etc.. It's something the *kernel* has
>>> to care about. You just can't have uninterruptible syscalls that
>>> sleep for a "long" time, otherwise first you'll just have annoyed
>>> users pressing ^C in vain; then, if the sleep is even longer, warnings
>>> about task sleeping too long.
>>>
>>> One idea is letting splice() return a short count, and so the app can
>>> safely issue SIZE_MAX requests and the kernel can decide if it can
>>> copy the whole file in one go or if it wants to do it in smaller
>>> chunks.
>>>
>> You cannot rely on a short count. That implies that an offloaded copy starts
>> at byte 0 and the short count first bytes are all valid.
> Huh?
>
> - app calls splice(from, 0, to, 0, SIZE_MAX)
> 1) VFS calls ->direct_splice(from, 0, to, 0, SIZE_MAX)
> 1.a) fs reflinks the whole file in a jiffy and returns the size of the file
> 1 b) fs does copy offload of, say, 64MB and returns 64M
> 2) VFS does page copy of, say, 1MB and returns 1MB
> - app calls splice(from, X, to, X, SIZE_MAX) where X is the new offset
> ...
>
> The point is: the app is always doing the same (incrementing offset
> with the return value from splice) and the kernel can decide what is
> the best size it can service within a single uninterruptible syscall.
>
> Wouldn't that work?
>
> Thanks,
> Miklos

No.

Keep in mind that the offload operation in (1) might fail partially. The target
file (the copy) is allocated, the question is what ranges have valid data.

I don't see that (2) is interesting or really needed to be done in the kernel.
If nothing else, it tends to confuse the discussion....

ric


2013-09-25 19:55:31

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [RFC] extending splice for copy offloading

On Wed, Sep 25, 2013 at 12:06:20PM -0700, Zach Brown wrote:
> On Wed, Sep 25, 2013 at 03:02:29PM -0400, Anna Schumaker wrote:
> > On Wed, Sep 25, 2013 at 2:38 PM, Zach Brown <[email protected]> wrote:
> > >
> > > Hrmph. I had composed a reply to you during Plumbers but.. something
> > > happened to it :). Here's another try now that I'm back.
> > >
> > >> > Some things to talk about:
> > >> > - I really don't care about the naming here. If you do, holler.
> > >> > - We might want different flags for file-to-file splicing and acceleration
> > >>
> > >> Yes, I think "copy" and "reflink" needs to be differentiated.
> > >
> > > I initially agreed but I'm not so sure now. The problem is that we
> > > can't know whether the acceleration is copying or not. XCOPY on some
> > > array may well do some shared referencing tricks. The nfs COPY op can
> > > have a server use btrfs reflink, or ext* and XCOPY, or .. who knows. At
> > > some point we have to admit that we have no way to determine the
> > > relative durability of writes. Storage can do a lot to make writes more
> > > or less fragile that we have no visibility of. SSD FTLs can log a bunch
> > > of unrelated sectors on to one flash failure domain.
> > >
> > > And if such a flag couldn't *actually* guarantee anything for a bunch of
> > > storage topologies, well, let's not bother with it.
> > >
> > > The only flag I'm in favour of now is one that has splice return rather
> > > than falling back to manual page cache reads and writes. It's more like
> > > O_NONBLOCK than any kind of data durability hint.
> >
> > For reference, I'm planning to have the NFS server do the fallback
> > when it copies since any local copy will be faster than a read and
> > write over the network.
>
> Agreed, this is definitely the reasonable thing to do.

A client-side copy will be slower, but I guess it does have the
advantage that the application can track progress to some degree, and
abort it fairly quickly without leaving the file in a totally undefined
state--and both might be useful if the copy's not a simple constant-time
operation.

So maybe a way to pass your NONBLOCKy flag to the server would be
useful?

FWIW the protocol doesn't seem frozen yet, so I assume we could still
add an extra flag field if you think it would be worthwhile.

--b.

2013-09-17 04:42:48

by Rob Landley

[permalink] [raw]
Subject: Re: [RFC] extending splice for copy offloading

On 09/11/2013 04:17:23 PM, Eric Wong wrote:
> Zach Brown <[email protected]> wrote:
> > Towards the end of that thread Eric Wong asked why we didn't just
> > extend splice. I immediately replied with some dumb dismissive
> > answer. Once I sat down and looked at it, though, it does make a
> > lot of sense. So good job, Eric. +10 Dummie points for me.
>
> Thanks for revisiting that :>
>
> > Some things to talk about:
> > - I really don't care about the naming here. If you do, holler.
>
> Exposing "DIRECT" to userspace now might confuse users into expecting
> O_DIRECT behavior. I say this as an easily-confused user.
>
> In the future, perhaps O_DIRECT behavior can become per-splice
> (instead
> of just per-open) and can save SPLICE_F_DIRECT for that.
>
> > - We might want different flags for file-to-file splicing and
> acceleration
> > - We might want flags to require or forbid acceleration
>
> > - We might want to provide all these flags to sendfile, too
>
> Another syscall? I prefer not. Better to just maintain the sendfile
> API as-is for compatibility reasons and nudge users towards splice.
>
> > Thoughts? Objections?
>
> I'll try to test/comment more in a week or two (not much time for
> computing until then).

Just a vague note that I've wanted to use splice implementing cp and
patch and cat and so on in toybox, but couldn't because it needs a pipe.

So I'm quite interested in moves to lift this restriction...

Rob

2013-09-19 13:09:12

by Jeff Layton

[permalink] [raw]
Subject: Re: [RFC] extending splice for copy offloading

On Wed, 11 Sep 2013 21:17:23 +0000
Eric Wong <[email protected]> wrote:

> Zach Brown <[email protected]> wrote:
> > Towards the end of that thread Eric Wong asked why we didn't just
> > extend splice. I immediately replied with some dumb dismissive
> > answer. Once I sat down and looked at it, though, it does make a
> > lot of sense. So good job, Eric. +10 Dummie points for me.
>
> Thanks for revisiting that :>
>
> > Some things to talk about:
> > - I really don't care about the naming here. If you do, holler.
>
> Exposing "DIRECT" to userspace now might confuse users into expecting
> O_DIRECT behavior. I say this as an easily-confused user.
>
> In the future, perhaps O_DIRECT behavior can become per-splice (instead
> of just per-open) and can save SPLICE_F_DIRECT for that.
>
> > - We might want different flags for file-to-file splicing and acceleration
> > - We might want flags to require or forbid acceleration
>

Do we need new flags at all? If both fds refer to files, then perhaps
we can just take it that SPLICE_F_DIRECT behavior is implied?

I'd probably suggest that we not add any more flags than are necessary
until use-cases for them become clear.

> > - We might want to provide all these flags to sendfile, too
>
> Another syscall? I prefer not. Better to just maintain the sendfile
> API as-is for compatibility reasons and nudge users towards splice.
>

Agreed.

> > Thoughts? Objections?
>
> I'll try to test/comment more in a week or two (not much time for
> computing until then).

On the whole, the concept looks sound.

I'll note too that by simply lifting the restriction that one of the
fd's to splice must always be a pipe, that may also give us a relatively
simple way to add recvfile() as well, even if only as a macro wrapper
around splice(). That's been a long sought-after feature of the samba
developers...

Just allow userland to do a splice straight from a socket fd to a file.
We may end up having to copy data if the alignment isn't right, but it'd
still be valuable to do that directly in the kernel in a single syscall.

--
Jeff Layton <[email protected]>

2013-09-30 18:49:48

by Bernd Schubert

[permalink] [raw]
Subject: Re: [RFC] extending splice for copy offloading

On 09/30/2013 08:02 PM, Myklebust, Trond wrote:
> On Mon, 2013-09-30 at 19:48 +0200, Bernd Schubert wrote:
>> On 09/30/2013 07:44 PM, Myklebust, Trond wrote:
>>> On Mon, 2013-09-30 at 19:17 +0200, Bernd Schubert wrote:
>>>> It would be nice if there would be way if the file system would get a
>>>> hint that the target file is supposed to be copy of another file. That
>>>> way distributed file systems could also create the target-file with the
>>>> correct meta-information (same storage targets as in-file has).
>>>> Well, if we cannot agree on that, file system with a custom protocol at
>>>> least can detect from 0 to SSIZE_MAX and then reset metadata. I'm not
>>>> sure if this would work for pNFS, though.
>>>
>>> splice() does not create new files. What you appear to be asking for
>>> lies way outside the scope of that system call interface.
>>>
>>
>> Sorry I know, definitely outside the scope of splice, but in the context
>> of offloaded file copies. So the question is, what is the best way to
>> address/discuss that?
>
> Why does it need to be addressed in the first place?

An offloaded copy is still not efficient if different storage
servers/targets used by from-file and to-file.

>
> What is preventing an application from retrieving and setting this
> information using standard libc functions such as fstat()+open(), and
> supplemented with libattr attr_setf/getf(), and libacl acl_get_fd/set_fd
> where appropriate?
>

At a minimum this requires network and metadata overhead. And while I'm
working on FhGFS now, I still wonder what other file system need to do -
for example Lustre pre-allocates storage-target files on creating a
file, so file layout changes mean even more overhead there.
Anyway, if we could agree on to use libattr or libacl to teach the file
system about the upcoming splice call I would be fine. Metadata overhead
is probably negligible for large files.




Thanks,
Bernd


2013-09-26 08:58:07

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [RFC] extending splice for copy offloading

On Wed, Sep 25, 2013 at 11:07 PM, Zach Brown <[email protected]> wrote:
>> A client-side copy will be slower, but I guess it does have the
>> advantage that the application can track progress to some degree, and
>> abort it fairly quickly without leaving the file in a totally undefined
>> state--and both might be useful if the copy's not a simple constant-time
>> operation.
>
> I suppose, but can't the app achieve a nice middle ground by copying the
> file in smaller syscalls? Avoid bulk data motion back to the client,
> but still get notification every, I dunno, few hundred meg?

Yes. And if "cp" could just be switched from a read+write syscall
pair to a single splice syscall using the same buffer size. And then
the user would only notice that things got faster in case of server
side copy. No problems with long blocking times (at least not much
worse than it was).

However "cp" doesn't do reflinking by default, it has a switch for
that. If we just want "cp" and the like to use splice without fearing
side effects then by default we should try to be as close to
read+write behavior as possible. No? That's what I'm really
worrying about when you want to wire up splice to reflink by default.
I do think there should be a flag for that. And if on the block level
some magic happens, so be it. It's not the fs deverloper's worry any
more ;)

Thanks,
Miklos

2013-09-30 20:09:25

by Ric Wheeler

[permalink] [raw]
Subject: Re: [RFC] extending splice for copy offloading

On 09/30/2013 04:00 PM, Bernd Schubert wrote:
> pNFS, FhGFS, Lustre, Ceph, etc., all of them shall implement their own
> interface? And userspace needs to address all of them differently?

The NFS and SCSI groups have each defined a standard which Zach's proposal
abstracts into a common user API.

Distributed file systems tend to be rather unique and do not have similar
standard bodies, but a lot of them could hide server specific implementations
under the current proposed interfaces.

What is not a good idea is to drag out the core, simple copy offload discussion
for another 5 years to pull in every odd use case :)

ric


2013-09-26 18:55:54

by Zach Brown

[permalink] [raw]
Subject: Re: [RFC] extending splice for copy offloading

On Thu, Sep 26, 2013 at 10:58:05AM +0200, Miklos Szeredi wrote:
> On Wed, Sep 25, 2013 at 11:07 PM, Zach Brown <[email protected]> wrote:
> >> A client-side copy will be slower, but I guess it does have the
> >> advantage that the application can track progress to some degree, and
> >> abort it fairly quickly without leaving the file in a totally undefined
> >> state--and both might be useful if the copy's not a simple constant-time
> >> operation.
> >
> > I suppose, but can't the app achieve a nice middle ground by copying the
> > file in smaller syscalls? Avoid bulk data motion back to the client,
> > but still get notification every, I dunno, few hundred meg?
>
> Yes. And if "cp" could just be switched from a read+write syscall
> pair to a single splice syscall using the same buffer size. And then
> the user would only notice that things got faster in case of server
> side copy. No problems with long blocking times (at least not much
> worse than it was).

Hmm, yes, that would be a nice outcome.

> However "cp" doesn't do reflinking by default, it has a switch for
> that. If we just want "cp" and the like to use splice without fearing
> side effects then by default we should try to be as close to
> read+write behavior as possible. No?

I guess? I don't find requiring --reflink hugely compelling. But there
it is.

> That's what I'm really
> worrying about when you want to wire up splice to reflink by default.
> I do think there should be a flag for that. And if on the block level
> some magic happens, so be it. It's not the fs deverloper's worry any
> more ;)

Sure. So we'd have:

- no flag default that forbids knowingly copying with shared references
so that it will be used by default by people who feel strongly about
their assumptions about independent write durability.

- a flag that allows shared references for people who would otherwise
use the file system shared reference ioctls (ocfs2 reflink, btrfs
clone) but would like it to also do server-side read/write copies
over nfs without additional intervention.

- a flag that requires shared references for callers who don't want
giant copies to take forever if they aren't instant. (The qemu guys
asked for this at Plumbers.)

I think I can live with that.

- z

2013-09-25 19:02:51

by Anna Schumaker

[permalink] [raw]
Subject: Re: [RFC] extending splice for copy offloading

On Wed, Sep 25, 2013 at 2:38 PM, Zach Brown <[email protected]> wrote:
>
> Hrmph. I had composed a reply to you during Plumbers but.. something
> happened to it :). Here's another try now that I'm back.
>
>> > Some things to talk about:
>> > - I really don't care about the naming here. If you do, holler.
>> > - We might want different flags for file-to-file splicing and acceleration
>>
>> Yes, I think "copy" and "reflink" needs to be differentiated.
>
> I initially agreed but I'm not so sure now. The problem is that we
> can't know whether the acceleration is copying or not. XCOPY on some
> array may well do some shared referencing tricks. The nfs COPY op can
> have a server use btrfs reflink, or ext* and XCOPY, or .. who knows. At
> some point we have to admit that we have no way to determine the
> relative durability of writes. Storage can do a lot to make writes more
> or less fragile that we have no visibility of. SSD FTLs can log a bunch
> of unrelated sectors on to one flash failure domain.
>
> And if such a flag couldn't *actually* guarantee anything for a bunch of
> storage topologies, well, let's not bother with it.
>
> The only flag I'm in favour of now is one that has splice return rather
> than falling back to manual page cache reads and writes. It's more like
> O_NONBLOCK than any kind of data durability hint.

For reference, I'm planning to have the NFS server do the fallback
when it copies since any local copy will be faster than a read and
write over the network.

Anna

>
>> > - We might want flags to require or forbid acceleration
>> > - We might want to provide all these flags to sendfile, too
>> >
>> > Thoughts? Objections?
>>
>> Can filesystem support "whole file copy" only? Or arbitrary
>> block-to-block copy should be mandatory?
>
> I'm not sure I understand what you're asking. The interface specifies
> byte ranges. File systems can return errors if they can't accelerate
> the copy. We *can't* mandate copy acceleration granularity as some
> formats and protocols just can't do it. splice() will fall back to
> doing buffered copies when the file system returns an error.
>
>> Splice has size_t argument for the size, which is limited to 4G on 32
>> bit. Won't this be an issue for whole-file-copy? We could have
>> special value (-1) for whole file, but that's starting to be hackish.
>
> It will be an issue, yeah. Just like it is with write() today. I think
> it's reasonable to start with a simple interface that matches current IO
> syscalls. I won't implement a special whole-file value, no.
>
> And it's not just 32bit size_t. While do_splice_direct() doesn't use
> the truncated length that's returned from rw_verify_area(), it then
> silently truncates the lengths to unsigned int in the splice_desc struct
> fields. It seems like we might want to address that :/.
>
>> We are talking about copying large amounts of data in a single
>> syscall, which will possibly take a long time. Will the syscall be
>> interruptible? Restartable?
>
> In as much as file systems let it be, yeah. As ever, you're not going
> to have a lot of luck interrupting a process stuck in lock_page(),
> mutex_lock(), wait_on_page_writeback(), etc. Though you did remind me
> to investigate restarting. Thanks.
>
> - z
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2013-09-26 21:27:52

by Ric Wheeler

[permalink] [raw]
Subject: Re: [RFC] extending splice for copy offloading

On 09/26/2013 02:55 PM, Zach Brown wrote:
> On Thu, Sep 26, 2013 at 10:58:05AM +0200, Miklos Szeredi wrote:
>> On Wed, Sep 25, 2013 at 11:07 PM, Zach Brown <[email protected]> wrote:
>>>> A client-side copy will be slower, but I guess it does have the
>>>> advantage that the application can track progress to some degree, and
>>>> abort it fairly quickly without leaving the file in a totally undefined
>>>> state--and both might be useful if the copy's not a simple constant-time
>>>> operation.
>>> I suppose, but can't the app achieve a nice middle ground by copying the
>>> file in smaller syscalls? Avoid bulk data motion back to the client,
>>> but still get notification every, I dunno, few hundred meg?
>> Yes. And if "cp" could just be switched from a read+write syscall
>> pair to a single splice syscall using the same buffer size. And then
>> the user would only notice that things got faster in case of server
>> side copy. No problems with long blocking times (at least not much
>> worse than it was).
> Hmm, yes, that would be a nice outcome.
>
>> However "cp" doesn't do reflinking by default, it has a switch for
>> that. If we just want "cp" and the like to use splice without fearing
>> side effects then by default we should try to be as close to
>> read+write behavior as possible. No?
> I guess? I don't find requiring --reflink hugely compelling. But there
> it is.
>
>> That's what I'm really
>> worrying about when you want to wire up splice to reflink by default.
>> I do think there should be a flag for that. And if on the block level
>> some magic happens, so be it. It's not the fs deverloper's worry any
>> more ;)
> Sure. So we'd have:
>
> - no flag default that forbids knowingly copying with shared references
> so that it will be used by default by people who feel strongly about
> their assumptions about independent write durability.
>
> - a flag that allows shared references for people who would otherwise
> use the file system shared reference ioctls (ocfs2 reflink, btrfs
> clone) but would like it to also do server-side read/write copies
> over nfs without additional intervention.
>
> - a flag that requires shared references for callers who don't want
> giant copies to take forever if they aren't instant. (The qemu guys
> asked for this at Plumbers.)
>
> I think I can live with that.
>
> - z

This last flag should not prevent a remote target device (NFS or SCSI array)
copy from working though since they often do reflink like operations inside of
the remote target device....

ric



2013-10-02 12:58:41

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC] extending splice for copy offloading

On Tue 01-10-13 12:58:17, Zach Brown wrote:
> > - app calls splice(from, 0, to, 0, SIZE_MAX)
> > 1) VFS calls ->direct_splice(from, 0, to, 0, SIZE_MAX)
> > 1.a) fs reflinks the whole file in a jiffy and returns the size of the file
> > 1 b) fs does copy offload of, say, 64MB and returns 64M
> > 2) VFS does page copy of, say, 1MB and returns 1MB
> > - app calls splice(from, X, to, X, SIZE_MAX) where X is the new offset
>
> (It's not SIZE_MAX. It's MAX_RW_COUNT. INT_MAX with some
> PAGE_CACHE_SIZE rounding noise. For fear of weird corners of fs code
> paths that still use int, one assumes.)
>
> > The point is: the app is always doing the same (incrementing offset
> > with the return value from splice) and the kernel can decide what is
> > the best size it can service within a single uninterruptible syscall.
> >
> > Wouldn't that work?
>
> It seems like it should, if people are willing to allow splice() to
> return partial counts. Quite a lot of IO syscalls technically do return
> partial counts today if you try to write > MAX_RW_COUNT :).
Yes. Also POSIX says that application must handle such case for read &
write. But in practice programmers are lazy.

> But returning partial counts on the order of a handful of megs that the
> file systems make up as the point of diminishing returns is another
> thing entirely. I can imagine people being anxious about that.
>
> I guess we'll find out!
Return 4 KB once in a while to screw up buggy applications from the
start :-p

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2013-10-02 13:32:48

by David Lang

[permalink] [raw]
Subject: Re: [RFC] extending splice for copy offloading

On Wed, 2 Oct 2013, Jan Kara wrote:

> On Tue 01-10-13 12:58:17, Zach Brown wrote:
>>> - app calls splice(from, 0, to, 0, SIZE_MAX)
>>> 1) VFS calls ->direct_splice(from, 0, to, 0, SIZE_MAX)
>>> 1.a) fs reflinks the whole file in a jiffy and returns the size of the file
>>> 1 b) fs does copy offload of, say, 64MB and returns 64M
>>> 2) VFS does page copy of, say, 1MB and returns 1MB
>>> - app calls splice(from, X, to, X, SIZE_MAX) where X is the new offset
>>
>> (It's not SIZE_MAX. It's MAX_RW_COUNT. INT_MAX with some
>> PAGE_CACHE_SIZE rounding noise. For fear of weird corners of fs code
>> paths that still use int, one assumes.)
>>
>>> The point is: the app is always doing the same (incrementing offset
>>> with the return value from splice) and the kernel can decide what is
>>> the best size it can service within a single uninterruptible syscall.
>>>
>>> Wouldn't that work?
>>
>> It seems like it should, if people are willing to allow splice() to
>> return partial counts. Quite a lot of IO syscalls technically do return
>> partial counts today if you try to write > MAX_RW_COUNT :).
> Yes. Also POSIX says that application must handle such case for read &
> write. But in practice programmers are lazy.
>
>> But returning partial counts on the order of a handful of megs that the
>> file systems make up as the point of diminishing returns is another
>> thing entirely. I can imagine people being anxious about that.
>>
>> I guess we'll find out!
> Return 4 KB once in a while to screw up buggy applications from the
> start :-p

or at least have a debugging option early on that does this so people can use it
to find such buggy apps.

David Lang

2013-10-06 08:42:38

by Rob Landley

[permalink] [raw]
Subject: Re: [RFC] extending splice for copy offloading

On 09/26/2013 01:06:41 PM, Miklos Szeredi wrote:
> On Thu, Sep 26, 2013 at 5:34 PM, J. Bruce Fields
> <[email protected]> wrote:
> > On Thu, Sep 26, 2013 at 10:58:05AM +0200, Miklos Szeredi wrote:
> >> On Wed, Sep 25, 2013 at 11:07 PM, Zach Brown <[email protected]>
> wrote:
> >> >> A client-side copy will be slower, but I guess it does have the
> >> >> advantage that the application can track progress to some
> degree, and
> >> >> abort it fairly quickly without leaving the file in a totally
> undefined
> >> >> state--and both might be useful if the copy's not a simple
> constant-time
> >> >> operation.
> >> >
> >> > I suppose, but can't the app achieve a nice middle ground by
> copying the
> >> > file in smaller syscalls? Avoid bulk data motion back to the
> client,
> >> > but still get notification every, I dunno, few hundred meg?
> >>
> >> Yes. And if "cp" could just be switched from a read+write syscall
> >> pair to a single splice syscall using the same buffer size.
> >
> > Will the various magic fs-specific copy operations become
> inefficient
> > when the range copied is too small?
>
> We could treat spice-copy operations just like write operations (can
> be buffered, coalesced, synced).
>
> But I'm not sure it's worth the effort; 99% of the use of this
> interface will be copying whole files.

My "patch" implementation (in busybox and toybox) hits a point where it
wants to copy the rest of the file, once there are no more hunks to
apply. This is not copying a whole file. A similar thing happens with
tail when you use the +N syntax to skip start instead of end lines. I
can see sed doing a similar thing when told to operate on line ranges...

Note sure your 99% holds up here.

Rob

2013-10-01 18:42:16

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [RFC] extending splice for copy offloading

On Mon, Sep 30, 2013 at 05:46:38PM +0200, Miklos Szeredi wrote:
> On Mon, Sep 30, 2013 at 4:41 PM, Ric Wheeler <[email protected]> wrote:
> > The way the array based offload (and some software side reflink works) is
> > not a byte by byte copy. We cannot assume that a valid count can be returned
> > or that such a count would be an indication of a sequential segment of good
> > data. The whole thing would normally have to be reissued.
> >
> > To make that a true assumption, you would have to mandate that in each of
> > the specifications (and sw targets)...
>
> You're missing my point.
>
> - user issues SIZE_MAX splice request
> - fs issues *64M* (or whatever) request to offload
> - when that completes *fully* then we return 64M to userspace
> - if it completes partially, then we return an error to userspace
>
> Again, wouldn't that work?

So if implementations fall into two categories:

- "instant": latency is on the order of a single IO.

- "slow": latency is seconds or minutes, but still faster than a
normal copy. (See Anna's NFS server implementation that does
an ordinary copy internally.)

Then to me it still seems simplest to design only for the "instant"
case.

But if we want to add some minimal help for the "slow" case then
Miklos's proposal looks fine: the application doesn't have to know which
case it's dealing with ahead of time--it always just submits the largest
range it knows about--but a "slow" implementation isn't forced to leave
the application waiting in one syscall for minutes with no indication
what's going on.

--b.

2013-10-01 19:58:56

by Zach Brown

[permalink] [raw]
Subject: Re: [RFC] extending splice for copy offloading

> - app calls splice(from, 0, to, 0, SIZE_MAX)
> 1) VFS calls ->direct_splice(from, 0, to, 0, SIZE_MAX)
> 1.a) fs reflinks the whole file in a jiffy and returns the size of the file
> 1 b) fs does copy offload of, say, 64MB and returns 64M
> 2) VFS does page copy of, say, 1MB and returns 1MB
> - app calls splice(from, X, to, X, SIZE_MAX) where X is the new offset

(It's not SIZE_MAX. It's MAX_RW_COUNT. INT_MAX with some
PAGE_CACHE_SIZE rounding noise. For fear of weird corners of fs code
paths that still use int, one assumes.)

> The point is: the app is always doing the same (incrementing offset
> with the return value from splice) and the kernel can decide what is
> the best size it can service within a single uninterruptible syscall.
>
> Wouldn't that work?

It seems like it should, if people are willing to allow splice() to
return partial counts. Quite a lot of IO syscalls technically do return
partial counts today if you try to write > MAX_RW_COUNT :).

But returning partial counts on the order of a handful of megs that the
file systems make up as the point of diminishing returns is another
thing entirely. I can imagine people being anxious about that.

I guess we'll find out!

- z

2013-12-18 12:41:31

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC] extending splice for copy offloading

On Wed, Sep 11, 2013 at 10:06:47AM -0700, Zach Brown wrote:
> When I first started on this stuff I followed the lead of previous
> work and added a new syscall for the copy operation:
>
> https://lkml.org/lkml/2013/5/14/618
>
> Towards the end of that thread Eric Wong asked why we didn't just
> extend splice. I immediately replied with some dumb dismissive
> answer. Once I sat down and looked at it, though, it does make a
> lot of sense. So good job, Eric. +10 Dummie points for me.
>
> Extending splice avoids all the noise of adding a new syscall and
> naturally falls back to buffered copying as that's what the direct
> splice path does for sendfile() today.

Given the convolute mess that the splice code already is I'd rather
prefer not overloading it even further.

Instead I'd make the sendfile code path that already works different
in practice separate first, and then generalize it to a copy chunk
syscall using the same code path.

We can still fall back to the splice code as a fallback if no option
is provided as a last resort, but I think making the splice code handle
even more totally different cases is the wrong direction.


2013-12-18 17:11:56

by Zach Brown

[permalink] [raw]
Subject: Re: [RFC] extending splice for copy offloading

On Wed, Dec 18, 2013 at 04:41:26AM -0800, Christoph Hellwig wrote:
> On Wed, Sep 11, 2013 at 10:06:47AM -0700, Zach Brown wrote:
> > When I first started on this stuff I followed the lead of previous
> > work and added a new syscall for the copy operation:
> >
> > https://lkml.org/lkml/2013/5/14/618
> >
> > Towards the end of that thread Eric Wong asked why we didn't just
> > extend splice. I immediately replied with some dumb dismissive
> > answer. Once I sat down and looked at it, though, it does make a
> > lot of sense. So good job, Eric. +10 Dummie points for me.
> >
> > Extending splice avoids all the noise of adding a new syscall and
> > naturally falls back to buffered copying as that's what the direct
> > splice path does for sendfile() today.
>
> Given the convolute mess that the splice code already is I'd rather
> prefer not overloading it even further.

I agree after trying to weave the copy offloading API into the splice
interface. There are also weird cases that we haven't really discussed
so far (preserving unwritten allocations between the copied files?) that
would muddy the waters even further.

The further the APIs drift from each other, the more I'm prefering
giving copy offloading its own clean syscall. Even if the argument
types superficially match the splice() ABI.

> We can still fall back to the splice code as a fallback if no option
> is provided as a last resort, but I think making the splice code handle
> even more totally different cases is the wrong direction.

I'm with you. I'll have another version out sometime after the US
holiday break.. say in a few weeks?

- z

2013-12-18 17:26:21

by Anna Schumaker

[permalink] [raw]
Subject: Re: [RFC] extending splice for copy offloading

On 12/18/2013 12:10 PM, Zach Brown wrote:
> On Wed, Dec 18, 2013 at 04:41:26AM -0800, Christoph Hellwig wrote:
>> On Wed, Sep 11, 2013 at 10:06:47AM -0700, Zach Brown wrote:
>>> When I first started on this stuff I followed the lead of previous
>>> work and added a new syscall for the copy operation:
>>>
>>> https://lkml.org/lkml/2013/5/14/618
>>>
>>> Towards the end of that thread Eric Wong asked why we didn't just
>>> extend splice. I immediately replied with some dumb dismissive
>>> answer. Once I sat down and looked at it, though, it does make a
>>> lot of sense. So good job, Eric. +10 Dummie points for me.
>>>
>>> Extending splice avoids all the noise of adding a new syscall and
>>> naturally falls back to buffered copying as that's what the direct
>>> splice path does for sendfile() today.
>> Given the convolute mess that the splice code already is I'd rather
>> prefer not overloading it even further.
> I agree after trying to weave the copy offloading API into the splice
> interface. There are also weird cases that we haven't really discussed
> so far (preserving unwritten allocations between the copied files?) that
> would muddy the waters even further.
>
> The further the APIs drift from each other, the more I'm prefering
> giving copy offloading its own clean syscall. Even if the argument
> types superficially match the splice() ABI.
>
>> We can still fall back to the splice code as a fallback if no option
>> is provided as a last resort, but I think making the splice code handle
>> even more totally different cases is the wrong direction.
> I'm with you. I'll have another version out sometime after the US
> holiday break.. say in a few weeks?

That'll work for me, I'll update my NFS code once your new patches are out.

Anna

>
> - z
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html