2018-10-19 23:36:06

by Olga Kornievskaia

[permalink] [raw]
Subject: [PATCH v1 00/11] client-side support for "inter" SSC copy

From: Olga Kornievskaia <[email protected]>

This patch series adds client-side support for doing NFSv4.2 "inter"
copy offload between different NFS servers.

In case of the "inter" SSC copy files reside on different servers and
thus under different superblocks and require that VFS removes the
restriction that src and dst files must be on the same superblock.

NFS's copy_file_range() determines if the copy is "intra" or "inter"
and for "inter" it sends the COPY_NOTIFY to the source server. Then,
it would send of an asynchronous COPY to the destination server. If
an application cancels an in-flight COPY, OFFLOAD_CANCEL is sent to
both of the servers.

This patch series also include necessary client-side additions that
are performed by the destination server. The server needs an NFS
open that represents a source file without doing an actual open.
Two function nfs42_ssc_open/nfs42_ssc_close() are introduced to
accomplish it that make use of the VFS's alloc_file_pseudo() to
represent an open.

Also this particular open is marked (NFS_SVC_SSC_COPY_STATE) so
that if the destination server ever to receive stateid errors on
this stateid, it knows not to initiate state recovery (in case
when source server reboots). The recovery must be done by the
client and a new copy must be initiated. Therefore, in this case
the recovery needs to fail with EIO.

Anna Schumaker (1):
fs: Don't copy beyond the end of the file

Olga Kornievskaia (10):
VFS permit cross device vfs_copy_file_range
NFS test for intra vs inter COPY
NFS NFSD defining nl4_servers structure needed by both
NFS add COPY_NOTIFY operation
NFS add ca_source_server<> to COPY
NFS also send OFFLOAD_CANCEL to source server
NFS inter ssc open
NFS skip recovery of copy open on dest server
NFS for "inter" copy treat ESTALE as ENOTSUPP
NFS COPY handle ERR_OFFLOAD_DENIED

Documentation/filesystems/vfs.txt | 4 +-
fs/nfs/nfs42.h | 15 ++-
fs/nfs/nfs42proc.c | 129 ++++++++++++++++++++++---
fs/nfs/nfs42xdr.c | 193 +++++++++++++++++++++++++++++++++++++-
fs/nfs/nfs4_fs.h | 8 ++
fs/nfs/nfs4file.c | 123 +++++++++++++++++++++++-
fs/nfs/nfs4proc.c | 6 +-
fs/nfs/nfs4state.c | 14 +++
fs/nfs/nfs4xdr.c | 1 +
fs/read_write.c | 16 ++--
include/linux/nfs4.h | 25 +++++
include/linux/nfs_fs_sb.h | 1 +
include/linux/nfs_xdr.h | 17 ++++
13 files changed, 526 insertions(+), 26 deletions(-)

--
1.8.3.1


2018-10-19 23:36:13

by Olga Kornievskaia

[permalink] [raw]
Subject: [PATCH v1 06/11] NFS add ca_source_server<> to COPY

From: Olga Kornievskaia <[email protected]>

Support only one source server address: the same address that
the client and source server use.

Signed-off-by: Andy Adamson <[email protected]>
Signed-off-by: Olga Kornievskaia <[email protected]>
---
fs/nfs/nfs42.h | 3 ++-
fs/nfs/nfs42proc.c | 26 +++++++++++++++++---------
fs/nfs/nfs42xdr.c | 12 ++++++++++--
fs/nfs/nfs4file.c | 7 ++++++-
include/linux/nfs_xdr.h | 1 +
5 files changed, 36 insertions(+), 13 deletions(-)

diff --git a/fs/nfs/nfs42.h b/fs/nfs/nfs42.h
index bcd0222..66e8b29 100644
--- a/fs/nfs/nfs42.h
+++ b/fs/nfs/nfs42.h
@@ -15,7 +15,8 @@

/* nfs4.2proc.c */
int nfs42_proc_allocate(struct file *, loff_t, loff_t);
-ssize_t nfs42_proc_copy(struct file *, loff_t, struct file *, loff_t, size_t);
+ssize_t nfs42_proc_copy(struct file *, loff_t, struct file *, loff_t, size_t,
+ struct nl4_server *, nfs4_stateid *);
int nfs42_proc_deallocate(struct file *, loff_t, loff_t);
loff_t nfs42_proc_llseek(struct file *, loff_t, int);
int nfs42_proc_layoutstats_generic(struct nfs_server *,
diff --git a/fs/nfs/nfs42proc.c b/fs/nfs/nfs42proc.c
index b1c57a4..bb9e799 100644
--- a/fs/nfs/nfs42proc.c
+++ b/fs/nfs/nfs42proc.c
@@ -242,7 +242,9 @@ static ssize_t _nfs42_proc_copy(struct file *src,
struct file *dst,
struct nfs_lock_context *dst_lock,
struct nfs42_copy_args *args,
- struct nfs42_copy_res *res)
+ struct nfs42_copy_res *res,
+ struct nl4_server *nss,
+ nfs4_stateid *cnr_stateid)
{
struct rpc_message msg = {
.rpc_proc = &nfs4_procedures[NFSPROC4_CLNT_COPY],
@@ -256,11 +258,15 @@ static ssize_t _nfs42_proc_copy(struct file *src,
size_t count = args->count;
ssize_t status;

- status = nfs4_set_rw_stateid(&args->src_stateid, src_lock->open_context,
- src_lock, FMODE_READ);
- if (status)
- return status;
-
+ if (nss) {
+ args->cp_src = nss;
+ nfs4_stateid_copy(&args->src_stateid, cnr_stateid);
+ } else {
+ status = nfs4_set_rw_stateid(&args->src_stateid,
+ src_lock->open_context, src_lock, FMODE_READ);
+ if (status)
+ return status;
+ }
status = nfs_filemap_write_and_wait_range(file_inode(src)->i_mapping,
pos_src, pos_src + (loff_t)count - 1);
if (status)
@@ -324,8 +330,9 @@ static ssize_t _nfs42_proc_copy(struct file *src,
}

ssize_t nfs42_proc_copy(struct file *src, loff_t pos_src,
- struct file *dst, loff_t pos_dst,
- size_t count)
+ struct file *dst, loff_t pos_dst, size_t count,
+ struct nl4_server *nss,
+ nfs4_stateid *cnr_stateid)
{
struct nfs_server *server = NFS_SERVER(file_inode(dst));
struct nfs_lock_context *src_lock;
@@ -370,7 +377,8 @@ ssize_t nfs42_proc_copy(struct file *src, loff_t pos_src,
inode_lock(file_inode(dst));
err = _nfs42_proc_copy(src, src_lock,
dst, dst_lock,
- &args, &res);
+ &args, &res,
+ nss, cnr_stateid);
inode_unlock(file_inode(dst));

if (err >= 0)
diff --git a/fs/nfs/nfs42xdr.c b/fs/nfs/nfs42xdr.c
index e6e7cbf..c96c3f8 100644
--- a/fs/nfs/nfs42xdr.c
+++ b/fs/nfs/nfs42xdr.c
@@ -21,7 +21,10 @@
#define encode_copy_maxsz (op_encode_hdr_maxsz + \
XDR_QUADLEN(NFS4_STATEID_SIZE) + \
XDR_QUADLEN(NFS4_STATEID_SIZE) + \
- 2 + 2 + 2 + 1 + 1 + 1)
+ 2 + 2 + 2 + 1 + 1 + 1 +\
+ 1 + /* One cnr_source_server */\
+ 1 + /* nl4_type */ \
+ 1 + XDR_QUADLEN(NFS4_OPAQUE_LIMIT))
#define decode_copy_maxsz (op_decode_hdr_maxsz + \
NFS42_WRITE_RES_SIZE + \
1 /* cr_consecutive */ + \
@@ -186,7 +189,12 @@ static void encode_copy(struct xdr_stream *xdr,

encode_uint32(xdr, 1); /* consecutive = true */
encode_uint32(xdr, args->sync);
- encode_uint32(xdr, 0); /* src server list */
+ if (args->cp_src == NULL) { /* intra-ssc */
+ encode_uint32(xdr, 0); /* no src server list */
+ return;
+ }
+ encode_uint32(xdr, 1); /* supporting 1 server */
+ encode_nl4_server(xdr, args->cp_src);
}

static void encode_offload_cancel(struct xdr_stream *xdr,
diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c
index 3a888f4..005862e 100644
--- a/fs/nfs/nfs4file.c
+++ b/fs/nfs/nfs4file.c
@@ -134,6 +134,8 @@ static ssize_t nfs4_copy_file_range(struct file *file_in, loff_t pos_in,
size_t count, unsigned int flags)
{
struct nfs42_copy_notify_res *cn_resp = NULL;
+ struct nl4_server *nss = NULL;
+ nfs4_stateid *cnrs = NULL;
ssize_t ret;

if (file_inode(file_in) == file_inode(file_out))
@@ -151,8 +153,11 @@ static ssize_t nfs4_copy_file_range(struct file *file_in, loff_t pos_in,
ret = nfs42_proc_copy_notify(file_in, file_out, cn_resp);
if (ret)
goto out;
+ nss = &cn_resp->cnr_src;
+ cnrs = &cn_resp->cnr_stateid;
}
- ret = nfs42_proc_copy(file_in, pos_in, file_out, pos_out, count);
+ ret = nfs42_proc_copy(file_in, pos_in, file_out, pos_out, count, nss,
+ cnrs);

out:
kfree(cn_resp);
diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
index dfc59bc..3a40b17 100644
--- a/include/linux/nfs_xdr.h
+++ b/include/linux/nfs_xdr.h
@@ -1400,6 +1400,7 @@ struct nfs42_copy_args {

u64 count;
bool sync;
+ struct nl4_server *cp_src;
};

struct nfs42_write_res {
--
1.8.3.1

2018-10-19 23:36:18

by Olga Kornievskaia

[permalink] [raw]
Subject: [PATCH v1 10/11] NFS for "inter" copy treat ESTALE as ENOTSUPP

From: Olga Kornievskaia <[email protected]>

If the client sends an "inter" copy to the destination server but
it only supports "intra" copy, it can return ESTALE (since it
doesn't know anything about the file handle from the other server
and does not recognize the special case of "inter" copy). Translate
this error as ENOTSUPP and also send OFFLOAD_CANCEL to teh source
server so that it can clean up state.

Signed-off-by: Olga Kornievskaia <[email protected]>
---
fs/nfs/nfs42proc.c | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/fs/nfs/nfs42proc.c b/fs/nfs/nfs42proc.c
index 98fe00b..00809b3 100644
--- a/fs/nfs/nfs42proc.c
+++ b/fs/nfs/nfs42proc.c
@@ -395,6 +395,11 @@ ssize_t nfs42_proc_copy(struct file *src, loff_t pos_src,
args.sync = true;
dst_exception.retry = 1;
continue;
+ } else if (err == -ESTALE &&
+ !nfs42_files_from_same_server(src, dst)) {
+ nfs42_do_offload_cancel_async(src, &args.src_stateid);
+ err = -EOPNOTSUPP;
+ break;
}

err2 = nfs4_handle_exception(server, err, &src_exception);
--
1.8.3.1

2018-10-19 23:36:19

by Olga Kornievskaia

[permalink] [raw]
Subject: [PATCH v1 11/11] NFS COPY handle ERR_OFFLOAD_DENIED

From: Olga Kornievskaia <[email protected]>

If server sends ERR_OFFLOAD_DENIED error, the client must fall
back on doing copy the normal way. Return ENOTSUPP to the vfs and
fallback to regular copy.

Signed-off-by: Olga Kornievskaia <[email protected]>
---
fs/nfs/nfs42proc.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/nfs/nfs42proc.c b/fs/nfs/nfs42proc.c
index 00809b3..c7c2ffa 100644
--- a/fs/nfs/nfs42proc.c
+++ b/fs/nfs/nfs42proc.c
@@ -395,7 +395,8 @@ ssize_t nfs42_proc_copy(struct file *src, loff_t pos_src,
args.sync = true;
dst_exception.retry = 1;
continue;
- } else if (err == -ESTALE &&
+ } else if ((err == -ESTALE ||
+ err == -NFS4ERR_OFFLOAD_DENIED) &&
!nfs42_files_from_same_server(src, dst)) {
nfs42_do_offload_cancel_async(src, &args.src_stateid);
err = -EOPNOTSUPP;
--
1.8.3.1

2018-10-19 23:36:13

by Olga Kornievskaia

[permalink] [raw]
Subject: [PATCH v1 05/11] NFS add COPY_NOTIFY operation

From: Olga Kornievskaia <[email protected]>

Try using the delegation stateid, then the open stateid.

Only NL4_NETATTR, No support for NL4_NAME and NL4_URL.
Allow only one source server address to be returned for now.

Signed-off-by: Andy Adamson <[email protected]>
Signed-off-by: Olga Kornievskaia <[email protected]>
---
fs/nfs/nfs42.h | 3 +-
fs/nfs/nfs42proc.c | 91 +++++++++++++++++++++++
fs/nfs/nfs42xdr.c | 181 ++++++++++++++++++++++++++++++++++++++++++++++
fs/nfs/nfs4file.c | 17 +++++
fs/nfs/nfs4proc.c | 1 +
fs/nfs/nfs4xdr.c | 1 +
include/linux/nfs4.h | 1 +
include/linux/nfs_fs_sb.h | 1 +
include/linux/nfs_xdr.h | 16 ++++
9 files changed, 311 insertions(+), 1 deletion(-)

diff --git a/fs/nfs/nfs42.h b/fs/nfs/nfs42.h
index 5abff4d..bcd0222 100644
--- a/fs/nfs/nfs42.h
+++ b/fs/nfs/nfs42.h
@@ -21,7 +21,8 @@
int nfs42_proc_layoutstats_generic(struct nfs_server *,
struct nfs42_layoutstat_data *);
int nfs42_proc_clone(struct file *, struct file *, loff_t, loff_t, loff_t);
-
+int nfs42_proc_copy_notify(struct file *, struct file *,
+ struct nfs42_copy_notify_res *);
static inline bool nfs42_files_from_same_server(struct file *in,
struct file *out)
{
diff --git a/fs/nfs/nfs42proc.c b/fs/nfs/nfs42proc.c
index ac5b784..b1c57a4 100644
--- a/fs/nfs/nfs42proc.c
+++ b/fs/nfs/nfs42proc.c
@@ -3,6 +3,7 @@
* Copyright (c) 2014 Anna Schumaker <[email protected]>
*/
#include <linux/fs.h>
+#include <linux/sunrpc/addr.h>
#include <linux/sunrpc/sched.h>
#include <linux/nfs.h>
#include <linux/nfs3.h>
@@ -15,10 +16,30 @@
#include "pnfs.h"
#include "nfs4session.h"
#include "internal.h"
+#include "delegation.h"

#define NFSDBG_FACILITY NFSDBG_PROC
static int nfs42_do_offload_cancel_async(struct file *dst, nfs4_stateid *std);

+static void nfs42_set_netaddr(struct file *filep, struct nfs42_netaddr *naddr)
+{
+ struct nfs_client *clp = (NFS_SERVER(file_inode(filep)))->nfs_client;
+ unsigned short port = 2049;
+
+ rcu_read_lock();
+ naddr->netid_len = scnprintf(naddr->netid,
+ sizeof(naddr->netid), "%s",
+ rpc_peeraddr2str(clp->cl_rpcclient,
+ RPC_DISPLAY_NETID));
+ naddr->addr_len = scnprintf(naddr->addr,
+ sizeof(naddr->addr),
+ "%s.%u.%u",
+ rpc_peeraddr2str(clp->cl_rpcclient,
+ RPC_DISPLAY_ADDR),
+ port >> 8, port & 255);
+ rcu_read_unlock();
+}
+
static int _nfs42_proc_fallocate(struct rpc_message *msg, struct file *filep,
struct nfs_lock_context *lock, loff_t offset, loff_t len)
{
@@ -461,6 +482,76 @@ static int nfs42_do_offload_cancel_async(struct file *dst,
return status;
}

+int _nfs42_proc_copy_notify(struct file *src, struct file *dst,
+ struct nfs42_copy_notify_args *args,
+ struct nfs42_copy_notify_res *res)
+{
+ struct nfs_server *src_server = NFS_SERVER(file_inode(src));
+ struct rpc_message msg = {
+ .rpc_proc = &nfs4_procedures[NFSPROC4_CLNT_COPY_NOTIFY],
+ .rpc_argp = args,
+ .rpc_resp = res,
+ };
+ int status;
+ struct nfs_open_context *ctx;
+ struct nfs_lock_context *l_ctx;
+
+ ctx = get_nfs_open_context(nfs_file_open_context(src));
+ l_ctx = nfs_get_lock_context(ctx);
+ if (IS_ERR(l_ctx))
+ return PTR_ERR(l_ctx);
+
+ status = nfs4_set_rw_stateid(&args->cna_src_stateid, ctx, l_ctx,
+ FMODE_READ);
+ nfs_put_lock_context(l_ctx);
+ if (status)
+ return status;
+
+ status = nfs4_call_sync(src_server->client, src_server, &msg,
+ &args->cna_seq_args, &res->cnr_seq_res, 0);
+ if (status == -ENOTSUPP)
+ src_server->caps &= ~NFS_CAP_COPY_NOTIFY;
+
+ put_nfs_open_context(nfs_file_open_context(src));
+ return status;
+}
+
+int nfs42_proc_copy_notify(struct file *src, struct file *dst,
+ struct nfs42_copy_notify_res *res)
+{
+ struct nfs_server *src_server = NFS_SERVER(file_inode(src));
+ struct nfs42_copy_notify_args *args;
+ struct nfs4_exception exception = {
+ .inode = file_inode(src),
+ };
+ int status;
+
+ if (!(src_server->caps & NFS_CAP_COPY_NOTIFY))
+ return -EOPNOTSUPP;
+
+ args = kzalloc(sizeof(struct nfs42_copy_notify_args), GFP_NOFS);
+ if (args == NULL)
+ return -ENOMEM;
+
+ args->cna_src_fh = NFS_FH(file_inode(src)),
+ args->cna_dst.nl4_type = NL4_NETADDR;
+ nfs42_set_netaddr(dst, &args->cna_dst.u.nl4_addr);
+ exception.stateid = &args->cna_src_stateid;
+
+ do {
+ status = _nfs42_proc_copy_notify(src, dst, args, res);
+ if (status == -ENOTSUPP) {
+ status = -EOPNOTSUPP;
+ goto out;
+ }
+ status = nfs4_handle_exception(src_server, status, &exception);
+ } while (exception.retry);
+
+out:
+ kfree(args);
+ return status;
+}
+
static loff_t _nfs42_proc_llseek(struct file *filep,
struct nfs_lock_context *lock, loff_t offset, int whence)
{
diff --git a/fs/nfs/nfs42xdr.c b/fs/nfs/nfs42xdr.c
index 69f72ed..e6e7cbf 100644
--- a/fs/nfs/nfs42xdr.c
+++ b/fs/nfs/nfs42xdr.c
@@ -29,6 +29,16 @@
#define encode_offload_cancel_maxsz (op_encode_hdr_maxsz + \
XDR_QUADLEN(NFS4_STATEID_SIZE))
#define decode_offload_cancel_maxsz (op_decode_hdr_maxsz)
+#define encode_copy_notify_maxsz (op_encode_hdr_maxsz + \
+ XDR_QUADLEN(NFS4_STATEID_SIZE) + \
+ 1 + /* nl4_type */ \
+ 1 + XDR_QUADLEN(NFS4_OPAQUE_LIMIT))
+#define decode_copy_notify_maxsz (op_decode_hdr_maxsz + \
+ 3 + /* cnr_lease_time */\
+ XDR_QUADLEN(NFS4_STATEID_SIZE) + \
+ 1 + /* Support 1 cnr_source_server */\
+ 1 + /* nl4_type */ \
+ 1 + XDR_QUADLEN(NFS4_OPAQUE_LIMIT))
#define encode_deallocate_maxsz (op_encode_hdr_maxsz + \
encode_fallocate_maxsz)
#define decode_deallocate_maxsz (op_decode_hdr_maxsz)
@@ -84,6 +94,12 @@
#define NFS4_dec_offload_cancel_sz (compound_decode_hdr_maxsz + \
decode_putfh_maxsz + \
decode_offload_cancel_maxsz)
+#define NFS4_enc_copy_notify_sz (compound_encode_hdr_maxsz + \
+ encode_putfh_maxsz + \
+ encode_copy_notify_maxsz)
+#define NFS4_dec_copy_notify_sz (compound_decode_hdr_maxsz + \
+ decode_putfh_maxsz + \
+ decode_copy_notify_maxsz)
#define NFS4_enc_deallocate_sz (compound_encode_hdr_maxsz + \
encode_putfh_maxsz + \
encode_deallocate_maxsz + \
@@ -137,6 +153,25 @@ static void encode_allocate(struct xdr_stream *xdr,
encode_fallocate(xdr, args);
}

+static void encode_nl4_server(struct xdr_stream *xdr, const struct nl4_server *ns)
+{
+ encode_uint32(xdr, ns->nl4_type);
+ switch (ns->nl4_type) {
+ case NL4_NAME:
+ case NL4_URL:
+ encode_string(xdr, ns->u.nl4_str_sz, ns->u.nl4_str);
+ break;
+ case NL4_NETADDR:
+ encode_string(xdr, ns->u.nl4_addr.netid_len,
+ ns->u.nl4_addr.netid);
+ encode_string(xdr, ns->u.nl4_addr.addr_len,
+ ns->u.nl4_addr.addr);
+ break;
+ default:
+ WARN_ON_ONCE(1);
+ }
+}
+
static void encode_copy(struct xdr_stream *xdr,
const struct nfs42_copy_args *args,
struct compound_hdr *hdr)
@@ -162,6 +197,15 @@ static void encode_offload_cancel(struct xdr_stream *xdr,
encode_nfs4_stateid(xdr, &args->osa_stateid);
}

+static void encode_copy_notify(struct xdr_stream *xdr,
+ const struct nfs42_copy_notify_args *args,
+ struct compound_hdr *hdr)
+{
+ encode_op_hdr(xdr, OP_COPY_NOTIFY, decode_copy_notify_maxsz, hdr);
+ encode_nfs4_stateid(xdr, &args->cna_src_stateid);
+ encode_nl4_server(xdr, &args->cna_dst);
+}
+
static void encode_deallocate(struct xdr_stream *xdr,
const struct nfs42_falloc_args *args,
struct compound_hdr *hdr)
@@ -298,6 +342,25 @@ static void nfs4_xdr_enc_offload_cancel(struct rpc_rqst *req,
}

/*
+ * Encode COPY_NOTIFY request
+ */
+static void nfs4_xdr_enc_copy_notify(struct rpc_rqst *req,
+ struct xdr_stream *xdr,
+ const void *data)
+{
+ const struct nfs42_copy_notify_args *args = data;
+ struct compound_hdr hdr = {
+ .minorversion = nfs4_xdr_minorversion(&args->cna_seq_args),
+ };
+
+ encode_compound_hdr(xdr, req, &hdr);
+ encode_sequence(xdr, &args->cna_seq_args, &hdr);
+ encode_putfh(xdr, args->cna_src_fh, &hdr);
+ encode_copy_notify(xdr, args, &hdr);
+ encode_nops(&hdr);
+}
+
+/*
* Encode DEALLOCATE request
*/
static void nfs4_xdr_enc_deallocate(struct rpc_rqst *req,
@@ -416,6 +479,58 @@ static int decode_write_response(struct xdr_stream *xdr,
return -EIO;
}

+static int decode_nl4_server(struct xdr_stream *xdr, struct nl4_server *ns)
+{
+ struct nfs42_netaddr *naddr;
+ uint32_t dummy;
+ char *dummy_str;
+ __be32 *p;
+ int status;
+
+ /* nl_type */
+ p = xdr_inline_decode(xdr, 4);
+ if (unlikely(!p))
+ return -EIO;
+ ns->nl4_type = be32_to_cpup(p);
+ switch (ns->nl4_type) {
+ case NL4_NAME:
+ case NL4_URL:
+ status = decode_opaque_inline(xdr, &dummy, &dummy_str);
+ if (unlikely(status))
+ return status;
+ if (unlikely(dummy > NFS4_OPAQUE_LIMIT))
+ return -EIO;
+ memcpy(&ns->u.nl4_str, dummy_str, dummy);
+ ns->u.nl4_str_sz = dummy;
+ break;
+ case NL4_NETADDR:
+ naddr = &ns->u.nl4_addr;
+
+ /* netid string */
+ status = decode_opaque_inline(xdr, &dummy, &dummy_str);
+ if (unlikely(status))
+ return status;
+ if (unlikely(dummy > RPCBIND_MAXNETIDLEN))
+ return -EIO;
+ naddr->netid_len = dummy;
+ memcpy(naddr->netid, dummy_str, naddr->netid_len);
+
+ /* uaddr string */
+ status = decode_opaque_inline(xdr, &dummy, &dummy_str);
+ if (unlikely(status))
+ return status;
+ if (unlikely(dummy > RPCBIND_MAXUADDRLEN))
+ return -EIO;
+ naddr->addr_len = dummy;
+ memcpy(naddr->addr, dummy_str, naddr->addr_len);
+ break;
+ default:
+ WARN_ON_ONCE(1);
+ return -EIO;
+ }
+ return 0;
+}
+
static int decode_copy_requirements(struct xdr_stream *xdr,
struct nfs42_copy_res *res) {
__be32 *p;
@@ -458,6 +573,46 @@ static int decode_offload_cancel(struct xdr_stream *xdr,
return decode_op_hdr(xdr, OP_OFFLOAD_CANCEL);
}

+static int decode_copy_notify(struct xdr_stream *xdr,
+ struct nfs42_copy_notify_res *res)
+{
+ __be32 *p;
+ int status, count;
+
+ status = decode_op_hdr(xdr, OP_COPY_NOTIFY);
+ if (status)
+ return status;
+ /* cnr_lease_time */
+ p = xdr_inline_decode(xdr, 12);
+ if (unlikely(!p))
+ goto out_overflow;
+ p = xdr_decode_hyper(p, &res->cnr_lease_time.seconds);
+ res->cnr_lease_time.nseconds = be32_to_cpup(p);
+
+ status = decode_opaque_fixed(xdr, &res->cnr_stateid, NFS4_STATEID_SIZE);
+ if (unlikely(status))
+ goto out_overflow;
+
+ /* number of source addresses */
+ p = xdr_inline_decode(xdr, 4);
+ if (unlikely(!p))
+ goto out_overflow;
+
+ count = be32_to_cpup(p);
+ if (count > 1)
+ pr_warn("NFS: %s: nsvr %d > Supported. Use first servers\n",
+ __func__, count);
+
+ status = decode_nl4_server(xdr, &res->cnr_src);
+ if (unlikely(status))
+ goto out_overflow;
+ return 0;
+
+out_overflow:
+ print_overflow_msg(__func__, xdr);
+ return -EIO;
+}
+
static int decode_deallocate(struct xdr_stream *xdr, struct nfs42_falloc_res *res)
{
return decode_op_hdr(xdr, OP_DEALLOCATE);
@@ -585,6 +740,32 @@ static int nfs4_xdr_dec_offload_cancel(struct rpc_rqst *rqstp,
}

/*
+ * Decode COPY_NOTIFY response
+ */
+static int nfs4_xdr_dec_copy_notify(struct rpc_rqst *rqstp,
+ struct xdr_stream *xdr,
+ void *data)
+{
+ struct nfs42_copy_notify_res *res = data;
+ struct compound_hdr hdr;
+ int status;
+
+ status = decode_compound_hdr(xdr, &hdr);
+ if (status)
+ goto out;
+ status = decode_sequence(xdr, &res->cnr_seq_res, rqstp);
+ if (status)
+ goto out;
+ status = decode_putfh(xdr);
+ if (status)
+ goto out;
+ status = decode_copy_notify(xdr, res);
+
+out:
+ return status;
+}
+
+/*
* Decode DEALLOCATE request
*/
static int nfs4_xdr_dec_deallocate(struct rpc_rqst *rqstp,
diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c
index 4288a6e..3a888f4 100644
--- a/fs/nfs/nfs4file.c
+++ b/fs/nfs/nfs4file.c
@@ -133,12 +133,29 @@ static ssize_t nfs4_copy_file_range(struct file *file_in, loff_t pos_in,
struct file *file_out, loff_t pos_out,
size_t count, unsigned int flags)
{
+ struct nfs42_copy_notify_res *cn_resp = NULL;
ssize_t ret;

if (file_inode(file_in) == file_inode(file_out))
return -EINVAL;
retry:
+ if (nfs42_files_from_same_server(file_in, file_out)) { /* Intra-ssc */
+ if (file_in->f_op != file_out->f_op)
+ return -EXDEV;
+ } else { /* Inter-ssc */
+ cn_resp = kzalloc(sizeof(struct nfs42_copy_notify_res),
+ GFP_NOFS);
+ if (unlikely(cn_resp == NULL))
+ return -ENOMEM;
+
+ ret = nfs42_proc_copy_notify(file_in, file_out, cn_resp);
+ if (ret)
+ goto out;
+ }
ret = nfs42_proc_copy(file_in, pos_in, file_out, pos_out, count);
+
+out:
+ kfree(cn_resp);
if (ret == -EAGAIN)
goto retry;
return ret;
diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index db84b4a..fec6e6b 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -9692,6 +9692,7 @@ static bool nfs4_match_stateid(const nfs4_stateid *s1,
| NFS_CAP_ALLOCATE
| NFS_CAP_COPY
| NFS_CAP_OFFLOAD_CANCEL
+ | NFS_CAP_COPY_NOTIFY
| NFS_CAP_DEALLOCATE
| NFS_CAP_SEEK
| NFS_CAP_LAYOUTSTATS
diff --git a/fs/nfs/nfs4xdr.c b/fs/nfs/nfs4xdr.c
index b7bde12..2163900 100644
--- a/fs/nfs/nfs4xdr.c
+++ b/fs/nfs/nfs4xdr.c
@@ -7790,6 +7790,7 @@ int nfs4_decode_dirent(struct xdr_stream *xdr, struct nfs_entry *entry,
PROC42(CLONE, enc_clone, dec_clone),
PROC42(COPY, enc_copy, dec_copy),
PROC42(OFFLOAD_CANCEL, enc_offload_cancel, dec_offload_cancel),
+ PROC42(COPY_NOTIFY, enc_copy_notify, dec_copy_notify),
PROC(LOOKUPP, enc_lookupp, dec_lookupp),
};

diff --git a/include/linux/nfs4.h b/include/linux/nfs4.h
index 4d76f87..d80b25e 100644
--- a/include/linux/nfs4.h
+++ b/include/linux/nfs4.h
@@ -537,6 +537,7 @@ enum {
NFSPROC4_CLNT_CLONE,
NFSPROC4_CLNT_COPY,
NFSPROC4_CLNT_OFFLOAD_CANCEL,
+ NFSPROC4_CLNT_COPY_NOTIFY,

NFSPROC4_CLNT_LOOKUPP,
};
diff --git a/include/linux/nfs_fs_sb.h b/include/linux/nfs_fs_sb.h
index 0fc0b91..e5d89ff 100644
--- a/include/linux/nfs_fs_sb.h
+++ b/include/linux/nfs_fs_sb.h
@@ -261,5 +261,6 @@ struct nfs_server {
#define NFS_CAP_CLONE (1U << 23)
#define NFS_CAP_COPY (1U << 24)
#define NFS_CAP_OFFLOAD_CANCEL (1U << 25)
+#define NFS_CAP_COPY_NOTIFY (1U << 26)

#endif
diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
index 0e01625..dfc59bc 100644
--- a/include/linux/nfs_xdr.h
+++ b/include/linux/nfs_xdr.h
@@ -1428,6 +1428,22 @@ struct nfs42_offload_status_res {
int osr_status;
};

+struct nfs42_copy_notify_args {
+ struct nfs4_sequence_args cna_seq_args;
+
+ struct nfs_fh *cna_src_fh;
+ nfs4_stateid cna_src_stateid;
+ struct nl4_server cna_dst;
+};
+
+struct nfs42_copy_notify_res {
+ struct nfs4_sequence_res cnr_seq_res;
+
+ struct nfstime4 cnr_lease_time;
+ nfs4_stateid cnr_stateid;
+ struct nl4_server cnr_src;
+};
+
struct nfs42_seek_args {
struct nfs4_sequence_args seq_args;

--
1.8.3.1

2018-10-19 23:36:14

by Olga Kornievskaia

[permalink] [raw]
Subject: [PATCH v1 07/11] NFS also send OFFLOAD_CANCEL to source server

From: Olga Kornievskaia <[email protected]>

In case of copy is cancelled, also send OFFLOAD_CANCEL to the source
server.

Signed-off-by: Olga Kornievskaia <[email protected]>
---
fs/nfs/nfs42proc.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/nfs/nfs42proc.c b/fs/nfs/nfs42proc.c
index bb9e799..98fe00b 100644
--- a/fs/nfs/nfs42proc.c
+++ b/fs/nfs/nfs42proc.c
@@ -205,12 +205,14 @@ static int handle_async_copy(struct nfs42_copy_res *res,
memcpy(&res->write_res.verifier, &copy->verf, sizeof(copy->verf));
status = -copy->error;

+out_free:
kfree(copy);
return status;
out_cancel:
nfs42_do_offload_cancel_async(dst, &copy->stateid);
- kfree(copy);
- return status;
+ if (!nfs42_files_from_same_server(src, dst))
+ nfs42_do_offload_cancel_async(src, src_stateid);
+ goto out_free;
}

static int process_copy_commit(struct file *dst, loff_t pos_dst,
--
1.8.3.1

2018-10-19 23:36:10

by Olga Kornievskaia

[permalink] [raw]
Subject: [PATCH v1 04/11] NFS NFSD defining nl4_servers structure needed by both

From: Olga Kornievskaia <[email protected]>

These structures are needed by COPY_NOTIFY on the client and needed
by the nfsd as well

Signed-off-by: Olga Kornievskaia <[email protected]>
---
include/linux/nfs4.h | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)

diff --git a/include/linux/nfs4.h b/include/linux/nfs4.h
index 1b06f0b..4d76f87 100644
--- a/include/linux/nfs4.h
+++ b/include/linux/nfs4.h
@@ -16,6 +16,7 @@
#include <linux/list.h>
#include <linux/uidgid.h>
#include <uapi/linux/nfs4.h>
+#include <linux/sunrpc/msg_prot.h>

enum nfs4_acl_whotype {
NFS4_ACL_WHO_NAMED = 0,
@@ -672,4 +673,27 @@ struct nfs4_op_map {
} u;
};

+struct nfs42_netaddr {
+ char netid[RPCBIND_MAXNETIDLEN];
+ char addr[RPCBIND_MAXUADDRLEN + 1];
+ u32 netid_len;
+ u32 addr_len;
+};
+
+enum netloc_type4 {
+ NL4_NAME = 1,
+ NL4_URL = 2,
+ NL4_NETADDR = 3,
+};
+
+struct nl4_server {
+ enum netloc_type4 nl4_type;
+ union {
+ struct { /* NL4_NAME, NL4_URL */
+ int nl4_str_sz;
+ char nl4_str[NFS4_OPAQUE_LIMIT + 1];
+ };
+ struct nfs42_netaddr nl4_addr; /* NL4_NETADDR */
+ } u;
+};
#endif
--
1.8.3.1

2018-10-19 23:36:09

by Olga Kornievskaia

[permalink] [raw]
Subject: [PATCH v1 03/11] NFS test for intra vs inter COPY

From: Olga Kornievskaia <[email protected]>

Signed-off-by: Olga Kornievskaia <[email protected]>
---
fs/nfs/nfs42.h | 11 +++++++++++
1 file changed, 11 insertions(+)

diff --git a/fs/nfs/nfs42.h b/fs/nfs/nfs42.h
index 19ec38f8..5abff4d 100644
--- a/fs/nfs/nfs42.h
+++ b/fs/nfs/nfs42.h
@@ -6,6 +6,7 @@
#ifndef __LINUX_FS_NFS_NFS4_2_H
#define __LINUX_FS_NFS_NFS4_2_H

+#include <linux/sunrpc/addr.h>
/*
* FIXME: four LAYOUTSTATS calls per compound at most! Do we need to support
* more? Need to consider not to pre-alloc too much for a compound.
@@ -21,4 +22,14 @@ int nfs42_proc_layoutstats_generic(struct nfs_server *,
struct nfs42_layoutstat_data *);
int nfs42_proc_clone(struct file *, struct file *, loff_t, loff_t, loff_t);

+static inline bool nfs42_files_from_same_server(struct file *in,
+ struct file *out)
+{
+ struct nfs_client *c_in = (NFS_SERVER(file_inode(in)))->nfs_client;
+ struct nfs_client *c_out = (NFS_SERVER(file_inode(out)))->nfs_client;
+
+ return rpc_cmp_addr((struct sockaddr *)&c_in->cl_addr,
+ (struct sockaddr *)&c_out->cl_addr);
+}
+
#endif /* __LINUX_FS_NFS_NFS4_2_H */
--
1.8.3.1

2018-10-19 23:36:08

by Olga Kornievskaia

[permalink] [raw]
Subject: [PATCH v1 02/11] VFS permit cross device vfs_copy_file_range

From: Olga Kornievskaia <[email protected]>

Allow copy_file_range to copy between different superblocks but only
of the same file system types. This feature was of interest to CIFS
as well as NFS.

This feature is needed by NFSv4.2 to perform file copy operation on
the same server or file copy between different NFSv4.2 servers.

If a file system's fileoperations copy_file_range operation prohibits
cross-device copies, fall back to do_splice_direct. This would be
needed for the NFS (destination) server side implementation of the
file copy and currently for CIFS.

Besides NFS, there is only 1 implementor of the copy_file_range FS
operation -- CIFS. CIFS assumes incoming file descriptors are both
CIFS but it will check if they are coming from different servers and
return error code to fall back to do_splice_direct.

NFS will allow for copies between different NFS servers.

Adding to the vfs.txt documentation to explicitly warn about allowing
for different superblocks of the same file type to be passed into the
copy_file_range for the future users of copy_file_range method.

Signed-off-by: Olga Kornievskaia <[email protected]>
---
Documentation/filesystems/vfs.txt | 4 +++-
fs/read_write.c | 13 ++++++-------
2 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index a6c6a8a..5e520de 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -958,7 +958,9 @@ otherwise noted.

fallocate: called by the VFS to preallocate blocks or punch a hole.

- copy_file_range: called by the copy_file_range(2) system call.
+ copy_file_range: called by copy_file_range(2) system call. This method
+ works on two file descriptors that might reside on
+ different superblocks of the same type of file system.

clone_file_range: called by the ioctl(2) system call for FICLONERANGE and
FICLONE commands.
diff --git a/fs/read_write.c b/fs/read_write.c
index c60790f..474e740 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1578,10 +1578,6 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
(file_out->f_flags & O_APPEND))
return -EBADF;

- /* this could be relaxed once a method supports cross-fs copies */
- if (inode_in->i_sb != inode_out->i_sb)
- return -EXDEV;
-
if (len == 0)
return 0;

@@ -1591,7 +1587,8 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
* Try cloning first, this is supported by more file systems, and
* more efficient if both clone and copy are supported (e.g. NFS).
*/
- if (file_in->f_op->clone_file_range) {
+ if (inode_in->i_sb == inode_out->i_sb &&
+ file_in->f_op->clone_file_range) {
ret = file_in->f_op->clone_file_range(file_in, pos_in,
file_out, pos_out, len);
if (ret == 0) {
@@ -1600,10 +1597,12 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
}
}

- if (file_out->f_op->copy_file_range) {
+ if (file_out->f_op->copy_file_range &&
+ (file_in->f_op->copy_file_range ==
+ file_out->f_op->copy_file_range)) {
ret = file_out->f_op->copy_file_range(file_in, pos_in, file_out,
pos_out, len, flags);
- if (ret != -EOPNOTSUPP)
+ if (ret != -EOPNOTSUPP && ret != -EXDEV)
goto done;
}

--
1.8.3.1

2018-10-19 23:36:16

by Olga Kornievskaia

[permalink] [raw]
Subject: [PATCH v1 09/11] NFS skip recovery of copy open on dest server

From: Olga Kornievskaia <[email protected]>

Mark the open created for the source file on the destination
server. Then if this open is going thru a recovery, then fail
the recovery as we don't need to be recoving a "fake" open.
We need to fail the ongoing READs and vfs_copy_file_range().

Signed-off-by: Olga Kornievskaia <[email protected]>
---
fs/nfs/nfs4_fs.h | 1 +
fs/nfs/nfs4file.c | 1 +
fs/nfs/nfs4state.c | 14 ++++++++++++++
3 files changed, 16 insertions(+)

diff --git a/fs/nfs/nfs4_fs.h b/fs/nfs/nfs4_fs.h
index f229864..8b00c90 100644
--- a/fs/nfs/nfs4_fs.h
+++ b/fs/nfs/nfs4_fs.h
@@ -165,6 +165,7 @@ enum {
NFS_STATE_CHANGE_WAIT, /* A state changing operation is outstanding */
#ifdef CONFIG_NFS_V4_2
NFS_CLNT_DST_SSC_COPY_STATE, /* dst server open state on client*/
+ NFS_SRV_SSC_COPY_STATE, /* ssc state on the dst server */
#endif /* CONFIG_NFS_V4_2 */
};

diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c
index 7f226f4..6cc4ae4 100644
--- a/fs/nfs/nfs4file.c
+++ b/fs/nfs/nfs4file.c
@@ -331,6 +331,7 @@ struct file *
if (ctx->state == NULL)
goto out_stateowner;

+ set_bit(NFS_SRV_SSC_COPY_STATE, &ctx->state->flags);
set_bit(NFS_OPEN_STATE, &ctx->state->flags);
memcpy(&ctx->state->open_stateid.other, &stateid->other,
NFS4_STATEID_OTHER_SIZE);
diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c
index 62ae0fd..b0b82c6 100644
--- a/fs/nfs/nfs4state.c
+++ b/fs/nfs/nfs4state.c
@@ -1606,6 +1606,9 @@ static int nfs4_reclaim_open_state(struct nfs4_state_owner *sp, const struct nfs
{
struct nfs4_state *state;
int status = 0;
+#ifdef CONFIG_NFS_V4_2
+ bool found_ssc_copy_state = false;
+#endif /* CONFIG_NFS_V4_2 */

/* Note: we rely on the sp->so_states list being ordered
* so that we always reclaim open(O_RDWR) and/or open(O_WRITE)
@@ -1625,6 +1628,13 @@ static int nfs4_reclaim_open_state(struct nfs4_state_owner *sp, const struct nfs
continue;
if (state->state == 0)
continue;
+#ifdef CONFIG_NFS_V4_2
+ if (test_bit(NFS_SRV_SSC_COPY_STATE, &state->flags)) {
+ nfs4_state_mark_recovery_failed(state, -EIO);
+ found_ssc_copy_state = true;
+ continue;
+ }
+#endif /* CONFIG_NFS_V4_2 */
refcount_inc(&state->count);
spin_unlock(&sp->so_lock);
status = __nfs4_reclaim_open_state(sp, state, ops);
@@ -1671,6 +1681,10 @@ static int nfs4_reclaim_open_state(struct nfs4_state_owner *sp, const struct nfs
}
raw_write_seqcount_end(&sp->so_reclaim_seqcount);
spin_unlock(&sp->so_lock);
+#ifdef CONFIG_NFS_V4_2
+ if (found_ssc_copy_state)
+ return -EIO;
+#endif /* CONFIG_NFS_V4_2 */
return 0;
out_err:
nfs4_put_open_state(state);
--
1.8.3.1

2018-10-19 23:36:07

by Olga Kornievskaia

[permalink] [raw]
Subject: [PATCH v1 01/11] fs: Don't copy beyond the end of the file

From: Anna Schumaker <[email protected]>

Signed-off-by: Anna Schumaker <[email protected]>
---
fs/read_write.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/fs/read_write.c b/fs/read_write.c
index 39b4a21..c60790f 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1570,6 +1570,9 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
if (unlikely(ret))
return ret;

+ if (pos_in >= i_size_read(inode_in))
+ return -EINVAL;
+
if (!(file_in->f_mode & FMODE_READ) ||
!(file_out->f_mode & FMODE_WRITE) ||
(file_out->f_flags & O_APPEND))
--
1.8.3.1

2018-10-19 23:36:16

by Olga Kornievskaia

[permalink] [raw]
Subject: [PATCH v1 08/11] NFS inter ssc open

From: Olga Kornievskaia <[email protected]>

NFSv4.2 inter server to server copy requires the destination server to
READ the data from the source server using the provided stateid and
file handle.

Given an NFSv4 stateid and filehandle from the COPY operaion, provide the
destination server with an NFS client function to create a struct file
suitable for the destiniation server to READ the data to be copied.

Signed-off-by: Andy Adamson <[email protected]>
Signed-off-by: Olga Kornievskaia <[email protected]>
---
fs/nfs/nfs4_fs.h | 7 ++++
fs/nfs/nfs4file.c | 98 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
fs/nfs/nfs4proc.c | 5 ++-
3 files changed, 107 insertions(+), 3 deletions(-)

diff --git a/fs/nfs/nfs4_fs.h b/fs/nfs/nfs4_fs.h
index 8d59c96..f229864 100644
--- a/fs/nfs/nfs4_fs.h
+++ b/fs/nfs/nfs4_fs.h
@@ -307,6 +307,13 @@ extern int nfs4_set_rw_stateid(nfs4_stateid *stateid,
const struct nfs_open_context *ctx,
const struct nfs_lock_context *l_ctx,
fmode_t fmode);
+extern int nfs4_proc_getattr(struct nfs_server *server, struct nfs_fh *fhandle,
+ struct nfs_fattr *fattr, struct nfs4_label *label,
+ struct inode *inode);
+extern int update_open_stateid(struct nfs4_state *state,
+ const nfs4_stateid *open_stateid,
+ const nfs4_stateid *deleg_stateid,
+ fmode_t fmode);

#if defined(CONFIG_NFS_V4_1)
extern int nfs41_sequence_done(struct rpc_task *, struct nfs4_sequence_res *);
diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c
index 005862e..7f226f4 100644
--- a/fs/nfs/nfs4file.c
+++ b/fs/nfs/nfs4file.c
@@ -8,6 +8,7 @@
#include <linux/file.h>
#include <linux/falloc.h>
#include <linux/nfs_fs.h>
+#include <linux/file.h>
#include "delegation.h"
#include "internal.h"
#include "iostat.h"
@@ -264,6 +265,103 @@ static int nfs42_clone_file_range(struct file *src_file, loff_t src_off,
out:
return ret;
}
+
+static int read_name_gen = 1;
+#define SSC_READ_NAME_BODY "ssc_read_%d"
+
+struct file *
+nfs42_ssc_open(struct vfsmount *ss_mnt, struct nfs_fh *src_fh,
+ nfs4_stateid *stateid)
+{
+ struct nfs_fattr fattr;
+ struct file *filep, *res;
+ struct nfs_server *server;
+ struct inode *r_ino = NULL;
+ struct nfs_open_context *ctx;
+ struct nfs4_state_owner *sp;
+ char *read_name;
+ int len, status = 0;
+
+ server = NFS_SERVER(ss_mnt->mnt_root->d_inode);
+
+ nfs_fattr_init(&fattr);
+
+ status = nfs4_proc_getattr(server, src_fh, &fattr, NULL, NULL);
+ if (status < 0) {
+ res = ERR_PTR(status);
+ goto out;
+ }
+
+ res = ERR_PTR(-ENOMEM);
+ len = strlen(SSC_READ_NAME_BODY) + 16;
+ read_name = kzalloc(len, GFP_NOFS);
+ if (read_name == NULL)
+ goto out;
+ snprintf(read_name, len, SSC_READ_NAME_BODY, read_name_gen++);
+ dprintk("%s read_name %s\n", __func__, read_name);
+
+ r_ino = nfs_fhget(ss_mnt->mnt_root->d_inode->i_sb, src_fh, &fattr,
+ NULL);
+ if (IS_ERR(r_ino)) {
+ res = ERR_CAST(r_ino);
+ goto out;
+ }
+
+ filep = alloc_file_pseudo(r_ino, ss_mnt, read_name, FMODE_READ,
+ r_ino->i_fop);
+ if (IS_ERR(filep)) {
+ res = ERR_CAST(filep);
+ goto out;
+ }
+ filep->f_mode |= FMODE_READ;
+
+ ctx = alloc_nfs_open_context(filep->f_path.dentry, filep->f_mode,
+ filep);
+ if (IS_ERR(ctx)) {
+ res = ERR_CAST(ctx);
+ goto out_filep;
+ }
+
+ res = ERR_PTR(-EINVAL);
+ sp = nfs4_get_state_owner(server, ctx->cred, GFP_KERNEL);
+ if (sp == NULL)
+ goto out_ctx;
+
+ ctx->state = nfs4_get_open_state(r_ino, sp);
+ if (ctx->state == NULL)
+ goto out_stateowner;
+
+ set_bit(NFS_OPEN_STATE, &ctx->state->flags);
+ memcpy(&ctx->state->open_stateid.other, &stateid->other,
+ NFS4_STATEID_OTHER_SIZE);
+ update_open_stateid(ctx->state, stateid, NULL, filep->f_mode);
+
+ nfs_file_set_open_context(filep, ctx);
+ put_nfs_open_context(ctx);
+
+ file_ra_state_init(&filep->f_ra, filep->f_mapping->host->i_mapping);
+ res = filep;
+out:
+ dprintk("<-- %s error %ld filep %p r_ino %p\n",
+ __func__, IS_ERR(res) ? PTR_ERR(res) : 0, res, r_ino);
+
+ return res;
+out_stateowner:
+ nfs4_put_state_owner(sp);
+out_ctx:
+ put_nfs_open_context(ctx);
+out_filep:
+ fput(filep);
+ goto out;
+}
+EXPORT_SYMBOL_GPL(nfs42_ssc_open);
+void nfs42_ssc_close(struct file *filep)
+{
+ struct nfs_open_context *ctx = nfs_file_open_context(filep);
+
+ ctx->state->flags = 0;
+}
+EXPORT_SYMBOL_GPL(nfs42_ssc_close);
#endif /* CONFIG_NFS_V4_2 */

const struct file_operations nfs4_file_operations = {
diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index fec6e6b..e5178b2f 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -91,7 +91,6 @@
static int _nfs4_recover_proc_open(struct nfs4_opendata *data);
static int nfs4_do_fsinfo(struct nfs_server *, struct nfs_fh *, struct nfs_fsinfo *);
static void nfs_fixup_referral_attributes(struct nfs_fattr *fattr);
-static int nfs4_proc_getattr(struct nfs_server *, struct nfs_fh *, struct nfs_fattr *, struct nfs4_label *label, struct inode *inode);
static int _nfs4_proc_getattr(struct nfs_server *server, struct nfs_fh *fhandle, struct nfs_fattr *fattr, struct nfs4_label *label, struct inode *inode);
static int nfs4_do_setattr(struct inode *inode, struct rpc_cred *cred,
struct nfs_fattr *fattr, struct iattr *sattr,
@@ -1653,7 +1652,7 @@ static void nfs_state_clear_delegation(struct nfs4_state *state)
write_sequnlock(&state->seqlock);
}

-static int update_open_stateid(struct nfs4_state *state,
+int update_open_stateid(struct nfs4_state *state,
const nfs4_stateid *open_stateid,
const nfs4_stateid *delegation,
fmode_t fmode)
@@ -3936,7 +3935,7 @@ static int _nfs4_proc_getattr(struct nfs_server *server, struct nfs_fh *fhandle,
return nfs4_call_sync(server->client, server, &msg, &args.seq_args, &res.seq_res, 0);
}

-static int nfs4_proc_getattr(struct nfs_server *server, struct nfs_fh *fhandle,
+int nfs4_proc_getattr(struct nfs_server *server, struct nfs_fh *fhandle,
struct nfs_fattr *fattr, struct nfs4_label *label,
struct inode *inode)
{
--
1.8.3.1

2018-10-20 00:20:42

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH v1 01/11] fs: Don't copy beyond the end of the file

T24gRnJpLCAyMDE4LTEwLTE5IGF0IDExOjI5IC0wNDAwLCBPbGdhIEtvcm5pZXZza2FpYSB3cm90
ZToNCj4gRnJvbTogQW5uYSBTY2h1bWFrZXIgPEFubmEuU2NodW1ha2VyQE5ldGFwcC5jb20+DQo+
IA0KPiBTaWduZWQtb2ZmLWJ5OiBBbm5hIFNjaHVtYWtlciA8QW5uYS5TY2h1bWFrZXJATmV0YXBw
LmNvbT4NCj4gLS0tDQo+ICBmcy9yZWFkX3dyaXRlLmMgfCAzICsrKw0KPiAgMSBmaWxlIGNoYW5n
ZWQsIDMgaW5zZXJ0aW9ucygrKQ0KPiANCj4gZGlmZiAtLWdpdCBhL2ZzL3JlYWRfd3JpdGUuYyBi
L2ZzL3JlYWRfd3JpdGUuYw0KPiBpbmRleCAzOWI0YTIxLi5jNjA3OTBmIDEwMDY0NA0KPiAtLS0g
YS9mcy9yZWFkX3dyaXRlLmMNCj4gKysrIGIvZnMvcmVhZF93cml0ZS5jDQo+IEBAIC0xNTcwLDYg
KzE1NzAsOSBAQCBzc2l6ZV90IHZmc19jb3B5X2ZpbGVfcmFuZ2Uoc3RydWN0IGZpbGUNCj4gKmZp
bGVfaW4sIGxvZmZfdCBwb3NfaW4sDQo+ICAJaWYgKHVubGlrZWx5KHJldCkpDQo+ICAJCXJldHVy
biByZXQ7DQo+ICANCj4gKwlpZiAocG9zX2luID49IGlfc2l6ZV9yZWFkKGlub2RlX2luKSkNCj4g
KwkJcmV0dXJuIC1FSU5WQUw7DQo+ICsNCj4gIAlpZiAoIShmaWxlX2luLT5mX21vZGUgJiBGTU9E
RV9SRUFEKSB8fA0KPiAgCSAgICAhKGZpbGVfb3V0LT5mX21vZGUgJiBGTU9ERV9XUklURSkgfHwN
Cj4gIAkgICAgKGZpbGVfb3V0LT5mX2ZsYWdzICYgT19BUFBFTkQpKQ0KDQpUaGlzIHBhdGNoIHJl
cXVpcmVzIGFuIEFDSyBmcm9tIHRoZSBWRlMgbWFpbnRhaW5lcnMgaWYgSSdtIHRvIHB1c2ggaXQN
CnVwc3RyZWFtLg0KQ2M6IEFsIGFuZCBMaW51eC1mc2RldmVsLg0KDQotLSANClRyb25kIE15a2xl
YnVzdA0KTGludXggTkZTIGNsaWVudCBtYWludGFpbmVyLCBIYW1tZXJzcGFjZQ0KdHJvbmQubXlr
bGVidXN0QGhhbW1lcnNwYWNlLmNvbQ0KDQoNCg==

2018-10-20 00:22:21

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH v1 02/11] VFS permit cross device vfs_copy_file_range

T24gRnJpLCAyMDE4LTEwLTE5IGF0IDExOjI5IC0wNDAwLCBPbGdhIEtvcm5pZXZza2FpYSB3cm90
ZToNCj4gRnJvbTogT2xnYSBLb3JuaWV2c2thaWEgPGtvbGdhQG5ldGFwcC5jb20+DQo+IA0KPiBB
bGxvdyBjb3B5X2ZpbGVfcmFuZ2UgdG8gY29weSBiZXR3ZWVuIGRpZmZlcmVudCBzdXBlcmJsb2Nr
cyBidXQgb25seQ0KPiBvZiB0aGUgc2FtZSBmaWxlIHN5c3RlbSB0eXBlcy4gVGhpcyBmZWF0dXJl
IHdhcyBvZiBpbnRlcmVzdCB0byBDSUZTDQo+IGFzIHdlbGwgYXMgTkZTLg0KPiANCj4gVGhpcyBm
ZWF0dXJlIGlzIG5lZWRlZCBieSBORlN2NC4yIHRvIHBlcmZvcm0gZmlsZSBjb3B5IG9wZXJhdGlv
biBvbg0KPiB0aGUgc2FtZSBzZXJ2ZXIgb3IgZmlsZSBjb3B5IGJldHdlZW4gZGlmZmVyZW50IE5G
U3Y0LjIgc2VydmVycy4NCj4gDQo+IElmIGEgZmlsZSBzeXN0ZW0ncyBmaWxlb3BlcmF0aW9ucyBj
b3B5X2ZpbGVfcmFuZ2Ugb3BlcmF0aW9uIHByb2hpYml0cw0KPiBjcm9zcy1kZXZpY2UgY29waWVz
LCBmYWxsIGJhY2sgdG8gZG9fc3BsaWNlX2RpcmVjdC4gVGhpcyB3b3VsZCBiZQ0KPiBuZWVkZWQg
Zm9yIHRoZSBORlMgKGRlc3RpbmF0aW9uKSBzZXJ2ZXIgc2lkZSBpbXBsZW1lbnRhdGlvbiBvZiB0
aGUNCj4gZmlsZSBjb3B5IGFuZCBjdXJyZW50bHkgZm9yIENJRlMuDQo+IA0KPiBCZXNpZGVzIE5G
UywgdGhlcmUgaXMgb25seSAxIGltcGxlbWVudG9yIG9mIHRoZSBjb3B5X2ZpbGVfcmFuZ2UgRlMN
Cj4gb3BlcmF0aW9uIC0tIENJRlMuIENJRlMgYXNzdW1lcyBpbmNvbWluZyBmaWxlIGRlc2NyaXB0
b3JzIGFyZSBib3RoDQo+IENJRlMgYnV0IGl0IHdpbGwgY2hlY2sgaWYgdGhleSBhcmUgY29taW5n
IGZyb20gZGlmZmVyZW50IHNlcnZlcnMgYW5kDQo+IHJldHVybiBlcnJvciBjb2RlIHRvIGZhbGwg
YmFjayB0byBkb19zcGxpY2VfZGlyZWN0Lg0KPiANCj4gTkZTIHdpbGwgYWxsb3cgZm9yIGNvcGll
cyBiZXR3ZWVuIGRpZmZlcmVudCBORlMgc2VydmVycy4NCj4gDQo+IEFkZGluZyB0byB0aGUgdmZz
LnR4dCBkb2N1bWVudGF0aW9uIHRvIGV4cGxpY2l0bHkgd2FybiBhYm91dCBhbGxvd2luZw0KPiBm
b3IgZGlmZmVyZW50IHN1cGVyYmxvY2tzIG9mIHRoZSBzYW1lIGZpbGUgdHlwZSB0byBiZSBwYXNz
ZWQgaW50byB0aGUNCj4gY29weV9maWxlX3JhbmdlIGZvciB0aGUgZnV0dXJlIHVzZXJzIG9mIGNv
cHlfZmlsZV9yYW5nZSBtZXRob2QuDQo+IA0KPiBTaWduZWQtb2ZmLWJ5OiBPbGdhIEtvcm5pZXZz
a2FpYSA8a29sZ2FAbmV0YXBwLmNvbT4NCj4gLS0tDQo+ICBEb2N1bWVudGF0aW9uL2ZpbGVzeXN0
ZW1zL3Zmcy50eHQgfCAgNCArKystDQo+ICBmcy9yZWFkX3dyaXRlLmMgICAgICAgICAgICAgICAg
ICAgfCAxMyArKysrKystLS0tLS0tDQo+ICAyIGZpbGVzIGNoYW5nZWQsIDkgaW5zZXJ0aW9ucygr
KSwgOCBkZWxldGlvbnMoLSkNCj4gDQo+IGRpZmYgLS1naXQgYS9Eb2N1bWVudGF0aW9uL2ZpbGVz
eXN0ZW1zL3Zmcy50eHQNCj4gYi9Eb2N1bWVudGF0aW9uL2ZpbGVzeXN0ZW1zL3Zmcy50eHQNCj4g
aW5kZXggYTZjNmE4YS4uNWU1MjBkZSAxMDA2NDQNCj4gLS0tIGEvRG9jdW1lbnRhdGlvbi9maWxl
c3lzdGVtcy92ZnMudHh0DQo+ICsrKyBiL0RvY3VtZW50YXRpb24vZmlsZXN5c3RlbXMvdmZzLnR4
dA0KPiBAQCAtOTU4LDcgKzk1OCw5IEBAIG90aGVyd2lzZSBub3RlZC4NCj4gIA0KPiAgICBmYWxs
b2NhdGU6IGNhbGxlZCBieSB0aGUgVkZTIHRvIHByZWFsbG9jYXRlIGJsb2NrcyBvciBwdW5jaCBh
DQo+IGhvbGUuDQo+ICANCj4gLSAgY29weV9maWxlX3JhbmdlOiBjYWxsZWQgYnkgdGhlIGNvcHlf
ZmlsZV9yYW5nZSgyKSBzeXN0ZW0gY2FsbC4NCj4gKyAgY29weV9maWxlX3JhbmdlOiBjYWxsZWQg
YnkgY29weV9maWxlX3JhbmdlKDIpIHN5c3RlbSBjYWxsLiBUaGlzDQo+IG1ldGhvZA0KPiArCQkg
ICB3b3JrcyBvbiB0d28gZmlsZSBkZXNjcmlwdG9ycyB0aGF0IG1pZ2h0IHJlc2lkZSBvbg0KPiAr
CQkgICBkaWZmZXJlbnQgc3VwZXJibG9ja3Mgb2YgdGhlIHNhbWUgdHlwZSBvZiBmaWxlDQo+IHN5
c3RlbS4NCj4gIA0KPiAgICBjbG9uZV9maWxlX3JhbmdlOiBjYWxsZWQgYnkgdGhlIGlvY3RsKDIp
IHN5c3RlbSBjYWxsIGZvcg0KPiBGSUNMT05FUkFOR0UgYW5kDQo+ICAJRklDTE9ORSBjb21tYW5k
cy4NCj4gZGlmZiAtLWdpdCBhL2ZzL3JlYWRfd3JpdGUuYyBiL2ZzL3JlYWRfd3JpdGUuYw0KPiBp
bmRleCBjNjA3OTBmLi40NzRlNzQwIDEwMDY0NA0KPiAtLS0gYS9mcy9yZWFkX3dyaXRlLmMNCj4g
KysrIGIvZnMvcmVhZF93cml0ZS5jDQo+IEBAIC0xNTc4LDEwICsxNTc4LDYgQEAgc3NpemVfdCB2
ZnNfY29weV9maWxlX3JhbmdlKHN0cnVjdCBmaWxlDQo+ICpmaWxlX2luLCBsb2ZmX3QgcG9zX2lu
LA0KPiAgCSAgICAoZmlsZV9vdXQtPmZfZmxhZ3MgJiBPX0FQUEVORCkpDQo+ICAJCXJldHVybiAt
RUJBREY7DQo+ICANCj4gLQkvKiB0aGlzIGNvdWxkIGJlIHJlbGF4ZWQgb25jZSBhIG1ldGhvZCBz
dXBwb3J0cyBjcm9zcy1mcyBjb3BpZXMNCj4gKi8NCj4gLQlpZiAoaW5vZGVfaW4tPmlfc2IgIT0g
aW5vZGVfb3V0LT5pX3NiKQ0KPiAtCQlyZXR1cm4gLUVYREVWOw0KPiAtDQo+ICAJaWYgKGxlbiA9
PSAwKQ0KPiAgCQlyZXR1cm4gMDsNCj4gIA0KPiBAQCAtMTU5MSw3ICsxNTg3LDggQEAgc3NpemVf
dCB2ZnNfY29weV9maWxlX3JhbmdlKHN0cnVjdCBmaWxlDQo+ICpmaWxlX2luLCBsb2ZmX3QgcG9z
X2luLA0KPiAgCSAqIFRyeSBjbG9uaW5nIGZpcnN0LCB0aGlzIGlzIHN1cHBvcnRlZCBieSBtb3Jl
IGZpbGUgc3lzdGVtcywNCj4gYW5kDQo+ICAJICogbW9yZSBlZmZpY2llbnQgaWYgYm90aCBjbG9u
ZSBhbmQgY29weSBhcmUgc3VwcG9ydGVkIChlLmcuDQo+IE5GUykuDQo+ICAJICovDQo+IC0JaWYg
KGZpbGVfaW4tPmZfb3AtPmNsb25lX2ZpbGVfcmFuZ2UpIHsNCj4gKwlpZiAoaW5vZGVfaW4tPmlf
c2IgPT0gaW5vZGVfb3V0LT5pX3NiICYmDQo+ICsJCQlmaWxlX2luLT5mX29wLT5jbG9uZV9maWxl
X3JhbmdlKSB7DQo+ICAJCXJldCA9IGZpbGVfaW4tPmZfb3AtPmNsb25lX2ZpbGVfcmFuZ2UoZmls
ZV9pbiwgcG9zX2luLA0KPiAgCQkJCWZpbGVfb3V0LCBwb3Nfb3V0LCBsZW4pOw0KPiAgCQlpZiAo
cmV0ID09IDApIHsNCj4gQEAgLTE2MDAsMTAgKzE1OTcsMTIgQEAgc3NpemVfdCB2ZnNfY29weV9m
aWxlX3JhbmdlKHN0cnVjdCBmaWxlDQo+ICpmaWxlX2luLCBsb2ZmX3QgcG9zX2luLA0KPiAgCQl9
DQo+ICAJfQ0KPiAgDQo+IC0JaWYgKGZpbGVfb3V0LT5mX29wLT5jb3B5X2ZpbGVfcmFuZ2UpIHsN
Cj4gKwlpZiAoZmlsZV9vdXQtPmZfb3AtPmNvcHlfZmlsZV9yYW5nZSAmJg0KPiArCQkJKGZpbGVf
aW4tPmZfb3AtPmNvcHlfZmlsZV9yYW5nZSA9PQ0KPiArCQkJCWZpbGVfb3V0LT5mX29wLT5jb3B5
X2ZpbGVfcmFuZ2UpKSB7DQo+ICAJCXJldCA9IGZpbGVfb3V0LT5mX29wLT5jb3B5X2ZpbGVfcmFu
Z2UoZmlsZV9pbiwgcG9zX2luLA0KPiBmaWxlX291dCwNCj4gIAkJCQkJCSAgICAgIHBvc19vdXQs
IGxlbiwNCj4gZmxhZ3MpOw0KPiAtCQlpZiAocmV0ICE9IC1FT1BOT1RTVVBQKQ0KPiArCQlpZiAo
cmV0ICE9IC1FT1BOT1RTVVBQICYmIHJldCAhPSAtRVhERVYpDQo+ICAJCQlnb3RvIGRvbmU7DQo+
ICAJfQ0KPiAgDQoNCkRpdHRvLiAgVGhpcyBhbHNvIG5lZWRzIGFuIEFDSyBmcm9tIHRoZSBWRlMg
bWFpbnRhaW5lcnMuDQoNCkNjOiBBbCBhbmQgbGludXgtZnNkZXZlbA0KLS0gDQpUcm9uZCBNeWts
ZWJ1c3QNCkxpbnV4IE5GUyBjbGllbnQgbWFpbnRhaW5lciwgSGFtbWVyc3BhY2UNCnRyb25kLm15
a2xlYnVzdEBoYW1tZXJzcGFjZS5jb20NCg0KDQo=

2018-10-20 00:33:48

by Olga Kornievskaia

[permalink] [raw]
Subject: Re: [PATCH v1 02/11] VFS permit cross device vfs_copy_file_range

On Fri, Oct 19, 2018 at 12:14 PM Trond Myklebust
<[email protected]> wrote:
>
> On Fri, 2018-10-19 at 11:29 -0400, Olga Kornievskaia wrote:
> > From: Olga Kornievskaia <[email protected]>
> >
> > Allow copy_file_range to copy between different superblocks but only
> > of the same file system types. This feature was of interest to CIFS
> > as well as NFS.
> >
> > This feature is needed by NFSv4.2 to perform file copy operation on
> > the same server or file copy between different NFSv4.2 servers.
> >
> > If a file system's fileoperations copy_file_range operation prohibits
> > cross-device copies, fall back to do_splice_direct. This would be
> > needed for the NFS (destination) server side implementation of the
> > file copy and currently for CIFS.
> >
> > Besides NFS, there is only 1 implementor of the copy_file_range FS
> > operation -- CIFS. CIFS assumes incoming file descriptors are both
> > CIFS but it will check if they are coming from different servers and
> > return error code to fall back to do_splice_direct.
> >
> > NFS will allow for copies between different NFS servers.
> >
> > Adding to the vfs.txt documentation to explicitly warn about allowing
> > for different superblocks of the same file type to be passed into the
> > copy_file_range for the future users of copy_file_range method.
> >
> > Signed-off-by: Olga Kornievskaia <[email protected]>
> > ---
> > Documentation/filesystems/vfs.txt | 4 +++-
> > fs/read_write.c | 13 ++++++-------
> > 2 files changed, 9 insertions(+), 8 deletions(-)
> >
> > diff --git a/Documentation/filesystems/vfs.txt
> > b/Documentation/filesystems/vfs.txt
> > index a6c6a8a..5e520de 100644
> > --- a/Documentation/filesystems/vfs.txt
> > +++ b/Documentation/filesystems/vfs.txt
> > @@ -958,7 +958,9 @@ otherwise noted.
> >
> > fallocate: called by the VFS to preallocate blocks or punch a
> > hole.
> >
> > - copy_file_range: called by the copy_file_range(2) system call.
> > + copy_file_range: called by copy_file_range(2) system call. This
> > method
> > + works on two file descriptors that might reside on
> > + different superblocks of the same type of file
> > system.
> >
> > clone_file_range: called by the ioctl(2) system call for
> > FICLONERANGE and
> > FICLONE commands.
> > diff --git a/fs/read_write.c b/fs/read_write.c
> > index c60790f..474e740 100644
> > --- a/fs/read_write.c
> > +++ b/fs/read_write.c
> > @@ -1578,10 +1578,6 @@ ssize_t vfs_copy_file_range(struct file
> > *file_in, loff_t pos_in,
> > (file_out->f_flags & O_APPEND))
> > return -EBADF;
> >
> > - /* this could be relaxed once a method supports cross-fs copies
> > */
> > - if (inode_in->i_sb != inode_out->i_sb)
> > - return -EXDEV;
> > -
> > if (len == 0)
> > return 0;
> >
> > @@ -1591,7 +1587,8 @@ ssize_t vfs_copy_file_range(struct file
> > *file_in, loff_t pos_in,
> > * Try cloning first, this is supported by more file systems,
> > and
> > * more efficient if both clone and copy are supported (e.g.
> > NFS).
> > */
> > - if (file_in->f_op->clone_file_range) {
> > + if (inode_in->i_sb == inode_out->i_sb &&
> > + file_in->f_op->clone_file_range) {
> > ret = file_in->f_op->clone_file_range(file_in, pos_in,
> > file_out, pos_out, len);
> > if (ret == 0) {
> > @@ -1600,10 +1597,12 @@ ssize_t vfs_copy_file_range(struct file
> > *file_in, loff_t pos_in,
> > }
> > }
> >
> > - if (file_out->f_op->copy_file_range) {
> > + if (file_out->f_op->copy_file_range &&
> > + (file_in->f_op->copy_file_range ==
> > + file_out->f_op->copy_file_range)) {
> > ret = file_out->f_op->copy_file_range(file_in, pos_in,
> > file_out,
> > pos_out, len,
> > flags);
> > - if (ret != -EOPNOTSUPP)
> > + if (ret != -EOPNOTSUPP && ret != -EXDEV)
> > goto done;
> > }
> >
>
> Ditto. This also needs an ACK from the VFS maintainers.
>
> Cc: Al and linux-fsdevel

Yeah I sent VFS as separate patches to the linux-fsdevel and included
other folks (glibc/CIFS?) that were interested in this functionality.
Apologizes for double sent. It was easier to send this as a patch
series to just linux-nfs first.

> --
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> [email protected]
>
>

2018-10-21 14:29:14

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v1 01/11] fs: Don't copy beyond the end of the file

On Fri, 2018-10-19 at 11:29 -0400, Olga Kornievskaia wrote:
> From: Anna Schumaker <[email protected]>
>
> Signed-off-by: Anna Schumaker <[email protected]>
> ---
> fs/read_write.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 39b4a21..c60790f 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1570,6 +1570,9 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
> if (unlikely(ret))
> return ret;
>
> + if (pos_in >= i_size_read(inode_in))
> + return -EINVAL;
> +
> if (!(file_in->f_mode & FMODE_READ) ||
> !(file_out->f_mode & FMODE_WRITE) ||
> (file_out->f_flags & O_APPEND))

The patch description could use a bit more fleshing-out. The
copy_file_range manpage says:

EINVAL Requested range extends beyond the end of the source file; or the flags argument is not 0.

So I guess this is intended to satisfy that requirement? If so,
i_size_read is just going to return whatever is in inode->isize.

Could a copy_file_range call end up getting issued to copy from a file
that is already open on a range that it doesn't know about yet? i.e.
where the inode cache has not yet been updated.

It seems like that could on network filesystems (like NFS). Would this
be better handled in ->copy_file_range instead, where the driver could
make a better determination of the file size?
--
Jeff Layton <[email protected]>


2018-10-21 14:44:30

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v1 03/11] NFS test for intra vs inter COPY

On Fri, 2018-10-19 at 11:29 -0400, Olga Kornievskaia wrote:
> From: Olga Kornievskaia <[email protected]>
>
> Signed-off-by: Olga Kornievskaia <[email protected]>
> ---
> fs/nfs/nfs42.h | 11 +++++++++++
> 1 file changed, 11 insertions(+)
>
> diff --git a/fs/nfs/nfs42.h b/fs/nfs/nfs42.h
> index 19ec38f8..5abff4d 100644
> --- a/fs/nfs/nfs42.h
> +++ b/fs/nfs/nfs42.h
> @@ -6,6 +6,7 @@
> #ifndef __LINUX_FS_NFS_NFS4_2_H
> #define __LINUX_FS_NFS_NFS4_2_H
>
> +#include <linux/sunrpc/addr.h>
> /*
> * FIXME: four LAYOUTSTATS calls per compound at most! Do we need to support
> * more? Need to consider not to pre-alloc too much for a compound.
> @@ -21,4 +22,14 @@ int nfs42_proc_layoutstats_generic(struct nfs_server *,
> struct nfs42_layoutstat_data *);
> int nfs42_proc_clone(struct file *, struct file *, loff_t, loff_t, loff_t);
>
> +static inline bool nfs42_files_from_same_server(struct file *in,
> + struct file *out)
> +{
> + struct nfs_client *c_in = (NFS_SERVER(file_inode(in)))->nfs_client;
> + struct nfs_client *c_out = (NFS_SERVER(file_inode(out)))->nfs_client;
> +
> + return rpc_cmp_addr((struct sockaddr *)&c_in->cl_addr,
> + (struct sockaddr *)&c_out->cl_addr);
> +}
> +
> #endif /* __LINUX_FS_NFS_NFS4_2_H */

What about trunking? If the addresses don't match, should you compare
cl_serverowner or something too? Or maybe just do that instead of
testing addresses?

It's usually best to add infrastructure like this in the same patch with
an initial caller so we can see how it's intended to be used.
--
Jeff Layton <[email protected]>


2018-10-22 17:48:36

by Olga Kornievskaia

[permalink] [raw]
Subject: Re: [PATCH v1 03/11] NFS test for intra vs inter COPY

On Sun, Oct 21, 2018 at 10:44 AM Jeff Layton <[email protected]> wrote:
>
> On Fri, 2018-10-19 at 11:29 -0400, Olga Kornievskaia wrote:
> > From: Olga Kornievskaia <[email protected]>
> >
> > Signed-off-by: Olga Kornievskaia <[email protected]>
> > ---
> > fs/nfs/nfs42.h | 11 +++++++++++
> > 1 file changed, 11 insertions(+)
> >
> > diff --git a/fs/nfs/nfs42.h b/fs/nfs/nfs42.h
> > index 19ec38f8..5abff4d 100644
> > --- a/fs/nfs/nfs42.h
> > +++ b/fs/nfs/nfs42.h
> > @@ -6,6 +6,7 @@
> > #ifndef __LINUX_FS_NFS_NFS4_2_H
> > #define __LINUX_FS_NFS_NFS4_2_H
> >
> > +#include <linux/sunrpc/addr.h>
> > /*
> > * FIXME: four LAYOUTSTATS calls per compound at most! Do we need to support
> > * more? Need to consider not to pre-alloc too much for a compound.
> > @@ -21,4 +22,14 @@ int nfs42_proc_layoutstats_generic(struct nfs_server *,
> > struct nfs42_layoutstat_data *);
> > int nfs42_proc_clone(struct file *, struct file *, loff_t, loff_t, loff_t);
> >
> > +static inline bool nfs42_files_from_same_server(struct file *in,
> > + struct file *out)
> > +{
> > + struct nfs_client *c_in = (NFS_SERVER(file_inode(in)))->nfs_client;
> > + struct nfs_client *c_out = (NFS_SERVER(file_inode(out)))->nfs_client;
> > +
> > + return rpc_cmp_addr((struct sockaddr *)&c_in->cl_addr,
> > + (struct sockaddr *)&c_out->cl_addr);
> > +}
> > +
> > #endif /* __LINUX_FS_NFS_NFS4_2_H */
>
> What about trunking? If the addresses don't match, should you compare
> cl_serverowner or something too? Or maybe just do that instead of
> testing addresses?
>
> It's usually best to add infrastructure like this in the same patch with
> an initial caller so we can see how it's intended to be used.

Thanks, I will change it to use nfs4_check_serverowner_major_id()
instead and merge it with one of the callers.

> --
> Jeff Layton <[email protected]>
>

2018-10-22 18:33:13

by Olga Kornievskaia

[permalink] [raw]
Subject: Re: [PATCH v1 01/11] fs: Don't copy beyond the end of the file

On Sun, Oct 21, 2018 at 10:29 AM Jeff Layton <[email protected]> wrote:
>
> On Fri, 2018-10-19 at 11:29 -0400, Olga Kornievskaia wrote:
> > From: Anna Schumaker <[email protected]>
> >
> > Signed-off-by: Anna Schumaker <[email protected]>
> > ---
> > fs/read_write.c | 3 +++
> > 1 file changed, 3 insertions(+)
> >
> > diff --git a/fs/read_write.c b/fs/read_write.c
> > index 39b4a21..c60790f 100644
> > --- a/fs/read_write.c
> > +++ b/fs/read_write.c
> > @@ -1570,6 +1570,9 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
> > if (unlikely(ret))
> > return ret;
> >
> > + if (pos_in >= i_size_read(inode_in))
> > + return -EINVAL;
> > +
> > if (!(file_in->f_mode & FMODE_READ) ||
> > !(file_out->f_mode & FMODE_WRITE) ||
> > (file_out->f_flags & O_APPEND))
>
> The patch description could use a bit more fleshing-out. The
> copy_file_range manpage says:
>
> EINVAL Requested range extends beyond the end of the source file; or the flags argument is not 0.
>
> So I guess this is intended to satisfy that requirement?

I agree the description of the patch is poor. It sort of falls under
the the man page's description of range beyond the end of the source
file. But in NFSv4.2, there is an explicit wording for the validity of
the input parameters and having an input source offset that's beyond
the end of the file is what this patch is attempting to check.

> If so,
> i_size_read is just going to return whatever is in inode->isize.

> Could a copy_file_range call end up getting issued to copy from a file
> that is already open on a range that it doesn't know about yet? i.e.
> where the inode cache has not yet been updated.

I thought that with NFSv4 cache consistency, the inode->isize is
accurate. If previous open had a read delegation, any modification on
a server would trigger a CB_RECALL and the open for the copy offload
would retrieve the latest size. In case of no delegations, the open
retrieves the latest size and the call to copy_file_range() would have
an update size.

> It seems like that could on network filesystems (like NFS). Would this
> be better handled in ->copy_file_range instead, where the driver could
> make a better determination of the file size?

I'm not opposed to moving the size check into the NFS's copy_file_size
(again in my opinion NFS attribute cache has the same file size as the
inode's size). I think the thought was that such check should be done
at the VFS layer as oppose to doing it by each of the file systems.

> --
> Jeff Layton <[email protected]>
>

2018-10-22 23:23:35

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v1 01/11] fs: Don't copy beyond the end of the file

On Mon, 2018-10-22 at 14:32 -0400, Olga Kornievskaia wrote:
> On Sun, Oct 21, 2018 at 10:29 AM Jeff Layton <[email protected]> wrote:
> >
> > On Fri, 2018-10-19 at 11:29 -0400, Olga Kornievskaia wrote:
> > > From: Anna Schumaker <[email protected]>
> > >
> > > Signed-off-by: Anna Schumaker <[email protected]>
> > > ---
> > > fs/read_write.c | 3 +++
> > > 1 file changed, 3 insertions(+)
> > >
> > > diff --git a/fs/read_write.c b/fs/read_write.c
> > > index 39b4a21..c60790f 100644
> > > --- a/fs/read_write.c
> > > +++ b/fs/read_write.c
> > > @@ -1570,6 +1570,9 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
> > > if (unlikely(ret))
> > > return ret;
> > >
> > > + if (pos_in >= i_size_read(inode_in))
> > > + return -EINVAL;
> > > +
> > > if (!(file_in->f_mode & FMODE_READ) ||
> > > !(file_out->f_mode & FMODE_WRITE) ||
> > > (file_out->f_flags & O_APPEND))
> >
> > The patch description could use a bit more fleshing-out. The
> > copy_file_range manpage says:
> >
> > EINVAL Requested range extends beyond the end of the source file; or the flags argument is not 0.
> >
> > So I guess this is intended to satisfy that requirement?
>
> I agree the description of the patch is poor. It sort of falls under
> the the man page's description of range beyond the end of the source
> file. But in NFSv4.2, there is an explicit wording for the validity of
> the input parameters and having an input source offset that's beyond
> the end of the file is what this patch is attempting to check.
>

Side note:

One has to wonder why they decided to make that an -EINVAL condition?
The system call returns ssize_t. Why not just return a fewer number of
bytes in that situation?

In fact, the RETURN VALUE section of the manpage says:

Upon successful completion, copy_file_range() will return the number of
bytes copied between files. This could be less than the length origi‐
nally requested.

Under what conditions would that occur that do not include the file
being shorter than the range you wanted to copy?

I wonder if we ought to lobby to get that changed.

> > If so,
> > i_size_read is just going to return whatever is in inode->isize.
> > Could a copy_file_range call end up getting issued to copy from a file
> > that is already open on a range that it doesn't know about yet? i.e.
> > where the inode cache has not yet been updated.
>
> I thought that with NFSv4 cache consistency, the inode->isize is
> accurate. If previous open had a read delegation, any modification on
> a server would trigger a CB_RECALL and the open for the copy offload
> would retrieve the latest size. In case of no delegations, the open
> retrieves the latest size and the call to copy_file_range() would have
> an update size.
>

> It seems like that could on network filesystems (like NFS). Would this
> > be better handled in ->copy_file_range instead, where the driver could
> > make a better determination of the file size?
>
> I'm not opposed to moving the size check into the NFS's copy_file_size
> (again in my opinion NFS attribute cache has the same file size as the
> inode's size). I think the thought was that such check should be done
> at the VFS layer as oppose to doing it by each of the file systems.
>
>

The attribute cache is not revalidated before the i_size is fetched with
i_size_read. You're just reading what happens to be in the in-memory
inode structure. So both clients have the file open already, and neither
has a delegation:

client 1: fetches attributes from file and sees a size of 1000
client 2: writes 20 bytes at offset 1000
client 1: calls copy file range to copy 1020 bytes starting at offset 0

If client1 didn't get an attribute update before the copy_file_range
call came in, then it would still think the size was 1000 and fail the
operation. It may even be many seconds before client1 sees the updated
size.

You could argue that we're not using locking here so you're just subject
to normal open-to-close cache coherency behavior, but that's rather "not
nice".

I think we probably ought to also push this check down into the
filesystem operations as well, and have copy_file_range ensure that the
attribute cache is updated. We're dealing with copy offload here so
doing an extra GETATTR beforehand shouldn't be terribly costly.
--
Jeff Layton <[email protected]>


2018-10-23 15:50:33

by Anna Schumaker

[permalink] [raw]
Subject: Re: [PATCH v1 05/11] NFS add COPY_NOTIFY operation

Hi Olga,

On Fri, 2018-10-19 at 11:29 -0400, Olga Kornievskaia wrote:
> From: Olga Kornievskaia <[email protected]>
>
> Try using the delegation stateid, then the open stateid.
>
> Only NL4_NETATTR, No support for NL4_NAME and NL4_URL.
> Allow only one source server address to be returned for now.
>
> Signed-off-by: Andy Adamson <[email protected]>
> Signed-off-by: Olga Kornievskaia <[email protected]>
> ---
> fs/nfs/nfs42.h | 3 +-
> fs/nfs/nfs42proc.c | 91 +++++++++++++++++++++++
> fs/nfs/nfs42xdr.c | 181
> ++++++++++++++++++++++++++++++++++++++++++++++
> fs/nfs/nfs4file.c | 17 +++++
> fs/nfs/nfs4proc.c | 1 +
> fs/nfs/nfs4xdr.c | 1 +
> include/linux/nfs4.h | 1 +
> include/linux/nfs_fs_sb.h | 1 +
> include/linux/nfs_xdr.h | 16 ++++
> 9 files changed, 311 insertions(+), 1 deletion(-)
>
> diff --git a/fs/nfs/nfs42.h b/fs/nfs/nfs42.h
> index 5abff4d..bcd0222 100644
> --- a/fs/nfs/nfs42.h
> +++ b/fs/nfs/nfs42.h
> @@ -21,7 +21,8 @@
> int nfs42_proc_layoutstats_generic(struct nfs_server *,
> struct nfs42_layoutstat_data *);
> int nfs42_proc_clone(struct file *, struct file *, loff_t, loff_t, loff_t);
> -
> +int nfs42_proc_copy_notify(struct file *, struct file *,
> + struct nfs42_copy_notify_res *);

Can you wrap this in a CONFIG_NFS_V4_2 check? Otherwise the compiler will
complain that nfs42_copy_notify_res doesn't exist.

Thanks,
Anna

> static inline bool nfs42_files_from_same_server(struct file *in,
> struct file *out)
> {
> diff --git a/fs/nfs/nfs42proc.c b/fs/nfs/nfs42proc.c
> index ac5b784..b1c57a4 100644
> --- a/fs/nfs/nfs42proc.c
> +++ b/fs/nfs/nfs42proc.c
> @@ -3,6 +3,7 @@
> * Copyright (c) 2014 Anna Schumaker <[email protected]>
> */
> #include <linux/fs.h>
> +#include <linux/sunrpc/addr.h>
> #include <linux/sunrpc/sched.h>
> #include <linux/nfs.h>
> #include <linux/nfs3.h>
> @@ -15,10 +16,30 @@
> #include "pnfs.h"
> #include "nfs4session.h"
> #include "internal.h"
> +#include "delegation.h"
>
> #define NFSDBG_FACILITY NFSDBG_PROC
> static int nfs42_do_offload_cancel_async(struct file *dst, nfs4_stateid
> *std);
>
> +static void nfs42_set_netaddr(struct file *filep, struct nfs42_netaddr
> *naddr)
> +{
> + struct nfs_client *clp = (NFS_SERVER(file_inode(filep)))->nfs_client;
> + unsigned short port = 2049;
> +
> + rcu_read_lock();
> + naddr->netid_len = scnprintf(naddr->netid,
> + sizeof(naddr->netid), "%s",
> + rpc_peeraddr2str(clp->cl_rpcclient,
> + RPC_DISPLAY_NETID));
> + naddr->addr_len = scnprintf(naddr->addr,
> + sizeof(naddr->addr),
> + "%s.%u.%u",
> + rpc_peeraddr2str(clp->cl_rpcclient,
> + RPC_DISPLAY_ADDR),
> + port >> 8, port & 255);
> + rcu_read_unlock();
> +}
> +
> static int _nfs42_proc_fallocate(struct rpc_message *msg, struct file *filep,
> struct nfs_lock_context *lock, loff_t offset, loff_t len)
> {
> @@ -461,6 +482,76 @@ static int nfs42_do_offload_cancel_async(struct file
> *dst,
> return status;
> }
>
> +int _nfs42_proc_copy_notify(struct file *src, struct file *dst,
> + struct nfs42_copy_notify_args *args,
> + struct nfs42_copy_notify_res *res)
> +{
> + struct nfs_server *src_server = NFS_SERVER(file_inode(src));
> + struct rpc_message msg = {
> + .rpc_proc = &nfs4_procedures[NFSPROC4_CLNT_COPY_NOTIFY],
> + .rpc_argp = args,
> + .rpc_resp = res,
> + };
> + int status;
> + struct nfs_open_context *ctx;
> + struct nfs_lock_context *l_ctx;
> +
> + ctx = get_nfs_open_context(nfs_file_open_context(src));
> + l_ctx = nfs_get_lock_context(ctx);
> + if (IS_ERR(l_ctx))
> + return PTR_ERR(l_ctx);
> +
> + status = nfs4_set_rw_stateid(&args->cna_src_stateid, ctx, l_ctx,
> + FMODE_READ);
> + nfs_put_lock_context(l_ctx);
> + if (status)
> + return status;
> +
> + status = nfs4_call_sync(src_server->client, src_server, &msg,
> + &args->cna_seq_args, &res->cnr_seq_res, 0);
> + if (status == -ENOTSUPP)
> + src_server->caps &= ~NFS_CAP_COPY_NOTIFY;
> +
> + put_nfs_open_context(nfs_file_open_context(src));
> + return status;
> +}
> +
> +int nfs42_proc_copy_notify(struct file *src, struct file *dst,
> + struct nfs42_copy_notify_res *res)
> +{
> + struct nfs_server *src_server = NFS_SERVER(file_inode(src));
> + struct nfs42_copy_notify_args *args;
> + struct nfs4_exception exception = {
> + .inode = file_inode(src),
> + };
> + int status;
> +
> + if (!(src_server->caps & NFS_CAP_COPY_NOTIFY))
> + return -EOPNOTSUPP;
> +
> + args = kzalloc(sizeof(struct nfs42_copy_notify_args), GFP_NOFS);
> + if (args == NULL)
> + return -ENOMEM;
> +
> + args->cna_src_fh = NFS_FH(file_inode(src)),
> + args->cna_dst.nl4_type = NL4_NETADDR;
> + nfs42_set_netaddr(dst, &args->cna_dst.u.nl4_addr);
> + exception.stateid = &args->cna_src_stateid;
> +
> + do {
> + status = _nfs42_proc_copy_notify(src, dst, args, res);
> + if (status == -ENOTSUPP) {
> + status = -EOPNOTSUPP;
> + goto out;
> + }
> + status = nfs4_handle_exception(src_server, status, &exception);
> + } while (exception.retry);
> +
> +out:
> + kfree(args);
> + return status;
> +}
> +
> static loff_t _nfs42_proc_llseek(struct file *filep,
> struct nfs_lock_context *lock, loff_t offset, int whence)
> {
> diff --git a/fs/nfs/nfs42xdr.c b/fs/nfs/nfs42xdr.c
> index 69f72ed..e6e7cbf 100644
> --- a/fs/nfs/nfs42xdr.c
> +++ b/fs/nfs/nfs42xdr.c
> @@ -29,6 +29,16 @@
> #define encode_offload_cancel_maxsz (op_encode_hdr_maxsz + \
> XDR_QUADLEN(NFS4_STATEID_SIZE))
> #define decode_offload_cancel_maxsz (op_decode_hdr_maxsz)
> +#define encode_copy_notify_maxsz (op_encode_hdr_maxsz + \
> + XDR_QUADLEN(NFS4_STATEID_SIZE) + \
> + 1 + /* nl4_type */ \
> + 1 + XDR_QUADLEN(NFS4_OPAQUE_LIMIT))
> +#define decode_copy_notify_maxsz (op_decode_hdr_maxsz + \
> + 3 + /* cnr_lease_time */\
> + XDR_QUADLEN(NFS4_STATEID_SIZE) + \
> + 1 + /* Support 1 cnr_source_server */\
> + 1 + /* nl4_type */ \
> + 1 + XDR_QUADLEN(NFS4_OPAQUE_LIMIT))
> #define encode_deallocate_maxsz (op_encode_hdr_maxsz + \
> encode_fallocate_maxsz)
> #define decode_deallocate_maxsz (op_decode_hdr_maxsz)
> @@ -84,6 +94,12 @@
> #define NFS4_dec_offload_cancel_sz (compound_decode_hdr_maxsz + \
> decode_putfh_maxsz + \
> decode_offload_cancel_maxsz)
> +#define NFS4_enc_copy_notify_sz (compound_encode_hdr_maxsz + \
> + encode_putfh_maxsz + \
> + encode_copy_notify_maxsz)
> +#define NFS4_dec_copy_notify_sz (compound_decode_hdr_maxsz + \
> + decode_putfh_maxsz + \
> + decode_copy_notify_maxsz)
> #define NFS4_enc_deallocate_sz (compound_encode_hdr_maxsz + \
> encode_putfh_maxsz + \
> encode_deallocate_maxsz + \
> @@ -137,6 +153,25 @@ static void encode_allocate(struct xdr_stream *xdr,
> encode_fallocate(xdr, args);
> }
>
> +static void encode_nl4_server(struct xdr_stream *xdr, const struct nl4_server
> *ns)
> +{
> + encode_uint32(xdr, ns->nl4_type);
> + switch (ns->nl4_type) {
> + case NL4_NAME:
> + case NL4_URL:
> + encode_string(xdr, ns->u.nl4_str_sz, ns->u.nl4_str);
> + break;
> + case NL4_NETADDR:
> + encode_string(xdr, ns->u.nl4_addr.netid_len,
> + ns->u.nl4_addr.netid);
> + encode_string(xdr, ns->u.nl4_addr.addr_len,
> + ns->u.nl4_addr.addr);
> + break;
> + default:
> + WARN_ON_ONCE(1);
> + }
> +}
> +
> static void encode_copy(struct xdr_stream *xdr,
> const struct nfs42_copy_args *args,
> struct compound_hdr *hdr)
> @@ -162,6 +197,15 @@ static void encode_offload_cancel(struct xdr_stream *xdr,
> encode_nfs4_stateid(xdr, &args->osa_stateid);
> }
>
> +static void encode_copy_notify(struct xdr_stream *xdr,
> + const struct nfs42_copy_notify_args *args,
> + struct compound_hdr *hdr)
> +{
> + encode_op_hdr(xdr, OP_COPY_NOTIFY, decode_copy_notify_maxsz, hdr);
> + encode_nfs4_stateid(xdr, &args->cna_src_stateid);
> + encode_nl4_server(xdr, &args->cna_dst);
> +}
> +
> static void encode_deallocate(struct xdr_stream *xdr,
> const struct nfs42_falloc_args *args,
> struct compound_hdr *hdr)
> @@ -298,6 +342,25 @@ static void nfs4_xdr_enc_offload_cancel(struct rpc_rqst
> *req,
> }
>
> /*
> + * Encode COPY_NOTIFY request
> + */
> +static void nfs4_xdr_enc_copy_notify(struct rpc_rqst *req,
> + struct xdr_stream *xdr,
> + const void *data)
> +{
> + const struct nfs42_copy_notify_args *args = data;
> + struct compound_hdr hdr = {
> + .minorversion = nfs4_xdr_minorversion(&args->cna_seq_args),
> + };
> +
> + encode_compound_hdr(xdr, req, &hdr);
> + encode_sequence(xdr, &args->cna_seq_args, &hdr);
> + encode_putfh(xdr, args->cna_src_fh, &hdr);
> + encode_copy_notify(xdr, args, &hdr);
> + encode_nops(&hdr);
> +}
> +
> +/*
> * Encode DEALLOCATE request
> */
> static void nfs4_xdr_enc_deallocate(struct rpc_rqst *req,
> @@ -416,6 +479,58 @@ static int decode_write_response(struct xdr_stream *xdr,
> return -EIO;
> }
>
> +static int decode_nl4_server(struct xdr_stream *xdr, struct nl4_server *ns)
> +{
> + struct nfs42_netaddr *naddr;
> + uint32_t dummy;
> + char *dummy_str;
> + __be32 *p;
> + int status;
> +
> + /* nl_type */
> + p = xdr_inline_decode(xdr, 4);
> + if (unlikely(!p))
> + return -EIO;
> + ns->nl4_type = be32_to_cpup(p);
> + switch (ns->nl4_type) {
> + case NL4_NAME:
> + case NL4_URL:
> + status = decode_opaque_inline(xdr, &dummy, &dummy_str);
> + if (unlikely(status))
> + return status;
> + if (unlikely(dummy > NFS4_OPAQUE_LIMIT))
> + return -EIO;
> + memcpy(&ns->u.nl4_str, dummy_str, dummy);
> + ns->u.nl4_str_sz = dummy;
> + break;
> + case NL4_NETADDR:
> + naddr = &ns->u.nl4_addr;
> +
> + /* netid string */
> + status = decode_opaque_inline(xdr, &dummy, &dummy_str);
> + if (unlikely(status))
> + return status;
> + if (unlikely(dummy > RPCBIND_MAXNETIDLEN))
> + return -EIO;
> + naddr->netid_len = dummy;
> + memcpy(naddr->netid, dummy_str, naddr->netid_len);
> +
> + /* uaddr string */
> + status = decode_opaque_inline(xdr, &dummy, &dummy_str);
> + if (unlikely(status))
> + return status;
> + if (unlikely(dummy > RPCBIND_MAXUADDRLEN))
> + return -EIO;
> + naddr->addr_len = dummy;
> + memcpy(naddr->addr, dummy_str, naddr->addr_len);
> + break;
> + default:
> + WARN_ON_ONCE(1);
> + return -EIO;
> + }
> + return 0;
> +}
> +
> static int decode_copy_requirements(struct xdr_stream *xdr,
> struct nfs42_copy_res *res) {
> __be32 *p;
> @@ -458,6 +573,46 @@ static int decode_offload_cancel(struct xdr_stream *xdr,
> return decode_op_hdr(xdr, OP_OFFLOAD_CANCEL);
> }
>
> +static int decode_copy_notify(struct xdr_stream *xdr,
> + struct nfs42_copy_notify_res *res)
> +{
> + __be32 *p;
> + int status, count;
> +
> + status = decode_op_hdr(xdr, OP_COPY_NOTIFY);
> + if (status)
> + return status;
> + /* cnr_lease_time */
> + p = xdr_inline_decode(xdr, 12);
> + if (unlikely(!p))
> + goto out_overflow;
> + p = xdr_decode_hyper(p, &res->cnr_lease_time.seconds);
> + res->cnr_lease_time.nseconds = be32_to_cpup(p);
> +
> + status = decode_opaque_fixed(xdr, &res->cnr_stateid, NFS4_STATEID_SIZE);
> + if (unlikely(status))
> + goto out_overflow;
> +
> + /* number of source addresses */
> + p = xdr_inline_decode(xdr, 4);
> + if (unlikely(!p))
> + goto out_overflow;
> +
> + count = be32_to_cpup(p);
> + if (count > 1)
> + pr_warn("NFS: %s: nsvr %d > Supported. Use first servers\n",
> + __func__, count);
> +
> + status = decode_nl4_server(xdr, &res->cnr_src);
> + if (unlikely(status))
> + goto out_overflow;
> + return 0;
> +
> +out_overflow:
> + print_overflow_msg(__func__, xdr);
> + return -EIO;
> +}
> +
> static int decode_deallocate(struct xdr_stream *xdr, struct nfs42_falloc_res
> *res)
> {
> return decode_op_hdr(xdr, OP_DEALLOCATE);
> @@ -585,6 +740,32 @@ static int nfs4_xdr_dec_offload_cancel(struct rpc_rqst
> *rqstp,
> }
>
> /*
> + * Decode COPY_NOTIFY response
> + */
> +static int nfs4_xdr_dec_copy_notify(struct rpc_rqst *rqstp,
> + struct xdr_stream *xdr,
> + void *data)
> +{
> + struct nfs42_copy_notify_res *res = data;
> + struct compound_hdr hdr;
> + int status;
> +
> + status = decode_compound_hdr(xdr, &hdr);
> + if (status)
> + goto out;
> + status = decode_sequence(xdr, &res->cnr_seq_res, rqstp);
> + if (status)
> + goto out;
> + status = decode_putfh(xdr);
> + if (status)
> + goto out;
> + status = decode_copy_notify(xdr, res);
> +
> +out:
> + return status;
> +}
> +
> +/*
> * Decode DEALLOCATE request
> */
> static int nfs4_xdr_dec_deallocate(struct rpc_rqst *rqstp,
> diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c
> index 4288a6e..3a888f4 100644
> --- a/fs/nfs/nfs4file.c
> +++ b/fs/nfs/nfs4file.c
> @@ -133,12 +133,29 @@ static ssize_t nfs4_copy_file_range(struct file
> *file_in, loff_t pos_in,
> struct file *file_out, loff_t pos_out,
> size_t count, unsigned int flags)
> {
> + struct nfs42_copy_notify_res *cn_resp = NULL;
> ssize_t ret;
>
> if (file_inode(file_in) == file_inode(file_out))
> return -EINVAL;
> retry:
> + if (nfs42_files_from_same_server(file_in, file_out)) { /* Intra-ssc */
> + if (file_in->f_op != file_out->f_op)
> + return -EXDEV;
> + } else { /* Inter-ssc */
> + cn_resp = kzalloc(sizeof(struct nfs42_copy_notify_res),
> + GFP_NOFS);
> + if (unlikely(cn_resp == NULL))
> + return -ENOMEM;
> +
> + ret = nfs42_proc_copy_notify(file_in, file_out, cn_resp);
> + if (ret)
> + goto out;
> + }
> ret = nfs42_proc_copy(file_in, pos_in, file_out, pos_out, count);
> +
> +out:
> + kfree(cn_resp);
> if (ret == -EAGAIN)
> goto retry;
> return ret;
> diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
> index db84b4a..fec6e6b 100644
> --- a/fs/nfs/nfs4proc.c
> +++ b/fs/nfs/nfs4proc.c
> @@ -9692,6 +9692,7 @@ static bool nfs4_match_stateid(const nfs4_stateid *s1,
> | NFS_CAP_ALLOCATE
> | NFS_CAP_COPY
> | NFS_CAP_OFFLOAD_CANCEL
> + | NFS_CAP_COPY_NOTIFY
> | NFS_CAP_DEALLOCATE
> | NFS_CAP_SEEK
> | NFS_CAP_LAYOUTSTATS
> diff --git a/fs/nfs/nfs4xdr.c b/fs/nfs/nfs4xdr.c
> index b7bde12..2163900 100644
> --- a/fs/nfs/nfs4xdr.c
> +++ b/fs/nfs/nfs4xdr.c
> @@ -7790,6 +7790,7 @@ int nfs4_decode_dirent(struct xdr_stream *xdr, struct
> nfs_entry *entry,
> PROC42(CLONE, enc_clone, dec_clone),
> PROC42(COPY, enc_copy, dec_copy),
> PROC42(OFFLOAD_CANCEL, enc_offload_cancel, dec_offload_cancel),
> + PROC42(COPY_NOTIFY, enc_copy_notify, dec_copy_notify),
> PROC(LOOKUPP, enc_lookupp, dec_lookupp),
> };
>
> diff --git a/include/linux/nfs4.h b/include/linux/nfs4.h
> index 4d76f87..d80b25e 100644
> --- a/include/linux/nfs4.h
> +++ b/include/linux/nfs4.h
> @@ -537,6 +537,7 @@ enum {
> NFSPROC4_CLNT_CLONE,
> NFSPROC4_CLNT_COPY,
> NFSPROC4_CLNT_OFFLOAD_CANCEL,
> + NFSPROC4_CLNT_COPY_NOTIFY,
>
> NFSPROC4_CLNT_LOOKUPP,
> };
> diff --git a/include/linux/nfs_fs_sb.h b/include/linux/nfs_fs_sb.h
> index 0fc0b91..e5d89ff 100644
> --- a/include/linux/nfs_fs_sb.h
> +++ b/include/linux/nfs_fs_sb.h
> @@ -261,5 +261,6 @@ struct nfs_server {
> #define NFS_CAP_CLONE (1U << 23)
> #define NFS_CAP_COPY (1U << 24)
> #define NFS_CAP_OFFLOAD_CANCEL (1U << 25)
> +#define NFS_CAP_COPY_NOTIFY (1U << 26)
>
> #endif
> diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
> index 0e01625..dfc59bc 100644
> --- a/include/linux/nfs_xdr.h
> +++ b/include/linux/nfs_xdr.h
> @@ -1428,6 +1428,22 @@ struct nfs42_offload_status_res {
> int osr_status;
> };
>
> +struct nfs42_copy_notify_args {
> + struct nfs4_sequence_args cna_seq_args;
> +
> + struct nfs_fh *cna_src_fh;
> + nfs4_stateid cna_src_stateid;
> + struct nl4_server cna_dst;
> +};
> +
> +struct nfs42_copy_notify_res {
> + struct nfs4_sequence_res cnr_seq_res;
> +
> + struct nfstime4 cnr_lease_time;
> + nfs4_stateid cnr_stateid;
> + struct nl4_server cnr_src;
> +};
> +
> struct nfs42_seek_args {
> struct nfs4_sequence_args seq_args;
>

2018-10-23 16:51:11

by Olga Kornievskaia

[permalink] [raw]
Subject: Re: [PATCH v1 01/11] fs: Don't copy beyond the end of the file

On Mon, Oct 22, 2018 at 7:23 PM Jeff Layton <[email protected]> wrote:
>
> On Mon, 2018-10-22 at 14:32 -0400, Olga Kornievskaia wrote:
> > On Sun, Oct 21, 2018 at 10:29 AM Jeff Layton <[email protected]> wrote:
> > >
> > > On Fri, 2018-10-19 at 11:29 -0400, Olga Kornievskaia wrote:
> > > > From: Anna Schumaker <[email protected]>
> > > >
> > > > Signed-off-by: Anna Schumaker <[email protected]>
> > > > ---
> > > > fs/read_write.c | 3 +++
> > > > 1 file changed, 3 insertions(+)
> > > >
> > > > diff --git a/fs/read_write.c b/fs/read_write.c
> > > > index 39b4a21..c60790f 100644
> > > > --- a/fs/read_write.c
> > > > +++ b/fs/read_write.c
> > > > @@ -1570,6 +1570,9 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
> > > > if (unlikely(ret))
> > > > return ret;
> > > >
> > > > + if (pos_in >= i_size_read(inode_in))
> > > > + return -EINVAL;
> > > > +
> > > > if (!(file_in->f_mode & FMODE_READ) ||
> > > > !(file_out->f_mode & FMODE_WRITE) ||
> > > > (file_out->f_flags & O_APPEND))
> > >
> > > The patch description could use a bit more fleshing-out. The
> > > copy_file_range manpage says:
> > >
> > > EINVAL Requested range extends beyond the end of the source file; or the flags argument is not 0.
> > >
> > > So I guess this is intended to satisfy that requirement?
> >
> > I agree the description of the patch is poor. It sort of falls under
> > the the man page's description of range beyond the end of the source
> > file. But in NFSv4.2, there is an explicit wording for the validity of
> > the input parameters and having an input source offset that's beyond
> > the end of the file is what this patch is attempting to check.
> >
>
> Side note:
>
> One has to wonder why they decided to make that an -EINVAL condition?

To me that sounds like a correct use of -EINVAL error to check for
invalid arguments.

You can argue that since there were no bytes to copy then returning 0
would be an appropriate "size" of the copy. However, I would argue how
would a caller distinguish this 0 size which really means don't try to
copy more vs another case when copy_file_range() returned less byte
(0) and so that caller should loop to get the rest.

> The system call returns ssize_t. Why not just return a fewer number of
> bytes in that situation?
>
> In fact, the RETURN VALUE section of the manpage says:
>
> Upon successful completion, copy_file_range() will return the number of
> bytes copied between files. This could be less than the length origi‐
> nally requested.
>
> Under what conditions would that occur that do not include the file
> being shorter than the range you wanted to copy?
>
> I wonder if we ought to lobby to get that changed.

Do you mean ask NFSv4.2 spec to be changed? I thought that's stuff is
"written in stone".

>
> > > If so,
> > > i_size_read is just going to return whatever is in inode->isize.
> > > Could a copy_file_range call end up getting issued to copy from a file
> > > that is already open on a range that it doesn't know about yet? i.e.
> > > where the inode cache has not yet been updated.
> >
> > I thought that with NFSv4 cache consistency, the inode->isize is
> > accurate. If previous open had a read delegation, any modification on
> > a server would trigger a CB_RECALL and the open for the copy offload
> > would retrieve the latest size. In case of no delegations, the open
> > retrieves the latest size and the call to copy_file_range() would have
> > an update size.
> >
>
> > It seems like that could on network filesystems (like NFS). Would this
> > > be better handled in ->copy_file_range instead, where the driver could
> > > make a better determination of the file size?
> >
> > I'm not opposed to moving the size check into the NFS's copy_file_size
> > (again in my opinion NFS attribute cache has the same file size as the
> > inode's size). I think the thought was that such check should be done
> > at the VFS layer as oppose to doing it by each of the file systems.
> >
> >
>
> The attribute cache is not revalidated before the i_size is fetched with
> i_size_read. You're just reading what happens to be in the in-memory
> inode structure. So both clients have the file open already, and neither
> has a delegation:
>
> client 1: fetches attributes from file and sees a size of 1000
> client 2: writes 20 bytes at offset 1000
> client 1: calls copy file range to copy 1020 bytes starting at offset 0
>
> If client1 didn't get an attribute update before the copy_file_range
> call came in, then it would still think the size was 1000 and fail the
> operation. It may even be many seconds before client1 sees the updated
> size.
>
> You could argue that we're not using locking here so you're just subject
> to normal open-to-close cache coherency behavior, but that's rather "not
> nice".

Yes I would argue that locking needs to be used. In case your
describing it is impossible to get an accurate file size. Even if the
client issues a GETATTR there is nothing prevents another client from
changing it right after that GETATTR was already replied to.

What nfscopy application does is it opens then file and then locks it
(as suggested by the spec), then calls the copy_file_range() system
call. You can argue (if we didn't get a delegation) that a file might
have been changed between the reply to the OPEN and the LOCK operation
and therefore, we should send another GETATTR just in case. To guard
against that, I can add a getattr call though I think it's an overkill
specially since linux server always grants us a read delegation.

> I think we probably ought to also push this check down into the
> filesystem operations as well, and have copy_file_range ensure that the
> attribute cache is updated. We're dealing with copy offload here so
> doing an extra GETATTR beforehand shouldn't be terribly costly.
> --
> Jeff Layton <[email protected]>
>

2018-10-23 20:23:07

by Anna Schumaker

[permalink] [raw]
Subject: Re: [PATCH v1 08/11] NFS inter ssc open

Hi Olga,

On Fri, 2018-10-19 at 11:29 -0400, Olga Kornievskaia wrote:
> From: Olga Kornievskaia <[email protected]>
>
> NFSv4.2 inter server to server copy requires the destination server to
> READ the data from the source server using the provided stateid and
> file handle.
>
> Given an NFSv4 stateid and filehandle from the COPY operaion, provide the
> destination server with an NFS client function to create a struct file
> suitable for the destiniation server to READ the data to be copied.
>
> Signed-off-by: Andy Adamson <[email protected]>
> Signed-off-by: Olga Kornievskaia <[email protected]>
> ---
> fs/nfs/nfs4_fs.h | 7 ++++
> fs/nfs/nfs4file.c | 98
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++
> fs/nfs/nfs4proc.c | 5 ++-
> 3 files changed, 107 insertions(+), 3 deletions(-)
>
> diff --git a/fs/nfs/nfs4_fs.h b/fs/nfs/nfs4_fs.h
> index 8d59c96..f229864 100644
> --- a/fs/nfs/nfs4_fs.h
> +++ b/fs/nfs/nfs4_fs.h
> @@ -307,6 +307,13 @@ extern int nfs4_set_rw_stateid(nfs4_stateid *stateid,
> const struct nfs_open_context *ctx,
> const struct nfs_lock_context *l_ctx,
> fmode_t fmode);
> +extern int nfs4_proc_getattr(struct nfs_server *server, struct nfs_fh
> *fhandle,
> + struct nfs_fattr *fattr, struct nfs4_label *label,
> + struct inode *inode);
> +extern int update_open_stateid(struct nfs4_state *state,
> + const nfs4_stateid *open_stateid,
> + const nfs4_stateid *deleg_stateid,
> + fmode_t fmode);
>
> #if defined(CONFIG_NFS_V4_1)
> extern int nfs41_sequence_done(struct rpc_task *, struct nfs4_sequence_res
> *);
> diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c
> index 005862e..7f226f4 100644
> --- a/fs/nfs/nfs4file.c
> +++ b/fs/nfs/nfs4file.c
> @@ -8,6 +8,7 @@
> #include <linux/file.h>
> #include <linux/falloc.h>
> #include <linux/nfs_fs.h>
> +#include <linux/file.h>
> #include "delegation.h"
> #include "internal.h"
> #include "iostat.h"
> @@ -264,6 +265,103 @@ static int nfs42_clone_file_range(struct file *src_file,
> loff_t src_off,
> out:
> return ret;
> }
> +
> +static int read_name_gen = 1;
> +#define SSC_READ_NAME_BODY "ssc_read_%d"
> +
> +struct file *
> +nfs42_ssc_open(struct vfsmount *ss_mnt, struct nfs_fh *src_fh,
> + nfs4_stateid *stateid)
> +{
> + struct nfs_fattr fattr;
> + struct file *filep, *res;
> + struct nfs_server *server;
> + struct inode *r_ino = NULL;
> + struct nfs_open_context *ctx;
> + struct nfs4_state_owner *sp;
> + char *read_name;
> + int len, status = 0;
> +
> + server = NFS_SERVER(ss_mnt->mnt_root->d_inode);
> +
> + nfs_fattr_init(&fattr);
> +
> + status = nfs4_proc_getattr(server, src_fh, &fattr, NULL, NULL);
> + if (status < 0) {
> + res = ERR_PTR(status);
> + goto out;
> + }
> +
> + res = ERR_PTR(-ENOMEM);
> + len = strlen(SSC_READ_NAME_BODY) + 16;
> + read_name = kzalloc(len, GFP_NOFS);
> + if (read_name == NULL)
> + goto out;
> + snprintf(read_name, len, SSC_READ_NAME_BODY, read_name_gen++);
> + dprintk("%s read_name %s\n", __func__, read_name);

Does this dprintk() need to be here? I'm wondering if it would work better as a
tracepoint (or if it should just be removed altogether).

Thanks,
Anna

> +
> + r_ino = nfs_fhget(ss_mnt->mnt_root->d_inode->i_sb, src_fh, &fattr,
> + NULL);
> + if (IS_ERR(r_ino)) {
> + res = ERR_CAST(r_ino);
> + goto out;
> + }
> +
> + filep = alloc_file_pseudo(r_ino, ss_mnt, read_name, FMODE_READ,
> + r_ino->i_fop);
> + if (IS_ERR(filep)) {
> + res = ERR_CAST(filep);
> + goto out;
> + }
> + filep->f_mode |= FMODE_READ;
> +
> + ctx = alloc_nfs_open_context(filep->f_path.dentry, filep->f_mode,
> + filep);
> + if (IS_ERR(ctx)) {
> + res = ERR_CAST(ctx);
> + goto out_filep;
> + }
> +
> + res = ERR_PTR(-EINVAL);
> + sp = nfs4_get_state_owner(server, ctx->cred, GFP_KERNEL);
> + if (sp == NULL)
> + goto out_ctx;
> +
> + ctx->state = nfs4_get_open_state(r_ino, sp);
> + if (ctx->state == NULL)
> + goto out_stateowner;
> +
> + set_bit(NFS_OPEN_STATE, &ctx->state->flags);
> + memcpy(&ctx->state->open_stateid.other, &stateid->other,
> + NFS4_STATEID_OTHER_SIZE);
> + update_open_stateid(ctx->state, stateid, NULL, filep->f_mode);
> +
> + nfs_file_set_open_context(filep, ctx);
> + put_nfs_open_context(ctx);
> +
> + file_ra_state_init(&filep->f_ra, filep->f_mapping->host->i_mapping);
> + res = filep;
> +out:
> + dprintk("<-- %s error %ld filep %p r_ino %p\n",
> + __func__, IS_ERR(res) ? PTR_ERR(res) : 0, res, r_ino);
> +
> + return res;
> +out_stateowner:
> + nfs4_put_state_owner(sp);
> +out_ctx:
> + put_nfs_open_context(ctx);
> +out_filep:
> + fput(filep);
> + goto out;
> +}
> +EXPORT_SYMBOL_GPL(nfs42_ssc_open);
> +void nfs42_ssc_close(struct file *filep)
> +{
> + struct nfs_open_context *ctx = nfs_file_open_context(filep);
> +
> + ctx->state->flags = 0;
> +}
> +EXPORT_SYMBOL_GPL(nfs42_ssc_close);
> #endif /* CONFIG_NFS_V4_2 */
>
> const struct file_operations nfs4_file_operations = {
> diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
> index fec6e6b..e5178b2f 100644
> --- a/fs/nfs/nfs4proc.c
> +++ b/fs/nfs/nfs4proc.c
> @@ -91,7 +91,6 @@
> static int _nfs4_recover_proc_open(struct nfs4_opendata *data);
> static int nfs4_do_fsinfo(struct nfs_server *, struct nfs_fh *, struct
> nfs_fsinfo *);
> static void nfs_fixup_referral_attributes(struct nfs_fattr *fattr);
> -static int nfs4_proc_getattr(struct nfs_server *, struct nfs_fh *, struct
> nfs_fattr *, struct nfs4_label *label, struct inode *inode);
> static int _nfs4_proc_getattr(struct nfs_server *server, struct nfs_fh
> *fhandle, struct nfs_fattr *fattr, struct nfs4_label *label, struct inode
> *inode);
> static int nfs4_do_setattr(struct inode *inode, struct rpc_cred *cred,
> struct nfs_fattr *fattr, struct iattr *sattr,
> @@ -1653,7 +1652,7 @@ static void nfs_state_clear_delegation(struct nfs4_state
> *state)
> write_sequnlock(&state->seqlock);
> }
>
> -static int update_open_stateid(struct nfs4_state *state,
> +int update_open_stateid(struct nfs4_state *state,
> const nfs4_stateid *open_stateid,
> const nfs4_stateid *delegation,
> fmode_t fmode)
> @@ -3936,7 +3935,7 @@ static int _nfs4_proc_getattr(struct nfs_server *server,
> struct nfs_fh *fhandle,
> return nfs4_call_sync(server->client, server, &msg, &args.seq_args,
> &res.seq_res, 0);
> }
>
> -static int nfs4_proc_getattr(struct nfs_server *server, struct nfs_fh
> *fhandle,
> +int nfs4_proc_getattr(struct nfs_server *server, struct nfs_fh *fhandle,
> struct nfs_fattr *fattr, struct nfs4_label
> *label,
> struct inode *inode)
> {

2018-10-24 02:58:07

by Olga Kornievskaia

[permalink] [raw]
Subject: Re: [PATCH v1 08/11] NFS inter ssc open

On Tue, Oct 23, 2018 at 4:23 PM Schumaker, Anna
<[email protected]> wrote:
>
> Hi Olga,
>
> On Fri, 2018-10-19 at 11:29 -0400, Olga Kornievskaia wrote:
> > From: Olga Kornievskaia <[email protected]>
> >
> > NFSv4.2 inter server to server copy requires the destination server to
> > READ the data from the source server using the provided stateid and
> > file handle.
> >
> > Given an NFSv4 stateid and filehandle from the COPY operaion, provide the
> > destination server with an NFS client function to create a struct file
> > suitable for the destiniation server to READ the data to be copied.
> >
> > Signed-off-by: Andy Adamson <[email protected]>
> > Signed-off-by: Olga Kornievskaia <[email protected]>
> > ---
> > fs/nfs/nfs4_fs.h | 7 ++++
> > fs/nfs/nfs4file.c | 98
> > +++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > fs/nfs/nfs4proc.c | 5 ++-
> > 3 files changed, 107 insertions(+), 3 deletions(-)
> >
> > diff --git a/fs/nfs/nfs4_fs.h b/fs/nfs/nfs4_fs.h
> > index 8d59c96..f229864 100644
> > --- a/fs/nfs/nfs4_fs.h
> > +++ b/fs/nfs/nfs4_fs.h
> > @@ -307,6 +307,13 @@ extern int nfs4_set_rw_stateid(nfs4_stateid *stateid,
> > const struct nfs_open_context *ctx,
> > const struct nfs_lock_context *l_ctx,
> > fmode_t fmode);
> > +extern int nfs4_proc_getattr(struct nfs_server *server, struct nfs_fh
> > *fhandle,
> > + struct nfs_fattr *fattr, struct nfs4_label *label,
> > + struct inode *inode);
> > +extern int update_open_stateid(struct nfs4_state *state,
> > + const nfs4_stateid *open_stateid,
> > + const nfs4_stateid *deleg_stateid,
> > + fmode_t fmode);
> >
> > #if defined(CONFIG_NFS_V4_1)
> > extern int nfs41_sequence_done(struct rpc_task *, struct nfs4_sequence_res
> > *);
> > diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c
> > index 005862e..7f226f4 100644
> > --- a/fs/nfs/nfs4file.c
> > +++ b/fs/nfs/nfs4file.c
> > @@ -8,6 +8,7 @@
> > #include <linux/file.h>
> > #include <linux/falloc.h>
> > #include <linux/nfs_fs.h>
> > +#include <linux/file.h>
> > #include "delegation.h"
> > #include "internal.h"
> > #include "iostat.h"
> > @@ -264,6 +265,103 @@ static int nfs42_clone_file_range(struct file *src_file,
> > loff_t src_off,
> > out:
> > return ret;
> > }
> > +
> > +static int read_name_gen = 1;
> > +#define SSC_READ_NAME_BODY "ssc_read_%d"
> > +
> > +struct file *
> > +nfs42_ssc_open(struct vfsmount *ss_mnt, struct nfs_fh *src_fh,
> > + nfs4_stateid *stateid)
> > +{
> > + struct nfs_fattr fattr;
> > + struct file *filep, *res;
> > + struct nfs_server *server;
> > + struct inode *r_ino = NULL;
> > + struct nfs_open_context *ctx;
> > + struct nfs4_state_owner *sp;
> > + char *read_name;
> > + int len, status = 0;
> > +
> > + server = NFS_SERVER(ss_mnt->mnt_root->d_inode);
> > +
> > + nfs_fattr_init(&fattr);
> > +
> > + status = nfs4_proc_getattr(server, src_fh, &fattr, NULL, NULL);
> > + if (status < 0) {
> > + res = ERR_PTR(status);
> > + goto out;
> > + }
> > +
> > + res = ERR_PTR(-ENOMEM);
> > + len = strlen(SSC_READ_NAME_BODY) + 16;
> > + read_name = kzalloc(len, GFP_NOFS);
> > + if (read_name == NULL)
> > + goto out;
> > + snprintf(read_name, len, SSC_READ_NAME_BODY, read_name_gen++);
> > + dprintk("%s read_name %s\n", __func__, read_name);
>
> Does this dprintk() need to be here?

Nope, doesn't need to be here, a left over debugging. I'll remove it.

> I'm wondering if it would work better as a
> tracepoint (or if it should just be removed altogether).
>
> Thanks,
> Anna
>
> > +
> > + r_ino = nfs_fhget(ss_mnt->mnt_root->d_inode->i_sb, src_fh, &fattr,
> > + NULL);
> > + if (IS_ERR(r_ino)) {
> > + res = ERR_CAST(r_ino);
> > + goto out;
> > + }
> > +
> > + filep = alloc_file_pseudo(r_ino, ss_mnt, read_name, FMODE_READ,
> > + r_ino->i_fop);
> > + if (IS_ERR(filep)) {
> > + res = ERR_CAST(filep);
> > + goto out;
> > + }
> > + filep->f_mode |= FMODE_READ;
> > +
> > + ctx = alloc_nfs_open_context(filep->f_path.dentry, filep->f_mode,
> > + filep);
> > + if (IS_ERR(ctx)) {
> > + res = ERR_CAST(ctx);
> > + goto out_filep;
> > + }
> > +
> > + res = ERR_PTR(-EINVAL);
> > + sp = nfs4_get_state_owner(server, ctx->cred, GFP_KERNEL);
> > + if (sp == NULL)
> > + goto out_ctx;
> > +
> > + ctx->state = nfs4_get_open_state(r_ino, sp);
> > + if (ctx->state == NULL)
> > + goto out_stateowner;
> > +
> > + set_bit(NFS_OPEN_STATE, &ctx->state->flags);
> > + memcpy(&ctx->state->open_stateid.other, &stateid->other,
> > + NFS4_STATEID_OTHER_SIZE);
> > + update_open_stateid(ctx->state, stateid, NULL, filep->f_mode);
> > +
> > + nfs_file_set_open_context(filep, ctx);
> > + put_nfs_open_context(ctx);
> > +
> > + file_ra_state_init(&filep->f_ra, filep->f_mapping->host->i_mapping);
> > + res = filep;
> > +out:
> > + dprintk("<-- %s error %ld filep %p r_ino %p\n",
> > + __func__, IS_ERR(res) ? PTR_ERR(res) : 0, res, r_ino);
> > +
> > + return res;
> > +out_stateowner:
> > + nfs4_put_state_owner(sp);
> > +out_ctx:
> > + put_nfs_open_context(ctx);
> > +out_filep:
> > + fput(filep);
> > + goto out;
> > +}
> > +EXPORT_SYMBOL_GPL(nfs42_ssc_open);
> > +void nfs42_ssc_close(struct file *filep)
> > +{
> > + struct nfs_open_context *ctx = nfs_file_open_context(filep);
> > +
> > + ctx->state->flags = 0;
> > +}
> > +EXPORT_SYMBOL_GPL(nfs42_ssc_close);
> > #endif /* CONFIG_NFS_V4_2 */
> >
> > const struct file_operations nfs4_file_operations = {
> > diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
> > index fec6e6b..e5178b2f 100644
> > --- a/fs/nfs/nfs4proc.c
> > +++ b/fs/nfs/nfs4proc.c
> > @@ -91,7 +91,6 @@
> > static int _nfs4_recover_proc_open(struct nfs4_opendata *data);
> > static int nfs4_do_fsinfo(struct nfs_server *, struct nfs_fh *, struct
> > nfs_fsinfo *);
> > static void nfs_fixup_referral_attributes(struct nfs_fattr *fattr);
> > -static int nfs4_proc_getattr(struct nfs_server *, struct nfs_fh *, struct
> > nfs_fattr *, struct nfs4_label *label, struct inode *inode);
> > static int _nfs4_proc_getattr(struct nfs_server *server, struct nfs_fh
> > *fhandle, struct nfs_fattr *fattr, struct nfs4_label *label, struct inode
> > *inode);
> > static int nfs4_do_setattr(struct inode *inode, struct rpc_cred *cred,
> > struct nfs_fattr *fattr, struct iattr *sattr,
> > @@ -1653,7 +1652,7 @@ static void nfs_state_clear_delegation(struct nfs4_state
> > *state)
> > write_sequnlock(&state->seqlock);
> > }
> >
> > -static int update_open_stateid(struct nfs4_state *state,
> > +int update_open_stateid(struct nfs4_state *state,
> > const nfs4_stateid *open_stateid,
> > const nfs4_stateid *delegation,
> > fmode_t fmode)
> > @@ -3936,7 +3935,7 @@ static int _nfs4_proc_getattr(struct nfs_server *server,
> > struct nfs_fh *fhandle,
> > return nfs4_call_sync(server->client, server, &msg, &args.seq_args,
> > &res.seq_res, 0);
> > }
> >
> > -static int nfs4_proc_getattr(struct nfs_server *server, struct nfs_fh
> > *fhandle,
> > +int nfs4_proc_getattr(struct nfs_server *server, struct nfs_fh *fhandle,
> > struct nfs_fattr *fattr, struct nfs4_label
> > *label,
> > struct inode *inode)
> > {

2018-10-24 03:00:14

by Olga Kornievskaia

[permalink] [raw]
Subject: Re: [PATCH v1 05/11] NFS add COPY_NOTIFY operation

On Tue, Oct 23, 2018 at 11:50 AM Schumaker, Anna
<[email protected]> wrote:
>
> Hi Olga,
>
> On Fri, 2018-10-19 at 11:29 -0400, Olga Kornievskaia wrote:
> > From: Olga Kornievskaia <[email protected]>
> >
> > Try using the delegation stateid, then the open stateid.
> >
> > Only NL4_NETATTR, No support for NL4_NAME and NL4_URL.
> > Allow only one source server address to be returned for now.
> >
> > Signed-off-by: Andy Adamson <[email protected]>
> > Signed-off-by: Olga Kornievskaia <[email protected]>
> > ---
> > fs/nfs/nfs42.h | 3 +-
> > fs/nfs/nfs42proc.c | 91 +++++++++++++++++++++++
> > fs/nfs/nfs42xdr.c | 181
> > ++++++++++++++++++++++++++++++++++++++++++++++
> > fs/nfs/nfs4file.c | 17 +++++
> > fs/nfs/nfs4proc.c | 1 +
> > fs/nfs/nfs4xdr.c | 1 +
> > include/linux/nfs4.h | 1 +
> > include/linux/nfs_fs_sb.h | 1 +
> > include/linux/nfs_xdr.h | 16 ++++
> > 9 files changed, 311 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/nfs/nfs42.h b/fs/nfs/nfs42.h
> > index 5abff4d..bcd0222 100644
> > --- a/fs/nfs/nfs42.h
> > +++ b/fs/nfs/nfs42.h
> > @@ -21,7 +21,8 @@
> > int nfs42_proc_layoutstats_generic(struct nfs_server *,
> > struct nfs42_layoutstat_data *);
> > int nfs42_proc_clone(struct file *, struct file *, loff_t, loff_t, loff_t);
> > -
> > +int nfs42_proc_copy_notify(struct file *, struct file *,
> > + struct nfs42_copy_notify_res *);
>
> Can you wrap this in a CONFIG_NFS_V4_2 check? Otherwise the compiler will
> complain that nfs42_copy_notify_res doesn't exist.

Will take care of it. Thank you.

>
> Thanks,
> Anna
>
> > static inline bool nfs42_files_from_same_server(struct file *in,
> > struct file *out)
> > {
> > diff --git a/fs/nfs/nfs42proc.c b/fs/nfs/nfs42proc.c
> > index ac5b784..b1c57a4 100644
> > --- a/fs/nfs/nfs42proc.c
> > +++ b/fs/nfs/nfs42proc.c
> > @@ -3,6 +3,7 @@
> > * Copyright (c) 2014 Anna Schumaker <[email protected]>
> > */
> > #include <linux/fs.h>
> > +#include <linux/sunrpc/addr.h>
> > #include <linux/sunrpc/sched.h>
> > #include <linux/nfs.h>
> > #include <linux/nfs3.h>
> > @@ -15,10 +16,30 @@
> > #include "pnfs.h"
> > #include "nfs4session.h"
> > #include "internal.h"
> > +#include "delegation.h"
> >
> > #define NFSDBG_FACILITY NFSDBG_PROC
> > static int nfs42_do_offload_cancel_async(struct file *dst, nfs4_stateid
> > *std);
> >
> > +static void nfs42_set_netaddr(struct file *filep, struct nfs42_netaddr
> > *naddr)
> > +{
> > + struct nfs_client *clp = (NFS_SERVER(file_inode(filep)))->nfs_client;
> > + unsigned short port = 2049;
> > +
> > + rcu_read_lock();
> > + naddr->netid_len = scnprintf(naddr->netid,
> > + sizeof(naddr->netid), "%s",
> > + rpc_peeraddr2str(clp->cl_rpcclient,
> > + RPC_DISPLAY_NETID));
> > + naddr->addr_len = scnprintf(naddr->addr,
> > + sizeof(naddr->addr),
> > + "%s.%u.%u",
> > + rpc_peeraddr2str(clp->cl_rpcclient,
> > + RPC_DISPLAY_ADDR),
> > + port >> 8, port & 255);
> > + rcu_read_unlock();
> > +}
> > +
> > static int _nfs42_proc_fallocate(struct rpc_message *msg, struct file *filep,
> > struct nfs_lock_context *lock, loff_t offset, loff_t len)
> > {
> > @@ -461,6 +482,76 @@ static int nfs42_do_offload_cancel_async(struct file
> > *dst,
> > return status;
> > }
> >
> > +int _nfs42_proc_copy_notify(struct file *src, struct file *dst,
> > + struct nfs42_copy_notify_args *args,
> > + struct nfs42_copy_notify_res *res)
> > +{
> > + struct nfs_server *src_server = NFS_SERVER(file_inode(src));
> > + struct rpc_message msg = {
> > + .rpc_proc = &nfs4_procedures[NFSPROC4_CLNT_COPY_NOTIFY],
> > + .rpc_argp = args,
> > + .rpc_resp = res,
> > + };
> > + int status;
> > + struct nfs_open_context *ctx;
> > + struct nfs_lock_context *l_ctx;
> > +
> > + ctx = get_nfs_open_context(nfs_file_open_context(src));
> > + l_ctx = nfs_get_lock_context(ctx);
> > + if (IS_ERR(l_ctx))
> > + return PTR_ERR(l_ctx);
> > +
> > + status = nfs4_set_rw_stateid(&args->cna_src_stateid, ctx, l_ctx,
> > + FMODE_READ);
> > + nfs_put_lock_context(l_ctx);
> > + if (status)
> > + return status;
> > +
> > + status = nfs4_call_sync(src_server->client, src_server, &msg,
> > + &args->cna_seq_args, &res->cnr_seq_res, 0);
> > + if (status == -ENOTSUPP)
> > + src_server->caps &= ~NFS_CAP_COPY_NOTIFY;
> > +
> > + put_nfs_open_context(nfs_file_open_context(src));
> > + return status;
> > +}
> > +
> > +int nfs42_proc_copy_notify(struct file *src, struct file *dst,
> > + struct nfs42_copy_notify_res *res)
> > +{
> > + struct nfs_server *src_server = NFS_SERVER(file_inode(src));
> > + struct nfs42_copy_notify_args *args;
> > + struct nfs4_exception exception = {
> > + .inode = file_inode(src),
> > + };
> > + int status;
> > +
> > + if (!(src_server->caps & NFS_CAP_COPY_NOTIFY))
> > + return -EOPNOTSUPP;
> > +
> > + args = kzalloc(sizeof(struct nfs42_copy_notify_args), GFP_NOFS);
> > + if (args == NULL)
> > + return -ENOMEM;
> > +
> > + args->cna_src_fh = NFS_FH(file_inode(src)),
> > + args->cna_dst.nl4_type = NL4_NETADDR;
> > + nfs42_set_netaddr(dst, &args->cna_dst.u.nl4_addr);
> > + exception.stateid = &args->cna_src_stateid;
> > +
> > + do {
> > + status = _nfs42_proc_copy_notify(src, dst, args, res);
> > + if (status == -ENOTSUPP) {
> > + status = -EOPNOTSUPP;
> > + goto out;
> > + }
> > + status = nfs4_handle_exception(src_server, status, &exception);
> > + } while (exception.retry);
> > +
> > +out:
> > + kfree(args);
> > + return status;
> > +}
> > +
> > static loff_t _nfs42_proc_llseek(struct file *filep,
> > struct nfs_lock_context *lock, loff_t offset, int whence)
> > {
> > diff --git a/fs/nfs/nfs42xdr.c b/fs/nfs/nfs42xdr.c
> > index 69f72ed..e6e7cbf 100644
> > --- a/fs/nfs/nfs42xdr.c
> > +++ b/fs/nfs/nfs42xdr.c
> > @@ -29,6 +29,16 @@
> > #define encode_offload_cancel_maxsz (op_encode_hdr_maxsz + \
> > XDR_QUADLEN(NFS4_STATEID_SIZE))
> > #define decode_offload_cancel_maxsz (op_decode_hdr_maxsz)
> > +#define encode_copy_notify_maxsz (op_encode_hdr_maxsz + \
> > + XDR_QUADLEN(NFS4_STATEID_SIZE) + \
> > + 1 + /* nl4_type */ \
> > + 1 + XDR_QUADLEN(NFS4_OPAQUE_LIMIT))
> > +#define decode_copy_notify_maxsz (op_decode_hdr_maxsz + \
> > + 3 + /* cnr_lease_time */\
> > + XDR_QUADLEN(NFS4_STATEID_SIZE) + \
> > + 1 + /* Support 1 cnr_source_server */\
> > + 1 + /* nl4_type */ \
> > + 1 + XDR_QUADLEN(NFS4_OPAQUE_LIMIT))
> > #define encode_deallocate_maxsz (op_encode_hdr_maxsz + \
> > encode_fallocate_maxsz)
> > #define decode_deallocate_maxsz (op_decode_hdr_maxsz)
> > @@ -84,6 +94,12 @@
> > #define NFS4_dec_offload_cancel_sz (compound_decode_hdr_maxsz + \
> > decode_putfh_maxsz + \
> > decode_offload_cancel_maxsz)
> > +#define NFS4_enc_copy_notify_sz (compound_encode_hdr_maxsz + \
> > + encode_putfh_maxsz + \
> > + encode_copy_notify_maxsz)
> > +#define NFS4_dec_copy_notify_sz (compound_decode_hdr_maxsz + \
> > + decode_putfh_maxsz + \
> > + decode_copy_notify_maxsz)
> > #define NFS4_enc_deallocate_sz (compound_encode_hdr_maxsz + \
> > encode_putfh_maxsz + \
> > encode_deallocate_maxsz + \
> > @@ -137,6 +153,25 @@ static void encode_allocate(struct xdr_stream *xdr,
> > encode_fallocate(xdr, args);
> > }
> >
> > +static void encode_nl4_server(struct xdr_stream *xdr, const struct nl4_server
> > *ns)
> > +{
> > + encode_uint32(xdr, ns->nl4_type);
> > + switch (ns->nl4_type) {
> > + case NL4_NAME:
> > + case NL4_URL:
> > + encode_string(xdr, ns->u.nl4_str_sz, ns->u.nl4_str);
> > + break;
> > + case NL4_NETADDR:
> > + encode_string(xdr, ns->u.nl4_addr.netid_len,
> > + ns->u.nl4_addr.netid);
> > + encode_string(xdr, ns->u.nl4_addr.addr_len,
> > + ns->u.nl4_addr.addr);
> > + break;
> > + default:
> > + WARN_ON_ONCE(1);
> > + }
> > +}
> > +
> > static void encode_copy(struct xdr_stream *xdr,
> > const struct nfs42_copy_args *args,
> > struct compound_hdr *hdr)
> > @@ -162,6 +197,15 @@ static void encode_offload_cancel(struct xdr_stream *xdr,
> > encode_nfs4_stateid(xdr, &args->osa_stateid);
> > }
> >
> > +static void encode_copy_notify(struct xdr_stream *xdr,
> > + const struct nfs42_copy_notify_args *args,
> > + struct compound_hdr *hdr)
> > +{
> > + encode_op_hdr(xdr, OP_COPY_NOTIFY, decode_copy_notify_maxsz, hdr);
> > + encode_nfs4_stateid(xdr, &args->cna_src_stateid);
> > + encode_nl4_server(xdr, &args->cna_dst);
> > +}
> > +
> > static void encode_deallocate(struct xdr_stream *xdr,
> > const struct nfs42_falloc_args *args,
> > struct compound_hdr *hdr)
> > @@ -298,6 +342,25 @@ static void nfs4_xdr_enc_offload_cancel(struct rpc_rqst
> > *req,
> > }
> >
> > /*
> > + * Encode COPY_NOTIFY request
> > + */
> > +static void nfs4_xdr_enc_copy_notify(struct rpc_rqst *req,
> > + struct xdr_stream *xdr,
> > + const void *data)
> > +{
> > + const struct nfs42_copy_notify_args *args = data;
> > + struct compound_hdr hdr = {
> > + .minorversion = nfs4_xdr_minorversion(&args->cna_seq_args),
> > + };
> > +
> > + encode_compound_hdr(xdr, req, &hdr);
> > + encode_sequence(xdr, &args->cna_seq_args, &hdr);
> > + encode_putfh(xdr, args->cna_src_fh, &hdr);
> > + encode_copy_notify(xdr, args, &hdr);
> > + encode_nops(&hdr);
> > +}
> > +
> > +/*
> > * Encode DEALLOCATE request
> > */
> > static void nfs4_xdr_enc_deallocate(struct rpc_rqst *req,
> > @@ -416,6 +479,58 @@ static int decode_write_response(struct xdr_stream *xdr,
> > return -EIO;
> > }
> >
> > +static int decode_nl4_server(struct xdr_stream *xdr, struct nl4_server *ns)
> > +{
> > + struct nfs42_netaddr *naddr;
> > + uint32_t dummy;
> > + char *dummy_str;
> > + __be32 *p;
> > + int status;
> > +
> > + /* nl_type */
> > + p = xdr_inline_decode(xdr, 4);
> > + if (unlikely(!p))
> > + return -EIO;
> > + ns->nl4_type = be32_to_cpup(p);
> > + switch (ns->nl4_type) {
> > + case NL4_NAME:
> > + case NL4_URL:
> > + status = decode_opaque_inline(xdr, &dummy, &dummy_str);
> > + if (unlikely(status))
> > + return status;
> > + if (unlikely(dummy > NFS4_OPAQUE_LIMIT))
> > + return -EIO;
> > + memcpy(&ns->u.nl4_str, dummy_str, dummy);
> > + ns->u.nl4_str_sz = dummy;
> > + break;
> > + case NL4_NETADDR:
> > + naddr = &ns->u.nl4_addr;
> > +
> > + /* netid string */
> > + status = decode_opaque_inline(xdr, &dummy, &dummy_str);
> > + if (unlikely(status))
> > + return status;
> > + if (unlikely(dummy > RPCBIND_MAXNETIDLEN))
> > + return -EIO;
> > + naddr->netid_len = dummy;
> > + memcpy(naddr->netid, dummy_str, naddr->netid_len);
> > +
> > + /* uaddr string */
> > + status = decode_opaque_inline(xdr, &dummy, &dummy_str);
> > + if (unlikely(status))
> > + return status;
> > + if (unlikely(dummy > RPCBIND_MAXUADDRLEN))
> > + return -EIO;
> > + naddr->addr_len = dummy;
> > + memcpy(naddr->addr, dummy_str, naddr->addr_len);
> > + break;
> > + default:
> > + WARN_ON_ONCE(1);
> > + return -EIO;
> > + }
> > + return 0;
> > +}
> > +
> > static int decode_copy_requirements(struct xdr_stream *xdr,
> > struct nfs42_copy_res *res) {
> > __be32 *p;
> > @@ -458,6 +573,46 @@ static int decode_offload_cancel(struct xdr_stream *xdr,
> > return decode_op_hdr(xdr, OP_OFFLOAD_CANCEL);
> > }
> >
> > +static int decode_copy_notify(struct xdr_stream *xdr,
> > + struct nfs42_copy_notify_res *res)
> > +{
> > + __be32 *p;
> > + int status, count;
> > +
> > + status = decode_op_hdr(xdr, OP_COPY_NOTIFY);
> > + if (status)
> > + return status;
> > + /* cnr_lease_time */
> > + p = xdr_inline_decode(xdr, 12);
> > + if (unlikely(!p))
> > + goto out_overflow;
> > + p = xdr_decode_hyper(p, &res->cnr_lease_time.seconds);
> > + res->cnr_lease_time.nseconds = be32_to_cpup(p);
> > +
> > + status = decode_opaque_fixed(xdr, &res->cnr_stateid, NFS4_STATEID_SIZE);
> > + if (unlikely(status))
> > + goto out_overflow;
> > +
> > + /* number of source addresses */
> > + p = xdr_inline_decode(xdr, 4);
> > + if (unlikely(!p))
> > + goto out_overflow;
> > +
> > + count = be32_to_cpup(p);
> > + if (count > 1)
> > + pr_warn("NFS: %s: nsvr %d > Supported. Use first servers\n",
> > + __func__, count);
> > +
> > + status = decode_nl4_server(xdr, &res->cnr_src);
> > + if (unlikely(status))
> > + goto out_overflow;
> > + return 0;
> > +
> > +out_overflow:
> > + print_overflow_msg(__func__, xdr);
> > + return -EIO;
> > +}
> > +
> > static int decode_deallocate(struct xdr_stream *xdr, struct nfs42_falloc_res
> > *res)
> > {
> > return decode_op_hdr(xdr, OP_DEALLOCATE);
> > @@ -585,6 +740,32 @@ static int nfs4_xdr_dec_offload_cancel(struct rpc_rqst
> > *rqstp,
> > }
> >
> > /*
> > + * Decode COPY_NOTIFY response
> > + */
> > +static int nfs4_xdr_dec_copy_notify(struct rpc_rqst *rqstp,
> > + struct xdr_stream *xdr,
> > + void *data)
> > +{
> > + struct nfs42_copy_notify_res *res = data;
> > + struct compound_hdr hdr;
> > + int status;
> > +
> > + status = decode_compound_hdr(xdr, &hdr);
> > + if (status)
> > + goto out;
> > + status = decode_sequence(xdr, &res->cnr_seq_res, rqstp);
> > + if (status)
> > + goto out;
> > + status = decode_putfh(xdr);
> > + if (status)
> > + goto out;
> > + status = decode_copy_notify(xdr, res);
> > +
> > +out:
> > + return status;
> > +}
> > +
> > +/*
> > * Decode DEALLOCATE request
> > */
> > static int nfs4_xdr_dec_deallocate(struct rpc_rqst *rqstp,
> > diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c
> > index 4288a6e..3a888f4 100644
> > --- a/fs/nfs/nfs4file.c
> > +++ b/fs/nfs/nfs4file.c
> > @@ -133,12 +133,29 @@ static ssize_t nfs4_copy_file_range(struct file
> > *file_in, loff_t pos_in,
> > struct file *file_out, loff_t pos_out,
> > size_t count, unsigned int flags)
> > {
> > + struct nfs42_copy_notify_res *cn_resp = NULL;
> > ssize_t ret;
> >
> > if (file_inode(file_in) == file_inode(file_out))
> > return -EINVAL;
> > retry:
> > + if (nfs42_files_from_same_server(file_in, file_out)) { /* Intra-ssc */
> > + if (file_in->f_op != file_out->f_op)
> > + return -EXDEV;
> > + } else { /* Inter-ssc */
> > + cn_resp = kzalloc(sizeof(struct nfs42_copy_notify_res),
> > + GFP_NOFS);
> > + if (unlikely(cn_resp == NULL))
> > + return -ENOMEM;
> > +
> > + ret = nfs42_proc_copy_notify(file_in, file_out, cn_resp);
> > + if (ret)
> > + goto out;
> > + }
> > ret = nfs42_proc_copy(file_in, pos_in, file_out, pos_out, count);
> > +
> > +out:
> > + kfree(cn_resp);
> > if (ret == -EAGAIN)
> > goto retry;
> > return ret;
> > diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
> > index db84b4a..fec6e6b 100644
> > --- a/fs/nfs/nfs4proc.c
> > +++ b/fs/nfs/nfs4proc.c
> > @@ -9692,6 +9692,7 @@ static bool nfs4_match_stateid(const nfs4_stateid *s1,
> > | NFS_CAP_ALLOCATE
> > | NFS_CAP_COPY
> > | NFS_CAP_OFFLOAD_CANCEL
> > + | NFS_CAP_COPY_NOTIFY
> > | NFS_CAP_DEALLOCATE
> > | NFS_CAP_SEEK
> > | NFS_CAP_LAYOUTSTATS
> > diff --git a/fs/nfs/nfs4xdr.c b/fs/nfs/nfs4xdr.c
> > index b7bde12..2163900 100644
> > --- a/fs/nfs/nfs4xdr.c
> > +++ b/fs/nfs/nfs4xdr.c
> > @@ -7790,6 +7790,7 @@ int nfs4_decode_dirent(struct xdr_stream *xdr, struct
> > nfs_entry *entry,
> > PROC42(CLONE, enc_clone, dec_clone),
> > PROC42(COPY, enc_copy, dec_copy),
> > PROC42(OFFLOAD_CANCEL, enc_offload_cancel, dec_offload_cancel),
> > + PROC42(COPY_NOTIFY, enc_copy_notify, dec_copy_notify),
> > PROC(LOOKUPP, enc_lookupp, dec_lookupp),
> > };
> >
> > diff --git a/include/linux/nfs4.h b/include/linux/nfs4.h
> > index 4d76f87..d80b25e 100644
> > --- a/include/linux/nfs4.h
> > +++ b/include/linux/nfs4.h
> > @@ -537,6 +537,7 @@ enum {
> > NFSPROC4_CLNT_CLONE,
> > NFSPROC4_CLNT_COPY,
> > NFSPROC4_CLNT_OFFLOAD_CANCEL,
> > + NFSPROC4_CLNT_COPY_NOTIFY,
> >
> > NFSPROC4_CLNT_LOOKUPP,
> > };
> > diff --git a/include/linux/nfs_fs_sb.h b/include/linux/nfs_fs_sb.h
> > index 0fc0b91..e5d89ff 100644
> > --- a/include/linux/nfs_fs_sb.h
> > +++ b/include/linux/nfs_fs_sb.h
> > @@ -261,5 +261,6 @@ struct nfs_server {
> > #define NFS_CAP_CLONE (1U << 23)
> > #define NFS_CAP_COPY (1U << 24)
> > #define NFS_CAP_OFFLOAD_CANCEL (1U << 25)
> > +#define NFS_CAP_COPY_NOTIFY (1U << 26)
> >
> > #endif
> > diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
> > index 0e01625..dfc59bc 100644
> > --- a/include/linux/nfs_xdr.h
> > +++ b/include/linux/nfs_xdr.h
> > @@ -1428,6 +1428,22 @@ struct nfs42_offload_status_res {
> > int osr_status;
> > };
> >
> > +struct nfs42_copy_notify_args {
> > + struct nfs4_sequence_args cna_seq_args;
> > +
> > + struct nfs_fh *cna_src_fh;
> > + nfs4_stateid cna_src_stateid;
> > + struct nl4_server cna_dst;
> > +};
> > +
> > +struct nfs42_copy_notify_res {
> > + struct nfs4_sequence_res cnr_seq_res;
> > +
> > + struct nfstime4 cnr_lease_time;
> > + nfs4_stateid cnr_stateid;
> > + struct nl4_server cnr_src;
> > +};
> > +
> > struct nfs42_seek_args {
> > struct nfs4_sequence_args seq_args;
> >

2018-10-24 11:09:36

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v1 01/11] fs: Don't copy beyond the end of the file

On Tue, 2018-10-23 at 12:50 -0400, Olga Kornievskaia wrote:
> On Mon, Oct 22, 2018 at 7:23 PM Jeff Layton <[email protected]> wrote:
> >
> > On Mon, 2018-10-22 at 14:32 -0400, Olga Kornievskaia wrote:
> > > On Sun, Oct 21, 2018 at 10:29 AM Jeff Layton <[email protected]> wrote:
> > > >
> > > > On Fri, 2018-10-19 at 11:29 -0400, Olga Kornievskaia wrote:
> > > > > From: Anna Schumaker <[email protected]>
> > > > >
> > > > > Signed-off-by: Anna Schumaker <[email protected]>
> > > > > ---
> > > > > fs/read_write.c | 3 +++
> > > > > 1 file changed, 3 insertions(+)
> > > > >
> > > > > diff --git a/fs/read_write.c b/fs/read_write.c
> > > > > index 39b4a21..c60790f 100644
> > > > > --- a/fs/read_write.c
> > > > > +++ b/fs/read_write.c
> > > > > @@ -1570,6 +1570,9 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
> > > > > if (unlikely(ret))
> > > > > return ret;
> > > > >
> > > > > + if (pos_in >= i_size_read(inode_in))
> > > > > + return -EINVAL;
> > > > > +
> > > > > if (!(file_in->f_mode & FMODE_READ) ||
> > > > > !(file_out->f_mode & FMODE_WRITE) ||
> > > > > (file_out->f_flags & O_APPEND))
> > > >
> > > > The patch description could use a bit more fleshing-out. The
> > > > copy_file_range manpage says:
> > > >
> > > > EINVAL Requested range extends beyond the end of the source file; or the flags argument is not 0.
> > > >
> > > > So I guess this is intended to satisfy that requirement?
> > >
> > > I agree the description of the patch is poor. It sort of falls under
> > > the the man page's description of range beyond the end of the source
> > > file. But in NFSv4.2, there is an explicit wording for the validity of
> > > the input parameters and having an input source offset that's beyond
> > > the end of the file is what this patch is attempting to check.
> > >
> >
> > Side note:
> >
> > One has to wonder why they decided to make that an -EINVAL condition?
>
> To me that sounds like a correct use of -EINVAL error to check for
> invalid arguments.
>
> You can argue that since there were no bytes to copy then returning 0
> would be an appropriate "size" of the copy. However, I would argue how
> would a caller distinguish this 0 size which really means don't try to
> copy more vs another case when copy_file_range() returned less byte
> (0) and so that caller should loop to get the rest.
>

I don't know -- it seems to run contrary to how read(2) and write(2)
work with an EOF condition. I don't see why the implementation wouldn't
want to just copy what you can up to the EOF of the source file. Maybe I
need to go back and review the discussion from when the syscall was
merged...

> > The system call returns ssize_t. Why not just return a fewer number of
> > bytes in that situation?
> >
> > In fact, the RETURN VALUE section of the manpage says:
> >
> > Upon successful completion, copy_file_range() will return the number of
> > bytes copied between files. This could be less than the length origi‐
> > nally requested.
> >
> > Under what conditions would that occur that do not include the file
> > being shorter than the range you wanted to copy?
> >
> > I wonder if we ought to lobby to get that changed.
>
> Do you mean ask NFSv4.2 spec to be changed? I thought that's stuff is
> "written in stone".
>

No, I meant the copy_file_range spec (such as it is). I guess the v4.2
spec has this though:

If the source offset or the source offset plus count is greater than
the size of the source file, the operation MUST fail with
NFS4ERR_INVAL.

I wonder if you'd be better off not trying to enforce this on the client
and simply let the server return NFS4ERR_INVAL if we're beyond EOF on
the source? That way it doesn't matter whether the client's attr cache
is up to date or not.

> >
> > > > If so,
> > > > i_size_read is just going to return whatever is in inode->isize.
> > > > Could a copy_file_range call end up getting issued to copy from a file
> > > > that is already open on a range that it doesn't know about yet? i.e.
> > > > where the inode cache has not yet been updated.
> > >
> > > I thought that with NFSv4 cache consistency, the inode->isize is
> > > accurate. If previous open had a read delegation, any modification on
> > > a server would trigger a CB_RECALL and the open for the copy offload
> > > would retrieve the latest size. In case of no delegations, the open
> > > retrieves the latest size and the call to copy_file_range() would have
> > > an update size.
> > >
> > > It seems like that could on network filesystems (like NFS). Would this
> > > > be better handled in ->copy_file_range instead, where the driver could
> > > > make a better determination of the file size?
> > >
> > > I'm not opposed to moving the size check into the NFS's copy_file_size
> > > (again in my opinion NFS attribute cache has the same file size as the
> > > inode's size). I think the thought was that such check should be done
> > > at the VFS layer as oppose to doing it by each of the file systems.
> > >
> > >
> >
> > The attribute cache is not revalidated before the i_size is fetched with
> > i_size_read. You're just reading what happens to be in the in-memory
> > inode structure. So both clients have the file open already, and neither
> > has a delegation:
> >
> > client 1: fetches attributes from file and sees a size of 1000
> > client 2: writes 20 bytes at offset 1000
> > client 1: calls copy file range to copy 1020 bytes starting at offset 0
> >
> > If client1 didn't get an attribute update before the copy_file_range
> > call came in, then it would still think the size was 1000 and fail the
> > operation. It may even be many seconds before client1 sees the updated
> > size.
> >
> > You could argue that we're not using locking here so you're just subject
> > to normal open-to-close cache coherency behavior, but that's rather "not
> > nice".
>
> Yes I would argue that locking needs to be used. In case your
> describing it is impossible to get an accurate file size. Even if the
> client issues a GETATTR there is nothing prevents another client from
> changing it right after that GETATTR was already replied to.
>
> What nfscopy application does is it opens then file and then locks it
> (as suggested by the spec), then calls the copy_file_range() system
> call. You can argue (if we didn't get a delegation) that a file might
> have been changed between the reply to the OPEN and the LOCK operation
> and therefore, we should send another GETATTR just in case. To guard
> against that, I can add a getattr call though I think it's an overkill
> specially since linux server always grants us a read delegation.
>

The spec does say you might need to lock the files but they don't say
you SHOULD, only that servers need to be able to deal with lock
stateids. hat said, you have a good point. We're not violating anything
by not revalidating the cache beforehand. Maybe we don't need to do this
after all.

Personally, I think this would be best enforced by the NFS server. Just
fire off the COPY request as-is and let the server vet the lengths.

Side note: a delegation is never guaranteed. knfsd won't grant one if
the inode falls afoul of the bloom filter, for instance, and that can
easily happen if (for example) there is a hash collision between
different inodes.

> > I think we probably ought to also push this check down into the
> > filesystem operations as well, and have copy_file_range ensure that the
> > attribute cache is updated. We're dealing with copy offload here so
> > doing an extra GETATTR beforehand shouldn't be terribly costly.
> > --
> > Jeff Layton <[email protected]>

--
Jeff Layton <[email protected]>


2018-10-24 16:00:01

by Olga Kornievskaia

[permalink] [raw]
Subject: Re: [PATCH v1 01/11] fs: Don't copy beyond the end of the file

On Wed, Oct 24, 2018 at 7:09 AM Jeff Layton <[email protected]> wrote:
>
> On Tue, 2018-10-23 at 12:50 -0400, Olga Kornievskaia wrote:
> > On Mon, Oct 22, 2018 at 7:23 PM Jeff Layton <[email protected]> wrote:
> > >
> > > On Mon, 2018-10-22 at 14:32 -0400, Olga Kornievskaia wrote:
> > > > On Sun, Oct 21, 2018 at 10:29 AM Jeff Layton <[email protected]> wrote:
> > > > >
> > > > > On Fri, 2018-10-19 at 11:29 -0400, Olga Kornievskaia wrote:
> > > > > > From: Anna Schumaker <[email protected]>
> > > > > >
> > > > > > Signed-off-by: Anna Schumaker <[email protected]>
> > > > > > ---
> > > > > > fs/read_write.c | 3 +++
> > > > > > 1 file changed, 3 insertions(+)
> > > > > >
> > > > > > diff --git a/fs/read_write.c b/fs/read_write.c
> > > > > > index 39b4a21..c60790f 100644
> > > > > > --- a/fs/read_write.c
> > > > > > +++ b/fs/read_write.c
> > > > > > @@ -1570,6 +1570,9 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
> > > > > > if (unlikely(ret))
> > > > > > return ret;
> > > > > >
> > > > > > + if (pos_in >= i_size_read(inode_in))
> > > > > > + return -EINVAL;
> > > > > > +
> > > > > > if (!(file_in->f_mode & FMODE_READ) ||
> > > > > > !(file_out->f_mode & FMODE_WRITE) ||
> > > > > > (file_out->f_flags & O_APPEND))
> > > > >
> > > > > The patch description could use a bit more fleshing-out. The
> > > > > copy_file_range manpage says:
> > > > >
> > > > > EINVAL Requested range extends beyond the end of the source file; or the flags argument is not 0.
> > > > >
> > > > > So I guess this is intended to satisfy that requirement?
> > > >
> > > > I agree the description of the patch is poor. It sort of falls under
> > > > the the man page's description of range beyond the end of the source
> > > > file. But in NFSv4.2, there is an explicit wording for the validity of
> > > > the input parameters and having an input source offset that's beyond
> > > > the end of the file is what this patch is attempting to check.
> > > >
> > >
> > > Side note:
> > >
> > > One has to wonder why they decided to make that an -EINVAL condition?
> >
> > To me that sounds like a correct use of -EINVAL error to check for
> > invalid arguments.
> >
> > You can argue that since there were no bytes to copy then returning 0
> > would be an appropriate "size" of the copy. However, I would argue how
> > would a caller distinguish this 0 size which really means don't try to
> > copy more vs another case when copy_file_range() returned less byte
> > (0) and so that caller should loop to get the rest.
> >
>
> I don't know -- it seems to run contrary to how read(2) and write(2)

That's because copy_file_range is not just read/write but also lseek.
I think of it as doing lseek(to source offset)->read()->lseek(to dst
offset)->write. and lseek() does return EINVAL when position is beyond
the end of the file.

> work with an EOF condition. I don't see why the implementation wouldn't
> want to just copy what you can up to the EOF of the source file. Maybe I
> need to go back and review the discussion from when the syscall was
> merged...

>
> > > The system call returns ssize_t. Why not just return a fewer number of
> > > bytes in that situation?
> > >
> > > In fact, the RETURN VALUE section of the manpage says:
> > >
> > > Upon successful completion, copy_file_range() will return the number of
> > > bytes copied between files. This could be less than the length origi‐
> > > nally requested.
> > >
> > > Under what conditions would that occur that do not include the file
> > > being shorter than the range you wanted to copy?
> > >
> > > I wonder if we ought to lobby to get that changed.
> >
> > Do you mean ask NFSv4.2 spec to be changed? I thought that's stuff is
> > "written in stone".
> >
>
> No, I meant the copy_file_range spec (such as it is). I guess the v4.2
> spec has this though:
>
> If the source offset or the source offset plus count is greater than
> the size of the source file, the operation MUST fail with
> NFS4ERR_INVAL.
>
> I wonder if you'd be better off not trying to enforce this on the client
> and simply let the server return NFS4ERR_INVAL if we're beyond EOF on
> the source? That way it doesn't matter whether the client's attr cache
> is up to date or not.
>
> > >
> > > > > If so,
> > > > > i_size_read is just going to return whatever is in inode->isize.
> > > > > Could a copy_file_range call end up getting issued to copy from a file
> > > > > that is already open on a range that it doesn't know about yet? i.e.
> > > > > where the inode cache has not yet been updated.
> > > >
> > > > I thought that with NFSv4 cache consistency, the inode->isize is
> > > > accurate. If previous open had a read delegation, any modification on
> > > > a server would trigger a CB_RECALL and the open for the copy offload
> > > > would retrieve the latest size. In case of no delegations, the open
> > > > retrieves the latest size and the call to copy_file_range() would have
> > > > an update size.
> > > >
> > > > It seems like that could on network filesystems (like NFS). Would this
> > > > > be better handled in ->copy_file_range instead, where the driver could
> > > > > make a better determination of the file size?
> > > >
> > > > I'm not opposed to moving the size check into the NFS's copy_file_size
> > > > (again in my opinion NFS attribute cache has the same file size as the
> > > > inode's size). I think the thought was that such check should be done
> > > > at the VFS layer as oppose to doing it by each of the file systems.
> > > >
> > > >
> > >
> > > The attribute cache is not revalidated before the i_size is fetched with
> > > i_size_read. You're just reading what happens to be in the in-memory
> > > inode structure. So both clients have the file open already, and neither
> > > has a delegation:
> > >
> > > client 1: fetches attributes from file and sees a size of 1000
> > > client 2: writes 20 bytes at offset 1000
> > > client 1: calls copy file range to copy 1020 bytes starting at offset 0
> > >
> > > If client1 didn't get an attribute update before the copy_file_range
> > > call came in, then it would still think the size was 1000 and fail the
> > > operation. It may even be many seconds before client1 sees the updated
> > > size.
> > >
> > > You could argue that we're not using locking here so you're just subject
> > > to normal open-to-close cache coherency behavior, but that's rather "not
> > > nice".
> >
> > Yes I would argue that locking needs to be used. In case your
> > describing it is impossible to get an accurate file size. Even if the
> > client issues a GETATTR there is nothing prevents another client from
> > changing it right after that GETATTR was already replied to.
> >
> > What nfscopy application does is it opens then file and then locks it
> > (as suggested by the spec), then calls the copy_file_range() system
> > call. You can argue (if we didn't get a delegation) that a file might
> > have been changed between the reply to the OPEN and the LOCK operation
> > and therefore, we should send another GETATTR just in case. To guard
> > against that, I can add a getattr call though I think it's an overkill
> > specially since linux server always grants us a read delegation.
> >
>
> The spec does say you might need to lock the files but they don't say
> you SHOULD, only that servers need to be able to deal with lock
> stateids. hat said, you have a good point. We're not violating anything
> by not revalidating the cache beforehand. Maybe we don't need to do this
> after all.
>
> Personally, I think this would be best enforced by the NFS server. Just
> fire off the COPY request as-is and let the server vet the lengths.
>
> Side note: a delegation is never guaranteed. knfsd won't grant one if
> the inode falls afoul of the bloom filter, for instance, and that can
> easily happen if (for example) there is a hash collision between
> different inodes.

I could push the checking validity of the arguments to the server (but
to me it sounds lame: a failure will require a network roundtrip). But
one can argue the server needs to check the validity of the arguments
anyway (as it can't rely on the client to play nice).

I still think the check should be done in both places (client and
server). And I feel like VFS is the right place to do so (as this
check implements what the man page states). But I don't feel strongly
about dropping the patch all together (I'll add a patch to the
server).

> > > I think we probably ought to also push this check down into the
> > > filesystem operations as well, and have copy_file_range ensure that the
> > > attribute cache is updated. We're dealing with copy offload here so
> > > doing an extra GETATTR beforehand shouldn't be terribly costly.
> > > --
> > > Jeff Layton <[email protected]>
>
> --
> Jeff Layton <[email protected]>
>

2018-10-24 18:54:14

by Olga Kornievskaia

[permalink] [raw]
Subject: Re: [PATCH v1 01/11] fs: Don't copy beyond the end of the file

On Wed, Oct 24, 2018 at 11:59 AM Olga Kornievskaia
<[email protected]> wrote:
>
> On Wed, Oct 24, 2018 at 7:09 AM Jeff Layton <[email protected]> wrote:
> >
> > On Tue, 2018-10-23 at 12:50 -0400, Olga Kornievskaia wrote:
> > > On Mon, Oct 22, 2018 at 7:23 PM Jeff Layton <[email protected]> wrote:
> > > >
> > > > On Mon, 2018-10-22 at 14:32 -0400, Olga Kornievskaia wrote:
> > > > > On Sun, Oct 21, 2018 at 10:29 AM Jeff Layton <[email protected]> wrote:
> > > > > >
> > > > > > On Fri, 2018-10-19 at 11:29 -0400, Olga Kornievskaia wrote:
> > > > > > > From: Anna Schumaker <[email protected]>
> > > > > > >
> > > > > > > Signed-off-by: Anna Schumaker <[email protected]>
> > > > > > > ---
> > > > > > > fs/read_write.c | 3 +++
> > > > > > > 1 file changed, 3 insertions(+)
> > > > > > >
> > > > > > > diff --git a/fs/read_write.c b/fs/read_write.c
> > > > > > > index 39b4a21..c60790f 100644
> > > > > > > --- a/fs/read_write.c
> > > > > > > +++ b/fs/read_write.c
> > > > > > > @@ -1570,6 +1570,9 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
> > > > > > > if (unlikely(ret))
> > > > > > > return ret;
> > > > > > >
> > > > > > > + if (pos_in >= i_size_read(inode_in))
> > > > > > > + return -EINVAL;
> > > > > > > +
> > > > > > > if (!(file_in->f_mode & FMODE_READ) ||
> > > > > > > !(file_out->f_mode & FMODE_WRITE) ||
> > > > > > > (file_out->f_flags & O_APPEND))
> > > > > >
> > > > > > The patch description could use a bit more fleshing-out. The
> > > > > > copy_file_range manpage says:
> > > > > >
> > > > > > EINVAL Requested range extends beyond the end of the source file; or the flags argument is not 0.
> > > > > >
> > > > > > So I guess this is intended to satisfy that requirement?
> > > > >
> > > > > I agree the description of the patch is poor. It sort of falls under
> > > > > the the man page's description of range beyond the end of the source
> > > > > file. But in NFSv4.2, there is an explicit wording for the validity of
> > > > > the input parameters and having an input source offset that's beyond
> > > > > the end of the file is what this patch is attempting to check.
> > > > >
> > > >
> > > > Side note:
> > > >
> > > > One has to wonder why they decided to make that an -EINVAL condition?
> > >
> > > To me that sounds like a correct use of -EINVAL error to check for
> > > invalid arguments.
> > >
> > > You can argue that since there were no bytes to copy then returning 0
> > > would be an appropriate "size" of the copy. However, I would argue how
> > > would a caller distinguish this 0 size which really means don't try to
> > > copy more vs another case when copy_file_range() returned less byte
> > > (0) and so that caller should loop to get the rest.
> > >
> >
> > I don't know -- it seems to run contrary to how read(2) and write(2)
>
> That's because copy_file_range is not just read/write but also lseek.
> I think of it as doing lseek(to source offset)->read()->lseek(to dst
> offset)->write. and lseek() does return EINVAL when position is beyond
> the end of the file.
>
> > work with an EOF condition. I don't see why the implementation wouldn't
> > want to just copy what you can up to the EOF of the source file. Maybe I
> > need to go back and review the discussion from when the syscall was
> > merged...
>
> >
> > > > The system call returns ssize_t. Why not just return a fewer number of
> > > > bytes in that situation?
> > > >
> > > > In fact, the RETURN VALUE section of the manpage says:
> > > >
> > > > Upon successful completion, copy_file_range() will return the number of
> > > > bytes copied between files. This could be less than the length origi‐
> > > > nally requested.
> > > >
> > > > Under what conditions would that occur that do not include the file
> > > > being shorter than the range you wanted to copy?
> > > >
> > > > I wonder if we ought to lobby to get that changed.
> > >
> > > Do you mean ask NFSv4.2 spec to be changed? I thought that's stuff is
> > > "written in stone".
> > >
> >
> > No, I meant the copy_file_range spec (such as it is). I guess the v4.2
> > spec has this though:
> >
> > If the source offset or the source offset plus count is greater than
> > the size of the source file, the operation MUST fail with
> > NFS4ERR_INVAL.
> >
> > I wonder if you'd be better off not trying to enforce this on the client
> > and simply let the server return NFS4ERR_INVAL if we're beyond EOF on
> > the source? That way it doesn't matter whether the client's attr cache
> > is up to date or not.
> >
> > > >
> > > > > > If so,
> > > > > > i_size_read is just going to return whatever is in inode->isize.
> > > > > > Could a copy_file_range call end up getting issued to copy from a file
> > > > > > that is already open on a range that it doesn't know about yet? i.e.
> > > > > > where the inode cache has not yet been updated.
> > > > >
> > > > > I thought that with NFSv4 cache consistency, the inode->isize is
> > > > > accurate. If previous open had a read delegation, any modification on
> > > > > a server would trigger a CB_RECALL and the open for the copy offload
> > > > > would retrieve the latest size. In case of no delegations, the open
> > > > > retrieves the latest size and the call to copy_file_range() would have
> > > > > an update size.
> > > > >
> > > > > It seems like that could on network filesystems (like NFS). Would this
> > > > > > be better handled in ->copy_file_range instead, where the driver could
> > > > > > make a better determination of the file size?
> > > > >
> > > > > I'm not opposed to moving the size check into the NFS's copy_file_size
> > > > > (again in my opinion NFS attribute cache has the same file size as the
> > > > > inode's size). I think the thought was that such check should be done
> > > > > at the VFS layer as oppose to doing it by each of the file systems.
> > > > >
> > > > >
> > > >
> > > > The attribute cache is not revalidated before the i_size is fetched with
> > > > i_size_read. You're just reading what happens to be in the in-memory
> > > > inode structure. So both clients have the file open already, and neither
> > > > has a delegation:
> > > >
> > > > client 1: fetches attributes from file and sees a size of 1000
> > > > client 2: writes 20 bytes at offset 1000
> > > > client 1: calls copy file range to copy 1020 bytes starting at offset 0
> > > >
> > > > If client1 didn't get an attribute update before the copy_file_range
> > > > call came in, then it would still think the size was 1000 and fail the
> > > > operation. It may even be many seconds before client1 sees the updated
> > > > size.
> > > >
> > > > You could argue that we're not using locking here so you're just subject
> > > > to normal open-to-close cache coherency behavior, but that's rather "not
> > > > nice".
> > >
> > > Yes I would argue that locking needs to be used. In case your
> > > describing it is impossible to get an accurate file size. Even if the
> > > client issues a GETATTR there is nothing prevents another client from
> > > changing it right after that GETATTR was already replied to.
> > >
> > > What nfscopy application does is it opens then file and then locks it
> > > (as suggested by the spec), then calls the copy_file_range() system
> > > call. You can argue (if we didn't get a delegation) that a file might
> > > have been changed between the reply to the OPEN and the LOCK operation
> > > and therefore, we should send another GETATTR just in case. To guard
> > > against that, I can add a getattr call though I think it's an overkill
> > > specially since linux server always grants us a read delegation.
> > >
> >
> > The spec does say you might need to lock the files but they don't say
> > you SHOULD, only that servers need to be able to deal with lock
> > stateids. hat said, you have a good point. We're not violating anything
> > by not revalidating the cache beforehand. Maybe we don't need to do this
> > after all.
> >
> > Personally, I think this would be best enforced by the NFS server. Just
> > fire off the COPY request as-is and let the server vet the lengths.
> >
> > Side note: a delegation is never guaranteed. knfsd won't grant one if
> > the inode falls afoul of the bloom filter, for instance, and that can
> > easily happen if (for example) there is a hash collision between
> > different inodes.
>
> I could push the checking validity of the arguments to the server (but
> to me it sounds lame: a failure will require a network roundtrip). But
> one can argue the server needs to check the validity of the arguments
> anyway (as it can't rely on the client to play nice).
>
> I still think the check should be done in both places (client and
> server). And I feel like VFS is the right place to do so (as this
> check implements what the man page states). But I don't feel strongly
> about dropping the patch all together (I'll add a patch to the
> server).

To add to my case, would it be acceptable to add a check *same as*
it's done for the vfs_clone_file_range()
pos_in+len > i_size_read() return EINVAL.

knfsd just calls in to the vfs to do the copy_file_range when it
receives it, so these checks should be a part of the VFS.

> > > > I think we probably ought to also push this check down into the
> > > > filesystem operations as well, and have copy_file_range ensure that the
> > > > attribute cache is updated. We're dealing with copy offload here so
> > > > doing an extra GETATTR beforehand shouldn't be terribly costly.
> > > > --
> > > > Jeff Layton <[email protected]>
> >
> > --
> > Jeff Layton <[email protected]>
> >

2018-10-24 19:21:43

by Olga Kornievskaia

[permalink] [raw]
Subject: Re: [PATCH v1 01/11] fs: Don't copy beyond the end of the file

On Wed, Oct 24, 2018 at 2:53 PM Olga Kornievskaia
<[email protected]> wrote:
>
> On Wed, Oct 24, 2018 at 11:59 AM Olga Kornievskaia
> <[email protected]> wrote:
> >
> > On Wed, Oct 24, 2018 at 7:09 AM Jeff Layton <[email protected]> wrote:
> > >
> > > On Tue, 2018-10-23 at 12:50 -0400, Olga Kornievskaia wrote:
> > > > On Mon, Oct 22, 2018 at 7:23 PM Jeff Layton <[email protected]> wrote:
> > > > >
> > > > > On Mon, 2018-10-22 at 14:32 -0400, Olga Kornievskaia wrote:
> > > > > > On Sun, Oct 21, 2018 at 10:29 AM Jeff Layton <[email protected]> wrote:
> > > > > > >
> > > > > > > On Fri, 2018-10-19 at 11:29 -0400, Olga Kornievskaia wrote:
> > > > > > > > From: Anna Schumaker <[email protected]>
> > > > > > > >
> > > > > > > > Signed-off-by: Anna Schumaker <[email protected]>
> > > > > > > > ---
> > > > > > > > fs/read_write.c | 3 +++
> > > > > > > > 1 file changed, 3 insertions(+)
> > > > > > > >
> > > > > > > > diff --git a/fs/read_write.c b/fs/read_write.c
> > > > > > > > index 39b4a21..c60790f 100644
> > > > > > > > --- a/fs/read_write.c
> > > > > > > > +++ b/fs/read_write.c
> > > > > > > > @@ -1570,6 +1570,9 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
> > > > > > > > if (unlikely(ret))
> > > > > > > > return ret;
> > > > > > > >
> > > > > > > > + if (pos_in >= i_size_read(inode_in))
> > > > > > > > + return -EINVAL;
> > > > > > > > +
> > > > > > > > if (!(file_in->f_mode & FMODE_READ) ||
> > > > > > > > !(file_out->f_mode & FMODE_WRITE) ||
> > > > > > > > (file_out->f_flags & O_APPEND))
> > > > > > >
> > > > > > > The patch description could use a bit more fleshing-out. The
> > > > > > > copy_file_range manpage says:
> > > > > > >
> > > > > > > EINVAL Requested range extends beyond the end of the source file; or the flags argument is not 0.
> > > > > > >
> > > > > > > So I guess this is intended to satisfy that requirement?
> > > > > >
> > > > > > I agree the description of the patch is poor. It sort of falls under
> > > > > > the the man page's description of range beyond the end of the source
> > > > > > file. But in NFSv4.2, there is an explicit wording for the validity of
> > > > > > the input parameters and having an input source offset that's beyond
> > > > > > the end of the file is what this patch is attempting to check.
> > > > > >
> > > > >
> > > > > Side note:
> > > > >
> > > > > One has to wonder why they decided to make that an -EINVAL condition?
> > > >
> > > > To me that sounds like a correct use of -EINVAL error to check for
> > > > invalid arguments.
> > > >
> > > > You can argue that since there were no bytes to copy then returning 0
> > > > would be an appropriate "size" of the copy. However, I would argue how
> > > > would a caller distinguish this 0 size which really means don't try to
> > > > copy more vs another case when copy_file_range() returned less byte
> > > > (0) and so that caller should loop to get the rest.
> > > >
> > >
> > > I don't know -- it seems to run contrary to how read(2) and write(2)
> >
> > That's because copy_file_range is not just read/write but also lseek.
> > I think of it as doing lseek(to source offset)->read()->lseek(to dst
> > offset)->write. and lseek() does return EINVAL when position is beyond
> > the end of the file.
> >
> > > work with an EOF condition. I don't see why the implementation wouldn't
> > > want to just copy what you can up to the EOF of the source file. Maybe I
> > > need to go back and review the discussion from when the syscall was
> > > merged...
> >
> > >
> > > > > The system call returns ssize_t. Why not just return a fewer number of
> > > > > bytes in that situation?
> > > > >
> > > > > In fact, the RETURN VALUE section of the manpage says:
> > > > >
> > > > > Upon successful completion, copy_file_range() will return the number of
> > > > > bytes copied between files. This could be less than the length origi‐
> > > > > nally requested.
> > > > >
> > > > > Under what conditions would that occur that do not include the file
> > > > > being shorter than the range you wanted to copy?
> > > > >
> > > > > I wonder if we ought to lobby to get that changed.
> > > >
> > > > Do you mean ask NFSv4.2 spec to be changed? I thought that's stuff is
> > > > "written in stone".
> > > >
> > >
> > > No, I meant the copy_file_range spec (such as it is). I guess the v4.2
> > > spec has this though:
> > >
> > > If the source offset or the source offset plus count is greater than
> > > the size of the source file, the operation MUST fail with
> > > NFS4ERR_INVAL.
> > >
> > > I wonder if you'd be better off not trying to enforce this on the client
> > > and simply let the server return NFS4ERR_INVAL if we're beyond EOF on
> > > the source? That way it doesn't matter whether the client's attr cache
> > > is up to date or not.
> > >
> > > > >
> > > > > > > If so,
> > > > > > > i_size_read is just going to return whatever is in inode->isize.
> > > > > > > Could a copy_file_range call end up getting issued to copy from a file
> > > > > > > that is already open on a range that it doesn't know about yet? i.e.
> > > > > > > where the inode cache has not yet been updated.
> > > > > >
> > > > > > I thought that with NFSv4 cache consistency, the inode->isize is
> > > > > > accurate. If previous open had a read delegation, any modification on
> > > > > > a server would trigger a CB_RECALL and the open for the copy offload
> > > > > > would retrieve the latest size. In case of no delegations, the open
> > > > > > retrieves the latest size and the call to copy_file_range() would have
> > > > > > an update size.
> > > > > >
> > > > > > It seems like that could on network filesystems (like NFS). Would this
> > > > > > > be better handled in ->copy_file_range instead, where the driver could
> > > > > > > make a better determination of the file size?
> > > > > >
> > > > > > I'm not opposed to moving the size check into the NFS's copy_file_size
> > > > > > (again in my opinion NFS attribute cache has the same file size as the
> > > > > > inode's size). I think the thought was that such check should be done
> > > > > > at the VFS layer as oppose to doing it by each of the file systems.
> > > > > >
> > > > > >
> > > > >
> > > > > The attribute cache is not revalidated before the i_size is fetched with
> > > > > i_size_read. You're just reading what happens to be in the in-memory
> > > > > inode structure. So both clients have the file open already, and neither
> > > > > has a delegation:
> > > > >
> > > > > client 1: fetches attributes from file and sees a size of 1000
> > > > > client 2: writes 20 bytes at offset 1000
> > > > > client 1: calls copy file range to copy 1020 bytes starting at offset 0
> > > > >
> > > > > If client1 didn't get an attribute update before the copy_file_range
> > > > > call came in, then it would still think the size was 1000 and fail the
> > > > > operation. It may even be many seconds before client1 sees the updated
> > > > > size.
> > > > >
> > > > > You could argue that we're not using locking here so you're just subject
> > > > > to normal open-to-close cache coherency behavior, but that's rather "not
> > > > > nice".
> > > >
> > > > Yes I would argue that locking needs to be used. In case your
> > > > describing it is impossible to get an accurate file size. Even if the
> > > > client issues a GETATTR there is nothing prevents another client from
> > > > changing it right after that GETATTR was already replied to.
> > > >
> > > > What nfscopy application does is it opens then file and then locks it
> > > > (as suggested by the spec), then calls the copy_file_range() system
> > > > call. You can argue (if we didn't get a delegation) that a file might
> > > > have been changed between the reply to the OPEN and the LOCK operation
> > > > and therefore, we should send another GETATTR just in case. To guard
> > > > against that, I can add a getattr call though I think it's an overkill
> > > > specially since linux server always grants us a read delegation.
> > > >
> > >
> > > The spec does say you might need to lock the files but they don't say
> > > you SHOULD, only that servers need to be able to deal with lock
> > > stateids. hat said, you have a good point. We're not violating anything
> > > by not revalidating the cache beforehand. Maybe we don't need to do this
> > > after all.
> > >
> > > Personally, I think this would be best enforced by the NFS server. Just
> > > fire off the COPY request as-is and let the server vet the lengths.
> > >
> > > Side note: a delegation is never guaranteed. knfsd won't grant one if
> > > the inode falls afoul of the bloom filter, for instance, and that can
> > > easily happen if (for example) there is a hash collision between
> > > different inodes.
> >
> > I could push the checking validity of the arguments to the server (but
> > to me it sounds lame: a failure will require a network roundtrip). But
> > one can argue the server needs to check the validity of the arguments
> > anyway (as it can't rely on the client to play nice).
> >
> > I still think the check should be done in both places (client and
> > server). And I feel like VFS is the right place to do so (as this
> > check implements what the man page states). But I don't feel strongly
> > about dropping the patch all together (I'll add a patch to the
> > server).
>
> To add to my case, would it be acceptable to add a check *same as*
> it's done for the vfs_clone_file_range()
> pos_in+len > i_size_read() return EINVAL.

Sigh. this is not what we want either because in previous discussion,
it was agreed that it would be beneficial to just return however much
the copy could do.

The problem is that copy doesn't spell out that copy of size 0 means
"end of copy" like an EOF. So either that needs to agreed upon and
then receiving size 0 would indicate stopping. However, if return of
size 0 copy isn't signaling an end, then we need to enforce if (pos_in
> i_size_read()) EINVAL check.

Question to the community: should the semantics of copy_file_range()
be clarified to give return of 0 a significance?

> knfsd just calls in to the vfs to do the copy_file_range when it
> receives it, so these checks should be a part of the VFS.
>
> > > > > I think we probably ought to also push this check down into the
> > > > > filesystem operations as well, and have copy_file_range ensure that the
> > > > > attribute cache is updated. We're dealing with copy offload here so
> > > > > doing an extra GETATTR beforehand shouldn't be terribly costly.
> > > > > --
> > > > > Jeff Layton <[email protected]>
> > >
> > > --
> > > Jeff Layton <[email protected]>
> > >