LinuxLists.cc - [PATCH v4 0/5] NFSD: Add support for the v4.2 READ_PLUS operation

2020-08-17 17:34:50

by Anna Schumaker

[permalink] [raw]

Subject: [PATCH v4 0/5] NFSD: Add support for the v4.2 READ_PLUS operation

From: Anna Schumaker <[email protected]>

These patches add server support for the READ_PLUS operation, which
breaks read requests into several "data" and "hole" segments when
replying to the client.

- Changes since v3:
- Combine first two patches related to xdr_reserve_space_vec()
- Remove unnecessary call to svc_encode_read_payload()

Here are the results of some performance tests I ran on some lab
machines. I tested by reading various 2G files from a few different underlying
filesystems and across several NFS versions. I used the `vmtouch` utility
to make sure files were only cached when we wanted them to be. In addition
to 100% data and 100% hole cases, I also tested with files that alternate
between data and hole segments. These files have either 4K, 8K, 16K, or 32K
segment sizes and start with either data or hole segments. So the file
mixed-4d has a 4K segment size beginning with a data segment, but mixed-32h
has 32K segments beginning with a hole. The units are in seconds, with the
first number for each NFS version being the uncached read time and the second
number is for when the file is cached on the server.

I added some extra data collection (client cpu percentage and sys time),
but the extra data means I couldn't figure out a way to break this down
into a concise table. I cut out v3 and v4.0 performance numbers to get
the size down, but I kept v4.1 for comparison because it uses the same
code that v4.2 without read plus uses.

Read Plus Results (ext4):
data
:... v4.1 ... Uncached ... 20.540 s, 105 MB/s, 0.65 s kern, 3% cpu
: :....... Cached ..... 18.252 s, 118 MB/s, 0.70 s kern, 3% cpu
:... v4.2 ... Uncached ... 20.605 s, 104 MB/s, 0.65 s kern, 3% cpu
:....... Cached ..... 18.253 s, 118 MB/s, 0.67 s kern, 3% cpu
hole
:... v4.1 ... Uncached ... 18.255 s, 118 MB/s, 0.72 s kern, 3% cpu
: :....... Cached ..... 18.252 s, 118 MB/s, 0.72 s kern, 3% cpu
:... v4.2 ... Uncached ... 0.847 s, 2.5 GB/s, 0.73 s kern, 86% cpu
:....... Cached ..... 0.845 s, 2.5 GB/s, 0.72 s kern, 85% cpu
mixed-4d
:... v4.1 ... Uncached ... 54.691 s, 39 MB/s, 0.75 s kern, 1% cpu
: :....... Cached ..... 18.252 s, 118 MB/s, 0.71 s kern, 3% cpu
:... v4.2 ... Uncached ... 51.587 s, 42 MB/s, 0.75 s kern, 1% cpu
:....... Cached ..... 9.215 s, 233 MB/s, 0.67 s kern, 7% cpu
mixed-8d
:... v4.1 ... Uncached ... 37.072 s, 58 MB/s, 0.67 s kern, 1% cpu
: :....... Cached ..... 18.252 s, 118 MB/s, 0.71 s kern, 3% cpu
:... v4.2 ... Uncached ... 33.259 s, 65 MB/s, 0.68 s kern, 2% cpu
:....... Cached ..... 9.172 s, 234 MB/s, 0.67 s kern, 7% cpu
mixed-16d
:... v4.1 ... Uncached ... 27.138 s, 79 MB/s, 0.73 s kern, 2% cpu
: :....... Cached ..... 18.252 s, 118 MB/s, 0.71 s kern, 3% cpu
:... v4.2 ... Uncached ... 23.042 s, 93 MB/s, 0.73 s kern, 3% cpu
:....... Cached ..... 9.150 s, 235 MB/s, 0.66 s kern, 7% cpu
mixed-32d
:... v4.1 ... Uncached ... 25.326 s, 85 MB/s, 0.68 s kern, 2% cpu
: :....... Cached ..... 18.252 s, 118 MB/s, 0.70 s kern, 3% cpu
:... v4.2 ... Uncached ... 21.125 s, 102 MB/s, 0.69 s kern, 3% cpu
:....... Cached ..... 9.140 s, 235 MB/s, 0.67 s kern, 7% cpu
mixed-4h
:... v4.1 ... Uncached ... 58.317 s, 37 MB/s, 0.75 s kern, 1% cpu
: :....... Cached ..... 18.252 s, 118 MB/s, 0.70 s kern, 3% cpu
:... v4.2 ... Uncached ... 51.878 s, 41 MB/s, 0.74 s kern, 1% cpu
:....... Cached ..... 9.215 s, 233 MB/s, 0.68 s kern, 7% cpu
mixed-8h
:... v4.1 ... Uncached ... 36.855 s, 58 MB/s, 0.68 s kern, 1% cpu
: :....... Cached ..... 18.252 s, 118 MB/s, 0.72 s kern, 3% cpu
:... v4.2 ... Uncached ... 29.457 s, 73 MB/s, 0.68 s kern, 2% cpu
:....... Cached ..... 9.172 s, 234 MB/s, 0.67 s kern, 7% cpu
mixed-16h
:... v4.1 ... Uncached ... 26.460 s, 81 MB/s, 0.74 s kern, 2% cpu
: :....... Cached ..... 18.252 s, 118 MB/s, 0.71 s kern, 3% cpu
:... v4.2 ... Uncached ... 19.587 s, 110 MB/s, 0.74 s kern, 3% cpu
:....... Cached ..... 9.150 s, 235 MB/s, 0.67 s kern, 7% cpu
mixed-32h
:... v4.1 ... Uncached ... 25.495 s, 84 MB/s, 0.69 s kern, 2% cpu
: :....... Cached ..... 18.252 s, 118 MB/s, 0.65 s kern, 3% cpu
:... v4.2 ... Uncached ... 17.634 s, 122 MB/s, 0.69 s kern, 3% cpu
:....... Cached ..... 9.140 s, 235 MB/s, 0.68 s kern, 7% cpu

Read Plus Results (xfs):
data
:... v4.1 ... Uncached ... 20.230 s, 106 MB/s, 0.65 s kern, 3% cpu
: :....... Cached ..... 18.252 s, 118 MB/s, 0.68 s kern, 3% cpu
:... v4.2 ... Uncached ... 20.724 s, 104 MB/s, 0.65 s kern, 3% cpu
:....... Cached ..... 18.253 s, 118 MB/s, 0.67 s kern, 3% cpu
hole
:... v4.1 ... Uncached ... 18.255 s, 118 MB/s, 0.68 s kern, 3% cpu
: :....... Cached ..... 18.252 s, 118 MB/s, 0.69 s kern, 3% cpu
:... v4.2 ... Uncached ... 0.904 s, 2.4 GB/s, 0.72 s kern, 79% cpu
:....... Cached ..... 0.908 s, 2.4 GB/s, 0.73 s kern, 80% cpu
mixed-4d
:... v4.1 ... Uncached ... 57.553 s, 37 MB/s, 0.77 s kern, 1% cpu
: :....... Cached ..... 18.252 s, 118 MB/s, 0.70 s kern, 3% cpu
:... v4.2 ... Uncached ... 37.162 s, 58 MB/s, 0.73 s kern, 1% cpu
:....... Cached ..... 9.215 s, 233 MB/s, 0.67 s kern, 7% cpu
mixed-8d
:... v4.1 ... Uncached ... 36.754 s, 58 MB/s, 0.69 s kern, 1% cpu
: :....... Cached ..... 18.252 s, 118 MB/s, 0.68 s kern, 3% cpu
:... v4.2 ... Uncached ... 24.454 s, 88 MB/s, 0.69 s kern, 2% cpu
:....... Cached ..... 9.172 s, 234 MB/s, 0.66 s kern, 7% cpu
mixed-16d
:... v4.1 ... Uncached ... 27.156 s, 79 MB/s, 0.73 s kern, 2% cpu
: :....... Cached ..... 18.252 s, 118 MB/s, 0.71 s kern, 3% cpu
:... v4.2 ... Uncached ... 22.934 s, 94 MB/s, 0.72 s kern, 3% cpu
:....... Cached ..... 9.150 s, 235 MB/s, 0.68 s kern, 7% cpu
mixed-32d
:... v4.1 ... Uncached ... 27.849 s, 77 MB/s, 0.68 s kern, 2% cpu
: :....... Cached ..... 18.252 s, 118 MB/s, 0.72 s kern, 3% cpu
:... v4.2 ... Uncached ... 23.670 s, 91 MB/s, 0.67 s kern, 2% cpu
:....... Cached ..... 9.139 s, 235 MB/s, 0.64 s kern, 7% cpu
mixed-4h
:... v4.1 ... Uncached ... 57.639 s, 37 MB/s, 0.72 s kern, 1% cpu
: :....... Cached ..... 18.252 s, 118 MB/s, 0.69 s kern, 3% cpu
:... v4.2 ... Uncached ... 35.503 s, 61 MB/s, 0.72 s kern, 2% cpu
:....... Cached ..... 9.215 s, 233 MB/s, 0.66 s kern, 7% cpu
mixed-8h
:... v4.1 ... Uncached ... 37.044 s, 58 MB/s, 0.71 s kern, 1% cpu
: :....... Cached ..... 18.252 s, 118 MB/s, 0.68 s kern, 3% cpu
:... v4.2 ... Uncached ... 23.779 s, 90 MB/s, 0.69 s kern, 2% cpu
:....... Cached ..... 9.172 s, 234 MB/s, 0.65 s kern, 7% cpu
mixed-16h
:... v4.1 ... Uncached ... 27.167 s, 79 MB/s, 0.73 s kern, 2% cpu
: :....... Cached ..... 18.252 s, 118 MB/s, 0.67 s kern, 3% cpu
:... v4.2 ... Uncached ... 19.088 s, 113 MB/s, 0.75 s kern, 3% cpu
:....... Cached ..... 9.159 s, 234 MB/s, 0.66 s kern, 7% cpu
mixed-32h
:... v4.1 ... Uncached ... 27.592 s, 78 MB/s, 0.71 s kern, 2% cpu
: :....... Cached ..... 18.252 s, 118 MB/s, 0.68 s kern, 3% cpu
:... v4.2 ... Uncached ... 19.682 s, 109 MB/s, 0.67 s kern, 3% cpu
:....... Cached ..... 9.140 s, 235 MB/s, 0.67 s kern, 7% cpu

Read Plus Results (btrfs):
data
:... v4.1 ... Uncached ... 21.317 s, 101 MB/s, 0.63 s kern, 2% cpu
: :....... Cached ..... 18.252 s, 118 MB/s, 0.67 s kern, 3% cpu
:... v4.2 ... Uncached ... 28.665 s, 75 MB/s, 0.65 s kern, 2% cpu
:....... Cached ..... 18.253 s, 118 MB/s, 0.66 s kern, 3% cpu
hole
:... v4.1 ... Uncached ... 18.256 s, 118 MB/s, 0.70 s kern, 3% cpu
: :....... Cached ..... 18.254 s, 118 MB/s, 0.73 s kern, 4% cpu
:... v4.2 ... Uncached ... 0.851 s, 2.5 GB/s, 0.72 s kern, 84% cpu
:....... Cached ..... 0.847 s, 2.5 GB/s, 0.73 s kern, 86% cpu
mixed-4d
:... v4.1 ... Uncached ... 56.857 s, 38 MB/s, 0.76 s kern, 1% cpu
: :....... Cached ..... 18.252 s, 118 MB/s, 0.72 s kern, 3% cpu
:... v4.2 ... Uncached ... 54.455 s, 39 MB/s, 0.73 s kern, 1% cpu
:....... Cached ..... 9.215 s, 233 MB/s, 0.68 s kern, 7% cpu
mixed-8d
:... v4.1 ... Uncached ... 36.641 s, 59 MB/s, 0.68 s kern, 1% cpu
: :....... Cached ..... 18.252 s, 118 MB/s, 0.70 s kern, 3% cpu
:... v4.2 ... Uncached ... 33.205 s, 65 MB/s, 0.67 s kern, 2% cpu
:....... Cached ..... 9.172 s, 234 MB/s, 0.65 s kern, 7% cpu
mixed-16d
:... v4.1 ... Uncached ... 28.653 s, 75 MB/s, 0.72 s kern, 2% cpu
: :....... Cached ..... 18.252 s, 118 MB/s, 0.70 s kern, 3% cpu
:... v4.2 ... Uncached ... 25.748 s, 83 MB/s, 0.71 s kern, 2% cpu
:....... Cached ..... 9.150 s, 235 MB/s, 0.64 s kern, 7% cpu
mixed-32d
:... v4.1 ... Uncached ... 28.886 s, 74 MB/s, 0.67 s kern, 2% cpu
: :....... Cached ..... 18.252 s, 118 MB/s, 0.71 s kern, 3% cpu
:... v4.2 ... Uncached ... 24.724 s, 87 MB/s, 0.74 s kern, 2% cpu
:....... Cached ..... 9.140 s, 235 MB/s, 0.63 s kern, 6% cpu
mixed-4h
:... v4.1 ... Uncached ... 52.181 s, 41 MB/s, 0.73 s kern, 1% cpu
: :....... Cached ..... 18.252 s, 118 MB/s, 0.66 s kern, 3% cpu
:... v4.2 ... Uncached ... 150.341 s, 14 MB/s, 0.72 s kern, 0% cpu
:....... Cached ..... 9.216 s, 233 MB/s, 0.63 s kern, 6% cpu
mixed-8h
:... v4.1 ... Uncached ... 36.945 s, 58 MB/s, 0.68 s kern, 1% cpu
: :....... Cached ..... 18.252 s, 118 MB/s, 0.65 s kern, 3% cpu
:... v4.2 ... Uncached ... 79.781 s, 27 MB/s, 0.68 s kern, 0% cpu
:....... Cached ..... 9.172 s, 234 MB/s, 0.66 s kern, 7% cpu
mixed-16h
:... v4.1 ... Uncached ... 28.651 s, 75 MB/s, 0.73 s kern, 2% cpu
: :....... Cached ..... 18.252 s, 118 MB/s, 0.66 s kern, 3% cpu
:... v4.2 ... Uncached ... 47.428 s, 45 MB/s, 0.71 s kern, 1% cpu
:....... Cached ..... 9.150 s, 235 MB/s, 0.67 s kern, 7% cpu
mixed-32h
:... v4.1 ... Uncached ... 28.618 s, 75 MB/s, 0.69 s kern, 2% cpu
: :....... Cached ..... 18.252 s, 118 MB/s, 0.70 s kern, 3% cpu
:... v4.2 ... Uncached ... 38.813 s, 55 MB/s, 0.67 s kern, 1% cpu
:....... Cached ..... 9.140 s, 235 MB/s, 0.61 s kern, 6% cpu

Thoughts?
Anna

Anna Schumaker (5):
SUNRPC/NFSD: Implement xdr_reserve_space_vec()
NFSD: Add READ_PLUS data support
NFSD: Add READ_PLUS hole segment encoding
NFSD: Return both a hole and a data segment
NFSD: Encode a full READ_PLUS reply

fs/nfsd/nfs4proc.c | 17 ++++
fs/nfsd/nfs4xdr.c | 167 +++++++++++++++++++++++++++++++------
include/linux/sunrpc/xdr.h | 2 +
net/sunrpc/xdr.c | 45 ++++++++++
4 files changed, 204 insertions(+), 27 deletions(-)

--
2.28.0

2020-08-17 17:34:51

by Anna Schumaker

[permalink] [raw]

Subject: [PATCH v4 2/5] NFSD: Add READ_PLUS data support

From: Anna Schumaker <[email protected]>

This patch adds READ_PLUS support for returning a single
NFS4_CONTENT_DATA segment to the client. This is basically the same as
the READ operation, only with the extra information about data segments.

Signed-off-by: Anna Schumaker <[email protected]>
---
fs/nfsd/nfs4proc.c | 17 ++++++++++
fs/nfsd/nfs4xdr.c | 83 ++++++++++++++++++++++++++++++++++++++++++++--
2 files changed, 98 insertions(+), 2 deletions(-)

diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
index a09c35f0f6f0..9630d33211f2 100644
--- a/fs/nfsd/nfs4proc.c
+++ b/fs/nfsd/nfs4proc.c
@@ -2523,6 +2523,16 @@ static inline u32 nfsd4_read_rsize(struct svc_rqst *rqstp, struct nfsd4_op *op)
return (op_encode_hdr_size + 2 + XDR_QUADLEN(rlen)) * sizeof(__be32);
}

+static inline u32 nfsd4_read_plus_rsize(struct svc_rqst *rqstp, struct nfsd4_op *op)
+{
+ u32 maxcount = svc_max_payload(rqstp);
+ u32 rlen = min(op->u.read.rd_length, maxcount);
+ /* enough extra xdr space for encoding either a hole or data segment. */
+ u32 segments = 1 + 2 + 2;
+
+ return (op_encode_hdr_size + 2 + segments + XDR_QUADLEN(rlen)) * sizeof(__be32);
+}
+
static inline u32 nfsd4_readdir_rsize(struct svc_rqst *rqstp, struct nfsd4_op *op)
{
u32 maxcount = 0, rlen = 0;
@@ -3059,6 +3069,13 @@ static const struct nfsd4_operation nfsd4_ops[] = {
.op_name = "OP_COPY",
.op_rsize_bop = nfsd4_copy_rsize,
},
+ [OP_READ_PLUS] = {
+ .op_func = nfsd4_read,
+ .op_release = nfsd4_read_release,
+ .op_name = "OP_READ_PLUS",
+ .op_rsize_bop = nfsd4_read_plus_rsize,
+ .op_get_currentstateid = nfsd4_get_readstateid,
+ },
[OP_SEEK] = {
.op_func = nfsd4_seek,
.op_name = "OP_SEEK",
diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 6a1c0a7fae05..9af92f538000 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -1957,7 +1957,7 @@ static const nfsd4_dec nfsd4_dec_ops[] = {
[OP_LAYOUTSTATS] = (nfsd4_dec)nfsd4_decode_notsupp,
[OP_OFFLOAD_CANCEL] = (nfsd4_dec)nfsd4_decode_offload_status,
[OP_OFFLOAD_STATUS] = (nfsd4_dec)nfsd4_decode_offload_status,
- [OP_READ_PLUS] = (nfsd4_dec)nfsd4_decode_notsupp,
+ [OP_READ_PLUS] = (nfsd4_dec)nfsd4_decode_read,
[OP_SEEK] = (nfsd4_dec)nfsd4_decode_seek,
[OP_WRITE_SAME] = (nfsd4_dec)nfsd4_decode_notsupp,
[OP_CLONE] = (nfsd4_dec)nfsd4_decode_clone,
@@ -4367,6 +4367,85 @@ nfsd4_encode_offload_status(struct nfsd4_compoundres *resp, __be32 nfserr,
return nfserr_resource;
p = xdr_encode_hyper(p, os->count);
*p++ = cpu_to_be32(0);
+ return nfserr;
+}
+
+static __be32
+nfsd4_encode_read_plus_data(struct nfsd4_compoundres *resp,
+ struct nfsd4_read *read,
+ unsigned long maxcount, u32 *eof)
+{
+ struct xdr_stream *xdr = &resp->xdr;
+ struct file *file = read->rd_nf->nf_file;
+ int starting_len = xdr->buf->len;
+ __be32 nfserr;
+ __be32 *p, tmp;
+ __be64 tmp64;
+
+ /* Content type, offset, byte count */
+ p = xdr_reserve_space(xdr, 4 + 8 + 4);
+ if (!p)
+ return nfserr_resource;
+
+ read->rd_vlen = xdr_reserve_space_vec(xdr, resp->rqstp->rq_vec, maxcount);
+ if (read->rd_vlen < 0)
+ return nfserr_resource;
+
+ nfserr = nfsd_readv(resp->rqstp, read->rd_fhp, file, read->rd_offset,
+ resp->rqstp->rq_vec, read->rd_vlen, &maxcount, eof);
+ if (nfserr)
+ return nfserr;
+
+ tmp = htonl(NFS4_CONTENT_DATA);
+ write_bytes_to_xdr_buf(xdr->buf, starting_len, &tmp, 4);
+ tmp64 = cpu_to_be64(read->rd_offset);
+ write_bytes_to_xdr_buf(xdr->buf, starting_len + 4, &tmp64, 8);
+ tmp = htonl(maxcount);
+ write_bytes_to_xdr_buf(xdr->buf, starting_len + 12, &tmp, 4);
+ return nfs_ok;
+}
+
+static __be32
+nfsd4_encode_read_plus(struct nfsd4_compoundres *resp, __be32 nfserr,
+ struct nfsd4_read *read)
+{
+ unsigned long maxcount;
+ struct xdr_stream *xdr = &resp->xdr;
+ struct file *file;
+ int starting_len = xdr->buf->len;
+ int segments = 0;
+ __be32 *p, tmp;
+ u32 eof;
+
+ if (nfserr)
+ return nfserr;
+ file = read->rd_nf->nf_file;
+
+ /* eof flag, segment count */
+ p = xdr_reserve_space(xdr, 4 + 4);
+ if (!p)
+ return nfserr_resource;
+ xdr_commit_encode(xdr);
+
+ maxcount = svc_max_payload(resp->rqstp);
+ maxcount = min_t(unsigned long, maxcount,
+ (xdr->buf->buflen - xdr->buf->len));
+ maxcount = min_t(unsigned long, maxcount, read->rd_length);
+
+ eof = read->rd_offset >= i_size_read(file_inode(file));
+ if (!eof) {
+ nfserr = nfsd4_encode_read_plus_data(resp, read, maxcount, &eof);
+ segments++;
+ }
+
+ if (nfserr)
+ xdr_truncate_encode(xdr, starting_len);
+ else {
+ tmp = htonl(eof);
+ write_bytes_to_xdr_buf(xdr->buf, starting_len, &tmp, 4);
+ tmp = htonl(segments);
+ write_bytes_to_xdr_buf(xdr->buf, starting_len + 4, &tmp, 4);
+ }

return nfserr;
}
@@ -4509,7 +4588,7 @@ static const nfsd4_enc nfsd4_enc_ops[] = {
[OP_LAYOUTSTATS] = (nfsd4_enc)nfsd4_encode_noop,
[OP_OFFLOAD_CANCEL] = (nfsd4_enc)nfsd4_encode_noop,
[OP_OFFLOAD_STATUS] = (nfsd4_enc)nfsd4_encode_offload_status,
- [OP_READ_PLUS] = (nfsd4_enc)nfsd4_encode_noop,
+ [OP_READ_PLUS] = (nfsd4_enc)nfsd4_encode_read_plus,
[OP_SEEK] = (nfsd4_enc)nfsd4_encode_seek,
[OP_WRITE_SAME] = (nfsd4_enc)nfsd4_encode_noop,
[OP_CLONE] = (nfsd4_enc)nfsd4_encode_noop,
--
2.28.0

2020-08-17 17:35:03

by Anna Schumaker

[permalink] [raw]

Subject: [PATCH v4 1/5] SUNRPC/NFSD: Implement xdr_reserve_space_vec()

From: Anna Schumaker <[email protected]>

Reserving space for a large READ payload requires special handling when
reserving space in the xdr buffer pages. One problem we can have is use
of the scratch buffer, which is used to get a pointer to a contiguous
region of data up to PAGE_SIZE. When using the scratch buffer, calls to
xdr_commit_encode() shift the data to it's proper alignment in the xdr
buffer. If we've reserved several pages in a vector, then this could
potentially invalidate earlier pointers and result in incorrect READ
data being sent to the client.

I get around this by looking at the amount of space left in the current
page, and never reserve more than that for each entry in the read
vector. This lets us place data directly where it needs to go in the
buffer pages.

Signed-off-by: Anna Schumaker <[email protected]>
---
fs/nfsd/nfs4xdr.c | 28 +++---------------------
include/linux/sunrpc/xdr.h | 2 ++
net/sunrpc/xdr.c | 45 ++++++++++++++++++++++++++++++++++++++
3 files changed, 50 insertions(+), 25 deletions(-)

diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 996ac01ee977..6a1c0a7fae05 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -3584,36 +3584,14 @@ static __be32 nfsd4_encode_readv(struct nfsd4_compoundres *resp,
{
struct xdr_stream *xdr = &resp->xdr;
u32 eof;
- int v;
int starting_len = xdr->buf->len - 8;
- long len;
- int thislen;
__be32 nfserr;
__be32 tmp;
- __be32 *p;
int pad;

- /*
- * svcrdma requires every READ payload to start somewhere
- * in xdr->pages.
- */
- if (xdr->iov == xdr->buf->head) {
- xdr->iov = NULL;
- xdr->end = xdr->p;
- }
-
- len = maxcount;
- v = 0;
- while (len) {
- thislen = min_t(long, len, PAGE_SIZE);
- p = xdr_reserve_space(xdr, thislen);
- WARN_ON_ONCE(!p);
- resp->rqstp->rq_vec[v].iov_base = p;
- resp->rqstp->rq_vec[v].iov_len = thislen;
- v++;
- len -= thislen;
- }
- read->rd_vlen = v;
+ read->rd_vlen = xdr_reserve_space_vec(xdr, resp->rqstp->rq_vec, maxcount);
+ if (read->rd_vlen < 0)
+ return nfserr_resource;

nfserr = nfsd_readv(resp->rqstp, read->rd_fhp, file, read->rd_offset,
resp->rqstp->rq_vec, read->rd_vlen, &maxcount,
diff --git a/include/linux/sunrpc/xdr.h b/include/linux/sunrpc/xdr.h
index 22c207b2425f..bac459584dd0 100644
--- a/include/linux/sunrpc/xdr.h
+++ b/include/linux/sunrpc/xdr.h
@@ -234,6 +234,8 @@ typedef int (*kxdrdproc_t)(struct rpc_rqst *rqstp, struct xdr_stream *xdr,
extern void xdr_init_encode(struct xdr_stream *xdr, struct xdr_buf *buf,
__be32 *p, struct rpc_rqst *rqst);
extern __be32 *xdr_reserve_space(struct xdr_stream *xdr, size_t nbytes);
+extern int xdr_reserve_space_vec(struct xdr_stream *xdr, struct kvec *vec,
+ size_t nbytes);
extern void xdr_commit_encode(struct xdr_stream *xdr);
extern void xdr_truncate_encode(struct xdr_stream *xdr, size_t len);
extern int xdr_restrict_buflen(struct xdr_stream *xdr, int newbuflen);
diff --git a/net/sunrpc/xdr.c b/net/sunrpc/xdr.c
index be11d672b5b9..6dfe5dc8b35f 100644
--- a/net/sunrpc/xdr.c
+++ b/net/sunrpc/xdr.c
@@ -648,6 +648,51 @@ __be32 * xdr_reserve_space(struct xdr_stream *xdr, size_t nbytes)
}
EXPORT_SYMBOL_GPL(xdr_reserve_space);

+
+/**
+ * xdr_reserve_space_vec - Reserves a large amount of buffer space for sending
+ * @xdr: pointer to xdr_stream
+ * @vec: pointer to a kvec array
+ * @nbytes: number of bytes to reserve
+ *
+ * Reserves enough buffer space to encode 'nbytes' of data and stores the
+ * pointers in 'vec'. The size argument passed to xdr_reserve_space() is
+ * determined based on the number of bytes remaining in the current page to
+ * avoid invalidating iov_base pointers when xdr_commit_encode() is called.
+ */
+int xdr_reserve_space_vec(struct xdr_stream *xdr, struct kvec *vec, size_t nbytes)
+{
+ int thislen;
+ int v = 0;
+ __be32 *p;
+
+ /*
+ * svcrdma requires every READ payload to start somewhere
+ * in xdr->pages.
+ */
+ if (xdr->iov == xdr->buf->head) {
+ xdr->iov = NULL;
+ xdr->end = xdr->p;
+ }
+
+ while (nbytes) {
+ thislen = xdr->buf->page_len % PAGE_SIZE;
+ thislen = min_t(size_t, nbytes, PAGE_SIZE - thislen);
+
+ p = xdr_reserve_space(xdr, thislen);
+ if (!p)
+ return -EIO;
+
+ vec[v].iov_base = p;
+ vec[v].iov_len = thislen;
+ v++;
+ nbytes -= thislen;
+ }
+
+ return v;
+}
+EXPORT_SYMBOL_GPL(xdr_reserve_space_vec);
+
/**
* xdr_truncate_encode - truncate an encode buffer
* @xdr: pointer to xdr_stream
--
2.28.0

2020-08-17 17:37:01

by Anna Schumaker

[permalink] [raw]

Subject: [PATCH v4 5/5] NFSD: Encode a full READ_PLUS reply

From: Anna Schumaker <[email protected]>

Reply to the client with multiple hole and data segments. I use the
result of the first vfs_llseek() call for encoding as an optimization so
we don't have to immediately repeat the call.

Signed-off-by: Anna Schumaker <[email protected]>
---
fs/nfsd/nfs4xdr.c | 33 ++++++++++++++-------------------
1 file changed, 14 insertions(+), 19 deletions(-)

diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 3f4860103b25..fb71e52accee 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -4373,16 +4373,18 @@ nfsd4_encode_offload_status(struct nfsd4_compoundres *resp, __be32 nfserr,
static __be32
nfsd4_encode_read_plus_data(struct nfsd4_compoundres *resp,
struct nfsd4_read *read,
- unsigned long *maxcount, u32 *eof)
+ unsigned long *maxcount, u32 *eof,
+ loff_t *pos)
{
struct xdr_stream *xdr = &resp->xdr;
struct file *file = read->rd_nf->nf_file;
int starting_len = xdr->buf->len;
- loff_t hole_pos = vfs_llseek(file, read->rd_offset, SEEK_HOLE);
+ loff_t hole_pos;
__be32 nfserr;
__be32 *p, tmp;
__be64 tmp64;

+ hole_pos = pos ? *pos : vfs_llseek(file, read->rd_offset, SEEK_HOLE);
if (hole_pos > read->rd_offset)
*maxcount = min_t(unsigned long, *maxcount, hole_pos - read->rd_offset);

@@ -4449,6 +4451,7 @@ nfsd4_encode_read_plus(struct nfsd4_compoundres *resp, __be32 nfserr,
int starting_len = xdr->buf->len;
int segments = 0;
__be32 *p, tmp;
+ bool is_data;
loff_t pos;
u32 eof;

@@ -4472,29 +4475,21 @@ nfsd4_encode_read_plus(struct nfsd4_compoundres *resp, __be32 nfserr,
if (eof)
goto out;

- pos = vfs_llseek(file, read->rd_offset, SEEK_DATA);
- if (pos == -ENXIO)
- pos = i_size_read(file_inode(file));
- else if (pos < 0)
- pos = read->rd_offset;
+ pos = vfs_llseek(file, read->rd_offset, SEEK_HOLE);
+ is_data = pos > read->rd_offset;

- if (pos == read->rd_offset) {
+ while (count > 0 && !eof) {
maxcount = count;
- nfserr = nfsd4_encode_read_plus_data(resp, read, &maxcount, &eof);
- if (nfserr)
- goto out;
- count -= maxcount;
- read->rd_offset += maxcount;
- segments++;
- }
-
- if (count > 0 && !eof) {
- maxcount = count;
- nfserr = nfsd4_encode_read_plus_hole(resp, read, &maxcount, &eof);
+ if (is_data)
+ nfserr = nfsd4_encode_read_plus_data(resp, read, &maxcount, &eof,
+ segments == 0 ? &pos : NULL);
+ else
+ nfserr = nfsd4_encode_read_plus_hole(resp, read, &maxcount, &eof);
if (nfserr)
goto out;
count -= maxcount;
read->rd_offset += maxcount;
+ is_data = !is_data;
segments++;
}

--
2.28.0

2020-08-19 17:08:47

[permalink] [raw]

Subject: Re: [PATCH v4 0/5] NFSD: Add support for the v4.2 READ_PLUS operation

> On Aug 17, 2020, at 12:53 PM, [email protected] wrote:
>
> From: Anna Schumaker <[email protected]>
>
> These patches add server support for the READ_PLUS operation, which
> breaks read requests into several "data" and "hole" segments when
> replying to the client.
>
> - Changes since v3:
> - Combine first two patches related to xdr_reserve_space_vec()
> - Remove unnecessary call to svc_encode_read_payload()

My vote is let's merge v3 and continue refining.

> Here are the results of some performance tests I ran on some lab
> machines. I tested by reading various 2G files from a few different underlying
> filesystems and across several NFS versions. I used the `vmtouch` utility
> to make sure files were only cached when we wanted them to be. In addition
> to 100% data and 100% hole cases, I also tested with files that alternate
> between data and hole segments. These files have either 4K, 8K, 16K, or 32K
> segment sizes and start with either data or hole segments. So the file
> mixed-4d has a 4K segment size beginning with a data segment, but mixed-32h
> has 32K segments beginning with a hole. The units are in seconds, with the
> first number for each NFS version being the uncached read time and the second
> number is for when the file is cached on the server.
>
> I added some extra data collection (client cpu percentage and sys time),
> but the extra data means I couldn't figure out a way to break this down
> into a concise table. I cut out v3 and v4.0 performance numbers to get
> the size down, but I kept v4.1 for comparison because it uses the same
> code that v4.2 without read plus uses.
>
>
> Read Plus Results (ext4):
> data
> :... v4.1 ... Uncached ... 20.540 s, 105 MB/s, 0.65 s kern, 3% cpu
> : :....... Cached ..... 18.252 s, 118 MB/s, 0.70 s kern, 3% cpu
> :... v4.2 ... Uncached ... 20.605 s, 104 MB/s, 0.65 s kern, 3% cpu
> :....... Cached ..... 18.253 s, 118 MB/s, 0.67 s kern, 3% cpu
> hole
> :... v4.1 ... Uncached ... 18.255 s, 118 MB/s, 0.72 s kern, 3% cpu
> : :....... Cached ..... 18.252 s, 118 MB/s, 0.72 s kern, 3% cpu
> :... v4.2 ... Uncached ... 0.847 s, 2.5 GB/s, 0.73 s kern, 86% cpu
> :....... Cached ..... 0.845 s, 2.5 GB/s, 0.72 s kern, 85% cpu
> mixed-4d
> :... v4.1 ... Uncached ... 54.691 s, 39 MB/s, 0.75 s kern, 1% cpu
> : :....... Cached ..... 18.252 s, 118 MB/s, 0.71 s kern, 3% cpu
> :... v4.2 ... Uncached ... 51.587 s, 42 MB/s, 0.75 s kern, 1% cpu
> :....... Cached ..... 9.215 s, 233 MB/s, 0.67 s kern, 7% cpu
> mixed-8d
> :... v4.1 ... Uncached ... 37.072 s, 58 MB/s, 0.67 s kern, 1% cpu
> : :....... Cached ..... 18.252 s, 118 MB/s, 0.71 s kern, 3% cpu
> :... v4.2 ... Uncached ... 33.259 s, 65 MB/s, 0.68 s kern, 2% cpu
> :....... Cached ..... 9.172 s, 234 MB/s, 0.67 s kern, 7% cpu
> mixed-16d
> :... v4.1 ... Uncached ... 27.138 s, 79 MB/s, 0.73 s kern, 2% cpu
> : :....... Cached ..... 18.252 s, 118 MB/s, 0.71 s kern, 3% cpu
> :... v4.2 ... Uncached ... 23.042 s, 93 MB/s, 0.73 s kern, 3% cpu
> :....... Cached ..... 9.150 s, 235 MB/s, 0.66 s kern, 7% cpu
> mixed-32d
> :... v4.1 ... Uncached ... 25.326 s, 85 MB/s, 0.68 s kern, 2% cpu
> : :....... Cached ..... 18.252 s, 118 MB/s, 0.70 s kern, 3% cpu
> :... v4.2 ... Uncached ... 21.125 s, 102 MB/s, 0.69 s kern, 3% cpu
> :....... Cached ..... 9.140 s, 235 MB/s, 0.67 s kern, 7% cpu
> mixed-4h
> :... v4.1 ... Uncached ... 58.317 s, 37 MB/s, 0.75 s kern, 1% cpu
> : :....... Cached ..... 18.252 s, 118 MB/s, 0.70 s kern, 3% cpu
> :... v4.2 ... Uncached ... 51.878 s, 41 MB/s, 0.74 s kern, 1% cpu
> :....... Cached ..... 9.215 s, 233 MB/s, 0.68 s kern, 7% cpu
> mixed-8h
> :... v4.1 ... Uncached ... 36.855 s, 58 MB/s, 0.68 s kern, 1% cpu
> : :....... Cached ..... 18.252 s, 118 MB/s, 0.72 s kern, 3% cpu
> :... v4.2 ... Uncached ... 29.457 s, 73 MB/s, 0.68 s kern, 2% cpu
> :....... Cached ..... 9.172 s, 234 MB/s, 0.67 s kern, 7% cpu
> mixed-16h
> :... v4.1 ... Uncached ... 26.460 s, 81 MB/s, 0.74 s kern, 2% cpu
> : :....... Cached ..... 18.252 s, 118 MB/s, 0.71 s kern, 3% cpu
> :... v4.2 ... Uncached ... 19.587 s, 110 MB/s, 0.74 s kern, 3% cpu
> :....... Cached ..... 9.150 s, 235 MB/s, 0.67 s kern, 7% cpu
> mixed-32h
> :... v4.1 ... Uncached ... 25.495 s, 84 MB/s, 0.69 s kern, 2% cpu
> : :....... Cached ..... 18.252 s, 118 MB/s, 0.65 s kern, 3% cpu
> :... v4.2 ... Uncached ... 17.634 s, 122 MB/s, 0.69 s kern, 3% cpu
> :....... Cached ..... 9.140 s, 235 MB/s, 0.68 s kern, 7% cpu
>
>
>
> Read Plus Results (xfs):
> data
> :... v4.1 ... Uncached ... 20.230 s, 106 MB/s, 0.65 s kern, 3% cpu
> : :....... Cached ..... 18.252 s, 118 MB/s, 0.68 s kern, 3% cpu
> :... v4.2 ... Uncached ... 20.724 s, 104 MB/s, 0.65 s kern, 3% cpu
> :....... Cached ..... 18.253 s, 118 MB/s, 0.67 s kern, 3% cpu
> hole
> :... v4.1 ... Uncached ... 18.255 s, 118 MB/s, 0.68 s kern, 3% cpu
> : :....... Cached ..... 18.252 s, 118 MB/s, 0.69 s kern, 3% cpu
> :... v4.2 ... Uncached ... 0.904 s, 2.4 GB/s, 0.72 s kern, 79% cpu
> :....... Cached ..... 0.908 s, 2.4 GB/s, 0.73 s kern, 80% cpu
> mixed-4d
> :... v4.1 ... Uncached ... 57.553 s, 37 MB/s, 0.77 s kern, 1% cpu
> : :....... Cached ..... 18.252 s, 118 MB/s, 0.70 s kern, 3% cpu
> :... v4.2 ... Uncached ... 37.162 s, 58 MB/s, 0.73 s kern, 1% cpu
> :....... Cached ..... 9.215 s, 233 MB/s, 0.67 s kern, 7% cpu
> mixed-8d
> :... v4.1 ... Uncached ... 36.754 s, 58 MB/s, 0.69 s kern, 1% cpu
> : :....... Cached ..... 18.252 s, 118 MB/s, 0.68 s kern, 3% cpu
> :... v4.2 ... Uncached ... 24.454 s, 88 MB/s, 0.69 s kern, 2% cpu
> :....... Cached ..... 9.172 s, 234 MB/s, 0.66 s kern, 7% cpu
> mixed-16d
> :... v4.1 ... Uncached ... 27.156 s, 79 MB/s, 0.73 s kern, 2% cpu
> : :....... Cached ..... 18.252 s, 118 MB/s, 0.71 s kern, 3% cpu
> :... v4.2 ... Uncached ... 22.934 s, 94 MB/s, 0.72 s kern, 3% cpu
> :....... Cached ..... 9.150 s, 235 MB/s, 0.68 s kern, 7% cpu
> mixed-32d
> :... v4.1 ... Uncached ... 27.849 s, 77 MB/s, 0.68 s kern, 2% cpu
> : :....... Cached ..... 18.252 s, 118 MB/s, 0.72 s kern, 3% cpu
> :... v4.2 ... Uncached ... 23.670 s, 91 MB/s, 0.67 s kern, 2% cpu
> :....... Cached ..... 9.139 s, 235 MB/s, 0.64 s kern, 7% cpu
> mixed-4h
> :... v4.1 ... Uncached ... 57.639 s, 37 MB/s, 0.72 s kern, 1% cpu
> : :....... Cached ..... 18.252 s, 118 MB/s, 0.69 s kern, 3% cpu
> :... v4.2 ... Uncached ... 35.503 s, 61 MB/s, 0.72 s kern, 2% cpu
> :....... Cached ..... 9.215 s, 233 MB/s, 0.66 s kern, 7% cpu
> mixed-8h
> :... v4.1 ... Uncached ... 37.044 s, 58 MB/s, 0.71 s kern, 1% cpu
> : :....... Cached ..... 18.252 s, 118 MB/s, 0.68 s kern, 3% cpu
> :... v4.2 ... Uncached ... 23.779 s, 90 MB/s, 0.69 s kern, 2% cpu
> :....... Cached ..... 9.172 s, 234 MB/s, 0.65 s kern, 7% cpu
> mixed-16h
> :... v4.1 ... Uncached ... 27.167 s, 79 MB/s, 0.73 s kern, 2% cpu
> : :....... Cached ..... 18.252 s, 118 MB/s, 0.67 s kern, 3% cpu
> :... v4.2 ... Uncached ... 19.088 s, 113 MB/s, 0.75 s kern, 3% cpu
> :....... Cached ..... 9.159 s, 234 MB/s, 0.66 s kern, 7% cpu
> mixed-32h
> :... v4.1 ... Uncached ... 27.592 s, 78 MB/s, 0.71 s kern, 2% cpu
> : :....... Cached ..... 18.252 s, 118 MB/s, 0.68 s kern, 3% cpu
> :... v4.2 ... Uncached ... 19.682 s, 109 MB/s, 0.67 s kern, 3% cpu
> :....... Cached ..... 9.140 s, 235 MB/s, 0.67 s kern, 7% cpu
>
>
>
> Read Plus Results (btrfs):
> data
> :... v4.1 ... Uncached ... 21.317 s, 101 MB/s, 0.63 s kern, 2% cpu
> : :....... Cached ..... 18.252 s, 118 MB/s, 0.67 s kern, 3% cpu
> :... v4.2 ... Uncached ... 28.665 s, 75 MB/s, 0.65 s kern, 2% cpu
> :....... Cached ..... 18.253 s, 118 MB/s, 0.66 s kern, 3% cpu
> hole
> :... v4.1 ... Uncached ... 18.256 s, 118 MB/s, 0.70 s kern, 3% cpu
> : :....... Cached ..... 18.254 s, 118 MB/s, 0.73 s kern, 4% cpu
> :... v4.2 ... Uncached ... 0.851 s, 2.5 GB/s, 0.72 s kern, 84% cpu
> :....... Cached ..... 0.847 s, 2.5 GB/s, 0.73 s kern, 86% cpu
> mixed-4d
> :... v4.1 ... Uncached ... 56.857 s, 38 MB/s, 0.76 s kern, 1% cpu
> : :....... Cached ..... 18.252 s, 118 MB/s, 0.72 s kern, 3% cpu
> :... v4.2 ... Uncached ... 54.455 s, 39 MB/s, 0.73 s kern, 1% cpu
> :....... Cached ..... 9.215 s, 233 MB/s, 0.68 s kern, 7% cpu
> mixed-8d
> :... v4.1 ... Uncached ... 36.641 s, 59 MB/s, 0.68 s kern, 1% cpu
> : :....... Cached ..... 18.252 s, 118 MB/s, 0.70 s kern, 3% cpu
> :... v4.2 ... Uncached ... 33.205 s, 65 MB/s, 0.67 s kern, 2% cpu
> :....... Cached ..... 9.172 s, 234 MB/s, 0.65 s kern, 7% cpu
> mixed-16d
> :... v4.1 ... Uncached ... 28.653 s, 75 MB/s, 0.72 s kern, 2% cpu
> : :....... Cached ..... 18.252 s, 118 MB/s, 0.70 s kern, 3% cpu
> :... v4.2 ... Uncached ... 25.748 s, 83 MB/s, 0.71 s kern, 2% cpu
> :....... Cached ..... 9.150 s, 235 MB/s, 0.64 s kern, 7% cpu
> mixed-32d
> :... v4.1 ... Uncached ... 28.886 s, 74 MB/s, 0.67 s kern, 2% cpu
> : :....... Cached ..... 18.252 s, 118 MB/s, 0.71 s kern, 3% cpu
> :... v4.2 ... Uncached ... 24.724 s, 87 MB/s, 0.74 s kern, 2% cpu
> :....... Cached ..... 9.140 s, 235 MB/s, 0.63 s kern, 6% cpu
> mixed-4h
> :... v4.1 ... Uncached ... 52.181 s, 41 MB/s, 0.73 s kern, 1% cpu
> : :....... Cached ..... 18.252 s, 118 MB/s, 0.66 s kern, 3% cpu
> :... v4.2 ... Uncached ... 150.341 s, 14 MB/s, 0.72 s kern, 0% cpu
> :....... Cached ..... 9.216 s, 233 MB/s, 0.63 s kern, 6% cpu
> mixed-8h
> :... v4.1 ... Uncached ... 36.945 s, 58 MB/s, 0.68 s kern, 1% cpu
> : :....... Cached ..... 18.252 s, 118 MB/s, 0.65 s kern, 3% cpu
> :... v4.2 ... Uncached ... 79.781 s, 27 MB/s, 0.68 s kern, 0% cpu
> :....... Cached ..... 9.172 s, 234 MB/s, 0.66 s kern, 7% cpu
> mixed-16h
> :... v4.1 ... Uncached ... 28.651 s, 75 MB/s, 0.73 s kern, 2% cpu
> : :....... Cached ..... 18.252 s, 118 MB/s, 0.66 s kern, 3% cpu
> :... v4.2 ... Uncached ... 47.428 s, 45 MB/s, 0.71 s kern, 1% cpu
> :....... Cached ..... 9.150 s, 235 MB/s, 0.67 s kern, 7% cpu
> mixed-32h
> :... v4.1 ... Uncached ... 28.618 s, 75 MB/s, 0.69 s kern, 2% cpu
> : :....... Cached ..... 18.252 s, 118 MB/s, 0.70 s kern, 3% cpu
> :... v4.2 ... Uncached ... 38.813 s, 55 MB/s, 0.67 s kern, 1% cpu
> :....... Cached ..... 9.140 s, 235 MB/s, 0.61 s kern, 6% cpu
>
>
>
> Thoughts?
> Anna
>
>
> Anna Schumaker (5):
> SUNRPC/NFSD: Implement xdr_reserve_space_vec()
> NFSD: Add READ_PLUS data support
> NFSD: Add READ_PLUS hole segment encoding
> NFSD: Return both a hole and a data segment
> NFSD: Encode a full READ_PLUS reply
>
> fs/nfsd/nfs4proc.c | 17 ++++
> fs/nfsd/nfs4xdr.c | 167 +++++++++++++++++++++++++++++++------
> include/linux/sunrpc/xdr.h | 2 +
> net/sunrpc/xdr.c | 45 ++++++++++
> 4 files changed, 204 insertions(+), 27 deletions(-)
>
> --
> 2.28.0
>

--
Chuck Lever

2020-08-26 21:56:56

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [PATCH v4 0/5] NFSD: Add support for the v4.2 READ_PLUS operation

On Mon, Aug 17, 2020 at 12:53:05PM -0400, [email protected] wrote:
> From: Anna Schumaker <[email protected]>
>
> These patches add server support for the READ_PLUS operation, which
> breaks read requests into several "data" and "hole" segments when
> replying to the client.
>
> - Changes since v3:
> - Combine first two patches related to xdr_reserve_space_vec()
> - Remove unnecessary call to svc_encode_read_payload()
>
> Here are the results of some performance tests I ran on some lab
> machines.

What's the hardware setup (do you know network and disk bandwidth?).

> I tested by reading various 2G files from a few different underlying
> filesystems and across several NFS versions. I used the `vmtouch` utility
> to make sure files were only cached when we wanted them to be. In addition
> to 100% data and 100% hole cases, I also tested with files that alternate
> between data and hole segments. These files have either 4K, 8K, 16K, or 32K
> segment sizes and start with either data or hole segments. So the file
> mixed-4d has a 4K segment size beginning with a data segment, but mixed-32h
> has 32K segments beginning with a hole. The units are in seconds, with the
> first number for each NFS version being the uncached read time and the second
> number is for when the file is cached on the server.

The only numbers that look really strange are in the btrfs uncached
case, in the data-only case and the mixed case that start with a hole.
Do we have any idea what's up there?

--b.

> Read Plus Results (btrfs):
> data
> :... v4.1 ... Uncached ... 21.317 s, 101 MB/s, 0.63 s kern, 2% cpu
> : :....... Cached ..... 18.252 s, 118 MB/s, 0.67 s kern, 3% cpu
> :... v4.2 ... Uncached ... 28.665 s, 75 MB/s, 0.65 s kern, 2% cpu
> :....... Cached ..... 18.253 s, 118 MB/s, 0.66 s kern, 3% cpu
> hole
> :... v4.1 ... Uncached ... 18.256 s, 118 MB/s, 0.70 s kern, 3% cpu
> : :....... Cached ..... 18.254 s, 118 MB/s, 0.73 s kern, 4% cpu
> :... v4.2 ... Uncached ... 0.851 s, 2.5 GB/s, 0.72 s kern, 84% cpu
> :....... Cached ..... 0.847 s, 2.5 GB/s, 0.73 s kern, 86% cpu
> mixed-4d
> :... v4.1 ... Uncached ... 56.857 s, 38 MB/s, 0.76 s kern, 1% cpu
> : :....... Cached ..... 18.252 s, 118 MB/s, 0.72 s kern, 3% cpu
> :... v4.2 ... Uncached ... 54.455 s, 39 MB/s, 0.73 s kern, 1% cpu
> :....... Cached ..... 9.215 s, 233 MB/s, 0.68 s kern, 7% cpu
> mixed-8d
> :... v4.1 ... Uncached ... 36.641 s, 59 MB/s, 0.68 s kern, 1% cpu
> : :....... Cached ..... 18.252 s, 118 MB/s, 0.70 s kern, 3% cpu
> :... v4.2 ... Uncached ... 33.205 s, 65 MB/s, 0.67 s kern, 2% cpu
> :....... Cached ..... 9.172 s, 234 MB/s, 0.65 s kern, 7% cpu
> mixed-16d
> :... v4.1 ... Uncached ... 28.653 s, 75 MB/s, 0.72 s kern, 2% cpu
> : :....... Cached ..... 18.252 s, 118 MB/s, 0.70 s kern, 3% cpu
> :... v4.2 ... Uncached ... 25.748 s, 83 MB/s, 0.71 s kern, 2% cpu
> :....... Cached ..... 9.150 s, 235 MB/s, 0.64 s kern, 7% cpu
> mixed-32d
> :... v4.1 ... Uncached ... 28.886 s, 74 MB/s, 0.67 s kern, 2% cpu
> : :....... Cached ..... 18.252 s, 118 MB/s, 0.71 s kern, 3% cpu
> :... v4.2 ... Uncached ... 24.724 s, 87 MB/s, 0.74 s kern, 2% cpu
> :....... Cached ..... 9.140 s, 235 MB/s, 0.63 s kern, 6% cpu
> mixed-4h
> :... v4.1 ... Uncached ... 52.181 s, 41 MB/s, 0.73 s kern, 1% cpu
> : :....... Cached ..... 18.252 s, 118 MB/s, 0.66 s kern, 3% cpu
> :... v4.2 ... Uncached ... 150.341 s, 14 MB/s, 0.72 s kern, 0% cpu
> :....... Cached ..... 9.216 s, 233 MB/s, 0.63 s kern, 6% cpu
> mixed-8h
> :... v4.1 ... Uncached ... 36.945 s, 58 MB/s, 0.68 s kern, 1% cpu
> : :....... Cached ..... 18.252 s, 118 MB/s, 0.65 s kern, 3% cpu
> :... v4.2 ... Uncached ... 79.781 s, 27 MB/s, 0.68 s kern, 0% cpu
> :....... Cached ..... 9.172 s, 234 MB/s, 0.66 s kern, 7% cpu
> mixed-16h
> :... v4.1 ... Uncached ... 28.651 s, 75 MB/s, 0.73 s kern, 2% cpu
> : :....... Cached ..... 18.252 s, 118 MB/s, 0.66 s kern, 3% cpu
> :... v4.2 ... Uncached ... 47.428 s, 45 MB/s, 0.71 s kern, 1% cpu
> :....... Cached ..... 9.150 s, 235 MB/s, 0.67 s kern, 7% cpu
> mixed-32h
> :... v4.1 ... Uncached ... 28.618 s, 75 MB/s, 0.69 s kern, 2% cpu
> : :....... Cached ..... 18.252 s, 118 MB/s, 0.70 s kern, 3% cpu
> :... v4.2 ... Uncached ... 38.813 s, 55 MB/s, 0.67 s kern, 1% cpu
> :....... Cached ..... 9.140 s, 235 MB/s, 0.61 s kern, 6% cpu

2020-08-28 21:26:02

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [PATCH v4 2/5] NFSD: Add READ_PLUS data support

On Mon, Aug 17, 2020 at 12:53:07PM -0400, [email protected] wrote:
> From: Anna Schumaker <[email protected]>
>
> This patch adds READ_PLUS support for returning a single
> NFS4_CONTENT_DATA segment to the client. This is basically the same as
> the READ operation, only with the extra information about data segments.
>
> Signed-off-by: Anna Schumaker <[email protected]>
> ---
> fs/nfsd/nfs4proc.c | 17 ++++++++++
> fs/nfsd/nfs4xdr.c | 83 ++++++++++++++++++++++++++++++++++++++++++++--
> 2 files changed, 98 insertions(+), 2 deletions(-)
>
> diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
> index a09c35f0f6f0..9630d33211f2 100644
> --- a/fs/nfsd/nfs4proc.c
> +++ b/fs/nfsd/nfs4proc.c
> @@ -2523,6 +2523,16 @@ static inline u32 nfsd4_read_rsize(struct svc_rqst *rqstp, struct nfsd4_op *op)
> return (op_encode_hdr_size + 2 + XDR_QUADLEN(rlen)) * sizeof(__be32);
> }
>
> +static inline u32 nfsd4_read_plus_rsize(struct svc_rqst *rqstp, struct nfsd4_op *op)
> +{
> + u32 maxcount = svc_max_payload(rqstp);
> + u32 rlen = min(op->u.read.rd_length, maxcount);
> + /* enough extra xdr space for encoding either a hole or data segment. */
> + u32 segments = 1 + 2 + 2;
> +
> + return (op_encode_hdr_size + 2 + segments + XDR_QUADLEN(rlen)) * sizeof(__be32);

I'm not sure I understand this calculation.

In the final code, there's no fixed limit on the number of segments
returned by a single READ_PLUS op, right?

--b.

> +}
> +
> static inline u32 nfsd4_readdir_rsize(struct svc_rqst *rqstp, struct nfsd4_op *op)
> {
> u32 maxcount = 0, rlen = 0;
> @@ -3059,6 +3069,13 @@ static const struct nfsd4_operation nfsd4_ops[] = {
> .op_name = "OP_COPY",
> .op_rsize_bop = nfsd4_copy_rsize,
> },
> + [OP_READ_PLUS] = {
> + .op_func = nfsd4_read,
> + .op_release = nfsd4_read_release,
> + .op_name = "OP_READ_PLUS",
> + .op_rsize_bop = nfsd4_read_plus_rsize,
> + .op_get_currentstateid = nfsd4_get_readstateid,
> + },
> [OP_SEEK] = {
> .op_func = nfsd4_seek,
> .op_name = "OP_SEEK",
> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> index 6a1c0a7fae05..9af92f538000 100644
> --- a/fs/nfsd/nfs4xdr.c
> +++ b/fs/nfsd/nfs4xdr.c
> @@ -1957,7 +1957,7 @@ static const nfsd4_dec nfsd4_dec_ops[] = {
> [OP_LAYOUTSTATS] = (nfsd4_dec)nfsd4_decode_notsupp,
> [OP_OFFLOAD_CANCEL] = (nfsd4_dec)nfsd4_decode_offload_status,
> [OP_OFFLOAD_STATUS] = (nfsd4_dec)nfsd4_decode_offload_status,
> - [OP_READ_PLUS] = (nfsd4_dec)nfsd4_decode_notsupp,
> + [OP_READ_PLUS] = (nfsd4_dec)nfsd4_decode_read,
> [OP_SEEK] = (nfsd4_dec)nfsd4_decode_seek,
> [OP_WRITE_SAME] = (nfsd4_dec)nfsd4_decode_notsupp,
> [OP_CLONE] = (nfsd4_dec)nfsd4_decode_clone,
> @@ -4367,6 +4367,85 @@ nfsd4_encode_offload_status(struct nfsd4_compoundres *resp, __be32 nfserr,
> return nfserr_resource;
> p = xdr_encode_hyper(p, os->count);
> *p++ = cpu_to_be32(0);
> + return nfserr;
> +}
> +
> +static __be32
> +nfsd4_encode_read_plus_data(struct nfsd4_compoundres *resp,
> + struct nfsd4_read *read,
> + unsigned long maxcount, u32 *eof)
> +{
> + struct xdr_stream *xdr = &resp->xdr;
> + struct file *file = read->rd_nf->nf_file;
> + int starting_len = xdr->buf->len;
> + __be32 nfserr;
> + __be32 *p, tmp;
> + __be64 tmp64;
> +
> + /* Content type, offset, byte count */
> + p = xdr_reserve_space(xdr, 4 + 8 + 4);
> + if (!p)
> + return nfserr_resource;
> +
> + read->rd_vlen = xdr_reserve_space_vec(xdr, resp->rqstp->rq_vec, maxcount);
> + if (read->rd_vlen < 0)
> + return nfserr_resource;
> +
> + nfserr = nfsd_readv(resp->rqstp, read->rd_fhp, file, read->rd_offset,
> + resp->rqstp->rq_vec, read->rd_vlen, &maxcount, eof);
> + if (nfserr)
> + return nfserr;
> +
> + tmp = htonl(NFS4_CONTENT_DATA);
> + write_bytes_to_xdr_buf(xdr->buf, starting_len, &tmp, 4);
> + tmp64 = cpu_to_be64(read->rd_offset);
> + write_bytes_to_xdr_buf(xdr->buf, starting_len + 4, &tmp64, 8);
> + tmp = htonl(maxcount);
> + write_bytes_to_xdr_buf(xdr->buf, starting_len + 12, &tmp, 4);
> + return nfs_ok;
> +}
> +
> +static __be32
> +nfsd4_encode_read_plus(struct nfsd4_compoundres *resp, __be32 nfserr,
> + struct nfsd4_read *read)
> +{
> + unsigned long maxcount;
> + struct xdr_stream *xdr = &resp->xdr;
> + struct file *file;
> + int starting_len = xdr->buf->len;
> + int segments = 0;
> + __be32 *p, tmp;
> + u32 eof;
> +
> + if (nfserr)
> + return nfserr;
> + file = read->rd_nf->nf_file;
> +
> + /* eof flag, segment count */
> + p = xdr_reserve_space(xdr, 4 + 4);
> + if (!p)
> + return nfserr_resource;
> + xdr_commit_encode(xdr);
> +
> + maxcount = svc_max_payload(resp->rqstp);
> + maxcount = min_t(unsigned long, maxcount,
> + (xdr->buf->buflen - xdr->buf->len));
> + maxcount = min_t(unsigned long, maxcount, read->rd_length);
> +
> + eof = read->rd_offset >= i_size_read(file_inode(file));
> + if (!eof) {
> + nfserr = nfsd4_encode_read_plus_data(resp, read, maxcount, &eof);
> + segments++;
> + }
> +
> + if (nfserr)
> + xdr_truncate_encode(xdr, starting_len);
> + else {
> + tmp = htonl(eof);
> + write_bytes_to_xdr_buf(xdr->buf, starting_len, &tmp, 4);
> + tmp = htonl(segments);
> + write_bytes_to_xdr_buf(xdr->buf, starting_len + 4, &tmp, 4);
> + }
>
> return nfserr;
> }
> @@ -4509,7 +4588,7 @@ static const nfsd4_enc nfsd4_enc_ops[] = {
> [OP_LAYOUTSTATS] = (nfsd4_enc)nfsd4_encode_noop,
> [OP_OFFLOAD_CANCEL] = (nfsd4_enc)nfsd4_encode_noop,
> [OP_OFFLOAD_STATUS] = (nfsd4_enc)nfsd4_encode_offload_status,
> - [OP_READ_PLUS] = (nfsd4_enc)nfsd4_encode_noop,
> + [OP_READ_PLUS] = (nfsd4_enc)nfsd4_encode_read_plus,
> [OP_SEEK] = (nfsd4_enc)nfsd4_encode_seek,
> [OP_WRITE_SAME] = (nfsd4_enc)nfsd4_encode_noop,
> [OP_CLONE] = (nfsd4_enc)nfsd4_encode_noop,
> --
> 2.28.0
>

2020-08-28 21:57:19

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [PATCH v4 2/5] NFSD: Add READ_PLUS data support

On Fri, Aug 28, 2020 at 05:25:21PM -0400, J. Bruce Fields wrote:
> On Mon, Aug 17, 2020 at 12:53:07PM -0400, [email protected] wrote:
> > From: Anna Schumaker <[email protected]>
> >
> > This patch adds READ_PLUS support for returning a single
> > NFS4_CONTENT_DATA segment to the client. This is basically the same as
> > the READ operation, only with the extra information about data segments.
> >
> > Signed-off-by: Anna Schumaker <[email protected]>
> > ---
> > fs/nfsd/nfs4proc.c | 17 ++++++++++
> > fs/nfsd/nfs4xdr.c | 83 ++++++++++++++++++++++++++++++++++++++++++++--
> > 2 files changed, 98 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
> > index a09c35f0f6f0..9630d33211f2 100644
> > --- a/fs/nfsd/nfs4proc.c
> > +++ b/fs/nfsd/nfs4proc.c
> > @@ -2523,6 +2523,16 @@ static inline u32 nfsd4_read_rsize(struct svc_rqst *rqstp, struct nfsd4_op *op)
> > return (op_encode_hdr_size + 2 + XDR_QUADLEN(rlen)) * sizeof(__be32);
> > }
> >
> > +static inline u32 nfsd4_read_plus_rsize(struct svc_rqst *rqstp, struct nfsd4_op *op)
> > +{
> > + u32 maxcount = svc_max_payload(rqstp);
> > + u32 rlen = min(op->u.read.rd_length, maxcount);
> > + /* enough extra xdr space for encoding either a hole or data segment. */
> > + u32 segments = 1 + 2 + 2;
> > +
> > + return (op_encode_hdr_size + 2 + segments + XDR_QUADLEN(rlen)) * sizeof(__be32);
>
> I'm not sure I understand this calculation.
>
> In the final code, there's no fixed limit on the number of segments
> returned by a single READ_PLUS op, right?

I think the worst-case overhead to represent a hole is around 50 bytes.

So as long as we don't encode any holes less than that, then we can just
use rlen as an upper bound.

We really don't want to bother encoding small holes. I doubt
filesystems want to bother with them either. Do they give us any
guarantees as to the minimum size of a hole?

--b.

>
> --b.
>
> > +}
> > +
> > static inline u32 nfsd4_readdir_rsize(struct svc_rqst *rqstp, struct nfsd4_op *op)
> > {
> > u32 maxcount = 0, rlen = 0;
> > @@ -3059,6 +3069,13 @@ static const struct nfsd4_operation nfsd4_ops[] = {
> > .op_name = "OP_COPY",
> > .op_rsize_bop = nfsd4_copy_rsize,
> > },
> > + [OP_READ_PLUS] = {
> > + .op_func = nfsd4_read,
> > + .op_release = nfsd4_read_release,
> > + .op_name = "OP_READ_PLUS",
> > + .op_rsize_bop = nfsd4_read_plus_rsize,
> > + .op_get_currentstateid = nfsd4_get_readstateid,
> > + },
> > [OP_SEEK] = {
> > .op_func = nfsd4_seek,
> > .op_name = "OP_SEEK",
> > diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> > index 6a1c0a7fae05..9af92f538000 100644
> > --- a/fs/nfsd/nfs4xdr.c
> > +++ b/fs/nfsd/nfs4xdr.c
> > @@ -1957,7 +1957,7 @@ static const nfsd4_dec nfsd4_dec_ops[] = {
> > [OP_LAYOUTSTATS] = (nfsd4_dec)nfsd4_decode_notsupp,
> > [OP_OFFLOAD_CANCEL] = (nfsd4_dec)nfsd4_decode_offload_status,
> > [OP_OFFLOAD_STATUS] = (nfsd4_dec)nfsd4_decode_offload_status,
> > - [OP_READ_PLUS] = (nfsd4_dec)nfsd4_decode_notsupp,
> > + [OP_READ_PLUS] = (nfsd4_dec)nfsd4_decode_read,
> > [OP_SEEK] = (nfsd4_dec)nfsd4_decode_seek,
> > [OP_WRITE_SAME] = (nfsd4_dec)nfsd4_decode_notsupp,
> > [OP_CLONE] = (nfsd4_dec)nfsd4_decode_clone,
> > @@ -4367,6 +4367,85 @@ nfsd4_encode_offload_status(struct nfsd4_compoundres *resp, __be32 nfserr,
> > return nfserr_resource;
> > p = xdr_encode_hyper(p, os->count);
> > *p++ = cpu_to_be32(0);
> > + return nfserr;
> > +}
> > +
> > +static __be32
> > +nfsd4_encode_read_plus_data(struct nfsd4_compoundres *resp,
> > + struct nfsd4_read *read,
> > + unsigned long maxcount, u32 *eof)
> > +{
> > + struct xdr_stream *xdr = &resp->xdr;
> > + struct file *file = read->rd_nf->nf_file;
> > + int starting_len = xdr->buf->len;
> > + __be32 nfserr;
> > + __be32 *p, tmp;
> > + __be64 tmp64;
> > +
> > + /* Content type, offset, byte count */
> > + p = xdr_reserve_space(xdr, 4 + 8 + 4);
> > + if (!p)
> > + return nfserr_resource;
> > +
> > + read->rd_vlen = xdr_reserve_space_vec(xdr, resp->rqstp->rq_vec, maxcount);
> > + if (read->rd_vlen < 0)
> > + return nfserr_resource;
> > +
> > + nfserr = nfsd_readv(resp->rqstp, read->rd_fhp, file, read->rd_offset,
> > + resp->rqstp->rq_vec, read->rd_vlen, &maxcount, eof);
> > + if (nfserr)
> > + return nfserr;
> > +
> > + tmp = htonl(NFS4_CONTENT_DATA);
> > + write_bytes_to_xdr_buf(xdr->buf, starting_len, &tmp, 4);
> > + tmp64 = cpu_to_be64(read->rd_offset);
> > + write_bytes_to_xdr_buf(xdr->buf, starting_len + 4, &tmp64, 8);
> > + tmp = htonl(maxcount);
> > + write_bytes_to_xdr_buf(xdr->buf, starting_len + 12, &tmp, 4);
> > + return nfs_ok;
> > +}
> > +
> > +static __be32
> > +nfsd4_encode_read_plus(struct nfsd4_compoundres *resp, __be32 nfserr,
> > + struct nfsd4_read *read)
> > +{
> > + unsigned long maxcount;
> > + struct xdr_stream *xdr = &resp->xdr;
> > + struct file *file;
> > + int starting_len = xdr->buf->len;
> > + int segments = 0;
> > + __be32 *p, tmp;
> > + u32 eof;
> > +
> > + if (nfserr)
> > + return nfserr;
> > + file = read->rd_nf->nf_file;
> > +
> > + /* eof flag, segment count */
> > + p = xdr_reserve_space(xdr, 4 + 4);
> > + if (!p)
> > + return nfserr_resource;
> > + xdr_commit_encode(xdr);
> > +
> > + maxcount = svc_max_payload(resp->rqstp);
> > + maxcount = min_t(unsigned long, maxcount,
> > + (xdr->buf->buflen - xdr->buf->len));
> > + maxcount = min_t(unsigned long, maxcount, read->rd_length);
> > +
> > + eof = read->rd_offset >= i_size_read(file_inode(file));
> > + if (!eof) {
> > + nfserr = nfsd4_encode_read_plus_data(resp, read, maxcount, &eof);
> > + segments++;
> > + }
> > +
> > + if (nfserr)
> > + xdr_truncate_encode(xdr, starting_len);
> > + else {
> > + tmp = htonl(eof);
> > + write_bytes_to_xdr_buf(xdr->buf, starting_len, &tmp, 4);
> > + tmp = htonl(segments);
> > + write_bytes_to_xdr_buf(xdr->buf, starting_len + 4, &tmp, 4);
> > + }
> >
> > return nfserr;
> > }
> > @@ -4509,7 +4588,7 @@ static const nfsd4_enc nfsd4_enc_ops[] = {
> > [OP_LAYOUTSTATS] = (nfsd4_enc)nfsd4_encode_noop,
> > [OP_OFFLOAD_CANCEL] = (nfsd4_enc)nfsd4_encode_noop,
> > [OP_OFFLOAD_STATUS] = (nfsd4_enc)nfsd4_encode_offload_status,
> > - [OP_READ_PLUS] = (nfsd4_enc)nfsd4_encode_noop,
> > + [OP_READ_PLUS] = (nfsd4_enc)nfsd4_encode_read_plus,
> > [OP_SEEK] = (nfsd4_enc)nfsd4_encode_seek,
> > [OP_WRITE_SAME] = (nfsd4_enc)nfsd4_encode_noop,
> > [OP_CLONE] = (nfsd4_enc)nfsd4_encode_noop,
> > --
> > 2.28.0
> >

2020-08-31 18:18:54

by Anna Schumaker

[permalink] [raw]

Subject: Re: [PATCH v4 2/5] NFSD: Add READ_PLUS data support

On Fri, Aug 28, 2020 at 5:56 PM J. Bruce Fields <[email protected]> wrote:
>
> On Fri, Aug 28, 2020 at 05:25:21PM -0400, J. Bruce Fields wrote:
> > On Mon, Aug 17, 2020 at 12:53:07PM -0400, [email protected] wrote:
> > > From: Anna Schumaker <[email protected]>
> > >
> > > This patch adds READ_PLUS support for returning a single
> > > NFS4_CONTENT_DATA segment to the client. This is basically the same as
> > > the READ operation, only with the extra information about data segments.
> > >
> > > Signed-off-by: Anna Schumaker <[email protected]>
> > > ---
> > > fs/nfsd/nfs4proc.c | 17 ++++++++++
> > > fs/nfsd/nfs4xdr.c | 83 ++++++++++++++++++++++++++++++++++++++++++++--
> > > 2 files changed, 98 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
> > > index a09c35f0f6f0..9630d33211f2 100644
> > > --- a/fs/nfsd/nfs4proc.c
> > > +++ b/fs/nfsd/nfs4proc.c
> > > @@ -2523,6 +2523,16 @@ static inline u32 nfsd4_read_rsize(struct svc_rqst *rqstp, struct nfsd4_op *op)
> > > return (op_encode_hdr_size + 2 + XDR_QUADLEN(rlen)) * sizeof(__be32);
> > > }
> > >
> > > +static inline u32 nfsd4_read_plus_rsize(struct svc_rqst *rqstp, struct nfsd4_op *op)
> > > +{
> > > + u32 maxcount = svc_max_payload(rqstp);
> > > + u32 rlen = min(op->u.read.rd_length, maxcount);
> > > + /* enough extra xdr space for encoding either a hole or data segment. */
> > > + u32 segments = 1 + 2 + 2;
> > > +
> > > + return (op_encode_hdr_size + 2 + segments + XDR_QUADLEN(rlen)) * sizeof(__be32);
> >
> > I'm not sure I understand this calculation.
> >
> > In the final code, there's no fixed limit on the number of segments
> > returned by a single READ_PLUS op, right?
>
> I think the worst-case overhead to represent a hole is around 50 bytes.
>
> So as long as we don't encode any holes less than that, then we can just
> use rlen as an upper bound.
>
> We really don't want to bother encoding small holes. I doubt
> filesystems want to bother with them either. Do they give us any
> guarantees as to the minimum size of a hole?

The minimum size seems to be PAGE_SIZE from everything I've seen.

>
> --b.
>
> >
> > --b.
> >
> > > +}
> > > +
> > > static inline u32 nfsd4_readdir_rsize(struct svc_rqst *rqstp, struct nfsd4_op *op)
> > > {
> > > u32 maxcount = 0, rlen = 0;
> > > @@ -3059,6 +3069,13 @@ static const struct nfsd4_operation nfsd4_ops[] = {
> > > .op_name = "OP_COPY",
> > > .op_rsize_bop = nfsd4_copy_rsize,
> > > },
> > > + [OP_READ_PLUS] = {
> > > + .op_func = nfsd4_read,
> > > + .op_release = nfsd4_read_release,
> > > + .op_name = "OP_READ_PLUS",
> > > + .op_rsize_bop = nfsd4_read_plus_rsize,
> > > + .op_get_currentstateid = nfsd4_get_readstateid,
> > > + },
> > > [OP_SEEK] = {
> > > .op_func = nfsd4_seek,
> > > .op_name = "OP_SEEK",
> > > diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> > > index 6a1c0a7fae05..9af92f538000 100644
> > > --- a/fs/nfsd/nfs4xdr.c
> > > +++ b/fs/nfsd/nfs4xdr.c
> > > @@ -1957,7 +1957,7 @@ static const nfsd4_dec nfsd4_dec_ops[] = {
> > > [OP_LAYOUTSTATS] = (nfsd4_dec)nfsd4_decode_notsupp,
> > > [OP_OFFLOAD_CANCEL] = (nfsd4_dec)nfsd4_decode_offload_status,
> > > [OP_OFFLOAD_STATUS] = (nfsd4_dec)nfsd4_decode_offload_status,
> > > - [OP_READ_PLUS] = (nfsd4_dec)nfsd4_decode_notsupp,
> > > + [OP_READ_PLUS] = (nfsd4_dec)nfsd4_decode_read,
> > > [OP_SEEK] = (nfsd4_dec)nfsd4_decode_seek,
> > > [OP_WRITE_SAME] = (nfsd4_dec)nfsd4_decode_notsupp,
> > > [OP_CLONE] = (nfsd4_dec)nfsd4_decode_clone,
> > > @@ -4367,6 +4367,85 @@ nfsd4_encode_offload_status(struct nfsd4_compoundres *resp, __be32 nfserr,
> > > return nfserr_resource;
> > > p = xdr_encode_hyper(p, os->count);
> > > *p++ = cpu_to_be32(0);
> > > + return nfserr;
> > > +}
> > > +
> > > +static __be32
> > > +nfsd4_encode_read_plus_data(struct nfsd4_compoundres *resp,
> > > + struct nfsd4_read *read,
> > > + unsigned long maxcount, u32 *eof)
> > > +{
> > > + struct xdr_stream *xdr = &resp->xdr;
> > > + struct file *file = read->rd_nf->nf_file;
> > > + int starting_len = xdr->buf->len;
> > > + __be32 nfserr;
> > > + __be32 *p, tmp;
> > > + __be64 tmp64;
> > > +
> > > + /* Content type, offset, byte count */
> > > + p = xdr_reserve_space(xdr, 4 + 8 + 4);
> > > + if (!p)
> > > + return nfserr_resource;
> > > +
> > > + read->rd_vlen = xdr_reserve_space_vec(xdr, resp->rqstp->rq_vec, maxcount);
> > > + if (read->rd_vlen < 0)
> > > + return nfserr_resource;
> > > +
> > > + nfserr = nfsd_readv(resp->rqstp, read->rd_fhp, file, read->rd_offset,
> > > + resp->rqstp->rq_vec, read->rd_vlen, &maxcount, eof);
> > > + if (nfserr)
> > > + return nfserr;
> > > +
> > > + tmp = htonl(NFS4_CONTENT_DATA);
> > > + write_bytes_to_xdr_buf(xdr->buf, starting_len, &tmp, 4);
> > > + tmp64 = cpu_to_be64(read->rd_offset);
> > > + write_bytes_to_xdr_buf(xdr->buf, starting_len + 4, &tmp64, 8);
> > > + tmp = htonl(maxcount);
> > > + write_bytes_to_xdr_buf(xdr->buf, starting_len + 12, &tmp, 4);
> > > + return nfs_ok;
> > > +}
> > > +
> > > +static __be32
> > > +nfsd4_encode_read_plus(struct nfsd4_compoundres *resp, __be32 nfserr,
> > > + struct nfsd4_read *read)
> > > +{
> > > + unsigned long maxcount;
> > > + struct xdr_stream *xdr = &resp->xdr;
> > > + struct file *file;
> > > + int starting_len = xdr->buf->len;
> > > + int segments = 0;
> > > + __be32 *p, tmp;
> > > + u32 eof;
> > > +
> > > + if (nfserr)
> > > + return nfserr;
> > > + file = read->rd_nf->nf_file;
> > > +
> > > + /* eof flag, segment count */
> > > + p = xdr_reserve_space(xdr, 4 + 4);
> > > + if (!p)
> > > + return nfserr_resource;
> > > + xdr_commit_encode(xdr);
> > > +
> > > + maxcount = svc_max_payload(resp->rqstp);
> > > + maxcount = min_t(unsigned long, maxcount,
> > > + (xdr->buf->buflen - xdr->buf->len));
> > > + maxcount = min_t(unsigned long, maxcount, read->rd_length);
> > > +
> > > + eof = read->rd_offset >= i_size_read(file_inode(file));
> > > + if (!eof) {
> > > + nfserr = nfsd4_encode_read_plus_data(resp, read, maxcount, &eof);
> > > + segments++;
> > > + }
> > > +
> > > + if (nfserr)
> > > + xdr_truncate_encode(xdr, starting_len);
> > > + else {
> > > + tmp = htonl(eof);
> > > + write_bytes_to_xdr_buf(xdr->buf, starting_len, &tmp, 4);
> > > + tmp = htonl(segments);
> > > + write_bytes_to_xdr_buf(xdr->buf, starting_len + 4, &tmp, 4);
> > > + }
> > >
> > > return nfserr;
> > > }
> > > @@ -4509,7 +4588,7 @@ static const nfsd4_enc nfsd4_enc_ops[] = {
> > > [OP_LAYOUTSTATS] = (nfsd4_enc)nfsd4_encode_noop,
> > > [OP_OFFLOAD_CANCEL] = (nfsd4_enc)nfsd4_encode_noop,
> > > [OP_OFFLOAD_STATUS] = (nfsd4_enc)nfsd4_encode_offload_status,
> > > - [OP_READ_PLUS] = (nfsd4_enc)nfsd4_encode_noop,
> > > + [OP_READ_PLUS] = (nfsd4_enc)nfsd4_encode_read_plus,
> > > [OP_SEEK] = (nfsd4_enc)nfsd4_encode_seek,
> > > [OP_WRITE_SAME] = (nfsd4_enc)nfsd4_encode_noop,
> > > [OP_CLONE] = (nfsd4_enc)nfsd4_encode_noop,
> > > --
> > > 2.28.0
> > >
>

2020-08-31 18:35:14

by Anna Schumaker

[permalink] [raw]

Subject: Re: [PATCH v4 0/5] NFSD: Add support for the v4.2 READ_PLUS operation

On Wed, Aug 26, 2020 at 5:54 PM J. Bruce Fields <[email protected]> wrote:
>
> On Mon, Aug 17, 2020 at 12:53:05PM -0400, [email protected] wrote:
> > From: Anna Schumaker <[email protected]>
> >
> > These patches add server support for the READ_PLUS operation, which
> > breaks read requests into several "data" and "hole" segments when
> > replying to the client.
> >
> > - Changes since v3:
> > - Combine first two patches related to xdr_reserve_space_vec()
> > - Remove unnecessary call to svc_encode_read_payload()
> >
> > Here are the results of some performance tests I ran on some lab
> > machines.
>
> What's the hardware setup (do you know network and disk bandwidth?).

I used iperf to benchmark the network, and it said it transferred 1.10
GBytes with a bandwidth of 941 Mbits/sec

I ran hdparm -tT to benchmark reads on the disk and it said this:
Timing cached reads: 13394 MB in 2.00 seconds = 6713.72 MB/sec
Timing buffered disk reads: 362 MB in 3.00 seconds = 120.60 MB/sec

>
> > I tested by reading various 2G files from a few different underlying
> > filesystems and across several NFS versions. I used the `vmtouch` utility
> > to make sure files were only cached when we wanted them to be. In addition
> > to 100% data and 100% hole cases, I also tested with files that alternate
> > between data and hole segments. These files have either 4K, 8K, 16K, or 32K
> > segment sizes and start with either data or hole segments. So the file
> > mixed-4d has a 4K segment size beginning with a data segment, but mixed-32h
> > has 32K segments beginning with a hole. The units are in seconds, with the
> > first number for each NFS version being the uncached read time and the second
> > number is for when the file is cached on the server.
>
> The only numbers that look really strange are in the btrfs uncached
> case, in the data-only case and the mixed case that start with a hole.
> Do we have any idea what's up there?

I'm not really sure. BTRFS does some work to make sure the page cache
is synced up with their internal extent representation as part of
llseek, so my guess is something related to that (But it's been a
while since I looked into that code, so I'm not sure if that's still
how it works)

Anna

>
> --b.
>
> > Read Plus Results (btrfs):
> > data
> > :... v4.1 ... Uncached ... 21.317 s, 101 MB/s, 0.63 s kern, 2% cpu
> > : :....... Cached ..... 18.252 s, 118 MB/s, 0.67 s kern, 3% cpu
> > :... v4.2 ... Uncached ... 28.665 s, 75 MB/s, 0.65 s kern, 2% cpu
> > :....... Cached ..... 18.253 s, 118 MB/s, 0.66 s kern, 3% cpu
> > hole
> > :... v4.1 ... Uncached ... 18.256 s, 118 MB/s, 0.70 s kern, 3% cpu
> > : :....... Cached ..... 18.254 s, 118 MB/s, 0.73 s kern, 4% cpu
> > :... v4.2 ... Uncached ... 0.851 s, 2.5 GB/s, 0.72 s kern, 84% cpu
> > :....... Cached ..... 0.847 s, 2.5 GB/s, 0.73 s kern, 86% cpu
> > mixed-4d
> > :... v4.1 ... Uncached ... 56.857 s, 38 MB/s, 0.76 s kern, 1% cpu
> > : :....... Cached ..... 18.252 s, 118 MB/s, 0.72 s kern, 3% cpu
> > :... v4.2 ... Uncached ... 54.455 s, 39 MB/s, 0.73 s kern, 1% cpu
> > :....... Cached ..... 9.215 s, 233 MB/s, 0.68 s kern, 7% cpu
> > mixed-8d
> > :... v4.1 ... Uncached ... 36.641 s, 59 MB/s, 0.68 s kern, 1% cpu
> > : :....... Cached ..... 18.252 s, 118 MB/s, 0.70 s kern, 3% cpu
> > :... v4.2 ... Uncached ... 33.205 s, 65 MB/s, 0.67 s kern, 2% cpu
> > :....... Cached ..... 9.172 s, 234 MB/s, 0.65 s kern, 7% cpu
> > mixed-16d
> > :... v4.1 ... Uncached ... 28.653 s, 75 MB/s, 0.72 s kern, 2% cpu
> > : :....... Cached ..... 18.252 s, 118 MB/s, 0.70 s kern, 3% cpu
> > :... v4.2 ... Uncached ... 25.748 s, 83 MB/s, 0.71 s kern, 2% cpu
> > :....... Cached ..... 9.150 s, 235 MB/s, 0.64 s kern, 7% cpu
> > mixed-32d
> > :... v4.1 ... Uncached ... 28.886 s, 74 MB/s, 0.67 s kern, 2% cpu
> > : :....... Cached ..... 18.252 s, 118 MB/s, 0.71 s kern, 3% cpu
> > :... v4.2 ... Uncached ... 24.724 s, 87 MB/s, 0.74 s kern, 2% cpu
> > :....... Cached ..... 9.140 s, 235 MB/s, 0.63 s kern, 6% cpu
> > mixed-4h
> > :... v4.1 ... Uncached ... 52.181 s, 41 MB/s, 0.73 s kern, 1% cpu
> > : :....... Cached ..... 18.252 s, 118 MB/s, 0.66 s kern, 3% cpu
> > :... v4.2 ... Uncached ... 150.341 s, 14 MB/s, 0.72 s kern, 0% cpu
> > :....... Cached ..... 9.216 s, 233 MB/s, 0.63 s kern, 6% cpu
> > mixed-8h
> > :... v4.1 ... Uncached ... 36.945 s, 58 MB/s, 0.68 s kern, 1% cpu
> > : :....... Cached ..... 18.252 s, 118 MB/s, 0.65 s kern, 3% cpu
> > :... v4.2 ... Uncached ... 79.781 s, 27 MB/s, 0.68 s kern, 0% cpu
> > :....... Cached ..... 9.172 s, 234 MB/s, 0.66 s kern, 7% cpu
> > mixed-16h
> > :... v4.1 ... Uncached ... 28.651 s, 75 MB/s, 0.73 s kern, 2% cpu
> > : :....... Cached ..... 18.252 s, 118 MB/s, 0.66 s kern, 3% cpu
> > :... v4.2 ... Uncached ... 47.428 s, 45 MB/s, 0.71 s kern, 1% cpu
> > :....... Cached ..... 9.150 s, 235 MB/s, 0.67 s kern, 7% cpu
> > mixed-32h
> > :... v4.1 ... Uncached ... 28.618 s, 75 MB/s, 0.69 s kern, 2% cpu
> > : :....... Cached ..... 18.252 s, 118 MB/s, 0.70 s kern, 3% cpu
> > :... v4.2 ... Uncached ... 38.813 s, 55 MB/s, 0.67 s kern, 1% cpu
> > :....... Cached ..... 9.140 s, 235 MB/s, 0.61 s kern, 6% cpu
>

2020-09-01 16:50:33

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [PATCH v4 2/5] NFSD: Add READ_PLUS data support

On Mon, Aug 31, 2020 at 02:16:26PM -0400, Anna Schumaker wrote:
> On Fri, Aug 28, 2020 at 5:56 PM J. Bruce Fields <[email protected]> wrote:
> >
> > On Fri, Aug 28, 2020 at 05:25:21PM -0400, J. Bruce Fields wrote:
> > > On Mon, Aug 17, 2020 at 12:53:07PM -0400, [email protected] wrote:
> > > > From: Anna Schumaker <[email protected]>
> > > >
> > > > This patch adds READ_PLUS support for returning a single
> > > > NFS4_CONTENT_DATA segment to the client. This is basically the same as
> > > > the READ operation, only with the extra information about data segments.
> > > >
> > > > Signed-off-by: Anna Schumaker <[email protected]>
> > > > ---
> > > > fs/nfsd/nfs4proc.c | 17 ++++++++++
> > > > fs/nfsd/nfs4xdr.c | 83 ++++++++++++++++++++++++++++++++++++++++++++--
> > > > 2 files changed, 98 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
> > > > index a09c35f0f6f0..9630d33211f2 100644
> > > > --- a/fs/nfsd/nfs4proc.c
> > > > +++ b/fs/nfsd/nfs4proc.c
> > > > @@ -2523,6 +2523,16 @@ static inline u32 nfsd4_read_rsize(struct svc_rqst *rqstp, struct nfsd4_op *op)
> > > > return (op_encode_hdr_size + 2 + XDR_QUADLEN(rlen)) * sizeof(__be32);
> > > > }
> > > >
> > > > +static inline u32 nfsd4_read_plus_rsize(struct svc_rqst *rqstp, struct nfsd4_op *op)
> > > > +{
> > > > + u32 maxcount = svc_max_payload(rqstp);
> > > > + u32 rlen = min(op->u.read.rd_length, maxcount);
> > > > + /* enough extra xdr space for encoding either a hole or data segment. */
> > > > + u32 segments = 1 + 2 + 2;
> > > > +
> > > > + return (op_encode_hdr_size + 2 + segments + XDR_QUADLEN(rlen)) * sizeof(__be32);
> > >
> > > I'm not sure I understand this calculation.
> > >
> > > In the final code, there's no fixed limit on the number of segments
> > > returned by a single READ_PLUS op, right?
> >
> > I think the worst-case overhead to represent a hole is around 50 bytes.
> >
> > So as long as we don't encode any holes less than that, then we can just
> > use rlen as an upper bound.
> >
> > We really don't want to bother encoding small holes. I doubt
> > filesystems want to bother with them either. Do they give us any
> > guarantees as to the minimum size of a hole?
>
> The minimum size seems to be PAGE_SIZE from everything I've seen.

OK, can we make that assumption explicit? It'd simplify stuff like
this.

--b.

2020-09-01 17:42:29

by Anna Schumaker

[permalink] [raw]

Subject: Re: [PATCH v4 2/5] NFSD: Add READ_PLUS data support

On Tue, Sep 1, 2020 at 12:49 PM J. Bruce Fields <[email protected]> wrote:
>
> On Mon, Aug 31, 2020 at 02:16:26PM -0400, Anna Schumaker wrote:
> > On Fri, Aug 28, 2020 at 5:56 PM J. Bruce Fields <[email protected]> wrote:
> > >
> > > On Fri, Aug 28, 2020 at 05:25:21PM -0400, J. Bruce Fields wrote:
> > > > On Mon, Aug 17, 2020 at 12:53:07PM -0400, [email protected] wrote:
> > > > > From: Anna Schumaker <[email protected]>
> > > > >
> > > > > This patch adds READ_PLUS support for returning a single
> > > > > NFS4_CONTENT_DATA segment to the client. This is basically the same as
> > > > > the READ operation, only with the extra information about data segments.
> > > > >
> > > > > Signed-off-by: Anna Schumaker <[email protected]>
> > > > > ---
> > > > > fs/nfsd/nfs4proc.c | 17 ++++++++++
> > > > > fs/nfsd/nfs4xdr.c | 83 ++++++++++++++++++++++++++++++++++++++++++++--
> > > > > 2 files changed, 98 insertions(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
> > > > > index a09c35f0f6f0..9630d33211f2 100644
> > > > > --- a/fs/nfsd/nfs4proc.c
> > > > > +++ b/fs/nfsd/nfs4proc.c
> > > > > @@ -2523,6 +2523,16 @@ static inline u32 nfsd4_read_rsize(struct svc_rqst *rqstp, struct nfsd4_op *op)
> > > > > return (op_encode_hdr_size + 2 + XDR_QUADLEN(rlen)) * sizeof(__be32);
> > > > > }
> > > > >
> > > > > +static inline u32 nfsd4_read_plus_rsize(struct svc_rqst *rqstp, struct nfsd4_op *op)
> > > > > +{
> > > > > + u32 maxcount = svc_max_payload(rqstp);
> > > > > + u32 rlen = min(op->u.read.rd_length, maxcount);
> > > > > + /* enough extra xdr space for encoding either a hole or data segment. */
> > > > > + u32 segments = 1 + 2 + 2;
> > > > > +
> > > > > + return (op_encode_hdr_size + 2 + segments + XDR_QUADLEN(rlen)) * sizeof(__be32);
> > > >
> > > > I'm not sure I understand this calculation.
> > > >
> > > > In the final code, there's no fixed limit on the number of segments
> > > > returned by a single READ_PLUS op, right?
> > >
> > > I think the worst-case overhead to represent a hole is around 50 bytes.
> > >
> > > So as long as we don't encode any holes less than that, then we can just
> > > use rlen as an upper bound.
> > >
> > > We really don't want to bother encoding small holes. I doubt
> > > filesystems want to bother with them either. Do they give us any
> > > guarantees as to the minimum size of a hole?
> >
> > The minimum size seems to be PAGE_SIZE from everything I've seen.
>
> OK, can we make that assumption explicit? It'd simplify stuff like
> this.

I'm okay with that, but it's technically up to the underlying filesystem.

>
> --b.

2020-09-01 19:19:33

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [PATCH v4 2/5] NFSD: Add READ_PLUS data support

On Tue, Sep 01, 2020 at 01:40:16PM -0400, Anna Schumaker wrote:
> On Tue, Sep 1, 2020 at 12:49 PM J. Bruce Fields <[email protected]> wrote:
> >
> > On Mon, Aug 31, 2020 at 02:16:26PM -0400, Anna Schumaker wrote:
> > > On Fri, Aug 28, 2020 at 5:56 PM J. Bruce Fields <[email protected]> wrote:
> > > > We really don't want to bother encoding small holes. I doubt
> > > > filesystems want to bother with them either. Do they give us any
> > > > guarantees as to the minimum size of a hole?
> > >
> > > The minimum size seems to be PAGE_SIZE from everything I've seen.
> >
> > OK, can we make that assumption explicit? It'd simplify stuff like
> > this.
>
> I'm okay with that, but it's technically up to the underlying filesystem.

Maybe we should ask on linux-fsdevel.

Maybe minimum hole length isn't the right question: suppose at time 1 a
file has a single hole at bytes 100-200, then it's modified so at time 2
it has a hole at bytes 50-150. If you lseek(fd, 0, SEEK_HOLE) at time
1, you'll get 100. Then if you lseek(fd, 100, SEEK_DATA) at time 2,
you'll get 150. So you'll encode a 50-byte hole in the READ_PLUS reply
even though the file never had a hole smaller than 100 bytes.

Minimum hole alignment might be the right idea.

If we can't get that: maybe just teach encode_read to stop when it
*either* returns maxcount worth of file data (and holes) *or* maxcount
of encoded xdr data, just to prevent a weird filesystem from triggering
a bug.

--b.

2020-09-04 13:54:33

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [PATCH v4 2/5] NFSD: Add READ_PLUS data support

On Tue, Sep 01, 2020 at 03:18:54PM -0400, J. Bruce Fields wrote:
> On Tue, Sep 01, 2020 at 01:40:16PM -0400, Anna Schumaker wrote:
> > On Tue, Sep 1, 2020 at 12:49 PM J. Bruce Fields <[email protected]> wrote:
> > >
> > > On Mon, Aug 31, 2020 at 02:16:26PM -0400, Anna Schumaker wrote:
> > > > On Fri, Aug 28, 2020 at 5:56 PM J. Bruce Fields <[email protected]> wrote:
> > > > > We really don't want to bother encoding small holes. I doubt
> > > > > filesystems want to bother with them either. Do they give us any
> > > > > guarantees as to the minimum size of a hole?
> > > >
> > > > The minimum size seems to be PAGE_SIZE from everything I've seen.
> > >
> > > OK, can we make that assumption explicit? It'd simplify stuff like
> > > this.
> >
> > I'm okay with that, but it's technically up to the underlying filesystem.
>
> Maybe we should ask on linux-fsdevel.
>
> Maybe minimum hole length isn't the right question: suppose at time 1 a
> file has a single hole at bytes 100-200, then it's modified so at time 2
> it has a hole at bytes 50-150. If you lseek(fd, 0, SEEK_HOLE) at time
> 1, you'll get 100. Then if you lseek(fd, 100, SEEK_DATA) at time 2,
> you'll get 150. So you'll encode a 50-byte hole in the READ_PLUS reply
> even though the file never had a hole smaller than 100 bytes.
>
> Minimum hole alignment might be the right idea.
>
> If we can't get that: maybe just teach encode_read to stop when it
> *either* returns maxcount worth of file data (and holes) *or* maxcount
> of encoded xdr data, just to prevent a weird filesystem from triggering
> a bug.

Alternatively, if it's easier, we could enforce a minimum alignment by
rounding up the result of SEEK_HOLE to the nearest multiple of (say) 512
bytes, and rounding down the result of SEEK_DATA.

--b.

2020-09-04 13:58:58

[permalink] [raw]

Subject: Re: [PATCH v4 2/5] NFSD: Add READ_PLUS data support

> On Sep 4, 2020, at 9:52 AM, J. Bruce Fields <[email protected]> wrote:
>
> On Tue, Sep 01, 2020 at 03:18:54PM -0400, J. Bruce Fields wrote:
>> On Tue, Sep 01, 2020 at 01:40:16PM -0400, Anna Schumaker wrote:
>>> On Tue, Sep 1, 2020 at 12:49 PM J. Bruce Fields <[email protected]> wrote:
>>>>
>>>> On Mon, Aug 31, 2020 at 02:16:26PM -0400, Anna Schumaker wrote:
>>>>> On Fri, Aug 28, 2020 at 5:56 PM J. Bruce Fields <[email protected]> wrote:
>>>>>> We really don't want to bother encoding small holes. I doubt
>>>>>> filesystems want to bother with them either. Do they give us any
>>>>>> guarantees as to the minimum size of a hole?
>>>>>
>>>>> The minimum size seems to be PAGE_SIZE from everything I've seen.
>>>>
>>>> OK, can we make that assumption explicit? It'd simplify stuff like
>>>> this.
>>>
>>> I'm okay with that, but it's technically up to the underlying filesystem.
>>
>> Maybe we should ask on linux-fsdevel.
>>
>> Maybe minimum hole length isn't the right question: suppose at time 1 a
>> file has a single hole at bytes 100-200, then it's modified so at time 2
>> it has a hole at bytes 50-150. If you lseek(fd, 0, SEEK_HOLE) at time
>> 1, you'll get 100. Then if you lseek(fd, 100, SEEK_DATA) at time 2,
>> you'll get 150. So you'll encode a 50-byte hole in the READ_PLUS reply
>> even though the file never had a hole smaller than 100 bytes.
>>
>> Minimum hole alignment might be the right idea.
>>
>> If we can't get that: maybe just teach encode_read to stop when it
>> *either* returns maxcount worth of file data (and holes) *or* maxcount
>> of encoded xdr data, just to prevent a weird filesystem from triggering
>> a bug.
>
> Alternatively, if it's easier, we could enforce a minimum alignment by
> rounding up the result of SEEK_HOLE to the nearest multiple of (say) 512
> bytes, and rounding down the result of SEEK_DATA.

Perhaps it goes without saying, but is there an effort to
ensure that the set of holes is represented in exactly the
same way when accessing a file via READ_PLUS and
SEEK_DATA/HOLE ?

--
Chuck Lever

2020-09-04 14:06:02

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [PATCH v4 2/5] NFSD: Add READ_PLUS data support

On Fri, Sep 04, 2020 at 09:56:19AM -0400, Chuck Lever wrote:
>
>
> > On Sep 4, 2020, at 9:52 AM, J. Bruce Fields <[email protected]> wrote:
> >
> > On Tue, Sep 01, 2020 at 03:18:54PM -0400, J. Bruce Fields wrote:
> >> On Tue, Sep 01, 2020 at 01:40:16PM -0400, Anna Schumaker wrote:
> >>> On Tue, Sep 1, 2020 at 12:49 PM J. Bruce Fields <[email protected]> wrote:
> >>>>
> >>>> On Mon, Aug 31, 2020 at 02:16:26PM -0400, Anna Schumaker wrote:
> >>>>> On Fri, Aug 28, 2020 at 5:56 PM J. Bruce Fields <[email protected]> wrote:
> >>>>>> We really don't want to bother encoding small holes. I doubt
> >>>>>> filesystems want to bother with them either. Do they give us any
> >>>>>> guarantees as to the minimum size of a hole?
> >>>>>
> >>>>> The minimum size seems to be PAGE_SIZE from everything I've seen.
> >>>>
> >>>> OK, can we make that assumption explicit? It'd simplify stuff like
> >>>> this.
> >>>
> >>> I'm okay with that, but it's technically up to the underlying filesystem.
> >>
> >> Maybe we should ask on linux-fsdevel.
> >>
> >> Maybe minimum hole length isn't the right question: suppose at time 1 a
> >> file has a single hole at bytes 100-200, then it's modified so at time 2
> >> it has a hole at bytes 50-150. If you lseek(fd, 0, SEEK_HOLE) at time
> >> 1, you'll get 100. Then if you lseek(fd, 100, SEEK_DATA) at time 2,
> >> you'll get 150. So you'll encode a 50-byte hole in the READ_PLUS reply
> >> even though the file never had a hole smaller than 100 bytes.
> >>
> >> Minimum hole alignment might be the right idea.
> >>
> >> If we can't get that: maybe just teach encode_read to stop when it
> >> *either* returns maxcount worth of file data (and holes) *or* maxcount
> >> of encoded xdr data, just to prevent a weird filesystem from triggering
> >> a bug.
> >
> > Alternatively, if it's easier, we could enforce a minimum alignment by
> > rounding up the result of SEEK_HOLE to the nearest multiple of (say) 512
> > bytes, and rounding down the result of SEEK_DATA.
>
> Perhaps it goes without saying, but is there an effort to
> ensure that the set of holes is represented in exactly the
> same way when accessing a file via READ_PLUS and
> SEEK_DATA/HOLE ?

So you're thinking of something like a pynfs test that creates a file
with holes and then tries reading through it with READ_PLUS and SEEK and
comparing the results?

There are lots of legitimate reasons that test might "fail"--servers
aren't required to support holes at all, and have a lot of lattitude
about how to report them.

But it might be a good idea to test anyway.

--b.

2020-09-04 14:08:42

[permalink] [raw]

Subject: Re: [PATCH v4 2/5] NFSD: Add READ_PLUS data support

> On Sep 4, 2020, at 10:03 AM, Bruce Fields <[email protected]> wrote:
>
> On Fri, Sep 04, 2020 at 09:56:19AM -0400, Chuck Lever wrote:
>>
>>
>>> On Sep 4, 2020, at 9:52 AM, J. Bruce Fields <[email protected]> wrote:
>>>
>>> On Tue, Sep 01, 2020 at 03:18:54PM -0400, J. Bruce Fields wrote:
>>>> On Tue, Sep 01, 2020 at 01:40:16PM -0400, Anna Schumaker wrote:
>>>>> On Tue, Sep 1, 2020 at 12:49 PM J. Bruce Fields <[email protected]> wrote:
>>>>>>
>>>>>> On Mon, Aug 31, 2020 at 02:16:26PM -0400, Anna Schumaker wrote:
>>>>>>> On Fri, Aug 28, 2020 at 5:56 PM J. Bruce Fields <[email protected]> wrote:
>>>>>>>> We really don't want to bother encoding small holes. I doubt
>>>>>>>> filesystems want to bother with them either. Do they give us any
>>>>>>>> guarantees as to the minimum size of a hole?
>>>>>>>
>>>>>>> The minimum size seems to be PAGE_SIZE from everything I've seen.
>>>>>>
>>>>>> OK, can we make that assumption explicit? It'd simplify stuff like
>>>>>> this.
>>>>>
>>>>> I'm okay with that, but it's technically up to the underlying filesystem.
>>>>
>>>> Maybe we should ask on linux-fsdevel.
>>>>
>>>> Maybe minimum hole length isn't the right question: suppose at time 1 a
>>>> file has a single hole at bytes 100-200, then it's modified so at time 2
>>>> it has a hole at bytes 50-150. If you lseek(fd, 0, SEEK_HOLE) at time
>>>> 1, you'll get 100. Then if you lseek(fd, 100, SEEK_DATA) at time 2,
>>>> you'll get 150. So you'll encode a 50-byte hole in the READ_PLUS reply
>>>> even though the file never had a hole smaller than 100 bytes.
>>>>
>>>> Minimum hole alignment might be the right idea.
>>>>
>>>> If we can't get that: maybe just teach encode_read to stop when it
>>>> *either* returns maxcount worth of file data (and holes) *or* maxcount
>>>> of encoded xdr data, just to prevent a weird filesystem from triggering
>>>> a bug.
>>>
>>> Alternatively, if it's easier, we could enforce a minimum alignment by
>>> rounding up the result of SEEK_HOLE to the nearest multiple of (say) 512
>>> bytes, and rounding down the result of SEEK_DATA.
>>
>> Perhaps it goes without saying, but is there an effort to
>> ensure that the set of holes is represented in exactly the
>> same way when accessing a file via READ_PLUS and
>> SEEK_DATA/HOLE ?
>
> So you're thinking of something like a pynfs test that creates a file
> with holes and then tries reading through it with READ_PLUS and SEEK and
> comparing the results?

I hadn't considered a particular test platform, but yes, a regression
test like that would be appropriate.

> There are lots of legitimate reasons that test might "fail"--servers
> aren't required to support holes at all, and have a lot of lattitude
> about how to report them.

Agreed that the test would need to account for server support for holes.

> But it might be a good idea to test anyway.

My primary concern is that the result of a file copy operation should
look the same on NFS/TCP (with READ_PLUS) and NFS/RDMA (with SEEK_DATA/HOLE).

--
Chuck Lever

2020-09-04 14:30:00

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [PATCH v4 2/5] NFSD: Add READ_PLUS data support

On Fri, Sep 04, 2020 at 10:07:22AM -0400, Chuck Lever wrote:
> My primary concern is that the result of a file copy operation should
> look the same on NFS/TCP (with READ_PLUS) and NFS/RDMA (with SEEK_DATA/HOLE).

I'm not sure what you mean.

I don't see the spec providing any guarantee of consistency between
READ_PLUS and SEEK. It also doesn't guarantee that the results tell you
anything about how the file is actually stored--a returned "hole" could
represent an unallocated segment, or a fully allocated segment that's
filled with zeroes, or some combination.

So, for example, if you implemented an optimized copy that used
ALLOCATE, DEALLOCATE, SEEK and/or READ_PLUS to avoid reading and writing
a lot of zeroes--there's no guarantee that the target file would end up
allocated in the same way as the source.

--b.

2020-09-04 14:37:19

[permalink] [raw]

Subject: Re: [PATCH v4 2/5] NFSD: Add READ_PLUS data support

> On Sep 4, 2020, at 10:29 AM, Bruce Fields <[email protected]> wrote:
>
> On Fri, Sep 04, 2020 at 10:07:22AM -0400, Chuck Lever wrote:
>> My primary concern is that the result of a file copy operation should
>> look the same on NFS/TCP (with READ_PLUS) and NFS/RDMA (with SEEK_DATA/HOLE).
>
> I'm not sure what you mean.
>
> I don't see the spec providing any guarantee of consistency between
> READ_PLUS and SEEK.

IMO this is an implementation quality issue, not a spec compliance
issue.

> It also doesn't guarantee that the results tell you
> anything about how the file is actually stored--a returned "hole" could
> represent an unallocated segment, or a fully allocated segment that's
> filled with zeroes, or some combination.

Understood, but the resulting copied file should look the same whether
it was read from the server using READ_PLUS or SEEK_DATA/HOLE.

> So, for example, if you implemented an optimized copy that used
> ALLOCATE, DEALLOCATE, SEEK and/or READ_PLUS to avoid reading and writing
> a lot of zeroes--there's no guarantee that the target file would end up
> allocated in the same way as the source.

--
Chuck Lever

2020-09-04 14:50:22

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [PATCH v4 2/5] NFSD: Add READ_PLUS data support

On Fri, Sep 04, 2020 at 10:36:36AM -0400, Chuck Lever wrote:
> > On Sep 4, 2020, at 10:29 AM, Bruce Fields <[email protected]> wrote:
> > It also doesn't guarantee that the results tell you
> > anything about how the file is actually stored--a returned "hole" could
> > represent an unallocated segment, or a fully allocated segment that's
> > filled with zeroes, or some combination.
>
> Understood, but the resulting copied file should look the same whether
> it was read from the server using READ_PLUS or SEEK_DATA/HOLE.

I'm uncomfortable about promising that. What do you think might go
wrong otherwise?

--b.

2020-09-04 15:01:40

[permalink] [raw]

Subject: Re: [PATCH v4 2/5] NFSD: Add READ_PLUS data support

> On Sep 4, 2020, at 10:49 AM, J. Bruce Fields <[email protected]> wrote:
>
> On Fri, Sep 04, 2020 at 10:36:36AM -0400, Chuck Lever wrote:
>>> On Sep 4, 2020, at 10:29 AM, Bruce Fields <[email protected]> wrote:
>>> It also doesn't guarantee that the results tell you
>>> anything about how the file is actually stored--a returned "hole" could
>>> represent an unallocated segment, or a fully allocated segment that's
>>> filled with zeroes, or some combination.
>>
>> Understood, but the resulting copied file should look the same whether
>> it was read from the server using READ_PLUS or SEEK_DATA/HOLE.
>
> I'm uncomfortable about promising that.

The server should be able to represent a file with holes
in exactly the same way with both mechanisms, unless there
is a significant flaw in the protocols or implementation.

The client's behavior is also important here, so the
guarantee would have to be about how the server presents
the holes. A quality client implementation would be able
to use this guarantee to reconstruct the holes exactly.

> What do you think might go wrong otherwise?

I don't see a data corruption issue here, if that's what
you mean.

Suppose the server has a large file with a lot of holes,
and these holes are all unallocated. This might be
typical of a container image.

Suppose further the client is able to punch holes in a
destination file as a thin provisioning mechanism.

Now, suppose we copy the file via TCP/READ_PLUS, and
that preserves the holes.

Copy with RDMA/SEEK_HOLE and maybe it doesn't preserve
holes. The destination file is now significantly larger
and less efficiently stored.

Or maybe it's the other way around. Either way, one
mechanism is hole-preserving and one isn't.

A quality implementation would try to preserve holes as
much as possible so that the server can make smart storage
provisioning decisions.

--
Chuck Lever

2020-09-04 15:25:42

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [PATCH v4 2/5] NFSD: Add READ_PLUS data support

On Fri, Sep 04, 2020 at 10:58:43AM -0400, Chuck Lever wrote:
> > What do you think might go wrong otherwise?
>
> I don't see a data corruption issue here, if that's what
> you mean.
>
> Suppose the server has a large file with a lot of holes,
> and these holes are all unallocated. This might be
> typical of a container image.
>
> Suppose further the client is able to punch holes in a
> destination file as a thin provisioning mechanism.
>
> Now, suppose we copy the file via TCP/READ_PLUS, and
> that preserves the holes.
>
> Copy with RDMA/SEEK_HOLE and maybe it doesn't preserve
> holes. The destination file is now significantly larger
> and less efficiently stored.
>
> Or maybe it's the other way around. Either way, one
> mechanism is hole-preserving and one isn't.
>
> A quality implementation would try to preserve holes as
> much as possible so that the server can make smart storage
> provisioning decisions.

OK, I can see that, thanks.

So, I was trying to make sure we handle cases where SEEK results are
weirdly aligned or segments returned are very small. I don't think
that'll happen with any "normal" setup, I think it probably requires
strange FUSE filesystems or unusual races or malicious users or some
combination thereof. So suboptimal handling is OK, I just don't want to
violate the protocol or crash or hang or something.

I'm not seeing the RDMA connection, by the way. SEEK and READ_PLUS
should work the same over TCP and RDMA.

--b.

2020-09-04 15:59:59

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [PATCH v4 0/5] NFSD: Add support for the v4.2 READ_PLUS operation

On Mon, Aug 31, 2020 at 02:33:30PM -0400, Anna Schumaker wrote:
> On Wed, Aug 26, 2020 at 5:54 PM J. Bruce Fields <[email protected]> wrote:
> > On Mon, Aug 17, 2020 at 12:53:05PM -0400, [email protected] wrote:
> > > I tested by reading various 2G files from a few different underlying
> > > filesystems and across several NFS versions. I used the `vmtouch` utility
> > > to make sure files were only cached when we wanted them to be. In addition
> > > to 100% data and 100% hole cases, I also tested with files that alternate
> > > between data and hole segments. These files have either 4K, 8K, 16K, or 32K
> > > segment sizes and start with either data or hole segments. So the file
> > > mixed-4d has a 4K segment size beginning with a data segment, but mixed-32h
> > > has 32K segments beginning with a hole. The units are in seconds, with the
> > > first number for each NFS version being the uncached read time and the second
> > > number is for when the file is cached on the server.
> >
> > The only numbers that look really strange are in the btrfs uncached
> > case, in the data-only case and the mixed case that start with a hole.
> > Do we have any idea what's up there?
>
> I'm not really sure. BTRFS does some work to make sure the page cache
> is synced up with their internal extent representation as part of
> llseek, so my guess is something related to that (But it's been a
> while since I looked into that code, so I'm not sure if that's still
> how it works)

Adding linux-btrfs in case they have any updates--are btrfs developers
aware of known performances issues with SEEK_HOLE/SEEK_DATA, and is it
something anyone's working on?

Anna's implementing a read optimization where the server uses seek to
identify holes to save transmitting all those zeroes back to the client,
and it's working as expected for ext4 and xfs but performing weirdly for
btrfs.

Original message:
https://lore.kernel.org/linux-nfs/[email protected]/

--b.

> > > Read Plus Results (btrfs):
> > > data
> > > :... v4.1 ... Uncached ... 21.317 s, 101 MB/s, 0.63 s kern, 2% cpu
> > > : :....... Cached ..... 18.252 s, 118 MB/s, 0.67 s kern, 3% cpu
> > > :... v4.2 ... Uncached ... 28.665 s, 75 MB/s, 0.65 s kern, 2% cpu
> > > :....... Cached ..... 18.253 s, 118 MB/s, 0.66 s kern, 3% cpu
> > > hole
> > > :... v4.1 ... Uncached ... 18.256 s, 118 MB/s, 0.70 s kern, 3% cpu
> > > : :....... Cached ..... 18.254 s, 118 MB/s, 0.73 s kern, 4% cpu
> > > :... v4.2 ... Uncached ... 0.851 s, 2.5 GB/s, 0.72 s kern, 84% cpu
> > > :....... Cached ..... 0.847 s, 2.5 GB/s, 0.73 s kern, 86% cpu
> > > mixed-4d
> > > :... v4.1 ... Uncached ... 56.857 s, 38 MB/s, 0.76 s kern, 1% cpu
> > > : :....... Cached ..... 18.252 s, 118 MB/s, 0.72 s kern, 3% cpu
> > > :... v4.2 ... Uncached ... 54.455 s, 39 MB/s, 0.73 s kern, 1% cpu
> > > :....... Cached ..... 9.215 s, 233 MB/s, 0.68 s kern, 7% cpu
> > > mixed-8d
> > > :... v4.1 ... Uncached ... 36.641 s, 59 MB/s, 0.68 s kern, 1% cpu
> > > : :....... Cached ..... 18.252 s, 118 MB/s, 0.70 s kern, 3% cpu
> > > :... v4.2 ... Uncached ... 33.205 s, 65 MB/s, 0.67 s kern, 2% cpu
> > > :....... Cached ..... 9.172 s, 234 MB/s, 0.65 s kern, 7% cpu
> > > mixed-16d
> > > :... v4.1 ... Uncached ... 28.653 s, 75 MB/s, 0.72 s kern, 2% cpu
> > > : :....... Cached ..... 18.252 s, 118 MB/s, 0.70 s kern, 3% cpu
> > > :... v4.2 ... Uncached ... 25.748 s, 83 MB/s, 0.71 s kern, 2% cpu
> > > :....... Cached ..... 9.150 s, 235 MB/s, 0.64 s kern, 7% cpu
> > > mixed-32d
> > > :... v4.1 ... Uncached ... 28.886 s, 74 MB/s, 0.67 s kern, 2% cpu
> > > : :....... Cached ..... 18.252 s, 118 MB/s, 0.71 s kern, 3% cpu
> > > :... v4.2 ... Uncached ... 24.724 s, 87 MB/s, 0.74 s kern, 2% cpu
> > > :....... Cached ..... 9.140 s, 235 MB/s, 0.63 s kern, 6% cpu
> > > mixed-4h
> > > :... v4.1 ... Uncached ... 52.181 s, 41 MB/s, 0.73 s kern, 1% cpu
> > > : :....... Cached ..... 18.252 s, 118 MB/s, 0.66 s kern, 3% cpu
> > > :... v4.2 ... Uncached ... 150.341 s, 14 MB/s, 0.72 s kern, 0% cpu
> > > :....... Cached ..... 9.216 s, 233 MB/s, 0.63 s kern, 6% cpu
> > > mixed-8h
> > > :... v4.1 ... Uncached ... 36.945 s, 58 MB/s, 0.68 s kern, 1% cpu
> > > : :....... Cached ..... 18.252 s, 118 MB/s, 0.65 s kern, 3% cpu
> > > :... v4.2 ... Uncached ... 79.781 s, 27 MB/s, 0.68 s kern, 0% cpu
> > > :....... Cached ..... 9.172 s, 234 MB/s, 0.66 s kern, 7% cpu
> > > mixed-16h
> > > :... v4.1 ... Uncached ... 28.651 s, 75 MB/s, 0.73 s kern, 2% cpu
> > > : :....... Cached ..... 18.252 s, 118 MB/s, 0.66 s kern, 3% cpu
> > > :... v4.2 ... Uncached ... 47.428 s, 45 MB/s, 0.71 s kern, 1% cpu
> > > :....... Cached ..... 9.150 s, 235 MB/s, 0.67 s kern, 7% cpu
> > > mixed-32h
> > > :... v4.1 ... Uncached ... 28.618 s, 75 MB/s, 0.69 s kern, 2% cpu
> > > : :....... Cached ..... 18.252 s, 118 MB/s, 0.70 s kern, 3% cpu
> > > :... v4.2 ... Uncached ... 38.813 s, 55 MB/s, 0.67 s kern, 1% cpu
> > > :....... Cached ..... 9.140 s, 235 MB/s, 0.61 s kern, 6% cpu
> >

2020-09-04 16:21:15

[permalink] [raw]

Subject: Re: [PATCH v4 2/5] NFSD: Add READ_PLUS data support

> On Sep 4, 2020, at 11:24 AM, Bruce Fields <[email protected]> wrote:
>
> I'm not seeing the RDMA connection, by the way. SEEK and READ_PLUS
> should work the same over TCP and RDMA.

The READ_PLUS result is not DDP-eligible because there's no way for
the client to predict in advance whether there will be data (which
can be moved by RDMA), holes or patterns (which cannot), nor how
many of each there might be.

Therefore I've asked Anna to leave READ_PLUS disabled on RDMA mounts.

--
Chuck Lever

2020-09-04 16:26:47

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [PATCH v4 2/5] NFSD: Add READ_PLUS data support

On Fri, Sep 04, 2020 at 12:17:29PM -0400, Chuck Lever wrote:
>
>
> > On Sep 4, 2020, at 11:24 AM, Bruce Fields <[email protected]> wrote:
> >
> > I'm not seeing the RDMA connection, by the way. SEEK and READ_PLUS
> > should work the same over TCP and RDMA.
>
> The READ_PLUS result is not DDP-eligible because there's no way for
> the client to predict in advance whether there will be data (which
> can be moved by RDMA), holes or patterns (which cannot), nor how
> many of each there might be.
>
> Therefore I've asked Anna to leave READ_PLUS disabled on RDMA mounts.

Oh, of course, makes sense.

I think we should leave ourselves the options of handling some of these
unlikely corner cases differently in the two cases, but I wouldn't
expect that to cause a noticeable difference in any normal use.

--b.

2020-09-04 16:33:03

[permalink] [raw]

Subject: Re: [PATCH v4 2/5] NFSD: Add READ_PLUS data support

> On Sep 4, 2020, at 12:17 PM, Chuck Lever <[email protected]> wrote:
>
>
>
>> On Sep 4, 2020, at 11:24 AM, Bruce Fields <[email protected]> wrote:
>>
>> I'm not seeing the RDMA connection, by the way. SEEK and READ_PLUS
>> should work the same over TCP and RDMA.
>
> The READ_PLUS result is not DDP-eligible because there's no way for
> the client to predict in advance whether there will be data (which
> can be moved by RDMA), holes or patterns (which cannot), nor how
> many of each there might be.
>
> Therefore I've asked Anna to leave READ_PLUS disabled on RDMA mounts.

Since this is a public forum, I should clarify that READ_PLUS does
work on NFS/RDMA mounts. It's just not as efficient as NFS READ with
RDMA, and that kind of defeats its purpose as a replacement for NFS
READ in everyday usage.

--
Chuck Lever