From: Anna Schumaker <[email protected]>
These patches add client support for the READ_PLUS operation. This
operation is meant to improve file reading performance when working with
sparse files, but underlying filesystem performance on the server side
may have an effect on the actual read performance. I've done a bunch of
testing on virtual machines, and I found that READ_PLUS performs best
if:
1) The file being read is not yet in the server's page cache.
2) The read request begins with a hole segment. And
3) The server only performs one llseek() call during encoding
I've added a "noreadplus" mount option to allow users to disabl ethe new
operation if it becomes a problem, similar to the "nordirplus" mount
option that we already have.
Here are the results of my performance tests, separated by underlying
filesystem and if the file is already in the cache or not. The NFS v4.2
column is for the standard READ operation, and v4.2+ is with READ_PLUS.
In addition to the 100% data and 100% hole cases, I also test with files
that alternate between data and hole chunks. I tested with two files
for each chunk size, one beginning with a data segment and one beginning
with a hole. I used the `vmtouch` utility to load and clear the file
from the server's cache, and I used the following `dd` command on the
client for reading back the file:
$ dd if=$src of=/dev/null bs=$rsize_from_mount 2>&1
xfs (uncached) | NFS v3 NFS v4.0 NFS v4.1 NFS v4.2 NFS v4.2+
------------------------+-------------------------------------------------
Whole File (data) | 3.228s 3.361s 3.679s 3.382s 3.483s
Whole File (hole) | 1.276s 1.086s 1.143s 1.066s 0.805s
Sparse 4K (data) | 3.473s 3.953s 3.740s 3.535s 3.515s
Sparse 4K (hole) | 3.373s 3.192s 3.120s 3.113s 2.709s
Sparse 8K (data) | 3.782s 3.527s 3.589s 3.476s 3.494s
Sparse 8K (hole) | 3.161s 3.328s 2.974s 2.889s 2.863s
Sparse 16K (data) | 3.804s 3.945s 3.885s 3.507s 3.569s
Sparse 16K (hole) | 2.961s 3.124s 3.413s 3.136s 2.712s
Sparse 32K (data) | 2.891s 3.632s 3.833s 3.643s 3.485s
Sparse 32K (hole) | 2.592s 2.216s 2.545s 2.665s 2.829s
xfs (cached) | NFS v3 NFS v4.0 NFS v4.1 NFS v4.2 NFS v4.2+
------------------------+-------------------------------------------------
Whole File (data) | 0.939s 0.943s 0.939s 0.942s 1.153s
Whole File (hole) | 0.982s 1.007s 0.991s 0.946s 0.826s
Sparse 4K (data) | 0.980s 0.999s 0.961s 0.996s 1.166s
Sparse 4K (hole) | 1.001s 0.972s 0.997s 1.001s 1.201s
Sparse 8K (data) | 1.272s 1.053s 0.999s 0.974s 1.200s
Sparse 8K (hole) | 0.965s 1.004s 1.036s 1.006s 1.248s
Sparse 16K (data) | 0.995s 0.993s 1.035s 1.054s 1.210s
Sparse 16K (hole) | 0.966s 0.982s 1.091s 1.038s 1.214s
Sparse 32K (data) | 1.054s 0.968s 1.045s 0.990s 1.203s
Sparse 32K (hole) | 1.019s 0.960s 1.001s 0.983s 1.254s
ext4 (uncached) | NFS v3 NFS v4.0 NFS v4.1 NFS v4.2 NFS v4.2+
------------------------+-------------------------------------------------
Whole File (data) | 6.089s 6.104s 6.489s 6.342s 6.137s
Whole File (hole) | 2.603s 2.258s 2.226s 2.315s 1.715s
Sparse 4K (data) | 7.063s 7.372s 7.064s 7.149s 7.459s
Sparse 4K (hole) | 7.231s 6.709s 6.495s 6.880s 6.138s
Sparse 8K (data) | 6.576s 6.938s 6.386s 6.086s 6.154s
Sparse 8K (hole) | 5.903s 6.089s 5.555s 5.578s 5.442s
Sparse 16K (data) | 6.556s 6.257s 6.135s 5.588s 5.856s
Sparse 16K (hole) | 5.504s 5.290s 5.545s 5.195s 4.983s
Sparse 32K (data) | 5.047s 5.490s 5.734s 5.578s 5.378s
Sparse 32K (hole) | 4.232s 3.860s 4.299s 4.466s 4.633s
ext4 (cached) | NFS v3 NFS v4.0 NFS v4.1 NFS v4.2 NFS v4.2+
------------------------+-------------------------------------------------
Whole File (data) | 1.873s 1.881s 1.869s 1.890s 2.344s
Whole File (hole) | 1.929s 2.009s 1.963s 1.917s 1.554s
Sparse 4K (data) | 1.961s 1.974s 1.957s 1.986s 2.408s
Sparse 4K (hole) | 2.056s 2.025s 1.977s 1.988s 2.458s
Sparse 8K (data) | 2.297s 2.038s 2.008s 1.954s 2.437s
Sparse 8K (hole) | 1.939s 2.011s 2.024s 2.015s 2.509s
Sparse 16K (data) | 1.907s 1.973s 2.053s 2.070s 2.411s
Sparse 16K (hole) | 1.940s 1.964s 2.075s 1.996s 2.422s
Sparse 32K (data) | 2.045s 1.921s 2.021s 2.013s 2.388s
Sparse 32K (hole) | 1.984s 1.944s 1.997s 1.974s 2.398s
btrfs (uncached) | NFS v3 NFS v4.0 NFS v4.1 NFS v4.2 NFS v4.2+
------------------------+-------------------------------------------------
Whole File (data) | 9.369s 9.438s 9.837s 9.840s 11.790s
Whole File (hole) | 4.052s 3.390s 3.380s 3.619s 2.519s
Sparse 4K (data) | 9.738s 10.110s 9.774s 9.819s 12.471s
Sparse 4K (hole) | 9.907s 9.504s 9.241s 9.610s 9.054s
Sparse 8K (data) | 9.132s 9.453s 8.954s 8.660s 10.555s
Sparse 8K (hole) | 8.290s 8.489s 8.305s 8.332s 7.850s
Sparse 16K (data) | 8.742s 8.507s 8.667s 8.002s 9.940s
Sparse 16K (hole) | 7.635s 7.604s 7.967s 7.558s 7.062s
Sparse 32K (data) | 7.279s 7.670s 8.006s 7.705s 9.219s
Sparse 32K (hole) | 6.200s 5.713s 6.268s 6.464s 6.486s
btrfs (cached) | NFS v3 NFS v4.0 NFS v4.1 NFS v4.2 NFS v4.2+
------------------------+-------------------------------------------------
Whole File (data) | 2.770s 2.814s 2.841s 2.854s 3.492s
Whole File (hole) | 2.871s 2.970s 3.001s 2.929s 2.372s
Sparse 4K (data) | 2.945s 2.905s 2.930s 2.951s 3.663s
Sparse 4K (hole) | 3.032s 3.057s 2.962s 3.050s 3.705s
Sparse 8K (data) | 3.277s 3.069s 3.127s 3.034s 3.652s
Sparse 8K (hole) | 2.866s 2.959s 3.078s 2.989s 3.762s
Sparse 16K (data) | 2.916s 2.923s 3.060s 3.081s 3.631s
Sparse 16K (hole) | 2.948s 2.969s 3.108s 2.990s 3.623s
Sparse 32K (data) | 3.044s 2.881s 3.052s 2.962s 3.585s
Sparse 32K (hole) | 2.954s 2.957s 3.018s 2.951s 3.639s
I also have performance numbers for if we encode every hole and data
segment but I figured this email was long enough already. I'm happy to
share it if requested!
Thoughts?
Anna
-------------------------------------------------------------------------
Anna Schumaker (6):
SUNRPC: Split out a function for setting current page
SUNRPC: Add the ability to expand holes in data pages
SUNRPC: Add the ability to shift data to a specific offset
NFS: Add basic READ_PLUS support
NFS: Add support for decoding multiple segments
NFS: Add a mount option for READ_PLUS
fs/nfs/nfs42xdr.c | 164 +++++++++++++++++++++++++
fs/nfs/nfs4client.c | 3 +
fs/nfs/nfs4proc.c | 32 ++++-
fs/nfs/nfs4xdr.c | 1 +
fs/nfs/super.c | 21 ++++
include/linux/nfs4.h | 3 +-
include/linux/nfs_fs_sb.h | 2 +
include/linux/nfs_xdr.h | 2 +-
include/linux/sunrpc/xdr.h | 2 +
net/sunrpc/xdr.c | 244 ++++++++++++++++++++++++++++++++++++-
10 files changed, 467 insertions(+), 7 deletions(-)
--
2.20.1
From: Anna Schumaker <[email protected]>
I'm going to need this bit of code in a few places for READ_PLUS
decoding, so let's make it a helper function.
Signed-off-by: Anna Schumaker <[email protected]>
---
net/sunrpc/xdr.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/net/sunrpc/xdr.c b/net/sunrpc/xdr.c
index f302c6eb8779..0fb9bbd2f3c7 100644
--- a/net/sunrpc/xdr.c
+++ b/net/sunrpc/xdr.c
@@ -792,6 +792,12 @@ static int xdr_set_page_base(struct xdr_stream *xdr,
return 0;
}
+static void xdr_set_page(struct xdr_stream *xdr, unsigned int base)
+{
+ if (xdr_set_page_base(xdr, base, PAGE_SIZE) < 0)
+ xdr_set_iov(xdr, xdr->buf->tail, xdr->nwords << 2);
+}
+
static void xdr_set_next_page(struct xdr_stream *xdr)
{
unsigned int newbase;
@@ -799,8 +805,7 @@ static void xdr_set_next_page(struct xdr_stream *xdr)
newbase = (1 + xdr->page_ptr - xdr->buf->pages) << PAGE_SHIFT;
newbase -= xdr->buf->page_base;
- if (xdr_set_page_base(xdr, newbase, PAGE_SIZE) < 0)
- xdr_set_iov(xdr, xdr->buf->tail, xdr->nwords << 2);
+ xdr_set_page(xdr, newbase);
}
static bool xdr_set_next_buffer(struct xdr_stream *xdr)
--
2.20.1
From: Anna Schumaker <[email protected]>
This patch adds the ability to "read a hole" into a set of XDR data
pages by taking the following steps:
1) Shift all data after the current xdr->p to the right, possibly into
the tail,
2) Zero the specified range, and
3) Update xdr->p to point beyond the hole.
Signed-off-by: Anna Schumaker <[email protected]>
---
include/linux/sunrpc/xdr.h | 1 +
net/sunrpc/xdr.c | 100 +++++++++++++++++++++++++++++++++++++
2 files changed, 101 insertions(+)
diff --git a/include/linux/sunrpc/xdr.h b/include/linux/sunrpc/xdr.h
index 2ec128060239..47e2d1fff59a 100644
--- a/include/linux/sunrpc/xdr.h
+++ b/include/linux/sunrpc/xdr.h
@@ -243,6 +243,7 @@ extern __be32 *xdr_inline_decode(struct xdr_stream *xdr, size_t nbytes);
extern unsigned int xdr_read_pages(struct xdr_stream *xdr, unsigned int len);
extern void xdr_enter_page(struct xdr_stream *xdr, unsigned int len);
extern int xdr_process_buf(struct xdr_buf *buf, unsigned int offset, unsigned int len, int (*actor)(struct scatterlist *, void *), void *data);
+extern size_t xdr_expand_hole(struct xdr_stream *, size_t, uint64_t);
/**
* xdr_stream_remaining - Return the number of bytes remaining in the stream
diff --git a/net/sunrpc/xdr.c b/net/sunrpc/xdr.c
index 0fb9bbd2f3c7..27dd3a507ef6 100644
--- a/net/sunrpc/xdr.c
+++ b/net/sunrpc/xdr.c
@@ -253,6 +253,40 @@ _shift_data_right_pages(struct page **pages, size_t pgto_base,
} while ((len -= copy) != 0);
}
+static void
+_shift_data_right_tail(struct xdr_buf *buf, size_t pgfrom_base, size_t len)
+{
+ struct kvec *tail = buf->tail;
+
+ /* Make room for new data. */
+ if (tail->iov_len > 0)
+ memmove((char *)tail->iov_base + len, tail->iov_base, len);
+
+ _copy_from_pages((char *)tail->iov_base,
+ buf->pages,
+ buf->page_base + pgfrom_base,
+ len);
+
+ tail->iov_len += len;
+}
+
+static void
+_shift_data_right(struct xdr_buf *buf, size_t to, size_t from, size_t len)
+{
+ size_t shift = len;
+
+ if ((to + len) > buf->page_len) {
+ shift = (to + len) - buf->page_len;
+ _shift_data_right_tail(buf, (from + len) - shift, shift);
+ shift = len - shift;
+ }
+
+ _shift_data_right_pages(buf->pages,
+ buf->page_base + to,
+ buf->page_base + from,
+ shift);
+}
+
/**
* _copy_to_pages
* @pages: array of pages
@@ -337,6 +371,33 @@ _copy_from_pages(char *p, struct page **pages, size_t pgbase, size_t len)
}
EXPORT_SYMBOL_GPL(_copy_from_pages);
+/**
+ * _zero_data_pages
+ * @pages: array of pages
+ * @pgbase: beginning page vector address
+ * @len: length
+ */
+static void
+_zero_data_pages(struct page **pages, size_t pgbase, size_t len)
+{
+ struct page **page;
+ size_t zero;
+
+ page = pages + (pgbase >> PAGE_SHIFT);
+ pgbase &= ~PAGE_MASK;
+
+ do {
+ zero = len;
+ if (pgbase + zero > PAGE_SIZE)
+ zero = PAGE_SIZE - pgbase;
+
+ zero_user_segment(*page, pgbase, pgbase + zero);
+ page++;
+ pgbase = 0;
+
+ } while ((len -= zero) != 0);
+}
+
/**
* xdr_shrink_bufhead
* @buf: xdr_buf
@@ -478,6 +539,24 @@ unsigned int xdr_stream_pos(const struct xdr_stream *xdr)
}
EXPORT_SYMBOL_GPL(xdr_stream_pos);
+/**
+ * xdr_page_pos - Return the current offset from the start of the xdr->buf->pages
+ * @xdr: pointer to struct xdr_stream
+ */
+static size_t xdr_page_pos(const struct xdr_stream *xdr)
+{
+ unsigned int offset;
+ unsigned int base = xdr->buf->page_len;
+ void *kaddr = xdr->buf->tail->iov_base;;
+
+ if (xdr->page_ptr) {
+ base = (xdr->page_ptr - xdr->buf->pages) * PAGE_SIZE;
+ kaddr = page_address(*xdr->page_ptr);
+ }
+ offset = xdr->p - (__be32 *)kaddr;
+ return base + (offset * sizeof(__be32));
+}
+
/**
* xdr_init_encode - Initialize a struct xdr_stream for sending data.
* @xdr: pointer to xdr_stream struct
@@ -1014,6 +1093,27 @@ unsigned int xdr_read_pages(struct xdr_stream *xdr, unsigned int len)
}
EXPORT_SYMBOL_GPL(xdr_read_pages);
+size_t xdr_expand_hole(struct xdr_stream *xdr, size_t offset, uint64_t length)
+{
+ struct xdr_buf *buf = xdr->buf;
+ size_t from = 0;
+
+ if ((offset + length) < offset ||
+ (offset + length) > buf->page_len)
+ length = buf->page_len - offset;
+
+ if (offset == 0)
+ xdr_align_pages(xdr, xdr->nwords << 2);
+ else
+ from = xdr_page_pos(xdr);
+
+ _shift_data_right(buf, offset + length, from, xdr->nwords << 2);
+ _zero_data_pages(buf->pages, buf->page_base + offset, length);
+ xdr_set_page(xdr, offset + length);
+ return length;
+}
+EXPORT_SYMBOL_GPL(xdr_expand_hole);
+
/**
* xdr_enter_page - decode data from the XDR page
* @xdr: pointer to xdr_stream struct
--
2.20.1
From: Anna Schumaker <[email protected]>
Expanding holes tends to put the data content a few bytes to the right
of where we want it. This patch implements a left-shift operation to
line everything up properly.
Signed-off-by: Anna Schumaker <[email protected]>
---
include/linux/sunrpc/xdr.h | 1 +
net/sunrpc/xdr.c | 135 +++++++++++++++++++++++++++++++++++++
2 files changed, 136 insertions(+)
diff --git a/include/linux/sunrpc/xdr.h b/include/linux/sunrpc/xdr.h
index 47e2d1fff59a..c7e49dc06b0b 100644
--- a/include/linux/sunrpc/xdr.h
+++ b/include/linux/sunrpc/xdr.h
@@ -244,6 +244,7 @@ extern unsigned int xdr_read_pages(struct xdr_stream *xdr, unsigned int len);
extern void xdr_enter_page(struct xdr_stream *xdr, unsigned int len);
extern int xdr_process_buf(struct xdr_buf *buf, unsigned int offset, unsigned int len, int (*actor)(struct scatterlist *, void *), void *data);
extern size_t xdr_expand_hole(struct xdr_stream *, size_t, uint64_t);
+extern uint64_t xdr_align_data(struct xdr_stream *, uint64_t, uint64_t);
/**
* xdr_stream_remaining - Return the number of bytes remaining in the stream
diff --git a/net/sunrpc/xdr.c b/net/sunrpc/xdr.c
index 27dd3a507ef6..f85f83da663c 100644
--- a/net/sunrpc/xdr.c
+++ b/net/sunrpc/xdr.c
@@ -17,6 +17,9 @@
#include <linux/sunrpc/msg_prot.h>
#include <linux/bvec.h>
+static void _copy_to_pages(struct page **, size_t, const char *, size_t);
+
+
/*
* XDR functions for basic NFS types
*/
@@ -287,6 +290,117 @@ _shift_data_right(struct xdr_buf *buf, size_t to, size_t from, size_t len)
shift);
}
+
+/**
+ * _shift_data_left_pages
+ * @pages: vector of pages containing both the source and dest memory area.
+ * @pgto_base: page vector address of destination
+ * @pgfrom_base: page vector address of source
+ * @len: number of bytes to copy
+ *
+ * Note: the addresses pgto_base and pgfrom_base are both calculated in
+ * the same way:
+ * if a memory area starts at byte 'base' in page 'pages[i]',
+ * then its address is given as (i << PAGE_CACHE_SHIFT) + base
+ * Alse note: pgto_base must be < pgfrom_base, but the memory areas
+ * they point to may overlap.
+ */
+static void
+_shift_data_left_pages(struct page **pages, size_t pgto_base,
+ size_t pgfrom_base, size_t len)
+{
+ struct page **pgfrom, **pgto;
+ char *vfrom, *vto;
+ size_t copy;
+
+ BUG_ON(pgfrom_base <= pgto_base);
+
+ pgto = pages + (pgto_base >> PAGE_SHIFT);
+ pgfrom = pages + (pgfrom_base >> PAGE_SHIFT);
+
+ pgto_base = pgto_base % PAGE_SIZE;
+ pgfrom_base = pgfrom_base % PAGE_SIZE;
+
+ do {
+ if (pgto_base >= PAGE_SIZE) {
+ pgto_base = 0;
+ pgto++;
+ }
+ if (pgfrom_base >= PAGE_SIZE){
+ pgfrom_base = 0;
+ pgfrom++;
+ }
+
+ copy = len;
+ if (copy > (PAGE_SIZE - pgto_base))
+ copy = PAGE_SIZE - pgto_base;
+ if (copy > (PAGE_SIZE - pgfrom_base))
+ copy = PAGE_SIZE - pgfrom_base;
+
+ if (pgto_base == 131056)
+ break;
+
+ vto = kmap_atomic(*pgto);
+ if (*pgto != *pgfrom) {
+ vfrom = kmap_atomic(*pgfrom);
+ memcpy(vto + pgto_base, vfrom + pgfrom_base, copy);
+ kunmap_atomic(vfrom);
+ } else
+ memmove(vto + pgto_base, vto + pgfrom_base, copy);
+ flush_dcache_page(*pgto);
+ kunmap_atomic(vto);
+
+ pgto_base += copy;
+ pgfrom_base += copy;
+
+ } while ((len -= copy) != 0);
+}
+
+static void
+_shift_data_left_tail(struct xdr_buf *buf, size_t pgto_base,
+ size_t tail_from, size_t len)
+{
+ struct kvec *tail = buf->tail;
+ size_t shift = len;
+
+ if (len == 0)
+ return;
+ if (pgto_base + len > buf->page_len)
+ shift = buf->page_len - pgto_base;
+
+ _copy_to_pages(buf->pages,
+ buf->page_base + pgto_base,
+ (char *)(tail->iov_base + tail_from),
+ shift);
+
+ memmove((char *)tail->iov_base, tail->iov_base + tail_from + shift, shift);
+ tail->iov_len -= (tail_from + shift);
+}
+
+static void
+_shift_data_left(struct xdr_buf *buf, size_t to, size_t from, size_t len)
+{
+ size_t shift = len;
+
+ if (from < buf->page_len) {
+ shift = min(len, buf->page_len - from);
+ _shift_data_left_pages(buf->pages,
+ buf->page_base + to,
+ buf->page_base + from,
+ shift);
+ to += shift;
+ from += shift;
+ shift = len - shift;
+ }
+
+ if (shift == 0)
+ return;
+ if (from >= buf->page_len)
+ from -= buf->page_len;
+
+ _shift_data_left_tail(buf, to, from, shift);
+}
+
/**
* _copy_to_pages
* @pages: array of pages
@@ -1114,6 +1228,27 @@ size_t xdr_expand_hole(struct xdr_stream *xdr, size_t offset, uint64_t length)
}
EXPORT_SYMBOL_GPL(xdr_expand_hole);
+uint64_t xdr_align_data(struct xdr_stream *xdr, uint64_t offset, uint64_t length)
+{
+ struct xdr_buf *buf = xdr->buf;
+ size_t from = offset;
+
+ if (offset + length > buf->page_len)
+ length = buf->page_len - offset;
+
+ if (offset == 0)
+ xdr_align_pages(xdr, xdr->nwords << 2);
+ else {
+ from = xdr_page_pos(xdr);
+ _shift_data_left(buf, offset, from, length);
+ }
+
+ xdr->nwords -= XDR_QUADLEN(length);
+ xdr_set_page(xdr, from + length);
+ return length;
+}
+EXPORT_SYMBOL_GPL(xdr_align_data);
+
/**
* xdr_enter_page - decode data from the XDR page
* @xdr: pointer to xdr_stream struct
--
2.20.1
From: Anna Schumaker <[email protected]>
This patch adds support for decoding a single NFS4_CONTENT_DATA or
NFS4_CONTENT_HOLE segment returned by the server. This gives a simple
implementation that does not need to spent a lot of time shifting data
arount.
Signed-off-by: Anna Schumaker <[email protected]>
---
fs/nfs/nfs42xdr.c | 160 ++++++++++++++++++++++++++++++++++++++
fs/nfs/nfs4proc.c | 32 +++++++-
fs/nfs/nfs4xdr.c | 1 +
include/linux/nfs4.h | 1 +
include/linux/nfs_fs_sb.h | 1 +
include/linux/nfs_xdr.h | 2 +-
6 files changed, 193 insertions(+), 4 deletions(-)
diff --git a/fs/nfs/nfs42xdr.c b/fs/nfs/nfs42xdr.c
index 69f72ed2bf87..57ec9c0fc00a 100644
--- a/fs/nfs/nfs42xdr.c
+++ b/fs/nfs/nfs42xdr.c
@@ -32,6 +32,14 @@
#define encode_deallocate_maxsz (op_encode_hdr_maxsz + \
encode_fallocate_maxsz)
#define decode_deallocate_maxsz (op_decode_hdr_maxsz)
+#define encode_read_plus_maxsz (op_encode_hdr_maxsz + \
+ encode_stateid_maxsz + 3)
+#define decode_read_plus_maxsz (op_decode_hdr_maxsz + \
+ 1 /* rpr_eof */ + \
+ 1 /* rpr_contents count */ + \
+ 1 /* data_content4 */ + \
+ 2 /* data_info4.di_offset */ + \
+ 2 /* data_info4.di_length */)
#define encode_seek_maxsz (op_encode_hdr_maxsz + \
encode_stateid_maxsz + \
2 /* offset */ + \
@@ -92,6 +100,12 @@
decode_putfh_maxsz + \
decode_deallocate_maxsz + \
decode_getattr_maxsz)
+#define NFS4_enc_read_plus_sz (compound_encode_hdr_maxsz + \
+ encode_putfh_maxsz + \
+ encode_read_plus_maxsz)
+#define NFS4_dec_read_plus_sz (compound_decode_hdr_maxsz + \
+ decode_putfh_maxsz + \
+ decode_read_plus_maxsz)
#define NFS4_enc_seek_sz (compound_encode_hdr_maxsz + \
encode_putfh_maxsz + \
encode_seek_maxsz)
@@ -170,6 +184,16 @@ static void encode_deallocate(struct xdr_stream *xdr,
encode_fallocate(xdr, args);
}
+static void encode_read_plus(struct xdr_stream *xdr,
+ const struct nfs_pgio_args *args,
+ struct compound_hdr *hdr)
+{
+ encode_op_hdr(xdr, OP_READ_PLUS, decode_read_plus_maxsz, hdr);
+ encode_nfs4_stateid(xdr, &args->stateid);
+ encode_uint64(xdr, args->offset);
+ encode_uint32(xdr, args->count);
+}
+
static void encode_seek(struct xdr_stream *xdr,
const struct nfs42_seek_args *args,
struct compound_hdr *hdr)
@@ -317,6 +341,29 @@ static void nfs4_xdr_enc_deallocate(struct rpc_rqst *req,
encode_nops(&hdr);
}
+/*
+ * Encode READ_PLUS request
+ */
+static void nfs4_xdr_enc_read_plus(struct rpc_rqst *req,
+ struct xdr_stream *xdr,
+ const void *data)
+{
+ const struct nfs_pgio_args *args = data;
+ struct compound_hdr hdr = {
+ .minorversion = nfs4_xdr_minorversion(&args->seq_args),
+ };
+
+ encode_compound_hdr(xdr, req, &hdr);
+ encode_sequence(xdr, &args->seq_args, &hdr);
+ encode_putfh(xdr, args->fh, &hdr);
+ encode_read_plus(xdr, args, &hdr);
+
+ xdr_inline_pages(&req->rq_rcv_buf, hdr.replen << 2,
+ args->pages, args->pgbase, args->count);
+ req->rq_rcv_buf.flags |= XDRBUF_READ;
+ encode_nops(&hdr);
+}
+
/*
* Encode SEEK request
*/
@@ -463,6 +510,92 @@ static int decode_deallocate(struct xdr_stream *xdr, struct nfs42_falloc_res *re
return decode_op_hdr(xdr, OP_DEALLOCATE);
}
+static int decode_read_plus_data(struct xdr_stream *xdr, struct nfs_pgio_res *res)
+{
+ __be32 *p;
+ uint32_t count, recvd;
+ uint64_t offset;
+
+ p = xdr_inline_decode(xdr, 8 + 4);
+ if (unlikely(!p))
+ goto out_overflow;
+
+ p = xdr_decode_hyper(p, &offset);
+ count = be32_to_cpup(p);
+
+ recvd = xdr_read_pages(xdr, count);
+ if (recvd < count)
+ res->eof = 0;
+
+ res->count = recvd;
+ return 0;
+out_overflow:
+ print_overflow_msg(__func__, xdr);
+ return -EIO;
+}
+
+static int decode_read_plus_hole(struct xdr_stream *xdr, struct nfs_pgio_res *res)
+{
+ __be32 *p;
+ uint64_t offset, length;
+ size_t recvd;
+
+ p = xdr_inline_decode(xdr, 8 + 8);
+ if (unlikely(!p))
+ goto out_overflow;
+
+ p = xdr_decode_hyper(p, &offset);
+ p = xdr_decode_hyper(p, &length);
+
+ recvd = xdr_expand_hole(xdr, 0, length);
+ if (recvd < length)
+ res->eof = 0;
+
+ res->count = recvd;
+ return 0;
+out_overflow:
+ print_overflow_msg(__func__, xdr);
+ return -EIO;
+}
+
+static int decode_read_plus(struct xdr_stream *xdr, struct nfs_pgio_res *res)
+{
+ __be32 *p;
+ int status, type;
+ uint32_t segments;
+
+ status = decode_op_hdr(xdr, OP_READ_PLUS);
+ if (status)
+ return status;
+
+ p = xdr_inline_decode(xdr, 4 + 4);
+ if (unlikely(!p))
+ goto out_overflow;
+
+ res->count = 0;
+ res->eof = be32_to_cpup(p++);
+ segments = be32_to_cpup(p++);
+ if (segments == 0)
+ return 0;
+
+ p = xdr_inline_decode(xdr, 4);
+ if (unlikely(!p))
+ goto out_overflow;
+
+ type = be32_to_cpup(p++);
+ if (type == NFS4_CONTENT_DATA)
+ status = decode_read_plus_data(xdr, res);
+ else
+ status = decode_read_plus_hole(xdr, res);
+
+ if (segments > 1)
+ res->eof = 0;
+ return status;
+out_overflow:
+ print_overflow_msg(__func__, xdr);
+ return -EIO;
+}
+
static int decode_seek(struct xdr_stream *xdr, struct nfs42_seek_res *res)
{
int status;
@@ -612,6 +745,33 @@ static int nfs4_xdr_dec_deallocate(struct rpc_rqst *rqstp,
return status;
}
+/*
+ * Decode READ_PLUS request
+ */
+static int nfs4_xdr_dec_read_plus(struct rpc_rqst *rqstp,
+ struct xdr_stream *xdr,
+ void *data)
+{
+ struct nfs_pgio_res *res = data;
+ struct compound_hdr hdr;
+ int status;
+
+ status = decode_compound_hdr(xdr, &hdr);
+ if (status)
+ goto out;
+ status = decode_sequence(xdr, &res->seq_res, rqstp);
+ if (status)
+ goto out;
+ status = decode_putfh(xdr);
+ if (status)
+ goto out;
+ status = decode_read_plus(xdr, res);
+ if (!status)
+ status = res->count;
+out:
+ return status;
+}
+
/*
* Decode SEEK request
*/
diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 557a5d636183..0aabddc900e0 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -69,6 +69,10 @@
#include "nfs4trace.h"
+#ifdef CONFIG_NFS_V4_2
+#include "nfs42.h"
+#endif /* CONFIG_NFS_V4_2 */
+
#define NFSDBG_FACILITY NFSDBG_PROC
#define NFS4_BITMASK_SZ 3
@@ -5007,9 +5011,15 @@ static bool nfs4_read_stateid_changed(struct rpc_task *task,
static int nfs4_read_done(struct rpc_task *task, struct nfs_pgio_header *hdr)
{
-
+ struct nfs_server *server = NFS_SERVER(hdr->inode);
dprintk("--> %s\n", __func__);
+ if ((server->caps & NFS_CAP_READ_PLUS) && (task->tk_status == -ENOTSUPP)) {
+ server->caps &= ~NFS_CAP_READ_PLUS;
+ if (rpc_restart_call_prepare(task))
+ task->tk_status = 0;
+ return -EAGAIN;
+ }
if (!nfs4_sequence_done(task, &hdr->res.seq_res))
return -EAGAIN;
if (nfs4_read_stateid_changed(task, &hdr->args))
@@ -5020,13 +5030,28 @@ static int nfs4_read_done(struct rpc_task *task, struct nfs_pgio_header *hdr)
nfs4_read_done_cb(task, hdr);
}
+#ifdef CONFIG_NFS_V4_2
+static void nfs42_read_plus_support(struct nfs_server *server, struct rpc_message *msg)
+{
+ if (server->caps & NFS_CAP_READ_PLUS)
+ msg->rpc_proc = &nfs4_procedures[NFSPROC4_CLNT_READ_PLUS];
+ else
+ msg->rpc_proc = &nfs4_procedures[NFSPROC4_CLNT_READ];
+}
+#else
+static void nfs42_read_plus_support(struct nfs_server *server, struct rpc_message *msg)
+{
+ msg->rpc_proc = &nfs4_procedures[NFSPROC4_CLNT_READ];
+}
+#endif /* CONFIG_NFS_V4_2 */
+
static void nfs4_proc_read_setup(struct nfs_pgio_header *hdr,
struct rpc_message *msg)
{
hdr->timestamp = jiffies;
if (!hdr->pgio_done_cb)
hdr->pgio_done_cb = nfs4_read_done_cb;
- msg->rpc_proc = &nfs4_procedures[NFSPROC4_CLNT_READ];
+ nfs42_read_plus_support(NFS_SERVER(hdr->inode), msg);
nfs4_init_sequence(&hdr->args.seq_args, &hdr->res.seq_res, 0, 0);
}
@@ -9691,7 +9716,8 @@ static const struct nfs4_minor_version_ops nfs_v4_2_minor_ops = {
| NFS_CAP_DEALLOCATE
| NFS_CAP_SEEK
| NFS_CAP_LAYOUTSTATS
- | NFS_CAP_CLONE,
+ | NFS_CAP_CLONE
+ | NFS_CAP_READ_PLUS,
.init_client = nfs41_init_client,
.shutdown_client = nfs41_shutdown_client,
.match_stateid = nfs41_match_stateid,
diff --git a/fs/nfs/nfs4xdr.c b/fs/nfs/nfs4xdr.c
index 2fc8f6fa25e4..b18a0143a4af 100644
--- a/fs/nfs/nfs4xdr.c
+++ b/fs/nfs/nfs4xdr.c
@@ -7790,6 +7790,7 @@ const struct rpc_procinfo nfs4_procedures[] = {
PROC42(CLONE, enc_clone, dec_clone),
PROC42(COPY, enc_copy, dec_copy),
PROC42(OFFLOAD_CANCEL, enc_offload_cancel, dec_offload_cancel),
+ PROC42(READ_PLUS, enc_read_plus, dec_read_plus),
PROC(LOOKUPP, enc_lookupp, dec_lookupp),
};
diff --git a/include/linux/nfs4.h b/include/linux/nfs4.h
index 1b06f0b28453..db465ad6659b 100644
--- a/include/linux/nfs4.h
+++ b/include/linux/nfs4.h
@@ -536,6 +536,7 @@ enum {
NFSPROC4_CLNT_CLONE,
NFSPROC4_CLNT_COPY,
NFSPROC4_CLNT_OFFLOAD_CANCEL,
+ NFSPROC4_CLNT_READ_PLUS,
NFSPROC4_CLNT_LOOKUPP,
};
diff --git a/include/linux/nfs_fs_sb.h b/include/linux/nfs_fs_sb.h
index 6aa8cc83c3b6..e431c2a7affd 100644
--- a/include/linux/nfs_fs_sb.h
+++ b/include/linux/nfs_fs_sb.h
@@ -261,5 +261,6 @@ struct nfs_server {
#define NFS_CAP_CLONE (1U << 23)
#define NFS_CAP_COPY (1U << 24)
#define NFS_CAP_OFFLOAD_CANCEL (1U << 25)
+#define NFS_CAP_READ_PLUS (1U << 26)
#endif
diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
index 441a93ebcac0..4fb9b9d11685 100644
--- a/include/linux/nfs_xdr.h
+++ b/include/linux/nfs_xdr.h
@@ -620,7 +620,7 @@ struct nfs_pgio_args {
struct nfs_pgio_res {
struct nfs4_sequence_res seq_res;
struct nfs_fattr * fattr;
- __u32 count;
+ __u64 count;
__u32 op_status;
union {
struct {
--
2.20.1
From: Anna Schumaker <[email protected]>
We now have everything we need to read holes and then shift data to
where it's supposed to be.
Signed-off-by: Anna Schumaker <[email protected]>
---
fs/nfs/nfs42xdr.c | 36 ++++++++++++++++++++----------------
1 file changed, 20 insertions(+), 16 deletions(-)
diff --git a/fs/nfs/nfs42xdr.c b/fs/nfs/nfs42xdr.c
index 57ec9c0fc00a..2b8b3a2524c4 100644
--- a/fs/nfs/nfs42xdr.c
+++ b/fs/nfs/nfs42xdr.c
@@ -523,11 +523,11 @@ static int decode_read_plus_data(struct xdr_stream *xdr, struct nfs_pgio_res *re
p = xdr_decode_hyper(p, &offset);
count = be32_to_cpup(p);
- recvd = xdr_read_pages(xdr, count);
+ recvd = xdr_align_data(xdr, res->count, count);
if (recvd < count)
res->eof = 0;
- res->count = recvd;
+ res->count += recvd;
return 0;
out_overflow:
print_overflow_msg(__func__, xdr);
@@ -547,11 +547,11 @@ static int decode_read_plus_hole(struct xdr_stream *xdr, struct nfs_pgio_res *re
p = xdr_decode_hyper(p, &offset);
p = xdr_decode_hyper(p, &length);
- recvd = xdr_expand_hole(xdr, 0, length);
+ recvd = xdr_expand_hole(xdr, res->count, length);
if (recvd < length)
res->eof = 0;
- res->count = recvd;
+ res->count += recvd;
return 0;
out_overflow:
print_overflow_msg(__func__, xdr);
@@ -562,7 +562,7 @@ static int decode_read_plus(struct xdr_stream *xdr, struct nfs_pgio_res *res)
{
__be32 *p;
int status, type;
- uint32_t segments;
+ uint32_t i, segments;
status = decode_op_hdr(xdr, OP_READ_PLUS);
if (status)
@@ -575,20 +575,24 @@ static int decode_read_plus(struct xdr_stream *xdr, struct nfs_pgio_res *res)
res->count = 0;
res->eof = be32_to_cpup(p++);
segments = be32_to_cpup(p++);
- if (segments == 0)
- return 0;
- p = xdr_inline_decode(xdr, 4);
- if (unlikely(!p))
- goto out_overflow;
+ for (i = 0; i < segments; i++) {
+ p = xdr_inline_decode(xdr, 4);
+ if (unlikely(!p))
+ goto out_overflow;
- type = be32_to_cpup(p++);
- if (type == NFS4_CONTENT_DATA)
- status = decode_read_plus_data(xdr, res);
- else
- status = decode_read_plus_hole(xdr, res);
+ type = be32_to_cpup(p);
+ if (type == NFS4_CONTENT_DATA)
+ status = decode_read_plus_data(xdr, res);
+ else
+ status = decode_read_plus_hole(xdr, res);
+ if (status)
+ break;
+ if (res->count == xdr->buf->page_len)
+ break;
+ }
- if (segments > 1)
+ if (i < segments)
res->eof = 0;
return status;
out_overflow:
--
2.20.1
From: Anna Schumaker <[email protected]>
There are some workloads where READ_PLUS might end up hurting
performance, so let's be nice to users and provide a way to disable this
operation similar to how READDIR_PLUS can be disabled.
Signed-off-by: Anna Schumaker <[email protected]>
---
fs/nfs/nfs4client.c | 3 +++
fs/nfs/super.c | 21 +++++++++++++++++++++
include/linux/nfs4.h | 4 ++--
include/linux/nfs_fs_sb.h | 1 +
4 files changed, 27 insertions(+), 2 deletions(-)
diff --git a/fs/nfs/nfs4client.c b/fs/nfs/nfs4client.c
index 2548405da1f7..2bb603d1a80f 100644
--- a/fs/nfs/nfs4client.c
+++ b/fs/nfs/nfs4client.c
@@ -998,6 +998,9 @@ static int nfs4_server_common_setup(struct nfs_server *server,
server->caps |= server->nfs_client->cl_mvops->init_caps;
if (server->flags & NFS_MOUNT_NORDIRPLUS)
server->caps &= ~NFS_CAP_READDIRPLUS;
+ if (server->options & NFS_OPTION_NO_READ_PLUS)
+ server->caps &= ~NFS_CAP_READ_PLUS;
+
/*
* Don't use NFS uid/gid mapping if we're using AUTH_SYS or lower
* authentication.
diff --git a/fs/nfs/super.c b/fs/nfs/super.c
index 0570391eaa16..5b8701fca5b9 100644
--- a/fs/nfs/super.c
+++ b/fs/nfs/super.c
@@ -90,6 +90,7 @@ enum {
Opt_resvport, Opt_noresvport,
Opt_fscache, Opt_nofscache,
Opt_migration, Opt_nomigration,
+ Opt_readplus, Opt_noreadplus,
/* Mount options that take integer arguments */
Opt_port,
@@ -151,6 +152,8 @@ static const match_table_t nfs_mount_option_tokens = {
{ Opt_nofscache, "nofsc" },
{ Opt_migration, "migration" },
{ Opt_nomigration, "nomigration" },
+ { Opt_readplus, "readplus" },
+ { Opt_noreadplus, "noreadplus" },
{ Opt_port, "port=%s" },
{ Opt_rsize, "rsize=%s" },
@@ -690,6 +693,11 @@ static void nfs_show_mount_options(struct seq_file *m, struct nfs_server *nfss,
if (nfss->options & NFS_OPTION_MIGRATION)
seq_printf(m, ",migration");
+ if (nfss->options & NFS_OPTION_NO_READ_PLUS)
+ seq_printf(m,",noreadplus");
+ else
+ seq_printf(m,",readplus");
+
if (nfss->flags & NFS_MOUNT_LOOKUP_CACHE_NONEG) {
if (nfss->flags & NFS_MOUNT_LOOKUP_CACHE_NONE)
seq_printf(m, ",lookupcache=none");
@@ -1324,6 +1332,12 @@ static int nfs_parse_mount_options(char *raw,
case Opt_nomigration:
mnt->options &= ~NFS_OPTION_MIGRATION;
break;
+ case Opt_readplus:
+ mnt->options &= ~NFS_OPTION_NO_READ_PLUS;
+ break;
+ case Opt_noreadplus:
+ mnt->options |= NFS_OPTION_NO_READ_PLUS;
+ break;
/*
* options that take numeric values
@@ -1626,6 +1640,9 @@ static int nfs_parse_mount_options(char *raw,
if (mnt->options & NFS_OPTION_MIGRATION &&
(mnt->version != 4 || mnt->minorversion != 0))
goto out_migration_misuse;
+ if (mnt->options & NFS_OPTION_NO_READ_PLUS &&
+ (mnt->version != 4 || mnt->minorversion < 2))
+ goto out_noreadplus_misuse;
/*
* verify that any proto=/mountproto= options match the address
@@ -1668,6 +1685,10 @@ static int nfs_parse_mount_options(char *raw,
printk(KERN_INFO
"NFS: 'migration' not supported for this NFS version\n");
return 0;
+out_noreadplus_misuse:
+ printk(KERN_INFO
+ "NFS: 'noreadplus' not supported for this NFS version\n");
+ return 0;
out_nomem:
printk(KERN_INFO "NFS: not enough memory to parse option\n");
return 0;
diff --git a/include/linux/nfs4.h b/include/linux/nfs4.h
index db465ad6659b..2fd3cf2061c2 100644
--- a/include/linux/nfs4.h
+++ b/include/linux/nfs4.h
@@ -535,10 +535,10 @@ enum {
NFSPROC4_CLNT_LAYOUTSTATS,
NFSPROC4_CLNT_CLONE,
NFSPROC4_CLNT_COPY,
- NFSPROC4_CLNT_OFFLOAD_CANCEL,
- NFSPROC4_CLNT_READ_PLUS,
NFSPROC4_CLNT_LOOKUPP,
+ NFSPROC4_CLNT_OFFLOAD_CANCEL,
+ NFSPROC4_CLNT_READ_PLUS,
};
/* nfs41 types */
diff --git a/include/linux/nfs_fs_sb.h b/include/linux/nfs_fs_sb.h
index e431c2a7affd..c95be09b84f1 100644
--- a/include/linux/nfs_fs_sb.h
+++ b/include/linux/nfs_fs_sb.h
@@ -157,6 +157,7 @@ struct nfs_server {
unsigned int clone_blksize; /* granularity of a CLONE operation */
#define NFS_OPTION_FSCACHE 0x00000001 /* - local caching enabled */
#define NFS_OPTION_MIGRATION 0x00000002 /* - NFSv4 migration enabled */
+#define NFS_OPTION_NO_READ_PLUS 0x00000004 /* - NFSv4.2 READ_PLUS enabled */
struct nfs_fsid fsid;
__u64 maxfilesize; /* maximum file size */
--
2.20.1