2011-04-12 15:30:15

by Badari Pulavarty

[permalink] [raw]
Subject: [RFC][PATCH] Vector read/write support for NFS (DIO) client

Hi,

We recently ran into serious performance issue with NFS client.
It turned out that its due to lack of readv/write support for
NFS (O_DIRECT) client.

Here is our use-case:

In our cloud environment, our storage is over NFS. Files
on NFS are passed as a blockdevices to the guest (using
O_DIRECT). When guest is doing IO on these block devices,
they will end up as O_DIRECT writes to NFS (on KVM host).

QEMU (on the host) gets a vector from virtio-ring and
submits them. Old versions of QEMU, linearized the vector
it got from KVM (copied them into a buffer) and submits
the buffer. So, NFS client always received a single buffer.

Later versions of QEMU, eliminated this copy and submits
a vector directly using preadv/pwritev().

NFS client loops through the vector and submits each
vector as separate request for each IO < wsize. In our
case (negotiated wsize=1MB), for 256K IO - we get 64
vectors, each 4K. So, we end up submitting 64 4K FILE_SYNC IOs.
Server end up doing each 4K synchronously. This causes
serious performance degrade. We are trying to see if the
performance improves if we convert IOs to ASYNC - but
our initial results doesn't look good.

readv/writev support NFS client for all possible cases is
hard. Instead, if all vectors are page-aligned and
iosizes page-multiple - it fits the current code easily.
Luckily, QEMU use-case fits these requirements.

Here is the patch to add this support. Comments ?

Thanks,
Badari

Special readv/writev support for NFS O_DIRECT. Currently NFS
client O_DIRECT read/write iterates over individual elements
in the vector and submits the IO and waits for them to finish.
If individual IOsize is < r/wsize, each IO will be FILE_SYNC.
Server has to finish individual smaller IOs synchronous. This
causes serious performance degrade.

This patch is a special case to submit larger IOs from the client
only if all the IO buffers are page-aligned and each individual
IOs are page-multiple + total IO size is < r/wsize.

This is the common case for QEMU usage.

Signed-off-by: Badari Pulavarty <[email protected]>
---
fs/nfs/direct.c | 259 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 259 insertions(+)

Index: linux-2.6.38.2/fs/nfs/direct.c
===================================================================
--- linux-2.6.38.2.orig/fs/nfs/direct.c 2011-04-12 10:56:29.374266292 -0400
+++ linux-2.6.38.2/fs/nfs/direct.c 2011-04-12 11:01:21.883266310 -0400
@@ -271,6 +271,52 @@ static const struct rpc_call_ops nfs_rea
.rpc_release = nfs_direct_read_release,
};

+static int nfs_direct_check_single_io(struct nfs_direct_req *dreq,
+ const struct iovec *iov,
+ unsigned long nr_segs, int rw)
+{
+ unsigned long seg;
+ int pages = 0;
+ struct nfs_open_context *ctx = dreq->ctx;
+ struct inode *inode = ctx->path.dentry->d_inode;
+ size_t size;
+
+
+ /* If its a single vector - use old code */
+ if (nr_segs == 1)
+ return pages;
+ /*
+ * Check to see if all IO buffers are page-aligned and
+ * individual IO sizes are page-multiple.
+ * If so, we can submit them in a single IO.
+ * Otherwise, use old code
+ */
+ for (seg = 0; seg < nr_segs; seg++) {
+ unsigned long user_addr = (unsigned long)iov[seg].iov_base;
+
+ if ((user_addr & (PAGE_SIZE - 1)) ||
+ (iov[seg].iov_len & (PAGE_SIZE - 1))) {
+ pages = 0;
+ break;
+ }
+ pages += (iov[seg].iov_len >> PAGE_SHIFT);
+ }
+
+ if (rw)
+ size = NFS_SERVER(inode)->wsize;
+ else
+ size = NFS_SERVER(inode)->rsize;
+
+ /*
+ * If IOsize > wsize/rsize, we need to split IOs into r/wsize anyway
+ * - fall back to old code.
+ */
+ if (pages > (size >> PAGE_SHIFT))
+ pages = 0;
+
+ return pages;
+}
+
/*
* For each rsize'd chunk of the user's buffer, dispatch an NFS READ
* operation. If nfs_readdata_alloc() or get_user_pages() fails,
@@ -385,6 +431,101 @@ static ssize_t nfs_direct_read_schedule_
return result < 0 ? (ssize_t) result : -EFAULT;
}

+/*
+ * Special Case: Efficient readv() support - only if all the IO buffers
+ * in the vector are page-aligned and IO sizes are page-multiple
+ * + total IO size is < rsize. We can map all of the vectors together
+ * and submit them in a single IO instead of operating on single entry
+ * in the vector.
+ */
+static ssize_t nfs_direct_read_schedule_single_io(struct nfs_direct_req *dreq,
+ const struct iovec *iov,
+ unsigned long nr_segs,
+ int pages,
+ loff_t pos)
+{
+ struct nfs_open_context *ctx = dreq->ctx;
+ struct inode *inode = ctx->path.dentry->d_inode;
+ unsigned long user_addr = (unsigned long)iov->iov_base;
+ size_t bytes = pages << PAGE_SHIFT;
+ struct rpc_task *task;
+ struct rpc_message msg = {
+ .rpc_cred = ctx->cred,
+ };
+ struct rpc_task_setup task_setup_data = {
+ .rpc_client = NFS_CLIENT(inode),
+ .rpc_message = &msg,
+ .callback_ops = &nfs_read_direct_ops,
+ .workqueue = nfsiod_workqueue,
+ .flags = RPC_TASK_ASYNC,
+ };
+ unsigned int pgbase;
+ int result;
+ int seg;
+ int mapped = 0;
+ int nr_pages;
+ ssize_t started = 0;
+ struct nfs_read_data *data;
+
+ result = -ENOMEM;
+ data = nfs_readdata_alloc(pages);
+ if (unlikely(!data))
+ return result;
+
+ pgbase = user_addr & ~PAGE_MASK;
+
+ for (seg = 0; seg < nr_segs; seg++) {
+ user_addr = (unsigned long)iov[seg].iov_base;
+ nr_pages = iov[seg].iov_len >> PAGE_SHIFT;
+
+ down_read(&current->mm->mmap_sem);
+ result = get_user_pages(current, current->mm, user_addr,
+ nr_pages, 1, 0, &data->pagevec[mapped],
+ NULL);
+ up_read(&current->mm->mmap_sem);
+ if (result < 0) {
+ /* unmap what is done so far */
+ nfs_direct_release_pages(data->pagevec, mapped);
+ nfs_readdata_free(data);
+ return result;
+ }
+ mapped += result;
+ }
+
+ get_dreq(dreq);
+
+ data->req = (struct nfs_page *) dreq;
+ data->inode = inode;
+ data->cred = msg.rpc_cred;
+ data->args.fh = NFS_FH(inode);
+ data->args.context = ctx;
+ data->args.offset = pos;
+ data->args.pgbase = pgbase;
+ data->args.pages = data->pagevec;
+ data->args.count = bytes;
+ data->res.fattr = &data->fattr;
+ data->res.eof = 0;
+ data->res.count = bytes;
+ nfs_fattr_init(&data->fattr);
+
+ task_setup_data.task = &data->task;
+ task_setup_data.callback_data = data;
+ msg.rpc_argp = &data->args;
+ msg.rpc_resp = &data->res;
+ NFS_PROTO(inode)->read_setup(data, &msg);
+
+ task = rpc_run_task(&task_setup_data);
+ if (IS_ERR(task))
+ goto out;
+ rpc_put_task(task);
+
+ started += bytes;
+out:
+ if (started)
+ return started;
+ return result < 0 ? (ssize_t) result : -EFAULT;
+}
+
static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
const struct iovec *iov,
unsigned long nr_segs,
@@ -396,6 +537,16 @@ static ssize_t nfs_direct_read_schedule_

get_dreq(dreq);

+ result = nfs_direct_check_single_io(dreq, iov, nr_segs, 0);
+ if (result) {
+ result = nfs_direct_read_schedule_single_io(dreq, iov, nr_segs,
+ result, pos);
+ if (result < 0)
+ goto out;
+ requested_bytes += result;
+ pos += result;
+ goto out;
+ }
for (seg = 0; seg < nr_segs; seg++) {
const struct iovec *vec = &iov[seg];
result = nfs_direct_read_schedule_segment(dreq, vec, pos);
@@ -416,6 +567,7 @@ static ssize_t nfs_direct_read_schedule_
return result < 0 ? result : -EIO;
}

+out:
if (put_dreq(dreq))
nfs_direct_complete(dreq);
return 0;
@@ -821,6 +973,102 @@ static ssize_t nfs_direct_write_schedule
return result < 0 ? (ssize_t) result : -EFAULT;
}

+/*
+ * Special Case: Efficient writev() support - only if all the IO buffers
+ * in the vector are page-aligned and IO sizes are page-multiple
+ * + total IO size is < wsize. We can map all of the vectors together
+ * and submit them in a single IO instead of operating on single entry
+ * in the vector.
+ */
+static ssize_t nfs_direct_write_schedule_single_io(struct nfs_direct_req *dreq,
+ const struct iovec *iov,
+ unsigned long nr_segs,
+ int pages,
+ loff_t pos, int sync)
+{
+ struct nfs_open_context *ctx = dreq->ctx;
+ struct inode *inode = ctx->path.dentry->d_inode;
+ unsigned long user_addr = (unsigned long)iov->iov_base;
+ size_t bytes = pages << PAGE_SHIFT;
+ struct rpc_task *task;
+ struct rpc_message msg = {
+ .rpc_cred = ctx->cred,
+ };
+ struct rpc_task_setup task_setup_data = {
+ .rpc_client = NFS_CLIENT(inode),
+ .rpc_message = &msg,
+ .callback_ops = &nfs_write_direct_ops,
+ .workqueue = nfsiod_workqueue,
+ .flags = RPC_TASK_ASYNC,
+ };
+ unsigned int pgbase;
+ int result;
+ int seg;
+ int mapped = 0;
+ int nr_pages;
+ ssize_t started = 0;
+ struct nfs_write_data *data;
+
+ result = -ENOMEM;
+ data = nfs_writedata_alloc(pages);
+ if (unlikely(!data))
+ return result;
+
+ pgbase = user_addr & ~PAGE_MASK;
+
+ for (seg = 0; seg < nr_segs; seg++) {
+ user_addr = (unsigned long)iov[seg].iov_base;
+ nr_pages = iov[seg].iov_len >> PAGE_SHIFT;
+
+ down_read(&current->mm->mmap_sem);
+ result = get_user_pages(current, current->mm, user_addr,
+ nr_pages, 0, 0, &data->pagevec[mapped], NULL);
+ up_read(&current->mm->mmap_sem);
+ if (result < 0) {
+ /* unmap what is done so far */
+ nfs_direct_release_pages(data->pagevec, mapped);
+ nfs_writedata_free(data);
+ return result;
+ }
+ mapped += result;
+ }
+
+ get_dreq(dreq);
+ list_move_tail(&data->pages, &dreq->rewrite_list);
+
+ data->req = (struct nfs_page *) dreq;
+ data->inode = inode;
+ data->cred = msg.rpc_cred;
+ data->args.fh = NFS_FH(inode);
+ data->args.context = ctx;
+ data->args.offset = pos;
+ data->args.pgbase = pgbase;
+ data->args.pages = data->pagevec;
+ data->args.count = bytes;
+ data->args.stable = sync;
+ data->res.fattr = &data->fattr;
+ data->res.count = bytes;
+ data->res.verf = &data->verf;
+ nfs_fattr_init(&data->fattr);
+
+ task_setup_data.task = &data->task;
+ task_setup_data.callback_data = data;
+ msg.rpc_argp = &data->args;
+ msg.rpc_resp = &data->res;
+ NFS_PROTO(inode)->write_setup(data, &msg);
+
+ task = rpc_run_task(&task_setup_data);
+ if (IS_ERR(task))
+ goto out;
+ rpc_put_task(task);
+
+ started += bytes;
+out:
+ if (started)
+ return started;
+ return result < 0 ? (ssize_t) result : -EFAULT;
+}
+
static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
const struct iovec *iov,
unsigned long nr_segs,
@@ -832,6 +1080,16 @@ static ssize_t nfs_direct_write_schedule

get_dreq(dreq);

+ result = nfs_direct_check_single_io(dreq, iov, nr_segs, 1);
+ if (result) {
+ result = nfs_direct_write_schedule_single_io(dreq, iov, nr_segs,
+ result, pos, sync);
+ if (result < 0)
+ goto out;
+ requested_bytes += result;
+ pos += result;
+ goto out;
+ }
for (seg = 0; seg < nr_segs; seg++) {
const struct iovec *vec = &iov[seg];
result = nfs_direct_write_schedule_segment(dreq, vec,
@@ -853,6 +1111,7 @@ static ssize_t nfs_direct_write_schedule
return result < 0 ? result : -EIO;
}

+out:
if (put_dreq(dreq))
nfs_direct_write_complete(dreq, dreq->inode);
return 0;




2011-04-12 16:35:51

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [RFC][PATCH] Vector read/write support for NFS (DIO) client

On Tue, 2011-04-12 at 11:36 -0400, Chuck Lever wrote:
> On Apr 12, 2011, at 11:32 AM, Badari Pulavarty wrote:
>
> > Hi,
> >
> > We recently ran into serious performance issue with NFS client.
> > It turned out that its due to lack of readv/write support for
> > NFS (O_DIRECT) client.
> >
> > Here is our use-case:
> >
> > In our cloud environment, our storage is over NFS. Files
> > on NFS are passed as a blockdevices to the guest (using
> > O_DIRECT). When guest is doing IO on these block devices,
> > they will end up as O_DIRECT writes to NFS (on KVM host).
> >
> > QEMU (on the host) gets a vector from virtio-ring and
> > submits them. Old versions of QEMU, linearized the vector
> > it got from KVM (copied them into a buffer) and submits
> > the buffer. So, NFS client always received a single buffer.
> >
> > Later versions of QEMU, eliminated this copy and submits
> > a vector directly using preadv/pwritev().
> >
> > NFS client loops through the vector and submits each
> > vector as separate request for each IO < wsize. In our
> > case (negotiated wsize=1MB), for 256K IO - we get 64
> > vectors, each 4K. So, we end up submitting 64 4K FILE_SYNC IOs.
> > Server end up doing each 4K synchronously. This causes
> > serious performance degrade. We are trying to see if the
> > performance improves if we convert IOs to ASYNC - but
> > our initial results doesn't look good.
> >
> > readv/writev support NFS client for all possible cases is
> > hard. Instead, if all vectors are page-aligned and
> > iosizes page-multiple - it fits the current code easily.
> > Luckily, QEMU use-case fits these requirements.
> >
> > Here is the patch to add this support. Comments ?
>
> Restricting buffer alignment requirements would be an onerous API change, IMO.

I am not suggesting an API change at all. All I am doing is, if all
the IOs are aligned - we can do a fast path as we can do in a single
IO request. (as if we got a single buffer). Otherwise, we do hard
way as today - loop through each one and do them individually.

>
> If the NFS write path is smart enough not to set FILE_SYNC when there are multiple segments to write, then the problem should be mostly fixed. I think Jeff Layton already has a patch that does this.

We are trying that patch. It does improve the performance by little,
but not anywhere close to doing it as a single vector/buffer.

Khoa, can you share your performance data for all the
suggestions/patches you tried so far ?

Thanks,
Badari


> >
> > Thanks,
> > Badari
> >
> > Special readv/writev support for NFS O_DIRECT. Currently NFS
> > client O_DIRECT read/write iterates over individual elements
> > in the vector and submits the IO and waits for them to finish.
> > If individual IOsize is < r/wsize, each IO will be FILE_SYNC.
> > Server has to finish individual smaller IOs synchronous. This
> > causes serious performance degrade.
> >
> > This patch is a special case to submit larger IOs from the client
> > only if all the IO buffers are page-aligned and each individual
> > IOs are page-multiple + total IO size is < r/wsize.
> >
> > This is the common case for QEMU usage.
> >
> > Signed-off-by: Badari Pulavarty <[email protected]>
> > ---
> > fs/nfs/direct.c | 259 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > 1 file changed, 259 insertions(+)
> >
> > Index: linux-2.6.38.2/fs/nfs/direct.c
> > ===================================================================
> > --- linux-2.6.38.2.orig/fs/nfs/direct.c 2011-04-12 10:56:29.374266292 -0400
> > +++ linux-2.6.38.2/fs/nfs/direct.c 2011-04-12 11:01:21.883266310 -0400
> > @@ -271,6 +271,52 @@ static const struct rpc_call_ops nfs_rea
> > .rpc_release = nfs_direct_read_release,
> > };
> >
> > +static int nfs_direct_check_single_io(struct nfs_direct_req *dreq,
> > + const struct iovec *iov,
> > + unsigned long nr_segs, int rw)
> > +{
> > + unsigned long seg;
> > + int pages = 0;
> > + struct nfs_open_context *ctx = dreq->ctx;
> > + struct inode *inode = ctx->path.dentry->d_inode;
> > + size_t size;
> > +
> > +
> > + /* If its a single vector - use old code */
> > + if (nr_segs == 1)
> > + return pages;
> > + /*
> > + * Check to see if all IO buffers are page-aligned and
> > + * individual IO sizes are page-multiple.
> > + * If so, we can submit them in a single IO.
> > + * Otherwise, use old code
> > + */
> > + for (seg = 0; seg < nr_segs; seg++) {
> > + unsigned long user_addr = (unsigned long)iov[seg].iov_base;
> > +
> > + if ((user_addr & (PAGE_SIZE - 1)) ||
> > + (iov[seg].iov_len & (PAGE_SIZE - 1))) {
> > + pages = 0;
> > + break;
> > + }
> > + pages += (iov[seg].iov_len >> PAGE_SHIFT);
> > + }
> > +
> > + if (rw)
> > + size = NFS_SERVER(inode)->wsize;
> > + else
> > + size = NFS_SERVER(inode)->rsize;
> > +
> > + /*
> > + * If IOsize > wsize/rsize, we need to split IOs into r/wsize anyway
> > + * - fall back to old code.
> > + */
> > + if (pages > (size >> PAGE_SHIFT))
> > + pages = 0;
> > +
> > + return pages;
> > +}
> > +
> > /*
> > * For each rsize'd chunk of the user's buffer, dispatch an NFS READ
> > * operation. If nfs_readdata_alloc() or get_user_pages() fails,
> > @@ -385,6 +431,101 @@ static ssize_t nfs_direct_read_schedule_
> > return result < 0 ? (ssize_t) result : -EFAULT;
> > }
> >
> > +/*
> > + * Special Case: Efficient readv() support - only if all the IO buffers
> > + * in the vector are page-aligned and IO sizes are page-multiple
> > + * + total IO size is < rsize. We can map all of the vectors together
> > + * and submit them in a single IO instead of operating on single entry
> > + * in the vector.
> > + */
> > +static ssize_t nfs_direct_read_schedule_single_io(struct nfs_direct_req *dreq,
> > + const struct iovec *iov,
> > + unsigned long nr_segs,
> > + int pages,
> > + loff_t pos)
> > +{
> > + struct nfs_open_context *ctx = dreq->ctx;
> > + struct inode *inode = ctx->path.dentry->d_inode;
> > + unsigned long user_addr = (unsigned long)iov->iov_base;
> > + size_t bytes = pages << PAGE_SHIFT;
> > + struct rpc_task *task;
> > + struct rpc_message msg = {
> > + .rpc_cred = ctx->cred,
> > + };
> > + struct rpc_task_setup task_setup_data = {
> > + .rpc_client = NFS_CLIENT(inode),
> > + .rpc_message = &msg,
> > + .callback_ops = &nfs_read_direct_ops,
> > + .workqueue = nfsiod_workqueue,
> > + .flags = RPC_TASK_ASYNC,
> > + };
> > + unsigned int pgbase;
> > + int result;
> > + int seg;
> > + int mapped = 0;
> > + int nr_pages;
> > + ssize_t started = 0;
> > + struct nfs_read_data *data;
> > +
> > + result = -ENOMEM;
> > + data = nfs_readdata_alloc(pages);
> > + if (unlikely(!data))
> > + return result;
> > +
> > + pgbase = user_addr & ~PAGE_MASK;
> > +
> > + for (seg = 0; seg < nr_segs; seg++) {
> > + user_addr = (unsigned long)iov[seg].iov_base;
> > + nr_pages = iov[seg].iov_len >> PAGE_SHIFT;
> > +
> > + down_read(&current->mm->mmap_sem);
> > + result = get_user_pages(current, current->mm, user_addr,
> > + nr_pages, 1, 0, &data->pagevec[mapped],
> > + NULL);
> > + up_read(&current->mm->mmap_sem);
> > + if (result < 0) {
> > + /* unmap what is done so far */
> > + nfs_direct_release_pages(data->pagevec, mapped);
> > + nfs_readdata_free(data);
> > + return result;
> > + }
> > + mapped += result;
> > + }
> > +
> > + get_dreq(dreq);
> > +
> > + data->req = (struct nfs_page *) dreq;
> > + data->inode = inode;
> > + data->cred = msg.rpc_cred;
> > + data->args.fh = NFS_FH(inode);
> > + data->args.context = ctx;
> > + data->args.offset = pos;
> > + data->args.pgbase = pgbase;
> > + data->args.pages = data->pagevec;
> > + data->args.count = bytes;
> > + data->res.fattr = &data->fattr;
> > + data->res.eof = 0;
> > + data->res.count = bytes;
> > + nfs_fattr_init(&data->fattr);
> > +
> > + task_setup_data.task = &data->task;
> > + task_setup_data.callback_data = data;
> > + msg.rpc_argp = &data->args;
> > + msg.rpc_resp = &data->res;
> > + NFS_PROTO(inode)->read_setup(data, &msg);
> > +
> > + task = rpc_run_task(&task_setup_data);
> > + if (IS_ERR(task))
> > + goto out;
> > + rpc_put_task(task);
> > +
> > + started += bytes;
> > +out:
> > + if (started)
> > + return started;
> > + return result < 0 ? (ssize_t) result : -EFAULT;
> > +}
> > +
> > static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
> > const struct iovec *iov,
> > unsigned long nr_segs,
> > @@ -396,6 +537,16 @@ static ssize_t nfs_direct_read_schedule_
> >
> > get_dreq(dreq);
> >
> > + result = nfs_direct_check_single_io(dreq, iov, nr_segs, 0);
> > + if (result) {
> > + result = nfs_direct_read_schedule_single_io(dreq, iov, nr_segs,
> > + result, pos);
> > + if (result < 0)
> > + goto out;
> > + requested_bytes += result;
> > + pos += result;
> > + goto out;
> > + }
> > for (seg = 0; seg < nr_segs; seg++) {
> > const struct iovec *vec = &iov[seg];
> > result = nfs_direct_read_schedule_segment(dreq, vec, pos);
> > @@ -416,6 +567,7 @@ static ssize_t nfs_direct_read_schedule_
> > return result < 0 ? result : -EIO;
> > }
> >
> > +out:
> > if (put_dreq(dreq))
> > nfs_direct_complete(dreq);
> > return 0;
> > @@ -821,6 +973,102 @@ static ssize_t nfs_direct_write_schedule
> > return result < 0 ? (ssize_t) result : -EFAULT;
> > }
> >
> > +/*
> > + * Special Case: Efficient writev() support - only if all the IO buffers
> > + * in the vector are page-aligned and IO sizes are page-multiple
> > + * + total IO size is < wsize. We can map all of the vectors together
> > + * and submit them in a single IO instead of operating on single entry
> > + * in the vector.
> > + */
> > +static ssize_t nfs_direct_write_schedule_single_io(struct nfs_direct_req *dreq,
> > + const struct iovec *iov,
> > + unsigned long nr_segs,
> > + int pages,
> > + loff_t pos, int sync)
> > +{
> > + struct nfs_open_context *ctx = dreq->ctx;
> > + struct inode *inode = ctx->path.dentry->d_inode;
> > + unsigned long user_addr = (unsigned long)iov->iov_base;
> > + size_t bytes = pages << PAGE_SHIFT;
> > + struct rpc_task *task;
> > + struct rpc_message msg = {
> > + .rpc_cred = ctx->cred,
> > + };
> > + struct rpc_task_setup task_setup_data = {
> > + .rpc_client = NFS_CLIENT(inode),
> > + .rpc_message = &msg,
> > + .callback_ops = &nfs_write_direct_ops,
> > + .workqueue = nfsiod_workqueue,
> > + .flags = RPC_TASK_ASYNC,
> > + };
> > + unsigned int pgbase;
> > + int result;
> > + int seg;
> > + int mapped = 0;
> > + int nr_pages;
> > + ssize_t started = 0;
> > + struct nfs_write_data *data;
> > +
> > + result = -ENOMEM;
> > + data = nfs_writedata_alloc(pages);
> > + if (unlikely(!data))
> > + return result;
> > +
> > + pgbase = user_addr & ~PAGE_MASK;
> > +
> > + for (seg = 0; seg < nr_segs; seg++) {
> > + user_addr = (unsigned long)iov[seg].iov_base;
> > + nr_pages = iov[seg].iov_len >> PAGE_SHIFT;
> > +
> > + down_read(&current->mm->mmap_sem);
> > + result = get_user_pages(current, current->mm, user_addr,
> > + nr_pages, 0, 0, &data->pagevec[mapped], NULL);
> > + up_read(&current->mm->mmap_sem);
> > + if (result < 0) {
> > + /* unmap what is done so far */
> > + nfs_direct_release_pages(data->pagevec, mapped);
> > + nfs_writedata_free(data);
> > + return result;
> > + }
> > + mapped += result;
> > + }
> > +
> > + get_dreq(dreq);
> > + list_move_tail(&data->pages, &dreq->rewrite_list);
> > +
> > + data->req = (struct nfs_page *) dreq;
> > + data->inode = inode;
> > + data->cred = msg.rpc_cred;
> > + data->args.fh = NFS_FH(inode);
> > + data->args.context = ctx;
> > + data->args.offset = pos;
> > + data->args.pgbase = pgbase;
> > + data->args.pages = data->pagevec;
> > + data->args.count = bytes;
> > + data->args.stable = sync;
> > + data->res.fattr = &data->fattr;
> > + data->res.count = bytes;
> > + data->res.verf = &data->verf;
> > + nfs_fattr_init(&data->fattr);
> > +
> > + task_setup_data.task = &data->task;
> > + task_setup_data.callback_data = data;
> > + msg.rpc_argp = &data->args;
> > + msg.rpc_resp = &data->res;
> > + NFS_PROTO(inode)->write_setup(data, &msg);
> > +
> > + task = rpc_run_task(&task_setup_data);
> > + if (IS_ERR(task))
> > + goto out;
> > + rpc_put_task(task);
> > +
> > + started += bytes;
> > +out:
> > + if (started)
> > + return started;
> > + return result < 0 ? (ssize_t) result : -EFAULT;
> > +}
> > +
> > static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
> > const struct iovec *iov,
> > unsigned long nr_segs,
> > @@ -832,6 +1080,16 @@ static ssize_t nfs_direct_write_schedule
> >
> > get_dreq(dreq);
> >
> > + result = nfs_direct_check_single_io(dreq, iov, nr_segs, 1);
> > + if (result) {
> > + result = nfs_direct_write_schedule_single_io(dreq, iov, nr_segs,
> > + result, pos, sync);
> > + if (result < 0)
> > + goto out;
> > + requested_bytes += result;
> > + pos += result;
> > + goto out;
> > + }
> > for (seg = 0; seg < nr_segs; seg++) {
> > const struct iovec *vec = &iov[seg];
> > result = nfs_direct_write_schedule_segment(dreq, vec,
> > @@ -853,6 +1111,7 @@ static ssize_t nfs_direct_write_schedule
> > return result < 0 ? result : -EIO;
> > }
> >
> > +out:
> > if (put_dreq(dreq))
> > nfs_direct_write_complete(dreq, dreq->inode);
> > return 0;
> >
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> --
> Chuck Lever
> chuck[dot]lever[at]oracle[dot]com
>
>
>
>


2011-04-13 14:02:35

by Jeff Layton

[permalink] [raw]
Subject: Re: [RFC][PATCH] Vector read/write support for NFS (DIO) client

On Wed, 13 Apr 2011 06:43:53 -0700
Badari Pulavarty <[email protected]> wrote:

> On 4/13/2011 5:36 AM, Jeff Layton wrote:
> > On Tue, 12 Apr 2011 10:46:00 -0700
> > Badari Pulavarty<[email protected]> wrote:
> >
> >
> >> On Tue, 2011-04-12 at 12:42 -0400, Chuck Lever wrote:
> >>
> >>> On Apr 12, 2011, at 12:15 PM, Badari Pulavarty wrote:
> >>>
> >>>
> >>>> On Tue, 2011-04-12 at 11:36 -0400, Chuck Lever wrote:
> >>>>
> >>>>> On Apr 12, 2011, at 11:32 AM, Badari Pulavarty wrote:
> >>>>>
> >>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> We recently ran into serious performance issue with NFS client.
> >>>>>> It turned out that its due to lack of readv/write support for
> >>>>>> NFS (O_DIRECT) client.
> >>>>>>
> >>>>>> Here is our use-case:
> >>>>>>
> >>>>>> In our cloud environment, our storage is over NFS. Files
> >>>>>> on NFS are passed as a blockdevices to the guest (using
> >>>>>> O_DIRECT). When guest is doing IO on these block devices,
> >>>>>> they will end up as O_DIRECT writes to NFS (on KVM host).
> >>>>>>
> >>>>>> QEMU (on the host) gets a vector from virtio-ring and
> >>>>>> submits them. Old versions of QEMU, linearized the vector
> >>>>>> it got from KVM (copied them into a buffer) and submits
> >>>>>> the buffer. So, NFS client always received a single buffer.
> >>>>>>
> >>>>>> Later versions of QEMU, eliminated this copy and submits
> >>>>>> a vector directly using preadv/pwritev().
> >>>>>>
> >>>>>> NFS client loops through the vector and submits each
> >>>>>> vector as separate request for each IO< wsize. In our
> >>>>>> case (negotiated wsize=1MB), for 256K IO - we get 64
> >>>>>> vectors, each 4K. So, we end up submitting 64 4K FILE_SYNC IOs.
> >>>>>> Server end up doing each 4K synchronously. This causes
> >>>>>> serious performance degrade. We are trying to see if the
> >>>>>> performance improves if we convert IOs to ASYNC - but
> >>>>>> our initial results doesn't look good.
> >>>>>>
> >>>>>> readv/writev support NFS client for all possible cases is
> >>>>>> hard. Instead, if all vectors are page-aligned and
> >>>>>> iosizes page-multiple - it fits the current code easily.
> >>>>>> Luckily, QEMU use-case fits these requirements.
> >>>>>>
> >>>>>> Here is the patch to add this support. Comments ?
> >>>>>>
> >>>>> Restricting buffer alignment requirements would be an onerous API change, IMO.
> >>>>>
> >>>> I am not suggesting an API change at all. All I am doing is, if all
> >>>> the IOs are aligned - we can do a fast path as we can do in a single
> >>>> IO request. (as if we got a single buffer). Otherwise, we do hard
> >>>> way as today - loop through each one and do them individually.
> >>>>
> >>> Thanks for the clarification. That means you don't also address the problem of doing multiple small segments with FILE_SYNC writes.
> >>>
> >>>
> >>>>> If the NFS write path is smart enough not to set FILE_SYNC when there are multiple segments to write, then the problem should be mostly fixed. I think Jeff Layton already has a patch that does this.
> >>>>>
> >>>> We are trying that patch. It does improve the performance by little,
> >>>> but not anywhere close to doing it as a single vector/buffer.
> >>>>
> >>>> Khoa, can you share your performance data for all the
> >>>> suggestions/patches you tried so far ?
> >>>>
> >>> The individual WRITEs should be submitted in parallel. If there is additional performance overhead, it is probably due to the small RPC slot table size. Have you tried increasing it?
> >>>
> >> We haven't tried both fixes together (RPC slot increase, Turn into
> >> ASYNC). Each one individually didn't help much. We will try them
> >> together.
> >>
> >>
> > I must have missed some of these emails, but here's the patch that
> > Chuck mentioned and based on his suggestion. It may be reasonable as a
> > stopgap solution until Trond's overhaul of the DIO code is complete.
> >
> > With this + a larger slot table size, I would think you'd have a
> > substantial write performance improvement. Probably not as good as if
> > the writes were coalesced, but it should help.
> >
> > Badari, if you can let us know whether this plus increasing the slot
> > table size helps, then I'll plan to "officially" submit it. This one is
> > against RHEL6 but it should apply to mainline kernels with a small
> > offset.
> >
> > -----------------[snip]-----------------
> >
> > BZ#694309: nfs: use unstable writes for bigger groups of DIO writes
> >
> > Currently, the client uses FILE_SYNC whenever it's writing less than or
> > equal data to the wsize. This is a problem though if we have a bunch
> > of small iovec's batched up in a single writev call. The client will
> > iterate over them and do a single FILE_SYNC WRITE for each.
> >
> > Instead, change the code to do unstable writes when we'll need to do
> > multiple WRITE RPC's in order to satisfy the request. While we're at
> > it, optimize away the allocation of commit_data when we aren't going
> > to use it anyway.
> >
> > Signed-off-by: Jeff Layton<[email protected]>
> >
>
> Khoa is running tests with this + RPC table change. Single thread
> performance improved,
> but multi-thread tests doesn't scale (I guess running out of RPC table
> entires even with 128 ?).
> He will share the results shortly.
>
> Thanks,
> Badari
>

Ok, good to know.

I tend to think that the whole slot table concept needs some
reconsideration anyway. Off the top of my head...

We could put the rpc_rqst's into a slabcache, and give each rpc_xprt a
mempool with a minimum number of slots. Have them all be allocated with
GFP_NOWAIT. If it gets a NULL pointer back, then the task can sleep on
the waitqueue like it does today. Then, the clients can allocate
rpc_rqst's as they need as long as memory holds out for it.

We have the reserve_xprt stuff to handle congestion control anyway so I
don't really see the value in the artificial limits that the slot table
provides.

Maybe I should hack up a patchset for this...

--
Jeff Layton <[email protected]>

2011-04-13 19:04:15

by Jeff Layton

[permalink] [raw]
Subject: Re: [RFC][PATCH] Vector read/write support for NFS (DIO) client

On Wed, 13 Apr 2011 14:47:05 -0400
Chuck Lever <[email protected]> wrote:

>
> On Apr 13, 2011, at 2:14 PM, Trond Myklebust wrote:
>
> > On Wed, 2011-04-13 at 13:56 -0400, Andy Adamson wrote:
> >> On Apr 13, 2011, at 1:20 PM, Jeff Layton wrote:
> >>
> >>> On Wed, 13 Apr 2011 10:22:13 -0400
> >>> Trond Myklebust <[email protected]> wrote:
> >>>
> >>>> On Wed, 2011-04-13 at 10:02 -0400, Jeff Layton wrote:
> >>>>> We could put the rpc_rqst's into a slabcache, and give each rpc_xprt a
> >>>>> mempool with a minimum number of slots. Have them all be allocated with
> >>>>> GFP_NOWAIT. If it gets a NULL pointer back, then the task can sleep on
> >>>>> the waitqueue like it does today. Then, the clients can allocate
> >>>>> rpc_rqst's as they need as long as memory holds out for it.
> >>>>>
> >>>>> We have the reserve_xprt stuff to handle congestion control anyway so I
> >>>>> don't really see the value in the artificial limits that the slot table
> >>>>> provides.
> >>>>>
> >>>>> Maybe I should hack up a patchset for this...
> >>>>
> >>>> This issue has come up several times recently. My preference would be to
> >>>> tie the availability of slots to the TCP window size, and basically say
> >>>> that if the SOCK_ASYNC_NOSPACE flag is set on the socket, then we hold
> >>>> off allocating more slots until we get a ->write_space() callback which
> >>>> clears that flag.
> >>>>
> >>>> For the RDMA case, we can continue to use the current system of a fixed
> >>>> number of preallocated slots.
> >>>>
> >>>
> >>> I take it then that we'd want a similar scheme for UDP as well? I guess
> >>> I'm just not sure what the slot table is supposed to be for.
> >>
> >> [andros] I look at the rpc_slot table as a representation of the amount of data the connection to the server
> >> can handle - basically the #slots should = double the bandwidth-delay product divided by the max(rsize/wsize).
> >> For TCP, this is the window size. (ping of max MTU packet * interface bandwidth).
> >> There is no reason to allocate more rpc_rqsts that can fit on the wire.
> >
> > Agreed, but as I said earlier, there is no reason to even try to use UDP
> > on high bandwidth links, so I suggest we just leave it as-is.
>
> I think Jeff is suggesting that all the transports should use the same logic, but UDP and RDMA should simply have fixed upper limits on their slot table size. UDP would then behave the same as before, but would share code with the others. That might be cleaner than maintaining separate slot allocation mechanisms for each transport.
>
> In other words, share the code, but parametrize it so that UDP and RDMA have effectively fixed slot tables as before, but TCP is allowed to expand.
>

That was my initial thought, but Trond has a point that there's no
reason to allocate info for a call that we're not able to send.

The idea of hooking up congestion feedback from the networking layer
into the slot allocation code sounds intriguing, so for now I'll stop
armchair quarterbacking and just wait to see what Andy comes up with :)

--
Jeff Layton <[email protected]>

2011-04-14 06:39:54

by Dean

[permalink] [raw]
Subject: Re: [RFC][PATCH] Vector read/write support for NFS (DIO) client



On 4/13/11 5:42 PM, Trond Myklebust wrote:
> On Wed, 2011-04-13 at 17:21 -0700, Dean wrote:
>>>>> This issue has come up several times recently. My preference would be to
>>>>> tie the availability of slots to the TCP window size, and basically say
>>>>> that if the SOCK_ASYNC_NOSPACE flag is set on the socket, then we hold
>>>>> off allocating more slots until we get a ->write_space() callback which
>>>>> clears that flag.
>>>>>
>>>>> For the RDMA case, we can continue to use the current system of a fixed
>>>>> number of preallocated slots.
>>>>>
>>>> I take it then that we'd want a similar scheme for UDP as well? I guess
>>>> I'm just not sure what the slot table is supposed to be for.
>>> [andros] I look at the rpc_slot table as a representation of the amount of data the connection to the server
>>> can handle - basically the #slots should = double the bandwidth-delay product divided by the max(rsize/wsize).
>>> For TCP, this is the window size. (ping of max MTU packet * interface bandwidth).
>>> There is no reason to allocate more rpc_rqsts that can fit on the wire.
>> I agree with checking for space on the link.
>>
>> The above formula is a good lower bound on the maximum number of slots,
>> but there are many times when a client could use more slots than the
>> above formula. For example, we don't want to punish writes if rsize>
>> wsize. Also, you have to account for the server memory, which can
>> sometimes hold several write requests while waiting for them to be
>> sync'd to disk, leaving the TCP buffers less than full.
> Err... No... On the contrary, it is a good _upper_ bound on the number
> of slots. There is no point in allocating a slot for an RPC request
> which you know you have no ability to transmit. That has nothing to do
> with rsize or wsize values: if the socket is backed up, it won't take
> more data.

Absolutely, I'm just trying to point out that checking the
SOCK_ASYNC_NOSPACE flag seems to be the only way to guarantee it won't
take more data.
Dean
> Trond
>

2011-04-13 18:47:27

by Chuck Lever III

[permalink] [raw]
Subject: Re: [RFC][PATCH] Vector read/write support for NFS (DIO) client


On Apr 13, 2011, at 2:14 PM, Trond Myklebust wrote:

> On Wed, 2011-04-13 at 13:56 -0400, Andy Adamson wrote:
>> On Apr 13, 2011, at 1:20 PM, Jeff Layton wrote:
>>
>>> On Wed, 13 Apr 2011 10:22:13 -0400
>>> Trond Myklebust <[email protected]> wrote:
>>>
>>>> On Wed, 2011-04-13 at 10:02 -0400, Jeff Layton wrote:
>>>>> We could put the rpc_rqst's into a slabcache, and give each rpc_xprt a
>>>>> mempool with a minimum number of slots. Have them all be allocated with
>>>>> GFP_NOWAIT. If it gets a NULL pointer back, then the task can sleep on
>>>>> the waitqueue like it does today. Then, the clients can allocate
>>>>> rpc_rqst's as they need as long as memory holds out for it.
>>>>>
>>>>> We have the reserve_xprt stuff to handle congestion control anyway so I
>>>>> don't really see the value in the artificial limits that the slot table
>>>>> provides.
>>>>>
>>>>> Maybe I should hack up a patchset for this...
>>>>
>>>> This issue has come up several times recently. My preference would be to
>>>> tie the availability of slots to the TCP window size, and basically say
>>>> that if the SOCK_ASYNC_NOSPACE flag is set on the socket, then we hold
>>>> off allocating more slots until we get a ->write_space() callback which
>>>> clears that flag.
>>>>
>>>> For the RDMA case, we can continue to use the current system of a fixed
>>>> number of preallocated slots.
>>>>
>>>
>>> I take it then that we'd want a similar scheme for UDP as well? I guess
>>> I'm just not sure what the slot table is supposed to be for.
>>
>> [andros] I look at the rpc_slot table as a representation of the amount of data the connection to the server
>> can handle - basically the #slots should = double the bandwidth-delay product divided by the max(rsize/wsize).
>> For TCP, this is the window size. (ping of max MTU packet * interface bandwidth).
>> There is no reason to allocate more rpc_rqsts that can fit on the wire.
>
> Agreed, but as I said earlier, there is no reason to even try to use UDP
> on high bandwidth links, so I suggest we just leave it as-is.

I think Jeff is suggesting that all the transports should use the same logic, but UDP and RDMA should simply have fixed upper limits on their slot table size. UDP would then behave the same as before, but would share code with the others. That might be cleaner than maintaining separate slot allocation mechanisms for each transport.

In other words, share the code, but parametrize it so that UDP and RDMA have effectively fixed slot tables as before, but TCP is allowed to expand.

My two pfennig's worth.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




2011-04-12 16:26:13

by Trond Myklebust

[permalink] [raw]
Subject: Re: [RFC][PATCH] Vector read/write support for NFS (DIO) client

On Tue, 2011-04-12 at 09:17 -0700, Badari Pulavarty wrote:
> On Tue, 2011-04-12 at 11:49 -0400, Trond Myklebust wrote:
> > On Tue, 2011-04-12 at 08:32 -0700, Badari Pulavarty wrote:
> > > Hi,
> > >
> > > We recently ran into serious performance issue with NFS client.
> > > It turned out that its due to lack of readv/write support for
> > > NFS (O_DIRECT) client.
> > >
> > > Here is our use-case:
> > >
> > > In our cloud environment, our storage is over NFS. Files
> > > on NFS are passed as a blockdevices to the guest (using
> > > O_DIRECT). When guest is doing IO on these block devices,
> > > they will end up as O_DIRECT writes to NFS (on KVM host).
> > >
> > > QEMU (on the host) gets a vector from virtio-ring and
> > > submits them. Old versions of QEMU, linearized the vector
> > > it got from KVM (copied them into a buffer) and submits
> > > the buffer. So, NFS client always received a single buffer.
> > >
> > > Later versions of QEMU, eliminated this copy and submits
> > > a vector directly using preadv/pwritev().
> > >
> > > NFS client loops through the vector and submits each
> > > vector as separate request for each IO < wsize. In our
> > > case (negotiated wsize=1MB), for 256K IO - we get 64
> > > vectors, each 4K. So, we end up submitting 64 4K FILE_SYNC IOs.
> > > Server end up doing each 4K synchronously. This causes
> > > serious performance degrade. We are trying to see if the
> > > performance improves if we convert IOs to ASYNC - but
> > > our initial results doesn't look good.
> > >
> > > readv/writev support NFS client for all possible cases is
> > > hard. Instead, if all vectors are page-aligned and
> > > iosizes page-multiple - it fits the current code easily.
> > > Luckily, QEMU use-case fits these requirements.
> > >
> > > Here is the patch to add this support. Comments ?
> >
> > Your approach goes in the direction of further special-casing O_DIRECT
> > in the NFS client. I'd like to move away from that and towards
> > integration with the ordinary read/write codepaths so that aside from
> > adding request coalescing, we can also enable pNFS support.
> >
>
> I completely agree. But its a major under-taking :(

Sure, but it is one that I'm working on. I'm just explaining why I'd
prefer not to include more stop-gap O_DIRECT patches at this point. We
can afford to wait for one more release cycle if it means fixing
O_DIRECT once and for all.

Cheers,
Trond


2011-04-12 17:43:51

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [RFC][PATCH] Vector read/write support for NFS (DIO) client

On Tue, 2011-04-12 at 12:42 -0400, Chuck Lever wrote:
> On Apr 12, 2011, at 12:15 PM, Badari Pulavarty wrote:
>
> > On Tue, 2011-04-12 at 11:36 -0400, Chuck Lever wrote:
> >> On Apr 12, 2011, at 11:32 AM, Badari Pulavarty wrote:
> >>
> >>> Hi,
> >>>
> >>> We recently ran into serious performance issue with NFS client.
> >>> It turned out that its due to lack of readv/write support for
> >>> NFS (O_DIRECT) client.
> >>>
> >>> Here is our use-case:
> >>>
> >>> In our cloud environment, our storage is over NFS. Files
> >>> on NFS are passed as a blockdevices to the guest (using
> >>> O_DIRECT). When guest is doing IO on these block devices,
> >>> they will end up as O_DIRECT writes to NFS (on KVM host).
> >>>
> >>> QEMU (on the host) gets a vector from virtio-ring and
> >>> submits them. Old versions of QEMU, linearized the vector
> >>> it got from KVM (copied them into a buffer) and submits
> >>> the buffer. So, NFS client always received a single buffer.
> >>>
> >>> Later versions of QEMU, eliminated this copy and submits
> >>> a vector directly using preadv/pwritev().
> >>>
> >>> NFS client loops through the vector and submits each
> >>> vector as separate request for each IO < wsize. In our
> >>> case (negotiated wsize=1MB), for 256K IO - we get 64
> >>> vectors, each 4K. So, we end up submitting 64 4K FILE_SYNC IOs.
> >>> Server end up doing each 4K synchronously. This causes
> >>> serious performance degrade. We are trying to see if the
> >>> performance improves if we convert IOs to ASYNC - but
> >>> our initial results doesn't look good.
> >>>
> >>> readv/writev support NFS client for all possible cases is
> >>> hard. Instead, if all vectors are page-aligned and
> >>> iosizes page-multiple - it fits the current code easily.
> >>> Luckily, QEMU use-case fits these requirements.
> >>>
> >>> Here is the patch to add this support. Comments ?
> >>
> >> Restricting buffer alignment requirements would be an onerous API change, IMO.
> >
> > I am not suggesting an API change at all. All I am doing is, if all
> > the IOs are aligned - we can do a fast path as we can do in a single
> > IO request. (as if we got a single buffer). Otherwise, we do hard
> > way as today - loop through each one and do them individually.
>
> Thanks for the clarification. That means you don't also address the problem of doing multiple small segments with FILE_SYNC writes.
>
> >> If the NFS write path is smart enough not to set FILE_SYNC when there are multiple segments to write, then the problem should be mostly fixed. I think Jeff Layton already has a patch that does this.
> >
> > We are trying that patch. It does improve the performance by little,
> > but not anywhere close to doing it as a single vector/buffer.
> >
> > Khoa, can you share your performance data for all the
> > suggestions/patches you tried so far ?
>
> The individual WRITEs should be submitted in parallel. If there is additional performance overhead, it is probably due to the small RPC slot table size. Have you tried increasing it?

We haven't tried both fixes together (RPC slot increase, Turn into
ASYNC). Each one individually didn't help much. We will try them
together.

Thanks,
Badari


2011-04-13 14:22:15

by Myklebust, Trond

[permalink] [raw]
Subject: Re: [RFC][PATCH] Vector read/write support for NFS (DIO) client

On Wed, 2011-04-13 at 10:02 -0400, Jeff Layton wrote:
> We could put the rpc_rqst's into a slabcache, and give each rpc_xprt a
> mempool with a minimum number of slots. Have them all be allocated with
> GFP_NOWAIT. If it gets a NULL pointer back, then the task can sleep on
> the waitqueue like it does today. Then, the clients can allocate
> rpc_rqst's as they need as long as memory holds out for it.
>
> We have the reserve_xprt stuff to handle congestion control anyway so I
> don't really see the value in the artificial limits that the slot table
> provides.
>
> Maybe I should hack up a patchset for this...

This issue has come up several times recently. My preference would be to
tie the availability of slots to the TCP window size, and basically say
that if the SOCK_ASYNC_NOSPACE flag is set on the socket, then we hold
off allocating more slots until we get a ->write_space() callback which
clears that flag.

For the RDMA case, we can continue to use the current system of a fixed
number of preallocated slots.

Cheers
Trond
--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com


2011-04-12 16:45:06

by Chuck Lever III

[permalink] [raw]
Subject: Re: [RFC][PATCH] Vector read/write support for NFS (DIO) client


On Apr 12, 2011, at 12:15 PM, Badari Pulavarty wrote:

> On Tue, 2011-04-12 at 11:36 -0400, Chuck Lever wrote:
>> On Apr 12, 2011, at 11:32 AM, Badari Pulavarty wrote:
>>
>>> Hi,
>>>
>>> We recently ran into serious performance issue with NFS client.
>>> It turned out that its due to lack of readv/write support for
>>> NFS (O_DIRECT) client.
>>>
>>> Here is our use-case:
>>>
>>> In our cloud environment, our storage is over NFS. Files
>>> on NFS are passed as a blockdevices to the guest (using
>>> O_DIRECT). When guest is doing IO on these block devices,
>>> they will end up as O_DIRECT writes to NFS (on KVM host).
>>>
>>> QEMU (on the host) gets a vector from virtio-ring and
>>> submits them. Old versions of QEMU, linearized the vector
>>> it got from KVM (copied them into a buffer) and submits
>>> the buffer. So, NFS client always received a single buffer.
>>>
>>> Later versions of QEMU, eliminated this copy and submits
>>> a vector directly using preadv/pwritev().
>>>
>>> NFS client loops through the vector and submits each
>>> vector as separate request for each IO < wsize. In our
>>> case (negotiated wsize=1MB), for 256K IO - we get 64
>>> vectors, each 4K. So, we end up submitting 64 4K FILE_SYNC IOs.
>>> Server end up doing each 4K synchronously. This causes
>>> serious performance degrade. We are trying to see if the
>>> performance improves if we convert IOs to ASYNC - but
>>> our initial results doesn't look good.
>>>
>>> readv/writev support NFS client for all possible cases is
>>> hard. Instead, if all vectors are page-aligned and
>>> iosizes page-multiple - it fits the current code easily.
>>> Luckily, QEMU use-case fits these requirements.
>>>
>>> Here is the patch to add this support. Comments ?
>>
>> Restricting buffer alignment requirements would be an onerous API change, IMO.
>
> I am not suggesting an API change at all. All I am doing is, if all
> the IOs are aligned - we can do a fast path as we can do in a single
> IO request. (as if we got a single buffer). Otherwise, we do hard
> way as today - loop through each one and do them individually.

Thanks for the clarification. That means you don't also address the problem of doing multiple small segments with FILE_SYNC writes.

>> If the NFS write path is smart enough not to set FILE_SYNC when there are multiple segments to write, then the problem should be mostly fixed. I think Jeff Layton already has a patch that does this.
>
> We are trying that patch. It does improve the performance by little,
> but not anywhere close to doing it as a single vector/buffer.
>
> Khoa, can you share your performance data for all the
> suggestions/patches you tried so far ?

The individual WRITEs should be submitted in parallel. If there is additional performance overhead, it is probably due to the small RPC slot table size. Have you tried increasing it?

>
> Thanks,
> Badari
>
>
>>>
>>> Thanks,
>>> Badari
>>>
>>> Special readv/writev support for NFS O_DIRECT. Currently NFS
>>> client O_DIRECT read/write iterates over individual elements
>>> in the vector and submits the IO and waits for them to finish.
>>> If individual IOsize is < r/wsize, each IO will be FILE_SYNC.
>>> Server has to finish individual smaller IOs synchronous. This
>>> causes serious performance degrade.
>>>
>>> This patch is a special case to submit larger IOs from the client
>>> only if all the IO buffers are page-aligned and each individual
>>> IOs are page-multiple + total IO size is < r/wsize.
>>>
>>> This is the common case for QEMU usage.
>>>
>>> Signed-off-by: Badari Pulavarty <[email protected]>
>>> ---
>>> fs/nfs/direct.c | 259 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> 1 file changed, 259 insertions(+)
>>>
>>> Index: linux-2.6.38.2/fs/nfs/direct.c
>>> ===================================================================
>>> --- linux-2.6.38.2.orig/fs/nfs/direct.c 2011-04-12 10:56:29.374266292 -0400
>>> +++ linux-2.6.38.2/fs/nfs/direct.c 2011-04-12 11:01:21.883266310 -0400
>>> @@ -271,6 +271,52 @@ static const struct rpc_call_ops nfs_rea
>>> .rpc_release = nfs_direct_read_release,
>>> };
>>>
>>> +static int nfs_direct_check_single_io(struct nfs_direct_req *dreq,
>>> + const struct iovec *iov,
>>> + unsigned long nr_segs, int rw)
>>> +{
>>> + unsigned long seg;
>>> + int pages = 0;
>>> + struct nfs_open_context *ctx = dreq->ctx;
>>> + struct inode *inode = ctx->path.dentry->d_inode;
>>> + size_t size;
>>> +
>>> +
>>> + /* If its a single vector - use old code */
>>> + if (nr_segs == 1)
>>> + return pages;
>>> + /*
>>> + * Check to see if all IO buffers are page-aligned and
>>> + * individual IO sizes are page-multiple.
>>> + * If so, we can submit them in a single IO.
>>> + * Otherwise, use old code
>>> + */
>>> + for (seg = 0; seg < nr_segs; seg++) {
>>> + unsigned long user_addr = (unsigned long)iov[seg].iov_base;
>>> +
>>> + if ((user_addr & (PAGE_SIZE - 1)) ||
>>> + (iov[seg].iov_len & (PAGE_SIZE - 1))) {
>>> + pages = 0;
>>> + break;
>>> + }
>>> + pages += (iov[seg].iov_len >> PAGE_SHIFT);
>>> + }
>>> +
>>> + if (rw)
>>> + size = NFS_SERVER(inode)->wsize;
>>> + else
>>> + size = NFS_SERVER(inode)->rsize;
>>> +
>>> + /*
>>> + * If IOsize > wsize/rsize, we need to split IOs into r/wsize anyway
>>> + * - fall back to old code.
>>> + */
>>> + if (pages > (size >> PAGE_SHIFT))
>>> + pages = 0;
>>> +
>>> + return pages;
>>> +}
>>> +
>>> /*
>>> * For each rsize'd chunk of the user's buffer, dispatch an NFS READ
>>> * operation. If nfs_readdata_alloc() or get_user_pages() fails,
>>> @@ -385,6 +431,101 @@ static ssize_t nfs_direct_read_schedule_
>>> return result < 0 ? (ssize_t) result : -EFAULT;
>>> }
>>>
>>> +/*
>>> + * Special Case: Efficient readv() support - only if all the IO buffers
>>> + * in the vector are page-aligned and IO sizes are page-multiple
>>> + * + total IO size is < rsize. We can map all of the vectors together
>>> + * and submit them in a single IO instead of operating on single entry
>>> + * in the vector.
>>> + */
>>> +static ssize_t nfs_direct_read_schedule_single_io(struct nfs_direct_req *dreq,
>>> + const struct iovec *iov,
>>> + unsigned long nr_segs,
>>> + int pages,
>>> + loff_t pos)
>>> +{
>>> + struct nfs_open_context *ctx = dreq->ctx;
>>> + struct inode *inode = ctx->path.dentry->d_inode;
>>> + unsigned long user_addr = (unsigned long)iov->iov_base;
>>> + size_t bytes = pages << PAGE_SHIFT;
>>> + struct rpc_task *task;
>>> + struct rpc_message msg = {
>>> + .rpc_cred = ctx->cred,
>>> + };
>>> + struct rpc_task_setup task_setup_data = {
>>> + .rpc_client = NFS_CLIENT(inode),
>>> + .rpc_message = &msg,
>>> + .callback_ops = &nfs_read_direct_ops,
>>> + .workqueue = nfsiod_workqueue,
>>> + .flags = RPC_TASK_ASYNC,
>>> + };
>>> + unsigned int pgbase;
>>> + int result;
>>> + int seg;
>>> + int mapped = 0;
>>> + int nr_pages;
>>> + ssize_t started = 0;
>>> + struct nfs_read_data *data;
>>> +
>>> + result = -ENOMEM;
>>> + data = nfs_readdata_alloc(pages);
>>> + if (unlikely(!data))
>>> + return result;
>>> +
>>> + pgbase = user_addr & ~PAGE_MASK;
>>> +
>>> + for (seg = 0; seg < nr_segs; seg++) {
>>> + user_addr = (unsigned long)iov[seg].iov_base;
>>> + nr_pages = iov[seg].iov_len >> PAGE_SHIFT;
>>> +
>>> + down_read(&current->mm->mmap_sem);
>>> + result = get_user_pages(current, current->mm, user_addr,
>>> + nr_pages, 1, 0, &data->pagevec[mapped],
>>> + NULL);
>>> + up_read(&current->mm->mmap_sem);
>>> + if (result < 0) {
>>> + /* unmap what is done so far */
>>> + nfs_direct_release_pages(data->pagevec, mapped);
>>> + nfs_readdata_free(data);
>>> + return result;
>>> + }
>>> + mapped += result;
>>> + }
>>> +
>>> + get_dreq(dreq);
>>> +
>>> + data->req = (struct nfs_page *) dreq;
>>> + data->inode = inode;
>>> + data->cred = msg.rpc_cred;
>>> + data->args.fh = NFS_FH(inode);
>>> + data->args.context = ctx;
>>> + data->args.offset = pos;
>>> + data->args.pgbase = pgbase;
>>> + data->args.pages = data->pagevec;
>>> + data->args.count = bytes;
>>> + data->res.fattr = &data->fattr;
>>> + data->res.eof = 0;
>>> + data->res.count = bytes;
>>> + nfs_fattr_init(&data->fattr);
>>> +
>>> + task_setup_data.task = &data->task;
>>> + task_setup_data.callback_data = data;
>>> + msg.rpc_argp = &data->args;
>>> + msg.rpc_resp = &data->res;
>>> + NFS_PROTO(inode)->read_setup(data, &msg);
>>> +
>>> + task = rpc_run_task(&task_setup_data);
>>> + if (IS_ERR(task))
>>> + goto out;
>>> + rpc_put_task(task);
>>> +
>>> + started += bytes;
>>> +out:
>>> + if (started)
>>> + return started;
>>> + return result < 0 ? (ssize_t) result : -EFAULT;
>>> +}
>>> +
>>> static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
>>> const struct iovec *iov,
>>> unsigned long nr_segs,
>>> @@ -396,6 +537,16 @@ static ssize_t nfs_direct_read_schedule_
>>>
>>> get_dreq(dreq);
>>>
>>> + result = nfs_direct_check_single_io(dreq, iov, nr_segs, 0);
>>> + if (result) {
>>> + result = nfs_direct_read_schedule_single_io(dreq, iov, nr_segs,
>>> + result, pos);
>>> + if (result < 0)
>>> + goto out;
>>> + requested_bytes += result;
>>> + pos += result;
>>> + goto out;
>>> + }
>>> for (seg = 0; seg < nr_segs; seg++) {
>>> const struct iovec *vec = &iov[seg];
>>> result = nfs_direct_read_schedule_segment(dreq, vec, pos);
>>> @@ -416,6 +567,7 @@ static ssize_t nfs_direct_read_schedule_
>>> return result < 0 ? result : -EIO;
>>> }
>>>
>>> +out:
>>> if (put_dreq(dreq))
>>> nfs_direct_complete(dreq);
>>> return 0;
>>> @@ -821,6 +973,102 @@ static ssize_t nfs_direct_write_schedule
>>> return result < 0 ? (ssize_t) result : -EFAULT;
>>> }
>>>
>>> +/*
>>> + * Special Case: Efficient writev() support - only if all the IO buffers
>>> + * in the vector are page-aligned and IO sizes are page-multiple
>>> + * + total IO size is < wsize. We can map all of the vectors together
>>> + * and submit them in a single IO instead of operating on single entry
>>> + * in the vector.
>>> + */
>>> +static ssize_t nfs_direct_write_schedule_single_io(struct nfs_direct_req *dreq,
>>> + const struct iovec *iov,
>>> + unsigned long nr_segs,
>>> + int pages,
>>> + loff_t pos, int sync)
>>> +{
>>> + struct nfs_open_context *ctx = dreq->ctx;
>>> + struct inode *inode = ctx->path.dentry->d_inode;
>>> + unsigned long user_addr = (unsigned long)iov->iov_base;
>>> + size_t bytes = pages << PAGE_SHIFT;
>>> + struct rpc_task *task;
>>> + struct rpc_message msg = {
>>> + .rpc_cred = ctx->cred,
>>> + };
>>> + struct rpc_task_setup task_setup_data = {
>>> + .rpc_client = NFS_CLIENT(inode),
>>> + .rpc_message = &msg,
>>> + .callback_ops = &nfs_write_direct_ops,
>>> + .workqueue = nfsiod_workqueue,
>>> + .flags = RPC_TASK_ASYNC,
>>> + };
>>> + unsigned int pgbase;
>>> + int result;
>>> + int seg;
>>> + int mapped = 0;
>>> + int nr_pages;
>>> + ssize_t started = 0;
>>> + struct nfs_write_data *data;
>>> +
>>> + result = -ENOMEM;
>>> + data = nfs_writedata_alloc(pages);
>>> + if (unlikely(!data))
>>> + return result;
>>> +
>>> + pgbase = user_addr & ~PAGE_MASK;
>>> +
>>> + for (seg = 0; seg < nr_segs; seg++) {
>>> + user_addr = (unsigned long)iov[seg].iov_base;
>>> + nr_pages = iov[seg].iov_len >> PAGE_SHIFT;
>>> +
>>> + down_read(&current->mm->mmap_sem);
>>> + result = get_user_pages(current, current->mm, user_addr,
>>> + nr_pages, 0, 0, &data->pagevec[mapped], NULL);
>>> + up_read(&current->mm->mmap_sem);
>>> + if (result < 0) {
>>> + /* unmap what is done so far */
>>> + nfs_direct_release_pages(data->pagevec, mapped);
>>> + nfs_writedata_free(data);
>>> + return result;
>>> + }
>>> + mapped += result;
>>> + }
>>> +
>>> + get_dreq(dreq);
>>> + list_move_tail(&data->pages, &dreq->rewrite_list);
>>> +
>>> + data->req = (struct nfs_page *) dreq;
>>> + data->inode = inode;
>>> + data->cred = msg.rpc_cred;
>>> + data->args.fh = NFS_FH(inode);
>>> + data->args.context = ctx;
>>> + data->args.offset = pos;
>>> + data->args.pgbase = pgbase;
>>> + data->args.pages = data->pagevec;
>>> + data->args.count = bytes;
>>> + data->args.stable = sync;
>>> + data->res.fattr = &data->fattr;
>>> + data->res.count = bytes;
>>> + data->res.verf = &data->verf;
>>> + nfs_fattr_init(&data->fattr);
>>> +
>>> + task_setup_data.task = &data->task;
>>> + task_setup_data.callback_data = data;
>>> + msg.rpc_argp = &data->args;
>>> + msg.rpc_resp = &data->res;
>>> + NFS_PROTO(inode)->write_setup(data, &msg);
>>> +
>>> + task = rpc_run_task(&task_setup_data);
>>> + if (IS_ERR(task))
>>> + goto out;
>>> + rpc_put_task(task);
>>> +
>>> + started += bytes;
>>> +out:
>>> + if (started)
>>> + return started;
>>> + return result < 0 ? (ssize_t) result : -EFAULT;
>>> +}
>>> +
>>> static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
>>> const struct iovec *iov,
>>> unsigned long nr_segs,
>>> @@ -832,6 +1080,16 @@ static ssize_t nfs_direct_write_schedule
>>>
>>> get_dreq(dreq);
>>>
>>> + result = nfs_direct_check_single_io(dreq, iov, nr_segs, 1);
>>> + if (result) {
>>> + result = nfs_direct_write_schedule_single_io(dreq, iov, nr_segs,
>>> + result, pos, sync);
>>> + if (result < 0)
>>> + goto out;
>>> + requested_bytes += result;
>>> + pos += result;
>>> + goto out;
>>> + }
>>> for (seg = 0; seg < nr_segs; seg++) {
>>> const struct iovec *vec = &iov[seg];
>>> result = nfs_direct_write_schedule_segment(dreq, vec,
>>> @@ -853,6 +1111,7 @@ static ssize_t nfs_direct_write_schedule
>>> return result < 0 ? result : -EIO;
>>> }
>>>
>>> +out:
>>> if (put_dreq(dreq))
>>> nfs_direct_write_complete(dreq, dreq->inode);
>>> return 0;
>>>
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>> the body of a message to [email protected]
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>> --
>> Chuck Lever
>> chuck[dot]lever[at]oracle[dot]com
>>
>>
>>
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com





2011-04-12 15:49:37

by Trond Myklebust

[permalink] [raw]
Subject: Re: [RFC][PATCH] Vector read/write support for NFS (DIO) client

On Tue, 2011-04-12 at 08:32 -0700, Badari Pulavarty wrote:
> Hi,
>
> We recently ran into serious performance issue with NFS client.
> It turned out that its due to lack of readv/write support for
> NFS (O_DIRECT) client.
>
> Here is our use-case:
>
> In our cloud environment, our storage is over NFS. Files
> on NFS are passed as a blockdevices to the guest (using
> O_DIRECT). When guest is doing IO on these block devices,
> they will end up as O_DIRECT writes to NFS (on KVM host).
>
> QEMU (on the host) gets a vector from virtio-ring and
> submits them. Old versions of QEMU, linearized the vector
> it got from KVM (copied them into a buffer) and submits
> the buffer. So, NFS client always received a single buffer.
>
> Later versions of QEMU, eliminated this copy and submits
> a vector directly using preadv/pwritev().
>
> NFS client loops through the vector and submits each
> vector as separate request for each IO < wsize. In our
> case (negotiated wsize=1MB), for 256K IO - we get 64
> vectors, each 4K. So, we end up submitting 64 4K FILE_SYNC IOs.
> Server end up doing each 4K synchronously. This causes
> serious performance degrade. We are trying to see if the
> performance improves if we convert IOs to ASYNC - but
> our initial results doesn't look good.
>
> readv/writev support NFS client for all possible cases is
> hard. Instead, if all vectors are page-aligned and
> iosizes page-multiple - it fits the current code easily.
> Luckily, QEMU use-case fits these requirements.
>
> Here is the patch to add this support. Comments ?

Your approach goes in the direction of further special-casing O_DIRECT
in the NFS client. I'd like to move away from that and towards
integration with the ordinary read/write codepaths so that aside from
adding request coalescing, we can also enable pNFS support.

Cheers
Trond


2011-04-15 18:00:43

by Myklebust, Trond

[permalink] [raw]
Subject: Re: [RFC][PATCH] Vector read/write support for NFS (DIO) client

On Fri, 2011-04-15 at 13:33 -0400, Christoph Hellwig wrote:
> On Tue, Apr 12, 2011 at 11:49:29AM -0400, Trond Myklebust wrote:
> > Your approach goes in the direction of further special-casing O_DIRECT
> > in the NFS client. I'd like to move away from that and towards
> > integration with the ordinary read/write codepaths so that aside from
> > adding request coalescing, we can also enable pNFS support.
>
> What is the exact plan? Split the direct I/O into two passes, one
> to lock down the user pages and then a second one to send the pages
> over the wire, which is shared with the writeback code? If that's
> the case it should naturally allow plugging in a scheme like Badari
> to send pages from different iovecs in a single on the wire request -
> after all page cache pages are non-continuous in virtual and physical
> memory, too.

You can't lock the user pages unfortunately: they may need to be faulted
in.

What I was thinking of doing is splitting out the code in the RPC
callbacks that plays around with page flags and puts the pages on the
inode's dirty list so that they don't get called in the case of
O_DIRECT.
I then want to attach the O_DIRECT pages to the nfsi->nfs_page_tree
radix tree so that they can be tracked by the NFS layer. I'm assuming
that nobody is going to be silly enough to require simultaneous writes
via O_DIRECT to the same locations.
Then we can feed the O_DIRECT pages into nfs_page_async_flush() so that
they share the existing page cache write coalescing and pnfs code.

The commit code will be easy to reuse too, since the requests are listed
in the radix tree and so nfs_scan_list() can find and process them in
the usual fashion.

The main problem that I have yet to figure out is what to do if the
server flags a reboot and the requests need to be resent. One option I'm
looking into is using the aio 'kick handler' to resubmit the writes.
Another may be to just resend directly from the nfsiod work queue.

> When do you plan to release your read/write code re-write? If it's
> not anytime soon how is applying Badari's patch going to hurt? Most
> of it probably will get reverted with a complete rewrite, but at least
> the logic to check which direct I/O iovecs can coalesced would stay
> in the new world order.

I'm hoping that I can do the rewrite fairly quickly once the resend
problem is solved. It shouldn't be more than a couple of weeks of
coding.

--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com


2011-04-13 17:36:00

by Myklebust, Trond

[permalink] [raw]
Subject: Re: [RFC][PATCH] Vector read/write support for NFS (DIO) client

On Wed, 2011-04-13 at 13:20 -0400, Jeff Layton wrote:
> On Wed, 13 Apr 2011 10:22:13 -0400
> Trond Myklebust <[email protected]> wrote:
>
> > On Wed, 2011-04-13 at 10:02 -0400, Jeff Layton wrote:
> > > We could put the rpc_rqst's into a slabcache, and give each rpc_xprt a
> > > mempool with a minimum number of slots. Have them all be allocated with
> > > GFP_NOWAIT. If it gets a NULL pointer back, then the task can sleep on
> > > the waitqueue like it does today. Then, the clients can allocate
> > > rpc_rqst's as they need as long as memory holds out for it.
> > >
> > > We have the reserve_xprt stuff to handle congestion control anyway so I
> > > don't really see the value in the artificial limits that the slot table
> > > provides.
> > >
> > > Maybe I should hack up a patchset for this...
> >
> > This issue has come up several times recently. My preference would be to
> > tie the availability of slots to the TCP window size, and basically say
> > that if the SOCK_ASYNC_NOSPACE flag is set on the socket, then we hold
> > off allocating more slots until we get a ->write_space() callback which
> > clears that flag.
> >
> > For the RDMA case, we can continue to use the current system of a fixed
> > number of preallocated slots.
> >
>
> I take it then that we'd want a similar scheme for UDP as well? I guess
> I'm just not sure what the slot table is supposed to be for.

No. I don't see UDP as having much of a future in 10GigE and other high
bandwidth environments (or even in 1GigE setups). Let's just leave it as
it is...

> Possibly naive question, and maybe you or Andy have scoped this out
> already...
>
> Wouldn't it make more sense to allow the code to allocate rpc_rqst's as
> needed, and manage congestion control in reserve_xprt ? It appears that
> that at least is what xprt_reserve_xprt_cong is supposed to do. The TCP
> variant (xprt_reserve_xprt) doesn't do that currently, but we could do
> it there and that would seem to make for more parity between the TCP
> and UDP in this sense.
>
> We could do that similarly for RDMA too. Simply keep track of how many
> RPCs are in flight and only allow reserving the xprt when that number
> hasn't crossed the max number of slots...

What is the point of allocating a lot of resources when you lack the
bandwidth to do anything with them?

The reason for tying this to the TCP window size is to try to queue up
as much data as we can possibly transmit, without eating too much out of
the same GFP_ATOMIC pool of memory that the networking layer also uses.

--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com


2011-04-13 17:56:40

by Andy Adamson

[permalink] [raw]
Subject: Re: [RFC][PATCH] Vector read/write support for NFS (DIO) client


On Apr 13, 2011, at 1:20 PM, Jeff Layton wrote:

> On Wed, 13 Apr 2011 10:22:13 -0400
> Trond Myklebust <[email protected]> wrote:
>
>> On Wed, 2011-04-13 at 10:02 -0400, Jeff Layton wrote:
>>> We could put the rpc_rqst's into a slabcache, and give each rpc_xprt a
>>> mempool with a minimum number of slots. Have them all be allocated with
>>> GFP_NOWAIT. If it gets a NULL pointer back, then the task can sleep on
>>> the waitqueue like it does today. Then, the clients can allocate
>>> rpc_rqst's as they need as long as memory holds out for it.
>>>
>>> We have the reserve_xprt stuff to handle congestion control anyway so I
>>> don't really see the value in the artificial limits that the slot table
>>> provides.
>>>
>>> Maybe I should hack up a patchset for this...
>>
>> This issue has come up several times recently. My preference would be to
>> tie the availability of slots to the TCP window size, and basically say
>> that if the SOCK_ASYNC_NOSPACE flag is set on the socket, then we hold
>> off allocating more slots until we get a ->write_space() callback which
>> clears that flag.
>>
>> For the RDMA case, we can continue to use the current system of a fixed
>> number of preallocated slots.
>>
>
> I take it then that we'd want a similar scheme for UDP as well? I guess
> I'm just not sure what the slot table is supposed to be for.

[andros] I look at the rpc_slot table as a representation of the amount of data the connection to the server
can handle - basically the #slots should = double the bandwidth-delay product divided by the max(rsize/wsize).
For TCP, this is the window size. (ping of max MTU packet * interface bandwidth).
There is no reason to allocate more rpc_rqsts that can fit on the wire.

>
> Possibly naive question, and maybe you or Andy have scoped this out
> already...
>
> Wouldn't it make more sense to allow the code to allocate rpc_rqst's as
> needed, and manage congestion control in reserve_xprt ?

[andros] Congestion control is not what the rpc_slot table is managing. It does need to have
a minimum which experience has set at 16. It's the maximum that needs to be dynamic.
Congestion control by the lower layers should work unfettered within the # of rpc_slots. Today that
is not always the case when 16 slots is not enough to fill the wire, and the administrator has
not changed the # of rpc_slots.

> It appears that
> that at least is what xprt_reserve_xprt_cong is supposed to do. The TCP
> variant (xprt_reserve_xprt) doesn't do that currently, but we could do
> it there and that would seem to make for more parity between the TCP
> and UDP in this sense.
>
> We could do that similarly for RDMA too. Simply keep track of how many
> RPCs are in flight and only allow reserving the xprt when that number
> hasn't crossed the max number of slots...


>
> --
> Jeff Layton <[email protected]>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


2011-04-13 17:20:30

by Jeff Layton

[permalink] [raw]
Subject: Re: [RFC][PATCH] Vector read/write support for NFS (DIO) client

On Wed, 13 Apr 2011 10:22:13 -0400
Trond Myklebust <[email protected]> wrote:

> On Wed, 2011-04-13 at 10:02 -0400, Jeff Layton wrote:
> > We could put the rpc_rqst's into a slabcache, and give each rpc_xprt a
> > mempool with a minimum number of slots. Have them all be allocated with
> > GFP_NOWAIT. If it gets a NULL pointer back, then the task can sleep on
> > the waitqueue like it does today. Then, the clients can allocate
> > rpc_rqst's as they need as long as memory holds out for it.
> >
> > We have the reserve_xprt stuff to handle congestion control anyway so I
> > don't really see the value in the artificial limits that the slot table
> > provides.
> >
> > Maybe I should hack up a patchset for this...
>
> This issue has come up several times recently. My preference would be to
> tie the availability of slots to the TCP window size, and basically say
> that if the SOCK_ASYNC_NOSPACE flag is set on the socket, then we hold
> off allocating more slots until we get a ->write_space() callback which
> clears that flag.
>
> For the RDMA case, we can continue to use the current system of a fixed
> number of preallocated slots.
>

I take it then that we'd want a similar scheme for UDP as well? I guess
I'm just not sure what the slot table is supposed to be for.

Possibly naive question, and maybe you or Andy have scoped this out
already...

Wouldn't it make more sense to allow the code to allocate rpc_rqst's as
needed, and manage congestion control in reserve_xprt ? It appears that
that at least is what xprt_reserve_xprt_cong is supposed to do. The TCP
variant (xprt_reserve_xprt) doesn't do that currently, but we could do
it there and that would seem to make for more parity between the TCP
and UDP in this sense.

We could do that similarly for RDMA too. Simply keep track of how many
RPCs are in flight and only allow reserving the xprt when that number
hasn't crossed the max number of slots...

--
Jeff Layton <[email protected]>

2011-04-13 14:27:31

by Andy Adamson

[permalink] [raw]
Subject: Re: [RFC][PATCH] Vector read/write support for NFS (DIO) client


On Apr 13, 2011, at 10:22 AM, Trond Myklebust wrote:

> On Wed, 2011-04-13 at 10:02 -0400, Jeff Layton wrote:
>> We could put the rpc_rqst's into a slabcache, and give each rpc_xprt a
>> mempool with a minimum number of slots. Have them all be allocated with
>> GFP_NOWAIT. If it gets a NULL pointer back, then the task can sleep on
>> the waitqueue like it does today. Then, the clients can allocate
>> rpc_rqst's as they need as long as memory holds out for it.
>>
>> We have the reserve_xprt stuff to handle congestion control anyway so I
>> don't really see the value in the artificial limits that the slot table
>> provides.
>>
>> Maybe I should hack up a patchset for this...
>
> This issue has come up several times recently. My preference would be to
> tie the availability of slots to the TCP window size, and basically say
> that if the SOCK_ASYNC_NOSPACE flag is set on the socket, then we hold
> off allocating more slots until we get a ->write_space() callback which
> clears that flag.

I am scoping the dynamic rpc_slot allocation work and plan to prototype.

-->Andy

>
> For the RDMA case, we can continue to use the current system of a fixed
> number of preallocated slots.
>
> Cheers
> Trond
> --
> Trond Myklebust
> Linux NFS client maintainer
>
> NetApp
> [email protected]
> http://www.netapp.com
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


2011-04-14 00:43:07

by Myklebust, Trond

[permalink] [raw]
Subject: Re: [RFC][PATCH] Vector read/write support for NFS (DIO) client

On Wed, 2011-04-13 at 17:21 -0700, Dean wrote:
> >>> This issue has come up several times recently. My preference would be to
> >>> tie the availability of slots to the TCP window size, and basically say
> >>> that if the SOCK_ASYNC_NOSPACE flag is set on the socket, then we hold
> >>> off allocating more slots until we get a ->write_space() callback which
> >>> clears that flag.
> >>>
> >>> For the RDMA case, we can continue to use the current system of a fixed
> >>> number of preallocated slots.
> >>>
> >> I take it then that we'd want a similar scheme for UDP as well? I guess
> >> I'm just not sure what the slot table is supposed to be for.
> > [andros] I look at the rpc_slot table as a representation of the amount of data the connection to the server
> > can handle - basically the #slots should = double the bandwidth-delay product divided by the max(rsize/wsize).
> > For TCP, this is the window size. (ping of max MTU packet * interface bandwidth).
> > There is no reason to allocate more rpc_rqsts that can fit on the wire.
> I agree with checking for space on the link.
>
> The above formula is a good lower bound on the maximum number of slots,
> but there are many times when a client could use more slots than the
> above formula. For example, we don't want to punish writes if rsize >
> wsize. Also, you have to account for the server memory, which can
> sometimes hold several write requests while waiting for them to be
> sync'd to disk, leaving the TCP buffers less than full.

Err... No... On the contrary, it is a good _upper_ bound on the number
of slots. There is no point in allocating a slot for an RPC request
which you know you have no ability to transmit. That has nothing to do
with rsize or wsize values: if the socket is backed up, it won't take
more data.

Trond

--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com


2011-04-12 15:36:27

by Chuck Lever III

[permalink] [raw]
Subject: Re: [RFC][PATCH] Vector read/write support for NFS (DIO) client


On Apr 12, 2011, at 11:32 AM, Badari Pulavarty wrote:

> Hi,
>
> We recently ran into serious performance issue with NFS client.
> It turned out that its due to lack of readv/write support for
> NFS (O_DIRECT) client.
>
> Here is our use-case:
>
> In our cloud environment, our storage is over NFS. Files
> on NFS are passed as a blockdevices to the guest (using
> O_DIRECT). When guest is doing IO on these block devices,
> they will end up as O_DIRECT writes to NFS (on KVM host).
>
> QEMU (on the host) gets a vector from virtio-ring and
> submits them. Old versions of QEMU, linearized the vector
> it got from KVM (copied them into a buffer) and submits
> the buffer. So, NFS client always received a single buffer.
>
> Later versions of QEMU, eliminated this copy and submits
> a vector directly using preadv/pwritev().
>
> NFS client loops through the vector and submits each
> vector as separate request for each IO < wsize. In our
> case (negotiated wsize=1MB), for 256K IO - we get 64
> vectors, each 4K. So, we end up submitting 64 4K FILE_SYNC IOs.
> Server end up doing each 4K synchronously. This causes
> serious performance degrade. We are trying to see if the
> performance improves if we convert IOs to ASYNC - but
> our initial results doesn't look good.
>
> readv/writev support NFS client for all possible cases is
> hard. Instead, if all vectors are page-aligned and
> iosizes page-multiple - it fits the current code easily.
> Luckily, QEMU use-case fits these requirements.
>
> Here is the patch to add this support. Comments ?

Restricting buffer alignment requirements would be an onerous API change, IMO.

If the NFS write path is smart enough not to set FILE_SYNC when there are multiple segments to write, then the problem should be mostly fixed. I think Jeff Layton already has a patch that does this.

>
> Thanks,
> Badari
>
> Special readv/writev support for NFS O_DIRECT. Currently NFS
> client O_DIRECT read/write iterates over individual elements
> in the vector and submits the IO and waits for them to finish.
> If individual IOsize is < r/wsize, each IO will be FILE_SYNC.
> Server has to finish individual smaller IOs synchronous. This
> causes serious performance degrade.
>
> This patch is a special case to submit larger IOs from the client
> only if all the IO buffers are page-aligned and each individual
> IOs are page-multiple + total IO size is < r/wsize.
>
> This is the common case for QEMU usage.
>
> Signed-off-by: Badari Pulavarty <[email protected]>
> ---
> fs/nfs/direct.c | 259 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 259 insertions(+)
>
> Index: linux-2.6.38.2/fs/nfs/direct.c
> ===================================================================
> --- linux-2.6.38.2.orig/fs/nfs/direct.c 2011-04-12 10:56:29.374266292 -0400
> +++ linux-2.6.38.2/fs/nfs/direct.c 2011-04-12 11:01:21.883266310 -0400
> @@ -271,6 +271,52 @@ static const struct rpc_call_ops nfs_rea
> .rpc_release = nfs_direct_read_release,
> };
>
> +static int nfs_direct_check_single_io(struct nfs_direct_req *dreq,
> + const struct iovec *iov,
> + unsigned long nr_segs, int rw)
> +{
> + unsigned long seg;
> + int pages = 0;
> + struct nfs_open_context *ctx = dreq->ctx;
> + struct inode *inode = ctx->path.dentry->d_inode;
> + size_t size;
> +
> +
> + /* If its a single vector - use old code */
> + if (nr_segs == 1)
> + return pages;
> + /*
> + * Check to see if all IO buffers are page-aligned and
> + * individual IO sizes are page-multiple.
> + * If so, we can submit them in a single IO.
> + * Otherwise, use old code
> + */
> + for (seg = 0; seg < nr_segs; seg++) {
> + unsigned long user_addr = (unsigned long)iov[seg].iov_base;
> +
> + if ((user_addr & (PAGE_SIZE - 1)) ||
> + (iov[seg].iov_len & (PAGE_SIZE - 1))) {
> + pages = 0;
> + break;
> + }
> + pages += (iov[seg].iov_len >> PAGE_SHIFT);
> + }
> +
> + if (rw)
> + size = NFS_SERVER(inode)->wsize;
> + else
> + size = NFS_SERVER(inode)->rsize;
> +
> + /*
> + * If IOsize > wsize/rsize, we need to split IOs into r/wsize anyway
> + * - fall back to old code.
> + */
> + if (pages > (size >> PAGE_SHIFT))
> + pages = 0;
> +
> + return pages;
> +}
> +
> /*
> * For each rsize'd chunk of the user's buffer, dispatch an NFS READ
> * operation. If nfs_readdata_alloc() or get_user_pages() fails,
> @@ -385,6 +431,101 @@ static ssize_t nfs_direct_read_schedule_
> return result < 0 ? (ssize_t) result : -EFAULT;
> }
>
> +/*
> + * Special Case: Efficient readv() support - only if all the IO buffers
> + * in the vector are page-aligned and IO sizes are page-multiple
> + * + total IO size is < rsize. We can map all of the vectors together
> + * and submit them in a single IO instead of operating on single entry
> + * in the vector.
> + */
> +static ssize_t nfs_direct_read_schedule_single_io(struct nfs_direct_req *dreq,
> + const struct iovec *iov,
> + unsigned long nr_segs,
> + int pages,
> + loff_t pos)
> +{
> + struct nfs_open_context *ctx = dreq->ctx;
> + struct inode *inode = ctx->path.dentry->d_inode;
> + unsigned long user_addr = (unsigned long)iov->iov_base;
> + size_t bytes = pages << PAGE_SHIFT;
> + struct rpc_task *task;
> + struct rpc_message msg = {
> + .rpc_cred = ctx->cred,
> + };
> + struct rpc_task_setup task_setup_data = {
> + .rpc_client = NFS_CLIENT(inode),
> + .rpc_message = &msg,
> + .callback_ops = &nfs_read_direct_ops,
> + .workqueue = nfsiod_workqueue,
> + .flags = RPC_TASK_ASYNC,
> + };
> + unsigned int pgbase;
> + int result;
> + int seg;
> + int mapped = 0;
> + int nr_pages;
> + ssize_t started = 0;
> + struct nfs_read_data *data;
> +
> + result = -ENOMEM;
> + data = nfs_readdata_alloc(pages);
> + if (unlikely(!data))
> + return result;
> +
> + pgbase = user_addr & ~PAGE_MASK;
> +
> + for (seg = 0; seg < nr_segs; seg++) {
> + user_addr = (unsigned long)iov[seg].iov_base;
> + nr_pages = iov[seg].iov_len >> PAGE_SHIFT;
> +
> + down_read(&current->mm->mmap_sem);
> + result = get_user_pages(current, current->mm, user_addr,
> + nr_pages, 1, 0, &data->pagevec[mapped],
> + NULL);
> + up_read(&current->mm->mmap_sem);
> + if (result < 0) {
> + /* unmap what is done so far */
> + nfs_direct_release_pages(data->pagevec, mapped);
> + nfs_readdata_free(data);
> + return result;
> + }
> + mapped += result;
> + }
> +
> + get_dreq(dreq);
> +
> + data->req = (struct nfs_page *) dreq;
> + data->inode = inode;
> + data->cred = msg.rpc_cred;
> + data->args.fh = NFS_FH(inode);
> + data->args.context = ctx;
> + data->args.offset = pos;
> + data->args.pgbase = pgbase;
> + data->args.pages = data->pagevec;
> + data->args.count = bytes;
> + data->res.fattr = &data->fattr;
> + data->res.eof = 0;
> + data->res.count = bytes;
> + nfs_fattr_init(&data->fattr);
> +
> + task_setup_data.task = &data->task;
> + task_setup_data.callback_data = data;
> + msg.rpc_argp = &data->args;
> + msg.rpc_resp = &data->res;
> + NFS_PROTO(inode)->read_setup(data, &msg);
> +
> + task = rpc_run_task(&task_setup_data);
> + if (IS_ERR(task))
> + goto out;
> + rpc_put_task(task);
> +
> + started += bytes;
> +out:
> + if (started)
> + return started;
> + return result < 0 ? (ssize_t) result : -EFAULT;
> +}
> +
> static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
> const struct iovec *iov,
> unsigned long nr_segs,
> @@ -396,6 +537,16 @@ static ssize_t nfs_direct_read_schedule_
>
> get_dreq(dreq);
>
> + result = nfs_direct_check_single_io(dreq, iov, nr_segs, 0);
> + if (result) {
> + result = nfs_direct_read_schedule_single_io(dreq, iov, nr_segs,
> + result, pos);
> + if (result < 0)
> + goto out;
> + requested_bytes += result;
> + pos += result;
> + goto out;
> + }
> for (seg = 0; seg < nr_segs; seg++) {
> const struct iovec *vec = &iov[seg];
> result = nfs_direct_read_schedule_segment(dreq, vec, pos);
> @@ -416,6 +567,7 @@ static ssize_t nfs_direct_read_schedule_
> return result < 0 ? result : -EIO;
> }
>
> +out:
> if (put_dreq(dreq))
> nfs_direct_complete(dreq);
> return 0;
> @@ -821,6 +973,102 @@ static ssize_t nfs_direct_write_schedule
> return result < 0 ? (ssize_t) result : -EFAULT;
> }
>
> +/*
> + * Special Case: Efficient writev() support - only if all the IO buffers
> + * in the vector are page-aligned and IO sizes are page-multiple
> + * + total IO size is < wsize. We can map all of the vectors together
> + * and submit them in a single IO instead of operating on single entry
> + * in the vector.
> + */
> +static ssize_t nfs_direct_write_schedule_single_io(struct nfs_direct_req *dreq,
> + const struct iovec *iov,
> + unsigned long nr_segs,
> + int pages,
> + loff_t pos, int sync)
> +{
> + struct nfs_open_context *ctx = dreq->ctx;
> + struct inode *inode = ctx->path.dentry->d_inode;
> + unsigned long user_addr = (unsigned long)iov->iov_base;
> + size_t bytes = pages << PAGE_SHIFT;
> + struct rpc_task *task;
> + struct rpc_message msg = {
> + .rpc_cred = ctx->cred,
> + };
> + struct rpc_task_setup task_setup_data = {
> + .rpc_client = NFS_CLIENT(inode),
> + .rpc_message = &msg,
> + .callback_ops = &nfs_write_direct_ops,
> + .workqueue = nfsiod_workqueue,
> + .flags = RPC_TASK_ASYNC,
> + };
> + unsigned int pgbase;
> + int result;
> + int seg;
> + int mapped = 0;
> + int nr_pages;
> + ssize_t started = 0;
> + struct nfs_write_data *data;
> +
> + result = -ENOMEM;
> + data = nfs_writedata_alloc(pages);
> + if (unlikely(!data))
> + return result;
> +
> + pgbase = user_addr & ~PAGE_MASK;
> +
> + for (seg = 0; seg < nr_segs; seg++) {
> + user_addr = (unsigned long)iov[seg].iov_base;
> + nr_pages = iov[seg].iov_len >> PAGE_SHIFT;
> +
> + down_read(&current->mm->mmap_sem);
> + result = get_user_pages(current, current->mm, user_addr,
> + nr_pages, 0, 0, &data->pagevec[mapped], NULL);
> + up_read(&current->mm->mmap_sem);
> + if (result < 0) {
> + /* unmap what is done so far */
> + nfs_direct_release_pages(data->pagevec, mapped);
> + nfs_writedata_free(data);
> + return result;
> + }
> + mapped += result;
> + }
> +
> + get_dreq(dreq);
> + list_move_tail(&data->pages, &dreq->rewrite_list);
> +
> + data->req = (struct nfs_page *) dreq;
> + data->inode = inode;
> + data->cred = msg.rpc_cred;
> + data->args.fh = NFS_FH(inode);
> + data->args.context = ctx;
> + data->args.offset = pos;
> + data->args.pgbase = pgbase;
> + data->args.pages = data->pagevec;
> + data->args.count = bytes;
> + data->args.stable = sync;
> + data->res.fattr = &data->fattr;
> + data->res.count = bytes;
> + data->res.verf = &data->verf;
> + nfs_fattr_init(&data->fattr);
> +
> + task_setup_data.task = &data->task;
> + task_setup_data.callback_data = data;
> + msg.rpc_argp = &data->args;
> + msg.rpc_resp = &data->res;
> + NFS_PROTO(inode)->write_setup(data, &msg);
> +
> + task = rpc_run_task(&task_setup_data);
> + if (IS_ERR(task))
> + goto out;
> + rpc_put_task(task);
> +
> + started += bytes;
> +out:
> + if (started)
> + return started;
> + return result < 0 ? (ssize_t) result : -EFAULT;
> +}
> +
> static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
> const struct iovec *iov,
> unsigned long nr_segs,
> @@ -832,6 +1080,16 @@ static ssize_t nfs_direct_write_schedule
>
> get_dreq(dreq);
>
> + result = nfs_direct_check_single_io(dreq, iov, nr_segs, 1);
> + if (result) {
> + result = nfs_direct_write_schedule_single_io(dreq, iov, nr_segs,
> + result, pos, sync);
> + if (result < 0)
> + goto out;
> + requested_bytes += result;
> + pos += result;
> + goto out;
> + }
> for (seg = 0; seg < nr_segs; seg++) {
> const struct iovec *vec = &iov[seg];
> result = nfs_direct_write_schedule_segment(dreq, vec,
> @@ -853,6 +1111,7 @@ static ssize_t nfs_direct_write_schedule
> return result < 0 ? result : -EIO;
> }
>
> +out:
> if (put_dreq(dreq))
> nfs_direct_write_complete(dreq, dreq->inode);
> return 0;
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com





2011-04-14 00:21:40

by Dean

[permalink] [raw]
Subject: Re: [RFC][PATCH] Vector read/write support for NFS (DIO) client

>>> This issue has come up several times recently. My preference would be to
>>> tie the availability of slots to the TCP window size, and basically say
>>> that if the SOCK_ASYNC_NOSPACE flag is set on the socket, then we hold
>>> off allocating more slots until we get a ->write_space() callback which
>>> clears that flag.
>>>
>>> For the RDMA case, we can continue to use the current system of a fixed
>>> number of preallocated slots.
>>>
>> I take it then that we'd want a similar scheme for UDP as well? I guess
>> I'm just not sure what the slot table is supposed to be for.
> [andros] I look at the rpc_slot table as a representation of the amount of data the connection to the server
> can handle - basically the #slots should = double the bandwidth-delay product divided by the max(rsize/wsize).
> For TCP, this is the window size. (ping of max MTU packet * interface bandwidth).
> There is no reason to allocate more rpc_rqsts that can fit on the wire.
I agree with checking for space on the link.

The above formula is a good lower bound on the maximum number of slots,
but there are many times when a client could use more slots than the
above formula. For example, we don't want to punish writes if rsize >
wsize. Also, you have to account for the server memory, which can
sometimes hold several write requests while waiting for them to be
sync'd to disk, leaving the TCP buffers less than full.

Also, I think any solution should allow admins to limit the maximum
number of slots. Too many slots can increase request randomness at the
server, and sometimes severely reduce performance.

Dean

>> Possibly naive question, and maybe you or Andy have scoped this out
>> already...
>>
>> Wouldn't it make more sense to allow the code to allocate rpc_rqst's as
>> needed, and manage congestion control in reserve_xprt ?
> [andros] Congestion control is not what the rpc_slot table is managing. It does need to have
> a minimum which experience has set at 16. It's the maximum that needs to be dynamic.
> Congestion control by the lower layers should work unfettered within the # of rpc_slots. Today that
> is not always the case when 16 slots is not enough to fill the wire, and the administrator has
> not changed the # of rpc_slots.
>
>> It appears that
>> that at least is what xprt_reserve_xprt_cong is supposed to do. The TCP
>> variant (xprt_reserve_xprt) doesn't do that currently, but we could do
>> it there and that would seem to make for more parity between the TCP
>> and UDP in this sense.
>>
>> We could do that similarly for RDMA too. Simply keep track of how many
>> RPCs are in flight and only allow reserving the xprt when that number
>> hasn't crossed the max number of slots...
>
>> --
>> Jeff Layton<[email protected]>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2011-04-13 12:37:01

by Jeff Layton

[permalink] [raw]
Subject: Re: [RFC][PATCH] Vector read/write support for NFS (DIO) client

On Tue, 12 Apr 2011 10:46:00 -0700
Badari Pulavarty <[email protected]> wrote:

> On Tue, 2011-04-12 at 12:42 -0400, Chuck Lever wrote:
> > On Apr 12, 2011, at 12:15 PM, Badari Pulavarty wrote:
> >
> > > On Tue, 2011-04-12 at 11:36 -0400, Chuck Lever wrote:
> > >> On Apr 12, 2011, at 11:32 AM, Badari Pulavarty wrote:
> > >>
> > >>> Hi,
> > >>>
> > >>> We recently ran into serious performance issue with NFS client.
> > >>> It turned out that its due to lack of readv/write support for
> > >>> NFS (O_DIRECT) client.
> > >>>
> > >>> Here is our use-case:
> > >>>
> > >>> In our cloud environment, our storage is over NFS. Files
> > >>> on NFS are passed as a blockdevices to the guest (using
> > >>> O_DIRECT). When guest is doing IO on these block devices,
> > >>> they will end up as O_DIRECT writes to NFS (on KVM host).
> > >>>
> > >>> QEMU (on the host) gets a vector from virtio-ring and
> > >>> submits them. Old versions of QEMU, linearized the vector
> > >>> it got from KVM (copied them into a buffer) and submits
> > >>> the buffer. So, NFS client always received a single buffer.
> > >>>
> > >>> Later versions of QEMU, eliminated this copy and submits
> > >>> a vector directly using preadv/pwritev().
> > >>>
> > >>> NFS client loops through the vector and submits each
> > >>> vector as separate request for each IO < wsize. In our
> > >>> case (negotiated wsize=1MB), for 256K IO - we get 64
> > >>> vectors, each 4K. So, we end up submitting 64 4K FILE_SYNC IOs.
> > >>> Server end up doing each 4K synchronously. This causes
> > >>> serious performance degrade. We are trying to see if the
> > >>> performance improves if we convert IOs to ASYNC - but
> > >>> our initial results doesn't look good.
> > >>>
> > >>> readv/writev support NFS client for all possible cases is
> > >>> hard. Instead, if all vectors are page-aligned and
> > >>> iosizes page-multiple - it fits the current code easily.
> > >>> Luckily, QEMU use-case fits these requirements.
> > >>>
> > >>> Here is the patch to add this support. Comments ?
> > >>
> > >> Restricting buffer alignment requirements would be an onerous API change, IMO.
> > >
> > > I am not suggesting an API change at all. All I am doing is, if all
> > > the IOs are aligned - we can do a fast path as we can do in a single
> > > IO request. (as if we got a single buffer). Otherwise, we do hard
> > > way as today - loop through each one and do them individually.
> >
> > Thanks for the clarification. That means you don't also address the problem of doing multiple small segments with FILE_SYNC writes.
> >
> > >> If the NFS write path is smart enough not to set FILE_SYNC when there are multiple segments to write, then the problem should be mostly fixed. I think Jeff Layton already has a patch that does this.
> > >
> > > We are trying that patch. It does improve the performance by little,
> > > but not anywhere close to doing it as a single vector/buffer.
> > >
> > > Khoa, can you share your performance data for all the
> > > suggestions/patches you tried so far ?
> >
> > The individual WRITEs should be submitted in parallel. If there is additional performance overhead, it is probably due to the small RPC slot table size. Have you tried increasing it?
>
> We haven't tried both fixes together (RPC slot increase, Turn into
> ASYNC). Each one individually didn't help much. We will try them
> together.
>

I must have missed some of these emails, but here's the patch that
Chuck mentioned and based on his suggestion. It may be reasonable as a
stopgap solution until Trond's overhaul of the DIO code is complete.

With this + a larger slot table size, I would think you'd have a
substantial write performance improvement. Probably not as good as if
the writes were coalesced, but it should help.

Badari, if you can let us know whether this plus increasing the slot
table size helps, then I'll plan to "officially" submit it. This one is
against RHEL6 but it should apply to mainline kernels with a small
offset.

-----------------[snip]-----------------

BZ#694309: nfs: use unstable writes for bigger groups of DIO writes

Currently, the client uses FILE_SYNC whenever it's writing less than or
equal data to the wsize. This is a problem though if we have a bunch
of small iovec's batched up in a single writev call. The client will
iterate over them and do a single FILE_SYNC WRITE for each.

Instead, change the code to do unstable writes when we'll need to do
multiple WRITE RPC's in order to satisfy the request. While we're at
it, optimize away the allocation of commit_data when we aren't going
to use it anyway.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfs/direct.c | 13 +++++++++++--
1 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index 398f8ed..d2ed659 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -870,9 +870,18 @@ static ssize_t nfs_direct_write(struct kiocb *iocb, const struct iovec *iov,
dreq = nfs_direct_req_alloc();
if (!dreq)
goto out;
- nfs_alloc_commit_data(dreq);

- if (dreq->commit_data == NULL || count <= wsize)
+ if (count > wsize || nr_segs > 1)
+ nfs_alloc_commit_data(dreq);
+ else
+ dreq->commit_data = NULL;
+
+ /*
+ * If we couldn't allocate commit data, or we'll just be doing a
+ * single write, then make this a NFS_FILE_SYNC write and do away
+ * with the commit.
+ */
+ if (dreq->commit_data == NULL)
sync = NFS_FILE_SYNC;

dreq->inode = inode;
--
1.7.1


2011-04-13 13:43:56

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [RFC][PATCH] Vector read/write support for NFS (DIO) client

On 4/13/2011 5:36 AM, Jeff Layton wrote:
> On Tue, 12 Apr 2011 10:46:00 -0700
> Badari Pulavarty<[email protected]> wrote:
>
>
>> On Tue, 2011-04-12 at 12:42 -0400, Chuck Lever wrote:
>>
>>> On Apr 12, 2011, at 12:15 PM, Badari Pulavarty wrote:
>>>
>>>
>>>> On Tue, 2011-04-12 at 11:36 -0400, Chuck Lever wrote:
>>>>
>>>>> On Apr 12, 2011, at 11:32 AM, Badari Pulavarty wrote:
>>>>>
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> We recently ran into serious performance issue with NFS client.
>>>>>> It turned out that its due to lack of readv/write support for
>>>>>> NFS (O_DIRECT) client.
>>>>>>
>>>>>> Here is our use-case:
>>>>>>
>>>>>> In our cloud environment, our storage is over NFS. Files
>>>>>> on NFS are passed as a blockdevices to the guest (using
>>>>>> O_DIRECT). When guest is doing IO on these block devices,
>>>>>> they will end up as O_DIRECT writes to NFS (on KVM host).
>>>>>>
>>>>>> QEMU (on the host) gets a vector from virtio-ring and
>>>>>> submits them. Old versions of QEMU, linearized the vector
>>>>>> it got from KVM (copied them into a buffer) and submits
>>>>>> the buffer. So, NFS client always received a single buffer.
>>>>>>
>>>>>> Later versions of QEMU, eliminated this copy and submits
>>>>>> a vector directly using preadv/pwritev().
>>>>>>
>>>>>> NFS client loops through the vector and submits each
>>>>>> vector as separate request for each IO< wsize. In our
>>>>>> case (negotiated wsize=1MB), for 256K IO - we get 64
>>>>>> vectors, each 4K. So, we end up submitting 64 4K FILE_SYNC IOs.
>>>>>> Server end up doing each 4K synchronously. This causes
>>>>>> serious performance degrade. We are trying to see if the
>>>>>> performance improves if we convert IOs to ASYNC - but
>>>>>> our initial results doesn't look good.
>>>>>>
>>>>>> readv/writev support NFS client for all possible cases is
>>>>>> hard. Instead, if all vectors are page-aligned and
>>>>>> iosizes page-multiple - it fits the current code easily.
>>>>>> Luckily, QEMU use-case fits these requirements.
>>>>>>
>>>>>> Here is the patch to add this support. Comments ?
>>>>>>
>>>>> Restricting buffer alignment requirements would be an onerous API change, IMO.
>>>>>
>>>> I am not suggesting an API change at all. All I am doing is, if all
>>>> the IOs are aligned - we can do a fast path as we can do in a single
>>>> IO request. (as if we got a single buffer). Otherwise, we do hard
>>>> way as today - loop through each one and do them individually.
>>>>
>>> Thanks for the clarification. That means you don't also address the problem of doing multiple small segments with FILE_SYNC writes.
>>>
>>>
>>>>> If the NFS write path is smart enough not to set FILE_SYNC when there are multiple segments to write, then the problem should be mostly fixed. I think Jeff Layton already has a patch that does this.
>>>>>
>>>> We are trying that patch. It does improve the performance by little,
>>>> but not anywhere close to doing it as a single vector/buffer.
>>>>
>>>> Khoa, can you share your performance data for all the
>>>> suggestions/patches you tried so far ?
>>>>
>>> The individual WRITEs should be submitted in parallel. If there is additional performance overhead, it is probably due to the small RPC slot table size. Have you tried increasing it?
>>>
>> We haven't tried both fixes together (RPC slot increase, Turn into
>> ASYNC). Each one individually didn't help much. We will try them
>> together.
>>
>>
> I must have missed some of these emails, but here's the patch that
> Chuck mentioned and based on his suggestion. It may be reasonable as a
> stopgap solution until Trond's overhaul of the DIO code is complete.
>
> With this + a larger slot table size, I would think you'd have a
> substantial write performance improvement. Probably not as good as if
> the writes were coalesced, but it should help.
>
> Badari, if you can let us know whether this plus increasing the slot
> table size helps, then I'll plan to "officially" submit it. This one is
> against RHEL6 but it should apply to mainline kernels with a small
> offset.
>
> -----------------[snip]-----------------
>
> BZ#694309: nfs: use unstable writes for bigger groups of DIO writes
>
> Currently, the client uses FILE_SYNC whenever it's writing less than or
> equal data to the wsize. This is a problem though if we have a bunch
> of small iovec's batched up in a single writev call. The client will
> iterate over them and do a single FILE_SYNC WRITE for each.
>
> Instead, change the code to do unstable writes when we'll need to do
> multiple WRITE RPC's in order to satisfy the request. While we're at
> it, optimize away the allocation of commit_data when we aren't going
> to use it anyway.
>
> Signed-off-by: Jeff Layton<[email protected]>
>

Khoa is running tests with this + RPC table change. Single thread
performance improved,
but multi-thread tests doesn't scale (I guess running out of RPC table
entires even with 128 ?).
He will share the results shortly.

Thanks,
Badari

> ---
> fs/nfs/direct.c | 13 +++++++++++--
> 1 files changed, 11 insertions(+), 2 deletions(-)
>
> diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
> index 398f8ed..d2ed659 100644
> --- a/fs/nfs/direct.c
> +++ b/fs/nfs/direct.c
> @@ -870,9 +870,18 @@ static ssize_t nfs_direct_write(struct kiocb *iocb, const struct iovec *iov,
> dreq = nfs_direct_req_alloc();
> if (!dreq)
> goto out;
> - nfs_alloc_commit_data(dreq);
>
> - if (dreq->commit_data == NULL || count<= wsize)
> + if (count> wsize || nr_segs> 1)
> + nfs_alloc_commit_data(dreq);
> + else
> + dreq->commit_data = NULL;
> +
> + /*
> + * If we couldn't allocate commit data, or we'll just be doing a
> + * single write, then make this a NFS_FILE_SYNC write and do away
> + * with the commit.
> + */
> + if (dreq->commit_data == NULL)
> sync = NFS_FILE_SYNC;
>
> dreq->inode = inode;
>



2011-04-15 17:33:20

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC][PATCH] Vector read/write support for NFS (DIO) client

On Tue, Apr 12, 2011 at 11:49:29AM -0400, Trond Myklebust wrote:
> Your approach goes in the direction of further special-casing O_DIRECT
> in the NFS client. I'd like to move away from that and towards
> integration with the ordinary read/write codepaths so that aside from
> adding request coalescing, we can also enable pNFS support.

What is the exact plan? Split the direct I/O into two passes, one
to lock down the user pages and then a second one to send the pages
over the wire, which is shared with the writeback code? If that's
the case it should naturally allow plugging in a scheme like Badari
to send pages from different iovecs in a single on the wire request -
after all page cache pages are non-continuous in virtual and physical
memory, too.

When do you plan to release your read/write code re-write? If it's
not anytime soon how is applying Badari's patch going to hurt? Most
of it probably will get reverted with a complete rewrite, but at least
the logic to check which direct I/O iovecs can coalesced would stay
in the new world order.


2011-04-12 16:15:06

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [RFC][PATCH] Vector read/write support for NFS (DIO) client

On Tue, 2011-04-12 at 11:49 -0400, Trond Myklebust wrote:
> On Tue, 2011-04-12 at 08:32 -0700, Badari Pulavarty wrote:
> > Hi,
> >
> > We recently ran into serious performance issue with NFS client.
> > It turned out that its due to lack of readv/write support for
> > NFS (O_DIRECT) client.
> >
> > Here is our use-case:
> >
> > In our cloud environment, our storage is over NFS. Files
> > on NFS are passed as a blockdevices to the guest (using
> > O_DIRECT). When guest is doing IO on these block devices,
> > they will end up as O_DIRECT writes to NFS (on KVM host).
> >
> > QEMU (on the host) gets a vector from virtio-ring and
> > submits them. Old versions of QEMU, linearized the vector
> > it got from KVM (copied them into a buffer) and submits
> > the buffer. So, NFS client always received a single buffer.
> >
> > Later versions of QEMU, eliminated this copy and submits
> > a vector directly using preadv/pwritev().
> >
> > NFS client loops through the vector and submits each
> > vector as separate request for each IO < wsize. In our
> > case (negotiated wsize=1MB), for 256K IO - we get 64
> > vectors, each 4K. So, we end up submitting 64 4K FILE_SYNC IOs.
> > Server end up doing each 4K synchronously. This causes
> > serious performance degrade. We are trying to see if the
> > performance improves if we convert IOs to ASYNC - but
> > our initial results doesn't look good.
> >
> > readv/writev support NFS client for all possible cases is
> > hard. Instead, if all vectors are page-aligned and
> > iosizes page-multiple - it fits the current code easily.
> > Luckily, QEMU use-case fits these requirements.
> >
> > Here is the patch to add this support. Comments ?
>
> Your approach goes in the direction of further special-casing O_DIRECT
> in the NFS client. I'd like to move away from that and towards
> integration with the ordinary read/write codepaths so that aside from
> adding request coalescing, we can also enable pNFS support.
>

I completely agree. But its a major under-taking :(

Thanks,
Badari