LinuxLists.cc - [PATCH RFC 0/3] NFS41: optimize layoutget

2012-08-08 02:03:26

Subject: [PATCH RFC 0/3] NFS41: optimize layoutget

The patches optimize layoutget size by sending real IO size instead
of just a small length in each layoutget. It helps server to do better
decisions about layout segment size and thus helps reduce layoutget
operation numbers a lot.

Boaz please see if you like the behavior in generic layout. If not, I can
make it block layout only.

Thanks,
Tao

Peng Tao (3):
NFS: track direct IO left bytes
NFS41: send real write size in layoutget
NFS41: send real read size in layoutget for DIO

fs/nfs/direct.c | 12 +++++++++
fs/nfs/internal.h | 1 +
fs/nfs/pnfs.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++++++--
3 files changed, 76 insertions(+), 3 deletions(-)

2012-08-12 18:30:44

by Boaz Harrosh

[permalink] [raw]

Subject: Re: [PATCH RFC 2/3] NFS41: send real write size in layoutget

On 08/08/2012 05:03 AM, Peng Tao wrote:

> From: Peng Tao <[email protected]>
>
> For bufferred write, scan dirty pages to find out longest continuous
> dirty pages. In this case, also allow layout driver to specify a
> maximum layoutget size which is useful to avoid busy scanning dirty pages
> for block layout client.
>
> For direct write, just use dreq->bytes_left.
>
> Signed-off-by: Peng Tao <[email protected]>
> ---
> fs/nfs/direct.c | 7 ++++++
> fs/nfs/internal.h | 1 +
> fs/nfs/pnfs.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++++++-
> 3 files changed, 64 insertions(+), 2 deletions(-)
>
> diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
> index c39f775..c1899dd 100644
> --- a/fs/nfs/direct.c
> +++ b/fs/nfs/direct.c
> @@ -46,6 +46,7 @@
> #include <linux/kref.h>
> #include <linux/slab.h>
> #include <linux/task_io_accounting_ops.h>
> +#include <linux/module.h>
>
> #include <linux/nfs_fs.h>
> #include <linux/nfs_page.h>
> @@ -191,6 +192,12 @@ static void nfs_direct_req_release(struct nfs_direct_req *dreq)
> kref_put(&dreq->kref, nfs_direct_req_free);
> }
>
> +ssize_t nfs_dreq_bytes_left(struct nfs_direct_req *dreq)
> +{
> + return dreq->bytes_left;
> +}
> +EXPORT_SYMBOL_GPL(nfs_dreq_bytes_left);
> +
> /*
> * Collects and returns the final error value/byte-count.
> */
> diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
> index 31fdb03..e68d329 100644
> --- a/fs/nfs/internal.h
> +++ b/fs/nfs/internal.h
> @@ -464,6 +464,7 @@ static inline void nfs_inode_dio_wait(struct inode *inode)
> {
> inode_dio_wait(inode);
> }
> +extern ssize_t nfs_dreq_bytes_left(struct nfs_direct_req *dreq);
>

Why is this an EXPORT_SYMBOL_GPL at .c file. Why not just an inline
here ?

> /* nfs4proc.c */
> extern void __nfs4_read_done_cb(struct nfs_read_data *);
> diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
> index 2e00fea..e61a373 100644
> --- a/fs/nfs/pnfs.c
> +++ b/fs/nfs/pnfs.c
> @@ -29,6 +29,7 @@
>
> #include <linux/nfs_fs.h>
> #include <linux/nfs_page.h>
> +#include <linux/pagevec.h>
> #include <linux/module.h>
> #include "internal.h"
> #include "pnfs.h"
> @@ -1172,19 +1173,72 @@ pnfs_generic_pg_init_read(struct nfs_pageio_descriptor *pgio, struct nfs_page *r
> }
> EXPORT_SYMBOL_GPL(pnfs_generic_pg_init_read);
>
> +/*
> + * Return the number of contiguous bytes in dirty pages for a given inode
> + * starting at page frame idx.
> + */
> +static u64 pnfs_num_dirty_bytes(struct inode *inode, pgoff_t idx)
> +{
> + struct address_space *mapping = inode->i_mapping;
> + pgoff_t index;
> + struct pagevec pvec;
> + pgoff_t num = 1; /* self */
> + int i, done = 0;
> +
> + pagevec_init(&pvec, 0);
> + idx++; /* self */
> + while (!done) {
> + index = idx;
> + pagevec_lookup_tag(&pvec, mapping, &index,
> + PAGECACHE_TAG_DIRTY, (pgoff_t)PAGEVEC_SIZE);
> + if (pagevec_count(&pvec) == 0)
> + break;
> +
> + for (i = 0; i < pagevec_count(&pvec); i++) {
> + struct page *page = pvec.pages[i];
> +
> + lock_page(page);
> + if (unlikely(page->mapping != mapping) ||
> + !PageDirty(page) ||
> + PageWriteback(page) ||
> + page->index != idx) {
> + done = 1;
> + unlock_page(page);
> + break;
> + }
> + unlock_page(page);
> + if (done)
> + break;
> + idx++;
> + num++;
> + }
> + pagevec_release(&pvec);
> + }
> + return num << PAGE_CACHE_SHIFT;
> +}
> +

Same as what Trond said. radix_tree_next_hole() should be nicer. We need never
do any locking this is just an hint, and not a transaction guaranty. Best first
guess approximation is all we need.

Also you might want to optimize for the most common case of a linear write from
zero. This you can do by comparing i_size / PAGE_SIZE and the number of dirty
pages, if they are the same you know there are no holes, and need not scan.

> void
> -pnfs_generic_pg_init_write(struct nfs_pageio_descriptor *pgio, struct nfs_page *req)
> +pnfs_generic_pg_init_write(struct nfs_pageio_descriptor *pgio,
> + struct nfs_page *req)

Nothing changed here, please don't do this

> {
> + u64 wb_size;
> +
> BUG_ON(pgio->pg_lseg != NULL);
>
> if (req->wb_offset != req->wb_pgbase) {
> nfs_pageio_reset_write_mds(pgio);
> return;
> }
> +
> + if (pgio->pg_dreq == NULL)
> + wb_size = pnfs_num_dirty_bytes(pgio->pg_inode, req->wb_index);
> + else
> + wb_size = nfs_dreq_bytes_left(pgio->pg_dreq);
> +
> pgio->pg_lseg = pnfs_update_layout(pgio->pg_inode,
> req->wb_context,
> req_offset(req),
> - req->wb_bytes,
> + wb_size?:req->wb_bytes,

wb_size is always set above in the if() or else. No need to check here.

> IOMODE_RW,
> GFP_NOFS);
> /* If no lseg, fall back to write through mds */

But No!

much much better then last time, thank you for working on this
but it is not optimum for objects and files
(when "files" supports segments)

For blocks, Yes radix_tree_next_hole() is the perfect fit. But for
objects (and files) it is i_size_read(). The objects/files server usually
determines it's topology according to file-size. And it does not have any
bigger resources because of holes or no holes. (The files example I think of
is CEPH)

So for objects the wasting factor is the actual i_size extending as a cause
of layout_get, and not the number of pages served. So for us the gain is if
client, that has a much newer information about i_size, sends it on first
layout_get. Though extending file size only once on first layout_get and
not on every layout_get.

So the small change I want is:

+enum pnfs_layout_get_strategy {
+ PLGS_USE_ISIZE,
+ PLGS_SEARCH_FIRST_HOLE,
+ PLGS_ALL_FILE,
+};

-pnfs_generic_pg_init_write(struct nfs_pageio_descriptor *pgio, struct nfs_page *req)
+pnfs_generic_pg_init_write(struct nfs_pageio_descriptor *pgio, struct nfs_page *req,
+ enum pnfs_layout_get_strategy plgs)

...

+ if (pgio->pg_dreq == NULL) {
+ switch (plgs) {
+ case PLGS_SEARCH_FIRST_HOLE:
+ wb_size = pnfs_num_dirty_bytes(pgio->pg_inode, req->wb_index);
+ break;
+ case PLGS_USE_ISIZE:
+ wb_size = i_size_read() - req_offset(req);
+ break;
+ case PLGS_ALL_FILE:
+ wb_size = NFS4_MAX_UINT64;
+ break;
+ default:
+ WARN_ON(1);
+ wb_size = PAGE_SIZE;
+ }
+ else
+ wb_size = nfs_dreq_bytes_left(pgio->pg_dreq);

The call to pnfs_generic_pg_init_write is always made by the LD's .pg_init
and each LD should pass the proper pnfs_layout_get_strategy flag.

But yes, this is much more like it. Thanks again, nice progress.

Thanks
Boaz

2012-08-08 02:03:36

by Peng Tao

[permalink] [raw]

Subject: [PATCH RFC 3/3] NFS41: send real read size in layoutget for DIO

From: Peng Tao <[email protected]>

We don't have the real IO size information in buffer read, but for direct
read, we can get it from dreq->bytes_left. Let's make use of it.

Signed-off-by: Peng Tao <[email protected]>
---
fs/nfs/pnfs.c | 8 +++++++-
1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index e61a373..e4cfd1e 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -1154,16 +1154,22 @@ out_forget_reply:
void
pnfs_generic_pg_init_read(struct nfs_pageio_descriptor *pgio, struct nfs_page *req)
{
+ u64 rd_size = req->wb_bytes;
+
BUG_ON(pgio->pg_lseg != NULL);

if (req->wb_offset != req->wb_pgbase) {
nfs_pageio_reset_read_mds(pgio);
return;
}
+
+ if (pgio->pg_dreq != NULL)
+ rd_size = nfs_dreq_bytes_left(pgio->pg_dreq);
+
pgio->pg_lseg = pnfs_update_layout(pgio->pg_inode,
req->wb_context,
req_offset(req),
- req->wb_bytes,
+ rd_size,
IOMODE_READ,
GFP_KERNEL);
/* If no lseg, fall back to read through mds */
--
1.7.1.262.g5ef3d

2012-08-08 18:50:22

by Myklebust, Trond

[permalink] [raw]

Subject: Re: [PATCH RFC 2/3] NFS41: send real write size in layoutget

T24gV2VkLCAyMDEyLTA4LTA4IGF0IDEwOjAzICswODAwLCBQZW5nIFRhbyB3cm90ZToNCj4gRnJv
bTogUGVuZyBUYW8gPHRhby5wZW5nQGVtYy5jb20+DQo+IA0KPiBGb3IgYnVmZmVycmVkIHdyaXRl
LCBzY2FuIGRpcnR5IHBhZ2VzIHRvIGZpbmQgb3V0IGxvbmdlc3QgY29udGludW91cw0KPiBkaXJ0
eSBwYWdlcy4gSW4gdGhpcyBjYXNlLCBhbHNvIGFsbG93IGxheW91dCBkcml2ZXIgdG8gc3BlY2lm
eSBhDQo+IG1heGltdW0gbGF5b3V0Z2V0IHNpemUgd2hpY2ggaXMgdXNlZnVsIHRvIGF2b2lkIGJ1
c3kgc2Nhbm5pbmcgZGlydHkgcGFnZXMNCj4gZm9yIGJsb2NrIGxheW91dCBjbGllbnQuDQoNClRo
aXMgaXMgX3JlYWxseV8gdWdseSwgYW5kIGlzIGEgc291cmNlIG9mIHBvdGVudGlhbCBkZWFkbG9j
a3MgaWYNCm11bHRpcGxlIHRocmVhZHMgYXJlIHNjYW5uaW5nK2xvY2tpbmcgcGFnZXMgYXQgdGhl
IHNhbWUgdGltZS4NCg0KV2h5IG5vdCBzaW1wbGlmeSB0aGlzIGJ5IGRyb3BwaW5nIHRoZSBjb21w
bGljYXRlZCB0ZXN0cyBmb3Igd2hldGhlciBvcg0Kbm90IHRoZSBwYWdlcyBhcmUgZGlydHk/IFRo
YXQgZ2V0cyByaWQgb2YgdGhlIGRlYWRsb2NrLXByb25lIGxvY2tfcGFnZSgpDQpjYWxscywgYW5k
IHdvdWxkIGVuYWJsZSB5b3UgdG8ganVzdCB1c2UgcmFkaXhfdHJlZV9uZXh0X2hvbGUoKS4NCg0K
PiBGb3IgZGlyZWN0IHdyaXRlLCBqdXN0IHVzZSBkcmVxLT5ieXRlc19sZWZ0Lg0KDQpUaGlzIG1h
a2VzIHNlbnNlLg0KDQotLSANClRyb25kIE15a2xlYnVzdA0KTGludXggTkZTIGNsaWVudCBtYWlu
dGFpbmVyDQoNCk5ldEFwcA0KVHJvbmQuTXlrbGVidXN0QG5ldGFwcC5jb20NCnd3dy5uZXRhcHAu
Y29tDQoNCg==

2012-08-13 06:15:49

by Peng Tao

[permalink] [raw]

Subject: Re: [PATCH RFC 2/3] NFS41: send real write size in layoutget

On Mon, Aug 13, 2012 at 2:30 AM, Boaz Harrosh <[email protected]> wrote:
> On 08/08/2012 05:03 AM, Peng Tao wrote:
>
>> From: Peng Tao <[email protected]>
>>
>> For bufferred write, scan dirty pages to find out longest continuous
>> dirty pages. In this case, also allow layout driver to specify a
>> maximum layoutget size which is useful to avoid busy scanning dirty pages
>> for block layout client.
>>
>> For direct write, just use dreq->bytes_left.
>>
>> Signed-off-by: Peng Tao <[email protected]>
>> ---
>> fs/nfs/direct.c | 7 ++++++
>> fs/nfs/internal.h | 1 +
>> fs/nfs/pnfs.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++++++-
>> 3 files changed, 64 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
>> index c39f775..c1899dd 100644
>> --- a/fs/nfs/direct.c
>> +++ b/fs/nfs/direct.c
>> @@ -46,6 +46,7 @@
>> #include <linux/kref.h>
>> #include <linux/slab.h>
>> #include <linux/task_io_accounting_ops.h>
>> +#include <linux/module.h>
>>
>> #include <linux/nfs_fs.h>
>> #include <linux/nfs_page.h>
>> @@ -191,6 +192,12 @@ static void nfs_direct_req_release(struct nfs_direct_req *dreq)
>> kref_put(&dreq->kref, nfs_direct_req_free);
>> }
>>
>> +ssize_t nfs_dreq_bytes_left(struct nfs_direct_req *dreq)
>> +{
>> + return dreq->bytes_left;
>> +}
>> +EXPORT_SYMBOL_GPL(nfs_dreq_bytes_left);
>> +
>> /*
>> * Collects and returns the final error value/byte-count.
>> */
>> diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
>> index 31fdb03..e68d329 100644
>> --- a/fs/nfs/internal.h
>> +++ b/fs/nfs/internal.h
>> @@ -464,6 +464,7 @@ static inline void nfs_inode_dio_wait(struct inode *inode)
>> {
>> inode_dio_wait(inode);
>> }
>> +extern ssize_t nfs_dreq_bytes_left(struct nfs_direct_req *dreq);
>>
>
>
> Why is this an EXPORT_SYMBOL_GPL at .c file. Why not just an inline
> here ?
>
struct nfs_direct_req is internal in direct.c. Cannot access its
member structure outside without exported APIs.

>> /* nfs4proc.c */
>> extern void __nfs4_read_done_cb(struct nfs_read_data *);
>> diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
>> index 2e00fea..e61a373 100644
>> --- a/fs/nfs/pnfs.c
>> +++ b/fs/nfs/pnfs.c
>> @@ -29,6 +29,7 @@
>>
>> #include <linux/nfs_fs.h>
>> #include <linux/nfs_page.h>
>> +#include <linux/pagevec.h>
>> #include <linux/module.h>
>> #include "internal.h"
>> #include "pnfs.h"
>> @@ -1172,19 +1173,72 @@ pnfs_generic_pg_init_read(struct nfs_pageio_descriptor *pgio, struct nfs_page *r
>> }
>> EXPORT_SYMBOL_GPL(pnfs_generic_pg_init_read);
>>
>> +/*
>> + * Return the number of contiguous bytes in dirty pages for a given inode
>> + * starting at page frame idx.
>> + */
>> +static u64 pnfs_num_dirty_bytes(struct inode *inode, pgoff_t idx)
>> +{
>> + struct address_space *mapping = inode->i_mapping;
>> + pgoff_t index;
>> + struct pagevec pvec;
>> + pgoff_t num = 1; /* self */
>> + int i, done = 0;
>> +
>> + pagevec_init(&pvec, 0);
>> + idx++; /* self */
>> + while (!done) {
>> + index = idx;
>> + pagevec_lookup_tag(&pvec, mapping, &index,
>> + PAGECACHE_TAG_DIRTY, (pgoff_t)PAGEVEC_SIZE);
>> + if (pagevec_count(&pvec) == 0)
>> + break;
>> +
>> + for (i = 0; i < pagevec_count(&pvec); i++) {
>> + struct page *page = pvec.pages[i];
>> +
>> + lock_page(page);
>> + if (unlikely(page->mapping != mapping) ||
>> + !PageDirty(page) ||
>> + PageWriteback(page) ||
>> + page->index != idx) {
>> + done = 1;
>> + unlock_page(page);
>> + break;
>> + }
>> + unlock_page(page);
>> + if (done)
>> + break;
>> + idx++;
>> + num++;
>> + }
>> + pagevec_release(&pvec);
>> + }
>> + return num << PAGE_CACHE_SHIFT;
>> +}
>> +
>
>
> Same as what Trond said. radix_tree_next_hole() should be nicer. We need never
> do any locking this is just an hint, and not a transaction guaranty. Best first
> guess approximation is all we need.
>
> Also you might want to optimize for the most common case of a linear write from
> zero. This you can do by comparing i_size / PAGE_SIZE and the number of dirty
> pages, if they are the same you know there are no holes, and need not scan.
>
Sounds reasonable. Will do it in v2.

>> void
>> -pnfs_generic_pg_init_write(struct nfs_pageio_descriptor *pgio, struct nfs_page *req)
>> +pnfs_generic_pg_init_write(struct nfs_pageio_descriptor *pgio,
>> + struct nfs_page *req)
>
>
> Nothing changed here, please don't do this
>
>> {
>> + u64 wb_size;
>> +
>> BUG_ON(pgio->pg_lseg != NULL);
>>
>> if (req->wb_offset != req->wb_pgbase) {
>> nfs_pageio_reset_write_mds(pgio);
>> return;
>> }
>> +
>> + if (pgio->pg_dreq == NULL)
>> + wb_size = pnfs_num_dirty_bytes(pgio->pg_inode, req->wb_index);
>> + else
>> + wb_size = nfs_dreq_bytes_left(pgio->pg_dreq);
>> +
>> pgio->pg_lseg = pnfs_update_layout(pgio->pg_inode,
>> req->wb_context,
>> req_offset(req),
>> - req->wb_bytes,
>> + wb_size?:req->wb_bytes,
>
>
> wb_size is always set above in the if() or else. No need to check here.
>
>> IOMODE_RW,
>> GFP_NOFS);
>> /* If no lseg, fall back to write through mds */
>
>
>
> But No!
>
> much much better then last time, thank you for working on this
> but it is not optimum for objects and files
> (when "files" supports segments)
>
> For blocks, Yes radix_tree_next_hole() is the perfect fit. But for
> objects (and files) it is i_size_read(). The objects/files server usually
> determines it's topology according to file-size. And it does not have any
> bigger resources because of holes or no holes. (The files example I think of
> is CEPH)
>
> So for objects the wasting factor is the actual i_size extending as a cause
> of layout_get, and not the number of pages served. So for us the gain is if
> client, that has a much newer information about i_size, sends it on first
> layout_get. Though extending file size only once on first layout_get and
> not on every layout_get.
>
> So the small change I want is:
>
> +enum pnfs_layout_get_strategy {
> + PLGS_USE_ISIZE,
> + PLGS_SEARCH_FIRST_HOLE,
> + PLGS_ALL_FILE,
> +};
>
> -pnfs_generic_pg_init_write(struct nfs_pageio_descriptor *pgio, struct nfs_page *req)
> +pnfs_generic_pg_init_write(struct nfs_pageio_descriptor *pgio, struct nfs_page *req,
> + enum pnfs_layout_get_strategy plgs)
>
>
> ...
>
> + if (pgio->pg_dreq == NULL) {
> + switch (plgs) {
> + case PLGS_SEARCH_FIRST_HOLE:
> + wb_size = pnfs_num_dirty_bytes(pgio->pg_inode, req->wb_index);
> + break;
> + case PLGS_USE_ISIZE:
> + wb_size = i_size_read() - req_offset(req);
> + break;
> + case PLGS_ALL_FILE:
> + wb_size = NFS4_MAX_UINT64;
> + break;
> + default:
> + WARN_ON(1);
> + wb_size = PAGE_SIZE;
> + }
> + else
> + wb_size = nfs_dreq_bytes_left(pgio->pg_dreq);
>
> The call to pnfs_generic_pg_init_write is always made by the LD's .pg_init
> and each LD should pass the proper pnfs_layout_get_strategy flag.
>
Will add it in v2. Thanks for reviewing.

Cheers,
Tao

2012-08-12 18:40:23

by Boaz Harrosh

[permalink] [raw]

Subject: Re: [PATCH RFC 2/3] NFS41: send real write size in layoutget

On 08/12/2012 09:30 PM, Boaz Harrosh wrote:

> So for objects the wasting factor is the actual i_size extending as a cause
> of layout_get, and not the number of pages served. So for us the gain is if
> client, that has a much newer information about i_size, sends it on first
> layout_get. Though extending file size only once on first layout_get and
> not on every layout_get.
>

I want to clarify here. The i_size does not and must not grow as part of
a layout_get. Only a layout_commit might extend i_size.

the "file-size" I meant above is the current maximum size that can be
described by the inode's layout device-map. The device map does grow on
layout_get both for objects, as well as for example a CEPH cluster. If we
send i_size from client then we only need extend device-map once during the
complete writeout. (If need extending)

Thanks
Boaz

2012-08-13 09:45:05

by Peng Tao

[permalink] [raw]

Subject: Re: [PATCH RFC 2/3] NFS41: send real write size in layoutget

On Mon, Aug 13, 2012 at 2:30 AM, Boaz Harrosh <[email protected]> wrote:
> On 08/08/2012 05:03 AM, Peng Tao wrote:
>
>> From: Peng Tao <[email protected]>
>>
>> For bufferred write, scan dirty pages to find out longest continuous
>> dirty pages. In this case, also allow layout driver to specify a
>> maximum layoutget size which is useful to avoid busy scanning dirty pages
>> for block layout client.
>>
>> For direct write, just use dreq->bytes_left.
>>
>> Signed-off-by: Peng Tao <[email protected]>
>> ---
>> fs/nfs/direct.c | 7 ++++++
>> fs/nfs/internal.h | 1 +
>> fs/nfs/pnfs.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++++++-
>> 3 files changed, 64 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
>> index c39f775..c1899dd 100644
>> --- a/fs/nfs/direct.c
>> +++ b/fs/nfs/direct.c
>> @@ -46,6 +46,7 @@
>> #include <linux/kref.h>
>> #include <linux/slab.h>
>> #include <linux/task_io_accounting_ops.h>
>> +#include <linux/module.h>
>>
>> #include <linux/nfs_fs.h>
>> #include <linux/nfs_page.h>
>> @@ -191,6 +192,12 @@ static void nfs_direct_req_release(struct nfs_direct_req *dreq)
>> kref_put(&dreq->kref, nfs_direct_req_free);
>> }
>>
>> +ssize_t nfs_dreq_bytes_left(struct nfs_direct_req *dreq)
>> +{
>> + return dreq->bytes_left;
>> +}
>> +EXPORT_SYMBOL_GPL(nfs_dreq_bytes_left);
>> +
>> /*
>> * Collects and returns the final error value/byte-count.
>> */
>> diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
>> index 31fdb03..e68d329 100644
>> --- a/fs/nfs/internal.h
>> +++ b/fs/nfs/internal.h
>> @@ -464,6 +464,7 @@ static inline void nfs_inode_dio_wait(struct inode *inode)
>> {
>> inode_dio_wait(inode);
>> }
>> +extern ssize_t nfs_dreq_bytes_left(struct nfs_direct_req *dreq);
>>
>
>
> Why is this an EXPORT_SYMBOL_GPL at .c file. Why not just an inline
> here ?
>
>> /* nfs4proc.c */
>> extern void __nfs4_read_done_cb(struct nfs_read_data *);
>> diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
>> index 2e00fea..e61a373 100644
>> --- a/fs/nfs/pnfs.c
>> +++ b/fs/nfs/pnfs.c
>> @@ -29,6 +29,7 @@
>>
>> #include <linux/nfs_fs.h>
>> #include <linux/nfs_page.h>
>> +#include <linux/pagevec.h>
>> #include <linux/module.h>
>> #include "internal.h"
>> #include "pnfs.h"
>> @@ -1172,19 +1173,72 @@ pnfs_generic_pg_init_read(struct nfs_pageio_descriptor *pgio, struct nfs_page *r
>> }
>> EXPORT_SYMBOL_GPL(pnfs_generic_pg_init_read);
>>
>> +/*
>> + * Return the number of contiguous bytes in dirty pages for a given inode
>> + * starting at page frame idx.
>> + */
>> +static u64 pnfs_num_dirty_bytes(struct inode *inode, pgoff_t idx)
>> +{
>> + struct address_space *mapping = inode->i_mapping;
>> + pgoff_t index;
>> + struct pagevec pvec;
>> + pgoff_t num = 1; /* self */
>> + int i, done = 0;
>> +
>> + pagevec_init(&pvec, 0);
>> + idx++; /* self */
>> + while (!done) {
>> + index = idx;
>> + pagevec_lookup_tag(&pvec, mapping, &index,
>> + PAGECACHE_TAG_DIRTY, (pgoff_t)PAGEVEC_SIZE);
>> + if (pagevec_count(&pvec) == 0)
>> + break;
>> +
>> + for (i = 0; i < pagevec_count(&pvec); i++) {
>> + struct page *page = pvec.pages[i];
>> +
>> + lock_page(page);
>> + if (unlikely(page->mapping != mapping) ||
>> + !PageDirty(page) ||
>> + PageWriteback(page) ||
>> + page->index != idx) {
>> + done = 1;
>> + unlock_page(page);
>> + break;
>> + }
>> + unlock_page(page);
>> + if (done)
>> + break;
>> + idx++;
>> + num++;
>> + }
>> + pagevec_release(&pvec);
>> + }
>> + return num << PAGE_CACHE_SHIFT;
>> +}
>> +
>
>
> Same as what Trond said. radix_tree_next_hole() should be nicer. We need never
> do any locking this is just an hint, and not a transaction guaranty. Best first
> guess approximation is all we need.
>
> Also you might want to optimize for the most common case of a linear write from
> zero. This you can do by comparing i_size / PAGE_SIZE and the number of dirty
> pages, if they are the same you know there are no holes, and need not scan.
>
>> void
>> -pnfs_generic_pg_init_write(struct nfs_pageio_descriptor *pgio, struct nfs_page *req)
>> +pnfs_generic_pg_init_write(struct nfs_pageio_descriptor *pgio,
>> + struct nfs_page *req)
>
>
> Nothing changed here, please don't do this
>
>> {
>> + u64 wb_size;
>> +
>> BUG_ON(pgio->pg_lseg != NULL);
>>
>> if (req->wb_offset != req->wb_pgbase) {
>> nfs_pageio_reset_write_mds(pgio);
>> return;
>> }
>> +
>> + if (pgio->pg_dreq == NULL)
>> + wb_size = pnfs_num_dirty_bytes(pgio->pg_inode, req->wb_index);
>> + else
>> + wb_size = nfs_dreq_bytes_left(pgio->pg_dreq);
>> +
>> pgio->pg_lseg = pnfs_update_layout(pgio->pg_inode,
>> req->wb_context,
>> req_offset(req),
>> - req->wb_bytes,
>> + wb_size?:req->wb_bytes,
>
>
> wb_size is always set above in the if() or else. No need to check here.
>
>> IOMODE_RW,
>> GFP_NOFS);
>> /* If no lseg, fall back to write through mds */
>
>
>
> But No!
>
> much much better then last time, thank you for working on this
> but it is not optimum for objects and files
> (when "files" supports segments)
>
> For blocks, Yes radix_tree_next_hole() is the perfect fit. But for
> objects (and files) it is i_size_read(). The objects/files server usually
> determines it's topology according to file-size. And it does not have any
> bigger resources because of holes or no holes. (The files example I think of
> is CEPH)
>
> So for objects the wasting factor is the actual i_size extending as a cause
> of layout_get, and not the number of pages served. So for us the gain is if
> client, that has a much newer information about i_size, sends it on first
> layout_get. Though extending file size only once on first layout_get and
> not on every layout_get.
>
> So the small change I want is:
>
> +enum pnfs_layout_get_strategy {
> + PLGS_USE_ISIZE,
> + PLGS_SEARCH_FIRST_HOLE,
> + PLGS_ALL_FILE,
> +};
>
Just a second thought, since each layout driver would use one
strategy, it is more reasonable to set the strategy in
pnfs_curr_ld->flags instead of changing pg_init API to pass it in. I
will do it this way.

--
Thanks,
Tao

2012-08-09 02:24:39

by Peng Tao

[permalink] [raw]

Subject: Re: [PATCH RFC 2/3] NFS41: send real write size in layoutget

On Thu, Aug 9, 2012 at 2:50 AM, Myklebust, Trond
<[email protected]> wrote:
> On Wed, 2012-08-08 at 10:03 +0800, Peng Tao wrote:
>> From: Peng Tao <[email protected]>
>>
>> For bufferred write, scan dirty pages to find out longest continuous
>> dirty pages. In this case, also allow layout driver to specify a
>> maximum layoutget size which is useful to avoid busy scanning dirty pages
>> for block layout client.
>
> This is _really_ ugly, and is a source of potential deadlocks if
> multiple threads are scanning+locking pages at the same time.
Multiple threads won't be able to scan into the same region as borders
are protected by page writeback bit.

>
> Why not simplify this by dropping the complicated tests for whether or
> not the pages are dirty? That gets rid of the deadlock-prone lock_page()
> calls, and would enable you to just use radix_tree_next_hole().
>
OK. I will cook a patch to drop the complicated page dirty/writeback
checks and just use radix_tree_next_hole.

--
Thanks,
Tao

2012-08-08 02:03:30

by Peng Tao

[permalink] [raw]

Subject: [PATCH RFC 1/3] NFS: track direct IO left bytes

From: Peng Tao <[email protected]>

Signed-off-by: Peng Tao <[email protected]>
---
fs/nfs/direct.c | 5 +++++
1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index 34a6282..c39f775 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -78,6 +78,7 @@ struct nfs_direct_req {
atomic_t io_count; /* i/os we're waiting for */
spinlock_t lock; /* protect completion state */
ssize_t count, /* bytes actually processed */
+ bytes_left, /* bytes left to be sent */
error; /* any reported error */
struct completion completion; /* wait for i/o completion */

@@ -390,6 +391,7 @@ static ssize_t nfs_direct_read_schedule_segment(struct nfs_pageio_descriptor *de
user_addr += req_len;
pos += req_len;
count -= req_len;
+ dreq->bytes_left -= req_len;
}
/* The nfs_page now hold references to these pages */
nfs_direct_release_pages(pagevec, npages);
@@ -456,6 +458,7 @@ static ssize_t nfs_direct_read(struct kiocb *iocb, const struct iovec *iov,
goto out;

dreq->inode = inode;
+ dreq->bytes_left = iov_length(iov, nr_segs);
dreq->ctx = get_nfs_open_context(nfs_file_open_context(iocb->ki_filp));
dreq->l_ctx = nfs_get_lock_context(dreq->ctx);
if (dreq->l_ctx == NULL)
@@ -706,6 +709,7 @@ static ssize_t nfs_direct_write_schedule_segment(struct nfs_pageio_descriptor *d
user_addr += req_len;
pos += req_len;
count -= req_len;
+ dreq->bytes_left -= req_len;
}
/* The nfs_page now hold references to these pages */
nfs_direct_release_pages(pagevec, npages);
@@ -855,6 +859,7 @@ static ssize_t nfs_direct_write(struct kiocb *iocb, const struct iovec *iov,
goto out;

dreq->inode = inode;
+ dreq->bytes_left = count;
dreq->ctx = get_nfs_open_context(nfs_file_open_context(iocb->ki_filp));
dreq->l_ctx = nfs_get_lock_context(dreq->ctx);
if (dreq->l_ctx == NULL)
--
1.7.1.262.g5ef3d

2012-08-13 20:21:39

by Myklebust, Trond

[permalink] [raw]

Subject: Re: [PATCH RFC 2/3] NFS41: send real write size in layoutget

T24gTW9uLCAyMDEyLTA4LTEzIGF0IDIzOjEzICswMzAwLCBCb2F6IEhhcnJvc2ggd3JvdGU6DQo+
IE9uIDA4LzEzLzIwMTIgMTI6NDQgUE0sIFBlbmcgVGFvIHdyb3RlOg0KPiANCj4gPiBPbiBNb24s
IEF1ZyAxMywgMjAxMiBhdCAyOjMwIEFNLCBCb2F6IEhhcnJvc2ggPGJoYXJyb3NoQHBhbmFzYXMu
Y29tPiB3cm90ZToNCj4gPj4gU28gdGhlIHNtYWxsIGNoYW5nZSBJIHdhbnQgaXM6DQo+ID4+DQo+
ID4+ICtlbnVtIHBuZnNfbGF5b3V0X2dldF9zdHJhdGVneSB7DQo+ID4+ICsgICAgICAgUExHU19V
U0VfSVNJWkUsDQo+ID4+ICsgICAgICAgUExHU19TRUFSQ0hfRklSU1RfSE9MRSwNCj4gPj4gKyAg
ICAgICBQTEdTX0FMTF9GSUxFLA0KPiA+PiArfTsNCj4gPj4NCj4gPiBKdXN0IGEgc2Vjb25kIHRo
b3VnaHQsIHNpbmNlIGVhY2ggbGF5b3V0IGRyaXZlciB3b3VsZCB1c2Ugb25lDQo+ID4gc3RyYXRl
Z3ksIGl0IGlzIG1vcmUgcmVhc29uYWJsZSB0byBzZXQgdGhlIHN0cmF0ZWd5IGluDQo+ID4gcG5m
c19jdXJyX2xkLT5mbGFncyBpbnN0ZWFkIG9mIGNoYW5naW5nIHBnX2luaXQgQVBJIHRvIHBhc3Mg
aXQgaW4uIEkNCj4gPiB3aWxsIGRvIGl0IHRoaXMgd2F5Lg0KPiA+IA0KPiANCj4gDQo+IEl0J3Mg
ZmluZSwgYXMgeW91IHNlZSBmaXQuIEkgdGhpbmsgaXQncyBtb3JlIGZsZXhpYmxlIHRoaXMgd2F5
IGJ1dA0KPiBib3RoIHdheXMgd2lsbCB3b3JrIGZvciBub3cuDQo+IA0KPiBQbGVhc2Ugbm90ZSB0
aGF0IGZvciBmaWxlcywgb25jZSBpdCB3aWxsIHN1cHBvcnQgc2VnbWVudHMsIGl0IHdvdWxkDQo+
IHdhbnQgdG8gdXNlIGlfc2l6ZSBsaWtlIG9iamVjdHMuDQoNCg0KTGF5b3V0LXNwZWNpZmljIGZs
YWdzIGluIHRoZSBnZW5lcmljIGNvZGUgYXJlIG5vdCBhY2NlcHRhYmxlLiBJZiB0aGUNCnN0cmF0
ZWd5IGlzIGxheW91dCBzcGVjaWZpYywgdGhlbiB0aGVyZSBzaG91bGQgYmUgbm8gbmVlZCB0byBw
YXNzIGl0DQphcm91bmQgdGhlIGdlbmVyaWMgbGF5ZXJzIGF0IGFsbC4gSnVzdCBkbyB0aGUgcmln
aHQgdGhpbmcgaW4gdGhlIGRyaXZlcg0KKGkuZS4gcGdfaW5pdCkgYW5kIHdlJ3JlIGRvbmUuDQoN
Ci0tIA0KVHJvbmQgTXlrbGVidXN0DQpMaW51eCBORlMgY2xpZW50IG1haW50YWluZXINCg0KTmV0
QXBwDQpUcm9uZC5NeWtsZWJ1c3RAbmV0YXBwLmNvbQ0Kd3d3Lm5ldGFwcC5jb20NCg0K

2012-08-08 02:03:33

by Peng Tao

[permalink] [raw]

Subject: [PATCH RFC 2/3] NFS41: send real write size in layoutget

From: Peng Tao <[email protected]>

For bufferred write, scan dirty pages to find out longest continuous
dirty pages. In this case, also allow layout driver to specify a
maximum layoutget size which is useful to avoid busy scanning dirty pages
for block layout client.

For direct write, just use dreq->bytes_left.

Signed-off-by: Peng Tao <[email protected]>
---
fs/nfs/direct.c | 7 ++++++
fs/nfs/internal.h | 1 +
fs/nfs/pnfs.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++++++-
3 files changed, 64 insertions(+), 2 deletions(-)

diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index c39f775..c1899dd 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -46,6 +46,7 @@
#include <linux/kref.h>
#include <linux/slab.h>
#include <linux/task_io_accounting_ops.h>
+#include <linux/module.h>

#include <linux/nfs_fs.h>
#include <linux/nfs_page.h>
@@ -191,6 +192,12 @@ static void nfs_direct_req_release(struct nfs_direct_req *dreq)
kref_put(&dreq->kref, nfs_direct_req_free);
}

+ssize_t nfs_dreq_bytes_left(struct nfs_direct_req *dreq)
+{
+ return dreq->bytes_left;
+}
+EXPORT_SYMBOL_GPL(nfs_dreq_bytes_left);
+
/*
* Collects and returns the final error value/byte-count.
*/
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index 31fdb03..e68d329 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -464,6 +464,7 @@ static inline void nfs_inode_dio_wait(struct inode *inode)
{
inode_dio_wait(inode);
}
+extern ssize_t nfs_dreq_bytes_left(struct nfs_direct_req *dreq);

/* nfs4proc.c */
extern void __nfs4_read_done_cb(struct nfs_read_data *);
diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index 2e00fea..e61a373 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -29,6 +29,7 @@

#include <linux/nfs_fs.h>
#include <linux/nfs_page.h>
+#include <linux/pagevec.h>
#include <linux/module.h>
#include "internal.h"
#include "pnfs.h"
@@ -1172,19 +1173,72 @@ pnfs_generic_pg_init_read(struct nfs_pageio_descriptor *pgio, struct nfs_page *r
}
EXPORT_SYMBOL_GPL(pnfs_generic_pg_init_read);

+/*
+ * Return the number of contiguous bytes in dirty pages for a given inode
+ * starting at page frame idx.
+ */
+static u64 pnfs_num_dirty_bytes(struct inode *inode, pgoff_t idx)
+{
+ struct address_space *mapping = inode->i_mapping;
+ pgoff_t index;
+ struct pagevec pvec;
+ pgoff_t num = 1; /* self */
+ int i, done = 0;
+
+ pagevec_init(&pvec, 0);
+ idx++; /* self */
+ while (!done) {
+ index = idx;
+ pagevec_lookup_tag(&pvec, mapping, &index,
+ PAGECACHE_TAG_DIRTY, (pgoff_t)PAGEVEC_SIZE);
+ if (pagevec_count(&pvec) == 0)
+ break;
+
+ for (i = 0; i < pagevec_count(&pvec); i++) {
+ struct page *page = pvec.pages[i];
+
+ lock_page(page);
+ if (unlikely(page->mapping != mapping) ||
+ !PageDirty(page) ||
+ PageWriteback(page) ||
+ page->index != idx) {
+ done = 1;
+ unlock_page(page);
+ break;
+ }
+ unlock_page(page);
+ if (done)
+ break;
+ idx++;
+ num++;
+ }
+ pagevec_release(&pvec);
+ }
+ return num << PAGE_CACHE_SHIFT;
+}
+
void
-pnfs_generic_pg_init_write(struct nfs_pageio_descriptor *pgio, struct nfs_page *req)
+pnfs_generic_pg_init_write(struct nfs_pageio_descriptor *pgio,
+ struct nfs_page *req)
{
+ u64 wb_size;
+
BUG_ON(pgio->pg_lseg != NULL);

if (req->wb_offset != req->wb_pgbase) {
nfs_pageio_reset_write_mds(pgio);
return;
}
+
+ if (pgio->pg_dreq == NULL)
+ wb_size = pnfs_num_dirty_bytes(pgio->pg_inode, req->wb_index);
+ else
+ wb_size = nfs_dreq_bytes_left(pgio->pg_dreq);
+
pgio->pg_lseg = pnfs_update_layout(pgio->pg_inode,
req->wb_context,
req_offset(req),
- req->wb_bytes,
+ wb_size?:req->wb_bytes,
IOMODE_RW,
GFP_NOFS);
/* If no lseg, fall back to write through mds */
--
1.7.1.262.g5ef3d

2012-08-09 02:30:23

by Peng Tao

[permalink] [raw]

Subject: Re: [PATCH RFC 3/3] NFS41: send real read size in layoutget for DIO

On Thu, Aug 9, 2012 at 2:57 AM, Myklebust, Trond
<[email protected]> wrote:
> On Wed, 2012-08-08 at 10:03 +0800, Peng Tao wrote:
>> From: Peng Tao <[email protected]>
>>
>> We don't have the real IO size information in buffer read, but for direct
>> read, we can get it from dreq->bytes_left. Let's make use of it.
>
> What's wrong with using the nfs_readpages arguments? The readahead code
> will tell you exactly how many pages it is trying to read.
>
The nr_pages information gets dropped before calling into pnfs code...

How about something like bellow for buffer reads? (untested patch and
needs to integrate with DIO size for sure).

diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index 2e00fea..cec79c2 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -1162,7 +1162,7 @@ pnfs_generic_pg_init_read(struct
nfs_pageio_descriptor *pgio, struct nfs_page *r
pgio->pg_lseg = pnfs_update_layout(pgio->pg_inode,
req->wb_context,
req_offset(req),
- req->wb_bytes,
+ (unsigned)pgio->pg_layout_private,
IOMODE_READ,
GFP_KERNEL);
/* If no lseg, fall back to read through mds */
diff --git a/fs/nfs/read.c b/fs/nfs/read.c
index b6bdb18..64fb0e2 100644
--- a/fs/nfs/read.c
+++ b/fs/nfs/read.c
@@ -649,6 +649,7 @@ int nfs_readpages(struct file *filp, struct
address_space *mapping,

NFS_PROTO(inode)->read_pageio_init(&pgio, inode,
&nfs_async_read_completion_ops);

+ pgio.pg_layout_private = (void *)nr_pages;
ret = read_cache_pages(mapping, pages, readpage_async_filler, &desc);

nfs_pageio_complete(&pgio);

--
Thanks,
Tao

2012-08-08 18:57:08

by Myklebust, Trond

[permalink] [raw]

Subject: Re: [PATCH RFC 3/3] NFS41: send real read size in layoutget for DIO

T24gV2VkLCAyMDEyLTA4LTA4IGF0IDEwOjAzICswODAwLCBQZW5nIFRhbyB3cm90ZToNCj4gRnJv
bTogUGVuZyBUYW8gPHRhby5wZW5nQGVtYy5jb20+DQo+IA0KPiBXZSBkb24ndCBoYXZlIHRoZSBy
ZWFsIElPIHNpemUgaW5mb3JtYXRpb24gaW4gYnVmZmVyIHJlYWQsIGJ1dCBmb3IgZGlyZWN0DQo+
IHJlYWQsIHdlIGNhbiBnZXQgaXQgZnJvbSBkcmVxLT5ieXRlc19sZWZ0LiBMZXQncyBtYWtlIHVz
ZSBvZiBpdC4NCg0KV2hhdCdzIHdyb25nIHdpdGggdXNpbmcgdGhlIG5mc19yZWFkcGFnZXMgYXJn
dW1lbnRzPyBUaGUgcmVhZGFoZWFkIGNvZGUNCndpbGwgdGVsbCB5b3UgZXhhY3RseSBob3cgbWFu
eSBwYWdlcyBpdCBpcyB0cnlpbmcgdG8gcmVhZC4NCg0KLS0gDQpUcm9uZCBNeWtsZWJ1c3QNCkxp
bnV4IE5GUyBjbGllbnQgbWFpbnRhaW5lcg0KDQpOZXRBcHANClRyb25kLk15a2xlYnVzdEBuZXRh
cHAuY29tDQp3d3cubmV0YXBwLmNvbQ0KDQo=

2012-08-13 20:14:03

by Boaz Harrosh

[permalink] [raw]

Subject: Re: [PATCH RFC 2/3] NFS41: send real write size in layoutget

On 08/13/2012 12:44 PM, Peng Tao wrote:

> On Mon, Aug 13, 2012 at 2:30 AM, Boaz Harrosh <[email protected]> wrote:
>> On 08/08/2012 05:03 AM, Peng Tao wrote:
>>
>>> From: Peng Tao <[email protected]>
>>>
>>> For bufferred write, scan dirty pages to find out longest continuous
>>> dirty pages. In this case, also allow layout driver to specify a
>>> maximum layoutget size which is useful to avoid busy scanning dirty pages
>>> for block layout client.
>>>
>>> For direct write, just use dreq->bytes_left.
>>>
>>> Signed-off-by: Peng Tao <[email protected]>
>>> ---
>>> fs/nfs/direct.c | 7 ++++++
>>> fs/nfs/internal.h | 1 +
>>> fs/nfs/pnfs.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++++++-
>>> 3 files changed, 64 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
>>> index c39f775..c1899dd 100644
>>> --- a/fs/nfs/direct.c
>>> +++ b/fs/nfs/direct.c
>>> @@ -46,6 +46,7 @@
>>> #include <linux/kref.h>
>>> #include <linux/slab.h>
>>> #include <linux/task_io_accounting_ops.h>
>>> +#include <linux/module.h>
>>>
>>> #include <linux/nfs_fs.h>
>>> #include <linux/nfs_page.h>
>>> @@ -191,6 +192,12 @@ static void nfs_direct_req_release(struct nfs_direct_req *dreq)
>>> kref_put(&dreq->kref, nfs_direct_req_free);
>>> }
>>>
>>> +ssize_t nfs_dreq_bytes_left(struct nfs_direct_req *dreq)
>>> +{
>>> + return dreq->bytes_left;
>>> +}
>>> +EXPORT_SYMBOL_GPL(nfs_dreq_bytes_left);
>>> +
>>> /*
>>> * Collects and returns the final error value/byte-count.
>>> */
>>> diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
>>> index 31fdb03..e68d329 100644
>>> --- a/fs/nfs/internal.h
>>> +++ b/fs/nfs/internal.h
>>> @@ -464,6 +464,7 @@ static inline void nfs_inode_dio_wait(struct inode *inode)
>>> {
>>> inode_dio_wait(inode);
>>> }
>>> +extern ssize_t nfs_dreq_bytes_left(struct nfs_direct_req *dreq);
>>>
>>
>>
>> Why is this an EXPORT_SYMBOL_GPL at .c file. Why not just an inline
>> here ?
>>
>>> /* nfs4proc.c */
>>> extern void __nfs4_read_done_cb(struct nfs_read_data *);
>>> diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
>>> index 2e00fea..e61a373 100644
>>> --- a/fs/nfs/pnfs.c
>>> +++ b/fs/nfs/pnfs.c
>>> @@ -29,6 +29,7 @@
>>>
>>> #include <linux/nfs_fs.h>
>>> #include <linux/nfs_page.h>
>>> +#include <linux/pagevec.h>
>>> #include <linux/module.h>
>>> #include "internal.h"
>>> #include "pnfs.h"
>>> @@ -1172,19 +1173,72 @@ pnfs_generic_pg_init_read(struct nfs_pageio_descriptor *pgio, struct nfs_page *r
>>> }
>>> EXPORT_SYMBOL_GPL(pnfs_generic_pg_init_read);
>>>
>>> +/*
>>> + * Return the number of contiguous bytes in dirty pages for a given inode
>>> + * starting at page frame idx.
>>> + */
>>> +static u64 pnfs_num_dirty_bytes(struct inode *inode, pgoff_t idx)
>>> +{
>>> + struct address_space *mapping = inode->i_mapping;
>>> + pgoff_t index;
>>> + struct pagevec pvec;
>>> + pgoff_t num = 1; /* self */
>>> + int i, done = 0;
>>> +
>>> + pagevec_init(&pvec, 0);
>>> + idx++; /* self */
>>> + while (!done) {
>>> + index = idx;
>>> + pagevec_lookup_tag(&pvec, mapping, &index,
>>> + PAGECACHE_TAG_DIRTY, (pgoff_t)PAGEVEC_SIZE);
>>> + if (pagevec_count(&pvec) == 0)
>>> + break;
>>> +
>>> + for (i = 0; i < pagevec_count(&pvec); i++) {
>>> + struct page *page = pvec.pages[i];
>>> +
>>> + lock_page(page);
>>> + if (unlikely(page->mapping != mapping) ||
>>> + !PageDirty(page) ||
>>> + PageWriteback(page) ||
>>> + page->index != idx) {
>>> + done = 1;
>>> + unlock_page(page);
>>> + break;
>>> + }
>>> + unlock_page(page);
>>> + if (done)
>>> + break;
>>> + idx++;
>>> + num++;
>>> + }
>>> + pagevec_release(&pvec);
>>> + }
>>> + return num << PAGE_CACHE_SHIFT;
>>> +}
>>> +
>>
>>
>> Same as what Trond said. radix_tree_next_hole() should be nicer. We need never
>> do any locking this is just an hint, and not a transaction guaranty. Best first
>> guess approximation is all we need.
>>
>> Also you might want to optimize for the most common case of a linear write from
>> zero. This you can do by comparing i_size / PAGE_SIZE and the number of dirty
>> pages, if they are the same you know there are no holes, and need not scan.
>>
>>> void
>>> -pnfs_generic_pg_init_write(struct nfs_pageio_descriptor *pgio, struct nfs_page *req)
>>> +pnfs_generic_pg_init_write(struct nfs_pageio_descriptor *pgio,
>>> + struct nfs_page *req)
>>
>>
>> Nothing changed here, please don't do this
>>
>>> {
>>> + u64 wb_size;
>>> +
>>> BUG_ON(pgio->pg_lseg != NULL);
>>>
>>> if (req->wb_offset != req->wb_pgbase) {
>>> nfs_pageio_reset_write_mds(pgio);
>>> return;
>>> }
>>> +
>>> + if (pgio->pg_dreq == NULL)
>>> + wb_size = pnfs_num_dirty_bytes(pgio->pg_inode, req->wb_index);
>>> + else
>>> + wb_size = nfs_dreq_bytes_left(pgio->pg_dreq);
>>> +
>>> pgio->pg_lseg = pnfs_update_layout(pgio->pg_inode,
>>> req->wb_context,
>>> req_offset(req),
>>> - req->wb_bytes,
>>> + wb_size?:req->wb_bytes,
>>
>>
>> wb_size is always set above in the if() or else. No need to check here.
>>
>>> IOMODE_RW,
>>> GFP_NOFS);
>>> /* If no lseg, fall back to write through mds */
>>
>>
>>
>> But No!
>>
>> much much better then last time, thank you for working on this
>> but it is not optimum for objects and files
>> (when "files" supports segments)
>>
>> For blocks, Yes radix_tree_next_hole() is the perfect fit. But for
>> objects (and files) it is i_size_read(). The objects/files server usually
>> determines it's topology according to file-size. And it does not have any
>> bigger resources because of holes or no holes. (The files example I think of
>> is CEPH)
>>
>> So for objects the wasting factor is the actual i_size extending as a cause
>> of layout_get, and not the number of pages served. So for us the gain is if
>> client, that has a much newer information about i_size, sends it on first
>> layout_get. Though extending file size only once on first layout_get and
>> not on every layout_get.
>>
>> So the small change I want is:
>>
>> +enum pnfs_layout_get_strategy {
>> + PLGS_USE_ISIZE,
>> + PLGS_SEARCH_FIRST_HOLE,
>> + PLGS_ALL_FILE,
>> +};
>>
> Just a second thought, since each layout driver would use one
> strategy, it is more reasonable to set the strategy in
> pnfs_curr_ld->flags instead of changing pg_init API to pass it in. I
> will do it this way.
>

It's fine, as you see fit. I think it's more flexible this way but
both ways will work for now.

Please note that for files, once it will support segments, it would
want to use i_size like objects.

Thanks
Boaz

2012-08-12 17:41:00

by Boaz Harrosh

[permalink] [raw]

Subject: Re: [PATCH RFC 3/3] NFS41: send real read size in layoutget for DIO

On 08/09/2012 05:30 AM, Peng Tao wrote:

> On Thu, Aug 9, 2012 at 2:57 AM, Myklebust, Trond
> <[email protected]> wrote:
>> On Wed, 2012-08-08 at 10:03 +0800, Peng Tao wrote:
>>> From: Peng Tao <[email protected]>
>>>
>>> We don't have the real IO size information in buffer read, but for direct
>>> read, we can get it from dreq->bytes_left. Let's make use of it.
>>
>> What's wrong with using the nfs_readpages arguments? The readahead code
>> will tell you exactly how many pages it is trying to read.
>>
> The nr_pages information gets dropped before calling into pnfs code...
>
> How about something like bellow for buffer reads? (untested patch and
> needs to integrate with DIO size for sure).
>
> diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
> index 2e00fea..cec79c2 100644
> --- a/fs/nfs/pnfs.c
> +++ b/fs/nfs/pnfs.c
> @@ -1162,7 +1162,7 @@ pnfs_generic_pg_init_read(struct
> nfs_pageio_descriptor *pgio, struct nfs_page *r
> pgio->pg_lseg = pnfs_update_layout(pgio->pg_inode,
> req->wb_context,
> req_offset(req),
> - req->wb_bytes,
> + (unsigned)pgio->pg_layout_private,
> IOMODE_READ,
> GFP_KERNEL);
> /* If no lseg, fall back to read through mds */
> diff --git a/fs/nfs/read.c b/fs/nfs/read.c
> index b6bdb18..64fb0e2 100644
> --- a/fs/nfs/read.c
> +++ b/fs/nfs/read.c
> @@ -649,6 +649,7 @@ int nfs_readpages(struct file *filp, struct
> address_space *mapping,
>
> NFS_PROTO(inode)->read_pageio_init(&pgio, inode,
> &nfs_async_read_completion_ops);
>
> + pgio.pg_layout_private = (void *)nr_pages;

NACK!!! pgio.pg_layout_private is for use by LD only. Generic code should
never touch it. If you actually need it and this info is realy lost please
use a new member.

> ret = read_cache_pages(mapping, pages, readpage_async_filler, &desc);
>
> nfs_pageio_complete(&pgio);
>
>

For reads, for both files and objects but I think also for blocks the optimum
size is just offset-to-i_size. I do not see any resource constrained at server.
If the extent list gets to big to be sent as one RPC for blocks layout, (as can be the
case for objects when spanning groups) The server can just send the biggest chunk it wants,
that fits. Or consider some other optimum max size according to internal topology
knowledge.

The server may always send a smaller layout as long as it includes "offset". If so the client
will return for the reminder. I test this al the time it works perfectly.

Actually for exofs regardless of requested size I always return a group_size full
for read-layouts of the group that includes "offset".
Only write-layouts need be sliced thin in exofs.

Boaz