2014-05-15 15:57:04

by Weston Andros Adamson

[permalink] [raw]
Subject: [PATCH v3 00/18] nfs: support multiple requests per page

This patchset changes the read and write paths to be more flexible in dealing
with requests that are not page aligned. Until now there was a 1:1 mapping
of struct nfs_page (referred to as "nfs requests") to struct page, which
limited the client to page aligned I/O in several pNFS scenarios.

This patchset allows multiple requests per page, loosely following
the approach taken with struct buffer_head (part of kernel bio interface).

With this patchset the client now supports:
- non-page-aligned O_DIRECT I/O to DSes (instead of reverting to MDS)
- arbitrary pnfs layout segment boundaries
- arbitrary pnfs filelayout stripe sizes

This patchset also includes a lot of cleanup - notably we no longer need
a separate code path to support rsize/wsize < PAGE_SIZE.

This new approach opens the door to many optimizations, such as not having to
flush a page on a non-contiguous write, but for the time being we are focusing
on correctness -- this patchset touches the read and write path for *all*
versions of NFS!

This has been tested against v2, v3, v4.0 and v4.1 (no pnfs) servers with
different rsize/wsize settings, and against pynfs filelayout servers hacked to
have non page aligned stripe sizes.

I had some code review already (with changes applied) and we've been testing
this pretty extensively for the last month+ - focusing mostly on v2, v3, v4.x
(no pnfs).

The patchset applies against Trond's testing branch, but should also include
the fix I posted earlier today: "pnfs: fix race in filelayout commit path"
as the race seems to be easier to hit with this patchset applied.

I'm pretty sure I didn't break anything in the object and block layouts, but
some extra attention there would be helpful.

I plan on sharing some performance numbers once I'm able to run some nfsometer
workloads.

Changes in V3:
- rebased to Anna's newest patches, which merges pageio.c into pagelist.c

Upcoming:
- ~5 patch patchset that cleans up nfs_pgio_header and nfs_pgio_data

I just pushed an updated version of my ‘pgio’ branch to:

git://git.linux-nfs.org/projects/dros/linux-nfs

The branch has the following patches:

from me (this patchset!):

7d8976b nfs: support page groups in nfs_read_completion
f0d7a82 pnfs: filelayout: support non page aligned layouts
79a9c89 pnfs: allow non page aligned pnfs layout segments
cf25fb9 pnfs: support multiple verfs per direct req
c081af5 nfs: remove data list from pgio header
6313def nfs: use > 1 request to handle bsize < PAGE_SIZE
9f89e40 nfs: chain calls to pg_test
d65d153 nfs: allow coalescing of subpage requests
12cf560 pnfs: clean up filelayout_alloc_commit_info
49fa6d6 nfs: page group support in nfs_mark_uptodate
ba94674 nfs: page group syncing in write path
00ddd94 nfs: page group syncing in read path
d6dec2d nfs: add support for multiple nfs reqs per page
1ea228a nfs: call nfs_can_coalesce_requests for every req
104ab5c nfs: modify pg_test interface to return size_t
27f3d0c nfs: remove unused arg from nfs_create_request
fc9eeaa nfs: clean up PG_* flags
366d632 pnfs: fix race in filelayout commit path

from Anna:

6e35ebf NFS: Create a common nfs_pageio_ops struct
9a0dabd NFS: Create a common generic_pg_pgios()
9a074ee NFS: Create a common multiple_pgios() function
a7c7f54 NFS: Create a common initiate_pgio() function
fd42750 NFS: Create a generic_pgio function
4003d02 NFS: Create a common pgio_error function
c9f494c NFS: Create a common rpcsetup function for reads and writes
1220495 NFS: Create a common rpc_call_ops struct
4e07fcf NFS: Create a common nfs_pgio_result_common function
1986d26 NFS: Create a common pgio_rpc_prepare function
9892bb1 NFS: Create a common rw_header_alloc and rw_header_free function
76c449c NFS: Create a common pgio_alloc and pgio_release function
4fca7e6 NFS: Move the write verifier into the nfs_pgio_header
fd279ad NFS: Create a common read and write header struct
59d6da2 NFS: Create a common read and write data struct
9726903 NFS: Create a common results structure for reads and writes
0ec19c2 NFS: Create a common argument structure for reads and writes

from Christoph:

140c1e0 nfs: remove ->read_pageio_init from rpc ops
75fd2e5 nfs: remove ->write_pageio_init from rpc ops


The diffstat of pgio branch (including patches from Anna and Christoph):

fs/nfs/blocklayout/blocklayout.c | 38 ++-
fs/nfs/direct.c | 117 +++++++-
fs/nfs/internal.h | 33 +--
fs/nfs/nfs2xdr.c | 14 +-
fs/nfs/nfs3proc.c | 21 +-
fs/nfs/nfs3xdr.c | 16 +-
fs/nfs/nfs4_fs.h | 4 +-
fs/nfs/nfs4filelayout.c | 182 ++++++------
fs/nfs/nfs4proc.c | 54 ++--
fs/nfs/nfs4trace.h | 8 +-
fs/nfs/nfs4xdr.c | 19 +-
fs/nfs/objlayout/objio_osd.c | 24 +-
fs/nfs/objlayout/objlayout.c | 16 +-
fs/nfs/objlayout/objlayout.h | 8 +-
fs/nfs/pagelist.c | 617 ++++++++++++++++++++++++++++++++++++---
fs/nfs/pnfs.c | 146 ++++-----
fs/nfs/pnfs.h | 30 +-
fs/nfs/proc.c | 21 +-
fs/nfs/read.c | 414 +++++---------------------
fs/nfs/write.c | 561 ++++++++++-------------------------
include/linux/nfs.h | 5 +-
include/linux/nfs_fs.h | 2 -
include/linux/nfs_page.h | 46 ++-
include/linux/nfs_xdr.h | 106 +++----
24 files changed, 1276 insertions(+), 1226 deletions(-)

-dros



Weston Andros Adamson (18):
pnfs: fix race in filelayout commit path
nfs: clean up PG_* flags
nfs: remove unused arg from nfs_create_request
nfs: modify pg_test interface to return size_t
nfs: call nfs_can_coalesce_requests for every req
nfs: add support for multiple nfs reqs per page
nfs: page group syncing in read path
nfs: page group syncing in write path
nfs: page group support in nfs_mark_uptodate
pnfs: clean up filelayout_alloc_commit_info
nfs: allow coalescing of subpage requests
nfs: chain calls to pg_test
nfs: use > 1 request to handle bsize < PAGE_SIZE
nfs: remove data list from pgio header
pnfs: support multiple verfs per direct req
pnfs: allow non page aligned pnfs layout segments
pnfs: filelayout: support non page aligned layouts
nfs: support page groups in nfs_read_completion

fs/nfs/blocklayout/blocklayout.c | 16 +-
fs/nfs/direct.c | 111 ++++++++++-
fs/nfs/nfs4filelayout.c | 142 +++++++-------
fs/nfs/objlayout/objio_osd.c | 18 +-
fs/nfs/pagelist.c | 387 ++++++++++++++++++++++++++-------------
fs/nfs/pnfs.c | 77 ++++----
fs/nfs/pnfs.h | 3 +-
fs/nfs/read.c | 51 ++++--
fs/nfs/write.c | 119 ++++++++++--
include/linux/nfs.h | 5 +-
include/linux/nfs_page.h | 33 ++--
include/linux/nfs_xdr.h | 7 +-
12 files changed, 666 insertions(+), 303 deletions(-)

--
1.8.5.2 (Apple Git-48)



2014-05-15 15:57:36

by Weston Andros Adamson

[permalink] [raw]
Subject: [PATCH v3 17/18] pnfs: filelayout: support non page aligned layouts

Use the new pg_test interface to adjust requests to fit in the current
stripe / segment.

Signed-off-by: Weston Andros Adamson <[email protected]>
---
fs/nfs/nfs4filelayout.c | 50 ++++++++++++++++++-------------------------------
1 file changed, 18 insertions(+), 32 deletions(-)

diff --git a/fs/nfs/nfs4filelayout.c b/fs/nfs/nfs4filelayout.c
index 0ebc521..9419061 100644
--- a/fs/nfs/nfs4filelayout.c
+++ b/fs/nfs/nfs4filelayout.c
@@ -639,7 +639,6 @@ filelayout_check_layout(struct pnfs_layout_hdr *lo,
struct nfs4_deviceid_node *d;
struct nfs4_file_layout_dsaddr *dsaddr;
int status = -EINVAL;
- struct nfs_server *nfss = NFS_SERVER(lo->plh_inode);

dprintk("--> %s\n", __func__);

@@ -657,7 +656,7 @@ filelayout_check_layout(struct pnfs_layout_hdr *lo,
goto out;
}

- if (!fl->stripe_unit || fl->stripe_unit % PAGE_SIZE) {
+ if (!fl->stripe_unit) {
dprintk("%s Invalid stripe unit (%u)\n",
__func__, fl->stripe_unit);
goto out;
@@ -694,12 +693,6 @@ filelayout_check_layout(struct pnfs_layout_hdr *lo,
goto out_put;
}

- if (fl->stripe_unit % nfss->rsize || fl->stripe_unit % nfss->wsize) {
- dprintk("%s Stripe unit (%u) not aligned with rsize %u "
- "wsize %u\n", __func__, fl->stripe_unit, nfss->rsize,
- nfss->wsize);
- }
-
status = 0;
out:
dprintk("--> %s returns %d\n", __func__, status);
@@ -936,44 +929,40 @@ filelayout_pg_test(struct nfs_pageio_descriptor *pgio, struct nfs_page *prev,
{
unsigned int size;
u64 p_stripe, r_stripe;
- u32 stripe_unit;
+ u32 stripe_offset;
+ u64 segment_offset = pgio->pg_lseg->pls_range.offset;
+ u32 stripe_unit = FILELAYOUT_LSEG(pgio->pg_lseg)->stripe_unit;

/* calls nfs_generic_pg_test */
size = pnfs_generic_pg_test(pgio, prev, req);
if (!size)
return 0;

+ /* see if req and prev are in the same stripe */
if (prev) {
- p_stripe = (u64)req_offset(prev);
- r_stripe = (u64)req_offset(req);
- stripe_unit = FILELAYOUT_LSEG(pgio->pg_lseg)->stripe_unit;
-
+ p_stripe = (u64)req_offset(prev) - segment_offset;
+ r_stripe = (u64)req_offset(req) - segment_offset;
do_div(p_stripe, stripe_unit);
do_div(r_stripe, stripe_unit);

if (p_stripe != r_stripe)
return 0;
}
- return min(size, req->wb_bytes);
+
+ /* calculate remaining bytes in the current stripe */
+ stripe_offset = ((u64)req_offset(req) - segment_offset) % stripe_unit;
+ WARN_ON_ONCE(stripe_offset > stripe_unit);
+ if (stripe_offset >= stripe_unit)
+ return 0;
+ return min(stripe_unit - (unsigned int)stripe_offset, size);
}

static void
filelayout_pg_init_read(struct nfs_pageio_descriptor *pgio,
struct nfs_page *req)
{
- WARN_ON_ONCE(pgio->pg_lseg != NULL);
-
- if (req->wb_offset != req->wb_pgbase) {
- /*
- * Handling unaligned pages is difficult, because have to
- * somehow split a req in two in certain cases in the
- * pg.test code. Avoid this by just not using pnfs
- * in this case.
- */
- nfs_pageio_reset_read_mds(pgio);
- return;
- }
- pgio->pg_lseg = pnfs_update_layout(pgio->pg_inode,
+ if (!pgio->pg_lseg)
+ pgio->pg_lseg = pnfs_update_layout(pgio->pg_inode,
req->wb_context,
0,
NFS4_MAX_UINT64,
@@ -991,11 +980,8 @@ filelayout_pg_init_write(struct nfs_pageio_descriptor *pgio,
struct nfs_commit_info cinfo;
int status;

- WARN_ON_ONCE(pgio->pg_lseg != NULL);
-
- if (req->wb_offset != req->wb_pgbase)
- goto out_mds;
- pgio->pg_lseg = pnfs_update_layout(pgio->pg_inode,
+ if (!pgio->pg_lseg)
+ pgio->pg_lseg = pnfs_update_layout(pgio->pg_inode,
req->wb_context,
0,
NFS4_MAX_UINT64,
--
1.8.5.2 (Apple Git-48)


2014-05-15 15:57:34

by Weston Andros Adamson

[permalink] [raw]
Subject: [PATCH v3 16/18] pnfs: allow non page aligned pnfs layout segments

Remove alignment checks that would revert to MDS and change pg_test
to return the max ammount left in the segment (or other pg_test call)
up to size of passed request, or 0 if no space is left.

Signed-off-by: Weston Andros Adamson <[email protected]>
---
fs/nfs/pnfs.c | 25 ++++++++++---------------
1 file changed, 10 insertions(+), 15 deletions(-)

diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index 6ef108b..ce46a41 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -1388,11 +1388,6 @@ pnfs_generic_pg_init_read(struct nfs_pageio_descriptor *pgio, struct nfs_page *r

WARN_ON_ONCE(pgio->pg_lseg != NULL);

- if (req->wb_offset != req->wb_pgbase) {
- nfs_pageio_reset_read_mds(pgio);
- return;
- }
-
if (pgio->pg_dreq == NULL)
rd_size = i_size_read(pgio->pg_inode) - req_offset(req);
else
@@ -1417,11 +1412,6 @@ pnfs_generic_pg_init_write(struct nfs_pageio_descriptor *pgio,
{
WARN_ON_ONCE(pgio->pg_lseg != NULL);

- if (req->wb_offset != req->wb_pgbase) {
- nfs_pageio_reset_write_mds(pgio);
- return;
- }
-
pgio->pg_lseg = pnfs_update_layout(pgio->pg_inode,
req->wb_context,
req_offset(req),
@@ -1443,9 +1433,9 @@ pnfs_generic_pg_test(struct nfs_pageio_descriptor *pgio, struct nfs_page *prev,
struct nfs_page *req)
{
unsigned int size;
+ u64 end;

size = nfs_generic_pg_test(pgio, prev, req);
-
if (!size)
return 0;

@@ -1463,11 +1453,16 @@ pnfs_generic_pg_test(struct nfs_pageio_descriptor *pgio, struct nfs_page *prev,
* first byte that lies outside the pnfs_layout_range. FIXME?
*
*/
- if (req_offset(req) >= end_offset(pgio->pg_lseg->pls_range.offset,
- pgio->pg_lseg->pls_range.length))
- return 0;
+ if (pgio->pg_lseg) {
+ end = end_offset(pgio->pg_lseg->pls_range.offset,
+ pgio->pg_lseg->pls_range.length);
+ WARN_ON_ONCE(req_offset(req) > end);
+ if (req_offset(req) >= end)
+ return 0;
+ size = min((unsigned int)(end - req_offset(req)), size);
+ }

- return min(size, req->wb_bytes);
+ return size;
}
EXPORT_SYMBOL_GPL(pnfs_generic_pg_test);

--
1.8.5.2 (Apple Git-48)


2014-05-15 15:57:13

by Weston Andros Adamson

[permalink] [raw]
Subject: [PATCH v3 05/18] nfs: call nfs_can_coalesce_requests for every req

Call nfs_can_coalesce_requests for every request, even the first one.
This is needed for future patches to give pg_test a way to inform
add_request to reduce the size of the request.

Now @prev can be null in nfs_can_coalesce_requests and pg_test functions.

Signed-off-by: Weston Andros Adamson <[email protected]>
---
fs/nfs/nfs4filelayout.c | 3 +++
fs/nfs/pagelist.c | 34 +++++++++++++++++++---------------
2 files changed, 22 insertions(+), 15 deletions(-)

diff --git a/fs/nfs/nfs4filelayout.c b/fs/nfs/nfs4filelayout.c
index ba9a9aa..9319427 100644
--- a/fs/nfs/nfs4filelayout.c
+++ b/fs/nfs/nfs4filelayout.c
@@ -929,6 +929,9 @@ filelayout_pg_test(struct nfs_pageio_descriptor *pgio, struct nfs_page *prev,
!nfs_generic_pg_test(pgio, prev, req))
return 0;

+ if (!prev)
+ return req->wb_bytes;
+
p_stripe = (u64)req_offset(prev);
r_stripe = (u64)req_offset(req);
stripe_unit = FILELAYOUT_LSEG(pgio->pg_lseg)->stripe_unit;
diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c
index 8d9e457..a71655a 100644
--- a/fs/nfs/pagelist.c
+++ b/fs/nfs/pagelist.c
@@ -292,6 +292,8 @@ nfs_wait_on_request(struct nfs_page *req)
size_t nfs_generic_pg_test(struct nfs_pageio_descriptor *desc,
struct nfs_page *prev, struct nfs_page *req)
{
+ if (!prev)
+ return req->wb_bytes;
/*
* FIXME: ideally we should be able to coalesce all requests
* that are not block boundary aligned, but currently this
@@ -762,17 +764,20 @@ static bool nfs_can_coalesce_requests(struct nfs_page *prev,
{
size_t size;

- if (!nfs_match_open_context(req->wb_context, prev->wb_context))
- return false;
- if (req->wb_context->dentry->d_inode->i_flock != NULL &&
- !nfs_match_lock_context(req->wb_lock_context, prev->wb_lock_context))
- return false;
- if (req->wb_pgbase != 0)
- return false;
- if (prev->wb_pgbase + prev->wb_bytes != PAGE_CACHE_SIZE)
- return false;
- if (req_offset(req) != req_offset(prev) + prev->wb_bytes)
- return false;
+ if (prev) {
+ if (!nfs_match_open_context(req->wb_context, prev->wb_context))
+ return false;
+ if (req->wb_context->dentry->d_inode->i_flock != NULL &&
+ !nfs_match_lock_context(req->wb_lock_context,
+ prev->wb_lock_context))
+ return false;
+ if (req->wb_pgbase != 0)
+ return false;
+ if (prev->wb_pgbase + prev->wb_bytes != PAGE_CACHE_SIZE)
+ return false;
+ if (req_offset(req) != req_offset(prev) + prev->wb_bytes)
+ return false;
+ }
size = pgio->pg_ops->pg_test(pgio, prev, req);
WARN_ON_ONCE(size && size != req->wb_bytes);
return size > 0;
@@ -789,17 +794,16 @@ static bool nfs_can_coalesce_requests(struct nfs_page *prev,
static int nfs_pageio_do_add_request(struct nfs_pageio_descriptor *desc,
struct nfs_page *req)
{
+ struct nfs_page *prev = NULL;
if (desc->pg_count != 0) {
- struct nfs_page *prev;
-
prev = nfs_list_entry(desc->pg_list.prev);
- if (!nfs_can_coalesce_requests(prev, req, desc))
- return 0;
} else {
if (desc->pg_ops->pg_init)
desc->pg_ops->pg_init(desc, req);
desc->pg_base = req->wb_pgbase;
}
+ if (!nfs_can_coalesce_requests(prev, req, desc))
+ return 0;
nfs_list_remove_request(req);
nfs_list_add_request(req, &desc->pg_list);
desc->pg_count += req->wb_bytes;
--
1.8.5.2 (Apple Git-48)


2014-05-15 15:57:25

by Weston Andros Adamson

[permalink] [raw]
Subject: [PATCH v3 11/18] nfs: allow coalescing of subpage requests

Remove check that the request covers a whole page.

Signed-off-by: Weston Andros Adamson <[email protected]>
---
fs/nfs/pagelist.c | 4 ----
1 file changed, 4 deletions(-)

diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c
index ce3faad..44aef5a 100644
--- a/fs/nfs/pagelist.c
+++ b/fs/nfs/pagelist.c
@@ -922,10 +922,6 @@ static bool nfs_can_coalesce_requests(struct nfs_page *prev,
!nfs_match_lock_context(req->wb_lock_context,
prev->wb_lock_context))
return false;
- if (req->wb_pgbase != 0)
- return false;
- if (prev->wb_pgbase + prev->wb_bytes != PAGE_CACHE_SIZE)
- return false;
if (req_offset(req) != req_offset(prev) + prev->wb_bytes)
return false;
}
--
1.8.5.2 (Apple Git-48)


2014-05-19 16:20:59

by Weston Andros Adamson

[permalink] [raw]
Subject: Re: [PATCH v3 00/18] nfs: support multiple requests per page


On May 19, 2014, at 11:37 AM, Christoph Hellwig <[email protected]> wrote:

> On Thu, May 15, 2014 at 11:56:39AM -0400, Weston Andros Adamson wrote:
>> This patchset changes the read and write paths to be more flexible in dealing
>> with requests that are not page aligned. Until now there was a 1:1 mapping
>> of struct nfs_page (referred to as "nfs requests") to struct page, which
>
> Any chance you could throw in a rename nfs_page to nfs_request patch
> at the end of the series? The nfs_page name is highly confusing,
> and gets even more so with the series.
>

I was thinking the same thing! It will be a huge change, but now is the time - we?re
changing everything else?

I?ll post in v2 of this patchset.
-dros

2014-05-15 15:57:19

by Weston Andros Adamson

[permalink] [raw]
Subject: [PATCH v3 08/18] nfs: page group syncing in write path

Operations that modify state for a whole page must be syncronized across
all requests within a page group. In the write path, this is calling
end_page_writeback and removing the head request from an inode.
Both of these operations should not be called until all requests
in a page group have reached the point where they would call them.

This patch should have no effect yet since all page groups currently
have one request, but will come into play when pg_test functions are
modified to split pages into sub-page regions.

Signed-off-by: Weston Andros Adamson <[email protected]>
---
fs/nfs/pagelist.c | 2 ++
fs/nfs/write.c | 32 ++++++++++++++++++++------------
include/linux/nfs_page.h | 2 ++
3 files changed, 24 insertions(+), 12 deletions(-)

diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c
index 87cdb4b..ce3faad 100644
--- a/fs/nfs/pagelist.c
+++ b/fs/nfs/pagelist.c
@@ -397,6 +397,8 @@ static void nfs_free_request(struct nfs_page *req)
WARN_ON_ONCE(test_bit(PG_TEARDOWN, &req->wb_flags));
WARN_ON_ONCE(test_bit(PG_UNLOCKPAGE, &req->wb_flags));
WARN_ON_ONCE(test_bit(PG_UPTODATE, &req->wb_flags));
+ WARN_ON_ONCE(test_bit(PG_WB_END, &req->wb_flags));
+ WARN_ON_ONCE(test_bit(PG_REMOVE, &req->wb_flags));

/* Release struct file and open context */
nfs_clear_request(req);
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index d0f30f1..5d75276 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -201,12 +201,15 @@ static void nfs_set_page_writeback(struct page *page)
}
}

-static void nfs_end_page_writeback(struct page *page)
+static void nfs_end_page_writeback(struct nfs_page *req)
{
- struct inode *inode = page_file_mapping(page)->host;
+ struct inode *inode = page_file_mapping(req->wb_page)->host;
struct nfs_server *nfss = NFS_SERVER(inode);

- end_page_writeback(page);
+ if (!nfs_page_group_sync_on_bit(req, PG_WB_END))
+ return;
+
+ end_page_writeback(req->wb_page);
if (atomic_long_dec_return(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH)
clear_bdi_congested(&nfss->backing_dev_info, BLK_RW_ASYNC);
}
@@ -397,15 +400,20 @@ static void nfs_inode_remove_request(struct nfs_page *req)
{
struct inode *inode = req->wb_context->dentry->d_inode;
struct nfs_inode *nfsi = NFS_I(inode);
+ struct nfs_page *head;

- spin_lock(&inode->i_lock);
- if (likely(!PageSwapCache(req->wb_page))) {
- set_page_private(req->wb_page, 0);
- ClearPagePrivate(req->wb_page);
- clear_bit(PG_MAPPED, &req->wb_flags);
+ if (nfs_page_group_sync_on_bit(req, PG_REMOVE)) {
+ head = req->wb_head;
+
+ spin_lock(&inode->i_lock);
+ if (likely(!PageSwapCache(head->wb_page))) {
+ set_page_private(head->wb_page, 0);
+ ClearPagePrivate(head->wb_page);
+ clear_bit(PG_MAPPED, &head->wb_flags);
+ }
+ nfsi->npages--;
+ spin_unlock(&inode->i_lock);
}
- nfsi->npages--;
- spin_unlock(&inode->i_lock);
nfs_release_request(req);
}

@@ -599,7 +607,7 @@ remove_req:
nfs_inode_remove_request(req);
next:
nfs_unlock_request(req);
- nfs_end_page_writeback(req->wb_page);
+ nfs_end_page_writeback(req);
do_destroy = !test_bit(NFS_IOHDR_NEED_COMMIT, &hdr->flags);
nfs_release_request(req);
}
@@ -964,7 +972,7 @@ static void nfs_redirty_request(struct nfs_page *req)
{
nfs_mark_request_dirty(req);
nfs_unlock_request(req);
- nfs_end_page_writeback(req->wb_page);
+ nfs_end_page_writeback(req);
nfs_release_request(req);
}

diff --git a/include/linux/nfs_page.h b/include/linux/nfs_page.h
index 6385175..7d9096d 100644
--- a/include/linux/nfs_page.h
+++ b/include/linux/nfs_page.h
@@ -31,6 +31,8 @@ enum {
PG_TEARDOWN, /* page group sync for destroy */
PG_UNLOCKPAGE, /* page group sync bit in read path */
PG_UPTODATE, /* page group sync bit in read path */
+ PG_WB_END, /* page group sync bit in write path */
+ PG_REMOVE, /* page group sync bit in write path */
};

struct nfs_inode;
--
1.8.5.2 (Apple Git-48)


2014-05-15 15:57:08

by Weston Andros Adamson

[permalink] [raw]
Subject: [PATCH v3 02/18] nfs: clean up PG_* flags

Remove unused flags PG_NEED_COMMIT and PG_NEED_RESCHED.
Add comments describing how each flag is used.

Signed-off-by: Weston Andros Adamson <[email protected]>
---
include/linux/nfs_page.h | 10 ++++------
1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/include/linux/nfs_page.h b/include/linux/nfs_page.h
index c6a587f..eb2eb63 100644
--- a/include/linux/nfs_page.h
+++ b/include/linux/nfs_page.h
@@ -22,12 +22,10 @@
* Valid flags for a dirty buffer
*/
enum {
- PG_BUSY = 0,
- PG_MAPPED,
- PG_CLEAN,
- PG_NEED_COMMIT,
- PG_NEED_RESCHED,
- PG_COMMIT_TO_DS,
+ PG_BUSY = 0, /* nfs_{un}lock_request */
+ PG_MAPPED, /* page private set for buffered io */
+ PG_CLEAN, /* write succeeded */
+ PG_COMMIT_TO_DS, /* used by pnfs layouts */
};

struct nfs_inode;
--
1.8.5.2 (Apple Git-48)


2014-05-15 15:57:17

by Weston Andros Adamson

[permalink] [raw]
Subject: [PATCH v3 07/18] nfs: page group syncing in read path

Operations that modify state for a whole page must be syncronized across
all requests within a page group. In the read path, this is calling
unlock_page and SetPageUptodate. Both of these functions should not be
called until all requests in a page group have reached the point where
they would call them.

This patch should have no effect yet since all page groups currently
have one request, but will come into play when pg_test functions are
modified to split pages into sub-page regions.

Signed-off-by: Weston Andros Adamson <[email protected]>
---
fs/nfs/pagelist.c | 2 ++
fs/nfs/read.c | 22 +++++++++++++++++-----
include/linux/nfs_page.h | 2 ++
3 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c
index 517c617..87cdb4b 100644
--- a/fs/nfs/pagelist.c
+++ b/fs/nfs/pagelist.c
@@ -395,6 +395,8 @@ static void nfs_free_request(struct nfs_page *req)

/* extra debug: make sure no sync bits are still set */
WARN_ON_ONCE(test_bit(PG_TEARDOWN, &req->wb_flags));
+ WARN_ON_ONCE(test_bit(PG_UNLOCKPAGE, &req->wb_flags));
+ WARN_ON_ONCE(test_bit(PG_UPTODATE, &req->wb_flags));

/* Release struct file and open context */
nfs_clear_request(req);
diff --git a/fs/nfs/read.c b/fs/nfs/read.c
index 902ba2c..53d5b83 100644
--- a/fs/nfs/read.c
+++ b/fs/nfs/read.c
@@ -105,10 +105,16 @@ static void nfs_readpage_release(struct nfs_page *req)
{
struct inode *d_inode = req->wb_context->dentry->d_inode;

- if (PageUptodate(req->wb_page))
- nfs_readpage_to_fscache(d_inode, req->wb_page, 0);
+ dprintk("NFS: read done (%s/%llu %d@%lld)\n", d_inode->i_sb->s_id,
+ (unsigned long long)NFS_FILEID(d_inode), req->wb_bytes,
+ (long long)req_offset(req));

- unlock_page(req->wb_page);
+ if (nfs_page_group_sync_on_bit(req, PG_UNLOCKPAGE)) {
+ if (PageUptodate(req->wb_page))
+ nfs_readpage_to_fscache(d_inode, req->wb_page, 0);
+
+ unlock_page(req->wb_page);
+ }

dprintk("NFS: read done (%s/%Lu %d@%Ld)\n",
req->wb_context->dentry->d_inode->i_sb->s_id,
@@ -118,6 +124,12 @@ static void nfs_readpage_release(struct nfs_page *req)
nfs_release_request(req);
}

+static void nfs_page_group_set_uptodate(struct nfs_page *req)
+{
+ if (nfs_page_group_sync_on_bit(req, PG_UPTODATE))
+ SetPageUptodate(req->wb_page);
+}
+
/* Note io was page aligned */
static void nfs_read_completion(struct nfs_pgio_header *hdr)
{
@@ -140,9 +152,9 @@ static void nfs_read_completion(struct nfs_pgio_header *hdr)
bytes += req->wb_bytes;
if (test_bit(NFS_IOHDR_ERROR, &hdr->flags)) {
if (bytes <= hdr->good_bytes)
- SetPageUptodate(page);
+ nfs_page_group_set_uptodate(req);
} else
- SetPageUptodate(page);
+ nfs_page_group_set_uptodate(req);
nfs_list_remove_request(req);
nfs_readpage_release(req);
}
diff --git a/include/linux/nfs_page.h b/include/linux/nfs_page.h
index 986c0c2..6385175 100644
--- a/include/linux/nfs_page.h
+++ b/include/linux/nfs_page.h
@@ -29,6 +29,8 @@ enum {
PG_INODE_REF, /* extra ref held by inode (head req only) */
PG_HEADLOCK, /* page group lock of wb_head */
PG_TEARDOWN, /* page group sync for destroy */
+ PG_UNLOCKPAGE, /* page group sync bit in read path */
+ PG_UPTODATE, /* page group sync bit in read path */
};

struct nfs_inode;
--
1.8.5.2 (Apple Git-48)


2014-05-15 21:12:51

by Anna Schumaker

[permalink] [raw]
Subject: Re: [PATCH v3 06/18] nfs: add support for multiple nfs reqs per page

On 05/15/2014 11:56 AM, Weston Andros Adamson wrote:
> Add "page groups" - a circular list of nfs requests (struct nfs_page)
> that all reference the same page. This gives nfs read and write paths
> the ability to account for sub-page regions independently. This
> somewhat follows the design of struct buffer_head's sub-page
> accounting.
>
> Only "head" requests are ever added/removed from the inode list in
> the buffered write path. "head" and "sub" requests are treated the
> same through the read path and the rest of the write/commit path.
> Requests are given an extra reference across the life of the list.
>
> Page groups are never rejoined after being split. If the read/write
> request fails and the client falls back to another path (ie revert
> to MDS in PNFS case), the already split requests are pushed through
> the recoalescing code again, which may split them further and then
> coalesce them into properly sized requests on the wire. Fragmentation
> shouldn't be a problem with the current design, because we flush all
> requests in page group when a non-contiguous request is added, so
> the only time resplitting should occur is on a resend of a read or
> write.
>
> This patch lays the groundwork for sub-page splitting, but does not
> actually do any splitting. For now all page groups have one request
> as pg_test functions don't yet split pages. There are several related
> patches that are needed support multiple requests per page group.
>
> Signed-off-by: Weston Andros Adamson <[email protected]>
> ---
> fs/nfs/direct.c | 7 +-
> fs/nfs/pagelist.c | 220 ++++++++++++++++++++++++++++++++++++++++++++---
> fs/nfs/read.c | 4 +-
> fs/nfs/write.c | 13 ++-
> include/linux/nfs_page.h | 13 ++-
> 5 files changed, 236 insertions(+), 21 deletions(-)
>
> diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
> index 1dd8c62..2c0e08f 100644
> --- a/fs/nfs/direct.c
> +++ b/fs/nfs/direct.c
> @@ -380,7 +380,7 @@ static ssize_t nfs_direct_read_schedule_segment(struct nfs_pageio_descriptor *de
> struct nfs_page *req;
> unsigned int req_len = min_t(size_t, bytes, PAGE_SIZE - pgbase);
> /* XXX do we need to do the eof zeroing found in async_filler? */
> - req = nfs_create_request(dreq->ctx, pagevec[i],
> + req = nfs_create_request(dreq->ctx, pagevec[i], NULL,
> pgbase, req_len);
> if (IS_ERR(req)) {
> result = PTR_ERR(req);
> @@ -749,7 +749,7 @@ static ssize_t nfs_direct_write_schedule_segment(struct nfs_pageio_descriptor *d
> struct nfs_page *req;
> unsigned int req_len = min_t(size_t, bytes, PAGE_SIZE - pgbase);
>
> - req = nfs_create_request(dreq->ctx, pagevec[i],
> + req = nfs_create_request(dreq->ctx, pagevec[i], NULL,
> pgbase, req_len);
> if (IS_ERR(req)) {
> result = PTR_ERR(req);
> @@ -827,6 +827,8 @@ static void nfs_direct_write_completion(struct nfs_pgio_header *hdr)
> spin_unlock(&dreq->lock);
>
> while (!list_empty(&hdr->pages)) {
> + bool do_destroy = true;
> +
> req = nfs_list_entry(hdr->pages.next);
> nfs_list_remove_request(req);
> switch (bit) {
> @@ -834,6 +836,7 @@ static void nfs_direct_write_completion(struct nfs_pgio_header *hdr)
> case NFS_IOHDR_NEED_COMMIT:
> kref_get(&req->wb_kref);
> nfs_mark_request_commit(req, hdr->lseg, &cinfo);
> + do_destroy = false;
> }
> nfs_unlock_and_release_request(req);
> }
> diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c
> index a71655a..517c617 100644
> --- a/fs/nfs/pagelist.c
> +++ b/fs/nfs/pagelist.c
> @@ -29,6 +29,8 @@
> static struct kmem_cache *nfs_page_cachep;
> static const struct rpc_call_ops nfs_pgio_common_ops;
>
> +static void nfs_free_request(struct nfs_page *);
> +
> static bool nfs_pgarray_set(struct nfs_page_array *p, unsigned int pagecount)
> {
> p->npages = pagecount;
> @@ -136,10 +138,151 @@ nfs_iocounter_wait(struct nfs_io_counter *c)
> return __nfs_iocounter_wait(c);
> }
>
> +/*
> + * nfs_page_group_lock - lock the head of the page group
> + * @req - request in group that is to be locked
> + *
> + * this lock must be held if modifying the page group list
> + */
> +void
> +nfs_page_group_lock(struct nfs_page *req)
> +{
> + struct nfs_page *head = req->wb_head;
> + int err = -EAGAIN;
> +
> + WARN_ON_ONCE(head != head->wb_head);
> +
> + while (err)
> + err = wait_on_bit_lock(&head->wb_flags, PG_HEADLOCK,
> + nfs_wait_bit_killable, TASK_KILLABLE);
> +}
> +
> +/*
> + * nfs_page_group_unlock - unlock the head of the page group
> + * @req - request in group that is to be unlocked
> + */
> +void
> +nfs_page_group_unlock(struct nfs_page *req)
> +{
> + struct nfs_page *head = req->wb_head;
> +
> + WARN_ON_ONCE(head != head->wb_head);
> +
> + smp_mb__before_clear_bit();
> + clear_bit(PG_HEADLOCK, &head->wb_flags);
> + smp_mb__after_clear_bit();
> + wake_up_bit(&head->wb_flags, PG_HEADLOCK);
> +}
> +
> +/*
> + * nfs_page_group_sync_on_bit_locked
> + *
> + * must be called with page group lock held
> + */
> +static bool
> +nfs_page_group_sync_on_bit_locked(struct nfs_page *req, unsigned int bit)
> +{
> + struct nfs_page *head = req->wb_head;
> + struct nfs_page *tmp;
> +
> + WARN_ON_ONCE(!test_bit(PG_HEADLOCK, &head->wb_flags));
> + WARN_ON_ONCE(test_and_set_bit(bit, &req->wb_flags));
> +
> + tmp = req->wb_this_page;
> + while (tmp != req) {
> + if (!test_bit(bit, &tmp->wb_flags))
> + return false;
> + tmp = tmp->wb_this_page;
> + }
> +
> + /* true! reset all bits */
> + tmp = req;
> + do {
> + clear_bit(bit, &tmp->wb_flags);
> + tmp = tmp->wb_this_page;
> + } while (tmp != req);
> +
> + return true;
> +}
> +
> +/*
> + * nfs_page_group_sync_on_bit - set bit on current request, but only
> + * return true if the bit is set for all requests in page group
> + * @req - request in page group
> + * @bit - PG_* bit that is used to sync page group
> + */
> +bool nfs_page_group_sync_on_bit(struct nfs_page *req, unsigned int bit)
> +{
> + bool ret;
> +
> + nfs_page_group_lock(req);
> + ret = nfs_page_group_sync_on_bit_locked(req, bit);
> + nfs_page_group_unlock(req);
> +
> + return ret;
> +}
> +
> +/*
> + * nfs_page_group_init - Initialize the page group linkage for @req
> + * @req - a new nfs request
> + * @prev - the previous request in page group, or NULL if @req is the first
> + * or only request in the group (the head).
> + */
> +static inline void
> +nfs_page_group_init(struct nfs_page *req, struct nfs_page *prev)
> +{
> + WARN_ON_ONCE(prev == req);
> +
> + if (!prev) {
> + req->wb_head = req;
> + req->wb_this_page = req;
> + } else {
> + WARN_ON_ONCE(prev->wb_this_page != prev->wb_head);
> + WARN_ON_ONCE(!test_bit(PG_HEADLOCK, &prev->wb_head->wb_flags));
> + req->wb_head = prev->wb_head;
> + req->wb_this_page = prev->wb_this_page;
> + prev->wb_this_page = req;
> +
> + /* grab extra ref if head request has extra ref from
> + * the write/commit path to handle handoff between write
> + * and commit lists */
> + if (test_bit(PG_INODE_REF, &prev->wb_head->wb_flags))
> + kref_get(&req->wb_kref);
> + }
> +}
> +
> +/*
> + * nfs_page_group_destroy - sync the destruction of page groups
> + * @req - request that no longer needs the page group
> + *
> + * releases the page group reference from each member once all
> + * members have called this function.
> + */
> +static void
> +nfs_page_group_destroy(struct kref *kref)
> +{
> + struct nfs_page *req = container_of(kref, struct nfs_page, wb_kref);
> + struct nfs_page *tmp, *next;
> +
> + if (!nfs_page_group_sync_on_bit(req, PG_TEARDOWN))
> + return;
> +
> + tmp = req;
> + do {
> + next = tmp->wb_this_page;
> + /* unlink and free */
> + tmp->wb_this_page = tmp;
> + tmp->wb_head = tmp;
> + nfs_free_request(tmp);
> + tmp = next;
> + } while (tmp != req);
> +}
> +
> /**
> * nfs_create_request - Create an NFS read/write request.
> * @ctx: open context to use
> * @page: page to write
> + * @last: last nfs request created for this page group or NULL if head
> * @offset: starting offset within the page for the write
> * @count: number of bytes to read/write
> *
> @@ -149,7 +292,8 @@ nfs_iocounter_wait(struct nfs_io_counter *c)
> */
> struct nfs_page *
> nfs_create_request(struct nfs_open_context *ctx, struct page *page,
> - unsigned int offset, unsigned int count)
> + struct nfs_page *last, unsigned int offset,
> + unsigned int count)
> {
> struct nfs_page *req;
> struct nfs_lock_context *l_ctx;
> @@ -181,6 +325,7 @@ nfs_create_request(struct nfs_open_context *ctx, struct page *page,
> req->wb_bytes = count;
> req->wb_context = get_nfs_open_context(ctx);
> kref_init(&req->wb_kref);
> + nfs_page_group_init(req, last);
> return req;
> }
>
> @@ -238,16 +383,18 @@ static void nfs_clear_request(struct nfs_page *req)
> }
> }
>
> -
> /**
> * nfs_release_request - Release the count on an NFS read/write request
> * @req: request to release
> *
> * Note: Should never be called with the spinlock held!
> */
> -static void nfs_free_request(struct kref *kref)
> +static void nfs_free_request(struct nfs_page *req)
> {
> - struct nfs_page *req = container_of(kref, struct nfs_page, wb_kref);
> + WARN_ON_ONCE(req->wb_this_page != req);
> +
> + /* extra debug: make sure no sync bits are still set */
> + WARN_ON_ONCE(test_bit(PG_TEARDOWN, &req->wb_flags));
>
> /* Release struct file and open context */
> nfs_clear_request(req);
> @@ -256,7 +403,7 @@ static void nfs_free_request(struct kref *kref)
>
> void nfs_release_request(struct nfs_page *req)
> {
> - kref_put(&req->wb_kref, nfs_free_request);
> + kref_put(&req->wb_kref, nfs_page_group_destroy);
> }
>
> static int nfs_wait_bit_uninterruptible(void *word)
> @@ -833,21 +980,66 @@ static void nfs_pageio_doio(struct nfs_pageio_descriptor *desc)
> * @desc: destination io descriptor
> * @req: request
> *
> + * This may split a request into subrequests which are all part of the
> + * same page group.
> + *
> * Returns true if the request 'req' was successfully coalesced into the
> * existing list of pages 'desc'.
> */
> static int __nfs_pageio_add_request(struct nfs_pageio_descriptor *desc,
> struct nfs_page *req)
> {
> - while (!nfs_pageio_do_add_request(desc, req)) {
> - desc->pg_moreio = 1;
> - nfs_pageio_doio(desc);
> - if (desc->pg_error < 0)
> - return 0;
> - desc->pg_moreio = 0;
> - if (desc->pg_recoalesce)
> - return 0;
> - }
> + struct nfs_page *subreq;
> + unsigned int bytes_left = 0;
> + unsigned int offset, pgbase;
> +
> + nfs_page_group_lock(req);
> +
> + subreq = req;
> + bytes_left = subreq->wb_bytes;
> + offset = subreq->wb_offset;
> + pgbase = subreq->wb_pgbase;
> +
> + do {
> + if (!nfs_pageio_do_add_request(desc, subreq)) {
> + /* make sure pg_test call(s) did nothing */
> + WARN_ON_ONCE(subreq->wb_bytes != bytes_left);
> + WARN_ON_ONCE(subreq->wb_offset != offset);
> + WARN_ON_ONCE(subreq->wb_pgbase != pgbase);
> +
> + nfs_page_group_unlock(req);
> + desc->pg_moreio = 1;
> + nfs_pageio_doio(desc);
> + if (desc->pg_error < 0)
> + return 0;
> + desc->pg_moreio = 0;
> + if (desc->pg_recoalesce)
> + return 0;

Would it make sense to 'or' together the desc->pg_error and desc->pg_recoalesce checks, rather than having two distinct if statements with the same return value?

Anna

> + /* retry add_request for this subreq */
> + nfs_page_group_lock(req);
> + continue;
> + }
> +
> + /* check for buggy pg_test call(s) */
> + WARN_ON_ONCE(subreq->wb_bytes + subreq->wb_pgbase > PAGE_SIZE);
> + WARN_ON_ONCE(subreq->wb_bytes > bytes_left);
> + WARN_ON_ONCE(subreq->wb_bytes == 0);
> +
> + bytes_left -= subreq->wb_bytes;
> + offset += subreq->wb_bytes;
> + pgbase += subreq->wb_bytes;
> +
> + if (bytes_left) {
> + subreq = nfs_create_request(req->wb_context,
> + req->wb_page,
> + subreq, pgbase, bytes_left);
> + nfs_lock_request(subreq);
> + subreq->wb_offset = offset;
> + subreq->wb_index = req->wb_index;
> + }
> + } while (bytes_left > 0);
> +
> + nfs_page_group_unlock(req);
> return 1;
> }
>
> diff --git a/fs/nfs/read.c b/fs/nfs/read.c
> index 46d9044..902ba2c 100644
> --- a/fs/nfs/read.c
> +++ b/fs/nfs/read.c
> @@ -85,7 +85,7 @@ int nfs_readpage_async(struct nfs_open_context *ctx, struct inode *inode,
> len = nfs_page_length(page);
> if (len == 0)
> return nfs_return_empty_page(page);
> - new = nfs_create_request(ctx, page, 0, len);
> + new = nfs_create_request(ctx, page, NULL, 0, len);
> if (IS_ERR(new)) {
> unlock_page(page);
> return PTR_ERR(new);
> @@ -311,7 +311,7 @@ readpage_async_filler(void *data, struct page *page)
> if (len == 0)
> return nfs_return_empty_page(page);
>
> - new = nfs_create_request(desc->ctx, page, 0, len);
> + new = nfs_create_request(desc->ctx, page, NULL, 0, len);
> if (IS_ERR(new))
> goto out_error;
>
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index e773df2..d0f30f1 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -367,6 +367,8 @@ static void nfs_inode_add_request(struct inode *inode, struct nfs_page *req)
> {
> struct nfs_inode *nfsi = NFS_I(inode);
>
> + WARN_ON_ONCE(req->wb_this_page != req);
> +
> /* Lock the request! */
> nfs_lock_request(req);
>
> @@ -383,6 +385,7 @@ static void nfs_inode_add_request(struct inode *inode, struct nfs_page *req)
> set_page_private(req->wb_page, (unsigned long)req);
> }
> nfsi->npages++;
> + set_bit(PG_INODE_REF, &req->wb_flags);
> kref_get(&req->wb_kref);
> spin_unlock(&inode->i_lock);
> }
> @@ -567,6 +570,7 @@ static void nfs_write_completion(struct nfs_pgio_header *hdr)
> {
> struct nfs_commit_info cinfo;
> unsigned long bytes = 0;
> + bool do_destroy;
>
> if (test_bit(NFS_IOHDR_REDO, &hdr->flags))
> goto out;
> @@ -596,6 +600,7 @@ remove_req:
> next:
> nfs_unlock_request(req);
> nfs_end_page_writeback(req->wb_page);
> + do_destroy = !test_bit(NFS_IOHDR_NEED_COMMIT, &hdr->flags);
> nfs_release_request(req);
> }
> out:
> @@ -700,6 +705,10 @@ static struct nfs_page *nfs_try_to_update_request(struct inode *inode,
> if (req == NULL)
> goto out_unlock;
>
> + /* should be handled by nfs_flush_incompatible */
> + WARN_ON_ONCE(req->wb_head != req);
> + WARN_ON_ONCE(req->wb_this_page != req);
> +
> rqend = req->wb_offset + req->wb_bytes;
> /*
> * Tell the caller to flush out the request if
> @@ -761,7 +770,7 @@ static struct nfs_page * nfs_setup_write_request(struct nfs_open_context* ctx,
> req = nfs_try_to_update_request(inode, page, offset, bytes);
> if (req != NULL)
> goto out;
> - req = nfs_create_request(ctx, page, offset, bytes);
> + req = nfs_create_request(ctx, page, NULL, offset, bytes);
> if (IS_ERR(req))
> goto out;
> nfs_inode_add_request(inode, req);
> @@ -805,6 +814,8 @@ int nfs_flush_incompatible(struct file *file, struct page *page)
> return 0;
> l_ctx = req->wb_lock_context;
> do_flush = req->wb_page != page || req->wb_context != ctx;
> + /* for now, flush if more than 1 request in page_group */
> + do_flush |= req->wb_this_page != req;
> if (l_ctx && ctx->dentry->d_inode->i_flock != NULL) {
> do_flush |= l_ctx->lockowner.l_owner != current->files
> || l_ctx->lockowner.l_pid != current->tgid;
> diff --git a/include/linux/nfs_page.h b/include/linux/nfs_page.h
> index 13d59af..986c0c2 100644
> --- a/include/linux/nfs_page.h
> +++ b/include/linux/nfs_page.h
> @@ -26,6 +26,9 @@ enum {
> PG_MAPPED, /* page private set for buffered io */
> PG_CLEAN, /* write succeeded */
> PG_COMMIT_TO_DS, /* used by pnfs layouts */
> + PG_INODE_REF, /* extra ref held by inode (head req only) */
> + PG_HEADLOCK, /* page group lock of wb_head */
> + PG_TEARDOWN, /* page group sync for destroy */
> };
>
> struct nfs_inode;
> @@ -41,6 +44,8 @@ struct nfs_page {
> struct kref wb_kref; /* reference count */
> unsigned long wb_flags;
> struct nfs_write_verifier wb_verf; /* Commit cookie */
> + struct nfs_page *wb_this_page; /* list of reqs for this page */
> + struct nfs_page *wb_head; /* head pointer for req list */
> };
>
> struct nfs_pageio_descriptor;
> @@ -87,9 +92,10 @@ struct nfs_pageio_descriptor {
>
> extern struct nfs_page *nfs_create_request(struct nfs_open_context *ctx,
> struct page *page,
> + struct nfs_page *last,
> unsigned int offset,
> unsigned int count);
> -extern void nfs_release_request(struct nfs_page *req);
> +extern void nfs_release_request(struct nfs_page *);
>
>
> extern void nfs_pageio_init(struct nfs_pageio_descriptor *desc,
> @@ -108,7 +114,10 @@ extern size_t nfs_generic_pg_test(struct nfs_pageio_descriptor *desc,
> struct nfs_page *req);
> extern int nfs_wait_on_request(struct nfs_page *);
> extern void nfs_unlock_request(struct nfs_page *req);
> -extern void nfs_unlock_and_release_request(struct nfs_page *req);
> +extern void nfs_unlock_and_release_request(struct nfs_page *);
> +extern void nfs_page_group_lock(struct nfs_page *);
> +extern void nfs_page_group_unlock(struct nfs_page *);
> +extern bool nfs_page_group_sync_on_bit(struct nfs_page *, unsigned int);
>
> /*
> * Lock the page of an asynchronous request


2014-05-16 12:58:04

by Anna Schumaker

[permalink] [raw]
Subject: Re: [PATCH v3 06/18] nfs: add support for multiple nfs reqs per page

On 05/15/2014 06:19 PM, Weston Andros Adamson wrote:
> On May 15, 2014, at 5:12 PM, Anna Schumaker <[email protected]> wrote:
>
>> On 05/15/2014 11:56 AM, Weston Andros Adamson wrote:
>>> Add "page groups" - a circular list of nfs requests (struct nfs_page)
>>> that all reference the same page. This gives nfs read and write paths
>>> the ability to account for sub-page regions independently. This
>>> somewhat follows the design of struct buffer_head's sub-page
>>> accounting.
>>>
>>> Only "head" requests are ever added/removed from the inode list in
>>> the buffered write path. "head" and "sub" requests are treated the
>>> same through the read path and the rest of the write/commit path.
>>> Requests are given an extra reference across the life of the list.
>>>
>>> Page groups are never rejoined after being split. If the read/write
>>> request fails and the client falls back to another path (ie revert
>>> to MDS in PNFS case), the already split requests are pushed through
>>> the recoalescing code again, which may split them further and then
>>> coalesce them into properly sized requests on the wire. Fragmentation
>>> shouldn't be a problem with the current design, because we flush all
>>> requests in page group when a non-contiguous request is added, so
>>> the only time resplitting should occur is on a resend of a read or
>>> write.
>>>
>>> This patch lays the groundwork for sub-page splitting, but does not
>>> actually do any splitting. For now all page groups have one request
>>> as pg_test functions don't yet split pages. There are several related
>>> patches that are needed support multiple requests per page group.
>>>
>>> Signed-off-by: Weston Andros Adamson <[email protected]>
>>> ---
>>> fs/nfs/direct.c | 7 +-
>>> fs/nfs/pagelist.c | 220 ++++++++++++++++++++++++++++++++++++++++++++---
>>> fs/nfs/read.c | 4 +-
>>> fs/nfs/write.c | 13 ++-
>>> include/linux/nfs_page.h | 13 ++-
>>> 5 files changed, 236 insertions(+), 21 deletions(-)
>>>
>>> diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
>>> index 1dd8c62..2c0e08f 100644
>>> --- a/fs/nfs/direct.c
>>> +++ b/fs/nfs/direct.c
>>> @@ -380,7 +380,7 @@ static ssize_t nfs_direct_read_schedule_segment(struct nfs_pageio_descriptor *de
>>> struct nfs_page *req;
>>> unsigned int req_len = min_t(size_t, bytes, PAGE_SIZE - pgbase);
>>> /* XXX do we need to do the eof zeroing found in async_filler? */
>>> - req = nfs_create_request(dreq->ctx, pagevec[i],
>>> + req = nfs_create_request(dreq->ctx, pagevec[i], NULL,
>>> pgbase, req_len);
>>> if (IS_ERR(req)) {
>>> result = PTR_ERR(req);
>>> @@ -749,7 +749,7 @@ static ssize_t nfs_direct_write_schedule_segment(struct nfs_pageio_descriptor *d
>>> struct nfs_page *req;
>>> unsigned int req_len = min_t(size_t, bytes, PAGE_SIZE - pgbase);
>>>
>>> - req = nfs_create_request(dreq->ctx, pagevec[i],
>>> + req = nfs_create_request(dreq->ctx, pagevec[i], NULL,
>>> pgbase, req_len);
>>> if (IS_ERR(req)) {
>>> result = PTR_ERR(req);
>>> @@ -827,6 +827,8 @@ static void nfs_direct_write_completion(struct nfs_pgio_header *hdr)
>>> spin_unlock(&dreq->lock);
>>>
>>> while (!list_empty(&hdr->pages)) {
>>> + bool do_destroy = true;
>>> +
>>> req = nfs_list_entry(hdr->pages.next);
>>> nfs_list_remove_request(req);
>>> switch (bit) {
>>> @@ -834,6 +836,7 @@ static void nfs_direct_write_completion(struct nfs_pgio_header *hdr)
>>> case NFS_IOHDR_NEED_COMMIT:
>>> kref_get(&req->wb_kref);
>>> nfs_mark_request_commit(req, hdr->lseg, &cinfo);
>>> + do_destroy = false;
>>> }
>>> nfs_unlock_and_release_request(req);
>>> }
>>> diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c
>>> index a71655a..517c617 100644
>>> --- a/fs/nfs/pagelist.c
>>> +++ b/fs/nfs/pagelist.c
>>> @@ -29,6 +29,8 @@
>>> static struct kmem_cache *nfs_page_cachep;
>>> static const struct rpc_call_ops nfs_pgio_common_ops;
>>>
>>> +static void nfs_free_request(struct nfs_page *);
>>> +
>>> static bool nfs_pgarray_set(struct nfs_page_array *p, unsigned int pagecount)
>>> {
>>> p->npages = pagecount;
>>> @@ -136,10 +138,151 @@ nfs_iocounter_wait(struct nfs_io_counter *c)
>>> return __nfs_iocounter_wait(c);
>>> }
>>>
>>> +/*
>>> + * nfs_page_group_lock - lock the head of the page group
>>> + * @req - request in group that is to be locked
>>> + *
>>> + * this lock must be held if modifying the page group list
>>> + */
>>> +void
>>> +nfs_page_group_lock(struct nfs_page *req)
>>> +{
>>> + struct nfs_page *head = req->wb_head;
>>> + int err = -EAGAIN;
>>> +
>>> + WARN_ON_ONCE(head != head->wb_head);
>>> +
>>> + while (err)
>>> + err = wait_on_bit_lock(&head->wb_flags, PG_HEADLOCK,
>>> + nfs_wait_bit_killable, TASK_KILLABLE);
>>> +}
>>> +
>>> +/*
>>> + * nfs_page_group_unlock - unlock the head of the page group
>>> + * @req - request in group that is to be unlocked
>>> + */
>>> +void
>>> +nfs_page_group_unlock(struct nfs_page *req)
>>> +{
>>> + struct nfs_page *head = req->wb_head;
>>> +
>>> + WARN_ON_ONCE(head != head->wb_head);
>>> +
>>> + smp_mb__before_clear_bit();
>>> + clear_bit(PG_HEADLOCK, &head->wb_flags);
>>> + smp_mb__after_clear_bit();
>>> + wake_up_bit(&head->wb_flags, PG_HEADLOCK);
>>> +}
>>> +
>>> +/*
>>> + * nfs_page_group_sync_on_bit_locked
>>> + *
>>> + * must be called with page group lock held
>>> + */
>>> +static bool
>>> +nfs_page_group_sync_on_bit_locked(struct nfs_page *req, unsigned int bit)
>>> +{
>>> + struct nfs_page *head = req->wb_head;
>>> + struct nfs_page *tmp;
>>> +
>>> + WARN_ON_ONCE(!test_bit(PG_HEADLOCK, &head->wb_flags));
>>> + WARN_ON_ONCE(test_and_set_bit(bit, &req->wb_flags));
>>> +
>>> + tmp = req->wb_this_page;
>>> + while (tmp != req) {
>>> + if (!test_bit(bit, &tmp->wb_flags))
>>> + return false;
>>> + tmp = tmp->wb_this_page;
>>> + }
>>> +
>>> + /* true! reset all bits */
>>> + tmp = req;
>>> + do {
>>> + clear_bit(bit, &tmp->wb_flags);
>>> + tmp = tmp->wb_this_page;
>>> + } while (tmp != req);
>>> +
>>> + return true;
>>> +}
>>> +
>>> +/*
>>> + * nfs_page_group_sync_on_bit - set bit on current request, but only
>>> + * return true if the bit is set for all requests in page group
>>> + * @req - request in page group
>>> + * @bit - PG_* bit that is used to sync page group
>>> + */
>>> +bool nfs_page_group_sync_on_bit(struct nfs_page *req, unsigned int bit)
>>> +{
>>> + bool ret;
>>> +
>>> + nfs_page_group_lock(req);
>>> + ret = nfs_page_group_sync_on_bit_locked(req, bit);
>>> + nfs_page_group_unlock(req);
>>> +
>>> + return ret;
>>> +}
>>> +
>>> +/*
>>> + * nfs_page_group_init - Initialize the page group linkage for @req
>>> + * @req - a new nfs request
>>> + * @prev - the previous request in page group, or NULL if @req is the first
>>> + * or only request in the group (the head).
>>> + */
>>> +static inline void
>>> +nfs_page_group_init(struct nfs_page *req, struct nfs_page *prev)
>>> +{
>>> + WARN_ON_ONCE(prev == req);
>>> +
>>> + if (!prev) {
>>> + req->wb_head = req;
>>> + req->wb_this_page = req;
>>> + } else {
>>> + WARN_ON_ONCE(prev->wb_this_page != prev->wb_head);
>>> + WARN_ON_ONCE(!test_bit(PG_HEADLOCK, &prev->wb_head->wb_flags));
>>> + req->wb_head = prev->wb_head;
>>> + req->wb_this_page = prev->wb_this_page;
>>> + prev->wb_this_page = req;
>>> +
>>> + /* grab extra ref if head request has extra ref from
>>> + * the write/commit path to handle handoff between write
>>> + * and commit lists */
>>> + if (test_bit(PG_INODE_REF, &prev->wb_head->wb_flags))
>>> + kref_get(&req->wb_kref);
>>> + }
>>> +}
>>> +
>>> +/*
>>> + * nfs_page_group_destroy - sync the destruction of page groups
>>> + * @req - request that no longer needs the page group
>>> + *
>>> + * releases the page group reference from each member once all
>>> + * members have called this function.
>>> + */
>>> +static void
>>> +nfs_page_group_destroy(struct kref *kref)
>>> +{
>>> + struct nfs_page *req = container_of(kref, struct nfs_page, wb_kref);
>>> + struct nfs_page *tmp, *next;
>>> +
>>> + if (!nfs_page_group_sync_on_bit(req, PG_TEARDOWN))
>>> + return;
>>> +
>>> + tmp = req;
>>> + do {
>>> + next = tmp->wb_this_page;
>>> + /* unlink and free */
>>> + tmp->wb_this_page = tmp;
>>> + tmp->wb_head = tmp;
>>> + nfs_free_request(tmp);
>>> + tmp = next;
>>> + } while (tmp != req);
>>> +}
>>> +
>>> /**
>>> * nfs_create_request - Create an NFS read/write request.
>>> * @ctx: open context to use
>>> * @page: page to write
>>> + * @last: last nfs request created for this page group or NULL if head
>>> * @offset: starting offset within the page for the write
>>> * @count: number of bytes to read/write
>>> *
>>> @@ -149,7 +292,8 @@ nfs_iocounter_wait(struct nfs_io_counter *c)
>>> */
>>> struct nfs_page *
>>> nfs_create_request(struct nfs_open_context *ctx, struct page *page,
>>> - unsigned int offset, unsigned int count)
>>> + struct nfs_page *last, unsigned int offset,
>>> + unsigned int count)
>>> {
>>> struct nfs_page *req;
>>> struct nfs_lock_context *l_ctx;
>>> @@ -181,6 +325,7 @@ nfs_create_request(struct nfs_open_context *ctx, struct page *page,
>>> req->wb_bytes = count;
>>> req->wb_context = get_nfs_open_context(ctx);
>>> kref_init(&req->wb_kref);
>>> + nfs_page_group_init(req, last);
>>> return req;
>>> }
>>>
>>> @@ -238,16 +383,18 @@ static void nfs_clear_request(struct nfs_page *req)
>>> }
>>> }
>>>
>>> -
>>> /**
>>> * nfs_release_request - Release the count on an NFS read/write request
>>> * @req: request to release
>>> *
>>> * Note: Should never be called with the spinlock held!
>>> */
>>> -static void nfs_free_request(struct kref *kref)
>>> +static void nfs_free_request(struct nfs_page *req)
>>> {
>>> - struct nfs_page *req = container_of(kref, struct nfs_page, wb_kref);
>>> + WARN_ON_ONCE(req->wb_this_page != req);
>>> +
>>> + /* extra debug: make sure no sync bits are still set */
>>> + WARN_ON_ONCE(test_bit(PG_TEARDOWN, &req->wb_flags));
>>>
>>> /* Release struct file and open context */
>>> nfs_clear_request(req);
>>> @@ -256,7 +403,7 @@ static void nfs_free_request(struct kref *kref)
>>>
>>> void nfs_release_request(struct nfs_page *req)
>>> {
>>> - kref_put(&req->wb_kref, nfs_free_request);
>>> + kref_put(&req->wb_kref, nfs_page_group_destroy);
>>> }
>>>
>>> static int nfs_wait_bit_uninterruptible(void *word)
>>> @@ -833,21 +980,66 @@ static void nfs_pageio_doio(struct nfs_pageio_descriptor *desc)
>>> * @desc: destination io descriptor
>>> * @req: request
>>> *
>>> + * This may split a request into subrequests which are all part of the
>>> + * same page group.
>>> + *
>>> * Returns true if the request 'req' was successfully coalesced into the
>>> * existing list of pages 'desc'.
>>> */
>>> static int __nfs_pageio_add_request(struct nfs_pageio_descriptor *desc,
>>> struct nfs_page *req)
>>> {
>>> - while (!nfs_pageio_do_add_request(desc, req)) {
>>> - desc->pg_moreio = 1;
>>> - nfs_pageio_doio(desc);
>>> - if (desc->pg_error < 0)
>>> - return 0;
>>> - desc->pg_moreio = 0;
>>> - if (desc->pg_recoalesce)
>>> - return 0;
>>> - }
>>> + struct nfs_page *subreq;
>>> + unsigned int bytes_left = 0;
>>> + unsigned int offset, pgbase;
>>> +
>>> + nfs_page_group_lock(req);
>>> +
>>> + subreq = req;
>>> + bytes_left = subreq->wb_bytes;
>>> + offset = subreq->wb_offset;
>>> + pgbase = subreq->wb_pgbase;
>>> +
>>> + do {
>>> + if (!nfs_pageio_do_add_request(desc, subreq)) {
>>> + /* make sure pg_test call(s) did nothing */
>>> + WARN_ON_ONCE(subreq->wb_bytes != bytes_left);
>>> + WARN_ON_ONCE(subreq->wb_offset != offset);
>>> + WARN_ON_ONCE(subreq->wb_pgbase != pgbase);
>>> +
>>> + nfs_page_group_unlock(req);
>>> + desc->pg_moreio = 1;
>>> + nfs_pageio_doio(desc);
>>> + if (desc->pg_error < 0)
>>> + return 0;
>>> + desc->pg_moreio = 0;
>>> + if (desc->pg_recoalesce)
>>> + return 0;
>> Would it make sense to 'or' together the desc->pg_error and desc->pg_recoalesce checks, rather than having two distinct if statements with the same return value?
>>
>> Anna
>>
> That�s a copy-paste from the original __nfs_pageio_add_request. I didn�t combine the
> statements, because desc->pg_moreio is set to 0 iff desc->pg_error < 0 evaluates to false.
>
> We might be able to clean this up, but it�s not just as simple as combining them into one
> statement.

Fair enough! I must have missed the copy-and-paste part.

Anna
>
> -dros
>
>>> + /* retry add_request for this subreq */
>>> + nfs_page_group_lock(req);
>>> + continue;
>>> + }
>>> +
>>> + /* check for buggy pg_test call(s) */
>>> + WARN_ON_ONCE(subreq->wb_bytes + subreq->wb_pgbase > PAGE_SIZE);
>>> + WARN_ON_ONCE(subreq->wb_bytes > bytes_left);
>>> + WARN_ON_ONCE(subreq->wb_bytes == 0);
>>> +
>>> + bytes_left -= subreq->wb_bytes;
>>> + offset += subreq->wb_bytes;
>>> + pgbase += subreq->wb_bytes;
>>> +
>>> + if (bytes_left) {
>>> + subreq = nfs_create_request(req->wb_context,
>>> + req->wb_page,
>>> + subreq, pgbase, bytes_left);
>>> + nfs_lock_request(subreq);
>>> + subreq->wb_offset = offset;
>>> + subreq->wb_index = req->wb_index;
>>> + }
>>> + } while (bytes_left > 0);
>>> +
>>> + nfs_page_group_unlock(req);
>>> return 1;
>>> }
>>>
>>> diff --git a/fs/nfs/read.c b/fs/nfs/read.c
>>> index 46d9044..902ba2c 100644
>>> --- a/fs/nfs/read.c
>>> +++ b/fs/nfs/read.c
>>> @@ -85,7 +85,7 @@ int nfs_readpage_async(struct nfs_open_context *ctx, struct inode *inode,
>>> len = nfs_page_length(page);
>>> if (len == 0)
>>> return nfs_return_empty_page(page);
>>> - new = nfs_create_request(ctx, page, 0, len);
>>> + new = nfs_create_request(ctx, page, NULL, 0, len);
>>> if (IS_ERR(new)) {
>>> unlock_page(page);
>>> return PTR_ERR(new);
>>> @@ -311,7 +311,7 @@ readpage_async_filler(void *data, struct page *page)
>>> if (len == 0)
>>> return nfs_return_empty_page(page);
>>>
>>> - new = nfs_create_request(desc->ctx, page, 0, len);
>>> + new = nfs_create_request(desc->ctx, page, NULL, 0, len);
>>> if (IS_ERR(new))
>>> goto out_error;
>>>
>>> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
>>> index e773df2..d0f30f1 100644
>>> --- a/fs/nfs/write.c
>>> +++ b/fs/nfs/write.c
>>> @@ -367,6 +367,8 @@ static void nfs_inode_add_request(struct inode *inode, struct nfs_page *req)
>>> {
>>> struct nfs_inode *nfsi = NFS_I(inode);
>>>
>>> + WARN_ON_ONCE(req->wb_this_page != req);
>>> +
>>> /* Lock the request! */
>>> nfs_lock_request(req);
>>>
>>> @@ -383,6 +385,7 @@ static void nfs_inode_add_request(struct inode *inode, struct nfs_page *req)
>>> set_page_private(req->wb_page, (unsigned long)req);
>>> }
>>> nfsi->npages++;
>>> + set_bit(PG_INODE_REF, &req->wb_flags);
>>> kref_get(&req->wb_kref);
>>> spin_unlock(&inode->i_lock);
>>> }
>>> @@ -567,6 +570,7 @@ static void nfs_write_completion(struct nfs_pgio_header *hdr)
>>> {
>>> struct nfs_commit_info cinfo;
>>> unsigned long bytes = 0;
>>> + bool do_destroy;
>>>
>>> if (test_bit(NFS_IOHDR_REDO, &hdr->flags))
>>> goto out;
>>> @@ -596,6 +600,7 @@ remove_req:
>>> next:
>>> nfs_unlock_request(req);
>>> nfs_end_page_writeback(req->wb_page);
>>> + do_destroy = !test_bit(NFS_IOHDR_NEED_COMMIT, &hdr->flags);
>>> nfs_release_request(req);
>>> }
>>> out:
>>> @@ -700,6 +705,10 @@ static struct nfs_page *nfs_try_to_update_request(struct inode *inode,
>>> if (req == NULL)
>>> goto out_unlock;
>>>
>>> + /* should be handled by nfs_flush_incompatible */
>>> + WARN_ON_ONCE(req->wb_head != req);
>>> + WARN_ON_ONCE(req->wb_this_page != req);
>>> +
>>> rqend = req->wb_offset + req->wb_bytes;
>>> /*
>>> * Tell the caller to flush out the request if
>>> @@ -761,7 +770,7 @@ static struct nfs_page * nfs_setup_write_request(struct nfs_open_context* ctx,
>>> req = nfs_try_to_update_request(inode, page, offset, bytes);
>>> if (req != NULL)
>>> goto out;
>>> - req = nfs_create_request(ctx, page, offset, bytes);
>>> + req = nfs_create_request(ctx, page, NULL, offset, bytes);
>>> if (IS_ERR(req))
>>> goto out;
>>> nfs_inode_add_request(inode, req);
>>> @@ -805,6 +814,8 @@ int nfs_flush_incompatible(struct file *file, struct page *page)
>>> return 0;
>>> l_ctx = req->wb_lock_context;
>>> do_flush = req->wb_page != page || req->wb_context != ctx;
>>> + /* for now, flush if more than 1 request in page_group */
>>> + do_flush |= req->wb_this_page != req;
>>> if (l_ctx && ctx->dentry->d_inode->i_flock != NULL) {
>>> do_flush |= l_ctx->lockowner.l_owner != current->files
>>> || l_ctx->lockowner.l_pid != current->tgid;
>>> diff --git a/include/linux/nfs_page.h b/include/linux/nfs_page.h
>>> index 13d59af..986c0c2 100644
>>> --- a/include/linux/nfs_page.h
>>> +++ b/include/linux/nfs_page.h
>>> @@ -26,6 +26,9 @@ enum {
>>> PG_MAPPED, /* page private set for buffered io */
>>> PG_CLEAN, /* write succeeded */
>>> PG_COMMIT_TO_DS, /* used by pnfs layouts */
>>> + PG_INODE_REF, /* extra ref held by inode (head req only) */
>>> + PG_HEADLOCK, /* page group lock of wb_head */
>>> + PG_TEARDOWN, /* page group sync for destroy */
>>> };
>>>
>>> struct nfs_inode;
>>> @@ -41,6 +44,8 @@ struct nfs_page {
>>> struct kref wb_kref; /* reference count */
>>> unsigned long wb_flags;
>>> struct nfs_write_verifier wb_verf; /* Commit cookie */
>>> + struct nfs_page *wb_this_page; /* list of reqs for this page */
>>> + struct nfs_page *wb_head; /* head pointer for req list */
>>> };
>>>
>>> struct nfs_pageio_descriptor;
>>> @@ -87,9 +92,10 @@ struct nfs_pageio_descriptor {
>>>
>>> extern struct nfs_page *nfs_create_request(struct nfs_open_context *ctx,
>>> struct page *page,
>>> + struct nfs_page *last,
>>> unsigned int offset,
>>> unsigned int count);
>>> -extern void nfs_release_request(struct nfs_page *req);
>>> +extern void nfs_release_request(struct nfs_page *);
>>>
>>>
>>> extern void nfs_pageio_init(struct nfs_pageio_descriptor *desc,
>>> @@ -108,7 +114,10 @@ extern size_t nfs_generic_pg_test(struct nfs_pageio_descriptor *desc,
>>> struct nfs_page *req);
>>> extern int nfs_wait_on_request(struct nfs_page *);
>>> extern void nfs_unlock_request(struct nfs_page *req);
>>> -extern void nfs_unlock_and_release_request(struct nfs_page *req);
>>> +extern void nfs_unlock_and_release_request(struct nfs_page *);
>>> +extern void nfs_page_group_lock(struct nfs_page *);
>>> +extern void nfs_page_group_unlock(struct nfs_page *);
>>> +extern bool nfs_page_group_sync_on_bit(struct nfs_page *, unsigned int);
>>>
>>> /*
>>> * Lock the page of an asynchronous request


2014-05-15 15:57:15

by Weston Andros Adamson

[permalink] [raw]
Subject: [PATCH v3 06/18] nfs: add support for multiple nfs reqs per page

Add "page groups" - a circular list of nfs requests (struct nfs_page)
that all reference the same page. This gives nfs read and write paths
the ability to account for sub-page regions independently. This
somewhat follows the design of struct buffer_head's sub-page
accounting.

Only "head" requests are ever added/removed from the inode list in
the buffered write path. "head" and "sub" requests are treated the
same through the read path and the rest of the write/commit path.
Requests are given an extra reference across the life of the list.

Page groups are never rejoined after being split. If the read/write
request fails and the client falls back to another path (ie revert
to MDS in PNFS case), the already split requests are pushed through
the recoalescing code again, which may split them further and then
coalesce them into properly sized requests on the wire. Fragmentation
shouldn't be a problem with the current design, because we flush all
requests in page group when a non-contiguous request is added, so
the only time resplitting should occur is on a resend of a read or
write.

This patch lays the groundwork for sub-page splitting, but does not
actually do any splitting. For now all page groups have one request
as pg_test functions don't yet split pages. There are several related
patches that are needed support multiple requests per page group.

Signed-off-by: Weston Andros Adamson <[email protected]>
---
fs/nfs/direct.c | 7 +-
fs/nfs/pagelist.c | 220 ++++++++++++++++++++++++++++++++++++++++++++---
fs/nfs/read.c | 4 +-
fs/nfs/write.c | 13 ++-
include/linux/nfs_page.h | 13 ++-
5 files changed, 236 insertions(+), 21 deletions(-)

diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index 1dd8c62..2c0e08f 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -380,7 +380,7 @@ static ssize_t nfs_direct_read_schedule_segment(struct nfs_pageio_descriptor *de
struct nfs_page *req;
unsigned int req_len = min_t(size_t, bytes, PAGE_SIZE - pgbase);
/* XXX do we need to do the eof zeroing found in async_filler? */
- req = nfs_create_request(dreq->ctx, pagevec[i],
+ req = nfs_create_request(dreq->ctx, pagevec[i], NULL,
pgbase, req_len);
if (IS_ERR(req)) {
result = PTR_ERR(req);
@@ -749,7 +749,7 @@ static ssize_t nfs_direct_write_schedule_segment(struct nfs_pageio_descriptor *d
struct nfs_page *req;
unsigned int req_len = min_t(size_t, bytes, PAGE_SIZE - pgbase);

- req = nfs_create_request(dreq->ctx, pagevec[i],
+ req = nfs_create_request(dreq->ctx, pagevec[i], NULL,
pgbase, req_len);
if (IS_ERR(req)) {
result = PTR_ERR(req);
@@ -827,6 +827,8 @@ static void nfs_direct_write_completion(struct nfs_pgio_header *hdr)
spin_unlock(&dreq->lock);

while (!list_empty(&hdr->pages)) {
+ bool do_destroy = true;
+
req = nfs_list_entry(hdr->pages.next);
nfs_list_remove_request(req);
switch (bit) {
@@ -834,6 +836,7 @@ static void nfs_direct_write_completion(struct nfs_pgio_header *hdr)
case NFS_IOHDR_NEED_COMMIT:
kref_get(&req->wb_kref);
nfs_mark_request_commit(req, hdr->lseg, &cinfo);
+ do_destroy = false;
}
nfs_unlock_and_release_request(req);
}
diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c
index a71655a..517c617 100644
--- a/fs/nfs/pagelist.c
+++ b/fs/nfs/pagelist.c
@@ -29,6 +29,8 @@
static struct kmem_cache *nfs_page_cachep;
static const struct rpc_call_ops nfs_pgio_common_ops;

+static void nfs_free_request(struct nfs_page *);
+
static bool nfs_pgarray_set(struct nfs_page_array *p, unsigned int pagecount)
{
p->npages = pagecount;
@@ -136,10 +138,151 @@ nfs_iocounter_wait(struct nfs_io_counter *c)
return __nfs_iocounter_wait(c);
}

+/*
+ * nfs_page_group_lock - lock the head of the page group
+ * @req - request in group that is to be locked
+ *
+ * this lock must be held if modifying the page group list
+ */
+void
+nfs_page_group_lock(struct nfs_page *req)
+{
+ struct nfs_page *head = req->wb_head;
+ int err = -EAGAIN;
+
+ WARN_ON_ONCE(head != head->wb_head);
+
+ while (err)
+ err = wait_on_bit_lock(&head->wb_flags, PG_HEADLOCK,
+ nfs_wait_bit_killable, TASK_KILLABLE);
+}
+
+/*
+ * nfs_page_group_unlock - unlock the head of the page group
+ * @req - request in group that is to be unlocked
+ */
+void
+nfs_page_group_unlock(struct nfs_page *req)
+{
+ struct nfs_page *head = req->wb_head;
+
+ WARN_ON_ONCE(head != head->wb_head);
+
+ smp_mb__before_clear_bit();
+ clear_bit(PG_HEADLOCK, &head->wb_flags);
+ smp_mb__after_clear_bit();
+ wake_up_bit(&head->wb_flags, PG_HEADLOCK);
+}
+
+/*
+ * nfs_page_group_sync_on_bit_locked
+ *
+ * must be called with page group lock held
+ */
+static bool
+nfs_page_group_sync_on_bit_locked(struct nfs_page *req, unsigned int bit)
+{
+ struct nfs_page *head = req->wb_head;
+ struct nfs_page *tmp;
+
+ WARN_ON_ONCE(!test_bit(PG_HEADLOCK, &head->wb_flags));
+ WARN_ON_ONCE(test_and_set_bit(bit, &req->wb_flags));
+
+ tmp = req->wb_this_page;
+ while (tmp != req) {
+ if (!test_bit(bit, &tmp->wb_flags))
+ return false;
+ tmp = tmp->wb_this_page;
+ }
+
+ /* true! reset all bits */
+ tmp = req;
+ do {
+ clear_bit(bit, &tmp->wb_flags);
+ tmp = tmp->wb_this_page;
+ } while (tmp != req);
+
+ return true;
+}
+
+/*
+ * nfs_page_group_sync_on_bit - set bit on current request, but only
+ * return true if the bit is set for all requests in page group
+ * @req - request in page group
+ * @bit - PG_* bit that is used to sync page group
+ */
+bool nfs_page_group_sync_on_bit(struct nfs_page *req, unsigned int bit)
+{
+ bool ret;
+
+ nfs_page_group_lock(req);
+ ret = nfs_page_group_sync_on_bit_locked(req, bit);
+ nfs_page_group_unlock(req);
+
+ return ret;
+}
+
+/*
+ * nfs_page_group_init - Initialize the page group linkage for @req
+ * @req - a new nfs request
+ * @prev - the previous request in page group, or NULL if @req is the first
+ * or only request in the group (the head).
+ */
+static inline void
+nfs_page_group_init(struct nfs_page *req, struct nfs_page *prev)
+{
+ WARN_ON_ONCE(prev == req);
+
+ if (!prev) {
+ req->wb_head = req;
+ req->wb_this_page = req;
+ } else {
+ WARN_ON_ONCE(prev->wb_this_page != prev->wb_head);
+ WARN_ON_ONCE(!test_bit(PG_HEADLOCK, &prev->wb_head->wb_flags));
+ req->wb_head = prev->wb_head;
+ req->wb_this_page = prev->wb_this_page;
+ prev->wb_this_page = req;
+
+ /* grab extra ref if head request has extra ref from
+ * the write/commit path to handle handoff between write
+ * and commit lists */
+ if (test_bit(PG_INODE_REF, &prev->wb_head->wb_flags))
+ kref_get(&req->wb_kref);
+ }
+}
+
+/*
+ * nfs_page_group_destroy - sync the destruction of page groups
+ * @req - request that no longer needs the page group
+ *
+ * releases the page group reference from each member once all
+ * members have called this function.
+ */
+static void
+nfs_page_group_destroy(struct kref *kref)
+{
+ struct nfs_page *req = container_of(kref, struct nfs_page, wb_kref);
+ struct nfs_page *tmp, *next;
+
+ if (!nfs_page_group_sync_on_bit(req, PG_TEARDOWN))
+ return;
+
+ tmp = req;
+ do {
+ next = tmp->wb_this_page;
+ /* unlink and free */
+ tmp->wb_this_page = tmp;
+ tmp->wb_head = tmp;
+ nfs_free_request(tmp);
+ tmp = next;
+ } while (tmp != req);
+}
+
/**
* nfs_create_request - Create an NFS read/write request.
* @ctx: open context to use
* @page: page to write
+ * @last: last nfs request created for this page group or NULL if head
* @offset: starting offset within the page for the write
* @count: number of bytes to read/write
*
@@ -149,7 +292,8 @@ nfs_iocounter_wait(struct nfs_io_counter *c)
*/
struct nfs_page *
nfs_create_request(struct nfs_open_context *ctx, struct page *page,
- unsigned int offset, unsigned int count)
+ struct nfs_page *last, unsigned int offset,
+ unsigned int count)
{
struct nfs_page *req;
struct nfs_lock_context *l_ctx;
@@ -181,6 +325,7 @@ nfs_create_request(struct nfs_open_context *ctx, struct page *page,
req->wb_bytes = count;
req->wb_context = get_nfs_open_context(ctx);
kref_init(&req->wb_kref);
+ nfs_page_group_init(req, last);
return req;
}

@@ -238,16 +383,18 @@ static void nfs_clear_request(struct nfs_page *req)
}
}

-
/**
* nfs_release_request - Release the count on an NFS read/write request
* @req: request to release
*
* Note: Should never be called with the spinlock held!
*/
-static void nfs_free_request(struct kref *kref)
+static void nfs_free_request(struct nfs_page *req)
{
- struct nfs_page *req = container_of(kref, struct nfs_page, wb_kref);
+ WARN_ON_ONCE(req->wb_this_page != req);
+
+ /* extra debug: make sure no sync bits are still set */
+ WARN_ON_ONCE(test_bit(PG_TEARDOWN, &req->wb_flags));

/* Release struct file and open context */
nfs_clear_request(req);
@@ -256,7 +403,7 @@ static void nfs_free_request(struct kref *kref)

void nfs_release_request(struct nfs_page *req)
{
- kref_put(&req->wb_kref, nfs_free_request);
+ kref_put(&req->wb_kref, nfs_page_group_destroy);
}

static int nfs_wait_bit_uninterruptible(void *word)
@@ -833,21 +980,66 @@ static void nfs_pageio_doio(struct nfs_pageio_descriptor *desc)
* @desc: destination io descriptor
* @req: request
*
+ * This may split a request into subrequests which are all part of the
+ * same page group.
+ *
* Returns true if the request 'req' was successfully coalesced into the
* existing list of pages 'desc'.
*/
static int __nfs_pageio_add_request(struct nfs_pageio_descriptor *desc,
struct nfs_page *req)
{
- while (!nfs_pageio_do_add_request(desc, req)) {
- desc->pg_moreio = 1;
- nfs_pageio_doio(desc);
- if (desc->pg_error < 0)
- return 0;
- desc->pg_moreio = 0;
- if (desc->pg_recoalesce)
- return 0;
- }
+ struct nfs_page *subreq;
+ unsigned int bytes_left = 0;
+ unsigned int offset, pgbase;
+
+ nfs_page_group_lock(req);
+
+ subreq = req;
+ bytes_left = subreq->wb_bytes;
+ offset = subreq->wb_offset;
+ pgbase = subreq->wb_pgbase;
+
+ do {
+ if (!nfs_pageio_do_add_request(desc, subreq)) {
+ /* make sure pg_test call(s) did nothing */
+ WARN_ON_ONCE(subreq->wb_bytes != bytes_left);
+ WARN_ON_ONCE(subreq->wb_offset != offset);
+ WARN_ON_ONCE(subreq->wb_pgbase != pgbase);
+
+ nfs_page_group_unlock(req);
+ desc->pg_moreio = 1;
+ nfs_pageio_doio(desc);
+ if (desc->pg_error < 0)
+ return 0;
+ desc->pg_moreio = 0;
+ if (desc->pg_recoalesce)
+ return 0;
+ /* retry add_request for this subreq */
+ nfs_page_group_lock(req);
+ continue;
+ }
+
+ /* check for buggy pg_test call(s) */
+ WARN_ON_ONCE(subreq->wb_bytes + subreq->wb_pgbase > PAGE_SIZE);
+ WARN_ON_ONCE(subreq->wb_bytes > bytes_left);
+ WARN_ON_ONCE(subreq->wb_bytes == 0);
+
+ bytes_left -= subreq->wb_bytes;
+ offset += subreq->wb_bytes;
+ pgbase += subreq->wb_bytes;
+
+ if (bytes_left) {
+ subreq = nfs_create_request(req->wb_context,
+ req->wb_page,
+ subreq, pgbase, bytes_left);
+ nfs_lock_request(subreq);
+ subreq->wb_offset = offset;
+ subreq->wb_index = req->wb_index;
+ }
+ } while (bytes_left > 0);
+
+ nfs_page_group_unlock(req);
return 1;
}

diff --git a/fs/nfs/read.c b/fs/nfs/read.c
index 46d9044..902ba2c 100644
--- a/fs/nfs/read.c
+++ b/fs/nfs/read.c
@@ -85,7 +85,7 @@ int nfs_readpage_async(struct nfs_open_context *ctx, struct inode *inode,
len = nfs_page_length(page);
if (len == 0)
return nfs_return_empty_page(page);
- new = nfs_create_request(ctx, page, 0, len);
+ new = nfs_create_request(ctx, page, NULL, 0, len);
if (IS_ERR(new)) {
unlock_page(page);
return PTR_ERR(new);
@@ -311,7 +311,7 @@ readpage_async_filler(void *data, struct page *page)
if (len == 0)
return nfs_return_empty_page(page);

- new = nfs_create_request(desc->ctx, page, 0, len);
+ new = nfs_create_request(desc->ctx, page, NULL, 0, len);
if (IS_ERR(new))
goto out_error;

diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index e773df2..d0f30f1 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -367,6 +367,8 @@ static void nfs_inode_add_request(struct inode *inode, struct nfs_page *req)
{
struct nfs_inode *nfsi = NFS_I(inode);

+ WARN_ON_ONCE(req->wb_this_page != req);
+
/* Lock the request! */
nfs_lock_request(req);

@@ -383,6 +385,7 @@ static void nfs_inode_add_request(struct inode *inode, struct nfs_page *req)
set_page_private(req->wb_page, (unsigned long)req);
}
nfsi->npages++;
+ set_bit(PG_INODE_REF, &req->wb_flags);
kref_get(&req->wb_kref);
spin_unlock(&inode->i_lock);
}
@@ -567,6 +570,7 @@ static void nfs_write_completion(struct nfs_pgio_header *hdr)
{
struct nfs_commit_info cinfo;
unsigned long bytes = 0;
+ bool do_destroy;

if (test_bit(NFS_IOHDR_REDO, &hdr->flags))
goto out;
@@ -596,6 +600,7 @@ remove_req:
next:
nfs_unlock_request(req);
nfs_end_page_writeback(req->wb_page);
+ do_destroy = !test_bit(NFS_IOHDR_NEED_COMMIT, &hdr->flags);
nfs_release_request(req);
}
out:
@@ -700,6 +705,10 @@ static struct nfs_page *nfs_try_to_update_request(struct inode *inode,
if (req == NULL)
goto out_unlock;

+ /* should be handled by nfs_flush_incompatible */
+ WARN_ON_ONCE(req->wb_head != req);
+ WARN_ON_ONCE(req->wb_this_page != req);
+
rqend = req->wb_offset + req->wb_bytes;
/*
* Tell the caller to flush out the request if
@@ -761,7 +770,7 @@ static struct nfs_page * nfs_setup_write_request(struct nfs_open_context* ctx,
req = nfs_try_to_update_request(inode, page, offset, bytes);
if (req != NULL)
goto out;
- req = nfs_create_request(ctx, page, offset, bytes);
+ req = nfs_create_request(ctx, page, NULL, offset, bytes);
if (IS_ERR(req))
goto out;
nfs_inode_add_request(inode, req);
@@ -805,6 +814,8 @@ int nfs_flush_incompatible(struct file *file, struct page *page)
return 0;
l_ctx = req->wb_lock_context;
do_flush = req->wb_page != page || req->wb_context != ctx;
+ /* for now, flush if more than 1 request in page_group */
+ do_flush |= req->wb_this_page != req;
if (l_ctx && ctx->dentry->d_inode->i_flock != NULL) {
do_flush |= l_ctx->lockowner.l_owner != current->files
|| l_ctx->lockowner.l_pid != current->tgid;
diff --git a/include/linux/nfs_page.h b/include/linux/nfs_page.h
index 13d59af..986c0c2 100644
--- a/include/linux/nfs_page.h
+++ b/include/linux/nfs_page.h
@@ -26,6 +26,9 @@ enum {
PG_MAPPED, /* page private set for buffered io */
PG_CLEAN, /* write succeeded */
PG_COMMIT_TO_DS, /* used by pnfs layouts */
+ PG_INODE_REF, /* extra ref held by inode (head req only) */
+ PG_HEADLOCK, /* page group lock of wb_head */
+ PG_TEARDOWN, /* page group sync for destroy */
};

struct nfs_inode;
@@ -41,6 +44,8 @@ struct nfs_page {
struct kref wb_kref; /* reference count */
unsigned long wb_flags;
struct nfs_write_verifier wb_verf; /* Commit cookie */
+ struct nfs_page *wb_this_page; /* list of reqs for this page */
+ struct nfs_page *wb_head; /* head pointer for req list */
};

struct nfs_pageio_descriptor;
@@ -87,9 +92,10 @@ struct nfs_pageio_descriptor {

extern struct nfs_page *nfs_create_request(struct nfs_open_context *ctx,
struct page *page,
+ struct nfs_page *last,
unsigned int offset,
unsigned int count);
-extern void nfs_release_request(struct nfs_page *req);
+extern void nfs_release_request(struct nfs_page *);


extern void nfs_pageio_init(struct nfs_pageio_descriptor *desc,
@@ -108,7 +114,10 @@ extern size_t nfs_generic_pg_test(struct nfs_pageio_descriptor *desc,
struct nfs_page *req);
extern int nfs_wait_on_request(struct nfs_page *);
extern void nfs_unlock_request(struct nfs_page *req);
-extern void nfs_unlock_and_release_request(struct nfs_page *req);
+extern void nfs_unlock_and_release_request(struct nfs_page *);
+extern void nfs_page_group_lock(struct nfs_page *);
+extern void nfs_page_group_unlock(struct nfs_page *);
+extern bool nfs_page_group_sync_on_bit(struct nfs_page *, unsigned int);

/*
* Lock the page of an asynchronous request
--
1.8.5.2 (Apple Git-48)


2014-05-15 15:57:27

by Weston Andros Adamson

[permalink] [raw]
Subject: [PATCH v3 12/18] nfs: chain calls to pg_test

Now that pg_test can change the size of the request (by returning a non-zero
size smaller than the request), pg_test functions that call other
pg_test functions must return the minimum of the result - or 0 if any fail.

Also clean up the logic of some pg_test functions so that all checks are
for contitions where coalescing is not possible.

Signed-off-by: Weston Andros Adamson <[email protected]>
---
fs/nfs/nfs4filelayout.c | 27 ++++++++++++++-------------
fs/nfs/objlayout/objio_osd.c | 9 ++++++---
fs/nfs/pnfs.c | 15 ++++++++++-----
3 files changed, 30 insertions(+), 21 deletions(-)

diff --git a/fs/nfs/nfs4filelayout.c b/fs/nfs/nfs4filelayout.c
index 9cea935..7a665e0 100644
--- a/fs/nfs/nfs4filelayout.c
+++ b/fs/nfs/nfs4filelayout.c
@@ -928,26 +928,27 @@ static size_t
filelayout_pg_test(struct nfs_pageio_descriptor *pgio, struct nfs_page *prev,
struct nfs_page *req)
{
+ unsigned int size;
u64 p_stripe, r_stripe;
u32 stripe_unit;

- if (!pnfs_generic_pg_test(pgio, prev, req) ||
- !nfs_generic_pg_test(pgio, prev, req))
+ /* calls nfs_generic_pg_test */
+ size = pnfs_generic_pg_test(pgio, prev, req);
+ if (!size)
return 0;

- if (!prev)
- return req->wb_bytes;
+ if (prev) {
+ p_stripe = (u64)req_offset(prev);
+ r_stripe = (u64)req_offset(req);
+ stripe_unit = FILELAYOUT_LSEG(pgio->pg_lseg)->stripe_unit;

- p_stripe = (u64)req_offset(prev);
- r_stripe = (u64)req_offset(req);
- stripe_unit = FILELAYOUT_LSEG(pgio->pg_lseg)->stripe_unit;
+ do_div(p_stripe, stripe_unit);
+ do_div(r_stripe, stripe_unit);

- do_div(p_stripe, stripe_unit);
- do_div(r_stripe, stripe_unit);
-
- if (p_stripe == r_stripe)
- return req->wb_bytes;
- return 0;
+ if (p_stripe != r_stripe)
+ return 0;
+ }
+ return min(size, req->wb_bytes);
}

static void
diff --git a/fs/nfs/objlayout/objio_osd.c b/fs/nfs/objlayout/objio_osd.c
index 71b9c69..6113207 100644
--- a/fs/nfs/objlayout/objio_osd.c
+++ b/fs/nfs/objlayout/objio_osd.c
@@ -571,12 +571,15 @@ int objio_write_pagelist(struct nfs_pgio_data *wdata, int how)
static size_t objio_pg_test(struct nfs_pageio_descriptor *pgio,
struct nfs_page *prev, struct nfs_page *req)
{
- if (!pnfs_generic_pg_test(pgio, prev, req) ||
- pgio->pg_count + req->wb_bytes >
+ unsigned int size;
+
+ size = pnfs_generic_pg_test(pgio, prev, req);
+
+ if (!size || pgio->pg_count + req->wb_bytes >
(unsigned long)pgio->pg_layout_private)
return 0;

- return req->wb_bytes;
+ return min(size, req->wb_bytes);
}

static void objio_init_read(struct nfs_pageio_descriptor *pgio, struct nfs_page *req)
diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index de6eb16..354c53c 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -1442,8 +1442,12 @@ size_t
pnfs_generic_pg_test(struct nfs_pageio_descriptor *pgio, struct nfs_page *prev,
struct nfs_page *req)
{
- if (pgio->pg_lseg == NULL)
- return nfs_generic_pg_test(pgio, prev, req);
+ unsigned int size;
+
+ size = nfs_generic_pg_test(pgio, prev, req);
+
+ if (!size)
+ return 0;

/*
* Test if a nfs_page is fully contained in the pnfs_layout_range.
@@ -1459,10 +1463,11 @@ pnfs_generic_pg_test(struct nfs_pageio_descriptor *pgio, struct nfs_page *prev,
* first byte that lies outside the pnfs_layout_range. FIXME?
*
*/
- if (req_offset(req) < end_offset(pgio->pg_lseg->pls_range.offset,
+ if (req_offset(req) >= end_offset(pgio->pg_lseg->pls_range.offset,
pgio->pg_lseg->pls_range.length))
- return req->wb_bytes;
- return 0;
+ return 0;
+
+ return min(size, req->wb_bytes);
}
EXPORT_SYMBOL_GPL(pnfs_generic_pg_test);

--
1.8.5.2 (Apple Git-48)


2014-05-15 15:57:06

by Weston Andros Adamson

[permalink] [raw]
Subject: [PATCH v3 01/18] pnfs: fix race in filelayout commit path

Hold the lock while modifying commit info dataserver buckets.

The following oops can be reproduced by running iozone for a while against
a 2 DS pynfs filelayout server.

general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
Modules linked in: nfs_layout_nfsv41_files rpcsec_gss_krb5 nfsv4 nfs fscache
CPU: 0 PID: 903 Comm: iozone Not tainted 3.15.0-rc1-branch-dros_testing+ #44
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference
task: ffff880078164480 ti: ffff88006e972000 task.ti: ffff88006e972000
RIP: 0010:[<ffffffffa01936e1>] [<ffffffffa01936e1>] nfs_init_commit+0x22/0x
RSP: 0018:ffff88006e973d30 EFLAGS: 00010246
RAX: ffff88006e973e00 RBX: ffff88006e828800 RCX: ffff88006e973e10
RDX: 0000000000000000 RSI: ffff88006e973e00 RDI: dead4ead00000000
RBP: ffff88006e973d38 R08: ffff88006e8289d8 R09: 0000000000000000
R10: ffff88006e8289d8 R11: 0000000000016988 R12: ffff88006e973b98
R13: ffff88007a0a6648 R14: ffff88006e973e10 R15: ffff88006e828800
FS: 00007f2ce396b740(0000) GS:ffff88007f200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f03278a1000 CR3: 0000000079043000 CR4: 00000000001407f0
Stack:
ffff88006e8289d8 ffff88006e973da8 ffffffffa00f144f ffff88006e9478c0
ffff88006e973e00 ffff88006de21080 0000000100000002 ffff880079be6c48
ffff88006e973d70 ffff88006e973d70 ffff88006e973e10 ffff88006de21080
Call Trace:
[<ffffffffa00f144f>] filelayout_commit_pagelist+0x1ae/0x34a [nfs_layout_nfsv
[<ffffffffa0194f72>] nfs_generic_commit_list+0x92/0xc4 [nfs]
[<ffffffffa0195053>] nfs_commit_inode+0xaf/0x114 [nfs]
[<ffffffffa01892bd>] nfs_file_fsync_commit+0x82/0xbe [nfs]
[<ffffffffa01ceb0d>] nfs4_file_fsync+0x59/0x9b [nfsv4]
[<ffffffff8114ee3c>] vfs_fsync_range+0x18/0x20
[<ffffffff8114ee60>] vfs_fsync+0x1c/0x1e
[<ffffffffa01891c2>] nfs_file_flush+0x7f/0x84 [nfs]
[<ffffffff81127a43>] filp_close+0x3c/0x72
[<ffffffff81140e12>] __close_fd+0x82/0x9a
[<ffffffff81127a9c>] SyS_close+0x23/0x4c
[<ffffffff814acd12>] system_call_fastpath+0x16/0x1b
Code: 5b 41 5c 41 5d 41 5e 5d c3 0f 1f 44 00 00 55 48 89 e5 53 48 89 fb 48 8
RIP [<ffffffffa01936e1>] nfs_init_commit+0x22/0xe1 [nfs]
RSP <ffff88006e973d30>
---[ end trace 732fe6419b235e2f ]---

Suggested-by: Trond Myklebust <[email protected]>
Signed-off-by: Weston Andros Adamson <[email protected]>
---
fs/nfs/nfs4filelayout.c | 20 ++++++++++++++++----
1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/fs/nfs/nfs4filelayout.c b/fs/nfs/nfs4filelayout.c
index 7954e16..9fd7ceb 100644
--- a/fs/nfs/nfs4filelayout.c
+++ b/fs/nfs/nfs4filelayout.c
@@ -1067,6 +1067,7 @@ filelayout_choose_commit_list(struct nfs_page *req,
*/
j = nfs4_fl_calc_j_index(lseg, req_offset(req));
i = select_bucket_index(fl, j);
+ spin_lock(cinfo->lock);
buckets = cinfo->ds->buckets;
list = &buckets[i].written;
if (list_empty(list)) {
@@ -1080,6 +1081,7 @@ filelayout_choose_commit_list(struct nfs_page *req,
}
set_bit(PG_COMMIT_TO_DS, &req->wb_flags);
cinfo->ds->nwritten++;
+ spin_unlock(cinfo->lock);
return list;
}

@@ -1176,6 +1178,7 @@ transfer_commit_list(struct list_head *src, struct list_head *dst,
return ret;
}

+/* Note called with cinfo->lock held. */
static int
filelayout_scan_ds_commit_list(struct pnfs_commit_bucket *bucket,
struct nfs_commit_info *cinfo,
@@ -1220,15 +1223,18 @@ static void filelayout_recover_commit_reqs(struct list_head *dst,
struct nfs_commit_info *cinfo)
{
struct pnfs_commit_bucket *b;
+ struct pnfs_layout_segment *freeme;
int i;

+restart:
spin_lock(cinfo->lock);
for (i = 0, b = cinfo->ds->buckets; i < cinfo->ds->nbuckets; i++, b++) {
if (transfer_commit_list(&b->written, dst, cinfo, 0)) {
- spin_unlock(cinfo->lock);
- pnfs_put_lseg(b->wlseg);
+ freeme = b->wlseg;
b->wlseg = NULL;
- spin_lock(cinfo->lock);
+ spin_unlock(cinfo->lock);
+ pnfs_put_lseg(freeme);
+ goto restart;
}
}
cinfo->ds->nwritten = 0;
@@ -1243,6 +1249,7 @@ alloc_ds_commits(struct nfs_commit_info *cinfo, struct list_head *list)
struct nfs_commit_data *data;
int i, j;
unsigned int nreq = 0;
+ struct pnfs_layout_segment *freeme;

fl_cinfo = cinfo->ds;
bucket = fl_cinfo->buckets;
@@ -1253,8 +1260,10 @@ alloc_ds_commits(struct nfs_commit_info *cinfo, struct list_head *list)
if (!data)
break;
data->ds_commit_index = i;
+ spin_lock(cinfo->lock);
data->lseg = bucket->clseg;
bucket->clseg = NULL;
+ spin_unlock(cinfo->lock);
list_add(&data->pages, list);
nreq++;
}
@@ -1264,8 +1273,11 @@ alloc_ds_commits(struct nfs_commit_info *cinfo, struct list_head *list)
if (list_empty(&bucket->committing))
continue;
nfs_retry_commit(&bucket->committing, bucket->clseg, cinfo);
- pnfs_put_lseg(bucket->clseg);
+ spin_lock(cinfo->lock);
+ freeme = bucket->clseg;
bucket->clseg = NULL;
+ spin_unlock(cinfo->lock);
+ pnfs_put_lseg(freeme);
}
/* Caller will clean up entries put on list */
return nreq;
--
1.8.5.2 (Apple Git-48)


2014-05-15 15:57:37

by Weston Andros Adamson

[permalink] [raw]
Subject: [PATCH v3 18/18] nfs: support page groups in nfs_read_completion

nfs_read_completion relied on the fact that there was a 1:1 mapping
of page to nfs_request, but this has now changed.

Regions not covered by a request have already been zeroed elsewhere.

Signed-off-by: Weston Andros Adamson <[email protected]>
---
fs/nfs/read.c | 24 +++++++++++++++++-------
1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/fs/nfs/read.c b/fs/nfs/read.c
index 53d5b83..e818a47 100644
--- a/fs/nfs/read.c
+++ b/fs/nfs/read.c
@@ -130,7 +130,6 @@ static void nfs_page_group_set_uptodate(struct nfs_page *req)
SetPageUptodate(req->wb_page);
}

-/* Note io was page aligned */
static void nfs_read_completion(struct nfs_pgio_header *hdr)
{
unsigned long bytes = 0;
@@ -140,14 +139,25 @@ static void nfs_read_completion(struct nfs_pgio_header *hdr)
while (!list_empty(&hdr->pages)) {
struct nfs_page *req = nfs_list_entry(hdr->pages.next);
struct page *page = req->wb_page;
+ unsigned long start = req->wb_pgbase;
+ unsigned long end = req->wb_pgbase + req->wb_bytes;

if (test_bit(NFS_IOHDR_EOF, &hdr->flags)) {
- if (bytes > hdr->good_bytes)
- zero_user(page, 0, PAGE_SIZE);
- else if (hdr->good_bytes - bytes < PAGE_SIZE)
- zero_user_segment(page,
- hdr->good_bytes & ~PAGE_MASK,
- PAGE_SIZE);
+ /* note: regions of the page not covered by a
+ * request are zeroed in nfs_readpage_async /
+ * readpage_async_filler */
+ if (bytes > hdr->good_bytes) {
+ /* nothing in this request was good, so zero
+ * the full extent of the request */
+ zero_user_segment(page, start, end);
+
+ } else if (hdr->good_bytes - bytes < req->wb_bytes) {
+ /* part of this request has good bytes, but
+ * not all. zero the bad bytes */
+ start += hdr->good_bytes - bytes;
+ WARN_ON(start < req->wb_pgbase);
+ zero_user_segment(page, start, end);
+ }
}
bytes += req->wb_bytes;
if (test_bit(NFS_IOHDR_ERROR, &hdr->flags)) {
--
1.8.5.2 (Apple Git-48)


2014-05-15 15:57:33

by Weston Andros Adamson

[permalink] [raw]
Subject: [PATCH v3 15/18] pnfs: support multiple verfs per direct req

Support direct requests that span multiple pnfs data servers by
comparing nfs_pgio_header->verf to a cached verf in pnfs_commit_bucket.
Continue to use dreq->verf if the MDS is used / non-pNFS.

Signed-off-by: Weston Andros Adamson <[email protected]>
---
fs/nfs/direct.c | 102 +++++++++++++++++++++++++++++++++++++++++++++---
fs/nfs/nfs4filelayout.c | 6 +++
include/linux/nfs.h | 5 ++-
include/linux/nfs_xdr.h | 2 +
4 files changed, 109 insertions(+), 6 deletions(-)

diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index 2c0e08f..4ad7bc3 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -108,6 +108,97 @@ static inline int put_dreq(struct nfs_direct_req *dreq)
return atomic_dec_and_test(&dreq->io_count);
}

+/*
+ * nfs_direct_select_verf - select the right verifier
+ * @dreq - direct request possibly spanning multiple servers
+ * @ds_clp - nfs_client of data server or NULL if MDS / non-pnfs
+ * @ds_idx - index of data server in data server list, only valid if ds_clp set
+ *
+ * returns the correct verifier to use given the role of the server
+ */
+static struct nfs_writeverf *
+nfs_direct_select_verf(struct nfs_direct_req *dreq,
+ struct nfs_client *ds_clp,
+ int ds_idx)
+{
+ struct nfs_writeverf *verfp = &dreq->verf;
+
+#ifdef CONFIG_NFS_V4_1
+ if (ds_clp) {
+ /* pNFS is in use, use the DS verf */
+ if (ds_idx >= 0 && ds_idx < dreq->ds_cinfo.nbuckets)
+ verfp = &dreq->ds_cinfo.buckets[ds_idx].direct_verf;
+ else
+ WARN_ON_ONCE(1);
+ }
+#endif
+ return verfp;
+}
+
+
+/*
+ * nfs_direct_set_hdr_verf - set the write/commit verifier
+ * @dreq - direct request possibly spanning multiple servers
+ * @hdr - pageio header to validate against previously seen verfs
+ *
+ * Set the server's (MDS or DS) "seen" verifier
+ */
+static void nfs_direct_set_hdr_verf(struct nfs_direct_req *dreq,
+ struct nfs_pgio_header *hdr)
+{
+ struct nfs_writeverf *verfp;
+
+ verfp = nfs_direct_select_verf(dreq, hdr->data->ds_clp,
+ hdr->data->ds_idx);
+ WARN_ON_ONCE(verfp->committed >= 0);
+ memcpy(verfp, &hdr->verf, sizeof(struct nfs_writeverf));
+ WARN_ON_ONCE(verfp->committed < 0);
+}
+
+/*
+ * nfs_direct_cmp_hdr_verf - compare verifier for pgio header
+ * @dreq - direct request possibly spanning multiple servers
+ * @hdr - pageio header to validate against previously seen verf
+ *
+ * set the server's "seen" verf if not initialized.
+ * returns result of comparison between @hdr->verf and the "seen"
+ * verf of the server used by @hdr (DS or MDS)
+ */
+static int nfs_direct_set_or_cmp_hdr_verf(struct nfs_direct_req *dreq,
+ struct nfs_pgio_header *hdr)
+{
+ struct nfs_writeverf *verfp;
+
+ verfp = nfs_direct_select_verf(dreq, hdr->data->ds_clp,
+ hdr->data->ds_idx);
+ if (verfp->committed < 0) {
+ nfs_direct_set_hdr_verf(dreq, hdr);
+ return 0;
+ }
+ return memcmp(verfp, &hdr->verf, sizeof(struct nfs_writeverf));
+}
+
+#if IS_ENABLED(CONFIG_NFS_V3) || IS_ENABLED(CONFIG_NFS_V4)
+/*
+ * nfs_direct_cmp_commit_data_verf - compare verifier for commit data
+ * @dreq - direct request possibly spanning multiple servers
+ * @data - commit data to validate against previously seen verf
+ *
+ * returns result of comparison between @data->verf and the verf of
+ * the server used by @data (DS or MDS)
+ */
+static int nfs_direct_cmp_commit_data_verf(struct nfs_direct_req *dreq,
+ struct nfs_commit_data *data)
+{
+ struct nfs_writeverf *verfp;
+
+ verfp = nfs_direct_select_verf(dreq, data->ds_clp,
+ data->ds_commit_index);
+ WARN_ON_ONCE(verfp->committed < 0);
+ return memcmp(verfp, &data->verf, sizeof(struct nfs_writeverf));
+}
+#endif
+
/**
* nfs_direct_IO - NFS address space operation for direct I/O
* @rw: direction (read or write)
@@ -168,6 +259,7 @@ static inline struct nfs_direct_req *nfs_direct_req_alloc(void)
kref_get(&dreq->kref);
init_completion(&dreq->completion);
INIT_LIST_HEAD(&dreq->mds_cinfo.list);
+ dreq->verf.committed = NFS_INVALID_STABLE_HOW; /* not set yet */
INIT_WORK(&dreq->work, nfs_direct_write_schedule_work);
spin_lock_init(&dreq->lock);

@@ -602,7 +694,7 @@ static void nfs_direct_commit_complete(struct nfs_commit_data *data)
dprintk("NFS: %5u commit failed with error %d.\n",
data->task.tk_pid, status);
dreq->flags = NFS_ODIRECT_RESCHED_WRITES;
- } else if (memcmp(&dreq->verf, &data->verf, sizeof(data->verf))) {
+ } else if (nfs_direct_cmp_commit_data_verf(dreq, data)) {
dprintk("NFS: %5u commit verify failed\n", data->task.tk_pid);
dreq->flags = NFS_ODIRECT_RESCHED_WRITES;
}
@@ -811,13 +903,13 @@ static void nfs_direct_write_completion(struct nfs_pgio_header *hdr)
if (dreq->flags == NFS_ODIRECT_RESCHED_WRITES)
bit = NFS_IOHDR_NEED_RESCHED;
else if (dreq->flags == 0) {
- memcpy(&dreq->verf, &hdr->verf,
- sizeof(dreq->verf));
+ nfs_direct_set_hdr_verf(dreq, hdr);
bit = NFS_IOHDR_NEED_COMMIT;
dreq->flags = NFS_ODIRECT_DO_COMMIT;
} else if (dreq->flags == NFS_ODIRECT_DO_COMMIT) {
- if (memcmp(&dreq->verf, &hdr->verf, sizeof(dreq->verf))) {
- dreq->flags = NFS_ODIRECT_RESCHED_WRITES;
+ if (nfs_direct_set_or_cmp_hdr_verf(dreq, hdr)) {
+ dreq->flags =
+ NFS_ODIRECT_RESCHED_WRITES;
bit = NFS_IOHDR_NEED_RESCHED;
} else
bit = NFS_IOHDR_NEED_COMMIT;
diff --git a/fs/nfs/nfs4filelayout.c b/fs/nfs/nfs4filelayout.c
index 7a665e0..0ebc521 100644
--- a/fs/nfs/nfs4filelayout.c
+++ b/fs/nfs/nfs4filelayout.c
@@ -560,6 +560,7 @@ filelayout_read_pagelist(struct nfs_pgio_data *data)
/* No multipath support. Use first DS */
atomic_inc(&ds->ds_clp->cl_count);
data->ds_clp = ds->ds_clp;
+ data->ds_idx = idx;
fh = nfs4_fl_select_ds_fh(lseg, j);
if (fh)
data->args.fh = fh;
@@ -603,6 +604,7 @@ filelayout_write_pagelist(struct nfs_pgio_data *data, int sync)
data->pgio_done_cb = filelayout_write_done_cb;
atomic_inc(&ds->ds_clp->cl_count);
data->ds_clp = ds->ds_clp;
+ data->ds_idx = idx;
fh = nfs4_fl_select_ds_fh(lseg, j);
if (fh)
data->args.fh = fh;
@@ -875,6 +877,8 @@ filelayout_alloc_commit_info(struct pnfs_layout_segment *lseg,
for (i = 0; i < size; i++) {
INIT_LIST_HEAD(&buckets[i].written);
INIT_LIST_HEAD(&buckets[i].committing);
+ /* mark direct verifier as unset */
+ buckets[i].direct_verf.committed = NFS_INVALID_STABLE_HOW;
}

spin_lock(cinfo->lock);
@@ -885,6 +889,8 @@ filelayout_alloc_commit_info(struct pnfs_layout_segment *lseg,
&buckets[i].written);
list_splice(&cinfo->ds->buckets[i].committing,
&buckets[i].committing);
+ buckets[i].direct_verf.committed =
+ cinfo->ds->buckets[i].direct_verf.committed;
buckets[i].wlseg = cinfo->ds->buckets[i].wlseg;
buckets[i].clseg = cinfo->ds->buckets[i].clseg;
}
diff --git a/include/linux/nfs.h b/include/linux/nfs.h
index 3e794c1..610af51 100644
--- a/include/linux/nfs.h
+++ b/include/linux/nfs.h
@@ -46,6 +46,9 @@ static inline void nfs_copy_fh(struct nfs_fh *target, const struct nfs_fh *sourc
enum nfs3_stable_how {
NFS_UNSTABLE = 0,
NFS_DATA_SYNC = 1,
- NFS_FILE_SYNC = 2
+ NFS_FILE_SYNC = 2,
+
+ /* used by direct.c to mark verf as invalid */
+ NFS_INVALID_STABLE_HOW = -1
};
#endif /* _LINUX_NFS_H */
diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
index ae63601..9a1396e 100644
--- a/include/linux/nfs_xdr.h
+++ b/include/linux/nfs_xdr.h
@@ -1112,6 +1112,7 @@ struct pnfs_commit_bucket {
struct list_head committing;
struct pnfs_layout_segment *wlseg;
struct pnfs_layout_segment *clseg;
+ struct nfs_writeverf direct_verf;
};

struct pnfs_ds_commit_info {
@@ -1294,6 +1295,7 @@ struct nfs_pgio_data {
__u64 mds_offset; /* Filelayout dense stripe */
struct nfs_page_array pages;
struct nfs_client *ds_clp; /* pNFS data server */
+ int ds_idx; /* ds index if ds_clp is set */
};

struct nfs_rw_header {
--
1.8.5.2 (Apple Git-48)


2014-05-19 15:37:40

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v3 00/18] nfs: support multiple requests per page

On Thu, May 15, 2014 at 11:56:39AM -0400, Weston Andros Adamson wrote:
> This patchset changes the read and write paths to be more flexible in dealing
> with requests that are not page aligned. Until now there was a 1:1 mapping
> of struct nfs_page (referred to as "nfs requests") to struct page, which

Any chance you could throw in a rename nfs_page to nfs_request patch
at the end of the series? The nfs_page name is highly confusing,
and gets even more so with the series.


2014-05-15 15:57:21

by Weston Andros Adamson

[permalink] [raw]
Subject: [PATCH v3 09/18] nfs: page group support in nfs_mark_uptodate

Change how nfs_mark_uptodate checks to see if writes cover a whole page.

This patch should have no effect yet since all page groups currently
have one request, but will come into play when pg_test functions are
modified to split pages into sub-page regions.

Signed-off-by: Weston Andros Adamson <[email protected]>
---
fs/nfs/write.c | 74 ++++++++++++++++++++++++++++++++++++++++++++++++++++------
1 file changed, 67 insertions(+), 7 deletions(-)

diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 5d75276..17b9895 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -154,18 +154,78 @@ static void nfs_set_pageerror(struct page *page)
nfs_zap_mapping(page_file_mapping(page)->host, page_file_mapping(page));
}

+/*
+ * nfs_page_group_search_locked
+ * @head - head request of page group
+ * @page_offset - offset into page
+ *
+ * Search page group with head @head to find a request that contains the
+ * page offset @page_offset.
+ *
+ * Returns a pointer to the first matching nfs request, or NULL if no
+ * match is found.
+ *
+ * Must be called with the page group lock held
+ */
+static struct nfs_page *
+nfs_page_group_search_locked(struct nfs_page *head, unsigned int page_offset)
+{
+ struct nfs_page *req;
+
+ WARN_ON_ONCE(head != head->wb_head);
+ WARN_ON_ONCE(!test_bit(PG_HEADLOCK, &head->wb_head->wb_flags));
+
+ req = head;
+ do {
+ if (page_offset >= req->wb_pgbase &&
+ page_offset < (req->wb_pgbase + req->wb_bytes))
+ return req;
+
+ req = req->wb_this_page;
+ } while (req != head);
+
+ return NULL;
+}
+
+/*
+ * nfs_page_group_covers_page
+ * @head - head request of page group
+ *
+ * Return true if the page group with head @head covers the whole page,
+ * returns false otherwise
+ */
+static bool nfs_page_group_covers_page(struct nfs_page *req)
+{
+ struct nfs_page *tmp;
+ unsigned int pos = 0;
+ unsigned int len = nfs_page_length(req->wb_page);
+
+ nfs_page_group_lock(req);
+
+ do {
+ tmp = nfs_page_group_search_locked(req->wb_head, pos);
+ if (tmp) {
+ /* no way this should happen */
+ WARN_ON_ONCE(tmp->wb_pgbase != pos);
+ pos += tmp->wb_bytes - (pos - tmp->wb_pgbase);
+ }
+ } while (tmp && pos < len);
+
+ nfs_page_group_unlock(req);
+ WARN_ON_ONCE(pos > len);
+ return pos == len;
+}
+
/* We can set the PG_uptodate flag if we see that a write request
* covers the full page.
*/
-static void nfs_mark_uptodate(struct page *page, unsigned int base, unsigned int count)
+static void nfs_mark_uptodate(struct nfs_page *req)
{
- if (PageUptodate(page))
- return;
- if (base != 0)
+ if (PageUptodate(req->wb_page))
return;
- if (count != nfs_page_length(page))
+ if (!nfs_page_group_covers_page(req))
return;
- SetPageUptodate(page);
+ SetPageUptodate(req->wb_page);
}

static int wb_priority(struct writeback_control *wbc)
@@ -796,7 +856,7 @@ static int nfs_writepage_setup(struct nfs_open_context *ctx, struct page *page,
return PTR_ERR(req);
/* Update file length */
nfs_grow_file(page, offset, count);
- nfs_mark_uptodate(page, req->wb_pgbase, req->wb_bytes);
+ nfs_mark_uptodate(req);
nfs_mark_request_dirty(req);
nfs_unlock_and_release_request(req);
return 0;
--
1.8.5.2 (Apple Git-48)


2014-05-15 22:19:48

by Weston Andros Adamson

[permalink] [raw]
Subject: Re: [PATCH v3 06/18] nfs: add support for multiple nfs reqs per page

On May 15, 2014, at 5:12 PM, Anna Schumaker <[email protected]> wrote:

> On 05/15/2014 11:56 AM, Weston Andros Adamson wrote:
>> Add "page groups" - a circular list of nfs requests (struct nfs_page)
>> that all reference the same page. This gives nfs read and write paths
>> the ability to account for sub-page regions independently. This
>> somewhat follows the design of struct buffer_head's sub-page
>> accounting.
>>
>> Only "head" requests are ever added/removed from the inode list in
>> the buffered write path. "head" and "sub" requests are treated the
>> same through the read path and the rest of the write/commit path.
>> Requests are given an extra reference across the life of the list.
>>
>> Page groups are never rejoined after being split. If the read/write
>> request fails and the client falls back to another path (ie revert
>> to MDS in PNFS case), the already split requests are pushed through
>> the recoalescing code again, which may split them further and then
>> coalesce them into properly sized requests on the wire. Fragmentation
>> shouldn't be a problem with the current design, because we flush all
>> requests in page group when a non-contiguous request is added, so
>> the only time resplitting should occur is on a resend of a read or
>> write.
>>
>> This patch lays the groundwork for sub-page splitting, but does not
>> actually do any splitting. For now all page groups have one request
>> as pg_test functions don't yet split pages. There are several related
>> patches that are needed support multiple requests per page group.
>>
>> Signed-off-by: Weston Andros Adamson <[email protected]>
>> ---
>> fs/nfs/direct.c | 7 +-
>> fs/nfs/pagelist.c | 220 ++++++++++++++++++++++++++++++++++++++++++++---
>> fs/nfs/read.c | 4 +-
>> fs/nfs/write.c | 13 ++-
>> include/linux/nfs_page.h | 13 ++-
>> 5 files changed, 236 insertions(+), 21 deletions(-)
>>
>> diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
>> index 1dd8c62..2c0e08f 100644
>> --- a/fs/nfs/direct.c
>> +++ b/fs/nfs/direct.c
>> @@ -380,7 +380,7 @@ static ssize_t nfs_direct_read_schedule_segment(struct nfs_pageio_descriptor *de
>> struct nfs_page *req;
>> unsigned int req_len = min_t(size_t, bytes, PAGE_SIZE - pgbase);
>> /* XXX do we need to do the eof zeroing found in async_filler? */
>> - req = nfs_create_request(dreq->ctx, pagevec[i],
>> + req = nfs_create_request(dreq->ctx, pagevec[i], NULL,
>> pgbase, req_len);
>> if (IS_ERR(req)) {
>> result = PTR_ERR(req);
>> @@ -749,7 +749,7 @@ static ssize_t nfs_direct_write_schedule_segment(struct nfs_pageio_descriptor *d
>> struct nfs_page *req;
>> unsigned int req_len = min_t(size_t, bytes, PAGE_SIZE - pgbase);
>>
>> - req = nfs_create_request(dreq->ctx, pagevec[i],
>> + req = nfs_create_request(dreq->ctx, pagevec[i], NULL,
>> pgbase, req_len);
>> if (IS_ERR(req)) {
>> result = PTR_ERR(req);
>> @@ -827,6 +827,8 @@ static void nfs_direct_write_completion(struct nfs_pgio_header *hdr)
>> spin_unlock(&dreq->lock);
>>
>> while (!list_empty(&hdr->pages)) {
>> + bool do_destroy = true;
>> +
>> req = nfs_list_entry(hdr->pages.next);
>> nfs_list_remove_request(req);
>> switch (bit) {
>> @@ -834,6 +836,7 @@ static void nfs_direct_write_completion(struct nfs_pgio_header *hdr)
>> case NFS_IOHDR_NEED_COMMIT:
>> kref_get(&req->wb_kref);
>> nfs_mark_request_commit(req, hdr->lseg, &cinfo);
>> + do_destroy = false;
>> }
>> nfs_unlock_and_release_request(req);
>> }
>> diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c
>> index a71655a..517c617 100644
>> --- a/fs/nfs/pagelist.c
>> +++ b/fs/nfs/pagelist.c
>> @@ -29,6 +29,8 @@
>> static struct kmem_cache *nfs_page_cachep;
>> static const struct rpc_call_ops nfs_pgio_common_ops;
>>
>> +static void nfs_free_request(struct nfs_page *);
>> +
>> static bool nfs_pgarray_set(struct nfs_page_array *p, unsigned int pagecount)
>> {
>> p->npages = pagecount;
>> @@ -136,10 +138,151 @@ nfs_iocounter_wait(struct nfs_io_counter *c)
>> return __nfs_iocounter_wait(c);
>> }
>>
>> +/*
>> + * nfs_page_group_lock - lock the head of the page group
>> + * @req - request in group that is to be locked
>> + *
>> + * this lock must be held if modifying the page group list
>> + */
>> +void
>> +nfs_page_group_lock(struct nfs_page *req)
>> +{
>> + struct nfs_page *head = req->wb_head;
>> + int err = -EAGAIN;
>> +
>> + WARN_ON_ONCE(head != head->wb_head);
>> +
>> + while (err)
>> + err = wait_on_bit_lock(&head->wb_flags, PG_HEADLOCK,
>> + nfs_wait_bit_killable, TASK_KILLABLE);
>> +}
>> +
>> +/*
>> + * nfs_page_group_unlock - unlock the head of the page group
>> + * @req - request in group that is to be unlocked
>> + */
>> +void
>> +nfs_page_group_unlock(struct nfs_page *req)
>> +{
>> + struct nfs_page *head = req->wb_head;
>> +
>> + WARN_ON_ONCE(head != head->wb_head);
>> +
>> + smp_mb__before_clear_bit();
>> + clear_bit(PG_HEADLOCK, &head->wb_flags);
>> + smp_mb__after_clear_bit();
>> + wake_up_bit(&head->wb_flags, PG_HEADLOCK);
>> +}
>> +
>> +/*
>> + * nfs_page_group_sync_on_bit_locked
>> + *
>> + * must be called with page group lock held
>> + */
>> +static bool
>> +nfs_page_group_sync_on_bit_locked(struct nfs_page *req, unsigned int bit)
>> +{
>> + struct nfs_page *head = req->wb_head;
>> + struct nfs_page *tmp;
>> +
>> + WARN_ON_ONCE(!test_bit(PG_HEADLOCK, &head->wb_flags));
>> + WARN_ON_ONCE(test_and_set_bit(bit, &req->wb_flags));
>> +
>> + tmp = req->wb_this_page;
>> + while (tmp != req) {
>> + if (!test_bit(bit, &tmp->wb_flags))
>> + return false;
>> + tmp = tmp->wb_this_page;
>> + }
>> +
>> + /* true! reset all bits */
>> + tmp = req;
>> + do {
>> + clear_bit(bit, &tmp->wb_flags);
>> + tmp = tmp->wb_this_page;
>> + } while (tmp != req);
>> +
>> + return true;
>> +}
>> +
>> +/*
>> + * nfs_page_group_sync_on_bit - set bit on current request, but only
>> + * return true if the bit is set for all requests in page group
>> + * @req - request in page group
>> + * @bit - PG_* bit that is used to sync page group
>> + */
>> +bool nfs_page_group_sync_on_bit(struct nfs_page *req, unsigned int bit)
>> +{
>> + bool ret;
>> +
>> + nfs_page_group_lock(req);
>> + ret = nfs_page_group_sync_on_bit_locked(req, bit);
>> + nfs_page_group_unlock(req);
>> +
>> + return ret;
>> +}
>> +
>> +/*
>> + * nfs_page_group_init - Initialize the page group linkage for @req
>> + * @req - a new nfs request
>> + * @prev - the previous request in page group, or NULL if @req is the first
>> + * or only request in the group (the head).
>> + */
>> +static inline void
>> +nfs_page_group_init(struct nfs_page *req, struct nfs_page *prev)
>> +{
>> + WARN_ON_ONCE(prev == req);
>> +
>> + if (!prev) {
>> + req->wb_head = req;
>> + req->wb_this_page = req;
>> + } else {
>> + WARN_ON_ONCE(prev->wb_this_page != prev->wb_head);
>> + WARN_ON_ONCE(!test_bit(PG_HEADLOCK, &prev->wb_head->wb_flags));
>> + req->wb_head = prev->wb_head;
>> + req->wb_this_page = prev->wb_this_page;
>> + prev->wb_this_page = req;
>> +
>> + /* grab extra ref if head request has extra ref from
>> + * the write/commit path to handle handoff between write
>> + * and commit lists */
>> + if (test_bit(PG_INODE_REF, &prev->wb_head->wb_flags))
>> + kref_get(&req->wb_kref);
>> + }
>> +}
>> +
>> +/*
>> + * nfs_page_group_destroy - sync the destruction of page groups
>> + * @req - request that no longer needs the page group
>> + *
>> + * releases the page group reference from each member once all
>> + * members have called this function.
>> + */
>> +static void
>> +nfs_page_group_destroy(struct kref *kref)
>> +{
>> + struct nfs_page *req = container_of(kref, struct nfs_page, wb_kref);
>> + struct nfs_page *tmp, *next;
>> +
>> + if (!nfs_page_group_sync_on_bit(req, PG_TEARDOWN))
>> + return;
>> +
>> + tmp = req;
>> + do {
>> + next = tmp->wb_this_page;
>> + /* unlink and free */
>> + tmp->wb_this_page = tmp;
>> + tmp->wb_head = tmp;
>> + nfs_free_request(tmp);
>> + tmp = next;
>> + } while (tmp != req);
>> +}
>> +
>> /**
>> * nfs_create_request - Create an NFS read/write request.
>> * @ctx: open context to use
>> * @page: page to write
>> + * @last: last nfs request created for this page group or NULL if head
>> * @offset: starting offset within the page for the write
>> * @count: number of bytes to read/write
>> *
>> @@ -149,7 +292,8 @@ nfs_iocounter_wait(struct nfs_io_counter *c)
>> */
>> struct nfs_page *
>> nfs_create_request(struct nfs_open_context *ctx, struct page *page,
>> - unsigned int offset, unsigned int count)
>> + struct nfs_page *last, unsigned int offset,
>> + unsigned int count)
>> {
>> struct nfs_page *req;
>> struct nfs_lock_context *l_ctx;
>> @@ -181,6 +325,7 @@ nfs_create_request(struct nfs_open_context *ctx, struct page *page,
>> req->wb_bytes = count;
>> req->wb_context = get_nfs_open_context(ctx);
>> kref_init(&req->wb_kref);
>> + nfs_page_group_init(req, last);
>> return req;
>> }
>>
>> @@ -238,16 +383,18 @@ static void nfs_clear_request(struct nfs_page *req)
>> }
>> }
>>
>> -
>> /**
>> * nfs_release_request - Release the count on an NFS read/write request
>> * @req: request to release
>> *
>> * Note: Should never be called with the spinlock held!
>> */
>> -static void nfs_free_request(struct kref *kref)
>> +static void nfs_free_request(struct nfs_page *req)
>> {
>> - struct nfs_page *req = container_of(kref, struct nfs_page, wb_kref);
>> + WARN_ON_ONCE(req->wb_this_page != req);
>> +
>> + /* extra debug: make sure no sync bits are still set */
>> + WARN_ON_ONCE(test_bit(PG_TEARDOWN, &req->wb_flags));
>>
>> /* Release struct file and open context */
>> nfs_clear_request(req);
>> @@ -256,7 +403,7 @@ static void nfs_free_request(struct kref *kref)
>>
>> void nfs_release_request(struct nfs_page *req)
>> {
>> - kref_put(&req->wb_kref, nfs_free_request);
>> + kref_put(&req->wb_kref, nfs_page_group_destroy);
>> }
>>
>> static int nfs_wait_bit_uninterruptible(void *word)
>> @@ -833,21 +980,66 @@ static void nfs_pageio_doio(struct nfs_pageio_descriptor *desc)
>> * @desc: destination io descriptor
>> * @req: request
>> *
>> + * This may split a request into subrequests which are all part of the
>> + * same page group.
>> + *
>> * Returns true if the request 'req' was successfully coalesced into the
>> * existing list of pages 'desc'.
>> */
>> static int __nfs_pageio_add_request(struct nfs_pageio_descriptor *desc,
>> struct nfs_page *req)
>> {
>> - while (!nfs_pageio_do_add_request(desc, req)) {
>> - desc->pg_moreio = 1;
>> - nfs_pageio_doio(desc);
>> - if (desc->pg_error < 0)
>> - return 0;
>> - desc->pg_moreio = 0;
>> - if (desc->pg_recoalesce)
>> - return 0;
>> - }
>> + struct nfs_page *subreq;
>> + unsigned int bytes_left = 0;
>> + unsigned int offset, pgbase;
>> +
>> + nfs_page_group_lock(req);
>> +
>> + subreq = req;
>> + bytes_left = subreq->wb_bytes;
>> + offset = subreq->wb_offset;
>> + pgbase = subreq->wb_pgbase;
>> +
>> + do {
>> + if (!nfs_pageio_do_add_request(desc, subreq)) {
>> + /* make sure pg_test call(s) did nothing */
>> + WARN_ON_ONCE(subreq->wb_bytes != bytes_left);
>> + WARN_ON_ONCE(subreq->wb_offset != offset);
>> + WARN_ON_ONCE(subreq->wb_pgbase != pgbase);
>> +
>> + nfs_page_group_unlock(req);
>> + desc->pg_moreio = 1;
>> + nfs_pageio_doio(desc);
>> + if (desc->pg_error < 0)
>> + return 0;
>> + desc->pg_moreio = 0;
>> + if (desc->pg_recoalesce)
>> + return 0;
>
> Would it make sense to 'or' together the desc->pg_error and desc->pg_recoalesce checks, rather than having two distinct if statements with the same return value?
>
> Anna
>

That?s a copy-paste from the original __nfs_pageio_add_request. I didn?t combine the
statements, because desc->pg_moreio is set to 0 iff desc->pg_error < 0 evaluates to false.

We might be able to clean this up, but it?s not just as simple as combining them into one
statement.

-dros

>> + /* retry add_request for this subreq */
>> + nfs_page_group_lock(req);
>> + continue;
>> + }
>> +
>> + /* check for buggy pg_test call(s) */
>> + WARN_ON_ONCE(subreq->wb_bytes + subreq->wb_pgbase > PAGE_SIZE);
>> + WARN_ON_ONCE(subreq->wb_bytes > bytes_left);
>> + WARN_ON_ONCE(subreq->wb_bytes == 0);
>> +
>> + bytes_left -= subreq->wb_bytes;
>> + offset += subreq->wb_bytes;
>> + pgbase += subreq->wb_bytes;
>> +
>> + if (bytes_left) {
>> + subreq = nfs_create_request(req->wb_context,
>> + req->wb_page,
>> + subreq, pgbase, bytes_left);
>> + nfs_lock_request(subreq);
>> + subreq->wb_offset = offset;
>> + subreq->wb_index = req->wb_index;
>> + }
>> + } while (bytes_left > 0);
>> +
>> + nfs_page_group_unlock(req);
>> return 1;
>> }
>>
>> diff --git a/fs/nfs/read.c b/fs/nfs/read.c
>> index 46d9044..902ba2c 100644
>> --- a/fs/nfs/read.c
>> +++ b/fs/nfs/read.c
>> @@ -85,7 +85,7 @@ int nfs_readpage_async(struct nfs_open_context *ctx, struct inode *inode,
>> len = nfs_page_length(page);
>> if (len == 0)
>> return nfs_return_empty_page(page);
>> - new = nfs_create_request(ctx, page, 0, len);
>> + new = nfs_create_request(ctx, page, NULL, 0, len);
>> if (IS_ERR(new)) {
>> unlock_page(page);
>> return PTR_ERR(new);
>> @@ -311,7 +311,7 @@ readpage_async_filler(void *data, struct page *page)
>> if (len == 0)
>> return nfs_return_empty_page(page);
>>
>> - new = nfs_create_request(desc->ctx, page, 0, len);
>> + new = nfs_create_request(desc->ctx, page, NULL, 0, len);
>> if (IS_ERR(new))
>> goto out_error;
>>
>> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
>> index e773df2..d0f30f1 100644
>> --- a/fs/nfs/write.c
>> +++ b/fs/nfs/write.c
>> @@ -367,6 +367,8 @@ static void nfs_inode_add_request(struct inode *inode, struct nfs_page *req)
>> {
>> struct nfs_inode *nfsi = NFS_I(inode);
>>
>> + WARN_ON_ONCE(req->wb_this_page != req);
>> +
>> /* Lock the request! */
>> nfs_lock_request(req);
>>
>> @@ -383,6 +385,7 @@ static void nfs_inode_add_request(struct inode *inode, struct nfs_page *req)
>> set_page_private(req->wb_page, (unsigned long)req);
>> }
>> nfsi->npages++;
>> + set_bit(PG_INODE_REF, &req->wb_flags);
>> kref_get(&req->wb_kref);
>> spin_unlock(&inode->i_lock);
>> }
>> @@ -567,6 +570,7 @@ static void nfs_write_completion(struct nfs_pgio_header *hdr)
>> {
>> struct nfs_commit_info cinfo;
>> unsigned long bytes = 0;
>> + bool do_destroy;
>>
>> if (test_bit(NFS_IOHDR_REDO, &hdr->flags))
>> goto out;
>> @@ -596,6 +600,7 @@ remove_req:
>> next:
>> nfs_unlock_request(req);
>> nfs_end_page_writeback(req->wb_page);
>> + do_destroy = !test_bit(NFS_IOHDR_NEED_COMMIT, &hdr->flags);
>> nfs_release_request(req);
>> }
>> out:
>> @@ -700,6 +705,10 @@ static struct nfs_page *nfs_try_to_update_request(struct inode *inode,
>> if (req == NULL)
>> goto out_unlock;
>>
>> + /* should be handled by nfs_flush_incompatible */
>> + WARN_ON_ONCE(req->wb_head != req);
>> + WARN_ON_ONCE(req->wb_this_page != req);
>> +
>> rqend = req->wb_offset + req->wb_bytes;
>> /*
>> * Tell the caller to flush out the request if
>> @@ -761,7 +770,7 @@ static struct nfs_page * nfs_setup_write_request(struct nfs_open_context* ctx,
>> req = nfs_try_to_update_request(inode, page, offset, bytes);
>> if (req != NULL)
>> goto out;
>> - req = nfs_create_request(ctx, page, offset, bytes);
>> + req = nfs_create_request(ctx, page, NULL, offset, bytes);
>> if (IS_ERR(req))
>> goto out;
>> nfs_inode_add_request(inode, req);
>> @@ -805,6 +814,8 @@ int nfs_flush_incompatible(struct file *file, struct page *page)
>> return 0;
>> l_ctx = req->wb_lock_context;
>> do_flush = req->wb_page != page || req->wb_context != ctx;
>> + /* for now, flush if more than 1 request in page_group */
>> + do_flush |= req->wb_this_page != req;
>> if (l_ctx && ctx->dentry->d_inode->i_flock != NULL) {
>> do_flush |= l_ctx->lockowner.l_owner != current->files
>> || l_ctx->lockowner.l_pid != current->tgid;
>> diff --git a/include/linux/nfs_page.h b/include/linux/nfs_page.h
>> index 13d59af..986c0c2 100644
>> --- a/include/linux/nfs_page.h
>> +++ b/include/linux/nfs_page.h
>> @@ -26,6 +26,9 @@ enum {
>> PG_MAPPED, /* page private set for buffered io */
>> PG_CLEAN, /* write succeeded */
>> PG_COMMIT_TO_DS, /* used by pnfs layouts */
>> + PG_INODE_REF, /* extra ref held by inode (head req only) */
>> + PG_HEADLOCK, /* page group lock of wb_head */
>> + PG_TEARDOWN, /* page group sync for destroy */
>> };
>>
>> struct nfs_inode;
>> @@ -41,6 +44,8 @@ struct nfs_page {
>> struct kref wb_kref; /* reference count */
>> unsigned long wb_flags;
>> struct nfs_write_verifier wb_verf; /* Commit cookie */
>> + struct nfs_page *wb_this_page; /* list of reqs for this page */
>> + struct nfs_page *wb_head; /* head pointer for req list */
>> };
>>
>> struct nfs_pageio_descriptor;
>> @@ -87,9 +92,10 @@ struct nfs_pageio_descriptor {
>>
>> extern struct nfs_page *nfs_create_request(struct nfs_open_context *ctx,
>> struct page *page,
>> + struct nfs_page *last,
>> unsigned int offset,
>> unsigned int count);
>> -extern void nfs_release_request(struct nfs_page *req);
>> +extern void nfs_release_request(struct nfs_page *);
>>
>>
>> extern void nfs_pageio_init(struct nfs_pageio_descriptor *desc,
>> @@ -108,7 +114,10 @@ extern size_t nfs_generic_pg_test(struct nfs_pageio_descriptor *desc,
>> struct nfs_page *req);
>> extern int nfs_wait_on_request(struct nfs_page *);
>> extern void nfs_unlock_request(struct nfs_page *req);
>> -extern void nfs_unlock_and_release_request(struct nfs_page *req);
>> +extern void nfs_unlock_and_release_request(struct nfs_page *);
>> +extern void nfs_page_group_lock(struct nfs_page *);
>> +extern void nfs_page_group_unlock(struct nfs_page *);
>> +extern bool nfs_page_group_sync_on_bit(struct nfs_page *, unsigned int);
>>
>> /*
>> * Lock the page of an asynchronous request
>


2014-05-28 23:23:18

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH v3 00/18] nfs: support multiple requests per page

On Thu, May 15, 2014 at 11:56 AM, Weston Andros Adamson
<[email protected]> wrote:
> This patchset changes the read and write paths to be more flexible in dealing
> with requests that are not page aligned. Until now there was a 1:1 mapping
> of struct nfs_page (referred to as "nfs requests") to struct page, which
> limited the client to page aligned I/O in several pNFS scenarios.
>
> This patchset allows multiple requests per page, loosely following
> the approach taken with struct buffer_head (part of kernel bio interface).
>
> With this patchset the client now supports:
> - non-page-aligned O_DIRECT I/O to DSes (instead of reverting to MDS)
> - arbitrary pnfs layout segment boundaries
> - arbitrary pnfs filelayout stripe sizes
>
> This patchset also includes a lot of cleanup - notably we no longer need
> a separate code path to support rsize/wsize < PAGE_SIZE.
>
> This new approach opens the door to many optimizations, such as not having to
> flush a page on a non-contiguous write, but for the time being we are focusing
> on correctness -- this patchset touches the read and write path for *all*
> versions of NFS!
>
> This has been tested against v2, v3, v4.0 and v4.1 (no pnfs) servers with
> different rsize/wsize settings, and against pynfs filelayout servers hacked to
> have non page aligned stripe sizes.
>
> I had some code review already (with changes applied) and we've been testing
> this pretty extensively for the last month+ - focusing mostly on v2, v3, v4.x
> (no pnfs).
>
> The patchset applies against Trond's testing branch, but should also include
> the fix I posted earlier today: "pnfs: fix race in filelayout commit path"
> as the race seems to be easier to hit with this patchset applied.
>
> I'm pretty sure I didn't break anything in the object and block layouts, but
> some extra attention there would be helpful.
>
> I plan on sharing some performance numbers once I'm able to run some nfsometer
> workloads.
>
> Changes in V3:
> - rebased to Anna's newest patches, which merges pageio.c into pagelist.c

Thanks! Applied...

--
Trond Myklebust

Linux NFS client maintainer, PrimaryData

[email protected]

2014-05-15 15:57:23

by Weston Andros Adamson

[permalink] [raw]
Subject: [PATCH v3 10/18] pnfs: clean up filelayout_alloc_commit_info

Remove unneeded else statement and clean up how commit info
dataserver buckets are replaced.

Suggested-by: Trond Myklebust <[email protected]>
Signed-off-by: Weston Andros Adamson <[email protected]>
---
fs/nfs/nfs4filelayout.c | 46 ++++++++++++++++++++++++++--------------------
1 file changed, 26 insertions(+), 20 deletions(-)

diff --git a/fs/nfs/nfs4filelayout.c b/fs/nfs/nfs4filelayout.c
index 9319427..9cea935 100644
--- a/fs/nfs/nfs4filelayout.c
+++ b/fs/nfs/nfs4filelayout.c
@@ -850,11 +850,15 @@ filelayout_alloc_commit_info(struct pnfs_layout_segment *lseg,
{
struct nfs4_filelayout_segment *fl = FILELAYOUT_LSEG(lseg);
struct pnfs_commit_bucket *buckets;
- int size;
+ int size, i;

if (fl->commit_through_mds)
return 0;
- if (cinfo->ds->nbuckets != 0) {
+
+ size = (fl->stripe_type == STRIPE_SPARSE) ?
+ fl->dsaddr->ds_num : fl->dsaddr->stripe_count;
+
+ if (cinfo->ds->nbuckets >= size) {
/* This assumes there is only one IOMODE_RW lseg. What
* we really want to do is have a layout_hdr level
* dictionary of <multipath_list4, fh> keys, each
@@ -864,30 +868,32 @@ filelayout_alloc_commit_info(struct pnfs_layout_segment *lseg,
return 0;
}

- size = (fl->stripe_type == STRIPE_SPARSE) ?
- fl->dsaddr->ds_num : fl->dsaddr->stripe_count;
-
buckets = kcalloc(size, sizeof(struct pnfs_commit_bucket),
gfp_flags);
if (!buckets)
return -ENOMEM;
- else {
- int i;
+ for (i = 0; i < size; i++) {
+ INIT_LIST_HEAD(&buckets[i].written);
+ INIT_LIST_HEAD(&buckets[i].committing);
+ }

- spin_lock(cinfo->lock);
- if (cinfo->ds->nbuckets != 0)
- kfree(buckets);
- else {
- cinfo->ds->buckets = buckets;
- cinfo->ds->nbuckets = size;
- for (i = 0; i < size; i++) {
- INIT_LIST_HEAD(&buckets[i].written);
- INIT_LIST_HEAD(&buckets[i].committing);
- }
- }
- spin_unlock(cinfo->lock);
- return 0;
+ spin_lock(cinfo->lock);
+ if (cinfo->ds->nbuckets >= size)
+ goto out;
+ for (i = 0; i < cinfo->ds->nbuckets; i++) {
+ list_splice(&cinfo->ds->buckets[i].written,
+ &buckets[i].written);
+ list_splice(&cinfo->ds->buckets[i].committing,
+ &buckets[i].committing);
+ buckets[i].wlseg = cinfo->ds->buckets[i].wlseg;
+ buckets[i].clseg = cinfo->ds->buckets[i].clseg;
}
+ swap(cinfo->ds->buckets, buckets);
+ cinfo->ds->nbuckets = size;
+out:
+ spin_unlock(cinfo->lock);
+ kfree(buckets);
+ return 0;
}

static struct pnfs_layout_segment *
--
1.8.5.2 (Apple Git-48)


2014-05-15 15:57:29

by Weston Andros Adamson

[permalink] [raw]
Subject: [PATCH v3 13/18] nfs: use > 1 request to handle bsize < PAGE_SIZE

Use the newly added support for multiple requests per page for
rsize/wsize < PAGE_SIZE, instead of having multiple read / write
data structures per pageio header.

This allows us to get rid of nfs_pgio_multi.

Signed-off-by: Weston Andros Adamson <[email protected]>
---
fs/nfs/pagelist.c | 80 ++++++++-----------------------------------------------
1 file changed, 11 insertions(+), 69 deletions(-)

diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c
index 44aef5a..0871a95 100644
--- a/fs/nfs/pagelist.c
+++ b/fs/nfs/pagelist.c
@@ -443,21 +443,13 @@ nfs_wait_on_request(struct nfs_page *req)
size_t nfs_generic_pg_test(struct nfs_pageio_descriptor *desc,
struct nfs_page *prev, struct nfs_page *req)
{
- if (!prev)
- return req->wb_bytes;
- /*
- * FIXME: ideally we should be able to coalesce all requests
- * that are not block boundary aligned, but currently this
- * is problematic for the case of bsize < PAGE_CACHE_SIZE,
- * since nfs_flush_multi and nfs_pagein_multi assume you
- * can have only one struct nfs_page.
- */
- if (desc->pg_bsize < PAGE_SIZE)
+ if (desc->pg_count > desc->pg_bsize) {
+ /* should never happen */
+ WARN_ON_ONCE(1);
return 0;
+ }

- if (desc->pg_count + req->wb_bytes <= desc->pg_bsize)
- return req->wb_bytes;
- return 0;
+ return min(desc->pg_bsize - desc->pg_count, (size_t)req->wb_bytes);
}
EXPORT_SYMBOL_GPL(nfs_generic_pg_test);

@@ -767,50 +759,6 @@ static void nfs_pgio_result(struct rpc_task *task, void *calldata)
}

/*
- * Generate multiple small requests to read or write a single
- * contiguous dirty on one page.
- */
-static int nfs_pgio_multi(struct nfs_pageio_descriptor *desc,
- struct nfs_pgio_header *hdr)
-{
- struct nfs_page *req = hdr->req;
- struct page *page = req->wb_page;
- struct nfs_pgio_data *data;
- size_t wsize = desc->pg_bsize, nbytes;
- unsigned int offset;
- int requests = 0;
- struct nfs_commit_info cinfo;
-
- nfs_init_cinfo(&cinfo, desc->pg_inode, desc->pg_dreq);
-
- if ((desc->pg_ioflags & FLUSH_COND_STABLE) &&
- (desc->pg_moreio || nfs_reqs_to_commit(&cinfo) ||
- desc->pg_count > wsize))
- desc->pg_ioflags &= ~FLUSH_COND_STABLE;
-
- offset = 0;
- nbytes = desc->pg_count;
- do {
- size_t len = min(nbytes, wsize);
-
- data = nfs_pgio_data_alloc(hdr, 1);
- if (!data)
- return nfs_pgio_error(desc, hdr);
- data->pages.pagevec[0] = page;
- nfs_pgio_rpcsetup(data, len, offset, desc->pg_ioflags, &cinfo);
- list_add(&data->list, &hdr->rpc_list);
- requests++;
- nbytes -= len;
- offset += len;
- } while (nbytes != 0);
-
- nfs_list_remove_request(req);
- nfs_list_add_request(req, &hdr->pages);
- desc->pg_rpc_callops = &nfs_pgio_common_ops;
- return 0;
-}
-
-/*
* Create an RPC task for the given read or write request and kick it.
* The page must have been locked by the caller.
*
@@ -818,8 +766,8 @@ static int nfs_pgio_multi(struct nfs_pageio_descriptor *desc,
* This is the case if nfs_updatepage detects a conflicting request
* that has been written but not committed.
*/
-static int nfs_pgio_one(struct nfs_pageio_descriptor *desc,
- struct nfs_pgio_header *hdr)
+int nfs_generic_pgio(struct nfs_pageio_descriptor *desc,
+ struct nfs_pgio_header *hdr)
{
struct nfs_page *req;
struct page **pages;
@@ -851,6 +799,7 @@ static int nfs_pgio_one(struct nfs_pageio_descriptor *desc,
desc->pg_rpc_callops = &nfs_pgio_common_ops;
return 0;
}
+EXPORT_SYMBOL_GPL(nfs_generic_pgio);

static int nfs_generic_pg_pgios(struct nfs_pageio_descriptor *desc)
{
@@ -876,15 +825,6 @@ static int nfs_generic_pg_pgios(struct nfs_pageio_descriptor *desc)
return ret;
}

-int nfs_generic_pgio(struct nfs_pageio_descriptor *desc,
- struct nfs_pgio_header *hdr)
-{
- if (desc->pg_bsize < PAGE_CACHE_SIZE)
- return nfs_pgio_multi(desc, hdr);
- return nfs_pgio_one(desc, hdr);
-}
-EXPORT_SYMBOL_GPL(nfs_generic_pgio);
-
static bool nfs_match_open_context(const struct nfs_open_context *ctx1,
const struct nfs_open_context *ctx2)
{
@@ -926,7 +866,9 @@ static bool nfs_can_coalesce_requests(struct nfs_page *prev,
return false;
}
size = pgio->pg_ops->pg_test(pgio, prev, req);
- WARN_ON_ONCE(size && size != req->wb_bytes);
+ WARN_ON_ONCE(size > req->wb_bytes);
+ if (size && size < req->wb_bytes)
+ req->wb_bytes = size;
return size > 0;
}

--
1.8.5.2 (Apple Git-48)


2014-05-15 15:57:31

by Weston Andros Adamson

[permalink] [raw]
Subject: [PATCH v3 14/18] nfs: remove data list from pgio header

Since the ability to split pages into subpage requests has been added,
nfs_pgio_header->rpc_list only ever has one pgio data.

Signed-off-by: Weston Andros Adamson <[email protected]>
---
fs/nfs/pagelist.c | 39 ++++++---------------------------------
fs/nfs/pnfs.c | 41 +++++++++++++++--------------------------
include/linux/nfs_xdr.h | 5 +++--
3 files changed, 24 insertions(+), 61 deletions(-)

diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c
index 0871a95..0c4aac4 100644
--- a/fs/nfs/pagelist.c
+++ b/fs/nfs/pagelist.c
@@ -470,7 +470,6 @@ struct nfs_rw_header *nfs_rw_header_alloc(const struct nfs_rw_ops *ops)
struct nfs_pgio_header *hdr = &header->header;

INIT_LIST_HEAD(&hdr->pages);
- INIT_LIST_HEAD(&hdr->rpc_list);
spin_lock_init(&hdr->lock);
atomic_set(&hdr->refcnt, 0);
hdr->rw_ops = ops;
@@ -649,27 +648,6 @@ out:
}
EXPORT_SYMBOL_GPL(nfs_initiate_pgio);

-static int nfs_do_multiple_pgios(struct list_head *head,
- const struct rpc_call_ops *call_ops,
- int how)
-{
- struct nfs_pgio_data *data;
- int ret = 0;
-
- while (!list_empty(head)) {
- int ret2;
-
- data = list_first_entry(head, struct nfs_pgio_data, list);
- list_del_init(&data->list);
-
- ret2 = nfs_initiate_pgio(NFS_CLIENT(data->header->inode),
- data, call_ops, how, 0);
- if (ret == 0)
- ret = ret2;
- }
- return ret;
-}
-
/**
* nfs_pgio_error - Clean up from a pageio error
* @desc: IO descriptor
@@ -678,14 +656,9 @@ static int nfs_do_multiple_pgios(struct list_head *head,
static int nfs_pgio_error(struct nfs_pageio_descriptor *desc,
struct nfs_pgio_header *hdr)
{
- struct nfs_pgio_data *data;
-
set_bit(NFS_IOHDR_REDO, &hdr->flags);
- while (!list_empty(&hdr->rpc_list)) {
- data = list_first_entry(&hdr->rpc_list, struct nfs_pgio_data, list);
- list_del(&data->list);
- nfs_pgio_data_release(data);
- }
+ nfs_pgio_data_release(hdr->data);
+ hdr->data = NULL;
desc->pg_completion_ops->error_cleanup(&desc->pg_list);
return -ENOMEM;
}
@@ -795,7 +768,7 @@ int nfs_generic_pgio(struct nfs_pageio_descriptor *desc,

/* Set up the argument struct */
nfs_pgio_rpcsetup(data, desc->pg_count, 0, desc->pg_ioflags, &cinfo);
- list_add(&data->list, &hdr->rpc_list);
+ hdr->data = data;
desc->pg_rpc_callops = &nfs_pgio_common_ops;
return 0;
}
@@ -817,9 +790,9 @@ static int nfs_generic_pg_pgios(struct nfs_pageio_descriptor *desc)
atomic_inc(&hdr->refcnt);
ret = nfs_generic_pgio(desc, hdr);
if (ret == 0)
- ret = nfs_do_multiple_pgios(&hdr->rpc_list,
- desc->pg_rpc_callops,
- desc->pg_ioflags);
+ ret = nfs_initiate_pgio(NFS_CLIENT(hdr->inode),
+ hdr->data, desc->pg_rpc_callops,
+ desc->pg_ioflags, 0);
if (atomic_dec_and_test(&hdr->refcnt))
hdr->completion_ops->completion(hdr);
return ret;
diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index 354c53c..6ef108b 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -1573,23 +1573,18 @@ pnfs_try_to_write_data(struct nfs_pgio_data *wdata,
}

static void
-pnfs_do_multiple_writes(struct nfs_pageio_descriptor *desc, struct list_head *head, int how)
+pnfs_do_write(struct nfs_pageio_descriptor *desc,
+ struct nfs_pgio_header *hdr, int how)
{
- struct nfs_pgio_data *data;
+ struct nfs_pgio_data *data = hdr->data;
const struct rpc_call_ops *call_ops = desc->pg_rpc_callops;
struct pnfs_layout_segment *lseg = desc->pg_lseg;
+ enum pnfs_try_status trypnfs;

desc->pg_lseg = NULL;
- while (!list_empty(head)) {
- enum pnfs_try_status trypnfs;
-
- data = list_first_entry(head, struct nfs_pgio_data, list);
- list_del_init(&data->list);
-
- trypnfs = pnfs_try_to_write_data(data, call_ops, lseg, how);
- if (trypnfs == PNFS_NOT_ATTEMPTED)
- pnfs_write_through_mds(desc, data);
- }
+ trypnfs = pnfs_try_to_write_data(data, call_ops, lseg, how);
+ if (trypnfs == PNFS_NOT_ATTEMPTED)
+ pnfs_write_through_mds(desc, data);
pnfs_put_lseg(lseg);
}

@@ -1623,7 +1618,7 @@ pnfs_generic_pg_writepages(struct nfs_pageio_descriptor *desc)
pnfs_put_lseg(desc->pg_lseg);
desc->pg_lseg = NULL;
} else
- pnfs_do_multiple_writes(desc, &hdr->rpc_list, desc->pg_ioflags);
+ pnfs_do_write(desc, hdr, desc->pg_ioflags);
if (atomic_dec_and_test(&hdr->refcnt))
hdr->completion_ops->completion(hdr);
return ret;
@@ -1731,23 +1726,17 @@ pnfs_try_to_read_data(struct nfs_pgio_data *rdata,
}

static void
-pnfs_do_multiple_reads(struct nfs_pageio_descriptor *desc, struct list_head *head)
+pnfs_do_read(struct nfs_pageio_descriptor *desc, struct nfs_pgio_header *hdr)
{
- struct nfs_pgio_data *data;
+ struct nfs_pgio_data *data = hdr->data;
const struct rpc_call_ops *call_ops = desc->pg_rpc_callops;
struct pnfs_layout_segment *lseg = desc->pg_lseg;
+ enum pnfs_try_status trypnfs;

desc->pg_lseg = NULL;
- while (!list_empty(head)) {
- enum pnfs_try_status trypnfs;
-
- data = list_first_entry(head, struct nfs_pgio_data, list);
- list_del_init(&data->list);
-
- trypnfs = pnfs_try_to_read_data(data, call_ops, lseg);
- if (trypnfs == PNFS_NOT_ATTEMPTED)
- pnfs_read_through_mds(desc, data);
- }
+ trypnfs = pnfs_try_to_read_data(data, call_ops, lseg);
+ if (trypnfs == PNFS_NOT_ATTEMPTED)
+ pnfs_read_through_mds(desc, data);
pnfs_put_lseg(lseg);
}

@@ -1782,7 +1771,7 @@ pnfs_generic_pg_readpages(struct nfs_pageio_descriptor *desc)
pnfs_put_lseg(desc->pg_lseg);
desc->pg_lseg = NULL;
} else
- pnfs_do_multiple_reads(desc, &hdr->rpc_list);
+ pnfs_do_read(desc, hdr);
if (atomic_dec_and_test(&hdr->refcnt))
hdr->completion_ops->completion(hdr);
return ret;
diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
index adef7bd..ae63601 100644
--- a/include/linux/nfs_xdr.h
+++ b/include/linux/nfs_xdr.h
@@ -1256,11 +1256,13 @@ enum {
NFS_IOHDR_NEED_RESCHED,
};

+struct nfs_pgio_data;
+
struct nfs_pgio_header {
struct inode *inode;
struct rpc_cred *cred;
struct list_head pages;
- struct list_head rpc_list;
+ struct nfs_pgio_data *data;
atomic_t refcnt;
struct nfs_page *req;
struct nfs_writeverf verf; /* Used for writes */
@@ -1282,7 +1284,6 @@ struct nfs_pgio_header {

struct nfs_pgio_data {
struct nfs_pgio_header *header;
- struct list_head list;
struct rpc_task task;
struct nfs_fattr fattr;
struct nfs_writeverf verf; /* Used for writes */
--
1.8.5.2 (Apple Git-48)


2014-05-15 15:57:09

by Weston Andros Adamson

[permalink] [raw]
Subject: [PATCH v3 03/18] nfs: remove unused arg from nfs_create_request

@inode is passed but not used.

Signed-off-by: Weston Andros Adamson <[email protected]>
---
fs/nfs/direct.c | 6 ++----
fs/nfs/pagelist.c | 4 +---
fs/nfs/read.c | 5 ++---
fs/nfs/write.c | 2 +-
include/linux/nfs_page.h | 1 -
5 files changed, 6 insertions(+), 12 deletions(-)

diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index 164b016..1dd8c62 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -380,8 +380,7 @@ static ssize_t nfs_direct_read_schedule_segment(struct nfs_pageio_descriptor *de
struct nfs_page *req;
unsigned int req_len = min_t(size_t, bytes, PAGE_SIZE - pgbase);
/* XXX do we need to do the eof zeroing found in async_filler? */
- req = nfs_create_request(dreq->ctx, dreq->inode,
- pagevec[i],
+ req = nfs_create_request(dreq->ctx, pagevec[i],
pgbase, req_len);
if (IS_ERR(req)) {
result = PTR_ERR(req);
@@ -750,8 +749,7 @@ static ssize_t nfs_direct_write_schedule_segment(struct nfs_pageio_descriptor *d
struct nfs_page *req;
unsigned int req_len = min_t(size_t, bytes, PAGE_SIZE - pgbase);

- req = nfs_create_request(dreq->ctx, dreq->inode,
- pagevec[i],
+ req = nfs_create_request(dreq->ctx, pagevec[i],
pgbase, req_len);
if (IS_ERR(req)) {
result = PTR_ERR(req);
diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c
index 88b6442..22b0f51 100644
--- a/fs/nfs/pagelist.c
+++ b/fs/nfs/pagelist.c
@@ -139,7 +139,6 @@ nfs_iocounter_wait(struct nfs_io_counter *c)
/**
* nfs_create_request - Create an NFS read/write request.
* @ctx: open context to use
- * @inode: inode to which the request is attached
* @page: page to write
* @offset: starting offset within the page for the write
* @count: number of bytes to read/write
@@ -149,8 +148,7 @@ nfs_iocounter_wait(struct nfs_io_counter *c)
* User should ensure it is safe to sleep in this function.
*/
struct nfs_page *
-nfs_create_request(struct nfs_open_context *ctx, struct inode *inode,
- struct page *page,
+nfs_create_request(struct nfs_open_context *ctx, struct page *page,
unsigned int offset, unsigned int count)
{
struct nfs_page *req;
diff --git a/fs/nfs/read.c b/fs/nfs/read.c
index 3986668..46d9044 100644
--- a/fs/nfs/read.c
+++ b/fs/nfs/read.c
@@ -85,7 +85,7 @@ int nfs_readpage_async(struct nfs_open_context *ctx, struct inode *inode,
len = nfs_page_length(page);
if (len == 0)
return nfs_return_empty_page(page);
- new = nfs_create_request(ctx, inode, page, 0, len);
+ new = nfs_create_request(ctx, page, 0, len);
if (IS_ERR(new)) {
unlock_page(page);
return PTR_ERR(new);
@@ -303,7 +303,6 @@ static int
readpage_async_filler(void *data, struct page *page)
{
struct nfs_readdesc *desc = (struct nfs_readdesc *)data;
- struct inode *inode = page_file_mapping(page)->host;
struct nfs_page *new;
unsigned int len;
int error;
@@ -312,7 +311,7 @@ readpage_async_filler(void *data, struct page *page)
if (len == 0)
return nfs_return_empty_page(page);

- new = nfs_create_request(desc->ctx, inode, page, 0, len);
+ new = nfs_create_request(desc->ctx, page, 0, len);
if (IS_ERR(new))
goto out_error;

diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 2680f29..e773df2 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -761,7 +761,7 @@ static struct nfs_page * nfs_setup_write_request(struct nfs_open_context* ctx,
req = nfs_try_to_update_request(inode, page, offset, bytes);
if (req != NULL)
goto out;
- req = nfs_create_request(ctx, inode, page, offset, bytes);
+ req = nfs_create_request(ctx, page, offset, bytes);
if (IS_ERR(req))
goto out;
nfs_inode_add_request(inode, req);
diff --git a/include/linux/nfs_page.h b/include/linux/nfs_page.h
index eb2eb63..be0b098 100644
--- a/include/linux/nfs_page.h
+++ b/include/linux/nfs_page.h
@@ -85,7 +85,6 @@ struct nfs_pageio_descriptor {
#define NFS_WBACK_BUSY(req) (test_bit(PG_BUSY,&(req)->wb_flags))

extern struct nfs_page *nfs_create_request(struct nfs_open_context *ctx,
- struct inode *inode,
struct page *page,
unsigned int offset,
unsigned int count);
--
1.8.5.2 (Apple Git-48)


2014-05-15 15:57:11

by Weston Andros Adamson

[permalink] [raw]
Subject: [PATCH v3 04/18] nfs: modify pg_test interface to return size_t

This is a step toward allowing pg_test to inform the the
coalescing code to reduce the size of requests so they may fit in
whatever scheme the pg_test callback wants to define.

For now, just return the size of the request if there is space, or 0
if there is not. This shouldn't change any behavior as it acts
the same as when the pg_test functions returned bool.

Signed-off-by: Weston Andros Adamson <[email protected]>
---
fs/nfs/blocklayout/blocklayout.c | 16 ++++++++++++----
fs/nfs/nfs4filelayout.c | 12 +++++++-----
fs/nfs/objlayout/objio_osd.c | 15 ++++++++++-----
fs/nfs/pagelist.c | 22 +++++++++++++++++++---
fs/nfs/pnfs.c | 12 +++++++++---
fs/nfs/pnfs.h | 3 ++-
include/linux/nfs_page.h | 5 +++--
7 files changed, 62 insertions(+), 23 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 206cc68..9b431f4 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -1189,13 +1189,17 @@ bl_pg_init_read(struct nfs_pageio_descriptor *pgio, struct nfs_page *req)
pnfs_generic_pg_init_read(pgio, req);
}

-static bool
+/*
+ * Return 0 if @req cannot be coalesced into @pgio, otherwise return the number
+ * of bytes (maximum @req->wb_bytes) that can be coalesced.
+ */
+static size_t
bl_pg_test_read(struct nfs_pageio_descriptor *pgio, struct nfs_page *prev,
struct nfs_page *req)
{
if (pgio->pg_dreq != NULL &&
!is_aligned_req(req, SECTOR_SIZE))
- return false;
+ return 0;

return pnfs_generic_pg_test(pgio, prev, req);
}
@@ -1241,13 +1245,17 @@ bl_pg_init_write(struct nfs_pageio_descriptor *pgio, struct nfs_page *req)
}
}

-static bool
+/*
+ * Return 0 if @req cannot be coalesced into @pgio, otherwise return the number
+ * of bytes (maximum @req->wb_bytes) that can be coalesced.
+ */
+static size_t
bl_pg_test_write(struct nfs_pageio_descriptor *pgio, struct nfs_page *prev,
struct nfs_page *req)
{
if (pgio->pg_dreq != NULL &&
!is_aligned_req(req, PAGE_CACHE_SIZE))
- return false;
+ return 0;

return pnfs_generic_pg_test(pgio, prev, req);
}
diff --git a/fs/nfs/nfs4filelayout.c b/fs/nfs/nfs4filelayout.c
index 9fd7ceb..ba9a9aa 100644
--- a/fs/nfs/nfs4filelayout.c
+++ b/fs/nfs/nfs4filelayout.c
@@ -915,10 +915,10 @@ filelayout_alloc_lseg(struct pnfs_layout_hdr *layoutid,
/*
* filelayout_pg_test(). Called by nfs_can_coalesce_requests()
*
- * return true : coalesce page
- * return false : don't coalesce page
+ * Return 0 if @req cannot be coalesced into @pgio, otherwise return the number
+ * of bytes (maximum @req->wb_bytes) that can be coalesced.
*/
-static bool
+static size_t
filelayout_pg_test(struct nfs_pageio_descriptor *pgio, struct nfs_page *prev,
struct nfs_page *req)
{
@@ -927,7 +927,7 @@ filelayout_pg_test(struct nfs_pageio_descriptor *pgio, struct nfs_page *prev,

if (!pnfs_generic_pg_test(pgio, prev, req) ||
!nfs_generic_pg_test(pgio, prev, req))
- return false;
+ return 0;

p_stripe = (u64)req_offset(prev);
r_stripe = (u64)req_offset(req);
@@ -936,7 +936,9 @@ filelayout_pg_test(struct nfs_pageio_descriptor *pgio, struct nfs_page *prev,
do_div(p_stripe, stripe_unit);
do_div(r_stripe, stripe_unit);

- return (p_stripe == r_stripe);
+ if (p_stripe == r_stripe)
+ return req->wb_bytes;
+ return 0;
}

static void
diff --git a/fs/nfs/objlayout/objio_osd.c b/fs/nfs/objlayout/objio_osd.c
index 426b366..71b9c69 100644
--- a/fs/nfs/objlayout/objio_osd.c
+++ b/fs/nfs/objlayout/objio_osd.c
@@ -564,14 +564,19 @@ int objio_write_pagelist(struct nfs_pgio_data *wdata, int how)
return 0;
}

-static bool objio_pg_test(struct nfs_pageio_descriptor *pgio,
+/*
+ * Return 0 if @req cannot be coalesced into @pgio, otherwise return the number
+ * of bytes (maximum @req->wb_bytes) that can be coalesced.
+ */
+static size_t objio_pg_test(struct nfs_pageio_descriptor *pgio,
struct nfs_page *prev, struct nfs_page *req)
{
- if (!pnfs_generic_pg_test(pgio, prev, req))
- return false;
+ if (!pnfs_generic_pg_test(pgio, prev, req) ||
+ pgio->pg_count + req->wb_bytes >
+ (unsigned long)pgio->pg_layout_private)
+ return 0;

- return pgio->pg_count + req->wb_bytes <=
- (unsigned long)pgio->pg_layout_private;
+ return req->wb_bytes;
}

static void objio_init_read(struct nfs_pageio_descriptor *pgio, struct nfs_page *req)
diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c
index 22b0f51..8d9e457 100644
--- a/fs/nfs/pagelist.c
+++ b/fs/nfs/pagelist.c
@@ -280,7 +280,17 @@ nfs_wait_on_request(struct nfs_page *req)
TASK_UNINTERRUPTIBLE);
}

-bool nfs_generic_pg_test(struct nfs_pageio_descriptor *desc, struct nfs_page *prev, struct nfs_page *req)
+/*
+ * nfs_generic_pg_test - determine if requests can be coalesced
+ * @desc: pointer to descriptor
+ * @prev: previous request in desc, or NULL
+ * @req: this request
+ *
+ * Returns zero if @req can be coalesced into @desc, otherwise it returns
+ * the size of the request.
+ */
+size_t nfs_generic_pg_test(struct nfs_pageio_descriptor *desc,
+ struct nfs_page *prev, struct nfs_page *req)
{
/*
* FIXME: ideally we should be able to coalesce all requests
@@ -292,7 +302,9 @@ bool nfs_generic_pg_test(struct nfs_pageio_descriptor *desc, struct nfs_page *pr
if (desc->pg_bsize < PAGE_SIZE)
return 0;

- return desc->pg_count + req->wb_bytes <= desc->pg_bsize;
+ if (desc->pg_count + req->wb_bytes <= desc->pg_bsize)
+ return req->wb_bytes;
+ return 0;
}
EXPORT_SYMBOL_GPL(nfs_generic_pg_test);

@@ -748,6 +760,8 @@ static bool nfs_can_coalesce_requests(struct nfs_page *prev,
struct nfs_page *req,
struct nfs_pageio_descriptor *pgio)
{
+ size_t size;
+
if (!nfs_match_open_context(req->wb_context, prev->wb_context))
return false;
if (req->wb_context->dentry->d_inode->i_flock != NULL &&
@@ -759,7 +773,9 @@ static bool nfs_can_coalesce_requests(struct nfs_page *prev,
return false;
if (req_offset(req) != req_offset(prev) + prev->wb_bytes)
return false;
- return pgio->pg_ops->pg_test(pgio, prev, req);
+ size = pgio->pg_ops->pg_test(pgio, prev, req);
+ WARN_ON_ONCE(size && size != req->wb_bytes);
+ return size > 0;
}

/**
diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index 0fe6701..de6eb16 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -1434,7 +1434,11 @@ pnfs_generic_pg_init_write(struct nfs_pageio_descriptor *pgio,
}
EXPORT_SYMBOL_GPL(pnfs_generic_pg_init_write);

-bool
+/*
+ * Return 0 if @req cannot be coalesced into @pgio, otherwise return the number
+ * of bytes (maximum @req->wb_bytes) that can be coalesced.
+ */
+size_t
pnfs_generic_pg_test(struct nfs_pageio_descriptor *pgio, struct nfs_page *prev,
struct nfs_page *req)
{
@@ -1455,8 +1459,10 @@ pnfs_generic_pg_test(struct nfs_pageio_descriptor *pgio, struct nfs_page *prev,
* first byte that lies outside the pnfs_layout_range. FIXME?
*
*/
- return req_offset(req) < end_offset(pgio->pg_lseg->pls_range.offset,
- pgio->pg_lseg->pls_range.length);
+ if (req_offset(req) < end_offset(pgio->pg_lseg->pls_range.offset,
+ pgio->pg_lseg->pls_range.length))
+ return req->wb_bytes;
+ return 0;
}
EXPORT_SYMBOL_GPL(pnfs_generic_pg_test);

diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
index 0031267..dccf182 100644
--- a/fs/nfs/pnfs.h
+++ b/fs/nfs/pnfs.h
@@ -187,7 +187,8 @@ int pnfs_generic_pg_readpages(struct nfs_pageio_descriptor *desc);
void pnfs_generic_pg_init_write(struct nfs_pageio_descriptor *pgio,
struct nfs_page *req, u64 wb_size);
int pnfs_generic_pg_writepages(struct nfs_pageio_descriptor *desc);
-bool pnfs_generic_pg_test(struct nfs_pageio_descriptor *pgio, struct nfs_page *prev, struct nfs_page *req);
+size_t pnfs_generic_pg_test(struct nfs_pageio_descriptor *pgio,
+ struct nfs_page *prev, struct nfs_page *req);
void pnfs_set_lo_fail(struct pnfs_layout_segment *lseg);
struct pnfs_layout_segment *pnfs_layout_process(struct nfs4_layoutget *lgp);
void pnfs_free_lseg_list(struct list_head *tmp_list);
diff --git a/include/linux/nfs_page.h b/include/linux/nfs_page.h
index be0b098..13d59af 100644
--- a/include/linux/nfs_page.h
+++ b/include/linux/nfs_page.h
@@ -46,7 +46,8 @@ struct nfs_page {
struct nfs_pageio_descriptor;
struct nfs_pageio_ops {
void (*pg_init)(struct nfs_pageio_descriptor *, struct nfs_page *);
- bool (*pg_test)(struct nfs_pageio_descriptor *, struct nfs_page *, struct nfs_page *);
+ size_t (*pg_test)(struct nfs_pageio_descriptor *, struct nfs_page *,
+ struct nfs_page *);
int (*pg_doio)(struct nfs_pageio_descriptor *);
};

@@ -102,7 +103,7 @@ extern int nfs_pageio_add_request(struct nfs_pageio_descriptor *,
struct nfs_page *);
extern void nfs_pageio_complete(struct nfs_pageio_descriptor *desc);
extern void nfs_pageio_cond_complete(struct nfs_pageio_descriptor *, pgoff_t);
-extern bool nfs_generic_pg_test(struct nfs_pageio_descriptor *desc,
+extern size_t nfs_generic_pg_test(struct nfs_pageio_descriptor *desc,
struct nfs_page *prev,
struct nfs_page *req);
extern int nfs_wait_on_request(struct nfs_page *);
--
1.8.5.2 (Apple Git-48)