2017-07-19 22:10:01

by Trond Myklebust

[permalink] [raw]
Subject: [PATCH 00/20] Reducing inode->i_lock contention in writebacks

Chuck Lever has presented measurements, that would appear to indicate that
the inode->i_lock can be a source of contention when doing writeback using
high performance networks. This patch series is therefore an attempt to
address the sources of contention that his measurements pointed to.

According to Chuck's table, the main sources of contention appear to
be in nfs_flush_incompatible(), nfs_lock_and_join_requests() and
nfs_updatepage(). Most of this patch series focuses on fixing
nfs_lock_and_join_requests(), since it holds the inode lock, while
taking locks on the page group and the sub-requests themselves,
rolling them all back if ever there is contention.
By noting a few simple rules that are mostly already in place, we
can simplify that locking to ensure that we only have to keep the
spin lock while we're dereferencing and locking the head page pointer
that is stored in page_private(page).

Along the way, the patches also simplify the code a little, and fix
a number of subtle races which mainly occur when you set wsize to
some value smaller than the page size. The most notable such race
occurs between nfs_lock_and_join_requests() and nfs_page_group_destroy(),
and could lead to a double free() of some of the sub-requests.

Finally, there are 2 patches tacked onto the end of the series that
attempt to improve the throttling of writes when the RPC layer is
congested. They do so by forcing each caller of nfs_initiate_pgio()
to wait until the request is being transmitted. The expectation is
that this might help improve latencies when there are several processes
competing for access to the RPC transport.

Trond Myklebust (20):
NFS: Simplify page writeback
NFS: Reduce lock contention in nfs_page_find_head_request()
NFS: Reduce lock contention in nfs_try_to_update_request()
NFS: Ensure we always dereference the page head last
NFS: Fix a reference and lock leak in nfs_lock_and_join_requests()
NFS: Fix an ABBA issue in nfs_lock_and_join_requests()
NFS: Don't check request offset and size without holding a lock
NFS: Don't unlock writebacks before declaring PG_WB_END
NFS: Fix the inode request accounting when pages have subrequests
NFS: Teach nfs_try_to_update_request() to deal with request
page_groups
NFS: Remove page group limit in nfs_flush_incompatible()
NFS: Reduce inode->i_lock contention in nfs_lock_and_join_requests()
NFS: Further optimise nfs_lock_and_join_requests()
NFS: Fix nfs_page_group_destroy() and nfs_lock_and_join_requests()
race cases
NFS: Remove nfs_page_group_clear_bits()
NFS: Remove unuse function nfs_page_group_lock_wait()
NFS: Remove unused parameter from nfs_page_group_lock()
NFS: Fix up nfs_page_group_covers_page()
SUNRPC: Add a function to allow waiting for RPC transmission
NFS: Throttle I/O to the NFS server

fs/nfs/pagelist.c | 64 +++------
fs/nfs/write.c | 313 ++++++++++++++++++-------------------------
include/linux/nfs_page.h | 3 +-
include/linux/sunrpc/sched.h | 3 +
net/sunrpc/sched.c | 22 +++
net/sunrpc/xprt.c | 6 +
6 files changed, 177 insertions(+), 234 deletions(-)

--
2.13.3



2017-07-19 22:10:02

by Trond Myklebust

[permalink] [raw]
Subject: [PATCH 01/20] NFS: Simplify page writeback

We don't expect the page header lock to ever be held across I/O, so
it should always be safe to wait for it, even if we're doing nonblocking
writebacks.

Signed-off-by: Trond Myklebust <[email protected]>
---
fs/nfs/write.c | 30 ++++++++++--------------------
1 file changed, 10 insertions(+), 20 deletions(-)

diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index b1af5dee5e0a..1d447e37f472 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -366,7 +366,6 @@ nfs_page_group_clear_bits(struct nfs_page *req)
* @inode - inode associated with request page group, must be holding inode lock
* @head - head request of page group, must be holding head lock
* @req - request that couldn't lock and needs to wait on the req bit lock
- * @nonblock - if true, don't actually wait
*
* NOTE: this must be called holding page_group bit lock and inode spin lock
* and BOTH will be released before returning.
@@ -375,7 +374,7 @@ nfs_page_group_clear_bits(struct nfs_page *req)
*/
static int
nfs_unroll_locks_and_wait(struct inode *inode, struct nfs_page *head,
- struct nfs_page *req, bool nonblock)
+ struct nfs_page *req)
__releases(&inode->i_lock)
{
struct nfs_page *tmp;
@@ -396,10 +395,7 @@ nfs_unroll_locks_and_wait(struct inode *inode, struct nfs_page *head,
/* release ref from nfs_page_find_head_request_locked */
nfs_release_request(head);

- if (!nonblock)
- ret = nfs_wait_on_request(req);
- else
- ret = -EAGAIN;
+ ret = nfs_wait_on_request(req);
nfs_release_request(req);

return ret;
@@ -464,7 +460,6 @@ nfs_destroy_unlinked_subrequests(struct nfs_page *destroy_list,
* operations for this page.
*
* @page - the page used to lookup the "page group" of nfs_page structures
- * @nonblock - if true, don't block waiting for request locks
*
* This function joins all sub requests to the head request by first
* locking all requests in the group, cancelling any pending operations
@@ -478,7 +473,7 @@ nfs_destroy_unlinked_subrequests(struct nfs_page *destroy_list,
* error was encountered.
*/
static struct nfs_page *
-nfs_lock_and_join_requests(struct page *page, bool nonblock)
+nfs_lock_and_join_requests(struct page *page)
{
struct inode *inode = page_file_mapping(page)->host;
struct nfs_page *head, *subreq;
@@ -511,14 +506,9 @@ nfs_lock_and_join_requests(struct page *page, bool nonblock)
if (ret < 0) {
spin_unlock(&inode->i_lock);

- if (!nonblock && ret == -EAGAIN) {
- nfs_page_group_lock_wait(head);
- nfs_release_request(head);
- goto try_again;
- }
-
+ nfs_page_group_lock_wait(head);
nfs_release_request(head);
- return ERR_PTR(ret);
+ goto try_again;
}

/* lock each request in the page group */
@@ -543,7 +533,7 @@ nfs_lock_and_join_requests(struct page *page, bool nonblock)
/* releases page group bit lock and
* inode spin lock and all references */
ret = nfs_unroll_locks_and_wait(inode, head,
- subreq, nonblock);
+ subreq);

if (ret == 0)
goto try_again;
@@ -624,12 +614,12 @@ nfs_error_is_fatal_on_server(int err)
* May return an error if the user signalled nfs_wait_on_request().
*/
static int nfs_page_async_flush(struct nfs_pageio_descriptor *pgio,
- struct page *page, bool nonblock)
+ struct page *page)
{
struct nfs_page *req;
int ret = 0;

- req = nfs_lock_and_join_requests(page, nonblock);
+ req = nfs_lock_and_join_requests(page);
if (!req)
goto out;
ret = PTR_ERR(req);
@@ -672,7 +662,7 @@ static int nfs_do_writepage(struct page *page, struct writeback_control *wbc,
int ret;

nfs_pageio_cond_complete(pgio, page_index(page));
- ret = nfs_page_async_flush(pgio, page, wbc->sync_mode == WB_SYNC_NONE);
+ ret = nfs_page_async_flush(pgio, page);
if (ret == -EAGAIN) {
redirty_page_for_writepage(wbc, page);
ret = 0;
@@ -2015,7 +2005,7 @@ int nfs_wb_page_cancel(struct inode *inode, struct page *page)

/* blocking call to cancel all requests and join to a single (head)
* request */
- req = nfs_lock_and_join_requests(page, false);
+ req = nfs_lock_and_join_requests(page);

if (IS_ERR(req)) {
ret = PTR_ERR(req);
--
2.13.3


2017-07-19 22:10:04

by Trond Myklebust

[permalink] [raw]
Subject: [PATCH 02/20] NFS: Reduce lock contention in nfs_page_find_head_request()

Add a lockless check for whether or not the page might be carrying
an existing writeback before we grab the inode->i_lock.

Reported-by: Chuck Lever <[email protected]>
Signed-off-by: Trond Myklebust <[email protected]>
---
fs/nfs/write.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 1d447e37f472..06e150c4e315 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -190,9 +190,11 @@ static struct nfs_page *nfs_page_find_head_request(struct page *page)
struct inode *inode = page_file_mapping(page)->host;
struct nfs_page *req = NULL;

- spin_lock(&inode->i_lock);
- req = nfs_page_find_head_request_locked(NFS_I(inode), page);
- spin_unlock(&inode->i_lock);
+ if (PagePrivate(page) || PageSwapCache(page)) {
+ spin_lock(&inode->i_lock);
+ req = nfs_page_find_head_request_locked(NFS_I(inode), page);
+ spin_unlock(&inode->i_lock);
+ }
return req;
}

--
2.13.3


2017-07-19 22:10:04

by Trond Myklebust

[permalink] [raw]
Subject: [PATCH 03/20] NFS: Reduce lock contention in nfs_try_to_update_request()

Micro-optimisation to move the lockless check into the for(;;) loop.

Signed-off-by: Trond Myklebust <[email protected]>
---
fs/nfs/write.c | 8 +++-----
1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 06e150c4e315..bb019096c331 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -1097,13 +1097,12 @@ static struct nfs_page *nfs_try_to_update_request(struct inode *inode,
unsigned int end;
int error;

- if (!PagePrivate(page))
- return NULL;
-
end = offset + bytes;
- spin_lock(&inode->i_lock);

for (;;) {
+ if (!(PagePrivate(page) || PageSwapCache(page)))
+ return NULL;
+ spin_lock(&inode->i_lock);
req = nfs_page_find_head_request_locked(NFS_I(inode), page);
if (req == NULL)
goto out_unlock;
@@ -1132,7 +1131,6 @@ static struct nfs_page *nfs_try_to_update_request(struct inode *inode,
nfs_release_request(req);
if (error != 0)
goto out_err;
- spin_lock(&inode->i_lock);
}

/* Okay, the request matches. Update the region */
--
2.13.3


2017-07-19 22:10:08

by Trond Myklebust

[permalink] [raw]
Subject: [PATCH 06/20] NFS: Fix an ABBA issue in nfs_lock_and_join_requests()

All other callers of nfs_page_group_lock() appear to already hold the
page lock on the head page, so doing it in the opposite order here
is inefficient, although not deadlock prone since we roll back all
locks on contention.

Signed-off-by: Trond Myklebust <[email protected]>
---
fs/nfs/write.c | 29 +++++++++++++++++------------
1 file changed, 17 insertions(+), 12 deletions(-)

diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 1ca759719429..c940e615f5dc 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -383,7 +383,7 @@ nfs_unroll_locks_and_wait(struct inode *inode, struct nfs_page *head,
int ret;

/* relinquish all the locks successfully grabbed this run */
- for (tmp = head ; tmp != req; tmp = tmp->wb_this_page)
+ for (tmp = head->wb_this_page ; tmp != req; tmp = tmp->wb_this_page)
nfs_unlock_request(tmp);

WARN_ON_ONCE(test_bit(PG_TEARDOWN, &req->wb_flags));
@@ -395,7 +395,7 @@ nfs_unroll_locks_and_wait(struct inode *inode, struct nfs_page *head,
spin_unlock(&inode->i_lock);

/* release ref from nfs_page_find_head_request_locked */
- nfs_release_request(head);
+ nfs_unlock_and_release_request(head);

ret = nfs_wait_on_request(req);
nfs_release_request(req);
@@ -484,10 +484,6 @@ nfs_lock_and_join_requests(struct page *page)
int ret;

try_again:
- total_bytes = 0;
-
- WARN_ON_ONCE(destroy_list);
-
spin_lock(&inode->i_lock);

/*
@@ -502,6 +498,16 @@ nfs_lock_and_join_requests(struct page *page)
return NULL;
}

+ /* lock the page head first in order to avoid an ABBA inefficiency */
+ if (!nfs_lock_request(head)) {
+ spin_unlock(&inode->i_lock);
+ ret = nfs_wait_on_request(head);
+ nfs_release_request(head);
+ if (ret < 0)
+ return ERR_PTR(ret);
+ goto try_again;
+ }
+
/* holding inode lock, so always make a non-blocking call to try the
* page group lock */
ret = nfs_page_group_lock(head, true);
@@ -509,13 +515,14 @@ nfs_lock_and_join_requests(struct page *page)
spin_unlock(&inode->i_lock);

nfs_page_group_lock_wait(head);
- nfs_release_request(head);
+ nfs_unlock_and_release_request(head);
goto try_again;
}

/* lock each request in the page group */
- subreq = head;
- do {
+ total_bytes = head->wb_bytes;
+ for (subreq = head->wb_this_page; subreq != head;
+ subreq = subreq->wb_this_page) {
/*
* Subrequests are always contiguous, non overlapping
* and in order - but may be repeated (mirrored writes).
@@ -541,9 +548,7 @@ nfs_lock_and_join_requests(struct page *page)

return ERR_PTR(ret);
}
-
- subreq = subreq->wb_this_page;
- } while (subreq != head);
+ }

/* Now that all requests are locked, make sure they aren't on any list.
* Commit list removal accounting is done after locks are dropped */
--
2.13.3


2017-07-19 22:10:10

by Trond Myklebust

[permalink] [raw]
Subject: [PATCH 08/20] NFS: Don't unlock writebacks before declaring PG_WB_END

We don't want nfs_lock_and_join_requests() to start fiddling with
the request before the call to nfs_page_group_sync_on_bit().

Signed-off-by: Trond Myklebust <[email protected]>
---
fs/nfs/write.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 84b6818e5278..bb38c881fc48 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -335,8 +335,11 @@ static void nfs_end_page_writeback(struct nfs_page *req)
{
struct inode *inode = page_file_mapping(req->wb_page)->host;
struct nfs_server *nfss = NFS_SERVER(inode);
+ bool is_done;

- if (!nfs_page_group_sync_on_bit(req, PG_WB_END))
+ is_done = nfs_page_group_sync_on_bit(req, PG_WB_END);
+ nfs_unlock_request(req);
+ if (!is_done)
return;

end_page_writeback(req->wb_page);
@@ -596,7 +599,6 @@ nfs_lock_and_join_requests(struct page *page)

static void nfs_write_error_remove_page(struct nfs_page *req)
{
- nfs_unlock_request(req);
nfs_end_page_writeback(req);
generic_error_remove_page(page_file_mapping(req->wb_page),
req->wb_page);
@@ -1019,7 +1021,6 @@ static void nfs_write_completion(struct nfs_pgio_header *hdr)
remove_req:
nfs_inode_remove_request(req);
next:
- nfs_unlock_request(req);
nfs_end_page_writeback(req);
nfs_release_request(req);
}
@@ -1406,7 +1407,6 @@ static void nfs_redirty_request(struct nfs_page *req)
{
nfs_mark_request_dirty(req);
set_bit(NFS_CONTEXT_RESEND_WRITES, &req->wb_context->flags);
- nfs_unlock_request(req);
nfs_end_page_writeback(req);
nfs_release_request(req);
}
--
2.13.3


2017-07-19 22:10:10

by Trond Myklebust

[permalink] [raw]
Subject: [PATCH 07/20] NFS: Don't check request offset and size without holding a lock

Request offsets and sizes are not guaranteed to be stable unless you
are holding the request locked.

Signed-off-by: Trond Myklebust <[email protected]>
---
fs/nfs/write.c | 24 ++++++++++++------------
1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index c940e615f5dc..84b6818e5278 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -523,6 +523,17 @@ nfs_lock_and_join_requests(struct page *page)
total_bytes = head->wb_bytes;
for (subreq = head->wb_this_page; subreq != head;
subreq = subreq->wb_this_page) {
+ if (!nfs_lock_request(subreq)) {
+ /* releases page group bit lock and
+ * inode spin lock and all references */
+ ret = nfs_unroll_locks_and_wait(inode, head,
+ subreq);
+
+ if (ret == 0)
+ goto try_again;
+
+ return ERR_PTR(ret);
+ }
/*
* Subrequests are always contiguous, non overlapping
* and in order - but may be repeated (mirrored writes).
@@ -533,21 +544,10 @@ nfs_lock_and_join_requests(struct page *page)
} else if (WARN_ON_ONCE(subreq->wb_offset < head->wb_offset ||
((subreq->wb_offset + subreq->wb_bytes) >
(head->wb_offset + total_bytes)))) {
+ nfs_unlock_request(subreq);
nfs_unroll_locks_and_wait(inode, head, subreq);
return ERR_PTR(-EIO);
}
-
- if (!nfs_lock_request(subreq)) {
- /* releases page group bit lock and
- * inode spin lock and all references */
- ret = nfs_unroll_locks_and_wait(inode, head,
- subreq);
-
- if (ret == 0)
- goto try_again;
-
- return ERR_PTR(ret);
- }
}

/* Now that all requests are locked, make sure they aren't on any list.
--
2.13.3


2017-07-19 22:10:06

by Trond Myklebust

[permalink] [raw]
Subject: [PATCH 04/20] NFS: Ensure we always dereference the page head last

This fixes a race with nfs_page_group_sync_on_bit() whereby the
call to wake_up_bit() in nfs_page_group_unlock() could occur after
the page header had been freed.

Signed-off-by: Trond Myklebust <[email protected]>
---
fs/nfs/pagelist.c | 11 ++++++-----
1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c
index de9066a92c0d..a6f2bbd709ba 100644
--- a/fs/nfs/pagelist.c
+++ b/fs/nfs/pagelist.c
@@ -306,14 +306,11 @@ static void
nfs_page_group_destroy(struct kref *kref)
{
struct nfs_page *req = container_of(kref, struct nfs_page, wb_kref);
+ struct nfs_page *head = req->wb_head;
struct nfs_page *tmp, *next;

- /* subrequests must release the ref on the head request */
- if (req->wb_head != req)
- nfs_release_request(req->wb_head);
-
if (!nfs_page_group_sync_on_bit(req, PG_TEARDOWN))
- return;
+ goto out;

tmp = req;
do {
@@ -324,6 +321,10 @@ nfs_page_group_destroy(struct kref *kref)
nfs_free_request(tmp);
tmp = next;
} while (tmp != req);
+out:
+ /* subrequests must release the ref on the head request */
+ if (head != req)
+ nfs_release_request(head);
}

/**
--
2.13.3


2017-07-19 22:10:08

by Trond Myklebust

[permalink] [raw]
Subject: [PATCH 05/20] NFS: Fix a reference and lock leak in nfs_lock_and_join_requests()

Yes, this is a situation that should never happen (hence the WARN_ON)
but we should still ensure that we free up the locks and references to
the faulty pages.

Signed-off-by: Trond Myklebust <[email protected]>
---
fs/nfs/write.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index bb019096c331..1ca759719429 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -526,8 +526,7 @@ nfs_lock_and_join_requests(struct page *page)
} else if (WARN_ON_ONCE(subreq->wb_offset < head->wb_offset ||
((subreq->wb_offset + subreq->wb_bytes) >
(head->wb_offset + total_bytes)))) {
- nfs_page_group_unlock(head);
- spin_unlock(&inode->i_lock);
+ nfs_unroll_locks_and_wait(inode, head, subreq);
return ERR_PTR(-EIO);
}

--
2.13.3


2017-07-19 22:10:13

by Trond Myklebust

[permalink] [raw]
Subject: [PATCH 10/20] NFS: Teach nfs_try_to_update_request() to deal with request page_groups

Simplify the code, and avoid some flushes to disk.

Signed-off-by: Trond Myklebust <[email protected]>
---
fs/nfs/write.c | 60 ++++++++++++++++++++--------------------------------------
1 file changed, 20 insertions(+), 40 deletions(-)

diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index ee981353d4aa..0b4d1ef168e0 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -1107,39 +1107,19 @@ static struct nfs_page *nfs_try_to_update_request(struct inode *inode,

end = offset + bytes;

- for (;;) {
- if (!(PagePrivate(page) || PageSwapCache(page)))
- return NULL;
- spin_lock(&inode->i_lock);
- req = nfs_page_find_head_request_locked(NFS_I(inode), page);
- if (req == NULL)
- goto out_unlock;
-
- /* should be handled by nfs_flush_incompatible */
- WARN_ON_ONCE(req->wb_head != req);
- WARN_ON_ONCE(req->wb_this_page != req);
-
- rqend = req->wb_offset + req->wb_bytes;
- /*
- * Tell the caller to flush out the request if
- * the offsets are non-contiguous.
- * Note: nfs_flush_incompatible() will already
- * have flushed out requests having wrong owners.
- */
- if (offset > rqend
- || end < req->wb_offset)
- goto out_flushme;
-
- if (nfs_lock_request(req))
- break;
+ req = nfs_lock_and_join_requests(page);
+ if (IS_ERR_OR_NULL(req))
+ return req;

- /* The request is locked, so wait and then retry */
- spin_unlock(&inode->i_lock);
- error = nfs_wait_on_request(req);
- nfs_release_request(req);
- if (error != 0)
- goto out_err;
- }
+ rqend = req->wb_offset + req->wb_bytes;
+ /*
+ * Tell the caller to flush out the request if
+ * the offsets are non-contiguous.
+ * Note: nfs_flush_incompatible() will already
+ * have flushed out requests having wrong owners.
+ */
+ if (offset > rqend || end < req->wb_offset)
+ goto out_flushme;

/* Okay, the request matches. Update the region */
if (offset < req->wb_offset) {
@@ -1150,17 +1130,17 @@ static struct nfs_page *nfs_try_to_update_request(struct inode *inode,
req->wb_bytes = end - req->wb_offset;
else
req->wb_bytes = rqend - req->wb_offset;
-out_unlock:
- if (req)
- nfs_clear_request_commit(req);
- spin_unlock(&inode->i_lock);
return req;
out_flushme:
- spin_unlock(&inode->i_lock);
- nfs_release_request(req);
+ /*
+ * Note: we mark the request dirty here because
+ * nfs_lock_and_join_requests() cannot preserve
+ * commit flags, so we have to replay the write.
+ */
+ nfs_mark_request_dirty(req);
+ nfs_unlock_and_release_request(req);
error = nfs_wb_page(inode, page);
-out_err:
- return ERR_PTR(error);
+ return (error < 0) ? ERR_PTR(error) : NULL;
}

/*
--
2.13.3


2017-07-19 22:10:12

by Trond Myklebust

[permalink] [raw]
Subject: [PATCH 09/20] NFS: Fix the inode request accounting when pages have subrequests

Both nfs_destroy_unlinked_subrequests() and nfs_lock_and_join_requests()
manipulate the inode flags adjusting the NFS_I(inode)->nrequests.

Signed-off-by: Trond Myklebust <[email protected]>
---
fs/nfs/write.c | 27 +++++++++++++++------------
1 file changed, 15 insertions(+), 12 deletions(-)

diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index bb38c881fc48..ee981353d4aa 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -418,7 +418,8 @@ nfs_unroll_locks_and_wait(struct inode *inode, struct nfs_page *head,
*/
static void
nfs_destroy_unlinked_subrequests(struct nfs_page *destroy_list,
- struct nfs_page *old_head)
+ struct nfs_page *old_head,
+ struct inode *inode)
{
while (destroy_list) {
struct nfs_page *subreq = destroy_list;
@@ -443,9 +444,12 @@ nfs_destroy_unlinked_subrequests(struct nfs_page *destroy_list,
nfs_page_group_clear_bits(subreq);

/* release the PG_INODE_REF reference */
- if (test_and_clear_bit(PG_INODE_REF, &subreq->wb_flags))
+ if (test_and_clear_bit(PG_INODE_REF, &subreq->wb_flags)) {
nfs_release_request(subreq);
- else
+ spin_lock(&inode->i_lock);
+ NFS_I(inode)->nrequests--;
+ spin_unlock(&inode->i_lock);
+ } else
WARN_ON_ONCE(1);
} else {
WARN_ON_ONCE(test_bit(PG_CLEAN, &subreq->wb_flags));
@@ -572,25 +576,24 @@ nfs_lock_and_join_requests(struct page *page)
head->wb_bytes = total_bytes;
}

+ /* Postpone destruction of this request */
+ if (test_and_clear_bit(PG_REMOVE, &head->wb_flags)) {
+ set_bit(PG_INODE_REF, &head->wb_flags);
+ kref_get(&head->wb_kref);
+ NFS_I(inode)->nrequests++;
+ }
+
/*
* prepare head request to be added to new pgio descriptor
*/
nfs_page_group_clear_bits(head);

- /*
- * some part of the group was still on the inode list - otherwise
- * the group wouldn't be involved in async write.
- * grab a reference for the head request, iff it needs one.
- */
- if (!test_and_set_bit(PG_INODE_REF, &head->wb_flags))
- kref_get(&head->wb_kref);
-
nfs_page_group_unlock(head);

/* drop lock to clean uprequests on destroy list */
spin_unlock(&inode->i_lock);

- nfs_destroy_unlinked_subrequests(destroy_list, head);
+ nfs_destroy_unlinked_subrequests(destroy_list, head, inode);

/* still holds ref on head from nfs_page_find_head_request_locked
* and still has lock on head from lock loop */
--
2.13.3


2017-07-19 22:10:15

by Trond Myklebust

[permalink] [raw]
Subject: [PATCH 12/20] NFS: Reduce inode->i_lock contention in nfs_lock_and_join_requests()

We should no longer need the inode->i_lock, now that we've
straightened out the request locking. The locking schema is now:

1) Lock page head request
2) Lock the page group
3) Lock the subrequests one by one

Note that there is a subtle race with nfs_inode_remove_request() due
to the fact that the latter does not lock the page head, when removing
it from the struct page. Only the last subrequest is locked, hence
we need to re-check that the PagePrivate(page) is still set after
we've locked all the subrequests.

Signed-off-by: Trond Myklebust <[email protected]>
---
fs/nfs/write.c | 40 ++++++++++++++++++++++------------------
1 file changed, 22 insertions(+), 18 deletions(-)

diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 08c1ce968cce..ff7c90c7ff79 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -372,15 +372,14 @@ nfs_page_group_clear_bits(struct nfs_page *req)
* @head - head request of page group, must be holding head lock
* @req - request that couldn't lock and needs to wait on the req bit lock
*
- * NOTE: this must be called holding page_group bit lock and inode spin lock
- * and BOTH will be released before returning.
+ * NOTE: this must be called holding page_group bit lock
+ * which will be released before returning.
*
* returns 0 on success, < 0 on error.
*/
static int
nfs_unroll_locks_and_wait(struct inode *inode, struct nfs_page *head,
struct nfs_page *req)
- __releases(&inode->i_lock)
{
struct nfs_page *tmp;
int ret;
@@ -395,7 +394,6 @@ nfs_unroll_locks_and_wait(struct inode *inode, struct nfs_page *head,
kref_get(&req->wb_kref);

nfs_page_group_unlock(head);
- spin_unlock(&inode->i_lock);

/* release ref from nfs_page_find_head_request_locked */
nfs_unlock_and_release_request(head);
@@ -491,8 +489,9 @@ nfs_lock_and_join_requests(struct page *page)
int ret;

try_again:
+ if (!(PagePrivate(page) || PageSwapCache(page)))
+ return NULL;
spin_lock(&inode->i_lock);
-
/*
* A reference is taken only on the head request which acts as a
* reference to the whole page group - the group will not be destroyed
@@ -514,16 +513,12 @@ nfs_lock_and_join_requests(struct page *page)
return ERR_PTR(ret);
goto try_again;
}
+ spin_unlock(&inode->i_lock);

- /* holding inode lock, so always make a non-blocking call to try the
- * page group lock */
- ret = nfs_page_group_lock(head, true);
+ ret = nfs_page_group_lock(head, false);
if (ret < 0) {
- spin_unlock(&inode->i_lock);
-
- nfs_page_group_lock_wait(head);
nfs_unlock_and_release_request(head);
- goto try_again;
+ return ERR_PTR(ret);
}

/* lock each request in the page group */
@@ -531,8 +526,10 @@ nfs_lock_and_join_requests(struct page *page)
for (subreq = head->wb_this_page; subreq != head;
subreq = subreq->wb_this_page) {
if (!nfs_lock_request(subreq)) {
- /* releases page group bit lock and
- * inode spin lock and all references */
+ /*
+ * releases page group bit lock and
+ * page locks and all references
+ */
ret = nfs_unroll_locks_and_wait(inode, head,
subreq);

@@ -580,7 +577,9 @@ nfs_lock_and_join_requests(struct page *page)
if (test_and_clear_bit(PG_REMOVE, &head->wb_flags)) {
set_bit(PG_INODE_REF, &head->wb_flags);
kref_get(&head->wb_kref);
+ spin_lock(&inode->i_lock);
NFS_I(inode)->nrequests++;
+ spin_unlock(&inode->i_lock);
}

/*
@@ -590,11 +589,14 @@ nfs_lock_and_join_requests(struct page *page)

nfs_page_group_unlock(head);

- /* drop lock to clean uprequests on destroy list */
- spin_unlock(&inode->i_lock);
-
nfs_destroy_unlinked_subrequests(destroy_list, head, inode);

+ /* Did we lose a race with nfs_inode_remove_request()? */
+ if (!(PagePrivate(page) || PageSwapCache(page))) {
+ nfs_unlock_and_release_request(head);
+ return NULL;
+ }
+
/* still holds ref on head from nfs_page_find_head_request_locked
* and still has lock on head from lock loop */
return head;
@@ -968,7 +970,7 @@ nfs_clear_page_commit(struct page *page)
WB_RECLAIMABLE);
}

-/* Called holding inode (/cinfo) lock */
+/* Called holding the request lock on @req */
static void
nfs_clear_request_commit(struct nfs_page *req)
{
@@ -977,9 +979,11 @@ nfs_clear_request_commit(struct nfs_page *req)
struct nfs_commit_info cinfo;

nfs_init_cinfo_from_inode(&cinfo, inode);
+ spin_lock(&inode->i_lock);
if (!pnfs_clear_request_commit(req, &cinfo)) {
nfs_request_remove_commit_list(req, &cinfo);
}
+ spin_unlock(&inode->i_lock);
nfs_clear_page_commit(req->wb_page);
}
}
--
2.13.3


2017-07-19 22:10:13

by Trond Myklebust

[permalink] [raw]
Subject: [PATCH 11/20] NFS: Remove page group limit in nfs_flush_incompatible()

nfs_try_to_update_request() should be able to cope now.

Signed-off-by: Trond Myklebust <[email protected]>
---
fs/nfs/write.c | 2 --
1 file changed, 2 deletions(-)

diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 0b4d1ef168e0..08c1ce968cce 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -1205,8 +1205,6 @@ int nfs_flush_incompatible(struct file *file, struct page *page)
l_ctx = req->wb_lock_context;
do_flush = req->wb_page != page ||
!nfs_match_open_context(req->wb_context, ctx);
- /* for now, flush if more than 1 request in page_group */
- do_flush |= req->wb_this_page != req;
if (l_ctx && flctx &&
!(list_empty_careful(&flctx->flc_posix) &&
list_empty_careful(&flctx->flc_flock))) {
--
2.13.3


2017-07-19 22:10:17

by Trond Myklebust

[permalink] [raw]
Subject: [PATCH 15/20] NFS: Remove nfs_page_group_clear_bits()

At this point, we only expect ever to potentially see PG_REMOVE and
PG_TEARDOWN being set on the subrequests.

Signed-off-by: Trond Myklebust <[email protected]>
---
fs/nfs/write.c | 29 +++--------------------------
1 file changed, 3 insertions(+), 26 deletions(-)

diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index ffb9934607ef..20d44ea328b6 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -347,22 +347,6 @@ static void nfs_end_page_writeback(struct nfs_page *req)
clear_bdi_congested(inode_to_bdi(inode), BLK_RW_ASYNC);
}

-
-/* nfs_page_group_clear_bits
- * @req - an nfs request
- * clears all page group related bits from @req
- */
-static void
-nfs_page_group_clear_bits(struct nfs_page *req)
-{
- clear_bit(PG_TEARDOWN, &req->wb_flags);
- clear_bit(PG_UNLOCKPAGE, &req->wb_flags);
- clear_bit(PG_UPTODATE, &req->wb_flags);
- clear_bit(PG_WB_END, &req->wb_flags);
- clear_bit(PG_REMOVE, &req->wb_flags);
-}
-
-
/*
* nfs_unroll_locks_and_wait - unlock all newly locked reqs and wait on @req
*
@@ -417,13 +401,12 @@ nfs_destroy_unlinked_subrequests(struct nfs_page *destroy_list,
/* make sure old group is not used */
subreq->wb_this_page = subreq;

+ clear_bit(PG_REMOVE, &subreq->wb_flags);
+
/* Note: races with nfs_page_group_destroy() */
if (!kref_read(&subreq->wb_kref)) {
- bool freeme = test_bit(PG_TEARDOWN, &subreq->wb_flags);
-
- nfs_page_group_clear_bits(subreq);
/* Check if we raced with nfs_page_group_destroy() */
- if (freeme)
+ if (test_and_clear_bit(PG_TEARDOWN, &subreq->wb_flags))
nfs_free_request(subreq);
continue;
}
@@ -437,7 +420,6 @@ nfs_destroy_unlinked_subrequests(struct nfs_page *destroy_list,
spin_unlock(&inode->i_lock);
}

- nfs_page_group_clear_bits(subreq);
/* subreq is now totally disconnected from page group or any
* write / commit lists. last chance to wake any waiters */
nfs_unlock_and_release_request(subreq);
@@ -573,11 +555,6 @@ nfs_lock_and_join_requests(struct page *page)
spin_unlock(&inode->i_lock);
}

- /*
- * prepare head request to be added to new pgio descriptor
- */
- nfs_page_group_clear_bits(head);
-
nfs_page_group_unlock(head);

nfs_destroy_unlinked_subrequests(destroy_list, head, inode);
--
2.13.3


2017-07-19 22:10:18

by Trond Myklebust

[permalink] [raw]
Subject: [PATCH 16/20] NFS: Remove unuse function nfs_page_group_lock_wait()

Signed-off-by: Trond Myklebust <[email protected]>
---
fs/nfs/pagelist.c | 21 ---------------------
include/linux/nfs_page.h | 1 -
2 files changed, 22 deletions(-)

diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c
index a6f2bbd709ba..ced7974622dd 100644
--- a/fs/nfs/pagelist.c
+++ b/fs/nfs/pagelist.c
@@ -166,27 +166,6 @@ nfs_page_group_lock(struct nfs_page *req, bool nonblock)
}

/*
- * nfs_page_group_lock_wait - wait for the lock to clear, but don't grab it
- * @req - a request in the group
- *
- * This is a blocking call to wait for the group lock to be cleared.
- */
-void
-nfs_page_group_lock_wait(struct nfs_page *req)
-{
- struct nfs_page *head = req->wb_head;
-
- WARN_ON_ONCE(head != head->wb_head);
-
- if (!test_bit(PG_HEADLOCK, &head->wb_flags))
- return;
- set_bit(PG_CONTENDED1, &head->wb_flags);
- smp_mb__after_atomic();
- wait_on_bit(&head->wb_flags, PG_HEADLOCK,
- TASK_UNINTERRUPTIBLE);
-}
-
-/*
* nfs_page_group_unlock - unlock the head of the page group
* @req - request in group that is to be unlocked
*/
diff --git a/include/linux/nfs_page.h b/include/linux/nfs_page.h
index d67b67ae6c8b..de1d24cedaa2 100644
--- a/include/linux/nfs_page.h
+++ b/include/linux/nfs_page.h
@@ -140,7 +140,6 @@ extern int nfs_wait_on_request(struct nfs_page *);
extern void nfs_unlock_request(struct nfs_page *req);
extern void nfs_unlock_and_release_request(struct nfs_page *);
extern int nfs_page_group_lock(struct nfs_page *, bool);
-extern void nfs_page_group_lock_wait(struct nfs_page *);
extern void nfs_page_group_unlock(struct nfs_page *);
extern bool nfs_page_group_sync_on_bit(struct nfs_page *, unsigned int);
extern bool nfs_async_iocounter_wait(struct rpc_task *, struct nfs_lock_context *);
--
2.13.3


2017-07-19 22:10:22

by Trond Myklebust

[permalink] [raw]
Subject: [PATCH 19/20] SUNRPC: Add a function to allow waiting for RPC transmission

Sometimes, we only want to know that the RPC message is in the process
of being transmitted. Add a function to allow that.

Signed-off-by: Trond Myklebust <[email protected]>
---
include/linux/sunrpc/sched.h | 3 +++
net/sunrpc/sched.c | 22 ++++++++++++++++++++++
net/sunrpc/xprt.c | 6 ++++++
3 files changed, 31 insertions(+)

diff --git a/include/linux/sunrpc/sched.h b/include/linux/sunrpc/sched.h
index 50a99a117da7..15bc1cd6ee5c 100644
--- a/include/linux/sunrpc/sched.h
+++ b/include/linux/sunrpc/sched.h
@@ -139,6 +139,8 @@ struct rpc_task_setup {
#define RPC_TASK_RUNNING 0
#define RPC_TASK_QUEUED 1
#define RPC_TASK_ACTIVE 2
+#define RPC_TASK_MSG_XMIT 3
+#define RPC_TASK_MSG_XMIT_WAIT 4

#define RPC_IS_RUNNING(t) test_bit(RPC_TASK_RUNNING, &(t)->tk_runstate)
#define rpc_set_running(t) set_bit(RPC_TASK_RUNNING, &(t)->tk_runstate)
@@ -244,6 +246,7 @@ void rpc_free(struct rpc_task *);
int rpciod_up(void);
void rpciod_down(void);
int __rpc_wait_for_completion_task(struct rpc_task *task, wait_bit_action_f *);
+int rpc_wait_for_msg_send(struct rpc_task *task);
#if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
struct net;
void rpc_show_tasks(struct net *);
diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index 0cc83839c13c..30509a4d7e00 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -320,6 +320,19 @@ int __rpc_wait_for_completion_task(struct rpc_task *task, wait_bit_action_f *act
EXPORT_SYMBOL_GPL(__rpc_wait_for_completion_task);

/*
+ * Allow callers to wait for completion of an RPC call
+ */
+int rpc_wait_for_msg_send(struct rpc_task *task)
+{
+ if (!test_bit(RPC_TASK_MSG_XMIT, &task->tk_runstate))
+ return 0;
+ set_bit(RPC_TASK_MSG_XMIT_WAIT, &task->tk_runstate);
+ return wait_on_bit_action(&task->tk_runstate, RPC_TASK_MSG_XMIT,
+ rpc_wait_bit_killable, TASK_KILLABLE);
+}
+EXPORT_SYMBOL_GPL(rpc_wait_for_msg_send);
+
+/*
* Make an RPC task runnable.
*
* Note: If the task is ASYNC, and is being made runnable after sitting on an
@@ -700,6 +713,7 @@ rpc_reset_task_statistics(struct rpc_task *task)
{
task->tk_timeouts = 0;
task->tk_flags &= ~(RPC_CALL_MAJORSEEN|RPC_TASK_KILLED|RPC_TASK_SENT);
+ set_bit(RPC_TASK_MSG_XMIT, &task->tk_runstate);

rpc_init_task_statistics(task);
}
@@ -928,6 +942,7 @@ static void rpc_init_task(struct rpc_task *task, const struct rpc_task_setup *ta
memset(task, 0, sizeof(*task));
atomic_set(&task->tk_count, 1);
task->tk_flags = task_setup_data->flags;
+ task->tk_runstate = BIT(RPC_TASK_MSG_XMIT);
task->tk_ops = task_setup_data->callback_ops;
task->tk_calldata = task_setup_data->callback_data;
INIT_LIST_HEAD(&task->tk_task);
@@ -1012,6 +1027,13 @@ static void rpc_async_release(struct work_struct *work)

static void rpc_release_resources_task(struct rpc_task *task)
{
+ if (test_bit(RPC_TASK_MSG_XMIT, &task->tk_runstate)) {
+ clear_bit(RPC_TASK_MSG_XMIT, &task->tk_runstate);
+ smp_mb__after_atomic();
+ if (test_bit(RPC_TASK_MSG_XMIT_WAIT, &task->tk_runstate))
+ wake_up_bit(&task->tk_runstate, RPC_TASK_MSG_XMIT);
+ }
+
xprt_release(task);
if (task->tk_msg.rpc_cred) {
put_rpccred(task->tk_msg.rpc_cred);
diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index 4654a9934269..788c1b6138c2 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -937,6 +937,12 @@ bool xprt_prepare_transmit(struct rpc_task *task)
goto out_unlock;
}
ret = true;
+ if (test_bit(RPC_TASK_MSG_XMIT, &task->tk_runstate)) {
+ clear_bit(RPC_TASK_MSG_XMIT, &task->tk_runstate);
+ smp_mb__after_atomic();
+ if (test_bit(RPC_TASK_MSG_XMIT_WAIT, &task->tk_runstate))
+ wake_up_bit(&task->tk_runstate, RPC_TASK_MSG_XMIT);
+ }
out_unlock:
spin_unlock_bh(&xprt->transport_lock);
return ret;
--
2.13.3


2017-07-19 22:10:16

by Trond Myklebust

[permalink] [raw]
Subject: [PATCH 13/20] NFS: Further optimise nfs_lock_and_join_requests()

When locking the entire group in order to remove subrequests,
the locks are always taken in order, and with the page group
lock being taken after the page head is locked. The intention
is that:

1) The lock on the group head guarantees that requests may not
be removed from the group (although new entries could be appended
if we're not holding the group lock).
2) It is safe to drop and retake the page group lock while iterating
through the list, in particular when waiting for a subrequest lock.

Signed-off-by: Trond Myklebust <[email protected]>
---
fs/nfs/write.c | 45 ++++++++++++++++++---------------------------
1 file changed, 18 insertions(+), 27 deletions(-)

diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index ff7c90c7ff79..1ee5d89380d9 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -377,31 +377,17 @@ nfs_page_group_clear_bits(struct nfs_page *req)
*
* returns 0 on success, < 0 on error.
*/
-static int
-nfs_unroll_locks_and_wait(struct inode *inode, struct nfs_page *head,
+static void
+nfs_unroll_locks(struct inode *inode, struct nfs_page *head,
struct nfs_page *req)
{
struct nfs_page *tmp;
- int ret;

/* relinquish all the locks successfully grabbed this run */
for (tmp = head->wb_this_page ; tmp != req; tmp = tmp->wb_this_page)
nfs_unlock_request(tmp);

WARN_ON_ONCE(test_bit(PG_TEARDOWN, &req->wb_flags));
-
- /* grab a ref on the request that will be waited on */
- kref_get(&req->wb_kref);
-
- nfs_page_group_unlock(head);
-
- /* release ref from nfs_page_find_head_request_locked */
- nfs_unlock_and_release_request(head);
-
- ret = nfs_wait_on_request(req);
- nfs_release_request(req);
-
- return ret;
}

/*
@@ -525,18 +511,21 @@ nfs_lock_and_join_requests(struct page *page)
total_bytes = head->wb_bytes;
for (subreq = head->wb_this_page; subreq != head;
subreq = subreq->wb_this_page) {
- if (!nfs_lock_request(subreq)) {
+
+ while (!nfs_lock_request(subreq)) {
/*
- * releases page group bit lock and
- * page locks and all references
+ * Unlock page to allow nfs_page_group_sync_on_bit()
+ * to succeed
*/
- ret = nfs_unroll_locks_and_wait(inode, head,
- subreq);
-
- if (ret == 0)
- goto try_again;
-
- return ERR_PTR(ret);
+ nfs_page_group_unlock(head);
+ ret = nfs_wait_on_request(subreq);
+ if (!ret)
+ ret = nfs_page_group_lock(head, false);
+ if (ret < 0) {
+ nfs_unroll_locks(inode, head, subreq);
+ nfs_unlock_and_release_request(head);
+ return ERR_PTR(ret);
+ }
}
/*
* Subrequests are always contiguous, non overlapping
@@ -549,7 +538,9 @@ nfs_lock_and_join_requests(struct page *page)
((subreq->wb_offset + subreq->wb_bytes) >
(head->wb_offset + total_bytes)))) {
nfs_unlock_request(subreq);
- nfs_unroll_locks_and_wait(inode, head, subreq);
+ nfs_unroll_locks(inode, head, subreq);
+ nfs_page_group_unlock(head);
+ nfs_unlock_and_release_request(head);
return ERR_PTR(-EIO);
}
}
--
2.13.3


2017-07-19 22:10:17

by Trond Myklebust

[permalink] [raw]
Subject: [PATCH 14/20] NFS: Fix nfs_page_group_destroy() and nfs_lock_and_join_requests() race cases

Since nfs_page_group_destroy() does not take any locks on the requests
to be freed, we need to ensure that we don't inadvertently free the
request in nfs_destroy_unlinked_subrequests() while the last reference
is being released elsewhere.

Do this by:

1) Taking a reference to the request unless it is already being freed
2) Checking (under the page group lock) if PG_TEARDOWN is already set before
freeing an unreferenced request in nfs_destroy_unlinked_subrequests()

Signed-off-by: Trond Myklebust <[email protected]>
---
fs/nfs/write.c | 58 +++++++++++++++++++++++++++++-----------------------------
1 file changed, 29 insertions(+), 29 deletions(-)

diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 1ee5d89380d9..ffb9934607ef 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -384,10 +384,11 @@ nfs_unroll_locks(struct inode *inode, struct nfs_page *head,
struct nfs_page *tmp;

/* relinquish all the locks successfully grabbed this run */
- for (tmp = head->wb_this_page ; tmp != req; tmp = tmp->wb_this_page)
- nfs_unlock_request(tmp);
-
- WARN_ON_ONCE(test_bit(PG_TEARDOWN, &req->wb_flags));
+ for (tmp = head->wb_this_page ; tmp != req; tmp = tmp->wb_this_page) {
+ if (!kref_read(&tmp->wb_kref))
+ continue;
+ nfs_unlock_and_release_request(tmp);
+ }
}

/*
@@ -414,36 +415,32 @@ nfs_destroy_unlinked_subrequests(struct nfs_page *destroy_list,
WARN_ON_ONCE(old_head != subreq->wb_head);

/* make sure old group is not used */
- subreq->wb_head = subreq;
subreq->wb_this_page = subreq;

- /* subreq is now totally disconnected from page group or any
- * write / commit lists. last chance to wake any waiters */
- nfs_unlock_request(subreq);
-
- if (!test_bit(PG_TEARDOWN, &subreq->wb_flags)) {
- /* release ref on old head request */
- nfs_release_request(old_head);
+ /* Note: races with nfs_page_group_destroy() */
+ if (!kref_read(&subreq->wb_kref)) {
+ bool freeme = test_bit(PG_TEARDOWN, &subreq->wb_flags);

nfs_page_group_clear_bits(subreq);
+ /* Check if we raced with nfs_page_group_destroy() */
+ if (freeme)
+ nfs_free_request(subreq);
+ continue;
+ }

- /* release the PG_INODE_REF reference */
- if (test_and_clear_bit(PG_INODE_REF, &subreq->wb_flags)) {
- nfs_release_request(subreq);
- spin_lock(&inode->i_lock);
- NFS_I(inode)->nrequests--;
- spin_unlock(&inode->i_lock);
- } else
- WARN_ON_ONCE(1);
- } else {
- WARN_ON_ONCE(test_bit(PG_CLEAN, &subreq->wb_flags));
- /* zombie requests have already released the last
- * reference and were waiting on the rest of the
- * group to complete. Since it's no longer part of a
- * group, simply free the request */
- nfs_page_group_clear_bits(subreq);
- nfs_free_request(subreq);
+ subreq->wb_head = subreq;
+
+ if (test_and_clear_bit(PG_INODE_REF, &subreq->wb_flags)) {
+ nfs_release_request(subreq);
+ spin_lock(&inode->i_lock);
+ NFS_I(inode)->nrequests--;
+ spin_unlock(&inode->i_lock);
}
+
+ nfs_page_group_clear_bits(subreq);
+ /* subreq is now totally disconnected from page group or any
+ * write / commit lists. last chance to wake any waiters */
+ nfs_unlock_and_release_request(subreq);
}
}

@@ -512,6 +509,8 @@ nfs_lock_and_join_requests(struct page *page)
for (subreq = head->wb_this_page; subreq != head;
subreq = subreq->wb_this_page) {

+ if (!kref_get_unless_zero(&subreq->wb_kref))
+ continue;
while (!nfs_lock_request(subreq)) {
/*
* Unlock page to allow nfs_page_group_sync_on_bit()
@@ -523,6 +522,7 @@ nfs_lock_and_join_requests(struct page *page)
ret = nfs_page_group_lock(head, false);
if (ret < 0) {
nfs_unroll_locks(inode, head, subreq);
+ nfs_release_request(subreq);
nfs_unlock_and_release_request(head);
return ERR_PTR(ret);
}
@@ -537,8 +537,8 @@ nfs_lock_and_join_requests(struct page *page)
} else if (WARN_ON_ONCE(subreq->wb_offset < head->wb_offset ||
((subreq->wb_offset + subreq->wb_bytes) >
(head->wb_offset + total_bytes)))) {
- nfs_unlock_request(subreq);
nfs_unroll_locks(inode, head, subreq);
+ nfs_unlock_and_release_request(subreq);
nfs_page_group_unlock(head);
nfs_unlock_and_release_request(head);
return ERR_PTR(-EIO);
--
2.13.3


2017-07-19 22:10:19

by Trond Myklebust

[permalink] [raw]
Subject: [PATCH 17/20] NFS: Remove unused parameter from nfs_page_group_lock()

nfs_page_group_lock() is now always called with the 'nonblock'
parameter set to 'false'.

Signed-off-by: Trond Myklebust <[email protected]>
---
fs/nfs/pagelist.c | 31 +++++++++++--------------------
fs/nfs/write.c | 6 +++---
include/linux/nfs_page.h | 2 +-
3 files changed, 15 insertions(+), 24 deletions(-)

diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c
index ced7974622dd..af6731dd4324 100644
--- a/fs/nfs/pagelist.c
+++ b/fs/nfs/pagelist.c
@@ -134,19 +134,14 @@ EXPORT_SYMBOL_GPL(nfs_async_iocounter_wait);
/*
* nfs_page_group_lock - lock the head of the page group
* @req - request in group that is to be locked
- * @nonblock - if true don't block waiting for lock
*
- * this lock must be held if modifying the page group list
+ * this lock must be held when traversing or modifying the page
+ * group list
*
- * return 0 on success, < 0 on error: -EDELAY if nonblocking or the
- * result from wait_on_bit_lock
- *
- * NOTE: calling with nonblock=false should always have set the
- * lock bit (see fs/buffer.c and other uses of wait_on_bit_lock
- * with TASK_UNINTERRUPTIBLE), so there is no need to check the result.
+ * return 0 on success, < 0 on error
*/
int
-nfs_page_group_lock(struct nfs_page *req, bool nonblock)
+nfs_page_group_lock(struct nfs_page *req)
{
struct nfs_page *head = req->wb_head;

@@ -155,14 +150,10 @@ nfs_page_group_lock(struct nfs_page *req, bool nonblock)
if (!test_and_set_bit(PG_HEADLOCK, &head->wb_flags))
return 0;

- if (!nonblock) {
- set_bit(PG_CONTENDED1, &head->wb_flags);
- smp_mb__after_atomic();
- return wait_on_bit_lock(&head->wb_flags, PG_HEADLOCK,
+ set_bit(PG_CONTENDED1, &head->wb_flags);
+ smp_mb__after_atomic();
+ return wait_on_bit_lock(&head->wb_flags, PG_HEADLOCK,
TASK_UNINTERRUPTIBLE);
- }
-
- return -EAGAIN;
}

/*
@@ -225,7 +216,7 @@ bool nfs_page_group_sync_on_bit(struct nfs_page *req, unsigned int bit)
{
bool ret;

- nfs_page_group_lock(req, false);
+ nfs_page_group_lock(req);
ret = nfs_page_group_sync_on_bit_locked(req, bit);
nfs_page_group_unlock(req);

@@ -1016,7 +1007,7 @@ static int __nfs_pageio_add_request(struct nfs_pageio_descriptor *desc,
unsigned int bytes_left = 0;
unsigned int offset, pgbase;

- nfs_page_group_lock(req, false);
+ nfs_page_group_lock(req);

subreq = req;
bytes_left = subreq->wb_bytes;
@@ -1038,7 +1029,7 @@ static int __nfs_pageio_add_request(struct nfs_pageio_descriptor *desc,
if (mirror->pg_recoalesce)
return 0;
/* retry add_request for this subreq */
- nfs_page_group_lock(req, false);
+ nfs_page_group_lock(req);
continue;
}

@@ -1135,7 +1126,7 @@ int nfs_pageio_add_request(struct nfs_pageio_descriptor *desc,

for (midx = 0; midx < desc->pg_mirror_count; midx++) {
if (midx) {
- nfs_page_group_lock(req, false);
+ nfs_page_group_lock(req);

/* find the last request */
for (lastreq = req->wb_head;
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 20d44ea328b6..0f418d825185 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -271,7 +271,7 @@ static bool nfs_page_group_covers_page(struct nfs_page *req)
unsigned int pos = 0;
unsigned int len = nfs_page_length(req->wb_page);

- nfs_page_group_lock(req, false);
+ nfs_page_group_lock(req);

do {
tmp = nfs_page_group_search_locked(req->wb_head, pos);
@@ -480,7 +480,7 @@ nfs_lock_and_join_requests(struct page *page)
}
spin_unlock(&inode->i_lock);

- ret = nfs_page_group_lock(head, false);
+ ret = nfs_page_group_lock(head);
if (ret < 0) {
nfs_unlock_and_release_request(head);
return ERR_PTR(ret);
@@ -501,7 +501,7 @@ nfs_lock_and_join_requests(struct page *page)
nfs_page_group_unlock(head);
ret = nfs_wait_on_request(subreq);
if (!ret)
- ret = nfs_page_group_lock(head, false);
+ ret = nfs_page_group_lock(head);
if (ret < 0) {
nfs_unroll_locks(inode, head, subreq);
nfs_release_request(subreq);
diff --git a/include/linux/nfs_page.h b/include/linux/nfs_page.h
index de1d24cedaa2..2f4fdafb6746 100644
--- a/include/linux/nfs_page.h
+++ b/include/linux/nfs_page.h
@@ -139,7 +139,7 @@ extern size_t nfs_generic_pg_test(struct nfs_pageio_descriptor *desc,
extern int nfs_wait_on_request(struct nfs_page *);
extern void nfs_unlock_request(struct nfs_page *req);
extern void nfs_unlock_and_release_request(struct nfs_page *);
-extern int nfs_page_group_lock(struct nfs_page *, bool);
+extern int nfs_page_group_lock(struct nfs_page *);
extern void nfs_page_group_unlock(struct nfs_page *);
extern bool nfs_page_group_sync_on_bit(struct nfs_page *, unsigned int);
extern bool nfs_async_iocounter_wait(struct rpc_task *, struct nfs_lock_context *);
--
2.13.3


2017-07-19 22:10:21

by Trond Myklebust

[permalink] [raw]
Subject: [PATCH 18/20] NFS: Fix up nfs_page_group_covers_page()

Fix up the test in nfs_page_group_covers_page(). The simplest implementation
is to check that we have a set of intersecting or contiguous subrequests
that connect page offset 0 to nfs_page_length(req->wb_page).

Signed-off-by: Trond Myklebust <[email protected]>
---
fs/nfs/write.c | 18 ++++++------------
1 file changed, 6 insertions(+), 12 deletions(-)

diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 0f418d825185..759e37d26acf 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -243,9 +243,6 @@ nfs_page_group_search_locked(struct nfs_page *head, unsigned int page_offset)
{
struct nfs_page *req;

- WARN_ON_ONCE(head != head->wb_head);
- WARN_ON_ONCE(!test_bit(PG_HEADLOCK, &head->wb_head->wb_flags));
-
req = head;
do {
if (page_offset >= req->wb_pgbase &&
@@ -273,18 +270,15 @@ static bool nfs_page_group_covers_page(struct nfs_page *req)

nfs_page_group_lock(req);

- do {
+ for (;;) {
tmp = nfs_page_group_search_locked(req->wb_head, pos);
- if (tmp) {
- /* no way this should happen */
- WARN_ON_ONCE(tmp->wb_pgbase != pos);
- pos += tmp->wb_bytes - (pos - tmp->wb_pgbase);
- }
- } while (tmp && pos < len);
+ if (!tmp)
+ break;
+ pos = tmp->wb_pgbase + tmp->wb_bytes;
+ }

nfs_page_group_unlock(req);
- WARN_ON_ONCE(pos > len);
- return pos == len;
+ return pos >= len;
}

/* We can set the PG_uptodate flag if we see that a write request
--
2.13.3


2017-07-19 22:10:22

by Trond Myklebust

[permalink] [raw]
Subject: [PATCH 20/20] NFS: Throttle I/O to the NFS server

Ensure that we do not build up large queues of asynchronous I/O work in the
RPC layer that are not being handled by the socket.

Signed-off-by: Trond Myklebust <[email protected]>
---
fs/nfs/pagelist.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c
index af6731dd4324..a5db892c272b 100644
--- a/fs/nfs/pagelist.c
+++ b/fs/nfs/pagelist.c
@@ -625,7 +625,8 @@ int nfs_initiate_pgio(struct rpc_clnt *clnt, struct nfs_pgio_header *hdr,
ret = rpc_wait_for_completion_task(task);
if (ret == 0)
ret = task->tk_status;
- }
+ } else
+ rpc_wait_for_msg_send(task);
rpc_put_task(task);
out:
return ret;
--
2.13.3


2017-07-20 06:52:07

by Chuck Lever

[permalink] [raw]
Subject: Re: [PATCH 00/20] Reducing inode->i_lock contention in writebacks

Hi Trond-


> On Jul 20, 2017, at 00:09, Trond Myklebust <[email protected]> wrote:
>
> Chuck Lever has presented measurements, that would appear to indicate that
> the inode->i_lock can be a source of contention when doing writeback using
> high performance networks. This patch series is therefore an attempt to
> address the sources of contention that his measurements pointed to.
>
> According to Chuck's table, the main sources of contention appear to
> be in nfs_flush_incompatible(), nfs_lock_and_join_requests() and
> nfs_updatepage(). Most of this patch series focuses on fixing
> nfs_lock_and_join_requests(), since it holds the inode lock, while
> taking locks on the page group and the sub-requests themselves,
> rolling them all back if ever there is contention.
> By noting a few simple rules that are mostly already in place, we
> can simplify that locking to ensure that we only have to keep the
> spin lock while we're dereferencing and locking the head page pointer
> that is stored in page_private(page).
>
> Along the way, the patches also simplify the code a little, and fix
> a number of subtle races which mainly occur when you set wsize to
> some value smaller than the page size. The most notable such race
> occurs between nfs_lock_and_join_requests() and nfs_page_group_destroy(),
> and could lead to a double free() of some of the sub-requests.
>
> Finally, there are 2 patches tacked onto the end of the series that
> attempt to improve the throttling of writes when the RPC layer is
> congested. They do so by forcing each caller of nfs_initiate_pgio()
> to wait until the request is being transmitted. The expectation is
> that this might help improve latencies when there are several processes
> competing for access to the RPC transport.

Thanks for looking into this. I'm especially intrigued by
the write throttling patches!

I can't test these immediately, but I will try to review
them in the next few days.


> Trond Myklebust (20):
> NFS: Simplify page writeback
> NFS: Reduce lock contention in nfs_page_find_head_request()
> NFS: Reduce lock contention in nfs_try_to_update_request()
> NFS: Ensure we always dereference the page head last
> NFS: Fix a reference and lock leak in nfs_lock_and_join_requests()
> NFS: Fix an ABBA issue in nfs_lock_and_join_requests()
> NFS: Don't check request offset and size without holding a lock
> NFS: Don't unlock writebacks before declaring PG_WB_END
> NFS: Fix the inode request accounting when pages have subrequests
> NFS: Teach nfs_try_to_update_request() to deal with request
> page_groups
> NFS: Remove page group limit in nfs_flush_incompatible()
> NFS: Reduce inode->i_lock contention in nfs_lock_and_join_requests()
> NFS: Further optimise nfs_lock_and_join_requests()
> NFS: Fix nfs_page_group_destroy() and nfs_lock_and_join_requests()
> race cases
> NFS: Remove nfs_page_group_clear_bits()
> NFS: Remove unuse function nfs_page_group_lock_wait()
> NFS: Remove unused parameter from nfs_page_group_lock()
> NFS: Fix up nfs_page_group_covers_page()
> SUNRPC: Add a function to allow waiting for RPC transmission
> NFS: Throttle I/O to the NFS server
>
> fs/nfs/pagelist.c | 64 +++------
> fs/nfs/write.c | 313 ++++++++++++++++++-------------------------
> include/linux/nfs_page.h | 3 +-
> include/linux/sunrpc/sched.h | 3 +
> net/sunrpc/sched.c | 22 +++
> net/sunrpc/xprt.c | 6 +
> 6 files changed, 177 insertions(+), 234 deletions(-)
>
> --
> 2.13.3
>

--
Chuck Lever




2017-07-25 04:13:04

by Weston Andros Adamson

[permalink] [raw]
Subject: Re: [PATCH 00/20] Reducing inode->i_lock contention in writebacks


> On Jul 19, 2017, at 6:09 PM, Trond Myklebust =
<[email protected]> wrote:
>=20
> Chuck Lever has presented measurements, that would appear to indicate =
that
> the inode->i_lock can be a source of contention when doing writeback =
using
> high performance networks. This patch series is therefore an attempt =
to
> address the sources of contention that his measurements pointed to.
>=20
> According to Chuck's table, the main sources of contention appear to
> be in nfs_flush_incompatible(), nfs_lock_and_join_requests() and
> nfs_updatepage(). Most of this patch series focuses on fixing
> nfs_lock_and_join_requests(), since it holds the inode lock, while
> taking locks on the page group and the sub-requests themselves,
> rolling them all back if ever there is contention.
> By noting a few simple rules that are mostly already in place, we
> can simplify that locking to ensure that we only have to keep the
> spin lock while we're dereferencing and locking the head page pointer
> that is stored in page_private(page).
>=20
> Along the way, the patches also simplify the code a little, and fix
> a number of subtle races which mainly occur when you set wsize to
> some value smaller than the page size. The most notable such race
> occurs between nfs_lock_and_join_requests() and =
nfs_page_group_destroy(),
> and could lead to a double free() of some of the sub-requests.
>=20
> Finally, there are 2 patches tacked onto the end of the series that
> attempt to improve the throttling of writes when the RPC layer is
> congested. They do so by forcing each caller of nfs_initiate_pgio()
> to wait until the request is being transmitted. The expectation is
> that this might help improve latencies when there are several =
processes
> competing for access to the RPC transport.
>=20
> Trond Myklebust (20):
> NFS: Simplify page writeback
> NFS: Reduce lock contention in nfs_page_find_head_request()
> NFS: Reduce lock contention in nfs_try_to_update_request()
> NFS: Ensure we always dereference the page head last
> NFS: Fix a reference and lock leak in nfs_lock_and_join_requests()
> NFS: Fix an ABBA issue in nfs_lock_and_join_requests()
> NFS: Don't check request offset and size without holding a lock
> NFS: Don't unlock writebacks before declaring PG_WB_END
> NFS: Fix the inode request accounting when pages have subrequests
> NFS: Teach nfs_try_to_update_request() to deal with request
> page_groups
> NFS: Remove page group limit in nfs_flush_incompatible()
> NFS: Reduce inode->i_lock contention in nfs_lock_and_join_requests()
> NFS: Further optimise nfs_lock_and_join_requests()
> NFS: Fix nfs_page_group_destroy() and nfs_lock_and_join_requests()
> race cases
> NFS: Remove nfs_page_group_clear_bits()
> NFS: Remove unuse function nfs_page_group_lock_wait()
> NFS: Remove unused parameter from nfs_page_group_lock()
> NFS: Fix up nfs_page_group_covers_page()
> SUNRPC: Add a function to allow waiting for RPC transmission
> NFS: Throttle I/O to the NFS server

This was a lot to look at, but it looks good to me.

-dros

>=20
> fs/nfs/pagelist.c | 64 +++------
> fs/nfs/write.c | 313 =
++++++++++++++++++-------------------------
> include/linux/nfs_page.h | 3 +-
> include/linux/sunrpc/sched.h | 3 +
> net/sunrpc/sched.c | 22 +++
> net/sunrpc/xprt.c | 6 +
> 6 files changed, 177 insertions(+), 234 deletions(-)
>=20
> --=20
> 2.13.3
>=20
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" =
in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


2017-07-25 15:55:23

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH 00/20] Reducing inode->i_lock contention in writebacks

On Wed, 2017-07-19 at 18:09 -0400, Trond Myklebust wrote:
> Chuck Lever has presented measurements, that would appear to indicate that
> the inode->i_lock can be a source of contention when doing writeback using
> high performance networks. This patch series is therefore an attempt to
> address the sources of contention that his measurements pointed to.
>
> According to Chuck's table, the main sources of contention appear to
> be in nfs_flush_incompatible(), nfs_lock_and_join_requests() and
> nfs_updatepage(). Most of this patch series focuses on fixing
> nfs_lock_and_join_requests(), since it holds the inode lock, while
> taking locks on the page group and the sub-requests themselves,
> rolling them all back if ever there is contention.
> By noting a few simple rules that are mostly already in place, we
> can simplify that locking to ensure that we only have to keep the
> spin lock while we're dereferencing and locking the head page pointer
> that is stored in page_private(page).
>
> Along the way, the patches also simplify the code a little, and fix
> a number of subtle races which mainly occur when you set wsize to
> some value smaller than the page size. The most notable such race
> occurs between nfs_lock_and_join_requests() and nfs_page_group_destroy(),
> and could lead to a double free() of some of the sub-requests.
>
> Finally, there are 2 patches tacked onto the end of the series that
> attempt to improve the throttling of writes when the RPC layer is
> congested. They do so by forcing each caller of nfs_initiate_pgio()
> to wait until the request is being transmitted. The expectation is
> that this might help improve latencies when there are several processes
> competing for access to the RPC transport.
>
> Trond Myklebust (20):
> NFS: Simplify page writeback
> NFS: Reduce lock contention in nfs_page_find_head_request()
> NFS: Reduce lock contention in nfs_try_to_update_request()
> NFS: Ensure we always dereference the page head last
> NFS: Fix a reference and lock leak in nfs_lock_and_join_requests()
> NFS: Fix an ABBA issue in nfs_lock_and_join_requests()
> NFS: Don't check request offset and size without holding a lock
> NFS: Don't unlock writebacks before declaring PG_WB_END
> NFS: Fix the inode request accounting when pages have subrequests
> NFS: Teach nfs_try_to_update_request() to deal with request
> page_groups
> NFS: Remove page group limit in nfs_flush_incompatible()
> NFS: Reduce inode->i_lock contention in nfs_lock_and_join_requests()
> NFS: Further optimise nfs_lock_and_join_requests()
> NFS: Fix nfs_page_group_destroy() and nfs_lock_and_join_requests()
> race cases
> NFS: Remove nfs_page_group_clear_bits()
> NFS: Remove unuse function nfs_page_group_lock_wait()
> NFS: Remove unused parameter from nfs_page_group_lock()
> NFS: Fix up nfs_page_group_covers_page()
> SUNRPC: Add a function to allow waiting for RPC transmission
> NFS: Throttle I/O to the NFS server
>
> fs/nfs/pagelist.c | 64 +++------
> fs/nfs/write.c | 313 ++++++++++++++++++-------------------------
> include/linux/nfs_page.h | 3 +-
> include/linux/sunrpc/sched.h | 3 +
> net/sunrpc/sched.c | 22 +++
> net/sunrpc/xprt.c | 6 +
> 6 files changed, 177 insertions(+), 234 deletions(-)
>

Looks good:

Reviewed-by: Jeff Layton <[email protected]>

2017-07-28 19:59:55

by Chuck Lever

[permalink] [raw]
Subject: Re: [PATCH 00/20] Reducing inode->i_lock contention in writebacks


> On Jul 28, 2017, at 1:26 PM, Chuck Lever <[email protected]> wrote:
>
>
>> On Jul 19, 2017, at 6:09 PM, Trond Myklebust <[email protected]> wrote:
>>
>> Chuck Lever has presented measurements, that would appear to indicate that
>> the inode->i_lock can be a source of contention when doing writeback using
>> high performance networks. This patch series is therefore an attempt to
>> address the sources of contention that his measurements pointed to.
>>
>> According to Chuck's table, the main sources of contention appear to
>> be in nfs_flush_incompatible(), nfs_lock_and_join_requests() and
>> nfs_updatepage(). Most of this patch series focuses on fixing
>> nfs_lock_and_join_requests(), since it holds the inode lock, while
>> taking locks on the page group and the sub-requests themselves,
>> rolling them all back if ever there is contention.
>> By noting a few simple rules that are mostly already in place, we
>> can simplify that locking to ensure that we only have to keep the
>> spin lock while we're dereferencing and locking the head page pointer
>> that is stored in page_private(page).
>>
>> Along the way, the patches also simplify the code a little, and fix
>> a number of subtle races which mainly occur when you set wsize to
>> some value smaller than the page size. The most notable such race
>> occurs between nfs_lock_and_join_requests() and nfs_page_group_destroy(),
>> and could lead to a double free() of some of the sub-requests.
>>
>> Finally, there are 2 patches tacked onto the end of the series that
>> attempt to improve the throttling of writes when the RPC layer is
>> congested. They do so by forcing each caller of nfs_initiate_pgio()
>> to wait until the request is being transmitted. The expectation is
>> that this might help improve latencies when there are several processes
>> competing for access to the RPC transport.
>>
>> Trond Myklebust (20):
>> NFS: Simplify page writeback
>> NFS: Reduce lock contention in nfs_page_find_head_request()
>> NFS: Reduce lock contention in nfs_try_to_update_request()
>> NFS: Ensure we always dereference the page head last
>> NFS: Fix a reference and lock leak in nfs_lock_and_join_requests()
>> NFS: Fix an ABBA issue in nfs_lock_and_join_requests()
>> NFS: Don't check request offset and size without holding a lock
>> NFS: Don't unlock writebacks before declaring PG_WB_END
>> NFS: Fix the inode request accounting when pages have subrequests
>> NFS: Teach nfs_try_to_update_request() to deal with request
>> page_groups
>> NFS: Remove page group limit in nfs_flush_incompatible()
>> NFS: Reduce inode->i_lock contention in nfs_lock_and_join_requests()
>> NFS: Further optimise nfs_lock_and_join_requests()
>> NFS: Fix nfs_page_group_destroy() and nfs_lock_and_join_requests()
>> race cases
>> NFS: Remove nfs_page_group_clear_bits()
>> NFS: Remove unuse function nfs_page_group_lock_wait()
>> NFS: Remove unused parameter from nfs_page_group_lock()
>> NFS: Fix up nfs_page_group_covers_page()
>> SUNRPC: Add a function to allow waiting for RPC transmission
>> NFS: Throttle I/O to the NFS server
>>
>> fs/nfs/pagelist.c | 64 +++------
>> fs/nfs/write.c | 313 ++++++++++++++++++-------------------------
>> include/linux/nfs_page.h | 3 +-
>> include/linux/sunrpc/sched.h | 3 +
>> net/sunrpc/sched.c | 22 +++
>> net/sunrpc/xprt.c | 6 +
>> 6 files changed, 177 insertions(+), 234 deletions(-)
>
> Sorry for the delay.
>
> I reviewed the patches. 1-10 were straightforward, but 11-18 appear to
> assume some local knowledge about how this code path is architected, so
> I can't make an informed comment on those. The only thing I would say is
> that if you go with 19 and 20, consider adding a similar governor to the
> DELEGRETURN path.
>
> I've had a few moments to run some basic tests with this series, applied
> to v4.13-rc1. The client is dual-socket 12-core CPU with CX3 Pro Infini-
> Band running at 56Gbps.
>
> • No new functional problems were observed.
> • RPC backlog queues definitely are smaller.
> • Throughput of large payload I/O (>1MB) is higher.
> • Latency of 1KB I/O is a little faster.
> • fio IOPS throughput is lower.
>
> The fio 8KB 70/30 test concerns me: it's usually about 105/45 KIOPS, but
> with the patches it's 76/33. I also noticed that my iozone-based 8KB IOPS
> test shows read IOPS throughput is about 5% lower.
>
> So, it's a mixed bag of macro benchmark results. I can try bisecting if
> you are interested.

Popping off 19 and 20 eliminated the fio performance regression.


> I also ran lock contention tests. I used:
>
> iozone -i0 -i1 -s2g -y1k -az -c
>
> Contention appears to be reduced, but i_lock is still the most contended
> lock in the system by far during this test. During an equivalent direct
> I/O test, I notice i_lock acquisitions increasing, but contentions do not.
>
> A copy of /proc/lock_stat is attached for a stock v4.13-rc1 kernel, and
> for the same kernel plus your patches. This is for just the iozone
> command line listed above.
>
>
> --
> Chuck Lever
>
>
> <lock_stat-4.13.0-rc1><lock_stat-4.13.0-rc1-00020-g1ee454a>

--
Chuck Lever