2017-08-16 16:19:33

by Trond Myklebust

[permalink] [raw]
Subject: [PATCH v2 0/6] RPC client latency fixes

The following patches apply on top of the writeback patches sent last
week. They start the process of reducing contention in the RPC client
receive code due to excessive holding of the transport lock.

The first patch is the most important in that it allows the client to
drop the transport lock while copying data from the socket into the
RPC call's reply buffers.
The second ensures that we don't keep hogging the workqueue worker thread
forever, but that we reschedule the receive job if it keeps looping.
Finally, a small cleanup to avoid having to make a copy of the
struct xdr_skb_reader on the stack.

v2: Reorder patches to add I/O throttling after the core RPC latency
changes
Fix a crash when receiving backchannel requests.

Trond Myklebust (6):
SUNRPC: Don't hold the transport lock across socket copy operations
SUNRPC: Don't hold the transport lock when receiving backchannel data
SUNRPC: Don't loop forever in xs_tcp_data_receive()
SUNRPC: Cleanup xs_tcp_read_common()
SUNRPC: Add a function to allow waiting for RPC transmission
NFS: Throttle I/O to the NFS server

fs/nfs/pagelist.c | 3 +-
include/linux/sunrpc/sched.h | 5 ++++
include/linux/sunrpc/xprt.h | 2 ++
net/sunrpc/backchannel_rqst.c | 4 +--
net/sunrpc/sched.c | 22 +++++++++++++++
net/sunrpc/xprt.c | 48 +++++++++++++++++++++++++++++++
net/sunrpc/xprtsock.c | 66 ++++++++++++++++++++++---------------------
7 files changed, 115 insertions(+), 35 deletions(-)

--
2.13.5



2017-08-16 16:19:35

by Trond Myklebust

[permalink] [raw]
Subject: [PATCH v2 2/6] SUNRPC: Don't hold the transport lock when receiving backchannel data

The backchannel request has no associated task, so it is going nowhere
until we call xprt_complete_bc_request().

Signed-off-by: Trond Myklebust <[email protected]>
---
net/sunrpc/backchannel_rqst.c | 4 ++--
net/sunrpc/xprtsock.c | 5 +----
2 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/net/sunrpc/backchannel_rqst.c b/net/sunrpc/backchannel_rqst.c
index ac701c28f44f..c2c68a15b59d 100644
--- a/net/sunrpc/backchannel_rqst.c
+++ b/net/sunrpc/backchannel_rqst.c
@@ -171,10 +171,10 @@ int xprt_setup_bc(struct rpc_xprt *xprt, unsigned int min_reqs)
/*
* Add the temporary list to the backchannel preallocation list
*/
- spin_lock_bh(&xprt->bc_pa_lock);
+ spin_lock(&xprt->bc_pa_lock);
list_splice(&tmp_list, &xprt->bc_pa_list);
xprt_inc_alloc_count(xprt, min_reqs);
- spin_unlock_bh(&xprt->bc_pa_lock);
+ spin_unlock(&xprt->bc_pa_lock);

dprintk("RPC: setup backchannel transport done\n");
return 0;
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 04dbc7027712..a2312f14beb4 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -1389,11 +1389,9 @@ static int xs_tcp_read_callback(struct rpc_xprt *xprt,
container_of(xprt, struct sock_xprt, xprt);
struct rpc_rqst *req;

- /* Look up and lock the request corresponding to the given XID */
- spin_lock_bh(&xprt->transport_lock);
+ /* Look up the request corresponding to the given XID */
req = xprt_lookup_bc_request(xprt, transport->tcp_xid);
if (req == NULL) {
- spin_unlock_bh(&xprt->transport_lock);
printk(KERN_WARNING "Callback slot table overflowed\n");
xprt_force_disconnect(xprt);
return -1;
@@ -1404,7 +1402,6 @@ static int xs_tcp_read_callback(struct rpc_xprt *xprt,

if (!(transport->tcp_flags & TCP_RCV_COPY_DATA))
xprt_complete_bc_request(req, transport->tcp_copied);
- spin_unlock_bh(&xprt->transport_lock);

return 0;
}
--
2.13.5


2017-08-16 16:19:41

by Trond Myklebust

[permalink] [raw]
Subject: [PATCH v2 4/6] SUNRPC: Cleanup xs_tcp_read_common()

Simplify the code to avoid a full copy of the struct xdr_skb_reader.

Signed-off-by: Trond Myklebust <[email protected]>
---
net/sunrpc/xprtsock.c | 25 ++++++++-----------------
1 file changed, 8 insertions(+), 17 deletions(-)

diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index e8f44fc76754..a344bea15fc7 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -1287,25 +1287,12 @@ static inline void xs_tcp_read_common(struct rpc_xprt *xprt,
}

len = desc->count;
- if (len > transport->tcp_reclen - transport->tcp_offset) {
- struct xdr_skb_reader my_desc;
-
- len = transport->tcp_reclen - transport->tcp_offset;
- memcpy(&my_desc, desc, sizeof(my_desc));
- my_desc.count = len;
- r = xdr_partial_copy_from_skb(rcvbuf, transport->tcp_copied,
- &my_desc, xdr_skb_read_bits);
- desc->count -= r;
- desc->offset += r;
- } else
- r = xdr_partial_copy_from_skb(rcvbuf, transport->tcp_copied,
+ if (len > transport->tcp_reclen - transport->tcp_offset)
+ desc->count = transport->tcp_reclen - transport->tcp_offset;
+ r = xdr_partial_copy_from_skb(rcvbuf, transport->tcp_copied,
desc, xdr_skb_read_bits);

- if (r > 0) {
- transport->tcp_copied += r;
- transport->tcp_offset += r;
- }
- if (r != len) {
+ if (desc->count) {
/* Error when copying to the receive buffer,
* usually because we weren't able to allocate
* additional buffer pages. All we can do now
@@ -1325,6 +1312,10 @@ static inline void xs_tcp_read_common(struct rpc_xprt *xprt,
return;
}

+ transport->tcp_copied += r;
+ transport->tcp_offset += r;
+ desc->count = len - r;
+
dprintk("RPC: XID %08x read %zd bytes\n",
ntohl(transport->tcp_xid), r);
dprintk("RPC: xprt = %p, tcp_copied = %lu, tcp_offset = %u, "
--
2.13.5


2017-08-16 16:19:34

by Trond Myklebust

[permalink] [raw]
Subject: [PATCH v2 1/6] SUNRPC: Don't hold the transport lock across socket copy operations

Instead add a mechanism to ensure that the request doesn't disappear
from underneath us while copying from the socket. We do this by
preventing xprt_release() from freeing the XDR buffers until the
flag RPC_TASK_MSG_RECV has been cleared from the request.

Signed-off-by: Trond Myklebust <[email protected]>
Reviewed-by: Chuck Lever <[email protected]>
---
include/linux/sunrpc/sched.h | 2 ++
include/linux/sunrpc/xprt.h | 2 ++
net/sunrpc/xprt.c | 42 ++++++++++++++++++++++++++++++++++++++++++
net/sunrpc/xprtsock.c | 23 ++++++++++++++++++-----
4 files changed, 64 insertions(+), 5 deletions(-)

diff --git a/include/linux/sunrpc/sched.h b/include/linux/sunrpc/sched.h
index 50a99a117da7..c1768f9d993b 100644
--- a/include/linux/sunrpc/sched.h
+++ b/include/linux/sunrpc/sched.h
@@ -139,6 +139,8 @@ struct rpc_task_setup {
#define RPC_TASK_RUNNING 0
#define RPC_TASK_QUEUED 1
#define RPC_TASK_ACTIVE 2
+#define RPC_TASK_MSG_RECV 3
+#define RPC_TASK_MSG_RECV_WAIT 4

#define RPC_IS_RUNNING(t) test_bit(RPC_TASK_RUNNING, &(t)->tk_runstate)
#define rpc_set_running(t) set_bit(RPC_TASK_RUNNING, &(t)->tk_runstate)
diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
index eab1c749e192..65b9e0224753 100644
--- a/include/linux/sunrpc/xprt.h
+++ b/include/linux/sunrpc/xprt.h
@@ -372,6 +372,8 @@ void xprt_write_space(struct rpc_xprt *xprt);
void xprt_adjust_cwnd(struct rpc_xprt *xprt, struct rpc_task *task, int result);
struct rpc_rqst * xprt_lookup_rqst(struct rpc_xprt *xprt, __be32 xid);
void xprt_complete_rqst(struct rpc_task *task, int copied);
+void xprt_pin_rqst(struct rpc_rqst *req);
+void xprt_unpin_rqst(struct rpc_rqst *req);
void xprt_release_rqst_cong(struct rpc_task *task);
void xprt_disconnect_done(struct rpc_xprt *xprt);
void xprt_force_disconnect(struct rpc_xprt *xprt);
diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index 4654a9934269..a558e8c620c0 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -844,6 +844,47 @@ struct rpc_rqst *xprt_lookup_rqst(struct rpc_xprt *xprt, __be32 xid)
}
EXPORT_SYMBOL_GPL(xprt_lookup_rqst);

+/**
+ * xprt_pin_rqst - Pin a request on the transport receive list
+ * @req: Request to pin
+ *
+ * Caller must ensure this is atomic with the call to xprt_lookup_rqst()
+ * so should be holding the xprt transport lock.
+ */
+void xprt_pin_rqst(struct rpc_rqst *req)
+{
+ set_bit(RPC_TASK_MSG_RECV, &req->rq_task->tk_runstate);
+}
+
+/**
+ * xprt_unpin_rqst - Unpin a request on the transport receive list
+ * @req: Request to pin
+ *
+ * Caller should be holding the xprt transport lock.
+ */
+void xprt_unpin_rqst(struct rpc_rqst *req)
+{
+ struct rpc_task *task = req->rq_task;
+
+ clear_bit(RPC_TASK_MSG_RECV, &task->tk_runstate);
+ if (test_bit(RPC_TASK_MSG_RECV_WAIT, &task->tk_runstate))
+ wake_up_bit(&task->tk_runstate, RPC_TASK_MSG_RECV);
+}
+
+static void xprt_wait_on_pinned_rqst(struct rpc_rqst *req)
+__must_hold(&req->rq_xprt->transport_lock)
+{
+ struct rpc_task *task = req->rq_task;
+ if (test_bit(RPC_TASK_MSG_RECV, &task->tk_runstate)) {
+ spin_unlock_bh(&req->rq_xprt->transport_lock);
+ set_bit(RPC_TASK_MSG_RECV_WAIT, &task->tk_runstate);
+ wait_on_bit(&task->tk_runstate, RPC_TASK_MSG_RECV,
+ TASK_UNINTERRUPTIBLE);
+ clear_bit(RPC_TASK_MSG_RECV_WAIT, &task->tk_runstate);
+ spin_lock_bh(&req->rq_xprt->transport_lock);
+ }
+}
+
static void xprt_update_rtt(struct rpc_task *task)
{
struct rpc_rqst *req = task->tk_rqstp;
@@ -1295,6 +1336,7 @@ void xprt_release(struct rpc_task *task)
list_del(&req->rq_list);
xprt->last_used = jiffies;
xprt_schedule_autodisconnect(xprt);
+ xprt_wait_on_pinned_rqst(req);
spin_unlock_bh(&xprt->transport_lock);
if (req->rq_buffer)
xprt->ops->buf_free(task);
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 4f154d388748..04dbc7027712 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -973,6 +973,8 @@ static void xs_local_data_read_skb(struct rpc_xprt *xprt,
rovr = xprt_lookup_rqst(xprt, *xp);
if (!rovr)
goto out_unlock;
+ xprt_pin_rqst(rovr);
+ spin_unlock_bh(&xprt->transport_lock);
task = rovr->rq_task;

copied = rovr->rq_private_buf.buflen;
@@ -981,11 +983,14 @@ static void xs_local_data_read_skb(struct rpc_xprt *xprt,

if (xs_local_copy_to_xdr(&rovr->rq_private_buf, skb)) {
dprintk("RPC: sk_buff copy failed\n");
- goto out_unlock;
+ spin_lock_bh(&xprt->transport_lock);
+ goto out_unpin;
}

+ spin_lock_bh(&xprt->transport_lock);
xprt_complete_rqst(task, copied);
-
+out_unpin:
+ xprt_unpin_rqst(rovr);
out_unlock:
spin_unlock_bh(&xprt->transport_lock);
}
@@ -1054,6 +1059,8 @@ static void xs_udp_data_read_skb(struct rpc_xprt *xprt,
rovr = xprt_lookup_rqst(xprt, *xp);
if (!rovr)
goto out_unlock;
+ xprt_pin_rqst(rovr);
+ spin_unlock_bh(&xprt->transport_lock);
task = rovr->rq_task;

if ((copied = rovr->rq_private_buf.buflen) > repsize)
@@ -1062,14 +1069,17 @@ static void xs_udp_data_read_skb(struct rpc_xprt *xprt,
/* Suck it into the iovec, verify checksum if not done by hw. */
if (csum_partial_copy_to_xdr(&rovr->rq_private_buf, skb)) {
__UDPX_INC_STATS(sk, UDP_MIB_INERRORS);
- goto out_unlock;
+ spin_lock_bh(&xprt->transport_lock);
+ goto out_unpin;
}

__UDPX_INC_STATS(sk, UDP_MIB_INDATAGRAMS);

+ spin_lock_bh(&xprt->transport_lock);
xprt_adjust_cwnd(xprt, task, copied);
xprt_complete_rqst(task, copied);
-
+out_unpin:
+ xprt_unpin_rqst(rovr);
out_unlock:
spin_unlock_bh(&xprt->transport_lock);
}
@@ -1351,12 +1361,15 @@ static inline int xs_tcp_read_reply(struct rpc_xprt *xprt,
spin_unlock_bh(&xprt->transport_lock);
return -1;
}
+ xprt_pin_rqst(req);
+ spin_unlock_bh(&xprt->transport_lock);

xs_tcp_read_common(xprt, desc, req);

+ spin_lock_bh(&xprt->transport_lock);
if (!(transport->tcp_flags & TCP_RCV_COPY_DATA))
xprt_complete_rqst(req->rq_task, transport->tcp_copied);
-
+ xprt_unpin_rqst(req);
spin_unlock_bh(&xprt->transport_lock);
return 0;
}
--
2.13.5


2017-08-16 16:19:43

by Trond Myklebust

[permalink] [raw]
Subject: [PATCH v2 5/6] SUNRPC: Add a function to allow waiting for RPC transmission

Sometimes, we only want to know that the RPC message is in the process
of being transmitted. Add a function to allow that.

Signed-off-by: Trond Myklebust <[email protected]>
---
include/linux/sunrpc/sched.h | 3 +++
net/sunrpc/sched.c | 22 ++++++++++++++++++++++
net/sunrpc/xprt.c | 6 ++++++
3 files changed, 31 insertions(+)

diff --git a/include/linux/sunrpc/sched.h b/include/linux/sunrpc/sched.h
index c1768f9d993b..148cd6c5d2eb 100644
--- a/include/linux/sunrpc/sched.h
+++ b/include/linux/sunrpc/sched.h
@@ -141,6 +141,8 @@ struct rpc_task_setup {
#define RPC_TASK_ACTIVE 2
#define RPC_TASK_MSG_RECV 3
#define RPC_TASK_MSG_RECV_WAIT 4
+#define RPC_TASK_MSG_XMIT 5
+#define RPC_TASK_MSG_XMIT_WAIT 6

#define RPC_IS_RUNNING(t) test_bit(RPC_TASK_RUNNING, &(t)->tk_runstate)
#define rpc_set_running(t) set_bit(RPC_TASK_RUNNING, &(t)->tk_runstate)
@@ -246,6 +248,7 @@ void rpc_free(struct rpc_task *);
int rpciod_up(void);
void rpciod_down(void);
int __rpc_wait_for_completion_task(struct rpc_task *task, wait_bit_action_f *);
+int rpc_wait_for_msg_send(struct rpc_task *task);
#if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
struct net;
void rpc_show_tasks(struct net *);
diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index 0cc83839c13c..30509a4d7e00 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -320,6 +320,19 @@ int __rpc_wait_for_completion_task(struct rpc_task *task, wait_bit_action_f *act
EXPORT_SYMBOL_GPL(__rpc_wait_for_completion_task);

/*
+ * Allow callers to wait for completion of an RPC call
+ */
+int rpc_wait_for_msg_send(struct rpc_task *task)
+{
+ if (!test_bit(RPC_TASK_MSG_XMIT, &task->tk_runstate))
+ return 0;
+ set_bit(RPC_TASK_MSG_XMIT_WAIT, &task->tk_runstate);
+ return wait_on_bit_action(&task->tk_runstate, RPC_TASK_MSG_XMIT,
+ rpc_wait_bit_killable, TASK_KILLABLE);
+}
+EXPORT_SYMBOL_GPL(rpc_wait_for_msg_send);
+
+/*
* Make an RPC task runnable.
*
* Note: If the task is ASYNC, and is being made runnable after sitting on an
@@ -700,6 +713,7 @@ rpc_reset_task_statistics(struct rpc_task *task)
{
task->tk_timeouts = 0;
task->tk_flags &= ~(RPC_CALL_MAJORSEEN|RPC_TASK_KILLED|RPC_TASK_SENT);
+ set_bit(RPC_TASK_MSG_XMIT, &task->tk_runstate);

rpc_init_task_statistics(task);
}
@@ -928,6 +942,7 @@ static void rpc_init_task(struct rpc_task *task, const struct rpc_task_setup *ta
memset(task, 0, sizeof(*task));
atomic_set(&task->tk_count, 1);
task->tk_flags = task_setup_data->flags;
+ task->tk_runstate = BIT(RPC_TASK_MSG_XMIT);
task->tk_ops = task_setup_data->callback_ops;
task->tk_calldata = task_setup_data->callback_data;
INIT_LIST_HEAD(&task->tk_task);
@@ -1012,6 +1027,13 @@ static void rpc_async_release(struct work_struct *work)

static void rpc_release_resources_task(struct rpc_task *task)
{
+ if (test_bit(RPC_TASK_MSG_XMIT, &task->tk_runstate)) {
+ clear_bit(RPC_TASK_MSG_XMIT, &task->tk_runstate);
+ smp_mb__after_atomic();
+ if (test_bit(RPC_TASK_MSG_XMIT_WAIT, &task->tk_runstate))
+ wake_up_bit(&task->tk_runstate, RPC_TASK_MSG_XMIT);
+ }
+
xprt_release(task);
if (task->tk_msg.rpc_cred) {
put_rpccred(task->tk_msg.rpc_cred);
diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index a558e8c620c0..62c379865f7c 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -978,6 +978,12 @@ bool xprt_prepare_transmit(struct rpc_task *task)
goto out_unlock;
}
ret = true;
+ if (test_bit(RPC_TASK_MSG_XMIT, &task->tk_runstate)) {
+ clear_bit(RPC_TASK_MSG_XMIT, &task->tk_runstate);
+ smp_mb__after_atomic();
+ if (test_bit(RPC_TASK_MSG_XMIT_WAIT, &task->tk_runstate))
+ wake_up_bit(&task->tk_runstate, RPC_TASK_MSG_XMIT);
+ }
out_unlock:
spin_unlock_bh(&xprt->transport_lock);
return ret;
--
2.13.5


2017-08-16 16:19:40

by Trond Myklebust

[permalink] [raw]
Subject: [PATCH v2 3/6] SUNRPC: Don't loop forever in xs_tcp_data_receive()

Ensure that we don't hog the workqueue thread by requeuing the job
every 64 loops.

Signed-off-by: Trond Myklebust <[email protected]>
---
net/sunrpc/xprtsock.c | 13 +++++++------
1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index a2312f14beb4..e8f44fc76754 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -1526,6 +1526,7 @@ static void xs_tcp_data_receive(struct sock_xprt *transport)
.arg.data = xprt,
};
unsigned long total = 0;
+ int loop;
int read = 0;

mutex_lock(&transport->recv_mutex);
@@ -1534,20 +1535,20 @@ static void xs_tcp_data_receive(struct sock_xprt *transport)
goto out;

/* We use rd_desc to pass struct xprt to xs_tcp_data_recv */
- for (;;) {
+ for (loop = 0; loop < 64; loop++) {
lock_sock(sk);
read = tcp_read_sock(sk, &rd_desc, xs_tcp_data_recv);
if (read <= 0) {
clear_bit(XPRT_SOCK_DATA_READY, &transport->sock_state);
release_sock(sk);
- if (!test_bit(XPRT_SOCK_DATA_READY, &transport->sock_state))
- break;
- } else {
- release_sock(sk);
- total += read;
+ break;
}
+ release_sock(sk);
+ total += read;
rd_desc.count = 65536;
}
+ if (test_bit(XPRT_SOCK_DATA_READY, &transport->sock_state))
+ queue_work(xprtiod_workqueue, &transport->recv_worker);
out:
mutex_unlock(&transport->recv_mutex);
trace_xs_tcp_data_ready(xprt, read, total);
--
2.13.5


2017-08-16 16:19:44

by Trond Myklebust

[permalink] [raw]
Subject: [PATCH v2 6/6] NFS: Throttle I/O to the NFS server

Ensure that we do not build up large queues of asynchronous I/O work in the
RPC layer that are not being handled by the socket.

Signed-off-by: Trond Myklebust <[email protected]>
---
fs/nfs/pagelist.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c
index 548ebc7256ff..c9a664730e94 100644
--- a/fs/nfs/pagelist.c
+++ b/fs/nfs/pagelist.c
@@ -625,7 +625,8 @@ int nfs_initiate_pgio(struct rpc_clnt *clnt, struct nfs_pgio_header *hdr,
ret = rpc_wait_for_completion_task(task);
if (ret == 0)
ret = task->tk_status;
- }
+ } else
+ rpc_wait_for_msg_send(task);
rpc_put_task(task);
out:
return ret;
--
2.13.5


2017-08-16 16:43:12

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH v2 5/6] SUNRPC: Add a function to allow waiting for RPC transmission


> On Aug 16, 2017, at 12:19 PM, Trond Myklebust <[email protected]> wrote:
>
> Sometimes, we only want to know that the RPC message is in the process
> of being transmitted. Add a function to allow that.

The last time I tried these last two, they resulted in a significant
performance regression. Is there anything I can do to help track
that down?


> Signed-off-by: Trond Myklebust <[email protected]>
> ---
> include/linux/sunrpc/sched.h | 3 +++
> net/sunrpc/sched.c | 22 ++++++++++++++++++++++
> net/sunrpc/xprt.c | 6 ++++++
> 3 files changed, 31 insertions(+)
>
> diff --git a/include/linux/sunrpc/sched.h b/include/linux/sunrpc/sched.h
> index c1768f9d993b..148cd6c5d2eb 100644
> --- a/include/linux/sunrpc/sched.h
> +++ b/include/linux/sunrpc/sched.h
> @@ -141,6 +141,8 @@ struct rpc_task_setup {
> #define RPC_TASK_ACTIVE 2
> #define RPC_TASK_MSG_RECV 3
> #define RPC_TASK_MSG_RECV_WAIT 4
> +#define RPC_TASK_MSG_XMIT 5
> +#define RPC_TASK_MSG_XMIT_WAIT 6
>
> #define RPC_IS_RUNNING(t) test_bit(RPC_TASK_RUNNING, &(t)->tk_runstate)
> #define rpc_set_running(t) set_bit(RPC_TASK_RUNNING, &(t)->tk_runstate)
> @@ -246,6 +248,7 @@ void rpc_free(struct rpc_task *);
> int rpciod_up(void);
> void rpciod_down(void);
> int __rpc_wait_for_completion_task(struct rpc_task *task, wait_bit_action_f *);
> +int rpc_wait_for_msg_send(struct rpc_task *task);
> #if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
> struct net;
> void rpc_show_tasks(struct net *);
> diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
> index 0cc83839c13c..30509a4d7e00 100644
> --- a/net/sunrpc/sched.c
> +++ b/net/sunrpc/sched.c
> @@ -320,6 +320,19 @@ int __rpc_wait_for_completion_task(struct rpc_task *task, wait_bit_action_f *act
> EXPORT_SYMBOL_GPL(__rpc_wait_for_completion_task);
>
> /*
> + * Allow callers to wait for completion of an RPC call
> + */
> +int rpc_wait_for_msg_send(struct rpc_task *task)
> +{
> + if (!test_bit(RPC_TASK_MSG_XMIT, &task->tk_runstate))
> + return 0;
> + set_bit(RPC_TASK_MSG_XMIT_WAIT, &task->tk_runstate);
> + return wait_on_bit_action(&task->tk_runstate, RPC_TASK_MSG_XMIT,
> + rpc_wait_bit_killable, TASK_KILLABLE);
> +}
> +EXPORT_SYMBOL_GPL(rpc_wait_for_msg_send);
> +
> +/*
> * Make an RPC task runnable.
> *
> * Note: If the task is ASYNC, and is being made runnable after sitting on an
> @@ -700,6 +713,7 @@ rpc_reset_task_statistics(struct rpc_task *task)
> {
> task->tk_timeouts = 0;
> task->tk_flags &= ~(RPC_CALL_MAJORSEEN|RPC_TASK_KILLED|RPC_TASK_SENT);
> + set_bit(RPC_TASK_MSG_XMIT, &task->tk_runstate);
>
> rpc_init_task_statistics(task);
> }
> @@ -928,6 +942,7 @@ static void rpc_init_task(struct rpc_task *task, const struct rpc_task_setup *ta
> memset(task, 0, sizeof(*task));
> atomic_set(&task->tk_count, 1);
> task->tk_flags = task_setup_data->flags;
> + task->tk_runstate = BIT(RPC_TASK_MSG_XMIT);
> task->tk_ops = task_setup_data->callback_ops;
> task->tk_calldata = task_setup_data->callback_data;
> INIT_LIST_HEAD(&task->tk_task);
> @@ -1012,6 +1027,13 @@ static void rpc_async_release(struct work_struct *work)
>
> static void rpc_release_resources_task(struct rpc_task *task)
> {
> + if (test_bit(RPC_TASK_MSG_XMIT, &task->tk_runstate)) {
> + clear_bit(RPC_TASK_MSG_XMIT, &task->tk_runstate);
> + smp_mb__after_atomic();
> + if (test_bit(RPC_TASK_MSG_XMIT_WAIT, &task->tk_runstate))
> + wake_up_bit(&task->tk_runstate, RPC_TASK_MSG_XMIT);
> + }
> +
> xprt_release(task);
> if (task->tk_msg.rpc_cred) {
> put_rpccred(task->tk_msg.rpc_cred);
> diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
> index a558e8c620c0..62c379865f7c 100644
> --- a/net/sunrpc/xprt.c
> +++ b/net/sunrpc/xprt.c
> @@ -978,6 +978,12 @@ bool xprt_prepare_transmit(struct rpc_task *task)
> goto out_unlock;
> }
> ret = true;
> + if (test_bit(RPC_TASK_MSG_XMIT, &task->tk_runstate)) {
> + clear_bit(RPC_TASK_MSG_XMIT, &task->tk_runstate);
> + smp_mb__after_atomic();
> + if (test_bit(RPC_TASK_MSG_XMIT_WAIT, &task->tk_runstate))
> + wake_up_bit(&task->tk_runstate, RPC_TASK_MSG_XMIT);
> + }
> out_unlock:
> spin_unlock_bh(&xprt->transport_lock);
> return ret;
> --
> 2.13.5
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Chuck Lever




2017-08-16 18:54:04

by Anna Schumaker

[permalink] [raw]
Subject: Re: [PATCH v2 1/6] SUNRPC: Don't hold the transport lock across socket copy operations

Hi Trond,

On 08/16/2017 12:19 PM, Trond Myklebust wrote:
> Instead add a mechanism to ensure that the request doesn't disappear
> from underneath us while copying from the socket. We do this by
> preventing xprt_release() from freeing the XDR buffers until the
> flag RPC_TASK_MSG_RECV has been cleared from the request.

I'm seeing a different oops on this version, still with xfstests generic/089:

[ 67.896287] BUG: unable to handle kernel NULL pointer dereference at 0000000000000030
[ 67.897466] IP: xprt_release+0x100/0x3b0 [sunrpc]
[ 67.897911] PGD 39014067
[ 67.897911] P4D 39014067
[ 67.898169] PUD 32b36067
[ 67.898424] PMD 0
[ 67.898675]
[ 67.898928] Oops: 0000 [#1] PREEMPT SMP
[ 67.899188] Modules linked in: rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache rpcrdma cfg80211 rfkill crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc vmwgfx snd_hda_codec_generic joydev mousedev ppdev drm_kms_helper snd_hda_intel syscopyarea sysfillrect sysimgblt fb_sys_fops aesni_intel snd_hda_codec ttm aes_x86_64 snd_hwdep crypto_simd cryptd snd_hda_core glue_helper snd_pcm drm evdev input_leds led_class psmouse snd_timer pcspkr mac_hid snd parport_pc intel_agp pvpanic parport i2c_piix4 soundcore intel_gtt button sch_fq_codel nfsd auth_rpcgss oid_registry nfs_acl lockd grace sunrpc ip_tables x_tables ata_generic pata_acpi serio_raw atkbd libps2 uhci_hcd ehci_pci ehci_hcd floppy i8042 serio ata_piix usbcore usb_common libata scsi_mod xfs libcrc32c crc32c_generic virtio_balloon crc32c_intel
[ 67.903802] virtio_net virtio_pci virtio_blk virtio_ring virtio
[ 67.904206] CPU: 1 PID: 728 Comm: NFSv4 callback Not tainted 4.13.0-rc5-ANNA+ #9642
[ 67.904716] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 67.905097] task: ffff88003c735880 task.stack: ffffc900015cc000
[ 67.905801] RIP: 0010:xprt_release+0x100/0x3b0 [sunrpc]
[ 67.906436] RSP: 0000:ffffc900015cfd18 EFLAGS: 00010212
[ 67.907064] RAX: ffff8800375e24b8 RBX: ffff88003db1e800 RCX: ffffffff81a054c8
[ 67.907824] RDX: ffff88003db1ec80 RSI: 00000000fffef000 RDI: ffff88003db1e800
[ 67.908579] RBP: ffffc900015cfd48 R08: 0000000000000000 R09: 0000000000000000
[ 67.909328] R10: 0000000000000010 R11: 0000000000000000 R12: ffff8800375cce00
[ 67.910071] R13: ffff880037874600 R14: ffff8800375cceb8 R15: ffff88003db1ec38
[ 67.910816] FS: 0000000000000000(0000) GS:ffff88003fd00000(0000) knlGS:0000000000000000
[ 67.911916] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 67.912592] CR2: 0000000000000030 CR3: 0000000032a5f000 CR4: 00000000001406e0
[ 67.913369] Call Trace:
[ 67.913858] ? _raw_spin_unlock_bh+0x31/0x40
[ 67.914477] rpc_release_resources_task+0x25/0x50 [sunrpc]
[ 67.915161] __rpc_execute+0x1ef/0x430 [sunrpc]
[ 67.915776] rpc_execute+0x70/0xe0 [sunrpc]
[ 67.916361] rpc_run_bc_task+0x9a/0xf0 [sunrpc]
[ 67.916963] bc_svc_process+0x26a/0x330 [sunrpc]
[ 67.917569] nfs41_callback_svc+0x12e/0x1c0 [nfsv4]
[ 67.918178] ? wait_woken+0x90/0x90
[ 67.918689] kthread+0x134/0x150
[ 67.919178] ? nfs_map_gid_to_group+0x130/0x130 [nfsv4]
[ 67.919786] ? kthread_create_on_node+0x80/0x80
[ 67.920345] ret_from_fork+0x25/0x30
[ 67.920838] Code: 35 a6 03 75 e1 48 8b 83 80 04 00 00 48 8d 93 80 04 00 00 48 39 c2 48 89 b3 18 04 00 00 0f 84 4c 02 00 00 4d 8b 84 24 88 00 00 00 <49> 8b 40 30 a8 08 0f 85 09 01 00 00 4c 89 ff e8 2c 4b 43 e1 49
[ 67.922831] RIP: xprt_release+0x100/0x3b0 [sunrpc] RSP: ffffc900015cfd18
[ 67.924646] CR2: 0000000000000030
[ 67.925582] ---[ end trace 69addd132abaa51b ]---
[ 67.926568] Kernel panic - not syncing: Fatal exception in interrupt
[ 67.927826] Kernel Offset: disabled
[ 67.928655] ---[ end Kernel panic - not syncing: Fatal exception in interrupt

Thanks,
Anna

>
> Signed-off-by: Trond Myklebust <[email protected]>
> Reviewed-by: Chuck Lever <[email protected]>
> ---
> include/linux/sunrpc/sched.h | 2 ++
> include/linux/sunrpc/xprt.h | 2 ++
> net/sunrpc/xprt.c | 42 ++++++++++++++++++++++++++++++++++++++++++
> net/sunrpc/xprtsock.c | 23 ++++++++++++++++++-----
> 4 files changed, 64 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/sunrpc/sched.h b/include/linux/sunrpc/sched.h
> index 50a99a117da7..c1768f9d993b 100644
> --- a/include/linux/sunrpc/sched.h
> +++ b/include/linux/sunrpc/sched.h
> @@ -139,6 +139,8 @@ struct rpc_task_setup {
> #define RPC_TASK_RUNNING 0
> #define RPC_TASK_QUEUED 1
> #define RPC_TASK_ACTIVE 2
> +#define RPC_TASK_MSG_RECV 3
> +#define RPC_TASK_MSG_RECV_WAIT 4
>
> #define RPC_IS_RUNNING(t) test_bit(RPC_TASK_RUNNING, &(t)->tk_runstate)
> #define rpc_set_running(t) set_bit(RPC_TASK_RUNNING, &(t)->tk_runstate)
> diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
> index eab1c749e192..65b9e0224753 100644
> --- a/include/linux/sunrpc/xprt.h
> +++ b/include/linux/sunrpc/xprt.h
> @@ -372,6 +372,8 @@ void xprt_write_space(struct rpc_xprt *xprt);
> void xprt_adjust_cwnd(struct rpc_xprt *xprt, struct rpc_task *task, int result);
> struct rpc_rqst * xprt_lookup_rqst(struct rpc_xprt *xprt, __be32 xid);
> void xprt_complete_rqst(struct rpc_task *task, int copied);
> +void xprt_pin_rqst(struct rpc_rqst *req);
> +void xprt_unpin_rqst(struct rpc_rqst *req);
> void xprt_release_rqst_cong(struct rpc_task *task);
> void xprt_disconnect_done(struct rpc_xprt *xprt);
> void xprt_force_disconnect(struct rpc_xprt *xprt);
> diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
> index 4654a9934269..a558e8c620c0 100644
> --- a/net/sunrpc/xprt.c
> +++ b/net/sunrpc/xprt.c
> @@ -844,6 +844,47 @@ struct rpc_rqst *xprt_lookup_rqst(struct rpc_xprt *xprt, __be32 xid)
> }
> EXPORT_SYMBOL_GPL(xprt_lookup_rqst);
>
> +/**
> + * xprt_pin_rqst - Pin a request on the transport receive list
> + * @req: Request to pin
> + *
> + * Caller must ensure this is atomic with the call to xprt_lookup_rqst()
> + * so should be holding the xprt transport lock.
> + */
> +void xprt_pin_rqst(struct rpc_rqst *req)
> +{
> + set_bit(RPC_TASK_MSG_RECV, &req->rq_task->tk_runstate);
> +}
> +
> +/**
> + * xprt_unpin_rqst - Unpin a request on the transport receive list
> + * @req: Request to pin
> + *
> + * Caller should be holding the xprt transport lock.
> + */
> +void xprt_unpin_rqst(struct rpc_rqst *req)
> +{
> + struct rpc_task *task = req->rq_task;
> +
> + clear_bit(RPC_TASK_MSG_RECV, &task->tk_runstate);
> + if (test_bit(RPC_TASK_MSG_RECV_WAIT, &task->tk_runstate))
> + wake_up_bit(&task->tk_runstate, RPC_TASK_MSG_RECV);
> +}
> +
> +static void xprt_wait_on_pinned_rqst(struct rpc_rqst *req)
> +__must_hold(&req->rq_xprt->transport_lock)
> +{
> + struct rpc_task *task = req->rq_task;
> + if (test_bit(RPC_TASK_MSG_RECV, &task->tk_runstate)) {
> + spin_unlock_bh(&req->rq_xprt->transport_lock);
> + set_bit(RPC_TASK_MSG_RECV_WAIT, &task->tk_runstate);
> + wait_on_bit(&task->tk_runstate, RPC_TASK_MSG_RECV,
> + TASK_UNINTERRUPTIBLE);
> + clear_bit(RPC_TASK_MSG_RECV_WAIT, &task->tk_runstate);
> + spin_lock_bh(&req->rq_xprt->transport_lock);
> + }
> +}
> +
> static void xprt_update_rtt(struct rpc_task *task)
> {
> struct rpc_rqst *req = task->tk_rqstp;
> @@ -1295,6 +1336,7 @@ void xprt_release(struct rpc_task *task)
> list_del(&req->rq_list);
> xprt->last_used = jiffies;
> xprt_schedule_autodisconnect(xprt);
> + xprt_wait_on_pinned_rqst(req);
> spin_unlock_bh(&xprt->transport_lock);
> if (req->rq_buffer)
> xprt->ops->buf_free(task);
> diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
> index 4f154d388748..04dbc7027712 100644
> --- a/net/sunrpc/xprtsock.c
> +++ b/net/sunrpc/xprtsock.c
> @@ -973,6 +973,8 @@ static void xs_local_data_read_skb(struct rpc_xprt *xprt,
> rovr = xprt_lookup_rqst(xprt, *xp);
> if (!rovr)
> goto out_unlock;
> + xprt_pin_rqst(rovr);
> + spin_unlock_bh(&xprt->transport_lock);
> task = rovr->rq_task;
>
> copied = rovr->rq_private_buf.buflen;
> @@ -981,11 +983,14 @@ static void xs_local_data_read_skb(struct rpc_xprt *xprt,
>
> if (xs_local_copy_to_xdr(&rovr->rq_private_buf, skb)) {
> dprintk("RPC: sk_buff copy failed\n");
> - goto out_unlock;
> + spin_lock_bh(&xprt->transport_lock);
> + goto out_unpin;
> }
>
> + spin_lock_bh(&xprt->transport_lock);
> xprt_complete_rqst(task, copied);
> -
> +out_unpin:
> + xprt_unpin_rqst(rovr);
> out_unlock:
> spin_unlock_bh(&xprt->transport_lock);
> }
> @@ -1054,6 +1059,8 @@ static void xs_udp_data_read_skb(struct rpc_xprt *xprt,
> rovr = xprt_lookup_rqst(xprt, *xp);
> if (!rovr)
> goto out_unlock;
> + xprt_pin_rqst(rovr);
> + spin_unlock_bh(&xprt->transport_lock);
> task = rovr->rq_task;
>
> if ((copied = rovr->rq_private_buf.buflen) > repsize)
> @@ -1062,14 +1069,17 @@ static void xs_udp_data_read_skb(struct rpc_xprt *xprt,
> /* Suck it into the iovec, verify checksum if not done by hw. */
> if (csum_partial_copy_to_xdr(&rovr->rq_private_buf, skb)) {
> __UDPX_INC_STATS(sk, UDP_MIB_INERRORS);
> - goto out_unlock;
> + spin_lock_bh(&xprt->transport_lock);
> + goto out_unpin;
> }
>
> __UDPX_INC_STATS(sk, UDP_MIB_INDATAGRAMS);
>
> + spin_lock_bh(&xprt->transport_lock);
> xprt_adjust_cwnd(xprt, task, copied);
> xprt_complete_rqst(task, copied);
> -
> +out_unpin:
> + xprt_unpin_rqst(rovr);
> out_unlock:
> spin_unlock_bh(&xprt->transport_lock);
> }
> @@ -1351,12 +1361,15 @@ static inline int xs_tcp_read_reply(struct rpc_xprt *xprt,
> spin_unlock_bh(&xprt->transport_lock);
> return -1;
> }
> + xprt_pin_rqst(req);
> + spin_unlock_bh(&xprt->transport_lock);
>
> xs_tcp_read_common(xprt, desc, req);
>
> + spin_lock_bh(&xprt->transport_lock);
> if (!(transport->tcp_flags & TCP_RCV_COPY_DATA))
> xprt_complete_rqst(req->rq_task, transport->tcp_copied);
> -
> + xprt_unpin_rqst(req);
> spin_unlock_bh(&xprt->transport_lock);
> return 0;
> }
>

2017-08-16 20:02:20

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH v2 5/6] SUNRPC: Add a function to allow waiting for RPC transmission

T24gV2VkLCAyMDE3LTA4LTE2IGF0IDEyOjQzIC0wNDAwLCBDaHVjayBMZXZlciB3cm90ZToNCj4g
PiBPbiBBdWcgMTYsIDIwMTcsIGF0IDEyOjE5IFBNLCBUcm9uZCBNeWtsZWJ1c3QgPHRyb25kLm15
a2xlYnVzdEBwcmltDQo+ID4gYXJ5ZGF0YS5jb20+IHdyb3RlOg0KPiA+IA0KPiA+IFNvbWV0aW1l
cywgd2Ugb25seSB3YW50IHRvIGtub3cgdGhhdCB0aGUgUlBDIG1lc3NhZ2UgaXMgaW4gdGhlDQo+
ID4gcHJvY2Vzcw0KPiA+IG9mIGJlaW5nIHRyYW5zbWl0dGVkLiBBZGQgYSBmdW5jdGlvbiB0byBh
bGxvdyB0aGF0Lg0KPiANCj4gVGhlIGxhc3QgdGltZSBJIHRyaWVkIHRoZXNlIGxhc3QgdHdvLCB0
aGV5IHJlc3VsdGVkIGluIGEgc2lnbmlmaWNhbnQNCj4gcGVyZm9ybWFuY2UgcmVncmVzc2lvbi4g
SXMgdGhlcmUgYW55dGhpbmcgSSBjYW4gZG8gdG8gaGVscCB0cmFjaw0KPiB0aGF0IGRvd24/DQoN
CklmIHRoZXkgYXJlIHRoZSByb290IGNhdXNlIG9mIGEgcGVyZm9ybWFuY2UgcmVncmVzc2lvbiwg
dGhlbiBJJ2xsIGp1c3QNCmRyb3AgdGhlbS4gSSd2ZSBiZWVuIHJlYmFzaW5nIHRvIGVuc3VyZSB0
aGF0IEkga2VlcCB0aGVtIGF0IHRoZSBlbmQgb2YNCnRoZSBxdWV1ZSBwcmVjaXNlbHkgYmVjYXVz
ZSBJIHdhcyB1bnN1cmUgb2YgdGhlaXIgZWZmZWN0IG9uIG92ZXJhbGwNCnBlcmZvcm1hbmNlLg0K
DQotLSANClRyb25kIE15a2xlYnVzdA0KTGludXggTkZTIGNsaWVudCBtYWludGFpbmVyLCBQcmlt
YXJ5RGF0YQ0KdHJvbmQubXlrbGVidXN0QHByaW1hcnlkYXRhLmNvbQ0K


2017-08-16 20:03:49

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH v2 5/6] SUNRPC: Add a function to allow waiting for RPC transmission


> On Aug 16, 2017, at 4:02 PM, Trond Myklebust <[email protected]> wrote:
>
> On Wed, 2017-08-16 at 12:43 -0400, Chuck Lever wrote:
>>> On Aug 16, 2017, at 12:19 PM, Trond Myklebust <trond.myklebust@prim
>>> arydata.com> wrote:
>>>
>>> Sometimes, we only want to know that the RPC message is in the
>>> process
>>> of being transmitted. Add a function to allow that.
>>
>> The last time I tried these last two, they resulted in a significant
>> performance regression. Is there anything I can do to help track
>> that down?
>
> If they are the root cause of a performance regression, then I'll just
> drop them. I've been rebasing to ensure that I keep them at the end of
> the queue precisely because I was unsure of their effect on overall
> performance.

My experience was applying the whole stack of the original 20
resulted in some performance degradations (fio). Popping these
two from that stack restored performance across the suite of
tests I was doing.


--
Chuck Lever