Hi Anna-
These are bug fixes and add support for RPC-over-RDMA connection
keepalive. The keepalive patches are still waiting for internal
testing resources to confirm they trigger connection loss in the
right circumstances, but my own testing shows they are behaving as
expected and do not introduce instability.
Available in the "nfs-rdma-for-4.11" topic branch of this git repo:
git://git.linux-nfs.org/projects/cel/cel-2.6.git
Or for browsing:
http://git.linux-nfs.org/?p=cel/cel-2.6.git;a=log;h=refs/heads/nfs-rdma-for-4.11
Changes since v2:
- Rebased on v4.10-rc7
- v4.10-rc bugfixes merged into this series
- Minor improvements to patch descriptions
- Field moved in 12/12 now done in the correct patch
Changes since v1:
- Rebased on v4.10-rc6
- Tested-by and additional clean-up in 1/7
- Patch description clarifications
- Renamed some constants and variables
---
Chuck Lever (12):
xprtrdma: Fix Read chunk padding
xprtrdma: Per-connection pad optimization
xprtrdma: Disable pad optimization by default
xprtrdma: Reduce required number of send SGEs
xprtrdma: Shrink send SGEs array
xprtrdma: Properly recover FRWRs with in-flight FASTREG WRs
xprtrdma: Handle stale connection rejection
xprtrdma: Refactor management of mw_list field
sunrpc: Allow xprt->ops->timer method to sleep
sunrpc: Enable calls to rpc_call_null_helper() from other modules
xprtrdma: Detect unreachable NFS/RDMA servers more reliably
sunrpc: Allow keepalive ping on a credit-full transport
fs/nfs/nfs4proc.c | 3 -
fs/nfsd/nfs4callback.c | 2 -
include/linux/sunrpc/clnt.h | 5 ++
include/linux/sunrpc/sched.h | 4 +
net/sunrpc/clnt.c | 28 +++++-----
net/sunrpc/xprt.c | 6 +-
net/sunrpc/xprtrdma/fmr_ops.c | 5 --
net/sunrpc/xprtrdma/frwr_ops.c | 11 +---
net/sunrpc/xprtrdma/rpc_rdma.c | 82 ++++++++++++++++++-----------
net/sunrpc/xprtrdma/transport.c | 76 +++++++++++++++++++++++++--
net/sunrpc/xprtrdma/verbs.c | 109 +++++++++++++++------------------------
net/sunrpc/xprtrdma/xprt_rdma.h | 37 ++++++++++++-
net/sunrpc/xprtsock.c | 2 +
13 files changed, 234 insertions(+), 136 deletions(-)
--
Chuck Lever
We no longer need to accommodate an xdr_buf whose pages start at an
offset and cross extra page boundaries. If there are more partial or
whole pages to send than there are available SGEs, the marshaling
logic is now smart enough to use a Read chunk instead of failing.
Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/xprt_rdma.h | 11 +++++++----
1 file changed, 7 insertions(+), 4 deletions(-)
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 3d7e9c9..852dd0a 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -305,16 +305,19 @@ struct rpcrdma_mr_seg { /* chunk descriptors */
char *mr_offset; /* kva if no page, else offset */
};
-/* Reserve enough Send SGEs to send a maximum size inline request:
+/* The Send SGE array is provisioned to send a maximum size
+ * inline request:
* - RPC-over-RDMA header
* - xdr_buf head iovec
- * - RPCRDMA_MAX_INLINE bytes, possibly unaligned, in pages
+ * - RPCRDMA_MAX_INLINE bytes, in pages
* - xdr_buf tail iovec
+ *
+ * The actual number of array elements consumed by each RPC
+ * depends on the device's max_sge limit.
*/
enum {
RPCRDMA_MIN_SEND_SGES = 3,
- RPCRDMA_MAX_SEND_PAGES = PAGE_SIZE + RPCRDMA_MAX_INLINE - 1,
- RPCRDMA_MAX_PAGE_SGES = (RPCRDMA_MAX_SEND_PAGES >> PAGE_SHIFT) + 1,
+ RPCRDMA_MAX_PAGE_SGES = RPCRDMA_MAX_INLINE >> PAGE_SHIFT,
RPCRDMA_MAX_SEND_SGES = 1 + 1 + RPCRDMA_MAX_PAGE_SGES + 1,
};
The transport lock is needed to protect the xprt_adjust_cwnd() call
in xs_udp_timer, but it is not necessary for accessing the
rq_reply_bytes_recvd or tk_status fields. It is correct to sublimate
the lock into UDP's xs_udp_timer method, where it is required.
The ->timer method has to take the transport lock if needed, but it
can now sleep safely, or even call back into the RPC scheduler.
This is more a clean-up than a fix, but the "issue" was introduced
by my transport switch patches back in 2005.
Fixes: 46c0ee8bc4ad ("RPC: separate xprt_timer implementations")
Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprt.c | 2 --
net/sunrpc/xprtsock.c | 2 ++
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index 9a6be03..b530a28 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -897,13 +897,11 @@ static void xprt_timer(struct rpc_task *task)
return;
dprintk("RPC: %5u xprt_timer\n", task->tk_pid);
- spin_lock_bh(&xprt->transport_lock);
if (!req->rq_reply_bytes_recvd) {
if (xprt->ops->timer)
xprt->ops->timer(xprt, task);
} else
task->tk_status = 0;
- spin_unlock_bh(&xprt->transport_lock);
}
/**
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index af392d9..d9bb644 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -1734,7 +1734,9 @@ static void xs_udp_set_buffer_size(struct rpc_xprt *xprt, size_t sndsize, size_t
*/
static void xs_udp_timer(struct rpc_xprt *xprt, struct rpc_task *task)
{
+ spin_lock_bh(&xprt->transport_lock);
xprt_adjust_cwnd(xprt, task, -ETIMEDOUT);
+ spin_unlock_bh(&xprt->transport_lock);
}
static unsigned short xs_get_random_port(void)
Current NFS clients rely on connection loss to determine when to
retransmit. In particular, for protocols like NFSv4, clients no
longer rely on RPC timeouts to drive retransmission: NFSv4 servers
are required to terminate a connection when they need a client to
retransmit pending RPCs.
When a server is no longer reachable, either because it has crashed
or because the network path has broken, the server cannot actively
terminate a connection. Thus NFS clients depend on transport-level
keepalive to determine when a connection must be replaced and
pending RPCs retransmitted.
However, RDMA RC connections do not have a native keepalive
mechanism. If an NFS/RDMA server crashes after a client has sent
RPCs successfully (an RC ACK has been received for all OTW RDMA
requests), there is no way for the client to know the connection is
moribund.
In addition, new RDMA requests are subject to the RPC-over-RDMA
credit limit. If the client has consumed all granted credits with
NFS traffic, it is not allowed to send another RDMA request until
the server replies. Thus it has no way to send a true keepalive when
the workload has already consumed all credits with pending RPCs.
To address this, emit an RPC NULL ping when an RPC retransmit
timeout occurs.
The purpose of this ping is to drive traffic on the connection to
force the transport layer to disconnect it if it is no longer
viable. Some RDMA operations are fully offloaded to the HCA, and can
be successful even if the server O/S has crashed. Thus an operation
that requires that the server is responsive is used for the ping.
Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/transport.c | 69 +++++++++++++++++++++++++++++++++++++++
net/sunrpc/xprtrdma/xprt_rdma.h | 7 ++++
2 files changed, 76 insertions(+)
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index c717f54..3a5a805 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -484,6 +484,74 @@
dprintk("RPC: %s: %u\n", __func__, port);
}
+static void rpcrdma_keepalive_done(struct rpc_task *task, void *calldata)
+{
+ struct rpc_xprt *xprt = (struct rpc_xprt *)calldata;
+ struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(xprt);
+
+ if (task->tk_status) {
+ struct sockaddr *sap =
+ (struct sockaddr *)&r_xprt->rx_ep.rep_remote_addr;
+
+ pr_err("rpcrdma: keepalive to %pIS:%u failed (%d)\n",
+ sap, rpc_get_port(sap), task->tk_status);
+ xprt_disconnect_done(xprt);
+ }
+ clear_bit(RPCRDMA_IA_RSVD_CREDIT, &r_xprt->rx_ia.ri_flags);
+}
+
+static void rpcrdma_keepalive_release(void *calldata)
+{
+ struct rpc_xprt *xprt = (struct rpc_xprt *)calldata;
+
+ xprt_put(xprt);
+}
+
+static const struct rpc_call_ops rpcrdma_keepalive_call_ops = {
+ .rpc_call_done = rpcrdma_keepalive_done,
+ .rpc_release = rpcrdma_keepalive_release,
+};
+
+/**
+ * xprt_rdma_timer - invoked when an RPC times out
+ * @xprt: controlling RPC transport
+ * @task: RPC task that timed out
+ *
+ * Some RDMA transports do not have any form of connection
+ * keepalive. In some circumstances, unviable connections
+ * can continue to live for a long time.
+ *
+ * Send a NULL RPC to see if the server still responds. On
+ * a moribund connection, this should trigger either an RPC
+ * or transport layer timeout and kill the connection.
+ */
+static void
+xprt_rdma_timer(struct rpc_xprt *xprt, struct rpc_task *task)
+{
+ struct rpcrdma_xprt *r_xprt =
+ container_of(xprt, struct rpcrdma_xprt, rx_xprt);
+#if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
+ struct rpcrdma_ep *ep = &r_xprt->rx_ep;
+ struct sockaddr *sap = (struct sockaddr *)&ep->rep_remote_addr;
+#endif
+ struct rpc_task *null_task;
+ void *data;
+
+ /* Ensure only one is sent at a time */
+ if (!test_and_set_bit(RPCRDMA_IA_RSVD_CREDIT, &r_xprt->rx_ia.ri_flags))
+ return;
+
+ dprintk("RPC: %s: sending keepalive ping to %pIS:%u\n",
+ __func__, sap, rpc_get_port(sap));
+
+ data = xprt_get(xprt);
+ null_task = rpc_call_null_helper(task->tk_client, xprt, NULL,
+ RPC_TASK_SOFTPING | RPC_TASK_ASYNC,
+ &rpcrdma_keepalive_call_ops, data);
+ if (!IS_ERR(null_task))
+ rpc_put_task(null_task);
+}
+
static void
xprt_rdma_connect(struct rpc_xprt *xprt, struct rpc_task *task)
{
@@ -776,6 +844,7 @@ void xprt_rdma_print_stats(struct rpc_xprt *xprt, struct seq_file *seq)
.alloc_slot = xprt_alloc_slot,
.release_request = xprt_release_rqst_cong, /* ditto */
.set_retrans_timeout = xprt_set_retrans_timeout_def, /* ditto */
+ .timer = xprt_rdma_timer,
.rpcbind = rpcb_getport_async, /* sunrpc/rpcb_clnt.c */
.set_port = xprt_rdma_set_port,
.connect = xprt_rdma_connect,
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 171a351..dd1340f 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -78,10 +78,17 @@ struct rpcrdma_ia {
bool ri_reminv_expected;
bool ri_implicit_roundup;
enum ib_mr_type ri_mrtype;
+ unsigned long ri_flags;
struct ib_qp_attr ri_qp_attr;
struct ib_qp_init_attr ri_qp_init_attr;
};
+/* ri_flags bits
+ */
+enum {
+ RPCRDMA_IA_RSVD_CREDIT = 0,
+};
+
/*
* RDMA Endpoint -- one per transport instance
*/
A server rejects a connection attempt with STALE_CONNECTION when a
client attempts to connect to a working remote service, but uses a
QPN and GUID that corresponds to an old connection that was
abandoned. This might occur after a client crashes and restarts.
Fix rpcrdma_conn_upcall() to distinguish between a normal rejection
and rejection of stale connection parameters.
As an additional clean-up, remove the code that retries the
connection attempt with different ORD/IRD values. Code audit of
other ULP initiators shows no similar special case handling of
initiator_depth or responder_resources.
Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/verbs.c | 66 ++++++++++++++-----------------------------
1 file changed, 21 insertions(+), 45 deletions(-)
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 61d16c3..d1ee33f 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -54,6 +54,7 @@
#include <linux/sunrpc/svc_rdma.h>
#include <asm/bitops.h>
#include <linux/module.h> /* try_module_get()/module_put() */
+#include <rdma/ib_cm.h>
#include "xprt_rdma.h"
@@ -279,7 +280,14 @@
connstate = -ENETDOWN;
goto connected;
case RDMA_CM_EVENT_REJECTED:
+#if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
+ pr_info("rpcrdma: connection to %pIS:%u on %s rejected: %s\n",
+ sap, rpc_get_port(sap), ia->ri_device->name,
+ rdma_reject_msg(id, event->status));
+#endif
connstate = -ECONNREFUSED;
+ if (event->status == IB_CM_REJ_STALE_CONN)
+ connstate = -EAGAIN;
goto connected;
case RDMA_CM_EVENT_DISCONNECTED:
connstate = -ECONNABORTED;
@@ -643,20 +651,21 @@ static void rpcrdma_destroy_id(struct rdma_cm_id *id)
int
rpcrdma_ep_connect(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia)
{
+ struct rpcrdma_xprt *r_xprt = container_of(ia, struct rpcrdma_xprt,
+ rx_ia);
struct rdma_cm_id *id, *old;
+ struct sockaddr *sap;
+ unsigned int extras;
int rc = 0;
- int retry_count = 0;
if (ep->rep_connected != 0) {
- struct rpcrdma_xprt *xprt;
retry:
dprintk("RPC: %s: reconnecting...\n", __func__);
rpcrdma_ep_disconnect(ep, ia);
- xprt = container_of(ia, struct rpcrdma_xprt, rx_ia);
- id = rpcrdma_create_id(xprt, ia,
- (struct sockaddr *)&xprt->rx_data.addr);
+ sap = (struct sockaddr *)&r_xprt->rx_data.addr;
+ id = rpcrdma_create_id(r_xprt, ia, sap);
if (IS_ERR(id)) {
rc = -EHOSTUNREACH;
goto out;
@@ -711,51 +720,18 @@ static void rpcrdma_destroy_id(struct rdma_cm_id *id)
}
wait_event_interruptible(ep->rep_connect_wait, ep->rep_connected != 0);
-
- /*
- * Check state. A non-peer reject indicates no listener
- * (ECONNREFUSED), which may be a transient state. All
- * others indicate a transport condition which has already
- * undergone a best-effort.
- */
- if (ep->rep_connected == -ECONNREFUSED &&
- ++retry_count <= RDMA_CONNECT_RETRY_MAX) {
- dprintk("RPC: %s: non-peer_reject, retry\n", __func__);
- goto retry;
- }
if (ep->rep_connected <= 0) {
- /* Sometimes, the only way to reliably connect to remote
- * CMs is to use same nonzero values for ORD and IRD. */
- if (retry_count++ <= RDMA_CONNECT_RETRY_MAX + 1 &&
- (ep->rep_remote_cma.responder_resources == 0 ||
- ep->rep_remote_cma.initiator_depth !=
- ep->rep_remote_cma.responder_resources)) {
- if (ep->rep_remote_cma.responder_resources == 0)
- ep->rep_remote_cma.responder_resources = 1;
- ep->rep_remote_cma.initiator_depth =
- ep->rep_remote_cma.responder_resources;
+ if (ep->rep_connected == -EAGAIN)
goto retry;
- }
rc = ep->rep_connected;
- } else {
- struct rpcrdma_xprt *r_xprt;
- unsigned int extras;
-
- dprintk("RPC: %s: connected\n", __func__);
-
- r_xprt = container_of(ia, struct rpcrdma_xprt, rx_ia);
- extras = r_xprt->rx_buf.rb_bc_srv_max_requests;
-
- if (extras) {
- rc = rpcrdma_ep_post_extra_recv(r_xprt, extras);
- if (rc) {
- pr_warn("%s: rpcrdma_ep_post_extra_recv: %i\n",
- __func__, rc);
- rc = 0;
- }
- }
+ goto out;
}
+ dprintk("RPC: %s: connected\n", __func__);
+ extras = r_xprt->rx_buf.rb_bc_srv_max_requests;
+ if (extras)
+ rpcrdma_ep_post_extra_recv(r_xprt, extras);
+
out:
if (rc)
ep->rep_connected = rc;
Clean up some duplicate code.
Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/fmr_ops.c | 5 +----
net/sunrpc/xprtrdma/frwr_ops.c | 11 ++++-------
net/sunrpc/xprtrdma/rpc_rdma.c | 6 +++---
net/sunrpc/xprtrdma/verbs.c | 15 +++++----------
net/sunrpc/xprtrdma/xprt_rdma.h | 16 ++++++++++++++++
5 files changed, 29 insertions(+), 24 deletions(-)
diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index 1ebb09e..59e6402 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -310,10 +310,7 @@ enum {
struct rpcrdma_mw *mw;
while (!list_empty(&req->rl_registered)) {
- mw = list_first_entry(&req->rl_registered,
- struct rpcrdma_mw, mw_list);
- list_del_init(&mw->mw_list);
-
+ mw = rpcrdma_pop_mw(&req->rl_registered);
if (sync)
fmr_op_recover_mr(mw);
else
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index 47bed53..f81dd93 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -466,8 +466,8 @@
struct ib_send_wr *first, **prev, *last, *bad_wr;
struct rpcrdma_rep *rep = req->rl_reply;
struct rpcrdma_ia *ia = &r_xprt->rx_ia;
- struct rpcrdma_mw *mw, *tmp;
struct rpcrdma_frmr *f;
+ struct rpcrdma_mw *mw;
int count, rc;
dprintk("RPC: %s: req %p\n", __func__, req);
@@ -534,10 +534,10 @@
* them to the free MW list.
*/
unmap:
- list_for_each_entry_safe(mw, tmp, &req->rl_registered, mw_list) {
+ while (!list_empty(&req->rl_registered)) {
+ mw = rpcrdma_pop_mw(&req->rl_registered);
dprintk("RPC: %s: DMA unmapping frmr %p\n",
__func__, &mw->frmr);
- list_del_init(&mw->mw_list);
ib_dma_unmap_sg(ia->ri_device,
mw->mw_sg, mw->mw_nents, mw->mw_dir);
rpcrdma_put_mw(r_xprt, mw);
@@ -571,10 +571,7 @@
struct rpcrdma_mw *mw;
while (!list_empty(&req->rl_registered)) {
- mw = list_first_entry(&req->rl_registered,
- struct rpcrdma_mw, mw_list);
- list_del_init(&mw->mw_list);
-
+ mw = rpcrdma_pop_mw(&req->rl_registered);
if (sync)
frwr_op_recover_mr(mw);
else
diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index 72b3ca0..a044be2 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -322,7 +322,7 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
false, &mw);
if (n < 0)
return ERR_PTR(n);
- list_add(&mw->mw_list, &req->rl_registered);
+ rpcrdma_push_mw(mw, &req->rl_registered);
*iptr++ = xdr_one; /* item present */
@@ -390,7 +390,7 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
true, &mw);
if (n < 0)
return ERR_PTR(n);
- list_add(&mw->mw_list, &req->rl_registered);
+ rpcrdma_push_mw(mw, &req->rl_registered);
iptr = xdr_encode_rdma_segment(iptr, mw);
@@ -455,7 +455,7 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
true, &mw);
if (n < 0)
return ERR_PTR(n);
- list_add(&mw->mw_list, &req->rl_registered);
+ rpcrdma_push_mw(mw, &req->rl_registered);
iptr = xdr_encode_rdma_segment(iptr, mw);
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index d1ee33f..81cd31a 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -776,9 +776,7 @@ static void rpcrdma_destroy_id(struct rdma_cm_id *id)
spin_lock(&buf->rb_recovery_lock);
while (!list_empty(&buf->rb_stale_mrs)) {
- mw = list_first_entry(&buf->rb_stale_mrs,
- struct rpcrdma_mw, mw_list);
- list_del_init(&mw->mw_list);
+ mw = rpcrdma_pop_mw(&buf->rb_stale_mrs);
spin_unlock(&buf->rb_recovery_lock);
dprintk("RPC: %s: recovering MR %p\n", __func__, mw);
@@ -796,7 +794,7 @@ static void rpcrdma_destroy_id(struct rdma_cm_id *id)
struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
spin_lock(&buf->rb_recovery_lock);
- list_add(&mw->mw_list, &buf->rb_stale_mrs);
+ rpcrdma_push_mw(mw, &buf->rb_stale_mrs);
spin_unlock(&buf->rb_recovery_lock);
schedule_delayed_work(&buf->rb_recovery_worker, 0);
@@ -1072,11 +1070,8 @@ struct rpcrdma_mw *
struct rpcrdma_mw *mw = NULL;
spin_lock(&buf->rb_mwlock);
- if (!list_empty(&buf->rb_mws)) {
- mw = list_first_entry(&buf->rb_mws,
- struct rpcrdma_mw, mw_list);
- list_del_init(&mw->mw_list);
- }
+ if (!list_empty(&buf->rb_mws))
+ mw = rpcrdma_pop_mw(&buf->rb_mws);
spin_unlock(&buf->rb_mwlock);
if (!mw)
@@ -1099,7 +1094,7 @@ struct rpcrdma_mw *
struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
spin_lock(&buf->rb_mwlock);
- list_add_tail(&mw->mw_list, &buf->rb_mws);
+ rpcrdma_push_mw(mw, &buf->rb_mws);
spin_unlock(&buf->rb_mwlock);
}
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 852dd0a..171a351 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -354,6 +354,22 @@ struct rpcrdma_req {
return rqst->rq_xprtdata;
}
+static inline void
+rpcrdma_push_mw(struct rpcrdma_mw *mw, struct list_head *list)
+{
+ list_add_tail(&mw->mw_list, list);
+}
+
+static inline struct rpcrdma_mw *
+rpcrdma_pop_mw(struct list_head *list)
+{
+ struct rpcrdma_mw *mw;
+
+ mw = list_first_entry(list, struct rpcrdma_mw, mw_list);
+ list_del(&mw->mw_list);
+ return mw;
+}
+
/*
* struct rpcrdma_buffer -- holds list/queue of pre-registered memory for
* inline requests/replies, and client/server credits.
Sriharsha ([email protected]) reports an occasional
double DMA unmap of an FRWR MR when a connection is lost. I see one
way this can happen.
When a request requires more than one segment or chunk,
rpcrdma_marshal_req loops, invoking ->frwr_op_map for each segment
(MR) in each chunk. Each call posts a FASTREG Work Request to
register one MR.
Now suppose that the transport connection is lost part-way through
marshaling this request. As part of recovering and resetting that
req, rpcrdma_marshal_req invokes ->frwr_op_unmap_safe, which hands
all the req's registered FRWRs to the MR recovery thread.
But note: FRWR registration is asynchronous. So it's possible that
some of these "already registered" FRWRs are fully registered, and
some are still waiting for their FASTREG WR to complete.
When the connection is lost, the "already registered" frmrs are
marked FRMR_IS_VALID, and the "still waiting" WRs flush. Then
frwr_wc_fastreg marks these frmrs FRMR_FLUSHED_FR.
But thanks to ->frwr_op_unmap_safe, the MR recovery thread is doing
an unreg / alloc_mr, a DMA unmap, and marking each of these frwrs
FRMR_IS_INVALID, at the same time frwr_wc_fastreg might be running.
- If the recovery thread runs last, then the frmr is marked
FRMR_IS_INVALID, and life continues.
- If frwr_wc_fastreg runs last, the frmr is marked FRMR_FLUSHED_FR,
but the recovery thread has already DMA unmapped that MR. When
->frwr_op_map later re-uses this frmr, it sees it is not marked
FRMR_IS_INVALID, and tries to recover it before using it, resulting
in a second DMA unmap of the same MR.
The fix is to guarantee in-flight FASTREG WRs have flushed before MR
recovery runs on those FRWRs. Thus we depend on ro_unmap_safe
(called from xprt_rdma_send_request on retransmit, or from
xprt_rdma_free) to clean up old registrations as needed.
Reported-by: Sriharsha Basavapatna <[email protected]>
Signed-off-by: Chuck Lever <[email protected]>
Tested-by: Sriharsha Basavapatna <[email protected]>
---
net/sunrpc/xprtrdma/rpc_rdma.c | 14 ++++++++------
net/sunrpc/xprtrdma/transport.c | 4 ----
2 files changed, 8 insertions(+), 10 deletions(-)
diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index d889883..72b3ca0 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -759,13 +759,13 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
iptr = headerp->rm_body.rm_chunks;
iptr = rpcrdma_encode_read_list(r_xprt, req, rqst, iptr, rtype);
if (IS_ERR(iptr))
- goto out_unmap;
+ goto out_err;
iptr = rpcrdma_encode_write_list(r_xprt, req, rqst, iptr, wtype);
if (IS_ERR(iptr))
- goto out_unmap;
+ goto out_err;
iptr = rpcrdma_encode_reply_chunk(r_xprt, req, rqst, iptr, wtype);
if (IS_ERR(iptr))
- goto out_unmap;
+ goto out_err;
hdrlen = (unsigned char *)iptr - (unsigned char *)headerp;
dprintk("RPC: %5u %s: %s/%s: hdrlen %zd rpclen %zd\n",
@@ -776,12 +776,14 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
if (!rpcrdma_prepare_send_sges(&r_xprt->rx_ia, req, hdrlen,
&rqst->rq_snd_buf, rtype)) {
iptr = ERR_PTR(-EIO);
- goto out_unmap;
+ goto out_err;
}
return 0;
-out_unmap:
- r_xprt->rx_ia.ri_ops->ro_unmap_safe(r_xprt, req, false);
+out_err:
+ pr_err("rpcrdma: rpcrdma_marshal_req failed, status %ld\n",
+ PTR_ERR(iptr));
+ r_xprt->rx_stats.failed_marshal_count++;
return PTR_ERR(iptr);
}
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 6990581..c717f54 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -709,10 +709,6 @@
return 0;
failed_marshal:
- dprintk("RPC: %s: rpcrdma_marshal_req failed, status %i\n",
- __func__, rc);
- if (rc == -EIO)
- r_xprt->rx_stats.failed_marshal_count++;
if (rc != -ENOTCONN)
return rc;
drop_connection:
To help detect unreachable servers, I'd like to emit an RPC ping
from rpcrdma.ko.
authnull_ops is not visible outside the sunrpc.ko module, so fold
the common case into rpc_call_null_helper, and export it, so that
it can be invoked from other kernel modules.
Signed-off-by: Chuck Lever <[email protected]>
---
fs/nfs/nfs4proc.c | 3 +--
fs/nfsd/nfs4callback.c | 2 +-
include/linux/sunrpc/clnt.h | 5 +++++
include/linux/sunrpc/sched.h | 2 ++
net/sunrpc/clnt.c | 28 ++++++++++++++--------------
5 files changed, 23 insertions(+), 17 deletions(-)
diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 0a0eaec..0091e5a 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -7640,8 +7640,7 @@ static int _nfs4_proc_exchange_id(struct nfs_client *clp, struct rpc_cred *cred,
if (xprt) {
calldata->xprt = xprt;
task_setup_data.rpc_xprt = xprt;
- task_setup_data.flags =
- RPC_TASK_SOFT|RPC_TASK_SOFTCONN|RPC_TASK_ASYNC;
+ task_setup_data.flags = RPC_TASK_SOFTPING | RPC_TASK_ASYNC;
calldata->args.verifier = &clp->cl_confirm;
} else {
calldata->args.verifier = &verifier;
diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c
index eb78109..e1e2224 100644
--- a/fs/nfsd/nfs4callback.c
+++ b/fs/nfsd/nfs4callback.c
@@ -1182,7 +1182,7 @@ static void nfsd4_process_cb_update(struct nfsd4_callback *cb)
}
cb->cb_msg.rpc_cred = clp->cl_cb_cred;
- rpc_call_async(clnt, &cb->cb_msg, RPC_TASK_SOFT | RPC_TASK_SOFTCONN,
+ rpc_call_async(clnt, &cb->cb_msg, RPC_TASK_SOFTPING,
cb->cb_ops ? &nfsd4_cb_ops : &nfsd4_cb_probe_ops, cb);
}
diff --git a/include/linux/sunrpc/clnt.h b/include/linux/sunrpc/clnt.h
index 333ad11..f576127 100644
--- a/include/linux/sunrpc/clnt.h
+++ b/include/linux/sunrpc/clnt.h
@@ -173,6 +173,11 @@ int rpc_call_async(struct rpc_clnt *clnt,
void *calldata);
int rpc_call_sync(struct rpc_clnt *clnt,
const struct rpc_message *msg, int flags);
+struct rpc_task *rpc_call_null_helper(struct rpc_clnt *clnt,
+ struct rpc_xprt *xprt,
+ struct rpc_cred *cred, int flags,
+ const struct rpc_call_ops *ops,
+ void *data);
struct rpc_task *rpc_call_null(struct rpc_clnt *clnt, struct rpc_cred *cred,
int flags);
int rpc_restart_call_prepare(struct rpc_task *);
diff --git a/include/linux/sunrpc/sched.h b/include/linux/sunrpc/sched.h
index 7ba040c..13822e6 100644
--- a/include/linux/sunrpc/sched.h
+++ b/include/linux/sunrpc/sched.h
@@ -128,6 +128,8 @@ struct rpc_task_setup {
#define RPC_TASK_NOCONNECT 0x2000 /* return ENOTCONN if not connected */
#define RPC_TASK_NO_RETRANS_TIMEOUT 0x4000 /* wait forever for a reply */
+#define RPC_TASK_SOFTPING (RPC_TASK_SOFT | RPC_TASK_SOFTCONN)
+
#define RPC_IS_ASYNC(t) ((t)->tk_flags & RPC_TASK_ASYNC)
#define RPC_IS_SWAPPER(t) ((t)->tk_flags & RPC_TASK_SWAPPER)
#define RPC_DO_ROOTOVERRIDE(t) ((t)->tk_flags & RPC_TASK_ROOTCREDS)
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index 1dc9f3b..642c93d 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -2520,12 +2520,11 @@ static int rpc_ping(struct rpc_clnt *clnt)
};
int err;
msg.rpc_cred = authnull_ops.lookup_cred(NULL, NULL, 0);
- err = rpc_call_sync(clnt, &msg, RPC_TASK_SOFT | RPC_TASK_SOFTCONN);
+ err = rpc_call_sync(clnt, &msg, RPC_TASK_SOFTPING);
put_rpccred(msg.rpc_cred);
return err;
}
-static
struct rpc_task *rpc_call_null_helper(struct rpc_clnt *clnt,
struct rpc_xprt *xprt, struct rpc_cred *cred, int flags,
const struct rpc_call_ops *ops, void *data)
@@ -2542,9 +2541,17 @@ struct rpc_task *rpc_call_null_helper(struct rpc_clnt *clnt,
.callback_data = data,
.flags = flags,
};
+ struct rpc_task *task;
- return rpc_run_task(&task_setup_data);
+ if (!cred)
+ msg.rpc_cred = authnull_ops.lookup_cred(NULL, NULL, 0);
+ task = rpc_run_task(&task_setup_data);
+ if (!cred)
+ put_rpccred(msg.rpc_cred);
+
+ return task;
}
+EXPORT_SYMBOL_GPL(rpc_call_null_helper);
struct rpc_task *rpc_call_null(struct rpc_clnt *clnt, struct rpc_cred *cred, int flags)
{
@@ -2591,7 +2598,6 @@ int rpc_clnt_test_and_add_xprt(struct rpc_clnt *clnt,
void *dummy)
{
struct rpc_cb_add_xprt_calldata *data;
- struct rpc_cred *cred;
struct rpc_task *task;
data = kmalloc(sizeof(*data), GFP_NOFS);
@@ -2600,11 +2606,9 @@ int rpc_clnt_test_and_add_xprt(struct rpc_clnt *clnt,
data->xps = xprt_switch_get(xps);
data->xprt = xprt_get(xprt);
- cred = authnull_ops.lookup_cred(NULL, NULL, 0);
- task = rpc_call_null_helper(clnt, xprt, cred,
- RPC_TASK_SOFT|RPC_TASK_SOFTCONN|RPC_TASK_ASYNC,
- &rpc_cb_add_xprt_call_ops, data);
- put_rpccred(cred);
+ task = rpc_call_null_helper(clnt, xprt, NULL,
+ RPC_TASK_SOFTPING | RPC_TASK_ASYNC,
+ &rpc_cb_add_xprt_call_ops, data);
if (IS_ERR(task))
return PTR_ERR(task);
rpc_put_task(task);
@@ -2635,7 +2639,6 @@ int rpc_clnt_setup_test_and_add_xprt(struct rpc_clnt *clnt,
struct rpc_xprt *xprt,
void *data)
{
- struct rpc_cred *cred;
struct rpc_task *task;
struct rpc_add_xprt_test *xtest = (struct rpc_add_xprt_test *)data;
int status = -EADDRINUSE;
@@ -2647,11 +2650,8 @@ int rpc_clnt_setup_test_and_add_xprt(struct rpc_clnt *clnt,
goto out_err;
/* Test the connection */
- cred = authnull_ops.lookup_cred(NULL, NULL, 0);
- task = rpc_call_null_helper(clnt, xprt, cred,
- RPC_TASK_SOFT | RPC_TASK_SOFTCONN,
+ task = rpc_call_null_helper(clnt, xprt, NULL, RPC_TASK_SOFTPING,
NULL, NULL);
- put_rpccred(cred);
if (IS_ERR(task)) {
status = PTR_ERR(task);
goto out_err;
Pad optimization is changed by echoing into
/proc/sys/sunrpc/rdma_pad_optimize. This is a global setting,
affecting all RPC-over-RDMA connections to all servers.
The marshaling code picks up that value and uses it for decisions
about how to construct each RPC-over-RDMA frame. Having it change
suddenly in mid-operation can result in unexpected failures. And
some servers a client mounts might need chunk round-up, while
others don't.
So instead, copy the pad_optimize setting into each connection's
rpcrdma_ia when the transport is created, and use the copy, which
can't change during the life of the connection, instead.
This also removes a hack: rpcrdma_convert_iovs was using
the remote-invalidation-expected flag to predict when it could leave
out Write chunk padding. This is because the Linux server handles
implicit XDR padding on Write chunks correctly, and only Linux
servers can set the connection's remote-invalidation-expected flag.
It's more sensible to use the pad optimization setting instead.
Fixes: 677eb17e94ed ("xprtrdma: Fix XDR tail buffer marshalling")
Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/rpc_rdma.c | 28 ++++++++++++++--------------
net/sunrpc/xprtrdma/verbs.c | 1 +
net/sunrpc/xprtrdma/xprt_rdma.h | 1 +
3 files changed, 16 insertions(+), 14 deletions(-)
diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index a524d3c..c634f0f 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -186,9 +186,9 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
*/
static int
-rpcrdma_convert_iovs(struct xdr_buf *xdrbuf, unsigned int pos,
- enum rpcrdma_chunktype type, struct rpcrdma_mr_seg *seg,
- bool reminv_expected)
+rpcrdma_convert_iovs(struct rpcrdma_xprt *r_xprt, struct xdr_buf *xdrbuf,
+ unsigned int pos, enum rpcrdma_chunktype type,
+ struct rpcrdma_mr_seg *seg)
{
int len, n, p, page_base;
struct page **ppages;
@@ -229,14 +229,15 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
/* When encoding a Read chunk, the tail iovec contains an
* XDR pad and may be omitted.
*/
- if (type == rpcrdma_readch && xprt_rdma_pad_optimize)
+ if (type == rpcrdma_readch && r_xprt->rx_ia.ri_implicit_roundup)
return n;
- /* When encoding the Write list, some servers need to see an extra
- * segment for odd-length Write chunks. The upper layer provides
- * space in the tail iovec for this purpose.
+ /* When encoding a Write chunk, some servers need to see an
+ * extra segment for non-XDR-aligned Write chunks. The upper
+ * layer provides space in the tail iovec that may be used
+ * for this purpose.
*/
- if (type == rpcrdma_writech && reminv_expected)
+ if (type == rpcrdma_writech && r_xprt->rx_ia.ri_implicit_roundup)
return n;
if (xdrbuf->tail[0].iov_len) {
@@ -291,7 +292,8 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
if (rtype == rpcrdma_areadch)
pos = 0;
seg = req->rl_segments;
- nsegs = rpcrdma_convert_iovs(&rqst->rq_snd_buf, pos, rtype, seg, false);
+ nsegs = rpcrdma_convert_iovs(r_xprt, &rqst->rq_snd_buf, pos,
+ rtype, seg);
if (nsegs < 0)
return ERR_PTR(nsegs);
@@ -353,10 +355,9 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
}
seg = req->rl_segments;
- nsegs = rpcrdma_convert_iovs(&rqst->rq_rcv_buf,
+ nsegs = rpcrdma_convert_iovs(r_xprt, &rqst->rq_rcv_buf,
rqst->rq_rcv_buf.head[0].iov_len,
- wtype, seg,
- r_xprt->rx_ia.ri_reminv_expected);
+ wtype, seg);
if (nsegs < 0)
return ERR_PTR(nsegs);
@@ -421,8 +422,7 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
}
seg = req->rl_segments;
- nsegs = rpcrdma_convert_iovs(&rqst->rq_rcv_buf, 0, wtype, seg,
- r_xprt->rx_ia.ri_reminv_expected);
+ nsegs = rpcrdma_convert_iovs(r_xprt, &rqst->rq_rcv_buf, 0, wtype, seg);
if (nsegs < 0)
return ERR_PTR(nsegs);
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 11d0774..2a6a367 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -208,6 +208,7 @@
/* Default settings for RPC-over-RDMA Version One */
r_xprt->rx_ia.ri_reminv_expected = false;
+ r_xprt->rx_ia.ri_implicit_roundup = xprt_rdma_pad_optimize;
rsize = RPCRDMA_V1_DEF_INLINE_SIZE;
wsize = RPCRDMA_V1_DEF_INLINE_SIZE;
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index e35efd4..c137154 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -75,6 +75,7 @@ struct rpcrdma_ia {
unsigned int ri_max_inline_write;
unsigned int ri_max_inline_read;
bool ri_reminv_expected;
+ bool ri_implicit_roundup;
enum ib_mr_type ri_mrtype;
struct ib_qp_attr ri_qp_attr;
struct ib_qp_init_attr ri_qp_init_attr;
Commit d5440e27d3e5 ("xprtrdma: Enable pad optimization") made the
Linux client omit XDR round-up padding in normal Read and Write
chunks so that the client doesn't have to register and invalidate
3-byte memory regions that contain no real data.
Unfortunately, my cheery 2014 assessment that this optimization "is
supported now by both Linux and Solaris servers" was premature.
We've found bugs in Solaris in this area since commit d5440e27d3e5
("xprtrdma: Enable pad optimization") was merged (SYMLINK is the
main offender).
So for maximum interoperability, I'm disabling this optimization
again. If a CM private message is exchanged when connecting, the
client recognizes that the server is Linux, and enables the
optimization for that connection.
Until now the Solaris server bugs did not impact common operations,
and were thus largely benign. Soon, less capable devices on Linux
NFS/RDMA clients will make use of Read chunks more often, and these
Solaris bugs will prevent interoperation in more cases.
Fixes: 677eb17e94ed ("xprtrdma: Fix XDR tail buffer marshalling")
Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/transport.c | 2 +-
net/sunrpc/xprtrdma/verbs.c | 1 +
2 files changed, 2 insertions(+), 1 deletion(-)
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 534c178..6990581 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -67,7 +67,7 @@
static unsigned int xprt_rdma_max_inline_write = RPCRDMA_DEF_INLINE;
static unsigned int xprt_rdma_inline_write_padding;
static unsigned int xprt_rdma_memreg_strategy = RPCRDMA_FRMR;
- int xprt_rdma_pad_optimize = 1;
+ int xprt_rdma_pad_optimize = 0;
#if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 2a6a367..23f4da4 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -216,6 +216,7 @@
pmsg->cp_magic == rpcrdma_cmp_magic &&
pmsg->cp_version == RPCRDMA_CMP_VERSION) {
r_xprt->rx_ia.ri_reminv_expected = true;
+ r_xprt->rx_ia.ri_implicit_roundup = true;
rsize = rpcrdma_decode_buffer_size(pmsg->cp_send_size);
wsize = rpcrdma_decode_buffer_size(pmsg->cp_recv_size);
}
The MAX_SEND_SGES check introduced in commit 655fec6987be
("xprtrdma: Use gathered Send for large inline messages") fails
for devices that have a small max_sge.
Instead of checking for a large fixed maximum number of SGEs,
check for a minimum small number. RPC-over-RDMA will switch to
using a Read chunk if an xdr_buf has more pages than can fit in
the device's max_sge limit. This is considerably better than
failing all together to mount the server.
This fix supports devices that have as few as three send SGEs
available.
Reported-by: Selvin Xavier <[email protected]>
Reported-by: Devesh Sharma <[email protected]>
Reported-by: Honggang Li <[email protected]>
Reported-by: Ram Amrani <[email protected]>
Fixes: 655fec6987be ("xprtrdma: Use gathered Send for large ...")
Tested-by: Honggang Li <[email protected]>
Tested-by: Ram Amrani <[email protected]>
Tested-by: Steve Wise <[email protected]>
Reviewed-by: Parav Pandit <[email protected]>
Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/rpc_rdma.c | 26 +++++++++++++++++++++++---
net/sunrpc/xprtrdma/verbs.c | 13 +++++++------
net/sunrpc/xprtrdma/xprt_rdma.h | 2 ++
3 files changed, 32 insertions(+), 9 deletions(-)
diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index c634f0f..d889883 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -125,14 +125,34 @@ void rpcrdma_set_max_header_sizes(struct rpcrdma_xprt *r_xprt)
/* The client can send a request inline as long as the RPCRDMA header
* plus the RPC call fit under the transport's inline limit. If the
* combined call message size exceeds that limit, the client must use
- * the read chunk list for this operation.
+ * a Read chunk for this operation.
+ *
+ * A Read chunk is also required if sending the RPC call inline would
+ * exceed this device's max_sge limit.
*/
static bool rpcrdma_args_inline(struct rpcrdma_xprt *r_xprt,
struct rpc_rqst *rqst)
{
- struct rpcrdma_ia *ia = &r_xprt->rx_ia;
+ struct xdr_buf *xdr = &rqst->rq_snd_buf;
+ unsigned int count, remaining, offset;
+
+ if (xdr->len > r_xprt->rx_ia.ri_max_inline_write)
+ return false;
- return rqst->rq_snd_buf.len <= ia->ri_max_inline_write;
+ if (xdr->page_len) {
+ remaining = xdr->page_len;
+ offset = xdr->page_base & ~PAGE_MASK;
+ count = 0;
+ while (remaining) {
+ remaining -= min_t(unsigned int,
+ PAGE_SIZE - offset, remaining);
+ offset = 0;
+ if (++count > r_xprt->rx_ia.ri_max_send_sges)
+ return false;
+ }
+ }
+
+ return true;
}
/* The client can't know how large the actual reply will be. Thus it
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 23f4da4..61d16c3 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -488,18 +488,19 @@ static void rpcrdma_destroy_id(struct rdma_cm_id *id)
*/
int
rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia,
- struct rpcrdma_create_data_internal *cdata)
+ struct rpcrdma_create_data_internal *cdata)
{
struct rpcrdma_connect_private *pmsg = &ep->rep_cm_private;
+ unsigned int max_qp_wr, max_sge;
struct ib_cq *sendcq, *recvcq;
- unsigned int max_qp_wr;
int rc;
- if (ia->ri_device->attrs.max_sge < RPCRDMA_MAX_SEND_SGES) {
- dprintk("RPC: %s: insufficient sge's available\n",
- __func__);
+ max_sge = min(ia->ri_device->attrs.max_sge, RPCRDMA_MAX_SEND_SGES);
+ if (max_sge < RPCRDMA_MIN_SEND_SGES) {
+ pr_warn("rpcrdma: HCA provides only %d send SGEs\n", max_sge);
return -ENOMEM;
}
+ ia->ri_max_send_sges = max_sge - RPCRDMA_MIN_SEND_SGES;
if (ia->ri_device->attrs.max_qp_wr <= RPCRDMA_BACKWARD_WRS) {
dprintk("RPC: %s: insufficient wqe's available\n",
@@ -524,7 +525,7 @@ static void rpcrdma_destroy_id(struct rdma_cm_id *id)
ep->rep_attr.cap.max_recv_wr = cdata->max_requests;
ep->rep_attr.cap.max_recv_wr += RPCRDMA_BACKWARD_WRS;
ep->rep_attr.cap.max_recv_wr += 1; /* drain cqe */
- ep->rep_attr.cap.max_send_sge = RPCRDMA_MAX_SEND_SGES;
+ ep->rep_attr.cap.max_send_sge = max_sge;
ep->rep_attr.cap.max_recv_sge = 1;
ep->rep_attr.cap.max_inline_data = 0;
ep->rep_attr.sq_sig_type = IB_SIGNAL_REQ_WR;
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index c137154..3d7e9c9 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -74,6 +74,7 @@ struct rpcrdma_ia {
unsigned int ri_max_frmr_depth;
unsigned int ri_max_inline_write;
unsigned int ri_max_inline_read;
+ unsigned int ri_max_send_sges;
bool ri_reminv_expected;
bool ri_implicit_roundup;
enum ib_mr_type ri_mrtype;
@@ -311,6 +312,7 @@ struct rpcrdma_mr_seg { /* chunk descriptors */
* - xdr_buf tail iovec
*/
enum {
+ RPCRDMA_MIN_SEND_SGES = 3,
RPCRDMA_MAX_SEND_PAGES = PAGE_SIZE + RPCRDMA_MAX_INLINE - 1,
RPCRDMA_MAX_PAGE_SGES = (RPCRDMA_MAX_SEND_PAGES >> PAGE_SHIFT) + 1,
RPCRDMA_MAX_SEND_SGES = 1 + 1 + RPCRDMA_MAX_PAGE_SGES + 1,
Allow RPC-over-RDMA to send NULL pings even when the transport has
hit its credit limit. One RPC-over-RDMA credit is reserved for
operations like keepalive.
For transports that convey NFSv4, it seems like lease renewal would
also be a candidate for using a priority transport slot. I'd like to
see a mechanism better than RPCRDMA_PRIORITY that can ensure only
one priority operation is in use at a time.
Signed-off-by: Chuck Lever <[email protected]>
---
include/linux/sunrpc/sched.h | 2 ++
net/sunrpc/xprt.c | 4 ++++
net/sunrpc/xprtrdma/transport.c | 3 ++-
net/sunrpc/xprtrdma/verbs.c | 13 ++++++++-----
4 files changed, 16 insertions(+), 6 deletions(-)
diff --git a/include/linux/sunrpc/sched.h b/include/linux/sunrpc/sched.h
index 13822e6..fcea158 100644
--- a/include/linux/sunrpc/sched.h
+++ b/include/linux/sunrpc/sched.h
@@ -127,6 +127,7 @@ struct rpc_task_setup {
#define RPC_TASK_TIMEOUT 0x1000 /* fail with ETIMEDOUT on timeout */
#define RPC_TASK_NOCONNECT 0x2000 /* return ENOTCONN if not connected */
#define RPC_TASK_NO_RETRANS_TIMEOUT 0x4000 /* wait forever for a reply */
+#define RPC_TASK_NO_CONG 0x8000 /* skip congestion control */
#define RPC_TASK_SOFTPING (RPC_TASK_SOFT | RPC_TASK_SOFTCONN)
@@ -137,6 +138,7 @@ struct rpc_task_setup {
#define RPC_IS_SOFT(t) ((t)->tk_flags & (RPC_TASK_SOFT|RPC_TASK_TIMEOUT))
#define RPC_IS_SOFTCONN(t) ((t)->tk_flags & RPC_TASK_SOFTCONN)
#define RPC_WAS_SENT(t) ((t)->tk_flags & RPC_TASK_SENT)
+#define RPC_SKIP_CONG(t) ((t)->tk_flags & RPC_TASK_NO_CONG)
#define RPC_TASK_RUNNING 0
#define RPC_TASK_QUEUED 1
diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index b530a28..a477ee6 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -392,6 +392,10 @@ static inline void xprt_release_write(struct rpc_xprt *xprt, struct rpc_task *ta
{
struct rpc_rqst *req = task->tk_rqstp;
+ if (RPC_SKIP_CONG(task)) {
+ req->rq_cong = 0;
+ return 1;
+ }
if (req->rq_cong)
return 1;
dprintk("RPC: %5u xprt_cwnd_limited cong = %lu cwnd = %lu\n",
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 3a5a805..073fecd 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -546,7 +546,8 @@ static void rpcrdma_keepalive_release(void *calldata)
data = xprt_get(xprt);
null_task = rpc_call_null_helper(task->tk_client, xprt, NULL,
- RPC_TASK_SOFTPING | RPC_TASK_ASYNC,
+ RPC_TASK_SOFTPING | RPC_TASK_ASYNC |
+ RPC_TASK_NO_CONG,
&rpcrdma_keepalive_call_ops, data);
if (!IS_ERR(null_task))
rpc_put_task(null_task);
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 81cd31a..d9b5fa7 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -136,19 +136,20 @@
static void
rpcrdma_update_granted_credits(struct rpcrdma_rep *rep)
{
- struct rpcrdma_msg *rmsgp = rdmab_to_msg(rep->rr_rdmabuf);
struct rpcrdma_buffer *buffer = &rep->rr_rxprt->rx_buf;
+ __be32 *p = rep->rr_rdmabuf->rg_base;
u32 credits;
if (rep->rr_len < RPCRDMA_HDRLEN_ERR)
return;
- credits = be32_to_cpu(rmsgp->rm_credit);
+ credits = be32_to_cpup(p + 2);
+ if (credits > buffer->rb_max_requests)
+ credits = buffer->rb_max_requests;
+ /* Reserve one credit for keepalive ping */
+ credits--;
if (credits == 0)
credits = 1; /* don't deadlock */
- else if (credits > buffer->rb_max_requests)
- credits = buffer->rb_max_requests;
-
atomic_set(&buffer->rb_credits, credits);
}
@@ -915,6 +916,8 @@ struct rpcrdma_rep *
struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
int i, rc;
+ if (r_xprt->rx_data.max_requests < 2)
+ return -EINVAL;
buf->rb_max_requests = r_xprt->rx_data.max_requests;
buf->rb_bc_srv_max_requests = 0;
atomic_set(&buf->rb_credits, 1);
T24gV2VkLCAyMDE3LTAyLTA4IGF0IDE3OjAxIC0wNTAwLCBDaHVjayBMZXZlciB3cm90ZToNCj4g
QWxsb3cgUlBDLW92ZXItUkRNQSB0byBzZW5kIE5VTEwgcGluZ3MgZXZlbiB3aGVuIHRoZSB0cmFu
c3BvcnQgaGFzDQo+IGhpdCBpdHMgY3JlZGl0IGxpbWl0LiBPbmUgUlBDLW92ZXItUkRNQSBjcmVk
aXQgaXMgcmVzZXJ2ZWQgZm9yDQo+IG9wZXJhdGlvbnMgbGlrZSBrZWVwYWxpdmUuDQo+IA0KPiBG
b3IgdHJhbnNwb3J0cyB0aGF0IGNvbnZleSBORlN2NCwgaXQgc2VlbXMgbGlrZSBsZWFzZSByZW5l
d2FsIHdvdWxkDQo+IGFsc28gYmUgYSBjYW5kaWRhdGUgZm9yIHVzaW5nIGEgcHJpb3JpdHkgdHJh
bnNwb3J0IHNsb3QuIEknZCBsaWtlIHRvDQo+IHNlZSBhIG1lY2hhbmlzbSBiZXR0ZXIgdGhhbiBS
UENSRE1BX1BSSU9SSVRZIHRoYXQgY2FuIGVuc3VyZSBvbmx5DQo+IG9uZSBwcmlvcml0eSBvcGVy
YXRpb24gaXMgaW4gdXNlIGF0IGEgdGltZS4NCj4gDQo+IFNpZ25lZC1vZmYtYnk6IENodWNrIExl
dmVyIDxjaHVjay5sZXZlckBvcmFjbGUuY29tPg0KPiAtLS0NCj4gwqBpbmNsdWRlL2xpbnV4L3N1
bnJwYy9zY2hlZC5owqDCoMKgwqB8wqDCoMKgwqAyICsrDQo+IMKgbmV0L3N1bnJwYy94cHJ0LmPC
oMKgwqDCoMKgwqDCoMKgwqDCoMKgwqDCoMKgwqB8wqDCoMKgwqA0ICsrKysNCj4gwqBuZXQvc3Vu
cnBjL3hwcnRyZG1hL3RyYW5zcG9ydC5jIHzCoMKgwqDCoDMgKystDQo+IMKgbmV0L3N1bnJwYy94
cHJ0cmRtYS92ZXJicy5jwqDCoMKgwqDCoHzCoMKgwqAxMyArKysrKysrKy0tLS0tDQo+IMKgNCBm
aWxlcyBjaGFuZ2VkLCAxNiBpbnNlcnRpb25zKCspLCA2IGRlbGV0aW9ucygtKQ0KPiANCj4gZGlm
ZiAtLWdpdCBhL2luY2x1ZGUvbGludXgvc3VucnBjL3NjaGVkLmgNCj4gYi9pbmNsdWRlL2xpbnV4
L3N1bnJwYy9zY2hlZC5oDQo+IGluZGV4IDEzODIyZTYuLmZjZWExNTggMTAwNjQ0DQo+IC0tLSBh
L2luY2x1ZGUvbGludXgvc3VucnBjL3NjaGVkLmgNCj4gKysrIGIvaW5jbHVkZS9saW51eC9zdW5y
cGMvc2NoZWQuaA0KPiBAQCAtMTI3LDYgKzEyNyw3IEBAIHN0cnVjdCBycGNfdGFza19zZXR1cCB7
DQo+IMKgI2RlZmluZSBSUENfVEFTS19USU1FT1VUCTB4MTAwMAkJLyogZmFpbCB3aXRoDQo+IEVU
SU1FRE9VVCBvbiB0aW1lb3V0ICovDQo+IMKgI2RlZmluZSBSUENfVEFTS19OT0NPTk5FQ1QJMHgy
MDAwCQkvKiByZXR1cm4NCj4gRU5PVENPTk4gaWYgbm90IGNvbm5lY3RlZCAqLw0KPiDCoCNkZWZp
bmUgUlBDX1RBU0tfTk9fUkVUUkFOU19USU1FT1VUCTB4NDAwMAkJLyoNCj4gd2FpdCBmb3JldmVy
IGZvciBhIHJlcGx5ICovDQo+ICsjZGVmaW5lIFJQQ19UQVNLX05PX0NPTkcJMHg4MDAwCQkvKiBz
a2lwDQo+IGNvbmdlc3Rpb24gY29udHJvbCAqLw0KPiDCoA0KPiDCoCNkZWZpbmUgUlBDX1RBU0tf
U09GVFBJTkcJKFJQQ19UQVNLX1NPRlQgfCBSUENfVEFTS19TT0ZUQ09OTikNCj4gwqANCj4gQEAg
LTEzNyw2ICsxMzgsNyBAQCBzdHJ1Y3QgcnBjX3Rhc2tfc2V0dXAgew0KPiDCoCNkZWZpbmUgUlBD
X0lTX1NPRlQodCkJCSgodCktPnRrX2ZsYWdzICYNCj4gKFJQQ19UQVNLX1NPRlR8UlBDX1RBU0tf
VElNRU9VVCkpDQo+IMKgI2RlZmluZSBSUENfSVNfU09GVENPTk4odCkJKCh0KS0+dGtfZmxhZ3Mg
Jg0KPiBSUENfVEFTS19TT0ZUQ09OTikNCj4gwqAjZGVmaW5lIFJQQ19XQVNfU0VOVCh0KQkJKCh0
KS0+dGtfZmxhZ3MgJg0KPiBSUENfVEFTS19TRU5UKQ0KPiArI2RlZmluZSBSUENfU0tJUF9DT05H
KHQpCSgodCktPnRrX2ZsYWdzICYgUlBDX1RBU0tfTk9fQ09ORykNCj4gwqANCj4gwqAjZGVmaW5l
IFJQQ19UQVNLX1JVTk5JTkcJMA0KPiDCoCNkZWZpbmUgUlBDX1RBU0tfUVVFVUVECQkxDQo+IGRp
ZmYgLS1naXQgYS9uZXQvc3VucnBjL3hwcnQuYyBiL25ldC9zdW5ycGMveHBydC5jDQo+IGluZGV4
IGI1MzBhMjguLmE0NzdlZTYgMTAwNjQ0DQo+IC0tLSBhL25ldC9zdW5ycGMveHBydC5jDQo+ICsr
KyBiL25ldC9zdW5ycGMveHBydC5jDQo+IEBAIC0zOTIsNiArMzkyLDEwIEBAIHN0YXRpYyBpbmxp
bmUgdm9pZCB4cHJ0X3JlbGVhc2Vfd3JpdGUoc3RydWN0DQo+IHJwY194cHJ0ICp4cHJ0LCBzdHJ1
Y3QgcnBjX3Rhc2sgKnRhDQo+IMKgew0KPiDCoAlzdHJ1Y3QgcnBjX3Jxc3QgKnJlcSA9IHRhc2st
PnRrX3Jxc3RwOw0KPiDCoA0KPiArCWlmIChSUENfU0tJUF9DT05HKHRhc2spKSB7DQo+ICsJCXJl
cS0+cnFfY29uZyA9IDA7DQo+ICsJCXJldHVybiAxOw0KPiArCX0NCg0KV2h5IG5vdCBqdXN0IGhh
dmUgdGhlIFJETUEgbGF5ZXIgY2FsbCB4cHJ0X3Jlc2VydmVfeHBydCgpIChhbmQNCnhwcnRfcmVs
ZWFzZV94cHJ0KCkpIGlmIHRoaXMgZmxhZyBpcyBzZXQ/IEl0IHNlZW1zIHRvIG1lIHRoYXQgeW91
IHdpbGwNCm5lZWQgc29tZSBraW5kIG9mIGV4dHJhIGNvbmdlc3Rpb24gY29udHJvbCBpbiB0aGUg
UkRNQSBsYXllciBhbnl3YXkNCnNpbmNlIHlvdSBvbmx5IGhhdmUgb25lIHJlc2VydmVkIGNyZWRp
dCBmb3IgdGhlc2UgcHJpdmlsZWdlZCB0YXNrcyAob3INCmRpZCBJIG1pc3Mgd2hlcmUgdGhhdCBp
cyBiZWluZyBnYXRlZD8pLg0KDQo+IMKgCWlmIChyZXEtPnJxX2NvbmcpDQo+IMKgCQlyZXR1cm4g
MTsNCj4gwqAJZHByaW50aygiUlBDOiAlNXUgeHBydF9jd25kX2xpbWl0ZWQgY29uZyA9ICVsdSBj
d25kID0NCj4gJWx1XG4iLA0KPiBkaWZmIC0tZ2l0IGEvbmV0L3N1bnJwYy94cHJ0cmRtYS90cmFu
c3BvcnQuYw0KPiBiL25ldC9zdW5ycGMveHBydHJkbWEvdHJhbnNwb3J0LmMNCj4gaW5kZXggM2E1
YTgwNS4uMDczZmVjZCAxMDA2NDQNCj4gLS0tIGEvbmV0L3N1bnJwYy94cHJ0cmRtYS90cmFuc3Bv
cnQuYw0KPiArKysgYi9uZXQvc3VucnBjL3hwcnRyZG1hL3RyYW5zcG9ydC5jDQo+IEBAIC01NDYs
NyArNTQ2LDggQEAgc3RhdGljIHZvaWQgcnBjcmRtYV9rZWVwYWxpdmVfcmVsZWFzZSh2b2lkDQo+
ICpjYWxsZGF0YSkNCj4gwqANCj4gwqAJZGF0YSA9IHhwcnRfZ2V0KHhwcnQpOw0KPiDCoAludWxs
X3Rhc2sgPSBycGNfY2FsbF9udWxsX2hlbHBlcih0YXNrLT50a19jbGllbnQsIHhwcnQsDQo+IE5V
TEwsDQo+IC0JCQkJCcKgUlBDX1RBU0tfU09GVFBJTkcgfA0KPiBSUENfVEFTS19BU1lOQywNCj4g
KwkJCQkJwqBSUENfVEFTS19TT0ZUUElORyB8DQo+IFJQQ19UQVNLX0FTWU5DIHwNCj4gKwkJCQkJ
wqBSUENfVEFTS19OT19DT05HLA0KPiDCoAkJCQkJwqAmcnBjcmRtYV9rZWVwYWxpdmVfY2FsbF9v
cHMNCj4gLCBkYXRhKTsNCj4gwqAJaWYgKCFJU19FUlIobnVsbF90YXNrKSkNCj4gwqAJCXJwY19w
dXRfdGFzayhudWxsX3Rhc2spOw0KPiBkaWZmIC0tZ2l0IGEvbmV0L3N1bnJwYy94cHJ0cmRtYS92
ZXJicy5jDQo+IGIvbmV0L3N1bnJwYy94cHJ0cmRtYS92ZXJicy5jDQo+IGluZGV4IDgxY2QzMWEu
LmQ5YjVmYTcgMTAwNjQ0DQo+IC0tLSBhL25ldC9zdW5ycGMveHBydHJkbWEvdmVyYnMuYw0KPiAr
KysgYi9uZXQvc3VucnBjL3hwcnRyZG1hL3ZlcmJzLmMNCj4gQEAgLTEzNiwxOSArMTM2LDIwIEBA
DQo+IMKgc3RhdGljIHZvaWQNCj4gwqBycGNyZG1hX3VwZGF0ZV9ncmFudGVkX2NyZWRpdHMoc3Ry
dWN0IHJwY3JkbWFfcmVwICpyZXApDQo+IMKgew0KPiAtCXN0cnVjdCBycGNyZG1hX21zZyAqcm1z
Z3AgPSByZG1hYl90b19tc2cocmVwLT5ycl9yZG1hYnVmKTsNCj4gwqAJc3RydWN0IHJwY3JkbWFf
YnVmZmVyICpidWZmZXIgPSAmcmVwLT5ycl9yeHBydC0+cnhfYnVmOw0KPiArCV9fYmUzMiAqcCA9
IHJlcC0+cnJfcmRtYWJ1Zi0+cmdfYmFzZTsNCj4gwqAJdTMyIGNyZWRpdHM7DQo+IMKgDQo+IMKg
CWlmIChyZXAtPnJyX2xlbiA8IFJQQ1JETUFfSERSTEVOX0VSUikNCj4gwqAJCXJldHVybjsNCj4g
wqANCj4gLQljcmVkaXRzID0gYmUzMl90b19jcHUocm1zZ3AtPnJtX2NyZWRpdCk7DQo+ICsJY3Jl
ZGl0cyA9IGJlMzJfdG9fY3B1cChwICsgMik7DQo+ICsJaWYgKGNyZWRpdHMgPiBidWZmZXItPnJi
X21heF9yZXF1ZXN0cykNCj4gKwkJY3JlZGl0cyA9IGJ1ZmZlci0+cmJfbWF4X3JlcXVlc3RzOw0K
PiArCS8qIFJlc2VydmUgb25lIGNyZWRpdCBmb3Iga2VlcGFsaXZlIHBpbmcgKi8NCj4gKwljcmVk
aXRzLS07DQo+IMKgCWlmIChjcmVkaXRzID09IDApDQo+IMKgCQljcmVkaXRzID0gMTsJLyogZG9u
J3QgZGVhZGxvY2sgKi8NCj4gLQllbHNlIGlmIChjcmVkaXRzID4gYnVmZmVyLT5yYl9tYXhfcmVx
dWVzdHMpDQo+IC0JCWNyZWRpdHMgPSBidWZmZXItPnJiX21heF9yZXF1ZXN0czsNCj4gLQ0KPiDC
oAlhdG9taWNfc2V0KCZidWZmZXItPnJiX2NyZWRpdHMsIGNyZWRpdHMpOw0KPiDCoH0NCj4gwqAN
Cj4gQEAgLTkxNSw2ICs5MTYsOCBAQCBzdHJ1Y3QgcnBjcmRtYV9yZXAgKg0KPiDCoAlzdHJ1Y3Qg
cnBjcmRtYV9idWZmZXIgKmJ1ZiA9ICZyX3hwcnQtPnJ4X2J1ZjsNCj4gwqAJaW50IGksIHJjOw0K
PiDCoA0KPiArCWlmIChyX3hwcnQtPnJ4X2RhdGEubWF4X3JlcXVlc3RzIDwgMikNCj4gKwkJcmV0
dXJuIC1FSU5WQUw7DQo+IMKgCWJ1Zi0+cmJfbWF4X3JlcXVlc3RzID0gcl94cHJ0LT5yeF9kYXRh
Lm1heF9yZXF1ZXN0czsNCj4gwqAJYnVmLT5yYl9iY19zcnZfbWF4X3JlcXVlc3RzID0gMDsNCj4g
wqAJYXRvbWljX3NldCgmYnVmLT5yYl9jcmVkaXRzLCAxKTsNCj4gDQo+IC0tDQo+IFRvIHVuc3Vi
c2NyaWJlIGZyb20gdGhpcyBsaXN0OiBzZW5kIHRoZSBsaW5lICJ1bnN1YnNjcmliZSBsaW51eC1u
ZnMiDQo+IGluDQo+IHRoZSBib2R5IG9mIGEgbWVzc2FnZSB0byBtYWpvcmRvbW9Admdlci5rZXJu
ZWwub3JnDQo+IE1vcmUgbWFqb3Jkb21vIGluZm8gYXTCoMKgaHR0cDovL3ZnZXIua2VybmVsLm9y
Zy9tYWpvcmRvbW8taW5mby5odG1sDQo+IA0KLS0gDQoNCg0KCQ0KCQ0KDQoNClRyb25kIE15a2xl
YnVzdA0KUHJpbmNpcGFsIFN5c3RlbSBBcmNoaXRlY3QNCjQzMDAgRWwgQ2FtaW5vIFJlYWwgfCBT
dWl0ZSAxMDANCkxvcyBBbHRvcywgQ0HCoMKgOTQwMjINClc6IDY1MC00MjItMzgwMA0KQzogODAx
LTkyMS00NTgzwqANCnd3dy5wcmltYXJ5ZGF0YS5jb20NCg0KDQoNCg==
T24gV2VkLCAyMDE3LTAyLTA4IGF0IDE3OjAwIC0wNTAwLCBDaHVjayBMZXZlciB3cm90ZToNCj4g
VGhlIHRyYW5zcG9ydCBsb2NrIGlzIG5lZWRlZCB0byBwcm90ZWN0IHRoZSB4cHJ0X2FkanVzdF9j
d25kKCkgY2FsbA0KPiBpbiB4c191ZHBfdGltZXIsIGJ1dCBpdCBpcyBub3QgbmVjZXNzYXJ5IGZv
ciBhY2Nlc3NpbmcgdGhlDQo+IHJxX3JlcGx5X2J5dGVzX3JlY3ZkIG9yIHRrX3N0YXR1cyBmaWVs
ZHMuIEl0IGlzIGNvcnJlY3QgdG8gc3VibGltYXRlDQo+IHRoZSBsb2NrIGludG8gVURQJ3MgeHNf
dWRwX3RpbWVyIG1ldGhvZCwgd2hlcmUgaXQgaXMgcmVxdWlyZWQuDQo+IA0KPiBUaGUgLT50aW1l
ciBtZXRob2QgaGFzIHRvIHRha2UgdGhlIHRyYW5zcG9ydCBsb2NrIGlmIG5lZWRlZCwgYnV0IGl0
DQo+IGNhbiBub3cgc2xlZXAgc2FmZWx5LCBvciBldmVuIGNhbGwgYmFjayBpbnRvIHRoZSBSUEMg
c2NoZWR1bGVyLg0KPiANCj4gVGhpcyBpcyBtb3JlIGEgY2xlYW4tdXAgdGhhbiBhIGZpeCwgYnV0
IHRoZSAiaXNzdWUiIHdhcyBpbnRyb2R1Y2VkDQo+IGJ5IG15IHRyYW5zcG9ydCBzd2l0Y2ggcGF0
Y2hlcyBiYWNrIGluIDIwMDUuDQo+IA0KPiBGaXhlczogNDZjMGVlOGJjNGFkICgiUlBDOiBzZXBh
cmF0ZSB4cHJ0X3RpbWVyIGltcGxlbWVudGF0aW9ucyIpDQo+IFNpZ25lZC1vZmYtYnk6IENodWNr
IExldmVyIDxjaHVjay5sZXZlckBvcmFjbGUuY29tPg0KPiAtLS0NCj4gwqBuZXQvc3VucnBjL3hw
cnQuY8KgwqDCoMKgwqB8wqDCoMKgwqAyIC0tDQo+IMKgbmV0L3N1bnJwYy94cHJ0c29jay5jIHzC
oMKgwqDCoDIgKysNCj4gwqAyIGZpbGVzIGNoYW5nZWQsIDIgaW5zZXJ0aW9ucygrKSwgMiBkZWxl
dGlvbnMoLSkNCj4gDQo+IGRpZmYgLS1naXQgYS9uZXQvc3VucnBjL3hwcnQuYyBiL25ldC9zdW5y
cGMveHBydC5jDQo+IGluZGV4IDlhNmJlMDMuLmI1MzBhMjggMTAwNjQ0DQo+IC0tLSBhL25ldC9z
dW5ycGMveHBydC5jDQo+ICsrKyBiL25ldC9zdW5ycGMveHBydC5jDQo+IEBAIC04OTcsMTMgKzg5
NywxMSBAQCBzdGF0aWMgdm9pZCB4cHJ0X3RpbWVyKHN0cnVjdCBycGNfdGFzayAqdGFzaykNCj4g
wqAJCXJldHVybjsNCj4gwqAJZHByaW50aygiUlBDOiAlNXUgeHBydF90aW1lclxuIiwgdGFzay0+
dGtfcGlkKTsNCj4gwqANCj4gLQlzcGluX2xvY2tfYmgoJnhwcnQtPnRyYW5zcG9ydF9sb2NrKTsN
Cj4gwqAJaWYgKCFyZXEtPnJxX3JlcGx5X2J5dGVzX3JlY3ZkKSB7DQo+IMKgCQlpZiAoeHBydC0+
b3BzLT50aW1lcikNCj4gwqAJCQl4cHJ0LT5vcHMtPnRpbWVyKHhwcnQsIHRhc2spOw0KPiDCoAl9
IGVsc2UNCj4gwqAJCXRhc2stPnRrX3N0YXR1cyA9IDA7DQo+IC0Jc3Bpbl91bmxvY2tfYmgoJnhw
cnQtPnRyYW5zcG9ydF9sb2NrKTsNCj4gwqB9DQo+IMKgDQo+IMKgLyoqDQo+IGRpZmYgLS1naXQg
YS9uZXQvc3VucnBjL3hwcnRzb2NrLmMgYi9uZXQvc3VucnBjL3hwcnRzb2NrLmMNCj4gaW5kZXgg
YWYzOTJkOS4uZDliYjY0NCAxMDA2NDQNCj4gLS0tIGEvbmV0L3N1bnJwYy94cHJ0c29jay5jDQo+
ICsrKyBiL25ldC9zdW5ycGMveHBydHNvY2suYw0KPiBAQCAtMTczNCw3ICsxNzM0LDkgQEAgc3Rh
dGljIHZvaWQgeHNfdWRwX3NldF9idWZmZXJfc2l6ZShzdHJ1Y3QNCj4gcnBjX3hwcnQgKnhwcnQs
IHNpemVfdCBzbmRzaXplLCBzaXplX3QNCj4gwqAgKi8NCj4gwqBzdGF0aWMgdm9pZCB4c191ZHBf
dGltZXIoc3RydWN0IHJwY194cHJ0ICp4cHJ0LCBzdHJ1Y3QgcnBjX3Rhc2sNCj4gKnRhc2spDQo+
IMKgew0KPiArCXNwaW5fbG9ja19iaCgmeHBydC0+dHJhbnNwb3J0X2xvY2spOw0KPiDCoAl4cHJ0
X2FkanVzdF9jd25kKHhwcnQsIHRhc2ssIC1FVElNRURPVVQpOw0KPiArCXNwaW5fdW5sb2NrX2Jo
KCZ4cHJ0LT50cmFuc3BvcnRfbG9jayk7DQo+IMKgfQ0KPiDCoA0KPiDCoHN0YXRpYyB1bnNpZ25l
ZCBzaG9ydCB4c19nZXRfcmFuZG9tX3BvcnQodm9pZCkNCj4gDQoNClRoYW5rcyEgR29vZCBjbGVh
bnVwLi4uDQoNClRyb25kDQoNCi0tIA0KDQoNCg0KCQ0KCQ0KDQoNClRyb25kIE15a2xlYnVzdA0K
UHJpbmNpcGFsIFN5c3RlbSBBcmNoaXRlY3QNCjQzMDAgRWwgQ2FtaW5vIFJlYWwgfCBTdWl0ZSAx
MDANCkxvcyBBbHRvcywgQ0HCoMKgOTQwMjINClc6IDY1MC00MjItMzgwMA0KQzogODAxLTkyMS00
NTgzwqANCnd3dy5wcmltYXJ5ZGF0YS5jb20NCg0KDQoNCg0K
> On Feb 8, 2017, at 7:05 PM, Trond Myklebust <[email protected]> wrote:
>
> On Wed, 2017-02-08 at 17:01 -0500, Chuck Lever wrote:
>> Allow RPC-over-RDMA to send NULL pings even when the transport has
>> hit its credit limit. One RPC-over-RDMA credit is reserved for
>> operations like keepalive.
>>
>> For transports that convey NFSv4, it seems like lease renewal would
>> also be a candidate for using a priority transport slot. I'd like to
>> see a mechanism better than RPCRDMA_PRIORITY that can ensure only
>> one priority operation is in use at a time.
>>
>> Signed-off-by: Chuck Lever <[email protected]>
>> ---
>> include/linux/sunrpc/sched.h | 2 ++
>> net/sunrpc/xprt.c | 4 ++++
>> net/sunrpc/xprtrdma/transport.c | 3 ++-
>> net/sunrpc/xprtrdma/verbs.c | 13 ++++++++-----
>> 4 files changed, 16 insertions(+), 6 deletions(-)
>>
>> diff --git a/include/linux/sunrpc/sched.h
>> b/include/linux/sunrpc/sched.h
>> index 13822e6..fcea158 100644
>> --- a/include/linux/sunrpc/sched.h
>> +++ b/include/linux/sunrpc/sched.h
>> @@ -127,6 +127,7 @@ struct rpc_task_setup {
>> #define RPC_TASK_TIMEOUT 0x1000 /* fail with
>> ETIMEDOUT on timeout */
>> #define RPC_TASK_NOCONNECT 0x2000 /* return
>> ENOTCONN if not connected */
>> #define RPC_TASK_NO_RETRANS_TIMEOUT 0x4000 /*
>> wait forever for a reply */
>> +#define RPC_TASK_NO_CONG 0x8000 /* skip
>> congestion control */
>>
>> #define RPC_TASK_SOFTPING (RPC_TASK_SOFT | RPC_TASK_SOFTCONN)
>>
>> @@ -137,6 +138,7 @@ struct rpc_task_setup {
>> #define RPC_IS_SOFT(t) ((t)->tk_flags &
>> (RPC_TASK_SOFT|RPC_TASK_TIMEOUT))
>> #define RPC_IS_SOFTCONN(t) ((t)->tk_flags &
>> RPC_TASK_SOFTCONN)
>> #define RPC_WAS_SENT(t) ((t)->tk_flags &
>> RPC_TASK_SENT)
>> +#define RPC_SKIP_CONG(t) ((t)->tk_flags & RPC_TASK_NO_CONG)
>>
>> #define RPC_TASK_RUNNING 0
>> #define RPC_TASK_QUEUED 1
>> diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
>> index b530a28..a477ee6 100644
>> --- a/net/sunrpc/xprt.c
>> +++ b/net/sunrpc/xprt.c
>> @@ -392,6 +392,10 @@ static inline void xprt_release_write(struct
>> rpc_xprt *xprt, struct rpc_task *ta
>> {
>> struct rpc_rqst *req = task->tk_rqstp;
>>
>> + if (RPC_SKIP_CONG(task)) {
>> + req->rq_cong = 0;
>> + return 1;
>> + }
>
> Why not just have the RDMA layer call xprt_reserve_xprt() (and
> xprt_release_xprt()) if this flag is set? It seems to me that you will
> need some kind of extra congestion control in the RDMA layer anyway
> since you only have one reserved credit for these privileged tasks (or
> did I miss where that is being gated?).
Thanks for the review.
See RPCRDMA_IA_RSVD_CREDIT in 11/12. It's a hack I'm not
terribly happy with.
So, I think you are suggesting replacing xprtrdma's
->reserve_xprt with something like:
int xprt_rdma_reserve_xprt(xprt, task)
{
if (RPC_SKIP_CONG(task))
return xprt_reserve_xprt(xprt, task);
return xprt_reserve_xprt_cong(xprt, task);
}
and likewise for ->release_xprt ?
What I'd really like to do is have the RPC layer
prevent more than one RPC at a time from using the
extra credit, and somehow ensure that those RPCs
are going to be short-lived (SOFT | SOFTCONN,
maybe).
>> if (req->rq_cong)
>> return 1;
>> dprintk("RPC: %5u xprt_cwnd_limited cong = %lu cwnd =
>> %lu\n",
>> diff --git a/net/sunrpc/xprtrdma/transport.c
>> b/net/sunrpc/xprtrdma/transport.c
>> index 3a5a805..073fecd 100644
>> --- a/net/sunrpc/xprtrdma/transport.c
>> +++ b/net/sunrpc/xprtrdma/transport.c
>> @@ -546,7 +546,8 @@ static void rpcrdma_keepalive_release(void
>> *calldata)
>>
>> data = xprt_get(xprt);
>> null_task = rpc_call_null_helper(task->tk_client, xprt,
>> NULL,
>> - RPC_TASK_SOFTPING |
>> RPC_TASK_ASYNC,
>> + RPC_TASK_SOFTPING |
>> RPC_TASK_ASYNC |
>> + RPC_TASK_NO_CONG,
>> &rpcrdma_keepalive_call_ops
>> , data);
>> if (!IS_ERR(null_task))
>> rpc_put_task(null_task);
>> diff --git a/net/sunrpc/xprtrdma/verbs.c
>> b/net/sunrpc/xprtrdma/verbs.c
>> index 81cd31a..d9b5fa7 100644
>> --- a/net/sunrpc/xprtrdma/verbs.c
>> +++ b/net/sunrpc/xprtrdma/verbs.c
>> @@ -136,19 +136,20 @@
>> static void
>> rpcrdma_update_granted_credits(struct rpcrdma_rep *rep)
>> {
>> - struct rpcrdma_msg *rmsgp = rdmab_to_msg(rep->rr_rdmabuf);
>> struct rpcrdma_buffer *buffer = &rep->rr_rxprt->rx_buf;
>> + __be32 *p = rep->rr_rdmabuf->rg_base;
>> u32 credits;
>>
>> if (rep->rr_len < RPCRDMA_HDRLEN_ERR)
>> return;
>>
>> - credits = be32_to_cpu(rmsgp->rm_credit);
>> + credits = be32_to_cpup(p + 2);
>> + if (credits > buffer->rb_max_requests)
>> + credits = buffer->rb_max_requests;
>> + /* Reserve one credit for keepalive ping */
>> + credits--;
>> if (credits == 0)
>> credits = 1; /* don't deadlock */
>> - else if (credits > buffer->rb_max_requests)
>> - credits = buffer->rb_max_requests;
>> -
>> atomic_set(&buffer->rb_credits, credits);
>> }
>>
>> @@ -915,6 +916,8 @@ struct rpcrdma_rep *
>> struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
>> int i, rc;
>>
>> + if (r_xprt->rx_data.max_requests < 2)
>> + return -EINVAL;
>> buf->rb_max_requests = r_xprt->rx_data.max_requests;
>> buf->rb_bc_srv_max_requests = 0;
>> atomic_set(&buf->rb_credits, 1);
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs"
>> in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
> --
>
>
>
>
>
>
> Trond Myklebust
> Principal System Architect
> 4300 El Camino Real | Suite 100
> Los Altos, CA 94022
> W: 650-422-3800
> C: 801-921-4583
> http://www.primarydata.com
--
Chuck Lever
T24gV2VkLCAyMDE3LTAyLTA4IGF0IDE5OjE5IC0wNTAwLCBDaHVjayBMZXZlciB3cm90ZToNCj4g
PiBPbiBGZWIgOCwgMjAxNywgYXQgNzowNSBQTSwgVHJvbmQgTXlrbGVidXN0IDx0cm9uZG15QHBy
aW1hcnlkYXRhLmNvDQo+ID4gbT4gd3JvdGU6DQo+ID4gDQo+ID4gT24gV2VkLCAyMDE3LTAyLTA4
IGF0IDE3OjAxIC0wNTAwLCBDaHVjayBMZXZlciB3cm90ZToNCj4gPiA+IEFsbG93IFJQQy1vdmVy
LVJETUEgdG8gc2VuZCBOVUxMIHBpbmdzIGV2ZW4gd2hlbiB0aGUgdHJhbnNwb3J0DQo+ID4gPiBo
YXMNCj4gPiA+IGhpdCBpdHMgY3JlZGl0IGxpbWl0LiBPbmUgUlBDLW92ZXItUkRNQSBjcmVkaXQg
aXMgcmVzZXJ2ZWQgZm9yDQo+ID4gPiBvcGVyYXRpb25zIGxpa2Uga2VlcGFsaXZlLg0KPiA+ID4g
DQo+ID4gPiBGb3IgdHJhbnNwb3J0cyB0aGF0IGNvbnZleSBORlN2NCwgaXQgc2VlbXMgbGlrZSBs
ZWFzZSByZW5ld2FsDQo+ID4gPiB3b3VsZA0KPiA+ID4gYWxzbyBiZSBhIGNhbmRpZGF0ZSBmb3Ig
dXNpbmcgYSBwcmlvcml0eSB0cmFuc3BvcnQgc2xvdC4gSSdkIGxpa2UNCj4gPiA+IHRvDQo+ID4g
PiBzZWUgYSBtZWNoYW5pc20gYmV0dGVyIHRoYW4gUlBDUkRNQV9QUklPUklUWSB0aGF0IGNhbiBl
bnN1cmUgb25seQ0KPiA+ID4gb25lIHByaW9yaXR5IG9wZXJhdGlvbiBpcyBpbiB1c2UgYXQgYSB0
aW1lLg0KPiA+ID4gDQo+ID4gPiBTaWduZWQtb2ZmLWJ5OiBDaHVjayBMZXZlciA8Y2h1Y2subGV2
ZXJAb3JhY2xlLmNvbT4NCj4gPiA+IC0tLQ0KPiA+ID4gwqBpbmNsdWRlL2xpbnV4L3N1bnJwYy9z
Y2hlZC5owqDCoMKgwqB8wqDCoMKgwqAyICsrDQo+ID4gPiDCoG5ldC9zdW5ycGMveHBydC5jwqDC
oMKgwqDCoMKgwqDCoMKgwqDCoMKgwqDCoMKgfMKgwqDCoMKgNCArKysrDQo+ID4gPiDCoG5ldC9z
dW5ycGMveHBydHJkbWEvdHJhbnNwb3J0LmMgfMKgwqDCoMKgMyArKy0NCj4gPiA+IMKgbmV0L3N1
bnJwYy94cHJ0cmRtYS92ZXJicy5jwqDCoMKgwqDCoHzCoMKgwqAxMyArKysrKysrKy0tLS0tDQo+
ID4gPiDCoDQgZmlsZXMgY2hhbmdlZCwgMTYgaW5zZXJ0aW9ucygrKSwgNiBkZWxldGlvbnMoLSkN
Cj4gPiA+IA0KPiA+ID4gZGlmZiAtLWdpdCBhL2luY2x1ZGUvbGludXgvc3VucnBjL3NjaGVkLmgN
Cj4gPiA+IGIvaW5jbHVkZS9saW51eC9zdW5ycGMvc2NoZWQuaA0KPiA+ID4gaW5kZXggMTM4MjJl
Ni4uZmNlYTE1OCAxMDA2NDQNCj4gPiA+IC0tLSBhL2luY2x1ZGUvbGludXgvc3VucnBjL3NjaGVk
LmgNCj4gPiA+ICsrKyBiL2luY2x1ZGUvbGludXgvc3VucnBjL3NjaGVkLmgNCj4gPiA+IEBAIC0x
MjcsNiArMTI3LDcgQEAgc3RydWN0IHJwY190YXNrX3NldHVwIHsNCj4gPiA+IMKgI2RlZmluZSBS
UENfVEFTS19USU1FT1VUCTB4MTAwMAkJLyogZmFpbA0KPiA+ID4gd2l0aA0KPiA+ID4gRVRJTUVE
T1VUIG9uIHRpbWVvdXQgKi8NCj4gPiA+IMKgI2RlZmluZSBSUENfVEFTS19OT0NPTk5FQ1QJMHgy
MDAwCQkvKg0KPiA+ID4gcmV0dXJuDQo+ID4gPiBFTk9UQ09OTiBpZiBub3QgY29ubmVjdGVkICov
DQo+ID4gPiDCoCNkZWZpbmUgUlBDX1RBU0tfTk9fUkVUUkFOU19USU1FT1VUCTB4NDAwMAkJDQo+
ID4gPiAvKg0KPiA+ID4gd2FpdCBmb3JldmVyIGZvciBhIHJlcGx5ICovDQo+ID4gPiArI2RlZmlu
ZSBSUENfVEFTS19OT19DT05HCTB4ODAwMAkJLyogc2tpcA0KPiA+ID4gY29uZ2VzdGlvbiBjb250
cm9sICovDQo+ID4gPiDCoA0KPiA+ID4gwqAjZGVmaW5lIFJQQ19UQVNLX1NPRlRQSU5HCShSUENf
VEFTS19TT0ZUIHwNCj4gPiA+IFJQQ19UQVNLX1NPRlRDT05OKQ0KPiA+ID4gwqANCj4gPiA+IEBA
IC0xMzcsNiArMTM4LDcgQEAgc3RydWN0IHJwY190YXNrX3NldHVwIHsNCj4gPiA+IMKgI2RlZmlu
ZSBSUENfSVNfU09GVCh0KQkJKCh0KS0+dGtfZmxhZ3MgJg0KPiA+ID4gKFJQQ19UQVNLX1NPRlR8
UlBDX1RBU0tfVElNRU9VVCkpDQo+ID4gPiDCoCNkZWZpbmUgUlBDX0lTX1NPRlRDT05OKHQpCSgo
dCktPnRrX2ZsYWdzICYNCj4gPiA+IFJQQ19UQVNLX1NPRlRDT05OKQ0KPiA+ID4gwqAjZGVmaW5l
IFJQQ19XQVNfU0VOVCh0KQkJKCh0KS0+dGtfZmxhZ3MgJg0KPiA+ID4gUlBDX1RBU0tfU0VOVCkN
Cj4gPiA+ICsjZGVmaW5lIFJQQ19TS0lQX0NPTkcodCkJKCh0KS0+dGtfZmxhZ3MgJg0KPiA+ID4g
UlBDX1RBU0tfTk9fQ09ORykNCj4gPiA+IMKgDQo+ID4gPiDCoCNkZWZpbmUgUlBDX1RBU0tfUlVO
TklORwkwDQo+ID4gPiDCoCNkZWZpbmUgUlBDX1RBU0tfUVVFVUVECQkxDQo+ID4gPiBkaWZmIC0t
Z2l0IGEvbmV0L3N1bnJwYy94cHJ0LmMgYi9uZXQvc3VucnBjL3hwcnQuYw0KPiA+ID4gaW5kZXgg
YjUzMGEyOC4uYTQ3N2VlNiAxMDA2NDQNCj4gPiA+IC0tLSBhL25ldC9zdW5ycGMveHBydC5jDQo+
ID4gPiArKysgYi9uZXQvc3VucnBjL3hwcnQuYw0KPiA+ID4gQEAgLTM5Miw2ICszOTIsMTAgQEAg
c3RhdGljIGlubGluZSB2b2lkIHhwcnRfcmVsZWFzZV93cml0ZShzdHJ1Y3QNCj4gPiA+IHJwY194
cHJ0ICp4cHJ0LCBzdHJ1Y3QgcnBjX3Rhc2sgKnRhDQo+ID4gPiDCoHsNCj4gPiA+IMKgCXN0cnVj
dCBycGNfcnFzdCAqcmVxID0gdGFzay0+dGtfcnFzdHA7DQo+ID4gPiDCoA0KPiA+ID4gKwlpZiAo
UlBDX1NLSVBfQ09ORyh0YXNrKSkgew0KPiA+ID4gKwkJcmVxLT5ycV9jb25nID0gMDsNCj4gPiA+
ICsJCXJldHVybiAxOw0KPiA+ID4gKwl9DQo+ID4gDQo+ID4gV2h5IG5vdCBqdXN0IGhhdmUgdGhl
IFJETUEgbGF5ZXIgY2FsbCB4cHJ0X3Jlc2VydmVfeHBydCgpIChhbmQNCj4gPiB4cHJ0X3JlbGVh
c2VfeHBydCgpKSBpZiB0aGlzIGZsYWcgaXMgc2V0PyBJdCBzZWVtcyB0byBtZSB0aGF0IHlvdQ0K
PiA+IHdpbGwNCj4gPiBuZWVkIHNvbWUga2luZCBvZiBleHRyYSBjb25nZXN0aW9uIGNvbnRyb2wg
aW4gdGhlIFJETUEgbGF5ZXIgYW55d2F5DQo+ID4gc2luY2UgeW91IG9ubHkgaGF2ZSBvbmUgcmVz
ZXJ2ZWQgY3JlZGl0IGZvciB0aGVzZSBwcml2aWxlZ2VkIHRhc2tzDQo+ID4gKG9yDQo+ID4gZGlk
IEkgbWlzcyB3aGVyZSB0aGF0IGlzIGJlaW5nIGdhdGVkPykuDQo+IA0KPiBUaGFua3MgZm9yIHRo
ZSByZXZpZXcuDQo+IA0KPiBTZWUgUlBDUkRNQV9JQV9SU1ZEX0NSRURJVCBpbiAxMS8xMi4gSXQn
cyBhIGhhY2sgSSdtIG5vdA0KPiB0ZXJyaWJseSBoYXBweSB3aXRoLg0KPiANCj4gU28sIEkgdGhp
bmsgeW91IGFyZSBzdWdnZXN0aW5nIHJlcGxhY2luZyB4cHJ0cmRtYSdzDQo+IC0+cmVzZXJ2ZV94
cHJ0IHdpdGggc29tZXRoaW5nIGxpa2U6DQo+IA0KPiBpbnQgeHBydF9yZG1hX3Jlc2VydmVfeHBy
dCh4cHJ0LCB0YXNrKQ0KPiB7DQo+IMKgwqDCoMKgwqDCoGlmIChSUENfU0tJUF9DT05HKHRhc2sp
KQ0KPiDCoMKgwqDCoMKgwqDCoMKgwqDCoMKgcmV0dXJuIHhwcnRfcmVzZXJ2ZV94cHJ0KHhwcnQs
IHRhc2spOw0KPiDCoMKgwqDCoMKgwqByZXR1cm4geHBydF9yZXNlcnZlX3hwcnRfY29uZyh4cHJ0
LCB0YXNrKTsNCj4gfQ0KPiANCj4gYW5kIGxpa2V3aXNlIGZvciAtPnJlbGVhc2VfeHBydCA/DQoN
ClJpZ2h0Lg0KDQo+IFdoYXQgSSdkIHJlYWxseSBsaWtlIHRvIGRvIGlzIGhhdmUgdGhlIFJQQyBs
YXllcg0KPiBwcmV2ZW50IG1vcmUgdGhhbiBvbmUgUlBDIGF0IGEgdGltZSBmcm9tIHVzaW5nIHRo
ZQ0KPiBleHRyYSBjcmVkaXQsIGFuZCBzb21laG93IGVuc3VyZSB0aGF0IHRob3NlIFJQQ3MNCj4g
YXJlIGdvaW5nIHRvIGJlIHNob3J0LWxpdmVkIChTT0ZUIHwgU09GVENPTk4sDQo+IG1heWJlKS4N
Cg0KQ3JlZGl0cyBhcmUgYSB0cmFuc3BvcnQgbGF5ZXIgdGhpbmcsIHRob3VnaC4gVGhlcmUgaXMg
bm8gZXF1aXZhbGVudCBpbg0KdGhlIG5vbi1SRE1BIHdvcmxkLiBUQ1AgYW5kIFVEUCBzaG91bGQg
bm9ybWFsbHkgYm90aCBiZSBmaW5lIHdpdGgNCnRyYW5zbWl0dGluZyBhbiBleHRyYSBSUEMgY2Fs
bC4NCg0KRXZlbiB0aW1lb3V0cyBhcmUgYSB0cmFuc3BvcnQgbGF5ZXIgaXNzdWU7IHNlZSB0aGUg
cGF0Y2hlcyBJIHB1dCBvdXQNCnRoaXMgbW9ybmluZyBpbiBvcmRlciB0byByZWR1Y2UgdGhlIFRD
UCBjb25uZWN0aW9uIHRpbWVvdXRzIGFuZCBwdXQNCnRoZW0gbW9yZSBpbiBsaW5lIHdpdGggdGhl
IGxlYXNlIHBlcmlvZC4gU29tZXRoaW5nIGxpa2UgdGhhdCBtYWtlcyBubw0Kc2Vuc2UgaW4gdGhl
IFVEUCB3b3JsZCAobm8gY29ubmVjdGlvbnMpLCBvciBldmVuIGluIEFGX0xPQ0FMIChubw0Kcm91
dGluZyksIHdoaWNoIGlzIHdoeSBJIGFkZGVkIHRoZSBzZXRfY29ubmVjdGlvbl90aW1lb3V0KCkg
Y2FsbGJhY2suDQoNCj4gDQo+ID4gPiDCoAlpZiAocmVxLT5ycV9jb25nKQ0KPiA+ID4gwqAJCXJl
dHVybiAxOw0KPiA+ID4gwqAJZHByaW50aygiUlBDOiAlNXUgeHBydF9jd25kX2xpbWl0ZWQgY29u
ZyA9ICVsdSBjd25kID0NCj4gPiA+ICVsdVxuIiwNCj4gPiA+IGRpZmYgLS1naXQgYS9uZXQvc3Vu
cnBjL3hwcnRyZG1hL3RyYW5zcG9ydC5jDQo+ID4gPiBiL25ldC9zdW5ycGMveHBydHJkbWEvdHJh
bnNwb3J0LmMNCj4gPiA+IGluZGV4IDNhNWE4MDUuLjA3M2ZlY2QgMTAwNjQ0DQo+ID4gPiAtLS0g
YS9uZXQvc3VucnBjL3hwcnRyZG1hL3RyYW5zcG9ydC5jDQo+ID4gPiArKysgYi9uZXQvc3VucnBj
L3hwcnRyZG1hL3RyYW5zcG9ydC5jDQo+ID4gPiBAQCAtNTQ2LDcgKzU0Niw4IEBAIHN0YXRpYyB2
b2lkIHJwY3JkbWFfa2VlcGFsaXZlX3JlbGVhc2Uodm9pZA0KPiA+ID4gKmNhbGxkYXRhKQ0KPiA+
ID4gwqANCj4gPiA+IMKgCWRhdGEgPSB4cHJ0X2dldCh4cHJ0KTsNCj4gPiA+IMKgCW51bGxfdGFz
ayA9IHJwY19jYWxsX251bGxfaGVscGVyKHRhc2stPnRrX2NsaWVudCwgeHBydCwNCj4gPiA+IE5V
TEwsDQo+ID4gPiAtCQkJCQnCoFJQQ19UQVNLX1NPRlRQSU5HIHwNCj4gPiA+IFJQQ19UQVNLX0FT
WU5DLA0KPiA+ID4gKwkJCQkJwqBSUENfVEFTS19TT0ZUUElORyB8DQo+ID4gPiBSUENfVEFTS19B
U1lOQyB8DQo+ID4gPiArCQkJCQnCoFJQQ19UQVNLX05PX0NPTkcsDQo+ID4gPiDCoAkJCQkJwqAm
cnBjcmRtYV9rZWVwYWxpdmVfY2FsbA0KPiA+ID4gX29wcw0KPiA+ID4gLCBkYXRhKTsNCj4gPiA+
IMKgCWlmICghSVNfRVJSKG51bGxfdGFzaykpDQo+ID4gPiDCoAkJcnBjX3B1dF90YXNrKG51bGxf
dGFzayk7DQo+ID4gPiBkaWZmIC0tZ2l0IGEvbmV0L3N1bnJwYy94cHJ0cmRtYS92ZXJicy5jDQo+
ID4gPiBiL25ldC9zdW5ycGMveHBydHJkbWEvdmVyYnMuYw0KPiA+ID4gaW5kZXggODFjZDMxYS4u
ZDliNWZhNyAxMDA2NDQNCj4gPiA+IC0tLSBhL25ldC9zdW5ycGMveHBydHJkbWEvdmVyYnMuYw0K
PiA+ID4gKysrIGIvbmV0L3N1bnJwYy94cHJ0cmRtYS92ZXJicy5jDQo+ID4gPiBAQCAtMTM2LDE5
ICsxMzYsMjAgQEANCj4gPiA+IMKgc3RhdGljIHZvaWQNCj4gPiA+IMKgcnBjcmRtYV91cGRhdGVf
Z3JhbnRlZF9jcmVkaXRzKHN0cnVjdCBycGNyZG1hX3JlcCAqcmVwKQ0KPiA+ID4gwqB7DQo+ID4g
PiAtCXN0cnVjdCBycGNyZG1hX21zZyAqcm1zZ3AgPSByZG1hYl90b19tc2cocmVwLQ0KPiA+ID4g
PnJyX3JkbWFidWYpOw0KPiA+ID4gwqAJc3RydWN0IHJwY3JkbWFfYnVmZmVyICpidWZmZXIgPSAm
cmVwLT5ycl9yeHBydC0+cnhfYnVmOw0KPiA+ID4gKwlfX2JlMzIgKnAgPSByZXAtPnJyX3JkbWFi
dWYtPnJnX2Jhc2U7DQo+ID4gPiDCoAl1MzIgY3JlZGl0czsNCj4gPiA+IMKgDQo+ID4gPiDCoAlp
ZiAocmVwLT5ycl9sZW4gPCBSUENSRE1BX0hEUkxFTl9FUlIpDQo+ID4gPiDCoAkJcmV0dXJuOw0K
PiA+ID4gwqANCj4gPiA+IC0JY3JlZGl0cyA9IGJlMzJfdG9fY3B1KHJtc2dwLT5ybV9jcmVkaXQp
Ow0KPiA+ID4gKwljcmVkaXRzID0gYmUzMl90b19jcHVwKHAgKyAyKTsNCj4gPiA+ICsJaWYgKGNy
ZWRpdHMgPiBidWZmZXItPnJiX21heF9yZXF1ZXN0cykNCj4gPiA+ICsJCWNyZWRpdHMgPSBidWZm
ZXItPnJiX21heF9yZXF1ZXN0czsNCj4gPiA+ICsJLyogUmVzZXJ2ZSBvbmUgY3JlZGl0IGZvciBr
ZWVwYWxpdmUgcGluZyAqLw0KPiA+ID4gKwljcmVkaXRzLS07DQo+ID4gPiDCoAlpZiAoY3JlZGl0
cyA9PSAwKQ0KPiA+ID4gwqAJCWNyZWRpdHMgPSAxOwkvKiBkb24ndCBkZWFkbG9jayAqLw0KPiA+
ID4gLQllbHNlIGlmIChjcmVkaXRzID4gYnVmZmVyLT5yYl9tYXhfcmVxdWVzdHMpDQo+ID4gPiAt
CQljcmVkaXRzID0gYnVmZmVyLT5yYl9tYXhfcmVxdWVzdHM7DQo+ID4gPiAtDQo+ID4gPiDCoAlh
dG9taWNfc2V0KCZidWZmZXItPnJiX2NyZWRpdHMsIGNyZWRpdHMpOw0KPiA+ID4gwqB9DQo+ID4g
PiDCoA0KPiA+ID4gQEAgLTkxNSw2ICs5MTYsOCBAQCBzdHJ1Y3QgcnBjcmRtYV9yZXAgKg0KPiA+
ID4gwqAJc3RydWN0IHJwY3JkbWFfYnVmZmVyICpidWYgPSAmcl94cHJ0LT5yeF9idWY7DQo+ID4g
PiDCoAlpbnQgaSwgcmM7DQo+ID4gPiDCoA0KPiA+ID4gKwlpZiAocl94cHJ0LT5yeF9kYXRhLm1h
eF9yZXF1ZXN0cyA8IDIpDQo+ID4gPiArCQlyZXR1cm4gLUVJTlZBTDsNCj4gPiA+IMKgCWJ1Zi0+
cmJfbWF4X3JlcXVlc3RzID0gcl94cHJ0LT5yeF9kYXRhLm1heF9yZXF1ZXN0czsNCj4gPiA+IMKg
CWJ1Zi0+cmJfYmNfc3J2X21heF9yZXF1ZXN0cyA9IDA7DQo+ID4gPiDCoAlhdG9taWNfc2V0KCZi
dWYtPnJiX2NyZWRpdHMsIDEpOw0KPiA+ID4gDQo+ID4gPiAtLQ0KPiA+ID4gVG8gdW5zdWJzY3Jp
YmUgZnJvbSB0aGlzIGxpc3Q6IHNlbmQgdGhlIGxpbmUgInVuc3Vic2NyaWJlIGxpbnV4LQ0KPiA+
ID4gbmZzIg0KPiA+ID4gaW4NCj4gPiA+IHRoZSBib2R5IG9mIGEgbWVzc2FnZSB0byBtYWpvcmRv
bW9Admdlci5rZXJuZWwub3JnDQo+ID4gPiBNb3JlIG1ham9yZG9tbyBpbmZvIGF0wqDCoGh0dHA6
Ly92Z2VyLmtlcm5lbC5vcmcvbWFqb3Jkb21vLWluZm8uaHRtDQo+ID4gPiBsDQo+ID4gPiANCj4g
PiANCj4gPiAtLcKgDQo+ID4gDQo+ID4gDQo+ID4gCQ0KPiA+IAkNCj4gPiANCj4gPiANCj4gPiBU
cm9uZCBNeWtsZWJ1c3QNCj4gPiBQcmluY2lwYWwgU3lzdGVtIEFyY2hpdGVjdA0KPiA+IDQzMDAg
RWwgQ2FtaW5vIFJlYWwgfCBTdWl0ZSAxMDANCj4gPiBMb3MgQWx0b3MsIENBwqDCoDk0MDIyDQo+
ID4gVzogNjUwLTQyMi0zODAwDQo+ID4gQzogODAxLTkyMS00NTgzwqANCj4gPiB3d3cucHJpbWFy
eWRhdGEuY29tDQo+IA0KPiAtLQ0KPiBDaHVjayBMZXZlcg0KPiANCj4gDQo+IA0KLS0gDQoNCg0K
DQoJDQoJDQoNCg0KVHJvbmQgTXlrbGVidXN0DQpQcmluY2lwYWwgU3lzdGVtIEFyY2hpdGVjdA0K
NDMwMCBFbCBDYW1pbm8gUmVhbCB8IFN1aXRlIDEwMA0KTG9zIEFsdG9zLCBDQcKgwqA5NDAyMg0K
VzogNjUwLTQyMi0zODAwDQpDOiA4MDEtOTIxLTQ1ODPCoA0Kd3d3LnByaW1hcnlkYXRhLmNvbQ0K
DQoNCg0KDQo=
When pad optimization is disabled, rpcrdma_convert_iovs still
does not add explicit XDR round-up padding to a Read chunk.
Commit 677eb17e94ed ("xprtrdma: Fix XDR tail buffer marshalling")
incorrectly short-circuited the test for whether round-up padding
is needed that appears later in rpcrdma_convert_iovs.
However, if this is indeed a regular Read chunk (and not a
Position-Zero Read chunk), the tail iovec _always_ contains the
chunk's padding, and never anything else.
So, it's easy to just skip the tail when padding optimization is
enabled, and add the tail in a subsequent Read chunk segment, if
disabled.
Fixes: 677eb17e94ed ("xprtrdma: Fix XDR tail buffer marshalling")
Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/rpc_rdma.c | 10 ++++------
1 file changed, 4 insertions(+), 6 deletions(-)
diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index c52e0f2..a524d3c 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -226,8 +226,10 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
if (len && n == RPCRDMA_MAX_SEGS)
goto out_overflow;
- /* When encoding the read list, the tail is always sent inline */
- if (type == rpcrdma_readch)
+ /* When encoding a Read chunk, the tail iovec contains an
+ * XDR pad and may be omitted.
+ */
+ if (type == rpcrdma_readch && xprt_rdma_pad_optimize)
return n;
/* When encoding the Write list, some servers need to see an extra
@@ -238,10 +240,6 @@ static bool rpcrdma_results_inline(struct rpcrdma_xprt *r_xprt,
return n;
if (xdrbuf->tail[0].iov_len) {
- /* the rpcrdma protocol allows us to omit any trailing
- * xdr pad bytes, saving the server an RDMA operation. */
- if (xdrbuf->tail[0].iov_len < 4 && xprt_rdma_pad_optimize)
- return n;
n = rpcrdma_convert_kvec(&xdrbuf->tail[0], seg, n);
if (n == RPCRDMA_MAX_SEGS)
goto out_overflow;
> On Feb 8, 2017, at 7:48 PM, Trond Myklebust <[email protected]> wrote:
>
> On Wed, 2017-02-08 at 19:19 -0500, Chuck Lever wrote:
>>> On Feb 8, 2017, at 7:05 PM, Trond Myklebust <[email protected]
>>> m> wrote:
>>>
>>> On Wed, 2017-02-08 at 17:01 -0500, Chuck Lever wrote:
>>>> Allow RPC-over-RDMA to send NULL pings even when the transport
>>>> has
>>>> hit its credit limit. One RPC-over-RDMA credit is reserved for
>>>> operations like keepalive.
>>>>
>>>> For transports that convey NFSv4, it seems like lease renewal
>>>> would
>>>> also be a candidate for using a priority transport slot. I'd like
>>>> to
>>>> see a mechanism better than RPCRDMA_PRIORITY that can ensure only
>>>> one priority operation is in use at a time.
>>>>
>>>> Signed-off-by: Chuck Lever <[email protected]>
>>>> ---
>>>> include/linux/sunrpc/sched.h | 2 ++
>>>> net/sunrpc/xprt.c | 4 ++++
>>>> net/sunrpc/xprtrdma/transport.c | 3 ++-
>>>> net/sunrpc/xprtrdma/verbs.c | 13 ++++++++-----
>>>> 4 files changed, 16 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/include/linux/sunrpc/sched.h
>>>> b/include/linux/sunrpc/sched.h
>>>> index 13822e6..fcea158 100644
>>>> --- a/include/linux/sunrpc/sched.h
>>>> +++ b/include/linux/sunrpc/sched.h
>>>> @@ -127,6 +127,7 @@ struct rpc_task_setup {
>>>> #define RPC_TASK_TIMEOUT 0x1000 /* fail
>>>> with
>>>> ETIMEDOUT on timeout */
>>>> #define RPC_TASK_NOCONNECT 0x2000 /*
>>>> return
>>>> ENOTCONN if not connected */
>>>> #define RPC_TASK_NO_RETRANS_TIMEOUT 0x4000
>>>> /*
>>>> wait forever for a reply */
>>>> +#define RPC_TASK_NO_CONG 0x8000 /* skip
>>>> congestion control */
>>>>
>>>> #define RPC_TASK_SOFTPING (RPC_TASK_SOFT |
>>>> RPC_TASK_SOFTCONN)
>>>>
>>>> @@ -137,6 +138,7 @@ struct rpc_task_setup {
>>>> #define RPC_IS_SOFT(t) ((t)->tk_flags &
>>>> (RPC_TASK_SOFT|RPC_TASK_TIMEOUT))
>>>> #define RPC_IS_SOFTCONN(t) ((t)->tk_flags &
>>>> RPC_TASK_SOFTCONN)
>>>> #define RPC_WAS_SENT(t) ((t)->tk_flags &
>>>> RPC_TASK_SENT)
>>>> +#define RPC_SKIP_CONG(t) ((t)->tk_flags &
>>>> RPC_TASK_NO_CONG)
>>>>
>>>> #define RPC_TASK_RUNNING 0
>>>> #define RPC_TASK_QUEUED 1
>>>> diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
>>>> index b530a28..a477ee6 100644
>>>> --- a/net/sunrpc/xprt.c
>>>> +++ b/net/sunrpc/xprt.c
>>>> @@ -392,6 +392,10 @@ static inline void xprt_release_write(struct
>>>> rpc_xprt *xprt, struct rpc_task *ta
>>>> {
>>>> struct rpc_rqst *req = task->tk_rqstp;
>>>>
>>>> + if (RPC_SKIP_CONG(task)) {
>>>> + req->rq_cong = 0;
>>>> + return 1;
>>>> + }
>>>
>>> Why not just have the RDMA layer call xprt_reserve_xprt() (and
>>> xprt_release_xprt()) if this flag is set? It seems to me that you
>>> will
>>> need some kind of extra congestion control in the RDMA layer anyway
>>> since you only have one reserved credit for these privileged tasks
>>> (or
>>> did I miss where that is being gated?).
>>
>> Thanks for the review.
>>
>> See RPCRDMA_IA_RSVD_CREDIT in 11/12. It's a hack I'm not
>> terribly happy with.
>>
>> So, I think you are suggesting replacing xprtrdma's
>> ->reserve_xprt with something like:
>>
>> int xprt_rdma_reserve_xprt(xprt, task)
>> {
>> if (RPC_SKIP_CONG(task))
>> return xprt_reserve_xprt(xprt, task);
>> return xprt_reserve_xprt_cong(xprt, task);
>> }
>>
>> and likewise for ->release_xprt ?
>
> Right.
>
>> What I'd really like to do is have the RPC layer
>> prevent more than one RPC at a time from using the
>> extra credit, and somehow ensure that those RPCs
>> are going to be short-lived (SOFT | SOFTCONN,
>> maybe).
>
> Credits are a transport layer thing, though. There is no equivalent in
> the non-RDMA world. TCP and UDP should normally both be fine with
> transmitting an extra RPC call.
xprtrdma maps credits to the xprt->cwnd, which UDP also uses.
Agree though, there probably isn't a need for temporarily
superceding the UDP connection window.
> Even timeouts are a transport layer issue; see the patches I put out
> this morning in order to reduce the TCP connection timeouts and put
> them more in line with the lease period. Something like that makes no
> sense in the UDP world (no connections), or even in AF_LOCAL (no
> routing), which is why I added the set_connection_timeout() callback.
I browsed those a couple times, wondering if connection-oriented
RPC-over-RDMA also needs a set_connection_timeout method. Still
studying.
>>>> if (req->rq_cong)
>>>> return 1;
>>>> dprintk("RPC: %5u xprt_cwnd_limited cong = %lu cwnd =
>>>> %lu\n",
>>>> diff --git a/net/sunrpc/xprtrdma/transport.c
>>>> b/net/sunrpc/xprtrdma/transport.c
>>>> index 3a5a805..073fecd 100644
>>>> --- a/net/sunrpc/xprtrdma/transport.c
>>>> +++ b/net/sunrpc/xprtrdma/transport.c
>>>> @@ -546,7 +546,8 @@ static void rpcrdma_keepalive_release(void
>>>> *calldata)
>>>>
>>>> data = xprt_get(xprt);
>>>> null_task = rpc_call_null_helper(task->tk_client, xprt,
>>>> NULL,
>>>> - RPC_TASK_SOFTPING |
>>>> RPC_TASK_ASYNC,
>>>> + RPC_TASK_SOFTPING |
>>>> RPC_TASK_ASYNC |
>>>> + RPC_TASK_NO_CONG,
>>>> &rpcrdma_keepalive_call
>>>> _ops
>>>> , data);
>>>> if (!IS_ERR(null_task))
>>>> rpc_put_task(null_task);
>>>> diff --git a/net/sunrpc/xprtrdma/verbs.c
>>>> b/net/sunrpc/xprtrdma/verbs.c
>>>> index 81cd31a..d9b5fa7 100644
>>>> --- a/net/sunrpc/xprtrdma/verbs.c
>>>> +++ b/net/sunrpc/xprtrdma/verbs.c
>>>> @@ -136,19 +136,20 @@
>>>> static void
>>>> rpcrdma_update_granted_credits(struct rpcrdma_rep *rep)
>>>> {
>>>> - struct rpcrdma_msg *rmsgp = rdmab_to_msg(rep-
>>>>> rr_rdmabuf);
>>>> struct rpcrdma_buffer *buffer = &rep->rr_rxprt->rx_buf;
>>>> + __be32 *p = rep->rr_rdmabuf->rg_base;
>>>> u32 credits;
>>>>
>>>> if (rep->rr_len < RPCRDMA_HDRLEN_ERR)
>>>> return;
>>>>
>>>> - credits = be32_to_cpu(rmsgp->rm_credit);
>>>> + credits = be32_to_cpup(p + 2);
>>>> + if (credits > buffer->rb_max_requests)
>>>> + credits = buffer->rb_max_requests;
>>>> + /* Reserve one credit for keepalive ping */
>>>> + credits--;
>>>> if (credits == 0)
>>>> credits = 1; /* don't deadlock */
>>>> - else if (credits > buffer->rb_max_requests)
>>>> - credits = buffer->rb_max_requests;
>>>> -
>>>> atomic_set(&buffer->rb_credits, credits);
>>>> }
>>>>
>>>> @@ -915,6 +916,8 @@ struct rpcrdma_rep *
>>>> struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
>>>> int i, rc;
>>>>
>>>> + if (r_xprt->rx_data.max_requests < 2)
>>>> + return -EINVAL;
>>>> buf->rb_max_requests = r_xprt->rx_data.max_requests;
>>>> buf->rb_bc_srv_max_requests = 0;
>>>> atomic_set(&buf->rb_credits, 1);
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-
>>>> nfs"
>>>> in
>>>> the body of a message to [email protected]
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.htm
>>>> l
>>>>
>>>
>>> --
>>>
>>>
>>>
>>>
>>>
>>>
>>> Trond Myklebust
>>> Principal System Architect
>>> 4300 El Camino Real | Suite 100
>>> Los Altos, CA 94022
>>> W: 650-422-3800
>>> C: 801-921-4583
>>> http://www.primarydata.com
>>
>> --
>> Chuck Lever
>>
>>
>>
> --
>
>
>
>
>
>
>
> Trond Myklebust
> Principal System Architect
> 4300 El Camino Real | Suite 100
> Los Altos, CA 94022
> W: 650-422-3800
> C: 801-921-4583
> http://www.primarydata.com
>
>
>
>
> N���r�y���b�X�ǧv�^�){.n�+�{#<^_NSEDR_^#<ٚ�{ay�ʇڙ�,j�f�h��z��w��j:+v�w�j�m����zZ+��ݢj"�!
--
Chuck Lever
> On Feb 9, 2017, at 10:37 AM, Chuck Lever <[email protected]> wrote:
>
>>
>> On Feb 8, 2017, at 7:48 PM, Trond Myklebust <[email protected]> wrote:
>>
>> On Wed, 2017-02-08 at 19:19 -0500, Chuck Lever wrote:
>>>> On Feb 8, 2017, at 7:05 PM, Trond Myklebust <[email protected]
>>>> m> wrote:
>>>>
>>>> On Wed, 2017-02-08 at 17:01 -0500, Chuck Lever wrote:
>>>>> Allow RPC-over-RDMA to send NULL pings even when the transport
>>>>> has
>>>>> hit its credit limit. One RPC-over-RDMA credit is reserved for
>>>>> operations like keepalive.
>>>>>
>>>>> For transports that convey NFSv4, it seems like lease renewal
>>>>> would
>>>>> also be a candidate for using a priority transport slot. I'd like
>>>>> to
>>>>> see a mechanism better than RPCRDMA_PRIORITY that can ensure only
>>>>> one priority operation is in use at a time.
>>>>>
>>>>> Signed-off-by: Chuck Lever <[email protected]>
>>>>> ---
>>>>> include/linux/sunrpc/sched.h | 2 ++
>>>>> net/sunrpc/xprt.c | 4 ++++
>>>>> net/sunrpc/xprtrdma/transport.c | 3 ++-
>>>>> net/sunrpc/xprtrdma/verbs.c | 13 ++++++++-----
>>>>> 4 files changed, 16 insertions(+), 6 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/sunrpc/sched.h
>>>>> b/include/linux/sunrpc/sched.h
>>>>> index 13822e6..fcea158 100644
>>>>> --- a/include/linux/sunrpc/sched.h
>>>>> +++ b/include/linux/sunrpc/sched.h
>>>>> @@ -127,6 +127,7 @@ struct rpc_task_setup {
>>>>> #define RPC_TASK_TIMEOUT 0x1000 /* fail
>>>>> with
>>>>> ETIMEDOUT on timeout */
>>>>> #define RPC_TASK_NOCONNECT 0x2000 /*
>>>>> return
>>>>> ENOTCONN if not connected */
>>>>> #define RPC_TASK_NO_RETRANS_TIMEOUT 0x4000
>>>>> /*
>>>>> wait forever for a reply */
>>>>> +#define RPC_TASK_NO_CONG 0x8000 /* skip
>>>>> congestion control */
>>>>>
>>>>> #define RPC_TASK_SOFTPING (RPC_TASK_SOFT |
>>>>> RPC_TASK_SOFTCONN)
>>>>>
>>>>> @@ -137,6 +138,7 @@ struct rpc_task_setup {
>>>>> #define RPC_IS_SOFT(t) ((t)->tk_flags &
>>>>> (RPC_TASK_SOFT|RPC_TASK_TIMEOUT))
>>>>> #define RPC_IS_SOFTCONN(t) ((t)->tk_flags &
>>>>> RPC_TASK_SOFTCONN)
>>>>> #define RPC_WAS_SENT(t) ((t)->tk_flags &
>>>>> RPC_TASK_SENT)
>>>>> +#define RPC_SKIP_CONG(t) ((t)->tk_flags &
>>>>> RPC_TASK_NO_CONG)
>>>>>
>>>>> #define RPC_TASK_RUNNING 0
>>>>> #define RPC_TASK_QUEUED 1
>>>>> diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
>>>>> index b530a28..a477ee6 100644
>>>>> --- a/net/sunrpc/xprt.c
>>>>> +++ b/net/sunrpc/xprt.c
>>>>> @@ -392,6 +392,10 @@ static inline void xprt_release_write(struct
>>>>> rpc_xprt *xprt, struct rpc_task *ta
>>>>> {
>>>>> struct rpc_rqst *req = task->tk_rqstp;
>>>>>
>>>>> + if (RPC_SKIP_CONG(task)) {
>>>>> + req->rq_cong = 0;
>>>>> + return 1;
>>>>> + }
>>>>
>>>> Why not just have the RDMA layer call xprt_reserve_xprt() (and
>>>> xprt_release_xprt()) if this flag is set? It seems to me that you
>>>> will
>>>> need some kind of extra congestion control in the RDMA layer anyway
>>>> since you only have one reserved credit for these privileged tasks
>>>> (or
>>>> did I miss where that is being gated?).
>>>
>>> Thanks for the review.
>>>
>>> See RPCRDMA_IA_RSVD_CREDIT in 11/12. It's a hack I'm not
>>> terribly happy with.
>>>
>>> So, I think you are suggesting replacing xprtrdma's
>>> ->reserve_xprt with something like:
>>>
>>> int xprt_rdma_reserve_xprt(xprt, task)
>>> {
>>> if (RPC_SKIP_CONG(task))
>>> return xprt_reserve_xprt(xprt, task);
>>> return xprt_reserve_xprt_cong(xprt, task);
>>> }
>>>
>>> and likewise for ->release_xprt ?
>>
>> Right.
This seems to work fine for the normal cases.
I'm confused about how to construct xprt_rdma_release_xprt()
so it never releases a normal RPC task when a SKIP_CONG
task completes and the credit limit is still full.
If it should send a normal task using the reserved credit
and that task hangs too, we're in exactly the position
we wanted to avoid.
My original solution might have had a similar problem,
come to think of it.
>>> What I'd really like to do is have the RPC layer
>>> prevent more than one RPC at a time from using the
>>> extra credit, and somehow ensure that those RPCs
>>> are going to be short-lived (SOFT | SOFTCONN,
>>> maybe).
>>
>> Credits are a transport layer thing, though. There is no equivalent in
>> the non-RDMA world. TCP and UDP should normally both be fine with
>> transmitting an extra RPC call.
>
> xprtrdma maps credits to the xprt->cwnd, which UDP also uses.
> Agree though, there probably isn't a need for temporarily
> superceding the UDP connection window.
>
>
>> Even timeouts are a transport layer issue; see the patches I put out
>> this morning in order to reduce the TCP connection timeouts and put
>> them more in line with the lease period. Something like that makes no
>> sense in the UDP world (no connections), or even in AF_LOCAL (no
>> routing), which is why I added the set_connection_timeout() callback.
>
> I browsed those a couple times, wondering if connection-oriented
> RPC-over-RDMA also needs a set_connection_timeout method. Still
> studying.
>
>
>>>>> if (req->rq_cong)
>>>>> return 1;
>>>>> dprintk("RPC: %5u xprt_cwnd_limited cong = %lu cwnd =
>>>>> %lu\n",
>>>>> diff --git a/net/sunrpc/xprtrdma/transport.c
>>>>> b/net/sunrpc/xprtrdma/transport.c
>>>>> index 3a5a805..073fecd 100644
>>>>> --- a/net/sunrpc/xprtrdma/transport.c
>>>>> +++ b/net/sunrpc/xprtrdma/transport.c
>>>>> @@ -546,7 +546,8 @@ static void rpcrdma_keepalive_release(void
>>>>> *calldata)
>>>>>
>>>>> data = xprt_get(xprt);
>>>>> null_task = rpc_call_null_helper(task->tk_client, xprt,
>>>>> NULL,
>>>>> - RPC_TASK_SOFTPING |
>>>>> RPC_TASK_ASYNC,
>>>>> + RPC_TASK_SOFTPING |
>>>>> RPC_TASK_ASYNC |
>>>>> + RPC_TASK_NO_CONG,
>>>>> &rpcrdma_keepalive_call
>>>>> _ops
>>>>> , data);
>>>>> if (!IS_ERR(null_task))
>>>>> rpc_put_task(null_task);
>>>>> diff --git a/net/sunrpc/xprtrdma/verbs.c
>>>>> b/net/sunrpc/xprtrdma/verbs.c
>>>>> index 81cd31a..d9b5fa7 100644
>>>>> --- a/net/sunrpc/xprtrdma/verbs.c
>>>>> +++ b/net/sunrpc/xprtrdma/verbs.c
>>>>> @@ -136,19 +136,20 @@
>>>>> static void
>>>>> rpcrdma_update_granted_credits(struct rpcrdma_rep *rep)
>>>>> {
>>>>> - struct rpcrdma_msg *rmsgp = rdmab_to_msg(rep-
>>>>>> rr_rdmabuf);
>>>>> struct rpcrdma_buffer *buffer = &rep->rr_rxprt->rx_buf;
>>>>> + __be32 *p = rep->rr_rdmabuf->rg_base;
>>>>> u32 credits;
>>>>>
>>>>> if (rep->rr_len < RPCRDMA_HDRLEN_ERR)
>>>>> return;
>>>>>
>>>>> - credits = be32_to_cpu(rmsgp->rm_credit);
>>>>> + credits = be32_to_cpup(p + 2);
>>>>> + if (credits > buffer->rb_max_requests)
>>>>> + credits = buffer->rb_max_requests;
>>>>> + /* Reserve one credit for keepalive ping */
>>>>> + credits--;
>>>>> if (credits == 0)
>>>>> credits = 1; /* don't deadlock */
>>>>> - else if (credits > buffer->rb_max_requests)
>>>>> - credits = buffer->rb_max_requests;
>>>>> -
>>>>> atomic_set(&buffer->rb_credits, credits);
>>>>> }
>>>>>
>>>>> @@ -915,6 +916,8 @@ struct rpcrdma_rep *
>>>>> struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
>>>>> int i, rc;
>>>>>
>>>>> + if (r_xprt->rx_data.max_requests < 2)
>>>>> + return -EINVAL;
>>>>> buf->rb_max_requests = r_xprt->rx_data.max_requests;
>>>>> buf->rb_bc_srv_max_requests = 0;
>>>>> atomic_set(&buf->rb_credits, 1);
>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe linux-
>>>>> nfs"
>>>>> in
>>>>> the body of a message to [email protected]
>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.htm
>>>>> l
>>>>>
>>>>
>>>> --
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Trond Myklebust
>>>> Principal System Architect
>>>> 4300 El Camino Real | Suite 100
>>>> Los Altos, CA 94022
>>>> W: 650-422-3800
>>>> C: 801-921-4583
>>>> http://www.primarydata.com
>>>
>>> --
>>> Chuck Lever
>>>
>>>
>>>
>> --
>>
>>
>>
>>
>>
>>
>>
>> Trond Myklebust
>> Principal System Architect
>> 4300 El Camino Real | Suite 100
>> Los Altos, CA 94022
>> W: 650-422-3800
>> C: 801-921-4583
>> http://www.primarydata.com
>>
>>
>>
>>
>> N���r�y���b�X�ǧv�^�){.n�+�{#<^_NSEDR_^#<ٚ�{ay�ʇڙ�,j�f�h��z��w��j:+v�w�j�m����zZ+��ݢj"�!
>
> --
> Chuck Lever
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Chuck Lever
T24gVGh1LCAyMDE3LTAyLTA5IGF0IDE0OjQyIC0wNTAwLCBDaHVjayBMZXZlciB3cm90ZToNCj4g
PiBPbiBGZWIgOSwgMjAxNywgYXQgMTA6MzcgQU0sIENodWNrIExldmVyIDxjaHVjay5sZXZlckBv
cmFjbGUuY29tPg0KPiA+IHdyb3RlOg0KPiA+IA0KPiA+ID4gDQo+ID4gPiBPbiBGZWIgOCwgMjAx
NywgYXQgNzo0OCBQTSwgVHJvbmQgTXlrbGVidXN0IDx0cm9uZG15QHByaW1hcnlkYXRhLg0KPiA+
ID4gY29tPiB3cm90ZToNCj4gPiA+IA0KPiA+ID4gT24gV2VkLCAyMDE3LTAyLTA4IGF0IDE5OjE5
IC0wNTAwLCBDaHVjayBMZXZlciB3cm90ZToNCj4gPiA+ID4gPiBPbiBGZWIgOCwgMjAxNywgYXQg
NzowNSBQTSwgVHJvbmQgTXlrbGVidXN0IDx0cm9uZG15QHByaW1hcnlkDQo+ID4gPiA+ID4gYXRh
LmNvDQo+ID4gPiA+ID4gbT4gd3JvdGU6DQo+ID4gPiA+ID4gDQo+ID4gPiA+ID4gT24gV2VkLCAy
MDE3LTAyLTA4IGF0IDE3OjAxIC0wNTAwLCBDaHVjayBMZXZlciB3cm90ZToNCj4gPiA+ID4gPiA+
IEFsbG93IFJQQy1vdmVyLVJETUEgdG8gc2VuZCBOVUxMIHBpbmdzIGV2ZW4gd2hlbiB0aGUNCj4g
PiA+ID4gPiA+IHRyYW5zcG9ydA0KPiA+ID4gPiA+ID4gaGFzDQo+ID4gPiA+ID4gPiBoaXQgaXRz
IGNyZWRpdCBsaW1pdC4gT25lIFJQQy1vdmVyLVJETUEgY3JlZGl0IGlzIHJlc2VydmVkDQo+ID4g
PiA+ID4gPiBmb3INCj4gPiA+ID4gPiA+IG9wZXJhdGlvbnMgbGlrZSBrZWVwYWxpdmUuDQo+ID4g
PiA+ID4gPiANCj4gPiA+ID4gPiA+IEZvciB0cmFuc3BvcnRzIHRoYXQgY29udmV5IE5GU3Y0LCBp
dCBzZWVtcyBsaWtlIGxlYXNlDQo+ID4gPiA+ID4gPiByZW5ld2FsDQo+ID4gPiA+ID4gPiB3b3Vs
ZA0KPiA+ID4gPiA+ID4gYWxzbyBiZSBhIGNhbmRpZGF0ZSBmb3IgdXNpbmcgYSBwcmlvcml0eSB0
cmFuc3BvcnQgc2xvdC4NCj4gPiA+ID4gPiA+IEknZCBsaWtlDQo+ID4gPiA+ID4gPiB0bw0KPiA+
ID4gPiA+ID4gc2VlIGEgbWVjaGFuaXNtIGJldHRlciB0aGFuIFJQQ1JETUFfUFJJT1JJVFkgdGhh
dCBjYW4NCj4gPiA+ID4gPiA+IGVuc3VyZSBvbmx5DQo+ID4gPiA+ID4gPiBvbmUgcHJpb3JpdHkg
b3BlcmF0aW9uIGlzIGluIHVzZSBhdCBhIHRpbWUuDQo+ID4gPiA+ID4gPiANCj4gPiA+ID4gPiA+
IFNpZ25lZC1vZmYtYnk6IENodWNrIExldmVyIDxjaHVjay5sZXZlckBvcmFjbGUuY29tPg0KPiA+
ID4gPiA+ID4gLS0tDQo+ID4gPiA+ID4gPiBpbmNsdWRlL2xpbnV4L3N1bnJwYy9zY2hlZC5owqDC
oMKgwqB8wqDCoMKgwqAyICsrDQo+ID4gPiA+ID4gPiBuZXQvc3VucnBjL3hwcnQuY8KgwqDCoMKg
wqDCoMKgwqDCoMKgwqDCoMKgwqDCoHzCoMKgwqDCoDQgKysrKw0KPiA+ID4gPiA+ID4gbmV0L3N1
bnJwYy94cHJ0cmRtYS90cmFuc3BvcnQuYyB8wqDCoMKgwqAzICsrLQ0KPiA+ID4gPiA+ID4gbmV0
L3N1bnJwYy94cHJ0cmRtYS92ZXJicy5jwqDCoMKgwqDCoHzCoMKgwqAxMyArKysrKysrKy0tLS0t
DQo+ID4gPiA+ID4gPiA0IGZpbGVzIGNoYW5nZWQsIDE2IGluc2VydGlvbnMoKyksIDYgZGVsZXRp
b25zKC0pDQo+ID4gPiA+ID4gPiANCj4gPiA+ID4gPiA+IGRpZmYgLS1naXQgYS9pbmNsdWRlL2xp
bnV4L3N1bnJwYy9zY2hlZC5oDQo+ID4gPiA+ID4gPiBiL2luY2x1ZGUvbGludXgvc3VucnBjL3Nj
aGVkLmgNCj4gPiA+ID4gPiA+IGluZGV4IDEzODIyZTYuLmZjZWExNTggMTAwNjQ0DQo+ID4gPiA+
ID4gPiAtLS0gYS9pbmNsdWRlL2xpbnV4L3N1bnJwYy9zY2hlZC5oDQo+ID4gPiA+ID4gPiArKysg
Yi9pbmNsdWRlL2xpbnV4L3N1bnJwYy9zY2hlZC5oDQo+ID4gPiA+ID4gPiBAQCAtMTI3LDYgKzEy
Nyw3IEBAIHN0cnVjdCBycGNfdGFza19zZXR1cCB7DQo+ID4gPiA+ID4gPiAjZGVmaW5lIFJQQ19U
QVNLX1RJTUVPVVQJMHgxMDAwCQkvKg0KPiA+ID4gPiA+ID4gZmFpbA0KPiA+ID4gPiA+ID4gd2l0
aA0KPiA+ID4gPiA+ID4gRVRJTUVET1VUIG9uIHRpbWVvdXQgKi8NCj4gPiA+ID4gPiA+ICNkZWZp
bmUgUlBDX1RBU0tfTk9DT05ORUNUCTB4MjAwMAkJLyoNCj4gPiA+ID4gPiA+IHJldHVybg0KPiA+
ID4gPiA+ID4gRU5PVENPTk4gaWYgbm90IGNvbm5lY3RlZCAqLw0KPiA+ID4gPiA+ID4gI2RlZmlu
ZSBSUENfVEFTS19OT19SRVRSQU5TX1RJTUVPVVQJMHg0MDAwCQkNCj4gPiA+ID4gPiA+IC8qDQo+
ID4gPiA+ID4gPiB3YWl0IGZvcmV2ZXIgZm9yIGEgcmVwbHkgKi8NCj4gPiA+ID4gPiA+ICsjZGVm
aW5lIFJQQ19UQVNLX05PX0NPTkcJMHg4MDAwCQkvKg0KPiA+ID4gPiA+ID4gc2tpcA0KPiA+ID4g
PiA+ID4gY29uZ2VzdGlvbiBjb250cm9sICovDQo+ID4gPiA+ID4gPiANCj4gPiA+ID4gPiA+ICNk
ZWZpbmUgUlBDX1RBU0tfU09GVFBJTkcJKFJQQ19UQVNLX1NPRlQgfA0KPiA+ID4gPiA+ID4gUlBD
X1RBU0tfU09GVENPTk4pDQo+ID4gPiA+ID4gPiANCj4gPiA+ID4gPiA+IEBAIC0xMzcsNiArMTM4
LDcgQEAgc3RydWN0IHJwY190YXNrX3NldHVwIHsNCj4gPiA+ID4gPiA+ICNkZWZpbmUgUlBDX0lT
X1NPRlQodCkJCSgodCktPnRrX2ZsYWdzICYNCj4gPiA+ID4gPiA+IChSUENfVEFTS19TT0ZUfFJQ
Q19UQVNLX1RJTUVPVVQpKQ0KPiA+ID4gPiA+ID4gI2RlZmluZSBSUENfSVNfU09GVENPTk4odCkJ
KCh0KS0+dGtfZmxhZ3MgJg0KPiA+ID4gPiA+ID4gUlBDX1RBU0tfU09GVENPTk4pDQo+ID4gPiA+
ID4gPiAjZGVmaW5lIFJQQ19XQVNfU0VOVCh0KQkJKCh0KS0+dGtfZmxhZ3MgJg0KPiA+ID4gPiA+
ID4gUlBDX1RBU0tfU0VOVCkNCj4gPiA+ID4gPiA+ICsjZGVmaW5lIFJQQ19TS0lQX0NPTkcodCkJ
KCh0KS0+dGtfZmxhZ3MgJg0KPiA+ID4gPiA+ID4gUlBDX1RBU0tfTk9fQ09ORykNCj4gPiA+ID4g
PiA+IA0KPiA+ID4gPiA+ID4gI2RlZmluZSBSUENfVEFTS19SVU5OSU5HCTANCj4gPiA+ID4gPiA+
ICNkZWZpbmUgUlBDX1RBU0tfUVVFVUVECQkxDQo+ID4gPiA+ID4gPiBkaWZmIC0tZ2l0IGEvbmV0
L3N1bnJwYy94cHJ0LmMgYi9uZXQvc3VucnBjL3hwcnQuYw0KPiA+ID4gPiA+ID4gaW5kZXggYjUz
MGEyOC4uYTQ3N2VlNiAxMDA2NDQNCj4gPiA+ID4gPiA+IC0tLSBhL25ldC9zdW5ycGMveHBydC5j
DQo+ID4gPiA+ID4gPiArKysgYi9uZXQvc3VucnBjL3hwcnQuYw0KPiA+ID4gPiA+ID4gQEAgLTM5
Miw2ICszOTIsMTAgQEAgc3RhdGljIGlubGluZSB2b2lkDQo+ID4gPiA+ID4gPiB4cHJ0X3JlbGVh
c2Vfd3JpdGUoc3RydWN0DQo+ID4gPiA+ID4gPiBycGNfeHBydCAqeHBydCwgc3RydWN0IHJwY190
YXNrICp0YQ0KPiA+ID4gPiA+ID4gew0KPiA+ID4gPiA+ID4gCXN0cnVjdCBycGNfcnFzdCAqcmVx
ID0gdGFzay0+dGtfcnFzdHA7DQo+ID4gPiA+ID4gPiANCj4gPiA+ID4gPiA+ICsJaWYgKFJQQ19T
S0lQX0NPTkcodGFzaykpIHsNCj4gPiA+ID4gPiA+ICsJCXJlcS0+cnFfY29uZyA9IDA7DQo+ID4g
PiA+ID4gPiArCQlyZXR1cm4gMTsNCj4gPiA+ID4gPiA+ICsJfQ0KPiA+ID4gPiA+IA0KPiA+ID4g
PiA+IFdoeSBub3QganVzdCBoYXZlIHRoZSBSRE1BIGxheWVyIGNhbGwgeHBydF9yZXNlcnZlX3hw
cnQoKQ0KPiA+ID4gPiA+IChhbmQNCj4gPiA+ID4gPiB4cHJ0X3JlbGVhc2VfeHBydCgpKSBpZiB0
aGlzIGZsYWcgaXMgc2V0PyBJdCBzZWVtcyB0byBtZSB0aGF0DQo+ID4gPiA+ID4geW91DQo+ID4g
PiA+ID4gd2lsbA0KPiA+ID4gPiA+IG5lZWQgc29tZSBraW5kIG9mIGV4dHJhIGNvbmdlc3Rpb24g
Y29udHJvbCBpbiB0aGUgUkRNQSBsYXllcg0KPiA+ID4gPiA+IGFueXdheQ0KPiA+ID4gPiA+IHNp
bmNlIHlvdSBvbmx5IGhhdmUgb25lIHJlc2VydmVkIGNyZWRpdCBmb3IgdGhlc2UgcHJpdmlsZWdl
ZA0KPiA+ID4gPiA+IHRhc2tzDQo+ID4gPiA+ID4gKG9yDQo+ID4gPiA+ID4gZGlkIEkgbWlzcyB3
aGVyZSB0aGF0IGlzIGJlaW5nIGdhdGVkPykuDQo+ID4gPiA+IA0KPiA+ID4gPiBUaGFua3MgZm9y
IHRoZSByZXZpZXcuDQo+ID4gPiA+IA0KPiA+ID4gPiBTZWUgUlBDUkRNQV9JQV9SU1ZEX0NSRURJ
VCBpbiAxMS8xMi4gSXQncyBhIGhhY2sgSSdtIG5vdA0KPiA+ID4gPiB0ZXJyaWJseSBoYXBweSB3
aXRoLg0KPiA+ID4gPiANCj4gPiA+ID4gU28sIEkgdGhpbmsgeW91IGFyZSBzdWdnZXN0aW5nIHJl
cGxhY2luZyB4cHJ0cmRtYSdzDQo+ID4gPiA+IC0+cmVzZXJ2ZV94cHJ0IHdpdGggc29tZXRoaW5n
IGxpa2U6DQo+ID4gPiA+IA0KPiA+ID4gPiBpbnQgeHBydF9yZG1hX3Jlc2VydmVfeHBydCh4cHJ0
LCB0YXNrKQ0KPiA+ID4gPiB7DQo+ID4gPiA+IMKgwqDCoMKgwqBpZiAoUlBDX1NLSVBfQ09ORyh0
YXNrKSkNCj4gPiA+ID4gwqDCoMKgwqDCoMKgwqDCoMKgwqByZXR1cm4geHBydF9yZXNlcnZlX3hw
cnQoeHBydCwgdGFzayk7DQo+ID4gPiA+IMKgwqDCoMKgwqByZXR1cm4geHBydF9yZXNlcnZlX3hw
cnRfY29uZyh4cHJ0LCB0YXNrKTsNCj4gPiA+ID4gfQ0KPiA+ID4gPiANCj4gPiA+ID4gYW5kIGxp
a2V3aXNlIGZvciAtPnJlbGVhc2VfeHBydCA/DQo+ID4gPiANCj4gPiA+IFJpZ2h0Lg0KPiANCj4g
VGhpcyBzZWVtcyB0byB3b3JrIGZpbmUgZm9yIHRoZSBub3JtYWwgY2FzZXMuDQo+IA0KPiBJJ20g
Y29uZnVzZWQgYWJvdXQgaG93IHRvIGNvbnN0cnVjdCB4cHJ0X3JkbWFfcmVsZWFzZV94cHJ0KCkN
Cj4gc28gaXQgbmV2ZXIgcmVsZWFzZXMgYSBub3JtYWwgUlBDIHRhc2sgd2hlbiBhIFNLSVBfQ09O
Rw0KPiB0YXNrIGNvbXBsZXRlcyBhbmQgdGhlIGNyZWRpdCBsaW1pdCBpcyBzdGlsbCBmdWxsLg0K
PiANCj4gSWYgaXQgc2hvdWxkIHNlbmQgYSBub3JtYWwgdGFzayB1c2luZyB0aGUgcmVzZXJ2ZWQg
Y3JlZGl0DQo+IGFuZCB0aGF0IHRhc2sgaGFuZ3MgdG9vLCB3ZSdyZSBpbiBleGFjdGx5IHRoZSBw
b3NpdGlvbg0KPiB3ZSB3YW50ZWQgdG8gYXZvaWQuDQo+IA0KPiBNeSBvcmlnaW5hbCBzb2x1dGlv
biBtaWdodCBoYXZlIGhhZCBhIHNpbWlsYXIgcHJvYmxlbSwNCj4gY29tZSB0byB0aGluayBvZiBp
dC4NCj4gDQo+IA0KDQpUaGF0J3MgdHJ1ZS4uLiBZb3UgbWF5IG5lZWQgdG8gc2V0IHVwIGEgc2Vw
YXJhdGUgd2FpdHF1ZXVlIHRoYXQgaXMNCnJlc2VydmVkIGZvciBTS0lQX0NPTkcgdGFza3MuIEFn
YWluLCBpdCBtYWtlcyBzZW5zZSB0byBrZWVwIHRoYXQgaW4gdGhlDQpSRE1BIGNvZGUuDQoNCi0t
IA0KDQoNCgkNCgkNCg0KDQpUcm9uZCBNeWtsZWJ1c3QNClByaW5jaXBhbCBTeXN0ZW0gQXJjaGl0
ZWN0DQo0MzAwIEVsIENhbWlubyBSZWFsIHwgU3VpdGUgMTAwDQpMb3MgQWx0b3MsIENBwqDCoDk0
MDIyDQpXOiA2NTAtNDIyLTM4MDANCkM6IDgwMS05MjEtNDU4M8KgDQp3d3cucHJpbWFyeWRhdGEu
Y29tDQoNCg0KDQo=
> On Feb 9, 2017, at 3:13 PM, Trond Myklebust <[email protected]> wrote:
>
> On Thu, 2017-02-09 at 14:42 -0500, Chuck Lever wrote:
>>> On Feb 9, 2017, at 10:37 AM, Chuck Lever <[email protected]>
>>> wrote:
>>>
>>>>
>>>> On Feb 8, 2017, at 7:48 PM, Trond Myklebust <trondmy@primarydata.
>>>> com> wrote:
>>>>
>>>> On Wed, 2017-02-08 at 19:19 -0500, Chuck Lever wrote:
>>>>>> On Feb 8, 2017, at 7:05 PM, Trond Myklebust <trondmy@primaryd
>>>>>> ata.co
>>>>>> m> wrote:
>>>>>>
>>>>>> On Wed, 2017-02-08 at 17:01 -0500, Chuck Lever wrote:
>>>>>>> Allow RPC-over-RDMA to send NULL pings even when the
>>>>>>> transport
>>>>>>> has
>>>>>>> hit its credit limit. One RPC-over-RDMA credit is reserved
>>>>>>> for
>>>>>>> operations like keepalive.
>>>>>>>
>>>>>>> For transports that convey NFSv4, it seems like lease
>>>>>>> renewal
>>>>>>> would
>>>>>>> also be a candidate for using a priority transport slot.
>>>>>>> I'd like
>>>>>>> to
>>>>>>> see a mechanism better than RPCRDMA_PRIORITY that can
>>>>>>> ensure only
>>>>>>> one priority operation is in use at a time.
>>>>>>>
>>>>>>> Signed-off-by: Chuck Lever <[email protected]>
>>>>>>> ---
>>>>>>> include/linux/sunrpc/sched.h聽聽聽聽|聽聽聽聽2 ++
>>>>>>> net/sunrpc/xprt.c聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽|聽聽聽聽4 ++++
>>>>>>> net/sunrpc/xprtrdma/transport.c |聽聽聽聽3 ++-
>>>>>>> net/sunrpc/xprtrdma/verbs.c聽聽聽聽聽|聽聽聽13 ++++++++-----
>>>>>>> 4 files changed, 16 insertions(+), 6 deletions(-)
>>>>>>>
>>>>>>> diff --git a/include/linux/sunrpc/sched.h
>>>>>>> b/include/linux/sunrpc/sched.h
>>>>>>> index 13822e6..fcea158 100644
>>>>>>> --- a/include/linux/sunrpc/sched.h
>>>>>>> +++ b/include/linux/sunrpc/sched.h
>>>>>>> @@ -127,6 +127,7 @@ struct rpc_task_setup {
>>>>>>> #define RPC_TASK_TIMEOUT 0x1000 /*
>>>>>>> fail
>>>>>>> with
>>>>>>> ETIMEDOUT on timeout */
>>>>>>> #define RPC_TASK_NOCONNECT 0x2000 /*
>>>>>>> return
>>>>>>> ENOTCONN if not connected */
>>>>>>> #define RPC_TASK_NO_RETRANS_TIMEOUT 0x4000
>>>>>>> /*
>>>>>>> wait forever for a reply */
>>>>>>> +#define RPC_TASK_NO_CONG 0x8000 /*
>>>>>>> skip
>>>>>>> congestion control */
>>>>>>>
>>>>>>> #define RPC_TASK_SOFTPING (RPC_TASK_SOFT |
>>>>>>> RPC_TASK_SOFTCONN)
>>>>>>>
>>>>>>> @@ -137,6 +138,7 @@ struct rpc_task_setup {
>>>>>>> #define RPC_IS_SOFT(t) ((t)->tk_flags &
>>>>>>> (RPC_TASK_SOFT|RPC_TASK_TIMEOUT))
>>>>>>> #define RPC_IS_SOFTCONN(t) ((t)->tk_flags &
>>>>>>> RPC_TASK_SOFTCONN)
>>>>>>> #define RPC_WAS_SENT(t) ((t)->tk_flags &
>>>>>>> RPC_TASK_SENT)
>>>>>>> +#define RPC_SKIP_CONG(t) ((t)->tk_flags &
>>>>>>> RPC_TASK_NO_CONG)
>>>>>>>
>>>>>>> #define RPC_TASK_RUNNING 0
>>>>>>> #define RPC_TASK_QUEUED 1
>>>>>>> diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
>>>>>>> index b530a28..a477ee6 100644
>>>>>>> --- a/net/sunrpc/xprt.c
>>>>>>> +++ b/net/sunrpc/xprt.c
>>>>>>> @@ -392,6 +392,10 @@ static inline void
>>>>>>> xprt_release_write(struct
>>>>>>> rpc_xprt *xprt, struct rpc_task *ta
>>>>>>> {
>>>>>>> struct rpc_rqst *req = task->tk_rqstp;
>>>>>>>
>>>>>>> + if (RPC_SKIP_CONG(task)) {
>>>>>>> + req->rq_cong = 0;
>>>>>>> + return 1;
>>>>>>> + }
>>>>>>
>>>>>> Why not just have the RDMA layer call xprt_reserve_xprt()
>>>>>> (and
>>>>>> xprt_release_xprt()) if this flag is set? It seems to me that
>>>>>> you
>>>>>> will
>>>>>> need some kind of extra congestion control in the RDMA layer
>>>>>> anyway
>>>>>> since you only have one reserved credit for these privileged
>>>>>> tasks
>>>>>> (or
>>>>>> did I miss where that is being gated?).
>>>>>
>>>>> Thanks for the review.
>>>>>
>>>>> See RPCRDMA_IA_RSVD_CREDIT in 11/12. It's a hack I'm not
>>>>> terribly happy with.
>>>>>
>>>>> So, I think you are suggesting replacing xprtrdma's
>>>>> ->reserve_xprt with something like:
>>>>>
>>>>> int xprt_rdma_reserve_xprt(xprt, task)
>>>>> {
>>>>> 聽聽聽聽聽if (RPC_SKIP_CONG(task))
>>>>> 聽聽聽聽聽聽聽聽聽聽return xprt_reserve_xprt(xprt, task);
>>>>> 聽聽聽聽聽return xprt_reserve_xprt_cong(xprt, task);
>>>>> }
>>>>>
>>>>> and likewise for ->release_xprt ?
>>>>
>>>> Right.
>>
>> This seems to work fine for the normal cases.
>>
>> I'm confused about how to construct xprt_rdma_release_xprt()
>> so it never releases a normal RPC task when a SKIP_CONG
>> task completes and the credit limit is still full.
>>
>> If it should send a normal task using the reserved credit
>> and that task hangs too, we're in exactly the position
>> we wanted to avoid.
>>
>> My original solution might have had a similar problem,
>> come to think of it.
>>
>>
>
> That's true... You may need to set up a separate waitqueue that is
> reserved for SKIP_CONG tasks. Again, it makes sense to keep that in the
> RDMA code.
Understood.
At this late date, probably the best thing to do is drop the
keepalive patches for the v4.11 merge window. That would be
10/12, 11/12, and 12/12.
--
Chuck Lever