2021-10-04 17:14:14

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH 0/5] Resend: tracepoint patches for NFSD

Hi Bruce-

I posted these on 9/22 and have heard no objections. Can you include
them in your v5.16 for-next branch?

---

Chuck Lever (5):
svcrdma: Split the svcrdma_wc_receive() tracepoint
svcrdma: Split the svcrdma_wc_send() tracepoint
svcrdma: Split svcrmda_wc_{read,write} tracepoints
SUNRPC: Add trace event when alloc_pages_bulk() makes no progress
SUNRPC: Capture value of xdr_buf::page_base


include/trace/events/rpcrdma.h | 185 +++++++++++++++++++++++-
include/trace/events/sunrpc.h | 38 ++++-
net/sunrpc/svc_xprt.c | 1 +
net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 9 +-
net/sunrpc/xprtrdma/svc_rdma_rw.c | 30 +++-
net/sunrpc/xprtrdma/svc_rdma_sendto.c | 14 +-
6 files changed, 258 insertions(+), 19 deletions(-)

--
Chuck Lever


2021-10-04 17:14:25

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH 4/5] SUNRPC: Add trace event when alloc_pages_bulk() makes no progress

This is an operational low memory situation that needs to be
flagged. The new tracepoint records a timestamp and the nfsd thread
that failed to allocate pages.

Signed-off-by: Chuck Lever <[email protected]>
---
include/trace/events/sunrpc.h | 18 ++++++++++++++++++
net/sunrpc/svc_xprt.c | 1 +
2 files changed, 19 insertions(+)

diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
index 2d04eb96d418..fb016308c185 100644
--- a/include/trace/events/sunrpc.h
+++ b/include/trace/events/sunrpc.h
@@ -1859,6 +1859,24 @@ TRACE_EVENT(svc_wake_up,
TP_printk("pid=%d", __entry->pid)
);

+TRACE_EVENT(svc_alloc_arg_err,
+ TP_PROTO(
+ unsigned int pages
+ ),
+
+ TP_ARGS(pages),
+
+ TP_STRUCT__entry(
+ __field(unsigned int, pages)
+ ),
+
+ TP_fast_assign(
+ __entry->pages = pages;
+ ),
+
+ TP_printk("pages=%u", __entry->pages)
+);
+
TRACE_EVENT(svc_handle_xprt,
TP_PROTO(struct svc_xprt *xprt, int len),

diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 6316bd2b8f37..1e99ba1b9d72 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -687,6 +687,7 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
set_current_state(TASK_RUNNING);
return -EINTR;
}
+ trace_svc_alloc_arg_err(pages);
schedule_timeout(msecs_to_jiffies(500));
}
rqstp->rq_page_end = &rqstp->rq_pages[pages];


2021-10-04 17:14:25

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH 2/5] svcrdma: Split the svcrdma_wc_send() tracepoint

There are currently three separate purposes being served by a single
tracepoint here. They need to be split up.

svcrdma_wc_send:
- status is always zero, so there's no value in recording it.
- vendor_err is meaningless unless status is not zero, so
there's no value in recording it.
- This tracepoint is needed only when developing modifications,
so it should be left disabled most of the time.

svcrdma_wc_send_flush:
- As above, needed only rarely, and not an error.

svcrdma_wc_send_err:
- This tracepoint can be left persistently enabled because
completion errors are run-time problems (except for FLUSHED_ERR).
- Tracepoint name now ends in _err to reflect its purpose.

Signed-off-by: Chuck Lever <[email protected]>
---
include/trace/events/rpcrdma.h | 72 +++++++++++++++++++++++++++++++++
net/sunrpc/xprtrdma/svc_rdma_sendto.c | 14 +++++-
2 files changed, 82 insertions(+), 4 deletions(-)

diff --git a/include/trace/events/rpcrdma.h b/include/trace/events/rpcrdma.h
index 342d6d7b5cd9..1d7c12f65f87 100644
--- a/include/trace/events/rpcrdma.h
+++ b/include/trace/events/rpcrdma.h
@@ -60,6 +60,74 @@ DECLARE_EVENT_CLASS(rpcrdma_completion_class,
), \
TP_ARGS(wc, cid))

+DECLARE_EVENT_CLASS(rpcrdma_send_completion_class,
+ TP_PROTO(
+ const struct ib_wc *wc,
+ const struct rpc_rdma_cid *cid
+ ),
+
+ TP_ARGS(wc, cid),
+
+ TP_STRUCT__entry(
+ __field(u32, cq_id)
+ __field(int, completion_id)
+ ),
+
+ TP_fast_assign(
+ __entry->cq_id = cid->ci_queue_id;
+ __entry->completion_id = cid->ci_completion_id;
+ ),
+
+ TP_printk("cq.id=%u cid=%d",
+ __entry->cq_id, __entry->completion_id
+ )
+);
+
+#define DEFINE_SEND_COMPLETION_EVENT(name) \
+ DEFINE_EVENT(rpcrdma_send_completion_class, name, \
+ TP_PROTO( \
+ const struct ib_wc *wc, \
+ const struct rpc_rdma_cid *cid \
+ ), \
+ TP_ARGS(wc, cid))
+
+DECLARE_EVENT_CLASS(rpcrdma_send_flush_class,
+ TP_PROTO(
+ const struct ib_wc *wc,
+ const struct rpc_rdma_cid *cid
+ ),
+
+ TP_ARGS(wc, cid),
+
+ TP_STRUCT__entry(
+ __field(u32, cq_id)
+ __field(int, completion_id)
+ __field(unsigned long, status)
+ __field(unsigned int, vendor_err)
+ ),
+
+ TP_fast_assign(
+ __entry->cq_id = cid->ci_queue_id;
+ __entry->completion_id = cid->ci_completion_id;
+ __entry->status = wc->status;
+ __entry->vendor_err = wc->vendor_err;
+ ),
+
+ TP_printk("cq.id=%u cid=%d status=%s (%lu/0x%x)",
+ __entry->cq_id, __entry->completion_id,
+ rdma_show_wc_status(__entry->status),
+ __entry->status, __entry->vendor_err
+ )
+);
+
+#define DEFINE_SEND_FLUSH_EVENT(name) \
+ DEFINE_EVENT(rpcrdma_send_flush_class, name, \
+ TP_PROTO( \
+ const struct ib_wc *wc, \
+ const struct rpc_rdma_cid *cid \
+ ), \
+ TP_ARGS(wc, cid))
+
DECLARE_EVENT_CLASS(rpcrdma_mr_completion_class,
TP_PROTO(
const struct ib_wc *wc,
@@ -1939,7 +2007,9 @@ TRACE_EVENT(svcrdma_post_send,
)
);

-DEFINE_COMPLETION_EVENT(svcrdma_wc_send);
+DEFINE_SEND_COMPLETION_EVENT(svcrdma_wc_send);
+DEFINE_SEND_FLUSH_EVENT(svcrdma_wc_send_flush);
+DEFINE_SEND_FLUSH_EVENT(svcrdma_wc_send_err);

TRACE_EVENT(svcrdma_post_recv,
TP_PROTO(
diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
index 599021b2391d..22a871e6fe4d 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
@@ -280,13 +280,21 @@ static void svc_rdma_wc_send(struct ib_cq *cq, struct ib_wc *wc)
struct svc_rdma_send_ctxt *ctxt =
container_of(cqe, struct svc_rdma_send_ctxt, sc_cqe);

- trace_svcrdma_wc_send(wc, &ctxt->sc_cid);
-
svc_rdma_wake_send_waiters(rdma, 1);
complete(&ctxt->sc_done);

if (unlikely(wc->status != IB_WC_SUCCESS))
- svc_xprt_deferred_close(&rdma->sc_xprt);
+ goto flushed;
+
+ trace_svcrdma_wc_send(wc, &ctxt->sc_cid);
+ return;
+
+flushed:
+ if (wc->status != IB_WC_WR_FLUSH_ERR)
+ trace_svcrdma_wc_send_err(wc, &ctxt->sc_cid);
+ else
+ trace_svcrdma_wc_send_flush(wc, &ctxt->sc_cid);
+ svc_xprt_deferred_close(&rdma->sc_xprt);
}

/**


2021-10-04 17:14:26

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH 3/5] svcrdma: Split svcrmda_wc_{read,write} tracepoints

There are currently three separate purposes being served by single
tracepoints. Split them up, as was done with wc_send.

Signed-off-by: Chuck Lever <[email protected]>
---
include/trace/events/rpcrdma.h | 38 +++++++++++++++++++++++++++++++++++--
net/sunrpc/xprtrdma/svc_rdma_rw.c | 30 +++++++++++++++++++++++++----
2 files changed, 62 insertions(+), 6 deletions(-)

diff --git a/include/trace/events/rpcrdma.h b/include/trace/events/rpcrdma.h
index 1d7c12f65f87..b5a1388e51a4 100644
--- a/include/trace/events/rpcrdma.h
+++ b/include/trace/events/rpcrdma.h
@@ -2099,8 +2099,42 @@ DEFINE_POST_CHUNK_EVENT(read);
DEFINE_POST_CHUNK_EVENT(write);
DEFINE_POST_CHUNK_EVENT(reply);

-DEFINE_COMPLETION_EVENT(svcrdma_wc_read);
-DEFINE_COMPLETION_EVENT(svcrdma_wc_write);
+TRACE_EVENT(svcrdma_wc_read,
+ TP_PROTO(
+ const struct ib_wc *wc,
+ const struct rpc_rdma_cid *cid,
+ unsigned int totalbytes,
+ const ktime_t posttime
+ ),
+
+ TP_ARGS(wc, cid, totalbytes, posttime),
+
+ TP_STRUCT__entry(
+ __field(u32, cq_id)
+ __field(int, completion_id)
+ __field(s64, read_latency)
+ __field(unsigned int, totalbytes)
+ ),
+
+ TP_fast_assign(
+ __entry->cq_id = cid->ci_queue_id;
+ __entry->completion_id = cid->ci_completion_id;
+ __entry->totalbytes = totalbytes;
+ __entry->read_latency = ktime_us_delta(ktime_get(), posttime);
+ ),
+
+ TP_printk("cq.id=%u cid=%d totalbytes=%u latency-us=%lld",
+ __entry->cq_id, __entry->completion_id,
+ __entry->totalbytes, __entry->read_latency
+ )
+);
+
+DEFINE_SEND_FLUSH_EVENT(svcrdma_wc_read_flush);
+DEFINE_SEND_FLUSH_EVENT(svcrdma_wc_read_err);
+
+DEFINE_SEND_COMPLETION_EVENT(svcrdma_wc_write);
+DEFINE_SEND_FLUSH_EVENT(svcrdma_wc_write_flush);
+DEFINE_SEND_FLUSH_EVENT(svcrdma_wc_write_err);

TRACE_EVENT(svcrdma_qp_error,
TP_PROTO(
diff --git a/net/sunrpc/xprtrdma/svc_rdma_rw.c b/net/sunrpc/xprtrdma/svc_rdma_rw.c
index e27433f08ca7..5f0155fdefc7 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_rw.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_rw.c
@@ -155,6 +155,7 @@ struct svc_rdma_chunk_ctxt {
struct ib_cqe cc_cqe;
struct svcxprt_rdma *cc_rdma;
struct list_head cc_rwctxts;
+ ktime_t cc_posttime;
int cc_sqecount;
enum ib_wc_status cc_status;
struct completion cc_done;
@@ -267,7 +268,16 @@ static void svc_rdma_write_done(struct ib_cq *cq, struct ib_wc *wc)
struct svc_rdma_write_info *info =
container_of(cc, struct svc_rdma_write_info, wi_cc);

- trace_svcrdma_wc_write(wc, &cc->cc_cid);
+ switch (wc->status) {
+ case IB_WC_SUCCESS:
+ trace_svcrdma_wc_write(wc, &cc->cc_cid);
+ break;
+ case IB_WC_WR_FLUSH_ERR:
+ trace_svcrdma_wc_write_flush(wc, &cc->cc_cid);
+ break;
+ default:
+ trace_svcrdma_wc_write_err(wc, &cc->cc_cid);
+ }

svc_rdma_wake_send_waiters(rdma, cc->cc_sqecount);

@@ -320,11 +330,22 @@ static void svc_rdma_wc_read_done(struct ib_cq *cq, struct ib_wc *wc)
struct ib_cqe *cqe = wc->wr_cqe;
struct svc_rdma_chunk_ctxt *cc =
container_of(cqe, struct svc_rdma_chunk_ctxt, cc_cqe);
- struct svcxprt_rdma *rdma = cc->cc_rdma;
+ struct svc_rdma_read_info *info;

- trace_svcrdma_wc_read(wc, &cc->cc_cid);
+ switch (wc->status) {
+ case IB_WC_SUCCESS:
+ info = container_of(cc, struct svc_rdma_read_info, ri_cc);
+ trace_svcrdma_wc_read(wc, &cc->cc_cid, info->ri_totalbytes,
+ cc->cc_posttime);
+ break;
+ case IB_WC_WR_FLUSH_ERR:
+ trace_svcrdma_wc_read_flush(wc, &cc->cc_cid);
+ break;
+ default:
+ trace_svcrdma_wc_read_err(wc, &cc->cc_cid);
+ }

- svc_rdma_wake_send_waiters(rdma, cc->cc_sqecount);
+ svc_rdma_wake_send_waiters(cc->cc_rdma, cc->cc_sqecount);
cc->cc_status = wc->status;
complete(&cc->cc_done);
return;
@@ -363,6 +384,7 @@ static int svc_rdma_post_chunk_ctxt(struct svc_rdma_chunk_ctxt *cc)
do {
if (atomic_sub_return(cc->cc_sqecount,
&rdma->sc_sq_avail) > 0) {
+ cc->cc_posttime = ktime_get();
ret = ib_post_send(rdma->sc_qp, first_wr, &bad_wr);
if (ret)
break;


2021-10-04 17:14:28

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH 5/5] SUNRPC: Capture value of xdr_buf::page_base

This value is usually zero, but will be non-zero more often in the
future. Knowing its value can be important diagnostic information.

Signed-off-by: Chuck Lever <[email protected]>
---
include/trace/events/sunrpc.h | 20 ++++++++++++++------
1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
index fb016308c185..9ea59959a2fe 100644
--- a/include/trace/events/sunrpc.h
+++ b/include/trace/events/sunrpc.h
@@ -62,6 +62,7 @@ DECLARE_EVENT_CLASS(rpc_xdr_buf_class,
__field(size_t, head_len)
__field(const void *, tail_base)
__field(size_t, tail_len)
+ __field(unsigned int, page_base)
__field(unsigned int, page_len)
__field(unsigned int, msg_len)
),
@@ -74,14 +75,17 @@ DECLARE_EVENT_CLASS(rpc_xdr_buf_class,
__entry->head_len = xdr->head[0].iov_len;
__entry->tail_base = xdr->tail[0].iov_base;
__entry->tail_len = xdr->tail[0].iov_len;
+ __entry->page_base = xdr->page_base;
__entry->page_len = xdr->page_len;
__entry->msg_len = xdr->len;
),

- TP_printk("task:%u@%u head=[%p,%zu] page=%u tail=[%p,%zu] len=%u",
+ TP_printk("task:%u@%u head=[%p,%zu] page=%u(%u) tail=[%p,%zu] len=%u",
__entry->task_id, __entry->client_id,
- __entry->head_base, __entry->head_len, __entry->page_len,
- __entry->tail_base, __entry->tail_len, __entry->msg_len
+ __entry->head_base, __entry->head_len,
+ __entry->page_len, __entry->page_base,
+ __entry->tail_base, __entry->tail_len,
+ __entry->msg_len
)
);

@@ -1496,6 +1500,7 @@ DECLARE_EVENT_CLASS(svc_xdr_buf_class,
__field(size_t, head_len)
__field(const void *, tail_base)
__field(size_t, tail_len)
+ __field(unsigned int, page_base)
__field(unsigned int, page_len)
__field(unsigned int, msg_len)
),
@@ -1506,14 +1511,17 @@ DECLARE_EVENT_CLASS(svc_xdr_buf_class,
__entry->head_len = xdr->head[0].iov_len;
__entry->tail_base = xdr->tail[0].iov_base;
__entry->tail_len = xdr->tail[0].iov_len;
+ __entry->page_base = xdr->page_base;
__entry->page_len = xdr->page_len;
__entry->msg_len = xdr->len;
),

- TP_printk("xid=0x%08x head=[%p,%zu] page=%u tail=[%p,%zu] len=%u",
+ TP_printk("xid=0x%08x head=[%p,%zu] page=%u(%u) tail=[%p,%zu] len=%u",
__entry->xid,
- __entry->head_base, __entry->head_len, __entry->page_len,
- __entry->tail_base, __entry->tail_len, __entry->msg_len
+ __entry->head_base, __entry->head_len,
+ __entry->page_len, __entry->page_base,
+ __entry->tail_base, __entry->tail_len,
+ __entry->msg_len
)
);



2021-10-04 23:50:44

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 0/5] Resend: tracepoint patches for NFSD

On Mon, Oct 04, 2021 at 10:16:02AM -0400, Chuck Lever wrote:
> I posted these on 9/22 and have heard no objections. Can you include
> them in your v5.16 for-next branch?

Looks good, applied.--b.

>
> ---
>
> Chuck Lever (5):
> svcrdma: Split the svcrdma_wc_receive() tracepoint
> svcrdma: Split the svcrdma_wc_send() tracepoint
> svcrdma: Split svcrmda_wc_{read,write} tracepoints
> SUNRPC: Add trace event when alloc_pages_bulk() makes no progress
> SUNRPC: Capture value of xdr_buf::page_base
>
>
> include/trace/events/rpcrdma.h | 185 +++++++++++++++++++++++-
> include/trace/events/sunrpc.h | 38 ++++-
> net/sunrpc/svc_xprt.c | 1 +
> net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 9 +-
> net/sunrpc/xprtrdma/svc_rdma_rw.c | 30 +++-
> net/sunrpc/xprtrdma/svc_rdma_sendto.c | 14 +-
> 6 files changed, 258 insertions(+), 19 deletions(-)
>
> --
> Chuck Lever