2018-03-13 15:43:43

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH 00/15] NFSD trace point clean up

Hi Bruce-

Please consider this cluster of patches for v4.17. The patches clean
up a number of issues with NFSD and server-side sunrpc static trace
points.

I have additional svcrdma patches for v4.17, which I'm holding until
you have reviewed this cluster.

---

Chuck Lever (15):
sunrpc: Remove unneeded pointer dereference
svc: Simplify ->xpo_secure_port
sunrpc: Update show_svc_xprt_flags() to include recently added flags
sunrpc: Move trace_svc_xprt_dequeue()
sunrpc: Simplify do_enqueue tracing
sunrpc: Simplify trace_svc_recv
sunrpc: Save remote presentation address in svc_xprt for trace events
sunrpc: Re-purpose trace_svc_process
sunrpc: Report per-RPC execution stats
nfsd: Fix NFSD trace points
nfsd: Record request byte count, not count of vectors
nfsd: Add "nfsd_" to trace point names
nfsd: Add I/O trace points in the NFSv4 write path
nfsd: Add I/O trace points in the NFSv4 read proc
nfsd: Trace NFSv4 COMPOUND execution


fs/nfsd/nfs4layouts.c | 16 +--
fs/nfsd/nfs4proc.c | 25 +++-
fs/nfsd/nfs4xdr.c | 55 +++++++---
fs/nfsd/trace.h | 98 +++++++++++++++--
fs/nfsd/vfs.c | 85 +++++++--------
fs/nfsd/vfs.h | 8 +
include/linux/sunrpc/svc.h | 1
include/linux/sunrpc/svc_rdma.h | 1
include/linux/sunrpc/svc_xprt.h | 6 +
include/trace/events/sunrpc.h | 173 ++++++++++++++++--------------
net/sunrpc/svc.c | 9 +-
net/sunrpc/svc_xprt.c | 31 ++---
net/sunrpc/svcsock.c | 10 +-
net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 1
net/sunrpc/xprtrdma/svc_rdma_transport.c | 11 +-
15 files changed, 322 insertions(+), 208 deletions(-)

--
Chuck Lever


2018-03-13 15:43:52

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH 02/15] svc: Simplify ->xpo_secure_port

Clean up: Instead of returning a value that is used to set or clear
a bit, just make ->xpo_secure_port mangle that bit, and return void.

Signed-off-by: Chuck Lever <[email protected]>
---
include/linux/sunrpc/svc_xprt.h | 2 +-
net/sunrpc/svc_xprt.c | 5 +----
net/sunrpc/svcsock.c | 7 +++++--
net/sunrpc/xprtrdma/svc_rdma_transport.c | 6 +++---
4 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/include/linux/sunrpc/svc_xprt.h b/include/linux/sunrpc/svc_xprt.h
index 1caf7bc..19475ac 100644
--- a/include/linux/sunrpc/svc_xprt.h
+++ b/include/linux/sunrpc/svc_xprt.h
@@ -25,7 +25,7 @@ struct svc_xprt_ops {
void (*xpo_release_rqst)(struct svc_rqst *);
void (*xpo_detach)(struct svc_xprt *);
void (*xpo_free)(struct svc_xprt *);
- int (*xpo_secure_port)(struct svc_rqst *);
+ void (*xpo_secure_port)(struct svc_rqst *rqstp);
void (*xpo_kill_temp_xprt)(struct svc_xprt *);
};

diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 6dca0f5..4e3b4c5 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -844,10 +844,7 @@ int svc_recv(struct svc_rqst *rqstp, long timeout)

clear_bit(XPT_OLD, &xprt->xpt_flags);

- if (xprt->xpt_ops->xpo_secure_port(rqstp))
- set_bit(RQ_SECURE, &rqstp->rq_flags);
- else
- clear_bit(RQ_SECURE, &rqstp->rq_flags);
+ xprt->xpt_ops->xpo_secure_port(rqstp);
rqstp->rq_chandle.defer = svc_defer;
rqstp->rq_xid = svc_getu32(&rqstp->rq_arg.head[0]);

diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 943f2a7..9b67035 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -391,9 +391,12 @@ static void svc_sock_setbufsize(struct socket *sock, unsigned int snd,
release_sock(sock->sk);
}

-static int svc_sock_secure_port(struct svc_rqst *rqstp)
+static void svc_sock_secure_port(struct svc_rqst *rqstp)
{
- return svc_port_is_privileged(svc_addr(rqstp));
+ if (svc_port_is_privileged(svc_addr(rqstp)))
+ set_bit(RQ_SECURE, &rqstp->rq_flags);
+ else
+ clear_bit(RQ_SECURE, &rqstp->rq_flags);
}

/*
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index 9ad12a2..b083ac8 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -69,7 +69,7 @@ static struct svc_xprt *svc_rdma_create(struct svc_serv *serv,
static void svc_rdma_detach(struct svc_xprt *xprt);
static void svc_rdma_free(struct svc_xprt *xprt);
static int svc_rdma_has_wspace(struct svc_xprt *xprt);
-static int svc_rdma_secure_port(struct svc_rqst *);
+static void svc_rdma_secure_port(struct svc_rqst *);
static void svc_rdma_kill_temp_xprt(struct svc_xprt *);

static const struct svc_xprt_ops svc_rdma_ops = {
@@ -992,9 +992,9 @@ static int svc_rdma_has_wspace(struct svc_xprt *xprt)
return 1;
}

-static int svc_rdma_secure_port(struct svc_rqst *rqstp)
+static void svc_rdma_secure_port(struct svc_rqst *rqstp)
{
- return 1;
+ set_bit(RQ_SECURE, &rqstp->rq_flags);
}

static void svc_rdma_kill_temp_xprt(struct svc_xprt *xprt)


2018-03-13 15:43:47

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH 01/15] sunrpc: Remove unneeded pointer dereference

Clean up: Noticed during code inspection that there is already a
local automatic variable "xprt" so dereferencing rqst->rq_xprt
again is unnecessary.

Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/svc_xprt.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index f9307bd..6dca0f5 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -492,7 +492,7 @@ static void svc_xprt_release(struct svc_rqst *rqstp)
{
struct svc_xprt *xprt = rqstp->rq_xprt;

- rqstp->rq_xprt->xpt_ops->xpo_release_rqst(rqstp);
+ xprt->xpt_ops->xpo_release_rqst(rqstp);

kfree(rqstp->rq_deferred);
rqstp->rq_deferred = NULL;
@@ -889,7 +889,7 @@ int svc_send(struct svc_rqst *rqstp)
goto out;

/* release the receive skb before sending the reply */
- rqstp->rq_xprt->xpt_ops->xpo_release_rqst(rqstp);
+ xprt->xpt_ops->xpo_release_rqst(rqstp);

/* calculate over-all length */
xb = &rqstp->rq_res;


2018-03-13 15:43:57

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH 03/15] sunrpc: Update show_svc_xprt_flags() to include recently added flags

XPT_KILL_TEMP was added by commit 546125d16142 ("sunrpc: don't call
sleeping functions from the notifier block callbacks"), and
XPT_CONG_CTRL was added by commit 362142b25843 ("sunrpc: flag
transports as having congestion control") .

Signed-off-by: Chuck Lever <[email protected]>
---
include/trace/events/sunrpc.h | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
index 970c91a..a231444 100644
--- a/include/trace/events/sunrpc.h
+++ b/include/trace/events/sunrpc.h
@@ -591,7 +591,9 @@
{ (1UL << XPT_OLD), "XPT_OLD"}, \
{ (1UL << XPT_LISTENER), "XPT_LISTENER"}, \
{ (1UL << XPT_CACHE_AUTH), "XPT_CACHE_AUTH"}, \
- { (1UL << XPT_LOCAL), "XPT_LOCAL"})
+ { (1UL << XPT_LOCAL), "XPT_LOCAL"}, \
+ { (1UL << XPT_KILL_TEMP), "XPT_KILL_TEMP"}, \
+ { (1UL << XPT_CONG_CTRL), "XPT_CONG_CTRL"})

TRACE_EVENT(svc_xprt_do_enqueue,
TP_PROTO(struct svc_xprt *xprt, struct svc_rqst *rqst),


2018-03-13 15:44:08

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH 05/15] sunrpc: Simplify do_enqueue tracing

There are three cases where svc_xprt_do_enqueue() returns without
waking an nfsd thread:

1. There is no work to do

2. The transport is already busy

3. There are no available nfsd threads

Only 3. is truly interesting. Move the trace point so it records
that there was work to do and either an nfsd thread was awoken, or
a free one could not found.

As an additional clean up, remove a redundant comment and a couple
of dprintk call sites.

Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/svc_xprt.c | 11 +++--------
1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 71f4718..5fe150c 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -382,25 +382,21 @@ void svc_xprt_do_enqueue(struct svc_xprt *xprt)
int cpu;

if (!svc_xprt_has_something_to_do(xprt))
- goto out;
+ return;

/* Mark transport as busy. It will remain in this state until
* the provider calls svc_xprt_received. We update XPT_BUSY
* atomically because it also guards against trying to enqueue
* the transport twice.
*/
- if (test_and_set_bit(XPT_BUSY, &xprt->xpt_flags)) {
- /* Don't enqueue transport while already enqueued */
- dprintk("svc: transport %p busy, not enqueued\n", xprt);
- goto out;
- }
+ if (test_and_set_bit(XPT_BUSY, &xprt->xpt_flags))
+ return;

cpu = get_cpu();
pool = svc_pool_for_cpu(xprt->xpt_server, cpu);

atomic_long_inc(&pool->sp_stats.packets);

- dprintk("svc: transport %p put into queue\n", xprt);
spin_lock_bh(&pool->sp_lock);
list_add_tail(&xprt->xpt_ready, &pool->sp_sockets);
pool->sp_stats.sockets_queued++;
@@ -420,7 +416,6 @@ void svc_xprt_do_enqueue(struct svc_xprt *xprt)
out_unlock:
rcu_read_unlock();
put_cpu();
-out:
trace_svc_xprt_do_enqueue(xprt, rqstp);
}
EXPORT_SYMBOL_GPL(svc_xprt_do_enqueue);


2018-03-13 15:44:02

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH 04/15] sunrpc: Move trace_svc_xprt_dequeue()

Reduce the amount of noise generated by trace_svc_xprt_dequeue by
moving it to the end of svc_get_next_xprt. This generates exactly
one trace event when a ready xprt is found, rather than spurious
events when there is no work to do. The empty events contain no
information that can't be obtained simply by tracing function calls
to svc_xprt_dequeue.

A small additional benefit is simplification of the svc_xprt_event
trace class, which no longer has to handle the case when the @xprt
parameter is NULL.

Signed-off-by: Chuck Lever <[email protected]>
---
include/trace/events/sunrpc.h | 16 +++++-----------
net/sunrpc/svc_xprt.c | 5 +----
2 files changed, 6 insertions(+), 15 deletions(-)

diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
index a231444..9bba307 100644
--- a/include/trace/events/sunrpc.h
+++ b/include/trace/events/sunrpc.h
@@ -634,24 +634,18 @@
TP_STRUCT__entry(
__field(struct svc_xprt *, xprt)
__field(unsigned long, flags)
- __dynamic_array(unsigned char, addr, xprt != NULL ?
- xprt->xpt_remotelen : 0)
+ __dynamic_array(unsigned char, addr, xprt->xpt_remotelen)
),

TP_fast_assign(
__entry->xprt = xprt;
- if (xprt) {
- memcpy(__get_dynamic_array(addr),
- &xprt->xpt_remote,
- xprt->xpt_remotelen);
- __entry->flags = xprt->xpt_flags;
- } else
- __entry->flags = 0;
+ __entry->flags = xprt->xpt_flags;
+ memcpy(__get_dynamic_array(addr),
+ &xprt->xpt_remote, xprt->xpt_remotelen);
),

TP_printk("xprt=0x%p addr=%pIScp flags=%s", __entry->xprt,
- __get_dynamic_array_len(addr) != 0 ?
- (struct sockaddr *)__get_dynamic_array(addr) : NULL,
+ (struct sockaddr *)__get_dynamic_array(addr),
show_svc_xprt_flags(__entry->flags))
);

diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 4e3b4c5..71f4718 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -454,13 +454,9 @@ static struct svc_xprt *svc_xprt_dequeue(struct svc_pool *pool)
struct svc_xprt, xpt_ready);
list_del_init(&xprt->xpt_ready);
svc_xprt_get(xprt);
-
- dprintk("svc: transport %p dequeued, inuse=%d\n",
- xprt, kref_read(&xprt->xpt_ref));
}
spin_unlock_bh(&pool->sp_lock);
out:
- trace_svc_xprt_dequeue(xprt);
return xprt;
}

@@ -734,6 +730,7 @@ static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
rqstp->rq_chandle.thread_wait = 5*HZ;
else
rqstp->rq_chandle.thread_wait = 1*HZ;
+ trace_svc_xprt_dequeue(rqstp->rq_xprt);
return rqstp->rq_xprt;
}



2018-03-13 15:44:18

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH 07/15] sunrpc: Save remote presentation address in svc_xprt for trace events

TP_printk defines a format string that is passed to user space for
converting raw trace event records to something human-readable.

My user space's printf (Oracle Linux 7), however, does not have a
%pI format specifier. The result is that what is supposed to be an
IP address in the output of "trace-cmd report" is just a string that
says the field couldn't be displayed.

To fix this, adopt the same approach as the client: maintain a pre-
formated presentation address for occasions when %pI is not
available.

The location of the trace_svc_send trace point is adjusted so that
rqst->rq_xprt is not NULL when the trace event is recorded.

Signed-off-by: Chuck Lever <[email protected]>
---
include/linux/sunrpc/svc_xprt.h | 4 +
include/trace/events/sunrpc.h | 89 +++++++++++-------------------
net/sunrpc/svc_xprt.c | 3 +
net/sunrpc/svcsock.c | 1
net/sunrpc/xprtrdma/svc_rdma_transport.c | 4 +
5 files changed, 43 insertions(+), 58 deletions(-)

diff --git a/include/linux/sunrpc/svc_xprt.h b/include/linux/sunrpc/svc_xprt.h
index 19475ac..c3d7206 100644
--- a/include/linux/sunrpc/svc_xprt.h
+++ b/include/linux/sunrpc/svc_xprt.h
@@ -83,6 +83,7 @@ struct svc_xprt {
size_t xpt_locallen; /* length of address */
struct sockaddr_storage xpt_remote; /* remote peer's address */
size_t xpt_remotelen; /* length of address */
+ char xpt_remotebuf[INET6_ADDRSTRLEN + 10];
struct rpc_wait_queue xpt_bc_pending; /* backchannel wait queue */
struct list_head xpt_users; /* callbacks on free */

@@ -152,7 +153,10 @@ static inline void svc_xprt_set_remote(struct svc_xprt *xprt,
{
memcpy(&xprt->xpt_remote, sa, salen);
xprt->xpt_remotelen = salen;
+ snprintf(xprt->xpt_remotebuf, sizeof(xprt->xpt_remotebuf) - 1,
+ "%pISpc", sa);
}
+
static inline unsigned short svc_addr_port(const struct sockaddr *sa)
{
const struct sockaddr_in *sin = (const struct sockaddr_in *)sa;
diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
index 5849bfb..1ec8c4c 100644
--- a/include/trace/events/sunrpc.h
+++ b/include/trace/events/sunrpc.h
@@ -493,20 +493,18 @@
__field(u32, xid)
__field(int, len)
__field(unsigned long, flags)
- __dynamic_array(unsigned char, addr, rqst->rq_addrlen)
+ __string(addr, rqst->rq_xprt->xpt_remotebuf)
),

TP_fast_assign(
__entry->xid = be32_to_cpu(rqst->rq_xid);
__entry->len = len;
__entry->flags = rqst->rq_flags;
- memcpy(__get_dynamic_array(addr),
- &rqst->rq_addr, rqst->rq_addrlen);
+ __assign_str(addr, rqst->rq_xprt->xpt_remotebuf);
),

- TP_printk("addr=%pIScp xid=0x%08x len=%d flags=%s",
- (struct sockaddr *)__get_dynamic_array(addr),
- __entry->xid, __entry->len,
+ TP_printk("addr=%s xid=0x%08x len=%d flags=%s",
+ __get_str(addr), __entry->xid, __entry->len,
show_rqstp_flags(__entry->flags))
);

@@ -519,20 +517,18 @@
TP_STRUCT__entry(
__field(u32, xid)
__field(unsigned long, flags)
- __dynamic_array(unsigned char, addr, rqst->rq_addrlen)
+ __string(addr, rqst->rq_xprt->xpt_remotebuf)
),

TP_fast_assign(
__entry->xid = be32_to_cpu(rqst->rq_xid);
__entry->flags = rqst->rq_flags;
- memcpy(__get_dynamic_array(addr),
- &rqst->rq_addr, rqst->rq_addrlen);
+ __assign_str(addr, rqst->rq_xprt->xpt_remotebuf);
),

- TP_printk("addr=%pIScp rq_xid=0x%08x flags=%s",
- (struct sockaddr *)__get_dynamic_array(addr),
- __entry->xid,
- show_rqstp_flags(__entry->flags))
+ TP_printk("addr=%s xid=0x%08x flags=%s",
+ __get_str(addr), __entry->xid,
+ show_rqstp_flags(__entry->flags))
);

DEFINE_EVENT(svc_rqst_event, svc_defer,
@@ -553,21 +549,19 @@
__field(u32, xid)
__field(int, status)
__field(unsigned long, flags)
- __dynamic_array(unsigned char, addr, rqst->rq_addrlen)
+ __string(addr, rqst->rq_xprt->xpt_remotebuf)
),

TP_fast_assign(
__entry->xid = be32_to_cpu(rqst->rq_xid);
__entry->status = status;
__entry->flags = rqst->rq_flags;
- memcpy(__get_dynamic_array(addr),
- &rqst->rq_addr, rqst->rq_addrlen);
+ __assign_str(addr, rqst->rq_xprt->xpt_remotebuf);
),

- TP_printk("addr=%pIScp rq_xid=0x%08x status=%d flags=%s",
- (struct sockaddr *)__get_dynamic_array(addr),
- __entry->xid,
- __entry->status, show_rqstp_flags(__entry->flags))
+ TP_printk("addr=%s xid=0x%08x status=%d flags=%s",
+ __get_str(addr), __entry->xid,
+ __entry->status, show_rqstp_flags(__entry->flags))
);

DEFINE_EVENT(svc_rqst_status, svc_process,
@@ -604,26 +598,19 @@
__field(struct svc_xprt *, xprt)
__field(int, pid)
__field(unsigned long, flags)
- __dynamic_array(unsigned char, addr, xprt != NULL ?
- xprt->xpt_remotelen : 0)
+ __string(addr, xprt->xpt_remotebuf)
),

TP_fast_assign(
__entry->xprt = xprt;
__entry->pid = rqst? rqst->rq_task->pid : 0;
- if (xprt) {
- memcpy(__get_dynamic_array(addr),
- &xprt->xpt_remote,
- xprt->xpt_remotelen);
- __entry->flags = xprt->xpt_flags;
- } else
- __entry->flags = 0;
+ __entry->flags = xprt->xpt_flags;
+ __assign_str(addr, xprt->xpt_remotebuf);
),

- TP_printk("xprt=0x%p addr=%pIScp pid=%d flags=%s", __entry->xprt,
- __get_dynamic_array_len(addr) != 0 ?
- (struct sockaddr *)__get_dynamic_array(addr) : NULL,
- __entry->pid, show_svc_xprt_flags(__entry->flags))
+ TP_printk("xprt=%p addr=%s pid=%d flags=%s",
+ __entry->xprt, __get_str(addr),
+ __entry->pid, show_svc_xprt_flags(__entry->flags))
);

DECLARE_EVENT_CLASS(svc_xprt_event,
@@ -634,19 +621,18 @@
TP_STRUCT__entry(
__field(struct svc_xprt *, xprt)
__field(unsigned long, flags)
- __dynamic_array(unsigned char, addr, xprt->xpt_remotelen)
+ __string(addr, xprt->xpt_remotebuf)
),

TP_fast_assign(
__entry->xprt = xprt;
__entry->flags = xprt->xpt_flags;
- memcpy(__get_dynamic_array(addr),
- &xprt->xpt_remote, xprt->xpt_remotelen);
+ __assign_str(addr, xprt->xpt_remotebuf);
),

- TP_printk("xprt=0x%p addr=%pIScp flags=%s", __entry->xprt,
- (struct sockaddr *)__get_dynamic_array(addr),
- show_svc_xprt_flags(__entry->flags))
+ TP_printk("xprt=%p addr=%s flags=%s",
+ __entry->xprt, __get_str(addr),
+ show_svc_xprt_flags(__entry->flags))
);

DEFINE_EVENT(svc_xprt_event, svc_xprt_dequeue,
@@ -682,25 +668,18 @@
__field(struct svc_xprt *, xprt)
__field(int, len)
__field(unsigned long, flags)
- __dynamic_array(unsigned char, addr, xprt != NULL ?
- xprt->xpt_remotelen : 0)
+ __string(addr, xprt->xpt_remotebuf)
),

TP_fast_assign(
__entry->xprt = xprt;
__entry->len = len;
- if (xprt) {
- memcpy(__get_dynamic_array(addr),
- &xprt->xpt_remote,
- xprt->xpt_remotelen);
- __entry->flags = xprt->xpt_flags;
- } else
- __entry->flags = 0;
+ __entry->flags = xprt->xpt_flags;
+ __assign_str(addr, xprt->xpt_remotebuf);
),

- TP_printk("xprt=0x%p addr=%pIScp len=%d flags=%s", __entry->xprt,
- __get_dynamic_array_len(addr) != 0 ?
- (struct sockaddr *)__get_dynamic_array(addr) : NULL,
+ TP_printk("xprt=%p addr=%s len=%d flags=%s",
+ __entry->xprt, __get_str(addr),
__entry->len, show_svc_xprt_flags(__entry->flags))
);

@@ -712,18 +691,16 @@

TP_STRUCT__entry(
__field(u32, xid)
- __dynamic_array(unsigned char, addr, dr->addrlen)
+ __string(addr, dr->xprt->xpt_remotebuf)
),

TP_fast_assign(
__entry->xid = be32_to_cpu(*(__be32 *)(dr->args +
(dr->xprt_hlen>>2)));
- memcpy(__get_dynamic_array(addr), &dr->addr, dr->addrlen);
+ __assign_str(addr, dr->xprt->xpt_remotebuf);
),

- TP_printk("addr=%pIScp xid=0x%08x",
- (struct sockaddr *)__get_dynamic_array(addr),
- __entry->xid)
+ TP_printk("addr=%s xid=0x%08x", __get_str(addr), __entry->xid)
);

DEFINE_EVENT(svc_deferred_event, svc_drop_deferred,
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 47384d0..f745754 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -173,6 +173,7 @@ void svc_xprt_init(struct net *net, struct svc_xprt_class *xcl,
set_bit(XPT_BUSY, &xprt->xpt_flags);
rpc_init_wait_queue(&xprt->xpt_bc_pending, "xpt_bc_pending");
xprt->xpt_net = get_net(net);
+ strcpy(xprt->xpt_remotebuf, "uninitialized");
}
EXPORT_SYMBOL_GPL(svc_xprt_init);

@@ -894,12 +895,12 @@ int svc_send(struct svc_rqst *rqstp)
len = xprt->xpt_ops->xpo_sendto(rqstp);
mutex_unlock(&xprt->xpt_mutex);
rpc_wake_up(&xprt->xpt_bc_pending);
+ trace_svc_send(rqstp, len);
svc_xprt_release(rqstp);

if (len == -ECONNREFUSED || len == -ENOTCONN || len == -EAGAIN)
len = 0;
out:
- trace_svc_send(rqstp, len);
return len;
}

diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 9b67035..4ca1d92 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -1310,6 +1310,7 @@ static void svc_tcp_init(struct svc_sock *svsk, struct svc_serv *serv)
set_bit(XPT_CONG_CTRL, &svsk->sk_xprt.xpt_flags);
if (sk->sk_state == TCP_LISTEN) {
dprintk("setting up TCP socket for listening\n");
+ strcpy(svsk->sk_xprt.xpt_remotebuf, "listener");
set_bit(XPT_LISTENER, &svsk->sk_xprt.xpt_flags);
sk->sk_data_ready = svc_tcp_listen_data_ready;
set_bit(XPT_CONN, &svsk->sk_xprt.xpt_flags);
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index b083ac8..ca211dc 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -401,8 +401,10 @@ static struct svcxprt_rdma *rdma_create_xprt(struct svc_serv *serv,
*/
set_bit(XPT_CONG_CTRL, &cma_xprt->sc_xprt.xpt_flags);

- if (listener)
+ if (listener) {
+ strcpy(cma_xprt->sc_xprt.xpt_remotebuf, "listener");
set_bit(XPT_LISTENER, &cma_xprt->sc_xprt.xpt_flags);
+ }

return cma_xprt;
}


2018-03-13 15:44:13

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH 06/15] sunrpc: Simplify trace_svc_recv

There doesn't seem to be a lot of value in calling trace_svc_recv
in the failing case.

1. There are two very common cases: one is the transport is not
ready, and the other is shutdown. Neither is terribly interesting.

2. The trace record for the failing case contains nothing but
the status code.

Therefore the trace point call site in the error exit is removed.
Since the trace point is now recording a length instead of a
status, rename the status field and remove the case that records a
zero XID.

Signed-off-by: Chuck Lever <[email protected]>
---
include/trace/events/sunrpc.h | 14 +++++++-------
net/sunrpc/svc_xprt.c | 1 -
2 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
index 9bba307..5849bfb 100644
--- a/include/trace/events/sunrpc.h
+++ b/include/trace/events/sunrpc.h
@@ -485,28 +485,28 @@
{ (1UL << RQ_BUSY), "RQ_BUSY"})

TRACE_EVENT(svc_recv,
- TP_PROTO(struct svc_rqst *rqst, int status),
+ TP_PROTO(struct svc_rqst *rqst, int len),

- TP_ARGS(rqst, status),
+ TP_ARGS(rqst, len),

TP_STRUCT__entry(
__field(u32, xid)
- __field(int, status)
+ __field(int, len)
__field(unsigned long, flags)
__dynamic_array(unsigned char, addr, rqst->rq_addrlen)
),

TP_fast_assign(
- __entry->xid = status > 0 ? be32_to_cpu(rqst->rq_xid) : 0;
- __entry->status = status;
+ __entry->xid = be32_to_cpu(rqst->rq_xid);
+ __entry->len = len;
__entry->flags = rqst->rq_flags;
memcpy(__get_dynamic_array(addr),
&rqst->rq_addr, rqst->rq_addrlen);
),

- TP_printk("addr=%pIScp xid=0x%08x status=%d flags=%s",
+ TP_printk("addr=%pIScp xid=0x%08x len=%d flags=%s",
(struct sockaddr *)__get_dynamic_array(addr),
- __entry->xid, __entry->status,
+ __entry->xid, __entry->len,
show_rqstp_flags(__entry->flags))
);

diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 5fe150c..47384d0 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -848,7 +848,6 @@ int svc_recv(struct svc_rqst *rqstp, long timeout)
rqstp->rq_res.len = 0;
svc_xprt_release(rqstp);
out:
- trace_svc_recv(rqstp, err);
return err;
}
EXPORT_SYMBOL_GPL(svc_recv);


2018-03-13 15:44:24

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH 08/15] sunrpc: Re-purpose trace_svc_process

Currently, trace_svc_process has two call sites:

1. Just after a call to svc_send. svc_send already invokes
trace_svc_send with the same arguments just before returning

2. Just before a call to svc_drop. svc_drop already invokes
trace_svc_drop with the same arguments just after it is called

Therefore trace_svc_process does not provide any additional
information not already provided by these other trace points.

However, it would be useful to record the incoming RPC procedure.
So reuse trace_svc_process for this purpose.

Signed-off-by: Chuck Lever <[email protected]>
---
include/trace/events/sunrpc.h | 30 ++++++++++++++++++++++++++----
net/sunrpc/svc.c | 9 +++------
2 files changed, 29 insertions(+), 10 deletions(-)

diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
index 1ec8c4c..5a8157c 100644
--- a/include/trace/events/sunrpc.h
+++ b/include/trace/events/sunrpc.h
@@ -508,6 +508,32 @@
show_rqstp_flags(__entry->flags))
);

+TRACE_EVENT(svc_process,
+ TP_PROTO(const struct svc_rqst *rqst, const char *name),
+
+ TP_ARGS(rqst, name),
+
+ TP_STRUCT__entry(
+ __field(u32, xid)
+ __field(u32, vers)
+ __field(u32, proc)
+ __string(service, name)
+ __string(addr, rqst->rq_xprt->xpt_remotebuf)
+ ),
+
+ TP_fast_assign(
+ __entry->xid = be32_to_cpu(rqst->rq_xid);
+ __entry->vers = rqst->rq_vers;
+ __entry->proc = rqst->rq_proc;
+ __assign_str(service, name);
+ __assign_str(addr, rqst->rq_xprt->xpt_remotebuf);
+ ),
+
+ TP_printk("addr=%s xid=0x%08x service=%s vers=%u proc=%u",
+ __get_str(addr), __entry->xid,
+ __get_str(service), __entry->vers, __entry->proc)
+);
+
DECLARE_EVENT_CLASS(svc_rqst_event,

TP_PROTO(struct svc_rqst *rqst),
@@ -564,10 +590,6 @@
__entry->status, show_rqstp_flags(__entry->flags))
);

-DEFINE_EVENT(svc_rqst_status, svc_process,
- TP_PROTO(struct svc_rqst *rqst, int status),
- TP_ARGS(rqst, status));
-
DEFINE_EVENT(svc_rqst_status, svc_send,
TP_PROTO(struct svc_rqst *rqst, int status),
TP_ARGS(rqst, status));
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 387cc4a..f19987f 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -1255,6 +1255,7 @@ static __printf(2,3) void svc_printk(struct svc_rqst *rqstp, const char *fmt, ..

/* Syntactic check complete */
serv->sv_stats->rpccnt++;
+ trace_svc_process(rqstp, progp->pg_name);

/* Build the reply header. */
statp = resv->iov_base +resv->iov_len;
@@ -1431,14 +1432,10 @@ static __printf(2,3) void svc_printk(struct svc_rqst *rqstp, const char *fmt, ..
}

/* Returns 1 for send, 0 for drop */
- if (likely(svc_process_common(rqstp, argv, resv))) {
- int ret = svc_send(rqstp);
+ if (likely(svc_process_common(rqstp, argv, resv)))
+ return svc_send(rqstp);

- trace_svc_process(rqstp, ret);
- return ret;
- }
out_drop:
- trace_svc_process(rqstp, 0);
svc_drop(rqstp);
return 0;
}


2018-03-13 15:44:34

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH 10/15] nfsd: Fix NFSD trace points

nfsd-1915 [003] 77915.780959: write_opened:
[FAILED TO PARSE] xid=3286130958 fh=0 offset=154624 len=1
nfsd-1915 [003] 77915.780960: write_io_done:
[FAILED TO PARSE] xid=3286130958 fh=0 offset=154624 len=1
nfsd-1915 [003] 77915.780964: write_done:
[FAILED TO PARSE] xid=3286130958 fh=0 offset=154624 len=1

Byte swapping and knfsd_fh_hash() are not available in "trace-cmd
report", where the print format string is actually used. These
data transformations have to be done during the TP_fast_assign step.

Signed-off-by: Chuck Lever <[email protected]>
---
fs/nfsd/trace.h | 17 +++++++++++------
1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
index 8b2f1d9..3aa3f3b 100644
--- a/fs/nfsd/trace.h
+++ b/fs/nfsd/trace.h
@@ -18,19 +18,24 @@
int len),
TP_ARGS(rqstp, fhp, offset, len),
TP_STRUCT__entry(
- __field(__be32, xid)
- __field_struct(struct knfsd_fh, fh)
+ __field(u32, xid)
+ __field(u32, fh_hash)
__field(loff_t, offset)
__field(int, len)
),
TP_fast_assign(
- __entry->xid = rqstp->rq_xid,
- fh_copy_shallow(&__entry->fh, &fhp->fh_handle);
+ __entry->xid = be32_to_cpu(rqstp->rq_xid);
+ do {
+ struct knfsd_fh fh;
+
+ fh_copy_shallow(&fh, &fhp->fh_handle);
+ __entry->fh_hash = knfsd_fh_hash(&fh);
+ } while (0);
__entry->offset = offset;
__entry->len = len;
),
- TP_printk("xid=0x%x fh=0x%x offset=%lld len=%d",
- __be32_to_cpu(__entry->xid), knfsd_fh_hash(&__entry->fh),
+ TP_printk("xid=0x%08x fh_hash=0x%08x offset=%lld len=%d",
+ __entry->xid, __entry->fh_hash,
__entry->offset, __entry->len)
)



2018-03-13 15:44:50

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH 13/15] nfsd: Add I/O trace points in the NFSv4 write path

NFSv4 write compound processing invokes nfsd_vfs_write directly. The
trace points currently in nfsd_write are not effective for NFSv4
writes.

Move the trace points into the shared nfsd_vfs_write() helper.

After the I/O, we also want to record any local I/O error that
might have occurred, and the total count of bytes that were actually
moved (rather than the requested number).

Signed-off-by: Chuck Lever <[email protected]>
---
fs/nfsd/nfs4proc.c | 7 +++++--
fs/nfsd/trace.h | 33 +++++++++++++++++++++++++++++++++
fs/nfsd/vfs.c | 23 ++++++++++++-----------
3 files changed, 50 insertions(+), 13 deletions(-)

diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
index 10c7620..b93673e 100644
--- a/fs/nfsd/nfs4proc.c
+++ b/fs/nfsd/nfs4proc.c
@@ -1001,6 +1001,9 @@ static int fill_in_write_vector(struct kvec *vec, struct nfsd4_write *write)
if (write->wr_offset >= OFFSET_MAX)
return nfserr_inval;

+ cnt = write->wr_buflen;
+ trace_nfsd_write_start(rqstp, &cstate->current_fh,
+ write->wr_offset, cnt);
status = nfs4_preprocess_stateid_op(rqstp, cstate, &cstate->current_fh,
stateid, WR_STATE, &filp, NULL);
if (status) {
@@ -1008,7 +1011,6 @@ static int fill_in_write_vector(struct kvec *vec, struct nfsd4_write *write)
return status;
}

- cnt = write->wr_buflen;
write->wr_how_written = write->wr_stable_how;
gen_boot_verifier(&write->wr_verifier, SVC_NET(rqstp));

@@ -1021,7 +1023,8 @@ static int fill_in_write_vector(struct kvec *vec, struct nfsd4_write *write)
fput(filp);

write->wr_bytes_written = cnt;
-
+ trace_nfsd_write_done(rqstp, &cstate->current_fh,
+ write->wr_offset, cnt);
return status;
}

diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
index 30e0485..653e9ee 100644
--- a/fs/nfsd/trace.h
+++ b/fs/nfsd/trace.h
@@ -51,6 +51,39 @@
DEFINE_NFSD_IO_EVENT(write_io_done);
DEFINE_NFSD_IO_EVENT(write_done);

+DECLARE_EVENT_CLASS(nfsd_err_class,
+ TP_PROTO(struct svc_rqst *rqstp,
+ struct svc_fh *fhp,
+ loff_t offset,
+ int status),
+ TP_ARGS(rqstp, fhp, offset, status),
+ TP_STRUCT__entry(
+ __field(u32, xid)
+ __field(u32, fh_hash)
+ __field(loff_t, offset)
+ __field(int, status)
+ ),
+ TP_fast_assign(
+ __entry->xid = be32_to_cpu(rqstp->rq_xid);
+ __entry->fh_hash = knfsd_fh_hash(&fhp->fh_handle);
+ __entry->offset = offset;
+ __entry->status = status;
+ ),
+ TP_printk("xid=0x%08x fh_hash=0x%08x offset=%lld status=%d",
+ __entry->xid, __entry->fh_hash,
+ __entry->offset, __entry->status)
+)
+
+#define DEFINE_NFSD_ERR_EVENT(name) \
+DEFINE_EVENT(nfsd_err_class, nfsd_##name, \
+ TP_PROTO(struct svc_rqst *rqstp, \
+ struct svc_fh *fhp, \
+ loff_t offset, \
+ int len), \
+ TP_ARGS(rqstp, fhp, offset, len))
+
+DEFINE_NFSD_ERR_EVENT(write_err);
+
#include "state.h"

DECLARE_EVENT_CLASS(nfsd_stateid_class,
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 8570319..ee59a0b 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -965,13 +965,15 @@ static int wait_for_concurrent_writes(struct file *file)
{
struct svc_export *exp;
struct iov_iter iter;
- __be32 err = 0;
+ __be32 nfserr;
int host_err;
int use_wgather;
loff_t pos = offset;
unsigned int pflags = current->flags;
rwf_t flags = 0;

+ trace_nfsd_write_opened(rqstp, fhp, offset, *cnt);
+
if (test_bit(RQ_LOCAL, &rqstp->rq_flags))
/*
* We want less throttling in balance_dirty_pages()
@@ -994,22 +996,23 @@ static int wait_for_concurrent_writes(struct file *file)
host_err = vfs_iter_write(file, &iter, &pos, flags);
if (host_err < 0)
goto out_nfserr;
- *cnt = host_err;
- nfsdstats.io_write += host_err;
+ nfsdstats.io_write += *cnt;
fsnotify_modify(file);

if (stable && use_wgather)
host_err = wait_for_concurrent_writes(file);

out_nfserr:
- dprintk("nfsd: write complete host_err=%d\n", host_err);
- if (host_err >= 0)
- err = 0;
- else
- err = nfserrno(host_err);
+ if (host_err >= 0) {
+ trace_nfsd_write_io_done(rqstp, fhp, offset, *cnt);
+ nfserr = nfs_ok;
+ } else {
+ trace_nfsd_write_err(rqstp, fhp, offset, host_err);
+ nfserr = nfserrno(host_err);
+ }
if (test_bit(RQ_LOCAL, &rqstp->rq_flags))
current_restore_flags(pflags, PF_LESS_THROTTLE);
- return err;
+ return nfserr;
}

/*
@@ -1067,9 +1070,7 @@ __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
if (err)
goto out;

- trace_nfsd_write_opened(rqstp, fhp, offset, *cnt);
err = nfsd_vfs_write(rqstp, fhp, file, offset, vec, vlen, cnt, stable);
- trace_nfsd_write_io_done(rqstp, fhp, offset, *cnt);
fput(file);
out:
trace_nfsd_write_done(rqstp, fhp, offset, *cnt);


2018-03-13 15:44:39

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH 11/15] nfsd: Record request byte count, not count of vectors

Byte count is more helpful to know than vector count.

Signed-off-by: Chuck Lever <[email protected]>
---
fs/nfsd/trace.h | 15 +++++----------
fs/nfsd/vfs.c | 16 ++++++++--------
2 files changed, 13 insertions(+), 18 deletions(-)

diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
index 3aa3f3b..9b79506 100644
--- a/fs/nfsd/trace.h
+++ b/fs/nfsd/trace.h
@@ -15,26 +15,21 @@
TP_PROTO(struct svc_rqst *rqstp,
struct svc_fh *fhp,
loff_t offset,
- int len),
+ unsigned long len),
TP_ARGS(rqstp, fhp, offset, len),
TP_STRUCT__entry(
__field(u32, xid)
__field(u32, fh_hash)
__field(loff_t, offset)
- __field(int, len)
+ __field(unsigned long, len)
),
TP_fast_assign(
__entry->xid = be32_to_cpu(rqstp->rq_xid);
- do {
- struct knfsd_fh fh;
-
- fh_copy_shallow(&fh, &fhp->fh_handle);
- __entry->fh_hash = knfsd_fh_hash(&fh);
- } while (0);
+ __entry->fh_hash = knfsd_fh_hash(&fhp->fh_handle);
__entry->offset = offset;
__entry->len = len;
),
- TP_printk("xid=0x%08x fh_hash=0x%08x offset=%lld len=%d",
+ TP_printk("xid=0x%08x fh_hash=0x%08x offset=%lld len=%lu",
__entry->xid, __entry->fh_hash,
__entry->offset, __entry->len)
)
@@ -44,7 +39,7 @@
TP_PROTO(struct svc_rqst *rqstp, \
struct svc_fh *fhp, \
loff_t offset, \
- int len), \
+ unsigned long len), \
TP_ARGS(rqstp, fhp, offset, len))

DEFINE_NFSD_IO_EVENT(read_start);
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index a3c9bfa..e2eab4c 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1024,27 +1024,27 @@ __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
struct raparms *ra;
__be32 err;

- trace_read_start(rqstp, fhp, offset, vlen);
+ trace_read_start(rqstp, fhp, offset, *count);
err = nfsd_open(rqstp, fhp, S_IFREG, NFSD_MAY_READ, &file);
if (err)
return err;

ra = nfsd_init_raparms(file);

- trace_read_opened(rqstp, fhp, offset, vlen);
+ trace_read_opened(rqstp, fhp, offset, *count);

if (file->f_op->splice_read && test_bit(RQ_SPLICE_OK, &rqstp->rq_flags))
err = nfsd_splice_read(rqstp, file, offset, count);
else
err = nfsd_readv(file, offset, vec, vlen, count);

- trace_read_io_done(rqstp, fhp, offset, vlen);
+ trace_read_io_done(rqstp, fhp, offset, *count);

if (ra)
nfsd_put_raparams(file, ra);
fput(file);

- trace_read_done(rqstp, fhp, offset, vlen);
+ trace_read_done(rqstp, fhp, offset, *count);

return err;
}
@@ -1061,18 +1061,18 @@ __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
struct file *file = NULL;
__be32 err = 0;

- trace_write_start(rqstp, fhp, offset, vlen);
+ trace_write_start(rqstp, fhp, offset, *cnt);

err = nfsd_open(rqstp, fhp, S_IFREG, NFSD_MAY_WRITE, &file);
if (err)
goto out;

- trace_write_opened(rqstp, fhp, offset, vlen);
+ trace_write_opened(rqstp, fhp, offset, *cnt);
err = nfsd_vfs_write(rqstp, fhp, file, offset, vec, vlen, cnt, stable);
- trace_write_io_done(rqstp, fhp, offset, vlen);
+ trace_write_io_done(rqstp, fhp, offset, *cnt);
fput(file);
out:
- trace_write_done(rqstp, fhp, offset, vlen);
+ trace_write_done(rqstp, fhp, offset, *cnt);
return err;
}



2018-03-13 15:44:29

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH 09/15] sunrpc: Report per-RPC execution stats

Introduce a mechanism to report the server-side execution latency of
each RPC. The goal is to enable user space to filter the trace
record for latency outliers, build histograms, etc.

Signed-off-by: Chuck Lever <[email protected]>
---
include/linux/sunrpc/svc.h | 1 +
include/linux/sunrpc/svc_rdma.h | 1 +
include/trace/events/sunrpc.h | 20 ++++++++++++++++++++
net/sunrpc/svc_xprt.c | 2 +-
net/sunrpc/svcsock.c | 2 ++
net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 1 +
net/sunrpc/xprtrdma/svc_rdma_transport.c | 1 +
7 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 786ae22..1939709 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -245,6 +245,7 @@ struct svc_rqst {
void * rq_xprt_ctxt; /* transport specific context ptr */
struct svc_deferred_req*rq_deferred; /* deferred request we are replaying */

+ ktime_t rq_rtime; /* receive time */
size_t rq_xprt_hlen; /* xprt header len */
struct xdr_buf rq_arg;
struct xdr_buf rq_res;
diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index 4b731b0..8045981 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -79,6 +79,7 @@ struct svc_rdma_op_ctxt {
struct list_head list;
struct xdr_buf arg;
struct ib_cqe cqe;
+ ktime_t rtime;
u32 byte_len;
struct svcxprt_rdma *xprt;
enum dma_data_direction direction;
diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
index 5a8157c..50de512 100644
--- a/include/trace/events/sunrpc.h
+++ b/include/trace/events/sunrpc.h
@@ -705,6 +705,26 @@
__entry->len, show_svc_xprt_flags(__entry->flags))
);

+TRACE_EVENT(svc_stats_latency,
+ TP_PROTO(const struct svc_rqst *rqst, ktime_t end),
+
+ TP_ARGS(rqst, end),
+
+ TP_STRUCT__entry(
+ __field(u32, xid)
+ __field(unsigned long, execute)
+ __string(addr, rqst->rq_xprt->xpt_remotebuf)
+ ),
+
+ TP_fast_assign(
+ __entry->xid = be32_to_cpu(rqst->rq_xid);
+ __entry->execute = ktime_to_us(ktime_sub(end, rqst->rq_rtime));
+ __assign_str(addr, rqst->rq_xprt->xpt_remotebuf);
+ ),
+
+ TP_printk("addr=%s xid=0x%08x execute=%lu",
+ __get_str(addr), __entry->xid, __entry->execute)
+);

DECLARE_EVENT_CLASS(svc_deferred_event,
TP_PROTO(struct svc_deferred_req *dr),
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index f745754..2556e6e 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -782,7 +782,6 @@ static int svc_handle_xprt(struct svc_rqst *rqstp, struct svc_xprt *xprt)
len = svc_deferred_recv(rqstp);
else
len = xprt->xpt_ops->xpo_recvfrom(rqstp);
- dprintk("svc: got len=%d\n", len);
rqstp->rq_reserved = serv->sv_max_mesg;
atomic_add(rqstp->rq_reserved, &xprt->xpt_reserved);
}
@@ -893,6 +892,7 @@ int svc_send(struct svc_rqst *rqstp)
len = -ENOTCONN;
else
len = xprt->xpt_ops->xpo_sendto(rqstp);
+ trace_svc_stats_latency(rqstp, ktime_get_real());
mutex_unlock(&xprt->xpt_mutex);
rpc_wake_up(&xprt->xpt_bc_pending);
trace_svc_send(rqstp, len);
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 4ca1d92..f616962 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -574,6 +574,7 @@ static int svc_udp_recvfrom(struct svc_rqst *rqstp)
/* Don't enable netstamp, sunrpc doesn't
need that much accuracy */
}
+ rqstp->rq_rtime = skb->tstamp;
svsk->sk_sk->sk_stamp = skb->tstamp;
set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags); /* there may be more data... */

@@ -1143,6 +1144,7 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
if (len < 0)
goto error;

+ rqstp->rq_rtime = ktime_get_real();
svc_xprt_copy_addrs(rqstp, &svsk->sk_xprt);
if (serv->sv_stats)
serv->sv_stats->nettcpcnt++;
diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index 19e9c6b..446b9d6 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -558,6 +558,7 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp)
goto out_readchunk;

complete:
+ rqstp->rq_rtime = ctxt->rtime;
svc_rdma_put_context(ctxt, 0);
dprintk("svcrdma: recvfrom: xprt=%p, rqstp=%p, rq_arg.len=%u\n",
rdma_xprt, rqstp, rqstp->rq_arg.len);
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index ca211dc..08fa7de 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -316,6 +316,7 @@ static void svc_rdma_wc_receive(struct ib_cq *cq, struct ib_wc *wc)
goto flushed;

/* All wc fields are now known to be valid */
+ ctxt->rtime = ktime_get_real();
ctxt->byte_len = wc->byte_len;
spin_lock(&xprt->sc_rq_dto_lock);
list_add_tail(&ctxt->list, &xprt->sc_rq_dto_q);


2018-03-13 15:45:01

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH 15/15] nfsd: Trace NFSv4 COMPOUND execution

This helps record the identity and timing of the ops in each NFSv4
COMPOUND, replacing dprintk calls that did much the same thing.

Signed-off-by: Chuck Lever <[email protected]>
---
fs/nfsd/nfs4proc.c | 9 +++------
fs/nfsd/trace.h | 39 +++++++++++++++++++++++++++++++++++++++
2 files changed, 42 insertions(+), 6 deletions(-)

diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
index 39016b6..0df37e0 100644
--- a/fs/nfsd/nfs4proc.c
+++ b/fs/nfsd/nfs4proc.c
@@ -1722,12 +1722,10 @@ static void svcxdr_init_encode(struct svc_rqst *rqstp,
goto encode_op;
}

+ trace_nfsd_compound(rqstp, args->opcnt);
while (!status && resp->opcnt < args->opcnt) {
op = &args->ops[resp->opcnt++];

- dprintk("nfsv4 compound op #%d/%d: %d (%s)\n",
- resp->opcnt, args->opcnt, op->opnum,
- nfsd4_op_name(op->opnum));
/*
* The XDR decode routines may have pre-set op->status;
* for example, if there is a miscellaneous XDR error
@@ -1801,9 +1799,8 @@ static void svcxdr_init_encode(struct svc_rqst *rqstp,
status = op->status;
}

- dprintk("nfsv4 compound op %p opcnt %d #%d: %d: status %d\n",
- args->ops, args->opcnt, resp->opcnt, op->opnum,
- be32_to_cpu(status));
+ trace_nfsd_compound_status(args->opcnt, resp->opcnt, status,
+ nfsd4_op_name(op->opnum));

nfsd4_cstate_clear_replay(cstate);
nfsd4_increment_op_stats(op->opnum);
diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
index a8bbd9d..80933e4 100644
--- a/fs/nfsd/trace.h
+++ b/fs/nfsd/trace.h
@@ -11,6 +11,45 @@
#include <linux/tracepoint.h>
#include "nfsfh.h"

+TRACE_EVENT(nfsd_compound,
+ TP_PROTO(const struct svc_rqst *rqst,
+ u32 args_opcnt),
+ TP_ARGS(rqst, args_opcnt),
+ TP_STRUCT__entry(
+ __field(u32, xid)
+ __field(u32, args_opcnt)
+ ),
+ TP_fast_assign(
+ __entry->xid = be32_to_cpu(rqst->rq_xid);
+ __entry->args_opcnt = args_opcnt;
+ ),
+ TP_printk("xid=0x%08x opcnt=%u",
+ __entry->xid, __entry->args_opcnt)
+)
+
+TRACE_EVENT(nfsd_compound_status,
+ TP_PROTO(u32 args_opcnt,
+ u32 resp_opcnt,
+ __be32 status,
+ const char *name),
+ TP_ARGS(args_opcnt, resp_opcnt, status, name),
+ TP_STRUCT__entry(
+ __field(u32, args_opcnt)
+ __field(u32, resp_opcnt)
+ __field(int, status)
+ __string(name, name)
+ ),
+ TP_fast_assign(
+ __entry->args_opcnt = args_opcnt;
+ __entry->resp_opcnt = resp_opcnt;
+ __entry->status = be32_to_cpu(status);
+ __assign_str(name, name);
+ ),
+ TP_printk("op=%u/%u %s status=%d",
+ __entry->resp_opcnt, __entry->args_opcnt,
+ __get_str(name), __entry->status)
+)
+
DECLARE_EVENT_CLASS(nfsd_io_class,
TP_PROTO(struct svc_rqst *rqstp,
struct svc_fh *fhp,


2018-03-13 15:44:55

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH 14/15] nfsd: Add I/O trace points in the NFSv4 read proc

NFSv4 read compound processing invokes nfsd_splice_read and
nfs_readv directly, so the trace points currently in nfsd_read are
not effective for NFSv4 reads.

Move and copy the trace points so that NFSv4 reads are captured.
Also, we want to record any local I/O error that occurs, and
the total count of bytes that were actually moved. And, also
whether splice or vectored read was used.

The svc_fh is not passed to the read helpers, so some code
duplication is necessary.

Signed-off-by: Chuck Lever <[email protected]>
---
fs/nfsd/nfs4proc.c | 5 +++++
fs/nfsd/nfs4xdr.c | 55 +++++++++++++++++++++++++++++++++++++---------------
fs/nfsd/trace.h | 4 +++-
fs/nfsd/vfs.c | 54 +++++++++++++++++++++++----------------------------
fs/nfsd/vfs.h | 8 ++++----
5 files changed, 75 insertions(+), 51 deletions(-)

diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
index b93673e..39016b6 100644
--- a/fs/nfsd/nfs4proc.c
+++ b/fs/nfsd/nfs4proc.c
@@ -751,6 +751,9 @@ static __be32 nfsd4_do_lookupp(struct svc_rqst *rqstp, struct svc_fh *fh)
if (read->rd_offset >= OFFSET_MAX)
return nfserr_inval;

+ trace_nfsd_read_start(rqstp, &cstate->current_fh,
+ read->rd_offset, read->rd_length);
+
/*
* If we do a zero copy read, then a client will see read data
* that reflects the state of the file *after* performing the
@@ -783,6 +786,8 @@ static __be32 nfsd4_do_lookupp(struct svc_rqst *rqstp, struct svc_fh *fh)
{
if (u->read.rd_filp)
fput(u->read.rd_filp);
+ trace_nfsd_read_done(u->read.rd_rqstp, u->read.rd_fhp,
+ u->read.rd_offset, u->read.rd_length);
}

static __be32
diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index e502fd1..f75cc7d 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -37,6 +37,7 @@
#include <linux/file.h>
#include <linux/slab.h>
#include <linux/namei.h>
+#include <linux/fsnotify.h>
#include <linux/statfs.h>
#include <linux/utsname.h>
#include <linux/pagemap.h>
@@ -50,6 +51,7 @@
#include "cache.h"
#include "netns.h"
#include "pnfs.h"
+#include "trace.h"

#ifdef CONFIG_NFSD_V4_SECURITY_LABEL
#include <linux/security.h>
@@ -3416,28 +3418,27 @@ static __be32 nfsd4_encode_splice_read(
{
struct xdr_stream *xdr = &resp->xdr;
struct xdr_buf *buf = xdr->buf;
+ int host_err;
u32 eof;
long len;
int space_left;
- __be32 nfserr;
__be32 *p = xdr->p - 2;

/* Make sure there will be room for padding if needed */
if (xdr->end - xdr->p < 1)
return nfserr_resource;

+ trace_nfsd_read_splice(resp->rqstp, read->rd_fhp,
+ read->rd_offset, maxcount);
len = maxcount;
- nfserr = nfsd_splice_read(read->rd_rqstp, file,
+ host_err = nfsd_splice_read(read->rd_rqstp, file,
read->rd_offset, &maxcount);
- if (nfserr) {
- /*
- * nfsd_splice_actor may have already messed with the
- * page length; reset it so as not to confuse
- * xdr_truncate_encode:
- */
- buf->page_len = 0;
- return nfserr;
- }
+ if (host_err < 0)
+ goto err;
+ trace_nfsd_read_io_done(read->rd_rqstp, read->rd_fhp,
+ read->rd_offset, maxcount);
+ nfsdstats.io_read += maxcount;
+ fsnotify_access(file);

eof = nfsd_eof_on_read(len, maxcount, read->rd_offset,
d_inode(read->rd_fhp->fh_dentry)->i_size);
@@ -3470,6 +3471,17 @@ static __be32 nfsd4_encode_splice_read(
xdr->end = (__be32 *)((void *)xdr->end + space_left);

return 0;
+
+err:
+ /*
+ * nfsd_splice_actor may have already messed with the
+ * page length; reset it so as not to confuse
+ * xdr_truncate_encode:
+ */
+ buf->page_len = 0;
+ trace_nfsd_read_err(read->rd_rqstp, read->rd_fhp,
+ read->rd_offset, host_err);
+ return nfserrno(host_err);
}

static __be32 nfsd4_encode_readv(struct nfsd4_compoundres *resp,
@@ -3477,12 +3489,12 @@ static __be32 nfsd4_encode_readv(struct nfsd4_compoundres *resp,
struct file *file, unsigned long maxcount)
{
struct xdr_stream *xdr = &resp->xdr;
+ int host_err;
u32 eof;
int v;
int starting_len = xdr->buf->len - 8;
long len;
int thislen;
- __be32 nfserr;
__be32 tmp;
__be32 *p;
u32 zzz = 0;
@@ -3510,11 +3522,18 @@ static __be32 nfsd4_encode_readv(struct nfsd4_compoundres *resp,
}
read->rd_vlen = v;

+ trace_nfsd_read_vector(resp->rqstp, read->rd_fhp,
+ read->rd_offset, maxcount);
len = maxcount;
- nfserr = nfsd_readv(file, read->rd_offset, resp->rqstp->rq_vec,
- read->rd_vlen, &maxcount);
- if (nfserr)
- return nfserr;
+ host_err = nfsd_readv(file, read->rd_offset, resp->rqstp->rq_vec,
+ read->rd_vlen, &maxcount);
+ if (host_err < 0)
+ goto err;
+ trace_nfsd_read_io_done(resp->rqstp, read->rd_fhp,
+ read->rd_offset, maxcount);
+ nfsdstats.io_read += maxcount;
+ fsnotify_access(file);
+
xdr_truncate_encode(xdr, starting_len + 8 + ((maxcount+3)&~3));

eof = nfsd_eof_on_read(len, maxcount, read->rd_offset,
@@ -3530,6 +3549,10 @@ static __be32 nfsd4_encode_readv(struct nfsd4_compoundres *resp,
&zzz, pad);
return 0;

+err:
+ trace_nfsd_read_err(resp->rqstp, read->rd_fhp,
+ read->rd_offset, host_err);
+ return nfserrno(host_err);
}

static __be32
diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
index 653e9ee..a8bbd9d 100644
--- a/fs/nfsd/trace.h
+++ b/fs/nfsd/trace.h
@@ -43,7 +43,8 @@
TP_ARGS(rqstp, fhp, offset, len))

DEFINE_NFSD_IO_EVENT(read_start);
-DEFINE_NFSD_IO_EVENT(read_opened);
+DEFINE_NFSD_IO_EVENT(read_splice);
+DEFINE_NFSD_IO_EVENT(read_vector);
DEFINE_NFSD_IO_EVENT(read_io_done);
DEFINE_NFSD_IO_EVENT(read_done);
DEFINE_NFSD_IO_EVENT(write_start);
@@ -82,6 +83,7 @@
int len), \
TP_ARGS(rqstp, fhp, offset, len))

+DEFINE_NFSD_ERR_EVENT(read_err);
DEFINE_NFSD_ERR_EVENT(write_err);

#include "state.h"
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index ee59a0b..0fa54b5 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -881,19 +881,7 @@ static int nfsd_direct_splice_actor(struct pipe_inode_info *pipe,
return __splice_from_pipe(pipe, sd, nfsd_splice_actor);
}

-static __be32
-nfsd_finish_read(struct file *file, unsigned long *count, int host_err)
-{
- if (host_err >= 0) {
- nfsdstats.io_read += host_err;
- *count = host_err;
- fsnotify_access(file);
- return 0;
- } else
- return nfserrno(host_err);
-}
-
-__be32 nfsd_splice_read(struct svc_rqst *rqstp,
+int nfsd_splice_read(struct svc_rqst *rqstp,
struct file *file, loff_t offset, unsigned long *count)
{
struct splice_desc sd = {
@@ -902,23 +890,18 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp,
.pos = offset,
.u.data = rqstp,
};
- int host_err;

rqstp->rq_next_page = rqstp->rq_respages + 1;
- host_err = splice_direct_to_actor(file, &sd, nfsd_direct_splice_actor);
- return nfsd_finish_read(file, count, host_err);
+ return splice_direct_to_actor(file, &sd, nfsd_direct_splice_actor);
}

-__be32 nfsd_readv(struct file *file, loff_t offset, struct kvec *vec, int vlen,
- unsigned long *count)
+int nfsd_readv(struct file *file, loff_t offset, struct kvec *vec, int vlen,
+ unsigned long *count)
{
struct iov_iter iter;
- int host_err;

iov_iter_kvec(&iter, READ | ITER_KVEC, vec, vlen, *count);
- host_err = vfs_iter_read(file, &iter, &offset, 0);
-
- return nfsd_finish_read(file, count, host_err);
+ return vfs_iter_read(file, &iter, &offset, 0);
}

/*
@@ -1025,6 +1008,7 @@ __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
{
struct file *file;
struct raparms *ra;
+ int host_err;
__be32 err;

trace_nfsd_read_start(rqstp, fhp, offset, *count);
@@ -1034,14 +1018,24 @@ __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,

ra = nfsd_init_raparms(file);

- trace_nfsd_read_opened(rqstp, fhp, offset, *count);
-
- if (file->f_op->splice_read && test_bit(RQ_SPLICE_OK, &rqstp->rq_flags))
- err = nfsd_splice_read(rqstp, file, offset, count);
- else
- err = nfsd_readv(file, offset, vec, vlen, count);
-
- trace_nfsd_read_io_done(rqstp, fhp, offset, *count);
+ if (file->f_op->splice_read &&
+ test_bit(RQ_SPLICE_OK, &rqstp->rq_flags)) {
+ trace_nfsd_read_splice(rqstp, fhp, offset, *count);
+ host_err = nfsd_splice_read(rqstp, file, offset, count);
+ } else {
+ trace_nfsd_read_vector(rqstp, fhp, offset, *count);
+ host_err = nfsd_readv(file, offset, vec, vlen, count);
+ }
+ if (host_err >= 0) {
+ trace_nfsd_read_io_done(rqstp, fhp, offset, host_err);
+ nfsdstats.io_read += host_err;
+ *count = host_err;
+ fsnotify_access(file);
+ err = nfs_ok;
+ } else {
+ trace_nfsd_read_err(rqstp, fhp, offset, host_err);
+ err = nfserrno(host_err);
+ }

if (ra)
nfsd_put_raparams(file, ra);
diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
index be6d8e0..d9131c3 100644
--- a/fs/nfsd/vfs.h
+++ b/fs/nfsd/vfs.h
@@ -78,10 +78,10 @@ __be32 nfsd_commit(struct svc_rqst *, struct svc_fh *,
__be32 nfsd_open(struct svc_rqst *, struct svc_fh *, umode_t,
int, struct file **);
struct raparms;
-__be32 nfsd_splice_read(struct svc_rqst *,
- struct file *, loff_t, unsigned long *);
-__be32 nfsd_readv(struct file *, loff_t, struct kvec *, int,
- unsigned long *);
+int nfsd_splice_read(struct svc_rqst *rqstp, struct file *file,
+ loff_t offset, unsigned long *count);
+int nfsd_readv(struct file *file, loff_t offset, struct kvec *vec,
+ int vlen, unsigned long *count);
__be32 nfsd_read(struct svc_rqst *, struct svc_fh *,
loff_t, struct kvec *, int, unsigned long *);
__be32 nfsd_write(struct svc_rqst *, struct svc_fh *, loff_t,


2018-03-13 15:44:45

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH 12/15] nfsd: Add "nfsd_" to trace point names

Follow naming convention used in client and in sunrpc layers.

Signed-off-by: Chuck Lever <[email protected]>
---
fs/nfsd/nfs4layouts.c | 16 ++++++++--------
fs/nfsd/nfs4proc.c | 4 ++--
fs/nfsd/trace.h | 4 ++--
fs/nfsd/vfs.c | 16 ++++++++--------
4 files changed, 20 insertions(+), 20 deletions(-)

diff --git a/fs/nfsd/nfs4layouts.c b/fs/nfsd/nfs4layouts.c
index 7d88836..228faf0 100644
--- a/fs/nfsd/nfs4layouts.c
+++ b/fs/nfsd/nfs4layouts.c
@@ -165,7 +165,7 @@ void nfsd4_setup_layout_type(struct svc_export *exp)
struct nfs4_client *clp = ls->ls_stid.sc_client;
struct nfs4_file *fp = ls->ls_stid.sc_file;

- trace_layoutstate_free(&ls->ls_stid.sc_stateid);
+ trace_nfsd_layoutstate_free(&ls->ls_stid.sc_stateid);

spin_lock(&clp->cl_lock);
list_del_init(&ls->ls_perclnt);
@@ -264,7 +264,7 @@ void nfsd4_setup_layout_type(struct svc_export *exp)
list_add(&ls->ls_perfile, &fp->fi_lo_states);
spin_unlock(&fp->fi_lock);

- trace_layoutstate_alloc(&ls->ls_stid.sc_stateid);
+ trace_nfsd_layoutstate_alloc(&ls->ls_stid.sc_stateid);
return ls;
}

@@ -334,7 +334,7 @@ void nfsd4_setup_layout_type(struct svc_export *exp)
if (list_empty(&ls->ls_layouts))
goto out_unlock;

- trace_layout_recall(&ls->ls_stid.sc_stateid);
+ trace_nfsd_layout_recall(&ls->ls_stid.sc_stateid);

refcount_inc(&ls->ls_stid.sc_count);
nfsd4_run_cb(&ls->ls_recall);
@@ -507,7 +507,7 @@ void nfsd4_setup_layout_type(struct svc_export *exp)
false, lrp->lr_layout_type,
&ls);
if (nfserr) {
- trace_layout_return_lookup_fail(&lrp->lr_sid);
+ trace_nfsd_layout_return_lookup_fail(&lrp->lr_sid);
return nfserr;
}

@@ -523,7 +523,7 @@ void nfsd4_setup_layout_type(struct svc_export *exp)
nfs4_inc_and_copy_stateid(&lrp->lr_sid, &ls->ls_stid);
lrp->lrs_present = 1;
} else {
- trace_layoutstate_unhash(&ls->ls_stid.sc_stateid);
+ trace_nfsd_layoutstate_unhash(&ls->ls_stid.sc_stateid);
nfs4_unhash_stid(&ls->ls_stid);
lrp->lrs_present = 0;
}
@@ -694,7 +694,7 @@ void nfsd4_setup_layout_type(struct svc_export *exp)
/*
* Unknown error or non-responding client, we'll need to fence.
*/
- trace_layout_recall_fail(&ls->ls_stid.sc_stateid);
+ trace_nfsd_layout_recall_fail(&ls->ls_stid.sc_stateid);

ops = nfsd4_layout_ops[ls->ls_layout_type];
if (ops->fence_client)
@@ -703,7 +703,7 @@ void nfsd4_setup_layout_type(struct svc_export *exp)
nfsd4_cb_layout_fail(ls);
return -1;
case -NFS4ERR_NOMATCHING_LAYOUT:
- trace_layout_recall_done(&ls->ls_stid.sc_stateid);
+ trace_nfsd_layout_recall_done(&ls->ls_stid.sc_stateid);
task->tk_status = 0;
return 1;
}
@@ -716,7 +716,7 @@ void nfsd4_setup_layout_type(struct svc_export *exp)
container_of(cb, struct nfs4_layout_stateid, ls_recall);
LIST_HEAD(reaplist);

- trace_layout_recall_release(&ls->ls_stid.sc_stateid);
+ trace_nfsd_layout_recall_release(&ls->ls_stid.sc_stateid);

nfsd4_return_all_layouts(ls, &reaplist);
nfsd4_free_layouts(&reaplist);
diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
index a0bed2b..10c7620 100644
--- a/fs/nfsd/nfs4proc.c
+++ b/fs/nfsd/nfs4proc.c
@@ -1412,7 +1412,7 @@ static int fill_in_write_vector(struct kvec *vec, struct nfsd4_write *write)
nfserr = nfsd4_preprocess_layout_stateid(rqstp, cstate, &lgp->lg_sid,
true, lgp->lg_layout_type, &ls);
if (nfserr) {
- trace_layout_get_lookup_fail(&lgp->lg_sid);
+ trace_nfsd_layout_get_lookup_fail(&lgp->lg_sid);
goto out;
}

@@ -1481,7 +1481,7 @@ static int fill_in_write_vector(struct kvec *vec, struct nfsd4_write *write)
false, lcp->lc_layout_type,
&ls);
if (nfserr) {
- trace_layout_commit_lookup_fail(&lcp->lc_sid);
+ trace_nfsd_layout_commit_lookup_fail(&lcp->lc_sid);
/* fixup error code as per RFC5661 */
if (nfserr == nfserr_bad_stateid)
nfserr = nfserr_badlayout;
diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
index 9b79506..30e0485 100644
--- a/fs/nfsd/trace.h
+++ b/fs/nfsd/trace.h
@@ -35,7 +35,7 @@
)

#define DEFINE_NFSD_IO_EVENT(name) \
-DEFINE_EVENT(nfsd_io_class, name, \
+DEFINE_EVENT(nfsd_io_class, nfsd_##name, \
TP_PROTO(struct svc_rqst *rqstp, \
struct svc_fh *fhp, \
loff_t offset, \
@@ -76,7 +76,7 @@
)

#define DEFINE_STATEID_EVENT(name) \
-DEFINE_EVENT(nfsd_stateid_class, name, \
+DEFINE_EVENT(nfsd_stateid_class, nfsd_##name, \
TP_PROTO(stateid_t *stp), \
TP_ARGS(stp))
DEFINE_STATEID_EVENT(layoutstate_alloc);
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index e2eab4c..8570319 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1024,27 +1024,27 @@ __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
struct raparms *ra;
__be32 err;

- trace_read_start(rqstp, fhp, offset, *count);
+ trace_nfsd_read_start(rqstp, fhp, offset, *count);
err = nfsd_open(rqstp, fhp, S_IFREG, NFSD_MAY_READ, &file);
if (err)
return err;

ra = nfsd_init_raparms(file);

- trace_read_opened(rqstp, fhp, offset, *count);
+ trace_nfsd_read_opened(rqstp, fhp, offset, *count);

if (file->f_op->splice_read && test_bit(RQ_SPLICE_OK, &rqstp->rq_flags))
err = nfsd_splice_read(rqstp, file, offset, count);
else
err = nfsd_readv(file, offset, vec, vlen, count);

- trace_read_io_done(rqstp, fhp, offset, *count);
+ trace_nfsd_read_io_done(rqstp, fhp, offset, *count);

if (ra)
nfsd_put_raparams(file, ra);
fput(file);

- trace_read_done(rqstp, fhp, offset, *count);
+ trace_nfsd_read_done(rqstp, fhp, offset, *count);

return err;
}
@@ -1061,18 +1061,18 @@ __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
struct file *file = NULL;
__be32 err = 0;

- trace_write_start(rqstp, fhp, offset, *cnt);
+ trace_nfsd_write_start(rqstp, fhp, offset, *cnt);

err = nfsd_open(rqstp, fhp, S_IFREG, NFSD_MAY_WRITE, &file);
if (err)
goto out;

- trace_write_opened(rqstp, fhp, offset, *cnt);
+ trace_nfsd_write_opened(rqstp, fhp, offset, *cnt);
err = nfsd_vfs_write(rqstp, fhp, file, offset, vec, vlen, cnt, stable);
- trace_write_io_done(rqstp, fhp, offset, *cnt);
+ trace_nfsd_write_io_done(rqstp, fhp, offset, *cnt);
fput(file);
out:
- trace_write_done(rqstp, fhp, offset, *cnt);
+ trace_nfsd_write_done(rqstp, fhp, offset, *cnt);
return err;
}



2018-03-14 20:35:07

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 07/15] sunrpc: Save remote presentation address in svc_xprt for trace events

On Tue, Mar 13, 2018 at 11:44:16AM -0400, Chuck Lever wrote:
> TP_printk defines a format string that is passed to user space for
> converting raw trace event records to something human-readable.
>
> My user space's printf (Oracle Linux 7), however, does not have a
> %pI format specifier. The result is that what is supposed to be an
> IP address in the output of "trace-cmd report" is just a string that
> says the field couldn't be displayed.
>
> To fix this, adopt the same approach as the client: maintain a pre-
> formated presentation address for occasions when %pI is not
> available.
>
> The location of the trace_svc_send trace point is adjusted so that
> rqst->rq_xprt is not NULL when the trace event is recorded.

So presumably the problem could also be fixed in userspace; why is it
better to do it here?

--b.

>
> Signed-off-by: Chuck Lever <[email protected]>
> ---
> include/linux/sunrpc/svc_xprt.h | 4 +
> include/trace/events/sunrpc.h | 89 +++++++++++-------------------
> net/sunrpc/svc_xprt.c | 3 +
> net/sunrpc/svcsock.c | 1
> net/sunrpc/xprtrdma/svc_rdma_transport.c | 4 +
> 5 files changed, 43 insertions(+), 58 deletions(-)
>
> diff --git a/include/linux/sunrpc/svc_xprt.h b/include/linux/sunrpc/svc_xprt.h
> index 19475ac..c3d7206 100644
> --- a/include/linux/sunrpc/svc_xprt.h
> +++ b/include/linux/sunrpc/svc_xprt.h
> @@ -83,6 +83,7 @@ struct svc_xprt {
> size_t xpt_locallen; /* length of address */
> struct sockaddr_storage xpt_remote; /* remote peer's address */
> size_t xpt_remotelen; /* length of address */
> + char xpt_remotebuf[INET6_ADDRSTRLEN + 10];
> struct rpc_wait_queue xpt_bc_pending; /* backchannel wait queue */
> struct list_head xpt_users; /* callbacks on free */
>
> @@ -152,7 +153,10 @@ static inline void svc_xprt_set_remote(struct svc_xprt *xprt,
> {
> memcpy(&xprt->xpt_remote, sa, salen);
> xprt->xpt_remotelen = salen;
> + snprintf(xprt->xpt_remotebuf, sizeof(xprt->xpt_remotebuf) - 1,
> + "%pISpc", sa);
> }
> +
> static inline unsigned short svc_addr_port(const struct sockaddr *sa)
> {
> const struct sockaddr_in *sin = (const struct sockaddr_in *)sa;
> diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
> index 5849bfb..1ec8c4c 100644
> --- a/include/trace/events/sunrpc.h
> +++ b/include/trace/events/sunrpc.h
> @@ -493,20 +493,18 @@
> __field(u32, xid)
> __field(int, len)
> __field(unsigned long, flags)
> - __dynamic_array(unsigned char, addr, rqst->rq_addrlen)
> + __string(addr, rqst->rq_xprt->xpt_remotebuf)
> ),
>
> TP_fast_assign(
> __entry->xid = be32_to_cpu(rqst->rq_xid);
> __entry->len = len;
> __entry->flags = rqst->rq_flags;
> - memcpy(__get_dynamic_array(addr),
> - &rqst->rq_addr, rqst->rq_addrlen);
> + __assign_str(addr, rqst->rq_xprt->xpt_remotebuf);
> ),
>
> - TP_printk("addr=%pIScp xid=0x%08x len=%d flags=%s",
> - (struct sockaddr *)__get_dynamic_array(addr),
> - __entry->xid, __entry->len,
> + TP_printk("addr=%s xid=0x%08x len=%d flags=%s",
> + __get_str(addr), __entry->xid, __entry->len,
> show_rqstp_flags(__entry->flags))
> );
>
> @@ -519,20 +517,18 @@
> TP_STRUCT__entry(
> __field(u32, xid)
> __field(unsigned long, flags)
> - __dynamic_array(unsigned char, addr, rqst->rq_addrlen)
> + __string(addr, rqst->rq_xprt->xpt_remotebuf)
> ),
>
> TP_fast_assign(
> __entry->xid = be32_to_cpu(rqst->rq_xid);
> __entry->flags = rqst->rq_flags;
> - memcpy(__get_dynamic_array(addr),
> - &rqst->rq_addr, rqst->rq_addrlen);
> + __assign_str(addr, rqst->rq_xprt->xpt_remotebuf);
> ),
>
> - TP_printk("addr=%pIScp rq_xid=0x%08x flags=%s",
> - (struct sockaddr *)__get_dynamic_array(addr),
> - __entry->xid,
> - show_rqstp_flags(__entry->flags))
> + TP_printk("addr=%s xid=0x%08x flags=%s",
> + __get_str(addr), __entry->xid,
> + show_rqstp_flags(__entry->flags))
> );
>
> DEFINE_EVENT(svc_rqst_event, svc_defer,
> @@ -553,21 +549,19 @@
> __field(u32, xid)
> __field(int, status)
> __field(unsigned long, flags)
> - __dynamic_array(unsigned char, addr, rqst->rq_addrlen)
> + __string(addr, rqst->rq_xprt->xpt_remotebuf)
> ),
>
> TP_fast_assign(
> __entry->xid = be32_to_cpu(rqst->rq_xid);
> __entry->status = status;
> __entry->flags = rqst->rq_flags;
> - memcpy(__get_dynamic_array(addr),
> - &rqst->rq_addr, rqst->rq_addrlen);
> + __assign_str(addr, rqst->rq_xprt->xpt_remotebuf);
> ),
>
> - TP_printk("addr=%pIScp rq_xid=0x%08x status=%d flags=%s",
> - (struct sockaddr *)__get_dynamic_array(addr),
> - __entry->xid,
> - __entry->status, show_rqstp_flags(__entry->flags))
> + TP_printk("addr=%s xid=0x%08x status=%d flags=%s",
> + __get_str(addr), __entry->xid,
> + __entry->status, show_rqstp_flags(__entry->flags))
> );
>
> DEFINE_EVENT(svc_rqst_status, svc_process,
> @@ -604,26 +598,19 @@
> __field(struct svc_xprt *, xprt)
> __field(int, pid)
> __field(unsigned long, flags)
> - __dynamic_array(unsigned char, addr, xprt != NULL ?
> - xprt->xpt_remotelen : 0)
> + __string(addr, xprt->xpt_remotebuf)
> ),
>
> TP_fast_assign(
> __entry->xprt = xprt;
> __entry->pid = rqst? rqst->rq_task->pid : 0;
> - if (xprt) {
> - memcpy(__get_dynamic_array(addr),
> - &xprt->xpt_remote,
> - xprt->xpt_remotelen);
> - __entry->flags = xprt->xpt_flags;
> - } else
> - __entry->flags = 0;
> + __entry->flags = xprt->xpt_flags;
> + __assign_str(addr, xprt->xpt_remotebuf);
> ),
>
> - TP_printk("xprt=0x%p addr=%pIScp pid=%d flags=%s", __entry->xprt,
> - __get_dynamic_array_len(addr) != 0 ?
> - (struct sockaddr *)__get_dynamic_array(addr) : NULL,
> - __entry->pid, show_svc_xprt_flags(__entry->flags))
> + TP_printk("xprt=%p addr=%s pid=%d flags=%s",
> + __entry->xprt, __get_str(addr),
> + __entry->pid, show_svc_xprt_flags(__entry->flags))
> );
>
> DECLARE_EVENT_CLASS(svc_xprt_event,
> @@ -634,19 +621,18 @@
> TP_STRUCT__entry(
> __field(struct svc_xprt *, xprt)
> __field(unsigned long, flags)
> - __dynamic_array(unsigned char, addr, xprt->xpt_remotelen)
> + __string(addr, xprt->xpt_remotebuf)
> ),
>
> TP_fast_assign(
> __entry->xprt = xprt;
> __entry->flags = xprt->xpt_flags;
> - memcpy(__get_dynamic_array(addr),
> - &xprt->xpt_remote, xprt->xpt_remotelen);
> + __assign_str(addr, xprt->xpt_remotebuf);
> ),
>
> - TP_printk("xprt=0x%p addr=%pIScp flags=%s", __entry->xprt,
> - (struct sockaddr *)__get_dynamic_array(addr),
> - show_svc_xprt_flags(__entry->flags))
> + TP_printk("xprt=%p addr=%s flags=%s",
> + __entry->xprt, __get_str(addr),
> + show_svc_xprt_flags(__entry->flags))
> );
>
> DEFINE_EVENT(svc_xprt_event, svc_xprt_dequeue,
> @@ -682,25 +668,18 @@
> __field(struct svc_xprt *, xprt)
> __field(int, len)
> __field(unsigned long, flags)
> - __dynamic_array(unsigned char, addr, xprt != NULL ?
> - xprt->xpt_remotelen : 0)
> + __string(addr, xprt->xpt_remotebuf)
> ),
>
> TP_fast_assign(
> __entry->xprt = xprt;
> __entry->len = len;
> - if (xprt) {
> - memcpy(__get_dynamic_array(addr),
> - &xprt->xpt_remote,
> - xprt->xpt_remotelen);
> - __entry->flags = xprt->xpt_flags;
> - } else
> - __entry->flags = 0;
> + __entry->flags = xprt->xpt_flags;
> + __assign_str(addr, xprt->xpt_remotebuf);
> ),
>
> - TP_printk("xprt=0x%p addr=%pIScp len=%d flags=%s", __entry->xprt,
> - __get_dynamic_array_len(addr) != 0 ?
> - (struct sockaddr *)__get_dynamic_array(addr) : NULL,
> + TP_printk("xprt=%p addr=%s len=%d flags=%s",
> + __entry->xprt, __get_str(addr),
> __entry->len, show_svc_xprt_flags(__entry->flags))
> );
>
> @@ -712,18 +691,16 @@
>
> TP_STRUCT__entry(
> __field(u32, xid)
> - __dynamic_array(unsigned char, addr, dr->addrlen)
> + __string(addr, dr->xprt->xpt_remotebuf)
> ),
>
> TP_fast_assign(
> __entry->xid = be32_to_cpu(*(__be32 *)(dr->args +
> (dr->xprt_hlen>>2)));
> - memcpy(__get_dynamic_array(addr), &dr->addr, dr->addrlen);
> + __assign_str(addr, dr->xprt->xpt_remotebuf);
> ),
>
> - TP_printk("addr=%pIScp xid=0x%08x",
> - (struct sockaddr *)__get_dynamic_array(addr),
> - __entry->xid)
> + TP_printk("addr=%s xid=0x%08x", __get_str(addr), __entry->xid)
> );
>
> DEFINE_EVENT(svc_deferred_event, svc_drop_deferred,
> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> index 47384d0..f745754 100644
> --- a/net/sunrpc/svc_xprt.c
> +++ b/net/sunrpc/svc_xprt.c
> @@ -173,6 +173,7 @@ void svc_xprt_init(struct net *net, struct svc_xprt_class *xcl,
> set_bit(XPT_BUSY, &xprt->xpt_flags);
> rpc_init_wait_queue(&xprt->xpt_bc_pending, "xpt_bc_pending");
> xprt->xpt_net = get_net(net);
> + strcpy(xprt->xpt_remotebuf, "uninitialized");
> }
> EXPORT_SYMBOL_GPL(svc_xprt_init);
>
> @@ -894,12 +895,12 @@ int svc_send(struct svc_rqst *rqstp)
> len = xprt->xpt_ops->xpo_sendto(rqstp);
> mutex_unlock(&xprt->xpt_mutex);
> rpc_wake_up(&xprt->xpt_bc_pending);
> + trace_svc_send(rqstp, len);
> svc_xprt_release(rqstp);
>
> if (len == -ECONNREFUSED || len == -ENOTCONN || len == -EAGAIN)
> len = 0;
> out:
> - trace_svc_send(rqstp, len);
> return len;
> }
>
> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> index 9b67035..4ca1d92 100644
> --- a/net/sunrpc/svcsock.c
> +++ b/net/sunrpc/svcsock.c
> @@ -1310,6 +1310,7 @@ static void svc_tcp_init(struct svc_sock *svsk, struct svc_serv *serv)
> set_bit(XPT_CONG_CTRL, &svsk->sk_xprt.xpt_flags);
> if (sk->sk_state == TCP_LISTEN) {
> dprintk("setting up TCP socket for listening\n");
> + strcpy(svsk->sk_xprt.xpt_remotebuf, "listener");
> set_bit(XPT_LISTENER, &svsk->sk_xprt.xpt_flags);
> sk->sk_data_ready = svc_tcp_listen_data_ready;
> set_bit(XPT_CONN, &svsk->sk_xprt.xpt_flags);
> diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
> index b083ac8..ca211dc 100644
> --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
> +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
> @@ -401,8 +401,10 @@ static struct svcxprt_rdma *rdma_create_xprt(struct svc_serv *serv,
> */
> set_bit(XPT_CONG_CTRL, &cma_xprt->sc_xprt.xpt_flags);
>
> - if (listener)
> + if (listener) {
> + strcpy(cma_xprt->sc_xprt.xpt_remotebuf, "listener");
> set_bit(XPT_LISTENER, &cma_xprt->sc_xprt.xpt_flags);
> + }
>
> return cma_xprt;
> }

2018-03-14 20:50:43

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 09/15] sunrpc: Report per-RPC execution stats

On Tue, Mar 13, 2018 at 11:44:27AM -0400, Chuck Lever wrote:
> Introduce a mechanism to report the server-side execution latency of
> each RPC. The goal is to enable user space to filter the trace
> record for latency outliers, build histograms, etc.

Neato.

To be useful I'd think you'd want more information about each RPC.
(E.g. you'd like to know that the latency outliers all did reads). I
guess you could use the address and xid to cross-reference with
information collected from somewhere else?

What's our commitment to backwards compatibility? Trace points seem to
be the wild west compared to the rest of the kernel interface, but if
we want to encourage tooling on top of this then I guess we'll need to
be strict.

Looking at the tcp case, I think it's measuring from the time
tcp_recvfrom receives the last fragment making up an rpc request till
the last sendpage() of the reply returns. Did you consider other spots?
(E.g. why after the send instead of before?)

--b.

>
> Signed-off-by: Chuck Lever <[email protected]>
> ---
> include/linux/sunrpc/svc.h | 1 +
> include/linux/sunrpc/svc_rdma.h | 1 +
> include/trace/events/sunrpc.h | 20 ++++++++++++++++++++
> net/sunrpc/svc_xprt.c | 2 +-
> net/sunrpc/svcsock.c | 2 ++
> net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 1 +
> net/sunrpc/xprtrdma/svc_rdma_transport.c | 1 +
> 7 files changed, 27 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> index 786ae22..1939709 100644
> --- a/include/linux/sunrpc/svc.h
> +++ b/include/linux/sunrpc/svc.h
> @@ -245,6 +245,7 @@ struct svc_rqst {
> void * rq_xprt_ctxt; /* transport specific context ptr */
> struct svc_deferred_req*rq_deferred; /* deferred request we are replaying */
>
> + ktime_t rq_rtime; /* receive time */
> size_t rq_xprt_hlen; /* xprt header len */
> struct xdr_buf rq_arg;
> struct xdr_buf rq_res;
> diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
> index 4b731b0..8045981 100644
> --- a/include/linux/sunrpc/svc_rdma.h
> +++ b/include/linux/sunrpc/svc_rdma.h
> @@ -79,6 +79,7 @@ struct svc_rdma_op_ctxt {
> struct list_head list;
> struct xdr_buf arg;
> struct ib_cqe cqe;
> + ktime_t rtime;
> u32 byte_len;
> struct svcxprt_rdma *xprt;
> enum dma_data_direction direction;
> diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
> index 5a8157c..50de512 100644
> --- a/include/trace/events/sunrpc.h
> +++ b/include/trace/events/sunrpc.h
> @@ -705,6 +705,26 @@
> __entry->len, show_svc_xprt_flags(__entry->flags))
> );
>
> +TRACE_EVENT(svc_stats_latency,
> + TP_PROTO(const struct svc_rqst *rqst, ktime_t end),
> +
> + TP_ARGS(rqst, end),
> +
> + TP_STRUCT__entry(
> + __field(u32, xid)
> + __field(unsigned long, execute)
> + __string(addr, rqst->rq_xprt->xpt_remotebuf)
> + ),
> +
> + TP_fast_assign(
> + __entry->xid = be32_to_cpu(rqst->rq_xid);
> + __entry->execute = ktime_to_us(ktime_sub(end, rqst->rq_rtime));
> + __assign_str(addr, rqst->rq_xprt->xpt_remotebuf);
> + ),
> +
> + TP_printk("addr=%s xid=0x%08x execute=%lu",
> + __get_str(addr), __entry->xid, __entry->execute)
> +);
>
> DECLARE_EVENT_CLASS(svc_deferred_event,
> TP_PROTO(struct svc_deferred_req *dr),
> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> index f745754..2556e6e 100644
> --- a/net/sunrpc/svc_xprt.c
> +++ b/net/sunrpc/svc_xprt.c
> @@ -782,7 +782,6 @@ static int svc_handle_xprt(struct svc_rqst *rqstp, struct svc_xprt *xprt)
> len = svc_deferred_recv(rqstp);
> else
> len = xprt->xpt_ops->xpo_recvfrom(rqstp);
> - dprintk("svc: got len=%d\n", len);
> rqstp->rq_reserved = serv->sv_max_mesg;
> atomic_add(rqstp->rq_reserved, &xprt->xpt_reserved);
> }
> @@ -893,6 +892,7 @@ int svc_send(struct svc_rqst *rqstp)
> len = -ENOTCONN;
> else
> len = xprt->xpt_ops->xpo_sendto(rqstp);
> + trace_svc_stats_latency(rqstp, ktime_get_real());
> mutex_unlock(&xprt->xpt_mutex);
> rpc_wake_up(&xprt->xpt_bc_pending);
> trace_svc_send(rqstp, len);
> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> index 4ca1d92..f616962 100644
> --- a/net/sunrpc/svcsock.c
> +++ b/net/sunrpc/svcsock.c
> @@ -574,6 +574,7 @@ static int svc_udp_recvfrom(struct svc_rqst *rqstp)
> /* Don't enable netstamp, sunrpc doesn't
> need that much accuracy */
> }
> + rqstp->rq_rtime = skb->tstamp;
> svsk->sk_sk->sk_stamp = skb->tstamp;
> set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags); /* there may be more data... */
>
> @@ -1143,6 +1144,7 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
> if (len < 0)
> goto error;
>
> + rqstp->rq_rtime = ktime_get_real();
> svc_xprt_copy_addrs(rqstp, &svsk->sk_xprt);
> if (serv->sv_stats)
> serv->sv_stats->nettcpcnt++;
> diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> index 19e9c6b..446b9d6 100644
> --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> @@ -558,6 +558,7 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp)
> goto out_readchunk;
>
> complete:
> + rqstp->rq_rtime = ctxt->rtime;
> svc_rdma_put_context(ctxt, 0);
> dprintk("svcrdma: recvfrom: xprt=%p, rqstp=%p, rq_arg.len=%u\n",
> rdma_xprt, rqstp, rqstp->rq_arg.len);
> diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
> index ca211dc..08fa7de 100644
> --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
> +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
> @@ -316,6 +316,7 @@ static void svc_rdma_wc_receive(struct ib_cq *cq, struct ib_wc *wc)
> goto flushed;
>
> /* All wc fields are now known to be valid */
> + ctxt->rtime = ktime_get_real();
> ctxt->byte_len = wc->byte_len;
> spin_lock(&xprt->sc_rq_dto_lock);
> list_add_tail(&ctxt->list, &xprt->sc_rq_dto_q);

2018-03-14 20:52:50

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 11/15] nfsd: Record request byte count, not count of vectors

On Tue, Mar 13, 2018 at 11:44:37AM -0400, Chuck Lever wrote:
> TP_fast_assign(
> __entry->xid = be32_to_cpu(rqstp->rq_xid);
> - do {
> - struct knfsd_fh fh;
> -
> - fh_copy_shallow(&fh, &fhp->fh_handle);
> - __entry->fh_hash = knfsd_fh_hash(&fh);
> - } while (0);
> + __entry->fh_hash = knfsd_fh_hash(&fhp->fh_handle);
> __entry->offset = offset;
> __entry->len = len;
> ),

What's that about? Why did someone originally think we needed to copy
before hashing?

--b.

2018-03-14 20:55:15

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 15/15] nfsd: Trace NFSv4 COMPOUND execution

On Tue, Mar 13, 2018 at 11:44:59AM -0400, Chuck Lever wrote:
> This helps record the identity and timing of the ops in each NFSv4
> COMPOUND, replacing dprintk calls that did much the same thing.

Would it be better to leave it to userspace to translate an operation
number to a name?

--b.

>
> Signed-off-by: Chuck Lever <[email protected]>
> ---
> fs/nfsd/nfs4proc.c | 9 +++------
> fs/nfsd/trace.h | 39 +++++++++++++++++++++++++++++++++++++++
> 2 files changed, 42 insertions(+), 6 deletions(-)
>
> diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
> index 39016b6..0df37e0 100644
> --- a/fs/nfsd/nfs4proc.c
> +++ b/fs/nfsd/nfs4proc.c
> @@ -1722,12 +1722,10 @@ static void svcxdr_init_encode(struct svc_rqst *rqstp,
> goto encode_op;
> }
>
> + trace_nfsd_compound(rqstp, args->opcnt);
> while (!status && resp->opcnt < args->opcnt) {
> op = &args->ops[resp->opcnt++];
>
> - dprintk("nfsv4 compound op #%d/%d: %d (%s)\n",
> - resp->opcnt, args->opcnt, op->opnum,
> - nfsd4_op_name(op->opnum));
> /*
> * The XDR decode routines may have pre-set op->status;
> * for example, if there is a miscellaneous XDR error
> @@ -1801,9 +1799,8 @@ static void svcxdr_init_encode(struct svc_rqst *rqstp,
> status = op->status;
> }
>
> - dprintk("nfsv4 compound op %p opcnt %d #%d: %d: status %d\n",
> - args->ops, args->opcnt, resp->opcnt, op->opnum,
> - be32_to_cpu(status));
> + trace_nfsd_compound_status(args->opcnt, resp->opcnt, status,
> + nfsd4_op_name(op->opnum));
>
> nfsd4_cstate_clear_replay(cstate);
> nfsd4_increment_op_stats(op->opnum);
> diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
> index a8bbd9d..80933e4 100644
> --- a/fs/nfsd/trace.h
> +++ b/fs/nfsd/trace.h
> @@ -11,6 +11,45 @@
> #include <linux/tracepoint.h>
> #include "nfsfh.h"
>
> +TRACE_EVENT(nfsd_compound,
> + TP_PROTO(const struct svc_rqst *rqst,
> + u32 args_opcnt),
> + TP_ARGS(rqst, args_opcnt),
> + TP_STRUCT__entry(
> + __field(u32, xid)
> + __field(u32, args_opcnt)
> + ),
> + TP_fast_assign(
> + __entry->xid = be32_to_cpu(rqst->rq_xid);
> + __entry->args_opcnt = args_opcnt;
> + ),
> + TP_printk("xid=0x%08x opcnt=%u",
> + __entry->xid, __entry->args_opcnt)
> +)
> +
> +TRACE_EVENT(nfsd_compound_status,
> + TP_PROTO(u32 args_opcnt,
> + u32 resp_opcnt,
> + __be32 status,
> + const char *name),
> + TP_ARGS(args_opcnt, resp_opcnt, status, name),
> + TP_STRUCT__entry(
> + __field(u32, args_opcnt)
> + __field(u32, resp_opcnt)
> + __field(int, status)
> + __string(name, name)
> + ),
> + TP_fast_assign(
> + __entry->args_opcnt = args_opcnt;
> + __entry->resp_opcnt = resp_opcnt;
> + __entry->status = be32_to_cpu(status);
> + __assign_str(name, name);
> + ),
> + TP_printk("op=%u/%u %s status=%d",
> + __entry->resp_opcnt, __entry->args_opcnt,
> + __get_str(name), __entry->status)
> +)
> +
> DECLARE_EVENT_CLASS(nfsd_io_class,
> TP_PROTO(struct svc_rqst *rqstp,
> struct svc_fh *fhp,

2018-03-15 00:54:08

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH 07/15] sunrpc: Save remote presentation address in svc_xprt for trace events



> On Mar 14, 2018, at 4:35 PM, J. Bruce Fields <[email protected]> =
wrote:
>=20
> On Tue, Mar 13, 2018 at 11:44:16AM -0400, Chuck Lever wrote:
>> TP_printk defines a format string that is passed to user space for
>> converting raw trace event records to something human-readable.
>>=20
>> My user space's printf (Oracle Linux 7), however, does not have a
>> %pI format specifier. The result is that what is supposed to be an
>> IP address in the output of "trace-cmd report" is just a string that
>> says the field couldn't be displayed.
>>=20
>> To fix this, adopt the same approach as the client: maintain a pre-
>> formated presentation address for occasions when %pI is not
>> available.
>>=20
>> The location of the trace_svc_send trace point is adjusted so that
>> rqst->rq_xprt is not NULL when the trace event is recorded.
>=20
> So presumably the problem could also be fixed in userspace; why is it
> better to do it here?

"User space" means the C library's printf() implementation. You'd have
to address this by adding %pI to every possible C library used by Linux
distributions.

Also, as mentioned above, we've already used this approach successfully
in several other similar places.


> --b.
>=20
>>=20
>> Signed-off-by: Chuck Lever <[email protected]>
>> ---
>> include/linux/sunrpc/svc_xprt.h | 4 +
>> include/trace/events/sunrpc.h | 89 =
+++++++++++-------------------
>> net/sunrpc/svc_xprt.c | 3 +
>> net/sunrpc/svcsock.c | 1=20
>> net/sunrpc/xprtrdma/svc_rdma_transport.c | 4 +
>> 5 files changed, 43 insertions(+), 58 deletions(-)
>>=20
>> diff --git a/include/linux/sunrpc/svc_xprt.h =
b/include/linux/sunrpc/svc_xprt.h
>> index 19475ac..c3d7206 100644
>> --- a/include/linux/sunrpc/svc_xprt.h
>> +++ b/include/linux/sunrpc/svc_xprt.h
>> @@ -83,6 +83,7 @@ struct svc_xprt {
>> size_t xpt_locallen; /* length of address */
>> struct sockaddr_storage xpt_remote; /* remote peer's address =
*/
>> size_t xpt_remotelen; /* length of address */
>> + char xpt_remotebuf[INET6_ADDRSTRLEN + 10];
>> struct rpc_wait_queue xpt_bc_pending; /* backchannel wait =
queue */
>> struct list_head xpt_users; /* callbacks on free */
>>=20
>> @@ -152,7 +153,10 @@ static inline void svc_xprt_set_remote(struct =
svc_xprt *xprt,
>> {
>> memcpy(&xprt->xpt_remote, sa, salen);
>> xprt->xpt_remotelen =3D salen;
>> + snprintf(xprt->xpt_remotebuf, sizeof(xprt->xpt_remotebuf) - 1,
>> + "%pISpc", sa);
>> }
>> +
>> static inline unsigned short svc_addr_port(const struct sockaddr *sa)
>> {
>> const struct sockaddr_in *sin =3D (const struct sockaddr_in =
*)sa;
>> diff --git a/include/trace/events/sunrpc.h =
b/include/trace/events/sunrpc.h
>> index 5849bfb..1ec8c4c 100644
>> --- a/include/trace/events/sunrpc.h
>> +++ b/include/trace/events/sunrpc.h
>> @@ -493,20 +493,18 @@
>> __field(u32, xid)
>> __field(int, len)
>> __field(unsigned long, flags)
>> - __dynamic_array(unsigned char, addr, rqst->rq_addrlen)
>> + __string(addr, rqst->rq_xprt->xpt_remotebuf)
>> ),
>>=20
>> TP_fast_assign(
>> __entry->xid =3D be32_to_cpu(rqst->rq_xid);
>> __entry->len =3D len;
>> __entry->flags =3D rqst->rq_flags;
>> - memcpy(__get_dynamic_array(addr),
>> - &rqst->rq_addr, rqst->rq_addrlen);
>> + __assign_str(addr, rqst->rq_xprt->xpt_remotebuf);
>> ),
>>=20
>> - TP_printk("addr=3D%pIScp xid=3D0x%08x len=3D%d flags=3D%s",
>> - (struct sockaddr *)__get_dynamic_array(addr),
>> - __entry->xid, __entry->len,
>> + TP_printk("addr=3D%s xid=3D0x%08x len=3D%d flags=3D%s",
>> + __get_str(addr), __entry->xid, __entry->len,
>> show_rqstp_flags(__entry->flags))
>> );
>>=20
>> @@ -519,20 +517,18 @@
>> TP_STRUCT__entry(
>> __field(u32, xid)
>> __field(unsigned long, flags)
>> - __dynamic_array(unsigned char, addr, rqst->rq_addrlen)
>> + __string(addr, rqst->rq_xprt->xpt_remotebuf)
>> ),
>>=20
>> TP_fast_assign(
>> __entry->xid =3D be32_to_cpu(rqst->rq_xid);
>> __entry->flags =3D rqst->rq_flags;
>> - memcpy(__get_dynamic_array(addr),
>> - &rqst->rq_addr, rqst->rq_addrlen);
>> + __assign_str(addr, rqst->rq_xprt->xpt_remotebuf);
>> ),
>>=20
>> - TP_printk("addr=3D%pIScp rq_xid=3D0x%08x flags=3D%s",
>> - (struct sockaddr *)__get_dynamic_array(addr),
>> - __entry->xid,
>> - show_rqstp_flags(__entry->flags))
>> + TP_printk("addr=3D%s xid=3D0x%08x flags=3D%s",
>> + __get_str(addr), __entry->xid,
>> + show_rqstp_flags(__entry->flags))
>> );
>>=20
>> DEFINE_EVENT(svc_rqst_event, svc_defer,
>> @@ -553,21 +549,19 @@
>> __field(u32, xid)
>> __field(int, status)
>> __field(unsigned long, flags)
>> - __dynamic_array(unsigned char, addr, rqst->rq_addrlen)
>> + __string(addr, rqst->rq_xprt->xpt_remotebuf)
>> ),
>>=20
>> TP_fast_assign(
>> __entry->xid =3D be32_to_cpu(rqst->rq_xid);
>> __entry->status =3D status;
>> __entry->flags =3D rqst->rq_flags;
>> - memcpy(__get_dynamic_array(addr),
>> - &rqst->rq_addr, rqst->rq_addrlen);
>> + __assign_str(addr, rqst->rq_xprt->xpt_remotebuf);
>> ),
>>=20
>> - TP_printk("addr=3D%pIScp rq_xid=3D0x%08x status=3D%d flags=3D%s",
>> - (struct sockaddr *)__get_dynamic_array(addr),
>> - __entry->xid,
>> - __entry->status, show_rqstp_flags(__entry->flags))
>> + TP_printk("addr=3D%s xid=3D0x%08x status=3D%d flags=3D%s",
>> + __get_str(addr), __entry->xid,
>> + __entry->status, show_rqstp_flags(__entry->flags))
>> );
>>=20
>> DEFINE_EVENT(svc_rqst_status, svc_process,
>> @@ -604,26 +598,19 @@
>> __field(struct svc_xprt *, xprt)
>> __field(int, pid)
>> __field(unsigned long, flags)
>> - __dynamic_array(unsigned char, addr, xprt !=3D NULL ?
>> - xprt->xpt_remotelen : 0)
>> + __string(addr, xprt->xpt_remotebuf)
>> ),
>>=20
>> TP_fast_assign(
>> __entry->xprt =3D xprt;
>> __entry->pid =3D rqst? rqst->rq_task->pid : 0;
>> - if (xprt) {
>> - memcpy(__get_dynamic_array(addr),
>> - &xprt->xpt_remote,
>> - xprt->xpt_remotelen);
>> - __entry->flags =3D xprt->xpt_flags;
>> - } else
>> - __entry->flags =3D 0;
>> + __entry->flags =3D xprt->xpt_flags;
>> + __assign_str(addr, xprt->xpt_remotebuf);
>> ),
>>=20
>> - TP_printk("xprt=3D0x%p addr=3D%pIScp pid=3D%d flags=3D%s", =
__entry->xprt,
>> - __get_dynamic_array_len(addr) !=3D 0 ?
>> - (struct sockaddr *)__get_dynamic_array(addr) : =
NULL,
>> - __entry->pid, show_svc_xprt_flags(__entry->flags))
>> + TP_printk("xprt=3D%p addr=3D%s pid=3D%d flags=3D%s",
>> + __entry->xprt, __get_str(addr),
>> + __entry->pid, =
show_svc_xprt_flags(__entry->flags))
>> );
>>=20
>> DECLARE_EVENT_CLASS(svc_xprt_event,
>> @@ -634,19 +621,18 @@
>> TP_STRUCT__entry(
>> __field(struct svc_xprt *, xprt)
>> __field(unsigned long, flags)
>> - __dynamic_array(unsigned char, addr, =
xprt->xpt_remotelen)
>> + __string(addr, xprt->xpt_remotebuf)
>> ),
>>=20
>> TP_fast_assign(
>> __entry->xprt =3D xprt;
>> __entry->flags =3D xprt->xpt_flags;
>> - memcpy(__get_dynamic_array(addr),
>> - &xprt->xpt_remote, xprt->xpt_remotelen);
>> + __assign_str(addr, xprt->xpt_remotebuf);
>> ),
>>=20
>> - TP_printk("xprt=3D0x%p addr=3D%pIScp flags=3D%s", __entry->xprt,
>> - (struct sockaddr *)__get_dynamic_array(addr),
>> - show_svc_xprt_flags(__entry->flags))
>> + TP_printk("xprt=3D%p addr=3D%s flags=3D%s",
>> + __entry->xprt, __get_str(addr),
>> + show_svc_xprt_flags(__entry->flags))
>> );
>>=20
>> DEFINE_EVENT(svc_xprt_event, svc_xprt_dequeue,
>> @@ -682,25 +668,18 @@
>> __field(struct svc_xprt *, xprt)
>> __field(int, len)
>> __field(unsigned long, flags)
>> - __dynamic_array(unsigned char, addr, xprt !=3D NULL ?
>> - xprt->xpt_remotelen : 0)
>> + __string(addr, xprt->xpt_remotebuf)
>> ),
>>=20
>> TP_fast_assign(
>> __entry->xprt =3D xprt;
>> __entry->len =3D len;
>> - if (xprt) {
>> - memcpy(__get_dynamic_array(addr),
>> - &xprt->xpt_remote,
>> - xprt->xpt_remotelen);
>> - __entry->flags =3D xprt->xpt_flags;
>> - } else
>> - __entry->flags =3D 0;
>> + __entry->flags =3D xprt->xpt_flags;
>> + __assign_str(addr, xprt->xpt_remotebuf);
>> ),
>>=20
>> - TP_printk("xprt=3D0x%p addr=3D%pIScp len=3D%d flags=3D%s", =
__entry->xprt,
>> - __get_dynamic_array_len(addr) !=3D 0 ?
>> - (struct sockaddr *)__get_dynamic_array(addr) : =
NULL,
>> + TP_printk("xprt=3D%p addr=3D%s len=3D%d flags=3D%s",
>> + __entry->xprt, __get_str(addr),
>> __entry->len, show_svc_xprt_flags(__entry->flags))
>> );
>>=20
>> @@ -712,18 +691,16 @@
>>=20
>> TP_STRUCT__entry(
>> __field(u32, xid)
>> - __dynamic_array(unsigned char, addr, dr->addrlen)
>> + __string(addr, dr->xprt->xpt_remotebuf)
>> ),
>>=20
>> TP_fast_assign(
>> __entry->xid =3D be32_to_cpu(*(__be32 *)(dr->args +
>> =
(dr->xprt_hlen>>2)));
>> - memcpy(__get_dynamic_array(addr), &dr->addr, =
dr->addrlen);
>> + __assign_str(addr, dr->xprt->xpt_remotebuf);
>> ),
>>=20
>> - TP_printk("addr=3D%pIScp xid=3D0x%08x",
>> - (struct sockaddr *)__get_dynamic_array(addr),
>> - __entry->xid)
>> + TP_printk("addr=3D%s xid=3D0x%08x", __get_str(addr), =
__entry->xid)
>> );
>>=20
>> DEFINE_EVENT(svc_deferred_event, svc_drop_deferred,
>> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
>> index 47384d0..f745754 100644
>> --- a/net/sunrpc/svc_xprt.c
>> +++ b/net/sunrpc/svc_xprt.c
>> @@ -173,6 +173,7 @@ void svc_xprt_init(struct net *net, struct =
svc_xprt_class *xcl,
>> set_bit(XPT_BUSY, &xprt->xpt_flags);
>> rpc_init_wait_queue(&xprt->xpt_bc_pending, "xpt_bc_pending");
>> xprt->xpt_net =3D get_net(net);
>> + strcpy(xprt->xpt_remotebuf, "uninitialized");
>> }
>> EXPORT_SYMBOL_GPL(svc_xprt_init);
>>=20
>> @@ -894,12 +895,12 @@ int svc_send(struct svc_rqst *rqstp)
>> len =3D xprt->xpt_ops->xpo_sendto(rqstp);
>> mutex_unlock(&xprt->xpt_mutex);
>> rpc_wake_up(&xprt->xpt_bc_pending);
>> + trace_svc_send(rqstp, len);
>> svc_xprt_release(rqstp);
>>=20
>> if (len =3D=3D -ECONNREFUSED || len =3D=3D -ENOTCONN || len =3D=3D=
-EAGAIN)
>> len =3D 0;
>> out:
>> - trace_svc_send(rqstp, len);
>> return len;
>> }
>>=20
>> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
>> index 9b67035..4ca1d92 100644
>> --- a/net/sunrpc/svcsock.c
>> +++ b/net/sunrpc/svcsock.c
>> @@ -1310,6 +1310,7 @@ static void svc_tcp_init(struct svc_sock *svsk, =
struct svc_serv *serv)
>> set_bit(XPT_CONG_CTRL, &svsk->sk_xprt.xpt_flags);
>> if (sk->sk_state =3D=3D TCP_LISTEN) {
>> dprintk("setting up TCP socket for listening\n");
>> + strcpy(svsk->sk_xprt.xpt_remotebuf, "listener");
>> set_bit(XPT_LISTENER, &svsk->sk_xprt.xpt_flags);
>> sk->sk_data_ready =3D svc_tcp_listen_data_ready;
>> set_bit(XPT_CONN, &svsk->sk_xprt.xpt_flags);
>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c =
b/net/sunrpc/xprtrdma/svc_rdma_transport.c
>> index b083ac8..ca211dc 100644
>> --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
>> +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
>> @@ -401,8 +401,10 @@ static struct svcxprt_rdma =
*rdma_create_xprt(struct svc_serv *serv,
>> */
>> set_bit(XPT_CONG_CTRL, &cma_xprt->sc_xprt.xpt_flags);
>>=20
>> - if (listener)
>> + if (listener) {
>> + strcpy(cma_xprt->sc_xprt.xpt_remotebuf, "listener");
>> set_bit(XPT_LISTENER, &cma_xprt->sc_xprt.xpt_flags);
>> + }
>>=20
>> return cma_xprt;
>> }

--
Chuck Lever




2018-03-15 01:11:53

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH 09/15] sunrpc: Report per-RPC execution stats



> On Mar 14, 2018, at 4:50 PM, J. Bruce Fields <[email protected]> =
wrote:
>=20
> On Tue, Mar 13, 2018 at 11:44:27AM -0400, Chuck Lever wrote:
>> Introduce a mechanism to report the server-side execution latency of
>> each RPC. The goal is to enable user space to filter the trace
>> record for latency outliers, build histograms, etc.
>=20
> Neato.
>=20
> To be useful I'd think you'd want more information about each RPC.
> (E.g. you'd like to know that the latency outliers all did reads). I
> guess you could use the address and xid to cross-reference with
> information collected from somewhere else?

Yes. You can enable other trace points (like the nfsd_compound ones)
to see what class each operation is in.

And yes, I would like to have all the relevant information for each
RPC in a single trace record; I just haven't figured out a way to
extract it as nicely as I did it on the client (patch forthcoming).
On the client side, there is a table set up for each RPC program that
contains an RPC procedure name to procedure number mapping. I was not
able to find a similar convenience on the server side.


> What's our commitment to backwards compatibility? Trace points seem =
to
> be the wild west compared to the rest of the kernel interface, but if
> we want to encourage tooling on top of this then I guess we'll need to
> be strict.

That has been discussed elsewhere (LWN.net and more recently on
linux-rdma). The only compatibility issues are with trace points that
have user tools and infrastructure that depends on them, such as the
scheduler trace points used for latencyTOP. The NFS and sunrpc trace
points do not have this constraint, as they are processed currently
only by generic tools like trace-cmd. So we are free to innovate for
the time being.


> Looking at the tcp case, I think it's measuring from the time
> tcp_recvfrom receives the last fragment making up an rpc request till
> the last sendpage() of the reply returns. Did you consider other =
spots?
> (E.g. why after the send instead of before?)

Yes, I've considered other spots. I don't consider the spots I'm
proposing here to be written in stone. I welcome help for placing the
socket-based timestamp capture points.

Some sendto implementations are more complex than others. For instance,
RPC/RDMA can post RDMA Writes containing data content first, then in the
final step post the RDMA Send carrying the RPC Reply header. The RDMA
Write step can be considered server-side processing, and thus is part
of the latency. Or, if we ever decide to move the RDMA Write step into
the XDR layer, it will definitely be counted as processing latency.

One thing I would like to keep in the latency measurement is how long
this rqstp has waited to acquire the send mutex. But otherwise, I'm
open to other ideas about how to measure this latency.


> --b.
>=20
>>=20
>> Signed-off-by: Chuck Lever <[email protected]>
>> ---
>> include/linux/sunrpc/svc.h | 1 +
>> include/linux/sunrpc/svc_rdma.h | 1 +
>> include/trace/events/sunrpc.h | 20 ++++++++++++++++++++
>> net/sunrpc/svc_xprt.c | 2 +-
>> net/sunrpc/svcsock.c | 2 ++
>> net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 1 +
>> net/sunrpc/xprtrdma/svc_rdma_transport.c | 1 +
>> 7 files changed, 27 insertions(+), 1 deletion(-)
>>=20
>> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
>> index 786ae22..1939709 100644
>> --- a/include/linux/sunrpc/svc.h
>> +++ b/include/linux/sunrpc/svc.h
>> @@ -245,6 +245,7 @@ struct svc_rqst {
>> void * rq_xprt_ctxt; /* transport specific =
context ptr */
>> struct svc_deferred_req*rq_deferred; /* deferred request we =
are replaying */
>>=20
>> + ktime_t rq_rtime; /* receive time */
>> size_t rq_xprt_hlen; /* xprt header len */
>> struct xdr_buf rq_arg;
>> struct xdr_buf rq_res;
>> diff --git a/include/linux/sunrpc/svc_rdma.h =
b/include/linux/sunrpc/svc_rdma.h
>> index 4b731b0..8045981 100644
>> --- a/include/linux/sunrpc/svc_rdma.h
>> +++ b/include/linux/sunrpc/svc_rdma.h
>> @@ -79,6 +79,7 @@ struct svc_rdma_op_ctxt {
>> struct list_head list;
>> struct xdr_buf arg;
>> struct ib_cqe cqe;
>> + ktime_t rtime;
>> u32 byte_len;
>> struct svcxprt_rdma *xprt;
>> enum dma_data_direction direction;
>> diff --git a/include/trace/events/sunrpc.h =
b/include/trace/events/sunrpc.h
>> index 5a8157c..50de512 100644
>> --- a/include/trace/events/sunrpc.h
>> +++ b/include/trace/events/sunrpc.h
>> @@ -705,6 +705,26 @@
>> __entry->len, show_svc_xprt_flags(__entry->flags))
>> );
>>=20
>> +TRACE_EVENT(svc_stats_latency,
>> + TP_PROTO(const struct svc_rqst *rqst, ktime_t end),
>> +
>> + TP_ARGS(rqst, end),
>> +
>> + TP_STRUCT__entry(
>> + __field(u32, xid)
>> + __field(unsigned long, execute)
>> + __string(addr, rqst->rq_xprt->xpt_remotebuf)
>> + ),
>> +
>> + TP_fast_assign(
>> + __entry->xid =3D be32_to_cpu(rqst->rq_xid);
>> + __entry->execute =3D ktime_to_us(ktime_sub(end, =
rqst->rq_rtime));
>> + __assign_str(addr, rqst->rq_xprt->xpt_remotebuf);
>> + ),
>> +
>> + TP_printk("addr=3D%s xid=3D0x%08x execute=3D%lu",
>> + __get_str(addr), __entry->xid, __entry->execute)
>> +);
>>=20
>> DECLARE_EVENT_CLASS(svc_deferred_event,
>> TP_PROTO(struct svc_deferred_req *dr),
>> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
>> index f745754..2556e6e 100644
>> --- a/net/sunrpc/svc_xprt.c
>> +++ b/net/sunrpc/svc_xprt.c
>> @@ -782,7 +782,6 @@ static int svc_handle_xprt(struct svc_rqst =
*rqstp, struct svc_xprt *xprt)
>> len =3D svc_deferred_recv(rqstp);
>> else
>> len =3D xprt->xpt_ops->xpo_recvfrom(rqstp);
>> - dprintk("svc: got len=3D%d\n", len);
>> rqstp->rq_reserved =3D serv->sv_max_mesg;
>> atomic_add(rqstp->rq_reserved, &xprt->xpt_reserved);
>> }
>> @@ -893,6 +892,7 @@ int svc_send(struct svc_rqst *rqstp)
>> len =3D -ENOTCONN;
>> else
>> len =3D xprt->xpt_ops->xpo_sendto(rqstp);
>> + trace_svc_stats_latency(rqstp, ktime_get_real());
>> mutex_unlock(&xprt->xpt_mutex);
>> rpc_wake_up(&xprt->xpt_bc_pending);
>> trace_svc_send(rqstp, len);
>> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
>> index 4ca1d92..f616962 100644
>> --- a/net/sunrpc/svcsock.c
>> +++ b/net/sunrpc/svcsock.c
>> @@ -574,6 +574,7 @@ static int svc_udp_recvfrom(struct svc_rqst =
*rqstp)
>> /* Don't enable netstamp, sunrpc doesn't
>> need that much accuracy */
>> }
>> + rqstp->rq_rtime =3D skb->tstamp;
>> svsk->sk_sk->sk_stamp =3D skb->tstamp;
>> set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags); /* there may be =
more data... */
>>=20
>> @@ -1143,6 +1144,7 @@ static int svc_tcp_recvfrom(struct svc_rqst =
*rqstp)
>> if (len < 0)
>> goto error;
>>=20
>> + rqstp->rq_rtime =3D ktime_get_real();
>> svc_xprt_copy_addrs(rqstp, &svsk->sk_xprt);
>> if (serv->sv_stats)
>> serv->sv_stats->nettcpcnt++;
>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c =
b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
>> index 19e9c6b..446b9d6 100644
>> --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
>> +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
>> @@ -558,6 +558,7 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp)
>> goto out_readchunk;
>>=20
>> complete:
>> + rqstp->rq_rtime =3D ctxt->rtime;
>> svc_rdma_put_context(ctxt, 0);
>> dprintk("svcrdma: recvfrom: xprt=3D%p, rqstp=3D%p, =
rq_arg.len=3D%u\n",
>> rdma_xprt, rqstp, rqstp->rq_arg.len);
>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c =
b/net/sunrpc/xprtrdma/svc_rdma_transport.c
>> index ca211dc..08fa7de 100644
>> --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
>> +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
>> @@ -316,6 +316,7 @@ static void svc_rdma_wc_receive(struct ib_cq *cq, =
struct ib_wc *wc)
>> goto flushed;
>>=20
>> /* All wc fields are now known to be valid */
>> + ctxt->rtime =3D ktime_get_real();
>> ctxt->byte_len =3D wc->byte_len;
>> spin_lock(&xprt->sc_rq_dto_lock);
>> list_add_tail(&ctxt->list, &xprt->sc_rq_dto_q);

--
Chuck Lever




2018-03-15 01:18:21

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH 11/15] nfsd: Record request byte count, not count of vectors



> On Mar 14, 2018, at 4:52 PM, J. Bruce Fields <[email protected]> wrote:
>
> On Tue, Mar 13, 2018 at 11:44:37AM -0400, Chuck Lever wrote:
>> TP_fast_assign(
>> __entry->xid = be32_to_cpu(rqstp->rq_xid);
>> - do {
>> - struct knfsd_fh fh;
>> -
>> - fh_copy_shallow(&fh, &fhp->fh_handle);
>> - __entry->fh_hash = knfsd_fh_hash(&fh);
>> - } while (0);
>> + __entry->fh_hash = knfsd_fh_hash(&fhp->fh_handle);
>> __entry->offset = offset;
>> __entry->len = len;
>> ),
>
> What's that about? Why did someone originally think we needed to copy
> before hashing?

No idea. I can't find any reason why copying the FH would be necessary.
It's not as if the FH will change while the trace point is computing
the hash.

--
Chuck Lever




2018-03-15 01:38:35

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH 15/15] nfsd: Trace NFSv4 COMPOUND execution



> On Mar 14, 2018, at 4:55 PM, J. Bruce Fields <[email protected]> =
wrote:
>=20
> On Tue, Mar 13, 2018 at 11:44:59AM -0400, Chuck Lever wrote:
>> This helps record the identity and timing of the ops in each NFSv4
>> COMPOUND, replacing dprintk calls that did much the same thing.
>=20
> Would it be better to leave it to userspace to translate an operation
> number to a name?

The tools that are used to look at these records are very generic,
like trace-cmd, which doesn't know from NFS. It's just so much nicer
to use that tool and see operation names that match wireshark displays
rather than op numbers.

However, if you'd like to additionally preserve the operation number
in the trace record, that is possible to add very easily.

Here is the difficulty with the latency event I add earlier in the
series: If we simply record the RPC procedure number there, for NFSv4
we'd see just a bunch of "proc=3D1". I added this here trace event
partially to record the latency of processing each operation in a
COMPOUND, and partially to see which operations are associated with
the latency of the whole RPC (ah, it's SEQUENCE, PUTFH, READ, GETATTR,
so this must be a READ request).


> --b.
>=20
>>=20
>> Signed-off-by: Chuck Lever <[email protected]>
>> ---
>> fs/nfsd/nfs4proc.c | 9 +++------
>> fs/nfsd/trace.h | 39 +++++++++++++++++++++++++++++++++++++++
>> 2 files changed, 42 insertions(+), 6 deletions(-)
>>=20
>> diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
>> index 39016b6..0df37e0 100644
>> --- a/fs/nfsd/nfs4proc.c
>> +++ b/fs/nfsd/nfs4proc.c
>> @@ -1722,12 +1722,10 @@ static void svcxdr_init_encode(struct =
svc_rqst *rqstp,
>> goto encode_op;
>> }
>>=20
>> + trace_nfsd_compound(rqstp, args->opcnt);
>> while (!status && resp->opcnt < args->opcnt) {
>> op =3D &args->ops[resp->opcnt++];
>>=20
>> - dprintk("nfsv4 compound op #%d/%d: %d (%s)\n",
>> - resp->opcnt, args->opcnt, op->opnum,
>> - nfsd4_op_name(op->opnum));
>> /*
>> * The XDR decode routines may have pre-set op->status;
>> * for example, if there is a miscellaneous XDR error
>> @@ -1801,9 +1799,8 @@ static void svcxdr_init_encode(struct svc_rqst =
*rqstp,
>> status =3D op->status;
>> }
>>=20
>> - dprintk("nfsv4 compound op %p opcnt %d #%d: %d: status =
%d\n",
>> - args->ops, args->opcnt, resp->opcnt, op->opnum,
>> - be32_to_cpu(status));
>> + trace_nfsd_compound_status(args->opcnt, resp->opcnt, =
status,
>> + nfsd4_op_name(op->opnum));

Instead of "op=3D%u/%u" here, maybe the XID might be more useful for
the event that reports each operation in the COMPOUND. On the other
hand, it's pretty obvious which nfsd_compound event these go with:
they will all come from the same nfsd process.


>> nfsd4_cstate_clear_replay(cstate);
>> nfsd4_increment_op_stats(op->opnum);
>> diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
>> index a8bbd9d..80933e4 100644
>> --- a/fs/nfsd/trace.h
>> +++ b/fs/nfsd/trace.h
>> @@ -11,6 +11,45 @@
>> #include <linux/tracepoint.h>
>> #include "nfsfh.h"
>>=20
>> +TRACE_EVENT(nfsd_compound,
>> + TP_PROTO(const struct svc_rqst *rqst,
>> + u32 args_opcnt),
>> + TP_ARGS(rqst, args_opcnt),
>> + TP_STRUCT__entry(
>> + __field(u32, xid)
>> + __field(u32, args_opcnt)
>> + ),
>> + TP_fast_assign(
>> + __entry->xid =3D be32_to_cpu(rqst->rq_xid);
>> + __entry->args_opcnt =3D args_opcnt;
>> + ),
>> + TP_printk("xid=3D0x%08x opcnt=3D%u",
>> + __entry->xid, __entry->args_opcnt)
>> +)
>> +
>> +TRACE_EVENT(nfsd_compound_status,
>> + TP_PROTO(u32 args_opcnt,
>> + u32 resp_opcnt,
>> + __be32 status,
>> + const char *name),
>> + TP_ARGS(args_opcnt, resp_opcnt, status, name),
>> + TP_STRUCT__entry(
>> + __field(u32, args_opcnt)
>> + __field(u32, resp_opcnt)
>> + __field(int, status)
>> + __string(name, name)
>> + ),
>> + TP_fast_assign(
>> + __entry->args_opcnt =3D args_opcnt;
>> + __entry->resp_opcnt =3D resp_opcnt;
>> + __entry->status =3D be32_to_cpu(status);
>> + __assign_str(name, name);
>> + ),
>> + TP_printk("op=3D%u/%u %s status=3D%d",
>> + __entry->resp_opcnt, __entry->args_opcnt,
>> + __get_str(name), __entry->status)
>> +)
>> +
>> DECLARE_EVENT_CLASS(nfsd_io_class,
>> TP_PROTO(struct svc_rqst *rqstp,
>> struct svc_fh *fhp,
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" =
in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Chuck Lever




2018-03-15 01:43:29

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 07/15] sunrpc: Save remote presentation address in svc_xprt for trace events

On Wed, Mar 14, 2018 at 08:54:01PM -0400, Chuck Lever wrote:
> "User space" means the C library's printf() implementation. You'd have
> to address this by adding %pI to every possible C library used by Linux
> distributions.
>
> Also, as mentioned above, we've already used this approach successfully
> in several other similar places.

OK.--b.

2018-03-22 20:32:41

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH 09/15] sunrpc: Report per-RPC execution stats



> On Mar 14, 2018, at 9:11 PM, Chuck Lever <[email protected]> =
wrote:
>=20
>=20
>=20
>> On Mar 14, 2018, at 4:50 PM, J. Bruce Fields <[email protected]> =
wrote:
>>=20
>> On Tue, Mar 13, 2018 at 11:44:27AM -0400, Chuck Lever wrote:
>>> Introduce a mechanism to report the server-side execution latency of
>>> each RPC. The goal is to enable user space to filter the trace
>>> record for latency outliers, build histograms, etc.
>>=20
>> Neato.
>>=20
>> To be useful I'd think you'd want more information about each RPC.
>> (E.g. you'd like to know that the latency outliers all did reads). I
>> guess you could use the address and xid to cross-reference with
>> information collected from somewhere else?
>=20
> Yes. You can enable other trace points (like the nfsd_compound ones)
> to see what class each operation is in.
>=20
> And yes, I would like to have all the relevant information for each
> RPC in a single trace record; I just haven't figured out a way to
> extract it as nicely as I did it on the client (patch forthcoming).
> On the client side, there is a table set up for each RPC program that
> contains an RPC procedure name to procedure number mapping. I was not
> able to find a similar convenience on the server side.
>=20
>=20
>> What's our commitment to backwards compatibility? Trace points seem =
to
>> be the wild west compared to the rest of the kernel interface, but if
>> we want to encourage tooling on top of this then I guess we'll need =
to
>> be strict.
>=20
> That has been discussed elsewhere (LWN.net and more recently on
> linux-rdma). The only compatibility issues are with trace points that
> have user tools and infrastructure that depends on them, such as the
> scheduler trace points used for latencyTOP. The NFS and sunrpc trace
> points do not have this constraint, as they are processed currently
> only by generic tools like trace-cmd. So we are free to innovate for
> the time being.
>=20
>=20
>> Looking at the tcp case, I think it's measuring from the time
>> tcp_recvfrom receives the last fragment making up an rpc request till
>> the last sendpage() of the reply returns. Did you consider other =
spots?
>> (E.g. why after the send instead of before?)
>=20
> Yes, I've considered other spots. I don't consider the spots I'm
> proposing here to be written in stone. I welcome help for placing the
> socket-based timestamp capture points.
>=20
> Some sendto implementations are more complex than others. For =
instance,
> RPC/RDMA can post RDMA Writes containing data content first, then in =
the
> final step post the RDMA Send carrying the RPC Reply header. The RDMA
> Write step can be considered server-side processing, and thus is part
> of the latency. Or, if we ever decide to move the RDMA Write step into
> the XDR layer, it will definitely be counted as processing latency.
>=20
> One thing I would like to keep in the latency measurement is how long
> this rqstp has waited to acquire the send mutex. But otherwise, I'm
> open to other ideas about how to measure this latency.

Hi Bruce-

How about measuring the same way for all transports:

- Capture a timestamp when xpo_recvfrom returns a positive value

- Fire the svc_stats_latency event just before invoking xpo_sendto

Would you be more comfortable with that arrangement?


>> --b.
>>=20
>>>=20
>>> Signed-off-by: Chuck Lever <[email protected]>
>>> ---
>>> include/linux/sunrpc/svc.h | 1 +
>>> include/linux/sunrpc/svc_rdma.h | 1 +
>>> include/trace/events/sunrpc.h | 20 ++++++++++++++++++++
>>> net/sunrpc/svc_xprt.c | 2 +-
>>> net/sunrpc/svcsock.c | 2 ++
>>> net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 1 +
>>> net/sunrpc/xprtrdma/svc_rdma_transport.c | 1 +
>>> 7 files changed, 27 insertions(+), 1 deletion(-)
>>>=20
>>> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
>>> index 786ae22..1939709 100644
>>> --- a/include/linux/sunrpc/svc.h
>>> +++ b/include/linux/sunrpc/svc.h
>>> @@ -245,6 +245,7 @@ struct svc_rqst {
>>> void * rq_xprt_ctxt; /* transport specific =
context ptr */
>>> struct svc_deferred_req*rq_deferred; /* deferred request we =
are replaying */
>>>=20
>>> + ktime_t rq_rtime; /* receive time */
>>> size_t rq_xprt_hlen; /* xprt header len */
>>> struct xdr_buf rq_arg;
>>> struct xdr_buf rq_res;
>>> diff --git a/include/linux/sunrpc/svc_rdma.h =
b/include/linux/sunrpc/svc_rdma.h
>>> index 4b731b0..8045981 100644
>>> --- a/include/linux/sunrpc/svc_rdma.h
>>> +++ b/include/linux/sunrpc/svc_rdma.h
>>> @@ -79,6 +79,7 @@ struct svc_rdma_op_ctxt {
>>> struct list_head list;
>>> struct xdr_buf arg;
>>> struct ib_cqe cqe;
>>> + ktime_t rtime;
>>> u32 byte_len;
>>> struct svcxprt_rdma *xprt;
>>> enum dma_data_direction direction;
>>> diff --git a/include/trace/events/sunrpc.h =
b/include/trace/events/sunrpc.h
>>> index 5a8157c..50de512 100644
>>> --- a/include/trace/events/sunrpc.h
>>> +++ b/include/trace/events/sunrpc.h
>>> @@ -705,6 +705,26 @@
>>> __entry->len, show_svc_xprt_flags(__entry->flags))
>>> );
>>>=20
>>> +TRACE_EVENT(svc_stats_latency,
>>> + TP_PROTO(const struct svc_rqst *rqst, ktime_t end),
>>> +
>>> + TP_ARGS(rqst, end),
>>> +
>>> + TP_STRUCT__entry(
>>> + __field(u32, xid)
>>> + __field(unsigned long, execute)
>>> + __string(addr, rqst->rq_xprt->xpt_remotebuf)
>>> + ),
>>> +
>>> + TP_fast_assign(
>>> + __entry->xid =3D be32_to_cpu(rqst->rq_xid);
>>> + __entry->execute =3D ktime_to_us(ktime_sub(end, =
rqst->rq_rtime));
>>> + __assign_str(addr, rqst->rq_xprt->xpt_remotebuf);
>>> + ),
>>> +
>>> + TP_printk("addr=3D%s xid=3D0x%08x execute=3D%lu",
>>> + __get_str(addr), __entry->xid, __entry->execute)
>>> +);
>>>=20
>>> DECLARE_EVENT_CLASS(svc_deferred_event,
>>> TP_PROTO(struct svc_deferred_req *dr),
>>> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
>>> index f745754..2556e6e 100644
>>> --- a/net/sunrpc/svc_xprt.c
>>> +++ b/net/sunrpc/svc_xprt.c
>>> @@ -782,7 +782,6 @@ static int svc_handle_xprt(struct svc_rqst =
*rqstp, struct svc_xprt *xprt)
>>> len =3D svc_deferred_recv(rqstp);
>>> else
>>> len =3D xprt->xpt_ops->xpo_recvfrom(rqstp);
>>> - dprintk("svc: got len=3D%d\n", len);
>>> rqstp->rq_reserved =3D serv->sv_max_mesg;
>>> atomic_add(rqstp->rq_reserved, &xprt->xpt_reserved);
>>> }
>>> @@ -893,6 +892,7 @@ int svc_send(struct svc_rqst *rqstp)
>>> len =3D -ENOTCONN;
>>> else
>>> len =3D xprt->xpt_ops->xpo_sendto(rqstp);
>>> + trace_svc_stats_latency(rqstp, ktime_get_real());
>>> mutex_unlock(&xprt->xpt_mutex);
>>> rpc_wake_up(&xprt->xpt_bc_pending);
>>> trace_svc_send(rqstp, len);
>>> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
>>> index 4ca1d92..f616962 100644
>>> --- a/net/sunrpc/svcsock.c
>>> +++ b/net/sunrpc/svcsock.c
>>> @@ -574,6 +574,7 @@ static int svc_udp_recvfrom(struct svc_rqst =
*rqstp)
>>> /* Don't enable netstamp, sunrpc doesn't
>>> need that much accuracy */
>>> }
>>> + rqstp->rq_rtime =3D skb->tstamp;
>>> svsk->sk_sk->sk_stamp =3D skb->tstamp;
>>> set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags); /* there may be =
more data... */
>>>=20
>>> @@ -1143,6 +1144,7 @@ static int svc_tcp_recvfrom(struct svc_rqst =
*rqstp)
>>> if (len < 0)
>>> goto error;
>>>=20
>>> + rqstp->rq_rtime =3D ktime_get_real();
>>> svc_xprt_copy_addrs(rqstp, &svsk->sk_xprt);
>>> if (serv->sv_stats)
>>> serv->sv_stats->nettcpcnt++;
>>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c =
b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
>>> index 19e9c6b..446b9d6 100644
>>> --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
>>> +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
>>> @@ -558,6 +558,7 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp)
>>> goto out_readchunk;
>>>=20
>>> complete:
>>> + rqstp->rq_rtime =3D ctxt->rtime;
>>> svc_rdma_put_context(ctxt, 0);
>>> dprintk("svcrdma: recvfrom: xprt=3D%p, rqstp=3D%p, =
rq_arg.len=3D%u\n",
>>> rdma_xprt, rqstp, rqstp->rq_arg.len);
>>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c =
b/net/sunrpc/xprtrdma/svc_rdma_transport.c
>>> index ca211dc..08fa7de 100644
>>> --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
>>> +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
>>> @@ -316,6 +316,7 @@ static void svc_rdma_wc_receive(struct ib_cq =
*cq, struct ib_wc *wc)
>>> goto flushed;
>>>=20
>>> /* All wc fields are now known to be valid */
>>> + ctxt->rtime =3D ktime_get_real();
>>> ctxt->byte_len =3D wc->byte_len;
>>> spin_lock(&xprt->sc_rq_dto_lock);
>>> list_add_tail(&ctxt->list, &xprt->sc_rq_dto_q);
>=20
> --
> Chuck Lever
>=20
>=20
>=20
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" =
in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Chuck Lever




2018-03-22 21:01:35

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 09/15] sunrpc: Report per-RPC execution stats

On Thu, Mar 22, 2018 at 04:32:36PM -0400, Chuck Lever wrote:
>
>
> > On Mar 14, 2018, at 9:11 PM, Chuck Lever <[email protected]> wrote:
> >
> >
> >
> >> On Mar 14, 2018, at 4:50 PM, J. Bruce Fields <[email protected]> wrote:
> >>
> >> On Tue, Mar 13, 2018 at 11:44:27AM -0400, Chuck Lever wrote:
> >>> Introduce a mechanism to report the server-side execution latency of
> >>> each RPC. The goal is to enable user space to filter the trace
> >>> record for latency outliers, build histograms, etc.
> >>
> >> Neato.
> >>
> >> To be useful I'd think you'd want more information about each RPC.
> >> (E.g. you'd like to know that the latency outliers all did reads). I
> >> guess you could use the address and xid to cross-reference with
> >> information collected from somewhere else?
> >
> > Yes. You can enable other trace points (like the nfsd_compound ones)
> > to see what class each operation is in.
> >
> > And yes, I would like to have all the relevant information for each
> > RPC in a single trace record; I just haven't figured out a way to
> > extract it as nicely as I did it on the client (patch forthcoming).
> > On the client side, there is a table set up for each RPC program that
> > contains an RPC procedure name to procedure number mapping. I was not
> > able to find a similar convenience on the server side.
> >
> >
> >> What's our commitment to backwards compatibility? Trace points seem to
> >> be the wild west compared to the rest of the kernel interface, but if
> >> we want to encourage tooling on top of this then I guess we'll need to
> >> be strict.
> >
> > That has been discussed elsewhere (LWN.net and more recently on
> > linux-rdma). The only compatibility issues are with trace points that
> > have user tools and infrastructure that depends on them, such as the
> > scheduler trace points used for latencyTOP. The NFS and sunrpc trace
> > points do not have this constraint, as they are processed currently
> > only by generic tools like trace-cmd. So we are free to innovate for
> > the time being.
> >
> >
> >> Looking at the tcp case, I think it's measuring from the time
> >> tcp_recvfrom receives the last fragment making up an rpc request till
> >> the last sendpage() of the reply returns. Did you consider other spots?
> >> (E.g. why after the send instead of before?)
> >
> > Yes, I've considered other spots. I don't consider the spots I'm
> > proposing here to be written in stone. I welcome help for placing the
> > socket-based timestamp capture points.
> >
> > Some sendto implementations are more complex than others. For instance,
> > RPC/RDMA can post RDMA Writes containing data content first, then in the
> > final step post the RDMA Send carrying the RPC Reply header. The RDMA
> > Write step can be considered server-side processing, and thus is part
> > of the latency. Or, if we ever decide to move the RDMA Write step into
> > the XDR layer, it will definitely be counted as processing latency.
> >
> > One thing I would like to keep in the latency measurement is how long
> > this rqstp has waited to acquire the send mutex. But otherwise, I'm
> > open to other ideas about how to measure this latency.
>
> Hi Bruce-
>
> How about measuring the same way for all transports:
>
> - Capture a timestamp when xpo_recvfrom returns a positive value
>
> - Fire the svc_stats_latency event just before invoking xpo_sendto
>
> Would you be more comfortable with that arrangement?

I honestly don't know. If I understand right: if we put that event just
before xpo_sendto, then in the case you describe above (RDMA Write step
moved between xdr layer and sendto), the latency number would change for
no really good reason. So that's a case for your original approach?

I don't know what's likely to contribute to sendto latency in the socket
case.

Really, I've never done latency tracing, and from what I've seen you may
have done more than any of us, so I trust your judgement here....

--b.

2018-03-22 21:18:26

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH 09/15] sunrpc: Report per-RPC execution stats



> On Mar 22, 2018, at 5:01 PM, Bruce Fields <[email protected]> =
wrote:
>=20
> On Thu, Mar 22, 2018 at 04:32:36PM -0400, Chuck Lever wrote:
>>=20
>>=20
>>> On Mar 14, 2018, at 9:11 PM, Chuck Lever <[email protected]> =
wrote:
>>>=20
>>>=20
>>>=20
>>>> On Mar 14, 2018, at 4:50 PM, J. Bruce Fields <[email protected]> =
wrote:
>>>>=20
>>>> On Tue, Mar 13, 2018 at 11:44:27AM -0400, Chuck Lever wrote:
>>>>> Introduce a mechanism to report the server-side execution latency =
of
>>>>> each RPC. The goal is to enable user space to filter the trace
>>>>> record for latency outliers, build histograms, etc.
>>>>=20
>>>> Neato.
>>>>=20
>>>> To be useful I'd think you'd want more information about each RPC.
>>>> (E.g. you'd like to know that the latency outliers all did reads). =
I
>>>> guess you could use the address and xid to cross-reference with
>>>> information collected from somewhere else?
>>>=20
>>> Yes. You can enable other trace points (like the nfsd_compound ones)
>>> to see what class each operation is in.
>>>=20
>>> And yes, I would like to have all the relevant information for each
>>> RPC in a single trace record; I just haven't figured out a way to
>>> extract it as nicely as I did it on the client (patch forthcoming).
>>> On the client side, there is a table set up for each RPC program =
that
>>> contains an RPC procedure name to procedure number mapping. I was =
not
>>> able to find a similar convenience on the server side.
>>>=20
>>>=20
>>>> What's our commitment to backwards compatibility? Trace points =
seem to
>>>> be the wild west compared to the rest of the kernel interface, but =
if
>>>> we want to encourage tooling on top of this then I guess we'll need =
to
>>>> be strict.
>>>=20
>>> That has been discussed elsewhere (LWN.net and more recently on
>>> linux-rdma). The only compatibility issues are with trace points =
that
>>> have user tools and infrastructure that depends on them, such as the
>>> scheduler trace points used for latencyTOP. The NFS and sunrpc trace
>>> points do not have this constraint, as they are processed currently
>>> only by generic tools like trace-cmd. So we are free to innovate for
>>> the time being.
>>>=20
>>>=20
>>>> Looking at the tcp case, I think it's measuring from the time
>>>> tcp_recvfrom receives the last fragment making up an rpc request =
till
>>>> the last sendpage() of the reply returns. Did you consider other =
spots?
>>>> (E.g. why after the send instead of before?)
>>>=20
>>> Yes, I've considered other spots. I don't consider the spots I'm
>>> proposing here to be written in stone. I welcome help for placing =
the
>>> socket-based timestamp capture points.
>>>=20
>>> Some sendto implementations are more complex than others. For =
instance,
>>> RPC/RDMA can post RDMA Writes containing data content first, then in =
the
>>> final step post the RDMA Send carrying the RPC Reply header. The =
RDMA
>>> Write step can be considered server-side processing, and thus is =
part
>>> of the latency. Or, if we ever decide to move the RDMA Write step =
into
>>> the XDR layer, it will definitely be counted as processing latency.
>>>=20
>>> One thing I would like to keep in the latency measurement is how =
long
>>> this rqstp has waited to acquire the send mutex. But otherwise, I'm
>>> open to other ideas about how to measure this latency.
>>=20
>> Hi Bruce-
>>=20
>> How about measuring the same way for all transports:
>>=20
>> - Capture a timestamp when xpo_recvfrom returns a positive value
>>=20
>> - Fire the svc_stats_latency event just before invoking xpo_sendto
>>=20
>> Would you be more comfortable with that arrangement?
>=20
> I honestly don't know. If I understand right: if we put that event =
just
> before xpo_sendto, then in the case you describe above (RDMA Write =
step
> moved between xdr layer and sendto), the latency number would change =
for
> no really good reason. So that's a case for your original approach?
>=20
> I don't know what's likely to contribute to sendto latency in the =
socket
> case.

Me neither, so maybe it makes sense to exclude sendto for now until
we have a better sense of what needs to be measured.


> Really, I've never done latency tracing, and from what I've seen you =
may
> have done more than any of us, so I trust your judgement here....

Only by an hour or two.

What we leave out with this proposal is the cost of xpo_sendto, and the
cost of the svc_xprt_enqueue and subsequent context switch before the
server calls xpo_recvfrom.

If we want to get down-and-dirty with particular transport
mechanisms, it might be better to insert special trace points for
those.

The enqueue wake-up latency is actually pretty significant in some
cases. We could add a latency report to trace_svc_xprt_dequeue, for
example.

I can respin that patch series with these changes.


--
Chuck Lever