From: Jeff Layton <jlayton@primarydata.com>
To: bfields@fieldses.org
Cc: linux-kernel@vger.kernel.org, linux-nfs@vger.kernel.org,
        Tejun Heo <tj@kernel.org>, Al Viro <viro@zeniv.linux.org.uk>,
        NeilBrown <neilb@suse.de>
Subject: [PATCH v2 00/16] nfsd/sunrpc: add support for a workqueue-based nfsd
Date: Wed, 10 Dec 2014 14:07:44 -0500
Message-Id: <1418238480-18857-1-git-send-email-jlayton@primarydata.com>
Sender: linux-nfs-owner@vger.kernel.org

This is the second revision of the workqueue-based nfsd. Changes from
the RFC version I posted earlier:

v2:
- found and fixed problem causing the delays between work. It was a bug
  in the design of the new code. I was queueing the svc_xprt's work to
  do everything and that necessarily serialized that work. This version
  adds a work_struct to the svc_rqst as well, and that's where the
  bulk of the work.

- no longer tries to use the thread count as a max_active setting for
  the workqueue. The workqueue max_active settings are left to the
  default. This means that /proc/fs/nfsd/threads is basically a binary
  switch that's either zero (not running) or non-zero (running). The
  actual value has no meaning.

- the handling of the fs_struct has been reworked. It's now allocated
  as part of the struct svc_rqst. Each svc_rqst has its own fs_struct
  now.

- CONFIG_SUNRPC_SVC_WORKQUEUE has been removed. Since the code must
  be enabled at runtime anyway, we don't need a switch to disable this
  code at build time.

- the new tracepoints have been reworked.

This new code is still a little slower (2-3%) than the older, thread
based code in my testing, though my hope is that it may perform better
than the older code on a NUMA machine. I don't have one at the moment
to verify that, however.

For background, here's an excerpt from the original posting:

This patchset is a little skunkworks project that I've been poking at
for the last few weeks. Currently nfsd uses a dedicated thread pool to
handle RPCs, but that requires maintaining a rather large swath of
"fiddly" code to handle the threads and transports.

This patchset represents an alternative approach, which makes nfsd use
workqueues to do its bidding rather than a dedicated thread pool. When a
transport needs to do work, we simply queue it to the workqueue in
softirq context and let it service the transport.

Performance:
------------
As far as numbers go, I ran a modified version of the fio
tiobench-example.fio test that just reduces the file size to 128M:

threads:
----------------------------------------------------------------------
Run status group 0 (all jobs):
  WRITE: io=13740KB, aggrb=228KB/s, minb=57KB/s, maxb=57KB/s,
mint=60034msec, maxt=60049msec

Run status group 1 (all jobs):
  WRITE: io=14652KB, aggrb=244KB/s, minb=60KB/s, maxb=61KB/s,
mint=60002msec, maxt=60028msec

Run status group 2 (all jobs):
   READ: io=524288KB, aggrb=49639KB/s, minb=12409KB/s, maxb=12433KB/s,
mint=10542msec, maxt=10562msec

Run status group 3 (all jobs):
   READ: io=524288KB, aggrb=50670KB/s, minb=12667KB/s, maxb=12687KB/s,
mint=10331msec, maxt=10347msec

workqueue:
----------------------------------------------------------------------
Run status group 0 (all jobs):
  WRITE: io=13632KB, aggrb=226KB/s, minb=56KB/s, maxb=56KB/s,
mint=60010msec, maxt=60061msec

Run status group 1 (all jobs):
  WRITE: io=14328KB, aggrb=238KB/s, minb=59KB/s, maxb=59KB/s,
mint=60023msec, maxt=60033msec

Run status group 2 (all jobs):
   READ: io=524288KB, aggrb=47779KB/s, minb=11944KB/s, maxb=11955KB/s,
mint=10963msec, maxt=10973msec

Run status group 3 (all jobs):
   READ: io=524288KB, aggrb=48379KB/s, minb=12094KB/s, maxb=12114KB/s,
mint=10819msec, maxt=10837msec

...a small performance decrease using workqueues (about 5% in the worst
case). It varies a little between runs but the results over several runs
are fairly consistent in showing a small perf decrease.

In an attempt to ascertain where that comes from, I added some
tracepoints and wrote a perl script to scrape the results and figure out
where the latency comes from. The numbers here are from the ones that
produced the above results:

[jlayton@palma nfsd-wq]$ ./analyze.pl < threads.txt 
-------------------------------------------------------------
rpcs=272437
queue=3.79952429455322
queued=10.0558330900775
receive=10.6196625266701
process=2325.8173229043
total=2350.2923428156

[jlayton@palma nfsd-wq]$ ./analyze.pl < workqueue.txt 
-------------------------------------------------------------
rpcs=272329
queue=4.41979737859012
queued=8.124893049967
receive=11.3421082590608
process=2310.15945786277
total=2334.04625655039

Here's a legend, the numbers represent a delta between two
tracepoints for a particular xprt or rqst. I can provide the script if
it's helpful but it's definitely hacked together:

rpcs: total number of rpcs processed (just a count)
queue: time it took to queue the xprt.xi
       trace_svc_xprt_enqueued - trace_svc_xprt_enqueue (in usecs)

queued: time between being queued and going "active"
        trace_svc_xprt_active - trace_svc_xprt_enqueued (in usecs)

receive: time between going "active" and completing the receive
         trace_svc_xprt_received - trace_svc_xprt_active (in usecs)

process: time between completing the receive and finishing processing
         trace_svc_process - trace_svc_xprt_recevied (in usecs)

total: total time from enqueueing to finishing the processing
       trace_svc_process - trace_svc_xprt_enqueued (in usecs)

The interesting bit is that according to these numbers, the workqueue
RPCs ran more quickly on average (though most of that is in svc_process,
for which I have zero explanation). So, why are we getting slower
results?

One theory is that it's getting bitten by the fact that the workqueue
queueing/dequeueing code uses spinlocks that disable bottom halves.
Perhaps that adds up and causes more latency to softirq processing? I'm
not sure how best to nail that down...

It's probably worth it to start considering this for v3.20, but I'd like
to get some time on a larger scale test rig to see how it does first.
We'll also need Al's ACK on the fs_struct stuff.

Thoughts or comments are welcome.

Jeff Layton (16):
  sunrpc: add a new svc_serv_ops struct and move sv_shutdown into it
  sunrpc: move sv_function into sv_ops
  sunrpc: move sv_module parm into sv_ops
  sunrpc: turn enqueueing a svc_xprt into a svc_serv operation
  sunrpc: abstract out svc_set_num_threads to sv_ops
  sunrpc: move pool_mode definitions into svc.h
  sunrpc: factor svc_rqst allocation and freeing from sv_nrthreads
    refcounting
  sunrpc: set up workqueue function in svc_xprt
  sunrpc: set up svc_rqst work if it's defined
  sunrpc: add basic support for workqueue-based services
  nfsd: keep a reference to the fs_struct in svc_rqst
  nfsd: add support for workqueue based service processing
  sunrpc: keep a cache of svc_rqsts for each NUMA node
  sunrpc: add more tracepoints around svc_xprt handling
  sunrpc: print the svc_rqst pointer value in svc_process tracepoint
  sunrpc: add tracepoints around svc_sock handling

 fs/fs_struct.c                  |  60 ++++++--
 fs/lockd/svc.c                  |   7 +-
 fs/nfs/callback.c               |   6 +-
 fs/nfsd/nfssvc.c                | 100 +++++++++++--
 include/linux/fs_struct.h       |   4 +
 include/linux/sunrpc/svc.h      |  93 ++++++++++---
 include/linux/sunrpc/svc_xprt.h |   3 +
 include/linux/sunrpc/svcsock.h  |   1 +
 include/trace/events/sunrpc.h   |  88 ++++++++++--
 net/sunrpc/Makefile             |   2 +-
 net/sunrpc/svc.c                | 141 +++++++++++--------
 net/sunrpc/svc_wq.c             | 302 ++++++++++++++++++++++++++++++++++++++++
 net/sunrpc/svc_xprt.c           |  68 ++++++++-
 net/sunrpc/svcsock.c            |   6 +
 14 files changed, 758 insertions(+), 123 deletions(-)
 create mode 100644 net/sunrpc/svc_wq.c

-- 
2.1.0