2019-05-03 11:24:09

by Trond Myklebust

[permalink] [raw]
Subject: [RFC PATCH 0/5] bh-safe lock removal for SUNRPC

This patchset aims to remove the bh-safe locks on the client side.
At this time it should be seen as a toy/strawman effort in order to
help the community figure out whether or not there are setups out
there that are actually seeing performance bottlenecks resulting
from taking bh-safe locks inside other spinlocks.

Trond Myklebust (5):
SUNRPC: Replace the queue timer with a delayed work function
SUNRPC: Replace direct task wakeups from softirq context
SUNRPC: Remove the bh-safe lock requirement on xprt->transport_lock
SUNRPC: Remove the bh-safe lock requirement on the
rpc_wait_queue->lock
SUNRPC: Reduce the priority of the xprtiod queue

include/linux/sunrpc/sched.h | 3 +-
include/linux/sunrpc/xprtsock.h | 5 +
net/sunrpc/sched.c | 76 +++++++++-------
net/sunrpc/xprt.c | 61 ++++++-------
net/sunrpc/xprtrdma/rpc_rdma.c | 4 +-
net/sunrpc/xprtrdma/svc_rdma_backchannel.c | 4 +-
net/sunrpc/xprtrdma/svc_rdma_transport.c | 8 +-
net/sunrpc/xprtsock.c | 101 +++++++++++++++++----
8 files changed, 168 insertions(+), 94 deletions(-)

--
2.21.0


2019-05-03 11:24:09

by Trond Myklebust

[permalink] [raw]
Subject: [RFC PATCH 1/5] SUNRPC: Replace the queue timer with a delayed work function

The queue timer function, which walks the RPC queue in order to locate
candidates for waking up is one of the current constraints against
removing the bh-safe queue spin locks. Replace it with a delayed
work queue, so that we can do the actual rpc task wake ups from an
ordinary process context.

Signed-off-by: Trond Myklebust <[email protected]>
---
include/linux/sunrpc/sched.h | 3 ++-
net/sunrpc/sched.c | 32 ++++++++++++++++++++------------
2 files changed, 22 insertions(+), 13 deletions(-)

diff --git a/include/linux/sunrpc/sched.h b/include/linux/sunrpc/sched.h
index d0e451868f02..7d8db5dcac04 100644
--- a/include/linux/sunrpc/sched.h
+++ b/include/linux/sunrpc/sched.h
@@ -183,8 +183,9 @@ struct rpc_task_setup {
#define RPC_NR_PRIORITY (1 + RPC_PRIORITY_PRIVILEGED - RPC_PRIORITY_LOW)

struct rpc_timer {
- struct timer_list timer;
struct list_head list;
+ unsigned long expires;
+ struct delayed_work dwork;
};

/*
diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index 1a12fb03e611..e7723c2c1b1c 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -45,7 +45,7 @@ static mempool_t *rpc_buffer_mempool __read_mostly;

static void rpc_async_schedule(struct work_struct *);
static void rpc_release_task(struct rpc_task *task);
-static void __rpc_queue_timer_fn(struct timer_list *t);
+static void __rpc_queue_timer_fn(struct work_struct *);

/*
* RPC tasks sit here while waiting for conditions to improve.
@@ -86,13 +86,19 @@ __rpc_disable_timer(struct rpc_wait_queue *queue, struct rpc_task *task)
task->tk_timeout = 0;
list_del(&task->u.tk_wait.timer_list);
if (list_empty(&queue->timer_list.list))
- del_timer(&queue->timer_list.timer);
+ cancel_delayed_work(&queue->timer_list.dwork);
}

static void
rpc_set_queue_timer(struct rpc_wait_queue *queue, unsigned long expires)
{
- timer_reduce(&queue->timer_list.timer, expires);
+ unsigned long now = jiffies;
+ queue->timer_list.expires = expires;
+ if (time_before_eq(expires, now))
+ expires = 0;
+ else
+ expires -= now;
+ mod_delayed_work(rpciod_workqueue, &queue->timer_list.dwork, expires);
}

/*
@@ -106,7 +112,8 @@ __rpc_add_timer(struct rpc_wait_queue *queue, struct rpc_task *task,
task->tk_pid, jiffies_to_msecs(timeout - jiffies));

task->tk_timeout = timeout;
- rpc_set_queue_timer(queue, timeout);
+ if (list_empty(&queue->timer_list.list) || time_before(timeout, queue->timer_list.expires))
+ rpc_set_queue_timer(queue, timeout);
list_add(&task->u.tk_wait.timer_list, &queue->timer_list.list);
}

@@ -249,9 +256,8 @@ static void __rpc_init_priority_wait_queue(struct rpc_wait_queue *queue, const c
queue->maxpriority = nr_queues - 1;
rpc_reset_waitqueue_priority(queue);
queue->qlen = 0;
- timer_setup(&queue->timer_list.timer,
- __rpc_queue_timer_fn,
- TIMER_DEFERRABLE);
+ queue->timer_list.expires = 0;
+ INIT_DEFERRABLE_WORK(&queue->timer_list.dwork, __rpc_queue_timer_fn);
INIT_LIST_HEAD(&queue->timer_list.list);
rpc_assign_waitqueue_name(queue, qname);
}
@@ -270,7 +276,7 @@ EXPORT_SYMBOL_GPL(rpc_init_wait_queue);

void rpc_destroy_wait_queue(struct rpc_wait_queue *queue)
{
- del_timer_sync(&queue->timer_list.timer);
+ cancel_delayed_work_sync(&queue->timer_list.dwork);
}
EXPORT_SYMBOL_GPL(rpc_destroy_wait_queue);

@@ -760,13 +766,15 @@ void rpc_wake_up_status(struct rpc_wait_queue *queue, int status)
}
EXPORT_SYMBOL_GPL(rpc_wake_up_status);

-static void __rpc_queue_timer_fn(struct timer_list *t)
+static void __rpc_queue_timer_fn(struct work_struct *work)
{
- struct rpc_wait_queue *queue = from_timer(queue, t, timer_list.timer);
+ struct rpc_wait_queue *queue = container_of(work,
+ struct rpc_wait_queue,
+ timer_list.dwork.work);
struct rpc_task *task, *n;
unsigned long expires, now, timeo;

- spin_lock(&queue->lock);
+ spin_lock_bh(&queue->lock);
expires = now = jiffies;
list_for_each_entry_safe(task, n, &queue->timer_list.list, u.tk_wait.timer_list) {
timeo = task->tk_timeout;
@@ -781,7 +789,7 @@ static void __rpc_queue_timer_fn(struct timer_list *t)
}
if (!list_empty(&queue->timer_list.list))
rpc_set_queue_timer(queue, expires);
- spin_unlock(&queue->lock);
+ spin_unlock_bh(&queue->lock);
}

static void __rpc_atrun(struct rpc_task *task)
--
2.21.0

2019-05-03 11:24:35

by Trond Myklebust

[permalink] [raw]
Subject: [RFC PATCH 2/5] SUNRPC: Replace direct task wakeups from softirq context

Replace the direct task wakeups from inside a softirq context with
wakeups from a process context.

Signed-off-by: Trond Myklebust <[email protected]>
---
include/linux/sunrpc/xprtsock.h | 5 +++
net/sunrpc/xprtsock.c | 78 ++++++++++++++++++++++++++++++---
2 files changed, 77 insertions(+), 6 deletions(-)

diff --git a/include/linux/sunrpc/xprtsock.h b/include/linux/sunrpc/xprtsock.h
index b81d0b3e0799..7638dbe7bc50 100644
--- a/include/linux/sunrpc/xprtsock.h
+++ b/include/linux/sunrpc/xprtsock.h
@@ -56,6 +56,7 @@ struct sock_xprt {
*/
unsigned long sock_state;
struct delayed_work connect_worker;
+ struct work_struct error_worker;
struct work_struct recv_worker;
struct mutex recv_mutex;
struct sockaddr_storage srcaddr;
@@ -84,6 +85,10 @@ struct sock_xprt {
#define XPRT_SOCK_CONNECTING 1U
#define XPRT_SOCK_DATA_READY (2)
#define XPRT_SOCK_UPD_TIMEOUT (3)
+#define XPRT_SOCK_WAKE_ERROR (4)
+#define XPRT_SOCK_WAKE_WRITE (5)
+#define XPRT_SOCK_WAKE_PENDING (6)
+#define XPRT_SOCK_WAKE_DISCONNECT (7)

#endif /* __KERNEL__ */

diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index c69951ed2ebc..e0195b1a0c18 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -1211,6 +1211,15 @@ static void xs_sock_reset_state_flags(struct rpc_xprt *xprt)
struct sock_xprt *transport = container_of(xprt, struct sock_xprt, xprt);

clear_bit(XPRT_SOCK_DATA_READY, &transport->sock_state);
+ clear_bit(XPRT_SOCK_WAKE_ERROR, &transport->sock_state);
+ clear_bit(XPRT_SOCK_WAKE_WRITE, &transport->sock_state);
+ clear_bit(XPRT_SOCK_WAKE_DISCONNECT, &transport->sock_state);
+}
+
+static void xs_run_error_worker(struct sock_xprt *transport, unsigned int nr)
+{
+ set_bit(nr, &transport->sock_state);
+ queue_work(xprtiod_workqueue, &transport->error_worker);
}

static void xs_sock_reset_connection_flags(struct rpc_xprt *xprt)
@@ -1231,6 +1240,7 @@ static void xs_sock_reset_connection_flags(struct rpc_xprt *xprt)
*/
static void xs_error_report(struct sock *sk)
{
+ struct sock_xprt *transport;
struct rpc_xprt *xprt;
int err;

@@ -1238,13 +1248,14 @@ static void xs_error_report(struct sock *sk)
if (!(xprt = xprt_from_sock(sk)))
goto out;

+ transport = container_of(xprt, struct sock_xprt, xprt);
err = -sk->sk_err;
if (err == 0)
goto out;
dprintk("RPC: xs_error_report client %p, error=%d...\n",
xprt, -err);
trace_rpc_socket_error(xprt, sk->sk_socket, err);
- xprt_wake_pending_tasks(xprt, err);
+ xs_run_error_worker(transport, XPRT_SOCK_WAKE_ERROR);
out:
read_unlock_bh(&sk->sk_callback_lock);
}
@@ -1507,7 +1518,7 @@ static void xs_tcp_state_change(struct sock *sk)
xprt->stat.connect_count++;
xprt->stat.connect_time += (long)jiffies -
xprt->stat.connect_start;
- xprt_wake_pending_tasks(xprt, -EAGAIN);
+ xs_run_error_worker(transport, XPRT_SOCK_WAKE_PENDING);
}
spin_unlock(&xprt->transport_lock);
break;
@@ -1525,7 +1536,7 @@ static void xs_tcp_state_change(struct sock *sk)
/* The server initiated a shutdown of the socket */
xprt->connect_cookie++;
clear_bit(XPRT_CONNECTED, &xprt->state);
- xs_tcp_force_close(xprt);
+ xs_run_error_worker(transport, XPRT_SOCK_WAKE_DISCONNECT);
/* fall through */
case TCP_CLOSING:
/*
@@ -1547,7 +1558,7 @@ static void xs_tcp_state_change(struct sock *sk)
xprt_clear_connecting(xprt);
clear_bit(XPRT_CLOSING, &xprt->state);
/* Trigger the socket release */
- xs_tcp_force_close(xprt);
+ xs_run_error_worker(transport, XPRT_SOCK_WAKE_DISCONNECT);
}
out:
read_unlock_bh(&sk->sk_callback_lock);
@@ -1556,6 +1567,7 @@ static void xs_tcp_state_change(struct sock *sk)
static void xs_write_space(struct sock *sk)
{
struct socket_wq *wq;
+ struct sock_xprt *transport;
struct rpc_xprt *xprt;

if (!sk->sk_socket)
@@ -1564,13 +1576,14 @@ static void xs_write_space(struct sock *sk)

if (unlikely(!(xprt = xprt_from_sock(sk))))
return;
+ transport = container_of(xprt, struct sock_xprt, xprt);
rcu_read_lock();
wq = rcu_dereference(sk->sk_wq);
if (!wq || test_and_clear_bit(SOCKWQ_ASYNC_NOSPACE, &wq->flags) == 0)
goto out;

- if (xprt_write_space(xprt))
- sk->sk_write_pending--;
+ xs_run_error_worker(transport, XPRT_SOCK_WAKE_WRITE);
+ sk->sk_write_pending--;
out:
rcu_read_unlock();
}
@@ -2461,6 +2474,56 @@ static void xs_connect(struct rpc_xprt *xprt, struct rpc_task *task)
delay);
}

+static void xs_wake_disconnect(struct sock_xprt *transport)
+{
+ if (test_and_clear_bit(XPRT_SOCK_WAKE_DISCONNECT, &transport->sock_state))
+ xs_tcp_force_close(&transport->xprt);
+}
+
+static void xs_wake_write(struct sock_xprt *transport)
+{
+ if (test_and_clear_bit(XPRT_SOCK_WAKE_WRITE, &transport->sock_state))
+ xprt_write_space(&transport->xprt);
+}
+
+static void xs_wake_error(struct sock_xprt *transport)
+{
+ int sockerr;
+ int sockerr_len = sizeof(sockerr);
+
+ if (!test_bit(XPRT_SOCK_WAKE_ERROR, &transport->sock_state))
+ return;
+ mutex_lock(&transport->recv_mutex);
+ if (transport->sock == NULL)
+ goto out;
+ if (!test_and_clear_bit(XPRT_SOCK_WAKE_ERROR, &transport->sock_state))
+ goto out;
+ if (kernel_getsockopt(transport->sock, SOL_SOCKET, SO_ERROR,
+ (char *)&sockerr, &sockerr_len) != 0)
+ goto out;
+ if (sockerr < 0)
+ xprt_wake_pending_tasks(&transport->xprt, sockerr);
+out:
+ mutex_unlock(&transport->recv_mutex);
+}
+
+static void xs_wake_pending(struct sock_xprt *transport)
+{
+ if (test_and_clear_bit(XPRT_SOCK_WAKE_PENDING, &transport->sock_state))
+ xprt_wake_pending_tasks(&transport->xprt, -EAGAIN);
+}
+
+static void xs_error_handle(struct work_struct *work)
+{
+ struct sock_xprt *transport = container_of(work,
+ struct sock_xprt, error_worker);
+
+ xs_wake_disconnect(transport);
+ xs_wake_write(transport);
+ xs_wake_error(transport);
+ xs_wake_pending(transport);
+}
+
/**
* xs_local_print_stats - display AF_LOCAL socket-specifc stats
* @xprt: rpc_xprt struct containing statistics
@@ -2873,6 +2936,7 @@ static struct rpc_xprt *xs_setup_local(struct xprt_create *args)
xprt->timeout = &xs_local_default_timeout;

INIT_WORK(&transport->recv_worker, xs_stream_data_receive_workfn);
+ INIT_WORK(&transport->error_worker, xs_error_handle);
INIT_DELAYED_WORK(&transport->connect_worker, xs_dummy_setup_socket);

switch (sun->sun_family) {
@@ -2943,6 +3007,7 @@ static struct rpc_xprt *xs_setup_udp(struct xprt_create *args)
xprt->timeout = &xs_udp_default_timeout;

INIT_WORK(&transport->recv_worker, xs_udp_data_receive_workfn);
+ INIT_WORK(&transport->error_worker, xs_error_handle);
INIT_DELAYED_WORK(&transport->connect_worker, xs_udp_setup_socket);

switch (addr->sa_family) {
@@ -3024,6 +3089,7 @@ static struct rpc_xprt *xs_setup_tcp(struct xprt_create *args)
(xprt->timeout->to_retries + 1);

INIT_WORK(&transport->recv_worker, xs_stream_data_receive_workfn);
+ INIT_WORK(&transport->error_worker, xs_error_handle);
INIT_DELAYED_WORK(&transport->connect_worker, xs_tcp_setup_socket);

switch (addr->sa_family) {
--
2.21.0

2019-05-06 18:22:53

by Chuck Lever III

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] bh-safe lock removal for SUNRPC

Hi Trond-

> On May 3, 2019, at 7:18 AM, Trond Myklebust <[email protected]> wrote:
>
> This patchset aims to remove the bh-safe locks on the client side.
> At this time it should be seen as a toy/strawman effort in order to
> help the community figure out whether or not there are setups out
> there that are actually seeing performance bottlenecks resulting
> from taking bh-safe locks inside other spinlocks.

What kernel does this patch set apply to? I've tried both v5.0 and
v5.1, but there appear to be some changes that I'm missing. The
first patch does not apply cleanly.


> Trond Myklebust (5):
> SUNRPC: Replace the queue timer with a delayed work function
> SUNRPC: Replace direct task wakeups from softirq context
> SUNRPC: Remove the bh-safe lock requirement on xprt->transport_lock
> SUNRPC: Remove the bh-safe lock requirement on the
> rpc_wait_queue->lock
> SUNRPC: Reduce the priority of the xprtiod queue
>
> include/linux/sunrpc/sched.h | 3 +-
> include/linux/sunrpc/xprtsock.h | 5 +
> net/sunrpc/sched.c | 76 +++++++++-------
> net/sunrpc/xprt.c | 61 ++++++-------
> net/sunrpc/xprtrdma/rpc_rdma.c | 4 +-
> net/sunrpc/xprtrdma/svc_rdma_backchannel.c | 4 +-
> net/sunrpc/xprtrdma/svc_rdma_transport.c | 8 +-
> net/sunrpc/xprtsock.c | 101 +++++++++++++++++----
> 8 files changed, 168 insertions(+), 94 deletions(-)
>
> --
> 2.21.0
>

--
Chuck Lever



2019-05-06 18:37:32

by Trond Myklebust

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] bh-safe lock removal for SUNRPC

On Mon, 2019-05-06 at 14:22 -0400, Chuck Lever wrote:
> Hi Trond-
>
> > On May 3, 2019, at 7:18 AM, Trond Myklebust <[email protected]>
> > wrote:
> >
> > This patchset aims to remove the bh-safe locks on the client side.
> > At this time it should be seen as a toy/strawman effort in order to
> > help the community figure out whether or not there are setups out
> > there that are actually seeing performance bottlenecks resulting
> > from taking bh-safe locks inside other spinlocks.
>
> What kernel does this patch set apply to? I've tried both v5.0 and
> v5.1, but there appear to be some changes that I'm missing. The
> first patch does not apply cleanly.
>

It should hopefully apply on top of Anna's linux-next branch.

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]


2019-05-06 20:02:39

by Chuck Lever III

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] bh-safe lock removal for SUNRPC



> On May 6, 2019, at 2:37 PM, Trond Myklebust <[email protected]> wrote:
>
> On Mon, 2019-05-06 at 14:22 -0400, Chuck Lever wrote:
>> Hi Trond-
>>
>>> On May 3, 2019, at 7:18 AM, Trond Myklebust <[email protected]>
>>> wrote:
>>>
>>> This patchset aims to remove the bh-safe locks on the client side.
>>> At this time it should be seen as a toy/strawman effort in order to
>>> help the community figure out whether or not there are setups out
>>> there that are actually seeing performance bottlenecks resulting
>>> from taking bh-safe locks inside other spinlocks.
>>
>> What kernel does this patch set apply to? I've tried both v5.0 and
>> v5.1, but there appear to be some changes that I'm missing. The
>> first patch does not apply cleanly.
>>
>
> It should hopefully apply on top of Anna's linux-next branch.

OK, you did mention that to me last week. Sorry for the noise.

--
Chuck Lever