2024-03-13 16:26:05

by Yan Zhai

[permalink] [raw]
Subject: [PATCH v3 net 0/3] Report RCU QS for busy network kthreads

This changeset fixes a common problem for busy networking kthreads.
These threads, e.g. NAPI threads, typically will do:

* polling a batch of packets
* if there are more work, call cond_resched to allow scheduling
* continue to poll more packets when rx queue is not empty

We observed this being a problem in production, since it can block RCU
tasks from making progress under heavy load. Investigation indicates
that just calling cond_resched is insufficient for RCU tasks to reach
quiescent states. This at least affects NAPI threads, napi_busy_loop, and
also cpumap kthread for now.

By reporting RCU QSes in these kthreads periodically before
cond_resched, the blocked RCU waiters can correctly progress. Instead of
just reporting QS for RCU tasks, these code share the same concern as
noted in the commit d28139c4e967 ("rcu: Apply RCU-bh QSes to RCU-sched
and RCU-preempt when safe"). So report a consolidated QS for safety.

It is worth noting that, although this problem is reproducible in
napi_busy_loop, it only shows up when setting the polling interval to as
high as 2ms, which is far larger than recommended 50us-100us in the
documentation. So napi_busy_loop is left untouched.

V2: https://lore.kernel.org/bpf/[email protected]/
V1: https://lore.kernel.org/lkml/[email protected]/#t

changes since v2:
* created a helper in rcu header to abstract the behavior
* fixed cpumap kthread in addition

changes since v1:
* disable preemption first as Paul McKenney suggested

Yan Zhai (3):
rcu: add a helper to report consolidated flavor QS
net: report RCU QS on threaded NAPI repolling
bpf: report RCU QS in cpumap kthread

include/linux/rcupdate.h | 23 +++++++++++++++++++++++
kernel/bpf/cpumap.c | 2 ++
net/core/dev.c | 3 +++
3 files changed, 28 insertions(+)

--
2.30.2




2024-03-13 16:26:40

by Yan Zhai

[permalink] [raw]
Subject: [PATCH v3 net 1/3] rcu: add a helper to report consolidated flavor QS

There are several scenario in network processing that can run
extensively under heavy traffic. In such situation, RCU synchronization
might not observe desired quiescent states for indefinitely long period.
Create a helper to safely raise the desired RCU quiescent states for
such scenario.

Reviewed-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: Yan Zhai <[email protected]>
---
include/linux/rcupdate.h | 23 +++++++++++++++++++++++
1 file changed, 23 insertions(+)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 0746b1b0b663..e91ae38c33e3 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -247,6 +247,29 @@ do { \
cond_resched(); \
} while (0)

+/**
+ * rcu_softirq_qs_periodic - Periodically report consolidated quiescent states
+ *
+ * This helper is for network processing in non-RT kernels, where there could
+ * be busy polling threads that block RCU synchronization indefinitely. In
+ * such context, simply calling cond_resched is insufficient, so give it a
+ * stronger push to eliminate potential blockage of all RCU types.
+ *
+ * NOTE: unless absolutely sure, this helper should in general be called
+ * outside of bh lock section to avoid reporting a surprising QS to updaters,
+ * who could be expecting RCU read critical section to end at local_bh_enable().
+ */
+#define rcu_softirq_qs_periodic(old_ts) \
+do { \
+ if (!IS_ENABLED(CONFIG_PREEMPT_RT) && \
+ time_after(jiffies, (old_ts) + HZ / 10)) { \
+ preempt_disable(); \
+ rcu_softirq_qs(); \
+ preempt_enable(); \
+ (old_ts) = jiffies; \
+ } \
+} while (0)
+
/*
* Infrastructure to implement the synchronize_() primitives in
* TREE_RCU and rcu_barrier_() primitives in TINY_RCU.
--
2.30.2



2024-03-13 16:27:07

by Yan Zhai

[permalink] [raw]
Subject: [PATCH v3 net 2/3] net: report RCU QS on threaded NAPI repolling

NAPI threads can keep polling packets under load. Currently it is only
calling cond_resched() before repolling, but it is not sufficient to
clear out the holdout of RCU tasks, which prevent BPF tracing programs
from detaching for long period. This can be reproduced easily with
following set up:

ip netns add test1
ip netns add test2

ip -n test1 link add veth1 type veth peer name veth2 netns test2

ip -n test1 link set veth1 up
ip -n test1 link set lo up
ip -n test2 link set veth2 up
ip -n test2 link set lo up

ip -n test1 addr add 192.168.1.2/31 dev veth1
ip -n test1 addr add 1.1.1.1/32 dev lo
ip -n test2 addr add 192.168.1.3/31 dev veth2
ip -n test2 addr add 2.2.2.2/31 dev lo

ip -n test1 route add default via 192.168.1.3
ip -n test2 route add default via 192.168.1.2

for i in `seq 10 210`; do
for j in `seq 10 210`; do
ip netns exec test2 iptables -I INPUT -s 3.3.$i.$j -p udp --dport 5201
done
done

ip netns exec test2 ethtool -K veth2 gro on
ip netns exec test2 bash -c 'echo 1 > /sys/class/net/veth2/threaded'
ip netns exec test1 ethtool -K veth1 tso off

Then run an iperf3 client/server and a bpftrace script can trigger it:

ip netns exec test2 iperf3 -s -B 2.2.2.2 >/dev/null&
ip netns exec test1 iperf3 -c 2.2.2.2 -B 1.1.1.1 -u -l 1500 -b 3g -t 100 >/dev/null&
bpftrace -e 'kfunc:__napi_poll{@=count();} interval:s:1{exit();}'

Report RCU quiescent states periodically will resolve the issue.

Fixes: 29863d41bb6e ("net: implement threaded-able napi poll loop support")
Suggested-by: Paul E. McKenney <[email protected]>
Reviewed-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: Yan Zhai <[email protected]>

---
v2->v3: abstracted the work into a RCU helper
v1->v2: moved rcu_softirq_qs out from bh critical section, and only
raise it after a second of repolling. Added some brief perf test result.

v2: https://lore.kernel.org/bpf/[email protected]/
v1: https://lore.kernel.org/lkml/[email protected]/#t
---
net/core/dev.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/net/core/dev.c b/net/core/dev.c
index 76e6438f4858..6b7fc42d4b3e 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6708,6 +6708,8 @@ static int napi_threaded_poll(void *data)
void *have;

while (!napi_thread_wait(napi)) {
+ unsigned long last_qs = jiffies;
+
for (;;) {
bool repoll = false;

@@ -6732,6 +6734,7 @@ static int napi_threaded_poll(void *data)
if (!repoll)
break;

+ rcu_softirq_qs_periodic(last_qs);
cond_resched();
}
}
--
2.30.2



2024-03-13 16:27:18

by Yan Zhai

[permalink] [raw]
Subject: [PATCH v3 net 3/3] bpf: report RCU QS in cpumap kthread

When there are heavy load, cpumap kernel threads can be busy polling
packets from redirect queues and block out RCU tasks from reaching
quiescent states. It is insufficient to just call cond_resched() in such
context. Periodically raise a consolidated RCU QS before cond_resched
fixes the problem.

Fixes: 6710e1126934 ("bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP")
Reviewed-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: Yan Zhai <[email protected]>
---
kernel/bpf/cpumap.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index ef82ffc90cbe..8f1d390bcbde 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -262,6 +262,7 @@ static int cpu_map_bpf_prog_run(struct bpf_cpu_map_entry *rcpu, void **frames,
static int cpu_map_kthread_run(void *data)
{
struct bpf_cpu_map_entry *rcpu = data;
+ unsigned long last_qs = jiffies;

complete(&rcpu->kthread_running);
set_current_state(TASK_INTERRUPTIBLE);
@@ -287,10 +288,12 @@ static int cpu_map_kthread_run(void *data)
if (__ptr_ring_empty(rcpu->queue)) {
schedule();
sched = 1;
+ last_qs = jiffies;
} else {
__set_current_state(TASK_RUNNING);
}
} else {
+ rcu_softirq_qs_periodic(last_qs);
sched = cond_resched();
}

--
2.30.2



2024-03-13 21:36:07

by Jesper Dangaard Brouer

[permalink] [raw]
Subject: Re: [PATCH v3 net 0/3] Report RCU QS for busy network kthreads



On 13/03/2024 17.25, Yan Zhai wrote:
> This changeset fixes a common problem for busy networking kthreads.
> These threads, e.g. NAPI threads, typically will do:
>
> * polling a batch of packets
> * if there are more work, call cond_resched to allow scheduling
> * continue to poll more packets when rx queue is not empty
>
> We observed this being a problem in production, since it can block RCU
> tasks from making progress under heavy load. Investigation indicates
> that just calling cond_resched is insufficient for RCU tasks to reach
> quiescent states. This at least affects NAPI threads, napi_busy_loop, and
> also cpumap kthread for now.
>
> By reporting RCU QSes in these kthreads periodically before
> cond_resched, the blocked RCU waiters can correctly progress. Instead of
> just reporting QS for RCU tasks, these code share the same concern as
> noted in the commit d28139c4e967 ("rcu: Apply RCU-bh QSes to RCU-sched
> and RCU-preempt when safe"). So report a consolidated QS for safety.
>
> It is worth noting that, although this problem is reproducible in
> napi_busy_loop, it only shows up when setting the polling interval to as
> high as 2ms, which is far larger than recommended 50us-100us in the
> documentation. So napi_busy_loop is left untouched.
>
> V2: https://lore.kernel.org/bpf/[email protected]/
> V1: https://lore.kernel.org/lkml/[email protected]/#t
>
> changes since v2:
> * created a helper in rcu header to abstract the behavior
> * fixed cpumap kthread in addition
>
> changes since v1:
> * disable preemption first as Paul McKenney suggested
>
> Yan Zhai (3):
> rcu: add a helper to report consolidated flavor QS
> net: report RCU QS on threaded NAPI repolling
> bpf: report RCU QS in cpumap kthread
>
> include/linux/rcupdate.h | 23 +++++++++++++++++++++++
> kernel/bpf/cpumap.c | 2 ++
> net/core/dev.c | 3 +++
> 3 files changed, 28 insertions(+)

Acked-by: Jesper Dangaard Brouer <[email protected]>



2024-03-13 21:54:16

by Toke Høiland-Jørgensen

[permalink] [raw]
Subject: Re: [PATCH v3 net 0/3] Report RCU QS for busy network kthreads

Yan Zhai <[email protected]> writes:

> This changeset fixes a common problem for busy networking kthreads.
> These threads, e.g. NAPI threads, typically will do:
>
> * polling a batch of packets
> * if there are more work, call cond_resched to allow scheduling
> * continue to poll more packets when rx queue is not empty
>
> We observed this being a problem in production, since it can block RCU
> tasks from making progress under heavy load. Investigation indicates
> that just calling cond_resched is insufficient for RCU tasks to reach
> quiescent states. This at least affects NAPI threads, napi_busy_loop, and
> also cpumap kthread for now.
>
> By reporting RCU QSes in these kthreads periodically before
> cond_resched, the blocked RCU waiters can correctly progress. Instead of
> just reporting QS for RCU tasks, these code share the same concern as
> noted in the commit d28139c4e967 ("rcu: Apply RCU-bh QSes to RCU-sched
> and RCU-preempt when safe"). So report a consolidated QS for safety.
>
> It is worth noting that, although this problem is reproducible in
> napi_busy_loop, it only shows up when setting the polling interval to as
> high as 2ms, which is far larger than recommended 50us-100us in the
> documentation. So napi_busy_loop is left untouched.
>
> V2: https://lore.kernel.org/bpf/[email protected]/
> V1: https://lore.kernel.org/lkml/[email protected]/#t
>
> changes since v2:
> * created a helper in rcu header to abstract the behavior
> * fixed cpumap kthread in addition
>
> changes since v1:
> * disable preemption first as Paul McKenney suggested
>
> Yan Zhai (3):
> rcu: add a helper to report consolidated flavor QS
> net: report RCU QS on threaded NAPI repolling
> bpf: report RCU QS in cpumap kthread
>
> include/linux/rcupdate.h | 23 +++++++++++++++++++++++
> kernel/bpf/cpumap.c | 2 ++
> net/core/dev.c | 3 +++
> 3 files changed, 28 insertions(+)

For the series:

Reviewed-by: Toke Høiland-Jørgensen <[email protected]>


2024-03-14 21:55:32

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [PATCH v3 net 1/3] rcu: add a helper to report consolidated flavor QS

On Wed, 13 Mar 2024 09:25:49 -0700 Yan Zhai wrote:
> +/**
> + * rcu_softirq_qs_periodic - Periodically report consolidated quiescent states

> +#define rcu_softirq_qs_periodic(old_ts) \

scripts/kernel-doc says:

include/linux/rcupdate.h:271: warning: Function parameter or struct member 'old_ts' not described in 'rcu_softirq_qs_periodic'
--
pw-bot: cr

2024-03-15 14:54:37

by Yan Zhai

[permalink] [raw]
Subject: Re: [PATCH v3 net 1/3] rcu: add a helper to report consolidated flavor QS

On Thu, Mar 14, 2024 at 4:55 PM Jakub Kicinski <[email protected]> wrote:
>
> On Wed, 13 Mar 2024 09:25:49 -0700 Yan Zhai wrote:
> > +/**
> > + * rcu_softirq_qs_periodic - Periodically report consolidated quiescent states
>
> > +#define rcu_softirq_qs_periodic(old_ts) \
>
> scripts/kernel-doc says:
>
TIL, thanks. Let me send v4 to amend the text.

Yan

> include/linux/rcupdate.h:271: warning: Function parameter or struct member 'old_ts' not described in 'rcu_softirq_qs_periodic'
> --
> pw-bot: cr