Subject: [PATCH v5 net-next 00/15] locking: Introduce nested-BH locking.

Disabling bottoms halves acts as per-CPU BKL. On PREEMPT_RT code within
local_bh_disable() section remains preemtible. As a result high prior
tasks (or threaded interrupts) will be blocked by lower-prio task (or
threaded interrupts) which are long running which includes softirq
sections.

The proposed way out is to introduce explicit per-CPU locks for
resources which are protected by local_bh_disable() and use those only
on PREEMPT_RT so there is no additional overhead for !PREEMPT_RT builds.

The series introduces the infrastructure and converts large parts of
networking which is largest stake holder here. Once this done the
per-CPU lock from local_bh_disable() on PREEMPT_RT can be lifted.

Performance testing. Baseline is net-next as of commit 93bda33046e7a
("Merge branch'net-constify-ctl_table-arguments-of-utility-functions'")
plus v6.10-rc1. A 10GiG link is used between two hosts. The command
xdp-bench redirect-cpu --cpu 3 --remote-action drop eth1 -e

was invoked on the receiving side with a ixgbe. The sending side uses
pktgen_sample03_burst_single_flow.sh on i40e.

Baseline:
| eth1->? 9,018,604 rx/s 0 err,drop/s
| receive total 9,018,604 pkt/s 0 drop/s 0 error/s
| cpu:7 9,018,604 pkt/s 0 drop/s 0 error/s
| enqueue to cpu 3 9,018,602 pkt/s 0 drop/s 7.00 bulk-avg
| cpu:7->3 9,018,602 pkt/s 0 drop/s 7.00 bulk-avg
| kthread total 9,018,606 pkt/s 0 drop/s 214,698 sched
| cpu:3 9,018,606 pkt/s 0 drop/s 214,698 sched
| xdp_stats 0 pass/s 9,018,606 drop/s 0 redir/s
| cpu:3 0 pass/s 9,018,606 drop/s 0 redir/s
| redirect_err 0 error/s
| xdp_exception 0 hit/s

perf top --sort cpu,symbol --no-children:
| 18.14% 007 [k] bpf_prog_4f0ffbb35139c187_cpumap_l4_hash
| 13.29% 007 [k] ixgbe_poll
| 12.66% 003 [k] cpu_map_kthread_run
| 7.23% 003 [k] page_frag_free
| 6.76% 007 [k] xdp_do_redirect
| 3.76% 007 [k] cpu_map_redirect
| 3.13% 007 [k] bq_flush_to_queue
| 2.51% 003 [k] xdp_return_frame
| 1.93% 007 [k] try_to_wake_up
| 1.78% 007 [k] _raw_spin_lock
| 1.74% 007 [k] cpu_map_enqueue
| 1.56% 003 [k] bpf_prog_57cd311f2e27366b_cpumap_drop

With this series applied:
| eth1->? 10,329,340 rx/s 0 err,drop/s
| receive total 10,329,340 pkt/s 0 drop/s 0 error/s
| cpu:6 10,329,340 pkt/s 0 drop/s 0 error/s
| enqueue to cpu 3 10,329,338 pkt/s 0 drop/s 8.00 bulk-avg
| cpu:6->3 10,329,338 pkt/s 0 drop/s 8.00 bulk-avg
| kthread total 10,329,321 pkt/s 0 drop/s 96,297 sched
| cpu:3 10,329,321 pkt/s 0 drop/s 96,297 sched
| xdp_stats 0 pass/s 10,329,321 drop/s 0 redir/s
| cpu:3 0 pass/s 10,329,321 drop/s 0 redir/s
| redirect_err 0 error/s
| xdp_exception 0 hit/s

perf top --sort cpu,symbol --no-children:
| 20.90% 006 [k] bpf_prog_4f0ffbb35139c187_cpumap_l4_hash
| 12.62% 006 [k] ixgbe_poll
| 9.82% 003 [k] page_frag_free
| 8.73% 003 [k] cpu_map_bpf_prog_run_xdp
| 6.63% 006 [k] xdp_do_redirect
| 4.94% 003 [k] cpu_map_kthread_run
| 4.28% 006 [k] cpu_map_redirect
| 4.03% 006 [k] bq_flush_to_queue
| 3.01% 003 [k] xdp_return_frame
| 1.95% 006 [k] _raw_spin_lock
| 1.94% 003 [k] bpf_prog_57cd311f2e27366b_cpumap_drop

This diff appears to be noise.

v4…v5 https://lore.kernel.org/all/[email protected]/:
- Remove the guard() notation as well as __free() within the patches.
Patch #1 and #2 add the guard definition for local_lock_nested_bh()
but it remains unused with the series.
The __free() notation for bpf_net_ctx_clear has been removed entirely.

- Collect Toke's Reviewed-by.

v3…v4 https://lore.kernel.org/all/[email protected]/:
- Removed bpf_clear_redirect_map(), moved the comment to the caller.
Suggested by Toke.

- The bpf_redirect_info structure is memset() each time it is assigned.
Suggested by Toke.

- The bpf_net_ctx_set() in __napi_busy_loop() has been moved from the
top of the function to begin/ end of the BH-disabled section. This has
been done to remain in sync with other call sites.
After adding the memset() I've been looking at the perf-numbers in my
test-case and I haven't noticed an impact, the numbers are in the same
range with and without the change. Therefore I kept the numbers from
previous posting.

- Collected Alexei's Acked-by.

v2…v3 https://lore.kernel.org/all/[email protected]/:
- WARN checks checks for bpf_net_ctx_get() have been dropped and all
NULL checks around it. This means bpf_net_ctx_get_ri() assumes the
context has been set and will segfault if it is not the case.
Suggested by Alexei and Jesper. This should always work or always
segfault.

- It has been suggested by Toke to embed struct bpf_net_context into
task_struct instead just a pointer to it. This would increase the size
of task_struct by 112 bytes instead just eight and Alexei didn't like
it due to the size impact with 1m threads. It is a pointer again.

v1…v2 https://lore.kernel.org/all/[email protected]/:
- Jakub complained about touching networking drivers to make the
additional locking work. Alexei complained about the additional
locking within the XDP/eBFP case.
This led to a change in how the per-CPU variables are accessed for the
XDP/eBPF case. On PREEMPT_RT the variables are now stored on stack and
the task pointer to the structure is saved in the task_struct while
keeping every for !RT unchanged. This was proposed as a RFC in
v1: https://lore.kernel.org/all/[email protected]/

and then updated

v2: https://lore.kernel.org/all/[email protected]/
- Renamed the container struct from xdp_storage to bpf_net_context.
Suggested by Toke Høiland-Jørgensen.
- Use the container struct also on !PREEMPT_RT builds. Store the
pointer to the on-stack struct in a per-CPU variable. Suggested by
Toke Høiland-Jørgensen.

This reduces the initial queue from 24 to 15 patches.

- There were complains about the scoped_guard() which shifts the whole
block and makes it harder to review because the whole gets removed and
added again. The usage has been replaced with local_lock_nested_bh()+
its unlock counterpart.

Sebastian



Subject: [PATCH v5 net-next 01/15] locking/local_lock: Introduce guard definition for local_lock.

Introduce lock guard definition for local_lock_t. There are no users
yet.

Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
include/linux/local_lock.h | 11 +++++++++++
1 file changed, 11 insertions(+)

diff --git a/include/linux/local_lock.h b/include/linux/local_lock.h
index e55010fa73296..82366a37f4474 100644
--- a/include/linux/local_lock.h
+++ b/include/linux/local_lock.h
@@ -51,4 +51,15 @@
#define local_unlock_irqrestore(lock, flags) \
__local_unlock_irqrestore(lock, flags)

+DEFINE_GUARD(local_lock, local_lock_t __percpu*,
+ local_lock(_T),
+ local_unlock(_T))
+DEFINE_GUARD(local_lock_irq, local_lock_t __percpu*,
+ local_lock_irq(_T),
+ local_unlock_irq(_T))
+DEFINE_LOCK_GUARD_1(local_lock_irqsave, local_lock_t __percpu,
+ local_lock_irqsave(_T->lock, _T->flags),
+ local_unlock_irqrestore(_T->lock, _T->flags),
+ unsigned long flags)
+
#endif
--
2.45.1


Subject: [PATCH v5 net-next 03/15] net: Use __napi_alloc_frag_align() instead of open coding it.

The else condition within __netdev_alloc_frag_align() is an open coded
__napi_alloc_frag_align().

Use __napi_alloc_frag_align() instead of open coding it.
Move fragsz assignment before page_frag_alloc_align() invocation because
__napi_alloc_frag_align() also contains this statement.

Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
net/core/skbuff.c | 8 ++------
1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index c8ac79851cd67..656b298255c5f 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -318,19 +318,15 @@ void *__netdev_alloc_frag_align(unsigned int fragsz, unsigned int align_mask)
{
void *data;

- fragsz = SKB_DATA_ALIGN(fragsz);
if (in_hardirq() || irqs_disabled()) {
struct page_frag_cache *nc = this_cpu_ptr(&netdev_alloc_cache);

+ fragsz = SKB_DATA_ALIGN(fragsz);
data = __page_frag_alloc_align(nc, fragsz, GFP_ATOMIC,
align_mask);
} else {
- struct napi_alloc_cache *nc;
-
local_bh_disable();
- nc = this_cpu_ptr(&napi_alloc_cache);
- data = __page_frag_alloc_align(&nc->page, fragsz, GFP_ATOMIC,
- align_mask);
+ data = __napi_alloc_frag_align(fragsz, align_mask);
local_bh_enable();
}
return data;
--
2.45.1


Subject: [PATCH v5 net-next 11/15] lwt: Don't disable migration prio invoking BPF.

There is no need to explicitly disable migration if bottom halves are
also disabled. Disabling BH implies disabling migration.

Remove migrate_disable() and rely solely on disabling BH to remain on
the same CPU.

Cc: [email protected]
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
net/core/lwt_bpf.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c
index 4a0797f0a154b..a94943681e5aa 100644
--- a/net/core/lwt_bpf.c
+++ b/net/core/lwt_bpf.c
@@ -40,10 +40,9 @@ static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
{
int ret;

- /* Migration disable and BH disable are needed to protect per-cpu
- * redirect_info between BPF prog and skb_do_redirect().
+ /* Disabling BH is needed to protect per-CPU bpf_redirect_info between
+ * BPF prog and skb_do_redirect().
*/
- migrate_disable();
local_bh_disable();
bpf_compute_data_pointers(skb);
ret = bpf_prog_run_save_cb(lwt->prog, skb);
@@ -78,7 +77,6 @@ static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
}

local_bh_enable();
- migrate_enable();

return ret;
}
--
2.45.1


Subject: [PATCH v5 net-next 12/15] seg6: Use nested-BH locking for seg6_bpf_srh_states.

The access to seg6_bpf_srh_states is protected by disabling preemption.
Based on the code, the entry point is input_action_end_bpf() and
every other function (the bpf helper functions bpf_lwt_seg6_*()), that
is accessing seg6_bpf_srh_states, should be called from within
input_action_end_bpf().

input_action_end_bpf() accesses seg6_bpf_srh_states first at the top of
the function and then disables preemption. This looks wrong because if
preemption needs to be disabled as part of the locking mechanism then
the variable shouldn't be accessed beforehand.

Looking at how it is used via test_lwt_seg6local.sh then
input_action_end_bpf() is always invoked from softirq context. If this
is always the case then the preempt_disable() statement is superfluous.
If this is not always invoked from softirq then disabling only
preemption is not sufficient.

Replace the preempt_disable() statement with nested-BH locking. This is
not an equivalent replacement as it assumes that the invocation of
input_action_end_bpf() always occurs in softirq context and thus the
preempt_disable() is superfluous.
Add a local_lock_t the data structure and use local_lock_nested_bh() for
locking. Add lockdep_assert_held() to ensure the lock is held while the
per-CPU variable is referenced in the helper functions.

Cc: Alexei Starovoitov <[email protected]>
Cc: Andrii Nakryiko <[email protected]>
Cc: David Ahern <[email protected]>
Cc: Hao Luo <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: John Fastabend <[email protected]>
Cc: KP Singh <[email protected]>
Cc: Martin KaFai Lau <[email protected]>
Cc: Song Liu <[email protected]>
Cc: Stanislav Fomichev <[email protected]>
Cc: Yonghong Song <[email protected]>
Cc: [email protected]
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
include/net/seg6_local.h | 1 +
net/core/filter.c | 3 +++
net/ipv6/seg6_local.c | 22 ++++++++++++++--------
3 files changed, 18 insertions(+), 8 deletions(-)

diff --git a/include/net/seg6_local.h b/include/net/seg6_local.h
index 3fab9dec2ec45..888c1ce6f5272 100644
--- a/include/net/seg6_local.h
+++ b/include/net/seg6_local.h
@@ -19,6 +19,7 @@ extern int seg6_lookup_nexthop(struct sk_buff *skb, struct in6_addr *nhaddr,
extern bool seg6_bpf_has_valid_srh(struct sk_buff *skb);

struct seg6_bpf_srh_state {
+ local_lock_t bh_lock;
struct ipv6_sr_hdr *srh;
u16 hdrlen;
bool valid;
diff --git a/net/core/filter.c b/net/core/filter.c
index 7c46ecba3b01b..ba1a739a9bedc 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -6450,6 +6450,7 @@ BPF_CALL_4(bpf_lwt_seg6_store_bytes, struct sk_buff *, skb, u32, offset,
void *srh_tlvs, *srh_end, *ptr;
int srhoff = 0;

+ lockdep_assert_held(&srh_state->bh_lock);
if (srh == NULL)
return -EINVAL;

@@ -6506,6 +6507,7 @@ BPF_CALL_4(bpf_lwt_seg6_action, struct sk_buff *, skb,
int hdroff = 0;
int err;

+ lockdep_assert_held(&srh_state->bh_lock);
switch (action) {
case SEG6_LOCAL_ACTION_END_X:
if (!seg6_bpf_has_valid_srh(skb))
@@ -6582,6 +6584,7 @@ BPF_CALL_3(bpf_lwt_seg6_adjust_srh, struct sk_buff *, skb, u32, offset,
int srhoff = 0;
int ret;

+ lockdep_assert_held(&srh_state->bh_lock);
if (unlikely(srh == NULL))
return -EINVAL;

diff --git a/net/ipv6/seg6_local.c b/net/ipv6/seg6_local.c
index 24e2b4b494cb0..c4828c6620f07 100644
--- a/net/ipv6/seg6_local.c
+++ b/net/ipv6/seg6_local.c
@@ -1380,7 +1380,9 @@ static int input_action_end_b6_encap(struct sk_buff *skb,
return err;
}

-DEFINE_PER_CPU(struct seg6_bpf_srh_state, seg6_bpf_srh_states);
+DEFINE_PER_CPU(struct seg6_bpf_srh_state, seg6_bpf_srh_states) = {
+ .bh_lock = INIT_LOCAL_LOCK(bh_lock),
+};

bool seg6_bpf_has_valid_srh(struct sk_buff *skb)
{
@@ -1388,6 +1390,7 @@ bool seg6_bpf_has_valid_srh(struct sk_buff *skb)
this_cpu_ptr(&seg6_bpf_srh_states);
struct ipv6_sr_hdr *srh = srh_state->srh;

+ lockdep_assert_held(&srh_state->bh_lock);
if (unlikely(srh == NULL))
return false;

@@ -1408,8 +1411,7 @@ bool seg6_bpf_has_valid_srh(struct sk_buff *skb)
static int input_action_end_bpf(struct sk_buff *skb,
struct seg6_local_lwt *slwt)
{
- struct seg6_bpf_srh_state *srh_state =
- this_cpu_ptr(&seg6_bpf_srh_states);
+ struct seg6_bpf_srh_state *srh_state;
struct ipv6_sr_hdr *srh;
int ret;

@@ -1420,10 +1422,14 @@ static int input_action_end_bpf(struct sk_buff *skb,
}
advance_nextseg(srh, &ipv6_hdr(skb)->daddr);

- /* preempt_disable is needed to protect the per-CPU buffer srh_state,
- * which is also accessed by the bpf_lwt_seg6_* helpers
+ /* The access to the per-CPU buffer srh_state is protected by running
+ * always in softirq context (with disabled BH). On PREEMPT_RT the
+ * required locking is provided by the following local_lock_nested_bh()
+ * statement. It is also accessed by the bpf_lwt_seg6_* helpers via
+ * bpf_prog_run_save_cb().
*/
- preempt_disable();
+ local_lock_nested_bh(&seg6_bpf_srh_states.bh_lock);
+ srh_state = this_cpu_ptr(&seg6_bpf_srh_states);
srh_state->srh = srh;
srh_state->hdrlen = srh->hdrlen << 3;
srh_state->valid = true;
@@ -1446,15 +1452,15 @@ static int input_action_end_bpf(struct sk_buff *skb,

if (srh_state->srh && !seg6_bpf_has_valid_srh(skb))
goto drop;
+ local_unlock_nested_bh(&seg6_bpf_srh_states.bh_lock);

- preempt_enable();
if (ret != BPF_REDIRECT)
seg6_lookup_nexthop(skb, NULL, 0);

return dst_input(skb);

drop:
- preempt_enable();
+ local_unlock_nested_bh(&seg6_bpf_srh_states.bh_lock);
kfree_skb(skb);
return -EINVAL;
}
--
2.45.1


Subject: [PATCH v5 net-next 09/15] dev: Remove PREEMPT_RT ifdefs from backlog_lock.*().

The backlog_napi locking (previously RPS) relies on explicit locking if
either RPS or backlog NAPI is enabled. If both are disabled then locking
was achieved by disabling interrupts except on PREEMPT_RT. PREEMPT_RT
was excluded because the needed synchronisation was already provided
local_bh_disable().

Since the introduction of backlog NAPI and making it mandatory for
PREEMPT_RT the ifdef within backlog_lock.*() is obsolete and can be
removed.

Remove the ifdefs in backlog_lock.*().

Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
net/core/dev.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 85fe8138f3e4e..a66e4e744bbb4 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -229,7 +229,7 @@ static inline void backlog_lock_irq_save(struct softnet_data *sd,
{
if (IS_ENABLED(CONFIG_RPS) || use_backlog_threads())
spin_lock_irqsave(&sd->input_pkt_queue.lock, *flags);
- else if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+ else
local_irq_save(*flags);
}

@@ -237,7 +237,7 @@ static inline void backlog_lock_irq_disable(struct softnet_data *sd)
{
if (IS_ENABLED(CONFIG_RPS) || use_backlog_threads())
spin_lock_irq(&sd->input_pkt_queue.lock);
- else if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+ else
local_irq_disable();
}

@@ -246,7 +246,7 @@ static inline void backlog_unlock_irq_restore(struct softnet_data *sd,
{
if (IS_ENABLED(CONFIG_RPS) || use_backlog_threads())
spin_unlock_irqrestore(&sd->input_pkt_queue.lock, *flags);
- else if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+ else
local_irq_restore(*flags);
}

@@ -254,7 +254,7 @@ static inline void backlog_unlock_irq_enable(struct softnet_data *sd)
{
if (IS_ENABLED(CONFIG_RPS) || use_backlog_threads())
spin_unlock_irq(&sd->input_pkt_queue.lock);
- else if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+ else
local_irq_enable();
}

--
2.45.1


Subject: [PATCH v5 net-next 08/15] net: softnet_data: Make xmit.recursion per task.

Softirq is preemptible on PREEMPT_RT. Without a per-CPU lock in
local_bh_disable() there is no guarantee that only one device is
transmitting at a time.
With preemption and multiple senders it is possible that the per-CPU
recursion counter gets incremented by different threads and exceeds
XMIT_RECURSION_LIMIT leading to a false positive recursion alert.

Instead of adding a lock to protect the per-CPU variable it is simpler
to make the counter per-task. Sending and receiving skbs happens always
in thread context anyway.

Having a lock to protected the per-CPU counter would block/ serialize two
sending threads needlessly. It would also require a recursive lock to
ensure that the owner can increment the counter further.

Make the recursion counter a task_struct member on PREEMPT_RT.

Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Vincent Guittot <[email protected]>
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
include/linux/netdevice.h | 11 +++++++++++
include/linux/sched.h | 4 +++-
net/core/dev.h | 20 ++++++++++++++++++++
3 files changed, 34 insertions(+), 1 deletion(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index d20c6c99eb887..b5ec072ec2430 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3223,7 +3223,9 @@ struct softnet_data {
#endif
/* written and read only by owning cpu: */
struct {
+#ifndef CONFIG_PREEMPT_RT
u16 recursion;
+#endif
u8 more;
#ifdef CONFIG_NET_EGRESS
u8 skip_txqueue;
@@ -3256,10 +3258,19 @@ struct softnet_data {

DECLARE_PER_CPU_ALIGNED(struct softnet_data, softnet_data);

+#ifdef CONFIG_PREEMPT_RT
+static inline int dev_recursion_level(void)
+{
+ return current->net_xmit_recursion;
+}
+
+#else
+
static inline int dev_recursion_level(void)
{
return this_cpu_read(softnet_data.xmit.recursion);
}
+#endif

void __netif_schedule(struct Qdisc *q);
void netif_schedule_queue(struct netdev_queue *txq);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 61591ac6eab6d..a9b0ca72db55f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -975,7 +975,9 @@ struct task_struct {
/* delay due to memory thrashing */
unsigned in_thrashing:1;
#endif
-
+#ifdef CONFIG_PREEMPT_RT
+ u8 net_xmit_recursion;
+#endif
unsigned long atomic_flags; /* Flags requiring atomic access. */

struct restart_block restart_block;
diff --git a/net/core/dev.h b/net/core/dev.h
index b7b518bc2be55..2f96d63053ad0 100644
--- a/net/core/dev.h
+++ b/net/core/dev.h
@@ -150,6 +150,25 @@ struct napi_struct *napi_by_id(unsigned int napi_id);
void kick_defer_list_purge(struct softnet_data *sd, unsigned int cpu);

#define XMIT_RECURSION_LIMIT 8
+
+#ifdef CONFIG_PREEMPT_RT
+static inline bool dev_xmit_recursion(void)
+{
+ return unlikely(current->net_xmit_recursion > XMIT_RECURSION_LIMIT);
+}
+
+static inline void dev_xmit_recursion_inc(void)
+{
+ current->net_xmit_recursion++;
+}
+
+static inline void dev_xmit_recursion_dec(void)
+{
+ current->net_xmit_recursion--;
+}
+
+#else
+
static inline bool dev_xmit_recursion(void)
{
return unlikely(__this_cpu_read(softnet_data.xmit.recursion) >
@@ -165,5 +184,6 @@ static inline void dev_xmit_recursion_dec(void)
{
__this_cpu_dec(softnet_data.xmit.recursion);
}
+#endif

#endif
--
2.45.1


Subject: [PATCH v5 net-next 07/15] netfilter: br_netfilter: Use nested-BH locking for brnf_frag_data_storage.

brnf_frag_data_storage is a per-CPU variable and relies on disabled BH
for its locking. Without per-CPU locking in local_bh_disable() on
PREEMPT_RT this data structure requires explicit locking.

Add a local_lock_t to the data structure and use local_lock_nested_bh()
for locking. This change adds only lockdep coverage and does not alter
the functional behaviour for !PREEMPT_RT.

Cc: Florian Westphal <[email protected]>
Cc: Jozsef Kadlecsik <[email protected]>
Cc: Nikolay Aleksandrov <[email protected]>
Cc: Pablo Neira Ayuso <[email protected]>
Cc: Roopa Prabhu <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
net/bridge/br_netfilter_hooks.c | 20 ++++++++++++++++----
1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/net/bridge/br_netfilter_hooks.c b/net/bridge/br_netfilter_hooks.c
index bf30c50b56895..3c9f6538990ea 100644
--- a/net/bridge/br_netfilter_hooks.c
+++ b/net/bridge/br_netfilter_hooks.c
@@ -137,6 +137,7 @@ static inline bool is_pppoe_ipv6(const struct sk_buff *skb,
#define NF_BRIDGE_MAX_MAC_HEADER_LENGTH (PPPOE_SES_HLEN + ETH_HLEN)

struct brnf_frag_data {
+ local_lock_t bh_lock;
char mac[NF_BRIDGE_MAX_MAC_HEADER_LENGTH];
u8 encap_size;
u8 size;
@@ -144,7 +145,9 @@ struct brnf_frag_data {
__be16 vlan_proto;
};

-static DEFINE_PER_CPU(struct brnf_frag_data, brnf_frag_data_storage);
+static DEFINE_PER_CPU(struct brnf_frag_data, brnf_frag_data_storage) = {
+ .bh_lock = INIT_LOCAL_LOCK(bh_lock),
+};

static void nf_bridge_info_free(struct sk_buff *skb)
{
@@ -850,6 +853,7 @@ static int br_nf_dev_queue_xmit(struct net *net, struct sock *sk, struct sk_buff
{
struct nf_bridge_info *nf_bridge = nf_bridge_info_get(skb);
unsigned int mtu, mtu_reserved;
+ int ret;

mtu_reserved = nf_bridge_mtu_reduction(skb);
mtu = skb->dev->mtu;
@@ -882,6 +886,7 @@ static int br_nf_dev_queue_xmit(struct net *net, struct sock *sk, struct sk_buff

IPCB(skb)->frag_max_size = nf_bridge->frag_max_size;

+ local_lock_nested_bh(&brnf_frag_data_storage.bh_lock);
data = this_cpu_ptr(&brnf_frag_data_storage);

if (skb_vlan_tag_present(skb)) {
@@ -897,7 +902,9 @@ static int br_nf_dev_queue_xmit(struct net *net, struct sock *sk, struct sk_buff
skb_copy_from_linear_data_offset(skb, -data->size, data->mac,
data->size);

- return br_nf_ip_fragment(net, sk, skb, br_nf_push_frag_xmit);
+ ret = br_nf_ip_fragment(net, sk, skb, br_nf_push_frag_xmit);
+ local_unlock_nested_bh(&brnf_frag_data_storage.bh_lock);
+ return ret;
}
if (IS_ENABLED(CONFIG_NF_DEFRAG_IPV6) &&
skb->protocol == htons(ETH_P_IPV6)) {
@@ -909,6 +916,7 @@ static int br_nf_dev_queue_xmit(struct net *net, struct sock *sk, struct sk_buff

IP6CB(skb)->frag_max_size = nf_bridge->frag_max_size;

+ local_lock_nested_bh(&brnf_frag_data_storage.bh_lock);
data = this_cpu_ptr(&brnf_frag_data_storage);
data->encap_size = nf_bridge_encap_header_len(skb);
data->size = ETH_HLEN + data->encap_size;
@@ -916,8 +924,12 @@ static int br_nf_dev_queue_xmit(struct net *net, struct sock *sk, struct sk_buff
skb_copy_from_linear_data_offset(skb, -data->size, data->mac,
data->size);

- if (v6ops)
- return v6ops->fragment(net, sk, skb, br_nf_push_frag_xmit);
+ if (v6ops) {
+ ret = v6ops->fragment(net, sk, skb, br_nf_push_frag_xmit);
+ local_unlock_nested_bh(&brnf_frag_data_storage.bh_lock);
+ return ret;
+ }
+ local_unlock_nested_bh(&brnf_frag_data_storage.bh_lock);

kfree_skb(skb);
return -EMSGSIZE;
--
2.45.1


Subject: [PATCH v5 net-next 13/15] net: Use nested-BH locking for bpf_scratchpad.

bpf_scratchpad is a per-CPU variable and relies on disabled BH for its
locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT
this data structure requires explicit locking.

Add a local_lock_t to the data structure and use local_lock_nested_bh()
for locking. This change adds only lockdep coverage and does not alter
the functional behaviour for !PREEMPT_RT.

Cc: Alexei Starovoitov <[email protected]>
Cc: Andrii Nakryiko <[email protected]>
Cc: Hao Luo <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: John Fastabend <[email protected]>
Cc: KP Singh <[email protected]>
Cc: Martin KaFai Lau <[email protected]>
Cc: Song Liu <[email protected]>
Cc: Stanislav Fomichev <[email protected]>
Cc: Yonghong Song <[email protected]>
Cc: [email protected]
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
net/core/filter.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index ba1a739a9bedc..fbcfd563dccfd 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1658,9 +1658,12 @@ struct bpf_scratchpad {
__be32 diff[MAX_BPF_STACK / sizeof(__be32)];
u8 buff[MAX_BPF_STACK];
};
+ local_lock_t bh_lock;
};

-static DEFINE_PER_CPU(struct bpf_scratchpad, bpf_sp);
+static DEFINE_PER_CPU(struct bpf_scratchpad, bpf_sp) = {
+ .bh_lock = INIT_LOCAL_LOCK(bh_lock),
+};

static inline int __bpf_try_make_writable(struct sk_buff *skb,
unsigned int write_len)
@@ -2016,6 +2019,7 @@ BPF_CALL_5(bpf_csum_diff, __be32 *, from, u32, from_size,
struct bpf_scratchpad *sp = this_cpu_ptr(&bpf_sp);
u32 diff_size = from_size + to_size;
int i, j = 0;
+ __wsum ret;

/* This is quite flexible, some examples:
*
@@ -2029,12 +2033,15 @@ BPF_CALL_5(bpf_csum_diff, __be32 *, from, u32, from_size,
diff_size > sizeof(sp->diff)))
return -EINVAL;

+ local_lock_nested_bh(&bpf_sp.bh_lock);
for (i = 0; i < from_size / sizeof(__be32); i++, j++)
sp->diff[j] = ~from[i];
for (i = 0; i < to_size / sizeof(__be32); i++, j++)
sp->diff[j] = to[i];

- return csum_partial(sp->diff, diff_size, seed);
+ ret = csum_partial(sp->diff, diff_size, seed);
+ local_unlock_nested_bh(&bpf_sp.bh_lock);
+ return ret;
}

static const struct bpf_func_proto bpf_csum_diff_proto = {
--
2.45.1


Subject: [PATCH v5 net-next 14/15] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.

The XDP redirect process is two staged:
- bpf_prog_run_xdp() is invoked to run a eBPF program which inspects the
packet and makes decisions. While doing that, the per-CPU variable
bpf_redirect_info is used.

- Afterwards xdp_do_redirect() is invoked and accesses bpf_redirect_info
and it may also access other per-CPU variables like xskmap_flush_list.

At the very end of the NAPI callback, xdp_do_flush() is invoked which
does not access bpf_redirect_info but will touch the individual per-CPU
lists.

The per-CPU variables are only used in the NAPI callback hence disabling
bottom halves is the only protection mechanism. Users from preemptible
context (like cpu_map_kthread_run()) explicitly disable bottom halves
for protections reasons.
Without locking in local_bh_disable() on PREEMPT_RT this data structure
requires explicit locking.

PREEMPT_RT has forced-threaded interrupts enabled and every
NAPI-callback runs in a thread. If each thread has its own data
structure then locking can be avoided.

Create a struct bpf_net_context which contains struct bpf_redirect_info.
Define the variable on stack, use bpf_net_ctx_set() to save a pointer to
it, bpf_net_ctx_clear() removes it again.
The bpf_net_ctx_set() may nest. For instance a function can be used from
within NET_RX_SOFTIRQ/ net_rx_action which uses bpf_net_ctx_set() and
NET_TX_SOFTIRQ which does not. Therefore only the first invocations
updates the pointer.
Use bpf_net_ctx_get_ri() as a wrapper to retrieve the current struct
bpf_redirect_info.

The pointer to bpf_net_context is saved task's task_struct. Using
always the bpf_net_context approach has the advantage that there is
almost zero differences between PREEMPT_RT and non-PREEMPT_RT builds.

Cc: Alexei Starovoitov <[email protected]>
Cc: Andrii Nakryiko <[email protected]>
Cc: Eduard Zingerman <[email protected]>
Cc: Hao Luo <[email protected]>
Cc: Jesper Dangaard Brouer <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: John Fastabend <[email protected]>
Cc: KP Singh <[email protected]>
Cc: Martin KaFai Lau <[email protected]>
Cc: Song Liu <[email protected]>
Cc: Stanislav Fomichev <[email protected]>
Cc: Toke Høiland-Jørgensen <[email protected]>
Cc: Yonghong Song <[email protected]>
Cc: [email protected]
Acked-by: Alexei Starovoitov <[email protected]>
Reviewed-by: Toke Høiland-Jørgensen <[email protected]>
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
include/linux/filter.h | 43 ++++++++++++++++++++++++++++++++++-------
include/linux/sched.h | 3 +++
kernel/bpf/cpumap.c | 3 +++
kernel/bpf/devmap.c | 9 ++++++++-
kernel/fork.c | 1 +
net/bpf/test_run.c | 11 ++++++++++-
net/core/dev.c | 26 ++++++++++++++++++++++++-
net/core/filter.c | 44 ++++++++++++------------------------------
net/core/lwt_bpf.c | 3 +++
9 files changed, 101 insertions(+), 42 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index b02aea291b7e8..2ff1c394dcf0c 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -744,7 +744,38 @@ struct bpf_redirect_info {
struct bpf_nh_params nh;
};

-DECLARE_PER_CPU(struct bpf_redirect_info, bpf_redirect_info);
+struct bpf_net_context {
+ struct bpf_redirect_info ri;
+};
+
+static inline struct bpf_net_context *bpf_net_ctx_set(struct bpf_net_context *bpf_net_ctx)
+{
+ struct task_struct *tsk = current;
+
+ if (tsk->bpf_net_context != NULL)
+ return NULL;
+ memset(&bpf_net_ctx->ri, 0, sizeof(bpf_net_ctx->ri));
+ tsk->bpf_net_context = bpf_net_ctx;
+ return bpf_net_ctx;
+}
+
+static inline void bpf_net_ctx_clear(struct bpf_net_context *bpf_net_ctx)
+{
+ if (bpf_net_ctx)
+ current->bpf_net_context = NULL;
+}
+
+static inline struct bpf_net_context *bpf_net_ctx_get(void)
+{
+ return current->bpf_net_context;
+}
+
+static inline struct bpf_redirect_info *bpf_net_ctx_get_ri(void)
+{
+ struct bpf_net_context *bpf_net_ctx = bpf_net_ctx_get();
+
+ return &bpf_net_ctx->ri;
+}

/* flags for bpf_redirect_info kern_flags */
#define BPF_RI_F_RF_NO_DIRECT BIT(0) /* no napi_direct on return_frame */
@@ -1018,25 +1049,23 @@ struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off,
const struct bpf_insn *patch, u32 len);
int bpf_remove_insns(struct bpf_prog *prog, u32 off, u32 cnt);

-void bpf_clear_redirect_map(struct bpf_map *map);
-
static inline bool xdp_return_frame_no_direct(void)
{
- struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+ struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();

return ri->kern_flags & BPF_RI_F_RF_NO_DIRECT;
}

static inline void xdp_set_return_frame_no_direct(void)
{
- struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+ struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();

ri->kern_flags |= BPF_RI_F_RF_NO_DIRECT;
}

static inline void xdp_clear_return_frame_no_direct(void)
{
- struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+ struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();

ri->kern_flags &= ~BPF_RI_F_RF_NO_DIRECT;
}
@@ -1592,7 +1621,7 @@ static __always_inline long __bpf_xdp_redirect_map(struct bpf_map *map, u64 inde
u64 flags, const u64 flag_mask,
void *lookup_elem(struct bpf_map *map, u32 key))
{
- struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+ struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
const u64 action_mask = XDP_ABORTED | XDP_DROP | XDP_PASS | XDP_TX;

/* Lower bits of the flags are used as return code on lookup failure */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a9b0ca72db55f..dfa1843ab2916 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -53,6 +53,7 @@ struct bio_list;
struct blk_plug;
struct bpf_local_storage;
struct bpf_run_ctx;
+struct bpf_net_context;
struct capture_control;
struct cfs_rq;
struct fs_struct;
@@ -1508,6 +1509,8 @@ struct task_struct {
/* Used for BPF run context */
struct bpf_run_ctx *bpf_ctx;
#endif
+ /* Used by BPF for per-TASK xdp storage */
+ struct bpf_net_context *bpf_net_context;

#ifdef CONFIG_GCC_PLUGIN_STACKLEAK
unsigned long lowest_stack;
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index a8e34416e960f..66974bd027109 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -240,12 +240,14 @@ static int cpu_map_bpf_prog_run(struct bpf_cpu_map_entry *rcpu, void **frames,
int xdp_n, struct xdp_cpumap_stats *stats,
struct list_head *list)
{
+ struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
int nframes;

if (!rcpu->prog)
return xdp_n;

rcu_read_lock_bh();
+ bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);

nframes = cpu_map_bpf_prog_run_xdp(rcpu, frames, xdp_n, stats);

@@ -255,6 +257,7 @@ static int cpu_map_bpf_prog_run(struct bpf_cpu_map_entry *rcpu, void **frames,
if (unlikely(!list_empty(list)))
cpu_map_bpf_prog_run_skb(rcpu, list, stats);

+ bpf_net_ctx_clear(bpf_net_ctx);
rcu_read_unlock_bh(); /* resched point, may call do_softirq() */

return nframes;
diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index 4e2cdbb5629f2..3d9d62c6525d4 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -196,7 +196,14 @@ static void dev_map_free(struct bpf_map *map)
list_del_rcu(&dtab->list);
spin_unlock(&dev_map_lock);

- bpf_clear_redirect_map(map);
+ /* bpf_redirect_info->map is assigned in __bpf_xdp_redirect_map()
+ * during NAPI callback and cleared after the XDP redirect. There is no
+ * explicit RCU read section which protects bpf_redirect_info->map but
+ * local_bh_disable() also marks the beginning an RCU section. This
+ * makes the complete softirq callback RCU protected. Thus after
+ * following synchronize_rcu() there no bpf_redirect_info->map == map
+ * assignment.
+ */
synchronize_rcu();

/* Make sure prior __dev_map_entry_free() have completed. */
diff --git a/kernel/fork.c b/kernel/fork.c
index 99076dbe27d83..f314bdd7e6108 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2355,6 +2355,7 @@ __latent_entropy struct task_struct *copy_process(
RCU_INIT_POINTER(p->bpf_storage, NULL);
p->bpf_ctx = NULL;
#endif
+ p->bpf_net_context = NULL;

/* Perform scheduler related setup. Assign this task to a CPU. */
retval = sched_fork(clone_flags, p);
diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index f6aad4ed2ab2f..600cc8e428c1a 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -283,9 +283,10 @@ static int xdp_recv_frames(struct xdp_frame **frames, int nframes,
static int xdp_test_run_batch(struct xdp_test_data *xdp, struct bpf_prog *prog,
u32 repeat)
{
- struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+ struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
int err = 0, act, ret, i, nframes = 0, batch_sz;
struct xdp_frame **frames = xdp->frames;
+ struct bpf_redirect_info *ri;
struct xdp_page_head *head;
struct xdp_frame *frm;
bool redirect = false;
@@ -295,6 +296,8 @@ static int xdp_test_run_batch(struct xdp_test_data *xdp, struct bpf_prog *prog,
batch_sz = min_t(u32, repeat, xdp->batch_size);

local_bh_disable();
+ bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
+ ri = bpf_net_ctx_get_ri();
xdp_set_return_frame_no_direct();

for (i = 0; i < batch_sz; i++) {
@@ -359,6 +362,7 @@ static int xdp_test_run_batch(struct xdp_test_data *xdp, struct bpf_prog *prog,
}

xdp_clear_return_frame_no_direct();
+ bpf_net_ctx_clear(bpf_net_ctx);
local_bh_enable();
return err;
}
@@ -394,6 +398,7 @@ static int bpf_test_run_xdp_live(struct bpf_prog *prog, struct xdp_buff *ctx,
static int bpf_test_run(struct bpf_prog *prog, void *ctx, u32 repeat,
u32 *retval, u32 *time, bool xdp)
{
+ struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
struct bpf_prog_array_item item = {.prog = prog};
struct bpf_run_ctx *old_ctx;
struct bpf_cg_run_ctx run_ctx;
@@ -419,10 +424,14 @@ static int bpf_test_run(struct bpf_prog *prog, void *ctx, u32 repeat,
do {
run_ctx.prog_item = &item;
local_bh_disable();
+ bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
+
if (xdp)
*retval = bpf_prog_run_xdp(prog, ctx);
else
*retval = bpf_prog_run(prog, ctx);
+
+ bpf_net_ctx_clear(bpf_net_ctx);
local_bh_enable();
} while (bpf_test_timer_continue(&t, 1, repeat, &ret, time));
bpf_reset_run_ctx(old_ctx);
diff --git a/net/core/dev.c b/net/core/dev.c
index 2c3f86c8cd176..73965dff1b30f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4031,10 +4031,13 @@ sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
{
struct bpf_mprog_entry *entry = rcu_dereference_bh(skb->dev->tcx_ingress);
enum skb_drop_reason drop_reason = SKB_DROP_REASON_TC_INGRESS;
+ struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
int sch_ret;

if (!entry)
return skb;
+
+ bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
if (*pt_prev) {
*ret = deliver_skb(skb, *pt_prev, orig_dev);
*pt_prev = NULL;
@@ -4063,10 +4066,12 @@ sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
break;
}
*ret = NET_RX_SUCCESS;
+ bpf_net_ctx_clear(bpf_net_ctx);
return NULL;
case TC_ACT_SHOT:
kfree_skb_reason(skb, drop_reason);
*ret = NET_RX_DROP;
+ bpf_net_ctx_clear(bpf_net_ctx);
return NULL;
/* used by tc_run */
case TC_ACT_STOLEN:
@@ -4076,8 +4081,10 @@ sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
fallthrough;
case TC_ACT_CONSUMED:
*ret = NET_RX_SUCCESS;
+ bpf_net_ctx_clear(bpf_net_ctx);
return NULL;
}
+ bpf_net_ctx_clear(bpf_net_ctx);

return skb;
}
@@ -4087,11 +4094,14 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
{
struct bpf_mprog_entry *entry = rcu_dereference_bh(dev->tcx_egress);
enum skb_drop_reason drop_reason = SKB_DROP_REASON_TC_EGRESS;
+ struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
int sch_ret;

if (!entry)
return skb;

+ bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
+
/* qdisc_skb_cb(skb)->pkt_len & tcx_set_ingress() was
* already set by the caller.
*/
@@ -4107,10 +4117,12 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
/* No need to push/pop skb's mac_header here on egress! */
skb_do_redirect(skb);
*ret = NET_XMIT_SUCCESS;
+ bpf_net_ctx_clear(bpf_net_ctx);
return NULL;
case TC_ACT_SHOT:
kfree_skb_reason(skb, drop_reason);
*ret = NET_XMIT_DROP;
+ bpf_net_ctx_clear(bpf_net_ctx);
return NULL;
/* used by tc_run */
case TC_ACT_STOLEN:
@@ -4120,8 +4132,10 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
fallthrough;
case TC_ACT_CONSUMED:
*ret = NET_XMIT_SUCCESS;
+ bpf_net_ctx_clear(bpf_net_ctx);
return NULL;
}
+ bpf_net_ctx_clear(bpf_net_ctx);

return skb;
}
@@ -6358,6 +6372,7 @@ static void __napi_busy_loop(unsigned int napi_id,
{
unsigned long start_time = loop_end ? busy_loop_current_time() : 0;
int (*napi_poll)(struct napi_struct *napi, int budget);
+ struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
void *have_poll_lock = NULL;
struct napi_struct *napi;

@@ -6376,6 +6391,7 @@ static void __napi_busy_loop(unsigned int napi_id,
int work = 0;

local_bh_disable();
+ bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
if (!napi_poll) {
unsigned long val = READ_ONCE(napi->state);

@@ -6406,6 +6422,7 @@ static void __napi_busy_loop(unsigned int napi_id,
__NET_ADD_STATS(dev_net(napi->dev),
LINUX_MIB_BUSYPOLLRXPACKETS, work);
skb_defer_free_flush(this_cpu_ptr(&softnet_data));
+ bpf_net_ctx_clear(bpf_net_ctx);
local_bh_enable();

if (!loop_end || loop_end(loop_end_arg, start_time))
@@ -6833,6 +6850,7 @@ static int napi_thread_wait(struct napi_struct *napi)

static void napi_threaded_poll_loop(struct napi_struct *napi)
{
+ struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
struct softnet_data *sd;
unsigned long last_qs = jiffies;

@@ -6841,6 +6859,8 @@ static void napi_threaded_poll_loop(struct napi_struct *napi)
void *have;

local_bh_disable();
+ bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
+
sd = this_cpu_ptr(&softnet_data);
sd->in_napi_threaded_poll = true;

@@ -6856,6 +6876,7 @@ static void napi_threaded_poll_loop(struct napi_struct *napi)
net_rps_action_and_irq_enable(sd);
}
skb_defer_free_flush(sd);
+ bpf_net_ctx_clear(bpf_net_ctx);
local_bh_enable();

if (!repoll)
@@ -6881,10 +6902,12 @@ static __latent_entropy void net_rx_action(struct softirq_action *h)
struct softnet_data *sd = this_cpu_ptr(&softnet_data);
unsigned long time_limit = jiffies +
usecs_to_jiffies(READ_ONCE(net_hotdata.netdev_budget_usecs));
+ struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
int budget = READ_ONCE(net_hotdata.netdev_budget);
LIST_HEAD(list);
LIST_HEAD(repoll);

+ bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
start:
sd->in_net_rx_action = true;
local_irq_disable();
@@ -6937,7 +6960,8 @@ static __latent_entropy void net_rx_action(struct softirq_action *h)
sd->in_net_rx_action = false;

net_rps_action_and_irq_enable(sd);
-end:;
+end:
+ bpf_net_ctx_clear(bpf_net_ctx);
}

struct netdev_adjacent {
diff --git a/net/core/filter.c b/net/core/filter.c
index fbcfd563dccfd..f40b8393dd58f 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2478,9 +2478,6 @@ static const struct bpf_func_proto bpf_clone_redirect_proto = {
.arg3_type = ARG_ANYTHING,
};

-DEFINE_PER_CPU(struct bpf_redirect_info, bpf_redirect_info);
-EXPORT_PER_CPU_SYMBOL_GPL(bpf_redirect_info);
-
static struct net_device *skb_get_peer_dev(struct net_device *dev)
{
const struct net_device_ops *ops = dev->netdev_ops;
@@ -2493,7 +2490,7 @@ static struct net_device *skb_get_peer_dev(struct net_device *dev)

int skb_do_redirect(struct sk_buff *skb)
{
- struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+ struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
struct net *net = dev_net(skb->dev);
struct net_device *dev;
u32 flags = ri->flags;
@@ -2526,7 +2523,7 @@ int skb_do_redirect(struct sk_buff *skb)

BPF_CALL_2(bpf_redirect, u32, ifindex, u64, flags)
{
- struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+ struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();

if (unlikely(flags & (~(BPF_F_INGRESS) | BPF_F_REDIRECT_INTERNAL)))
return TC_ACT_SHOT;
@@ -2547,7 +2544,7 @@ static const struct bpf_func_proto bpf_redirect_proto = {

BPF_CALL_2(bpf_redirect_peer, u32, ifindex, u64, flags)
{
- struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+ struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();

if (unlikely(flags))
return TC_ACT_SHOT;
@@ -2569,7 +2566,7 @@ static const struct bpf_func_proto bpf_redirect_peer_proto = {
BPF_CALL_4(bpf_redirect_neigh, u32, ifindex, struct bpf_redir_neigh *, params,
int, plen, u64, flags)
{
- struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+ struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();

if (unlikely((plen && plen < sizeof(*params)) || flags))
return TC_ACT_SHOT;
@@ -4295,30 +4292,13 @@ void xdp_do_check_flushed(struct napi_struct *napi)
}
#endif

-void bpf_clear_redirect_map(struct bpf_map *map)
-{
- struct bpf_redirect_info *ri;
- int cpu;
-
- for_each_possible_cpu(cpu) {
- ri = per_cpu_ptr(&bpf_redirect_info, cpu);
- /* Avoid polluting remote cacheline due to writes if
- * not needed. Once we pass this test, we need the
- * cmpxchg() to make sure it hasn't been changed in
- * the meantime by remote CPU.
- */
- if (unlikely(READ_ONCE(ri->map) == map))
- cmpxchg(&ri->map, map, NULL);
- }
-}
-
DEFINE_STATIC_KEY_FALSE(bpf_master_redirect_enabled_key);
EXPORT_SYMBOL_GPL(bpf_master_redirect_enabled_key);

u32 xdp_master_redirect(struct xdp_buff *xdp)
{
+ struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
struct net_device *master, *slave;
- struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);

master = netdev_master_upper_dev_get_rcu(xdp->rxq->dev);
slave = master->netdev_ops->ndo_xdp_get_xmit_slave(master, xdp);
@@ -4390,7 +4370,7 @@ static __always_inline int __xdp_do_redirect_frame(struct bpf_redirect_info *ri,
map = READ_ONCE(ri->map);

/* The map pointer is cleared when the map is being torn
- * down by bpf_clear_redirect_map()
+ * down by dev_map_free()
*/
if (unlikely(!map)) {
err = -ENOENT;
@@ -4435,7 +4415,7 @@ static __always_inline int __xdp_do_redirect_frame(struct bpf_redirect_info *ri,
int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
struct bpf_prog *xdp_prog)
{
- struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+ struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
enum bpf_map_type map_type = ri->map_type;

if (map_type == BPF_MAP_TYPE_XSKMAP)
@@ -4449,7 +4429,7 @@ EXPORT_SYMBOL_GPL(xdp_do_redirect);
int xdp_do_redirect_frame(struct net_device *dev, struct xdp_buff *xdp,
struct xdp_frame *xdpf, struct bpf_prog *xdp_prog)
{
- struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+ struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
enum bpf_map_type map_type = ri->map_type;

if (map_type == BPF_MAP_TYPE_XSKMAP)
@@ -4466,7 +4446,7 @@ static int xdp_do_generic_redirect_map(struct net_device *dev,
enum bpf_map_type map_type, u32 map_id,
u32 flags)
{
- struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+ struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
struct bpf_map *map;
int err;

@@ -4478,7 +4458,7 @@ static int xdp_do_generic_redirect_map(struct net_device *dev,
map = READ_ONCE(ri->map);

/* The map pointer is cleared when the map is being torn
- * down by bpf_clear_redirect_map()
+ * down by dev_map_free()
*/
if (unlikely(!map)) {
err = -ENOENT;
@@ -4520,7 +4500,7 @@ static int xdp_do_generic_redirect_map(struct net_device *dev,
int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
struct xdp_buff *xdp, struct bpf_prog *xdp_prog)
{
- struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+ struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
enum bpf_map_type map_type = ri->map_type;
void *fwd = ri->tgt_value;
u32 map_id = ri->map_id;
@@ -4556,7 +4536,7 @@ int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,

BPF_CALL_2(bpf_xdp_redirect, u32, ifindex, u64, flags)
{
- struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+ struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();

if (unlikely(flags))
return XDP_ABORTED;
diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c
index a94943681e5aa..afb05f58b64c5 100644
--- a/net/core/lwt_bpf.c
+++ b/net/core/lwt_bpf.c
@@ -38,12 +38,14 @@ static inline struct bpf_lwt *bpf_lwt_lwtunnel(struct lwtunnel_state *lwt)
static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
struct dst_entry *dst, bool can_redirect)
{
+ struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
int ret;

/* Disabling BH is needed to protect per-CPU bpf_redirect_info between
* BPF prog and skb_do_redirect().
*/
local_bh_disable();
+ bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
bpf_compute_data_pointers(skb);
ret = bpf_prog_run_save_cb(lwt->prog, skb);

@@ -76,6 +78,7 @@ static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
break;
}

+ bpf_net_ctx_clear(bpf_net_ctx);
local_bh_enable();

return ret;
--
2.45.1


Subject: [PATCH v5 net-next 15/15] net: Move per-CPU flush-lists to bpf_net_context on PREEMPT_RT.

The per-CPU flush lists, which are accessed from within the NAPI callback
(xdp_do_flush() for instance), are per-CPU. There are subject to the
same problem as struct bpf_redirect_info.

Add the per-CPU lists cpu_map_flush_list, dev_map_flush_list and
xskmap_map_flush_list to struct bpf_net_context. Add wrappers for the
access.

Cc: "Björn Töpel" <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: Andrii Nakryiko <[email protected]>
Cc: Eduard Zingerman <[email protected]>
Cc: Hao Luo <[email protected]>
Cc: Jesper Dangaard Brouer <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: John Fastabend <[email protected]>
Cc: Jonathan Lemon <[email protected]>
Cc: KP Singh <[email protected]>
Cc: Maciej Fijalkowski <[email protected]>
Cc: Magnus Karlsson <[email protected]>
Cc: Martin KaFai Lau <[email protected]>
Cc: Song Liu <[email protected]>
Cc: Stanislav Fomichev <[email protected]>
Cc: Toke Høiland-Jørgensen <[email protected]>
Cc: Yonghong Song <[email protected]>
Cc: [email protected]
Reviewed-by: Toke Høiland-Jørgensen <[email protected]>
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
include/linux/filter.h | 32 ++++++++++++++++++++++++++++++++
kernel/bpf/cpumap.c | 19 +++----------------
kernel/bpf/devmap.c | 11 +++--------
net/xdp/xsk.c | 12 ++++--------
4 files changed, 42 insertions(+), 32 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 2ff1c394dcf0c..d2b4260d9d0be 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -746,6 +746,9 @@ struct bpf_redirect_info {

struct bpf_net_context {
struct bpf_redirect_info ri;
+ struct list_head cpu_map_flush_list;
+ struct list_head dev_map_flush_list;
+ struct list_head xskmap_map_flush_list;
};

static inline struct bpf_net_context *bpf_net_ctx_set(struct bpf_net_context *bpf_net_ctx)
@@ -755,6 +758,14 @@ static inline struct bpf_net_context *bpf_net_ctx_set(struct bpf_net_context *bp
if (tsk->bpf_net_context != NULL)
return NULL;
memset(&bpf_net_ctx->ri, 0, sizeof(bpf_net_ctx->ri));
+
+ if (IS_ENABLED(CONFIG_BPF_SYSCALL)) {
+ INIT_LIST_HEAD(&bpf_net_ctx->cpu_map_flush_list);
+ INIT_LIST_HEAD(&bpf_net_ctx->dev_map_flush_list);
+ }
+ if (IS_ENABLED(CONFIG_XDP_SOCKETS))
+ INIT_LIST_HEAD(&bpf_net_ctx->xskmap_map_flush_list);
+
tsk->bpf_net_context = bpf_net_ctx;
return bpf_net_ctx;
}
@@ -777,6 +788,27 @@ static inline struct bpf_redirect_info *bpf_net_ctx_get_ri(void)
return &bpf_net_ctx->ri;
}

+static inline struct list_head *bpf_net_ctx_get_cpu_map_flush_list(void)
+{
+ struct bpf_net_context *bpf_net_ctx = bpf_net_ctx_get();
+
+ return &bpf_net_ctx->cpu_map_flush_list;
+}
+
+static inline struct list_head *bpf_net_ctx_get_dev_flush_list(void)
+{
+ struct bpf_net_context *bpf_net_ctx = bpf_net_ctx_get();
+
+ return &bpf_net_ctx->dev_map_flush_list;
+}
+
+static inline struct list_head *bpf_net_ctx_get_xskmap_flush_list(void)
+{
+ struct bpf_net_context *bpf_net_ctx = bpf_net_ctx_get();
+
+ return &bpf_net_ctx->xskmap_map_flush_list;
+}
+
/* flags for bpf_redirect_info kern_flags */
#define BPF_RI_F_RF_NO_DIRECT BIT(0) /* no napi_direct on return_frame */

diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index 66974bd027109..068e994ed781a 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -79,8 +79,6 @@ struct bpf_cpu_map {
struct bpf_cpu_map_entry __rcu **cpu_map;
};

-static DEFINE_PER_CPU(struct list_head, cpu_map_flush_list);
-
static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
{
u32 value_size = attr->value_size;
@@ -709,7 +707,7 @@ static void bq_flush_to_queue(struct xdp_bulk_queue *bq)
*/
static void bq_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_frame *xdpf)
{
- struct list_head *flush_list = this_cpu_ptr(&cpu_map_flush_list);
+ struct list_head *flush_list = bpf_net_ctx_get_cpu_map_flush_list();
struct xdp_bulk_queue *bq = this_cpu_ptr(rcpu->bulkq);

if (unlikely(bq->count == CPU_MAP_BULK_SIZE))
@@ -761,7 +759,7 @@ int cpu_map_generic_redirect(struct bpf_cpu_map_entry *rcpu,

void __cpu_map_flush(void)
{
- struct list_head *flush_list = this_cpu_ptr(&cpu_map_flush_list);
+ struct list_head *flush_list = bpf_net_ctx_get_cpu_map_flush_list();
struct xdp_bulk_queue *bq, *tmp;

list_for_each_entry_safe(bq, tmp, flush_list, flush_node) {
@@ -775,20 +773,9 @@ void __cpu_map_flush(void)
#ifdef CONFIG_DEBUG_NET
bool cpu_map_check_flush(void)
{
- if (list_empty(this_cpu_ptr(&cpu_map_flush_list)))
+ if (list_empty(bpf_net_ctx_get_cpu_map_flush_list()))
return false;
__cpu_map_flush();
return true;
}
#endif
-
-static int __init cpu_map_init(void)
-{
- int cpu;
-
- for_each_possible_cpu(cpu)
- INIT_LIST_HEAD(&per_cpu(cpu_map_flush_list, cpu));
- return 0;
-}
-
-subsys_initcall(cpu_map_init);
diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index 3d9d62c6525d4..c8267ed580840 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -83,7 +83,6 @@ struct bpf_dtab {
u32 n_buckets;
};

-static DEFINE_PER_CPU(struct list_head, dev_flush_list);
static DEFINE_SPINLOCK(dev_map_lock);
static LIST_HEAD(dev_map_list);

@@ -415,7 +414,7 @@ static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
*/
void __dev_flush(void)
{
- struct list_head *flush_list = this_cpu_ptr(&dev_flush_list);
+ struct list_head *flush_list = bpf_net_ctx_get_dev_flush_list();
struct xdp_dev_bulk_queue *bq, *tmp;

list_for_each_entry_safe(bq, tmp, flush_list, flush_node) {
@@ -429,7 +428,7 @@ void __dev_flush(void)
#ifdef CONFIG_DEBUG_NET
bool dev_check_flush(void)
{
- if (list_empty(this_cpu_ptr(&dev_flush_list)))
+ if (list_empty(bpf_net_ctx_get_dev_flush_list()))
return false;
__dev_flush();
return true;
@@ -460,7 +459,7 @@ static void *__dev_map_lookup_elem(struct bpf_map *map, u32 key)
static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
struct net_device *dev_rx, struct bpf_prog *xdp_prog)
{
- struct list_head *flush_list = this_cpu_ptr(&dev_flush_list);
+ struct list_head *flush_list = bpf_net_ctx_get_dev_flush_list();
struct xdp_dev_bulk_queue *bq = this_cpu_ptr(dev->xdp_bulkq);

if (unlikely(bq->count == DEV_MAP_BULK_SIZE))
@@ -1163,15 +1162,11 @@ static struct notifier_block dev_map_notifier = {

static int __init dev_map_init(void)
{
- int cpu;
-
/* Assure tracepoint shadow struct _bpf_dtab_netdev is in sync */
BUILD_BUG_ON(offsetof(struct bpf_dtab_netdev, dev) !=
offsetof(struct _bpf_dtab_netdev, dev));
register_netdevice_notifier(&dev_map_notifier);

- for_each_possible_cpu(cpu)
- INIT_LIST_HEAD(&per_cpu(dev_flush_list, cpu));
return 0;
}

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 727aa20be4bde..8b0b557408fc2 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -35,8 +35,6 @@
#define TX_BATCH_SIZE 32
#define MAX_PER_SOCKET_BUDGET (TX_BATCH_SIZE)

-static DEFINE_PER_CPU(struct list_head, xskmap_flush_list);
-
void xsk_set_rx_need_wakeup(struct xsk_buff_pool *pool)
{
if (pool->cached_need_wakeup & XDP_WAKEUP_RX)
@@ -375,7 +373,7 @@ static int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)

int __xsk_map_redirect(struct xdp_sock *xs, struct xdp_buff *xdp)
{
- struct list_head *flush_list = this_cpu_ptr(&xskmap_flush_list);
+ struct list_head *flush_list = bpf_net_ctx_get_xskmap_flush_list();
int err;

err = xsk_rcv(xs, xdp);
@@ -390,7 +388,7 @@ int __xsk_map_redirect(struct xdp_sock *xs, struct xdp_buff *xdp)

void __xsk_map_flush(void)
{
- struct list_head *flush_list = this_cpu_ptr(&xskmap_flush_list);
+ struct list_head *flush_list = bpf_net_ctx_get_xskmap_flush_list();
struct xdp_sock *xs, *tmp;

list_for_each_entry_safe(xs, tmp, flush_list, flush_node) {
@@ -402,7 +400,7 @@ void __xsk_map_flush(void)
#ifdef CONFIG_DEBUG_NET
bool xsk_map_check_flush(void)
{
- if (list_empty(this_cpu_ptr(&xskmap_flush_list)))
+ if (list_empty(bpf_net_ctx_get_xskmap_flush_list()))
return false;
__xsk_map_flush();
return true;
@@ -1775,7 +1773,7 @@ static struct pernet_operations xsk_net_ops = {

static int __init xsk_init(void)
{
- int err, cpu;
+ int err;

err = proto_register(&xsk_proto, 0 /* no slab */);
if (err)
@@ -1793,8 +1791,6 @@ static int __init xsk_init(void)
if (err)
goto out_pernet;

- for_each_possible_cpu(cpu)
- INIT_LIST_HEAD(&per_cpu(xskmap_flush_list, cpu));
return 0;

out_pernet:
--
2.45.1


Subject: [PATCH v5 net-next 10/15] dev: Use nested-BH locking for softnet_data.process_queue.

softnet_data::process_queue is a per-CPU variable and relies on disabled
BH for its locking. Without per-CPU locking in local_bh_disable() on
PREEMPT_RT this data structure requires explicit locking.

softnet_data::input_queue_head can be updated lockless. This is fine
because this value is only update CPU local by the local backlog_napi
thread.

Add a local_lock_t to softnet_data and use local_lock_nested_bh() for locking
of process_queue. This change adds only lockdep coverage and does not
alter the functional behaviour for !PREEMPT_RT.

Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
include/linux/netdevice.h | 1 +
net/core/dev.c | 12 +++++++++++-
2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index b5ec072ec2430..f0ab89caf3cc2 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3200,6 +3200,7 @@ static inline bool dev_has_header(const struct net_device *dev)
struct softnet_data {
struct list_head poll_list;
struct sk_buff_head process_queue;
+ local_lock_t process_queue_bh_lock;

/* stats */
unsigned int processed;
diff --git a/net/core/dev.c b/net/core/dev.c
index a66e4e744bbb4..2c3f86c8cd176 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -449,7 +449,9 @@ static RAW_NOTIFIER_HEAD(netdev_chain);
* queue in the local softnet handler.
*/

-DEFINE_PER_CPU_ALIGNED(struct softnet_data, softnet_data);
+DEFINE_PER_CPU_ALIGNED(struct softnet_data, softnet_data) = {
+ .process_queue_bh_lock = INIT_LOCAL_LOCK(process_queue_bh_lock),
+};
EXPORT_PER_CPU_SYMBOL(softnet_data);

/* Page_pool has a lockless array/stack to alloc/recycle pages.
@@ -5934,6 +5936,7 @@ static void flush_backlog(struct work_struct *work)
}
backlog_unlock_irq_enable(sd);

+ local_lock_nested_bh(&softnet_data.process_queue_bh_lock);
skb_queue_walk_safe(&sd->process_queue, skb, tmp) {
if (skb->dev->reg_state == NETREG_UNREGISTERING) {
__skb_unlink(skb, &sd->process_queue);
@@ -5941,6 +5944,7 @@ static void flush_backlog(struct work_struct *work)
rps_input_queue_head_incr(sd);
}
}
+ local_unlock_nested_bh(&softnet_data.process_queue_bh_lock);
local_bh_enable();
}

@@ -6062,7 +6066,9 @@ static int process_backlog(struct napi_struct *napi, int quota)
while (again) {
struct sk_buff *skb;

+ local_lock_nested_bh(&softnet_data.process_queue_bh_lock);
while ((skb = __skb_dequeue(&sd->process_queue))) {
+ local_unlock_nested_bh(&softnet_data.process_queue_bh_lock);
rcu_read_lock();
__netif_receive_skb(skb);
rcu_read_unlock();
@@ -6071,7 +6077,9 @@ static int process_backlog(struct napi_struct *napi, int quota)
return work;
}

+ local_lock_nested_bh(&softnet_data.process_queue_bh_lock);
}
+ local_unlock_nested_bh(&softnet_data.process_queue_bh_lock);

backlog_lock_irq_disable(sd);
if (skb_queue_empty(&sd->input_pkt_queue)) {
@@ -6086,8 +6094,10 @@ static int process_backlog(struct napi_struct *napi, int quota)
napi->state &= NAPIF_STATE_THREADED;
again = false;
} else {
+ local_lock_nested_bh(&softnet_data.process_queue_bh_lock);
skb_queue_splice_tail_init(&sd->input_pkt_queue,
&sd->process_queue);
+ local_unlock_nested_bh(&softnet_data.process_queue_bh_lock);
}
backlog_unlock_irq_enable(sd);
}
--
2.45.1


Subject: [PATCH v5 net-next 06/15] net/ipv4: Use nested-BH locking for ipv4_tcp_sk.

ipv4_tcp_sk is a per-CPU variable and relies on disabled BH for its
locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT
this data structure requires explicit locking.

Make a struct with a sock member (original ipv4_tcp_sk) and a
local_lock_t and use local_lock_nested_bh() for locking. This change
adds only lockdep coverage and does not alter the functional behaviour
for !PREEMPT_RT.

Cc: David Ahern <[email protected]>
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
include/net/sock.h | 5 +++++
net/ipv4/tcp_ipv4.c | 15 +++++++++++----
2 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 953c8dc4e259e..7d6784ebb26f5 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -544,6 +544,11 @@ struct sock {
netns_tracker ns_tracker;
};

+struct sock_bh_locked {
+ struct sock *sock;
+ local_lock_t bh_lock;
+};
+
enum sk_pacing {
SK_PACING_NONE = 0,
SK_PACING_NEEDED = 1,
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 3613e08ca7949..58b21f5c333b2 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -93,7 +93,9 @@ static int tcp_v4_md5_hash_hdr(char *md5_hash, const struct tcp_md5sig_key *key,
struct inet_hashinfo tcp_hashinfo;
EXPORT_SYMBOL(tcp_hashinfo);

-static DEFINE_PER_CPU(struct sock *, ipv4_tcp_sk);
+static DEFINE_PER_CPU(struct sock_bh_locked, ipv4_tcp_sk) = {
+ .bh_lock = INIT_LOCAL_LOCK(bh_lock),
+};

static u32 tcp_v4_init_seq(const struct sk_buff *skb)
{
@@ -882,7 +884,9 @@ static void tcp_v4_send_reset(const struct sock *sk, struct sk_buff *skb,
arg.tos = ip_hdr(skb)->tos;
arg.uid = sock_net_uid(net, sk && sk_fullsock(sk) ? sk : NULL);
local_bh_disable();
- ctl_sk = this_cpu_read(ipv4_tcp_sk);
+ local_lock_nested_bh(&ipv4_tcp_sk.bh_lock);
+ ctl_sk = this_cpu_read(ipv4_tcp_sk.sock);
+
sock_net_set(ctl_sk, net);
if (sk) {
ctl_sk->sk_mark = (sk->sk_state == TCP_TIME_WAIT) ?
@@ -907,6 +911,7 @@ static void tcp_v4_send_reset(const struct sock *sk, struct sk_buff *skb,
sock_net_set(ctl_sk, &init_net);
__TCP_INC_STATS(net, TCP_MIB_OUTSEGS);
__TCP_INC_STATS(net, TCP_MIB_OUTRSTS);
+ local_unlock_nested_bh(&ipv4_tcp_sk.bh_lock);
local_bh_enable();

#ifdef CONFIG_TCP_MD5SIG
@@ -1002,7 +1007,8 @@ static void tcp_v4_send_ack(const struct sock *sk,
arg.tos = tos;
arg.uid = sock_net_uid(net, sk_fullsock(sk) ? sk : NULL);
local_bh_disable();
- ctl_sk = this_cpu_read(ipv4_tcp_sk);
+ local_lock_nested_bh(&ipv4_tcp_sk.bh_lock);
+ ctl_sk = this_cpu_read(ipv4_tcp_sk.sock);
sock_net_set(ctl_sk, net);
ctl_sk->sk_mark = (sk->sk_state == TCP_TIME_WAIT) ?
inet_twsk(sk)->tw_mark : READ_ONCE(sk->sk_mark);
@@ -1017,6 +1023,7 @@ static void tcp_v4_send_ack(const struct sock *sk,

sock_net_set(ctl_sk, &init_net);
__TCP_INC_STATS(net, TCP_MIB_OUTSEGS);
+ local_unlock_nested_bh(&ipv4_tcp_sk.bh_lock);
local_bh_enable();
}

@@ -3619,7 +3626,7 @@ void __init tcp_v4_init(void)

sk->sk_clockid = CLOCK_MONOTONIC;

- per_cpu(ipv4_tcp_sk, cpu) = sk;
+ per_cpu(ipv4_tcp_sk.sock, cpu) = sk;
}
if (register_pernet_subsys(&tcp_sk_ops))
panic("Failed to create the TCP control socket.\n");
--
2.45.1


Subject: [PATCH v5 net-next 05/15] net/tcp_sigpool: Use nested-BH locking for sigpool_scratch.

sigpool_scratch is a per-CPU variable and relies on disabled BH for its
locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT
this data structure requires explicit locking.

Make a struct with a pad member (original sigpool_scratch) and a
local_lock_t and use local_lock_nested_bh() for locking. This change
adds only lockdep coverage and does not alter the functional behaviour
for !PREEMPT_RT.

Cc: David Ahern <[email protected]>
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
net/ipv4/tcp_sigpool.c | 17 +++++++++++++----
1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/tcp_sigpool.c b/net/ipv4/tcp_sigpool.c
index 8512cb09ebc09..d8a4f192873a2 100644
--- a/net/ipv4/tcp_sigpool.c
+++ b/net/ipv4/tcp_sigpool.c
@@ -10,7 +10,14 @@
#include <net/tcp.h>

static size_t __scratch_size;
-static DEFINE_PER_CPU(void __rcu *, sigpool_scratch);
+struct sigpool_scratch {
+ local_lock_t bh_lock;
+ void __rcu *pad;
+};
+
+static DEFINE_PER_CPU(struct sigpool_scratch, sigpool_scratch) = {
+ .bh_lock = INIT_LOCAL_LOCK(bh_lock),
+};

struct sigpool_entry {
struct crypto_ahash *hash;
@@ -72,7 +79,7 @@ static int sigpool_reserve_scratch(size_t size)
break;
}

- old_scratch = rcu_replace_pointer(per_cpu(sigpool_scratch, cpu),
+ old_scratch = rcu_replace_pointer(per_cpu(sigpool_scratch.pad, cpu),
scratch, lockdep_is_held(&cpool_mutex));
if (!cpu_online(cpu) || !old_scratch) {
kfree(old_scratch);
@@ -93,7 +100,7 @@ static void sigpool_scratch_free(void)
int cpu;

for_each_possible_cpu(cpu)
- kfree(rcu_replace_pointer(per_cpu(sigpool_scratch, cpu),
+ kfree(rcu_replace_pointer(per_cpu(sigpool_scratch.pad, cpu),
NULL, lockdep_is_held(&cpool_mutex)));
__scratch_size = 0;
}
@@ -277,7 +284,8 @@ int tcp_sigpool_start(unsigned int id, struct tcp_sigpool *c) __cond_acquires(RC
/* Pairs with tcp_sigpool_reserve_scratch(), scratch area is
* valid (allocated) until tcp_sigpool_end().
*/
- c->scratch = rcu_dereference_bh(*this_cpu_ptr(&sigpool_scratch));
+ local_lock_nested_bh(&sigpool_scratch.bh_lock);
+ c->scratch = rcu_dereference_bh(*this_cpu_ptr(&sigpool_scratch.pad));
return 0;
}
EXPORT_SYMBOL_GPL(tcp_sigpool_start);
@@ -286,6 +294,7 @@ void tcp_sigpool_end(struct tcp_sigpool *c) __releases(RCU_BH)
{
struct crypto_ahash *hash = crypto_ahash_reqtfm(c->req);

+ local_unlock_nested_bh(&sigpool_scratch.bh_lock);
rcu_read_unlock_bh();
ahash_request_free(c->req);
crypto_free_ahash(hash);
--
2.45.1


2024-06-07 10:02:33

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v5 net-next 01/15] locking/local_lock: Introduce guard definition for local_lock.

On Fri, Jun 07, 2024 at 08:53:04AM +0200, Sebastian Andrzej Siewior wrote:
> Introduce lock guard definition for local_lock_t. There are no users
> yet.
>
> Signed-off-by: Sebastian Andrzej Siewior <[email protected]>

Acked-by: Peter Zijlstra (Intel) <[email protected]>

> ---
> include/linux/local_lock.h | 11 +++++++++++
> 1 file changed, 11 insertions(+)
>
> diff --git a/include/linux/local_lock.h b/include/linux/local_lock.h
> index e55010fa73296..82366a37f4474 100644
> --- a/include/linux/local_lock.h
> +++ b/include/linux/local_lock.h
> @@ -51,4 +51,15 @@
> #define local_unlock_irqrestore(lock, flags) \
> __local_unlock_irqrestore(lock, flags)
>
> +DEFINE_GUARD(local_lock, local_lock_t __percpu*,
> + local_lock(_T),
> + local_unlock(_T))
> +DEFINE_GUARD(local_lock_irq, local_lock_t __percpu*,
> + local_lock_irq(_T),
> + local_unlock_irq(_T))
> +DEFINE_LOCK_GUARD_1(local_lock_irqsave, local_lock_t __percpu,
> + local_lock_irqsave(_T->lock, _T->flags),
> + local_unlock_irqrestore(_T->lock, _T->flags),
> + unsigned long flags)
> +
> #endif
> --
> 2.45.1
>

2024-06-07 11:51:49

by Jesper Dangaard Brouer

[permalink] [raw]
Subject: Re: [PATCH v5 net-next 14/15] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.


On 07/06/2024 08.53, Sebastian Andrzej Siewior wrote:
[...]
>
> Create a struct bpf_net_context which contains struct bpf_redirect_info.
> Define the variable on stack, use bpf_net_ctx_set() to save a pointer to
> it, bpf_net_ctx_clear() removes it again.
> The bpf_net_ctx_set() may nest. For instance a function can be used from
> within NET_RX_SOFTIRQ/ net_rx_action which uses bpf_net_ctx_set() and
> NET_TX_SOFTIRQ which does not. Therefore only the first invocations
> updates the pointer.
> Use bpf_net_ctx_get_ri() as a wrapper to retrieve the current struct
> bpf_redirect_info.
>
> The pointer to bpf_net_context is saved task's task_struct. Using
> always the bpf_net_context approach has the advantage that there is
> almost zero differences between PREEMPT_RT and non-PREEMPT_RT builds.
>
[...]
> ---
> include/linux/filter.h | 43 ++++++++++++++++++++++++++++++++++-------
> include/linux/sched.h | 3 +++
> kernel/bpf/cpumap.c | 3 +++
> kernel/bpf/devmap.c | 9 ++++++++-
> kernel/fork.c | 1 +
> net/bpf/test_run.c | 11 ++++++++++-
> net/core/dev.c | 26 ++++++++++++++++++++++++-
> net/core/filter.c | 44 ++++++++++++------------------------------
> net/core/lwt_bpf.c | 3 +++
> 9 files changed, 101 insertions(+), 42 deletions(-)
>
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index b02aea291b7e8..2ff1c394dcf0c 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -744,7 +744,38 @@ struct bpf_redirect_info {
> struct bpf_nh_params nh;
> };
>
> -DECLARE_PER_CPU(struct bpf_redirect_info, bpf_redirect_info);
> +struct bpf_net_context {
> + struct bpf_redirect_info ri;
> +};
> +
> +static inline struct bpf_net_context *bpf_net_ctx_set(struct bpf_net_context *bpf_net_ctx)
> +{
> + struct task_struct *tsk = current;
> +
> + if (tsk->bpf_net_context != NULL)
> + return NULL;
> + memset(&bpf_net_ctx->ri, 0, sizeof(bpf_net_ctx->ri));

It annoys me that we have to clear this memory every time.
(This is added in net_rx_action() that *all* RX packets traverse).

The feature and memory is only/primarily used for XDP and TC redirects,
but we take the overhead of clearing even when these features are not used.

Netstack does bulking in most of the cases this is used, so in our/your
benchmarks this overhead doesn't show. But we need to be aware that
this is a "paper-cut" for single network packet processing.

Idea: We could postpone clearing until code calls bpf_net_ctx_get() ?
See below.

> + tsk->bpf_net_context = bpf_net_ctx;
> + return bpf_net_ctx;
> +}
> +
> +static inline void bpf_net_ctx_clear(struct bpf_net_context *bpf_net_ctx)
> +{
> + if (bpf_net_ctx)
> + current->bpf_net_context = NULL;
> +}
> +
> +static inline struct bpf_net_context *bpf_net_ctx_get(void)
> +{

> + return current->bpf_net_context;
> +}
> +
> +static inline struct bpf_redirect_info *bpf_net_ctx_get_ri(void)
> +{
> + struct bpf_net_context *bpf_net_ctx = bpf_net_ctx_get();
> +

if (bpf_net_ctx->ri->kern_flags & BPF_RI_F_NEEDS_INIT) {
memset + init_list (intro in patch 15)
}

Maybe even postpone the init_list calls to the "get" helpers introduced
in patch 15.


> + return &bpf_net_ctx->ri;
> +}
>
[...]

> diff --git a/net/core/dev.c b/net/core/dev.c
> index 2c3f86c8cd176..73965dff1b30f 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
[...]
> @@ -6881,10 +6902,12 @@ static __latent_entropy void net_rx_action(struct softirq_action *h)

The function net_rx_action() is core to the network stack.

> struct softnet_data *sd = this_cpu_ptr(&softnet_data);
> unsigned long time_limit = jiffies +
> usecs_to_jiffies(READ_ONCE(net_hotdata.netdev_budget_usecs));
> + struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
> int budget = READ_ONCE(net_hotdata.netdev_budget);
> LIST_HEAD(list);
> LIST_HEAD(repoll);
>
> + bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
> start:
> sd->in_net_rx_action = true;
> local_irq_disable();
> @@ -6937,7 +6960,8 @@ static __latent_entropy void net_rx_action(struct softirq_action *h)
> sd->in_net_rx_action = false;
>
> net_rps_action_and_irq_enable(sd);
> -end:;
> +end:
> + bpf_net_ctx_clear(bpf_net_ctx);
> }


The memset can be further optimized as it currently clears 64 bytes, but
it only need to clear 40 bytes, see pahole below.

Replace memset with something like:
memset(&bpf_net_ctx->ri, 0, offsetof(struct bpf_net_context, ri.nh));

This is an optimization, because with 64 bytes this result in a rep-stos
(repeated string store operation) that on Intel touch CPU-flags (to be
IRQ safe) which is slow, while clearing 40 bytes doesn't cause compiler
to use this instruction, which is faster. Memset benchmarked with [1]

[1]
https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_memset.c

--Jesper

$ pahole -C bpf_redirect_info vmlinux
struct bpf_redirect_info {
u64 tgt_index; /* 0 8 */
void * tgt_value; /* 8 8 */
struct bpf_map * map; /* 16 8 */
u32 flags; /* 24 4 */
u32 kern_flags; /* 28 4 */
u32 map_id; /* 32 4 */
enum bpf_map_type map_type; /* 36 4 */
struct bpf_nh_params nh; /* 40 20 */

/* size: 64, cachelines: 1, members: 8 */
/* padding: 4 */
};



The full struct:

$ pahole -C bpf_net_context vmlinux
struct bpf_net_context {
struct bpf_redirect_info ri; /* 0 64 */

/* XXX last struct has 4 bytes of padding */

/* --- cacheline 1 boundary (64 bytes) --- */
struct list_head cpu_map_flush_list; /* 64 16 */
struct list_head dev_map_flush_list; /* 80 16 */
struct list_head xskmap_map_flush_list; /* 96 16 */

/* size: 112, cachelines: 2, members: 4 */
/* paddings: 1, sum paddings: 4 */
/* last cacheline: 48 bytes */
};



2024-06-07 13:55:26

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v5 net-next 01/15] locking/local_lock: Introduce guard definition for local_lock.

On Fri, Jun 07 2024 at 08:53, Sebastian Andrzej Siewior wrote:

> Introduce lock guard definition for local_lock_t. There are no users
> yet.
>
> Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
> ---
> include/linux/local_lock.h | 11 +++++++++++
> 1 file changed, 11 insertions(+)
>
> diff --git a/include/linux/local_lock.h b/include/linux/local_lock.h
> index e55010fa73296..82366a37f4474 100644
> --- a/include/linux/local_lock.h
> +++ b/include/linux/local_lock.h
> @@ -51,4 +51,15 @@
> #define local_unlock_irqrestore(lock, flags) \
> __local_unlock_irqrestore(lock, flags)
>
> +DEFINE_GUARD(local_lock, local_lock_t __percpu*,
> + local_lock(_T),
> + local_unlock(_T))
> +DEFINE_GUARD(local_lock_irq, local_lock_t __percpu*,
> + local_lock_irq(_T),
> + local_unlock_irq(_T))
> +DEFINE_LOCK_GUARD_1(local_lock_irqsave, local_lock_t __percpu,
> + local_lock_irqsave(_T->lock, _T->flags),
> + local_unlock_irqrestore(_T->lock, _T->flags),
> + unsigned long flags)
> +
> #endif

Reviewed-by: Thomas Gleixner <[email protected]>

Subject: Re: [PATCH v5 net-next 14/15] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.

On 2024-06-07 13:51:25 [+0200], Jesper Dangaard Brouer wrote:
> The memset can be further optimized as it currently clears 64 bytes, but
> it only need to clear 40 bytes, see pahole below.
>
> Replace memset with something like:
> memset(&bpf_net_ctx->ri, 0, offsetof(struct bpf_net_context, ri.nh));
>
> This is an optimization, because with 64 bytes this result in a rep-stos
> (repeated string store operation) that on Intel touch CPU-flags (to be
> IRQ safe) which is slow, while clearing 40 bytes doesn't cause compiler
> to use this instruction, which is faster. Memset benchmarked with [1]

I've been playing along with this and have to say that "rep stosq" is
roughly 3x slower vs "movq" for 64 bytes on all x86 I've been looking
at.
For gcc the stosq vs movq depends on the CPU settings. The generic uses
movq up to 40 bytes, skylake uses movq even for 64bytes. clang…
This could be tuned via -mmemset-strategy=libcall:64:align,rep_8byte:-1:align

I folded this into the last two patches:

diff --git a/include/linux/filter.h b/include/linux/filter.h
index d2b4260d9d0be..1588d208f1348 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -744,27 +744,40 @@ struct bpf_redirect_info {
struct bpf_nh_params nh;
};

+enum bpf_ctx_init_type {
+ bpf_ctx_ri_init,
+ bpf_ctx_cpu_map_init,
+ bpf_ctx_dev_map_init,
+ bpf_ctx_xsk_map_init,
+};
+
struct bpf_net_context {
struct bpf_redirect_info ri;
struct list_head cpu_map_flush_list;
struct list_head dev_map_flush_list;
struct list_head xskmap_map_flush_list;
+ unsigned int flags;
};

+static inline bool bpf_net_ctx_need_init(struct bpf_net_context *bpf_net_ctx,
+ enum bpf_ctx_init_type flag)
+{
+ return !(bpf_net_ctx->flags & (1 << flag));
+}
+
+static inline bool bpf_net_ctx_set_flag(struct bpf_net_context *bpf_net_ctx,
+ enum bpf_ctx_init_type flag)
+{
+ return bpf_net_ctx->flags |= 1 << flag;
+}
+
static inline struct bpf_net_context *bpf_net_ctx_set(struct bpf_net_context *bpf_net_ctx)
{
struct task_struct *tsk = current;

if (tsk->bpf_net_context != NULL)
return NULL;
- memset(&bpf_net_ctx->ri, 0, sizeof(bpf_net_ctx->ri));
-
- if (IS_ENABLED(CONFIG_BPF_SYSCALL)) {
- INIT_LIST_HEAD(&bpf_net_ctx->cpu_map_flush_list);
- INIT_LIST_HEAD(&bpf_net_ctx->dev_map_flush_list);
- }
- if (IS_ENABLED(CONFIG_XDP_SOCKETS))
- INIT_LIST_HEAD(&bpf_net_ctx->xskmap_map_flush_list);
+ bpf_net_ctx->flags = 0;

tsk->bpf_net_context = bpf_net_ctx;
return bpf_net_ctx;
@@ -785,6 +798,11 @@ static inline struct bpf_redirect_info *bpf_net_ctx_get_ri(void)
{
struct bpf_net_context *bpf_net_ctx = bpf_net_ctx_get();

+ if (bpf_net_ctx_need_init(bpf_net_ctx, bpf_ctx_ri_init)) {
+ memset(&bpf_net_ctx->ri, 0, offsetof(struct bpf_net_context, ri.nh));
+ bpf_net_ctx_set_flag(bpf_net_ctx, bpf_ctx_ri_init);
+ }
+
return &bpf_net_ctx->ri;
}

@@ -792,6 +810,11 @@ static inline struct list_head *bpf_net_ctx_get_cpu_map_flush_list(void)
{
struct bpf_net_context *bpf_net_ctx = bpf_net_ctx_get();

+ if (bpf_net_ctx_need_init(bpf_net_ctx, bpf_ctx_cpu_map_init)) {
+ INIT_LIST_HEAD(&bpf_net_ctx->cpu_map_flush_list);
+ bpf_net_ctx_set_flag(bpf_net_ctx, bpf_ctx_cpu_map_init);
+ }
+
return &bpf_net_ctx->cpu_map_flush_list;
}

@@ -799,6 +822,11 @@ static inline struct list_head *bpf_net_ctx_get_dev_flush_list(void)
{
struct bpf_net_context *bpf_net_ctx = bpf_net_ctx_get();

+ if (bpf_net_ctx_need_init(bpf_net_ctx, bpf_ctx_dev_map_init)) {
+ INIT_LIST_HEAD(&bpf_net_ctx->dev_map_flush_list);
+ bpf_net_ctx_set_flag(bpf_net_ctx, bpf_ctx_dev_map_init);
+ }
+
return &bpf_net_ctx->dev_map_flush_list;
}

@@ -806,6 +834,11 @@ static inline struct list_head *bpf_net_ctx_get_xskmap_flush_list(void)
{
struct bpf_net_context *bpf_net_ctx = bpf_net_ctx_get();

+ if (bpf_net_ctx_need_init(bpf_net_ctx, bpf_ctx_xsk_map_init)) {
+ INIT_LIST_HEAD(&bpf_net_ctx->xskmap_map_flush_list);
+ bpf_net_ctx_set_flag(bpf_net_ctx, bpf_ctx_xsk_map_init);
+ }
+
return &bpf_net_ctx->xskmap_map_flush_list;
}


Sebastian

2024-06-11 07:56:04

by Jesper Dangaard Brouer

[permalink] [raw]
Subject: Re: [PATCH v5 net-next 14/15] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.



On 10/06/2024 18.50, Sebastian Andrzej Siewior wrote:
> On 2024-06-07 13:51:25 [+0200], Jesper Dangaard Brouer wrote:
>> The memset can be further optimized as it currently clears 64 bytes, but
>> it only need to clear 40 bytes, see pahole below.
>>
>> Replace memset with something like:
>> memset(&bpf_net_ctx->ri, 0, offsetof(struct bpf_net_context, ri.nh));
>>
>> This is an optimization, because with 64 bytes this result in a rep-stos
>> (repeated string store operation) that on Intel touch CPU-flags (to be
>> IRQ safe) which is slow, while clearing 40 bytes doesn't cause compiler
>> to use this instruction, which is faster. Memset benchmarked with [1]
>
> I've been playing along with this and have to say that "rep stosq" is
> roughly 3x slower vs "movq" for 64 bytes on all x86 I've been looking
> at.

Thanks for confirming "rep stos" is 3x slower for small sizes.


> For gcc the stosq vs movq depends on the CPU settings. The generic uses
> movq up to 40 bytes, skylake uses movq even for 64bytes. clang…
> This could be tuned via -mmemset-strategy=libcall:64:align,rep_8byte:-1:align
>

Cool I didn't know of this tuning. Is this a compiler option?
Where do I change this setting, as I would like to experiment with this
for our prod kernels.

My other finding is, this primarily a kernel compile problem, because
for userspace compiler chooses to use MMX instructions (e.g. movaps
xmmword ptr[rsp], xmm0). The kernel compiler options (-mno-sse -mno-mmx
-mno-sse2 -mno-3dnow -mno-avx) disables this, which aparently changes
the tipping point.


> I folded this into the last two patches:
>
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index d2b4260d9d0be..1588d208f1348 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -744,27 +744,40 @@ struct bpf_redirect_info {
> struct bpf_nh_params nh;
> };
>
> +enum bpf_ctx_init_type {
> + bpf_ctx_ri_init,
> + bpf_ctx_cpu_map_init,
> + bpf_ctx_dev_map_init,
> + bpf_ctx_xsk_map_init,
> +};
> +
> struct bpf_net_context {
> struct bpf_redirect_info ri;
> struct list_head cpu_map_flush_list;
> struct list_head dev_map_flush_list;
> struct list_head xskmap_map_flush_list;
> + unsigned int flags;

Why have yet another flags variable, when we already have two flags in
bpf_redirect_info ?

> };
>
> +static inline bool bpf_net_ctx_need_init(struct bpf_net_context *bpf_net_ctx,
> + enum bpf_ctx_init_type flag)
> +{
> + return !(bpf_net_ctx->flags & (1 << flag));
> +}
> +
> +static inline bool bpf_net_ctx_set_flag(struct bpf_net_context *bpf_net_ctx,
> + enum bpf_ctx_init_type flag)
> +{
> + return bpf_net_ctx->flags |= 1 << flag;
> +}
> +
> static inline struct bpf_net_context *bpf_net_ctx_set(struct bpf_net_context *bpf_net_ctx)
> {
> struct task_struct *tsk = current;
>
> if (tsk->bpf_net_context != NULL)
> return NULL;
> - memset(&bpf_net_ctx->ri, 0, sizeof(bpf_net_ctx->ri));
> -
> - if (IS_ENABLED(CONFIG_BPF_SYSCALL)) {
> - INIT_LIST_HEAD(&bpf_net_ctx->cpu_map_flush_list);
> - INIT_LIST_HEAD(&bpf_net_ctx->dev_map_flush_list);
> - }
> - if (IS_ENABLED(CONFIG_XDP_SOCKETS))
> - INIT_LIST_HEAD(&bpf_net_ctx->xskmap_map_flush_list);
> + bpf_net_ctx->flags = 0;
>
> tsk->bpf_net_context = bpf_net_ctx;
> return bpf_net_ctx;
> @@ -785,6 +798,11 @@ static inline struct bpf_redirect_info *bpf_net_ctx_get_ri(void)
> {
> struct bpf_net_context *bpf_net_ctx = bpf_net_ctx_get();
>
> + if (bpf_net_ctx_need_init(bpf_net_ctx, bpf_ctx_ri_init)) {
> + memset(&bpf_net_ctx->ri, 0, offsetof(struct bpf_net_context, ri.nh));
> + bpf_net_ctx_set_flag(bpf_net_ctx, bpf_ctx_ri_init);
> + }
> +
> return &bpf_net_ctx->ri;
> }
>
> @@ -792,6 +810,11 @@ static inline struct list_head *bpf_net_ctx_get_cpu_map_flush_list(void)
> {
> struct bpf_net_context *bpf_net_ctx = bpf_net_ctx_get();
>
> + if (bpf_net_ctx_need_init(bpf_net_ctx, bpf_ctx_cpu_map_init)) {
> + INIT_LIST_HEAD(&bpf_net_ctx->cpu_map_flush_list);
> + bpf_net_ctx_set_flag(bpf_net_ctx, bpf_ctx_cpu_map_init);
> + }
> +
> return &bpf_net_ctx->cpu_map_flush_list;
> }
>
> @@ -799,6 +822,11 @@ static inline struct list_head *bpf_net_ctx_get_dev_flush_list(void)
> {
> struct bpf_net_context *bpf_net_ctx = bpf_net_ctx_get();
>
> + if (bpf_net_ctx_need_init(bpf_net_ctx, bpf_ctx_dev_map_init)) {
> + INIT_LIST_HEAD(&bpf_net_ctx->dev_map_flush_list);
> + bpf_net_ctx_set_flag(bpf_net_ctx, bpf_ctx_dev_map_init);
> + }
> +
> return &bpf_net_ctx->dev_map_flush_list;
> }
>
> @@ -806,6 +834,11 @@ static inline struct list_head *bpf_net_ctx_get_xskmap_flush_list(void)
> {
> struct bpf_net_context *bpf_net_ctx = bpf_net_ctx_get();
>
> + if (bpf_net_ctx_need_init(bpf_net_ctx, bpf_ctx_xsk_map_init)) {
> + INIT_LIST_HEAD(&bpf_net_ctx->xskmap_map_flush_list);
> + bpf_net_ctx_set_flag(bpf_net_ctx, bpf_ctx_xsk_map_init);
> + }
> +
> return &bpf_net_ctx->xskmap_map_flush_list;
> }
>
>
> Sebastian

Subject: Re: [PATCH v5 net-next 14/15] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.

On 2024-06-11 09:55:11 [+0200], Jesper Dangaard Brouer wrote:
> > For gcc the stosq vs movq depends on the CPU settings. The generic uses
> > movq up to 40 bytes, skylake uses movq even for 64bytes. clang…
> > This could be tuned via -mmemset-strategy=libcall:64:align,rep_8byte:-1:align
> >
>
> Cool I didn't know of this tuning. Is this a compiler option?
> Where do I change this setting, as I would like to experiment with this
> for our prod kernels.

This is what I play with right now, I'm not sure it is what I want… For
reference:

---->8-----
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 1d7122a1883e8..b35b7b21598de 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -775,6 +775,9 @@ config SCHED_OMIT_FRAME_POINTER

If in doubt, say "Y".

+config X86_OPT_MEMSET
+ bool "X86 memset playground"
+
menuconfig HYPERVISOR_GUEST
bool "Linux guest support"
help
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 801fd85c3ef69..bab37787fe5cd 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -151,6 +151,15 @@ else
KBUILD_AFLAGS += -m64
KBUILD_CFLAGS += -m64

+ ifeq ($(CONFIG_X86_OPT_MEMSET),y)
+ #export X86_MEMSET_CFLAGS := -mmemset-strategy=libcall:64:align,rep_8byte:-1:align
+ export X86_MEMSET_CFLAGS := -mmemset-strategy=libcall:-1:align
+ else
+ export X86_MEMSET_CFLAGS :=
+ endif
+
+ KBUILD_CFLAGS += $(X86_MEMSET_CFLAGS)
+
# Align jump targets to 1 byte, not the default 16 bytes:
KBUILD_CFLAGS += $(call cc-option,-falign-jumps=1)

diff --git a/arch/x86/entry/vdso/Makefile b/arch/x86/entry/vdso/Makefile
index 215a1b202a918..d0c9a589885ef 100644
--- a/arch/x86/entry/vdso/Makefile
+++ b/arch/x86/entry/vdso/Makefile
@@ -121,6 +121,7 @@ KBUILD_CFLAGS_32 := $(filter-out -m64,$(KBUILD_CFLAGS))
KBUILD_CFLAGS_32 := $(filter-out -mcmodel=kernel,$(KBUILD_CFLAGS_32))
KBUILD_CFLAGS_32 := $(filter-out -fno-pic,$(KBUILD_CFLAGS_32))
KBUILD_CFLAGS_32 := $(filter-out -mfentry,$(KBUILD_CFLAGS_32))
+KBUILD_CFLAGS_32 := $(filter-out $(X86_MEMSET_CFLAGS),$(KBUILD_CFLAGS_32))
KBUILD_CFLAGS_32 := $(filter-out $(RANDSTRUCT_CFLAGS),$(KBUILD_CFLAGS_32))
KBUILD_CFLAGS_32 := $(filter-out $(GCC_PLUGINS_CFLAGS),$(KBUILD_CFLAGS_32))
KBUILD_CFLAGS_32 := $(filter-out $(RETPOLINE_CFLAGS),$(KBUILD_CFLAGS_32))


---->8-----

I dug this up in the gcc source code and initially played on the command
line with it. The snippet compiles the kernel and it boots so…

> My other finding is, this primarily a kernel compile problem, because
> for userspace compiler chooses to use MMX instructions (e.g. movaps
> xmmword ptr[rsp], xmm0). The kernel compiler options (-mno-sse -mno-mmx
> -mno-sse2 -mno-3dnow -mno-avx) disables this, which aparently changes
> the tipping point.

sure.

>
> > I folded this into the last two patches:
> >
> > diff --git a/include/linux/filter.h b/include/linux/filter.h
> > index d2b4260d9d0be..1588d208f1348 100644
> > --- a/include/linux/filter.h
> > +++ b/include/linux/filter.h
> > @@ -744,27 +744,40 @@ struct bpf_redirect_info {
> > struct bpf_nh_params nh;
> > };
> > +enum bpf_ctx_init_type {
> > + bpf_ctx_ri_init,
> > + bpf_ctx_cpu_map_init,
> > + bpf_ctx_dev_map_init,
> > + bpf_ctx_xsk_map_init,
> > +};
> > +
> > struct bpf_net_context {
> > struct bpf_redirect_info ri;
> > struct list_head cpu_map_flush_list;
> > struct list_head dev_map_flush_list;
> > struct list_head xskmap_map_flush_list;
> > + unsigned int flags;
>
> Why have yet another flags variable, when we already have two flags in
> bpf_redirect_info ?

Ah you want to fold this into ri member including the status for the
lists? Could try. It is splitted in order to delay the initialisation of
the lists, too. We would need to be careful to not overwrite the
flags if `ri' is initialized after the lists. That would be the case
with CONFIG_DEBUG_NET=y and not doing redirect (the empty list check
initializes that).

Sebastian

Subject: Re: [PATCH v5 net-next 14/15] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.

On 2024-06-11 10:39:20 [+0200], To Jesper Dangaard Brouer wrote:
> On 2024-06-11 09:55:11 [+0200], Jesper Dangaard Brouer wrote:
> > > struct bpf_net_context {
> > > struct bpf_redirect_info ri;
> > > struct list_head cpu_map_flush_list;
> > > struct list_head dev_map_flush_list;
> > > struct list_head xskmap_map_flush_list;
> > > + unsigned int flags;
> >
> > Why have yet another flags variable, when we already have two flags in
> > bpf_redirect_info ?
>
> Ah you want to fold this into ri member including the status for the
> lists? Could try. It is splitted in order to delay the initialisation of
> the lists, too. We would need to be careful to not overwrite the
> flags if `ri' is initialized after the lists. That would be the case
> with CONFIG_DEBUG_NET=y and not doing redirect (the empty list check
> initializes that).

What about this:

------>8----------

diff --git a/include/linux/filter.h b/include/linux/filter.h
index d2b4260d9d0be..c0349522de8fb 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -733,15 +733,22 @@ struct bpf_nh_params {
};
};

+/* flags for bpf_redirect_info kern_flags */
+#define BPF_RI_F_RF_NO_DIRECT BIT(0) /* no napi_direct on return_frame */
+#define BPF_RI_F_RI_INIT BIT(1)
+#define BPF_RI_F_CPU_MAP_INIT BIT(2)
+#define BPF_RI_F_DEV_MAP_INIT BIT(3)
+#define BPF_RI_F_XSK_MAP_INIT BIT(4)
+
struct bpf_redirect_info {
u64 tgt_index;
void *tgt_value;
struct bpf_map *map;
u32 flags;
- u32 kern_flags;
u32 map_id;
enum bpf_map_type map_type;
struct bpf_nh_params nh;
+ u32 kern_flags;
};

struct bpf_net_context {
@@ -757,14 +764,7 @@ static inline struct bpf_net_context *bpf_net_ctx_set(struct bpf_net_context *bp

if (tsk->bpf_net_context != NULL)
return NULL;
- memset(&bpf_net_ctx->ri, 0, sizeof(bpf_net_ctx->ri));
-
- if (IS_ENABLED(CONFIG_BPF_SYSCALL)) {
- INIT_LIST_HEAD(&bpf_net_ctx->cpu_map_flush_list);
- INIT_LIST_HEAD(&bpf_net_ctx->dev_map_flush_list);
- }
- if (IS_ENABLED(CONFIG_XDP_SOCKETS))
- INIT_LIST_HEAD(&bpf_net_ctx->xskmap_map_flush_list);
+ bpf_net_ctx->ri.kern_flags = 0;

tsk->bpf_net_context = bpf_net_ctx;
return bpf_net_ctx;
@@ -785,6 +785,11 @@ static inline struct bpf_redirect_info *bpf_net_ctx_get_ri(void)
{
struct bpf_net_context *bpf_net_ctx = bpf_net_ctx_get();

+ if (!(bpf_net_ctx->ri.kern_flags & BPF_RI_F_RI_INIT)) {
+ memset(&bpf_net_ctx->ri, 0, offsetof(struct bpf_net_context, ri.nh));
+ bpf_net_ctx->ri.kern_flags |= BPF_RI_F_RI_INIT;
+ }
+
return &bpf_net_ctx->ri;
}

@@ -792,6 +797,11 @@ static inline struct list_head *bpf_net_ctx_get_cpu_map_flush_list(void)
{
struct bpf_net_context *bpf_net_ctx = bpf_net_ctx_get();

+ if (!(bpf_net_ctx->ri.kern_flags & BPF_RI_F_CPU_MAP_INIT)) {
+ INIT_LIST_HEAD(&bpf_net_ctx->cpu_map_flush_list);
+ bpf_net_ctx->ri.kern_flags |= BPF_RI_F_CPU_MAP_INIT;
+ }
+
return &bpf_net_ctx->cpu_map_flush_list;
}

@@ -799,6 +809,11 @@ static inline struct list_head *bpf_net_ctx_get_dev_flush_list(void)
{
struct bpf_net_context *bpf_net_ctx = bpf_net_ctx_get();

+ if (!(bpf_net_ctx->ri.kern_flags & BPF_RI_F_DEV_MAP_INIT)) {
+ INIT_LIST_HEAD(&bpf_net_ctx->dev_map_flush_list);
+ bpf_net_ctx->ri.kern_flags |= BPF_RI_F_DEV_MAP_INIT;
+ }
+
return &bpf_net_ctx->dev_map_flush_list;
}

@@ -806,12 +821,14 @@ static inline struct list_head *bpf_net_ctx_get_xskmap_flush_list(void)
{
struct bpf_net_context *bpf_net_ctx = bpf_net_ctx_get();

+ if (!(bpf_net_ctx->ri.kern_flags & BPF_RI_F_XSK_MAP_INIT)) {
+ INIT_LIST_HEAD(&bpf_net_ctx->xskmap_map_flush_list);
+ bpf_net_ctx->ri.kern_flags |= BPF_RI_F_XSK_MAP_INIT;
+ }
+
return &bpf_net_ctx->xskmap_map_flush_list;
}

-/* flags for bpf_redirect_info kern_flags */
-#define BPF_RI_F_RF_NO_DIRECT BIT(0) /* no napi_direct on return_frame */
-
/* Compute the linear packet data range [data, data_end) which
* will be accessed by various program types (cls_bpf, act_bpf,
* lwt, ...). Subsystems allowing direct data access must (!)

------>8----------

Moving kern_flags to the end excludes it from the memset() and can be
re-used for the delayed initialisation.

Sebastian