Hi,
Disabling bottoms halves acts as per-CPU BKL. On PREEMPT_RT code within
local_bh_disable() section remains preemtible. As a result high prior
tasks (or threaded interrupts) will be blocked by lower-prio task (or
threaded interrupts) which are long running which includes softirq
sections.
The proposed way out is to introduce explicit per-CPU locks for
resources which are protected by local_bh_disable() and use those only
on PREEMPT_RT so there is no additional overhead for !PREEMPT_RT builds.
The series introduces the infrastructure and converts large parts of
networking which is largerst stake holder here. Once this done the
per-CPU lock from local_bh_disable() on PREEMPT_RT can be lifted.
Sebastian
sigpool_scratch is a per-CPU variable and relies on disabled BH for its
locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT
this data structure requires explicit locking.
Make a struct with a pad member (original sigpool_scratch) and a
local_lock_t and use local_lock_nested_bh() for locking. This change
adds only lockdep coverage and does not alter the functional behaviour
for !PREEMPT_RT.
Cc: David Ahern <[email protected]>
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
net/ipv4/tcp_sigpool.c | 17 +++++++++++++----
1 file changed, 13 insertions(+), 4 deletions(-)
diff --git a/net/ipv4/tcp_sigpool.c b/net/ipv4/tcp_sigpool.c
index 55b310a722c7d..0046d1a0dd796 100644
--- a/net/ipv4/tcp_sigpool.c
+++ b/net/ipv4/tcp_sigpool.c
@@ -10,7 +10,14 @@
#include <net/tcp.h>
static size_t __scratch_size;
-static DEFINE_PER_CPU(void __rcu *, sigpool_scratch);
+struct sigpool_scratch {
+ local_lock_t bh_lock;
+ void __rcu *pad;
+};
+
+static DEFINE_PER_CPU(struct sigpool_scratch, sigpool_scratch) = {
+ .bh_lock = INIT_LOCAL_LOCK(bh_lock),
+};
struct sigpool_entry {
struct crypto_ahash *hash;
@@ -72,7 +79,7 @@ static int sigpool_reserve_scratch(size_t size)
break;
}
- old_scratch = rcu_replace_pointer(per_cpu(sigpool_scratch, cpu),
+ old_scratch = rcu_replace_pointer(per_cpu(sigpool_scratch.pad, cpu),
scratch, lockdep_is_held(&cpool_mutex));
if (!cpu_online(cpu) || !old_scratch) {
kfree(old_scratch);
@@ -93,7 +100,7 @@ static void sigpool_scratch_free(void)
int cpu;
for_each_possible_cpu(cpu)
- kfree(rcu_replace_pointer(per_cpu(sigpool_scratch, cpu),
+ kfree(rcu_replace_pointer(per_cpu(sigpool_scratch.pad, cpu),
NULL, lockdep_is_held(&cpool_mutex)));
__scratch_size = 0;
}
@@ -278,7 +285,8 @@ int tcp_sigpool_start(unsigned int id, struct tcp_sigpool *c) __cond_acquires(RC
/* Pairs with tcp_sigpool_reserve_scratch(), scratch area is
* valid (allocated) until tcp_sigpool_end().
*/
- c->scratch = rcu_dereference_bh(*this_cpu_ptr(&sigpool_scratch));
+ local_lock_nested_bh(&sigpool_scratch.bh_lock);
+ c->scratch = rcu_dereference_bh(*this_cpu_ptr(&sigpool_scratch.pad));
return 0;
}
EXPORT_SYMBOL_GPL(tcp_sigpool_start);
@@ -287,6 +295,7 @@ void tcp_sigpool_end(struct tcp_sigpool *c) __releases(RCU_BH)
{
struct crypto_ahash *hash = crypto_ahash_reqtfm(c->req);
+ local_unlock_nested_bh(&sigpool_scratch.bh_lock);
rcu_read_unlock_bh();
ahash_request_free(c->req);
crypto_free_ahash(hash);
--
2.43.0
brnf_frag_data_storage is a per-CPU variable and relies on disabled BH
for its locking. Without per-CPU locking in local_bh_disable() on
PREEMPT_RT this data structure requires explicit locking.
Add a local_lock_t to the data structure and use local_lock_nested_bh()
for locking. This change adds only lockdep coverage and does not alter
the functional behaviour for !PREEMPT_RT.
Cc: Florian Westphal <[email protected]>
Cc: Jozsef Kadlecsik <[email protected]>
Cc: Nikolay Aleksandrov <[email protected]>
Cc: Pablo Neira Ayuso <[email protected]>
Cc: Roopa Prabhu <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
net/bridge/br_netfilter_hooks.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/net/bridge/br_netfilter_hooks.c b/net/bridge/br_netfilter_hooks.c
index 6adcb45bca75d..c06f498c5f1b3 100644
--- a/net/bridge/br_netfilter_hooks.c
+++ b/net/bridge/br_netfilter_hooks.c
@@ -133,6 +133,7 @@ static inline bool is_pppoe_ipv6(const struct sk_buff *skb,
#define NF_BRIDGE_MAX_MAC_HEADER_LENGTH (PPPOE_SES_HLEN + ETH_HLEN)
struct brnf_frag_data {
+ local_lock_t bh_lock;
char mac[NF_BRIDGE_MAX_MAC_HEADER_LENGTH];
u8 encap_size;
u8 size;
@@ -140,7 +141,9 @@ struct brnf_frag_data {
__be16 vlan_proto;
};
-static DEFINE_PER_CPU(struct brnf_frag_data, brnf_frag_data_storage);
+static DEFINE_PER_CPU(struct brnf_frag_data, brnf_frag_data_storage) = {
+ .bh_lock = INIT_LOCAL_LOCK(bh_lock),
+};
static void nf_bridge_info_free(struct sk_buff *skb)
{
@@ -768,6 +771,7 @@ static int br_nf_dev_queue_xmit(struct net *net, struct sock *sk, struct sk_buff
IPCB(skb)->frag_max_size = nf_bridge->frag_max_size;
+ guard(local_lock_nested_bh)(&brnf_frag_data_storage.bh_lock);
data = this_cpu_ptr(&brnf_frag_data_storage);
if (skb_vlan_tag_present(skb)) {
@@ -795,6 +799,7 @@ static int br_nf_dev_queue_xmit(struct net *net, struct sock *sk, struct sk_buff
IP6CB(skb)->frag_max_size = nf_bridge->frag_max_size;
+ guard(local_lock_nested_bh)(&brnf_frag_data_storage.bh_lock);
data = this_cpu_ptr(&brnf_frag_data_storage);
data->encap_size = nf_bridge_encap_header_len(skb);
data->size = ETH_HLEN + data->encap_size;
--
2.43.0
Softirq is preemptible on PREEMPT_RT. Without a per-CPU lock in
local_bh_disable() there is no guarantee that only one device is
transmitting at a time.
With preemption and multiple senders it is possible that the per-CPU
recursion counter gets incremented by different threads and exceeds
XMIT_RECURSION_LIMIT leading to a false positive recursion alert.
Instead of adding a lock to protect the per-CPU variable it is simpler
to make the counter per-task. Sending and receiving skbs happens always
in thread context anyway.
Having a lock to protected the per-CPU counter would block/ serialize two
sending threads needlessly. It would also require a recursive lock to
ensure that the owner can increment the counter further.
Make the recursion counter a task_struct member on PREEMPT_RT.
Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Vincent Guittot <[email protected]>
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
include/linux/netdevice.h | 30 +++++++++++++++++++++++++++++-
include/linux/sched.h | 4 +++-
2 files changed, 32 insertions(+), 2 deletions(-)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 2564e209465ea..06436695c3679 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3260,7 +3260,9 @@ struct softnet_data {
#endif
/* written and read only by owning cpu: */
struct {
+#ifndef CONFIG_PREEMPT_RT
u16 recursion;
+#endif
u8 more;
#ifdef CONFIG_NET_EGRESS
u8 skip_txqueue;
@@ -3308,12 +3310,36 @@ static inline void input_queue_tail_incr_save(struct softnet_data *sd,
DECLARE_PER_CPU_ALIGNED(struct softnet_data, softnet_data);
+#define XMIT_RECURSION_LIMIT 8
+#ifdef CONFIG_PREEMPT_RT
+
+static inline int dev_recursion_level(void)
+{
+ return current->net_xmit_recursion;
+}
+
+static inline bool dev_xmit_recursion(void)
+{
+ return unlikely(current->net_xmit_recursion > XMIT_RECURSION_LIMIT);
+}
+
+static inline void dev_xmit_recursion_inc(void)
+{
+ current->net_xmit_recursion++;
+}
+
+static inline void dev_xmit_recursion_dec(void)
+{
+ current->net_xmit_recursion--;
+}
+
+#else
+
static inline int dev_recursion_level(void)
{
return this_cpu_read(softnet_data.xmit.recursion);
}
-#define XMIT_RECURSION_LIMIT 8
static inline bool dev_xmit_recursion(void)
{
return unlikely(__this_cpu_read(softnet_data.xmit.recursion) >
@@ -3330,6 +3356,8 @@ static inline void dev_xmit_recursion_dec(void)
__this_cpu_dec(softnet_data.xmit.recursion);
}
+#endif
+
void __netif_schedule(struct Qdisc *q);
void netif_schedule_queue(struct netdev_queue *txq);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 292c316972485..26b1b7d674b5d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -964,7 +964,9 @@ struct task_struct {
/* delay due to memory thrashing */
unsigned in_thrashing:1;
#endif
-
+#ifdef CONFIG_PREEMPT_RT
+ u8 net_xmit_recursion;
+#endif
unsigned long atomic_flags; /* Flags requiring atomic access. */
struct restart_block restart_block;
--
2.43.0
Add local_lock_nested_bh() locking. It is based on local_lock_t and the
naming follows the preempt_disable_nested() example.
For !PREEMPT_RT + !LOCKDEP it is a per-CPU annotation for locking
assumptions based on local_bh_disable(). The macro is optimized away
during compilation.
For !PREEMPT_RT + LOCKDEP the local_lock_nested_bh() is reduced to
the usual lock-acquire plus lockdep_assert_in_softirq() - ensuring that
BH is disabled.
For PREEMPT_RT local_lock_nested_bh() acquires the specified per-CPU
lock. It does not disable CPU migration because it relies on
local_bh_disable() disabling CPU migration.
With LOCKDEP it performans the usual lockdep checks as with !PREEMPT_RT.
Due to include hell the softirq check has been moved spinlock.c.
The intention is to use this locking in places where locking of a per-CPU
variable relies on BH being disabled. Instead of treating disabled
bottom halves as a big per-CPU lock, PREEMPT_RT can use this to reduce
the locking scope to what actually needs protecting.
A side effect is that it also documents the protection scope of the
per-CPU variables.
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
include/linux/local_lock.h | 10 ++++++++++
include/linux/local_lock_internal.h | 31 +++++++++++++++++++++++++++++
include/linux/lockdep.h | 3 +++
kernel/locking/spinlock.c | 8 ++++++++
4 files changed, 52 insertions(+)
diff --git a/include/linux/local_lock.h b/include/linux/local_lock.h
index 706c4b65d9449..303caac7fbbbc 100644
--- a/include/linux/local_lock.h
+++ b/include/linux/local_lock.h
@@ -62,4 +62,14 @@ DEFINE_LOCK_GUARD_1(local_lock_irqsave, local_lock_t,
local_unlock_irqrestore(_T->lock, _T->flags),
unsigned long flags)
+#define local_lock_nested_bh(_lock) \
+ __local_lock_nested_bh(_lock)
+
+#define local_unlock_nested_bh(_lock) \
+ __local_unlock_nested_bh(_lock)
+
+DEFINE_LOCK_GUARD_1(local_lock_nested_bh, local_lock_t,
+ local_lock_nested_bh(_T->lock),
+ local_unlock_nested_bh(_T->lock))
+
#endif
diff --git a/include/linux/local_lock_internal.h b/include/linux/local_lock_internal.h
index 975e33b793a77..8dd71fbbb6d2b 100644
--- a/include/linux/local_lock_internal.h
+++ b/include/linux/local_lock_internal.h
@@ -62,6 +62,17 @@ do { \
local_lock_debug_init(lock); \
} while (0)
+#define __spinlock_nested_bh_init(lock) \
+do { \
+ static struct lock_class_key __key; \
+ \
+ debug_check_no_locks_freed((void *)lock, sizeof(*lock));\
+ lockdep_init_map_type(&(lock)->dep_map, #lock, &__key, \
+ 0, LD_WAIT_CONFIG, LD_WAIT_INV, \
+ LD_LOCK_NORMAL); \
+ local_lock_debug_init(lock); \
+} while (0)
+
#define __local_lock(lock) \
do { \
preempt_disable(); \
@@ -98,6 +109,15 @@ do { \
local_irq_restore(flags); \
} while (0)
+#define __local_lock_nested_bh(lock) \
+ do { \
+ lockdep_assert_in_softirq(); \
+ local_lock_acquire(this_cpu_ptr(lock)); \
+ } while (0)
+
+#define __local_unlock_nested_bh(lock) \
+ local_lock_release(this_cpu_ptr(lock))
+
#else /* !CONFIG_PREEMPT_RT */
/*
@@ -138,4 +158,15 @@ typedef spinlock_t local_lock_t;
#define __local_unlock_irqrestore(lock, flags) __local_unlock(lock)
+#define __local_lock_nested_bh(lock) \
+do { \
+ lockdep_assert_in_softirq_func(); \
+ spin_lock(this_cpu_ptr(lock)); \
+} while (0)
+
+#define __local_unlock_nested_bh(lock) \
+do { \
+ spin_unlock(this_cpu_ptr((lock))); \
+} while (0)
+
#endif /* CONFIG_PREEMPT_RT */
diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
index dc2844b071c2c..881732fa0df8c 100644
--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
@@ -657,6 +657,8 @@ do { \
(!in_softirq() || in_irq() || in_nmi())); \
} while (0)
+extern void lockdep_assert_in_softirq_func(void);
+
#else
# define might_lock(lock) do { } while (0)
# define might_lock_read(lock) do { } while (0)
@@ -670,6 +672,7 @@ do { \
# define lockdep_assert_preemption_enabled() do { } while (0)
# define lockdep_assert_preemption_disabled() do { } while (0)
# define lockdep_assert_in_softirq() do { } while (0)
+# define lockdep_assert_in_softirq_func() do { } while (0)
#endif
#ifdef CONFIG_PROVE_RAW_LOCK_NESTING
diff --git a/kernel/locking/spinlock.c b/kernel/locking/spinlock.c
index 8475a0794f8c5..438c6086d540e 100644
--- a/kernel/locking/spinlock.c
+++ b/kernel/locking/spinlock.c
@@ -413,3 +413,11 @@ notrace int in_lock_functions(unsigned long addr)
&& addr < (unsigned long)__lock_text_end;
}
EXPORT_SYMBOL(in_lock_functions);
+
+#if defined(CONFIG_PROVE_LOCKING) && defined(CONFIG_PREEMPT_RT)
+void notrace lockdep_assert_in_softirq_func(void)
+{
+ lockdep_assert_in_softirq();
+}
+EXPORT_SYMBOL(lockdep_assert_in_softirq_func);
+#endif
--
2.43.0
There is no additional locking in rps_lock_*() on PREEMPT_RT in the
!CONFIG_RPS case. The reasoning is that local_bh_disable() provides the
required synchronization (everywhere).
All interrupts are threaded and so the required CPU local
synchronization is provided by the per-CPU lock in local_bh_disable()
which is always performed for the forced-threaded interrupts.
Needless to say that the softirq users (NAPI) have BH disabled while
being invoked as a softirq callback.
Without locking in local_bh_disable() on PREEMPT_RT the
softnet_data::input_pkt_queue data structure requires explicit locking
in the !CONFIG_RPS case.
Disabling interrupts for CPU-local locking is undesired because it will
impose restrictions within the locked sections like not allowing to free
a skb.
Utilise the locking which is present in the CONFIG_RPS case for
serialisation on PREEMPT_RT even without CONFIG_RPS enabled.
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
net/core/dev.c | 16 ++++++++--------
1 file changed, 8 insertions(+), 8 deletions(-)
diff --git a/net/core/dev.c b/net/core/dev.c
index c879246be48d8..09232080843ee 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -220,34 +220,34 @@ static inline struct hlist_head *dev_index_hash(struct net *net, int ifindex)
static inline void rps_lock_irqsave(struct softnet_data *sd,
unsigned long *flags)
{
- if (IS_ENABLED(CONFIG_RPS))
+ if (IS_ENABLED(CONFIG_RPS) || IS_ENABLED(CONFIG_PREEMPT_RT))
spin_lock_irqsave(&sd->input_pkt_queue.lock, *flags);
- else if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+ else
local_irq_save(*flags);
}
static inline void rps_lock_irq_disable(struct softnet_data *sd)
{
- if (IS_ENABLED(CONFIG_RPS))
+ if (IS_ENABLED(CONFIG_RPS) || IS_ENABLED(CONFIG_PREEMPT_RT))
spin_lock_irq(&sd->input_pkt_queue.lock);
- else if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+ else
local_irq_disable();
}
static inline void rps_unlock_irq_restore(struct softnet_data *sd,
unsigned long *flags)
{
- if (IS_ENABLED(CONFIG_RPS))
+ if (IS_ENABLED(CONFIG_RPS) || IS_ENABLED(CONFIG_PREEMPT_RT))
spin_unlock_irqrestore(&sd->input_pkt_queue.lock, *flags);
- else if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+ else
local_irq_restore(*flags);
}
static inline void rps_unlock_irq_enable(struct softnet_data *sd)
{
- if (IS_ENABLED(CONFIG_RPS))
+ if (IS_ENABLED(CONFIG_RPS) || IS_ENABLED(CONFIG_PREEMPT_RT))
spin_unlock_irq(&sd->input_pkt_queue.lock);
- else if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+ else
local_irq_enable();
}
--
2.43.0
bpf_scratchpad is a per-CPU variable and relies on disabled BH for its
locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT
this data structure requires explicit locking.
Add a local_lock_t to the data structure and use local_lock_nested_bh()
for locking. This change adds only lockdep coverage and does not alter
the functional behaviour for !PREEMPT_RT.
Cc: Alexei Starovoitov <[email protected]>
Cc: Andrii Nakryiko <[email protected]>
Cc: Hao Luo <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: John Fastabend <[email protected]>
Cc: KP Singh <[email protected]>
Cc: Martin KaFai Lau <[email protected]>
Cc: Song Liu <[email protected]>
Cc: Stanislav Fomichev <[email protected]>
Cc: Yonghong Song <[email protected]>
Cc: [email protected]
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
net/core/filter.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/net/core/filter.c b/net/core/filter.c
index c8013f762524b..896aa3fa699f9 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1652,9 +1652,12 @@ struct bpf_scratchpad {
__be32 diff[MAX_BPF_STACK / sizeof(__be32)];
u8 buff[MAX_BPF_STACK];
};
+ local_lock_t lock;
};
-static DEFINE_PER_CPU(struct bpf_scratchpad, bpf_sp);
+static DEFINE_PER_CPU(struct bpf_scratchpad, bpf_sp) = {
+ .lock = INIT_LOCAL_LOCK(lock),
+};
static inline int __bpf_try_make_writable(struct sk_buff *skb,
unsigned int write_len)
@@ -2023,6 +2026,7 @@ BPF_CALL_5(bpf_csum_diff, __be32 *, from, u32, from_size,
diff_size > sizeof(sp->diff)))
return -EINVAL;
+ guard(local_lock_nested_bh)(&bpf_sp.lock);
for (i = 0; i < from_size / sizeof(__be32); i++, j++)
sp->diff[j] = ~from[i];
for (i = 0; i < to_size / sizeof(__be32); i++, j++)
--
2.43.0
The access to seg6_bpf_srh_states is protected by disabling preemption.
Based on the code, the entry point is input_action_end_bpf() and
every other function (the bpf helper functions bpf_lwt_seg6_*()), that
is accessing seg6_bpf_srh_states, should be called from within
input_action_end_bpf().
input_action_end_bpf() accesses seg6_bpf_srh_states first at the top of
the function and then disables preemption. This looks wrong because if
preemption needs to be disabled as part of the locking mechanism then
the variable shouldn't be accessed beforehand.
Looking at how it is used via test_lwt_seg6local.sh then
input_action_end_bpf() is always invoked from softirq context. If this
is always the case then the preempt_disable() statement is superfluous.
If this is not always invoked from softirq then disabling only
preemption is not sufficient.
Replace the preempt_disable() statement with nested-BH locking. This is
not an equivalent replacement as it assumes that the invocation of
input_action_end_bpf() always occurs in softirq context and thus the
preempt_disable() is superfluous.
Add a local_lock_t the data structure and use local_lock_nested_bh() in
guard notation for locking. Add lockdep_assert_held() to ensure the lock
is held while the per-CPU variable is referenced in the helper functions.
Cc: Alexei Starovoitov <[email protected]>
Cc: Andrii Nakryiko <[email protected]>
Cc: David Ahern <[email protected]>
Cc: Hao Luo <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: John Fastabend <[email protected]>
Cc: KP Singh <[email protected]>
Cc: Martin KaFai Lau <[email protected]>
Cc: Song Liu <[email protected]>
Cc: Stanislav Fomichev <[email protected]>
Cc: Yonghong Song <[email protected]>
Cc: [email protected]
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
include/net/seg6_local.h | 1 +
net/core/filter.c | 3 ++
net/ipv6/seg6_local.c | 59 ++++++++++++++++++++++------------------
3 files changed, 36 insertions(+), 27 deletions(-)
diff --git a/include/net/seg6_local.h b/include/net/seg6_local.h
index 3fab9dec2ec45..0f22771359f4c 100644
--- a/include/net/seg6_local.h
+++ b/include/net/seg6_local.h
@@ -20,6 +20,7 @@ extern bool seg6_bpf_has_valid_srh(struct sk_buff *skb);
struct seg6_bpf_srh_state {
struct ipv6_sr_hdr *srh;
+ local_lock_t bh_lock;
u16 hdrlen;
bool valid;
};
diff --git a/net/core/filter.c b/net/core/filter.c
index 1737884be52f8..c8013f762524b 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -6384,6 +6384,7 @@ BPF_CALL_4(bpf_lwt_seg6_store_bytes, struct sk_buff *, skb, u32, offset,
void *srh_tlvs, *srh_end, *ptr;
int srhoff = 0;
+ lockdep_assert_held(&srh_state->bh_lock);
if (srh == NULL)
return -EINVAL;
@@ -6440,6 +6441,7 @@ BPF_CALL_4(bpf_lwt_seg6_action, struct sk_buff *, skb,
int hdroff = 0;
int err;
+ lockdep_assert_held(&srh_state->bh_lock);
switch (action) {
case SEG6_LOCAL_ACTION_END_X:
if (!seg6_bpf_has_valid_srh(skb))
@@ -6516,6 +6518,7 @@ BPF_CALL_3(bpf_lwt_seg6_adjust_srh, struct sk_buff *, skb, u32, offset,
int srhoff = 0;
int ret;
+ lockdep_assert_held(&srh_state->bh_lock);
if (unlikely(srh == NULL))
return -EINVAL;
diff --git a/net/ipv6/seg6_local.c b/net/ipv6/seg6_local.c
index 24e2b4b494cb0..ed7278af321a2 100644
--- a/net/ipv6/seg6_local.c
+++ b/net/ipv6/seg6_local.c
@@ -1380,7 +1380,9 @@ static int input_action_end_b6_encap(struct sk_buff *skb,
return err;
}
-DEFINE_PER_CPU(struct seg6_bpf_srh_state, seg6_bpf_srh_states);
+DEFINE_PER_CPU(struct seg6_bpf_srh_state, seg6_bpf_srh_states) = {
+ .bh_lock = INIT_LOCAL_LOCK(bh_lock),
+};
bool seg6_bpf_has_valid_srh(struct sk_buff *skb)
{
@@ -1388,6 +1390,7 @@ bool seg6_bpf_has_valid_srh(struct sk_buff *skb)
this_cpu_ptr(&seg6_bpf_srh_states);
struct ipv6_sr_hdr *srh = srh_state->srh;
+ lockdep_assert_held(&srh_state->bh_lock);
if (unlikely(srh == NULL))
return false;
@@ -1408,8 +1411,7 @@ bool seg6_bpf_has_valid_srh(struct sk_buff *skb)
static int input_action_end_bpf(struct sk_buff *skb,
struct seg6_local_lwt *slwt)
{
- struct seg6_bpf_srh_state *srh_state =
- this_cpu_ptr(&seg6_bpf_srh_states);
+ struct seg6_bpf_srh_state *srh_state;
struct ipv6_sr_hdr *srh;
int ret;
@@ -1420,41 +1422,44 @@ static int input_action_end_bpf(struct sk_buff *skb,
}
advance_nextseg(srh, &ipv6_hdr(skb)->daddr);
- /* preempt_disable is needed to protect the per-CPU buffer srh_state,
- * which is also accessed by the bpf_lwt_seg6_* helpers
+ /* The access to the per-CPU buffer srh_state is protected by running
+ * always in softirq context (with disabled BH). On PREEMPT_RT the
+ * required locking is provided by the following local_lock_nested_bh()
+ * statement. It is also accessed by the bpf_lwt_seg6_* helpers via
+ * bpf_prog_run_save_cb().
*/
- preempt_disable();
- srh_state->srh = srh;
- srh_state->hdrlen = srh->hdrlen << 3;
- srh_state->valid = true;
+ scoped_guard(local_lock_nested_bh, &seg6_bpf_srh_states.bh_lock) {
+ srh_state = this_cpu_ptr(&seg6_bpf_srh_states);
+ srh_state->srh = srh;
+ srh_state->hdrlen = srh->hdrlen << 3;
+ srh_state->valid = true;
- rcu_read_lock();
- bpf_compute_data_pointers(skb);
- ret = bpf_prog_run_save_cb(slwt->bpf.prog, skb);
- rcu_read_unlock();
+ rcu_read_lock();
+ bpf_compute_data_pointers(skb);
+ ret = bpf_prog_run_save_cb(slwt->bpf.prog, skb);
+ rcu_read_unlock();
- switch (ret) {
- case BPF_OK:
- case BPF_REDIRECT:
- break;
- case BPF_DROP:
- goto drop;
- default:
- pr_warn_once("bpf-seg6local: Illegal return value %u\n", ret);
- goto drop;
+ switch (ret) {
+ case BPF_OK:
+ case BPF_REDIRECT:
+ break;
+ case BPF_DROP:
+ goto drop;
+ default:
+ pr_warn_once("bpf-seg6local: Illegal return value %u\n", ret);
+ goto drop;
+ }
+
+ if (srh_state->srh && !seg6_bpf_has_valid_srh(skb))
+ goto drop;
}
- if (srh_state->srh && !seg6_bpf_has_valid_srh(skb))
- goto drop;
-
- preempt_enable();
if (ret != BPF_REDIRECT)
seg6_lookup_nexthop(skb, NULL, 0);
return dst_input(skb);
drop:
- preempt_enable();
kfree_skb(skb);
return -EINVAL;
}
--
2.43.0
There is no need to explicitly disable migration if bottom halves are
also disabled. Disabling BH implies disabling migration.
Remove migrate_disable() and rely solely on disabling BH to remain on
the same CPU.
Cc: [email protected]
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
net/core/lwt_bpf.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c
index 4a0797f0a154b..a94943681e5aa 100644
--- a/net/core/lwt_bpf.c
+++ b/net/core/lwt_bpf.c
@@ -40,10 +40,9 @@ static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
{
int ret;
- /* Migration disable and BH disable are needed to protect per-cpu
- * redirect_info between BPF prog and skb_do_redirect().
+ /* Disabling BH is needed to protect per-CPU bpf_redirect_info between
+ * BPF prog and skb_do_redirect().
*/
- migrate_disable();
local_bh_disable();
bpf_compute_data_pointers(skb);
ret = bpf_prog_run_save_cb(lwt->prog, skb);
@@ -78,7 +77,6 @@ static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
}
local_bh_enable();
- migrate_enable();
return ret;
}
--
2.43.0
ipv4_tcp_sk is a per-CPU variable and relies on disabled BH for its
locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT
this data structure requires explicit locking.
Make a struct with a sock member (original ipv4_tcp_sk) and a
local_lock_t and use local_lock_nested_bh() for locking. This change
adds only lockdep coverage and does not alter the functional behaviour
for !PREEMPT_RT.
Cc: David Ahern <[email protected]>
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
include/net/sock.h | 5 +++++
net/ipv4/tcp_ipv4.c | 15 +++++++++++----
2 files changed, 16 insertions(+), 4 deletions(-)
diff --git a/include/net/sock.h b/include/net/sock.h
index 1d6931caf0c3c..cb407b7ae39f8 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -547,6 +547,11 @@ struct sock {
struct hlist_node sk_bind2_node;
};
+struct sock_bh_locked {
+ struct sock *sock;
+ local_lock_t bh_lock;
+};
+
enum sk_pacing {
SK_PACING_NONE = 0,
SK_PACING_NEEDED = 1,
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 0c50c5a32b84a..e69dd19f39e02 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -92,7 +92,9 @@ static int tcp_v4_md5_hash_hdr(char *md5_hash, const struct tcp_md5sig_key *key,
struct inet_hashinfo tcp_hashinfo;
EXPORT_SYMBOL(tcp_hashinfo);
-static DEFINE_PER_CPU(struct sock *, ipv4_tcp_sk);
+static DEFINE_PER_CPU(struct sock_bh_locked, ipv4_tcp_sk) = {
+ .bh_lock = INIT_LOCAL_LOCK(bh_lock),
+};
static u32 tcp_v4_init_seq(const struct sk_buff *skb)
{
@@ -878,7 +880,9 @@ static void tcp_v4_send_reset(const struct sock *sk, struct sk_buff *skb)
arg.tos = ip_hdr(skb)->tos;
arg.uid = sock_net_uid(net, sk && sk_fullsock(sk) ? sk : NULL);
local_bh_disable();
- ctl_sk = this_cpu_read(ipv4_tcp_sk);
+ local_lock_nested_bh(&ipv4_tcp_sk.bh_lock);
+ ctl_sk = this_cpu_read(ipv4_tcp_sk.sock);
+
sock_net_set(ctl_sk, net);
if (sk) {
ctl_sk->sk_mark = (sk->sk_state == TCP_TIME_WAIT) ?
@@ -903,6 +907,7 @@ static void tcp_v4_send_reset(const struct sock *sk, struct sk_buff *skb)
sock_net_set(ctl_sk, &init_net);
__TCP_INC_STATS(net, TCP_MIB_OUTSEGS);
__TCP_INC_STATS(net, TCP_MIB_OUTRSTS);
+ local_unlock_nested_bh(&ipv4_tcp_sk.bh_lock);
local_bh_enable();
#ifdef CONFIG_TCP_MD5SIG
@@ -998,7 +1003,8 @@ static void tcp_v4_send_ack(const struct sock *sk,
arg.tos = tos;
arg.uid = sock_net_uid(net, sk_fullsock(sk) ? sk : NULL);
local_bh_disable();
- ctl_sk = this_cpu_read(ipv4_tcp_sk);
+ local_lock_nested_bh(&ipv4_tcp_sk.bh_lock);
+ ctl_sk = this_cpu_read(ipv4_tcp_sk.sock);
sock_net_set(ctl_sk, net);
ctl_sk->sk_mark = (sk->sk_state == TCP_TIME_WAIT) ?
inet_twsk(sk)->tw_mark : READ_ONCE(sk->sk_mark);
@@ -1013,6 +1019,7 @@ static void tcp_v4_send_ack(const struct sock *sk,
sock_net_set(ctl_sk, &init_net);
__TCP_INC_STATS(net, TCP_MIB_OUTSEGS);
+ local_unlock_nested_bh(&ipv4_tcp_sk.bh_lock);
local_bh_enable();
}
@@ -3604,7 +3611,7 @@ void __init tcp_v4_init(void)
*/
inet_sk(sk)->pmtudisc = IP_PMTUDISC_DO;
- per_cpu(ipv4_tcp_sk, cpu) = sk;
+ per_cpu(ipv4_tcp_sk.sock, cpu) = sk;
}
if (register_pernet_subsys(&tcp_sk_ops))
panic("Failed to create the TCP control socket.\n");
--
2.43.0
softnet_data::process_queue is a per-CPU variable and relies on disabled
BH for its locking. Without per-CPU locking in local_bh_disable() on
PREEMPT_RT this data structure requires explicit locking.
Add a local_lock_t to softnet_data and use local_lock_nested_bh() for locking
of process_queue. This change adds only lockdep coverage and does not
alter the functional behaviour for !PREEMPT_RT.
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
include/linux/netdevice.h | 1 +
net/core/dev.c | 17 +++++++++++++++--
2 files changed, 16 insertions(+), 2 deletions(-)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 06436695c3679..88e27d3c39da2 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3238,6 +3238,7 @@ static inline bool dev_has_header(const struct net_device *dev)
struct softnet_data {
struct list_head poll_list;
struct sk_buff_head process_queue;
+ local_lock_t process_queue_bh_lock;
/* stats */
unsigned int processed;
diff --git a/net/core/dev.c b/net/core/dev.c
index 09232080843ee..5a0f6da7b3ae5 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -440,7 +440,9 @@ static RAW_NOTIFIER_HEAD(netdev_chain);
* queue in the local softnet handler.
*/
-DEFINE_PER_CPU_ALIGNED(struct softnet_data, softnet_data);
+DEFINE_PER_CPU_ALIGNED(struct softnet_data, softnet_data) = {
+ .process_queue_bh_lock = INIT_LOCAL_LOCK(process_queue_bh_lock),
+};
EXPORT_PER_CPU_SYMBOL(softnet_data);
#ifdef CONFIG_LOCKDEP
@@ -5838,6 +5840,7 @@ static void flush_backlog(struct work_struct *work)
}
rps_unlock_irq_enable(sd);
+ local_lock_nested_bh(&softnet_data.process_queue_bh_lock);
skb_queue_walk_safe(&sd->process_queue, skb, tmp) {
if (skb->dev->reg_state == NETREG_UNREGISTERING) {
__skb_unlink(skb, &sd->process_queue);
@@ -5845,6 +5848,7 @@ static void flush_backlog(struct work_struct *work)
input_queue_head_incr(sd);
}
}
+ local_unlock_nested_bh(&softnet_data.process_queue_bh_lock);
local_bh_enable();
}
@@ -5966,15 +5970,22 @@ static int process_backlog(struct napi_struct *napi, int quota)
while (again) {
struct sk_buff *skb;
+ local_lock_nested_bh(&softnet_data.process_queue_bh_lock);
while ((skb = __skb_dequeue(&sd->process_queue))) {
+ local_unlock_nested_bh(&softnet_data.process_queue_bh_lock);
rcu_read_lock();
__netif_receive_skb(skb);
rcu_read_unlock();
+
+ local_lock_nested_bh(&softnet_data.process_queue_bh_lock);
input_queue_head_incr(sd);
- if (++work >= quota)
+ if (++work >= quota) {
+ local_unlock_nested_bh(&softnet_data.process_queue_bh_lock);
return work;
+ }
}
+ local_unlock_nested_bh(&softnet_data.process_queue_bh_lock);
rps_lock_irq_disable(sd);
if (skb_queue_empty(&sd->input_pkt_queue)) {
@@ -5989,8 +6000,10 @@ static int process_backlog(struct napi_struct *napi, int quota)
napi->state = 0;
again = false;
} else {
+ local_lock_nested_bh(&softnet_data.process_queue_bh_lock);
skb_queue_splice_tail_init(&sd->input_pkt_queue,
&sd->process_queue);
+ local_unlock_nested_bh(&softnet_data.process_queue_bh_lock);
}
rps_unlock_irq_enable(sd);
}
--
2.43.0
The XDP redirect process is two staged:
- bpf_prog_run_xdp() is invoked to run a eBPF program which inspects the
packet and makes decisions. While doing that, the per-CPU variable
bpf_redirect_info is used.
- Afterwards xdp_do_redirect() is invoked and accesses bpf_redirect_info
and it may also access other per-CPU variables like xskmap_flush_list.
At the very end of the NAPI callback, xdp_do_flush() is invoked which
does not access bpf_redirect_info but will touch the individual per-CPU
lists.
The per-CPU variables are only used in the NAPI callback hence disabling
bottom halves is the only protection mechanism. Users from preemptible
context (like cpu_map_kthread_run()) explicitly disable bottom halves
for protections reasons.
Without locking in local_bh_disable() on PREEMPT_RT this data structure
requires explicit locking.
Introduce redirect_lock as a lock to be acquired when access to these
per-CPU variables is performed. Usually the lock is part of the per-CPU
variable which is about to be protected but since there are a few
different per-CPU variables which need to be protected at the same
time (and some of the variables depend on a CONFIG setting) a new
per-CPU data structure with variable bpf_run_lock is used for this.
The lock is a nested-BH lock meaning that on non-PREEMPT_RT kernels this
simply results in a lockdep check and ensuring that bottom halves are
disabled. On PREEMPT_RT kernels this will provide the needed
synchronisation once local_bh_disable() does not act as per-CPU lock.
This patch introduces the bpf_run_lock.redirect_lock lock. It will be
used by drivers in the following patches.
A follow-up step could be to keep bpf_prog_run_xdp() and the
XDP_REDIRECT switch case (with xdp_do_redirect()) close together. That
would allow a single scoped_guard() macro to cover the two required
instaces that require locking instead the whole switch case.
Cc: Alexei Starovoitov <[email protected]>
Cc: Andrii Nakryiko <[email protected]>
Cc: Hao Luo <[email protected]>
Cc: Jesper Dangaard Brouer <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: John Fastabend <[email protected]>
Cc: KP Singh <[email protected]>
Cc: Martin KaFai Lau <[email protected]>
Cc: Song Liu <[email protected]>
Cc: Stanislav Fomichev <[email protected]>
Cc: Yonghong Song <[email protected]>
Cc: [email protected]
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
include/linux/bpf.h | 6 ++++++
net/core/filter.c | 5 +++++
2 files changed, 11 insertions(+)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index cff5bb08820ec..6912b85209b12 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -291,6 +291,12 @@ struct bpf_map {
s64 __percpu *elem_count;
};
+struct bpf_run_lock {
+ local_lock_t redirect_lock;
+};
+
+DECLARE_PER_CPU(struct bpf_run_lock, bpf_run_lock);
+
static inline const char *btf_field_type_name(enum btf_field_type type)
{
switch (type) {
diff --git a/net/core/filter.c b/net/core/filter.c
index 896aa3fa699f9..7c9653734fb60 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -89,6 +89,11 @@
static const struct bpf_func_proto *
bpf_sk_base_func_proto(enum bpf_func_id func_id);
+DEFINE_PER_CPU(struct bpf_run_lock, bpf_run_lock) = {
+ .redirect_lock = INIT_LOCAL_LOCK(redirect_lock),
+};
+EXPORT_PER_CPU_SYMBOL_GPL(bpf_run_lock);
+
int copy_bpf_fprog_from_user(struct sock_fprog *dst, sockptr_t src, int len)
{
if (in_compat_syscall()) {
--
2.43.0
The per-CPU variables used during bpf_prog_run_xdp() invocation and
later during xdp_do_redirect() rely on disabled BH for their protection.
Without locking in local_bh_disable() on PREEMPT_RT these data structure
require explicit locking.
This is a follow-up on the previous change which introduced
bpf_run_lock.redirect_lock and uses it now within drivers.
The simple way is to acquire the lock before bpf_prog_run_xdp() is
invoked and hold it until the end of function.
This does not always work because some drivers (cpsw, atlantic) invoke
xdp_do_flush() in the same context.
Acquiring the lock in bpf_prog_run_xdp() and dropping in
xdp_do_redirect() (without touching drivers) does not work because not
all driver, which use bpf_prog_run_xdp(), do support XDP_REDIRECT (and
invoke xdp_do_redirect()).
Ideally the minimal locking scope would be bpf_prog_run_xdp() +
xdp_do_redirect() and everything else (error recovery, DMA unmapping,
free/ alloc of memory, …) would happen outside of the locked section.
Cc: Alexei Starovoitov <[email protected]>
Cc: Andrii Nakryiko <[email protected]>
Cc: Cong Wang <[email protected]>
Cc: Hao Luo <[email protected]>
Cc: Jamal Hadi Salim <[email protected]>
Cc: Jesper Dangaard Brouer <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Jiri Pirko <[email protected]>
Cc: John Fastabend <[email protected]>
Cc: KP Singh <[email protected]>
Cc: Martin KaFai Lau <[email protected]>
Cc: Ronak Doshi <[email protected]>
Cc: Song Liu <[email protected]>
Cc: Stanislav Fomichev <[email protected]>
Cc: VMware PV-Drivers Reviewers <[email protected]>
Cc: Yonghong Song <[email protected]>
Cc: [email protected]
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
drivers/net/vmxnet3/vmxnet3_xdp.c | 1 +
kernel/bpf/cpumap.c | 2 ++
net/bpf/test_run.c | 11 ++++++++---
net/core/dev.c | 3 +++
net/core/filter.c | 1 +
net/core/lwt_bpf.c | 2 ++
net/sched/cls_api.c | 2 ++
7 files changed, 19 insertions(+), 3 deletions(-)
diff --git a/drivers/net/vmxnet3/vmxnet3_xdp.c b/drivers/net/vmxnet3/vmxnet3_xdp.c
index 80ddaff759d47..18bce98fd2e31 100644
--- a/drivers/net/vmxnet3/vmxnet3_xdp.c
+++ b/drivers/net/vmxnet3/vmxnet3_xdp.c
@@ -257,6 +257,7 @@ vmxnet3_run_xdp(struct vmxnet3_rx_queue *rq, struct xdp_buff *xdp,
u32 act;
rq->stats.xdp_packets++;
+ guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
act = bpf_prog_run_xdp(prog, xdp);
page = virt_to_page(xdp->data_hard_start);
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index 8a0bb80fe48a3..c26d49bb78679 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -144,6 +144,7 @@ static void cpu_map_bpf_prog_run_skb(struct bpf_cpu_map_entry *rcpu,
int err;
list_for_each_entry_safe(skb, tmp, listp, list) {
+ guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
act = bpf_prog_run_generic_xdp(skb, &xdp, rcpu->prog);
switch (act) {
case XDP_PASS:
@@ -182,6 +183,7 @@ static int cpu_map_bpf_prog_run_xdp(struct bpf_cpu_map_entry *rcpu,
struct xdp_buff xdp;
int i, nframes = 0;
+ guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
xdp_set_return_frame_no_direct();
xdp.rxq = &rxq;
diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index c9fdcc5cdce10..db8f7eb35c6ca 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -293,6 +293,7 @@ static int xdp_test_run_batch(struct xdp_test_data *xdp, struct bpf_prog *prog,
batch_sz = min_t(u32, repeat, xdp->batch_size);
local_bh_disable();
+ local_lock_nested_bh(&bpf_run_lock.redirect_lock);
xdp_set_return_frame_no_direct();
for (i = 0; i < batch_sz; i++) {
@@ -348,6 +349,9 @@ static int xdp_test_run_batch(struct xdp_test_data *xdp, struct bpf_prog *prog,
}
out:
+ xdp_clear_return_frame_no_direct();
+ local_unlock_nested_bh(&bpf_run_lock.redirect_lock);
+
if (redirect)
xdp_do_flush();
if (nframes) {
@@ -356,7 +360,6 @@ static int xdp_test_run_batch(struct xdp_test_data *xdp, struct bpf_prog *prog,
err = ret;
}
- xdp_clear_return_frame_no_direct();
local_bh_enable();
return err;
}
@@ -417,10 +420,12 @@ static int bpf_test_run(struct bpf_prog *prog, void *ctx, u32 repeat,
do {
run_ctx.prog_item = &item;
local_bh_disable();
- if (xdp)
+ if (xdp) {
+ guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
*retval = bpf_prog_run_xdp(prog, ctx);
- else
+ } else {
*retval = bpf_prog_run(prog, ctx);
+ }
local_bh_enable();
} while (bpf_test_timer_continue(&t, 1, repeat, &ret, time));
bpf_reset_run_ctx(old_ctx);
diff --git a/net/core/dev.c b/net/core/dev.c
index 5a0f6da7b3ae5..5ba7509e88752 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3993,6 +3993,7 @@ sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
*pt_prev = NULL;
}
+ guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
qdisc_skb_cb(skb)->pkt_len = skb->len;
tcx_set_ingress(skb, true);
@@ -4045,6 +4046,7 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
if (!entry)
return skb;
+ guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
/* qdisc_skb_cb(skb)->pkt_len & tcx_set_ingress() was
* already set by the caller.
*/
@@ -5008,6 +5010,7 @@ int do_xdp_generic(struct bpf_prog *xdp_prog, struct sk_buff *skb)
u32 act;
int err;
+ guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
act = netif_receive_generic_xdp(skb, &xdp, xdp_prog);
if (act != XDP_PASS) {
switch (act) {
diff --git a/net/core/filter.c b/net/core/filter.c
index 7c9653734fb60..72a7812f933a1 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4241,6 +4241,7 @@ static const struct bpf_func_proto bpf_xdp_adjust_meta_proto = {
*/
void xdp_do_flush(void)
{
+ guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
__dev_flush();
__cpu_map_flush();
__xsk_map_flush();
diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c
index a94943681e5aa..74b88e897a7e3 100644
--- a/net/core/lwt_bpf.c
+++ b/net/core/lwt_bpf.c
@@ -44,6 +44,7 @@ static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
* BPF prog and skb_do_redirect().
*/
local_bh_disable();
+ local_lock_nested_bh(&bpf_run_lock.redirect_lock);
bpf_compute_data_pointers(skb);
ret = bpf_prog_run_save_cb(lwt->prog, skb);
@@ -76,6 +77,7 @@ static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
break;
}
+ local_unlock_nested_bh(&bpf_run_lock.redirect_lock);
local_bh_enable();
return ret;
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 1976bd1639863..da61b99bc558f 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -23,6 +23,7 @@
#include <linux/jhash.h>
#include <linux/rculist.h>
#include <linux/rhashtable.h>
+#include <linux/bpf.h>
#include <net/net_namespace.h>
#include <net/sock.h>
#include <net/netlink.h>
@@ -3925,6 +3926,7 @@ struct sk_buff *tcf_qevent_handle(struct tcf_qevent *qe, struct Qdisc *sch, stru
fl = rcu_dereference_bh(qe->filter_chain);
+ guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
switch (tcf_classify(skb, NULL, fl, &cl_res, false)) {
case TC_ACT_SHOT:
qdisc_qstats_drop(sch);
--
2.43.0
The per-CPU variables used during bpf_prog_run_xdp() invocation and
later during xdp_do_redirect() rely on disabled BH for their protection.
Without locking in local_bh_disable() on PREEMPT_RT these data structure
require explicit locking.
This is a follow-up on the previous change which introduced
bpf_run_lock.redirect_lock and uses it now within drivers.
The simple way is to acquire the lock before bpf_prog_run_xdp() is
invoked and hold it until the end of function.
This does not always work because some drivers (cpsw, atlantic) invoke
xdp_do_flush() in the same context.
Acquiring the lock in bpf_prog_run_xdp() and dropping in
xdp_do_redirect() (without touching drivers) does not work because not
all driver, which use bpf_prog_run_xdp(), do support XDP_REDIRECT (and
invoke xdp_do_redirect()).
Ideally the minimal locking scope would be bpf_prog_run_xdp() +
xdp_do_redirect() and everything else (error recovery, DMA unmapping,
free/ alloc of memory, …) would happen outside of the locked section.
Cc: Alexei Starovoitov <[email protected]>
Cc: Arthur Kiyanovski <[email protected]>
Cc: David Arinzon <[email protected]>
Cc: Igor Russkikh <[email protected]>
Cc: Jesper Dangaard Brouer <[email protected]>
Cc: John Fastabend <[email protected]>
Cc: Michael Chan <[email protected]>
Cc: Noam Dagan <[email protected]>
Cc: Saeed Bishara <[email protected]>
Cc: Shay Agroskin <[email protected]>
Cc: Sunil Goutham <[email protected]>
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
drivers/net/ethernet/amazon/ena/ena_netdev.c | 1 +
.../net/ethernet/aquantia/atlantic/aq_ring.c | 26 ++++++++++++-------
drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c | 1 +
.../net/ethernet/cavium/thunder/nicvf_main.c | 3 ++-
drivers/net/ethernet/engleder/tsnep_main.c | 17 +++++++-----
5 files changed, 31 insertions(+), 17 deletions(-)
diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c b/drivers/net/ethernet/amazon/ena/ena_netdev.c
index b5bca48148309..cf075bc5e2b13 100644
--- a/drivers/net/ethernet/amazon/ena/ena_netdev.c
+++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c
@@ -385,6 +385,7 @@ static int ena_xdp_execute(struct ena_ring *rx_ring, struct xdp_buff *xdp)
if (!xdp_prog)
goto out;
+ guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
verdict = bpf_prog_run_xdp(xdp_prog, xdp);
switch (verdict) {
diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_ring.c b/drivers/net/ethernet/aquantia/atlantic/aq_ring.c
index 694daeaf3e615..5d33d478d5109 100644
--- a/drivers/net/ethernet/aquantia/atlantic/aq_ring.c
+++ b/drivers/net/ethernet/aquantia/atlantic/aq_ring.c
@@ -458,7 +458,24 @@ static struct sk_buff *aq_xdp_run_prog(struct aq_nic_s *aq_nic,
if (xdp_buff_has_frags(xdp) && !prog->aux->xdp_has_frags)
goto out_aborted;
+ local_lock_nested_bh(&bpf_run_lock.redirect_lock);
act = bpf_prog_run_xdp(prog, xdp);
+ if (act == XDP_REDIRECT) {
+ if (xdp_do_redirect(aq_nic->ndev, xdp, prog) < 0) {
+ local_unlock_nested_bh(&bpf_run_lock.redirect_lock);
+ goto out_aborted;
+ }
+ local_unlock_nested_bh(&bpf_run_lock.redirect_lock);
+
+ xdp_do_flush();
+ u64_stats_update_begin(&rx_ring->stats.rx.syncp);
+ ++rx_ring->stats.rx.xdp_redirect;
+ u64_stats_update_end(&rx_ring->stats.rx.syncp);
+ aq_get_rxpages_xdp(buff, xdp);
+ } else {
+ local_unlock_nested_bh(&bpf_run_lock.redirect_lock);
+ }
+
switch (act) {
case XDP_PASS:
skb = aq_xdp_build_skb(xdp, aq_nic->ndev, buff);
@@ -481,15 +498,6 @@ static struct sk_buff *aq_xdp_run_prog(struct aq_nic_s *aq_nic,
u64_stats_update_end(&rx_ring->stats.rx.syncp);
aq_get_rxpages_xdp(buff, xdp);
break;
- case XDP_REDIRECT:
- if (xdp_do_redirect(aq_nic->ndev, xdp, prog) < 0)
- goto out_aborted;
- xdp_do_flush();
- u64_stats_update_begin(&rx_ring->stats.rx.syncp);
- ++rx_ring->stats.rx.xdp_redirect;
- u64_stats_update_end(&rx_ring->stats.rx.syncp);
- aq_get_rxpages_xdp(buff, xdp);
- break;
default:
fallthrough;
case XDP_ABORTED:
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
index 96f5ca778c67d..c4d989da7fade 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
@@ -253,6 +253,7 @@ bool bnxt_rx_xdp(struct bnxt *bp, struct bnxt_rx_ring_info *rxr, u16 cons,
/* BNXT_RX_PAGE_MODE(bp) when XDP enabled */
orig_data = xdp.data;
+ guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
act = bpf_prog_run_xdp(xdp_prog, &xdp);
tx_avail = bnxt_tx_avail(bp, txr);
diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_main.c b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
index eff350e0bc2a8..8e1406101f71b 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_main.c
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
@@ -554,7 +554,8 @@ static inline bool nicvf_xdp_rx(struct nicvf *nic, struct bpf_prog *prog,
xdp_prepare_buff(&xdp, hard_start, data - hard_start, len, false);
orig_data = xdp.data;
- action = bpf_prog_run_xdp(prog, &xdp);
+ scoped_guard(local_lock_nested_bh, &bpf_run_lock.redirect_lock)
+ action = bpf_prog_run_xdp(prog, &xdp);
len = xdp.data_end - xdp.data;
/* Check if XDP program has changed headers */
diff --git a/drivers/net/ethernet/engleder/tsnep_main.c b/drivers/net/ethernet/engleder/tsnep_main.c
index df40c720e7b23..acda3502d274f 100644
--- a/drivers/net/ethernet/engleder/tsnep_main.c
+++ b/drivers/net/ethernet/engleder/tsnep_main.c
@@ -1268,6 +1268,7 @@ static bool tsnep_xdp_run_prog(struct tsnep_rx *rx, struct bpf_prog *prog,
length = xdp->data_end - xdp->data_hard_start - XDP_PACKET_HEADROOM;
+ guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
act = bpf_prog_run_xdp(prog, xdp);
switch (act) {
case XDP_PASS:
@@ -1309,14 +1310,16 @@ static bool tsnep_xdp_run_prog_zc(struct tsnep_rx *rx, struct bpf_prog *prog,
{
u32 act;
- act = bpf_prog_run_xdp(prog, xdp);
+ scoped_guard(local_lock_nested_bh, &bpf_run_lock.redirect_lock) {
+ act = bpf_prog_run_xdp(prog, xdp);
- /* XDP_REDIRECT is the main action for zero-copy */
- if (likely(act == XDP_REDIRECT)) {
- if (xdp_do_redirect(rx->adapter->netdev, xdp, prog) < 0)
- goto out_failure;
- *status |= TSNEP_XDP_REDIRECT;
- return true;
+ /* XDP_REDIRECT is the main action for zero-copy */
+ if (likely(act == XDP_REDIRECT)) {
+ if (xdp_do_redirect(rx->adapter->netdev, xdp, prog) < 0)
+ goto out_failure;
+ *status |= TSNEP_XDP_REDIRECT;
+ return true;
+ }
}
switch (act) {
--
2.43.0
The per-CPU variables used during bpf_prog_run_xdp() invocation and
later during xdp_do_redirect() rely on disabled BH for their protection.
Without locking in local_bh_disable() on PREEMPT_RT these data structure
require explicit locking.
This is a follow-up on the previous change which introduced
bpf_run_lock.redirect_lock and uses it now within drivers.
The simple way is to acquire the lock before bpf_prog_run_xdp() is
invoked and hold it until the end of function.
This does not always work because some drivers (cpsw, atlantic) invoke
xdp_do_flush() in the same context.
Acquiring the lock in bpf_prog_run_xdp() and dropping in
xdp_do_redirect() (without touching drivers) does not work because not
all driver, which use bpf_prog_run_xdp(), do support XDP_REDIRECT (and
invoke xdp_do_redirect()).
Ideally the minimal locking scope would be bpf_prog_run_xdp() +
xdp_do_redirect() and everything else (error recovery, DMA unmapping,
free/ alloc of memory, …) would happen outside of the locked section.
Cc: Alexei Starovoitov <[email protected]>
Cc: Clark Wang <[email protected]>
Cc: Claudiu Manoil <[email protected]>
Cc: Ioana Ciornei <[email protected]>
Cc: Jesper Dangaard Brouer <[email protected]>
Cc: John Fastabend <[email protected]>
Cc: Madalin Bucur <[email protected]>
Cc: NXP Linux Team <[email protected]>
Cc: Shenwei Wang <[email protected]>
Cc: Vladimir Oltean <[email protected]>
Cc: Wei Fang <[email protected]>
Cc: [email protected]
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
.../net/ethernet/freescale/dpaa/dpaa_eth.c | 1 +
.../net/ethernet/freescale/dpaa2/dpaa2-eth.c | 1 +
.../net/ethernet/freescale/dpaa2/dpaa2-xsk.c | 30 ++++++++++---------
drivers/net/ethernet/freescale/enetc/enetc.c | 1 +
drivers/net/ethernet/freescale/fec_main.c | 1 +
5 files changed, 20 insertions(+), 14 deletions(-)
diff --git a/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c b/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
index dcbc598b11c6c..8adc766282fde 100644
--- a/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
+++ b/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
@@ -2597,6 +2597,7 @@ static u32 dpaa_run_xdp(struct dpaa_priv *priv, struct qm_fd *fd, void *vaddr,
}
#endif
+ guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
xdp_act = bpf_prog_run_xdp(xdp_prog, &xdp);
/* Update the length and the offset of the FD */
diff --git a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
index 888509cf1f210..08be35a3e3de7 100644
--- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
+++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
@@ -442,6 +442,7 @@ static u32 dpaa2_eth_run_xdp(struct dpaa2_eth_priv *priv,
xdp_prepare_buff(&xdp, vaddr + offset, XDP_PACKET_HEADROOM,
dpaa2_fd_get_len(fd), false);
+ guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
xdp_act = bpf_prog_run_xdp(xdp_prog, &xdp);
/* xdp.data pointer may have changed */
diff --git a/drivers/net/ethernet/freescale/dpaa2/dpaa2-xsk.c b/drivers/net/ethernet/freescale/dpaa2/dpaa2-xsk.c
index 051748b997f3f..e3ae9de6b0a34 100644
--- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-xsk.c
+++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-xsk.c
@@ -56,23 +56,25 @@ static u32 dpaa2_xsk_run_xdp(struct dpaa2_eth_priv *priv,
xdp_buff->rxq = &ch->xdp_rxq;
xsk_buff_dma_sync_for_cpu(xdp_buff, ch->xsk_pool);
- xdp_act = bpf_prog_run_xdp(xdp_prog, xdp_buff);
+ scoped_guard(local_lock_nested_bh, &bpf_run_lock.redirect_lock) {
+ xdp_act = bpf_prog_run_xdp(xdp_prog, xdp_buff);
- /* xdp.data pointer may have changed */
- dpaa2_fd_set_offset(fd, xdp_buff->data - vaddr);
- dpaa2_fd_set_len(fd, xdp_buff->data_end - xdp_buff->data);
+ /* xdp.data pointer may have changed */
+ dpaa2_fd_set_offset(fd, xdp_buff->data - vaddr);
+ dpaa2_fd_set_len(fd, xdp_buff->data_end - xdp_buff->data);
- if (likely(xdp_act == XDP_REDIRECT)) {
- err = xdp_do_redirect(priv->net_dev, xdp_buff, xdp_prog);
- if (unlikely(err)) {
- ch->stats.xdp_drop++;
- dpaa2_eth_recycle_buf(priv, ch, addr);
- } else {
- ch->buf_count--;
- ch->stats.xdp_redirect++;
+ if (likely(xdp_act == XDP_REDIRECT)) {
+ err = xdp_do_redirect(priv->net_dev, xdp_buff, xdp_prog);
+ if (unlikely(err)) {
+ ch->stats.xdp_drop++;
+ dpaa2_eth_recycle_buf(priv, ch, addr);
+ } else {
+ ch->buf_count--;
+ ch->stats.xdp_redirect++;
+ }
+
+ goto xdp_redir;
}
-
- goto xdp_redir;
}
switch (xdp_act) {
diff --git a/drivers/net/ethernet/freescale/enetc/enetc.c b/drivers/net/ethernet/freescale/enetc/enetc.c
index cffbf27c4656b..d516b28815af4 100644
--- a/drivers/net/ethernet/freescale/enetc/enetc.c
+++ b/drivers/net/ethernet/freescale/enetc/enetc.c
@@ -1578,6 +1578,7 @@ static int enetc_clean_rx_ring_xdp(struct enetc_bdr *rx_ring,
rx_byte_cnt += VLAN_HLEN;
rx_byte_cnt += xdp_get_buff_len(&xdp_buff);
+ guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
xdp_act = bpf_prog_run_xdp(prog, &xdp_buff);
switch (xdp_act) {
diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
index c3b7694a74851..335b1e307d468 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -1587,6 +1587,7 @@ fec_enet_run_xdp(struct fec_enet_private *fep, struct bpf_prog *prog,
int err;
u32 act;
+ guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
act = bpf_prog_run_xdp(prog, xdp);
/* Due xdp_adjust_tail and xdp_adjust_head: DMA sync for_device cover
--
2.43.0
The per-CPU variables used during bpf_prog_run_xdp() invocation and
later during xdp_do_redirect() rely on disabled BH for their protection.
Without locking in local_bh_disable() on PREEMPT_RT these data structure
require explicit locking.
This is a follow-up on the previous change which introduced
bpf_run_lock.redirect_lock and uses it now within drivers.
The simple way is to acquire the lock before bpf_prog_run_xdp() is
invoked and hold it until the end of function.
This does not always work because some drivers (cpsw, atlantic) invoke
xdp_do_flush() in the same context.
Acquiring the lock in bpf_prog_run_xdp() and dropping in
xdp_do_redirect() (without touching drivers) does not work because not
all driver, which use bpf_prog_run_xdp(), do support XDP_REDIRECT (and
invoke xdp_do_redirect()).
Ideally the minimal locking scope would be bpf_prog_run_xdp() +
xdp_do_redirect() and everything else (error recovery, DMA unmapping,
free/ alloc of memory, …) would happen outside of the locked section.
Cc: "K. Y. Srinivasan" <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: AngeloGioacchino Del Regno <[email protected]>
Cc: Dexuan Cui <[email protected]>
Cc: Dimitris Michailidis <[email protected]>
Cc: Felix Fietkau <[email protected]>
Cc: Haiyang Zhang <[email protected]>
Cc: Horatiu Vultur <[email protected]>
Cc: Jeroen de Borst <[email protected]>
Cc: Jesper Dangaard Brouer <[email protected]>
Cc: John Crispin <[email protected]>
Cc: John Fastabend <[email protected]>
Cc: Lorenzo Bianconi <[email protected]>
Cc: Mark Lee <[email protected]>
Cc: Matthias Brugger <[email protected]>
Cc: Praveen Kaligineedi <[email protected]>
Cc: Sean Wang <[email protected]>
Cc: Shailend Chand <[email protected]>
Cc: [email protected]
Cc: Wei Liu <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
drivers/net/ethernet/fungible/funeth/funeth_rx.c | 1 +
drivers/net/ethernet/google/gve/gve_rx.c | 12 +++++++-----
drivers/net/ethernet/mediatek/mtk_eth_soc.c | 1 +
drivers/net/ethernet/microchip/lan966x/lan966x_xdp.c | 1 +
drivers/net/ethernet/microsoft/mana/mana_bpf.c | 1 +
5 files changed, 11 insertions(+), 5 deletions(-)
diff --git a/drivers/net/ethernet/fungible/funeth/funeth_rx.c b/drivers/net/ethernet/fungible/funeth/funeth_rx.c
index 7e2584895de39..e7b1382545908 100644
--- a/drivers/net/ethernet/fungible/funeth/funeth_rx.c
+++ b/drivers/net/ethernet/fungible/funeth/funeth_rx.c
@@ -152,6 +152,7 @@ static void *fun_run_xdp(struct funeth_rxq *q, skb_frag_t *frags, void *buf_va,
xdp_prepare_buff(&xdp, buf_va, FUN_XDP_HEADROOM, skb_frag_size(frags) -
(FUN_RX_TAILROOM + FUN_XDP_HEADROOM), false);
+ guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
xdp_prog = READ_ONCE(q->xdp_prog);
act = bpf_prog_run_xdp(xdp_prog, &xdp);
diff --git a/drivers/net/ethernet/google/gve/gve_rx.c b/drivers/net/ethernet/google/gve/gve_rx.c
index 73655347902d2..504c8ef761a33 100644
--- a/drivers/net/ethernet/google/gve/gve_rx.c
+++ b/drivers/net/ethernet/google/gve/gve_rx.c
@@ -779,11 +779,13 @@ static void gve_rx(struct gve_rx_ring *rx, netdev_features_t feat,
page_info->page_offset, GVE_RX_PAD,
len, false);
old_data = xdp.data;
- xdp_act = bpf_prog_run_xdp(xprog, &xdp);
- if (xdp_act != XDP_PASS) {
- gve_xdp_done(priv, rx, &xdp, xprog, xdp_act);
- ctx->total_size += frag_size;
- goto finish_ok_pkt;
+ scoped_guard(local_lock_nested_bh, &bpf_run_lock.redirect_lock) {
+ xdp_act = bpf_prog_run_xdp(xprog, &xdp);
+ if (xdp_act != XDP_PASS) {
+ gve_xdp_done(priv, rx, &xdp, xprog, xdp_act);
+ ctx->total_size += frag_size;
+ goto finish_ok_pkt;
+ }
}
page_info->pad += xdp.data - old_data;
diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index 3cf6589cfdacf..477a74ee18c0a 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -1946,6 +1946,7 @@ static u32 mtk_xdp_run(struct mtk_eth *eth, struct mtk_rx_ring *ring,
if (!prog)
goto out;
+ guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
act = bpf_prog_run_xdp(prog, xdp);
switch (act) {
case XDP_PASS:
diff --git a/drivers/net/ethernet/microchip/lan966x/lan966x_xdp.c b/drivers/net/ethernet/microchip/lan966x/lan966x_xdp.c
index 9ee61db8690b4..026311af07f9e 100644
--- a/drivers/net/ethernet/microchip/lan966x/lan966x_xdp.c
+++ b/drivers/net/ethernet/microchip/lan966x/lan966x_xdp.c
@@ -84,6 +84,7 @@ int lan966x_xdp_run(struct lan966x_port *port, struct page *page, u32 data_len)
xdp_prepare_buff(&xdp, page_address(page),
IFH_LEN_BYTES + XDP_PACKET_HEADROOM,
data_len - IFH_LEN_BYTES, false);
+ guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
act = bpf_prog_run_xdp(xdp_prog, &xdp);
switch (act) {
case XDP_PASS:
diff --git a/drivers/net/ethernet/microsoft/mana/mana_bpf.c b/drivers/net/ethernet/microsoft/mana/mana_bpf.c
index 23b1521c0df96..d465b1dd9fca0 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_bpf.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_bpf.c
@@ -93,6 +93,7 @@ u32 mana_run_xdp(struct net_device *ndev, struct mana_rxq *rxq,
xdp_init_buff(xdp, PAGE_SIZE, &rxq->xdp_rxq);
xdp_prepare_buff(xdp, buf_va, XDP_PACKET_HEADROOM, pkt_len, false);
+ guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
act = bpf_prog_run_xdp(prog, xdp);
rx_stats = &rxq->stats;
--
2.43.0
The per-CPU variables used during bpf_prog_run_xdp() invocation and
later during xdp_do_redirect() rely on disabled BH for their protection.
Without locking in local_bh_disable() on PREEMPT_RT these data structure
require explicit locking.
This is a follow-up on the previous change which introduced
bpf_run_lock.redirect_lock and uses it now within drivers.
The simple way is to acquire the lock before bpf_prog_run_xdp() is
invoked and hold it until the end of function.
This does not always work because some drivers (cpsw, atlantic) invoke
xdp_do_flush() in the same context.
Acquiring the lock in bpf_prog_run_xdp() and dropping in
xdp_do_redirect() (without touching drivers) does not work because not
all driver, which use bpf_prog_run_xdp(), do support XDP_REDIRECT (and
invoke xdp_do_redirect()).
Ideally the minimal locking scope would be bpf_prog_run_xdp() +
xdp_do_redirect() and everything else (error recovery, DMA unmapping,
free/ alloc of memory, …) would happen outside of the locked section.
Cc: Alexei Starovoitov <[email protected]>
Cc: Jesper Dangaard Brouer <[email protected]>
Cc: Jesse Brandeburg <[email protected]>
Cc: John Fastabend <[email protected]>
Cc: Tony Nguyen <[email protected]>
Cc: [email protected] (open list:XDP
Cc: [email protected]
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
drivers/net/ethernet/intel/i40e/i40e_txrx.c | 1 +
drivers/net/ethernet/intel/i40e/i40e_xsk.c | 22 +++++++++--------
drivers/net/ethernet/intel/ice/ice_txrx.c | 1 +
drivers/net/ethernet/intel/ice/ice_xsk.c | 21 ++++++++--------
drivers/net/ethernet/intel/igb/igb_main.c | 1 +
drivers/net/ethernet/intel/igc/igc_main.c | 5 +++-
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 1 +
drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c | 24 ++++++++++---------
.../net/ethernet/intel/ixgbevf/ixgbevf_main.c | 3 ++-
9 files changed, 46 insertions(+), 33 deletions(-)
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index dd410b15000f7..76e069ae2183a 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -2326,6 +2326,7 @@ static int i40e_run_xdp(struct i40e_ring *rx_ring, struct xdp_buff *xdp, struct
prefetchw(xdp->data_hard_start); /* xdp_frame write */
+ guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
act = bpf_prog_run_xdp(xdp_prog, xdp);
switch (act) {
case XDP_PASS:
diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
index e99fa854d17f1..2b0c0c1f3ddc8 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
@@ -201,17 +201,19 @@ static int i40e_run_xdp_zc(struct i40e_ring *rx_ring, struct xdp_buff *xdp,
struct i40e_ring *xdp_ring;
u32 act;
- act = bpf_prog_run_xdp(xdp_prog, xdp);
+ scoped_guard(local_lock_nested_bh, &bpf_run_lock.redirect_lock) {
+ act = bpf_prog_run_xdp(xdp_prog, xdp);
- if (likely(act == XDP_REDIRECT)) {
- err = xdp_do_redirect(rx_ring->netdev, xdp, xdp_prog);
- if (!err)
- return I40E_XDP_REDIR;
- if (xsk_uses_need_wakeup(rx_ring->xsk_pool) && err == -ENOBUFS)
- result = I40E_XDP_EXIT;
- else
- result = I40E_XDP_CONSUMED;
- goto out_failure;
+ if (likely(act == XDP_REDIRECT)) {
+ err = xdp_do_redirect(rx_ring->netdev, xdp, xdp_prog);
+ if (!err)
+ return I40E_XDP_REDIR;
+ if (xsk_uses_need_wakeup(rx_ring->xsk_pool) && err == -ENOBUFS)
+ result = I40E_XDP_EXIT;
+ else
+ result = I40E_XDP_CONSUMED;
+ goto out_failure;
+ }
}
switch (act) {
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
index 9e97ea8630686..5d4cfa3455b37 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
@@ -571,6 +571,7 @@ ice_run_xdp(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
if (!xdp_prog)
goto exit;
+ guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
act = bpf_prog_run_xdp(xdp_prog, xdp);
switch (act) {
case XDP_PASS:
diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
index 99954508184f9..02f89c22d19e3 100644
--- a/drivers/net/ethernet/intel/ice/ice_xsk.c
+++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
@@ -762,17 +762,18 @@ ice_run_xdp_zc(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
int err, result = ICE_XDP_PASS;
u32 act;
+ scoped_guard(local_lock_nested_bh, &bpf_run_lock.redirect_lock) {
act = bpf_prog_run_xdp(xdp_prog, xdp);
-
- if (likely(act == XDP_REDIRECT)) {
- err = xdp_do_redirect(rx_ring->netdev, xdp, xdp_prog);
- if (!err)
- return ICE_XDP_REDIR;
- if (xsk_uses_need_wakeup(rx_ring->xsk_pool) && err == -ENOBUFS)
- result = ICE_XDP_EXIT;
- else
- result = ICE_XDP_CONSUMED;
- goto out_failure;
+ if (likely(act == XDP_REDIRECT)) {
+ err = xdp_do_redirect(rx_ring->netdev, xdp, xdp_prog);
+ if (!err)
+ return ICE_XDP_REDIR;
+ if (xsk_uses_need_wakeup(rx_ring->xsk_pool) && err == -ENOBUFS)
+ result = ICE_XDP_EXIT;
+ else
+ result = ICE_XDP_CONSUMED;
+ goto out_failure;
+ }
}
switch (act) {
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index b2295caa2f0ab..e01be809d030e 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -8621,6 +8621,7 @@ static struct sk_buff *igb_run_xdp(struct igb_adapter *adapter,
prefetchw(xdp->data_hard_start); /* xdp_frame write */
+ guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
act = bpf_prog_run_xdp(xdp_prog, xdp);
switch (act) {
case XDP_PASS:
diff --git a/drivers/net/ethernet/intel/igc/igc_main.c b/drivers/net/ethernet/intel/igc/igc_main.c
index e9bb403bbacf9..8321419b3a307 100644
--- a/drivers/net/ethernet/intel/igc/igc_main.c
+++ b/drivers/net/ethernet/intel/igc/igc_main.c
@@ -2485,7 +2485,10 @@ static int __igc_xdp_run_prog(struct igc_adapter *adapter,
struct bpf_prog *prog,
struct xdp_buff *xdp)
{
- u32 act = bpf_prog_run_xdp(prog, xdp);
+ u32 act;
+
+ guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
+ act = bpf_prog_run_xdp(prog, xdp);
switch (act) {
case XDP_PASS:
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 94bde2cad0f47..de564e8b83be2 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -2203,6 +2203,7 @@ static struct sk_buff *ixgbe_run_xdp(struct ixgbe_adapter *adapter,
prefetchw(xdp->data_hard_start); /* xdp_frame write */
+ guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
act = bpf_prog_run_xdp(xdp_prog, xdp);
switch (act) {
case XDP_PASS:
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
index 59798bc33298f..b988f758aad49 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
@@ -104,18 +104,20 @@ static int ixgbe_run_xdp_zc(struct ixgbe_adapter *adapter,
struct xdp_frame *xdpf;
u32 act;
- xdp_prog = READ_ONCE(rx_ring->xdp_prog);
- act = bpf_prog_run_xdp(xdp_prog, xdp);
+ scoped_guard(local_lock_nested_bh, &bpf_run_lock.redirect_lock) {
+ xdp_prog = READ_ONCE(rx_ring->xdp_prog);
+ act = bpf_prog_run_xdp(xdp_prog, xdp);
- if (likely(act == XDP_REDIRECT)) {
- err = xdp_do_redirect(rx_ring->netdev, xdp, xdp_prog);
- if (!err)
- return IXGBE_XDP_REDIR;
- if (xsk_uses_need_wakeup(rx_ring->xsk_pool) && err == -ENOBUFS)
- result = IXGBE_XDP_EXIT;
- else
- result = IXGBE_XDP_CONSUMED;
- goto out_failure;
+ if (likely(act == XDP_REDIRECT)) {
+ err = xdp_do_redirect(rx_ring->netdev, xdp, xdp_prog);
+ if (!err)
+ return IXGBE_XDP_REDIR;
+ if (xsk_uses_need_wakeup(rx_ring->xsk_pool) && err == -ENOBUFS)
+ result = IXGBE_XDP_EXIT;
+ else
+ result = IXGBE_XDP_CONSUMED;
+ goto out_failure;
+ }
}
switch (act) {
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index a44e4bd561421..1c58c08aa15ff 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -1059,7 +1059,8 @@ static struct sk_buff *ixgbevf_run_xdp(struct ixgbevf_adapter *adapter,
if (!xdp_prog)
goto xdp_out;
- act = bpf_prog_run_xdp(xdp_prog, xdp);
+ scoped_guard(local_lock_nested_bh, &bpf_run_lock.redirect_lock)
+ act = bpf_prog_run_xdp(xdp_prog, xdp);
switch (act) {
case XDP_PASS:
break;
--
2.43.0
The per-CPU variables used during bpf_prog_run_xdp() invocation and
later during xdp_do_redirect() rely on disabled BH for their protection.
Without locking in local_bh_disable() on PREEMPT_RT these data structure
require explicit locking.
This is a follow-up on the previous change which introduced
bpf_run_lock.redirect_lock and uses it now within drivers.
The simple way is to acquire the lock before bpf_prog_run_xdp() is
invoked and hold it until the end of function.
This does not always work because some drivers (cpsw, atlantic) invoke
xdp_do_flush() in the same context.
Acquiring the lock in bpf_prog_run_xdp() and dropping in
xdp_do_redirect() (without touching drivers) does not work because not
all driver, which use bpf_prog_run_xdp(), do support XDP_REDIRECT (and
invoke xdp_do_redirect()).
Ideally the minimal locking scope would be bpf_prog_run_xdp() +
xdp_do_redirect() and everything else (error recovery, DMA unmapping,
free/ alloc of memory, …) would happen outside of the locked section.
Cc: "K. Y. Srinivasan" <[email protected]>
Cc: "Michael S. Tsirkin" <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: Andrii Nakryiko <[email protected]>
Cc: Dexuan Cui <[email protected]>
Cc: Haiyang Zhang <[email protected]>
Cc: Hao Luo <[email protected]>
Cc: Jesper Dangaard Brouer <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: John Fastabend <[email protected]>
Cc: Juergen Gross <[email protected]>
Cc: KP Singh <[email protected]>
Cc: Martin KaFai Lau <[email protected]>
Cc: Nikolay Aleksandrov <[email protected]>
Cc: Song Liu <[email protected]>
Cc: Stanislav Fomichev <[email protected]>
Cc: Stefano Stabellini <[email protected]>
Cc: Wei Liu <[email protected]>
Cc: Willem de Bruijn <[email protected]>
Cc: Xuan Zhuo <[email protected]>
Cc: Yonghong Song <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
drivers/net/hyperv/netvsc_bpf.c | 1 +
drivers/net/netkit.c | 13 +++++++----
drivers/net/tun.c | 28 +++++++++++++----------
drivers/net/veth.c | 40 ++++++++++++++++++++-------------
drivers/net/virtio_net.c | 1 +
drivers/net/xen-netfront.c | 1 +
6 files changed, 52 insertions(+), 32 deletions(-)
diff --git a/drivers/net/hyperv/netvsc_bpf.c b/drivers/net/hyperv/netvsc_bpf.c
index 4a9522689fa4f..55f8ca92ca199 100644
--- a/drivers/net/hyperv/netvsc_bpf.c
+++ b/drivers/net/hyperv/netvsc_bpf.c
@@ -58,6 +58,7 @@ u32 netvsc_run_xdp(struct net_device *ndev, struct netvsc_channel *nvchan,
memcpy(xdp->data, data, len);
+ guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
act = bpf_prog_run_xdp(prog, xdp);
switch (act) {
diff --git a/drivers/net/netkit.c b/drivers/net/netkit.c
index 39171380ccf29..fbcf78477bda8 100644
--- a/drivers/net/netkit.c
+++ b/drivers/net/netkit.c
@@ -80,8 +80,15 @@ static netdev_tx_t netkit_xmit(struct sk_buff *skb, struct net_device *dev)
netkit_prep_forward(skb, !net_eq(dev_net(dev), dev_net(peer)));
skb->dev = peer;
entry = rcu_dereference(nk->active);
- if (entry)
- ret = netkit_run(entry, skb, ret);
+ if (entry) {
+ scoped_guard(local_lock_nested_bh, &bpf_run_lock.redirect_lock) {
+ ret = netkit_run(entry, skb, ret);
+ if (ret == NETKIT_REDIRECT) {
+ dev_sw_netstats_tx_add(dev, 1, len);
+ skb_do_redirect(skb);
+ }
+ }
+ }
switch (ret) {
case NETKIT_NEXT:
case NETKIT_PASS:
@@ -95,8 +102,6 @@ static netdev_tx_t netkit_xmit(struct sk_buff *skb, struct net_device *dev)
}
break;
case NETKIT_REDIRECT:
- dev_sw_netstats_tx_add(dev, 1, len);
- skb_do_redirect(skb);
break;
case NETKIT_DROP:
default:
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index afa5497f7c35c..fe0d31f11e4b6 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1708,16 +1708,18 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun,
xdp_init_buff(&xdp, buflen, &tfile->xdp_rxq);
xdp_prepare_buff(&xdp, buf, pad, len, false);
- act = bpf_prog_run_xdp(xdp_prog, &xdp);
- if (act == XDP_REDIRECT || act == XDP_TX) {
- get_page(alloc_frag->page);
- alloc_frag->offset += buflen;
- }
- err = tun_xdp_act(tun, xdp_prog, &xdp, act);
- if (err < 0) {
- if (act == XDP_REDIRECT || act == XDP_TX)
- put_page(alloc_frag->page);
- goto out;
+ scoped_guard(local_lock_nested_bh, &bpf_run_lock.redirect_lock) {
+ act = bpf_prog_run_xdp(xdp_prog, &xdp);
+ if (act == XDP_REDIRECT || act == XDP_TX) {
+ get_page(alloc_frag->page);
+ alloc_frag->offset += buflen;
+ }
+ err = tun_xdp_act(tun, xdp_prog, &xdp, act);
+ if (err < 0) {
+ if (act == XDP_REDIRECT || act == XDP_TX)
+ put_page(alloc_frag->page);
+ goto out;
+ }
}
if (err == XDP_REDIRECT)
@@ -2460,8 +2462,10 @@ static int tun_xdp_one(struct tun_struct *tun,
xdp_init_buff(xdp, buflen, &tfile->xdp_rxq);
xdp_set_data_meta_invalid(xdp);
- act = bpf_prog_run_xdp(xdp_prog, xdp);
- ret = tun_xdp_act(tun, xdp_prog, xdp, act);
+ scoped_guard(local_lock_nested_bh, &bpf_run_lock.redirect_lock) {
+ act = bpf_prog_run_xdp(xdp_prog, xdp);
+ ret = tun_xdp_act(tun, xdp_prog, xdp, act);
+ }
if (ret < 0) {
put_page(virt_to_head_page(xdp->data));
return ret;
diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 977861c46b1fe..c69e5ff9f8795 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -624,7 +624,18 @@ static struct xdp_frame *veth_xdp_rcv_one(struct veth_rq *rq,
xdp->rxq = &rq->xdp_rxq;
vxbuf.skb = NULL;
- act = bpf_prog_run_xdp(xdp_prog, xdp);
+ scoped_guard(local_lock_nested_bh, &bpf_run_lock.redirect_lock) {
+ act = bpf_prog_run_xdp(xdp_prog, xdp);
+ if (act == XDP_REDIRECT) {
+ orig_frame = *frame;
+ xdp->rxq->mem = frame->mem;
+ if (xdp_do_redirect(rq->dev, xdp, xdp_prog)) {
+ frame = &orig_frame;
+ stats->xdp_drops++;
+ goto err_xdp;
+ }
+ }
+ }
switch (act) {
case XDP_PASS:
@@ -644,13 +655,6 @@ static struct xdp_frame *veth_xdp_rcv_one(struct veth_rq *rq,
rcu_read_unlock();
goto xdp_xmit;
case XDP_REDIRECT:
- orig_frame = *frame;
- xdp->rxq->mem = frame->mem;
- if (xdp_do_redirect(rq->dev, xdp, xdp_prog)) {
- frame = &orig_frame;
- stats->rx_drops++;
- goto err_xdp;
- }
stats->xdp_redirect++;
rcu_read_unlock();
goto xdp_xmit;
@@ -857,7 +861,18 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
orig_data = xdp->data;
orig_data_end = xdp->data_end;
- act = bpf_prog_run_xdp(xdp_prog, xdp);
+ scoped_guard(local_lock_nested_bh, &bpf_run_lock.redirect_lock) {
+ act = bpf_prog_run_xdp(xdp_prog, xdp);
+ if (act == XDP_REDIRECT) {
+ veth_xdp_get(xdp);
+ consume_skb(skb);
+ xdp->rxq->mem = rq->xdp_mem;
+ if (xdp_do_redirect(rq->dev, xdp, xdp_prog)) {
+ stats->rx_drops++;
+ goto err_xdp;
+ }
+ }
+ }
switch (act) {
case XDP_PASS:
@@ -875,13 +890,6 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
rcu_read_unlock();
goto xdp_xmit;
case XDP_REDIRECT:
- veth_xdp_get(xdp);
- consume_skb(skb);
- xdp->rxq->mem = rq->xdp_mem;
- if (xdp_do_redirect(rq->dev, xdp, xdp_prog)) {
- stats->rx_drops++;
- goto err_xdp;
- }
stats->xdp_redirect++;
rcu_read_unlock();
goto xdp_xmit;
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index d16f592c2061f..5e362c4604239 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1010,6 +1010,7 @@ static int virtnet_xdp_handler(struct bpf_prog *xdp_prog, struct xdp_buff *xdp,
int err;
u32 act;
+ guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
act = bpf_prog_run_xdp(xdp_prog, xdp);
u64_stats_inc(&stats->xdp_packets);
diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
index ad29f370034e4..e3daa8cdeb84e 100644
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -978,6 +978,7 @@ static u32 xennet_run_xdp(struct netfront_queue *queue, struct page *pdata,
xdp_prepare_buff(xdp, page_address(pdata), XDP_PACKET_HEADROOM,
len, false);
+ guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
act = bpf_prog_run_xdp(prog, xdp);
switch (act) {
case XDP_TX:
--
2.43.0
The per-CPU variables used during bpf_prog_run_xdp() invocation and
later during xdp_do_redirect() rely on disabled BH for their protection.
Without locking in local_bh_disable() on PREEMPT_RT these data structure
require explicit locking.
This is a follow-up on the previous change which introduced
bpf_run_lock.redirect_lock and uses it now within drivers.
The simple way is to acquire the lock before bpf_prog_run_xdp() is
invoked and hold it until the end of function.
This does not always work because some drivers (cpsw, atlantic) invoke
xdp_do_flush() in the same context.
Acquiring the lock in bpf_prog_run_xdp() and dropping in
xdp_do_redirect() (without touching drivers) does not work because not
all driver, which use bpf_prog_run_xdp(), do support XDP_REDIRECT (and
invoke xdp_do_redirect()).
Ideally the minimal locking scope would be bpf_prog_run_xdp() +
xdp_do_redirect() and everything else (error recovery, DMA unmapping,
free/ alloc of memory, …) would happen outside of the locked section.
Cc: Alexei Starovoitov <[email protected]>
Cc: Geetha sowjanya <[email protected]>
Cc: Jesper Dangaard Brouer <[email protected]>
Cc: John Fastabend <[email protected]>
Cc: Marcin Wojtas <[email protected]>
Cc: Russell King <[email protected]>
Cc: Subbaraya Sundeep <[email protected]>
Cc: Sunil Goutham <[email protected]>
Cc: Thomas Petazzoni <[email protected]>
Cc: hariprasad <[email protected]>
Cc: [email protected]
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
drivers/net/ethernet/marvell/mvneta.c | 2 ++
drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c | 1 +
drivers/net/ethernet/marvell/octeontx2/nic/otx2_txrx.c | 1 +
3 files changed, 4 insertions(+)
diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index 29aac327574d6..9c7aacd73b590 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -2263,6 +2263,8 @@ mvneta_run_xdp(struct mvneta_port *pp, struct mvneta_rx_queue *rxq,
len = xdp->data_end - xdp->data_hard_start - pp->rx_offset_correction;
data_len = xdp->data_end - xdp->data;
+
+ guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
act = bpf_prog_run_xdp(prog, xdp);
/* Due xdp_adjust_tail: DMA sync for_device cover max len CPU touch */
diff --git a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
index 93137606869e2..3a5524ffaba68 100644
--- a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
+++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
@@ -3793,6 +3793,7 @@ mvpp2_run_xdp(struct mvpp2_port *port, struct bpf_prog *prog,
u32 ret, act;
len = xdp->data_end - xdp->data_hard_start - MVPP2_SKB_HEADROOM;
+ guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
act = bpf_prog_run_xdp(prog, xdp);
/* Due xdp_adjust_tail: DMA sync for_device cover max len CPU touch */
diff --git a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_txrx.c b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_txrx.c
index 4d519ea833b2c..e48e84d6159bc 100644
--- a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_txrx.c
+++ b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_txrx.c
@@ -1422,6 +1422,7 @@ static bool otx2_xdp_rcv_pkt_handler(struct otx2_nic *pfvf,
xdp_prepare_buff(&xdp, hard_start, data - hard_start,
cqe->sg.seg_size, false);
+ guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
act = bpf_prog_run_xdp(prog, &xdp);
switch (act) {
--
2.43.0
The per-CPU variables used during bpf_prog_run_xdp() invocation and
later during xdp_do_redirect() rely on disabled BH for their protection.
Without locking in local_bh_disable() on PREEMPT_RT these data structure
require explicit locking.
This is a follow-up on the previous change which introduced
bpf_run_lock.redirect_lock and uses it now within drivers.
The simple way is to acquire the lock before bpf_prog_run_xdp() is
invoked and hold it until the end of function.
This does not always work because some drivers (cpsw, atlantic) invoke
xdp_do_flush() in the same context.
Acquiring the lock in bpf_prog_run_xdp() and dropping in
xdp_do_redirect() (without touching drivers) does not work because not
all driver, which use bpf_prog_run_xdp(), do support XDP_REDIRECT (and
invoke xdp_do_redirect()).
Ideally the minimal locking scope would be bpf_prog_run_xdp() +
xdp_do_redirect() and everything else (error recovery, DMA unmapping,
free/ alloc of memory, …) would happen outside of the locked section.
Cc: Alexandre Torgue <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: Ariel Elior <[email protected]>
Cc: Ilias Apalodimas <[email protected]>
Cc: Jassi Brar <[email protected]>
Cc: Jesper Dangaard Brouer <[email protected]>
Cc: John Fastabend <[email protected]>
Cc: Jose Abreu <[email protected]>
Cc: Manish Chopra <[email protected]>
Cc: Maxime Coquelin <[email protected]>
Cc: Ravi Gunasekaran <[email protected]>
Cc: Roger Quadros <[email protected]>
Cc: Siddharth Vadapalli <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
drivers/net/ethernet/qlogic/qede/qede_fp.c | 1 +
drivers/net/ethernet/socionext/netsec.c | 1 +
drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 1 +
drivers/net/ethernet/ti/cpsw_priv.c | 15 +++++++++------
4 files changed, 12 insertions(+), 6 deletions(-)
diff --git a/drivers/net/ethernet/qlogic/qede/qede_fp.c b/drivers/net/ethernet/qlogic/qede/qede_fp.c
index cb1746bc0e0c5..ce5af094fb817 100644
--- a/drivers/net/ethernet/qlogic/qede/qede_fp.c
+++ b/drivers/net/ethernet/qlogic/qede/qede_fp.c
@@ -1091,6 +1091,7 @@ static bool qede_rx_xdp(struct qede_dev *edev,
xdp_prepare_buff(&xdp, page_address(bd->data), *data_offset,
*len, false);
+ guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
act = bpf_prog_run_xdp(prog, &xdp);
/* Recalculate, as XDP might have changed the headers */
diff --git a/drivers/net/ethernet/socionext/netsec.c b/drivers/net/ethernet/socionext/netsec.c
index 0891e9e49ecb5..47e314338f3f3 100644
--- a/drivers/net/ethernet/socionext/netsec.c
+++ b/drivers/net/ethernet/socionext/netsec.c
@@ -905,6 +905,7 @@ static u32 netsec_run_xdp(struct netsec_priv *priv, struct bpf_prog *prog,
int err;
u32 act;
+ guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
act = bpf_prog_run_xdp(prog, xdp);
/* Due xdp_adjust_tail: DMA sync for_device cover max len CPU touch */
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 37e64283f9107..9e92affc8c22c 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -4893,6 +4893,7 @@ static int __stmmac_xdp_run_prog(struct stmmac_priv *priv,
u32 act;
int res;
+ guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
act = bpf_prog_run_xdp(prog, xdp);
switch (act) {
case XDP_PASS:
diff --git a/drivers/net/ethernet/ti/cpsw_priv.c b/drivers/net/ethernet/ti/cpsw_priv.c
index 764ed298b5708..f38c49f9fab35 100644
--- a/drivers/net/ethernet/ti/cpsw_priv.c
+++ b/drivers/net/ethernet/ti/cpsw_priv.c
@@ -1335,9 +1335,15 @@ int cpsw_run_xdp(struct cpsw_priv *priv, int ch, struct xdp_buff *xdp,
if (!prog)
return CPSW_XDP_PASS;
- act = bpf_prog_run_xdp(prog, xdp);
- /* XDP prog might have changed packet data and boundaries */
- *len = xdp->data_end - xdp->data;
+ scoped_guard(local_lock_nested_bh, &bpf_run_lock.redirect_lock) {
+ act = bpf_prog_run_xdp(prog, xdp);
+ /* XDP prog might have changed packet data and boundaries */
+ *len = xdp->data_end - xdp->data;
+ if (act == XDP_REDIRECT) {
+ if (xdp_do_redirect(ndev, xdp, prog))
+ goto drop;
+ }
+ }
switch (act) {
case XDP_PASS:
@@ -1352,9 +1358,6 @@ int cpsw_run_xdp(struct cpsw_priv *priv, int ch, struct xdp_buff *xdp,
xdp_return_frame_rx_napi(xdpf);
break;
case XDP_REDIRECT:
- if (xdp_do_redirect(ndev, xdp, prog))
- goto drop;
-
/* Have to flush here, per packet, instead of doing it in bulk
* at the end of the napi handler. The RX devices on this
* particular hardware is sharing a common queue, so the
--
2.43.0
The users of bpf_redirect_info should lock the access by acquiring the
nested BH-lock bpf_run_lock.redirect_lock. This lock should be acquired
before the first usage (bpf_prog_run_xdp()) and dropped after the last
user in the context (xdp_do_redirect()).
Current user in tree have been audited and updated.
Add lockdep annonation to ensure new user acquire the lock.
Cc: Alexei Starovoitov <[email protected]>
Cc: Andrii Nakryiko <[email protected]>
Cc: Hao Luo <[email protected]>
Cc: Jesper Dangaard Brouer <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: John Fastabend <[email protected]>
Cc: KP Singh <[email protected]>
Cc: Martin KaFai Lau <[email protected]>
Cc: Song Liu <[email protected]>
Cc: Stanislav Fomichev <[email protected]>
Cc: Yonghong Song <[email protected]>
Cc: [email protected]
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
include/net/xdp.h | 1 +
net/core/filter.c | 11 +++++++++++
2 files changed, 12 insertions(+)
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 349c36fb5fd8f..cdeab175abf18 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -493,6 +493,7 @@ static inline void xdp_clear_features_flag(struct net_device *dev)
static __always_inline u32 bpf_prog_run_xdp(const struct bpf_prog *prog,
struct xdp_buff *xdp)
{
+ lockdep_assert_held(this_cpu_ptr(&bpf_run_lock.redirect_lock));
/* Driver XDP hooks are invoked within a single NAPI poll cycle and thus
* under local_bh_disable(), which provides the needed RCU protection
* for accessing map entries.
diff --git a/net/core/filter.c b/net/core/filter.c
index 72a7812f933a1..a2f97503ed578 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2495,6 +2495,7 @@ int skb_do_redirect(struct sk_buff *skb)
struct net_device *dev;
u32 flags = ri->flags;
+ lockdep_assert_held(this_cpu_ptr(&bpf_run_lock.redirect_lock));
dev = dev_get_by_index_rcu(net, ri->tgt_index);
ri->tgt_index = 0;
ri->flags = 0;
@@ -2525,6 +2526,8 @@ BPF_CALL_2(bpf_redirect, u32, ifindex, u64, flags)
{
struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+ lockdep_assert_held(this_cpu_ptr(&bpf_run_lock.redirect_lock));
+
if (unlikely(flags & (~(BPF_F_INGRESS) | BPF_F_REDIRECT_INTERNAL)))
return TC_ACT_SHOT;
@@ -2546,6 +2549,8 @@ BPF_CALL_2(bpf_redirect_peer, u32, ifindex, u64, flags)
{
struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+ lockdep_assert_held(this_cpu_ptr(&bpf_run_lock.redirect_lock));
+
if (unlikely(flags))
return TC_ACT_SHOT;
@@ -2568,6 +2573,8 @@ BPF_CALL_4(bpf_redirect_neigh, u32, ifindex, struct bpf_redir_neigh *, params,
{
struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+ lockdep_assert_held(this_cpu_ptr(&bpf_run_lock.redirect_lock));
+
if (unlikely((plen && plen < sizeof(*params)) || flags))
return TC_ACT_SHOT;
@@ -4287,6 +4294,8 @@ u32 xdp_master_redirect(struct xdp_buff *xdp)
struct net_device *master, *slave;
struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+ lockdep_assert_held(this_cpu_ptr(&bpf_run_lock.redirect_lock));
+
master = netdev_master_upper_dev_get_rcu(xdp->rxq->dev);
slave = master->netdev_ops->ndo_xdp_get_xmit_slave(master, xdp);
if (slave && slave != xdp->rxq->dev) {
@@ -4394,6 +4403,7 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
enum bpf_map_type map_type = ri->map_type;
+ lockdep_assert_held(this_cpu_ptr(&bpf_run_lock.redirect_lock));
if (map_type == BPF_MAP_TYPE_XSKMAP)
return __xdp_do_redirect_xsk(ri, dev, xdp, xdp_prog);
@@ -4408,6 +4418,7 @@ int xdp_do_redirect_frame(struct net_device *dev, struct xdp_buff *xdp,
struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
enum bpf_map_type map_type = ri->map_type;
+ lockdep_assert_held(this_cpu_ptr(&bpf_run_lock.redirect_lock));
if (map_type == BPF_MAP_TYPE_XSKMAP)
return __xdp_do_redirect_xsk(ri, dev, xdp, xdp_prog);
--
2.43.0
The per-CPU variables used during bpf_prog_run_xdp() invocation and
later during xdp_do_redirect() rely on disabled BH for their protection.
Without locking in local_bh_disable() on PREEMPT_RT these data structure
require explicit locking.
This is a follow-up on the previous change which introduced
bpf_run_lock.redirect_lock and uses it now within drivers.
The simple way is to acquire the lock before bpf_prog_run_xdp() is
invoked and hold it until the end of function.
This does not always work because some drivers (cpsw, atlantic) invoke
xdp_do_flush() in the same context.
Acquiring the lock in bpf_prog_run_xdp() and dropping in
xdp_do_redirect() (without touching drivers) does not work because not
all driver, which use bpf_prog_run_xdp(), do support XDP_REDIRECT (and
invoke xdp_do_redirect()).
Ideally the minimal locking scope would be bpf_prog_run_xdp() +
xdp_do_redirect() and everything else (error recovery, DMA unmapping,
free/ alloc of memory, …) would happen outside of the locked section.
Cc: Alexei Starovoitov <[email protected]>
Cc: Edward Cree <[email protected]>
Cc: Jesper Dangaard Brouer <[email protected]>
Cc: John Fastabend <[email protected]>
Cc: Leon Romanovsky <[email protected]>
Cc: Louis Peens <[email protected]>
Cc: Martin Habets <[email protected]>
Cc: Saeed Mahameed <[email protected]>
Cc: Tariq Toukan <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
drivers/net/ethernet/mellanox/mlx4/en_rx.c | 1 +
drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c | 1 +
drivers/net/ethernet/netronome/nfp/nfd3/dp.c | 3 ++-
drivers/net/ethernet/netronome/nfp/nfd3/xsk.c | 1 +
drivers/net/ethernet/netronome/nfp/nfdk/dp.c | 3 ++-
drivers/net/ethernet/sfc/rx.c | 1 +
drivers/net/ethernet/sfc/siena/rx.c | 1 +
7 files changed, 9 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index a09b6e05337d9..c0a3ac3405bc5 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -833,6 +833,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
mxbuf.ring = ring;
mxbuf.dev = dev;
+ guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
act = bpf_prog_run_xdp(xdp_prog, &mxbuf.xdp);
length = mxbuf.xdp.data_end - mxbuf.xdp.data;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
index 7decc81ed33a9..b4e3c6a5a6da6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
@@ -269,6 +269,7 @@ bool mlx5e_xdp_handle(struct mlx5e_rq *rq,
u32 act;
int err;
+ guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
act = bpf_prog_run_xdp(prog, xdp);
switch (act) {
case XDP_PASS:
diff --git a/drivers/net/ethernet/netronome/nfp/nfd3/dp.c b/drivers/net/ethernet/netronome/nfp/nfd3/dp.c
index 17381bfc15d72..a041b55514aa3 100644
--- a/drivers/net/ethernet/netronome/nfp/nfd3/dp.c
+++ b/drivers/net/ethernet/netronome/nfp/nfd3/dp.c
@@ -1011,7 +1011,8 @@ static int nfp_nfd3_rx(struct nfp_net_rx_ring *rx_ring, int budget)
pkt_off - NFP_NET_RX_BUF_HEADROOM,
pkt_len, true);
- act = bpf_prog_run_xdp(xdp_prog, &xdp);
+ scoped_guard(local_lock_nested_bh, &bpf_run_lock.redirect_lock)
+ act = bpf_prog_run_xdp(xdp_prog, &xdp);
pkt_len = xdp.data_end - xdp.data;
pkt_off += xdp.data - orig_data;
diff --git a/drivers/net/ethernet/netronome/nfp/nfd3/xsk.c b/drivers/net/ethernet/netronome/nfp/nfd3/xsk.c
index 45be6954d5aae..38f2d4c2b5b7c 100644
--- a/drivers/net/ethernet/netronome/nfp/nfd3/xsk.c
+++ b/drivers/net/ethernet/netronome/nfp/nfd3/xsk.c
@@ -216,6 +216,7 @@ nfp_nfd3_xsk_rx(struct nfp_net_rx_ring *rx_ring, int budget,
}
}
+ guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
act = bpf_prog_run_xdp(xdp_prog, xrxbuf->xdp);
pkt_len = xrxbuf->xdp->data_end - xrxbuf->xdp->data;
diff --git a/drivers/net/ethernet/netronome/nfp/nfdk/dp.c b/drivers/net/ethernet/netronome/nfp/nfdk/dp.c
index 8d78c6faefa8a..af0a36c4fb018 100644
--- a/drivers/net/ethernet/netronome/nfp/nfdk/dp.c
+++ b/drivers/net/ethernet/netronome/nfp/nfdk/dp.c
@@ -1130,7 +1130,8 @@ static int nfp_nfdk_rx(struct nfp_net_rx_ring *rx_ring, int budget)
pkt_off - NFP_NET_RX_BUF_HEADROOM,
pkt_len, true);
- act = bpf_prog_run_xdp(xdp_prog, &xdp);
+ scoped_guard(local_lock_nested_bh, &bpf_run_lock.redirect_lock)
+ act = bpf_prog_run_xdp(xdp_prog, &xdp);
pkt_len = xdp.data_end - xdp.data;
pkt_off += xdp.data - orig_data;
diff --git a/drivers/net/ethernet/sfc/rx.c b/drivers/net/ethernet/sfc/rx.c
index f77a2d3ef37ec..3712d29150af5 100644
--- a/drivers/net/ethernet/sfc/rx.c
+++ b/drivers/net/ethernet/sfc/rx.c
@@ -291,6 +291,7 @@ static bool efx_do_xdp(struct efx_nic *efx, struct efx_channel *channel,
xdp_prepare_buff(&xdp, *ehp - EFX_XDP_HEADROOM, EFX_XDP_HEADROOM,
rx_buf->len, false);
+ guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
xdp_act = bpf_prog_run_xdp(xdp_prog, &xdp);
offset = (u8 *)xdp.data - *ehp;
diff --git a/drivers/net/ethernet/sfc/siena/rx.c b/drivers/net/ethernet/sfc/siena/rx.c
index 98d3c0743c0f5..6bfc4cd1c83e0 100644
--- a/drivers/net/ethernet/sfc/siena/rx.c
+++ b/drivers/net/ethernet/sfc/siena/rx.c
@@ -291,6 +291,7 @@ static bool efx_do_xdp(struct efx_nic *efx, struct efx_channel *channel,
xdp_prepare_buff(&xdp, *ehp - EFX_XDP_HEADROOM, EFX_XDP_HEADROOM,
rx_buf->len, false);
+ guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
xdp_act = bpf_prog_run_xdp(xdp_prog, &xdp);
offset = (u8 *)xdp.data - *ehp;
--
2.43.0
On Fri, 15 Dec 2023 18:07:19 +0100 Sebastian Andrzej Siewior wrote:
> The proposed way out is to introduce explicit per-CPU locks for
> resources which are protected by local_bh_disable() and use those only
> on PREEMPT_RT so there is no additional overhead for !PREEMPT_RT builds.
As I said at LPC, complicating drivers with odd locking constructs
is a no go for me.
Hi Sebastian,
kernel test robot noticed the following build warnings:
[auto build test WARNING on net-next/main]
url: https://github.com/intel-lab-lkp/linux/commits/Sebastian-Andrzej-Siewior/locking-local_lock-Introduce-guard-definition-for-local_lock/20231216-011911
base: net-next/main
patch link: https://lore.kernel.org/r/20231215171020.687342-13-bigeasy%40linutronix.de
patch subject: [PATCH net-next 12/24] seg6: Use nested-BH locking for seg6_bpf_srh_states.
config: x86_64-randconfig-r131-20231216 (https://download.01.org/0day-ci/archive/20231216/[email protected]/config)
compiler: clang version 16.0.4 (https://github.com/llvm/llvm-project.git ae42196bc493ffe877a7e3dff8be32035dea4d07)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231216/[email protected]/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
sparse warnings: (new ones prefixed by >>)
>> net/ipv6/seg6_local.c:1431:9: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct local_lock_t [usertype] *l @@ got struct local_lock_t [noderef] __percpu * @@
net/ipv6/seg6_local.c:1431:9: sparse: expected struct local_lock_t [usertype] *l
net/ipv6/seg6_local.c:1431:9: sparse: got struct local_lock_t [noderef] __percpu *
vim +1431 net/ipv6/seg6_local.c
1410
1411 static int input_action_end_bpf(struct sk_buff *skb,
1412 struct seg6_local_lwt *slwt)
1413 {
1414 struct seg6_bpf_srh_state *srh_state;
1415 struct ipv6_sr_hdr *srh;
1416 int ret;
1417
1418 srh = get_and_validate_srh(skb);
1419 if (!srh) {
1420 kfree_skb(skb);
1421 return -EINVAL;
1422 }
1423 advance_nextseg(srh, &ipv6_hdr(skb)->daddr);
1424
1425 /* The access to the per-CPU buffer srh_state is protected by running
1426 * always in softirq context (with disabled BH). On PREEMPT_RT the
1427 * required locking is provided by the following local_lock_nested_bh()
1428 * statement. It is also accessed by the bpf_lwt_seg6_* helpers via
1429 * bpf_prog_run_save_cb().
1430 */
> 1431 scoped_guard(local_lock_nested_bh, &seg6_bpf_srh_states.bh_lock) {
1432 srh_state = this_cpu_ptr(&seg6_bpf_srh_states);
1433 srh_state->srh = srh;
1434 srh_state->hdrlen = srh->hdrlen << 3;
1435 srh_state->valid = true;
1436
1437 rcu_read_lock();
1438 bpf_compute_data_pointers(skb);
1439 ret = bpf_prog_run_save_cb(slwt->bpf.prog, skb);
1440 rcu_read_unlock();
1441
1442 switch (ret) {
1443 case BPF_OK:
1444 case BPF_REDIRECT:
1445 break;
1446 case BPF_DROP:
1447 goto drop;
1448 default:
1449 pr_warn_once("bpf-seg6local: Illegal return value %u\n", ret);
1450 goto drop;
1451 }
1452
1453 if (srh_state->srh && !seg6_bpf_has_valid_srh(skb))
1454 goto drop;
1455 }
1456
1457 if (ret != BPF_REDIRECT)
1458 seg6_lookup_nexthop(skb, NULL, 0);
1459
1460 return dst_input(skb);
1461
1462 drop:
1463 kfree_skb(skb);
1464 return -EINVAL;
1465 }
1466
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
Hi Sebastian,
kernel test robot noticed the following build warnings:
[auto build test WARNING on net-next/main]
url: https://github.com/intel-lab-lkp/linux/commits/Sebastian-Andrzej-Siewior/locking-local_lock-Introduce-guard-definition-for-local_lock/20231216-011911
base: net-next/main
patch link: https://lore.kernel.org/r/20231215171020.687342-16-bigeasy%40linutronix.de
patch subject: [PATCH net-next 15/24] net: Use nested-BH locking for XDP redirect.
config: x86_64-randconfig-122-20231216 (https://download.01.org/0day-ci/archive/20231216/[email protected]/config)
compiler: clang version 16.0.4 (https://github.com/llvm/llvm-project.git ae42196bc493ffe877a7e3dff8be32035dea4d07)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231216/[email protected]/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
sparse warnings: (new ones prefixed by >>)
net/sched/cls_api.c:391:22: sparse: sparse: incorrect type in assignment (different base types) @@ expected restricted __be16 [usertype] protocol @@ got unsigned int [usertype] protocol @@
net/sched/cls_api.c:391:22: sparse: expected restricted __be16 [usertype] protocol
net/sched/cls_api.c:391:22: sparse: got unsigned int [usertype] protocol
net/sched/cls_api.c:1862:16: sparse: sparse: incompatible types in comparison expression (different address spaces):
net/sched/cls_api.c:1862:16: sparse: struct tcf_proto *
net/sched/cls_api.c:1862:16: sparse: struct tcf_proto [noderef] __rcu *
net/sched/cls_api.c:1962:20: sparse: sparse: incompatible types in comparison expression (different address spaces):
net/sched/cls_api.c:1962:20: sparse: struct tcf_proto [noderef] __rcu *
net/sched/cls_api.c:1962:20: sparse: struct tcf_proto *
net/sched/cls_api.c:1924:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
net/sched/cls_api.c:1924:25: sparse: struct tcf_proto [noderef] __rcu *
net/sched/cls_api.c:1924:25: sparse: struct tcf_proto *
net/sched/cls_api.c:1944:16: sparse: sparse: incompatible types in comparison expression (different address spaces):
net/sched/cls_api.c:1944:16: sparse: struct tcf_proto *
net/sched/cls_api.c:1944:16: sparse: struct tcf_proto [noderef] __rcu *
net/sched/cls_api.c:2010:25: sparse: sparse: restricted __be16 degrades to integer
net/sched/cls_api.c:2697:50: sparse: sparse: restricted __be16 degrades to integer
>> net/sched/cls_api.c:3933:38: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct local_lock_t [usertype] *l @@ got struct local_lock_t [noderef] __percpu * @@
net/sched/cls_api.c:3933:38: sparse: expected struct local_lock_t [usertype] *l
net/sched/cls_api.c:3933:38: sparse: got struct local_lock_t [noderef] __percpu *
net/sched/cls_api.c: note: in included file (through include/linux/mmzone.h, include/linux/gfp.h, include/linux/umh.h, include/linux/kmod.h, ...):
include/linux/local_lock.h:71:1: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected void const [noderef] __percpu *__vpp_verify @@ got struct local_lock_t [usertype] * @@
include/linux/local_lock.h:71:1: sparse: expected void const [noderef] __percpu *__vpp_verify
include/linux/local_lock.h:71:1: sparse: got struct local_lock_t [usertype] *
vim +3933 net/sched/cls_api.c
3921
3922 struct sk_buff *tcf_qevent_handle(struct tcf_qevent *qe, struct Qdisc *sch, struct sk_buff *skb,
3923 struct sk_buff **to_free, int *ret)
3924 {
3925 struct tcf_result cl_res;
3926 struct tcf_proto *fl;
3927
3928 if (!qe->info.block_index)
3929 return skb;
3930
3931 fl = rcu_dereference_bh(qe->filter_chain);
3932
> 3933 guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
3934 switch (tcf_classify(skb, NULL, fl, &cl_res, false)) {
3935 case TC_ACT_SHOT:
3936 qdisc_qstats_drop(sch);
3937 __qdisc_drop(skb, to_free);
3938 *ret = __NET_XMIT_BYPASS;
3939 return NULL;
3940 case TC_ACT_STOLEN:
3941 case TC_ACT_QUEUED:
3942 case TC_ACT_TRAP:
3943 __qdisc_drop(skb, to_free);
3944 *ret = __NET_XMIT_STOLEN;
3945 return NULL;
3946 case TC_ACT_REDIRECT:
3947 skb_do_redirect(skb);
3948 *ret = __NET_XMIT_STOLEN;
3949 return NULL;
3950 }
3951
3952 return skb;
3953 }
3954 EXPORT_SYMBOL(tcf_qevent_handle);
3955
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
Hi Sebastian,
kernel test robot noticed the following build errors:
[auto build test ERROR on net-next/main]
url: https://github.com/intel-lab-lkp/linux/commits/Sebastian-Andrzej-Siewior/locking-local_lock-Introduce-guard-definition-for-local_lock/20231216-011911
base: net-next/main
patch link: https://lore.kernel.org/r/20231215171020.687342-21-bigeasy%40linutronix.de
patch subject: [PATCH net-next 20/24] net: intel: Use nested-BH locking for XDP redirect.
config: arm-defconfig (https://download.01.org/0day-ci/archive/20231216/[email protected]/config)
compiler: clang version 14.0.6 (https://github.com/llvm/llvm-project.git f28c006a5895fc0e329fe15fead81e37457cb1d1)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231216/[email protected]/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
All errors (new ones prefixed by >>):
>> drivers/net/ethernet/intel/igb/igb_main.c:8620:3: error: cannot jump from this goto statement to its label
goto xdp_out;
^
drivers/net/ethernet/intel/igb/igb_main.c:8624:2: note: jump bypasses initialization of variable with __attribute__((cleanup))
guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
^
include/linux/cleanup.h:142:15: note: expanded from macro 'guard'
CLASS(_name, __UNIQUE_ID(guard))
^
include/linux/compiler.h:180:29: note: expanded from macro '__UNIQUE_ID'
#define __UNIQUE_ID(prefix) __PASTE(__PASTE(__UNIQUE_ID_, prefix), __COUNTER__)
^
include/linux/compiler_types.h:84:22: note: expanded from macro '__PASTE'
#define __PASTE(a,b) ___PASTE(a,b)
^
include/linux/compiler_types.h:83:23: note: expanded from macro '___PASTE'
#define ___PASTE(a,b) a##b
^
<scratch space>:52:1: note: expanded from here
__UNIQUE_ID_guard753
^
1 error generated.
vim +8620 drivers/net/ethernet/intel/igb/igb_main.c
b1bb2eb0a0deb0 Alexander Duyck 2017-02-06 8608
9cbc948b5a20c9 Sven Auhagen 2020-09-02 8609 static struct sk_buff *igb_run_xdp(struct igb_adapter *adapter,
9cbc948b5a20c9 Sven Auhagen 2020-09-02 8610 struct igb_ring *rx_ring,
9cbc948b5a20c9 Sven Auhagen 2020-09-02 8611 struct xdp_buff *xdp)
9cbc948b5a20c9 Sven Auhagen 2020-09-02 8612 {
9cbc948b5a20c9 Sven Auhagen 2020-09-02 8613 int err, result = IGB_XDP_PASS;
9cbc948b5a20c9 Sven Auhagen 2020-09-02 8614 struct bpf_prog *xdp_prog;
9cbc948b5a20c9 Sven Auhagen 2020-09-02 8615 u32 act;
9cbc948b5a20c9 Sven Auhagen 2020-09-02 8616
9cbc948b5a20c9 Sven Auhagen 2020-09-02 8617 xdp_prog = READ_ONCE(rx_ring->xdp_prog);
9cbc948b5a20c9 Sven Auhagen 2020-09-02 8618
9cbc948b5a20c9 Sven Auhagen 2020-09-02 8619 if (!xdp_prog)
9cbc948b5a20c9 Sven Auhagen 2020-09-02 @8620 goto xdp_out;
9cbc948b5a20c9 Sven Auhagen 2020-09-02 8621
9cbc948b5a20c9 Sven Auhagen 2020-09-02 8622 prefetchw(xdp->data_hard_start); /* xdp_frame write */
9cbc948b5a20c9 Sven Auhagen 2020-09-02 8623
d568b111738dbb Sebastian Andrzej Siewior 2023-12-15 8624 guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
9cbc948b5a20c9 Sven Auhagen 2020-09-02 8625 act = bpf_prog_run_xdp(xdp_prog, xdp);
9cbc948b5a20c9 Sven Auhagen 2020-09-02 8626 switch (act) {
9cbc948b5a20c9 Sven Auhagen 2020-09-02 8627 case XDP_PASS:
9cbc948b5a20c9 Sven Auhagen 2020-09-02 8628 break;
9cbc948b5a20c9 Sven Auhagen 2020-09-02 8629 case XDP_TX:
9cbc948b5a20c9 Sven Auhagen 2020-09-02 8630 result = igb_xdp_xmit_back(adapter, xdp);
74431c40b9c5fa Magnus Karlsson 2021-05-10 8631 if (result == IGB_XDP_CONSUMED)
74431c40b9c5fa Magnus Karlsson 2021-05-10 8632 goto out_failure;
9cbc948b5a20c9 Sven Auhagen 2020-09-02 8633 break;
9cbc948b5a20c9 Sven Auhagen 2020-09-02 8634 case XDP_REDIRECT:
9cbc948b5a20c9 Sven Auhagen 2020-09-02 8635 err = xdp_do_redirect(adapter->netdev, xdp, xdp_prog);
74431c40b9c5fa Magnus Karlsson 2021-05-10 8636 if (err)
74431c40b9c5fa Magnus Karlsson 2021-05-10 8637 goto out_failure;
9cbc948b5a20c9 Sven Auhagen 2020-09-02 8638 result = IGB_XDP_REDIR;
9cbc948b5a20c9 Sven Auhagen 2020-09-02 8639 break;
9cbc948b5a20c9 Sven Auhagen 2020-09-02 8640 default:
c8064e5b4adac5 Paolo Abeni 2021-11-30 8641 bpf_warn_invalid_xdp_action(adapter->netdev, xdp_prog, act);
9cbc948b5a20c9 Sven Auhagen 2020-09-02 8642 fallthrough;
9cbc948b5a20c9 Sven Auhagen 2020-09-02 8643 case XDP_ABORTED:
74431c40b9c5fa Magnus Karlsson 2021-05-10 8644 out_failure:
9cbc948b5a20c9 Sven Auhagen 2020-09-02 8645 trace_xdp_exception(rx_ring->netdev, xdp_prog, act);
9cbc948b5a20c9 Sven Auhagen 2020-09-02 8646 fallthrough;
9cbc948b5a20c9 Sven Auhagen 2020-09-02 8647 case XDP_DROP:
9cbc948b5a20c9 Sven Auhagen 2020-09-02 8648 result = IGB_XDP_CONSUMED;
9cbc948b5a20c9 Sven Auhagen 2020-09-02 8649 break;
9cbc948b5a20c9 Sven Auhagen 2020-09-02 8650 }
9cbc948b5a20c9 Sven Auhagen 2020-09-02 8651 xdp_out:
9cbc948b5a20c9 Sven Auhagen 2020-09-02 8652 return ERR_PTR(-result);
9cbc948b5a20c9 Sven Auhagen 2020-09-02 8653 }
9cbc948b5a20c9 Sven Auhagen 2020-09-02 8654
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
Hi Sebastian,
kernel test robot noticed the following build errors:
[auto build test ERROR on net-next/main]
url: https://github.com/intel-lab-lkp/linux/commits/Sebastian-Andrzej-Siewior/locking-local_lock-Introduce-guard-definition-for-local_lock/20231216-011911
base: net-next/main
patch link: https://lore.kernel.org/r/20231215171020.687342-17-bigeasy%40linutronix.de
patch subject: [PATCH net-next 16/24] net: netkit, veth, tun, virt*: Use nested-BH locking for XDP redirect.
config: x86_64-rhel-8.3-bpf (https://download.01.org/0day-ci/archive/20231217/[email protected]/config)
compiler: clang version 16.0.4 (https://github.com/llvm/llvm-project.git ae42196bc493ffe877a7e3dff8be32035dea4d07)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231217/[email protected]/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
All errors (new ones prefixed by >>):
>> drivers/net/hyperv/netvsc_bpf.c:53:3: error: cannot jump from this goto statement to its label
goto out;
^
drivers/net/hyperv/netvsc_bpf.c:61:2: note: jump bypasses initialization of variable with __attribute__((cleanup))
guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
^
include/linux/cleanup.h:142:15: note: expanded from macro 'guard'
CLASS(_name, __UNIQUE_ID(guard))
^
include/linux/compiler.h:180:29: note: expanded from macro '__UNIQUE_ID'
#define __UNIQUE_ID(prefix) __PASTE(__PASTE(__UNIQUE_ID_, prefix), __COUNTER__)
^
include/linux/compiler_types.h:84:22: note: expanded from macro '__PASTE'
#define __PASTE(a,b) ___PASTE(a,b)
^
include/linux/compiler_types.h:83:23: note: expanded from macro '___PASTE'
#define ___PASTE(a,b) a##b
^
<scratch space>:81:1: note: expanded from here
__UNIQUE_ID_guard635
^
drivers/net/hyperv/netvsc_bpf.c:46:3: error: cannot jump from this goto statement to its label
goto out;
^
drivers/net/hyperv/netvsc_bpf.c:61:2: note: jump bypasses initialization of variable with __attribute__((cleanup))
guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
^
include/linux/cleanup.h:142:15: note: expanded from macro 'guard'
CLASS(_name, __UNIQUE_ID(guard))
^
include/linux/compiler.h:180:29: note: expanded from macro '__UNIQUE_ID'
#define __UNIQUE_ID(prefix) __PASTE(__PASTE(__UNIQUE_ID_, prefix), __COUNTER__)
^
include/linux/compiler_types.h:84:22: note: expanded from macro '__PASTE'
#define __PASTE(a,b) ___PASTE(a,b)
^
include/linux/compiler_types.h:83:23: note: expanded from macro '___PASTE'
#define ___PASTE(a,b) a##b
^
<scratch space>:81:1: note: expanded from here
__UNIQUE_ID_guard635
^
drivers/net/hyperv/netvsc_bpf.c:41:3: error: cannot jump from this goto statement to its label
goto out;
^
drivers/net/hyperv/netvsc_bpf.c:61:2: note: jump bypasses initialization of variable with __attribute__((cleanup))
guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
^
include/linux/cleanup.h:142:15: note: expanded from macro 'guard'
CLASS(_name, __UNIQUE_ID(guard))
^
include/linux/compiler.h:180:29: note: expanded from macro '__UNIQUE_ID'
#define __UNIQUE_ID(prefix) __PASTE(__PASTE(__UNIQUE_ID_, prefix), __COUNTER__)
^
include/linux/compiler_types.h:84:22: note: expanded from macro '__PASTE'
#define __PASTE(a,b) ___PASTE(a,b)
^
include/linux/compiler_types.h:83:23: note: expanded from macro '___PASTE'
#define ___PASTE(a,b) a##b
^
<scratch space>:81:1: note: expanded from here
__UNIQUE_ID_guard635
^
3 errors generated.
vim +53 drivers/net/hyperv/netvsc_bpf.c
351e1581395fcc Haiyang Zhang 2020-01-23 23
351e1581395fcc Haiyang Zhang 2020-01-23 24 u32 netvsc_run_xdp(struct net_device *ndev, struct netvsc_channel *nvchan,
351e1581395fcc Haiyang Zhang 2020-01-23 25 struct xdp_buff *xdp)
351e1581395fcc Haiyang Zhang 2020-01-23 26 {
1cb9d3b6185b2a Haiyang Zhang 2022-04-07 27 struct netvsc_stats_rx *rx_stats = &nvchan->rx_stats;
351e1581395fcc Haiyang Zhang 2020-01-23 28 void *data = nvchan->rsc.data[0];
351e1581395fcc Haiyang Zhang 2020-01-23 29 u32 len = nvchan->rsc.len[0];
351e1581395fcc Haiyang Zhang 2020-01-23 30 struct page *page = NULL;
351e1581395fcc Haiyang Zhang 2020-01-23 31 struct bpf_prog *prog;
351e1581395fcc Haiyang Zhang 2020-01-23 32 u32 act = XDP_PASS;
1cb9d3b6185b2a Haiyang Zhang 2022-04-07 33 bool drop = true;
351e1581395fcc Haiyang Zhang 2020-01-23 34
351e1581395fcc Haiyang Zhang 2020-01-23 35 xdp->data_hard_start = NULL;
351e1581395fcc Haiyang Zhang 2020-01-23 36
351e1581395fcc Haiyang Zhang 2020-01-23 37 rcu_read_lock();
351e1581395fcc Haiyang Zhang 2020-01-23 38 prog = rcu_dereference(nvchan->bpf_prog);
351e1581395fcc Haiyang Zhang 2020-01-23 39
351e1581395fcc Haiyang Zhang 2020-01-23 40 if (!prog)
351e1581395fcc Haiyang Zhang 2020-01-23 41 goto out;
351e1581395fcc Haiyang Zhang 2020-01-23 42
505e3f00c3f364 Andrea Parri (Microsoft 2021-01-14 43) /* Ensure that the below memcpy() won't overflow the page buffer. */
505e3f00c3f364 Andrea Parri (Microsoft 2021-01-14 44) if (len > ndev->mtu + ETH_HLEN) {
505e3f00c3f364 Andrea Parri (Microsoft 2021-01-14 45) act = XDP_DROP;
505e3f00c3f364 Andrea Parri (Microsoft 2021-01-14 46) goto out;
505e3f00c3f364 Andrea Parri (Microsoft 2021-01-14 47) }
505e3f00c3f364 Andrea Parri (Microsoft 2021-01-14 48)
351e1581395fcc Haiyang Zhang 2020-01-23 49 /* allocate page buffer for data */
351e1581395fcc Haiyang Zhang 2020-01-23 50 page = alloc_page(GFP_ATOMIC);
351e1581395fcc Haiyang Zhang 2020-01-23 51 if (!page) {
351e1581395fcc Haiyang Zhang 2020-01-23 52 act = XDP_DROP;
351e1581395fcc Haiyang Zhang 2020-01-23 @53 goto out;
351e1581395fcc Haiyang Zhang 2020-01-23 54 }
351e1581395fcc Haiyang Zhang 2020-01-23 55
43b5169d8355cc Lorenzo Bianconi 2020-12-22 56 xdp_init_buff(xdp, PAGE_SIZE, &nvchan->xdp_rxq);
be9df4aff65f18 Lorenzo Bianconi 2020-12-22 57 xdp_prepare_buff(xdp, page_address(page), NETVSC_XDP_HDRM, len, false);
351e1581395fcc Haiyang Zhang 2020-01-23 58
351e1581395fcc Haiyang Zhang 2020-01-23 59 memcpy(xdp->data, data, len);
351e1581395fcc Haiyang Zhang 2020-01-23 60
31dbfc0f055c7d Sebastian Andrzej Siewior 2023-12-15 61 guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
351e1581395fcc Haiyang Zhang 2020-01-23 62 act = bpf_prog_run_xdp(prog, xdp);
351e1581395fcc Haiyang Zhang 2020-01-23 63
351e1581395fcc Haiyang Zhang 2020-01-23 64 switch (act) {
351e1581395fcc Haiyang Zhang 2020-01-23 65 case XDP_PASS:
351e1581395fcc Haiyang Zhang 2020-01-23 66 case XDP_TX:
1cb9d3b6185b2a Haiyang Zhang 2022-04-07 67 drop = false;
1cb9d3b6185b2a Haiyang Zhang 2022-04-07 68 break;
1cb9d3b6185b2a Haiyang Zhang 2022-04-07 69
351e1581395fcc Haiyang Zhang 2020-01-23 70 case XDP_DROP:
351e1581395fcc Haiyang Zhang 2020-01-23 71 break;
351e1581395fcc Haiyang Zhang 2020-01-23 72
1cb9d3b6185b2a Haiyang Zhang 2022-04-07 73 case XDP_REDIRECT:
1cb9d3b6185b2a Haiyang Zhang 2022-04-07 74 if (!xdp_do_redirect(ndev, xdp, prog)) {
1cb9d3b6185b2a Haiyang Zhang 2022-04-07 75 nvchan->xdp_flush = true;
1cb9d3b6185b2a Haiyang Zhang 2022-04-07 76 drop = false;
1cb9d3b6185b2a Haiyang Zhang 2022-04-07 77
1cb9d3b6185b2a Haiyang Zhang 2022-04-07 78 u64_stats_update_begin(&rx_stats->syncp);
1cb9d3b6185b2a Haiyang Zhang 2022-04-07 79
1cb9d3b6185b2a Haiyang Zhang 2022-04-07 80 rx_stats->xdp_redirect++;
1cb9d3b6185b2a Haiyang Zhang 2022-04-07 81 rx_stats->packets++;
1cb9d3b6185b2a Haiyang Zhang 2022-04-07 82 rx_stats->bytes += nvchan->rsc.pktlen;
1cb9d3b6185b2a Haiyang Zhang 2022-04-07 83
1cb9d3b6185b2a Haiyang Zhang 2022-04-07 84 u64_stats_update_end(&rx_stats->syncp);
1cb9d3b6185b2a Haiyang Zhang 2022-04-07 85
1cb9d3b6185b2a Haiyang Zhang 2022-04-07 86 break;
1cb9d3b6185b2a Haiyang Zhang 2022-04-07 87 } else {
1cb9d3b6185b2a Haiyang Zhang 2022-04-07 88 u64_stats_update_begin(&rx_stats->syncp);
1cb9d3b6185b2a Haiyang Zhang 2022-04-07 89 rx_stats->xdp_drop++;
1cb9d3b6185b2a Haiyang Zhang 2022-04-07 90 u64_stats_update_end(&rx_stats->syncp);
1cb9d3b6185b2a Haiyang Zhang 2022-04-07 91 }
1cb9d3b6185b2a Haiyang Zhang 2022-04-07 92
1cb9d3b6185b2a Haiyang Zhang 2022-04-07 93 fallthrough;
1cb9d3b6185b2a Haiyang Zhang 2022-04-07 94
351e1581395fcc Haiyang Zhang 2020-01-23 95 case XDP_ABORTED:
351e1581395fcc Haiyang Zhang 2020-01-23 96 trace_xdp_exception(ndev, prog, act);
351e1581395fcc Haiyang Zhang 2020-01-23 97 break;
351e1581395fcc Haiyang Zhang 2020-01-23 98
351e1581395fcc Haiyang Zhang 2020-01-23 99 default:
c8064e5b4adac5 Paolo Abeni 2021-11-30 100 bpf_warn_invalid_xdp_action(ndev, prog, act);
351e1581395fcc Haiyang Zhang 2020-01-23 101 }
351e1581395fcc Haiyang Zhang 2020-01-23 102
351e1581395fcc Haiyang Zhang 2020-01-23 103 out:
351e1581395fcc Haiyang Zhang 2020-01-23 104 rcu_read_unlock();
351e1581395fcc Haiyang Zhang 2020-01-23 105
1cb9d3b6185b2a Haiyang Zhang 2022-04-07 106 if (page && drop) {
351e1581395fcc Haiyang Zhang 2020-01-23 107 __free_page(page);
351e1581395fcc Haiyang Zhang 2020-01-23 108 xdp->data_hard_start = NULL;
351e1581395fcc Haiyang Zhang 2020-01-23 109 }
351e1581395fcc Haiyang Zhang 2020-01-23 110
351e1581395fcc Haiyang Zhang 2020-01-23 111 return act;
351e1581395fcc Haiyang Zhang 2020-01-23 112 }
351e1581395fcc Haiyang Zhang 2020-01-23 113
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
> The per-CPU variables used during bpf_prog_run_xdp() invocation and later
> during xdp_do_redirect() rely on disabled BH for their protection.
> Without locking in local_bh_disable() on PREEMPT_RT these data structure
> require explicit locking.
>
> This is a follow-up on the previous change which introduced
> bpf_run_lock.redirect_lock and uses it now within drivers.
>
> The simple way is to acquire the lock before bpf_prog_run_xdp() is invoked and
> hold it until the end of function.
> This does not always work because some drivers (cpsw, atlantic) invoke
> xdp_do_flush() in the same context.
> Acquiring the lock in bpf_prog_run_xdp() and dropping in
> xdp_do_redirect() (without touching drivers) does not work because not all
> driver, which use bpf_prog_run_xdp(), do support XDP_REDIRECT (and invoke
> xdp_do_redirect()).
>
> Ideally the minimal locking scope would be bpf_prog_run_xdp() +
> xdp_do_redirect() and everything else (error recovery, DMA unmapping, free/
> alloc of memory, …) would happen outside of the locked section.
>
> Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Hi Sebastian,
I would like to make sure I understand correctly the difference in this patch
between ena and atlantic drivers.
In the atlantic driver the change you've made seems like the best change
in terms of making the critical section as small as possible.
You could have done exactly the same thing with ena, but you chose instead
to let ena release the lock at the end of the function, which in case of an XDP_TX
may make the critical section considerably longer than in the atlantic solution.
If I understand correctly (quote from your commit message "This does not
always work because some drivers (cpsw, atlantic) invoke xdp_do_flush()
in the same context"), in the case of atlantic you had to go for the more
code-altering change, because if you simply used guard() you would include
the xdp_do_flush() in the critical section, but in the case of ena xdp_do_flush()
is called after the function ends so guard is good enough.
Questions:
1. Did I understand correctly the difference in solution choice between atlantic
and ena?
2. As far as I can see the guard() solution looks good for ena except for (maybe?)
XDP_TX, where the critical section becomes a bit long. Can you please explain,
why you think it is still good enough for ena to use the guard() solution instead
of doing the more code-altering atlantic solution?
Thanks!
Arthur
On Fri, 2023-12-15 at 18:07 +0100, Sebastian Andrzej Siewior wrote:
> The access to seg6_bpf_srh_states is protected by disabling preemption.
> Based on the code, the entry point is input_action_end_bpf() and
> every other function (the bpf helper functions bpf_lwt_seg6_*()), that
> is accessing seg6_bpf_srh_states, should be called from within
> input_action_end_bpf().
>
> input_action_end_bpf() accesses seg6_bpf_srh_states first at the top of
> the function and then disables preemption. This looks wrong because if
> preemption needs to be disabled as part of the locking mechanism then
> the variable shouldn't be accessed beforehand.
>
> Looking at how it is used via test_lwt_seg6local.sh then
> input_action_end_bpf() is always invoked from softirq context. If this
> is always the case then the preempt_disable() statement is superfluous.
> If this is not always invoked from softirq then disabling only
> preemption is not sufficient.
>
> Replace the preempt_disable() statement with nested-BH locking. This is
> not an equivalent replacement as it assumes that the invocation of
> input_action_end_bpf() always occurs in softirq context and thus the
> preempt_disable() is superfluous.
> Add a local_lock_t the data structure and use local_lock_nested_bh() in
> guard notation for locking. Add lockdep_assert_held() to ensure the lock
> is held while the per-CPU variable is referenced in the helper functions.
>
> Cc: Alexei Starovoitov <[email protected]>
> Cc: Andrii Nakryiko <[email protected]>
> Cc: David Ahern <[email protected]>
> Cc: Hao Luo <[email protected]>
> Cc: Jiri Olsa <[email protected]>
> Cc: John Fastabend <[email protected]>
> Cc: KP Singh <[email protected]>
> Cc: Martin KaFai Lau <[email protected]>
> Cc: Song Liu <[email protected]>
> Cc: Stanislav Fomichev <[email protected]>
> Cc: Yonghong Song <[email protected]>
> Cc: [email protected]
> Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
> ---
> include/net/seg6_local.h | 1 +
> net/core/filter.c | 3 ++
> net/ipv6/seg6_local.c | 59 ++++++++++++++++++++++------------------
> 3 files changed, 36 insertions(+), 27 deletions(-)
>
> diff --git a/include/net/seg6_local.h b/include/net/seg6_local.h
> index 3fab9dec2ec45..0f22771359f4c 100644
> --- a/include/net/seg6_local.h
> +++ b/include/net/seg6_local.h
> @@ -20,6 +20,7 @@ extern bool seg6_bpf_has_valid_srh(struct sk_buff *skb);
>
> struct seg6_bpf_srh_state {
> struct ipv6_sr_hdr *srh;
> + local_lock_t bh_lock;
> u16 hdrlen;
> bool valid;
> };
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 1737884be52f8..c8013f762524b 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -6384,6 +6384,7 @@ BPF_CALL_4(bpf_lwt_seg6_store_bytes, struct sk_buff *, skb, u32, offset,
> void *srh_tlvs, *srh_end, *ptr;
> int srhoff = 0;
>
> + lockdep_assert_held(&srh_state->bh_lock);
> if (srh == NULL)
> return -EINVAL;
>
> @@ -6440,6 +6441,7 @@ BPF_CALL_4(bpf_lwt_seg6_action, struct sk_buff *, skb,
> int hdroff = 0;
> int err;
>
> + lockdep_assert_held(&srh_state->bh_lock);
> switch (action) {
> case SEG6_LOCAL_ACTION_END_X:
> if (!seg6_bpf_has_valid_srh(skb))
> @@ -6516,6 +6518,7 @@ BPF_CALL_3(bpf_lwt_seg6_adjust_srh, struct sk_buff *, skb, u32, offset,
> int srhoff = 0;
> int ret;
>
> + lockdep_assert_held(&srh_state->bh_lock);
> if (unlikely(srh == NULL))
> return -EINVAL;
>
> diff --git a/net/ipv6/seg6_local.c b/net/ipv6/seg6_local.c
> index 24e2b4b494cb0..ed7278af321a2 100644
> --- a/net/ipv6/seg6_local.c
> +++ b/net/ipv6/seg6_local.c
> @@ -1380,7 +1380,9 @@ static int input_action_end_b6_encap(struct sk_buff *skb,
> return err;
> }
>
> -DEFINE_PER_CPU(struct seg6_bpf_srh_state, seg6_bpf_srh_states);
> +DEFINE_PER_CPU(struct seg6_bpf_srh_state, seg6_bpf_srh_states) = {
> + .bh_lock = INIT_LOCAL_LOCK(bh_lock),
> +};
>
> bool seg6_bpf_has_valid_srh(struct sk_buff *skb)
> {
> @@ -1388,6 +1390,7 @@ bool seg6_bpf_has_valid_srh(struct sk_buff *skb)
> this_cpu_ptr(&seg6_bpf_srh_states);
> struct ipv6_sr_hdr *srh = srh_state->srh;
>
> + lockdep_assert_held(&srh_state->bh_lock);
> if (unlikely(srh == NULL))
> return false;
>
> @@ -1408,8 +1411,7 @@ bool seg6_bpf_has_valid_srh(struct sk_buff *skb)
> static int input_action_end_bpf(struct sk_buff *skb,
> struct seg6_local_lwt *slwt)
> {
> - struct seg6_bpf_srh_state *srh_state =
> - this_cpu_ptr(&seg6_bpf_srh_states);
> + struct seg6_bpf_srh_state *srh_state;
> struct ipv6_sr_hdr *srh;
> int ret;
>
> @@ -1420,41 +1422,44 @@ static int input_action_end_bpf(struct sk_buff *skb,
> }
> advance_nextseg(srh, &ipv6_hdr(skb)->daddr);
>
> - /* preempt_disable is needed to protect the per-CPU buffer srh_state,
> - * which is also accessed by the bpf_lwt_seg6_* helpers
> + /* The access to the per-CPU buffer srh_state is protected by running
> + * always in softirq context (with disabled BH). On PREEMPT_RT the
> + * required locking is provided by the following local_lock_nested_bh()
> + * statement. It is also accessed by the bpf_lwt_seg6_* helpers via
> + * bpf_prog_run_save_cb().
> */
> - preempt_disable();
> - srh_state->srh = srh;
> - srh_state->hdrlen = srh->hdrlen << 3;
> - srh_state->valid = true;
> + scoped_guard(local_lock_nested_bh, &seg6_bpf_srh_states.bh_lock) {
> + srh_state = this_cpu_ptr(&seg6_bpf_srh_states);
> + srh_state->srh = srh;
> + srh_state->hdrlen = srh->hdrlen << 3;
> + srh_state->valid = true;
Here the 'scoped_guard' usage adds a lot of noise to the patch, due to
the added indentation. What about using directly
local_lock_nested_bh()/local_unlock_nested_bh() ?
Cheers,
Paolo
Hi Sebastian,
On 12/15/23 6:07 PM, Sebastian Andrzej Siewior wrote:
> The per-CPU variables used during bpf_prog_run_xdp() invocation and
> later during xdp_do_redirect() rely on disabled BH for their protection.
> Without locking in local_bh_disable() on PREEMPT_RT these data structure
> require explicit locking.
>
> This is a follow-up on the previous change which introduced
> bpf_run_lock.redirect_lock and uses it now within drivers.
>
> The simple way is to acquire the lock before bpf_prog_run_xdp() is
> invoked and hold it until the end of function.
> This does not always work because some drivers (cpsw, atlantic) invoke
> xdp_do_flush() in the same context.
> Acquiring the lock in bpf_prog_run_xdp() and dropping in
> xdp_do_redirect() (without touching drivers) does not work because not
> all driver, which use bpf_prog_run_xdp(), do support XDP_REDIRECT (and
> invoke xdp_do_redirect()).
>
> Ideally the minimal locking scope would be bpf_prog_run_xdp() +
> xdp_do_redirect() and everything else (error recovery, DMA unmapping,
> free/ alloc of memory, …) would happen outside of the locked section.
[...]
> drivers/net/hyperv/netvsc_bpf.c | 1 +
> drivers/net/netkit.c | 13 +++++++----
> drivers/net/tun.c | 28 +++++++++++++----------
> drivers/net/veth.c | 40 ++++++++++++++++++++-------------
> drivers/net/virtio_net.c | 1 +
> drivers/net/xen-netfront.c | 1 +
> 6 files changed, 52 insertions(+), 32 deletions(-)
[...]
Please exclude netkit from this set given it does not support XDP, but
instead only accepts tc BPF typed programs.
Thanks,
Daniel
> diff --git a/drivers/net/netkit.c b/drivers/net/netkit.c
> index 39171380ccf29..fbcf78477bda8 100644
> --- a/drivers/net/netkit.c
> +++ b/drivers/net/netkit.c
> @@ -80,8 +80,15 @@ static netdev_tx_t netkit_xmit(struct sk_buff *skb, struct net_device *dev)
> netkit_prep_forward(skb, !net_eq(dev_net(dev), dev_net(peer)));
> skb->dev = peer;
> entry = rcu_dereference(nk->active);
> - if (entry)
> - ret = netkit_run(entry, skb, ret);
> + if (entry) {
> + scoped_guard(local_lock_nested_bh, &bpf_run_lock.redirect_lock) {
> + ret = netkit_run(entry, skb, ret);
> + if (ret == NETKIT_REDIRECT) {
> + dev_sw_netstats_tx_add(dev, 1, len);
> + skb_do_redirect(skb);
> + }
> + }
> + }
> switch (ret) {
> case NETKIT_NEXT:
> case NETKIT_PASS:
> @@ -95,8 +102,6 @@ static netdev_tx_t netkit_xmit(struct sk_buff *skb, struct net_device *dev)
> }
> break;
> case NETKIT_REDIRECT:
> - dev_sw_netstats_tx_add(dev, 1, len);
> - skb_do_redirect(skb);
> break;
> case NETKIT_DROP:
> default:
On 2023-12-15 14:50:59 [-0800], Jakub Kicinski wrote:
> On Fri, 15 Dec 2023 18:07:19 +0100 Sebastian Andrzej Siewior wrote:
> > The proposed way out is to introduce explicit per-CPU locks for
> > resources which are protected by local_bh_disable() and use those only
> > on PREEMPT_RT so there is no additional overhead for !PREEMPT_RT builds.
>
> As I said at LPC, complicating drivers with odd locking constructs
> is a no go for me.
I misunderstood it then as I assumed you wanted to ease the work while I
was done which every driver after (hopefully) understanding what is
possible/ needed and what not. We do speak here about 15++?
Now. The pattern is usually
| act = bpf_prog_run_xdp(xdp_prog, &xdp);
| switch (act) {
| case XDP_REDIRECT:
| ret = xdp_do_redirect(netdev, &xdp, xdp_prog)))
| if (ret)
| goto XDP_ABORTED;
| xdp_redir++ or so;
so we might be able to turn this into something that covers both and
returns either XDP_REDIRECT or XDP_ABORTED. So this could be merged
into
| u32 bpf_prog_run_xdp_and_redirect(struct net_device *dev, const struct
| bpf_prog *prog, struct xdp_buff *xdp)
| {
| u32 act;
| int ret;
|
| act = bpf_prog_run_xdp(prog, xdp);
| if (act == XDP_REDIRECT) {
| ret = xdp_do_redirect(netdev, xdp, prog);
| if (ret < 0)
| act = XDP_ABORTED;
| }
| return act;
| }
so the lock can be put inside the function and all drivers use this
function.
From looking through drivers/net/ethernet/, this should work for most
drivers:
- amazon/ena
- aquantia/atlantic
- engleder/tsnep
- freescale/enetc
- freescale/fec
- intel/igb
- intel/igc
- marvell/mvneta
- marvell/mvpp2
- marvell/octeontx2
- mediatek/mtk
- mellanox/mlx5
- microchip/lan966x
- microsoft/mana
- netronome/nfp (two call paths with no support XDP_REDIRECT)
- sfc/rx
- sfc/siena (that offset pointer can be moved)
- socionext/netsec
- stmicro/stmmac
A few do something custom/ additionally between bpf_prog_run_xdp() and
xdp_do_redirect():
- broadcom/bnxt
calculates length, offset, data pointer. DMA unmaps + memory
allocations before redirect.
- freescale/dpaa2
- freescale/dpaa
sets xdp.data_hard_start + frame_sz, unmaps DMA.
- fungible/funeth
conditional redirect.
- google/gve
Allocates a new packet for redirect.
- intel/ixgbe
- intel/i40e
- intel/ice
Failure in the ZC case is different from XDP_ABORTED, depends on the
error from xdp_do_redirect())
- mellanox/mlx4/
calculates page_offset.
- qlogic/qede
DMA unmap and buffer alloc.
- ti/cpsw_priv
recalculates length (pointer).
and a few more don't support XDP_REDIRECT:
- cavium/thunder
does not support XDP_REDIRECT, calculates length, offset.
- intel/ixgbevf
does not support XDP_REDIRECT
I don't understand why some driver need to recalculate data_hard_start,
length and so on and others don't. This might be only needed for the
XDP_TX case or not needed…
Also I'm not sure about the dma unmaps and skb allocations. The new skb
allocation can be probably handled before running the bpf prog but then
in the XDP_PASS case it is a waste…
And the DMA unmaps. Only a few seem to need it. Maybe it can be done
before running the BPF program. After all the bpf may look into the skb.
If that is no go, then the only thing that comes to mind is (as you
mentioned on LPC) to acquire the lock in bpf_prog_run_xdp() and drop it
in xdp_do_redirect(). This would require that every driver invokes
xdp_do_redirect() even not if it is not supporting it (by setting netdev
to NULL or so).
Sebastian
On Sat, Dec 16, 2023 at 12:53:43PM +0800, kernel test robot wrote:
> Hi Sebastian,
>
> kernel test robot noticed the following build errors:
>
> [auto build test ERROR on net-next/main]
>
> url: https://github.com/intel-lab-lkp/linux/commits/Sebastian-Andrzej-Siewior/locking-local_lock-Introduce-guard-definition-for-local_lock/20231216-011911
> base: net-next/main
> patch link: https://lore.kernel.org/r/20231215171020.687342-21-bigeasy%40linutronix.de
> patch subject: [PATCH net-next 20/24] net: intel: Use nested-BH locking for XDP redirect.
> config: arm-defconfig (https://download.01.org/0day-ci/archive/20231216/[email protected]/config)
> compiler: clang version 14.0.6 (https://github.com/llvm/llvm-project.git f28c006a5895fc0e329fe15fead81e37457cb1d1)
> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231216/[email protected]/reproduce)
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <[email protected]>
> | Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
>
> All errors (new ones prefixed by >>):
>
> >> drivers/net/ethernet/intel/igb/igb_main.c:8620:3: error: cannot jump from this goto statement to its label
> goto xdp_out;
> ^
> drivers/net/ethernet/intel/igb/igb_main.c:8624:2: note: jump bypasses initialization of variable with __attribute__((cleanup))
> guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
> ^
> include/linux/cleanup.h:142:15: note: expanded from macro 'guard'
> CLASS(_name, __UNIQUE_ID(guard))
> ^
> include/linux/compiler.h:180:29: note: expanded from macro '__UNIQUE_ID'
> #define __UNIQUE_ID(prefix) __PASTE(__PASTE(__UNIQUE_ID_, prefix), __COUNTER__)
> ^
> include/linux/compiler_types.h:84:22: note: expanded from macro '__PASTE'
> #define __PASTE(a,b) ___PASTE(a,b)
> ^
> include/linux/compiler_types.h:83:23: note: expanded from macro '___PASTE'
> #define ___PASTE(a,b) a##b
> ^
> <scratch space>:52:1: note: expanded from here
> __UNIQUE_ID_guard753
> ^
> 1 error generated.
I initially thought that this may have been
https://github.com/ClangBuiltLinux/linux/issues/1886 but asm goto is not
involved here.
This error occurs because jumping over the initialization of a variable
declared with __attribute__((__cleanup__(...))) does not prevent the
clean up function from running as one may expect it to, but could
instead result in the clean up function getting run on uninitialized
memory. A contrived example (see the bottom of the "Output" tabs for the
execution output):
https://godbolt.org/z/9bvGboxvc
While there is a warning from GCC in that example, I don't see one in
the kernel's case. I see there is an open GCC issue around this problem:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91951
While it is possible that there may not actually be a problem with how
the kernel uses __attribute__((__cleanup__(...))) and gotos, I think
clang's behavior is reasonable given the potential footguns that this
construct has.
Cheers,
Nathan
On Mon, 18 Dec 2023 18:23:31 +0100 Sebastian Andrzej Siewior wrote:
> On 2023-12-15 14:50:59 [-0800], Jakub Kicinski wrote:
> > On Fri, 15 Dec 2023 18:07:19 +0100 Sebastian Andrzej Siewior wrote:
> > > The proposed way out is to introduce explicit per-CPU locks for
> > > resources which are protected by local_bh_disable() and use those only
> > > on PREEMPT_RT so there is no additional overhead for !PREEMPT_RT builds.
> >
> > As I said at LPC, complicating drivers with odd locking constructs
> > is a no go for me.
>
> I misunderstood it then as I assumed you wanted to ease the work while I
> was done which every driver after (hopefully) understanding what is
> possible/ needed and what not. We do speak here about 15++?
My main concern is that it takes the complexity of writing network
device drivers to a next level. It's already hard enough to implement
XDP correctly. "local lock" and "guard"? Too complicated :(
Or "unmaintainable" as in "too much maintainer's time will be spent
reviewing code that gets this wrong".
> Now. The pattern is usually
> | act = bpf_prog_run_xdp(xdp_prog, &xdp);
> | switch (act) {
> | case XDP_REDIRECT:
> | ret = xdp_do_redirect(netdev, &xdp, xdp_prog)))
> | if (ret)
> | goto XDP_ABORTED;
> | xdp_redir++ or so;
>
> so we might be able to turn this into something that covers both and
> returns either XDP_REDIRECT or XDP_ABORTED. So this could be merged
> into
>
> | u32 bpf_prog_run_xdp_and_redirect(struct net_device *dev, const struct
> | bpf_prog *prog, struct xdp_buff *xdp)
> | {
> | u32 act;
> | int ret;
> |
> | act = bpf_prog_run_xdp(prog, xdp);
> | if (act == XDP_REDIRECT) {
> | ret = xdp_do_redirect(netdev, xdp, prog);
> | if (ret < 0)
> | act = XDP_ABORTED;
> | }
> | return act;
> | }
If we could fold the DROP case into this -- even better!
> so the lock can be put inside the function and all drivers use this
> function.
>
> From looking through drivers/net/ethernet/, this should work for most
> drivers:
> - amazon/ena
> - aquantia/atlantic
> - engleder/tsnep
> - freescale/enetc
> - freescale/fec
> - intel/igb
> - intel/igc
> - marvell/mvneta
> - marvell/mvpp2
> - marvell/octeontx2
> - mediatek/mtk
> - mellanox/mlx5
> - microchip/lan966x
> - microsoft/mana
> - netronome/nfp (two call paths with no support XDP_REDIRECT)
> - sfc/rx
> - sfc/siena (that offset pointer can be moved)
> - socionext/netsec
> - stmicro/stmmac
>
> A few do something custom/ additionally between bpf_prog_run_xdp() and
> xdp_do_redirect():
>
> - broadcom/bnxt
> calculates length, offset, data pointer. DMA unmaps + memory
> allocations before redirect.
Just looked at this one. The recalculation is probably for the PASS /
TX cases, REDIRECT / DROP shouldn't care. The DMA unmap looks like
a bug (hi, Michael!)
> - freescale/dpaa2
> - freescale/dpaa
> sets xdp.data_hard_start + frame_sz, unmaps DMA.
>
> - fungible/funeth
> conditional redirect.
>
> - google/gve
> Allocates a new packet for redirect.
>
> - intel/ixgbe
> - intel/i40e
> - intel/ice
> Failure in the ZC case is different from XDP_ABORTED, depends on the
> error from xdp_do_redirect())
>
> - mellanox/mlx4/
> calculates page_offset.
>
> - qlogic/qede
> DMA unmap and buffer alloc.
>
> - ti/cpsw_priv
> recalculates length (pointer).
>
> and a few more don't support XDP_REDIRECT:
>
> - cavium/thunder
> does not support XDP_REDIRECT, calculates length, offset.
>
> - intel/ixgbevf
> does not support XDP_REDIRECT
>
> I don't understand why some driver need to recalculate data_hard_start,
> length and so on and others don't. This might be only needed for the
> XDP_TX case or not needed…
> Also I'm not sure about the dma unmaps and skb allocations. The new skb
> allocation can be probably handled before running the bpf prog but then
> in the XDP_PASS case it is a waste…
> And the DMA unmaps. Only a few seem to need it. Maybe it can be done
> before running the BPF program. After all the bpf may look into the skb.
>
>
> If that is no go, then the only thing that comes to mind is (as you
> mentioned on LPC) to acquire the lock in bpf_prog_run_xdp() and drop it
> in xdp_do_redirect(). This would require that every driver invokes
> xdp_do_redirect() even not if it is not supporting it (by setting netdev
> to NULL or so).
To make progress on other parts of the stack we could also take
the local lock around all of napi->poll() for now..
On Mon, Dec 18, 2023 at 4:01 PM Nathan Chancellor <[email protected]> wrote:
>
> On Sat, Dec 16, 2023 at 12:53:43PM +0800, kernel test robot wrote:
> > Hi Sebastian,
> >
> > kernel test robot noticed the following build errors:
> >
> > [auto build test ERROR on net-next/main]
> >
> > url: https://github.com/intel-lab-lkp/linux/commits/Sebastian-Andrzej-Siewior/locking-local_lock-Introduce-guard-definition-for-local_lock/20231216-011911
> > base: net-next/main
> > patch link: https://lore.kernel.org/r/20231215171020.687342-21-bigeasy%40linutronix.de
> > patch subject: [PATCH net-next 20/24] net: intel: Use nested-BH locking for XDP redirect.
> > config: arm-defconfig (https://download.01.org/0day-ci/archive/20231216/[email protected]/config)
> > compiler: clang version 14.0.6 (https://github.com/llvm/llvm-project.git f28c006a5895fc0e329fe15fead81e37457cb1d1)
> > reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231216/[email protected]/reproduce)
> >
> > If you fix the issue in a separate patch/commit (i.e. not just a new version of
> > the same patch/commit), kindly add following tags
> > | Reported-by: kernel test robot <[email protected]>
> > | Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
> >
> > All errors (new ones prefixed by >>):
> >
> > >> drivers/net/ethernet/intel/igb/igb_main.c:8620:3: error: cannot jump from this goto statement to its label
> > goto xdp_out;
> > ^
^ The problematic goto should be replaced with an early return. (and
perhaps a comment that you can't jump over __cleanup variable
initialization).
Otherwise the compiler cannot put the cleanup in the destination basic
block; it would have to split the edges and have all the happy paths
go to a synthesized basic block that runs the cleanup, then jumps to
the original destination.
> > drivers/net/ethernet/intel/igb/igb_main.c:8624:2: note: jump bypasses initialization of variable with __attribute__((cleanup))
> > guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
> > ^
> > include/linux/cleanup.h:142:15: note: expanded from macro 'guard'
> > CLASS(_name, __UNIQUE_ID(guard))
> > ^
> > include/linux/compiler.h:180:29: note: expanded from macro '__UNIQUE_ID'
> > #define __UNIQUE_ID(prefix) __PASTE(__PASTE(__UNIQUE_ID_, prefix), __COUNTER__)
> > ^
> > include/linux/compiler_types.h:84:22: note: expanded from macro '__PASTE'
> > #define __PASTE(a,b) ___PASTE(a,b)
> > ^
> > include/linux/compiler_types.h:83:23: note: expanded from macro '___PASTE'
> > #define ___PASTE(a,b) a##b
> > ^
> > <scratch space>:52:1: note: expanded from here
> > __UNIQUE_ID_guard753
> > ^
> > 1 error generated.
>
> I initially thought that this may have been
> https://github.com/ClangBuiltLinux/linux/issues/1886 but asm goto is not
> involved here.
>
> This error occurs because jumping over the initialization of a variable
> declared with __attribute__((__cleanup__(...))) does not prevent the
> clean up function from running as one may expect it to, but could
> instead result in the clean up function getting run on uninitialized
> memory. A contrived example (see the bottom of the "Output" tabs for the
> execution output):
>
> https://godbolt.org/z/9bvGboxvc
>
> While there is a warning from GCC in that example, I don't see one in
> the kernel's case. I see there is an open GCC issue around this problem:
>
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91951
>
> While it is possible that there may not actually be a problem with how
> the kernel uses __attribute__((__cleanup__(...))) and gotos, I think
> clang's behavior is reasonable given the potential footguns that this
> construct has.
>
> Cheers,
> Nathan
>
--
Thanks,
~Nick Desaulniers
On Fri, Dec 15, 2023 at 9:10 AM Sebastian Andrzej Siewior
<[email protected]> wrote:
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 5a0f6da7b3ae5..5ba7509e88752 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -3993,6 +3993,7 @@ sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
> *pt_prev = NULL;
> }
>
> + guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
> qdisc_skb_cb(skb)->pkt_len = skb->len;
> tcx_set_ingress(skb, true);
>
> @@ -4045,6 +4046,7 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
> if (!entry)
> return skb;
>
> + guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
> /* qdisc_skb_cb(skb)->pkt_len & tcx_set_ingress() was
> * already set by the caller.
> */
> @@ -5008,6 +5010,7 @@ int do_xdp_generic(struct bpf_prog *xdp_prog, struct sk_buff *skb)
> u32 act;
> int err;
>
> + guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
> act = netif_receive_generic_xdp(skb, &xdp, xdp_prog);
> if (act != XDP_PASS) {
> switch (act) {
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 7c9653734fb60..72a7812f933a1 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -4241,6 +4241,7 @@ static const struct bpf_func_proto bpf_xdp_adjust_meta_proto = {
> */
> void xdp_do_flush(void)
> {
> + guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
> __dev_flush();
> __cpu_map_flush();
> __xsk_map_flush();
> diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c
> index a94943681e5aa..74b88e897a7e3 100644
> --- a/net/core/lwt_bpf.c
> +++ b/net/core/lwt_bpf.c
> @@ -44,6 +44,7 @@ static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
> * BPF prog and skb_do_redirect().
> */
> local_bh_disable();
> + local_lock_nested_bh(&bpf_run_lock.redirect_lock);
> bpf_compute_data_pointers(skb);
> ret = bpf_prog_run_save_cb(lwt->prog, skb);
>
> @@ -76,6 +77,7 @@ static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
> break;
> }
>
> + local_unlock_nested_bh(&bpf_run_lock.redirect_lock);
> local_bh_enable();
>
> return ret;
> diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
> index 1976bd1639863..da61b99bc558f 100644
> --- a/net/sched/cls_api.c
> +++ b/net/sched/cls_api.c
> @@ -23,6 +23,7 @@
> #include <linux/jhash.h>
> #include <linux/rculist.h>
> #include <linux/rhashtable.h>
> +#include <linux/bpf.h>
> #include <net/net_namespace.h>
> #include <net/sock.h>
> #include <net/netlink.h>
> @@ -3925,6 +3926,7 @@ struct sk_buff *tcf_qevent_handle(struct tcf_qevent *qe, struct Qdisc *sch, stru
>
> fl = rcu_dereference_bh(qe->filter_chain);
>
> + guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
> switch (tcf_classify(skb, NULL, fl, &cl_res, false)) {
> case TC_ACT_SHOT:
> qdisc_qstats_drop(sch);
Here and in all other places this patch adds locks that
will kill performance of XDP, tcx and everything else in networking.
I'm surprised Jesper and other folks are not jumping in with nacks.
We measure performance in nanoseconds here.
Extra lock is no go.
Please find a different way without ruining performance.
On 2023-12-18 16:41:42 [-0800], Jakub Kicinski wrote:
>
> To make progress on other parts of the stack we could also take
> the local lock around all of napi->poll() for now..
Okay. Thank you. I will look into this next year again.
Sebastian
Alexei Starovoitov <[email protected]> writes:
> On Fri, Dec 15, 2023 at 9:10 AM Sebastian Andrzej Siewior
> <[email protected]> wrote:
>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index 5a0f6da7b3ae5..5ba7509e88752 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -3993,6 +3993,7 @@ sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
>> *pt_prev = NULL;
>> }
>>
>> + guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
>> qdisc_skb_cb(skb)->pkt_len = skb->len;
>> tcx_set_ingress(skb, true);
>>
>> @@ -4045,6 +4046,7 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
>> if (!entry)
>> return skb;
>>
>> + guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
>> /* qdisc_skb_cb(skb)->pkt_len & tcx_set_ingress() was
>> * already set by the caller.
>> */
>> @@ -5008,6 +5010,7 @@ int do_xdp_generic(struct bpf_prog *xdp_prog, struct sk_buff *skb)
>> u32 act;
>> int err;
>>
>> + guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
>> act = netif_receive_generic_xdp(skb, &xdp, xdp_prog);
>> if (act != XDP_PASS) {
>> switch (act) {
>> diff --git a/net/core/filter.c b/net/core/filter.c
>> index 7c9653734fb60..72a7812f933a1 100644
>> --- a/net/core/filter.c
>> +++ b/net/core/filter.c
>> @@ -4241,6 +4241,7 @@ static const struct bpf_func_proto bpf_xdp_adjust_meta_proto = {
>> */
>> void xdp_do_flush(void)
>> {
>> + guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
>> __dev_flush();
>> __cpu_map_flush();
>> __xsk_map_flush();
>> diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c
>> index a94943681e5aa..74b88e897a7e3 100644
>> --- a/net/core/lwt_bpf.c
>> +++ b/net/core/lwt_bpf.c
>> @@ -44,6 +44,7 @@ static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
>> * BPF prog and skb_do_redirect().
>> */
>> local_bh_disable();
>> + local_lock_nested_bh(&bpf_run_lock.redirect_lock);
>> bpf_compute_data_pointers(skb);
>> ret = bpf_prog_run_save_cb(lwt->prog, skb);
>>
>> @@ -76,6 +77,7 @@ static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
>> break;
>> }
>>
>> + local_unlock_nested_bh(&bpf_run_lock.redirect_lock);
>> local_bh_enable();
>>
>> return ret;
>> diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
>> index 1976bd1639863..da61b99bc558f 100644
>> --- a/net/sched/cls_api.c
>> +++ b/net/sched/cls_api.c
>> @@ -23,6 +23,7 @@
>> #include <linux/jhash.h>
>> #include <linux/rculist.h>
>> #include <linux/rhashtable.h>
>> +#include <linux/bpf.h>
>> #include <net/net_namespace.h>
>> #include <net/sock.h>
>> #include <net/netlink.h>
>> @@ -3925,6 +3926,7 @@ struct sk_buff *tcf_qevent_handle(struct tcf_qevent *qe, struct Qdisc *sch, stru
>>
>> fl = rcu_dereference_bh(qe->filter_chain);
>>
>> + guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
>> switch (tcf_classify(skb, NULL, fl, &cl_res, false)) {
>> case TC_ACT_SHOT:
>> qdisc_qstats_drop(sch);
>
> Here and in all other places this patch adds locks that
> will kill performance of XDP, tcx and everything else in networking.
>
> I'm surprised Jesper and other folks are not jumping in with nacks.
> We measure performance in nanoseconds here.
> Extra lock is no go.
> Please find a different way without ruining performance.
I'll add that while all this compiles out as no-ops on !PREEMPT_RT, I do
believe there are people who are using XDP on PREEMPT_RT kernels and
still expect decent performance. And to achieve that it is absolutely
imperative that we can amortise expensive operations (such as locking)
over multiple packets.
I realise there's a fundamental trade-off between the amount of
amortisation and the latency hit that we take from holding locks for
longer, but tuning the batch size (while still keeping some amount of
batching) may be a way forward? I suppose Jakub's suggestion in the
other part of the thread, of putting the locks around napi->poll(), is a
step towards something like this.
-Toke
On 2023-12-18 09:33:39 [+0100], Paolo Abeni wrote:
> > --- a/net/ipv6/seg6_local.c
> > +++ b/net/ipv6/seg6_local.c
> > @@ -1420,41 +1422,44 @@ static int input_action_end_bpf(struct sk_buff *skb,
> > }
> > advance_nextseg(srh, &ipv6_hdr(skb)->daddr);
> >
> > - /* preempt_disable is needed to protect the per-CPU buffer srh_state,
> > - * which is also accessed by the bpf_lwt_seg6_* helpers
> > + /* The access to the per-CPU buffer srh_state is protected by running
> > + * always in softirq context (with disabled BH). On PREEMPT_RT the
> > + * required locking is provided by the following local_lock_nested_bh()
> > + * statement. It is also accessed by the bpf_lwt_seg6_* helpers via
> > + * bpf_prog_run_save_cb().
> > */
> > - preempt_disable();
> > - srh_state->srh = srh;
> > - srh_state->hdrlen = srh->hdrlen << 3;
> > - srh_state->valid = true;
> > + scoped_guard(local_lock_nested_bh, &seg6_bpf_srh_states.bh_lock) {
> > + srh_state = this_cpu_ptr(&seg6_bpf_srh_states);
> > + srh_state->srh = srh;
> > + srh_state->hdrlen = srh->hdrlen << 3;
> > + srh_state->valid = true;
>
> Here the 'scoped_guard' usage adds a lot of noise to the patch, due to
> the added indentation. What about using directly
> local_lock_nested_bh()/local_unlock_nested_bh() ?
If this is preferred, sure.
> Cheers,
>
> Paolo
Sebastian
On 2023-12-18 09:52:05 [+0100], Daniel Borkmann wrote:
> Hi Sebastian,
Hi Daniel,
> Please exclude netkit from this set given it does not support XDP, but
> instead only accepts tc BPF typed programs.
okay, thank you.
> Thanks,
> Daniel
Sebastian
On 2024-01-04 20:29:02 [+0100], Toke Høiland-Jørgensen wrote:
> Alexei Starovoitov <[email protected]> writes:
>
> >> @@ -3925,6 +3926,7 @@ struct sk_buff *tcf_qevent_handle(struct tcf_qevent *qe, struct Qdisc *sch, stru
> >>
> >> fl = rcu_dereference_bh(qe->filter_chain);
> >>
> >> + guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
> >> switch (tcf_classify(skb, NULL, fl, &cl_res, false)) {
> >> case TC_ACT_SHOT:
> >> qdisc_qstats_drop(sch);
> >
> > Here and in all other places this patch adds locks that
> > will kill performance of XDP, tcx and everything else in networking.
> >
> > I'm surprised Jesper and other folks are not jumping in with nacks.
> > We measure performance in nanoseconds here.
> > Extra lock is no go.
> > Please find a different way without ruining performance.
>
> I'll add that while all this compiles out as no-ops on !PREEMPT_RT, I do
> believe there are people who are using XDP on PREEMPT_RT kernels and
> still expect decent performance. And to achieve that it is absolutely
> imperative that we can amortise expensive operations (such as locking)
> over multiple packets.
>
> I realise there's a fundamental trade-off between the amount of
> amortisation and the latency hit that we take from holding locks for
> longer, but tuning the batch size (while still keeping some amount of
> batching) may be a way forward? I suppose Jakub's suggestion in the
> other part of the thread, of putting the locks around napi->poll(), is a
> step towards something like this.
The RT requirements are usually different. Networking as in CAN might be
important but Ethernet could only used for remote communication and so
"not" important. People complained that they need to wait for Ethernet
to be done until the CAN packet can be injected into the stack.
With that expectation you would like to pause Ethernet immediately and
switch over the CAN interrupt thread.
But if someone managed to setup XDP then it is likely to be important.
With RT traffic it is usually not the throughput that matters but the
latency. You are likely in the position to receive a packet, say every
1ms, and need to respond immediately. XDP would be used to inspect the
packet and either hand it over to the stack or process it.
I expected the lock operation (under RT) to always succeeds and not
cause any delay because it should not be contended. It should only
block if something with higher priority preempted the current interrupt
thread _and_ also happen to use XDP on the same CPU. In that case (XDP
is needed) it would flush the current user out of the locked section
before the higher-prio thread could take over. Doing bulk and allowing
the low-priority thread to complete would delay the high-priority
thread. Maybe I am too pessimistic here and having two XDP programs on
one CPU is unlikely to happen.
Adding the lock on per-NAPI basis would allow to batch packets.
Acquiring the lock only if XDP is supported would not block the CAN
drivers since they dont't support XDP. But sounds like a hack.
Daniel said netkit doesn't need this locking because it is not
supporting this redirect and it made me think. Would it work to make
the redirect structures part of the bpf_prog-structure instead of
per-CPU? My understanding is that eBPF's programs data structures are
part of it and contain locking allowing one eBPF program preempt
another one.
Having the redirect structures part of the program would obsolete
locking. Do I miss anything?
> -Toke
Sebastian
On 2023-12-16 22:09:07 [+0000], Kiyanovski, Arthur wrote:
> Hi Sebastian,
Arthur,
> I would like to make sure I understand correctly the difference in this patch
> between ena and atlantic drivers.
>
> In the atlantic driver the change you've made seems like the best change
> in terms of making the critical section as small as possible.
>
> You could have done exactly the same thing with ena, but you chose instead
> to let ena release the lock at the end of the function, which in case of an XDP_TX
> may make the critical section considerably longer than in the atlantic solution.
>
> If I understand correctly (quote from your commit message "This does not
> always work because some drivers (cpsw, atlantic) invoke xdp_do_flush()
> in the same context"), in the case of atlantic you had to go for the more
> code-altering change, because if you simply used guard() you would include
> the xdp_do_flush() in the critical section, but in the case of ena xdp_do_flush()
> is called after the function ends so guard is good enough.
>
> Questions:
> 1. Did I understand correctly the difference in solution choice between atlantic
> and ena?
Yes. I could have moved the "XDP_REDIRECT" case right after
bpf_prog_run_xdp() and use "scope guard" to make it slim in the ena
driver. I just made "this" because it was simpler I did not want to
spent unnecessarily cycles on it especially if I have to maintain for a
few releases.
> 2. As far as I can see the guard() solution looks good for ena except for (maybe?)
> XDP_TX, where the critical section becomes a bit long. Can you please explain,
> why you think it is still good enough for ena to use the guard() solution instead
> of doing the more code-altering atlantic solution?
Well, it was simpler/ quicker. If this approach would have been accepted
and this long section a problem then it could have been shorten
afterwards. Maybe a another function/ method could be introduced since
this pattern fits ~90% of all drivers.
However it looks like touching all drivers is not what we want so
avoiding spending a lot of cycles on it in the first place wasn't that
bad. (Also it was the third iteration until I got all details right).
> Thanks!
> Arthur
>
Sebastian
Sebastian Andrzej Siewior <[email protected]> writes:
> On 2024-01-04 20:29:02 [+0100], Toke Høiland-Jørgensen wrote:
>> Alexei Starovoitov <[email protected]> writes:
>>
>> >> @@ -3925,6 +3926,7 @@ struct sk_buff *tcf_qevent_handle(struct tcf_qevent *qe, struct Qdisc *sch, stru
>> >>
>> >> fl = rcu_dereference_bh(qe->filter_chain);
>> >>
>> >> + guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock);
>> >> switch (tcf_classify(skb, NULL, fl, &cl_res, false)) {
>> >> case TC_ACT_SHOT:
>> >> qdisc_qstats_drop(sch);
>> >
>> > Here and in all other places this patch adds locks that
>> > will kill performance of XDP, tcx and everything else in networking.
>> >
>> > I'm surprised Jesper and other folks are not jumping in with nacks.
>> > We measure performance in nanoseconds here.
>> > Extra lock is no go.
>> > Please find a different way without ruining performance.
>>
>> I'll add that while all this compiles out as no-ops on !PREEMPT_RT, I do
>> believe there are people who are using XDP on PREEMPT_RT kernels and
>> still expect decent performance. And to achieve that it is absolutely
>> imperative that we can amortise expensive operations (such as locking)
>> over multiple packets.
>>
>> I realise there's a fundamental trade-off between the amount of
>> amortisation and the latency hit that we take from holding locks for
>> longer, but tuning the batch size (while still keeping some amount of
>> batching) may be a way forward? I suppose Jakub's suggestion in the
>> other part of the thread, of putting the locks around napi->poll(), is a
>> step towards something like this.
>
> The RT requirements are usually different. Networking as in CAN might be
> important but Ethernet could only used for remote communication and so
> "not" important. People complained that they need to wait for Ethernet
> to be done until the CAN packet can be injected into the stack.
> With that expectation you would like to pause Ethernet immediately and
> switch over the CAN interrupt thread.
>
> But if someone managed to setup XDP then it is likely to be important.
> With RT traffic it is usually not the throughput that matters but the
> latency. You are likely in the position to receive a packet, say every
> 1ms, and need to respond immediately. XDP would be used to inspect the
> packet and either hand it over to the stack or process it.
I am not contesting that latency is important, but it's a pretty
fundamental trade-off and we don't want to kill throughput entirely
either. Especially since this is global to the whole kernel; and there
are definitely people who want to use XDP on an RT kernel and still
achieve high PPS rates.
(Whether those people really strictly speaking need to be running an RT
kernel is maybe debatable, but it does happen).
> I expected the lock operation (under RT) to always succeeds and not
> cause any delay because it should not be contended.
A lock does cause delay even when it's not contended. Bear in mind that
at 10 Gbps line rate, we have a budget of 64 nanoseconds to process each
packet (for 64-byte packets). So just the atomic op to figure out
whether there's any contention (around 10ns on the Intel processors I
usually test on) will blow a huge chunk of the total processing budget.
We can't actually do the full processing needed in those 64 nanoseconds
(not to mention the 6.4 nanoseconds we have available at 100Gbps), which
is why it's essential to amortise as much as we can over multiple
packets.
This is all back-of-the-envelope calculations, of course. Having some
actual numbers to look at would be great; I don't suppose you have a
setup where you can run xdp-bench and see how your patches affect the
throughput?
> It should only block if something with higher priority preempted the
> current interrupt thread _and_ also happen to use XDP on the same CPU.
> In that case (XDP is needed) it would flush the current user out of
> the locked section before the higher-prio thread could take over.
> Doing bulk and allowing the low-priority thread to complete would
> delay the high-priority thread. Maybe I am too pessimistic here and
> having two XDP programs on one CPU is unlikely to happen.
>
> Adding the lock on per-NAPI basis would allow to batch packets.
> Acquiring the lock only if XDP is supported would not block the CAN
> drivers since they dont't support XDP. But sounds like a hack.
I chatted with Jesper about this, and he had an idea not too far from
this: split up the XDP and regular stack processing in two stages, each
with their individual batching. So whereas right now we're doing
something like:
run_napi()
bh_disable()
for pkt in budget:
act = run_xdp(pkt)
if (act == XDP_PASS)
run_netstack(pkt) // this is the expensive bit
bh_enable()
We could instead do:
run_napi()
bh_disable()
for pkt in budget:
act = run_xdp(pkt)
if (act == XDP_PASS)
add_to_list(pkt, to_stack_list)
bh_enable()
// sched point
bh_disable()
for pkt in to_stack_list:
run_netstack(pkt)
bh_enable()
This would limit the batching that blocks everything to only the XDP
processing itself, which should limit the maximum time spent in the
blocking state significantly compared to what we have today. The caveat
being that rearranging things like this is potentially a pretty major
refactoring task that needs to touch all the drivers (even if some of
the logic can be moved into the core code in the process). So not really
sure if this approach is feasible, TBH.
> Daniel said netkit doesn't need this locking because it is not
> supporting this redirect and it made me think. Would it work to make
> the redirect structures part of the bpf_prog-structure instead of
> per-CPU? My understanding is that eBPF's programs data structures are
> part of it and contain locking allowing one eBPF program preempt
> another one.
> Having the redirect structures part of the program would obsolete
> locking. Do I miss anything?
This won't work, unfortunately: the same XDP program can be attached to
multiple interfaces simultaneously, and for hardware with multiple
receive queues (which is most of the hardware that supports XDP), it can
even run simultaneously on multiple CPUs on the same interface. This is
the reason why this is all being kept in per-CPU variables today.
-Toke
On Wed, 17 Jan 2024 17:37:29 +0100 Toke Høiland-Jørgensen wrote:
> I am not contesting that latency is important, but it's a pretty
> fundamental trade-off and we don't want to kill throughput entirely
> either. Especially since this is global to the whole kernel; and there
> are definitely people who want to use XDP on an RT kernel and still
> achieve high PPS rates.
>
> (Whether those people really strictly speaking need to be running an RT
> kernel is maybe debatable, but it does happen).
>
> > I expected the lock operation (under RT) to always succeeds and not
> > cause any delay because it should not be contended.
>
> A lock does cause delay even when it's not contended. Bear in mind that
> at 10 Gbps line rate, we have a budget of 64 nanoseconds to process each
> packet (for 64-byte packets). So just the atomic op to figure out
> whether there's any contention (around 10ns on the Intel processors I
> usually test on) will blow a huge chunk of the total processing budget.
> We can't actually do the full processing needed in those 64 nanoseconds
> (not to mention the 6.4 nanoseconds we have available at 100Gbps), which
> is why it's essential to amortise as much as we can over multiple
> packets.
>
> This is all back-of-the-envelope calculations, of course. Having some
> actual numbers to look at would be great; I don't suppose you have a
> setup where you can run xdp-bench and see how your patches affect the
> throughput?
A potentially stupid idea which I have been turning in my head is
how we could get away from having the driver handle details of NAPI
budgeting. It's an source of bugs and endless review comments.
All drivers end up maintaining a counter of "how many packets have
I processed" and comparing that against the budget. Would it be crazy
if we put that inside napi_struct? Add a "budget" member inside
napi_struct as well, and:
struct napi_struct {
..
// poll state
unsigned int budget;
unsigned int rx_used;
..
}
static inline bool napi_rx_has_budget(napi)
{
return napi->budget > napi->rx_used;
}
poll(napi) // no budget
{
while (napi_rx_has_budget(napi)) {
napi_gro_receive(napi, skb); /* does napi->rx_used++ */
// maybe add explicit napi_rx_count() if
// driver did something funny with the frame.
}
}
We can also create napi_tx_has_budget() so that people stop being
confused whether budget is for Tx or not. And napi_xdp_comp_has_budget()
so that people stop completing XDP in hard irq context (budget==0)...
And we can pass napi into napi_consume_skb(), instead of, presumably
inexplicably to a newcomer, passing in budget.
And napi_complete_done() can lose the work_done argument, too.
Oh, and I'm bringing it up here, because CONFIG_RT can throw
in "need_resched()" into the napi_rx_has_budget(), obviously.
On 2024-01-17 17:37:29 [+0100], Toke Høiland-Jørgensen wrote:
> This is all back-of-the-envelope calculations, of course. Having some
> actual numbers to look at would be great; I don't suppose you have a
> setup where you can run xdp-bench and see how your patches affect the
> throughput?
No but I probably could set it up.
> I chatted with Jesper about this, and he had an idea not too far from
> this: split up the XDP and regular stack processing in two stages, each
> with their individual batching. So whereas right now we're doing
> something like:
>
> run_napi()
> bh_disable()
> for pkt in budget:
> act = run_xdp(pkt)
> if (act == XDP_PASS)
> run_netstack(pkt) // this is the expensive bit
> bh_enable()
>
> We could instead do:
>
> run_napi()
> bh_disable()
> for pkt in budget:
> act = run_xdp(pkt)
> if (act == XDP_PASS)
> add_to_list(pkt, to_stack_list)
> bh_enable()
> // sched point
> bh_disable()
> for pkt in to_stack_list:
> run_netstack(pkt)
> bh_enable()
>
>
> This would limit the batching that blocks everything to only the XDP
> processing itself, which should limit the maximum time spent in the
> blocking state significantly compared to what we have today. The caveat
> being that rearranging things like this is potentially a pretty major
> refactoring task that needs to touch all the drivers (even if some of
> the logic can be moved into the core code in the process). So not really
> sure if this approach is feasible, TBH.
This does not work because bh_disable() does not disable scheduling.
Scheduling may happen. bh_disable() acquires a lock which is currently
the only synchronisation point between two say network driver doing
NAPI. And this what I want to get rid of.
Regarding expensive bit as in XDP_PASS: This doesn't need locking as per
proposal, just the REDIRECT piece.
> > Daniel said netkit doesn't need this locking because it is not
> > supporting this redirect and it made me think. Would it work to make
> > the redirect structures part of the bpf_prog-structure instead of
> > per-CPU? My understanding is that eBPF's programs data structures are
> > part of it and contain locking allowing one eBPF program preempt
> > another one.
> > Having the redirect structures part of the program would obsolete
> > locking. Do I miss anything?
>
> This won't work, unfortunately: the same XDP program can be attached to
> multiple interfaces simultaneously, and for hardware with multiple
> receive queues (which is most of the hardware that supports XDP), it can
> even run simultaneously on multiple CPUs on the same interface. This is
> the reason why this is all being kept in per-CPU variables today.
So I started hacking this and noticed yesterday and noticed that you can
run multiple bpf programs. This is how I learned that it won't work.
My plan B is now to move it into task_struct.
> -Toke
Sebastian
On 2024-01-17 18:04:47 [-0800], Jakub Kicinski wrote:
> Oh, and I'm bringing it up here, because CONFIG_RT can throw
> in "need_resched()" into the napi_rx_has_budget(), obviously.
need_resched() does not work on PREEMPT_RT the way you think. This
context (the NAPI poll callback) is preemptible and (by default) runs at
SCHED_FIFO 50 (within a threaded IRQ) so a context switch can happen at
any time by a task with higher priority.
If threadA gets preempted and owns a lock that threadB, with higher
priority, wants then threadA will get back on CPU, inherit the priority
of the threadB and continue to run until it releases the lock.
If this is the per-CPU BH lock (which I want to remove) then it will
continue until all softirqs complete.
Sebastian
Jakub Kicinski <[email protected]> writes:
> On Wed, 17 Jan 2024 17:37:29 +0100 Toke Høiland-Jørgensen wrote:
>> I am not contesting that latency is important, but it's a pretty
>> fundamental trade-off and we don't want to kill throughput entirely
>> either. Especially since this is global to the whole kernel; and there
>> are definitely people who want to use XDP on an RT kernel and still
>> achieve high PPS rates.
>>
>> (Whether those people really strictly speaking need to be running an RT
>> kernel is maybe debatable, but it does happen).
>>
>> > I expected the lock operation (under RT) to always succeeds and not
>> > cause any delay because it should not be contended.
>>
>> A lock does cause delay even when it's not contended. Bear in mind that
>> at 10 Gbps line rate, we have a budget of 64 nanoseconds to process each
>> packet (for 64-byte packets). So just the atomic op to figure out
>> whether there's any contention (around 10ns on the Intel processors I
>> usually test on) will blow a huge chunk of the total processing budget.
>> We can't actually do the full processing needed in those 64 nanoseconds
>> (not to mention the 6.4 nanoseconds we have available at 100Gbps), which
>> is why it's essential to amortise as much as we can over multiple
>> packets.
>>
>> This is all back-of-the-envelope calculations, of course. Having some
>> actual numbers to look at would be great; I don't suppose you have a
>> setup where you can run xdp-bench and see how your patches affect the
>> throughput?
>
> A potentially stupid idea which I have been turning in my head is
> how we could get away from having the driver handle details of NAPI
> budgeting. It's an source of bugs and endless review comments.
>
> All drivers end up maintaining a counter of "how many packets have
> I processed" and comparing that against the budget. Would it be crazy
> if we put that inside napi_struct? Add a "budget" member inside
> napi_struct as well, and:
>
> struct napi_struct {
> ...
> // poll state
> unsigned int budget;
> unsigned int rx_used;
> ...
> }
>
> static inline bool napi_rx_has_budget(napi)
> {
> return napi->budget > napi->rx_used;
> }
>
> poll(napi) // no budget
> {
> while (napi_rx_has_budget(napi)) {
> napi_gro_receive(napi, skb); /* does napi->rx_used++ */
> // maybe add explicit napi_rx_count() if
> // driver did something funny with the frame.
> }
> }
>
> We can also create napi_tx_has_budget() so that people stop being
> confused whether budget is for Tx or not. And napi_xdp_comp_has_budget()
> so that people stop completing XDP in hard irq context (budget==0)...
>
> And we can pass napi into napi_consume_skb(), instead of, presumably
> inexplicably to a newcomer, passing in budget.
> And napi_complete_done() can lose the work_done argument, too.
I do agree that conceptually it makes a lot of sense to encapsulate the
budget like this so drivers don't have to do all this state tracking
themselves. It does appear that drivers are doing different things with
the budget as it is today, though. For instance, the intel drivers seem
to divide the budget over all the enabled RX rings(?); so I'm wondering
if it'll be possible to unify drivers around a more opaque NAPI poll
API?
-Toke
Sebastian Andrzej Siewior <[email protected]> writes:
> On 2024-01-17 17:37:29 [+0100], Toke Høiland-Jørgensen wrote:
>> This is all back-of-the-envelope calculations, of course. Having some
>> actual numbers to look at would be great; I don't suppose you have a
>> setup where you can run xdp-bench and see how your patches affect the
>> throughput?
>
> No but I probably could set it up.
That would be great! Feel free to ping me if you need any pointers to
how we usually do the perf measurements :)
>> I chatted with Jesper about this, and he had an idea not too far from
>> this: split up the XDP and regular stack processing in two stages, each
>> with their individual batching. So whereas right now we're doing
>> something like:
>>
>> run_napi()
>> bh_disable()
>> for pkt in budget:
>> act = run_xdp(pkt)
>> if (act == XDP_PASS)
>> run_netstack(pkt) // this is the expensive bit
>> bh_enable()
>>
>> We could instead do:
>>
>> run_napi()
>> bh_disable()
>> for pkt in budget:
>> act = run_xdp(pkt)
>> if (act == XDP_PASS)
>> add_to_list(pkt, to_stack_list)
>> bh_enable()
>> // sched point
>> bh_disable()
>> for pkt in to_stack_list:
>> run_netstack(pkt)
>> bh_enable()
>>
>>
>> This would limit the batching that blocks everything to only the XDP
>> processing itself, which should limit the maximum time spent in the
>> blocking state significantly compared to what we have today. The caveat
>> being that rearranging things like this is potentially a pretty major
>> refactoring task that needs to touch all the drivers (even if some of
>> the logic can be moved into the core code in the process). So not really
>> sure if this approach is feasible, TBH.
>
> This does not work because bh_disable() does not disable scheduling.
> Scheduling may happen. bh_disable() acquires a lock which is currently
> the only synchronisation point between two say network driver doing
> NAPI. And this what I want to get rid of.
> Regarding expensive bit as in XDP_PASS: This doesn't need locking as per
> proposal, just the REDIRECT piece.
Right, well s/bh_disable()/lock()/; my main point was splitting up the
processing so that the XDP processing itself and the stack activation on
XDP_PASS is not interleaved. This will make it possible to hold the lock
around the whole XDP batch, not just individual packets, and so retain
the performance we gain from amortising expensive operations over
multiple packets.
-Toke
On Thu, 18 Jan 2024 12:51:18 +0100 Toke Høiland-Jørgensen wrote:
> I do agree that conceptually it makes a lot of sense to encapsulate the
> budget like this so drivers don't have to do all this state tracking
> themselves. It does appear that drivers are doing different things with
> the budget as it is today, though. For instance, the intel drivers seem
> to divide the budget over all the enabled RX rings(?); so I'm wondering
> if it'll be possible to unify drivers around a more opaque NAPI poll API?
We can come up with APIs which would cater to multi-queue cases.
Bigger question is what is the sensible polling strategy for those,
just dividing the budget seems, hm, crude.
On Thu, 18 Jan 2024 09:27:54 +0100 Sebastian Andrzej Siewior wrote:
> On 2024-01-17 18:04:47 [-0800], Jakub Kicinski wrote:
> > Oh, and I'm bringing it up here, because CONFIG_RT can throw
> > in "need_resched()" into the napi_rx_has_budget(), obviously.
>
> need_resched() does not work on PREEMPT_RT the way you think. This
> context (the NAPI poll callback) is preemptible and (by default) runs at
> SCHED_FIFO 50 (within a threaded IRQ) so a context switch can happen at
> any time by a task with higher priority.
> If threadA gets preempted and owns a lock that threadB, with higher
> priority, wants then threadA will get back on CPU, inherit the priority
> of the threadB and continue to run until it releases the lock.
>
> If this is the per-CPU BH lock (which I want to remove) then it will
> continue until all softirqs complete.
So there's no way for a process to know on RT that someone with higher
prio is waiting for it to release its locks? :(
On 2024-01-18 08:38:12 [-0800], Jakub Kicinski wrote:
> > If this is the per-CPU BH lock (which I want to remove) then it will
> > continue until all softirqs complete.
>
> So there's no way for a process to know on RT that someone with higher
> prio is waiting for it to release its locks? :(
You could add a function to check if your current priority is inherited
from someone else and if so start dropping the locks you think are
responsible for it.
I made a PoC that appears to work for timer_list timer which is one of
the softirqs. This made me realise that I need in more spots and I am
doing it for the wrong reasons…
Sebastian
Jakub Kicinski <[email protected]> writes:
> On Thu, 18 Jan 2024 12:51:18 +0100 Toke Høiland-Jørgensen wrote:
>> I do agree that conceptually it makes a lot of sense to encapsulate the
>> budget like this so drivers don't have to do all this state tracking
>> themselves. It does appear that drivers are doing different things with
>> the budget as it is today, though. For instance, the intel drivers seem
>> to divide the budget over all the enabled RX rings(?); so I'm wondering
>> if it'll be possible to unify drivers around a more opaque NAPI poll API?
>
> We can come up with APIs which would cater to multi-queue cases.
> Bigger question is what is the sensible polling strategy for those,
> just dividing the budget seems, hm, crude.
Right, agreed, though I don't have a good answer for what else to do off
the top of my head...
-Toke