Hi,
During development of flower-route[1], which I
recently presented at FOSDEM[2], I noticed that
CPU usage, would increase the more rules I installed
into the hardware for IP forwarding offloading.
Since we use TC flower offload for the hottest
prefixes, and leave the long tail to the normal (non-TC)
Linux network stack for slow-path IP forwarding.
We therefore need both the hardware and software
datapath to perform well.
I found that skip_sw rules, are quite expensive
in the kernel datapath, since they must be evaluated
and matched upon, before the kernel checks the
skip_sw flag.
This patchset optimizes the case where all rules
are skip_sw, by implementing a TC bypass for these
cases, where TC is only used as a control plane
for the hardware path.
Changes from v1:
- Patch 1:
- Add Reviewed-By from Jiri Pirko
- Patch 2:
- Move code, to avoid forward declaration (Jiri).
- Patch 3
- Refactor to use a static key.
- Add performance data for trapping, or sending
a packet to a non-existent chain (as suggested by Marcelo).
[1] flower-route
https://github.com/fiberby-dk/flower-route
[2] FOSDEM talk
https://fosdem.org/2024/schedule/event/fosdem-2024-3337-flying-higher-hardware-offloading-with-bird/
Asbjørn Sloth Tønnesen (3):
net: sched: cls_api: add skip_sw counter
net: sched: cls_api: add filter counter
net: sched: make skip_sw actually skip software
include/net/pkt_cls.h | 9 +++++++++
include/net/sch_generic.h | 4 ++++
net/core/dev.c | 10 ++++++++++
net/sched/cls_api.c | 39 +++++++++++++++++++++++++++++++++++++++
4 files changed, 62 insertions(+)
--
2.43.0
TC filters come in 3 variants:
- no flag (try to process in hardware, but fallback to software))
- skip_hw (do not process filter by hardware)
- skip_sw (do not process filter by software)
However skip_sw is implemented so that the skip_sw
flag can first be checked, after it has been matched.
IMHO it's common when using skip_sw, to use it on all rules.
So if all filters in a block is skip_sw filters, then
we can bail early, we can thus avoid having to match
the filters, just to check for the skip_sw flag.
This patch adds a bypass, for when only TC skip_sw rules
are used. The bypass is guarded by a static key, to avoid
harming other workloads.
There are 3 ways that a packet from a skip_sw ruleset, can
end up in the kernel path. Although the send packets to a
non-existent chain way is only improved a few percents, then
I believe it's worth optimizing the trap and fall-though
use-cases.
+----------------------------+--------+--------+--------+
| Test description | Pre- | Post- | Rel. |
| | kpps | kpps | chg. |
+----------------------------+--------+--------+--------+
| basic forwarding + notrack | 3589.3 | 3587.9 | 1.00x |
| switch to eswitch mode | 3081.8 | 3094.7 | 1.00x |
| add ingress qdisc | 3042.9 | 3063.6 | 1.01x |
| tc forward in hw / skip_sw |37024.7 |37028.4 | 1.00x |
| tc forward in sw / skip_hw | 3245.0 | 3245.3 | 1.00x |
+----------------------------+--------+--------+--------+
| tests with only skip_sw rules below: |
+----------------------------+--------+--------+--------+
| 1 non-matching rule | 2694.7 | 3058.7 | 1.14x |
| 1 n-m rule, match trap | 2611.2 | 3323.1 | 1.27x |
| 1 n-m rule, goto non-chain | 2886.8 | 2945.9 | 1.02x |
| 5 non-matching rules | 1958.2 | 3061.3 | 1.56x |
| 5 n-m rules, match trap | 1911.9 | 3327.0 | 1.74x |
| 5 n-m rules, goto non-chain| 2883.1 | 2947.5 | 1.02x |
| 10 non-matching rules | 1466.3 | 3062.8 | 2.09x |
| 10 n-m rules, match trap | 1444.3 | 3317.9 | 2.30x |
| 10 n-m rules,goto non-chain| 2883.1 | 2939.5 | 1.02x |
| 25 non-matching rules | 838.5 | 3058.9 | 3.65x |
| 25 n-m rules, match trap | 824.5 | 3323.0 | 4.03x |
| 25 n-m rules,goto non-chain| 2875.8 | 2944.7 | 1.02x |
| 50 non-matching rules | 488.1 | 3054.7 | 6.26x |
| 50 n-m rules, match trap | 484.9 | 3318.5 | 6.84x |
| 50 n-m rules,goto non-chain| 2884.1 | 2939.7 | 1.02x |
+----------------------------+--------+--------+--------+
perf top (25 n-m skip_sw rules - pre patch):
20.39% [kernel] [k] __skb_flow_dissect
16.43% [kernel] [k] rhashtable_jhash2
10.58% [kernel] [k] fl_classify
10.23% [kernel] [k] fl_mask_lookup
4.79% [kernel] [k] memset_orig
2.58% [kernel] [k] tcf_classify
1.47% [kernel] [k] __x86_indirect_thunk_rax
1.42% [kernel] [k] __dev_queue_xmit
1.36% [kernel] [k] nft_do_chain
1.21% [kernel] [k] __rcu_read_lock
perf top (25 n-m skip_sw rules - post patch):
5.12% [kernel] [k] __dev_queue_xmit
4.77% [kernel] [k] nft_do_chain
3.65% [kernel] [k] dev_gro_receive
3.41% [kernel] [k] check_preemption_disabled
3.14% [kernel] [k] mlx5e_skb_from_cqe_mpwrq_nonlinear
2.88% [kernel] [k] __netif_receive_skb_core.constprop.0
2.49% [kernel] [k] mlx5e_xmit
2.15% [kernel] [k] ip_forward
1.95% [kernel] [k] mlx5e_tc_restore_tunnel
1.92% [kernel] [k] vlan_gro_receive
Test setup:
DUT: Intel Xeon D-1518 (2.20GHz) w/ Nvidia/Mellanox ConnectX-6 Dx 2x100G
Data rate measured on switch (Extreme X690), and DUT connected as
a router on a stick, with pktgen and pktsink as VLANs.
Pktgen-dpdk was in range 36.6-37.7 Mpps across all tests.
Full test data at https://files.fiberby.net/ast/2024/tc_skip_sw/v2_tests/
Signed-off-by: Asbjørn Sloth Tønnesen <[email protected]>
---
include/net/pkt_cls.h | 9 +++++++++
include/net/sch_generic.h | 1 +
net/core/dev.c | 10 ++++++++++
net/sched/cls_api.c | 16 ++++++++++++++++
4 files changed, 36 insertions(+)
diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index a4ee43f493bb..41297bd38dff 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -74,6 +74,15 @@ static inline bool tcf_block_non_null_shared(struct tcf_block *block)
return block && block->index;
}
+#ifdef CONFIG_NET_CLS_ACT
+DECLARE_STATIC_KEY_FALSE(tcf_bypass_check_needed_key);
+
+static inline bool tcf_block_bypass_sw(struct tcf_block *block)
+{
+ return block && block->bypass_wanted;
+}
+#endif
+
static inline struct Qdisc *tcf_block_q(struct tcf_block *block)
{
WARN_ON(tcf_block_shared(block));
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 7af0621db226..60b0fdf2b1ad 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -477,6 +477,7 @@ struct tcf_block {
struct flow_block flow_block;
struct list_head owner_list;
bool keep_dst;
+ bool bypass_wanted;
atomic_t filtercnt; /* Number of filters */
atomic_t skipswcnt; /* Number of skip_sw filters */
atomic_t offloadcnt; /* Number of oddloaded filters */
diff --git a/net/core/dev.c b/net/core/dev.c
index fe054cbd41e9..b7c583f98e82 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2057,6 +2057,11 @@ void net_dec_egress_queue(void)
EXPORT_SYMBOL_GPL(net_dec_egress_queue);
#endif
+#ifdef CONFIG_NET_CLS_ACT
+DEFINE_STATIC_KEY_FALSE(tcf_bypass_check_needed_key);
+EXPORT_SYMBOL(tcf_bypass_check_needed_key);
+#endif
+
DEFINE_STATIC_KEY_FALSE(netstamp_needed_key);
EXPORT_SYMBOL(netstamp_needed_key);
#ifdef CONFIG_JUMP_LABEL
@@ -3911,6 +3916,11 @@ static int tc_run(struct tcx_entry *entry, struct sk_buff *skb,
if (!miniq)
return ret;
+ if (static_branch_unlikely(&tcf_bypass_check_needed_key)) {
+ if (tcf_block_bypass_sw(miniq->block))
+ return ret;
+ }
+
tc_skb_cb(skb)->mru = 0;
tc_skb_cb(skb)->post_ct = false;
tcf_set_drop_reason(skb, *drop_reason);
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 304a46ab0e0b..acef09e21969 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -410,6 +410,21 @@ static void tcf_proto_get(struct tcf_proto *tp)
refcount_inc(&tp->refcnt);
}
+static inline void tcf_maintain_bypass(struct tcf_block *block)
+{
+ int filtercnt = atomic_read(&block->filtercnt);
+ int skipswcnt = atomic_read(&block->skipswcnt);
+ bool bypass_wanted = filtercnt > 0 && filtercnt == skipswcnt;
+
+ if (bypass_wanted != block->bypass_wanted) {
+ if (bypass_wanted)
+ static_branch_inc(&tcf_bypass_check_needed_key);
+ else
+ static_branch_dec(&tcf_bypass_check_needed_key);
+ block->bypass_wanted = bypass_wanted;
+ }
+}
+
static void tcf_block_filter_cnt_update(struct tcf_block *block, bool *counted, bool add)
{
lockdep_assert_not_held(&block->cb_lock);
@@ -424,6 +439,7 @@ static void tcf_block_filter_cnt_update(struct tcf_block *block, bool *counted,
*counted = false;
}
}
+ tcf_maintain_bypass(block);
up_write(&block->cb_lock);
}
--
2.43.0
Maintain a count of filters per block.
Counter updates are protected by cb_lock, which is
also used to protect the offload counters.
Signed-off-by: Asbjørn Sloth Tønnesen <[email protected]>
---
include/net/sch_generic.h | 2 ++
net/sched/cls_api.c | 19 +++++++++++++++++++
2 files changed, 21 insertions(+)
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 46a63d1818a0..7af0621db226 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -427,6 +427,7 @@ struct tcf_proto {
*/
spinlock_t lock;
bool deleting;
+ bool counted;
refcount_t refcnt;
struct rcu_head rcu;
struct hlist_node destroy_ht_node;
@@ -476,6 +477,7 @@ struct tcf_block {
struct flow_block flow_block;
struct list_head owner_list;
bool keep_dst;
+ atomic_t filtercnt; /* Number of filters */
atomic_t skipswcnt; /* Number of skip_sw filters */
atomic_t offloadcnt; /* Number of oddloaded filters */
unsigned int nooffloaddevcnt; /* Number of devs unable to do offload */
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 397c3d29659c..304a46ab0e0b 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -410,12 +410,30 @@ static void tcf_proto_get(struct tcf_proto *tp)
refcount_inc(&tp->refcnt);
}
+static void tcf_block_filter_cnt_update(struct tcf_block *block, bool *counted, bool add)
+{
+ lockdep_assert_not_held(&block->cb_lock);
+
+ down_write(&block->cb_lock);
+ if (*counted != add) {
+ if (add) {
+ atomic_inc(&block->filtercnt);
+ *counted = true;
+ } else {
+ atomic_dec(&block->filtercnt);
+ *counted = false;
+ }
+ }
+ up_write(&block->cb_lock);
+}
+
static void tcf_chain_put(struct tcf_chain *chain);
static void tcf_proto_destroy(struct tcf_proto *tp, bool rtnl_held,
bool sig_destroy, struct netlink_ext_ack *extack)
{
tp->ops->destroy(tp, rtnl_held, extack);
+ tcf_block_filter_cnt_update(tp->chain->block, &tp->counted, false);
if (sig_destroy)
tcf_proto_signal_destroyed(tp->chain, tp);
tcf_chain_put(tp->chain);
@@ -2364,6 +2382,7 @@ static int tc_new_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
err = tp->ops->change(net, skb, tp, cl, t->tcm_handle, tca, &fh,
flags, extack);
if (err == 0) {
+ tcf_block_filter_cnt_update(block, &tp->counted, true);
tfilter_notify(net, skb, n, tp, block, q, parent, fh,
RTM_NEWTFILTER, false, rtnl_held, extack);
tfilter_put(tp, fh);
--
2.43.0
Hi Asbj?rn,
kernel test robot noticed the following build errors:
[auto build test ERROR on net-next/main]
url: https://github.com/intel-lab-lkp/linux/commits/Asbj-rn-Sloth-T-nnesen/net-sched-cls_api-add-skip_sw-counter/20240305-225707
base: net-next/main
patch link: https://lore.kernel.org/r/20240305144404.569632-4-ast%40fiberby.net
patch subject: [PATCH net-next v2 3/3] net: sched: make skip_sw actually skip software
config: riscv-defconfig (https://download.01.org/0day-ci/archive/20240306/[email protected]/config)
compiler: clang version 19.0.0git (https://github.com/llvm/llvm-project 325f51237252e6dab8e4e1ea1fa7acbb4faee1cd)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240306/[email protected]/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
All errors (new ones prefixed by >>):
In file included from net/sched/cls_api.c:18:
In file included from include/linux/skbuff.h:17:
In file included from include/linux/bvec.h:10:
In file included from include/linux/highmem.h:8:
In file included from include/linux/cacheflush.h:5:
In file included from arch/riscv/include/asm/cacheflush.h:9:
In file included from include/linux/mm.h:2188:
include/linux/vmstat.h:522:36: warning: arithmetic between different enumeration types ('enum node_stat_item' and 'enum lru_list') [-Wenum-enum-conversion]
522 | return node_stat_name(NR_LRU_BASE + lru) + 3; // skip "nr_"
| ~~~~~~~~~~~ ^ ~~~
In file included from net/sched/cls_api.c:27:
In file included from include/net/sock.h:46:
In file included from include/linux/netdevice.h:45:
In file included from include/uapi/linux/neighbour.h:6:
In file included from include/linux/netlink.h:9:
In file included from include/net/scm.h:9:
In file included from include/linux/security.h:35:
include/linux/bpf.h:736:48: warning: bitwise operation between different enumeration types ('enum bpf_type_flag' and 'enum bpf_arg_type') [-Wenum-enum-conversion]
736 | ARG_PTR_TO_MAP_VALUE_OR_NULL = PTR_MAYBE_NULL | ARG_PTR_TO_MAP_VALUE,
| ~~~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~~~~~~~~
include/linux/bpf.h:737:43: warning: bitwise operation between different enumeration types ('enum bpf_type_flag' and 'enum bpf_arg_type') [-Wenum-enum-conversion]
737 | ARG_PTR_TO_MEM_OR_NULL = PTR_MAYBE_NULL | ARG_PTR_TO_MEM,
| ~~~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~~
include/linux/bpf.h:738:43: warning: bitwise operation between different enumeration types ('enum bpf_type_flag' and 'enum bpf_arg_type') [-Wenum-enum-conversion]
738 | ARG_PTR_TO_CTX_OR_NULL = PTR_MAYBE_NULL | ARG_PTR_TO_CTX,
| ~~~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~~
include/linux/bpf.h:739:45: warning: bitwise operation between different enumeration types ('enum bpf_type_flag' and 'enum bpf_arg_type') [-Wenum-enum-conversion]
739 | ARG_PTR_TO_SOCKET_OR_NULL = PTR_MAYBE_NULL | ARG_PTR_TO_SOCKET,
| ~~~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~~~~~
include/linux/bpf.h:740:44: warning: bitwise operation between different enumeration types ('enum bpf_type_flag' and 'enum bpf_arg_type') [-Wenum-enum-conversion]
740 | ARG_PTR_TO_STACK_OR_NULL = PTR_MAYBE_NULL | ARG_PTR_TO_STACK,
| ~~~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~~~~
include/linux/bpf.h:741:45: warning: bitwise operation between different enumeration types ('enum bpf_type_flag' and 'enum bpf_arg_type') [-Wenum-enum-conversion]
741 | ARG_PTR_TO_BTF_ID_OR_NULL = PTR_MAYBE_NULL | ARG_PTR_TO_BTF_ID,
| ~~~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~~~~~
include/linux/bpf.h:745:38: warning: bitwise operation between different enumeration types ('enum bpf_type_flag' and 'enum bpf_arg_type') [-Wenum-enum-conversion]
745 | ARG_PTR_TO_UNINIT_MEM = MEM_UNINIT | ARG_PTR_TO_MEM,
| ~~~~~~~~~~ ^ ~~~~~~~~~~~~~~
include/linux/bpf.h:747:45: warning: bitwise operation between different enumeration types ('enum bpf_type_flag' and 'enum bpf_arg_type') [-Wenum-enum-conversion]
747 | ARG_PTR_TO_FIXED_SIZE_MEM = MEM_FIXED_SIZE | ARG_PTR_TO_MEM,
| ~~~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~~
include/linux/bpf.h:770:48: warning: bitwise operation between different enumeration types ('enum bpf_type_flag' and 'enum bpf_return_type') [-Wenum-enum-conversion]
770 | RET_PTR_TO_MAP_VALUE_OR_NULL = PTR_MAYBE_NULL | RET_PTR_TO_MAP_VALUE,
| ~~~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~~~~~~~~
include/linux/bpf.h:771:45: warning: bitwise operation between different enumeration types ('enum bpf_type_flag' and 'enum bpf_return_type') [-Wenum-enum-conversion]
771 | RET_PTR_TO_SOCKET_OR_NULL = PTR_MAYBE_NULL | RET_PTR_TO_SOCKET,
| ~~~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~~~~~
include/linux/bpf.h:772:47: warning: bitwise operation between different enumeration types ('enum bpf_type_flag' and 'enum bpf_return_type') [-Wenum-enum-conversion]
772 | RET_PTR_TO_TCP_SOCK_OR_NULL = PTR_MAYBE_NULL | RET_PTR_TO_TCP_SOCK,
| ~~~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~~~~~~~
include/linux/bpf.h:773:50: warning: bitwise operation between different enumeration types ('enum bpf_type_flag' and 'enum bpf_return_type') [-Wenum-enum-conversion]
773 | RET_PTR_TO_SOCK_COMMON_OR_NULL = PTR_MAYBE_NULL | RET_PTR_TO_SOCK_COMMON,
| ~~~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~~~~~~~~~~
include/linux/bpf.h:775:49: warning: bitwise operation between different enumeration types ('enum bpf_type_flag' and 'enum bpf_return_type') [-Wenum-enum-conversion]
775 | RET_PTR_TO_DYNPTR_MEM_OR_NULL = PTR_MAYBE_NULL | RET_PTR_TO_MEM,
| ~~~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~~
include/linux/bpf.h:776:45: warning: bitwise operation between different enumeration types ('enum bpf_type_flag' and 'enum bpf_return_type') [-Wenum-enum-conversion]
776 | RET_PTR_TO_BTF_ID_OR_NULL = PTR_MAYBE_NULL | RET_PTR_TO_BTF_ID,
| ~~~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~~~~~
include/linux/bpf.h:777:43: warning: bitwise operation between different enumeration types ('enum bpf_type_flag' and 'enum bpf_return_type') [-Wenum-enum-conversion]
777 | RET_PTR_TO_BTF_ID_TRUSTED = PTR_TRUSTED | RET_PTR_TO_BTF_ID,
| ~~~~~~~~~~~ ^ ~~~~~~~~~~~~~~~~~
include/linux/bpf.h:888:44: warning: bitwise operation between different enumeration types ('enum bpf_type_flag' and 'enum bpf_reg_type') [-Wenum-enum-conversion]
888 | PTR_TO_MAP_VALUE_OR_NULL = PTR_MAYBE_NULL | PTR_TO_MAP_VALUE,
| ~~~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~~~~
include/linux/bpf.h:889:42: warning: bitwise operation between different enumeration types ('enum bpf_type_flag' and 'enum bpf_reg_type') [-Wenum-enum-conversion]
889 | PTR_TO_SOCKET_OR_NULL = PTR_MAYBE_NULL | PTR_TO_SOCKET,
| ~~~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~
include/linux/bpf.h:890:46: warning: bitwise operation between different enumeration types ('enum bpf_type_flag' and 'enum bpf_reg_type') [-Wenum-enum-conversion]
890 | PTR_TO_SOCK_COMMON_OR_NULL = PTR_MAYBE_NULL | PTR_TO_SOCK_COMMON,
| ~~~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~~~~~~
include/linux/bpf.h:891:44: warning: bitwise operation between different enumeration types ('enum bpf_type_flag' and 'enum bpf_reg_type') [-Wenum-enum-conversion]
891 | PTR_TO_TCP_SOCK_OR_NULL = PTR_MAYBE_NULL | PTR_TO_TCP_SOCK,
| ~~~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~~~
include/linux/bpf.h:892:42: warning: bitwise operation between different enumeration types ('enum bpf_type_flag' and 'enum bpf_reg_type') [-Wenum-enum-conversion]
892 | PTR_TO_BTF_ID_OR_NULL = PTR_MAYBE_NULL | PTR_TO_BTF_ID,
| ~~~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~
>> net/sched/cls_api.c:421:23: error: use of undeclared identifier 'tcf_bypass_check_needed_key'
421 | static_branch_inc(&tcf_bypass_check_needed_key);
| ^
net/sched/cls_api.c:423:23: error: use of undeclared identifier 'tcf_bypass_check_needed_key'
423 | static_branch_dec(&tcf_bypass_check_needed_key);
| ^
21 warnings and 2 errors generated.
vim +/tcf_bypass_check_needed_key +421 net/sched/cls_api.c
412
413 static inline void tcf_maintain_bypass(struct tcf_block *block)
414 {
415 int filtercnt = atomic_read(&block->filtercnt);
416 int skipswcnt = atomic_read(&block->skipswcnt);
417 bool bypass_wanted = filtercnt > 0 && filtercnt == skipswcnt;
418
419 if (bypass_wanted != block->bypass_wanted) {
420 if (bypass_wanted)
> 421 static_branch_inc(&tcf_bypass_check_needed_key);
422 else
423 static_branch_dec(&tcf_bypass_check_needed_key);
424 block->bypass_wanted = bypass_wanted;
425 }
426 }
427
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki