=== Context ===
In the context of a middlebox, fragmented packets are tricky to handle.
The full 5-tuple of a packet is often only available in the first
fragment which makes enforcing consistent policy difficult. There are
really only two stateless options, neither of which are very nice:
1. Enforce policy on first fragment and accept all subsequent fragments.
This works but may let in certain attacks or allow data exfiltration.
2. Enforce policy on first fragment and drop all subsequent fragments.
This does not really work b/c some protocols may rely on
fragmentation. For example, DNS may rely on oversized UDP packets for
large responses.
So stateful tracking is the only sane option. RFC 8900 [0] calls this
out as well in section 6.3:
Middleboxes [...] should process IP fragments in a manner that is
consistent with [RFC0791] and [RFC8200]. In many cases, middleboxes
must maintain state in order to achieve this goal.
=== BPF related bits ===
However, when policy is enforced through BPF, the prog is run before the
kernel reassembles fragmented packets. This leaves BPF developers in a
awkward place: implement reassembly (possibly poorly) or use a stateless
method as described above.
Fortunately, the kernel has robust support for fragmented IP packets.
This patchset wraps the existing defragmentation facilities in kfuncs so
that BPF progs running on middleboxes can reassemble fragmented packets
before applying policy.
=== Patchset details ===
This patchset is (hopefully) relatively straightforward from BPF perspective.
One thing I'd like to call out is the skb_copy()ing of the prog skb. I
did this to maintain the invariant that the ctx remains valid after prog
has run. This is relevant b/c ip_defrag() and ip_check_defrag() may
consume the skb if the skb is a fragment.
Originally I did play around with teaching the verifier about kfuncs
that may consume the ctx and disallowing ctx accesses in ret != 0
branches. It worked ok, but it seemed too complex to modify the
surrounding assumptions about ctx validity.
[0]: https://datatracker.ietf.org/doc/html/rfc8900
===
Changes from v1:
* Add support for ipv6 defragmentation
Daniel Xu (8):
ip: frags: Return actual error codes from ip_check_defrag()
bpf: verifier: Support KF_CHANGES_PKT flag
bpf, net, frags: Add bpf_ip_check_defrag() kfunc
net: ipv6: Factor ipv6_frag_rcv() to take netns and user
bpf: net: ipv6: Add bpf_ipv6_frag_rcv() kfunc
bpf: selftests: Support not connecting client socket
bpf: selftests: Support custom type and proto for client sockets
bpf: selftests: Add defrag selftests
Documentation/bpf/kfuncs.rst | 7 +
drivers/net/macvlan.c | 2 +-
include/linux/btf.h | 1 +
include/net/ip.h | 11 +
include/net/ipv6.h | 1 +
include/net/ipv6_frag.h | 1 +
include/net/transp_v6.h | 1 +
kernel/bpf/verifier.c | 8 +
net/ipv4/Makefile | 1 +
net/ipv4/ip_fragment.c | 15 +-
net/ipv4/ip_fragment_bpf.c | 98 ++++++
net/ipv6/Makefile | 1 +
net/ipv6/af_inet6.c | 4 +
net/ipv6/reassembly.c | 16 +-
net/ipv6/reassembly_bpf.c | 143 ++++++++
net/packet/af_packet.c | 2 +-
tools/testing/selftests/bpf/Makefile | 3 +-
.../selftests/bpf/generate_udp_fragments.py | 90 +++++
.../selftests/bpf/ip_check_defrag_frags.h | 57 +++
tools/testing/selftests/bpf/network_helpers.c | 26 +-
tools/testing/selftests/bpf/network_helpers.h | 3 +
.../bpf/prog_tests/ip_check_defrag.c | 327 ++++++++++++++++++
.../selftests/bpf/progs/bpf_tracing_net.h | 1 +
.../selftests/bpf/progs/ip_check_defrag.c | 133 +++++++
24 files changed, 931 insertions(+), 21 deletions(-)
create mode 100644 net/ipv4/ip_fragment_bpf.c
create mode 100644 net/ipv6/reassembly_bpf.c
create mode 100755 tools/testing/selftests/bpf/generate_udp_fragments.py
create mode 100644 tools/testing/selftests/bpf/ip_check_defrag_frags.h
create mode 100644 tools/testing/selftests/bpf/prog_tests/ip_check_defrag.c
create mode 100644 tools/testing/selftests/bpf/progs/ip_check_defrag.c
--
2.39.1
Once we wrap ip_check_defrag() in a kfunc, it may be useful for progs to
know the exact error condition ip_check_defrag() encountered.
Signed-off-by: Daniel Xu <[email protected]>
---
drivers/net/macvlan.c | 2 +-
net/ipv4/ip_fragment.c | 13 ++++++++-----
net/packet/af_packet.c | 2 +-
3 files changed, 10 insertions(+), 7 deletions(-)
diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index 99a971929c8e..b8310e13d7e1 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -456,7 +456,7 @@ static rx_handler_result_t macvlan_handle_frame(struct sk_buff **pskb)
unsigned int hash;
skb = ip_check_defrag(dev_net(skb->dev), skb, IP_DEFRAG_MACVLAN);
- if (!skb)
+ if (IS_ERR(skb))
return RX_HANDLER_CONSUMED;
*pskb = skb;
eth = eth_hdr(skb);
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 69c00ffdcf3e..959d2c4260ea 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -514,6 +514,7 @@ struct sk_buff *ip_check_defrag(struct net *net, struct sk_buff *skb, u32 user)
struct iphdr iph;
int netoff;
u32 len;
+ int err;
if (skb->protocol != htons(ETH_P_IP))
return skb;
@@ -535,15 +536,17 @@ struct sk_buff *ip_check_defrag(struct net *net, struct sk_buff *skb, u32 user)
if (skb) {
if (!pskb_may_pull(skb, netoff + iph.ihl * 4)) {
kfree_skb(skb);
- return NULL;
+ return ERR_PTR(-ENOMEM);
}
- if (pskb_trim_rcsum(skb, netoff + len)) {
+ err = pskb_trim_rcsum(skb, netoff + len);
+ if (err) {
kfree_skb(skb);
- return NULL;
+ return ERR_PTR(err);
}
memset(IPCB(skb), 0, sizeof(struct inet_skb_parm));
- if (ip_defrag(net, skb, user))
- return NULL;
+ err = ip_defrag(net, skb, user);
+ if (err)
+ return ERR_PTR(err);
skb_clear_hash(skb);
}
}
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index d4e76e2ae153..1ef94828c8da 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -1470,7 +1470,7 @@ static int packet_rcv_fanout(struct sk_buff *skb, struct net_device *dev,
if (fanout_has_flag(f, PACKET_FANOUT_FLAG_DEFRAG)) {
skb = ip_check_defrag(net, skb, IP_DEFRAG_AF_PACKET);
- if (!skb)
+ if (IS_ERR(skb))
return 0;
}
switch (f->type) {
--
2.39.1
KF_CHANGES_PKT indicates that the kfunc call may change packet data.
This is analogous to bpf_helper_changes_pkt_data().
Signed-off-by: Daniel Xu <[email protected]>
---
Documentation/bpf/kfuncs.rst | 7 +++++++
include/linux/btf.h | 1 +
kernel/bpf/verifier.c | 8 ++++++++
3 files changed, 16 insertions(+)
diff --git a/Documentation/bpf/kfuncs.rst b/Documentation/bpf/kfuncs.rst
index 226313747be5..16c387ee987f 100644
--- a/Documentation/bpf/kfuncs.rst
+++ b/Documentation/bpf/kfuncs.rst
@@ -260,6 +260,13 @@ encouraged to make their use-cases known as early as possible, and participate
in upstream discussions regarding whether to keep, change, deprecate, or remove
those kfuncs if and when such discussions occur.
+2.4.10 KF_CHANGES_PKT flag
+-----------------
+
+The KF_CHANGES_PKT is used for kfuncs that may change packet data.
+After calls to such kfuncs, existing packet pointers will be invalidated
+and must be revalidated before the prog can access packet data.
+
2.5 Registering the kfuncs
--------------------------
diff --git a/include/linux/btf.h b/include/linux/btf.h
index 49e0fe6d8274..ee3d6c3e6cc0 100644
--- a/include/linux/btf.h
+++ b/include/linux/btf.h
@@ -71,6 +71,7 @@
#define KF_SLEEPABLE (1 << 5) /* kfunc may sleep */
#define KF_DESTRUCTIVE (1 << 6) /* kfunc performs destructive actions */
#define KF_RCU (1 << 7) /* kfunc only takes rcu pointer arguments */
+#define KF_CHANGES_PKT (1 << 8) /* kfunc may change packet data */
/*
* Tag marking a kernel function as a kfunc. This is meant to minimize the
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 5cb8b623f639..e58065498a35 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -8681,6 +8681,11 @@ static bool is_kfunc_rcu(struct bpf_kfunc_call_arg_meta *meta)
return meta->kfunc_flags & KF_RCU;
}
+static bool is_kfunc_changes_pkt(struct bpf_kfunc_call_arg_meta *meta)
+{
+ return meta->kfunc_flags & KF_CHANGES_PKT;
+}
+
static bool is_kfunc_arg_kptr_get(struct bpf_kfunc_call_arg_meta *meta, int arg)
{
return arg == 0 && (meta->kfunc_flags & KF_KPTR_GET);
@@ -10083,6 +10088,9 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
mark_btf_func_reg_size(env, regno, t->size);
}
+ if (is_kfunc_changes_pkt(&meta))
+ clear_all_pkt_pointers(env);
+
return 0;
}
--
2.39.1
This kfunc is used to defragment IPv4 packets. The idea is that if you
see a fragmented packet, you call this kfunc. If the kfunc returns 0,
then the skb has been updated to contain the entire reassembled packet.
If the kfunc returns an error (most likely -EINPROGRESS), then it means
the skb is part of a yet-incomplete original packet. A reasonable
response to -EINPROGRESS is to drop the packet, as the ip defrag
infrastructure is already hanging onto the frag for future reassembly.
Care has been taken to ensure the prog skb remains valid no matter what
the underlying ip_check_defrag() call does. This is in contrast to
ip_defrag(), which may consume the skb if the skb is part of a
yet-incomplete original packet.
So far this kfunc is only callable from TC clsact progs.
Signed-off-by: Daniel Xu <[email protected]>
---
include/net/ip.h | 11 +++++
net/ipv4/Makefile | 1 +
net/ipv4/ip_fragment.c | 2 +
net/ipv4/ip_fragment_bpf.c | 98 ++++++++++++++++++++++++++++++++++++++
4 files changed, 112 insertions(+)
create mode 100644 net/ipv4/ip_fragment_bpf.c
diff --git a/include/net/ip.h b/include/net/ip.h
index c3fffaa92d6e..f3796b1b5cac 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -680,6 +680,7 @@ enum ip_defrag_users {
IP_DEFRAG_VS_FWD,
IP_DEFRAG_AF_PACKET,
IP_DEFRAG_MACVLAN,
+ IP_DEFRAG_BPF,
};
/* Return true if the value of 'user' is between 'lower_bond'
@@ -693,6 +694,16 @@ static inline bool ip_defrag_user_in_between(u32 user,
}
int ip_defrag(struct net *net, struct sk_buff *skb, u32 user);
+
+#ifdef CONFIG_DEBUG_INFO_BTF
+int register_ip_frag_bpf(void);
+#else
+static inline int register_ip_frag_bpf(void)
+{
+ return 0;
+}
+#endif
+
#ifdef CONFIG_INET
struct sk_buff *ip_check_defrag(struct net *net, struct sk_buff *skb, u32 user);
#else
diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
index 880277c9fd07..950efb166d37 100644
--- a/net/ipv4/Makefile
+++ b/net/ipv4/Makefile
@@ -65,6 +65,7 @@ obj-$(CONFIG_TCP_CONG_ILLINOIS) += tcp_illinois.o
obj-$(CONFIG_NET_SOCK_MSG) += tcp_bpf.o
obj-$(CONFIG_BPF_SYSCALL) += udp_bpf.o
obj-$(CONFIG_NETLABEL) += cipso_ipv4.o
+obj-$(CONFIG_DEBUG_INFO_BTF) += ip_fragment_bpf.o
obj-$(CONFIG_XFRM) += xfrm4_policy.o xfrm4_state.o xfrm4_input.o \
xfrm4_output.o xfrm4_protocol.o
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 959d2c4260ea..e3fda5203f09 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -759,5 +759,7 @@ void __init ipfrag_init(void)
if (inet_frags_init(&ip4_frags))
panic("IP: failed to allocate ip4_frags cache\n");
ip4_frags_ctl_register();
+ if (register_ip_frag_bpf())
+ panic("IP: bpf: failed to register ip_frag_bpf\n");
register_pernet_subsys(&ip4_frags_ops);
}
diff --git a/net/ipv4/ip_fragment_bpf.c b/net/ipv4/ip_fragment_bpf.c
new file mode 100644
index 000000000000..a9e5908ed216
--- /dev/null
+++ b/net/ipv4/ip_fragment_bpf.c
@@ -0,0 +1,98 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Unstable ipv4 fragmentation helpers for TC-BPF hook
+ *
+ * These are called from SCHED_CLS BPF programs. Note that it is allowed to
+ * break compatibility for these functions since the interface they are exposed
+ * through to BPF programs is explicitly unstable.
+ */
+
+#include <linux/bpf.h>
+#include <linux/btf_ids.h>
+#include <linux/ip.h>
+#include <linux/filter.h>
+#include <linux/netdevice.h>
+#include <net/ip.h>
+#include <net/sock.h>
+
+__diag_push();
+__diag_ignore_all("-Wmissing-prototypes",
+ "Global functions as their definitions will be in ip_fragment BTF");
+
+/* bpf_ip_check_defrag - Defragment an ipv4 packet
+ *
+ * This helper takes an skb as input. If this skb successfully reassembles
+ * the original packet, the skb is updated to contain the original, reassembled
+ * packet.
+ *
+ * Otherwise (on error or incomplete reassembly), the input skb remains
+ * unmodified.
+ *
+ * Parameters:
+ * @ctx - Pointer to program context (skb)
+ * @netns - Child network namespace id. If value is a negative signed
+ * 32-bit integer, the netns of the device in the skb is used.
+ *
+ * Return:
+ * 0 on successfully reassembly or non-fragmented packet. Negative value on
+ * error or incomplete reassembly.
+ */
+int bpf_ip_check_defrag(struct __sk_buff *ctx, u64 netns)
+{
+ struct sk_buff *skb = (struct sk_buff *)ctx;
+ struct sk_buff *skb_cpy, *skb_out;
+ struct net *caller_net;
+ struct net *net;
+ int mac_len;
+ void *mac;
+
+ if (unlikely(!((s32)netns < 0 || netns <= S32_MAX)))
+ return -EINVAL;
+
+ caller_net = skb->dev ? dev_net(skb->dev) : sock_net(skb->sk);
+ if ((s32)netns < 0) {
+ net = caller_net;
+ } else {
+ net = get_net_ns_by_id(caller_net, netns);
+ if (unlikely(!net))
+ return -EINVAL;
+ }
+
+ mac_len = skb->mac_len;
+ skb_cpy = skb_copy(skb, GFP_ATOMIC);
+ if (!skb_cpy)
+ return -ENOMEM;
+
+ skb_out = ip_check_defrag(net, skb_cpy, IP_DEFRAG_BPF);
+ if (IS_ERR(skb_out))
+ return PTR_ERR(skb_out);
+
+ skb_morph(skb, skb_out);
+ kfree_skb(skb_out);
+
+ /* ip_check_defrag() does not maintain mac header, so push empty header
+ * in so prog sees the correct layout. The empty mac header will be
+ * later pulled from cls_bpf.
+ */
+ mac = skb_push(skb, mac_len);
+ memset(mac, 0, mac_len);
+ bpf_compute_data_pointers(skb);
+
+ return 0;
+}
+
+__diag_pop()
+
+BTF_SET8_START(ip_frag_kfunc_set)
+BTF_ID_FLAGS(func, bpf_ip_check_defrag, KF_CHANGES_PKT)
+BTF_SET8_END(ip_frag_kfunc_set)
+
+static const struct btf_kfunc_id_set ip_frag_bpf_kfunc_set = {
+ .owner = THIS_MODULE,
+ .set = &ip_frag_kfunc_set,
+};
+
+int register_ip_frag_bpf(void)
+{
+ return register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS,
+ &ip_frag_bpf_kfunc_set);
+}
--
2.39.1
Factor _ipv6_frag_rcv() out of ipv6_frag_rcv() such that the former
takes a netns and user field.
We do this so that the BPF interface for ipv6 defrag can have the same
semantics as ipv4 defrag (see ip_check_defrag()).
Signed-off-by: Daniel Xu <[email protected]>
---
include/net/ipv6.h | 1 +
net/ipv6/reassembly.c | 16 +++++++++++-----
2 files changed, 12 insertions(+), 5 deletions(-)
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 7332296eca44..9bbdf82ca6c0 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -1238,6 +1238,7 @@ int inet6_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
extern const struct proto_ops inet6_stream_ops;
extern const struct proto_ops inet6_dgram_ops;
extern const struct proto_ops inet6_sockraw_ops;
+int _ipv6_frag_rcv(struct net *net, struct sk_buff *skb, u32 user);
struct group_source_req;
struct group_filter;
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index 5bc8a28e67f9..5100430eb982 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -81,13 +81,13 @@ static void ip6_frag_expire(struct timer_list *t)
}
static struct frag_queue *
-fq_find(struct net *net, __be32 id, const struct ipv6hdr *hdr, int iif)
+fq_find(struct net *net, __be32 id, const struct ipv6hdr *hdr, int iif, u32 user)
{
struct frag_v6_compare_key key = {
.id = id,
.saddr = hdr->saddr,
.daddr = hdr->daddr,
- .user = IP6_DEFRAG_LOCAL_DELIVER,
+ .user = user,
.iif = iif,
};
struct inet_frag_queue *q;
@@ -324,12 +324,11 @@ static int ip6_frag_reasm(struct frag_queue *fq, struct sk_buff *skb,
return -1;
}
-static int ipv6_frag_rcv(struct sk_buff *skb)
+int _ipv6_frag_rcv(struct net *net, struct sk_buff *skb, u32 user)
{
struct frag_hdr *fhdr;
struct frag_queue *fq;
const struct ipv6hdr *hdr = ipv6_hdr(skb);
- struct net *net = dev_net(skb_dst(skb)->dev);
u8 nexthdr;
int iif;
@@ -377,7 +376,7 @@ static int ipv6_frag_rcv(struct sk_buff *skb)
}
iif = skb->dev ? skb->dev->ifindex : 0;
- fq = fq_find(net, fhdr->identification, hdr, iif);
+ fq = fq_find(net, fhdr->identification, hdr, iif, user);
if (fq) {
u32 prob_offset = 0;
int ret;
@@ -410,6 +409,13 @@ static int ipv6_frag_rcv(struct sk_buff *skb)
return -1;
}
+static int ipv6_frag_rcv(struct sk_buff *skb)
+{
+ struct net *net = dev_net(skb_dst(skb)->dev);
+
+ return _ipv6_frag_rcv(net, skb, IP6_DEFRAG_LOCAL_DELIVER);
+}
+
static const struct inet6_protocol frag_protocol = {
.handler = ipv6_frag_rcv,
.flags = INET6_PROTO_NOPOLICY,
--
2.39.1
This helper is used to defragment IPv6 packets. Similar to the previous
bpf_ip_check_defrag() kfunc, this kfunc:
* Returns 0 on defrag + skb update success
* Returns < 0 on error
* Takes care to ensure ctx (skb) remains valid no matter what the
underlying call to _ipv6_frag_rcv() does
* Is only callable from TC clsact progs
Please see bpf_ip_check_defrag() commit for more details / suggestions.
Signed-off-by: Daniel Xu <[email protected]>
---
include/net/ipv6_frag.h | 1 +
include/net/transp_v6.h | 1 +
net/ipv6/Makefile | 1 +
net/ipv6/af_inet6.c | 4 ++
net/ipv6/reassembly_bpf.c | 143 ++++++++++++++++++++++++++++++++++++++
5 files changed, 150 insertions(+)
create mode 100644 net/ipv6/reassembly_bpf.c
diff --git a/include/net/ipv6_frag.h b/include/net/ipv6_frag.h
index 7321ffe3a108..cf4763cd3886 100644
--- a/include/net/ipv6_frag.h
+++ b/include/net/ipv6_frag.h
@@ -15,6 +15,7 @@ enum ip6_defrag_users {
__IP6_DEFRAG_CONNTRACK_OUT = IP6_DEFRAG_CONNTRACK_OUT + USHRT_MAX,
IP6_DEFRAG_CONNTRACK_BRIDGE_IN,
__IP6_DEFRAG_CONNTRACK_BRIDGE_IN = IP6_DEFRAG_CONNTRACK_BRIDGE_IN + USHRT_MAX,
+ IP6_DEFRAG_BPF,
};
/*
diff --git a/include/net/transp_v6.h b/include/net/transp_v6.h
index d27b1caf3753..244123a74349 100644
--- a/include/net/transp_v6.h
+++ b/include/net/transp_v6.h
@@ -20,6 +20,7 @@ int ipv6_exthdrs_init(void);
void ipv6_exthdrs_exit(void);
int ipv6_frag_init(void);
void ipv6_frag_exit(void);
+int register_ipv6_reassembly_bpf(void);
/* transport protocols */
int pingv6_init(void);
diff --git a/net/ipv6/Makefile b/net/ipv6/Makefile
index 3036a45e8a1e..6e90ff1d20c0 100644
--- a/net/ipv6/Makefile
+++ b/net/ipv6/Makefile
@@ -26,6 +26,7 @@ ipv6-$(CONFIG_IPV6_SEG6_LWTUNNEL) += seg6_iptunnel.o seg6_local.o
ipv6-$(CONFIG_IPV6_SEG6_HMAC) += seg6_hmac.o
ipv6-$(CONFIG_IPV6_RPL_LWTUNNEL) += rpl_iptunnel.o
ipv6-$(CONFIG_IPV6_IOAM6_LWTUNNEL) += ioam6_iptunnel.o
+ipv6-$(CONFIG_DEBUG_INFO_BTF) += reassembly_bpf.o
obj-$(CONFIG_INET6_AH) += ah6.o
obj-$(CONFIG_INET6_ESP) += esp6.o
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index 38689bedfce7..39663de75fbd 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -1174,6 +1174,10 @@ static int __init inet6_init(void)
if (err)
goto ipv6_frag_fail;
+ err = register_ipv6_reassembly_bpf();
+ if (err)
+ goto ipv6_frag_fail;
+
/* Init v6 transport protocols. */
err = udpv6_init();
if (err)
diff --git a/net/ipv6/reassembly_bpf.c b/net/ipv6/reassembly_bpf.c
new file mode 100644
index 000000000000..c6c804d4f636
--- /dev/null
+++ b/net/ipv6/reassembly_bpf.c
@@ -0,0 +1,143 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Unstable ipv6 fragmentation helpers for TC-BPF hook
+ *
+ * These are called from SCHED_CLS BPF programs. Note that it is allowed to
+ * break compatibility for these functions since the interface they are exposed
+ * through to BPF programs is explicitly unstable.
+ */
+
+#include <linux/bpf.h>
+#include <linux/btf_ids.h>
+#include <linux/filter.h>
+#include <linux/netdevice.h>
+#include <net/ipv6.h>
+#include <net/ipv6_frag.h>
+#include <net/ipv6_stubs.h>
+
+static int set_dst(struct sk_buff *skb, struct net *net)
+{
+ const struct ipv6hdr *ip6h = ipv6_hdr(skb);
+ struct dst_entry *dst;
+
+ struct flowi6 fl6 = {
+ .flowi6_flags = FLOWI_FLAG_ANYSRC,
+ .flowi6_mark = skb->mark,
+ .flowlabel = ip6_flowinfo(ip6h),
+ .flowi6_iif = skb->skb_iif,
+ .flowi6_proto = ip6h->nexthdr,
+ .daddr = ip6h->daddr,
+ .saddr = ip6h->saddr,
+ };
+
+ dst = ipv6_stub->ipv6_dst_lookup_flow(net, NULL, &fl6, NULL);
+ if (IS_ERR(dst))
+ return PTR_ERR(dst);
+
+ skb_dst_set(skb, dst);
+
+ return 0;
+}
+
+__diag_push();
+__diag_ignore_all("-Wmissing-prototypes",
+ "Global functions as their definitions will be in reassembly BTF");
+
+/* bpf_ipv6_frag_rcv - Defragment an ipv6 packet
+ *
+ * This helper takes an skb as input. If this skb successfully reassembles
+ * the original packet, the skb is updated to contain the original, reassembled
+ * packet.
+ *
+ * Otherwise (on error or incomplete reassembly), the input skb remains
+ * unmodified.
+ *
+ * Parameters:
+ * @ctx - Pointer to program context (skb)
+ * @netns - Child network namespace id. If value is a negative signed
+ * 32-bit integer, the netns of the device in the skb is used.
+ *
+ * Return:
+ * 0 on successfully reassembly or non-fragmented packet. Negative value on
+ * error or incomplete reassembly.
+ */
+int bpf_ipv6_frag_rcv(struct __sk_buff *ctx, u64 netns)
+{
+ struct sk_buff *skb = (struct sk_buff *)ctx;
+ struct sk_buff *skb_cpy;
+ struct net *caller_net;
+ unsigned int foff;
+ struct net *net;
+ int mac_len;
+ void *mac;
+ int err;
+
+ if (unlikely(!((s32)netns < 0 || netns <= S32_MAX)))
+ return -EINVAL;
+
+ caller_net = skb->dev ? dev_net(skb->dev) : sock_net(skb->sk);
+ if ((s32)netns < 0) {
+ net = caller_net;
+ } else {
+ net = get_net_ns_by_id(caller_net, netns);
+ if (unlikely(!net))
+ return -EINVAL;
+ }
+
+ err = set_dst(skb, net);
+ if (err < 0)
+ return err;
+
+ mac_len = skb->mac_len;
+ skb_cpy = skb_copy(skb, GFP_ATOMIC);
+ if (!skb_cpy)
+ return -ENOMEM;
+
+ /* _ipv6_frag_rcv() expects skb->transport_header to be set to start of
+ * the frag header and nhoff to be set.
+ */
+ err = ipv6_find_hdr(skb_cpy, &foff, NEXTHDR_FRAGMENT, NULL, NULL);
+ if (err < 0)
+ return err;
+ skb_set_transport_header(skb_cpy, foff);
+ IP6CB(skb_cpy)->nhoff = offsetof(struct ipv6hdr, nexthdr);
+
+ /* inet6_protocol handlers return >0 on success, 0 on out of band
+ * consumption, <0 on error. We never expect to see 0 here.
+ */
+ err = _ipv6_frag_rcv(net, skb_cpy, IP6_DEFRAG_BPF);
+ if (err < 0)
+ return err;
+ else if (err == 0)
+ return -EINVAL;
+
+ skb_morph(skb, skb_cpy);
+ kfree_skb(skb_cpy);
+
+ /* _ipv6_frag_rcv() does not maintain mac header, so push empty header
+ * in so prog sees the correct layout. The empty mac header will be
+ * later pulled from cls_bpf.
+ */
+ skb->mac_len = mac_len;
+ mac = skb_push(skb, mac_len);
+ memset(mac, 0, mac_len);
+ bpf_compute_data_pointers(skb);
+
+ return 0;
+}
+
+__diag_pop()
+
+BTF_SET8_START(ipv6_reassembly_kfunc_set)
+BTF_ID_FLAGS(func, bpf_ipv6_frag_rcv, KF_CHANGES_PKT)
+BTF_SET8_END(ipv6_reassembly_kfunc_set)
+
+static const struct btf_kfunc_id_set ipv6_reassembly_bpf_kfunc_set = {
+ .owner = THIS_MODULE,
+ .set = &ipv6_reassembly_kfunc_set,
+};
+
+int register_ipv6_reassembly_bpf(void)
+{
+ return register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS,
+ &ipv6_reassembly_bpf_kfunc_set);
+}
--
2.39.1
For connectionless protocols or raw sockets we do not want to actually
connect() to the server.
Signed-off-by: Daniel Xu <[email protected]>
---
tools/testing/selftests/bpf/network_helpers.c | 5 +++--
tools/testing/selftests/bpf/network_helpers.h | 1 +
2 files changed, 4 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/bpf/network_helpers.c b/tools/testing/selftests/bpf/network_helpers.c
index 01de33191226..24f5efebc7dd 100644
--- a/tools/testing/selftests/bpf/network_helpers.c
+++ b/tools/testing/selftests/bpf/network_helpers.c
@@ -301,8 +301,9 @@ int connect_to_fd_opts(int server_fd, const struct network_helper_opts *opts)
strlen(opts->cc) + 1))
goto error_close;
- if (connect_fd_to_addr(fd, &addr, addrlen, opts->must_fail))
- goto error_close;
+ if (!opts->noconnect)
+ if (connect_fd_to_addr(fd, &addr, addrlen, opts->must_fail))
+ goto error_close;
return fd;
diff --git a/tools/testing/selftests/bpf/network_helpers.h b/tools/testing/selftests/bpf/network_helpers.h
index f882c691b790..8be04cd76d8b 100644
--- a/tools/testing/selftests/bpf/network_helpers.h
+++ b/tools/testing/selftests/bpf/network_helpers.h
@@ -21,6 +21,7 @@ struct network_helper_opts {
const char *cc;
int timeout_ms;
bool must_fail;
+ bool noconnect;
};
/* ipv4 test vector */
--
2.39.1
Extend connect_to_fd_opts() to take optional type and protocol
parameters for the client socket. These parameters are useful when
opening a raw socket to send IP fragments.
Signed-off-by: Daniel Xu <[email protected]>
---
tools/testing/selftests/bpf/network_helpers.c | 21 +++++++++++++------
tools/testing/selftests/bpf/network_helpers.h | 2 ++
2 files changed, 17 insertions(+), 6 deletions(-)
diff --git a/tools/testing/selftests/bpf/network_helpers.c b/tools/testing/selftests/bpf/network_helpers.c
index 24f5efebc7dd..4f9ba90b1b7e 100644
--- a/tools/testing/selftests/bpf/network_helpers.c
+++ b/tools/testing/selftests/bpf/network_helpers.c
@@ -270,14 +270,23 @@ int connect_to_fd_opts(int server_fd, const struct network_helper_opts *opts)
opts = &default_opts;
optlen = sizeof(type);
- if (getsockopt(server_fd, SOL_SOCKET, SO_TYPE, &type, &optlen)) {
- log_err("getsockopt(SOL_TYPE)");
- return -1;
+
+ if (opts->type) {
+ type = opts->type;
+ } else {
+ if (getsockopt(server_fd, SOL_SOCKET, SO_TYPE, &type, &optlen)) {
+ log_err("getsockopt(SOL_TYPE)");
+ return -1;
+ }
}
- if (getsockopt(server_fd, SOL_SOCKET, SO_PROTOCOL, &protocol, &optlen)) {
- log_err("getsockopt(SOL_PROTOCOL)");
- return -1;
+ if (opts->proto) {
+ protocol = opts->proto;
+ } else {
+ if (getsockopt(server_fd, SOL_SOCKET, SO_PROTOCOL, &protocol, &optlen)) {
+ log_err("getsockopt(SOL_PROTOCOL)");
+ return -1;
+ }
}
addrlen = sizeof(addr);
diff --git a/tools/testing/selftests/bpf/network_helpers.h b/tools/testing/selftests/bpf/network_helpers.h
index 8be04cd76d8b..7119804ea79b 100644
--- a/tools/testing/selftests/bpf/network_helpers.h
+++ b/tools/testing/selftests/bpf/network_helpers.h
@@ -22,6 +22,8 @@ struct network_helper_opts {
int timeout_ms;
bool must_fail;
bool noconnect;
+ int type;
+ int proto;
};
/* ipv4 test vector */
--
2.39.1
These selftests tests 2 major scenarios: the BPF based defragmentation
can successfully be done and that packet pointers are invalidated after
calls to the kfunc. The logic is similar for both ipv4 and ipv6.
In the first scenario, we create a UDP client and UDP echo server. The
the server side is fairly straightforward: we attach the prog and simply
echo back the message.
The on the client side, we send fragmented packets to and expect the
reassembled message back from the server.
Signed-off-by: Daniel Xu <[email protected]>
---
tools/testing/selftests/bpf/Makefile | 3 +-
.../selftests/bpf/generate_udp_fragments.py | 90 +++++
.../selftests/bpf/ip_check_defrag_frags.h | 57 +++
.../bpf/prog_tests/ip_check_defrag.c | 327 ++++++++++++++++++
.../selftests/bpf/progs/bpf_tracing_net.h | 1 +
.../selftests/bpf/progs/ip_check_defrag.c | 133 +++++++
6 files changed, 610 insertions(+), 1 deletion(-)
create mode 100755 tools/testing/selftests/bpf/generate_udp_fragments.py
create mode 100644 tools/testing/selftests/bpf/ip_check_defrag_frags.h
create mode 100644 tools/testing/selftests/bpf/prog_tests/ip_check_defrag.c
create mode 100644 tools/testing/selftests/bpf/progs/ip_check_defrag.c
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index b677dcd0b77a..979af1611139 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -558,7 +558,8 @@ TRUNNER_BPF_PROGS_DIR := progs
TRUNNER_EXTRA_SOURCES := test_progs.c cgroup_helpers.c trace_helpers.c \
network_helpers.c testing_helpers.c \
btf_helpers.c flow_dissector_load.h \
- cap_helpers.c test_loader.c xsk.c
+ cap_helpers.c test_loader.c xsk.c \
+ ip_check_defrag_frags.h
TRUNNER_EXTRA_FILES := $(OUTPUT)/urandom_read $(OUTPUT)/bpf_testmod.ko \
$(OUTPUT)/liburandom_read.so \
$(OUTPUT)/xdp_synproxy \
diff --git a/tools/testing/selftests/bpf/generate_udp_fragments.py b/tools/testing/selftests/bpf/generate_udp_fragments.py
new file mode 100755
index 000000000000..2b8a1187991c
--- /dev/null
+++ b/tools/testing/selftests/bpf/generate_udp_fragments.py
@@ -0,0 +1,90 @@
+#!/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+
+"""
+This script helps generate fragmented UDP packets.
+
+While it is technically possible to dynamically generate
+fragmented packets in C, it is much harder to read and write
+said code. `scapy` is relatively industry standard and really
+easy to read / write.
+
+So we choose to write this script that generates a valid C
+header. Rerun script and commit generated file after any
+modifications.
+"""
+
+import argparse
+import os
+
+from scapy.all import *
+
+
+# These constants must stay in sync with `ip_check_defrag.c`
+VETH1_ADDR = "172.16.1.200"
+VETH0_ADDR6 = "fc00::100"
+VETH1_ADDR6 = "fc00::200"
+CLIENT_PORT = 48878
+SERVER_PORT = 48879
+MAGIC_MESSAGE = "THIS IS THE ORIGINAL MESSAGE, PLEASE REASSEMBLE ME"
+
+
+def print_header(f):
+ f.write("// SPDX-License-Identifier: GPL-2.0\n")
+ f.write("/* DO NOT EDIT -- this file is generated */\n")
+ f.write("\n")
+ f.write("#ifndef _IP_CHECK_DEFRAG_FRAGS_H\n")
+ f.write("#define _IP_CHECK_DEFRAG_FRAGS_H\n")
+ f.write("\n")
+ f.write("#include <stdint.h>\n")
+ f.write("\n")
+
+
+def print_frags(f, frags, v6):
+ for idx, frag in enumerate(frags):
+ # 10 bytes per line to keep width in check
+ chunks = [frag[i : i + 10] for i in range(0, len(frag), 10)]
+ chunks_fmted = [", ".join([str(hex(b)) for b in chunk]) for chunk in chunks]
+ suffix = "6" if v6 else ""
+
+ f.write(f"static uint8_t frag{suffix}_{idx}[] = {{\n")
+ for chunk in chunks_fmted:
+ f.write(f"\t{chunk},\n")
+ f.write(f"}};\n")
+
+
+def print_trailer(f):
+ f.write("\n")
+ f.write("#endif /* _IP_CHECK_DEFRAG_FRAGS_H */\n")
+
+
+def main(f):
+ # srcip of 0 is filled in by IP_HDRINCL
+ sip = "0.0.0.0"
+ sip6 = VETH0_ADDR6
+ dip = VETH1_ADDR
+ dip6 = VETH1_ADDR6
+ sport = CLIENT_PORT
+ dport = SERVER_PORT
+ payload = MAGIC_MESSAGE.encode()
+
+ # Disable UDPv4 checksums to keep code simpler
+ pkt = IP(src=sip,dst=dip) / UDP(sport=sport,dport=dport,chksum=0) / Raw(load=payload)
+ # UDPv6 requires a checksum
+ # Also pin the ipv6 fragment header ID, otherwise it's a random value
+ pkt6 = IPv6(src=sip6,dst=dip6) / IPv6ExtHdrFragment(id=0xBEEF) / UDP(sport=sport,dport=dport) / Raw(load=payload)
+
+ frags = [f.build() for f in pkt.fragment(24)]
+ frags6 = [f.build() for f in fragment6(pkt6, 72)]
+
+ print_header(f)
+ print_frags(f, frags, False)
+ print_frags(f, frags6, True)
+ print_trailer(f)
+
+
+if __name__ == "__main__":
+ dir = os.path.dirname(os.path.realpath(__file__))
+ header = f"{dir}/ip_check_defrag_frags.h"
+ with open(header, "w") as f:
+ main(f)
diff --git a/tools/testing/selftests/bpf/ip_check_defrag_frags.h b/tools/testing/selftests/bpf/ip_check_defrag_frags.h
new file mode 100644
index 000000000000..70ab7e9fa22b
--- /dev/null
+++ b/tools/testing/selftests/bpf/ip_check_defrag_frags.h
@@ -0,0 +1,57 @@
+// SPDX-License-Identifier: GPL-2.0
+/* DO NOT EDIT -- this file is generated */
+
+#ifndef _IP_CHECK_DEFRAG_FRAGS_H
+#define _IP_CHECK_DEFRAG_FRAGS_H
+
+#include <stdint.h>
+
+static uint8_t frag_0[] = {
+ 0x45, 0x0, 0x0, 0x2c, 0x0, 0x1, 0x20, 0x0, 0x40, 0x11,
+ 0xac, 0xe8, 0x0, 0x0, 0x0, 0x0, 0xac, 0x10, 0x1, 0xc8,
+ 0xbe, 0xee, 0xbe, 0xef, 0x0, 0x3a, 0x0, 0x0, 0x54, 0x48,
+ 0x49, 0x53, 0x20, 0x49, 0x53, 0x20, 0x54, 0x48, 0x45, 0x20,
+ 0x4f, 0x52, 0x49, 0x47,
+};
+static uint8_t frag_1[] = {
+ 0x45, 0x0, 0x0, 0x2c, 0x0, 0x1, 0x20, 0x3, 0x40, 0x11,
+ 0xac, 0xe5, 0x0, 0x0, 0x0, 0x0, 0xac, 0x10, 0x1, 0xc8,
+ 0x49, 0x4e, 0x41, 0x4c, 0x20, 0x4d, 0x45, 0x53, 0x53, 0x41,
+ 0x47, 0x45, 0x2c, 0x20, 0x50, 0x4c, 0x45, 0x41, 0x53, 0x45,
+ 0x20, 0x52, 0x45, 0x41,
+};
+static uint8_t frag_2[] = {
+ 0x45, 0x0, 0x0, 0x1e, 0x0, 0x1, 0x0, 0x6, 0x40, 0x11,
+ 0xcc, 0xf0, 0x0, 0x0, 0x0, 0x0, 0xac, 0x10, 0x1, 0xc8,
+ 0x53, 0x53, 0x45, 0x4d, 0x42, 0x4c, 0x45, 0x20, 0x4d, 0x45,
+};
+static uint8_t frag6_0[] = {
+ 0x60, 0x0, 0x0, 0x0, 0x0, 0x20, 0x2c, 0x40, 0xfc, 0x0,
+ 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
+ 0x0, 0x0, 0x1, 0x0, 0xfc, 0x0, 0x0, 0x0, 0x0, 0x0,
+ 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x2, 0x0,
+ 0x11, 0x0, 0x0, 0x1, 0x0, 0x0, 0xbe, 0xef, 0xbe, 0xee,
+ 0xbe, 0xef, 0x0, 0x3a, 0xd0, 0xf8, 0x54, 0x48, 0x49, 0x53,
+ 0x20, 0x49, 0x53, 0x20, 0x54, 0x48, 0x45, 0x20, 0x4f, 0x52,
+ 0x49, 0x47,
+};
+static uint8_t frag6_1[] = {
+ 0x60, 0x0, 0x0, 0x0, 0x0, 0x20, 0x2c, 0x40, 0xfc, 0x0,
+ 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
+ 0x0, 0x0, 0x1, 0x0, 0xfc, 0x0, 0x0, 0x0, 0x0, 0x0,
+ 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x2, 0x0,
+ 0x11, 0x0, 0x0, 0x19, 0x0, 0x0, 0xbe, 0xef, 0x49, 0x4e,
+ 0x41, 0x4c, 0x20, 0x4d, 0x45, 0x53, 0x53, 0x41, 0x47, 0x45,
+ 0x2c, 0x20, 0x50, 0x4c, 0x45, 0x41, 0x53, 0x45, 0x20, 0x52,
+ 0x45, 0x41,
+};
+static uint8_t frag6_2[] = {
+ 0x60, 0x0, 0x0, 0x0, 0x0, 0x12, 0x2c, 0x40, 0xfc, 0x0,
+ 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
+ 0x0, 0x0, 0x1, 0x0, 0xfc, 0x0, 0x0, 0x0, 0x0, 0x0,
+ 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x2, 0x0,
+ 0x11, 0x0, 0x0, 0x30, 0x0, 0x0, 0xbe, 0xef, 0x53, 0x53,
+ 0x45, 0x4d, 0x42, 0x4c, 0x45, 0x20, 0x4d, 0x45,
+};
+
+#endif /* _IP_CHECK_DEFRAG_FRAGS_H */
diff --git a/tools/testing/selftests/bpf/prog_tests/ip_check_defrag.c b/tools/testing/selftests/bpf/prog_tests/ip_check_defrag.c
new file mode 100644
index 000000000000..c79c4096aab4
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/ip_check_defrag.c
@@ -0,0 +1,327 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <test_progs.h>
+#include <net/if.h>
+#include <network_helpers.h>
+#include "ip_check_defrag.skel.h"
+#include "ip_check_defrag_frags.h"
+
+/*
+ * This selftest spins up a client and an echo server, each in their own
+ * network namespace. The server will receive fragmented messages which
+ * the attached BPF prog should reassemble. We verify that reassembly
+ * occurred by checking the original (fragmented) message is received
+ * in whole.
+ *
+ * Topology:
+ * =========
+ * NS0 | NS1
+ * |
+ * client | server
+ * ---------- | ----------
+ * | veth0 | --------- | veth1 |
+ * ---------- peer ----------
+ * |
+ * | with bpf
+ */
+
+#define NS0 "defrag_ns0"
+#define NS1 "defrag_ns1"
+#define VETH0 "veth0"
+#define VETH1 "veth1"
+#define VETH0_ADDR "172.16.1.100"
+#define VETH0_ADDR6 "fc00::100"
+/* The following constants must stay in sync with `generate_udp_fragments.py` */
+#define VETH1_ADDR "172.16.1.200"
+#define VETH1_ADDR6 "fc00::200"
+#define CLIENT_PORT 48878
+#define SERVER_PORT 48879
+#define MAGIC_MESSAGE "THIS IS THE ORIGINAL MESSAGE, PLEASE REASSEMBLE ME"
+
+static char log_buf[1024 * 1024];
+
+static int setup_topology(bool ipv6)
+{
+ bool veth0_up;
+ bool veth1_up;
+ int i;
+
+ SYS(fail, "ip netns add " NS0);
+ SYS(fail, "ip netns add " NS1);
+ SYS(fail, "ip link add " VETH0 " netns " NS0 " type veth peer name " VETH1 " netns " NS1);
+ if (ipv6) {
+ SYS(fail, "ip -6 -net " NS0 " addr add " VETH0_ADDR6 "/64 dev " VETH0 " nodad");
+ SYS(fail, "ip -6 -net " NS1 " addr add " VETH1_ADDR6 "/64 dev " VETH1 " nodad");
+ } else {
+ SYS(fail, "ip -net " NS0 " addr add " VETH0_ADDR "/24 dev " VETH0);
+ SYS(fail, "ip -net " NS1 " addr add " VETH1_ADDR "/24 dev " VETH1);
+ }
+ SYS(fail, "ip -net " NS0 " link set dev " VETH0 " up");
+ SYS(fail, "ip -net " NS1 " link set dev " VETH1 " up");
+
+ /* Wait for up to 5s for links to come up */
+ for (i = 0; i < 50; ++i) {
+ veth0_up = !system("ip -net " NS0 " link show " VETH0 " | grep 'state UP'");
+ veth1_up = !system("ip -net " NS1 " link show " VETH1 " | grep 'state UP'");
+ if (veth0_up && veth1_up)
+ break;
+ usleep(100000);
+ }
+
+ if (!ASSERT_TRUE((veth0_up && veth1_up), "ifaces up"))
+ goto fail;
+
+ return 0;
+fail:
+ return -1;
+}
+
+static void cleanup_topology(void)
+{
+ SYS_NOFAIL("test -f /var/run/netns/" NS0 " && ip netns delete " NS0);
+ SYS_NOFAIL("test -f /var/run/netns/" NS1 " && ip netns delete " NS1);
+}
+
+static int attach(struct ip_check_defrag *skel)
+{
+ LIBBPF_OPTS(bpf_tc_hook, tc_hook,
+ .attach_point = BPF_TC_INGRESS);
+ LIBBPF_OPTS(bpf_tc_opts, tc_attach,
+ .prog_fd = bpf_program__fd(skel->progs.defrag));
+ struct nstoken *nstoken;
+ int err = -1;
+
+ nstoken = open_netns(NS1);
+
+ tc_hook.ifindex = if_nametoindex(VETH1);
+ if (!ASSERT_OK(bpf_tc_hook_create(&tc_hook), "bpf_tc_hook_create"))
+ goto out;
+
+ if (!ASSERT_OK(bpf_tc_attach(&tc_hook, &tc_attach), "bpf_tc_attach"))
+ goto out;
+
+ err = 0;
+out:
+ close_netns(nstoken);
+ return err;
+}
+
+static int send_frags(int client)
+{
+ struct sockaddr_storage saddr;
+ struct sockaddr *saddr_p;
+ socklen_t saddr_len;
+ int err;
+
+ saddr_p = (struct sockaddr *)&saddr;
+ err = make_sockaddr(AF_INET, VETH1_ADDR, SERVER_PORT, &saddr, &saddr_len);
+ if (!ASSERT_OK(err, "make_sockaddr"))
+ return -1;
+
+ err = sendto(client, frag_0, sizeof(frag_0), 0, saddr_p, saddr_len);
+ if (!ASSERT_GE(err, 0, "sendto frag_0"))
+ return -1;
+
+ err = sendto(client, frag_1, sizeof(frag_1), 0, saddr_p, saddr_len);
+ if (!ASSERT_GE(err, 0, "sendto frag_1"))
+ return -1;
+
+ err = sendto(client, frag_2, sizeof(frag_2), 0, saddr_p, saddr_len);
+ if (!ASSERT_GE(err, 0, "sendto frag_2"))
+ return -1;
+
+ return 0;
+}
+
+static int send_frags6(int client)
+{
+ struct sockaddr_storage saddr;
+ struct sockaddr *saddr_p;
+ socklen_t saddr_len;
+ int err;
+
+ saddr_p = (struct sockaddr *)&saddr;
+ /* Port needs to be set to 0 for raw ipv6 socket for some reason */
+ err = make_sockaddr(AF_INET6, VETH1_ADDR6, 0, &saddr, &saddr_len);
+ if (!ASSERT_OK(err, "make_sockaddr"))
+ return -1;
+
+ err = sendto(client, frag6_0, sizeof(frag6_0), 0, saddr_p, saddr_len);
+ if (!ASSERT_GE(err, 0, "sendto frag6_0"))
+ return -1;
+
+ err = sendto(client, frag6_1, sizeof(frag6_1), 0, saddr_p, saddr_len);
+ if (!ASSERT_GE(err, 0, "sendto frag6_1"))
+ return -1;
+
+ err = sendto(client, frag6_2, sizeof(frag6_2), 0, saddr_p, saddr_len);
+ if (!ASSERT_GE(err, 0, "sendto frag6_2"))
+ return -1;
+
+ return 0;
+}
+
+void test_bpf_ip_check_defrag_ok(bool ipv6)
+{
+ struct network_helper_opts rx_opts = {
+ .timeout_ms = 1000,
+ .noconnect = true,
+ };
+ struct network_helper_opts tx_ops = {
+ .timeout_ms = 1000,
+ .type = SOCK_RAW,
+ .proto = IPPROTO_RAW,
+ .noconnect = true,
+ };
+ struct sockaddr_storage caddr;
+ struct ip_check_defrag *skel;
+ struct nstoken *nstoken;
+ int client_tx_fd = -1;
+ int client_rx_fd = -1;
+ socklen_t caddr_len;
+ int srv_fd = -1;
+ char buf[1024];
+ int len, err;
+
+ skel = ip_check_defrag__open_and_load();
+ if (!ASSERT_OK_PTR(skel, "skel_open"))
+ return;
+
+ if (!ASSERT_OK(setup_topology(ipv6), "setup_topology"))
+ goto out;
+
+ if (!ASSERT_OK(attach(skel), "attach"))
+ goto out;
+
+ /* Start server in ns1 */
+ nstoken = open_netns(NS1);
+ if (!ASSERT_OK_PTR(nstoken, "setns ns1"))
+ goto out;
+ srv_fd = start_server(ipv6 ? AF_INET6 : AF_INET, SOCK_DGRAM, NULL, SERVER_PORT, 0);
+ close_netns(nstoken);
+ if (!ASSERT_GE(srv_fd, 0, "start_server"))
+ goto out;
+
+ /* Open tx raw socket in ns0 */
+ nstoken = open_netns(NS0);
+ if (!ASSERT_OK_PTR(nstoken, "setns ns0"))
+ goto out;
+ client_tx_fd = connect_to_fd_opts(srv_fd, &tx_ops);
+ close_netns(nstoken);
+ if (!ASSERT_GE(client_tx_fd, 0, "connect_to_fd_opts"))
+ goto out;
+
+ /* Open rx socket in ns0 */
+ nstoken = open_netns(NS0);
+ if (!ASSERT_OK_PTR(nstoken, "setns ns0"))
+ goto out;
+ client_rx_fd = connect_to_fd_opts(srv_fd, &rx_opts);
+ close_netns(nstoken);
+ if (!ASSERT_GE(client_rx_fd, 0, "connect_to_fd_opts"))
+ goto out;
+
+ /* Bind rx socket to a premeditated port */
+ memset(&caddr, 0, sizeof(caddr));
+ nstoken = open_netns(NS0);
+ if (!ASSERT_OK_PTR(nstoken, "setns ns0"))
+ goto out;
+ if (ipv6) {
+ struct sockaddr_in6 *c = (struct sockaddr_in6 *)&caddr;
+
+ c->sin6_family = AF_INET6;
+ inet_pton(AF_INET6, VETH0_ADDR6, &c->sin6_addr);
+ c->sin6_port = htons(CLIENT_PORT);
+ err = bind(client_rx_fd, (struct sockaddr *)c, sizeof(*c));
+ } else {
+ struct sockaddr_in *c = (struct sockaddr_in *)&caddr;
+
+ c->sin_family = AF_INET;
+ inet_pton(AF_INET, VETH0_ADDR, &c->sin_addr);
+ c->sin_port = htons(CLIENT_PORT);
+ err = bind(client_rx_fd, (struct sockaddr *)c, sizeof(*c));
+ }
+ close_netns(nstoken);
+ if (!ASSERT_OK(err, "bind"))
+ goto out;
+
+ /* Send message in fragments */
+ if (ipv6) {
+ if (!ASSERT_OK(send_frags6(client_tx_fd), "send_frags6"))
+ goto out;
+ } else {
+ if (!ASSERT_OK(send_frags(client_tx_fd), "send_frags"))
+ goto out;
+ }
+
+ if (!ASSERT_EQ(skel->bss->frags_seen, 3, "frags_seen"))
+ goto out;
+
+ if (!ASSERT_FALSE(skel->data->is_final_frag, "is_final_frag"))
+ goto out;
+
+ /* Receive reassembled msg on server and echo back to client */
+ len = recvfrom(srv_fd, buf, sizeof(buf), 0, (struct sockaddr *)&caddr, &caddr_len);
+ if (!ASSERT_GE(len, 0, "server recvfrom"))
+ goto out;
+ len = sendto(srv_fd, buf, len, 0, (struct sockaddr *)&caddr, caddr_len);
+ if (!ASSERT_GE(len, 0, "server sendto"))
+ goto out;
+
+ /* Expect reassembed message to be echoed back */
+ len = recvfrom(client_rx_fd, buf, sizeof(buf), 0, NULL, NULL);
+ if (!ASSERT_EQ(len, sizeof(MAGIC_MESSAGE) - 1, "client short read"))
+ goto out;
+
+out:
+ if (client_rx_fd != -1)
+ close(client_rx_fd);
+ if (client_tx_fd != -1)
+ close(client_tx_fd);
+ if (srv_fd != -1)
+ close(srv_fd);
+ cleanup_topology();
+ ip_check_defrag__destroy(skel);
+}
+
+void test_bpf_ip_check_defrag_fail(void)
+{
+ const char *err_msg = "invalid mem access 'scalar'";
+ LIBBPF_OPTS(bpf_object_open_opts, opts,
+ .kernel_log_buf = log_buf,
+ .kernel_log_size = sizeof(log_buf),
+ .kernel_log_level = 1);
+ struct ip_check_defrag *skel;
+ struct bpf_program *prog;
+ int err;
+
+ skel = ip_check_defrag__open_opts(&opts);
+ if (!ASSERT_OK_PTR(skel, "ip_check_defrag__open_opts"))
+ return;
+
+ prog = bpf_object__find_program_by_name(skel->obj, "defrag_fail");
+ if (!ASSERT_OK_PTR(prog, "bpf_object__find_program_by_name"))
+ goto out;
+
+ bpf_program__set_autoload(prog, true);
+
+ err = ip_check_defrag__load(skel);
+ if (!ASSERT_ERR(err, "ip_check_defrag__load must fail"))
+ goto out;
+
+ if (!ASSERT_OK_PTR(strstr(log_buf, err_msg), "expected error message")) {
+ fprintf(stderr, "Expected: %s\n", err_msg);
+ fprintf(stderr, "Verifier: %s\n", log_buf);
+ }
+
+out:
+ ip_check_defrag__destroy(skel);
+}
+
+void test_bpf_ip_check_defrag(void)
+{
+ if (test__start_subtest("ok-v4"))
+ test_bpf_ip_check_defrag_ok(false);
+ if (test__start_subtest("ok-v6"))
+ test_bpf_ip_check_defrag_ok(true);
+ if (test__start_subtest("fail"))
+ test_bpf_ip_check_defrag_fail();
+}
diff --git a/tools/testing/selftests/bpf/progs/bpf_tracing_net.h b/tools/testing/selftests/bpf/progs/bpf_tracing_net.h
index cfed4df490f3..fde688b8af16 100644
--- a/tools/testing/selftests/bpf/progs/bpf_tracing_net.h
+++ b/tools/testing/selftests/bpf/progs/bpf_tracing_net.h
@@ -26,6 +26,7 @@
#define IPV6_AUTOFLOWLABEL 70
#define TC_ACT_UNSPEC (-1)
+#define TC_ACT_OK 0
#define TC_ACT_SHOT 2
#define SOL_TCP 6
diff --git a/tools/testing/selftests/bpf/progs/ip_check_defrag.c b/tools/testing/selftests/bpf/progs/ip_check_defrag.c
new file mode 100644
index 000000000000..5978fd2dd479
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/ip_check_defrag.c
@@ -0,0 +1,133 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+#include "bpf_tracing_net.h"
+
+#define BPF_F_CURRENT_NETNS (-1)
+#define ETH_P_IP 0x0800
+#define ETH_P_IPV6 0x86DD
+#define IP_DF 0x4000
+#define IP_MF 0x2000
+#define IP_OFFSET 0x1FFF
+#define NEXTHDR_FRAGMENT 44
+#define ctx_ptr(field) (void *)(long)(field)
+
+int bpf_ip_check_defrag(struct __sk_buff *ctx, u64 netns) __ksym;
+int bpf_ipv6_frag_rcv(struct __sk_buff *ctx, u64 netns) __ksym;
+
+volatile int frags_seen = 0;
+volatile bool is_final_frag = true;
+
+static bool is_frag_v4(struct iphdr *iph)
+{
+ int offset;
+ int flags;
+
+ offset = bpf_ntohs(iph->frag_off);
+ flags = offset & ~IP_OFFSET;
+ offset &= IP_OFFSET;
+ offset <<= 3;
+
+ return (flags & IP_MF) || offset;
+}
+
+static bool is_frag_v6(struct ipv6hdr *ip6h)
+{
+ /* Simplifying assumption that there are no extension headers
+ * between fixed header and fragmentation header. This assumption
+ * is only valid in this test case. It saves us the hassle of
+ * searching all potential extension headers.
+ */
+ return ip6h->nexthdr == NEXTHDR_FRAGMENT;
+}
+
+static int defrag_v4(struct __sk_buff *skb)
+{
+ void *data_end = ctx_ptr(skb->data_end);
+ void *data = ctx_ptr(skb->data);
+ struct iphdr *iph;
+
+ iph = data + sizeof(struct ethhdr);
+ if (iph + 1 > data_end)
+ return TC_ACT_SHOT;
+
+ if (!is_frag_v4(iph))
+ return TC_ACT_OK;
+
+ frags_seen++;
+ if (bpf_ip_check_defrag(skb, BPF_F_CURRENT_NETNS))
+ return TC_ACT_SHOT;
+
+ data_end = ctx_ptr(skb->data_end);
+ data = ctx_ptr(skb->data);
+ iph = data + sizeof(struct ethhdr);
+ if (iph + 1 > data_end)
+ return TC_ACT_SHOT;
+ is_final_frag = is_frag_v4(iph);
+
+ return TC_ACT_OK;
+}
+
+static int defrag_v6(struct __sk_buff *skb)
+{
+ void *data_end = ctx_ptr(skb->data_end);
+ void *data = ctx_ptr(skb->data);
+ struct ipv6hdr *ip6h;
+
+ ip6h = data + sizeof(struct ethhdr);
+ if (ip6h + 1 > data_end)
+ return TC_ACT_SHOT;
+
+ if (!is_frag_v6(ip6h))
+ return TC_ACT_OK;
+
+ frags_seen++;
+ if (bpf_ipv6_frag_rcv(skb, BPF_F_CURRENT_NETNS))
+ return TC_ACT_SHOT;
+
+ data_end = ctx_ptr(skb->data_end);
+ data = ctx_ptr(skb->data);
+ ip6h = data + sizeof(struct ethhdr);
+ if (ip6h + 1 > data_end)
+ return TC_ACT_SHOT;
+ is_final_frag = is_frag_v6(ip6h);
+
+ return TC_ACT_OK;
+}
+
+SEC("tc")
+int defrag(struct __sk_buff *skb)
+{
+ switch (bpf_ntohs(skb->protocol)) {
+ case ETH_P_IP:
+ return defrag_v4(skb);
+ case ETH_P_IPV6:
+ return defrag_v6(skb);
+ default:
+ return TC_ACT_OK;
+ }
+}
+
+SEC("?tc")
+int defrag_fail(struct __sk_buff *skb)
+{
+ void *data_end = ctx_ptr(skb->data_end);
+ void *data = ctx_ptr(skb->data);
+ struct iphdr *iph;
+
+ if (skb->protocol != bpf_htons(ETH_P_IP))
+ return TC_ACT_OK;
+
+ iph = data + sizeof(struct ethhdr);
+ if (iph + 1 > data_end)
+ return TC_ACT_SHOT;
+
+ if (bpf_ip_check_defrag(skb, BPF_F_CURRENT_NETNS))
+ return TC_ACT_SHOT;
+
+ /* Boom. Must revalidate pkt ptrs */
+ return iph->ttl ? TC_ACT_OK : TC_ACT_SHOT;
+}
+
+char _license[] SEC("license") = "GPL";
--
2.39.1
On 27/02/2023 19:51, Daniel Xu wrote:
> However, when policy is enforced through BPF, the prog is run before the
> kernel reassembles fragmented packets. This leaves BPF developers in a
> awkward place: implement reassembly (possibly poorly) or use a stateless
> method as described above.
Just out of curiosity - what stops BPF progs using the middle ground of
stateful validation? I'm thinking of something like:
First-frag: run the usual checks on L4 headers etc, if we PASS then save
IPID and maybe expected next frag-offset into a map. But don't try to
stash the packet contents anywhere for later reassembly, just PASS it.
Subsequent frags: look up the IPID in the map. If we find it, validate
and update the frag-offset in the map; if this is the last fragment then
delete the map entry. If the frag-offset was bogus or the IPID wasn't
found in the map, DROP; otherwise PASS.
(If re-ordering is prevalent then use something more sophisticated than
just expected next frag-offset, but the principle is the same. And of
course you might want to put in timers for expiry etc.)
So this avoids the need to stash the packet data and modify/consume SKBs,
because you're not actually doing reassembly; the down-side is that the
BPF program can't so easily make decisions about the application-layer
contents of the fragmented datagram, but for the common case (we just
care about the 5-tuple) it's simple enough.
But I haven't actually tried it, so maybe there's some obvious reason why
it can't work this way.
-ed
Hi Ed,
Thanks for giving this a look.
On Mon, Feb 27, 2023 at 08:38:41PM +0000, Edward Cree wrote:
> On 27/02/2023 19:51, Daniel Xu wrote:
> > However, when policy is enforced through BPF, the prog is run before the
> > kernel reassembles fragmented packets. This leaves BPF developers in a
> > awkward place: implement reassembly (possibly poorly) or use a stateless
> > method as described above.
>
> Just out of curiosity - what stops BPF progs using the middle ground of
> stateful validation? I'm thinking of something like:
> First-frag: run the usual checks on L4 headers etc, if we PASS then save
> IPID and maybe expected next frag-offset into a map. But don't try to
> stash the packet contents anywhere for later reassembly, just PASS it.
> Subsequent frags: look up the IPID in the map. If we find it, validate
> and update the frag-offset in the map; if this is the last fragment then
> delete the map entry. If the frag-offset was bogus or the IPID wasn't
> found in the map, DROP; otherwise PASS.
> (If re-ordering is prevalent then use something more sophisticated than
> just expected next frag-offset, but the principle is the same. And of
> course you might want to put in timers for expiry etc.)
> So this avoids the need to stash the packet data and modify/consume SKBs,
> because you're not actually doing reassembly; the down-side is that the
> BPF program can't so easily make decisions about the application-layer
> contents of the fragmented datagram, but for the common case (we just
> care about the 5-tuple) it's simple enough.
> But I haven't actually tried it, so maybe there's some obvious reason why
> it can't work this way.
I don't believe full L4 headers are required in the first fragment.
Sufficiently sneaky attackers can, I think, send a byte at a time to
subvert your proposed algorithm. Storing skb data seems inevitable here.
Someone can correct me if I'm wrong here.
Reordering like you mentioned is another attack vector. Perhaps there
are more sophisticated semi-stateful algorithms that can solve the
problem, but it leads me to my next point.
A semi-stateful method like you are proposing is concerning to me from a
reliability and correctness stand point. Such a method can suffer from
impedance mismatches with the rest of the system. For example, whatever
map sizes you choose should probably be aligned with sysfs conntrack
values otherwise you may get some very interesting and unexpected pkt
drops. I think cilium had a talk about debugging a related conntrack
issue in the same vein a while ago. Furthermore, the debugging and
troubleshooting facilities will be different (counters, logs, etc).
Unless someone has had lots of experience writing an ip stack from
the ground up, I suspect there are quite a few more unknown-unknowns
here. What I find valuable about this patch series is that we can
leverage the well understood and battle hardened kernel facilities. So
avoid all the correctness and security issues that the kernel has spent
20+ years fixing. And make it trivial for the next person that comes
along to do the right thing.
Hopefully this all makes sense.
Thanks,
Daniel
On 27/02/2023 22:04, Daniel Xu wrote:
> I don't believe full L4 headers are required in the first fragment.
> Sufficiently sneaky attackers can, I think, send a byte at a time to
> subvert your proposed algorithm. Storing skb data seems inevitable here.
> Someone can correct me if I'm wrong here.
My thinking was that legitimate traffic would never do this and thus if
your first fragment doesn't have enough data to make a determination
then you just DROP the packet.
> What I find valuable about this patch series is that we can
> leverage the well understood and battle hardened kernel facilities. So
> avoid all the correctness and security issues that the kernel has spent
> 20+ years fixing.
I can certainly see the argument here. I guess it's a question of are
you more worried about the DoS from tricking the validator into thinking
good fragments are bad (the reverse is irrelevant because if you can
trick a validator into thinking your bad fragment belongs to a previously
seen good packet, then you can equally trick a reassembler into stitching
your bad fragment into that packet), or are you more worried about the
DoS from tying lots of memory down in the reassembly cache.
Even with reordering handling, a data structure to record which ranges of
a packet have been seen takes much less memory than storing the complete
fragment bodies. (Just a simple bitmap of 8-byte blocks — the resolution
of iph->frag_off — reduces size by a factor of 64, not counting all the
overhead of a struct sk_buff for each fragment in the queue. Or you
could re-use the rbtree-based code from the reassembler, just with a
freshly allocated node containing only offset & length, instead of the
whole SKB.)
And having a BPF helper effectively consume the skb is awkward, as you
noted; someone is likely to decide that skb_copy() is too slow, try to
add ctx invalidation, and thereby create a whole new swathe of potential
correctness and security issues.
Plus, imagine trying to support this in a hardware-offload XDP device.
They'd have to reimplement the entire frag cache, which is a much bigger
attack surface than just a frag validator, and they couldn't leverage
the battle-hardened kernel implementation.
> And make it trivial for the next person that comes
> along to do the right thing.
Fwiw the validator approach could *also* be a helper, it doesn't have to
be something the BPF developer writes for themselves.
But if after thinking about the possibility you still prefer your way, I
won't try to stop you — I just wanted to ensure it had been considered.
-ed
On Mon, Feb 27, 2023 at 12:51:02PM -0700, Daniel Xu wrote:
> === Context ===
>
> In the context of a middlebox, fragmented packets are tricky to handle.
> The full 5-tuple of a packet is often only available in the first
> fragment which makes enforcing consistent policy difficult. There are
> really only two stateless options, neither of which are very nice:
>
> 1. Enforce policy on first fragment and accept all subsequent fragments.
> This works but may let in certain attacks or allow data exfiltration.
>
> 2. Enforce policy on first fragment and drop all subsequent fragments.
> This does not really work b/c some protocols may rely on
> fragmentation. For example, DNS may rely on oversized UDP packets for
> large responses.
>
> So stateful tracking is the only sane option. RFC 8900 [0] calls this
> out as well in section 6.3:
>
> Middleboxes [...] should process IP fragments in a manner that is
> consistent with [RFC0791] and [RFC8200]. In many cases, middleboxes
> must maintain state in order to achieve this goal.
>
> === BPF related bits ===
>
> However, when policy is enforced through BPF, the prog is run before the
> kernel reassembles fragmented packets. This leaves BPF developers in a
> awkward place: implement reassembly (possibly poorly) or use a stateless
> method as described above.
>
> Fortunately, the kernel has robust support for fragmented IP packets.
> This patchset wraps the existing defragmentation facilities in kfuncs so
> that BPF progs running on middleboxes can reassemble fragmented packets
> before applying policy.
>
> === Patchset details ===
>
> This patchset is (hopefully) relatively straightforward from BPF perspective.
> One thing I'd like to call out is the skb_copy()ing of the prog skb. I
> did this to maintain the invariant that the ctx remains valid after prog
> has run. This is relevant b/c ip_defrag() and ip_check_defrag() may
> consume the skb if the skb is a fragment.
Instead of doing all that with extra skb copy can you hook bpf prog after
the networking stack already handled ip defrag?
What kind of middle box are you doing? Why does it have to run at TC layer?
On Mon, Feb 27, 2023 at 5:57 PM Daniel Xu <[email protected]> wrote:
>
> Hi Alexei,
>
> On Mon, Feb 27, 2023 at 03:03:38PM -0800, Alexei Starovoitov wrote:
> > On Mon, Feb 27, 2023 at 12:51:02PM -0700, Daniel Xu wrote:
> > > === Context ===
> > >
> > > In the context of a middlebox, fragmented packets are tricky to handle.
> > > The full 5-tuple of a packet is often only available in the first
> > > fragment which makes enforcing consistent policy difficult. There are
> > > really only two stateless options, neither of which are very nice:
> > >
> > > 1. Enforce policy on first fragment and accept all subsequent fragments.
> > > This works but may let in certain attacks or allow data exfiltration.
> > >
> > > 2. Enforce policy on first fragment and drop all subsequent fragments.
> > > This does not really work b/c some protocols may rely on
> > > fragmentation. For example, DNS may rely on oversized UDP packets for
> > > large responses.
> > >
> > > So stateful tracking is the only sane option. RFC 8900 [0] calls this
> > > out as well in section 6.3:
> > >
> > > Middleboxes [...] should process IP fragments in a manner that is
> > > consistent with [RFC0791] and [RFC8200]. In many cases, middleboxes
> > > must maintain state in order to achieve this goal.
> > >
> > > === BPF related bits ===
> > >
> > > However, when policy is enforced through BPF, the prog is run before the
> > > kernel reassembles fragmented packets. This leaves BPF developers in a
> > > awkward place: implement reassembly (possibly poorly) or use a stateless
> > > method as described above.
> > >
> > > Fortunately, the kernel has robust support for fragmented IP packets.
> > > This patchset wraps the existing defragmentation facilities in kfuncs so
> > > that BPF progs running on middleboxes can reassemble fragmented packets
> > > before applying policy.
> > >
> > > === Patchset details ===
> > >
> > > This patchset is (hopefully) relatively straightforward from BPF perspective.
> > > One thing I'd like to call out is the skb_copy()ing of the prog skb. I
> > > did this to maintain the invariant that the ctx remains valid after prog
> > > has run. This is relevant b/c ip_defrag() and ip_check_defrag() may
> > > consume the skb if the skb is a fragment.
> >
> > Instead of doing all that with extra skb copy can you hook bpf prog after
> > the networking stack already handled ip defrag?
> > What kind of middle box are you doing? Why does it have to run at TC layer?
>
> Unless I'm missing something, the only other relevant hooks would be
> socket hooks, right?
>
> Unfortunately I don't think my use case can do that. We are running the
> kernel as a router, so no sockets are involved.
Are you using bpf_fib_lookup and populating kernel routing
table and doing everything on your own including neigh ?
Have you considered to skb redirect to another netdev that does ip defrag?
Like macvlan does it under some conditions. This can be generalized.
Recently Florian proposed to allow calling bpf progs from all existing
netfilter hooks.
You can pretend to local deliver and hook in NF_INET_LOCAL_IN ?
I feel it would be so much cleaner if stack does ip_defrag normally.
The general issue of skb ownership between bpf prog and defrag logic
isn't really solved with skb_copy. It's still an issue.
Hi Daniel,
Thank you for the patch! Yet something to improve:
[auto build test ERROR on bpf-next/master]
url: https://github.com/intel-lab-lkp/linux/commits/Daniel-Xu/ip-frags-Return-actual-error-codes-from-ip_check_defrag/20230228-035449
base: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
patch link: https://lore.kernel.org/r/bce083a4293eefb048a700b5a6086e8d8c957700.1677526810.git.dxu%40dxuuu.xyz
patch subject: [PATCH bpf-next v2 5/8] bpf: net: ipv6: Add bpf_ipv6_frag_rcv() kfunc
config: i386-defconfig (https://download.01.org/0day-ci/archive/20230228/[email protected]/config)
compiler: gcc-11 (Debian 11.3.0-8) 11.3.0
reproduce (this is a W=1 build):
# https://github.com/intel-lab-lkp/linux/commit/be4610312351d4a658435bd4649a3a830322396d
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Daniel-Xu/ip-frags-Return-actual-error-codes-from-ip_check_defrag/20230228-035449
git checkout be4610312351d4a658435bd4649a3a830322396d
# save the config file
mkdir build_dir && cp config build_dir/.config
make W=1 O=build_dir ARCH=i386 olddefconfig
make W=1 O=build_dir ARCH=i386 SHELL=/bin/bash
If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <[email protected]>
| Link: https://lore.kernel.org/oe-kbuild-all/[email protected]/
All errors (new ones prefixed by >>):
ld: net/ipv6/af_inet6.o: in function `inet6_init':
>> af_inet6.c:(.init.text+0x22a): undefined reference to `register_ipv6_reassembly_bpf'
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests
Hi Daniel,
Thank you for the patch! Yet something to improve:
[auto build test ERROR on bpf-next/master]
url: https://github.com/intel-lab-lkp/linux/commits/Daniel-Xu/ip-frags-Return-actual-error-codes-from-ip_check_defrag/20230228-035449
base: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
patch link: https://lore.kernel.org/r/bce083a4293eefb048a700b5a6086e8d8c957700.1677526810.git.dxu%40dxuuu.xyz
patch subject: [PATCH bpf-next v2 5/8] bpf: net: ipv6: Add bpf_ipv6_frag_rcv() kfunc
config: i386-debian-10.3 (https://download.01.org/0day-ci/archive/20230228/[email protected]/config)
compiler: gcc-11 (Debian 11.3.0-8) 11.3.0
reproduce (this is a W=1 build):
# https://github.com/intel-lab-lkp/linux/commit/be4610312351d4a658435bd4649a3a830322396d
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Daniel-Xu/ip-frags-Return-actual-error-codes-from-ip_check_defrag/20230228-035449
git checkout be4610312351d4a658435bd4649a3a830322396d
# save the config file
mkdir build_dir && cp config build_dir/.config
make W=1 O=build_dir ARCH=i386 olddefconfig
make W=1 O=build_dir ARCH=i386 SHELL=/bin/bash
If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <[email protected]>
| Link: https://lore.kernel.org/oe-kbuild-all/[email protected]/
All errors (new ones prefixed by >>):
ld: net/ipv6/af_inet6.o: in function `inet6_init':
>> net/ipv6/af_inet6.c:1177: undefined reference to `register_ipv6_reassembly_bpf'
vim +1177 net/ipv6/af_inet6.c
1061
1062 static int __init inet6_init(void)
1063 {
1064 struct list_head *r;
1065 int err = 0;
1066
1067 sock_skb_cb_check_size(sizeof(struct inet6_skb_parm));
1068
1069 /* Register the socket-side information for inet6_create. */
1070 for (r = &inetsw6[0]; r < &inetsw6[SOCK_MAX]; ++r)
1071 INIT_LIST_HEAD(r);
1072
1073 raw_hashinfo_init(&raw_v6_hashinfo);
1074
1075 if (disable_ipv6_mod) {
1076 pr_info("Loaded, but administratively disabled, reboot required to enable\n");
1077 goto out;
1078 }
1079
1080 err = proto_register(&tcpv6_prot, 1);
1081 if (err)
1082 goto out;
1083
1084 err = proto_register(&udpv6_prot, 1);
1085 if (err)
1086 goto out_unregister_tcp_proto;
1087
1088 err = proto_register(&udplitev6_prot, 1);
1089 if (err)
1090 goto out_unregister_udp_proto;
1091
1092 err = proto_register(&rawv6_prot, 1);
1093 if (err)
1094 goto out_unregister_udplite_proto;
1095
1096 err = proto_register(&pingv6_prot, 1);
1097 if (err)
1098 goto out_unregister_raw_proto;
1099
1100 /* We MUST register RAW sockets before we create the ICMP6,
1101 * IGMP6, or NDISC control sockets.
1102 */
1103 err = rawv6_init();
1104 if (err)
1105 goto out_unregister_ping_proto;
1106
1107 /* Register the family here so that the init calls below will
1108 * be able to create sockets. (?? is this dangerous ??)
1109 */
1110 err = sock_register(&inet6_family_ops);
1111 if (err)
1112 goto out_sock_register_fail;
1113
1114 /*
1115 * ipngwg API draft makes clear that the correct semantics
1116 * for TCP and UDP is to consider one TCP and UDP instance
1117 * in a host available by both INET and INET6 APIs and
1118 * able to communicate via both network protocols.
1119 */
1120
1121 err = register_pernet_subsys(&inet6_net_ops);
1122 if (err)
1123 goto register_pernet_fail;
1124 err = ip6_mr_init();
1125 if (err)
1126 goto ipmr_fail;
1127 err = icmpv6_init();
1128 if (err)
1129 goto icmp_fail;
1130 err = ndisc_init();
1131 if (err)
1132 goto ndisc_fail;
1133 err = igmp6_init();
1134 if (err)
1135 goto igmp_fail;
1136
1137 err = ipv6_netfilter_init();
1138 if (err)
1139 goto netfilter_fail;
1140 /* Create /proc/foo6 entries. */
1141 #ifdef CONFIG_PROC_FS
1142 err = -ENOMEM;
1143 if (raw6_proc_init())
1144 goto proc_raw6_fail;
1145 if (udplite6_proc_init())
1146 goto proc_udplite6_fail;
1147 if (ipv6_misc_proc_init())
1148 goto proc_misc6_fail;
1149 if (if6_proc_init())
1150 goto proc_if6_fail;
1151 #endif
1152 err = ip6_route_init();
1153 if (err)
1154 goto ip6_route_fail;
1155 err = ndisc_late_init();
1156 if (err)
1157 goto ndisc_late_fail;
1158 err = ip6_flowlabel_init();
1159 if (err)
1160 goto ip6_flowlabel_fail;
1161 err = ipv6_anycast_init();
1162 if (err)
1163 goto ipv6_anycast_fail;
1164 err = addrconf_init();
1165 if (err)
1166 goto addrconf_fail;
1167
1168 /* Init v6 extension headers. */
1169 err = ipv6_exthdrs_init();
1170 if (err)
1171 goto ipv6_exthdrs_fail;
1172
1173 err = ipv6_frag_init();
1174 if (err)
1175 goto ipv6_frag_fail;
1176
> 1177 err = register_ipv6_reassembly_bpf();
1178 if (err)
1179 goto ipv6_frag_fail;
1180
1181 /* Init v6 transport protocols. */
1182 err = udpv6_init();
1183 if (err)
1184 goto udpv6_fail;
1185
1186 err = udplitev6_init();
1187 if (err)
1188 goto udplitev6_fail;
1189
1190 err = udpv6_offload_init();
1191 if (err)
1192 goto udpv6_offload_fail;
1193
1194 err = tcpv6_init();
1195 if (err)
1196 goto tcpv6_fail;
1197
1198 err = ipv6_packet_init();
1199 if (err)
1200 goto ipv6_packet_fail;
1201
1202 err = pingv6_init();
1203 if (err)
1204 goto pingv6_fail;
1205
1206 err = calipso_init();
1207 if (err)
1208 goto calipso_fail;
1209
1210 err = seg6_init();
1211 if (err)
1212 goto seg6_fail;
1213
1214 err = rpl_init();
1215 if (err)
1216 goto rpl_fail;
1217
1218 err = ioam6_init();
1219 if (err)
1220 goto ioam6_fail;
1221
1222 err = igmp6_late_init();
1223 if (err)
1224 goto igmp6_late_err;
1225
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests
On 2/28/23 5:56 AM, Alexei Starovoitov wrote:
> On Mon, Feb 27, 2023 at 5:57 PM Daniel Xu <[email protected]> wrote:
>> On Mon, Feb 27, 2023 at 03:03:38PM -0800, Alexei Starovoitov wrote:
>>> On Mon, Feb 27, 2023 at 12:51:02PM -0700, Daniel Xu wrote:
>>>> === Context ===
>>>>
>>>> In the context of a middlebox, fragmented packets are tricky to handle.
>>>> The full 5-tuple of a packet is often only available in the first
>>>> fragment which makes enforcing consistent policy difficult. There are
>>>> really only two stateless options, neither of which are very nice:
>>>>
>>>> 1. Enforce policy on first fragment and accept all subsequent fragments.
>>>> This works but may let in certain attacks or allow data exfiltration.
>>>>
>>>> 2. Enforce policy on first fragment and drop all subsequent fragments.
>>>> This does not really work b/c some protocols may rely on
>>>> fragmentation. For example, DNS may rely on oversized UDP packets for
>>>> large responses.
>>>>
>>>> So stateful tracking is the only sane option. RFC 8900 [0] calls this
>>>> out as well in section 6.3:
>>>>
>>>> Middleboxes [...] should process IP fragments in a manner that is
>>>> consistent with [RFC0791] and [RFC8200]. In many cases, middleboxes
>>>> must maintain state in order to achieve this goal.
>>>>
>>>> === BPF related bits ===
>>>>
>>>> However, when policy is enforced through BPF, the prog is run before the
>>>> kernel reassembles fragmented packets. This leaves BPF developers in a
>>>> awkward place: implement reassembly (possibly poorly) or use a stateless
>>>> method as described above.
>>>>
>>>> Fortunately, the kernel has robust support for fragmented IP packets.
>>>> This patchset wraps the existing defragmentation facilities in kfuncs so
>>>> that BPF progs running on middleboxes can reassemble fragmented packets
>>>> before applying policy.
>>>>
>>>> === Patchset details ===
>>>>
>>>> This patchset is (hopefully) relatively straightforward from BPF perspective.
>>>> One thing I'd like to call out is the skb_copy()ing of the prog skb. I
>>>> did this to maintain the invariant that the ctx remains valid after prog
>>>> has run. This is relevant b/c ip_defrag() and ip_check_defrag() may
>>>> consume the skb if the skb is a fragment.
>>>
>>> Instead of doing all that with extra skb copy can you hook bpf prog after
>>> the networking stack already handled ip defrag?
>>> What kind of middle box are you doing? Why does it have to run at TC layer?
>>
>> Unless I'm missing something, the only other relevant hooks would be
>> socket hooks, right?
>>
>> Unfortunately I don't think my use case can do that. We are running the
>> kernel as a router, so no sockets are involved.
>
> Are you using bpf_fib_lookup and populating kernel routing
> table and doing everything on your own including neigh ?
>
> Have you considered to skb redirect to another netdev that does ip defrag?
> Like macvlan does it under some conditions. This can be generalized.
>
> Recently Florian proposed to allow calling bpf progs from all existing
> netfilter hooks.
> You can pretend to local deliver and hook in NF_INET_LOCAL_IN ?
> I feel it would be so much cleaner if stack does ip_defrag normally.
> The general issue of skb ownership between bpf prog and defrag logic
> isn't really solved with skb_copy. It's still an issue.
I do like this series and we would also use it for Cilium case, so +1 on the
tc BPF integration. Today we have in Cilium what Ed [0] hinted in his earlier
mail where we extract information from first fragment and store the meta data
in a BPF map for subsequent packets based on ipid [1], but limitations apply
e.g. service load-balancing won't work. Redirecting to a different device
or moving higher up the stack is cumbersome since we then need to go and
recirculate back into tc BPF layer where all the business logic is located and
handling the regular (non-fragmented) path, too. Wrt skb ownership, can you
elaborate what is a concrete issue exactly? Anything that comes to mind with
this approach that could crash the kernel?
[0] https://lore.kernel.org/bpf/[email protected]/
[1] https://github.com/cilium/cilium/pull/10264
On 02/27, Daniel Xu wrote:
> This kfunc is used to defragment IPv4 packets. The idea is that if you
> see a fragmented packet, you call this kfunc. If the kfunc returns 0,
> then the skb has been updated to contain the entire reassembled packet.
> If the kfunc returns an error (most likely -EINPROGRESS), then it means
> the skb is part of a yet-incomplete original packet. A reasonable
> response to -EINPROGRESS is to drop the packet, as the ip defrag
> infrastructure is already hanging onto the frag for future reassembly.
> Care has been taken to ensure the prog skb remains valid no matter what
> the underlying ip_check_defrag() call does. This is in contrast to
> ip_defrag(), which may consume the skb if the skb is part of a
> yet-incomplete original packet.
> So far this kfunc is only callable from TC clsact progs.
> Signed-off-by: Daniel Xu <[email protected]>
> ---
> include/net/ip.h | 11 +++++
> net/ipv4/Makefile | 1 +
> net/ipv4/ip_fragment.c | 2 +
> net/ipv4/ip_fragment_bpf.c | 98 ++++++++++++++++++++++++++++++++++++++
> 4 files changed, 112 insertions(+)
> create mode 100644 net/ipv4/ip_fragment_bpf.c
> diff --git a/include/net/ip.h b/include/net/ip.h
> index c3fffaa92d6e..f3796b1b5cac 100644
> --- a/include/net/ip.h
> +++ b/include/net/ip.h
> @@ -680,6 +680,7 @@ enum ip_defrag_users {
> IP_DEFRAG_VS_FWD,
> IP_DEFRAG_AF_PACKET,
> IP_DEFRAG_MACVLAN,
> + IP_DEFRAG_BPF,
> };
> /* Return true if the value of 'user' is between 'lower_bond'
> @@ -693,6 +694,16 @@ static inline bool ip_defrag_user_in_between(u32
> user,
> }
> int ip_defrag(struct net *net, struct sk_buff *skb, u32 user);
> +
> +#ifdef CONFIG_DEBUG_INFO_BTF
> +int register_ip_frag_bpf(void);
> +#else
> +static inline int register_ip_frag_bpf(void)
> +{
> + return 0;
> +}
> +#endif
> +
> #ifdef CONFIG_INET
> struct sk_buff *ip_check_defrag(struct net *net, struct sk_buff *skb,
> u32 user);
> #else
> diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
> index 880277c9fd07..950efb166d37 100644
> --- a/net/ipv4/Makefile
> +++ b/net/ipv4/Makefile
> @@ -65,6 +65,7 @@ obj-$(CONFIG_TCP_CONG_ILLINOIS) += tcp_illinois.o
> obj-$(CONFIG_NET_SOCK_MSG) += tcp_bpf.o
> obj-$(CONFIG_BPF_SYSCALL) += udp_bpf.o
> obj-$(CONFIG_NETLABEL) += cipso_ipv4.o
> +obj-$(CONFIG_DEBUG_INFO_BTF) += ip_fragment_bpf.o
> obj-$(CONFIG_XFRM) += xfrm4_policy.o xfrm4_state.o xfrm4_input.o \
> xfrm4_output.o xfrm4_protocol.o
> diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
> index 959d2c4260ea..e3fda5203f09 100644
> --- a/net/ipv4/ip_fragment.c
> +++ b/net/ipv4/ip_fragment.c
> @@ -759,5 +759,7 @@ void __init ipfrag_init(void)
> if (inet_frags_init(&ip4_frags))
> panic("IP: failed to allocate ip4_frags cache\n");
> ip4_frags_ctl_register();
> + if (register_ip_frag_bpf())
> + panic("IP: bpf: failed to register ip_frag_bpf\n");
> register_pernet_subsys(&ip4_frags_ops);
> }
> diff --git a/net/ipv4/ip_fragment_bpf.c b/net/ipv4/ip_fragment_bpf.c
> new file mode 100644
> index 000000000000..a9e5908ed216
> --- /dev/null
> +++ b/net/ipv4/ip_fragment_bpf.c
> @@ -0,0 +1,98 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Unstable ipv4 fragmentation helpers for TC-BPF hook
> + *
> + * These are called from SCHED_CLS BPF programs. Note that it is allowed
> to
> + * break compatibility for these functions since the interface they are
> exposed
> + * through to BPF programs is explicitly unstable.
> + */
> +
> +#include <linux/bpf.h>
> +#include <linux/btf_ids.h>
> +#include <linux/ip.h>
> +#include <linux/filter.h>
> +#include <linux/netdevice.h>
> +#include <net/ip.h>
> +#include <net/sock.h>
> +
> +__diag_push();
> +__diag_ignore_all("-Wmissing-prototypes",
> + "Global functions as their definitions will be in ip_fragment BTF");
> +
> +/* bpf_ip_check_defrag - Defragment an ipv4 packet
> + *
> + * This helper takes an skb as input. If this skb successfully
> reassembles
> + * the original packet, the skb is updated to contain the original,
> reassembled
> + * packet.
> + *
> + * Otherwise (on error or incomplete reassembly), the input skb remains
> + * unmodified.
> + *
> + * Parameters:
> + * @ctx - Pointer to program context (skb)
> + * @netns - Child network namespace id. If value is a negative signed
> + * 32-bit integer, the netns of the device in the skb is used.
> + *
> + * Return:
> + * 0 on successfully reassembly or non-fragmented packet. Negative value
> on
> + * error or incomplete reassembly.
> + */
> +int bpf_ip_check_defrag(struct __sk_buff *ctx, u64 netns)
Needs a __bpf_kfunc tag as well?
> +{
> + struct sk_buff *skb = (struct sk_buff *)ctx;
> + struct sk_buff *skb_cpy, *skb_out;
> + struct net *caller_net;
> + struct net *net;
> + int mac_len;
> + void *mac;
> +
[..]
> + if (unlikely(!((s32)netns < 0 || netns <= S32_MAX)))
> + return -EINVAL;
Can you explain what it does? Is it checking for -1 explicitly? Not sure
it works :-/
Maybe we can spell out the cases explicitly?
if (unlikely(
((s32)netns < 0 && netns != S32_MAX) || /* -1 */
netns > U32_MAX /* higher 4 bytes */
)
return -EINVAL;
> +
> + caller_net = skb->dev ? dev_net(skb->dev) : sock_net(skb->sk);
> + if ((s32)netns < 0) {
> + net = caller_net;
> + } else {
> + net = get_net_ns_by_id(caller_net, netns);
> + if (unlikely(!net))
> + return -EINVAL;
> + }
> +
> + mac_len = skb->mac_len;
> + skb_cpy = skb_copy(skb, GFP_ATOMIC);
> + if (!skb_cpy)
> + return -ENOMEM;
> +
> + skb_out = ip_check_defrag(net, skb_cpy, IP_DEFRAG_BPF);
> + if (IS_ERR(skb_out))
> + return PTR_ERR(skb_out);
> +
> + skb_morph(skb, skb_out);
> + kfree_skb(skb_out);
> +
> + /* ip_check_defrag() does not maintain mac header, so push empty header
> + * in so prog sees the correct layout. The empty mac header will be
> + * later pulled from cls_bpf.
> + */
> + mac = skb_push(skb, mac_len);
> + memset(mac, 0, mac_len);
> + bpf_compute_data_pointers(skb);
> +
> + return 0;
> +}
> +
> +__diag_pop()
> +
> +BTF_SET8_START(ip_frag_kfunc_set)
> +BTF_ID_FLAGS(func, bpf_ip_check_defrag, KF_CHANGES_PKT)
> +BTF_SET8_END(ip_frag_kfunc_set)
> +
> +static const struct btf_kfunc_id_set ip_frag_bpf_kfunc_set = {
> + .owner = THIS_MODULE,
> + .set = &ip_frag_kfunc_set,
> +};
> +
> +int register_ip_frag_bpf(void)
> +{
> + return register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS,
> + &ip_frag_bpf_kfunc_set);
> +}
> --
> 2.39.1
Hi Stanislav,
On Tue, Feb 28, 2023 at 11:37:16AM -0800, Stanislav Fomichev wrote:
> On 02/27, Daniel Xu wrote:
> > This kfunc is used to defragment IPv4 packets. The idea is that if you
> > see a fragmented packet, you call this kfunc. If the kfunc returns 0,
> > then the skb has been updated to contain the entire reassembled packet.
>
> > If the kfunc returns an error (most likely -EINPROGRESS), then it means
> > the skb is part of a yet-incomplete original packet. A reasonable
> > response to -EINPROGRESS is to drop the packet, as the ip defrag
> > infrastructure is already hanging onto the frag for future reassembly.
>
> > Care has been taken to ensure the prog skb remains valid no matter what
> > the underlying ip_check_defrag() call does. This is in contrast to
> > ip_defrag(), which may consume the skb if the skb is part of a
> > yet-incomplete original packet.
>
> > So far this kfunc is only callable from TC clsact progs.
>
> > Signed-off-by: Daniel Xu <[email protected]>
> > ---
> > include/net/ip.h | 11 +++++
> > net/ipv4/Makefile | 1 +
> > net/ipv4/ip_fragment.c | 2 +
> > net/ipv4/ip_fragment_bpf.c | 98 ++++++++++++++++++++++++++++++++++++++
> > 4 files changed, 112 insertions(+)
> > create mode 100644 net/ipv4/ip_fragment_bpf.c
>
> > diff --git a/include/net/ip.h b/include/net/ip.h
> > index c3fffaa92d6e..f3796b1b5cac 100644
> > --- a/include/net/ip.h
> > +++ b/include/net/ip.h
> > @@ -680,6 +680,7 @@ enum ip_defrag_users {
> > IP_DEFRAG_VS_FWD,
> > IP_DEFRAG_AF_PACKET,
> > IP_DEFRAG_MACVLAN,
> > + IP_DEFRAG_BPF,
> > };
>
> > /* Return true if the value of 'user' is between 'lower_bond'
> > @@ -693,6 +694,16 @@ static inline bool ip_defrag_user_in_between(u32
> > user,
> > }
>
> > int ip_defrag(struct net *net, struct sk_buff *skb, u32 user);
> > +
> > +#ifdef CONFIG_DEBUG_INFO_BTF
> > +int register_ip_frag_bpf(void);
> > +#else
> > +static inline int register_ip_frag_bpf(void)
> > +{
> > + return 0;
> > +}
> > +#endif
> > +
> > #ifdef CONFIG_INET
> > struct sk_buff *ip_check_defrag(struct net *net, struct sk_buff *skb,
> > u32 user);
> > #else
> > diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
> > index 880277c9fd07..950efb166d37 100644
> > --- a/net/ipv4/Makefile
> > +++ b/net/ipv4/Makefile
> > @@ -65,6 +65,7 @@ obj-$(CONFIG_TCP_CONG_ILLINOIS) += tcp_illinois.o
> > obj-$(CONFIG_NET_SOCK_MSG) += tcp_bpf.o
> > obj-$(CONFIG_BPF_SYSCALL) += udp_bpf.o
> > obj-$(CONFIG_NETLABEL) += cipso_ipv4.o
> > +obj-$(CONFIG_DEBUG_INFO_BTF) += ip_fragment_bpf.o
>
> > obj-$(CONFIG_XFRM) += xfrm4_policy.o xfrm4_state.o xfrm4_input.o \
> > xfrm4_output.o xfrm4_protocol.o
> > diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
> > index 959d2c4260ea..e3fda5203f09 100644
> > --- a/net/ipv4/ip_fragment.c
> > +++ b/net/ipv4/ip_fragment.c
> > @@ -759,5 +759,7 @@ void __init ipfrag_init(void)
> > if (inet_frags_init(&ip4_frags))
> > panic("IP: failed to allocate ip4_frags cache\n");
> > ip4_frags_ctl_register();
> > + if (register_ip_frag_bpf())
> > + panic("IP: bpf: failed to register ip_frag_bpf\n");
> > register_pernet_subsys(&ip4_frags_ops);
> > }
> > diff --git a/net/ipv4/ip_fragment_bpf.c b/net/ipv4/ip_fragment_bpf.c
> > new file mode 100644
> > index 000000000000..a9e5908ed216
> > --- /dev/null
> > +++ b/net/ipv4/ip_fragment_bpf.c
> > @@ -0,0 +1,98 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/* Unstable ipv4 fragmentation helpers for TC-BPF hook
> > + *
> > + * These are called from SCHED_CLS BPF programs. Note that it is
> > allowed to
> > + * break compatibility for these functions since the interface they are
> > exposed
> > + * through to BPF programs is explicitly unstable.
> > + */
> > +
> > +#include <linux/bpf.h>
> > +#include <linux/btf_ids.h>
> > +#include <linux/ip.h>
> > +#include <linux/filter.h>
> > +#include <linux/netdevice.h>
> > +#include <net/ip.h>
> > +#include <net/sock.h>
> > +
> > +__diag_push();
> > +__diag_ignore_all("-Wmissing-prototypes",
> > + "Global functions as their definitions will be in ip_fragment BTF");
> > +
> > +/* bpf_ip_check_defrag - Defragment an ipv4 packet
> > + *
> > + * This helper takes an skb as input. If this skb successfully
> > reassembles
> > + * the original packet, the skb is updated to contain the original,
> > reassembled
> > + * packet.
> > + *
> > + * Otherwise (on error or incomplete reassembly), the input skb remains
> > + * unmodified.
> > + *
> > + * Parameters:
> > + * @ctx - Pointer to program context (skb)
> > + * @netns - Child network namespace id. If value is a negative signed
> > + * 32-bit integer, the netns of the device in the skb is used.
> > + *
> > + * Return:
> > + * 0 on successfully reassembly or non-fragmented packet. Negative
> > value on
> > + * error or incomplete reassembly.
> > + */
> > +int bpf_ip_check_defrag(struct __sk_buff *ctx, u64 netns)
>
> Needs a __bpf_kfunc tag as well?
Ack.
> > +{
> > + struct sk_buff *skb = (struct sk_buff *)ctx;
> > + struct sk_buff *skb_cpy, *skb_out;
> > + struct net *caller_net;
> > + struct net *net;
> > + int mac_len;
> > + void *mac;
> > +
>
> [..]
>
> > + if (unlikely(!((s32)netns < 0 || netns <= S32_MAX)))
> > + return -EINVAL;
>
> Can you explain what it does? Is it checking for -1 explicitly? Not sure
> it works :-/
>
> Maybe we can spell out the cases explicitly?
> if (unlikely(
> ((s32)netns < 0 && netns != S32_MAX) || /* -1 */
> netns > U32_MAX /* higher 4 bytes */
> )
> return -EINVAL;
>
I copied this from net/core/filter.c:__bpf_skc_lookup:
if (unlikely(flags || !((s32)netns_id < 0 || netns_id <= S32_MAX)))
goto out;
The semantics are a bit odd, but I thought it'd be good to maintain
consistency. I believe the code correctly checks what the docs describe:
@netns - Child network namespace id. If value is a negative signed
32-bit integer, the netns of the device in the skb is used.
I can pull out the logic into a helper for v3.
[...]
Thanks,
Daniel
On Tue, Feb 28, 2023 at 2:00 PM Daniel Xu <[email protected]> wrote:
>
> Hi Stanislav,
>
> On Tue, Feb 28, 2023 at 11:37:16AM -0800, Stanislav Fomichev wrote:
> > On 02/27, Daniel Xu wrote:
> > > This kfunc is used to defragment IPv4 packets. The idea is that if you
> > > see a fragmented packet, you call this kfunc. If the kfunc returns 0,
> > > then the skb has been updated to contain the entire reassembled packet.
> >
> > > If the kfunc returns an error (most likely -EINPROGRESS), then it means
> > > the skb is part of a yet-incomplete original packet. A reasonable
> > > response to -EINPROGRESS is to drop the packet, as the ip defrag
> > > infrastructure is already hanging onto the frag for future reassembly.
> >
> > > Care has been taken to ensure the prog skb remains valid no matter what
> > > the underlying ip_check_defrag() call does. This is in contrast to
> > > ip_defrag(), which may consume the skb if the skb is part of a
> > > yet-incomplete original packet.
> >
> > > So far this kfunc is only callable from TC clsact progs.
> >
> > > Signed-off-by: Daniel Xu <[email protected]>
> > > ---
> > > include/net/ip.h | 11 +++++
> > > net/ipv4/Makefile | 1 +
> > > net/ipv4/ip_fragment.c | 2 +
> > > net/ipv4/ip_fragment_bpf.c | 98 ++++++++++++++++++++++++++++++++++++++
> > > 4 files changed, 112 insertions(+)
> > > create mode 100644 net/ipv4/ip_fragment_bpf.c
> >
> > > diff --git a/include/net/ip.h b/include/net/ip.h
> > > index c3fffaa92d6e..f3796b1b5cac 100644
> > > --- a/include/net/ip.h
> > > +++ b/include/net/ip.h
> > > @@ -680,6 +680,7 @@ enum ip_defrag_users {
> > > IP_DEFRAG_VS_FWD,
> > > IP_DEFRAG_AF_PACKET,
> > > IP_DEFRAG_MACVLAN,
> > > + IP_DEFRAG_BPF,
> > > };
> >
> > > /* Return true if the value of 'user' is between 'lower_bond'
> > > @@ -693,6 +694,16 @@ static inline bool ip_defrag_user_in_between(u32
> > > user,
> > > }
> >
> > > int ip_defrag(struct net *net, struct sk_buff *skb, u32 user);
> > > +
> > > +#ifdef CONFIG_DEBUG_INFO_BTF
> > > +int register_ip_frag_bpf(void);
> > > +#else
> > > +static inline int register_ip_frag_bpf(void)
> > > +{
> > > + return 0;
> > > +}
> > > +#endif
> > > +
> > > #ifdef CONFIG_INET
> > > struct sk_buff *ip_check_defrag(struct net *net, struct sk_buff *skb,
> > > u32 user);
> > > #else
> > > diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
> > > index 880277c9fd07..950efb166d37 100644
> > > --- a/net/ipv4/Makefile
> > > +++ b/net/ipv4/Makefile
> > > @@ -65,6 +65,7 @@ obj-$(CONFIG_TCP_CONG_ILLINOIS) += tcp_illinois.o
> > > obj-$(CONFIG_NET_SOCK_MSG) += tcp_bpf.o
> > > obj-$(CONFIG_BPF_SYSCALL) += udp_bpf.o
> > > obj-$(CONFIG_NETLABEL) += cipso_ipv4.o
> > > +obj-$(CONFIG_DEBUG_INFO_BTF) += ip_fragment_bpf.o
> >
> > > obj-$(CONFIG_XFRM) += xfrm4_policy.o xfrm4_state.o xfrm4_input.o \
> > > xfrm4_output.o xfrm4_protocol.o
> > > diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
> > > index 959d2c4260ea..e3fda5203f09 100644
> > > --- a/net/ipv4/ip_fragment.c
> > > +++ b/net/ipv4/ip_fragment.c
> > > @@ -759,5 +759,7 @@ void __init ipfrag_init(void)
> > > if (inet_frags_init(&ip4_frags))
> > > panic("IP: failed to allocate ip4_frags cache\n");
> > > ip4_frags_ctl_register();
> > > + if (register_ip_frag_bpf())
> > > + panic("IP: bpf: failed to register ip_frag_bpf\n");
> > > register_pernet_subsys(&ip4_frags_ops);
> > > }
> > > diff --git a/net/ipv4/ip_fragment_bpf.c b/net/ipv4/ip_fragment_bpf.c
> > > new file mode 100644
> > > index 000000000000..a9e5908ed216
> > > --- /dev/null
> > > +++ b/net/ipv4/ip_fragment_bpf.c
> > > @@ -0,0 +1,98 @@
> > > +// SPDX-License-Identifier: GPL-2.0-only
> > > +/* Unstable ipv4 fragmentation helpers for TC-BPF hook
> > > + *
> > > + * These are called from SCHED_CLS BPF programs. Note that it is
> > > allowed to
> > > + * break compatibility for these functions since the interface they are
> > > exposed
> > > + * through to BPF programs is explicitly unstable.
> > > + */
> > > +
> > > +#include <linux/bpf.h>
> > > +#include <linux/btf_ids.h>
> > > +#include <linux/ip.h>
> > > +#include <linux/filter.h>
> > > +#include <linux/netdevice.h>
> > > +#include <net/ip.h>
> > > +#include <net/sock.h>
> > > +
> > > +__diag_push();
> > > +__diag_ignore_all("-Wmissing-prototypes",
> > > + "Global functions as their definitions will be in ip_fragment BTF");
> > > +
> > > +/* bpf_ip_check_defrag - Defragment an ipv4 packet
> > > + *
> > > + * This helper takes an skb as input. If this skb successfully
> > > reassembles
> > > + * the original packet, the skb is updated to contain the original,
> > > reassembled
> > > + * packet.
> > > + *
> > > + * Otherwise (on error or incomplete reassembly), the input skb remains
> > > + * unmodified.
> > > + *
> > > + * Parameters:
> > > + * @ctx - Pointer to program context (skb)
> > > + * @netns - Child network namespace id. If value is a negative signed
> > > + * 32-bit integer, the netns of the device in the skb is used.
> > > + *
> > > + * Return:
> > > + * 0 on successfully reassembly or non-fragmented packet. Negative
> > > value on
> > > + * error or incomplete reassembly.
> > > + */
> > > +int bpf_ip_check_defrag(struct __sk_buff *ctx, u64 netns)
> >
> > Needs a __bpf_kfunc tag as well?
>
> Ack.
>
> > > +{
> > > + struct sk_buff *skb = (struct sk_buff *)ctx;
> > > + struct sk_buff *skb_cpy, *skb_out;
> > > + struct net *caller_net;
> > > + struct net *net;
> > > + int mac_len;
> > > + void *mac;
> > > +
> >
> > [..]
> >
> > > + if (unlikely(!((s32)netns < 0 || netns <= S32_MAX)))
> > > + return -EINVAL;
> >
> > Can you explain what it does? Is it checking for -1 explicitly? Not sure
> > it works :-/
> >
> > Maybe we can spell out the cases explicitly?
> > if (unlikely(
> > ((s32)netns < 0 && netns != S32_MAX) || /* -1 */
> > netns > U32_MAX /* higher 4 bytes */
> > )
> > return -EINVAL;
> >
>
> I copied this from net/core/filter.c:__bpf_skc_lookup:
>
> if (unlikely(flags || !((s32)netns_id < 0 || netns_id <= S32_MAX)))
> goto out;
>
> The semantics are a bit odd, but I thought it'd be good to maintain
> consistency. I believe the code correctly checks what the docs describe:
>
> @netns - Child network namespace id. If value is a negative signed
> 32-bit integer, the netns of the device in the skb is used.
>
> I can pull out the logic into a helper for v3.
>
> [...]
Ah, so this comes from commit f71c6143c203 ("bpf: Support sk lookup in
netns with id 0") which explicitly treats everything <0 as
current_netns, makes sense.
In this case agreed, let's keep for consistency. Up to you on whether
to pull it out in the helper or keep as is.
>
>
> Thanks,
> Daniel
Hi Alexei,
On Mon, Feb 27, 2023 at 08:56:38PM -0800, Alexei Starovoitov wrote:
> On Mon, Feb 27, 2023 at 5:57 PM Daniel Xu <[email protected]> wrote:
> >
> > Hi Alexei,
> >
> > On Mon, Feb 27, 2023 at 03:03:38PM -0800, Alexei Starovoitov wrote:
> > > On Mon, Feb 27, 2023 at 12:51:02PM -0700, Daniel Xu wrote:
> > > > === Context ===
> > > >
> > > > In the context of a middlebox, fragmented packets are tricky to handle.
> > > > The full 5-tuple of a packet is often only available in the first
> > > > fragment which makes enforcing consistent policy difficult. There are
> > > > really only two stateless options, neither of which are very nice:
> > > >
> > > > 1. Enforce policy on first fragment and accept all subsequent fragments.
> > > > This works but may let in certain attacks or allow data exfiltration.
> > > >
> > > > 2. Enforce policy on first fragment and drop all subsequent fragments.
> > > > This does not really work b/c some protocols may rely on
> > > > fragmentation. For example, DNS may rely on oversized UDP packets for
> > > > large responses.
> > > >
> > > > So stateful tracking is the only sane option. RFC 8900 [0] calls this
> > > > out as well in section 6.3:
> > > >
> > > > Middleboxes [...] should process IP fragments in a manner that is
> > > > consistent with [RFC0791] and [RFC8200]. In many cases, middleboxes
> > > > must maintain state in order to achieve this goal.
> > > >
> > > > === BPF related bits ===
> > > >
> > > > However, when policy is enforced through BPF, the prog is run before the
> > > > kernel reassembles fragmented packets. This leaves BPF developers in a
> > > > awkward place: implement reassembly (possibly poorly) or use a stateless
> > > > method as described above.
> > > >
> > > > Fortunately, the kernel has robust support for fragmented IP packets.
> > > > This patchset wraps the existing defragmentation facilities in kfuncs so
> > > > that BPF progs running on middleboxes can reassemble fragmented packets
> > > > before applying policy.
> > > >
> > > > === Patchset details ===
> > > >
> > > > This patchset is (hopefully) relatively straightforward from BPF perspective.
> > > > One thing I'd like to call out is the skb_copy()ing of the prog skb. I
> > > > did this to maintain the invariant that the ctx remains valid after prog
> > > > has run. This is relevant b/c ip_defrag() and ip_check_defrag() may
> > > > consume the skb if the skb is a fragment.
> > >
> > > Instead of doing all that with extra skb copy can you hook bpf prog after
> > > the networking stack already handled ip defrag?
> > > What kind of middle box are you doing? Why does it have to run at TC layer?
> >
> > Unless I'm missing something, the only other relevant hooks would be
> > socket hooks, right?
> >
> > Unfortunately I don't think my use case can do that. We are running the
> > kernel as a router, so no sockets are involved.
>
> Are you using bpf_fib_lookup and populating kernel routing
> table and doing everything on your own including neigh ?
We're currently not doing any routing things in BPF yet. All the routing
manipulation has been done in iptables / netfilter so far. I'm not super
familiar with routing stuff but from what I understand there is some
relatively complicated stuff going on with BGP and ipsec tunnels at the
moment. Not sure if that answers your question.
> Have you considered to skb redirect to another netdev that does ip defrag?
> Like macvlan does it under some conditions. This can be generalized.
I had not considered that yet. Are you suggesting adding a new
passthrough netdev thing that'll defrags? I looked at the macvlan driver
and it looks like it defrags to handle some multicast corner case.
> Recently Florian proposed to allow calling bpf progs from all existing
> netfilter hooks.
> You can pretend to local deliver and hook in NF_INET_LOCAL_IN ?
Does that work for forwarding cases? I'm reading through [0] and it
seems to suggest that it'll only defrag for locally destined packets:
If the destination IP address is matches with
local NIC's IP address, the dst_input() function will brings the packets
into the ip_local_deliver(), which will defrag the packet and pass it
to the NF_IP_LOCAL_IN hook
Faking local delivery seems kinda ugly -- maybe I don't know any clean
ways.
[...]
[0]: https://kernelnewbies.org/Networking?action=AttachFile&do=get&target=hacking_the_wholism_of_linux_net.txt
Thanks,
Daniel
Hi Ed,
Had some trouble with email yesterday (forgot to renew domain
registration) and this reply might not have made it out. Apologies
if it's a repost.
On Mon, Feb 27, 2023 at 10:58:47PM +0000, Edward Cree wrote:
> On 27/02/2023 22:04, Daniel Xu wrote:
> > I don't believe full L4 headers are required in the first fragment.
> > Sufficiently sneaky attackers can, I think, send a byte at a time to
> > subvert your proposed algorithm. Storing skb data seems inevitable here.
> > Someone can correct me if I'm wrong here.
>
> My thinking was that legitimate traffic would never do this and thus if
> your first fragment doesn't have enough data to make a determination
> then you just DROP the packet.
Right, that would be practical. I had some discussion with coworkers and
the other option on the table is to drop all fragments. At least for us
in the cloud, fragments are heavily frowned upon (where are they not..)
anyways.
> > What I find valuable about this patch series is that we can
> > leverage the well understood and battle hardened kernel facilities. So
> > avoid all the correctness and security issues that the kernel has spent
> > 20+ years fixing.
>
> I can certainly see the argument here. I guess it's a question of are
> you more worried about the DoS from tricking the validator into thinking
> good fragments are bad (the reverse is irrelevant because if you can
> trick a validator into thinking your bad fragment belongs to a previously
> seen good packet, then you can equally trick a reassembler into stitching
> your bad fragment into that packet), or are you more worried about the
> DoS from tying lots of memory down in the reassembly cache.
Equal balance of concerns on my side. Ideally there are no dropping of
valid packets and DoS is very hard to achieve.
> Even with reordering handling, a data structure to record which ranges of
> a packet have been seen takes much less memory than storing the complete
> fragment bodies. (Just a simple bitmap of 8-byte blocks — the resolution
> of iph->frag_off — reduces size by a factor of 64, not counting all the
> overhead of a struct sk_buff for each fragment in the queue. Or you
> could re-use the rbtree-based code from the reassembler, just with a
> freshly allocated node containing only offset & length, instead of the
> whole SKB.)
Yeah, now that you say that, it doesn't sound too bad on space side. But
I do wonder -- how much code and complexity is that going to be? For
example I think ipv6 frags have a 60s reassembly timeout which adds more
stuff to consider. And probably even more I've already forgotten.
B/c at least on the kernel side, this series is 80% code for tests. And
the kfunc wrappers are not very invasive at all. Plus it's wrapping
infra that hasn't changed much for decades.
> And having a BPF helper effectively consume the skb is awkward, as you
> noted; someone is likely to decide that skb_copy() is too slow, try to
> add ctx invalidation, and thereby create a whole new swathe of potential
> correctness and security issues.
Yep. I did try that. While the verifier bits weren't too tricky, there
are a lot of infra concerns to solve:
* https://github.com/danobi/linux/commit/35a66af8d54cca647b0adfc7c1da7105d2603dde
* https://github.com/danobi/linux/commit/e8c86ea75e2ca8f0631632d54ef763381308711e
* https://github.com/danobi/linux/commit/972bcf769f41fbfa7f84ce00faf06b5b57bc6f7a
But FWIW, fragmented packets are kinda a corner case anyways. I don't
think it would be resonable to expect high perf when packets are in
play.
> Plus, imagine trying to support this in a hardware-offload XDP device.
> They'd have to reimplement the entire frag cache, which is a much bigger
> attack surface than just a frag validator, and they couldn't leverage
> the battle-hardened kernel implementation.
Hmm, well this helper is restricted to TC progs for now. I don't quite
see a path to enabling for XDP as there would have to be at a minimum
quite a few allocations to handle frags. So not sure XDP is a factor at
the moment.
>
> > And make it trivial for the next person that comes
> > along to do the right thing.
>
> Fwiw the validator approach could *also* be a helper, it doesn't have to
> be something the BPF developer writes for themselves.
>
> But if after thinking about the possibility you still prefer your way, I
> won't try to stop you — I just wanted to ensure it had been considered.
Thank you for the discussion. The thought had come to mind originally,
but I shied away after seeing some of the reassembly details. Would be
interested in hearing more from other folks.
Thanks,
Daniel
On Tue, Feb 28, 2023 at 3:17 PM Daniel Xu <[email protected]> wrote:
>
> > Have you considered to skb redirect to another netdev that does ip defrag?
> > Like macvlan does it under some conditions. This can be generalized.
>
> I had not considered that yet. Are you suggesting adding a new
> passthrough netdev thing that'll defrags? I looked at the macvlan driver
> and it looks like it defrags to handle some multicast corner case.
Something like that. A netdev that bpf prog can redirect too.
It will consume ip frags and eventually will produce reassembled skb.
The kernel ip_defrag logic has timeouts, counters, rhashtable
with thresholds, etc. All of them are per netns.
Just another ip_defrag_user will still share rhashtable
with its limits. The kernel can even do icmp_send().
ip_defrag is not a kfunc. It's a big block with plenty of kernel
wide side effects.
I really don't think we can alloc_skb, copy_skb, and ip_defrag it.
It messes with the stack too much.
It's also not clear to me when skb is reassembled and how bpf sees it.
"redirect into reassembling netdev" and attaching bpf prog to consume
that skb is much cleaner imo.
May be there are other ways to use ip_defrag, but certainly not like
synchronous api helper.
Hi Alexei,
(cc netfilter maintainers)
On Mon, Mar 06, 2023 at 08:17:20PM -0800, Alexei Starovoitov wrote:
> On Tue, Feb 28, 2023 at 3:17 PM Daniel Xu <[email protected]> wrote:
> >
> > > Have you considered to skb redirect to another netdev that does ip defrag?
> > > Like macvlan does it under some conditions. This can be generalized.
> >
> > I had not considered that yet. Are you suggesting adding a new
> > passthrough netdev thing that'll defrags? I looked at the macvlan driver
> > and it looks like it defrags to handle some multicast corner case.
>
> Something like that. A netdev that bpf prog can redirect too.
> It will consume ip frags and eventually will produce reassembled skb.
>
> The kernel ip_defrag logic has timeouts, counters, rhashtable
> with thresholds, etc. All of them are per netns.
> Just another ip_defrag_user will still share rhashtable
> with its limits. The kernel can even do icmp_send().
> ip_defrag is not a kfunc. It's a big block with plenty of kernel
> wide side effects.
> I really don't think we can alloc_skb, copy_skb, and ip_defrag it.
> It messes with the stack too much.
> It's also not clear to me when skb is reassembled and how bpf sees it.
> "redirect into reassembling netdev" and attaching bpf prog to consume
> that skb is much cleaner imo.
> May be there are other ways to use ip_defrag, but certainly not like
> synchronous api helper.
I was giving the virtual netdev idea some thought this morning and I
thought I'd give the netfilter approach a deeper look.
From my reading (I'll run some tests later) it looks like netfilter
will defrag all ipv4/ipv6 packets in any netns with conntrack enabled.
It appears to do so in NF_INET_PRE_ROUTING.
Unfortunately that does run after tc hooks. But fortunately with the
new BPF netfilter hooks I think we can make defrag work outside of BPF
kfuncs like you want. And the NF_IP_FORWARD hook works well for my
router use case.
One thing we would need though are (probably kfunc) wrappers around
nf_defrag_ipv4_enable() and nf_defrag_ipv6_enable() to ensure BPF progs
are not transitively depending on defrag support from other netfilter
modules.
The exact mechanism would probably need some thinking, as the above
functions kinda rely on module_init() and module_exit() semantics. We
cannot make the prog bump the refcnt every time it runs -- it would
overflow. And it would be nice to automatically free the refcnt when
prog is unloaded.
Once the netfilter prog type series lands I can get that discussion
started. Unless Daniel feels strongly that we should continue with
the approach in this patchset, I am leaning towards dropping in favor
of netfilter approach.
Thanks,
Daniel
Daniel Xu <[email protected]> wrote:
> From my reading (I'll run some tests later) it looks like netfilter
> will defrag all ipv4/ipv6 packets in any netns with conntrack enabled.
> It appears to do so in NF_INET_PRE_ROUTING.
Yes, and output.
> One thing we would need though are (probably kfunc) wrappers around
> nf_defrag_ipv4_enable() and nf_defrag_ipv6_enable() to ensure BPF progs
> are not transitively depending on defrag support from other netfilter
> modules.
>
> The exact mechanism would probably need some thinking, as the above
> functions kinda rely on module_init() and module_exit() semantics. We
> cannot make the prog bump the refcnt every time it runs -- it would
> overflow. And it would be nice to automatically free the refcnt when
> prog is unloaded.
Probably add a flag attribute that is evaluated at BPF_LINK time, so
progs can say they need defrag enabled. Same could be used to request
conntrack enablement.
Will need some glue on netfilter side to handle DEFRAG=m, but we already
have plenty of those.
On Tue, Mar 7, 2023 at 12:11 PM Florian Westphal <[email protected]> wrote:
>
> Daniel Xu <[email protected]> wrote:
> > From my reading (I'll run some tests later) it looks like netfilter
> > will defrag all ipv4/ipv6 packets in any netns with conntrack enabled.
> > It appears to do so in NF_INET_PRE_ROUTING.
>
> Yes, and output.
>
> > One thing we would need though are (probably kfunc) wrappers around
> > nf_defrag_ipv4_enable() and nf_defrag_ipv6_enable() to ensure BPF progs
> > are not transitively depending on defrag support from other netfilter
> > modules.
> >
> > The exact mechanism would probably need some thinking, as the above
> > functions kinda rely on module_init() and module_exit() semantics. We
> > cannot make the prog bump the refcnt every time it runs -- it would
> > overflow. And it would be nice to automatically free the refcnt when
> > prog is unloaded.
>
> Probably add a flag attribute that is evaluated at BPF_LINK time, so
> progs can say they need defrag enabled. Same could be used to request
> conntrack enablement.
>
> Will need some glue on netfilter side to handle DEFRAG=m, but we already
> have plenty of those.
All makes perfect sense to me.
It's cleaner than a special netdevice.
ipv4_conntrack_defrag() is pretty neat. I didn't know about it.
If we can reuse it as-is that would be ideal.
Conceptually it fits perfectly.
If we cannot reuse it (for whatever unlikely reason) I would
argue that TC hook should gain similar functionality.