From: Maciej Żenczykowski <[email protected]>
This function is used from:
bpf_skb_adjust_room
__bpf_skb_change_tail
__bpf_skb_change_head
but in the case of forwarding we're likely calling these functions
during receive processing on ingress and bpf_redirect()'ing at
a later point in time to egress on another interface, thus these
mtu checks are for the wrong device.
This is particularly problematic if we're receiving on an L3 1500 mtu
cellular interface, trying to add an L2 header and forwarding to
an L3 mtu 1500 mtu wifi/ethernet device. The mtu check prevents
us from adding the ethernet header prior to forwarding the packet.
After the packet has already been redirected, we'd need to add
an additional 2nd ebpf program on the target device's egress tc hook,
but then we'd also see non-redirected traffic and have no easy
way to tell apart normal egress with ethernet header packets
from forwarded ethernet headerless packets.
Signed-off-by: Maciej Żenczykowski <[email protected]>
---
net/core/filter.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/net/core/filter.c b/net/core/filter.c
index ec567d1e6fb9..1e119a47f9fe 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3159,8 +3159,7 @@ static int bpf_skb_net_shrink(struct sk_buff *skb, u32 off, u32 len_diff,
static u32 __bpf_skb_max_len(const struct sk_buff *skb)
{
- return skb->dev ? skb->dev->mtu + skb->dev->hard_header_len :
- SKB_MAX_ALLOC;
+ return SKB_MAX_ALLOC;
}
BPF_CALL_4(bpf_skb_adjust_room, struct sk_buff *, skb, s32, len_diff,
--
2.26.1.301.g55bc3eb7cb9-goog
This is only a semi serious patch.
But, I've spent a long time trying to come up with a solution that works,
and everything seems broken.
I'm hoping someone else has some ideas.
As is, forwarding doesn't work.
Here's an example scenario:
cell0 - 1500 l3 mtu, raw_ip, 0 l2 header
wlan0 - 1500 l3 mtu, ethernet, 14 l2 header
cell0 -> wlan0 forwarding
tc ingress hook on cell0:
map lookups, other stuff, eventually
skb_modifications to add ethernet header (via skb_change_head or
bpf_skb_adjust_room)
bpf_redirect(wlan0, egress)
This fails because adding ethernet header goes above the cell0 ->
mtu+header_len,
even though it would be fine if we tested against wlan0 -> mtu+header_len
Indeed the only solution that would perhaps work is to have 2 bpf programs
tc ingress hook on cell0: redirect to wlan0
tc egress hook on wlan0: actually add the header
but this requires doing the lookups twice - first to determine if
should redirect and where,
and then to actually add the header. additionally the packet we get
on wlan0 might
not have come from the redirect... and that's hard to detect...
so you actually need to do:
tc ingress hook on cell0: redirect to dummy0, which has larger mtu
tc ingress hook on dummy0: add header, redirect to wlan0
this still requires a double set of bpf programs and lookups...
it's ugly.
Calling bpf_redirect() prior to skb_change_head() isn't enough, since it checks
skb->dev not tgt_index. Although I guess we could save the redirect device's
mtu in the redirect struct and test against that in preference to
testing against skb->dev...
but that's really a pointless test, because you can call bpf_redirect
multiple times
changing the device, ie...
bpf_redirect(dummy with large mtu)
skb_change_head()
bpf_redirect(wlan0)
so basically this would make the test worthless...
I considered simply removing the mtu check from these skb modifying functions...
it's not like it even does the right thing:
(a) device mtu is only an upper limit - we should really be testing
against path mtu
and that's probably only something the bpf code knows
(b) it ignores mtu entirely for gso packets: but gso max seg size
should be tested instead...
Or maybe add a bpf uapi visible flag to ignore the mtu check...
Or maybe simply pass in 16-bits of mtu via the currently unused flags field...
... etc ...
- Maciej
On Mon, 20 Apr 2020 16:14:27 -0700 Maciej Żenczykowski wrote:
> From: Maciej Żenczykowski <[email protected]>
>
> This function is used from:
> bpf_skb_adjust_room
> __bpf_skb_change_tail
> __bpf_skb_change_head
>
> but in the case of forwarding we're likely calling these functions
> during receive processing on ingress and bpf_redirect()'ing at
> a later point in time to egress on another interface, thus these
> mtu checks are for the wrong device.
Interesting. Without redirecting there should also be no reason
to do this check at ingress, right? So at ingress it's either
incorrect or unnecessary?
> > This function is used from:
> > bpf_skb_adjust_room
> > __bpf_skb_change_tail
> > __bpf_skb_change_head
> >
> > but in the case of forwarding we're likely calling these functions
> > during receive processing on ingress and bpf_redirect()'ing at
> > a later point in time to egress on another interface, thus these
> > mtu checks are for the wrong device.
>
> Interesting. Without redirecting there should also be no reason
> to do this check at ingress, right? So at ingress it's either
> incorrect or unnecessary?
Well, I guess there's technically a chance that you'd want to mutate
the packet somehow during ingress pre-receive processing (without
redirecting)...
But yeah, I can't really think of a case where that would be
increasing the size of the packet.
Usually you'd be decapsulating at ingress and encapsulating at egress,
or doing ingress rewrite & redirect to egress...
(Also, note that relying on a sequence where at ingress you first call
bpf_redirect(ifindex, EGRESS); then change the packet size, and then
return TC_ACT_REDIRECT; thus being able to use the redirect ifindex
for mtu checks in the packet mutation functions is potentially buggy,
since there's no guarantee you won't call bpf_redirect again to change
the ifinidex, or even return from the bpf program without returning
TC_ACT_REDIRECT --- so while that could be *more* correct, it would
still have holes...)
On Tue, Apr 21, 2020 at 01:36:08PM -0700, Maciej Żenczykowski wrote:
> > > This function is used from:
> > > bpf_skb_adjust_room
> > > __bpf_skb_change_tail
> > > __bpf_skb_change_head
> > >
> > > but in the case of forwarding we're likely calling these functions
> > > during receive processing on ingress and bpf_redirect()'ing at
> > > a later point in time to egress on another interface, thus these
> > > mtu checks are for the wrong device.
> >
> > Interesting. Without redirecting there should also be no reason
> > to do this check at ingress, right? So at ingress it's either
> > incorrect or unnecessary?
>
> Well, I guess there's technically a chance that you'd want to mutate
> the packet somehow during ingress pre-receive processing (without
> redirecting)...
> But yeah, I can't really think of a case where that would be
> increasing the size of the packet.
>
> Usually you'd be decapsulating at ingress and encapsulating at egress,
> or doing ingress rewrite & redirect to egress...
>
> (Also, note that relying on a sequence where at ingress you first call
> bpf_redirect(ifindex, EGRESS); then change the packet size, and then
> return TC_ACT_REDIRECT; thus being able to use the redirect ifindex
> for mtu checks in the packet mutation functions is potentially buggy,
> since there's no guarantee you won't call bpf_redirect again to change
> the ifinidex, or even return from the bpf program without returning
> TC_ACT_REDIRECT --- so while that could be *more* correct, it would
> still have holes...)
yeah. there is no good fix here, since target netdev is not known,
but dropping the check also doesn't seem right.
How about:
if (skb->dev) {
u32 header_len = skb->dev->hard_header_len;
if (!header_len)
header_len = ETH_HLEN;
return skb->dev->mtu + header_len;
} else {
return SKB_MAX_ALLOC;
}
the idea that l3 devices won't have l2 and here we will assume
that l2 can be added sooner or later.
It's not pretty either, but it will solve your wifi->eth use case?
While keeping basic sanity for other cases.
From: Maciej Żenczykowski <[email protected]>
__bpf_skb_max_len(skb) is used from:
bpf_skb_adjust_room
__bpf_skb_change_tail
__bpf_skb_change_head
but in the case of forwarding we're likely calling these functions
during receive processing on ingress and bpf_redirect()'ing at
a later point in time to egress on another interface, thus these
mtu checks are for the wrong device (input instead of output).
This is particularly problematic if we're receiving on an L3 1500 mtu
cellular interface, trying to add an L2 header and forwarding to
an L3 mtu 1500 mtu wifi/ethernet device (which is thus L2 1514).
The mtu check prevents us from adding the 14 byte ethernet header prior
to forwarding the packet.
After the packet has already been redirected, we'd need to add
an additional 2nd ebpf program on the target device's egress tc hook,
but then we'd also see non-redirected traffic and have no easy
way to tell apart normal egress with ethernet header packets
from forwarded ethernet headerless packets.
Credits to Alexei Starovoitov for the suggestion on how to 'fix' this.
This probably should be further fixed up *somehow*, just
not at all clear how. This does at least make things work.
Cc: Alexei Starovoitov <[email protected]>
Signed-off-by: Maciej Żenczykowski <[email protected]>
---
net/core/filter.c | 16 ++++++++++++++--
1 file changed, 14 insertions(+), 2 deletions(-)
diff --git a/net/core/filter.c b/net/core/filter.c
index 7d6ceaa54d21..811aba77e24b 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3159,8 +3159,20 @@ static int bpf_skb_net_shrink(struct sk_buff *skb, u32 off, u32 len_diff,
static u32 __bpf_skb_max_len(const struct sk_buff *skb)
{
- return skb->dev ? skb->dev->mtu + skb->dev->hard_header_len :
- SKB_MAX_ALLOC;
+ if (skb->dev) {
+ unsigned short header_len = skb->dev->hard_header_len;
+
+ /* HACK: Treat 0 as ETH_HLEN to allow redirect while
+ * adding ethernet header from an L3 (tun, rawip, cellular)
+ * to an L2 device (tap, ethernet, wifi)
+ */
+ if (!header_len)
+ header_len = ETH_HLEN;
+
+ return skb->dev->mtu + header_len;
+ } else {
+ return SKB_MAX_ALLOC;
+ }
}
BPF_CALL_4(bpf_skb_adjust_room, struct sk_buff *, skb, s32, len_diff,
--
2.26.2.526.g744177e7f7-goog
On Wed, 6 May 2020 16:32:59 -0700 Maciej Żenczykowski wrote:
> From: Maciej Żenczykowski <[email protected]>
>
> __bpf_skb_max_len(skb) is used from:
> bpf_skb_adjust_room
> __bpf_skb_change_tail
> __bpf_skb_change_head
>
> but in the case of forwarding we're likely calling these functions
> during receive processing on ingress and bpf_redirect()'ing at
> a later point in time to egress on another interface, thus these
> mtu checks are for the wrong device (input instead of output).
>
> This is particularly problematic if we're receiving on an L3 1500 mtu
> cellular interface, trying to add an L2 header and forwarding to
> an L3 mtu 1500 mtu wifi/ethernet device (which is thus L2 1514).
>
> The mtu check prevents us from adding the 14 byte ethernet header prior
> to forwarding the packet.
>
> After the packet has already been redirected, we'd need to add
> an additional 2nd ebpf program on the target device's egress tc hook,
> but then we'd also see non-redirected traffic and have no easy
> way to tell apart normal egress with ethernet header packets
> from forwarded ethernet headerless packets.
>
> Credits to Alexei Starovoitov for the suggestion on how to 'fix' this.
>
> This probably should be further fixed up *somehow*, just
> not at all clear how. This does at least make things work.
>
> Cc: Alexei Starovoitov <[email protected]>
> Signed-off-by: Maciej Żenczykowski <[email protected]>
> ---
> net/core/filter.c | 16 ++++++++++++++--
> 1 file changed, 14 insertions(+), 2 deletions(-)
>
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 7d6ceaa54d21..811aba77e24b 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -3159,8 +3159,20 @@ static int bpf_skb_net_shrink(struct sk_buff *skb, u32 off, u32 len_diff,
>
> static u32 __bpf_skb_max_len(const struct sk_buff *skb)
> {
> - return skb->dev ? skb->dev->mtu + skb->dev->hard_header_len :
> - SKB_MAX_ALLOC;
> + if (skb->dev) {
> + unsigned short header_len = skb->dev->hard_header_len;
> +
> + /* HACK: Treat 0 as ETH_HLEN to allow redirect while
> + * adding ethernet header from an L3 (tun, rawip, cellular)
> + * to an L2 device (tap, ethernet, wifi)
> + */
> + if (!header_len)
> + header_len = ETH_HLEN;
> +
> + return skb->dev->mtu + header_len;
> + } else {
> + return SKB_MAX_ALLOC;
> + }
> }
>
> BPF_CALL_4(bpf_skb_adjust_room, struct sk_buff *, skb, s32, len_diff,
I thought we have established that checking device MTU (m*T*u)
at ingress makes a very limited amount of sense, no?
Shooting from the hip here, but won't something like:
if (!skb->dev || skb->tc_at_ingress)
return SKB_MAX_ALLOC;
return skb->dev->mtu + skb->dev->hard_header_len;
Solve your problem?
> I thought we have established that checking device MTU (m*T*u)
> at ingress makes a very limited amount of sense, no?
>
> Shooting from the hip here, but won't something like:
>
> if (!skb->dev || skb->tc_at_ingress)
> return SKB_MAX_ALLOC;
> return skb->dev->mtu + skb->dev->hard_header_len;
>
> Solve your problem?
I believe that probably does indeed solve the ingress case of tc
ingress hook on cellular redirecting to wifi.
However, there's 2 possible uplinks - cellular (rawip, L3), and wifi
(ethernet, L2).
Thus, there's actually 4 things I'm trying to support:
- ipv6 ingress on cellular uplink (L3/rawip), translate to ipv4,
forward to wifi/ethernet <- need to add ethernet header
- ipv6 ingress on wifi uplink (L2/ether), translate to ipv4, forward
to wifi/ethernet <- trivial, no packet size change
- ipv4 egressing through tun (L3), translate to ipv6, forward to
cellular uplink <- trivial, no packet size change
- ipv4 egressing through tun (L3), translate to ipv6, forward to wifi
uplink <- need to add ethernet header [*]
I think your approach doesn't solve the reverse path (* up above):
ie. ipv4 packets hitting a tun device (owned by a clat daemon doing
ipv4<->ipv6 translation in userspace), being stolen by a tc egress
ebpf hook, mutated to ipv6 by ebpf and bpf_redirect'ed to egress
through a wifi ipv6-only uplink.
Though arguably in this case I could probably simply increase the tun
device mtu by another 14, while keeping ipv4 route mtus low...
(tun mtu already has to be 28 bytes lower then wifi mtu to allow
replacement of ipv4 with ipv6 header (20 bytes extra), with possibly
an ipv6 frag header (8 more bytes))
Any further thoughts?
> > I thought we have established that checking device MTU (m*T*u)
> > at ingress makes a very limited amount of sense, no?
> >
> > Shooting from the hip here, but won't something like:
> >
> > if (!skb->dev || skb->tc_at_ingress)
> > return SKB_MAX_ALLOC;
> > return skb->dev->mtu + skb->dev->hard_header_len;
> >
> > Solve your problem?
>
> I believe that probably does indeed solve the ingress case of tc
> ingress hook on cellular redirecting to wifi.
>
> However, there's 2 possible uplinks - cellular (rawip, L3), and wifi
> (ethernet, L2).
> Thus, there's actually 4 things I'm trying to support:
>
> - ipv6 ingress on cellular uplink (L3/rawip), translate to ipv4,
> forward to wifi/ethernet <- need to add ethernet header
>
> - ipv6 ingress on wifi uplink (L2/ether), translate to ipv4, forward
> to wifi/ethernet <- trivial, no packet size change
>
> - ipv4 egressing through tun (L3), translate to ipv6, forward to
> cellular uplink <- trivial, no packet size change
>
> - ipv4 egressing through tun (L3), translate to ipv6, forward to wifi
> uplink <- need to add ethernet header [*]
>
> I think your approach doesn't solve the reverse path (* up above):
>
> ie. ipv4 packets hitting a tun device (owned by a clat daemon doing
> ipv4<->ipv6 translation in userspace), being stolen by a tc egress
> ebpf hook, mutated to ipv6 by ebpf and bpf_redirect'ed to egress
> through a wifi ipv6-only uplink.
>
> Though arguably in this case I could probably simply increase the tun
> device mtu by another 14, while keeping ipv4 route mtus low...
> (tun mtu already has to be 28 bytes lower then wifi mtu to allow
> replacement of ipv4 with ipv6 header (20 bytes extra), with possibly
> an ipv6 frag header (8 more bytes))
>
> Any further thoughts?
Thinking about this some more, that seems to solve the immediate need
(case 1 above),
and I can work around case 4 with tun mtu bumps.
And maybe the real correct fix would be to simply pass in the desired path mtu
to these 3 functions via 16-bits of the flags argument.
From: Maciej Żenczykowski <[email protected]>
__bpf_skb_max_len(skb) is used from:
bpf_skb_adjust_room
__bpf_skb_change_tail
__bpf_skb_change_head
but in the case of forwarding we're likely calling these functions
during receive processing on ingress and bpf_redirect()'ing at
a later point in time to egress on another interface, thus these
mtu checks are for the wrong device (input instead of output).
This is particularly problematic if we're receiving on an L3 1500 mtu
cellular interface, trying to add an L2 header and forwarding to
an L3 mtu 1500 mtu wifi/ethernet device (which is thus L2 1514).
The mtu check prevents us from adding the 14 byte ethernet header prior
to forwarding the packet.
After the packet has already been redirected, we'd need to add
an additional 2nd ebpf program on the target device's egress tc hook,
but then we'd also see non-redirected traffic and have no easy
way to tell apart normal egress with ethernet header packets
from forwarded ethernet headerless packets.
Cc: Alexei Starovoitov <[email protected]>
Cc: Jakub Kicinski <[email protected]>
Signed-off-by: Maciej Żenczykowski <[email protected]>
---
net/core/filter.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/net/core/filter.c b/net/core/filter.c
index 7d6ceaa54d21..5c8243930462 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3159,8 +3159,9 @@ static int bpf_skb_net_shrink(struct sk_buff *skb, u32 off, u32 len_diff,
static u32 __bpf_skb_max_len(const struct sk_buff *skb)
{
- return skb->dev ? skb->dev->mtu + skb->dev->hard_header_len :
- SKB_MAX_ALLOC;
+ if (skb_at_tc_ingress(skb) || !skb->dev)
+ return SKB_MAX_ALLOC;
+ return skb->dev->mtu + skb->dev->hard_header_len;
}
BPF_CALL_4(bpf_skb_adjust_room, struct sk_buff *, skb, s32, len_diff,
--
2.26.2.526.g744177e7f7-goog
On 5/7/20 4:36 AM, Maciej Żenczykowski wrote:
> From: Maciej Żenczykowski <[email protected]>
>
> __bpf_skb_max_len(skb) is used from:
> bpf_skb_adjust_room
> __bpf_skb_change_tail
> __bpf_skb_change_head
>
> but in the case of forwarding we're likely calling these functions
> during receive processing on ingress and bpf_redirect()'ing at
> a later point in time to egress on another interface, thus these
> mtu checks are for the wrong device (input instead of output).
>
> This is particularly problematic if we're receiving on an L3 1500 mtu
> cellular interface, trying to add an L2 header and forwarding to
> an L3 mtu 1500 mtu wifi/ethernet device (which is thus L2 1514).
>
> The mtu check prevents us from adding the 14 byte ethernet header prior
> to forwarding the packet.
>
> After the packet has already been redirected, we'd need to add
> an additional 2nd ebpf program on the target device's egress tc hook,
> but then we'd also see non-redirected traffic and have no easy
> way to tell apart normal egress with ethernet header packets
> from forwarded ethernet headerless packets.
>
> Cc: Alexei Starovoitov <[email protected]>
> Cc: Jakub Kicinski <[email protected]>
> Signed-off-by: Maciej Żenczykowski <[email protected]>
> ---
> net/core/filter.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 7d6ceaa54d21..5c8243930462 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -3159,8 +3159,9 @@ static int bpf_skb_net_shrink(struct sk_buff *skb, u32 off, u32 len_diff,
>
> static u32 __bpf_skb_max_len(const struct sk_buff *skb)
> {
> - return skb->dev ? skb->dev->mtu + skb->dev->hard_header_len :
> - SKB_MAX_ALLOC;
> + if (skb_at_tc_ingress(skb) || !skb->dev)
> + return SKB_MAX_ALLOC;
> + return skb->dev->mtu + skb->dev->hard_header_len;
> }
But then why even have any MTU check in the first place? Above would basically
break for the case where I'd have a one-legged load-balancer. skb comes in at
tc ingress, we adjust its size and are allowed to do so up to SKB_MAX_ALLOC.
Then we redirect it out through the same device through bpf where it came from.
I suppose we are the ones responsible to assert here that it doesn't exceed MTU.
There are 3 cases when skb exits the prog on tc ingress or egress: i) we redirect
via ingress, then ii) we redirect via egress, and iii) we just do tc_act_ok. Case
i) is asserted already via ____dev_forward_skb() today. If we fix/relax the
__bpf_skb_max_len(), we would also need to assert the other two locations,
something hacked up like the below. And on this it probably makes sense to expose
the current MTU, but that can be optional.
Thoughts?
Thanks,
Daniel
From 95464f75ed8d520b9ff068b72687a422465686cd Mon Sep 17 00:00:00 2001
From: Daniel Borkmann <[email protected]>
Date: Thu, 7 May 2020 16:46:30 +0200
Subject: [PATCH] bpf: xxx
Signed-off-by: Daniel Borkmann <[email protected]>
---
include/linux/netdevice.h | 25 +++++++++++++++++++++++--
include/uapi/linux/bpf.h | 1 +
net/core/dev.c | 24 +++---------------------
net/core/filter.c | 22 +++++++++++++++++-----
4 files changed, 44 insertions(+), 28 deletions(-)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 5a8d40f1ffe2..19770744d5b5 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3787,8 +3787,29 @@ int xdp_umem_query(struct net_device *dev, u16 queue_id);
int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb);
int dev_forward_skb(struct net_device *dev, struct sk_buff *skb);
-bool is_skb_forwardable(const struct net_device *dev,
- const struct sk_buff *skb);
+
+static __always_inline bool is_skb_size_ok(const struct net_device *dev,
+ const struct sk_buff *skb)
+{
+ static const u32 vlan_header_len = 4;
+
+ if (skb->len <= dev->mtu + dev->hard_header_len + vlan_header_len)
+ return true;
+
+ /* If TSO is enabled, we don't care about the length as the packet
+ * could be forwarded without being segmented before.
+ */
+ return skb_is_gso(skb);
+}
+
+static __always_inline bool is_skb_forwardable(const struct net_device *dev,
+ const struct sk_buff *skb)
+{
+ if (unlikely(!(dev->flags & IFF_UP)))
+ return false;
+
+ return is_skb_size_ok(dev, skb);
+}
static __always_inline int ____dev_forward_skb(struct net_device *dev,
struct sk_buff *skb)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index b3643e27e264..0239e415a469 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -3370,6 +3370,7 @@ struct __sk_buff {
__u32 gso_segs;
__bpf_md_ptr(struct bpf_sock *, sk);
__u32 gso_size;
+ __u32 mtu;
};
struct bpf_tunnel_key {
diff --git a/net/core/dev.c b/net/core/dev.c
index afff16849c26..b3bf738fc36f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2100,27 +2100,6 @@ static inline void net_timestamp_set(struct sk_buff *skb)
__net_timestamp(SKB); \
} \
-bool is_skb_forwardable(const struct net_device *dev, const struct sk_buff *skb)
-{
- unsigned int len;
-
- if (!(dev->flags & IFF_UP))
- return false;
-
- len = dev->mtu + dev->hard_header_len + VLAN_HLEN;
- if (skb->len <= len)
- return true;
-
- /* if TSO is enabled, we don't care about the length as the packet
- * could be forwarded without being segmented before
- */
- if (skb_is_gso(skb))
- return true;
-
- return false;
-}
-EXPORT_SYMBOL_GPL(is_skb_forwardable);
-
int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb)
{
int ret = ____dev_forward_skb(dev, skb);
@@ -3786,8 +3765,11 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
case TC_ACT_OK:
case TC_ACT_RECLASSIFY:
skb->tc_index = TC_H_MIN(cl_res.classid);
+ if (unlikely(!is_skb_size_ok(dev, skb)))
+ goto drop;
break;
case TC_ACT_SHOT:
+drop:
mini_qdisc_qstats_cpu_drop(miniq);
*ret = NET_XMIT_DROP;
kfree_skb(skb);
diff --git a/net/core/filter.c b/net/core/filter.c
index dfaf5df13722..54db75bf15c5 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2037,10 +2037,11 @@ static inline int __bpf_tx_skb(struct net_device *dev, struct sk_buff *skb)
{
int ret;
- if (dev_xmit_recursion()) {
+ if (unlikely(!is_skb_forwardable(dev, skb)))
+ goto drop;
+ if (unlikely(dev_xmit_recursion())) {
net_crit_ratelimited("bpf: recursion limit reached on datapath, buggy bpf program?\n");
- kfree_skb(skb);
- return -ENETDOWN;
+ goto drop;
}
skb->dev = dev;
@@ -2051,6 +2052,10 @@ static inline int __bpf_tx_skb(struct net_device *dev, struct sk_buff *skb)
dev_xmit_recursion_dec();
return ret;
+drop:
+ atomic_long_inc(&dev->rx_dropped);
+ kfree_skb(skb);
+ return -EIO;
}
static int __bpf_redirect_no_mac(struct sk_buff *skb, struct net_device *dev,
@@ -3148,8 +3153,7 @@ static int bpf_skb_net_shrink(struct sk_buff *skb, u32 off, u32 len_diff,
static u32 __bpf_skb_max_len(const struct sk_buff *skb)
{
- return skb->dev ? skb->dev->mtu + skb->dev->hard_header_len :
- SKB_MAX_ALLOC;
+ return SKB_MAX_ALLOC;
}
BPF_CALL_4(bpf_skb_adjust_room, struct sk_buff *, skb, s32, len_diff,
@@ -7831,6 +7835,14 @@ static u32 tc_cls_act_convert_ctx_access(enum bpf_access_type type,
bpf_target_off(struct net_device, ifindex, 4,
target_size));
break;
+ case offsetof(struct __sk_buff, mtu):
+ *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct sk_buff, dev),
+ si->dst_reg, si->src_reg,
+ offsetof(struct sk_buff, dev));
+ *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,
+ bpf_target_off(struct net_device, mtu, 4,
+ target_size));
+ break;
default:
return bpf_convert_ctx_access(type, si, insn_buf, prog,
target_size);
--
2.21.0
(a) not clear why the max is SKB_MAX_ALLOC in the first place (this is
PAGE_SIZE << 2, ie. 16K on x86), while lo mtu is 64k
(b) hmm, if we're not redirecting, then exceeding the ingress device's
mtu doesn't seem to be a problem.
Indeed AFAIK this can already happen, some devices will round mtu up
when they configure the device mru buffers.
(ie. you configure L3 mtu 1500, they treat that as L2 1536 or 1532 [-4
fcs], simply because 3 * 512 is a nice amount of buffers, or they'll
accept not only 1514 L2, but also 1518 L2 or even 1522 L2 for VLAN and
Q-IN-Q -- even if the packets aren't actually VLAN'ed, so your non
VLAN mru might be 1504 or 1508)
Indeed my corp dell workstation with some standard 1 gigabit
motherboard nic has a standard default mtu of 1500, and I've seen it
receive L3 mtu 1520 packets (apparently due to misconfiguration in our
hardware [cisco? juniper?] ipv4->ipv6 translator which can take 1500
mtu ipv4 packets and convert them to 1520 mtu ipv6 packets without
fragmenting or generating icmp too big errors). While it's obviously
wrong, it does just work (the network paths themselves are also
obviously 1520 clean).
(c) If we are redirecting we'll eventually (after bpf program returns)
hit dev_queue_xmit(), and shouldn't that be what returns an error?
btw. is_skb_forwardable() actually tests
- device is up && (packet is gso || skb->len < dev->mtu +
dev->hard_header_len + VLAN_HLEN);
which is also wrong and in 2 ways, cause VLAN_HLEN makes no sense on
non ethernet, and the __bpf_skb_max_len function doesn't account for
VLAN... (which possibly has implications if you try to redirect to a
vlan interface)
---
I think having an mtu check is useful, but I think the mtu should be
selectable by the bpf program. Because it might not even be device
mtu at all, it might be path mtu which we should be testing against.
It should also be checked for gso frames, since the max post
segmentation size should be enforced.
---
I agree we should expose dev->mtu (and dev->hard_header_len and hatype)
I'll mull this over a bit more, but I'm not convinced this patch isn't ok as is.
There just is probably more we should do.
On 5/7/20 6:46 PM, Maciej Żenczykowski wrote:
> (a) not clear why the max is SKB_MAX_ALLOC in the first place (this is
> PAGE_SIZE << 2, ie. 16K on x86), while lo mtu is 64k
Agreed, tbh, it's not clear to me either atm. :) The SKB_MAX_ALLOC constant itself
should be replaced with something more appropriate. Also as a small side note,
the !skb->dev check should be removed since in tc ingress/egress the skb->dev
is never NULL. (See also tc_cls_act_convert_ctx_access() on struct __sk_buff,
ifindex conversion.)
> (b) hmm, if we're not redirecting, then exceeding the ingress device's
> mtu doesn't seem to be a problem.
>
> Indeed AFAIK this can already happen, some devices will round mtu up
> when they configure the device mru buffers.
> (ie. you configure L3 mtu 1500, they treat that as L2 1536 or 1532 [-4
> fcs], simply because 3 * 512 is a nice amount of buffers, or they'll
> accept not only 1514 L2, but also 1518 L2 or even 1522 L2 for VLAN and
> Q-IN-Q -- even if the packets aren't actually VLAN'ed, so your non
> VLAN mru might be 1504 or 1508)
>
> Indeed my corp dell workstation with some standard 1 gigabit
> motherboard nic has a standard default mtu of 1500, and I've seen it
> receive L3 mtu 1520 packets (apparently due to misconfiguration in our
> hardware [cisco? juniper?] ipv4->ipv6 translator which can take 1500
> mtu ipv4 packets and convert them to 1520 mtu ipv6 packets without
> fragmenting or generating icmp too big errors). While it's obviously
> wrong, it does just work (the network paths themselves are also
> obviously 1520 clean).
Right, agree on tc ingress side when skb goes further up the stack.
> (c) If we are redirecting we'll eventually (after bpf program returns)
> hit dev_queue_xmit(), and shouldn't that be what returns an error?
You mean whether the check should be inside __dev_queue_xmit() itself
instead? Maybe. From a cursory glance the MTU checks are spread in upper
layer protos. As mentioned, we would need to have coverage for BPF progs
attached on the tc ingress and egress (sch_handle_{ingress,egress}())
hook where for each case an skb can be just TC_ACT_OK'ed (up to stack or
down to driver), redirected via ____dev_forward_skb() or dev_queue_xmit().
The ____dev_forward_skb() has us covered and for TC_ACT_OK on tc ingress,
we'd only need a generic upper cap. So for the rest of the cases, we'd
need to have some sort of sanity check somewhere.
> btw. is_skb_forwardable() actually tests
> - device is up && (packet is gso || skb->len < dev->mtu +
> dev->hard_header_len + VLAN_HLEN);
>
> which is also wrong and in 2 ways, cause VLAN_HLEN makes no sense on
> non ethernet, and the __bpf_skb_max_len function doesn't account for
> VLAN... (which possibly has implications if you try to redirect to a
> vlan interface)
Yeah, it probably would for QinQ which is probably why noone was running
into it so far. At least the skb_vlan_push() would first store the tag
via __vlan_hwaccel_put_tag() in the skb itself.
> I think having an mtu check is useful, but I think the mtu should be
> selectable by the bpf program. Because it might not even be device
> mtu at all, it might be path mtu which we should be testing against.
> It should also be checked for gso frames, since the max post
> segmentation size should be enforced.
>
> I agree we should expose dev->mtu (and dev->hard_header_len and hatype)
>
> I'll mull this over a bit more, but I'm not convinced this patch isn't ok as is.
> There just is probably more we should do.
Ok, makes sense. Thanks for looking into it!