Before this commit, dev_forward_skb() always cleared packet's
per-network-namespace info. Even if the packet doesn't cross
network namespaces.
The comment above dev_forward_skb() describes that this is done
because the receiving device may be in another network namespace.
However, this case can easily be tested for and therefore we can
scrub packet's per-network-namespace info only when receiving device
is indeed in another network namespace.
Therefore, this commit changes ____dev_forward_skb() to tell
skb_scrub_packet() that skb has crossed network-namespace only in case
transmitting device (skb->dev) network namespace is different then
receiving device (dev) network namespace.
An example of a netdev that use skb_forward_skb() is veth.
Thus, before this commit a packet transmitted from one veth peer to
another when both veth peers are on same network namespace will lose
it's skb->mark. The bug could easily be demonstrated by the following:
ip netns add test
ip netns exec test bash
ip link add veth-a type veth peer name veth-b
ip link set veth-a up
ip link set veth-b up
ip addr add dev veth-a 12.0.0.1/24
tc qdisc add dev veth-a root handle 1 prio
tc qdisc add dev veth-b ingress
tc filter add dev veth-a parent 1: u32 match u32 0 0 action skbedit mark 1337
tc filter add dev veth-b parent ffff: basic match 'meta(nf_mark eq 1337)' action simple "skb->mark 1337!"
dmesg -C
ping 12.0.0.2
dmesg
Before this change, the above will print nothing to dmesg.
After this change, "skb->mark 1337!" will be printed as necessary.
Signed-off-by: Liran Alon <[email protected]>
Reviewed-by: Yuval Shaia <[email protected]>
Signed-off-by: Yuval Shaia <[email protected]>
---
include/linux/netdevice.h | 2 +-
net/core/dev.c | 6 +++---
2 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 5eef6c8e2741..5908f1e31ee2 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3371,7 +3371,7 @@ static __always_inline int ____dev_forward_skb(struct net_device *dev,
return NET_RX_DROP;
}
- skb_scrub_packet(skb, true);
+ skb_scrub_packet(skb, !net_eq(dev_net(dev), dev_net(skb->dev)));
skb->priority = 0;
return 0;
}
diff --git a/net/core/dev.c b/net/core/dev.c
index 2cedf520cb28..087787dd0a50 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1877,9 +1877,9 @@ int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb)
* start_xmit function of one device into the receive queue
* of another device.
*
- * The receiving device may be in another namespace, so
- * we have to clear all information in the skb that could
- * impact namespace isolation.
+ * The receiving device may be in another namespace.
+ * In that case, we have to clear all information in the
+ * skb that could impact namespace isolation.
*/
int dev_forward_skb(struct net_device *dev, struct sk_buff *skb)
{
--
1.9.1
On Tue, Mar 13, 2018 at 05:07:22PM +0200, Liran Alon wrote:
> Before this commit, dev_forward_skb() always cleared packet's
> per-network-namespace info. Even if the packet doesn't cross
> network namespaces.
>
> The comment above dev_forward_skb() describes that this is done
> because the receiving device may be in another network namespace.
> However, this case can easily be tested for and therefore we can
> scrub packet's per-network-namespace info only when receiving device
> is indeed in another network namespace.
>
> Therefore, this commit changes ____dev_forward_skb() to tell
> skb_scrub_packet() that skb has crossed network-namespace only in case
> transmitting device (skb->dev) network namespace is different then
> receiving device (dev) network namespace.
>
> An example of a netdev that use skb_forward_skb() is veth.
> Thus, before this commit a packet transmitted from one veth peer to
> another when both veth peers are on same network namespace will lose
> it's skb->mark. The bug could easily be demonstrated by the following:
>
> ip netns add test
> ip netns exec test bash
> ip link add veth-a type veth peer name veth-b
> ip link set veth-a up
> ip link set veth-b up
> ip addr add dev veth-a 12.0.0.1/24
> tc qdisc add dev veth-a root handle 1 prio
> tc qdisc add dev veth-b ingress
> tc filter add dev veth-a parent 1: u32 match u32 0 0 action skbedit mark 1337
> tc filter add dev veth-b parent ffff: basic match 'meta(nf_mark eq 1337)' action simple "skb->mark 1337!"
> dmesg -C
> ping 12.0.0.2
> dmesg
>
> Before this change, the above will print nothing to dmesg.
> After this change, "skb->mark 1337!" will be printed as necessary.
Hi Liran,
>
> Signed-off-by: Liran Alon <[email protected]>
> Reviewed-by: Yuval Shaia <[email protected]>
> Signed-off-by: Yuval Shaia <[email protected]>
I did not earned the credits for SOB, only r-b.
Yuval
> ---
> include/linux/netdevice.h | 2 +-
> net/core/dev.c | 6 +++---
> 2 files changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 5eef6c8e2741..5908f1e31ee2 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -3371,7 +3371,7 @@ static __always_inline int ____dev_forward_skb(struct net_device *dev,
> return NET_RX_DROP;
> }
>
> - skb_scrub_packet(skb, true);
> + skb_scrub_packet(skb, !net_eq(dev_net(dev), dev_net(skb->dev)));
> skb->priority = 0;
> return 0;
> }
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 2cedf520cb28..087787dd0a50 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -1877,9 +1877,9 @@ int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb)
> * start_xmit function of one device into the receive queue
> * of another device.
> *
> - * The receiving device may be in another namespace, so
> - * we have to clear all information in the skb that could
> - * impact namespace isolation.
> + * The receiving device may be in another namespace.
> + * In that case, we have to clear all information in the
> + * skb that could impact namespace isolation.
> */
> int dev_forward_skb(struct net_device *dev, struct sk_buff *skb)
> {
> --
> 1.9.1
>
On Tue, Mar 13, 2018 at 06:13:45PM +0200, Yuval Shaia wrote:
> On Tue, Mar 13, 2018 at 05:07:22PM +0200, Liran Alon wrote:
> > Before this commit, dev_forward_skb() always cleared packet's
> > per-network-namespace info. Even if the packet doesn't cross
> > network namespaces.
> >
> > The comment above dev_forward_skb() describes that this is done
> > because the receiving device may be in another network namespace.
> > However, this case can easily be tested for and therefore we can
> > scrub packet's per-network-namespace info only when receiving device
> > is indeed in another network namespace.
> >
> > Therefore, this commit changes ____dev_forward_skb() to tell
> > skb_scrub_packet() that skb has crossed network-namespace only in case
> > transmitting device (skb->dev) network namespace is different then
> > receiving device (dev) network namespace.
> >
> > An example of a netdev that use skb_forward_skb() is veth.
> > Thus, before this commit a packet transmitted from one veth peer to
> > another when both veth peers are on same network namespace will lose
> > it's skb->mark. The bug could easily be demonstrated by the following:
> >
> > ip netns add test
> > ip netns exec test bash
> > ip link add veth-a type veth peer name veth-b
> > ip link set veth-a up
> > ip link set veth-b up
> > ip addr add dev veth-a 12.0.0.1/24
> > tc qdisc add dev veth-a root handle 1 prio
> > tc qdisc add dev veth-b ingress
> > tc filter add dev veth-a parent 1: u32 match u32 0 0 action skbedit mark 1337
> > tc filter add dev veth-b parent ffff: basic match 'meta(nf_mark eq 1337)' action simple "skb->mark 1337!"
> > dmesg -C
> > ping 12.0.0.2
> > dmesg
> >
> > Before this change, the above will print nothing to dmesg.
> > After this change, "skb->mark 1337!" will be printed as necessary.
>
> Hi Liran,
>
> >
> > Signed-off-by: Liran Alon <[email protected]>
> > Reviewed-by: Yuval Shaia <[email protected]>
> > Signed-off-by: Yuval Shaia <[email protected]>
>
> I did not earned the credits for SOB, only r-b.
Had an offlist conversation with Liran,
Turns out that this SOB is ok.
Yuval
>
> Yuval
>
> > ---
> > include/linux/netdevice.h | 2 +-
> > net/core/dev.c | 6 +++---
> > 2 files changed, 4 insertions(+), 4 deletions(-)
> >
> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > index 5eef6c8e2741..5908f1e31ee2 100644
> > --- a/include/linux/netdevice.h
> > +++ b/include/linux/netdevice.h
> > @@ -3371,7 +3371,7 @@ static __always_inline int ____dev_forward_skb(struct net_device *dev,
> > return NET_RX_DROP;
> > }
> >
> > - skb_scrub_packet(skb, true);
> > + skb_scrub_packet(skb, !net_eq(dev_net(dev), dev_net(skb->dev)));
> > skb->priority = 0;
> > return 0;
> > }
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index 2cedf520cb28..087787dd0a50 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -1877,9 +1877,9 @@ int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb)
> > * start_xmit function of one device into the receive queue
> > * of another device.
> > *
> > - * The receiving device may be in another namespace, so
> > - * we have to clear all information in the skb that could
> > - * impact namespace isolation.
> > + * The receiving device may be in another namespace.
> > + * In that case, we have to clear all information in the
> > + * skb that could impact namespace isolation.
> > */
> > int dev_forward_skb(struct net_device *dev, struct sk_buff *skb)
> > {
> > --
> > 1.9.1
> >
Hi,
On Tue, 13 Mar 2018 17:07:22 +0200 Liran Alon <[email protected]> wrote:
> Before this commit, dev_forward_skb() always cleared packet's
> per-network-namespace info. Even if the packet doesn't cross
> network namespaces.
>
> The comment above dev_forward_skb() describes that this is done
> because the receiving device may be in another network namespace.
> However, this case can easily be tested for and therefore we can
> scrub packet's per-network-namespace info only when receiving device
> is indeed in another network namespace.
>
> Therefore, this commit changes ____dev_forward_skb() to tell
> skb_scrub_packet() that skb has crossed network-namespace only in case
> transmitting device (skb->dev) network namespace is different then
> receiving device (dev) network namespace.
Assuming the premise of this commit is correct, note it may not act as
intended for xnet situation in ipvlan_process_multicast, snip:
nskb->dev = ipvlan->dev;
if (tx_pkt)
ret = dev_forward_skb(ipvlan->dev, nskb);
else
ret = netif_rx(nskb);
as 'dev' gets already assigned to nskb prior dev_forward_skb (hence in
____dev_forward_skb both dev and skb->dev are the same).
Fortunately every ipvlan_multicast_enqueue call is preceded by a forced
scrub; It would be future-proof to not assign nskb->dev in the
dev_forward_skb case (assign it only in the netif_rx case).
Regarding the premise of this commit, this "reduces" the
ipvs/orphan/mark scrubbing in the following *non* xnet situations:
1. mac2vlan port xmit to other macvlan ports in Bridge Mode
2. similarly for ipvlan
3. veth xmit
4. l2tp_eth_dev_recv
5. bpf redirect/clone_redirect ingress actions
Regarding l2tp recv, this commit seems to align the srubbing behavior
with ip tunnels (full scrub only if crossing netns, see ip_tunnel_rcv).
Regarding veth xmit, it does makes sense to preserve the fields if not
crossing netns. This is also the case when one uses tc mirred.
Regarding bpf redirect, well, it depends on the expectations of each bpf
program.
I'd argue that preserving the fields (at least the mark field) in the
*non* xnet makes sense and provides more information and therefore more
capabilities; Alas this might change behavior already being relied on.
Maybe Daniel can comment on the matter.
Regards,
Shmulik
On 03/15/2018 10:21 AM, Shmulik Ladkani wrote:
> Hi,
>
> On Tue, 13 Mar 2018 17:07:22 +0200 Liran Alon <[email protected]> wrote:
>> Before this commit, dev_forward_skb() always cleared packet's
>> per-network-namespace info. Even if the packet doesn't cross
>> network namespaces.
>>
>> The comment above dev_forward_skb() describes that this is done
>> because the receiving device may be in another network namespace.
>> However, this case can easily be tested for and therefore we can
>> scrub packet's per-network-namespace info only when receiving device
>> is indeed in another network namespace.
>>
>> Therefore, this commit changes ____dev_forward_skb() to tell
>> skb_scrub_packet() that skb has crossed network-namespace only in case
>> transmitting device (skb->dev) network namespace is different then
>> receiving device (dev) network namespace.
>
> Assuming the premise of this commit is correct, note it may not act as
> intended for xnet situation in ipvlan_process_multicast, snip:
>
> nskb->dev = ipvlan->dev;
> if (tx_pkt)
> ret = dev_forward_skb(ipvlan->dev, nskb);
> else
> ret = netif_rx(nskb);
>
> as 'dev' gets already assigned to nskb prior dev_forward_skb (hence in
> ____dev_forward_skb both dev and skb->dev are the same).
> Fortunately every ipvlan_multicast_enqueue call is preceded by a forced
> scrub; It would be future-proof to not assign nskb->dev in the
> dev_forward_skb case (assign it only in the netif_rx case).
>
>
> Regarding the premise of this commit, this "reduces" the
> ipvs/orphan/mark scrubbing in the following *non* xnet situations:
>
> 1. mac2vlan port xmit to other macvlan ports in Bridge Mode
> 2. similarly for ipvlan
> 3. veth xmit
> 4. l2tp_eth_dev_recv
> 5. bpf redirect/clone_redirect ingress actions
>
> Regarding l2tp recv, this commit seems to align the srubbing behavior
> with ip tunnels (full scrub only if crossing netns, see ip_tunnel_rcv).
>
> Regarding veth xmit, it does makes sense to preserve the fields if not
> crossing netns. This is also the case when one uses tc mirred.
>
> Regarding bpf redirect, well, it depends on the expectations of each bpf
> program.
> I'd argue that preserving the fields (at least the mark field) in the
> *non* xnet makes sense and provides more information and therefore more
> capabilities; Alas this might change behavior already being relied on.
>
> Maybe Daniel can comment on the matter.
Overall I think it might be nice to not need scrubbing skb in such cases,
although my concern would be that this has potential to break existing
setups when they would expect mark being zero on other veth peer in any
case since it's the behavior for a long time already. The safer option
would be to have some sort of explicit opt-in e.g. on link creation to let
the skb->mark pass through unscrubbed. This would definitely be a useful
option e.g. when mark is set in the netns facing veth via clsact/egress
on xmit and when the container is unprivileged anyway.
Thanks,
Daniel
----- [email protected] wrote:
> Hi,
>
> On Tue, 13 Mar 2018 17:07:22 +0200 Liran Alon <[email protected]>
> wrote:
> > Before this commit, dev_forward_skb() always cleared packet's
> > per-network-namespace info. Even if the packet doesn't cross
> > network namespaces.
> >
> > The comment above dev_forward_skb() describes that this is done
> > because the receiving device may be in another network namespace.
> > However, this case can easily be tested for and therefore we can
> > scrub packet's per-network-namespace info only when receiving
> device
> > is indeed in another network namespace.
> >
> > Therefore, this commit changes ____dev_forward_skb() to tell
> > skb_scrub_packet() that skb has crossed network-namespace only in
> case
> > transmitting device (skb->dev) network namespace is different then
> > receiving device (dev) network namespace.
>
> Assuming the premise of this commit is correct, note it may not act
> as
> intended for xnet situation in ipvlan_process_multicast, snip:
>
> nskb->dev = ipvlan->dev;
> if (tx_pkt)
> ret = dev_forward_skb(ipvlan->dev, nskb);
> else
> ret = netif_rx(nskb);
>
> as 'dev' gets already assigned to nskb prior dev_forward_skb (hence
> in
> ____dev_forward_skb both dev and skb->dev are the same).
> Fortunately every ipvlan_multicast_enqueue call is preceded by a
> forced
> scrub; It would be future-proof to not assign nskb->dev in the
> dev_forward_skb case (assign it only in the netif_rx case).
I agree. Nice catch.
skb->dev will later get assigned in eth_type_trans() called from __dev_forward_skb().
I will do this small change in a separate patch before this patch.
(In v2 of this series)
>
>
> Regarding the premise of this commit, this "reduces" the
> ipvs/orphan/mark scrubbing in the following *non* xnet situations:
>
> 1. mac2vlan port xmit to other macvlan ports in Bridge Mode
> 2. similarly for ipvlan
> 3. veth xmit
> 4. l2tp_eth_dev_recv
> 5. bpf redirect/clone_redirect ingress actions
>
> Regarding l2tp recv, this commit seems to align the srubbing behavior
> with ip tunnels (full scrub only if crossing netns, see
> ip_tunnel_rcv).
>
> Regarding veth xmit, it does makes sense to preserve the fields if
> not
> crossing netns. This is also the case when one uses tc mirred.
>
> Regarding bpf redirect, well, it depends on the expectations of each
> bpf
> program.
> I'd argue that preserving the fields (at least the mark field) in the
> *non* xnet makes sense and provides more information and therefore
> more
> capabilities; Alas this might change behavior already being relied
> on.
I now noticed that a similar change was done in the past in
commit 8a83a00b0735 ("net: maintain namespace isolation
between vlan and real device"). Commit changed dev_forward_skb() from always
setting skb->mark=0 to do this only in case we cross netns.
However, a later commit: 59b9997baba5 ("Revert "net: maintain namespace
isolation between vlan and real device") seems to later reverted that change.
But I think that the regression issue mentioned in the revert isn't
related to the change proposed by this commit.
Please correct me if I'm missing something.
>
> Maybe Daniel can comment on the matter.
>
> Regards,
> Shmulik
----- [email protected] wrote:
> On 03/15/2018 10:21 AM, Shmulik Ladkani wrote:
> > Regarding the premise of this commit, this "reduces" the
> > ipvs/orphan/mark scrubbing in the following *non* xnet situations:
> >
> > 1. mac2vlan port xmit to other macvlan ports in Bridge Mode
> > 2. similarly for ipvlan
> > 3. veth xmit
> > 4. l2tp_eth_dev_recv
> > 5. bpf redirect/clone_redirect ingress actions
> >
> > Regarding l2tp recv, this commit seems to align the srubbing
> behavior
> > with ip tunnels (full scrub only if crossing netns, see
> ip_tunnel_rcv).
> >
> > Regarding veth xmit, it does makes sense to preserve the fields if
> not
> > crossing netns. This is also the case when one uses tc mirred.
> >
> > Regarding bpf redirect, well, it depends on the expectations of each
> bpf
> > program.
> > I'd argue that preserving the fields (at least the mark field) in
> the
> > *non* xnet makes sense and provides more information and therefore
> more
> > capabilities; Alas this might change behavior already being relied
> on.
> >
> > Maybe Daniel can comment on the matter.
>
> Overall I think it might be nice to not need scrubbing skb in such
> cases,
> although my concern would be that this has potential to break
> existing
> setups when they would expect mark being zero on other veth peer in
> any
> case since it's the behavior for a long time already. The safer
> option
> would be to have some sort of explicit opt-in e.g. on link creation to
> let
> the skb->mark pass through unscrubbed. This would definitely be a
> useful
> option e.g. when mark is set in the netns facing veth via
> clsact/egress
> on xmit and when the container is unprivileged anyway.
>
> Thanks,
> Daniel
I see your point in regards to backwards comparability.
However, not scrubbing skb when it cross netns via some kernel functions compared to
others is basically a bug which could easily break with a little bit of more refactoring.
Therefore, it seems a bit weird to me to from now on, we will force
every user on link creation to consider that once there was a bug leading
to this weird behavior on specific netdevs.
Thus, I suggest to maybe control this via a global /proc/sys/net file instead.
-Liran
Hi,
On Thu, 15 Mar 2018 12:56:13 +0100 Daniel Borkmann <[email protected]> wrote:
> On 03/15/2018 10:21 AM, Shmulik Ladkani wrote:
> >
> > Regarding veth xmit, it does makes sense to preserve the fields if not
> > crossing netns. This is also the case when one uses tc mirred.
> >
> > Regarding bpf redirect, well, it depends on the expectations of each bpf
> > program.
> > I'd argue that preserving the fields (at least the mark field) in the
> > *non* xnet makes sense and provides more information and therefore more
> > capabilities; Alas this might change behavior already being relied on.
> >
> > Maybe Daniel can comment on the matter.
>
> Overall I think it might be nice to not need scrubbing skb in such cases,
> although my concern would be that this has potential to break existing
> setups when they would expect mark being zero on other veth peer in any
> case since it's the behavior for a long time already. The safer option
> would be to have some sort of explicit opt-in e.g. on link creation to let
> the skb->mark pass through unscrubbed. This would definitely be a useful
> option e.g. when mark is set in the netns facing veth via clsact/egress
> on xmit and when the container is unprivileged anyway.
For the veth xmit case, an opt-in flag which disables mark scrubbing in
the *non* xnet veth-pair seems reasonable.
But what about bpf_redirect BPF_F_INGRESS, in setups not invovling
containers?
Currently bpf_redirect is implemented using dev_forward_skb which
*fully* scrubs the skb, even if the target device is on same netns as
skb->dev is.
One might use ebpf programs that perform BPF_F_INGRESS bpf_redirect, for
example for demuxing skbs arriving on some "master" device into various
"slave" devices using specialized critiria.
It would be beneficial to have the mark preserved when skb is injected
to the slave device's rx path (especially when it's on the same netns).
Liran's patch fixes this - but at the cost of changing existing behavior
for BPF_F_INGRESS users (formerly: fully scrubbed; post patch: scrubbed
only if xnet).
I wonder, do you know of implementations that actually RELY on the fact
that BPF_F_INGRESS actually clears the mark, in the *non* xnet case?
Regards,
Shmulik
Liran Alon <[email protected]> writes:
[...]
>> Overall I think it might be nice to not need scrubbing skb in such
>> cases,
>> although my concern would be that this has potential to break
>> existing
>> setups when they would expect mark being zero on other veth peer in
>> any
>> case since it's the behavior for a long time already. The safer
>> option
>> would be to have some sort of explicit opt-in e.g. on link creation to
>> let
>> the skb->mark pass through unscrubbed. This would definitely be a
>> useful
>> option e.g. when mark is set in the netns facing veth via
>> clsact/egress
>> on xmit and when the container is unprivileged anyway.
>>
>> Thanks,
>> Daniel
>
> I see your point in regards to backwards comparability.
> However, not scrubbing skb when it cross netns via some kernel functions compared to
> others is basically a bug which could easily break with a little bit of more refactoring.
> Therefore, it seems a bit weird to me to from now on, we will force
> every user on link creation to consider that once there was a bug leading
> to this weird behavior on specific netdevs.
> Thus, I suggest to maybe control this via a global /proc/sys/net file instead.
One valid use case could be preserving a source namespace nsid in
skb->mark when a packet crosses netns.
On 03/15/2018 03:35 PM, Roman Mashak wrote:
> Liran Alon <[email protected]> writes:
> [...]
>>> Overall I think it might be nice to not need scrubbing skb in such
>>> cases,
>>> although my concern would be that this has potential to break
>>> existing
>>> setups when they would expect mark being zero on other veth peer in
>>> any
>>> case since it's the behavior for a long time already. The safer
>>> option
>>> would be to have some sort of explicit opt-in e.g. on link creation to
>>> let
>>> the skb->mark pass through unscrubbed. This would definitely be a
>>> useful
>>> option e.g. when mark is set in the netns facing veth via
>>> clsact/egress
>>> on xmit and when the container is unprivileged anyway.
>>>
>>> Thanks,
>>> Daniel
>>
>> I see your point in regards to backwards comparability.
>> However, not scrubbing skb when it cross netns via some kernel functions compared to
>> others is basically a bug which could easily break with a little bit of more refactoring.
>> Therefore, it seems a bit weird to me to from now on, we will force
>> every user on link creation to consider that once there was a bug leading
>> to this weird behavior on specific netdevs.
Why bug specifically? It could well be that for some unpriv containers
it would be fine to do e.g. in cases where orchestrator sets up clsact/
egress on veth/ipvlan/etc in the container to set the mark and where app
cannot mess with this while for others you need to act out of host facing
veth; thus, explicit opt-in per dev could provide such more fine grained
control.
> One valid use case could be preserving a source namespace nsid in
> skb->mark when a packet crosses netns.
Right, was thinking about something similar.
----- [email protected] wrote:
> Liran Alon <[email protected]> writes:
>
>
> [...]
>
> >> Overall I think it might be nice to not need scrubbing skb in such
> >> cases,
> >> although my concern would be that this has potential to break
> >> existing
> >> setups when they would expect mark being zero on other veth peer
> in
> >> any
> >> case since it's the behavior for a long time already. The safer
> >> option
> >> would be to have some sort of explicit opt-in e.g. on link creation
> to
> >> let
> >> the skb->mark pass through unscrubbed. This would definitely be a
> >> useful
> >> option e.g. when mark is set in the netns facing veth via
> >> clsact/egress
> >> on xmit and when the container is unprivileged anyway.
> >>
> >> Thanks,
> >> Daniel
> >
> > I see your point in regards to backwards comparability.
> > However, not scrubbing skb when it cross netns via some kernel
> functions compared to
> > others is basically a bug which could easily break with a little bit
> of more refactoring.
> > Therefore, it seems a bit weird to me to from now on, we will force
> > every user on link creation to consider that once there was a bug
> leading
> > to this weird behavior on specific netdevs.
> > Thus, I suggest to maybe control this via a global /proc/sys/net
> file instead.
>
> One valid use case could be preserving a source namespace nsid in
> skb->mark when a packet crosses netns.
Before and after this commit, veth peers that crosses netns zero skb->mark.
Therefore, what you suggest is a new feature unrelated to the
issue fixed by this commit.
I still think that default behavior should be to zero skb->mark only when skb
cross netdevs in different netns. And for backwards comparability, we can
consider adding a /proc/sys/net/core file to let dev_forward_skb() to always
scrub packet regardless if it crosses netdevs or not.
----- [email protected] wrote:
> On 03/15/2018 03:35 PM, Roman Mashak wrote:
> > Liran Alon <[email protected]> writes:
> > [...]
> >>> Overall I think it might be nice to not need scrubbing skb in
> such
> >>> cases,
> >>> although my concern would be that this has potential to break
> >>> existing
> >>> setups when they would expect mark being zero on other veth peer
> in
> >>> any
> >>> case since it's the behavior for a long time already. The safer
> >>> option
> >>> would be to have some sort of explicit opt-in e.g. on link
> creation to
> >>> let
> >>> the skb->mark pass through unscrubbed. This would definitely be a
> >>> useful
> >>> option e.g. when mark is set in the netns facing veth via
> >>> clsact/egress
> >>> on xmit and when the container is unprivileged anyway.
> >>>
> >>> Thanks,
> >>> Daniel
> >>
> >> I see your point in regards to backwards comparability.
> >> However, not scrubbing skb when it cross netns via some kernel
> functions compared to
> >> others is basically a bug which could easily break with a little
> bit of more refactoring.
> >> Therefore, it seems a bit weird to me to from now on, we will
> force
> >> every user on link creation to consider that once there was a bug
> leading
> >> to this weird behavior on specific netdevs.
>
> Why bug specifically? It could well be that for some unpriv
> containers
> it would be fine to do e.g. in cases where orchestrator sets up
> clsact/
> egress on veth/ipvlan/etc in the container to set the mark and where
> app
> cannot mess with this while for others you need to act out of host
> facing
> veth; thus, explicit opt-in per dev could provide such more fine
> grained
> control.
>
> > One valid use case could be preserving a source namespace nsid in
> > skb->mark when a packet crosses netns.
>
> Right, was thinking about something similar.
I agree with all the above but this behavior was not supported both
before and after this commit. skb->mark is always zeroed when crossing netns.
This commit only changes behavior for skb crossing netdevs on *same* netns
via dev_forward_skb().
Therefore, I believe we should discuss here what we want default behavior to be
and how it should be controlled for backwards comparability.
Only after we should discuss about adding an extra feature of controlling skb scrub
per netdev or something similar.
On 03/15/2018 01:50 PM, Shmulik Ladkani wrote:
> On Thu, 15 Mar 2018 12:56:13 +0100 Daniel Borkmann <[email protected]> wrote:
>> On 03/15/2018 10:21 AM, Shmulik Ladkani wrote:
>>>
>>> Regarding veth xmit, it does makes sense to preserve the fields if not
>>> crossing netns. This is also the case when one uses tc mirred.
>>>
>>> Regarding bpf redirect, well, it depends on the expectations of each bpf
>>> program.
>>> I'd argue that preserving the fields (at least the mark field) in the
>>> *non* xnet makes sense and provides more information and therefore more
>>> capabilities; Alas this might change behavior already being relied on.
>>>
>>> Maybe Daniel can comment on the matter.
>>
>> Overall I think it might be nice to not need scrubbing skb in such cases,
>> although my concern would be that this has potential to break existing
>> setups when they would expect mark being zero on other veth peer in any
>> case since it's the behavior for a long time already. The safer option
>> would be to have some sort of explicit opt-in e.g. on link creation to let
>> the skb->mark pass through unscrubbed. This would definitely be a useful
>> option e.g. when mark is set in the netns facing veth via clsact/egress
>> on xmit and when the container is unprivileged anyway.
>
> For the veth xmit case, an opt-in flag which disables mark scrubbing in
> the *non* xnet veth-pair seems reasonable.
>
> But what about bpf_redirect BPF_F_INGRESS, in setups not invovling
> containers?
> Currently bpf_redirect is implemented using dev_forward_skb which
> *fully* scrubs the skb, even if the target device is on same netns as
> skb->dev is.
>
> One might use ebpf programs that perform BPF_F_INGRESS bpf_redirect, for
> example for demuxing skbs arriving on some "master" device into various
> "slave" devices using specialized critiria.
>
> It would be beneficial to have the mark preserved when skb is injected
> to the slave device's rx path (especially when it's on the same netns).
Right, I think also here the easiest would be to have a BPF_F_PRESERVE_MARK
flag to opt-in in general case (xnet/non-xnet) and where helper bails out
on unknown flag, but also for the redirect in the same netns I think it would
be useful to have a similar redirect mode as in ipvlan master where instead
of dev_forward_skb() you would set the skb->dev = dev and have a similar
notion of RX_HANDLER_ANOTHER. Was thinking about the latter more recently
but haven't gotten to implement it yet.
> Liran's patch fixes this - but at the cost of changing existing behavior
> for BPF_F_INGRESS users (formerly: fully scrubbed; post patch: scrubbed
> only if xnet).
>
> I wonder, do you know of implementations that actually RELY on the fact
> that BPF_F_INGRESS actually clears the mark, in the *non* xnet case?
Not that I'm aware of right now, but hard to tell what other people run
in the wild.
But lets presume for a sec you would _not_ scrub it, then how are users
supposed to make use of this? The feature/bug may not be critical enough
(well, otherwise it wouldn't have been like this for long time) for stable,
so to write an app relying on it the behavior will change from kernel A to
kernel B, where you need to end up having a full blown veth run-time test
in order to figure it out before you can use it, not really useful either.
Thanks,
Daniel
On Thu, 15 Mar 2018 16:13:39 +0100 Daniel Borkmann <[email protected]> wrote:
> On 03/15/2018 01:50 PM, Shmulik Ladkani wrote:
> >
> > It would be beneficial to have the mark preserved when skb is injected
> > to the slave device's rx path (especially when it's on the same netns).
>
> Right, I think also here the easiest would be to have a BPF_F_PRESERVE_MARK
> flag to opt-in in general case (xnet/non-xnet)
Sounds okay to me.
> But lets presume for a sec you would _not_ scrub it, then how are users
> supposed to make use of this? The feature/bug may not be critical enough
> (well, otherwise it wouldn't have been like this for long time) for stable,
> so to write an app relying on it the behavior will change from kernel A to
> kernel B, where you need to end up having a full blown veth run-time test
> in order to figure it out before you can use it, not really useful either.
Let's assume BPF_F_PRESERVE_MARK is a feature then, which is available only
in new kernels.
As said, this flag will not be honored by older kernels.
But your "run-time test" argument is true for every new flag-bit
introduced to bpf functions, for example:
BPF_F_SEQ_NUMBER was added after other skb_set_tunnel_key flags,
Same for BPF_F_INVALIDATE_HASH (skb_store_bytes), BPF_F_MARK_ENFORCE
(l4_csum_replace) and others.
With every flag addition, the flag mask validation in the corresponding
bpf function has been relaxed to support it.
Why is BPF_F_PRESERVE_MARK any different from any previous flag addition?
Thanks,
Shmulik
On Thu, 15 Mar 2018 08:01:03 -0700 (PDT) Liran Alon <[email protected]> wrote:
>
> I still think that default behavior should be to zero skb->mark only when skb
> cross netdevs in different netns.
But the previous default was scrub the mark in *both* xnet and non-xnet
situations.
Therefore, there might be users which RELY on this (strange) default
behavior in their same-netns-veth-pair setups.
Meaning, changing the default behavior might break their apps relying on
the former default behavior.
This is why the "disable mark scrubbing in non-xnet case" should be opt-in.
Regards,
Shmulik
----- [email protected] wrote:
> On Thu, 15 Mar 2018 08:01:03 -0700 (PDT) Liran Alon
> <[email protected]> wrote:
> >
> > I still think that default behavior should be to zero skb->mark only
> when skb
> > cross netdevs in different netns.
>
> But the previous default was scrub the mark in *both* xnet and
> non-xnet
> situations.
>
> Therefore, there might be users which RELY on this (strange) default
> behavior in their same-netns-veth-pair setups.
> Meaning, changing the default behavior might break their apps relying
> on
> the former default behavior.
>
> This is why the "disable mark scrubbing in non-xnet case" should be
> opt-in.
We think the same.
The only difference is that I think this for now should be controllable
by a global /proc/sys/net/core file instead of giving a flexible per-netdev control.
Because that is a larger change that could be done later.
>
> Regards,
> Shmulik
On Thu, 15 Mar 2018 09:35:51 -0700 (PDT) Liran Alon <[email protected]> wrote:
> ----- [email protected] wrote:
>
> > On Thu, 15 Mar 2018 08:01:03 -0700 (PDT) Liran Alon
> > <[email protected]> wrote:
> > >
> > > I still think that default behavior should be to zero skb->mark only
> > when skb
> > > cross netdevs in different netns.
> >
> > But the previous default was scrub the mark in *both* xnet and
> > non-xnet
> > situations.
> >
> > Therefore, there might be users which RELY on this (strange) default
> > behavior in their same-netns-veth-pair setups.
> > Meaning, changing the default behavior might break their apps relying
> > on
> > the former default behavior.
> >
> > This is why the "disable mark scrubbing in non-xnet case" should be
> > opt-in.
>
> We think the same.
> The only difference is that I think this for now should be controllable
> by a global /proc/sys/net/core file instead of giving a flexible per-netdev
> control.
> Because that is a larger change that could be done later.
A flags attribute to veth newlink is a very scoped change.
User controls this per veth creation.
This is way more neat than /proc/sys/net and provides the desired granular
control.
Also, scoping this to veth has the advantage of not affecting the many other
dev_forward_skb callers.
Regards,
Shmulik
----- [email protected] wrote:
> On Thu, 15 Mar 2018 09:35:51 -0700 (PDT) Liran Alon
> <[email protected]> wrote:
> > ----- [email protected] wrote:
> >
> > > On Thu, 15 Mar 2018 08:01:03 -0700 (PDT) Liran Alon
> > > <[email protected]> wrote:
> > > >
> > > > I still think that default behavior should be to zero skb->mark
> only
> > > when skb
> > > > cross netdevs in different netns.
> > >
> > > But the previous default was scrub the mark in *both* xnet and
> > > non-xnet
> > > situations.
> > >
> > > Therefore, there might be users which RELY on this (strange)
> default
> > > behavior in their same-netns-veth-pair setups.
> > > Meaning, changing the default behavior might break their apps
> relying
> > > on
> > > the former default behavior.
> > >
> > > This is why the "disable mark scrubbing in non-xnet case" should
> be
> > > opt-in.
> >
> > We think the same.
> > The only difference is that I think this for now should be
> controllable
> > by a global /proc/sys/net/core file instead of giving a flexible
> per-netdev
> > control.
> > Because that is a larger change that could be done later.
>
> A flags attribute to veth newlink is a very scoped change.
> User controls this per veth creation.
> This is way more neat than /proc/sys/net and provides the desired
> granular
> control.
>
> Also, scoping this to veth has the advantage of not affecting the many
> other
> dev_forward_skb callers.
Agreed. But isn't this an issue also for the
many others (& future) callers of dev_forward_skb()?
This seems problematic to me.
This will kinda leave a kernel interface with broken default behavior
for backwards comparability.
A flag to netdev or /proc/sys/net/core to "fix" default behavior
will avoid this.
>
> Regards,
> Shmulik
On 03/15/2018 04:54 PM, Shmulik Ladkani wrote:
> On Thu, 15 Mar 2018 16:13:39 +0100 Daniel Borkmann <[email protected]> wrote:
>> On 03/15/2018 01:50 PM, Shmulik Ladkani wrote:
>>>
>>> It would be beneficial to have the mark preserved when skb is injected
>>> to the slave device's rx path (especially when it's on the same netns).
>>
>> Right, I think also here the easiest would be to have a BPF_F_PRESERVE_MARK
>> flag to opt-in in general case (xnet/non-xnet)
>
> Sounds okay to me.
>
>> But lets presume for a sec you would _not_ scrub it, then how are users
>> supposed to make use of this? The feature/bug may not be critical enough
>> (well, otherwise it wouldn't have been like this for long time) for stable,
>> so to write an app relying on it the behavior will change from kernel A to
>> kernel B, where you need to end up having a full blown veth run-time test
>> in order to figure it out before you can use it, not really useful either.
>
> Let's assume BPF_F_PRESERVE_MARK is a feature then, which is available only
> in new kernels.
> As said, this flag will not be honored by older kernels.
>
> But your "run-time test" argument is true for every new flag-bit
> introduced to bpf functions, for example:
> BPF_F_SEQ_NUMBER was added after other skb_set_tunnel_key flags,
> Same for BPF_F_INVALIDATE_HASH (skb_store_bytes), BPF_F_MARK_ENFORCE
> (l4_csum_replace) and others.
>
> With every flag addition, the flag mask validation in the corresponding
> bpf function has been relaxed to support it.
>
> Why is BPF_F_PRESERVE_MARK any different from any previous flag addition?
Not really different than other BPF helpers, but you can simply figure it out
via BPF_PROG_TEST_RUN by calling bpf_redirect() helper on some fake ifindex
and check the return value whether it's shot or redirect. This is definitely
trivial to do and doesn't require any devs to set up compared to otherwise
full blown setup. E.g. test_verifier uses this for the test case array it
contains.
Cheers,
Daniel
From: Liran Alon <[email protected]>
Date: Tue, 13 Mar 2018 17:07:22 +0200
> Before this commit, dev_forward_skb() always cleared packet's
> per-network-namespace info. Even if the packet doesn't cross
> network namespaces.
There was a lot of discussion about this patch.
Particularly whether it could potentially break current
users or not.
If this is resolved and the patch should still be applied,
please repost and the folks involved in this dicussion should
add their ACKs.
Thanks.
On 20/03/18 16:47, David Miller wrote:
> From: Liran Alon <[email protected]>
> Date: Tue, 13 Mar 2018 17:07:22 +0200
>
>> Before this commit, dev_forward_skb() always cleared packet's
>> per-network-namespace info. Even if the packet doesn't cross
>> network namespaces.
>
> There was a lot of discussion about this patch.
>
> Particularly whether it could potentially break current
> users or not.
>
> If this is resolved and the patch should still be applied,
> please repost and the folks involved in this dicussion should
> add their ACKs.
>
> Thanks.
>
The problem is that I don't think we have reached an agreement.
I would be happy to here your opinion on the issue at hand here.
I personally don't understand why we should maintain
backwards-comparability to this behaviour. How would a user rely on the
fact that skb->mark is scrubbed when it is passed between 2 netdevs on
the same netns but only when it is passed between very specific netdev
type (one of them being veth-peers).
This behaviour seems to have been created by mistake.
This feature is not documented to user-mode and I don't see why it is
legit for the user to rely on it.
In addition, even if we do want to maintain backwards-comparability to
this behaviour, I think it is enough to have an opt-in flag in
/proc/sys/net/core/ that when set to 1 will activate the fix in
dev_forward_skb() provided by this patch. That would also be a very
simple change to the patch provided here.
Do you agree? Or do you think we should have a flag per netdev like
suggested in other replies to this thread?
Thanks,
-Liran
From: Liran Alon <[email protected]>
Date: Tue, 20 Mar 2018 17:34:38 +0200
> I personally don't understand why we should maintain
> backwards-comparability to this behaviour.
The reason is because not breaking things is a cornerstone of Linux
kernel development.
> This feature is not documented to user-mode and I don't see why it
> is legit for the user to rely on it.
Whether it is documented or not is irrelevant. A lot of our
interfaces and behaviors are not documented or poorly documented
at best.
> In addition, even if we do want to maintain backwards-comparability to
> this behaviour, I think it is enough to have an opt-in flag in
> /proc/sys/net/core/ that when set to 1 will activate the fix in
> dev_forward_skb() provided by this patch. That would also be a very
> simple change to the patch provided here.
Making it opt-in makes it more palatable, that's for sure.
On 20/03/18 18:00, David Miller wrote:
> From: Liran Alon <[email protected]>
> Date: Tue, 20 Mar 2018 17:34:38 +0200
>
>> I personally don't understand why we should maintain
>> backwards-comparability to this behaviour.
>
> The reason is because not breaking things is a cornerstone of Linux
> kernel development.
>
>> This feature is not documented to user-mode and I don't see why it
>> is legit for the user to rely on it.
>
> Whether it is documented or not is irrelevant. A lot of our
> interfaces and behaviors are not documented or poorly documented
> at best.
>
>> In addition, even if we do want to maintain backwards-comparability to
>> this behaviour, I think it is enough to have an opt-in flag in
>> /proc/sys/net/core/ that when set to 1 will activate the fix in
>> dev_forward_skb() provided by this patch. That would also be a very
>> simple change to the patch provided here.
>
> Making it opt-in makes it more palatable, that's for sure.
>
1. Do we want to make a flag for every bug that is user-space visible? I
think there is place for consideration on a per-case basis. I still
don't see how a user can utilize this behaviour. He is basically loosing
information (skb->mark) without this patch.
2. Having said that, I don't mind changing patch to maintain backwards
compatibility here. However, there was also a discussion here on where
the flag should sit. I think that a global /proc/sys/net/core/ flag
should be enough. Do you agree it's sufficient for now?
Thanks,
-Liran
Liran Alon <[email protected]> writes:
> ----- [email protected] wrote:
>
>> On Thu, 15 Mar 2018 09:35:51 -0700 (PDT) Liran Alon
>> <[email protected]> wrote:
>> > ----- [email protected] wrote:
>> >
>> > > On Thu, 15 Mar 2018 08:01:03 -0700 (PDT) Liran Alon
>> > > <[email protected]> wrote:
>> > > >
>> > > > I still think that default behavior should be to zero skb->mark
>> only
>> > > when skb
>> > > > cross netdevs in different netns.
>> > >
>> > > But the previous default was scrub the mark in *both* xnet and
>> > > non-xnet
>> > > situations.
>> > >
>> > > Therefore, there might be users which RELY on this (strange)
>> default
>> > > behavior in their same-netns-veth-pair setups.
>> > > Meaning, changing the default behavior might break their apps
>> relying
>> > > on
>> > > the former default behavior.
>> > >
>> > > This is why the "disable mark scrubbing in non-xnet case" should
>> be
>> > > opt-in.
>> >
>> > We think the same.
>> > The only difference is that I think this for now should be
>> controllable
>> > by a global /proc/sys/net/core file instead of giving a flexible
>> per-netdev
>> > control.
>> > Because that is a larger change that could be done later.
>>
>> A flags attribute to veth newlink is a very scoped change.
>> User controls this per veth creation.
>> This is way more neat than /proc/sys/net and provides the desired
>> granular
>> control.
>>
>> Also, scoping this to veth has the advantage of not affecting the many
>> other
>> dev_forward_skb callers.
>
> Agreed. But isn't this an issue also for the
> many others (& future) callers of dev_forward_skb()?
> This seems problematic to me.
>
> This will kinda leave a kernel interface with broken default behavior
> for backwards comparability.
>
> A flag to netdev or /proc/sys/net/core to "fix" default behavior
> will avoid this.
I don't believe the current behavior is a bug.
I looked through the history. Basically skb_scrub_packet
started out as the scrubbing needed for crossing network
namespaces.
Then tunnels which needed 90% of the functionality started
calling it, with the xnet flag added. Because the tunnels
needed to preserve their historic behavior.
Then dev_forward_skb started calling skb_scrub_packet.
A veth pair is supposed to give the same behavior as a cross-over
cable plugged into two local nics. A cross over cable won't
preserve things like the skb mark. So I don't see why anyone would
expect a veth pair to preserve the mark.
Right now I don't see the point of handling packets that don't cross
network namespace boundaries specially, other than to preserve backwards
compatibility.
Eric
From: Liran Alon <[email protected]>
Date: Tue, 20 Mar 2018 18:11:49 +0200
> 1. Do we want to make a flag for every bug that is user-space visible?
> I think there is place for consideration on a per-case basis. I still
> don't see how a user can utilize this behaviour. He is basically
> loosing information (skb->mark) without this patch.
And maybe people trying to work in this situation have added code to
get the mark set some other way, or to validate that it is in fact
zero after passing through, which we would break with this change.
If it's set to zero now, it's reasonable to expect it to be zero.
By changing it to non-zero, different rules and routes will match
and this for sure has potential to break things.
On 20/03/18 18:34, David Miller wrote:
> From: Liran Alon <[email protected]>
> Date: Tue, 20 Mar 2018 18:11:49 +0200
>
>> 1. Do we want to make a flag for every bug that is user-space visible?
>> I think there is place for consideration on a per-case basis. I still
>> don't see how a user can utilize this behaviour. He is basically
>> loosing information (skb->mark) without this patch.
>
> And maybe people trying to work in this situation have added code to
> get the mark set some other way, or to validate that it is in fact
> zero after passing through, which we would break with this change.
>
> If it's set to zero now, it's reasonable to expect it to be zero.
>
> By changing it to non-zero, different rules and routes will match
> and this for sure has potential to break things.
>
OK.
What is your opinion in regards if it's OK to put the flag enabling this
"fix" in /proc/sys/net/core? Do you think it's sufficient?
On 20/03/18 18:24, [email protected] wrote:
>
> I don't believe the current behavior is a bug.
>
> I looked through the history. Basically skb_scrub_packet
> started out as the scrubbing needed for crossing network
> namespaces.
>
> Then tunnels which needed 90% of the functionality started
> calling it, with the xnet flag added. Because the tunnels
> needed to preserve their historic behavior.
>
> Then dev_forward_skb started calling skb_scrub_packet.
>
> A veth pair is supposed to give the same behavior as a cross-over
> cable plugged into two local nics. A cross over cable won't
> preserve things like the skb mark. So I don't see why anyone would
> expect a veth pair to preserve the mark.
I disagree with this argument.
I think that a skb crossing netns is what simulates a real packet
crossing physical computers. Following your argument, why would
skb->mark should be preserved when crossing netdevs on same netns via
routing? But this does today preserve skb->mark.
Therefore, I do think that skb->mark should conceptually only be
scrubbed when crossing netns. Regardless of the netdev used to cross it.
>
> Right now I don't see the point of handling packets that don't cross
> network namespace boundaries specially, other than to preserve backwards
> compatibility.
>
> Eric
>
On 03/20/2018 09:44 AM, Liran Alon wrote:
>
>
> On 20/03/18 18:24, [email protected] wrote:
>>
>> I don't believe the current behavior is a bug.
>>
>> I looked through the history. Basically skb_scrub_packet
>> started out as the scrubbing needed for crossing network
>> namespaces.
>>
>> Then tunnels which needed 90% of the functionality started
>> calling it, with the xnet flag added. Because the tunnels
>> needed to preserve their historic behavior.
>>
>> Then dev_forward_skb started calling skb_scrub_packet.
>>
>> A veth pair is supposed to give the same behavior as a cross-over
>> cable plugged into two local nics. A cross over cable won't
>> preserve things like the skb mark. So I don't see why anyone would
>> expect a veth pair to preserve the mark.
>
> I disagree with this argument.
>
> I think that a skb crossing netns is what simulates a real packet crossing physical computers. Following your argument, why would skb->mark should be preserved
> when crossing netdevs on same netns via routing? But this does today preserve skb->mark.
>
> Therefore, I do think that skb->mark should conceptually only be scrubbed when crossing netns. Regardless of the netdev used to cross it.
It should be scrubbed in VETH as well. That is one way to make virtual routers. Possibly
the newer VRF features will give another better way to do it, but you should not break
things that used to work.
Now, if you want to add a new feature that allows one to configure the kernel (or VETH) for
a new behavior, then that might be something to consider.
>> Right now I don't see the point of handling packets that don't cross
>> network namespace boundaries specially, other than to preserve backwards
>> compatibility.
Well, backwards compat is a big deal all by itself!
Thanks,
Ben
>>
>> Eric
>>
>
>
--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com
Ben Greear <[email protected]> writes:
> On 03/20/2018 09:44 AM, Liran Alon wrote:
>>
>>
>> On 20/03/18 18:24, [email protected] wrote:
>>>
>>> I don't believe the current behavior is a bug.
>>>
>>> I looked through the history. Basically skb_scrub_packet
>>> started out as the scrubbing needed for crossing network
>>> namespaces.
>>>
>>> Then tunnels which needed 90% of the functionality started
>>> calling it, with the xnet flag added. Because the tunnels
>>> needed to preserve their historic behavior.
>>>
>>> Then dev_forward_skb started calling skb_scrub_packet.
>>>
>>> A veth pair is supposed to give the same behavior as a cross-over
>>> cable plugged into two local nics. A cross over cable won't
>>> preserve things like the skb mark. So I don't see why anyone would
>>> expect a veth pair to preserve the mark.
>>
>> I disagree with this argument.
>>
>> I think that a skb crossing netns is what simulates a real packet
>> crossing physical computers. Following your argument, why would
>> skb->mark should be preserved when crossing netdevs on same netns via
>> routing? But this does today preserve skb->mark.
>>
>> Therefore, I do think that skb->mark should conceptually only be
>> scrubbed when crossing netns. Regardless of the netdev used to cross
>> it.
>
> It should be scrubbed in VETH as well. That is one way to make virtual routers. Possibly
> the newer VRF features will give another better way to do it, but you should not break
> things that used to work.
>
> Now, if you want to add a new feature that allows one to configure the kernel (or VETH) for
> a new behavior, then that might be something to consider.
>
>>> Right now I don't see the point of handling packets that don't cross
>>> network namespace boundaries specially, other than to preserve backwards
>>> compatibility.
>
> Well, backwards compat is a big deal all by itself!
Absolutely agreed.
Eric
On Tue, 20 Mar 2018 18:39:47 +0200, Liran Alon said:
> What is your opinion in regards if it's OK to put the flag enabling this
> "fix" in /proc/sys/net/core? Do you think it's sufficient?
Umm.. *which* /proc/sys/net/core? These could differ for things that
are in different namespaces. Or are you proposing one systemwide
global value (which also gets "interesting" if it's writable inside a
container and changes the behavior a different container sees...)
On 20/03/18 20:51, [email protected] wrote:
> On Tue, 20 Mar 2018 18:39:47 +0200, Liran Alon said:
>> What is your opinion in regards if it's OK to put the flag enabling this
>> "fix" in /proc/sys/net/core? Do you think it's sufficient?
>
> Umm.. *which* /proc/sys/net/core? These could differ for things that
> are in different namespaces. Or are you proposing one systemwide
> global value (which also gets "interesting" if it's writable inside a
> container and changes the behavior a different container sees...)
>
I'm indeed proposing an opt-in system-wide global value.
I think it is the simplest approach to fix the issue at
hand here while maintaining backwards-compatibility.
I'm open to suggestions to where that system-wide
global value should be.
It must be a system-wide global value if we are not going
with the per-netdev flag approach as this system-wide global flag
should control how a skb is travelled between different netns.
So it doesn't belong to any one single netns.