2023-09-30 19:05:42

by Henrik Lindström

[permalink] [raw]
Subject: macvtap performs IP defragmentation, causing MTU problems for virtual machines

Hi,

We are trying to receive fragmented multicast traffic on a qemu/kvm VM
using a macvtap NIC (e1000e model).
Even though the traffic is properly fragmented by the sender, it's
never received by the VM.

The reason seems to be that the macvlan driver performs IP
defragmentation for multicast traffic, causing e1001e to drop it for
being too big.
It works fine with a virtio NIC instead of e1000e, but that's not an
option for us.

I found this old thread describing why macvlan does this:
https://lore.kernel.org/netdev/[email protected]/
Interestingly, the problem described in that thread seems to be more
general than macvlans, and i can still reproduce it by simply having
multiple physical interfaces.
So it looks like macvlans are being special-cased right now, as a
workaround for a more general defragmentation problem?

A fix for our issue is to simply remove the `ip_check_defrag` call
from `macvlan_handle_frame`, but that regresses the other issue..
I'm not sure what a proper fix would look like, but it feels a bit
unexpected that macvtaps would perform IP defragmentation.

Thanks,
Henrik


2023-10-02 09:25:57

by Florian Westphal

[permalink] [raw]
Subject: Re: macvtap performs IP defragmentation, causing MTU problems for virtual machines

Henrik Lindstr?m <[email protected]> wrote:
> I found this old thread describing why macvlan does this:
> https://lore.kernel.org/netdev/[email protected]/
> Interestingly, the problem described in that thread seems to be more
> general than macvlans, and i can still reproduce it by simply having
> multiple physical interfaces.
> So it looks like macvlans are being special-cased right now, as a
> workaround for a more general defragmentation problem?

Looks like it, maybe Eric remembers details here.

AFAIU however this issue isn't specific to macvlan, looks like some people
insist that receiving a fragmented multicast packet on n devices means we
should pass n defragmented packets up to the stack (we don't; ip defrag
will discard "duplicates").

There is a vif identifier for l3mdev sake (that did not exist back then),
we could use that as a discriminator for mcast case.

Something like this (totally untested):

diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -479,11 +479,29 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *skb,
return err;
}

+static int ip_defrag_vif(const struct sk_buff *skb, const struct net_device *dev)
+{
+ int vif = l3mdev_master_ifindex_rcu(dev);
+
+ if (vif)
+ return vif;
+
+ /* some folks insist that receiving a fragmented mcast dgram on n devices shall
+ * result in n defragmented packets.
+ */
+ if (skb->pkt_type == PACKET_BROADCAST || skb->pkt_type == PACKET_MULTICAST) {
+ if (dev)
+ vif = dev->ifindex;
+ }
+
+ return 0;
+}
+
/* Process an incoming IP datagram fragment. */
int ip_defrag(struct net *net, struct sk_buff *skb, u32 user)
{
struct net_device *dev = skb->dev ? : skb_dst(skb)->dev;
- int vif = l3mdev_master_ifindex_rcu(dev);
+ int vif = ip_defrag_vif(skb, dev);
struct ipq *qp;

__IP_INC_STATS(net, IPSTATS_MIB_REASMREQDS);

... which should allow to remove the macvlan defrag step.

2023-10-02 22:25:01

by Henrik Lindström

[permalink] [raw]
Subject: Re: macvtap performs IP defragmentation, causing MTU problems for virtual machines

Had to change "return 0" to "return vif" but other than that your changes
seem to work, even with macvlan defragmentation removed.

I tested it by sending 8K fragmented multicast packets, with 5 macvlans on
the receiving side. I consistently received 6 copies of the packet (1 from the
real interface and 1 per macvlan). While doing this i had my VM running with
a macvtap, and it was receiving fragmented packets as expected.

Here are the changes i was testing with, first time sending a diff over mail
so hope it works :-)

diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index 02bd201bc7e5..5f638433cef9 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -462,10 +462,6 @@ static rx_handler_result_t macvlan_handle_frame(struct sk_buff **pskb)
if (is_multicast_ether_addr(eth->h_dest)) {
unsigned int hash;

- skb = ip_check_defrag(dev_net(skb->dev), skb, IP_DEFRAG_MACVLAN);
- if (!skb)
- return RX_HANDLER_CONSUMED;
- *pskb = skb;
eth = eth_hdr(skb);
if (macvlan_forward_source(skb, port, eth->h_source)) {
kfree_skb(skb);
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index a4941f53b523..30b822dfa222 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -479,11 +479,29 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *skb,
return err;
}

+static int ip_defrag_vif(const struct sk_buff *skb, const struct net_device *dev)
+{
+ int vif = l3mdev_master_ifindex_rcu(dev);
+
+ if (vif)
+ return vif;
+
+ /* some folks insist that receiving a fragmented mcast dgram on n devices shall
+ * result in n defragmented packets.
+ */
+ if (skb->pkt_type == PACKET_BROADCAST || skb->pkt_type == PACKET_MULTICAST) {
+ if (dev)
+ vif = dev->ifindex;
+ }
+
+ return vif;
+}
+
/* Process an incoming IP datagram fragment. */
int ip_defrag(struct net *net, struct sk_buff *skb, u32 user)
{
struct net_device *dev = skb->dev ? : skb_dst(skb)->dev;
- int vif = l3mdev_master_ifindex_rcu(dev);
+ int vif = ip_defrag_vif(skb, dev);
struct ipq *qp;

__IP_INC_STATS(net, IPSTATS_MIB_REASMREQDS);



2023-10-04 08:01:18

by Florian Westphal

[permalink] [raw]
Subject: Re: macvtap performs IP defragmentation, causing MTU problems for virtual machines

Henrik Lindstr?m <[email protected]> wrote:
> Had to change "return 0" to "return vif" but other than that your changes
> seem to work, even with macvlan defragmentation removed.

Oh, right, that should have been "return vid" indeed.

> I tested it by sending 8K fragmented multicast packets, with 5 macvlans on
> the receiving side. I consistently received 6 copies of the packet (1 from the
> real interface and 1 per macvlan). While doing this i had my VM running with
> a macvtap, and it was receiving fragmented packets as expected.
>
> Here are the changes i was testing with, first time sending a diff over mail
> so hope it works :-)

Can you submit this formally, with proper changelog and Signed-off-by?
See scripts/checkpatch.pl in the kernel tree.

It might be a good idea to first mail the patch to yourself and see if
you can apply it (sometimes tabs get munged into spaces, long lines get
broken, etc).

You could also mention in changelog that this is ipv4 only because
ipv6 already considers the interface index during reassembly.

Thanks!

2023-10-05 17:29:12

by Henrik Lindström

[permalink] [raw]
Subject: Re: macvtap performs IP defragmentation, causing MTU problems for virtual machines

On onsdag 4 oktober 2023 10:00:37 CEST Florian Westphal wrote:
> Can you submit this formally, with proper changelog and Signed-off-by?
> See scripts/checkpatch.pl in the kernel tree.
Sure, i can give it a shot. How do i properly credit you if i submit your
patch with some small changes of my own?

> You could also mention in changelog that this is ipv4 only because
> ipv6 already considers the interface index during reassembly.
Interesting. I've been trying to understand the code and it seems like
ipv6 does defragmentation per-interface, while ipv4 does it "per-vrf"
(correct me if i'm wrong). Is there any reason for this difference?

I also did some more testing with the diff from my previous mail. It
looks like the problem remains for interfaces under vrfs. I think simply
doing the bcast/mcast check first fixes that though, something like this:
if (skb->pkt_type == PACKET_BROADCAST || skb->pkt_type == PACKET_MULTICAST) {
if (dev)
return dev->ifindex;
}

return l3mdev_master_ifindex_rcu(dev);
Does that look reasonable?
The idea being that bcast/mcast packets are always defragmented
per-interface, and unicast packets always "per-vrf".

Thanks,
Henrik


2023-10-06 06:06:58

by Florian Westphal

[permalink] [raw]
Subject: Re: macvtap performs IP defragmentation, causing MTU problems for virtual machines

Henrik Lindstr?m <[email protected]> wrote:
> On onsdag 4 oktober 2023 10:00:37 CEST Florian Westphal wrote:
> > Can you submit this formally, with proper changelog and Signed-off-by?
> > See scripts/checkpatch.pl in the kernel tree.
> Sure, i can give it a shot. How do i properly credit you if i submit your
> patch with some small changes of my own?

You can use:

"Suggested-by:" tag here.

> > You could also mention in changelog that this is ipv4 only because
> > ipv6 already considers the interface index during reassembly.
> Interesting. I've been trying to understand the code and it seems like
> ipv6 does defragmentation per-interface, while ipv4 does it "per-vrf"
> (correct me if i'm wrong). Is there any reason for this difference?

Only for linklocal and multicasts. Added in
264640fc2c5f4f913db5c73fa3eb1ead2c45e9d7 . Even mentions macvlan in the
changelog.

> The idea being that bcast/mcast packets are always defragmented
> per-interface, and unicast packets always "per-vrf".

LGTM, but please CC [email protected] once you submit the patch.