Hi!
We got a customer that reported an issue where the Linux VXLAN
implementation diverges from the RFC, namely when any of the (reserved)
flags other than the VNI one is set, the kernel just drops the package.
According to the vxlan_rcv function in vxlan_core this is done by choice:
if (unparsed.vx_flags || unparsed.vx_vni) {
/* If there are any unprocessed flags remaining treat
* this as a malformed packet. This behavior diverges from
* VXLAN RFC (RFC7348) which stipulates that bits in reserved
* in reserved fields are to be ignored. The approach here
* maintains compatibility with previous stack code, and also
* is more robust and provides a little more security in
* adding extensions to VXLAN.
*/
goto drop;
}
Normally this is not an issue, as the same RFC also dictates that the sender
must have those reserved bits set to zero. But naturally, some devices are
not following that side of the contract either, like some Juniper switches
of said customers, which set the B-bit (like it would be a VXLAN-GPE) in the
VXLAN packet, even though they have VXLAN-GPE explicitly disabled.
So, while I asked the customer to open a support ticket with their switch
vendor, as that one is breaking the RFC too, the kernel is just the simpler
thing to "fix", especially for our side the only thing we can change at all.
As just changing the code so that it would be always RFC conform (at least
in this regard) seems to be a no-go, as some setups would then suddenly see
extra (malicious) traffic go through, so to my actual question:
What would be the accepted way to add a switch of making this RFC conform in
an opt-in way? A module parameter? A sysfs entry? Through netlink?
As depending on the answer of that I'd like to prepare a patch implementing
the opt-in RFC-conformance w.r.t. ignoring the reserved bits values of the
VXLAN flags, this way setups with complementary broken HW in their network
path can opt in to that behavior as a workaround.
thanks!
Thomas
On Fri, 12 Jan 2024 16:13:22 +0100 Thomas Lamprecht wrote:
> What would be the accepted way to add a switch of making this RFC conform in
> an opt-in way? A module parameter? A sysfs entry? Through netlink?
Thru netlink. My intuition would be to try to add a "ignore
bits" mask, rather than "RFC compliance knob" because RFCs
may have shorter lifespan than kernel's uAPI guarantees..
On Tue, Jan 16, 2024 at 08:23:57AM -0800, Jakub Kicinski wrote:
> On Fri, 12 Jan 2024 16:13:22 +0100 Thomas Lamprecht wrote:
> > What would be the accepted way to add a switch of making this RFC conform in
> > an opt-in way? A module parameter? A sysfs entry? Through netlink?
>
> Thru netlink.
+1
> My intuition would be to try to add a "ignore bits" mask, rather than
> "RFC compliance knob" because RFCs may have shorter lifespan than
> kernel's uAPI guarantees..
Newer Spectrum chips have a 64 bit mask that covers the entire VXLAN
header. If a bit is set in the mask and the corresponding bit in the
VXLAN header is not zero, the packet is dropped / trapped.
Another option, assuming the interface that receives the encapsulated
packets is known, is to clear the reserved bits in the VXLAN header
using pedit. This seems to work:
tc -n ns2 qdisc add dev veth1 clsact
tc -n ns2 filter add dev veth1 ingress pref 1 proto ip flower ip_proto udp \
dst_port 4789 \
action pedit munge offset 28 u8 set 0x08
Tested by setting the reserved bits on the other side and making sure
ping works:
tc -n ns1 qdisc add dev veth0 clsact
tc -n ns1 filter add dev veth0 egress pref 1 proto ip flower ip_proto udp \
dst_port 4789 \
action pedit munge offset 28 u8 set 0xff
The advantage is that no kernel changes are required whereas the netlink
solution will have to be maintained forever, even after the other side
is fixed.