LinuxLists.cc - Routing loops & TTL tracking with tunnel devices

2015-11-16 20:15:00

Subject: Routing loops & TTL tracking with tunnel devices

Hi folks,

A few tunnel devices, like geneve or vxlan, are using
udp_tunnel_xmit_skb, or related functions for transmitting packets,
and are doing the usual FIB lookup to get the dst entry. I see a lot
of code like this:

if (rt->dst.dev == dev) {
netdev_dbg(dev, "circular route to %pI4\n",
&dst->sin.sin_addr.s_addr);
dev->stats.collisions++;
goto rt_tx_error;
}

This one is from vxlan, but there are other similar blocks elsewhere.
The basic idea is "am I about to send this packet to my own device?"

This is a bit crude. For starters, two interfaces could be pointed at
each other, bouncing the packet back and forth indefinitely, causing
the feared routing loop. Hopefully as more headers got tacked on,
allocations would eventually fail, and the queen would be saved.

But what about in devices for which self-routing might actually be
useful? For example, let's say that if an incoming skb is headed for
dst X, it gets encapsulated and sent to dst A, and for dst Y it gets
encapsulated and sent to dst B, and for dst Z it gets encapsulated and
sent to dst C. I can imagine situations in which setting A==Y and B==Z
might be useful to do multiple levels of encapsulation on one device,
so that skbs headed for dst X get sent to dst C, but with intermediate
transformations of dst A and dst B.

This isn't merely theoretical. I'm working on a driver right now that
could benefit from this.

So, in implementing this, the question of avoiding routing loops comes
into play. The most straight forward way to do this is to use a TTL
value that's decreased. But we have a problem. A packet sent to dst X
that is encapsulated and sent to dst A will have a ttl calculated for
its journey to dst A. How do we preserve TTLs across multiple
traversals of the networking stack? We can't simply stay with the TTL
of the packet when it comes in, because it's tunnel destination might
require a different TTL. The best thing would be to have a "tunnel
TTL" value as part of skb->cb, except the cb gets overwritten when
traversing the networking stack. The best thing I can think of is some
other member of sk_buff, but I don't see any that look good for this.

So perhaps it would be worthwhile to add this to struct sk_buff? David
- are you interested in this if I submit a patch?

Or, alternatively, does a fast solution for this already exist that I
overlooked?

Thanks,
Jason

2015-11-16 20:37:28

by Sowmini Varadhan

[permalink] [raw]

Subject: Re: Routing loops & TTL tracking with tunnel devices

On (11/16/15 21:14), Jason A. Donenfeld wrote:
>
> But what about in devices for which self-routing might actually be
> useful? For example, let's say that if an incoming skb is headed for
> dst X, it gets encapsulated and sent to dst A, and for dst Y it gets
> encapsulated and sent to dst B, and for dst Z it gets encapsulated and
> sent to dst C. I can imagine situations in which setting A==Y and B==Z
> might be useful to do multiple levels of encapsulation on one device,
> so that skbs headed for dst X get sent to dst C, but with intermediate
> transformations of dst A and dst B.

I believe that what you are talking about is basically nested encapsulation-
see https://tools.ietf.org/html/rfc2473.

The tunnelling endpoint could track the number of encapsulations and keep
a limit on that? (conceptually this may be the same thing as your ttl
proposal, except that "ttl" has other meanings in other contexts, so
a bit non-intuitive)

--Sowmini

(fwiw, RFC 2473 proposes an ipv6 option to track nested encapsulation,
and that never took off, because, among other reasons, its hard to
offload such options to hardware. Anyway, you are not trying to carry
this around in the packet).

2015-11-16 20:55:30

by Jason A. Donenfeld

[permalink] [raw]

Subject: Re: Routing loops & TTL tracking with tunnel devices

Hi Sowmini,

Neat. Though, in my case, I'm not actually just prepending a header.
I'm doing some more substantial transformations of a packet. And this
needs to work with v4 too. So I'm not sure implementing a v6 spec will
help with things. I need to identify the right mechanism inside the
kernel to assist with this, like, say, a member in sk_buff.

Jason

2015-11-16 20:59:56

by Sowmini Varadhan

[permalink] [raw]

Subject: Re: Routing loops & TTL tracking with tunnel devices

> Neat. Though, in my case, I'm not actually just prepending a header.
> I'm doing some more substantial transformations of a packet. And this
> needs to work with v4 too. So I'm not sure implementing a v6 spec will

Understood, that spec was just referenced to indicate that there
are more issues (mtu reduction etc) with nested encapsulation,
and this is actually applicable even without the recursion issue
(i.e even if you dont have a tunnelling loop, and even if it
is not ipv6, there are some non-trivial problems here. Luckily,
nested encaps is somewhat uncommon).

--Sowmini

2015-11-16 22:25:07

by Hannes Frederic Sowa

[permalink] [raw]

Subject: Re: Routing loops & TTL tracking with tunnel devices

Hi Jason,

On Mon, Nov 16, 2015, at 21:14, Jason A. Donenfeld wrote:
> A few tunnel devices, like geneve or vxlan, are using
> udp_tunnel_xmit_skb, or related functions for transmitting packets,
> and are doing the usual FIB lookup to get the dst entry. I see a lot
> of code like this:
>
> if (rt->dst.dev == dev) {
> netdev_dbg(dev, "circular route to %pI4\n",
> &dst->sin.sin_addr.s_addr);
> dev->stats.collisions++;
> goto rt_tx_error;
> }
>
> This one is from vxlan, but there are other similar blocks elsewhere.
> The basic idea is "am I about to send this packet to my own device?"
>
> This is a bit crude. For starters, two interfaces could be pointed at
> each other, bouncing the packet back and forth indefinitely, causing
> the feared routing loop. Hopefully as more headers got tacked on,
> allocations would eventually fail, and the queen would be saved.
>
> But what about in devices for which self-routing might actually be
> useful? For example, let's say that if an incoming skb is headed for
> dst X, it gets encapsulated and sent to dst A, and for dst Y it gets
> encapsulated and sent to dst B, and for dst Z it gets encapsulated and
> sent to dst C. I can imagine situations in which setting A==Y and B==Z
> might be useful to do multiple levels of encapsulation on one device,
> so that skbs headed for dst X get sent to dst C, but with intermediate
> transformations of dst A and dst B.
>
> This isn't merely theoretical. I'm working on a driver right now that
> could benefit from this.
>
> So, in implementing this, the question of avoiding routing loops comes
> into play. The most straight forward way to do this is to use a TTL
> value that's decreased. But we have a problem. A packet sent to dst X
> that is encapsulated and sent to dst A will have a ttl calculated for
> its journey to dst A. How do we preserve TTLs across multiple
> traversals of the networking stack? We can't simply stay with the TTL
> of the packet when it comes in, because it's tunnel destination might
> require a different TTL. The best thing would be to have a "tunnel
> TTL" value as part of skb->cb, except the cb gets overwritten when
> traversing the networking stack. The best thing I can think of is some
> other member of sk_buff, but I don't see any that look good for this.
>
> So perhaps it would be worthwhile to add this to struct sk_buff? David
> - are you interested in this if I submit a patch?
>
> Or, alternatively, does a fast solution for this already exist that I
> overlooked?

Have a look at __dev_queue_xmit and the per_cpu recursion limits
implemented there:

if (__this_cpu_read(xmit_recursion) >
RECURSION_LIMIT)
goto recursion_alert;

Bye,
Hannes

2015-11-16 22:29:06

by Eric Dumazet

[permalink] [raw]

Subject: Re: Routing loops & TTL tracking with tunnel devices

On Mon, 2015-11-16 at 21:55 +0100, Jason A. Donenfeld wrote:
> Hi Sowmini,
>
> Neat. Though, in my case, I'm not actually just prepending a header.
> I'm doing some more substantial transformations of a packet. And this
> needs to work with v4 too. So I'm not sure implementing a v6 spec will
> help with things. I need to identify the right mechanism inside the
> kernel to assist with this, like, say, a member in sk_buff.

There is very little chance we'll accept a new member in sk_buff, unless
proven needed.

Yes, it is very tempting and dozen of programmers already tried.

You'll have to demonstrate full understanding of the stack and why other
solutions do not work.

2015-11-17 02:41:44

by Jason A. Donenfeld

[permalink] [raw]

Subject: Re: Routing loops & TTL tracking with tunnel devices

Hi Eric,

On Mon, Nov 16, 2015 at 11:28 PM, Eric Dumazet <[email protected]> wrote:
> There is very little chance we'll accept a new member in sk_buff, unless
> proven needed.

I actually have no intention of doing this! I'm wondering if there
already is a member in sk_buff that moonlights as my desired ttl
counter, or if there's another mechanism for avoiding routing loops. I
want to work with what's already there, rather than meddling with the
innards of important and memory sensitive structures such as sk_buff.

2015-11-17 02:57:31

by Jason A. Donenfeld

[permalink] [raw]

Subject: Re: Routing loops & TTL tracking with tunnel devices

On Mon, Nov 16, 2015 at 11:25 PM, Hannes Frederic Sowa
<[email protected]> wrote:
> Have a look at __dev_queue_xmit and the per_cpu recursion limits
> implemented there:
>
> if (__this_cpu_read(xmit_recursion) >
> RECURSION_LIMIT)
> goto recursion_alert;

Ahh, thanks for pointing that out. So this works with virtual devices
with no queue. As of some recent changes, that now applies to what I'm
doing.

Unfortunately, I get a complete hard crash, with the blinking
keyboard. The only thing written to serial before it dies is:
[ 171.347446] Dead loop on virtual device wg0, fix it urgently!
This means it did hit that recursion condition, which is good.
I assume the recursion limit is just too high, and this has something
to do with me overflowing the stack. I'll test this hypothesis and see
if I can add a similar check inside my driver to make it lower. If
this works, I'm satisfied.

Thanks a lot for the pointer here.

2022-04-29 12:08:46

by Jason A. Donenfeld

[permalink] [raw]

Subject: Re: Routing loops & TTL tracking with tunnel devices

Hey Eric,

On Tue, Nov 17, 2015 at 03:41:35AM +0100, Jason A. Donenfeld wrote:
> On Mon, Nov 16, 2015 at 11:28 PM, Eric Dumazet <[email protected]> wrote:
> > There is very little chance we'll accept a new member in sk_buff, unless
> > proven needed.
>
> I actually have no intention of doing this! I'm wondering if there
> already is a member in sk_buff that moonlights as my desired ttl
> counter, or if there's another mechanism for avoiding routing loops. I
> want to work with what's already there, rather than meddling with the
> innards of important and memory sensitive structures such as sk_buff.

Well, 7 years later... Maybe you have a better idea now of what I was
working on then. :)

As an update on this issue, it's still quasi problematic. To review, I
can't use the TTL value, because the outer packet always must get the
TTL of the route to the outer destination, not the inner packet minus
one. I can't rely on reaching MTU size, because people want this to work
with fragmentation (see [1] for my attempt to disallow fragmentation for
this issue, which resulted in hoots and hollers). I can't use the
per-cpu xmit_recursion variable, because I use threads.

What I can sort of use is taking advantage of what looks like a bug in
pskb expansion, such that it always allocates too much, and pretty
quickly fails allocations after a few loops. Only powerpc64 and s390x
don't appear to have this bug. See [2] for a description of this in
depth I wrote a few months ago to you.

Anyway, it'd be nice if there were a free u8 somewhere in sk_buff that I
could use for tracking times through the stack. Other kernels have this
but afaict Linux still does not. I looked into trying to overload some
existing fields -- tstamp/skb_mstamp_ns or queue_mapping -- which I was
thinking might be totally unused on TX?

Any ideas about this?

Thanks,
Jason

[1] https://lore.kernel.org/wireguard/CAHmME9rNnBiNvBstb7MPwK-7AmAN0sOfnhdR=eeLrowWcKxaaQ@mail.gmail.com/
[2] https://lore.kernel.org/netdev/CAHmME9pv1x6C4TNdL6648HydD8r+txpV4hTUXOBVkrapBXH4QQ@mail.gmail.com/

2022-04-30 12:20:13

by Jason A. Donenfeld

[permalink] [raw]

Subject: Re: Routing loops & TTL tracking with tunnel devices

Hi Eric,

On Fri, Apr 29, 2022 at 11:14 PM Eric Dumazet <[email protected]> wrote:
>
>
> On 4/29/22 14:07, Jason A. Donenfeld wrote:
>
> Hi Eric,
>
> On Fri, Apr 29, 2022 at 01:54:27PM -0700, Eric Dumazet wrote:
>
> Anyway, it'd be nice if there were a free u8 somewhere in sk_buff that I
> could use for tracking times through the stack. Other kernels have this
> but afaict Linux still does not. I looked into trying to overload some
> existing fields -- tstamp/skb_mstamp_ns or queue_mapping -- which I was
> thinking might be totally unused on TX?
>
> if skbs are stored in some internal wireguard queue, can not you use
> skb->cb[],
>
> like many other layers do ?
>
> This isn't for some internal wireguard queue. The packets get sent out
> of udp_tunnel_xmit_skb(), so they leave wireguard's queues.
>
>
> OK, where is the queue then ?
>
> dev_xmit_recursion() is supposed to catch long chains of virtual devices.

This is the long-chain-of-virtual-devices case indeed. But
dev_xmit_recursion() does not help here, since that's just a per-cpu
increment, assuming that the packet actually gets xmit'd in its
ndo_start_xmit function. But in reality, wireguard queues up the
packet, encrypts it in some worker later, and eventually transmits it
with udp_tunnel_xmit_skb(). All the while ndo_start_xmit() has long
returned. So no help from dev_xmit_recursion().

Jason

2022-05-02 23:32:23

by Jason A. Donenfeld

[permalink] [raw]

Subject: Re: Routing loops & TTL tracking with tunnel devices

Hi Eric,

On Fri, Apr 29, 2022 at 01:54:27PM -0700, Eric Dumazet wrote:
> > Anyway, it'd be nice if there were a free u8 somewhere in sk_buff that I
> > could use for tracking times through the stack. Other kernels have this
> > but afaict Linux still does not. I looked into trying to overload some
> > existing fields -- tstamp/skb_mstamp_ns or queue_mapping -- which I was
> > thinking might be totally unused on TX?
>
>
> if skbs are stored in some internal wireguard queue, can not you use
> skb->cb[],
>
> like many other layers do ?

This isn't for some internal wireguard queue. The packets get sent out
of udp_tunnel_xmit_skb(), so they leave wireguard's queues.

Jason

2022-05-02 23:47:27

by Jason A. Donenfeld

[permalink] [raw]

Subject: Re: Routing loops & TTL tracking with tunnel devices

Hi Eric,

On Sat, Apr 30, 2022 at 12:05 AM Eric Dumazet <[email protected]> wrote:
> I assume you add encap headers to the skb ?

Yes; it's encapsulated in UDP, and under that some short header.
However, everything under that is encrypted. So,

> You could check if the wireguard header is there already, or if the
> amount of headers is crazy.

so it's not quite possible to peer down further to see.

> You also can take a look at CONFIG_SKB_EXTENSIONS infrastructure.

Blech, this involves some kind of per-packet allocation, right? I was
hoping there might be some 6 or 7 or 8 bit field in sk_buff that's not
used anywhere on the TX path that maybe I could overload for this
purpose...

Jason

2022-05-02 23:59:04

by Eric Dumazet

[permalink] [raw]

Subject: Re: Routing loops & TTL tracking with tunnel devices

On 4/28/22 17:37, Jason A. Donenfeld wrote:
> Hey Eric,
>
> On Tue, Nov 17, 2015 at 03:41:35AM +0100, Jason A. Donenfeld wrote:
>> On Mon, Nov 16, 2015 at 11:28 PM, Eric Dumazet <[email protected]> wrote:
>>> There is very little chance we'll accept a new member in sk_buff, unless
>>> proven needed.
>> I actually have no intention of doing this! I'm wondering if there
>> already is a member in sk_buff that moonlights as my desired ttl
>> counter, or if there's another mechanism for avoiding routing loops. I
>> want to work with what's already there, rather than meddling with the
>> innards of important and memory sensitive structures such as sk_buff.
> Well, 7 years later... Maybe you have a better idea now of what I was
> working on then. :)
>
> As an update on this issue, it's still quasi problematic. To review, I
> can't use the TTL value, because the outer packet always must get the
> TTL of the route to the outer destination, not the inner packet minus
> one. I can't rely on reaching MTU size, because people want this to work
> with fragmentation (see [1] for my attempt to disallow fragmentation for
> this issue, which resulted in hoots and hollers). I can't use the
> per-cpu xmit_recursion variable, because I use threads.
>
> What I can sort of use is taking advantage of what looks like a bug in
> pskb expansion, such that it always allocates too much, and pretty
> quickly fails allocations after a few loops. Only powerpc64 and s390x
> don't appear to have this bug. See [2] for a description of this in
> depth I wrote a few months ago to you.

Hmm, I will take a look later I think. Thanks for the reminder.

>
> Anyway, it'd be nice if there were a free u8 somewhere in sk_buff that I
> could use for tracking times through the stack. Other kernels have this
> but afaict Linux still does not. I looked into trying to overload some
> existing fields -- tstamp/skb_mstamp_ns or queue_mapping -- which I was
> thinking might be totally unused on TX?

if skbs are stored in some internal wireguard queue, can not you use
skb->cb[],

like many other layers do ?

>
> Any ideas about this?
>
> Thanks,
> Jason
>
> [1] https://lore.kernel.org/wireguard/CAHmME9rNnBiNvBstb7MPwK-7AmAN0sOfnhdR=eeLrowWcKxaaQ@mail.gmail.com/
> [2] https://lore.kernel.org/netdev/CAHmME9pv1x6C4TNdL6648HydD8r+txpV4hTUXOBVkrapBXH4QQ@mail.gmail.com/

2022-05-03 00:18:09

by Eric Dumazet

[permalink] [raw]

Subject: Re: Routing loops & TTL tracking with tunnel devices

On Fri, Apr 29, 2022 at 2:54 PM Jason A. Donenfeld <[email protected]> wrote:
>
> Hi Eric,
>
> On Fri, Apr 29, 2022 at 11:14 PM Eric Dumazet <[email protected]> wrote:
> >
> >
> > On 4/29/22 14:07, Jason A. Donenfeld wrote:
> >
> > Hi Eric,
> >
> > On Fri, Apr 29, 2022 at 01:54:27PM -0700, Eric Dumazet wrote:
> >
> > Anyway, it'd be nice if there were a free u8 somewhere in sk_buff that I
> > could use for tracking times through the stack. Other kernels have this
> > but afaict Linux still does not. I looked into trying to overload some
> > existing fields -- tstamp/skb_mstamp_ns or queue_mapping -- which I was
> > thinking might be totally unused on TX?
> >
> > if skbs are stored in some internal wireguard queue, can not you use
> > skb->cb[],
> >
> > like many other layers do ?
> >
> > This isn't for some internal wireguard queue. The packets get sent out
> > of udp_tunnel_xmit_skb(), so they leave wireguard's queues.
> >
> >
> > OK, where is the queue then ?
> >
> > dev_xmit_recursion() is supposed to catch long chains of virtual devices.
>
> This is the long-chain-of-virtual-devices case indeed. But
> dev_xmit_recursion() does not help here, since that's just a per-cpu
> increment, assuming that the packet actually gets xmit'd in its
> ndo_start_xmit function. But in reality, wireguard queues up the
> packet, encrypts it in some worker later, and eventually transmits it
> with udp_tunnel_xmit_skb(). All the while ndo_start_xmit() has long
> returned. So no help from dev_xmit_recursion().
>

I assume you add encap headers to the skb ?

You could check if the wireguard header is there already, or if the
amount of headers is crazy.

net/core/flow_dissector.c might help.

You also can take a look at CONFIG_SKB_EXTENSIONS infrastructure.

2022-05-03 01:01:52

by Eric Dumazet

[permalink] [raw]

Subject: Re: Routing loops & TTL tracking with tunnel devices

On Fri, Apr 29, 2022 at 3:09 PM Jason A. Donenfeld <[email protected]> wrote:
>
> Hi Eric,
>
> On Sat, Apr 30, 2022 at 12:05 AM Eric Dumazet <[email protected]> wrote:
> > I assume you add encap headers to the skb ?
>
> Yes; it's encapsulated in UDP, and under that some short header.
> However, everything under that is encrypted. So,
>
> > You could check if the wireguard header is there already, or if the
> > amount of headers is crazy.
>
> so it's not quite possible to peer down further to see.
>
> > You also can take a look at CONFIG_SKB_EXTENSIONS infrastructure.
>
> Blech, this involves some kind of per-packet allocation, right? I was
> hoping there might be some 6 or 7 or 8 bit field in sk_buff that's not
> used anywhere on the TX path that maybe I could overload for this
> purpose...
>

Maybe reuse redirected / tc_skip_classify (used in drivers/net/ifb.c)