LinuxLists.cc - [REGRESSION] connection timeout with routes to VRF

2022-06-11 11:44:48

Subject: [REGRESSION] connection timeout with routes to VRF

Hi,

TL;DR: We think we have found a regression in the handling of VRF route leaking
caused by "net: allow binding socket in a VRF when there's an unbound socket"
(3c82a21f4320).

We've been using VRF on 4.19 (from Debian buster) for serveral years and want to
update. After updating to 5.10 (from bullseye), we can no longer use routes
pointed at the VRF to reach services in the VRF

We use only one VRF, which contains some wireguard VPN interfaces.
Then we have routes outside the VRF to make the network connected to it
reachable from normal processes.

Our minimized test case looks like this:
ip rule add pref 32765 from all lookup local
ip rule del pref 0 from all lookup local
ip link add red type vrf table 1000
ip link set red up
ip route add vrf red unreachable default metric 8192
ip addr add dev red 172.16.0.1/24
ip route add 172.16.0.0/24 dev red
ip vrf exec red socat -dd TCP-LISTEN:1234,reuseaddr,fork SYSTEM:"echo connected" &
sleep 1
nc 172.16.0.1 1234 < /dev/null

While in our real setup, we connect to hosts reachable via th VRF's VPN
interfaces, the issue also shows up with a simple listening socket in the VRF.
In that case, the SYN-ACK from the remote host leads to a RST from the VRF, so
it seems that the outbound socket is not found.

A bisection with this leads to "net: allow binding socket in a VRF when there's
an unbound socket" (3c82a21f4320).

Before 3c82a21f4320, this connects fine:
2022/06/11 11:11:03 socat[326] N listening on AF=2 0.0.0.0:1234
nc 172.16.0.1 1234
2022/06/11 11:11:04 socat[326] N accepting connection from AF=2 172.16.0.1:44164 on AF=2 172.16.0.1:1234
...
connected

Since 3c82a21f4320, the connection does not complete:
2022/06/11 11:16:31 socat[324] N listening on AF=2 0.0.0.0:1234
+ nc 172.16.0.1 1234
(... times out)

We've retested with v5.19-rc1-262-g0885eacdc81f, and continue to get the
timeout.

The partial revert
diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 98e1ec1a14f0..41e7f20d7e51 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -310,8 +310,9 @@ static inline struct sock *inet_lookup_listener(struct net *net,
#define INET_MATCH(__sk, __net, __cookie, __saddr, __daddr, __ports, __dif, __sdif) \
        (((__sk)->sk_portpair == (__ports))                     &&      \
         ((__sk)->sk_addrpair == (__cookie))                    &&      \
-        (((__sk)->sk_bound_dev_if == (__dif))                  ||      \
-         ((__sk)->sk_bound_dev_if == (__sdif)))                &&      \
+        (!(__sk)->sk_bound_dev_if      ||                              \
+          ((__sk)->sk_bound_dev_if == (__dif))                 ||      \
+          ((__sk)->sk_bound_dev_if == (__sdif)))               &&      \
         net_eq(sock_net(__sk), (__net)))
#else /* 32-bit arch */
#define INET_ADDR_COOKIE(__name, __saddr, __daddr) \
@@ -321,8 +322,9 @@ static inline struct sock *inet_lookup_listener(struct net *net,
        (((__sk)->sk_portpair == (__ports))             &&              \
         ((__sk)->sk_daddr      == (__saddr))           &&              \
         ((__sk)->sk_rcv_saddr == (__daddr))           &&              \
-        (((__sk)->sk_bound_dev_if == (__dif))          ||              \
-         ((__sk)->sk_bound_dev_if == (__sdif)))        &&              \
+        (!(__sk)->sk_bound_dev_if      ||                              \
+          ((__sk)->sk_bound_dev_if == (__dif))         ||              \
+          ((__sk)->sk_bound_dev_if == (__sdif)))       &&              \
         net_eq(sock_net(__sk), (__net)))
#endif /* 64-bit arch */

restores the original behavior when applied on v5.18. This doesn't apply
directly on master, as the macro was replaced by an inline function in "inet:
add READ_ONCE(sk->sk_bound_dev_if) in INET_MATCH()" (4915d50e300e).

I have to admit I don't quite understand 3c82a21f4320, so I'm not sure how to
proceed. What would be broken by the partial revert above? Are there better ways
to configure routing into the VRF than simply "ip route add 172.16.0.0/24 dev
red" that still work?

Thanks,
Jan

#regzbot introduced: 3c82a21f4320

2022-06-11 16:52:42

by David Ahern

[permalink] [raw]

Subject: Re: [REGRESSION] connection timeout with routes to VRF

On 6/11/22 5:14 AM, Jan Luebbe wrote:
> Hi,
>
> TL;DR: We think we have found a regression in the handling of VRF route leaking
> caused by "net: allow binding socket in a VRF when there's an unbound socket"
> (3c82a21f4320).

This is the 3rd report in the past few months about this commit.

...

>
> Our minimized test case looks like this:
> ip rule add pref 32765 from all lookup local
> ip rule del pref 0 from all lookup local
> ip link add red type vrf table 1000
> ip link set red up
> ip route add vrf red unreachable default metric 8192
> ip addr add dev red 172.16.0.1/24
> ip route add 172.16.0.0/24 dev red
> ip vrf exec red socat -dd TCP-LISTEN:1234,reuseaddr,fork SYSTEM:"echo connected" &
> sleep 1
> nc 172.16.0.1 1234 < /dev/null
>

...
Thanks for the detailed analysis and reproducer.

>
> The partial revert
> diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
> index 98e1ec1a14f0..41e7f20d7e51 100644
> --- a/include/net/inet_hashtables.h
> +++ b/include/net/inet_hashtables.h
> @@ -310,8 +310,9 @@ static inline struct sock *inet_lookup_listener(struct net *net,
> #define INET_MATCH(__sk, __net, __cookie, __saddr, __daddr, __ports, __dif, __sdif) \
>         (((__sk)->sk_portpair == (__ports))                     &&      \
>          ((__sk)->sk_addrpair == (__cookie))                    &&      \
> -        (((__sk)->sk_bound_dev_if == (__dif))                  ||      \
> -         ((__sk)->sk_bound_dev_if == (__sdif)))                &&      \
> +        (!(__sk)->sk_bound_dev_if      ||                              \
> +          ((__sk)->sk_bound_dev_if == (__dif))                 ||      \
> +          ((__sk)->sk_bound_dev_if == (__sdif)))               &&      \
>          net_eq(sock_net(__sk), (__net)))
> #else /* 32-bit arch */
> #define INET_ADDR_COOKIE(__name, __saddr, __daddr) \
> @@ -321,8 +322,9 @@ static inline struct sock *inet_lookup_listener(struct net *net,
>         (((__sk)->sk_portpair == (__ports))             &&              \
>          ((__sk)->sk_daddr      == (__saddr))           &&              \
>          ((__sk)->sk_rcv_saddr == (__daddr))           &&              \
> -        (((__sk)->sk_bound_dev_if == (__dif))          ||              \
> -         ((__sk)->sk_bound_dev_if == (__sdif)))        &&              \
> +        (!(__sk)->sk_bound_dev_if      ||                              \
> +          ((__sk)->sk_bound_dev_if == (__dif))         ||              \
> +          ((__sk)->sk_bound_dev_if == (__sdif)))       &&              \
>          net_eq(sock_net(__sk), (__net)))
> #endif /* 64-bit arch */
>
> restores the original behavior when applied on v5.18. This doesn't apply
> directly on master, as the macro was replaced by an inline function in "inet:
> add READ_ONCE(sk->sk_bound_dev_if) in INET_MATCH()" (4915d50e300e).
>
> I have to admit I don't quite understand 3c82a21f4320, so I'm not sure how to
> proceed. What would be broken by the partial revert above? Are there better ways
> to configure routing into the VRF than simply "ip route add 172.16.0.0/24 dev
> red" that still work?
>
> Thanks,
> Jan
>
> #regzbot introduced: 3c82a21f4320
>
>
>

Andy Roulin suggested the same fix to the same problem a few weeks back.
Let's do it along with a test case in fcnl-test.sh which covers all of
these vrf permutations.

2022-06-15 18:32:16

by Jan Lübbe

[permalink] [raw]

Subject: Re: [REGRESSION] connection timeout with routes to VRF

On Sat, 2022-06-11 at 10:44 -0600, David Ahern wrote:
> On 6/11/22 5:14 AM, Jan Luebbe wrote:
> > Hi,
> >
> > TL;DR: We think we have found a regression in the handling of VRF route
> > leaking
> > caused by "net: allow binding socket in a VRF when there's an unbound
> > socket"
> > (3c82a21f4320).
>
> This is the 3rd report in the past few months about this commit.
>
> ...

Hmm, I've not been able to find other reports. Could you point me to them?

> >
> > Our minimized test case looks like this:
> > ip rule add pref 32765 from all lookup local
> > ip rule del pref 0 from all lookup local
> > ip link add red type vrf table 1000
> > ip link set red up
> > ip route add vrf red unreachable default metric 8192
> > ip addr add dev red 172.16.0.1/24
> > ip route add 172.16.0.0/24 dev red
> > ip vrf exec red socat -dd TCP-LISTEN:1234,reuseaddr,fork SYSTEM:"echo
> > connected" &
> > sleep 1
> > nc 172.16.0.1 1234 < /dev/null
> >
>
> ...
> Thanks for the detailed analysis and reproducer.
>
> >
> > The partial revert
> > diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
> > index 98e1ec1a14f0..41e7f20d7e51 100644
> > --- a/include/net/inet_hashtables.h
> > +++ b/include/net/inet_hashtables.h
> > @@ -310,8 +310,9 @@ static inline struct sock *inet_lookup_listener(struct
> > net *net,
> > #define INET_MATCH(__sk, __net, __cookie, __saddr, __daddr, __ports, __dif,
> > __sdif) \
> >         (((__sk)->sk_portpair == (__ports))                     &&      \
> >          ((__sk)->sk_addrpair == (__cookie))                    &&      \
> > -        (((__sk)->sk_bound_dev_if == (__dif))                  ||      \
> > -         ((__sk)->sk_bound_dev_if == (__sdif)))                &&      \
> > +        (!(__sk)->sk_bound_dev_if      ||                              \
> > +          ((__sk)->sk_bound_dev_if == (__dif))                 ||      \
> > +          ((__sk)->sk_bound_dev_if == (__sdif)))               &&      \
> >          net_eq(sock_net(__sk), (__net)))
> > #else /* 32-bit arch */
> > #define INET_ADDR_COOKIE(__name, __saddr, __daddr) \
> > @@ -321,8 +322,9 @@ static inline struct sock *inet_lookup_listener(struct
> > net *net,
> >         (((__sk)->sk_portpair == (__ports))             &&              \
> >          ((__sk)->sk_daddr      == (__saddr))           &&              \
> >          ((__sk)->sk_rcv_saddr == (__daddr))           &&              \
> > -        (((__sk)->sk_bound_dev_if == (__dif))          ||              \
> > -         ((__sk)->sk_bound_dev_if == (__sdif)))        &&              \
> > +        (!(__sk)->sk_bound_dev_if      ||                              \
> > +          ((__sk)->sk_bound_dev_if == (__dif))         ||              \
> > +          ((__sk)->sk_bound_dev_if == (__sdif)))       &&              \
> >          net_eq(sock_net(__sk), (__net)))
> > #endif /* 64-bit arch */
> >
> > restores the original behavior when applied on v5.18. This doesn't apply
> > directly on master, as the macro was replaced by an inline function in
> > "inet:
> > add READ_ONCE(sk->sk_bound_dev_if) in INET_MATCH()" (4915d50e300e).
> >
> > I have to admit I don't quite understand 3c82a21f4320, so I'm not sure how
> > to proceed. What would be broken by the partial revert above? Are there
> > better ways to configure routing into the VRF than simply "ip route add
> > 172.16.0.0/24 dev red" that still work?
> >
> > Thanks,
> > Jan
> >
> > #regzbot introduced: 3c82a21f4320
> >
> >
> >
>
> Andy Roulin suggested the same fix to the same problem a few weeks back.
> Let's do it along with a test case in fcnl-test.sh which covers all of
> these vrf permutations.

Thanks! I'd be happy to test any patch in our real setup.

Regards,
Jan

2022-06-16 13:36:20

by David Ahern

[permalink] [raw]

Subject: Re: [REGRESSION] connection timeout with routes to VRF

On 6/15/22 11:47 AM, Jan Luebbe wrote:
> On Sat, 2022-06-11 at 10:44 -0600, David Ahern wrote:
>> On 6/11/22 5:14 AM, Jan Luebbe wrote:
>>> Hi,
>>>
>>> TL;DR: We think we have found a regression in the handling of VRF route
>>> leaking
>>> caused by "net: allow binding socket in a VRF when there's an unbound
>>> socket"
>>> (3c82a21f4320).
>>
>> This is the 3rd report in the past few months about this commit.
>>
>> ...
>
> Hmm, I've not been able to find other reports. Could you point me to them?

March of this year:
https://lore.kernel.org/netdev/[email protected]/

Andy sent me an email offlist in May; same topic.

Hence you are the 3rd in 3 months.

...

>>
>> Andy Roulin suggested the same fix to the same problem a few weeks back.
>> Let's do it along with a test case in fcnl-test.sh which covers all of
>> these vrf permutations.
>
> Thanks! I'd be happy to test any patch in our real setup.

as I said, Andy suggested the same fix as you. Feel free to submit as a
patch; would be best to get test cases added to the fcnal-test script.

2022-06-26 20:51:42

by Mike Manning

[permalink] [raw]

Subject: Re: [REGRESSION] connection timeout with routes to VRF

On 11/06/2022 17:44, David Ahern wrote:
> On 6/11/22 5:14 AM, Jan Luebbe wrote:
>> Hi,
>>
>> TL;DR: We think we have found a regression in the handling of VRF route leaking
>> caused by "net: allow binding socket in a VRF when there's an unbound socket"
>> (3c82a21f4320).
> This is the 3rd report in the past few months about this commit.
>
> ...
>
>> Our minimized test case looks like this:
>> ip rule add pref 32765 from all lookup local
>> ip rule del pref 0 from all lookup local
>> ip link add red type vrf table 1000
>> ip link set red up
>> ip route add vrf red unreachable default metric 8192
>> ip addr add dev red 172.16.0.1/24
>> ip route add 172.16.0.0/24 dev red
>> ip vrf exec red socat -dd TCP-LISTEN:1234,reuseaddr,fork SYSTEM:"echo connected" &
>> sleep 1
>> nc 172.16.0.1 1234 < /dev/null
>>
> ...
> Thanks for the detailed analysis and reproducer.
>
>> The partial revert
>> diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
>> index 98e1ec1a14f0..41e7f20d7e51 100644
>> --- a/include/net/inet_hashtables.h
>> +++ b/include/net/inet_hashtables.h
>> @@ -310,8 +310,9 @@ static inline struct sock *inet_lookup_listener(struct net *net,
>> #define INET_MATCH(__sk, __net, __cookie, __saddr, __daddr, __ports, __dif, __sdif) \
>>         (((__sk)->sk_portpair == (__ports))                     &&      \
>>          ((__sk)->sk_addrpair == (__cookie))                    &&      \
>> -        (((__sk)->sk_bound_dev_if == (__dif))                  ||      \
>> -         ((__sk)->sk_bound_dev_if == (__sdif)))                &&      \
>> +        (!(__sk)->sk_bound_dev_if      ||                              \
>> +          ((__sk)->sk_bound_dev_if == (__dif))                 ||      \
>> +          ((__sk)->sk_bound_dev_if == (__sdif)))               &&      \
>>          net_eq(sock_net(__sk), (__net)))
>> #else /* 32-bit arch */
>> #define INET_ADDR_COOKIE(__name, __saddr, __daddr) \
>> @@ -321,8 +322,9 @@ static inline struct sock *inet_lookup_listener(struct net *net,
>>         (((__sk)->sk_portpair == (__ports))             &&              \
>>          ((__sk)->sk_daddr      == (__saddr))           &&              \
>>          ((__sk)->sk_rcv_saddr == (__daddr))           &&              \
>> -        (((__sk)->sk_bound_dev_if == (__dif))          ||              \
>> -         ((__sk)->sk_bound_dev_if == (__sdif)))        &&              \
>> +        (!(__sk)->sk_bound_dev_if      ||                              \
>> +          ((__sk)->sk_bound_dev_if == (__dif))         ||              \
>> +          ((__sk)->sk_bound_dev_if == (__sdif)))       &&              \
>>          net_eq(sock_net(__sk), (__net)))
>> #endif /* 64-bit arch */
>>
>> restores the original behavior when applied on v5.18. This doesn't apply
>> directly on master, as the macro was replaced by an inline function in "inet:
>> add READ_ONCE(sk->sk_bound_dev_if) in INET_MATCH()" (4915d50e300e).
>>
>> I have to admit I don't quite understand 3c82a21f4320, so I'm not sure how to
>> proceed. What would be broken by the partial revert above? Are there better ways
>> to configure routing into the VRF than simply "ip route add 172.16.0.0/24 dev
>> red" that still work?
>>
>> Thanks,
>> Jan
>>
>> #regzbot introduced: 3c82a21f4320
>>
>>
>>
> Andy Roulin suggested the same fix to the same problem a few weeks back.
> Let's do it along with a test case in fcnl-test.sh which covers all of
> these vrf permutations.
>
Reverting 3c82a21f4320 would remove isolation between the default and other VRFs
needed when no VRF route leaking has been configured between these: there may be
unintended leaking of packets arriving on a device enslaved to an l3mdev due to
the potential match on an unbound socket.

VRF route leaking requires routes to be present for both ingress and egress VRFs,
the testcase shown only has a route from default to red VRF. The implicit return
path from red to default VRF due to match on unbound socket is no longer present.

Match on unbound socket in all VRFs and not only in the default VRF should be
possible by setting this option (see
https://www.kernel.org/doc/Documentation/networking/vrf.txt):

sysctl net.ipv4.tcp_l3mdev_accept=1

However, for this to work a change similar to the following is needed (I have
shown the change to the macro for consistency with above, it is now an inline fn):

---
include/net/inet_hashtables.h | 10 ++++------
1 file changed, 4 insertions(+), 6 deletions(-)

--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -300,9 +300,8 @@
#define INET_MATCH(__sk, __net, __cookie, __saddr, __daddr, __ports, __dif, __sdif) \
(((__sk)->sk_portpair == (__ports)) && \
((__sk)->sk_addrpair == (__cookie)) && \
- (((__sk)->sk_bound_dev_if == (__dif)) || \
- ((__sk)->sk_bound_dev_if == (__sdif))) && \
- net_eq(sock_net(__sk), (__net)))
+ net_eq(sock_net(__sk), (__net)) && \
+ inet_sk_bound_dev_eq((__net), (__sk)->sk_bound_dev_if, (__dif), (__sdif)))
#else /* 32-bit arch */
#define INET_ADDR_COOKIE(__name, __saddr, __daddr) \
const int __name __deprecated __attribute__((unused))
@@ -311,9 +310,8 @@
(((__sk)->sk_portpair == (__ports)) && \
((__sk)->sk_daddr == (__saddr)) && \
((__sk)->sk_rcv_saddr == (__daddr)) && \
- (((__sk)->sk_bound_dev_if == (__dif)) || \
- ((__sk)->sk_bound_dev_if == (__sdif))) && \
- net_eq(sock_net(__sk), (__net)))
+ net_eq(sock_net(__sk), (__net)) && \
+ inet_sk_bound_dev_eq((__net), (__sk)->sk_bound_dev_if, (__dif), (__sdif)))
#endif /* 64-bit arch */

/* Sockets in TCP_CLOSE state are _always_ taken out of the hash, so we need

This is to get the testcase to pass, I will leave it to others to comment on
the testcase validity in terms of testing forwarding using commands on 1 device.

The series that 3c82a21f4320 is part of were introduced into the kernel in 2018
by the Vyatta team, who regularly run an extensive test suite for routing protocols
for VRF functionality incl. all combinations of route leaking between default and
other VRFs, so there is no known issue in this regard. I will attempt to reach out
to them so as to advise them of this thread.

Thanks
Mike

2022-07-06 19:26:38

by Jan Lübbe

[permalink] [raw]

Subject: Re: [REGRESSION] connection timeout with routes to VRF

On Sun, 2022-06-26 at 21:06 +0100, Mike Manning wrote:
...
> Andy Roulin suggested the same fix to the same problem a few weeks back.
> Let's do it along with a test case in fcnl-test.sh which covers all of
> these vrf permutations.
>
Reverting 3c82a21f4320 would remove isolation between the default and other VRFs
needed when no VRF route leaking has been configured between these: there may be
unintended leaking of packets arriving on a device enslaved to an l3mdev due to
the potential match on an unbound socket.

Thanks for the explanation.

VRF route leaking requires routes to be present for both ingress and egress
VRFs,
the testcase shown only has a route from default to red VRF. The implicit return
path from red to default VRF due to match on unbound socket is no longer
present.

If there is a better configuration that makes this work in the general case
without a change to the kernel, we'd be happy as well.

In our full setup, the outbound TCP connection (from the default VRF) gets a
local IP from the interface enslaved to the VRF. Before 3c82a21f4320, this would
simply work.

How would the return path route from the red VRF to the default VRF look in that
case?

Match on unbound socket in all VRFs and not only in the default VRF should be
possible by setting this option (see
https://www.kernel.org/doc/Documentation/networking/vrf.txt):

Do you mean unbound as in listening socket not bound to an IP with bind()? Or as
in a socket in the default VRF?

sysctl net.ipv4.tcp_l3mdev_accept=1

The sysctl docs sound like this should only apply to listening sockets. In this
case, we have an unconnected outbound socket.

However, for this to work a change similar to the following is needed (I have
shown the change to the macro for consistency with above, it is now an inline
fn):

I can also test on master and only used the macro form only because I wasn't
completely sure how to translate it to the inline function form.

---
include/net/inet_hashtables.h |   10 ++++------
1 file changed, 4 insertions(+), 6 deletions(-)

--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -300,9 +300,8 @@
#define INET_MATCH(__sk, __net, __cookie, __saddr, __daddr, __ports, __dif,
__sdif) \
        (((__sk)->sk_portpair == (__ports))                     &&      \
         ((__sk)->sk_addrpair == (__cookie))                    &&      \
-        (((__sk)->sk_bound_dev_if == (__dif))                  ||      \
-         ((__sk)->sk_bound_dev_if == (__sdif)))                &&      \
-        net_eq(sock_net(__sk), (__net)))
+        net_eq(sock_net(__sk), (__net))                        &&      \
+        inet_sk_bound_dev_eq((__net), (__sk)->sk_bound_dev_if, (__dif),
(__sdif)))
#else /* 32-bit arch */
#define INET_ADDR_COOKIE(__name, __saddr, __daddr) \
        const int __name __deprecated __attribute__((unused))
@@ -311,9 +310,8 @@
        (((__sk)->sk_portpair == (__ports))             &&              \
         ((__sk)->sk_daddr      == (__saddr))           &&              \
         ((__sk)->sk_rcv_saddr == (__daddr))           &&              \
-        (((__sk)->sk_bound_dev_if == (__dif))          ||              \
-         ((__sk)->sk_bound_dev_if == (__sdif)))        &&              \
-        net_eq(sock_net(__sk), (__net)))
+        net_eq(sock_net(__sk), (__net))                &&              \
+        inet_sk_bound_dev_eq((__net), (__sk)->sk_bound_dev_if, (__dif),
(__sdif)))
#endif /* 64-bit arch */

/* Sockets in TCP_CLOSE state are _always_ taken out of the hash, so we need

I can confirm that this gets my testcase working with
net.ipv4.tcp_l3mdev_accept=1.

This is to get the testcase to pass, I will leave it to others to comment on
the testcase validity in terms of testing forwarding using commands on 1 device.

So a network-namespace-based testcase would be preferred? We used the simple
setup because it seemed easier to understand.

The series that 3c82a21f4320 is part of were introduced into the kernel in 2018
by the Vyatta team, who regularly run an extensive test suite for routing
protocols
for VRF functionality incl. all combinations of route leaking between default
and
other VRFs, so there is no known issue in this regard. I will attempt to reach
out
to them so as to advise them of this thread.

Are these testcases public? Perhaps I could use them find a better configuration
that handles our use-case.

Thanks,

Jan

2022-07-17 10:52:42

by Mike Manning

[permalink] [raw]

Subject: Re: [REGRESSION] connection timeout with routes to VRF

On 06/07/2022 19:49, Jan Luebbe wrote:
> On Sun, 2022-06-26 at 21:06 +0100, Mike Manning wrote:
> ...
>> Andy Roulin suggested the same fix to the same problem a few weeks back.
>> Let's do it along with a test case in fcnl-test.sh which covers all of
>> these vrf permutations.
>>
> Reverting 3c82a21f4320 would remove isolation between the default and other VRFs
> needed when no VRF route leaking has been configured between these: there may be
> unintended leaking of packets arriving on a device enslaved to an l3mdev due to
> the potential match on an unbound socket.
>
> Thanks for the explanation.
>
> VRF route leaking requires routes to be present for both ingress and egress
> VRFs,
> the testcase shown only has a route from default to red VRF. The implicit return
> path from red to default VRF due to match on unbound socket is no longer
> present.
>
>
> If there is a better configuration that makes this work in the general case
> without a change to the kernel, we'd be happy as well.
>
> In our full setup, the outbound TCP connection (from the default VRF) gets a
> local IP from the interface enslaved to the VRF. Before 3c82a21f4320, this would
> simply work.
>
> How would the return path route from the red VRF to the default VRF look in that
> case?

I am unaware of your topology, can you add a route in the red VRF table
(see 'ip route ls vrf red'), so 'ip route add vrf red <prefix> via
<next-hop>'.

The isolation between default and other VRFs necessary for forwarding
purposes means that running a local process in the default VRF to access
another VRF no longer works since the change made in 2018 that you
identified. So in your example, 'ip vrf exec red nc ...' will work, but
I assume that this is of no use to you.

> Match on unbound socket in all VRFs and not only in the default VRF should be
> possible by setting this option (see
> https://www.kernel.org/doc/Documentation/networking/vrf.txt):
>
>
> Do you mean unbound as in listening socket not bound to an IP with bind()? Or as
> in a socket in the default VRF?

Unbound meaning a socket in the default VRF, as opposed to a a socket
set into a VRF context by binding it to a VRF master interface using
SO_BINDTODEVICE. One must also be able to bind to an appropriate IP
address with bind() regardless of whether the socket is in the default
or another VRF, but that is not relevant here.

> sysctl net.ipv4.tcp_l3mdev_accept=1
>
>
> The sysctl docs sound like this should only apply to listening sockets. In this
> case, we have an unconnected outbound socket.

With this option disabled (by default), any stream socket in a VRF is
only selected for packets in that VRF, this is done in the input path
see e.g. tcp_v4_rcv() for IPv4.

> However, for this to work a change similar to the following is needed (I have
> shown the change to the macro for consistency with above, it is now an inline
> fn):
>
>
> I can also test on master and only used the macro form only because I wasn't
> completely sure how to translate it to the inline function form.
>
> ---
> include/net/inet_hashtables.h |   10 ++++------
> 1 file changed, 4 insertions(+), 6 deletions(-)
>
> --- a/include/net/inet_hashtables.h
> +++ b/include/net/inet_hashtables.h
> @@ -300,9 +300,8 @@
> #define INET_MATCH(__sk, __net, __cookie, __saddr, __daddr, __ports, __dif,
> __sdif) \
>         (((__sk)->sk_portpair == (__ports))                     &&      \
>          ((__sk)->sk_addrpair == (__cookie))                    &&      \
> -        (((__sk)->sk_bound_dev_if == (__dif))                  ||      \
> -         ((__sk)->sk_bound_dev_if == (__sdif)))                &&      \
> -        net_eq(sock_net(__sk), (__net)))
> +        net_eq(sock_net(__sk), (__net))                        &&      \
> +        inet_sk_bound_dev_eq((__net), (__sk)->sk_bound_dev_if, (__dif),
> (__sdif)))
> #else /* 32-bit arch */
> #define INET_ADDR_COOKIE(__name, __saddr, __daddr) \
>         const int __name __deprecated __attribute__((unused))
> @@ -311,9 +310,8 @@
>         (((__sk)->sk_portpair == (__ports))             &&              \
>          ((__sk)->sk_daddr      == (__saddr))           &&              \
>          ((__sk)->sk_rcv_saddr == (__daddr))           &&              \
> -        (((__sk)->sk_bound_dev_if == (__dif))          ||              \
> -         ((__sk)->sk_bound_dev_if == (__sdif)))        &&              \
> -        net_eq(sock_net(__sk), (__net)))
> +        net_eq(sock_net(__sk), (__net))                &&              \
> +        inet_sk_bound_dev_eq((__net), (__sk)->sk_bound_dev_if, (__dif),
> (__sdif)))
> #endif /* 64-bit arch */
>
> /* Sockets in TCP_CLOSE state are _always_ taken out of the hash, so we need
>
> I can confirm that this gets my testcase working with
> net.ipv4.tcp_l3mdev_accept=1.

I can submit this change to kernel-net (modified for latest code) if
David is ok with this approach. It should not have a significant
performance impact (due to the additional kernel parameter check) for
most use-cases, as typically sdif = 0 for the unbound case. I am not in
a position to carry out any performance testing.

> This is to get the testcase to pass, I will leave it to others to comment on
> the testcase validity in terms of testing forwarding using commands on 1 device.
>
> So a network-namespace-based testcase would be preferred? We used the simple
> setup because it seemed easier to understand.
>
> The series that 3c82a21f4320 is part of were introduced into the kernel in 2018
> by the Vyatta team, who regularly run an extensive test suite for routing
> protocols
> for VRF functionality incl. all combinations of route leaking between default
> and
> other VRFs, so there is no known issue in this regard. I will attempt to reach
> out
> to them so as to advise them of this thread.
>
> Are these testcases public? Perhaps I could use them find a better configuration
> that handles our use-case.

The test automation to bring up network topologies is not public, but
the test cases would not be readily transferable for general use in any
case. I have advised the Vyatta team of this thread.

> Thanks,
>
> Jan
>

2022-07-17 19:54:27

by Jan Lübbe

[permalink] [raw]

Subject: Re: [REGRESSION] connection timeout with routes to VRF

On Sun, 2022-07-17 at 11:31 +0100, Mike Manning wrote:
> On 06/07/2022 19:49, Jan Luebbe wrote:
> > On Sun, 2022-06-26 at 21:06 +0100, Mike Manning wrote:
> > ...
> > > Andy Roulin suggested the same fix to the same problem a few weeks back.
> > > Let's do it along with a test case in fcnl-test.sh which covers all of
> > > these vrf permutations.
> > >
> > Reverting 3c82a21f4320 would remove isolation between the default and other VRFs
> > needed when no VRF route leaking has been configured between these: there may be
> > unintended leaking of packets arriving on a device enslaved to an l3mdev due to
> > the potential match on an unbound socket.
> >
> > Thanks for the explanation.
> >
> > VRF route leaking requires routes to be present for both ingress and egress
> > VRFs,
> > the testcase shown only has a route from default to red VRF. The implicit return
> > path from red to default VRF due to match on unbound socket is no longer
> > present.
> >
> >
> > If there is a better configuration that makes this work in the general case
> > without a change to the kernel, we'd be happy as well.
> >
> > In our full setup, the outbound TCP connection (from the default VRF) gets a
> > local IP from the interface enslaved to the VRF. Before 3c82a21f4320, this would
> > simply work.
> >
> > How would the return path route from the red VRF to the default VRF look in that
> > case?
>
> I am unaware of your topology, can you add a route in the red VRF table
> (see 'ip route ls vrf red'), so 'ip route add vrf red <prefix> via
> <next-hop>'.

With 4.19 (and my workaround), the outbound packets were simply assigned an IP
from the red VRF if the have to leave the local machine. The IPs in the default
VRF are public IPs, which would not be routed back to the same machine.

I'm not sure if it helps, but I'll try to explain our topology:
https://gist.github.com/jluebbe/001c7b9ba531ad04d7e5c0a58f400967

Were using VRF for the https://freifunk-bs.de/ network. In the diagram above, a
'freifunk' VRF is used to create an overlay network, consisting of Wireguard
tunnels with OSPF and BGP between the concentrators and exists.

Each of the these servers has the main VRF with the physical network interface
and externally routeable IPs, which is used to transport the Wireguard packets
and management access.

The 'freifunk' VRF encapsulates the overlay traffic, so that it is only routed
to other interfaces assigned to the VRF. Previously, we used policy routing for
this, but VRF has made the setup much easier to understand and debug. :)

> The isolation between default and other VRFs necessary for forwarding
> purposes means that running a local process in the default VRF to access
> another VRF no longer works since the change made in 2018 that you
> identified. So in your example, 'ip vrf exec red nc ...' will work, but
> I assume that this is of no use to you.

I don't actually use forwarding across VRFs via route leaking, but only for
locally originating connections. A simple example is that a 'ssh <IP accessible
via VRF>' just worked for non-root users, which also makes ssh's ProxyJump work
to hosts accessible via the VRF. Both 'ip vrf exec' and binding to the VRF
interface would require root.

For real services (such as bind), we already used 'ip vrf exec ...' and that was
(and is) working fine.

> > Match on unbound socket in all VRFs and not only in the default VRF should be
> > possible by setting this option (see
> > https://www.kernel.org/doc/Documentation/networking/vrf.txt):
> >
> >
> > Do you mean unbound as in listening socket not bound to an IP with bind()? Or as
> > in a socket in the default VRF?
>
> Unbound meaning a socket in the default VRF, as opposed to a a socket
> set into a VRF context by binding it to a VRF master interface using
> SO_BINDTODEVICE. One must also be able to bind to an appropriate IP
> address with bind() regardless of whether the socket is in the default
> or another VRF, but that is not relevant here.

OK.

> > sysctl net.ipv4.tcp_l3mdev_accept=1
> >
> >
> > The sysctl docs sound like this should only apply to listening sockets. In this
> > case, we have an unconnected outbound socket.
>
> With this option disabled (by default), any stream socket in a VRF is
> only selected for packets in that VRF, this is done in the input path
> see e.g. tcp_v4_rcv() for IPv4.
>

Yes, we've been using this option for a long time to make sshd accessible from
both the default VRF and the red VRF.

> > However, for this to work a change similar to the following is needed (I have
> > shown the change to the macro for consistency with above, it is now an inline
> > fn):
> >
> >
> > I can also test on master and only used the macro form only because I wasn't
> > completely sure how to translate it to the inline function form.
> >
> > ---
> > include/net/inet_hashtables.h |   10 ++++------
> > 1 file changed, 4 insertions(+), 6 deletions(-)
> >
> > --- a/include/net/inet_hashtables.h
> > +++ b/include/net/inet_hashtables.h
> > @@ -300,9 +300,8 @@
> > #define INET_MATCH(__sk, __net, __cookie, __saddr, __daddr, __ports, __dif,
> > __sdif) \
> >         (((__sk)->sk_portpair == (__ports))                     &&      \
> >          ((__sk)->sk_addrpair == (__cookie))                    &&      \
> > -        (((__sk)->sk_bound_dev_if == (__dif))                  ||      \
> > -         ((__sk)->sk_bound_dev_if == (__sdif)))                &&      \
> > -        net_eq(sock_net(__sk), (__net)))
> > +        net_eq(sock_net(__sk), (__net))                        &&      \
> > +        inet_sk_bound_dev_eq((__net), (__sk)->sk_bound_dev_if, (__dif),
> > (__sdif)))
> > #else /* 32-bit arch */
> > #define INET_ADDR_COOKIE(__name, __saddr, __daddr) \
> >         const int __name __deprecated __attribute__((unused))
> > @@ -311,9 +310,8 @@
> >         (((__sk)->sk_portpair == (__ports))             &&              \
> >          ((__sk)->sk_daddr      == (__saddr))           &&              \
> >          ((__sk)->sk_rcv_saddr == (__daddr))           &&              \
> > -        (((__sk)->sk_bound_dev_if == (__dif))          ||              \
> > -         ((__sk)->sk_bound_dev_if == (__sdif)))        &&              \
> > -        net_eq(sock_net(__sk), (__net)))
> > +        net_eq(sock_net(__sk), (__net))                &&              \
> > +        inet_sk_bound_dev_eq((__net), (__sk)->sk_bound_dev_if, (__dif),
> > (__sdif)))
> > #endif /* 64-bit arch */
> >
> > /* Sockets in TCP_CLOSE state are _always_ taken out of the hash, so we need
> >
> > I can confirm that this gets my testcase working with
> > net.ipv4.tcp_l3mdev_accept=1.
>
> I can submit this change to kernel-net (modified for latest code) if
> David is ok with this approach. It should not have a significant
> performance impact (due to the additional kernel parameter check) for
> most use-cases, as typically sdif = 0 for the unbound case. I am not in
> a position to carry out any performance testing.

I've tried your patch on our production network and notices an additional
complication I've missed so far: It only fixes v4.

As there is no tcp_l3mdev_accept for v6 as far as I can tell, I've updated my
workaround (see below) and we've been using it since last week without obvious
issues.

> > This is to get the testcase to pass, I will leave it to others to comment on
> > the testcase validity in terms of testing forwarding using commands on 1 device.
> >
> > So a network-namespace-based testcase would be preferred? We used the simple
> > setup because it seemed easier to understand.
> >
> > The series that 3c82a21f4320 is part of were introduced into the kernel in 2018
> > by the Vyatta team, who regularly run an extensive test suite for routing
> > protocols for VRF functionality incl. all combinations of route leaking between default
> > and other VRFs, so there is no known issue in this regard. I will attempt to reach
> > out to them so as to advise them of this thread.
> >
> > Are these testcases public? Perhaps I could use them find a better configuration
> > that handles our use-case.
>
> The test automation to bring up network topologies is not public, but
> the test cases would not be readily transferable for general use in any
> case. I have advised the Vyatta team of this thread.

Ah, thanks!

My current workaround on top of 5.18.11 (derived from a 3c82a21f4320 revert):

diff --git a/include/net/inet6_hashtables.h b/include/net/inet6_hashtables.h
index 81b965953036..1944f8730b42 100644
--- a/include/net/inet6_hashtables.h
+++ b/include/net/inet6_hashtables.h
@@ -110,8 +110,9 @@ int inet6_hash(struct sock *sk);
((__sk)->sk_family == AF_INET6) && \
ipv6_addr_equal(&(__sk)->sk_v6_daddr, (__saddr)) && \
ipv6_addr_equal(&(__sk)->sk_v6_rcv_saddr, (__daddr)) && \
- (((__sk)->sk_bound_dev_if == (__dif)) || \
- ((__sk)->sk_bound_dev_if == (__sdif))) && \
+ (!(__sk)->sk_bound_dev_if || \
+ ((__sk)->sk_bound_dev_if == (__dif)) || \
+ ((__sk)->sk_bound_dev_if == (__sdif))) && \
net_eq(sock_net(__sk), (__net)))

#endif /* _INET6_HASHTABLES_H */
diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 98e1ec1a14f0..41e7f20d7e51 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -310,8 +310,9 @@ static inline struct sock *inet_lookup_listener(struct net *net,
#define INET_MATCH(__sk, __net, __cookie, __saddr, __daddr, __ports, __dif, __sdif) \
(((__sk)->sk_portpair == (__ports)) && \
((__sk)->sk_addrpair == (__cookie)) && \
- (((__sk)->sk_bound_dev_if == (__dif)) || \
- ((__sk)->sk_bound_dev_if == (__sdif))) && \
+ (!(__sk)->sk_bound_dev_if || \
+ ((__sk)->sk_bound_dev_if == (__dif)) || \
+ ((__sk)->sk_bound_dev_if == (__sdif))) && \
net_eq(sock_net(__sk), (__net)))
#else /* 32-bit arch */
#define INET_ADDR_COOKIE(__name, __saddr, __daddr) \
@@ -321,8 +322,9 @@ static inline struct sock *inet_lookup_listener(struct net *net,
(((__sk)->sk_portpair == (__ports)) && \
((__sk)->sk_daddr == (__saddr)) && \
((__sk)->sk_rcv_saddr == (__daddr)) && \
- (((__sk)->sk_bound_dev_if == (__dif)) || \
- ((__sk)->sk_bound_dev_if == (__sdif))) && \
+ (!(__sk)->sk_bound_dev_if || \
+ ((__sk)->sk_bound_dev_if == (__dif)) || \
+ ((__sk)->sk_bound_dev_if == (__sdif))) && \
net_eq(sock_net(__sk), (__net)))
#endif /* 64-bit arch */

Thanks again,
Jan