LinuxLists.cc - [RFC PATCH v2 0/3] l3mdev icmp error route lookup fixes

2020-09-18 18:27:57

Subject: [RFC PATCH v2 0/3] l3mdev icmp error route lookup fixes

Hi,

Here is an updated series of fixes for ipv4 and ipv6 which which ensure
the route lookup is performed on the right routing table in VRF
configurations when sending TTL expired icmp errors (useful for
traceroute).

It includes tests for both ipv4 and ipv6.

These fixes address specifically address the code paths involved in
sending TTL expired icmp errors. As detailed in the individual commit
messages, those fixes do not address similar issues related to network
namespaces and unreachable / fragmentation needed messages, which appear
to use different code paths.

Thanks,

Mathieu

Mathieu Desnoyers (2):
ipv4/icmp: l3mdev: Perform icmp error route lookup on source device
routing table (v2)
ipv6/icmp: l3mdev: Perform icmp error route lookup on source device
routing table (v2)

Michael Jeanson (1):
selftests: Add VRF icmp error route lookup test

net/ipv4/icmp.c | 23 +-
net/ipv6/icmp.c | 7 +-
net/ipv6/ip6_output.c | 2 -
tools/testing/selftests/net/Makefile | 1 +
.../selftests/net/vrf_icmp_error_route.sh | 475 ++++++++++++++++++
5 files changed, 502 insertions(+), 6 deletions(-)
create mode 100755 tools/testing/selftests/net/vrf_icmp_error_route.sh

--
2.17.1

2020-09-18 18:30:36

by Mathieu Desnoyers

[permalink] [raw]

Subject: [RFC PATCH v2 3/3] selftests: Add VRF icmp error route lookup test

From: Michael Jeanson <[email protected]>

The objective is to check that the incoming vrf routing table is selected
to send an ICMP error back to the source. We test two scenarios: when the
ttl of a packet reaches 1 while it is forwarded between different vrfs
and when a packet is bigger than the mtu of the second interface is
forwarded between different vrfs.

The first ttl test sends a ping with a ttl of 1 from h1 to h2 and parses the
output of the command to check that a ttl expired error is received.

The second ttl test runs traceroute from h1 to h2 and parses the output to
check for a hop on r1.

The mtu test sends a ping with a payload of 1450 from h1 to h2, through
r1 which has an interface with a mtu of 1400 and parses the output of the
command to check that a fragmentation needed error is received.

Signed-off-by: Michael Jeanson <[email protected]>
Cc: David Ahern <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: [email protected]

---
Changes since v1:
- Formating and indentation fixes
- Added '-t' to getopts
- Reworked verbose output of grep'd commands with a new function
- Expanded ip command names
- Added fragmentation tests
---
tools/testing/selftests/net/Makefile | 1 +
.../selftests/net/vrf_icmp_error_route.sh | 475 ++++++++++++++++++
2 files changed, 476 insertions(+)
create mode 100755 tools/testing/selftests/net/vrf_icmp_error_route.sh

diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
index 9491bbaa0831..a716fbf780b3 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -19,6 +19,7 @@ TEST_PROGS += txtimestamp.sh
TEST_PROGS += vrf-xfrm-tests.sh
TEST_PROGS += rxtimestamp.sh
TEST_PROGS += devlink_port_split.py
+TEST_PROGS += vrf_icmp_error_route.sh
TEST_PROGS_EXTENDED := in_netns.sh
TEST_GEN_FILES = socket nettest
TEST_GEN_FILES += psock_fanout psock_tpacket msg_zerocopy reuseport_addr_any
diff --git a/tools/testing/selftests/net/vrf_icmp_error_route.sh b/tools/testing/selftests/net/vrf_icmp_error_route.sh
new file mode 100755
index 000000000000..42c412bd79ab
--- /dev/null
+++ b/tools/testing/selftests/net/vrf_icmp_error_route.sh
@@ -0,0 +1,475 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Copyright (c) 2019 David Ahern <[email protected]>. All rights reserved.
+# Copyright (c) 2020 Michael Jeanson <[email protected]>. All rights reserved.
+#
+# blue red
+# .253 +----+ .253
+# +----| r1 |----+
+# | +----+ |
+# +----+ | | +----+
+# | h1 |--------------+ +--------------| h2 |
+# +----+ .1 | | .2 +----+
+# 172.16.1/24 | +----+ | 172.16.2/24
+# 2001:db8:16:1/64 +----| r2 |----+ 2001:db8:16:2/64
+# .254 +----+ .254
+#
+#
+# Route from h1 to h2 goes through r1, incoming vrf blue has a route to the
+# outgoing vrf red for the n2 network but red doesn't have a route back to n1.
+# Route from h2 to h1 goes through r2.
+#
+# The objective is to check that the incoming vrf routing table is selected
+# to send an ICMP error back to the source when the ttl of a packet reaches 1
+# while it is forwarded between different vrfs.
+#
+# The first test sends a ping with a ttl of 1 from h1 to h2 and parses the
+# output of the command to check that a ttl expired error is received.
+#
+# The second test runs traceroute from h1 to h2 and parses the output to check
+# for a hop on r1.
+#
+# Requires CONFIG_NET_VRF, CONFIG_VETH, CONFIG_BRIDGE and CONFIG_NET_NS.
+
+VERBOSE=0
+PAUSE_ON_FAIL=no
+
+H1_N1_IP=172.16.1.1
+R1_N1_IP=172.16.1.253
+R2_N1_IP=172.16.1.254
+
+H1_N1_IP6=2001:db8:16:1::1
+R1_N1_IP6=2001:db8:16:1::253
+R2_N1_IP6=2001:db8:16:1::254
+
+H2_N2=172.16.2.0/24
+H2_N2_6=2001:db8:16:2::/64
+
+H2_N2_IP=172.16.2.2
+R1_N2_IP=172.16.2.253
+R2_N2_IP=172.16.2.254
+
+H2_N2_IP6=2001:db8:16:2::2
+R1_N2_IP6=2001:db8:16:2::253
+R2_N2_IP6=2001:db8:16:2::254
+
+################################################################################
+# helpers
+
+log_section()
+{
+ echo
+ echo "###########################################################################"
+ echo "$*"
+ echo "###########################################################################"
+ echo
+}
+
+log_test()
+{
+ local rc=$1
+ local expected=$2
+ local msg="$3"
+
+ if [ "${rc}" -eq "${expected}" ]; then
+ printf "TEST: %-60s [ OK ]\n" "${msg}"
+ nsuccess=$((nsuccess+1))
+ else
+ ret=1
+ nfail=$((nfail+1))
+ printf "TEST: %-60s [FAIL]\n" "${msg}"
+ if [ "${PAUSE_ON_FAIL}" = "yes" ]; then
+ echo
+ echo "hit enter to continue, 'q' to quit"
+ read -r a
+ [ "$a" = "q" ] && exit 1
+ fi
+ fi
+}
+
+run_cmd()
+{
+ local cmd="$*"
+ local out
+ local rc
+
+ if [ "$VERBOSE" = "1" ]; then
+ echo "COMMAND: $cmd"
+ fi
+
+ out=$(eval $cmd 2>&1)
+ rc=$?
+ if [ "$VERBOSE" = "1" ] && [ -n "$out" ]; then
+ echo "$out"
+ fi
+
+ [ "$VERBOSE" = "1" ] && echo
+
+ return $rc
+}
+
+run_cmd_grep()
+{
+ local grep_pattern="$1"
+ shift
+ local cmd="$*"
+ local out
+ local rc
+
+ if [ "$VERBOSE" = "1" ]; then
+ echo "COMMAND: $cmd"
+ fi
+
+ out=$(eval $cmd 2>&1)
+ if [ "$VERBOSE" = "1" ] && [ -n "$out" ]; then
+ echo "$out"
+ fi
+
+ echo "$out" | grep -q "$grep_pattern"
+ rc=$?
+
+ [ "$VERBOSE" = "1" ] && echo
+
+ return $rc
+}
+
+################################################################################
+# setup and teardown
+
+cleanup()
+{
+ local ns
+
+ setup=0
+
+ for ns in h1 h2 r1 r2; do
+ ip netns del $ns 2>/dev/null
+ done
+}
+
+setup_vrf()
+{
+ local ns=$1
+
+ ip -netns "${ns}" rule del pref 0
+ ip -netns "${ns}" rule add pref 32765 from all lookup local
+ ip -netns "${ns}" -6 rule del pref 0
+ ip -netns "${ns}" -6 rule add pref 32765 from all lookup local
+}
+
+create_vrf()
+{
+ local ns=$1
+ local vrf=$2
+ local table=$3
+
+ ip -netns "${ns}" link add "${vrf}" type vrf table "${table}"
+ ip -netns "${ns}" link set "${vrf}" up
+ ip -netns "${ns}" route add vrf "${vrf}" unreachable default metric 8192
+ ip -netns "${ns}" -6 route add vrf "${vrf}" unreachable default metric 8192
+
+ ip -netns "${ns}" addr add 127.0.0.1/8 dev "${vrf}"
+ ip -netns "${ns}" -6 addr add ::1 dev "${vrf}" nodad
+}
+
+setup()
+{
+ local ns
+
+ if [ "${setup}" -eq 1 ]; then
+ return 0
+ fi
+
+ # make sure we are starting with a clean slate
+ cleanup
+
+ setup=1
+
+ #
+ # create nodes as namespaces
+ #
+ for ns in h1 h2 r1 r2; do
+ ip netns add $ns
+ ip -netns $ns link set lo up
+
+ case "${ns}" in
+ h[12]) ip netns exec $ns sysctl -q -w net.ipv6.conf.all.forwarding=0
+ ip netns exec $ns sysctl -q -w net.ipv6.conf.all.keep_addr_on_down=1
+ ;;
+ r[12]) ip netns exec $ns sysctl -q -w net.ipv4.ip_forward=1
+ ip netns exec $ns sysctl -q -w net.ipv6.conf.all.forwarding=1
+ esac
+ done
+
+ #
+ # create interconnects
+ #
+ ip -netns h1 link add eth0 type veth peer name r1h1
+ ip -netns h1 link set r1h1 netns r1 name eth0 up
+
+ ip -netns h1 link add eth1 type veth peer name r2h1
+ ip -netns h1 link set r2h1 netns r2 name eth0 up
+
+ ip -netns h2 link add eth0 type veth peer name r1h2
+ ip -netns h2 link set r1h2 netns r1 name eth1 up
+
+ ip -netns h2 link add eth1 type veth peer name r2h2
+ ip -netns h2 link set r2h2 netns r2 name eth1 up
+
+ #
+ # h1
+ #
+ ip -netns h1 link add br0 type bridge
+ ip -netns h1 link set br0 up
+ ip -netns h1 addr add dev br0 ${H1_N1_IP}/24
+ ip -netns h1 -6 addr add dev br0 ${H1_N1_IP6}/64 nodad
+ ip -netns h1 link set eth0 master br0 up
+ ip -netns h1 link set eth1 master br0 up
+
+ # h1 to h2 via r1
+ ip -netns h1 route add ${H2_N2} via ${R1_N1_IP} dev br0
+ ip -netns h1 -6 route add ${H2_N2_6} via "${R1_N1_IP6}" dev br0
+
+ #
+ # h2
+ #
+ ip -netns h2 link add br0 type bridge
+ ip -netns h2 link set br0 up
+ ip -netns h2 addr add dev br0 ${H2_N2_IP}/24
+ ip -netns h2 -6 addr add dev br0 ${H2_N2_IP6}/64 nodad
+ ip -netns h2 link set eth0 master br0 up
+ ip -netns h2 link set eth1 master br0 up
+
+ ip -netns h2 route add default via ${R2_N2_IP} dev br0
+ ip -netns h2 -6 route add default via ${R2_N2_IP6} dev br0
+
+ #
+ # r1
+ #
+ setup_vrf r1
+ create_vrf r1 blue 1101
+ create_vrf r1 red 1102
+ ip -netns r1 link set mtu 1400 dev eth1
+ ip -netns r1 link set eth0 vrf blue up
+ ip -netns r1 link set eth1 vrf red up
+ ip -netns r1 addr add dev eth0 ${R1_N1_IP}/24
+ ip -netns r1 -6 addr add dev eth0 ${R1_N1_IP6}/64 nodad
+ ip -netns r1 addr add dev eth1 ${R1_N2_IP}/24
+ ip -netns r1 -6 addr add dev eth1 ${R1_N2_IP6}/64 nodad
+
+ # Route leak from blue to red
+ ip -netns r1 route add vrf blue ${H2_N2} dev red
+ ip -netns r1 -6 route add vrf blue ${H2_N2_6} dev red
+
+ #
+ # r2
+ #
+ ip -netns r2 addr add dev eth0 ${R2_N1_IP}/24
+ ip -netns r2 -6 addr add dev eth0 ${R2_N1_IP6}/64 nodad
+ ip -netns r2 addr add dev eth1 ${R2_N2_IP}/24
+ ip -netns r2 -6 addr add dev eth1 ${R2_N2_IP6}/64 nodad
+
+ # Wait for ip config to settle
+ sleep 2
+}
+
+check_connectivity4()
+{
+ ip netns exec h1 ping -c1 -w1 ${H2_N2_IP} >/dev/null 2>&1
+}
+
+check_connectivity6()
+{
+ ip netns exec h1 "${ping6}" -c1 -w1 ${H2_N2_IP6} >/dev/null 2>&1
+}
+
+ipv4_traceroute()
+{
+ log_section "IPv4: VRF ICMP error route lookup traceroute"
+
+ if [ ! -x "$(command -v traceroute)" ]; then
+ echo "SKIP: Could not run IPV4 test without traceroute"
+ return
+ fi
+
+ setup
+
+ # verify connectivity
+ if ! check_connectivity4; then
+ echo "Error: Basic connectivity is broken"
+ ret=1
+ return
+ fi
+
+ run_cmd_grep "${R1_N1_IP}" ip netns exec h1 traceroute ${H2_N2_IP}
+ log_test $? 0 "Traceroute reports a hop on r1"
+}
+
+ipv6_traceroute()
+{
+ log_section "IPv6: VRF ICMP error route lookup traceroute"
+
+ if [ ! -x "$(command -v traceroute6)" ]; then
+ echo "SKIP: Could not run IPV6 test without traceroute6"
+ return
+ fi
+
+ setup
+
+ # verify connectivity
+ if ! check_connectivity6; then
+ echo "Error: Basic connectivity is broken"
+ ret=1
+ return
+ fi
+
+ run_cmd_grep "${R1_N1_IP6}" ip netns exec h1 traceroute6 ${H2_N2_IP6}
+ log_test $? 0 "Traceroute6 reports a hop on r1"
+}
+
+ipv4_ping_ttl()
+{
+ log_section "IPv4: VRF ICMP ttl error route lookup ping"
+
+ setup
+
+ # verify connectivity
+ if ! check_connectivity4; then
+ echo "Error: Basic connectivity is broken"
+ ret=1
+ return
+ fi
+
+ run_cmd_grep "Time to live exceeded" ip netns exec h1 ping -t1 -c1 -W2 ${H2_N2_IP}
+ log_test $? 0 "Ping received ICMP ttl exceeded"
+}
+
+ipv4_ping_frag()
+{
+ log_section "IPv4: VRF ICMP fragmentation error route lookup ping"
+
+ setup
+
+ # verify connectivity
+ if ! check_connectivity4; then
+ echo "Error: Basic connectivity is broken"
+ ret=1
+ return
+ fi
+
+ run_cmd_grep "Frag needed" ip netns exec h1 ping -s 1450 -Mdo -c1 -W2 ${H2_N2_IP}
+ log_test $? 0 "Ping received ICMP frag needed"
+}
+
+ipv6_ping_ttl()
+{
+ log_section "IPv6: VRF ICMP ttl error route lookup ping"
+
+ setup
+
+ # verify connectivity
+ if ! check_connectivity6; then
+ echo "Error: Basic connectivity is broken"
+ ret=1
+ return
+ fi
+
+ run_cmd_grep "Time exceeded: Hop limit" ip netns exec h1 "${ping6}" -t1 -c1 -W2 ${H2_N2_IP6}
+ log_test $? 0 "Ping received ICMP frag"
+}
+
+ipv6_ping_frag()
+{
+ log_section "IPv6: VRF ICMP fragmentation error route lookup ping"
+
+ setup
+
+ # verify connectivity
+ if ! check_connectivity6; then
+ echo "Error: Basic connectivity is broken"
+ ret=1
+ return
+ fi
+
+ run_cmd_grep "Packet too big" ip netns exec h1 "${ping6}" -s 1450 -Mdo -c1 -W2 ${H2_N2_IP6}
+ log_test $? 0 "Ping received ICMP frag needed"
+}
+
+################################################################################
+# usage
+
+usage()
+{
+ cat <<EOF
+usage: ${0##*/} OPTS
+
+ -4 IPv4 tests only
+ -6 IPv6 tests only
+ -p Pause on fail
+ -v verbose mode (show commands and output)
+EOF
+}
+
+################################################################################
+# main
+
+# Some systems don't have a ping6 binary anymore
+command -v ping6 > /dev/null 2>&1 && ping6=$(command -v ping6) || ping6=$(command -v ping)
+
+TESTS_IPV4="ipv4_ping_ttl ipv4_ping_frag ipv4_traceroute"
+TESTS_IPV6="ipv6_ping_ttl ipv6_ping_frag ipv6_traceroute"
+
+ret=0
+nsuccess=0
+nfail=0
+setup=0
+
+while getopts :46t:pvh o
+do
+ case $o in
+ 4) TESTS=ipv4;;
+ 6) TESTS=ipv6;;
+ t) TESTS=$OPTARG;;
+ p) PAUSE_ON_FAIL=yes;;
+ v) VERBOSE=1;;
+ h) usage; exit 0;;
+ *) usage; exit 1;;
+ esac
+done
+
+#
+# show user test config
+#
+if [ -z "$TESTS" ]; then
+ TESTS="$TESTS_IPV4 $TESTS_IPV6"
+elif [ "$TESTS" = "ipv4" ]; then
+ TESTS="$TESTS_IPV4"
+elif [ "$TESTS" = "ipv6" ]; then
+ TESTS="$TESTS_IPV6"
+fi
+
+for t in $TESTS
+do
+ case $t in
+ ipv4_ping_ttl|ping) ipv4_ping_ttl;;&
+ ipv4_ping_frag|ping) ipv4_ping_frag;;&
+ ipv4_traceroute|traceroute) ipv4_traceroute;;&
+
+ ipv6_ping_ttl|ping) ipv6_ping_ttl;;&
+ ipv6_ping_frag|ping) ipv6_ping_frag;;&
+ ipv6_traceroute|traceroute) ipv6_traceroute;;&
+
+ # setup namespaces and config, but do not run any tests
+ setup) setup; exit 0;;
+
+ help) echo "Test names: $TESTS"; exit 0;;
+ esac
+done
+
+cleanup
+
+printf "\nTests passed: %3d\n" ${nsuccess}
+printf "Tests failed: %3d\n" ${nfail}
+
+exit $ret
--
2.17.1

2020-09-21 18:38:29

by David Ahern

[permalink] [raw]

Subject: Re: [RFC PATCH v2 0/3] l3mdev icmp error route lookup fixes

On 9/18/20 12:17 PM, Mathieu Desnoyers wrote:
> Hi,
>
> Here is an updated series of fixes for ipv4 and ipv6 which which ensure
> the route lookup is performed on the right routing table in VRF
> configurations when sending TTL expired icmp errors (useful for
> traceroute).
>
> It includes tests for both ipv4 and ipv6.
>
> These fixes address specifically address the code paths involved in
> sending TTL expired icmp errors. As detailed in the individual commit
> messages, those fixes do not address similar issues related to network
> namespaces and unreachable / fragmentation needed messages, which appear
> to use different code paths.
>

New selftests are failing:
TEST: Ping received ICMP frag needed [FAIL]

Both IPv4 and IPv6 versions are failing.

2020-09-21 18:51:41

by Mathieu Desnoyers

[permalink] [raw]

Subject: Re: [RFC PATCH v2 0/3] l3mdev icmp error route lookup fixes

----- On Sep 21, 2020, at 2:36 PM, David Ahern [email protected] wrote:

> On 9/18/20 12:17 PM, Mathieu Desnoyers wrote:
>> Hi,
>>
>> Here is an updated series of fixes for ipv4 and ipv6 which which ensure
>> the route lookup is performed on the right routing table in VRF
>> configurations when sending TTL expired icmp errors (useful for
>> traceroute).
>>
>> It includes tests for both ipv4 and ipv6.
>>
>> These fixes address specifically address the code paths involved in
>> sending TTL expired icmp errors. As detailed in the individual commit
>> messages, those fixes do not address similar issues related to network
>> namespaces and unreachable / fragmentation needed messages, which appear
>> to use different code paths.
>>
>
> New selftests are failing:
> TEST: Ping received ICMP frag needed [FAIL]
>
> Both IPv4 and IPv6 versions are failing.

Indeed, this situation is discussed in each patch commit message:

ipv4:

[ It has also been pointed out that a similar issue exists with
unreachable / fragmentation needed messages, which can be triggered by
changing the MTU of eth1 in r1 to 1400 and running:

ip netns exec h1 ping -s 1450 -Mdo -c1 172.16.2.2

Some investigation points to raw_icmp_error() and raw_err() as being
involved in this last scenario. The focus of this patch is TTL expired
ICMP messages, which go through icmp_route_lookup.
Investigation of failure modes related to raw_icmp_error() is beyond
this investigation's scope. ]

ipv6:

[ Testing shows that similar issues exist with ipv6 unreachable /
fragmentation needed messages. However, investigation of this
additional failure mode is beyond this investigation's scope. ]

I do not have the time to investigate further unfortunately, so I
thought it best to post what I have.

Note that network namespaces also probably have the same problem,
but those are not covered by the test cases.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2020-09-21 19:13:37

by David Ahern

[permalink] [raw]

Subject: Re: [RFC PATCH v2 0/3] l3mdev icmp error route lookup fixes

On 9/21/20 12:44 PM, Mathieu Desnoyers wrote:
> ----- On Sep 21, 2020, at 2:36 PM, David Ahern [email protected] wrote:
>
>> On 9/18/20 12:17 PM, Mathieu Desnoyers wrote:
>>> Hi,
>>>
>>> Here is an updated series of fixes for ipv4 and ipv6 which which ensure
>>> the route lookup is performed on the right routing table in VRF
>>> configurations when sending TTL expired icmp errors (useful for
>>> traceroute).
>>>
>>> It includes tests for both ipv4 and ipv6.
>>>
>>> These fixes address specifically address the code paths involved in
>>> sending TTL expired icmp errors. As detailed in the individual commit
>>> messages, those fixes do not address similar issues related to network
>>> namespaces and unreachable / fragmentation needed messages, which appear
>>> to use different code paths.
>>>
>>
>> New selftests are failing:
>> TEST: Ping received ICMP frag needed [FAIL]
>>
>> Both IPv4 and IPv6 versions are failing.
>
> Indeed, this situation is discussed in each patch commit message:
>
> ipv4:
>
> [ It has also been pointed out that a similar issue exists with
> unreachable / fragmentation needed messages, which can be triggered by
> changing the MTU of eth1 in r1 to 1400 and running:
>
> ip netns exec h1 ping -s 1450 -Mdo -c1 172.16.2.2
>
> Some investigation points to raw_icmp_error() and raw_err() as being
> involved in this last scenario. The focus of this patch is TTL expired
> ICMP messages, which go through icmp_route_lookup.
> Investigation of failure modes related to raw_icmp_error() is beyond
> this investigation's scope. ]
>
> ipv6:
>
> [ Testing shows that similar issues exist with ipv6 unreachable /
> fragmentation needed messages. However, investigation of this
> additional failure mode is beyond this investigation's scope. ]
>
> I do not have the time to investigate further unfortunately, so I
> thought it best to post what I have.
>

the test setup is bad. You have r1 dropping the MTU in VRF red, but not
telling VRF red how to send back the ICMP. e.g., for IPv4 add:

ip -netns r1 ro add vrf red 172.16.1.0/24 dev blue

do the same for v6.

Also, I do not see a reason for r2; I suggest dropping it. What you are
testing is icmp crossing VRF with route leaking, so there should not be
a need for r2 which leads to asymmetrical routing (172.16.1.0 via r1 and
the return via r2).

2020-09-21 19:37:18

by Mathieu Desnoyers

[permalink] [raw]

Subject: Re: [RFC PATCH v2 0/3] l3mdev icmp error route lookup fixes

----- On Sep 21, 2020, at 3:11 PM, David Ahern [email protected] wrote:

> On 9/21/20 12:44 PM, Mathieu Desnoyers wrote:
>> ----- On Sep 21, 2020, at 2:36 PM, David Ahern [email protected] wrote:
>>
>>> On 9/18/20 12:17 PM, Mathieu Desnoyers wrote:
>>>> Hi,
>>>>
>>>> Here is an updated series of fixes for ipv4 and ipv6 which which ensure
>>>> the route lookup is performed on the right routing table in VRF
>>>> configurations when sending TTL expired icmp errors (useful for
>>>> traceroute).
>>>>
>>>> It includes tests for both ipv4 and ipv6.
>>>>
>>>> These fixes address specifically address the code paths involved in
>>>> sending TTL expired icmp errors. As detailed in the individual commit
>>>> messages, those fixes do not address similar issues related to network
>>>> namespaces and unreachable / fragmentation needed messages, which appear
>>>> to use different code paths.
>>>>
>>>
>>> New selftests are failing:
>>> TEST: Ping received ICMP frag needed [FAIL]
>>>
>>> Both IPv4 and IPv6 versions are failing.
>>
>> Indeed, this situation is discussed in each patch commit message:
>>
>> ipv4:
>>
>> [ It has also been pointed out that a similar issue exists with
>> unreachable / fragmentation needed messages, which can be triggered by
>> changing the MTU of eth1 in r1 to 1400 and running:
>>
>> ip netns exec h1 ping -s 1450 -Mdo -c1 172.16.2.2
>>
>> Some investigation points to raw_icmp_error() and raw_err() as being
>> involved in this last scenario. The focus of this patch is TTL expired
>> ICMP messages, which go through icmp_route_lookup.
>> Investigation of failure modes related to raw_icmp_error() is beyond
>> this investigation's scope. ]
>>
>> ipv6:
>>
>> [ Testing shows that similar issues exist with ipv6 unreachable /
>> fragmentation needed messages. However, investigation of this
>> additional failure mode is beyond this investigation's scope. ]
>>
>> I do not have the time to investigate further unfortunately, so I
>> thought it best to post what I have.
>>
>
> the test setup is bad. You have r1 dropping the MTU in VRF red, but not
> telling VRF red how to send back the ICMP. e.g., for IPv4 add:
>
> ip -netns r1 ro add vrf red 172.16.1.0/24 dev blue
>
> do the same for v6.
>
> Also, I do not see a reason for r2; I suggest dropping it. What you are
> testing is icmp crossing VRF with route leaking, so there should not be
> a need for r2 which leads to asymmetrical routing (172.16.1.0 via r1 and
> the return via r2).

CCing Michael Jeanson, author of the selftests patch.

Thanks for your feedback,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2020-09-22 14:02:59

by Michael Jeanson

[permalink] [raw]

Subject: Re: [RFC PATCH v2 0/3] l3mdev icmp error route lookup fixes

----- On 21 Sep, 2020, at 15:33, Mathieu Desnoyers [email protected] wrote:

> ----- On Sep 21, 2020, at 3:11 PM, David Ahern [email protected] wrote:
>
>> On 9/21/20 12:44 PM, Mathieu Desnoyers wrote:
>>> ----- On Sep 21, 2020, at 2:36 PM, David Ahern [email protected] wrote:
>>>
>>>> On 9/18/20 12:17 PM, Mathieu Desnoyers wrote:
>>>>> Hi,
>>>>>
>>>>> Here is an updated series of fixes for ipv4 and ipv6 which which ensure
>>>>> the route lookup is performed on the right routing table in VRF
>>>>> configurations when sending TTL expired icmp errors (useful for
>>>>> traceroute).
>>>>>
>>>>> It includes tests for both ipv4 and ipv6.
>>>>>
>>>>> These fixes address specifically address the code paths involved in
>>>>> sending TTL expired icmp errors. As detailed in the individual commit
>>>>> messages, those fixes do not address similar issues related to network
>>>>> namespaces and unreachable / fragmentation needed messages, which appear
>>>>> to use different code paths.
>>>>>
>>>>
>>>> New selftests are failing:
>>>> TEST: Ping received ICMP frag needed [FAIL]
>>>>
>>>> Both IPv4 and IPv6 versions are failing.
>>>
>>> Indeed, this situation is discussed in each patch commit message:
>>>
>>> ipv4:
>>>
>>> [ It has also been pointed out that a similar issue exists with
>>> unreachable / fragmentation needed messages, which can be triggered by
>>> changing the MTU of eth1 in r1 to 1400 and running:
>>>
>>> ip netns exec h1 ping -s 1450 -Mdo -c1 172.16.2.2
>>>
>>> Some investigation points to raw_icmp_error() and raw_err() as being
>>> involved in this last scenario. The focus of this patch is TTL expired
>>> ICMP messages, which go through icmp_route_lookup.
>>> Investigation of failure modes related to raw_icmp_error() is beyond
>>> this investigation's scope. ]
>>>
>>> ipv6:
>>>
>>> [ Testing shows that similar issues exist with ipv6 unreachable /
>>> fragmentation needed messages. However, investigation of this
>>> additional failure mode is beyond this investigation's scope. ]
>>>
>>> I do not have the time to investigate further unfortunately, so I
>>> thought it best to post what I have.
>>>
>>
>> the test setup is bad. You have r1 dropping the MTU in VRF red, but not
>> telling VRF red how to send back the ICMP. e.g., for IPv4 add:
>>
>> ip -netns r1 ro add vrf red 172.16.1.0/24 dev blue
>>
>> do the same for v6.
>>
>> Also, I do not see a reason for r2; I suggest dropping it. What you are
>> testing is icmp crossing VRF with route leaking, so there should not be
>> a need for r2 which leads to asymmetrical routing (172.16.1.0 via r1 and
>> the return via r2).

The objective of the test was to replicate a clients environment where
packets are crossing from a VRF which has a route back to the source to
one which doesn't while reaching a ttl of 0. If the route lookup for the
icmp error is done on the interface in the first VRF, it can be routed to
the source but not on the interface in the second VRF which is the
current behaviour for icmp errors generated while crossing between VRFs.

There may be a better test case that doesn't involve asymmetric routing
to test this but it's the only way I found to replicate this.

2020-09-23 02:46:19

by David Ahern

[permalink] [raw]

Subject: Re: [RFC PATCH v2 0/3] l3mdev icmp error route lookup fixes

On 9/22/20 7:52 AM, Michael Jeanson wrote:
>>>
>>> the test setup is bad. You have r1 dropping the MTU in VRF red, but not
>>> telling VRF red how to send back the ICMP. e.g., for IPv4 add:
>>>
>>> ip -netns r1 ro add vrf red 172.16.1.0/24 dev blue
>>>
>>> do the same for v6.
>>>
>>> Also, I do not see a reason for r2; I suggest dropping it. What you are
>>> testing is icmp crossing VRF with route leaking, so there should not be
>>> a need for r2 which leads to asymmetrical routing (172.16.1.0 via r1 and
>>> the return via r2).
>
> The objective of the test was to replicate a clients environment where
> packets are crossing from a VRF which has a route back to the source to
> one which doesn't while reaching a ttl of 0. If the route lookup for the
> icmp error is done on the interface in the first VRF, it can be routed to
> the source but not on the interface in the second VRF which is the
> current behaviour for icmp errors generated while crossing between VRFs.
>
> There may be a better test case that doesn't involve asymmetric routing
> to test this but it's the only way I found to replicate this.
>

It should work without asymmetric routing; adding the return route to
the second vrf as I mentioned above fixes the FRAG_NEEDED problem. It
should work for TTL as well.

Adding a second pass on the tests with the return through r2 is fine,
but add a first pass for the more typical case.

2020-09-23 16:06:29

by Michael Jeanson

[permalink] [raw]

Subject: Re: [RFC PATCH v2 0/3] l3mdev icmp error route lookup fixes

On 2020-09-22 21 h 59, David Ahern wrote:
> On 9/22/20 7:52 AM, Michael Jeanson wrote:
>>>>
>>>> the test setup is bad. You have r1 dropping the MTU in VRF red, but not
>>>> telling VRF red how to send back the ICMP. e.g., for IPv4 add:
>>>>
>>>> ip -netns r1 ro add vrf red 172.16.1.0/24 dev blue
>>>>
>>>> do the same for v6.
>>>>
>>>> Also, I do not see a reason for r2; I suggest dropping it. What you are
>>>> testing is icmp crossing VRF with route leaking, so there should not be
>>>> a need for r2 which leads to asymmetrical routing (172.16.1.0 via r1 and
>>>> the return via r2).
>>
>> The objective of the test was to replicate a clients environment where
>> packets are crossing from a VRF which has a route back to the source to
>> one which doesn't while reaching a ttl of 0. If the route lookup for the
>> icmp error is done on the interface in the first VRF, it can be routed to
>> the source but not on the interface in the second VRF which is the
>> current behaviour for icmp errors generated while crossing between VRFs.
>>
>> There may be a better test case that doesn't involve asymmetric routing
>> to test this but it's the only way I found to replicate this.
>>
>
> It should work without asymmetric routing; adding the return route to
> the second vrf as I mentioned above fixes the FRAG_NEEDED problem. It
> should work for TTL as well.
>
> Adding a second pass on the tests with the return through r2 is fine,
> but add a first pass for the more typical case.

Hi,

Before writing new tests I just want to make sure we are trying to fix
the same issue. If I add a return route to the red VRF then we don't
need this patchset because whether the ICMP error are routed using the
table from the source or destination interface they will reach the
source host.

The issue for which this patchset was sent only happens when the
destination interface's VRF doesn't have a route back to the source
host. I guess we might question if this is actually a bug or not.

So the question really is, when a packet is forwarded between VRFs
through route leaking and an icmp error is generated, which table should
be used for the route lookup? And does it depend on the type of icmp
error? (e.g. TTL=1 happens before forwarding, but fragmentation needed
happens after when on the destination interface)

Cheers,

Michael

2020-09-23 17:05:12

by Michael Jeanson

[permalink] [raw]

Subject: Re: [RFC PATCH v2 0/3] l3mdev icmp error route lookup fixes

On 2020-09-23 12 h 04, Michael Jeanson wrote:
>> It should work without asymmetric routing; adding the return route to
>> the second vrf as I mentioned above fixes the FRAG_NEEDED problem. It
>> should work for TTL as well.
>>
>> Adding a second pass on the tests with the return through r2 is fine,
>> but add a first pass for the more typical case.
>
> Hi,
>
> Before writing new tests I just want to make sure we are trying to fix
> the same issue. If I add a return route to the red VRF then we don't
> need this patchset because whether the ICMP error are routed using the
> table from the source or destination interface they will reach the
> source host.
>
> The issue for which this patchset was sent only happens when the
> destination interface's VRF doesn't have a route back to the source
> host. I guess we might question if this is actually a bug or not.
>
> So the question really is, when a packet is forwarded between VRFs
> through route leaking and an icmp error is generated, which table should
> be used for the route lookup? And does it depend on the type of icmp
> error? (e.g. TTL=1 happens before forwarding, but fragmentation needed
> happens after when on the destination interface)

As a side note, I don't mind reworking the tests as you requested even
if the patchset as a whole ends up not being needed and if you think
they are still useful. I just wanted to make sure we understood each other.

Cheers,

Michael

2020-09-23 18:48:15

by David Ahern

[permalink] [raw]

Subject: Re: [RFC PATCH v2 0/3] l3mdev icmp error route lookup fixes

On 9/23/20 11:03 AM, Michael Jeanson wrote:
> On 2020-09-23 12 h 04, Michael Jeanson wrote:
>>> It should work without asymmetric routing; adding the return route to
>>> the second vrf as I mentioned above fixes the FRAG_NEEDED problem. It
>>> should work for TTL as well.
>>>
>>> Adding a second pass on the tests with the return through r2 is fine,
>>> but add a first pass for the more typical case.
>>
>> Hi,
>>
>> Before writing new tests I just want to make sure we are trying to fix
>> the same issue. If I add a return route to the red VRF then we don't
>> need this patchset because whether the ICMP error are routed using the
>> table from the source or destination interface they will reach the
>> source host.
>>
>> The issue for which this patchset was sent only happens when the
>> destination interface's VRF doesn't have a route back to the source
>> host. I guess we might question if this is actually a bug or not.
>>
>> So the question really is, when a packet is forwarded between VRFs
>> through route leaking and an icmp error is generated, which table
>> should be used for the route lookup? And does it depend on the type of
>> icmp error? (e.g. TTL=1 happens before forwarding, but fragmentation
>> needed happens after when on the destination interface)
>
> As a side note, I don't mind reworking the tests as you requested even
> if the patchset as a whole ends up not being needed and if you think
> they are still useful. I just wanted to make sure we understood each other.
>

if you are leaking from VRF 1 to VRF 2 and you do not configure VRF 2
with how to send to errors back to source - MTU or TTL - then I will
argue that is a configuration problem, not a bug.

Now the TTL problem is interesting. You need the FIB lookup to know that
the packet is forwarded, and by the time of the ttl check in ip_forward
skb->dev points to the ingress VRF and dst points to the egress VRF. So
I think the change is warranted.

Let's do this for the tests:
1 pass through all of the tests (TTL and MTU, v4 and v6) with symmetric
routing configured and make sure they all pass. ie., keep all of them
and make sure all tests pass. No sense losing the tests and the thoughts
behind them.

Add a second pass with the asymmetric routing per the customer setup
since it motivated the investigation.

Rename the test to something like vrf_route_leaking.sh. It can be
expanded with more tests related to route leaking as they come up.

2020-09-23 19:14:54

by Michael Jeanson

[permalink] [raw]

Subject: Re: [RFC PATCH v2 0/3] l3mdev icmp error route lookup fixes

On 2020-09-23 14 h 46, David Ahern wrote:
> On 9/23/20 11:03 AM, Michael Jeanson wrote:
>> On 2020-09-23 12 h 04, Michael Jeanson wrote:
>>>> It should work without asymmetric routing; adding the return route to
>>>> the second vrf as I mentioned above fixes the FRAG_NEEDED problem. It
>>>> should work for TTL as well.
>>>>
>>>> Adding a second pass on the tests with the return through r2 is fine,
>>>> but add a first pass for the more typical case.
>>>
>>> Hi,
>>>
>>> Before writing new tests I just want to make sure we are trying to fix
>>> the same issue. If I add a return route to the red VRF then we don't
>>> need this patchset because whether the ICMP error are routed using the
>>> table from the source or destination interface they will reach the
>>> source host.
>>>
>>> The issue for which this patchset was sent only happens when the
>>> destination interface's VRF doesn't have a route back to the source
>>> host. I guess we might question if this is actually a bug or not.
>>>
>>> So the question really is, when a packet is forwarded between VRFs
>>> through route leaking and an icmp error is generated, which table
>>> should be used for the route lookup? And does it depend on the type of
>>> icmp error? (e.g. TTL=1 happens before forwarding, but fragmentation
>>> needed happens after when on the destination interface)
>>
>> As a side note, I don't mind reworking the tests as you requested even
>> if the patchset as a whole ends up not being needed and if you think
>> they are still useful. I just wanted to make sure we understood each other.
>>
>
> if you are leaking from VRF 1 to VRF 2 and you do not configure VRF 2
> with how to send to errors back to source - MTU or TTL - then I will
> argue that is a configuration problem, not a bug.
>
> Now the TTL problem is interesting. You need the FIB lookup to know that
> the packet is forwarded, and by the time of the ttl check in ip_forward
> skb->dev points to the ingress VRF and dst points to the egress VRF. So
> I think the change is warranted.
>
> Let's do this for the tests:
> 1 pass through all of the tests (TTL and MTU, v4 and v6) with symmetric
> routing configured and make sure they all pass. ie., keep all of them
> and make sure all tests pass. No sense losing the tests and the thoughts
> behind them.
>
> Add a second pass with the asymmetric routing per the customer setup
> since it motivated the investigation.
>
> Rename the test to something like vrf_route_leaking.sh. It can be
> expanded with more tests related to route leaking as they come up.
>

Just a final clarification, the asymmetric setup would have no return
route in VRF 2 and only test the TTL case since the others would fail?

2020-09-23 20:06:01

by David Ahern

[permalink] [raw]

Subject: Re: [RFC PATCH v2 0/3] l3mdev icmp error route lookup fixes

On 9/23/20 1:12 PM, Michael Jeanson wrote:
>
> Just a final clarification, the asymmetric setup would have no return
> route in VRF 2 and only test the TTL case since the others would fail?

correct. add a statement about it representing a customer setup so it is
clear such a config is a 1-off