2020-08-11 19:51:11

by Mathieu Desnoyers

[permalink] [raw]
Subject: [PATCH 0/3] l3mdev icmp error route lookup fixes

Hi,

Here is a series of fixes for ipv4 and ipv6 which which ensure the route
lookup is performed on the right routing table in VRF configurations.

It includes a test for both ipv4 and ipv6.

The series has been rebased on the net tree.

Thanks,

Mathieu

Mathieu Desnoyers (2):
ipv4/icmp: l3mdev: Perform icmp error route lookup on source device
routing table
ipv6/icmp: l3mdev: Perform icmp error route lookup on source device
routing table

Michael Jeanson (1):
selftests: Add VRF icmp error route lookup test

net/ipv4/icmp.c | 15 +-
net/ipv6/icmp.c | 15 +-
net/ipv6/ip6_output.c | 2 -
tools/testing/selftests/net/Makefile | 1 +
.../selftests/net/vrf_icmp_error_route.sh | 429 ++++++++++++++++++
5 files changed, 456 insertions(+), 6 deletions(-)
create mode 100755 tools/testing/selftests/net/vrf_icmp_error_route.sh

--
2.17.1


2020-08-11 19:51:21

by Mathieu Desnoyers

[permalink] [raw]
Subject: [PATCH 1/3] selftests: Add VRF icmp error route lookup test

From: Michael Jeanson <[email protected]>

The objective is to check that the incoming vrf routing table is selected
to send an ICMP error back to the source when the ttl of a packet reaches 1
while it is forwarded between different vrfs.

The first test sends a ping with a ttl of 1 from h1 to h2 and parses the
output of the command to check that a ttl expired error is received.

[This may be flaky, I'm open to suggestions of a more robust approch.]

The second test runs traceroute from h1 to h2 and parses the output to
check for a hop on r1.

Signed-off-by: Michael Jeanson <[email protected]>
Cc: David Ahern <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: [email protected]
---
tools/testing/selftests/net/Makefile | 1 +
.../selftests/net/vrf_icmp_error_route.sh | 429 ++++++++++++++++++
2 files changed, 430 insertions(+)
create mode 100755 tools/testing/selftests/net/vrf_icmp_error_route.sh

diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
index 9491bbaa0831..a716fbf780b3 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -19,6 +19,7 @@ TEST_PROGS += txtimestamp.sh
TEST_PROGS += vrf-xfrm-tests.sh
TEST_PROGS += rxtimestamp.sh
TEST_PROGS += devlink_port_split.py
+TEST_PROGS += vrf_icmp_error_route.sh
TEST_PROGS_EXTENDED := in_netns.sh
TEST_GEN_FILES = socket nettest
TEST_GEN_FILES += psock_fanout psock_tpacket msg_zerocopy reuseport_addr_any
diff --git a/tools/testing/selftests/net/vrf_icmp_error_route.sh b/tools/testing/selftests/net/vrf_icmp_error_route.sh
new file mode 100755
index 000000000000..0b15a886bf5b
--- /dev/null
+++ b/tools/testing/selftests/net/vrf_icmp_error_route.sh
@@ -0,0 +1,429 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Copyright (c) 2019 David Ahern <[email protected]>. All rights reserved.
+# Copyright (c) 2020 Michael Jeanson <[email protected]>. All rights reserved.
+#
+# blue red
+# .253 +----+ .253
+# +----| r1 |----+
+# | +----+ |
+# +----+ | | +----+
+# | h1 |--------------+ +--------------| h2 |
+# +----+ .1 | | .2 +----+
+# 172.16.1/24 | +----+ | 172.16.2/24
+# 2001:db8:16:1/64 +----| r2 |----+ 2001:db8:16:2/64
+# .254 +----+ .254
+#
+#
+# Route from h1 to h2 goes through r1, incoming vrf blue has a route to the
+# outgoing vrf red for the n2 network but red doesn't have a route back to n1.
+# Route from h2 to h1 goes through r2.
+#
+# The objective is to check that the incoming vrf routing table is selected
+# to send an ICMP error back to the source when the ttl of a packet reaches 1
+# while it is forwarded between different vrfs.
+#
+# The first test sends a ping with a ttl of 1 from h1 to h2 and parses the
+# output of the command to check that a ttl expired error is received.
+#
+# The second test runs traceroute from h1 to h2 and parses the output to check
+# for a hop on r1.
+#
+# Requires CONFIG_NET_VRF, CONFIG_VETH, CONFIG_BRIDGE and CONFIG_NET_NS.
+
+VERBOSE=0
+PAUSE_ON_FAIL=no
+
+H1_N1_IP=172.16.1.1
+R1_N1_IP=172.16.1.253
+R2_N1_IP=172.16.1.254
+
+H1_N1_IP6=2001:db8:16:1::1
+R1_N1_IP6=2001:db8:16:1::253
+R2_N1_IP6=2001:db8:16:1::254
+
+H2_N2=172.16.2.0/24
+H2_N2_6=2001:db8:16:2::/64
+
+H2_N2_IP=172.16.2.2
+R1_N2_IP=172.16.2.253
+R2_N2_IP=172.16.2.254
+
+H2_N2_IP6=2001:db8:16:2::2
+R1_N2_IP6=2001:db8:16:2::253
+R2_N2_IP6=2001:db8:16:2::254
+
+################################################################################
+# helpers
+
+log_section()
+{
+ echo
+ echo "###########################################################################"
+ echo "$*"
+ echo "###########################################################################"
+ echo
+}
+
+log_test()
+{
+ local rc=$1
+ local expected=$2
+ local msg="$3"
+
+ if [ "${rc}" -eq "${expected}" ]; then
+ printf "TEST: %-60s [ OK ]\n" "${msg}"
+ nsuccess=$((nsuccess+1))
+ else
+ ret=1
+ nfail=$((nfail+1))
+ printf "TEST: %-60s [FAIL]\n" "${msg}"
+ if [ "${PAUSE_ON_FAIL}" = "yes" ]; then
+ echo
+ echo "hit enter to continue, 'q' to quit"
+ read -r a
+ [ "$a" = "q" ] && exit 1
+ fi
+ fi
+}
+
+run_cmd()
+{
+ local cmd="$*"
+ local out
+ local rc
+
+ if [ "$VERBOSE" = "1" ]; then
+ echo "COMMAND: $cmd"
+ fi
+
+ out=$(eval $cmd 2>&1)
+ rc=$?
+ if [ "$VERBOSE" = "1" ] && [ -n "$out" ]; then
+ echo "$out"
+ fi
+
+ [ "$VERBOSE" = "1" ] && echo
+
+ return $rc
+}
+
+################################################################################
+# setup and teardown
+
+cleanup()
+{
+ local ns
+
+ setup=0
+
+ for ns in h1 h2 r1 r2; do
+ ip netns del $ns 2>/dev/null
+ done
+}
+
+setup_vrf()
+{
+ local ns=$1
+
+ ip -netns "${ns}" ru del pref 0
+ ip -netns "${ns}" ru add pref 32765 from all lookup local
+ ip -netns "${ns}" -6 ru del pref 0
+ ip -netns "${ns}" -6 ru add pref 32765 from all lookup local
+}
+
+create_vrf()
+{
+ local ns=$1
+ local vrf=$2
+ local table=$3
+
+ ip -netns "${ns}" link add "${vrf}" type vrf table "${table}"
+ ip -netns "${ns}" link set "${vrf}" up
+ ip -netns "${ns}" route add vrf "${vrf}" unreachable default metric 8192
+ ip -netns "${ns}" -6 route add vrf "${vrf}" unreachable default metric 8192
+
+ ip -netns "${ns}" addr add 127.0.0.1/8 dev "${vrf}"
+ ip -netns "${ns}" -6 addr add ::1 dev "${vrf}" nodad
+}
+
+setup()
+{
+ local ns
+
+ if [ "${setup}" -eq 1 ]; then
+ return 0
+ fi
+
+ # make sure we are starting with a clean slate
+ cleanup
+
+ setup=1
+
+ #
+ # create nodes as namespaces
+ #
+ for ns in h1 h2 r1 r2; do
+ ip netns add $ns
+ ip -netns $ns li set lo up
+
+ case "${ns}" in
+ h[12]) ip netns exec $ns sysctl -q -w net.ipv6.conf.all.forwarding=0
+ ip netns exec $ns sysctl -q -w net.ipv6.conf.all.keep_addr_on_down=1
+ ;;
+ r[12]) ip netns exec $ns sysctl -q -w net.ipv4.ip_forward=1
+ ip netns exec $ns sysctl -q -w net.ipv6.conf.all.forwarding=1
+ esac
+ done
+
+ #
+ # create interconnects
+ #
+ ip -netns h1 li add eth0 type veth peer name r1h1
+ ip -netns h1 li set r1h1 netns r1 name eth0 up
+
+ ip -netns h1 li add eth1 type veth peer name r2h1
+ ip -netns h1 li set r2h1 netns r2 name eth0 up
+
+ ip -netns h2 li add eth0 type veth peer name r1h2
+ ip -netns h2 li set r1h2 netns r1 name eth1 up
+
+ ip -netns h2 li add eth1 type veth peer name r2h2
+ ip -netns h2 li set r2h2 netns r2 name eth1 up
+
+ #
+ # h1
+ #
+ ip -netns h1 li add br0 type bridge
+ ip -netns h1 li set br0 up
+ ip -netns h1 addr add dev br0 ${H1_N1_IP}/24
+ ip -netns h1 -6 addr add dev br0 ${H1_N1_IP6}/64 nodad
+ ip -netns h1 li set eth0 master br0 up
+ ip -netns h1 li set eth1 master br0 up
+
+ # h1 to h2 via r1
+ ip -netns h1 ro add ${H2_N2} via ${R1_N1_IP} dev br0
+ ip -netns h1 -6 ro add ${H2_N2_6} via "${R1_N1_IP6}" dev br0
+
+ #
+ # h2
+ #
+ ip -netns h2 li add br0 type bridge
+ ip -netns h2 li set br0 up
+ ip -netns h2 addr add dev br0 ${H2_N2_IP}/24
+ ip -netns h2 -6 addr add dev br0 ${H2_N2_IP6}/64 nodad
+ ip -netns h2 li set eth0 master br0 up
+ ip -netns h2 li set eth1 master br0 up
+
+ ip -netns h2 ro add default via ${R2_N2_IP} dev br0
+ ip -netns h2 -6 ro add default via ${R2_N2_IP6} dev br0
+
+ #
+ # r1
+ #
+ setup_vrf r1
+ create_vrf r1 blue 1101
+ create_vrf r1 red 1102
+ ip -netns r1 li set eth0 vrf blue up
+ ip -netns r1 li set eth1 vrf red up
+ ip -netns r1 addr add dev eth0 ${R1_N1_IP}/24
+ ip -netns r1 -6 addr add dev eth0 ${R1_N1_IP6}/64 nodad
+ ip -netns r1 addr add dev eth1 ${R1_N2_IP}/24
+ ip -netns r1 -6 addr add dev eth1 ${R1_N2_IP6}/64 nodad
+
+ # Route leak from blue to red
+ ip -netns r1 route add vrf blue ${H2_N2} dev red
+ ip -netns r1 -6 route add vrf blue ${H2_N2_6} dev red
+
+ #
+ # r2
+ #
+ ip -netns r2 addr add dev eth0 ${R2_N1_IP}/24
+ ip -netns r2 -6 addr add dev eth0 ${R2_N1_IP6}/64 nodad
+ ip -netns r2 addr add dev eth1 ${R2_N2_IP}/24
+ ip -netns r2 -6 addr add dev eth1 ${R2_N2_IP6}/64 nodad
+
+ # Wait for ip config to settle
+ sleep 2
+}
+
+check_connectivity4()
+{
+ ip netns exec h1 ping -c1 -w1 ${H2_N2_IP} >/dev/null 2>&1
+}
+
+check_connectivity6()
+{
+ ip netns exec h1 "${ping6}" -c1 -w1 ${H2_N2_IP6} >/dev/null 2>&1
+}
+
+ipv4_traceroute()
+{
+ log_section "IPv4: VRF ICMP error route lookup traceroute"
+
+ if [ ! -x "$(command -v traceroute)" ]; then
+ echo "SKIP: Could not run IPV4 test without traceroute"
+ return
+ fi
+
+ setup
+
+ # verify connectivity
+ if ! check_connectivity4; then
+ echo "Error: Basic connectivity is broken"
+ ret=1
+ return
+ fi
+
+ if [ "$VERBOSE" = "1" ]; then
+ run_cmd ip netns exec h1 traceroute ${H2_N2_IP}
+ fi
+
+ ip netns exec h1 traceroute ${H2_N2_IP} | grep -q "${R1_N1_IP}"
+ log_test $? 0 "Traceroute reports a hop on r1"
+}
+
+ipv6_traceroute()
+{
+ log_section "IPv6: VRF ICMP error route lookup traceroute"
+
+ if [ ! -x "$(command -v traceroute6)" ]; then
+ echo "SKIP: Could not run IPV6 test without traceroute6"
+ return
+ fi
+
+ setup
+
+ # verify connectivity
+ if ! check_connectivity6; then
+ echo "Error: Basic connectivity is broken"
+ ret=1
+ return
+ fi
+
+ if [ "$VERBOSE" = "1" ]; then
+ run_cmd ip netns exec h1 traceroute6 ${H2_N2_IP6}
+ fi
+
+ ip netns exec h1 traceroute6 ${H2_N2_IP6} | grep -q "${R1_N1_IP6}"
+ log_test $? 0 "Traceroute6 reports a hop on r1"
+}
+
+ipv4_ping()
+{
+ log_section "IPv4: VRF ICMP error route lookup ping"
+
+ setup
+
+ # verify connectivity
+ if ! check_connectivity4; then
+ echo "Error: Basic connectivity is broken"
+ ret=1
+ return
+ fi
+
+ if [ "$VERBOSE" = "1" ]; then
+ echo "Command to check for ICMP ttl exceeded:"
+ run_cmd ip netns exec h1 ping -t1 -c1 -W2 ${H2_N2_IP}
+ fi
+
+ ip netns exec h1 ping -t1 -c1 -W2 ${H2_N2_IP} | grep -q "Time to live exceeded"
+ log_test $? 0 "Ping received ICMP ttl exceeded"
+}
+
+ipv6_ping()
+{
+ log_section "IPv6: VRF ICMP error route lookup ping"
+
+ setup
+
+ # verify connectivity
+ if ! check_connectivity6; then
+ echo "Error: Basic connectivity is broken"
+ ret=1
+ return
+ fi
+
+ if [ "$VERBOSE" = "1" ]; then
+ echo "Command to check for ICMP ttl exceeded:"
+ run_cmd ip netns exec h1 "${ping6}" -t1 -c1 -W2 ${H2_N2_IP6}
+ fi
+
+ ip netns exec h1 "${ping6}" -t1 -c1 -W2 ${H2_N2_IP6} | grep -q "Time exceeded: Hop limit"
+ log_test $? 0 "Ping received ICMP ttl exceeded"
+}
+################################################################################
+# usage
+
+usage()
+{
+ cat <<EOF
+usage: ${0##*/} OPTS
+
+ -4 IPv4 tests only
+ -6 IPv6 tests only
+ -p Pause on fail
+ -v verbose mode (show commands and output)
+EOF
+}
+
+################################################################################
+# main
+
+# Some systems don't have a ping6 binary anymore
+command -v ping6 > /dev/null 2>&1 && ping6=$(command -v ping6) || ping6=$(command -v ping)
+
+TESTS_IPV4="ipv4_ping ipv4_traceroute"
+TESTS_IPV6="ipv6_ping ipv6_traceroute"
+
+ret=0
+nsuccess=0
+nfail=0
+setup=0
+
+while getopts :46pvh o
+do
+ case $o in
+ 4) TESTS=ipv4;;
+ 6) TESTS=ipv6;;
+ p) PAUSE_ON_FAIL=yes;;
+ v) VERBOSE=1;;
+ h) usage; exit 0;;
+ *) usage; exit 1;;
+ esac
+done
+
+#
+# show user test config
+#
+if [ -z "$TESTS" ]; then
+ TESTS="$TESTS_IPV4 $TESTS_IPV6"
+elif [ "$TESTS" = "ipv4" ]; then
+ TESTS="$TESTS_IPV4"
+elif [ "$TESTS" = "ipv6" ]; then
+ TESTS="$TESTS_IPV6"
+fi
+
+for t in $TESTS
+do
+ case $t in
+ ipv4_ping|ping) ipv4_ping;;
+ ipv4_traceroute|traceroute) ipv4_traceroute;;
+
+ ipv6_ping|ping) ipv6_ping;;
+ ipv6_traceroute|traceroute) ipv6_traceroute;;
+
+ # setup namespaces and config, but do not run any tests
+ setup) setup; exit 0;;
+
+ help) echo "Test names: $TESTS"; exit 0;;
+ esac
+done
+
+cleanup
+
+printf "\nTests passed: %3d\n" ${nsuccess}
+printf "Tests failed: %3d\n" ${nfail}
+
+exit $ret
--
2.17.1

2020-08-11 19:51:25

by Mathieu Desnoyers

[permalink] [raw]
Subject: [PATCH 3/3] ipv6/icmp: l3mdev: Perform icmp error route lookup on source device routing table

As per RFC4443, the destination address field for ICMPv6 error messages
is copied from the source address field of the invoking packet.

In configurations with Virtual Routing and Forwarding tables, looking up
which routing table to use for sending ICMPv6 error messages is
currently done by using the destination net_device.

If the source and destination interfaces are within separate VRFs, or
one in the global routing table and the other in a VRF, looking up the
source address of the invoking packet in the destination interface's
routing table will fail if the destination interface's routing table
contains no route to the invoking packet's source address.

One observable effect of this issue is that traceroute6 does not work in
the following cases:

- Route leaking between global routing table and VRF
- Route leaking between VRFs

Preferably use the source device routing table when sending ICMPv6 error
messages. If no source device is set, fall-back on the destination
device routing table.

Link: https://tools.ietf.org/html/rfc4443
Signed-off-by: Mathieu Desnoyers <[email protected]>
Cc: David Ahern <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: [email protected]
---
net/ipv6/icmp.c | 15 +++++++++++++--
net/ipv6/ip6_output.c | 2 --
2 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/net/ipv6/icmp.c b/net/ipv6/icmp.c
index a4e4912ad607..a971b58b0371 100644
--- a/net/ipv6/icmp.c
+++ b/net/ipv6/icmp.c
@@ -501,8 +501,19 @@ void icmp6_send(struct sk_buff *skb, u8 type, u8 code, __u32 info,
if (__ipv6_addr_needs_scope_id(addr_type)) {
iif = icmp6_iif(skb);
} else {
- dst = skb_dst(skb);
- iif = l3mdev_master_ifindex(dst ? dst->dev : skb->dev);
+ struct net_device *route_lookup_dev = NULL;
+
+ /*
+ * The device used for looking up which routing table to use is
+ * preferably the source whenever it is set, which should
+ * ensure the icmp error can be sent to the source host, else
+ * fallback on the destination device.
+ */
+ if (skb->dev)
+ route_lookup_dev = skb->dev;
+ else if (skb_dst(skb))
+ route_lookup_dev = skb_dst(skb)->dev;
+ iif = l3mdev_master_ifindex(route_lookup_dev);
}

/*
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index c78e67d7747f..cd623068de53 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -468,8 +468,6 @@ int ip6_forward(struct sk_buff *skb)
* check and decrement ttl
*/
if (hdr->hop_limit <= 1) {
- /* Force OUTPUT device used as source address */
- skb->dev = dst->dev;
icmpv6_send(skb, ICMPV6_TIME_EXCEED, ICMPV6_EXC_HOPLIMIT, 0);
__IP6_INC_STATS(net, idev, IPSTATS_MIB_INHDRERRORS);

--
2.17.1

2020-08-11 19:52:28

by Mathieu Desnoyers

[permalink] [raw]
Subject: [PATCH 2/3] ipv4/icmp: l3mdev: Perform icmp error route lookup on source device routing table

As per RFC792, ICMP errors should be sent to the source host.

However, in configurations with Virtual Routing and Forwarding tables,
looking up which routing table to use is currently done by using the
destination net_device.

commit 9d1a6c4ea43e ("net: icmp_route_lookup should use rt dev to
determine L3 domain") changes the interface passed to
l3mdev_master_ifindex() and inet_addr_type_dev_table() from skb_in->dev
to skb_dst(skb_in)->dev. This effectively uses the destination device
rather than the source device for choosing which routing table should be
used to lookup where to send the ICMP error.

Therefore, if the source and destination interfaces are within separate
VRFs, or one in the global routing table and the other in a VRF, looking
up the source host in the destination interface's routing table will
fail if the destination interface's routing table contains no route to
the source host.

One observable effect of this issue is that traceroute does not work in
the following cases:

- Route leaking between global routing table and VRF
- Route leaking between VRFs

Preferably use the source device routing table when sending ICMP error
messages. If no source device is set, fall-back on the destination
device routing table.

Fixes: 9d1a6c4ea43e ("net: icmp_route_lookup should use rt dev to determine L3 domain")
Link: https://tools.ietf.org/html/rfc792
Signed-off-by: Mathieu Desnoyers <[email protected]>
Cc: David Ahern <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: [email protected]
---
net/ipv4/icmp.c | 15 +++++++++++++--
1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index cf36f955bfe6..1eb83d82ec68 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -465,6 +465,7 @@ static struct rtable *icmp_route_lookup(struct net *net,
int type, int code,
struct icmp_bxm *param)
{
+ struct net_device *route_lookup_dev = NULL;
struct rtable *rt, *rt2;
struct flowi4 fl4_dec;
int err;
@@ -479,7 +480,17 @@ static struct rtable *icmp_route_lookup(struct net *net,
fl4->flowi4_proto = IPPROTO_ICMP;
fl4->fl4_icmp_type = type;
fl4->fl4_icmp_code = code;
- fl4->flowi4_oif = l3mdev_master_ifindex(skb_dst(skb_in)->dev);
+ /*
+ * The device used for looking up which routing table to use is
+ * preferably the source whenever it is set, which should ensure
+ * the icmp error can be sent to the source host, else fallback
+ * on the destination device.
+ */
+ if (skb_in->dev)
+ route_lookup_dev = skb_in->dev;
+ else if (skb_dst(skb_in))
+ route_lookup_dev = skb_dst(skb_in)->dev;
+ fl4->flowi4_oif = l3mdev_master_ifindex(route_lookup_dev);

security_skb_classify_flow(skb_in, flowi4_to_flowi(fl4));
rt = ip_route_output_key_hash(net, fl4, skb_in);
@@ -503,7 +514,7 @@ static struct rtable *icmp_route_lookup(struct net *net,
if (err)
goto relookup_failed;

- if (inet_addr_type_dev_table(net, skb_dst(skb_in)->dev,
+ if (inet_addr_type_dev_table(net, route_lookup_dev,
fl4_dec.saddr) == RTN_LOCAL) {
rt2 = __ip_route_output_key(net, &fl4_dec);
if (IS_ERR(rt2))
--
2.17.1

2020-08-12 21:44:29

by David Miller

[permalink] [raw]
Subject: Re: [PATCH 2/3] ipv4/icmp: l3mdev: Perform icmp error route lookup on source device routing table

From: Mathieu Desnoyers <[email protected]>
Date: Tue, 11 Aug 2020 15:50:02 -0400

> @@ -465,6 +465,7 @@ static struct rtable *icmp_route_lookup(struct net *net,
> int type, int code,
> struct icmp_bxm *param)
> {
> + struct net_device *route_lookup_dev = NULL;
> struct rtable *rt, *rt2;
> struct flowi4 fl4_dec;
> int err;
> @@ -479,7 +480,17 @@ static struct rtable *icmp_route_lookup(struct net *net,
> fl4->flowi4_proto = IPPROTO_ICMP;
> fl4->fl4_icmp_type = type;
> fl4->fl4_icmp_code = code;
> - fl4->flowi4_oif = l3mdev_master_ifindex(skb_dst(skb_in)->dev);
> + /*
> + * The device used for looking up which routing table to use is
> + * preferably the source whenever it is set, which should ensure
> + * the icmp error can be sent to the source host, else fallback
> + * on the destination device.
> + */
> + if (skb_in->dev)
> + route_lookup_dev = skb_in->dev;
> + else if (skb_dst(skb_in))
> + route_lookup_dev = skb_dst(skb_in)->dev;
> + fl4->flowi4_oif = l3mdev_master_ifindex(route_lookup_dev);

The caller of icmp_route_lookup() uses the opposite prioritization of
devices for determining the network namespace to use:

if (rt->dst.dev)
net = dev_net(rt->dst.dev);
else if (skb_in->dev)
net = dev_net(skb_in->dev);
else
goto out;

Do we have to reverse the ordering there too?

And when I read fallback in your commit message description, I
imagined that you would have a two tiered lookup scheme. First you
would be trying the skb_in->dev for a lookup (to accomodate the VRF
case), and if that failed you'd try again with skb_dst()->dev.

2020-08-13 13:14:25

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH 2/3] ipv4/icmp: l3mdev: Perform icmp error route lookup on source device routing table

----- On Aug 12, 2020, at 5:43 PM, David S. Miller [email protected] wrote:

> From: Mathieu Desnoyers <[email protected]>
> Date: Tue, 11 Aug 2020 15:50:02 -0400
>
>> @@ -465,6 +465,7 @@ static struct rtable *icmp_route_lookup(struct net *net,
>> int type, int code,
>> struct icmp_bxm *param)
>> {
>> + struct net_device *route_lookup_dev = NULL;
>> struct rtable *rt, *rt2;
>> struct flowi4 fl4_dec;
>> int err;
>> @@ -479,7 +480,17 @@ static struct rtable *icmp_route_lookup(struct net *net,
>> fl4->flowi4_proto = IPPROTO_ICMP;
>> fl4->fl4_icmp_type = type;
>> fl4->fl4_icmp_code = code;
>> - fl4->flowi4_oif = l3mdev_master_ifindex(skb_dst(skb_in)->dev);
>> + /*
>> + * The device used for looking up which routing table to use is
>> + * preferably the source whenever it is set, which should ensure
>> + * the icmp error can be sent to the source host, else fallback
>> + * on the destination device.
>> + */
>> + if (skb_in->dev)
>> + route_lookup_dev = skb_in->dev;
>> + else if (skb_dst(skb_in))
>> + route_lookup_dev = skb_dst(skb_in)->dev;
>> + fl4->flowi4_oif = l3mdev_master_ifindex(route_lookup_dev);
>
> The caller of icmp_route_lookup() uses the opposite prioritization of
> devices for determining the network namespace to use:
>
> if (rt->dst.dev)
> net = dev_net(rt->dst.dev);
> else if (skb_in->dev)
> net = dev_net(skb_in->dev);
> else
> goto out;
>
> Do we have to reverse the ordering there too?

Looking at the history:

Originally dst.dev was used as network namespace for icmp errors:

dde1bc0e6f861 (Denis V. Lunev 2008-01-22 23:50:57 -0800 450) net = rt->u.dst.dev->nd_net;

commit dde1bc0e6f86183bc095d0774cd109f4edf66ea2
Author: Denis V. Lunev <[email protected]>
Date: Tue Jan 22 23:50:57 2008 -0800

[NETNS]: Add namespace for ICMP replying code.

All needed API is done, the namespace is available when required from
the device on the DST entry from the incoming packet. So, just replace
init_net with proper namespace.

Here I wonder what motivated use of the DST entry here ?

Note that this choice of DST network namespace applies to both __icmp_send and
icmp_unreach.

It has been followed by a few data structure layout changes:

c346dca10840a (YOSHIFUJI Hideaki 2008-03-25 21:47:49 +0900 430) net = dev_net(rt->u.dst.dev);
d8d1f30b95a63 (Changli Gao 2010-06-10 23:31:35 -0700 585) net = dev_net(rt->dst.dev);

It was then changed to fix a NULL pointer deref:

e2c693934194f (Hangbin Liu 2019-08-22 22:19:48 +0800 586)
e2c693934194f (Hangbin Liu 2019-08-22 22:19:48 +0800 587) if (rt->dst.dev)
e2c693934194f (Hangbin Liu 2019-08-22 22:19:48 +0800 588) net = dev_net(rt->dst.dev);
e2c693934194f (Hangbin Liu 2019-08-22 22:19:48 +0800 589) else if (skb_in->dev)
e2c693934194f (Hangbin Liu 2019-08-22 22:19:48 +0800 590) net = dev_net(skb_in->dev);
e2c693934194f (Hangbin Liu 2019-08-22 22:19:48 +0800 591) else
e2c693934194f (Hangbin Liu 2019-08-22 22:19:48 +0800 592) goto out;


> And when I read fallback in your commit message description, I
> imagined that you would have a two tiered lookup scheme. First you
> would be trying the skb_in->dev for a lookup (to accomodate the VRF
> case), and if that failed you'd try again with skb_dst()->dev.

The code I proposed basically does use the skb_in->dev (if non-null)
for looking up which VRF table to use, else use skb_dst(skb_in) (if non-null)
for looking up which VRF table to use, else route_lookup_dev is NULL, which
means use the master table.

Whether this should instead try to lookup the source address with the skb_in->dev
table, and of that fails go to the next, is a good question. I think the context
I am missing in order to understand which approach is appropriate is which
scenario can cause skb_in->dev to be NULL, and which can cause skb_dst(skb_in)
to be NULL, and what is the expected behavior for icmp error route lookup in those
cases ?

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2020-08-13 23:02:57

by David Ahern

[permalink] [raw]
Subject: Re: [PATCH 2/3] ipv4/icmp: l3mdev: Perform icmp error route lookup on source device routing table

On 8/11/20 1:50 PM, Mathieu Desnoyers wrote:
> As per RFC792, ICMP errors should be sent to the source host.
>
> However, in configurations with Virtual Routing and Forwarding tables,
> looking up which routing table to use is currently done by using the
> destination net_device.
>
> commit 9d1a6c4ea43e ("net: icmp_route_lookup should use rt dev to
> determine L3 domain") changes the interface passed to
> l3mdev_master_ifindex() and inet_addr_type_dev_table() from skb_in->dev
> to skb_dst(skb_in)->dev. This effectively uses the destination device
> rather than the source device for choosing which routing table should be
> used to lookup where to send the ICMP error.
>
> Therefore, if the source and destination interfaces are within separate
> VRFs, or one in the global routing table and the other in a VRF, looking
> up the source host in the destination interface's routing table will
> fail if the destination interface's routing table contains no route to
> the source host.
>
> One observable effect of this issue is that traceroute does not work in
> the following cases:
>
> - Route leaking between global routing table and VRF
> - Route leaking between VRFs
>
> Preferably use the source device routing table when sending ICMP error
> messages. If no source device is set, fall-back on the destination
> device routing table.
>
> Fixes: 9d1a6c4ea43e ("net: icmp_route_lookup should use rt dev to determine L3 domain")
> Link: https://tools.ietf.org/html/rfc792
> Signed-off-by: Mathieu Desnoyers <[email protected]>
> Cc: David Ahern <[email protected]>
> Cc: David S. Miller <[email protected]>
> Cc: [email protected]
> ---
> net/ipv4/icmp.c | 15 +++++++++++++--
> 1 file changed, 13 insertions(+), 2 deletions(-)
>
> diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
> index cf36f955bfe6..1eb83d82ec68 100644
> --- a/net/ipv4/icmp.c
> +++ b/net/ipv4/icmp.c
> @@ -465,6 +465,7 @@ static struct rtable *icmp_route_lookup(struct net *net,
> int type, int code,
> struct icmp_bxm *param)
> {
> + struct net_device *route_lookup_dev = NULL;
> struct rtable *rt, *rt2;
> struct flowi4 fl4_dec;
> int err;
> @@ -479,7 +480,17 @@ static struct rtable *icmp_route_lookup(struct net *net,
> fl4->flowi4_proto = IPPROTO_ICMP;
> fl4->fl4_icmp_type = type;
> fl4->fl4_icmp_code = code;
> - fl4->flowi4_oif = l3mdev_master_ifindex(skb_dst(skb_in)->dev);
> + /*
> + * The device used for looking up which routing table to use is
> + * preferably the source whenever it is set, which should ensure
> + * the icmp error can be sent to the source host, else fallback
> + * on the destination device.
> + */
> + if (skb_in->dev)
> + route_lookup_dev = skb_in->dev;
> + else if (skb_dst(skb_in))
> + route_lookup_dev = skb_dst(skb_in)->dev;
> + fl4->flowi4_oif = l3mdev_master_ifindex(route_lookup_dev);
>
> security_skb_classify_flow(skb_in, flowi4_to_flowi(fl4));
> rt = ip_route_output_key_hash(net, fl4, skb_in);
> @@ -503,7 +514,7 @@ static struct rtable *icmp_route_lookup(struct net *net,
> if (err)
> goto relookup_failed;
>
> - if (inet_addr_type_dev_table(net, skb_dst(skb_in)->dev,
> + if (inet_addr_type_dev_table(net, route_lookup_dev,
> fl4_dec.saddr) == RTN_LOCAL) {
> rt2 = __ip_route_output_key(net, &fl4_dec);
> if (IS_ERR(rt2))
>

ICMP's can be generated in many locations:
1. forward path - I think the skb_in dev is always set,

2. ingress and upper layer protocols - dev is dropped prior to
transport layers, so, for example, UDP sending port unreachable calls
icmp_send with skb_in->dev set to NULL.

3. local packets and egress - e.g., link failures and here I believe skb
dev is set.

If in and out are in the same L3 domain, either device works where for
VRF route leaking with the forward path in and out are in separate
domains so yes you want the ingress device.

This change seems fine to me and I have not seen any issues with
existing selftests.

Reviewed-by: David Ahern <[email protected]>


But I did notice that unreachable / fragmentation needed messages are
NOT working with this change. You can see that by changing the MTU of
eth1 in r1 to 1400 and running:
ip netns exec h1 ping -s 1450 -Mdo -c1 172.16.2.2

You really should get that working as well with VRF route leaking.


2020-08-13 23:15:01

by David Ahern

[permalink] [raw]
Subject: Re: [PATCH 1/3] selftests: Add VRF icmp error route lookup test

On 8/11/20 1:50 PM, Mathieu Desnoyers wrote:
> +run_cmd()
> +{
> + local cmd="$*"
> + local out
> + local rc
> +
> + if [ "$VERBOSE" = "1" ]; then
> + echo "COMMAND: $cmd"
> + fi
> +
> + out=$(eval $cmd 2>&1)
> + rc=$?
> + if [ "$VERBOSE" = "1" ] && [ -n "$out" ]; then
> + echo "$out"
> + fi
> +
> + [ "$VERBOSE" = "1" ] && echo
> +
> + return $rc
> +}
> +

...

> +ipv6_ping()
> +{
> + log_section "IPv6: VRF ICMP error route lookup ping"
> +
> + setup
> +
> + # verify connectivity
> + if ! check_connectivity6; then
> + echo "Error: Basic connectivity is broken"
> + ret=1
> + return
> + fi
> +
> + if [ "$VERBOSE" = "1" ]; then
> + echo "Command to check for ICMP ttl exceeded:"
> + run_cmd ip netns exec h1 "${ping6}" -t1 -c1 -W2 ${H2_N2_IP6}
> + fi
> +
> + ip netns exec h1 "${ping6}" -t1 -c1 -W2 ${H2_N2_IP6} | grep -q "Time exceeded: Hop limit"

run_cmd runs the command and if VERBOSE is set to 1 shows the command to
the user. Something is off with this script and passing the -v arg -- I
do not get a command list. This applies to the whole script.

Since you need to check for output, I suggest modifying run_cmd to
search the output for the given string.


> + log_test $? 0 "Ping received ICMP ttl exceeded"
> +}
> +################################################################################

missing newline between '}' and '####'

> +# usage
> +
> +usage()
> +{
> + cat <<EOF
> +usage: ${0##*/} OPTS
> +
> + -4 IPv4 tests only
> + -6 IPv6 tests only
> + -p Pause on fail
> + -v verbose mode (show commands and output)
> +EOF
> +}
> +
> +################################################################################
> +# main
> +
> +# Some systems don't have a ping6 binary anymore
> +command -v ping6 > /dev/null 2>&1 && ping6=$(command -v ping6) || ping6=$(command -v ping)
> +
> +TESTS_IPV4="ipv4_ping ipv4_traceroute"
> +TESTS_IPV6="ipv6_ping ipv6_traceroute"
> +
> +ret=0
> +nsuccess=0
> +nfail=0
> +setup=0
> +
> +while getopts :46pvh o
> +do
> + case $o in
> + 4) TESTS=ipv4;;
> + 6) TESTS=ipv6;;
> + p) PAUSE_ON_FAIL=yes;;
> + v) VERBOSE=1;;
> + h) usage; exit 0;;
> + *) usage; exit 1;;

indentation issues; not using tabs

> + esac
> +done
> +
> +#
> +# show user test config
> +#
> +if [ -z "$TESTS" ]; then
> + TESTS="$TESTS_IPV4 $TESTS_IPV6"
> +elif [ "$TESTS" = "ipv4" ]; then
> + TESTS="$TESTS_IPV4"
> +elif [ "$TESTS" = "ipv6" ]; then
> + TESTS="$TESTS_IPV6"
> +fi
> +
> +for t in $TESTS
> +do
> + case $t in
> + ipv4_ping|ping) ipv4_ping;;
> + ipv4_traceroute|traceroute) ipv4_traceroute;;
> +
> + ipv6_ping|ping) ipv6_ping;;
> + ipv6_traceroute|traceroute) ipv6_traceroute;;
> +
> + # setup namespaces and config, but do not run any tests
> + setup) setup; exit 0;;

you don't allow '-t setup' so you can remove this part

> +
> + help) echo "Test names: $TESTS"; exit 0;;
> + esac
> +done
> +
> +cleanup
> +
> +printf "\nTests passed: %3d\n" ${nsuccess}
> +printf "Tests failed: %3d\n" ${nfail}
> +
> +exit $ret
>

2020-08-13 23:23:18

by David Ahern

[permalink] [raw]
Subject: Re: [PATCH 3/3] ipv6/icmp: l3mdev: Perform icmp error route lookup on source device routing table

On 8/11/20 1:50 PM, Mathieu Desnoyers wrote:
> As per RFC4443, the destination address field for ICMPv6 error messages
> is copied from the source address field of the invoking packet.
>
> In configurations with Virtual Routing and Forwarding tables, looking up
> which routing table to use for sending ICMPv6 error messages is
> currently done by using the destination net_device.
>
> If the source and destination interfaces are within separate VRFs, or
> one in the global routing table and the other in a VRF, looking up the
> source address of the invoking packet in the destination interface's
> routing table will fail if the destination interface's routing table
> contains no route to the invoking packet's source address.
>
> One observable effect of this issue is that traceroute6 does not work in
> the following cases:
>
> - Route leaking between global routing table and VRF
> - Route leaking between VRFs
>
> Preferably use the source device routing table when sending ICMPv6 error
> messages. If no source device is set, fall-back on the destination
> device routing table.
>
> Link: https://tools.ietf.org/html/rfc4443
> Signed-off-by: Mathieu Desnoyers <[email protected]>
> Cc: David Ahern <[email protected]>
> Cc: David S. Miller <[email protected]>
> Cc: [email protected]
> ---
> net/ipv6/icmp.c | 15 +++++++++++++--
> net/ipv6/ip6_output.c | 2 --
> 2 files changed, 13 insertions(+), 4 deletions(-)
>
> diff --git a/net/ipv6/icmp.c b/net/ipv6/icmp.c
> index a4e4912ad607..a971b58b0371 100644
> --- a/net/ipv6/icmp.c
> +++ b/net/ipv6/icmp.c
> @@ -501,8 +501,19 @@ void icmp6_send(struct sk_buff *skb, u8 type, u8 code, __u32 info,
> if (__ipv6_addr_needs_scope_id(addr_type)) {
> iif = icmp6_iif(skb);
> } else {
> - dst = skb_dst(skb);
> - iif = l3mdev_master_ifindex(dst ? dst->dev : skb->dev);
> + struct net_device *route_lookup_dev = NULL;
> +
> + /*
> + * The device used for looking up which routing table to use is
> + * preferably the source whenever it is set, which should
> + * ensure the icmp error can be sent to the source host, else
> + * fallback on the destination device.
> + */
> + if (skb->dev)
> + route_lookup_dev = skb->dev;

top of icmp6_send there is a check that skb->dev is set.


> + else if (skb_dst(skb))
> + route_lookup_dev = skb_dst(skb)->dev;
> + iif = l3mdev_master_ifindex(route_lookup_dev);
> }
>
> /*
> diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
> index c78e67d7747f..cd623068de53 100644
> --- a/net/ipv6/ip6_output.c
> +++ b/net/ipv6/ip6_output.c
> @@ -468,8 +468,6 @@ int ip6_forward(struct sk_buff *skb)
> * check and decrement ttl
> */
> if (hdr->hop_limit <= 1) {
> - /* Force OUTPUT device used as source address */
> - skb->dev = dst->dev;

I *think* this ok. Not clear to me why the forward path would change the
skb->dev like that. Goes back to beginning of the git history.

> icmpv6_send(skb, ICMPV6_TIME_EXCEED, ICMPV6_EXC_HOPLIMIT, 0);
> __IP6_INC_STATS(net, idev, IPSTATS_MIB_INHDRERRORS);
>
>

2020-08-14 15:39:41

by Michael Jeanson

[permalink] [raw]
Subject: Re: [PATCH 1/3] selftests: Add VRF icmp error route lookup test

----- On 13 Aug, 2020, at 19:13, David Ahern [email protected] wrote:

...

>> +ipv6_ping()
>> +{
>> + log_section "IPv6: VRF ICMP error route lookup ping"
>> +
>> + setup
>> +
>> + # verify connectivity
>> + if ! check_connectivity6; then
>> + echo "Error: Basic connectivity is broken"
>> + ret=1
>> + return
>> + fi
>> +
>> + if [ "$VERBOSE" = "1" ]; then
>> + echo "Command to check for ICMP ttl exceeded:"
>> + run_cmd ip netns exec h1 "${ping6}" -t1 -c1 -W2 ${H2_N2_IP6}
>> + fi
>> +
>> + ip netns exec h1 "${ping6}" -t1 -c1 -W2 ${H2_N2_IP6} | grep -q "Time exceeded:
>> Hop limit"
>
> run_cmd runs the command and if VERBOSE is set to 1 shows the command to
> the user. Something is off with this script and passing the -v arg -- I
> do not get a command list. This applies to the whole script.

Hum, I have no issues here with '-v', you get no output at all from run_cmd?

>
> Since you need to check for output, I suggest modifying run_cmd to
> search the output for the given string.

I took this pattern of executing commands twice when running with verbose and
grepping the output from icmp_redirect.sh. I'll see if I can come up with
something fancier in run_cmd.


>> + log_test $? 0 "Ping received ICMP ttl exceeded"
>> +}
>> +################################################################################
>
> missing newline between '}' and '####'

Ack

...

>> +while getopts :46pvh o
>> +do
>> + case $o in
>> + 4) TESTS=ipv4;;
>> + 6) TESTS=ipv6;;
>> + p) PAUSE_ON_FAIL=yes;;
>> + v) VERBOSE=1;;
>> + h) usage; exit 0;;
>> + *) usage; exit 1;;
>
> indentation issues; not using tabs
>

Ack

>> + esac
>> +done
>> +
>> +#
>> +# show user test config
>> +#
>> +if [ -z "$TESTS" ]; then
>> + TESTS="$TESTS_IPV4 $TESTS_IPV6"
>> +elif [ "$TESTS" = "ipv4" ]; then
>> + TESTS="$TESTS_IPV4"
>> +elif [ "$TESTS" = "ipv6" ]; then
>> + TESTS="$TESTS_IPV6"
>> +fi
>> +
>> +for t in $TESTS
>> +do
>> + case $t in
>> + ipv4_ping|ping) ipv4_ping;;
>> + ipv4_traceroute|traceroute) ipv4_traceroute;;
>> +
>> + ipv6_ping|ping) ipv6_ping;;
>> + ipv6_traceroute|traceroute) ipv6_traceroute;;
>> +
>> + # setup namespaces and config, but do not run any tests
>> + setup) setup; exit 0;;
>
> you don't allow '-t setup' so you can remove this part

I'l add the missing '-t' option to getopts.

>
>> +
>> + help) echo "Test names: $TESTS"; exit 0;;
>> + esac
>> +done
>> +
>> +cleanup
>> +
>> +printf "\nTests passed: %3d\n" ${nsuccess}
>> +printf "Tests failed: %3d\n" ${nfail}
>> +
>> +exit $ret