Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752963AbbKKWyi (ORCPT ); Wed, 11 Nov 2015 17:54:38 -0500 Received: from ja.ssi.bg ([178.16.129.10]:35137 "EHLO ja.ssi.bg" rhost-flags-OK-FAIL-OK-FAIL) by vger.kernel.org with ESMTP id S1752587AbbKKWyh (ORCPT ); Wed, 11 Nov 2015 17:54:37 -0500 Date: Thu, 12 Nov 2015 00:54:23 +0200 (EET) From: Julian Anastasov To: Kouya Shimura cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org, "David S. Miller" Subject: Re: ipv4: ip unreachable with SO_BINDTODEVICE socket In-Reply-To: <87egfxjd9o.fsf@pingu.sky.yk.fujitsu.co.jp> Message-ID: References: <87egfxjd9o.fsf@pingu.sky.yk.fujitsu.co.jp> User-Agent: Alpine 2.11 (LFD 23 2013-08-11) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2881 Lines: 76 Hello, On Wed, 11 Nov 2015, Kouya Shimura wrote: > Hi > > When both server and client are on the same machine and each their > socket option is set to SO_BINDTODEVICE, sometimes a packet doesn't > reach to the server. > > The reproducible test program is attached. (modify "IF_ADDR=, IP_ADDR=, > PORT=" lines appropriately). Please try 'taskset -c 1 python test.py' > since per cpu data (rt_cache) affects results. Also 'tcpdump -i lo' > is helpful for testing. you can see "ICMP udp port unreachable". > > In this test program, a packet doesn't pass through the bound > interface but 'lo' interface. So, it might be granted that local > communication with SO_BINDTODEVICE socket fails. However, dnsmasq and > dhcp_release commands rely on it (Actually I've found this issue on > the OpenStack envirionment) and the test program works well on > linux-2.6.32 but doesn't work on linux-3.10.0 and 4.3.0. > > I'd like to know whether this is a kernel bug or the specification of > SO_BINDTODEVICE. My man page does not indicate that SO_BINDTODEVICE should be relaxed for traffic from loopback but for old kernels it was really working in this way due to caching of different orig_oif values and even for mcast/bcast because ip_mc_output() does not use "lo" for loopback. > The attached patch fixes this issue, but no confidence this is a > right modification. > diff --git a/net/ipv4/route.c b/net/ipv4/route.c > index 85f184e..546cabe 100644 > --- a/net/ipv4/route.c > +++ b/net/ipv4/route.c > @@ -2027,7 +2027,7 @@ static struct rtable *__mkroute_output(const struct fib_result *res, > prth = raw_cpu_ptr(nh->nh_pcpu_rth_output); > } > rth = rcu_dereference(*prth); > - if (rt_cache_valid(rth)) { > + if (rt_cache_valid(rth) && rth->rt_iif == orig_oif) { > dst_hold(&rth->dst); > return rth; > } So, if the cache contains same orig_oif we will use it and if it is different we will replace the cache. While traffics use same orig_oif we will not have frequent rt_cache_route() calls that replace the cache. Patch looks ok to me but I'm not sure if we should worry for the unicast traffic. If we want frequent updates only for loopback then the check could be: if (rt_cache_valid(rth) && (!(flags & RTCF_LOCAL) || rth->rt_iif == orig_oif)) { Or the following, it should better cache mcast because mcast does not use/need the rt_iif check: if (rt_cache_valid(rth) && (type != RTN_LOCAL || rth->rt_iif == orig_oif)) { Regards -- Julian Anastasov -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/