Received: by 2002:a25:8b91:0:0:0:0:0 with SMTP id j17csp888675ybl; Fri, 6 Dec 2019 07:52:08 -0800 (PST) X-Google-Smtp-Source: APXvYqyYJuBNt7x4hA1F3+Tc52FDMWZGQ/TueMiaunymvcBh+BuCuxEibOZVtaXTD9doVvlrN5KX X-Received: by 2002:aca:cdd5:: with SMTP id d204mr12879297oig.134.1575647528515; Fri, 06 Dec 2019 07:52:08 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1575647528; cv=none; d=google.com; s=arc-20160816; b=MxUFH3Ylg+ZmVQ0x9wMDzOVmrbYWTBY1iIqqROxoyQxkSaJ21P4eBFbQ+NtJ94iEup fqwVhjQK9rpgE93zR1+ptmyc7WkSMeDJUaymh4FFN0Ec731dJ0V97Ybm92pp0C8dZ4Ip 27uxG5BKMCebHtV/FFEEvw/cKf5RiHfyPKhIGXauWI4BDPq74yGwsNSeokGCfQN8apEi zxGIkgYL4OxmWXuu1LtcJjcIfm7yTK7nmFCAQM7SfE07bW/SYuQgdgYWki9mlGJhAqT9 kLptEZ+sX60Wob0H5ZGriB0slQvFn4BYvDBhlnHQycC+YZrewyQ262ogBgy1I6pgYDHV uKhw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=hgBn6xIvx0fy7TM8/tqwBfjgTga1pwSiHFOok9UIjgg=; b=Kshpjx3i1biJrCWMsqWY6q5PQXcn/c/P3aE/vyt3NiwlttINFwYYfe2PkgAmoI925n PbkWsrIR42Bx5GuWyYg4nN8eyW+0oDfFq18kHGYyhKNYUKrVHkOwmvIPYGm4viOy20GP bQmdTUG03IvJsVuDyMY2Rf/PNdRkGS080ggZZuniq00DW/I+BoJruWujE1t8vB7Lr1nS 1KmyHii/3C3VuDg/LuuMQ4gvqoEHXBYUBIkapKO9t7JNfrC68/UcEc0Aw1CNydHcTZe3 hd56NaH6CKus5PrIoAw/5kr4EG8FW/1gNSyH9Xib4pAHP3wydR7zszVDqeKtsgvZKo89 ZUyA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=canonical.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id s206si7154122oib.73.2019.12.06.07.51.56; Fri, 06 Dec 2019 07:52:08 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=canonical.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726421AbfLFPuy (ORCPT + 99 others); Fri, 6 Dec 2019 10:50:54 -0500 Received: from youngberry.canonical.com ([91.189.89.112]:48949 "EHLO youngberry.canonical.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726251AbfLFPuy (ORCPT ); Fri, 6 Dec 2019 10:50:54 -0500 Received: from 1.general.cascardo.us.vpn ([10.172.70.58] helo=calabresa) by youngberry.canonical.com with esmtpsa (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.86_2) (envelope-from ) id 1idFsY-0007xZ-QH; Fri, 06 Dec 2019 15:50:51 +0000 Date: Fri, 6 Dec 2019 12:50:46 -0300 From: Thadeu Lima de Souza Cascardo To: Eric Dumazet Cc: netdev@vger.kernel.org, davem@davemloft.net, shuah@kernel.org, linux-kselftest@vger.kernel.org, linux-kernel@vger.kernel.org, posk@google.com Subject: Re: [PATCH] selftests: net: ip_defrag: increase netdev_max_backlog Message-ID: <20191206155046.GF5083@calabresa> References: <20191204195321.406365-1-cascardo@canonical.com> <483097a3-92ec-aedd-60d9-ab7f58b9708d@gmail.com> <20191206121707.GC5083@calabresa> <20191206145010.GE5083@calabresa> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20191206145010.GE5083@calabresa> User-Agent: Mutt/1.12.2 (2019-09-21) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Dec 06, 2019 at 11:50:15AM -0300, Thadeu Lima de Souza Cascardo wrote: > On Fri, Dec 06, 2019 at 05:41:01AM -0800, Eric Dumazet wrote: > > > > > > On 12/6/19 4:17 AM, Thadeu Lima de Souza Cascardo wrote: > > > On Wed, Dec 04, 2019 at 12:03:57PM -0800, Eric Dumazet wrote: > > >> > > >> > > >> On 12/4/19 11:53 AM, Thadeu Lima de Souza Cascardo wrote: > > >>> When using fragments with size 8 and payload larger than 8000, the backlog > > >>> might fill up and packets will be dropped, causing the test to fail. This > > >>> happens often enough when conntrack is on during the IPv6 test. > > >>> > > >>> As the larger payload in the test is 10000, using a backlog of 1250 allow > > >>> the test to run repeatedly without failure. At least a 1000 runs were > > >>> possible with no failures, when usually less than 50 runs were good enough > > >>> for showing a failure. > > >>> > > >>> As netdev_max_backlog is not a pernet setting, this sets the backlog to > > >>> 1000 during exit to prevent disturbing following tests. > > >>> > > >> > > >> Hmmm... I would prefer not changing a global setting like that. > > >> This is going to be flaky since we often run tests in parallel (using different netns) > > >> > > >> What about adding a small delay after each sent packet ? > > >> > > >> diff --git a/tools/testing/selftests/net/ip_defrag.c b/tools/testing/selftests/net/ip_defrag.c > > >> index c0c9ecb891e1d78585e0db95fd8783be31bc563a..24d0723d2e7e9b94c3e365ee2ee30e9445deafa8 100644 > > >> --- a/tools/testing/selftests/net/ip_defrag.c > > >> +++ b/tools/testing/selftests/net/ip_defrag.c > > >> @@ -198,6 +198,7 @@ static void send_fragment(int fd_raw, struct sockaddr *addr, socklen_t alen, > > >> error(1, 0, "send_fragment: %d vs %d", res, frag_len); > > >> > > >> frag_counter++; > > >> + usleep(1000); > > >> } > > >> > > >> static void send_udp_frags(int fd_raw, struct sockaddr *addr, > > >> > > > > > > That won't work because the issue only shows when we using conntrack, as the > > > packet will be reassembled on output, then fragmented again. When this happens, > > > the fragmentation code is transmitting the fragments in a tight loop, which > > > floods the backlog. > > > > Interesting ! > > > > So it looks like the test is correct, and exposed a long standing problem in this code. > > > > We should not adjust the test to some kernel-of-the-day-constraints, and instead fix the kernel bug ;) > > > > Where is this tight loop exactly ? > > > > If this is feeding/bursting ~1000 skbs via netif_rx() in a BH context, maybe we need to call a variant > > that allows immediate processing instead of (ab)using the softnet backlog. > > > > Thanks ! > > This is the loopback interface, so its xmit calls netif_rx. I suppose we would > have the same problem with veth, for example. > > So net/ipv6/ip6_output.c:ip6_fragment has this: > > for (;;) { > /* Prepare header of the next frame, > * before previous one went down. */ > if (iter.frag) > ip6_fraglist_prepare(skb, &iter); > > skb->tstamp = tstamp; > err = output(net, sk, skb); > if (!err) > IP6_INC_STATS(net, ip6_dst_idev(&rt->dst), > IPSTATS_MIB_FRAGCREATES); > > if (err || !iter.frag) > break; > > skb = ip6_fraglist_next(&iter); > } > > output is ip6_finish_output2, which will call neigh_output, which ends up > calling dev_queue_xmit. > > In this case, ip6_fragment is being called probably from rawv6_send_hdrinc -> > dst_output -> ip6_output -> ip6_finish_output -> __ip6_finish_output -> > ip6_fragment. > > dst_output at rawv6_send_hdrinc is being called after netfilter > NF_INET_LOCAL_OUT hook. That one is gathering the fragments and only accepting > that last, reassembled skb, which causes ip6_fragment enter that loop. > > So, basically, the easiest way to reproduce this is using this test with > loopback and netfilter doing the reassembly during conntrack. I see some BH > locks here and there, but I think this is just filling up the backlog too fast > to give any chance for softirq to kick in. > > I will see if I can reproduce this using routed veths. > Confirmed that the same happens when using veth. vethX (nsX) <-> veth1 (router) forwards through veth2 (router) <-> vethY (nsY) With such a setup, when I send those fragments from nsX to nsY, they get through, until I setup that same conntrack rule on the router. Then, increasing netdev_max_backlog allows those fragments to go through again. That at least seems to be a plausible scenario that we would like to fix, as you said, instead of only making a test pass. Next Monday, I may test anything you come up with. Thanks. Cascardo.