Received: by 2002:a05:6358:16cc:b0:ea:6187:17c9 with SMTP id r12csp5179902rwl; Sun, 8 Jan 2023 09:55:53 -0800 (PST) X-Google-Smtp-Source: AMrXdXvYHmAhaYcVWKONnFPQQWnQKmEvwwbPhnaQ30N7X9NrjvXd+p95VF+UflxmtXk6MwYKGHmm X-Received: by 2002:a05:6a00:338b:b0:581:ce00:f69e with SMTP id cm11-20020a056a00338b00b00581ce00f69emr37180021pfb.10.1673200553744; Sun, 08 Jan 2023 09:55:53 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1673200553; cv=none; d=google.com; s=arc-20160816; b=IIVqzPGVhxV2WGTzwtjUH+/hGLa41B2wQKKFBgfWLfOVv230Nx6abHyzMXxnZqF21k DgyZWgrDClVL2K8wLhgY79uG7925eWK291iH9tpagGzvjE617yC28P2d8IBF0/GdJjt5 UZB4/MkrKbZhDr0/8JBgDb2UBFq96U/HLyuJg07tK1M5sep8EZZaFt+JukdYTON4JfjU C42OjggLxjhaB0OmFUoz8tQWW0IIQ3jfxw3wkSvw1YnTue0pb69XJVklWWcax9Ch+tS3 flEKSGmAkd8EYYW3najFv+Qo2F2XFL0CNjP9YcDiMGpr0CDycEB6emNV5shheGhPdCWZ 0QZA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:subject:cc:to:from:date :dkim-signature:dkim-signature; bh=z4kc5qFAos7ZIAgwgfR4xycDpTMwBJ6/I/3JrmodzQI=; b=zMbCUrovqUG4auGjAAto4wTHuYvedOhhLt8hgt84fYJCTdAtDNiFGsluRazCPALjU3 cMWZuPi2n0Zh+KaTXeOwIpnN7WqJZSAfURDBtvl97VZYPO3BVP2BPBuOl7G1POeE/t0G lkD6kAxLtJH1EDxbkPyOp5xZAI8w81pFmPRuTfUjC/FSCqAYu4m97mTbCaGg5uROXSwJ nkRWiDQJ3tAPmJvo7+pK14Yces/4pARGCA4IKtYquQM/f1BpQCO3xgFZ2u4vETjs5UwE ZuDV/dPc8apZzCCyg5iuorUtYZfRZtf77Reawb69VnWFZKzgA7wNGCpWnnpFic0/4icR 1YIA== ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (no key) header.i=@uniroma2.it header.b="vv/whSHi"; dkim=pass header.i=@uniroma2.it header.s=rsa201904 header.b=DvYa+m5q; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=uniroma2.it Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id y16-20020aa78550000000b0058241d3a271si7437580pfn.49.2023.01.08.09.55.47; Sun, 08 Jan 2023 09:55:53 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=neutral (no key) header.i=@uniroma2.it header.b="vv/whSHi"; dkim=pass header.i=@uniroma2.it header.s=rsa201904 header.b=DvYa+m5q; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=uniroma2.it Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233805AbjAHRet (ORCPT + 51 others); Sun, 8 Jan 2023 12:34:49 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58056 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235986AbjAHRep (ORCPT ); Sun, 8 Jan 2023 12:34:45 -0500 Received: from smtp.uniroma2.it (smtp.uniroma2.it [160.80.6.16]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CB75CD2E7; Sun, 8 Jan 2023 09:34:42 -0800 (PST) Received: from smtpauth-2019-1.uniroma2.it (smtpauth-2019-1.uniroma2.it [160.80.5.46]) by smtp-2015.uniroma2.it (8.14.4/8.14.4/Debian-8) with ESMTP id 308HY53o024467; Sun, 8 Jan 2023 18:34:11 +0100 Received: from lubuntu-18.04 (unknown [160.80.103.126]) by smtpauth-2019-1.uniroma2.it (Postfix) with ESMTPSA id F3E0C1228E1; Sun, 8 Jan 2023 18:34:01 +0100 (CET) DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=uniroma2.it; s=ed201904; t=1673199242; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=z4kc5qFAos7ZIAgwgfR4xycDpTMwBJ6/I/3JrmodzQI=; b=vv/whSHi6xon2T/NpYA/YyAng8wsROAmsXoPBZSJRw9j6zV9hUvX84YnK6CGgtH3UXlwYN cljmvok+3YvCckBg== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=uniroma2.it; s=rsa201904; t=1673199242; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=z4kc5qFAos7ZIAgwgfR4xycDpTMwBJ6/I/3JrmodzQI=; b=DvYa+m5qYheA9on2p5bcdeAxaC35YrEPne0KXjosVgLqqlXxeewrDTYbZoOLCkXV/t1RN3 YvMSJ3beOpMxprQglcMkFammt5zT+ZkYM8WfCLi3PA0RvwGJFHfNE2ijB0ZwoC9vC0y3dr 5GFmq4LIaNNKoi5ZxabDqdHCOYvnyP0n7g4dpHXRU2XHkl+JkUodxWIO3H8hPAKfx1xs9F na9ibZMAPFf50qIflnI45qUVfLVoB052DBMHwN8UxssF+a6jLZ/vRphISimU55kRRLotfl MFejP0h5sGfF67TwAnMEXrySvRlAPwrCHuRy+lfjQjMnSup3YfIt2l/v6wY+/g== Date: Sun, 8 Jan 2023 18:34:01 +0100 From: Andrea Mayer To: Jonathan Maxwell Cc: Paolo Abeni , davem@davemloft.net, edumazet@google.com, kuba@kernel.org, yoshfuji@linux-ipv6.org, dsahern@kernel.org, netdev@vger.kernel.org, linux-kernel@vger.kernel.org, Stefano Salsano , Paolo Lungaroni , Ahmed Abdelsalam , Andrea Mayer Subject: Re: [net-next] ipv6: fix routing cache overflow for raw sockets Message-Id: <20230108183401.3fa2b3ac27bde441265b4fca@uniroma2.it> In-Reply-To: References: <20221218234801.579114-1-jmaxwell37@gmail.com> <9f145202ca6a59b48d4430ed26a7ab0fe4c5dfaf.camel@redhat.com> <20221223212835.eb9d03f3f7db22360e34341d@uniroma2.it> <20230103170711.819921d40132494b4bfd6a0d@uniroma2.it> <20230107002656.b732de6750a063d07cdb8a5f@uniroma2.it> X-Mailer: Sylpheed 3.5.1 (GTK+ 2.24.32; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Virus-Scanned: clamav-milter 0.100.0 at smtp-2015 X-Virus-Status: Clean X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,NICE_REPLY_A,RCVD_IN_DNSWL_MED, RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun, 8 Jan 2023 10:46:09 +1100 Jonathan Maxwell wrote: > On Sat, Jan 7, 2023 at 10:27 AM Andrea Mayer wrote: > > > > Hi Jon, > > please see after, thanks. > > > > > > > > > Any chance you could test this patch based on the latest net-next > > > > kernel and let me know the result? > > > > > > > > diff --git a/include/net/dst_ops.h b/include/net/dst_ops.h > > > > index 88ff7bb2bb9b..632086b2f644 100644 > > > > --- a/include/net/dst_ops.h > > > > +++ b/include/net/dst_ops.h > > > > @@ -16,7 +16,7 @@ struct dst_ops { > > > > unsigned short family; > > > > unsigned int gc_thresh; > > > > > > > > - int (*gc)(struct dst_ops *ops); > > > > + void (*gc)(struct dst_ops *ops); > > > > struct dst_entry * (*check)(struct dst_entry *, __u32 cookie); > > > > unsigned int (*default_advmss)(const struct dst_entry *); > > > > unsigned int (*mtu)(const struct dst_entry *); > > > > diff --git a/net/core/dst.c b/net/core/dst.c > > > > index 6d2dd03dafa8..31c08a3386d3 100644 > > > > --- a/net/core/dst.c > > > > +++ b/net/core/dst.c > > > > @@ -82,12 +82,8 @@ void *dst_alloc(struct dst_ops *ops, struct net_device *dev, > > > > > > > > if (ops->gc && > > > > !(flags & DST_NOCOUNT) && > > > > - dst_entries_get_fast(ops) > ops->gc_thresh) { > > > > - if (ops->gc(ops)) { > > > > - pr_notice_ratelimited("Route cache is full: > > > > consider increasing sysctl net.ipv6.route.max_size.\n"); > > > > - return NULL; > > > > - } > > > > - } > > > > + dst_entries_get_fast(ops) > ops->gc_thresh) > > > > + ops->gc(ops); > > > > > > > > dst = kmem_cache_alloc(ops->kmem_cachep, GFP_ATOMIC); > > > > if (!dst) > > > > diff --git a/net/ipv6/route.c b/net/ipv6/route.c > > > > index e74e0361fd92..b643dda68d31 100644 > > > > --- a/net/ipv6/route.c > > > > +++ b/net/ipv6/route.c > > > > @@ -91,7 +91,7 @@ static struct dst_entry *ip6_negative_advice(struct > > > > dst_entry *); > > > > static void ip6_dst_destroy(struct dst_entry *); > > > > static void ip6_dst_ifdown(struct dst_entry *, > > > > struct net_device *dev, int how); > > > > -static int ip6_dst_gc(struct dst_ops *ops); > > > > +static void ip6_dst_gc(struct dst_ops *ops); > > > > > > > > static int ip6_pkt_discard(struct sk_buff *skb); > > > > static int ip6_pkt_discard_out(struct net *net, struct > > > > sock *sk, struct sk_buff *skb); > > > > @@ -3284,11 +3284,10 @@ struct dst_entry *icmp6_dst_alloc(struct > > > > net_device *dev, > > > > return dst; > > > > } > > > > > > > > -static int ip6_dst_gc(struct dst_ops *ops) > > > > +static void ip6_dst_gc(struct dst_ops *ops) > > > > { > > > > struct net *net = container_of(ops, struct net, ipv6.ip6_dst_ops); > > > > int rt_min_interval = net->ipv6.sysctl.ip6_rt_gc_min_interval; > > > > - int rt_max_size = net->ipv6.sysctl.ip6_rt_max_size; > > > > int rt_elasticity = net->ipv6.sysctl.ip6_rt_gc_elasticity; > > > > int rt_gc_timeout = net->ipv6.sysctl.ip6_rt_gc_timeout; > > > > unsigned long rt_last_gc = net->ipv6.ip6_rt_last_gc; > > > > @@ -3296,11 +3295,10 @@ static int ip6_dst_gc(struct dst_ops *ops) > > > > int entries; > > > > > > > > entries = dst_entries_get_fast(ops); > > > > - if (entries > rt_max_size) > > > > + if (entries > ops->gc_thresh) > > > > entries = dst_entries_get_slow(ops); > > > > > > > > - if (time_after(rt_last_gc + rt_min_interval, jiffies) && > > > > - entries <= rt_max_size) > > > > + if (time_after(rt_last_gc + rt_min_interval, jiffies)) > > > > goto out; > > > > > > > > fib6_run_gc(atomic_inc_return(&net->ipv6.ip6_rt_gc_expire), net, true); > > > > @@ -3310,7 +3308,6 @@ static int ip6_dst_gc(struct dst_ops *ops) > > > > out: > > > > val = atomic_read(&net->ipv6.ip6_rt_gc_expire); > > > > atomic_set(&net->ipv6.ip6_rt_gc_expire, val - (val >> rt_elasticity)); > > > > - return entries > rt_max_size; > > > > } > > > > > > > > static int ip6_nh_lookup_table(struct net *net, struct fib6_config *cfg, > > > > @@ -6512,7 +6509,7 @@ static int __net_init ip6_route_net_init(struct net *net) > > > > #endif > > > > > > > > net->ipv6.sysctl.flush_delay = 0; > > > > - net->ipv6.sysctl.ip6_rt_max_size = 4096; > > > > + net->ipv6.sysctl.ip6_rt_max_size = INT_MAX; > > > > net->ipv6.sysctl.ip6_rt_gc_min_interval = HZ / 2; > > > > net->ipv6.sysctl.ip6_rt_gc_timeout = 60*HZ; > > > > net->ipv6.sysctl.ip6_rt_gc_interval = 30*HZ; > > > > > > > > > > Yes, I will apply this patch in the next days and check how it deals with the > > > seg6 subsystem. I will keep you posted. > > > > > > > I applied the patch* to the net-next (HEAD 6bd4755c7c49) and did some tests on > > the seg6 subsystem, specifically running the End.X/DX6 behaviors. They seem to > > work fine. > > Thanks Andrea much appreciated. It worked fine in my raw socket tests as well. You're welcome. Good! > I'll look at submitting it soon. Please let me know (keep me in cc). > > > > > (*) I had to slightly edit the patch because of the code formatting, e.g. > > some incorrect line breaks, spaces, etc. > > > > Sorry about that, I should have sent it from git to avoid that. > Don't worry. > Regards > > Jon > Ciao, Andrea > > > > > > > On Sat, Dec 24, 2022 at 6:38 PM Jonathan Maxwell wrote: > > > > > > > > > > On Sat, Dec 24, 2022 at 7:28 AM Andrea Mayer wrote: > > > > > > > > > > > > Hi Jon, > > > > > > please see below, thanks. > > > > > > > > > > > > On Wed, 21 Dec 2022 08:48:11 +1100 > > > > > > Jonathan Maxwell wrote: > > > > > > > > > > > > > On Tue, Dec 20, 2022 at 11:35 PM Paolo Abeni wrote: > > > > > > > > > > > > > > > > On Mon, 2022-12-19 at 10:48 +1100, Jon Maxwell wrote: > > > > > > > > > Sending Ipv6 packets in a loop via a raw socket triggers an issue where a > > > > > > > > > route is cloned by ip6_rt_cache_alloc() for each packet sent. This quickly > > > > > > > > > consumes the Ipv6 max_size threshold which defaults to 4096 resulting in > > > > > > > > > these warnings: > > > > > > > > > > > > > > > > > > [1] 99.187805] dst_alloc: 7728 callbacks suppressed > > > > > > > > > [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > > > > > > > > . > > > > > > > > > . > > > > > > > > > [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > > > > > > > > > > > > > > > If I read correctly, the maximum number of dst that the raw socket can > > > > > > > > use this way is limited by the number of packets it allows via the > > > > > > > > sndbuf limit, right? > > > > > > > > > > > > > > > > > > > > > > Yes, but in my test sndbuf limit is never hit so it clones a route for > > > > > > > every packet. > > > > > > > > > > > > > > e.g: > > > > > > > > > > > > > > output from C program sending 5000000 packets via a raw socket. > > > > > > > > > > > > > > ip raw: total num pkts 5000000 > > > > > > > > > > > > > > # bpftrace -e 'kprobe:dst_alloc {@count[comm] = count()}' > > > > > > > Attaching 1 probe... > > > > > > > > > > > > > > @count[a.out]: 5000009 > > > > > > > > > > > > > > > Are other FLOWI_FLAG_KNOWN_NH users affected, too? e.g. nf_dup_ipv6, > > > > > > > > ipvs, seg6? > > > > > > > > > > > > > > > > > > > > > > Any call to ip6_pol_route(s) where no res.nh->fib_nh_gw_family is 0 can do it. > > > > > > > But we have only seen this for raw sockets so far. > > > > > > > > > > > > > > > > > > > In the SRv6 subsystem, the seg6_lookup_nexthop() is used by some > > > > > > cross-connecting behaviors such as End.X and End.DX6 to forward traffic to a > > > > > > specified nexthop. SRv6 End.X/DX6 can specify an IPv6 DA (i.e., a nexthop) > > > > > > different from the one carried by the IPv6 header. For this purpose, > > > > > > seg6_lookup_nexthop() sets the FLOWI_FLAG_KNOWN_NH. > > > > > > > > > > > Hi Andrea, > > > > > > > > > > Thanks for pointing that datapath out. The more generic approach we are > > > > > taking bringing Ipv6 closer to Ipv4 in this regard should fix all instances > > > > > of this. > > > > > > > > > > > > > > [1] 99.187805] dst_alloc: 7728 callbacks suppressed > > > > > > > > > [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > > > > > > > > . > > > > > > > > > . > > > > > > > > > [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > > > > > > > > > > > I can reproduce the same warning messages reported by you, by instantiating an > > > > > > End.X behavior whose nexthop is handled by a route for which there is no "via". > > > > > > In this configuration, the ip6_pol_route() (called by seg6_lookup_nexthop()) > > > > > > triggers ip6_rt_cache_alloc() because i) the FLOWI_FLAG_KNOWN_NH is present ii) > > > > > > and the res.nh->fib_nh_gw_family is 0 (as already pointed out). > > > > > > > > > > > > > > > > Nice, when I get back after the holiday break I'll submit the next patch. It > > > > > would be great if you could test the new patch and let me know how it works in > > > > > your tests at that juncture. I'll keep you posted. > > > > > > > > > > Regards > > > > > > > > > > Jon > > > > > > > > > > > > Regards > > > > > > > > > > > > > > Jon > > > > > > > > > > > > Ciao, > > > > > > Andrea > > > > > > > > > -- > > > Andrea Mayer > > > > > > -- > > Andrea Mayer