Received: by 2002:a05:6358:c692:b0:131:369:b2a3 with SMTP id fe18csp1247370rwb; Wed, 26 Jul 2023 09:33:12 -0700 (PDT) X-Google-Smtp-Source: APBJJlGbOQE0YkZQXSjGlFQ4WbjL19kPb8wMlnVIO6c0Nvfj7flZaWKCzefKb2MD9OI4yJKP7Qjc X-Received: by 2002:a17:902:dac3:b0:1b8:a6be:ead3 with SMTP id q3-20020a170902dac300b001b8a6beead3mr2702889plx.57.1690389192402; Wed, 26 Jul 2023 09:33:12 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1690389192; cv=none; d=google.com; s=arc-20160816; b=aUzzxDc1aowK1f/Gocjc8ujX3c7Yz+NYYLoSMfy7iHoFEgA5V3F9irKDJrWxRzVNXZ 8WP28/IBoit4kKWL2b0xpUASvtvQxIJACi8XsYV2TvMS3BjUKsyRqkVPP0pALEEmFW51 xG0ZccucH8eeExTksGcM5fwjk8wZJO9Uhhjz6vRZsEkwBMfkmfPZREt1YLX275PPkgmx 4H+3mmgt/ExA0QC5uEfxvq7+2OWv2McF5O2WHIj/7Wfs9bcpwzpXa7s+Fc5iIYyIWweM 9/DHM50s8TPmrLeBTdMiwfqFyeW34YQoMJ/C5CLnMcACFW5D0Rw4JjHYKlk1enfWjs7J CD/g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=K7viDi95/DtqS7YumhA/QRcfR2/DcATgamUSnc6vM2A=; fh=Ke5maScHwjN6J59cocsTKLu/qb0KK7yOTgvF55t1yFE=; b=qntJR3BcpRqIjJlmcjJIteJqYZZG68f7XTOCk5wi1/WkMrQ6qdLana+w7fW1DIPc3+ XX93/NutXOliAkcaCsBNq1JCzquEmnDpecOkLKncMQg5vMG1snfg0rHLkSXFn/dsTDND AdgYyGIPza/OCJfJGUybbJ66dZWNSzLNBpgQQvjfyxlLAXkHySfzwRd3U7l+gS0l+KP/ X2OWwGymLRh8gsAtZIN5Uyupr2wSXooqY9U8FQQy3LwvntG1JUyJrmeTNze87QAQwH/f uSWM9rv8FD+XYP2XxCd33HuAsK3XNW0rSen1HT3wPmDDes2uMMcfPzYgGaxZu+OP3GJT pXcQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@cloudflare.com header.s=google header.b=sxj07bNc; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=cloudflare.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id p5-20020a170902e74500b001bb2d95f2a7si683053plf.267.2023.07.26.09.32.58; Wed, 26 Jul 2023 09:33:12 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@cloudflare.com header.s=google header.b=sxj07bNc; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=cloudflare.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229598AbjGZQK3 (ORCPT + 99 others); Wed, 26 Jul 2023 12:10:29 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36578 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229980AbjGZQK1 (ORCPT ); Wed, 26 Jul 2023 12:10:27 -0400 Received: from mail-qt1-x82c.google.com (mail-qt1-x82c.google.com [IPv6:2607:f8b0:4864:20::82c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 221DD1FCF for ; Wed, 26 Jul 2023 09:10:24 -0700 (PDT) Received: by mail-qt1-x82c.google.com with SMTP id d75a77b69052e-403e7472b28so49806801cf.2 for ; Wed, 26 Jul 2023 09:10:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cloudflare.com; s=google; t=1690387823; x=1690992623; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=K7viDi95/DtqS7YumhA/QRcfR2/DcATgamUSnc6vM2A=; b=sxj07bNcg5678Si4gdEEp+WW40Oh8OI/eqiNU57JICAoq/E0Moh1ql4Pa0rJCD5mXQ owdUb7eyP+YcKFu9dUnu4PnUQ1PmNYO06AVBi6/Nq1AJjmy0x9dpbxtZCwdMGx8tDzkR u9NWYslhaCMLBzgQIrcdD92wnxlUg+s93/lZY= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1690387823; x=1690992623; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=K7viDi95/DtqS7YumhA/QRcfR2/DcATgamUSnc6vM2A=; b=UH/ZFtv3tWLg0DW4ysd0dsq5URbSjRJHspac2vlsddLKyROHUso2UyYCVX9eJK73Fp jKoZDH8GimFfduUa4nfrl0sLnqIlQ/c27JbSiVJNYLMEkEIwNXgKrW1ee1W1iXmwTDy7 T5wXX7i3Nt5TxVP4b2SXEpZJuhxFez+g3ViHEAIDGSJSmFnuiq/RmNZVcrDxoahArFqy YsfhN0rcvwROtKSuBvqO/YvuUJLl0OAia+CQBpDeK8XZjhcxUH8qESGDNv7U7QQOfyiW ahVLexYoU8o8x4Ry/IBOp+3REBy/g4yhNWYjztN1LCV8ShHAFUP0ows1L2mn4KrUeWRq Ug7w== X-Gm-Message-State: ABy/qLbgJ380ZgZZPSF/H7y95wVXgPH35GMnbh+urD7bW9nPJIyNxPb4 FpkKFxcw7Jbt489PWF+aI/4MLw== X-Received: by 2002:ac8:5c02:0:b0:403:a63d:9a2e with SMTP id i2-20020ac85c02000000b00403a63d9a2emr3018341qti.10.1690387823149; Wed, 26 Jul 2023 09:10:23 -0700 (PDT) Received: from debian.debian ([140.141.197.139]) by smtp.gmail.com with ESMTPSA id fb14-20020a05622a480e00b003f7fd3ce69fsm4902747qtb.59.2023.07.26.09.10.21 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 26 Jul 2023 09:10:22 -0700 (PDT) Date: Wed, 26 Jul 2023 09:10:20 -0700 From: Yan Zhai To: Dan Carpenter Cc: Yan Zhai , bpf@vger.kernel.org, Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Martin KaFai Lau , Song Liu , Yonghong Song , John Fastabend , KP Singh , Stanislav Fomichev , Hao Luo , Jiri Olsa , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Mykola Lysenko , Shuah Khan , linux-kernel@vger.kernel.org, netdev@vger.kernel.org, linux-kselftest@vger.kernel.org, kernel-team@cloudflare.com, Jordan Griege , Markus Elfring , Jakub Sitnicki Subject: Re: [PATCH v4 bpf 1/2] bpf: fix skb_do_redirect return values Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE,URIBL_BLOCKED autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jul 26, 2023 at 06:01:00PM +0300, Dan Carpenter wrote: > On Wed, Jul 26, 2023 at 07:14:56AM -0700, Yan Zhai wrote: > > On Wed, Jul 26, 2023 at 04:39:08PM +0300, Dan Carpenter wrote: > > > I'm not positive I understand the code in ip_finish_output2(). I think > > > instead of looking for LWTUNNEL_XMIT_DONE it should instead look for > > > != LWTUNNEL_XMIT_CONTINUE. It's unfortunate that NET_XMIT_DROP and > > > LWTUNNEL_XMIT_CONTINUE are the both 0x1. Why don't we just change that > > > instead? > > > > > I considered about changing lwt side logic. But it would bring larger > > impact since there are multiple types of encaps on this hook, not just > > bpf redirect. Changing bpf return values is a minimum change on the > > other hand. In addition, returning value of NET_RX_DROP and > > NET_XMIT_CN are the same, so if we don't do something in bpf redirect, > > there is no way to distinguish them later: the former is considered as > > an error, while "CN" is considered as non-error. > > Uh, NET_RX/XMIT_DROP values are 1. NET_XMIT_CN is 2. > > I'm not an expert but I think what happens is that we treat NET_XMIT_CN > as success so that it takes a while for the resend to happen. > Eventually the TCP layer will detect it as a dropped packet. > My eyes slipped lines. CN is 2. But the fact RX return value can be returned on a TX path still makes me feel unclean. Odds are low that we will have new statuses in future, it is a risk. I'd hope to contain these values only inside BPF redirect code as they are the reason why such rx values can show up there. Meanwhile, your argument do make good sense to me that the same problem may occur for other stuff. It is true. In fact, I just re-examined BPF-REROUTE path, it has the exact same issue by directly sending dst_output value back. So I would propose to do two things: 1. still convert BPF redirect ingress code to contain the propagation of mixed return. Return only TX side value instead, which is also what majority of those local senders are expecting. (I was wrong about positive values returned to sendmsg below btw, they are not). 2. change LWTUNNEL_XMIT_CONTINUE and check for this at xmit hook. > > > > > Also there seems to be a leak in lwtunnel_xmit(). Should that return > > > LWTUNNEL_XMIT_CONTINUE or should it call kfree_skb() before returning? > > > > > > Something like the following? > > > > > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h > > > index 11652e464f5d..375790b672bc 100644 > > > --- a/include/linux/netdevice.h > > > +++ b/include/linux/netdevice.h > > > @@ -112,6 +112,9 @@ void netdev_sw_irq_coalesce_default_on(struct net_device *dev); > > > #define NET_XMIT_CN 0x02 /* congestion notification */ > > > #define NET_XMIT_MASK 0x0f /* qdisc flags in net/sch_generic.h */ > > > > > > +#define LWTUNNEL_XMIT_DONE NET_XMIT_SUCCESS > > > +#define LWTUNNEL_XMIT_CONTINUE 0x3 > > > + > > > /* NET_XMIT_CN is special. It does not guarantee that this packet is lost. It > > > * indicates that the device will soon be dropping packets, or already drops > > > * some packets of the same priority; prompting us to send less aggressively. */ > > > diff --git a/include/net/lwtunnel.h b/include/net/lwtunnel.h > > > index 6f15e6fa154e..8ab032ee04d0 100644 > > > --- a/include/net/lwtunnel.h > > > +++ b/include/net/lwtunnel.h > > > @@ -16,12 +16,6 @@ > > > #define LWTUNNEL_STATE_INPUT_REDIRECT BIT(1) > > > #define LWTUNNEL_STATE_XMIT_REDIRECT BIT(2) > > > > > > -enum { > > > - LWTUNNEL_XMIT_DONE, > > > - LWTUNNEL_XMIT_CONTINUE, > > > -}; > > > - > > > - > > > struct lwtunnel_state { > > > __u16 type; > > > __u16 flags; > > > diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c > > > index 711cd3b4347a..732415d1287d 100644 > > > --- a/net/core/lwtunnel.c > > > +++ b/net/core/lwtunnel.c > > > @@ -371,7 +371,7 @@ int lwtunnel_xmit(struct sk_buff *skb) > > > > > > if (lwtstate->type == LWTUNNEL_ENCAP_NONE || > > > lwtstate->type > LWTUNNEL_ENCAP_MAX) > > > - return 0; > > > + return LWTUNNEL_XMIT_CONTINUE; > > > > You are correct this path would leak skb. Return continue (or drop) > > would avoid the leak. Personally I'd prefer drop instead to signal the > > error setup. Since this is a separate issue, do you want to send a > > separate patch on this? Or I am happy to do it if you prefer. > > > > I don't know which makes sense so I'll leave that up to you. > This conversation is juicy, I think we discovered two potential new problem sites (the leak here and the reroute path) :) > > > > > > ret = -EOPNOTSUPP; > > > rcu_read_lock(); > > > diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c > > > index 6e70839257f7..4be50a211b14 100644 > > > --- a/net/ipv4/ip_output.c > > > +++ b/net/ipv4/ip_output.c > > > @@ -216,7 +216,7 @@ static int ip_finish_output2(struct net *net, struct sock *sk, struct sk_buff *s > > > if (lwtunnel_xmit_redirect(dst->lwtstate)) { > > > int res = lwtunnel_xmit(skb); > > > > > > - if (res < 0 || res == LWTUNNEL_XMIT_DONE) > > > + if (res != LWTUNNEL_XMIT_CONTINUE) > > > return res; > > > > Unfortunately we cannot return res directly here when res > 0. This is > > the final reason why I didn't patch here. Return values here can be > > propagated back to sendmsg syscall, so returning a positive value > > would break the syscall convention. > > The neigh_output() function is going to return NET_XMIT_DROP so this > already happens. Is that not what we want to happen? > My bad, those return values are processed at ip_send_skb etc, while I was staring only at ip_local_out and beneath with my sleepy eyes. > I guess my concern is that eventually people will eventually new > introduce bugs. Fixing incorrect error codes is something that I do > several times per week. :P > > regards, > dan carpenter > >