Received: by 2002:a05:6358:3188:b0:123:57c1:9b43 with SMTP id q8csp998912rwd; Thu, 25 May 2023 06:56:52 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ62+ymZKy9NQtawpNfeKWiT9OFsvs+J06cthz1rocDvsu0Bvimsb1wztTMVisEjHnazFXvz X-Received: by 2002:a05:6a00:13a4:b0:643:b27f:6c43 with SMTP id t36-20020a056a0013a400b00643b27f6c43mr9120794pfg.27.1685023012106; Thu, 25 May 2023 06:56:52 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1685023012; cv=none; d=google.com; s=arc-20160816; b=yS14XnNbyTfjz/WLwXhJ4nySp8a/wVfk14M5WjimiAhvGRMwAkxXebIAW4i9DcZuP5 U13uJ+UISGzd4aT0UcHX0D/TE9LtVT1XD17gVA3bDMWrdVFuObgAQB5Oik39rSecsE4Y kMEWdiWgNUH3nuakZyWgwXYXrYJeNhp83jghUKv9/m9/hyrYoJVW7q846HQmrHhFPygd BMpeUo/Lw/6e9v3+Dh5I9YH6UnpZ30mF6Br8z3HDIxlI0odf7YI75E2l7GJdlPakNi0F Es81npoMOlruzXUb3Lka4qOoGUwKLs2WB4vtde1a+/+TsBG0aEbWB6kBPkMkGWcgYaRZ etRw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=cJE2Y+9lHKdGfQARiSKxGndvrD8wTcpfzxVhMA8TopU=; b=w38YR7b835hQyNug7s571nifpAwMxDCx0duNXLcJumAvzEJVUQ8KMsqQsEBPr5A9Qe HmPflndHeDM/F56lux+TI+EehAne1pRKepKMdTmo1RZt2f7ZqkP8mdKO1YL0k2xO29G+ kxoCCQY8HPDH6F7c17QHjDlgedf5lAONUhV109FvUpPmdY8pXLQL4dr/0ijr2SEQHSP/ zQ+3Y2eiRDU6rPDmjr8QWGV0vFwOv3IQy69wKqm3DLjt98t4bferh/hUUmyJeepXbTFW SdVaru+WEohTuQbrCdtRXPvrmKKhjW71ToHnwqgi2PoOvb5v8Qcrd9b/iWHtdCZ4Tsnp QH5Q== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=YiF+QeEx; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id l62-20020a622541000000b0063d6708d1c9si73092pfl.105.2023.05.25.06.56.38; Thu, 25 May 2023 06:56:52 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=YiF+QeEx; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233986AbjEYNYh (ORCPT + 99 others); Thu, 25 May 2023 09:24:37 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52450 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231665AbjEYNYf (ORCPT ); Thu, 25 May 2023 09:24:35 -0400 Received: from mail-wm1-x329.google.com (mail-wm1-x329.google.com [IPv6:2a00:1450:4864:20::329]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E72FF99 for ; Thu, 25 May 2023 06:24:33 -0700 (PDT) Received: by mail-wm1-x329.google.com with SMTP id 5b1f17b1804b1-3f6a6b9bebdso70265e9.0 for ; Thu, 25 May 2023 06:24:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1685021072; x=1687613072; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=cJE2Y+9lHKdGfQARiSKxGndvrD8wTcpfzxVhMA8TopU=; b=YiF+QeExMQxuNZllkEbjao6wMWMhEVzx3YUS3ZCR+T5Z2ViGeXLDrlgSwESrKLz4XQ aHC36izj9qACsxIgCo1nJDvqof44tKuHXIkY4kMj3RL7juxIvC7mFQrH6i1cUU7b8BrD Io4wq4+XN+bmjjD0soDHYJ9NIDAYY3aUCLSrnFNj4uMSnImFW/vOejV5JXLMM/pp7EJp sYWnIaQW3BDRzeONo50RoHpiggh6qlgQNYUcj5w+dS4gssm5oB7LTIUIZhfDspsPxQ5c arukdP3Z6mDo9glEQxRMdshI8016mULKUSggLbucCMYpclqFvItsXNBoomXocgU7GeYp Q3nQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1685021072; x=1687613072; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=cJE2Y+9lHKdGfQARiSKxGndvrD8wTcpfzxVhMA8TopU=; b=ABWbUxTOBkDVtql3y0P9ThgiVWqUDGzNvzWd4BZYhK2oFG1xZOe6K3kVH3FxUndsni Ik5e7O6utlONlzALpUojpRmw+wS6XDJsEDiAF4DSIYf5qPWG2WrpBcV1etTGPWO+FKsJ dLxYoBbscuoFKOhoGycc9lyxL1xJGnFeJeBgmcenwNmyM/zB7o5A+Pi8yZfwlwiYu5Mo yYQOUdp9F+8PxASUSibW9zlUMx8JeEZwn6yt3YaaH2UGsyVkOICKuL2XD52oTksSsNr2 V/CLcfhuIzEQRG4ottkqMxXM1oPPOxTYwkXqlTohfGDu+/BDeC4xz+FmmtQacWqgZHug IiCw== X-Gm-Message-State: AC+VfDxofEVSdcbRZBydyBvQ0BbtPeSRXUtTeLRNSPWj12tz3GkD8xZC Nbojz4+KfRAQl1OCcKIzaqW5dLgJWOS3uX6CZnq5NA== X-Received: by 2002:a05:600c:1c1b:b0:3f6:f4b:d4a6 with SMTP id j27-20020a05600c1c1b00b003f60f4bd4a6mr134629wms.7.1685021072117; Thu, 25 May 2023 06:24:32 -0700 (PDT) MIME-Version: 1.0 References: <20230525081923.8596-1-lmb@isovalent.com> In-Reply-To: <20230525081923.8596-1-lmb@isovalent.com> From: Eric Dumazet Date: Thu, 25 May 2023 15:24:20 +0200 Message-ID: Subject: Re: [PATCH bpf-next 1/2] bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign To: Lorenz Bauer Cc: "David S. Miller" , Jakub Kicinski , Paolo Abeni , Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Martin KaFai Lau , Song Liu , Yonghong Song , John Fastabend , KP Singh , Stanislav Fomichev , Hao Luo , Jiri Olsa , David Ahern , Willem de Bruijn , Joe Stringer , Joe Stringer , Martin KaFai Lau , netdev@vger.kernel.org, linux-kernel@vger.kernel.org, bpf@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE,USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, May 25, 2023 at 10:19=E2=80=AFAM Lorenz Bauer w= rote: > > Currently the bpf_sk_assign helper in tc BPF context refuses SO_REUSEPORT > sockets. This means we can't use the helper to steer traffic to Envoy, wh= ich > configures SO_REUSEPORT on its sockets. In turn, we're blocked from remov= ing > TPROXY from our setup. > > The reason that bpf_sk_assign refuses such sockets is that the bpf_sk_loo= kup > helpers don't execute SK_REUSEPORT programs. Instead, one of the > reuseport sockets is selected by hash. This could cause dispatch to the > "wrong" socket: > > sk =3D bpf_sk_lookup_tcp(...) // select SO_REUSEPORT by hash > bpf_sk_assign(skb, sk) // SK_REUSEPORT wasn't executed > > Fixing this isn't as simple as invoking SK_REUSEPORT from the lookup > helpers unfortunately. In the tc context, L2 headers are at the start > of the skb, while SK_REUSEPORT expects L3 headers instead. > > Instead, we execute the SK_REUSEPORT program when the assigned socket > is pulled out of the skb, further up the stack. This creates some > trickiness with regards to refcounting as bpf_sk_assign will put both > refcounted and RCU freed sockets in skb->sk. reuseport sockets are RCU > freed. We can infer that the sk_assigned socket is RCU freed if the > reuseport lookup succeeds, but convincing yourself of this fact isn't > straight forward. Therefore we defensively check refcounting on the > sk_assign sock even though it's probably not required in practice. > > Fixes: 8e368dc ("bpf: Fix use of sk->sk_reuseport from sk_assign") > Fixes: cf7fbe6 ("bpf: Add socket assign support") > Co-developed-by: Daniel Borkmann > Signed-off-by: Daniel Borkmann > Signed-off-by: Lorenz Bauer > Cc: Joe Stringer > Link: https://lore.kernel.org/bpf/CACAyw98+qycmpQzKupquhkxbvWK4OFyDuuLMBN= ROnfWMZxUWeA@mail.gmail.com/ > --- > include/net/inet6_hashtables.h | 36 +++++++++++++++++++++++++++++----- > include/net/inet_hashtables.h | 27 +++++++++++++++++++++++-- > include/net/sock.h | 7 +++++-- > include/uapi/linux/bpf.h | 3 --- > net/core/filter.c | 2 -- > net/ipv4/inet_hashtables.c | 15 +++++++------- > net/ipv4/udp.c | 23 +++++++++++++++++++--- > net/ipv6/inet6_hashtables.c | 19 +++++++++--------- > net/ipv6/udp.c | 23 +++++++++++++++++++--- > tools/include/uapi/linux/bpf.h | 3 --- > 10 files changed, 119 insertions(+), 39 deletions(-) > diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c > index e7391bf310a7..920131e4a65d 100644 > --- a/net/ipv4/inet_hashtables.c > +++ b/net/ipv4/inet_hashtables.c > @@ -332,10 +332,10 @@ static inline int compute_score(struct sock *sk, st= ruct net *net, > return score; > } > > -static inline struct sock *lookup_reuseport(struct net *net, struct sock= *sk, > - struct sk_buff *skb, int doff= , > - __be32 saddr, __be16 sport, > - __be32 daddr, unsigned short = hnum) > +struct sock *inet_lookup_reuseport(struct net *net, struct sock *sk, > + struct sk_buff *skb, int doff, > + __be32 saddr, __be16 sport, > + __be32 daddr, unsigned short hnum) > { > struct sock *reuse_sk =3D NULL; > u32 phash; > @@ -346,6 +346,7 @@ static inline struct sock *lookup_reuseport(struct ne= t *net, struct sock *sk, > } > return reuse_sk; > } > +EXPORT_SYMBOL_GPL(inet_lookup_reuseport); > > /* > * Here are some nice properties to exploit here. The BSD API > @@ -369,8 +370,8 @@ static struct sock *inet_lhash2_lookup(struct net *ne= t, > sk_nulls_for_each_rcu(sk, node, &ilb2->nulls_head) { > score =3D compute_score(sk, net, hnum, daddr, dif, sdif); > if (score > hiscore) { > - result =3D lookup_reuseport(net, sk, skb, doff, > - saddr, sport, daddr, hn= um); > + result =3D inet_lookup_reuseport(net, sk, skb, do= ff, > + saddr, sport, dadd= r, hnum); > if (result) > return result; > Please split in a series. First a patch renaming lookup_reuseport() to inet_lookup_reuseport() and inet6_lookup_reuseport() (cleanup, no change in behavior) This would ease review and future bug hunting quite a bit.