Received: by 2002:a05:6358:9144:b0:117:f937:c515 with SMTP id r4csp1022220rwr; Wed, 26 Apr 2023 09:10:20 -0700 (PDT) X-Google-Smtp-Source: AKy350ZDJ979Lw+G+9u5qDuVhc7+5CZKOa/+lNFZhBu/3x24AUiH1SOvQwenvQmyzS6tJjvGSKZg X-Received: by 2002:a17:90a:9747:b0:23f:b369:c159 with SMTP id i7-20020a17090a974700b0023fb369c159mr19296339pjw.49.1682525420258; Wed, 26 Apr 2023 09:10:20 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1682525420; cv=none; d=google.com; s=arc-20160816; b=Dm2DNpPJwIjB9rKgYkbIy6hKPizhaElAGvrsPGd+00g3U6shWUvE89kJ96of5fGVyS FPu+eOCk8BQ23KAFz3OsPR8zfKox1kOszldbwtbFl+4N4ZEybTh7TyjlKBYWM3Ctyhfp 8HJPkbqatchUxKSIDvHxq8K7VyTPzvm7bWtgEcFJq9E5Aoow8s5WIO2u8NUFc200n30z uJp67dDiKbAxaSB4+/in5Rg7R8Z8b+CPNgCMSdTUUEfqzfcmaLBFzyUn3UZn0bjminhU yzlrzUFef08No7sA4ChZO54yq595sj5VnZSIjj3kIH1sZ5yQWx/3kPcNkp7RMIql1tJC 8QQQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=JAZt4SwBq0kzUhXCO374DApwwiCqj65AMLJ7jYfFhVY=; b=ZDBPt/Awa3MXEP0H5XouBAdhVVOmNb3SIZQf3rzcKs8GPZK9+ukfe62qp4wuC5lj6w Iljhu2mnD2BVaoTSPYT5nCB0jAR9VAUzo1gAMr9E9pSO9e1zHg9Vc8FYQH0rNnvnWfhD 1NhOOgoGzbNCAMc6e9nImdbvrTT7154UJ7xTVZxgXKYhvPkK0lhrl2WWX+eeDvLmC3Ud NSYaOFcJzKYZkvIqX2TeH1/Xq4Yv23l7b5smRAGOVPaAuVcsCCN7Y7Nqkfh6R0TguGtN 17W9zAlrFoIjWtnfN8sWpg2Lc5gHuFhu941l7HEdMQKEWg+jbpkJajvCmI4q39wcNkfW 0UDQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=wONEkvpf; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id y18-20020a17090aa41200b002473699ecc6si16907134pjp.32.2023.04.26.09.10.07; Wed, 26 Apr 2023 09:10:20 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=wONEkvpf; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S241283AbjDZPv0 (ORCPT + 99 others); Wed, 26 Apr 2023 11:51:26 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55286 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S241245AbjDZPvM (ORCPT ); Wed, 26 Apr 2023 11:51:12 -0400 Received: from mail-pl1-x62f.google.com (mail-pl1-x62f.google.com [IPv6:2607:f8b0:4864:20::62f]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1096E61A9 for ; Wed, 26 Apr 2023 08:51:10 -0700 (PDT) Received: by mail-pl1-x62f.google.com with SMTP id d9443c01a7336-1a6c5acf6ccso55647505ad.3 for ; Wed, 26 Apr 2023 08:51:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1682524269; x=1685116269; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=JAZt4SwBq0kzUhXCO374DApwwiCqj65AMLJ7jYfFhVY=; b=wONEkvpfq/eASdg9qtRBkCqKPWIGN7jF9+5WyNuWNLb/Jtu4Y/jEn/IxoxPGIXv9Q5 Ev3gX0JaGekf02cXAI9Sibu8PvQhanwkfz/r4trzy0QIwtNaVcyGLkEu5mKrgf7rHtfk ca3DM0w8EIe0xAXzP4z7TNHaDXKrQCBk4fkg6CHLtiNb3pA7w+Bt6m4X6/hK+WyatIgR LFv463ZxN9D6wwr9/ammqIwg/pkcWmkuEHiOKhI7km6/JPefaTB/eiw/Foiq3GOdVAcV R2nCBd4tyihhOPhTwbaeyKwuwaVEZ3Cxw3xx9qpE9jrlV5kvsg4rRVqSlrR6dPF9+/zX kSeA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1682524269; x=1685116269; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=JAZt4SwBq0kzUhXCO374DApwwiCqj65AMLJ7jYfFhVY=; b=Oqp7ecY+aobygXDoomkiVKpy+jRUVnkREpEVWnrXnbuVbLDtvY3ZsBnMqX5PVaA13v LYVe7dUiEeuqxcJ6aPiTinETCRdoixblY2bDNLZjJR+cd2lu7O262MBA8mULOF5CVRYD fuaFPdg3VYJHZ4yJ/DZroPpDNfeg65myv3Fjn2V6iZ6Tr9m2jmerqmlvSiOJculDBS/q PdJygepaW67P+VFYeHrWJhQthPn8RL356lHxr+MTUYz7NsHQBAIkOsw0PJU+6+FtA/Kp 5iYapWNVrzQ79f8X1lZk9h0R6uEVT52kUXXVNBo85LraZ4g3VbH4CmjzHIcD8UqD2H47 I4yg== X-Gm-Message-State: AAQBX9duFSdbwvZfWuw3CSkTBKz3oNSQ5B0QnJQ2cQW1IP7HkDTbLxz+ VpO0iDeQfptyESu90BqtjTVUirkEohAH8jCQe4b/7w== X-Received: by 2002:a17:902:d582:b0:1a9:20ea:f49b with SMTP id k2-20020a170902d58200b001a920eaf49bmr22387086plh.24.1682524269347; Wed, 26 Apr 2023 08:51:09 -0700 (PDT) MIME-Version: 1.0 References: <20230413133355.350571-1-aleksandr.mikhalitsyn@canonical.com> <20230413133355.350571-3-aleksandr.mikhalitsyn@canonical.com> <4e177291-0f94-ab71-a982-f3e9f1f64280@gmail.com> In-Reply-To: <4e177291-0f94-ab71-a982-f3e9f1f64280@gmail.com> From: Stanislav Fomichev Date: Wed, 26 Apr 2023 08:50:58 -0700 Message-ID: Subject: Re: handling unsupported optlen in cgroup bpf getsockopt: (was [PATCH net-next v4 2/4] net: socket: add sockopts blacklist for BPF cgroup hook) To: Kui-Feng Lee Cc: Martin KaFai Lau , Eric Dumazet , davem@davemloft.net, linux-kernel@vger.kernel.org, netdev@vger.kernel.org, daniel@iogearbox.net, Jakub Kicinski , Paolo Abeni , Leon Romanovsky , David Ahern , Arnd Bergmann , Kees Cook , Christian Brauner , Kuniyuki Iwashima , Lennart Poettering , linux-arch@vger.kernel.org, Aleksandr Mikhalitsyn , bpf Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE,USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Apr 25, 2023 at 2:11=E2=80=AFPM Kui-Feng Lee = wrote: > > > > On 4/25/23 11:42, Stanislav Fomichev wrote: > > On Tue, Apr 25, 2023 at 10:59=E2=80=AFAM Kui-Feng Lee wrote: > >> > >> > >> > >> On 4/18/23 09:47, Stanislav Fomichev wrote: > >>> On 04/17, Martin KaFai Lau wrote: > >>>> On 4/14/23 6:55 PM, Stanislav Fomichev wrote: > >>>>> On 04/13, Stanislav Fomichev wrote: > >>>>>> On Thu, Apr 13, 2023 at 7:38=E2=80=AFAM Aleksandr Mikhalitsyn > >>>>>> wrote: > >>>>>>> > >>>>>>> On Thu, Apr 13, 2023 at 4:22=E2=80=AFPM Eric Dumazet wrote: > >>>>>>>> > >>>>>>>> On Thu, Apr 13, 2023 at 3:35=E2=80=AFPM Alexander Mikhalitsyn > >>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>> During work on SO_PEERPIDFD, it was discovered (thanks to Chris= tian), > >>>>>>>>> that bpf cgroup hook can cause FD leaks when used with sockopts= which > >>>>>>>>> install FDs into the process fdtable. > >>>>>>>>> > >>>>>>>>> After some offlist discussion it was proposed to add a blacklis= t of > >>>>>>>> > >>>>>>>> We try to replace this word by either denylist or blocklist, eve= n in changelogs. > >>>>>>> > >>>>>>> Hi Eric, > >>>>>>> > >>>>>>> Oh, I'm sorry about that. :( Sure. > >>>>>>> > >>>>>>>> > >>>>>>>>> socket options those can cause troubles when BPF cgroup hook is= enabled. > >>>>>>>>> > >>>>>>>> > >>>>>>>> Can we find the appropriate Fixes: tag to help stable teams ? > >>>>>>> > >>>>>>> Sure, I will add next time. > >>>>>>> > >>>>>>> Fixes: 0d01da6afc54 ("bpf: implement getsockopt and setsockopt ho= oks") > >>>>>>> > >>>>>>> I think it's better to add Stanislav Fomichev to CC. > >>>>>> > >>>>>> Can we use 'struct proto' bpf_bypass_getsockopt instead? We alread= y > >>>>>> use it for tcp zerocopy, I'm assuming it should work in this case = as > >>>>>> well? > >>>>> > >>>>> Jakub reminded me of the other things I wanted to ask here bug forg= ot: > >>>>> > >>>>> - setsockopt is probably not needed, right? setsockopt hook trigger= s > >>>>> before the kernel and shouldn't leak anything > >>>>> - for getsockopt, instead of bypassing bpf completely, should we in= stead > >>>>> ignore the error from the bpf program? that would still preser= ve > >>>>> the observability aspect > >>>> > >>>> stealing this thread to discuss the optlen issue which may make sens= e to > >>>> bypass also. > >>>> > >>>> There has been issue with optlen. Other than this older post related= to > >>>> optlen > PAGE_SIZE: > >>>> https://lore.kernel.org/bpf/5c8b7d59-1f28-2284-f7b9-49d946f2e982@lin= ux.dev/, > >>>> the recent one related to optlen that we have seen is > >>>> NETLINK_LIST_MEMBERSHIPS. The userspace passed in optlen =3D=3D 0 an= d the kernel > >>>> put the expected optlen (> 0) and 'return 0;' to userspace. The user= space > >>>> intention is to learn the expected optlen. This makes 'ctx.optlen > > >>>> max_optlen' and __cgroup_bpf_run_filter_getsockopt() ends up returni= ng > >>>> -EFAULT to the userspace even the bpf prog has not changed anything. > >>> > >>> (ignoring -EFAULT issue) this seems like it needs to be > >>> > >>> if (optval && (ctx.optlen > max_optlen || ctx.optlen < 0)) { > >>> /* error */ > >>> } > >>> > >>> ? > >>> > >>>> Does it make sense to also bypass the bpf prog when 'ctx.optlen > > >>>> max_optlen' for now (and this can use a separate patch which as usua= l > >>>> requires a bpf selftests)? > >>> > >>> Yeah, makes sense. Replacing this -EFAULT with WARN_ON_ONCE or someth= ing > >>> seems like the way to go. It caused too much trouble already :-( > >>> > >>> Should I prepare a patch or do you want to take a stab at it? > >>> > >>>> In the future, does it make sense to have a specific cgroup-bpf-prog= (a > >>>> specific attach type?) that only uses bpf_dynptr kfunc to access the= optval > >>>> such that it can enforce read-only for some optname and potentially = also > >>>> track if bpf-prog has written a new optval? The bpf-prog can only re= turn 1 > >>>> (OK) and only allows using bpf_set_retval() instead. Likely there is= still > >>>> holes but could be a seed of thought to continue polishing the idea. > >>> > >>> Ack, let's think about it. > >>> > >>> Maybe we should re-evaluate 'getsockopt-happens-after-the-kernel' ide= a > >>> as well? If we can have a sleepable hook that can copy_from_user/copy= _to_user, > >>> and we have a mostly working bpf_getsockopt (after your refactoring), > >>> I don't see why we need to continue the current scheme of triggering > >>> after the kernel? > >> > >> Since a sleepable hook would cause some restrictions, perhaps, we coul= d > >> introduce something like the promise pattern. In our case here, BPF > >> program call an async version of copy_from_user()/copy_to_user() to > >> return a promise. > > > > Having a promise might work. This is essentially what we already do > > with sockets/etc with acquire/release pattern. > > > > What are the sleepable restrictions you're hinting about? I feel like > AFAIK, a sleepable program can use only some types of maps; for example, > array, hash, ringbuf,... etc. Other types of maps are unavailable to > sleepable programs; for example, sockmap, sockhash. Sure, but it doesn't have to stay that way. (hypothetically) If we think that sleepable makes sense, we can try to expand the scope. > > with the sleepable bpf, we can also remove all the temporary buffer > > management / extra copies which sounds like a win to me. (we have this > > ugly heuristics with BPF_SOCKOPT_KERN_BUF_SIZE) The program can > > allocate temporary buffers if needed.. > Agree! > > > > > >>>>> - or maybe we can even have a per-proto bpf_getsockopt_cleanup call= that > >>>>> gets called whenever bpf returns an error to make sure protoco= ls have > >>>>> a chance to handle that condition (and free the fd) > >>>>> > >>>> > >>>>