Received: by 2002:a05:6a10:2726:0:0:0:0 with SMTP id ib38csp1342196pxb; Fri, 1 Apr 2022 10:40:23 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwIQOrZkqunOMJrQsQRXxx0inVRZnvfa+A4llkIuBwcyt3zcJai8MQPf/WSdHcaTLHoY/wx X-Received: by 2002:a17:90b:3b44:b0:1c7:9ca8:a19e with SMTP id ot4-20020a17090b3b4400b001c79ca8a19emr12929756pjb.245.1648834822862; Fri, 01 Apr 2022 10:40:22 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1648834822; cv=none; d=google.com; s=arc-20160816; b=iS4q0Awj87XRK8pahQX6KcJObsriStQBiuer8Nq33jAPuBxUQ6+vxUcarpXFsdokHt /Jm2QpmbM8hTZmbPsTufG8RaEWshcTZMGDy+oWRABwNHMS7HR1eLhdB0SWXpEhGQeW+2 wFxa2tTUGzEzF/T6qMSn0R1lL3nnR7VyTdbL8PlpIZnTpF9k3E1tebkYjuhgqnKmfBqq AyC84Z3lhs3dG0TiCTnuinkE3aIP3noUtI/pg73n3OgZqmEU98IniRbaTGeggQysk06E YTMAqOgH3+5RH3ARMHlZ6y3I26eb7KtuPN8DicMrpUYXZBv0EBtCp7PWhkK47lqaU5/Y L/nw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=ZJ0xznflRxc+nc4Wh4WgxduM1xfoAiqKIXYLHID9OzQ=; b=f8jT/pLz9XNybB4QNgW/7jszk9osXp1Tq8AsxDFLm3j7lK+i6sjAvNDS6nElKArVRO Xmm4rW7UGER8q/n1An4WKT8m3TFwAMrc5kzDXJJ7o3PZnKb8o9+yTTQqBL9ZrgHCfOqp k34Ulluw97esmpD7NUl+Zc7lj0XBvK1Bk4H78PMoEnHkaOtWYmwn3pnLx5lNIDzeJ1E1 WFW1juG+ITZG8kApaK8k2DuTfkdJ9WorE1f/+rwB8JJXvunXrSwfPrpNfvefz+++2ZCA dNtvcjxNGIbRjRXCWuf7IYdiOSpaOvWQhz3dDE3sUHoiNkrRcGXdmsjZlimaduZWre6z MR+g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=lhQiSoGp; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id y4-20020a634944000000b003816043f10csi2963452pgk.769.2022.04.01.10.39.41; Fri, 01 Apr 2022 10:40:22 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=lhQiSoGp; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S242492AbiCaWeO (ORCPT + 99 others); Thu, 31 Mar 2022 18:34:14 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46534 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S242491AbiCaWeM (ORCPT ); Thu, 31 Mar 2022 18:34:12 -0400 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0DBDE1B8FD8 for ; Thu, 31 Mar 2022 15:32:23 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 66BAA61280 for ; Thu, 31 Mar 2022 22:32:23 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id C4328C36AE2 for ; Thu, 31 Mar 2022 22:32:22 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1648765942; bh=MI8AewDSQkp1S3OGkehy1DQErFde/A/tnZsKwUdv8F4=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=lhQiSoGpV0ttrCt64gxcdlRVxUPVdVtzxK39qNRHsjzFwqlunr4FujTCAZA3YGNTM xgUHPDHNbMtUSvqMwhVPlHlfebgCm0kwJPMmgj8xcI0BChfVJkDwS8SKQ9rp53UqER ryP+RAa5sa+nvgVr1CCXmoj30PPNzzyLHFrOZx1e9gEwnp7VAsnz3mZ0H1gdul+9AG AngPjloMsCsBKsCyPEO/Ci+jR6ezpCgcSc1eZ0Tr6wq8SMNvPjhci2paEjKCQLTBeF eKuh/b9Ps8iOxP9JJ+W0LiK45jHNwjLO7uO8G+wvUOR/j9m0nNrfNLu2URXBNawQIs Fuvt/K4KNFrlw== Received: by mail-ed1-f44.google.com with SMTP id b15so937757edn.4 for ; Thu, 31 Mar 2022 15:32:22 -0700 (PDT) X-Gm-Message-State: AOAM531E+S12J1fI5BJM7Hsec7vKSddix1udZ3v9joidCeGCYLw7QM0J nJNh8xj8E4yf1hZGgCrltJ3DA+zwwkUYvqz3qm1E3A== X-Received: by 2002:a05:6402:348b:b0:419:172c:e2aa with SMTP id v11-20020a056402348b00b00419172ce2aamr18499542edc.261.1648765940962; Thu, 31 Mar 2022 15:32:20 -0700 (PDT) MIME-Version: 1.0 References: <20220324234123.1608337-1-haoluo@google.com> <9cdf860d-8370-95b5-1688-af03265cc874@fb.com> <20220329093753.26wc3noelqrwlrcj@apollo.legion> <20220329232956.gbsr65jdbe4lw2m6@ast-mbp> In-Reply-To: From: KP Singh Date: Fri, 1 Apr 2022 00:32:10 +0200 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH RFC bpf-next 0/2] Mmapable task local storage. To: Hao Luo , Jann Horn Cc: Alexei Starovoitov , Kumar Kartikeya Dwivedi , Yonghong Song , Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann , Martin KaFai Lau , Song Liu , bpf , LKML Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-7.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_HI, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Mar 30, 2022 at 8:26 PM Hao Luo wrote: > > On Wed, Mar 30, 2022 at 11:16 AM Alexei Starovoitov > wrote: > > > > On Wed, Mar 30, 2022 at 11:06 AM Hao Luo wrote: > > > > > > On Tue, Mar 29, 2022 at 4:30 PM Alexei Starovoitov > > > wrote: > > > > > > > > On Tue, Mar 29, 2022 at 10:43:42AM -0700, Hao Luo wrote: > > > > > On Tue, Mar 29, 2022 at 2:37 AM Kumar Kartikeya Dwivedi > > > > > wrote: > > > > > > > > > > > > On Mon, Mar 28, 2022 at 11:16:15PM IST, Hao Luo wrote: > > > > > > > On Mon, Mar 28, 2022 at 10:39 AM Hao Luo wrote: > > > > > > > > > > > > > > > > Hi Yonghong, > > > > > > > > > > > > > > > > On Fri, Mar 25, 2022 at 12:16 PM Yonghong Song wrote: > > > > > > > > > > > > > > > > > > On 3/24/22 4:41 PM, Hao Luo wrote: > > > > > > > > > > Some map types support mmap operation, which allows userspace to > > > > > > > > > > communicate with BPF programs directly. Currently only arraymap > > > > > > > > > > and ringbuf have mmap implemented. > > > > > > > > > > > > > > > > > > > > However, in some use cases, when multiple program instances can > > > > > > > > > > run concurrently, global mmapable memory can cause race. In that > > > > > > > > > > case, userspace needs to provide necessary synchronizations to > > > > > > > > > > coordinate the usage of mapped global data. This can be a source > > > > > > > > > > of bottleneck. > > > > > > > > > > > > > > > > > > I can see your use case here. Each calling process can get the > > > > > > > > > corresponding bpf program task local storage data through > > > > > > > > > mmap interface. As you mentioned, there is a tradeoff > > > > > > > > > between more memory vs. non-global synchronization. > > > > > > > > > > > > > > > > > > I am thinking that another bpf_iter approach can retrieve > > > > > > > > > the similar result. We could implement a bpf_iter > > > > > > > > > for task local storage map, optionally it can provide > > > > > > > > > a tid to retrieve the data for that particular tid. > > > > > > > > > This way, user space needs an explicit syscall, but > > > > > > > > > does not need to allocate more memory than necessary. > > > > > > > > > > > > > > > > > > WDYT? > > > > > > > > > > > > > > > > > > > > > > > > > Thanks for the suggestion. I have two thoughts about bpf_iter + tid and mmap: > > > > > > > > > > > > > > > > - mmap prevents the calling task from reading other task's value. > > > > > > > > Using bpf_iter, one can pass other task's tid to get their values. I > > > > > > > > assume there are two potential ways of passing tid to bpf_iter: one is > > > > > > > > to use global data in bpf prog, the other is adding tid parameterized > > > > > > > > iter_link. For the first, it's not easy for unpriv tasks to use. For > > > > > > > > the second, we need to create one iter_link object for each interested > > > > > > > > tid. It may not be easy to use either. > > > > > > > > > > > > > > > > - Regarding adding an explicit syscall. I thought about adding > > > > > > > > write/read syscalls for task local storage maps, just like reading > > > > > > > > values from iter_link. Writing or reading task local storage map > > > > > > > > updates/reads the current task's value. I think this could achieve the > > > > > > > > same effect as mmap. > > > > > > > > > > > > > > > > > > > > > > Actually, my use case of using mmap on task local storage is to allow > > > > > > > userspace to pass FDs into bpf prog. Some of the helpers I want to add > > > > > > > need to take an FD as parameter and the bpf progs can run > > > > > > > concurrently, thus using global data is racy. Mmapable task local > > > > > > > storage is the best solution I can find for this purpose. > > > > > > > > > > > > > > Song also mentioned to me offline, that mmapable task local storage > > > > > > > may be useful for his use case. > > > > > > > > > > > > > > I am actually open to other proposals. > > > > > > > > > > > > > > > > > > > You could also use a syscall prog, and use bpf_prog_test_run to update local > > > > > > storage for current. Data can be passed for that specific prog invocation using > > > > > > ctx. You might have to enable bpf_task_storage helpers in it though, since they > > > > > > are not allowed to be called right now. > > > > > > > > > > > > > > > > The loading process needs CAP_BPF to load bpf_prog_test_run. I'm > > > > > thinking of allowing any thread including unpriv ones to be able to > > > > > pass data to the prog and update their own storage. > > > > > > > > If I understand the use case correctly all of this mmap-ing is only to > > > > allow unpriv userspace to access a priv map via unpriv mmap() syscall. > > > > But the map can be accessed as unpriv already. > > > > Pin it with the world read creds and do map_lookup sys_bpf cmd on it. > > > > > > Right, but, if I understand correctly, with > > > sysctl_unprivileged_bpf_disabled, unpriv tasks are not able to make > > > use of __sys_bpf(). Is there anything I missed? > > > > That sysctl is a heavy hammer. Let's fix it instead. > > map lookup/update/delete can be allowed for unpriv for certain map types. > > There are permissions checks in corresponding lookup/update calls already. > (Adding Jann) I wonder if we can tag a map as BPF_F_UNPRIVILEGED and allow the writes to only maps that are explicitly marked as writable by unprivileged processes. We will have task local storage in LSM programs that we won't like unprivileged processes to write to as well. struct { __uint(type, BPF_MAP_TYPE_TASK_STORAGE); __uint(map_flags, BPF_F_NO_PREALLOC | BPF_F_UNPRIVILEGED); __type(key, int); __type(value, struct fd_storage); } task_fd_storage_map SEC(".maps"); - KP > This sounds great. If we can allow basic map operations for some map > types, it will change many use cases I'm looking at. Let me take a > look and report back.