Received: by 2002:a05:6a10:2726:0:0:0:0 with SMTP id ib38csp5656282pxb; Mon, 28 Mar 2022 15:35:26 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxqDpPBVM0WFN9/J2ArFyEI7zIDe72w8uv4/trytN8MbzMlZzlGhignO8wamsrYQ2KOs+7Z X-Received: by 2002:a9d:68c5:0:b0:5b2:2991:e9e9 with SMTP id i5-20020a9d68c5000000b005b22991e9e9mr11036522oto.384.1648506925882; Mon, 28 Mar 2022 15:35:25 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1648506925; cv=none; d=google.com; s=arc-20160816; b=Wuf/kL+S9jXPpYSYuvovsicE4BDvXT/oOpOyWYbLe4kjZvR+0KyA8QjPKaq3Deo6WN rFVaBfzLzh9ensYtxIlE72W3suavAgzT03l66h+FR8/NiVb5k7/y0mcrPKYHpMqn2gwv BxHnjXlWVOJoT6dWkgr7+GinkYWbtlZOyHTuN/4d0/Qeg43nar9uwMxGrdfo6WKr2Ryp rDLj0m/EbCSFsEAg0WUvxGPygxEkK0Xfgoi2eA9SG8kiuqURHawJWfz1DdHDCKivNnHl 0N4LvxQkhmL1+zMa9I9hqJvDBZCPJCmfGVWpSC+2tH3TiZYXmOx6BztR4JOoEpR3v9sa a6Dw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=OTiPPMnOln50TZ8P+7s3GLF/+hjvE6KvjSzwQxSGV3c=; b=uqqgWBYAZuI3lywp/n2H9YSf06TjDnHWyJ0e6SzVXVEafjGO6mOqf5cVheIFZ9xpTX 2lpUu9XvFh2YfH6oUXHledpeTI0At0U//B1zcwQNaQEtivZod1H9hB3uYmXzA3hGVehh dmCIP94Vfs5U5wPZvFNmyCceyOBl6Z7lsTywZdHnwV5cHDbPI+ke8zeHhAIXQgHwJgtb wAVon7S+3W0ZV6XRvt+bfkdlSUL9kXZ+As18PbkKR5a7XvRQCJkGy3EGLHkcs7RpwV4a 3TnPChdB3BNs3N999JpujW0H8qAscpNmOnULFJ6eMTt+OGxwIJEJWDqTTNRYEFFSJFAf lMvg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=Fm0ykbJR; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [23.128.96.19]) by mx.google.com with ESMTPS id f13-20020a05680814cd00b002ef0c3476adsi14312172oiw.301.2022.03.28.15.35.25 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 28 Mar 2022 15:35:25 -0700 (PDT) Received-SPF: softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) client-ip=23.128.96.19; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=Fm0ykbJR; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id E9DD9F3289; Mon, 28 Mar 2022 14:49:02 -0700 (PDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S243056AbiC1RsN (ORCPT + 99 others); Mon, 28 Mar 2022 13:48:13 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:47310 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S244732AbiC1RsI (ORCPT ); Mon, 28 Mar 2022 13:48:08 -0400 Received: from mail-qk1-x72c.google.com (mail-qk1-x72c.google.com [IPv6:2607:f8b0:4864:20::72c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BBC672458A for ; Mon, 28 Mar 2022 10:46:27 -0700 (PDT) Received: by mail-qk1-x72c.google.com with SMTP id d65so8544696qke.5 for ; Mon, 28 Mar 2022 10:46:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=OTiPPMnOln50TZ8P+7s3GLF/+hjvE6KvjSzwQxSGV3c=; b=Fm0ykbJRyPVO/B3M1TtAUj97/2QpePdENLBvycMQl72Et5cFTTucKx3PxPU6vHbyaY 2LOpjhVpM6xaSGIEYzZvrkFLwr0Z0NC0lcL7Kzzby+8EdCi1ztrnxpmgI6JDM+kitiKC 4JsDfvrMM8cjNdu7pHpgwZwbW2GdCfoVOTBLGG00zBSchkCznKKlNQ0eV3bNZk4hnLNH NmfmYzK32P+srh3+Z/OMLGYTsmjFD0QMI2ZnXcAZuwRdI9KRXJOZ0iR3Cts8fxgcwrK4 Nc0mv/b71Azz0yAA6vcTgzNRtHW63qnAwxup4p6/Yg4dlMORS++VXU7xSpjCTJ20Pjtq NNlA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=OTiPPMnOln50TZ8P+7s3GLF/+hjvE6KvjSzwQxSGV3c=; b=OrMBjtv2trk2Kms67zd0ZxeATAeFyeFGtfa1tN4iFvDCegiGIg6P0sG7ZvIWOxP14p cDw0sQSJa+aD84U2StJF4BrhM4QqBRsV6dwojrcmqv7SCF2RUGSKKx21fIUmG2x1uaGm SYlTXQKX2Eh+d6ulEmpeKix/ZDZIpE8f3OmrVkHCdLdJIJojnr2KEJqHEj/O5viY7ijn DTdYAMEr8+i03Dhmu2yJt1JnmdqGEgonqoLjbPWmlr1DanyadldVs1080lvOl7VtDlZS MCqiyWHAEF+tUPk+JRTzbZjzHcr4qd+OZuuagBTlknwDkyaFypX7F802pPHoN0JAwapw 6dKA== X-Gm-Message-State: AOAM530gDFCX3TuuU6Hi+F4EZpL0pQBJDbGYCoDldeoBnWrgMLkx3pd7 adBolFOTS2jPWchQPwyZc5//l2PAO6d1mCT/RqW20Q== X-Received: by 2002:a05:620a:4110:b0:680:d70a:376d with SMTP id j16-20020a05620a411000b00680d70a376dmr3982436qko.446.1648489586646; Mon, 28 Mar 2022 10:46:26 -0700 (PDT) MIME-Version: 1.0 References: <20220324234123.1608337-1-haoluo@google.com> <9cdf860d-8370-95b5-1688-af03265cc874@fb.com> In-Reply-To: From: Hao Luo Date: Mon, 28 Mar 2022 10:46:15 -0700 Message-ID: Subject: Re: [PATCH RFC bpf-next 0/2] Mmapable task local storage. To: Yonghong Song Cc: Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann , KP Singh , Martin KaFai Lau , Song Liu , bpf@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-9.5 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,RDNS_NONE,SPF_HELO_NONE,T_SCC_BODY_TEXT_LINE, USER_IN_DEF_DKIM_WL autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Mar 28, 2022 at 10:39 AM Hao Luo wrote: > > Hi Yonghong, > > On Fri, Mar 25, 2022 at 12:16 PM Yonghong Song wrote: > > > > On 3/24/22 4:41 PM, Hao Luo wrote: > > > Some map types support mmap operation, which allows userspace to > > > communicate with BPF programs directly. Currently only arraymap > > > and ringbuf have mmap implemented. > > > > > > However, in some use cases, when multiple program instances can > > > run concurrently, global mmapable memory can cause race. In that > > > case, userspace needs to provide necessary synchronizations to > > > coordinate the usage of mapped global data. This can be a source > > > of bottleneck. > > > > I can see your use case here. Each calling process can get the > > corresponding bpf program task local storage data through > > mmap interface. As you mentioned, there is a tradeoff > > between more memory vs. non-global synchronization. > > > > I am thinking that another bpf_iter approach can retrieve > > the similar result. We could implement a bpf_iter > > for task local storage map, optionally it can provide > > a tid to retrieve the data for that particular tid. > > This way, user space needs an explicit syscall, but > > does not need to allocate more memory than necessary. > > > > WDYT? > > > > Thanks for the suggestion. I have two thoughts about bpf_iter + tid and mmap: > > - mmap prevents the calling task from reading other task's value. > Using bpf_iter, one can pass other task's tid to get their values. I > assume there are two potential ways of passing tid to bpf_iter: one is > to use global data in bpf prog, the other is adding tid parameterized > iter_link. For the first, it's not easy for unpriv tasks to use. For > the second, we need to create one iter_link object for each interested > tid. It may not be easy to use either. > > - Regarding adding an explicit syscall. I thought about adding > write/read syscalls for task local storage maps, just like reading > values from iter_link. Writing or reading task local storage map > updates/reads the current task's value. I think this could achieve the > same effect as mmap. > Actually, my use case of using mmap on task local storage is to allow userspace to pass FDs into bpf prog. Some of the helpers I want to add need to take an FD as parameter and the bpf progs can run concurrently, thus using global data is racy. Mmapable task local storage is the best solution I can find for this purpose. Song also mentioned to me offline, that mmapable task local storage may be useful for his use case. I am actually open to other proposals. > > > > > > It would be great to have a mmapable local storage in that case. > > > This patch adds that. > > > > > > Mmap isn't BPF syscall, so unpriv users can also use it to > > > interact with maps. > > > > > > Currently the only way of allocating mmapable map area is using > > > vmalloc() and it's only used at map allocation time. Vmalloc() > > > may sleep, therefore it's not suitable for maps that may allocate > > > memory in an atomic context such as local storage. Local storage > > > uses kmalloc() with GFP_ATOMIC, which doesn't sleep. This patch > > > uses kmalloc() with GFP_ATOMIC as well for mmapable map area. > > > > > > Allocating mmapable memory has requirment on page alignment. So we > > > have to deliberately allocate more memory than necessary to obtain > > > an address that has sdata->data aligned at page boundary. The > > > calculations for mmapable allocation size, and the actual > > > allocation/deallocation are packaged in three functions: > > > > > > - bpf_map_mmapable_alloc_size() > > > - bpf_map_mmapable_kzalloc() > > > - bpf_map_mmapable_kfree() > > > > > > BPF local storage uses them to provide generic mmap API: > > > > > > - bpf_local_storage_mmap() > > > > > > And task local storage adds the mmap callback: > > > > > > - task_storage_map_mmap() > > > > > > When application calls mmap on a task local storage, it gets its > > > own local storage. > > > > > > Overall, mmapable local storage trades off memory with flexibility > > > and efficiency. It brings memory fragmentation but can make programs > > > stateless. Therefore useful in some cases. > > > > > > Hao Luo (2): > > > bpf: Mmapable local storage. > > > selftests/bpf: Test mmapable task local storage. > > > > > > include/linux/bpf.h | 4 + > > > include/linux/bpf_local_storage.h | 5 +- > > > kernel/bpf/bpf_local_storage.c | 73 +++++++++++++++++-- > > > kernel/bpf/bpf_task_storage.c | 40 ++++++++++ > > > kernel/bpf/syscall.c | 67 +++++++++++++++++ > > > .../bpf/prog_tests/task_local_storage.c | 38 ++++++++++ > > > .../bpf/progs/task_local_storage_mmapable.c | 38 ++++++++++ > > > 7 files changed, 257 insertions(+), 8 deletions(-) > > > create mode 100644 tools/testing/selftests/bpf/progs/task_local_storage_mmapable.c > > >