Received: by 2002:ac0:da4c:0:0:0:0:0 with SMTP id a12csp953265imi; Thu, 21 Jul 2022 14:14:54 -0700 (PDT) X-Google-Smtp-Source: AGRyM1v4WOzROf/klGUPp3WphAQ5nGkArfbSzFmBxTFlJDPusv0O8dv9aD1etsYhAfXsfUM031+c X-Received: by 2002:a05:6402:2892:b0:43b:c5e4:87ae with SMTP id eg18-20020a056402289200b0043bc5e487aemr295216edb.384.1658438094056; Thu, 21 Jul 2022 14:14:54 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1658438094; cv=none; d=google.com; s=arc-20160816; b=g650K3ZQMu13wM6bu08wT83VJ/HXq3BGynxTiJXtP13vhqHNVuxCUVc1bbzCyjQUEU D+KkcCbJugUOGan1VINPLGFbrwaCmj2aX/8N27pyZ+TUBJF6qeVFr6bCLLgR7EUybsZN Xoj1KvZP5bp4SMaVReEzTXxRwGyBm3PCEs1vrretia2hW3HdbY58szfBomQ4ae5y1LKV DsmLsLo4lU8uJHO2oq/4W0Uog+Jx0YZmTXxonoGwXOQndjho5tBDlls2cHbrMiR4gyrJ PyrM2AgYrZFMBuMsMbRsay1XNZ6z3DjEansVE4nP4W102cRlK0JdJNiUPsefWfzhMi2U Y24A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=pMfZCsY+RirO5OLnqzh+GY3ZBlu6lJvxWrW4CDlXPAs=; b=lrbBF+vUlBJSjoeJiVBSc97fZZiv1F8UynaZnm02nULrkY32sJXUKC/NMLC/fFUI0f PGUQm0EuFpXysduri/PNL7zj7nIUBonbAskv5U8b5675lopYgxnG8PfXLIXF+YLvaE7x g6ICjuLmPrUCHwrnMIfGmgtnbJT6/AuNPzCv7Vry7B+oU3YZ1wROqeyrn/ncu44ueVdj EXDlI/29eGnk+B/wnQ4iKhs0i8tfLLEe/8/xtN1dZ97FVKNZV5XQjgDbYFZVWasIdKZY ICN7K3VJREg5/Phy+oem8n+h0cFJDynPTsk1LwYm8OczssBsHqxU9DIuc8oPZyPXon6L neAg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=XjvYjUjj; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id j13-20020a1709066dcd00b00722f396c33esi3317432ejt.415.2022.07.21.14.14.28; Thu, 21 Jul 2022 14:14:54 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=XjvYjUjj; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233166AbiGUVHx (ORCPT + 99 others); Thu, 21 Jul 2022 17:07:53 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57542 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233477AbiGUVHu (ORCPT ); Thu, 21 Jul 2022 17:07:50 -0400 Received: from mail-wr1-x42b.google.com (mail-wr1-x42b.google.com [IPv6:2a00:1450:4864:20::42b]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A845A90D91 for ; Thu, 21 Jul 2022 14:07:47 -0700 (PDT) Received: by mail-wr1-x42b.google.com with SMTP id h9so4019756wrm.0 for ; Thu, 21 Jul 2022 14:07:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=pMfZCsY+RirO5OLnqzh+GY3ZBlu6lJvxWrW4CDlXPAs=; b=XjvYjUjjL+6A061o5dJ8vErORcpCKYEqoHwslZ7YKKEe5bkK4+qfGvEle7o1HUXbUd 8Wd46M63TQR2S13ZsEhCSNAyMYgkF5PFSoa0xRLZv89h2JN5K258BjOxUex9IfYmqF0p EBfhbqERQD71sHnE9H3u9lddY+AO4wuatlTdEyDXBPsZ+JwJPJVeNH0IBr7j4n9fwyA+ bJkop7tATRkLri7ePJakpqSTwf6T3sSut852KbyJwONkle0EAxpdtoit8x2oLxMnt7/A 8ogXE3oNUfpexLa4nQcjkt8PesJwLtUZ3Ng+hYKoxPWJXoAZ1slz8l1+DTy/McV2CCyb KB6A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=pMfZCsY+RirO5OLnqzh+GY3ZBlu6lJvxWrW4CDlXPAs=; b=x2mFKdEhMiK0MaN4h6FHActf5+n0X3U6VuuD6U09gCcsnaIXt39N9J+pRfHo1ca1kt XXIQgb8kOnCAxYKhH2XJVRV/6i62WwVSj9nEZSk9e/kcrD0qtONR9aMnzGQZXLm9D/QE G07NdUWipGc1EoOAZvvtcUwZWJAIlwzJmpShhT3n4n16qIguBydPi5IoaxcnB6jwypsk /Q4xSLqiFJ7LauPuO5zqD/LQYyLUW7rw/Q5KBFdSaaSG+TtHOo7NO8DCGmF7fi3QfGcr uja8pbwfbw22gn79K3fG96eIE2xMUv1rKT6BYx3jbHHq1wl7F2aDljw0vw4833f2gMei kFbg== X-Gm-Message-State: AJIora9lafzr4YW3AOmbekRjPNaNX9XeNIiDMj4zeZSNAR17l1LoDHCw cu+a9gGN5HSlJfTOiqY2tQjBkzLz5/UNlRs+i4J5BA== X-Received: by 2002:a5d:64a3:0:b0:21d:adaa:ce4c with SMTP id m3-20020a5d64a3000000b0021dadaace4cmr197053wrp.161.1658437665909; Thu, 21 Jul 2022 14:07:45 -0700 (PDT) MIME-Version: 1.0 References: <20220709000439.243271-1-yosryahmed@google.com> <20220709000439.243271-5-yosryahmed@google.com> <370cb480-a427-4d93-37d9-3c6acd73b967@fb.com> <2a26b45d-6fab-b2a2-786e-5cb4572219ea@fb.com> <3f3ffe0e-d2ac-c868-a1bf-cdf1b58fd666@fb.com> <6935de9f-49da-7703-55ae-241a8b838c6c@fb.com> In-Reply-To: <6935de9f-49da-7703-55ae-241a8b838c6c@fb.com> From: Hao Luo Date: Thu, 21 Jul 2022 14:07:34 -0700 Message-ID: Subject: Re: [PATCH bpf-next v3 4/8] bpf: Introduce cgroup iter To: Yonghong Song Cc: Yosry Ahmed , Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Martin KaFai Lau , Song Liu , Tejun Heo , Zefan Li , Johannes Weiner , Shuah Khan , Michal Hocko , KP Singh , Benjamin Tissoires , John Fastabend , =?UTF-8?Q?Michal_Koutn=C3=BD?= , Roman Gushchin , David Rientjes , Stanislav Fomichev , Greg Thelen , Shakeel Butt , linux-kernel@vger.kernel.org, netdev@vger.kernel.org, bpf@vger.kernel.org, cgroups@vger.kernel.org Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jul 21, 2022 at 11:16 AM Yonghong Song wrote: > > > > On 7/21/22 10:21 AM, Hao Luo wrote: > > On Thu, Jul 21, 2022 at 9:15 AM Yonghong Song wrote: > >> > >> > >> > >> On 7/20/22 5:40 PM, Hao Luo wrote: > >>> On Mon, Jul 11, 2022 at 8:45 PM Yonghong Song wrote: > >>>> > >>>> On 7/11/22 5:42 PM, Hao Luo wrote: > >>> [...] > >>>>>>>> + > >>>>>>>> +static void *cgroup_iter_seq_start(struct seq_file *seq, loff_t *pos) > >>>>>>>> +{ > >>>>>>>> + struct cgroup_iter_priv *p = seq->private; > >>>>>>>> + > >>>>>>>> + mutex_lock(&cgroup_mutex); > >>>>>>>> + > >>>>>>>> + /* support only one session */ > >>>>>>>> + if (*pos > 0) > >>>>>>>> + return NULL; > >>>>>>> > >>>>>>> This might be okay. But want to check what is > >>>>>>> the practical upper limit for cgroups in a system > >>>>>>> and whether we may miss some cgroups. If this > >>>>>>> happens, it will be a surprise to the user. > >>>>>>> > >>>>> > >>>>> Ok. What's the max number of items supported in a single session? > >>>> > >>>> The max number of items (cgroups) in a single session is determined > >>>> by kernel_buffer_size which equals to 8 * PAGE_SIZE. So it really > >>>> depends on how much data bpf program intends to send to user space. > >>>> If each bpf program run intends to send 64B to user space, e.g., for > >>>> cpu, memory, cpu pressure, mem pressure, io pressure, read rate, write > >>>> rate, read/write rate. Then each session can support 512 cgroups. > >>>> > >>> > >>> Hi Yonghong, > >>> > >>> Sorry about the late reply. It's possible that the number of cgroup > >>> can be large, 1000+, in our production environment. But that may not > >>> be common. Would it be good to leave handling large number of cgroups > >>> as follow up for this patch? If it turns out to be a problem, to > >>> alleviate it, we could: > >>> > >>> 1. tell users to write program to skip a certain uninteresting cgroups. > >>> 2. support requesting large kernel_buffer_size for bpf_iter, maybe as > >>> a new bpf_iter flag. > >> > >> Currently if we intend to support multiple read() for cgroup_iter, > >> the following is a very inefficient approach: > >> > >> in seq_file private data structure, remember the last cgroup visited > >> and for the second read() syscall, do the traversal again (but not > >> calling bpf program) until the last cgroup and proceed from there. > >> This is inefficient and probably works. But if the last cgroup is > >> gone from the hierarchy, that the above approach won't work. One > >> possibility is to remember the last two cgroups. If the last cgroup > >> is gone, check the 'next' cgroup based on the one before the last > >> cgroup. If both are gone, we return NULL. > >> > > > > I suspect in reality, just remembering the last cgroup (or two > > cgroups) may not be sufficient. First, I don't want to hold > > cgroup_mutex across multiple sessions. I assume it's also not safe to > > release cgroup_mutex in the middle of walking cgroup hierarchy. > > Supporting multiple read() can be nasty for cgroup_iter. > > Right, holding cgroup_mutex across sessions is not bad idea > and I didn't recommend it either. > > I am aware of remembering last (one or two) cgroups are not > 100% reliable. All other iters have similar issues w.r.t. > across multiple sessions. But the idea is to find a > *reasonable* start for the second and later session. > For example, for task related iter, the previous session > last tid can be remember and the next session starts > with next tid after last tid based on idr. Some old > processes might be gone, but we got a reasonable > approximation. But cgroup is different, holding > the cgroup pointer with an additional reference > across sessions is not good. but holding cgroup > id requires to traverse the cgroup hierarchy > again and this is not efficient. Maybe other people > has a better idea how to do this. > Makes sense. > > > >> But in any case, if there are additional cgroups not visited, > >> in the second read(), we should not return NULL which indicates > >> done with all cgroups. We may return EOPNOTSUPP to indicate there > >> are missing cgroups due to not supported. > >> > >> Once users see EOPNOTSUPP which indicates there are missing > >> cgroups, they can do more filtering in bpf program to avoid > >> large data volume to user space. > >> > > > > Makes sense. Yonghong, one question to confirm, if the first read() > > overflows, does the user still get partial data? > > Yes, the first read() and subsequent read()'s will be okay > until user space receives all 8KB data where 8KB is the > kernel buffer size. For example, if user provides 1KB buffer > size, the first 8 read() syscalls will return proper data > to user space. > > After that, read() should return > EOPNOTSUPP instead of return 0 where '0' indicates > no more data. > Sounds good. Will do that. > > > > > I'll change the return code to EOPNOTSUPP in v4 of this patchset. > > > >> To provide a way to truely visit *all* cgroups, > >> we can either use bpf_iter link_create->flags > >> to increase the buffer size as your suggested in the above so > >> user can try to allocate more kernel buffer size. Or implement > >> proper second read() traversal which I don't have a good idea > >> how to do it efficiently. > > > > I will try the buffer size increase first. Looks more doable. Do you > > mind putting this support as a follow-up? > > If we cannot finalize the solution to completely support > arbitrary user output for cgroup_iter, we need to support > EOPNOTSUPP so user knows it should adjust the bpf program > to have less data to user space through seq_file. For example, > they can put data into mmap-ed array map. Please explain > such a limitation and how to workaround this in commit > message clearly. > Acknowledged. I will put a comment in the code and also explain in the commit message. Thanks! > So yes, to support buffer size increase through link_create > flags or to support a better way to start iteration after 8KB > user data can be a followup. > > > > >>> > >>> Hao > >>> > >>>>> > >>> [...] > >>>>>>> [...]