Received: by 2002:ac0:da4c:0:0:0:0:0 with SMTP id a12csp762190imi; Thu, 21 Jul 2022 10:24:02 -0700 (PDT) X-Google-Smtp-Source: AGRyM1uox2OJdW6v9xCOzb/SVJ3OZiL5Ff6hMrD8/HrSgvXgVcR5NlE9yt7GoBItYNqWxKKMT06B X-Received: by 2002:a63:1917:0:b0:419:b8e8:233 with SMTP id z23-20020a631917000000b00419b8e80233mr33861644pgl.271.1658424241533; Thu, 21 Jul 2022 10:24:01 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1658424241; cv=none; d=google.com; s=arc-20160816; b=NEqvP5TSVQXSZx0SM47fEnJIieDT9wa9cFPdMA9fz2EbRtix4Nfo738oD3XEtmmjb/ 1SsAw5KGcJYdRlwdQrUUSxvBFByqMO5tM7yTjE0gmThUWsWU9C31WEADdQH1uiZrR6iY F/NlBorRL+ysOsP1uUAW14FlGlHeG42p0w3EEMYQGmifovnPujoJqziJWEMNJ+NGkBnm lCymaVbCM+a09/Hk1boFzl+QgJcVYhLarBi/SAuYwxZcuBOQCMhCEpSpFcF26vGHb8th TbucKoE/dBoliD9xfIDnVdievyhflLlfclgaPKkq2447uVsL+u6lMu6OmYYyA/ymFb6E g4tQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=S3C8q1laz4rBJ9aGVPrsPOjt/QLLVMunUdYBJjHo4DE=; b=oF6YDXeRdUvvTEFv9K1dc2kqvk0SGYUYZ2RP+tNX1msPT/hHF7nFtGn4M9rFIpeUOS XGz+aJ+0a7sZa1H2G9POT4Miiy+xjGXw5wGUH/UVUeQlVNpe7xh3HJsjqwkF3TPQbw2q EB5CvNsNiOX/2crTKvoiqA8Q0BaRnMB+2hoNdcXqYfDfQjG67zjtBLu6bdV2QZCLdRSi ofpTvknXclwTKSsR7Hu+UWHPvHE9hsPJJZMdrBntHXaKbWIL/FQmgO2DvoBu4DTNbr0+ UMOA2yldxkuWgchtlWe8abu0DQI74+fVIWuHHa9mU3gst5cpYvuMynBcpGU7pavvkK4o ZKIQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=FilIoPgS; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id d3-20020a631d03000000b003fae0c67055si2645691pgd.777.2022.07.21.10.23.44; Thu, 21 Jul 2022 10:24:01 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=FilIoPgS; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230310AbiGURWS (ORCPT + 99 others); Thu, 21 Jul 2022 13:22:18 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50362 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229986AbiGURWI (ORCPT ); Thu, 21 Jul 2022 13:22:08 -0400 Received: from mail-wm1-x334.google.com (mail-wm1-x334.google.com [IPv6:2a00:1450:4864:20::334]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 322102D1D4 for ; Thu, 21 Jul 2022 10:22:06 -0700 (PDT) Received: by mail-wm1-x334.google.com with SMTP id n185so1413971wmn.4 for ; Thu, 21 Jul 2022 10:22:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=S3C8q1laz4rBJ9aGVPrsPOjt/QLLVMunUdYBJjHo4DE=; b=FilIoPgSWSX+M9Q5xzcjR7Bm7oSCIxKHBRehRGwnQpyGPiolhL1PxrDc/gy3NEOgOA WtGktAZpu/0N7X96yPfmjrLk2bK/wsS8TEOdha+N5u6sMaZkwMhw8bnIsVuTtsAWfZ9o htlMSo/fKYogD8FOwfdEzo5i+r75wKKrQSZlleXpxHTTDFqyRHQyrtEU3d8Qj5hwOytl 2Vl8qkFZwE4dMFP7JDWy90fDAGFnFDYaM5yEfdAeOrfMN5c4raOQQd17eiuYR0t0J5A5 vpib/yCSI2QQxVVS5XwklwcxvCmxO2tOy1SVFzfY7kFlxniF2kdvaqzNIhWOLJ+DZBJ+ dnaQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=S3C8q1laz4rBJ9aGVPrsPOjt/QLLVMunUdYBJjHo4DE=; b=x2i31bOkQtDukEUl1eCt6XzRrkhhjcddgzkRKwM1yXwOp0j0EYnvnFiGIz+8hIZ/+b x01Ngefq+/t3t2ahG8kArwcEgaABd7qdHoZy1tXknKsXIFSvhdT5+AlgQQAK5u+fLM7K zBq9SWpyM57WjPBbWwLiCOFQrpIP90twv6vz3Fh9Qp2qKEeeVKC8aMUgcWjcHJH4BX0q iQxBIqb0tNcLGe989+ykHVW8UfYBF3/H4t6kKQgc1+s7VDh5cBlWf6/3u8yLlBATud4E mblbDmDOdqjb9Kfr3/gMB+K446snSc9SPEJ9AOUv3gnhT/glbC4MGm8gBdyUDg+qcGZJ s0/g== X-Gm-Message-State: AJIora9Dro5/N/BqzrEty+e9szy37rObYzK0YqyeH3P9kuIl220MWiSS h3BlgwPCFollKYonz2xisMuasp57mcEPRl/H+lY7Bg== X-Received: by 2002:a05:600c:354e:b0:3a3:2ede:853d with SMTP id i14-20020a05600c354e00b003a32ede853dmr4107316wmq.61.1658424124442; Thu, 21 Jul 2022 10:22:04 -0700 (PDT) MIME-Version: 1.0 References: <20220709000439.243271-1-yosryahmed@google.com> <20220709000439.243271-5-yosryahmed@google.com> <370cb480-a427-4d93-37d9-3c6acd73b967@fb.com> <2a26b45d-6fab-b2a2-786e-5cb4572219ea@fb.com> <3f3ffe0e-d2ac-c868-a1bf-cdf1b58fd666@fb.com> In-Reply-To: <3f3ffe0e-d2ac-c868-a1bf-cdf1b58fd666@fb.com> From: Hao Luo Date: Thu, 21 Jul 2022 10:21:53 -0700 Message-ID: Subject: Re: [PATCH bpf-next v3 4/8] bpf: Introduce cgroup iter To: Yonghong Song Cc: Yosry Ahmed , Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Martin KaFai Lau , Song Liu , Tejun Heo , Zefan Li , Johannes Weiner , Shuah Khan , Michal Hocko , KP Singh , Benjamin Tissoires , John Fastabend , =?UTF-8?Q?Michal_Koutn=C3=BD?= , Roman Gushchin , David Rientjes , Stanislav Fomichev , Greg Thelen , Shakeel Butt , linux-kernel@vger.kernel.org, netdev@vger.kernel.org, bpf@vger.kernel.org, cgroups@vger.kernel.org Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jul 21, 2022 at 9:15 AM Yonghong Song wrote: > > > > On 7/20/22 5:40 PM, Hao Luo wrote: > > On Mon, Jul 11, 2022 at 8:45 PM Yonghong Song wrote: > >> > >> On 7/11/22 5:42 PM, Hao Luo wrote: > > [...] > >>>>>> + > >>>>>> +static void *cgroup_iter_seq_start(struct seq_file *seq, loff_t *pos) > >>>>>> +{ > >>>>>> + struct cgroup_iter_priv *p = seq->private; > >>>>>> + > >>>>>> + mutex_lock(&cgroup_mutex); > >>>>>> + > >>>>>> + /* support only one session */ > >>>>>> + if (*pos > 0) > >>>>>> + return NULL; > >>>>> > >>>>> This might be okay. But want to check what is > >>>>> the practical upper limit for cgroups in a system > >>>>> and whether we may miss some cgroups. If this > >>>>> happens, it will be a surprise to the user. > >>>>> > >>> > >>> Ok. What's the max number of items supported in a single session? > >> > >> The max number of items (cgroups) in a single session is determined > >> by kernel_buffer_size which equals to 8 * PAGE_SIZE. So it really > >> depends on how much data bpf program intends to send to user space. > >> If each bpf program run intends to send 64B to user space, e.g., for > >> cpu, memory, cpu pressure, mem pressure, io pressure, read rate, write > >> rate, read/write rate. Then each session can support 512 cgroups. > >> > > > > Hi Yonghong, > > > > Sorry about the late reply. It's possible that the number of cgroup > > can be large, 1000+, in our production environment. But that may not > > be common. Would it be good to leave handling large number of cgroups > > as follow up for this patch? If it turns out to be a problem, to > > alleviate it, we could: > > > > 1. tell users to write program to skip a certain uninteresting cgroups. > > 2. support requesting large kernel_buffer_size for bpf_iter, maybe as > > a new bpf_iter flag. > > Currently if we intend to support multiple read() for cgroup_iter, > the following is a very inefficient approach: > > in seq_file private data structure, remember the last cgroup visited > and for the second read() syscall, do the traversal again (but not > calling bpf program) until the last cgroup and proceed from there. > This is inefficient and probably works. But if the last cgroup is > gone from the hierarchy, that the above approach won't work. One > possibility is to remember the last two cgroups. If the last cgroup > is gone, check the 'next' cgroup based on the one before the last > cgroup. If both are gone, we return NULL. > I suspect in reality, just remembering the last cgroup (or two cgroups) may not be sufficient. First, I don't want to hold cgroup_mutex across multiple sessions. I assume it's also not safe to release cgroup_mutex in the middle of walking cgroup hierarchy. Supporting multiple read() can be nasty for cgroup_iter. > But in any case, if there are additional cgroups not visited, > in the second read(), we should not return NULL which indicates > done with all cgroups. We may return EOPNOTSUPP to indicate there > are missing cgroups due to not supported. > > Once users see EOPNOTSUPP which indicates there are missing > cgroups, they can do more filtering in bpf program to avoid > large data volume to user space. > Makes sense. Yonghong, one question to confirm, if the first read() overflows, does the user still get partial data? I'll change the return code to EOPNOTSUPP in v4 of this patchset. > To provide a way to truely visit *all* cgroups, > we can either use bpf_iter link_create->flags > to increase the buffer size as your suggested in the above so > user can try to allocate more kernel buffer size. Or implement > proper second read() traversal which I don't have a good idea > how to do it efficiently. I will try the buffer size increase first. Looks more doable. Do you mind putting this support as a follow-up? > > > > Hao > > > >>> > > [...] > >>>>> [...]