Received: by 2002:ad5:4acb:0:0:0:0:0 with SMTP id n11csp2945333imw; Wed, 6 Jul 2022 14:37:08 -0700 (PDT) X-Google-Smtp-Source: AGRyM1vYYv6a4WrwyEDp9sicqaRttO+4sINlf8dClIrnzOuvW+mQ6IBSwZyN8I8oO11mYCgmWb2z X-Received: by 2002:a17:902:f544:b0:16a:2b62:ef77 with SMTP id h4-20020a170902f54400b0016a2b62ef77mr49925861plf.134.1657143427740; Wed, 06 Jul 2022 14:37:07 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1657143427; cv=none; d=google.com; s=arc-20160816; b=u+z72JIu5Jhca4LxFE8B7T9hI9J6HfN/dnk75q2Udej1NRFwafflIw1fBzbEcEq5Lv CH9pmLq0RdLdvNtmLh6vi7hbegishmZFvNPE2b071EnhXgZIAib171ErpD23d35P/AT8 wIunHXW+F+VUhMWLVxFJemVdiVb14iSSJYae6YAO4Q5jWc0m0bYotY0JGsiybO58QkDu 7YlcJLd8b3PmE3CvZnAYoE6KH7gou+r59C4YCgbh2Ci0WMiBbE2LDmEMGNKFV7zNaEHJ WsrmLRFbxI6DOqMwmb7sNflcWThT2qZi7AGSDLYhJQq+LxYiEYrLNSlQWjC+rXOsCseM 5/NQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=rQw39tPtKwwT0TwawC+SleFl+XxjNJEpZjL17UZWqD0=; b=xC7zRpe3nbxZelEO5M3/SWMQwQToRwkTiI5NX/BPuSQ7Ki1wg8/K4yTRe5K0mDP7zw 0kk/GTqgsrYGlPo52ptqUCLspcH+9LnZzb8UI/d1VmAGiYa4Z0SS/B7hamRbfJTFehFI G5v5fAPxmGYCSEo57EI4duwZ3SrznaxVkxKTc+HroweAyhYZTbvZ7R2KiZsGXwZqOalC n3OXOlVn3qcgwD1eDxBaAR6JlZUiTkfxLEz5Mspmj+nybraDYtaIRODwuiAMs6g8bB9l WhkHInmx2+wdoTQygkpYECEDu5jKdHD6AY6Hi9iQktKkY1qKxLjhrHYrdAb48/wGYwNl Fc+w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=TJ0Jwctk; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id h6-20020a636c06000000b00411eb025f19si20712035pgc.764.2022.07.06.14.36.54; Wed, 06 Jul 2022 14:37:07 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=TJ0Jwctk; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234328AbiGFVae (ORCPT + 99 others); Wed, 6 Jul 2022 17:30:34 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53026 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234391AbiGFVac (ORCPT ); Wed, 6 Jul 2022 17:30:32 -0400 Received: from mail-wr1-x435.google.com (mail-wr1-x435.google.com [IPv6:2a00:1450:4864:20::435]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D4E92240A5 for ; Wed, 6 Jul 2022 14:30:29 -0700 (PDT) Received: by mail-wr1-x435.google.com with SMTP id v14so23765729wra.5 for ; Wed, 06 Jul 2022 14:30:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=rQw39tPtKwwT0TwawC+SleFl+XxjNJEpZjL17UZWqD0=; b=TJ0JwctkW4FEMEufZxjl6PXg89sMUFNRoAy0Z5jhygDZZFckNW8E4XxJ8eCgL/hDjV hpoomx5AObK3ULKhJgcDQLBiab+fZf8xg/BY/LryOR1DX8vA3JDoAj8gpSib9lNy/s/e /jzooYkBVmWDEH5zOICcV4QgtMCIhgJSD6R5rt6p8JNkeUND/+ROwETEoXbXZcSx3Pw5 +kO8Eawr7OH8uiz5h4pEiZrUnntkQIZncf0aUqDcaOyWbRd8t3dclS0SKnZIHTtW5x8L dikFgDoKYwxUBoOszJ6rcwJsmQibZ2dVNTX63FmKNc/RgDWCF4smHfVSDETfonEAlYnp OgTQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=rQw39tPtKwwT0TwawC+SleFl+XxjNJEpZjL17UZWqD0=; b=j919Dnq25fPLKQDfo3WqEZb1sdnTO82aOq9TSjF8Tg+GcXc0D840/wjrIgpRbyxNFX sbXB82Njt8pa4g+v0S0b7sOOCdEQ1Ca/imG+Z5Z0JJu6flPdwz71IutgAPeWrtdff++P NTEbrGcthamJRFmbZrVkuFXWwm3AfGSgBLcJQnDPXs+ip/p2yP1LU+t2/Ibhaz/DEYCw 1zbG/cxN6ToNPq0cHzOUGiRl5R6hDcvEPVIjhCcfI5WP5QLTmp2W/tbpNsDz1Qf3gc31 IBz7ovr0iCQtmgFqKYch9KSMdfh/xNfILZ7qR8NBmYdL3b+YQWjePc8vdWZgIkiTfcg3 mmmw== X-Gm-Message-State: AJIora+2pXBGR0UKkDK2Ny4Q3nWE2DUjBVxCq9rRAPPuxSWUP042FUI6 famQ3RLJxF+NkI1Mw77lhcEuWVgbRszSAOHlRGfv6A== X-Received: by 2002:a5d:664d:0:b0:21a:3b82:6bb2 with SMTP id f13-20020a5d664d000000b0021a3b826bb2mr39462129wrw.534.1657143028112; Wed, 06 Jul 2022 14:30:28 -0700 (PDT) MIME-Version: 1.0 References: <20220610194435.2268290-1-yosryahmed@google.com> <20220610194435.2268290-9-yosryahmed@google.com> <00df1932-38fe-c6f8-49d0-3a44affb1268@fb.com> <6dc9d46b-f1df-fb1d-8efd-580b7a6a7a6e@fb.com> In-Reply-To: From: Yosry Ahmed Date: Wed, 6 Jul 2022 14:29:51 -0700 Message-ID: Subject: Re: [PATCH bpf-next v2 8/8] bpf: add a selftest for cgroup hierarchical stats collection To: Yonghong Song Cc: Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Martin KaFai Lau , Song Liu , John Fastabend , KP Singh , Hao Luo , Tejun Heo , Zefan Li , Johannes Weiner , Shuah Khan , Michal Hocko , Roman Gushchin , David Rientjes , Stanislav Fomichev , Greg Thelen , Shakeel Butt , Linux Kernel Mailing List , Networking , bpf , Cgroups Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE,USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jul 1, 2022 at 5:55 PM Yonghong Song wrote: > > > > On 6/29/22 1:04 AM, Yosry Ahmed wrote: > > On Tue, Jun 28, 2022 at 11:48 PM Yonghong Song wrote: > >> > >> > >> > >> On 6/28/22 5:09 PM, Yosry Ahmed wrote: > >>> On Tue, Jun 28, 2022 at 12:14 AM Yosry Ahmed wrote: > >>>> > >>>> On Mon, Jun 27, 2022 at 11:47 PM Yosry Ahmed wrote: > >>>>> > >>>>> On Mon, Jun 27, 2022 at 11:14 PM Yonghong Song wrote: > >>>>>> > >>>>>> > >>>>>> > >>>>>> On 6/10/22 12:44 PM, Yosry Ahmed wrote: > >>>>>>> Add a selftest that tests the whole workflow for collecting, > >>>>>>> aggregating (flushing), and displaying cgroup hierarchical stats. > >>>>>>> > >>>>>>> TL;DR: > >>>>>>> - Whenever reclaim happens, vmscan_start and vmscan_end update > >>>>>>> per-cgroup percpu readings, and tell rstat which (cgroup, cpu) pairs > >>>>>>> have updates. > >>>>>>> - When userspace tries to read the stats, vmscan_dump calls rstat to flush > >>>>>>> the stats, and outputs the stats in text format to userspace (similar > >>>>>>> to cgroupfs stats). > >>>>>>> - rstat calls vmscan_flush once for every (cgroup, cpu) pair that has > >>>>>>> updates, vmscan_flush aggregates cpu readings and propagates updates > >>>>>>> to parents. > >>>>>>> > >>>>>>> Detailed explanation: > >>>>>>> - The test loads tracing bpf programs, vmscan_start and vmscan_end, to > >>>>>>> measure the latency of cgroup reclaim. Per-cgroup ratings are stored in > >>>>>>> percpu maps for efficiency. When a cgroup reading is updated on a cpu, > >>>>>>> cgroup_rstat_updated(cgroup, cpu) is called to add the cgroup to the > >>>>>>> rstat updated tree on that cpu. > >>>>>>> > >>>>>>> - A cgroup_iter program, vmscan_dump, is loaded and pinned to a file, for > >>>>>>> each cgroup. Reading this file invokes the program, which calls > >>>>>>> cgroup_rstat_flush(cgroup) to ask rstat to propagate the updates for all > >>>>>>> cpus and cgroups that have updates in this cgroup's subtree. Afterwards, > >>>>>>> the stats are exposed to the user. vmscan_dump returns 1 to terminate > >>>>>>> iteration early, so that we only expose stats for one cgroup per read. > >>>>>>> > >>>>>>> - An ftrace program, vmscan_flush, is also loaded and attached to > >>>>>>> bpf_rstat_flush. When rstat flushing is ongoing, vmscan_flush is invoked > >>>>>>> once for each (cgroup, cpu) pair that has updates. cgroups are popped > >>>>>>> from the rstat tree in a bottom-up fashion, so calls will always be > >>>>>>> made for cgroups that have updates before their parents. The program > >>>>>>> aggregates percpu readings to a total per-cgroup reading, and also > >>>>>>> propagates them to the parent cgroup. After rstat flushing is over, all > >>>>>>> cgroups will have correct updated hierarchical readings (including all > >>>>>>> cpus and all their descendants). > >>>>>>> > >>>>>>> Signed-off-by: Yosry Ahmed > >>>>>> > >>>>>> There are a selftest failure with test: > >>>>>> > >>>>>> get_cgroup_vmscan_delay:PASS:output format 0 nsec > >>>>>> get_cgroup_vmscan_delay:PASS:cgroup_id 0 nsec > >>>>>> get_cgroup_vmscan_delay:PASS:vmscan_reading 0 nsec > >>>>>> get_cgroup_vmscan_delay:PASS:read cgroup_iter 0 nsec > >>>>>> get_cgroup_vmscan_delay:PASS:output format 0 nsec > >>>>>> get_cgroup_vmscan_delay:PASS:cgroup_id 0 nsec > >>>>>> get_cgroup_vmscan_delay:FAIL:vmscan_reading unexpected vmscan_reading: > >>>>>> actual 0 <= expected 0 > >>>>>> check_vmscan_stats:FAIL:child1_vmscan unexpected child1_vmscan: actual > >>>>>> 781874 != expected 382092 > >>>>>> check_vmscan_stats:FAIL:child2_vmscan unexpected child2_vmscan: actual > >>>>>> -1 != expected -2 > >>>>>> check_vmscan_stats:FAIL:test_vmscan unexpected test_vmscan: actual > >>>>>> 781874 != expected 781873 > >>>>>> check_vmscan_stats:FAIL:root_vmscan unexpected root_vmscan: actual 0 < > >>>>>> expected 781874 > >>>>>> destroy_progs:PASS:remove cgroup_iter pin 0 nsec > >>>>>> destroy_progs:PASS:remove cgroup_iter pin 0 nsec > >>>>>> destroy_progs:PASS:remove cgroup_iter pin 0 nsec > >>>>>> destroy_progs:PASS:remove cgroup_iter pin 0 nsec > >>>>>> destroy_progs:PASS:remove cgroup_iter pin 0 nsec > >>>>>> destroy_progs:PASS:remove cgroup_iter pin 0 nsec > >>>>>> destroy_progs:PASS:remove cgroup_iter pin 0 nsec > >>>>>> destroy_progs:PASS:remove cgroup_iter root pin 0 nsec > >>>>>> cleanup_bpffs:PASS:rmdir /sys/fs/bpf/vmscan/ 0 nsec > >>>>>> #33 cgroup_hierarchical_stats:FAIL > >>>>>> > >>>>> > >>>>> The test is passing on my setup. I am trying to figure out if there is > >>>>> something outside the setup done by the test that can cause the test > >>>>> to fail. > >>>>> > >>>> > >>>> I can't reproduce the failure on my machine. It seems like for some > >>>> reason reclaim is not invoked in one of the test cgroups which results > >>>> in the expected stats not being there. I have a few suspicions as to > >>>> what might cause this but I am not sure. > >>>> > >>>> If you have the capacity, do you mind re-running the test with the > >>>> attached diff1.patch? (and maybe diff2.patch if that fails, this will > >>>> cause OOMs in the test cgroup, you might see some process killed > >>>> warnings). > >>>> Thanks! > >>>> > >>> > >>> In addition to that, it looks like one of the cgroups has a "0" stat > >>> which shouldn't happen unless one of the map update/lookup operations > >>> failed, which should log something using bpf_printk. I need to > >>> reproduce the test failure to investigate this properly. Did you > >>> observe this failure on your machine or in CI? Any instructions on how > >>> to reproduce or system setup? > >> > >> I got "0" as well. > >> > >> get_cgroup_vmscan_delay:FAIL:vmscan_reading unexpected vmscan_reading: > >> actual 0 <= expected 0 > >> check_vmscan_stats:FAIL:child1_vmscan unexpected child1_vmscan: actual > >> 676612 != expected 339142 > >> check_vmscan_stats:FAIL:child2_vmscan unexpected child2_vmscan: actual > >> -1 != expected -2 > >> check_vmscan_stats:FAIL:test_vmscan unexpected test_vmscan: actual > >> 676612 != expected 676611 > >> check_vmscan_stats:FAIL:root_vmscan unexpected root_vmscan: actual 0 < > >> expected 676612 > >> > >> I don't have special config. I am running on qemu vm, similar to > >> ci environment but may have a slightly different config. > >> > >> The CI for this patch set won't work since the sleepable kfunc support > >> patch is not available. Once you have that patch, bpf CI should be able > >> to compile the patch set and run the tests. > >> > > > > I will include this patch in the next version anyway, but I am trying > > to find out why this selftest is failing for you before I send it out. > > I am trying to reproduce the problem but no luck so far. > > I debugged this a little bit and found that this two programs > > SEC("tp_btf/mm_vmscan_memcg_reclaim_begin") > int BPF_PROG(vmscan_start, struct lruvec *lruvec, struct scan_control *sc) > > and > > SEC("tp_btf/mm_vmscan_memcg_reclaim_end") > int BPF_PROG(vmscan_end, struct lruvec *lruvec, struct scan_control *sc) > > are not triggered. Thanks so much for doing this. I am still failing to reproduce the problem so this is very useful. I believe if those programs are not triggered at all then we are not walking the memcg reclaim path, which shouldn't happen since we are setting memory.high to a limit and then allocating more memory, which should trigger memcg reclaim. I am looking at the code now, and there are some conditions that will cause memory.high to not invoke reclaim (at least synchronously). Did you try diff2.patch attached in the previous email? It changes the test to use memory.max instead of memory.high, this will cause an OOM kill of the test child process, but it should be a stronger guarantee that reclaim happens and we hit mm_vmscan_memcg_reclaim_begin/end(). If diff2.patch above works, is it okay to keep it? Is it okay to have some test processes OOM killed during testing? > > I do have CONFIG_MEMCG enabled in my config file: > ... > CONFIG_MEMCG=y > CONFIG_MEMCG_SWAP=y > CONFIG_MEMCG_KMEM= > ... > > Maybe when cgroup_rstat_flush() is called, some code path won't trigger > mm_vmscan_memcg_reclaim_begin/end()? > cgroup_rstat_flush() should be completely separate in this regard, and should not affect the code path that triggers mm_vmscan_memcg_reclaim_begin/end(). > > > >>> > >>>> > >>>>>> > >>>>>> Also an existing test also failed. > >>>>>> > >>>>>> btf_dump_data:PASS:find type id 0 nsec > >>>>>> > >>>>>> > >>>>>> btf_dump_data:PASS:failed/unexpected type_sz 0 nsec > >>>>>> > >>>>>> > >>>>>> btf_dump_data:FAIL:ensure expected/actual match unexpected ensure > >>>>>> expected/actual match: actual '(union bpf_iter_link_info){.map = > >>>>>> (struct){.map_fd = (__u32)1,},.cgroup ' > >>>>>> test_btf_dump_struct_data:PASS:find struct sk_buff 0 nsec > >>>>>> > >>>>> > >>>>> Yeah I see what happened there. bpf_iter_link_info was changed by the > >>>>> patch that introduced cgroup_iter, and this specific union is used by > >>>>> the test to test the "union with nested struct" btf dumping. I will > >>>>> add a patch in the next version that updates the btf_dump_data test > >>>>> accordingly. Thanks. > >>>>> > >>>>>> > >>>>>> test_btf_dump_struct_data:PASS:unexpected return value dumping sk_buff 0 > >>>>>> nsec > >>>>>> > >>>>>> btf_dump_data:PASS:verify prefix match 0 nsec > >>>>>> > >>>>>> > >>>>>> btf_dump_data:PASS:find type id 0 nsec > >>>>>> > >>>>>> > >>>>>> btf_dump_data:PASS:failed to return -E2BIG 0 nsec > >>>>>> > >>>>>> > >>>>>> btf_dump_data:PASS:ensure expected/actual match 0 nsec > >>>>>> > >>>>>> > >>>>>> btf_dump_data:PASS:verify prefix match 0 nsec > >>>>>> > >>>>>> > >>>>>> btf_dump_data:PASS:find type id 0 nsec > >>>>>> > >>>>>> > >>>>>> btf_dump_data:PASS:failed to return -E2BIG 0 nsec > >>>>>> > >>>>>> > >>>>>> btf_dump_data:PASS:ensure expected/actual match 0 nsec > >>>>>> > >>>>>> > >>>>>> #21/14 btf_dump/btf_dump: struct_data:FAIL > >>>>>> > >>>>>> please take a look. > >>>>>> > >>>>>>> --- > >>>>>>> .../prog_tests/cgroup_hierarchical_stats.c | 351 ++++++++++++++++++ > >>>>>>> .../bpf/progs/cgroup_hierarchical_stats.c | 234 ++++++++++++ > >>>>>>> 2 files changed, 585 insertions(+) > >>>>>>> create mode 100644 tools/testing/selftests/bpf/prog_tests/cgroup_hierarchical_stats.c > >>>>>>> create mode 100644 tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c > >>>>>>> > >> [...]