Received: by 2002:a05:6358:7058:b0:131:369:b2a3 with SMTP id 24csp3108444rwp; Fri, 14 Jul 2023 17:55:15 -0700 (PDT) X-Google-Smtp-Source: APBJJlFRTFv5pXGlzrQcTn+G+CZvSShnhlADFzFjSZIzJlePjMZf5620SkOEtCuRaPwJeGvj/0ny X-Received: by 2002:a17:906:535e:b0:993:d5e7:80f8 with SMTP id j30-20020a170906535e00b00993d5e780f8mr492573ejo.7.1689382515479; Fri, 14 Jul 2023 17:55:15 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1689382515; cv=none; d=google.com; s=arc-20160816; b=dCowTWi2/2mHm8ysK0WjoHQkqbvDeYAjpV5Ntme3CxBCKNQdinYieVfDvQMGOv/rJN m0Bamg2FplWBNjnF/69E+yB6jgUhk0a9adPpZtrzfmsbpFlEjJJ39uww99d5GiMz6h60 9OZssz4FK5JlEIP2aITZE4EoYSVxN8StuHmdp85+2soc6iC3SM6PuM5+tm11wojtrZ7X pwC6iL9ML9RrDZoj+85QilKM1X+nGyr0sMS3UXklTsbaC7i//nGG0YDnx0cjF6lGqExD hSRXrW0KQrSMQ0O2Xk/NDrA45fHNDJ+4bkvUgW5CZ5DOpUNyi9AUQBjmJygbjSDCa+LI W+NA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=9ptAxBcW7aHFl51V3a6EjybWpG1k8jXl2fJumuPzCWI=; fh=uIJTi+MvVwTQpXy2edvdsmk/o6hpXF+F8XkN0BKpCxA=; b=caGmdw2IezR/G5bnGwAYd0U6if1Jia0HM3SuGeE/WpUF/ky1+dLkg13OYikIYWtiEd YhOMOWO07MXmK6Xsy9X4+omR+Z6nTb7KyPLA+Tgf4R/Oc7qqRZMjjI8CWNF20nhpLEyD b4FoUDRioj7rph9TKIgI1xMkQ/lhGKdi9ew0OxLvKqYspaawF8ydSINbUkftzsRjl5m4 j3/WyBTn8BAZIkURA4ZLwWlwnoVBbzoryF0pLftucBnzxXKl2VCJN6wesqLTnrrODv9M rPujWFakSGXQyVsetdoCWcP6T9K4sIjV8Xy3YW/C3yuBqTCeTf7iyYcBV+zkLkEGN2qo dp9g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@cloudflare.com header.s=google header.b=OKBWZD7h; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=cloudflare.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id j17-20020a170906831100b0097894b128f6si4110979ejx.840.2023.07.14.17.54.51; Fri, 14 Jul 2023 17:55:15 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@cloudflare.com header.s=google header.b=OKBWZD7h; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=cloudflare.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229935AbjGOAac (ORCPT + 99 others); Fri, 14 Jul 2023 20:30:32 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56332 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229455AbjGOAab (ORCPT ); Fri, 14 Jul 2023 20:30:31 -0400 Received: from mail-wr1-x42c.google.com (mail-wr1-x42c.google.com [IPv6:2a00:1450:4864:20::42c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5892E3A98 for ; Fri, 14 Jul 2023 17:30:29 -0700 (PDT) Received: by mail-wr1-x42c.google.com with SMTP id ffacd0b85a97d-3159d75606dso2517135f8f.1 for ; Fri, 14 Jul 2023 17:30:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cloudflare.com; s=google; t=1689381028; x=1691973028; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=9ptAxBcW7aHFl51V3a6EjybWpG1k8jXl2fJumuPzCWI=; b=OKBWZD7hUGWSmN+3mpSUHI7/TIUCwtTCSVhJFhqPPb2/YEUL9MtgCkWbZCQbhlPV26 ZU9n46fYO9n14gKdH9pdnznVF3xxgsNFqBTEzlvYTseWX2rakSjIVOCOVJv8Woyh/P6/ wutIaHQINaqQLf1SxDMmwhv/hdu/yneXvriNw= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1689381028; x=1691973028; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=9ptAxBcW7aHFl51V3a6EjybWpG1k8jXl2fJumuPzCWI=; b=W4LfkvGbJDGyGgl0Un8d7AC22RSDftcGLX9YYRWos46kXk+6gkQHE5T5CH+bLNFDYO U5o0IrMn1eI6u6/BGo0AZSUHFLknXM40RAdmwlKMzKIPEkupDF/0MYCwEgsa2BPlVVDa tzZ0kxETc/So2BFDwMmVu5KM+ibqWniOiwDtK51ftsZqC5XXA3CHsZhgkhHHQe+YyDqL S7ZqsEDaY9zXaVoodw3NXBh0NjBDfqO3TLAGWrH79i4+EyDaFPJofsbkUP3V7HsPIrJw GAgBhGrthjvHCdxPGiMo8oyQVqO0CcYQ0U4VTFWPrOkUn8jaG1SzhTit2FoTIQCMCiZT TUuA== X-Gm-Message-State: ABy/qLZLZZqXw3VlpyPWICVIDUfBKJ8jLRrqGIDW4fwGiqWuV5KHQRty aBjAhhRG+ko/+ylzl/KpYFoY8A5E/up7noZJ7eVYmtdR3JgwW4c3T0v/oA== X-Received: by 2002:a5d:514b:0:b0:313:e9d7:108f with SMTP id u11-20020a5d514b000000b00313e9d7108fmr4937042wrt.33.1689381027716; Fri, 14 Jul 2023 17:30:27 -0700 (PDT) MIME-Version: 1.0 References: <20230706062045.xwmwns7cm4fxd7iu@google.com> In-Reply-To: From: Ivan Babrou Date: Fri, 14 Jul 2023 17:30:16 -0700 Message-ID: Subject: Re: Expensive memory.stat + cpu.stat reads To: Waiman Long Cc: Shakeel Butt , cgroups@vger.kernel.org, Linux MM , kernel-team , Johannes Weiner , Michal Hocko , Roman Gushchin , Muchun Song , Andrew Morton , linux-kernel Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, RCVD_IN_DNSWL_BLOCKED,SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE, URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jul 13, 2023 at 4:25=E2=80=AFPM Ivan Babrou w= rote: > > My understanding of mem-stat and cpu-stat is that they are independent > > of each other. In theory, reading one shouldn't affect the performance > > of reading the others. Since you are doing mem-stat and cpu-stat readin= g > > repetitively in a loop, it is likely that all the data are in the cache > > most of the time resulting in very fast processing time. If it happens > > that the specific memory location of mem-stat and cpu-stat data are suc= h > > that reading one will cause the other data to be flushed out of the > > cache and have to be re-read from memory again, you could see > > significant performance regression. > > > > It is one of the possible causes, but I may be wrong. > > Do you think it's somewhat similar to how iterating a matrix in rows > is faster than in columns due to sequential vs random memory reads? > > * https://stackoverflow.com/q/9936132 > * https://en.wikipedia.org/wiki/Row-_and_column-major_order > * https://en.wikipedia.org/wiki/Loop_interchange > > I've had a similar suspicion and it would be good to confirm whether > it's that or something else. I can probably collect perf counters for > different runs, but I'm not sure which ones I'll need. > > In a similar vein, if we could come up with a tracepoint that would > tell us the amount of work done (or any other relevant metric that > would help) during rstat flushing, I can certainly collect that > information as well for every reading combination. Since cgroup_rstat_flush_locked appears in flamegraphs for both fast (discrete) and slow (combined) cases, I grabbed some stats for it: * Slow: completed: 19.43s [manual / mem-stat + cpu-stat] $ sudo /usr/share/bcc/tools/funclatency -uT cgroup_rstat_flush_locked Tracing 1 functions for "cgroup_rstat_flush_locked"... Hit Ctrl-C to end. ^C 00:12:55 usecs : count distribution 0 -> 1 : 0 | = | 2 -> 3 : 0 | = | 4 -> 7 : 0 | = | 8 -> 15 : 0 | = | 16 -> 31 : 0 | = | 32 -> 63 : 0 | = | 64 -> 127 : 1 | = | 128 -> 255 : 191 |************ = | 256 -> 511 : 590 |**************************************= **| 512 -> 1023 : 186 |************ = | 1024 -> 2047 : 2 | = | 2048 -> 4095 : 0 | = | 4096 -> 8191 : 0 | = | 8192 -> 16383 : 504 |********************************** = | 16384 -> 32767 : 514 |********************************** = | 32768 -> 65535 : 3 | = | 65536 -> 131071 : 1 | = | avg =3D 8852 usecs, total: 17633268 usecs, count: 1992 * Fast: completed: 0.95s [manual / mem-stat] completed: 0.05s [manual / cpu-stat] $ sudo /usr/share/bcc/tools/funclatency -uT cgroup_rstat_flush_locked Tracing 1 functions for "cgroup_rstat_flush_locked"... Hit Ctrl-C to end. ^C 00:13:27 usecs : count distribution 0 -> 1 : 0 | = | 2 -> 3 : 0 | = | 4 -> 7 : 499 |**************************************= **| 8 -> 15 : 253 |******************** = | 16 -> 31 : 191 |*************** = | 32 -> 63 : 41 |*** = | 64 -> 127 : 12 | = | 128 -> 255 : 2 | = | 256 -> 511 : 2 | = | 512 -> 1023 : 0 | = | 1024 -> 2047 : 0 | = | 2048 -> 4095 : 0 | = | 4096 -> 8191 : 0 | = | 8192 -> 16383 : 34 |** = | 16384 -> 32767 : 21 |* = | avg =3D 857 usecs, total: 904762 usecs, count: 1055 There's a different number of calls into cgroup_rstat_flush_locked and they are much slower in the slow case. There are also two bands in the slow case, with 8ms..32ms having the half of the calls. For mem_cgroup_css_rstat_flush: * Slow: completed: 32.77s [manual / mem-stat + cpu-stat] $ sudo /usr/share/bcc/tools/funclatency -uT mem_cgroup_css_rstat_flush Tracing 1 functions for "mem_cgroup_css_rstat_flush"... Hit Ctrl-C to end. ^C 00:21:25 usecs : count distribution 0 -> 1 : 93078 |* = | 2 -> 3 : 3397714 |**************************************= **| 4 -> 7 : 1009440 |*********** = | 8 -> 15 : 168013 |* = | 16 -> 31 : 93 | = | avg =3D 3 usecs, total: 17189289 usecs, count: 4668338 * Fast: completed: 0.16s [manual / mem-stat] completed: 0.04s [manual / cpu-stat] $ sudo /usr/share/bcc/tools/funclatency -uT mem_cgroup_css_rstat_flush Tracing 1 functions for "mem_cgroup_css_rstat_flush"... Hit Ctrl-C to end. ^C 00:21:57 usecs : count distribution 0 -> 1 : 1441 |*** = | 2 -> 3 : 18780 |**************************************= **| 4 -> 7 : 4826 |********** = | 8 -> 15 : 732 |* = | 16 -> 31 : 1 | = | avg =3D 3 usecs, total: 89174 usecs, count: 25780 There's an 181x difference in the number of calls into mem_cgroup_css_rstat_flush. Does this provide a clue? Perhaps cgroup_rstat_cpu_pop_updated is yielding a ton more iterations for some reason here? * https://elixir.bootlin.com/linux/v6.1/source/kernel/cgroup/rstat.c#L196 It's inlined, but I can place a probe into the loop: 7 for_each_possible_cpu(cpu) { 8 raw_spinlock_t *cpu_lock =3D per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu); 10 struct cgroup *pos =3D NULL; unsigned long flags; /* * The _irqsave() is needed because cgroup_rstat_lo= ck is * spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring * this lock with the _irq() suffix only disables interrupts on * a non-PREEMPT_RT kernel. The raw_spinlock_t below disables * interrupts on both configurations. The _irqsave() ensures * that interrupts are always disabled and later restored. */ raw_spin_lock_irqsave(cpu_lock, flags); while ((pos =3D cgroup_rstat_cpu_pop_updated(pos, cgrp, cpu))) { struct cgroup_subsys_state *css; cgroup_base_stat_flush(pos, cpu); 26 bpf_rstat_flush(pos, cgroup_parent(pos), cp= u); 28 rcu_read_lock(); 29 list_for_each_entry_rcu(css, &pos->rstat_css_list, rstat_css_node) 31 css->ss->css_rstat_flush(css, cpu); 32 rcu_read_unlock(); } 34 raw_spin_unlock_irqrestore(cpu_lock, flags); I added probes on both line 26 and line 31 to catch the middle and inner lo= ops. * Slow: completed: 32.97s [manual / mem-stat + cpu-stat] Performance counter stats for '/tmp/derp': 4,702,570 probe:cgroup_rstat_flush_locked_L26 9,301,436 probe:cgroup_rstat_flush_locked_L31 * Fast: completed: 0.17s [manual / mem-stat] completed: 0.34s [manual / cpu-stat] Performance counter stats for '/tmp/derp': 31,769 probe:cgroup_rstat_flush_locked_L26 62,849 probe:cgroup_rstat_flush_locked_L31 It definitely looks like cgroup_rstat_cpu_pop_updated is yielding a lot more positions. I'm going to sign off for the week, but let me know if I should place any more probes to nail this down.