Received: by 2002:a05:6a10:8c0a:0:0:0:0 with SMTP id go10csp722023pxb; Tue, 2 Feb 2021 16:42:06 -0800 (PST) X-Google-Smtp-Source: ABdhPJwIqpNH+zii1o1AAg4I9e34zWPxgJp6emPhvuj9Yo0V3ZZ1QzDnA0P0eVczRHrh+EAcGyh7 X-Received: by 2002:a50:858a:: with SMTP id a10mr42095edh.122.1612312926545; Tue, 02 Feb 2021 16:42:06 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1612312926; cv=none; d=google.com; s=arc-20160816; b=mFZ3pDqneiBcP5zNgg33MF4hEhxNNsK0sDnyYptyU1rrbGfnRcMDaFMQQu/1OMuXzq PkL65aDHpxvLs0uoleX/SDghVHr8+eX4p5Q2d1wH1HUoR7utQMVDmjH/+IXUdQC9DI/0 hC/VPqcXaDCM1ggiSD+lhni5thgM5Xg59cNE5TFwTd+rwPT5qVT9yM6DjXk/N97Y7GtT gcEeQ4LhStmNlRNRx/2JRlTonacdvXsbj9TN/tifWZj4L0ZdTACeOjr4JBboCundFIdW sZ4+at1gNH0ZHTyaU4Dy9JpXU5AyIMYjR79Q1aBmyAeqmN+0e3aEfjTxDm+P4w9kjjtH XtVQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=qIYQWYhqyLwEqsbVc7mdSrEASRoPZLsNukqo0gJhBmg=; b=dLwj46v/suCrjb/Ybqwjd83QNa85Mrd/4NMv0Z9Yq26GO3CLPUfZ4exXNZYLLBMH2w xGq7bFPuDg29K9VgKbT1rN0UTV7vhOXFxDTYORZZrPp9Md73wyYhh2Pl7XKVJV2fHMP3 Q7urfLsnliwG95MrSSpuRKE76/kqwj+nYSRIDGQ11o1zMWtsIYQ26QzZd6KZGQD4OS4e yUcnumsV597hdP4jY3DxmEJe4Vbw56LCcZM9olCJpw4vbiZQxoZa0bdozDgYUFiQl0fW pjP+XGQ5ZhC1ltqXCbbihU3gQvmhLDcSOphX3bzo/uSI9jYPpZarwHNlBzo5wAxf8Q0y pgzA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@cmpxchg-org.20150623.gappssmtp.com header.s=20150623 header.b=UA+NDRaU; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id m3si221987edq.484.2021.02.02.16.41.41; Tue, 02 Feb 2021 16:42:06 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@cmpxchg-org.20150623.gappssmtp.com header.s=20150623 header.b=UA+NDRaU; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239005AbhBBSu7 (ORCPT + 99 others); Tue, 2 Feb 2021 13:50:59 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55172 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238994AbhBBSsk (ORCPT ); Tue, 2 Feb 2021 13:48:40 -0500 Received: from mail-qv1-xf30.google.com (mail-qv1-xf30.google.com [IPv6:2607:f8b0:4864:20::f30]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7978AC0617AB for ; Tue, 2 Feb 2021 10:48:00 -0800 (PST) Received: by mail-qv1-xf30.google.com with SMTP id ew18so10400610qvb.4 for ; Tue, 02 Feb 2021 10:48:00 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=qIYQWYhqyLwEqsbVc7mdSrEASRoPZLsNukqo0gJhBmg=; b=UA+NDRaUgwhmDiqE9A/0CZWTqaviooGCnPGyA+jZ2rmm9s/ZWHmRACKLLov+V3gnkf f+WbulaPJMozM9OO9Nm70fO1X+fZ96IGtJLLv8iQasxbYpNhytoZLiw2GErGyqDmG7U/ kQiBBHmf7csgJc/wcdVAf77uLQqTFN/SY4go1+zwoWOteoN4cDkF1qgflZikBVisGnJ6 huximZuV8y3uI3nAgaySyrwW9iKWbTA3DLjStpuUBXO88S7Qfy5H/ityRic6Swg8CM9v 3qeCFwAgZ+ZNOyO7e06J3eVQmDaJCPOAb4z07v0tBl3Ehtq9UApNk9+k9a+1wmuxnVBS 2zcw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=qIYQWYhqyLwEqsbVc7mdSrEASRoPZLsNukqo0gJhBmg=; b=EizRUrjbT1VQQARccOOhYSjx8tQe+cURuelFUZ6yrUSG1KgHYLRzdIl/2pMDYVH8F1 2xhGFQIg5yX1UL1jZ9GTvzSpy3dS4ikT2PY4snu77ZCOY8pHlTSv0sJxsPLTXPoDRH4G ApIH8zLP9GpJhMPy4LhRNGmOx4Per5+pvIt1WmpN3nr2oDJ+JUaOhkZZkvx3wHlbcW7a osvfWvl5BDufUQkkdI0P+vJGTRfY8E6wy3G37vBjViVAfStSDq7fJnUK0E0uMXPZP5q/ ojaAzQXpHAU8Oyo1N4+YRxPoip+JObDS+gSjhQb+OSHEFprb5VH0dwwwjAFVDz9BsMhM W9pg== X-Gm-Message-State: AOAM533WPUDNV0VL4w4B8Zd0maP8M3BH0SJsS/lgehpwCF7ZC5FHNF8f +TI+rhuXGhxyBHobm4ysbxQdkw== X-Received: by 2002:ad4:4c84:: with SMTP id bs4mr21871414qvb.0.1612291679723; Tue, 02 Feb 2021 10:47:59 -0800 (PST) Received: from localhost (70.44.39.90.res-cmts.bus.ptd.net. [70.44.39.90]) by smtp.gmail.com with ESMTPSA id j66sm18116876qkf.78.2021.02.02.10.47.58 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 02 Feb 2021 10:47:59 -0800 (PST) From: Johannes Weiner To: Andrew Morton , Tejun Heo Cc: Michal Hocko , Roman Gushchin , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@fb.com Subject: [PATCH 5/7] cgroup: rstat: punt root-level optimization to individual controllers Date: Tue, 2 Feb 2021 13:47:44 -0500 Message-Id: <20210202184746.119084-6-hannes@cmpxchg.org> X-Mailer: git-send-email 2.30.0 In-Reply-To: <20210202184746.119084-1-hannes@cmpxchg.org> References: <20210202184746.119084-1-hannes@cmpxchg.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Current users of the rstat code can source root-level statistics from the native counters of their respective subsystem, allowing them to forego aggregation at the root level. This optimization is currently implemented inside the generic rstat code, which doesn't track the root cgroup and doesn't invoke the subsystem flush callbacks on it. However, the memory controller cannot do this optimization, because cgroup1 breaks out memory specifically for the local level, including at the root level. In preparation for the memory controller switching to rstat, move the optimization from rstat core to the controllers. Afterwards, rstat will always track the root cgroup for changes and invoke the subsystem callbacks on it; and it's up to the subsystem to special-case and skip aggregation of the root cgroup if it can source this information through other, cheaper means. The extra cost of tracking the root cgroup is negligible: on stat changes, we actually remove a branch that checks for the root. The queueing for a flush touches only per-cpu data, and only the first stat change since a flush requires a (per-cpu) lock. Signed-off-by: Johannes Weiner --- block/blk-cgroup.c | 14 +++++++--- kernel/cgroup/rstat.c | 60 +++++++++++++++++++++++++------------------ 2 files changed, 45 insertions(+), 29 deletions(-) diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index 02ce2058c14b..76725e1cad7f 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -766,6 +766,10 @@ static void blkcg_rstat_flush(struct cgroup_subsys_state *css, int cpu) struct blkcg *blkcg = css_to_blkcg(css); struct blkcg_gq *blkg; + /* Root-level stats are sourced from system-wide IO stats */ + if (!cgroup_parent(css->cgroup)) + return; + rcu_read_lock(); hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) { @@ -789,6 +793,7 @@ static void blkcg_rstat_flush(struct cgroup_subsys_state *css, int cpu) u64_stats_update_end(&blkg->iostat.sync); /* propagate global delta to parent */ + /* XXX: could skip this if parent is root */ if (parent) { u64_stats_update_begin(&parent->iostat.sync); blkg_iostat_set(&delta, &blkg->iostat.cur); @@ -803,10 +808,11 @@ static void blkcg_rstat_flush(struct cgroup_subsys_state *css, int cpu) } /* - * The rstat algorithms intentionally don't handle the root cgroup to avoid - * incurring overhead when no cgroups are defined. For that reason, - * cgroup_rstat_flush in blkcg_print_stat does not actually fill out the - * iostat in the root cgroup's blkcg_gq. + * We source root cgroup stats from the system-wide stats to avoid + * tracking the same information twice and incurring overhead when no + * cgroups are defined. For that reason, cgroup_rstat_flush in + * blkcg_print_stat does not actually fill out the iostat in the root + * cgroup's blkcg_gq. * * However, we would like to re-use the printing code between the root and * non-root cgroups to the extent possible. For that reason, we simulate diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c index faa767a870ba..6f50c199bf2a 100644 --- a/kernel/cgroup/rstat.c +++ b/kernel/cgroup/rstat.c @@ -25,13 +25,8 @@ static struct cgroup_rstat_cpu *cgroup_rstat_cpu(struct cgroup *cgrp, int cpu) void cgroup_rstat_updated(struct cgroup *cgrp, int cpu) { raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu); - struct cgroup *parent; unsigned long flags; - /* nothing to do for root */ - if (!cgroup_parent(cgrp)) - return; - /* * Speculative already-on-list test. This may race leading to * temporary inaccuracies, which is fine. @@ -46,10 +41,10 @@ void cgroup_rstat_updated(struct cgroup *cgrp, int cpu) raw_spin_lock_irqsave(cpu_lock, flags); /* put @cgrp and all ancestors on the corresponding updated lists */ - for (parent = cgroup_parent(cgrp); parent; - cgrp = parent, parent = cgroup_parent(cgrp)) { + while (true) { struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(cgrp, cpu); - struct cgroup_rstat_cpu *prstatc = cgroup_rstat_cpu(parent, cpu); + struct cgroup *parent = cgroup_parent(cgrp); + struct cgroup_rstat_cpu *prstatc; /* * Both additions and removals are bottom-up. If a cgroup @@ -58,8 +53,16 @@ void cgroup_rstat_updated(struct cgroup *cgrp, int cpu) if (rstatc->updated_next) break; + if (!parent) { + rstatc->updated_next = cgrp; + break; + } + + prstatc = cgroup_rstat_cpu(parent, cpu); rstatc->updated_next = prstatc->updated_children; prstatc->updated_children = cgrp; + + cgrp = parent; } raw_spin_unlock_irqrestore(cpu_lock, flags); @@ -113,23 +116,26 @@ static struct cgroup *cgroup_rstat_cpu_pop_updated(struct cgroup *pos, */ if (rstatc->updated_next) { struct cgroup *parent = cgroup_parent(pos); - struct cgroup_rstat_cpu *prstatc = cgroup_rstat_cpu(parent, cpu); - struct cgroup_rstat_cpu *nrstatc; - struct cgroup **nextp; - - nextp = &prstatc->updated_children; - while (true) { - nrstatc = cgroup_rstat_cpu(*nextp, cpu); - if (*nextp == pos) - break; - - WARN_ON_ONCE(*nextp == parent); - nextp = &nrstatc->updated_next; + + if (parent) { + struct cgroup_rstat_cpu *prstatc; + struct cgroup **nextp; + + prstatc = cgroup_rstat_cpu(parent, cpu); + nextp = &prstatc->updated_children; + while (true) { + struct cgroup_rstat_cpu *nrstatc; + + nrstatc = cgroup_rstat_cpu(*nextp, cpu); + if (*nextp == pos) + break; + WARN_ON_ONCE(*nextp == parent); + nextp = &nrstatc->updated_next; + } + *nextp = rstatc->updated_next; } - *nextp = rstatc->updated_next; rstatc->updated_next = NULL; - return pos; } @@ -309,11 +315,15 @@ static void cgroup_base_stat_sub(struct cgroup_base_stat *dst_bstat, static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu) { - struct cgroup *parent = cgroup_parent(cgrp); struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(cgrp, cpu); + struct cgroup *parent = cgroup_parent(cgrp); struct cgroup_base_stat cur, delta; unsigned seq; + /* Root-level stats are sourced from system-wide CPU stats */ + if (!parent) + return; + /* fetch the current per-cpu values */ do { seq = __u64_stats_fetch_begin(&rstatc->bsync); @@ -326,8 +336,8 @@ static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu) cgroup_base_stat_add(&cgrp->bstat, &delta); cgroup_base_stat_add(&rstatc->last_bstat, &delta); - /* propagate global delta to parent */ - if (parent) { + /* propagate global delta to parent (unless that's root) */ + if (cgroup_parent(parent)) { delta = cgrp->bstat; cgroup_base_stat_sub(&delta, &cgrp->last_bstat); cgroup_base_stat_add(&parent->bstat, &delta); -- 2.30.0