Received: by 2002:ac0:a5a6:0:0:0:0:0 with SMTP id m35-v6csp7130581imm; Tue, 28 Aug 2018 06:57:55 -0700 (PDT) X-Google-Smtp-Source: ANB0VdZhHwK6CvdSlRm67d8tWGxvmrp1fZfa5fjU0CIWjfCrmkSt463F6dAM0QL//exldcBf0yza X-Received: by 2002:a63:4f64:: with SMTP id p36-v6mr1679452pgl.210.1535464675134; Tue, 28 Aug 2018 06:57:55 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1535464675; cv=none; d=google.com; s=arc-20160816; b=BUfBzl6tS+HXUlUXEIwBUAgJ/CaRUGqCqMSR0OtwHgiFElwTeejhoPiABn8T4K4NaY p2VC04Ro6L2nW5PfuXODeKrbnzrJZR2rSA4iRMd1aAteP8T47bHa7BqonaN5+fJkusv4 3hIDBTixixKy0hyVL+OIeOZ1+fc8ycRbu8PFJod0SlHUGQbXd9VlR+P6gJskcc5iK4Zx 4F2bnQ9YSjxbdjBBAssQokOdkhrikIBLWrgPquBv/wH7KJ8qqLDKku66RV7P96oWRjv/ eHFRO0MsNZMHSJDklELRiGAXTZdbrEaTrSF3/ty0vndRR8cyedcP470jpCsoTYw3S6uz X/Rw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:references:in-reply-to:message-id:date :subject:cc:to:from:arc-authentication-results; bh=Q/i+NSZ7R3pLVQCTHWbg5rdL3tsG3bEMbDTHWZYxeAg=; b=LPFnt7VxfqLIJeu7YAIPtDow83N9z2g8zeR+EWthqWAriylf80HbvtMXU/kp0j+aII 23hbLYacyFBf7GhGO2NoFaQWtfkQzx3GcTOKTgK7CXhPap/hAGNHNF+8oqGRG1AdQBgo PB0Aj6Y7GXDinpy643XvCLCIbWlRKekApa/b1eFhen3CdF3SS8QijOfIjME5nELpvzmn rNh1PmpP8rxIf8QiSa/0DBR7jPZz7PYUDKCmWKIJILCuX/4EQp/3ddesDuXnrNhDLqUs XMGuHSckvL90wOwzdcW4W6wtjtJeeiWJRKHy2VZbLWila/MGRAi/RCPc7x8W9jpzDBth 8IdA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id f10-v6si1027965pfn.85.2018.08.28.06.57.39; Tue, 28 Aug 2018 06:57:55 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728414AbeH1Rro (ORCPT + 99 others); Tue, 28 Aug 2018 13:47:44 -0400 Received: from foss.arm.com ([217.140.101.70]:38662 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727284AbeH1Rq7 (ORCPT ); Tue, 28 Aug 2018 13:46:59 -0400 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 2D291168F; Tue, 28 Aug 2018 06:54:24 -0700 (PDT) Received: from e110439-lin.Cambridge.arm.com (e110439-lin.emea.arm.com [10.4.12.126]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id 3BC6E3F5BD; Tue, 28 Aug 2018 06:54:21 -0700 (PDT) From: Patrick Bellasi To: linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org Cc: Ingo Molnar , Peter Zijlstra , Tejun Heo , "Rafael J . Wysocki" , Viresh Kumar , Vincent Guittot , Paul Turner , Quentin Perret , Dietmar Eggemann , Morten Rasmussen , Juri Lelli , Todd Kjos , Joel Fernandes , Steve Muckle , Suren Baghdasaryan Subject: [PATCH v4 08/16] sched/core: uclamp: propagate parent clamps Date: Tue, 28 Aug 2018 14:53:16 +0100 Message-Id: <20180828135324.21976-9-patrick.bellasi@arm.com> X-Mailer: git-send-email 2.18.0 In-Reply-To: <20180828135324.21976-1-patrick.bellasi@arm.com> References: <20180828135324.21976-1-patrick.bellasi@arm.com> Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org In order to properly support hierarchical resources control, the cgroup delegation model requires that attribute writes from a child group never fail but still are (potentially) constrained based on parent's assigned resources. This requires to properly propagate and aggregate parent attributes down to its descendants. Let's implement this mechanism by adding a new "effective" clamp value for each task group. The effective clamp value is defined as the smaller value between the clamp value of a group and the effective clamp value of its parent. This represent also the clamp value which is actually used to clamp tasks in each task group. Since it can be interesting for tasks in a cgroup to know exactly what is the currently propagated/enforced configuration, the effective clamp values are exposed to user-space by means of a new pair of read-only attributes: cpu.util.{min,max}.effective. Signed-off-by: Patrick Bellasi Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Tejun Heo Cc: Rafael J. Wysocki Cc: Viresh Kumar Cc: Suren Baghdasaryan Cc: Todd Kjos Cc: Joel Fernandes Cc: Juri Lelli Cc: Quentin Perret Cc: Dietmar Eggemann Cc: Morten Rasmussen Cc: linux-kernel@vger.kernel.org Cc: linux-pm@vger.kernel.org --- Changes in v4: Message-ID: <20180816140731.GD2960@e110439-lin> - add ".effective" attributes to the default hierarchy Others: - small documentation fixes - rebased on v4.19-rc1 Changes in v3: Message-ID: <20180409222417.GK3126663@devbig577.frc2.facebook.com> - new patch in v3, to implement a suggestion from v1 review --- Documentation/admin-guide/cgroup-v2.rst | 25 +++++- include/linux/sched.h | 8 ++ kernel/sched/core.c | 112 +++++++++++++++++++++++- 3 files changed, 139 insertions(+), 6 deletions(-) diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 80ef7bdc517b..72272f58d304 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -976,22 +976,43 @@ All time durations are in microseconds. A read-write single value file which exists on non-root cgroups. The default is "0", i.e. no bandwidth boosting. - The minimum utilization in the range [0, 1023]. + The requested minimum utilization in the range [0, 1023]. This interface allows reading and setting minimum utilization clamp values similar to the sched_setattr(2). This minimum utilization value is used to clamp the task specific minimum utilization clamp. + cpu.util.min.effective + A read-only single value file which exists on non-root cgroups and + reports minimum utilization clamp value currently enforced on a task + group. + + The actual minimum utilization in the range [0, 1023]. + + This value can be lower then cpu.util.min in case a parent cgroup + is enforcing a more restrictive clamping on minimum utilization. + cpu.util.max A read-write single value file which exists on non-root cgroups. The default is "1023". i.e. no bandwidth clamping - The maximum utilization in the range [0, 1023]. + The requested maximum utilization in the range [0, 1023]. This interface allows reading and setting maximum utilization clamp values similar to the sched_setattr(2). This maximum utilization value is used to clamp the task specific maximum utilization clamp. + cpu.util.max.effective + A read-only single value file which exists on non-root cgroups and + reports maximum utilization clamp value currently enforced on a task + group. + + The actual maximum utilization in the range [0, 1023]. + + This value can be lower then cpu.util.max in case a parent cgroup + is enforcing a more restrictive clamping on max utilization. + + Memory ------ diff --git a/include/linux/sched.h b/include/linux/sched.h index dc39b67a366a..2da130d17e70 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -591,6 +591,14 @@ struct sched_dl_entity { struct uclamp_se { unsigned int value; unsigned int group_id; + /* + * Effective task (group) clamp value. + * For task groups is the value (eventually) enforced by a parent task + * group. + */ + struct { + unsigned int value; + } effective; }; union rcu_special { diff --git a/kernel/sched/core.c b/kernel/sched/core.c index dcbf22abd0bf..b2d438b6484b 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1254,6 +1254,8 @@ static inline int alloc_uclamp_sched_group(struct task_group *tg, for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) { uc_se = &tg->uclamp[clamp_id]; + uc_se->effective.value = + parent->uclamp[clamp_id].effective.value; uc_se->value = parent->uclamp[clamp_id].value; uc_se->group_id = parent->uclamp[clamp_id].group_id; } @@ -1415,6 +1417,7 @@ static void __init init_uclamp(void) #ifdef CONFIG_UCLAMP_TASK_GROUP /* Init root TG's clamp group */ uc_se = &root_task_group.uclamp[clamp_id]; + uc_se->effective.value = uclamp_none(clamp_id); uc_se->value = uclamp_none(clamp_id); uc_se->group_id = 0; #endif @@ -7226,6 +7229,68 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset) } #ifdef CONFIG_UCLAMP_TASK_GROUP +/** + * cpu_util_update_hier: propagete effective clamp down the hierarchy + * @css: the task group to update + * @clamp_id: the clamp index to update + * @value: the new task group clamp value + * + * The effective clamp for a TG is expected to track the most restrictive + * value between the TG's clamp value and it's parent effective clamp value. + * This method achieve that: + * 1. updating the current TG effective value + * 2. walking all the descendant task group that needs an update + * + * A TG's effective clamp needs to be updated when its current value is not + * matching the TG's clamp value. In this case indeed either: + * a) the parent has got a more relaxed clamp value + * thus potentially we can relax the effective value for this group + * b) the parent has got a more strict clamp value + * thus potentially we have to restrict the effective value of this group + * + * Restriction and relaxation of current TG's effective clamp values needs to + * be propagated down to all the descendants. When a subgroup is found which + * has already its effective clamp value matching its clamp value, then we can + * safely skip all its descendants which are granted to be already in sync. + */ +static void cpu_util_update_hier(struct cgroup_subsys_state *css, + int clamp_id, int value) +{ + struct cgroup_subsys_state *top_css = css; + struct uclamp_se *uc_se, *uc_parent; + + css_for_each_descendant_pre(css, top_css) { + /* + * The first visited task group is top_css, which clamp value + * is the one passed as parameter. For descendent task + * groups we consider their current value. + */ + uc_se = &css_tg(css)->uclamp[clamp_id]; + if (css != top_css) + value = uc_se->value; + /* + * Skip the whole subtrees if the current effective clamp is + * alredy matching the TG's clamp value. + * In this case, all the subtrees already have top_value, or a + * more restrictive, as effective clamp. + */ + uc_parent = &css_tg(css)->parent->uclamp[clamp_id]; + if (uc_se->effective.value == value && + uc_parent->effective.value >= value) { + css = css_rightmost_descendant(css); + continue; + } + + /* Propagate the most restrictive effective value */ + if (uc_parent->effective.value < value) + value = uc_parent->effective.value; + if (uc_se->effective.value == value) + continue; + + uc_se->effective.value = value; + } +} + static int cpu_util_min_write_u64(struct cgroup_subsys_state *css, struct cftype *cftype, u64 min_value) { @@ -7245,6 +7310,9 @@ static int cpu_util_min_write_u64(struct cgroup_subsys_state *css, if (tg->uclamp[UCLAMP_MAX].value < min_value) goto out; + /* Update effective clamps to track the most restrictive value */ + cpu_util_update_hier(css, UCLAMP_MIN, min_value); + out: rcu_read_unlock(); @@ -7270,6 +7338,9 @@ static int cpu_util_max_write_u64(struct cgroup_subsys_state *css, if (tg->uclamp[UCLAMP_MIN].value > max_value) goto out; + /* Update effective clamps to track the most restrictive value */ + cpu_util_update_hier(css, UCLAMP_MAX, max_value); + out: rcu_read_unlock(); @@ -7277,14 +7348,17 @@ static int cpu_util_max_write_u64(struct cgroup_subsys_state *css, } static inline u64 cpu_uclamp_read(struct cgroup_subsys_state *css, - enum uclamp_id clamp_id) + enum uclamp_id clamp_id, + bool effective) { struct task_group *tg; u64 util_clamp; rcu_read_lock(); tg = css_tg(css); - util_clamp = tg->uclamp[clamp_id].value; + util_clamp = effective + ? tg->uclamp[clamp_id].effective.value + : tg->uclamp[clamp_id].value; rcu_read_unlock(); return util_clamp; @@ -7293,13 +7367,25 @@ static inline u64 cpu_uclamp_read(struct cgroup_subsys_state *css, static u64 cpu_util_min_read_u64(struct cgroup_subsys_state *css, struct cftype *cft) { - return cpu_uclamp_read(css, UCLAMP_MIN); + return cpu_uclamp_read(css, UCLAMP_MIN, false); } static u64 cpu_util_max_read_u64(struct cgroup_subsys_state *css, struct cftype *cft) { - return cpu_uclamp_read(css, UCLAMP_MAX); + return cpu_uclamp_read(css, UCLAMP_MAX, false); +} + +static u64 cpu_util_min_effective_read_u64(struct cgroup_subsys_state *css, + struct cftype *cft) +{ + return cpu_uclamp_read(css, UCLAMP_MIN, true); +} + +static u64 cpu_util_max_effective_read_u64(struct cgroup_subsys_state *css, + struct cftype *cft) +{ + return cpu_uclamp_read(css, UCLAMP_MAX, true); } #endif /* CONFIG_UCLAMP_TASK_GROUP */ @@ -7647,11 +7733,19 @@ static struct cftype cpu_legacy_files[] = { .read_u64 = cpu_util_min_read_u64, .write_u64 = cpu_util_min_write_u64, }, + { + .name = "util.min.effective", + .read_u64 = cpu_util_min_effective_read_u64, + }, { .name = "util.max", .read_u64 = cpu_util_max_read_u64, .write_u64 = cpu_util_max_write_u64, }, + { + .name = "util.max.effective", + .read_u64 = cpu_util_max_effective_read_u64, + }, #endif { } /* Terminate */ }; @@ -7827,12 +7921,22 @@ static struct cftype cpu_files[] = { .read_u64 = cpu_util_min_read_u64, .write_u64 = cpu_util_min_write_u64, }, + { + .name = "util.min.effective", + .flags = CFTYPE_NOT_ON_ROOT, + .read_u64 = cpu_util_min_effective_read_u64, + }, { .name = "util_max", .flags = CFTYPE_NOT_ON_ROOT, .read_u64 = cpu_util_max_read_u64, .write_u64 = cpu_util_max_write_u64, }, + { + .name = "util.max.effective", + .flags = CFTYPE_NOT_ON_ROOT, + .read_u64 = cpu_util_max_effective_read_u64, + }, #endif { } /* terminate */ }; -- 2.18.0