Received: by 2002:ac0:a594:0:0:0:0:0 with SMTP id m20-v6csp3741371imm; Thu, 17 May 2018 13:59:51 -0700 (PDT) X-Google-Smtp-Source: AB8JxZqyXsUnwDhyc7QW7KU3GTfA+vtZxjPNOigkj7YaPfYAXqfD5xGcfWRnVVavqbiRioasnAMT X-Received: by 2002:a17:902:ab93:: with SMTP id f19-v6mr6777706plr.392.1526590791928; Thu, 17 May 2018 13:59:51 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1526590791; cv=none; d=google.com; s=arc-20160816; b=Jx/ouT/bU2i/hMD6OZKG5NDGn0ypvzF72K1veaiLqVEvgoU707Q2vwkb2uZXjVbDUC AC7NgsMHEEFk8MlcR6LrFbMIGh0QBjof8peR26NSHkoHqIOgP8u7gJPqddAKCYGf88dm OCQM7C8dw4pYF+wgT58VNx7CtsuGpERNR6rI2Ii81b7cu91PsoxBNdRXjVB3gUpYrzsc y5Qjst4/2QKtrfi2vODwZU4Ky/fygs6+EqU7j5hJ9eB9xg8jnMaJGsEyhfgmdhiEQqOZ vJYOnkqGlEgABs/AqiQcygz0kpa/elFyeXSv67DCs+8yC5SkQLPl5jqXzClaM+w+tGV0 OWQw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:references:in-reply-to:message-id:date :subject:cc:to:from:arc-authentication-results; bh=AB+9OiqcT2bCA1og4LnqrPBUo5b2l6A6t+sKyw9Zzq0=; b=Im/+eymQEC5Ah916hPt4KlBr6yVECFOul919PILRg7QtkGAchUNq/iFbF1lQDxLI5n JZrSCvGdHLOn8TA26ZQUL3hLSKRaBkbyR3hBjQetUQkUfkj1OjY7c1DDepi8Ru3epEKR /apZp682krLayl1F3qqmT1OyG4+KHvxIr7wnwxmFixi4ndgLEHk9RuzDqBETVNEkpClx Neg98v8A3oFeOkqbHVr9Rdnk4NSJQ1Yhr0G7tkY2/7EN4IsKXyci3V3TUmx1LC20hcQg SGxbdBFrftLat5uVYmXCD5A4V2fzp61VLs+IrzxZ453PDqIyY6GjcOAzinhBX/TCyX24 fK7Q== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id m14-v6si4744762pgs.178.2018.05.17.13.59.37; Thu, 17 May 2018 13:59:51 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752304AbeEQU5e (ORCPT + 99 others); Thu, 17 May 2018 16:57:34 -0400 Received: from mx3-rdu2.redhat.com ([66.187.233.73]:56926 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1751788AbeEQUz5 (ORCPT ); Thu, 17 May 2018 16:55:57 -0400 Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.rdu2.redhat.com [10.11.54.4]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 1C530CB9C1; Thu, 17 May 2018 20:55:57 +0000 (UTC) Received: from llong.com (dhcp-17-164.bos.redhat.com [10.18.17.164]) by smtp.corp.redhat.com (Postfix) with ESMTP id ACC7E2024CBE; Thu, 17 May 2018 20:55:56 +0000 (UTC) From: Waiman Long To: Tejun Heo , Li Zefan , Johannes Weiner , Peter Zijlstra , Ingo Molnar Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, kernel-team@fb.com, pjt@google.com, luto@amacapital.net, Mike Galbraith , torvalds@linux-foundation.org, Roman Gushchin , Juri Lelli , Waiman Long Subject: [PATCH v8 2/6] cpuset: Add new v2 cpuset.sched.domain flag Date: Thu, 17 May 2018 16:55:41 -0400 Message-Id: <1526590545-3350-3-git-send-email-longman@redhat.com> In-Reply-To: <1526590545-3350-1-git-send-email-longman@redhat.com> References: <1526590545-3350-1-git-send-email-longman@redhat.com> X-Scanned-By: MIMEDefang 2.78 on 10.11.54.4 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.11.55.1]); Thu, 17 May 2018 20:55:57 +0000 (UTC) X-Greylist: inspected by milter-greylist-4.5.16 (mx1.redhat.com [10.11.55.1]); Thu, 17 May 2018 20:55:57 +0000 (UTC) for IP:'10.11.54.4' DOMAIN:'int-mx04.intmail.prod.int.rdu2.redhat.com' HELO:'smtp.corp.redhat.com' FROM:'longman@redhat.com' RCPT:'' Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org A new cpuset.sched.domain boolean flag is added to cpuset v2. This new flag indicates that the CPUs in the current cpuset should be treated as a separate scheduling domain. This new flag is owned by the parent and will cause the CPUs in the cpuset to be removed from the effective CPUs of its parent. This is implemented internally by adding a new isolated_cpus mask that holds the CPUs belonging to child scheduling domain cpusets so that: isolated_cpus | effective_cpus = cpus_allowed isolated_cpus & effective_cpus = 0 This new flag can only be turned on in a cpuset if its parent is either root or a scheduling domain itself with non-empty cpu list. The state of this flag cannot be changed if the cpuset has children. Signed-off-by: Waiman Long --- Documentation/cgroup-v2.txt | 22 ++++ kernel/cgroup/cpuset.c | 237 +++++++++++++++++++++++++++++++++++++++++++- 2 files changed, 256 insertions(+), 3 deletions(-) diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt index cf7bac6..54d9e22 100644 --- a/Documentation/cgroup-v2.txt +++ b/Documentation/cgroup-v2.txt @@ -1514,6 +1514,28 @@ Cpuset Interface Files it is a subset of "cpuset.mems". Its value will be affected by memory nodes hotplug events. + cpuset.sched.domain + A read-write single value file which exists on non-root + cpuset-enabled cgroups. It is a binary value flag that accepts + either "0" (off) or a non-zero value (on). This flag is set + by the parent and is not delegatable. + + If set, it indicates that the CPUs in the current cgroup will + be the root of a scheduling domain. The root cgroup is always + a scheduling domain. There are constraints on where this flag + can be set. It can only be set in a cgroup if all the following + conditions are true. + + 1) The parent cgroup is also a scheduling domain with a non-empty + cpu list. + 2) The list of CPUs are exclusive, i.e. they are not shared by + any of its siblings. + 3) There is no child cgroups with cpuset enabled. + + Setting this flag will take the CPUs away from the effective + CPUs of the parent cgroup. Once it is set, this flag cannot be + cleared if there are any child cgroups with cpuset enabled. + Device controller ----------------- diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 419b758..e1a1af0 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -109,6 +109,9 @@ struct cpuset { cpumask_var_t effective_cpus; nodemask_t effective_mems; + /* Isolated CPUs for scheduling domain children */ + cpumask_var_t isolated_cpus; + /* * This is old Memory Nodes tasks took on. * @@ -134,6 +137,9 @@ struct cpuset { /* for custom sched domain */ int relax_domain_level; + + /* for isolated_cpus */ + int isolation_count; }; static inline struct cpuset *css_cs(struct cgroup_subsys_state *css) @@ -175,6 +181,7 @@ static inline bool task_has_mempolicy(struct task_struct *task) CS_SCHED_LOAD_BALANCE, CS_SPREAD_PAGE, CS_SPREAD_SLAB, + CS_SCHED_DOMAIN, } cpuset_flagbits_t; /* convenient tests for these bits */ @@ -203,6 +210,11 @@ static inline int is_sched_load_balance(const struct cpuset *cs) return test_bit(CS_SCHED_LOAD_BALANCE, &cs->flags); } +static inline int is_sched_domain(const struct cpuset *cs) +{ + return test_bit(CS_SCHED_DOMAIN, &cs->flags); +} + static inline int is_memory_migrate(const struct cpuset *cs) { return test_bit(CS_MEMORY_MIGRATE, &cs->flags); @@ -220,7 +232,7 @@ static inline int is_spread_slab(const struct cpuset *cs) static struct cpuset top_cpuset = { .flags = ((1 << CS_ONLINE) | (1 << CS_CPU_EXCLUSIVE) | - (1 << CS_MEM_EXCLUSIVE)), + (1 << CS_MEM_EXCLUSIVE) | (1 << CS_SCHED_DOMAIN)), }; /** @@ -902,7 +914,19 @@ static void update_cpumasks_hier(struct cpuset *cs, struct cpumask *new_cpus) cpuset_for_each_descendant_pre(cp, pos_css, cs) { struct cpuset *parent = parent_cs(cp); - cpumask_and(new_cpus, cp->cpus_allowed, parent->effective_cpus); + /* + * If parent has isolated CPUs, include them in the list + * of allowable CPUs. + */ + if (parent->isolation_count) { + cpumask_or(new_cpus, parent->effective_cpus, + parent->isolated_cpus); + cpumask_and(new_cpus, new_cpus, cpu_online_mask); + cpumask_and(new_cpus, new_cpus, cp->cpus_allowed); + } else { + cpumask_and(new_cpus, cp->cpus_allowed, + parent->effective_cpus); + } /* * If it becomes empty, inherit the effective mask of the @@ -948,6 +972,154 @@ static void update_cpumasks_hier(struct cpuset *cs, struct cpumask *new_cpus) } /** + * update_isolated_cpumask - update the isolated_cpus mask of parent cpuset + * @cpuset: The cpuset that requests CPU isolation + * @oldmask: The old isolated cpumask to be removed from the parent + * @newmask: The new isolated cpumask to be added to the parent + * Return: 0 if successful, an error code otherwise + * + * Changes to the isolated CPUs are not allowed if any of CPUs changing + * state are in any of the child cpusets of the parent except the requesting + * child. + * + * If the sched_domain flag changes, either the oldmask (0=>1) or the + * newmask (1=>0) will be NULL. + * + * Called with cpuset_mutex held. + */ +static int update_isolated_cpumask(struct cpuset *cpuset, + struct cpumask *oldmask, struct cpumask *newmask) +{ + int retval; + int adding, deleting; + cpumask_var_t addmask, delmask; + struct cpuset *parent = parent_cs(cpuset); + struct cpuset *sibling; + struct cgroup_subsys_state *pos_css; + int old_count = parent->isolation_count; + bool dying = cpuset->css.flags & CSS_DYING; + + /* + * Parent must be a scheduling domain with non-empty cpus_allowed. + */ + if (!is_sched_domain(parent) || cpumask_empty(parent->cpus_allowed)) + return -EINVAL; + + /* + * The oldmask, if present, must be a subset of parent's isolated + * CPUs. + */ + if (oldmask && !cpumask_empty(oldmask) && (!parent->isolation_count || + !cpumask_subset(oldmask, parent->isolated_cpus))) { + WARN_ON_ONCE(1); + return -EINVAL; + } + + /* + * A sched_domain state change is not allowed if there are + * online children and the cpuset is not dying. + */ + if (!dying && (!oldmask || !newmask) && + css_has_online_children(&cpuset->css)) + return -EBUSY; + + if (!zalloc_cpumask_var(&addmask, GFP_KERNEL)) + return -ENOMEM; + if (!zalloc_cpumask_var(&delmask, GFP_KERNEL)) { + free_cpumask_var(addmask); + return -ENOMEM; + } + + if (!old_count) { + if (!zalloc_cpumask_var(&parent->isolated_cpus, GFP_KERNEL)) { + retval = -ENOMEM; + goto out; + } + old_count = 1; + } + + retval = -EBUSY; + adding = deleting = false; + if (newmask) + cpumask_copy(addmask, newmask); + if (oldmask) + deleting = cpumask_andnot(delmask, oldmask, addmask); + if (newmask) + adding = cpumask_andnot(addmask, newmask, delmask); + + if (!adding && !deleting) + goto out_ok; + + /* + * The cpus to be added must be in the parent's effective_cpus mask + * but not in the isolated_cpus mask. + */ + if (!cpumask_subset(addmask, parent->effective_cpus)) + goto out; + if (parent->isolation_count && + cpumask_intersects(parent->isolated_cpus, addmask)) + goto out; + + /* + * Check if any CPUs in addmask or delmask are in a sibling cpuset. + * An empty sibling cpus_allowed means it is the same as parent's + * effective_cpus. This checking is skipped if the cpuset is dying. + */ + if (dying) + goto updated_isolated_cpus; + + cpuset_for_each_child(sibling, pos_css, parent) { + if ((sibling == cpuset) || !(sibling->css.flags & CSS_ONLINE)) + continue; + if (cpumask_empty(sibling->cpus_allowed)) + goto out; + if (adding && + cpumask_intersects(sibling->cpus_allowed, addmask)) + goto out; + if (deleting && + cpumask_intersects(sibling->cpus_allowed, delmask)) + goto out; + } + + /* + * Change the isolated CPU list. + * Newly added isolated CPUs will be removed from effective_cpus + * and newly deleted ones will be added back if they are online. + */ +updated_isolated_cpus: + spin_lock_irq(&callback_lock); + if (adding) + cpumask_or(parent->isolated_cpus, + parent->isolated_cpus, addmask); + + if (deleting) + cpumask_andnot(parent->isolated_cpus, + parent->isolated_cpus, delmask); + + /* + * New effective_cpus = (cpus_allowed & ~isolated_cpus) & + * cpu_online_mask + */ + cpumask_andnot(parent->effective_cpus, parent->cpus_allowed, + parent->isolated_cpus); + cpumask_and(parent->effective_cpus, parent->effective_cpus, + cpu_online_mask); + + parent->isolation_count = cpumask_weight(parent->isolated_cpus); + spin_unlock_irq(&callback_lock); + +out_ok: + retval = 0; +out: + free_cpumask_var(addmask); + free_cpumask_var(delmask); + if (old_count && !parent->isolation_count) + free_cpumask_var(parent->isolated_cpus); + + return retval; +} + +/** * update_cpumask - update the cpus_allowed mask of a cpuset and all tasks in it * @cs: the cpuset to consider * @trialcs: trial cpuset @@ -988,6 +1160,13 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs, if (retval < 0) return retval; + if (is_sched_domain(cs)) { + retval = update_isolated_cpumask(cs, cs->cpus_allowed, + trialcs->cpus_allowed); + if (retval < 0) + return retval; + } + spin_lock_irq(&callback_lock); cpumask_copy(cs->cpus_allowed, trialcs->cpus_allowed); spin_unlock_irq(&callback_lock); @@ -1316,6 +1495,7 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs, struct cpuset *trialcs; int balance_flag_changed; int spread_flag_changed; + int domain_flag_changed; int err; trialcs = alloc_trial_cpuset(cs); @@ -1327,6 +1507,18 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs, else clear_bit(bit, &trialcs->flags); + /* + * Turning on sched.domain flag (default hierarchy only) implies + * an implicit cpu_exclusive. Turning off sched.domain will clear + * the cpu_exclusive flag. + */ + if (bit == CS_SCHED_DOMAIN) { + if (turning_on) + set_bit(CS_CPU_EXCLUSIVE, &trialcs->flags); + else + clear_bit(CS_CPU_EXCLUSIVE, &trialcs->flags); + } + err = validate_change(cs, trialcs); if (err < 0) goto out; @@ -1337,11 +1529,26 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs, spread_flag_changed = ((is_spread_slab(cs) != is_spread_slab(trialcs)) || (is_spread_page(cs) != is_spread_page(trialcs))); + domain_flag_changed = (is_sched_domain(cs) != is_sched_domain(trialcs)); + + if (domain_flag_changed) { + err = turning_on + ? update_isolated_cpumask(cs, NULL, cs->cpus_allowed) + : update_isolated_cpumask(cs, cs->cpus_allowed, NULL); + if (err < 0) + goto out; + /* + * At this point, the state has been changed. + * So we can't back out with error anymore. + */ + } + spin_lock_irq(&callback_lock); cs->flags = trialcs->flags; spin_unlock_irq(&callback_lock); - if (!cpumask_empty(trialcs->cpus_allowed) && balance_flag_changed) + if (!cpumask_empty(trialcs->cpus_allowed) && + (balance_flag_changed || domain_flag_changed)) rebuild_sched_domains_locked(); if (spread_flag_changed) @@ -1596,6 +1803,7 @@ static void cpuset_attach(struct cgroup_taskset *tset) FILE_MEM_EXCLUSIVE, FILE_MEM_HARDWALL, FILE_SCHED_LOAD_BALANCE, + FILE_SCHED_DOMAIN, FILE_SCHED_RELAX_DOMAIN_LEVEL, FILE_MEMORY_PRESSURE_ENABLED, FILE_MEMORY_PRESSURE, @@ -1629,6 +1837,9 @@ static int cpuset_write_u64(struct cgroup_subsys_state *css, struct cftype *cft, case FILE_SCHED_LOAD_BALANCE: retval = update_flag(CS_SCHED_LOAD_BALANCE, cs, val); break; + case FILE_SCHED_DOMAIN: + retval = update_flag(CS_SCHED_DOMAIN, cs, val); + break; case FILE_MEMORY_MIGRATE: retval = update_flag(CS_MEMORY_MIGRATE, cs, val); break; @@ -1790,6 +2001,8 @@ static u64 cpuset_read_u64(struct cgroup_subsys_state *css, struct cftype *cft) return is_mem_hardwall(cs); case FILE_SCHED_LOAD_BALANCE: return is_sched_load_balance(cs); + case FILE_SCHED_DOMAIN: + return is_sched_domain(cs); case FILE_MEMORY_MIGRATE: return is_memory_migrate(cs); case FILE_MEMORY_PRESSURE_ENABLED: @@ -1966,6 +2179,14 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft) .flags = CFTYPE_NOT_ON_ROOT, }, + { + .name = "sched.domain", + .read_u64 = cpuset_read_u64, + .write_u64 = cpuset_write_u64, + .private = FILE_SCHED_DOMAIN, + .flags = CFTYPE_NOT_ON_ROOT, + }, + { } /* terminate */ }; @@ -2075,6 +2296,9 @@ static int cpuset_css_online(struct cgroup_subsys_state *css) * If the cpuset being removed has its flag 'sched_load_balance' * enabled, then simulate turning sched_load_balance off, which * will call rebuild_sched_domains_locked(). + * + * If the cpuset has the 'sched_domain' flag enabled, simulate + * turning sched_domain off. */ static void cpuset_css_offline(struct cgroup_subsys_state *css) @@ -2083,6 +2307,13 @@ static void cpuset_css_offline(struct cgroup_subsys_state *css) mutex_lock(&cpuset_mutex); + /* + * Calling update_flag() may fail, so we have to call + * update_isolated_cpumask directly to be sure. + */ + if (is_sched_domain(cs)) + update_isolated_cpumask(cs, cs->cpus_allowed, NULL); + if (is_sched_load_balance(cs)) update_flag(CS_SCHED_LOAD_BALANCE, cs, 0); -- 1.8.3.1