2021-07-14 13:56:25

by Frederic Weisbecker

[permalink] [raw]
Subject: [RFC PATCH 0/6] cpuset: Allow to modify isolcpus through cpuset

The fact that "isolcpus=" behaviour can't be modified at runtime is an
eternal source of discussion and debate opposing a useful feature against
a terrible interface.

I've long since tried to figure out a proper way to control this at
runtime using cpusets, which isn't easy as a boot time single cpumask
is difficult to map to a hierarchy of cpusets that can even overlap.

The idea here is to map the boot-set isolation behaviour to any cpuset
directory whose cpumask is a subset of "isolcpus=". I let you browse
for details on the last patch.

Note this is still WIP and half-baked, but I figured it's important to
validate the interface early.

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
cpuset/isolation

HEAD: 6d3dba1115b7ea464febf3763244c783e87c7baf

Thanks,
Frederic
---

Frederic Weisbecker (6):
pci: Decouple HK_FLAG_WQ and HK_FLAG_DOMAIN cpumask fetch
workqueue: Decouple HK_FLAG_WQ and HK_FLAG_DOMAIN cpumask fetch
net: Decouple HK_FLAG_WQ and HK_FLAG_DOMAIN cpumask fetch
sched/isolation: Split domain housekeeping mask from the rest
sched/isolation: Make HK_FLAG_DOMAIN mutable
cpuset: Add cpuset.isolation_mask file


drivers/pci/pci-driver.c | 21 ++++++--
include/linux/sched/isolation.h | 4 ++
kernel/cgroup/cpuset.c | 111 ++++++++++++++++++++++++++++++++++++++--
kernel/sched/isolation.c | 73 ++++++++++++++++++++++----
kernel/workqueue.c | 4 +-
net/core/net-sysfs.c | 6 +--
6 files changed, 196 insertions(+), 23 deletions(-)


2021-07-14 13:56:40

by Frederic Weisbecker

[permalink] [raw]
Subject: [RFC PATCH 4/6] sched/isolation: Split domain housekeeping mask from the rest

To prepare for supporting each feature of the housekeeping cpumask
toward cpuset, move HK_FLAG_DOMAIN to its own cpumask. This will allow
to modify the set passed through "isolcpus=" kernel boot parameter on
runtime.

Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Marcelo Tosatti <[email protected]>
Cc: Nitesh Lal <[email protected]>
Cc: Nicolas Saenz <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Zefan Li <[email protected]>
Cc: Alex Belits <[email protected]>
---
kernel/sched/isolation.c | 54 +++++++++++++++++++++++++++++++++-------
1 file changed, 45 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 7f06eaf12818..c2bdf7e6dc39 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -12,6 +12,7 @@
DEFINE_STATIC_KEY_FALSE(housekeeping_overridden);
EXPORT_SYMBOL_GPL(housekeeping_overridden);
static cpumask_var_t housekeeping_mask;
+static cpumask_var_t hk_domain_mask;
static unsigned int housekeeping_flags;

bool housekeeping_enabled(enum hk_flags flags)
@@ -26,7 +27,7 @@ int housekeeping_any_cpu(enum hk_flags flags)

if (static_branch_unlikely(&housekeeping_overridden)) {
if (housekeeping_flags & flags) {
- cpu = sched_numa_find_closest(housekeeping_mask, smp_processor_id());
+ cpu = sched_numa_find_closest(housekeeping_cpumask(flags), smp_processor_id());
if (cpu < nr_cpu_ids)
return cpu;

@@ -39,9 +40,13 @@ EXPORT_SYMBOL_GPL(housekeeping_any_cpu);

const struct cpumask *housekeeping_cpumask(enum hk_flags flags)
{
- if (static_branch_unlikely(&housekeeping_overridden))
+ if (static_branch_unlikely(&housekeeping_overridden)) {
+ WARN_ON_ONCE((flags & HK_FLAG_DOMAIN) && (flags & ~HK_FLAG_DOMAIN));
+ if (housekeeping_flags & HK_FLAG_DOMAIN)
+ return hk_domain_mask;
if (housekeeping_flags & flags)
return housekeeping_mask;
+ }
return cpu_possible_mask;
}
EXPORT_SYMBOL_GPL(housekeeping_cpumask);
@@ -50,7 +55,7 @@ void housekeeping_affine(struct task_struct *t, enum hk_flags flags)
{
if (static_branch_unlikely(&housekeeping_overridden))
if (housekeeping_flags & flags)
- set_cpus_allowed_ptr(t, housekeeping_mask);
+ set_cpus_allowed_ptr(t, housekeeping_cpumask(flags));
}
EXPORT_SYMBOL_GPL(housekeeping_affine);

@@ -58,11 +63,13 @@ bool housekeeping_test_cpu(int cpu, enum hk_flags flags)
{
if (static_branch_unlikely(&housekeeping_overridden))
if (housekeeping_flags & flags)
- return cpumask_test_cpu(cpu, housekeeping_mask);
+ return cpumask_test_cpu(cpu, housekeeping_cpumask(flags));
return true;
}
EXPORT_SYMBOL_GPL(housekeeping_test_cpu);

+
+
void __init housekeeping_init(void)
{
if (!housekeeping_flags)
@@ -91,28 +98,57 @@ static int __init housekeeping_setup(char *str, enum hk_flags flags)

alloc_bootmem_cpumask_var(&tmp);
if (!housekeeping_flags) {
- alloc_bootmem_cpumask_var(&housekeeping_mask);
- cpumask_andnot(housekeeping_mask,
- cpu_possible_mask, non_housekeeping_mask);
+ if (flags & ~HK_FLAG_DOMAIN) {
+ alloc_bootmem_cpumask_var(&housekeeping_mask);
+ cpumask_andnot(housekeeping_mask,
+ cpu_possible_mask, non_housekeeping_mask);
+ }
+
+ if (flags & HK_FLAG_DOMAIN) {
+ alloc_bootmem_cpumask_var(&hk_domain_mask);
+ cpumask_andnot(hk_domain_mask,
+ cpu_possible_mask, non_housekeeping_mask);
+ }

cpumask_andnot(tmp, cpu_present_mask, non_housekeeping_mask);
if (cpumask_empty(tmp)) {
pr_warn("Housekeeping: must include one present CPU, "
"using boot CPU:%d\n", smp_processor_id());
- __cpumask_set_cpu(smp_processor_id(), housekeeping_mask);
+ if (flags & ~HK_FLAG_DOMAIN)
+ __cpumask_set_cpu(smp_processor_id(), housekeeping_mask);
+ if (flags & HK_FLAG_DOMAIN)
+ __cpumask_set_cpu(smp_processor_id(), hk_domain_mask);
__cpumask_clear_cpu(smp_processor_id(), non_housekeeping_mask);
}
} else {
+ struct cpumask *prev;
+
cpumask_andnot(tmp, cpu_present_mask, non_housekeeping_mask);
if (cpumask_empty(tmp))
__cpumask_clear_cpu(smp_processor_id(), non_housekeeping_mask);
cpumask_andnot(tmp, cpu_possible_mask, non_housekeeping_mask);
- if (!cpumask_equal(tmp, housekeeping_mask)) {
+
+ if (housekeeping_flags == HK_FLAG_DOMAIN)
+ prev = hk_domain_mask;
+ else
+ prev = housekeeping_mask;
+
+ if (!cpumask_equal(tmp, prev)) {
pr_warn("Housekeeping: nohz_full= must match isolcpus=\n");
free_bootmem_cpumask_var(tmp);
free_bootmem_cpumask_var(non_housekeeping_mask);
return 0;
}
+
+ if ((housekeeping_flags & HK_FLAG_DOMAIN) && (flags & ~HK_FLAG_DOMAIN)) {
+ alloc_bootmem_cpumask_var(&housekeeping_mask);
+ cpumask_copy(housekeeping_mask, hk_domain_mask);
+ }
+
+ if ((housekeeping_flags & ~HK_FLAG_DOMAIN) && (flags & HK_FLAG_DOMAIN)) {
+ alloc_bootmem_cpumask_var(&hk_domain_mask);
+ cpumask_copy(hk_domain_mask, housekeeping_mask);
+ }
}
free_bootmem_cpumask_var(tmp);

--
2.25.1

2021-07-14 13:57:15

by Frederic Weisbecker

[permalink] [raw]
Subject: [RFC PATCH 6/6] cpuset: Add cpuset.isolation_mask file

Add a new cpuset.isolation_mask file in order to be able to modify the
housekeeping cpumask for each individual isolation feature on runtime.
In the future this will include nohz_full, unbound timers,
unbound workqueues, unbound kthreads, managed irqs, etc...

Start with supporting domain exclusion and CPUs passed through
"isolcpus=".

The cpuset.isolation_mask defaults to 0. Setting it to 1 will exclude
the given cpuset from the domains (they will be attached to NULL domain).
As long as a CPU is part of any cpuset with cpuset.isolation_mask set to
1, it will remain isolated even if it overlaps with another cpuset that
has cpuset.isolation_mask set to 0. The same applies to parent and
subdirectories.

If a cpuset is a subset of "isolcpus=", it automatically maps it and
cpuset.isolation_mask will be set to 1. This subset is then cleared from
the initial "isolcpus=" mask. The user is then free to override
cpuset.isolation_mask to 0 in order to revert the effect of "isolcpus=".

Here is an example of use where the CPU 7 has been isolated on boot and
get re-attached to domains later from cpuset:

$ cat /proc/cmdline
isolcpus=7
$ cd /sys/fs/cgroup/cpuset
$ mkdir cpu7
$ cd cpu7
$ cat cpuset.cpus
0-7
$ cat cpuset.isolation_mask
0
$ ls /sys/kernel/debug/domains/cpu7 # empty because isolcpus=7
$ echo 7 > cpuset.cpus
$ cat cpuset.isolation_mask # isolcpus subset automatically mapped
1
$ echo 0 > cpuset.isolation_mask
$ ls /sys/kernel/debug/domains/cpu7/
domain0 domain1

CHECKME: Should we have individual cpuset.isolation.$feature files for
each isolation feature instead of a single mask file?

CHECKME: The scheduler is unhappy when _every_ CPUs are isolated

Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Marcelo Tosatti <[email protected]>
Cc: Nitesh Lal <[email protected]>
Cc: Nicolas Saenz <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Zefan Li <[email protected]>
Cc: Alex Belits <[email protected]>
---
kernel/cgroup/cpuset.c | 111 +++++++++++++++++++++++++++++++++++++++--
1 file changed, 107 insertions(+), 4 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index adb5190c4429..ecb63be04408 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -82,6 +82,7 @@ struct cpuset {
struct cgroup_subsys_state css;

unsigned long flags; /* "unsigned long" so bitops work */
+ unsigned long isol_flags;

/*
* On default hierarchy:
@@ -258,6 +259,17 @@ static inline int is_spread_slab(const struct cpuset *cs)
return test_bit(CS_SPREAD_SLAB, &cs->flags);
}

+/* bits in struct cpuset flags field */
+typedef enum {
+ CS_ISOL_DOMAIN,
+ CS_ISOL_MAX
+} isol_flagbits_t;
+
+static inline int is_isol_domain(const struct cpuset *cs)
+{
+ return test_bit(CS_ISOL_DOMAIN, &cs->isol_flags);
+}
+
static inline int is_partition_root(const struct cpuset *cs)
{
return cs->partition_root_state > 0;
@@ -269,6 +281,13 @@ static struct cpuset top_cpuset = {
.partition_root_state = PRS_ENABLED,
};

+/*
+ * CPUs passed through "isolcpus=" on boot, waiting to be mounted
+ * as soon as we meet a cpuset directory whose cpus_allowed is a
+ * subset of "isolcpus="
+ */
+static cpumask_var_t unmounted_isolcpus_mask;
+
/**
* cpuset_for_each_child - traverse online children of a cpuset
* @child_cs: loop cursor pointing to the current child
@@ -681,6 +700,39 @@ static inline int nr_cpusets(void)
return static_key_count(&cpusets_enabled_key.key) + 1;
}

+static int update_domain_housekeeping_mask(void)
+{
+ struct cpuset *cp; /* top-down scan of cpusets */
+ struct cgroup_subsys_state *pos_css;
+ cpumask_var_t domain_mask;
+
+ if (!zalloc_cpumask_var(&domain_mask, GFP_KERNEL))
+ return -ENOMEM;
+
+ cpumask_andnot(domain_mask, cpu_possible_mask, unmounted_isolcpus_mask);
+
+ rcu_read_lock();
+ cpuset_for_each_descendant_pre(cp, pos_css, &top_cpuset) {
+ if (is_isol_domain(cp))
+ cpumask_andnot(domain_mask, domain_mask, cp->cpus_allowed);
+
+ if (cpumask_subset(cp->cpus_allowed, unmounted_isolcpus_mask)) {
+ unsigned long flags;
+ cpumask_andnot(unmounted_isolcpus_mask, unmounted_isolcpus_mask,
+ cp->cpus_allowed);
+ spin_lock_irqsave(&callback_lock, flags);
+ cp->isol_flags |= BIT(CS_ISOL_DOMAIN);
+ spin_unlock_irqrestore(&callback_lock, flags);
+ }
+ }
+ rcu_read_unlock();
+
+ housekeeping_cpumask_set(domain_mask, HK_FLAG_DOMAIN);
+ free_cpumask_var(domain_mask);
+
+ return 0;
+}
+
/*
* generate_sched_domains()
*
@@ -741,6 +793,7 @@ static int generate_sched_domains(cpumask_var_t **domains,
struct cpuset **csa; /* array of all cpuset ptrs */
int csn; /* how many cpuset ptrs in csa so far */
int i, j, k; /* indices for partition finding loops */
+ int err;
cpumask_var_t *doms; /* resulting partition; i.e. sched domains */
struct sched_domain_attr *dattr; /* attributes for custom domains */
int ndoms = 0; /* number of sched domains in result */
@@ -752,6 +805,10 @@ static int generate_sched_domains(cpumask_var_t **domains,
dattr = NULL;
csa = NULL;

+ err = update_domain_housekeeping_mask();
+ if (err < 0)
+ pr_err("Can't update housekeeping cpumask\n");
+
/* Special case for the 99% of systems with one, full, sched domain */
if (root_load_balance && !top_cpuset.nr_subparts_cpus) {
ndoms = 1;
@@ -1449,7 +1506,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
* root as well.
*/
if (!cpumask_empty(cp->cpus_allowed) &&
- is_sched_load_balance(cp) &&
+ (is_sched_load_balance(cp) || is_isol_domain(cs)) &&
(!cgroup_subsys_on_dfl(cpuset_cgrp_subsys) ||
is_partition_root(cp)))
need_rebuild_sched_domains = true;
@@ -1935,6 +1992,30 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
return err;
}

+/*
+ * update_isol_flags - read a 0 or a 1 in a file and update associated isol flag
+ * mask: the new mask value to apply (see isol_flagbits_t)
+ * cs: the cpuset to update
+ *
+ * Call with cpuset_mutex held.
+ */
+static int update_isol_flags(struct cpuset *cs, u64 mask)
+{
+ unsigned long old_mask = cs->isol_flags;
+
+ if (mask & ~(BIT_ULL(CS_ISOL_MAX) - 1))
+ return -EINVAL;
+
+ spin_lock_irq(&callback_lock);
+ cs->isol_flags = (unsigned long)mask;
+ spin_unlock_irq(&callback_lock);
+
+ if (mask ^ old_mask)
+ rebuild_sched_domains_locked();
+
+ return 0;
+}
+
/*
* update_prstate - update partititon_root_state
* cs: the cpuset to update
@@ -2273,6 +2354,9 @@ typedef enum {
FILE_MEMORY_PRESSURE,
FILE_SPREAD_PAGE,
FILE_SPREAD_SLAB,
+//CHECKME: should we have individual cpuset.isolation.$feature files
+//instead of a mask of features in a single file?
+ FILE_ISOLATION_MASK,
} cpuset_filetype_t;

static int cpuset_write_u64(struct cgroup_subsys_state *css, struct cftype *cft,
@@ -2314,6 +2398,9 @@ static int cpuset_write_u64(struct cgroup_subsys_state *css, struct cftype *cft,
case FILE_SPREAD_SLAB:
retval = update_flag(CS_SPREAD_SLAB, cs, val);
break;
+ case FILE_ISOLATION_MASK:
+ retval = update_isol_flags(cs, val);
+ break;
default:
retval = -EINVAL;
break;
@@ -2481,6 +2568,8 @@ static u64 cpuset_read_u64(struct cgroup_subsys_state *css, struct cftype *cft)
return is_spread_page(cs);
case FILE_SPREAD_SLAB:
return is_spread_slab(cs);
+ case FILE_ISOLATION_MASK:
+ return cs->isol_flags;
default:
BUG();
}
@@ -2658,6 +2747,13 @@ static struct cftype legacy_files[] = {
.private = FILE_MEMORY_PRESSURE_ENABLED,
},

+ {
+ .name = "isolation_mask",
+ .read_u64 = cpuset_read_u64,
+ .write_u64 = cpuset_write_u64,
+ .private = FILE_ISOLATION_MASK,
+ },
+
{ } /* terminate */
};

@@ -2834,9 +2930,12 @@ static void cpuset_css_offline(struct cgroup_subsys_state *css)
if (is_partition_root(cs))
update_prstate(cs, 0);

- if (!cgroup_subsys_on_dfl(cpuset_cgrp_subsys) &&
- is_sched_load_balance(cs))
- update_flag(CS_SCHED_LOAD_BALANCE, cs, 0);
+ if (!cgroup_subsys_on_dfl(cpuset_cgrp_subsys)) {
+ if (is_sched_load_balance(cs))
+ update_flag(CS_SCHED_LOAD_BALANCE, cs, 0);
+ if (is_isol_domain(cs))
+ update_isol_flags(cs, cs->isol_flags & ~BIT(CS_ISOL_DOMAIN));
+ }

if (cs->use_parent_ecpus) {
struct cpuset *parent = parent_cs(cs);
@@ -2873,6 +2972,9 @@ static void cpuset_bind(struct cgroup_subsys_state *root_css)
top_cpuset.mems_allowed = top_cpuset.effective_mems;
}

+ cpumask_andnot(unmounted_isolcpus_mask, cpu_possible_mask,
+ housekeeping_cpumask(HK_FLAG_DOMAIN));
+
spin_unlock_irq(&callback_lock);
percpu_up_write(&cpuset_rwsem);
}
@@ -2932,6 +3034,7 @@ int __init cpuset_init(void)
top_cpuset.relax_domain_level = -1;

BUG_ON(!alloc_cpumask_var(&cpus_attach, GFP_KERNEL));
+ BUG_ON(!alloc_cpumask_var(&unmounted_isolcpus_mask, GFP_KERNEL));

return 0;
}
--
2.25.1

2021-07-14 13:58:02

by Frederic Weisbecker

[permalink] [raw]
Subject: [RFC PATCH 2/6] workqueue: Decouple HK_FLAG_WQ and HK_FLAG_DOMAIN cpumask fetch

To prepare for supporting each feature of the housekeeping cpumask
toward cpuset, prepare for HK_FLAG_DOMAIN to move to its own cpumask.
This will allow to modify the set passed through "isolcpus=" kernel boot
parameter on runtime.

Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Marcelo Tosatti <[email protected]>
Cc: Nitesh Lal <[email protected]>
Cc: Nicolas Saenz <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Zefan Li <[email protected]>
Cc: Alex Belits <[email protected]>
---
kernel/workqueue.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 50142fc08902..d29c5b61a333 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -5938,13 +5938,13 @@ static void __init wq_numa_init(void)
void __init workqueue_init_early(void)
{
int std_nice[NR_STD_WORKER_POOLS] = { 0, HIGHPRI_NICE_LEVEL };
- int hk_flags = HK_FLAG_DOMAIN | HK_FLAG_WQ;
int i, cpu;

BUILD_BUG_ON(__alignof__(struct pool_workqueue) < __alignof__(long long));

BUG_ON(!alloc_cpumask_var(&wq_unbound_cpumask, GFP_KERNEL));
- cpumask_copy(wq_unbound_cpumask, housekeeping_cpumask(hk_flags));
+ cpumask_copy(wq_unbound_cpumask, housekeeping_cpumask(HK_FLAG_WQ));
+ cpumask_and(wq_unbound_cpumask, wq_unbound_cpumask, housekeeping_cpumask(HK_FLAG_DOMAIN));

pwq_cache = KMEM_CACHE(pool_workqueue, SLAB_PANIC);

--
2.25.1

2021-07-14 16:35:59

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [RFC PATCH 6/6] cpuset: Add cpuset.isolation_mask file

On Wed, Jul 14, 2021 at 03:54:20PM +0200, Frederic Weisbecker wrote:
> Add a new cpuset.isolation_mask file in order to be able to modify the
> housekeeping cpumask for each individual isolation feature on runtime.
> In the future this will include nohz_full, unbound timers,
> unbound workqueues, unbound kthreads, managed irqs, etc...
>
> Start with supporting domain exclusion and CPUs passed through
> "isolcpus=".

It is possible to just add return -ENOTSUPPORTED for the features
whose support is not present?

> The cpuset.isolation_mask defaults to 0. Setting it to 1 will exclude
> the given cpuset from the domains (they will be attached to NULL domain).
> As long as a CPU is part of any cpuset with cpuset.isolation_mask set to
> 1, it will remain isolated even if it overlaps with another cpuset that
> has cpuset.isolation_mask set to 0. The same applies to parent and
> subdirectories.
>
> If a cpuset is a subset of "isolcpus=", it automatically maps it and
> cpuset.isolation_mask will be set to 1. This subset is then cleared from
> the initial "isolcpus=" mask. The user is then free to override
> cpuset.isolation_mask to 0 in order to revert the effect of "isolcpus=".
>
> Here is an example of use where the CPU 7 has been isolated on boot and
> get re-attached to domains later from cpuset:
>
> $ cat /proc/cmdline
> isolcpus=7
> $ cd /sys/fs/cgroup/cpuset
> $ mkdir cpu7
> $ cd cpu7
> $ cat cpuset.cpus
> 0-7
> $ cat cpuset.isolation_mask
> 0
> $ ls /sys/kernel/debug/domains/cpu7 # empty because isolcpus=7
> $ echo 7 > cpuset.cpus
> $ cat cpuset.isolation_mask # isolcpus subset automatically mapped
> 1
> $ echo 0 > cpuset.isolation_mask
> $ ls /sys/kernel/debug/domains/cpu7/
> domain0 domain1
>
> CHECKME: Should we have individual cpuset.isolation.$feature files for
> each isolation feature instead of a single mask file?

Yes, guess that is useful, for example due to the -ENOTSUPPORTED
comment above.


Guarantees on updates
=====================

Perhaps start with a document with:

On return to the write to the cpumask file, what are the guarantees?

For example, for kthread it is that any kernel threads from that point
on should start with the new mask. Therefore userspace should
respect the order:

1) Change kthread mask.
2) Move threads.


Updates to interface
====================

Also, thinking about updates to the interface (which today are one
cpumask per isolation feature) might be useful. What can happen:

1) New isolation feature is added, feature name added to the interface.

Userspace must support new filename. If not there, then thats an
old kernel without support for it.

2) If an isolation feature is removed, a file will be gone. What should
be the behaviour there? Remove the file? (userspace should probably
ignore the failure in that case?) (then features names should not be
reused, as that can confuse #1 above).

Or maybe have a versioned scheme?

>
> CHECKME: The scheduler is unhappy when _every_ CPUs are isolated
>
> Signed-off-by: Frederic Weisbecker <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: Marcelo Tosatti <[email protected]>
> Cc: Nitesh Lal <[email protected]>
> Cc: Nicolas Saenz <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Christoph Lameter <[email protected]>
> Cc: Tejun Heo <[email protected]>
> Cc: Zefan Li <[email protected]>
> Cc: Alex Belits <[email protected]>
> ---
> kernel/cgroup/cpuset.c | 111 +++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 107 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index adb5190c4429..ecb63be04408 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -82,6 +82,7 @@ struct cpuset {
> struct cgroup_subsys_state css;
>
> unsigned long flags; /* "unsigned long" so bitops work */
> + unsigned long isol_flags;
>
> /*
> * On default hierarchy:
> @@ -258,6 +259,17 @@ static inline int is_spread_slab(const struct cpuset *cs)
> return test_bit(CS_SPREAD_SLAB, &cs->flags);
> }
>
> +/* bits in struct cpuset flags field */
> +typedef enum {
> + CS_ISOL_DOMAIN,
> + CS_ISOL_MAX
> +} isol_flagbits_t;
> +
> +static inline int is_isol_domain(const struct cpuset *cs)
> +{
> + return test_bit(CS_ISOL_DOMAIN, &cs->isol_flags);
> +}
> +
> static inline int is_partition_root(const struct cpuset *cs)
> {
> return cs->partition_root_state > 0;
> @@ -269,6 +281,13 @@ static struct cpuset top_cpuset = {
> .partition_root_state = PRS_ENABLED,
> };
>
> +/*
> + * CPUs passed through "isolcpus=" on boot, waiting to be mounted
> + * as soon as we meet a cpuset directory whose cpus_allowed is a
> + * subset of "isolcpus="
> + */
> +static cpumask_var_t unmounted_isolcpus_mask;
> +
> /**
> * cpuset_for_each_child - traverse online children of a cpuset
> * @child_cs: loop cursor pointing to the current child
> @@ -681,6 +700,39 @@ static inline int nr_cpusets(void)
> return static_key_count(&cpusets_enabled_key.key) + 1;
> }
>
> +static int update_domain_housekeeping_mask(void)
> +{
> + struct cpuset *cp; /* top-down scan of cpusets */
> + struct cgroup_subsys_state *pos_css;
> + cpumask_var_t domain_mask;
> +
> + if (!zalloc_cpumask_var(&domain_mask, GFP_KERNEL))
> + return -ENOMEM;
> +
> + cpumask_andnot(domain_mask, cpu_possible_mask, unmounted_isolcpus_mask);
> +
> + rcu_read_lock();
> + cpuset_for_each_descendant_pre(cp, pos_css, &top_cpuset) {
> + if (is_isol_domain(cp))
> + cpumask_andnot(domain_mask, domain_mask, cp->cpus_allowed);
> +
> + if (cpumask_subset(cp->cpus_allowed, unmounted_isolcpus_mask)) {
> + unsigned long flags;
> + cpumask_andnot(unmounted_isolcpus_mask, unmounted_isolcpus_mask,
> + cp->cpus_allowed);
> + spin_lock_irqsave(&callback_lock, flags);
> + cp->isol_flags |= BIT(CS_ISOL_DOMAIN);
> + spin_unlock_irqrestore(&callback_lock, flags);
> + }
> + }
> + rcu_read_unlock();
> +
> + housekeeping_cpumask_set(domain_mask, HK_FLAG_DOMAIN);
> + free_cpumask_var(domain_mask);
> +
> + return 0;
> +}
> +
> /*
> * generate_sched_domains()
> *
> @@ -741,6 +793,7 @@ static int generate_sched_domains(cpumask_var_t **domains,
> struct cpuset **csa; /* array of all cpuset ptrs */
> int csn; /* how many cpuset ptrs in csa so far */
> int i, j, k; /* indices for partition finding loops */
> + int err;
> cpumask_var_t *doms; /* resulting partition; i.e. sched domains */
> struct sched_domain_attr *dattr; /* attributes for custom domains */
> int ndoms = 0; /* number of sched domains in result */
> @@ -752,6 +805,10 @@ static int generate_sched_domains(cpumask_var_t **domains,
> dattr = NULL;
> csa = NULL;
>
> + err = update_domain_housekeeping_mask();
> + if (err < 0)
> + pr_err("Can't update housekeeping cpumask\n");
> +
> /* Special case for the 99% of systems with one, full, sched domain */
> if (root_load_balance && !top_cpuset.nr_subparts_cpus) {
> ndoms = 1;
> @@ -1449,7 +1506,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
> * root as well.
> */
> if (!cpumask_empty(cp->cpus_allowed) &&
> - is_sched_load_balance(cp) &&
> + (is_sched_load_balance(cp) || is_isol_domain(cs)) &&
> (!cgroup_subsys_on_dfl(cpuset_cgrp_subsys) ||
> is_partition_root(cp)))
> need_rebuild_sched_domains = true;
> @@ -1935,6 +1992,30 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
> return err;
> }
>
> +/*
> + * update_isol_flags - read a 0 or a 1 in a file and update associated isol flag
> + * mask: the new mask value to apply (see isol_flagbits_t)
> + * cs: the cpuset to update
> + *
> + * Call with cpuset_mutex held.
> + */
> +static int update_isol_flags(struct cpuset *cs, u64 mask)
> +{
> + unsigned long old_mask = cs->isol_flags;
> +
> + if (mask & ~(BIT_ULL(CS_ISOL_MAX) - 1))
> + return -EINVAL;
> +
> + spin_lock_irq(&callback_lock);
> + cs->isol_flags = (unsigned long)mask;
> + spin_unlock_irq(&callback_lock);
> +
> + if (mask ^ old_mask)
> + rebuild_sched_domains_locked();
> +
> + return 0;
> +}
> +
> /*
> * update_prstate - update partititon_root_state
> * cs: the cpuset to update
> @@ -2273,6 +2354,9 @@ typedef enum {
> FILE_MEMORY_PRESSURE,
> FILE_SPREAD_PAGE,
> FILE_SPREAD_SLAB,
> +//CHECKME: should we have individual cpuset.isolation.$feature files
> +//instead of a mask of features in a single file?
> + FILE_ISOLATION_MASK,
> } cpuset_filetype_t;
>
> static int cpuset_write_u64(struct cgroup_subsys_state *css, struct cftype *cft,
> @@ -2314,6 +2398,9 @@ static int cpuset_write_u64(struct cgroup_subsys_state *css, struct cftype *cft,
> case FILE_SPREAD_SLAB:
> retval = update_flag(CS_SPREAD_SLAB, cs, val);
> break;
> + case FILE_ISOLATION_MASK:
> + retval = update_isol_flags(cs, val);
> + break;
> default:
> retval = -EINVAL;
> break;
> @@ -2481,6 +2568,8 @@ static u64 cpuset_read_u64(struct cgroup_subsys_state *css, struct cftype *cft)
> return is_spread_page(cs);
> case FILE_SPREAD_SLAB:
> return is_spread_slab(cs);
> + case FILE_ISOLATION_MASK:
> + return cs->isol_flags;
> default:
> BUG();
> }
> @@ -2658,6 +2747,13 @@ static struct cftype legacy_files[] = {
> .private = FILE_MEMORY_PRESSURE_ENABLED,
> },
>
> + {
> + .name = "isolation_mask",
> + .read_u64 = cpuset_read_u64,
> + .write_u64 = cpuset_write_u64,
> + .private = FILE_ISOLATION_MASK,
> + },
> +
> { } /* terminate */
> };
>
> @@ -2834,9 +2930,12 @@ static void cpuset_css_offline(struct cgroup_subsys_state *css)
> if (is_partition_root(cs))
> update_prstate(cs, 0);
>
> - if (!cgroup_subsys_on_dfl(cpuset_cgrp_subsys) &&
> - is_sched_load_balance(cs))
> - update_flag(CS_SCHED_LOAD_BALANCE, cs, 0);
> + if (!cgroup_subsys_on_dfl(cpuset_cgrp_subsys)) {
> + if (is_sched_load_balance(cs))
> + update_flag(CS_SCHED_LOAD_BALANCE, cs, 0);
> + if (is_isol_domain(cs))
> + update_isol_flags(cs, cs->isol_flags & ~BIT(CS_ISOL_DOMAIN));
> + }
>
> if (cs->use_parent_ecpus) {
> struct cpuset *parent = parent_cs(cs);
> @@ -2873,6 +2972,9 @@ static void cpuset_bind(struct cgroup_subsys_state *root_css)
> top_cpuset.mems_allowed = top_cpuset.effective_mems;
> }
>
> + cpumask_andnot(unmounted_isolcpus_mask, cpu_possible_mask,
> + housekeeping_cpumask(HK_FLAG_DOMAIN));
> +
> spin_unlock_irq(&callback_lock);
> percpu_up_write(&cpuset_rwsem);
> }
> @@ -2932,6 +3034,7 @@ int __init cpuset_init(void)
> top_cpuset.relax_domain_level = -1;
>
> BUG_ON(!alloc_cpumask_var(&cpus_attach, GFP_KERNEL));
> + BUG_ON(!alloc_cpumask_var(&unmounted_isolcpus_mask, GFP_KERNEL));
>
> return 0;
> }
> --
> 2.25.1
>
>

2021-07-14 16:56:20

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 6/6] cpuset: Add cpuset.isolation_mask file

On Wed, Jul 14, 2021 at 03:54:20PM +0200, Frederic Weisbecker wrote:
> Add a new cpuset.isolation_mask file in order to be able to modify the
> housekeeping cpumask for each individual isolation feature on runtime.
> In the future this will include nohz_full, unbound timers,
> unbound workqueues, unbound kthreads, managed irqs, etc...
>
> Start with supporting domain exclusion and CPUs passed through
> "isolcpus=".
>
> The cpuset.isolation_mask defaults to 0. Setting it to 1 will exclude
> the given cpuset from the domains (they will be attached to NULL domain).
> As long as a CPU is part of any cpuset with cpuset.isolation_mask set to
> 1, it will remain isolated even if it overlaps with another cpuset that
> has cpuset.isolation_mask set to 0. The same applies to parent and
> subdirectories.
>
> If a cpuset is a subset of "isolcpus=", it automatically maps it and
> cpuset.isolation_mask will be set to 1. This subset is then cleared from
> the initial "isolcpus=" mask. The user is then free to override
> cpuset.isolation_mask to 0 in order to revert the effect of "isolcpus=".
>
> Here is an example of use where the CPU 7 has been isolated on boot and
> get re-attached to domains later from cpuset:
>
> $ cat /proc/cmdline
> isolcpus=7
> $ cd /sys/fs/cgroup/cpuset
> $ mkdir cpu7
> $ cd cpu7
> $ cat cpuset.cpus
> 0-7
> $ cat cpuset.isolation_mask
> 0
> $ ls /sys/kernel/debug/domains/cpu7 # empty because isolcpus=7
> $ echo 7 > cpuset.cpus
> $ cat cpuset.isolation_mask # isolcpus subset automatically mapped
> 1
> $ echo 0 > cpuset.isolation_mask
> $ ls /sys/kernel/debug/domains/cpu7/
> domain0 domain1
>

cpusets already has means to create paritions; why are you creating
something else?

2021-07-15 00:25:58

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC PATCH 6/6] cpuset: Add cpuset.isolation_mask file

On Wed, Jul 14, 2021 at 06:52:43PM +0200, Peter Zijlstra wrote:
> On Wed, Jul 14, 2021 at 03:54:20PM +0200, Frederic Weisbecker wrote:
> > Add a new cpuset.isolation_mask file in order to be able to modify the
> > housekeeping cpumask for each individual isolation feature on runtime.
> > In the future this will include nohz_full, unbound timers,
> > unbound workqueues, unbound kthreads, managed irqs, etc...
> >
> > Start with supporting domain exclusion and CPUs passed through
> > "isolcpus=".
> >
> > The cpuset.isolation_mask defaults to 0. Setting it to 1 will exclude
> > the given cpuset from the domains (they will be attached to NULL domain).
> > As long as a CPU is part of any cpuset with cpuset.isolation_mask set to
> > 1, it will remain isolated even if it overlaps with another cpuset that
> > has cpuset.isolation_mask set to 0. The same applies to parent and
> > subdirectories.
> >
> > If a cpuset is a subset of "isolcpus=", it automatically maps it and
> > cpuset.isolation_mask will be set to 1. This subset is then cleared from
> > the initial "isolcpus=" mask. The user is then free to override
> > cpuset.isolation_mask to 0 in order to revert the effect of "isolcpus=".
> >
> > Here is an example of use where the CPU 7 has been isolated on boot and
> > get re-attached to domains later from cpuset:
> >
> > $ cat /proc/cmdline
> > isolcpus=7
> > $ cd /sys/fs/cgroup/cpuset
> > $ mkdir cpu7
> > $ cd cpu7
> > $ cat cpuset.cpus
> > 0-7
> > $ cat cpuset.isolation_mask
> > 0
> > $ ls /sys/kernel/debug/domains/cpu7 # empty because isolcpus=7
> > $ echo 7 > cpuset.cpus
> > $ cat cpuset.isolation_mask # isolcpus subset automatically mapped
> > 1
> > $ echo 0 > cpuset.isolation_mask
> > $ ls /sys/kernel/debug/domains/cpu7/
> > domain0 domain1
> >
>
> cpusets already has means to create paritions; why are you creating
> something else?

I was about to answer that the semantics of isolcpus, which reference
a NULL domain, are different from SD_LOAD_BALANCE implied by
cpuset.sched_load_balance. But then I realize that SD_LOAD_BALANCE has
been removed.

How cpuset.sched_load_balance is implemented then? Commit
e669ac8ab952df2f07dee1e1efbf40647d6de332 ("sched: Remove checks against
SD_LOAD_BALANCE") advertize that setting cpuset.sched_load_balance to 0
ends up creating NULL domain but that's not what I get. For example if I
mount a single cpuset root (no other cpuset mountpoints):

$ mount -t cgroup none ./cpuset -o cpuset
$ cd cpuset
$ cat cpuset.cpus
0-7
$ cat cpuset.sched_load_balance
1
$ echo 0 > cpuset.sched_load_balance
$ ls /sys/kernel/debug/domains/cpu1/
domain0 domain1

I still get the domains on all CPUs...

2021-07-15 00:47:18

by Valentin Schneider

[permalink] [raw]
Subject: Re: [RFC PATCH 6/6] cpuset: Add cpuset.isolation_mask file

On 15/07/21 01:13, Frederic Weisbecker wrote:
> On Wed, Jul 14, 2021 at 06:52:43PM +0200, Peter Zijlstra wrote:
>>
>> cpusets already has means to create paritions; why are you creating
>> something else?
>
> I was about to answer that the semantics of isolcpus, which reference
> a NULL domain, are different from SD_LOAD_BALANCE implied by
> cpuset.sched_load_balance. But then I realize that SD_LOAD_BALANCE has
> been removed.
>
> How cpuset.sched_load_balance is implemented then? Commit
> e669ac8ab952df2f07dee1e1efbf40647d6de332 ("sched: Remove checks against
> SD_LOAD_BALANCE") advertize that setting cpuset.sched_load_balance to 0
> ends up creating NULL domain but that's not what I get. For example if I
> mount a single cpuset root (no other cpuset mountpoints):
>
> $ mount -t cgroup none ./cpuset -o cpuset
> $ cd cpuset
> $ cat cpuset.cpus
> 0-7
> $ cat cpuset.sched_load_balance
> 1
> $ echo 0 > cpuset.sched_load_balance
> $ ls /sys/kernel/debug/domains/cpu1/
> domain0 domain1
>
> I still get the domains on all CPUs...

Huh. That's on v5.14-rc1 with an automounted cpuset:

$ cat /sys/fs/cgroup/cpuset/cpuset.cpus
0-5
$ cat /sys/fs/cgroup/cpuset/cpuset.sched_load_balance
1

$ ls /sys/kernel/debug/sched/domains/cpu*
/sys/kernel/debug/sched/domains/cpu0:
domain0 domain1

/sys/kernel/debug/sched/domains/cpu1:
domain0 domain1

/sys/kernel/debug/sched/domains/cpu2:
domain0 domain1

/sys/kernel/debug/sched/domains/cpu3:
domain0 domain1

/sys/kernel/debug/sched/domains/cpu4:
domain0 domain1

/sys/kernel/debug/sched/domains/cpu5:
domain0 domain1

$ echo 0 > /sys/fs/cgroup/cpuset/cpuset.sched_load_balance
$ ls /sys/kernel/debug/sched/domains/cpu*
/sys/kernel/debug/sched/domains/cpu0:

/sys/kernel/debug/sched/domains/cpu1:

/sys/kernel/debug/sched/domains/cpu2:

/sys/kernel/debug/sched/domains/cpu3:

/sys/kernel/debug/sched/domains/cpu4:

/sys/kernel/debug/sched/domains/cpu5:


I also checked that you can keep cpuset.sched_load_balance=0 at the root
and create exclusive child cpusets with different values of
sched_load_balance, giving you some CPUs attached to the NULL domain and
some others with a sched_domain hierarchy that stays within the cpuset span.

2021-07-15 00:53:09

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC PATCH 6/6] cpuset: Add cpuset.isolation_mask file

On Thu, Jul 15, 2021 at 12:44:08AM +0100, Valentin Schneider wrote:
> On 15/07/21 01:13, Frederic Weisbecker wrote:
> > On Wed, Jul 14, 2021 at 06:52:43PM +0200, Peter Zijlstra wrote:
> >>
> >> cpusets already has means to create paritions; why are you creating
> >> something else?
> >
> > I was about to answer that the semantics of isolcpus, which reference
> > a NULL domain, are different from SD_LOAD_BALANCE implied by
> > cpuset.sched_load_balance. But then I realize that SD_LOAD_BALANCE has
> > been removed.
> >
> > How cpuset.sched_load_balance is implemented then? Commit
> > e669ac8ab952df2f07dee1e1efbf40647d6de332 ("sched: Remove checks against
> > SD_LOAD_BALANCE") advertize that setting cpuset.sched_load_balance to 0
> > ends up creating NULL domain but that's not what I get. For example if I
> > mount a single cpuset root (no other cpuset mountpoints):
> >
> > $ mount -t cgroup none ./cpuset -o cpuset
> > $ cd cpuset
> > $ cat cpuset.cpus
> > 0-7
> > $ cat cpuset.sched_load_balance
> > 1
> > $ echo 0 > cpuset.sched_load_balance
> > $ ls /sys/kernel/debug/domains/cpu1/
> > domain0 domain1
> >
> > I still get the domains on all CPUs...
>
> Huh. That's on v5.14-rc1 with an automounted cpuset:
>
> $ cat /sys/fs/cgroup/cpuset/cpuset.cpus
> 0-5
> $ cat /sys/fs/cgroup/cpuset/cpuset.sched_load_balance
> 1
>
> $ ls /sys/kernel/debug/sched/domains/cpu*
> /sys/kernel/debug/sched/domains/cpu0:
> domain0 domain1
>
> /sys/kernel/debug/sched/domains/cpu1:
> domain0 domain1
>
> /sys/kernel/debug/sched/domains/cpu2:
> domain0 domain1
>
> /sys/kernel/debug/sched/domains/cpu3:
> domain0 domain1
>
> /sys/kernel/debug/sched/domains/cpu4:
> domain0 domain1
>
> /sys/kernel/debug/sched/domains/cpu5:
> domain0 domain1
>
> $ echo 0 > /sys/fs/cgroup/cpuset/cpuset.sched_load_balance
> $ ls /sys/kernel/debug/sched/domains/cpu*
> /sys/kernel/debug/sched/domains/cpu0:
>
> /sys/kernel/debug/sched/domains/cpu1:
>
> /sys/kernel/debug/sched/domains/cpu2:
>
> /sys/kernel/debug/sched/domains/cpu3:
>
> /sys/kernel/debug/sched/domains/cpu4:
>
> /sys/kernel/debug/sched/domains/cpu5:
>
>
> I also checked that you can keep cpuset.sched_load_balance=0 at the root
> and create exclusive child cpusets with different values of
> sched_load_balance, giving you some CPUs attached to the NULL domain and
> some others with a sched_domain hierarchy that stays within the cpuset span.

Ok I must have done something wrong somewhere, I'll check further tomorrow.

Thanks!

2021-07-15 09:10:08

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 6/6] cpuset: Add cpuset.isolation_mask file

On Thu, Jul 15, 2021 at 01:13:38AM +0200, Frederic Weisbecker wrote:
> On Wed, Jul 14, 2021 at 06:52:43PM +0200, Peter Zijlstra wrote:

> > cpusets already has means to create paritions; why are you creating
> > something else?
>
> I was about to answer that the semantics of isolcpus, which reference
> a NULL domain, are different from SD_LOAD_BALANCE implied by
> cpuset.sched_load_balance. But then I realize that SD_LOAD_BALANCE has
> been removed.
>
> How cpuset.sched_load_balance is implemented then? Commit
> e669ac8ab952df2f07dee1e1efbf40647d6de332 ("sched: Remove checks against
> SD_LOAD_BALANCE") advertize that setting cpuset.sched_load_balance to 0
> ends up creating NULL domain but that's not what I get. For example if I
> mount a single cpuset root (no other cpuset mountpoints):

SD_LOAD_BALANCE was only for when you wanted to stop balancing inside a
domain tree. That no longer happens (and hasn't for a *long* time).
Cpusets simply creates multiple domain trees (or the empty one if its
just one CPU).

> $ mount -t cgroup none ./cpuset -o cpuset
> $ cd cpuset
> $ cat cpuset.cpus
> 0-7
> $ cat cpuset.sched_load_balance
> 1
> $ echo 0 > cpuset.sched_load_balance
> $ ls /sys/kernel/debug/domains/cpu1/
> domain0 domain1
>
> I still get the domains on all CPUs...

(note, that's the cgroup-v1 interface, the cgroup-v2 interface is
significantly different)

I'd suggest doing: echo 1 > /debug/sched/verbose, if I do the above I
get:

[1290784.889705] CPU0 attaching NULL sched-domain.
[1290784.894830] CPU1 attaching NULL sched-domain.
[1290784.899947] CPU2 attaching NULL sched-domain.
[1290784.905056] CPU3 attaching NULL sched-domain.
[1290784.910153] CPU4 attaching NULL sched-domain.
[1290784.915252] CPU5 attaching NULL sched-domain.
[1290784.920338] CPU6 attaching NULL sched-domain.
[1290784.925439] CPU7 attaching NULL sched-domain.
[1290784.930535] CPU8 attaching NULL sched-domain.
[1290784.935660] CPU9 attaching NULL sched-domain.
[1290784.940911] CPU10 attaching NULL sched-domain.
[1290784.946117] CPU11 attaching NULL sched-domain.
[1290784.951317] CPU12 attaching NULL sched-domain.
[1290784.956507] CPU13 attaching NULL sched-domain.
[1290784.961688] CPU14 attaching NULL sched-domain.
[1290784.966876] CPU15 attaching NULL sched-domain.
[1290784.972047] CPU16 attaching NULL sched-domain.
[1290784.977218] CPU17 attaching NULL sched-domain.
[1290784.982383] CPU18 attaching NULL sched-domain.
[1290784.987552] CPU19 attaching NULL sched-domain.
[1290784.992724] CPU20 attaching NULL sched-domain.
[1290784.997893] CPU21 attaching NULL sched-domain.
[1290785.003063] CPU22 attaching NULL sched-domain.
[1290785.008230] CPU23 attaching NULL sched-domain.
[1290785.013400] CPU24 attaching NULL sched-domain.
[1290785.018568] CPU25 attaching NULL sched-domain.
[1290785.023736] CPU26 attaching NULL sched-domain.
[1290785.028905] CPU27 attaching NULL sched-domain.
[1290785.034074] CPU28 attaching NULL sched-domain.
[1290785.039241] CPU29 attaching NULL sched-domain.
[1290785.044409] CPU30 attaching NULL sched-domain.
[1290785.049579] CPU31 attaching NULL sched-domain.
[1290785.054816] CPU32 attaching NULL sched-domain.
[1290785.059986] CPU33 attaching NULL sched-domain.
[1290785.065154] CPU34 attaching NULL sched-domain.
[1290785.070323] CPU35 attaching NULL sched-domain.
[1290785.075492] CPU36 attaching NULL sched-domain.
[1290785.080662] CPU37 attaching NULL sched-domain.
[1290785.085832] CPU38 attaching NULL sched-domain.
[1290785.091001] CPU39 attaching NULL sched-domain.

Then when I do:

# mkdir /cgroup/A
# echo 0,20 > /cgroup/A/cpuset.cpus

I get:

[1291020.749036] CPU0 attaching sched-domain(s):
[1291020.754251] domain-0: span=0,20 level=SMT
[1291020.759061] groups: 0:{ span=0 }, 20:{ span=20 }
[1291020.765386] CPU20 attaching sched-domain(s):
[1291020.770399] domain-0: span=0,20 level=SMT
[1291020.775210] groups: 20:{ span=20 }, 0:{ span=0 }
[1291020.780831] root domain span: 0,20 (max cpu_capacity = 1024)

IOW, I've created a load-balance domain on just the first core of the
system.

# echo 0-1,20-21 > /cgroup/A/cpuset.cpus

Extends it to the first two cores:

[1291340.260699] CPU0 attaching NULL sched-domain.
[1291340.265820] CPU20 attaching NULL sched-domain.
[1291340.271403] CPU0 attaching sched-domain(s):
[1291340.276315] domain-0: span=0,20 level=SMT
[1291340.281122] groups: 0:{ span=0 }, 20:{ span=20 }
[1291340.286719] domain-1: span=0-1,20-21 level=MC
[1291340.292011] groups: 0:{ span=0,20 cap=2048 }, 1:{ span=1,21 cap=2048 }
[1291340.299855] CPU1 attaching sched-domain(s):
[1291340.304757] domain-0: span=1,21 level=SMT
[1291340.309564] groups: 1:{ span=1 }, 21:{ span=21 }
[1291340.315190] domain-1: span=0-1,20-21 level=MC
[1291340.320474] groups: 1:{ span=1,21 cap=2048 }, 0:{ span=0,20 cap=2048 }
[1291340.328307] CPU20 attaching sched-domain(s):
[1291340.333344] domain-0: span=0,20 level=SMT
[1291340.338136] groups: 20:{ span=20 }, 0:{ span=0 }
[1291340.343721] domain-1: span=0-1,20-21 level=MC
[1291340.348980] groups: 0:{ span=0,20 cap=2048 }, 1:{ span=1,21 cap=2048 }
[1291340.356783] CPU21 attaching sched-domain(s):
[1291340.361755] domain-0: span=1,21 level=SMT
[1291340.366534] groups: 21:{ span=21 }, 1:{ span=1 }
[1291340.372099] domain-1: span=0-1,20-21 level=MC
[1291340.377364] groups: 1:{ span=1,21 cap=2048 }, 0:{ span=0,20 cap=2048 }
[1291340.385216] root domain span: 0-1,20-21 (max cpu_capacity = 1024)

2021-07-16 18:03:40

by Waiman Long

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] cpuset: Allow to modify isolcpus through cpuset

On 7/14/21 9:54 AM, Frederic Weisbecker wrote:
> The fact that "isolcpus=" behaviour can't be modified at runtime is an
> eternal source of discussion and debate opposing a useful feature against
> a terrible interface.
>
> I've long since tried to figure out a proper way to control this at
> runtime using cpusets, which isn't easy as a boot time single cpumask
> is difficult to map to a hierarchy of cpusets that can even overlap.

I have a cpuset patch that allow disabling of load balancing in a
cgroup-v2 setting:

https://lore.kernel.org/lkml/[email protected]/

The idea of cpuset partition is that there will be no overlap of cpus in
different partitions. So there will be no confusion whether a cpu is
load-balanced or not.

>
> The idea here is to map the boot-set isolation behaviour to any cpuset
> directory whose cpumask is a subset of "isolcpus=". I let you browse
> for details on the last patch.
>
> Note this is still WIP and half-baked, but I figured it's important to
> validate the interface early.

Using different cpumasks for different isolated properties is the easy
part. The hard part is to make different subsystems to change their
behavior as the isolation masks change dynamically at run time.
Currently, they check the housekeeping cpumask only at boot time or when
certain events happen.

Cheers,
Longman


2021-07-19 13:18:03

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC PATCH 6/6] cpuset: Add cpuset.isolation_mask file

On Thu, Jul 15, 2021 at 11:04:19AM +0200, Peter Zijlstra wrote:
> On Thu, Jul 15, 2021 at 01:13:38AM +0200, Frederic Weisbecker wrote:
> > On Wed, Jul 14, 2021 at 06:52:43PM +0200, Peter Zijlstra wrote:
>
> > > cpusets already has means to create paritions; why are you creating
> > > something else?
> >
> > I was about to answer that the semantics of isolcpus, which reference
> > a NULL domain, are different from SD_LOAD_BALANCE implied by
> > cpuset.sched_load_balance. But then I realize that SD_LOAD_BALANCE has
> > been removed.
> >
> > How cpuset.sched_load_balance is implemented then? Commit
> > e669ac8ab952df2f07dee1e1efbf40647d6de332 ("sched: Remove checks against
> > SD_LOAD_BALANCE") advertize that setting cpuset.sched_load_balance to 0
> > ends up creating NULL domain but that's not what I get. For example if I
> > mount a single cpuset root (no other cpuset mountpoints):
>
> SD_LOAD_BALANCE was only for when you wanted to stop balancing inside a
> domain tree. That no longer happens (and hasn't for a *long* time).
> Cpusets simply creates multiple domain trees (or the empty one if its
> just one CPU).

Ok.

>
> > $ mount -t cgroup none ./cpuset -o cpuset
> > $ cd cpuset
> > $ cat cpuset.cpus
> > 0-7
> > $ cat cpuset.sched_load_balance
> > 1
> > $ echo 0 > cpuset.sched_load_balance
> > $ ls /sys/kernel/debug/domains/cpu1/
> > domain0 domain1
> >
> > I still get the domains on all CPUs...
>
> (note, that's the cgroup-v1 interface, the cgroup-v2 interface is
> significantly different)
>
> I'd suggest doing: echo 1 > /debug/sched/verbose, if I do the above I
> get:
>
> [1290784.889705] CPU0 attaching NULL sched-domain.
> [1290784.894830] CPU1 attaching NULL sched-domain.

Thanks!

Eventually I uninstalled cgmanager and things seem to work now. I have
no idea why and I'm not sure I'm willing to investigate further :o)

2021-07-19 13:30:19

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC PATCH 6/6] cpuset: Add cpuset.isolation_mask file

On Wed, Jul 14, 2021 at 01:31:57PM -0300, Marcelo Tosatti wrote:
> On Wed, Jul 14, 2021 at 03:54:20PM +0200, Frederic Weisbecker wrote:
> > Add a new cpuset.isolation_mask file in order to be able to modify the
> > housekeeping cpumask for each individual isolation feature on runtime.
> > In the future this will include nohz_full, unbound timers,
> > unbound workqueues, unbound kthreads, managed irqs, etc...
> >
> > Start with supporting domain exclusion and CPUs passed through
> > "isolcpus=".
>
> It is possible to just add return -ENOTSUPPORTED for the features
> whose support is not present?

Maybe, although that looks like a specialized error for corner cases.

> >
> > CHECKME: Should we have individual cpuset.isolation.$feature files for
> > each isolation feature instead of a single mask file?
>
> Yes, guess that is useful, for example due to the -ENOTSUPPORTED
> comment above.
>
>
> Guarantees on updates
> =====================
>
> Perhaps start with a document with:
>
> On return to the write to the cpumask file, what are the guarantees?
>
> For example, for kthread it is that any kernel threads from that point
> on should start with the new mask. Therefore userspace should
> respect the order:
>
> 1) Change kthread mask.
> 2) Move threads.
>

Yep.

> Updates to interface
> ====================
>
> Also, thinking about updates to the interface (which today are one
> cpumask per isolation feature) might be useful. What can happen:
>
> 1) New isolation feature is added, feature name added to the interface.
>
> Userspace must support new filename. If not there, then thats an
> old kernel without support for it.
>
> 2) If an isolation feature is removed, a file will be gone. What should
> be the behaviour there? Remove the file? (userspace should probably
> ignore the failure in that case?) (then features names should not be
> reused, as that can confuse #1 above).

Heh, yeah that's complicated. I guess we should use one flag per file as that
fits well within the current cpuset design. But we must carefully choose the new
files to make sure they have the least chances to be useless in the long term.

> Or maybe have a versioned scheme?

I suspect we should avoid that at all costs :-)

Thanks!

2021-07-19 13:58:37

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] cpuset: Allow to modify isolcpus through cpuset

On Fri, Jul 16, 2021 at 02:02:50PM -0400, Waiman Long wrote:
> On 7/14/21 9:54 AM, Frederic Weisbecker wrote:
> > The fact that "isolcpus=" behaviour can't be modified at runtime is an
> > eternal source of discussion and debate opposing a useful feature against
> > a terrible interface.
> >
> > I've long since tried to figure out a proper way to control this at
> > runtime using cpusets, which isn't easy as a boot time single cpumask
> > is difficult to map to a hierarchy of cpusets that can even overlap.
>
> I have a cpuset patch that allow disabling of load balancing in a cgroup-v2
> setting:
>
> https://lore.kernel.org/lkml/[email protected]/
>
> The idea of cpuset partition is that there will be no overlap of cpus in
> different partitions. So there will be no confusion whether a cpu is
> load-balanced or not.

Oh ok I missed that, time for me to check your patchset.

Thanks!

>
> >
> > The idea here is to map the boot-set isolation behaviour to any cpuset
> > directory whose cpumask is a subset of "isolcpus=". I let you browse
> > for details on the last patch.
> >
> > Note this is still WIP and half-baked, but I figured it's important to
> > validate the interface early.
>
> Using different cpumasks for different isolated properties is the easy part.
> The hard part is to make different subsystems to change their behavior as
> the isolation masks change dynamically at run time. Currently, they check
> the housekeeping cpumask only at boot time or when certain events happen.
>
> Cheers,
> Longman
>
>

2021-07-19 18:30:59

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [RFC PATCH 6/6] cpuset: Add cpuset.isolation_mask file

On Mon, Jul 19, 2021 at 03:26:49PM +0200, Frederic Weisbecker wrote:
> On Wed, Jul 14, 2021 at 01:31:57PM -0300, Marcelo Tosatti wrote:
> > On Wed, Jul 14, 2021 at 03:54:20PM +0200, Frederic Weisbecker wrote:
> > > Add a new cpuset.isolation_mask file in order to be able to modify the
> > > housekeeping cpumask for each individual isolation feature on runtime.
> > > In the future this will include nohz_full, unbound timers,
> > > unbound workqueues, unbound kthreads, managed irqs, etc...
> > >
> > > Start with supporting domain exclusion and CPUs passed through
> > > "isolcpus=".
> >
> > It is possible to just add return -ENOTSUPPORTED for the features
> > whose support is not present?
>
> Maybe, although that looks like a specialized error for corner cases.

Well, are you going to implement runtime enablement for all features,
including nohz_full, in the first patch set?

From my POV returning -ENOTSUPPORTED would allow for a gradual
implementation of the features.

> > > CHECKME: Should we have individual cpuset.isolation.$feature files for
> > > each isolation feature instead of a single mask file?
> >
> > Yes, guess that is useful, for example due to the -ENOTSUPPORTED
> > comment above.
> >
> >
> > Guarantees on updates
> > =====================
> >
> > Perhaps start with a document with:
> >
> > On return to the write to the cpumask file, what are the guarantees?
> >
> > For example, for kthread it is that any kernel threads from that point
> > on should start with the new mask. Therefore userspace should
> > respect the order:
> >
> > 1) Change kthread mask.
> > 2) Move threads.
> >
>
> Yep.
>
> > Updates to interface
> > ====================
> >
> > Also, thinking about updates to the interface (which today are one
> > cpumask per isolation feature) might be useful. What can happen:
> >
> > 1) New isolation feature is added, feature name added to the interface.
> >
> > Userspace must support new filename. If not there, then thats an
> > old kernel without support for it.
> >
> > 2) If an isolation feature is removed, a file will be gone. What should
> > be the behaviour there? Remove the file? (userspace should probably
> > ignore the failure in that case?) (then features names should not be
> > reused, as that can confuse #1 above).
>
> Heh, yeah that's complicated. I guess we should use one flag per file as that
> fits well within the current cpuset design. But we must carefully choose the new
> files to make sure they have the least chances to be useless in the long term.
>
> > Or maybe have a versioned scheme?
>
> I suspect we should avoid that at all costs :-)
>
> Thanks!

Makes sense.