LinuxLists.cc - [PATCH-cgroup 0/2] cgroup/cpuset: Include isolated cpuset CPUs in cpu_is

2023-11-27 04:21:12

Subject: [PATCH-cgroup 0/2] cgroup/cpuset: Include isolated cpuset CPUs in cpu_is_isolated()

This patchset enables the inclusion of CPUs in isolated cpuset partitions
in the cpu_is_isolated() check to reduce interference caused by vmstat
or memcg local stock flushing.

To reduce cpu_is_isolated() call overhead, a seqcount is used to
protect read access of the isolated cpumask without taking any lock. As
a result, the callback_lock is changed to a raw_spinlock_t to make it
work in PREEMPT_RT kernel too.

Waiman Long (2):
cgroup/cpuset: Make callback_lock a raw_spinlock_t
cgroup/cpuset: Include isolated cpuset CPUs in cpu_is_isolated() check

include/linux/cpuset.h | 6 ++
include/linux/sched/isolation.h | 4 +-
kernel/cgroup/cpuset.c | 127 +++++++++++++++++++-------------
3 files changed, 85 insertions(+), 52 deletions(-)

--
2.39.3

2023-11-27 04:21:15

by Waiman Long

[permalink] [raw]

Subject: [PATCH-cgroup 1/2] cgroup/cpuset: Make callback_lock a raw_spinlock_t

All the callback_lock critical sections are pretty small and there
shouldn't be much contention on that lock. Make it a raw_spinlock_t to
avoid additional locking overhead on PREEMPT_RT kernel.

Signed-off-by: Waiman Long <[email protected]>
---
kernel/cgroup/cpuset.c | 102 ++++++++++++++++++++---------------------
1 file changed, 51 insertions(+), 51 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 2a16df86c55c..e34bbb0e2f24 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -445,7 +445,7 @@ void cpuset_unlock(void)
mutex_unlock(&cpuset_mutex);
}

-static DEFINE_SPINLOCK(callback_lock);
+static DEFINE_RAW_SPINLOCK(callback_lock);

static struct workqueue_struct *cpuset_migrate_mm_wq;

@@ -1588,7 +1588,7 @@ static int remote_partition_enable(struct cpuset *cs, int new_prs,
cpumask_subset(top_cpuset.effective_cpus, tmp->new_cpus))
return 0;

- spin_lock_irq(&callback_lock);
+ raw_spin_lock_irq(&callback_lock);
isolcpus_updated = partition_xcpus_add(new_prs, NULL, tmp->new_cpus);
list_add(&cs->remote_sibling, &remote_children);
if (cs->use_parent_ecpus) {
@@ -1597,7 +1597,7 @@ static int remote_partition_enable(struct cpuset *cs, int new_prs,
cs->use_parent_ecpus = false;
parent->child_ecpus_count--;
}
- spin_unlock_irq(&callback_lock);
+ raw_spin_unlock_irq(&callback_lock);
update_unbound_workqueue_cpumask(isolcpus_updated);

/*
@@ -1625,7 +1625,7 @@ static void remote_partition_disable(struct cpuset *cs, struct tmpmasks *tmp)
WARN_ON_ONCE(!is_remote_partition(cs));
WARN_ON_ONCE(!cpumask_subset(tmp->new_cpus, subpartitions_cpus));

- spin_lock_irq(&callback_lock);
+ raw_spin_lock_irq(&callback_lock);
list_del_init(&cs->remote_sibling);
isolcpus_updated = partition_xcpus_del(cs->partition_root_state,
NULL, tmp->new_cpus);
@@ -1633,7 +1633,7 @@ static void remote_partition_disable(struct cpuset *cs, struct tmpmasks *tmp)
if (!cs->prs_err)
cs->prs_err = PERR_INVCPUS;
reset_partition_data(cs);
- spin_unlock_irq(&callback_lock);
+ raw_spin_unlock_irq(&callback_lock);
update_unbound_workqueue_cpumask(isolcpus_updated);

/*
@@ -1680,12 +1680,12 @@ static void remote_cpus_update(struct cpuset *cs, struct cpumask *newmask,
cpumask_subset(top_cpuset.effective_cpus, tmp->addmask)))
goto invalidate;

- spin_lock_irq(&callback_lock);
+ raw_spin_lock_irq(&callback_lock);
if (adding)
isolcpus_updated += partition_xcpus_add(prs, NULL, tmp->addmask);
if (deleting)
isolcpus_updated += partition_xcpus_del(prs, NULL, tmp->delmask);
- spin_unlock_irq(&callback_lock);
+ raw_spin_unlock_irq(&callback_lock);
update_unbound_workqueue_cpumask(isolcpus_updated);

/*
@@ -2034,7 +2034,7 @@ static int update_parent_effective_cpumask(struct cpuset *cs, int cmd,
* Newly added CPUs will be removed from effective_cpus and
* newly deleted ones will be added back to effective_cpus.
*/
- spin_lock_irq(&callback_lock);
+ raw_spin_lock_irq(&callback_lock);
if (old_prs != new_prs) {
cs->partition_root_state = new_prs;
if (new_prs <= 0)
@@ -2055,7 +2055,7 @@ static int update_parent_effective_cpumask(struct cpuset *cs, int cmd,
parent->nr_subparts += subparts_delta;
WARN_ON_ONCE(parent->nr_subparts < 0);
}
- spin_unlock_irq(&callback_lock);
+ raw_spin_unlock_irq(&callback_lock);
update_unbound_workqueue_cpumask(isolcpus_updated);

if ((old_prs != new_prs) && (cmd == partcmd_update))
@@ -2134,11 +2134,11 @@ static void compute_partition_effective_cpumask(struct cpuset *cs,
/*
* Invalidate child partition
*/
- spin_lock_irq(&callback_lock);
+ raw_spin_lock_irq(&callback_lock);
make_partition_invalid(child);
cs->nr_subparts--;
child->nr_subparts = 0;
- spin_unlock_irq(&callback_lock);
+ raw_spin_unlock_irq(&callback_lock);
notify_partition_change(child, old_prs);
continue;
}
@@ -2195,9 +2195,9 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp,
* The case when exclusive_cpus isn't set is handled later.
*/
if (!cpumask_empty(cp->exclusive_cpus) && (cp != cs)) {
- spin_lock_irq(&callback_lock);
+ raw_spin_lock_irq(&callback_lock);
compute_effective_exclusive_cpumask(cp, NULL);
- spin_unlock_irq(&callback_lock);
+ raw_spin_unlock_irq(&callback_lock);
}

old_prs = new_prs = cp->partition_root_state;
@@ -2295,7 +2295,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp,
new_prs = cp->partition_root_state;
}

- spin_lock_irq(&callback_lock);
+ raw_spin_lock_irq(&callback_lock);
cpumask_copy(cp->effective_cpus, tmp->new_cpus);
cp->partition_root_state = new_prs;
/*
@@ -2307,7 +2307,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp,
cp->cpus_allowed, parent->effective_xcpus);
else if (new_prs < 0)
reset_partition_data(cp);
- spin_unlock_irq(&callback_lock);
+ raw_spin_unlock_irq(&callback_lock);

notify_partition_change(cp, old_prs);

@@ -2536,12 +2536,12 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
trialcs->effective_cpus, &tmp);
}

- spin_lock_irq(&callback_lock);
+ raw_spin_lock_irq(&callback_lock);
cpumask_copy(cs->cpus_allowed, trialcs->cpus_allowed);
cpumask_copy(cs->effective_xcpus, trialcs->effective_xcpus);
if ((old_prs > 0) && !is_partition_valid(cs))
reset_partition_data(cs);
- spin_unlock_irq(&callback_lock);
+ raw_spin_unlock_irq(&callback_lock);

/* effective_cpus/effective_xcpus will be updated here */
update_cpumasks_hier(cs, &tmp, hier_flags);
@@ -2636,12 +2636,12 @@ static int update_exclusive_cpumask(struct cpuset *cs, struct cpuset *trialcs,
remote_partition_check(cs, trialcs->effective_xcpus,
trialcs->effective_cpus, &tmp);
}
- spin_lock_irq(&callback_lock);
+ raw_spin_lock_irq(&callback_lock);
cpumask_copy(cs->exclusive_cpus, trialcs->exclusive_cpus);
cpumask_copy(cs->effective_xcpus, trialcs->effective_xcpus);
if ((old_prs > 0) && !is_partition_valid(cs))
reset_partition_data(cs);
- spin_unlock_irq(&callback_lock);
+ raw_spin_unlock_irq(&callback_lock);

/*
* Call update_cpumasks_hier() to update effective_cpus/effective_xcpus
@@ -2841,9 +2841,9 @@ static void update_nodemasks_hier(struct cpuset *cs, nodemask_t *new_mems)
continue;
rcu_read_unlock();

- spin_lock_irq(&callback_lock);
+ raw_spin_lock_irq(&callback_lock);
cp->effective_mems = *new_mems;
- spin_unlock_irq(&callback_lock);
+ raw_spin_unlock_irq(&callback_lock);

WARN_ON(!is_in_v2_mode() &&
!nodes_equal(cp->mems_allowed, cp->effective_mems));
@@ -2913,9 +2913,9 @@ static int update_nodemask(struct cpuset *cs, struct cpuset *trialcs,

check_insane_mems_config(&trialcs->mems_allowed);

- spin_lock_irq(&callback_lock);
+ raw_spin_lock_irq(&callback_lock);
cs->mems_allowed = trialcs->mems_allowed;
- spin_unlock_irq(&callback_lock);
+ raw_spin_unlock_irq(&callback_lock);

/* use trialcs->mems_allowed as a temp variable */
update_nodemasks_hier(cs, &trialcs->mems_allowed);
@@ -3006,9 +3006,9 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
spread_flag_changed = ((is_spread_slab(cs) != is_spread_slab(trialcs))
|| (is_spread_page(cs) != is_spread_page(trialcs)));

- spin_lock_irq(&callback_lock);
+ raw_spin_lock_irq(&callback_lock);
cs->flags = trialcs->flags;
- spin_unlock_irq(&callback_lock);
+ raw_spin_unlock_irq(&callback_lock);

if (!cpumask_empty(trialcs->cpus_allowed) && balance_flag_changed)
rebuild_sched_domains_locked();
@@ -3052,10 +3052,10 @@ static int update_prstate(struct cpuset *cs, int new_prs)
* later if partition becomes invalid.
*/
if ((new_prs > 0) && cpumask_empty(cs->exclusive_cpus)) {
- spin_lock_irq(&callback_lock);
+ raw_spin_lock_irq(&callback_lock);
cpumask_and(cs->effective_xcpus,
cs->cpus_allowed, parent->effective_xcpus);
- spin_unlock_irq(&callback_lock);
+ raw_spin_unlock_irq(&callback_lock);
}

err = update_partition_exclusive(cs, new_prs);
@@ -3112,14 +3112,14 @@ static int update_prstate(struct cpuset *cs, int new_prs)
update_partition_exclusive(cs, new_prs);
}

- spin_lock_irq(&callback_lock);
+ raw_spin_lock_irq(&callback_lock);
cs->partition_root_state = new_prs;
WRITE_ONCE(cs->prs_err, err);
if (!is_partition_valid(cs))
reset_partition_data(cs);
else if (new_xcpus_state)
partition_xcpus_newstate(old_prs, new_prs, cs->effective_xcpus);
- spin_unlock_irq(&callback_lock);
+ raw_spin_unlock_irq(&callback_lock);
update_unbound_workqueue_cpumask(new_xcpus_state);

/* Force update if switching back to member */
@@ -3650,7 +3650,7 @@ static int cpuset_common_seq_show(struct seq_file *sf, void *v)
cpuset_filetype_t type = seq_cft(sf)->private;
int ret = 0;

- spin_lock_irq(&callback_lock);
+ raw_spin_lock_irq(&callback_lock);

switch (type) {
case FILE_CPULIST:
@@ -3681,7 +3681,7 @@ static int cpuset_common_seq_show(struct seq_file *sf, void *v)
ret = -EINVAL;
}

- spin_unlock_irq(&callback_lock);
+ raw_spin_unlock_irq(&callback_lock);
return ret;
}

@@ -4042,7 +4042,7 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)

cpuset_inc();

- spin_lock_irq(&callback_lock);
+ raw_spin_lock_irq(&callback_lock);
if (is_in_v2_mode()) {
cpumask_copy(cs->effective_cpus, parent->effective_cpus);
cs->effective_mems = parent->effective_mems;
@@ -4062,7 +4062,7 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
!is_sched_load_balance(parent))
clear_bit(CS_SCHED_LOAD_BALANCE, &cs->flags);

- spin_unlock_irq(&callback_lock);
+ raw_spin_unlock_irq(&callback_lock);

if (!test_bit(CGRP_CPUSET_CLONE_CHILDREN, &css->cgroup->flags))
goto out_unlock;
@@ -4089,12 +4089,12 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
}
rcu_read_unlock();

- spin_lock_irq(&callback_lock);
+ raw_spin_lock_irq(&callback_lock);
cs->mems_allowed = parent->mems_allowed;
cs->effective_mems = parent->mems_allowed;
cpumask_copy(cs->cpus_allowed, parent->cpus_allowed);
cpumask_copy(cs->effective_cpus, parent->cpus_allowed);
- spin_unlock_irq(&callback_lock);
+ raw_spin_unlock_irq(&callback_lock);
out_unlock:
mutex_unlock(&cpuset_mutex);
cpus_read_unlock();
@@ -4150,7 +4150,7 @@ static void cpuset_css_free(struct cgroup_subsys_state *css)
static void cpuset_bind(struct cgroup_subsys_state *root_css)
{
mutex_lock(&cpuset_mutex);
- spin_lock_irq(&callback_lock);
+ raw_spin_lock_irq(&callback_lock);

if (is_in_v2_mode()) {
cpumask_copy(top_cpuset.cpus_allowed, cpu_possible_mask);
@@ -4162,7 +4162,7 @@ static void cpuset_bind(struct cgroup_subsys_state *root_css)
top_cpuset.mems_allowed = top_cpuset.effective_mems;
}

- spin_unlock_irq(&callback_lock);
+ raw_spin_unlock_irq(&callback_lock);
mutex_unlock(&cpuset_mutex);
}

@@ -4349,12 +4349,12 @@ hotplug_update_tasks_legacy(struct cpuset *cs,
{
bool is_empty;

- spin_lock_irq(&callback_lock);
+ raw_spin_lock_irq(&callback_lock);
cpumask_copy(cs->cpus_allowed, new_cpus);
cpumask_copy(cs->effective_cpus, new_cpus);
cs->mems_allowed = *new_mems;
cs->effective_mems = *new_mems;
- spin_unlock_irq(&callback_lock);
+ raw_spin_unlock_irq(&callback_lock);

/*
* Don't call update_tasks_cpumask() if the cpuset becomes empty,
@@ -4391,10 +4391,10 @@ hotplug_update_tasks(struct cpuset *cs,
if (nodes_empty(*new_mems))
*new_mems = parent_cs(cs)->effective_mems;

- spin_lock_irq(&callback_lock);
+ raw_spin_lock_irq(&callback_lock);
cpumask_copy(cs->effective_cpus, new_cpus);
cs->effective_mems = *new_mems;
- spin_unlock_irq(&callback_lock);
+ raw_spin_unlock_irq(&callback_lock);

if (cpus_updated)
update_tasks_cpumask(cs, new_cpus);
@@ -4597,7 +4597,7 @@ static void cpuset_hotplug_workfn(struct work_struct *work)

/* For v1, synchronize cpus_allowed to cpu_active_mask */
if (cpus_updated) {
- spin_lock_irq(&callback_lock);
+ raw_spin_lock_irq(&callback_lock);
if (!on_dfl)
cpumask_copy(top_cpuset.cpus_allowed, &new_cpus);
/*
@@ -4616,17 +4616,17 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
}
}
cpumask_copy(top_cpuset.effective_cpus, &new_cpus);
- spin_unlock_irq(&callback_lock);
+ raw_spin_unlock_irq(&callback_lock);
/* we don't mess with cpumasks of tasks in top_cpuset */
}

/* synchronize mems_allowed to N_MEMORY */
if (mems_updated) {
- spin_lock_irq(&callback_lock);
+ raw_spin_lock_irq(&callback_lock);
if (!on_dfl)
top_cpuset.mems_allowed = new_mems;
top_cpuset.effective_mems = new_mems;
- spin_unlock_irq(&callback_lock);
+ raw_spin_unlock_irq(&callback_lock);
update_tasks_nodemask(&top_cpuset);
}

@@ -4726,7 +4726,7 @@ void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask)
unsigned long flags;
struct cpuset *cs;

- spin_lock_irqsave(&callback_lock, flags);
+ raw_spin_lock_irqsave(&callback_lock, flags);
rcu_read_lock();

cs = task_cs(tsk);
@@ -4750,7 +4750,7 @@ void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask)
}

rcu_read_unlock();
- spin_unlock_irqrestore(&callback_lock, flags);
+ raw_spin_unlock_irqrestore(&callback_lock, flags);
}

/**
@@ -4821,11 +4821,11 @@ nodemask_t cpuset_mems_allowed(struct task_struct *tsk)
nodemask_t mask;
unsigned long flags;

- spin_lock_irqsave(&callback_lock, flags);
+ raw_spin_lock_irqsave(&callback_lock, flags);
rcu_read_lock();
guarantee_online_mems(task_cs(tsk), &mask);
rcu_read_unlock();
- spin_unlock_irqrestore(&callback_lock, flags);
+ raw_spin_unlock_irqrestore(&callback_lock, flags);

return mask;
}
@@ -4917,14 +4917,14 @@ bool cpuset_node_allowed(int node, gfp_t gfp_mask)
return true;

/* Not hardwall and node outside mems_allowed: scan up cpusets */
- spin_lock_irqsave(&callback_lock, flags);
+ raw_spin_lock_irqsave(&callback_lock, flags);

rcu_read_lock();
cs = nearest_hardwall_ancestor(task_cs(current));
allowed = node_isset(node, cs->mems_allowed);
rcu_read_unlock();

- spin_unlock_irqrestore(&callback_lock, flags);
+ raw_spin_unlock_irqrestore(&callback_lock, flags);
return allowed;
}

--
2.39.3

2023-11-27 04:21:20

by Waiman Long

[permalink] [raw]

Subject: [PATCH-cgroup 2/2] cgroup/cpuset: Include isolated cpuset CPUs in cpu_is_isolated() check

Currently, the cpu_is_isolated() function checks only the statically
isolated CPUs specified via the "isolcpus" and "nohz_full" kernel
command line options. This function is used by vmstat and memcg to
reduce interference with isolated CPUs by not doing stat flushing
or scheduling works on those CPUs.

Workloads running on isolated CPUs within isolated cpuset
partitions should receive the same treatment to reduce unnecessary
interference. This patch introduces a new cpuset_cpu_is_isolated()
function to be called by cpu_is_isolated() so that the set of dynamically
created cpuset isolated CPUs will be included in the check.

To minimize overhead of calling cpuset_cpu_is_isolated(), a seqcount
is used to protect read access of the isolated cpumask without taking
the cpuset_mutex or callback_lock.

Signed-off-by: Waiman Long <[email protected]>
---
include/linux/cpuset.h | 6 ++++++
include/linux/sched/isolation.h | 4 +++-
kernel/cgroup/cpuset.c | 25 +++++++++++++++++++++++++
3 files changed, 34 insertions(+), 1 deletion(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index d629094fac6e..875d12598bd2 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -77,6 +77,7 @@ extern void cpuset_lock(void);
extern void cpuset_unlock(void);
extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
extern bool cpuset_cpus_allowed_fallback(struct task_struct *p);
+extern bool cpuset_cpu_is_isolated(int cpu);
extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
#define cpuset_current_mems_allowed (current->mems_allowed)
void cpuset_init_current_mems_allowed(void);
@@ -207,6 +208,11 @@ static inline bool cpuset_cpus_allowed_fallback(struct task_struct *p)
return false;
}

+static inline bool cpuset_cpu_is_isolated(int cpu)
+{
+ return false;
+}
+
static inline nodemask_t cpuset_mems_allowed(struct task_struct *p)
{
return node_possible_map;
diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
index fe1a46f30d24..2b461129d1fa 100644
--- a/include/linux/sched/isolation.h
+++ b/include/linux/sched/isolation.h
@@ -2,6 +2,7 @@
#define _LINUX_SCHED_ISOLATION_H

#include <linux/cpumask.h>
+#include <linux/cpuset.h>
#include <linux/init.h>
#include <linux/tick.h>

@@ -67,7 +68,8 @@ static inline bool housekeeping_cpu(int cpu, enum hk_type type)
static inline bool cpu_is_isolated(int cpu)
{
return !housekeeping_test_cpu(cpu, HK_TYPE_DOMAIN) ||
- !housekeeping_test_cpu(cpu, HK_TYPE_TICK);
+ !housekeeping_test_cpu(cpu, HK_TYPE_TICK) ||
+ cpuset_cpu_is_isolated(cpu);
}

#endif /* _LINUX_SCHED_ISOLATION_H */
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index e34bbb0e2f24..4adb6d2209ca 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -208,8 +208,13 @@ static cpumask_var_t subpartitions_cpus;

/*
* Exclusive CPUs in isolated partitions
+ *
+ * The isolcpus_seq is used to protect read access to isolated_cpus without
+ * taking callback_lock or cpuset_mutex while write access requires taking
+ * both cpuset_mutex and callback_lock.
*/
static cpumask_var_t isolated_cpus;
+static seqcount_t isolcpus_seq = SEQCNT_ZERO(isolcpus_seq);

/* List of remote partition root children */
static struct list_head remote_children;
@@ -1435,10 +1440,12 @@ static void reset_partition_data(struct cpuset *cs)
static void partition_xcpus_newstate(int old_prs, int new_prs, struct cpumask *xcpus)
{
WARN_ON_ONCE(old_prs == new_prs);
+ write_seqcount_begin(&isolcpus_seq);
if (new_prs == PRS_ISOLATED)
cpumask_or(isolated_cpus, isolated_cpus, xcpus);
else
cpumask_andnot(isolated_cpus, isolated_cpus, xcpus);
+ write_seqcount_end(&isolcpus_seq);
}

/*
@@ -1518,6 +1525,24 @@ static void update_unbound_workqueue_cpumask(bool isolcpus_updated)
WARN_ON_ONCE(ret < 0);
}

+/**
+ * cpuset_cpu_is_isolated - Check if the given CPU is isolated
+ * @cpu: the CPU number to be checked
+ * Return: true if CPU is used in an isolated partition, false otherwise
+ */
+bool cpuset_cpu_is_isolated(int cpu)
+{
+ unsigned int seq;
+ bool ret;
+
+ do {
+ seq = read_seqcount_begin(&isolcpus_seq);
+ ret = cpumask_test_cpu(cpu, isolated_cpus);
+ } while (read_seqcount_retry(&isolcpus_seq, seq));
+ return ret;
+}
+EXPORT_SYMBOL_GPL(cpuset_cpu_is_isolated);
+
/*
* compute_effective_exclusive_cpumask - compute effective exclusive CPUs
* @cs: cpuset
--
2.39.3

2023-11-28 16:56:26

by Tejun Heo

[permalink] [raw]

Subject: Re: [PATCH-cgroup 2/2] cgroup/cpuset: Include isolated cpuset CPUs in cpu_is_isolated() check

Hello,

On Sun, Nov 26, 2023 at 11:19:56PM -0500, Waiman Long wrote:
> +bool cpuset_cpu_is_isolated(int cpu)
> +{
> + unsigned int seq;
> + bool ret;
> +
> + do {
> + seq = read_seqcount_begin(&isolcpus_seq);
> + ret = cpumask_test_cpu(cpu, isolated_cpus);
> + } while (read_seqcount_retry(&isolcpus_seq, seq));
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(cpuset_cpu_is_isolated);

We're testing a bit in a bitmask. I don't think we need to worry about value
integrity from RMW updates being broken up. ie. We can just test the bit
without seqlock and drop the first patch?

Thanks.

--
tejun

2023-11-28 18:33:28

by Waiman Long

[permalink] [raw]

Subject: Re: [PATCH-cgroup 2/2] cgroup/cpuset: Include isolated cpuset CPUs in cpu_is_isolated() check

On 11/28/23 11:56, Tejun Heo wrote:
> Hello,
>
> On Sun, Nov 26, 2023 at 11:19:56PM -0500, Waiman Long wrote:
>> +bool cpuset_cpu_is_isolated(int cpu)
>> +{
>> + unsigned int seq;
>> + bool ret;
>> +
>> + do {
>> + seq = read_seqcount_begin(&isolcpus_seq);
>> + ret = cpumask_test_cpu(cpu, isolated_cpus);
>> + } while (read_seqcount_retry(&isolcpus_seq, seq));
>> + return ret;
>> +}
>> +EXPORT_SYMBOL_GPL(cpuset_cpu_is_isolated);
> We're testing a bit in a bitmask. I don't think we need to worry about value
> integrity from RMW updates being broken up. ie. We can just test the bit
> without seqlock and drop the first patch?

My concern is that if we have an isolated partition with a set of
isolated CPUs (say 2-4), I don't want any addition, deletion of changes
made to another isolated partition affects the test of the pre-existing
one. Testing result of the partition being change is fair game.

Depending on how the cpumask operators are implemented, we may not have
a guarantee that testing CPU 2, for instance, will always return true.
That is why I am adding some synchronization primitive to prevent
racing. My original plan was to take the callback_lock. However, that
can be somewhat costly if this API is used rather frequently, especially
on systems with large # of CPUs. So I change it to use seqcount for read
protection which has a much lower cost.

Regarding patch 1 on converting callback_lock to raw_spinlock_t, I can
drop it if you have concern about that change. I just need to surround
the write_seqcount_begin()/write_seqcount_end() calls with
preempt_disabled()/preempt_enabled().

Cheers,
Longman

2023-11-28 22:12:52

by Tejun Heo

[permalink] [raw]

Subject: Re: [PATCH-cgroup 2/2] cgroup/cpuset: Include isolated cpuset CPUs in cpu_is_isolated() check

Hello,

On Tue, Nov 28, 2023 at 01:32:53PM -0500, Waiman Long wrote:
> On 11/28/23 11:56, Tejun Heo wrote:
> > Hello,
> >
> > On Sun, Nov 26, 2023 at 11:19:56PM -0500, Waiman Long wrote:
> > > +bool cpuset_cpu_is_isolated(int cpu)
> > > +{
> > > + unsigned int seq;
> > > + bool ret;
> > > +
> > > + do {
> > > + seq = read_seqcount_begin(&isolcpus_seq);
> > > + ret = cpumask_test_cpu(cpu, isolated_cpus);
> > > + } while (read_seqcount_retry(&isolcpus_seq, seq));
> > > + return ret;
> > > +}
> > > +EXPORT_SYMBOL_GPL(cpuset_cpu_is_isolated);
> > We're testing a bit in a bitmask. I don't think we need to worry about value
> > integrity from RMW updates being broken up. ie. We can just test the bit
> > without seqlock and drop the first patch?
>
> My concern is that if we have an isolated partition with a set of isolated
> CPUs (say 2-4), I don't want any addition, deletion of changes made to
> another isolated partition affects the test of the pre-existing one. Testing
> result of the partition being change is fair game.
>
> Depending on how the cpumask operators are implemented, we may not have a
> guarantee that testing CPU 2, for instance, will always return true. That is

Can you please elaborate this part a bit? I'm having a difficult time
imagining the sequence of operations where this would matter but that could
easily be me not being familiar with the details.

Thanks.

--
tejun

2023-11-29 16:01:26

by Waiman Long

[permalink] [raw]

Subject: Re: [PATCH-cgroup 2/2] cgroup/cpuset: Include isolated cpuset CPUs in cpu_is_isolated() check

On 11/28/23 17:12, Tejun Heo wrote:
> Hello,
>
> On Tue, Nov 28, 2023 at 01:32:53PM -0500, Waiman Long wrote:
>> On 11/28/23 11:56, Tejun Heo wrote:
>>> Hello,
>>>
>>> On Sun, Nov 26, 2023 at 11:19:56PM -0500, Waiman Long wrote:
>>>> +bool cpuset_cpu_is_isolated(int cpu)
>>>> +{
>>>> + unsigned int seq;
>>>> + bool ret;
>>>> +
>>>> + do {
>>>> + seq = read_seqcount_begin(&isolcpus_seq);
>>>> + ret = cpumask_test_cpu(cpu, isolated_cpus);
>>>> + } while (read_seqcount_retry(&isolcpus_seq, seq));
>>>> + return ret;
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(cpuset_cpu_is_isolated);
>>> We're testing a bit in a bitmask. I don't think we need to worry about value
>>> integrity from RMW updates being broken up. ie. We can just test the bit
>>> without seqlock and drop the first patch?
>> My concern is that if we have an isolated partition with a set of isolated
>> CPUs (say 2-4), I don't want any addition, deletion of changes made to
>> another isolated partition affects the test of the pre-existing one. Testing
>> result of the partition being change is fair game.
>>
>> Depending on how the cpumask operators are implemented, we may not have a
>> guarantee that testing CPU 2, for instance, will always return true. That is
> Can you please elaborate this part a bit? I'm having a difficult time
> imagining the sequence of operations where this would matter but that could
> easily be me not being familiar with the details.

I may be a bit paranoid about incorrect result due to racing as I had
been burned before. Just testing a bit in the bitmask may probably be
OK. I don't think it will be a problem for x86, but I am less certain
about other more exotic architectures like arm64 or PPC which I am less
familiar about. I add a seqcount for synchronization just for the peace
of mind. I can take the seqcount out if you don't it is necessary.

I have also been thinking about an alternative helper that returns the
whole isolated cpumask since in both cases where cpu_is_isolated() is
used, we will have to iterate all the possible CPUs anyway, it will be
more efficient to have the whole cpumask available. In that case, we may
want to have a seqcount to avoid returning an intermediate result.
Anyway, this is just a thought for now, I am not planning to do that at
the moment.

Cheers,
Longman

2023-12-01 17:07:22

by Tejun Heo

[permalink] [raw]

Subject: Re: [PATCH-cgroup 2/2] cgroup/cpuset: Include isolated cpuset CPUs in cpu_is_isolated() check

Hello,

On Wed, Nov 29, 2023 at 11:01:04AM -0500, Waiman Long wrote:
...
> > > Depending on how the cpumask operators are implemented, we may not have a
> > > guarantee that testing CPU 2, for instance, will always return true. That is
> > Can you please elaborate this part a bit? I'm having a difficult time
> > imagining the sequence of operations where this would matter but that could
> > easily be me not being familiar with the details.
>
> I may be a bit paranoid about incorrect result due to racing as I had been
> burned before. Just testing a bit in the bitmask may probably be OK. I don't

Setting and clearing a bit is as atomic as it gets, right?

> think it will be a problem for x86, but I am less certain about other more
> exotic architectures like arm64 or PPC which I am less familiar about. I add
> a seqcount for synchronization just for the peace of mind. I can take the
> seqcount out if you don't it is necessary.

I just can't think of a case where this would be broken. The data being read
and written is atomic. There's no way to break a bit operation into multiple
pieces. It is possible to write a really bone-headed bitmask operations
(like, if you shift the bits into place or sth) to make the bits go through
unintended changes but that'd just be a flat-out broken implementation. Even
for a bitmask where write accesses are synchronized through a spinlock, we
should still be able to use test_bit() without holding the lock. This seems
like a pretty basic assumption.

Adding unnecessary synchronization confuses the readers. If we don't need
it, we shouldn't have it.

Thanks.

--
tejun

2023-12-05 22:16:51

by Waiman Long

[permalink] [raw]

Subject: Re: [PATCH-cgroup 2/2] cgroup/cpuset: Include isolated cpuset CPUs in cpu_is_isolated() check

On 12/1/23 12:06, Tejun Heo wrote:
> Hello,
>
> On Wed, Nov 29, 2023 at 11:01:04AM -0500, Waiman Long wrote:
> ...
>>>> Depending on how the cpumask operators are implemented, we may not have a
>>>> guarantee that testing CPU 2, for instance, will always return true. That is
>>> Can you please elaborate this part a bit? I'm having a difficult time
>>> imagining the sequence of operations where this would matter but that could
>>> easily be me not being familiar with the details.
>> I may be a bit paranoid about incorrect result due to racing as I had been
>> burned before. Just testing a bit in the bitmask may probably be OK. I don't
> Setting and clearing a bit is as atomic as it gets, right?
Yes, I think so.
>
>> think it will be a problem for x86, but I am less certain about other more
>> exotic architectures like arm64 or PPC which I am less familiar about. I add
>> a seqcount for synchronization just for the peace of mind. I can take the
>> seqcount out if you don't it is necessary.
> I just can't think of a case where this would be broken. The data being read
> and written is atomic. There's no way to break a bit operation into multiple
> pieces. It is possible to write a really bone-headed bitmask operations
> (like, if you shift the bits into place or sth) to make the bits go through
> unintended changes but that'd just be a flat-out broken implementation. Even
> for a bitmask where write accesses are synchronized through a spinlock, we
> should still be able to use test_bit() without holding the lock. This seems
> like a pretty basic assumption.
>
> Adding unnecessary synchronization confuses the readers. If we don't need
> it, we shouldn't have it.

OK, I will send a simplified v2 patch.

Cheers,
Longman