__mm_cid_get() uses a __mm_cid_try_get() helper to atomically acquire a
bit in mm cid mask. Now that we have atomic find_and_set_bit(), we can
easily extend it to cpumasks and use in the scheduler code.
__mm_cid_try_get() has an infinite loop, which may delay forward
progress of __mm_cid_get() when the mask is dense. The
cpumask_find_and_set() doesn't poll the mask infinitely, and returns as
soon as nothing has found after the first iteration, allowing to acquire
the lock, and set use_cid_lock faster, if needed.
cpumask_find_and_set() considers cid mask as a volatile region of memory,
as it actually is in this case. So, if it's changed while search is in
progress, KCSAN wouldn't fire warning on it.
Signed-off-by: Yury Norov <[email protected]>
---
include/linux/cpumask.h | 12 ++++++++++
kernel/sched/sched.h | 52 ++++++++++++-----------------------------
2 files changed, 27 insertions(+), 37 deletions(-)
diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index cfb545841a2c..c2acced8be4e 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -271,6 +271,18 @@ unsigned int cpumask_next_and(int n, const struct cpumask *src1p,
small_cpumask_bits, n + 1);
}
+/**
+ * cpumask_find_and_set - find the first unset cpu in a cpumask and
+ * set it atomically
+ * @srcp: the cpumask pointer
+ *
+ * Return: >= nr_cpu_ids if nothing is found.
+ */
+static inline unsigned int cpumask_find_and_set(volatile struct cpumask *srcp)
+{
+ return find_and_set_bit(cpumask_bits(srcp), small_cpumask_bits);
+}
+
/**
* for_each_cpu - iterate over every cpu in a mask
* @cpu: the (optionally unsigned) integer iterator
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 2e5a95486a42..b2f095a9fc40 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3345,28 +3345,6 @@ static inline void mm_cid_put(struct mm_struct *mm)
__mm_cid_put(mm, mm_cid_clear_lazy_put(cid));
}
-static inline int __mm_cid_try_get(struct mm_struct *mm)
-{
- struct cpumask *cpumask;
- int cid;
-
- cpumask = mm_cidmask(mm);
- /*
- * Retry finding first zero bit if the mask is temporarily
- * filled. This only happens during concurrent remote-clear
- * which owns a cid without holding a rq lock.
- */
- for (;;) {
- cid = cpumask_first_zero(cpumask);
- if (cid < nr_cpu_ids)
- break;
- cpu_relax();
- }
- if (cpumask_test_and_set_cpu(cid, cpumask))
- return -1;
- return cid;
-}
-
/*
* Save a snapshot of the current runqueue time of this cpu
* with the per-cpu cid value, allowing to estimate how recently it was used.
@@ -3381,25 +3359,25 @@ static inline void mm_cid_snapshot_time(struct rq *rq, struct mm_struct *mm)
static inline int __mm_cid_get(struct rq *rq, struct mm_struct *mm)
{
+ struct cpumask *cpumask = mm_cidmask(mm);
int cid;
- /*
- * All allocations (even those using the cid_lock) are lock-free. If
- * use_cid_lock is set, hold the cid_lock to perform cid allocation to
- * guarantee forward progress.
- */
+ /* All allocations (even those using the cid_lock) are lock-free. */
if (!READ_ONCE(use_cid_lock)) {
- cid = __mm_cid_try_get(mm);
- if (cid >= 0)
+ cid = cpumask_find_and_set(cpumask);
+ if (cid < nr_cpu_ids)
goto end;
- raw_spin_lock(&cid_lock);
- } else {
- raw_spin_lock(&cid_lock);
- cid = __mm_cid_try_get(mm);
- if (cid >= 0)
- goto unlock;
}
+ /*
+ * If use_cid_lock is set, hold the cid_lock to perform cid
+ * allocation to guarantee forward progress.
+ */
+ raw_spin_lock(&cid_lock);
+ cid = cpumask_find_and_set(cpumask);
+ if (cid < nr_cpu_ids)
+ goto unlock;
+
/*
* cid concurrently allocated. Retry while forcing following
* allocations to use the cid_lock to ensure forward progress.
@@ -3415,9 +3393,9 @@ static inline int __mm_cid_get(struct rq *rq, struct mm_struct *mm)
* all newcoming allocations observe the use_cid_lock flag set.
*/
do {
- cid = __mm_cid_try_get(mm);
+ cid = cpumask_find_and_set(cpumask);
cpu_relax();
- } while (cid < 0);
+ } while (cid >= nr_cpu_ids);
/*
* Allocate before clearing use_cid_lock. Only care about
* program order because this is for forward progress.
--
2.39.2
On Sat, Nov 18, 2023 at 07:50:35AM -0800, Yury Norov wrote:
> __mm_cid_get() uses a __mm_cid_try_get() helper to atomically acquire a
> bit in mm cid mask. Now that we have atomic find_and_set_bit(), we can
> easily extend it to cpumasks and use in the scheduler code.
>
> __mm_cid_try_get() has an infinite loop, which may delay forward
> progress of __mm_cid_get() when the mask is dense. The
> cpumask_find_and_set() doesn't poll the mask infinitely, and returns as
> soon as nothing has found after the first iteration, allowing to acquire
> the lock, and set use_cid_lock faster, if needed.
Methieu, I forgot again, but the comment delete seems to suggest you did
this on purpose...
> cpumask_find_and_set() considers cid mask as a volatile region of memory,
> as it actually is in this case. So, if it's changed while search is in
> progress, KCSAN wouldn't fire warning on it.
>
> Signed-off-by: Yury Norov <[email protected]>
> ---
> include/linux/cpumask.h | 12 ++++++++++
> kernel/sched/sched.h | 52 ++++++++++++-----------------------------
> 2 files changed, 27 insertions(+), 37 deletions(-)
>
> diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
> index cfb545841a2c..c2acced8be4e 100644
> --- a/include/linux/cpumask.h
> +++ b/include/linux/cpumask.h
> @@ -271,6 +271,18 @@ unsigned int cpumask_next_and(int n, const struct cpumask *src1p,
> small_cpumask_bits, n + 1);
> }
>
> +/**
> + * cpumask_find_and_set - find the first unset cpu in a cpumask and
> + * set it atomically
> + * @srcp: the cpumask pointer
> + *
> + * Return: >= nr_cpu_ids if nothing is found.
> + */
> +static inline unsigned int cpumask_find_and_set(volatile struct cpumask *srcp)
> +{
> + return find_and_set_bit(cpumask_bits(srcp), small_cpumask_bits);
> +}
> +
> /**
> * for_each_cpu - iterate over every cpu in a mask
> * @cpu: the (optionally unsigned) integer iterator
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 2e5a95486a42..b2f095a9fc40 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -3345,28 +3345,6 @@ static inline void mm_cid_put(struct mm_struct *mm)
> __mm_cid_put(mm, mm_cid_clear_lazy_put(cid));
> }
>
> -static inline int __mm_cid_try_get(struct mm_struct *mm)
> -{
> - struct cpumask *cpumask;
> - int cid;
> -
> - cpumask = mm_cidmask(mm);
> - /*
> - * Retry finding first zero bit if the mask is temporarily
> - * filled. This only happens during concurrent remote-clear
> - * which owns a cid without holding a rq lock.
> - */
> - for (;;) {
> - cid = cpumask_first_zero(cpumask);
> - if (cid < nr_cpu_ids)
> - break;
> - cpu_relax();
> - }
> - if (cpumask_test_and_set_cpu(cid, cpumask))
> - return -1;
> - return cid;
> -}
> -
> /*
> * Save a snapshot of the current runqueue time of this cpu
> * with the per-cpu cid value, allowing to estimate how recently it was used.
> @@ -3381,25 +3359,25 @@ static inline void mm_cid_snapshot_time(struct rq *rq, struct mm_struct *mm)
>
> static inline int __mm_cid_get(struct rq *rq, struct mm_struct *mm)
> {
> + struct cpumask *cpumask = mm_cidmask(mm);
> int cid;
>
> - /*
> - * All allocations (even those using the cid_lock) are lock-free. If
> - * use_cid_lock is set, hold the cid_lock to perform cid allocation to
> - * guarantee forward progress.
> - */
> + /* All allocations (even those using the cid_lock) are lock-free. */
> if (!READ_ONCE(use_cid_lock)) {
> - cid = __mm_cid_try_get(mm);
> - if (cid >= 0)
> + cid = cpumask_find_and_set(cpumask);
> + if (cid < nr_cpu_ids)
> goto end;
> - raw_spin_lock(&cid_lock);
> - } else {
> - raw_spin_lock(&cid_lock);
> - cid = __mm_cid_try_get(mm);
> - if (cid >= 0)
> - goto unlock;
> }
>
> + /*
> + * If use_cid_lock is set, hold the cid_lock to perform cid
> + * allocation to guarantee forward progress.
> + */
> + raw_spin_lock(&cid_lock);
> + cid = cpumask_find_and_set(cpumask);
> + if (cid < nr_cpu_ids)
> + goto unlock;
> +
> /*
> * cid concurrently allocated. Retry while forcing following
> * allocations to use the cid_lock to ensure forward progress.
> @@ -3415,9 +3393,9 @@ static inline int __mm_cid_get(struct rq *rq, struct mm_struct *mm)
> * all newcoming allocations observe the use_cid_lock flag set.
> */
> do {
> - cid = __mm_cid_try_get(mm);
> + cid = cpumask_find_and_set(cpumask);
> cpu_relax();
> - } while (cid < 0);
> + } while (cid >= nr_cpu_ids);
> /*
> * Allocate before clearing use_cid_lock. Only care about
> * program order because this is for forward progress.
> --
> 2.39.2
>
On 2023-11-20 06:31, Peter Zijlstra wrote:
> On Sat, Nov 18, 2023 at 07:50:35AM -0800, Yury Norov wrote:
>> __mm_cid_get() uses a __mm_cid_try_get() helper to atomically acquire a
>> bit in mm cid mask. Now that we have atomic find_and_set_bit(), we can
>> easily extend it to cpumasks and use in the scheduler code.
>>
>> __mm_cid_try_get() has an infinite loop, which may delay forward
>> progress of __mm_cid_get() when the mask is dense. The
>> cpumask_find_and_set() doesn't poll the mask infinitely, and returns as
>> soon as nothing has found after the first iteration, allowing to acquire
>> the lock, and set use_cid_lock faster, if needed.
>
> Methieu, I forgot again, but the comment delete seems to suggest you did
> this on purpose...
See comments below.
>
>> cpumask_find_and_set() considers cid mask as a volatile region of memory,
>> as it actually is in this case. So, if it's changed while search is in
>> progress, KCSAN wouldn't fire warning on it.
>>
>> Signed-off-by: Yury Norov <[email protected]>
>> ---
>> include/linux/cpumask.h | 12 ++++++++++
>> kernel/sched/sched.h | 52 ++++++++++++-----------------------------
>> 2 files changed, 27 insertions(+), 37 deletions(-)
>>
>> diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
>> index cfb545841a2c..c2acced8be4e 100644
>> --- a/include/linux/cpumask.h
>> +++ b/include/linux/cpumask.h
>> @@ -271,6 +271,18 @@ unsigned int cpumask_next_and(int n, const struct cpumask *src1p,
>> small_cpumask_bits, n + 1);
>> }
>>
>> +/**
>> + * cpumask_find_and_set - find the first unset cpu in a cpumask and
>> + * set it atomically
>> + * @srcp: the cpumask pointer
>> + *
>> + * Return: >= nr_cpu_ids if nothing is found.
>> + */
>> +static inline unsigned int cpumask_find_and_set(volatile struct cpumask *srcp)
>> +{
>> + return find_and_set_bit(cpumask_bits(srcp), small_cpumask_bits);
>> +}
>> +
>> /**
>> * for_each_cpu - iterate over every cpu in a mask
>> * @cpu: the (optionally unsigned) integer iterator
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 2e5a95486a42..b2f095a9fc40 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -3345,28 +3345,6 @@ static inline void mm_cid_put(struct mm_struct *mm)
>> __mm_cid_put(mm, mm_cid_clear_lazy_put(cid));
>> }
>>
>> -static inline int __mm_cid_try_get(struct mm_struct *mm)
>> -{
>> - struct cpumask *cpumask;
>> - int cid;
>> -
>> - cpumask = mm_cidmask(mm);
>> - /*
>> - * Retry finding first zero bit if the mask is temporarily
>> - * filled. This only happens during concurrent remote-clear
>> - * which owns a cid without holding a rq lock.
>> - */
>> - for (;;) {
>> - cid = cpumask_first_zero(cpumask);
>> - if (cid < nr_cpu_ids)
>> - break;
>> - cpu_relax();
>> - }
>> - if (cpumask_test_and_set_cpu(cid, cpumask))
>> - return -1;
This was split in find / test_and_set on purpose because following
patches I have (implementing numa-aware mm_cid) have a scan which
needs to scan sets of two cpumasks in parallel (with "and" and
and_not" operators).
Moreover, the "mask full" scenario only happens while a concurrent
remote-clear temporarily owns a cid without rq lock. See
sched_mm_cid_remote_clear():
/*
* The cid is unused, so it can be unset.
* Disable interrupts to keep the window of cid ownership without rq
* lock small.
*/
local_irq_save(flags);
if (try_cmpxchg(&pcpu_cid->cid, &lazy_cid, MM_CID_UNSET))
__mm_cid_put(mm, cid);
local_irq_restore(flags);
The proposed patch here turns this scenario into something heavier
(setting the use_cid_lock) rather than just retrying. I guess the
question to ask here is whether it is theoretically possible to cause
__mm_cid_try_get() to fail to have forward progress if we have a high
rate of sched_mm_cid_remote_clear. If we decide that this is indeed
a possible progress-failure scenario, then it makes sense to fallback
to use_cid_lock as soon as a full mask is encountered.
However, removing the __mm_cid_try_get() helper will make it harder to
integrate the following numa-awareness patches I have on top.
I am not against using cpumask_find_and_set, but can we keep the
__mm_cid_try_get() helper to facilitate integration of future work ?
We just have to make it use cpumask_find_and_set, which should be
easy.
>> - return cid;
>> -}
>> -
>> /*
>> * Save a snapshot of the current runqueue time of this cpu
>> * with the per-cpu cid value, allowing to estimate how recently it was used.
>> @@ -3381,25 +3359,25 @@ static inline void mm_cid_snapshot_time(struct rq *rq, struct mm_struct *mm)
>>
>> static inline int __mm_cid_get(struct rq *rq, struct mm_struct *mm)
>> {
>> + struct cpumask *cpumask = mm_cidmask(mm);
>> int cid;
>>
>> - /*
>> - * All allocations (even those using the cid_lock) are lock-free. If
>> - * use_cid_lock is set, hold the cid_lock to perform cid allocation to
>> - * guarantee forward progress.
>> - */
>> + /* All allocations (even those using the cid_lock) are lock-free. */
>> if (!READ_ONCE(use_cid_lock)) {
>> - cid = __mm_cid_try_get(mm);
>> - if (cid >= 0)
>> + cid = cpumask_find_and_set(cpumask);
>> + if (cid < nr_cpu_ids)
>> goto end;
>> - raw_spin_lock(&cid_lock);
>> - } else {
>> - raw_spin_lock(&cid_lock);
>> - cid = __mm_cid_try_get(mm);
>> - if (cid >= 0)
>> - goto unlock;
>> }
>>
>> + /*
>> + * If use_cid_lock is set, hold the cid_lock to perform cid
>> + * allocation to guarantee forward progress.
>> + */
>> + raw_spin_lock(&cid_lock);
>> + cid = cpumask_find_and_set(cpumask);
>> + if (cid < nr_cpu_ids)
>> + goto unlock;
In the !use_cid_lock case where we already failed a lookup above, this change
ends up doing another attempt at lookup before setting the use_cid_lock and
attempting again until success. I am not sure what is the motivation for changing
the code flow here ?
General comment about the rest of the series: please review code comments for
typos.
Thanks,
Mathieu
>> +
>> /*
>> * cid concurrently allocated. Retry while forcing following
>> * allocations to use the cid_lock to ensure forward progress.
>> @@ -3415,9 +3393,9 @@ static inline int __mm_cid_get(struct rq *rq, struct mm_struct *mm)
>> * all newcoming allocations observe the use_cid_lock flag set.
>> */
>> do {
>> - cid = __mm_cid_try_get(mm);
>> + cid = cpumask_find_and_set(cpumask);
>> cpu_relax();
>> - } while (cid < 0);
>> + } while (cid >= nr_cpu_ids);
>> /*
>> * Allocate before clearing use_cid_lock. Only care about
>> * program order because this is for forward progress.
>> --
>> 2.39.2
>>
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
On Mon, Nov 20, 2023 at 11:17:32AM -0500, Mathieu Desnoyers wrote:
...
> > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > > index 2e5a95486a42..b2f095a9fc40 100644
> > > --- a/kernel/sched/sched.h
> > > +++ b/kernel/sched/sched.h
> > > @@ -3345,28 +3345,6 @@ static inline void mm_cid_put(struct mm_struct *mm)
> > > __mm_cid_put(mm, mm_cid_clear_lazy_put(cid));
> > > }
> > > -static inline int __mm_cid_try_get(struct mm_struct *mm)
> > > -{
> > > - struct cpumask *cpumask;
> > > - int cid;
> > > -
> > > - cpumask = mm_cidmask(mm);
> > > - /*
> > > - * Retry finding first zero bit if the mask is temporarily
> > > - * filled. This only happens during concurrent remote-clear
> > > - * which owns a cid without holding a rq lock.
> > > - */
> > > - for (;;) {
> > > - cid = cpumask_first_zero(cpumask);
> > > - if (cid < nr_cpu_ids)
> > > - break;
> > > - cpu_relax();
> > > - }
> > > - if (cpumask_test_and_set_cpu(cid, cpumask))
> > > - return -1;
>
> This was split in find / test_and_set on purpose because following
> patches I have (implementing numa-aware mm_cid) have a scan which
> needs to scan sets of two cpumasks in parallel (with "and" and
> and_not" operators).
>
> Moreover, the "mask full" scenario only happens while a concurrent
> remote-clear temporarily owns a cid without rq lock. See
> sched_mm_cid_remote_clear():
>
> /*
> * The cid is unused, so it can be unset.
> * Disable interrupts to keep the window of cid ownership without rq
> * lock small.
> */
> local_irq_save(flags);
> if (try_cmpxchg(&pcpu_cid->cid, &lazy_cid, MM_CID_UNSET))
> __mm_cid_put(mm, cid);
> local_irq_restore(flags);
>
> The proposed patch here turns this scenario into something heavier
> (setting the use_cid_lock) rather than just retrying. I guess the
> question to ask here is whether it is theoretically possible to cause
> __mm_cid_try_get() to fail to have forward progress if we have a high
> rate of sched_mm_cid_remote_clear. If we decide that this is indeed
> a possible progress-failure scenario, then it makes sense to fallback
> to use_cid_lock as soon as a full mask is encountered.
>
> However, removing the __mm_cid_try_get() helper will make it harder to
> integrate the following numa-awareness patches I have on top.
>
> I am not against using cpumask_find_and_set, but can we keep the
> __mm_cid_try_get() helper to facilitate integration of future work ?
> We just have to make it use cpumask_find_and_set, which should be
> easy.
Sure, I can. Can you point me to the work you mention here?
On 2023-11-21 08:31, Yury Norov wrote:
> On Mon, Nov 20, 2023 at 11:17:32AM -0500, Mathieu Desnoyers wrote:
>
[...]
>
> Sure, I can. Can you point me to the work you mention here?
It would have to be updated now, but here is the last version that was posted:
https://lore.kernel.org/lkml/[email protected]/
Especially those patches:
2022-11-22 20:39 ` [PATCH 22/30] lib: Implement find_{first,next,nth}_notandnot_bit, find_first_andnot_bit Mathieu Desnoyers
2022-11-22 20:39 ` [PATCH 23/30] cpumask: Implement cpumask_{first,next}_{not,}andnot Mathieu Desnoyers
2022-11-22 20:39 ` [PATCH 24/30] sched: NUMA-aware per-memory-map concurrency ID Mathieu Desnoyers
2022-11-22 20:39 ` [PATCH 25/30] rseq: Extend struct rseq with per-memory-map NUMA-aware Concurrency ID Mathieu Desnoyers
2022-11-22 20:39 ` [PATCH 26/30] selftests/rseq: x86: Implement rseq_load_u32_u32 Mathieu Desnoyers
2022-11-22 20:39 ` [PATCH 27/30] selftests/rseq: Implement mm_numa_cid accessors in headers Mathieu Desnoyers
2022-11-22 20:39 ` [PATCH 28/30] selftests/rseq: Implement numa node id vs mm_numa_cid invariant test Mathieu Desnoyers
2022-11-22 20:39 ` [PATCH 29/30] selftests/rseq: Implement mm_numa_cid tests Mathieu Desnoyers
2022-11-22 20:39 ` [PATCH 30/30] tracing/rseq: Add mm_numa_cid field to rseq_update Mathieu Desnoyers
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
On Tue, Nov 21, 2023 at 08:44:17AM -0500, Mathieu Desnoyers wrote:
> On 2023-11-21 08:31, Yury Norov wrote:
> > On Mon, Nov 20, 2023 at 11:17:32AM -0500, Mathieu Desnoyers wrote:
> >
> [...]
> >
> > Sure, I can. Can you point me to the work you mention here?
>
> It would have to be updated now, but here is the last version that was posted:
>
> https://lore.kernel.org/lkml/[email protected]/
>
> Especially those patches:
>
> 2022-11-22 20:39 ` [PATCH 22/30] lib: Implement find_{first,next,nth}_notandnot_bit, find_first_andnot_bit Mathieu Desnoyers
> 2022-11-22 20:39 ` [PATCH 23/30] cpumask: Implement cpumask_{first,next}_{not,}andnot Mathieu Desnoyers
> 2022-11-22 20:39 ` [PATCH 24/30] sched: NUMA-aware per-memory-map concurrency ID Mathieu Desnoyers
> 2022-11-22 20:39 ` [PATCH 25/30] rseq: Extend struct rseq with per-memory-map NUMA-aware Concurrency ID Mathieu Desnoyers
> 2022-11-22 20:39 ` [PATCH 26/30] selftests/rseq: x86: Implement rseq_load_u32_u32 Mathieu Desnoyers
> 2022-11-22 20:39 ` [PATCH 27/30] selftests/rseq: Implement mm_numa_cid accessors in headers Mathieu Desnoyers
> 2022-11-22 20:39 ` [PATCH 28/30] selftests/rseq: Implement numa node id vs mm_numa_cid invariant test Mathieu Desnoyers
> 2022-11-22 20:39 ` [PATCH 29/30] selftests/rseq: Implement mm_numa_cid tests Mathieu Desnoyers
> 2022-11-22 20:39 ` [PATCH 30/30] tracing/rseq: Add mm_numa_cid field to rseq_update Mathieu Desnoyers
OK, I'll take a look.
Thanks,
Yury