2023-04-14 16:32:13

by Thomas Gleixner

[permalink] [raw]
Subject: [patch 1/3] lib/percpu_counter: Fix CPU hotplug handling

Commit 8b57b11cca88 ("pcpcntrs: fix dying cpu summation race") tried to
address a race condition between percpu_counter_sum() and a concurrent CPU
hotplug operation.

The race window is between the point where an un-plugged CPU removed itself
from the online_cpu_mask and the hotplug state callback which folds the per
CPU counters of the now dead CPU into the global count.

percpu_counter_sum() used for_each_online_cpu() to accumulate the per CPU
local counts, so during the race window it missed to account for the not
yet folded back local count of the offlined CPU.

The attempt to address this used the admittedly undocumented and
pointlessly public cpu_dying_mask by changing the loop iterator to take
both the cpu_online_mask and the cpu_dying_mask into account.

That works to some extent, but it is incorrect.

The cpu_dying_mask bits are sticky even after cpu_up()/cpu_down()
completes. That means that all offlined CPUs are always taken into
account. In the case of disabling SMT at boottime or runtime this results
in evaluating _all_ offlined SMT siblings counters forever. Depending on
system size, that's a massive amount of cache-lines to be touched forever.

It might be argued, that the cpu_dying_mask bit could be cleared when
cpu_down() completes, but that's not possible under all circumstances.

Especially with partial hotplug the bit must be sticky in order to keep the
initial user, i.e. the scheduler correct. Partial hotplug which allows
explicit state transitions also can create a situation where the race
window gets recreated:

cpu_down(target = CPUHP_PERCPU_CNT_DEAD + 1)

brings a CPU down to one state before the per CPU counter folding
callback. As this did not reach CPUHP_OFFLINE state the bit would stay set.
Now the next partial operation:

cpu_up(target = CPUHP_PERCPU_CNT_DEAD + 2)

has to clear the bit and the race window is open again.

There are two ways to solve this:

1) Maintain a local CPU mask in the per CPU counter code which
gets the bit set when a CPU comes online and removed in the
the CPUHP_PERCPU_CNT_DEAD state after folding.

This adds more code and complexity.

2) Move the folding hotplug state into the DYING callback section, which
runs on the outgoing CPU immediatedly after it cleared its online bit.

There is no concurrency vs. percpu_counter_sum() on another CPU
because all still online CPUs are waiting in stop_machine() for the
outgoing CPU to complete its shutdown. The raw spinlock held around
the CPU mask iteration prevents that an online CPU reaches the stop
machine thread while iterating, which implicitely prevents the
outgoing CPU from clearing its online bit.

This is way simpler than #1 and makes the hotplug calls symmetric for
the price of a slightly longer wait time in stop_machine(), which is
not the end of the world as CPU un-plug is already slow. The overall
time for a cpu_down() operation stays exactly the same.

Implement #2 and plug the race completely.

percpu_counter_sum() is still inherently racy against a concurrent
percpu_counter_add_batch() fastpath unless externally serialized. That's
completely independent of CPU hotplug though.

Fixes: 8b57b11cca88 ("pcpcntrs: fix dying cpu summation race")
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Yury Norov <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Rasmus Villemoes <[email protected]>
Cc: Ye Bin <[email protected]>
Cc: [email protected]
---
include/linux/cpuhotplug.h | 2 -
lib/percpu_counter.c | 57 +++++++++++++++++++--------------------------
2 files changed, 26 insertions(+), 33 deletions(-)

--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -91,7 +91,6 @@ enum cpuhp_state {
CPUHP_PRINTK_DEAD,
CPUHP_MM_MEMCQ_DEAD,
CPUHP_XFS_DEAD,
- CPUHP_PERCPU_CNT_DEAD,
CPUHP_RADIX_DEAD,
CPUHP_PAGE_ALLOC,
CPUHP_NET_DEV_DEAD,
@@ -196,6 +195,7 @@ enum cpuhp_state {
CPUHP_AP_SMPCFD_DYING,
CPUHP_AP_X86_TBOOT_DYING,
CPUHP_AP_ARM_CACHE_B15_RAC_DYING,
+ CPUHP_AP_PERCPU_COUNTER_STARTING,
CPUHP_AP_ONLINE,
CPUHP_TEARDOWN_CPU,

--- a/lib/percpu_counter.c
+++ b/lib/percpu_counter.c
@@ -12,7 +12,7 @@

#ifdef CONFIG_HOTPLUG_CPU
static LIST_HEAD(percpu_counters);
-static DEFINE_SPINLOCK(percpu_counters_lock);
+static DEFINE_RAW_SPINLOCK(percpu_counters_lock);
#endif

#ifdef CONFIG_DEBUG_OBJECTS_PERCPU_COUNTER
@@ -126,13 +126,8 @@ EXPORT_SYMBOL(percpu_counter_sync);
* Add up all the per-cpu counts, return the result. This is a more accurate
* but much slower version of percpu_counter_read_positive().
*
- * We use the cpu mask of (cpu_online_mask | cpu_dying_mask) to capture sums
- * from CPUs that are in the process of being taken offline. Dying cpus have
- * been removed from the online mask, but may not have had the hotplug dead
- * notifier called to fold the percpu count back into the global counter sum.
- * By including dying CPUs in the iteration mask, we avoid this race condition
- * so __percpu_counter_sum() just does the right thing when CPUs are being taken
- * offline.
+ * Note: This function is inherently racy against the lockless fastpath of
+ * percpu_counter_add_batch() unless externaly serialized.
*/
s64 __percpu_counter_sum(struct percpu_counter *fbc)
{
@@ -142,10 +137,8 @@ s64 __percpu_counter_sum(struct percpu_c

raw_spin_lock_irqsave(&fbc->lock, flags);
ret = fbc->count;
- for_each_cpu_or(cpu, cpu_online_mask, cpu_dying_mask) {
- s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
- ret += *pcount;
- }
+ for_each_online_cpu(cpu)
+ ret += *per_cpu_ptr(fbc->counters, cpu);
raw_spin_unlock_irqrestore(&fbc->lock, flags);
return ret;
}
@@ -167,9 +160,9 @@ int __percpu_counter_init(struct percpu_

#ifdef CONFIG_HOTPLUG_CPU
INIT_LIST_HEAD(&fbc->list);
- spin_lock_irqsave(&percpu_counters_lock, flags);
+ raw_spin_lock_irqsave(&percpu_counters_lock, flags);
list_add(&fbc->list, &percpu_counters);
- spin_unlock_irqrestore(&percpu_counters_lock, flags);
+ raw_spin_unlock_irqrestore(&percpu_counters_lock, flags);
#endif
return 0;
}
@@ -185,9 +178,9 @@ void percpu_counter_destroy(struct percp
debug_percpu_counter_deactivate(fbc);

#ifdef CONFIG_HOTPLUG_CPU
- spin_lock_irqsave(&percpu_counters_lock, flags);
+ raw_spin_lock_irqsave(&percpu_counters_lock, flags);
list_del(&fbc->list);
- spin_unlock_irqrestore(&percpu_counters_lock, flags);
+ raw_spin_unlock_irqrestore(&percpu_counters_lock, flags);
#endif
free_percpu(fbc->counters);
fbc->counters = NULL;
@@ -197,22 +190,29 @@ EXPORT_SYMBOL(percpu_counter_destroy);
int percpu_counter_batch __read_mostly = 32;
EXPORT_SYMBOL(percpu_counter_batch);

-static int compute_batch_value(unsigned int cpu)
+static void compute_batch_value(int offs)
{
- int nr = num_online_cpus();
+ int nr = num_online_cpus() + offs;
+
+ percpu_counter_batch = max(32, nr * 2);
+}

- percpu_counter_batch = max(32, nr*2);
+static int percpu_counter_cpu_starting(unsigned int cpu)
+{
+ /* If invoked during hotplug @cpu is not yet marked online. */
+ compute_batch_value(cpu_online(cpu) ? 0 : 1);
return 0;
}

-static int percpu_counter_cpu_dead(unsigned int cpu)
+static int percpu_counter_cpu_dying(unsigned int cpu)
{
#ifdef CONFIG_HOTPLUG_CPU
struct percpu_counter *fbc;
+ unsigned long flags;

- compute_batch_value(cpu);
+ compute_batch_value(0);

- spin_lock_irq(&percpu_counters_lock);
+ raw_spin_lock_irqsave(&percpu_counters_lock, flags);
list_for_each_entry(fbc, &percpu_counters, list) {
s32 *pcount;

@@ -222,7 +222,7 @@ static int percpu_counter_cpu_dead(unsig
*pcount = 0;
raw_spin_unlock(&fbc->lock);
}
- spin_unlock_irq(&percpu_counters_lock);
+ raw_spin_unlock_irqrestore(&percpu_counters_lock, flags);
#endif
return 0;
}
@@ -256,15 +256,8 @@ EXPORT_SYMBOL(__percpu_counter_compare);

static int __init percpu_counter_startup(void)
{
- int ret;
-
- ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "lib/percpu_cnt:online",
- compute_batch_value, NULL);
- WARN_ON(ret < 0);
- ret = cpuhp_setup_state_nocalls(CPUHP_PERCPU_CNT_DEAD,
- "lib/percpu_cnt:dead", NULL,
- percpu_counter_cpu_dead);
- WARN_ON(ret < 0);
+ WARN_ON(cpuhp_setup_state(CPUHP_AP_PERCPU_COUNTER_STARTING, "lib/percpu_counter:starting",
+ percpu_counter_cpu_starting, percpu_counter_cpu_dying));
return 0;
}
module_init(percpu_counter_startup);


2023-04-15 06:34:13

by Dennis Zhou

[permalink] [raw]
Subject: Re: [patch 1/3] lib/percpu_counter: Fix CPU hotplug handling

Hello,

On Fri, Apr 14, 2023 at 06:30:43PM +0200, Thomas Gleixner wrote:
> Commit 8b57b11cca88 ("pcpcntrs: fix dying cpu summation race") tried to
> address a race condition between percpu_counter_sum() and a concurrent CPU
> hotplug operation.
>
> The race window is between the point where an un-plugged CPU removed itself
> from the online_cpu_mask and the hotplug state callback which folds the per
> CPU counters of the now dead CPU into the global count.
>
> percpu_counter_sum() used for_each_online_cpu() to accumulate the per CPU
> local counts, so during the race window it missed to account for the not
> yet folded back local count of the offlined CPU.
>
> The attempt to address this used the admittedly undocumented and
> pointlessly public cpu_dying_mask by changing the loop iterator to take
> both the cpu_online_mask and the cpu_dying_mask into account.
>
> That works to some extent, but it is incorrect.
>
> The cpu_dying_mask bits are sticky even after cpu_up()/cpu_down()
> completes. That means that all offlined CPUs are always taken into
> account. In the case of disabling SMT at boottime or runtime this results
> in evaluating _all_ offlined SMT siblings counters forever. Depending on
> system size, that's a massive amount of cache-lines to be touched forever.
>
> It might be argued, that the cpu_dying_mask bit could be cleared when
> cpu_down() completes, but that's not possible under all circumstances.
>
> Especially with partial hotplug the bit must be sticky in order to keep the
> initial user, i.e. the scheduler correct. Partial hotplug which allows
> explicit state transitions also can create a situation where the race
> window gets recreated:
>
> cpu_down(target = CPUHP_PERCPU_CNT_DEAD + 1)
>
> brings a CPU down to one state before the per CPU counter folding
> callback. As this did not reach CPUHP_OFFLINE state the bit would stay set.
> Now the next partial operation:
>
> cpu_up(target = CPUHP_PERCPU_CNT_DEAD + 2)
>
> has to clear the bit and the race window is open again.
>
> There are two ways to solve this:
>
> 1) Maintain a local CPU mask in the per CPU counter code which
> gets the bit set when a CPU comes online and removed in the
> the CPUHP_PERCPU_CNT_DEAD state after folding.
>
> This adds more code and complexity.
>
> 2) Move the folding hotplug state into the DYING callback section, which
> runs on the outgoing CPU immediatedly after it cleared its online bit.
>
> There is no concurrency vs. percpu_counter_sum() on another CPU
> because all still online CPUs are waiting in stop_machine() for the
> outgoing CPU to complete its shutdown. The raw spinlock held around
> the CPU mask iteration prevents that an online CPU reaches the stop
> machine thread while iterating, which implicitely prevents the
> outgoing CPU from clearing its online bit.
>
> This is way simpler than #1 and makes the hotplug calls symmetric for
> the price of a slightly longer wait time in stop_machine(), which is
> not the end of the world as CPU un-plug is already slow. The overall
> time for a cpu_down() operation stays exactly the same.
>
> Implement #2 and plug the race completely.
>
> percpu_counter_sum() is still inherently racy against a concurrent
> percpu_counter_add_batch() fastpath unless externally serialized. That's
> completely independent of CPU hotplug though.
>
> Fixes: 8b57b11cca88 ("pcpcntrs: fix dying cpu summation race")
> Signed-off-by: Thomas Gleixner <[email protected]>
> Cc: Dennis Zhou <[email protected]>
> Cc: Tejun Heo <[email protected]>
> Cc: Christoph Lameter <[email protected]>
> Cc: Dave Chinner <[email protected]>
> Cc: Yury Norov <[email protected]>
> Cc: Andy Shevchenko <[email protected]>
> Cc: Rasmus Villemoes <[email protected]>
> Cc: Ye Bin <[email protected]>
> Cc: [email protected]
> ---
> include/linux/cpuhotplug.h | 2 -
> lib/percpu_counter.c | 57 +++++++++++++++++++--------------------------
> 2 files changed, 26 insertions(+), 33 deletions(-)
>
> --- a/include/linux/cpuhotplug.h
> +++ b/include/linux/cpuhotplug.h
> @@ -91,7 +91,6 @@ enum cpuhp_state {
> CPUHP_PRINTK_DEAD,
> CPUHP_MM_MEMCQ_DEAD,
> CPUHP_XFS_DEAD,
> - CPUHP_PERCPU_CNT_DEAD,
> CPUHP_RADIX_DEAD,
> CPUHP_PAGE_ALLOC,
> CPUHP_NET_DEV_DEAD,
> @@ -196,6 +195,7 @@ enum cpuhp_state {
> CPUHP_AP_SMPCFD_DYING,
> CPUHP_AP_X86_TBOOT_DYING,
> CPUHP_AP_ARM_CACHE_B15_RAC_DYING,
> + CPUHP_AP_PERCPU_COUNTER_STARTING,
> CPUHP_AP_ONLINE,
> CPUHP_TEARDOWN_CPU,
>
> --- a/lib/percpu_counter.c
> +++ b/lib/percpu_counter.c
> @@ -12,7 +12,7 @@
>
> #ifdef CONFIG_HOTPLUG_CPU
> static LIST_HEAD(percpu_counters);
> -static DEFINE_SPINLOCK(percpu_counters_lock);
> +static DEFINE_RAW_SPINLOCK(percpu_counters_lock);
> #endif
>
> #ifdef CONFIG_DEBUG_OBJECTS_PERCPU_COUNTER
> @@ -126,13 +126,8 @@ EXPORT_SYMBOL(percpu_counter_sync);
> * Add up all the per-cpu counts, return the result. This is a more accurate
> * but much slower version of percpu_counter_read_positive().
> *
> - * We use the cpu mask of (cpu_online_mask | cpu_dying_mask) to capture sums
> - * from CPUs that are in the process of being taken offline. Dying cpus have
> - * been removed from the online mask, but may not have had the hotplug dead
> - * notifier called to fold the percpu count back into the global counter sum.
> - * By including dying CPUs in the iteration mask, we avoid this race condition
> - * so __percpu_counter_sum() just does the right thing when CPUs are being taken
> - * offline.
> + * Note: This function is inherently racy against the lockless fastpath of
> + * percpu_counter_add_batch() unless externaly serialized.
> */
> s64 __percpu_counter_sum(struct percpu_counter *fbc)
> {
> @@ -142,10 +137,8 @@ s64 __percpu_counter_sum(struct percpu_c
>
> raw_spin_lock_irqsave(&fbc->lock, flags);
> ret = fbc->count;
> - for_each_cpu_or(cpu, cpu_online_mask, cpu_dying_mask) {
> - s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
> - ret += *pcount;
> - }
> + for_each_online_cpu(cpu)
> + ret += *per_cpu_ptr(fbc->counters, cpu);
> raw_spin_unlock_irqrestore(&fbc->lock, flags);
> return ret;
> }
> @@ -167,9 +160,9 @@ int __percpu_counter_init(struct percpu_
>
> #ifdef CONFIG_HOTPLUG_CPU
> INIT_LIST_HEAD(&fbc->list);
> - spin_lock_irqsave(&percpu_counters_lock, flags);
> + raw_spin_lock_irqsave(&percpu_counters_lock, flags);
> list_add(&fbc->list, &percpu_counters);
> - spin_unlock_irqrestore(&percpu_counters_lock, flags);
> + raw_spin_unlock_irqrestore(&percpu_counters_lock, flags);
> #endif
> return 0;
> }
> @@ -185,9 +178,9 @@ void percpu_counter_destroy(struct percp
> debug_percpu_counter_deactivate(fbc);
>
> #ifdef CONFIG_HOTPLUG_CPU
> - spin_lock_irqsave(&percpu_counters_lock, flags);
> + raw_spin_lock_irqsave(&percpu_counters_lock, flags);
> list_del(&fbc->list);
> - spin_unlock_irqrestore(&percpu_counters_lock, flags);
> + raw_spin_unlock_irqrestore(&percpu_counters_lock, flags);
> #endif
> free_percpu(fbc->counters);
> fbc->counters = NULL;
> @@ -197,22 +190,29 @@ EXPORT_SYMBOL(percpu_counter_destroy);
> int percpu_counter_batch __read_mostly = 32;
> EXPORT_SYMBOL(percpu_counter_batch);
>
> -static int compute_batch_value(unsigned int cpu)
> +static void compute_batch_value(int offs)
> {
> - int nr = num_online_cpus();
> + int nr = num_online_cpus() + offs;
> +
> + percpu_counter_batch = max(32, nr * 2);
> +}
>
> - percpu_counter_batch = max(32, nr*2);
> +static int percpu_counter_cpu_starting(unsigned int cpu)
> +{
> + /* If invoked during hotplug @cpu is not yet marked online. */
> + compute_batch_value(cpu_online(cpu) ? 0 : 1);
> return 0;
> }
>
> -static int percpu_counter_cpu_dead(unsigned int cpu)
> +static int percpu_counter_cpu_dying(unsigned int cpu)
> {
> #ifdef CONFIG_HOTPLUG_CPU
> struct percpu_counter *fbc;
> + unsigned long flags;
>
> - compute_batch_value(cpu);
> + compute_batch_value(0);
>
> - spin_lock_irq(&percpu_counters_lock);
> + raw_spin_lock_irqsave(&percpu_counters_lock, flags);
> list_for_each_entry(fbc, &percpu_counters, list) {
> s32 *pcount;
>
> @@ -222,7 +222,7 @@ static int percpu_counter_cpu_dead(unsig
> *pcount = 0;
> raw_spin_unlock(&fbc->lock);
> }
> - spin_unlock_irq(&percpu_counters_lock);
> + raw_spin_unlock_irqrestore(&percpu_counters_lock, flags);
> #endif
> return 0;
> }
> @@ -256,15 +256,8 @@ EXPORT_SYMBOL(__percpu_counter_compare);
>
> static int __init percpu_counter_startup(void)
> {
> - int ret;
> -
> - ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "lib/percpu_cnt:online",
> - compute_batch_value, NULL);
> - WARN_ON(ret < 0);
> - ret = cpuhp_setup_state_nocalls(CPUHP_PERCPU_CNT_DEAD,
> - "lib/percpu_cnt:dead", NULL,
> - percpu_counter_cpu_dead);
> - WARN_ON(ret < 0);
> + WARN_ON(cpuhp_setup_state(CPUHP_AP_PERCPU_COUNTER_STARTING, "lib/percpu_counter:starting",
> + percpu_counter_cpu_starting, percpu_counter_cpu_dying));
> return 0;
> }
> module_init(percpu_counter_startup);
>

Thanks for this work. This is a much more complete solution.

Acked-by: Dennis Zhou <[email protected]>

Thanks,
Dennis

2023-04-17 02:20:44

by Dave Chinner

[permalink] [raw]
Subject: Re: [patch 1/3] lib/percpu_counter: Fix CPU hotplug handling

On Fri, Apr 14, 2023 at 06:30:43PM +0200, Thomas Gleixner wrote:
> Commit 8b57b11cca88 ("pcpcntrs: fix dying cpu summation race") tried to
> address a race condition between percpu_counter_sum() and a concurrent CPU
> hotplug operation.
>
> The race window is between the point where an un-plugged CPU removed itself
> from the online_cpu_mask and the hotplug state callback which folds the per
> CPU counters of the now dead CPU into the global count.
>
> percpu_counter_sum() used for_each_online_cpu() to accumulate the per CPU
> local counts, so during the race window it missed to account for the not
> yet folded back local count of the offlined CPU.
>
> The attempt to address this used the admittedly undocumented and
> pointlessly public cpu_dying_mask by changing the loop iterator to take
> both the cpu_online_mask and the cpu_dying_mask into account.
>
> That works to some extent, but it is incorrect.
>
> The cpu_dying_mask bits are sticky even after cpu_up()/cpu_down()
> completes. That means that all offlined CPUs are always taken into
> account. In the case of disabling SMT at boottime or runtime this results
> in evaluating _all_ offlined SMT siblings counters forever. Depending on
> system size, that's a massive amount of cache-lines to be touched forever.
>
> It might be argued, that the cpu_dying_mask bit could be cleared when
> cpu_down() completes, but that's not possible under all circumstances.
>
> Especially with partial hotplug the bit must be sticky in order to keep the
> initial user, i.e. the scheduler correct. Partial hotplug which allows
> explicit state transitions also can create a situation where the race
> window gets recreated:
>
> cpu_down(target = CPUHP_PERCPU_CNT_DEAD + 1)
>
> brings a CPU down to one state before the per CPU counter folding
> callback. As this did not reach CPUHP_OFFLINE state the bit would stay set.
> Now the next partial operation:
>
> cpu_up(target = CPUHP_PERCPU_CNT_DEAD + 2)
>
> has to clear the bit and the race window is open again.
>
> There are two ways to solve this:
>
> 1) Maintain a local CPU mask in the per CPU counter code which
> gets the bit set when a CPU comes online and removed in the
> the CPUHP_PERCPU_CNT_DEAD state after folding.
>
> This adds more code and complexity.
>
> 2) Move the folding hotplug state into the DYING callback section, which
> runs on the outgoing CPU immediatedly after it cleared its online bit.
>
> There is no concurrency vs. percpu_counter_sum() on another CPU
> because all still online CPUs are waiting in stop_machine() for the
> outgoing CPU to complete its shutdown. The raw spinlock held around
> the CPU mask iteration prevents that an online CPU reaches the stop
> machine thread while iterating, which implicitely prevents the
> outgoing CPU from clearing its online bit.
>
> This is way simpler than #1 and makes the hotplug calls symmetric for
> the price of a slightly longer wait time in stop_machine(), which is
> not the end of the world as CPU un-plug is already slow. The overall
> time for a cpu_down() operation stays exactly the same.
>
> Implement #2 and plug the race completely.
>
> percpu_counter_sum() is still inherently racy against a concurrent
> percpu_counter_add_batch() fastpath unless externally serialized. That's
> completely independent of CPU hotplug though.
>
> Fixes: 8b57b11cca88 ("pcpcntrs: fix dying cpu summation race")
> Signed-off-by: Thomas Gleixner <[email protected]>
> Cc: Dennis Zhou <[email protected]>
> Cc: Tejun Heo <[email protected]>
> Cc: Christoph Lameter <[email protected]>
> Cc: Dave Chinner <[email protected]>
> Cc: Yury Norov <[email protected]>
> Cc: Andy Shevchenko <[email protected]>
> Cc: Rasmus Villemoes <[email protected]>
> Cc: Ye Bin <[email protected]>
> Cc: [email protected]
> ---
> include/linux/cpuhotplug.h | 2 -
> lib/percpu_counter.c | 57 +++++++++++++++++++--------------------------
> 2 files changed, 26 insertions(+), 33 deletions(-)
>
> --- a/include/linux/cpuhotplug.h
> +++ b/include/linux/cpuhotplug.h
> @@ -91,7 +91,6 @@ enum cpuhp_state {
> CPUHP_PRINTK_DEAD,
> CPUHP_MM_MEMCQ_DEAD,
> CPUHP_XFS_DEAD,
> - CPUHP_PERCPU_CNT_DEAD,
> CPUHP_RADIX_DEAD,
> CPUHP_PAGE_ALLOC,
> CPUHP_NET_DEV_DEAD,
> @@ -196,6 +195,7 @@ enum cpuhp_state {
> CPUHP_AP_SMPCFD_DYING,
> CPUHP_AP_X86_TBOOT_DYING,
> CPUHP_AP_ARM_CACHE_B15_RAC_DYING,
> + CPUHP_AP_PERCPU_COUNTER_STARTING,
> CPUHP_AP_ONLINE,
> CPUHP_TEARDOWN_CPU,
>
> --- a/lib/percpu_counter.c
> +++ b/lib/percpu_counter.c
> @@ -12,7 +12,7 @@
>
> #ifdef CONFIG_HOTPLUG_CPU
> static LIST_HEAD(percpu_counters);
> -static DEFINE_SPINLOCK(percpu_counters_lock);
> +static DEFINE_RAW_SPINLOCK(percpu_counters_lock);
> #endif
>
> #ifdef CONFIG_DEBUG_OBJECTS_PERCPU_COUNTER
> @@ -126,13 +126,8 @@ EXPORT_SYMBOL(percpu_counter_sync);
> * Add up all the per-cpu counts, return the result. This is a more accurate
> * but much slower version of percpu_counter_read_positive().
> *
> - * We use the cpu mask of (cpu_online_mask | cpu_dying_mask) to capture sums
> - * from CPUs that are in the process of being taken offline. Dying cpus have
> - * been removed from the online mask, but may not have had the hotplug dead
> - * notifier called to fold the percpu count back into the global counter sum.
> - * By including dying CPUs in the iteration mask, we avoid this race condition
> - * so __percpu_counter_sum() just does the right thing when CPUs are being taken
> - * offline.
> + * Note: This function is inherently racy against the lockless fastpath of
> + * percpu_counter_add_batch() unless externaly serialized.
> */
> s64 __percpu_counter_sum(struct percpu_counter *fbc)
> {
> @@ -142,10 +137,8 @@ s64 __percpu_counter_sum(struct percpu_c
>
> raw_spin_lock_irqsave(&fbc->lock, flags);
> ret = fbc->count;
> - for_each_cpu_or(cpu, cpu_online_mask, cpu_dying_mask) {
> - s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
> - ret += *pcount;
> - }
> + for_each_online_cpu(cpu)
> + ret += *per_cpu_ptr(fbc->counters, cpu);
> raw_spin_unlock_irqrestore(&fbc->lock, flags);
> return ret;
> }
> @@ -167,9 +160,9 @@ int __percpu_counter_init(struct percpu_
>
> #ifdef CONFIG_HOTPLUG_CPU
> INIT_LIST_HEAD(&fbc->list);
> - spin_lock_irqsave(&percpu_counters_lock, flags);
> + raw_spin_lock_irqsave(&percpu_counters_lock, flags);
> list_add(&fbc->list, &percpu_counters);
> - spin_unlock_irqrestore(&percpu_counters_lock, flags);
> + raw_spin_unlock_irqrestore(&percpu_counters_lock, flags);
> #endif
> return 0;
> }
> @@ -185,9 +178,9 @@ void percpu_counter_destroy(struct percp
> debug_percpu_counter_deactivate(fbc);
>
> #ifdef CONFIG_HOTPLUG_CPU
> - spin_lock_irqsave(&percpu_counters_lock, flags);
> + raw_spin_lock_irqsave(&percpu_counters_lock, flags);
> list_del(&fbc->list);
> - spin_unlock_irqrestore(&percpu_counters_lock, flags);
> + raw_spin_unlock_irqrestore(&percpu_counters_lock, flags);
> #endif
> free_percpu(fbc->counters);
> fbc->counters = NULL;
> @@ -197,22 +190,29 @@ EXPORT_SYMBOL(percpu_counter_destroy);
> int percpu_counter_batch __read_mostly = 32;
> EXPORT_SYMBOL(percpu_counter_batch);
>
> -static int compute_batch_value(unsigned int cpu)
> +static void compute_batch_value(int offs)
> {
> - int nr = num_online_cpus();
> + int nr = num_online_cpus() + offs;
> +
> + percpu_counter_batch = max(32, nr * 2);
> +}
>
> - percpu_counter_batch = max(32, nr*2);
> +static int percpu_counter_cpu_starting(unsigned int cpu)
> +{
> + /* If invoked during hotplug @cpu is not yet marked online. */
> + compute_batch_value(cpu_online(cpu) ? 0 : 1);
> return 0;
> }

So this changes the batch size based on whether the CPU is starting
or dying to try to get _compare() to fall into the slow path correctly?

How is this supposed to work with counters that have caller supplied
custom batch sizes? i.e. use percpu_counter_add_batch() and
__percpu_counter_compare() with their own batch sizes directly?
Do they now need to add their own cpu hotplug hooks to
screw around with their batch sizes as well?

-Dave.
--
Dave Chinner
[email protected]

2023-04-17 08:11:57

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 1/3] lib/percpu_counter: Fix CPU hotplug handling

On Mon, Apr 17 2023 at 12:09, Dave Chinner wrote:
> On Fri, Apr 14, 2023 at 06:30:43PM +0200, Thomas Gleixner wrote:
>> - percpu_counter_batch = max(32, nr*2);
>> +static int percpu_counter_cpu_starting(unsigned int cpu)
>> +{
>> + /* If invoked during hotplug @cpu is not yet marked online. */
>> + compute_batch_value(cpu_online(cpu) ? 0 : 1);
>> return 0;
>> }
>
> So this changes the batch size based on whether the CPU is starting
> or dying to try to get _compare() to fall into the slow path
> correctly?

Right. That's not new. The original code did the same.

> How is this supposed to work with counters that have caller supplied
> custom batch sizes? i.e. use percpu_counter_add_batch() and
> __percpu_counter_compare() with their own batch sizes directly?
> Do they now need to add their own cpu hotplug hooks to
> screw around with their batch sizes as well?

Now? Nothing has changed here. Just the point where the batch size
computation is called is different. The original code did it in the
dynamic online callback late on hotplug and in the dead (cleanup)
callback late on unplug.

The external batch sizes always have been independent of this.

Thanks,

tglx