2023-04-06 01:57:39

by yebin

[permalink] [raw]
Subject: [PATCH v2 0/2] fix dying cpu compare race

From: Ye Bin <[email protected]>

This patch set solve race between '__percpu_counter_compare()' and cpu offline.
Before commit 5825bea05265("xfs: __percpu_counter_compare() inode count debug too expensive").
I got issue as follows when do cpu online/offline test:
smpboot: CPU 1 is now offline
XFS: Assertion failed: percpu_counter_compare(&mp->m_ifree, 0) >= 0, file: fs/xfs/xfs_trans.c, line: 622
------------[ cut here ]------------
kernel BUG at fs/xfs/xfs_message.c:110!
invalid opcode: 0000 [#1] SMP KASAN PTI
CPU: 3 PID: 25512 Comm: fsstress Not tainted 5.10.0-04288-gcb31bdc8c65d #8
RIP: 0010:assfail+0x77/0x8b fs/xfs/xfs_message.c:110
RSP: 0018:ffff88810a5df5c0 EFLAGS: 00010293
RAX: ffff88810f3a8000 RBX: 0000000000000201 RCX: ffffffffaa8bd7c0
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000001
RBP: 0000000000000000 R08: ffff88810f3a8000 R09: ffffed103edf71cd
R10: ffff8881f6fb8e67 R11: ffffed103edf71cc R12: ffffffffab0108c0
R13: ffffffffab010220 R14: ffffffffffffffff R15: 0000000000000000
FS: 00007f8536e16b80(0000) GS:ffff8881f6f80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00005617e1115f44 CR3: 000000015873a005 CR4: 0000000000370ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
xfs_trans_unreserve_and_mod_sb+0x833/0xca0 fs/xfs/xfs_trans.c:622
xlog_cil_commit+0x1169/0x29b0 fs/xfs/xfs_log_cil.c:1325
__xfs_trans_commit+0x2c0/0xe20 fs/xfs/xfs_trans.c:889
xfs_create_tmpfile+0x6a6/0x9a0 fs/xfs/xfs_inode.c:1320
xfs_rename_alloc_whiteout fs/xfs/xfs_inode.c:3193 [inline]
xfs_rename+0x58a/0x1e00 fs/xfs/xfs_inode.c:3245
xfs_vn_rename+0x28e/0x410 fs/xfs/xfs_iops.c:436
vfs_rename+0x10b5/0x1dd0 fs/namei.c:4329
do_renameat2+0xa19/0xb10 fs/namei.c:4474
__do_sys_renameat2 fs/namei.c:4512 [inline]
__se_sys_renameat2 fs/namei.c:4509 [inline]
__x64_sys_renameat2+0xe4/0x120 fs/namei.c:4509
do_syscall_64+0x33/0x40 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x61/0xc6
RIP: 0033:0x7f853623d91d

I can reproduce above issue by injecting kernel latency to invalidate the quick
judgment of “__percpu_counter_compare()”.
For quick judgment logic, the number of CPUs may have decreased before calling
percpu_counter_cpu_dead() when concurrent with CPU offline. That leads to
calculation errors. For example:
Assumption:
(1) batch = 32
(2) The final count is 2
(3) The number of CPUs is 4
If the number of percpu variables on each CPU is as follows when CPU3 is offline:
cpu0 cpu1 cpu2 cpu3
31 31 31 31
fbc->count = -122 -> 'percpu_counter_cpu_dead()' isn't called.
So at this point, check if percpu counter is greater than 0.
abs(count - rhs) = -122
batch * num_ online_ cpus() = 32 * 3 = 96 -> Online CPUs number become 3
That is: abs (count rhs) > batch * num_online_cpus() condition met. The actual
value is 2, but the fact that count<0 returns -1 is the opposite.

Ye Bin (2):
cpu/hotplug: introduce 'num_dying_cpus' to get dying CPUs count
lib/percpu_counter: fix dying cpu compare race

include/linux/cpumask.h | 20 ++++++++++++++++----
kernel/cpu.c | 2 ++
lib/percpu_counter.c | 11 ++++++++++-
3 files changed, 28 insertions(+), 5 deletions(-)

--
2.31.1


2023-04-06 01:59:56

by yebin

[permalink] [raw]
Subject: [PATCH v2 1/2] cpu/hotplug: introduce 'num_dying_cpus' to get dying CPUs count

From: Ye Bin <[email protected]>

Introduce '__num_dying_cpus' variable to cache the number of dying CPUs
in the core and just return the cached variable.

Signed-off-by: Ye Bin <[email protected]>
---
include/linux/cpumask.h | 20 ++++++++++++++++----
kernel/cpu.c | 2 ++
2 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index 2a61ddcf8321..8127fd598f51 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -135,6 +135,8 @@ extern struct cpumask __cpu_dying_mask;

extern atomic_t __num_online_cpus;

+extern atomic_t __num_dying_cpus;
+
extern cpumask_t cpus_booted_once_mask;

static __always_inline void cpu_max_bits_warn(unsigned int cpu, unsigned int bits)
@@ -1018,10 +1020,14 @@ set_cpu_active(unsigned int cpu, bool active)
static __always_inline void
set_cpu_dying(unsigned int cpu, bool dying)
{
- if (dying)
- cpumask_set_cpu(cpu, &__cpu_dying_mask);
- else
- cpumask_clear_cpu(cpu, &__cpu_dying_mask);
+ if (dying) {
+ if (!cpumask_test_and_set_cpu(cpu, &__cpu_dying_mask))
+ atomic_inc(&__num_dying_cpus);
+ }
+ else {
+ if (cpumask_test_and_clear_cpu(cpu, &__cpu_dying_mask))
+ atomic_dec(&__num_dying_cpus);
+ }
}

/**
@@ -1073,6 +1079,11 @@ static __always_inline unsigned int num_online_cpus(void)
{
return arch_atomic_read(&__num_online_cpus);
}
+
+static __always_inline unsigned int num_dying_cpus(void)
+{
+ return arch_atomic_read(&__num_dying_cpus);
+}
#define num_possible_cpus() cpumask_weight(cpu_possible_mask)
#define num_present_cpus() cpumask_weight(cpu_present_mask)
#define num_active_cpus() cpumask_weight(cpu_active_mask)
@@ -1108,6 +1119,7 @@ static __always_inline bool cpu_dying(unsigned int cpu)
#define num_possible_cpus() 1U
#define num_present_cpus() 1U
#define num_active_cpus() 1U
+#define num_dying_cpus() 0U

static __always_inline bool cpu_online(unsigned int cpu)
{
diff --git a/kernel/cpu.c b/kernel/cpu.c
index f4a2c5845bcb..1c96c04cb259 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -2662,6 +2662,8 @@ EXPORT_SYMBOL(__cpu_dying_mask);
atomic_t __num_online_cpus __read_mostly;
EXPORT_SYMBOL(__num_online_cpus);

+atomic_t __num_dying_cpus __read_mostly;
+
void init_cpu_present(const struct cpumask *src)
{
cpumask_copy(&__cpu_present_mask, src);
--
2.31.1

2023-04-06 01:59:59

by yebin

[permalink] [raw]
Subject: [PATCH v2 2/2] lib/percpu_counter: fix dying cpu compare race

From: Ye Bin <[email protected]>

In commit 8b57b11cca88 ("pcpcntrs: fix dying cpu summation race") a race
condition between a cpu dying and percpu_counter_sum() iterating online CPUs
was identified.
Acctually, there's the same race condition between a cpu dying and
__percpu_counter_compare(). Here, use 'num_online_cpus()' for quick judgment.
But 'num_online_cpus()' will be decreased before call 'percpu_counter_cpu_dead()',
then maybe return incorrect result.
To solve above issue, also need to add dying CPUs count when do quick judgment
in __percpu_counter_compare().

Signed-off-by: Ye Bin <[email protected]>
---
lib/percpu_counter.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c
index 5004463c4f9f..399840cb0012 100644
--- a/lib/percpu_counter.c
+++ b/lib/percpu_counter.c
@@ -227,6 +227,15 @@ static int percpu_counter_cpu_dead(unsigned int cpu)
return 0;
}

+static __always_inline unsigned int num_count_cpus(void)
+{
+#ifdef CONFIG_HOTPLUG_CPU
+ return (num_online_cpus() + num_dying_cpus());
+#else
+ return num_online_cpus();
+#endif
+}
+
/*
* Compare counter against given value.
* Return 1 if greater, 0 if equal and -1 if less
@@ -237,7 +246,7 @@ int __percpu_counter_compare(struct percpu_counter *fbc, s64 rhs, s32 batch)

count = percpu_counter_read(fbc);
/* Check to see if rough count will be sufficient for comparison */
- if (abs(count - rhs) > (batch * num_online_cpus())) {
+ if (abs(count - rhs) > (batch * num_count_cpus())) {
if (count > rhs)
return 1;
else
--
2.31.1

2023-04-08 07:34:31

by Dennis Zhou

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] lib/percpu_counter: fix dying cpu compare race

Hello,

On Thu, Apr 06, 2023 at 09:56:29AM +0800, Ye Bin wrote:
> From: Ye Bin <[email protected]>
>
> In commit 8b57b11cca88 ("pcpcntrs: fix dying cpu summation race") a race
> condition between a cpu dying and percpu_counter_sum() iterating online CPUs
> was identified.
> Acctually, there's the same race condition between a cpu dying and
> __percpu_counter_compare(). Here, use 'num_online_cpus()' for quick judgment.
> But 'num_online_cpus()' will be decreased before call 'percpu_counter_cpu_dead()',
> then maybe return incorrect result.
> To solve above issue, also need to add dying CPUs count when do quick judgment
> in __percpu_counter_compare().
>

I've thought a lot of about this since you've sent v1. For the general
problem, you haven't addressed Dave's concerns from [1].

I agree you've found a valid race condition, but as Dave mentioned,
there's no synchronization in __percpu_counter_compare() and
consequently no guarantees about the accuracy of the value.

However, I might be missing something, but I do think the use case in
5825bea05265 ("xfs: __percpu_counter_compare() inode count debug too expensive")
is potentially valid. If the rhs is an expected lower bound or upper
bound (depending on if you're counting up or down, but not both) and the
count you're accounting has the same expectations as percpu_refcount
(you can only subtract what you've already added), then should the
percpu_counter_sum() ever be on the wrong side of rhs, that should be an
error and visible via percpu_counter_compare().

I need to think a little longer, but my initial thought is while you
close a race condition, the function itself is inherently vulnerable.

[1] https://lore.kernel.org/lkml/ZCu9LtdA+NMrfG9x@rh/

Thanks,
Dennis

> Signed-off-by: Ye Bin <[email protected]>
> ---
> lib/percpu_counter.c | 11 ++++++++++-
> 1 file changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c
> index 5004463c4f9f..399840cb0012 100644
> --- a/lib/percpu_counter.c
> +++ b/lib/percpu_counter.c
> @@ -227,6 +227,15 @@ static int percpu_counter_cpu_dead(unsigned int cpu)
> return 0;
> }
>
> +static __always_inline unsigned int num_count_cpus(void)
> +{
> +#ifdef CONFIG_HOTPLUG_CPU
> + return (num_online_cpus() + num_dying_cpus());
> +#else
> + return num_online_cpus();
> +#endif
> +}
> +
> /*
> * Compare counter against given value.
> * Return 1 if greater, 0 if equal and -1 if less
> @@ -237,7 +246,7 @@ int __percpu_counter_compare(struct percpu_counter *fbc, s64 rhs, s32 batch)
>
> count = percpu_counter_read(fbc);
> /* Check to see if rough count will be sufficient for comparison */
> - if (abs(count - rhs) > (batch * num_online_cpus())) {
> + if (abs(count - rhs) > (batch * num_count_cpus())) {
> if (count > rhs)
> return 1;
> else
> --
> 2.31.1
>
>

2023-04-10 17:46:02

by Yury Norov

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] cpu/hotplug: introduce 'num_dying_cpus' to get dying CPUs count

On Thu, Apr 06, 2023 at 09:56:28AM +0800, Ye Bin wrote:
> From: Ye Bin <[email protected]>
>
> Introduce '__num_dying_cpus' variable to cache the number of dying CPUs
> in the core and just return the cached variable.
>
> Signed-off-by: Ye Bin <[email protected]>

It looks like you didn't address any comments for v1. Can you please
do that? Otherwise, NAK.

Thanks,
Yury

2023-04-10 20:15:12

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] cpu/hotplug: introduce 'num_dying_cpus' to get dying CPUs count

On Thu, Apr 06 2023 at 09:56, Ye Bin wrote:
> From: Ye Bin <[email protected]>
>
> Introduce '__num_dying_cpus' variable to cache the number of dying CPUs
> in the core and just return the cached variable.

Why?

That atomic counter is racy too if read and acted upon w/o having CPUs
read locked.

All it does is making the race window smaller vs. the cpumask_weight()
based implementation. It's still racy and incorrect.

So no, this is not going to happen.

Thanks,

tglx


2023-04-10 20:54:14

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] lib/percpu_counter: fix dying cpu compare race

On Thu, Apr 06 2023 at 09:56, Ye Bin wrote:
> From: Ye Bin <[email protected]>
>
> In commit 8b57b11cca88 ("pcpcntrs: fix dying cpu summation race") a race
> condition between a cpu dying and percpu_counter_sum() iterating online CPUs
> was identified.
> Acctually, there's the same race condition between a cpu dying and
> __percpu_counter_compare(). Here, use 'num_online_cpus()' for quick judgment.
> But 'num_online_cpus()' will be decreased before call 'percpu_counter_cpu_dead()',
> then maybe return incorrect result.
> To solve above issue, also need to add dying CPUs count when do quick judgment
> in __percpu_counter_compare().

This is all bogus including the above commit.

All CPU masks including cpu_online_mask and cpu_dying_mask are only
valid when the context is:

- A CPU hotplug callback

- Any other code which holds cpu_hotplug_lock read or write locked.

cpu_online_mask is special in that regard. It is also protected against
modification in any preemption disabled region, but that's a pure
implementation detail.

cpu_dying_mask is a random number generator w/o cpu_hotplug_lock being
held. And even with that lock held any cpumask operation on it is silly.
The mask is a core detail:

commit e40f74c535b8 "cpumask: Introduce DYING mask"

Introduce a cpumask that indicates (for each CPU) what direction the
CPU hotplug is currently going. Notably, it tracks rollbacks. Eg. when
an up fails and we do a roll-back down, it will accurately reflect the
direction.

It does not tell anything to a user which is not aware of the actual
hotplug state machine state.


The real reason for this percpu counter issue is how percpu counter
hotplug callbacks are implemented: They are asymmetric and at the
completely wrong place.

The above 8b57b11cca88 ("pcpcntrs: fix dying cpu summation race") was
done via XFS and without the courtesy of CC'ing the people who care
about the CPU hotplug core. The lenghty analysis of this commit is all
shiny, but fundamentally wrong. See above.

I'm way too tired to come up with a proper fix for this mess, but I'm
going to stare at it tomorrow morning with brain awake.

Thanks,

tglx

2023-04-11 07:11:26

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] lib/percpu_counter: fix dying cpu compare race

On Mon, Apr 10 2023 at 22:53, Thomas Gleixner wrote:
> On Thu, Apr 06 2023 at 09:56, Ye Bin wrote:
> cpu_dying_mask is a random number generator w/o cpu_hotplug_lock being
> held. And even with that lock held any cpumask operation on it is silly.
> The mask is a core detail:
>
> commit e40f74c535b8 "cpumask: Introduce DYING mask"
>
> Introduce a cpumask that indicates (for each CPU) what direction the
> CPU hotplug is currently going. Notably, it tracks rollbacks. Eg. when
> an up fails and we do a roll-back down, it will accurately reflect the
> direction.
>
> It does not tell anything to a user which is not aware of the actual
> hotplug state machine state.

Even if the mask is most of the time stable, it's a total disaster
performance wise. The bits in cpu_dying_mask are sticky until the next
online operation.

So for a system which has SMT enabled in BIOS, but SMT is disabled on
the kernel command line or later via the sysfs knob, this means that the
loop in __percpu_counter_sum() will iterate over all shutdown SMT
siblings forever. IOW, it will touch nr_of_shutdown_cpus() cachelines
for absolutely zero reason.

The same applies for the proposed counter.

Thanks,

tglx