2024-03-13 20:56:38

by Mathieu Desnoyers

[permalink] [raw]
Subject: [RFC PATCH 0/2] Introduce serialized smp_call_function APIs

commit 944d5fe50f3f ("sched/membarrier: reduce the ability to hammer on sys_membarrier")
introduces a mutex over all membarrier operations to reduce its ability
to slow down the rest of the system.

This RFC series has two objectives:

1) Move this mutex to the smp_call_function APIs so other system calls
using smp_call_function IPIs are limited in the same way,

2) Restore scalability of MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ with
MEMBARRIER_CMD_FLAG_CPU, which targets specific CPUs with IPIs.
This may or may not be useful, and I would welcome benchmarks from
users of this feature to figure out if this is worth it.

This series applies on top of v6.8.

Thanks,

Mathieu

Cc: Peter Zijlstra <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Andrew Hunter <[email protected]>
Cc: Maged Michael <[email protected]>
Cc: [email protected]
Cc: Avi Kivity <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Peter Oskolkov <[email protected]>

Mathieu Desnoyers (2):
smp: Implement serialized smp_call_function APIs
sched/membarrier: Use serialized smp_call_function APIs

include/linux/smp.h | 40 ++++++++++++++
kernel/sched/membarrier.c | 24 +++------
kernel/smp.c | 106 +++++++++++++++++++++++++++++++++-----
3 files changed, 141 insertions(+), 29 deletions(-)

--
2.39.2


2024-03-13 20:56:43

by Mathieu Desnoyers

[permalink] [raw]
Subject: [RFC PATCH 2/2] sched/membarrier: Use serialized smp_call_function APIs

Use the serialized smp_call_function APIs to issue IPIs, thus limiting
the rate at which IPIs can be generated for each CPU.

Limiting the rate of IPIs at the smp_call_function level ensures that
various mechanisms cannot be combined to overwhelm a CPU with IPIs.

This allows removing the IPI serialization mutex introduced by commit
944d5fe50f3f ("sched/membarrier: reduce the ability to hammer on
sys_membarrier"), which restores scaling of membarrier
MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ targeting a specific CPU with
MEMBARRIER_CMD_FLAG_CPU. This is used in Google tcmalloc for cross-CPU
operations.

[ I do not have numbers justifying the benefit of moving to a per-CPU
mutex for MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ targeting a specific
CPU. Perhaps Google folks using this have benchmarks that can provide
those numbers ? ]

Signed-off-by: Mathieu Desnoyers <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Andrew Hunter <[email protected]>
Cc: Maged Michael <[email protected]>
Cc: [email protected]
Cc: Avi Kivity <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Peter Oskolkov <[email protected]>
---
kernel/sched/membarrier.c | 24 +++++++-----------------
1 file changed, 7 insertions(+), 17 deletions(-)

diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index 4e715b9b278e..368afd35c1de 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -162,9 +162,6 @@
| MEMBARRIER_PRIVATE_EXPEDITED_RSEQ_BITMASK \
| MEMBARRIER_CMD_GET_REGISTRATIONS)

-static DEFINE_MUTEX(membarrier_ipi_mutex);
-#define SERIALIZE_IPI() guard(mutex)(&membarrier_ipi_mutex)
-
static void ipi_mb(void *info)
{
smp_mb(); /* IPIs should be serializing but paranoid. */
@@ -262,7 +259,6 @@ static int membarrier_global_expedited(void)
if (!zalloc_cpumask_var(&tmpmask, GFP_KERNEL))
return -ENOMEM;

- SERIALIZE_IPI();
cpus_read_lock();
rcu_read_lock();
for_each_online_cpu(cpu) {
@@ -295,9 +291,7 @@ static int membarrier_global_expedited(void)
}
rcu_read_unlock();

- preempt_disable();
- smp_call_function_many(tmpmask, ipi_mb, NULL, 1);
- preempt_enable();
+ smp_call_function_many_serialize(tmpmask, ipi_mb, NULL, 1);

free_cpumask_var(tmpmask);
cpus_read_unlock();
@@ -351,7 +345,6 @@ static int membarrier_private_expedited(int flags, int cpu_id)
if (cpu_id < 0 && !zalloc_cpumask_var(&tmpmask, GFP_KERNEL))
return -ENOMEM;

- SERIALIZE_IPI();
cpus_read_lock();

if (cpu_id >= 0) {
@@ -382,10 +375,10 @@ static int membarrier_private_expedited(int flags, int cpu_id)

if (cpu_id >= 0) {
/*
- * smp_call_function_single() will call ipi_func() if cpu_id
- * is the calling CPU.
+ * smp_call_function_single_serialize() will call
+ * ipi_func() if cpu_id is the calling CPU.
*/
- smp_call_function_single(cpu_id, ipi_func, NULL, 1);
+ smp_call_function_single_serialize(cpu_id, ipi_func, NULL, 1);
} else {
/*
* For regular membarrier, we can save a few cycles by
@@ -405,11 +398,9 @@ static int membarrier_private_expedited(int flags, int cpu_id)
* rseq critical section.
*/
if (flags != MEMBARRIER_FLAG_SYNC_CORE) {
- preempt_disable();
- smp_call_function_many(tmpmask, ipi_func, NULL, true);
- preempt_enable();
+ smp_call_function_many_serialize(tmpmask, ipi_func, NULL, true);
} else {
- on_each_cpu_mask(tmpmask, ipi_func, NULL, true);
+ on_each_cpu_mask_serialize(tmpmask, ipi_func, NULL, true);
}
}

@@ -465,7 +456,6 @@ static int sync_runqueues_membarrier_state(struct mm_struct *mm)
* between threads which are users of @mm has its membarrier state
* updated.
*/
- SERIALIZE_IPI();
cpus_read_lock();
rcu_read_lock();
for_each_online_cpu(cpu) {
@@ -478,7 +468,7 @@ static int sync_runqueues_membarrier_state(struct mm_struct *mm)
}
rcu_read_unlock();

- on_each_cpu_mask(tmpmask, ipi_sync_rq_state, mm, true);
+ on_each_cpu_mask_serialize(tmpmask, ipi_sync_rq_state, mm, true);

free_cpumask_var(tmpmask);
cpus_read_unlock();
--
2.39.2


2024-03-13 21:07:31

by Mathieu Desnoyers

[permalink] [raw]
Subject: [RFC PATCH 1/2] smp: Implement serialized smp_call_function APIs

Introduce serialized smp_call_function APIs to limit the number of
concurrent smp_call_function IPIs which can be sent to a given CPU to a
maximum of two: one broadcast and one specifically targeting the CPU.

- smp_call_function_single_serialize holds a per-target-cpu
mutex while sending the IPI,
- smp_call_function_many_serialize, on_each_cpu_cond_mask_serialize,
on_each_cpu_mask_serialize call smp_call_function_single_serialize
if there is only one cpu in the mask, else grab the
serialize_ipi_broadcast_lock mutex to use the broadcast IPIs.

Motivations for this new API:

- Prevent CPU-IPI-hammering from various IPI sources (other than
membarrier) that rely on smp_call_function APIs to send IPIs.
- Restore scaling of membarrier MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ
targeting a specific CPU with MEMBARRIER_CMD_FLAG_CPU, which is
serialized by a mutex by commit 944d5fe50f3f ("sched/membarrier:
reduce the ability to hammer on sys_membarrier"). This is used in
Google tcmalloc for cross-CPU operations.

My testing on an AMD EPYC 9654 shows that 5 IPI source CPUs are needed
to overwhelm one target CPU without commit 944d5fe50f3f. Therefore,
permitting two CPUs to concurrently IPI a given target CPU should not
allow overwhelming it. [ This should be confirmed by testing on other
hardware. ]

* nr_src=1
Runs OK
top output: %Cpu0 : 35.0 us, 65.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st

* nr_src=2
Runs OK
top output: %Cpu0 : 19.1 us, 80.9 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st

* nr_src=3
Runs OK
top output: %Cpu0 : 4.9 us, 94.1 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 1.0 si, 0.0 st

* nr_src=4
Runs OK, but "si" (time spent servicing software interrupts) is getting high:
top output: %Cpu0 : 0.0 us, 20.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 80.0 si, 0.0 st

* nr_src=5
"si" is now 100%. watchdog: BUG: soft lockup - CPU#0 stuck for 24s! [membarrier-hamm:5464]
%Cpu0 : 0.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi,100.0 si, 0.0 st

With this reproducer:

#define _GNU_SOURCE
#include <linux/membarrier.h> /* Definition of MEMBARRIER_* constants */
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
#include <pthread.h>
#include <sched.h>
#include <stdlib.h>
#include <stdio.h>

#define NR_DEST 1
#define NR_SRC 5

static int
membarrier(int cmd, unsigned int flags, int cpu_id)
{
return syscall(__NR_membarrier, cmd, flags, cpu_id);
}

void *ipi_dest(void *arg)
{
int ret;
cpu_set_t mask;

CPU_ZERO(&mask);
CPU_SET(0, &mask);
ret = sched_setaffinity(0, sizeof(mask), &mask);
if (ret) {
perror("sched_setaffinity");
abort();
}
while (1) {
asm volatile ("rep; nop" : : : "memory");
}
}

void *ipi_src(void *arg)
{
int ret;

while (1) {
/* Hammer CPU 0 */
ret = membarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ,
MEMBARRIER_CMD_FLAG_CPU, 0);
if (ret) {
perror("membarrier");
abort();
}
}
}

static
void do_membarrier_hammer(void)
{
int ret, i;
pthread_t dest_tid[NR_DEST];
pthread_t src_tid[NR_SRC];

for (i = 0; i < NR_DEST; i++) {
ret = pthread_create(&dest_tid[i], NULL,
ipi_dest, NULL);
if (ret)
abort();
}
for (i = 0; i < NR_SRC; i++) {
ret = pthread_create(&src_tid[i], NULL,
ipi_src, NULL);
if (ret)
abort();
}

for (i = 0; i < NR_DEST; i++) {
ret = pthread_join(dest_tid[i], NULL);
if (ret)
abort();
}
for (i = 0; i < NR_SRC; i++) {
ret = pthread_join(src_tid[i], NULL);
if (ret)
abort();
}
}

int main()
{
int ret;

ret = membarrier(MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_RSEQ, 0, 0);
if (ret) {
perror("membarrier");
return -1;
}
do_membarrier_hammer();
return 0;
}

Signed-off-by: Mathieu Desnoyers <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Andrew Hunter <[email protected]>
Cc: Maged Michael <[email protected]>
Cc: [email protected]
Cc: Avi Kivity <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Peter Oskolkov <[email protected]>
---
include/linux/smp.h | 40 +++++++++++++++++
kernel/smp.c | 106 +++++++++++++++++++++++++++++++++++++++-----
2 files changed, 134 insertions(+), 12 deletions(-)

diff --git a/include/linux/smp.h b/include/linux/smp.h
index e87520dc2959..cfc7df87fceb 100644
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -160,6 +160,16 @@ void smp_call_function_many(const struct cpumask *mask,
int smp_call_function_any(const struct cpumask *mask,
smp_call_func_t func, void *info, int wait);

+int smp_call_function_single_serialize(int cpuid, smp_call_func_t func,
+ void *info, int wait);
+
+void smp_call_function_many_serialize(const struct cpumask *mask,
+ smp_call_func_t func,
+ void *info, bool wait);
+
+void on_each_cpu_cond_mask_serialize(smp_cond_func_t cond_func, smp_call_func_t func,
+ void *info, bool wait, const struct cpumask *mask);
+
void kick_all_cpus_sync(void);
void wake_up_all_idle_cpus(void);

@@ -215,6 +225,28 @@ smp_call_function_any(const struct cpumask *mask, smp_call_func_t func,
return smp_call_function_single(0, func, info, wait);
}

+static inline
+int smp_call_function_single_serialize(int cpuid, smp_call_func_t func,
+ void *info, int wait)
+{
+ return smp_call_function_single(cpuid, func, info, wait);
+}
+
+static inline
+void smp_call_function_many_serialize(const struct cpumask *mask,
+ smp_call_func_t func,
+ void *info, bool wait)
+{
+ return smp_call_function_many(mask, func, info, wait);
+}
+
+static inline
+void on_each_cpu_cond_mask_serialize(smp_cond_func_t cond_func, smp_call_func_t func,
+ void *info, bool wait, const struct cpumask *mask)
+{
+ return on_each_cpu_cond_mask(cond_func, func, info, wait, mask);
+}
+
static inline void kick_all_cpus_sync(void) { }
static inline void wake_up_all_idle_cpus(void) { }

@@ -288,6 +320,14 @@ void smp_setup_processor_id(void);
int smp_call_on_cpu(unsigned int cpu, int (*func)(void *), void *par,
bool phys);

+static inline
+void on_each_cpu_mask_serialize(const struct cpumask *mask,
+ smp_call_func_t func,
+ void *info, bool wait)
+{
+ on_each_cpu_cond_mask_serialize(NULL, func, info, wait, mask);
+}
+
/* SMP core functions */
int smpcfd_prepare_cpu(unsigned int cpu);
int smpcfd_dead_cpu(unsigned int cpu);
diff --git a/kernel/smp.c b/kernel/smp.c
index f085ebcdf9e7..4c58617a2c48 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -48,6 +48,10 @@ static DEFINE_PER_CPU_SHARED_ALIGNED(struct llist_head, call_single_queue);

static DEFINE_PER_CPU(atomic_t, trigger_backtrace) = ATOMIC_INIT(1);

+static DEFINE_PER_CPU(struct mutex, serialize_ipi_lock);
+
+static DEFINE_MUTEX(serialize_ipi_broadcast_lock);
+
static void __flush_smp_call_function_queue(bool warn_cpu_offline);

int smpcfd_prepare_cpu(unsigned int cpu)
@@ -102,9 +106,10 @@ void __init call_function_init(void)
{
int i;

- for_each_possible_cpu(i)
+ for_each_possible_cpu(i) {
init_llist_head(&per_cpu(call_single_queue, i));
-
+ mutex_init(&per_cpu(serialize_ipi_lock, i));
+ }
smpcfd_prepare_cpu(smp_processor_id());
}

@@ -590,16 +595,9 @@ void flush_smp_call_function_queue(void)
local_irq_restore(flags);
}

-/*
- * smp_call_function_single - Run a function on a specific CPU
- * @func: The function to run. This must be fast and non-blocking.
- * @info: An arbitrary pointer to pass to the function.
- * @wait: If true, wait until function has completed on other CPUs.
- *
- * Returns 0 on success, else a negative status code.
- */
-int smp_call_function_single(int cpu, smp_call_func_t func, void *info,
- int wait)
+static
+int _smp_call_function_single(int cpu, smp_call_func_t func, void *info,
+ int wait, bool serialize)
{
call_single_data_t *csd;
call_single_data_t csd_stack = {
@@ -608,6 +606,9 @@ int smp_call_function_single(int cpu, smp_call_func_t func, void *info,
int this_cpu;
int err;

+ if (serialize)
+ mutex_lock(&per_cpu(serialize_ipi_lock, cpu));
+
/*
* prevent preemption and reschedule on another processor,
* as well as CPU removal
@@ -651,10 +652,38 @@ int smp_call_function_single(int cpu, smp_call_func_t func, void *info,

put_cpu();

+ if (serialize)
+ mutex_unlock(&per_cpu(serialize_ipi_lock, cpu));
+
return err;
}
+
+/*
+ * smp_call_function_single - Run a function on a specific CPU
+ * @func: The function to run. This must be fast and non-blocking.
+ * @info: An arbitrary pointer to pass to the function.
+ * @wait: If true, wait until function has completed on other CPUs.
+ *
+ * Returns 0 on success, else a negative status code.
+ */
+int smp_call_function_single(int cpu, smp_call_func_t func, void *info,
+ int wait)
+{
+ return _smp_call_function_single(cpu, func, info, wait, false);
+}
EXPORT_SYMBOL(smp_call_function_single);

+/*
+ * Run a function on a specific CPU. Serialize the IPI to that CPU with
+ * a mutex.
+ */
+int smp_call_function_single_serialize(int cpu, smp_call_func_t func, void *info,
+ int wait)
+{
+ return _smp_call_function_single(cpu, func, info, wait, true);
+}
+EXPORT_SYMBOL(smp_call_function_single_serialize);
+
/**
* smp_call_function_single_async() - Run an asynchronous function on a
* specific CPU.
@@ -880,6 +909,30 @@ void smp_call_function_many(const struct cpumask *mask,
}
EXPORT_SYMBOL(smp_call_function_many);

+/* Call with preemption *enabled*. */
+void smp_call_function_many_serialize(const struct cpumask *mask,
+ smp_call_func_t func, void *info, bool wait)
+{
+ unsigned int weight = cpumask_weight(mask);
+ int ret;
+
+ if (!weight)
+ return;
+
+ if (weight == 1) {
+ ret = smp_call_function_single_serialize(cpumask_first(mask), func, info, wait);
+ WARN_ON_ONCE(ret);
+ return;
+ }
+
+ mutex_lock(&serialize_ipi_broadcast_lock);
+ preempt_disable();
+ smp_call_function_many_cond(mask, func, info, wait * SCF_WAIT, NULL);
+ preempt_enable();
+ mutex_unlock(&serialize_ipi_broadcast_lock);
+}
+EXPORT_SYMBOL(smp_call_function_many_serialize);
+
/**
* smp_call_function(): Run a function on all other CPUs.
* @func: The function to run. This must be fast and non-blocking.
@@ -1025,6 +1078,35 @@ void on_each_cpu_cond_mask(smp_cond_func_t cond_func, smp_call_func_t func,
}
EXPORT_SYMBOL(on_each_cpu_cond_mask);

+/* Call with preemption enabled. */
+void on_each_cpu_cond_mask_serialize(smp_cond_func_t cond_func, smp_call_func_t func,
+ void *info, bool wait, const struct cpumask *mask)
+{
+ int cpu, ret;
+ unsigned int scf_flags = SCF_RUN_LOCAL, weight = cpumask_weight(mask);
+
+ if (!weight)
+ return;
+
+ if (weight == 1) {
+ cpu = cpumask_first(mask);
+ if (cond_func && !cond_func(cpu, info))
+ return;
+ ret = smp_call_function_single_serialize(cpu, func, info, wait);
+ WARN_ON_ONCE(ret);
+ }
+
+ if (wait)
+ scf_flags |= SCF_WAIT;
+
+ mutex_lock(&serialize_ipi_broadcast_lock);
+ preempt_disable();
+ smp_call_function_many_cond(mask, func, info, scf_flags, cond_func);
+ preempt_enable();
+ mutex_unlock(&serialize_ipi_broadcast_lock);
+}
+EXPORT_SYMBOL(on_each_cpu_cond_mask_serialize);
+
static void do_nothing(void *unused)
{
}
--
2.39.2


2024-03-13 21:20:06

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC PATCH 1/2] smp: Implement serialized smp_call_function APIs

On Wed, 13 Mar 2024 at 13:56, Mathieu Desnoyers
<[email protected]> wrote:
>
> Introduce serialized smp_call_function APIs to limit the number of
> concurrent smp_call_function IPIs which can be sent to a given CPU to a
> maximum of two: one broadcast and one specifically targeting the CPU.

So honestly, with only one user, I think the serialization code
should be solidly in that one user, not in kernel/smp.c.

Also, this kind of extra complexity does require numbers to argue for it.

Linus

2024-03-13 21:58:57

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] Introduce serialized smp_call_function APIs

[resend in plain text for the list]

On Wed, 2024-03-13 at 16:56 -0400, Mathieu Desnoyers wrote:
> commit 944d5fe50f3f ("sched/membarrier: reduce the ability to hammer
> on sys_membarrier")
> introduces a mutex over all membarrier operations to reduce its
> ability
> to slow down the rest of the system.
>
> This RFC series has two objectives:
>
> 1) Move this mutex to the smp_call_function APIs so other system
> calls
>    using smp_call_function IPIs are limited in the same way,
>
> 2) Restore scalability of MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ with
>    MEMBARRIER_CMD_FLAG_CPU, which targets specific CPUs with IPIs.
>    This may or may not be useful, and I would welcome benchmarks from
>    users of this feature to figure out if this is worth it.
>

I see this doesn't restore scaling of MEMBARRIER_CMD_PRIVATE_EXPEDITED,
which I use (and wasn't aware was broken).

I don't have comments on the patches, but do have ideas on how to work
around the problem in Seastar. So this was a useful heads-up for me.

2024-03-13 22:06:28

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] Introduce serialized smp_call_function APIs

On 2024-03-13 17:14, Avi Kivity wrote:
> On Wed, 2024-03-13 at 16:56 -0400, Mathieu Desnoyers wrote:
>> commit 944d5fe50f3f ("sched/membarrier: reduce the ability to hammer
>> on sys_membarrier")
>> introduces a mutex over all membarrier operations to reduce its ability
>> to slow down the rest of the system.
>>
>> This RFC series has two objectives:
>>
>> 1) Move this mutex to the smp_call_function APIs so other system calls
>>    using smp_call_function IPIs are limited in the same way,
>>
>> 2) Restore scalability of MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ with
>>    MEMBARRIER_CMD_FLAG_CPU, which targets specific CPUs with IPIs.
>>    This may or may not be useful, and I would welcome benchmarks from
>>    users of this feature to figure out if this is worth it.
>>
>> This series applies on top of v6.8.
>>
>
>
> I see this doesn't restore scaling of MEMBARRIER_CMD_PRIVATE_EXPEDITED,
> which I use (and wasn't aware was broken).

It's mainly a mitigation for IPI Storming: CVE-2024-26602 disclosed
as part of [1].

>
> I don't have comments on the patches, but do have ideas on how to work
> around the problem in Seastar. So this was a useful heads-up for me.

Note that if you don't use membarrier private expedited too heavily,
you should not notice any difference. But nevertheless I would be
interested to hear about any regression on performance of real
workloads resulting from commit 944d5fe50f3f.

Thanks,

Mathieu

[1] https://www.vusec.net/projects/ghostrace/



--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


2024-03-13 22:32:44

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] Introduce serialized smp_call_function APIs

On Wed, 2024-03-13 at 18:06 -0400, Mathieu Desnoyers wrote:
> On 2024-03-13 17:14, Avi Kivity wrote:
> > On Wed, 2024-03-13 at 16:56 -0400, Mathieu Desnoyers wrote:
> > > commit 944d5fe50f3f ("sched/membarrier: reduce the ability to
> > > hammer
> > > on sys_membarrier")
> > > introduces a mutex over all membarrier operations to reduce its
> > > ability
> > > to slow down the rest of the system.
> > >
> > > This RFC series has two objectives:
> > >
> > > 1) Move this mutex to the smp_call_function APIs so other system
> > > calls
> > >    using smp_call_function IPIs are limited in the same way,
> > >
> > > 2) Restore scalability of MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ
> > > with
> > >    MEMBARRIER_CMD_FLAG_CPU, which targets specific CPUs with
> > > IPIs.
> > >    This may or may not be useful, and I would welcome benchmarks
> > > from
> > >    users of this feature to figure out if this is worth it.
> > >
> > > This series applies on top of v6.8.
> > >
> >
> >
> > I see this doesn't restore scaling of
> > MEMBARRIER_CMD_PRIVATE_EXPEDITED,
> > which I use (and wasn't aware was broken).
>
> It's mainly a mitigation for IPI Storming: CVE-2024-26602 disclosed


Very interesting.


> as part of [1].
>
> >
> > I don't have comments on the patches, but do have ideas on how to
> > work
> > around the problem in Seastar. So this was a useful heads-up for
> > me.
>
> Note that if you don't use membarrier private expedited too heavily,
> you should not notice any difference. But nevertheless I would be
> interested to hear about any regression on performance of real
> workloads resulting from commit 944d5fe50f3f.
>


In fact I did observe the original text of 944d5fe50f3f ("On some
systems, sys_membarrier can be very expensive, causing overall
slowdowns for everything") to be true [1]. So rather than causing
a regression, this commit made me fix a problem.

The smp_call_function_many_cond() in [1] is very likely due to
sys_membarrier, and it's slow since it's running on a virtual machine
without posted interrupt virtualization. Usually we detect virtual
machines and call membarrier() less frequently, but on that instance
(AWS d3en) the detection failed and triggered that IPI storm.

My fix is to just detect if there's a concurrent membarrier running and
fall back to doing something else, I don't think it's generally
applicable.

[1] https://github.com/scylladb/scylladb/issues/17207



2024-03-13 23:25:21

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH 1/2] smp: Implement serialized smp_call_function APIs

On 2024-03-13 17:19, Linus Torvalds wrote:
> On Wed, 13 Mar 2024 at 13:56, Mathieu Desnoyers
> <[email protected]> wrote:
>>
>> Introduce serialized smp_call_function APIs to limit the number of
>> concurrent smp_call_function IPIs which can be sent to a given CPU to a
>> maximum of two: one broadcast and one specifically targeting the CPU.
>
> So honestly, with only one user, I think the serialization code
> should be solidly in that one user, not in kernel/smp.c.

Good point, unless other users of this show up, I could move it
into membarrier.c. But only if anyone cares enough to come up
with benchmarks justifying the added complexity.

>
> Also, this kind of extra complexity does require numbers to argue for it.

Of course,

Thanks,

Mathieu

>
> Linus

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com