LinuxLists.cc - Buggy IPI and MTRR code on low memory

[permalink] [raw]

Subject: Re: Buggy IPI and MTRR code on low memory

Duh! I forgot to CC Jens.

-- Steve

On Wed, 28 Jan 2009, Steven Rostedt wrote:

>
> While developing the RT git tree I came across this deadlock.
>
> To avoid touching the memory allocator in smp_call_function_many I forced
> the stack use case, the path that would be taken if data fails to
> allocate.
>
> Here's the current code in kernel/smp.c:
>
> void smp_call_function_many(const struct cpumask *mask,
> void (*func)(void *), void *info,
> bool wait)
> {
> struct call_function_data *data;
> [...]
> data = kmalloc(sizeof(*data) + cpumask_size(), GFP_ATOMIC);
> if (unlikely(!data)) {
> /* Slow path. */
> for_each_online_cpu(cpu) {
> if (cpu == smp_processor_id())
> continue;
> if (cpumask_test_cpu(cpu, mask))
> smp_call_function_single(cpu, func, info,
> wait);
> }
> return;
> }
> [...]
>
> int smp_call_function_single(int cpu, void (*func) (void *info), void
> *info,
> int wait)
> {
> struct call_single_data d;
> [...]
> if (!wait) {
> data = kmalloc(sizeof(*data), GFP_ATOMIC);
> if (data)
> data->flags = CSD_FLAG_ALLOC;
> }
> if (!data) {
> data = &d;
> data->flags = CSD_FLAG_WAIT;
> }
>
> Note that if data failed to allocate, we force the wait state.
>
>
> This immediately caused a deadlock with the mtrr code:
>
> arch/x86/kernel/cpu/mtrr/main.c:
>
> static void set_mtrr(unsigned int reg, unsigned long base,
> unsigned long size, mtrr_type type)
> {
> struct set_mtrr_data data;
> [...]
> /* Start the ball rolling on other CPUs */
> if (smp_call_function(ipi_handler, &data, 0) != 0)
> panic("mtrr: timed out waiting for other CPUs\n");
>
> local_irq_save(flags);
>
> while(atomic_read(&data.count))
> cpu_relax();
>
> /* ok, reset count and toggle gate */
> atomic_set(&data.count, num_booting_cpus() - 1);
> smp_wmb();
> atomic_set(&data.gate,1);
>
> [...]
>
> static void ipi_handler(void *info)
> /* [SUMMARY] Synchronisation handler. Executed by "other" CPUs.
> [RETURNS] Nothing.
> */
> {
> #ifdef CONFIG_SMP
> struct set_mtrr_data *data = info;
> unsigned long flags;
>
> local_irq_save(flags);
>
> atomic_dec(&data->count);
> while(!atomic_read(&data->gate))
> cpu_relax();
>
>
> The problem is that if we use the stack, then we must wait for the
> function to finish. But in the mtrr code, the called functions are waiting
> for the caller to do something after the smp_call_function. Thus we
> deadlock! This mtrr code seems to have been there for a while. At least
> longer than the git history.
>
> To get around this, I did the following hack. Now this may be good
> enough to handle the case. I'm posting it for comments.
>
> The patch creates another flag called CSD_FLAG_RELEASE. If we fail
> to alloc the data and the wait bit is not set, we still use the stack
> but we also set this flag instead of the wait flag. The receiving IPI
> will copy the data locally, and if this flag is set, it will clear it. The
> caller, after sending the IPI, will wait on this flag to be cleared.
>
> The difference between this and the wait bit is that the release bit is
> just a way to let the callee tell the caller that it copied the data and
> is continuing. The data can be released with no worries. This prevents the
> deadlock because the caller can continue without waiting for the functions
> to be called.
>
> I tested this patch by forcing the data to be null:
>
> data = NULL; // kmalloc(...);
>
> Also, when forcing data to be NULL on the latest git tree, without
> applying the patch, I hit a deadlock in testing of the NMI watchdog. This
> means there may be other areas in the kernel that think smp_call_function,
> without the wait bit set, expects that function not to ever wait.
>
> Signed-off-by: Steven Rostedt <[email protected]>
>
> diff --git a/kernel/smp.c b/kernel/smp.c
> index 5cfa0e5..e85881b 100644
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -18,6 +18,7 @@ __cacheline_aligned_in_smp DEFINE_SPINLOCK(call_function_lock);
> enum {
> CSD_FLAG_WAIT = 0x01,
> CSD_FLAG_ALLOC = 0x02,
> + CSD_FLAG_RELEASE = 0x04,
> };
>
> struct call_function_data {
> @@ -168,21 +169,32 @@ void generic_smp_call_function_single_interrupt(void)
>
> while (!list_empty(&list)) {
> struct call_single_data *data;
> + struct call_single_data d;
>
> data = list_entry(list.next, struct call_single_data,
> list);
> list_del(&data->list);
>
> + d = *data;
> + /* Make sure the data was read before released */
> + smp_mb();
> + data->flags &= ~CSD_FLAG_RELEASE;
> +
> /*
> * 'data' can be invalid after this call if
> * flags == 0 (when called through
> * generic_exec_single(), so save them away before
> * making the call.
> */
> - data_flags = data->flags;
> + data_flags = d.flags;
> + smp_rmb();
>
> - data->func(data->info);
> + d.func(d.info);
>
> + /*
> + * data still exists if either of these
> + * flags are true. We should not use d.
> + */
> if (data_flags & CSD_FLAG_WAIT) {
> smp_wmb();
> data->flags &= ~CSD_FLAG_WAIT;
> @@ -230,6 +242,20 @@ int smp_call_function_single(int cpu, void (*func) (void *info), void *info,
> data = kmalloc(sizeof(*data), GFP_ATOMIC);
> if (data)
> data->flags = CSD_FLAG_ALLOC;
> + else {
> + /*
> + * There exist callers that call
> + * functions that will wait on the caller
> + * to do something after calling this.
> + * This means we can not wait for the callee
> + * function to finish.
> + * Use the stack data but have the callee
> + * copy it and tell us we can continue
> + * before they call the function.
> + */
> + data = &d;
> + data->flags = CSD_FLAG_RELEASE;
> + }
> }
> if (!data) {
> data = &d;
> @@ -239,6 +265,9 @@ int smp_call_function_single(int cpu, void (*func) (void *info), void *info,
> data->func = func;
> data->info = info;
> generic_exec_single(cpu, data);
> +
> + while (data->flags & CSD_FLAG_RELEASE)
> + cpu_relax();
> } else {
> err = -ENXIO; /* CPU not online */
> }
>

2009-01-28 16:46:52

[permalink] [raw]

Subject: Re: Buggy IPI and MTRR code on low memory

On Wed, 2009-01-28 at 11:38 -0500, Steven Rostedt wrote:

> The problem is that if we use the stack, then we must wait for the
> function to finish. But in the mtrr code, the called functions are waiting
> for the caller to do something after the smp_call_function. Thus we
> deadlock!

You'd have to 'fix' the regular fallback paths to use your scheme as
well.

Below is a fix for the mtrr code to not rely on this.

---
Subject: x86: fix potential deadlock in set_mtrr()
From: Peter Zijlstra <[email protected]>
Date: Wed Jan 28 17:17:32 CET 2009

smp_call_function() can fall-back to waiting on completion in case of
low memory (GFP_ATOMIC). set_mtrr() relies on the async behaviour of !wait.

This would deadlock.

Fix this by providing per-cpu csd's and using __smp_call_function_single().

Signed-off-by: Peter Zijlstra <[email protected]>
---
arch/x86/kernel/cpu/mtrr/main.c | 34 ++++++++++++++++++++++++++++------
1 file changed, 28 insertions(+), 6 deletions(-)

Index: linux-2.6/arch/x86/kernel/cpu/mtrr/main.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/mtrr/main.c
+++ linux-2.6/arch/x86/kernel/cpu/mtrr/main.c
@@ -38,6 +38,7 @@
#include <linux/cpu.h>
#include <linux/mutex.h>
#include <linux/sort.h>
+#include <linux/smp.h>

#include <asm/e820.h>
#include <asm/mtrr.h>
@@ -130,12 +131,12 @@ struct set_mtrr_data {
mtrr_type smp_type;
};

+#ifdef CONFIG_SMP
static void ipi_handler(void *info)
/* [SUMMARY] Synchronisation handler. Executed by "other" CPUs.
[RETURNS] Nothing.
*/
{
-#ifdef CONFIG_SMP
struct set_mtrr_data *data = info;
unsigned long flags;

@@ -158,9 +159,31 @@ static void ipi_handler(void *info)

atomic_dec(&data->count);
local_irq_restore(flags);
-#endif
}

+DEFINE_PER_CPU(struct call_single_data, mtrr_csd);
+
+static void set_mtrr_smp(struct set_mtrr_data *data)
+{
+ int cpu;
+
+ for_each_online_cpu(cpu) {
+ struct call_single_data *csd = &per_cpu(mtrr_csd, cpu);
+
+ if (cpu == smp_processor_id())
+ continue;
+
+ csd->func = ipi_handler;
+ csd->info = data;
+ __smp_call_function_single(cpu, csd);
+ }
+}
+#else
+static void set_mtrr_smp(struct set_mtrr_data *data)
+{
+}
+#endif
+
static inline int types_compatible(mtrr_type type1, mtrr_type type2) {
return type1 == MTRR_TYPE_UNCACHABLE ||
type2 == MTRR_TYPE_UNCACHABLE ||
@@ -222,12 +245,11 @@ static void set_mtrr(unsigned int reg, u
smp_wmb();
atomic_set(&data.gate,0);

- /* Start the ball rolling on other CPUs */
- if (smp_call_function(ipi_handler, &data, 0) != 0)
- panic("mtrr: timed out waiting for other CPUs\n");
-
local_irq_save(flags);

+ /* Start the ball rolling on other CPUs */
+ set_mtrr_smp(&data);
+
while(atomic_read(&data.count))
cpu_relax();

2009-01-28 16:56:21

[permalink] [raw]

Subject: Re: Buggy IPI and MTRR code on low memory

On Wed, 28 Jan 2009, Peter Zijlstra wrote:

> On Wed, 2009-01-28 at 11:38 -0500, Steven Rostedt wrote:
>
> > The problem is that if we use the stack, then we must wait for the
> > function to finish. But in the mtrr code, the called functions are waiting
> > for the caller to do something after the smp_call_function. Thus we
> > deadlock!
>
> You'd have to 'fix' the regular fallback paths to use your scheme as
> well.

Not sure what you mean by that. I booted just fine with the patch, and
forcing data to NULL. Although, I'm not saying that these should not be
fixed.

-- Steve

2009-01-28 17:01:28

[permalink] [raw]

Subject: Re: Buggy IPI and MTRR code on low memory

On Wed, 2009-01-28 at 11:56 -0500, Steven Rostedt wrote:
> On Wed, 28 Jan 2009, Peter Zijlstra wrote:
>
> > On Wed, 2009-01-28 at 11:38 -0500, Steven Rostedt wrote:
> >
> > > The problem is that if we use the stack, then we must wait for the
> > > function to finish. But in the mtrr code, the called functions are waiting
> > > for the caller to do something after the smp_call_function. Thus we
> > > deadlock!
> >
> > You'd have to 'fix' the regular fallback paths to use your scheme as
> > well.
>
> Not sure what you mean by that. I booted just fine with the patch, and
> forcing data to NULL. Although, I'm not saying that these should not be
> fixed.

Yeah, I was being confused by:

> @@ -230,6 +242,20 @@ int smp_call_function_single(int cpu, void (*func) (void *info), void *info,
> data = kmalloc(sizeof(*data), GFP_ATOMIC);
> if (data)
> data->flags = CSD_FLAG_ALLOC;
> + else {
> + /*
> + * There exist callers that call
> + * functions that will wait on the caller
> + * to do something after calling this.
> + * This means we can not wait for the callee
> + * function to finish.
> + * Use the stack data but have the callee
> + * copy it and tell us we can continue
> + * before they call the function.
> + */
> + data = &d;
> + data->flags = CSD_FLAG_RELEASE;
> + }
> }
> if (!data) {
> data = &d;

that second !data test in there, thinking there was a fallback path
still relying on CSD_FLAG_WAIT.

It appears all is well indeed.

2009-01-28 17:25:00

[permalink] [raw]

Subject: Re: Buggy IPI and MTRR code on low memory

On Wed, 28 Jan 2009, Peter Zijlstra wrote:
> ---
> Subject: x86: fix potential deadlock in set_mtrr()
> From: Peter Zijlstra <[email protected]>
> Date: Wed Jan 28 17:17:32 CET 2009
>
> smp_call_function() can fall-back to waiting on completion in case of
> low memory (GFP_ATOMIC). set_mtrr() relies on the async behaviour of !wait.
>
> This would deadlock.
>
> Fix this by providing per-cpu csd's and using __smp_call_function_single().

I applied the patch but it still locked up in the testing NMI
watchdog code. Not saying your patch is at fault, because we never made it
to your code. This is probably not an issue, since if we have low memory
at bootup, we have much bigger problems to deal with.

I'll have skip the NMI test and see if it locks up any place else.

-- Steve

2009-01-28 18:21:04

[permalink] [raw]

Subject: Re: Buggy IPI and MTRR code on low memory

On Wed, 2009-01-28 at 12:24 -0500, Steven Rostedt wrote:
> On Wed, 28 Jan 2009, Peter Zijlstra wrote:
> > ---
> > Subject: x86: fix potential deadlock in set_mtrr()
> > From: Peter Zijlstra <[email protected]>
> > Date: Wed Jan 28 17:17:32 CET 2009
> >
> > smp_call_function() can fall-back to waiting on completion in case of
> > low memory (GFP_ATOMIC). set_mtrr() relies on the async behaviour of !wait.
> >
> > This would deadlock.
> >
> > Fix this by providing per-cpu csd's and using __smp_call_function_single().
>
> I applied the patch but it still locked up in the testing NMI
> watchdog code. Not saying your patch is at fault, because we never made it
> to your code. This is probably not an issue, since if we have low memory
> at bootup, we have much bigger problems to deal with.
>
> I'll have skip the NMI test and see if it locks up any place else.

Right, the NMI code does exactly the same, in that case your patch might
well be the best way forward.

2009-01-28 18:22:21

by Arjan van de Ven

[permalink] [raw]

Subject: Re: Buggy IPI and MTRR code on low memory

On Wed, 28 Jan 2009 12:24:48 -0500 (EST)
Steven Rostedt <[email protected]> wrote:

> > ---
> I'll have skip the NMI test and see if it locks up any place else.

if you get an NMI during MTRR changing you are more or less screwed.
Really. (the cpu is uncachable etc)

the MTRR code should disable NMIs as much as possible.

and for sure in -RT, the MTRR change section should not be preemptable
at all.

--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2009-01-28 18:34:25

[permalink] [raw]

Subject: Re: Buggy IPI and MTRR code on low memory

On Wed, 28 Jan 2009, Arjan van de Ven wrote:

> On Wed, 28 Jan 2009 12:24:48 -0500 (EST)
> Steven Rostedt <[email protected]> wrote:
>
> > > ---
> > I'll have skip the NMI test and see if it locks up any place else.
>
> if you get an NMI during MTRR changing you are more or less screwed.
> Really. (the cpu is uncachable etc)

Heh, the NMI is just another place that has the IPI problem. A different
issue to the MTRR. Well, same bug, different place.

>
> the MTRR code should disable NMIs as much as possible.
>
> and for sure in -RT, the MTRR change section should not be preemptable
> at all.

Yeah, that is why I found this. The current code calls kfree and such from
the IPI if it is allocated (not to mention the kmalloc in
smp_call_function. Both of which can sleep in RT. I was forcing the
data=NULL case in RT to keep it from entering the allocation code paths.
But this is where I found the deadlock with MTRR (and NMI).

-- Steve

2009-01-28 18:52:58

[permalink] [raw]

Subject: Re: Buggy IPI and MTRR code on low memory

On Wed, 28 Jan 2009, Peter Zijlstra wrote:

> On Wed, 2009-01-28 at 12:24 -0500, Steven Rostedt wrote:
> > On Wed, 28 Jan 2009, Peter Zijlstra wrote:
> > > ---
> > > Subject: x86: fix potential deadlock in set_mtrr()
> > > From: Peter Zijlstra <[email protected]>
> > > Date: Wed Jan 28 17:17:32 CET 2009
> > >
> > > smp_call_function() can fall-back to waiting on completion in case of
> > > low memory (GFP_ATOMIC). set_mtrr() relies on the async behaviour of !wait.
> > >
> > > This would deadlock.
> > >
> > > Fix this by providing per-cpu csd's and using __smp_call_function_single().
> >
> > I applied the patch but it still locked up in the testing NMI
> > watchdog code. Not saying your patch is at fault, because we never made it
> > to your code. This is probably not an issue, since if we have low memory
> > at bootup, we have much bigger problems to deal with.
> >
> > I'll have skip the NMI test and see if it locks up any place else.
>
> Right, the NMI code does exactly the same, in that case your patch might
> well be the best way forward.

Is there any reason to force the user to have to deal with a wait? We can
add this code to handle the failed alloc case, and I can even make it a
bit better to only do the copy *d = data; when the RELEASE flag is set.

-- Steve

2009-01-28 21:13:23

[permalink] [raw]

Subject: Re: Buggy IPI and MTRR code on low memory

On Wed, 28 Jan 2009 11:38:14 -0500 (EST)
Steven Rostedt <[email protected]> wrote:

>
> While developing the RT git tree I came across this deadlock.
>
> To avoid touching the memory allocator in smp_call_function_many I forced
> the stack use case, the path that would be taken if data fails to
> allocate.
>
> Here's the current code in kernel/smp.c:
>
> void smp_call_function_many(const struct cpumask *mask,
> void (*func)(void *), void *info,
> bool wait)
> {
> struct call_function_data *data;
> [...]
> data = kmalloc(sizeof(*data) + cpumask_size(), GFP_ATOMIC);
> if (unlikely(!data)) {
> /* Slow path. */
> for_each_online_cpu(cpu) {
> if (cpu == smp_processor_id())
> continue;
> if (cpumask_test_cpu(cpu, mask))
> smp_call_function_single(cpu, func, info,
> wait);
> }
> return;
> }
> [...]
>
> int smp_call_function_single(int cpu, void (*func) (void *info), void
> *info,
> int wait)
> {
> struct call_single_data d;
> [...]
> if (!wait) {
> data = kmalloc(sizeof(*data), GFP_ATOMIC);
> if (data)
> data->flags = CSD_FLAG_ALLOC;
> }
> if (!data) {
> data = &d;
> data->flags = CSD_FLAG_WAIT;
> }
>
> Note that if data failed to allocate, we force the wait state.
>
>
> This immediately caused a deadlock with the mtrr code:
>
> arch/x86/kernel/cpu/mtrr/main.c:
>
> static void set_mtrr(unsigned int reg, unsigned long base,
> unsigned long size, mtrr_type type)
> {
> struct set_mtrr_data data;
> [...]
> /* Start the ball rolling on other CPUs */
> if (smp_call_function(ipi_handler, &data, 0) != 0)
> panic("mtrr: timed out waiting for other CPUs\n");
>
> local_irq_save(flags);
>
> while(atomic_read(&data.count))
> cpu_relax();
>
> /* ok, reset count and toggle gate */
> atomic_set(&data.count, num_booting_cpus() - 1);
> smp_wmb();
> atomic_set(&data.gate,1);
>
> [...]
>
> static void ipi_handler(void *info)
> /* [SUMMARY] Synchronisation handler. Executed by "other" CPUs.
> [RETURNS] Nothing.
> */
> {
> #ifdef CONFIG_SMP
> struct set_mtrr_data *data = info;
> unsigned long flags;
>
> local_irq_save(flags);
>
> atomic_dec(&data->count);
> while(!atomic_read(&data->gate))
> cpu_relax();
>
>
> The problem is that if we use the stack, then we must wait for the
> function to finish. But in the mtrr code, the called functions are waiting
> for the caller to do something after the smp_call_function. Thus we
> deadlock! This mtrr code seems to have been there for a while. At least
> longer than the git history.

My initial reaction is that the mtrr code is being stupid, but I guess
that strengthening the smp_call_function() stuff is good, and we _do_
have this "wait=0" contract.

> To get around this, I did the following hack. Now this may be good
> enough to handle the case. I'm posting it for comments.
>
> The patch creates another flag called CSD_FLAG_RELEASE. If we fail
> to alloc the data and the wait bit is not set, we still use the stack
> but we also set this flag instead of the wait flag. The receiving IPI
> will copy the data locally, and if this flag is set, it will clear it. The
> caller, after sending the IPI, will wait on this flag to be cleared.
>
> The difference between this and the wait bit is that the release bit is
> just a way to let the callee tell the caller that it copied the data and
> is continuing. The data can be released with no worries. This prevents the
> deadlock because the caller can continue without waiting for the functions
> to be called.
>
> I tested this patch by forcing the data to be null:
>
> data = NULL; // kmalloc(...);
>
> Also, when forcing data to be NULL on the latest git tree, without
> applying the patch, I hit a deadlock in testing of the NMI watchdog. This
> means there may be other areas in the kernel that think smp_call_function,
> without the wait bit set, expects that function not to ever wait.

Concern 1: do all architectures actually call
generic_smp_call_function_single_interrupt()? I don't think they
_have_ to at present, and if they don't, we now have inconsistent
behaviour between architectures.

Concern 2: not all architectures set CONFIG_USE_GENERIC_SMP_HELPERS=y.
Those which do not set CONFIG_USE_GENERIC_SMP_HELPERS might need to
have similar changes made so that the behaviour remains consistent
across architectures.

Thought: do we need to do the kmalloc at all? Perhaps we can instead
use a statically allocated per-cpu call_single_data local to
kernel/smp.c? It would need a spinlock or something to protect it...

2009-01-28 21:14:50

[permalink] [raw]

Subject: Re: Buggy IPI and MTRR code on low memory

On Wed, 28 Jan 2009 13:12:02 -0800
Andrew Morton <[email protected]> wrote:

> Thought: do we need to do the kmalloc at all? Perhaps we can instead
> use a statically allocated per-cpu call_single_data local to
> kernel/smp.c? It would need a spinlock or something to protect it...

(not a spinlock - get_cpu_var/put_cpu_var will suffice)

2009-01-28 21:23:50

[permalink] [raw]

Subject: Re: Buggy IPI and MTRR code on low memory

On Wed, 28 Jan 2009, Andrew Morton wrote:

> On Wed, 28 Jan 2009 13:12:02 -0800
> Andrew Morton <[email protected]> wrote:
>
> > Thought: do we need to do the kmalloc at all? Perhaps we can instead
> > use a statically allocated per-cpu call_single_data local to
> > kernel/smp.c? It would need a spinlock or something to protect it...
>
> (not a spinlock - get_cpu_var/put_cpu_var will suffice)

Is that enough?

The calling IPIs may process the data after smp_call_function is made.
What happens if two smp_call_functions are executed one after the other?
The second one may corrupt the data if the IPI function has not executed
yet.

We may still need that "RELEASE" flag for that case.

-- Steve

2009-01-28 22:08:36

[permalink] [raw]

Subject: Re: Buggy IPI and MTRR code on low memory

On Wed, 28 Jan 2009 16:23:32 -0500 (EST)
Steven Rostedt <[email protected]> wrote:

>
> On Wed, 28 Jan 2009, Andrew Morton wrote:
>
> > On Wed, 28 Jan 2009 13:12:02 -0800
> > Andrew Morton <[email protected]> wrote:
> >
> > > Thought: do we need to do the kmalloc at all? Perhaps we can instead
> > > use a statically allocated per-cpu call_single_data local to
> > > kernel/smp.c? It would need a spinlock or something to protect it...
> >
> > (not a spinlock - get_cpu_var/put_cpu_var will suffice)
>
> Is that enough?
>
> The calling IPIs may process the data after smp_call_function is made.
> What happens if two smp_call_functions are executed one after the other?
> The second one may corrupt the data if the IPI function has not executed
> yet.
>
> We may still need that "RELEASE" flag for that case.
>

Good point.

Forget I said that - smp_call_function_single() is calling a function
on a *different* CPU, so data which is local to the calling CPU is of
course in the wrong place.

So if we're going to use per-cpu data then we'd need to protect it with
a lock. We could (should?) have a separate lock for each destination
CPU.

We could make smp_call_function_single() block until the IPI handler
has consumed the call_single_data, in which case we might as well put
the call_single_data, onto the caller's stack, as you've done.

Or we could take the per-cpu spinlock in smp_call_function_single(),
and release it in the IPI handler, after the call_single_data has been
consumed, which is a bit more efficient. But I have a suspicion that
this is AB/BA deadlockable.

<tries to think of any scenarios>

<fails>

So we have

smp_call_function_single(int cpu)
{
spin_lock(per_cpu(cpu, locks));
per_cpu(cpu, call_single_data) = <stuff>
send_ipi(cpu);
return;
}

ipi_handler(...)
{
int cpu = smp_processor_id();
call_single_data csd = per_cpu(cpu, call_single_data);

spin_unlock(per_cpu(cpu, locks));
use(csd);
}

does that work?

Dunno if it's any better than what you have now. It does however
remove the unpleasant "try kmalloc and if that failed, try something
else" mess.

2009-01-28 22:48:14

[permalink] [raw]

Subject: Re: Buggy IPI and MTRR code on low memory

On Wed, 28 Jan 2009, Andrew Morton wrote:
>
> So if we're going to use per-cpu data then we'd need to protect it with
> a lock. We could (should?) have a separate lock for each destination
> CPU.
>
> We could make smp_call_function_single() block until the IPI handler
> has consumed the call_single_data, in which case we might as well put
> the call_single_data, onto the caller's stack, as you've done.
>
> Or we could take the per-cpu spinlock in smp_call_function_single(),
> and release it in the IPI handler, after the call_single_data has been
> consumed, which is a bit more efficient. But I have a suspicion that
> this is AB/BA deadlockable.
>
> <tries to think of any scenarios>
>
> <fails>
>
> So we have
>
> smp_call_function_single(int cpu)
> {
> spin_lock(per_cpu(cpu, locks));
> per_cpu(cpu, call_single_data) = <stuff>
> send_ipi(cpu);
> return;
> }
>
> ipi_handler(...)
> {
> int cpu = smp_processor_id();
> call_single_data csd = per_cpu(cpu, call_single_data);
>
> spin_unlock(per_cpu(cpu, locks));
> use(csd);
> }
>
> does that work?

I don't think so. With ticket spinlocks and such, that just looks like it
is destine to crash. Also, spinlocks disable preemption, and we would need
to enable it again otherwise we have a dangling preempt_disable.

>
> Dunno if it's any better than what you have now. It does however
> remove the unpleasant "try kmalloc and if that failed, try something
> else" mess.

We could have a percpu for single data and still use the lock. Keep the
method of the IPI handler signaling to the caller that they copied it to
their stack. Then the caller could release the lock.

We can keep this only for the non wait case. Most users set the wait bit,
so it will only slow down the non waiters.

Of course this will only work with the single cpu function callers. I
think the all cpu function callers still need to alloc.

-- Steve

2009-01-28 23:21:34

[permalink] [raw]

Subject: Re: Buggy IPI and MTRR code on low memory

On Wed, 28 Jan 2009 17:47:55 -0500 (EST)
Steven Rostedt <[email protected]> wrote:

>
> On Wed, 28 Jan 2009, Andrew Morton wrote:
> >
> > So if we're going to use per-cpu data then we'd need to protect it with
> > a lock. We could (should?) have a separate lock for each destination
> > CPU.
> >
> > We could make smp_call_function_single() block until the IPI handler
> > has consumed the call_single_data, in which case we might as well put
> > the call_single_data, onto the caller's stack, as you've done.
> >
> > Or we could take the per-cpu spinlock in smp_call_function_single(),
> > and release it in the IPI handler, after the call_single_data has been
> > consumed, which is a bit more efficient. But I have a suspicion that
> > this is AB/BA deadlockable.
> >
> > <tries to think of any scenarios>
> >
> > <fails>
> >
> > So we have
> >
> > smp_call_function_single(int cpu)
> > {
> > spin_lock(per_cpu(cpu, locks));
> > per_cpu(cpu, call_single_data) = <stuff>
> > send_ipi(cpu);
> > return;
> > }
> >
> > ipi_handler(...)
> > {
> > int cpu = smp_processor_id();
> > call_single_data csd = per_cpu(cpu, call_single_data);
> >
> > spin_unlock(per_cpu(cpu, locks));
> > use(csd);
> > }
> >
> > does that work?
>
> I don't think so. With ticket spinlocks and such, that just looks like it
> is destine to crash. Also, spinlocks disable preemption, and we would need
> to enable it again otherwise we have a dangling preempt_disable.

Well that sucks.

I think the model of taking a lock, passing the data across to another
thread of control and having that thread of control release the lock
once the data has been consumed is a good one. It's faster and in this
case it means that we can now (maybe) perform (wait=0) IPI calls with local
interrupts disabled, which wasn't the case before.

It's a shame that irrelevant-other-stuff prevents us from implementing
such a thing. Of course, we can still do it, via some ad-hoc locking
thing using raw_spin_lock/atomic_t's/bitops/cpu_relax/etc.

> >
> > Dunno if it's any better than what you have now. It does however
> > remove the unpleasant "try kmalloc and if that failed, try something
> > else" mess.
>
> We could have a percpu for single data and still use the lock. Keep the
> method of the IPI handler signaling to the caller that they copied it to
> their stack. Then the caller could release the lock.

Yeah, spose so. That separates the signalling from the locking, rather
than using the lock as the signal. But it's an inferior solution
because it a) makes the caller hang around until the consumer has
executed and it b) requires that callers still have local interrupts
enabled.

Now, maybe we can never implement b) at all - maybe we'll still require
that callers have local interrupts enabled. Because there might be
arch_send_call_function_single_ipi() which has to loop around until it
sees that the target CPU has accepted the interrupt, dunno.

> We can keep this only for the non wait case. Most users set the wait bit,
> so it will only slow down the non waiters.

It would be good to avoid separate sync and async code paths as much
as possible, of course.

> Of course this will only work with the single cpu function callers. I
> think the all cpu function callers still need to alloc.

I haven't really thought about the smp_call_function_many()
common case.

2009-01-28 23:26:16

by Rusty Russell

[permalink] [raw]

Subject: Re: Buggy IPI and MTRR code on low memory

On Thursday 29 January 2009 03:08:14 Steven Rostedt wrote:
>
> While developing the RT git tree I came across this deadlock.
>
> To avoid touching the memory allocator in smp_call_function_many I forced
> the stack use case, the path that would be taken if data fails to
> allocate.
>
> Here's the current code in kernel/smp.c:

Interesting. I simplified smp_call_function_ma{sk,ny}, and introduced this bug (see 54b11e6d57a10aa9d0009efd93873e17bffd5d30).

We used to wait on OOM, yes, but we didn't do them one at a time.

We could restore that quiesce code, or call a function on all online cpus using on-stack data, and have them atomic_dec a counter when they're done (I'm not sure why we didn't do this in the first place: Nick?)

> The problem is that if we use the stack, then we must wait for the
> function to finish. But in the mtrr code, the called functions are waiting
> for the caller to do something after the smp_call_function. Thus we
> deadlock! This mtrr code seems to have been there for a while. At least
> longer than the git history.

I don't see how the *ever* worked then, even with the quiesce stuff.

> The patch creates another flag called CSD_FLAG_RELEASE. If we fail
> to alloc the data and the wait bit is not set, we still use the stack
> but we also set this flag instead of the wait flag. The receiving IPI
> will copy the data locally, and if this flag is set, it will clear it. The
> caller, after sending the IPI, will wait on this flag to be cleared.

Doesn't this break with more than one cpus? I think a refcnt is needed for the general case...

Rusty.

2009-01-28 23:41:19

[permalink] [raw]

Subject: Re: Buggy IPI and MTRR code on low memory

On Thu, 29 Jan 2009, Rusty Russell wrote:

> On Thursday 29 January 2009 03:08:14 Steven Rostedt wrote:
> >
> > While developing the RT git tree I came across this deadlock.
> >
> > To avoid touching the memory allocator in smp_call_function_many I forced
> > the stack use case, the path that would be taken if data fails to
> > allocate.
> >
> > Here's the current code in kernel/smp.c:
>
> Interesting. I simplified smp_call_function_ma{sk,ny}, and introduced
> this bug (see 54b11e6d57a10aa9d0009efd93873e17bffd5d30).

No, I think the bug existed before you simplified it. In fact, I'm
currently working on 2.6.28 where I hit the bug. I actually backported
part of that commit to implement this.

>
> We used to wait on OOM, yes, but we didn't do them one at a time.
>
> We could restore that quiesce code, or call a function on all online
> cpus using on-stack data, and have them atomic_dec a counter when
> they're done (I'm not sure why we didn't do this in the first place:
> Nick?)
>
> > The problem is that if we use the stack, then we must wait for the
> > function to finish. But in the mtrr code, the called functions are waiting
> > for the caller to do something after the smp_call_function. Thus we
> > deadlock! This mtrr code seems to have been there for a while. At least
> > longer than the git history.
>
> I don't see how the *ever* worked then, even with the quiesce stuff.

It didn't ;-)

>
> > The patch creates another flag called CSD_FLAG_RELEASE. If we fail
> > to alloc the data and the wait bit is not set, we still use the stack
> > but we also set this flag instead of the wait flag. The receiving IPI
> > will copy the data locally, and if this flag is set, it will clear it. The
> > caller, after sending the IPI, will wait on this flag to be cleared.
>
> Doesn't this break with more than one cpus? I think a refcnt is needed
> for the general case...

I only use it on a single cpu call. This was what I backported from your
patch. I still have the first alloc (ifdef out for RT), but the fall back
goes to single CPU mode, one at a time.

void smp_call_function_many(const struct cpumask *mask,
void (*func)(void *), void *info,
bool wait)
{
struct call_function_data *data;
[...]
data = kmalloc(sizeof(*data) + cpumask_size(), GFP_ATOMIC);
if (unlikely(!data)) {
/* Slow path. */
for_each_online_cpu(cpu) {
if (cpu == smp_processor_id())
continue;
if (cpumask_test_cpu(cpu, mask))
smp_call_function_single(cpu, func, info, wait);
}
return;
}

I only needed to handle the single cpu case. Yes, this is slow because it
would need to wait for each individual CPU before going to the next. But
it at least is a work around.

-- Steve

2009-01-29 00:30:15

[permalink] [raw]

Subject: Re: Buggy IPI and MTRR code on low memory

On Wed, 28 Jan 2009, Andrew Morton wrote:
> >
> > I don't think so. With ticket spinlocks and such, that just looks like it
> > is destine to crash. Also, spinlocks disable preemption, and we would need
> > to enable it again otherwise we have a dangling preempt_disable.
>
> Well that sucks.
>
> I think the model of taking a lock, passing the data across to another
> thread of control and having that thread of control release the lock
> once the data has been consumed is a good one. It's faster and in this
> case it means that we can now (maybe) perform (wait=0) IPI calls with local
> interrupts disabled, which wasn't the case before.

I think that model is nice too. Just not for spinlocks ;-)

>
> It's a shame that irrelevant-other-stuff prevents us from implementing
> such a thing. Of course, we can still do it, via some ad-hoc locking
> thing using raw_spin_lock/atomic_t's/bitops/cpu_relax/etc.

It should not be too hackish. I don't even think we need raw or atomic.

cpu caller:

again:
spin_lock(&data_lock);
if (per_cpu(data, cpu).lock) {
spin_unlock(&data_lock);
cpu_relax();
goto again;
}

per_cpu(data, cpu).lock = 1;
spin_unlock(&data_lock);

call ipi functions

cpu callee:

func();

smp_wmb();
per_cpu(data, cpu).lock = 0;

Here we can only call an ipi on a funcion if the data lock is cleared.
If it is set, we wait for the ipi function to clear it and then we set it.

Thus we synchronize the callers, but the callers do not need to wait on
the callees. And since the callers can only call a ipi on a cpu that is
finished, we do not need to worry about corrupted data.

>
> > Of course this will only work with the single cpu function callers. I
> > think the all cpu function callers still need to alloc.
>
> I haven't really thought about the smp_call_function_many()
> common case.

We can keep the original code that still does the malloc for the many
cpus. But falls back to the single version that does not use malloc if the
malloc fails.

current code for smp_call_function_many():

data = kmalloc(sizeof(*data) + cpumask_size(), GFP_ATOMIC);
if (unlikely(!data)) {
/* Slow path. */
for_each_online_cpu(cpu) {
if (cpu == smp_processor_id())
continue;
if (cpumask_test_cpu(cpu, mask))
smp_call_function_single(cpu, func, info, wait);
}
return;
}

I think this would work.

/me goes and tries it out

-- Steve

2009-01-29 00:52:28

[permalink] [raw]

Subject: [PATCH] use per cpu data for single cpu ipi calls

The smp_call_function can be passed a wait parameter telling it to
wait for all the functions running on other CPUs to complete before
returning, or to return without waiting. Unfortunately, this is
currently just a suggestion and not manditory. That is, the
smp_call_function can decide not to return and wait instead.

The reason for this is because it uses kmalloc to allocate storage
to send to the called CPU and that CPU will free it when it is done.
But if we fail to allocate the storage, the stack is used instead.
This means we must wait for the called CPU to finish before
continuing.

Unfortunatly, some callers do no abide by this hint and act as if
the non-wait option is mandatory. The MTRR code for instance will
deadlock if the smp_call_function is set to wait. This is because
the smp_call_function will wait for the other CPUs to finish their
called functions, but those functions are waiting on the caller to
continue.

This patch changes the generic smp_call_function code to use per cpu
variables instead of allocating for a single CPU call. The
smp_call_function_many will fall back to the smp_call_function_single
if it fails its alloc. The smp_call_function_single is modified
to not force the wait state.

Since we now are using a single data per cpu we must synchronize the
callers to prevent a second caller modifying the data before the
first called IPI functions complete. To do so, I added a flag to
the call_single_data called CSD_FLAG_LOCK. When the single CPU is
called (which can be called when a many call fails an alloc), we
set the LOCK bit on this per cpu data. When the caller finishes
it clears the LOCK bit.

The caller must wait till the LOCK bit is cleared before setting
it. When it is cleared, there is no IPI function using it.
A spinlock is used to synchronize the setting of the bit between
callers. Since only one callee can be called at a time, and it
is the only thing to clear it, the IPI does not need to use
any locking.

Signed-off-by: Steven Rostedt <[email protected]>
---
diff --git a/kernel/smp.c b/kernel/smp.c
index 5cfa0e5..aba3813 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -18,6 +18,7 @@ __cacheline_aligned_in_smp DEFINE_SPINLOCK(call_function_lock);
enum {
CSD_FLAG_WAIT = 0x01,
CSD_FLAG_ALLOC = 0x02,
+ CSD_FLAG_LOCK = 0x04,
};

struct call_function_data {
@@ -186,6 +187,9 @@ void generic_smp_call_function_single_interrupt(void)
if (data_flags & CSD_FLAG_WAIT) {
smp_wmb();
data->flags &= ~CSD_FLAG_WAIT;
+ } else if (data_flags & CSD_FLAG_LOCK) {
+ smp_wmb();
+ data->flags &= ~CSD_FLAG_LOCK;
} else if (data_flags & CSD_FLAG_ALLOC)
kfree(data);
}
@@ -196,6 +200,9 @@ void generic_smp_call_function_single_interrupt(void)
}
}

+static DEFINE_PER_CPU(struct call_single_data, csd_data);
+static DEFINE_SPINLOCK(csd_data_lock);
+
/*
* smp_call_function_single - Run a function on a specific CPU
* @func: The function to run. This must be fast and non-blocking.
@@ -224,14 +231,35 @@ int smp_call_function_single(int cpu, void (*func) (void *info), void *info,
func(info);
local_irq_restore(flags);
} else if ((unsigned)cpu < nr_cpu_ids && cpu_online(cpu)) {
- struct call_single_data *data = NULL;
+ struct call_single_data *data;

if (!wait) {
- data = kmalloc(sizeof(*data), GFP_ATOMIC);
- if (data)
- data->flags = CSD_FLAG_ALLOC;
- }
- if (!data) {
+ data = &per_cpu(csd_data, cpu);
+ /*
+ * We are calling a function on a single CPU
+ * and we are not going to wait for it to finish.
+ * We use a per cpu data to pass the information
+ * to that CPU, but since all callers of this
+ * code will use the same data, we must
+ * synchronize the callers to prevent a new caller
+ * from corrupting the data before the callee
+ * can access it.
+ *
+ * The CSD_FLAG_LOCK is used to let us know when
+ * the IPI handler is done with the data.
+ * The first caller will set it, and the callee
+ * will clear it. The next caller must wait for
+ * it to clear before we set it again. This
+ * will make sure the callee is done with the
+ * data before a new caller will use it.
+ * We use spinlocks to manage the callers.
+ */
+ spin_lock(&csd_data_lock);
+ while (data->flags & CSD_FLAG_LOCK)
+ cpu_relax();
+ data->flags = CSD_FLAG_LOCK;
+ spin_unlock(&csd_data_lock);
+ } else {
data = &d;
data->flags = CSD_FLAG_WAIT;
}

2009-01-29 01:31:53

[permalink] [raw]

Subject: Re: [PATCH] use per cpu data for single cpu ipi calls

On Wed, 28 Jan 2009 19:52:16 -0500 (EST)
Steven Rostedt <[email protected]> wrote:

>
> The smp_call_function can be passed a wait parameter telling it to
> wait for all the functions running on other CPUs to complete before
> returning, or to return without waiting. Unfortunately, this is
> currently just a suggestion and not manditory. That is, the

"mandatory"

> smp_call_function can decide not to return and wait instead.
>
> The reason for this is because it uses kmalloc to allocate storage
> to send to the called CPU and that CPU will free it when it is done.
> But if we fail to allocate the storage, the stack is used instead.
> This means we must wait for the called CPU to finish before
> continuing.
>
> Unfortunatly, some callers do no abide by this hint and act as if

"Unfortunately".

> the non-wait option is mandatory. The MTRR code for instance will
> deadlock if the smp_call_function is set to wait. This is because
> the smp_call_function will wait for the other CPUs to finish their
> called functions, but those functions are waiting on the caller to
> continue.
>
> This patch changes the generic smp_call_function code to use per cpu
> variables instead of allocating for a single CPU call. The
> smp_call_function_many will fall back to the smp_call_function_single
> if it fails its alloc. The smp_call_function_single is modified
> to not force the wait state.
>
> Since we now are using a single data per cpu we must synchronize the
> callers to prevent a second caller modifying the data before the
> first called IPI functions complete. To do so, I added a flag to
> the call_single_data called CSD_FLAG_LOCK. When the single CPU is
> called (which can be called when a many call fails an alloc), we
> set the LOCK bit on this per cpu data. When the caller finishes
> it clears the LOCK bit.
>
> The caller must wait till the LOCK bit is cleared before setting
> it. When it is cleared, there is no IPI function using it.
> A spinlock is used to synchronize the setting of the bit between
> callers. Since only one callee can be called at a time, and it
> is the only thing to clear it, the IPI does not need to use
> any locking.
>
> Signed-off-by: Steven Rostedt <[email protected]>
> ---
> diff --git a/kernel/smp.c b/kernel/smp.c
> index 5cfa0e5..aba3813 100644
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -18,6 +18,7 @@ __cacheline_aligned_in_smp DEFINE_SPINLOCK(call_function_lock);
> enum {
> CSD_FLAG_WAIT = 0x01,
> CSD_FLAG_ALLOC = 0x02,
> + CSD_FLAG_LOCK = 0x04,
> };
>
> struct call_function_data {
> @@ -186,6 +187,9 @@ void generic_smp_call_function_single_interrupt(void)
> if (data_flags & CSD_FLAG_WAIT) {
> smp_wmb();
> data->flags &= ~CSD_FLAG_WAIT;
> + } else if (data_flags & CSD_FLAG_LOCK) {
> + smp_wmb();
> + data->flags &= ~CSD_FLAG_LOCK;
> } else if (data_flags & CSD_FLAG_ALLOC)
> kfree(data);
> }
> @@ -196,6 +200,9 @@ void generic_smp_call_function_single_interrupt(void)
> }
> }
>
> +static DEFINE_PER_CPU(struct call_single_data, csd_data);
> +static DEFINE_SPINLOCK(csd_data_lock);
> +
> /*
> * smp_call_function_single - Run a function on a specific CPU
> * @func: The function to run. This must be fast and non-blocking.
> @@ -224,14 +231,35 @@ int smp_call_function_single(int cpu, void (*func) (void *info), void *info,
> func(info);
> local_irq_restore(flags);
> } else if ((unsigned)cpu < nr_cpu_ids && cpu_online(cpu)) {
> - struct call_single_data *data = NULL;
> + struct call_single_data *data;
>
> if (!wait) {
> - data = kmalloc(sizeof(*data), GFP_ATOMIC);
> - if (data)
> - data->flags = CSD_FLAG_ALLOC;
> - }
> - if (!data) {
> + data = &per_cpu(csd_data, cpu);
> + /*
> + * We are calling a function on a single CPU
> + * and we are not going to wait for it to finish.
> + * We use a per cpu data to pass the information
> + * to that CPU, but since all callers of this
> + * code will use the same data, we must
> + * synchronize the callers to prevent a new caller
> + * from corrupting the data before the callee
> + * can access it.
> + *
> + * The CSD_FLAG_LOCK is used to let us know when
> + * the IPI handler is done with the data.
> + * The first caller will set it, and the callee
> + * will clear it. The next caller must wait for
> + * it to clear before we set it again. This
> + * will make sure the callee is done with the
> + * data before a new caller will use it.
> + * We use spinlocks to manage the callers.
> + */
> + spin_lock(&csd_data_lock);
> + while (data->flags & CSD_FLAG_LOCK)
> + cpu_relax();
> + data->flags = CSD_FLAG_LOCK;
> + spin_unlock(&csd_data_lock);
> + } else {
> data = &d;
> data->flags = CSD_FLAG_WAIT;
> }

Well that looks nice.

Can we make the spinlock a per-cpu thing as well? Or is that
over-optimising? We'd need to initialise all those spinlocks at
runtime.

In generic_smp_call_function_single_interrupt(), did you consider
releasing the "lock" _before_ calling the callback function? That
would reduces latencies a bit, allow more concurrency. Maybe that's
over-optimising too.

Can generic_smp_call_function_single_interrupt() ever see
CSD_FLAG_ALLOC set now? If not, that kfree can go away.

And where do we now stand with the architectures which _don't_ use the
kernel/smp.c code? If someone writes code in generic kernel which
relies upon the new capabilities, it will go bad on those
architectures. Makes davem sad.

2009-01-29 01:56:18

[permalink] [raw]

Subject: Re: [PATCH] use per cpu data for single cpu ipi calls

On Wed, 28 Jan 2009, Andrew Morton wrote:

> On Wed, 28 Jan 2009 19:52:16 -0500 (EST)
> Steven Rostedt <[email protected]> wrote:
>
> >
> > The smp_call_function can be passed a wait parameter telling it to
> > wait for all the functions running on other CPUs to complete before
> > returning, or to return without waiting. Unfortunately, this is
> > currently just a suggestion and not manditory. That is, the
>
> "mandatory"

Some day I'll learn to use spell check.

> >
> > Unfortunatly, some callers do no abide by this hint and act as if
>
> "Unfortunately".

Unfortunately, not today.

>
> Well that looks nice.

Thank you.

>
> Can we make the spinlock a per-cpu thing as well? Or is that
> over-optimising? We'd need to initialise all those spinlocks at
> runtime.

I thought about it and thought it was over optimizing (US spelling).
But I could be wrong.

>
> In generic_smp_call_function_single_interrupt(), did you consider
> releasing the "lock" _before_ calling the callback function? That
> would reduces latencies a bit, allow more concurrency. Maybe that's
> over-optimising too.

I thought about it too, but I wanted to avoid the coping of the data
fields. Currently the callee does:

data = list_entry(list.next, struct call_single_data,
list);
list_del(&data->list);

/*
* 'data' can be invalid after this call if
* flags == 0 (when called through
* generic_exec_single(), so save them away before
* making the call.
*/
data_flags = data->flags;

data->func(data->info);

I would need to have a stack item like I did in my first release, and copy
it.

d = *data;
smp_wmb();
data->flags &= ~CDS_FLAG_LOCK;

d->func(d->info).

I figured that the contention would only happen with a second caller. The
first caller does not have to wait. And I doubt that there would be many
instances of two callers contending. Matters how often smp_call_function
is called.

Probably not enough contention to worry about.

>
> Can generic_smp_call_function_single_interrupt() ever see
> CSD_FLAG_ALLOC set now? If not, that kfree can go away.

Probably not, but I kept it just in case. I see there's a way a caller
can pass in their own data through

generic_exec_single()

A user could potentially set the ALLOC flag. But since the enum is defined
in this file, I find that highly unlikely. But looking at some of the code
in the kernel, I would not be surprised if it is.

>
> And where do we now stand with the architectures which _don't_ use the
> kernel/smp.c code? If someone writes code in generic kernel which
> relies upon the new capabilities, it will go bad on those
> architectures. Makes davem sad.

I guess someone should flag the arch list.

-- Steve

2009-01-29 08:49:36

[permalink] [raw]

Subject: Re: [PATCH] use per cpu data for single cpu ipi calls

On Wed, 2009-01-28 at 17:30 -0800, Andrew Morton wrote:
> On Wed, 28 Jan 2009 19:52:16 -0500 (EST)
> Steven Rostedt <[email protected]> wrote:
>
> >
> > The smp_call_function can be passed a wait parameter telling it to
> > wait for all the functions running on other CPUs to complete before
> > returning, or to return without waiting. Unfortunately, this is
> > currently just a suggestion and not manditory. That is, the
>
> "mandatory"
>
> > smp_call_function can decide not to return and wait instead.
> >
> > The reason for this is because it uses kmalloc to allocate storage
> > to send to the called CPU and that CPU will free it when it is done.
> > But if we fail to allocate the storage, the stack is used instead.
> > This means we must wait for the called CPU to finish before
> > continuing.
> >
> > Unfortunatly, some callers do no abide by this hint and act as if
>
> "Unfortunately".
>
> > the non-wait option is mandatory. The MTRR code for instance will
> > deadlock if the smp_call_function is set to wait. This is because
> > the smp_call_function will wait for the other CPUs to finish their
> > called functions, but those functions are waiting on the caller to
> > continue.
> >
> > This patch changes the generic smp_call_function code to use per cpu
> > variables instead of allocating for a single CPU call. The
> > smp_call_function_many will fall back to the smp_call_function_single
> > if it fails its alloc. The smp_call_function_single is modified
> > to not force the wait state.
> >
> > Since we now are using a single data per cpu we must synchronize the
> > callers to prevent a second caller modifying the data before the
> > first called IPI functions complete. To do so, I added a flag to
> > the call_single_data called CSD_FLAG_LOCK. When the single CPU is
> > called (which can be called when a many call fails an alloc), we
> > set the LOCK bit on this per cpu data. When the caller finishes
> > it clears the LOCK bit.
> >
> > The caller must wait till the LOCK bit is cleared before setting
> > it. When it is cleared, there is no IPI function using it.
> > A spinlock is used to synchronize the setting of the bit between
> > callers. Since only one callee can be called at a time, and it
> > is the only thing to clear it, the IPI does not need to use
> > any locking.
> >
> > Signed-off-by: Steven Rostedt <[email protected]>
> > ---
> > diff --git a/kernel/smp.c b/kernel/smp.c
> > index 5cfa0e5..aba3813 100644
> > --- a/kernel/smp.c
> > +++ b/kernel/smp.c
> > @@ -18,6 +18,7 @@ __cacheline_aligned_in_smp DEFINE_SPINLOCK(call_function_lock);
> > enum {
> > CSD_FLAG_WAIT = 0x01,
> > CSD_FLAG_ALLOC = 0x02,
> > + CSD_FLAG_LOCK = 0x04,
> > };
> >
> > struct call_function_data {
> > @@ -186,6 +187,9 @@ void generic_smp_call_function_single_interrupt(void)
> > if (data_flags & CSD_FLAG_WAIT) {
> > smp_wmb();
> > data->flags &= ~CSD_FLAG_WAIT;
> > + } else if (data_flags & CSD_FLAG_LOCK) {
> > + smp_wmb();
> > + data->flags &= ~CSD_FLAG_LOCK;
> > } else if (data_flags & CSD_FLAG_ALLOC)
> > kfree(data);
> > }
> > @@ -196,6 +200,9 @@ void generic_smp_call_function_single_interrupt(void)
> > }
> > }
> >
> > +static DEFINE_PER_CPU(struct call_single_data, csd_data);
> > +static DEFINE_SPINLOCK(csd_data_lock);
> > +
> > /*
> > * smp_call_function_single - Run a function on a specific CPU
> > * @func: The function to run. This must be fast and non-blocking.
> > @@ -224,14 +231,35 @@ int smp_call_function_single(int cpu, void (*func) (void *info), void *info,
> > func(info);
> > local_irq_restore(flags);
> > } else if ((unsigned)cpu < nr_cpu_ids && cpu_online(cpu)) {
> > - struct call_single_data *data = NULL;
> > + struct call_single_data *data;
> >
> > if (!wait) {
> > - data = kmalloc(sizeof(*data), GFP_ATOMIC);
> > - if (data)
> > - data->flags = CSD_FLAG_ALLOC;
> > - }

I would advise against removing this, it would destroy a lot of the
properties of smp_call_function_single() that it was designed to have.

That kmalloc allows you to send multiple requests and batch them.

> > - if (!data) {
> > + data = &per_cpu(csd_data, cpu);
> > + /*
> > + * We are calling a function on a single CPU
> > + * and we are not going to wait for it to finish.
> > + * We use a per cpu data to pass the information
> > + * to that CPU, but since all callers of this
> > + * code will use the same data, we must
> > + * synchronize the callers to prevent a new caller
> > + * from corrupting the data before the callee
> > + * can access it.
> > + *
> > + * The CSD_FLAG_LOCK is used to let us know when
> > + * the IPI handler is done with the data.
> > + * The first caller will set it, and the callee
> > + * will clear it. The next caller must wait for
> > + * it to clear before we set it again. This
> > + * will make sure the callee is done with the
> > + * data before a new caller will use it.
> > + * We use spinlocks to manage the callers.
> > + */
> > + spin_lock(&csd_data_lock);
> > + while (data->flags & CSD_FLAG_LOCK)
> > + cpu_relax();
> > + data->flags = CSD_FLAG_LOCK;
> > + spin_unlock(&csd_data_lock);
> > + } else {
> > data = &d;
> > data->flags = CSD_FLAG_WAIT;
> > }
>
> Well that looks nice.
>
> Can we make the spinlock a per-cpu thing as well? Or is that
> over-optimising? We'd need to initialise all those spinlocks at
> runtime.

I think we should, its easy enough, and

static DEFINE_PER_CPU(spinlock_t, csd_lock) =
__SPIN_LOCK_UNLOCKED(csd_lock);

might just work.

> In generic_smp_call_function_single_interrupt(), did you consider
> releasing the "lock" _before_ calling the callback function? That
> would reduces latencies a bit, allow more concurrency. Maybe that's
> over-optimising too.

You'd have to copy the func and info fields to do that, but yeah, that'd
work.

> Can generic_smp_call_function_single_interrupt() ever see
> CSD_FLAG_ALLOC set now? If not, that kfree can go away.

Like said above, removing that kmalloc will hurt people.

2009-01-29 11:14:20

[permalink] [raw]

Subject: Re: [PATCH] use per cpu data for single cpu ipi calls

* Peter Zijlstra <[email protected]> wrote:

> > Can generic_smp_call_function_single_interrupt() ever see
> > CSD_FLAG_ALLOC set now? If not, that kfree can go away.
>
> Like said above, removing that kmalloc will hurt people.

the whole promise of generic-IPI seems to dwindle and we'll get back to
roughly where we started out from.

Ingo

2009-01-29 11:42:26

[permalink] [raw]

Subject: Re: [PATCH] use per cpu data for single cpu ipi calls

On Thu, 2009-01-29 at 12:13 +0100, Ingo Molnar wrote:
> * Peter Zijlstra <[email protected]> wrote:
>
> > > Can generic_smp_call_function_single_interrupt() ever see
> > > CSD_FLAG_ALLOC set now? If not, that kfree can go away.
> >
> > Like said above, removing that kmalloc will hurt people.
>
> the whole promise of generic-IPI seems to dwindle and we'll get back to
> roughly where we started out from.

I'd not go that far, the per-cpu csd is a nice fallback, but removing
that kmalloc is just silly.

And I see no reason why arch specific could do any better (aside from
sparc which has fancy ipis).

2009-01-29 13:43:09

[permalink] [raw]

Subject: Re: [PATCH] use per cpu data for single cpu ipi calls

* Peter Zijlstra <[email protected]> wrote:

> On Thu, 2009-01-29 at 12:13 +0100, Ingo Molnar wrote:
> > * Peter Zijlstra <[email protected]> wrote:
> >
> > > > Can generic_smp_call_function_single_interrupt() ever see
> > > > CSD_FLAG_ALLOC set now? If not, that kfree can go away.
> > >
> > > Like said above, removing that kmalloc will hurt people.
> >
> > the whole promise of generic-IPI seems to dwindle and we'll get back to
> > roughly where we started out from.
>
> I'd not go that far, the per-cpu csd is a nice fallback, but removing
> that kmalloc is just silly.
>
> And I see no reason why arch specific could do any better (aside from
> sparc which has fancy ipis).

well if the hw can embedd at least 64 bit of dataload in the IPI itself,
we could get rid of a lot of complications. But since we cannot (on x86
and most other architectures), we've got to demultiplex from generic data
which brings up serialization issues and adds overhead .

Ingo

2009-01-29 14:07:28

[permalink] [raw]

Subject: Re: [PATCH] use per cpu data for single cpu ipi calls

On Thu, 29 Jan 2009, Peter Zijlstra wrote:
> On Thu, 2009-01-29 at 12:13 +0100, Ingo Molnar wrote:
> > * Peter Zijlstra <[email protected]> wrote:
> >
> > > > Can generic_smp_call_function_single_interrupt() ever see
> > > > CSD_FLAG_ALLOC set now? If not, that kfree can go away.
> > >
> > > Like said above, removing that kmalloc will hurt people.
> >
> > the whole promise of generic-IPI seems to dwindle and we'll get back to
> > roughly where we started out from.
>
> I'd not go that far, the per-cpu csd is a nice fallback, but removing
> that kmalloc is just silly.

I'll add the kmalloc back. But the rest should be fine.

-- Steve

2009-01-29 15:08:18

[permalink] [raw]

Subject: [PATCH -v2] use per cpu data for single cpu ipi calls

The smp_call_function can be passed a wait parameter telling it to
wait for all the functions running on other CPUs to complete before
returning, or to return without waiting. Unfortunately, this is
currently just a suggestion and not manditory. That is, the
smp_call_function can decide not to return and wait instead.

The reason for this is because it uses kmalloc to allocate storage
to send to the called CPU and that CPU will free it when it is done.
But if we fail to allocate the storage, the stack is used instead.
This means we must wait for the called CPU to finish before
continuing.

Unfortunatly, some callers do no abide by this hint and act as if
the non-wait option is mandatory. The MTRR code for instance will
deadlock if the smp_call_function is set to wait. This is because
the smp_call_function will wait for the other CPUs to finish their
called functions, but those functions are waiting on the caller to
continue.

This patch changes the generic smp_call_function code to use per cpu
variables if the allocation of the data fails for a single CPU call. The
smp_call_function_many will fall back to the smp_call_function_single
if it fails its alloc. The smp_call_function_single is modified
to not force the wait state.

Since we now are using a single data per cpu we must synchronize the
callers to prevent a second caller modifying the data before the
first called IPI functions complete. To do so, I added a flag to
the call_single_data called CSD_FLAG_LOCK. When the single CPU is
called (which can be called when a many call fails an alloc), we
set the LOCK bit on this per cpu data. When the caller finishes
it clears the LOCK bit.

The caller must wait till the LOCK bit is cleared before setting
it. When it is cleared, there is no IPI function using it.
A spinlock is used to synchronize the setting of the bit between
callers. Since only one callee can be called at a time, and it
is the only thing to clear it, the IPI does not need to use
any locking.

[
changes for v2:

-- kept kmalloc and only use per cpu if kmalloc fails.
(Requested by Peter Zijlstra)

-- added per cpu spinlocks
(Requested by Andrew Morton and Peter Zijlstra)
]

Signed-off-by: Steven Rostedt <[email protected]>
---
diff --git a/kernel/smp.c b/kernel/smp.c
index 5cfa0e5..9bce851 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -18,6 +18,7 @@ __cacheline_aligned_in_smp DEFINE_SPINLOCK(call_function_lock);
enum {
CSD_FLAG_WAIT = 0x01,
CSD_FLAG_ALLOC = 0x02,
+ CSD_FLAG_LOCK = 0x04,
};

struct call_function_data {
@@ -186,6 +187,9 @@ void generic_smp_call_function_single_interrupt(void)
if (data_flags & CSD_FLAG_WAIT) {
smp_wmb();
data->flags &= ~CSD_FLAG_WAIT;
+ } else if (data_flags & CSD_FLAG_LOCK) {
+ smp_wmb();
+ data->flags &= ~CSD_FLAG_LOCK;
} else if (data_flags & CSD_FLAG_ALLOC)
kfree(data);
}
@@ -196,6 +200,10 @@ void generic_smp_call_function_single_interrupt(void)
}
}

+static DEFINE_PER_CPU(struct call_single_data, csd_data);
+static DEFINE_PER_CPU(spinlock_t, csd_data_lock) =
+ __SPIN_LOCK_UNLOCKED(csd_lock);
+
/*
* smp_call_function_single - Run a function on a specific CPU
* @func: The function to run. This must be fast and non-blocking.
@@ -224,14 +232,41 @@ int smp_call_function_single(int cpu, void (*func) (void *info), void *info,
func(info);
local_irq_restore(flags);
} else if ((unsigned)cpu < nr_cpu_ids && cpu_online(cpu)) {
- struct call_single_data *data = NULL;
+ struct call_single_data *data;

if (!wait) {
+ /*
+ * We are calling a function on a single CPU
+ * and we are not going to wait for it to finish.
+ * We first try to allocate the data, but if we
+ * fail, we fall back to use a per cpu data to pass
+ * the information to that CPU. Since all callers
+ * of this code will use the same data, we must
+ * synchronize the callers to prevent a new caller
+ * from corrupting the data before the callee
+ * can access it.
+ *
+ * The CSD_FLAG_LOCK is used to let us know when
+ * the IPI handler is done with the data.
+ * The first caller will set it, and the callee
+ * will clear it. The next caller must wait for
+ * it to clear before we set it again. This
+ * will make sure the callee is done with the
+ * data before a new caller will use it.
+ * We use spinlocks to manage the callers.
+ */
data = kmalloc(sizeof(*data), GFP_ATOMIC);
if (data)
data->flags = CSD_FLAG_ALLOC;
- }
- if (!data) {
+ else {
+ data = &per_cpu(csd_data, cpu);
+ spin_lock(&per_cpu(csd_data_lock, cpu));
+ while (data->flags & CSD_FLAG_LOCK)
+ cpu_relax();
+ data->flags = CSD_FLAG_LOCK;
+ spin_unlock(&per_cpu(csd_data_lock, cpu));
+ }
+ } else {
data = &d;
data->flags = CSD_FLAG_WAIT;
}

2009-01-29 15:34:18

[permalink] [raw]

Subject: Re: [PATCH -v2] use per cpu data for single cpu ipi calls

On Thu, 2009-01-29 at 10:08 -0500, Steven Rostedt wrote:
> The smp_call_function can be passed a wait parameter telling it to
> wait for all the functions running on other CPUs to complete before
> returning, or to return without waiting. Unfortunately, this is
> currently just a suggestion and not manditory. That is, the
> smp_call_function can decide not to return and wait instead.
>
> The reason for this is because it uses kmalloc to allocate storage
> to send to the called CPU and that CPU will free it when it is done.
> But if we fail to allocate the storage, the stack is used instead.
> This means we must wait for the called CPU to finish before
> continuing.
>
> Unfortunatly, some callers do no abide by this hint and act as if
> the non-wait option is mandatory. The MTRR code for instance will
> deadlock if the smp_call_function is set to wait. This is because
> the smp_call_function will wait for the other CPUs to finish their
> called functions, but those functions are waiting on the caller to
> continue.
>
> This patch changes the generic smp_call_function code to use per cpu
> variables if the allocation of the data fails for a single CPU call. The
> smp_call_function_many will fall back to the smp_call_function_single
> if it fails its alloc. The smp_call_function_single is modified
> to not force the wait state.
>
> Since we now are using a single data per cpu we must synchronize the
> callers to prevent a second caller modifying the data before the
> first called IPI functions complete. To do so, I added a flag to
> the call_single_data called CSD_FLAG_LOCK. When the single CPU is
> called (which can be called when a many call fails an alloc), we
> set the LOCK bit on this per cpu data. When the caller finishes
> it clears the LOCK bit.
>
> The caller must wait till the LOCK bit is cleared before setting
> it. When it is cleared, there is no IPI function using it.
> A spinlock is used to synchronize the setting of the bit between
> callers. Since only one callee can be called at a time, and it
> is the only thing to clear it, the IPI does not need to use
> any locking.
>
> [
> changes for v2:
>
> -- kept kmalloc and only use per cpu if kmalloc fails.
> (Requested by Peter Zijlstra)
>
> -- added per cpu spinlocks
> (Requested by Andrew Morton and Peter Zijlstra)
> ]
>
> Signed-off-by: Steven Rostedt <[email protected]>

Looks nice, thanks!

Acked-by: Peter Zijlstra <[email protected]>

> ---
> diff --git a/kernel/smp.c b/kernel/smp.c
> index 5cfa0e5..9bce851 100644
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -18,6 +18,7 @@ __cacheline_aligned_in_smp DEFINE_SPINLOCK(call_function_lock);
> enum {
> CSD_FLAG_WAIT = 0x01,
> CSD_FLAG_ALLOC = 0x02,
> + CSD_FLAG_LOCK = 0x04,
> };
>
> struct call_function_data {
> @@ -186,6 +187,9 @@ void generic_smp_call_function_single_interrupt(void)
> if (data_flags & CSD_FLAG_WAIT) {
> smp_wmb();
> data->flags &= ~CSD_FLAG_WAIT;
> + } else if (data_flags & CSD_FLAG_LOCK) {
> + smp_wmb();
> + data->flags &= ~CSD_FLAG_LOCK;
> } else if (data_flags & CSD_FLAG_ALLOC)
> kfree(data);
> }
> @@ -196,6 +200,10 @@ void generic_smp_call_function_single_interrupt(void)
> }
> }
>
> +static DEFINE_PER_CPU(struct call_single_data, csd_data);
> +static DEFINE_PER_CPU(spinlock_t, csd_data_lock) =
> + __SPIN_LOCK_UNLOCKED(csd_lock);
> +
> /*
> * smp_call_function_single - Run a function on a specific CPU
> * @func: The function to run. This must be fast and non-blocking.
> @@ -224,14 +232,41 @@ int smp_call_function_single(int cpu, void (*func) (void *info), void *info,
> func(info);
> local_irq_restore(flags);
> } else if ((unsigned)cpu < nr_cpu_ids && cpu_online(cpu)) {
> - struct call_single_data *data = NULL;
> + struct call_single_data *data;
>
> if (!wait) {
> + /*
> + * We are calling a function on a single CPU
> + * and we are not going to wait for it to finish.
> + * We first try to allocate the data, but if we
> + * fail, we fall back to use a per cpu data to pass
> + * the information to that CPU. Since all callers
> + * of this code will use the same data, we must
> + * synchronize the callers to prevent a new caller
> + * from corrupting the data before the callee
> + * can access it.
> + *
> + * The CSD_FLAG_LOCK is used to let us know when
> + * the IPI handler is done with the data.
> + * The first caller will set it, and the callee
> + * will clear it. The next caller must wait for
> + * it to clear before we set it again. This
> + * will make sure the callee is done with the
> + * data before a new caller will use it.
> + * We use spinlocks to manage the callers.
> + */
> data = kmalloc(sizeof(*data), GFP_ATOMIC);
> if (data)
> data->flags = CSD_FLAG_ALLOC;
> - }
> - if (!data) {
> + else {
> + data = &per_cpu(csd_data, cpu);
> + spin_lock(&per_cpu(csd_data_lock, cpu));
> + while (data->flags & CSD_FLAG_LOCK)
> + cpu_relax();
> + data->flags = CSD_FLAG_LOCK;
> + spin_unlock(&per_cpu(csd_data_lock, cpu));
> + }
> + } else {
> data = &d;
> data->flags = CSD_FLAG_WAIT;
> }
>

2009-01-29 16:17:50

[permalink] [raw]

Subject: Re: [PATCH -v2] use per cpu data for single cpu ipi calls

* Peter Zijlstra <[email protected]> wrote:

> On Thu, 2009-01-29 at 10:08 -0500, Steven Rostedt wrote:
> > The smp_call_function can be passed a wait parameter telling it to
> > wait for all the functions running on other CPUs to complete before
> > returning, or to return without waiting. Unfortunately, this is
> > currently just a suggestion and not manditory. That is, the
> > smp_call_function can decide not to return and wait instead.
> >
> > The reason for this is because it uses kmalloc to allocate storage
> > to send to the called CPU and that CPU will free it when it is done.
> > But if we fail to allocate the storage, the stack is used instead.
> > This means we must wait for the called CPU to finish before
> > continuing.
> >
> > Unfortunatly, some callers do no abide by this hint and act as if
> > the non-wait option is mandatory. The MTRR code for instance will
> > deadlock if the smp_call_function is set to wait. This is because
> > the smp_call_function will wait for the other CPUs to finish their
> > called functions, but those functions are waiting on the caller to
> > continue.
> >
> > This patch changes the generic smp_call_function code to use per cpu
> > variables if the allocation of the data fails for a single CPU call. The
> > smp_call_function_many will fall back to the smp_call_function_single
> > if it fails its alloc. The smp_call_function_single is modified
> > to not force the wait state.
> >
> > Since we now are using a single data per cpu we must synchronize the
> > callers to prevent a second caller modifying the data before the
> > first called IPI functions complete. To do so, I added a flag to
> > the call_single_data called CSD_FLAG_LOCK. When the single CPU is
> > called (which can be called when a many call fails an alloc), we
> > set the LOCK bit on this per cpu data. When the caller finishes
> > it clears the LOCK bit.
> >
> > The caller must wait till the LOCK bit is cleared before setting
> > it. When it is cleared, there is no IPI function using it.
> > A spinlock is used to synchronize the setting of the bit between
> > callers. Since only one callee can be called at a time, and it
> > is the only thing to clear it, the IPI does not need to use
> > any locking.
> >
> > [
> > changes for v2:
> >
> > -- kept kmalloc and only use per cpu if kmalloc fails.
> > (Requested by Peter Zijlstra)
> >
> > -- added per cpu spinlocks
> > (Requested by Andrew Morton and Peter Zijlstra)
> > ]
> >
> > Signed-off-by: Steven Rostedt <[email protected]>
>
> Looks nice, thanks!
>
> Acked-by: Peter Zijlstra <[email protected]>

started testing it in tip/core/urgent, thanks guys!

Ingo

2009-01-29 17:23:22

[permalink] [raw]

Subject: Re: [PATCH -v2] use per cpu data for single cpu ipi calls

On Thu, 29 Jan 2009, Steven Rostedt wrote:
>
> The caller must wait till the LOCK bit is cleared before setting
> it. When it is cleared, there is no IPI function using it.
> A spinlock is used to synchronize the setting of the bit between
> callers. Since only one callee can be called at a time, and it
> is the only thing to clear it, the IPI does not need to use
> any locking.

That spinlock cannot be right. It is provably wrong for so many reasons..

Think about it. We're talking about a per-CPU lock, which already makes no
sense: we're only locking against our own CPU, and we've already disabled
preemption for totally unrelated reasons.

And the only way locking can make sense against our own CPU is if we lock
against interrupts - but the lock isn't actually irq-safe, so if you are
trying to lock against interrupts, you are (a) doing it wrong (you should
disable interrupts, not use a spinlock) and (b) causing a deadlock if it
ever happens.

In other words: no way in hell does that make any sense what-so-ever.

Just remove it. As it is, it can only hurt. There's no possible point to
it.

Now, if people actually do send IPI's from interrupts (and it does happen
- cross-cpu wakeups etc), then that implies that you do want to then make
the whole thing irq-safe. But the *spinlock* is pointless.

But to be irq-safe, you now need to protect the _whole_ region against
interrupts, because otherwise an incoming interrupt will hit while the
CSD_FLAG_LOCK bit is set, perhaps _before_ the actual IPI has been sent,
and now nothing will ever clear it. So the interrupt handler that tries to
send an IPI will now spin forever.

So NAK on this patch. I think the approach is correct, but the
implementation is buggy.

Linus

2009-01-29 17:44:47

[permalink] [raw]

Subject: Re: [PATCH -v2] use per cpu data for single cpu ipi calls

On Thu, 29 Jan 2009, Linus Torvalds wrote:

>
>
> On Thu, 29 Jan 2009, Steven Rostedt wrote:
> >
> > The caller must wait till the LOCK bit is cleared before setting
> > it. When it is cleared, there is no IPI function using it.
> > A spinlock is used to synchronize the setting of the bit between
> > callers. Since only one callee can be called at a time, and it
> > is the only thing to clear it, the IPI does not need to use
> > any locking.
>
> That spinlock cannot be right. It is provably wrong for so many reasons..
>
> Think about it. We're talking about a per-CPU lock, which already makes no
> sense: we're only locking against our own CPU, and we've already disabled
> preemption for totally unrelated reasons.

Actually, we are locking against the destination CPU. Or perhaps I'm
misunderstanding you. The lock is protecting the destination data. The
'cpu' variable is the CPU we call to, not the current CPU.

>
> And the only way locking can make sense against our own CPU is if we lock
> against interrupts - but the lock isn't actually irq-safe, so if you are
> trying to lock against interrupts, you are (a) doing it wrong (you should
> disable interrupts, not use a spinlock) and (b) causing a deadlock if it
> ever happens.

I did not think an interrupt handler was allowed to do a
smp_call_function. If it is, then you are correct and this is very buggy.

Both smp_call_function and smp_call_function have a:

WARN_ON(irqs_disabled());

and comments saying that interrupts must be enabled and that these can
not be called from interrupt or softirq handlers.

The spin lock is to protect this scenario:

task on CPU 0 sends IPI to CPU 2
task on CPU 1 sends IPI to CPU 2

The first task to get past that spin lock (which is for CPU 2) will send
the IPI. The second task will keep have to spin until the LOCK bit is
cleared.

In other words, that spin lock is just a fancy:

again:
flags = per_cpu(csd_data).flags & ~CSD_FLAG_LOCK;
if (cmpxchg(&per_cpu(csd_data, cpu).lock, flags,
flags | CSD_FLAG_LOCK) != flags) {
cpu_relax();
goto again;
}

>
> In other words: no way in hell does that make any sense what-so-ever.
>
> Just remove it. As it is, it can only hurt. There's no possible point to
> it.
>
> Now, if people actually do send IPI's from interrupts (and it does happen
> - cross-cpu wakeups etc), then that implies that you do want to then make
> the whole thing irq-safe. But the *spinlock* is pointless.

But these IPIs would not use the smp_call_function routines.

>
> But to be irq-safe, you now need to protect the _whole_ region against
> interrupts, because otherwise an incoming interrupt will hit while the
> CSD_FLAG_LOCK bit is set, perhaps _before_ the actual IPI has been sent,
> and now nothing will ever clear it. So the interrupt handler that tries to
> send an IPI will now spin forever.

Again, this code is not allowed to run from an interrupt handler.

>
> So NAK on this patch. I think the approach is correct, but the
> implementation is buggy.

-- Steve

2009-01-29 17:47:38

[permalink] [raw]

Subject: Re: [PATCH -v2] use per cpu data for single cpu ipi calls

On Thu, 2009-01-29 at 09:21 -0800, Linus Torvalds wrote:
> On Thu, 29 Jan 2009, Steven Rostedt wrote:
> >
> > The caller must wait till the LOCK bit is cleared before setting
> > it. When it is cleared, there is no IPI function using it.
> > A spinlock is used to synchronize the setting of the bit between
> > callers. Since only one callee can be called at a time, and it
> > is the only thing to clear it, the IPI does not need to use
> > any locking.
>
> That spinlock cannot be right. It is provably wrong for so many reasons..
>
> Think about it. We're talking about a per-CPU lock, which already makes no
> sense: we're only locking against our own CPU, and we've already disabled
> preemption for totally unrelated reasons.
>
> And the only way locking can make sense against our own CPU is if we lock
> against interrupts - but the lock isn't actually irq-safe, so if you are
> trying to lock against interrupts, you are (a) doing it wrong (you should
> disable interrupts, not use a spinlock) and (b) causing a deadlock if it
> ever happens.

> + else {
> + data = &per_cpu(csd_data, cpu);
> + spin_lock(&per_cpu(csd_data_lock, cpu));
> + while (data->flags & CSD_FLAG_LOCK)
> + cpu_relax();
> + data->flags = CSD_FLAG_LOCK;
> + spin_unlock(&per_cpu(csd_data_lock, cpu));
> + }

I think your argument would hold if he did:

data = &__get_cpu_var(csd_data);

But now he's actually grabbing the remote cpu's csd, and thus needs
atomicy around that remote csd -- which two cpus could contend for.

2009-01-29 17:51:03

[permalink] [raw]

Subject: Re: [PATCH -v2] use per cpu data for single cpu ipi calls

On Thu, 29 Jan 2009, Steven Rostedt wrote:
>
> >
> > But to be irq-safe, you now need to protect the _whole_ region against
> > interrupts, because otherwise an incoming interrupt will hit while the
> > CSD_FLAG_LOCK bit is set, perhaps _before_ the actual IPI has been sent,
> > and now nothing will ever clear it. So the interrupt handler that tries to
> > send an IPI will now spin forever.
>
> Again, this code is not allowed to run from an interrupt handler.
>
> >
> > So NAK on this patch. I think the approach is correct, but the
> > implementation is buggy.

My whole assumption of this is that smp_call_function_single can not be
called from an interrupt handler. If it can, then perhaps I just change
the spinlock to the cmpxchg code, and put a:

WARN_ON(cpu == smp_processor_id());

-- Steve

2009-01-29 17:56:19

[permalink] [raw]

Subject: Re: [PATCH -v2] use per cpu data for single cpu ipi calls

On Thu, 2009-01-29 at 18:47 +0100, Peter Zijlstra wrote:
> On Thu, 2009-01-29 at 09:21 -0800, Linus Torvalds wrote:
> > On Thu, 29 Jan 2009, Steven Rostedt wrote:
> > >
> > > The caller must wait till the LOCK bit is cleared before setting
> > > it. When it is cleared, there is no IPI function using it.
> > > A spinlock is used to synchronize the setting of the bit between
> > > callers. Since only one callee can be called at a time, and it
> > > is the only thing to clear it, the IPI does not need to use
> > > any locking.
> >
> > That spinlock cannot be right. It is provably wrong for so many reasons..
> >
> > Think about it. We're talking about a per-CPU lock, which already makes no
> > sense: we're only locking against our own CPU, and we've already disabled
> > preemption for totally unrelated reasons.
> >
> > And the only way locking can make sense against our own CPU is if we lock
> > against interrupts - but the lock isn't actually irq-safe, so if you are
> > trying to lock against interrupts, you are (a) doing it wrong (you should
> > disable interrupts, not use a spinlock) and (b) causing a deadlock if it
> > ever happens.
>
>
> > + else {
> > + data = &per_cpu(csd_data, cpu);
> > + spin_lock(&per_cpu(csd_data_lock, cpu));
> > + while (data->flags & CSD_FLAG_LOCK)
> > + cpu_relax();
> > + data->flags = CSD_FLAG_LOCK;
> > + spin_unlock(&per_cpu(csd_data_lock, cpu));
> > + }
>
> I think your argument would hold if he did:
>
> data = &__get_cpu_var(csd_data);
>
> But now he's actually grabbing the remote cpu's csd, and thus needs
> atomicy around that remote csd -- which two cpus could contend for.

So the below should do

---
kernel/smp.c | 6 +-----
1 files changed, 1 insertions(+), 5 deletions(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index 9bce851..9eead6c 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -201,8 +201,6 @@ void generic_smp_call_function_single_interrupt(void)
}

static DEFINE_PER_CPU(struct call_single_data, csd_data);
-static DEFINE_PER_CPU(spinlock_t, csd_data_lock) =
- __SPIN_LOCK_UNLOCKED(csd_lock);

/*
* smp_call_function_single - Run a function on a specific CPU
@@ -259,12 +257,10 @@ int smp_call_function_single(int cpu, void (*func) (void *info), void *info,
if (data)
data->flags = CSD_FLAG_ALLOC;
else {
- data = &per_cpu(csd_data, cpu);
- spin_lock(&per_cpu(csd_data_lock, cpu));
+ data = &per_cpu(csd_data, me);
while (data->flags & CSD_FLAG_LOCK)
cpu_relax();
data->flags = CSD_FLAG_LOCK;
- spin_unlock(&per_cpu(csd_data_lock, cpu));
}
} else {
data = &d;

2009-01-29 18:08:38

[permalink] [raw]

Subject: Re: [PATCH -v2] use per cpu data for single cpu ipi calls

On Thu, 29 Jan 2009, Peter Zijlstra wrote:
> >
> >
> > > + else {
> > > + data = &per_cpu(csd_data, cpu);
> > > + spin_lock(&per_cpu(csd_data_lock, cpu));
> > > + while (data->flags & CSD_FLAG_LOCK)
> > > + cpu_relax();
> > > + data->flags = CSD_FLAG_LOCK;
> > > + spin_unlock(&per_cpu(csd_data_lock, cpu));
> > > + }
> >
> > I think your argument would hold if he did:
> >
> > data = &__get_cpu_var(csd_data);
> >
> > But now he's actually grabbing the remote cpu's csd, and thus needs
> > atomicy around that remote csd -- which two cpus could contend for.
>
> So the below should do
>
> ---
> kernel/smp.c | 6 +-----
> 1 files changed, 1 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/smp.c b/kernel/smp.c
> index 9bce851..9eead6c 100644
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -201,8 +201,6 @@ void generic_smp_call_function_single_interrupt(void)
> }
>
> static DEFINE_PER_CPU(struct call_single_data, csd_data);
> -static DEFINE_PER_CPU(spinlock_t, csd_data_lock) =
> - __SPIN_LOCK_UNLOCKED(csd_lock);
>
> /*
> * smp_call_function_single - Run a function on a specific CPU
> @@ -259,12 +257,10 @@ int smp_call_function_single(int cpu, void (*func) (void *info), void *info,
> if (data)
> data->flags = CSD_FLAG_ALLOC;
> else {
> - data = &per_cpu(csd_data, cpu);
> - spin_lock(&per_cpu(csd_data_lock, cpu));
> + data = &per_cpu(csd_data, me);
> while (data->flags & CSD_FLAG_LOCK)
> cpu_relax();
> data->flags = CSD_FLAG_LOCK;
> - spin_unlock(&per_cpu(csd_data_lock, cpu));
> }
> } else {
> data = &d;

Ah, OK.

So if we just use our own CPU data, we can through away the spinlocks and
just do the test ourselves.

That's even better.

Ingo, can you apply Peter's patch on top of mine.

-- Steve

2009-01-29 18:09:21

[permalink] [raw]

Subject: Re: [PATCH -v2] use per cpu data for single cpu ipi calls

On Thu, 29 Jan 2009, Steven Rostedt wrote:
>
> Actually, we are locking against the destination CPU.

Oh.

THAT'S JUST INCOMPETENT.

What the *fuck* is the point of having per-CPU data, and then using it for
the wrong CPU?

Stop doing that idiocy. Put the per-cpu data on the senders side, and stop
the idiocy. You are going to get cross-CPU cacheline bouncing anyway,
there's no way to avoid it, but as long as you do it on the wrong CPU's
local data, you're missing the whole POINT of having per-cpu data in the
first place.

But yeah, that explains the locking. One stupid design mistake leads to
another.

And are you really sure it cannot be called from within interrupts? I'm
finding a lot of callers of smp_call_function_single(), and while I
couldn't find any that look like interrupts, I also couldn't find any
indication that it never happens.

Linus

2009-01-29 18:12:07

[permalink] [raw]

Subject: Re: [PATCH -v2] use per cpu data for single cpu ipi calls

On Thu, 29 Jan 2009, Linus Torvalds wrote:

>
>
> On Thu, 29 Jan 2009, Steven Rostedt wrote:
> >
> > Actually, we are locking against the destination CPU.
>
> Oh.
>
> THAT'S JUST INCOMPETENT.

Or lack of sleep ;-)

>
> What the *fuck* is the point of having per-CPU data, and then using it for
> the wrong CPU?
>
> Stop doing that idiocy. Put the per-cpu data on the senders side, and stop
> the idiocy. You are going to get cross-CPU cacheline bouncing anyway,
> there's no way to avoid it, but as long as you do it on the wrong CPU's
> local data, you're missing the whole POINT of having per-cpu data in the
> first place.
>
> But yeah, that explains the locking. One stupid design mistake leads to
> another.
>
> And are you really sure it cannot be called from within interrupts? I'm
> finding a lot of callers of smp_call_function_single(), and while I
> couldn't find any that look like interrupts, I also couldn't find any
> indication that it never happens.

No, the solution that Peter gave on top of mine looks like something we
can all be happy with.

-- Steve

2009-01-29 18:23:29

[permalink] [raw]

Subject: Re: [PATCH -v2] use per cpu data for single cpu ipi calls

On Thu, 2009-01-29 at 10:08 -0800, Linus Torvalds wrote:

> And are you really sure it cannot be called from within interrupts? I'm
> finding a lot of callers of smp_call_function_single(), and while I
> couldn't find any that look like interrupts, I also couldn't find any
> indication that it never happens.

Calls with wait=1 are deadlockable when done from within irqs or irq
disabled sections because two cpus could cross ipi each other and stay
waiting.

That leaves us with the !wait case, which wasn't safe because that
kmalloc could fail, and the fallback was wait, which yields the same
deadlock.

That leaves us with 6 cases,

root@laptop:/usr/src/linux-2.6# git grep "smp_call_function_single(.*0)"
arch/blackfin/kernel/kgdb.c: smp_call_function_single(cpu, kgdb_passive_cpu_callback, NULL, 0);
arch/ia64/kernel/smpboot.c: if (smp_call_function_single(master, sync_master, NULL, 0) < 0) {
arch/ia64/kvm/kvm-ia64.c: smp_call_function_single(ipi_pcpu, vcpu_kick_intr, vcpu, 0);
arch/x86/kvm/x86.c: smp_call_function_single(ipi_pcpu, vcpu_kick_intr, vcpu, 0);
arch/x86/oprofile/nmi_int.c: smp_call_function_single(cpu, nmi_cpu_start, NULL, 0);
arch/x86/pci/amd_bus.c: smp_call_function_single(cpu, enable_pci_io_ecs, NULL, 0);

The kgdb one looks to be buggy, as it calls smp_call_function_single()
from under irqs disabled.

Such users should use __smp_call_single_function() with a pre-allocated
csd.

Didn't check the others.

2009-01-29 18:31:39

[permalink] [raw]

Subject: Re: [PATCH -v2] use per cpu data for single cpu ipi calls

On Thu, 29 Jan 2009, Peter Zijlstra wrote:

> On Thu, 2009-01-29 at 10:08 -0800, Linus Torvalds wrote:
>
> > And are you really sure it cannot be called from within interrupts? I'm
> > finding a lot of callers of smp_call_function_single(), and while I
> > couldn't find any that look like interrupts, I also couldn't find any
> > indication that it never happens.
>
> Calls with wait=1 are deadlockable when done from within irqs or irq
> disabled sections because two cpus could cross ipi each other and stay
> waiting.
>
> That leaves us with the !wait case, which wasn't safe because that
> kmalloc could fail, and the fallback was wait, which yields the same
> deadlock.
>
> That leaves us with 6 cases,
>
> root@laptop:/usr/src/linux-2.6# git grep "smp_call_function_single(.*0)"
> arch/blackfin/kernel/kgdb.c: smp_call_function_single(cpu, kgdb_passive_cpu_callback, NULL, 0);
> arch/ia64/kernel/smpboot.c: if (smp_call_function_single(master, sync_master, NULL, 0) < 0) {
> arch/ia64/kvm/kvm-ia64.c: smp_call_function_single(ipi_pcpu, vcpu_kick_intr, vcpu, 0);
> arch/x86/kvm/x86.c: smp_call_function_single(ipi_pcpu, vcpu_kick_intr, vcpu, 0);
> arch/x86/oprofile/nmi_int.c: smp_call_function_single(cpu, nmi_cpu_start, NULL, 0);
> arch/x86/pci/amd_bus.c: smp_call_function_single(cpu, enable_pci_io_ecs, NULL, 0);
>
> The kgdb one looks to be buggy, as it calls smp_call_function_single()
> from under irqs disabled.
>
> Such users should use __smp_call_single_function() with a pre-allocated
> csd.
>
> Didn't check the others.

Regardless of this, I like your change better than my original. We should
do that anyway.

-- Steve

2009-01-29 18:41:20

[permalink] [raw]

Subject: Re: [PATCH -v2] use per cpu data for single cpu ipi calls

On Thu, 29 Jan 2009, Peter Zijlstra wrote:
>
> Calls with wait=1 are deadlockable when done from within irqs or irq
> disabled sections because two cpus could cross ipi each other and stay
> waiting.

Sure. But this isn't the "wait=1" case. In the waiting case, we don't need
to allocate anything at all, we can just use the stack.

So the only case that matters for interrupts is indeed "wait=0", but
that's very much also the case we're talking about.

> That leaves us with the !wait case, which wasn't safe because that
> kmalloc could fail, and the fallback was wait, which yields the same
> deadlock.

Yes. I agree that the _concept_ of the patch is good. No argument there.

I'm just worried about the details.

> That leaves us with 6 cases,
>
> root@laptop:/usr/src/linux-2.6# git grep "smp_call_function_single(.*0)"

This misses a _lot_ of cases. What about this one:

kernel/relay.c: err = smp_call_function_single(i,

where you're not seeing if it's a non-waiting one or not (it's a waiting
one, but my point is that your grep is missing cases).

Doing

git grep -w smp_call_function_single | grep -v '1);'

is probably more likely to find the cases to look for.

But you're right, I'm still not seeing anything that looks _likely_ to be
a problem. But I worry that it's something somebody may want to do.

Linus

2009-01-29 18:45:22

[permalink] [raw]

Subject: Re: [PATCH -v2] use per cpu data for single cpu ipi calls

On Thu, 2009-01-29 at 10:39 -0800, Linus Torvalds wrote:

> But you're right, I'm still not seeing anything that looks _likely_ to be
> a problem. But I worry that it's something somebody may want to do.

Well, the kgdb case I looked at did exactly that, and that is a problem.

As for people wanting to do it, we have __smp_call_function_single() to
do just that, all you have to do is provide your own csd.

2009-01-29 18:50:42

[permalink] [raw]

Subject: Re: [PATCH -v2] use per cpu data for single cpu ipi calls

* Linus Torvalds <[email protected]> wrote:

> And are you really sure it cannot be called from within interrupts? I'm
> finding a lot of callers of smp_call_function_single(), and while I
> couldn't find any that look like interrupts, I also couldn't find any
> indication that it never happens.

smp_call_function_single() must not be called from irqs-off code - and
that implicitly includes most IRQ handlers as well. (most of which run
with hardirqs off)

We check for that via:

/* Can deadlock when called with interrupts disabled */
WARN_ON(irqs_disabled());

[ although it should probably be extended to '&& !in_interrupt()' as well,
to cover the subset of IRQ handlers running with irqs enabled - which is
small but nonzero. It would also cover softirqs. ]

There are 3 reasons why sending IPIs from an irqs off critical section is
a bad idea currently:

1) the implementation is deadlockable as you noted. That is self-inflicted
stupidity and fixable.

2) the _concept_ of two CPUs sending IPIs to each other at once and
expecting them to be handled and then waiting for them is a deadlock.
We could allow async IPIs from irqs-off sections, but isnt that a too
narrow special-case?

3) the third reason is nasty: the act of sending an IPI on x86 hardware
needs to happen with interrupts enabled. In the lowlevel IPI sending
code we first poll the hardware whether it's ready to accept new IPIs:
via apic_wait_icr_idle().

There've been incidents in the past where the ICR-ready (Interrupt
Command Register - ready) bit would deadlock the box in certain
situations (usually with wackier APIC implementations and if there's
too many IPIs 'in flight' and the only way to make progress is to allow
IPIs to happen locally - i.e. to have irqs on), if the IPI was sent
from an irqs-off section.

So i'm not sure we could relax things here without expecting too much of
the hardware.

What i like least about kernel/smp.c is the kmalloc(GFP_ATOMIC) - and we
had to add that in a late -rc for an essential bugfix.

IMO that kmalloc() has "layering violation" written all over it. It turns
the pure act of 'sending messages' into a conceptually heavy and wide
facility. Is the buffering of async messages really such a big deal in
practice? Does anyone have any numbers?

Btw., there _is_ a relatively atomic, safe and instantaneous way of
sending IPIs on x86: sending NMIs. But that has its own host of issues
both on the hardware and on the software side ...

Ingo

2009-01-30 01:12:19

by Rusty Russell

[permalink] [raw]

Subject: Re: [PATCH -v2] use per cpu data for single cpu ipi calls

On Friday 30 January 2009 01:38:01 Steven Rostedt wrote:
>
> The smp_call_function can be passed a wait parameter telling it to
> wait for all the functions running on other CPUs to complete before
> returning, or to return without waiting. Unfortunately, this is
> currently just a suggestion and not manditory. That is, the
> smp_call_function can decide not to return and wait instead.

I think the file could probably use some comment love and the comments
made more pithy (the correct comment in the "kmalloc failed" path usually
involves cussing), but we can't have everything.

Not that it matters, but since I was the last one to touch this file:
Acked-by: Rusty Russell <[email protected]>

Thanks!
Rusty.

2009-01-30 01:56:10

by Rusty Russell

[permalink] [raw]

Subject: Re: [PATCH -v2] use per cpu data for single cpu ipi calls

On Friday 30 January 2009 04:38:23 Linus Torvalds wrote:
>
> On Thu, 29 Jan 2009, Steven Rostedt wrote:
> >
> > Actually, we are locking against the destination CPU.
>
> Oh.
>
> THAT'S JUST INCOMPETENT.
>
> What the *fuck* is the point of having per-CPU data, and then using it for
> the wrong CPU?

Maybe we made it too easy to declare and use per-cpu data?

Could we use an nr_cpu_ids array? Yes, it needs alloc at boot somewhere, but
it avoids collateral damage to other percpu data (and please don't suggest DEFINE_PER_CPU_SHARED_ALIGNED: that's just antisocial).

Thanks,
Rusty.

2009-01-30 11:25:19

[permalink] [raw]

Subject: Re: [PATCH -v2] use per cpu data for single cpu ipi calls

Hi,

A little late into this discussion, was nasty ill the past days.

Anyway, v2 with the suggested patches looks fine to me. That is, use
local cpu storage (and thus lose the lock) and retain the the kmalloc()
for the !wait case. That should still scale fine, as far as I can tell.
The block layer usage, for which this was originally concocted, doesn't
really care that much, since it always provides its own csd storage and
thus use the __ variant of the helper.

Peter, can you post a final complete patch for review and acks?

--
Jens Axboe

2009-01-30 12:32:42

[permalink] [raw]

Subject: [PATCH -v3] use per cpu data for single cpu ipi calls

On Fri, 2009-01-30 at 12:23 +0100, Jens Axboe wrote:

> Peter, can you post a final complete patch for review and acks?

Sure,

---
Author: Steven Rostedt <[email protected]>
Date: Thu Jan 29 10:08:01 2009 -0500

generic-ipi: use per cpu data for single cpu ipi calls

The smp_call_function can be passed a wait parameter telling it to
wait for all the functions running on other CPUs to complete before
returning, or to return without waiting. Unfortunately, this is
currently just a suggestion and not manditory. That is, the
smp_call_function can decide not to return and wait instead.

The reason for this is because it uses kmalloc to allocate storage
to send to the called CPU and that CPU will free it when it is done.
But if we fail to allocate the storage, the stack is used instead.
This means we must wait for the called CPU to finish before
continuing.

Unfortunatly, some callers do no abide by this hint and act as if
the non-wait option is mandatory. The MTRR code for instance will
deadlock if the smp_call_function is set to wait. This is because
the smp_call_function will wait for the other CPUs to finish their
called functions, but those functions are waiting on the caller to
continue.

This patch changes the generic smp_call_function code to use per cpu
variables if the allocation of the data fails for a single CPU call. The
smp_call_function_many will fall back to the smp_call_function_single
if it fails its alloc. The smp_call_function_single is modified
to not force the wait state.

Since we now are using a single data per cpu we must synchronize the
callers to prevent a second caller modifying the data before the
first called IPI functions complete. To do so, I added a flag to
the call_single_data called CSD_FLAG_LOCK. When the single CPU is
called (which can be called when a many call fails an alloc), we
set the LOCK bit on this per cpu data. When the caller finishes
it clears the LOCK bit.

The caller must wait till the LOCK bit is cleared before setting
it. When it is cleared, there is no IPI function using it.

Signed-off-by: Steven Rostedt <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
---
kernel/smp.c | 37 ++++++++++++++++++++++++++++++++++---
1 files changed, 34 insertions(+), 3 deletions(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index 5cfa0e5..9eead6c 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -18,6 +18,7 @@ __cacheline_aligned_in_smp DEFINE_SPINLOCK(call_function_lock);
enum {
CSD_FLAG_WAIT = 0x01,
CSD_FLAG_ALLOC = 0x02,
+ CSD_FLAG_LOCK = 0x04,
};

struct call_function_data {
@@ -186,6 +187,9 @@ void generic_smp_call_function_single_interrupt(void)
if (data_flags & CSD_FLAG_WAIT) {
smp_wmb();
data->flags &= ~CSD_FLAG_WAIT;
+ } else if (data_flags & CSD_FLAG_LOCK) {
+ smp_wmb();
+ data->flags &= ~CSD_FLAG_LOCK;
} else if (data_flags & CSD_FLAG_ALLOC)
kfree(data);
}
@@ -196,6 +200,8 @@ void generic_smp_call_function_single_interrupt(void)
}
}

+static DEFINE_PER_CPU(struct call_single_data, csd_data);
+
/*
* smp_call_function_single - Run a function on a specific CPU
* @func: The function to run. This must be fast and non-blocking.
@@ -224,14 +230,39 @@ int smp_call_function_single(int cpu, void (*func) (void *info), void *info,
func(info);
local_irq_restore(flags);
} else if ((unsigned)cpu < nr_cpu_ids && cpu_online(cpu)) {
- struct call_single_data *data = NULL;
+ struct call_single_data *data;

if (!wait) {
+ /*
+ * We are calling a function on a single CPU
+ * and we are not going to wait for it to finish.
+ * We first try to allocate the data, but if we
+ * fail, we fall back to use a per cpu data to pass
+ * the information to that CPU. Since all callers
+ * of this code will use the same data, we must
+ * synchronize the callers to prevent a new caller
+ * from corrupting the data before the callee
+ * can access it.
+ *
+ * The CSD_FLAG_LOCK is used to let us know when
+ * the IPI handler is done with the data.
+ * The first caller will set it, and the callee
+ * will clear it. The next caller must wait for
+ * it to clear before we set it again. This
+ * will make sure the callee is done with the
+ * data before a new caller will use it.
+ * We use spinlocks to manage the callers.
+ */
data = kmalloc(sizeof(*data), GFP_ATOMIC);
if (data)
data->flags = CSD_FLAG_ALLOC;
- }
- if (!data) {
+ else {
+ data = &per_cpu(csd_data, me);
+ while (data->flags & CSD_FLAG_LOCK)
+ cpu_relax();
+ data->flags = CSD_FLAG_LOCK;
+ }
+ } else {
data = &d;
data->flags = CSD_FLAG_WAIT;
}

2009-01-30 12:40:56

[permalink] [raw]

Subject: Re: [PATCH -v3] use per cpu data for single cpu ipi calls

On Fri, Jan 30 2009, Peter Zijlstra wrote:
> On Fri, 2009-01-30 at 12:23 +0100, Jens Axboe wrote:
>
> > Peter, can you post a final complete patch for review and acks?
>
> Sure,

Looks good, you can add my acked-by to that. One small comment:

> if (!wait) {
> + /*
> + * We are calling a function on a single CPU
> + * and we are not going to wait for it to finish.
> + * We first try to allocate the data, but if we
> + * fail, we fall back to use a per cpu data to pass
> + * the information to that CPU. Since all callers
> + * of this code will use the same data, we must
> + * synchronize the callers to prevent a new caller
> + * from corrupting the data before the callee
> + * can access it.
> + *
> + * The CSD_FLAG_LOCK is used to let us know when
> + * the IPI handler is done with the data.
> + * The first caller will set it, and the callee
> + * will clear it. The next caller must wait for
> + * it to clear before we set it again. This
> + * will make sure the callee is done with the
> + * data before a new caller will use it.
> + * We use spinlocks to manage the callers.
> + */

That last sentence appears stale now, since there's no locking involved
for the per-cpu csd.

--
Jens Axboe

2009-01-30 12:49:21

[permalink] [raw]

Subject: Re: [PATCH -v3] use per cpu data for single cpu ipi calls

On Fri, 2009-01-30 at 13:38 +0100, Jens Axboe wrote:

> > if (!wait) {
> > + /*
> > + * We are calling a function on a single CPU
> > + * and we are not going to wait for it to finish.
> > + * We first try to allocate the data, but if we
> > + * fail, we fall back to use a per cpu data to pass
> > + * the information to that CPU. Since all callers
> > + * of this code will use the same data, we must
> > + * synchronize the callers to prevent a new caller
> > + * from corrupting the data before the callee
> > + * can access it.
> > + *
> > + * The CSD_FLAG_LOCK is used to let us know when
> > + * the IPI handler is done with the data.
> > + * The first caller will set it, and the callee
> > + * will clear it. The next caller must wait for
> > + * it to clear before we set it again. This
> > + * will make sure the callee is done with the
> > + * data before a new caller will use it.
> > + * We use spinlocks to manage the callers.
> > + */
>
> That last sentence appears stale now, since there's no locking involved
> for the per-cpu csd.

Gah, I edited that, and generated the patch using

# git diff --stat -p 4f4b6c1a94a8735bbdc030a2911cf395495645b6.. kernel/smp.c | xclip

on Ingo's tip/master. However it seems such a diff does not include
non-commited changes, even though a regular git diff does:

# git diff kernel/smp.c
diff --git a/kernel/smp.c b/kernel/smp.c
index 9eead6c..bbedbb7 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -251,7 +251,6 @@ int smp_call_function_single(int cpu, void (*func) (void *info), void *info,
* it to clear before we set it again. This
* will make sure the callee is done with the
* data before a new caller will use it.
- * We use spinlocks to manage the callers.
*/
data = kmalloc(sizeof(*data), GFP_ATOMIC);
if (data)

What is the recommended way to do such a diff?

2009-01-30 12:57:38

[permalink] [raw]

Subject: Re: [PATCH -v3] use per cpu data for single cpu ipi calls

On Fri, Jan 30 2009, Peter Zijlstra wrote:
> On Fri, 2009-01-30 at 13:38 +0100, Jens Axboe wrote:
>
> > > if (!wait) {
> > > + /*
> > > + * We are calling a function on a single CPU
> > > + * and we are not going to wait for it to finish.
> > > + * We first try to allocate the data, but if we
> > > + * fail, we fall back to use a per cpu data to pass
> > > + * the information to that CPU. Since all callers
> > > + * of this code will use the same data, we must
> > > + * synchronize the callers to prevent a new caller
> > > + * from corrupting the data before the callee
> > > + * can access it.
> > > + *
> > > + * The CSD_FLAG_LOCK is used to let us know when
> > > + * the IPI handler is done with the data.
> > > + * The first caller will set it, and the callee
> > > + * will clear it. The next caller must wait for
> > > + * it to clear before we set it again. This
> > > + * will make sure the callee is done with the
> > > + * data before a new caller will use it.
> > > + * We use spinlocks to manage the callers.
> > > + */
> >
> > That last sentence appears stale now, since there's no locking involved
> > for the per-cpu csd.
>
> Gah, I edited that, and generated the patch using
>
> # git diff --stat -p 4f4b6c1a94a8735bbdc030a2911cf395495645b6.. kernel/smp.c | xclip
>
> on Ingo's tip/master. However it seems such a diff does not include
> non-commited changes, even though a regular git diff does:
>
> # git diff kernel/smp.c
> diff --git a/kernel/smp.c b/kernel/smp.c
> index 9eead6c..bbedbb7 100644
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -251,7 +251,6 @@ int smp_call_function_single(int cpu, void (*func) (void *info), void *info,
> * it to clear before we set it again. This
> * will make sure the callee is done with the
> * data before a new caller will use it.
> - * We use spinlocks to manage the callers.
> */
> data = kmalloc(sizeof(*data), GFP_ATOMIC);
> if (data)
>
> What is the recommended way to do such a diff?

Drop the .. and it should work. Don't ask me why :-)

--
Jens Axboe

2009-01-30 12:58:31

[permalink] [raw]

Subject: Re: [PATCH -v3] use per cpu data for single cpu ipi calls

On Fri, Jan 30 2009, Jens Axboe wrote:
> Drop the .. and it should work. Don't ask me why :-)

On second thought, probably because XXX.. will mean from XXX to HEAD,
and HEAD doesn't include the edited part since it isn't committed.

--
Jens Axboe

2009-01-30 13:00:54

[permalink] [raw]

Subject: Re: [PATCH -v3] use per cpu data for single cpu ipi calls

On Fri, 2009-01-30 at 13:56 +0100, Jens Axboe wrote:
> On Fri, Jan 30 2009, Jens Axboe wrote:
> > Drop the .. and it should work. Don't ask me why :-)
>
> On second thought, probably because XXX.. will mean from XXX to HEAD,
> and HEAD doesn't include the edited part since it isn't committed.

Ah, I guess that makes some sense. Let me do a v4 :-)

2009-01-30 13:03:16

[permalink] [raw]

Subject: [PATCH -v4] use per cpu data for single cpu ipi calls

Author: Steven Rostedt <[email protected]>
Date: Thu Jan 29 10:08:01 2009 -0500

generic-ipi: use per cpu data for single cpu ipi calls

The smp_call_function can be passed a wait parameter telling it to
wait for all the functions running on other CPUs to complete before
returning, or to return without waiting. Unfortunately, this is
currently just a suggestion and not manditory. That is, the
smp_call_function can decide not to return and wait instead.

The reason for this is because it uses kmalloc to allocate storage
to send to the called CPU and that CPU will free it when it is done.
But if we fail to allocate the storage, the stack is used instead.
This means we must wait for the called CPU to finish before
continuing.

Unfortunatly, some callers do no abide by this hint and act as if
the non-wait option is mandatory. The MTRR code for instance will
deadlock if the smp_call_function is set to wait. This is because
the smp_call_function will wait for the other CPUs to finish their
called functions, but those functions are waiting on the caller to
continue.

This patch changes the generic smp_call_function code to use per cpu
variables if the allocation of the data fails for a single CPU call. The
smp_call_function_many will fall back to the smp_call_function_single
if it fails its alloc. The smp_call_function_single is modified
to not force the wait state.

Since we now are using a single data per cpu we must synchronize the
callers to prevent a second caller modifying the data before the
first called IPI functions complete. To do so, I added a flag to
the call_single_data called CSD_FLAG_LOCK. When the single CPU is
called (which can be called when a many call fails an alloc), we
set the LOCK bit on this per cpu data. When the caller finishes
it clears the LOCK bit.

The caller must wait till the LOCK bit is cleared before setting
it. When it is cleared, there is no IPI function using it.

Signed-off-by: Steven Rostedt <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Acked-by: Jens Axboe <[email protected]>
---
kernel/smp.c | 36 +++++++++++++++++++++++++++++++++---
1 files changed, 33 insertions(+), 3 deletions(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index 5cfa0e5..bbedbb7 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -18,6 +18,7 @@ __cacheline_aligned_in_smp DEFINE_SPINLOCK(call_function_lock);
enum {
CSD_FLAG_WAIT = 0x01,
CSD_FLAG_ALLOC = 0x02,
+ CSD_FLAG_LOCK = 0x04,
};

struct call_function_data {
@@ -186,6 +187,9 @@ void generic_smp_call_function_single_interrupt(void)
if (data_flags & CSD_FLAG_WAIT) {
smp_wmb();
data->flags &= ~CSD_FLAG_WAIT;
+ } else if (data_flags & CSD_FLAG_LOCK) {
+ smp_wmb();
+ data->flags &= ~CSD_FLAG_LOCK;
} else if (data_flags & CSD_FLAG_ALLOC)
kfree(data);
}
@@ -196,6 +200,8 @@ void generic_smp_call_function_single_interrupt(void)
}
}

+static DEFINE_PER_CPU(struct call_single_data, csd_data);
+
/*
* smp_call_function_single - Run a function on a specific CPU
* @func: The function to run. This must be fast and non-blocking.
@@ -224,14 +230,38 @@ int smp_call_function_single(int cpu, void (*func) (void *info), void *info,
func(info);
local_irq_restore(flags);
} else if ((unsigned)cpu < nr_cpu_ids && cpu_online(cpu)) {
- struct call_single_data *data = NULL;
+ struct call_single_data *data;

if (!wait) {
+ /*
+ * We are calling a function on a single CPU
+ * and we are not going to wait for it to finish.
+ * We first try to allocate the data, but if we
+ * fail, we fall back to use a per cpu data to pass
+ * the information to that CPU. Since all callers
+ * of this code will use the same data, we must
+ * synchronize the callers to prevent a new caller
+ * from corrupting the data before the callee
+ * can access it.
+ *
+ * The CSD_FLAG_LOCK is used to let us know when
+ * the IPI handler is done with the data.
+ * The first caller will set it, and the callee
+ * will clear it. The next caller must wait for
+ * it to clear before we set it again. This
+ * will make sure the callee is done with the
+ * data before a new caller will use it.
+ */
data = kmalloc(sizeof(*data), GFP_ATOMIC);
if (data)
data->flags = CSD_FLAG_ALLOC;
- }
- if (!data) {
+ else {
+ data = &per_cpu(csd_data, me);
+ while (data->flags & CSD_FLAG_LOCK)
+ cpu_relax();
+ data->flags = CSD_FLAG_LOCK;
+ }
+ } else {
data = &d;
data->flags = CSD_FLAG_WAIT;
}

2009-01-30 14:51:45

[permalink] [raw]

Subject: Re: [PATCH -v4] use per cpu data for single cpu ipi calls

* Peter Zijlstra <[email protected]> wrote:

> Author: Steven Rostedt <[email protected]>
> Date: Thu Jan 29 10:08:01 2009 -0500
>
> generic-ipi: use per cpu data for single cpu ipi calls
>
> The smp_call_function can be passed a wait parameter telling it to
> wait for all the functions running on other CPUs to complete before
> returning, or to return without waiting. Unfortunately, this is
> currently just a suggestion and not manditory. That is, the
> smp_call_function can decide not to return and wait instead.
>
> The reason for this is because it uses kmalloc to allocate storage
> to send to the called CPU and that CPU will free it when it is done.
> But if we fail to allocate the storage, the stack is used instead.
> This means we must wait for the called CPU to finish before
> continuing.
>
> Unfortunatly, some callers do no abide by this hint and act as if
> the non-wait option is mandatory. The MTRR code for instance will
> deadlock if the smp_call_function is set to wait. This is because
> the smp_call_function will wait for the other CPUs to finish their
> called functions, but those functions are waiting on the caller to
> continue.
>
> This patch changes the generic smp_call_function code to use per cpu
> variables if the allocation of the data fails for a single CPU call. The
> smp_call_function_many will fall back to the smp_call_function_single
> if it fails its alloc. The smp_call_function_single is modified
> to not force the wait state.
>
> Since we now are using a single data per cpu we must synchronize the
> callers to prevent a second caller modifying the data before the
> first called IPI functions complete. To do so, I added a flag to
> the call_single_data called CSD_FLAG_LOCK. When the single CPU is
> called (which can be called when a many call fails an alloc), we
> set the LOCK bit on this per cpu data. When the caller finishes
> it clears the LOCK bit.
>
> The caller must wait till the LOCK bit is cleared before setting
> it. When it is cleared, there is no IPI function using it.
>
> Signed-off-by: Steven Rostedt <[email protected]>
> Signed-off-by: Peter Zijlstra <[email protected]>
> Acked-by: Jens Axboe <[email protected]>
> ---
> kernel/smp.c | 36 +++++++++++++++++++++++++++++++++---
> 1 files changed, 33 insertions(+), 3 deletions(-)

ok, i've updated the commit in tip/core/urgent with the commit line
removal and the updated description. Find it below - any objections?

Ingo

--------------------->
>From 05257ef396a72daef9ef57e01c8dc0d55351c5c7 Mon Sep 17 00:00:00 2001
From: Steven Rostedt <[email protected]>
Date: Thu, 29 Jan 2009 10:08:01 -0500
Subject: [PATCH] generic-ipi: use per cpu data for single cpu ipi calls

The smp_call_function can be passed a wait parameter telling it to
wait for all the functions running on other CPUs to complete before
returning, or to return without waiting. Unfortunately, this is
currently just a suggestion and not manditory. That is, the
smp_call_function can decide not to return and wait instead.

The reason for this is because it uses kmalloc to allocate storage
to send to the called CPU and that CPU will free it when it is done.
But if we fail to allocate the storage, the stack is used instead.
This means we must wait for the called CPU to finish before
continuing.

Unfortunatly, some callers do no abide by this hint and act as if
the non-wait option is mandatory. The MTRR code for instance will
deadlock if the smp_call_function is set to wait. This is because
the smp_call_function will wait for the other CPUs to finish their
called functions, but those functions are waiting on the caller to
continue.

This patch changes the generic smp_call_function code to use per cpu
variables if the allocation of the data fails for a single CPU call. The
smp_call_function_many will fall back to the smp_call_function_single
if it fails its alloc. The smp_call_function_single is modified
to not force the wait state.

Since we now are using a single data per cpu we must synchronize the
callers to prevent a second caller modifying the data before the
first called IPI functions complete. To do so, I added a flag to
the call_single_data called CSD_FLAG_LOCK. When the single CPU is
called (which can be called when a many call fails an alloc), we
set the LOCK bit on this per cpu data. When the caller finishes
it clears the LOCK bit.

The caller must wait till the LOCK bit is cleared before setting
it. When it is cleared, there is no IPI function using it.

Signed-off-by: Steven Rostedt <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Acked-by: Jens Axboe <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/smp.c | 36 +++++++++++++++++++++++++++++++++---
1 files changed, 33 insertions(+), 3 deletions(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index 5cfa0e5..bbedbb7 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -18,6 +18,7 @@ __cacheline_aligned_in_smp DEFINE_SPINLOCK(call_function_lock);
enum {
CSD_FLAG_WAIT = 0x01,
CSD_FLAG_ALLOC = 0x02,
+ CSD_FLAG_LOCK = 0x04,
};

struct call_function_data {
@@ -186,6 +187,9 @@ void generic_smp_call_function_single_interrupt(void)
if (data_flags & CSD_FLAG_WAIT) {
smp_wmb();
data->flags &= ~CSD_FLAG_WAIT;
+ } else if (data_flags & CSD_FLAG_LOCK) {
+ smp_wmb();
+ data->flags &= ~CSD_FLAG_LOCK;
} else if (data_flags & CSD_FLAG_ALLOC)
kfree(data);
}
@@ -196,6 +200,8 @@ void generic_smp_call_function_single_interrupt(void)
}
}

+static DEFINE_PER_CPU(struct call_single_data, csd_data);
+
/*
* smp_call_function_single - Run a function on a specific CPU
* @func: The function to run. This must be fast and non-blocking.
@@ -224,14 +230,38 @@ int smp_call_function_single(int cpu, void (*func) (void *info), void *info,
func(info);
local_irq_restore(flags);
} else if ((unsigned)cpu < nr_cpu_ids && cpu_online(cpu)) {
- struct call_single_data *data = NULL;
+ struct call_single_data *data;

if (!wait) {
+ /*
+ * We are calling a function on a single CPU
+ * and we are not going to wait for it to finish.
+ * We first try to allocate the data, but if we
+ * fail, we fall back to use a per cpu data to pass
+ * the information to that CPU. Since all callers
+ * of this code will use the same data, we must
+ * synchronize the callers to prevent a new caller
+ * from corrupting the data before the callee
+ * can access it.
+ *
+ * The CSD_FLAG_LOCK is used to let us know when
+ * the IPI handler is done with the data.
+ * The first caller will set it, and the callee
+ * will clear it. The next caller must wait for
+ * it to clear before we set it again. This
+ * will make sure the callee is done with the
+ * data before a new caller will use it.
+ */
data = kmalloc(sizeof(*data), GFP_ATOMIC);
if (data)
data->flags = CSD_FLAG_ALLOC;
- }
- if (!data) {
+ else {
+ data = &per_cpu(csd_data, me);
+ while (data->flags & CSD_FLAG_LOCK)
+ cpu_relax();
+ data->flags = CSD_FLAG_LOCK;
+ }
+ } else {
data = &d;
data->flags = CSD_FLAG_WAIT;
}

2009-01-30 16:05:19

[permalink] [raw]

Subject: Re: [PATCH -v3] use per cpu data for single cpu ipi calls

On Fri, 30 Jan 2009, Peter Zijlstra wrote:
>
> Signed-off-by: Steven Rostedt <[email protected]>
> Signed-off-by: Peter Zijlstra <[email protected]>

Acked-by: Linus Torvalds <[email protected]>

My only question is whetherr we might even drop the kmalloc() some day:
I suspect that the CSD_FLAG_LOCK is essentially never a contention point,
and the cost (and occasional synchronization) of kmalloc() quite possibly
overwhelms any theoretical scaling ability.

If another CPU hasn't even received its IPI before the same CPU sends the
next one, I'm not sure we _want_ to send one, in fact.

But that's a secondary issue, and isn't a correctness thing, just a "do we
really need three different allocations?" musing..

Linus

2009-01-30 16:16:34

[permalink] [raw]

Subject: Re: [PATCH -v3] use per cpu data for single cpu ipi calls

On Fri, 2009-01-30 at 08:04 -0800, Linus Torvalds wrote:

> My only question is whetherr we might even drop the kmalloc() some day:
> I suspect that the CSD_FLAG_LOCK is essentially never a contention point,
> and the cost (and occasional synchronization) of kmalloc() quite possibly
> overwhelms any theoretical scaling ability.

IIRC the recent SL*B numbers posted showed that a kmalloc could be as
cheap as ~100 cycles or something. IPIs are sadly still a bit more
expensive.

> If another CPU hasn't even received its IPI before the same CPU sends the
> next one, I'm not sure we _want_ to send one, in fact.

I think the intent was to re-route IO-completion interrupts to whatever
cpu/node issued the IO with the idea that that cpu/node has the page
hottest etc. and transferring the completion is cheaper than bouncing
the page.

Since that would be relaying hardware interrupts, there's nothing much
you can do about the rate, or something, that's up to the firmware on
$$$ scsi thing.

But Jens already said that that path was using the __ variant and
providing its own csds, the kmalloc isn't needed there, so it might all
be moot.

> But that's a secondary issue, and isn't a correctness thing, just a "do we
> really need three different allocations?" musing..

Nick, Jens, I was under the presumption that the kmalloc was needed for
something other than failing to deadlock, happen to remember what?

2009-01-31 08:46:28