2014-01-22 05:53:24

by Paul Mackerras

[permalink] [raw]
Subject: Deadlock between cpu_hotplug_begin and cpu_add_remove_lock

This arises out of a report from a tester that offlining a CPU never
finished on a system they were testing. This was on a POWER8 running
a 3.10.x kernel, but the issue is still present in mainline AFAICS.

What I found when I looked at the system was this:

* There was a ppc64_cpu process stuck inside cpu_hotplug_begin(),
called from _cpu_down(), from cpu_down(). This process was holding
the cpu_add_remove_lock mutex, since cpu_down() calls
cpu_maps_update_begin() before calling _cpu_down(). It was stuck
there because cpu_hotplug.refcount == 1.

* There was a mdadm process trying to acquire the cpu_add_remove_lock
mutex inside register_cpu_notifier(), called from
raid5_alloc_percpu() in drivers/md/raid5.c. That process had
previously called get_online_cpus, which is why cpu_hotplug.refcount
was 1.

Result: deadlock.

Thus it seems that the following code is not safe:

get_online_cpus();
register_cpu_notifier(&...);
put_online_cpus();

There are a few different places that do that sort of thing; besides
drivers/md/raid5.c, there are instances in arch/x86/kernel/cpu,
arch/x86/oprofile, drivers/cpufreq/acpi-cpufreq.c,
drivers/oprofile/nmi_timer_int.c and kernel/trace/ring_buffer.c.

My question is this: is it reasonable to call register_cpu_notifier
inside a get/put_online_cpus block? If so, the deadlock needs to be
fixed; if not, the callers need to be fixed, and the restriction
should be documented.

Regards,
Paul.


2014-01-22 08:36:08

by Srivatsa S. Bhat

[permalink] [raw]
Subject: Re: Deadlock between cpu_hotplug_begin and cpu_add_remove_lock

Hi Paul,

On 01/22/2014 11:22 AM, Paul Mackerras wrote:
> This arises out of a report from a tester that offlining a CPU never
> finished on a system they were testing. This was on a POWER8 running
> a 3.10.x kernel, but the issue is still present in mainline AFAICS.
>
> What I found when I looked at the system was this:
>
> * There was a ppc64_cpu process stuck inside cpu_hotplug_begin(),
> called from _cpu_down(), from cpu_down(). This process was holding
> the cpu_add_remove_lock mutex, since cpu_down() calls
> cpu_maps_update_begin() before calling _cpu_down(). It was stuck
> there because cpu_hotplug.refcount == 1.
>
> * There was a mdadm process trying to acquire the cpu_add_remove_lock
> mutex inside register_cpu_notifier(), called from
> raid5_alloc_percpu() in drivers/md/raid5.c. That process had
> previously called get_online_cpus, which is why cpu_hotplug.refcount
> was 1.
>
> Result: deadlock.
>
> Thus it seems that the following code is not safe:
>
> get_online_cpus();
> register_cpu_notifier(&...);
> put_online_cpus();
>

Yes, this is a known problem, and I had proposed an elaborate solution
some time ago: https://lkml.org/lkml/2012/3/1/39
But that won't work for all cases, so that solution is a no-go.

If we forget the CPU_POST_DEAD stage for a moment, we can just replace the
calls to cpu_maps_update_begin/done() with get/put_online_cpus() in both
register_cpu_notifier() as well as unregister_cpu_notifier(). After all,
the callback registration code needs to synchronize only with the actual
hotplug operations, and not the update of cpu-maps. So they don't really
need to acquire the cpu_add_remove_lock.

However, CPU_POST_DEAD notifications run with the hotplug lock dropped.
So we can't simply replace cpu_add_remove_lock with hotplug lock in the
registration routines, because notifier invocations and notifier registration
needs to be synchronized.

Hmm...

> There are a few different places that do that sort of thing; besides
> drivers/md/raid5.c, there are instances in arch/x86/kernel/cpu,
> arch/x86/oprofile, drivers/cpufreq/acpi-cpufreq.c,
> drivers/oprofile/nmi_timer_int.c and kernel/trace/ring_buffer.c.
>
> My question is this: is it reasonable to call register_cpu_notifier
> inside a get/put_online_cpus block?

Ideally, we would want that to work. Because there is no other race-free
way of registering a notifier.

> If so, the deadlock needs to be
> fixed; if not, the callers need to be fixed, and the restriction
> should be documented.

Fixing the callers is a last resort. I'm thinking of ways to fix the
deadlock itself, and allow the callers to call register_cpu_notifier
within a get/put_online_cpus() block...

Regards,
Srivatsa S. Bhat

2014-01-22 09:21:30

by Srivatsa S. Bhat

[permalink] [raw]
Subject: Re: Deadlock between cpu_hotplug_begin and cpu_add_remove_lock

On 01/22/2014 02:00 PM, Srivatsa S. Bhat wrote:
> Hi Paul,
>
> On 01/22/2014 11:22 AM, Paul Mackerras wrote:
>> This arises out of a report from a tester that offlining a CPU never
>> finished on a system they were testing. This was on a POWER8 running
>> a 3.10.x kernel, but the issue is still present in mainline AFAICS.
>>
>> What I found when I looked at the system was this:
>>
>> * There was a ppc64_cpu process stuck inside cpu_hotplug_begin(),
>> called from _cpu_down(), from cpu_down(). This process was holding
>> the cpu_add_remove_lock mutex, since cpu_down() calls
>> cpu_maps_update_begin() before calling _cpu_down(). It was stuck
>> there because cpu_hotplug.refcount == 1.
>>
>> * There was a mdadm process trying to acquire the cpu_add_remove_lock
>> mutex inside register_cpu_notifier(), called from
>> raid5_alloc_percpu() in drivers/md/raid5.c. That process had
>> previously called get_online_cpus, which is why cpu_hotplug.refcount
>> was 1.
>>
>> Result: deadlock.
>>
>> Thus it seems that the following code is not safe:
>>
>> get_online_cpus();
>> register_cpu_notifier(&...);
>> put_online_cpus();
>>
>
> Yes, this is a known problem, and I had proposed an elaborate solution
> some time ago: https://lkml.org/lkml/2012/3/1/39
> But that won't work for all cases, so that solution is a no-go.
>

Wait a min, that _will_ actually work for all cases because I have provided
an option to invoke _any_ arbitrary function as the "setup" routine.

So, taking the example of raid5 that you mentioned below, instead of doing
this:

static int raid5_alloc_percpu()
{
...
get_online_cpus();
for_each_present_cpu(cpu) {
...
}
...
register_cpu_notifier();
put_online_cpus();
}

We can do this:

void func()
{
for_each_present_cpu(cpu) {
...
}
...
}

static int raid5_alloc_percpu()
{
...
register_allcpu_notifier(..., true, func);
}


The other, simpler alternative fix is to use cpu_hotplug_disable/enable()
in place of get/put_online_cpus() around the callback registration code.
Something like this:

cpu_hotplug_disable();
register_cpu_notifier();
cpu_hotplug_enable();

But the problem with this is that a hotplug operation that tries to run
concurrently with this will get a -EBUSY, which is kinda undesirable.
Also, this will only synchronize with hotplug operations initiated via
calls to cpu_up/down() (such as those that are initiated by writing to the
online file in sysfs). It won't synchronize with the hotplug operations
invoked by disable/enable_nonboot_cpus(), which by-pass cpu_up/down() and
directly call _cpu_up/down() by ignoring the cpu_hotplug_disabled flag.

The latter is a more controlled environment though, since its mostly used
by the suspend/hibernate code, in a state where the entire userspace is
frozen. So it might not be that bad.

Regards,
Srivatsa S. Bhat

> If we forget the CPU_POST_DEAD stage for a moment, we can just replace the
> calls to cpu_maps_update_begin/done() with get/put_online_cpus() in both
> register_cpu_notifier() as well as unregister_cpu_notifier(). After all,
> the callback registration code needs to synchronize only with the actual
> hotplug operations, and not the update of cpu-maps. So they don't really
> need to acquire the cpu_add_remove_lock.
>
> However, CPU_POST_DEAD notifications run with the hotplug lock dropped.
> So we can't simply replace cpu_add_remove_lock with hotplug lock in the
> registration routines, because notifier invocations and notifier registration
> needs to be synchronized.
>
> Hmm...
>
>> There are a few different places that do that sort of thing; besides
>> drivers/md/raid5.c, there are instances in arch/x86/kernel/cpu,
>> arch/x86/oprofile, drivers/cpufreq/acpi-cpufreq.c,
>> drivers/oprofile/nmi_timer_int.c and kernel/trace/ring_buffer.c.
>>
>> My question is this: is it reasonable to call register_cpu_notifier
>> inside a get/put_online_cpus block?
>
> Ideally, we would want that to work. Because there is no other race-free
> way of registering a notifier.
>
>> If so, the deadlock needs to be
>> fixed; if not, the callers need to be fixed, and the restriction
>> should be documented.
>
> Fixing the callers is a last resort. I'm thinking of ways to fix the
> deadlock itself, and allow the callers to call register_cpu_notifier
> within a get/put_online_cpus() block...
>

2014-01-22 19:18:38

by Oleg Nesterov

[permalink] [raw]
Subject: Re: Deadlock between cpu_hotplug_begin and cpu_add_remove_lock

On 01/22, Srivatsa S. Bhat wrote:
>
> Wait a min, that _will_ actually work for all cases because I have provided
> an option to invoke _any_ arbitrary function as the "setup" routine.

And probably the generic solution makes sense. I am not sure I actually
understand the semantics of register_allcpu_notifier(), but the problem
it tries to solve looks clear/valid.

But as for a quick fix for raid5_alloc_percpu(), can't it simply call
register_cpu_notifier() before get_online_cpus/for_each_present_cpu ?

This probably means that raid456_cpu_notify() should be modified because
it obviously can be called before get_online_cpus(). Hmm, it already
has safe_put_page(), so this looks really simple? Something like below,
although of course I can miss easily something.

Oleg.


--- x/drivers/md/raid5.c
+++ x/drivers/md/raid5.c
@@ -5542,6 +5542,24 @@ static void free_conf(struct r5conf *con
kfree(conf);
}

+static int alloc_xxx(struct r5conf *conf, struct raid5_percpu *percpu)
+{
+ if (conf->level == 6 && !percpu->spare_page)
+ percpu->spare_page = alloc_page(GFP_KERNEL);
+ if (!percpu->scribble)
+ percpu->scribble = kmalloc(conf->scribble_len, GFP_KERNEL);
+
+ if (!percpu->scribble || (conf->level == 6 && !percpu->spare_page)) {
+ safe_put_page(percpu->spare_page);
+ kfree(percpu->scribble);
+ pr_err("%s: failed memory allocation for cpu%ld\n",
+ __func__, cpu);
+ return -ENOMEM;
+ }
+
+ return 0;
+}
+
#ifdef CONFIG_HOTPLUG_CPU
static int raid456_cpu_notify(struct notifier_block *nfb, unsigned long action,
void *hcpu)
@@ -5553,19 +5571,8 @@ static int raid456_cpu_notify(struct not
switch (action) {
case CPU_UP_PREPARE:
case CPU_UP_PREPARE_FROZEN:
- if (conf->level == 6 && !percpu->spare_page)
- percpu->spare_page = alloc_page(GFP_KERNEL);
- if (!percpu->scribble)
- percpu->scribble = kmalloc(conf->scribble_len, GFP_KERNEL);
-
- if (!percpu->scribble ||
- (conf->level == 6 && !percpu->spare_page)) {
- safe_put_page(percpu->spare_page);
- kfree(percpu->scribble);
- pr_err("%s: failed memory allocation for cpu%ld\n",
- __func__, cpu);
+ if (alloc_xxx(conf, percpu))
return notifier_from_errno(-ENOMEM);
- }
break;
case CPU_DEAD:
case CPU_DEAD_FROZEN:
@@ -5585,39 +5592,27 @@ static int raid5_alloc_percpu(struct r5c
{
unsigned long cpu;
struct page *spare_page;
- struct raid5_percpu __percpu *allcpus;
void *scribble;
- int err;
+ int err = 0;

- allcpus = alloc_percpu(struct raid5_percpu);
- if (!allcpus)
+ conf->percpu = alloc_percpu(struct raid5_percpu);
+ if (!conf->percpu)
return -ENOMEM;
- conf->percpu = allcpus;

- get_online_cpus();
- err = 0;
- for_each_present_cpu(cpu) {
- if (conf->level == 6) {
- spare_page = alloc_page(GFP_KERNEL);
- if (!spare_page) {
- err = -ENOMEM;
- break;
- }
- per_cpu_ptr(conf->percpu, cpu)->spare_page = spare_page;
- }
- scribble = kmalloc(conf->scribble_len, GFP_KERNEL);
- if (!scribble) {
- err = -ENOMEM;
- break;
- }
- per_cpu_ptr(conf->percpu, cpu)->scribble = scribble;
- }
#ifdef CONFIG_HOTPLUG_CPU
conf->cpu_notify.notifier_call = raid456_cpu_notify;
conf->cpu_notify.priority = 0;
- if (err == 0)
- err = register_cpu_notifier(&conf->cpu_notify);
+ err = register_cpu_notifier(&conf->cpu_notify);
+ if (err)
+ return err;
#endif
+
+ get_online_cpus();
+ for_each_present_cpu(cpu) {
+ err = alloc_xxx(conf, per_cpu_ptr(conf->percpu, cpu));
+ if (err)
+ break;
+ }
put_online_cpus();

return err;

2014-01-22 20:03:12

by Srivatsa S. Bhat

[permalink] [raw]
Subject: Re: Deadlock between cpu_hotplug_begin and cpu_add_remove_lock

On 01/23/2014 12:48 AM, Oleg Nesterov wrote:
> On 01/22, Srivatsa S. Bhat wrote:
>>
>> Wait a min, that _will_ actually work for all cases because I have provided
>> an option to invoke _any_ arbitrary function as the "setup" routine.
>
> And probably the generic solution makes sense. I am not sure I actually
> understand the semantics of register_allcpu_notifier(), but the problem
> it tries to solve looks clear/valid.
>

Thank you. But I was wondering whether its usage is a bit unintuitive/
convoluted. So I was contemplating between going with that solution or the
below one, where the call-sites are expected to do:

cpu_maps_update_begin();
for_each_online_cpu(cpu) {
...
}
__register_cpu_notifier(); //use the __reg() variant, which doesn't take locks
cpu_maps_update_done();

Of course, that requires exporting the functions cpu_maps_update_begin/done(),
but this latter form of callback registration might look more natural.

diff --git a/kernel/cpu.c b/kernel/cpu.c
index deff2e6..37373c1 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -166,6 +166,11 @@ int __ref register_cpu_notifier(struct notifier_block *nb)
return ret;
}

+int __ref __register_cpu_notifier(struct notifier_block *nb)
+{
+ return raw_notifier_chain_register(&cpu_chain, nb);
+}
+
static int __cpu_notify(unsigned long val, void *v, int nr_to_call,
int *nr_calls)
{
@@ -189,6 +194,7 @@ static void cpu_notify_nofail(unsigned long val, void *v)
BUG_ON(cpu_notify(val, v));
}
EXPORT_SYMBOL(register_cpu_notifier);
+EXPORT_SYMBOL(__register_cpu_notifier);

void __ref unregister_cpu_notifier(struct notifier_block *nb)
{
@@ -198,6 +204,12 @@ void __ref unregister_cpu_notifier(struct notifier_block *nb)
}
EXPORT_SYMBOL(unregister_cpu_notifier);

+void __ref __unregister_cpu_notifier(struct notifier_block *nb)
+{
+ raw_notifier_chain_unregister(&cpu_chain, nb);
+}
+EXPORT_SYMBOL(__unregister_cpu_notifier);
+
/**
* clear_tasks_mm_cpumask - Safely clear tasks' mm_cpumask for a CPU
* @cpu: a CPU id


> But as for a quick fix for raid5_alloc_percpu(), can't it simply call
> register_cpu_notifier() before get_online_cpus/for_each_present_cpu ?
>
> This probably means that raid456_cpu_notify() should be modified because
> it obviously can be called before get_online_cpus(). Hmm, it already
> has safe_put_page(), so this looks really simple? Something like below,
> although of course I can miss easily something.

Yes, your solution for raid5 does look good; it is also simple and
elegant. But for some of the other call-sites, we might have to use one
of the solutions mentioned above.

Regards,
Srivatsa S. Bhat

>
>
> --- x/drivers/md/raid5.c
> +++ x/drivers/md/raid5.c
> @@ -5542,6 +5542,24 @@ static void free_conf(struct r5conf *con
> kfree(conf);
> }
>
> +static int alloc_xxx(struct r5conf *conf, struct raid5_percpu *percpu)
> +{
> + if (conf->level == 6 && !percpu->spare_page)
> + percpu->spare_page = alloc_page(GFP_KERNEL);
> + if (!percpu->scribble)
> + percpu->scribble = kmalloc(conf->scribble_len, GFP_KERNEL);
> +
> + if (!percpu->scribble || (conf->level == 6 && !percpu->spare_page)) {
> + safe_put_page(percpu->spare_page);
> + kfree(percpu->scribble);
> + pr_err("%s: failed memory allocation for cpu%ld\n",
> + __func__, cpu);
> + return -ENOMEM;
> + }
> +
> + return 0;
> +}
> +
> #ifdef CONFIG_HOTPLUG_CPU
> static int raid456_cpu_notify(struct notifier_block *nfb, unsigned long action,
> void *hcpu)
> @@ -5553,19 +5571,8 @@ static int raid456_cpu_notify(struct not
> switch (action) {
> case CPU_UP_PREPARE:
> case CPU_UP_PREPARE_FROZEN:
> - if (conf->level == 6 && !percpu->spare_page)
> - percpu->spare_page = alloc_page(GFP_KERNEL);
> - if (!percpu->scribble)
> - percpu->scribble = kmalloc(conf->scribble_len, GFP_KERNEL);
> -
> - if (!percpu->scribble ||
> - (conf->level == 6 && !percpu->spare_page)) {
> - safe_put_page(percpu->spare_page);
> - kfree(percpu->scribble);
> - pr_err("%s: failed memory allocation for cpu%ld\n",
> - __func__, cpu);
> + if (alloc_xxx(conf, percpu))
> return notifier_from_errno(-ENOMEM);
> - }
> break;
> case CPU_DEAD:
> case CPU_DEAD_FROZEN:
> @@ -5585,39 +5592,27 @@ static int raid5_alloc_percpu(struct r5c
> {
> unsigned long cpu;
> struct page *spare_page;
> - struct raid5_percpu __percpu *allcpus;
> void *scribble;
> - int err;
> + int err = 0;
>
> - allcpus = alloc_percpu(struct raid5_percpu);
> - if (!allcpus)
> + conf->percpu = alloc_percpu(struct raid5_percpu);
> + if (!conf->percpu)
> return -ENOMEM;
> - conf->percpu = allcpus;
>
> - get_online_cpus();
> - err = 0;
> - for_each_present_cpu(cpu) {
> - if (conf->level == 6) {
> - spare_page = alloc_page(GFP_KERNEL);
> - if (!spare_page) {
> - err = -ENOMEM;
> - break;
> - }
> - per_cpu_ptr(conf->percpu, cpu)->spare_page = spare_page;
> - }
> - scribble = kmalloc(conf->scribble_len, GFP_KERNEL);
> - if (!scribble) {
> - err = -ENOMEM;
> - break;
> - }
> - per_cpu_ptr(conf->percpu, cpu)->scribble = scribble;
> - }
> #ifdef CONFIG_HOTPLUG_CPU
> conf->cpu_notify.notifier_call = raid456_cpu_notify;
> conf->cpu_notify.priority = 0;
> - if (err == 0)
> - err = register_cpu_notifier(&conf->cpu_notify);
> + err = register_cpu_notifier(&conf->cpu_notify);
> + if (err)
> + return err;
> #endif
> +
> + get_online_cpus();
> + for_each_present_cpu(cpu) {
> + err = alloc_xxx(conf, per_cpu_ptr(conf->percpu, cpu));
> + if (err)
> + break;
> + }
> put_online_cpus();
>
> return err;
>

2014-01-23 04:45:38

by Rusty Russell

[permalink] [raw]
Subject: Re: Deadlock between cpu_hotplug_begin and cpu_add_remove_lock

"Srivatsa S. Bhat" <[email protected]> writes:
> On 01/22/2014 02:00 PM, Srivatsa S. Bhat wrote:
>> Hi Paul,

I find an old patch for register_allcpu_notifier(), but the "bool
replay_history" should be eliminated (always true): it's too weird.

Then we should get rid of register_cpu_notifier, or at least hide it.

Thanks,
Rusty.

2014-01-23 05:41:43

by Srivatsa S. Bhat

[permalink] [raw]
Subject: Re: Deadlock between cpu_hotplug_begin and cpu_add_remove_lock

On 01/23/2014 07:59 AM, Rusty Russell wrote:
> "Srivatsa S. Bhat" <[email protected]> writes:
>> On 01/22/2014 02:00 PM, Srivatsa S. Bhat wrote:
>>> Hi Paul,
>
> I find an old patch for register_allcpu_notifier(), but the "bool
> replay_history" should be eliminated (always true): it's too weird.
>

Sorry, I didn't get this part. Why do you say that replay_history
will always be true?

replay_history will be set to true whenever the caller wants to
get notified of CPU_UP_PREPARE and CPU_ONLINE notifications for the
already online CPUs, or wants to run a custom setup-routine of its
own. And it will be false whenever the caller simply wants to just
register the callback.

Note that passing NULL for the setup-routine, by itself isn't enough
to make a decision. NULL + replay_history == True will invoke the normal
CPU_UP_PREPARE/CPU_ONLINE notifiers for the already online CPUs before
registering the callback. NULL + replay_history == False will just
register the callback and do nothing else.

> Then we should get rid of register_cpu_notifier, or at least hide it.
>

Why? Isn't it easier to use (since you don't have to pass 2 additional
parameters)? I see register_allcpu_notifier (or whatever better name
we can give it), as an API for special cases where there is something
more to be done than just registering the callback. And register_cpu_notifier
will continue to be the API for the regular case when the caller wants
to just register the callback. This latter case is the majority in the
kernel. So I don't think eliminating the regular API would be a good idea.


By the way, I'm still tempted to try out the simpler-looking alternative
idea of exporting cpu_maps_update_begin() and cpu_maps_update_done()
and then mandating that the callers do:

cpu_maps_update_begin();
for_each_online_cpu(cpu) {
...
}

__register_cpu_notifier(); // this doesn't take the add_remove_lock
cpu_maps_update_done();


I'm working on a patchset that does this and performs a tree-wide
conversion. Please let me know if you have any objections to exporting
cpu_maps_update_begin/done() in this manner.

I thought I'd give this solution a try first, before going to the much
fancier register_allcpu_notifier() method.

Regards,
Srivatsa S. Bhat

2014-01-23 17:03:13

by Oleg Nesterov

[permalink] [raw]
Subject: Re: Deadlock between cpu_hotplug_begin and cpu_add_remove_lock

On 01/23, Srivatsa S. Bhat wrote:
>
> On 01/23/2014 12:48 AM, Oleg Nesterov wrote:
> > On 01/22, Srivatsa S. Bhat wrote:
> >>
> >> Wait a min, that _will_ actually work for all cases because I have provided
> >> an option to invoke _any_ arbitrary function as the "setup" routine.
> >
> > And probably the generic solution makes sense. I am not sure I actually
> > understand the semantics of register_allcpu_notifier(), but the problem
> > it tries to solve looks clear/valid.
> >
>
> Thank you. But I was wondering whether its usage is a bit unintuitive/
> convoluted. So I was contemplating between going with that solution or the
> below one, where the call-sites are expected to do:
>
> cpu_maps_update_begin();
> for_each_online_cpu(cpu) {
> ...
> }
> __register_cpu_notifier(); //use the __reg() variant, which doesn't take locks
> cpu_maps_update_done();
>
> Of course, that requires exporting the functions cpu_maps_update_begin/done(),
> but this latter form of callback registration might look more natural.

Yes, I thought about this too ;)

> But for some of the other call-sites, we might have to use one
> of the solutions mentioned above.

Yes, yes, sure, I agree.

I suggested this change only for discussion, for the case we need
an "urgent" fix without changes outside of drivers/md/. The generic
solution is better.

Oleg.

2014-01-23 23:39:25

by Rusty Russell

[permalink] [raw]
Subject: Re: Deadlock between cpu_hotplug_begin and cpu_add_remove_lock

"Srivatsa S. Bhat" <[email protected]> writes:
> On 01/23/2014 07:59 AM, Rusty Russell wrote:
>> "Srivatsa S. Bhat" <[email protected]> writes:
>>> On 01/22/2014 02:00 PM, Srivatsa S. Bhat wrote:
>>>> Hi Paul,
>>
>> I find an old patch for register_allcpu_notifier(), but the "bool
>> replay_history" should be eliminated (always true): it's too weird.
>>
>
> Sorry, I didn't get this part. Why do you say that replay_history
> will always be true?

OK, let me start again and try to explain myself properly:

register_cpu_notifier is a bad API. It's hard to get right because:
1) You need to loop over online (or present) cpus once before you call
it.
2) You have to beware the race between the loop and registration, but
much example code happens at boot time where it doesn't matter,
so random author is likely to copy that and have a race.
3) You have two paths doing the same thing: the loop which is run on
every machine (cpu hotplug or not), and the notifier callback which
is run far less rarely.

What we actually *want* is a routine which will reliably call for every
current and future CPU, and then there are very few places which should
use the current register_cpu_notifier().

ie. halfway between register_cpu_notifier() (too racy) and
register_allcpu_notifier() (too simplified).

Let's call it register_cpu_callback / unregister_cpu_callback?

> By the way, I'm still tempted to try out the simpler-looking alternative
> idea of exporting cpu_maps_update_begin() and cpu_maps_update_done()
> and then mandating that the callers do:
>
> cpu_maps_update_begin();
> for_each_online_cpu(cpu) {
> ...
> }
>
> __register_cpu_notifier(); // this doesn't take the add_remove_lock
> cpu_maps_update_done();

Sure, fix this one for -stable. But let's create an idiom we can be
proud of for the longer term.

Thanks,
Rusty.

2014-01-28 14:37:37

by Srivatsa S. Bhat

[permalink] [raw]
Subject: Re: Deadlock between cpu_hotplug_begin and cpu_add_remove_lock

On 01/23/2014 10:32 PM, Oleg Nesterov wrote:
> On 01/23, Srivatsa S. Bhat wrote:
>>
>> On 01/23/2014 12:48 AM, Oleg Nesterov wrote:
>>> On 01/22, Srivatsa S. Bhat wrote:
>>>>
>>>> Wait a min, that _will_ actually work for all cases because I have provided
>>>> an option to invoke _any_ arbitrary function as the "setup" routine.
>>>
>>> And probably the generic solution makes sense. I am not sure I actually
>>> understand the semantics of register_allcpu_notifier(), but the problem
>>> it tries to solve looks clear/valid.
>>>
>>
>> Thank you. But I was wondering whether its usage is a bit unintuitive/
>> convoluted. So I was contemplating between going with that solution or the
>> below one, where the call-sites are expected to do:
>>
>> cpu_maps_update_begin();
>> for_each_online_cpu(cpu) {
>> ...
>> }
>> __register_cpu_notifier(); //use the __reg() variant, which doesn't take locks
>> cpu_maps_update_done();
>>
>> Of course, that requires exporting the functions cpu_maps_update_begin/done(),
>> but this latter form of callback registration might look more natural.
>
> Yes, I thought about this too ;)
>
>> But for some of the other call-sites, we might have to use one
>> of the solutions mentioned above.
>
> Yes, yes, sure, I agree.
>
> I suggested this change only for discussion, for the case we need
> an "urgent" fix without changes outside of drivers/md/. The generic
> solution is better.
>

Ok :) But your fix for drivers/md/ also makes the code look much neater.
So I'll include your patch in my series and convert the rest of the call-
sites using the generic solution.

Thank you!

Regards,
Srivatsa S. Bhat

2014-01-28 14:41:49

by Srivatsa S. Bhat

[permalink] [raw]
Subject: Re: Deadlock between cpu_hotplug_begin and cpu_add_remove_lock

On 01/24/2014 04:31 AM, Rusty Russell wrote:
> "Srivatsa S. Bhat" <[email protected]> writes:
>> On 01/23/2014 07:59 AM, Rusty Russell wrote:
>>> "Srivatsa S. Bhat" <[email protected]> writes:
>>>> On 01/22/2014 02:00 PM, Srivatsa S. Bhat wrote:
>>>>> Hi Paul,
>>>
>>> I find an old patch for register_allcpu_notifier(), but the "bool
>>> replay_history" should be eliminated (always true): it's too weird.
>>>
>>
>> Sorry, I didn't get this part. Why do you say that replay_history
>> will always be true?
>
> OK, let me start again and try to explain myself properly:
>
> register_cpu_notifier is a bad API. It's hard to get right because:
> 1) You need to loop over online (or present) cpus once before you call
> it.
> 2) You have to beware the race between the loop and registration, but
> much example code happens at boot time where it doesn't matter,
> so random author is likely to copy that and have a race.
> 3) You have two paths doing the same thing: the loop which is run on
> every machine (cpu hotplug or not), and the notifier callback which
> is run far less rarely.
>
> What we actually *want* is a routine which will reliably call for every
> current and future CPU, and then there are very few places which should
> use the current register_cpu_notifier().
>
> ie. halfway between register_cpu_notifier() (too racy) and
> register_allcpu_notifier() (too simplified).
>
> Let's call it register_cpu_callback / unregister_cpu_callback?
>

Thanks a lot for the detailed and profound explanation! It makes perfect
sense to me now.

>> By the way, I'm still tempted to try out the simpler-looking alternative
>> idea of exporting cpu_maps_update_begin() and cpu_maps_update_done()
>> and then mandating that the callers do:
>>
>> cpu_maps_update_begin();
>> for_each_online_cpu(cpu) {
>> ...
>> }
>>
>> __register_cpu_notifier(); // this doesn't take the add_remove_lock
>> cpu_maps_update_done();
>
> Sure, fix this one for -stable. But let's create an idiom we can be
> proud of for the longer term.
>

Ok, that sounds good, will work on that.

Thank you very much!

Regards,
Srivatsa S. Bhat