LinuxLists.cc - [patch 3/4] stop_machine: implement stop_machine_from_offline

2011-06-22 22:31:15

Subject: [patch 3/4] stop_machine: implement stop_machine_from_offline_cpu()

From: Tejun Heo <[email protected]>

Currently, mtrr wants stop_machine functionality while a CPU is being
brought up. As stop_machine() requires the calling CPU to be online,
mtrr implements its own stop_machine using stop_one_cpu() on each
online CPU. This doesn't only unnecessarily duplicate complex logic
but also introduces a possibility of deadlock when it races against
the generic stop_machine().

This patch implements stop_machine_from_offline_cpu() to serve such
use cases. Its functionality is basically the same as stop_machine();
however, it should be called from a CPU which isn't online and doesn't
depend on working scheduling on the calling CPU.

This is achieved by using busy loops for synchronization and
open-coding stop_cpus queuing and waiting with direct invocation of
fn() for local CPU inbetween.

Signed-off-by: Tejun Heo <[email protected]>
Signed-off-by: Suresh Siddha <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
---
include/linux/stop_machine.h | 14 ++++++++-
kernel/stop_machine.c | 62 ++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 73 insertions(+), 3 deletions(-)

Index: linux-2.6-tip/kernel/stop_machine.c
===================================================================
--- linux-2.6-tip.orig/kernel/stop_machine.c
+++ linux-2.6-tip/kernel/stop_machine.c
@@ -439,8 +439,15 @@ static int stop_machine_cpu_stop(void *d
struct stop_machine_data *smdata = data;
enum stopmachine_state curstate = STOPMACHINE_NONE;
int cpu = smp_processor_id(), err = 0;
+ unsigned long flags;
bool is_active;

+ /*
+ * When called from stop_machine_from_offline_cpu(), irq might
+ * already be disabled. Save the state and restore it on exit.
+ */
+ local_save_flags(flags);
+
if (!smdata->active_cpus)
is_active = cpu == cpumask_first(cpu_online_mask);
else
@@ -468,7 +475,7 @@ static int stop_machine_cpu_stop(void *d
}
} while (curstate != STOPMACHINE_EXIT);

- local_irq_enable();
+ local_irq_restore(flags);
return err;
}

@@ -495,4 +502,57 @@ int stop_machine(int (*fn)(void *), void
}
EXPORT_SYMBOL_GPL(stop_machine);

+/**
+ * stop_machine_from_offline_cpu - stop_machine() from offline CPU
+ * @fn: the function to run
+ * @data: the data ptr for the @fn()
+ * @cpus: the cpus to run the @fn() on (NULL = any online cpu)
+ *
+ * This is identical to stop_machine() but can be called from a CPU which
+ * isn't online. The local CPU is in the process of hotplug (so no other
+ * CPU hotplug can start) and not marked online and doesn't have enough
+ * context to sleep.
+ *
+ * This function provides stop_machine() functionality for such state by
+ * using busy-wait for synchronization and executing @fn directly for local
+ * CPU.
+ *
+ * CONTEXT:
+ * Local CPU is offline. Temporarily stops all online CPUs.
+ *
+ * RETURNS:
+ * 0 if all executions of @fn returned 0, any non zero return value if any
+ * returned non zero.
+ */
+int stop_machine_from_offline_cpu(int (*fn)(void *), void *data,
+ const struct cpumask *cpus)
+{
+ struct stop_machine_data smdata = { .fn = fn, .data = data,
+ .active_cpus = cpus };
+ struct cpu_stop_done done;
+ int ret;
+
+ /* Local CPU must be offline and CPU hotplug in progress. */
+ BUG_ON(cpu_online(raw_smp_processor_id()));
+ smdata.num_threads = num_online_cpus() + 1; /* +1 for local */
+
+ /* No proper task established and can't sleep - busy wait for lock. */
+ while (!mutex_trylock(&stop_cpus_mutex))
+ cpu_relax();
+
+ /* Schedule work on other CPUs and execute directly for local CPU */
+ set_state(&smdata, STOPMACHINE_PREPARE);
+ cpu_stop_init_done(&done, num_online_cpus());
+ queue_stop_cpus_work(cpu_online_mask, stop_machine_cpu_stop, &smdata,
+ &done);
+ ret = stop_machine_cpu_stop(&smdata);
+
+ /* Busy wait for completion. */
+ while (!completion_done(&done.completion))
+ cpu_relax();
+
+ mutex_unlock(&stop_cpus_mutex);
+ return ret ?: done.ret;
+}
+
#endif /* CONFIG_STOP_MACHINE */
Index: linux-2.6-tip/include/linux/stop_machine.h
===================================================================
--- linux-2.6-tip.orig/include/linux/stop_machine.h
+++ linux-2.6-tip/include/linux/stop_machine.h
@@ -126,15 +126,19 @@ int stop_machine(int (*fn)(void *), void
*/
int __stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus);

+int stop_machine_from_offline_cpu(int (*fn)(void *), void *data,
+ const struct cpumask *cpus);
+
#else /* CONFIG_STOP_MACHINE && CONFIG_SMP */

static inline int __stop_machine(int (*fn)(void *), void *data,
const struct cpumask *cpus)
{
+ unsigned long flags;
int ret;
- local_irq_disable();
+ local_irq_save(flags);
ret = fn(data);
- local_irq_enable();
+ local_irq_restore(flags);
return ret;
}

@@ -144,5 +148,11 @@ static inline int stop_machine(int (*fn)
return __stop_machine(fn, data, cpus);
}

+static inline int stop_machine_from_offline_cpu(int (*fn)(void *), void *data,
+ const struct cpumask *cpus)
+{
+ return __stop_machine(fn, data, cpus);
+}
+
#endif /* CONFIG_STOP_MACHINE && CONFIG_SMP */
#endif /* _LINUX_STOP_MACHINE */

2011-06-23 09:26:17

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [patch 3/4] stop_machine: implement stop_machine_from_offline_cpu()

On Wed, 2011-06-22 at 15:20 -0700, Suresh Siddha wrote:
> +int stop_machine_from_offline_cpu(int (*fn)(void *), void *data,
> + const struct cpumask *cpus)
> +{
> + struct stop_machine_data smdata = { .fn = fn, .data = data,
> + .active_cpus = cpus };
> + struct cpu_stop_done done;
> + int ret;
> +
> + /* Local CPU must be offline and CPU hotplug in progress. */
> + BUG_ON(cpu_online(raw_smp_processor_id()));
> + smdata.num_threads = num_online_cpus() + 1; /* +1 for local */
> +
> + /* No proper task established and can't sleep - busy wait for lock. */
> + while (!mutex_trylock(&stop_cpus_mutex))
> + cpu_relax();
> +
> + /* Schedule work on other CPUs and execute directly for local CPU */
> + set_state(&smdata, STOPMACHINE_PREPARE);
> + cpu_stop_init_done(&done, num_online_cpus());
> + queue_stop_cpus_work(cpu_online_mask, stop_machine_cpu_stop, &smdata,
> + &done);
> + ret = stop_machine_cpu_stop(&smdata);
> +
> + /* Busy wait for completion. */
> + while (!completion_done(&done.completion))
> + cpu_relax();
> +
> + mutex_unlock(&stop_cpus_mutex);
> + return ret ?: done.ret;
> +}

Damn thats ugly, I sure hope you're going to make those hardware folks
pay for this :-)

In commit d0af9eed5aa91b6b7b5049cae69e5ea956fd85c3 you mention that its
specific to HT, wouldn't it make sense to limit the stop-machine use in
the next patch to the sibling mask instead of the whole machine?

2011-06-23 09:28:51

by Tejun Heo

[permalink] [raw]

Subject: Re: [patch 3/4] stop_machine: implement stop_machine_from_offline_cpu()

Hello, Peter.

On Thu, Jun 23, 2011 at 11:25:19AM +0200, Peter Zijlstra wrote:
> On Wed, 2011-06-22 at 15:20 -0700, Suresh Siddha wrote:
> > +int stop_machine_from_offline_cpu(int (*fn)(void *), void *data,
> > + const struct cpumask *cpus)
> > +{
> > + struct stop_machine_data smdata = { .fn = fn, .data = data,
> > + .active_cpus = cpus };
> > + struct cpu_stop_done done;
> > + int ret;
> > +
> > + /* Local CPU must be offline and CPU hotplug in progress. */
> > + BUG_ON(cpu_online(raw_smp_processor_id()));
> > + smdata.num_threads = num_online_cpus() + 1; /* +1 for local */
> > +
> > + /* No proper task established and can't sleep - busy wait for lock. */
> > + while (!mutex_trylock(&stop_cpus_mutex))
> > + cpu_relax();
> > +
> > + /* Schedule work on other CPUs and execute directly for local CPU */
> > + set_state(&smdata, STOPMACHINE_PREPARE);
> > + cpu_stop_init_done(&done, num_online_cpus());
> > + queue_stop_cpus_work(cpu_online_mask, stop_machine_cpu_stop, &smdata,
> > + &done);
> > + ret = stop_machine_cpu_stop(&smdata);
> > +
> > + /* Busy wait for completion. */
> > + while (!completion_done(&done.completion))
> > + cpu_relax();
> > +
> > + mutex_unlock(&stop_cpus_mutex);
> > + return ret ?: done.ret;
> > +}
>
> Damn thats ugly, I sure hope you're going to make those hardware folks
> pay for this :-)

Oh, I agree it's fugly. It's trying to orchestrate stop_machine from
a CPU which doesn't have proper scheduler/task set up. At least it's
contained in a single relatively short function.

Thanks.

--
tejun

2011-06-23 09:32:36

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [patch 3/4] stop_machine: implement stop_machine_from_offline_cpu()

On Thu, 2011-06-23 at 11:28 +0200, Tejun Heo wrote:
> Hello, Peter.
>
> On Thu, Jun 23, 2011 at 11:25:19AM +0200, Peter Zijlstra wrote:
> > On Wed, 2011-06-22 at 15:20 -0700, Suresh Siddha wrote:
> > > +int stop_machine_from_offline_cpu(int (*fn)(void *), void *data,
> > > + const struct cpumask *cpus)
> > > +{
> > > + struct stop_machine_data smdata = { .fn = fn, .data = data,
> > > + .active_cpus = cpus };
> > > + struct cpu_stop_done done;
> > > + int ret;
> > > +
> > > + /* Local CPU must be offline and CPU hotplug in progress. */
> > > + BUG_ON(cpu_online(raw_smp_processor_id()));
> > > + smdata.num_threads = num_online_cpus() + 1; /* +1 for local */
> > > +
> > > + /* No proper task established and can't sleep - busy wait for lock. */
> > > + while (!mutex_trylock(&stop_cpus_mutex))
> > > + cpu_relax();
> > > +
> > > + /* Schedule work on other CPUs and execute directly for local CPU */
> > > + set_state(&smdata, STOPMACHINE_PREPARE);
> > > + cpu_stop_init_done(&done, num_online_cpus());
> > > + queue_stop_cpus_work(cpu_online_mask, stop_machine_cpu_stop, &smdata,
> > > + &done);
> > > + ret = stop_machine_cpu_stop(&smdata);
> > > +
> > > + /* Busy wait for completion. */
> > > + while (!completion_done(&done.completion))
> > > + cpu_relax();
> > > +
> > > + mutex_unlock(&stop_cpus_mutex);
> > > + return ret ?: done.ret;
> > > +}
> >
> > Damn thats ugly, I sure hope you're going to make those hardware folks
> > pay for this :-)
>
> Oh, I agree it's fugly. It's trying to orchestrate stop_machine from
> a CPU which doesn't have proper scheduler/task set up. At least it's
> contained in a single relatively short function.

Yeah, no complaints on that grounds, its just sad we need to actually do
it, but given its a hardware constraint we're working with we don't have
much choice.

Just wanted to make sure Suresh (and possibly others) are going to tell
the hardware folks that such constraints are not cool.

2011-06-23 18:19:14

by Suresh Siddha

[permalink] [raw]

Subject: Re: [patch 3/4] stop_machine: implement stop_machine_from_offline_cpu()

On Thu, 2011-06-23 at 02:25 -0700, Peter Zijlstra wrote:
> On Wed, 2011-06-22 at 15:20 -0700, Suresh Siddha wrote:
> > +int stop_machine_from_offline_cpu(int (*fn)(void *), void *data,
> > + const struct cpumask *cpus)
> > +{
> > + struct stop_machine_data smdata = { .fn = fn, .data = data,
> > + .active_cpus = cpus };
> > + struct cpu_stop_done done;
> > + int ret;
> > +
> > + /* Local CPU must be offline and CPU hotplug in progress. */
> > + BUG_ON(cpu_online(raw_smp_processor_id()));
> > + smdata.num_threads = num_online_cpus() + 1; /* +1 for local */
> > +
> > + /* No proper task established and can't sleep - busy wait for lock. */
> > + while (!mutex_trylock(&stop_cpus_mutex))
> > + cpu_relax();
> > +
> > + /* Schedule work on other CPUs and execute directly for local CPU */
> > + set_state(&smdata, STOPMACHINE_PREPARE);
> > + cpu_stop_init_done(&done, num_online_cpus());
> > + queue_stop_cpus_work(cpu_online_mask, stop_machine_cpu_stop, &smdata,
> > + &done);
> > + ret = stop_machine_cpu_stop(&smdata);
> > +
> > + /* Busy wait for completion. */
> > + while (!completion_done(&done.completion))
> > + cpu_relax();
> > +
> > + mutex_unlock(&stop_cpus_mutex);
> > + return ret ?: done.ret;
> > +}
>
> Damn thats ugly, I sure hope you're going to make those hardware folks
> pay for this :-)

yeah, with more cores and sockets, this is becoming a pain. We will
bring this up with HW folks.

>
> In commit d0af9eed5aa91b6b7b5049cae69e5ea956fd85c3 you mention that its
> specific to HT, wouldn't it make sense to limit the stop-machine use in
> the next patch to the sibling mask instead of the whole machine?

That specific issue was seen in the context of HT. But the SDM
guidelines (pre date HT and) are applicable for SMP too.

thanks,
suresh

2011-06-24 07:46:33

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [patch 3/4] stop_machine: implement stop_machine_from_offline_cpu()

On Thu, 2011-06-23 at 11:19 -0700, Suresh Siddha wrote:
> > In commit d0af9eed5aa91b6b7b5049cae69e5ea956fd85c3 you mention that its
> > specific to HT, wouldn't it make sense to limit the stop-machine use in
> > the next patch to the sibling mask instead of the whole machine?
>
> That specific issue was seen in the context of HT. But the SDM
> guidelines (pre date HT and) are applicable for SMP too.

Sure, but we managed to ignore those long enough, could we not continue
to violate them and keep to the minimum that is working in practice?

>From what I understand the explosion is WSM+/VMX on HT only because the
siblings share state, or do we have proof that it yields problems
between cores as well?

2011-06-24 17:56:01

by Suresh Siddha

[permalink] [raw]

Subject: Re: [patch 3/4] stop_machine: implement stop_machine_from_offline_cpu()

On Fri, 2011-06-24 at 00:45 -0700, Peter Zijlstra wrote:
> On Thu, 2011-06-23 at 11:19 -0700, Suresh Siddha wrote:
> > > In commit d0af9eed5aa91b6b7b5049cae69e5ea956fd85c3 you mention that its
> > > specific to HT, wouldn't it make sense to limit the stop-machine use in
> > > the next patch to the sibling mask instead of the whole machine?
> >
> > That specific issue was seen in the context of HT. But the SDM
> > guidelines (pre date HT and) are applicable for SMP too.
>
> Sure, but we managed to ignore those long enough, could we not continue
> to violate them and keep to the minimum that is working in practice?

No we didn't violate all of them.

There are two paths where we do the rendezvous. One is during cpu online
(either boot, hotplug, suspend/resume etc) where we init the MTRR
registers and another is when MTRR's change (through /proc/mtrr
interface) during runtime.

Before the commit 'd0af9eed5aa91b6b7b5049cae69e5ea956fd85c3' we were
indeed doing rendezvous of all the cpu's when we were dynamically
changing MTRR's.

And even in older kernels prior to 2.6.11 or so, we were doing
rendezvous on all occasions. And during cpu hotplug code revamp, we
broke the MTRR rendezvous for online paths which led to the WSM issue we
fixed in the commit 'd0af9eed5aa91b6b7b5049cae69e5ea956fd85c3'.

> From what I understand the explosion is WSM+/VMX on HT only because the
> siblings share state, or do we have proof that it yields problems
> between cores as well?

During dynamic MTRR register changes, I believe rendezvous of all the
cpu's is more critical, otherwise, there can be multiple cpu's accessing
the same memory location with different memory attributes and that can
potentially lead to memory corruption, machine-check etc.

During cpu online, I think we can be less aggressive (like perhaps
rendezvous only some of the HT/core siblings etc).

This is what we are planning to bring up with the cpu folks and see what
we can cut down and what is essential.

thanks,
suresh