LinuxLists.cc - [PATCH] stopmachine: add stopmachine

[permalink] [raw]

Subject: Re: [PATCH] stopmachine: add stopmachine_timeout

p.s.
Before applying this patch, you need to apply Rusty's patches:

> From: Rusty Russell <[email protected]>
> To: [email protected]
> Subject: [PATCH 0/3] stop_machine enhancements and simplifications
> Date: Tue, 8 Jul 2008 17:50:55 +1000

Thanks,
H.Seto

2008-07-14 10:45:35

[permalink] [raw]

Subject: Re: [PATCH] stopmachine: add stopmachine_timeout

On Monday 14 July 2008 17:52:46 Hidetoshi Seto wrote:
> If stop_machine() invoked while one of onlined cpu is locked up
> by some reason, stop_machine cannot finish its work because the
> locked cpu cannot stop. This means all other healthy cpus
> will be blocked infinitely by one dead cpu.

Hello Hidetoshi!

I took your version and modified it slightly. What do you think?

(BTW, the "warning: passing argument 1 of 'kthread_create' from incompatible
pointer type" is caused by a kernel without the typesafe patches: this will
be fixed soon, do I've removed your fix for that).

Thanks,
Rusty.

===
stopmachine: add stopmachine_timeout

This patch is based on Hidetoshi Seto's patch to make stop_machine()
more robust in the face of wedged CPUs.

This version does not fail unless the stuck CPU is one we wanted to do
something on, and does not fail all future stop_machines.

Signed-off-by: Rusty Russell <[email protected]>
---
kernel/stop_machine.c | 52 +++++++++++++++++++++++++++++++++++++++++++++---
kernel/sysctl.c | 15 ++++++++++++++
2 files changed, 63 insertions(+), 4 deletions(-)

diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -35,10 +35,14 @@ struct stop_machine_data {
};

/* Like num_online_cpus(), but hotplug cpu uses us, so we need this. */
-static unsigned int num_threads;
+static unsigned num_threads;
static atomic_t thread_ack;
+static cpumask_t prepared_cpus;
static struct completion finished;
static DEFINE_MUTEX(lock);
+
+static unsigned long limit;
+unsigned long stopmachine_timeout = 5; /* secs, arbitrary */

static void set_state(enum stopmachine_state newstate)
{
@@ -66,6 +70,10 @@ static int stop_cpu(struct stop_machine_
{
enum stopmachine_state curstate = STOPMACHINE_NONE;
int uninitialized_var(ret);
+
+ /* If we've been shoved off the normal CPU, abort. */
+ if (cpu_test_and_set(smp_processor_id(), prepared_cpus))
+ do_exit(0);

/* Simple state machine */
do {
@@ -100,6 +108,44 @@ static int chill(void *unused)
return 0;
}

+static bool fixup_timeout(struct task_struct **threads, const cpumask_t *cpus)
+{
+ unsigned int i;
+ bool stagger_onwards = true;
+
+ printk(KERN_CRIT "stopmachine: Failed to stop machine in time(%lds).\n",
+ stopmachine_timeout);
+
+ for_each_online_cpu(i) {
+ if (!cpu_isset(i, prepared_cpus) && i != smp_processor_id()) {
+ bool ignore;
+
+ /* If we wanted to run on a particular CPU, and that's
+ * the one which is stuck, it's a real failure. */
+ ignore = !cpus || !cpu_isset(i, *cpus);
+ printk(KERN_CRIT "stopmachine: cpu#%d seems to be "
+ "stuck, %s.\n",
+ i, ignore ? "ignoring" : "FAILING");
+ /* Unbind thread: it will exit when it sees
+ * that prepared_cpus bit set. */
+ set_cpus_allowed(threads[i], cpu_online_map);
+
+ if (!ignore)
+ stagger_onwards = false;
+
+ /* Pretend this one doesn't exist. */
+ num_threads--;
+ }
+ }
+
+ if (stagger_onwards) {
+ /* Force progress. */
+ set_state(state + 1);
+ }
+
+ return stagger_onwards;
+}
+
int __stop_machine(int (*fn)(void *), void *data, const cpumask_t *cpus)
{
int i, err;
@@ -121,6 +167,7 @@ int __stop_machine(int (*fn)(void *), vo
mutex_lock(&lock);
init_completion(&finished);
num_threads = num_online_cpus();
+ limit = jiffies + msecs_to_jiffies(stopmachine_timeout * MSEC_PER_SEC);
set_state(STOPMACHINE_PREPARE);

for_each_online_cpu(i) {
@@ -152,9 +199,23 @@ int __stop_machine(int (*fn)(void *), vo

/* We've created all the threads. Wake them all: hold this CPU so one
* doesn't hit this CPU until we're ready. */
+ cpus_clear(prepared_cpus);
get_cpu();
for_each_online_cpu(i)
wake_up_process(threads[i]);
+
+ /* Wait all others come to life */
+ while (cpus_weight(prepared_cpus) != num_online_cpus() - 1) {
+ if (time_is_before_jiffies(limit)) {
+ if (!fixup_timeout(threads, cpus)) {
+ /* Tell them all to exit. */
+ set_state(STOPMACHINE_EXIT);
+ active.fnret = -EIO;
+ }
+ break;
+ }
+ cpu_relax();
+ }

/* This will release the thread on our CPU. */
put_cpu();
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -144,6 +144,10 @@ extern int no_unaligned_warning;

#ifdef CONFIG_RT_MUTEXES
extern int max_lock_depth;
+#endif
+
+#ifdef CONFIG_STOP_MACHINE
+extern unsigned long stopmachine_timeout;
#endif

#ifdef CONFIG_PROC_SYSCTL
@@ -811,6 +815,17 @@ static struct ctl_table kern_table[] = {
.procname = "keys",
.mode = 0555,
.child = key_sysctls,
+ },
+#endif
+#ifdef CONFIG_STOP_MACHINE
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "stopmachine_timeout",
+ .data = &stopmachine_timeout,
+ .maxlen = sizeof(unsigned long),
+ .mode = 0644,
+ .proc_handler = &proc_doulongvec_minmax,
+ .strategy = &sysctl_intvec,
},
#endif
/*

2008-07-14 11:51:37

by Christian Borntraeger

[permalink] [raw]

Subject: Re: [PATCH] stopmachine: add stopmachine_timeout

Am Montag, 14. Juli 2008 schrieb Hidetoshi Seto:
> + /* Wait all others come to life */
> + while (cpus_weight(prepared_cpus) != num_online_cpus() - 1) {
> + if (time_is_before_jiffies(limit))
> + goto timeout;
> + cpu_relax();
> + }
> +

Hmm. I think this could become interesting on virtual machines. The hypervisor
might be to busy to schedule a specific cpu at certain load scenarios. This
would cause a failure even if the cpu is not really locked up. We had similar
problems with the soft lockup daemon on s390.

It would be good to not-use wall-clock time, but really used cpu time instead.
Unfortunately I have no idea, if that is possible in a generic way.
Heiko, any ideas?

2008-07-14 12:34:56

[permalink] [raw]

Subject: Re: [PATCH] stopmachine: add stopmachine_timeout

On Monday 14 July 2008 21:51:25 Christian Borntraeger wrote:
> Am Montag, 14. Juli 2008 schrieb Hidetoshi Seto:
> > + /* Wait all others come to life */
> > + while (cpus_weight(prepared_cpus) != num_online_cpus() - 1) {
> > + if (time_is_before_jiffies(limit))
> > + goto timeout;
> > + cpu_relax();
> > + }
> > +
>
> Hmm. I think this could become interesting on virtual machines. The
> hypervisor might be to busy to schedule a specific cpu at certain load
> scenarios. This would cause a failure even if the cpu is not really locked
> up. We had similar problems with the soft lockup daemon on s390.

5 seconds is a fairly long time. If all else fails we could have a config
option to simply disable this code.

> It would be good to not-use wall-clock time, but really used cpu time
> instead. Unfortunately I have no idea, if that is possible in a generic
> way. Heiko, any ideas?

Ah, cpu time comes up again. Perhaps we should actually dig that up again;
Zach and Jeremy CC'd.

Thanks,
Rusty.

2008-07-14 18:56:39

by Jeremy Fitzhardinge

[permalink] [raw]

Subject: Re: [PATCH] stopmachine: add stopmachine_timeout

Rusty Russell wrote:
> On Monday 14 July 2008 21:51:25 Christian Borntraeger wrote:
>
>> Am Montag, 14. Juli 2008 schrieb Hidetoshi Seto:
>>
>>> + /* Wait all others come to life */
>>> + while (cpus_weight(prepared_cpus) != num_online_cpus() - 1) {
>>> + if (time_is_before_jiffies(limit))
>>> + goto timeout;
>>> + cpu_relax();
>>> + }
>>> +
>>>
>> Hmm. I think this could become interesting on virtual machines. The
>> hypervisor might be to busy to schedule a specific cpu at certain load
>> scenarios. This would cause a failure even if the cpu is not really locked
>> up. We had similar problems with the soft lockup daemon on s390.
>>
>
> 5 seconds is a fairly long time. If all else fails we could have a config
> option to simply disable this code.
>
>
>> It would be good to not-use wall-clock time, but really used cpu time
>> instead. Unfortunately I have no idea, if that is possible in a generic
>> way. Heiko, any ideas?
>>
>
> Ah, cpu time comes up again. Perhaps we should actually dig that up again;
> Zach and Jeremy CC'd.

Hm, yeah. But in this case, it's tricky. CPU time is an inherently
per-cpu quantity. If cpu A is waiting for cpu B, and wants to do the
timeout in cpu-seconds, then it has to be in *B*s cpu-seconds (and if A
is waiting on B,C,D,E,F... it needs to measure separate timeouts with
separate timebases for each other CPU). It also means that if B is
unresponsive but also not consuming any time (blocked in IO,
administratively paused, etc), then the timeout will never trigger.

So I think monotonic wallclock time actually makes the most sense here.

The other issue is whether cpu_relax() is the right thing to put in the
busywait. We don't hook it in pvops, so it's just an x86 "pause"
instruction, so from the hypervisor's perspective it just looks like a
spinning CPU. We could either hook cpu_relax() into a hypervisor yield,
or come up with a heavier-weight cpu_snooze() (cpu_relax() is often used
in loops which are expected to have a short duration, where doing a
hypercall+yield would be overkill).

J

2008-07-14 21:21:54

by Heiko Carstens

[permalink] [raw]

Subject: Re: [PATCH] stopmachine: add stopmachine_timeout

On Mon, Jul 14, 2008 at 11:56:18AM -0700, Jeremy Fitzhardinge wrote:
> Rusty Russell wrote:
> > On Monday 14 July 2008 21:51:25 Christian Borntraeger wrote:
> >> Am Montag, 14. Juli 2008 schrieb Hidetoshi Seto:
> >>
> >>> + /* Wait all others come to life */
> >>> + while (cpus_weight(prepared_cpus) != num_online_cpus() - 1) {
> >>> + if (time_is_before_jiffies(limit))
> >>> + goto timeout;
> >>> + cpu_relax();
> >>> + }
> >>> +
> >>>
> >> Hmm. I think this could become interesting on virtual machines. The
> >> hypervisor might be to busy to schedule a specific cpu at certain load
> >> scenarios. This would cause a failure even if the cpu is not really locked
> >> up. We had similar problems with the soft lockup daemon on s390.
> > 5 seconds is a fairly long time. If all else fails we could have a config
> > option to simply disable this code.

Hmm.. probably a stupid question: but what could happen that a real cpu
(not virtual) becomes unresponsive so that it won't schedule a MAX_RT_PRIO-1
prioritized task for 5 seconds?

> >> It would be good to not-use wall-clock time, but really used cpu time
> >> instead. Unfortunately I have no idea, if that is possible in a generic
> >> way. Heiko, any ideas?
> >
> > Ah, cpu time comes up again. Perhaps we should actually dig that up again;
> > Zach and Jeremy CC'd.
>
> Hm, yeah. But in this case, it's tricky. CPU time is an inherently
> per-cpu quantity. If cpu A is waiting for cpu B, and wants to do the
> timeout in cpu-seconds, then it has to be in *B*s cpu-seconds (and if A
> is waiting on B,C,D,E,F... it needs to measure separate timeouts with
> separate timebases for each other CPU). It also means that if B is
> unresponsive but also not consuming any time (blocked in IO,
> administratively paused, etc), then the timeout will never trigger.
>
> So I think monotonic wallclock time actually makes the most sense here.

This is asking for trouble... a config option to disable this would be
nice. But as I don't know which problem this patch originally addresses
it might be that this is needed anyway. So lets see why we need it first.

> The other issue is whether cpu_relax() is the right thing to put in the
> busywait. We don't hook it in pvops, so it's just an x86 "pause"
> instruction, so from the hypervisor's perspective it just looks like a
> spinning CPU. We could either hook cpu_relax() into a hypervisor yield,
> or come up with a heavier-weight cpu_snooze() (cpu_relax() is often used
> in loops which are expected to have a short duration, where doing a
> hypercall+yield would be overkill).

cpu_relax() translates to a hypervisor yield on s390. Probably makes sense
if other architectures would do the same.

2008-07-15 01:12:22

[permalink] [raw]

Subject: Re: [PATCH] stopmachine: add stopmachine_timeout

Hi Rusty,

Rusty Russell wrote:
> I took your version and modified it slightly. What do you think?

I have some comments, inlined.

> (BTW, the "warning: passing argument 1 of 'kthread_create' from incompatible
> pointer type" is caused by a kernel without the typesafe patches: this will
> be fixed soon, do I've removed your fix for that).
>
> Thanks,
> Rusty.
>
> ===
> stopmachine: add stopmachine_timeout
>
> This patch is based on Hidetoshi Seto's patch to make stop_machine()
> more robust in the face of wedged CPUs.
>
> This version does not fail unless the stuck CPU is one we wanted to do
> something on, and does not fail all future stop_machines.

We must protect next stop_machine from threads spawned by previous one.
I like your idea that if we did not want to do something on the stuck CPU
then treat the CPU as stopped. However we need to be careful that the
stuck CPU can restart unexpectedly.

> Signed-off-by: Rusty Russell <[email protected]>
> ---
> kernel/stop_machine.c | 52 +++++++++++++++++++++++++++++++++++++++++++++---
> kernel/sysctl.c | 15 ++++++++++++++
> 2 files changed, 63 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
> --- a/kernel/stop_machine.c
> +++ b/kernel/stop_machine.c
> @@ -35,10 +35,14 @@ struct stop_machine_data {
> };
>
> /* Like num_online_cpus(), but hotplug cpu uses us, so we need this. */
> -static unsigned int num_threads;
> +static unsigned num_threads;
> static atomic_t thread_ack;
> +static cpumask_t prepared_cpus;
> static struct completion finished;
> static DEFINE_MUTEX(lock);
> +
> +static unsigned long limit;
> +unsigned long stopmachine_timeout = 5; /* secs, arbitrary */
>
> static void set_state(enum stopmachine_state newstate)
> {
> @@ -66,6 +70,10 @@ static int stop_cpu(struct stop_machine_
> {
> enum stopmachine_state curstate = STOPMACHINE_NONE;
> int uninitialized_var(ret);
> +
> + /* If we've been shoved off the normal CPU, abort. */
> + if (cpu_test_and_set(smp_processor_id(), prepared_cpus))
> + do_exit(0);
>
> /* Simple state machine */
> do {
> @@ -100,6 +108,44 @@ static int chill(void *unused)
> return 0;
> }
>
> +static bool fixup_timeout(struct task_struct **threads, const cpumask_t *cpus)
> +{
> + unsigned int i;
> + bool stagger_onwards = true;
> +
> + printk(KERN_CRIT "stopmachine: Failed to stop machine in time(%lds).\n",
> + stopmachine_timeout);
> +
> + for_each_online_cpu(i) {
> + if (!cpu_isset(i, prepared_cpus) && i != smp_processor_id()) {
> + bool ignore;
> +

Note that here is a window that a not-prepared frozen cpu can be thawed and
become be prepared.

> + /* If we wanted to run on a particular CPU, and that's
> + * the one which is stuck, it's a real failure. */
> + ignore = !cpus || !cpu_isset(i, *cpus);
> + printk(KERN_CRIT "stopmachine: cpu#%d seems to be "
> + "stuck, %s.\n",
> + i, ignore ? "ignoring" : "FAILING");
> + /* Unbind thread: it will exit when it sees
> + * that prepared_cpus bit set. */
> + set_cpus_allowed(threads[i], cpu_online_map);

Unbinded threads still can wake up on a cpu where they originally targeted.

> + if (!ignore)
> + stagger_onwards = false;
> +
> + /* Pretend this one doesn't exist. */
> + num_threads--;
> + }
> + }
> +
> + if (stagger_onwards) {
> + /* Force progress. */
> + set_state(state + 1);

So num_threads have decremented, but it can happen that there are more threads.

> + }
> +
> + return stagger_onwards;
> +}
> +
> int __stop_machine(int (*fn)(void *), void *data, const cpumask_t *cpus)
> {
> int i, err;
> @@ -121,6 +167,7 @@ int __stop_machine(int (*fn)(void *), vo
> mutex_lock(&lock);
> init_completion(&finished);
> num_threads = num_online_cpus();
> + limit = jiffies + msecs_to_jiffies(stopmachine_timeout * MSEC_PER_SEC);
> set_state(STOPMACHINE_PREPARE);
>
> for_each_online_cpu(i) {
> @@ -152,9 +199,23 @@ int __stop_machine(int (*fn)(void *), vo
>
> /* We've created all the threads. Wake them all: hold this CPU so one
> * doesn't hit this CPU until we're ready. */
> + cpus_clear(prepared_cpus);
> get_cpu();
> for_each_online_cpu(i)
> wake_up_process(threads[i]);
> +
> + /* Wait all others come to life */
> + while (cpus_weight(prepared_cpus) != num_online_cpus() - 1) {
> + if (time_is_before_jiffies(limit)) {
> + if (!fixup_timeout(threads, cpus)) {
> + /* Tell them all to exit. */
> + set_state(STOPMACHINE_EXIT);
> + active.fnret = -EIO;
> + }
> + break;
> + }
> + cpu_relax();
> + }
>
> /* This will release the thread on our CPU. */
> put_cpu();
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -144,6 +144,10 @@ extern int no_unaligned_warning;
>
> #ifdef CONFIG_RT_MUTEXES
> extern int max_lock_depth;
> +#endif
> +
> +#ifdef CONFIG_STOP_MACHINE
> +extern unsigned long stopmachine_timeout;
> #endif
>
> #ifdef CONFIG_PROC_SYSCTL
> @@ -811,6 +815,17 @@ static struct ctl_table kern_table[] = {
> .procname = "keys",
> .mode = 0555,
> .child = key_sysctls,
> + },
> +#endif
> +#ifdef CONFIG_STOP_MACHINE
> + {
> + .ctl_name = CTL_UNNUMBERED,
> + .procname = "stopmachine_timeout",
> + .data = &stopmachine_timeout,
> + .maxlen = sizeof(unsigned long),
> + .mode = 0644,
> + .proc_handler = &proc_doulongvec_minmax,
> + .strategy = &sysctl_intvec,
> },
> #endif
> /*
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2008-07-15 02:12:58

[permalink] [raw]

Subject: Re: [PATCH] stopmachine: add stopmachine_timeout

On Tuesday 15 July 2008 07:20:26 Heiko Carstens wrote:
> On Mon, Jul 14, 2008 at 11:56:18AM -0700, Jeremy Fitzhardinge wrote:
> > Rusty Russell wrote:
> > > On Monday 14 July 2008 21:51:25 Christian Borntraeger wrote:
> > >> Am Montag, 14. Juli 2008 schrieb Hidetoshi Seto:
> > >>> + /* Wait all others come to life */
> > >>> + while (cpus_weight(prepared_cpus) != num_online_cpus() - 1) {
> > >>> + if (time_is_before_jiffies(limit))
> > >>> + goto timeout;
> > >>> + cpu_relax();
> > >>> + }
> > >>> +
> > >>
> > >> Hmm. I think this could become interesting on virtual machines. The
> > >> hypervisor might be to busy to schedule a specific cpu at certain load
> > >> scenarios. This would cause a failure even if the cpu is not really
> > >> locked up. We had similar problems with the soft lockup daemon on
> > >> s390.
> > >
> > > 5 seconds is a fairly long time. If all else fails we could have a
> > > config option to simply disable this code.
>
> Hmm.. probably a stupid question: but what could happen that a real cpu
> (not virtual) becomes unresponsive so that it won't schedule a
> MAX_RT_PRIO-1 prioritized task for 5 seconds?

Yes. That's exactly what we're trying to detect. Currently the entire
machine will wedge. With this patch we can often limp along.

Hidetoshi's original problem was a client whose machine had one CPU die, then
got wedged as the emergency backup tried to load a module.

Along these lines, I found VMWare's relaxed co-scheduling interesting, BTW:
http://communities.vmware.com/docs/DOC-4960

> cpu_relax() translates to a hypervisor yield on s390. Probably makes sense
> if other architectures would do the same.

Yes, I think so too. Actually, doing a random yield-to-other-VCPU on
cpu_relax is arguable the right semantic (in Linux it's used for spinning,
almost exclusively to wait for other cpus).

Cheers,
Rusty.

2008-07-15 02:25:24

[permalink] [raw]

Subject: Re: [PATCH] stopmachine: add stopmachine_timeout

Heiko Carstens wrote:
> On Mon, Jul 14, 2008 at 11:56:18AM -0700, Jeremy Fitzhardinge wrote:
>> Rusty Russell wrote:
>>> On Monday 14 July 2008 21:51:25 Christian Borntraeger wrote:
>>>> Am Montag, 14. Juli 2008 schrieb Hidetoshi Seto:
>>>>
>>>>> + /* Wait all others come to life */
>>>>> + while (cpus_weight(prepared_cpus) != num_online_cpus() - 1) {
>>>>> + if (time_is_before_jiffies(limit))
>>>>> + goto timeout;
>>>>> + cpu_relax();
>>>>> + }
>>>>> +
>>>>>
>>>> Hmm. I think this could become interesting on virtual machines. The
>>>> hypervisor might be to busy to schedule a specific cpu at certain load
>>>> scenarios. This would cause a failure even if the cpu is not really locked
>>>> up. We had similar problems with the soft lockup daemon on s390.
>>> 5 seconds is a fairly long time. If all else fails we could have a config
>>> option to simply disable this code.
>
> Hmm.. probably a stupid question: but what could happen that a real cpu
> (not virtual) becomes unresponsive so that it won't schedule a MAX_RT_PRIO-1
> prioritized task for 5 seconds?
I have a workload where MAX_PRIO RT thread runs and never yeilds. That's what
my cpu isolation patches/tree addresses. Stopmachine is the only (that I know
of) thing that really brakes in that case. btw In case you're wondering yes
we've discussed workqueue threads starvation and stuff in the other threads.
So yet it can happen.

>>>> It would be good to not-use wall-clock time, but really used cpu time
>>>> instead. Unfortunately I have no idea, if that is possible in a generic
>>>> way. Heiko, any ideas?
>>> Ah, cpu time comes up again. Perhaps we should actually dig that up again;
>>> Zach and Jeremy CC'd.
>> Hm, yeah. But in this case, it's tricky. CPU time is an inherently
>> per-cpu quantity. If cpu A is waiting for cpu B, and wants to do the
>> timeout in cpu-seconds, then it has to be in *B*s cpu-seconds (and if A
>> is waiting on B,C,D,E,F... it needs to measure separate timeouts with
>> separate timebases for each other CPU). It also means that if B is
>> unresponsive but also not consuming any time (blocked in IO,
>> administratively paused, etc), then the timeout will never trigger.
>>
>> So I think monotonic wallclock time actually makes the most sense here.
>
> This is asking for trouble... a config option to disable this would be
> nice. But as I don't know which problem this patch originally addresses
> it might be that this is needed anyway. So lets see why we need it first.
How about this. We'll make this a sysctl, as Rusty already did, and set the
default to 0 which means "never timeout". That way crazy people like me who
care about this scenario can enable this feature.

btw Rusty, I just had this "why didn't I think of that" moments. This is
actually another way of handling my workload. I mean it certainly does not fix
the root case of the problems and we still need other things that we talked
about (non-blocking module delete, lock-free module insertion, etc) but at
least in the mean time it avoids wedging the machines for good.
btw I'd like that timeout in milliseconds. I think 5 seconds is way tooooo
long :).

Max

2008-07-15 02:25:53

[permalink] [raw]

Subject: Re: [PATCH] stopmachine: add stopmachine_timeout

Heiko Carstens wrote:
> Hmm.. probably a stupid question: but what could happen that a real cpu
> (not virtual) becomes unresponsive so that it won't schedule a MAX_RT_PRIO-1
> prioritized task for 5 seconds?

The original problem (once I heard and easily reproduced) was there was an
another MAX_RT_PRIO-1 task and the task was spinning in itself by a bug.
(Now this would not be a problem since RLIMIT_RTTIME will work for it, but
I cannot deny that there are some situations which cannot set the limit.)

However there would be more possible problem in the world, ex. assume that
a routine work with interrupt (and also preemption) disabled have an issue
of scalability so it takes long time on huge machine then stop_machine will
stop whole system such long time. You can assume a driver's bug. Now the
stop_machine is good tool to escalate a partial problem to global suddenly.

>> So I think monotonic wallclock time actually makes the most sense here.
>
> This is asking for trouble... a config option to disable this would be
> nice. But as I don't know which problem this patch originally addresses
> it might be that this is needed anyway. So lets see why we need it first.

I'm not good at VM etc., but I think user doesn't care who holds a cpu,
whether other guest or actual buggy software or space alien or so.
The important thing here is return control to user if timeout.

Thanks,
H.Seto

2008-07-15 02:40:59

[permalink] [raw]

Subject: Re: [PATCH] stopmachine: add stopmachine_timeout

Hidetoshi Seto wrote:
> Heiko Carstens wrote:
>> Hmm.. probably a stupid question: but what could happen that a real cpu
>> (not virtual) becomes unresponsive so that it won't schedule a MAX_RT_PRIO-1
>> prioritized task for 5 seconds?
>
> The original problem (once I heard and easily reproduced) was there was an
> another MAX_RT_PRIO-1 task and the task was spinning in itself by a bug.
> (Now this would not be a problem since RLIMIT_RTTIME will work for it, but
> I cannot deny that there are some situations which cannot set the limit.)
Yep. As I described in the prev email in my case it's a legitimate thing. Some
of the CPU cores are running wireless basestation schedulers and the deadlines
are way too tight for them to sleep (it's "cpu as a dedicated engine" kind of
thing, they are properly isolated and stuff).

In this case actually RT limit is the first thing that I disable :).
I'd rather have stop_machine fail and tell the user that something is wrong.
In which case they can simply stop the basestation app that is running when
convinient. ie It give control back to the user rather than wedging the box or
killing the app.

Max

2008-07-15 06:10:32

by Heiko Carstens

[permalink] [raw]

Subject: Re: [PATCH] stopmachine: add stopmachine_timeout

> >> So I think monotonic wallclock time actually makes the most sense here.
> >
> > This is asking for trouble... a config option to disable this would be
> > nice. But as I don't know which problem this patch originally addresses
> > it might be that this is needed anyway. So lets see why we need it first.
> How about this. We'll make this a sysctl, as Rusty already did, and set the
> default to 0 which means "never timeout". That way crazy people like me who
> care about this scenario can enable this feature.

Yes, sounds like that should work for everyone and no additional config
option as well. Sounds good to me.

2008-07-15 07:50:30

[permalink] [raw]

Subject: Re: [PATCH] stopmachine: add stopmachine_timeout

On Tuesday 15 July 2008 11:11:34 Hidetoshi Seto wrote:
> Hi Rusty,

Hi Hidetoshi,

> However we need to be careful that the stuck CPU can restart unexpectedly.

OK, if you are worried about that race, I think we can still fix it...

> > + for_each_online_cpu(i) {
> > + if (!cpu_isset(i, prepared_cpus) && i != smp_processor_id()) {
> > + bool ignore;
> > +
>
> Note that here is a window that a not-prepared frozen cpu can be thawed and
> become be prepared.
>
> > + /* If we wanted to run on a particular CPU, and that's
> > + * the one which is stuck, it's a real failure. */
> > + ignore = !cpus || !cpu_isset(i, *cpus);
> > + printk(KERN_CRIT "stopmachine: cpu#%d seems to be "
> > + "stuck, %s.\n",
> > + i, ignore ? "ignoring" : "FAILING");
> > + /* Unbind thread: it will exit when it sees
> > + * that prepared_cpus bit set. */
> > + set_cpus_allowed(threads[i], cpu_online_map);
>
> Unbinded threads still can wake up on a cpu where they originally targeted.

What if we use:
if (i != smp_processor_id() && !cpu_test_and_set(i, prepared_cpus)) {

instead of cpu_isset? That means that if a CPU restarts during that window,
either the thread will exit (because we set the bit here), or this will
detect it.

Hmm, there's still the vague possibility that the thread doesn't schedule
until we start a new stop_machine (and clear prepared_cpus). We could simply
loop in the main thread if any threads are alive, before freeing them (inside
the lock). A counter and notifier is the other way, but it seems like
overkill for a very unlikely event.

Thanks for the analysis!
Rusty.

2008-07-15 08:12:06

[permalink] [raw]

Subject: Re: [PATCH] stopmachine: add stopmachine_timeout

On Tuesday 15 July 2008 12:24:54 Max Krasnyansky wrote:
> Heiko Carstens wrote:
> > On Mon, Jul 14, 2008 at 11:56:18AM -0700, Jeremy Fitzhardinge wrote:
> > This is asking for trouble... a config option to disable this would be
> > nice. But as I don't know which problem this patch originally addresses
> > it might be that this is needed anyway. So lets see why we need it first.
>
> How about this. We'll make this a sysctl, as Rusty already did, and set the
> default to 0 which means "never timeout". That way crazy people like me who
> care about this scenario can enable this feature.

Indeed, this was my thought too. s390 can initialize it to zero somewhere in
their boot code.

> btw Rusty, I just had this "why didn't I think of that" moments. This is
> actually another way of handling my workload. I mean it certainly does not
> fix the root case of the problems and we still need other things that we
> talked about (non-blocking module delete, lock-free module insertion, etc)
> but at least in the mean time it avoids wedging the machines for good.
> btw I'd like that timeout in milliseconds. I think 5 seconds is way tooooo
> long :).

We can make it ms, sure. 200ms should be plenty of time: worst I ever saw was
150ms, and that was some weird Power box doing crazy stuff. I wouldn't be
surprised if you'd never see 1ms on your hardware.

The ipi idea would handle your case a little more nicely, too, but that's
probably not going to hit this merge window.

Cheers,
Rusty.

2008-07-15 08:39:32

by Heiko Carstens

[permalink] [raw]

Subject: Re: [PATCH] stopmachine: add stopmachine_timeout

On Tue, Jul 15, 2008 at 06:09:59PM +1000, Rusty Russell wrote:
> On Tuesday 15 July 2008 12:24:54 Max Krasnyansky wrote:
> > Heiko Carstens wrote:
> > > On Mon, Jul 14, 2008 at 11:56:18AM -0700, Jeremy Fitzhardinge wrote:
> > > This is asking for trouble... a config option to disable this would be
> > > nice. But as I don't know which problem this patch originally addresses
> > > it might be that this is needed anyway. So lets see why we need it first.
> >
> > How about this. We'll make this a sysctl, as Rusty already did, and set the
> > default to 0 which means "never timeout". That way crazy people like me who
> > care about this scenario can enable this feature.
>
> Indeed, this was my thought too. s390 can initialize it to zero somewhere in
> their boot code.

Shouldn't we default to zero and let whowever wants something different
configure that via sysctl.conf?
Whatever default value > 0 you pick is likely wrong for whatever specific
usecase.

2008-07-15 08:51:45

[permalink] [raw]

Subject: Re: [PATCH] stopmachine: add stopmachine_timeout

Rusty Russell wrote:
> On Tuesday 15 July 2008 12:24:54 Max Krasnyansky wrote:
>> Heiko Carstens wrote:
>>> On Mon, Jul 14, 2008 at 11:56:18AM -0700, Jeremy Fitzhardinge wrote:
>>> This is asking for trouble... a config option to disable this would be
>>> nice. But as I don't know which problem this patch originally addresses
>>> it might be that this is needed anyway. So lets see why we need it first.
>> How about this. We'll make this a sysctl, as Rusty already did, and set the
>> default to 0 which means "never timeout". That way crazy people like me who
>> care about this scenario can enable this feature.
>
> Indeed, this was my thought too. s390 can initialize it to zero somewhere in
> their boot code.
>
>> btw Rusty, I just had this "why didn't I think of that" moments. This is
>> actually another way of handling my workload. I mean it certainly does not
>> fix the root case of the problems and we still need other things that we
>> talked about (non-blocking module delete, lock-free module insertion, etc)
>> but at least in the mean time it avoids wedging the machines for good.
>> btw I'd like that timeout in milliseconds. I think 5 seconds is way tooooo
>> long :).
>
> We can make it ms, sure. 200ms should be plenty of time: worst I ever saw was
> 150ms, and that was some weird Power box doing crazy stuff. I wouldn't be
> surprised if you'd never see 1ms on your hardware.
Sounds good.

> The ipi idea would handle your case a little more nicely, too, but that's
> probably not going to hit this merge window.
Which reminds me that I wanted to submit a bunch of kthread and workqueue
related things in this window :).

Max

2008-07-16 04:06:28

[permalink] [raw]

Subject: Re: [PATCH] stopmachine: add stopmachine_timeout

Hi Rusty,

Rusty Russell wrote:
> On Tuesday 15 July 2008 11:11:34 Hidetoshi Seto wrote:
>> However we need to be careful that the stuck CPU can restart unexpectedly.
>
> OK, if you are worried about that race, I think we can still fix it...

After having a relaxing day, once I said:
"I like your idea that if we did not want to do something on the stuck CPU
then treat the CPU as stopped."
but now I noticed that the stuck CPU can harm what we want to do if it is
not real stuck... ex. busy loop in a subsystem, and we want to touch the
core of the subsystem exclusively.
So "force progress" is not safe, on some rare case. I'd like to make this
timeout feature as a safe-net, therefore we should return error without
taking a risk even it would be small, I think.

> Hmm, there's still the vague possibility that the thread doesn't schedule
> until we start a new stop_machine (and clear prepared_cpus). We could simply
> loop in the main thread if any threads are alive, before freeing them (inside
> the lock). A counter and notifier is the other way, but it seems like
> overkill for a very unlikely event.

I suppose my current implementation, returning control to user immediately,
is better than looping in main thread. In my implementation, num_threads is
initialized to num_online_cpus() by main thread, and decremented 1 by 1
each child thread. If time out happen, main thread will return without
waiting completion but set state STOPMACHINE_EXIT. Then child threads are now
detached from usual procedure, so they exit soon without do any work.

At the beginning of new stop_machine, we can check the num_threads to know
whether there are remaining child threads. If there are, something is wrong
since the system cannot run MAX_PRIO RT thread, not binded to typical cpu now.
So we can return error in such case, assuming that the new stop_machine will
fail in same way.

Anyway, I also think we can better thing here, but we don't need to do all
at once. Making steps by incremental patches would be nice, I think.

Thanks,
H.Seto

2008-07-16 04:29:20

[permalink] [raw]

Subject: [PATCH] stopmachine: add stopmachine_timeout v2

Thank you for useful feedbacks!
Here is the updated version.
Could you put this on top of your patches, Rusty?

Thanks,
H.Seto

If stop_machine() invoked while one of onlined cpu is locked up
by some reason, stop_machine cannot finish its work because the
locked cpu cannot stop. This means all other healthy cpus
will be blocked infinitely by one dead cpu.

This patch allows stop_machine to return -EBUSY with some printk
messages if any of stop_machine's threads cannot start running on
its target cpu.

v2:
- remove fix for warning since it will be fixed upcoming typesafe
patches
- make stopmachine_timeout from secs to msecs, and set default to
200 msec (since v1's arbitrary 5 sec is too long)
- allow disabling timeout by setting the stopmachine_timeout to 0

Signed-off-by: Hidetoshi Seto <[email protected]>
---
kernel/stop_machine.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++--
kernel/sysctl.c | 15 +++++++++++++
2 files changed, 66 insertions(+), 3 deletions(-)

diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index 5b72c2b..2968b8a 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -35,15 +35,18 @@ struct stop_machine_data {
};

/* Like num_online_cpus(), but hotplug cpu uses us, so we need this. */
-static unsigned int num_threads;
+static atomic_t num_threads;
static atomic_t thread_ack;
+static cpumask_t prepared_cpus;
static struct completion finished;
static DEFINE_MUTEX(lock);

+unsigned long stopmachine_timeout = 200; /* msecs, arbitrary */
+
static void set_state(enum stopmachine_state newstate)
{
/* Reset ack counter. */
- atomic_set(&thread_ack, num_threads);
+ atomic_set(&thread_ack, atomic_read(&num_threads));
smp_wmb();
state = newstate;
}
@@ -67,6 +70,8 @@ static int stop_cpu(struct stop_machine_data *smdata)
enum stopmachine_state curstate = STOPMACHINE_NONE;
int uninitialized_var(ret);

+ cpu_set(smp_processor_id(), prepared_cpus);
+
/* Simple state machine */
do {
/* Chill out and ensure we re-read stopmachine_state. */
@@ -90,6 +95,7 @@ static int stop_cpu(struct stop_machine_data *smdata)
}
} while (curstate != STOPMACHINE_EXIT);

+ atomic_dec(&num_threads);
local_irq_enable();
do_exit(0);
}
@@ -105,6 +111,15 @@ int __stop_machine_run(int (*fn)(void *), void *data, const cpumask_t *cpus)
int i, err;
struct stop_machine_data active, idle;
struct task_struct **threads;
+ unsigned long limit;
+
+ if (atomic_read(&num_threads)) {
+ /*
+ * previous stop_machine was timeout, and still there are some
+ * unfinished thread (dangling stucked CPU?).
+ */
+ return -EBUSY;
+ }

active.fn = fn;
active.data = data;
@@ -120,7 +135,7 @@ int __stop_machine_run(int (*fn)(void *), void *data, const cpumask_t *cpus)
/* Set up initial state. */
mutex_lock(&lock);
init_completion(&finished);
- num_threads = num_online_cpus();
+ atomic_set(&num_threads, num_online_cpus());
set_state(STOPMACHINE_PREPARE);

for_each_online_cpu(i) {
@@ -152,10 +167,21 @@ int __stop_machine_run(int (*fn)(void *), void *data, const cpumask_t *cpus)

/* We've created all the threads. Wake them all: hold this CPU so one
* doesn't hit this CPU until we're ready. */
+ cpus_clear(prepared_cpus);
get_cpu();
for_each_online_cpu(i)
wake_up_process(threads[i]);

+ /* Wait all others come to life */
+ if (stopmachine_timeout) {
+ limit = jiffies + msecs_to_jiffies(stopmachine_timeout);
+ while (cpus_weight(prepared_cpus) != num_online_cpus() - 1) {
+ if (time_is_before_jiffies(limit))
+ goto timeout;
+ cpu_relax();
+ }
+ }
+
/* This will release the thread on our CPU. */
put_cpu();
wait_for_completion(&finished);
@@ -169,10 +195,32 @@ kill_threads:
for_each_online_cpu(i)
if (threads[i])
kthread_stop(threads[i]);
+ atomic_set(&num_threads, 0);
mutex_unlock(&lock);

kfree(threads);
return err;
+
+timeout:
+ printk(KERN_CRIT "stopmachine: Failed to stop machine in time(%lds).\n",
+ stopmachine_timeout);
+ for_each_online_cpu(i) {
+ if (!cpu_isset(i, prepared_cpus) && i != smp_processor_id())
+ printk(KERN_CRIT "stopmachine: cpu#%d seems to be "
+ "stuck.\n", i);
+ /* Unbind threads */
+ set_cpus_allowed(threads[i], cpu_online_map);
+ }
+
+ /* Let threads go exit */
+ set_state(STOPMACHINE_EXIT);
+
+ put_cpu();
+ /* no wait for completion */
+ mutex_unlock(&lock);
+ kfree(threads);
+
+ return -EBUSY; /* canceled */
}

int stop_machine_run(int (*fn)(void *), void *data, const cpumask_t *cpus)
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 2911665..3c7ca98 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -146,6 +146,10 @@ extern int no_unaligned_warning;
extern int max_lock_depth;
#endif

+#ifdef CONFIG_STOP_MACHINE
+extern unsigned long stopmachine_timeout;
+#endif
+
#ifdef CONFIG_PROC_SYSCTL
static int proc_do_cad_pid(struct ctl_table *table, int write, struct file *filp,
void __user *buffer, size_t *lenp, loff_t *ppos);
@@ -813,6 +817,17 @@ static struct ctl_table kern_table[] = {
.child = key_sysctls,
},
#endif
+#ifdef CONFIG_STOP_MACHINE
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "stopmachine_timeout",
+ .data = &stopmachine_timeout,
+ .maxlen = sizeof(unsigned long),
+ .mode = 0644,
+ .proc_handler = &proc_doulongvec_minmax,
+ .strategy = &sysctl_intvec,
+ },
+#endif
/*
* NOTE: do not add new entries to this table unless you have read
* Documentation/sysctl/ctl_unnumbered.txt
--

2008-07-16 06:23:39

[permalink] [raw]

Subject: Re: [PATCH] stopmachine: add stopmachine_timeout v2

Hidetoshi Seto wrote:
> Thank you for useful feedbacks!
> Here is the updated version.
> Could you put this on top of your patches, Rusty?
>
> Thanks,
> H.Seto
>
>
> If stop_machine() invoked while one of onlined cpu is locked up
> by some reason, stop_machine cannot finish its work because the
> locked cpu cannot stop. This means all other healthy cpus
> will be blocked infinitely by one dead cpu.
>
> This patch allows stop_machine to return -EBUSY with some printk
> messages if any of stop_machine's threads cannot start running on
> its target cpu.
>
> v2:
> - remove fix for warning since it will be fixed upcoming typesafe
> patches
> - make stopmachine_timeout from secs to msecs, and set default to
> 200 msec (since v1's arbitrary 5 sec is too long)
> - allow disabling timeout by setting the stopmachine_timeout to 0
>

I'd set the default to zero. I beleive that's what Heiko suggested too.

Max

2008-07-16 06:37:51

[permalink] [raw]

Subject: Re: [PATCH] stopmachine: add stopmachine_timeout v2

Max Krasnyansky wrote:
> I'd set the default to zero. I beleive that's what Heiko suggested too.

Oh, yes, you are right. I missed to catch the suggestion.
I'll post fixed version soon. Wait a minutes...

Thanks,
H.Seto

2008-07-16 06:54:24

[permalink] [raw]

Subject: [PATCH] stopmachine: add stopmachine_timeout v3

If stop_machine() invoked while one of onlined cpu is locked up
by some reason, stop_machine cannot finish its work because the
locked cpu cannot stop. This means all other healthy cpus
will be blocked infinitely by one dead cpu.

This patch allows stop_machine to return -EBUSY with some printk
messages if any of stop_machine's threads cannot start running on
its target cpu in time. You can enable this timeout via sysctl.

v3:
- set stopmachine_timeout default to 0 (= never timeout)

v2:
- remove fix for warning since it will be fixed upcoming typesafe
patches
- make stopmachine_timeout from secs to msecs
- allow disabling timeout by setting the stopmachine_timeout to 0

Signed-off-by: Hidetoshi Seto <[email protected]>
---
kernel/stop_machine.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++--
kernel/sysctl.c | 15 +++++++++++++
2 files changed, 66 insertions(+), 3 deletions(-)

diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index 5b72c2b..77b7944 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -35,15 +35,18 @@ struct stop_machine_data {
};

/* Like num_online_cpus(), but hotplug cpu uses us, so we need this. */
-static unsigned int num_threads;
+static atomic_t num_threads;
static atomic_t thread_ack;
+static cpumask_t prepared_cpus;
static struct completion finished;
static DEFINE_MUTEX(lock);

+unsigned long stopmachine_timeout = 0; /* msecs, 0 = "never timeout" */
+
static void set_state(enum stopmachine_state newstate)
{
/* Reset ack counter. */
- atomic_set(&thread_ack, num_threads);
+ atomic_set(&thread_ack, atomic_read(&num_threads));
smp_wmb();
state = newstate;
}
@@ -67,6 +70,8 @@ static int stop_cpu(struct stop_machine_data *smdata)
enum stopmachine_state curstate = STOPMACHINE_NONE;
int uninitialized_var(ret);

+ cpu_set(smp_processor_id(), prepared_cpus);
+
/* Simple state machine */
do {
/* Chill out and ensure we re-read stopmachine_state. */
@@ -90,6 +95,7 @@ static int stop_cpu(struct stop_machine_data *smdata)
}
} while (curstate != STOPMACHINE_EXIT);

+ atomic_dec(&num_threads);
local_irq_enable();
do_exit(0);
}
@@ -105,6 +111,15 @@ int __stop_machine_run(int (*fn)(void *), void *data, const cpumask_t *cpus)
int i, err;
struct stop_machine_data active, idle;
struct task_struct **threads;
+ unsigned long limit;
+
+ if (atomic_read(&num_threads)) {
+ /*
+ * previous stop_machine was timeout, and still there are some
+ * unfinished thread (dangling stucked CPU?).
+ */
+ return -EBUSY;
+ }

active.fn = fn;
active.data = data;
@@ -120,7 +135,7 @@ int __stop_machine_run(int (*fn)(void *), void *data, const cpumask_t *cpus)
/* Set up initial state. */
mutex_lock(&lock);
init_completion(&finished);
- num_threads = num_online_cpus();
+ atomic_set(&num_threads, num_online_cpus());
set_state(STOPMACHINE_PREPARE);

for_each_online_cpu(i) {
@@ -152,10 +167,21 @@ int __stop_machine_run(int (*fn)(void *), void *data, const cpumask_t *cpus)

/* We've created all the threads. Wake them all: hold this CPU so one
* doesn't hit this CPU until we're ready. */
+ cpus_clear(prepared_cpus);
get_cpu();
for_each_online_cpu(i)
wake_up_process(threads[i]);

+ /* Wait all others come to life */
+ if (stopmachine_timeout) {
+ limit = jiffies + msecs_to_jiffies(stopmachine_timeout);
+ while (cpus_weight(prepared_cpus) != num_online_cpus() - 1) {
+ if (time_is_before_jiffies(limit))
+ goto timeout;
+ cpu_relax();
+ }
+ }
+
/* This will release the thread on our CPU. */
put_cpu();
wait_for_completion(&finished);
@@ -169,10 +195,32 @@ kill_threads:
for_each_online_cpu(i)
if (threads[i])
kthread_stop(threads[i]);
+ atomic_set(&num_threads, 0);
mutex_unlock(&lock);

kfree(threads);
return err;
+
+timeout:
+ printk(KERN_CRIT "stopmachine: Failed to stop machine in time(%lds).\n",
+ stopmachine_timeout);
+ for_each_online_cpu(i) {
+ if (!cpu_isset(i, prepared_cpus) && i != smp_processor_id())
+ printk(KERN_CRIT "stopmachine: cpu#%d seems to be "
+ "stuck.\n", i);
+ /* Unbind threads */
+ set_cpus_allowed(threads[i], cpu_online_map);
+ }
+
+ /* Let threads go exit */
+ set_state(STOPMACHINE_EXIT);
+
+ put_cpu();
+ /* no wait for completion */
+ mutex_unlock(&lock);
+ kfree(threads);
+
+ return -EBUSY; /* canceled */
}

int stop_machine_run(int (*fn)(void *), void *data, const cpumask_t *cpus)
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 2911665..3c7ca98 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -146,6 +146,10 @@ extern int no_unaligned_warning;
extern int max_lock_depth;
#endif

+#ifdef CONFIG_STOP_MACHINE
+extern unsigned long stopmachine_timeout;
+#endif
+
#ifdef CONFIG_PROC_SYSCTL
static int proc_do_cad_pid(struct ctl_table *table, int write, struct file *filp,
void __user *buffer, size_t *lenp, loff_t *ppos);
@@ -813,6 +817,17 @@ static struct ctl_table kern_table[] = {
.child = key_sysctls,
},
#endif
+#ifdef CONFIG_STOP_MACHINE
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "stopmachine_timeout",
+ .data = &stopmachine_timeout,
+ .maxlen = sizeof(unsigned long),
+ .mode = 0644,
+ .proc_handler = &proc_doulongvec_minmax,
+ .strategy = &sysctl_intvec,
+ },
+#endif
/*
* NOTE: do not add new entries to this table unless you have read
* Documentation/sysctl/ctl_unnumbered.txt
--

2008-07-16 07:33:23

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH] stopmachine: add stopmachine_timeout v3

On Wed, 2008-07-16 at 15:51 +0900, Hidetoshi Seto wrote:
> If stop_machine() invoked while one of onlined cpu is locked up
> by some reason, stop_machine cannot finish its work because the
> locked cpu cannot stop. This means all other healthy cpus
> will be blocked infinitely by one dead cpu.
>
> This patch allows stop_machine to return -EBUSY with some printk
> messages if any of stop_machine's threads cannot start running on
> its target cpu in time. You can enable this timeout via sysctl.
>
> v3:
> - set stopmachine_timeout default to 0 (= never timeout)
>
> v2:
> - remove fix for warning since it will be fixed upcoming typesafe
> patches
> - make stopmachine_timeout from secs to msecs
> - allow disabling timeout by setting the stopmachine_timeout to 0
>
> Signed-off-by: Hidetoshi Seto <[email protected]>

I really don't like this, it means the system is really screwed up and
doesn't deserve to continue.

> ---
> kernel/stop_machine.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++--
> kernel/sysctl.c | 15 +++++++++++++
> 2 files changed, 66 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
> index 5b72c2b..77b7944 100644
> --- a/kernel/stop_machine.c
> +++ b/kernel/stop_machine.c
> @@ -35,15 +35,18 @@ struct stop_machine_data {
> };
>
> /* Like num_online_cpus(), but hotplug cpu uses us, so we need this. */
> -static unsigned int num_threads;
> +static atomic_t num_threads;
> static atomic_t thread_ack;
> +static cpumask_t prepared_cpus;
> static struct completion finished;
> static DEFINE_MUTEX(lock);
>
> +unsigned long stopmachine_timeout = 0; /* msecs, 0 = "never timeout" */
> +
> static void set_state(enum stopmachine_state newstate)
> {
> /* Reset ack counter. */
> - atomic_set(&thread_ack, num_threads);
> + atomic_set(&thread_ack, atomic_read(&num_threads));
> smp_wmb();
> state = newstate;
> }
> @@ -67,6 +70,8 @@ static int stop_cpu(struct stop_machine_data *smdata)
> enum stopmachine_state curstate = STOPMACHINE_NONE;
> int uninitialized_var(ret);
>
> + cpu_set(smp_processor_id(), prepared_cpus);
> +
> /* Simple state machine */
> do {
> /* Chill out and ensure we re-read stopmachine_state. */
> @@ -90,6 +95,7 @@ static int stop_cpu(struct stop_machine_data *smdata)
> }
> } while (curstate != STOPMACHINE_EXIT);
>
> + atomic_dec(&num_threads);
> local_irq_enable();
> do_exit(0);
> }
> @@ -105,6 +111,15 @@ int __stop_machine_run(int (*fn)(void *), void *data, const cpumask_t *cpus)
> int i, err;
> struct stop_machine_data active, idle;
> struct task_struct **threads;
> + unsigned long limit;
> +
> + if (atomic_read(&num_threads)) {
> + /*
> + * previous stop_machine was timeout, and still there are some
> + * unfinished thread (dangling stucked CPU?).
> + */
> + return -EBUSY;
> + }
>
> active.fn = fn;
> active.data = data;
> @@ -120,7 +135,7 @@ int __stop_machine_run(int (*fn)(void *), void *data, const cpumask_t *cpus)
> /* Set up initial state. */
> mutex_lock(&lock);
> init_completion(&finished);
> - num_threads = num_online_cpus();
> + atomic_set(&num_threads, num_online_cpus());
> set_state(STOPMACHINE_PREPARE);
>
> for_each_online_cpu(i) {
> @@ -152,10 +167,21 @@ int __stop_machine_run(int (*fn)(void *), void *data, const cpumask_t *cpus)
>
> /* We've created all the threads. Wake them all: hold this CPU so one
> * doesn't hit this CPU until we're ready. */
> + cpus_clear(prepared_cpus);
> get_cpu();
> for_each_online_cpu(i)
> wake_up_process(threads[i]);
>
> + /* Wait all others come to life */
> + if (stopmachine_timeout) {
> + limit = jiffies + msecs_to_jiffies(stopmachine_timeout);
> + while (cpus_weight(prepared_cpus) != num_online_cpus() - 1) {
> + if (time_is_before_jiffies(limit))
> + goto timeout;
> + cpu_relax();
> + }
> + }
> +
> /* This will release the thread on our CPU. */
> put_cpu();
> wait_for_completion(&finished);
> @@ -169,10 +195,32 @@ kill_threads:
> for_each_online_cpu(i)
> if (threads[i])
> kthread_stop(threads[i]);
> + atomic_set(&num_threads, 0);
> mutex_unlock(&lock);
>
> kfree(threads);
> return err;
> +
> +timeout:
> + printk(KERN_CRIT "stopmachine: Failed to stop machine in time(%lds).\n",
> + stopmachine_timeout);
> + for_each_online_cpu(i) {
> + if (!cpu_isset(i, prepared_cpus) && i != smp_processor_id())
> + printk(KERN_CRIT "stopmachine: cpu#%d seems to be "
> + "stuck.\n", i);
> + /* Unbind threads */
> + set_cpus_allowed(threads[i], cpu_online_map);
> + }
> +
> + /* Let threads go exit */
> + set_state(STOPMACHINE_EXIT);
> +
> + put_cpu();
> + /* no wait for completion */
> + mutex_unlock(&lock);
> + kfree(threads);
> +
> + return -EBUSY; /* canceled */
> }
>
> int stop_machine_run(int (*fn)(void *), void *data, const cpumask_t *cpus)
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 2911665..3c7ca98 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -146,6 +146,10 @@ extern int no_unaligned_warning;
> extern int max_lock_depth;
> #endif
>
> +#ifdef CONFIG_STOP_MACHINE
> +extern unsigned long stopmachine_timeout;
> +#endif
> +
> #ifdef CONFIG_PROC_SYSCTL
> static int proc_do_cad_pid(struct ctl_table *table, int write, struct file *filp,
> void __user *buffer, size_t *lenp, loff_t *ppos);
> @@ -813,6 +817,17 @@ static struct ctl_table kern_table[] = {
> .child = key_sysctls,
> },
> #endif
> +#ifdef CONFIG_STOP_MACHINE
> + {
> + .ctl_name = CTL_UNNUMBERED,
> + .procname = "stopmachine_timeout",
> + .data = &stopmachine_timeout,
> + .maxlen = sizeof(unsigned long),
> + .mode = 0644,
> + .proc_handler = &proc_doulongvec_minmax,
> + .strategy = &sysctl_intvec,
> + },
> +#endif
> /*
> * NOTE: do not add new entries to this table unless you have read
> * Documentation/sysctl/ctl_unnumbered.txt

2008-07-16 08:16:06

[permalink] [raw]

Subject: Re: [PATCH] stopmachine: add stopmachine_timeout v3

Peter Zijlstra wrote:
> I really don't like this, it means the system is really screwed up and
> doesn't deserve to continue.

It can be said that after timeout we just back to previous state, where
machine already limp(=partially screwed up), but have some degree of
performance. We might be able to do some recovery, such as killing
process, restart or reset of subsystem and so on. Even if a CPU get
stuck, it might be possible to continue its service with remaining
CPUs, ex. assume there are 1024 CPUs total.
(I wish if we were able to force-reset such unstable CPU in future...)

I agree that there are much amount of situation where this feature is
not acceptable. But there would be others.

Thanks,
H.Seto

2008-07-16 09:17:00

by Christian Borntraeger

[permalink] [raw]

Subject: Re: [PATCH] stopmachine: add stopmachine_timeout

Am Dienstag, 15. Juli 2008 schrieb Rusty Russell:
> > btw Rusty, I just had this "why didn't I think of that" moments. This is
> > actually another way of handling my workload. I mean it certainly does not
> > fix the root case of the problems and we still need other things that we
> > talked about (non-blocking module delete, lock-free module insertion, etc)
> > but at least in the mean time it avoids wedging the machines for good.
> > btw I'd like that timeout in milliseconds. I think 5 seconds is way tooooo
> > long :).
>
> We can make it ms, sure. 200ms should be plenty of time: worst I ever saw
was
> 150ms, and that was some weird Power box doing crazy stuff. I wouldn't be
> surprised if you'd never see 1ms on your hardware.

I disagree that 5 seconds is to long :-). I even think having it default to 0
is the safest option for virtualized environments. What if the host is paging
like hell and the vcpu cannot run due to a missing page? In that case 200ms
can be an incredible short amount of time. If the timeout triggers,
stop_machine_run fails, but everything would work fine - it just takes
longer.

2008-07-16 10:11:40

by Jeremy Fitzhardinge

[permalink] [raw]

Subject: Re: [PATCH] stopmachine: add stopmachine_timeout v2

Hidetoshi Seto wrote:
> +#ifdef CONFIG_STOP_MACHINE
> +extern unsigned long stopmachine_timeout;
> +#endif
>

No externs in C files. Put it in an appropriate header.

I'll do a proper review soon.

J

2008-07-17 03:41:03

[permalink] [raw]

Subject: Re: [PATCH] stopmachine: add stopmachine_timeout v2

Jeremy Fitzhardinge wrote:
> Hidetoshi Seto wrote:
>> +#ifdef CONFIG_STOP_MACHINE
>> +extern unsigned long stopmachine_timeout;
>> +#endif
>
> No externs in C files. Put it in an appropriate header.

sysctl.c already has many externs... but I can fix at least
the above.

> I'll do a proper review soon.

Is it better to postpone v4 until your comment comes?

Thanks,
H.Seto

2008-07-17 05:37:38

by Jeremy Fitzhardinge

[permalink] [raw]

Subject: Re: [PATCH] stopmachine: add stopmachine_timeout v2

Hidetoshi Seto wrote:
> sysctl.c already has many externs... but I can fix at least
> the above.
>

Yeah, but it's an ugly pattern we'd rather not encourage.

>> I'll do a proper review soon.
>>
>
> Is it better to postpone v4 until your comment comes?
>

No.

J

2008-07-17 06:15:06

[permalink] [raw]

Subject: [PATCH] stopmachine: add stopmachine_timeout v4

If stop_machine() invoked while one of onlined cpu is locked up
by some reason, stop_machine cannot finish its work because the
locked cpu cannot stop. This means all other healthy cpus
will be blocked infinitely by one dead cpu.

This patch allows stop_machine to return -EBUSY with some printk
messages if any of stop_machine's threads cannot start running on
its target cpu in time. You can enable this timeout via sysctl.

v4:
- move extern into linux/stop_machine.h and add include of the
header to kernel/sysctl.c. Now stopmachine_timeout is available
only on smp.

v3:
- set stopmachine_timeout default to 0 (= never timeout)

v2:
- remove fix for warning since it will be fixed upcoming typesafe
patches
- make stopmachine_timeout from secs to msecs
- allow disabling timeout by setting the stopmachine_timeout to 0

Signed-off-by: Hidetoshi Seto <[email protected]>
---
include/linux/stop_machine.h | 3 ++
kernel/stop_machine.c | 54 +++++++++++++++++++++++++++++++++++++++--
kernel/sysctl.c | 12 +++++++++
3 files changed, 66 insertions(+), 3 deletions(-)

diff --git a/include/linux/stop_machine.h b/include/linux/stop_machine.h
index 0a7815c..4c934f7 100644
--- a/include/linux/stop_machine.h
+++ b/include/linux/stop_machine.h
@@ -13,6 +13,9 @@
/* Deprecated, but useful for transition. */
#define ALL_CPUS CPU_MASK_ALL_PTR

+/* for sysctl entry */
+extern unsigned long stopmachine_timeout;
+
/**
* stop_machine_run: freeze the machine on all CPUs and run this function
* @fn: the function to run
diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index 5b72c2b..9059b9e 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -35,15 +35,18 @@ struct stop_machine_data {
};

/* Like num_online_cpus(), but hotplug cpu uses us, so we need this. */
-static unsigned int num_threads;
+static atomic_t num_threads;
static atomic_t thread_ack;
+static cpumask_t prepared_cpus;
static struct completion finished;
static DEFINE_MUTEX(lock);

+unsigned long stopmachine_timeout; /* msecs, default is 0 = "never timeout" */
+
static void set_state(enum stopmachine_state newstate)
{
/* Reset ack counter. */
- atomic_set(&thread_ack, num_threads);
+ atomic_set(&thread_ack, atomic_read(&num_threads));
smp_wmb();
state = newstate;
}
@@ -67,6 +70,8 @@ static int stop_cpu(struct stop_machine_data *smdata)
enum stopmachine_state curstate = STOPMACHINE_NONE;
int uninitialized_var(ret);

+ cpu_set(smp_processor_id(), prepared_cpus);
+
/* Simple state machine */
do {
/* Chill out and ensure we re-read stopmachine_state. */
@@ -90,6 +95,7 @@ static int stop_cpu(struct stop_machine_data *smdata)
}
} while (curstate != STOPMACHINE_EXIT);

+ atomic_dec(&num_threads);
local_irq_enable();
do_exit(0);
}
@@ -105,6 +111,15 @@ int __stop_machine_run(int (*fn)(void *), void *data, const cpumask_t *cpus)
int i, err;
struct stop_machine_data active, idle;
struct task_struct **threads;
+ unsigned long limit;
+
+ if (atomic_read(&num_threads)) {
+ /*
+ * previous stop_machine was timeout, and still there are some
+ * unfinished thread (dangling stucked CPU?).
+ */
+ return -EBUSY;
+ }

active.fn = fn;
active.data = data;
@@ -120,7 +135,7 @@ int __stop_machine_run(int (*fn)(void *), void *data, const cpumask_t *cpus)
/* Set up initial state. */
mutex_lock(&lock);
init_completion(&finished);
- num_threads = num_online_cpus();
+ atomic_set(&num_threads, num_online_cpus());
set_state(STOPMACHINE_PREPARE);

for_each_online_cpu(i) {
@@ -152,10 +167,21 @@ int __stop_machine_run(int (*fn)(void *), void *data, const cpumask_t *cpus)

/* We've created all the threads. Wake them all: hold this CPU so one
* doesn't hit this CPU until we're ready. */
+ cpus_clear(prepared_cpus);
get_cpu();
for_each_online_cpu(i)
wake_up_process(threads[i]);

+ /* Wait all others come to life */
+ if (stopmachine_timeout) {
+ limit = jiffies + msecs_to_jiffies(stopmachine_timeout);
+ while (cpus_weight(prepared_cpus) != num_online_cpus() - 1) {
+ if (time_is_before_jiffies(limit))
+ goto timeout;
+ cpu_relax();
+ }
+ }
+
/* This will release the thread on our CPU. */
put_cpu();
wait_for_completion(&finished);
@@ -169,10 +195,32 @@ kill_threads:
for_each_online_cpu(i)
if (threads[i])
kthread_stop(threads[i]);
+ atomic_set(&num_threads, 0);
mutex_unlock(&lock);

kfree(threads);
return err;
+
+timeout:
+ printk(KERN_CRIT "stopmachine: Failed to stop machine in time(%lds).\n",
+ stopmachine_timeout);
+ for_each_online_cpu(i) {
+ if (!cpu_isset(i, prepared_cpus) && i != smp_processor_id())
+ printk(KERN_CRIT "stopmachine: cpu#%d seems to be "
+ "stuck.\n", i);
+ /* Unbind threads */
+ set_cpus_allowed(threads[i], cpu_online_map);
+ }
+
+ /* Let threads go exit */
+ set_state(STOPMACHINE_EXIT);
+
+ put_cpu();
+ /* no wait for completion */
+ mutex_unlock(&lock);
+ kfree(threads);
+
+ return -EBUSY; /* canceled */
}

int stop_machine_run(int (*fn)(void *), void *data, const cpumask_t *cpus)
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 2911665..d9e9900 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -46,6 +46,7 @@
#include <linux/nfs_fs.h>
#include <linux/acpi.h>
#include <linux/reboot.h>
+#include <linux/stop_machine.h>

#include <asm/uaccess.h>
#include <asm/processor.h>
@@ -813,6 +814,17 @@ static struct ctl_table kern_table[] = {
.child = key_sysctls,
},
#endif
+#if defined(CONFIG_STOP_MACHINE) && defined(CONFIG_SMP)
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "stopmachine_timeout",
+ .data = &stopmachine_timeout,
+ .maxlen = sizeof(unsigned long),
+ .mode = 0644,
+ .proc_handler = &proc_doulongvec_minmax,
+ .strategy = &sysctl_intvec,
+ },
+#endif
/*
* NOTE: do not add new entries to this table unless you have read
* Documentation/sysctl/ctl_unnumbered.txt
--

2008-07-17 07:10:11

[permalink] [raw]

Subject: Re: [PATCH] stopmachine: add stopmachine_timeout v4

Hidetoshi Seto wrote:
> If stop_machine() invoked while one of onlined cpu is locked up
> by some reason, stop_machine cannot finish its work because the
> locked cpu cannot stop. This means all other healthy cpus
> will be blocked infinitely by one dead cpu.
>
> This patch allows stop_machine to return -EBUSY with some printk
> messages if any of stop_machine's threads cannot start running on
> its target cpu in time. You can enable this timeout via sysctl.
>
> v4:
> - move extern into linux/stop_machine.h and add include of the
> header to kernel/sysctl.c. Now stopmachine_timeout is available
> only on smp.
>
> v3:
> - set stopmachine_timeout default to 0 (= never timeout)
>
> v2:
> - remove fix for warning since it will be fixed upcoming typesafe
> patches
> - make stopmachine_timeout from secs to msecs
> - allow disabling timeout by setting the stopmachine_timeout to 0
>
> Signed-off-by: Hidetoshi Seto <[email protected]>
> ---
> include/linux/stop_machine.h | 3 ++
> kernel/stop_machine.c | 54 +++++++++++++++++++++++++++++++++++++++--
> kernel/sysctl.c | 12 +++++++++
> 3 files changed, 66 insertions(+), 3 deletions(-)

Looks good to me.
Ack.

Max

2008-07-18 04:18:20

[permalink] [raw]

Subject: Re: [PATCH] stopmachine: add stopmachine_timeout v2

On Thursday 17 July 2008 13:40:14 Hidetoshi Seto wrote:
> Jeremy Fitzhardinge wrote:
> > Hidetoshi Seto wrote:
> >> +#ifdef CONFIG_STOP_MACHINE
> >> +extern unsigned long stopmachine_timeout;
> >> +#endif
> >
> > No externs in C files. Put it in an appropriate header.
>
> sysctl.c already has many externs... but I can fix at least
> the above.

I already patched this; checkpatch.pl warned about it :)

> > I'll do a proper review soon.
>
> Is it better to postpone v4 until your comment comes?

I think the only other real change is to make the value in ms, not seconds,
and allow a 0 value to mean "don't check".

Cheers,
Rusty.

2008-07-20 09:46:31

[permalink] [raw]

Subject: Re: [PATCH] stopmachine: add stopmachine_timeout

On Wednesday 16 July 2008 14:05:31 Hidetoshi Seto wrote:
> Hi Rusty,
>
> Rusty Russell wrote:
> > On Tuesday 15 July 2008 11:11:34 Hidetoshi Seto wrote:
> >> However we need to be careful that the stuck CPU can restart
> >> unexpectedly.
> >
> > OK, if you are worried about that race, I think we can still fix it...
>
> After having a relaxing day, once I said:
> "I like your idea that if we did not want to do something on the stuck CPU
> then treat the CPU as stopped."
> but now I noticed that the stuck CPU can harm what we want to do if it is
> not real stuck... ex. busy loop in a subsystem, and we want to touch the
> core of the subsystem exclusively.

No. You aim for perfection, but there is no "right" answer other than "don't
get your system into this mess". Whatever we do is going to be an educated
guess. And guessing that there'll be no race is a very good guess indeed.

The scenario we are addressing is a stuck CPU and module load. If we fail
stop machine, module load fails.

That is why we should continue if we can. It is also why the default timeout
cannot be 0. You can't turn this on once you notice there's a problem: it's
too late.

If we don't want to handle this case, let's not apply any patch at all.
Rusty.

2008-07-22 03:29:41