2008-08-16 16:34:31

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: [PATCH RFC 1/3] Add a trigger API for efficient non-blocking waiting

There are various places in the kernel which wish to wait for a
condition to come true while in a non-blocking context. Existing
examples of this are stop_machine() and smp_call_function_mask().
(No doubt there are other instances of this pattern in the tree.)

Thus far, the only way to achieve this is by spinning with a
cpu_relax() loop. This is fine if the condition becomes true very
quickly, but it is not ideal:

- There's little opportunity to put the CPUs into a low-power state.
cpu_relax() may do this to some extent, but if the wait is
relatively long, then we can probably do better.

- In a virtual environment, spinning virtual CPUs just waste CPU
resources, and may steal CPU time from vCPUs which need it to make
progress. The trigger API allows the vCPUs to give up their CPU
entirely. The s390 people observed a problem with stop_machine
taking a very long time (seconds) when there are more vcpus than
available cpus.

The trigger API is simple:

To initialize a trigger, you can either do it statically with:

DEFINE_TRIGGER(trigger);

or dynamically with

trigger_init(&trigger);

Then to use it, the wait side does:

trigger_reset(&trigger);

while(!condition)
trigger_wait(&trigger);

trigger_finish(&trigger);

and when the condition is set true:

condition = true;
trigger_kick(&trigger);

Some points to note:

- the wait side of the trigger must have preemption disabled (but
interrupts may be enabled)

- the kick side may be any context

- the trigger is "sticky", so that if it's kicked before entering the
wait, the wait terminates immediately

- you must check the condition between the trigger_reset() and
trigger_wait(), and between calls to trigger_wait()

- trigger_wait() implicitly resets the trigger

- trigger_kick() is a write barrier

- trigger_wait() is a read barrier

- the implementation may disable interrupts between trigger_reset()
and trigger_wait(); if interrupts were enabled at that point, they
will be enabled during the wait, but disabled again by the time
trigger_wait() returns. trigger_finish() will restore the original
interrupt state.

The initial generic implementation is just a simple polling
cpu_relax() loop. Architectures may set CONFIG_ARCH_HAS_TRIGGER to
define more optimal architecture-specific implementations.

[ I haven't given much thought to how this might make use of lockdep yet. ]

Signed-off-by: Jeremy Fitzhardinge <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Jens Axboe <[email protected]>
---
include/linux/smp.h | 2
include/linux/trigger.h | 145 +++++++++++++++++++++++++++++++++++++++++++++++
kernel/smp.c | 18 ++---
kernel/stop_machine.c | 39 +++++++-----
4 files changed, 178 insertions(+), 26 deletions(-)

===================================================================
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -9,6 +9,7 @@
#include <linux/errno.h>
#include <linux/list.h>
#include <linux/cpumask.h>
+#include <linux/trigger.h>

extern void cpu_idle(void);

@@ -17,6 +18,7 @@
void (*func) (void *info);
void *info;
unsigned int flags;
+ trigger_t trigger;
};

#ifdef CONFIG_SMP
===================================================================
--- /dev/null
+++ b/include/linux/trigger.h
@@ -0,0 +1,145 @@
+#ifndef _LINUX_TRIGGER_H
+#define _LINUX_TRIGGER_H
+/*
+ * Triggers - a general purpose synchronization primitive
+ *
+ * A trigger is a primitive used for waiting for a condition to come
+ * true. Sample usage:
+ *
+ * struct trigger trig; // already initialized
+ *
+ * // wait for condition
+ * trigger_reset(&trig);
+ * while(!condition)
+ * trigger_wait(&trig);
+ * trigger_finish(&trig);
+ *
+ * // set condition
+ * make_condition_true();
+ * trigger_kick(&trig);
+ *
+ * This would be used when:
+ * - we can't block (otherwise there are other primitives we could use), and
+ * - we expect it will take a while for the condition to come true
+ *
+ * The wait-side functions must be called with preemption disabled.
+ *
+ * The simplest implementation would just be a no-op, with
+ * trigger_wait() simply being a compiler barrier. But a more
+ * sophisticated implementation could put the cpu into a low power
+ * state, or if virtualized, yield the cpu altogether.
+ *
+ * 2008 Jeremy Fitzhardinge <[email protected]>
+ */
+
+#ifdef CONFIG_ARCH_HAS_TRIGGER
+#include <asm/trigger.h>
+#else /* !CONFIG_ARCH_HAS_TRIGGER */
+#include <asm/processor.h>
+#include <asm/system.h>
+
+typedef struct {} trigger_t;
+
+#define DEFINE_TRIGGER(n) trigger_t n = {}
+
+static inline void __raw_trigger_init(trigger_t *t)
+{
+}
+
+static inline void __raw_trigger_reset(trigger_t *t)
+{
+}
+
+static inline void __raw_trigger_wait(trigger_t *t)
+{
+ /* cpu_relax() is a read barrier */
+ cpu_relax();
+}
+
+static inline void __raw_trigger_kick(trigger_t *t)
+{
+ /* make sure any memory writes a done before going on */
+ smp_wmb();
+}
+
+static inline void __raw_trigger_finish(trigger_t *t)
+{
+}
+#endif /* CONFIG_ARCH_HAS_TRIGGER */
+
+/**
+ * trigger_init - Initialize a trigger for use
+ * @t - trigger to be initialized
+ */
+static inline void trigger_init(trigger_t *t)
+{
+ __raw_trigger_init(t);
+}
+
+/**
+ * trigger_reset - reset a trigger
+ * @t - trigger to be reset
+ *
+ * This resets the trigger state, allowing a trigger_wait to block.
+ * Note that preemption must be disabled between trigger_reset() and
+ * trigger_finish().
+ *
+ * The use of these functions is:
+ * trigger_reset(&t);
+ * while(!condition)
+ * trigger_wait(&t);
+ * trigger_finish(&t);
+ *
+ * and where the condition is set:
+ * condition = true;
+ * trigger_kick(&t);
+ */
+static inline void trigger_reset(trigger_t *t)
+{
+ __raw_trigger_reset(t);
+}
+
+/**
+ * trigger_wait - wait for a trigger to be kicked
+ * @t - trigger to be waited on
+ *
+ * This blocks until the trigger has been kicked. If the trigger has
+ * already been kicked, it will return immediately. It may also
+ * return without the trigger having been kicked at all; the caller
+ * must test the condition before calling trigger_wait() again.
+ *
+ * trigger_wait() acts as a read barrier.
+ *
+ * On return, the trigger will have always be reset.
+ */
+static inline void trigger_wait(trigger_t *t)
+{
+ __raw_trigger_wait(t);
+}
+
+/**
+ * trigger_kick - kick a trigger
+ * @t - trigger to kick
+ *
+ * This causes anyone waiting in trigger_wait to continue. It may be
+ * called in any context.
+ *
+ * trigger_kick() acts as a write barrier.
+ */
+static inline void trigger_kick(trigger_t *t)
+{
+ __raw_trigger_kick(t);
+}
+
+/**
+ * trigger_finish - clean up trigger
+ * @t - trigger to be cleaned up
+ *
+ * This cleans up any implementation trigger state once we've finished
+ * with it.
+ */
+static inline void trigger_finish(trigger_t *t)
+{
+ __raw_trigger_finish(t);
+}
+#endif /* _LINUX_TRIGGER_H */
===================================================================
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -50,15 +50,10 @@
static void csd_flag_wait(struct call_single_data *data)
{
/* Wait for response */
- do {
- /*
- * We need to see the flags store in the IPI handler
- */
- smp_mb();
- if (!(data->flags & CSD_FLAG_WAIT))
- break;
- cpu_relax();
- } while (1);
+ trigger_reset(&data->trigger);
+ while(data->flags & CSD_FLAG_WAIT)
+ trigger_wait(&data->trigger);
+ trigger_finish(&data->trigger);
}

/*
@@ -70,6 +65,8 @@
struct call_single_queue *dst = &per_cpu(call_single_queue, cpu);
int wait = data->flags & CSD_FLAG_WAIT, ipi;
unsigned long flags;
+
+ trigger_init(&data->trigger);

spin_lock_irqsave(&dst->lock, flags);
ipi = list_empty(&dst->list);
@@ -135,6 +132,7 @@
*/
smp_wmb();
data->csd.flags &= ~CSD_FLAG_WAIT;
+ trigger_kick(&data->csd.trigger);
}
if (data->csd.flags & CSD_FLAG_ALLOC)
call_rcu(&data->rcu_head, rcu_free_call_data);
@@ -185,6 +183,7 @@
if (data_flags & CSD_FLAG_WAIT) {
smp_wmb();
data->flags &= ~CSD_FLAG_WAIT;
+ trigger_kick(&data->trigger);
} else if (data_flags & CSD_FLAG_ALLOC)
kfree(data);
}
@@ -357,6 +356,7 @@
}

spin_lock_init(&data->lock);
+ trigger_init(&data->csd.trigger);
data->csd.func = func;
data->csd.info = info;
data->refs = num_cpus;
===================================================================
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -9,6 +9,7 @@
#include <linux/stop_machine.h>
#include <linux/syscalls.h>
#include <linux/interrupt.h>
+#include <linux/trigger.h>

#include <asm/atomic.h>
#include <asm/uaccess.h>
@@ -35,6 +36,7 @@
};

/* Like num_online_cpus(), but hotplug cpu uses us, so we need this. */
+static DEFINE_TRIGGER(trigger);
static unsigned int num_threads;
static atomic_t thread_ack;
static struct completion finished;
@@ -46,6 +48,7 @@
atomic_set(&thread_ack, num_threads);
smp_wmb();
state = newstate;
+ trigger_kick(&trigger);
}

/* Last one to ack a state moves to the next state. */
@@ -70,24 +73,26 @@
/* Simple state machine */
do {
/* Chill out and ensure we re-read stopmachine_state. */
- cpu_relax();
- if (state != curstate) {
- curstate = state;
- switch (curstate) {
- case STOPMACHINE_DISABLE_IRQ:
- local_irq_disable();
- hard_irq_disable();
- break;
- case STOPMACHINE_RUN:
- /* |= allows error detection if functions on
- * multiple CPUs. */
- smdata->fnret |= smdata->fn(smdata->data);
- break;
- default:
- break;
- }
- ack_state();
+ trigger_reset(&trigger);
+ while (state == curstate)
+ trigger_wait(&trigger);
+ trigger_finish(&trigger);
+
+ curstate = state;
+ switch (curstate) {
+ case STOPMACHINE_DISABLE_IRQ:
+ local_irq_disable();
+ hard_irq_disable();
+ break;
+ case STOPMACHINE_RUN:
+ /* |= allows error detection if functions on
+ * multiple CPUs. */
+ smdata->fnret |= smdata->fn(smdata->data);
+ break;
+ default:
+ break;
}
+ ack_state();
} while (curstate != STOPMACHINE_EXIT);

local_irq_enable();


2008-08-16 17:47:53

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH RFC 1/3] Add a trigger API for efficient non-blocking waiting

On Sat, 16 Aug 2008 09:34:13 -0700
Jeremy Fitzhardinge <[email protected]> wrote:

>
> The trigger API is simple:
>
> To initialize a trigger, you can either do it statically with:

it looks a lot like a spinning completion ;-)

can we make it more simple by just having spinning versions of the wait
for completion functions?

--
If you want to reach me at my work email, use [email protected]
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2008-08-17 23:02:09

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH RFC 1/3] Add a trigger API for efficient non-blocking waiting

On Sat, 16 Aug 2008 09:34:13 -0700 Jeremy Fitzhardinge wrote:

> --- /dev/null
> +++ b/include/linux/trigger.h
> @@ -0,0 +1,145 @@
> +#ifndef _LINUX_TRIGGER_H
> +#define _LINUX_TRIGGER_H
> +
> +
> +
> +/**
> + * trigger_init - Initialize a trigger for use
> + * @t - trigger to be initialized

kernel-doc format for function parameters is like:

* @t: trigger to be initialized

so please change all occurrences to be like that. Thanks.


> + */
> +static inline void trigger_init(trigger_t *t)
> +{
> + __raw_trigger_init(t);
> +}
> +
> +/**
> + * trigger_reset - reset a trigger
> + * @t - trigger to be reset
> + *
> + * This resets the trigger state, allowing a trigger_wait to block.
> + * Note that preemption must be disabled between trigger_reset() and
> + * trigger_finish().
> + *
> + * The use of these functions is:
> + * trigger_reset(&t);
> + * while(!condition)
> + * trigger_wait(&t);
> + * trigger_finish(&t);
> + *
> + * and where the condition is set:
> + * condition = true;
> + * trigger_kick(&t);
> + */
> +static inline void trigger_reset(trigger_t *t)
> +{
> + __raw_trigger_reset(t);
> +}
> +
> +/**
> + * trigger_wait - wait for a trigger to be kicked
> + * @t - trigger to be waited on
> + *
> + * This blocks until the trigger has been kicked. If the trigger has
> + * already been kicked, it will return immediately. It may also
> + * return without the trigger having been kicked at all; the caller
> + * must test the condition before calling trigger_wait() again.
> + *
> + * trigger_wait() acts as a read barrier.
> + *
> + * On return, the trigger will have always be reset.
> + */
> +static inline void trigger_wait(trigger_t *t)
> +{
> + __raw_trigger_wait(t);
> +}
> +
> +/**
> + * trigger_kick - kick a trigger
> + * @t - trigger to kick
> + *
> + * This causes anyone waiting in trigger_wait to continue. It may be
> + * called in any context.
> + *
> + * trigger_kick() acts as a write barrier.
> + */
> +static inline void trigger_kick(trigger_t *t)
> +{
> + __raw_trigger_kick(t);
> +}
> +
> +/**
> + * trigger_finish - clean up trigger
> + * @t - trigger to be cleaned up
> + *
> + * This cleans up any implementation trigger state once we've finished
> + * with it.
> + */
> +static inline void trigger_finish(trigger_t *t)
> +{
> + __raw_trigger_finish(t);
> +}
> +#endif /* _LINUX_TRIGGER_H */


---
~Randy
Linux Plumbers Conference, 17-19 September 2008, Portland, Oregon USA
http://linuxplumbersconf.org/

2008-08-20 06:23:24

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH RFC 1/3] Add a trigger API for efficient non-blocking waiting

On Sat, 16 Aug 2008 09:34:13 -0700 Jeremy Fitzhardinge <[email protected]> wrote:

> There are various places in the kernel which wish to wait for a
> condition to come true while in a non-blocking context. Existing
> examples of this are stop_machine() and smp_call_function_mask().
> (No doubt there are other instances of this pattern in the tree.)
>
> Thus far, the only way to achieve this is by spinning with a
> cpu_relax() loop. This is fine if the condition becomes true very
> quickly, but it is not ideal:
>
> - There's little opportunity to put the CPUs into a low-power state.
> cpu_relax() may do this to some extent, but if the wait is
> relatively long, then we can probably do better.

If this change saves a significant amount of power then we should fix
the offending callsites.

> - In a virtual environment, spinning virtual CPUs just waste CPU
> resources, and may steal CPU time from vCPUs which need it to make
> progress. The trigger API allows the vCPUs to give up their CPU
> entirely. The s390 people observed a problem with stop_machine
> taking a very long time (seconds) when there are more vcpus than
> available cpus.

If this change saves a significant amount of virtual-cpu-time then we
should fix the offending callsites.


Tell me I'm wrong...

2008-08-20 18:42:39

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH RFC 1/3] Add a trigger API for efficient non-blocking waiting

Andrew Morton wrote:
> On Sat, 16 Aug 2008 09:34:13 -0700 Jeremy Fitzhardinge <[email protected]> wrote:
>
>
>> There are various places in the kernel which wish to wait for a
>> condition to come true while in a non-blocking context. Existing
>> examples of this are stop_machine() and smp_call_function_mask().
>> (No doubt there are other instances of this pattern in the tree.)
>>
>> Thus far, the only way to achieve this is by spinning with a
>> cpu_relax() loop. This is fine if the condition becomes true very
>> quickly, but it is not ideal:
>>
>> - There's little opportunity to put the CPUs into a low-power state.
>> cpu_relax() may do this to some extent, but if the wait is
>> relatively long, then we can probably do better.
>>
>
> If this change saves a significant amount of power then we should fix
> the offending callsites.
>

Fix them how? In general we're talking about contexts where we can't
block, and where the wait time is limited by some property of the
platform, such as IPI time or interrupt latency (though doing a
cross-cpu call of a long-running function would be something we could fix).

>> - In a virtual environment, spinning virtual CPUs just waste CPU
>> resources, and may steal CPU time from vCPUs which need it to make
>> progress. The trigger API allows the vCPUs to give up their CPU
>> entirely. The s390 people observed a problem with stop_machine
>> taking a very long time (seconds) when there are more vcpus than
>> available cpus.
>>
>
> If this change saves a significant amount of virtual-cpu-time then we
> should fix the offending callsites.
>

This case isn't particularly about saving vcpu time, but making timely
progress. stop_machine() gets all the cpus into a spinloop, where they
spin waiting for an event to tell them to go to their next state-machine
state. By definition this can't be a blocking operation (since the
whole point is that they're high priority threads that prevent anything
else from running). But in the virtual case, the fact that they're all
spinning means that the underlying hypervisor has no idea who's just
spinning, and who's trying to do some work needed to make overall
progress, so the whole thing gets bogged down.

Now perhaps we could solve stop_machine by modifying the scheduler in
some way, where you can block the run queue so that you sit in the idle
loop even though there's runnable processes waiting. But even then,
stop_machine requires that interrupts be disabled, which means the we're
pretty much limited to spinning.

So my proposal is to add a non-scheduler-blocking operation which is
semantically equivalent to spinning, but gives the underlying platform
more information about what's going on.

Arjan suggested that since this is more or less equivalent to a
completion, we should just implement "spinpletions" - a spinning
completion. This should be more familiar to kernel programmers, and
should be just as useful as triggers.

I've run out of time to work on this now, but Rusty has hinted he'll
pick up the baton...

(I'd also like to hear from other architecture folks, particularly s390,
to make sure this is going to be useful to them too.)

J

2008-08-20 19:29:57

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH RFC 1/3] Add a trigger API for efficient non-blocking waiting

On Wed, 20 Aug 2008 11:42:27 -0700
Jeremy Fitzhardinge <[email protected]> wrote:

> Andrew Morton wrote:
> > On Sat, 16 Aug 2008 09:34:13 -0700 Jeremy Fitzhardinge <[email protected]> wrote:
> >
> >
> >> There are various places in the kernel which wish to wait for a
> >> condition to come true while in a non-blocking context. Existing
> >> examples of this are stop_machine() and smp_call_function_mask().
> >> (No doubt there are other instances of this pattern in the tree.)
> >>
> >> Thus far, the only way to achieve this is by spinning with a
> >> cpu_relax() loop. This is fine if the condition becomes true very
> >> quickly, but it is not ideal:
> >>
> >> - There's little opportunity to put the CPUs into a low-power state.
> >> cpu_relax() may do this to some extent, but if the wait is
> >> relatively long, then we can probably do better.
> >>
> >
> > If this change saves a significant amount of power then we should fix
> > the offending callsites.
> >
>
> Fix them how? In general we're talking about contexts where we can't
> block, and where the wait time is limited by some property of the
> platform, such as IPI time or interrupt latency (though doing a
> cross-cpu call of a long-running function would be something we could fix).

ah, OK, I'd failed to note that you had identified two specific culprits.

Are either of these operations executed frequently enough for there to
be significant energy savings here?

> >> - In a virtual environment, spinning virtual CPUs just waste CPU
> >> resources, and may steal CPU time from vCPUs which need it to make
> >> progress. The trigger API allows the vCPUs to give up their CPU
> >> entirely. The s390 people observed a problem with stop_machine
> >> taking a very long time (seconds) when there are more vcpus than
> >> available cpus.
> >>
> >
> > If this change saves a significant amount of virtual-cpu-time then we
> > should fix the offending callsites.
> >
>
> This case isn't particularly about saving vcpu time, but making timely
> progress. stop_machine() gets all the cpus into a spinloop, where they
> spin waiting for an event to tell them to go to their next state-machine
> state. By definition this can't be a blocking operation (since the
> whole point is that they're high priority threads that prevent anything
> else from running). But in the virtual case, the fact that they're all
> spinning means that the underlying hypervisor has no idea who's just
> spinning, and who's trying to do some work needed to make overall
> progress, so the whole thing gets bogged down.

hm. I'm surprised that stop_machine() is executed frequently enough
for you to care. What's causing it?

> Now perhaps we could solve stop_machine by modifying the scheduler in
> some way, where you can block the run queue so that you sit in the idle
> loop even though there's runnable processes waiting. But even then,
> stop_machine requires that interrupts be disabled, which means the we're
> pretty much limited to spinning.

If stop_machine() is the _only_ problematic callsite and we reasonably
expect that no new ones will pop up then sure, a
stop_machine()-specific fix might be appropriate.

Otherwise, sure, we'd need to loko at something more general.

> So my proposal is to add a non-scheduler-blocking operation which is
> semantically equivalent to spinning, but gives the underlying platform
> more information about what's going on.
>
> Arjan suggested that since this is more or less equivalent to a
> completion, we should just implement "spinpletions" - a spinning
> completion. This should be more familiar to kernel programmers, and
> should be just as useful as triggers.
>
> I've run out of time to work on this now, but Rusty has hinted he'll
> pick up the baton...
>
> (I'd also like to hear from other architecture folks, particularly s390,
> to make sure this is going to be useful to them too.)
>
> J

2008-08-20 20:14:27

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH RFC 1/3] Add a trigger API for efficient non-blocking waiting

Andrew Morton wrote:
> On Wed, 20 Aug 2008 11:42:27 -0700
> Jeremy Fitzhardinge <[email protected]> wrote:
>
>
>> Andrew Morton wrote:
>>
>>> On Sat, 16 Aug 2008 09:34:13 -0700 Jeremy Fitzhardinge <[email protected]> wrote:
>>>
>>>
>>>
>>>> There are various places in the kernel which wish to wait for a
>>>> condition to come true while in a non-blocking context. Existing
>>>> examples of this are stop_machine() and smp_call_function_mask().
>>>> (No doubt there are other instances of this pattern in the tree.)
>>>>
>>>> Thus far, the only way to achieve this is by spinning with a
>>>> cpu_relax() loop. This is fine if the condition becomes true very
>>>> quickly, but it is not ideal:
>>>>
>>>> - There's little opportunity to put the CPUs into a low-power state.
>>>> cpu_relax() may do this to some extent, but if the wait is
>>>> relatively long, then we can probably do better.
>>>>
>>>>
>>> If this change saves a significant amount of power then we should fix
>>> the offending callsites.
>>>
>>>
>> Fix them how? In general we're talking about contexts where we can't
>> block, and where the wait time is limited by some property of the
>> platform, such as IPI time or interrupt latency (though doing a
>> cross-cpu call of a long-running function would be something we could fix).
>>
>
> ah, OK, I'd failed to note that you had identified two specific culprits.
>
> Are either of these operations executed frequently enough for there to
> be significant energy savings here?
>

The energy savings are more gravy, and not really my focus. Arjan tells
me that monitor/mwait are unusably slow in current implementations
anyway. My interest is in the virtual machine case, where bad
interactions with the vcpu scheduler can cause things to spin for 30
milliseconds or more (sometimes much more) in causes that would only be
microseconds running native.

The s390 people have reported similar things, so this is definitely not
Xen or x86 specific.

>>>> - In a virtual environment, spinning virtual CPUs just waste CPU
>>>> resources, and may steal CPU time from vCPUs which need it to make
>>>> progress. The trigger API allows the vCPUs to give up their CPU
>>>> entirely. The s390 people observed a problem with stop_machine
>>>> taking a very long time (seconds) when there are more vcpus than
>>>> available cpus.
>>>>
>>>>
>>> If this change saves a significant amount of virtual-cpu-time then we
>>> should fix the offending callsites.
>>>
>>>
>> This case isn't particularly about saving vcpu time, but making timely
>> progress. stop_machine() gets all the cpus into a spinloop, where they
>> spin waiting for an event to tell them to go to their next state-machine
>> state. By definition this can't be a blocking operation (since the
>> whole point is that they're high priority threads that prevent anything
>> else from running). But in the virtual case, the fact that they're all
>> spinning means that the underlying hypervisor has no idea who's just
>> spinning, and who's trying to do some work needed to make overall
>> progress, so the whole thing gets bogged down.
>>
>
> hm. I'm surprised that stop_machine() is executed frequently enough
> for you to care. What's causing it?
>

The big user is module load/unload, which have been observed to take
multiple seconds in stop_machine with some pathological overload
conditions. It's a pretty major hiccup if you hit it. (It's not
something that you'd deliberate set up except for testing, but it means
that something which might otherwise be a brief transient overload could
turn into a very brittle state with wildly varying performance
characteristics.)

Also Xen suspend/migrate uses stop_machine, and that's actually fairly
latency-sensitive. A live migrate can only have a few 10s ms of
downtime for the virtual machine, so having stop_machine() with
latencies of a similar or longer scale is noticeable.

>> Now perhaps we could solve stop_machine by modifying the scheduler in
>> some way, where you can block the run queue so that you sit in the idle
>> loop even though there's runnable processes waiting. But even then,
>> stop_machine requires that interrupts be disabled, which means the we're
>> pretty much limited to spinning.
>>
>
> If stop_machine() is the _only_ problematic callsite and we reasonably
> expect that no new ones will pop up then sure, a
> stop_machine()-specific fix might be appropriate.
>
> Otherwise, sure, we'd need to loko at something more general.
>

Well smp_call_function() does a spin wait, waiting for the other cpu(s)
to finish running the function. If it's a long-running function, then
that spinning could be arbitrarily long - not that it's a good idea to
call something long-running in interrupt context like that, but you
could see it as a quality of implementation issue.

And again, in a virtual environment, all that spinning competes with
cpus trying to do real work, so even a "short" spin could be arbitrarily
long if it's preventing the event it is waiting for from occurring.

I'm pretty sure there are other places in the kernel which can make use
of a more general facility. There are ~300 non-arch uses of cpu_relax()
in ~100 files, which are all (roughly) waiting for something to become
true. Some are polling on hardware state, and some are waiting for
states set by uncooperative subsystems, but I'd be surprised if a
significant number couldn't be converted to use a higher-level
trigger/spinpletion mechanism.

And the fact that there are so many existing instances in the kernel
suggests that new ones will appear, and they could be encouraged to use
a high-level mechanism from the outset.

J

2008-08-25 04:00:32

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH RFC 1/3] Add a trigger API for efficient non-blocking waiting

On Thursday 21 August 2008 05:25:46 Andrew Morton wrote:
> If stop_machine() is the _only_ problematic callsite and we reasonably
> expect that no new ones will pop up then sure, a
> stop_machine()-specific fix might be appropriate.
>
> Otherwise, sure, we'd need to loko at something more general.

The reason I'm interested is that we have 331 cpu_relax() calls outside the
arch trees. That seems like a lot, and I wonder if what they really want is
a spinpletion.

Cheers,
Rusty.

2008-08-28 12:27:42

by Christian Borntraeger

[permalink] [raw]
Subject: Re: [PATCH RFC 1/3] Add a trigger API for efficient non-blocking waiting

Sorry for answering late, but here are some comments.

Am Samstag, 16. August 2008 schrieb Jeremy Fitzhardinge:
[...]
>
> - In a virtual environment, spinning virtual CPUs just waste CPU
> resources, and may steal CPU time from vCPUs which need it to make
> progress. The trigger API allows the vCPUs to give up their CPU
> entirely. The s390 people observed a problem with stop_machine
> taking a very long time (seconds) when there are more vcpus than
> available cpus.

Yes, we have seen some real contention if the number of vcpus is much higher
than the number of real cpus.

>
> The trigger API is simple:
>
> To initialize a trigger, you can either do it statically with:
>
> DEFINE_TRIGGER(trigger);
>
> or dynamically with
>
> trigger_init(&trigger);
>
> Then to use it, the wait side does:
>
> trigger_reset(&trigger);
>
> while(!condition)
> trigger_wait(&trigger);
>
> trigger_finish(&trigger);
>
> and when the condition is set true:
>
> condition = true;
> trigger_kick(&trigger);

Hmm, I currently try to build something with sigp_stop and sigp_start on
s390, not sure yet if this interface allows me to do that. At the moment
something is wrong in my prototype.