I'm seeing a workqueue-related deadlock. This is on an ia64
box running SLES10, but it looks like the same problem should
be possible in current upstream on any architecture.
Here are the two tasks involved:
events/4:
schedule
__down
__lock_cpu_hotplug
lock_cpu_hotplug
flush_workqueue
kblockd_flush
blk_sync_queue
cfq_shutdown_timer_wq
cfq_exit_queue
elevator_exit
blk_cleanup_queue
scsi_free_queue
scsi_device_dev_release_usercontext
run_workqueue
loadkeys:
schedule
flush_cpu_workqueue
flush_workqueue
flush_scheduled_work
release_dev
tty_release
loadkeys is holding the cpu_hotplug lock (acquired in flush_workqueue())
and waiting in flush_cpu_workqueue() until the cpu_workqueue drains.
But events/4 is responsible for draining it, and it is blocked waiting
to acquire the cpu_hotplug lock.
In current upstream, the cpu_hotplug lock has been replaced with
workqueue_mutex, but it looks to me like the same deadlock is still
possible.
Is there some rule that workqueue functions shouldn't try to
flush a workqueue? Or should flush_workqueue() be smarter by
releasing the workqueue_mutex once in a while?
On Wed, Dec 06, 2006 at 05:26:14PM -0700, Bjorn Helgaas wrote:
> loadkeys is holding the cpu_hotplug lock (acquired in flush_workqueue())
> and waiting in flush_cpu_workqueue() until the cpu_workqueue drains.
>
> But events/4 is responsible for draining it, and it is blocked waiting
> to acquire the cpu_hotplug lock.
>
> In current upstream, the cpu_hotplug lock has been replaced with
> workqueue_mutex, but it looks to me like the same deadlock is still
> possible.
Yes I think so too.
> Is there some rule that workqueue functions shouldn't try to
> flush a workqueue?
In general, workqueue functions wanting to flush workqueue seems wierd
to me. But in this case, I think the requirement is to block until all
queued work is complete, which is what flush_workqueue is supposed to
do. Hence I dont see any way to avoid it ..
> Or should flush_workqueue() be smarter by
> releasing the workqueue_mutex once in a while?
IMHO, rehauling lock_cpu_hotplug() to support scenarios like this is a
better approach.
- Make it rw-sem
- Make it per-cpu mutex, which could be either:
http://lkml.org/lkml/2006/11/30/110 - Ingo's suggestion
http://lkml.org/lkml/2006/10/26/65 - Gautham's work based on RCU
In Ingo's suggestion, I really dont know if the task_struct
modifications is a good thing (to support recursive requirements).
Gautham's patches avoid modifications to task_struct, is fast but can
starve writers (who want to bring down/up a CPU).
--
Regards,
vatsa
On Thu, Dec 07, 2006 at 11:47:01AM +0530, Srivatsa Vaddagiri wrote:
> - Make it rw-sem
I think rw-sems also were shown to hit deadlocks (recursive read-lock
attempt deadlocks when a writer comes between the two read attempts by the same
thread). So below suggestion only seems to makes sense ..
> - Make it per-cpu mutex, which could be either:
>
> http://lkml.org/lkml/2006/11/30/110 - Ingo's suggestion
> http://lkml.org/lkml/2006/10/26/65 - Gautham's work based on RCU
>
> In Ingo's suggestion, I really dont know if the task_struct
> modifications is a good thing (to support recursive requirements).
> Gautham's patches avoid modifications to task_struct, is fast but can
> starve writers (who want to bring down/up a CPU).
--
Regards,
vatsa
On Fri, 8 Dec 2006 08:23:01 +0530
Srivatsa Vaddagiri <[email protected]> wrote:
> On Thu, Dec 07, 2006 at 11:37:00AM -0800, Andrew Morton wrote:
> > -static void flush_cpu_workqueue(struct cpu_workqueue_struct *cwq)
> > +/*
> > + * If cpu == -1 it's a single-threaded workqueue and the caller does not hold
> > + * workqueue_mutex
> > + */
> > +static void flush_cpu_workqueue(struct cpu_workqueue_struct *cwq, int cpu)
>
> Lets say @cpu = 4
>
> > {
> > if (cwq->thread == current) {
> > /*
> > * Probably keventd trying to flush its own queue. So simply run
> > * it by hand rather than deadlocking.
> > */
> > + if (cpu != -1)
> > + mutex_unlock(&workqueue_mutex);
>
> Lets say we release the workqueue mutex here (events/4 is trying to
> flush its own workqueue). Immediately another CPU takes this mutex
> (in CPU_DOWN_PREPARE) and brings down CPU4. In CPU_DEAD handling we now wait
> on events/4 thread to exit (cleanup_workqueue_thread).
>
> Couldnt this wait deadlock on :
>
> > run_workqueue(cwq);
>
> > + if (cpu != -1)
> > + mutex_lock(&workqueue_mutex);
>
> events/4 thread itself wanting the same mutex above?
>
Could do, not sure. I'm planning on converting all the locking around here
to preempt_disable() though.
On Wed, 6 Dec 2006 17:26:14 -0700
Bjorn Helgaas <[email protected]> wrote:
> I'm seeing a workqueue-related deadlock. This is on an ia64
> box running SLES10, but it looks like the same problem should
> be possible in current upstream on any architecture.
>
> Here are the two tasks involved:
>
> events/4:
> schedule
> __down
> __lock_cpu_hotplug
> lock_cpu_hotplug
> flush_workqueue
> kblockd_flush
> blk_sync_queue
> cfq_shutdown_timer_wq
> cfq_exit_queue
> elevator_exit
> blk_cleanup_queue
> scsi_free_queue
> scsi_device_dev_release_usercontext
> run_workqueue
>
> loadkeys:
> schedule
> flush_cpu_workqueue
> flush_workqueue
> flush_scheduled_work
> release_dev
> tty_release
This will go away if/when I get the proposed new flush_work(struct
work_struct *) implemented. We can then convert blk_sync_queue() to do
flush_work(&q->unplug_work);
which will only block if blk_unplug_work() is actually executing on this
queue, and which will return as soon as blk_unplug_work() has finished.
(And a similar change in release_dev()).
It doesn't solve the fundamental problem though. But I'm not sure what
that is. If it is "flush_scheduled_work() waits on things which the caller
isn't interested in" then it will fix the fundamental problem.
Needs more work:
diff -puN kernel/workqueue.c~implement-flush_work kernel/workqueue.c
--- a/kernel/workqueue.c~implement-flush_work
+++ a/kernel/workqueue.c
@@ -53,6 +53,7 @@ struct cpu_workqueue_struct {
struct workqueue_struct *wq;
struct task_struct *thread;
+ struct work_struct *current_work;
int run_depth; /* Detect run_workqueue() recursion depth */
} ____cacheline_aligned;
@@ -243,6 +244,7 @@ static void run_workqueue(struct cpu_wor
work_func_t f = work->func;
list_del_init(cwq->worklist.next);
+ cwq->current_work = work;
spin_unlock_irqrestore(&cwq->lock, flags);
BUG_ON(get_wq_data(work) != cwq);
@@ -251,6 +253,7 @@ static void run_workqueue(struct cpu_wor
f(work);
spin_lock_irqsave(&cwq->lock, flags);
+ cwq->current_work = NULL;
cwq->remove_sequence++;
wake_up(&cwq->work_done);
}
@@ -330,6 +333,70 @@ static void flush_cpu_workqueue(struct c
}
}
+static void wait_on_work(struct cpu_workqueue_struct *cwq,
+ struct work_struct *work, int cpu)
+{
+ DEFINE_WAIT(wait);
+
+ spin_lock_irq(&cwq->lock);
+ while (cwq->current_work == work) {
+ prepare_to_wait(&cwq->work_done, &wait, TASK_UNINTERRUPTIBLE);
+ spin_unlock_irq(&cwq->lock);
+ mutex_unlock(&workqueue_mutex);
+ schedule();
+ mutex_lock(&workqueue_mutex);
+ if (!cpu_online(cpu)) /* oops, CPU got unplugged */
+ goto bail;
+ spin_lock_irq(&cwq->lock);
+ }
+ spin_unlock_irq(&cwq->lock);
+bail:
+ finish_wait(&cwq->work_done, &wait);
+}
+
+static void flush_one_work(struct cpu_workqueue_struct *cwq,
+ struct work_struct *work, int cpu)
+{
+ spin_lock_irq(&cwq->lock);
+ if (test_and_clear_bit(WORK_STRUCT_PENDING, &work->management)) {
+ list_del_init(&work->entry);
+ spin_unlock_irq(&cwq->lock);
+ return;
+ }
+ spin_unlock_irq(&cwq->lock);
+
+ /* It's running, or it has completed */
+
+ if (cwq->thread == current) {
+ /* This stinks */
+ /*
+ * Probably keventd trying to flush its own queue. So simply run
+ * it by hand rather than deadlocking.
+ */
+ run_workqueue(cwq);
+ } else {
+ wait_on_work(cwq, work, cpu);
+ }
+}
+
+void flush_work(struct workqueue_struct *wq, struct work_struct *work)
+{
+ might_sleep();
+
+ mutex_lock(&workqueue_mutex);
+ if (is_single_threaded(wq)) {
+ /* Always use first cpu's area. */
+ flush_one_work(per_cpu_ptr(wq->cpu_wq, singlethread_cpu), work,
+ singlethread_cpu);
+ } else {
+ int cpu;
+
+ for_each_online_cpu(cpu)
+ flush_one_work(per_cpu_ptr(wq->cpu_wq, cpu), work, cpu);
+ }
+ mutex_unlock(&workqueue_mutex);
+}
+
/**
* flush_workqueue - ensure that any scheduled work has run to completion.
* @wq: workqueue to flush
_
On Thu, Dec 07, 2006 at 08:54:07PM -0800, Andrew Morton wrote:
> Could do, not sure.
AFAICS it will deadlock for sure.
> I'm planning on converting all the locking around here
> to preempt_disable() though.
Will look forward to that patch. Its hard to dance around w/o a
lock_cpu_hotplug() ..:)
--
Regards,
vatsa
On Thu, Dec 07, 2006 at 11:37:00AM -0800, Andrew Morton wrote:
> -static void flush_cpu_workqueue(struct cpu_workqueue_struct *cwq)
> +/*
> + * If cpu == -1 it's a single-threaded workqueue and the caller does not hold
> + * workqueue_mutex
> + */
> +static void flush_cpu_workqueue(struct cpu_workqueue_struct *cwq, int cpu)
Lets say @cpu = 4
> {
> if (cwq->thread == current) {
> /*
> * Probably keventd trying to flush its own queue. So simply run
> * it by hand rather than deadlocking.
> */
> + if (cpu != -1)
> + mutex_unlock(&workqueue_mutex);
Lets say we release the workqueue mutex here (events/4 is trying to
flush its own workqueue). Immediately another CPU takes this mutex
(in CPU_DOWN_PREPARE) and brings down CPU4. In CPU_DEAD handling we now wait
on events/4 thread to exit (cleanup_workqueue_thread).
Couldnt this wait deadlock on :
> run_workqueue(cwq);
> + if (cpu != -1)
> + mutex_lock(&workqueue_mutex);
events/4 thread itself wanting the same mutex above?
What am I missing?
--
Regards,
vatsa
On Thu, 7 Dec 2006 10:51:48 -0800
Andrew Morton <[email protected]> wrote:
> + if (!cpu_online(cpu)) /* oops, CPU got unplugged */
> + goto bail;
hm, actually we can pull the same trick with flush_scheduled_work().
Should fix quite a few things...
From: Andrew Morton <[email protected]>
We have a class of deadlocks where the flush_scheduled_work() caller can get
stuck waiting for a work to complete, where that work wants to take
workqueue_mutex for some reason.
Fix this by not holding workqueue_mutex when waiting for a workqueue to flush.
The patch assumes that the per-cpu workqueue won't get freed up while there's
a task waiting on cpu_workqueue_struct.work_done. If that can happen,
run_workqueue() would crash anyway.
Cc: Bjorn Helgaas <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Srivatsa Vaddagiri <[email protected]>
Cc: Gautham shenoy <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
kernel/workqueue.c | 22 +++++++++++++++++++---
1 files changed, 19 insertions(+), 3 deletions(-)
diff -puN kernel/workqueue.c~workqueue-dont-hold-workqueue_mutex-in-flush_scheduled_work kernel/workqueue.c
--- a/kernel/workqueue.c~workqueue-dont-hold-workqueue_mutex-in-flush_scheduled_work
+++ a/kernel/workqueue.c
@@ -325,14 +325,22 @@ static int worker_thread(void *__cwq)
return 0;
}
-static void flush_cpu_workqueue(struct cpu_workqueue_struct *cwq)
+/*
+ * If cpu == -1 it's a single-threaded workqueue and the caller does not hold
+ * workqueue_mutex
+ */
+static void flush_cpu_workqueue(struct cpu_workqueue_struct *cwq, int cpu)
{
if (cwq->thread == current) {
/*
* Probably keventd trying to flush its own queue. So simply run
* it by hand rather than deadlocking.
*/
+ if (cpu != -1)
+ mutex_unlock(&workqueue_mutex);
run_workqueue(cwq);
+ if (cpu != -1)
+ mutex_lock(&workqueue_mutex);
} else {
DEFINE_WAIT(wait);
long sequence_needed;
@@ -344,7 +352,14 @@ static void flush_cpu_workqueue(struct c
prepare_to_wait(&cwq->work_done, &wait,
TASK_UNINTERRUPTIBLE);
spin_unlock_irq(&cwq->lock);
+ if (cpu != -1)
+ mutex_unlock(&workqueue_mutex);
schedule();
+ if (cpu != -1) {
+ mutex_lock(&workqueue_mutex);
+ if (!cpu_online(cpu))
+ return; /* oops, CPU unplugged */
+ }
spin_lock_irq(&cwq->lock);
}
finish_wait(&cwq->work_done, &wait);
@@ -373,13 +388,14 @@ void fastcall flush_workqueue(struct wor
if (is_single_threaded(wq)) {
/* Always use first cpu's area. */
- flush_cpu_workqueue(per_cpu_ptr(wq->cpu_wq, singlethread_cpu));
+ flush_cpu_workqueue(per_cpu_ptr(wq->cpu_wq, singlethread_cpu),
+ -1);
} else {
int cpu;
mutex_lock(&workqueue_mutex);
for_each_online_cpu(cpu)
- flush_cpu_workqueue(per_cpu_ptr(wq->cpu_wq, cpu));
+ flush_cpu_workqueue(per_cpu_ptr(wq->cpu_wq, cpu), cpu);
mutex_unlock(&workqueue_mutex);
}
}
_
* Andrew Morton <[email protected]> wrote:
> > > + if (cpu != -1)
> > > + mutex_lock(&workqueue_mutex);
> >
> > events/4 thread itself wanting the same mutex above?
>
> Could do, not sure. I'm planning on converting all the locking around
> here to preempt_disable() though.
please at least use an owner-recursive per-CPU lock, not a naked
preempt_disable()! The concurrency rules for data structures changed via
preempt_disable() are quite hard to sort out after the fact.
(preempt_disable() is too opaque, it doesnt attach data structure to
critical section, like normal locks do.)
Ingo
On Sat, 9 Dec 2006 11:26:52 +0100
Ingo Molnar <[email protected]> wrote:
>
> * Andrew Morton <[email protected]> wrote:
>
> > > > + if (cpu != -1)
> > > > + mutex_lock(&workqueue_mutex);
> > >
> > > events/4 thread itself wanting the same mutex above?
> >
> > Could do, not sure. I'm planning on converting all the locking around
> > here to preempt_disable() though.
>
> please at least use an owner-recursive per-CPU lock,
a wot?
> not a naked
> preempt_disable()! The concurrency rules for data structures changed via
> preempt_disable() are quite hard to sort out after the fact.
> (preempt_disable() is too opaque,
preempt_disable() is the preferred way of holding off cpu hotplug.
> it doesnt attach data structure to
> critical section, like normal locks do.)
the data structure is the CPU, and its per-cpu data. And cpu_online_map.
* Andrew Morton <[email protected]> wrote:
> > > Could do, not sure. I'm planning on converting all the locking
> > > around here to preempt_disable() though.
> >
> > please at least use an owner-recursive per-CPU lock,
>
> a wot?
something like the pseudocode further below - when applied to a data
structure it has semantics and scalability close to that of
preempt_disable(), but it is still preemptible and the lock is specific.
> > not a naked preempt_disable()! The concurrency rules for data
> > structures changed via preempt_disable() are quite hard to sort out
> > after the fact. (preempt_disable() is too opaque,
>
> preempt_disable() is the preferred way of holding off cpu hotplug.
well, preempt_disable() is the scheduler's internal mechanism to keep
tasks from being preempted. It is fast but it also has non-nice
side-effect:
1) nothing tells us what the connection between preempt-disable and data
structure is. If we let preempt_disable() spread then we'll end up
with a situation like the BKL: all preempt_disable() sections become
one big blob of code with hard-to-define specifications, and if we
take out code from that blob stuff mysteriously breaks. If on the
other hand we use a specific lock, that is visible in the source
already (and can be programmatically attached to the data structure
too, like i did it in the PI-list debugging code, where the
connection between the lock and the list is explicit and the
debugging code checks that the lock is held when the data structure
is accessed.)
2) it disables 'all preemption', not 'CPU hotplug down on this CPU' - so
it's a cannon to shoot at sparrows.
> > it doesnt attach data structure to critical section, like normal
> > locks do.)
>
> the data structure is the CPU, and its per-cpu data. And cpu_online_map.
The abstract data structure of CPU hotplug is /not/ "the CPU" but:
a CPU's ability to run tasks, expressed via cpu_online_map, the
runqueues, the per-CPU systems kthreads and all the other hotplug
data structures.
"CPU hotplug code" is the implementation of that data structure, used
via the 'bring CPU down' and 'CPU up' methods of the data structure.
preempt_disable() on the other hand is a scheduler-internal method that
keeps a context from being forcibly rescheduled. As such it is a very
broad superset of some of CPU-hotplug's requirements (because if a CPU
has a task that cannot be rescheduled it then obviously the property of
the CPU to run tasks is frozen), but IMO totally unsuitable for use as a
real hotplug API. The two things are really quite fundamentally
different.
the 'natural' locking method of the CPU hotplug data structures is:
"keep this CPU from being brought down". The pseudocode below expresses
this and only this.
struct task_struct {
...
int hotplug_depth;
struct mutex *hotplug_lock;
}
...
DEFINE_PER_CPU(struct mutex, hotplug_lock);
void cpu_hotplug_lock(void)
{
int cpu = raw_smp_processor_id();
/*
* Interrupts/softirqs are hotplug-safe:
*/
if (in_interrupt())
return;
if (current->hotplug_depth++)
return;
current->hotplug_lock = &per_cpu(hotplug_lock, cpu);
mutex_lock(current->hotplug_lock);
}
void cpu_hotplug_unlock(void)
{
int cpu;
if (in_interrupt())
return;
if (DEBUG_LOCKS_WARN_ON(!current->hotplug_depth))
return;
if (--current->hotplug_depth)
return;
mutex_unlock(current->hotplug_lock);
current->hotplug_lock = NULL;
}
...
void do_exit(void)
{
...
DEBUG_LOCKS_WARN_ON(current->hotplug_depth);
...
}
...
copy_process(void)
{
...
p->hotplug_depth = 0;
p->hotplug_lock = NULL;
...
}
Ingo
On Sun, 10 Dec 2006 09:26:16 +0100
Ingo Molnar <[email protected]> wrote:
>
> * Andrew Morton <[email protected]> wrote:
>
> > > > Could do, not sure. I'm planning on converting all the locking
> > > > around here to preempt_disable() though.
> > >
> > > please at least use an owner-recursive per-CPU lock,
> >
> > a wot?
>
> something like the pseudocode further below - when applied to a data
> structure it has semantics and scalability close to that of
> preempt_disable(), but it is still preemptible and the lock is specific.
>
> > > not a naked preempt_disable()! The concurrency rules for data
> > > structures changed via preempt_disable() are quite hard to sort out
> > > after the fact. (preempt_disable() is too opaque,
> >
> > preempt_disable() is the preferred way of holding off cpu hotplug.
>
> well, preempt_disable() is the scheduler's internal mechanism to keep
> tasks from being preempted. It is fast but it also has non-nice
> side-effect:
>
> 1) nothing tells us what the connection between preempt-disable and data
> structure is. If we let preempt_disable() spread then we'll end up
> with a situation like the BKL: all preempt_disable() sections become
> one big blob of code with hard-to-define specifications, and if we
> take out code from that blob stuff mysteriously breaks.
Well we can add some suitably-named wrapper around preempt_disable() to make
it obvious why we're calling it. But I haven't noticed any such problem with
existing usages.
> void cpu_hotplug_lock(void)
> {
> int cpu = raw_smp_processor_id();
> /*
> * Interrupts/softirqs are hotplug-safe:
> */
> if (in_interrupt())
> return;
> if (current->hotplug_depth++)
> return;
> current->hotplug_lock = &per_cpu(hotplug_lock, cpu);
> mutex_lock(current->hotplug_lock);
> }
That's functionally equivalent to what we have now, and it isn't working
too well.
* Andrew Morton <[email protected]> wrote:
> > > > not a naked preempt_disable()! The concurrency rules for data
> > > > structures changed via preempt_disable() are quite hard to sort out
> > > > after the fact. (preempt_disable() is too opaque,
> > >
> > > preempt_disable() is the preferred way of holding off cpu hotplug.
> >
> > well, preempt_disable() is the scheduler's internal mechanism to keep
> > tasks from being preempted. It is fast but it also has non-nice
> > side-effect:
> >
> > 1) nothing tells us what the connection between preempt-disable and data
> > structure is. If we let preempt_disable() spread then we'll end up
> > with a situation like the BKL: all preempt_disable() sections become
> > one big blob of code with hard-to-define specifications, and if we
> > take out code from that blob stuff mysteriously breaks.
>
> Well we can add some suitably-named wrapper around preempt_disable()
> to make it obvious why we're calling it. But I haven't noticed any
> such problem with existing usages.
ok, as long as there's a separate API for it (which just maps to
disable_preempt(), or whatever), i'm fine with it. The complications
with preempt_disable()/preempt_enable() and local_irq_disable()/enable()
come when someone tries to implement something like a fully preemptible
kernel :-)
it was quite some work to sort the irqs on/off + per-cpu assumptions out
in the slab allocator and in the page allocator:
$ diffstat patches/rt-slab.patch
mm/slab.c | 460 +++++++++++++++++++++++++++++++++++++++++------------------------
1 file changed, 296 insertions(+), 164 deletions(-)
$ diffstat patches/rt-page_alloc.patch
mm/page_alloc.c | 125 ++++++++++++++++++++++++++++++++++++++++++-----------------
1 file changed, 90 insertions(+), 35 deletions(-)
> > void cpu_hotplug_lock(void)
> > {
> > int cpu = raw_smp_processor_id();
> > /*
> > * Interrupts/softirqs are hotplug-safe:
> > */
> > if (in_interrupt())
> > return;
> > if (current->hotplug_depth++)
> > return;
> > current->hotplug_lock = &per_cpu(hotplug_lock, cpu);
> > mutex_lock(current->hotplug_lock);
> > }
>
> That's functionally equivalent to what we have now, and it isn't
> working too well.
hm, i thought the main reason of not using cpu_hotplug_lock() in a
widespread manner was not related to its functionality but to its
scalability - but i could be wrong. The one above is scalable and we
could use it as /the/ method to control CPU hotplug. All the flux i
remember related to cpu_hotplug_lock() use from the fork path and from
other scheduler hotpaths related to its scalability bottleneck, not to
its locking efficiency.
Ingo
On Sun, 10 Dec 2006 12:49:43 +0100
Ingo Molnar <[email protected]> wrote:
> > > void cpu_hotplug_lock(void)
This is actually not cpu-hotplug safe ;)
> > > {
> > > int cpu = raw_smp_processor_id();
> > > /*
> > > * Interrupts/softirqs are hotplug-safe:
> > > */
> > > if (in_interrupt())
> > > return;
> > > if (current->hotplug_depth++)
> > > return;
<preempt, cpu hot-unplug, resume on different CPU>
> > > current->hotplug_lock = &per_cpu(hotplug_lock, cpu);
<use-after-free>
> > > mutex_lock(current->hotplug_lock);
And it sleeps, so we can't use preempt_disable().
> > > }
It's worth noting that this very common sequence:
preempt_disable();
cpu = smp_processor_id();
...
preempt_enable();
also provides cpu-hotunplug protection against scenarios such as the above.
> > That's functionally equivalent to what we have now, and it isn't
> > working too well.
>
> hm, i thought the main reason of not using cpu_hotplug_lock() in a
> widespread manner was not related to its functionality but to its
> scalability - but i could be wrong.
It hasn't been noticed yet.
I suspect a large part of the reason for that is that we only really care
about hot-unplug when this CPU reaches across to some other CPU's data. Often
_all_ other CPU's data. And that's a super-inefficient thing, so it's rare.
Most of the problems we've had are due to borkage in cpufreq. And that's
simply cruddy code - it's not due to the complexity of CPU hotplug per-se.
> The one above is scalable and we
> could use it as /the/ method to control CPU hotplug. All the flux i
> remember related to cpu_hotplug_lock() use from the fork path and from
> other scheduler hotpaths related to its scalability bottleneck, not to
> its locking efficiency.
One quite different way of addressing all of this is to stop using
stop_machine_run() for hotplug synchronisation and switch to the swsusp
freezer infrastructure: all kernel threads and user processes need to stop
and park themselves in a known state before we allow the CPU to be removed.
lock_cpu_hotplug() becomes a no-op.
Dunno if it'll work - I only just thought of it. It sure would simplify
things.
* Andrew Morton <[email protected]> wrote:
> This is actually not cpu-hotplug safe ;)
>
> > > > {
> > > > int cpu = raw_smp_processor_id();
> > > > /*
> > > > * Interrupts/softirqs are hotplug-safe:
> > > > */
> > > > if (in_interrupt())
> > > > return;
> > > > if (current->hotplug_depth++)
> > > > return;
>
> <preempt, cpu hot-unplug, resume on different CPU>
>
> > > > current->hotplug_lock = &per_cpu(hotplug_lock, cpu);
>
> <use-after-free>
>
> > > > mutex_lock(current->hotplug_lock);
>
> And it sleeps, so we can't use preempt_disable().
i explained it in the other mail - this is the 'read' side. The 'write'
side (code actually wanting to /do/ a CPU hotplug state transition) has
to take /all/ these locks before it can take a CPU down.
so this is still a global CPU hotplug lock, but made scalable.
Ingo
On Sun, 10 Dec 2006 13:19:15 +0100
Ingo Molnar <[email protected]> wrote:
>
> * Andrew Morton <[email protected]> wrote:
>
> > This is actually not cpu-hotplug safe ;)
> >
> > > > > {
> > > > > int cpu = raw_smp_processor_id();
> > > > > /*
> > > > > * Interrupts/softirqs are hotplug-safe:
> > > > > */
> > > > > if (in_interrupt())
> > > > > return;
> > > > > if (current->hotplug_depth++)
> > > > > return;
> >
> > <preempt, cpu hot-unplug, resume on different CPU>
> >
> > > > > current->hotplug_lock = &per_cpu(hotplug_lock, cpu);
> >
> > <use-after-free>
> >
> > > > > mutex_lock(current->hotplug_lock);
> >
> > And it sleeps, so we can't use preempt_disable().
>
> i explained it in the other mail - this is the 'read' side. The 'write'
> side (code actually wanting to /do/ a CPU hotplug state transition) has
> to take /all/ these locks before it can take a CPU down.
Doesn't matter - the race is still there.
Well, not really, because we don't free the percpu data of offlined CPUs,
but we'd like to.
And it's easily fixable by using a statically-allocated array. That would
make life easier for code which wants to take this lock early in boot too.
> so this is still a global CPU hotplug lock, but made scalable.
Scalability is not the problem. At present, at least.
On Sunday, 10 December 2006 13:16, Andrew Morton wrote:
> On Sun, 10 Dec 2006 12:49:43 +0100
> Ingo Molnar <[email protected]> wrote:
>
> > > > void cpu_hotplug_lock(void)
>
> This is actually not cpu-hotplug safe ;)
>
> > > > {
> > > > int cpu = raw_smp_processor_id();
> > > > /*
> > > > * Interrupts/softirqs are hotplug-safe:
> > > > */
> > > > if (in_interrupt())
> > > > return;
> > > > if (current->hotplug_depth++)
> > > > return;
>
> <preempt, cpu hot-unplug, resume on different CPU>
>
> > > > current->hotplug_lock = &per_cpu(hotplug_lock, cpu);
>
> <use-after-free>
>
> > > > mutex_lock(current->hotplug_lock);
>
> And it sleeps, so we can't use preempt_disable().
>
> > > > }
>
> It's worth noting that this very common sequence:
>
> preempt_disable();
> cpu = smp_processor_id();
> ...
> preempt_enable();
>
> also provides cpu-hotunplug protection against scenarios such as the above.
>
> > > That's functionally equivalent to what we have now, and it isn't
> > > working too well.
> >
> > hm, i thought the main reason of not using cpu_hotplug_lock() in a
> > widespread manner was not related to its functionality but to its
> > scalability - but i could be wrong.
>
> It hasn't been noticed yet.
>
> I suspect a large part of the reason for that is that we only really care
> about hot-unplug when this CPU reaches across to some other CPU's data. Often
> _all_ other CPU's data. And that's a super-inefficient thing, so it's rare.
>
> Most of the problems we've had are due to borkage in cpufreq. And that's
> simply cruddy code - it's not due to the complexity of CPU hotplug per-se.
>
> > The one above is scalable and we
> > could use it as /the/ method to control CPU hotplug. All the flux i
> > remember related to cpu_hotplug_lock() use from the fork path and from
> > other scheduler hotpaths related to its scalability bottleneck, not to
> > its locking efficiency.
>
> One quite different way of addressing all of this is to stop using
> stop_machine_run() for hotplug synchronisation and switch to the swsusp
> freezer infrastructure: all kernel threads and user processes need to stop
> and park themselves in a known state before we allow the CPU to be removed.
> lock_cpu_hotplug() becomes a no-op.
>
> Dunno if it'll work - I only just thought of it. It sure would simplify
> things.
Hm, currently we're using the CPU hotplug to disable the nonboot CPUs before
the freezer is called. ;-)
However, we're now trying to make the freezer SMP-safe, so that we can disable
the nonboot CPUs after devices have been suspended (we want to do this for
some ACPI-related reasons). In fact we're almost there, I'm only waiting for
the confirmation from Pavel to post the patches.
When this is done, we won't need the CPU hotplug that much and I think the CPU
hotplug code will be able to do something like:
freeze_processes
suspend_cpufreq (or even suspend_all_devices)
remove_the_CPU_in_question
resume_cpufreq
thaw_processes
Greetings,
Rafael
--
If you don't have the time to read,
you don't have the time or the tools to write.
- Stephen King
On Sun, Dec 10, 2006 at 09:26:16AM +0100, Ingo Molnar wrote:
> something like the pseudocode further below - when applied to a data
> structure it has semantics and scalability close to that of
> preempt_disable(), but it is still preemptible and the lock is specific.
Ingo,
The psuedo-code you have provided can still fail to avoid
the deadlock reported by Bjorn Helgaas earlier in this thread:
http://lkml.org/lkml/2006/12/6/352
Thread1->flush_workqueue->mutex_lock(cpu4's hotplug_lock)
Thread2(keventd)->run_workqueue->som_work_fn-> ..
flush_workqueue->mutex_lock(cpu4's hotplug_lock)
Both deadlock with each other.
All this mess could easily be avoided if we implement a reference-count
based cpu_hotplug_lock(), as suggested by Arjan and Linus before and
implemented by Gautham here:
http://lkml.org/lkml/2006/10/26/65
--
Regards,
vatsa
On Sun, Dec 10, 2006 at 04:16:00AM -0800, Andrew Morton wrote:
> One quite different way of addressing all of this is to stop using
> stop_machine_run() for hotplug synchronisation and switch to the swsusp
> freezer infrastructure: all kernel threads and user processes need to stop
> and park themselves in a known state before we allow the CPU to be removed.
> lock_cpu_hotplug() becomes a no-op.
Well ...you still need to provide some mechanism for stable access to
cpu_online_map in blocking functions (ex: do_event_scan_all_cpus).
Freezing-tasks/Resuming-them-after-hotp-unplug is definitely not one of them
(when they resume, online_map would have changed under their feet).
> Dunno if it'll work - I only just thought of it. It sure would simplify
> things.
--
Regards,
vatsa
> On Mon, 11 Dec 2006 11:15:45 +0530 Srivatsa Vaddagiri <[email protected]> wrote:
> On Sun, Dec 10, 2006 at 04:16:00AM -0800, Andrew Morton wrote:
> > One quite different way of addressing all of this is to stop using
> > stop_machine_run() for hotplug synchronisation and switch to the swsusp
> > freezer infrastructure: all kernel threads and user processes need to stop
> > and park themselves in a known state before we allow the CPU to be removed.
> > lock_cpu_hotplug() becomes a no-op.
>
> Well ...you still need to provide some mechanism for stable access to
> cpu_online_map in blocking functions (ex: do_event_scan_all_cpus).
> Freezing-tasks/Resuming-them-after-hotp-unplug is definitely not one of them
> (when they resume, online_map would have changed under their feet).
Problems will only occur if a process is holding some sort of per-cpu state
across a call to try_to_freeze(). Surely nobody does that.
* Andrew Morton <[email protected]> wrote:
> > > > > > {
> > > > > > int cpu = raw_smp_processor_id();
> > > > > > /*
> > > > > > * Interrupts/softirqs are hotplug-safe:
> > > > > > */
> > > > > > if (in_interrupt())
> > > > > > return;
> > > > > > if (current->hotplug_depth++)
> > > > > > return;
> > >
> > > <preempt, cpu hot-unplug, resume on different CPU>
> > >
> > > > > > current->hotplug_lock = &per_cpu(hotplug_lock, cpu);
> > >
> > > <use-after-free>
> > >
> > > > > > mutex_lock(current->hotplug_lock);
> > >
> > > And it sleeps, so we can't use preempt_disable().
> >
> > i explained it in the other mail - this is the 'read' side. The 'write'
> > side (code actually wanting to /do/ a CPU hotplug state transition) has
> > to take /all/ these locks before it can take a CPU down.
>
> Doesn't matter - the race is still there.
>
> Well, not really, because we don't free the percpu data of offlined
> CPUs, but we'd like to.
>
> And it's easily fixable by using a statically-allocated array. That
> would make life easier for code which wants to take this lock early in
> boot too.
yeah.
but i think your freeze_processes() idea is even better - it would unify
the suspend and the CPU-hotplug infrastructure even more. (and kprobes
wants to use freeze_processes() too)
Ingo
On Sun, Dec 10, 2006 at 03:18:38PM +0100, Rafael J. Wysocki wrote:
> On Sunday, 10 December 2006 13:16, Andrew Morton wrote:
> > On Sun, 10 Dec 2006 12:49:43 +0100
>
> Hm, currently we're using the CPU hotplug to disable the nonboot CPUs before
> the freezer is called. ;-)
>
> However, we're now trying to make the freezer SMP-safe, so that we can disable
> the nonboot CPUs after devices have been suspended (we want to do this for
> some ACPI-related reasons). In fact we're almost there, I'm only waiting for
> the confirmation from Pavel to post the patches.
>
> When this is done, we won't need the CPU hotplug that much and I think the CPU
> hotplug code will be able to do something like:
>
> freeze_processes
> suspend_cpufreq (or even suspend_all_devices)
> remove_the_CPU_in_question
> resume_cpufreq
> thaw_processes
Have you thought about how much time this might take on
a machine with say - 128 CPUs of which I want to dynamically
reconvigure 64 of them and make a new partition ? Assume
10,000 tasks are running in the system.
Thanks
Dipankar
On Monday, 11 December 2006 07:52, Dipankar Sarma wrote:
> On Sun, Dec 10, 2006 at 03:18:38PM +0100, Rafael J. Wysocki wrote:
> > On Sunday, 10 December 2006 13:16, Andrew Morton wrote:
> > > On Sun, 10 Dec 2006 12:49:43 +0100
> >
> > Hm, currently we're using the CPU hotplug to disable the nonboot CPUs before
> > the freezer is called. ;-)
> >
> > However, we're now trying to make the freezer SMP-safe, so that we can disable
> > the nonboot CPUs after devices have been suspended (we want to do this for
> > some ACPI-related reasons). In fact we're almost there, I'm only waiting for
> > the confirmation from Pavel to post the patches.
> >
> > When this is done, we won't need the CPU hotplug that much and I think the CPU
> > hotplug code will be able to do something like:
> >
> > freeze_processes
> > suspend_cpufreq (or even suspend_all_devices)
> > remove_the_CPU_in_question
> > resume_cpufreq
> > thaw_processes
>
> Have you thought about how much time this might take on
> a machine with say - 128 CPUs of which I want to dynamically
> reconvigure 64 of them and make a new partition ? Assume
> 10,000 tasks are running in the system.
Well, of course it doesn't makes sense to freeze processes and thaw them for
each CPU again and again. Still the overall idea is to freeze processes and
suspend devices, then do something with as many CPUs as necessary etc.
Greetings,
Rafael
--
If you don't have the time to read,
you don't have the time or the tools to write.
- Stephen King