2014-12-03 20:40:47

by Alex Thorlton

[permalink] [raw]
Subject: [BUG] Possible locking issues in stop_machine code on 6k core machine

Hey guys,

While working to get our newly upgraded 6k core machine online, we've
discovered a few possible locking issues in the stop_machine code that
we're trying to get sorted out. (We think) the problems we're seeing
stem from possible interaction between stop_cpus and stop_one_cpu. The
issue presents as a deadlock, and seems to only show itself
intermittently.

After quite a bit of debugging we think we've narrowed the issue down to
the fact that stop_one_cpu does not respect many of the locks that are
taken in the stop_cpus code path. For reference the stop_cpus code path
takes the stop_cpus_mutex, then stop_cpus_lock, and then takes each
cpu's stopper->lock. stop_one_cpu seems to rely solely on the
stopper->lock.

What appears to be happening to cause our deadlock is, stop_cpus works
its way down to queue_stop_cpus_work, which tells each cpu's stopper
task to wake up, take its lock, and do its work. As the loop that does
this progresses, the lowest numbered cpus complete their work, and are
allowed to go on about their business. The problem occurs when one of
these lower numbered cpus calls stop_one_cpu, targeting one of the
higher numbered cpus, which the stop_cpus loop has not yet reached. If
this happens, that higher numbered cpu's completion variable will get
stomped on, and the wait_for_completion in the stop_cpus code path will
never return.

A quick example: CPU 0 calls stop_cpus, which will hit all 6,000 cores.
CPU 50 completes its stopper work, and at some point in the near future
calls stop_one_cpu on CPU 5000. This clobbers CPU 5000's pointer to the
cpu_stop_done struct set up in queue_stop_cpus_work, meaning that, once
CPU 5000 completes its work, it won't be able to decrement the nr_todo
for the correct cpu_stop_done struct, and CPU 0's wait_for_completion
will never return.

Again, much of this is semi-educated guesswork, put together based on
information gathered from examining lots of debug output, in an attempt
to spot the problem. We're fairly certain that we've pinned down our
issue, but we'd like to ask those who are more knowledgeable of these
code paths to weigh in their opinions here.

We'd really appreciate any help that anyone can offer. Thanks!

- Alex


2014-12-03 23:34:13

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [BUG] Possible locking issues in stop_machine code on 6k core machine

Alex,

first of all, could you please format your mail so it is actually
readable and can be replied to? This looks like copy and paste from a
office app with about 5 gazillion spaces at the end of each line. I
have no idea how you tricked mutt to do that.

On Wed, 3 Dec 2014, Alex Thorlton wrote:

> While working to get our newly upgraded 6k core machine online,
> we've discovered a few possible locking issues in the stop_machine
> code that we're trying to get sorted out. (We think) the problems
> we're seeing stem from possible interaction between stop_cpus and
> stop_one_cpu. The issue presents as a deadlock, and seems to only
> show itself intermittently.

> After quite a bit of debugging we think we've narrowed the issue
> down to the fact that stop_one_cpu does not respect many of the
> locks that are taken in the stop_cpus code path. For reference the
> stop_cpus code path takes the stop_cpus_mutex, then stop_cpus_lock,
> and then takes each cpu's stopper->lock. stop_one_cpu seems to rely
> solely on the stopper->lock.

And what's actually wrong with that?

stop_cpus()
mutex_lock(&stop_cpus_mutex);
__stop_cpus()
cpu_stop_init_done(); <-- works on a stack variable
queue_stop_cpus_work()
for_each_cpu()
per_cpu(stop_cpus_work, cpu)
^^^^ is a static per cpu work struct protected by
stop_cpus_mutex

/* Now we have to queue that work for real */
lg_global_lock(&stop_cpus_lock);
for_each_cpu(cpu, cpumask)
cpu_stop_queue_work(cpu ...)
lock(per_cpu(stopper_lock, cpu);
list_add_tail(work, per_cpu(worklist, cpu);
unlock(per_cpu(stopper_lock, cpu);
lg_global_unlock(&stop_cpus_lock);

stop_one_cpu()
cpu_stop_init_done(&done, 1); <-- works on a stack variable
cpu_stop_queue_work(cpu, &work); <-- work is on stack
wait_for_completion(&done.completion);

So stop_cpus() operates on static allocated per cpu cpu_stop_work. The
work->done of each per cpu work points to cpu_stop_done which is on
the stack of the caller. Nothing can stomp on the per_cpu work as long
as stop_cpus_mutex is held. Nothing can stomp on the done struct which
is on the stack of the caller and only referenced by the cpu work.

stop_one_cpu() has both the work and the done on the callers stack. So
it does neither need to take the stop_cpus_mutex nor the
stop_cpus_lock lglock. All it requires is the per cpu stopper->lock to
add the work to the per cpu list of the target cpu.

> What appears to be happening to cause our deadlock is, stop_cpus
> works its way down to queue_stop_cpus_work, which tells each cpu's
> stopper task to wake up, take its lock, and do its work. As the
> loop that does this progresses, the lowest numbered cpus complete
> their work, and are allowed to go on about their business. The
> problem occurs when one of these lower numbered cpus calls
> stop_one_cpu, targeting one of the higher numbered cpus, which the
> stop_cpus loop has not yet reached. If this happens, that higher
> numbered cpu's completion variable will get stomped on, and the
> wait_for_completion in the stop_cpus code path will never return.

That's complete nonsense. It CANNOT stomp on the completion variable
of a higher numbered cpu, because there is no such variable.

Lets look at a single CPU:

The per cpu struct cpu_stopper has work A (from
stop_cpus) queued.

A looks like this:

A->fn = stop_cpus(fn);
A->arg = stop_cpus(arg);
A->done = struct cpu_stop_done on the stack of the stop_cpus() caller

Now B gets queued:

B looks like this:

B->fn = stop_one_cpu(fn);
B->arg = stop_one_cpu(arg);
B->done = struct cpu_stop_done on the stack of the stop_one_cpu() caller

The only connection between A and B is that they are enqueued on
the same list, i.e. the er cpu struct cpu_stopper::works list.

So now explain, how the enqueueing of B stomps on A->done. There is no
way that the list_add_tail(B) can modify A->done.

Again. This are the protection scopes:

stop_cpus_mutex protects stop_cpus_work

It is held across the stop_cpus() call until the
wait__for_completion() returns. So nothing can stomp on
stop_cpus_work and the done pointer of any cpu until
stop_cpus_mutex is released.

stop_cpus_lock serializes multi_cpu_stop invocations

You have not provided any indication that multi_cpu_stop is
involved, so that's irrelevant.

stopper->lock protects the per cpu work list

> Again, much of this is semi-educated guesswork, put together based

I'd say semi-educated guesswork is a euphemism.

> on information gathered from examining lots of debug output, in an
> attempt to spot the problem.

Of course you completely missed to provide the actual call chains and
callback functions involved. Instead of that you provide your
interpretation of the problem, which seems to be a bit off.

> We're fairly certain that we've pinned down our issue,

I'm very certain that your interpretation of lots of debug output is
pretty much in line with your creative description of the problem.

> but we'd like to ask those who are more knowledgeable of these code
> paths to weigh in their opinions here.

Done.

> We'd really appreciate any help that anyone can offer. Thanks!

If you would provide real data instead of half baken assumptions
we might actually be able to help you.

There might be some hidden issue in the stomp machine crap, but surely
not the one you described.

Thanks,

tglx

2014-12-04 15:58:51

by Alex Thorlton

[permalink] [raw]
Subject: Re: [BUG] Possible locking issues in stop_machine code on 6k core machine

On Thu, Dec 04, 2014 at 12:34:04AM +0100, Thomas Gleixner wrote:
> first of all, could you please format your mail so it is actually
> readable and can be replied to?

My fault there - stupid mistake.

> If you would provide real data instead of half baken assumptions
> we might actually be able to help you.
>
> There might be some hidden issue in the stomp machine crap, but surely
> not the one you described.

I'll return with some actual data so that others can get a better
picture of what we're seeing. I'll also see if I can come up with a
way to reproduce the issue on a smaller configuration; might not be
possible, but attempting to do so might provide some more insight into
our problem.

Thanks for your detailed description of the code path! I definitely
understand that our assumption can't be the actual problem here. I'll
dig back into the issue and try to get some better information.

Thanks again!

- Alex