2015-08-31 18:04:37

by Alex Thorlton

[permalink] [raw]
Subject: [BUG] Boot hangs at clocksource_done_booting on large configs

Hey guys,

We've been hitting issues with machines with 4096 or more logical cores
hanging up at clocksource_done_booting, seemingly forever. The problem
occurs somewhat intermittently, though it hits more consistently as you
increase the core count. We've observed this problem on kernels
starting at 3.10, all the way up through 4.2.

We initially thought there was some issue with the stop_machine code,
but Thomas G. cleared that up for us with a thorough explanation of the
stop_machine mechanism. After ruling out issues in stop_machine, we
started investigating possible hardware related scaling issues.

I was able to hit this issue on 4.2-rc1 with our RTC disabled, to rule
out any scaling issues related to multiple concurrent reads to our
RTC's MMR. We start getting hung task timeouts during
clocksource_done_booting:

8<----
calling clocksource_done_booting+0x0/0x42 @ 1
INFO: task swapper/0:1 blocked for more than 480 seconds.
Not tainted 4.2.0-rc1-nortc+ #384
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
swapper/0 D ffff881efa933b48 0 1 0 0x00000000
ffff881efa933b48 ffff881ef45e0000 ffffb5de7a130000 000000fdfa933b38
ffff881efa934000 7fffffffffffffff ffff881efa933ca0 ffffb5de7a130000
0000000000000000 ffff881efa933b68 ffffffff8158eae7 ffff881efa933b68
Call Trace:
[<ffffffff8158eae7>] schedule+0x37/0x80
[<ffffffff81591707>] schedule_timeout+0x1e7/0x270
[<ffffffff812ee9b9>] ? _find_next_bit.part.0+0x19/0x70
[<ffffffff8158fdb0>] wait_for_completion+0xa0/0x100
[<ffffffff8108d490>] ? wake_up_q+0x70/0x70
[<ffffffff810f9f00>] ? cpu_stop_should_run+0x50/0x50
[<ffffffff810fa2a0>] __stop_cpus+0x50/0x70
[<ffffffff810b8f7d>] ? console_unlock+0x2dd/0x4a0
[<ffffffff810f9f00>] ? cpu_stop_should_run+0x50/0x50
[<ffffffff810fa525>] stop_cpus+0x35/0x50
[<ffffffff810d18f0>] ? tk_setup_internals.constprop.10+0x140/0x140
[<ffffffff810fa652>] __stop_machine+0xb2/0xe0
[<ffffffff810d18f0>] ? tk_setup_internals.constprop.10+0x140/0x140
[<ffffffff810d18f0>] ? tk_setup_internals.constprop.10+0x140/0x140
[<ffffffff810fa6ae>] stop_machine+0x2e/0x50
[<ffffffff810d1fad>] timekeeping_notify+0x2d/0x50
[<ffffffff810d3d4b>] __clocksource_select+0xcb/0x120
[<ffffffff81b77cfa>] ? boot_override_clock+0x4d/0x4d
[<ffffffff81b77d2c>] clocksource_done_booting+0x32/0x42
[<ffffffff810003a9>] do_one_initcall+0x169/0x1d0
[<ffffffff81b5613c>] kernel_init_freeable+0x1bb/0x248
[<ffffffff81582560>] ? rest_init+0x80/0x80
[<ffffffff8158256e>] kernel_init+0xe/0xe0
[<ffffffff81592c5f>] ret_from_fork+0x3f/0x70
[<ffffffff81582560>] ? rest_init+0x80/0x80
---->8

After this I NMId the machine to get backtraces for the other CPUs.
Most are stuck around here:

8<----
CPU: 4867 PID: 19730 Comm: migration/4867 Tainted: G D 4.2.0-rc1-nortc+ #384
Hardware name: SGI UV2000/ROMLEY, BIOS SGI UV 2000/3000 series BIOS 01/15/2013
task: ffffad5e798cc500 ti: ffffad5e798d4000 task.ti: ffffad5e798d4000
RIP: 0010:[<ffffffff8105e879>] [<ffffffff8105e879>] uv_handle_nmi+0x19/0x790
RSP: 0000:ffffc90020156da8 EFLAGS: 00010092
RAX: 000002fe65b95384 RBX: 000000000000002d RCX: ffffc90020162a98
RDX: 000002fe65b95384 RSI: ffffc90020156ef8 RDI: 0000000000000001
RBP: ffffc90020156e18 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000003 R11: 0000000000000020 R12: 000002fe65b95384
R13: 0000000000000000 R14: ffffffff81a38880 R15: ffffc90020156ef8
FS: 0000000000000000(0000) GS:ffffc90020151000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000000000da80 CR3: 0000000001a0b000 CR4: 00000000001406e0
Stack:
0000000000000000 0000000000000000 0000000000000000 0000000000000000
0000000000000000 0000000000000000 0000000000000000 0000000000000000
0000000000000000 000000000000002d 000002fe65b95384 0000000000000000
Call Trace:
<NMI>
[<ffffffff810f9f00>] ? cpu_stop_should_run+0x50/0x50
[<ffffffff81007869>] nmi_handle+0x79/0x100
[<ffffffff81007873>] ? nmi_handle+0x83/0x100
[<ffffffff810f9f00>] ? cpu_stop_should_run+0x50/0x50
[<ffffffff81007c6c>] unknown_nmi_error+0x1c/0x90
[<ffffffff81007e48>] default_do_nmi+0xd8/0xf0
[<ffffffff81007ee3>] do_nmi+0x83/0xb0
[<ffffffff815948bf>] end_repeat_nmi+0x1e/0x2e
[<ffffffff810f9f00>] ? cpu_stop_should_run+0x50/0x50
[<ffffffff810486c6>] ? read_hpet+0x16/0x20
[<ffffffff810486c6>] ? read_hpet+0x16/0x20
[<ffffffff810486c6>] ? read_hpet+0x16/0x20
<<EOE>>
<IRQ>
[<ffffffff810d1b97>] ktime_get+0x37/0xb0
[<ffffffff810d819b>] clockevents_program_event+0x4b/0x130
[<ffffffff810d8737>] tick_handle_periodic+0x57/0x70
[<ffffffff8103b46c>] local_apic_timer_interrupt+0x3c/0x70
[<ffffffff815952c1>] smp_apic_timer_interrupt+0x41/0x60
[<ffffffff815936ab>] apic_timer_interrupt+0x6b/0x70
<EOI>
[<ffffffff81097b41>] ? update_blocked_averages+0x381/0x860
[<ffffffff810f9f98>] ? multi_cpu_stop+0x98/0xf0
[<ffffffff8100272a>] ? __switch_to+0x26a/0x590
[<ffffffff810f9f00>] ? cpu_stop_should_run+0x50/0x50
[<ffffffff810fa1c8>] cpu_stopper_thread+0x88/0x110
[<ffffffff8158e4e8>] ? __schedule+0x338/0x900
[<ffffffff81085575>] smpboot_thread_fn+0x135/0x190
[<ffffffff81085440>] ? sort_range+0x30/0x30
[<ffffffff810820cb>] kthread+0xdb/0x100
[<ffffffff81081ff0>] ? kthread_create_on_node+0x180/0x180
[<ffffffff81592c5f>] ret_from_fork+0x3f/0x70
[<ffffffff81081ff0>] ? kthread_create_on_node+0x180/0x180
Code: 31 c0 e8 a0 c2 52 00 e9 39 ff ff ff 66 0f 1f 44 00 00 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54 49 89 f7 53 48 83 ec 48 <48> 8b 1d 00 f2 fa 7e 65 44 8b 2d 70 b8 fa 7e 9c 58 0f 1f 44 00
---->8

I can provide full logs if desired.

Anyways, based on my limited understanding of what's going on here, I
tried disabling interrupts a bit further out in the cpu_stopper_thread
code, to try and avoid that apic_timer_interrupt during the clocksource
change. change_clocksource already disables interrupts, but, from the
backtraces, it appears that we're taking an interrupt after the stopper
thread wakes up, but before change_clocksource actually gets called.

I used this patch to try and avoid taking a apic_timer_interrupt at any
point during the clocksource change.:

8<---
diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index fd643d8..8502521 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -417,8 +417,11 @@ static void cpu_stopper_thread(unsigned int cpu)
{
struct cpu_stopper *stopper = &per_cpu(cpu_stopper, cpu);
struct cpu_stop_work *work;
+ unsigned long flags;
int ret;

+ local_irq_save(flags);
+
repeat:
work = NULL;
spin_lock_irq(&stopper->lock);
@@ -452,6 +455,8 @@ repeat:
cpu_stop_signal_done(done, true);
goto repeat;
}
+
+ local_irq_restore(flags);
}

extern void sched_set_stop_task(int cpu, struct task_struct *stop);
--->8

This was a bit of a shot in the dark, but it has worked on every kernel
and every machine we've tried it on so far.

I'm hoping to get some input from the experts in this area, first of
all, on whether the problem I'm seeing is actually what I think it is,
and, if so, if I've solved it in the correct way.

Any input is greatly appreciated!

- Alex


2015-08-31 18:25:11

by Alex Thorlton

[permalink] [raw]
Subject: Re: [BUG] Boot hangs at clocksource_done_booting on large configs

On Mon, Aug 31, 2015 at 01:04:33PM -0500, Alex Thorlton wrote:
> I can provide full logs if desired.

Full log is here:

http://oss.sgi.com/projects/athorlton/harp50-6144-nortc

I'd use wget to download it. Probably don't want to try to open the 15M
file in your browser :)

Also, I've filed a kernel BugZilla for the same issue, in case anyone
would rather use BugZilla:

https://bugzilla.kernel.org/show_bug.cgi?id=103821

- Alex

2015-08-31 20:32:58

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [BUG] Boot hangs at clocksource_done_booting on large configs

On Mon, Aug 31, 2015 at 01:04:33PM -0500, Alex Thorlton wrote:
q
> diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
> index fd643d8..8502521 100644
> --- a/kernel/stop_machine.c
> +++ b/kernel/stop_machine.c
> @@ -417,8 +417,11 @@ static void cpu_stopper_thread(unsigned int cpu)
> {
> struct cpu_stopper *stopper = &per_cpu(cpu_stopper, cpu);
> struct cpu_stop_work *work;
> + unsigned long flags;
> int ret;
>
> + local_irq_save(flags);
> +
> repeat:
> work = NULL;
> spin_lock_irq(&stopper->lock);
> @@ -452,6 +455,8 @@ repeat:
> cpu_stop_signal_done(done, true);
> goto repeat;
> }
> +
> + local_irq_restore(flags);
> }
>

So I should probably just go sleep and not say anything.. _but_
*confused*.

That local_irq_save() will disable IRQs over:

work = NULL;

But that is _all_. The spin_unlock_irq() will re-enable IRQs, after
which things will run as usual.

That local_irq_restore() is a total NOP, IRQs are guaranteed enabled at
the irq_local_save() (otherwise lockdep would've complained about
spin_unlock_irq() unconditionally enabling them) and by the time we get
to the restore that same unlock_irq will have enabled them already.

2015-08-31 20:55:49

by Alex Thorlton

[permalink] [raw]
Subject: Re: [BUG] Boot hangs at clocksource_done_booting on large configs

On Mon, Aug 31, 2015 at 10:32:50PM +0200, Peter Zijlstra wrote:
> On Mon, Aug 31, 2015 at 01:04:33PM -0500, Alex Thorlton wrote:
> q
> > diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
> > index fd643d8..8502521 100644
> > --- a/kernel/stop_machine.c
> > +++ b/kernel/stop_machine.c
> > @@ -417,8 +417,11 @@ static void cpu_stopper_thread(unsigned int cpu)
> > {
> > struct cpu_stopper *stopper = &per_cpu(cpu_stopper, cpu);
> > struct cpu_stop_work *work;
> > + unsigned long flags;
> > int ret;
> >
> > + local_irq_save(flags);
> > +
> > repeat:
> > work = NULL;
> > spin_lock_irq(&stopper->lock);
> > @@ -452,6 +455,8 @@ repeat:
> > cpu_stop_signal_done(done, true);
> > goto repeat;
> > }
> > +
> > + local_irq_restore(flags);
> > }
> >
>
> So I should probably just go sleep and not say anything.. _but_
> *confused*.
>
> That local_irq_save() will disable IRQs over:
>
> work = NULL;
>
> But that is _all_. The spin_unlock_irq() will re-enable IRQs, after
> which things will run as usual.
>
> That local_irq_restore() is a total NOP, IRQs are guaranteed enabled at
> the irq_local_save() (otherwise lockdep would've complained about
> spin_unlock_irq() unconditionally enabling them) and by the time we get
> to the restore that same unlock_irq will have enabled them already.

Ahh, right. Well that code is worthless then :)

Either way though, I guess that means that slight change just fudged the
timing enough in that area to avoid the lockup we're seeing. Ignoring
my useless code change, does anything else jump out at you as
interesting, here?

2015-08-31 21:12:47

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [BUG] Boot hangs at clocksource_done_booting on large configs

On Mon, 31 Aug 2015, Alex Thorlton wrote:
> I was able to hit this issue on 4.2-rc1 with our RTC disabled, to rule
> out any scaling issues related to multiple concurrent reads to our
> RTC's MMR.

And to rule out scaling issues you replaced the RTC MMR with HPET. Not
a very good choice:

HPET does not scale either. It's uncached memory mapped I/O. See
below.

> I'm hoping to get some input from the experts in this area, first of
> all, on whether the problem I'm seeing is actually what I think it is,
> and, if so, if I've solved it in the correct way.

I fear both the analysis and the solution is wrong.

Up to the point where the actual clocksource change happens there is
no reason why timer interrupts should not happen. And the code which
actually changes the clocksource is definitely called with interrupts
disabled. When that function returns the new clocksource is fully
functional and interrupts can happen again.

Now looking at your backtraces. Most CPUs are in the migration thread
and a few (3073,3078,3079,3082) are in the idle task.

>From the trace artifacts (? read_hpet) it looks like the clock source
change has been done and the cpus are on the way back from stop
machine.

But they are obviously held off by something. And that something looks
like the timekeeper sequence lock. Too bad, that we don't have a
backtrace for CPU0 in the log.

I really wonder how a machine that large works with HPET as
clocksource at all. hpet_read() is uncached memory mapped IO which
takes thousands of CPU cycles. Last time I looked it was around
1us. Let's take that number to do some math.

If all CPUs do that access at the same time, then it takes NCPUS
microseconds to complete if the memory mapped I/O scheduling is
completely fair, which I doubt. So with 4k CPUs thats whopping 4.096ms
and it gets worse if you go larger. That's more than a tick with
HZ=250.

I'm quite sure that you are staring at the HPET scalability bottleneck
and not at some actual kernel bug.

Your patch shifts some timing around so the issue does not happen, but
that's certainly not a solution.

Thanks,

tglx