2018-06-11 14:00:00

by Jeremy Cline

[permalink] [raw]
Subject: Regression: x86/tsc: Fix mark_tsc_unstable()

Hi folks,

A few Fedora users have reported[0] a regression starting in v4.16.8
where the boot will hang ~1/3 of the time with the following RCU stall
warning:

INFO: rcu_sched detected stalls on CPUs/tasks:
o1-...!: (0 ticks this GP) idle=688/0/0 softirq=171/171 fqs=0
o(detected by 0, t=60002 jiffies, g=-142, c=-143, q=9)
Sending NMI from CPU 0 to CPU 1:
NMI backtrace for cpu 1 skipped: idling at
acpi_processor_ffh_cstate_enter+0x65/0xb0
rcu_sched kthread starved for 60002 jiffies! g18446744073709551474
c1844674407370955143 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x402 -> cpu=1
RCU grace-period kthread stack dump:
rcu_sched I 0 9 2 0x80000000
Call Trace:
? __schedule+0x234/0x850
schedule+0x28/0x80
schedule_timeout+0x166/0x380
? __next_timer_interrupt+0xc0/0xc0
rcu_gp_kthread+0x368/0x830
? rcu_process_callbacks+0x4f0/0x4f0
kthread+0x112/0x130
? kthread_create_worker_on_cpu+0x70/0x70
ret_from_fork+0x35/0x40

A user has bisected the problem to the v4.16 commit 1ab4ca7c59d4
("x86/tsc: Fix mark_tsc_unstable()"). According to the reporter,
explicitly setting "tsc=" on the kernel command line causes the boot to
always succeed. All the users have Thinkpad T500s or T400s (Core 2 Duos)

[0] https://bugzilla.redhat.com/show_bug.cgi?id=1579925


Thanks,
Jeremy


2018-06-11 14:39:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Regression: x86/tsc: Fix mark_tsc_unstable()

On Mon, Jun 11, 2018 at 04:17:42PM +0200, Peter Zijlstra wrote:
> On Mon, Jun 11, 2018 at 01:59:15PM +0000, Jeremy Cline wrote:
> > A user has bisected the problem to the v4.16 commit 1ab4ca7c59d4
> > ("x86/tsc: Fix mark_tsc_unstable()"). According to the reporter,
> > explicitly setting "tsc=" on the kernel command line causes the boot to
> > always succeed. All the users have Thinkpad T500s or T400s (Core 2 Duos)
>
> Weird. So Core2 typically triggers mark_tsc_unstable() in either
> intel_idle or processor_idle. ISTR testing that when I did the patches.
>
> When I make that mark_tsc_unstable() in the idle drivers unconditional
> and boot my ivb with that, it doesn't want to fail. I've booted the
> machine 5 consequctive times without issue.
>
> Let me try and checkout -stable, maybe something's up with that.

Nope -stable seems to be working as well on the IVB (with modification).
I just dug up my T500 and that's actually still running the test kernel.
Let me try and build the -stable kernel for that.



2018-06-11 15:32:46

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Regression: x86/tsc: Fix mark_tsc_unstable()

On Mon, Jun 11, 2018 at 04:38:01PM +0200, Peter Zijlstra wrote:
> On Mon, Jun 11, 2018 at 04:17:42PM +0200, Peter Zijlstra wrote:
> > On Mon, Jun 11, 2018 at 01:59:15PM +0000, Jeremy Cline wrote:
> > > A user has bisected the problem to the v4.16 commit 1ab4ca7c59d4
> > > ("x86/tsc: Fix mark_tsc_unstable()"). According to the reporter,
> > > explicitly setting "tsc=" on the kernel command line causes the boot to
> > > always succeed. All the users have Thinkpad T500s or T400s (Core 2 Duos)
> >
> > Weird. So Core2 typically triggers mark_tsc_unstable() in either
> > intel_idle or processor_idle. ISTR testing that when I did the patches.
> >
> > When I make that mark_tsc_unstable() in the idle drivers unconditional
> > and boot my ivb with that, it doesn't want to fail. I've booted the
> > machine 5 consequctive times without issue.
> >
> > Let me try and checkout -stable, maybe something's up with that.
>
> Nope -stable seems to be working as well on the IVB (with modification).
> I just dug up my T500 and that's actually still running the test kernel.
> Let me try and build the -stable kernel for that.

4.16.8 works without issue on my T500 with a debian/ubuntu like distro
config.

2018-06-11 16:15:51

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Regression: x86/tsc: Fix mark_tsc_unstable()

On Mon, Jun 11, 2018 at 01:59:15PM +0000, Jeremy Cline wrote:
> A user has bisected the problem to the v4.16 commit 1ab4ca7c59d4
> ("x86/tsc: Fix mark_tsc_unstable()"). According to the reporter,
> explicitly setting "tsc=" on the kernel command line causes the boot to
> always succeed. All the users have Thinkpad T500s or T400s (Core 2 Duos)

Weird. So Core2 typically triggers mark_tsc_unstable() in either
intel_idle or processor_idle. ISTR testing that when I did the patches.

When I make that mark_tsc_unstable() in the idle drivers unconditional
and boot my ivb with that, it doesn't want to fail. I've booted the
machine 5 consequctive times without issue.

Let me try and checkout -stable, maybe something's up with that.



2018-06-11 17:52:23

by Diego Viola

[permalink] [raw]
Subject: Re: Regression: x86/tsc: Fix mark_tsc_unstable()

On Mon, Jun 11, 2018 at 10:59 AM, Jeremy Cline <[email protected]> wrote:
> Hi folks,
>
> A few Fedora users have reported[0] a regression starting in v4.16.8
> where the boot will hang ~1/3 of the time with the following RCU stall
> warning:
>
> INFO: rcu_sched detected stalls on CPUs/tasks:
> o1-...!: (0 ticks this GP) idle=688/0/0 softirq=171/171 fqs=0
> o(detected by 0, t=60002 jiffies, g=-142, c=-143, q=9)
> Sending NMI from CPU 0 to CPU 1:
> NMI backtrace for cpu 1 skipped: idling at
> acpi_processor_ffh_cstate_enter+0x65/0xb0
> rcu_sched kthread starved for 60002 jiffies! g18446744073709551474
> c1844674407370955143 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x402 -> cpu=1
> RCU grace-period kthread stack dump:
> rcu_sched I 0 9 2 0x80000000
> Call Trace:
> ? __schedule+0x234/0x850
> schedule+0x28/0x80
> schedule_timeout+0x166/0x380
> ? __next_timer_interrupt+0xc0/0xc0
> rcu_gp_kthread+0x368/0x830
> ? rcu_process_callbacks+0x4f0/0x4f0
> kthread+0x112/0x130
> ? kthread_create_worker_on_cpu+0x70/0x70
> ret_from_fork+0x35/0x40
>
> A user has bisected the problem to the v4.16 commit 1ab4ca7c59d4
> ("x86/tsc: Fix mark_tsc_unstable()"). According to the reporter,
> explicitly setting "tsc=" on the kernel command line causes the boot to
> always succeed. All the users have Thinkpad T500s or T400s (Core 2 Duos)
>
> [0] https://bugzilla.redhat.com/show_bug.cgi?id=1579925
>
>
> Thanks,
> Jeremy

Everything works fine here with 4.16.8+ on my desktop with E5500 CPU.

[diego@dualcore ~]$ uname -a
Linux dualcore 4.16.13-2-ARCH #1 SMP PREEMPT Fri Jun 1 18:46:11 UTC
2018 x86_64 GNU/Linux
[diego@dualcore ~]$

2018-06-11 17:58:52

by Jeremy Cline

[permalink] [raw]
Subject: Re: Regression: x86/tsc: Fix mark_tsc_unstable()

On 06/11/2018 11:30 AM, Peter Zijlstra wrote:
> On Mon, Jun 11, 2018 at 04:38:01PM +0200, Peter Zijlstra wrote:
>> On Mon, Jun 11, 2018 at 04:17:42PM +0200, Peter Zijlstra wrote:
>>> On Mon, Jun 11, 2018 at 01:59:15PM +0000, Jeremy Cline wrote:
>>>> A user has bisected the problem to the v4.16 commit 1ab4ca7c59d4
>>>> ("x86/tsc: Fix mark_tsc_unstable()"). According to the reporter,
>>>> explicitly setting "tsc=" on the kernel command line causes the boot to
>>>> always succeed. All the users have Thinkpad T500s or T400s (Core 2 Duos)
>>>
>>> Weird. So Core2 typically triggers mark_tsc_unstable() in either
>>> intel_idle or processor_idle. ISTR testing that when I did the patches.
>>>
>>> When I make that mark_tsc_unstable() in the idle drivers unconditional
>>> and boot my ivb with that, it doesn't want to fail. I've booted the
>>> machine 5 consequctive times without issue.
>>>
>>> Let me try and checkout -stable, maybe something's up with that.
>>
>> Nope -stable seems to be working as well on the IVB (with modification).
>> I just dug up my T500 and that's actually still running the test kernel.
>> Let me try and build the -stable kernel for that.
>
> 4.16.8 works without issue on my T500 with a debian/ubuntu like distro
> config.
>

Adding mmarget (who bisected the problem) to the CC.

It might well be something Fedora-specific, then. I just noticed mmarget
commented over the weekend noting that they couldn't reproduce the
problem without using the initramfs generated during the RPM install of
the kernel. mmarget's theory was that it's a race condition that doesn't
occur when the initramfs takes long enough to unpack, but I don't know
enough about the early boot process *or* how Fedora's generating the
initramfs for RPM installs vs "make install" yet to know how likely that
is. I'm going to have to do some research.

Thanks for looking into this so quickly and also sorry if this turns out
to be a Fedora problem :(

Thanks,
Jeremy

2018-06-11 18:12:48

by Jeremy Cline

[permalink] [raw]
Subject: Re: Regression: x86/tsc: Fix mark_tsc_unstable()

On 06/11/2018 01:56 PM, Jeremy Cline wrote:
> On 06/11/2018 11:30 AM, Peter Zijlstra wrote:
>> On Mon, Jun 11, 2018 at 04:38:01PM +0200, Peter Zijlstra wrote:
>>> On Mon, Jun 11, 2018 at 04:17:42PM +0200, Peter Zijlstra wrote:
>>>> On Mon, Jun 11, 2018 at 01:59:15PM +0000, Jeremy Cline wrote:
>>>>> A user has bisected the problem to the v4.16 commit 1ab4ca7c59d4
>>>>> ("x86/tsc: Fix mark_tsc_unstable()"). According to the reporter,
>>>>> explicitly setting "tsc=" on the kernel command line causes the boot to
>>>>> always succeed. All the users have Thinkpad T500s or T400s (Core 2 Duos)
>>>>
>>>> Weird. So Core2 typically triggers mark_tsc_unstable() in either
>>>> intel_idle or processor_idle. ISTR testing that when I did the patches.
>>>>
>>>> When I make that mark_tsc_unstable() in the idle drivers unconditional
>>>> and boot my ivb with that, it doesn't want to fail. I've booted the
>>>> machine 5 consequctive times without issue.
>>>>
>>>> Let me try and checkout -stable, maybe something's up with that.
>>>
>>> Nope -stable seems to be working as well on the IVB (with modification).
>>> I just dug up my T500 and that's actually still running the test kernel.
>>> Let me try and build the -stable kernel for that.
>>
>> 4.16.8 works without issue on my T500 with a debian/ubuntu like distro
>> config.
>>
>
> Adding mmarget (who bisected the problem) to the CC.
>
> It might well be something Fedora-specific, then. I just noticed mmarget
> commented over the weekend noting that they couldn't reproduce the
> problem without using the initramfs generated during the RPM install of
> the kernel. mmarget's theory was that it's a race condition that doesn't
> occur when the initramfs takes long enough to unpack, but I don't know
> enough about the early boot process *or* how Fedora's generating the
> initramfs for RPM installs vs "make install" yet to know how likely that
> is. I'm going to have to do some research.
>
> Thanks for looking into this so quickly and also sorry if this turns out
> to be a Fedora problem :(

Attached is the Fedora configuration for 4.16.8, as well, in case you'd
like to test it with that.

Thanks,
Jeremy


Attachments:
config (191.77 kB)

2018-06-11 20:02:10

by Diego Viola

[permalink] [raw]
Subject: Re: Regression: x86/tsc: Fix mark_tsc_unstable()

On Mon, Jun 11, 2018 at 3:11 PM, Jeremy Cline <[email protected]> wrote:
> On 06/11/2018 01:56 PM, Jeremy Cline wrote:
>> On 06/11/2018 11:30 AM, Peter Zijlstra wrote:
>>> On Mon, Jun 11, 2018 at 04:38:01PM +0200, Peter Zijlstra wrote:
>>>> On Mon, Jun 11, 2018 at 04:17:42PM +0200, Peter Zijlstra wrote:
>>>>> On Mon, Jun 11, 2018 at 01:59:15PM +0000, Jeremy Cline wrote:
>>>>>> A user has bisected the problem to the v4.16 commit 1ab4ca7c59d4
>>>>>> ("x86/tsc: Fix mark_tsc_unstable()"). According to the reporter,
>>>>>> explicitly setting "tsc=" on the kernel command line causes the boot to
>>>>>> always succeed. All the users have Thinkpad T500s or T400s (Core 2 Duos)
>>>>>
>>>>> Weird. So Core2 typically triggers mark_tsc_unstable() in either
>>>>> intel_idle or processor_idle. ISTR testing that when I did the patches.
>>>>>
>>>>> When I make that mark_tsc_unstable() in the idle drivers unconditional
>>>>> and boot my ivb with that, it doesn't want to fail. I've booted the
>>>>> machine 5 consequctive times without issue.
>>>>>
>>>>> Let me try and checkout -stable, maybe something's up with that.
>>>>
>>>> Nope -stable seems to be working as well on the IVB (with modification).
>>>> I just dug up my T500 and that's actually still running the test kernel.
>>>> Let me try and build the -stable kernel for that.
>>>
>>> 4.16.8 works without issue on my T500 with a debian/ubuntu like distro
>>> config.
>>>
>>
>> Adding mmarget (who bisected the problem) to the CC.
>>
>> It might well be something Fedora-specific, then. I just noticed mmarget
>> commented over the weekend noting that they couldn't reproduce the
>> problem without using the initramfs generated during the RPM install of
>> the kernel. mmarget's theory was that it's a race condition that doesn't
>> occur when the initramfs takes long enough to unpack, but I don't know
>> enough about the early boot process *or* how Fedora's generating the
>> initramfs for RPM installs vs "make install" yet to know how likely that
>> is. I'm going to have to do some research.
>>
>> Thanks for looking into this so quickly and also sorry if this turns out
>> to be a Fedora problem :(
>
> Attached is the Fedora configuration for 4.16.8, as well, in case you'd
> like to test it with that.
>
> Thanks,
> Jeremy

Hi Jeremy,

I've compiled 4.16.8 with your config and booted my machine about 10
times with this kernel, and I'm unable to reproduce the issue.

Maybe it's an issue with the Fedora initramfs?

Diego

2018-06-11 21:42:23

by Jeremy Cline

[permalink] [raw]
Subject: Re: Regression: x86/tsc: Fix mark_tsc_unstable()



On 06/11/2018 03:23 PM, Diego Viola wrote:
> On Mon, Jun 11, 2018 at 3:11 PM, Jeremy Cline <[email protected]> wrote:
>> On 06/11/2018 01:56 PM, Jeremy Cline wrote:
>>> On 06/11/2018 11:30 AM, Peter Zijlstra wrote:
>>>> On Mon, Jun 11, 2018 at 04:38:01PM +0200, Peter Zijlstra wrote:
>>>>> On Mon, Jun 11, 2018 at 04:17:42PM +0200, Peter Zijlstra wrote:
>>>>>> On Mon, Jun 11, 2018 at 01:59:15PM +0000, Jeremy Cline wrote:
>>>>>>> A user has bisected the problem to the v4.16 commit 1ab4ca7c59d4
>>>>>>> ("x86/tsc: Fix mark_tsc_unstable()"). According to the reporter,
>>>>>>> explicitly setting "tsc=" on the kernel command line causes the boot to
>>>>>>> always succeed. All the users have Thinkpad T500s or T400s (Core 2 Duos)
>>>>>>
>>>>>> Weird. So Core2 typically triggers mark_tsc_unstable() in either
>>>>>> intel_idle or processor_idle. ISTR testing that when I did the patches.
>>>>>>
>>>>>> When I make that mark_tsc_unstable() in the idle drivers unconditional
>>>>>> and boot my ivb with that, it doesn't want to fail. I've booted the
>>>>>> machine 5 consequctive times without issue.
>>>>>>
>>>>>> Let me try and checkout -stable, maybe something's up with that.
>>>>>
>>>>> Nope -stable seems to be working as well on the IVB (with modification).
>>>>> I just dug up my T500 and that's actually still running the test kernel.
>>>>> Let me try and build the -stable kernel for that.
>>>>
>>>> 4.16.8 works without issue on my T500 with a debian/ubuntu like distro
>>>> config.
>>>>
>>>
>>> Adding mmarget (who bisected the problem) to the CC.
>>>
>>> It might well be something Fedora-specific, then. I just noticed mmarget
>>> commented over the weekend noting that they couldn't reproduce the
>>> problem without using the initramfs generated during the RPM install of
>>> the kernel. mmarget's theory was that it's a race condition that doesn't
>>> occur when the initramfs takes long enough to unpack, but I don't know
>>> enough about the early boot process *or* how Fedora's generating the
>>> initramfs for RPM installs vs "make install" yet to know how likely that
>>> is. I'm going to have to do some research.
>>>
>>> Thanks for looking into this so quickly and also sorry if this turns out
>>> to be a Fedora problem :(
>>
>> Attached is the Fedora configuration for 4.16.8, as well, in case you'd
>> like to test it with that.
>>
>> Thanks,
>> Jeremy
>
> Hi Jeremy,
>
> I've compiled 4.16.8 with your config and booted my machine about 10
> times with this kernel, and I'm unable to reproduce the issue.

Thanks for confirming.

>
> Maybe it's an issue with the Fedora initramfs?

Indeed, I'll dig into what exactly is different about the RPM-created
initramfs and the one created with "make install" to see if we can
narrow this down some more.

Thanks,
Jeremy