On Mon, Apr 20, 2020 at 1:52 AM Harald Arnesen <[email protected]> wrote:
>
> Neither rc1 nor rc2 will boot on my laptop. The attached picture is all
> I have been able to capture.
I know you saw the reply about this probably being fixed by
https://lore.kernel.org/lkml/[email protected]/
but it would be lovely if you could actually verify that that series
of four patches does indeed fix it for you.
Your oops is on that divide instruction:
freq_scale = div64_u64(acnt, mcnt);
and while we had a check for mcnt not being zero earlier, we did
mcnt *= arch_max_freq_ratio;
after that check. I could see it becoming zero either due to an
overflow, or due to arch_max_freq_ratio being 0.
I think the first commit in that series is supposed to fix that
arch_max_freq_ratio being 0 case, but it still feels like the code
that does the divide is checking for zero in the wrong place...
Linus
On Tue, Apr 21, 2020 at 12:03:10PM -0700, Linus Torvalds wrote:
> On Mon, Apr 20, 2020 at 1:52 AM Harald Arnesen <[email protected]> wrote:
> >
> > Neither rc1 nor rc2 will boot on my laptop. The attached picture is all
> > I have been able to capture.
>
> I know you saw the reply about this probably being fixed by
>
> https://lore.kernel.org/lkml/[email protected]/
>
> but it would be lovely if you could actually verify that that series
> of four patches does indeed fix it for you.
(not seeing the original report in the archives or my list copy)
I'm assuming it's some sort of dodgy virt setup, actual real proper
hardware should never get here like that.
> Your oops is on that divide instruction:
>
> freq_scale = div64_u64(acnt, mcnt);
>
> and while we had a check for mcnt not being zero earlier, we did
>
> mcnt *= arch_max_freq_ratio;
>
> after that check. I could see it becoming zero either due to an
> overflow, or due to arch_max_freq_ratio being 0.
Right, so that's not supposed to happen, as you say, we should not
enable this code if the ratio is 0, and we should not overflow mcnt due
to reading that reg once per tick.
But yeha, virt, anything can happen :/
> I think the first commit in that series is supposed to fix that
> arch_max_freq_ratio being 0 case, but it still feels like the code
> that does the divide is checking for zero in the wrong place...
Yeah, we can certainly modify that. As is, real actual hardware should
never even hit that case either. So we might as well move that check and
then also make it disable all this frequency scaling stuff if we ever do
hit it.
On Tue, Apr 21, 2020 at 2:23 PM Peter Zijlstra <[email protected]> wrote:
>
> (not seeing the original report in the archives or my list copy)
Hmm. It was cc'd to lkml, but maybe the attached 3.2MB picture ended
up making it too big to actually make it to the list...
The picture was just the bottom of the oops - I decoded the Code:
section just to verify that yeah, it's the div instruction in
arch_scale_freq_tick, from kernel_init -> native_smp_prepapre_cpus
(the rest of the oops has scrolled off the page).
> I'm assuming it's some sort of dodgy virt setup, actual real proper
> hardware should never get here like that.
It actually looks like a native boot on a Lenovo T510i laptop to me.
Also, see
https://lore.kernel.org/lkml/[email protected]/
where Dave Kleikamp seems to say that it happens on his Lenovo T410.
So I do think it's real, and _not_ virtualization. Please don't
dismiss it as some vmware artifact. Looks more like it's triggered by
some Lenovo Thinkpad BIOS setup issue.
Linus
On Tue, Apr 21, 2020 at 02:30:06PM -0700, Linus Torvalds wrote:
> Also, see
>
> https://lore.kernel.org/lkml/[email protected]/
>
> where Dave Kleikamp seems to say that it happens on his Lenovo T410.
>
> So I do think it's real, and _not_ virtualization. Please don't
> dismiss it as some vmware artifact. Looks more like it's triggered by
> some Lenovo Thinkpad BIOS setup issue.
Oh, right (I actually saw that email when I picked up the patches, but
clearly instantly forgot about it again). If the BIOS gets the tables
funny we hit the same paths.
Anyway, it would be good if the reporter can confirm, I've also queued
those patches for /urgent, so they should show up soonish.
Linus, Peter:
the panic seen by Harald Arnesen (and Dave Kleikamp) is unrelated to
virtualization, and happens on all machines that have a CPU with less than 4
physical cores. It's a very serious (and very stupid) bug in the original
version of my code and is fixed by patch 2/4 "x86, sched: Account for CPUs
with less than 4 cores in freq. invariance" in the series
https://lore.kernel.org/lkml/[email protected]
Harald, Dave:
for peace of mind, can you please share the output of
turbostat --interval 1 sleep 0
from your machines? This should print a long list of power management-related
information, including your CPU model and the content of MSR_TURBO_RATIO_LIMIT.
In all likelihood that MSR says that the "4 cores turbo frequency" of your CPU
is zero, and the buggy code divides by that value.
Harald:
I'll echo Linus' request of testing that the patch series linked above fixes
the problem on your machine. Since you're testing -rc kernels and bisecting
bugs I assume you're comfortable with patching and compiling kernels, but if
that is not the case I am more than happy to assist by providing either an RPM
or a DEB package, depending on the distribution you're running. Let me know.
More comments below.
On Tue, 2020-04-21 at 23:23 +0200, Peter Zijlstra wrote:
> On Tue, Apr 21, 2020 at 12:03:10PM -0700, Linus Torvalds wrote:
> > On Mon, Apr 20, 2020 at 1:52 AM Harald Arnesen <[email protected]> wrote:
> > >
> > > Neither rc1 nor rc2 will boot on my laptop. The attached picture is all
> > > I have been able to capture.
> >
> > I know you saw the reply about this probably being fixed by
> >
> > https://lore.kernel.org/lkml/[email protected]/
> >
> > but it would be lovely if you could actually verify that that series
> > of four patches does indeed fix it for you.
>
> (not seeing the original report in the archives or my list copy)
>
> I'm assuming it's some sort of dodgy virt setup, actual real proper
> hardware should never get here like that.
I think Peter is mentioning virtualization because in another patch of my
bugfix series I address a separate issue, indeed arising in hypervisors:
patch 1/4 "x86, sched: Bail out of frequency invariance if base frequency is
unknown".
The thing is that my original code does two divisions (just grep for "div" in
the patch 1567c3e3467c "x86, sched: Add support for frequency invariance").
One of them divides by zero when num_cpus < 4, the other divides by zero on
some hypervisors (basically).
Here are the incriminated statements (file arch/x86/kernel/smpboot.c):
(1) arch_turbo_freq_ratio = div_u64(turbo_freq * SCHED_CAPACITY_SCALE,
base_freq);
/* ... stuff ... */
mcnt *= arch_max_freq_ratio;
(2) freq_scale = div64_u64(acnt, mcnt);
Hypervisors:
If base_freq is zero, division (1) fails. This is addressed by patch 1/4 in the
bugfix series linked above. base_freq is read from MSR_PLATFORM_INFO (no BIOS
involved, it's straight from the CPU) and is the max CPU clock frequency
without counting turbo. The Intel SDM specifies this MSR's content, none of the
machines I tested said "my base clock frequency is zero", but I didn't think
of hypervisors.
Small machines:
If turbo_freq is zero, division (2) fails. This is addressed by patch 2/4
"x86, sched: Account for CPUs with less than 4 cores in freq. invariance".
I read turbo_freq from MSR_TURBO_RATIO_LIMIT (again: no BIOS involved), at a
specific offset where it should contain the clock frequency sustainable by 4
cores simultaneously. If the machines has less than for cores, this data is
rightfully zero and thus the panic that Harald Arnesen and Dave Kleikamp
observed. It's an embarrassing mistake I made, I tested on several machines
including a consumer-level Atom Airmont (Dell Wyse thin client) but all had at
least 4 cores so I didn't see it.
For the records: there is a report for the "small machines" div-by-zero bug
from a 24 core Atom P-Series (new product line from 2020), but the fix works
on that machine, see
https://lore.kernel.org/lkml/[email protected]/
>
> > Your oops is on that divide instruction:
> >
> > freq_scale = div64_u64(acnt, mcnt);
> >
> > and while we had a check for mcnt not being zero earlier, we did
> >
> > mcnt *= arch_max_freq_ratio;
> >
> > after that check. I could see it becoming zero either due to an
> > overflow, or due to arch_max_freq_ratio being 0.
>
> Right, so that's not supposed to happen, as you say, we should not
> enable this code if the ratio is 0, and we should not overflow mcnt due
> to reading that reg once per tick.
>
> But yeha, virt, anything can happen :/
>
> > I think the first commit in that series is supposed to fix that
> > arch_max_freq_ratio being 0 case, but it still feels like the code
> > that does the divide is checking for zero in the wrong place...
>
> Yeah, we can certainly modify that. As is, real actual hardware should
> never even hit that case either. So we might as well move that check and
> then also make it disable all this frequency scaling stuff if we ever do
> hit it.
I'm going to send the following patch. There are a number of reasons why mcnt
overflowing and landing exactly at zero is unlikely (we read MPERF once every
tick, and arch_max_freq_ratio is a small number) but it's still a real problem
(eg. tickless kernels) and the zero check is 4 lines above where it should be,
it just needs to move down.
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 8c89e4d9ad28..fb71395cbcad 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -2055,14 +2055,14 @@ void arch_scale_freq_tick(void)
acnt = aperf - this_cpu_read(arch_prev_aperf);
mcnt = mperf - this_cpu_read(arch_prev_mperf);
- if (!mcnt)
- return;
this_cpu_write(arch_prev_aperf, aperf);
this_cpu_write(arch_prev_mperf, mperf);
acnt <<= 2*SCHED_CAPACITY_SHIFT;
mcnt *= arch_max_freq_ratio;
+ if (!mcnt)
+ return;
freq_scale = div64_u64(acnt, mcnt);
Giovanni
Linus Torvalds [21.04.2020 21:03]:
> On Mon, Apr 20, 2020 at 1:52 AM Harald Arnesen <[email protected]> wrote:
>>
>> Neither rc1 nor rc2 will boot on my laptop. The attached picture is all
>> I have been able to capture.
>
> I know you saw the reply about this probably being fixed by
>
> https://lore.kernel.org/lkml/[email protected]/
>
> but it would be lovely if you could actually verify that that series
> of four patches does indeed fix it for you.
>
> Your oops is on that divide instruction:
>
> freq_scale = div64_u64(acnt, mcnt);
>
> and while we had a check for mcnt not being zero earlier, we did
>
> mcnt *= arch_max_freq_ratio;
>
> after that check. I could see it becoming zero either due to an
> overflow, or due to arch_max_freq_ratio being 0.
>
> I think the first commit in that series is supposed to fix that
> arch_max_freq_ratio being 0 case, but it still feels like the code
> that does the divide is checking for zero in the wrong place...
>
> Linus
I will try the patches today, and report back.
--
Hilsen Harald
Giovanni Gherdovich [22.04.2020 11:02]:
> Harald, Dave:
>
> for peace of mind, can you please share the output of
>
> turbostat --interval 1 sleep 0
Output from turbostat is attached.
> Harald:
>
> I'll echo Linus' request of testing that the patch series linked above fixes
> the problem on your machine. Since you're testing -rc kernels and bisecting
> bugs I assume you're comfortable with patching and compiling kernels, but if
> that is not the case I am more than happy to assist by providing either an RPM
> or a DEB package, depending on the distribution you're running. Let me know.
Will try patching first, if I'm not successful, you may compile a DEB
package for me.
--
Hilsen Harald
Harald Arnesen [22.04.2020 11:37]:
> Giovanni Gherdovich [22.04.2020 11:02]:
>>
>> Harald:
>>
>> I'll echo Linus' request of testing that the patch series linked above fixes
>> the problem on your machine. Since you're testing -rc kernels and bisecting
>> bugs I assume you're comfortable with patching and compiling kernels, but if
>> that is not the case I am more than happy to assist by providing either an RPM
>> or a DEB package, depending on the distribution you're running. Let me know.
> Will try patching first, if I'm not successful, you may compile a DEB
> package for me.
I can confirm that my Thinkpad T510i boots normally with the four
patches added.
Thanks!
--
Hilsen Harald
On 4/22/20 4:02 AM, Giovanni Gherdovich wrote:
> Linus, Peter:
>
> the panic seen by Harald Arnesen (and Dave Kleikamp) is unrelated to
> virtualization, and happens on all machines that have a CPU with less than 4
> physical cores. It's a very serious (and very stupid) bug in the original
> version of my code and is fixed by patch 2/4 "x86, sched: Account for CPUs
> with less than 4 cores in freq. invariance" in the series
>
> https://lore.kernel.org/lkml/[email protected]
>
> Harald, Dave:
>
> for peace of mind, can you please share the output of
>
> turbostat --interval 1 sleep 0
This is a Lenovo T410.
turbostat version 20.03.20 - Len Brown <[email protected]>
CPUID(0): GenuineIntel 0xb CPUID levels; 0x80000008 xlevels; family:model:stepping 0x6:25:5 (6:37:5)
CPUID(1): SSE3 MONITOR SMX EIST TM2 TSC MSR ACPI-TM HT TM
CPUID(6): APERF, TURBO, DTS, No-PTM, No-HWP, No-HWPnotify, No-HWPwindow, No-HWPepp, No-HWPpkg, No-EPB
cpu0: MSR_IA32_MISC_ENABLE: 0x00850089 (TCC EIST MWAIT PREFETCH TURBO)
CPUID(7): No-SGX
cpu0: MSR_MISC_PWR_MGMT: 0x00400000 (ENable-EIST_Coordination DISable-EPB DISable-OOB)
cpu0: MSR_PLATFORM_INFO: 0x90020011200
9 * 133.3 = 1200.0 MHz max efficiency frequency
18 * 133.3 = 2399.9 MHz base frequency
cpu0: MSR_IA32_POWER_CTL: 0x00000000 (C1E auto-promotion: DISabled)
cpu0: MSR_TURBO_RATIO_LIMIT: 0x00001416
20 * 133.3 = 2666.6 MHz max turbo 2 active cores
22 * 133.3 = 2933.3 MHz max turbo 1 active cores
cpu0: MSR_PKG_CST_CONFIG_CONTROL: 0x00000403 (UNlocked, pkg-cstate-limit=3 (pc6))
current_driver: intel_idle
current_governor_ro: menu
cpu0: POLL: CPUIDLE CORE POLL IDLE
cpu0: C1: MWAIT 0x00
cpu0: C1E: MWAIT 0x01
cpu0: C3: MWAIT 0x10
cpu0: C6: MWAIT 0x20
cpu0: cpufreq driver: acpi-cpufreq
cpu0: cpufreq governor: ondemand
cpufreq boost: 1
cpu0: MSR_IA32_TEMPERATURE_TARGET: 0x00690a00 (105 C)
0.001871 sec
Node Core CPU Avg_MHz Busy% Bzy_MHz TSC_MHz IRQ SMI POLL C1 C1E C3 C6 POLL% C1% C1E% C3% C6% CPU%c1 CPU%c3 CPU%c6 CoreTmp Pkg%pc3 Pkg%pc6
- - - 395 26.14 1493 2430 12 0 0 1 1 10 0 0.00 1.50 0.59 73.54 0.00 28.45 45.41 0.00 63 0.00 0.00
-1 0 0 239 16.27 1473 2397 3 0 0 1 0 2 0 0.00 5.92 0.00 79.10 0.00 51.50 32.23 0.00 62 0.00 0.00
-1 0 1 784 50.12 1574 2385 3 0 0 0 1 2 0 0.00 0.00 2.37 47.20 0.00 17.26
-1 2 2 171 10.80 1598 2373 2 0 0 0 0 2 0 0.00 0.00 0.00 90.00 0.00 30.33 58.88 0.00 63
-1 2 3 358 27.42 1319 2377 4 0 0 0 0 4 0 0.00 0.00 0.00 72.18 0.00 14.64