2014-01-03 17:30:32

by Dirk Brandewie

[permalink] [raw]
Subject: Re: intel_pstate divide error with v3.13-rc4-256-gb7000ad

Hi All,

Sorry for being late to the party but I just got back from vacation.

There is something deeply wrong here. We should have never gotten to
intel_pstate_init_cpu(). The VM had to have returned value from the read
of the max pstate at driver init time and 0 when the CPU was being brought up.

intel_pstate_msrs_not_valid() was added to solve this issue early on
if I remember correctly it was Josh that reported it then. Is there
a definative way to detect whether we are running in a VM?

Can some one tell me how the nested environment differs in regards to
reading MSRs?

TIA
--Dirk

On 12/30/2013 06:07 PM, Rafael J. Wysocki wrote:
>>> ---
>>> drivers/cpufreq/intel_pstate.c | 5 +++++
>>> 1 file changed, 5 insertions(+)
>>>
>>> Index: linux-pm/drivers/cpufreq/intel_pstate.c
>>> ===================================================================
>>> --- linux-pm.orig/drivers/cpufreq/intel_pstate.c
>>> +++ linux-pm/drivers/cpufreq/intel_pstate.c
>>> @@ -614,6 +614,11 @@ static int intel_pstate_init_cpu(unsigne
>>> cpu = all_cpu_data[cpunum];
>>>
>>> intel_pstate_get_cpu_pstates(cpu);
>>> + if (!cpu->pstate.current_pstate) {
>>> + all_cpu_data[cpunum] = NULL;
>>> + kfree(cpu);
>>> + return -ENODATA;
>>> + }
>>>
>>> cpu->cpu = cpunum;
>>>
>>>
>>
>>
>> Thanks Rafel, I can confirm this patch helps.
>
> Awesome, thanks!
>
> Below is an official version with a changelog. I'll queue it up as a fix
> for 3.13.
>
> Thanks,
> Rafael
>
>
> ---
> From: Rafael J. Wysocki <[email protected]>
> Subject: intel_pstate: Fail initialization if P-state information is missing
>
> If pstate.current_pstate is 0 after the initial
> intel_pstate_get_cpu_pstates(), this means that we were unable to
> obtain any useful P-state information and there is no reason to
> continue, so free memory and return an error in that case.
>
> This fixes the following divide error occuring in a nested KVM
> guest:
>
> Intel P-state driver initializing.
> Intel pstate controlling: cpu 0
> cpufreq: __cpufreq_add_dev: ->get() failed
> divide error: 0000 [#1] SMP
> Modules linked in:
> CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.13.0-0.rc4.git5.1.fc21.x86_64 #1
> Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> task: ffff88001ea20000 ti: ffff88001e9bc000 task.ti: ffff88001e9bc000
> RIP: 0010:[<ffffffff815c551d>] [<ffffffff815c551d>] intel_pstate_timer_func+0x11d/0x2b0
> RSP: 0000:ffff88001ee03e18 EFLAGS: 00010246
> RAX: 0000000000000000 RBX: ffff88001a454348 RCX: 0000000000006100
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> RBP: ffff88001ee03e38 R08: 0000000000000000 R09: 0000000000000000
> R10: ffff88001ea20000 R11: 0000000000000000 R12: 00000c0a1ea20000
> R13: 1ea200001ea20000 R14: ffffffff815c5400 R15: ffff88001a454348
> FS: 0000000000000000(0000) GS:ffff88001ee00000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 0000000000000000 CR3: 0000000001c0c000 CR4: 00000000000006f0
> Stack:
> fffffffb1a454390 ffffffff821a4500 ffff88001a454390 0000000000000100
> ffff88001ee03ea8 ffffffff81083e9a ffffffff81083e15 ffffffff82d5ed40
> ffffffff8258cc60 0000000000000000 ffffffff81ac39de 0000000000000000
> Call Trace:
> <IRQ>
> [<ffffffff81083e9a>] call_timer_fn+0x8a/0x310
> [<ffffffff81083e15>] ? call_timer_fn+0x5/0x310
> [<ffffffff815c5400>] ? pid_param_set+0x130/0x130
> [<ffffffff81084354>] run_timer_softirq+0x234/0x380
> [<ffffffff8107aee4>] __do_softirq+0x104/0x430
> [<ffffffff8107b5fd>] irq_exit+0xcd/0xe0
> [<ffffffff81770645>] smp_apic_timer_interrupt+0x45/0x60
> [<ffffffff8176efb2>] apic_timer_interrupt+0x72/0x80
> <EOI>
> [<ffffffff810e15cd>] ? vprintk_emit+0x1dd/0x5e0
> [<ffffffff81757719>] printk+0x67/0x69
> [<ffffffff815c1493>] __cpufreq_add_dev.isra.13+0x883/0x8d0
> [<ffffffff815c14f0>] cpufreq_add_dev+0x10/0x20
> [<ffffffff814a14d1>] subsys_interface_register+0xb1/0xf0
> [<ffffffff815bf5cf>] cpufreq_register_driver+0x9f/0x210
> [<ffffffff81fb19af>] intel_pstate_init+0x27d/0x3be
> [<ffffffff81761e3e>] ? mutex_unlock+0xe/0x10
> [<ffffffff81fb1732>] ? cpufreq_gov_dbs_init+0x12/0x12
> [<ffffffff8100214a>] do_one_initcall+0xfa/0x1b0
> [<ffffffff8109dbf5>] ? parse_args+0x225/0x3f0
> [<ffffffff81f64193>] kernel_init_freeable+0x1fc/0x287
> [<ffffffff81f638d0>] ? do_early_param+0x88/0x88
> [<ffffffff8174b530>] ? rest_init+0x150/0x150
> [<ffffffff8174b53e>] kernel_init+0xe/0x130
> [<ffffffff8176e27c>] ret_from_fork+0x7c/0xb0
> [<ffffffff8174b530>] ? rest_init+0x150/0x150
> Code: c1 e0 05 48 63 bc 03 10 01 00 00 48 63 83 d0 00 00 00 48 63 d6 48 c1 e2 08 c1 e1 08 4c 63 c2 48 c1 e0 08 48 98 48 c1 e0 08 48 99 <49> f7 f8 48 98 48 0f af f8 48 c1 ff 08 29 f9 89 ca c1 fa 1f 89
> RIP [<ffffffff815c551d>] intel_pstate_timer_func+0x11d/0x2b0
> RSP <ffff88001ee03e18>
> ---[ end trace f166110ed22cc37a ]---
> Kernel panic - not syncing: Fatal exception in interrupt
>
> Reported-and-tested-by: Kashyap Chamarthy <[email protected]>
> Cc: Josh Boyer <[email protected]>
> Signed-off-by: Rafael J. Wysocki <[email protected]>
> ---
> drivers/cpufreq/intel_pstate.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
> Index: linux-pm/drivers/cpufreq/intel_pstate.c
> ===================================================================
> --- linux-pm.orig/drivers/cpufreq/intel_pstate.c
> +++ linux-pm/drivers/cpufreq/intel_pstate.c
> @@ -614,6 +614,11 @@ static int intel_pstate_init_cpu(unsigne
> cpu = all_cpu_data[cpunum];
>
> intel_pstate_get_cpu_pstates(cpu);
> + if (!cpu->pstate.current_pstate) {
> + all_cpu_data[cpunum] = NULL;
> + kfree(cpu);
> + return -ENODATA;
> + }
>
> cpu->cpu = cpunum;
>
>


2014-01-03 18:04:45

by Gleb Natapov

[permalink] [raw]
Subject: Re: intel_pstate divide error with v3.13-rc4-256-gb7000ad

On Fri, Jan 03, 2014 at 09:30:28AM -0800, Dirk Brandewie wrote:
> Hi All,
>
> Sorry for being late to the party but I just got back from vacation.
>
> There is something deeply wrong here. We should have never gotten to
> intel_pstate_init_cpu(). The VM had to have returned value from the read
> of the max pstate at driver init time and 0 when the CPU was being brought up.
>
> intel_pstate_msrs_not_valid() was added to solve this issue early on
> if I remember correctly it was Josh that reported it then. Is there
> a definative way to detect whether we are running in a VM?
>
Checking for VM is a wrong thing to do here. KVM should behave like it
does not support p-state.

> Can some one tell me how the nested environment differs in regards to
> reading MSRs?
>
It shouldn't differ, but there may be bug somewhere in nested emulation.
We shouldn't try and hind the bug by doing more checks in Linux but
rather fixing KVM bug that causes Linux to behave incorrectly.

> TIA
> --Dirk
>
> On 12/30/2013 06:07 PM, Rafael J. Wysocki wrote:
> >>>---
> >>> drivers/cpufreq/intel_pstate.c | 5 +++++
> >>> 1 file changed, 5 insertions(+)
> >>>
> >>>Index: linux-pm/drivers/cpufreq/intel_pstate.c
> >>>===================================================================
> >>>--- linux-pm.orig/drivers/cpufreq/intel_pstate.c
> >>>+++ linux-pm/drivers/cpufreq/intel_pstate.c
> >>>@@ -614,6 +614,11 @@ static int intel_pstate_init_cpu(unsigne
> >>> cpu = all_cpu_data[cpunum];
> >>>
> >>> intel_pstate_get_cpu_pstates(cpu);
> >>>+ if (!cpu->pstate.current_pstate) {
> >>>+ all_cpu_data[cpunum] = NULL;
> >>>+ kfree(cpu);
> >>>+ return -ENODATA;
> >>>+ }
> >>>
> >>> cpu->cpu = cpunum;
> >>>
> >>>
> >>
> >>
> >>Thanks Rafel, I can confirm this patch helps.
> >
> >Awesome, thanks!
> >
> >Below is an official version with a changelog. I'll queue it up as a fix
> >for 3.13.
> >
> >Thanks,
> >Rafael
> >
> >
> >---
> >From: Rafael J. Wysocki <[email protected]>
> >Subject: intel_pstate: Fail initialization if P-state information is missing
> >
> >If pstate.current_pstate is 0 after the initial
> >intel_pstate_get_cpu_pstates(), this means that we were unable to
> >obtain any useful P-state information and there is no reason to
> >continue, so free memory and return an error in that case.
> >
> >This fixes the following divide error occuring in a nested KVM
> >guest:
> >
> >Intel P-state driver initializing.
> >Intel pstate controlling: cpu 0
> >cpufreq: __cpufreq_add_dev: ->get() failed
> >divide error: 0000 [#1] SMP
> >Modules linked in:
> >CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.13.0-0.rc4.git5.1.fc21.x86_64 #1
> >Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> >task: ffff88001ea20000 ti: ffff88001e9bc000 task.ti: ffff88001e9bc000
> >RIP: 0010:[<ffffffff815c551d>] [<ffffffff815c551d>] intel_pstate_timer_func+0x11d/0x2b0
> >RSP: 0000:ffff88001ee03e18 EFLAGS: 00010246
> >RAX: 0000000000000000 RBX: ffff88001a454348 RCX: 0000000000006100
> >RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> >RBP: ffff88001ee03e38 R08: 0000000000000000 R09: 0000000000000000
> >R10: ffff88001ea20000 R11: 0000000000000000 R12: 00000c0a1ea20000
> >R13: 1ea200001ea20000 R14: ffffffff815c5400 R15: ffff88001a454348
> >FS: 0000000000000000(0000) GS:ffff88001ee00000(0000) knlGS:0000000000000000
> >CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> >CR2: 0000000000000000 CR3: 0000000001c0c000 CR4: 00000000000006f0
> >Stack:
> > fffffffb1a454390 ffffffff821a4500 ffff88001a454390 0000000000000100
> > ffff88001ee03ea8 ffffffff81083e9a ffffffff81083e15 ffffffff82d5ed40
> > ffffffff8258cc60 0000000000000000 ffffffff81ac39de 0000000000000000
> >Call Trace:
> > <IRQ>
> > [<ffffffff81083e9a>] call_timer_fn+0x8a/0x310
> > [<ffffffff81083e15>] ? call_timer_fn+0x5/0x310
> > [<ffffffff815c5400>] ? pid_param_set+0x130/0x130
> > [<ffffffff81084354>] run_timer_softirq+0x234/0x380
> > [<ffffffff8107aee4>] __do_softirq+0x104/0x430
> > [<ffffffff8107b5fd>] irq_exit+0xcd/0xe0
> > [<ffffffff81770645>] smp_apic_timer_interrupt+0x45/0x60
> > [<ffffffff8176efb2>] apic_timer_interrupt+0x72/0x80
> > <EOI>
> > [<ffffffff810e15cd>] ? vprintk_emit+0x1dd/0x5e0
> > [<ffffffff81757719>] printk+0x67/0x69
> > [<ffffffff815c1493>] __cpufreq_add_dev.isra.13+0x883/0x8d0
> > [<ffffffff815c14f0>] cpufreq_add_dev+0x10/0x20
> > [<ffffffff814a14d1>] subsys_interface_register+0xb1/0xf0
> > [<ffffffff815bf5cf>] cpufreq_register_driver+0x9f/0x210
> > [<ffffffff81fb19af>] intel_pstate_init+0x27d/0x3be
> > [<ffffffff81761e3e>] ? mutex_unlock+0xe/0x10
> > [<ffffffff81fb1732>] ? cpufreq_gov_dbs_init+0x12/0x12
> > [<ffffffff8100214a>] do_one_initcall+0xfa/0x1b0
> > [<ffffffff8109dbf5>] ? parse_args+0x225/0x3f0
> > [<ffffffff81f64193>] kernel_init_freeable+0x1fc/0x287
> > [<ffffffff81f638d0>] ? do_early_param+0x88/0x88
> > [<ffffffff8174b530>] ? rest_init+0x150/0x150
> > [<ffffffff8174b53e>] kernel_init+0xe/0x130
> > [<ffffffff8176e27c>] ret_from_fork+0x7c/0xb0
> > [<ffffffff8174b530>] ? rest_init+0x150/0x150
> >Code: c1 e0 05 48 63 bc 03 10 01 00 00 48 63 83 d0 00 00 00 48 63 d6 48 c1 e2 08 c1 e1 08 4c 63 c2 48 c1 e0 08 48 98 48 c1 e0 08 48 99 <49> f7 f8 48 98 48 0f af f8 48 c1 ff 08 29 f9 89 ca c1 fa 1f 89
> >RIP [<ffffffff815c551d>] intel_pstate_timer_func+0x11d/0x2b0
> > RSP <ffff88001ee03e18>
> >---[ end trace f166110ed22cc37a ]---
> >Kernel panic - not syncing: Fatal exception in interrupt
> >
> >Reported-and-tested-by: Kashyap Chamarthy <[email protected]>
> >Cc: Josh Boyer <[email protected]>
> >Signed-off-by: Rafael J. Wysocki <[email protected]>
> >---
> > drivers/cpufreq/intel_pstate.c | 5 +++++
> > 1 file changed, 5 insertions(+)
> >
> >Index: linux-pm/drivers/cpufreq/intel_pstate.c
> >===================================================================
> >--- linux-pm.orig/drivers/cpufreq/intel_pstate.c
> >+++ linux-pm/drivers/cpufreq/intel_pstate.c
> >@@ -614,6 +614,11 @@ static int intel_pstate_init_cpu(unsigne
> > cpu = all_cpu_data[cpunum];
> >
> > intel_pstate_get_cpu_pstates(cpu);
> >+ if (!cpu->pstate.current_pstate) {
> >+ all_cpu_data[cpunum] = NULL;
> >+ kfree(cpu);
> >+ return -ENODATA;
> >+ }
> >
> > cpu->cpu = cpunum;
> >
> >
>

--
Gleb.

2014-01-03 20:00:01

by Dirk Brandewie

[permalink] [raw]
Subject: Re: intel_pstate divide error with v3.13-rc4-256-gb7000ad

On 01/03/2014 10:04 AM, Gleb Natapov wrote:
> On Fri, Jan 03, 2014 at 09:30:28AM -0800, Dirk Brandewie wrote:
>> Hi All,
>>
>> Sorry for being late to the party but I just got back from vacation.
>>
>> There is something deeply wrong here. We should have never gotten to
>> intel_pstate_init_cpu(). The VM had to have returned value from the read
>> of the max pstate at driver init time and 0 when the CPU was being brought up.
>>
>> intel_pstate_msrs_not_valid() was added to solve this issue early on
>> if I remember correctly it was Josh that reported it then. Is there
>> a definative way to detect whether we are running in a VM?
>>
> Checking for VM is a wrong thing to do here. KVM should behave like it
> does not support p-state.
>
>> Can some one tell me how the nested environment differs in regards to
>> reading MSRs?
>>
> It shouldn't differ, but there may be bug somewhere in nested emulation.
> We shouldn't try and hind the bug by doing more checks in Linux but
> rather fixing KVM bug that causes Linux to behave incorrectly.

Based on the unhandled MSR messages in the host dmesg the following patch
should make sure the correct values are returned for PLAT_FORM_INFO, APERF
and MPERF. intel_pstate and turbostat are the only users of these registers.

Could you try the folloinw patch minus Rafael's patch please.
Compile tested only.

TIA
--Dirk

commit 5594b89bee7f83200c1a70bf95d50ac35e4fe3f8
Author: Dirk Brandewie <[email protected]>
Date: Fri Jan 3 11:44:15 2014 -0800

x86/kvm: Handle MSR_PLATFORM_INFO, MSR_IA32_MPERF and MSR_IA32_APERF MSRs

Handle MSRs correctly when read from a nested KVM

Signed-off-by: Dirk Brandewie <[email protected]>
---
arch/x86/kvm/x86.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 5d004da..390ef27 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2336,6 +2336,9 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64
*pdata)

switch (msr) {
case MSR_IA32_PLATFORM_ID:
+ case MSR_IA32_MPERF:
+ case MSR_IA32_APERF:
+ case MSR_PLATFORM_INFO:
case MSR_IA32_EBL_CR_POWERON:
case MSR_IA32_DEBUGCTLMSR:
case MSR_IA32_LASTBRANCHFROMIP:

2014-01-03 22:06:29

by Paolo Bonzini

[permalink] [raw]
Subject: Re: intel_pstate divide error with v3.13-rc4-256-gb7000ad

Il 03/01/2014 21:00, Dirk Brandewie ha scritto:
> + case MSR_IA32_MPERF:
> + case MSR_IA32_APERF:

These should never be accessed. A KVM VM will always have
CPUID[06H].ECX = 0, and the Intel manual says that the MSRs are only
present if CPUID returns that value with bit 0 set.

I think the actual bug is that intel_pstate_init does not check the
feature bits in CPUID (either manually or via x86_match_cpu).

Paolo

2014-01-03 22:32:56

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: intel_pstate divide error with v3.13-rc4-256-gb7000ad

On Friday, January 03, 2014 08:04:36 PM Gleb Natapov wrote:
> On Fri, Jan 03, 2014 at 09:30:28AM -0800, Dirk Brandewie wrote:
> > Hi All,
> >
> > Sorry for being late to the party but I just got back from vacation.
> >
> > There is something deeply wrong here. We should have never gotten to
> > intel_pstate_init_cpu(). The VM had to have returned value from the read
> > of the max pstate at driver init time and 0 when the CPU was being brought up.
> >
> > intel_pstate_msrs_not_valid() was added to solve this issue early on
> > if I remember correctly it was Josh that reported it then. Is there
> > a definative way to detect whether we are running in a VM?
> >
> Checking for VM is a wrong thing to do here. KVM should behave like it
> does not support p-state.
>
> > Can some one tell me how the nested environment differs in regards to
> > reading MSRs?
> >
> It shouldn't differ, but there may be bug somewhere in nested emulation.
> We shouldn't try and hind the bug by doing more checks in Linux but
> rather fixing KVM bug that causes Linux to behave incorrectly.

Well, fixing the KVM bug is surely welcome.

That said, adding checks to ensure that your assumptions are valid is rarely
wrong, especially if they are done once per kernel boot. And the kernel only
should panic if it cannot continue to run, which isn't the case here.

Thanks,
Rafael

2014-01-04 08:36:11

by Paolo Bonzini

[permalink] [raw]
Subject: Re: intel_pstate divide error with v3.13-rc4-256-gb7000ad

Il 03/01/2014 23:46, Rafael J. Wysocki ha scritto:
> Well, fixing the KVM bug is surely welcome.
>
> That said, adding checks to ensure that your assumptions are valid is rarely
> wrong, especially if they are done once per kernel boot. And the kernel only
> should panic if it cannot continue to run, which isn't the case here.

I agree, but I suspect even your check is already late. Your patch is
welcome but perhaps it should have a WARN_ON too.

Paolo

2014-01-04 14:24:49

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: intel_pstate divide error with v3.13-rc4-256-gb7000ad

On Saturday, January 04, 2014 09:35:58 AM Paolo Bonzini wrote:
> Il 03/01/2014 23:46, Rafael J. Wysocki ha scritto:
> > Well, fixing the KVM bug is surely welcome.
> >
> > That said, adding checks to ensure that your assumptions are valid is rarely
> > wrong, especially if they are done once per kernel boot. And the kernel only
> > should panic if it cannot continue to run, which isn't the case here.
>
> I agree, but I suspect even your check is already late.

Well, it's just a sanity check and it makes the problem go away for the reporter.

> Your patch is welcome but perhaps it should have a WARN_ON too.

It has been pulled in already, so the WARN_ON() can only be added via a separate
patch now. Would you like to prepare that patch?

Rafael

2014-01-04 17:39:16

by Paolo Bonzini

[permalink] [raw]
Subject: Re: intel_pstate divide error with v3.13-rc4-256-gb7000ad

Il 04/01/2014 15:38, Rafael J. Wysocki ha scritto:
> Well, it's just a sanity check and it makes the problem go away for the reporter.
>
>> > Your patch is welcome but perhaps it should have a WARN_ON too.
> It has been pulled in already, so the WARN_ON() can only be added via a separate
> patch now. Would you like to prepare that patch?

Yes, I'll add it together with the CPUID check. I'll send the patch so
that it can get into 3.14.

Paolo

2014-01-04 17:48:25

by Gleb Natapov

[permalink] [raw]
Subject: Re: intel_pstate divide error with v3.13-rc4-256-gb7000ad

On Sat, Jan 04, 2014 at 06:38:59PM +0100, Paolo Bonzini wrote:
> Il 04/01/2014 15:38, Rafael J. Wysocki ha scritto:
> > Well, it's just a sanity check and it makes the problem go away for the reporter.
> >
> >> > Your patch is welcome but perhaps it should have a WARN_ON too.
> > It has been pulled in already, so the WARN_ON() can only be added via a separate
> > patch now. Would you like to prepare that patch?
>
> Yes, I'll add it together with the CPUID check. I'll send the patch so
> that it can get into 3.14.
>
CPUID check, while correct, will sweep the problem under the rug. Current
Linux logic should detect non working pstate in KVM. We should look into
why this is not happening for nested.

--
Gleb.

2014-01-04 21:24:51

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: intel_pstate divide error with v3.13-rc4-256-gb7000ad

On Saturday, January 04, 2014 07:48:13 PM Gleb Natapov wrote:
> On Sat, Jan 04, 2014 at 06:38:59PM +0100, Paolo Bonzini wrote:
> > Il 04/01/2014 15:38, Rafael J. Wysocki ha scritto:
> > > Well, it's just a sanity check and it makes the problem go away for the reporter.
> > >
> > >> > Your patch is welcome but perhaps it should have a WARN_ON too.
> > > It has been pulled in already, so the WARN_ON() can only be added via a separate
> > > patch now. Would you like to prepare that patch?
> >
> > Yes, I'll add it together with the CPUID check. I'll send the patch so
> > that it can get into 3.14.
> >
> CPUID check, while correct, will sweep the problem under the rug. Current
> Linux logic should detect non working pstate in KVM. We should look into
> why this is not happening for nested.

I agree. It's better not to use CPUID for that in my opinion.

Thanks!

--
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

2014-01-06 11:20:38

by Paolo Bonzini

[permalink] [raw]
Subject: Re: intel_pstate divide error with v3.13-rc4-256-gb7000ad

Il 04/01/2014 22:38, Rafael J. Wysocki ha scritto:
> On Saturday, January 04, 2014 07:48:13 PM Gleb Natapov wrote:
>> On Sat, Jan 04, 2014 at 06:38:59PM +0100, Paolo Bonzini wrote:
>>> Il 04/01/2014 15:38, Rafael J. Wysocki ha scritto:
>>>> Well, it's just a sanity check and it makes the problem go away for the reporter.
>>>>
>>>>>> Your patch is welcome but perhaps it should have a WARN_ON too.
>>>> It has been pulled in already, so the WARN_ON() can only be added via a separate
>>>> patch now. Would you like to prepare that patch?
>>>
>>> Yes, I'll add it together with the CPUID check. I'll send the patch so
>>> that it can get into 3.14.
>>>
>> CPUID check, while correct, will sweep the problem under the rug. Current
>> Linux logic should detect non working pstate in KVM. We should look into
>> why this is not happening for nested.
>
> I agree. It's better not to use CPUID for that in my opinion.

Among hypervisors, RHEL5's Xen is probably one of the oldest in actual
use with new hardware and new kernels, and the CPUID bit has been fixed
in 2011. Older versions wouldn't run new kernels due to other CPUID
bits not being cleared properly in VMs.

Is there real hardware that has the CPUID bit set and non-working
pstate? If there's no such real hardware, CPUID is what the SDM says
you should use to detect presence of the APERF/MPERF msrs.

Having extra safety checks is fine on top of what the SDM says, but IMO
they should be WARN_ONs. Otherwise you are sweeping bugs under the rug
just as much.

Paolo

2014-01-06 11:23:35

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: intel_pstate divide error with v3.13-rc4-256-gb7000ad

On Monday, January 06, 2014 12:20:32 PM Paolo Bonzini wrote:
> Il 04/01/2014 22:38, Rafael J. Wysocki ha scritto:
> > On Saturday, January 04, 2014 07:48:13 PM Gleb Natapov wrote:
> >> On Sat, Jan 04, 2014 at 06:38:59PM +0100, Paolo Bonzini wrote:
> >>> Il 04/01/2014 15:38, Rafael J. Wysocki ha scritto:
> >>>> Well, it's just a sanity check and it makes the problem go away for the reporter.
> >>>>
> >>>>>> Your patch is welcome but perhaps it should have a WARN_ON too.
> >>>> It has been pulled in already, so the WARN_ON() can only be added via a separate
> >>>> patch now. Would you like to prepare that patch?
> >>>
> >>> Yes, I'll add it together with the CPUID check. I'll send the patch so
> >>> that it can get into 3.14.
> >>>
> >> CPUID check, while correct, will sweep the problem under the rug. Current
> >> Linux logic should detect non working pstate in KVM. We should look into
> >> why this is not happening for nested.
> >
> > I agree. It's better not to use CPUID for that in my opinion.
>
> Among hypervisors, RHEL5's Xen is probably one of the oldest in actual
> use with new hardware and new kernels, and the CPUID bit has been fixed
> in 2011. Older versions wouldn't run new kernels due to other CPUID
> bits not being cleared properly in VMs.
>
> Is there real hardware that has the CPUID bit set and non-working
> pstate? If there's no such real hardware, CPUID is what the SDM says
> you should use to detect presence of the APERF/MPERF msrs.

OK

> Having extra safety checks is fine on top of what the SDM says, but IMO
> they should be WARN_ONs. Otherwise you are sweeping bugs under the rug
> just as much.

As I said I'm not against adding WARN_ON()s there. :-)

Thanks!

--
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

2014-01-06 17:18:43

by Dirk Brandewie

[permalink] [raw]
Subject: Re: intel_pstate divide error with v3.13-rc4-256-gb7000ad

On 01/03/2014 02:06 PM, Paolo Bonzini wrote:
> Il 03/01/2014 21:00, Dirk Brandewie ha scritto:
>> + case MSR_IA32_MPERF:
>> + case MSR_IA32_APERF:
>
OK I will spin the patch to only add MSR_PLATFORM_INFO.

> These should never be accessed. A KVM VM will always have
> CPUID[06H].ECX = 0, and the Intel manual says that the MSRs are only
> present if CPUID returns that value with bit 0 set.
>
> I think the actual bug is that intel_pstate_init does not check the
> feature bits in CPUID (either manually or via x86_match_cpu).

I will add the feature check.

What are the differences between the first and the nested KVM's?
At load time intel_pstate checks that APERF and MPERF are incrementing
and that PLATFORM_INFO has some value. Somehow these checks pass
in the nested environment and we fall over when the CPU is being added
by cpufreq.

--Dirk

2014-01-06 18:40:39

by Dirk Brandewie

[permalink] [raw]
Subject: Re: intel_pstate divide error with v3.13-rc4-256-gb7000ad

On 01/06/2014 03:37 AM, Rafael J. Wysocki wrote:
> On Monday, January 06, 2014 12:20:32 PM Paolo Bonzini wrote:
>> Il 04/01/2014 22:38, Rafael J. Wysocki ha scritto:
>>> On Saturday, January 04, 2014 07:48:13 PM Gleb Natapov wrote:
>>>> On Sat, Jan 04, 2014 at 06:38:59PM +0100, Paolo Bonzini wrote:
>>>>> Il 04/01/2014 15:38, Rafael J. Wysocki ha scritto:
>>>>>> Well, it's just a sanity check and it makes the problem go away for the reporter.
>>>>>>
>>>>>>>> Your patch is welcome but perhaps it should have a WARN_ON too.
>>>>>> It has been pulled in already, so the WARN_ON() can only be added via a separate
>>>>>> patch now. Would you like to prepare that patch?
>>>>>
>>>>> Yes, I'll add it together with the CPUID check. I'll send the patch so
>>>>> that it can get into 3.14.
>>>>>
>>>> CPUID check, while correct, will sweep the problem under the rug. Current
>>>> Linux logic should detect non working pstate in KVM. We should look into
>>>> why this is not happening for nested.
>>>
>>> I agree. It's better not to use CPUID for that in my opinion.
>>
>> Among hypervisors, RHEL5's Xen is probably one of the oldest in actual
>> use with new hardware and new kernels, and the CPUID bit has been fixed
>> in 2011. Older versions wouldn't run new kernels due to other CPUID
>> bits not being cleared properly in VMs.
>>
>> Is there real hardware that has the CPUID bit set and non-working
>> pstate? If there's no such real hardware, CPUID is what the SDM says
>> you should use to detect presence of the APERF/MPERF msrs.
>
> OK
>
>> Having extra safety checks is fine on top of what the SDM says, but IMO
>> they should be WARN_ONs. Otherwise you are sweeping bugs under the rug
>> just as much.
>
> As I said I'm not against adding WARN_ON()s there. :-)
>

The patch below adds a feature check for APERF/MPERF. With this patch
you should NOT see "Intel P-state driver initializing." in dmesg for KVM.

commit 4279f36818bd3ac42f077de114b17eb27d81d482
Author: Dirk Brandewie <[email protected]>
Date: Mon Jan 6 10:19:38 2014 -0800

intel_pstate: Add X86_FEATURE_APERFMPERF to cpu match parameters.


KVM environments do not support APERF/MPERF MSRs. intel_pstate cannot
operate without these registers.

The previous validity checks in intel_pstate_msrs_not_valid() are
insufficent in nested KVMs.

Fixes:
https://bugzilla.redhat.com/show_bug.cgi?id=1046317

Signed-off-by: Dirk Brandewie <[email protected]>
---
drivers/cpufreq/intel_pstate.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index 0f63f5d..fe91dad 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -619,7 +619,8 @@ static void intel_pstate_timer_func(unsigned long __data)
}

#define ICPU(model, policy) \
- { X86_VENDOR_INTEL, 6, model, X86_FEATURE_ANY, (unsigned long)&policy }
+ { X86_VENDOR_INTEL, 6, model, X86_FEATURE_APERFMPERF,\
+ (unsigned long)&policy }

static const struct x86_cpu_id intel_pstate_cpu_ids[] = {
ICPU(0x2a, core_params),

2014-01-07 16:11:25

by Gleb Natapov

[permalink] [raw]
Subject: Re: intel_pstate divide error with v3.13-rc4-256-gb7000ad

On Mon, Jan 06, 2014 at 09:18:44AM -0800, Dirk Brandewie wrote:
> On 01/03/2014 02:06 PM, Paolo Bonzini wrote:
> >Il 03/01/2014 21:00, Dirk Brandewie ha scritto:
> >>+ case MSR_IA32_MPERF:
> >>+ case MSR_IA32_APERF:
> >
> OK I will spin the patch to only add MSR_PLATFORM_INFO.
>
> >These should never be accessed. A KVM VM will always have
> >CPUID[06H].ECX = 0, and the Intel manual says that the MSRs are only
> >present if CPUID returns that value with bit 0 set.
> >
> >I think the actual bug is that intel_pstate_init does not check the
> >feature bits in CPUID (either manually or via x86_match_cpu).
>
> I will add the feature check.
>
> What are the differences between the first and the nested KVM's?
There shouldn't be any. There is a bug in nested emulation probably.

> At load time intel_pstate checks that APERF and MPERF are incrementing
> and that PLATFORM_INFO has some value. Somehow these checks pass
> in the nested environment and we fall over when the CPU is being added
> by cpufreq.
>
KVM does not emulate either of those and inject #GP if one is accessed. Linux
catches those #GPs and fixs up rdmsr to return zero. It would be interesting to
see ftrace for nested kvm run.

--
Gleb.