===
Could be the following an acceptable fix?
===
kernel might hung in pvclock_clocksource_read() due to
uninitialized memory might contain odd version value in
following cycle:
do {
version = __pvclock_read_cycles(src, &ret, &flags);
} while ((src->version & 1) || version != src->version);
if secondary kvmclock is accessed before it's registered with kvm.
Clear garbage in pvclock shared memory area right after it's
allocated to avoid this issue.
Ref: https://bugzilla.kernel.org/show_bug.cgi?id=59521
Signed-off-by: Igor Mammedov <[email protected]>
---
arch/x86/kernel/kvmclock.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index d2c3812..3dd37eb 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -242,6 +242,7 @@ void __init kvmclock_init(void)
if (!mem)
return;
hv_clock = __va(mem);
+ memset(hv_clock, 0, size);
if (kvm_register_clock("boot clock")) {
hv_clock = NULL;
--
1.7.1
On Mon, Jun 10, 2013 at 06:31:11PM +0200, Igor Mammedov wrote:
> ===
> Could be the following an acceptable fix?
> ===
Read of kvmclock should return proper value from hypervisor: system
timestamp + tsc delta.
Should find the offender site and have it register MSR_KVM_SYSTEM_TIME
before reading the area.
> kernel might hung in pvclock_clocksource_read() due to
> uninitialized memory might contain odd version value in
> following cycle:
>
> do {
> version = __pvclock_read_cycles(src, &ret, &flags);
> } while ((src->version & 1) || version != src->version);
>
> if secondary kvmclock is accessed before it's registered with kvm.
>
> Clear garbage in pvclock shared memory area right after it's
> allocated to avoid this issue.
>
> Ref: https://bugzilla.kernel.org/show_bug.cgi?id=59521
> Signed-off-by: Igor Mammedov <[email protected]>
> ---
> arch/x86/kernel/kvmclock.c | 1 +
> 1 files changed, 1 insertions(+), 0 deletions(-)
>
> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> index d2c3812..3dd37eb 100644
> --- a/arch/x86/kernel/kvmclock.c
> +++ b/arch/x86/kernel/kvmclock.c
> @@ -242,6 +242,7 @@ void __init kvmclock_init(void)
> if (!mem)
> return;
> hv_clock = __va(mem);
> + memset(hv_clock, 0, size);
>
> if (kvm_register_clock("boot clock")) {
> hv_clock = NULL;
> --
> 1.7.1
Due to unintialized kvmclock read KVM guest is hanging on SMP boot stage.
If unintialized memory contains fatal garbage then hang reproduction is 100%.
Unintialized memory is allocated by memblock_alloc. So the garbage values
depend on many many things.
See the detailed description of the bug and possible ways to fix it
in the kernel bug tracker.
"Bug 59521 - KVM linux guest reads uninitialized pvclock values before
executing rdmsr MSR_KVM_WALL_CLOCK"
I decided to fix it simply returning 0ULL from kvm_clock_read when
kvm clocksource is not initialized yet.
The same as kernel bootstrap CPU doesn on boot stage when kernel
clocksources are not initialized yet.
Signed-off-by: Eugene Batalov <[email protected]>
---
I dont' use kernel percpu variables because for each SMP CPU
their contents are copied from the bootstrap CPU. And I don't
think that fixing the value for each SMP CPU is a good style.
If you know a better approach to store the is_pv_clock_ready flags
I am ready to use it.
The patch applies cleanly to
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
I've tested the changes with Ubuntu 13.04 "raring" userspace and
Ubuntu-3.8.0.19-30 kernel tag.
arch/x86/kernel/kvmclock.c | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 5bedbdd..a6e0af4 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -43,6 +43,9 @@ early_param("no-kvmclock", parse_no_kvmclock);
static struct pvclock_vsyscall_time_info *hv_clock;
static struct pvclock_wall_clock wall_clock;
+/* For each cpu store here a flag which tells whether pvclock is initialized */
+static int __cacheline_aligned_in_smp is_pv_clock_ready[NR_CPUS] = {};
+
/*
* The wallclock is the time of day when we booted. Since then, some time may
* have elapsed since the hypervisor wrote the data. So we try to account for
@@ -84,8 +87,11 @@ static cycle_t kvm_clock_read(void)
preempt_disable_notrace();
cpu = smp_processor_id();
- src = &hv_clock[cpu].pvti;
- ret = pvclock_clocksource_read(src);
+ if (is_pv_clock_ready[cpu]) {
+ src = &hv_clock[cpu].pvti;
+ ret = pvclock_clocksource_read(src);
+ } else
+ ret = 0ULL;
preempt_enable_notrace();
return ret;
}
@@ -168,6 +174,9 @@ int kvm_register_clock(char *txt)
printk(KERN_INFO "kvm-clock: cpu %d, msr %x:%x, %s\n",
cpu, high, low, txt);
+ if (!ret)
+ is_pv_clock_ready[cpu] = 1;
+
return ret;
}
--
1.7.9.5
On Sat, Jun 15, 2013 at 10:01:45PM +0400, Eugene Batalov wrote:
> Due to unintialized kvmclock read KVM guest is hanging on SMP boot stage.
> If unintialized memory contains fatal garbage then hang reproduction is 100%.
> Unintialized memory is allocated by memblock_alloc. So the garbage values
> depend on many many things.
>
> See the detailed description of the bug and possible ways to fix it
> in the kernel bug tracker.
> "Bug 59521 - KVM linux guest reads uninitialized pvclock values before
> executing rdmsr MSR_KVM_WALL_CLOCK"
>
> I decided to fix it simply returning 0ULL from kvm_clock_read when
> kvm clocksource is not initialized yet.
> The same as kernel bootstrap CPU doesn on boot stage when kernel
> clocksources are not initialized yet.
>
> Signed-off-by: Eugene Batalov <[email protected]>
> ---
> I dont' use kernel percpu variables because for each SMP CPU
> their contents are copied from the bootstrap CPU. And I don't
> think that fixing the value for each SMP CPU is a good style.
> If you know a better approach to store the is_pv_clock_ready flags
> I am ready to use it.
>
> The patch applies cleanly to
> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
>
> I've tested the changes with Ubuntu 13.04 "raring" userspace and
> Ubuntu-3.8.0.19-30 kernel tag.
>
> arch/x86/kernel/kvmclock.c | 13 +++++++++++--
> 1 file changed, 11 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> index 5bedbdd..a6e0af4 100644
> --- a/arch/x86/kernel/kvmclock.c
> +++ b/arch/x86/kernel/kvmclock.c
> @@ -43,6 +43,9 @@ early_param("no-kvmclock", parse_no_kvmclock);
> static struct pvclock_vsyscall_time_info *hv_clock;
> static struct pvclock_wall_clock wall_clock;
>
> +/* For each cpu store here a flag which tells whether pvclock is initialized */
> +static int __cacheline_aligned_in_smp is_pv_clock_ready[NR_CPUS] = {};
> +
> /*
> * The wallclock is the time of day when we booted. Since then, some time may
> * have elapsed since the hypervisor wrote the data. So we try to account for
> @@ -84,8 +87,11 @@ static cycle_t kvm_clock_read(void)
>
> preempt_disable_notrace();
> cpu = smp_processor_id();
> - src = &hv_clock[cpu].pvti;
> - ret = pvclock_clocksource_read(src);
> + if (is_pv_clock_ready[cpu]) {
> + src = &hv_clock[cpu].pvti;
> + ret = pvclock_clocksource_read(src);
> + } else
> + ret = 0ULL;
> preempt_enable_notrace();
> return ret;
> }
> @@ -168,6 +174,9 @@ int kvm_register_clock(char *txt)
> printk(KERN_INFO "kvm-clock: cpu %d, msr %x:%x, %s\n",
> cpu, high, low, txt);
>
> + if (!ret)
> + is_pv_clock_ready[cpu] = 1;
> +
> return ret;
> }
The path can be really hot, so better avoid it if possible. The patch
to zero the memblock_alloc'd area from Igor brings the behaviour back
to before regression: return 0 until kvmclock is initialized. On top of
your analysis in the BZ, it now seems the right (and safer) thing to do.
Il 19/06/2013 00:21, Marcelo Tosatti ha scritto:
> On Sat, Jun 15, 2013 at 10:01:45PM +0400, Eugene Batalov wrote:
>> Due to unintialized kvmclock read KVM guest is hanging on SMP boot stage.
>> If unintialized memory contains fatal garbage then hang reproduction is 100%.
>> Unintialized memory is allocated by memblock_alloc. So the garbage values
>> depend on many many things.
>>
>> See the detailed description of the bug and possible ways to fix it
>> in the kernel bug tracker.
>> "Bug 59521 - KVM linux guest reads uninitialized pvclock values before
>> executing rdmsr MSR_KVM_WALL_CLOCK"
>>
>> I decided to fix it simply returning 0ULL from kvm_clock_read when
>> kvm clocksource is not initialized yet.
>> The same as kernel bootstrap CPU doesn on boot stage when kernel
>> clocksources are not initialized yet.
>>
>> Signed-off-by: Eugene Batalov <[email protected]>
>> ---
>> I dont' use kernel percpu variables because for each SMP CPU
>> their contents are copied from the bootstrap CPU. And I don't
>> think that fixing the value for each SMP CPU is a good style.
>> If you know a better approach to store the is_pv_clock_ready flags
>> I am ready to use it.
>>
>> The patch applies cleanly to
>> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
>>
>> I've tested the changes with Ubuntu 13.04 "raring" userspace and
>> Ubuntu-3.8.0.19-30 kernel tag.
>>
>> arch/x86/kernel/kvmclock.c | 13 +++++++++++--
>> 1 file changed, 11 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
>> index 5bedbdd..a6e0af4 100644
>> --- a/arch/x86/kernel/kvmclock.c
>> +++ b/arch/x86/kernel/kvmclock.c
>> @@ -43,6 +43,9 @@ early_param("no-kvmclock", parse_no_kvmclock);
>> static struct pvclock_vsyscall_time_info *hv_clock;
>> static struct pvclock_wall_clock wall_clock;
>>
>> +/* For each cpu store here a flag which tells whether pvclock is initialized */
>> +static int __cacheline_aligned_in_smp is_pv_clock_ready[NR_CPUS] = {};
>> +
>> /*
>> * The wallclock is the time of day when we booted. Since then, some time may
>> * have elapsed since the hypervisor wrote the data. So we try to account for
>> @@ -84,8 +87,11 @@ static cycle_t kvm_clock_read(void)
>>
>> preempt_disable_notrace();
>> cpu = smp_processor_id();
>> - src = &hv_clock[cpu].pvti;
>> - ret = pvclock_clocksource_read(src);
>> + if (is_pv_clock_ready[cpu]) {
>> + src = &hv_clock[cpu].pvti;
>> + ret = pvclock_clocksource_read(src);
>> + } else
>> + ret = 0ULL;
>> preempt_enable_notrace();
>> return ret;
>> }
>> @@ -168,6 +174,9 @@ int kvm_register_clock(char *txt)
>> printk(KERN_INFO "kvm-clock: cpu %d, msr %x:%x, %s\n",
>> cpu, high, low, txt);
>>
>> + if (!ret)
>> + is_pv_clock_ready[cpu] = 1;
>> +
>> return ret;
>> }
>
> The path can be really hot, so better avoid it if possible. The patch
> to zero the memblock_alloc'd area from Igor brings the behaviour back
> to before regression: return 0 until kvmclock is initialized. On top of
> your analysis in the BZ, it now seems the right (and safer) thing to do.
Yeah, definitely safer for master and stable. I'm applying that to master.
But if we can invert cpu_init with early_clock_percpu_init, it would be
nicer to do that for 3.11.
Paolo
Il 19/06/2013 15:20, Batalov Eugene ha scritto:
>
> I've missed this detail. It looks like Igor's patch doesn't bring
> secondary cpus kvm_clocksource behavior back to one before the regression,
> Before the regression per_cpu variables are used to allocate
> kvm_pv_clock areas.
> To to usage of percpu variables bootstrap cpu kvm_clock area contents
> were copied to smp secondary cpus kvm_clock areas when they were started.
> Bootstrap cpu kvm_clock area was not zeroed at this time.
> So kvm_pv_clock for secondary cpus never returned "zero" clock before
> the regression.
>
> During the analysis of the bug I introduced idea to return zero before
> kvm clocksource is initialized for secondary cpus
> just like bootstrap cpu does on kernel boot. You can read that in BZ.
Yes, this is why I prefer to invert the two function calls. But Igor's
patch fixes the hang (trivially because version is even) and is more
appropriate for -rc6.
Paolo
Il 20/06/2013 10:30, Igor Mammedov ha scritto:
> On Wed, 19 Jun 2013 15:29:31 +0200
> Paolo Bonzini <[email protected]> wrote:
>
>> Il 19/06/2013 15:20, Batalov Eugene ha scritto:
>>>
>>> I've missed this detail. It looks like Igor's patch doesn't bring
>>> secondary cpus kvm_clocksource behavior back to one before the regression,
>>> Before the regression per_cpu variables are used to allocate
>>> kvm_pv_clock areas.
>>> To to usage of percpu variables bootstrap cpu kvm_clock area contents
>>> were copied to smp secondary cpus kvm_clock areas when they were started.
>>> Bootstrap cpu kvm_clock area was not zeroed at this time.
>>> So kvm_pv_clock for secondary cpus never returned "zero" clock before
>>> the regression.
>>>
>>> During the analysis of the bug I introduced idea to return zero before
>>> kvm clocksource is initialized for secondary cpus
>>> just like bootstrap cpu does on kernel boot. You can read that in BZ.
>>
>> Yes, this is why I prefer to invert the two function calls. But Igor's
>> patch fixes the hang (trivially because version is even) and is more
>> appropriate for -rc6.
>
> I'll post this swap shortly, but zeroing out hv_clock at init time,
> would be still needed to provide sane values there if ftrace enabled
> at that time.
Fine! Please mention it (with --verbose flag) in the commit message.
Paolo
On Wed, 19 Jun 2013 15:29:31 +0200
Paolo Bonzini <[email protected]> wrote:
> Il 19/06/2013 15:20, Batalov Eugene ha scritto:
> >
> > I've missed this detail. It looks like Igor's patch doesn't bring
> > secondary cpus kvm_clocksource behavior back to one before the regression,
> > Before the regression per_cpu variables are used to allocate
> > kvm_pv_clock areas.
> > To to usage of percpu variables bootstrap cpu kvm_clock area contents
> > were copied to smp secondary cpus kvm_clock areas when they were started.
> > Bootstrap cpu kvm_clock area was not zeroed at this time.
> > So kvm_pv_clock for secondary cpus never returned "zero" clock before
> > the regression.
> >
> > During the analysis of the bug I introduced idea to return zero before
> > kvm clocksource is initialized for secondary cpus
> > just like bootstrap cpu does on kernel boot. You can read that in BZ.
>
> Yes, this is why I prefer to invert the two function calls. But Igor's
> patch fixes the hang (trivially because version is even) and is more
> appropriate for -rc6.
I'll post this swap shortly, but zeroing out hv_clock at init time,
would be still needed to provide sane values there if ftrace enabled
at that time.
>
> Paolo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/