LinuxLists.cc - [patch 0/7] x86/kvmclock: Remove memblock dependency and further cleanups

2018-07-06 16:27:11

Subject: [patch 0/7] x86/kvmclock: Remove memblock dependency and further cleanups

To allow early utilization of kvmclock it is required to remove the
memblock dependency. memblock is currently used to allocate the per
cpu data for kvmclock.

The first patch replaces the memblock with a static array sized 64bytes *
NR_CPUS and was posted by Pavel. That patch allocates everything statically
which is a waste when kvmclock is not used.

The rest of the series cleans up the code and converts it to per cpu
variables but does not put the kvmclock data into the per cpu area as that
has an issue vs. mapping the boot cpu data into the VDSO (leaks arbitrary
data, unless page sized).

The per cpu data consists of pointers to the actual data. For the boot cpu
a page sized array is statically allocated which can be mapped into the
VDSO. That array is used for initializing the first 64 CPU pointers. If
there are more CPUs the pvclock data is allocated during CPU bringup.

So this still will have some overhead when kvmclock is not in use, but
bringing it down to zero would be a massive trainwreck and even more
indirections.

Thanks,

tglx

8<--------------
a/arch/x86/include/asm/kvm_guest.h | 7
arch/x86/include/asm/kvm_para.h | 1
arch/x86/kernel/kvm.c | 14 -
arch/x86/kernel/kvmclock.c | 262 ++++++++++++++-----------------------
arch/x86/kernel/setup.c | 4
5 files changed, 105 insertions(+), 183 deletions(-)

2018-07-06 17:48:32

by Paolo Bonzini

[permalink] [raw]

Subject: Re: [patch 0/7] x86/kvmclock: Remove memblock dependency and further cleanups

On 06/07/2018 18:13, Thomas Gleixner wrote:
> To allow early utilization of kvmclock it is required to remove the
> memblock dependency. memblock is currently used to allocate the per
> cpu data for kvmclock.
>
> The first patch replaces the memblock with a static array sized 64bytes *
> NR_CPUS and was posted by Pavel. That patch allocates everything statically
> which is a waste when kvmclock is not used.
>
> The rest of the series cleans up the code and converts it to per cpu
> variables but does not put the kvmclock data into the per cpu area as that
> has an issue vs. mapping the boot cpu data into the VDSO (leaks arbitrary
> data, unless page sized).
>
> The per cpu data consists of pointers to the actual data. For the boot cpu
> a page sized array is statically allocated which can be mapped into the
> VDSO. That array is used for initializing the first 64 CPU pointers. If
> there are more CPUs the pvclock data is allocated during CPU bringup.
>
> So this still will have some overhead when kvmclock is not in use, but
> bringing it down to zero would be a massive trainwreck and even more
> indirections.
>
> Thanks,
>
> tglx
>
> 8<--------------
> a/arch/x86/include/asm/kvm_guest.h | 7
> arch/x86/include/asm/kvm_para.h | 1
> arch/x86/kernel/kvm.c | 14 -
> arch/x86/kernel/kvmclock.c | 262 ++++++++++++++-----------------------
> arch/x86/kernel/setup.c | 4
> 5 files changed, 105 insertions(+), 183 deletions(-)
>
>
>
>

Thanks, this is really nice. With the small changes from my review,

Acked-by: Paolo Bonzini <[email protected]>

Paolo

2018-07-06 23:52:34

by Brijesh Singh

[permalink] [raw]

Subject: Re: [patch 0/7] x86/kvmclock: Remove memblock dependency and further cleanups

Adding Tom and Boris

On 7/6/18 12:47 PM, Paolo Bonzini wrote:
> On 06/07/2018 18:13, Thomas Gleixner wrote:
>> To allow early utilization of kvmclock it is required to remove the
>> memblock dependency. memblock is currently used to allocate the per
>> cpu data for kvmclock.
>>
>> The first patch replaces the memblock with a static array sized 64bytes *
>> NR_CPUS and was posted by Pavel. That patch allocates everything statically
>> which is a waste when kvmclock is not used.
>>
>> The rest of the series cleans up the code and converts it to per cpu
>> variables but does not put the kvmclock data into the per cpu area as that
>> has an issue vs. mapping the boot cpu data into the VDSO (leaks arbitrary
>> data, unless page sized).
>>
>> The per cpu data consists of pointers to the actual data. For the boot cpu
>> a page sized array is statically allocated which can be mapped into the
>> VDSO. That array is used for initializing the first 64 CPU pointers. If
>> there are more CPUs the pvclock data is allocated during CPU bringup.
>>
>> So this still will have some overhead when kvmclock is not in use, but
>> bringing it down to zero would be a massive trainwreck and even more
>> indirections.
>>
>> Thanks,
>>
>> tglx
>>
>> 8<--------------
>> a/arch/x86/include/asm/kvm_guest.h | 7
>> arch/x86/include/asm/kvm_para.h | 1
>> arch/x86/kernel/kvm.c | 14 -
>> arch/x86/kernel/kvmclock.c | 262 ++++++++++++++-----------------------
>> arch/x86/kernel/setup.c | 4
>> 5 files changed, 105 insertions(+), 183 deletions(-)
>>
>>
>>
>>
> Thanks, this is really nice. With the small changes from my review,
>
> Acked-by: Paolo Bonzini <[email protected]>

Hi Paolo and Thomas,

This series breaks SEV guest support. The physical address of both
wall_clock and hv_clock is shared with hypervisor for updates. In case
of SEV the address must be mapped as 'decrypted (i.e C=0)' so that both
guest and HV can access the data correctly. The follow patch should map
the pages as decrypted.

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 890e9e5..640c796 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -251,6 +251,20 @@ static void kvm_shutdown(void)
        native_machine_shutdown();
}

+static void sev_map_clocks_decrypted(void)
+{
+       if (!sev_active())
+               return;
+
+       /*
+        * wall_clock and hv_clock addresses are shared with hypervisor.
+        * When SEV is enabled, any addresses shared with hypervisor must be
+        * mapped decrypted.
+        */
+       early_set_memory_decrypted((unsigned long) wall_clock,
WALL_CLOCK_SIZE);
+       early_set_memory_decrypted((unsigned long) hv_clock, HV_CLOCK_SIZE);
+}
+
void __init kvmclock_init(void)
{
        struct pvclock_vcpu_time_info *vcpu_time;
@@ -269,6 +283,8 @@ void __init kvmclock_init(void)
        wall_clock = (struct pvclock_wall_clock *)wall_clock_mem;
        hv_clock = (struct pvclock_vsyscall_time_info *)hv_clock_mem;

+       sev_map_clocks_decrypted();
+
        if (kvm_register_clock("primary cpu clock")) {
                hv_clock = NULL;
                wall_clock = NULL;

But this patch triggers the below kernel crash.
early_set_memory_decrypted() uses kernel_physical_mapping_init() to
split the large pages and clear the C-bit. It seems this function still
has dependency with memblock.

[    0.000000] Hypervisor detected: KVM
[    0.000000] Kernel panic - not syncing: alloc_low_pages: ran out of
memory
[    0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 4.18.0-rc3-sev #19
[    0.000000] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
0.0.0 02/06/2015
[    0.000000] Call Trace:
[    0.000000] ? dump_stack+0x5c/0x80
[    0.000000] ? panic+0xe7/0x247
[    0.000000] ? alloc_low_pages+0x130/0x130
[    0.000000] ? kernel_physical_mapping_init+0xe0/0x204
[    0.000000] ? early_set_memory_enc_dec+0x10f/0x160
[    0.000000] ? 0xffffffffb1000000
[    0.000000] ? kvmclock_init+0x83/0x20a
[    0.000000] ? setup_arch+0x42c/0xce6
[    0.000000] ? start_kernel+0x67/0x531
[    0.000000] ? load_ucode_bsp+0x76/0x12e
[    0.000000] ? secondary_startup_64+0xa5/0xb0
[    0.000000] ---[ end Kernel panic - not syncing: alloc_low_pages: ran
out of memory ]---

- Brijesh

2018-07-09 09:24:09

by Peter Zijlstra

[permalink] [raw]

Subject: [patch 8/7] x86/kvmclock: Avoid TSC recalibration

If the host gives us a TSC rate, assume it is good and don't try and
recalibrate things against virtual timer hardware.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -140,7 +140,16 @@ static inline void kvm_sched_clock_init(
*/
static unsigned long kvm_get_tsc_khz(void)
{
- return pvclock_tsc_khz(this_cpu_pvti());
+ unsigned long tsc_khz = pvclock_tsc_khz(this_cpu_pvti());
+
+ /*
+ * TSC frequency is reported by the host; calibration against (virtual)
+ * HPET/PM-timer in a guest is dodgy and pointless since the host already
+ * did it for us where required.
+ */
+ setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
+
+ return tsc_khz;
}

static void kvm_get_preset_lpj(void)

2018-07-12 03:13:42

by Pavel Tatashin

[permalink] [raw]

Subject: Re: [patch 0/7] x86/kvmclock: Remove memblock dependency and further cleanups

> So this still will have some overhead when kvmclock is not in use, but
> bringing it down to zero would be a massive trainwreck and even more
> indirections.

Hi Thomas,

In my opinion, having kvmclock page in __initdata for boot cpu, and
setup it in init_hypervisor_platform(). Later, switch to memblock
allocated memory in x86_init.hyper.guest_late_init() for all CPUs
would not be too bad, and might be even use fewer lines of code. In
addition, it won't have any overhead when kvm is not used.

Pavel

2018-07-13 22:52:58

by Thomas Gleixner

[permalink] [raw]

Subject: Re: [patch 0/7] x86/kvmclock: Remove memblock dependency and further cleanups

On Wed, 11 Jul 2018, Pavel Tatashin wrote:

> > So this still will have some overhead when kvmclock is not in use, but
> > bringing it down to zero would be a massive trainwreck and even more
> > indirections.
>
> Hi Thomas,
>
> In my opinion, having kvmclock page in __initdata for boot cpu, and
> setup it in init_hypervisor_platform(). Later, switch to memblock
> allocated memory in x86_init.hyper.guest_late_init() for all CPUs
> would not be too bad, and might be even use fewer lines of code. In
> addition, it won't have any overhead when kvm is not used.

Why memblock? This can be switched when the allocator is up and
running. And you can use the per cpu allocator for that.

I'm not having cycles at the moment to look at that, so feel free to pick
the series up and enhance it.

Thanks,

tglx

2018-07-14 00:22:20

by Pavel Tatashin

[permalink] [raw]

Subject: Re: [patch 0/7] x86/kvmclock: Remove memblock dependency and further cleanups

On 07/13/2018 06:51 PM, Thomas Gleixner wrote:
> On Wed, 11 Jul 2018, Pavel Tatashin wrote:
>
>>> So this still will have some overhead when kvmclock is not in use, but
>>> bringing it down to zero would be a massive trainwreck and even more
>>> indirections.
>>
>> Hi Thomas,
>>
>> In my opinion, having kvmclock page in __initdata for boot cpu, and
>> setup it in init_hypervisor_platform(). Later, switch to memblock
>> allocated memory in x86_init.hyper.guest_late_init() for all CPUs
>> would not be too bad, and might be even use fewer lines of code. In
>> addition, it won't have any overhead when kvm is not used.
>
> Why memblock? This can be switched when the allocator is up and
> running. And you can use the per cpu allocator for that.
>
> I'm not having cycles at the moment to look at that, so feel free to pick
> the series up and enhance it.

OK, I will add your series into my series, and fix all the comments that were raised by the reviewers.

Pavel