Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18;
From:   Vitaly Kuznetsov <vkuznets@redhat.com>
To:     Lenny Szubowicz <lszubowi@redhat.com>
Cc:     pbonzini@redhat.com, seanjc@google.com, wanpengli@tencent.com,
        jmattson@google.com, joro@8bytes.org, tglx@linutronix.de,
        mingo@redhat.com, bp@alien8.de, x86@kernel.org, hpa@zytor.com,
        kvm@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] x86/kvmclock: Stop kvmclocks for hibernate restore
In-Reply-To: <5048babd-a40b-5a95-9dee-ab13367de6cb@redhat.com>
References: <20210311132847.224406-1-lszubowi@redhat.com>
 <87sg4t7vqy.fsf@vitty.brq.redhat.com>
 <5048babd-a40b-5a95-9dee-ab13367de6cb@redhat.com>
Date:   Fri, 26 Mar 2021 11:57:01 +0100
Message-ID: <87mtuqchdu.fsf@vitty.brq.redhat.com>
MIME-Version: 1.0
Content-Type: text/plain
Precedence: bulk

Lenny Szubowicz <lszubowi@redhat.com> writes:

> On 3/17/21 9:30 AM, Vitaly Kuznetsov wrote:
>> Lenny Szubowicz <lszubowi@redhat.com> writes:
>> 
>>> Turn off host updates to the registered kvmclock memory
>>> locations when transitioning to a hibernated kernel in
>>> resume_target_kernel().
>>>
>>> This is accomplished for secondary vcpus by disabling host
>>> clock updates for that vcpu when it is put offline. For the
>>> primary vcpu, it's accomplished by using the existing call back
>>> from save_processor_state() to kvm_save_sched_clock_state().
>>>
>>> The registered kvmclock memory locations may differ between
>>> the currently running kernel and the hibernated kernel, which
>>> is being restored and resumed. Kernel memory corruption is thus
>>> possible if the host clock updates are allowed to run while the
>>> hibernated kernel is relocated to its original physical memory
>>> locations.
>>>
>>> This is similar to the problem solved for kexec by
>>> commit 1e977aa12dd4 ("x86: KVM guest: disable clock before rebooting.")
>>>
>>> Commit 95a3d4454bb1 ("x86/kvmclock: Switch kvmclock data to a
>>> PER_CPU variable") innocently increased the exposure for this
>>> problem by dynamically allocating the physical pages that are
>>> used for host clock updates when the vcpu count exceeds 64.
>>> This increases the likelihood that the registered kvmclock
>>> locations will differ for vcpus above 64.
>>>
>>> Reported-by: Xiaoyi Chen <cxiaoyi@amazon.com>
>>> Tested-by: Mohamed Aboubakr <mabouba@amazon.com>
>>> Signed-off-by: Lenny Szubowicz <lszubowi@redhat.com>
>>> ---
>>>   arch/x86/kernel/kvmclock.c | 34 ++++++++++++++++++++++++++++++++--
>>>   1 file changed, 32 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
>>> index aa593743acf6..291ffca41afb 100644
>>> --- a/arch/x86/kernel/kvmclock.c
>>> +++ b/arch/x86/kernel/kvmclock.c
>>> @@ -187,8 +187,18 @@ static void kvm_register_clock(char *txt)
>>>   	pr_info("kvm-clock: cpu %d, msr %llx, %s", smp_processor_id(), pa, txt);
>>>   }
>>>   
>>> +/*
>>> + * Turn off host clock updates to the registered memory location when the
>>> + * cpu clock context is saved via save_processor_state(). Enables correct
>>> + * handling of the primary cpu clock when transitioning to a hibernated
>>> + * kernel in resume_target_kernel(), where the old and new registered
>>> + * memory locations may differ.
>>> + */
>>>   static void kvm_save_sched_clock_state(void)
>>>   {
>>> +	native_write_msr(msr_kvm_system_time, 0, 0);
>>> +	kvm_disable_steal_time();
>>> +	pr_info("kvm-clock: cpu %d, clock stopped", smp_processor_id());
>>>   }
>> 
>> Nitpick: should we rename kvm_save_sched_clock_state() to something more
>> generic, like kvm_disable_host_clock_updates() to indicate, that what it
>> does is not only sched clock related?
>
> I see your rationale. But if I rename kvm_save_sched_clock_state()
> then shouldn't I also rename kvm_restore_sched_clock_state().
> The names appear to reflect the callback that invokes them,
> from save_processor_state()/restore_state(), rather than what these
> functions need to do.
>
>          x86_platform.save_sched_clock_state = kvm_save_sched_clock_state;
>          x86_platform.restore_sched_clock_state = kvm_restore_sched_clock_state;
>   
> For V2 of my patch, I kept these names as they were. But if you have a strong
> desire for a different name, then I think both routines should be renamed
> similarly, since they are meant to be a complimentary pair.
>
>> 
>>>   
>>>   static void kvm_restore_sched_clock_state(void)
>>> @@ -311,9 +321,23 @@ static int kvmclock_setup_percpu(unsigned int cpu)
>>>   	return p ? 0 : -ENOMEM;
>>>   }
>>>   
>>> +/*
>>> + * Turn off host clock updates to the registered memory location when a
>>> + * cpu is placed offline. Enables correct handling of secondary cpu clocks
>>> + * when transitioning to a hibernated kernel in resume_target_kernel(),
>>> + * where the old and new registered memory locations may differ.
>>> + */
>>> +static int kvmclock_cpu_offline(unsigned int cpu)
>>> +{
>>> +	native_write_msr(msr_kvm_system_time, 0, 0);
>>> +	pr_info("kvm-clock: cpu %d, clock stopped", cpu);
>> 
>> I'd say this pr_info() is superfluous: on a system with hundereds of
>> vCPUs users will get flooded with 'clock stopped' messages which don't
>> actually mean much: in case native_write_msr() fails the error gets
>> reported in dmesg anyway. I'd suggest we drop this and pr_info() in
>> kvm_save_sched_clock_state().
>> 
>
> Agreed. I was essentially using it as a pr_debug(). Gone in V2.
>
>>> +	return 0;
>> 
>> Why don't we disable steal time accounting here? MSR_KVM_STEAL_TIME is
>> also per-cpu. Can we merge kvm_save_sched_clock_state() with
>> kvmclock_cpu_offline() maybe?
>> 
>
> kvm_cpu_down_prepare() in arch/x86/kernel/kvm.c already calls
> kvm_disable_steal_time() when a vcpu is placed offline.
> So there is no need to do that in kvmclock_cpu_offline().
>
> In the case of the hibernation resume code path, resume_target_kernel()
> in kernel/power/hibernate.c, the secondary cpus are placed offline,
> but the primary is not. Instead, we are going to be switching contexts
> of the primary cpu from the boot kernel to the kernel that was restored
> from the hibernation image.
>
> This is where save_processor_state()/restore_processor_state() and kvm_save_sched_clock_state()/restore_sched_clock_state() come into play
> to stop the kvmclock of the boot kernel's primary cpu and restart
> the kvmclock of restored hibernated kernel's primary cpu.
>
> And in this case, no one is calling kvm_disable_steal_time(),
> so kvm_save_sched_clock_state() is doing it. (This is very similar
> to the reason why kvm_crash_shutdown() in kvmclock.c needs to call
> kvm_disable_steal_time())
>
> However, I'm now wondering if kvm_restore_sched_clock_state()
> needs to add a call to the equivalent of kvm_register_steal_time(),
> because otherwise no one will do that for the primary vcpu
> on resume from hibernation.

In case this is true, steal time accounting is not our only
problem. kvm_guest_cpu_init(), which is called from
smp_prepare_boot_cpu() hook also sets up Async PF an PV EOI and both
these features establish a shared guest-host memory region, in this
doesn't happen upon resume from hibernation we're in trouble.

smp_prepare_boot_cpu() hook is called very early from start_kernel() but
what happens when we switch to the context of the hibernated kernel?

I'm going to set up an environement and check what's going on.

>
>>> +}
>>> +
>>>   void __init kvmclock_init(void)
>>>   {
>>>   	u8 flags;
>>> +	int cpuhp_prepare;
>>>   
>>>   	if (!kvm_para_available() || !kvmclock)
>>>   		return;
>>> @@ -325,8 +349,14 @@ void __init kvmclock_init(void)
>>>   		return;
>>>   	}
>>>   
>>> -	if (cpuhp_setup_state(CPUHP_BP_PREPARE_DYN, "kvmclock:setup_percpu",
>>> -			      kvmclock_setup_percpu, NULL) < 0) {
>>> +	cpuhp_prepare = cpuhp_setup_state(CPUHP_BP_PREPARE_DYN,
>>> +					  "kvmclock:setup_percpu",
>>> +					  kvmclock_setup_percpu, NULL);
>>> +	if (cpuhp_prepare < 0)
>>> +		return;
>>> +	if (cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "kvmclock:cpu_offline",
>>> +			      NULL, kvmclock_cpu_offline) < 0) {
>>> +		cpuhp_remove_state(cpuhp_prepare);
>> 
>> In case we fail to set up kvmclock_cpu_offline() callback we get broken
>> hybernation but without kvmclock_setup_percpu() call we get a broken
>> guest (as kvmclock() may be the only reliable clocksource) so I'm not
>> exactly sure we're active in a best way with cpuhp_remove_state()
>> here. I don't feel strong though, I think it can stay but in that case
>> I'd add a pr_warn() at least.
>
> Something is seriously broken if either of these cpuhp_setup_state()
> calls fail. I first considered using the "down" callback of the
> CPUHP_BP_PREPARE_DYN state, as in:
>
>          if (cpuhp_setup_state(CPUHP_BP_PREPARE_DYN, "kvmclock:setup_percpu",
> 			      kvmclock_setup_percpu, kvmclock_cpu_offline) < 0) {
>
> This would have minimized the change, but I wasn't comfortable with how late
> it would be called. Other examples in the kernel, including kvm.c, use
> CPUHP_AP_ONLINE_DYN for cpu online/offline events.
>
> But I do agree that a failure of either cpuhp_setup_state() should not
> be silent. So in V2 I added a pr_err().
>
> Thank you for your review comments.
>
>                          -Lenny.
>
>> 
>>>   		return;
>>>   	}
>> 
>

-- 
Vitaly