LinuxLists.cc - [PATCH v4 0/2] xen: fix HVM kexec kernel panic

2022-03-02 22:37:33

Subject: [PATCH v4 0/2] xen: fix HVM kexec kernel panic

This is the v4 of the patch to fix xen kexec kernel panic issue when the
kexec is triggered on VCPU >= 32.

PANIC: early exception 0x0e IP 10:ffffffffa96679b6 error 0 cr2 0x20
[ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 5.17.0-rc4xen-00054-gf71077a4d84b-dirty #1
... ...
[ 0.000000] RIP: 0010:pvclock_clocksource_read+0x6/0xb0
... ...
[ 0.000000] RSP: 0000:ffffffffaae03e10 EFLAGS: 00010082 ORIG_RAX: 0000000000000000
[ 0.000000] RAX: 0000000000000000 RBX: 0000000000010000 RCX: 0000000000000002
[ 0.000000] RDX: 0000000000000003 RSI: ffffffffaac37515 RDI: 0000000000000020
[ 0.000000] RBP: 0000000000011000 R08: 0000000000000000 R09: 0000000000000001
[ 0.000000] R10: ffffffffaae03df8 R11: ffffffffaae03c68 R12: 0000000040000004
[ 0.000000] R13: ffffffffaae03e50 R14: 0000000000000000 R15: 0000000000000000
[ 0.000000] FS: 0000000000000000(0000) GS:ffffffffab588000(0000) knlGS:0000000000000000
[ 0.000000] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 0.000000] CR2: 0000000000000020 CR3: 00000000ea410000 CR4: 00000000000406a0
[ 0.000000] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 0.000000] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 0.000000] Call Trace:
[ 0.000000] <TASK>
[ 0.000000] ? xen_clocksource_read+0x24/0x40
[ 0.000000] ? xen_init_time_common+0x5/0x49
[ 0.000000] ? xen_hvm_init_time_ops+0x23/0x45
[ 0.000000] ? xen_hvm_guest_init+0x221/0x25c
[ 0.000000] ? 0xffffffffa9600000
[ 0.000000] ? setup_arch+0x440/0xbd6
[ 0.000000] ? start_kernel+0x6c/0x695
[ 0.000000] ? secondary_startup_64_no_verify+0xd5/0xdb
[ 0.000000] </TASK>

Changed since v1:
- Add commit message to explain why xen_hvm_init_time_ops() is delayed
for any vcpus. (Suggested by Boris Ostrovsky)
- Add a comment in xen_hvm_smp_prepare_boot_cpu() referencing the related
code in xen_hvm_guest_init(). (suggested by Juergen Gross)
Changed since v2:
- Delay for all VCPUs. (Suggested by Boris Ostrovsky)
- Add commit message that why PVM is not supported by this patch
- Test if kexec/kdump works with mainline xen (HVM and PVM)
Changed since v3:
- Re-use v2 but move the login into xen_hvm_init_time_ops() (Suggested
by Boris Ostrovsky)

I have tested with HVM VM on both old xen and mainline xen.

About the mainline xen, the 'soft_reset' works after I reset d->creation_reset
as suggested by Jan Beulich.

https://lore.kernel.org/all/[email protected]/

Thank you very much!

Dongli Zhang

2022-03-02 22:46:42

by Dongli Zhang

[permalink] [raw]

Subject: [PATCH v4 2/2] xen: delay xen_hvm_init_time_ops() if kdump is boot on vcpu>=32

The sched_clock() can be used very early since commit 857baa87b642
("sched/clock: Enable sched clock early"). In addition, with commit
38669ba205d1 ("x86/xen/time: Output xen sched_clock time from 0"), kdump
kernel in Xen HVM guest may panic at very early stage when accessing
&__this_cpu_read(xen_vcpu)->time as in below:

setup_arch()
-> init_hypervisor_platform()
-> x86_init.hyper.init_platform = xen_hvm_guest_init()
-> xen_hvm_init_time_ops()
-> xen_clocksource_read()
-> src = &__this_cpu_read(xen_vcpu)->time;

This is because Xen HVM supports at most MAX_VIRT_CPUS=32 'vcpu_info'
embedded inside 'shared_info' during early stage until xen_vcpu_setup() is
used to allocate/relocate 'vcpu_info' for boot cpu at arbitrary address.

However, when Xen HVM guest panic on vcpu >= 32, since
xen_vcpu_info_reset(0) would set per_cpu(xen_vcpu, cpu) = NULL when
vcpu >= 32, xen_clocksource_read() on vcpu >= 32 would panic.

This patch calls xen_hvm_init_time_ops() again later in
xen_hvm_smp_prepare_boot_cpu() after the 'vcpu_info' for boot vcpu is
registered when the boot vcpu is >= 32.

This issue can be reproduced on purpose via below command at the guest
side when kdump/kexec is enabled:

"taskset -c 33 echo c > /proc/sysrq-trigger"

The bugfix for PVM is not implemented due to the lack of testing
environment.

Cc: Joe Jin <[email protected]>
Signed-off-by: Dongli Zhang <[email protected]>
---
Changed since v1:
- Add commit message to explain why xen_hvm_init_time_ops() is delayed
for any vcpus. (Suggested by Boris Ostrovsky)
- Add a comment in xen_hvm_smp_prepare_boot_cpu() referencing the related
code in xen_hvm_guest_init(). (suggested by Juergen Gross)
Changed since v2:
- Delay for all VCPUs. (Suggested by Boris Ostrovsky)
- Add commit message that why PVM is not supported by this patch
- Test if kexec/kdump works with mainline xen (HVM and PVM)
Changed since v3:
- Re-use v2 but move the login into xen_hvm_init_time_ops() (Suggested
by Boris Ostrovsky)

arch/x86/xen/smp_hvm.c | 6 ++++++
arch/x86/xen/time.c | 25 ++++++++++++++++++++++++-
2 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/arch/x86/xen/smp_hvm.c b/arch/x86/xen/smp_hvm.c
index 6ff3c887e0b9..b70afdff419c 100644
--- a/arch/x86/xen/smp_hvm.c
+++ b/arch/x86/xen/smp_hvm.c
@@ -19,6 +19,12 @@ static void __init xen_hvm_smp_prepare_boot_cpu(void)
*/
xen_vcpu_setup(0);

+ /*
+ * Called again in case the kernel boots on vcpu >= MAX_VIRT_CPUS.
+ * Refer to comments in xen_hvm_init_time_ops().
+ */
+ xen_hvm_init_time_ops();
+
/*
* The alternative logic (which patches the unlock/lock) runs before
* the smp bootup up code is activated. Hence we need to set this up
diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index 55b3407358a9..dcf292cc859e 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -558,16 +558,36 @@ static void xen_hvm_setup_cpu_clockevents(void)

void __init xen_hvm_init_time_ops(void)
{
+ static bool hvm_time_initialized;
+
+ if (hvm_time_initialized)
+ return;
+
/*
* vector callback is needed otherwise we cannot receive interrupts
* on cpu > 0 and at this point we don't know how many cpus are
* available.
*/
if (!xen_have_vector_callback)
- return;
+ goto exit;

if (!xen_feature(XENFEAT_hvm_safe_pvclock)) {
pr_info("Xen doesn't support pvclock on HVM, disable pv timer");
+ goto exit;
+ }
+
+ /*
+ * Only MAX_VIRT_CPUS 'vcpu_info' are embedded inside 'shared_info'.
+ * The __this_cpu_read(xen_vcpu) is still NULL when Xen HVM guest
+ * boots on vcpu >= MAX_VIRT_CPUS (e.g., kexec), To access
+ * __this_cpu_read(xen_vcpu) via xen_clocksource_read() will panic.
+ *
+ * The xen_hvm_init_time_ops() should be called again later after
+ * __this_cpu_read(xen_vcpu) is available.
+ */
+ if (!__this_cpu_read(xen_vcpu)) {
+ pr_info("Delay xen_init_time_common() as kernel is running on vcpu=%d\n",
+ xen_vcpu_nr(0));
return;
}

@@ -577,6 +597,9 @@ void __init xen_hvm_init_time_ops(void)
x86_cpuinit.setup_percpu_clockev = xen_hvm_setup_cpu_clockevents;

x86_platform.set_wallclock = xen_set_wallclock;
+
+exit:
+ hvm_time_initialized = true;
}
#endif

--
2.17.1

2022-03-03 00:55:43

by Boris Ostrovsky

[permalink] [raw]

Subject: Re: [PATCH v4 2/2] xen: delay xen_hvm_init_time_ops() if kdump is boot on vcpu>=32

On 3/2/22 11:40 AM, Dongli Zhang wrote:
> void __init xen_hvm_init_time_ops(void)
> {
> + static bool hvm_time_initialized;
> +
> + if (hvm_time_initialized)
> + return;
> +
> /*
> * vector callback is needed otherwise we cannot receive interrupts
> * on cpu > 0 and at this point we don't know how many cpus are
> * available.
> */
> if (!xen_have_vector_callback)
> - return;
> + goto exit;

Why not just return? Do we expect the value of xen_have_vector_callback to change?

-boris

>
> if (!xen_feature(XENFEAT_hvm_safe_pvclock)) {
> pr_info("Xen doesn't support pvclock on HVM, disable pv timer");
> + goto exit;
> + }
> +
> + /*
> + * Only MAX_VIRT_CPUS 'vcpu_info' are embedded inside 'shared_info'.
> + * The __this_cpu_read(xen_vcpu) is still NULL when Xen HVM guest
> + * boots on vcpu >= MAX_VIRT_CPUS (e.g., kexec), To access
> + * __this_cpu_read(xen_vcpu) via xen_clocksource_read() will panic.
> + *
> + * The xen_hvm_init_time_ops() should be called again later after
> + * __this_cpu_read(xen_vcpu) is available.
> + */
> + if (!__this_cpu_read(xen_vcpu)) {
> + pr_info("Delay xen_init_time_common() as kernel is running on vcpu=%d\n",
> + xen_vcpu_nr(0));
> return;
> }
>
> @@ -577,6 +597,9 @@ void __init xen_hvm_init_time_ops(void)
> x86_cpuinit.setup_percpu_clockev = xen_hvm_setup_cpu_clockevents;
>
> x86_platform.set_wallclock = xen_set_wallclock;
> +
> +exit:
> + hvm_time_initialized = true;
> }
> #endif
>

2022-03-03 01:02:48

by Dongli Zhang

[permalink] [raw]

Subject: Re: [PATCH v4 2/2] xen: delay xen_hvm_init_time_ops() if kdump is boot on vcpu>=32

Hi Boris,

On 3/2/22 4:20 PM, Boris Ostrovsky wrote:
>
> On 3/2/22 11:40 AM, Dongli Zhang wrote:
>> void __init xen_hvm_init_time_ops(void)
>> {
>> +    static bool hvm_time_initialized;
>> +
>> +    if (hvm_time_initialized)
>> +        return;
>> +
>>       /*
>>        * vector callback is needed otherwise we cannot receive interrupts
>>        * on cpu > 0 and at this point we don't know how many cpus are
>>        * available.
>>        */
>>       if (!xen_have_vector_callback)
>> -        return;
>> +        goto exit;
>
>
> Why not just return? Do we expect the value of xen_have_vector_callback to change?

I just want to keep above sync with ....

>
>
> -boris
>
>
>>       if (!xen_feature(XENFEAT_hvm_safe_pvclock)) {
>>           pr_info("Xen doesn't support pvclock on HVM, disable pv timer");
>> +        goto exit;
>> +    }

... here.

That is, I want the main logic of xen_hvm_init_time_ops() to run for at most
once. Both of above two if statements will "go to exit".

Thank you very much!

Dongli Zhang

>> +
>> +    /*
>> +     * Only MAX_VIRT_CPUS 'vcpu_info' are embedded inside 'shared_info'.
>> +     * The __this_cpu_read(xen_vcpu) is still NULL when Xen HVM guest
>> +     * boots on vcpu >= MAX_VIRT_CPUS (e.g., kexec), To access
>> +     * __this_cpu_read(xen_vcpu) via xen_clocksource_read() will panic.
>> +     *
>> +     * The xen_hvm_init_time_ops() should be called again later after
>> +     * __this_cpu_read(xen_vcpu) is available.
>> +     */
>> +    if (!__this_cpu_read(xen_vcpu)) {
>> +        pr_info("Delay xen_init_time_common() as kernel is running on
>> vcpu=%d\n",
>> +            xen_vcpu_nr(0));
>>           return;
>>       }
>> @@ -577,6 +597,9 @@ void __init xen_hvm_init_time_ops(void)
>>       x86_cpuinit.setup_percpu_clockev = xen_hvm_setup_cpu_clockevents;
>>       x86_platform.set_wallclock = xen_set_wallclock;
>> +
>> +exit:
>> +    hvm_time_initialized = true;
>> }
>> #endif
>>

2022-03-03 02:12:05

by Boris Ostrovsky

[permalink] [raw]

Subject: Re: [PATCH v4 2/2] xen: delay xen_hvm_init_time_ops() if kdump is boot on vcpu>=32

On 3/2/22 7:31 PM, Dongli Zhang wrote:
> Hi Boris,
>
> On 3/2/22 4:20 PM, Boris Ostrovsky wrote:
>> On 3/2/22 11:40 AM, Dongli Zhang wrote:
>>> void __init xen_hvm_init_time_ops(void)
>>> {
>>> +    static bool hvm_time_initialized;
>>> +
>>> +    if (hvm_time_initialized)
>>> +        return;
>>> +
>>>       /*
>>>        * vector callback is needed otherwise we cannot receive interrupts
>>>        * on cpu > 0 and at this point we don't know how many cpus are
>>>        * available.
>>>        */
>>>       if (!xen_have_vector_callback)
>>> -        return;
>>> +        goto exit;
>>
>> Why not just return? Do we expect the value of xen_have_vector_callback to change?
> I just want to keep above sync with ....
>
>>
>> -boris
>>
>>
>>>       if (!xen_feature(XENFEAT_hvm_safe_pvclock)) {
>>>           pr_info("Xen doesn't support pvclock on HVM, disable pv timer");
>>> +        goto exit;
>>> +    }
> ... here.
>
> That is, I want the main logic of xen_hvm_init_time_ops() to run for at most
> once. Both of above two if statements will "go to exit".

I didn't notice this actually.

I think both of them should return early, there is no reason to set hvm_time_initialized to true when, in fact, we have not initialized anything. And to avoid printing the warning twice we can just replace it with pr_info_once().

I can fix it up when committing so no need to resend. So unless you disagree

Reviewed-by: Boris Ostrovsky <[email protected]>

> Thank you very much!
>
> Dongli Zhang
>
>>> +
>>> +    /*
>>> +     * Only MAX_VIRT_CPUS 'vcpu_info' are embedded inside 'shared_info'.
>>> +     * The __this_cpu_read(xen_vcpu) is still NULL when Xen HVM guest
>>> +     * boots on vcpu >= MAX_VIRT_CPUS (e.g., kexec), To access
>>> +     * __this_cpu_read(xen_vcpu) via xen_clocksource_read() will panic.
>>> +     *
>>> +     * The xen_hvm_init_time_ops() should be called again later after
>>> +     * __this_cpu_read(xen_vcpu) is available.
>>> +     */
>>> +    if (!__this_cpu_read(xen_vcpu)) {
>>> +        pr_info("Delay xen_init_time_common() as kernel is running on
>>> vcpu=%d\n",
>>> +            xen_vcpu_nr(0));
>>>           return;
>>>       }
>>> @@ -577,6 +597,9 @@ void __init xen_hvm_init_time_ops(void)
>>>       x86_cpuinit.setup_percpu_clockev = xen_hvm_setup_cpu_clockevents;
>>>       x86_platform.set_wallclock = xen_set_wallclock;
>>> +
>>> +exit:
>>> +    hvm_time_initialized = true;
>>> }
>>> #endif
>>>

2022-03-03 03:09:44

by Dongli Zhang

[permalink] [raw]

Subject: Re: [PATCH v4 2/2] xen: delay xen_hvm_init_time_ops() if kdump is boot on vcpu>=32

Hi Boris,

On 3/2/22 6:11 PM, Boris Ostrovsky wrote:
>
> On 3/2/22 7:31 PM, Dongli Zhang wrote:
>> Hi Boris,
>>
>> On 3/2/22 4:20 PM, Boris Ostrovsky wrote:
>>> On 3/2/22 11:40 AM, Dongli Zhang wrote:
>>>>    void __init xen_hvm_init_time_ops(void)
>>>>    {
>>>> +    static bool hvm_time_initialized;
>>>> +
>>>> +    if (hvm_time_initialized)
>>>> +        return;
>>>> +
>>>>        /*
>>>>         * vector callback is needed otherwise we cannot receive interrupts
>>>>         * on cpu > 0 and at this point we don't know how many cpus are
>>>>         * available.
>>>>         */
>>>>        if (!xen_have_vector_callback)
>>>> -        return;
>>>> +        goto exit;
>>>
>>> Why not just return? Do we expect the value of xen_have_vector_callback to
>>> change?
>> I just want to keep above sync with ....
>>
>>>
>>> -boris
>>>
>>>
>>>>          if (!xen_feature(XENFEAT_hvm_safe_pvclock)) {
>>>>            pr_info("Xen doesn't support pvclock on HVM, disable pv timer");
>>>> +        goto exit;
>>>> +    }
>> ... here.
>>
>> That is, I want the main logic of xen_hvm_init_time_ops() to run for at most
>> once. Both of above two if statements will "go to exit".
>
>
> I didn't notice this actually.
>
>
> I think both of them should return early, there is no reason to set
> hvm_time_initialized to true when, in fact, we have not initialized anything.
> And to avoid printing the warning twice we can just replace it with pr_info_once().
>
>
> I can fix it up when committing so no need to resend. So unless you disagree

Thank you very much for fixing it during committing.

Dongli Zhang

2022-03-11 22:24:23

by Boris Ostrovsky

[permalink] [raw]

Subject: Re: [PATCH v4 2/2] xen: delay xen_hvm_init_time_ops() if kdump is boot on vcpu>=32

On 3/2/22 11:40 AM, Dongli Zhang wrote:
> The sched_clock() can be used very early since commit 857baa87b642
> ("sched/clock: Enable sched clock early"). In addition, with commit
> 38669ba205d1 ("x86/xen/time: Output xen sched_clock time from 0"), kdump
> kernel in Xen HVM guest may panic at very early stage when accessing
> &__this_cpu_read(xen_vcpu)->time as in below:
>
> setup_arch()
> -> init_hypervisor_platform()
> -> x86_init.hyper.init_platform = xen_hvm_guest_init()
> -> xen_hvm_init_time_ops()
> -> xen_clocksource_read()
> -> src = &__this_cpu_read(xen_vcpu)->time;
>
> This is because Xen HVM supports at most MAX_VIRT_CPUS=32 'vcpu_info'
> embedded inside 'shared_info' during early stage until xen_vcpu_setup() is
> used to allocate/relocate 'vcpu_info' for boot cpu at arbitrary address.
>
> However, when Xen HVM guest panic on vcpu >= 32, since
> xen_vcpu_info_reset(0) would set per_cpu(xen_vcpu, cpu) = NULL when
> vcpu >= 32, xen_clocksource_read() on vcpu >= 32 would panic.
>
> This patch calls xen_hvm_init_time_ops() again later in
> xen_hvm_smp_prepare_boot_cpu() after the 'vcpu_info' for boot vcpu is
> registered when the boot vcpu is >= 32.
>
> This issue can be reproduced on purpose via below command at the guest
> side when kdump/kexec is enabled:
>
> "taskset -c 33 echo c > /proc/sysrq-trigger"
>
> The bugfix for PVM is not implemented due to the lack of testing
> environment.
>
> Cc: Joe Jin <[email protected]>
> Signed-off-by: Dongli Zhang <[email protected]>

Reviewed-by: Boris Ostrovsky <[email protected]>

Applied to for-linus-5.18 (with return path adjusted)