LinuxLists.cc - [PATCH v3 0/1] xen: fix HVM kexec kernel panic

2022-02-24 22:47:00

Subject: [PATCH v3 0/1] xen: fix HVM kexec kernel panic

This is the v3 of the patch to fix xen kexec kernel panic issue when the
kexec is triggered on VCPU >= 32.

PANIC: early exception 0x0e IP 10:ffffffffa96679b6 error 0 cr2 0x20
[ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 5.17.0-rc4xen-00054-gf71077a4d84b-dirty #1
[ 0.000000] Hardware name: Xen HVM domU, BIOS 4.4.4OVM 12/15/2020
[ 0.000000] RIP: 0010:pvclock_clocksource_read+0x6/0xb0
... ...
[ 0.000000] RSP: 0000:ffffffffaae03e10 EFLAGS: 00010082 ORIG_RAX: 0000000000000000
[ 0.000000] RAX: 0000000000000000 RBX: 0000000000010000 RCX: 0000000000000002
[ 0.000000] RDX: 0000000000000003 RSI: ffffffffaac37515 RDI: 0000000000000020
[ 0.000000] RBP: 0000000000011000 R08: 0000000000000000 R09: 0000000000000001
[ 0.000000] R10: ffffffffaae03df8 R11: ffffffffaae03c68 R12: 0000000040000004
[ 0.000000] R13: ffffffffaae03e50 R14: 0000000000000000 R15: 0000000000000000
[ 0.000000] FS: 0000000000000000(0000) GS:ffffffffab588000(0000) knlGS:0000000000000000
[ 0.000000] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 0.000000] CR2: 0000000000000020 CR3: 00000000ea410000 CR4: 00000000000406a0
[ 0.000000] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 0.000000] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 0.000000] Call Trace:
[ 0.000000] <TASK>
[ 0.000000] ? xen_clocksource_read+0x24/0x40
[ 0.000000] ? xen_init_time_common+0x5/0x49
[ 0.000000] ? xen_hvm_init_time_ops+0x23/0x45
[ 0.000000] ? xen_hvm_guest_init+0x221/0x25c
[ 0.000000] ? 0xffffffffa9600000
[ 0.000000] ? setup_arch+0x440/0xbd6
[ 0.000000] ? start_kernel+0x6c/0x695
[ 0.000000] ? secondary_startup_64_no_verify+0xd5/0xdb
[ 0.000000] </TASK>

Changed since v1:
- Add commit message to explain why xen_hvm_init_time_ops() is delayed
for any vcpus. (Suggested by Boris Ostrovsky)
- Add a comment in xen_hvm_smp_prepare_boot_cpu() referencing the related
code in xen_hvm_guest_init(). (suggested by Juergen Gross)
Changed since v2:
- Delay for all VCPUs. (Suggested by Boris Ostrovsky)
- Add commit message that why PVM is not supported by this patch
- Test if kexec/kdump works with mainline xen (HVM and PVM)

I have delayed the xen_hvm_init_time_ops() for all VCPUs. Unfortunately,
now I am able to reproduce the clock backward as shown below on some old
versions of xen. I am not able to reproduce on most recent mainline xen.

[ 0.359687] pcpu-alloc: [0] 16 17 18 19 20 21 22 23 [0] 24 25 26 27 28 29 30 31
[ 0.359694] pcpu-alloc: [0] 32 33 34 35 36 37 38 39 [0] 40 41 42 43 44 45 46 47
[ 0.359701] pcpu-alloc: [0] 48 49 50 51 52 53 54 55 [0] 56 57 58 59 60 61 62 63

... clock backward after the clocksource is switched from native to xen...

[ 0.000004] Fallback order for Node 0: 0
[ 0.002967] Built 1 zonelists, mobility grouping on. Total pages: 3527744
[ 0.007129] Policy zone: Normal
[ 0.008937] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.17.0-rc4xen-00054-gf71077a4d84b-dirty root=UUID=2a5975ab-a059-4697-9aee-7a53ddfeea21 ro text console=ttyS0,115200n8 console=tty1 earlyprintk=tty S0,115200n8 loglevel=7 no_timer_check reboot=s32 splash crashkernel=512M-:192M vt.handoff=1
[ 0.023880] Unknown kernel command line parameters "text splash BOOT_IMAGE=/boot/vmlinuz-5.17.0-rc4xen-00054-gf71077a4d84b-dirty", will be passed to user space.
[ 0.032647] printk: log_buf_len individual max cpu contribution: 4096 bytes
[ 0.036828] printk: log_buf_len total cpu_extra contributions: 258048 bytes
[ 0.041049] printk: log_buf_len min size: 262144 bytes
[ 0.044481] printk: log_buf_len: 524288 bytes

Since now I am able to reproduce the clock backward on old xen version,
please let me know if I should re-use the v2 of this patch, as it has been
running well in our env well for very long time.

https://lore.kernel.org/all/[email protected]/

BTW, I have tested that 'soft_reset' does not work with mainline xen, even
when I directly trigger kexec with below commands.

# kexec -l /boot/vmlinuz-5.17.0-rc4xen-00054-gf71077a4d84b-dirty \
--initrd=/boot/initrd.img-5.17.0-rc4xen-00054-gf71077a4d84b-dirty \
--reuse-cmdline
# kexec -e

Thank you very much!

Dongli Zhang

2022-02-25 05:22:17

by Dongli Zhang

[permalink] [raw]

Subject: [PATCH v3 1/1] xen: delay xen_hvm_init_time_ops() to xen_hvm_smp_prepare_boot_cpu()

The sched_clock() can be used very early since commit 857baa87b642
("sched/clock: Enable sched clock early"). In addition, with commit
38669ba205d1 ("x86/xen/time: Output xen sched_clock time from 0"), kdump
kernel in Xen HVM guest may panic at very early stage when accessing
&__this_cpu_read(xen_vcpu)->time as in below:

setup_arch()
-> init_hypervisor_platform()
-> x86_init.hyper.init_platform = xen_hvm_guest_init()
-> xen_hvm_init_time_ops()
-> xen_clocksource_read()
-> src = &__this_cpu_read(xen_vcpu)->time;

This is because Xen HVM supports at most MAX_VIRT_CPUS=32 'vcpu_info'
embedded inside 'shared_info' during early stage until xen_vcpu_setup() is
used to allocate/relocate 'vcpu_info' for boot cpu at arbitrary address.

However, when Xen HVM guest panic on vcpu >= 32, since
xen_vcpu_info_reset(0) would set per_cpu(xen_vcpu, cpu) = NULL when
vcpu >= 32, xen_clocksource_read() on vcpu >= 32 would panic.

This patch always delays xen_hvm_init_time_ops() to later in
xen_hvm_smp_prepare_boot_cpu() after the 'vcpu_info' for boot vcpu is
registered.

This issue can be reproduced on purpose via below command at the guest
side when kdump/kexec is enabled:

"taskset -c 33 echo c > /proc/sysrq-trigger"

Unfortunately, the 'soft_reset' (kexec) does not work with mainline xen
version so that I can test this patch only with HVM guest on old xen
hypervisor where 'soft_reset' is working. The bugfix for PVM is not
implemented due to the lack of testing environment.

Cc: Joe Jin <[email protected]>
Signed-off-by: Dongli Zhang <[email protected]>
---
Changed since v1:
- Add commit message to explain why xen_hvm_init_time_ops() is delayed
for any vcpus. (Suggested by Boris Ostrovsky)
- Add a comment in xen_hvm_smp_prepare_boot_cpu() referencing the related
code in xen_hvm_guest_init(). (suggested by Juergen Gross)
Changed since v2:
- Delay for all VCPUs. (Suggested by Boris Ostrovsky)
- Add commit message that why PVM is not supported by this patch
- Test if kexec/kdump works with mainline xen (HVM and PVM)

arch/x86/xen/enlighten_hvm.c | 1 -
arch/x86/xen/smp_hvm.c | 17 +++++++++++++++++
2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/arch/x86/xen/enlighten_hvm.c b/arch/x86/xen/enlighten_hvm.c
index 517a9d8d8f94..53f306ec1d3b 100644
--- a/arch/x86/xen/enlighten_hvm.c
+++ b/arch/x86/xen/enlighten_hvm.c
@@ -216,7 +216,6 @@ static void __init xen_hvm_guest_init(void)
WARN_ON(xen_cpuhp_setup(xen_cpu_up_prepare_hvm, xen_cpu_dead_hvm));
xen_unplug_emulated_devices();
x86_init.irqs.intr_init = xen_init_IRQ;
- xen_hvm_init_time_ops();
xen_hvm_init_mmu_ops();

#ifdef CONFIG_KEXEC_CORE
diff --git a/arch/x86/xen/smp_hvm.c b/arch/x86/xen/smp_hvm.c
index 6ff3c887e0b9..9a5efc1a1633 100644
--- a/arch/x86/xen/smp_hvm.c
+++ b/arch/x86/xen/smp_hvm.c
@@ -19,6 +19,23 @@ static void __init xen_hvm_smp_prepare_boot_cpu(void)
*/
xen_vcpu_setup(0);

+ /*
+ * xen_hvm_init_time_ops() used to be called at very early stage
+ * by xen_hvm_guest_init(). While only MAX_VIRT_CPUS 'vcpu_info'
+ * are embedded inside 'shared_info', the VM would use them until
+ * xen_vcpu_setup() is used to allocate/relocate them at arbitrary
+ * address.
+ *
+ * However, when Xen HVM guest boots on vcpu >= MAX_VIRT_CPUS
+ * (e.g., kexec kernel), per_cpu(xen_vcpu, cpu) is NULL at early
+ * stage. To access per_cpu(xen_vcpu, cpu) via
+ * xen_clocksource_read() would panic the kernel.
+ *
+ * Therefore we always delay xen_hvm_init_time_ops() to
+ * xen_hvm_smp_prepare_boot_cpu() to avoid the panic.
+ */
+ xen_hvm_init_time_ops();
+
/*
* The alternative logic (which patches the unlock/lock) runs before
* the smp bootup up code is activated. Hence we need to set this up
--
2.17.1

2022-02-26 01:35:43

by Dongli Zhang

[permalink] [raw]

Subject: Re: [PATCH v3 0/1] xen: fix HVM kexec kernel panic

Hi Boris,

On 2/25/22 2:39 PM, Boris Ostrovsky wrote:
>
> On 2/24/22 4:50 PM, Dongli Zhang wrote:
>> This is the v3 of the patch to fix xen kexec kernel panic issue when the
>> kexec is triggered on VCPU >= 32.
>>
>> PANIC: early exception 0x0e IP 10:ffffffffa96679b6 error 0 cr2 0x20
>> [    0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted
>> 5.17.0-rc4xen-00054-gf71077a4d84b-dirty #1
>> [    0.000000] Hardware name: Xen HVM domU, BIOS 4.4.4OVM 12/15/2020
>> [    0.000000] RIP: 0010:pvclock_clocksource_read+0x6/0xb0
>> ... ...
>> [    0.000000] RSP: 0000:ffffffffaae03e10 EFLAGS: 00010082 ORIG_RAX:
>> 0000000000000000
>> [    0.000000] RAX: 0000000000000000 RBX: 0000000000010000 RCX: 0000000000000002
>> [    0.000000] RDX: 0000000000000003 RSI: ffffffffaac37515 RDI: 0000000000000020
>> [    0.000000] RBP: 0000000000011000 R08: 0000000000000000 R09: 0000000000000001
>> [    0.000000] R10: ffffffffaae03df8 R11: ffffffffaae03c68 R12: 0000000040000004
>> [    0.000000] R13: ffffffffaae03e50 R14: 0000000000000000 R15: 0000000000000000
>> [    0.000000] FS: 0000000000000000(0000) GS:ffffffffab588000(0000)
>> knlGS:0000000000000000
>> [    0.000000] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [    0.000000] CR2: 0000000000000020 CR3: 00000000ea410000 CR4: 00000000000406a0
>> [    0.000000] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [    0.000000] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> [    0.000000] Call Trace:
>> [    0.000000] <TASK>
>> [    0.000000] ? xen_clocksource_read+0x24/0x40
>
>
> This is done to set xen_sched_clock_offset which I think will not be used for a
> while, until sched_clock is called (and the other two uses are for suspend/resume)
>
>
> Can we simply defer 'xen_sched_clock_offset = xen_clocksource_read();' until
> after all vcpu areas are properly set? Or are there other uses of
> xen_clocksource_read() before ?
>

I have tested that below patch will panic kdump kernel.

diff --git a/arch/x86/xen/smp_hvm.c b/arch/x86/xen/smp_hvm.c
index 6ff3c887e0b9..6a0c99941ae1 100644
--- a/arch/x86/xen/smp_hvm.c
+++ b/arch/x86/xen/smp_hvm.c
@@ -19,6 +19,8 @@ static void __init xen_hvm_smp_prepare_boot_cpu(void)
*/
xen_vcpu_setup(0);

+ xen_init_sched_clock_offset();
+
/*
* The alternative logic (which patches the unlock/lock) runs before
* the smp bootup up code is activated. Hence we need to set this up
diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index d9c945ee1100..8a2eafa0c215 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -520,9 +520,14 @@ static void __init xen_time_init(void)
pvclock_gtod_register_notifier(&xen_pvclock_gtod_notifier);
}

-static void __init xen_init_time_common(void)
+void xen_init_sched_clock_offset(void)
{
xen_sched_clock_offset = xen_clocksource_read();
+}
+
+static void __init xen_init_time_common(void)
+{
+ xen_sched_clock_offset = 0;
static_call_update(pv_steal_clock, xen_steal_clock);
paravirt_set_sched_clock(xen_sched_clock);

diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
index fd0fec6e92f4..9f7656214dfb 100644
--- a/arch/x86/xen/xen-ops.h
+++ b/arch/x86/xen/xen-ops.h
@@ -69,6 +69,7 @@ void xen_teardown_timer(int cpu);
void xen_setup_cpu_clockevents(void);
void xen_save_time_memory_area(void);
void xen_restore_time_memory_area(void);
+void xen_init_sched_clock_offset(void);
void xen_init_time_ops(void);
void xen_hvm_init_time_ops(void);

Unfortunately, I am not able to obtain the panic callstack from kdump time this
time. I have only below.

PANIC: early exception 0x0e IP 10:ffffffffa6c679b6 error 0 cr2 0x20
PANIC: early exception 0x0e IP 10:ffffffffa6c679b6 error 0 cr2 0x20

The sched_clock() can be used very early since commit 857baa87b642
("sched/clock: Enable sched clock early"). Any printk should use sched_clock()
to obtain the timestamp.

vprintk_store()
-> local_clock()
-> sched_clock()
-> paravirt_sched_clock()
-> xen_sched_clock()
-> xen_clocksource_read()

AFAIR, we started to encounter the issue since commit 857baa87b642
("sched/clock: Enable sched clock early"). kdump used to work well before that
commit.

Thank you very much!

Dongli Zhang

2022-02-26 02:12:31

by Boris Ostrovsky

[permalink] [raw]

Subject: Re: [PATCH v3 0/1] xen: fix HVM kexec kernel panic

On 2/24/22 4:50 PM, Dongli Zhang wrote:
> This is the v3 of the patch to fix xen kexec kernel panic issue when the
> kexec is triggered on VCPU >= 32.
>
> PANIC: early exception 0x0e IP 10:ffffffffa96679b6 error 0 cr2 0x20
> [ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 5.17.0-rc4xen-00054-gf71077a4d84b-dirty #1
> [ 0.000000] Hardware name: Xen HVM domU, BIOS 4.4.4OVM 12/15/2020
> [ 0.000000] RIP: 0010:pvclock_clocksource_read+0x6/0xb0
> ... ...
> [ 0.000000] RSP: 0000:ffffffffaae03e10 EFLAGS: 00010082 ORIG_RAX: 0000000000000000
> [ 0.000000] RAX: 0000000000000000 RBX: 0000000000010000 RCX: 0000000000000002
> [ 0.000000] RDX: 0000000000000003 RSI: ffffffffaac37515 RDI: 0000000000000020
> [ 0.000000] RBP: 0000000000011000 R08: 0000000000000000 R09: 0000000000000001
> [ 0.000000] R10: ffffffffaae03df8 R11: ffffffffaae03c68 R12: 0000000040000004
> [ 0.000000] R13: ffffffffaae03e50 R14: 0000000000000000 R15: 0000000000000000
> [ 0.000000] FS: 0000000000000000(0000) GS:ffffffffab588000(0000) knlGS:0000000000000000
> [ 0.000000] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 0.000000] CR2: 0000000000000020 CR3: 00000000ea410000 CR4: 00000000000406a0
> [ 0.000000] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 0.000000] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 0.000000] Call Trace:
> [ 0.000000] <TASK>
> [ 0.000000] ? xen_clocksource_read+0x24/0x40

This is done to set xen_sched_clock_offset which I think will not be used for a while, until sched_clock is called (and the other two uses are for suspend/resume)

Can we simply defer 'xen_sched_clock_offset = xen_clocksource_read();' until after all vcpu areas are properly set? Or are there other uses of xen_clocksource_read() before ?

-boris

> [ 0.000000] ? xen_init_time_common+0x5/0x49
> [ 0.000000] ? xen_hvm_init_time_ops+0x23/0x45
> [ 0.000000] ? xen_hvm_guest_init+0x221/0x25c
> [ 0.000000] ? 0xffffffffa9600000
> [ 0.000000] ? setup_arch+0x440/0xbd6
> [ 0.000000] ? start_kernel+0x6c/0x695
> [ 0.000000] ? secondary_startup_64_no_verify+0xd5/0xdb
> [ 0.000000] </TASK>
>
>
> Changed since v1:
> - Add commit message to explain why xen_hvm_init_time_ops() is delayed
> for any vcpus. (Suggested by Boris Ostrovsky)
> - Add a comment in xen_hvm_smp_prepare_boot_cpu() referencing the related
> code in xen_hvm_guest_init(). (suggested by Juergen Gross)
> Changed since v2:
> - Delay for all VCPUs. (Suggested by Boris Ostrovsky)
> - Add commit message that why PVM is not supported by this patch
> - Test if kexec/kdump works with mainline xen (HVM and PVM)
>
>
> I have delayed the xen_hvm_init_time_ops() for all VCPUs. Unfortunately,
> now I am able to reproduce the clock backward as shown below on some old
> versions of xen. I am not able to reproduce on most recent mainline xen.
>
> [ 0.359687] pcpu-alloc: [0] 16 17 18 19 20 21 22 23 [0] 24 25 26 27 28 29 30 31
> [ 0.359694] pcpu-alloc: [0] 32 33 34 35 36 37 38 39 [0] 40 41 42 43 44 45 46 47
> [ 0.359701] pcpu-alloc: [0] 48 49 50 51 52 53 54 55 [0] 56 57 58 59 60 61 62 63
>
> ... clock backward after the clocksource is switched from native to xen...
>
> [ 0.000004] Fallback order for Node 0: 0
> [ 0.002967] Built 1 zonelists, mobility grouping on. Total pages: 3527744
> [ 0.007129] Policy zone: Normal
> [ 0.008937] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.17.0-rc4xen-00054-gf71077a4d84b-dirty root=UUID=2a5975ab-a059-4697-9aee-7a53ddfeea21 ro text console=ttyS0,115200n8 console=tty1 earlyprintk=tty S0,115200n8 loglevel=7 no_timer_check reboot=s32 splash crashkernel=512M-:192M vt.handoff=1
> [ 0.023880] Unknown kernel command line parameters "text splash BOOT_IMAGE=/boot/vmlinuz-5.17.0-rc4xen-00054-gf71077a4d84b-dirty", will be passed to user space.
> [ 0.032647] printk: log_buf_len individual max cpu contribution: 4096 bytes
> [ 0.036828] printk: log_buf_len total cpu_extra contributions: 258048 bytes
> [ 0.041049] printk: log_buf_len min size: 262144 bytes
> [ 0.044481] printk: log_buf_len: 524288 bytes
>
>
> Since now I am able to reproduce the clock backward on old xen version,
> please let me know if I should re-use the v2 of this patch, as it has been
> running well in our env well for very long time.
>
> https://lore.kernel.org/all/[email protected]/
>
>
> BTW, I have tested that 'soft_reset' does not work with mainline xen, even
> when I directly trigger kexec with below commands.
>
> # kexec -l /boot/vmlinuz-5.17.0-rc4xen-00054-gf71077a4d84b-dirty \
> --initrd=/boot/initrd.img-5.17.0-rc4xen-00054-gf71077a4d84b-dirty \
> --reuse-cmdline
> # kexec -e
>
>
> Thank you very much!
>
> Dongli Zhang
>
>

2022-02-28 20:47:07

by Boris Ostrovsky

[permalink] [raw]

Subject: Re: [PATCH v3 0/1] xen: fix HVM kexec kernel panic

On 2/25/22 8:17 PM, Dongli Zhang wrote:
> Hi Boris,
>
> On 2/25/22 2:39 PM, Boris Ostrovsky wrote:
>>
>> On 2/24/22 4:50 PM, Dongli Zhang wrote:
>>> This is the v3 of the patch to fix xen kexec kernel panic issue when the
>>> kexec is triggered on VCPU >= 32.
>>>
>>> PANIC: early exception 0x0e IP 10:ffffffffa96679b6 error 0 cr2 0x20
>>> [    0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted
>>> 5.17.0-rc4xen-00054-gf71077a4d84b-dirty #1
>>> [    0.000000] Hardware name: Xen HVM domU, BIOS 4.4.4OVM 12/15/2020
>>> [    0.000000] RIP: 0010:pvclock_clocksource_read+0x6/0xb0
>>> ... ...
>>> [    0.000000] RSP: 0000:ffffffffaae03e10 EFLAGS: 00010082 ORIG_RAX:
>>> 0000000000000000
>>> [    0.000000] RAX: 0000000000000000 RBX: 0000000000010000 RCX: 0000000000000002
>>> [    0.000000] RDX: 0000000000000003 RSI: ffffffffaac37515 RDI: 0000000000000020
>>> [    0.000000] RBP: 0000000000011000 R08: 0000000000000000 R09: 0000000000000001
>>> [    0.000000] R10: ffffffffaae03df8 R11: ffffffffaae03c68 R12: 0000000040000004
>>> [    0.000000] R13: ffffffffaae03e50 R14: 0000000000000000 R15: 0000000000000000
>>> [    0.000000] FS: 0000000000000000(0000) GS:ffffffffab588000(0000)
>>> knlGS:0000000000000000
>>> [    0.000000] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [    0.000000] CR2: 0000000000000020 CR3: 00000000ea410000 CR4: 00000000000406a0
>>> [    0.000000] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>> [    0.000000] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>> [    0.000000] Call Trace:
>>> [    0.000000] <TASK>
>>> [    0.000000] ? xen_clocksource_read+0x24/0x40
>>
>>
>> This is done to set xen_sched_clock_offset which I think will not be used for a
>> while, until sched_clock is called (and the other two uses are for suspend/resume)
>>
>>
>> Can we simply defer 'xen_sched_clock_offset = xen_clocksource_read();' until
>> after all vcpu areas are properly set? Or are there other uses of
>> xen_clocksource_read() before ?
>>
>
> I have tested that below patch will panic kdump kernel.
>

Oh well, so much for that then. Yes, sched_clock() is at least called from printk path.

I guess we will have to go with v2 then, we don't want to start seeing time going back, even if only with older hypervisors. The only thing I might ask is that you roll the logic inside xen_hvm_init_time_ops(). Something like

xen_hvm_init_time_ops()
{
/*
* Wait until per_cpu(xen_vcpu, 0) is initialized which may happen
* later (e.g. when kdump kernel runs on >=MAX_VIRT_CPUS vcpu)
*/
if (__this_cpu_read(xen_vcpu_nr(0)) == NULL)
return;

...

-boris

2022-03-01 05:45:12

by Dongli Zhang

[permalink] [raw]

Subject: Re: [PATCH v3 0/1] xen: fix HVM kexec kernel panic

Hi Boris,

On 2/28/22 12:45 PM, Boris Ostrovsky wrote:
>
>
> On 2/25/22 8:17 PM, Dongli Zhang wrote:
>> Hi Boris,
>>
>> On 2/25/22 2:39 PM, Boris Ostrovsky wrote:
>>>
>>> On 2/24/22 4:50 PM, Dongli Zhang wrote:
>>>> This is the v3 of the patch to fix xen kexec kernel panic issue when the
>>>> kexec is triggered on VCPU >= 32.
>>>>
>>>> PANIC: early exception 0x0e IP 10:ffffffffa96679b6 error 0 cr2 0x20
>>>> [    0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted
>>>> 5.17.0-rc4xen-00054-gf71077a4d84b-dirty #1
>>>> [    0.000000] Hardware name: Xen HVM domU, BIOS 4.4.4OVM 12/15/2020
>>>> [    0.000000] RIP: 0010:pvclock_clocksource_read+0x6/0xb0
>>>> ... ...
>>>> [    0.000000] RSP: 0000:ffffffffaae03e10 EFLAGS: 00010082 ORIG_RAX:
>>>> 0000000000000000
>>>> [    0.000000] RAX: 0000000000000000 RBX: 0000000000010000 RCX:
>>>> 0000000000000002
>>>> [    0.000000] RDX: 0000000000000003 RSI: ffffffffaac37515 RDI:
>>>> 0000000000000020
>>>> [    0.000000] RBP: 0000000000011000 R08: 0000000000000000 R09:
>>>> 0000000000000001
>>>> [    0.000000] R10: ffffffffaae03df8 R11: ffffffffaae03c68 R12:
>>>> 0000000040000004
>>>> [    0.000000] R13: ffffffffaae03e50 R14: 0000000000000000 R15:
>>>> 0000000000000000
>>>> [    0.000000] FS: 0000000000000000(0000) GS:ffffffffab588000(0000)
>>>> knlGS:0000000000000000
>>>> [    0.000000] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> [    0.000000] CR2: 0000000000000020 CR3: 00000000ea410000 CR4:
>>>> 00000000000406a0
>>>> [    0.000000] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>>>> 0000000000000000
>>>> [    0.000000] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
>>>> 0000000000000400
>>>> [    0.000000] Call Trace:
>>>> [    0.000000] <TASK>
>>>> [    0.000000] ? xen_clocksource_read+0x24/0x40
>>>
>>>
>>> This is done to set xen_sched_clock_offset which I think will not be used for a
>>> while, until sched_clock is called (and the other two uses are for
>>> suspend/resume)
>>>
>>>
>>> Can we simply defer 'xen_sched_clock_offset = xen_clocksource_read();' until
>>> after all vcpu areas are properly set? Or are there other uses of
>>> xen_clocksource_read() before ?
>>>
>>
>> I have tested that below patch will panic kdump kernel.
>>
>
>
>
> Oh well, so much for that then. Yes, sched_clock() is at least called from
> printk path.
>
>
> I guess we will have to go with v2 then, we don't want to start seeing time
> going back, even if only with older hypervisors. The only thing I might ask is
> that you roll the logic inside xen_hvm_init_time_ops(). Something like
>
>
> xen_hvm_init_time_ops()
> {
>     /*
>      * Wait until per_cpu(xen_vcpu, 0) is initialized which may happen
>      * later (e.g. when kdump kernel runs on >=MAX_VIRT_CPUS vcpu)
>      */
>     if (__this_cpu_read(xen_vcpu_nr(0)) == NULL)
>         return;
>

I think you meant __this_cpu_read(xen_vcpu).

I will call xen_hvm_init_time_ops() at both places, and move the logic into
xen_hvm_init_time_ops().

Thank you very much!

Dongli Zhang

2022-03-01 06:16:11

by Dongli Zhang

[permalink] [raw]

Subject: Re: [PATCH v3 0/1] xen: fix HVM kexec kernel panic

Hi Boris,

On 2/28/22 5:18 PM, Dongli Zhang wrote:
> Hi Boris,
>
> On 2/28/22 12:45 PM, Boris Ostrovsky wrote:
>>
>>
>> On 2/25/22 8:17 PM, Dongli Zhang wrote:
>>> Hi Boris,
>>>
>>> On 2/25/22 2:39 PM, Boris Ostrovsky wrote:
>>>>
>>>> On 2/24/22 4:50 PM, Dongli Zhang wrote:
>>>>> This is the v3 of the patch to fix xen kexec kernel panic issue when the
>>>>> kexec is triggered on VCPU >= 32.
>>>>>
>>>>> PANIC: early exception 0x0e IP 10:ffffffffa96679b6 error 0 cr2 0x20
>>>>> [    0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted
>>>>> 5.17.0-rc4xen-00054-gf71077a4d84b-dirty #1
>>>>> [    0.000000] Hardware name: Xen HVM domU, BIOS 4.4.4OVM 12/15/2020
>>>>> [    0.000000] RIP: 0010:pvclock_clocksource_read+0x6/0xb0
>>>>> ... ...
>>>>> [    0.000000] RSP: 0000:ffffffffaae03e10 EFLAGS: 00010082 ORIG_RAX:
>>>>> 0000000000000000
>>>>> [    0.000000] RAX: 0000000000000000 RBX: 0000000000010000 RCX:
>>>>> 0000000000000002
>>>>> [    0.000000] RDX: 0000000000000003 RSI: ffffffffaac37515 RDI:
>>>>> 0000000000000020
>>>>> [    0.000000] RBP: 0000000000011000 R08: 0000000000000000 R09:
>>>>> 0000000000000001
>>>>> [    0.000000] R10: ffffffffaae03df8 R11: ffffffffaae03c68 R12:
>>>>> 0000000040000004
>>>>> [    0.000000] R13: ffffffffaae03e50 R14: 0000000000000000 R15:
>>>>> 0000000000000000
>>>>> [    0.000000] FS: 0000000000000000(0000) GS:ffffffffab588000(0000)
>>>>> knlGS:0000000000000000
>>>>> [    0.000000] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>> [    0.000000] CR2: 0000000000000020 CR3: 00000000ea410000 CR4:
>>>>> 00000000000406a0
>>>>> [    0.000000] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>>>>> 0000000000000000
>>>>> [    0.000000] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
>>>>> 0000000000000400
>>>>> [    0.000000] Call Trace:
>>>>> [    0.000000] <TASK>
>>>>> [    0.000000] ? xen_clocksource_read+0x24/0x40
>>>>
>>>>
>>>> This is done to set xen_sched_clock_offset which I think will not be used for a
>>>> while, until sched_clock is called (and the other two uses are for
>>>> suspend/resume)
>>>>
>>>>
>>>> Can we simply defer 'xen_sched_clock_offset = xen_clocksource_read();' until
>>>> after all vcpu areas are properly set? Or are there other uses of
>>>> xen_clocksource_read() before ?
>>>>
>>>
>>> I have tested that below patch will panic kdump kernel.
>>>
>>
>>
>>
>> Oh well, so much for that then. Yes, sched_clock() is at least called from
>> printk path.
>>
>>
>> I guess we will have to go with v2 then, we don't want to start seeing time
>> going back, even if only with older hypervisors. The only thing I might ask is
>> that you roll the logic inside xen_hvm_init_time_ops(). Something like
>>
>>
>> xen_hvm_init_time_ops()
>> {
>>     /*
>>      * Wait until per_cpu(xen_vcpu, 0) is initialized which may happen
>>      * later (e.g. when kdump kernel runs on >=MAX_VIRT_CPUS vcpu)
>>      */
>>     if (__this_cpu_read(xen_vcpu_nr(0)) == NULL)
>>         return;
>>
>
> I think you meant __this_cpu_read(xen_vcpu).
>
> I will call xen_hvm_init_time_ops() at both places, and move the logic into
> xen_hvm_init_time_ops().
>
> Thank you very much!
>
> Dongli Zhang
>

How about we do not move the logic into xen_hvm_init_time_ops()?

Suppose we move the logic into xen_hvm_init_time_ops() line 573, the line line
570 might be printed twice.

559 void __init xen_hvm_init_time_ops(void)
560 {
561 /*
562 * vector callback is needed otherwise we cannot receive interrupts
563 * on cpu > 0 and at this point we don't know how many cpus are
564 * available.
565 */
566 if (!xen_have_vector_callback)
567 return;
568
569 if (!xen_feature(XENFEAT_hvm_safe_pvclock)) {
570 pr_info("Xen doesn't support pvclock on HVM, disable pv timer");
571 return;
572 }
573
574 xen_init_time_common();
575
576 x86_init.timers.setup_percpu_clockev = xen_time_init;
577 x86_cpuinit.setup_percpu_clockev = xen_hvm_setup_cpu_clockevents;
578
579 x86_platform.set_wallclock = xen_set_wallclock;
580 }

I feel the code looks better if we keep the logic at caller side. Would you mind
letting me know your feedback?

Thank you very much!

Dongli Zhang

2022-03-01 20:18:30

by Boris Ostrovsky

[permalink] [raw]

Subject: Re: [PATCH v3 0/1] xen: fix HVM kexec kernel panic

On 2/28/22 11:56 PM, Dongli Zhang wrote:
> Hi Boris,
>
> On 2/28/22 5:18 PM, Dongli Zhang wrote:
>> Hi Boris,
>>
>> On 2/28/22 12:45 PM, Boris Ostrovsky wrote:
>>>
>>> On 2/25/22 8:17 PM, Dongli Zhang wrote:
>>>> Hi Boris,
>>>>
>>>> On 2/25/22 2:39 PM, Boris Ostrovsky wrote:
>>>>> On 2/24/22 4:50 PM, Dongli Zhang wrote:
>>>>>> This is the v3 of the patch to fix xen kexec kernel panic issue when the
>>>>>> kexec is triggered on VCPU >= 32.
>>>>>>
>>>>>> PANIC: early exception 0x0e IP 10:ffffffffa96679b6 error 0 cr2 0x20
>>>>>> [    0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted
>>>>>> 5.17.0-rc4xen-00054-gf71077a4d84b-dirty #1
>>>>>> [    0.000000] Hardware name: Xen HVM domU, BIOS 4.4.4OVM 12/15/2020
>>>>>> [    0.000000] RIP: 0010:pvclock_clocksource_read+0x6/0xb0
>>>>>> ... ...
>>>>>> [    0.000000] RSP: 0000:ffffffffaae03e10 EFLAGS: 00010082 ORIG_RAX:
>>>>>> 0000000000000000
>>>>>> [    0.000000] RAX: 0000000000000000 RBX: 0000000000010000 RCX:
>>>>>> 0000000000000002
>>>>>> [    0.000000] RDX: 0000000000000003 RSI: ffffffffaac37515 RDI:
>>>>>> 0000000000000020
>>>>>> [    0.000000] RBP: 0000000000011000 R08: 0000000000000000 R09:
>>>>>> 0000000000000001
>>>>>> [    0.000000] R10: ffffffffaae03df8 R11: ffffffffaae03c68 R12:
>>>>>> 0000000040000004
>>>>>> [    0.000000] R13: ffffffffaae03e50 R14: 0000000000000000 R15:
>>>>>> 0000000000000000
>>>>>> [    0.000000] FS: 0000000000000000(0000) GS:ffffffffab588000(0000)
>>>>>> knlGS:0000000000000000
>>>>>> [    0.000000] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>> [    0.000000] CR2: 0000000000000020 CR3: 00000000ea410000 CR4:
>>>>>> 00000000000406a0
>>>>>> [    0.000000] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>>>>>> 0000000000000000
>>>>>> [    0.000000] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
>>>>>> 0000000000000400
>>>>>> [    0.000000] Call Trace:
>>>>>> [    0.000000] <TASK>
>>>>>> [    0.000000] ? xen_clocksource_read+0x24/0x40
>>>>>
>>>>> This is done to set xen_sched_clock_offset which I think will not be used for a
>>>>> while, until sched_clock is called (and the other two uses are for
>>>>> suspend/resume)
>>>>>
>>>>>
>>>>> Can we simply defer 'xen_sched_clock_offset = xen_clocksource_read();' until
>>>>> after all vcpu areas are properly set? Or are there other uses of
>>>>> xen_clocksource_read() before ?
>>>>>
>>>> I have tested that below patch will panic kdump kernel.
>>>>
>>>
>>>
>>> Oh well, so much for that then. Yes, sched_clock() is at least called from
>>> printk path.
>>>
>>>
>>> I guess we will have to go with v2 then, we don't want to start seeing time
>>> going back, even if only with older hypervisors. The only thing I might ask is
>>> that you roll the logic inside xen_hvm_init_time_ops(). Something like
>>>
>>>
>>> xen_hvm_init_time_ops()
>>> {
>>>     /*
>>>      * Wait until per_cpu(xen_vcpu, 0) is initialized which may happen
>>>      * later (e.g. when kdump kernel runs on >=MAX_VIRT_CPUS vcpu)
>>>      */
>>>     if (__this_cpu_read(xen_vcpu_nr(0)) == NULL)
>>>         return;
>>>
>> I think you meant __this_cpu_read(xen_vcpu).
>>
>> I will call xen_hvm_init_time_ops() at both places, and move the logic into
>> xen_hvm_init_time_ops().
>>
>> Thank you very much!
>>
>> Dongli Zhang
>>
>
> How about we do not move the logic into xen_hvm_init_time_ops()?
>
> Suppose we move the logic into xen_hvm_init_time_ops() line 573, the line line
> 570 might be printed twice.

You would need to make sure the routine is executed only once so something like a local static variable would be needed.

>
>
> 559 void __init xen_hvm_init_time_ops(void)
> 560 {
> 561 /*
> 562 * vector callback is needed otherwise we cannot receive interrupts
> 563 * on cpu > 0 and at this point we don't know how many cpus are
> 564 * available.
> 565 */
> 566 if (!xen_have_vector_callback)
> 567 return;
> 568
> 569 if (!xen_feature(XENFEAT_hvm_safe_pvclock)) {
> 570 pr_info("Xen doesn't support pvclock on HVM, disable pv timer");
> 571 return;
> 572 }
> 573
> 574 xen_init_time_common();
> 575
> 576 x86_init.timers.setup_percpu_clockev = xen_time_init;
> 577 x86_cpuinit.setup_percpu_clockev = xen_hvm_setup_cpu_clockevents;
> 578
> 579 x86_platform.set_wallclock = xen_set_wallclock;
> 580 }
>
> I feel the code looks better if we keep the logic at caller side. Would you mind
> letting me know your feedback?

My preference is to keep logic concentrated in one place whenever possible.

-boris