LinuxLists.cc - Re: qemu-x86_64 booting with 8.0.0 stil see int3: when running LTP tracing testing.

2023-06-21 16:16:25

Subject: Re: qemu-x86_64 booting with 8.0.0 stil see int3: when running LTP tracing testing.

On Wed, Jun 21, 2023, at 16:16, Naresh Kamboju wrote:
> Hi Team,
>
> FYI,
> qemu-x86_64 booting with 8.0.0 still see int3: when running LTP tracing testing.
>
> docker.io/linaro/tuxrun-qemu:v8.0.0 qemu-system-x86_64

Thanks for the report. I've added the x86 and kernel lists as well
as Peter Zijlstra to Cc.

> Running tests.......
> ftrace_regression01 1 TPASS: Finished running the test
> <4>[ 49.874424] int3: 0000 [#1] PREEMPT SMP PTI
> <4>[ 49.874634] int3: 0000 [#2] PREEMPT SMP PTI
> <4>[ 49.874656] int3: 0000 [#3] PREEMPT SMP PTI
> <4>[ 49.874669] int3: 0000 [#4] PREEMPT SMP PTI
> <4>[ 49.874680] int3: 0000 [#5] PREEMPT SMP PTI
> <4>[ 49.874691] int3: 0000 [#6] PREEMPT SMP PTI
> <4>[ 49.874703] int3: 0000 [#7] PREEMPT SMP PTI
> <4>[ 49.874714] int3: 0000 [#8] PREEMPT SMP PTI
> <4>[ 49.874725] int3: 0000 [#9] PREEMPT SMP PTI

At first looked a bit like it's hitting on multiple CPUs, but
I also see that the instance only has two CPUs, so apparently
this instance is hitting the same bug recursively from
the exception handler.

> <4>[ 49.874736] int3: 0000 [#10] PREEMPT SMP PTI
> <4>[ 49.874750] int3: 0000 [#11] PREEMPT SMP PTI
> <4>[ 49.874761] int3: 0000 [#12] PREEMPT SMP PTI
> <4>[ 49.874772] int3: 0000 [#13] PREEMPT SMP PTI
> <4>[ 49.874783] int3: 0000 [#14] PREEMPT SMP PTI
> <4>[ 49.874828] int3: 0000 [#15] PREEMPT SMP PTI
> <4>[ 49.874894] int3: 0000 [#16] PREEMPT SMP PTI
> <4>[ 49.874907] int3: 0000 [#17] PREEMPT SMP PTI
> <4>[ 49.874918] int3: 0000 [#18] PREEMPT SMP PTI
> <4>[ 49.874930] int3: 0000 [#19] PREEMPT SMP PTI
> <4>[ 49.874941] int3: 0000 [#20] PREEMPT SMP PTI
> <4>[ 49.874953] int3: 0000 [#21] PREEMPT SMP PTI
> <4>[ 49.874964] int3: 0000 [#22] PREEMPT SMP PTI
> <4>[ 49.874976] int3: 0000 [#23] PREEMPT SMP PTI
> <4>[ 49.874987] int3: 0000 [#24] PREEMPT SMP PTI
> <4>[ 49.874999] int3: 0000 [#25] PREEMPT SMP PTI
> <4>[ 49.875010] int3: 0000 [#26] PREEMPT SMP PTI
> <4>[ 49.875021] int3: 0000 [#27] PREEMPT SMP PTI
> <4>[ 49.875032] int3: 0000 [#28] PREEMPT SMP PTI
> <4>[ 49.875043] int3: 0000 [#29] PREEMPT SMP PTI
> <4>[ 49.875054] int3: 0000 [#30] PREEMPT SMP PTI
> <4>[ 49.875282] ------------[ cut here ]------------
> <4>[ 49.875687] ------------[ cut here ]------------
> <2>[ 49.875904] kernel BUG at kernel/entry/common.c:454!
> <4>[ 49.876652] invalid opcode: 0000 [#31] PREEMPT SMP PTI
> <4>[ 49.876845] CPU: 1 PID: 317 Comm: ftrace_regressi Not tainted
> 6.4.0-rc7-next-20230621 #1
> <4>[ 49.876994] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
> BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
> <4>[ 49.877119] RIP: 0010:irqentry_nmi_enter+0x68/0x70
> <4>[ 49.877756] Code: 48 8b 0d 7b 3a 6a 73 c7 81 88 08 00 00 00 00
> 00 00 85 c0 0f 95 c3 e8 27 02 00 00 e8 22 93 fe fe 89 d8 5b 5d c3 cc
> cc cc cc cc <0f> 0b 66 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90
> 90 90 90
> <4>[ 49.877929] RSP: 0000:fffffe3e79ab6ee8 EFLAGS: 00000046
> <4>[ 49.878425] RAX: 0000000000000000 RBX: fffffe3e79ab6f58 RCX:
> 00000000fe00ffdb
> <4>[ 49.878444] RDX: 00000000ffffa445 RSI: 0000000000000000 RDI:
> fffffe3e79ab6f58
> <4>[ 49.878460] RBP: fffffe3e79ab6ef0 R08: 0000000000000000 R09:
> 0000000000000000
> <4>[ 49.878475] R10: 0000000000000000 R11: 0000000000000000 R12:
> ffffa44502ab5a00
> <4>[ 49.878490] R13: 0000000000000000 R14: 0000000000000000 R15:
> ffffab17000ccfd8
> <4>[ 49.878540] FS: 00007f91069b4740(0000)
> GS:ffffa4457bd00000(0000) knlGS:0000000000000000
> <4>[ 49.878579] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> <4>[ 49.878596] CR2: ffffab17000ccfd8 CR3: 0000000102be8000 CR4:
> 00000000000006e0
> <4>[ 49.878695] Call Trace:
> <4>[ 49.878892] <#DF>
> <4>[ 49.879037] ? __die_body+0x6c/0xc0
> <4>[ 49.879123] ? die+0xae/0xe0
> <4>[ 49.879151] ? do_trap+0x8d/0x160
> <4>[ 49.879178] ? irqentry_nmi_enter+0x68/0x70
> <4>[ 49.879199] ? handle_invalid_op+0x7f/0xd0
> <4>[ 49.879213] ? irqentry_nmi_enter+0x68/0x70
> <4>[ 49.879237] ? exc_invalid_op+0x36/0x50
> <4>[ 49.879259] ? asm_exc_invalid_op+0x1f/0x30
> <4>[ 49.879316] ? irqentry_nmi_enter+0x68/0x70
> <4>[ 49.879344] exc_double_fault+0x107/0x1b0
> <4>[ 49.879416] asm_exc_double_fault+0x23/0x30
> <4>[ 49.879479] RIP: 0010:0xffffffffc00ac0ac

I don't know much about x86 exception handling, but my guess is
that this is where the stack overflows, so this backtrace
is not all that useful. Looking at the full log from your link,
I see that recursion through asm_exc_int3:

<4>[ 49.886694] ? exc_int3+0x62/0x80
<4>[ 49.886714] ? asm_exc_int3+0x3e/0x50
<4>[ 49.886759] ? preempt_count_sub+0x5/0x80
<4>[ 49.886783] ? preempt_count_sub+0x5/0x80
<4>[ 49.886805] ? irq_work_queue+0x40/0x80
<4>[ 49.886826] ? defer_console_output+0x49/0x80
<4>[ 49.886840] ? vprintk+0x42/0x60
<4>[ 49.886857] ? _printk+0x5d/0x80
<4>[ 49.886891] ? die+0x9c/0xe0
<4>[ 49.886922] ? exc_int3+0x62/0x80
<4>[ 49.886942] ? asm_exc_int3+0x3e/0x50
<4>[ 49.886966] ? __pfx_tick_sched_timer+0x10/0x10
<4>[ 49.886980] ? __pfx_read_tsc+0x10/0x10
<4>[ 49.887010] ? preempt_count_sub+0x5/0x80
<4>[ 49.887035] ? preempt_count_sub+0x5/0x80
<4>[ 49.887057] ? __hrtimer_run_queues+0xee/0x330
<4>[ 49.887070] ? _raw_spin_unlock_irqrestore+0x28/0x50
<4>[ 49.887085] ? __hrtimer_run_queues+0xee/0x330
<4>[ 49.887136] ? hrtimer_interrupt+0xf6/0x390
<4>[ 49.887167] ? __sysvec_apic_timer_interrupt+0x64/0x1a0
<4>[ 49.887189] ? sysvec_apic_timer_interrupt+0x7a/0x90
<4>[ 49.887206] </IRQ>
<4>[ 49.887219] <TASK>
<4>[ 49.887231] ? asm_sysvec_apic_timer_interrupt+0x1f/0x30
<4>[ 49.887252] ? sched_rt_period_timer+0x4/0x390
<4>[ 49.887272] ? __pfx_do_sync_core+0x10/0x10
<4>[ 49.887305] ? insn_get_displacement+0x9/0x160
<4>[ 49.887329] ? insn_get_displacement+0x9/0x160
<4>[ 49.887344] ? insn_get_immediate+0xd2/0x270
<4>[ 49.887365] ? insn_decode+0x113/0x150
<4>[ 49.887384] ? text_poke_loc_init+0xea/0x220
<4>[ 49.887429] ? sched_rt_period_timer+0x4/0x390
<4>[ 49.887451] ? text_poke_queue+0x89/0xa0
<4>[ 49.887477] ? ftrace_replace_code+0x149/0x1f0
<4>[ 49.887508] ? ftrace_modify_all_code+0x71/0x140
<4>[ 49.887533] ? arch_ftrace_update_code+0xd/0x20
<4>[ 49.887550] ? ftrace_shutdown+0xf5/0x220
<4>[ 49.887577] ? unregister_ftrace_function+0x2e/0x150
<4>[ 49.887608] ? stack_trace_sysctl+0x82/0xb0
<4>[ 49.887636] ? proc_sys_call_handler+0x18b/0x280
<4>[ 49.887675] ? proc_sys_write+0x17/0x20
<4>[ 49.887692] ? vfs_write+0x324/0x3f0
<4>[ 49.887742] ? ksys_write+0x75/0xe0
<4>[ 49.887771] ? __x64_sys_write+0x1f/0x30
<4>[ 49.887787] ? do_syscall_64+0x48/0xa0
<4>[ 49.887802] ? sysvec_apic_timer_interrupt+0x4d/0x90
<4>[ 49.887822] ? entry_SYSCALL_64_after_hwframe+0x6e/0xd8
<4>[ 49.887878] </TASK>

This looks like a timer interrupt happens inside of
text_poke_loc_init(), i.e. while the kernel is modifying
itself, and presumably adding (or removing) an int3
instruction that is later hit inside of the timer function.

While our previous theory was that this was most likely
a qemu bug in dealing with self-modifying code, this might
actually hint at a problem in the kernel after all.

I think Peter understands this function best, he probably
sees more here than I do.

> Link:
> https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20230621/testrun/17699662/suite/log-parser-test/tests/
>
> https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20230621/testrun/17699662/suite/log-parser-test/test/check-kernel-panic/log
>
> - Naresh

Arnd

2023-06-21 16:31:49

by Peter Zijlstra

[permalink] [raw]

Subject: Re: qemu-x86_64 booting with 8.0.0 stil see int3: when running LTP tracing testing.

On Wed, Jun 21, 2023 at 05:31:15PM +0200, Arnd Bergmann wrote:

> I don't know much about x86 exception handling, but my guess is
> that this is where the stack overflows, so this backtrace
> is not all that useful. Looking at the full log from your link,
> I see that recursion through asm_exc_int3:
>
> <4>[ 49.886694] ? exc_int3+0x62/0x80
> <4>[ 49.886714] ? asm_exc_int3+0x3e/0x50
> <4>[ 49.886759] ? preempt_count_sub+0x5/0x80
> <4>[ 49.886783] ? preempt_count_sub+0x5/0x80
> <4>[ 49.886805] ? irq_work_queue+0x40/0x80
> <4>[ 49.886826] ? defer_console_output+0x49/0x80
> <4>[ 49.886840] ? vprintk+0x42/0x60
> <4>[ 49.886857] ? _printk+0x5d/0x80
> <4>[ 49.886891] ? die+0x9c/0xe0
> <4>[ 49.886922] ? exc_int3+0x62/0x80
> <4>[ 49.886942] ? asm_exc_int3+0x3e/0x50

So this is exc_int3(), but it is past poke_int3_handler() which is the
counter-part of text_poke_bp_batch() -- die() seems to suggest we're
even past do_int3().

The fact that it continued past poke_int3_handler() suggests that it is
a 'stray' int3, not covered by an active text modification site.

> <4>[ 49.886966] ? __pfx_tick_sched_timer+0x10/0x10
> <4>[ 49.886980] ? __pfx_read_tsc+0x10/0x10
> <4>[ 49.887010] ? preempt_count_sub+0x5/0x80
> <4>[ 49.887035] ? preempt_count_sub+0x5/0x80
> <4>[ 49.887057] ? __hrtimer_run_queues+0xee/0x330
> <4>[ 49.887070] ? _raw_spin_unlock_irqrestore+0x28/0x50
> <4>[ 49.887085] ? __hrtimer_run_queues+0xee/0x330
> <4>[ 49.887136] ? hrtimer_interrupt+0xf6/0x390
> <4>[ 49.887167] ? __sysvec_apic_timer_interrupt+0x64/0x1a0
> <4>[ 49.887189] ? sysvec_apic_timer_interrupt+0x7a/0x90
> <4>[ 49.887206] </IRQ>
> <4>[ 49.887219] <TASK>
> <4>[ 49.887231] ? asm_sysvec_apic_timer_interrupt+0x1f/0x30
> <4>[ 49.887252] ? sched_rt_period_timer+0x4/0x390
> <4>[ 49.887272] ? __pfx_do_sync_core+0x10/0x10
> <4>[ 49.887305] ? insn_get_displacement+0x9/0x160
> <4>[ 49.887329] ? insn_get_displacement+0x9/0x160
> <4>[ 49.887344] ? insn_get_immediate+0xd2/0x270
> <4>[ 49.887365] ? insn_decode+0x113/0x150
> <4>[ 49.887384] ? text_poke_loc_init+0xea/0x220
> <4>[ 49.887429] ? sched_rt_period_timer+0x4/0x390
> <4>[ 49.887451] ? text_poke_queue+0x89/0xa0

This is the text_poke_queue() stage, it is still collecting sites to
poke but hasn't actually started yet. The actual poking happens in
text_poke_bp_batch(), which isn't in the call-chain afaict.

> <4>[ 49.887477] ? ftrace_replace_code+0x149/0x1f0
> <4>[ 49.887508] ? ftrace_modify_all_code+0x71/0x140
> <4>[ 49.887533] ? arch_ftrace_update_code+0xd/0x20
> <4>[ 49.887550] ? ftrace_shutdown+0xf5/0x220
> <4>[ 49.887577] ? unregister_ftrace_function+0x2e/0x150
> <4>[ 49.887608] ? stack_trace_sysctl+0x82/0xb0
> <4>[ 49.887636] ? proc_sys_call_handler+0x18b/0x280
> <4>[ 49.887675] ? proc_sys_write+0x17/0x20
> <4>[ 49.887692] ? vfs_write+0x324/0x3f0
> <4>[ 49.887742] ? ksys_write+0x75/0xe0
> <4>[ 49.887771] ? __x64_sys_write+0x1f/0x30
> <4>[ 49.887787] ? do_syscall_64+0x48/0xa0
> <4>[ 49.887802] ? sysvec_apic_timer_interrupt+0x4d/0x90
> <4>[ 49.887822] ? entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> <4>[ 49.887878] </TASK>

2023-07-04 07:52:16

by Richard W.M. Jones

[permalink] [raw]

Subject: Re: qemu-x86_64 booting with 8.0.0 stil see int3: when running LTP tracing testing.

On Wed, Jun 21, 2023 at 06:06:55PM +0200, Peter Zijlstra wrote:
> On Wed, Jun 21, 2023 at 05:31:15PM +0200, Arnd Bergmann wrote:
>
> > I don't know much about x86 exception handling, but my guess is
> > that this is where the stack overflows, so this backtrace
> > is not all that useful. Looking at the full log from your link,
> > I see that recursion through asm_exc_int3:
> >
> > <4>[ 49.886694] ? exc_int3+0x62/0x80
> > <4>[ 49.886714] ? asm_exc_int3+0x3e/0x50
> > <4>[ 49.886759] ? preempt_count_sub+0x5/0x80
> > <4>[ 49.886783] ? preempt_count_sub+0x5/0x80
> > <4>[ 49.886805] ? irq_work_queue+0x40/0x80
> > <4>[ 49.886826] ? defer_console_output+0x49/0x80
> > <4>[ 49.886840] ? vprintk+0x42/0x60
> > <4>[ 49.886857] ? _printk+0x5d/0x80
> > <4>[ 49.886891] ? die+0x9c/0xe0
> > <4>[ 49.886922] ? exc_int3+0x62/0x80
> > <4>[ 49.886942] ? asm_exc_int3+0x3e/0x50
>
> So this is exc_int3(), but it is past poke_int3_handler() which is the
> counter-part of text_poke_bp_batch() -- die() seems to suggest we're
> even past do_int3().
>
> The fact that it continued past poke_int3_handler() suggests that it is
> a 'stray' int3, not covered by an active text modification site.
>
> > <4>[ 49.886966] ? __pfx_tick_sched_timer+0x10/0x10
> > <4>[ 49.886980] ? __pfx_read_tsc+0x10/0x10
> > <4>[ 49.887010] ? preempt_count_sub+0x5/0x80
> > <4>[ 49.887035] ? preempt_count_sub+0x5/0x80
> > <4>[ 49.887057] ? __hrtimer_run_queues+0xee/0x330
> > <4>[ 49.887070] ? _raw_spin_unlock_irqrestore+0x28/0x50
> > <4>[ 49.887085] ? __hrtimer_run_queues+0xee/0x330
> > <4>[ 49.887136] ? hrtimer_interrupt+0xf6/0x390
> > <4>[ 49.887167] ? __sysvec_apic_timer_interrupt+0x64/0x1a0
> > <4>[ 49.887189] ? sysvec_apic_timer_interrupt+0x7a/0x90
> > <4>[ 49.887206] </IRQ>
> > <4>[ 49.887219] <TASK>
> > <4>[ 49.887231] ? asm_sysvec_apic_timer_interrupt+0x1f/0x30
> > <4>[ 49.887252] ? sched_rt_period_timer+0x4/0x390
> > <4>[ 49.887272] ? __pfx_do_sync_core+0x10/0x10
> > <4>[ 49.887305] ? insn_get_displacement+0x9/0x160
> > <4>[ 49.887329] ? insn_get_displacement+0x9/0x160
> > <4>[ 49.887344] ? insn_get_immediate+0xd2/0x270
> > <4>[ 49.887365] ? insn_decode+0x113/0x150
> > <4>[ 49.887384] ? text_poke_loc_init+0xea/0x220
> > <4>[ 49.887429] ? sched_rt_period_timer+0x4/0x390
> > <4>[ 49.887451] ? text_poke_queue+0x89/0xa0
>
> This is the text_poke_queue() stage, it is still collecting sites to
> poke but hasn't actually started yet. The actual poking happens in
> text_poke_bp_batch(), which isn't in the call-chain afaict.
>
> > <4>[ 49.887477] ? ftrace_replace_code+0x149/0x1f0
> > <4>[ 49.887508] ? ftrace_modify_all_code+0x71/0x140
> > <4>[ 49.887533] ? arch_ftrace_update_code+0xd/0x20
> > <4>[ 49.887550] ? ftrace_shutdown+0xf5/0x220
> > <4>[ 49.887577] ? unregister_ftrace_function+0x2e/0x150
> > <4>[ 49.887608] ? stack_trace_sysctl+0x82/0xb0
> > <4>[ 49.887636] ? proc_sys_call_handler+0x18b/0x280
> > <4>[ 49.887675] ? proc_sys_write+0x17/0x20
> > <4>[ 49.887692] ? vfs_write+0x324/0x3f0
> > <4>[ 49.887742] ? ksys_write+0x75/0xe0
> > <4>[ 49.887771] ? __x64_sys_write+0x1f/0x30
> > <4>[ 49.887787] ? do_syscall_64+0x48/0xa0
> > <4>[ 49.887802] ? sysvec_apic_timer_interrupt+0x4d/0x90
> > <4>[ 49.887822] ? entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> > <4>[ 49.887878] </TASK>

Did anything happen with this bug later? I see nothing on any public
list, nor any public bug tracker.

We have been having the same sort of problem
(https://bugzilla.redhat.com/show_bug.cgi?id=2216496). It's another
of those bugs that requires hundreds or thousands of boot iterations
before you can see it. There is a test in comment 27 but it requires
guestfish and some hacking to work. I'll try to come up with a
cleaner test later.

We see stack traces like:

[ 3.081939] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
[ 3.082266] clocksource: Switched to clocksource acpi_pm
[ 3.090201] NET: Registered PF_INET protocol family
[ 3.093098] int3: 0000 [#1] PREEMPT SMP NOPTI
[ 3.093098] CPU: 3 PID: 1 Comm: swapper/0 Not tainted 6.4.0-10173-ga901a3568fd2 #8
[ 3.093098] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
[ 3.093098] RIP: 0010:__mod_timer+0x1c3/0x370
[ 3.093098] Code: 00 00 41 bd ff ff ff ff 31 d2 4c 89 f6 4c 89 ff e8 f2 ef ff ff 41 89 c4 85 c0 75 09 83 e3 01 0f 85 54 ff ff ff 41 8b 4f 20 66 <90> f7 c1 00 00 10 00 0f 84 23 01 00 00 48 c7 c3 40 cc 01 00 65 48
[ 3.093098] RSP: 0018:ffffaf1600013e00 EFLAGS: 00000046
[ 3.093098] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000280003
[ 3.093098] RDX: 0000000000000000 RSI: ffff9aa90fd9dec0 RDI: ffffffff8441e4b8
[ 3.093098] RBP: 00000000fffc200d R08: ffffffff8441e4a0 R09: ffffffff8441e4b8
[ 3.093098] R10: 0000000000000001 R11: 000000000002e990 R12: 0000000000000000
[ 3.093098] R13: 00000000ffffffff R14: ffff9aa90fd9dec0 R15: ffffffff8441e4b8
[ 3.093098] FS: 0000000000000000(0000) GS:ffff9aa90fd80000(0000) knlGS:0000000000000000
[ 3.093098] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3.093098] CR2: 0000000000000000 CR3: 000000004e02e000 CR4: 0000000000750ee0
[ 3.093098] PKRU: 55555554
[ 3.093098] Call Trace:
[ 3.093098] <TASK>
[ 3.093098] ? die+0x31/0x80
[ 3.093098] ? exc_int3+0x10e/0x120
[ 3.093098] ? asm_exc_int3+0x39/0x40
[ 3.093098] ? __mod_timer+0x1c3/0x370
[ 3.093098] ? __mod_timer+0x1c3/0x370
[ 3.093098] queue_delayed_work_on+0x23/0x30
[ 3.093098] neigh_table_init+0x1bb/0x2e0
[ 3.093098] arp_init+0x12/0x50
[ 3.093098] inet_init+0x15b/0x2f0
[ 3.093098] ? __pfx_inet_init+0x10/0x10
[ 3.093098] do_one_initcall+0x58/0x230
[ 3.093098] kernel_init_freeable+0x199/0x2d0
[ 3.093098] ? __pfx_kernel_init+0x10/0x10
[ 3.093098] kernel_init+0x15/0x1b0
[ 3.093098] ret_from_fork+0x2c/0x50
[ 3.093098] </TASK>
[ 3.093098] Modules linked in:
[ 3.093098] ---[ end trace 0000000000000000 ]---
[ 3.093098] RIP: 0010:__mod_timer+0x1c3/0x370
[ 3.093098] Code: 00 00 41 bd ff ff ff ff 31 d2 4c 89 f6 4c 89 ff e8 f2 ef ff ff 41 89 c4 85 c0 75 09 83 e3 01 0f 85 54 ff ff ff 41 8b 4f 20 66 <90> f7 c1 00 00 10 00 0f 84 23 01 00 00 48 c7 c3 40 cc 01 00 65 48
[ 3.093098] RSP: 0018:ffffaf1600013e00 EFLAGS: 00000046
[ 3.093098] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000280003
[ 3.093098] RDX: 0000000000000000 RSI: ffff9aa90fd9dec0 RDI: ffffffff8441e4b8
[ 3.093098] RBP: 00000000fffc200d R08: ffffffff8441e4a0 R09: ffffffff8441e4b8
[ 3.093098] R10: 0000000000000001 R11: 000000000002e990 R12: 0000000000000000
[ 3.093098] R13: 00000000ffffffff R14: ffff9aa90fd9dec0 R15: ffffffff8441e4b8
[ 3.093098] FS: 0000000000000000(0000) GS:ffff9aa90fd80000(0000) knlGS:0000000000000000
[ 3.093098] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3.093098] CR2: 0000000000000000 CR3: 000000004e02e000 CR4: 0000000000750ee0
[ 3.093098] PKRU: 55555554
[ 3.093098] Kernel panic - not syncing: Fatal exception in interrupt

There are many variations, but the common pattern seems to be
<something in the clock or timer code> -> int3 exception

It only happens under qemu TCG (software emulation).

It goes away if we recompile qemu without MTTCG support.

It only happens with -smp enabled, we are using qemu -smp 4

We are using qemu-system-x86_64 full system emulation on x86_64 host
(ie. forcing KVM off).

It happens with the latest upstream kernel and qemu, compiled from
source.

I'd love to have a better way to debug this or collect more
diagnostics if you have any suggestions.

Rich.

--
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html

2023-07-04 13:38:24

by Richard W.M. Jones

[permalink] [raw]

Subject: Re: qemu-x86_64 booting with 8.0.0 stil see int3: when running LTP tracing testing.

2023-07-05 16:41:13

by Richard W.M. Jones

[permalink] [raw]

Subject: Re: qemu-x86_64 booting with 8.0.0 stil see int3: when running LTP tracing testing.

On Tue, Jul 04, 2023 at 08:46:20AM +0100, Richard W.M. Jones wrote:
> We have been having the same sort of problem
> (https://bugzilla.redhat.com/show_bug.cgi?id=2216496). It's another
> of those bugs that requires hundreds or thousands of boot iterations
> before you can see it. There is a test in comment 27 but it requires
> guestfish and some hacking to work. I'll try to come up with a
> cleaner test later.
>
> We see stack traces like:
>
> [ 3.081939] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
> [ 3.082266] clocksource: Switched to clocksource acpi_pm
> [ 3.090201] NET: Registered PF_INET protocol family
> [ 3.093098] int3: 0000 [#1] PREEMPT SMP NOPTI
> [ 3.093098] CPU: 3 PID: 1 Comm: swapper/0 Not tainted 6.4.0-10173-ga901a3568fd2 #8
> [ 3.093098] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
> [ 3.093098] RIP: 0010:__mod_timer+0x1c3/0x370
> [ 3.093098] Code: 00 00 41 bd ff ff ff ff 31 d2 4c 89 f6 4c 89 ff e8 f2 ef ff ff 41 89 c4 85 c0 75 09 83 e3 01 0f 85 54 ff ff ff 41 8b 4f 20 66 <90> f7 c1 00 00 10 00 0f 84 23 01 00 00 48 c7 c3 40 cc 01 00 65 48
> [ 3.093098] RSP: 0018:ffffaf1600013e00 EFLAGS: 00000046
> [ 3.093098] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000280003
> [ 3.093098] RDX: 0000000000000000 RSI: ffff9aa90fd9dec0 RDI: ffffffff8441e4b8
> [ 3.093098] RBP: 00000000fffc200d R08: ffffffff8441e4a0 R09: ffffffff8441e4b8
> [ 3.093098] R10: 0000000000000001 R11: 000000000002e990 R12: 0000000000000000
> [ 3.093098] R13: 00000000ffffffff R14: ffff9aa90fd9dec0 R15: ffffffff8441e4b8
> [ 3.093098] FS: 0000000000000000(0000) GS:ffff9aa90fd80000(0000) knlGS:0000000000000000
> [ 3.093098] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 3.093098] CR2: 0000000000000000 CR3: 000000004e02e000 CR4: 0000000000750ee0
> [ 3.093098] PKRU: 55555554
> [ 3.093098] Call Trace:
> [ 3.093098] <TASK>
> [ 3.093098] ? die+0x31/0x80
> [ 3.093098] ? exc_int3+0x10e/0x120
> [ 3.093098] ? asm_exc_int3+0x39/0x40
> [ 3.093098] ? __mod_timer+0x1c3/0x370
> [ 3.093098] ? __mod_timer+0x1c3/0x370
> [ 3.093098] queue_delayed_work_on+0x23/0x30
> [ 3.093098] neigh_table_init+0x1bb/0x2e0
> [ 3.093098] arp_init+0x12/0x50
> [ 3.093098] inet_init+0x15b/0x2f0
> [ 3.093098] ? __pfx_inet_init+0x10/0x10
> [ 3.093098] do_one_initcall+0x58/0x230
> [ 3.093098] kernel_init_freeable+0x199/0x2d0
> [ 3.093098] ? __pfx_kernel_init+0x10/0x10
> [ 3.093098] kernel_init+0x15/0x1b0
> [ 3.093098] ret_from_fork+0x2c/0x50
> [ 3.093098] </TASK>
> [ 3.093098] Modules linked in:
> [ 3.093098] ---[ end trace 0000000000000000 ]---
> [ 3.093098] RIP: 0010:__mod_timer+0x1c3/0x370
> [ 3.093098] Code: 00 00 41 bd ff ff ff ff 31 d2 4c 89 f6 4c 89 ff e8 f2 ef ff ff 41 89 c4 85 c0 75 09 83 e3 01 0f 85 54 ff ff ff 41 8b 4f 20 66 <90> f7 c1 00 00 10 00 0f 84 23 01 00 00 48 c7 c3 40 cc 01 00 65 48
> [ 3.093098] RSP: 0018:ffffaf1600013e00 EFLAGS: 00000046
> [ 3.093098] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000280003
> [ 3.093098] RDX: 0000000000000000 RSI: ffff9aa90fd9dec0 RDI: ffffffff8441e4b8
> [ 3.093098] RBP: 00000000fffc200d R08: ffffffff8441e4a0 R09: ffffffff8441e4b8
> [ 3.093098] R10: 0000000000000001 R11: 000000000002e990 R12: 0000000000000000
> [ 3.093098] R13: 00000000ffffffff R14: ffff9aa90fd9dec0 R15: ffffffff8441e4b8
> [ 3.093098] FS: 0000000000000000(0000) GS:ffff9aa90fd80000(0000) knlGS:0000000000000000
> [ 3.093098] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 3.093098] CR2: 0000000000000000 CR3: 000000004e02e000 CR4: 0000000000750ee0
> [ 3.093098] PKRU: 55555554
> [ 3.093098] Kernel panic - not syncing: Fatal exception in interrupt
>
> There are many variations, but the common pattern seems to be
> <something in the clock or timer code> -> int3 exception
>
> It only happens under qemu TCG (software emulation).
>
> It goes away if we recompile qemu without MTTCG support.
>
> It only happens with -smp enabled, we are using qemu -smp 4
>
> We are using qemu-system-x86_64 full system emulation on x86_64 host
> (ie. forcing KVM off).
>
> It happens with the latest upstream kernel and qemu, compiled from
> source.

I got a bit further on this one.

The bug happens in the code that updates the static branch used for at
least these two keys:

static DEFINE_STATIC_KEY_FALSE(__sched_clock_stable);
DEFINE_STATIC_KEY_FALSE(timers_migration_enabled);

There are probably others (it seems a generic problem with how static
branches are handled by TCG), but I only see the bug for those two.

When the static branch is updated we end up calling
arch/x86/kernel/alternative.c:text_poke_bp_batch(). It's best to read
the description of that function to see where int3 is used:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/alternative.c#n2086

I modified the qemu TCG int3 helper so it dumps the code at %rip when
the interrupt fires, and I can actually see the changes in the above
function happen, first int3 being written, then the end of the nop,
then the int3 being overwritten with the first byte of the nop.

Unfortunately the int3 still fires after the code has been completely
rewritten to its final (ie nop) value.

This seems to indicate to me that neither the self-write to the kernel
text segment, nor sync_core (implemented by a "iret to self" trick)
invalidates the qemu TCG translation block containing the old int3
helper call. Thus we (qemu) never "see" the new nop, we keep
emulating int3, and that causes the kernel to crash.

I added print statements inside tb_invalidate_phys_page() and this
function seems never to be called at all. It's my understanding that
at least the kernel writing to its text segment ought to cause
tb_invalidate_phys_page() to be called, but I'm not super-familiar
with this qemu code.

Richard Henderson - do you have any suggestions?

Rich.

--
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines. Supports shell scripting,
bindings from many languages. http://libguestfs.org

2023-07-05 21:58:27

by Richard W.M. Jones

[permalink] [raw]

Subject: Re: qemu-x86_64 booting with 8.0.0 stil see int3: when running LTP tracing testing.

On Wed, Jul 05, 2023 at 05:28:30PM +0100, Richard W.M. Jones wrote:
> On Tue, Jul 04, 2023 at 08:46:20AM +0100, Richard W.M. Jones wrote:
> > We have been having the same sort of problem
> > (https://bugzilla.redhat.com/show_bug.cgi?id=2216496). It's another
> > of those bugs that requires hundreds or thousands of boot iterations
> > before you can see it. There is a test in comment 27 but it requires
> > guestfish and some hacking to work. I'll try to come up with a
> > cleaner test later.
> >
> > We see stack traces like:
> >
> > [ 3.081939] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
> > [ 3.082266] clocksource: Switched to clocksource acpi_pm
> > [ 3.090201] NET: Registered PF_INET protocol family
> > [ 3.093098] int3: 0000 [#1] PREEMPT SMP NOPTI
> > [ 3.093098] CPU: 3 PID: 1 Comm: swapper/0 Not tainted 6.4.0-10173-ga901a3568fd2 #8
> > [ 3.093098] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
> > [ 3.093098] RIP: 0010:__mod_timer+0x1c3/0x370
> > [ 3.093098] Code: 00 00 41 bd ff ff ff ff 31 d2 4c 89 f6 4c 89 ff e8 f2 ef ff ff 41 89 c4 85 c0 75 09 83 e3 01 0f 85 54 ff ff ff 41 8b 4f 20 66 <90> f7 c1 00 00 10 00 0f 84 23 01 00 00 48 c7 c3 40 cc 01 00 65 48
> > [ 3.093098] RSP: 0018:ffffaf1600013e00 EFLAGS: 00000046
> > [ 3.093098] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000280003
> > [ 3.093098] RDX: 0000000000000000 RSI: ffff9aa90fd9dec0 RDI: ffffffff8441e4b8
> > [ 3.093098] RBP: 00000000fffc200d R08: ffffffff8441e4a0 R09: ffffffff8441e4b8
> > [ 3.093098] R10: 0000000000000001 R11: 000000000002e990 R12: 0000000000000000
> > [ 3.093098] R13: 00000000ffffffff R14: ffff9aa90fd9dec0 R15: ffffffff8441e4b8
> > [ 3.093098] FS: 0000000000000000(0000) GS:ffff9aa90fd80000(0000) knlGS:0000000000000000
> > [ 3.093098] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 3.093098] CR2: 0000000000000000 CR3: 000000004e02e000 CR4: 0000000000750ee0
> > [ 3.093098] PKRU: 55555554
> > [ 3.093098] Call Trace:
> > [ 3.093098] <TASK>
> > [ 3.093098] ? die+0x31/0x80
> > [ 3.093098] ? exc_int3+0x10e/0x120
> > [ 3.093098] ? asm_exc_int3+0x39/0x40
> > [ 3.093098] ? __mod_timer+0x1c3/0x370
> > [ 3.093098] ? __mod_timer+0x1c3/0x370
> > [ 3.093098] queue_delayed_work_on+0x23/0x30
> > [ 3.093098] neigh_table_init+0x1bb/0x2e0
> > [ 3.093098] arp_init+0x12/0x50
> > [ 3.093098] inet_init+0x15b/0x2f0
> > [ 3.093098] ? __pfx_inet_init+0x10/0x10
> > [ 3.093098] do_one_initcall+0x58/0x230
> > [ 3.093098] kernel_init_freeable+0x199/0x2d0
> > [ 3.093098] ? __pfx_kernel_init+0x10/0x10
> > [ 3.093098] kernel_init+0x15/0x1b0
> > [ 3.093098] ret_from_fork+0x2c/0x50
> > [ 3.093098] </TASK>
> > [ 3.093098] Modules linked in:
> > [ 3.093098] ---[ end trace 0000000000000000 ]---
> > [ 3.093098] RIP: 0010:__mod_timer+0x1c3/0x370
> > [ 3.093098] Code: 00 00 41 bd ff ff ff ff 31 d2 4c 89 f6 4c 89 ff e8 f2 ef ff ff 41 89 c4 85 c0 75 09 83 e3 01 0f 85 54 ff ff ff 41 8b 4f 20 66 <90> f7 c1 00 00 10 00 0f 84 23 01 00 00 48 c7 c3 40 cc 01 00 65 48
> > [ 3.093098] RSP: 0018:ffffaf1600013e00 EFLAGS: 00000046
> > [ 3.093098] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000280003
> > [ 3.093098] RDX: 0000000000000000 RSI: ffff9aa90fd9dec0 RDI: ffffffff8441e4b8
> > [ 3.093098] RBP: 00000000fffc200d R08: ffffffff8441e4a0 R09: ffffffff8441e4b8
> > [ 3.093098] R10: 0000000000000001 R11: 000000000002e990 R12: 0000000000000000
> > [ 3.093098] R13: 00000000ffffffff R14: ffff9aa90fd9dec0 R15: ffffffff8441e4b8
> > [ 3.093098] FS: 0000000000000000(0000) GS:ffff9aa90fd80000(0000) knlGS:0000000000000000
> > [ 3.093098] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 3.093098] CR2: 0000000000000000 CR3: 000000004e02e000 CR4: 0000000000750ee0
> > [ 3.093098] PKRU: 55555554
> > [ 3.093098] Kernel panic - not syncing: Fatal exception in interrupt
> >
> > There are many variations, but the common pattern seems to be
> > <something in the clock or timer code> -> int3 exception
> >
> > It only happens under qemu TCG (software emulation).
> >
> > It goes away if we recompile qemu without MTTCG support.
> >
> > It only happens with -smp enabled, we are using qemu -smp 4
> >
> > We are using qemu-system-x86_64 full system emulation on x86_64 host
> > (ie. forcing KVM off).
> >
> > It happens with the latest upstream kernel and qemu, compiled from
> > source.
>
> I got a bit further on this one.
>
> The bug happens in the code that updates the static branch used for at
> least these two keys:
>
> static DEFINE_STATIC_KEY_FALSE(__sched_clock_stable);
> DEFINE_STATIC_KEY_FALSE(timers_migration_enabled);
>
> There are probably others (it seems a generic problem with how static
> branches are handled by TCG), but I only see the bug for those two.
>
> When the static branch is updated we end up calling
> arch/x86/kernel/alternative.c:text_poke_bp_batch(). It's best to read
> the description of that function to see where int3 is used:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/alternative.c#n2086
>
> I modified the qemu TCG int3 helper so it dumps the code at %rip when
> the interrupt fires, and I can actually see the changes in the above
> function happen, first int3 being written, then the end of the nop,
> then the int3 being overwritten with the first byte of the nop.
>
> Unfortunately the int3 still fires after the code has been completely
> rewritten to its final (ie nop) value.
>
> This seems to indicate to me that neither the self-write to the kernel
> text segment, nor sync_core (implemented by a "iret to self" trick)
> invalidates the qemu TCG translation block containing the old int3
> helper call. Thus we (qemu) never "see" the new nop, we keep
> emulating int3, and that causes the kernel to crash.

The following paragraph is wrong:

> I added print statements inside tb_invalidate_phys_page() and this
> function seems never to be called at all. It's my understanding that
> at least the kernel writing to its text segment ought to cause
> tb_invalidate_phys_page() to be called, but I'm not super-familiar
> with this qemu code.

tb_invalidate_phys_range_fast() *is* called, and we end up calling
tb_invalidate_phys_page_range__locked ->
tb_phys_invalidate__locked ->
do_tb_phys_invalidate

Nevertheless the old TB (containing the call to the int3 helper) is
still called after the code has been replaced with a NOP.

Of course there are 4 MTTCG threads so maybe another thread is in the
middle of executing the same TB when it gets invalidated.
tb_invalidate_phys_page_range__locked goes to some effort to check if
the current TB is being invalidated and restart the TB, but as far as
I can see the test can only work for the current core, and won't
restart the TB on other cores.

Is there any way to get the current TranslationBlock* from a helper?
It would be useful for additional debugging, I couldn't see how to do it.

The next thing I'll look at is the kernel sync_core() function and
whether TCG does the right thing, whatever that would be.

Rich.

--
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
nbdkit - Flexible, fast NBD server with plugins
https://gitlab.com/nbdkit/nbdkit

2023-07-06 07:10:45

by Richard Henderson

[permalink] [raw]

Subject: Re: qemu-x86_64 booting with 8.0.0 stil see int3: when running LTP tracing testing.

On 7/5/23 22:50, Richard W.M. Jones wrote:
> tb_invalidate_phys_range_fast() *is* called, and we end up calling
> tb_invalidate_phys_page_range__locked ->
> tb_phys_invalidate__locked ->
> do_tb_phys_invalidate
>
> Nevertheless the old TB (containing the call to the int3 helper) is
> still called after the code has been replaced with a NOP.
>
> Of course there are 4 MTTCG threads so maybe another thread is in the
> middle of executing the same TB when it gets invalidated.

Yes.

> tb_invalidate_phys_page_range__locked goes to some effort to check if
> the current TB is being invalidated and restart the TB, but as far as
> I can see the test can only work for the current core, and won't
> restart the TB on other cores.

Yes.

The assumption with any of these sorts of races is that it is "as if" the other thread has
already passed the location of the write within that block. But by the time this thread
has finished do_tb_phys_invalidate, any other thread cannot execute the same block *again*.

There's a race here, and now that I think about it, there's been mail about it in the past:

https://lore.kernel.org/qemu-devel/[email protected]/

We take care of the same race for user-only in translator_access, by ensuring that each
translated page is read-only *before* performing the read for translation. But for system
mode we grab the page locks *after* the reads. Which means there's a race.

The email above describes the race pretty clearly, with a new TB being generated before
the write is even complete.

It'll be non-trivial fixing this, because not only do we need to grab the lock earlier,
there are ordering issues for a TB that spans two pages, in that one must grab the two
locks in the correct order lest we deadlock.

r~

2023-07-06 10:51:39

by Richard W.M. Jones

[permalink] [raw]

Subject: Re: qemu-x86_64 booting with 8.0.0 stil see int3: when running LTP tracing testing.

On Thu, Jul 06, 2023 at 07:30:50AM +0100, Richard Henderson wrote:
> On 7/5/23 22:50, Richard W.M. Jones wrote:
> >tb_invalidate_phys_range_fast() *is* called, and we end up calling
> > tb_invalidate_phys_page_range__locked ->
> > tb_phys_invalidate__locked ->
> > do_tb_phys_invalidate
> >
> >Nevertheless the old TB (containing the call to the int3 helper) is
> >still called after the code has been replaced with a NOP.
> >
> >Of course there are 4 MTTCG threads so maybe another thread is in the
> >middle of executing the same TB when it gets invalidated.
>
> Yes.
>
> >tb_invalidate_phys_page_range__locked goes to some effort to check if
> >the current TB is being invalidated and restart the TB, but as far as
> >I can see the test can only work for the current core, and won't
> >restart the TB on other cores.
>
> Yes.
>
> The assumption with any of these sorts of races is that it is "as
> if" the other thread has already passed the location of the write
> within that block. But by the time this thread has finished
> do_tb_phys_invalidate, any other thread cannot execute the same
> block *again*.
>
> There's a race here, and now that I think about it, there's been mail about it in the past:
>
> https://lore.kernel.org/qemu-devel/[email protected]/
>
> We take care of the same race for user-only in translator_access, by
> ensuring that each translated page is read-only *before* performing
> the read for translation. But for system mode we grab the page
> locks *after* the reads. Which means there's a race.
>
> The email above describes the race pretty clearly, with a new TB
> being generated before the write is even complete.
>
> It'll be non-trivial fixing this, because not only do we need to
> grab the lock earlier, there are ordering issues for a TB that spans
> two pages, in that one must grab the two locks in the correct order
> lest we deadlock.

Yes I can see how this is hard to fix. Even if we just lock the page
containing the first instruction (which we know) before doing
translation, we still have a problem when entering tb_link_page()
where we would need to only lock the second page, which might cause
ordering issues.

How about a new per-page lock, which would be grabbed by
do_tb_phys_invalidate() and tb_gen_code(), just on the first
instruction? It would mean, I think, that no page can be having TBs
invalidated and generated at the same time.

Or something like scanning the bytes as they are being translated,
generate a secure-ish checksum, then recheck it after translation and
discard the TB if the code changed.

Rich.

--
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
Fedora Windows cross-compiler. Compile Windows programs, test, and
build Windows installers. Over 100 libraries supported.
http://fedoraproject.org/wiki/MinGW

2023-08-08 20:23:56

by John Stultz

[permalink] [raw]

Subject: Re: qemu-x86_64 booting with 8.0.0 stil see int3: when running LTP tracing testing.

On Thu, Jul 6, 2023 at 3:28 AM Richard W.M. Jones <[email protected]> wrote:
>
> On Thu, Jul 06, 2023 at 07:30:50AM +0100, Richard Henderson wrote:
> > On 7/5/23 22:50, Richard W.M. Jones wrote:
> > >tb_invalidate_phys_range_fast() *is* called, and we end up calling
> > > tb_invalidate_phys_page_range__locked ->
> > > tb_phys_invalidate__locked ->
> > > do_tb_phys_invalidate
> > >
> > >Nevertheless the old TB (containing the call to the int3 helper) is
> > >still called after the code has been replaced with a NOP.
> > >
> > >Of course there are 4 MTTCG threads so maybe another thread is in the
> > >middle of executing the same TB when it gets invalidated.
> >
> > Yes.
> >
> > >tb_invalidate_phys_page_range__locked goes to some effort to check if
> > >the current TB is being invalidated and restart the TB, but as far as
> > >I can see the test can only work for the current core, and won't
> > >restart the TB on other cores.
> >
> > Yes.
> >
> > The assumption with any of these sorts of races is that it is "as
> > if" the other thread has already passed the location of the write
> > within that block. But by the time this thread has finished
> > do_tb_phys_invalidate, any other thread cannot execute the same
> > block *again*.
> >
> > There's a race here, and now that I think about it, there's been mail about it in the past:
> >
> > https://lore.kernel.org/qemu-devel/[email protected]/
> >
> > We take care of the same race for user-only in translator_access, by
> > ensuring that each translated page is read-only *before* performing
> > the read for translation. But for system mode we grab the page
> > locks *after* the reads. Which means there's a race.
> >
> > The email above describes the race pretty clearly, with a new TB
> > being generated before the write is even complete.
> >
> > It'll be non-trivial fixing this, because not only do we need to
> > grab the lock earlier, there are ordering issues for a TB that spans
> > two pages, in that one must grab the two locks in the correct order
> > lest we deadlock.
>
> Yes I can see how this is hard to fix. Even if we just lock the page
> containing the first instruction (which we know) before doing
> translation, we still have a problem when entering tb_link_page()
> where we would need to only lock the second page, which might cause
> ordering issues.
>
> How about a new per-page lock, which would be grabbed by
> do_tb_phys_invalidate() and tb_gen_code(), just on the first
> instruction? It would mean, I think, that no page can be having TBs
> invalidated and generated at the same time.
>
> Or something like scanning the bytes as they are being translated,
> generate a secure-ish checksum, then recheck it after translation and
> discard the TB if the code changed.

Hey all,
So I've occasionally tripped over something similar in my stress
testing of the proxy-execution patch series.

This is only triggered by booting w/ qemu (7.2.2) without kvm. It's
rare but seems to trip in bursts.

The errors usually look something like:

[ 21.264283] clocksource: Switched to clocksource hpet
[ 21.273213] hrtimer: interrupt took 2400410 ns
[ 21.369331] VFS: Disk quotas dquot_6.6.0
[ 21.375425] int3: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC NOPTI
[ 21.375453] CPU: 12 PID: 0 Comm: swapper/12 Not tainted
6.4.0-rc5-00032-g8d3f70560882-dirty #1712
[ 21.375453] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[ 21.375453] RIP: 0010:hrtimer_start_range_ns+0x1ab/0x3d0
[ 21.375453] Code: 0f 85 70 ff ff ff 44 8b 7c 24 18 4c 63 65 08 4c
89 f3 e8 c8 86 0c 01 89 c0 41 83 e7 02 48 03 1c c5 40 0c ac b1 48 89
1c 24 0f <1f> 44 00 00 45 85 ff 0f 84 e3 01 00 00 4d 8d 7c 24 01 49 c1
e4 07
[ 21.375453] RSP: 0018:ffffbf4b000f7e60 EFLAGS: 00000086
[ 21.375453] RAX: 000000000000000c RBX: ffff984ff85239c0 RCX: 00000004ef14ffc0
[ 21.375453] RDX: 7fffffffffffffff RSI: ffffffffb1973381 RDI: ffffffffb1976d48
[ 21.375453] RBP: ffff984ff8523a40 R08: 0000000000000001 R09: 0000000000000000
[ 21.375453] R10: 00000000e8003900 R11: 0000000005673366 R12: 0000000000000000
[ 21.375453] R13: ffff984ff85242f0 R14: 00000000000239c0 R15: 0000000000000002
[ 21.375453] FS: 0000000000000000(0000) GS:ffff984ff8500000(0000)
knlGS:0000000000000000
[ 21.375453] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 21.375453] CR2: 0000000000000000 CR3: 0000000045e62000 CR4: 00000000000006e0
[ 21.375453] Call Trace:
[ 21.375453] <TASK>
[ 21.375453] ? die+0x2d/0x80
[ 21.375453] ? exc_int3+0xf3/0x100
[ 21.375453] ? asm_exc_int3+0x35/0x40
[ 21.375453] ? hrtimer_start_range_ns+0x1ab/0x3d0
[ 21.375453] ? hrtimer_start_range_ns+0x1ab/0x3d0
[ 21.375453] tick_nohz_restart+0x72/0x90
[ 21.375453] tick_nohz_idle_exit+0xc7/0xf0
[ 21.375453] do_idle+0x160/0x220
[ 21.375453] cpu_startup_entry+0x14/0x20
[ 21.375453] start_secondary+0xf5/0x100
[ 21.375453] secondary_startup_64_no_verify+0x10b/0x10b
[ 21.375453] </TASK>

Curious if anyone has found a workaround?

thanks
-john