Dear RT folks!
I'm pleased to announce the v6.6-rc4-rt7 patch set.
Changes since v6.6-rc4-rt6:
- Lockdep complained about SRCU locking if printk was used from NMI.
- The i915 driver could create a locking related splat during suspend/
resume. Reported by John B. Wyatt IV.
- A new version of printk. This is an update to the current
development version. Large patch series provided by John Ogness.
Known issues
None
The delta patch against v6.6-rc4-rt6 is not appended below due to its
size. It and can be found here:
https://cdn.kernel.org/pub/linux/kernel/projects/rt/6.6/incr/patch-6.6-rc4-rt6-rt7.patch.xz
and as web based diff:
https://git.kernel.org/rt/linux-rt-devel/d/v6.6-rc4-rt7/v6.6-rc4-rt6
You can get this release via the git tree at:
https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git v6.6-rc4-rt7
The RT patch against v6.6-rc4 can be found here:
https://cdn.kernel.org/pub/linux/kernel/projects/rt/6.6/older/patch-6.6-rc4-rt7.patch.xz
The split quilt queue is available at:
https://cdn.kernel.org/pub/linux/kernel/projects/rt/6.6/older/patches-6.6-rc4-rt7.tar.xz
Sebastian
Hello Sebastian,
The following happened once [1]. In pl011_console_write() from:
commit da042bbb7a3f ("printk: Update the printk series.")
&uap->port seems to be released, but various paths don't seem
to acquire the lock,
Regards,
Pierre
[1]
[...] =====================================
[...] WARNING: bad unlock balance detected!
[...] 6.6.0-rc4-rt7-00098-g400db4a2dd90 #1157 Not tainted
[...] -------------------------------------
[...] pr/legacy/18 is trying to release lock (&port_lock_key) at:
[...] pl011_console_write ([path]/drivers/tty/serial/amba-pl011.c:2366)
[...] but there are no more locks to release!
[...]
[...] other info that might help us debug this:
[...] 2 locks held by pr/legacy/18:
[...] #0: ffff8000846bc0b8 (console_lock){+.+.}-{0:0}, at: nbcon_legacy_kthread_func ([path]/kernel/printk/printk.c:3052 [path]/kernel/printk/printk.c:3364)
[...] #1: ffff8000846bc030 (console_srcu){....}-{0:0}, at: console_flush_all ([path]/./include/linux/srcu.h:230 [path]/kernel/printk/printk.c:286 [path]/kernel/printk/printk.c:2971)
[...]
[...] stack backtrace:
[...] CPU: 1 PID: 18 Comm: pr/legacy Not tainted 6.6.0-rc4-rt7-00098-g400db4a2dd90 #1157
[...] Hardware name: ARM LTD ARM Juno Development Platform/ARM Juno Development Platform, BIOS EDK II Sep 5 2023
[...] Call trace:
[...] dump_backtrace ([path]/arch/arm64/kernel/stacktrace.c:235)
[...] show_stack ([path]/arch/arm64/kernel/stacktrace.c:242)
[...] dump_stack_lvl ([path]/lib/dump_stack.c:107)
[...] dump_stack ([path]/lib/dump_stack.c:114)
[...] print_unlock_imbalance_bug ([path]/kernel/locking/lockdep.c:5201)
[...] lock_release ([path]/kernel/locking/lockdep.c:5438 [path]/kernel/locking/lockdep.c:5780)
[...] rt_spin_unlock ([path]/kernel/locking/spinlock_rt.c:81)
[...] pl011_console_write ([path]/drivers/tty/serial/amba-pl011.c:2366)
[...] console_flush_all ([path]/kernel/printk/printk.c:2910 [path]/kernel/printk/printk.c:2992)
[...] nbcon_legacy_kthread_func ([path]/kernel/printk/printk.c:3053 [path]/kernel/printk/printk.c:3364)
[...] kthread ([path]/kernel/kthread.c:388)
[...] ret_from_fork ([path]/arch/arm64/kernel/entry.S:854)
On 10/4/23 18:06, Sebastian Andrzej Siewior wrote:
> Dear RT folks!
>
> I'm pleased to announce the v6.6-rc4-rt7 patch set.
>
> Changes since v6.6-rc4-rt6:
>
> - Lockdep complained about SRCU locking if printk was used from NMI.
>
> - The i915 driver could create a locking related splat during suspend/
> resume. Reported by John B. Wyatt IV.
>
> - A new version of printk. This is an update to the current
> development version. Large patch series provided by John Ogness.
>
> Known issues
> None
>
> The delta patch against v6.6-rc4-rt6 is not appended below due to its
> size. It and can be found here:
>
> https://cdn.kernel.org/pub/linux/kernel/projects/rt/6.6/incr/patch-6.6-rc4-rt6-rt7.patch.xz
>
> and as web based diff:
>
> https://git.kernel.org/rt/linux-rt-devel/d/v6.6-rc4-rt7/v6.6-rc4-rt6
>
> You can get this release via the git tree at:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git v6.6-rc4-rt7
>
> The RT patch against v6.6-rc4 can be found here:
>
> https://cdn.kernel.org/pub/linux/kernel/projects/rt/6.6/older/patch-6.6-rc4-rt7.patch.xz
>
> The split quilt queue is available at:
>
> https://cdn.kernel.org/pub/linux/kernel/projects/rt/6.6/older/patches-6.6-rc4-rt7.tar.xz
>
> Sebastian
>
On 2023-10-05 15:38:06 [+0200], Pierre Gondois wrote:
> Hello Sebastian,
Hi Pierre,
> The following happened once [1]. In pl011_console_write() from:
> commit da042bbb7a3f ("printk: Update the printk series.")
> &uap->port seems to be released, but various paths don't seem
> to acquire the lock,
Thank you for the report.
The following cures it:
diff --git a/drivers/tty/serial/amba-pl011.c b/drivers/tty/serial/amba-pl011.c
index 96da994e5af06..c6c2d3e46a8ec 100644
--- a/drivers/tty/serial/amba-pl011.c
+++ b/drivers/tty/serial/amba-pl011.c
@@ -2335,7 +2335,7 @@ pl011_console_write(struct console *co, const char *s, unsigned int count)
if (uap->port.sysrq || oops_in_progress)
locked = uart_port_trylock_irqsave(&uap->port, &flags);
else
- uart_port_trylock_irqsave(&uap->port, &flags);
+ uart_port_lock_irqsave(&uap->port, &flags);
/*
* First save the CR then disable the interrupts
> Regards,
> Pierre
Sebastian
Hello,
I am seeing various splats (see below) while running a PREEMPT_RT
kernel on an arm64 Juno board. The branch used is based on v6.6-rc4-rt8.
The following command is run, and a splat appear around ~1/2 minutes later:
rtla timerlat hist
but if waiting long enough, a splat can appear without any command run.
(The issue is not related to the rtla tool, this is just to share the
method to trigger the issue.)
The issue seems to be related to this patchset:
https://lore.kernel.org/all/[email protected]/
but as I was unable to really diagnose the issue, it might aswell be
something else.
It seems the memory/registers get corrupted, cf the recurring error
messages:
- Undefined instruction
- Unable to handle kernel paging request at virtual address
- Mem abort info
Splats seem to happen while taking IRQs while going out of idle or
when handling a syscall. More splat variations could be generated,
but 5 should be enough I believe.
When running a non-PREEMPT_RT kernel, splats don't appear, so the issue
might be related to the way locks are handled in PREEMPT_RT. I don't
deeply understand the relation between rcu/irq/tracing so far, if someone
has an idea of what could happen, this would be helpful :)
Regards,
Pierre
Splats:
[splat-1]
[...] Internal error: Oops - Undefined instruction: 0000000002000000 [#1] PREEMPT_RT SMP
[...] Modules linked in:
[...] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 6.6.0-rc4-rt8-00102-g97b0e2d47443 #1193
[...] Hardware name: ARM LTD ARM Juno Development Platform/ARM Juno Development Platform, BIOS EDK II Oct 4 2023
[...] pstate: 204000c5 (nzCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[...] pc : trace_irq_noise (./include/trace/events/osnoise.h:63)
[...] lr : trace_irqexit_callback (kernel/trace/trace_osnoise.c:756 kernel/trace/trace_osnoise.c:952 kernel/trace/trace_osnoise.c:974)
[...] sp : ffff800084efbee0
[...] x29: ffff800084efbee0 x28: ffff000800828000 x27: 0000000000000001
[...] x26: 0000000000000000 x25: ffff000805764898 x24: ffff800083cee000
[...] x23: ffff800083c76ab0 x22: ffff8000821a415c x21: 0000000000000438
[...] x20: 0000000000000002 x19: ffff00097ef09a18 x18: 071c71c71c71c71c
[...] x17: ffff800084d74000 x16: ffff8000843b4bf0 x15: 0000000000000000
[...] x14: 0000000000000000 x13: 0000000000000006 x12: 00000004c2bb1498
[...] x11: 0000000000000015 x10: 0000000002800000 x9 : 0000000000082b49
[...] x8 : 0000005f37319994 x7 : ffff80008018e1a8 x6 : 0000000000000000
[...] x5 : 0000000000000001 x4 : 0000000000000000 x3 : 0000000000000438
[...] x2 : 0000005f37319944 x1 : ffff8000821a415c x0 : 0000000000000002
[...] Call trace:
[...] trace_irq_noise (./include/trace/events/osnoise.h:63)
[...] __traceiter_irq_handler_exit (./include/trace/events/irq.h:83)
[...] trace_irq_handler_exit (./arch/arm64/include/asm/preempt.h:62 ./include/trace/events/irq.h:83)
[...] handle_percpu_devid_irq (kernel/irq/chip.c:955)
[...] generic_handle_domain_irq (kernel/irq/irqdesc.c:? kernel/irq/irqdesc.c:728)
[...] gic_handle_irq (./arch/arm64/include/asm/io.h:75 ./include/asm-generic/io.h:335 drivers/irqchip/irq-gic.c:344)
[...] call_on_irq_stack (arch/arm64/kernel/entry.S:888)
[...] el1_interrupt (arch/arm64/kernel/entry-common.c:? arch/arm64/kernel/entry-common.c:502 arch/arm64/kernel/entry-common.c:517)
[...] el1h_64_irq_handler (arch/arm64/kernel/entry-common.c:523)
[...] el1h_64_irq (arch/arm64/kernel/entry.S:587)
[...] cpuidle_enter_state (drivers/cpuidle/cpuidle.c:291)
[...] cpuidle_enter (drivers/cpuidle/cpuidle.c:390)
[...] cpuidle_idle_call (kernel/sched/idle.c:134 kernel/sched/idle.c:215)
[...] do_idle (kernel/sched/idle.c:?)
[...] cpu_startup_entry (kernel/sched/idle.c:379)
[...] secondary_start_kernel (./arch/arm64/include/asm/atomic_ll_sc.h:95 ./arch/arm64/include/asm/atomic.h:28 ./include/linux/atomic/atomic-arch-fallback.h:533 ./include/linux/atomic/atomic-arch-fallback.h:981 ./include/linux/atomic/atomic-instrumented.h:436 ./include/linux/sched/mm.h:36 arch/arm64/kernel/smp.c:206)
[...] __secondary_switched (arch/arm64/kernel/head.S:682)
[...] Code: a8c47bfd d50323bf d65f03c0 d503201f (d503233f)
All code
========
0: a8c47bfd ldp x29, x30, [sp], #64
4: d50323bf autiasp
8: d65f03c0 ret
c: d503201f nop
10:* d503233f paciasp <-- trapping instruction
Code starting with the faulting instruction
===========================================
0: d503233f paciasp
[...] ---[ end trace 0000000000000000 ]---
[...] Kernel panic - not syncing: Oops - Undefined instruction: Fatal exception in interrupt
[...] SMP: stopping secondary CPUs
[...] Kernel Offset: disabled
[...] CPU features: 0x8000020c,2c020000,0000421b
[...] Memory Limit: none
[splat-2]
[...] Unable to handle kernel paging request at virtual address 001c71c71c71d434
[...] Mem abort info:
[...] ESR = 0x0000000096000044
[...] EC = 0x25: DABT (current EL), IL = 32 bits
[...] SET = 0, FnV = 0
[...] EA = 0, S1PTW = 0
[...] FSC = 0x04: level 0 translation fault
[...] Data abort info:
[...] ISV = 0, ISS = 0x00000044, ISS2 = 0x00000000
[...] CM = 0, WnR = 1, TnD = 0, TagAccess = 0
[...] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[...] [001c71c71c71d434] address between user and kernel address ranges
[...] Internal error: Oops: 0000000096000044 [#1] PREEMPT_RT SMP
[...] Modules linked in:
[...] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 6.6.0-rc4-rt8-00102-g97b0e2d47443 #1193
[...] Hardware name: ARM LTD ARM Juno Development Platform/ARM Juno Development Platform, BIOS EDK II Oct 4 2023
[...] pstate: 804000c5 (Nzcv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[...] pc : ttwu_do_activate (kernel/sched/core.c:3855)
[...] lr : ttwu_do_activate (kernel/sched/sched.h:1363 kernel/sched/sched.h:1507 kernel/sched/core.c:3846)
[...] sp : ffff80008000bc80
[...] x29: ffff80008000bc80 x28: 0000000000000001 x27: ffff8000801e3b88
[...] x26: ffff8008fb265000 x25: ffff800083c79bc0 x24: ffff800083cee000
[...] x23: 0000000000000009 x22: ffff00097eedebd8 x21: ffff000800840000
[...] x20: ffff80008000bcd8 x19: ffff00097eedebc0 x18: 071c71c71c71c71c
[...] x17: ffff800084d74000 x16: ffff8000843b4bf0 x15: 00000000a0987f40
[...] x14: 000000006a667d56 x13: 0000000098508a0a x12: 000000001c2fd96a
[...] x11: 00000000000f4240 x10: 00000000000d1be7 x9 : 000000010001b9a6
[...] x8 : 00000000000645ab x7 : ffff800080149d58 x6 : 0000000000000000
[...] x5 : 0000000000000001 x4 : 0000000000000001 x3 : 0000000000000000
[...] x2 : d123e8dc2efa2772 x1 : ffff00097eedebd8 x0 : 0000000000000001
[...] Call trace:
[...] ttwu_do_activate (kernel/sched/core.c:3855)
[...] try_to_wake_up (kernel/sched/sched.h:1622 kernel/sched/sched.h:1705 kernel/sched/core.c:4072 kernel/sched/core.c:4388)
[...] wake_up_process (kernel/sched/core.c:4520)
[...] invoke_rcu_core_kthread (./arch/arm64/include/asm/alternative-macros.h:250 ./arch/arm64/include/asm/irqflags.h:27 ./arch/arm64/include/asm/irqflags.h:121 kernel/rcu/tree.c:2443)
[...] rcu_sched_clock_irq (kernel/rcu/tree.c:2242)
[...] update_process_times (./arch/arm64/include/asm/preempt.h:13 kernel/time/timer.c:2080)
[...] tick_sched_timer (kernel/time/tick-sched.c:255 kernel/time/tick-sched.c:1492)
[...] __hrtimer_run_queues (kernel/time/hrtimer.c:1688 kernel/time/hrtimer.c:1752)
[...] hrtimer_interrupt (kernel/time/hrtimer.c:605 kernel/time/hrtimer.c:1817)
[...] arch_timer_handler_phys (drivers/clocksource/arm_arch_timer.c:? drivers/clocksource/arm_arch_timer.c:692)
[...] handle_percpu_devid_irq (kernel/irq/chip.c:942)
[...] generic_handle_domain_irq (kernel/irq/irqdesc.c:? kernel/irq/irqdesc.c:728)
[...] gic_handle_irq (./arch/arm64/include/asm/io.h:75 ./include/asm-generic/io.h:335 drivers/irqchip/irq-gic.c:344)
[...] call_on_irq_stack (arch/arm64/kernel/entry.S:888)
[...] el1_interrupt (arch/arm64/kernel/entry-common.c:? arch/arm64/kernel/entry-common.c:502 arch/arm64/kernel/entry-common.c:517)
[...] el1h_64_irq_handler (arch/arm64/kernel/entry-common.c:523)
[...] el1h_64_irq (arch/arm64/kernel/entry.S:587)
[...] cpuidle_enter_state (drivers/cpuidle/cpuidle.c:291)
[...] cpuidle_enter (drivers/cpuidle/cpuidle.c:390)
[...] cpuidle_idle_call (kernel/sched/idle.c:134 kernel/sched/idle.c:215)
[...] do_idle (kernel/sched/idle.c:?)
[...] cpu_startup_entry (kernel/sched/idle.c:379)
[...] secondary_start_kernel (./arch/arm64/include/asm/atomic_ll_sc.h:95 ./arch/arm64/include/asm/atomic.h:28 ./include/linux/atomic/atomic-arch-fallback.h:533 ./include/linux/atomic/atomic-arch-fallback.h:981 ./include/linux/atomic/atomic-instrumented.h:436 ./include/linux/sched/mm.h:36 arch/arm64/kernel/smp.c:206)
[...] __secondary_switched (arch/arm64/kernel/head.S:682)
[...] Code: f001dd89 f9068668 f9418129 d341fd08 (f9068e68)
All code
========
0: f001dd89 adrp x9, 0x3bb3000
4: f9068668 str x8, [x19, #3336]
8: f9418129 ldr x9, [x9, #768]
c: d341fd08 lsr x8, x8, #1
10:* f9068e68 str x8, [x19, #3352] <-- trapping instruction
Code starting with the faulting instruction
===========================================
0: f9068e68 str x8, [x19, #3352]
[...] ---[ end trace 0000000000000000 ]---
[...] Kernel panic - not syncing: Oops: Fatal exception in interrupt
[...] SMP: stopping secondary CPUs
[...] Kernel Offset: disabled
[...] CPU features: 0x8000020c,2c020000,0000421b
[...] Memory Limit: none
[splat-3]
[...] Mem abort info:
[...] ESR = 0x0000000096000045
[...] EC = 0x25: DABT (current EL), IL = 32 bits
[...] SET = 0, FnV = 0
[...] EA = 0, S1PTW = 0
[...] FSC = 0x05: level 1 translation fault
[...] Data abort info:
[...] ISV = 0, ISS = 0x00000045, ISS2 = 0x00000000
[...] CM = 0, WnR = 1, TnD = 0, TagAccess = 0
[...] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[...] swapper pgtable: 4k pages, 48-bit VAs, pgdp=00000000ed344000
[...] [ffff800886adb8b8] pgd=10000009fffff003, p4d=10000009fffff003, pud=0000000000000000
[...] Internal error: Oops: 0000000096000045 [#1] PREEMPT_RT SMP
[...] Modules linked in:
[...] CPU: 1 PID: 264 Comm: rtla-static Not tainted 6.6.0-rc4-rt8-00102-g97b0e2d47443 #1193
[...] Hardware name: ARM LTD ARM Juno Development Platform/ARM Juno Development Platform, BIOS EDK II Oct 4 2023
[...] pstate: 204000c5 (nzCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[...] pc : rcu_is_watching (kernel/rcu/tree.c:695)
[...] lr : trace_irq_disable (./include/trace/events/preemptirq.h:36)
[...] sp : ffff800086adb8d0
[...] x29: ffff800086adb8e0 x28: ffff0008062a4ec0 x27: 0000000000000030
[...] x26: ffff000800d95560 x25: 0000000000000010 x24: ffff000800442468
[...] x23: 0000000000000000 x22: 0000000000000000 x21: ffff800083c719f0
[...] x20: ffff0008062a4ec0 x19: ffff8000815300c8 x18: 0000000000000000
[...] x17: ffff800084d74000 x16: ffff8000843b4bf0 x15: 00000000ffd5937b
[...] x14: 00000000be9552e3 x13: 0000000045cfb2f8 x12: ffff800086ad8000
[...] x11: 128b911a6b1c6000 x10: ffff800083cece88 x9 : 000000000000003f
[...] x8 : 000000000000001f x7 : ffff8000807e8190 x6 : 0000000000000000
[...] x5 : 0000000000000000 x4 : 0000000000000001 x3 : ffff800086adb960
[...] x2 : ffff0008062a4ec0 x1 : ffff800081529334 x0 : ffff8000815300c8
[...] Call trace:
[...] rcu_is_watching (kernel/rcu/tree.c:695)
[...] trace_hardirqs_off (kernel/trace/trace_preemptirq.c:94)
[...] _raw_spin_lock_irqsave (./arch/arm64/include/asm/current.h:19 ./arch/arm64/include/asm/preempt.h:47 ./include/linux/spinlock_api_smp.h:109 kernel/locking/spinlock.c:162)
[...] rt_spin_lock (kernel/locking/rtmutex.c:1859 kernel/locking/spinlock_rt.c:43 kernel/locking/spinlock_rt.c:49 kernel/locking/spinlock_rt.c:57)
[...] lockref_get (lib/lockref.c:51)
[...] traverse_mounts (fs/namei.c:1402 fs/namei.c:1443)
[...] step_into (fs/namei.c:1547 fs/namei.c:1839)
[...] walk_component (fs/namei.c:?)
[...] link_path_walk (fs/namei.c:2330)
[...] path_lookupat (fs/namei.c:2481)
[...] filename_lookup (fs/namei.c:2511)
[...] vfs_statx (fs/stat.c:240)
[...] vfs_fstatat (fs/stat.c:295)
[...] __arm64_sys_newfstatat (fs/stat.c:460 fs/stat.c:453 fs/stat.c:453)
[...] invoke_syscall (arch/arm64/kernel/syscall.c:? arch/arm64/kernel/syscall.c:51)
[...] el0_svc_common (./include/linux/thread_info.h:127 arch/arm64/kernel/syscall.c:144)
[...] do_el0_svc (arch/arm64/kernel/syscall.c:156)
[...] el0_svc (./arch/arm64/include/asm/daifflags.h:28 arch/arm64/kernel/entry-common.c:133 arch/arm64/kernel/entry-common.c:144 arch/arm64/kernel/entry-common.c:679)
[...] el0t_64_sync_handler (arch/arm64/kernel/entry-common.c:?)
[...] el0t_64_sync (arch/arm64/kernel/entry.S:591)
[...] Code: 54fffb61 17ffffe4 d503201f d503233f (a9be7bfd)
All code
========
0: 54fffb61 b.ne 0xffffffffffffff6c // b.any
4: 17ffffe4 b 0xffffffffffffff94
8: d503201f nop
c: d503233f paciasp
10:* a9be7bfd stp x29, x30, [sp, #-32]! <-- trapping instruction
Code starting with the faulting instruction
===========================================
0: a9be7bfd stp x29, x30, [sp, #-32]!
[...] ---[ end trace 0000000000000000 ]---
[splat-4]
[...] Mem abort info:
[...] ESR = 0x0000000096000045
[...] EC = 0x25: DABT (current EL), IL = 32 bits
[...] SET = 0, FnV = 0
[...] EA = 0, S1PTW = 0
[...] FSC = 0x05: level 1 translation fault
[...] Data abort info:
[...] ISV = 0, ISS = 0x00000045, ISS2 = 0x00000000
[...] CM = 0, WnR = 1, TnD = 0, TagAccess = 0
[...] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[...] swapper pgtable: 4k pages, 48-bit VAs, pgdp=00000000ed344000
[...] [ffff800886adb8b8] pgd=10000009fffff003, p4d=10000009fffff003, pud=0000000000000000
[...] Internal error: Oops: 0000000096000045 [#1] PREEMPT_RT SMP
[...] Modules linked in:
[...] CPU: 1 PID: 264 Comm: rtla-static Not tainted 6.6.0-rc4-rt8-00102-g97b0e2d47443 #1193
[...] Hardware name: ARM LTD ARM Juno Development Platform/ARM Juno Development Platform, BIOS EDK II Oct 4 2023
[...] pstate: 204000c5 (nzCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[...] pc : rcu_is_watching (kernel/rcu/tree.c:695)
[...] lr : trace_irq_disable (./include/trace/events/preemptirq.h:36)
[...] sp : ffff800086adb8d0
[...] x29: ffff800086adb8e0 x28: ffff0008062a4ec0 x27: 0000000000000030
[...] x26: ffff000800d95560 x25: 0000000000000010 x24: ffff000800442468
[...] x23: 0000000000000000 x22: 0000000000000000 x21: ffff800083c719f0
[...] x20: ffff0008062a4ec0 x19: ffff8000815300c8 x18: 0000000000000000
[...] x17: ffff800084d74000 x16: ffff8000843b4bf0 x15: 00000000ffd5937b
[...] x14: 00000000be9552e3 x13: 0000000045cfb2f8 x12: ffff800086ad8000
[...] x11: 128b911a6b1c6000 x10: ffff800083cece88 x9 : 000000000000003f
[...] x8 : 000000000000001f x7 : ffff8000807e8190 x6 : 0000000000000000
[...] x5 : 0000000000000000 x4 : 0000000000000001 x3 : ffff800086adb960
[...] x2 : ffff0008062a4ec0 x1 : ffff800081529334 x0 : ffff8000815300c8
[...] Call trace:
[...] rcu_is_watching (kernel/rcu/tree.c:695)
[...] trace_hardirqs_off (kernel/trace/trace_preemptirq.c:94)
[...] _raw_spin_lock_irqsave (./arch/arm64/include/asm/current.h:19 ./arch/arm64/include/asm/preempt.h:47 ./include/linux/spinlock_api_smp.h:109 kernel/locking/spinlock.c:162)
[...] rt_spin_lock (kernel/locking/rtmutex.c:1859 kernel/locking/spinlock_rt.c:43 kernel/locking/spinlock_rt.c:49 kernel/locking/spinlock_rt.c:57)
[...] lockref_get (lib/lockref.c:51)
[...] traverse_mounts (fs/namei.c:1402 fs/namei.c:1443)
[...] step_into (fs/namei.c:1547 fs/namei.c:1839)
[...] walk_component (fs/namei.c:?)
[...] link_path_walk (fs/namei.c:2330)
[...] path_lookupat (fs/namei.c:2481)
[...] filename_lookup (fs/namei.c:2511)
[...] vfs_statx (fs/stat.c:240)
[...] vfs_fstatat (fs/stat.c:295)
[...] __arm64_sys_newfstatat (fs/stat.c:460 fs/stat.c:453 fs/stat.c:453)
[...] invoke_syscall (arch/arm64/kernel/syscall.c:? arch/arm64/kernel/syscall.c:51)
[...] el0_svc_common (./include/linux/thread_info.h:127 arch/arm64/kernel/syscall.c:144)
[...] do_el0_svc (arch/arm64/kernel/syscall.c:156)
[...] el0_svc (./arch/arm64/include/asm/daifflags.h:28 arch/arm64/kernel/entry-common.c:133 arch/arm64/kernel/entry-common.c:144 arch/arm64/kernel/entry-common.c:679)
[...] el0t_64_sync_handler (arch/arm64/kernel/entry-common.c:?)
[...] el0t_64_sync (arch/arm64/kernel/entry.S:591)
[...] Code: 54fffb61 17ffffe4 d503201f d503233f (a9be7bfd)
All code
========
0: 54fffb61 b.ne 0xffffffffffffff6c // b.any
4: 17ffffe4 b 0xffffffffffffff94
8: d503201f nop
c: d503233f paciasp
10:* a9be7bfd stp x29, x30, [sp, #-32]! <-- trapping instruction
Code starting with the faulting instruction
===========================================
0: a9be7bfd stp x29, x30, [sp, #-32]!
[...] ---[ end trace 0000000000000000 ]---
[splat-5]
[...] Internal error: Oops - Undefined instruction: 0000000002000000 [#1] PREEMPT_RT SMP
[...] Modules linked in:
[...] CPU: 2 PID: 40 Comm: rcuc/2 Not tainted 6.6.0-rc4-rt8-00102-g97b0e2d47443 #1194
[...] Hardware name: ARM LTD ARM Juno Development Platform/ARM Juno Development Platform, BIOS EDK II Oct 4 2023
[...] pstate: 804000c5 (Nzcv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[...] pc : trace_pelt_irq_tp (./include/trace/events/sched.h:?)
[...] lr : irqtime_account_irq (kernel/sched/cputime.c:64)
[...] sp : ffff8000851d3ce0
[...] x29: ffff8000851d3ce0 x28: 0000000000000020 x27: ffff800083ce4e80
[...] x26: ffff800083d46180 x25: 000000000000000a x24: 0000000000000000
[...] x23: 0000000000000007 x22: 0000000000000000 x21: ffff00097eeebf50
[...] x20: 0000000000002a08 x19: ffff00080092b480 x18: ffff8000850fd038
[...] x17: ffff800084e05000 x16: ffff800084445bf0 x15: 0000000008a87beb
[...] x14: 000000003bb0a251 x13: 0000000000000006 x12: 0000000934346b33
[...] x11: 0000000100000000 x10: 0000000000000001 x9 : 0000000014443054
[...] x8 : ffff00097eeebeb0 x7 : ffff80008012d608 x6 : 0000000000000000
[...] x5 : 0000000000000000 x4 : 0000000000000000 x3 : ffff8000851d3d80
[...] x2 : ffff00080092b480 x1 : 0000000000000000 x0 : 000000b814aa1780
[...] Call trace:
[...] trace_pelt_irq_tp (./include/trace/events/sched.h:?)
[...] __do_softirq (./include/linux/vtime.h:? kernel/softirq.c:593)
[...] __local_bh_enable_ip (kernel/softirq.c:?)
[...] local_bh_enable (./include/linux/bottom_half.h:34)
[...] rcu_cpu_kthread (kernel/rcu/tree.c:2493)
[...] smpboot_thread_fn (kernel/smpboot.c:?)
[...] kthread (kernel/kthread.c:389)
[...] ret_from_fork (arch/arm64/kernel/entry.S:854)
[...] Code: 17ffffc2 d4210000 17ffffe4 d503201f (819e3608)
All code
========
0: 17ffffc2 b 0xffffffffffffff08
4: d4210000 brk #0x800
8: 17ffffe4 b 0xffffffffffffff98
c: d503201f nop
10:* 819e3608 .inst 0x819e3608 ; undefined <-- trapping instruction
Code starting with the faulting instruction
===========================================
0: 819e3608 .inst 0x819e3608 ; undefined
[...] ---[ end trace 0000000000000000 ]---
[...] Kernel panic - not syncing: Oops - Undefined instruction: Fatal exception in interrupt
[...] SMP: stopping secondary CPUs
[...] Kernel Offset: disabled
[...] CPU features: 0x8000020c,2c020000,0000421b
[...] Memory Limit: none
On 2023-10-10 11:23:59 [+0200], Pierre Gondois wrote:
> Hello,
Hi,
> The issue seems to be related to this patchset:
> https://lore.kernel.org/all/[email protected]/
> but as I was unable to really diagnose the issue, it might aswell be
> something else.
Is it easy to figure when this faulty behaviour was introduced?
> It seems the memory/registers get corrupted, cf the recurring error
> messages:
> - Undefined instruction
> - Unable to handle kernel paging request at virtual address
> - Mem abort info
> Splats seem to happen while taking IRQs while going out of idle or
> when handling a syscall. More splat variations could be generated,
> but 5 should be enough I believe.
>
> When running a non-PREEMPT_RT kernel, splats don't appear, so the issue
> might be related to the way locks are handled in PREEMPT_RT. I don't
> deeply understand the relation between rcu/irq/tracing so far, if someone
> has an idea of what could happen, this would be helpful :)
I've been looking at these splats. Do you have auth-pointer or shadow
stack enabled?
> Regards,
> Pierre
>
> Splats:
> [splat-1]
> [...] Internal error: Oops - Undefined instruction: 0000000002000000 [#1] PREEMPT_RT SMP
…
> All code
> ========
> 0: a8c47bfd ldp x29, x30, [sp], #64
> 4: d50323bf autiasp
> 8: d65f03c0 ret
> c: d503201f nop
> 10:* d503233f paciasp <-- trapping instruction
paciasp is the undefined instruction. I don't see paciasp to be patched
out if not supported by the CPU so it is a NOP if not supported. What
could go wrong here?
> [splat-2]
> [...] Unable to handle kernel paging request at virtual address 001c71c71c71d434
…
> [...] pc : ttwu_do_activate (kernel/sched/core.c:3855)
> [...] lr : ttwu_do_activate (kernel/sched/sched.h:1363 kernel/sched/sched.h:1507 kernel/sched/core.c:3846)
…
> [...] x20: ffff80008000bcd8 x19: ffff00097eedebc0 x18: 071c71c71c71c71c
> [...] x8 : 00000000000645ab x7 : ffff800080149d58 x6 : 0000000000000000
…
> All code
> ========
> 0: f001dd89 adrp x9, 0x3bb3000
> 4: f9068668 str x8, [x19, #3336]
> 8: f9418129 ldr x9, [x9, #768]
> c: d341fd08 lsr x8, x8, #1
> 10:* f9068e68 str x8, [x19, #3352] <-- trapping instruction
So this reads as
store x8 to x19 + #3352, 0xffff00097eedebc0 + 3352
store 0x00000000000645ab to 0xffff00097eedf8d8
But the kernel complains about 001c71c71c71d434 which is not exactly
what I computed. But it is familiar to x18. Looking at those two
0x07 1c71c71c71 c71c
0x00 1c71c71c71 d434
the pattern in the middle is the same. And 0xd434 - 0xc71c = 0xd18 which
is 3352.
x18 is the shadow stack (?) and contains the same value in splat-1 and
is zero in splat-3 and splat-4.
> [splat-3]
> [...] Mem abort info:
> [...] ESR = 0x0000000096000045
> [...] EC = 0x25: DABT (current EL), IL = 32 bits
> [...] SET = 0, FnV = 0
> [...] EA = 0, S1PTW = 0
> [...] FSC = 0x05: level 1 translation fault
> [...] Data abort info:
> [...] ISV = 0, ISS = 0x00000045, ISS2 = 0x00000000
> [...] CM = 0, WnR = 1, TnD = 0, TagAccess = 0
> [...] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
> [...] swapper pgtable: 4k pages, 48-bit VAs, pgdp=00000000ed344000
> [...] [ffff800886adb8b8] pgd=10000009fffff003, p4d=10000009fffff003, pud=0000000000000000
> [...] Internal error: Oops: 0000000096000045 [#1] PREEMPT_RT SMP
> [...] Modules linked in:
> [...] CPU: 1 PID: 264 Comm: rtla-static Not tainted 6.6.0-rc4-rt8-00102-g97b0e2d47443 #1193
> [...] Hardware name: ARM LTD ARM Juno Development Platform/ARM Juno Development Platform, BIOS EDK II Oct 4 2023
> [...] pstate: 204000c5 (nzCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [...] pc : rcu_is_watching (kernel/rcu/tree.c:695)
> [...] lr : trace_irq_disable (./include/trace/events/preemptirq.h:36)
> [...] sp : ffff800086adb8d0
> [...] x29: ffff800086adb8e0 x28: ffff0008062a4ec0 x27: 0000000000000030
> All code
> ========
> 0: 54fffb61 b.ne 0xffffffffffffff6c // b.any
> 4: 17ffffe4 b 0xffffffffffffff94
> 8: d503201f nop
> c: d503233f paciasp
> 10:* a9be7bfd stp x29, x30, [sp, #-32]! <-- trapping instruction
This looks like a stack entry. So this looks sane and if SP is correct
then nothing should go wrong. The fault says "translation fault" so my
guess would be that SP is not correct and we have to page tables backing
ffff800086adb8d0.
> [splat-4]
> [...] Mem abort info:
> [...] ESR = 0x0000000096000045
> [...] EC = 0x25: DABT (current EL), IL = 32 bits
> [...] SET = 0, FnV = 0
> [...] EA = 0, S1PTW = 0
> [...] FSC = 0x05: level 1 translation fault
> [...] Data abort info:
> [...] ISV = 0, ISS = 0x00000045, ISS2 = 0x00000000
> [...] CM = 0, WnR = 1, TnD = 0, TagAccess = 0
> [...] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
…
> [...] lr : trace_irq_disable (./include/trace/events/preemptirq.h:36)
> [...] sp : ffff800086adb8d0
> [...] x29: ffff800086adb8e0 x28: ffff0008062a4ec0 x27: 0000000000000030
…
> c: d503233f paciasp
> 10:* a9be7bfd stp x29, x30, [sp, #-32]! <-- trapping instruction
This seems to be same as splat-3 including register.
> [splat-5]
> [...] Internal error: Oops - Undefined instruction: 0000000002000000 [#1] PREEMPT_RT SMP
> [...] Modules linked in:
> [...] CPU: 2 PID: 40 Comm: rcuc/2 Not tainted 6.6.0-rc4-rt8-00102-g97b0e2d47443 #1194
> [...] Hardware name: ARM LTD ARM Juno Development Platform/ARM Juno Development Platform, BIOS EDK II Oct 4 2023
> [...] pstate: 804000c5 (Nzcv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [...] pc : trace_pelt_irq_tp (./include/trace/events/sched.h:?)
> [...] lr : irqtime_account_irq (kernel/sched/cputime.c:64)
> [...] sp : ffff8000851d3ce0
> [...] x29: ffff8000851d3ce0 x28: 0000000000000020 x27: ffff800083ce4e80
> [...] x26: ffff800083d46180 x25: 000000000000000a x24: 0000000000000000
> [...] x23: 0000000000000007 x22: 0000000000000000 x21: ffff00097eeebf50
> [...] x20: 0000000000002a08 x19: ffff00080092b480 x18: ffff8000850fd038
> [...] x17: ffff800084e05000 x16: ffff800084445bf0 x15: 0000000008a87beb
> [...] x14: 000000003bb0a251 x13: 0000000000000006 x12: 0000000934346b33
> [...] x11: 0000000100000000 x10: 0000000000000001 x9 : 0000000014443054
> [...] x8 : ffff00097eeebeb0 x7 : ffff80008012d608 x6 : 0000000000000000
> [...] x5 : 0000000000000000 x4 : 0000000000000000 x3 : ffff8000851d3d80
> [...] x2 : ffff00080092b480 x1 : 0000000000000000 x0 : 000000b814aa1780
> [...] Call trace:
> [...] trace_pelt_irq_tp (./include/trace/events/sched.h:?)
> [...] __do_softirq (./include/linux/vtime.h:? kernel/softirq.c:593)
> [...] __local_bh_enable_ip (kernel/softirq.c:?)
> [...] local_bh_enable (./include/linux/bottom_half.h:34)
> [...] rcu_cpu_kthread (kernel/rcu/tree.c:2493)
> [...] smpboot_thread_fn (kernel/smpboot.c:?)
> [...] kthread (kernel/kthread.c:389)
> [...] ret_from_fork (arch/arm64/kernel/entry.S:854)
> [...] Code: 17ffffc2 d4210000 17ffffe4 d503201f (819e3608)
> All code
> ========
> 0: 17ffffc2 b 0xffffffffffffff08
> 4: d4210000 brk #0x800
> 8: 17ffffe4 b 0xffffffffffffff98
> c: d503201f nop
> 10:* 819e3608 .inst 0x819e3608 ; undefined <-- trapping instruction
Knowing what PC is could help to figure out if this is really
trace_pelt_irq_tp. The brk opcode could be a warning since it jumps back
afterwards. But the trapping code a different/ wrong page that is mapped
here. Or it jumped too far.
I don't know why trace_pelt_irq_tp is visible here. This should be just
a nop which is patched at runtime to some underscroll function ;) I
might be missing something.
Sebastian