2023-10-17 08:04:51

by Naresh Kamboju

[permalink] [raw]
Subject: selftests: ftrace: Internal error: Oops: sve_save_state

Following kernel crash noticed while running selftests: ftrace:
ftracetest-ktap on FVP models running stable-rc 6.5.8-rc2.

This is not an easy to reproduce issue and not seen on mainline and next.
We are investigating this report.

Reported-by: Linux Kernel Functional Testing <[email protected]>
Reported-by: Naresh Kamboju <[email protected]>


Test log:
----------
kselftest: Running tests in ftrace
TAP version 13
1..1
# timeout set to 0
# selftests: ftrace: ftracetest-ktap
# TAP version 13
# 1..129
# ok 1 Basic trace file check
# ok 2 Basic test for tracers
# ok 3 Basic trace clock test
# ok 4 Basic event tracing check
# ok 5 Change the ringbuffer size
# ok 6 Snapshot and tracing setting
# ok 7 Snapshot and tracing_cpumask
# ok 8 trace_pipe and trace_marker
[ 471.689140]
[ 471.689264] **********************************************************
[ 471.689422] ** NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE **
[ 471.689574] ** **
[ 471.689716] ** trace_printk() being used. Allocating extra memory. **
[ 471.689878] ** **
[ 471.690031] ** This means that this is a DEBUG kernel and it is **
[ 471.690183] ** unsafe for production use. **
[ 471.690335] ** **
[ 471.690487] ** If you see this message and you are not debugging **
[ 471.690728] ** the kernel, report this immediately to your vendor! **
[ 471.690881] ** **
[ 471.691033] ** NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE **
[ 471.691185] **********************************************************
[ 543.243648] hrtimer: interrupt took 11937170 ns
[ 764.987161] Unable to handle kernel NULL pointer dereference at
virtual address 0000000000000000
[ 764.992518] Mem abort info:
[ 764.995330] ESR = 0x0000000096000044
[ 764.998562] EC = 0x25: DABT (current EL), IL = 32 bits
[ 765.002434] SET = 0, FnV = 0
[ 765.005361] EA = 0, S1PTW = 0
[ 765.008327] FSC = 0x04: level 0 translation fault
[ 765.012011] Data abort info:
[ 765.014858] ISV = 0, ISS = 0x00000044, ISS2 = 0x00000000
[ 765.018797] CM = 0, WnR = 1, TnD = 0, TagAccess = 0
[ 765.022562] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[ 765.026438] user pgtable: 4k pages, 48-bit VAs, pgdp=00000008848bd000
[ 765.030782] [0000000000000000] pgd=0000000000000000, p4d=0000000000000000
[ 765.037045] Internal error: Oops: 0000000096000044 [#1] PREEMPT SMP
[ 765.038392] Modules linked in: ftrace_direct pl111_drm arm_spe_pmu
drm_dma_helper crct10dif_ce panel_simple drm_kms_helper fuse drm
dm_mod ip_tables x_tables [last unloaded: ftrace_direct]
[ 765.044892] CPU: 3 PID: 808 Comm: rmmod Not tainted 6.5.8-rc2 #1
[ 765.046192] Hardware name: FVP Base RevC (DT)
[ 765.047264] pstate: 234020c9 (nzCv daIF +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
[ 765.048693] pc : sve_save_state+0x4/0xf0
[ 765.049820] lr : fpsimd_save+0x1b8/0x218
[ 765.050933] sp : ffff800080a83ac0
[ 765.051871] x29: ffff800080a83ac0 x28: ffff000805257058 x27: 0000000000000001
[ 765.054108] x26: 0000000000000000 x25: ffffd7c64d332980 x24: 0000000000000000
[ 765.056341] x23: 0000000000000001 x22: ffff284232103000 x21: 0000000000000040
[ 765.058575] x20: ffff00087f7470b0 x19: ffffd7c64d6440b0 x18: 0000000000000000
[ 765.060811] x17: 0000000000000000 x16: ffff800080018000 x15: 0000000000000000
[ 765.063041] x14: 0000000000000000 x13: 0000000000000000 x12: 0000380a873b560e
[ 765.065277] x11: ffffd7c64e0ae390 x10: ffff800080a83b10 x9 : ffffd7c64b5b7710
[ 765.067516] x8 : ffff800080a839b8 x7 : 000000000000001e x6 : ffff00080000c200
[ 765.069752] x5 : ffffd7c64b78cc30 x4 : 0000000000000000 x3 : 0000000000000000
[ 765.071983] x2 : 0000000000000001 x1 : ffff000805257820 x0 : 0000000000000880
[ 765.074221] Call trace:
[ 765.075045] sve_save_state+0x4/0xf0
[ 765.076138] fpsimd_thread_switch+0x2c/0xe8
[ 765.077305] __switch_to+0x20/0x158
[ 765.078384] __schedule+0x2cc/0xb38
[ 765.079464] preempt_schedule_irq+0x44/0xa8
[ 765.080633] el1_interrupt+0x4c/0x68
[ 765.081691] el1h_64_irq_handler+0x18/0x28
[ 765.082829] el1h_64_irq+0x64/0x68
[ 765.083874] ftrace_return_to_handler+0x98/0x158
[ 765.085090] return_to_handler+0x20/0x48
[ 765.086205] do_sve_acc+0x64/0x128
[ 765.087272] el0_sve_acc+0x3c/0xa0
[ 765.088356] el0t_64_sync_handler+0x114/0x130
[ 765.089524] el0t_64_sync+0x190/0x198
[ 765.090712] Code: d51b4408 d65f03c0 d503201f d503245f (e5bb5800)
[ 765.092024] ---[ end trace 0000000000000000 ]---
[ 765.904294] pstore: backend (efi_pstore) writing error (-5)
[ 765.905531] note: rmmod[808] exited with irqs disabled

Links:
test log link:
- https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-6.5.y/build/v6.5.7-191-gbe4ec7370801/testrun/20672437/suite/log-parser-test/test/check-kernel-oops/log

Details of tests:
- https://tuxapi.tuxsuite.com/v1/groups/linaro/projects/lkft/tests/2WrHwIIFdZpJWV5smWZVr2yJxl1

Build link:
- https://storage.tuxsuite.com/public/linaro/lkft/builds/2WrHvExYOOOZVoxlISTdAD1gMzO/


--
Linaro LKFT
https://lkft.linaro.org


2023-10-17 12:24:37

by Mark Brown

[permalink] [raw]
Subject: Re: selftests: ftrace: Internal error: Oops: sve_save_state

On Tue, Oct 17, 2023 at 01:34:18PM +0530, Naresh Kamboju wrote:

> Following kernel crash noticed while running selftests: ftrace:
> ftracetest-ktap on FVP models running stable-rc 6.5.8-rc2.

> This is not an easy to reproduce issue and not seen on mainline and next.
> We are investigating this report.

To confirm have you seen this on other stables as well or is this only
v6.5? For how long have you been seeing this?

> [ 764.987161] Unable to handle kernel NULL pointer dereference at
> virtual address 0000000000000000

> [ 765.074221] Call trace:
> [ 765.075045] sve_save_state+0x4/0xf0
> [ 765.076138] fpsimd_thread_switch+0x2c/0xe8
> [ 765.077305] __switch_to+0x20/0x158
> [ 765.078384] __schedule+0x2cc/0xb38
> [ 765.079464] preempt_schedule_irq+0x44/0xa8
> [ 765.080633] el1_interrupt+0x4c/0x68
> [ 765.081691] el1h_64_irq_handler+0x18/0x28
> [ 765.082829] el1h_64_irq+0x64/0x68
> [ 765.083874] ftrace_return_to_handler+0x98/0x158
> [ 765.085090] return_to_handler+0x20/0x48
> [ 765.086205] do_sve_acc+0x64/0x128
> [ 765.087272] el0_sve_acc+0x3c/0xa0
> [ 765.088356] el0t_64_sync_handler+0x114/0x130
> [ 765.089524] el0t_64_sync+0x190/0x198

So something managed to get flagged as having SVE state without having
the backing storage allocated. We *were* preempted in the SVE access
handler which does the allocation but I can't see the path that would
trigger that since we allocate the state before setting TIF_SVE. It's
possible the compiler did something funky, a decode of the backtrace
might help show that?


Attachments:
(No filename) (1.56 kB)
signature.asc (499.00 B)
Download all attachments

2023-10-17 13:27:37

by Steven Rostedt

[permalink] [raw]
Subject: Re: selftests: ftrace: Internal error: Oops: sve_save_state

On Tue, 17 Oct 2023 13:34:18 +0530
Naresh Kamboju <[email protected]> wrote:

> Following kernel crash noticed while running selftests: ftrace:
> ftracetest-ktap on FVP models running stable-rc 6.5.8-rc2.
>
> This is not an easy to reproduce issue and not seen on mainline and next.
> We are investigating this report.
>
> Reported-by: Linux Kernel Functional Testing <[email protected]>
> Reported-by: Naresh Kamboju <[email protected]>
>
>
> Test log:
> ----------
> kselftest: Running tests in ftrace
> TAP version 13
> 1..1
> # timeout set to 0
> # selftests: ftrace: ftracetest-ktap
> # TAP version 13
> # 1..129
> # ok 1 Basic trace file check
> # ok 2 Basic test for tracers
> # ok 3 Basic trace clock test
> # ok 4 Basic event tracing check
> # ok 5 Change the ringbuffer size
> # ok 6 Snapshot and tracing setting
> # ok 7 Snapshot and tracing_cpumask
> # ok 8 trace_pipe and trace_marker

Sad part about ktap format is that it doesn't show you what test crashed. I
had to run it to see what was next. That was:

ok 9 Test ftrace direct functions against tracers

Adding Mark Rutland and Florent, as they worked on the direct trampolines
on Arm 64.

-- Steve


> [ 471.689140]
> [ 471.689264] **********************************************************
> [ 471.689422] ** NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE **
> [ 471.689574] ** **
> [ 471.689716] ** trace_printk() being used. Allocating extra memory. **
> [ 471.689878] ** **
> [ 471.690031] ** This means that this is a DEBUG kernel and it is **
> [ 471.690183] ** unsafe for production use. **
> [ 471.690335] ** **
> [ 471.690487] ** If you see this message and you are not debugging **
> [ 471.690728] ** the kernel, report this immediately to your vendor! **
> [ 471.690881] ** **
> [ 471.691033] ** NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE **
> [ 471.691185] **********************************************************
> [ 543.243648] hrtimer: interrupt took 11937170 ns
> [ 764.987161] Unable to handle kernel NULL pointer dereference at
> virtual address 0000000000000000
> [ 764.992518] Mem abort info:
> [ 764.995330] ESR = 0x0000000096000044
> [ 764.998562] EC = 0x25: DABT (current EL), IL = 32 bits
> [ 765.002434] SET = 0, FnV = 0
> [ 765.005361] EA = 0, S1PTW = 0
> [ 765.008327] FSC = 0x04: level 0 translation fault
> [ 765.012011] Data abort info:
> [ 765.014858] ISV = 0, ISS = 0x00000044, ISS2 = 0x00000000
> [ 765.018797] CM = 0, WnR = 1, TnD = 0, TagAccess = 0
> [ 765.022562] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
> [ 765.026438] user pgtable: 4k pages, 48-bit VAs, pgdp=00000008848bd000
> [ 765.030782] [0000000000000000] pgd=0000000000000000, p4d=0000000000000000
> [ 765.037045] Internal error: Oops: 0000000096000044 [#1] PREEMPT SMP
> [ 765.038392] Modules linked in: ftrace_direct pl111_drm arm_spe_pmu
> drm_dma_helper crct10dif_ce panel_simple drm_kms_helper fuse drm
> dm_mod ip_tables x_tables [last unloaded: ftrace_direct]
> [ 765.044892] CPU: 3 PID: 808 Comm: rmmod Not tainted 6.5.8-rc2 #1
> [ 765.046192] Hardware name: FVP Base RevC (DT)
> [ 765.047264] pstate: 234020c9 (nzCv daIF +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
> [ 765.048693] pc : sve_save_state+0x4/0xf0
> [ 765.049820] lr : fpsimd_save+0x1b8/0x218
> [ 765.050933] sp : ffff800080a83ac0
> [ 765.051871] x29: ffff800080a83ac0 x28: ffff000805257058 x27: 0000000000000001
> [ 765.054108] x26: 0000000000000000 x25: ffffd7c64d332980 x24: 0000000000000000
> [ 765.056341] x23: 0000000000000001 x22: ffff284232103000 x21: 0000000000000040
> [ 765.058575] x20: ffff00087f7470b0 x19: ffffd7c64d6440b0 x18: 0000000000000000
> [ 765.060811] x17: 0000000000000000 x16: ffff800080018000 x15: 0000000000000000
> [ 765.063041] x14: 0000000000000000 x13: 0000000000000000 x12: 0000380a873b560e
> [ 765.065277] x11: ffffd7c64e0ae390 x10: ffff800080a83b10 x9 : ffffd7c64b5b7710
> [ 765.067516] x8 : ffff800080a839b8 x7 : 000000000000001e x6 : ffff00080000c200
> [ 765.069752] x5 : ffffd7c64b78cc30 x4 : 0000000000000000 x3 : 0000000000000000
> [ 765.071983] x2 : 0000000000000001 x1 : ffff000805257820 x0 : 0000000000000880
> [ 765.074221] Call trace:
> [ 765.075045] sve_save_state+0x4/0xf0
> [ 765.076138] fpsimd_thread_switch+0x2c/0xe8
> [ 765.077305] __switch_to+0x20/0x158
> [ 765.078384] __schedule+0x2cc/0xb38
> [ 765.079464] preempt_schedule_irq+0x44/0xa8
> [ 765.080633] el1_interrupt+0x4c/0x68
> [ 765.081691] el1h_64_irq_handler+0x18/0x28
> [ 765.082829] el1h_64_irq+0x64/0x68
> [ 765.083874] ftrace_return_to_handler+0x98/0x158
> [ 765.085090] return_to_handler+0x20/0x48
> [ 765.086205] do_sve_acc+0x64/0x128
> [ 765.087272] el0_sve_acc+0x3c/0xa0
> [ 765.088356] el0t_64_sync_handler+0x114/0x130
> [ 765.089524] el0t_64_sync+0x190/0x198
> [ 765.090712] Code: d51b4408 d65f03c0 d503201f d503245f (e5bb5800)
> [ 765.092024] ---[ end trace 0000000000000000 ]---
> [ 765.904294] pstore: backend (efi_pstore) writing error (-5)
> [ 765.905531] note: rmmod[808] exited with irqs disabled
>
> Links:
> test log link:
> - https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-6.5.y/build/v6.5.7-191-gbe4ec7370801/testrun/20672437/suite/log-parser-test/test/check-kernel-oops/log
>
> Details of tests:
> - https://tuxapi.tuxsuite.com/v1/groups/linaro/projects/lkft/tests/2WrHwIIFdZpJWV5smWZVr2yJxl1
>
> Build link:
> - https://storage.tuxsuite.com/public/linaro/lkft/builds/2WrHvExYOOOZVoxlISTdAD1gMzO/
>
>
> --
> Linaro LKFT
> https://lkft.linaro.org

2023-10-17 13:42:40

by Mark Rutland

[permalink] [raw]
Subject: Re: selftests: ftrace: Internal error: Oops: sve_save_state

On Tue, Oct 17, 2023 at 01:22:14PM +0100, Mark Brown wrote:
> On Tue, Oct 17, 2023 at 01:34:18PM +0530, Naresh Kamboju wrote:
>
> > Following kernel crash noticed while running selftests: ftrace:
> > ftracetest-ktap on FVP models running stable-rc 6.5.8-rc2.
>
> > This is not an easy to reproduce issue and not seen on mainline and next.
> > We are investigating this report.
>
> To confirm have you seen this on other stables as well or is this only
> v6.5? For how long have you been seeing this?
>
> > [ 764.987161] Unable to handle kernel NULL pointer dereference at
> > virtual address 0000000000000000
>
> > [ 765.074221] Call trace:
> > [ 765.075045] sve_save_state+0x4/0xf0
> > [ 765.076138] fpsimd_thread_switch+0x2c/0xe8
> > [ 765.077305] __switch_to+0x20/0x158
> > [ 765.078384] __schedule+0x2cc/0xb38
> > [ 765.079464] preempt_schedule_irq+0x44/0xa8
> > [ 765.080633] el1_interrupt+0x4c/0x68
> > [ 765.081691] el1h_64_irq_handler+0x18/0x28
> > [ 765.082829] el1h_64_irq+0x64/0x68
> > [ 765.083874] ftrace_return_to_handler+0x98/0x158
> > [ 765.085090] return_to_handler+0x20/0x48
> > [ 765.086205] do_sve_acc+0x64/0x128
> > [ 765.087272] el0_sve_acc+0x3c/0xa0
> > [ 765.088356] el0t_64_sync_handler+0x114/0x130
> > [ 765.089524] el0t_64_sync+0x190/0x198
>
> So something managed to get flagged as having SVE state without having
> the backing storage allocated. We *were* preempted in the SVE access
> handler which does the allocation but I can't see the path that would
> trigger that since we allocate the state before setting TIF_SVE. It's
> possible the compiler did something funky, a decode of the backtrace
> might help show that?

Having a vmlinux would be *really* helpful...

I tried generating fpsimd.o using the same config and the kernel.org crosstool
GCC 13.2.0, code dump below. Assuming the code generation is the same as for
Naresh, do_sve_acc+0x64 is at 0x191c, and is just after the call to sve_alloc().

So IIUC what's happening here is that sve_alloc() has been called, its entry
has been traced, its body has been run, and in the process of tracing its
return an IRQ has preempted the task and caused a reschedule.

So unless sve_alloc() failed, at the instant the IRQ was taken:

* `task->thread.sve_state` should be non-NULL
* `task->thread_info.flags & TIF_SVE` should be 0

... so if `task->thread.sve_state` becomes NULL, I wonder if we end up
accidentally blatting that as part of the context switch? I can't immedaitely
see how/

Mark.

00000000000018b8 <do_sve_acc>:
18b8: d503201f nop
18bc: d503201f nop
18c0: d503233f paciasp
18c4: a9be7bfd stp x29, x30, [sp, #-32]!
18c8: 910003fd mov x29, sp
18cc: 1400000a b 18f4 <do_sve_acc+0x3c>
18d0: d503201f nop
18d4: f9408022 ldr x2, [x1, #256]
18d8: d2800003 mov x3, #0x0 // #0
18dc: 52800080 mov w0, #0x4 // #4
18e0: 52800021 mov w1, #0x1 // #1
18e4: 94000000 bl 0 <force_signal_inject>
18e8: a8c27bfd ldp x29, x30, [sp], #32
18ec: d50323bf autiasp
18f0: d65f03c0 ret
18f4: 90000000 adrp x0, 0 <system_cpucaps>
18f8: f9400000 ldr x0, [x0]
18fc: b6f7fec0 tbz x0, #62, 18d4 <do_sve_acc+0x1c>
1900: f9000bf3 str x19, [sp, #16]
1904: d5384113 mrs x19, sp_el0
1908: f9400260 ldr x0, [x19]
190c: 37b005e0 tbnz w0, #22, 19c8 <do_sve_acc+0x110>
1910: aa1303e0 mov x0, x19
1914: 52800021 mov w1, #0x1 // #1
1918: 94000000 bl 1140 <sve_alloc>
191c: f946be60 ldr x0, [x19, #3448]
1920: b4000480 cbz x0, 19b0 <do_sve_acc+0xf8>
1924: 97fffb59 bl 688 <get_cpu_fpsimd_context>
1928: 14000016 b 1980 <do_sve_acc+0xc8>
192c: d2a01000 mov x0, #0x800000 // #8388608
1930: f8e03260 ldsetal x0, x0, [x19]
1934: 36b80040 tbz w0, #23, 193c <do_sve_acc+0x84>
1938: d4210000 brk #0x800
193c: d5384113 mrs x19, sp_el0
1940: f9400260 ldr x0, [x19]
1944: 371802c0 tbnz w0, #3, 199c <do_sve_acc+0xe4>
1948: b94d8a73 ldr w19, [x19, #3464]
194c: 53047e73 lsr w19, w19, #4
1950: 51000673 sub w19, w19, #0x1
1954: aa1303e0 mov x0, x19
1958: 94000000 bl 0 <sve_set_vq>
195c: aa1303e1 mov x1, x19
1960: 52800020 mov w0, #0x1 // #1
1964: 94000000 bl 0 <sve_flush_live>
1968: 97fffbb4 bl 838 <fpsimd_bind_task_to_cpu>
196c: 97fffb61 bl 6f0 <put_cpu_fpsimd_context>
1970: f9400bf3 ldr x19, [sp, #16]
1974: a8c27bfd ldp x29, x30, [sp], #32
1978: d50323bf autiasp
197c: d65f03c0 ret
1980: f9800271 prfm pstl1strm, [x19]
1984: c85f7e60 ldxr x0, [x19]
1988: b2690001 orr x1, x0, #0x800000
198c: c802fe61 stlxr w2, x1, [x19]
1990: 35ffffa2 cbnz w2, 1984 <do_sve_acc+0xcc>
1994: d5033bbf dmb ish
1998: 17ffffe7 b 1934 <do_sve_acc+0x7c>
199c: aa1303e0 mov x0, x19
19a0: 97fffaf2 bl 568 <fpsimd_to_sve>
19a4: 52800040 mov w0, #0x2 // #2
19a8: b90d7260 str w0, [x19, #3440]
19ac: 17fffff0 b 196c <do_sve_acc+0xb4>
19b0: 52800120 mov w0, #0x9 // #9
19b4: 94000000 bl 0 <force_sig>
19b8: f9400bf3 ldr x19, [sp, #16]
19bc: a8c27bfd ldp x29, x30, [sp], #32
19c0: d50323bf autiasp
19c4: d65f03c0 ret
19c8: d4210000 brk #0x800
19cc: f9400bf3 ldr x19, [sp, #16]
19d0: 17ffffc1 b 18d4 <do_sve_acc+0x1c>
19d4: d503201f nop
19d8: d503201f nop
19dc: d503201f nop

2023-10-17 14:03:52

by Mark Brown

[permalink] [raw]
Subject: Re: selftests: ftrace: Internal error: Oops: sve_save_state

On Tue, Oct 17, 2023 at 02:42:01PM +0100, Mark Rutland wrote:

> So unless sve_alloc() failed, at the instant the IRQ was taken:

> * `task->thread.sve_state` should be non-NULL
> * `task->thread_info.flags & TIF_SVE` should be 0

> ... so if `task->thread.sve_state` becomes NULL, I wonder if we end up
> accidentally blatting that as part of the context switch? I can't immedaitely
> see how/

We're possibly missing a fpsimd_bind_task_to_cpu() somewhere since all
the hilarity with KVM means that we don't use the task_struct to save
state, though the task that's taking the SVE trap shouldn't be impacted
there if it didn't set TIF_SVE yet. There *is* a window where we have
TIF_SVE set but didn't yet do the rebind but that should be in a preempt
disabled section.


Attachments:
(No filename) (788.00 B)
signature.asc (499.00 B)
Download all attachments

2023-10-19 15:37:26

by Naresh Kamboju

[permalink] [raw]
Subject: Re: selftests: ftrace: Internal error: Oops: sve_save_state

On Tue, 17 Oct 2023 at 17:52, Mark Brown <[email protected]> wrote:
>
> On Tue, Oct 17, 2023 at 01:34:18PM +0530, Naresh Kamboju wrote:
>
> > Following kernel crash noticed while running selftests: ftrace:
> > ftracetest-ktap on FVP models running stable-rc 6.5.8-rc2.
>
> > This is not an easy to reproduce issue and not seen on mainline and next.
> > We are investigating this report.
>
> To confirm have you seen this on other stables as well or is this only
> v6.5? For how long have you been seeing this?

This is only seen on 6.5.8-rc2 and seen only once.
I have checked on mainline / next and other stable branches and this crash
is not seen anywhere else.

However, I will keep checking them on other branches and next and mainline.

>
> > [ 764.987161] Unable to handle kernel NULL pointer dereference at
> > virtual address 0000000000000000
>
> > [ 765.074221] Call trace:
> > [ 765.075045] sve_save_state+0x4/0xf0
> > [ 765.076138] fpsimd_thread_switch+0x2c/0xe8
> > [ 765.077305] __switch_to+0x20/0x158
> > [ 765.078384] __schedule+0x2cc/0xb38
> > [ 765.079464] preempt_schedule_irq+0x44/0xa8
> > [ 765.080633] el1_interrupt+0x4c/0x68
> > [ 765.081691] el1h_64_irq_handler+0x18/0x28
> > [ 765.082829] el1h_64_irq+0x64/0x68
> > [ 765.083874] ftrace_return_to_handler+0x98/0x158
> > [ 765.085090] return_to_handler+0x20/0x48
> > [ 765.086205] do_sve_acc+0x64/0x128
> > [ 765.087272] el0_sve_acc+0x3c/0xa0
> > [ 765.088356] el0t_64_sync_handler+0x114/0x130
> > [ 765.089524] el0t_64_sync+0x190/0x198
>
> So something managed to get flagged as having SVE state without having
> the backing storage allocated. We *were* preempted in the SVE access
> handler which does the allocation but I can't see the path that would
> trigger that since we allocate the state before setting TIF_SVE. It's
> possible the compiler did something funky, a decode of the backtrace
> might help show that?

We have not uploaded vmlinux and System.map to this specific build.
However, I have requested to have these files get uploaded for upcoming
builds.

- Naresh

2023-10-19 17:26:14

by Mark Brown

[permalink] [raw]
Subject: Re: selftests: ftrace: Internal error: Oops: sve_save_state

On Thu, Oct 19, 2023 at 09:07:02PM +0530, Naresh Kamboju wrote:
> On Tue, 17 Oct 2023 at 17:52, Mark Brown <[email protected]> wrote:
> > On Tue, Oct 17, 2023 at 01:34:18PM +0530, Naresh Kamboju wrote:

> > > Following kernel crash noticed while running selftests: ftrace:
> > > ftracetest-ktap on FVP models running stable-rc 6.5.8-rc2.

> > > This is not an easy to reproduce issue and not seen on mainline and next.
> > > We are investigating this report.

> > To confirm have you seen this on other stables as well or is this only
> > v6.5? For how long have you been seeing this?

> This is only seen on 6.5.8-rc2 and seen only once.
> I have checked on mainline / next and other stable branches and this crash
> is not seen anywhere else.

> However, I will keep checking them on other branches and next and mainline.

I see. If the bug has never been reproduced I don't think we can draw
any conclusion about which branches are affected, I would be a bit
surprised by something that affects v6.5 only.


Attachments:
(No filename) (1.01 kB)
signature.asc (499.00 B)
Download all attachments

2023-12-21 00:07:18

by Daniel Díaz

[permalink] [raw]
Subject: Re: selftests: ftrace: Internal error: Oops: sve_save_state

Hello!

On Thu, 19 Oct 2023 at 11:26, Mark Brown <[email protected]> wrote:
> On Thu, Oct 19, 2023 at 09:07:02PM +0530, Naresh Kamboju wrote:
> > On Tue, 17 Oct 2023 at 17:52, Mark Brown <[email protected]> wrote:
[...]
> > > To confirm have you seen this on other stables as well or is this only
> > > v6.5? For how long have you been seeing this?
>
> > This is only seen on 6.5.8-rc2 and seen only once.
> > I have checked on mainline / next and other stable branches and this crash
> > is not seen anywhere else.
>
> > However, I will keep checking them on other branches and next and mainline.
>
> I see. If the bug has never been reproduced I don't think we can draw
> any conclusion about which branches are affected, I would be a bit
> surprised by something that affects v6.5 only.

We have been seeing this problem in other instances, specifically on
the following kernels:
* 5.15.132, 5.15.134-rc1, 5.15.135, 5.15.136-rc1, 5.15.142, 5.15.145-rc1
* 6.1.42, 6.1.43, 6.1.51-rc1, 6.1.56-rc1, 6.1.59-rc1, 6.1.63
* 6.3.10, 6.3.11
* 6.4.7
* 6.5.2, 6.5.10-rc2

Most recent case is for the current 5.15 RC. Decoded stack trace is here:
-----8<-----
<4>[ 29.297166] ------------[ cut here ]------------
<4>[ 29.298039] WARNING: CPU: 1 PID: 220 at
arch/arm64/kernel/fpsimd.c:950 do_sve_acc
(/builds/linux/arch/arm64/kernel/fpsimd.c:950 (discriminator 1))
<4>[ 29.300418] Modules linked in: fuse drm dm_mod ip_tables x_tables
<4>[ 29.302720] CPU: 1 PID: 220 Comm: systemd-udevd Not tainted
5.15.145-rc1 #1
<4>[ 29.303601] Hardware name: linux,dummy-virt (DT)
<4>[ 29.304804] pstate: 40400009 (nZcv daif +PAN -UAO -TCO -DIT
-SSBS BTYPE=--)
<4>[ 29.305615] pc : do_sve_acc
(/builds/linux/arch/arm64/kernel/fpsimd.c:950 (discriminator 1))
<4>[ 29.306499] lr : do_sve_acc
(/builds/linux/arch/arm64/include/asm/jump_label.h:38
/builds/linux/arch/arm64/include/asm/lse.h:24
/builds/linux/arch/arm64/include/asm/atomic.h:86
/builds/linux/include/linux/atomic/atomic-long.h:335
/builds/linux/include/asm-generic/bitops/atomic.h:42
/builds/linux/include/asm-generic/bitops/instrumented-atomic.h:71
/builds/linux/include/linux/thread_info.h:108
/builds/linux/arch/arm64/kernel/fpsimd.c:949)
<4>[ 29.307466] sp : ffff8000083b3e60
<4>[ 29.308447] x29: ffff8000083b3e60 x28: ffff0000c2f42000 x27:
0000000000000000
<4>[ 29.310500] x26: 0000000000000000 x25: 0000000000000000 x24:
0000000000000000
<4>[ 29.312034] x23: 0000000080001000 x22: 0000ffffb3c8a454 x21:
00000000ffffffff
<4>[ 29.313338] x20: 0000000064000000 x19: ffff0000c2f42000 x18:
0000000000000000
<4>[ 29.314636] x17: 0000000000000000 x16: 0000000000000000 x15:
0000000000000000
<4>[ 29.316090] x14: 0000000000000000 x13: 0000000000000000 x12:
0000000000000000
<4>[ 29.317096] x11: 0000000000000000 x10: 0000000000000000 x9 :
ffffb047a36178f4
<4>[ 29.318131] x8 : ffff0000c2c4d888 x7 : 0000000000000000 x6 :
0000000000000200
<4>[ 29.319171] x5 : ffffb047a5c6f260 x4 : 0000000000000200 x3 :
0000000000000001
<4>[ 29.320180] x2 : 0000000000000000 x1 : ffff4fb95a329000 x0 :
0000000000800800
<4>[ 29.321599] Call trace:
<4>[ 29.321996] do_sve_acc
(/builds/linux/arch/arm64/kernel/fpsimd.c:950 (discriminator 1))
<4>[ 29.322674] el0_sve_acc
(/builds/linux/arch/arm64/include/asm/daifflags.h:28
/builds/linux/arch/arm64/kernel/entry-common.c:129
/builds/linux/arch/arm64/kernel/entry-common.c:138
/builds/linux/arch/arm64/kernel/entry-common.c:529)
<4>[ 29.323273] el0t_64_sync_handler
(/builds/linux/arch/arm64/kernel/entry-common.c:639)
<4>[ 29.323951] el0t_64_sync (/builds/linux/arch/arm64/kernel/entry.S:584)
<4>[ 29.324598] ---[ end trace b31d3b95f436e1b5 ]---
----->8-----

Complete log (and reproducer) here:
https://tuxapi.tuxsuite.com/v1/groups/linaro/projects/lkft/tests/2ZoYpOR79EjlFiAuFHeOWN2rwC3

Artifacts (including vmlinux and System.map) can be found here:
https://storage.tuxsuite.com/public/linaro/lkft/builds/2ZoYoIewSKk26jISEo4QlNaOvVR/

Greetings!

Daniel Díaz
[email protected]

2023-12-21 00:34:56

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: selftests: ftrace: Internal error: Oops: sve_save_state

Hi Daniel,

Would you know what ftrace selftest testcase script is running when
this crash happened? I think it depends on the selftest version but
I don't know which one you run.

Thank you,

On Wed, 20 Dec 2023 18:06:53 -0600
Daniel Díaz <[email protected]> wrote:

> Hello!
>
> On Thu, 19 Oct 2023 at 11:26, Mark Brown <[email protected]> wrote:
> > On Thu, Oct 19, 2023 at 09:07:02PM +0530, Naresh Kamboju wrote:
> > > On Tue, 17 Oct 2023 at 17:52, Mark Brown <[email protected]> wrote:
> [...]
> > > > To confirm have you seen this on other stables as well or is this only
> > > > v6.5? For how long have you been seeing this?
> >
> > > This is only seen on 6.5.8-rc2 and seen only once.
> > > I have checked on mainline / next and other stable branches and this crash
> > > is not seen anywhere else.
> >
> > > However, I will keep checking them on other branches and next and mainline.
> >
> > I see. If the bug has never been reproduced I don't think we can draw
> > any conclusion about which branches are affected, I would be a bit
> > surprised by something that affects v6.5 only.
>
> We have been seeing this problem in other instances, specifically on
> the following kernels:
> * 5.15.132, 5.15.134-rc1, 5.15.135, 5.15.136-rc1, 5.15.142, 5.15.145-rc1
> * 6.1.42, 6.1.43, 6.1.51-rc1, 6.1.56-rc1, 6.1.59-rc1, 6.1.63
> * 6.3.10, 6.3.11
> * 6.4.7
> * 6.5.2, 6.5.10-rc2
>
> Most recent case is for the current 5.15 RC. Decoded stack trace is here:
> -----8<-----
> <4>[ 29.297166] ------------[ cut here ]------------
> <4>[ 29.298039] WARNING: CPU: 1 PID: 220 at
> arch/arm64/kernel/fpsimd.c:950 do_sve_acc
> (/builds/linux/arch/arm64/kernel/fpsimd.c:950 (discriminator 1))
> <4>[ 29.300418] Modules linked in: fuse drm dm_mod ip_tables x_tables
> <4>[ 29.302720] CPU: 1 PID: 220 Comm: systemd-udevd Not tainted
> 5.15.145-rc1 #1
> <4>[ 29.303601] Hardware name: linux,dummy-virt (DT)
> <4>[ 29.304804] pstate: 40400009 (nZcv daif +PAN -UAO -TCO -DIT
> -SSBS BTYPE=--)
> <4>[ 29.305615] pc : do_sve_acc
> (/builds/linux/arch/arm64/kernel/fpsimd.c:950 (discriminator 1))
> <4>[ 29.306499] lr : do_sve_acc
> (/builds/linux/arch/arm64/include/asm/jump_label.h:38
> /builds/linux/arch/arm64/include/asm/lse.h:24
> /builds/linux/arch/arm64/include/asm/atomic.h:86
> /builds/linux/include/linux/atomic/atomic-long.h:335
> /builds/linux/include/asm-generic/bitops/atomic.h:42
> /builds/linux/include/asm-generic/bitops/instrumented-atomic.h:71
> /builds/linux/include/linux/thread_info.h:108
> /builds/linux/arch/arm64/kernel/fpsimd.c:949)
> <4>[ 29.307466] sp : ffff8000083b3e60
> <4>[ 29.308447] x29: ffff8000083b3e60 x28: ffff0000c2f42000 x27:
> 0000000000000000
> <4>[ 29.310500] x26: 0000000000000000 x25: 0000000000000000 x24:
> 0000000000000000
> <4>[ 29.312034] x23: 0000000080001000 x22: 0000ffffb3c8a454 x21:
> 00000000ffffffff
> <4>[ 29.313338] x20: 0000000064000000 x19: ffff0000c2f42000 x18:
> 0000000000000000
> <4>[ 29.314636] x17: 0000000000000000 x16: 0000000000000000 x15:
> 0000000000000000
> <4>[ 29.316090] x14: 0000000000000000 x13: 0000000000000000 x12:
> 0000000000000000
> <4>[ 29.317096] x11: 0000000000000000 x10: 0000000000000000 x9 :
> ffffb047a36178f4
> <4>[ 29.318131] x8 : ffff0000c2c4d888 x7 : 0000000000000000 x6 :
> 0000000000000200
> <4>[ 29.319171] x5 : ffffb047a5c6f260 x4 : 0000000000000200 x3 :
> 0000000000000001
> <4>[ 29.320180] x2 : 0000000000000000 x1 : ffff4fb95a329000 x0 :
> 0000000000800800
> <4>[ 29.321599] Call trace:
> <4>[ 29.321996] do_sve_acc
> (/builds/linux/arch/arm64/kernel/fpsimd.c:950 (discriminator 1))
> <4>[ 29.322674] el0_sve_acc
> (/builds/linux/arch/arm64/include/asm/daifflags.h:28
> /builds/linux/arch/arm64/kernel/entry-common.c:129
> /builds/linux/arch/arm64/kernel/entry-common.c:138
> /builds/linux/arch/arm64/kernel/entry-common.c:529)
> <4>[ 29.323273] el0t_64_sync_handler
> (/builds/linux/arch/arm64/kernel/entry-common.c:639)
> <4>[ 29.323951] el0t_64_sync (/builds/linux/arch/arm64/kernel/entry.S:584)
> <4>[ 29.324598] ---[ end trace b31d3b95f436e1b5 ]---
> ----->8-----
>
> Complete log (and reproducer) here:
> https://tuxapi.tuxsuite.com/v1/groups/linaro/projects/lkft/tests/2ZoYpOR79EjlFiAuFHeOWN2rwC3
>
> Artifacts (including vmlinux and System.map) can be found here:
> https://storage.tuxsuite.com/public/linaro/lkft/builds/2ZoYoIewSKk26jISEo4QlNaOvVR/
>
> Greetings!
>
> Daniel Díaz
> [email protected]


--
Masami Hiramatsu (Google) <[email protected]>

2023-12-21 01:06:44

by Mark Brown

[permalink] [raw]
Subject: Re: selftests: ftrace: Internal error: Oops: sve_save_state

On Wed, Dec 20, 2023 at 06:06:53PM -0600, Daniel D?az wrote:

> We have been seeing this problem in other instances, specifically on
> the following kernels:
> * 5.15.132, 5.15.134-rc1, 5.15.135, 5.15.136-rc1, 5.15.142, 5.15.145-rc1
> * 6.1.42, 6.1.43, 6.1.51-rc1, 6.1.56-rc1, 6.1.59-rc1, 6.1.63
> * 6.3.10, 6.3.11
> * 6.4.7
> * 6.5.2, 6.5.10-rc2

This is a huge range of kernels with some substantial reworkings of
the FP code, and I do note that v5.15 appears to have backported only
one change there (an incidental one related to ESR handling). This
makes me think this is likely to be something that's been sitting there
for a very long time and is unrelated to those versions and any changes
that went into them. I see you're still testing back to v4.19 which
suggests an issue introduced between v5.10 and v5.15, my change
cccb78ce89c45a4 ("arm64/sve: Rework SVE access trap to convert state in
registers") does jump out there though I don't immediately see what the
issue would be.

Looking at the list of versions you've posted the earliest is from the
very end of June with others in July, was there something that changed
in your test environment in broadly that time? I see that the
logs you and Naresh posted are both using a Debian 12/Bookworm based
root filesystem and that was released a couple of weeks before this
started appearing, Bookworm introduced glibc usage of SVE which makes
usage much more common. Is this perhaps tied to you upgrading your root
filesystems to Bookworm or were you tracking testing before then?

> Most recent case is for the current 5.15 RC. Decoded stack trace is here:
> -----8<-----
> <4>[ 29.297166] ------------[ cut here ]------------
> <4>[ 29.298039] WARNING: CPU: 1 PID: 220 at
> arch/arm64/kernel/fpsimd.c:950 do_sve_acc
> (/builds/linux/arch/arm64/kernel/fpsimd.c:950 (discriminator 1))

That's an assert that we shouldn't take a SVE trap when SVE is
alreadly enabled for the thread. The backtrace Naresh originally
supplied was a NULL pointer dereference attempting to save SVE state
(indicating that we think we're trying to save SVE state but don't have
any storage allocated for it) during thread switch. It's very plausible
that the two are the same underlying issue but it's also not 100% a
given. Can you double check exactly how similar the various issues you
are seeing are please?

I have coincidentally been chasing some other stuff in the past week or
two which might potentially be different manifestations of the same
underlying issue with current code, broadly in the area of the register
state and task state getting out of sync.


Attachments:
(No filename) (2.60 kB)
signature.asc (499.00 B)
Download all attachments

2024-01-30 23:14:08

by Daniel Díaz

[permalink] [raw]
Subject: Re: selftests: ftrace: Internal error: Oops: sve_save_state

Hello, Masami-san!

Sorry for the late reply -- winter break, expired artifacts, life, etc.

On Wed, 20 Dec 2023 at 18:34, Masami Hiramatsu <[email protected]> wrote:
> Hi Daniel,
> Would you know what ftrace selftest testcase script is running when
> this crash happened? I think it depends on the selftest version but
> I don't know which one you run.

If I remember correctly, that actually happened during LTP testing. I
found a new instance on 6.1 today, on ltp-controllers, so there were
no kselftests involved.

I'll share more details in a bit.

Greetings!

Daniel Díaz
[email protected]

2024-01-31 00:15:26

by Daniel Díaz

[permalink] [raw]
Subject: Re: selftests: ftrace: Internal error: Oops: sve_save_state

Hello!

On 20/12/23 7:06 p. m., Mark Brown wrote:
> On Wed, Dec 20, 2023 at 06:06:53PM -0600, Daniel Díaz wrote:
>
>> We have been seeing this problem in other instances, specifically on
>> the following kernels:
>> * 5.15.132, 5.15.134-rc1, 5.15.135, 5.15.136-rc1, 5.15.142, 5.15.145-rc1
>> * 6.1.42, 6.1.43, 6.1.51-rc1, 6.1.56-rc1, 6.1.59-rc1, 6.1.63
>> * 6.3.10, 6.3.11
>> * 6.4.7
>> * 6.5.2, 6.5.10-rc2
[...]> Looking at the list of versions you've posted the earliest is from the
> very end of June with others in July, was there something that changed
> in your test environment in broadly that time?

Yes! Important to note here is that I started collecting that date around that time, second half of 2023. It's possible that the same problem was there before but I didn't register that.


> I see that the
> logs you and Naresh posted are both using a Debian 12/Bookworm based
> root filesystem and that was released a couple of weeks before this
> started appearing, Bookworm introduced glibc usage of SVE which makes
> usage much more common. Is this perhaps tied to you upgrading your root
> filesystems to Bookworm or were you tracking testing before then?

I don't have an easy way to swap the Debian version right now, I'll have to ask around, but that might be a good experiment.

This message appears so sporadically that it would be hard to determine with 100% certainty if the changes we make really fix the issue. We can try to figure out just how sporadic that error is, though, and go from there, but the numbers looks slim with this happening once every so many weeks.


>> Most recent case is for the current 5.15 RC. Decoded stack trace is here:
>> -----8<-----
>> <4>[ 29.297166] ------------[ cut here ]------------
>> <4>[ 29.298039] WARNING: CPU: 1 PID: 220 at
>> arch/arm64/kernel/fpsimd.c:950 do_sve_acc
>> (/builds/linux/arch/arm64/kernel/fpsimd.c:950 (discriminator 1))
>
> That's an assert that we shouldn't take a SVE trap when SVE is
> alreadly enabled for the thread. The backtrace Naresh originally
> supplied was a NULL pointer dereference attempting to save SVE state
> (indicating that we think we're trying to save SVE state but don't have
> any storage allocated for it) during thread switch. It's very plausible
> that the two are the same underlying issue but it's also not 100% a
> given. Can you double check exactly how similar the various issues you
> are seeing are please?

I'm not sure of how similar the test cases are, but I'm inclined to think its occurrence is not related specifically to ftrace or one test case. It looks like these appear on FVP and Qemu-arm64, on several LTS branches, and while running different set of tests. On today's episode, we had it happening in the middle of ltp-controllers' cgroup_fj_stress:

-----8<-----
<4>[ 619.431962] ------------[ cut here ]------------
<4>[ 619.432530] WARNING: CPU: 0 PID: 10382 at arch/arm64/kernel/fpsimd.c:1403 do_sve_acc (arch/arm64/kernel/fpsimd.c:1402)
<4>[ 619.433889] Modules linked in: crct10dif_ce sm3_ce sm3 sha3_ce sha512_ce sha512_arm64 fuse drm dm_mod ip_tables x_tables
<4>[ 619.436649] CPU: 0 PID: 10382 Comm: cgroup_fj_stres Not tainted 6.1.76-rc2 #1
<4>[ 619.437435] Hardware name: linux,dummy-virt (DT)
<4>[ 619.438319] pstate: 42400009 (nZcv daif +PAN -UAO +TCO -DIT -SSBS BTYPE=--)
<4>[ 619.439680] pc : do_sve_acc (arch/arm64/kernel/fpsimd.c:1402)
<4>[ 619.440003] lr : do_sve_acc (arch/arm64/kernel/fpsimd.c:227 arch/arm64/kernel/fpsimd.c:253 arch/arm64/kernel/fpsimd.c:1400)
<4>[ 619.440572] sp : ffff80000f3e3e40
<4>[ 619.441022] x29: ffff80000f3e3e40 x28: ffff0000c4e42100 x27: 0000000000000000
<4>[ 619.441860] x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000
<4>[ 619.442789] x23: 0000000080001000 x22: 0000ffffaa47a454 x21: 0000000000000001
<4>[ 619.443893] x20: ffffb6cd7e13c0b0 x19: ffff0000c4e42100 x18: 0000000000000000
<4>[ 619.445033] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
<4>[ 619.446017] x14: 0000000000000000 x13: ffff80000f3e0000 x12: ffff80000f3e4000
<4>[ 619.446991] x11: 25fdfcd46891cd00 x10: 0000000000000200 x9 : 0000000000000000
<4>[ 619.447614] x8 : 0000000000800000 x7 : 0000000000000000 x6 : 000000000000003f
<4>[ 619.448285] x5 : 0000000000000040 x4 : 0000000000000000 x3 : 0000000000000000
<4>[ 619.448963] x2 : 0000000000000008 x1 : ffffb6cd7c019eac x0 : ffffb6cd7c019eac
<4>[ 619.449852] Call trace:
<4>[ 619.450219] do_sve_acc (arch/arm64/kernel/fpsimd.c:1402)
<4>[ 619.450983] el0_sve_acc (arch/arm64/kernel/entry-common.c:133 arch/arm64/kernel/entry-common.c:142 arch/arm64/kernel/entry-common.c:550)
<4>[ 619.451394] el0t_64_sync_handler (arch/arm64/kernel/entry-common.c:0)
<4>[ 619.451849] el0t_64_sync (arch/arm64/kernel/entry.S:585)
<4>[ 619.452333] ---[ end trace 0000000000000000 ]---
----->8-----

The full log, reproducers, artifacts, can be found here:

https://tuxapi.tuxsuite.com/v1/groups/linaro/projects/lkft/tests/2bgeMqo59p7kgw5dKw0m5uNegzy

Greetings!

Daniel Díaz
[email protected]


2024-01-31 12:43:00

by Mark Brown

[permalink] [raw]
Subject: Re: selftests: ftrace: Internal error: Oops: sve_save_state

On Tue, Jan 30, 2024 at 06:15:11PM -0600, Daniel D?az wrote:

> > That's an assert that we shouldn't take a SVE trap when SVE is
> > alreadly enabled for the thread. The backtrace Naresh originally
> > supplied was a NULL pointer dereference attempting to save SVE state

...

> > given. Can you double check exactly how similar the various issues you
> > are seeing are please?

> I'm not sure of how similar the test cases are, but I'm inclined to
> think its occurrence is not related specifically to ftrace or one test
> case. It looks like these appear on FVP and Qemu-arm64, on several LTS

That question is not about the test that is running, that question is
about the error that the kernel reports. There were two different but
potentially related kernel errors in the reports that you and Naresh
sent, I'm asking how similar the other cases actually are - is it more
instances of these two issues, are there possibly other issues?


Attachments:
(No filename) (964.00 B)
signature.asc (499.00 B)
Download all attachments