Following kernel crash noticed on Linux stable-rc 6.5.3-rc1 on qemu-arm64 while
running LTP sched tests cases.
This is not always reproducible.
Anyone have noticed LTP cfs_bandwidth01 causing a kernel crash on any of the
devices or qemu-* ?
I need to check similar crashes on other Linux trees and branches.
Boot log and test log:
---------------------
[ 0.000000] Booting Linux on physical CPU 0x0000000000 [0x000f0510]
[ 0.000000] Linux version 6.5.3-rc1 (tuxmake@tuxmake) (Debian clang
version 18.0.0 (++20230910112057+710b5a12324e-1~exp1~20230910112229.889),
Debian LLD 18.0.0) #1 SMP PREEMPT @1694441978
[ 0.000000] KASLR enabled
[ 0.000000] random: crng init done
[ 0.000000] Machine model: linux,dummy-virt
...
running LTP sched tests
...
cfs_bandwidth01.c:129: TPASS: Workers exited
cfs_bandwidth01.c:117: TPASS: Scheduled bandwidth constrained workers
cfs_bandwidth01.c:54: TINFO: Set 'level2/cpu.max' = '5000 10000'
<1>[ 74.455327] Unable to handle kernel NULL pointer dereference at
virtual address 0000000000000038
<1>[ 74.456395] Mem abort info:
<1>[ 74.456639] ESR = 0x0000000097880004
<1>[ 74.458273] EC = 0x25: DABT (current EL), IL = 32 bits
<1>[ 74.458859] SET = 0, FnV = 0
<1>[ 74.459495] EA = 0, S1PTW = 0
<1>[ 74.460171] FSC = 0x04: level 0 translation fault
<1>[ 74.460799] Data abort info:
<1>[ 74.461388] Access size = 4 byte(s)
<1>[ 74.462068] SSE = 0, SRT = 8
<1>[ 74.462713] SF = 0, AR = 0
<1>[ 74.463257] CM = 0, WnR = 0, TnD = 0, TagAccess = 0
<1>[ 74.463996] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
<1>[ 74.465120] user pgtable: 4k pages, 48-bit VAs, pgdp=00000001029d6000
<1>[ 74.465818] [0000000000000038] pgd=0000000000000000, p4d=0000000000000000
<0>[ 74.468416] Internal error: Oops: 0000000097880004 [#1] PREEMPT SMP
<4>[ 74.469489] Modules linked in: fuse drm dm_mod ip_tables x_tables
<4>[ 74.470964] CPU: 0 PID: 435 Comm: cfs_bandwidth01 Not tainted 6.5.3-rc1 #1
<4>[ 74.471789] Hardware name: linux,dummy-virt (DT)
<4>[ 74.473045] pstate: 634000c9 (nZCv daIF +PAN -UAO +TCO +DIT
-SSBS BTYPE=--)
<4>[ 74.473785] pc : set_next_entity+0xc0/0x1f8
<4>[ 74.475461] lr : pick_next_task_fair+0x204/0x3b8
<4>[ 74.476989] sp : ffff8000807eb870
<4>[ 74.477346] x29: ffff8000807eb870 x28: ffff0000c4e3b750 x27:
ffffcb93e8e19008
<4>[ 74.478392] x26: ffff0000c4e3b0c0 x25: ffffcb93e8ab4828 x24:
ffff0000c0354a00
<4>[ 74.479263] x23: ffff8000807eb900 x22: 0000000000000000 x21:
ffff0000ff5b1300
<4>[ 74.480401] x20: ffff0000ff5b1300 x19: 0000000000000000 x18:
0000000000000000
<4>[ 74.481417] x17: 000000000000ba7e x16: 0000000000000606 x15:
000000000117d17a
<4>[ 74.482733] x14: 0000000000000000 x13: 0000000f0f4bc800 x12:
00000000000002b0
<4>[ 74.484181] x11: 0000000f0f4bc800 x10: 0000000cf6ad6bd1 x9 :
ffffcb93e6af8e4c
<4>[ 74.485229] x8 : 0000000000000000 x7 : ffffcb93e8a3ccac x6 :
0000000000000003
<4>[ 74.486131] x5 : 000000008040002b x4 : 0000ffffbef0c000 x3 :
ffff0000ff5b1200
<4>[ 74.487012] x2 : ffff0000c39efc00 x1 : 0000000000000000 x0 :
ffff0000ff5b1300
<4>[ 74.488236] Call trace:
<4>[ 74.488608] set_next_entity+0xc0/0x1f8
<4>[ 74.489280] pick_next_task_fair+0x204/0x3b8
<4>[ 74.489987] __schedule+0x1e0/0x9c8
<4>[ 74.490903] schedule+0x134/0x1b8
<4>[ 74.491632] schedule_preempt_disabled+0x90/0x108
<4>[ 74.492392] rwsem_down_write_slowpath+0x288/0x6f0
<4>[ 74.493056] down_write+0x48/0xb0
<4>[ 74.493606] unlink_anon_vmas+0x148/0x1b0
<4>[ 74.494222] free_pgtables+0x10c/0x200
<4>[ 74.494800] exit_mmap+0x174/0x3c0
<4>[ 74.495177] __mmput+0x48/0x150
<4>[ 74.495761] mmput+0x34/0x70
<4>[ 74.496058] exit_mm+0xbc/0x148
<4>[ 74.497651] do_exit+0x22c/0x910
<4>[ 74.498212] do_group_exit+0xa4/0xb0
<4>[ 74.498870] __arm64_sys_exit_group+0x24/0x30
<4>[ 74.499484] invoke_syscall+0x4c/0x120
<4>[ 74.499834] el0_svc_common+0xd0/0x110
<4>[ 74.500196] do_el0_svc+0x3c/0xb8
<4>[ 74.500475] el0_svc+0x30/0x90
<4>[ 74.500746] el0t_64_sync_handler+0x84/0x100
<4>[ 74.501309] el0t_64_sync+0x190/0x198
<0>[ 74.502156] Code: f900293f f9403908 b5ffff48 17ffffde (b9403a68)
<4>[ 74.503735] ---[ end trace 0000000000000000 ]---
<6>[ 74.504727] note: cfs_bandwidth01[435] exited with irqs disabled
Links:
-----
- https://tuxapi.tuxsuite.com/v1/groups/linaro/projects/lkft/tests/2VFpDOMEgzroNyiP9SSlxRxHsMH
- https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-6.5.y/build/v6.5.2-740-g7bfd1316ceae/testrun/19901770/suite/log-parser-test/tests/
- https://storage.tuxsuite.com/public/linaro/lkft/builds/2VFpB1ieNZSp5zh0joVGtoMn7RG/
Steps to reproduce:
----------------
# To install tuxrun to your home directory at ~/.local/bin:
# pip3 install -U --user tuxrun==0.49.2
#
# Or install a deb/rpm depending on the running distribution
# See https://tuxmake.org/install-deb/ or
# https://tuxmake.org/install-rpm/
#
# See https://tuxrun.org/ for complete documentation.
#
tuxrun --runtime podman --device qemu-arm64 --boot-args rw --kernel
https://storage.tuxsuite.com/public/linaro/lkft/builds/2VFpB1ieNZSp5zh0joVGtoMn7RG/Image.gz
--modules https://storage.tuxsuite.com/public/linaro/lkft/builds/2VFpB1ieNZSp5zh0joVGtoMn7RG/modules.tar.xz
--rootfs https://storage.tuxboot.com/debian/bookworm/arm64/rootfs.ext4.xz
--parameters SKIPFILE=skipfile-lkft.yaml --parameters SHARD_NUMBER=4
--parameters SHARD_INDEX=2 --image
docker.io/linaro/tuxrun-dispatcher:v0.49.2 --tests ltp-sched
--timeouts boot=30 ltp-sched=30 --overlay
https://storage.tuxboot.com/overlays/debian/bookworm/arm64/ltp/20230516/ltp.tar.xz
--
Linaro LKFT
https://lkft.linaro.org
Hi!
> Following kernel crash noticed on Linux stable-rc 6.5.3-rc1 on qemu-arm64 while
> running LTP sched tests cases.
>
> This is not always reproducible.
What the test does is to create three levels of cgroups, sets CPU
quotas for them, runs bussy loop proceses in the groups and changes the
quotas during the time the bussy processes runs.
And the test is regression test for quite a few commits:
commit 39f23ce07b9355d05a64ae303ce20d1c4b92b957
Author: Vincent Guittot <[email protected]>
Date: Wed May 13 15:55:28 2020 +0200
sched/fair: Fix unthrottle_cfs_rq() for leaf_cfs_rq list
commit b34cb07dde7c2346dec73d053ce926aeaa087303
Author: Phil Auld <[email protected]>
Date: Tue May 12 09:52:22 2020 -0400
sched/fair: Fix enqueue_task_fair() warning some more
commit fe61468b2cbc2b7ce5f8d3bf32ae5001d4c434e9
Author: Vincent Guittot <[email protected]>
Date: Fri Mar 6 14:52:57 2020 +0100
sched/fair: Fix enqueue_task_fair warning
commit 5ab297bab984310267734dfbcc8104566658ebef
Author: Vincent Guittot <[email protected]>
Date: Fri Mar 6 09:42:08 2020 +0100
sched/fair: Fix reordering of enqueue/dequeue_task_fair()
commit 6d4d22468dae3d8757af9f8b81b848a76ef4409d
Author: Vincent Guittot <[email protected]>
Date: Mon Feb 24 09:52:14 2020 +0000
sched/fair: Reorder enqueue/dequeue_task_fair path
commit fdaba61ef8a268d4136d0a113d153f7a89eb9984
Author: Rik van Riel <[email protected]>
Date: Mon Jun 21 19:43:30 2021 +0200
sched/fair: Ensure that the CFS parent is added after unthrottling
Unless this is a random corruption we should look closer at scheduller
changes.
--
Cyril Hrubis
[email protected]
Hello!
>Following kernel crash noticed on Linux stable-rc 6.5.3-rc1 on qemu-arm64 while
>running LTP sched tests cases.
>
>This is not always reproducible.
I also encountered this problem on linux 5.10 on arm64 environment.
The prompt information is as follows:
[ 2893.003795] ==================================================================
[ 2893.003822] BUG: KASAN: null-ptr-deref in pick_next_task_fair+0x130/0x4e0
[ 2893.003880] Read of size 8 at addr 0000000000000080 by task ksoftirqd/0/12
[ 2893.003901]
[ 2893.003914] CPU: 0 PID: 12 Comm: ksoftirqd/0 Tainted: P O 5.10.59-rt52#1
[ 2893.003959] Call trace:
[ 2893.003968] dump_backtrace+0x0/0x2e8
[ 2893.004009] show_stack+0x18/0x28
[ 2893.004032] dump_stack+0x104/0x174
[ 2893.004067] kasan_report+0x1d0/0x258
[ 2893.004098] __asan_load8+0x94/0xd0
[ 2893.004126] pick_next_task_fair+0x130/0x4e0
[ 2893.004164] __schedule+0x220/0xbd0
[ 2893.004192] schedule+0xec/0x1a0
[ 2893.004216] smpboot_thread_fn+0x124/0x548
[ 2893.004246] kthread+0x24c/0x278
[ 2893.004277] ret_from_fork+0x10/0x34
[ 2893.004306] ==================================================================
[ 2893.004325] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000080
[ 2893.152267] Mem abort info:
[ 2893.152639] ESR = 0x96000004
[ 2893.153045] EC = 0x25: DABT (current EL), IL = 32 bits
[ 2893.153739] SET = 0, FnV = 0
[ 2893.154143] EA = 0, S1PTW = 0
[ 2893.154560] Data abort info:
[ 2893.154940] ISV = 0, ISS = 0x00000004
[ 2893.155443] CM = 0, WnR = 0
[ 2893.155838] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000188edb000
The source code where the problem occurs corresponds to:
se = pick_next_entity(cfs_rq, curr);
cfs_rq = group_cfs_rq(se); //se is NULL!
It is found that pick_next_entity returns null, so null-ptr-dere appears when accessing the members of se later.
But it is not clear under what circumstances pick_next_entity returns null.
In addition, in my environment, the following operations often recur:
stress-ng -c 8 --cpu-load 100 --sched fifo --sched-prio 1 --cpu-method pi -t 900 &
runltp -s cfs_bandwidth01
Hope it helps to solve the problem.
Thanks.