2018-04-18 13:40:17

by Fengguang Wu

[permalink] [raw]
Subject: [cfs_trace_lock_tcd] BUG: unable to handle kernel NULL pointer dereference at 00000050

Hello,

FYI this happens in mainline kernel 4.17.0-rc1.
It looks like a new regression.

It occurs in 5 out of 5 boots.

[ 6.524361] ledtrig-cpu: registered to indicate activity on CPUs
[ 6.527658] NET: Registered protocol family 4
[ 6.528191] comedi: version 0.7.76 - http://www.comedi.org
[ 6.528851] LNetError: 1:0:(module.c:546:libcfs_init()) misc_register: error -16
[ 7.220272] input: ImExPS/2 Generic Explorer Mouse as /devices/platform/i8042/serio1/input/input3
[ 7.586283] BUG: unable to handle kernel NULL pointer dereference at 00000050
[ 7.586962] *pdpt = 0000000000000000 *pde = f000ff53f000ff53
[ 7.587002] Oops: 0000 [#1] PREEMPT
[ 7.587002] CPU: 0 PID: 1 Comm: swapper Not tainted 4.17.0-rc1 #1
[ 7.587002] EIP: cfs_trace_lock_tcd+0xb/0xa0:
cfs_trace_lock_tcd at drivers/staging/lustre/lnet/libcfs/linux/linux-tracefile.c:149
[ 7.587002] EFLAGS: 00210246 CPU: 0
[ 7.587002] EAX: 00000000 EBX: 00000000 ECX: 81fcb588 EDX: 00000000
[ 7.587002] ESI: 00001800 EDI: 8f5d1e08 EBP: 8f5d1d7c ESP: 8f5d1d70
[ 7.587002] DS: 007b ES: 007b FS: 0000 GS: 00e0 SS: 0068
[ 7.587002] CR0: 80050033 CR2: 00000050 CR3: 022f0000 CR4: 000006b0
[ 7.587002] Call Trace:
[ 7.587002] libcfs_debug_vmsg2+0x8f/0x82f:
libcfs_debug_vmsg2 at drivers/staging/lustre/lnet/libcfs/tracefile.c:317
[ 7.587002] ? trace_irq_enable_rcuidle+0x25/0x62:
static_key_false at include/linux/jump_label.h:206
(inlined by) trace_irq_enable_rcuidle at include/trace/events/preemptirq.h:40
[ 7.587002] ? slob_free+0x249/0x251:
slob_free at mm/slob.c:421
[ 7.587002] libcfs_debug_msg+0x19/0x1b:
libcfs_debug_msg at drivers/staging/lustre/lnet/libcfs/tracefile.c:287
[ 7.587002] ksocknal_startup+0xe77/0x12b2:
ksocknal_startup at drivers/staging/lustre/lnet/klnds/socklnd/socklnd.c:2845 (discriminator 3)
[ 7.587002] ? lock_release+0x135/0x1ec:
lock_release at kernel/locking/lockdep.c:3942
[ 7.587002] ? _raw_spin_unlock+0x3c/0x4b:
__raw_spin_unlock at include/linux/spinlock_api_smp.h:152
(inlined by) _raw_spin_unlock at kernel/locking/spinlock.c:176
[ 7.587002] lnet_startup_lndni+0x4cd/0x9ec:
lnet_startup_lndni at drivers/staging/lustre/lnet/lnet/api-ni.c:1304
[ 7.587002] LNetNIInit+0x880/0xa00:
lnet_startup_lndnis at drivers/staging/lustre/lnet/lnet/api-ni.c:1385
(inlined by) LNetNIInit at drivers/staging/lustre/lnet/lnet/api-ni.c:1543
[ 7.587002] ? read_seqcount_retry+0x1b/0x22:
read_seqcount_retry at include/linux/seqlock.h:222

read_seqcount_retry+0x1b/0x22:
read_seqcount_retry at include/linux/seqlock.h:222

read_seqcount_retry+0x1b/0x22:
read_seqcount_retry at include/linux/seqlock.h:222
[ 7.587002] srpc_startup+0x84/0x381:
srpc_startup at drivers/staging/lustre/lnet/selftest/rpc.c:1613
[ 7.587002] lnet_selftest_init+0x2c4/0x5d9:
lnet_selftest_init at drivers/staging/lustre/lnet/selftest/module.c:134
[ 7.587002] ? lnet_selftest_exit+0x8d/0x8d:
lnet_selftest_init at drivers/staging/lustre/lnet/selftest/module.c:90
[ 7.587002] do_one_initcall+0x76/0x1d7:
__read_once_size at include/linux/compiler.h:188
(inlined by) arch_atomic_read at arch/x86/include/asm/atomic.h:31
(inlined by) atomic_read at include/asm-generic/atomic-instrumented.h:22
(inlined by) static_key_count at include/linux/jump_label.h:194
(inlined by) static_key_false at include/linux/jump_label.h:206
(inlined by) trace_initcall_finish at include/trace/events/initcall.h:44
(inlined by) do_one_initcall at init/main.c:884
[ 7.587002] ? do_early_param+0x78/0x78:
repair_env_string at init/main.c:250
[ 7.587002] ? rcu_read_unlock_sched_notrace+0x1e/0x2d:
rcu_read_unlock_sched_notrace at include/linux/rcupdate.h:778

rcu_read_unlock_sched_notrace+0x1e/0x2d:
rcu_read_unlock_sched_notrace at include/linux/rcupdate.h:778

rcu_read_unlock_sched_notrace+0x1e/0x2d:
rcu_read_unlock_sched_notrace at include/linux/rcupdate.h:778

rcu_read_unlock_sched_notrace+0x1e/0x2d:
rcu_read_unlock_sched_notrace at include/linux/rcupdate.h:778

rcu_read_unlock_sched_notrace+0x1e/0x2d:
rcu_read_unlock_sched_notrace at include/linux/rcupdate.h:778

rcu_read_unlock_sched_notrace+0x1e/0x2d:
rcu_read_unlock_sched_notrace at include/linux/rcupdate.h:778

rcu_read_unlock_sched_notrace+0x1e/0x2d:
rcu_read_unlock_sched_notrace at include/linux/rcupdate.h:778
[ 7.587002] ? trace_initcall_level+0x3e/0x64:
trace_initcall_level at include/trace/events/initcall.h:10 (discriminator 18)
[ 7.587002] kernel_init_freeable+0xfc/0x177:
do_initcall_level at init/main.c:951
(inlined by) do_initcalls at init/main.c:959
(inlined by) do_basic_setup at init/main.c:977
(inlined by) kernel_init_freeable at init/main.c:1127
[ 7.587002] ? rest_init+0xbe/0xbe:
kernel_init at init/main.c:1050
[ 7.587002] kernel_init+0xd/0xd0:
kernel_init at init/main.c:1055
[ 7.587002] ret_from_fork+0x2e/0x38:
ret_from_fork at arch/x86/entry/entry_32.S:311
[ 7.587002] Code: 16 94 ff 8b 15 68 ab 03 82 f7 c2 00 00 0f 00 b8 02 00 00 00 75 08 31 c0 80 e6 ff 0f 95 c0 5d c3 55 89 e5 57 56 53 e8 99 16 94 ff <8b> 78 50 89 d6 6a 00 31 d2 89 c3 b8 50 f0 1b 82 66 83 ff 02 0f
[ 7.587002] EIP: cfs_trace_lock_tcd+0xb/0xa0:
cfs_trace_lock_tcd at drivers/staging/lustre/lnet/libcfs/linux/linux-tracefile.c:149 SS:ESP: 0068:8f5d1d70
[ 7.587002] CR2: 0000000000000050
[ 7.587002] ---[ end trace 4fbdcb515c61525e ]---
[ 7.587002] Kernel panic - not syncing: Fatal exception

Attached the full dmesg, kconfig and reproduce scripts.

Thanks,
Fengguang


Attachments:
(No filename) (5.84 kB)
dmesg-vm-lkp-nhm-dp1-yocto-i386-7:20180416204457:i386-randconfig-s1-201814:4.17.0-rc1:1 (37.57 kB)
.config (99.51 kB)
job-script (4.04 kB)
reproduce-vm-lkp-nhm-dp1-yocto-i386-7:20180416204457:i386-randconfig-s1-201814:4.17.0-rc1:1 (1.93 kB)
Download all attachments

2018-04-18 13:42:47

by Fengguang Wu

[permalink] [raw]
Subject: [cfs_trace_lock_tcd] BUG: KASAN: null-ptr-deref in cfs_trace_lock_tcd+0x25/0xeb

FYI, it's the same regression, however reported by KASAN.

[ 19.940511] vhci_hcd: dump_port_status_diff:142: +R27
[ 19.940884] vhci_hcd: dump_port_status_diff:150:
[ 19.941126] vhci_hcd: vhci_hub_control:604: bye
[ 19.941703] usb usb2: vhci_bus_suspend
[ 20.967090] ==================================================================
[ 20.968082] BUG: KASAN: null-ptr-deref in cfs_trace_lock_tcd+0x25/0xeb:
cfs_trace_lock_tcd at drivers/staging/lustre/lnet/libcfs/linux/linux-tracefile.c:149
[ 20.968856] Read of size 2 at addr 00000000000000a4 by task swapper/1
[ 20.969604]
[ 20.969817] CPU: 0 PID: 1 Comm: swapper Tainted: G T 4.17.0-rc1 #1
[ 20.970046] Call Trace:
[ 20.970046] dump_stack+0x2e/0x3e:
__dump_stack at lib/dump_stack.c:77
(inlined by) dump_stack at lib/dump_stack.c:123
[ 20.970046] kasan_report+0x2f3/0x360:
kasan_report_error at mm/kasan/report.c:352
(inlined by) kasan_report at mm/kasan/report.c:412
[ 20.970046] __asan_load2+0x31/0xe4
[ 20.970046] cfs_trace_lock_tcd+0x25/0xeb:
cfs_trace_lock_tcd at drivers/staging/lustre/lnet/libcfs/linux/linux-tracefile.c:149
[ 20.970046] libcfs_debug_vmsg2+0x18d/0xd26:
libcfs_debug_vmsg2 at drivers/staging/lustre/lnet/libcfs/tracefile.c:317
[ 20.970046] ? cfs_trace_get_tage+0x957/0x957:
libcfs_debug_vmsg2 at drivers/staging/lustre/lnet/libcfs/tracefile.c:293
[ 20.970046] ? mark_lock+0xe9/0xa16:
mark_lock at kernel/locking/lockdep.c:3121
[ 20.970046] ? fput+0x111/0x11f:
queue_delayed_work at include/linux/workqueue.h:526
(inlined by) schedule_delayed_work at include/linux/workqueue.h:627
(inlined by) fput at fs/file_table.c:280
[ 20.970046] ? lnet_ipif_enumerate+0x541/0x588:
lnet_ipif_enumerate at drivers/staging/lustre/lnet/lnet/lib-socket.c:238
[ 20.970046] ? kasan_slab_free+0x15/0x1e:
kasan_slab_free at mm/kasan/kasan.c:528
[ 20.970046] ? kfree+0x215/0x2cb:
slab_free_hook at mm/slub.c:1388
(inlined by) slab_free_freelist_hook at mm/slub.c:1415
(inlined by) slab_free at mm/slub.c:2988
(inlined by) kfree at mm/slub.c:3944
[ 20.970046] libcfs_debug_msg+0x9b/0xc5:
libcfs_debug_msg at drivers/staging/lustre/lnet/libcfs/tracefile.c:278
[ 20.970046] ? libcfs_debug_vmsg2+0xd26/0xd26:
libcfs_debug_msg at drivers/staging/lustre/lnet/libcfs/tracefile.c:278
[ 20.970046] ksocknal_startup+0x341/0x1228:
ksocknal_enumerate_interfaces at drivers/staging/lustre/lnet/klnds/socklnd/socklnd.c:2634
(inlined by) ksocknal_startup at drivers/staging/lustre/lnet/klnds/socklnd/socklnd.c:2822
[ 20.970046] ? __asan_loadN+0x1d/0x26
[ 20.970046] ? pvclock_clocksource_read+0x162/0x296:
pvclock_clocksource_read at arch/x86/kernel/pvclock.c:89
[ 20.970046] ? ksocknal_base_startup+0xbae/0xbae:
ksocknal_startup at drivers/staging/lustre/lnet/klnds/socklnd/socklnd.c:2796
[ 20.970046] ? pvclock_read_flags+0x5d/0x5d:
pvclock_clocksource_read at arch/x86/kernel/pvclock.c:79
[ 20.970046] ? kasan_check_read+0x1f/0x28:
kasan_check_read at mm/kasan/kasan.c:272
[ 20.970046] ? kvm_clock_read+0x24/0x3a:
kvm_clock_read at arch/x86/kernel/kvmclock.c:89
[ 20.970046] ? kasan_check_write+0x22/0x2b:
kasan_check_write at mm/kasan/kasan.c:278
[ 20.970046] ? sched_clock_local+0xc8/0xfa
[ 20.970046] ? reacquire_held_locks+0x1cd/0x1cd:
lock_release at kernel/locking/lockdep.c:3929
[ 20.970046] ? cfs_percpt_lock+0xdf/0x1ad:
spin_lock at include/linux/spinlock.h:310
(inlined by) cfs_percpt_lock at drivers/staging/lustre/lnet/libcfs/libcfs_lock.c:111
[ 20.970046] ? do_raw_spin_unlock+0x100/0x10d:
arch_spin_unlock at include/linux/spinlock_up.h:48
(inlined by) do_raw_spin_unlock at kernel/locking/spinlock_debug.c:135
[ 20.970046] ? _raw_spin_unlock+0x37/0x4e:
__raw_spin_unlock at include/linux/spinlock_api_smp.h:151
(inlined by) _raw_spin_unlock at kernel/locking/spinlock.c:176
[ 20.970046] ? cfs_percpt_unlock+0x66/0x13a:
spin_unlock at include/linux/spinlock.h:350
(inlined by) cfs_percpt_unlock at drivers/staging/lustre/lnet/libcfs/libcfs_lock.c:140
[ 20.970046] lnet_startup_lndni+0x62a/0xf84:
lnet_startup_lndni at drivers/staging/lustre/lnet/lnet/api-ni.c:1306
[ 20.970046] ? lnet_shutdown_lndni+0xcc/0xcc:
lnet_startup_lndni at drivers/staging/lustre/lnet/lnet/api-ni.c:1210
[ 20.970046] ? __schedule+0xbff/0xc7f:
context_switch at kernel/sched/core.c:2851
(inlined by) __schedule at kernel/sched/core.c:3490
[ 20.970046] ? firmware_map_remove+0x265/0x265:
__schedule at kernel/sched/core.c:3401
[ 20.970046] ? kasan_check_read+0x1f/0x28:
kasan_check_read at mm/kasan/kasan.c:272
[ 20.970046] LNetNIInit+0x5d4/0xbf8:
lnet_startup_lndnis at drivers/staging/lustre/lnet/lnet/api-ni.c:1385
(inlined by) LNetNIInit at drivers/staging/lustre/lnet/lnet/api-ni.c:1543
[ 20.970046] ? lnet_lib_exit+0x147/0x147:
LNetNIInit at drivers/staging/lustre/lnet/lnet/api-ni.c:1497
[ 20.970046] ? del_timer+0x6d/0xf8:
del_timer at kernel/time/timer.c:1200
[ 20.970046] ? add_timer_on+0x32b/0x32b:
del_timer at kernel/time/timer.c:1193
[ 20.970046] ? _raw_spin_unlock_irqrestore+0x74/0x9e:
__raw_spin_unlock_irqrestore at include/linux/spinlock_api_smp.h:160
(inlined by) _raw_spin_unlock_irqrestore at kernel/locking/spinlock.c:184
[ 20.970046] ? trace_hardirqs_on_caller+0x2c8/0x33d:
trace_hardirqs_on_caller at kernel/locking/lockdep.c:2884
[ 20.970046] ? schedule_timeout+0x82b/0x8a7:
schedule_timeout at kernel/time/timer.c:1807
[ 20.970046] ? kasan_kmalloc+0x1c5/0x1e1:
save_stack at mm/kasan/kasan.c:454
(inlined by) set_track at mm/kasan/kasan.c:460
(inlined by) kasan_kmalloc at mm/kasan/kasan.c:553
[ 20.970046] ? console_conditional_schedule+0x44/0x44:
schedule_timeout at kernel/time/timer.c:1763
[ 20.970046] ? do_one_initcall+0x158/0x3ac:
do_one_initcall at init/main.c:883
[ 20.970046] ? kernel_init_freeable+0x2b8/0x434:
do_initcall_level at init/main.c:951
(inlined by) do_initcalls at init/main.c:959
(inlined by) do_basic_setup at init/main.c:977
(inlined by) kernel_init_freeable at init/main.c:1127
[ 20.970046] ? kernel_init+0x13/0x1ae:
kernel_init at init/main.c:1053
[ 20.970046] ? ret_from_fork+0x24/0x30:
ret_from_fork at arch/x86/entry/entry_64.S:418
[ 20.970046] ? __check_object_size+0x8d7/0xa08:
__check_object_size at mm/usercopy.c:262
[ 20.970046] ? usercopy_abort+0xb7/0xb7:
__check_object_size at mm/usercopy.c:250
[ 20.970046] ? __lock_is_held+0x39/0xbb:
__lock_is_held at kernel/locking/lockdep.c:3737 (discriminator 1)
[ 20.970046] ? mark_held_locks+0x2c/0xb2:
mark_held_locks at kernel/locking/lockdep.c:2795 (discriminator 1)
[ 20.970046] ? debug_check_no_locks_freed+0x237/0x25f:
debug_check_no_locks_freed at kernel/locking/lockdep.c:4421 (discriminator 2)
[ 20.970046] ? collect_expired_timers+0x23b/0x23b:
process_timeout at kernel/time/timer.c:1727
[ 20.970046] ? lockdep_init_map+0x10/0x19:
lockdep_init_map at kernel/locking/lockdep.c:3239
[ 20.970046] srpc_startup+0x136/0x352:
srpc_startup at drivers/staging/lustre/lnet/selftest/rpc.c:1613
[ 20.970046] ? srpc_shutdown+0x1f0/0x1f0:
srpc_startup at drivers/staging/lustre/lnet/selftest/rpc.c:1599
[ 20.970046] ? kvmalloc_node+0x6b/0xb6:
__kmalloc_node at include/linux/slab.h:383
(inlined by) kmalloc_node at include/linux/slab.h:554
(inlined by) kvmalloc_node at mm/util.c:421
[ 20.970046] lnet_selftest_init+0x33a/0x4f5:
lnet_selftest_init at drivers/staging/lustre/lnet/selftest/module.c:134
[ 20.970046] ? lnet_selftest_exit+0x13e/0x13e:
lnet_selftest_init at drivers/staging/lustre/lnet/selftest/module.c:90
[ 20.970046] ? lustre_insert_debugfs+0x150/0x150:
libcfs_init at drivers/staging/lustre/lnet/libcfs/module.c:531
[ 20.970046] ? lnet_init+0x1df/0x1df:
ksocklnd_init at drivers/staging/lustre/lnet/klnds/socklnd/socklnd.c:2885
[ 20.970046] ? lnet_selftest_exit+0x13e/0x13e:
lnet_selftest_init at drivers/staging/lustre/lnet/selftest/module.c:90
[ 20.970046] do_one_initcall+0x158/0x3ac:
do_one_initcall at init/main.c:883
[ 20.970046] ? start_kernel+0xbf5/0xbf5:
do_one_initcall at init/main.c:874
[ 20.970046] kernel_init_freeable+0x2b8/0x434:
do_initcall_level at init/main.c:951
(inlined by) do_initcalls at init/main.c:959
(inlined by) do_basic_setup at init/main.c:977
(inlined by) kernel_init_freeable at init/main.c:1127
[ 20.970046] ? rest_init+0x1d2/0x1d2:
kernel_init at init/main.c:1050
[ 20.970046] kernel_init+0x13/0x1ae:
kernel_init at init/main.c:1053
[ 20.970046] ? rest_init+0x1d2/0x1d2:
kernel_init at init/main.c:1050
[ 20.970046] ret_from_fork+0x24/0x30:
ret_from_fork at arch/x86/entry/entry_64.S:418
[ 20.970046] ==================================================================
[ 20.970046] Disabling lock debugging due to kernel taint
[ 21.007543] BUG: unable to handle kernel NULL pointer dereference at 00000000000000a4
[ 21.009103] PGD 0 P4D 0
[ 21.009432] Oops: 0000 [#1] DEBUG_PAGEALLOC KASAN PTI
[ 21.010063] CPU: 0 PID: 1 Comm: swapper Tainted: G B T 4.17.0-rc1 #1
[ 21.011023] RIP: 0010:cfs_trace_lock_tcd+0x25/0xeb:
cfs_trace_lock_tcd at drivers/staging/lustre/lnet/libcfs/linux/linux-tracefile.c:149
[ 21.011650] RSP: 0000:ffff880014c87380 EFLAGS: 00010202
[ 21.012299] RAX: 0000000000000202 RBX: 0000000000000000 RCX: ffffffff8119215d
[ 21.013197] RDX: dffffc0000000000 RSI: 0000000000000003 RDI: 0000000000000202
[ 21.014074] RBP: ffff880014c87398 R08: 0000000000000001 R09: 0000000000000001
[ 21.014940] R10: ffffed0002990e5a R11: 0000000000000001 R12: 00000000000000a4
[ 21.015818] R13: 0000000000000000 R14: ffff88001402d680 R15: ffff880014c87970
[ 21.016634] FS: 0000000000000000(0000) GS:ffffffff842e9000(0000) knlGS:0000000000000000
[ 21.016892] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 21.016892] CR2: 00000000000000a4 CR3: 0000000004226000 CR4: 00000000000006f0
[ 21.016892] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 21.016892] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 21.016892] Call Trace:
[ 21.016892] libcfs_debug_vmsg2+0x18d/0xd26:
libcfs_debug_vmsg2 at drivers/staging/lustre/lnet/libcfs/tracefile.c:317
[ 21.016892] ? cfs_trace_get_tage+0x957/0x957:
libcfs_debug_vmsg2 at drivers/staging/lustre/lnet/libcfs/tracefile.c:293
[ 21.016892] ? mark_lock+0xe9/0xa16:
mark_lock at kernel/locking/lockdep.c:3121
[ 21.016892] ? fput+0x111/0x11f:
queue_delayed_work at include/linux/workqueue.h:526
(inlined by) schedule_delayed_work at include/linux/workqueue.h:627
(inlined by) fput at fs/file_table.c:280
[ 21.016892] ? lnet_ipif_enumerate+0x541/0x588:
lnet_ipif_enumerate at drivers/staging/lustre/lnet/lnet/lib-socket.c:238
[ 21.016892] ? kasan_slab_free+0x15/0x1e:
kasan_slab_free at mm/kasan/kasan.c:528
[ 21.016892] ? kfree+0x215/0x2cb:
slab_free_hook at mm/slub.c:1388
(inlined by) slab_free_freelist_hook at mm/slub.c:1415
(inlined by) slab_free at mm/slub.c:2988
(inlined by) kfree at mm/slub.c:3944
[ 21.016892] libcfs_debug_msg+0x9b/0xc5:
libcfs_debug_msg at drivers/staging/lustre/lnet/libcfs/tracefile.c:278
[ 21.016892] ? libcfs_debug_vmsg2+0xd26/0xd26:
libcfs_debug_msg at drivers/staging/lustre/lnet/libcfs/tracefile.c:278
[ 21.016892] ksocknal_startup+0x341/0x1228:
ksocknal_enumerate_interfaces at drivers/staging/lustre/lnet/klnds/socklnd/socklnd.c:2634
(inlined by) ksocknal_startup at drivers/staging/lustre/lnet/klnds/socklnd/socklnd.c:2822
[ 21.016892] ? __asan_loadN+0x1d/0x26
[ 21.016892] ? pvclock_clocksource_read+0x162/0x296:
pvclock_clocksource_read at arch/x86/kernel/pvclock.c:89
[ 21.016892] ? ksocknal_base_startup+0xbae/0xbae:
ksocknal_startup at drivers/staging/lustre/lnet/klnds/socklnd/socklnd.c:2796
[ 21.016892] ? pvclock_read_flags+0x5d/0x5d:
pvclock_clocksource_read at arch/x86/kernel/pvclock.c:79
[ 21.016892] ? kasan_check_read+0x1f/0x28:
kasan_check_read at mm/kasan/kasan.c:272
[ 21.016892] ? kvm_clock_read+0x24/0x3a:
kvm_clock_read at arch/x86/kernel/kvmclock.c:89
[ 21.016892] ? kasan_check_write+0x22/0x2b:
kasan_check_write at mm/kasan/kasan.c:278
[ 21.016892] ? sched_clock_local+0xc8/0xfa
[ 21.016892] ? reacquire_held_locks+0x1cd/0x1cd:
lock_release at kernel/locking/lockdep.c:3929
[ 21.016892] ? cfs_percpt_lock+0xdf/0x1ad:
spin_lock at include/linux/spinlock.h:310
(inlined by) cfs_percpt_lock at drivers/staging/lustre/lnet/libcfs/libcfs_lock.c:111
[ 21.016892] ? do_raw_spin_unlock+0x100/0x10d:
arch_spin_unlock at include/linux/spinlock_up.h:48
(inlined by) do_raw_spin_unlock at kernel/locking/spinlock_debug.c:135
[ 21.016892] ? _raw_spin_unlock+0x37/0x4e:
__raw_spin_unlock at include/linux/spinlock_api_smp.h:151
(inlined by) _raw_spin_unlock at kernel/locking/spinlock.c:176
[ 21.016892] ? cfs_percpt_unlock+0x66/0x13a:
spin_unlock at include/linux/spinlock.h:350
(inlined by) cfs_percpt_unlock at drivers/staging/lustre/lnet/libcfs/libcfs_lock.c:140
[ 21.016892] lnet_startup_lndni+0x62a/0xf84:
lnet_startup_lndni at drivers/staging/lustre/lnet/lnet/api-ni.c:1306
[ 21.016892] ? lnet_shutdown_lndni+0xcc/0xcc:
lnet_startup_lndni at drivers/staging/lustre/lnet/lnet/api-ni.c:1210
[ 21.016892] ? __schedule+0xbff/0xc7f:
context_switch at kernel/sched/core.c:2851
(inlined by) __schedule at kernel/sched/core.c:3490
[ 21.016892] ? firmware_map_remove+0x265/0x265:
__schedule at kernel/sched/core.c:3401
[ 21.016892] ? kasan_check_read+0x1f/0x28:
kasan_check_read at mm/kasan/kasan.c:272
[ 21.016892] LNetNIInit+0x5d4/0xbf8:
lnet_startup_lndnis at drivers/staging/lustre/lnet/lnet/api-ni.c:1385
(inlined by) LNetNIInit at drivers/staging/lustre/lnet/lnet/api-ni.c:1543
[ 21.016892] ? lnet_lib_exit+0x147/0x147:
LNetNIInit at drivers/staging/lustre/lnet/lnet/api-ni.c:1497
[ 21.016892] ? del_timer+0x6d/0xf8:
del_timer at kernel/time/timer.c:1200
[ 21.016892] ? add_timer_on+0x32b/0x32b:
del_timer at kernel/time/timer.c:1193
[ 21.016892] ? _raw_spin_unlock_irqrestore+0x74/0x9e:
__raw_spin_unlock_irqrestore at include/linux/spinlock_api_smp.h:160
(inlined by) _raw_spin_unlock_irqrestore at kernel/locking/spinlock.c:184
[ 21.016892] ? trace_hardirqs_on_caller+0x2c8/0x33d:
trace_hardirqs_on_caller at kernel/locking/lockdep.c:2884
[ 21.016892] ? schedule_timeout+0x82b/0x8a7:
schedule_timeout at kernel/time/timer.c:1807
[ 21.016892] ? kasan_kmalloc+0x1c5/0x1e1:
save_stack at mm/kasan/kasan.c:454
(inlined by) set_track at mm/kasan/kasan.c:460
(inlined by) kasan_kmalloc at mm/kasan/kasan.c:553
[ 21.016892] ? console_conditional_schedule+0x44/0x44:
schedule_timeout at kernel/time/timer.c:1763
[ 21.016892] ? do_one_initcall+0x158/0x3ac:
do_one_initcall at init/main.c:883
[ 21.016892] ? kernel_init_freeable+0x2b8/0x434:
do_initcall_level at init/main.c:951
(inlined by) do_initcalls at init/main.c:959
(inlined by) do_basic_setup at init/main.c:977
(inlined by) kernel_init_freeable at init/main.c:1127
[ 21.016892] ? kernel_init+0x13/0x1ae:
kernel_init at init/main.c:1053
[ 21.016892] ? ret_from_fork+0x24/0x30:
ret_from_fork at arch/x86/entry/entry_64.S:418
[ 21.016892] ? __check_object_size+0x8d7/0xa08:
__check_object_size at mm/usercopy.c:262
[ 21.016892] ? usercopy_abort+0xb7/0xb7:
__check_object_size at mm/usercopy.c:250
[ 21.016892] ? __lock_is_held+0x39/0xbb:
__lock_is_held at kernel/locking/lockdep.c:3737 (discriminator 1)
[ 21.016892] ? mark_held_locks+0x2c/0xb2:
mark_held_locks at kernel/locking/lockdep.c:2795 (discriminator 1)
[ 21.016892] ? debug_check_no_locks_freed+0x237/0x25f:
debug_check_no_locks_freed at kernel/locking/lockdep.c:4421 (discriminator 2)
[ 21.016892] ? collect_expired_timers+0x23b/0x23b:
process_timeout at kernel/time/timer.c:1727
[ 21.016892] ? lockdep_init_map+0x10/0x19:
lockdep_init_map at kernel/locking/lockdep.c:3239
[ 21.016892] srpc_startup+0x136/0x352:
srpc_startup at drivers/staging/lustre/lnet/selftest/rpc.c:1613
[ 21.016892] ? srpc_shutdown+0x1f0/0x1f0:
srpc_startup at drivers/staging/lustre/lnet/selftest/rpc.c:1599
[ 21.016892] ? kvmalloc_node+0x6b/0xb6:
__kmalloc_node at include/linux/slab.h:383
(inlined by) kmalloc_node at include/linux/slab.h:554
(inlined by) kvmalloc_node at mm/util.c:421
[ 21.016892] lnet_selftest_init+0x33a/0x4f5:
lnet_selftest_init at drivers/staging/lustre/lnet/selftest/module.c:134
[ 21.016892] ? lnet_selftest_exit+0x13e/0x13e:
lnet_selftest_init at drivers/staging/lustre/lnet/selftest/module.c:90
[ 21.016892] ? lustre_insert_debugfs+0x150/0x150:
libcfs_init at drivers/staging/lustre/lnet/libcfs/module.c:531
[ 21.016892] ? lnet_init+0x1df/0x1df:
ksocklnd_init at drivers/staging/lustre/lnet/klnds/socklnd/socklnd.c:2885
[ 21.016892] ? lnet_selftest_exit+0x13e/0x13e:
lnet_selftest_init at drivers/staging/lustre/lnet/selftest/module.c:90
[ 21.016892] do_one_initcall+0x158/0x3ac:
do_one_initcall at init/main.c:883
[ 21.016892] ? start_kernel+0xbf5/0xbf5:
do_one_initcall at init/main.c:874
[ 21.016892] kernel_init_freeable+0x2b8/0x434:
do_initcall_level at init/main.c:951
(inlined by) do_initcalls at init/main.c:959
(inlined by) do_basic_setup at init/main.c:977
(inlined by) kernel_init_freeable at init/main.c:1127
[ 21.016892] ? rest_init+0x1d2/0x1d2:
kernel_init at init/main.c:1050
[ 21.016892] kernel_init+0x13/0x1ae:
kernel_init at init/main.c:1053
[ 21.016892] ? rest_init+0x1d2/0x1d2:
kernel_init at init/main.c:1050
[ 21.016892] ret_from_fork+0x24/0x30:
ret_from_fork at arch/x86/entry/entry_64.S:418
[ 21.016892] Code: d4 05 31 c0 5d c3 55 48 ff 05 2b bf d4 05 48 89 e5 41 55 41 54 4c 8d a7 a4 00 00 00 53 48 89 fb 41 89 f5 4c 89 e7 e8 16 a9 18 ff <66> 83 bb a4 00 00 00 02 76 24 48 c7 c6 20 37 75 85 48 c7 c7 40
[ 21.016892] RIP: cfs_trace_lock_tcd+0x25/0xeb:
cfs_trace_lock_tcd at drivers/staging/lustre/lnet/libcfs/linux/linux-tracefile.c:149 RSP: ffff880014c87380
[ 21.016892] CR2: 00000000000000a4
[ 21.016892] ---[ end trace be7ae81fe247d9ae ]---
[ 21.016892] Kernel panic - not syncing: Fatal exception

Attached the full dmesg, kconfig and reproduce scripts.

Thanks,
Fengguang


Attachments:
(No filename) (19.11 kB)
dmesg-vm-lkp-hsw01-openwrt-ia32-7:20180417120909:x86_64-randconfig-ws0-03101549:4.17.0-rc1:1 (70.19 kB)
.config (124.65 kB)
job-script (4.02 kB)
reproduce-vm-lkp-hsw01-openwrt-ia32-7:20180417120909:x86_64-randconfig-ws0-03101549:4.17.0-rc1:1 (1.87 kB)
Download all attachments

2018-04-18 14:00:56

by James Simmons

[permalink] [raw]
Subject: Re: [cfs_trace_lock_tcd] BUG: unable to handle kernel NULL pointer dereference at 00000050


> Hello,
>
> FYI this happens in mainline kernel 4.17.0-rc1.
> It looks like a new regression.
>
> [ 7.587002] lnet_selftest_init+0x2c4/0x5d9:
> lnet_selftest_init at drivers/staging/lustre/lnet/selftest/module.c:134
> [ 7.587002] ? lnet_selftest_exit+0x8d/0x8d:
> lnet_selftest_init at drivers/staging/lustre/lnet/selftest/module.c:90

Are you running lnet selftest ? Is this a UMP setup? The reason I ask is
that their is a SMP handling bug in lnet selftest. If you look at the
mailing list I pushed a SMP patch series. Can you try that series and
tell me if it works for you. Thanks

2018-04-18 14:14:46

by Fengguang Wu

[permalink] [raw]
Subject: Re: [cfs_trace_lock_tcd] BUG: unable to handle kernel NULL pointer dereference at 00000050

Hi James,

On Wed, Apr 18, 2018 at 02:59:15PM +0100, James Simmons wrote:
>
>> Hello,
>>
>> FYI this happens in mainline kernel 4.17.0-rc1.
>> It looks like a new regression.
>>
>> [ 7.587002] lnet_selftest_init+0x2c4/0x5d9:
>> lnet_selftest_init at drivers/staging/lustre/lnet/selftest/module.c:134
>> [ 7.587002] ? lnet_selftest_exit+0x8d/0x8d:
>> lnet_selftest_init at drivers/staging/lustre/lnet/selftest/module.c:90
>
>Are you running lnet selftest ?

Perhaps yes -- it's randconfig boot test and the .config does include
CONFIG_LNET_SELFTEST:

CONFIG_LNET=y
CONFIG_LNET_MAX_PAYLOAD=1048576
==> CONFIG_LNET_SELFTEST=y
CONFIG_LNET_XPRT_IB=y

>Is this a UMP setup?

Yes, .config has:

# CONFIG_SMP is not set

>The reason I ask is that their is a SMP handling bug in lnet
>selftest. If you look at the mailing list I pushed a SMP patch
>series. Can you try that series and tell me if it works for you.

So it looks your fixup patch is not for this case? Anyway the
reproduce-* script attached in the previous email should be fairly
straightforward to try out for reproducing the bug.

Thanks,
Fengguang

2018-04-18 14:17:32

by Fengguang Wu

[permalink] [raw]
Subject: [lnet_res_container_setup] BUG: unable to handle kernel paging request at 08000664

Hi James,

Here the .config has SMP set.

FYI this happens in mainline kernel 4.17.0-rc1.
It looks like a new regression.

It occurs in 4 out of 5 boots.

[ 6.872036] usbhid: USB HID core driver
[ 6.872383] NET: Registered protocol family 4
[ 6.872889] no options.
[ 6.874165] LNet: HW nodes: 1, HW CPU cores: 1, npartitions: 1
[ 6.874688] LNetError: 1:0:(module.c:546:libcfs_init()) misc_register: error -16
[ 7.893425] BUG: unable to handle kernel paging request at 08000664
[ 7.893989] *pde = 00000000
[ 7.894251] Oops: 0000 [#1] SMP
[ 7.894528] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G S 4.17.0-rc1 #198
[ 7.895196] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
[ 7.895924] EIP: cfs_cpt_spread_node+0x5c/0x115:
bitmap_weight at include/linux/bitmap.h:360
(inlined by) __nodes_weight at include/linux/nodemask.h:240
(inlined by) cfs_cpt_spread_node at drivers/staging/lustre/lnet/libcfs/linux/linux-cpu.c:498
[ 7.896320] EFLAGS: 00210202 CPU: 0
[ 7.896625] EAX: d6c31320 EBX: 00000001 ECX: 00000001 EDX: 00000000
[ 7.896678] ESI: 08000664 EDI: 00000000 EBP: c7ea3e88 ESP: c7ea3e78
[ 7.896678] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[ 7.896678] CR0: 80050033 CR2: 08000664 CR3: 05380000 CR4: 00000690
[ 7.896678] Call Trace:
[ 7.896678] lnet_res_container_setup+0x9a/0x210:
lnet_res_container_setup at drivers/staging/lustre/lnet/lnet/api-ni.c:426
[ 7.896678] LNetNIInit+0x701/0xdef:
lnet_prepare at drivers/staging/lustre/lnet/lnet/api-ni.c:577
(inlined by) LNetNIInit at drivers/staging/lustre/lnet/lnet/api-ni.c:1517
[ 7.896678] ? ktime_get_real_seconds+0xe3/0xf9:
ktime_get_real_seconds at kernel/time/timekeeping.c:938
[ 7.896678] srpc_startup+0x8b/0x413:
srpc_startup at drivers/staging/lustre/lnet/selftest/rpc.c:1613
[ 7.896678] lnet_selftest_init+0x3de/0x78c:
lnet_selftest_init at drivers/staging/lustre/lnet/selftest/module.c:134
[ 7.896678] do_one_initcall+0x1a2/0x410:
do_one_initcall at init/main.c:883
[ 7.896678] ? lnet_selftest_exit+0xa8/0xa8:
lnet_selftest_init at drivers/staging/lustre/lnet/selftest/module.c:90
[ 7.896678] ? parse_args+0x94/0x3b6:
parse_args at kernel/params.c:180 (discriminator 8)
[ 7.896678] kernel_init_freeable+0x3b3/0x488:
do_initcall_level at init/main.c:951
(inlined by) do_initcalls at init/main.c:959
(inlined by) do_basic_setup at init/main.c:977
(inlined by) kernel_init_freeable at init/main.c:1127
[ 7.896678] ? rest_init+0x11b/0x11b:
kernel_init at init/main.c:1050
[ 7.896678] kernel_init+0xd/0x15f:
kernel_init at init/main.c:1055
[ 7.896678] ret_from_fork+0x2e/0x40:
ret_from_fork at arch/x86/entry/entry_32.S:311
[ 7.896678] Code: c4 74 11 8b 50 04 8b 70 18 89 55 f0 8d 4a 01 89 48 04 eb 17 6b d2 0c 03 50 0c 89 d1 8b 42 08 8b 72 04 89 45 f0 8d 50 01 89 51 08 <8b> 06 6a 00 83 e0 01 e8 5c 82 d7 fe 89 c3 31 c0 85 db 0f 9e c0
[ 7.896678] EIP: cfs_cpt_spread_node+0x5c/0x115:
bitmap_weight at include/linux/bitmap.h:360
(inlined by) __nodes_weight at include/linux/nodemask.h:240
(inlined by) cfs_cpt_spread_node at drivers/staging/lustre/lnet/libcfs/linux/linux-cpu.c:498 SS:ESP: 0068:c7ea3e78
[ 7.896678] CR2: 0000000008000664
[ 7.896678] ---[ end trace 5246465f75f74727 ]---
[ 7.896678] Kernel panic - not syncing: Fatal exception

Attached the full dmesg, kconfig and reproduce scripts.

Thanks,
Fengguang


Attachments:
(No filename) (3.56 kB)
dmesg-vm-lkp-nhm-dp2-yocto-ia32-6:20180418050635:i386-randconfig-c0-04180358:4.17.0-rc1:198 (57.93 kB)
.config (129.99 kB)
job-script (4.16 kB)
reproduce-vm-lkp-nhm-dp2-yocto-ia32-6:20180418050635:i386-randconfig-c0-04180358:4.17.0-rc1:198 (1.94 kB)
Download all attachments

2018-04-18 18:38:51

by Linus Torvalds

[permalink] [raw]
Subject: Re: [cfs_trace_lock_tcd] BUG: KASAN: null-ptr-deref in cfs_trace_lock_tcd+0x25/0xeb

Ugh, that lustre code is disgusting.

I thought we were getting rid of it.

Anyway, I started looking at why the stack trace is such an incredible
mess, with lots of stale entries.

The reason (well, _one_ reason) seems to be "ksocknal_startup". It has
a 500-byte stack frame for some incomprehensible reason. I assume due
to excessive inlining, because the function itself doesn't seem to be
that bad.

Similarly, LNetNIInit has a 300-byte stack frame. So it gets pretty deep.

I'm getting the feeling that KASAN is making things worse because
probably it's disabling all the sane stack frame stuff (ie no merging
of stack slot entries, perhaps?).

Without KASAN (but also without a lot of other things, so I might be
blaming KASAN incorrectly), the stack usage of ksocknal_startup() is
just under 100 bytes, so if it is KASAN, it's really a big difference.

Anyway, apart from the excessive elements, the report seems fine, but
I'm adding Neil Brown to the cc, since he's the one that has been
making most of the lustre/lnet changes this merge window.

Also adding Andrey to check about the oddly large stack usage.

Not including the whole email with the attachements - Neil, it's on
lkml and lustre-devel if you hadn't seen it.

Linus

2018-04-18 22:41:40

by NeilBrown

[permalink] [raw]
Subject: Re: [cfs_trace_lock_tcd] BUG: KASAN: null-ptr-deref in cfs_trace_lock_tcd+0x25/0xeb

On Wed, Apr 18 2018, Linus Torvalds wrote:

> Ugh, that lustre code is disgusting.
>
> I thought we were getting rid of it.

Lots of people seem to get value out of it. So we're trying to polish
the code to make it less disgusting. This is just a little fall-out.
The smoking gun is

[ 6.528851] LNetError: 1:0:(module.c:546:libcfs_init()) misc_register: error -16

lustre registers a misc char device with the same number as USERIO.
If they both try to register, one fails.
Until recently, lustre could only be built as a module so when lustre
failed to register the char dev, the module-load fails.
Now it can be built monolithic (makes my testing easier) and the failure
mode is different. The module that tried to register the chardev rewinds
some initialization, and a subsequent module assumes that init was done,
and explodes.

There are patches in Greg's inbox to change lustre to use a dynamically
allocated minor. And it is on my todo list to get lustre to do less
initialization at module-init time (where, in a monolithic build, it is
hard to give up if some previous module failed), and more at mount time.

So this is a known bug (maybe a new manifestation) and a fix has been
posted. There is certainly room for lots more cleanup and that is
slowly happening. I'll make a note to look into the large stack frames
you observed.

Previous report of bug was
Subject: [staging] 184ecc5ceb: BUG:unable_to_handle_kernel
Message-ID: <20180319091931.gt6ijdw7ahkbtvrq@inn>


Thanks,
NeilBrown

>
> Anyway, I started looking at why the stack trace is such an incredible
> mess, with lots of stale entries.
>
> The reason (well, _one_ reason) seems to be "ksocknal_startup". It has
> a 500-byte stack frame for some incomprehensible reason. I assume due
> to excessive inlining, because the function itself doesn't seem to be
> that bad.
>
> Similarly, LNetNIInit has a 300-byte stack frame. So it gets pretty deep.
>
> I'm getting the feeling that KASAN is making things worse because
> probably it's disabling all the sane stack frame stuff (ie no merging
> of stack slot entries, perhaps?).
>
> Without KASAN (but also without a lot of other things, so I might be
> blaming KASAN incorrectly), the stack usage of ksocknal_startup() is
> just under 100 bytes, so if it is KASAN, it's really a big difference.
>
> Anyway, apart from the excessive elements, the report seems fine, but
> I'm adding Neil Brown to the cc, since he's the one that has been
> making most of the lustre/lnet changes this merge window.
>
> Also adding Andrey to check about the oddly large stack usage.
>
> Not including the whole email with the attachements - Neil, it's on
> lkml and lustre-devel if you hadn't seen it.
>
> Linus


Attachments:
signature.asc (847.00 B)

2018-04-19 13:35:58

by Andrey Ryabinin

[permalink] [raw]
Subject: Re: [cfs_trace_lock_tcd] BUG: KASAN: null-ptr-deref in cfs_trace_lock_tcd+0x25/0xeb



On 04/18/2018 09:37 PM, Linus Torvalds wrote:
> Ugh, that lustre code is disgusting.
>
> I thought we were getting rid of it.
>
> Anyway, I started looking at why the stack trace is such an incredible
> mess, with lots of stale entries.
>
> The reason (well, _one_ reason) seems to be "ksocknal_startup". It has
> a 500-byte stack frame for some incomprehensible reason. I assume due
> to excessive inlining, because the function itself doesn't seem to be
> that bad.
>
> Similarly, LNetNIInit has a 300-byte stack frame. So it gets pretty deep.
>
> I'm getting the feeling that KASAN is making things worse because
> probably it's disabling all the sane stack frame stuff (ie no merging
> of stack slot entries, perhaps?).
>

AFAIR no merging of stack slots policy enabled only if -fsanitize-address-use-after-scope
is on (which is CONFIG_KASAN_EXTRA). This feature does cause sometimes significant stack bloat,
but hasn't been proven to be very useful, so I wouldn't mind disabling it completely.

So far I know only about a single BUG - https://lkml.kernel.org/r/<[email protected]>
it has found.
There are also a lot of other


> Without KASAN (but also without a lot of other things, so I might be
> blaming KASAN incorrectly), the stack usage of ksocknal_startup() is
> just under 100 bytes, so if it is KASAN, it's really a big difference.
>

Yes, it's because of KASAN:
CONFIG_KASAN=n
socklnd.c:2795:1:ksocknal_startup 144 static

CONFIG_KASAN=y
CONFIG_KASAN_OUTLINE=y
CONFIG_KASAN_EXTRA=n
socklnd.c:2795:1:ksocknal_startup 552 static

CONFIG_KASAN=y
CONFIG_KASAN_OUTLINE=y
CONFIG_KASAN_EXTRA=y
socklnd.c:2795:1:ksocknal_startup 624 static

It's expected that KASAN may cause sometimes significant stack usage growth.
This is needed to catch out-of-bounds accesses to stack data.
When compiler can't proof that access to stack variable is valid (e.g. reference to
stack variable passed to some external function), it will create redzones around such
stack variable.

E.g. ksocknal_enumerate_interfaces() which is called only from ksocknal_startup(), thus probably
inlined into ksocknal_startup() does this:


for (i = j = 0; i < n; i++) {
int up;
__u32 ip;
__u32 mask;

if (!strcmp(names[i], "lo")) /* skip the loopback IF */
continue;

rc = lnet_ipif_query(names[i], &up, &ip, &mask);


With KASAN stack might look something like this:
[32-byte left redzone of the stack frame] [up (4 bytes)] [28-bytes redzone][ip (4 bytes)] [28-bytes redzone][mask (4 bytes)] [28-bytes redzone][32-byte right redzone of the stack frame]

GCC always use 32-bytes redzones. AFAIK clang is more smart about this, it has adaptive redzone policy - smaller redzones for small variables, and bigger for big.

In this particular case, the best way to reduce stack usage is to refactor the code.
1) Drop 'int *up' argument from lnet_ipif_query(). When interface is down lnet_ipif_query() sets up to zero and doesn't return error.
But all callers treat up == 0 as error. So instead, lnet_ipif_query() should simply return error code, and 'up' won't be needed.
This will simplify the code, and should drop the stack usage with KASAN and without KASAN.

2) Instead of using local ip, mask variables, pass pointers '&net->ksnn_interfaces[j].ksni_ipaddr', '&net->ksnn_interfaces[j].ksni_netmask'.
As in 1) this should alst drop the stack usage both with KASAN and without KASAN





2018-04-19 13:56:19

by Andrey Ryabinin

[permalink] [raw]
Subject: Re: [cfs_trace_lock_tcd] BUG: KASAN: null-ptr-deref in cfs_trace_lock_tcd+0x25/0xeb



On 04/19/2018 04:35 PM, Andrey Ryabinin wrote:
>
>
> On 04/18/2018 09:37 PM, Linus Torvalds wrote:
>> Ugh, that lustre code is disgusting.
>>
>> I thought we were getting rid of it.
>>
>> Anyway, I started looking at why the stack trace is such an incredible
>> mess, with lots of stale entries.
>>
>> The reason (well, _one_ reason) seems to be "ksocknal_startup". It has
>> a 500-byte stack frame for some incomprehensible reason. I assume due
>> to excessive inlining, because the function itself doesn't seem to be
>> that bad.
>>
>> Similarly, LNetNIInit has a 300-byte stack frame. So it gets pretty deep.
>>
>> I'm getting the feeling that KASAN is making things worse because
>> probably it's disabling all the sane stack frame stuff (ie no merging
>> of stack slot entries, perhaps?).
>>
>
> AFAIR no merging of stack slots policy enabled only if -fsanitize-address-use-after-scope
> is on (which is CONFIG_KASAN_EXTRA). This feature does cause sometimes significant stack bloat,
> but hasn't been proven to be very useful, so I wouldn't mind disabling it completely.
>
> So far I know only about a single BUG - https://lkml.kernel.org/r/<[email protected]>
> it has found.

Actually, there is one more - https://syzkaller.appspot.com/bug?id=6a929b72a32ca0b1a6985126fa1bc77c03c12304
so two bugs.

> There are also a lot of other

I didn't finish this sentence:

There are also a lot of other reports about use-after-scope, but seem all of them are false positives
caused by STRUCTLEAK plugin.