LinuxLists.cc - [syzbot] KASAN: use-after-free Read in remove_wait

2021-11-16 09:23:35

Subject: [syzbot] KASAN: use-after-free Read in remove_wait_queue (3)

Hello,

syzbot found the following issue on:

HEAD commit: cc0356d6a02e Merge tag 'x86_core_for_v5.16_rc1' of git://g..
git tree: net-next
console output: https://syzkaller.appspot.com/x/log.txt?x=102e7ce6b00000
kernel config: https://syzkaller.appspot.com/x/.config?x=a5d447cdc3ae81d9
dashboard link: https://syzkaller.appspot.com/bug?extid=cdb5dd11c97cc532efad
compiler: gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2

Unfortunately, I don't have any reproducer for this issue yet.

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: [email protected]

==================================================================
BUG: KASAN: use-after-free in __lock_acquire+0x3d86/0x54a0 kernel/locking/lockdep.c:4885
Read of size 8 at addr ffff888032563540 by task syz-executor.2/8869

CPU: 1 PID: 8869 Comm: syz-executor.2 Not tainted 5.15.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
print_address_description.constprop.0.cold+0x6c/0x309 mm/kasan/report.c:256
__kasan_report mm/kasan/report.c:442 [inline]
kasan_report.cold+0x83/0xdf mm/kasan/report.c:459
__lock_acquire+0x3d86/0x54a0 kernel/locking/lockdep.c:4885
lock_acquire kernel/locking/lockdep.c:5625 [inline]
lock_acquire+0x1ab/0x510 kernel/locking/lockdep.c:5590
__raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
_raw_spin_lock_irqsave+0x39/0x50 kernel/locking/spinlock.c:162
remove_wait_queue+0x1d/0x180 kernel/sched/wait.c:55
ep_remove_wait_queue+0x88/0x1a0 fs/eventpoll.c:545
ep_unregister_pollwait fs/eventpoll.c:561 [inline]
ep_free+0x18a/0x390 fs/eventpoll.c:756
ep_eventpoll_release+0x41/0x60 fs/eventpoll.c:788
__fput+0x286/0x9f0 fs/file_table.c:280
task_work_run+0xdd/0x1a0 kernel/task_work.c:164
exit_task_work include/linux/task_work.h:32 [inline]
do_exit+0xbaa/0x2a20 kernel/exit.c:826
do_group_exit+0x125/0x310 kernel/exit.c:923
get_signal+0x47d/0x21d0 kernel/signal.c:2855
arch_do_signal_or_restart+0x2a9/0x1c40 arch/x86/kernel/signal.c:868
handle_signal_work kernel/entry/common.c:148 [inline]
exit_to_user_mode_loop kernel/entry/common.c:172 [inline]
exit_to_user_mode_prepare+0x17d/0x290 kernel/entry/common.c:207
__syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline]
syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:300
do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f8b46801ae9
Code: Unable to access opcode bytes at RIP 0x7f8b46801abf.
RSP: 002b:00007f8b43d77218 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
RAX: 0000000000000000 RBX: 00007f8b46914f68 RCX: 00007f8b46801ae9
RDX: 0000000000000000 RSI: 0000000000000080 RDI: 00007f8b46914f68
RBP: 00007f8b46914f60 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f8b46914f6c
R13: 00007fff0432647f R14: 00007f8b43d77300 R15: 0000000000022000
</TASK>

Allocated by task 8869:
kasan_save_stack+0x1b/0x40 mm/kasan/common.c:38
kasan_set_track mm/kasan/common.c:46 [inline]
set_alloc_info mm/kasan/common.c:434 [inline]
____kasan_kmalloc mm/kasan/common.c:513 [inline]
____kasan_kmalloc mm/kasan/common.c:472 [inline]
__kasan_kmalloc+0xa4/0xd0 mm/kasan/common.c:522
kmalloc include/linux/slab.h:591 [inline]
psi_trigger_create.part.0+0x15e/0x7f0 kernel/sched/psi.c:1141
cgroup_pressure_write+0x15d/0x6b0 kernel/cgroup/cgroup.c:3622
cgroup_file_write+0x1ec/0x780 kernel/cgroup/cgroup.c:3829
kernfs_fop_write_iter+0x342/0x500 fs/kernfs/file.c:296
call_write_iter include/linux/fs.h:2161 [inline]
new_sync_write+0x429/0x660 fs/read_write.c:503
vfs_write+0x7cd/0xae0 fs/read_write.c:590
ksys_write+0x12d/0x250 fs/read_write.c:643
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae

Freed by task 8869:
kasan_save_stack+0x1b/0x40 mm/kasan/common.c:38
kasan_set_track+0x1c/0x30 mm/kasan/common.c:46
kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:360
____kasan_slab_free mm/kasan/common.c:366 [inline]
____kasan_slab_free mm/kasan/common.c:328 [inline]
__kasan_slab_free+0xff/0x130 mm/kasan/common.c:374
kasan_slab_free include/linux/kasan.h:235 [inline]
slab_free_hook mm/slub.c:1700 [inline]
slab_free_freelist_hook+0x8b/0x1c0 mm/slub.c:1726
slab_free mm/slub.c:3492 [inline]
kfree+0xf3/0x550 mm/slub.c:4552
cgroup_pressure_write+0x18d/0x6b0 kernel/cgroup/cgroup.c:3628
cgroup_file_write+0x1ec/0x780 kernel/cgroup/cgroup.c:3829
kernfs_fop_write_iter+0x342/0x500 fs/kernfs/file.c:296
call_write_iter include/linux/fs.h:2161 [inline]
new_sync_write+0x429/0x660 fs/read_write.c:503
vfs_write+0x7cd/0xae0 fs/read_write.c:590
ksys_write+0x12d/0x250 fs/read_write.c:643
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae

Last potentially related work creation:
kasan_save_stack+0x1b/0x40 mm/kasan/common.c:38
kasan_record_aux_stack+0xe9/0x110 mm/kasan/generic.c:348
insert_work+0x48/0x370 kernel/workqueue.c:1353
__queue_work+0x5ca/0xee0 kernel/workqueue.c:1519
queue_work_on+0xee/0x110 kernel/workqueue.c:1546
queue_work include/linux/workqueue.h:501 [inline]
call_usermodehelper_exec+0x1f0/0x4c0 kernel/umh.c:435
kobject_uevent_env+0xf8f/0x1650 lib/kobject_uevent.c:618
rx_queue_add_kobject net/core/net-sysfs.c:1069 [inline]
net_rx_queue_update_kobjects+0xf8/0x500 net/core/net-sysfs.c:1109
register_queue_kobjects net/core/net-sysfs.c:1766 [inline]
netdev_register_kobject+0x275/0x430 net/core/net-sysfs.c:2014
register_netdevice+0xd31/0x1500 net/core/dev.c:10330
__ip_tunnel_create+0x398/0x5c0 net/ipv4/ip_tunnel.c:267
ip_tunnel_init_net+0x2e4/0x9d0 net/ipv4/ip_tunnel.c:1070
ops_init+0xaf/0x470 net/core/net_namespace.c:140
setup_net+0x40f/0xa30 net/core/net_namespace.c:326
copy_net_ns+0x319/0x760 net/core/net_namespace.c:470
create_new_namespaces+0x3f6/0xb20 kernel/nsproxy.c:110
unshare_nsproxy_namespaces+0xc1/0x1f0 kernel/nsproxy.c:226
ksys_unshare+0x445/0x920 kernel/fork.c:3076
__do_sys_unshare kernel/fork.c:3150 [inline]
__se_sys_unshare kernel/fork.c:3148 [inline]
__x64_sys_unshare+0x2d/0x40 kernel/fork.c:3148
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae

Second to last potentially related work creation:
kasan_save_stack+0x1b/0x40 mm/kasan/common.c:38
kasan_record_aux_stack+0xe9/0x110 mm/kasan/generic.c:348
insert_work+0x48/0x370 kernel/workqueue.c:1353
__queue_work+0x5ca/0xee0 kernel/workqueue.c:1519
queue_work_on+0xee/0x110 kernel/workqueue.c:1546
queue_work include/linux/workqueue.h:501 [inline]
call_usermodehelper_exec+0x1f0/0x4c0 kernel/umh.c:435
kobject_uevent_env+0xf8f/0x1650 lib/kobject_uevent.c:618
rx_queue_add_kobject net/core/net-sysfs.c:1069 [inline]
net_rx_queue_update_kobjects+0xf8/0x500 net/core/net-sysfs.c:1109
register_queue_kobjects net/core/net-sysfs.c:1766 [inline]
netdev_register_kobject+0x275/0x430 net/core/net-sysfs.c:2014
register_netdevice+0xd31/0x1500 net/core/dev.c:10330
__ip_tunnel_create+0x398/0x5c0 net/ipv4/ip_tunnel.c:267
ip_tunnel_init_net+0x2e4/0x9d0 net/ipv4/ip_tunnel.c:1070
vti_init_net+0x2a/0x370 net/ipv4/ip_vti.c:504
ops_init+0xaf/0x470 net/core/net_namespace.c:140
setup_net+0x40f/0xa30 net/core/net_namespace.c:326
copy_net_ns+0x319/0x760 net/core/net_namespace.c:470
create_new_namespaces+0x3f6/0xb20 kernel/nsproxy.c:110
unshare_nsproxy_namespaces+0xc1/0x1f0 kernel/nsproxy.c:226
ksys_unshare+0x445/0x920 kernel/fork.c:3076
__do_sys_unshare kernel/fork.c:3150 [inline]
__se_sys_unshare kernel/fork.c:3148 [inline]
__x64_sys_unshare+0x2d/0x40 kernel/fork.c:3148
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae

The buggy address belongs to the object at ffff888032563500
which belongs to the cache kmalloc-192 of size 192
The buggy address is located 64 bytes inside of
192-byte region [ffff888032563500, ffff8880325635c0)
The buggy address belongs to the page:
page:ffffea0000c958c0 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x32563
flags: 0xfff00000000200(slab|node=0|zone=1|lastcpupid=0x7ff)
raw: 00fff00000000200 ffffea0001ef2780 0000000600000002 ffff888010c41a00
raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 0, migratetype Unmovable, gfp_mask 0x112cc0(GFP_USER|__GFP_NOWARN|__GFP_NORETRY), pid 3310, ts 611411618313, free_ts 611397869806
prep_new_page mm/page_alloc.c:2426 [inline]
get_page_from_freelist+0xa72/0x2f80 mm/page_alloc.c:4155
__alloc_pages+0x1b2/0x500 mm/page_alloc.c:5381
alloc_pages+0x1a7/0x300 mm/mempolicy.c:2191
alloc_slab_page mm/slub.c:1770 [inline]
allocate_slab mm/slub.c:1907 [inline]
new_slab+0x319/0x490 mm/slub.c:1970
___slab_alloc+0x921/0xfe0 mm/slub.c:3001
__slab_alloc.constprop.0+0x4d/0xa0 mm/slub.c:3088
slab_alloc_node mm/slub.c:3179 [inline]
slab_alloc mm/slub.c:3221 [inline]
__kmalloc_track_caller+0x2ef/0x310 mm/slub.c:4916
__do_krealloc mm/slab_common.c:1208 [inline]
krealloc+0x87/0xf0 mm/slab_common.c:1241
push_jmp_history kernel/bpf/verifier.c:2278 [inline]
is_state_visited kernel/bpf/verifier.c:10967 [inline]
do_check kernel/bpf/verifier.c:11107 [inline]
do_check_common+0x3521/0xcf50 kernel/bpf/verifier.c:13374
do_check_main kernel/bpf/verifier.c:13437 [inline]
bpf_check+0x87ca/0xbc90 kernel/bpf/verifier.c:14004
bpf_prog_load+0xf4c/0x21e0 kernel/bpf/syscall.c:2329
__sys_bpf+0x67e/0x5f00 kernel/bpf/syscall.c:4618
__do_sys_bpf kernel/bpf/syscall.c:4722 [inline]
__se_sys_bpf kernel/bpf/syscall.c:4720 [inline]
__x64_sys_bpf+0x75/0xb0 kernel/bpf/syscall.c:4720
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
page last free stack trace:
reset_page_owner include/linux/page_owner.h:24 [inline]
free_pages_prepare mm/page_alloc.c:1340 [inline]
free_pcp_prepare+0x326/0x810 mm/page_alloc.c:1391
free_unref_page_prepare mm/page_alloc.c:3317 [inline]
free_unref_page+0x19/0x690 mm/page_alloc.c:3396
__vunmap+0x781/0xb70 mm/vmalloc.c:2621
__vfree+0x3c/0xd0 mm/vmalloc.c:2669
vfree+0x5a/0x90 mm/vmalloc.c:2700
bpf_jit_free+0xbb/0x1c0
bpf_prog_free_deferred+0x5c1/0x790 kernel/bpf/core.c:2292
process_one_work+0x9b2/0x1690 kernel/workqueue.c:2297
worker_thread+0x658/0x11f0 kernel/workqueue.c:2444
kthread+0x405/0x4f0 kernel/kthread.c:327
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295

Memory state around the buggy address:
ffff888032563400: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ffff888032563480: 04 fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff888032563500: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff888032563580: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
ffff888032563600: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================

---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at [email protected].

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

2021-12-10 22:42:32

by syzbot

[permalink] [raw]

Subject: Re: [syzbot] KASAN: use-after-free Read in remove_wait_queue (3)

syzbot has found a reproducer for the following issue on:

HEAD commit: e5d75fc20b92 sh_eth: Use dev_err_probe() helper
git tree: net-next
console output: https://syzkaller.appspot.com/x/log.txt?x=1540cdceb00000
kernel config: https://syzkaller.appspot.com/x/.config?x=24fd48984584829b
dashboard link: https://syzkaller.appspot.com/bug?extid=cdb5dd11c97cc532efad
compiler: gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2
syz repro: https://syzkaller.appspot.com/x/repro.syz?x=15de00bab00000
C reproducer: https://syzkaller.appspot.com/x/repro.c?x=15ad646db00000

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: [email protected]

==================================================================
BUG: KASAN: use-after-free in __lock_acquire+0x3d86/0x54a0 kernel/locking/lockdep.c:4897
Read of size 8 at addr ffff888015be3740 by task syz-executor161/3598

CPU: 1 PID: 3598 Comm: syz-executor161 Not tainted 5.16.0-rc4-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
print_address_description.constprop.0.cold+0x8d/0x320 mm/kasan/report.c:247
__kasan_report mm/kasan/report.c:433 [inline]
kasan_report.cold+0x83/0xdf mm/kasan/report.c:450
__lock_acquire+0x3d86/0x54a0 kernel/locking/lockdep.c:4897
lock_acquire kernel/locking/lockdep.c:5637 [inline]
lock_acquire+0x1ab/0x510 kernel/locking/lockdep.c:5602
__raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
_raw_spin_lock_irqsave+0x39/0x50 kernel/locking/spinlock.c:162
remove_wait_queue+0x1d/0x180 kernel/sched/wait.c:55
ep_remove_wait_queue+0x88/0x1a0 fs/eventpoll.c:545
ep_unregister_pollwait fs/eventpoll.c:561 [inline]
ep_remove+0x106/0x9c0 fs/eventpoll.c:690
eventpoll_release_file+0xe1/0x130 fs/eventpoll.c:923
eventpoll_release include/linux/eventpoll.h:53 [inline]
__fput+0x87b/0x9f0 fs/file_table.c:271
task_work_run+0xdd/0x1a0 kernel/task_work.c:164
tracehook_notify_resume include/linux/tracehook.h:189 [inline]
exit_to_user_mode_loop kernel/entry/common.c:175 [inline]
exit_to_user_mode_prepare+0x27e/0x290 kernel/entry/common.c:207
__syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline]
syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:300
do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f3167c0def3
Code: c7 c2 c0 ff ff ff f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb ba 0f 1f 00 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 45 c3 0f 1f 40 00 48 83 ec 18 89 7c 24 0c e8
RSP: 002b:00007ffddef2e488 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
RAX: 0000000000000000 RBX: 0000000000000005 RCX: 00007f3167c0def3
RDX: 000000000000002f RSI: 0000000020001340 RDI: 0000000000000004
RBP: 0000000000000000 R08: 0000000000000014 R09: 00007ffddef2e4b0
R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffddef2e4ac
R13: 00007ffddef2e4c0 R14: 00007ffddef2e500 R15: 0000000000000000
</TASK>

Allocated by task 3598:
kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
kasan_set_track mm/kasan/common.c:46 [inline]
set_alloc_info mm/kasan/common.c:434 [inline]
____kasan_kmalloc mm/kasan/common.c:513 [inline]
____kasan_kmalloc mm/kasan/common.c:472 [inline]
__kasan_kmalloc+0xa9/0xd0 mm/kasan/common.c:522
kmalloc include/linux/slab.h:590 [inline]
psi_trigger_create.part.0+0x15e/0x7f0 kernel/sched/psi.c:1141
cgroup_pressure_write+0x15d/0x6b0 kernel/cgroup/cgroup.c:3645
cgroup_file_write+0x1ec/0x780 kernel/cgroup/cgroup.c:3852
kernfs_fop_write_iter+0x342/0x500 fs/kernfs/file.c:296
call_write_iter include/linux/fs.h:2162 [inline]
new_sync_write+0x429/0x660 fs/read_write.c:503
vfs_write+0x7cd/0xae0 fs/read_write.c:590
ksys_write+0x12d/0x250 fs/read_write.c:643
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae

Freed by task 3598:
kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
kasan_set_track+0x21/0x30 mm/kasan/common.c:46
kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:370
____kasan_slab_free mm/kasan/common.c:366 [inline]
____kasan_slab_free mm/kasan/common.c:328 [inline]
__kasan_slab_free+0xff/0x130 mm/kasan/common.c:374
kasan_slab_free include/linux/kasan.h:235 [inline]
slab_free_hook mm/slub.c:1723 [inline]
slab_free_freelist_hook+0x8b/0x1c0 mm/slub.c:1749
slab_free mm/slub.c:3513 [inline]
kfree+0xf6/0x560 mm/slub.c:4561
cgroup_pressure_write+0x18d/0x6b0 kernel/cgroup/cgroup.c:3651
cgroup_file_write+0x1ec/0x780 kernel/cgroup/cgroup.c:3852
kernfs_fop_write_iter+0x342/0x500 fs/kernfs/file.c:296
call_write_iter include/linux/fs.h:2162 [inline]
new_sync_write+0x429/0x660 fs/read_write.c:503
vfs_write+0x7cd/0xae0 fs/read_write.c:590
ksys_write+0x12d/0x250 fs/read_write.c:643
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae

The buggy address belongs to the object at ffff888015be3700
which belongs to the cache kmalloc-192 of size 192
The buggy address is located 64 bytes inside of
192-byte region [ffff888015be3700, ffff888015be37c0)
The buggy address belongs to the page:
page:ffffea000056f8c0 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x15be3
flags: 0xfff00000000200(slab|node=0|zone=1|lastcpupid=0x7ff)
raw: 00fff00000000200 0000000000000000 dead000000000001 ffff888010c41a00
raw: 0000000000000000 0000000080100010 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 0, migratetype Unmovable, gfp_mask 0x12cc0(GFP_KERNEL|__GFP_NOWARN|__GFP_NORETRY), pid 1, ts 1983850449, free_ts 0
prep_new_page mm/page_alloc.c:2418 [inline]
get_page_from_freelist+0xa72/0x2f50 mm/page_alloc.c:4149
__alloc_pages+0x1b2/0x500 mm/page_alloc.c:5369
alloc_page_interleave+0x1e/0x200 mm/mempolicy.c:2036
alloc_pages+0x29f/0x300 mm/mempolicy.c:2186
alloc_slab_page mm/slub.c:1793 [inline]
allocate_slab mm/slub.c:1930 [inline]
new_slab+0x32d/0x4a0 mm/slub.c:1993
___slab_alloc+0x918/0xfe0 mm/slub.c:3022
__slab_alloc.constprop.0+0x4d/0xa0 mm/slub.c:3109
slab_alloc_node mm/slub.c:3200 [inline]
slab_alloc mm/slub.c:3242 [inline]
kmem_cache_alloc_trace+0x289/0x2c0 mm/slub.c:3259
kmalloc include/linux/slab.h:590 [inline]
kzalloc include/linux/slab.h:724 [inline]
call_usermodehelper_setup+0x97/0x340 kernel/umh.c:365
kobject_uevent_env+0xf73/0x1650 lib/kobject_uevent.c:614
version_sysfs_builtin kernel/params.c:878 [inline]
param_sysfs_init+0x146/0x43b kernel/params.c:969
do_one_initcall+0x103/0x650 init/main.c:1297
do_initcall_level init/main.c:1370 [inline]
do_initcalls init/main.c:1386 [inline]
do_basic_setup init/main.c:1405 [inline]
kernel_init_freeable+0x6b1/0x73a init/main.c:1610
kernel_init+0x1a/0x1d0 init/main.c:1499
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
page_owner free stack trace missing

Memory state around the buggy address:
ffff888015be3600: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ffff888015be3680: 00 00 00 00 00 fc fc fc fc fc fc fc fc fc fc fc
>ffff888015be3700: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff888015be3780: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
ffff888015be3800: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
==================================================================

2021-12-11 03:00:41

by Eric Biggers

[permalink] [raw]

Subject: Re: [syzbot] KASAN: use-after-free Read in remove_wait_queue (3)

On Sat, Dec 11, 2021 at 09:56:20AM +0800, Hillf Danton wrote:
> On Fri, 10 Dec 2021 14:42:22 -0800
> > syzbot has found a reproducer for the following issue on:
> >
> > HEAD commit: e5d75fc20b92 sh_eth: Use dev_err_probe() helper
> > git tree: net-next
> > console output: https://syzkaller.appspot.com/x/log.txt?x=1540cdceb00000
> > kernel config: https://syzkaller.appspot.com/x/.config?x=24fd48984584829b
> > dashboard link: https://syzkaller.appspot.com/bug?extid=cdb5dd11c97cc532efad
> > compiler: gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2
> > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=15de00bab00000
> > C reproducer: https://syzkaller.appspot.com/x/repro.c?x=15ad646db00000
> >
> > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > Reported-by: [email protected]
> >
> > ==================================================================
> > BUG: KASAN: use-after-free in __lock_acquire+0x3d86/0x54a0 kernel/locking/lockdep.c:4897
> > Read of size 8 at addr ffff888015be3740 by task syz-executor161/3598
> >
> > CPU: 1 PID: 3598 Comm: syz-executor161 Not tainted 5.16.0-rc4-syzkaller #0
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
> > Call Trace:
> > <TASK>
> > __dump_stack lib/dump_stack.c:88 [inline]
> > dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
> > print_address_description.constprop.0.cold+0x8d/0x320 mm/kasan/report.c:247
> > __kasan_report mm/kasan/report.c:433 [inline]
> > kasan_report.cold+0x83/0xdf mm/kasan/report.c:450
> > __lock_acquire+0x3d86/0x54a0 kernel/locking/lockdep.c:4897
> > lock_acquire kernel/locking/lockdep.c:5637 [inline]
> > lock_acquire+0x1ab/0x510 kernel/locking/lockdep.c:5602
> > __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
> > _raw_spin_lock_irqsave+0x39/0x50 kernel/locking/spinlock.c:162
> > remove_wait_queue+0x1d/0x180 kernel/sched/wait.c:55
> > ep_remove_wait_queue+0x88/0x1a0 fs/eventpoll.c:545
> > ep_unregister_pollwait fs/eventpoll.c:561 [inline]
> > ep_remove+0x106/0x9c0 fs/eventpoll.c:690
> > eventpoll_release_file+0xe1/0x130 fs/eventpoll.c:923
> > eventpoll_release include/linux/eventpoll.h:53 [inline]
> > __fput+0x87b/0x9f0 fs/file_table.c:271
> > task_work_run+0xdd/0x1a0 kernel/task_work.c:164
> > tracehook_notify_resume include/linux/tracehook.h:189 [inline]
> > exit_to_user_mode_loop kernel/entry/common.c:175 [inline]
> > exit_to_user_mode_prepare+0x27e/0x290 kernel/entry/common.c:207
> > __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline]
> > syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:300
> > do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86
> > entry_SYSCALL_64_after_hwframe+0x44/0xae
> > RIP: 0033:0x7f3167c0def3
> > Code: c7 c2 c0 ff ff ff f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb ba 0f 1f 00 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 45 c3 0f 1f 40 00 48 83 ec 18 89 7c 24 0c e8
> > RSP: 002b:00007ffddef2e488 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
> > RAX: 0000000000000000 RBX: 0000000000000005 RCX: 00007f3167c0def3
> > RDX: 000000000000002f RSI: 0000000020001340 RDI: 0000000000000004
> > RBP: 0000000000000000 R08: 0000000000000014 R09: 00007ffddef2e4b0
> > R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffddef2e4ac
> > R13: 00007ffddef2e4c0 R14: 00007ffddef2e500 R15: 0000000000000000
> > </TASK>
> >
> > Allocated by task 3598:
> > kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
> > kasan_set_track mm/kasan/common.c:46 [inline]
> > set_alloc_info mm/kasan/common.c:434 [inline]
> > ____kasan_kmalloc mm/kasan/common.c:513 [inline]
> > ____kasan_kmalloc mm/kasan/common.c:472 [inline]
> > __kasan_kmalloc+0xa9/0xd0 mm/kasan/common.c:522
> > kmalloc include/linux/slab.h:590 [inline]
> > psi_trigger_create.part.0+0x15e/0x7f0 kernel/sched/psi.c:1141
> > cgroup_pressure_write+0x15d/0x6b0 kernel/cgroup/cgroup.c:3645
> > cgroup_file_write+0x1ec/0x780 kernel/cgroup/cgroup.c:3852
> > kernfs_fop_write_iter+0x342/0x500 fs/kernfs/file.c:296
> > call_write_iter include/linux/fs.h:2162 [inline]
> > new_sync_write+0x429/0x660 fs/read_write.c:503
> > vfs_write+0x7cd/0xae0 fs/read_write.c:590
> > ksys_write+0x12d/0x250 fs/read_write.c:643
> > do_syscall_x64 arch/x86/entry/common.c:50 [inline]
> > do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
> > entry_SYSCALL_64_after_hwframe+0x44/0xae
> >
> > Freed by task 3598:
> > kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
> > kasan_set_track+0x21/0x30 mm/kasan/common.c:46
> > kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:370
> > ____kasan_slab_free mm/kasan/common.c:366 [inline]
> > ____kasan_slab_free mm/kasan/common.c:328 [inline]
> > __kasan_slab_free+0xff/0x130 mm/kasan/common.c:374
> > kasan_slab_free include/linux/kasan.h:235 [inline]
> > slab_free_hook mm/slub.c:1723 [inline]
> > slab_free_freelist_hook+0x8b/0x1c0 mm/slub.c:1749
> > slab_free mm/slub.c:3513 [inline]
> > kfree+0xf6/0x560 mm/slub.c:4561
> > cgroup_pressure_write+0x18d/0x6b0 kernel/cgroup/cgroup.c:3651
> > cgroup_file_write+0x1ec/0x780 kernel/cgroup/cgroup.c:3852
> > kernfs_fop_write_iter+0x342/0x500 fs/kernfs/file.c:296
> > call_write_iter include/linux/fs.h:2162 [inline]
> > new_sync_write+0x429/0x660 fs/read_write.c:503
> > vfs_write+0x7cd/0xae0 fs/read_write.c:590
> > ksys_write+0x12d/0x250 fs/read_write.c:643
> > do_syscall_x64 arch/x86/entry/common.c:50 [inline]
> > do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
> > entry_SYSCALL_64_after_hwframe+0x44/0xae
> >
> > The buggy address belongs to the object at ffff888015be3700
> > which belongs to the cache kmalloc-192 of size 192
> > The buggy address is located 64 bytes inside of
> > 192-byte region [ffff888015be3700, ffff888015be37c0)
> > The buggy address belongs to the page:
> > page:ffffea000056f8c0 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x15be3
> > flags: 0xfff00000000200(slab|node=0|zone=1|lastcpupid=0x7ff)
> > raw: 00fff00000000200 0000000000000000 dead000000000001 ffff888010c41a00
> > raw: 0000000000000000 0000000080100010 00000001ffffffff 0000000000000000
> > page dumped because: kasan: bad access detected
> > page_owner tracks the page as allocated
> > page last allocated via order 0, migratetype Unmovable, gfp_mask 0x12cc0(GFP_KERNEL|__GFP_NOWARN|__GFP_NORETRY), pid 1, ts 1983850449, free_ts 0
> > prep_new_page mm/page_alloc.c:2418 [inline]
> > get_page_from_freelist+0xa72/0x2f50 mm/page_alloc.c:4149
> > __alloc_pages+0x1b2/0x500 mm/page_alloc.c:5369
> > alloc_page_interleave+0x1e/0x200 mm/mempolicy.c:2036
> > alloc_pages+0x29f/0x300 mm/mempolicy.c:2186
> > alloc_slab_page mm/slub.c:1793 [inline]
> > allocate_slab mm/slub.c:1930 [inline]
> > new_slab+0x32d/0x4a0 mm/slub.c:1993
> > ___slab_alloc+0x918/0xfe0 mm/slub.c:3022
> > __slab_alloc.constprop.0+0x4d/0xa0 mm/slub.c:3109
> > slab_alloc_node mm/slub.c:3200 [inline]
> > slab_alloc mm/slub.c:3242 [inline]
> > kmem_cache_alloc_trace+0x289/0x2c0 mm/slub.c:3259
> > kmalloc include/linux/slab.h:590 [inline]
> > kzalloc include/linux/slab.h:724 [inline]
> > call_usermodehelper_setup+0x97/0x340 kernel/umh.c:365
> > kobject_uevent_env+0xf73/0x1650 lib/kobject_uevent.c:614
> > version_sysfs_builtin kernel/params.c:878 [inline]
> > param_sysfs_init+0x146/0x43b kernel/params.c:969
> > do_one_initcall+0x103/0x650 init/main.c:1297
> > do_initcall_level init/main.c:1370 [inline]
> > do_initcalls init/main.c:1386 [inline]
> > do_basic_setup init/main.c:1405 [inline]
> > kernel_init_freeable+0x6b1/0x73a init/main.c:1610
> > kernel_init+0x1a/0x1d0 init/main.c:1499
> > ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
> > page_owner free stack trace missing
> >
> > Memory state around the buggy address:
> > ffff888015be3600: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > ffff888015be3680: 00 00 00 00 00 fc fc fc fc fc fc fc fc fc fc fc
> > >ffff888015be3700: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > ^
> > ffff888015be3780: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
> > ffff888015be3800: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > ==================================================================
>
> Hey Eric
>
> Let us know if this uaf adds another call site for what you added [1].
>
> Hillf
>
> [1] https://lore.kernel.org/lkml/[email protected]/
>
> +++ x/kernel/sched/psi.c
> @@ -1193,7 +1193,7 @@ static void psi_trigger_destroy(struct k
> * Wakeup waiters to stop polling. Can happen if cgroup is deleted
> * from under a polling process.
> */
> - wake_up_interruptible(&t->event_wait);
> + wake_up_pollfree(&t->event_wait);
>
> mutex_lock(&group->trigger_lock);

[added linux-mm and all maintainers for kernel/sched/psi.c]

Well, it is the same sort of issue, but POLLFREE is *not* enough here. POLLFREE
only works if the lifetime of waitqueue is tied to the polling task, as blocking
polls don't handle it -- only non-blocking polls do.

The kernel/sched/psi.c use case is just totally broken, since the lifetime of
its waitqueue is totally arbitrary; the open file descriptor can be written to
at any time by any process, which causes the waitqueue to be freed. So it will
cause a use-after-free even for regular blocking poll().

To fix this, I think the psi trigger stuff will need to be refactored to have
just one waitqueue per open file. We need to be removing uses of POLLFREE, not
adding new ones. (See Linus' comments on POLLFREE here:
https://lore.kernel.org/lkml/CAHk-=wgvt7PH+AU_29H95tJQZ9FnhS8vVmymbhpZ6NZ7yaAigw@mail.gmail.com/)

Here are some repros:

#include <fcntl.h>
#include <sys/epoll.h>
#include <unistd.h>
int main()
{
int fd = open("/proc/pressure/cpu", O_RDWR);
int epfd = epoll_create(1);
const char trigger[] = "some 100000 1000000";
struct epoll_event event = { .events = EPOLLIN };

write(fd, trigger, sizeof(trigger));
epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &event);
write(fd, trigger, sizeof(trigger));
}

#include <fcntl.h>
#include <sys/poll.h>
#include <unistd.h>
int main()
{
int fd = open("/proc/pressure/cpu", O_RDWR);
const char trigger[] = "some 100000 1000000";

if (fork()) {
struct pollfd pfd = { .fd = fd, .events = POLLIN };

for (;;)
poll(&pfd, 1, -1);
} else {
for (;;)
write(fd, trigger, sizeof(trigger));
}
}

2022-01-05 15:21:13

by Eric Biggers

[permalink] [raw]

Subject: Re: psi_trigger_poll() is completely broken

[changed subject line to hopefully get people to stop ignoring this]

Please see my message below where I explained the problem in detail. Any
response from the maintainers of kernel/sched/psi.c? There are a lot of you:

$ ./scripts/get_maintainer.pl kernel/sched/psi.c
Johannes Weiner <[email protected]> (maintainer:PRESSURE STALL INFORMATION (PSI))
Ingo Molnar <[email protected]> (maintainer:SCHEDULER)
Peter Zijlstra <[email protected]> (maintainer:SCHEDULER)
Juri Lelli <[email protected]> (maintainer:SCHEDULER)
Vincent Guittot <[email protected]> (maintainer:SCHEDULER)
Dietmar Eggemann <[email protected]> (reviewer:SCHEDULER)
Steven Rostedt <[email protected]> (reviewer:SCHEDULER)
Ben Segall <[email protected]> (reviewer:SCHEDULER)
Mel Gorman <[email protected]> (reviewer:SCHEDULER)
Daniel Bristot de Oliveira <[email protected]> (reviewer:SCHEDULER)
[email protected] (open list:SCHEDULER)

On Fri, Dec 10, 2021 at 07:00:26PM -0800, Eric Biggers wrote:
> On Sat, Dec 11, 2021 at 09:56:20AM +0800, Hillf Danton wrote:
> > On Fri, 10 Dec 2021 14:42:22 -0800
> > > syzbot has found a reproducer for the following issue on:
> > >
> > > HEAD commit: e5d75fc20b92 sh_eth: Use dev_err_probe() helper
> > > git tree: net-next
> > > console output: https://syzkaller.appspot.com/x/log.txt?x=1540cdceb00000
> > > kernel config: https://syzkaller.appspot.com/x/.config?x=24fd48984584829b
> > > dashboard link: https://syzkaller.appspot.com/bug?extid=cdb5dd11c97cc532efad
> > > compiler: gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2
> > > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=15de00bab00000
> > > C reproducer: https://syzkaller.appspot.com/x/repro.c?x=15ad646db00000
> > >
> > > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > > Reported-by: [email protected]
> > >
> > > ==================================================================
> > > BUG: KASAN: use-after-free in __lock_acquire+0x3d86/0x54a0 kernel/locking/lockdep.c:4897
> > > Read of size 8 at addr ffff888015be3740 by task syz-executor161/3598
> > >
> > > CPU: 1 PID: 3598 Comm: syz-executor161 Not tainted 5.16.0-rc4-syzkaller #0
> > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
> > > Call Trace:
> > > <TASK>
> > > __dump_stack lib/dump_stack.c:88 [inline]
> > > dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
> > > print_address_description.constprop.0.cold+0x8d/0x320 mm/kasan/report.c:247
> > > __kasan_report mm/kasan/report.c:433 [inline]
> > > kasan_report.cold+0x83/0xdf mm/kasan/report.c:450
> > > __lock_acquire+0x3d86/0x54a0 kernel/locking/lockdep.c:4897
> > > lock_acquire kernel/locking/lockdep.c:5637 [inline]
> > > lock_acquire+0x1ab/0x510 kernel/locking/lockdep.c:5602
> > > __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
> > > _raw_spin_lock_irqsave+0x39/0x50 kernel/locking/spinlock.c:162
> > > remove_wait_queue+0x1d/0x180 kernel/sched/wait.c:55
> > > ep_remove_wait_queue+0x88/0x1a0 fs/eventpoll.c:545
> > > ep_unregister_pollwait fs/eventpoll.c:561 [inline]
> > > ep_remove+0x106/0x9c0 fs/eventpoll.c:690
> > > eventpoll_release_file+0xe1/0x130 fs/eventpoll.c:923
> > > eventpoll_release include/linux/eventpoll.h:53 [inline]
> > > __fput+0x87b/0x9f0 fs/file_table.c:271
> > > task_work_run+0xdd/0x1a0 kernel/task_work.c:164
> > > tracehook_notify_resume include/linux/tracehook.h:189 [inline]
> > > exit_to_user_mode_loop kernel/entry/common.c:175 [inline]
> > > exit_to_user_mode_prepare+0x27e/0x290 kernel/entry/common.c:207
> > > __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline]
> > > syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:300
> > > do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86
> > > entry_SYSCALL_64_after_hwframe+0x44/0xae
> > > RIP: 0033:0x7f3167c0def3
> > > Code: c7 c2 c0 ff ff ff f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb ba 0f 1f 00 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 45 c3 0f 1f 40 00 48 83 ec 18 89 7c 24 0c e8
> > > RSP: 002b:00007ffddef2e488 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
> > > RAX: 0000000000000000 RBX: 0000000000000005 RCX: 00007f3167c0def3
> > > RDX: 000000000000002f RSI: 0000000020001340 RDI: 0000000000000004
> > > RBP: 0000000000000000 R08: 0000000000000014 R09: 00007ffddef2e4b0
> > > R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffddef2e4ac
> > > R13: 00007ffddef2e4c0 R14: 00007ffddef2e500 R15: 0000000000000000
> > > </TASK>
> > >
> > > Allocated by task 3598:
> > > kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
> > > kasan_set_track mm/kasan/common.c:46 [inline]
> > > set_alloc_info mm/kasan/common.c:434 [inline]
> > > ____kasan_kmalloc mm/kasan/common.c:513 [inline]
> > > ____kasan_kmalloc mm/kasan/common.c:472 [inline]
> > > __kasan_kmalloc+0xa9/0xd0 mm/kasan/common.c:522
> > > kmalloc include/linux/slab.h:590 [inline]
> > > psi_trigger_create.part.0+0x15e/0x7f0 kernel/sched/psi.c:1141
> > > cgroup_pressure_write+0x15d/0x6b0 kernel/cgroup/cgroup.c:3645
> > > cgroup_file_write+0x1ec/0x780 kernel/cgroup/cgroup.c:3852
> > > kernfs_fop_write_iter+0x342/0x500 fs/kernfs/file.c:296
> > > call_write_iter include/linux/fs.h:2162 [inline]
> > > new_sync_write+0x429/0x660 fs/read_write.c:503
> > > vfs_write+0x7cd/0xae0 fs/read_write.c:590
> > > ksys_write+0x12d/0x250 fs/read_write.c:643
> > > do_syscall_x64 arch/x86/entry/common.c:50 [inline]
> > > do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
> > > entry_SYSCALL_64_after_hwframe+0x44/0xae
> > >
> > > Freed by task 3598:
> > > kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
> > > kasan_set_track+0x21/0x30 mm/kasan/common.c:46
> > > kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:370
> > > ____kasan_slab_free mm/kasan/common.c:366 [inline]
> > > ____kasan_slab_free mm/kasan/common.c:328 [inline]
> > > __kasan_slab_free+0xff/0x130 mm/kasan/common.c:374
> > > kasan_slab_free include/linux/kasan.h:235 [inline]
> > > slab_free_hook mm/slub.c:1723 [inline]
> > > slab_free_freelist_hook+0x8b/0x1c0 mm/slub.c:1749
> > > slab_free mm/slub.c:3513 [inline]
> > > kfree+0xf6/0x560 mm/slub.c:4561
> > > cgroup_pressure_write+0x18d/0x6b0 kernel/cgroup/cgroup.c:3651
> > > cgroup_file_write+0x1ec/0x780 kernel/cgroup/cgroup.c:3852
> > > kernfs_fop_write_iter+0x342/0x500 fs/kernfs/file.c:296
> > > call_write_iter include/linux/fs.h:2162 [inline]
> > > new_sync_write+0x429/0x660 fs/read_write.c:503
> > > vfs_write+0x7cd/0xae0 fs/read_write.c:590
> > > ksys_write+0x12d/0x250 fs/read_write.c:643
> > > do_syscall_x64 arch/x86/entry/common.c:50 [inline]
> > > do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
> > > entry_SYSCALL_64_after_hwframe+0x44/0xae
> > >
> > > The buggy address belongs to the object at ffff888015be3700
> > > which belongs to the cache kmalloc-192 of size 192
> > > The buggy address is located 64 bytes inside of
> > > 192-byte region [ffff888015be3700, ffff888015be37c0)
> > > The buggy address belongs to the page:
> > > page:ffffea000056f8c0 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x15be3
> > > flags: 0xfff00000000200(slab|node=0|zone=1|lastcpupid=0x7ff)
> > > raw: 00fff00000000200 0000000000000000 dead000000000001 ffff888010c41a00
> > > raw: 0000000000000000 0000000080100010 00000001ffffffff 0000000000000000
> > > page dumped because: kasan: bad access detected
> > > page_owner tracks the page as allocated
> > > page last allocated via order 0, migratetype Unmovable, gfp_mask 0x12cc0(GFP_KERNEL|__GFP_NOWARN|__GFP_NORETRY), pid 1, ts 1983850449, free_ts 0
> > > prep_new_page mm/page_alloc.c:2418 [inline]
> > > get_page_from_freelist+0xa72/0x2f50 mm/page_alloc.c:4149
> > > __alloc_pages+0x1b2/0x500 mm/page_alloc.c:5369
> > > alloc_page_interleave+0x1e/0x200 mm/mempolicy.c:2036
> > > alloc_pages+0x29f/0x300 mm/mempolicy.c:2186
> > > alloc_slab_page mm/slub.c:1793 [inline]
> > > allocate_slab mm/slub.c:1930 [inline]
> > > new_slab+0x32d/0x4a0 mm/slub.c:1993
> > > ___slab_alloc+0x918/0xfe0 mm/slub.c:3022
> > > __slab_alloc.constprop.0+0x4d/0xa0 mm/slub.c:3109
> > > slab_alloc_node mm/slub.c:3200 [inline]
> > > slab_alloc mm/slub.c:3242 [inline]
> > > kmem_cache_alloc_trace+0x289/0x2c0 mm/slub.c:3259
> > > kmalloc include/linux/slab.h:590 [inline]
> > > kzalloc include/linux/slab.h:724 [inline]
> > > call_usermodehelper_setup+0x97/0x340 kernel/umh.c:365
> > > kobject_uevent_env+0xf73/0x1650 lib/kobject_uevent.c:614
> > > version_sysfs_builtin kernel/params.c:878 [inline]
> > > param_sysfs_init+0x146/0x43b kernel/params.c:969
> > > do_one_initcall+0x103/0x650 init/main.c:1297
> > > do_initcall_level init/main.c:1370 [inline]
> > > do_initcalls init/main.c:1386 [inline]
> > > do_basic_setup init/main.c:1405 [inline]
> > > kernel_init_freeable+0x6b1/0x73a init/main.c:1610
> > > kernel_init+0x1a/0x1d0 init/main.c:1499
> > > ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
> > > page_owner free stack trace missing
> > >
> > > Memory state around the buggy address:
> > > ffff888015be3600: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > > ffff888015be3680: 00 00 00 00 00 fc fc fc fc fc fc fc fc fc fc fc
> > > >ffff888015be3700: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > > ^
> > > ffff888015be3780: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
> > > ffff888015be3800: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > > ==================================================================
> >
> > Hey Eric
> >
> > Let us know if this uaf adds another call site for what you added [1].
> >
> > Hillf
> >
> > [1] https://lore.kernel.org/lkml/[email protected]/
> >
> > +++ x/kernel/sched/psi.c
> > @@ -1193,7 +1193,7 @@ static void psi_trigger_destroy(struct k
> > * Wakeup waiters to stop polling. Can happen if cgroup is deleted
> > * from under a polling process.
> > */
> > - wake_up_interruptible(&t->event_wait);
> > + wake_up_pollfree(&t->event_wait);
> >
> > mutex_lock(&group->trigger_lock);
>
> [added linux-mm and all maintainers for kernel/sched/psi.c]
>
> Well, it is the same sort of issue, but POLLFREE is *not* enough here. POLLFREE
> only works if the lifetime of waitqueue is tied to the polling task, as blocking
> polls don't handle it -- only non-blocking polls do.
>
> The kernel/sched/psi.c use case is just totally broken, since the lifetime of
> its waitqueue is totally arbitrary; the open file descriptor can be written to
> at any time by any process, which causes the waitqueue to be freed. So it will
> cause a use-after-free even for regular blocking poll().
>
> To fix this, I think the psi trigger stuff will need to be refactored to have
> just one waitqueue per open file. We need to be removing uses of POLLFREE, not
> adding new ones. (See Linus' comments on POLLFREE here:
> https://lore.kernel.org/lkml/CAHk-=wgvt7PH+AU_29H95tJQZ9FnhS8vVmymbhpZ6NZ7yaAigw@mail.gmail.com/)
>
> Here are some repros:
>
> #include <fcntl.h>
> #include <sys/epoll.h>
> #include <unistd.h>
> int main()
> {
> int fd = open("/proc/pressure/cpu", O_RDWR);
> int epfd = epoll_create(1);
> const char trigger[] = "some 100000 1000000";
> struct epoll_event event = { .events = EPOLLIN };
>
> write(fd, trigger, sizeof(trigger));
> epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &event);
> write(fd, trigger, sizeof(trigger));
> }
>
>
> #include <fcntl.h>
> #include <sys/poll.h>
> #include <unistd.h>
> int main()
> {
> int fd = open("/proc/pressure/cpu", O_RDWR);
> const char trigger[] = "some 100000 1000000";
>
> if (fork()) {
> struct pollfd pfd = { .fd = fd, .events = POLLIN };
>
> for (;;)
> poll(&pfd, 1, -1);
> } else {
> for (;;)
> write(fd, trigger, sizeof(trigger));
> }
> }

2022-01-05 18:51:02

by Linus Torvalds

[permalink] [raw]

Subject: Re: psi_trigger_poll() is completely broken

On Wed, Jan 5, 2022 at 7:21 AM Eric Biggers <[email protected]> wrote:
>
> [changed subject line to hopefully get people to stop ignoring this]
>
> Please see my message below where I explained the problem in detail. Any
> response from the maintainers of kernel/sched/psi.c? There are a lot of you:

Ok, this one is clearly a kernel/sched/psi.c bug, since the lifetime
isn't even maintained by the fiel reference.

I think the proper thing to do is to move the whole "get kref to
trigger pointer" in the open/close code, and keep the ref around that
way.

The natural thing to do would be to look up the trigger at open time,
save the pointer in seq->private, and release it at close time.

Sadly, right now the code actually uses that 'seq->private' as an
indirect rcu-pointer to the trigger data, instead of as the trigger
data itself. And that seems very much on purpose and inherent to that
'psi_write()' model, where it changes the trigger pointer very much on
purpose.

So I agree 100% - the PSI code is fundamentally broken. psi_write()
seems to be literally _designed_ to do the wrong thing.

I don't know who - if anybody - uses this. My preference would be to
just disable the completely broken poll support.

Another alternative is to just make 'psi_write()' return -EBUSY if
there are existing poll waiters (ie t->event_wait not being empty. At
least then the open file would keep the kref to the trigger.

That would require that 'psi_trigger_replace()' serialize with the
waitqueue lock (easy), but all callers would also have to check the
return value of it

The cgroup code does

psi_trigger_replace(&of->priv, NULL);

in the release function, but I guess that might work since at release
time there shouldn't be any pending polls anyway.

But it would also mean that anybody who can open the file for reading
(so that they can poll it) would be able to keep people from changing
it.

But yes, I think that unless we get some reply from the PSI
maintainers, we will have to just disable polling entirely.

I hope there are no users that would break, but considering that the
current code is clearly too broken to live, this may be one of those
"we have no choice" cases.

Linus

2022-01-05 19:07:47

by Linus Torvalds

[permalink] [raw]

Subject: Re: psi_trigger_poll() is completely broken

On Wed, Jan 5, 2022 at 10:50 AM Linus Torvalds
<[email protected]> wrote:
>
> That would require that 'psi_trigger_replace()' serialize with the
> waitqueue lock (easy)

I take the "easy" back. The other side of that serialization would
require that the poll() side also re-lookup the trigger pointer under
that same lock.

And you can't do that with the waitqueue lock, because 'poll_wait()'
does the add_wait_queue() internally, and that will take the waitqueue
lock. So you can't take and hold the waitqueue lock in the caller in
poll, it would just deadlock.

And not holding the lock over the call would mean that you'd have a
small race between adding a new poll waiter, and checking that the
trigger is still the same one.

We could use another lock - the code in kernel/sched/psi.c already does

mutex_lock(&seq->lock);
psi_trigger_replace(&seq->private, new);
mutex_unlock(&seq->lock);

and could use that same lock around the poll sequence too.

But the cgroup_pressure_write() code doesn't even do that, and
concurrent writes aren't serialized at all (much less concurrent
poll() calls).

Side note: it looks like concurrent writes in the
cgroup_pressure_write() is literally broken. Because
psi_trigger_replace() is *not* handling concurrency, and does that

struct psi_trigger *old = *trigger_ptr;
....
if (old)
kref_put(&old->refcount, psi_trigger_destroy);

assuming that the caller holds some lock that makes '*trigger_ptr' a
stable thing.

Again, kernel/sched/psi.c itself does that already, but the cgroup
code doesn't seem to.

So the bugs in this area go deeper than "just" poll(). The whole
psi_trigger_replace() thing is literally broken even ignoring the
poll() interactions.

Whoever came up with that stupid "replace existing trigger with a
write()" model should feel bad. It's garbage, and it's actively buggy
in multiple ways.

Linus

2022-01-05 19:19:01

by Linus Torvalds

[permalink] [raw]

Subject: Re: psi_trigger_poll() is completely broken

On Wed, Jan 5, 2022 at 11:07 AM Linus Torvalds
<[email protected]> wrote:
>
> Whoever came up with that stupid "replace existing trigger with a
> write()" model should feel bad. It's garbage, and it's actively buggy
> in multiple ways.

What are the users? Can we make the rule for -EBUSY simply be that you
can _install_ a trigger, but you can't replace an existing one (except
with NULL, when you close it).

That would fix the poll() lifetime issue, and would make the
psi_trigger_replace() races fairly easy to fix - just use

if (cmpxchg(trigger_ptr, NULL, new) != NULL) {
... free 'new', return -EBUSY ..

to install the new one, instead of

rcu_assign_pointer(*trigger_ptr, new);

or something like that. No locking necessary.

But I assume people actually end up re-writing triggers, because
people are perverse and have taken advantage of this completely broken
API.

Linus

2022-01-06 22:05:52

by Tejun Heo

[permalink] [raw]

Subject: Re: psi_trigger_poll() is completely broken

Happy new year,

On Wed, Jan 05, 2022 at 11:13:30AM -0800, Linus Torvalds wrote:
> On Wed, Jan 5, 2022 at 11:07 AM Linus Torvalds
> <[email protected]> wrote:
> >
> > Whoever came up with that stupid "replace existing trigger with a
> > write()" model should feel bad. It's garbage, and it's actively buggy
> > in multiple ways.
>
> What are the users? Can we make the rule for -EBUSY simply be that you
> can _install_ a trigger, but you can't replace an existing one (except
> with NULL, when you close it).

I don't have enough context here and Johannes seems offline today. Let's
wait for him to chime in.

> That would fix the poll() lifetime issue, and would make the
> psi_trigger_replace() races fairly easy to fix - just use
>
> if (cmpxchg(trigger_ptr, NULL, new) != NULL) {
> ... free 'new', return -EBUSY ..
>
> to install the new one, instead of
>
> rcu_assign_pointer(*trigger_ptr, new);
>
> or something like that. No locking necessary.
>
> But I assume people actually end up re-writing triggers, because
> people are perverse and have taken advantage of this completely broken
> API.

IIRC, the rationale for the shared trigger at the time was around the
complexities of preventing it from devolving into O(N) trigger checks on
every pressure update. If the overriding behavior is something that can be
changed, I'd prefer going for per-opener triggers even if that involves
adding complexities (maybe a rbtree w/ prev/next links for faster sweeps?).

Thanks.

--
tejun

2022-01-06 23:00:00

by Linus Torvalds

[permalink] [raw]

Subject: Re: psi_trigger_poll() is completely broken

On Thu, Jan 6, 2022 at 2:05 PM Tejun Heo <[email protected]> wrote:
>
> On Wed, Jan 05, 2022 at 11:13:30AM -0800, Linus Torvalds wrote:
> >
> > What are the users? Can we make the rule for -EBUSY simply be that you
> > can _install_ a trigger, but you can't replace an existing one (except
> > with NULL, when you close it).
>
> I don't have enough context here and Johannes seems offline today. Let's
> wait for him to chime in.

Context here:

https://lore.kernel.org/all/YdW3WfHURBXRmn%[email protected]/

> IIRC, the rationale for the shared trigger at the time was around the
> complexities of preventing it from devolving into O(N) trigger checks on
> every pressure update. If the overriding behavior is something that can be
> changed, I'd prefer going for per-opener triggers even if that involves
> adding complexities (maybe a rbtree w/ prev/next links for faster sweeps?).

So here's a COMPLETELY UNTESTED patch to try to fix the lifetime and locking.

The locking was completely broken, in that psi_trigger_replace()
expected that the caller would hold some exclusive lock so that it
would release the correct previous trigger. The cgroup code doesn't
seem to have any such exclusion.

This (UNTESTED!) patch fixes that breakage by just using a cmpxchg loop.

And the lifetime was completely broken (and that's Eric's email)
because psi_trigger_replace() would drop the refcount to the old
trigger - assuming it got the right one - even though the old trigger
could still have active waiters on the waitqueue due to poll() or
select().

This (UNTESTED!) patch fixes _that_ breakage by making
psi_trigger_replace() instead just put the previous trigger on the
"stale_trigger" linked list, and never release it at all.

It now gets released by "psi_trigger_release()" instead, which walks
the list at file release time. Doing "psi_trigger_replace(.., NULL)"
is not valid any more.

And because the reference cannot go away, we now can throw away all
the incorrect temporary kref_get/put games from psi_trigger_poll(),
which didn't actually fix the race at all, only limited it to the poll
waitqueue.

That also means we can remove the "synchronize_rcu()" from
psi_trigger_destroy(), since that was trying to hide all the problems
with the "take rcu lock and then do kref_get()" thing not having
locking. The locking still doesn't exist, but since we don't release
the old one when replacing it, the issue is moot.

NOTE NOTE NOTE! Not only is this patch entirely untested, there are
optimizations you could do if there was some sane synchronization
between psi_trigger_poll() and psi_trigger_replace(). I put comments
about it in the code, but right now the code just assumes that
replacing a trigger is fairly rare (and since it requires write
permissions, it's not something random users can do).

I'm not proud of this patch, but I think it might fix the fundamental
bugs in the code for now.

It's not lovely, it has room for improvement, and I wish we didn't
need this kind of thing, but it looks superficially sane as a fix to
me.

Comments?

And once again: this is UNTESTED. I've compiled-tested it, it looks
kind of sane to me, but honestly, I don't know the code very well.

Also, I'm not super-happy with how that 'psi_disabled' static branch
works. If somebody switches it off after it has been on, that will
also disable the freeing code, so now you'll be leaking memory.

I couldn't find it in myself to care.

Eric - you have the test-case, and the eagle-eyes that found this
problem in the first place. As such, your opinion and comments count
more than most...

Linus

Attachments:

patch.diff (4.96 kB)

2022-01-07 00:14:08

by Eric Biggers

[permalink] [raw]

Subject: Re: psi_trigger_poll() is completely broken

On Thu, Jan 06, 2022 at 02:59:36PM -0800, Linus Torvalds wrote:
>
> So here's a COMPLETELY UNTESTED patch to try to fix the lifetime and locking.
>
> The locking was completely broken, in that psi_trigger_replace()
> expected that the caller would hold some exclusive lock so that it
> would release the correct previous trigger. The cgroup code doesn't
> seem to have any such exclusion.
>
> This (UNTESTED!) patch fixes that breakage by just using a cmpxchg loop.
>
> And the lifetime was completely broken (and that's Eric's email)
> because psi_trigger_replace() would drop the refcount to the old
> trigger - assuming it got the right one - even though the old trigger
> could still have active waiters on the waitqueue due to poll() or
> select().
>
> This (UNTESTED!) patch fixes _that_ breakage by making
> psi_trigger_replace() instead just put the previous trigger on the
> "stale_trigger" linked list, and never release it at all.
>
> It now gets released by "psi_trigger_release()" instead, which walks
> the list at file release time. Doing "psi_trigger_replace(.., NULL)"
> is not valid any more.
>
> And because the reference cannot go away, we now can throw away all
> the incorrect temporary kref_get/put games from psi_trigger_poll(),
> which didn't actually fix the race at all, only limited it to the poll
> waitqueue.
>
> That also means we can remove the "synchronize_rcu()" from
> psi_trigger_destroy(), since that was trying to hide all the problems
> with the "take rcu lock and then do kref_get()" thing not having
> locking. The locking still doesn't exist, but since we don't release
> the old one when replacing it, the issue is moot.
>
> NOTE NOTE NOTE! Not only is this patch entirely untested, there are
> optimizations you could do if there was some sane synchronization
> between psi_trigger_poll() and psi_trigger_replace(). I put comments
> about it in the code, but right now the code just assumes that
> replacing a trigger is fairly rare (and since it requires write
> permissions, it's not something random users can do).
>
> I'm not proud of this patch, but I think it might fix the fundamental
> bugs in the code for now.
>
> It's not lovely, it has room for improvement, and I wish we didn't
> need this kind of thing, but it looks superficially sane as a fix to
> me.
>
> Comments?
>
> And once again: this is UNTESTED. I've compiled-tested it, it looks
> kind of sane to me, but honestly, I don't know the code very well.
>
> Also, I'm not super-happy with how that 'psi_disabled' static branch
> works. If somebody switches it off after it has been on, that will
> also disable the freeing code, so now you'll be leaking memory.
>
> I couldn't find it in myself to care.

I had to make the following changes to Linus's patch:

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 10430f75f21a..7d5afa89db44 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -1255,7 +1255,7 @@ void psi_trigger_replace(void **trigger_ptr, struct psi_trigger *new)
struct psi_trigger *old = *trigger_ptr;

new->stale_trigger = old;
- if (try_cmpxchg(trigger_ptr, old, new))
+ if (try_cmpxchg(trigger_ptr, &old, new))
break;
}

@@ -1270,7 +1270,7 @@ void psi_trigger_replace(void **trigger_ptr, struct psi_trigger *new)
/* No locking needed for final release */
void psi_trigger_release(void **trigger_ptr)
{
- struct psi_trigger *trigger;
+ struct psi_trigger *trigger = *trigger_ptr;

if (static_branch_likely(&psi_disabled))
return;

After that, the two reproducers I gave in
https://lore.kernel.org/r/[email protected] (the ones at the end
of my mail, not the syzbot-generated ones which I didn't try) no longer crash
the kernel.

This is one way to fix the use-after-free, but the fact that it allows anyone
who can write to a /proc/pressure/* file to cause the kernel to allocate an
unbounded number of 'struct psi_trigger' structs is still really broken.

I think we really need an answer to Linus' question:

> What are the users? Can we make the rule for -EBUSY simply be that you
> can _install_ a trigger, but you can't replace an existing one (except
> with NULL, when you close it).

... since that would be a much better fix. The example in
Documentation/accounting/psi.rst only does a single write; that case wouldn't be
broken if we made multiple writes not work.

- Eric

2022-01-07 03:03:21

by Linus Torvalds

[permalink] [raw]

Subject: Re: psi_trigger_poll() is completely broken

On Thu, Jan 6, 2022 at 4:14 PM Eric Biggers <[email protected]> wrote:
>
> I had to make the following changes to Linus's patch:

Ack. Thanks.

> This is one way to fix the use-after-free, but the fact that it allows anyone
> who can write to a /proc/pressure/* file to cause the kernel to allocate an
> unbounded number of 'struct psi_trigger' structs is still really broken.

Yeah, I agree. Very non-optimal - that patch really was trying to just
keep the status quo, and fixing the immediate problems.

Modifying that patch to only allow a previous NULL value in
psi_trigger_replace() would be fairly simple - it would basically just
get rid of the "stale_trigger" list (and the loops it creates).

You'd still want the psi_trigger_release() model to separate that
whole "release" from "new trigger".

But that does require that nobody ever does more than a single write
to one file.

Debian code search finds those "/proc/pressure/xyz" files mentioned at
least by systemd and the chromium chrome browser sources. Whether they
actually write triggers to them, I can't say.

Maybe we just need to try.

Linus

2022-01-10 13:46:14

by Johannes Weiner

[permalink] [raw]

Subject: Re: psi_trigger_poll() is completely broken

On Wed, Jan 05, 2022 at 11:13:30AM -0800, Linus Torvalds wrote:
> On Wed, Jan 5, 2022 at 11:07 AM Linus Torvalds
> <[email protected]> wrote:
> >
> > Whoever came up with that stupid "replace existing trigger with a
> > write()" model should feel bad. It's garbage, and it's actively buggy
> > in multiple ways.
>
> What are the users? Can we make the rule for -EBUSY simply be that you
> can _install_ a trigger, but you can't replace an existing one (except
> with NULL, when you close it).

Apologies for the delay, I'm traveling right now.

The primary user of the poll interface is still Android userspace OOM
killing. I'm CCing Suren who is the most familiar with this usecase.

Suren, the way the refcounting is written right now assumes that
poll_wait() is the actual blocking wait. That's not true, it just
queues the waiter and saves &t->event_wait, and the *caller* of
psi_trigger_poll() continues to interact with it afterwards.

If at all possible, I would also prefer the simplicity of one trigger
setup per fd; if you need a new trigger, close the fd and open again.

Can you please take a look if that is workable from the Android side?

(I'm going to follow up on the static branch issue Linus pointed out,
later this week when I'm back home. I also think we should add Suren
as additional psi maintainer since the polling code is a good chunk of
the codebase and he shouldn't miss threads like these.)

> That would fix the poll() lifetime issue, and would make the
> psi_trigger_replace() races fairly easy to fix - just use
>
> if (cmpxchg(trigger_ptr, NULL, new) != NULL) {
> ... free 'new', return -EBUSY ..
>
> to install the new one, instead of
>
> rcu_assign_pointer(*trigger_ptr, new);
>
> or something like that. No locking necessary.
>
> But I assume people actually end up re-writing triggers, because
> people are perverse and have taken advantage of this completely broken
> API.
>
> Linus

2022-01-10 17:25:15

by Suren Baghdasaryan

[permalink] [raw]

Subject: Re: psi_trigger_poll() is completely broken

On Mon, Jan 10, 2022 at 5:45 AM Johannes Weiner <[email protected]> wrote:
>
> On Wed, Jan 05, 2022 at 11:13:30AM -0800, Linus Torvalds wrote:
> > On Wed, Jan 5, 2022 at 11:07 AM Linus Torvalds
> > <[email protected]> wrote:
> > >
> > > Whoever came up with that stupid "replace existing trigger with a
> > > write()" model should feel bad. It's garbage, and it's actively buggy
> > > in multiple ways.
> >
> > What are the users? Can we make the rule for -EBUSY simply be that you
> > can _install_ a trigger, but you can't replace an existing one (except
> > with NULL, when you close it).
>
> Apologies for the delay, I'm traveling right now.
>
> The primary user of the poll interface is still Android userspace OOM
> killing. I'm CCing Suren who is the most familiar with this usecase.
>
> Suren, the way the refcounting is written right now assumes that
> poll_wait() is the actual blocking wait. That's not true, it just
> queues the waiter and saves &t->event_wait, and the *caller* of
> psi_trigger_poll() continues to interact with it afterwards.

Thanks for adding me, Johannes. I see where I made a mistake.
Terribly sorry for the trouble this caused. I do feel bad.

>
> If at all possible, I would also prefer the simplicity of one trigger
> setup per fd; if you need a new trigger, close the fd and open again.
>
> Can you please take a look if that is workable from the Android side?

Yes, one trigger per fd would work fine for Android. That's how we
intended to use it.
I'm still catching up on this email thread. Once I digest it, will try
to fix this with one-trigger-per-fd approach.

About the issue of serializing concurrent writes for
cgroup_pressure_write() similar to how psi_write() does. Doesn't
of->mutex inside kernfs_fop_write_iter() serialize the writes to the
same file: https://elixir.bootlin.com/linux/latest/source/fs/kernfs/file.c#L287
?

>
> (I'm going to follow up on the static branch issue Linus pointed out,
> later this week when I'm back home. I also think we should add Suren
> as additional psi maintainer since the polling code is a good chunk of
> the codebase and he shouldn't miss threads like these.)

That would help me not to miss these emails and respond promptly.
Thanks,
Suren.

>
> > That would fix the poll() lifetime issue, and would make the
> > psi_trigger_replace() races fairly easy to fix - just use
> >
> > if (cmpxchg(trigger_ptr, NULL, new) != NULL) {
> > ... free 'new', return -EBUSY ..
> >
> > to install the new one, instead of
> >
> > rcu_assign_pointer(*trigger_ptr, new);
> >
> > or something like that. No locking necessary.
> >
> > But I assume people actually end up re-writing triggers, because
> > people are perverse and have taken advantage of this completely broken
> > API.
> >
> > Linus

2022-01-10 17:42:37

by Linus Torvalds

[permalink] [raw]

Subject: Re: psi_trigger_poll() is completely broken

On Mon, Jan 10, 2022 at 9:25 AM Suren Baghdasaryan <[email protected]> wrote:
>
> About the issue of serializing concurrent writes for
> cgroup_pressure_write() similar to how psi_write() does. Doesn't
> of->mutex inside kernfs_fop_write_iter() serialize the writes to the
> same file?

Ahh, yes, it looks like that does solve the serialization issue.
Sorry, I missed that because I'm not actually all that familiar with
the kernfs 'of' code.

So the only issue is the trigger lifetime one, and if a single trigger
is sufficient and returning -EBUSY for trying to replace an existing
one is good, then I think that's the proper fix.

I'm very busy with the merge window (and some upcoming travel and
family events), so I'm hoping somebody will write and test such a
patch. Please?

Linus

2022-01-10 18:19:16

by Suren Baghdasaryan

[permalink] [raw]

Subject: Re: psi_trigger_poll() is completely broken

On Mon, Jan 10, 2022 at 9:42 AM Linus Torvalds
<[email protected]> wrote:
>
> On Mon, Jan 10, 2022 at 9:25 AM Suren Baghdasaryan <[email protected]> wrote:
> >
> > About the issue of serializing concurrent writes for
> > cgroup_pressure_write() similar to how psi_write() does. Doesn't
> > of->mutex inside kernfs_fop_write_iter() serialize the writes to the
> > same file?
>
> Ahh, yes, it looks like that does solve the serialization issue.
> Sorry, I missed that because I'm not actually all that familiar with
> the kernfs 'of' code.
>
> So the only issue is the trigger lifetime one, and if a single trigger
> is sufficient and returning -EBUSY for trying to replace an existing
> one is good, then I think that's the proper fix.
>
> I'm very busy with the merge window (and some upcoming travel and
> family events), so I'm hoping somebody will write and test such a
> patch. Please?

Yes, definitely. I'm on it. Will try posting it later today or
tomorrow morning if testing reveals something unexpected.
Thanks!

>
> Linus

2022-01-11 03:02:40

by Suren Baghdasaryan

[permalink] [raw]

Subject: Re: psi_trigger_poll() is completely broken

On Mon, Jan 10, 2022 at 10:19 AM Suren Baghdasaryan <[email protected]> wrote:
>
> On Mon, Jan 10, 2022 at 9:42 AM Linus Torvalds
> <[email protected]> wrote:
> >
> > On Mon, Jan 10, 2022 at 9:25 AM Suren Baghdasaryan <[email protected]> wrote:
> > >
> > > About the issue of serializing concurrent writes for
> > > cgroup_pressure_write() similar to how psi_write() does. Doesn't
> > > of->mutex inside kernfs_fop_write_iter() serialize the writes to the
> > > same file?
> >
> > Ahh, yes, it looks like that does solve the serialization issue.
> > Sorry, I missed that because I'm not actually all that familiar with
> > the kernfs 'of' code.
> >
> > So the only issue is the trigger lifetime one, and if a single trigger
> > is sufficient and returning -EBUSY for trying to replace an existing
> > one is good, then I think that's the proper fix.
> >
> > I'm very busy with the merge window (and some upcoming travel and
> > family events), so I'm hoping somebody will write and test such a
> > patch. Please?
>
> Yes, definitely. I'm on it. Will try posting it later today or
> tomorrow morning if testing reveals something unexpected.

My first attempt to fix this issue is posted at:
https://lore.kernel.org/all/[email protected]/
Couple notes:
- I don't think we need psi_trigger::refcount anymore, therefore it's removed.
- synchronize_rcu is kept to ensure we do not free group->poll_task
while psi_schedule_poll_work is using it.
- Documentation needed minimal changes because it did not clearly
specify how trigger overwrite should work. Now it does.

I ran as many test cases as I could find/create. I'll work on adding
some kselftests for psi triggers to test different usage patterns.
Thanks,
Suren.

> Thanks!
>
> >
> > Linus