Hi,
We found the following use-after-free with userspace code polling on
a pressure file in a non-root cgroup using epoll.
[ 57.183661] BUG: KASAN: use-after-free in _raw_spin_lock_irqsave+0x76/0x130
[ 57.186662] Write of size 4 at addr ffff888114976428 by task a.out/2426
[ 57.190551] CPU: 0 PID: 2426 Comm: a.out Not tainted 6.2.0-rc2+ #16
[ 57.193384] Hardware name: Amazon EC2 c5a.large/, BIOS 1.0 10/16/2017
[ 57.196272] Call Trace:
[ 57.197565] <TASK>
[ 57.198714] dump_stack_lvl+0x8f/0xc0
[ 57.200494] print_report+0x16c/0x4e0
[ 57.202084] ? _raw_spin_lock_irqsave+0x76/0x130
[ 57.204077] kasan_report+0xc3/0xf0
[ 57.205587] ? entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 57.207760] ? _raw_spin_lock_irqsave+0x76/0x130
[ 57.209685] kasan_check_range+0x2d2/0x310
[ 57.211477] _raw_spin_lock_irqsave+0x76/0x130
[ 57.213355] remove_wait_queue+0x25/0x130
[ 57.215102] ep_free+0x12d/0x220
[ 57.216506] ep_eventpoll_release+0x3c/0x40
[ 57.218254] __fput+0x32b/0x700
[ 57.221486] task_work_run+0x1db/0x230
[ 57.224885] exit_to_user_mode_prepare+0xfd/0x100
[ 57.228662] syscall_exit_to_user_mode+0x20/0x40
[ 57.232360] do_syscall_64+0x52/0x90
[ 57.235691] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 57.239572] RIP: 0033:0x7fadf96e1c44
[ 57.242865] Code: 00 00 b8 ff ff ff ff eb 9c b8 ff ff ff ff eb 95 e8 01 e2 01 00 90 8b 05 2a ac 2c 00 48 63 ff 85 c0 75 11 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3a f3 c3 48 83 ec 18 48 89 7c 24 08 e8 e4 a0
[ 57.255244] RSP: 002b:00007ffd1d1b7b98 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[ 57.261714] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00007fadf96e1c44
[ 57.266293] RDX: 0000000000000000 RSI: 00007ffd1d1b7b60 RDI: 0000000000000004
[ 57.270979] RBP: 00007ffd1d1b7bf0 R08: 00000000004007e0 R09: 00007fadf9a0f240
[ 57.275856] R10: 00000000000006ba R11: 0000000000000246 R12: 00000000004005e0
[ 57.280478] R13: 00007ffd1d1b7cd0 R14: 0000000000000000 R15: 0000000000000000
[ 57.285059] </TASK>
[ 57.290402] Allocated by task 2426:
[ 57.293705] kasan_set_track+0x3d/0x60
[ 57.297102] __kasan_kmalloc+0x85/0x90
[ 57.300491] psi_trigger_create+0x155/0x850
[ 57.304040] pressure_write+0x200/0x510
[ 57.307508] cgroup_file_write+0x1de/0x3e0
[ 57.310949] kernfs_fop_write_iter+0x27d/0x380
[ 57.314601] vfs_write+0x7d7/0xaa0
[ 57.317891] ksys_write+0xd7/0x1a0
[ 57.321152] do_syscall_64+0x43/0x90
[ 57.324496] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 57.330887] Freed by task 2429:
[ 57.334053] kasan_set_track+0x3d/0x60
[ 57.337448] kasan_save_free_info+0x27/0x40
[ 57.340977] ____kasan_slab_free+0x11d/0x170
[ 57.344570] slab_free_freelist_hook+0x87/0x150
[ 57.348236] __kmem_cache_free+0xcb/0x180
[ 57.351710] psi_trigger_destroy+0x458/0x550
[ 57.355274] cgroup_file_release+0x96/0x110
[ 57.358779] kernfs_drain_open_files+0x238/0x420
[ 57.362519] kernfs_drain+0x191/0x2a0
[ 57.365901] __kernfs_remove+0x3a6/0x600
[ 57.369363] kernfs_remove_by_name_ns+0xc2/0x120
[ 57.373073] cgroup_addrm_files+0x90f/0xcf0
[ 57.376610] cgroup_destroy_locked+0x48a/0x730
[ 57.380260] cgroup_rmdir+0x2b/0x130
[ 57.383650] kernfs_iop_rmdir+0x17a/0x230
[ 57.387201] vfs_rmdir+0x196/0x410
[ 57.390442] do_rmdir+0x1c7/0x3f0
[ 57.393651] __x64_sys_rmdir+0x45/0x50
[ 57.397000] do_syscall_64+0x43/0x90
[ 57.400340] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 57.406689] The buggy address belongs to the object at ffff888114976400
which belongs to the cache kmalloc-128 of size 128
[ 57.414907] The buggy address is located 40 bytes inside of
128-byte region [ffff888114976400, ffff888114976480)
[ 57.425474] The buggy address belongs to the physical page:
[ 57.429541] page:000000008c5ecb31 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x114976
[ 57.436725] flags: 0x2fffff80000200(slab|node=0|zone=2|lastcpupid=0x1fffff)
[ 57.441235] raw: 002fffff80000200 ffff8881000418c0 dead000000000100 dead000000000122
[ 57.447793] raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000
[ 57.454274] page dumped because: kasan: bad access detected
[ 57.460990] Memory state around the buggy address:
[ 57.464744] ffff888114976300: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 57.471126] ffff888114976380: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 57.477447] >ffff888114976400: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 57.483833] ^
[ 57.487541] ffff888114976480: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 57.493976] ffff888114976500: 00 00 00 00 00 00 00 00 00 00 00 00 00 01 fc fc
Here is the simple repro.
#include <fcntl.h>
#include <sys/epoll.h>
#include <unistd.h>
int main(void)
{
const char trigger_str[] = "some 100000 1000000";
int fd, epfd;
struct epoll_event event;
struct epoll_event events[1];
fd = open("/cgroup2/test/cpu.pressure", O_RDWR);
write(fd, trigger_str, sizeof(trigger_str));
epfd = epoll_create(1);
event.events = EPOLLPRI | EPOLLET;
event.data.fd = fd;
epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &event);
epoll_wait(epfd, events, 1, -1); /* returns after rmdir */
sleep(5);
close(epfd);
close(fd);
return 0;
}
# mkdir /cgroup2
# mount -t cgroup2 none /cgroup2
# mkdir /cgroup2/test
# ./a.out &
# rmdir /cgroup2/test
Looks like calling wake_up_pollfree() in psi_trigger_destroy() can properly
clear the queue and then avoid this use-after-free, but POLLFREE wasn't
considered enough there for the past similar issue[1]. While
wake_up_pollfree() could *also* be called in psi_trigger_destroy(), it may
be awkward and there can be more appropriate solution. It would be great if
experts could have a look.
[1] https://lkml.org/lkml/2021/11/16/264
Regards,
Munehisa
On Fri, Jan 6, 2023 at 2:49 PM Munehisa Kamata <[email protected]> wrote:
>
> Hi,
>
> We found the following use-after-free with userspace code polling on
> a pressure file in a non-root cgroup using epoll.
>
> [ 57.183661] BUG: KASAN: use-after-free in _raw_spin_lock_irqsave+0x76/0x130
> [ 57.186662] Write of size 4 at addr ffff888114976428 by task a.out/2426
>
> [ 57.190551] CPU: 0 PID: 2426 Comm: a.out Not tainted 6.2.0-rc2+ #16
> [ 57.193384] Hardware name: Amazon EC2 c5a.large/, BIOS 1.0 10/16/2017
> [ 57.196272] Call Trace:
> [ 57.197565] <TASK>
> [ 57.198714] dump_stack_lvl+0x8f/0xc0
> [ 57.200494] print_report+0x16c/0x4e0
> [ 57.202084] ? _raw_spin_lock_irqsave+0x76/0x130
> [ 57.204077] kasan_report+0xc3/0xf0
> [ 57.205587] ? entry_SYSCALL_64_after_hwframe+0x63/0xcd
> [ 57.207760] ? _raw_spin_lock_irqsave+0x76/0x130
> [ 57.209685] kasan_check_range+0x2d2/0x310
> [ 57.211477] _raw_spin_lock_irqsave+0x76/0x130
> [ 57.213355] remove_wait_queue+0x25/0x130
> [ 57.215102] ep_free+0x12d/0x220
> [ 57.216506] ep_eventpoll_release+0x3c/0x40
> [ 57.218254] __fput+0x32b/0x700
> [ 57.221486] task_work_run+0x1db/0x230
> [ 57.224885] exit_to_user_mode_prepare+0xfd/0x100
> [ 57.228662] syscall_exit_to_user_mode+0x20/0x40
> [ 57.232360] do_syscall_64+0x52/0x90
> [ 57.235691] entry_SYSCALL_64_after_hwframe+0x63/0xcd
> [ 57.239572] RIP: 0033:0x7fadf96e1c44
> [ 57.242865] Code: 00 00 b8 ff ff ff ff eb 9c b8 ff ff ff ff eb 95 e8 01 e2 01 00 90 8b 05 2a ac 2c 00 48 63 ff 85 c0 75 11 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3a f3 c3 48 83 ec 18 48 89 7c 24 08 e8 e4 a0
> [ 57.255244] RSP: 002b:00007ffd1d1b7b98 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
> [ 57.261714] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00007fadf96e1c44
> [ 57.266293] RDX: 0000000000000000 RSI: 00007ffd1d1b7b60 RDI: 0000000000000004
> [ 57.270979] RBP: 00007ffd1d1b7bf0 R08: 00000000004007e0 R09: 00007fadf9a0f240
> [ 57.275856] R10: 00000000000006ba R11: 0000000000000246 R12: 00000000004005e0
> [ 57.280478] R13: 00007ffd1d1b7cd0 R14: 0000000000000000 R15: 0000000000000000
> [ 57.285059] </TASK>
>
> [ 57.290402] Allocated by task 2426:
> [ 57.293705] kasan_set_track+0x3d/0x60
> [ 57.297102] __kasan_kmalloc+0x85/0x90
> [ 57.300491] psi_trigger_create+0x155/0x850
> [ 57.304040] pressure_write+0x200/0x510
> [ 57.307508] cgroup_file_write+0x1de/0x3e0
> [ 57.310949] kernfs_fop_write_iter+0x27d/0x380
> [ 57.314601] vfs_write+0x7d7/0xaa0
> [ 57.317891] ksys_write+0xd7/0x1a0
> [ 57.321152] do_syscall_64+0x43/0x90
> [ 57.324496] entry_SYSCALL_64_after_hwframe+0x63/0xcd
>
> [ 57.330887] Freed by task 2429:
> [ 57.334053] kasan_set_track+0x3d/0x60
> [ 57.337448] kasan_save_free_info+0x27/0x40
> [ 57.340977] ____kasan_slab_free+0x11d/0x170
> [ 57.344570] slab_free_freelist_hook+0x87/0x150
> [ 57.348236] __kmem_cache_free+0xcb/0x180
> [ 57.351710] psi_trigger_destroy+0x458/0x550
> [ 57.355274] cgroup_file_release+0x96/0x110
> [ 57.358779] kernfs_drain_open_files+0x238/0x420
> [ 57.362519] kernfs_drain+0x191/0x2a0
> [ 57.365901] __kernfs_remove+0x3a6/0x600
> [ 57.369363] kernfs_remove_by_name_ns+0xc2/0x120
> [ 57.373073] cgroup_addrm_files+0x90f/0xcf0
> [ 57.376610] cgroup_destroy_locked+0x48a/0x730
> [ 57.380260] cgroup_rmdir+0x2b/0x130
> [ 57.383650] kernfs_iop_rmdir+0x17a/0x230
> [ 57.387201] vfs_rmdir+0x196/0x410
> [ 57.390442] do_rmdir+0x1c7/0x3f0
> [ 57.393651] __x64_sys_rmdir+0x45/0x50
> [ 57.397000] do_syscall_64+0x43/0x90
> [ 57.400340] entry_SYSCALL_64_after_hwframe+0x63/0xcd
>
> [ 57.406689] The buggy address belongs to the object at ffff888114976400
> which belongs to the cache kmalloc-128 of size 128
> [ 57.414907] The buggy address is located 40 bytes inside of
> 128-byte region [ffff888114976400, ffff888114976480)
>
> [ 57.425474] The buggy address belongs to the physical page:
> [ 57.429541] page:000000008c5ecb31 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x114976
> [ 57.436725] flags: 0x2fffff80000200(slab|node=0|zone=2|lastcpupid=0x1fffff)
> [ 57.441235] raw: 002fffff80000200 ffff8881000418c0 dead000000000100 dead000000000122
> [ 57.447793] raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000
> [ 57.454274] page dumped because: kasan: bad access detected
>
> [ 57.460990] Memory state around the buggy address:
> [ 57.464744] ffff888114976300: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [ 57.471126] ffff888114976380: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
> [ 57.477447] >ffff888114976400: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [ 57.483833] ^
> [ 57.487541] ffff888114976480: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
> [ 57.493976] ffff888114976500: 00 00 00 00 00 00 00 00 00 00 00 00 00 01 fc fc
>
> Here is the simple repro.
>
> #include <fcntl.h>
> #include <sys/epoll.h>
> #include <unistd.h>
>
> int main(void)
> {
> const char trigger_str[] = "some 100000 1000000";
> int fd, epfd;
> struct epoll_event event;
> struct epoll_event events[1];
>
> fd = open("/cgroup2/test/cpu.pressure", O_RDWR);
> write(fd, trigger_str, sizeof(trigger_str));
> epfd = epoll_create(1);
> event.events = EPOLLPRI | EPOLLET;
> event.data.fd = fd;
> epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &event);
> epoll_wait(epfd, events, 1, -1); /* returns after rmdir */
> sleep(5);
> close(epfd);
> close(fd);
>
> return 0;
> }
>
> # mkdir /cgroup2
> # mount -t cgroup2 none /cgroup2
> # mkdir /cgroup2/test
> # ./a.out &
> # rmdir /cgroup2/test
>
> Looks like calling wake_up_pollfree() in psi_trigger_destroy() can properly
> clear the queue and then avoid this use-after-free, but POLLFREE wasn't
> considered enough there for the past similar issue[1]. While
> wake_up_pollfree() could *also* be called in psi_trigger_destroy(), it may
> be awkward and there can be more appropriate solution. It would be great if
> experts could have a look.
Thanks for the report, Munehisa. I'll look into this over the weekend
or on Monday.
Suren.
>
> [1] https://lkml.org/lkml/2021/11/16/264
>
>
> Regards,
> Munehisa
Hi Hillf,
On Sat, 2023-01-07 08:07:02 +0000, Hillf Danton <[email protected]> wrote:
> On Fri, Jan 6, 2023 at 2:49 PM Munehisa Kamata <[email protected]> wrote:
> > Hi,
> >
> > We found the following use-after-free with userspace code polling on
> > a pressure file in a non-root cgroup using epoll.
> >
> > [ 57.183661] BUG: KASAN: use-after-free in _raw_spin_lock_irqsave+0x76/0x130
> > [ 57.186662] Write of size 4 at addr ffff888114976428 by task a.out/2426
> >
> > [ 57.190551] CPU: 0 PID: 2426 Comm: a.out Not tainted 6.2.0-rc2+ #16
> > [ 57.193384] Hardware name: Amazon EC2 c5a.large/, BIOS 1.0 10/16/2017
> > [ 57.196272] Call Trace:
> > [ 57.197565] <TASK>
> > [ 57.198714] dump_stack_lvl+0x8f/0xc0
> > [ 57.200494] print_report+0x16c/0x4e0
> > [ 57.202084] ? _raw_spin_lock_irqsave+0x76/0x130
> > [ 57.204077] kasan_report+0xc3/0xf0
> > [ 57.205587] ? entry_SYSCALL_64_after_hwframe+0x63/0xcd
> > [ 57.207760] ? _raw_spin_lock_irqsave+0x76/0x130
> > [ 57.209685] kasan_check_range+0x2d2/0x310
> > [ 57.211477] _raw_spin_lock_irqsave+0x76/0x130
> > [ 57.213355] remove_wait_queue+0x25/0x130
> > [ 57.215102] ep_free+0x12d/0x220
> > [ 57.216506] ep_eventpoll_release+0x3c/0x40
> > [ 57.218254] __fput+0x32b/0x700
> > [ 57.221486] task_work_run+0x1db/0x230
> > [ 57.224885] exit_to_user_mode_prepare+0xfd/0x100
> > [ 57.228662] syscall_exit_to_user_mode+0x20/0x40
> > [ 57.232360] do_syscall_64+0x52/0x90
> > [ 57.235691] entry_SYSCALL_64_after_hwframe+0x63/0xcd
> > [ 57.239572] RIP: 0033:0x7fadf96e1c44
> > [ 57.242865] Code: 00 00 b8 ff ff ff ff eb 9c b8 ff ff ff ff eb 95 e8 01 e2 01 00 90 8b 05 2a ac 2c 00 48 63 ff 85 c0 75 11 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3a f3 c3 48 83 ec 18 48 89 7c 24 08 e8 e4 a0
> > [ 57.255244] RSP: 002b:00007ffd1d1b7b98 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
> > [ 57.261714] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00007fadf96e1c44
> > [ 57.266293] RDX: 0000000000000000 RSI: 00007ffd1d1b7b60 RDI: 0000000000000004
> > [ 57.270979] RBP: 00007ffd1d1b7bf0 R08: 00000000004007e0 R09: 00007fadf9a0f240
> > [ 57.275856] R10: 00000000000006ba R11: 0000000000000246 R12: 00000000004005e0
> > [ 57.280478] R13: 00007ffd1d1b7cd0 R14: 0000000000000000 R15: 0000000000000000
> > [ 57.285059] </TASK>
> >
> > [ 57.290402] Allocated by task 2426:
> > [ 57.293705] kasan_set_track+0x3d/0x60
> > [ 57.297102] __kasan_kmalloc+0x85/0x90
> > [ 57.300491] psi_trigger_create+0x155/0x850
> > [ 57.304040] pressure_write+0x200/0x510
> > [ 57.307508] cgroup_file_write+0x1de/0x3e0
> > [ 57.310949] kernfs_fop_write_iter+0x27d/0x380
> > [ 57.314601] vfs_write+0x7d7/0xaa0
> > [ 57.317891] ksys_write+0xd7/0x1a0
> > [ 57.321152] do_syscall_64+0x43/0x90
> > [ 57.324496] entry_SYSCALL_64_after_hwframe+0x63/0xcd
> >
> > [ 57.330887] Freed by task 2429:
> > [ 57.334053] kasan_set_track+0x3d/0x60
> > [ 57.337448] kasan_save_free_info+0x27/0x40
> > [ 57.340977] ____kasan_slab_free+0x11d/0x170
> > [ 57.344570] slab_free_freelist_hook+0x87/0x150
> > [ 57.348236] __kmem_cache_free+0xcb/0x180
> > [ 57.351710] psi_trigger_destroy+0x458/0x550
> > [ 57.355274] cgroup_file_release+0x96/0x110
> > [ 57.358779] kernfs_drain_open_files+0x238/0x420
> > [ 57.362519] kernfs_drain+0x191/0x2a0
> > [ 57.365901] __kernfs_remove+0x3a6/0x600
> > [ 57.369363] kernfs_remove_by_name_ns+0xc2/0x120
> > [ 57.373073] cgroup_addrm_files+0x90f/0xcf0
> > [ 57.376610] cgroup_destroy_locked+0x48a/0x730
> > [ 57.380260] cgroup_rmdir+0x2b/0x130
> > [ 57.383650] kernfs_iop_rmdir+0x17a/0x230
> > [ 57.387201] vfs_rmdir+0x196/0x410
> > [ 57.390442] do_rmdir+0x1c7/0x3f0
> > [ 57.393651] __x64_sys_rmdir+0x45/0x50
> > [ 57.397000] do_syscall_64+0x43/0x90
> > [ 57.400340] entry_SYSCALL_64_after_hwframe+0x63/0xcd
> >
> > [ 57.406689] The buggy address belongs to the object at ffff888114976400
> > which belongs to the cache kmalloc-128 of size 128
> > [ 57.414907] The buggy address is located 40 bytes inside of
> > 128-byte region [ffff888114976400, ffff888114976480)
> >
> > [ 57.425474] The buggy address belongs to the physical page:
> > [ 57.429541] page:000000008c5ecb31 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x114976
> > [ 57.436725] flags: 0x2fffff80000200(slab|node=0|zone=2|lastcpupid=0x1fffff)
> > [ 57.441235] raw: 002fffff80000200 ffff8881000418c0 dead000000000100 dead000000000122
> > [ 57.447793] raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000
> > [ 57.454274] page dumped because: kasan: bad access detected
> >
> > [ 57.460990] Memory state around the buggy address:
> > [ 57.464744] ffff888114976300: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > [ 57.471126] ffff888114976380: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
> > [ 57.477447] >ffff888114976400: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > [ 57.483833] ^
> > [ 57.487541] ffff888114976480: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
> > [ 57.493976] ffff888114976500: 00 00 00 00 00 00 00 00 00 00 00 00 00 01 fc fc
> >
> > Here is the simple repro.
> >
> > #include <fcntl.h>
> > #include <sys/epoll.h>
> > #include <unistd.h>
> >
> > int main(void)
> > {
> > const char trigger_str[] = "some 100000 1000000";
> > int fd, epfd;
> > struct epoll_event event;
> > struct epoll_event events[1];
> >
> > fd = open("/cgroup2/test/cpu.pressure", O_RDWR);
> > write(fd, trigger_str, sizeof(trigger_str));
> > epfd = epoll_create(1);
> > event.events = EPOLLPRI | EPOLLET;
> > event.data.fd = fd;
> > epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &event);
> > epoll_wait(epfd, events, 1, -1); /* returns after rmdir */
> > sleep(5);
> > close(epfd);
> > close(fd);
> >
> > return 0;
> > }
> >
> > # mkdir /cgroup2
> > # mount -t cgroup2 none /cgroup2
> > # mkdir /cgroup2/test
> > # ./a.out &
> > # rmdir /cgroup2/test
> >
> > Looks like calling wake_up_pollfree() in psi_trigger_destroy() can properly
> > clear the queue and then avoid this use-after-free, but POLLFREE wasn't
> > considered enough there for the past similar issue[1]. While
> > wake_up_pollfree() could *also* be called in psi_trigger_destroy(), it may
> > be awkward and there can be more appropriate solution. It would be great if
> > experts could have a look.
>
> Thanks for your report.
>
> In the wakeup pattern below,
>
> cpu0 cpu2
> --- ---
> wake_up(&t->event_wait);
> HZ later
> kfree(t);
> HZ *3 later
> the last sleeper gets on CPU
> remove_wait_queue
>
> waker should wait for every sleeper to go off waitqueue.
>
> See if the diff below could survive your repro.
That patch survived the repro in my original post, however, the waker
(rmdir) was getting stuck until a file descriptor of the epoll instance or
the pressure file got closed. So, if the following modified repro runs
with the patch, the waker never returns (unless the sleeper gets killed)
while holding cgroup_mutex. This doesn't seem to be what you expected to
see with the patch, does it? Even wake_up_all() does not appear to empty
the queue, but wake_up_pollfree() does.
#include <fcntl.h>
#include <sys/epoll.h>
#include <unistd.h>
int main(void)
{
const char trigger_str[] = "some 100000 1000000";
int fd, epfd;
struct epoll_event event;
struct epoll_event events[1];
fd = open("/cgroup2/test/cpu.pressure", O_RDWR);
write(fd, trigger_str, sizeof(trigger_str));
epfd = epoll_create(1);
event.events = EPOLLPRI | EPOLLET;
event.data.fd = fd;
epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &event);
epoll_wait(epfd, events, 1, -1);
pause();
return 0;
}
# mkdir /cgroup2
# mount -t cgroup2 none /cgroup2
# mkdir /cgroup2/test
# ./a.out &
# rmdir /cgroup2/test
# klilall a.out
Regards,
Munehisa
> Hillf
>
> --- mainline/kernel/sched/psi.c
> +++ y/kernel/sched/psi.c
> @@ -1346,7 +1346,7 @@ void psi_trigger_destroy(struct psi_trig
> * Wakeup waiters to stop polling. Can happen if cgroup is deleted
> * from under a polling process.
> */
> - wake_up_interruptible(&t->event_wait);
> + wake_up_all(&t->event_wait);
>
> mutex_lock(&group->trigger_lock);
>
> @@ -1394,6 +1394,10 @@ void psi_trigger_destroy(struct psi_trig
> kthread_stop(task_to_destroy);
> atomic_set(&group->poll_scheduled, 0);
> }
> +
> + while (wq_has_sleeper(&t->event_wait))
> + schedule_timeout_idle(HZ);
> +
> kfree(t);
> }
>
> --
>
>
On Sun, Jan 8, 2023 at 3:49 PM Hillf Danton <[email protected]> wrote:
>
> On 8 Jan 2023 14:25:48 -0800 PM Munehisa Kamata <[email protected]> wrote:
> >
> > That patch survived the repro in my original post, however, the waker
> > (rmdir) was getting stuck until a file descriptor of the epoll instance or
> > the pressure file got closed. So, if the following modified repro runs
> > with the patch, the waker never returns (unless the sleeper gets killed)
> > while holding cgroup_mutex. This doesn't seem to be what you expected to
> > see with the patch, does it? Even wake_up_all() does not appear to empty
> > the queue, but wake_up_pollfree() does.
>
> Thanks for your testing. And the debugging completes.
>
> Mind sending a patch with wake_up_pollfree() folded?
I finally had some time to look into this issue. I don't think
delaying destruction in psi_trigger_destroy() because there are still
users of the trigger as Hillf suggested is a good way to go. Before
[1] correct trigger destruction was handled using a
psi_trigger.refcount. For some reason I thought it's not needed
anymore when we placed one-trigger-per-file restriction in that patch,
so I removed it. Obviously that was a wrong move, so I think the
cleanest way would be to bring back the refcounting. That way the last
user of the trigger (either psi_trigger_poll() or psi_fop_release())
will free the trigger.
I'll check once more to make sure I did not miss anything and if there
are no objections, will post a fix.
[1] https://lore.kernel.org/lkml/[email protected]/
Thanks,
Suren.
>
> Hillf
On Mon, Jan 9, 2023 at 5:33 PM Suren Baghdasaryan <[email protected]> wrote:
>
> On Sun, Jan 8, 2023 at 3:49 PM Hillf Danton <[email protected]> wrote:
> >
> > On 8 Jan 2023 14:25:48 -0800 PM Munehisa Kamata <[email protected]> wrote:
> > >
> > > That patch survived the repro in my original post, however, the waker
> > > (rmdir) was getting stuck until a file descriptor of the epoll instance or
> > > the pressure file got closed. So, if the following modified repro runs
> > > with the patch, the waker never returns (unless the sleeper gets killed)
> > > while holding cgroup_mutex. This doesn't seem to be what you expected to
> > > see with the patch, does it? Even wake_up_all() does not appear to empty
> > > the queue, but wake_up_pollfree() does.
> >
> > Thanks for your testing. And the debugging completes.
> >
> > Mind sending a patch with wake_up_pollfree() folded?
>
> I finally had some time to look into this issue. I don't think
> delaying destruction in psi_trigger_destroy() because there are still
> users of the trigger as Hillf suggested is a good way to go. Before
> [1] correct trigger destruction was handled using a
> psi_trigger.refcount. For some reason I thought it's not needed
> anymore when we placed one-trigger-per-file restriction in that patch,
> so I removed it. Obviously that was a wrong move, so I think the
> cleanest way would be to bring back the refcounting. That way the last
> user of the trigger (either psi_trigger_poll() or psi_fop_release())
> will free the trigger.
> I'll check once more to make sure I did not miss anything and if there
> are no objections, will post a fix.
Uh, I recalled now why refcounting was not helpful here. I'm making
the same mistake of thinking that poll_wait() blocks until the call to
wake_up() which is not the case. Let me think if there is anything
better than wake_up_pollfree() for this case.
>
> [1] https://lore.kernel.org/lkml/[email protected]/
>
> Thanks,
> Suren.
>
> >
> > Hillf
On Mon, Jan 9, 2023 at 7:06 PM Suren Baghdasaryan <[email protected]> wrote:
>
> On Mon, Jan 9, 2023 at 5:33 PM Suren Baghdasaryan <[email protected]> wrote:
> >
> > On Sun, Jan 8, 2023 at 3:49 PM Hillf Danton <[email protected]> wrote:
> > >
> > > On 8 Jan 2023 14:25:48 -0800 PM Munehisa Kamata <[email protected]> wrote:
> > > >
> > > > That patch survived the repro in my original post, however, the waker
> > > > (rmdir) was getting stuck until a file descriptor of the epoll instance or
> > > > the pressure file got closed. So, if the following modified repro runs
> > > > with the patch, the waker never returns (unless the sleeper gets killed)
> > > > while holding cgroup_mutex. This doesn't seem to be what you expected to
> > > > see with the patch, does it? Even wake_up_all() does not appear to empty
> > > > the queue, but wake_up_pollfree() does.
> > >
> > > Thanks for your testing. And the debugging completes.
> > >
> > > Mind sending a patch with wake_up_pollfree() folded?
> >
> > I finally had some time to look into this issue. I don't think
> > delaying destruction in psi_trigger_destroy() because there are still
> > users of the trigger as Hillf suggested is a good way to go. Before
> > [1] correct trigger destruction was handled using a
> > psi_trigger.refcount. For some reason I thought it's not needed
> > anymore when we placed one-trigger-per-file restriction in that patch,
> > so I removed it. Obviously that was a wrong move, so I think the
> > cleanest way would be to bring back the refcounting. That way the last
> > user of the trigger (either psi_trigger_poll() or psi_fop_release())
> > will free the trigger.
> > I'll check once more to make sure I did not miss anything and if there
> > are no objections, will post a fix.
>
> Uh, I recalled now why refcounting was not helpful here. I'm making
> the same mistake of thinking that poll_wait() blocks until the call to
> wake_up() which is not the case. Let me think if there is anything
> better than wake_up_pollfree() for this case.
Hi Munehisa,
Sorry for the delay. I was trying to reproduce the issue but even
after adding a delay before ep_remove_wait_queue() it did not happen.
One thing about wake_up_pollfree() solution that does not seem right
to me is this comment at
https://elixir.bootlin.com/linux/latest/source/include/linux/wait.h#L253:
`In the very rare cases where a ->poll() implementation uses a
waitqueue whose lifetime is tied to a task rather than to the 'struct
file' being polled, this function must be called before the waitqueue
is freed...`
In our case we free the waitqueue from cgroup_pressure_release(),
which is the handler for `release` operation on cgroup psi files. The
other place calling psi_trigger_destroy() is psi_fop_release(), which
is also tied to the lifetime to the psi files. Therefore the lifetime
of the trigger's waitqueue is tied to the lifetime of the files and
IIUC, we should not be required to use wake_up_pollfree().
Could you please post your .config file? I might be missing some
configuration which prevents the issue from happening on my side.
Thanks,
Suren.
>
>
> >
> > [1] https://lore.kernel.org/lkml/[email protected]/
> >
> > Thanks,
> > Suren.
> >
> > >
> > > Hillf
On Thu, 2023-01-12 22:01:24 +0000, Suren Baghdasaryan <[email protected]> wrote:
>
> On Mon, Jan 9, 2023 at 7:06 PM Suren Baghdasaryan <[email protected]> wrote:
> >
> > On Mon, Jan 9, 2023 at 5:33 PM Suren Baghdasaryan <[email protected]> wrote:
> > >
> > > On Sun, Jan 8, 2023 at 3:49 PM Hillf Danton <[email protected]> wrote:
> > > >
> > > > On 8 Jan 2023 14:25:48 -0800 PM Munehisa Kamata <[email protected]> wrote:
> > > > >
> > > > > That patch survived the repro in my original post, however, the waker
> > > > > (rmdir) was getting stuck until a file descriptor of the epoll instance or
> > > > > the pressure file got closed. So, if the following modified repro runs
> > > > > with the patch, the waker never returns (unless the sleeper gets killed)
> > > > > while holding cgroup_mutex. This doesn't seem to be what you expected to
> > > > > see with the patch, does it? Even wake_up_all() does not appear to empty
> > > > > the queue, but wake_up_pollfree() does.
> > > >
> > > > Thanks for your testing. And the debugging completes.
> > > >
> > > > Mind sending a patch with wake_up_pollfree() folded?
> > >
> > > I finally had some time to look into this issue. I don't think
> > > delaying destruction in psi_trigger_destroy() because there are still
> > > users of the trigger as Hillf suggested is a good way to go. Before
> > > [1] correct trigger destruction was handled using a
> > > psi_trigger.refcount. For some reason I thought it's not needed
> > > anymore when we placed one-trigger-per-file restriction in that patch,
> > > so I removed it. Obviously that was a wrong move, so I think the
> > > cleanest way would be to bring back the refcounting. That way the last
> > > user of the trigger (either psi_trigger_poll() or psi_fop_release())
> > > will free the trigger.
> > > I'll check once more to make sure I did not miss anything and if there
> > > are no objections, will post a fix.
> >
> > Uh, I recalled now why refcounting was not helpful here. I'm making
> > the same mistake of thinking that poll_wait() blocks until the call to
> > wake_up() which is not the case. Let me think if there is anything
> > better than wake_up_pollfree() for this case.
>
> Hi Munehisa,
> Sorry for the delay. I was trying to reproduce the issue but even
> after adding a delay before ep_remove_wait_queue() it did not happen.
Hi Suren,
Thank you for your help here.
Just in case, do you have KASAN enabled in your config? If not, this may
just silently corrupt a certain memory location and not immediately
followed by obvious messages or noticeable event like oops.
> One thing about wake_up_pollfree() solution that does not seem right
> to me is this comment at
> https://elixir.bootlin.com/linux/latest/source/include/linux/wait.h#L253:
>
> `In the very rare cases where a ->poll() implementation uses a
> waitqueue whose lifetime is tied to a task rather than to the 'struct
> file' being polled, this function must be called before the waitqueue
> is freed...`
>
> In our case we free the waitqueue from cgroup_pressure_release(),
> which is the handler for `release` operation on cgroup psi files. The
> other place calling psi_trigger_destroy() is psi_fop_release(), which
> is also tied to the lifetime to the psi files. Therefore the lifetime
> of the trigger's waitqueue is tied to the lifetime of the files and
> IIUC, we should not be required to use wake_up_pollfree().
> Could you please post your .config file? I might be missing some
> configuration which prevents the issue from happening on my side.
Sure, here is my config.
https://gist.github.com/kamatam9/a078bdd9f695e7a0767b061c60e48d50
I confirmed that it's reliably reproducible with v6.2-rc3 as shown below.
https://gist.github.com/kamatam9/096a79cf59d8ed8785c4267e917b8675
Regards,
Munehisa
> Thanks,
> Suren.
>
> >
> >
> > >
> > > [1] https://lore.kernel.org/lkml/[email protected]/
> > >
> > > Thanks,
> > > Suren.
> > >
> > > >
> > > > Hillf
>
>
On Thu, Jan 12, 2023 at 6:26 PM Munehisa Kamata <[email protected]> wrote:
>
> On Thu, 2023-01-12 22:01:24 +0000, Suren Baghdasaryan <[email protected]> wrote:
> >
> > On Mon, Jan 9, 2023 at 7:06 PM Suren Baghdasaryan <[email protected]> wrote:
> > >
> > > On Mon, Jan 9, 2023 at 5:33 PM Suren Baghdasaryan <[email protected]> wrote:
> > > >
> > > > On Sun, Jan 8, 2023 at 3:49 PM Hillf Danton <[email protected]> wrote:
> > > > >
> > > > > On 8 Jan 2023 14:25:48 -0800 PM Munehisa Kamata <[email protected]> wrote:
> > > > > >
> > > > > > That patch survived the repro in my original post, however, the waker
> > > > > > (rmdir) was getting stuck until a file descriptor of the epoll instance or
> > > > > > the pressure file got closed. So, if the following modified repro runs
> > > > > > with the patch, the waker never returns (unless the sleeper gets killed)
> > > > > > while holding cgroup_mutex. This doesn't seem to be what you expected to
> > > > > > see with the patch, does it? Even wake_up_all() does not appear to empty
> > > > > > the queue, but wake_up_pollfree() does.
> > > > >
> > > > > Thanks for your testing. And the debugging completes.
> > > > >
> > > > > Mind sending a patch with wake_up_pollfree() folded?
> > > >
> > > > I finally had some time to look into this issue. I don't think
> > > > delaying destruction in psi_trigger_destroy() because there are still
> > > > users of the trigger as Hillf suggested is a good way to go. Before
> > > > [1] correct trigger destruction was handled using a
> > > > psi_trigger.refcount. For some reason I thought it's not needed
> > > > anymore when we placed one-trigger-per-file restriction in that patch,
> > > > so I removed it. Obviously that was a wrong move, so I think the
> > > > cleanest way would be to bring back the refcounting. That way the last
> > > > user of the trigger (either psi_trigger_poll() or psi_fop_release())
> > > > will free the trigger.
> > > > I'll check once more to make sure I did not miss anything and if there
> > > > are no objections, will post a fix.
> > >
> > > Uh, I recalled now why refcounting was not helpful here. I'm making
> > > the same mistake of thinking that poll_wait() blocks until the call to
> > > wake_up() which is not the case. Let me think if there is anything
> > > better than wake_up_pollfree() for this case.
> >
> > Hi Munehisa,
> > Sorry for the delay. I was trying to reproduce the issue but even
> > after adding a delay before ep_remove_wait_queue() it did not happen.
>
> Hi Suren,
>
> Thank you for your help here.
>
> Just in case, do you have KASAN enabled in your config? If not, this may
> just silently corrupt a certain memory location and not immediately
> followed by obvious messages or noticeable event like oops.
Yes, KASAN was enabled in my build.
>
> > One thing about wake_up_pollfree() solution that does not seem right
> > to me is this comment at
> > https://elixir.bootlin.com/linux/latest/source/include/linux/wait.h#L253:
> >
> > `In the very rare cases where a ->poll() implementation uses a
> > waitqueue whose lifetime is tied to a task rather than to the 'struct
> > file' being polled, this function must be called before the waitqueue
> > is freed...`
> >
> > In our case we free the waitqueue from cgroup_pressure_release(),
> > which is the handler for `release` operation on cgroup psi files. The
> > other place calling psi_trigger_destroy() is psi_fop_release(), which
> > is also tied to the lifetime to the psi files. Therefore the lifetime
> > of the trigger's waitqueue is tied to the lifetime of the files and
> > IIUC, we should not be required to use wake_up_pollfree().
> > Could you please post your .config file? I might be missing some
> > configuration which prevents the issue from happening on my side.
>
> Sure, here is my config.
>
> https://gist.github.com/kamatam9/a078bdd9f695e7a0767b061c60e48d50
>
> I confirmed that it's reliably reproducible with v6.2-rc3 as shown below.
>
> https://gist.github.com/kamatam9/096a79cf59d8ed8785c4267e917b8675
Thanks! I'll try to figure out the difference.
Suren.
>
>
> Regards,
> Munehisa
>
>
> > Thanks,
> > Suren.
> >
> > >
> > >
> > > >
> > > > [1] https://lore.kernel.org/lkml/[email protected]/
> > > >
> > > > Thanks,
> > > > Suren.
> > > >
> > > > >
> > > > > Hillf
> >
> >
On Fri, Jan 13, 2023 at 9:52 AM Suren Baghdasaryan <[email protected]> wrote:
>
> On Thu, Jan 12, 2023 at 6:26 PM Munehisa Kamata <[email protected]> wrote:
> >
> > On Thu, 2023-01-12 22:01:24 +0000, Suren Baghdasaryan <[email protected]> wrote:
> > >
> > > On Mon, Jan 9, 2023 at 7:06 PM Suren Baghdasaryan <[email protected]> wrote:
> > > >
> > > > On Mon, Jan 9, 2023 at 5:33 PM Suren Baghdasaryan <[email protected]> wrote:
> > > > >
> > > > > On Sun, Jan 8, 2023 at 3:49 PM Hillf Danton <[email protected]> wrote:
> > > > > >
> > > > > > On 8 Jan 2023 14:25:48 -0800 PM Munehisa Kamata <[email protected]> wrote:
> > > > > > >
> > > > > > > That patch survived the repro in my original post, however, the waker
> > > > > > > (rmdir) was getting stuck until a file descriptor of the epoll instance or
> > > > > > > the pressure file got closed. So, if the following modified repro runs
> > > > > > > with the patch, the waker never returns (unless the sleeper gets killed)
> > > > > > > while holding cgroup_mutex. This doesn't seem to be what you expected to
> > > > > > > see with the patch, does it? Even wake_up_all() does not appear to empty
> > > > > > > the queue, but wake_up_pollfree() does.
> > > > > >
> > > > > > Thanks for your testing. And the debugging completes.
> > > > > >
> > > > > > Mind sending a patch with wake_up_pollfree() folded?
> > > > >
> > > > > I finally had some time to look into this issue. I don't think
> > > > > delaying destruction in psi_trigger_destroy() because there are still
> > > > > users of the trigger as Hillf suggested is a good way to go. Before
> > > > > [1] correct trigger destruction was handled using a
> > > > > psi_trigger.refcount. For some reason I thought it's not needed
> > > > > anymore when we placed one-trigger-per-file restriction in that patch,
> > > > > so I removed it. Obviously that was a wrong move, so I think the
> > > > > cleanest way would be to bring back the refcounting. That way the last
> > > > > user of the trigger (either psi_trigger_poll() or psi_fop_release())
> > > > > will free the trigger.
> > > > > I'll check once more to make sure I did not miss anything and if there
> > > > > are no objections, will post a fix.
> > > >
> > > > Uh, I recalled now why refcounting was not helpful here. I'm making
> > > > the same mistake of thinking that poll_wait() blocks until the call to
> > > > wake_up() which is not the case. Let me think if there is anything
> > > > better than wake_up_pollfree() for this case.
> > >
> > > Hi Munehisa,
> > > Sorry for the delay. I was trying to reproduce the issue but even
> > > after adding a delay before ep_remove_wait_queue() it did not happen.
> >
> > Hi Suren,
> >
> > Thank you for your help here.
> >
> > Just in case, do you have KASAN enabled in your config? If not, this may
> > just silently corrupt a certain memory location and not immediately
> > followed by obvious messages or noticeable event like oops.
>
> Yes, KASAN was enabled in my build.
>
> >
> > > One thing about wake_up_pollfree() solution that does not seem right
> > > to me is this comment at
> > > https://elixir.bootlin.com/linux/latest/source/include/linux/wait.h#L253:
> > >
> > > `In the very rare cases where a ->poll() implementation uses a
> > > waitqueue whose lifetime is tied to a task rather than to the 'struct
> > > file' being polled, this function must be called before the waitqueue
> > > is freed...`
> > >
> > > In our case we free the waitqueue from cgroup_pressure_release(),
> > > which is the handler for `release` operation on cgroup psi files. The
> > > other place calling psi_trigger_destroy() is psi_fop_release(), which
> > > is also tied to the lifetime to the psi files. Therefore the lifetime
> > > of the trigger's waitqueue is tied to the lifetime of the files and
> > > IIUC, we should not be required to use wake_up_pollfree().
> > > Could you please post your .config file? I might be missing some
> > > configuration which prevents the issue from happening on my side.
> >
> > Sure, here is my config.
> >
> > https://gist.github.com/kamatam9/a078bdd9f695e7a0767b061c60e48d50
> >
> > I confirmed that it's reliably reproducible with v6.2-rc3 as shown below.
> >
> > https://gist.github.com/kamatam9/096a79cf59d8ed8785c4267e917b8675
>
> Thanks! I'll try to figure out the difference.
Sorry for the slow progress on this issue. I'm multiplexing between
several tasks ATM but I did not forget about this one.
Even though I still can't get the kasan UAF report, I clearly see the
wrong order when tracing these functions and forcing the delay before
ep_remove_wait_queue(). I don't think that should be happening, so
something in the release process I think needs fixing. Will update
once I figure out the root cause, hopefully sometime this week.
> Suren.
>
> >
> >
> > Regards,
> > Munehisa
> >
> >
> > > Thanks,
> > > Suren.
> > >
> > > >
> > > >
> > > > >
> > > > > [1] https://lore.kernel.org/lkml/[email protected]/
> > > > >
> > > > > Thanks,
> > > > > Suren.
> > > > >
> > > > > >
> > > > > > Hillf
> > >
> > >
On Wed, Jan 18, 2023 at 7:06 PM Suren Baghdasaryan <[email protected]> wrote:
>
> On Fri, Jan 13, 2023 at 9:52 AM Suren Baghdasaryan <[email protected]> wrote:
> >
> > On Thu, Jan 12, 2023 at 6:26 PM Munehisa Kamata <[email protected]> wrote:
> > >
> > > On Thu, 2023-01-12 22:01:24 +0000, Suren Baghdasaryan <[email protected]> wrote:
> > > >
> > > > On Mon, Jan 9, 2023 at 7:06 PM Suren Baghdasaryan <[email protected]> wrote:
> > > > >
> > > > > On Mon, Jan 9, 2023 at 5:33 PM Suren Baghdasaryan <[email protected]> wrote:
> > > > > >
> > > > > > On Sun, Jan 8, 2023 at 3:49 PM Hillf Danton <[email protected]> wrote:
> > > > > > >
> > > > > > > On 8 Jan 2023 14:25:48 -0800 PM Munehisa Kamata <[email protected]> wrote:
> > > > > > > >
> > > > > > > > That patch survived the repro in my original post, however, the waker
> > > > > > > > (rmdir) was getting stuck until a file descriptor of the epoll instance or
> > > > > > > > the pressure file got closed. So, if the following modified repro runs
> > > > > > > > with the patch, the waker never returns (unless the sleeper gets killed)
> > > > > > > > while holding cgroup_mutex. This doesn't seem to be what you expected to
> > > > > > > > see with the patch, does it? Even wake_up_all() does not appear to empty
> > > > > > > > the queue, but wake_up_pollfree() does.
> > > > > > >
> > > > > > > Thanks for your testing. And the debugging completes.
> > > > > > >
> > > > > > > Mind sending a patch with wake_up_pollfree() folded?
> > > > > >
> > > > > > I finally had some time to look into this issue. I don't think
> > > > > > delaying destruction in psi_trigger_destroy() because there are still
> > > > > > users of the trigger as Hillf suggested is a good way to go. Before
> > > > > > [1] correct trigger destruction was handled using a
> > > > > > psi_trigger.refcount. For some reason I thought it's not needed
> > > > > > anymore when we placed one-trigger-per-file restriction in that patch,
> > > > > > so I removed it. Obviously that was a wrong move, so I think the
> > > > > > cleanest way would be to bring back the refcounting. That way the last
> > > > > > user of the trigger (either psi_trigger_poll() or psi_fop_release())
> > > > > > will free the trigger.
> > > > > > I'll check once more to make sure I did not miss anything and if there
> > > > > > are no objections, will post a fix.
> > > > >
> > > > > Uh, I recalled now why refcounting was not helpful here. I'm making
> > > > > the same mistake of thinking that poll_wait() blocks until the call to
> > > > > wake_up() which is not the case. Let me think if there is anything
> > > > > better than wake_up_pollfree() for this case.
> > > >
> > > > Hi Munehisa,
> > > > Sorry for the delay. I was trying to reproduce the issue but even
> > > > after adding a delay before ep_remove_wait_queue() it did not happen.
> > >
> > > Hi Suren,
> > >
> > > Thank you for your help here.
> > >
> > > Just in case, do you have KASAN enabled in your config? If not, this may
> > > just silently corrupt a certain memory location and not immediately
> > > followed by obvious messages or noticeable event like oops.
> >
> > Yes, KASAN was enabled in my build.
> >
> > >
> > > > One thing about wake_up_pollfree() solution that does not seem right
> > > > to me is this comment at
> > > > https://elixir.bootlin.com/linux/latest/source/include/linux/wait.h#L253:
> > > >
> > > > `In the very rare cases where a ->poll() implementation uses a
> > > > waitqueue whose lifetime is tied to a task rather than to the 'struct
> > > > file' being polled, this function must be called before the waitqueue
> > > > is freed...`
> > > >
> > > > In our case we free the waitqueue from cgroup_pressure_release(),
> > > > which is the handler for `release` operation on cgroup psi files. The
> > > > other place calling psi_trigger_destroy() is psi_fop_release(), which
> > > > is also tied to the lifetime to the psi files. Therefore the lifetime
> > > > of the trigger's waitqueue is tied to the lifetime of the files and
> > > > IIUC, we should not be required to use wake_up_pollfree().
> > > > Could you please post your .config file? I might be missing some
> > > > configuration which prevents the issue from happening on my side.
> > >
> > > Sure, here is my config.
> > >
> > > https://gist.github.com/kamatam9/a078bdd9f695e7a0767b061c60e48d50
> > >
> > > I confirmed that it's reliably reproducible with v6.2-rc3 as shown below.
> > >
> > > https://gist.github.com/kamatam9/096a79cf59d8ed8785c4267e917b8675
> >
> > Thanks! I'll try to figure out the difference.
>
> Sorry for the slow progress on this issue. I'm multiplexing between
> several tasks ATM but I did not forget about this one.
> Even though I still can't get the kasan UAF report, I clearly see the
> wrong order when tracing these functions and forcing the delay before
> ep_remove_wait_queue(). I don't think that should be happening, so
> something in the release process I think needs fixing. Will update
> once I figure out the root cause, hopefully sometime this week.
Hi Folks,
I spent some more time digging into the details and this is what's
happening. When we call rmdir to delete the cgroup with the pressure
file being epoll'ed, roughly the following call chain happens in the
context of the shell process:
do_rmdir
cgroup_rmdir
kernfs_drain_open_files
cgroup_file_release
cgroup_pressure_release
psi_trigger_destroy
Later on in the context of our reproducer, the last fput() is called
causing wait queue removal:
fput
ep_eventpoll_release
ep_free
ep_remove_wait_queue
remove_wait_queue
By this time psi_trigger_destroy() already destroyed the trigger's
waitqueue head and we hit UAF.
I think the conceptual problem here (or maybe that's by design?) is
that cgroup_file_release() is not really tied to the file's real
lifetime (when the last fput() is issued). Otherwise fput() would call
eventpoll_release() before f_op->release() and the order would be fine
(we would remove the wait queue first in eventpoll_release() and then
f_op->release() would cause trigger's destruction).
Considering these findings, I think we can use the wake_up_pollfree()
without contradicting the comment at
https://elixir.bootlin.com/linux/latest/source/include/linux/wait.h#L253
because indeed, cgroup_file_release() and therefore
psi_trigger_destroy() are not tied to the file's lifetime.
I'm CC'ing Tejun to check if this makes sense to him and
cgroup_file_release() is working as expected in this case.
Munehisha, if Tejun confirms this is all valid, could you please post
a patch replacing wake_up_interruptible() with wake_up_pollfree()? We
don't need to worry about wake_up_all() because we have a limitation
of one trigger per file descriptor:
https://elixir.bootlin.com/linux/latest/source/kernel/sched/psi.c#L1419,
so there can be only one waiter.
Thanks,
Suren.
>
>
> > Suren.
> >
> > >
> > >
> > > Regards,
> > > Munehisa
> > >
> > >
> > > > Thanks,
> > > > Suren.
> > > >
> > > > >
> > > > >
> > > > > >
> > > > > > [1] https://lore.kernel.org/lkml/[email protected]/
> > > > > >
> > > > > > Thanks,
> > > > > > Suren.
> > > > > >
> > > > > > >
> > > > > > > Hillf
> > > >
> > > >
On Thu, Jan 19, 2023 at 01:01:42PM -0800, Suren Baghdasaryan wrote:
> I spent some more time digging into the details and this is what's
> happening. When we call rmdir to delete the cgroup with the pressure
> file being epoll'ed, roughly the following call chain happens in the
> context of the shell process:
>
> do_rmdir
> cgroup_rmdir
> kernfs_drain_open_files
> cgroup_file_release
> cgroup_pressure_release
> psi_trigger_destroy
>
> Later on in the context of our reproducer, the last fput() is called
> causing wait queue removal:
>
> fput
> ep_eventpoll_release
> ep_free
> ep_remove_wait_queue
> remove_wait_queue
>
> By this time psi_trigger_destroy() already destroyed the trigger's
> waitqueue head and we hit UAF.
> I think the conceptual problem here (or maybe that's by design?) is
> that cgroup_file_release() is not really tied to the file's real
> lifetime (when the last fput() is issued). Otherwise fput() would call
> eventpoll_release() before f_op->release() and the order would be fine
> (we would remove the wait queue first in eventpoll_release() and then
> f_op->release() would cause trigger's destruction).
> Considering these findings, I think we can use the wake_up_pollfree()
> without contradicting the comment at
> https://elixir.bootlin.com/linux/latest/source/include/linux/wait.h#L253
> because indeed, cgroup_file_release() and therefore
> psi_trigger_destroy() are not tied to the file's lifetime.
>
> I'm CC'ing Tejun to check if this makes sense to him and
> cgroup_file_release() is working as expected in this case.
>
> Munehisha, if Tejun confirms this is all valid, could you please post
> a patch replacing wake_up_interruptible() with wake_up_pollfree()? We
> don't need to worry about wake_up_all() because we have a limitation
> of one trigger per file descriptor:
Solid analysis!
Indeed, wake_up_pollfree() should fix it.
On Thu, Jan 19, 2023 at 5:31 PM Hillf Danton <[email protected]> wrote:
>
> On Thu, 19 Jan 2023 13:01:42 -0800 Suren Baghdasaryan <[email protected]> wrote:
> >
> > Hi Folks,
> > I spent some more time digging into the details and this is what's
> > happening. When we call rmdir to delete the cgroup with the pressure
> > file being epoll'ed, roughly the following call chain happens in the
> > context of the shell process:
> >
> > do_rmdir
> > cgroup_rmdir
> > kernfs_drain_open_files
> > cgroup_file_release
> > cgroup_pressure_release
> > psi_trigger_destroy
> >
> > Later on in the context of our reproducer, the last fput() is called
> > causing wait queue removal:
> >
> > fput
> > ep_eventpoll_release
> > ep_free
> > ep_remove_wait_queue
> > remove_wait_queue
> >
> > By this time psi_trigger_destroy() already destroyed the trigger's
> > waitqueue head and we hit UAF.
> > I think the conceptual problem here (or maybe that's by design?) is
> > that cgroup_file_release() is not really tied to the file's real
> > lifetime (when the last fput() is issued). Otherwise fput() would call
> > eventpoll_release() before f_op->release() and the order would be fine
> > (we would remove the wait queue first in eventpoll_release() and then
> > f_op->release() would cause trigger's destruction).
>
> eventpoll_release
> eventpoll_release_file
> ep_remove
> ep_unregister_pollwait
> ep_remove_wait_queue
>
Yes but fput() calls eventpoll_release() *before* f_op->release(), so
waitqueue_head would be removed before trigger destruction.
> Different roads run into the same Roma city.
You butchered the phrase :)
>
> > Considering these findings, I think we can use the wake_up_pollfree()
> > without contradicting the comment at
> > https://elixir.bootlin.com/linux/latest/source/include/linux/wait.h#L253
> > because indeed, cgroup_file_release() and therefore
> > psi_trigger_destroy() are not tied to the file's lifetime.
> >
> > I'm CC'ing Tejun to check if this makes sense to him and
> > cgroup_file_release() is working as expected in this case.
> >
> > Munehisha, if Tejun confirms this is all valid, could you please post
> > a patch replacing wake_up_interruptible() with wake_up_pollfree()? We
> > don't need to worry about wake_up_all() because we have a limitation
> > of one trigger per file descriptor:
> > https://elixir.bootlin.com/linux/latest/source/kernel/sched/psi.c#L1419,
> > so there can be only one waiter.
> > Thanks,
> > Suren.
>
On Thu, 2023-01-19 21:01:42 +0000, Suren Baghdasaryan <[email protected]> wrote:
>
> Munehisha, if Tejun confirms this is all valid, could you please post
> a patch replacing wake_up_interruptible() with wake_up_pollfree()? We
> don't need to worry about wake_up_all() because we have a limitation
> of one trigger per file descriptor:
> https://elixir.bootlin.com/linux/latest/source/kernel/sched/psi.c#L1419,
> so there can be only one waiter.
Yes, will do (and will CC stable).
Regards,
Munehisa
On Fri, 2023-01-20 02:46:13 +0000, Munehisa Kamata <[email protected]> wrote:
>
> On Fri, 2023-01-20 01:37:11 +0000, Suren Baghdasaryan <[email protected]> wrote:
> >
> > On Thu, Jan 19, 2023 at 5:31 PM Hillf Danton <[email protected]> wrote:
> > >
> > > On Thu, 19 Jan 2023 13:01:42 -0800 Suren Baghdasaryan <[email protected]> wrote:
> > > >
> > > > Hi Folks,
> > > > I spent some more time digging into the details and this is what's
> > > > happening. When we call rmdir to delete the cgroup with the pressure
> > > > file being epoll'ed, roughly the following call chain happens in the
> > > > context of the shell process:
> > > >
> > > > do_rmdir
> > > > cgroup_rmdir
> > > > kernfs_drain_open_files
> > > > cgroup_file_release
> > > > cgroup_pressure_release
> > > > psi_trigger_destroy
> > > >
> > > > Later on in the context of our reproducer, the last fput() is called
> > > > causing wait queue removal:
> > > >
> > > > fput
> > > > ep_eventpoll_release
> > > > ep_free
> > > > ep_remove_wait_queue
> > > > remove_wait_queue
> > > >
> > > > By this time psi_trigger_destroy() already destroyed the trigger's
> > > > waitqueue head and we hit UAF.
> > > > I think the conceptual problem here (or maybe that's by design?) is
> > > > that cgroup_file_release() is not really tied to the file's real
> > > > lifetime (when the last fput() is issued). Otherwise fput() would call
> > > > eventpoll_release() before f_op->release() and the order would be fine
> > > > (we would remove the wait queue first in eventpoll_release() and then
> > > > f_op->release() would cause trigger's destruction).
> > >
> > > eventpoll_release
> > > eventpoll_release_file
> > > ep_remove
> > > ep_unregister_pollwait
> > > ep_remove_wait_queue
> > >
> >
> > Yes but fput() calls eventpoll_release() *before* f_op->release(), so
> > waitqueue_head would be removed before trigger destruction.
>
> But pwq->whead is still pointing the freed head, then we just hit the same
> issue earlier?
Ah nevermind, that was just a hypothetical case if cgroup_file_release()
was tied to file's lifetime and assuming trigger destruction that frees
the queue and clears pwq->whead would happen later in f_op->release();
there is no such an implementation today.
Sorry for noise.
> > > Different roads run into the same Roma city.
> >
> > You butchered the phrase :)
> >
> > >
> > > > Considering these findings, I think we can use the wake_up_pollfree()
> > > > without contradicting the comment at
> > > > https://elixir.bootlin.com/linux/latest/source/include/linux/wait.h#L253
> > > > because indeed, cgroup_file_release() and therefore
> > > > psi_trigger_destroy() are not tied to the file's lifetime.
> > > >
> > > > I'm CC'ing Tejun to check if this makes sense to him and
> > > > cgroup_file_release() is working as expected in this case.
> > > >
> > > > Munehisha, if Tejun confirms this is all valid, could you please post
> > > > a patch replacing wake_up_interruptible() with wake_up_pollfree()? We
> > > > don't need to worry about wake_up_all() because we have a limitation
> > > > of one trigger per file descriptor:
> > > > https://elixir.bootlin.com/linux/latest/source/kernel/sched/psi.c#L1419,
> > > > so there can be only one waiter.
> > > > Thanks,
> > > > Suren.
> > >
> >
> >
>
>
On Fri, 2023-01-20 01:37:11 +0000, Suren Baghdasaryan <[email protected]> wrote:
>
> On Thu, Jan 19, 2023 at 5:31 PM Hillf Danton <[email protected]> wrote:
> >
> > On Thu, 19 Jan 2023 13:01:42 -0800 Suren Baghdasaryan <[email protected]> wrote:
> > >
> > > Hi Folks,
> > > I spent some more time digging into the details and this is what's
> > > happening. When we call rmdir to delete the cgroup with the pressure
> > > file being epoll'ed, roughly the following call chain happens in the
> > > context of the shell process:
> > >
> > > do_rmdir
> > > cgroup_rmdir
> > > kernfs_drain_open_files
> > > cgroup_file_release
> > > cgroup_pressure_release
> > > psi_trigger_destroy
> > >
> > > Later on in the context of our reproducer, the last fput() is called
> > > causing wait queue removal:
> > >
> > > fput
> > > ep_eventpoll_release
> > > ep_free
> > > ep_remove_wait_queue
> > > remove_wait_queue
> > >
> > > By this time psi_trigger_destroy() already destroyed the trigger's
> > > waitqueue head and we hit UAF.
> > > I think the conceptual problem here (or maybe that's by design?) is
> > > that cgroup_file_release() is not really tied to the file's real
> > > lifetime (when the last fput() is issued). Otherwise fput() would call
> > > eventpoll_release() before f_op->release() and the order would be fine
> > > (we would remove the wait queue first in eventpoll_release() and then
> > > f_op->release() would cause trigger's destruction).
> >
> > eventpoll_release
> > eventpoll_release_file
> > ep_remove
> > ep_unregister_pollwait
> > ep_remove_wait_queue
> >
>
> Yes but fput() calls eventpoll_release() *before* f_op->release(), so
> waitqueue_head would be removed before trigger destruction.
But pwq->whead is still pointing the freed head, then we just hit the same
issue earlier?
> > Different roads run into the same Roma city.
>
> You butchered the phrase :)
>
> >
> > > Considering these findings, I think we can use the wake_up_pollfree()
> > > without contradicting the comment at
> > > https://elixir.bootlin.com/linux/latest/source/include/linux/wait.h#L253
> > > because indeed, cgroup_file_release() and therefore
> > > psi_trigger_destroy() are not tied to the file's lifetime.
> > >
> > > I'm CC'ing Tejun to check if this makes sense to him and
> > > cgroup_file_release() is working as expected in this case.
> > >
> > > Munehisha, if Tejun confirms this is all valid, could you please post
> > > a patch replacing wake_up_interruptible() with wake_up_pollfree()? We
> > > don't need to worry about wake_up_all() because we have a limitation
> > > of one trigger per file descriptor:
> > > https://elixir.bootlin.com/linux/latest/source/kernel/sched/psi.c#L1419,
> > > so there can be only one waiter.
> > > Thanks,
> > > Suren.
> >
>
>
On Fri, Jan 20, 2023 at 1:00 AM Hillf Danton <[email protected]> wrote:
>
> On Thu, 19 Jan 2023 17:37:11 -0800 Suren Baghdasaryan <[email protected]> wrote:
> > On Thu, Jan 19, 2023 at 5:31 PM Hillf Danton <[email protected]> wrote:
> > > On Thu, 19 Jan 2023 13:01:42 -0800 Suren Baghdasaryan <[email protected]> wrote:
> > > >
> > > > Hi Folks,
> > > > I spent some more time digging into the details and this is what's
> > > > happening. When we call rmdir to delete the cgroup with the pressure
> > > > file being epoll'ed, roughly the following call chain happens in the
> > > > context of the shell process:
> > > >
> > > > do_rmdir
> > > > cgroup_rmdir
> > > > kernfs_drain_open_files
> > > > cgroup_file_release
> > > > cgroup_pressure_release
> > > > psi_trigger_destroy
> > > >
> > > > Later on in the context of our reproducer, the last fput() is called
> > > > causing wait queue removal:
> > > >
> > > > fput
> > > > ep_eventpoll_release
> > > > ep_free
> > > > ep_remove_wait_queue
> > > > remove_wait_queue
> > > >
> > > > By this time psi_trigger_destroy() already destroyed the trigger's
> > > > waitqueue head and we hit UAF.
> > > > I think the conceptual problem here (or maybe that's by design?) is
> > > > that cgroup_file_release() is not really tied to the file's real
> > > > lifetime (when the last fput() is issued). Otherwise fput() would call
> > > > eventpoll_release() before f_op->release() and the order would be fine
> > > > (we would remove the wait queue first in eventpoll_release() and then
> > > > f_op->release() would cause trigger's destruction).
> > >
> > > eventpoll_release
> > > eventpoll_release_file
> > > ep_remove
> > > ep_unregister_pollwait
> > > ep_remove_wait_queue
> > >
> >
> > Yes but fput() calls eventpoll_release() *before* f_op->release(), so
> > waitqueue_head would be removed before trigger destruction.
>
> Then check if file is polled before destroying trigger.
>
> +++ b/kernel/sched/psi.c
> @@ -1529,6 +1529,7 @@ static int psi_fop_release(struct inode
> {
> struct seq_file *seq = file->private_data;
>
> + eventpoll_release_file(file);
Be careful here and see the comment in
https://elixir.bootlin.com/linux/latest/source/fs/eventpoll.c#L912.
eventpoll_release_file() assumes that the last fput() was called and
nobody other than ep_free() will race with us. So, this will not be
that simple. Besides if we really need to fix the order here, the fix
should be somewhere at the level of cgroup_file_release() or even
kernfs to work for other similar situations.
> psi_trigger_destroy(seq->private);
> return single_release(inode, file);
> }
>
On Fri, Jan 20, 2023 at 9:18 PM Hillf Danton <[email protected]> wrote:
>
> On Fri, 20 Jan 2023 08:28:25 -0800 Suren Baghdasaryan <[email protected]>
> > On Fri, Jan 20, 2023 at 1:00 AM Hillf Danton <[email protected]> wrote:
> > > +++ b/kernel/sched/psi.c
> > > @@ -1529,6 +1529,7 @@ static int psi_fop_release(struct inode
> > > {
> > > struct seq_file *seq = file->private_data;
> > >
> > > + eventpoll_release_file(file);
> >
> > Be careful here and see the comment in
> > https://elixir.bootlin.com/linux/latest/source/fs/eventpoll.c#L912.
> > eventpoll_release_file() assumes that the last fput() was called and
> > nobody other than ep_free() will race with us. So, this will not be
> > that simple.
>
> The epmutex serializes eventpoll_release_file() and ep_free(). And this
> is in psi_fop_release(), so no chance is likely left for another release.
>
> > Besides if we really need to fix the order here, the fix
> > should be somewhere at the level of cgroup_file_release() or even
> > kernfs to work for other similar situations.
>
> Good point but cgroup and kernfs have no idea of psi trigger.
Yes, that's why I think if we really need to fix the order here and do
it properly, it won't be straightforward. IMHO wake_up_pollfree() is
an appropriate and simple fix for this.
>
> The bonus of the uaf is check polled file upon release in scenarios like
> the psi trigger.
>
If a non-root cgroup gets removed when there is a thread that registered
trigger and is polling on a pressure file within the cgroup, the polling
waitqueue gets freed without clearing the queue and reference, then it can
result in use-after-free as below. Use wake_up_pollfree() instead to
ensure that the queue and reference are cleared before freeing.
BUG: KASAN: use-after-free in _raw_spin_lock_irqsave+0x60/0xc0
Write of size 4 at addr ffff88810e625328 by task a.out/4404
CPU: 19 PID: 4404 Comm: a.out Not tainted 6.2.0-rc6 #38
Hardware name: Amazon EC2 c5a.8xlarge/, BIOS 1.0 10/16/2017
Call Trace:
<TASK>
dump_stack_lvl+0x73/0xa0
print_report+0x16c/0x4e0
? _printk+0x59/0x80
? __virt_addr_valid+0xb8/0x130
? _raw_spin_lock_irqsave+0x60/0xc0
kasan_report+0xc3/0xf0
? _raw_spin_lock_irqsave+0x60/0xc0
kasan_check_range+0x2d2/0x310
_raw_spin_lock_irqsave+0x60/0xc0
remove_wait_queue+0x1a/0xa0
ep_free+0x12c/0x170
ep_eventpoll_release+0x26/0x30
__fput+0x202/0x400
task_work_run+0x11d/0x170
do_exit+0x495/0x1130
? update_cfs_rq_load_avg+0x2c2/0x2e0
do_group_exit+0x100/0x100
get_signal+0xd67/0xde0
? finish_task_switch+0x15f/0x3a0
arch_do_signal_or_restart+0x2a/0x2b0
exit_to_user_mode_prepare+0x94/0x100
syscall_exit_to_user_mode+0x20/0x40
do_syscall_64+0x52/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f8e392bfb91
Code: Unable to access opcode bytes at 0x7f8e392bfb67.
RSP: 002b:00007fff261e08d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000022
RAX: fffffffffffffdfe RBX: 0000000000000000 RCX: 00007f8e392bfb91
RDX: 0000000000000001 RSI: 00007fff261e08e8 RDI: 0000000000000004
RBP: 00007fff261e0920 R08: 0000000000400780 R09: 00007f8e3960f240
R10: 00000000000003df R11: 0000000000000246 R12: 00000000004005a0
R13: 00007fff261e0a00 R14: 0000000000000000 R15: 0000000000000000
</TASK>
Allocated by task 4404:
kasan_set_track+0x3d/0x60
__kasan_kmalloc+0x85/0x90
psi_trigger_create+0x113/0x3e0
pressure_write+0x146/0x2e0
cgroup_file_write+0x11c/0x250
kernfs_fop_write_iter+0x186/0x220
vfs_write+0x3d8/0x5c0
ksys_write+0x90/0x110
do_syscall_64+0x43/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
Freed by task 4407:
kasan_set_track+0x3d/0x60
kasan_save_free_info+0x27/0x40
____kasan_slab_free+0x11d/0x170
slab_free_freelist_hook+0x87/0x150
__kmem_cache_free+0xcb/0x180
psi_trigger_destroy+0x2e8/0x310
cgroup_file_release+0x4f/0xb0
kernfs_drain_open_files+0x165/0x1f0
kernfs_drain+0x162/0x1a0
__kernfs_remove+0x1fb/0x310
kernfs_remove_by_name_ns+0x95/0xe0
cgroup_addrm_files+0x67f/0x700
cgroup_destroy_locked+0x283/0x3c0
cgroup_rmdir+0x29/0x100
kernfs_iop_rmdir+0xd1/0x140
vfs_rmdir+0xfe/0x240
do_rmdir+0x13d/0x280
__x64_sys_rmdir+0x2c/0x30
do_syscall_64+0x43/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
Link: https://lore.kernel.org/lkml/[email protected]/
Fixes: 0e94682b73bf ("psi: introduce psi monitor")
Cc: [email protected]
Signed-off-by: Munehisa Kamata <[email protected]>
Signed-off-by: Mengchi Cheng <[email protected]>
---
kernel/sched/psi.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 8ac8b81bfee6..6e66c15f6450 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -1343,10 +1343,11 @@ void psi_trigger_destroy(struct psi_trigger *t)
group = t->group;
/*
- * Wakeup waiters to stop polling. Can happen if cgroup is deleted
- * from under a polling process.
+ * Wakeup waiters to stop polling and clear the queue to prevent it from
+ * being accessed later. Can happen if cgroup is deleted from under a
+ * polling process otherwise.
*/
- wake_up_interruptible(&t->event_wait);
+ wake_up_pollfree(&t->event_wait);
mutex_lock(&group->trigger_lock);
--
2.38.1
On Wed, Feb 01, 2023 at 07:00:23PM -0800, Munehisa Kamata wrote:
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index 8ac8b81bfee6..6e66c15f6450 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -1343,10 +1343,11 @@ void psi_trigger_destroy(struct psi_trigger *t)
>
> group = t->group;
> /*
> - * Wakeup waiters to stop polling. Can happen if cgroup is deleted
> - * from under a polling process.
> + * Wakeup waiters to stop polling and clear the queue to prevent it from
> + * being accessed later. Can happen if cgroup is deleted from under a
> + * polling process otherwise.
> */
> - wake_up_interruptible(&t->event_wait);
> + wake_up_pollfree(&t->event_wait);
>
> mutex_lock(&group->trigger_lock);
wake_up_pollfree() should only be used in extremely rare cases. Why can't the
lifetime of the waitqueue be fixed instead?
- Eric
On Wed, Feb 1, 2023 at 8:56 PM Eric Biggers <[email protected]> wrote:
>
> On Wed, Feb 01, 2023 at 07:00:23PM -0800, Munehisa Kamata wrote:
> > diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> > index 8ac8b81bfee6..6e66c15f6450 100644
> > --- a/kernel/sched/psi.c
> > +++ b/kernel/sched/psi.c
> > @@ -1343,10 +1343,11 @@ void psi_trigger_destroy(struct psi_trigger *t)
> >
> > group = t->group;
> > /*
> > - * Wakeup waiters to stop polling. Can happen if cgroup is deleted
> > - * from under a polling process.
> > + * Wakeup waiters to stop polling and clear the queue to prevent it from
> > + * being accessed later. Can happen if cgroup is deleted from under a
> > + * polling process otherwise.
> > */
> > - wake_up_interruptible(&t->event_wait);
> > + wake_up_pollfree(&t->event_wait);
> >
> > mutex_lock(&group->trigger_lock);
>
> wake_up_pollfree() should only be used in extremely rare cases. Why can't the
> lifetime of the waitqueue be fixed instead?
waitqueue lifetime in this case is linked to cgroup_file_release(),
which seems appropriate to me here. Unfortunately
cgroup_file_release() is not directly linked to the file's lifetime.
For more details see:
https://lore.kernel.org/all/CAJuCfpFZ3B4530TgsSHqp5F_gwfrDujwRYewKReJru==MdEHQg@mail.gmail.com/#t
.
So, if we want to fix the lifetime of the waitqueue, we would have to
tie cgroup_file_release() to the fput() somehow. IOW, the fix would
have to be done at the cgroups or higher (kernfs?) layer.
Thanks,
Suren.
>
> - Eric
On Thu, Feb 2, 2023 at 1:11 PM Suren Baghdasaryan <[email protected]> wrote:
>
> On Wed, Feb 1, 2023 at 8:56 PM Eric Biggers <[email protected]> wrote:
> >
> > On Wed, Feb 01, 2023 at 07:00:23PM -0800, Munehisa Kamata wrote:
> > > diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> > > index 8ac8b81bfee6..6e66c15f6450 100644
> > > --- a/kernel/sched/psi.c
> > > +++ b/kernel/sched/psi.c
> > > @@ -1343,10 +1343,11 @@ void psi_trigger_destroy(struct psi_trigger *t)
> > >
> > > group = t->group;
> > > /*
> > > - * Wakeup waiters to stop polling. Can happen if cgroup is deleted
> > > - * from under a polling process.
> > > + * Wakeup waiters to stop polling and clear the queue to prevent it from
> > > + * being accessed later. Can happen if cgroup is deleted from under a
> > > + * polling process otherwise.
> > > */
> > > - wake_up_interruptible(&t->event_wait);
> > > + wake_up_pollfree(&t->event_wait);
> > >
> > > mutex_lock(&group->trigger_lock);
> >
> > wake_up_pollfree() should only be used in extremely rare cases. Why can't the
> > lifetime of the waitqueue be fixed instead?
>
> waitqueue lifetime in this case is linked to cgroup_file_release(),
> which seems appropriate to me here. Unfortunately
> cgroup_file_release() is not directly linked to the file's lifetime.
> For more details see:
> https://lore.kernel.org/all/CAJuCfpFZ3B4530TgsSHqp5F_gwfrDujwRYewKReJru==MdEHQg@mail.gmail.com/#t
> .
> So, if we want to fix the lifetime of the waitqueue, we would have to
> tie cgroup_file_release() to the fput() somehow. IOW, the fix would
> have to be done at the cgroups or higher (kernfs?) layer.
Hi Eric,
Do you still object to using wake_up_pollfree() for this case?
Changing higher levels to make cgroup_file_release() be tied to fput()
would be ideal but I think that would be a big change for this one
case. If you agree I'll Ack this patch.
Thanks,
Suren.
> Thanks,
> Suren.
>
> >
> > - Eric
On Thu, Feb 09, 2023 at 09:09:03AM -0800, Suren Baghdasaryan wrote:
> On Thu, Feb 2, 2023 at 1:11 PM Suren Baghdasaryan <[email protected]> wrote:
> >
> > On Wed, Feb 1, 2023 at 8:56 PM Eric Biggers <[email protected]> wrote:
> > >
> > > On Wed, Feb 01, 2023 at 07:00:23PM -0800, Munehisa Kamata wrote:
> > > > diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> > > > index 8ac8b81bfee6..6e66c15f6450 100644
> > > > --- a/kernel/sched/psi.c
> > > > +++ b/kernel/sched/psi.c
> > > > @@ -1343,10 +1343,11 @@ void psi_trigger_destroy(struct psi_trigger *t)
> > > >
> > > > group = t->group;
> > > > /*
> > > > - * Wakeup waiters to stop polling. Can happen if cgroup is deleted
> > > > - * from under a polling process.
> > > > + * Wakeup waiters to stop polling and clear the queue to prevent it from
> > > > + * being accessed later. Can happen if cgroup is deleted from under a
> > > > + * polling process otherwise.
> > > > */
> > > > - wake_up_interruptible(&t->event_wait);
> > > > + wake_up_pollfree(&t->event_wait);
> > > >
> > > > mutex_lock(&group->trigger_lock);
> > >
> > > wake_up_pollfree() should only be used in extremely rare cases. Why can't the
> > > lifetime of the waitqueue be fixed instead?
> >
> > waitqueue lifetime in this case is linked to cgroup_file_release(),
> > which seems appropriate to me here. Unfortunately
> > cgroup_file_release() is not directly linked to the file's lifetime.
> > For more details see:
> > https://lore.kernel.org/all/CAJuCfpFZ3B4530TgsSHqp5F_gwfrDujwRYewKReJru==MdEHQg@mail.gmail.com/#t
> > .
> > So, if we want to fix the lifetime of the waitqueue, we would have to
> > tie cgroup_file_release() to the fput() somehow. IOW, the fix would
> > have to be done at the cgroups or higher (kernfs?) layer.
>
> Hi Eric,
> Do you still object to using wake_up_pollfree() for this case?
> Changing higher levels to make cgroup_file_release() be tied to fput()
> would be ideal but I think that would be a big change for this one
> case. If you agree I'll Ack this patch.
> Thanks,
> Suren.
>
I haven't read the code closely in this case. I'm just letting you know that
wake_up_pollfree() is very much a last-resort option for when the waitqueue
lifetime can't be fixed. So if you want to use wake_up_pollfree(), you need to
explain why no other fix is possible. For example maybe the UAPI depends on the
waitqueue having a nonstandard lifetime.
- Eric
On Thu, Feb 9, 2023 at 10:46 AM Eric Biggers <[email protected]> wrote:
>
> On Thu, Feb 09, 2023 at 09:09:03AM -0800, Suren Baghdasaryan wrote:
> > On Thu, Feb 2, 2023 at 1:11 PM Suren Baghdasaryan <[email protected]> wrote:
> > >
> > > On Wed, Feb 1, 2023 at 8:56 PM Eric Biggers <[email protected]> wrote:
> > > >
> > > > On Wed, Feb 01, 2023 at 07:00:23PM -0800, Munehisa Kamata wrote:
> > > > > diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> > > > > index 8ac8b81bfee6..6e66c15f6450 100644
> > > > > --- a/kernel/sched/psi.c
> > > > > +++ b/kernel/sched/psi.c
> > > > > @@ -1343,10 +1343,11 @@ void psi_trigger_destroy(struct psi_trigger *t)
> > > > >
> > > > > group = t->group;
> > > > > /*
> > > > > - * Wakeup waiters to stop polling. Can happen if cgroup is deleted
> > > > > - * from under a polling process.
> > > > > + * Wakeup waiters to stop polling and clear the queue to prevent it from
> > > > > + * being accessed later. Can happen if cgroup is deleted from under a
> > > > > + * polling process otherwise.
> > > > > */
> > > > > - wake_up_interruptible(&t->event_wait);
> > > > > + wake_up_pollfree(&t->event_wait);
> > > > >
> > > > > mutex_lock(&group->trigger_lock);
> > > >
> > > > wake_up_pollfree() should only be used in extremely rare cases. Why can't the
> > > > lifetime of the waitqueue be fixed instead?
> > >
> > > waitqueue lifetime in this case is linked to cgroup_file_release(),
> > > which seems appropriate to me here. Unfortunately
> > > cgroup_file_release() is not directly linked to the file's lifetime.
> > > For more details see:
> > > https://lore.kernel.org/all/CAJuCfpFZ3B4530TgsSHqp5F_gwfrDujwRYewKReJru==MdEHQg@mail.gmail.com/#t
> > > .
> > > So, if we want to fix the lifetime of the waitqueue, we would have to
> > > tie cgroup_file_release() to the fput() somehow. IOW, the fix would
> > > have to be done at the cgroups or higher (kernfs?) layer.
> >
> > Hi Eric,
> > Do you still object to using wake_up_pollfree() for this case?
> > Changing higher levels to make cgroup_file_release() be tied to fput()
> > would be ideal but I think that would be a big change for this one
> > case. If you agree I'll Ack this patch.
> > Thanks,
> > Suren.
> >
>
> I haven't read the code closely in this case. I'm just letting you know that
> wake_up_pollfree() is very much a last-resort option for when the waitqueue
> lifetime can't be fixed.
Got it. Thanks for the warning.
I think it can be fixed but the right fix would require a sizable
higher level refactoring which might be more justifiable if we have
more such cases in the future.
> So if you want to use wake_up_pollfree(), you need to
> explain why no other fix is possible. For example maybe the UAPI depends on the
> waitqueue having a nonstandard lifetime.
I think the changelog should explain that the waitqueue lifetime in
cases of non-root cgroups is tied to cgroup_file_release() callback,
which in turn is not tied to file's lifetime. That's the reason for
waitqueue and the file having different lifecycles. Would that suffice
as the justification?
Again, I'm not saying that no other fix is possible, but that the
right fix would be much more complex.
Thanks,
Suren.
>
> - Eric
On Thu, Feb 9, 2023 at 11:13 AM Suren Baghdasaryan <[email protected]> wrote:
>
> On Thu, Feb 9, 2023 at 10:46 AM Eric Biggers <[email protected]> wrote:
> >
> > On Thu, Feb 09, 2023 at 09:09:03AM -0800, Suren Baghdasaryan wrote:
> > > On Thu, Feb 2, 2023 at 1:11 PM Suren Baghdasaryan <[email protected]> wrote:
> > > >
> > > > On Wed, Feb 1, 2023 at 8:56 PM Eric Biggers <[email protected]> wrote:
> > > > >
> > > > > On Wed, Feb 01, 2023 at 07:00:23PM -0800, Munehisa Kamata wrote:
> > > > > > diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> > > > > > index 8ac8b81bfee6..6e66c15f6450 100644
> > > > > > --- a/kernel/sched/psi.c
> > > > > > +++ b/kernel/sched/psi.c
> > > > > > @@ -1343,10 +1343,11 @@ void psi_trigger_destroy(struct psi_trigger *t)
> > > > > >
> > > > > > group = t->group;
> > > > > > /*
> > > > > > - * Wakeup waiters to stop polling. Can happen if cgroup is deleted
> > > > > > - * from under a polling process.
> > > > > > + * Wakeup waiters to stop polling and clear the queue to prevent it from
> > > > > > + * being accessed later. Can happen if cgroup is deleted from under a
> > > > > > + * polling process otherwise.
> > > > > > */
> > > > > > - wake_up_interruptible(&t->event_wait);
> > > > > > + wake_up_pollfree(&t->event_wait);
> > > > > >
> > > > > > mutex_lock(&group->trigger_lock);
> > > > >
> > > > > wake_up_pollfree() should only be used in extremely rare cases. Why can't the
> > > > > lifetime of the waitqueue be fixed instead?
> > > >
> > > > waitqueue lifetime in this case is linked to cgroup_file_release(),
> > > > which seems appropriate to me here. Unfortunately
> > > > cgroup_file_release() is not directly linked to the file's lifetime.
> > > > For more details see:
> > > > https://lore.kernel.org/all/CAJuCfpFZ3B4530TgsSHqp5F_gwfrDujwRYewKReJru==MdEHQg@mail.gmail.com/#t
> > > > .
> > > > So, if we want to fix the lifetime of the waitqueue, we would have to
> > > > tie cgroup_file_release() to the fput() somehow. IOW, the fix would
> > > > have to be done at the cgroups or higher (kernfs?) layer.
> > >
> > > Hi Eric,
> > > Do you still object to using wake_up_pollfree() for this case?
> > > Changing higher levels to make cgroup_file_release() be tied to fput()
> > > would be ideal but I think that would be a big change for this one
> > > case. If you agree I'll Ack this patch.
> > > Thanks,
> > > Suren.
> > >
> >
> > I haven't read the code closely in this case. I'm just letting you know that
> > wake_up_pollfree() is very much a last-resort option for when the waitqueue
> > lifetime can't be fixed.
>
> Got it. Thanks for the warning.
> I think it can be fixed but the right fix would require a sizable
> higher level refactoring which might be more justifiable if we have
> more such cases in the future.
>
> > So if you want to use wake_up_pollfree(), you need to
> > explain why no other fix is possible. For example maybe the UAPI depends on the
> > waitqueue having a nonstandard lifetime.
>
> I think the changelog should explain that the waitqueue lifetime in
> cases of non-root cgroups is tied to cgroup_file_release() callback,
> which in turn is not tied to file's lifetime. That's the reason for
> waitqueue and the file having different lifecycles. Would that suffice
> as the justification?
Ok, in the absence of objections, I would suggest resending this patch
with the changelog including details about waitqueue lifetime and
reasons wake_up_pollfree() is required here.
Munehisa, feel free to reuse
https://lore.kernel.org/all/CAJuCfpFZ3B4530TgsSHqp5F_gwfrDujwRYewKReJru==MdEHQg@mail.gmail.com/#t
if you find it useful.
Thanks,
Suren.
> Again, I'm not saying that no other fix is possible, but that the
> right fix would be much more complex.
> Thanks,
> Suren.
>
> >
> > - Eric
If a non-root cgroup gets removed when there is a thread that registered
trigger and is polling on a pressure file within the cgroup, the polling
waitqueue gets freed without clearing the queue and reference in the
following path.
do_rmdir
cgroup_rmdir
kernfs_drain_open_files
cgroup_file_release
cgroup_pressure_release
psi_trigger_destroy
However, the polling thread can keep having the last reference to the
pressure file that is tied to the freed waitqueue until explicit close or
exit later.
fput
ep_eventpoll_release
ep_free
ep_remove_wait_queue
remove_wait_queue
Then, the thread accesses to the already-freed waitqueue when dropping the
reference and results in use-after-free as pasted below.
The fundamental problem here is that the lifetime of the waitqueue is not
tied to the file's real lifetime as shown above. Using wake_up_pollfree()
here might be less than ideal, but it also is not fully contradicting the
comment at commit 42288cb44c4b ("wait: add wake_up_pollfree()") since the
waitqueue's lifetime is not tied to file's one and can be considered as
another special case. While this would be fixable by somehow making
cgroup_file_release() be tied to the fput(), it would require sizable
refactoring at cgroups or higher layer which might be more justifiable if
we identify more cases like this.
BUG: KASAN: use-after-free in _raw_spin_lock_irqsave+0x60/0xc0
Write of size 4 at addr ffff88810e625328 by task a.out/4404
CPU: 19 PID: 4404 Comm: a.out Not tainted 6.2.0-rc6 #38
Hardware name: Amazon EC2 c5a.8xlarge/, BIOS 1.0 10/16/2017
Call Trace:
<TASK>
dump_stack_lvl+0x73/0xa0
print_report+0x16c/0x4e0
? _printk+0x59/0x80
? __virt_addr_valid+0xb8/0x130
? _raw_spin_lock_irqsave+0x60/0xc0
kasan_report+0xc3/0xf0
? _raw_spin_lock_irqsave+0x60/0xc0
kasan_check_range+0x2d2/0x310
_raw_spin_lock_irqsave+0x60/0xc0
remove_wait_queue+0x1a/0xa0
ep_free+0x12c/0x170
ep_eventpoll_release+0x26/0x30
__fput+0x202/0x400
task_work_run+0x11d/0x170
do_exit+0x495/0x1130
? update_cfs_rq_load_avg+0x2c2/0x2e0
do_group_exit+0x100/0x100
get_signal+0xd67/0xde0
? finish_task_switch+0x15f/0x3a0
arch_do_signal_or_restart+0x2a/0x2b0
exit_to_user_mode_prepare+0x94/0x100
syscall_exit_to_user_mode+0x20/0x40
do_syscall_64+0x52/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f8e392bfb91
Code: Unable to access opcode bytes at 0x7f8e392bfb67.
RSP: 002b:00007fff261e08d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000022
RAX: fffffffffffffdfe RBX: 0000000000000000 RCX: 00007f8e392bfb91
RDX: 0000000000000001 RSI: 00007fff261e08e8 RDI: 0000000000000004
RBP: 00007fff261e0920 R08: 0000000000400780 R09: 00007f8e3960f240
R10: 00000000000003df R11: 0000000000000246 R12: 00000000004005a0
R13: 00007fff261e0a00 R14: 0000000000000000 R15: 0000000000000000
</TASK>
Allocated by task 4404:
kasan_set_track+0x3d/0x60
__kasan_kmalloc+0x85/0x90
psi_trigger_create+0x113/0x3e0
pressure_write+0x146/0x2e0
cgroup_file_write+0x11c/0x250
kernfs_fop_write_iter+0x186/0x220
vfs_write+0x3d8/0x5c0
ksys_write+0x90/0x110
do_syscall_64+0x43/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
Freed by task 4407:
kasan_set_track+0x3d/0x60
kasan_save_free_info+0x27/0x40
____kasan_slab_free+0x11d/0x170
slab_free_freelist_hook+0x87/0x150
__kmem_cache_free+0xcb/0x180
psi_trigger_destroy+0x2e8/0x310
cgroup_file_release+0x4f/0xb0
kernfs_drain_open_files+0x165/0x1f0
kernfs_drain+0x162/0x1a0
__kernfs_remove+0x1fb/0x310
kernfs_remove_by_name_ns+0x95/0xe0
cgroup_addrm_files+0x67f/0x700
cgroup_destroy_locked+0x283/0x3c0
cgroup_rmdir+0x29/0x100
kernfs_iop_rmdir+0xd1/0x140
vfs_rmdir+0xfe/0x240
do_rmdir+0x13d/0x280
__x64_sys_rmdir+0x2c/0x30
do_syscall_64+0x43/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
v2: updated commit message
Link: https://lore.kernel.org/lkml/[email protected]/
Fixes: 0e94682b73bf ("psi: introduce psi monitor")
Cc: [email protected]
Signed-off-by: Munehisa Kamata <[email protected]>
Signed-off-by: Mengchi Cheng <[email protected]>
---
kernel/sched/psi.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 8ac8b81bfee6..6e66c15f6450 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -1343,10 +1343,11 @@ void psi_trigger_destroy(struct psi_trigger *t)
group = t->group;
/*
- * Wakeup waiters to stop polling. Can happen if cgroup is deleted
- * from under a polling process.
+ * Wakeup waiters to stop polling and clear the queue to prevent it from
+ * being accessed later. Can happen if cgroup is deleted from under a
+ * polling process otherwise.
*/
- wake_up_interruptible(&t->event_wait);
+ wake_up_pollfree(&t->event_wait);
mutex_lock(&group->trigger_lock);
--
2.38.1
Thanks!
Overall LGTM, just a couple of nits (simplifications):
On Mon, Feb 13, 2023 at 11:04 PM Munehisa Kamata <[email protected]> wrote:
>
> If a non-root cgroup gets removed when there is a thread that registered
> trigger and is polling on a pressure file within the cgroup, the polling
> waitqueue gets freed without clearing the queue and reference in the
> following path.
Let's remove "without clearing the queue and reference" in the above
sentence. The next section explains why this is problematic, therefore
mentioning that here is unnecessary IMHO.
>
> do_rmdir
> cgroup_rmdir
> kernfs_drain_open_files
> cgroup_file_release
> cgroup_pressure_release
> psi_trigger_destroy
>
> However, the polling thread can keep having the last reference to the
> pressure file that is tied to the freed waitqueue until explicit close or
> exit later.
Suggest replacing: However, the polling thread still has a reference
to the pressure file it is polling and will access the freed waitqueue
when file is closed or upon exit:
>
> fput
> ep_eventpoll_release
> ep_free
> ep_remove_wait_queue
> remove_wait_queue
>
> Then, the thread accesses to the already-freed waitqueue when dropping the
> reference and results in use-after-free as pasted below.
Suggest replacing: This results is use-after-free as pasted below.
>
> The fundamental problem here is that the lifetime of the waitqueue is not
> tied to the file's real lifetime as shown above.
The fundamental problem here is that cgroup_file_release() (and
consequently waitqueue's lifetime) is not tied to the file's real
lifetime.
> Using wake_up_pollfree()
> here might be less than ideal, but it also is not fully contradicting the
> comment at commit 42288cb44c4b ("wait: add wake_up_pollfree()") since the
> waitqueue's lifetime is not tied to file's one and can be considered as
> another special case. While this would be fixable by somehow making
> cgroup_file_release() be tied to the fput(), it would require sizable
> refactoring at cgroups or higher layer which might be more justifiable if
> we identify more cases like this.
>
> BUG: KASAN: use-after-free in _raw_spin_lock_irqsave+0x60/0xc0
> Write of size 4 at addr ffff88810e625328 by task a.out/4404
>
> CPU: 19 PID: 4404 Comm: a.out Not tainted 6.2.0-rc6 #38
> Hardware name: Amazon EC2 c5a.8xlarge/, BIOS 1.0 10/16/2017
> Call Trace:
> <TASK>
> dump_stack_lvl+0x73/0xa0
> print_report+0x16c/0x4e0
> ? _printk+0x59/0x80
> ? __virt_addr_valid+0xb8/0x130
> ? _raw_spin_lock_irqsave+0x60/0xc0
> kasan_report+0xc3/0xf0
> ? _raw_spin_lock_irqsave+0x60/0xc0
> kasan_check_range+0x2d2/0x310
> _raw_spin_lock_irqsave+0x60/0xc0
> remove_wait_queue+0x1a/0xa0
> ep_free+0x12c/0x170
> ep_eventpoll_release+0x26/0x30
> __fput+0x202/0x400
> task_work_run+0x11d/0x170
> do_exit+0x495/0x1130
> ? update_cfs_rq_load_avg+0x2c2/0x2e0
> do_group_exit+0x100/0x100
> get_signal+0xd67/0xde0
> ? finish_task_switch+0x15f/0x3a0
> arch_do_signal_or_restart+0x2a/0x2b0
> exit_to_user_mode_prepare+0x94/0x100
> syscall_exit_to_user_mode+0x20/0x40
> do_syscall_64+0x52/0x90
> entry_SYSCALL_64_after_hwframe+0x63/0xcd
> RIP: 0033:0x7f8e392bfb91
> Code: Unable to access opcode bytes at 0x7f8e392bfb67.
> RSP: 002b:00007fff261e08d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000022
> RAX: fffffffffffffdfe RBX: 0000000000000000 RCX: 00007f8e392bfb91
> RDX: 0000000000000001 RSI: 00007fff261e08e8 RDI: 0000000000000004
> RBP: 00007fff261e0920 R08: 0000000000400780 R09: 00007f8e3960f240
> R10: 00000000000003df R11: 0000000000000246 R12: 00000000004005a0
> R13: 00007fff261e0a00 R14: 0000000000000000 R15: 0000000000000000
> </TASK>
>
> Allocated by task 4404:
> kasan_set_track+0x3d/0x60
> __kasan_kmalloc+0x85/0x90
> psi_trigger_create+0x113/0x3e0
> pressure_write+0x146/0x2e0
> cgroup_file_write+0x11c/0x250
> kernfs_fop_write_iter+0x186/0x220
> vfs_write+0x3d8/0x5c0
> ksys_write+0x90/0x110
> do_syscall_64+0x43/0x90
> entry_SYSCALL_64_after_hwframe+0x63/0xcd
>
> Freed by task 4407:
> kasan_set_track+0x3d/0x60
> kasan_save_free_info+0x27/0x40
> ____kasan_slab_free+0x11d/0x170
> slab_free_freelist_hook+0x87/0x150
> __kmem_cache_free+0xcb/0x180
> psi_trigger_destroy+0x2e8/0x310
> cgroup_file_release+0x4f/0xb0
> kernfs_drain_open_files+0x165/0x1f0
> kernfs_drain+0x162/0x1a0
> __kernfs_remove+0x1fb/0x310
> kernfs_remove_by_name_ns+0x95/0xe0
> cgroup_addrm_files+0x67f/0x700
> cgroup_destroy_locked+0x283/0x3c0
> cgroup_rmdir+0x29/0x100
> kernfs_iop_rmdir+0xd1/0x140
> vfs_rmdir+0xfe/0x240
> do_rmdir+0x13d/0x280
> __x64_sys_rmdir+0x2c/0x30
> do_syscall_64+0x43/0x90
> entry_SYSCALL_64_after_hwframe+0x63/0xcd
>
> v2: updated commit message
>
> Link: https://lore.kernel.org/lkml/[email protected]/
> Fixes: 0e94682b73bf ("psi: introduce psi monitor")
> Cc: [email protected]
> Signed-off-by: Munehisa Kamata <[email protected]>
> Signed-off-by: Mengchi Cheng <[email protected]>
> ---
> kernel/sched/psi.c | 7 ++++---
> 1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index 8ac8b81bfee6..6e66c15f6450 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -1343,10 +1343,11 @@ void psi_trigger_destroy(struct psi_trigger *t)
>
> group = t->group;
> /*
> - * Wakeup waiters to stop polling. Can happen if cgroup is deleted
> - * from under a polling process.
> + * Wakeup waiters to stop polling and clear the queue to prevent it from
> + * being accessed later. Can happen if cgroup is deleted from under a
> + * polling process otherwise.
This "otherwise" at the end seems extra. Was there a continuation of
this comment which was removed without removing this "otherwise" ?
> */
> - wake_up_interruptible(&t->event_wait);
> + wake_up_pollfree(&t->event_wait);
>
> mutex_lock(&group->trigger_lock);
>
> --
> 2.38.1
>
If a non-root cgroup gets removed when there is a thread that registered
trigger and is polling on a pressure file within the cgroup, the polling
waitqueue gets freed in the following path.
do_rmdir
cgroup_rmdir
kernfs_drain_open_files
cgroup_file_release
cgroup_pressure_release
psi_trigger_destroy
However, the polling thread still has a reference to the pressure file and
will access the freed waitqueue when the file is closed or upon exit.
fput
ep_eventpoll_release
ep_free
ep_remove_wait_queue
remove_wait_queue
This results in use-after-free as pasted below.
The fundamental problem here is that cgroup_file_release() (and
consequently waitqueue's lifetime) is not tied to the file's real lifetime.
Using wake_up_pollfree() here might be less than ideal, but it also is not
fully contradicting the comment at commit 42288cb44c4b ("wait: add
wake_up_pollfree()") since the waitqueue's lifetime is not tied to file's
one and can be considered as another special case. While this would be
fixable by somehow making cgroup_file_release() be tied to the fput(), it
would require sizable refactoring at cgroups or higher layer which might be
more justifiable if we identify more cases like this.
BUG: KASAN: use-after-free in _raw_spin_lock_irqsave+0x60/0xc0
Write of size 4 at addr ffff88810e625328 by task a.out/4404
CPU: 19 PID: 4404 Comm: a.out Not tainted 6.2.0-rc6 #38
Hardware name: Amazon EC2 c5a.8xlarge/, BIOS 1.0 10/16/2017
Call Trace:
<TASK>
dump_stack_lvl+0x73/0xa0
print_report+0x16c/0x4e0
? _printk+0x59/0x80
? __virt_addr_valid+0xb8/0x130
? _raw_spin_lock_irqsave+0x60/0xc0
kasan_report+0xc3/0xf0
? _raw_spin_lock_irqsave+0x60/0xc0
kasan_check_range+0x2d2/0x310
_raw_spin_lock_irqsave+0x60/0xc0
remove_wait_queue+0x1a/0xa0
ep_free+0x12c/0x170
ep_eventpoll_release+0x26/0x30
__fput+0x202/0x400
task_work_run+0x11d/0x170
do_exit+0x495/0x1130
? update_cfs_rq_load_avg+0x2c2/0x2e0
do_group_exit+0x100/0x100
get_signal+0xd67/0xde0
? finish_task_switch+0x15f/0x3a0
arch_do_signal_or_restart+0x2a/0x2b0
exit_to_user_mode_prepare+0x94/0x100
syscall_exit_to_user_mode+0x20/0x40
do_syscall_64+0x52/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f8e392bfb91
Code: Unable to access opcode bytes at 0x7f8e392bfb67.
RSP: 002b:00007fff261e08d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000022
RAX: fffffffffffffdfe RBX: 0000000000000000 RCX: 00007f8e392bfb91
RDX: 0000000000000001 RSI: 00007fff261e08e8 RDI: 0000000000000004
RBP: 00007fff261e0920 R08: 0000000000400780 R09: 00007f8e3960f240
R10: 00000000000003df R11: 0000000000000246 R12: 00000000004005a0
R13: 00007fff261e0a00 R14: 0000000000000000 R15: 0000000000000000
</TASK>
Allocated by task 4404:
kasan_set_track+0x3d/0x60
__kasan_kmalloc+0x85/0x90
psi_trigger_create+0x113/0x3e0
pressure_write+0x146/0x2e0
cgroup_file_write+0x11c/0x250
kernfs_fop_write_iter+0x186/0x220
vfs_write+0x3d8/0x5c0
ksys_write+0x90/0x110
do_syscall_64+0x43/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
Freed by task 4407:
kasan_set_track+0x3d/0x60
kasan_save_free_info+0x27/0x40
____kasan_slab_free+0x11d/0x170
slab_free_freelist_hook+0x87/0x150
__kmem_cache_free+0xcb/0x180
psi_trigger_destroy+0x2e8/0x310
cgroup_file_release+0x4f/0xb0
kernfs_drain_open_files+0x165/0x1f0
kernfs_drain+0x162/0x1a0
__kernfs_remove+0x1fb/0x310
kernfs_remove_by_name_ns+0x95/0xe0
cgroup_addrm_files+0x67f/0x700
cgroup_destroy_locked+0x283/0x3c0
cgroup_rmdir+0x29/0x100
kernfs_iop_rmdir+0xd1/0x140
vfs_rmdir+0xfe/0x240
do_rmdir+0x13d/0x280
__x64_sys_rmdir+0x2c/0x30
do_syscall_64+0x43/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
v3: updated commit message and the comment in the code
v2: updated commit message
Link: https://lore.kernel.org/lkml/[email protected]/
Fixes: 0e94682b73bf ("psi: introduce psi monitor")
Cc: [email protected]
Signed-off-by: Munehisa Kamata <[email protected]>
Signed-off-by: Mengchi Cheng <[email protected]>
---
kernel/sched/psi.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 8ac8b81bfee6..02e011cabe91 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -1343,10 +1343,11 @@ void psi_trigger_destroy(struct psi_trigger *t)
group = t->group;
/*
- * Wakeup waiters to stop polling. Can happen if cgroup is deleted
- * from under a polling process.
+ * Wakeup waiters to stop polling and clear the queue to prevent it from
+ * being accessed later. Can happen if cgroup is deleted from under a
+ * polling process.
*/
- wake_up_interruptible(&t->event_wait);
+ wake_up_pollfree(&t->event_wait);
mutex_lock(&group->trigger_lock);
--
2.38.1
On Tue, Feb 14, 2023 at 10:13 AM Munehisa Kamata <[email protected]> wrote:
>
> If a non-root cgroup gets removed when there is a thread that registered
> trigger and is polling on a pressure file within the cgroup, the polling
> waitqueue gets freed in the following path.
>
> do_rmdir
> cgroup_rmdir
> kernfs_drain_open_files
> cgroup_file_release
> cgroup_pressure_release
> psi_trigger_destroy
>
> However, the polling thread still has a reference to the pressure file and
> will access the freed waitqueue when the file is closed or upon exit.
>
> fput
> ep_eventpoll_release
> ep_free
> ep_remove_wait_queue
> remove_wait_queue
>
> This results in use-after-free as pasted below.
>
> The fundamental problem here is that cgroup_file_release() (and
> consequently waitqueue's lifetime) is not tied to the file's real lifetime.
> Using wake_up_pollfree() here might be less than ideal, but it also is not
> fully contradicting the comment at commit 42288cb44c4b ("wait: add
> wake_up_pollfree()") since the waitqueue's lifetime is not tied to file's
> one and can be considered as another special case. While this would be
> fixable by somehow making cgroup_file_release() be tied to the fput(), it
> would require sizable refactoring at cgroups or higher layer which might be
> more justifiable if we identify more cases like this.
>
> BUG: KASAN: use-after-free in _raw_spin_lock_irqsave+0x60/0xc0
> Write of size 4 at addr ffff88810e625328 by task a.out/4404
>
> CPU: 19 PID: 4404 Comm: a.out Not tainted 6.2.0-rc6 #38
> Hardware name: Amazon EC2 c5a.8xlarge/, BIOS 1.0 10/16/2017
> Call Trace:
> <TASK>
> dump_stack_lvl+0x73/0xa0
> print_report+0x16c/0x4e0
> ? _printk+0x59/0x80
> ? __virt_addr_valid+0xb8/0x130
> ? _raw_spin_lock_irqsave+0x60/0xc0
> kasan_report+0xc3/0xf0
> ? _raw_spin_lock_irqsave+0x60/0xc0
> kasan_check_range+0x2d2/0x310
> _raw_spin_lock_irqsave+0x60/0xc0
> remove_wait_queue+0x1a/0xa0
> ep_free+0x12c/0x170
> ep_eventpoll_release+0x26/0x30
> __fput+0x202/0x400
> task_work_run+0x11d/0x170
> do_exit+0x495/0x1130
> ? update_cfs_rq_load_avg+0x2c2/0x2e0
> do_group_exit+0x100/0x100
> get_signal+0xd67/0xde0
> ? finish_task_switch+0x15f/0x3a0
> arch_do_signal_or_restart+0x2a/0x2b0
> exit_to_user_mode_prepare+0x94/0x100
> syscall_exit_to_user_mode+0x20/0x40
> do_syscall_64+0x52/0x90
> entry_SYSCALL_64_after_hwframe+0x63/0xcd
> RIP: 0033:0x7f8e392bfb91
> Code: Unable to access opcode bytes at 0x7f8e392bfb67.
> RSP: 002b:00007fff261e08d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000022
> RAX: fffffffffffffdfe RBX: 0000000000000000 RCX: 00007f8e392bfb91
> RDX: 0000000000000001 RSI: 00007fff261e08e8 RDI: 0000000000000004
> RBP: 00007fff261e0920 R08: 0000000000400780 R09: 00007f8e3960f240
> R10: 00000000000003df R11: 0000000000000246 R12: 00000000004005a0
> R13: 00007fff261e0a00 R14: 0000000000000000 R15: 0000000000000000
> </TASK>
>
> Allocated by task 4404:
> kasan_set_track+0x3d/0x60
> __kasan_kmalloc+0x85/0x90
> psi_trigger_create+0x113/0x3e0
> pressure_write+0x146/0x2e0
> cgroup_file_write+0x11c/0x250
> kernfs_fop_write_iter+0x186/0x220
> vfs_write+0x3d8/0x5c0
> ksys_write+0x90/0x110
> do_syscall_64+0x43/0x90
> entry_SYSCALL_64_after_hwframe+0x63/0xcd
>
> Freed by task 4407:
> kasan_set_track+0x3d/0x60
> kasan_save_free_info+0x27/0x40
> ____kasan_slab_free+0x11d/0x170
> slab_free_freelist_hook+0x87/0x150
> __kmem_cache_free+0xcb/0x180
> psi_trigger_destroy+0x2e8/0x310
> cgroup_file_release+0x4f/0xb0
> kernfs_drain_open_files+0x165/0x1f0
> kernfs_drain+0x162/0x1a0
> __kernfs_remove+0x1fb/0x310
> kernfs_remove_by_name_ns+0x95/0xe0
> cgroup_addrm_files+0x67f/0x700
> cgroup_destroy_locked+0x283/0x3c0
> cgroup_rmdir+0x29/0x100
> kernfs_iop_rmdir+0xd1/0x140
> vfs_rmdir+0xfe/0x240
> do_rmdir+0x13d/0x280
> __x64_sys_rmdir+0x2c/0x30
> do_syscall_64+0x43/0x90
> entry_SYSCALL_64_after_hwframe+0x63/0xcd
>
> v3: updated commit message and the comment in the code
> v2: updated commit message
>
> Link: https://lore.kernel.org/lkml/[email protected]/
> Fixes: 0e94682b73bf ("psi: introduce psi monitor")
> Cc: [email protected]
> Signed-off-by: Munehisa Kamata <[email protected]>
> Signed-off-by: Mengchi Cheng <[email protected]>
Acked-by: Suren Baghdasaryan <[email protected]>
> ---
> kernel/sched/psi.c | 7 ++++---
> 1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index 8ac8b81bfee6..02e011cabe91 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -1343,10 +1343,11 @@ void psi_trigger_destroy(struct psi_trigger *t)
>
> group = t->group;
> /*
> - * Wakeup waiters to stop polling. Can happen if cgroup is deleted
> - * from under a polling process.
> + * Wakeup waiters to stop polling and clear the queue to prevent it from
> + * being accessed later. Can happen if cgroup is deleted from under a
> + * polling process.
> */
> - wake_up_interruptible(&t->event_wait);
> + wake_up_pollfree(&t->event_wait);
>
> mutex_lock(&group->trigger_lock);
>
> --
> 2.38.1
>
On Tue, Feb 14, 2023 at 10:28 AM Suren Baghdasaryan <[email protected]> wrote:
>
> On Tue, Feb 14, 2023 at 10:13 AM Munehisa Kamata <[email protected]> wrote:
> >
> > If a non-root cgroup gets removed when there is a thread that registered
> > trigger and is polling on a pressure file within the cgroup, the polling
> > waitqueue gets freed in the following path.
> >
> > do_rmdir
> > cgroup_rmdir
> > kernfs_drain_open_files
> > cgroup_file_release
> > cgroup_pressure_release
> > psi_trigger_destroy
> >
> > However, the polling thread still has a reference to the pressure file and
> > will access the freed waitqueue when the file is closed or upon exit.
> >
> > fput
> > ep_eventpoll_release
> > ep_free
> > ep_remove_wait_queue
> > remove_wait_queue
> >
> > This results in use-after-free as pasted below.
> >
> > The fundamental problem here is that cgroup_file_release() (and
> > consequently waitqueue's lifetime) is not tied to the file's real lifetime.
> > Using wake_up_pollfree() here might be less than ideal, but it also is not
> > fully contradicting the comment at commit 42288cb44c4b ("wait: add
> > wake_up_pollfree()") since the waitqueue's lifetime is not tied to file's
> > one and can be considered as another special case. While this would be
> > fixable by somehow making cgroup_file_release() be tied to the fput(), it
> > would require sizable refactoring at cgroups or higher layer which might be
> > more justifiable if we identify more cases like this.
> >
> > BUG: KASAN: use-after-free in _raw_spin_lock_irqsave+0x60/0xc0
> > Write of size 4 at addr ffff88810e625328 by task a.out/4404
> >
> > CPU: 19 PID: 4404 Comm: a.out Not tainted 6.2.0-rc6 #38
> > Hardware name: Amazon EC2 c5a.8xlarge/, BIOS 1.0 10/16/2017
> > Call Trace:
> > <TASK>
> > dump_stack_lvl+0x73/0xa0
> > print_report+0x16c/0x4e0
> > ? _printk+0x59/0x80
> > ? __virt_addr_valid+0xb8/0x130
> > ? _raw_spin_lock_irqsave+0x60/0xc0
> > kasan_report+0xc3/0xf0
> > ? _raw_spin_lock_irqsave+0x60/0xc0
> > kasan_check_range+0x2d2/0x310
> > _raw_spin_lock_irqsave+0x60/0xc0
> > remove_wait_queue+0x1a/0xa0
> > ep_free+0x12c/0x170
> > ep_eventpoll_release+0x26/0x30
> > __fput+0x202/0x400
> > task_work_run+0x11d/0x170
> > do_exit+0x495/0x1130
> > ? update_cfs_rq_load_avg+0x2c2/0x2e0
> > do_group_exit+0x100/0x100
> > get_signal+0xd67/0xde0
> > ? finish_task_switch+0x15f/0x3a0
> > arch_do_signal_or_restart+0x2a/0x2b0
> > exit_to_user_mode_prepare+0x94/0x100
> > syscall_exit_to_user_mode+0x20/0x40
> > do_syscall_64+0x52/0x90
> > entry_SYSCALL_64_after_hwframe+0x63/0xcd
> > RIP: 0033:0x7f8e392bfb91
> > Code: Unable to access opcode bytes at 0x7f8e392bfb67.
> > RSP: 002b:00007fff261e08d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000022
> > RAX: fffffffffffffdfe RBX: 0000000000000000 RCX: 00007f8e392bfb91
> > RDX: 0000000000000001 RSI: 00007fff261e08e8 RDI: 0000000000000004
> > RBP: 00007fff261e0920 R08: 0000000000400780 R09: 00007f8e3960f240
> > R10: 00000000000003df R11: 0000000000000246 R12: 00000000004005a0
> > R13: 00007fff261e0a00 R14: 0000000000000000 R15: 0000000000000000
> > </TASK>
> >
> > Allocated by task 4404:
> > kasan_set_track+0x3d/0x60
> > __kasan_kmalloc+0x85/0x90
> > psi_trigger_create+0x113/0x3e0
> > pressure_write+0x146/0x2e0
> > cgroup_file_write+0x11c/0x250
> > kernfs_fop_write_iter+0x186/0x220
> > vfs_write+0x3d8/0x5c0
> > ksys_write+0x90/0x110
> > do_syscall_64+0x43/0x90
> > entry_SYSCALL_64_after_hwframe+0x63/0xcd
> >
> > Freed by task 4407:
> > kasan_set_track+0x3d/0x60
> > kasan_save_free_info+0x27/0x40
> > ____kasan_slab_free+0x11d/0x170
> > slab_free_freelist_hook+0x87/0x150
> > __kmem_cache_free+0xcb/0x180
> > psi_trigger_destroy+0x2e8/0x310
> > cgroup_file_release+0x4f/0xb0
> > kernfs_drain_open_files+0x165/0x1f0
> > kernfs_drain+0x162/0x1a0
> > __kernfs_remove+0x1fb/0x310
> > kernfs_remove_by_name_ns+0x95/0xe0
> > cgroup_addrm_files+0x67f/0x700
> > cgroup_destroy_locked+0x283/0x3c0
> > cgroup_rmdir+0x29/0x100
> > kernfs_iop_rmdir+0xd1/0x140
> > vfs_rmdir+0xfe/0x240
> > do_rmdir+0x13d/0x280
> > __x64_sys_rmdir+0x2c/0x30
> > do_syscall_64+0x43/0x90
> > entry_SYSCALL_64_after_hwframe+0x63/0xcd
> >
> > v3: updated commit message and the comment in the code
> > v2: updated commit message
> >
> > Link: https://lore.kernel.org/lkml/[email protected]/
> > Fixes: 0e94682b73bf ("psi: introduce psi monitor")
> > Cc: [email protected]
> > Signed-off-by: Munehisa Kamata <[email protected]>
> > Signed-off-by: Mengchi Cheng <[email protected]>
>
Acked-by: Suren Baghdasaryan <[email protected]>
CC'ing PeterZ directly for inclusion into his tree.
>
> > ---
> > kernel/sched/psi.c | 7 ++++---
> > 1 file changed, 4 insertions(+), 3 deletions(-)
> >
> > diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> > index 8ac8b81bfee6..02e011cabe91 100644
> > --- a/kernel/sched/psi.c
> > +++ b/kernel/sched/psi.c
> > @@ -1343,10 +1343,11 @@ void psi_trigger_destroy(struct psi_trigger *t)
> >
> > group = t->group;
> > /*
> > - * Wakeup waiters to stop polling. Can happen if cgroup is deleted
> > - * from under a polling process.
> > + * Wakeup waiters to stop polling and clear the queue to prevent it from
> > + * being accessed later. Can happen if cgroup is deleted from under a
> > + * polling process.
> > */
> > - wake_up_interruptible(&t->event_wait);
> > + wake_up_pollfree(&t->event_wait);
> >
> > mutex_lock(&group->trigger_lock);
> >
> > --
> > 2.38.1
> >
On Tue, 2023-02-14 17:10:58 +0000, Suren Baghdasaryan <[email protected]> wrote:
>
> Thanks!
> Overall LGTM, just a couple of nits (simplifications):
>
> On Mon, Feb 13, 2023 at 11:04 PM Munehisa Kamata <[email protected]> wrote:
> >
> > If a non-root cgroup gets removed when there is a thread that registered
> > trigger and is polling on a pressure file within the cgroup, the polling
> > waitqueue gets freed without clearing the queue and reference in the
> > following path.
>
> Let's remove "without clearing the queue and reference" in the above
> sentence. The next section explains why this is problematic, therefore
> mentioning that here is unnecessary IMHO.
Applied in v3.
> >
> > do_rmdir
> > cgroup_rmdir
> > kernfs_drain_open_files
> > cgroup_file_release
> > cgroup_pressure_release
> > psi_trigger_destroy
> >
> > However, the polling thread can keep having the last reference to the
> > pressure file that is tied to the freed waitqueue until explicit close or
> > exit later.
>
> Suggest replacing: However, the polling thread still has a reference
> to the pressure file it is polling and will access the freed waitqueue
> when file is closed or upon exit:
Applied in v3.
> >
> > fput
> > ep_eventpoll_release
> > ep_free
> > ep_remove_wait_queue
> > remove_wait_queue
> >
> > Then, the thread accesses to the already-freed waitqueue when dropping the
> > reference and results in use-after-free as pasted below.
>
> Suggest replacing: This results is use-after-free as pasted below.
Applied in v3.
> >
> > The fundamental problem here is that the lifetime of the waitqueue is not
> > tied to the file's real lifetime as shown above.
>
> The fundamental problem here is that cgroup_file_release() (and
> consequently waitqueue's lifetime) is not tied to the file's real
> lifetime.
Applied in v3 as well.
> > Using wake_up_pollfree()
> > here might be less than ideal, but it also is not fully contradicting the
> > comment at commit 42288cb44c4b ("wait: add wake_up_pollfree()") since the
> > waitqueue's lifetime is not tied to file's one and can be considered as
> > another special case. While this would be fixable by somehow making
> > cgroup_file_release() be tied to the fput(), it would require sizable
> > refactoring at cgroups or higher layer which might be more justifiable if
> > we identify more cases like this.
> >
> > BUG: KASAN: use-after-free in _raw_spin_lock_irqsave+0x60/0xc0
> > Write of size 4 at addr ffff88810e625328 by task a.out/4404
> >
> > CPU: 19 PID: 4404 Comm: a.out Not tainted 6.2.0-rc6 #38
> > Hardware name: Amazon EC2 c5a.8xlarge/, BIOS 1.0 10/16/2017
> > Call Trace:
> > <TASK>
> > dump_stack_lvl+0x73/0xa0
> > print_report+0x16c/0x4e0
> > ? _printk+0x59/0x80
> > ? __virt_addr_valid+0xb8/0x130
> > ? _raw_spin_lock_irqsave+0x60/0xc0
> > kasan_report+0xc3/0xf0
> > ? _raw_spin_lock_irqsave+0x60/0xc0
> > kasan_check_range+0x2d2/0x310
> > _raw_spin_lock_irqsave+0x60/0xc0
> > remove_wait_queue+0x1a/0xa0
> > ep_free+0x12c/0x170
> > ep_eventpoll_release+0x26/0x30
> > __fput+0x202/0x400
> > task_work_run+0x11d/0x170
> > do_exit+0x495/0x1130
> > ? update_cfs_rq_load_avg+0x2c2/0x2e0
> > do_group_exit+0x100/0x100
> > get_signal+0xd67/0xde0
> > ? finish_task_switch+0x15f/0x3a0
> > arch_do_signal_or_restart+0x2a/0x2b0
> > exit_to_user_mode_prepare+0x94/0x100
> > syscall_exit_to_user_mode+0x20/0x40
> > do_syscall_64+0x52/0x90
> > entry_SYSCALL_64_after_hwframe+0x63/0xcd
> > RIP: 0033:0x7f8e392bfb91
> > Code: Unable to access opcode bytes at 0x7f8e392bfb67.
> > RSP: 002b:00007fff261e08d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000022
> > RAX: fffffffffffffdfe RBX: 0000000000000000 RCX: 00007f8e392bfb91
> > RDX: 0000000000000001 RSI: 00007fff261e08e8 RDI: 0000000000000004
> > RBP: 00007fff261e0920 R08: 0000000000400780 R09: 00007f8e3960f240
> > R10: 00000000000003df R11: 0000000000000246 R12: 00000000004005a0
> > R13: 00007fff261e0a00 R14: 0000000000000000 R15: 0000000000000000
> > </TASK>
> >
> > Allocated by task 4404:
> > kasan_set_track+0x3d/0x60
> > __kasan_kmalloc+0x85/0x90
> > psi_trigger_create+0x113/0x3e0
> > pressure_write+0x146/0x2e0
> > cgroup_file_write+0x11c/0x250
> > kernfs_fop_write_iter+0x186/0x220
> > vfs_write+0x3d8/0x5c0
> > ksys_write+0x90/0x110
> > do_syscall_64+0x43/0x90
> > entry_SYSCALL_64_after_hwframe+0x63/0xcd
> >
> > Freed by task 4407:
> > kasan_set_track+0x3d/0x60
> > kasan_save_free_info+0x27/0x40
> > ____kasan_slab_free+0x11d/0x170
> > slab_free_freelist_hook+0x87/0x150
> > __kmem_cache_free+0xcb/0x180
> > psi_trigger_destroy+0x2e8/0x310
> > cgroup_file_release+0x4f/0xb0
> > kernfs_drain_open_files+0x165/0x1f0
> > kernfs_drain+0x162/0x1a0
> > __kernfs_remove+0x1fb/0x310
> > kernfs_remove_by_name_ns+0x95/0xe0
> > cgroup_addrm_files+0x67f/0x700
> > cgroup_destroy_locked+0x283/0x3c0
> > cgroup_rmdir+0x29/0x100
> > kernfs_iop_rmdir+0xd1/0x140
> > vfs_rmdir+0xfe/0x240
> > do_rmdir+0x13d/0x280
> > __x64_sys_rmdir+0x2c/0x30
> > do_syscall_64+0x43/0x90
> > entry_SYSCALL_64_after_hwframe+0x63/0xcd
> >
> > v2: updated commit message
> >
> > Link: https://lore.kernel.org/lkml/[email protected]/
> > Fixes: 0e94682b73bf ("psi: introduce psi monitor")
> > Cc: [email protected]
> > Signed-off-by: Munehisa Kamata <[email protected]>
> > Signed-off-by: Mengchi Cheng <[email protected]>
> > ---
> > kernel/sched/psi.c | 7 ++++---
> > 1 file changed, 4 insertions(+), 3 deletions(-)
> >
> > diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> > index 8ac8b81bfee6..6e66c15f6450 100644
> > --- a/kernel/sched/psi.c
> > +++ b/kernel/sched/psi.c
> > @@ -1343,10 +1343,11 @@ void psi_trigger_destroy(struct psi_trigger *t)
> >
> > group = t->group;
> > /*
> > - * Wakeup waiters to stop polling. Can happen if cgroup is deleted
> > - * from under a polling process.
> > + * Wakeup waiters to stop polling and clear the queue to prevent it from
> > + * being accessed later. Can happen if cgroup is deleted from under a
> > + * polling process otherwise.
>
> This "otherwise" at the end seems extra. Was there a continuation of
> this comment which was removed without removing this "otherwise" ?
Removed it in v3. The subject of the original sentense was a bit unclear to
me and I think I tried to have a simplified version of "[Otherwise an
unexpected free of the queue] can happen if ... " there.
Thanks for all the suggestions.
Regards,
Munehisa
> > */
> > - wake_up_interruptible(&t->event_wait);
> > + wake_up_pollfree(&t->event_wait);
> >
> > mutex_lock(&group->trigger_lock);
> >
> > --
> > 2.38.1
> >
>
On Tue, Feb 14, 2023 at 10:13:35AM -0800, Munehisa Kamata wrote:
> Using wake_up_pollfree() here might be less than ideal, but it also is not
> fully contradicting the comment at commit 42288cb44c4b ("wait: add
> wake_up_pollfree()") since the waitqueue's lifetime is not tied to file's
> one and can be considered as another special case.
If that comment is outdated now, then please update the comment too. We can do
better than "not fully contradicting".
- Eric
On Tue, Feb 14, 2023 at 10:55 AM Eric Biggers <[email protected]> wrote:
>
> On Tue, Feb 14, 2023 at 10:13:35AM -0800, Munehisa Kamata wrote:
> > Using wake_up_pollfree() here might be less than ideal, but it also is not
> > fully contradicting the comment at commit 42288cb44c4b ("wait: add
> > wake_up_pollfree()") since the waitqueue's lifetime is not tied to file's
> > one and can be considered as another special case.
>
> If that comment is outdated now, then please update the comment too. We can do
> better than "not fully contradicting".
Agree. I don't see it contradicting that comment.
>
> - Eric