2023-03-24 00:55:51

by syzbot

[permalink] [raw]
Subject: [syzbot] [net?] KASAN: slab-use-after-free Write in mini_qdisc_pair_swap

Hello,

syzbot found the following issue on:

HEAD commit: fff5a5e7f528 Merge tag 'for-linus' of git://git.armlinux.o..
git tree: upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=11884731c80000
kernel config: https://syzkaller.appspot.com/x/.config?x=aaa4b45720ca0519
dashboard link: https://syzkaller.appspot.com/bug?extid=b53a9c0d1ea4ad62da8b
compiler: gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2
syz repro: https://syzkaller.appspot.com/x/repro.syz?x=15d1497ac80000
C reproducer: https://syzkaller.appspot.com/x/repro.c?x=11eed636c80000

Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/33a184f98b9d/disk-fff5a5e7.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/3d75f967571e/vmlinux-fff5a5e7.xz
kernel image: https://storage.googleapis.com/syzbot-assets/4eeb8edbdc7e/bzImage-fff5a5e7.xz

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: [email protected]

==================================================================
BUG: KASAN: slab-use-after-free in mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
Write of size 8 at addr ffff888045b31308 by task syz-executor690/14901

CPU: 0 PID: 14901 Comm: syz-executor690 Not tainted 6.3.0-rc3-syzkaller-00026-gfff5a5e7f528 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/02/2023
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106
print_address_description.constprop.0+0x2c/0x3c0 mm/kasan/report.c:319
print_report mm/kasan/report.c:430 [inline]
kasan_report+0x11c/0x130 mm/kasan/report.c:536
mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
tcf_chain_head_change_item net/sched/cls_api.c:495 [inline]
tcf_chain0_head_change.isra.0+0xb9/0x120 net/sched/cls_api.c:509
tcf_chain_tp_insert net/sched/cls_api.c:1826 [inline]
tcf_chain_tp_insert_unique net/sched/cls_api.c:1875 [inline]
tc_new_tfilter+0x1de6/0x2290 net/sched/cls_api.c:2266
rtnetlink_rcv_msg+0x996/0xd50 net/core/rtnetlink.c:6165
netlink_rcv_skb+0x165/0x440 net/netlink/af_netlink.c:2574
netlink_unicast_kernel net/netlink/af_netlink.c:1339 [inline]
netlink_unicast+0x547/0x7f0 net/netlink/af_netlink.c:1365
netlink_sendmsg+0x925/0xe30 net/netlink/af_netlink.c:1942
sock_sendmsg_nosec net/socket.c:724 [inline]
sock_sendmsg+0xde/0x190 net/socket.c:747
____sys_sendmsg+0x334/0x900 net/socket.c:2501
___sys_sendmsg+0x110/0x1b0 net/socket.c:2555
__sys_sendmmsg+0x18f/0x460 net/socket.c:2641
__do_sys_sendmmsg net/socket.c:2670 [inline]
__se_sys_sendmmsg net/socket.c:2667 [inline]
__x64_sys_sendmmsg+0x9d/0x100 net/socket.c:2667
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f4f11222579
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 31 19 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f4f11184208 EFLAGS: 00000246 ORIG_RAX: 0000000000000133
RAX: ffffffffffffffda RBX: 00007f4f112ab2a8 RCX: 00007f4f11222579
RDX: 040000000000009f RSI: 00000000200002c0 RDI: 0000000000000007
RBP: 00007f4f112ab2a0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f4f112ab2ac
R13: 00007fffc8e5214f R14: 00007f4f11184300 R15: 0000000000022000
</TASK>

Allocated by task 14898:
kasan_save_stack+0x22/0x40 mm/kasan/common.c:45
kasan_set_track+0x25/0x30 mm/kasan/common.c:52
____kasan_kmalloc mm/kasan/common.c:374 [inline]
____kasan_kmalloc mm/kasan/common.c:333 [inline]
__kasan_kmalloc+0xa2/0xb0 mm/kasan/common.c:383
kasan_kmalloc include/linux/kasan.h:196 [inline]
__do_kmalloc_node mm/slab_common.c:967 [inline]
__kmalloc_node+0x61/0x1a0 mm/slab_common.c:974
kmalloc_node include/linux/slab.h:610 [inline]
kzalloc_node include/linux/slab.h:731 [inline]
qdisc_alloc+0xb0/0xb30 net/sched/sch_generic.c:938
qdisc_create+0xce/0x1040 net/sched/sch_api.c:1244
tc_modify_qdisc+0x488/0x1a40 net/sched/sch_api.c:1680
rtnetlink_rcv_msg+0x43d/0xd50 net/core/rtnetlink.c:6174
netlink_rcv_skb+0x165/0x440 net/netlink/af_netlink.c:2574
netlink_unicast_kernel net/netlink/af_netlink.c:1339 [inline]
netlink_unicast+0x547/0x7f0 net/netlink/af_netlink.c:1365
netlink_sendmsg+0x925/0xe30 net/netlink/af_netlink.c:1942
sock_sendmsg_nosec net/socket.c:724 [inline]
sock_sendmsg+0xde/0x190 net/socket.c:747
____sys_sendmsg+0x334/0x900 net/socket.c:2501
___sys_sendmsg+0x110/0x1b0 net/socket.c:2555
__sys_sendmmsg+0x18f/0x460 net/socket.c:2641
__do_sys_sendmmsg net/socket.c:2670 [inline]
__se_sys_sendmmsg net/socket.c:2667 [inline]
__x64_sys_sendmmsg+0x9d/0x100 net/socket.c:2667
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

Freed by task 21:
kasan_save_stack+0x22/0x40 mm/kasan/common.c:45
kasan_set_track+0x25/0x30 mm/kasan/common.c:52
kasan_save_free_info+0x2e/0x40 mm/kasan/generic.c:521
____kasan_slab_free mm/kasan/common.c:236 [inline]
____kasan_slab_free+0x160/0x1c0 mm/kasan/common.c:200
kasan_slab_free include/linux/kasan.h:162 [inline]
slab_free_hook mm/slub.c:1781 [inline]
slab_free_freelist_hook+0x8b/0x1c0 mm/slub.c:1807
slab_free mm/slub.c:3787 [inline]
__kmem_cache_free+0xaf/0x2d0 mm/slub.c:3800
rcu_do_batch kernel/rcu/tree.c:2112 [inline]
rcu_core+0x814/0x1960 kernel/rcu/tree.c:2372
__do_softirq+0x1d4/0x905 kernel/softirq.c:571

Last potentially related work creation:
kasan_save_stack+0x22/0x40 mm/kasan/common.c:45
__kasan_record_aux_stack+0xbc/0xd0 mm/kasan/generic.c:491
__call_rcu_common.constprop.0+0x99/0x7e0 kernel/rcu/tree.c:2622
qdisc_put_unlocked+0x73/0x90 net/sched/sch_generic.c:1097
tcf_block_release+0x86/0x90 net/sched/cls_api.c:1362
tc_new_tfilter+0xa35/0x2290 net/sched/cls_api.c:2331
rtnetlink_rcv_msg+0x996/0xd50 net/core/rtnetlink.c:6165
netlink_rcv_skb+0x165/0x440 net/netlink/af_netlink.c:2574
netlink_unicast_kernel net/netlink/af_netlink.c:1339 [inline]
netlink_unicast+0x547/0x7f0 net/netlink/af_netlink.c:1365
netlink_sendmsg+0x925/0xe30 net/netlink/af_netlink.c:1942
sock_sendmsg_nosec net/socket.c:724 [inline]
sock_sendmsg+0xde/0x190 net/socket.c:747
____sys_sendmsg+0x334/0x900 net/socket.c:2501
___sys_sendmsg+0x110/0x1b0 net/socket.c:2555
__sys_sendmmsg+0x18f/0x460 net/socket.c:2641
__do_sys_sendmmsg net/socket.c:2670 [inline]
__se_sys_sendmmsg net/socket.c:2667 [inline]
__x64_sys_sendmmsg+0x9d/0x100 net/socket.c:2667
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

The buggy address belongs to the object at ffff888045b31000
which belongs to the cache kmalloc-1k of size 1024
The buggy address is located 776 bytes inside of
freed 1024-byte region [ffff888045b31000, ffff888045b31400)

The buggy address belongs to the physical page:
page:ffffea000116cc00 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888045b37000 pfn:0x45b30
head:ffffea000116cc00 order:3 entire_mapcount:0 nr_pages_mapped:0 pincount:0
anon flags: 0xfff00000010200(slab|head|node=0|zone=1|lastcpupid=0x7ff)
raw: 00fff00000010200 ffff888012441dc0 0000000000000000 dead000000000001
raw: ffff888045b37000 000000008010000d 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 3, migratetype Unmovable, gfp_mask 0x1d20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL), pid 14184, tgid 14168 (syz-executor690), ts 420322459064, free_ts 15493742100
prep_new_page mm/page_alloc.c:2552 [inline]
get_page_from_freelist+0x1190/0x2e20 mm/page_alloc.c:4325
__alloc_pages+0x1cb/0x4a0 mm/page_alloc.c:5591
alloc_pages+0x1aa/0x270 mm/mempolicy.c:2283
alloc_slab_page mm/slub.c:1851 [inline]
allocate_slab+0x25f/0x390 mm/slub.c:1998
new_slab mm/slub.c:2051 [inline]
___slab_alloc+0xa91/0x1400 mm/slub.c:3193
__slab_alloc.constprop.0+0x56/0xa0 mm/slub.c:3292
__slab_alloc_node mm/slub.c:3345 [inline]
slab_alloc_node mm/slub.c:3442 [inline]
__kmem_cache_alloc_node+0x136/0x320 mm/slub.c:3491
kmalloc_trace+0x26/0xe0 mm/slab_common.c:1061
kmalloc include/linux/slab.h:580 [inline]
kmalloc_array include/linux/slab.h:635 [inline]
kcalloc include/linux/slab.h:667 [inline]
fl_change+0x1cf/0x4ac0 net/sched/cls_flower.c:2175
tc_new_tfilter+0x97c/0x2290 net/sched/cls_api.c:2310
rtnetlink_rcv_msg+0x996/0xd50 net/core/rtnetlink.c:6165
netlink_rcv_skb+0x165/0x440 net/netlink/af_netlink.c:2574
netlink_unicast_kernel net/netlink/af_netlink.c:1339 [inline]
netlink_unicast+0x547/0x7f0 net/netlink/af_netlink.c:1365
netlink_sendmsg+0x925/0xe30 net/netlink/af_netlink.c:1942
sock_sendmsg_nosec net/socket.c:724 [inline]
sock_sendmsg+0xde/0x190 net/socket.c:747
____sys_sendmsg+0x334/0x900 net/socket.c:2501
page last free stack trace:
reset_page_owner include/linux/page_owner.h:24 [inline]
free_pages_prepare mm/page_alloc.c:1453 [inline]
free_pcp_prepare+0x5d5/0xa50 mm/page_alloc.c:1503
free_unref_page_prepare mm/page_alloc.c:3387 [inline]
free_unref_page+0x1d/0x490 mm/page_alloc.c:3482
free_contig_range+0xb5/0x180 mm/page_alloc.c:9531
destroy_args+0x6c4/0x920 mm/debug_vm_pgtable.c:1023
debug_vm_pgtable+0x242a/0x4640 mm/debug_vm_pgtable.c:1403
do_one_initcall+0x102/0x540 init/main.c:1310
do_initcall_level init/main.c:1383 [inline]
do_initcalls init/main.c:1399 [inline]
do_basic_setup init/main.c:1418 [inline]
kernel_init_freeable+0x696/0xc00 init/main.c:1638
kernel_init+0x1e/0x2c0 init/main.c:1526
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308

Memory state around the buggy address:
ffff888045b31200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff888045b31280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff888045b31300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff888045b31380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff888045b31400: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
==================================================================


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at [email protected].

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
syzbot can test patches for this issue, for details see:
https://goo.gl/tpsmEJ#testing-patches


2023-03-29 01:55:23

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [syzbot] [net?] KASAN: slab-use-after-free Write in mini_qdisc_pair_swap

Seth, does this looks related to commit 267463823adb ("net: sch:
eliminate unnecessary RCU waits in mini_qdisc_pair_swap()")
by any chance?

On Thu, 23 Mar 2023 17:52:40 -0700 syzbot wrote:
> Hello,
>
> syzbot found the following issue on:
>
> HEAD commit: fff5a5e7f528 Merge tag 'for-linus' of git://git.armlinux.o..
> git tree: upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=11884731c80000
> kernel config: https://syzkaller.appspot.com/x/.config?x=aaa4b45720ca0519
> dashboard link: https://syzkaller.appspot.com/bug?extid=b53a9c0d1ea4ad62da8b
> compiler: gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2
> syz repro: https://syzkaller.appspot.com/x/repro.syz?x=15d1497ac80000
> C reproducer: https://syzkaller.appspot.com/x/repro.c?x=11eed636c80000
>
> Downloadable assets:
> disk image: https://storage.googleapis.com/syzbot-assets/33a184f98b9d/disk-fff5a5e7.raw.xz
> vmlinux: https://storage.googleapis.com/syzbot-assets/3d75f967571e/vmlinux-fff5a5e7.xz
> kernel image: https://storage.googleapis.com/syzbot-assets/4eeb8edbdc7e/bzImage-fff5a5e7.xz
>
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: [email protected]
>
> ==================================================================
> BUG: KASAN: slab-use-after-free in mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
> Write of size 8 at addr ffff888045b31308 by task syz-executor690/14901
>
> CPU: 0 PID: 14901 Comm: syz-executor690 Not tainted 6.3.0-rc3-syzkaller-00026-gfff5a5e7f528 #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/02/2023
> Call Trace:
> <TASK>
> __dump_stack lib/dump_stack.c:88 [inline]
> dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106
> print_address_description.constprop.0+0x2c/0x3c0 mm/kasan/report.c:319
> print_report mm/kasan/report.c:430 [inline]
> kasan_report+0x11c/0x130 mm/kasan/report.c:536
> mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
> tcf_chain_head_change_item net/sched/cls_api.c:495 [inline]
> tcf_chain0_head_change.isra.0+0xb9/0x120 net/sched/cls_api.c:509
> tcf_chain_tp_insert net/sched/cls_api.c:1826 [inline]
> tcf_chain_tp_insert_unique net/sched/cls_api.c:1875 [inline]
> tc_new_tfilter+0x1de6/0x2290 net/sched/cls_api.c:2266
> rtnetlink_rcv_msg+0x996/0xd50 net/core/rtnetlink.c:6165
> netlink_rcv_skb+0x165/0x440 net/netlink/af_netlink.c:2574
> netlink_unicast_kernel net/netlink/af_netlink.c:1339 [inline]
> netlink_unicast+0x547/0x7f0 net/netlink/af_netlink.c:1365
> netlink_sendmsg+0x925/0xe30 net/netlink/af_netlink.c:1942
> sock_sendmsg_nosec net/socket.c:724 [inline]
> sock_sendmsg+0xde/0x190 net/socket.c:747
> ____sys_sendmsg+0x334/0x900 net/socket.c:2501
> ___sys_sendmsg+0x110/0x1b0 net/socket.c:2555
> __sys_sendmmsg+0x18f/0x460 net/socket.c:2641
> __do_sys_sendmmsg net/socket.c:2670 [inline]
> __se_sys_sendmmsg net/socket.c:2667 [inline]
> __x64_sys_sendmmsg+0x9d/0x100 net/socket.c:2667
> do_syscall_x64 arch/x86/entry/common.c:50 [inline]
> do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
> entry_SYSCALL_64_after_hwframe+0x63/0xcd
> RIP: 0033:0x7f4f11222579
> Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 31 19 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
> RSP: 002b:00007f4f11184208 EFLAGS: 00000246 ORIG_RAX: 0000000000000133
> RAX: ffffffffffffffda RBX: 00007f4f112ab2a8 RCX: 00007f4f11222579
> RDX: 040000000000009f RSI: 00000000200002c0 RDI: 0000000000000007
> RBP: 00007f4f112ab2a0 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000246 R12: 00007f4f112ab2ac
> R13: 00007fffc8e5214f R14: 00007f4f11184300 R15: 0000000000022000
> </TASK>
>
> Allocated by task 14898:
> kasan_save_stack+0x22/0x40 mm/kasan/common.c:45
> kasan_set_track+0x25/0x30 mm/kasan/common.c:52
> ____kasan_kmalloc mm/kasan/common.c:374 [inline]
> ____kasan_kmalloc mm/kasan/common.c:333 [inline]
> __kasan_kmalloc+0xa2/0xb0 mm/kasan/common.c:383
> kasan_kmalloc include/linux/kasan.h:196 [inline]
> __do_kmalloc_node mm/slab_common.c:967 [inline]
> __kmalloc_node+0x61/0x1a0 mm/slab_common.c:974
> kmalloc_node include/linux/slab.h:610 [inline]
> kzalloc_node include/linux/slab.h:731 [inline]
> qdisc_alloc+0xb0/0xb30 net/sched/sch_generic.c:938
> qdisc_create+0xce/0x1040 net/sched/sch_api.c:1244
> tc_modify_qdisc+0x488/0x1a40 net/sched/sch_api.c:1680
> rtnetlink_rcv_msg+0x43d/0xd50 net/core/rtnetlink.c:6174
> netlink_rcv_skb+0x165/0x440 net/netlink/af_netlink.c:2574
> netlink_unicast_kernel net/netlink/af_netlink.c:1339 [inline]
> netlink_unicast+0x547/0x7f0 net/netlink/af_netlink.c:1365
> netlink_sendmsg+0x925/0xe30 net/netlink/af_netlink.c:1942
> sock_sendmsg_nosec net/socket.c:724 [inline]
> sock_sendmsg+0xde/0x190 net/socket.c:747
> ____sys_sendmsg+0x334/0x900 net/socket.c:2501
> ___sys_sendmsg+0x110/0x1b0 net/socket.c:2555
> __sys_sendmmsg+0x18f/0x460 net/socket.c:2641
> __do_sys_sendmmsg net/socket.c:2670 [inline]
> __se_sys_sendmmsg net/socket.c:2667 [inline]
> __x64_sys_sendmmsg+0x9d/0x100 net/socket.c:2667
> do_syscall_x64 arch/x86/entry/common.c:50 [inline]
> do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
> entry_SYSCALL_64_after_hwframe+0x63/0xcd
>
> Freed by task 21:
> kasan_save_stack+0x22/0x40 mm/kasan/common.c:45
> kasan_set_track+0x25/0x30 mm/kasan/common.c:52
> kasan_save_free_info+0x2e/0x40 mm/kasan/generic.c:521
> ____kasan_slab_free mm/kasan/common.c:236 [inline]
> ____kasan_slab_free+0x160/0x1c0 mm/kasan/common.c:200
> kasan_slab_free include/linux/kasan.h:162 [inline]
> slab_free_hook mm/slub.c:1781 [inline]
> slab_free_freelist_hook+0x8b/0x1c0 mm/slub.c:1807
> slab_free mm/slub.c:3787 [inline]
> __kmem_cache_free+0xaf/0x2d0 mm/slub.c:3800
> rcu_do_batch kernel/rcu/tree.c:2112 [inline]
> rcu_core+0x814/0x1960 kernel/rcu/tree.c:2372
> __do_softirq+0x1d4/0x905 kernel/softirq.c:571
>
> Last potentially related work creation:
> kasan_save_stack+0x22/0x40 mm/kasan/common.c:45
> __kasan_record_aux_stack+0xbc/0xd0 mm/kasan/generic.c:491
> __call_rcu_common.constprop.0+0x99/0x7e0 kernel/rcu/tree.c:2622
> qdisc_put_unlocked+0x73/0x90 net/sched/sch_generic.c:1097
> tcf_block_release+0x86/0x90 net/sched/cls_api.c:1362
> tc_new_tfilter+0xa35/0x2290 net/sched/cls_api.c:2331
> rtnetlink_rcv_msg+0x996/0xd50 net/core/rtnetlink.c:6165
> netlink_rcv_skb+0x165/0x440 net/netlink/af_netlink.c:2574
> netlink_unicast_kernel net/netlink/af_netlink.c:1339 [inline]
> netlink_unicast+0x547/0x7f0 net/netlink/af_netlink.c:1365
> netlink_sendmsg+0x925/0xe30 net/netlink/af_netlink.c:1942
> sock_sendmsg_nosec net/socket.c:724 [inline]
> sock_sendmsg+0xde/0x190 net/socket.c:747
> ____sys_sendmsg+0x334/0x900 net/socket.c:2501
> ___sys_sendmsg+0x110/0x1b0 net/socket.c:2555
> __sys_sendmmsg+0x18f/0x460 net/socket.c:2641
> __do_sys_sendmmsg net/socket.c:2670 [inline]
> __se_sys_sendmmsg net/socket.c:2667 [inline]
> __x64_sys_sendmmsg+0x9d/0x100 net/socket.c:2667
> do_syscall_x64 arch/x86/entry/common.c:50 [inline]
> do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
> entry_SYSCALL_64_after_hwframe+0x63/0xcd
>
> The buggy address belongs to the object at ffff888045b31000
> which belongs to the cache kmalloc-1k of size 1024
> The buggy address is located 776 bytes inside of
> freed 1024-byte region [ffff888045b31000, ffff888045b31400)
>
> The buggy address belongs to the physical page:
> page:ffffea000116cc00 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888045b37000 pfn:0x45b30
> head:ffffea000116cc00 order:3 entire_mapcount:0 nr_pages_mapped:0 pincount:0
> anon flags: 0xfff00000010200(slab|head|node=0|zone=1|lastcpupid=0x7ff)
> raw: 00fff00000010200 ffff888012441dc0 0000000000000000 dead000000000001
> raw: ffff888045b37000 000000008010000d 00000001ffffffff 0000000000000000
> page dumped because: kasan: bad access detected
> page_owner tracks the page as allocated
> page last allocated via order 3, migratetype Unmovable, gfp_mask 0x1d20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL), pid 14184, tgid 14168 (syz-executor690), ts 420322459064, free_ts 15493742100
> prep_new_page mm/page_alloc.c:2552 [inline]
> get_page_from_freelist+0x1190/0x2e20 mm/page_alloc.c:4325
> __alloc_pages+0x1cb/0x4a0 mm/page_alloc.c:5591
> alloc_pages+0x1aa/0x270 mm/mempolicy.c:2283
> alloc_slab_page mm/slub.c:1851 [inline]
> allocate_slab+0x25f/0x390 mm/slub.c:1998
> new_slab mm/slub.c:2051 [inline]
> ___slab_alloc+0xa91/0x1400 mm/slub.c:3193
> __slab_alloc.constprop.0+0x56/0xa0 mm/slub.c:3292
> __slab_alloc_node mm/slub.c:3345 [inline]
> slab_alloc_node mm/slub.c:3442 [inline]
> __kmem_cache_alloc_node+0x136/0x320 mm/slub.c:3491
> kmalloc_trace+0x26/0xe0 mm/slab_common.c:1061
> kmalloc include/linux/slab.h:580 [inline]
> kmalloc_array include/linux/slab.h:635 [inline]
> kcalloc include/linux/slab.h:667 [inline]
> fl_change+0x1cf/0x4ac0 net/sched/cls_flower.c:2175
> tc_new_tfilter+0x97c/0x2290 net/sched/cls_api.c:2310
> rtnetlink_rcv_msg+0x996/0xd50 net/core/rtnetlink.c:6165
> netlink_rcv_skb+0x165/0x440 net/netlink/af_netlink.c:2574
> netlink_unicast_kernel net/netlink/af_netlink.c:1339 [inline]
> netlink_unicast+0x547/0x7f0 net/netlink/af_netlink.c:1365
> netlink_sendmsg+0x925/0xe30 net/netlink/af_netlink.c:1942
> sock_sendmsg_nosec net/socket.c:724 [inline]
> sock_sendmsg+0xde/0x190 net/socket.c:747
> ____sys_sendmsg+0x334/0x900 net/socket.c:2501
> page last free stack trace:
> reset_page_owner include/linux/page_owner.h:24 [inline]
> free_pages_prepare mm/page_alloc.c:1453 [inline]
> free_pcp_prepare+0x5d5/0xa50 mm/page_alloc.c:1503
> free_unref_page_prepare mm/page_alloc.c:3387 [inline]
> free_unref_page+0x1d/0x490 mm/page_alloc.c:3482
> free_contig_range+0xb5/0x180 mm/page_alloc.c:9531
> destroy_args+0x6c4/0x920 mm/debug_vm_pgtable.c:1023
> debug_vm_pgtable+0x242a/0x4640 mm/debug_vm_pgtable.c:1403
> do_one_initcall+0x102/0x540 init/main.c:1310
> do_initcall_level init/main.c:1383 [inline]
> do_initcalls init/main.c:1399 [inline]
> do_basic_setup init/main.c:1418 [inline]
> kernel_init_freeable+0x696/0xc00 init/main.c:1638
> kernel_init+0x1e/0x2c0 init/main.c:1526
> ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308
>
> Memory state around the buggy address:
> ffff888045b31200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> ffff888045b31280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> >ffff888045b31300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> ^
> ffff888045b31380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> ffff888045b31400: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
> ==================================================================
>
>
> ---
> This report is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
> syzbot engineers can be reached at [email protected].
>
> syzbot will keep track of this issue. See:
> https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
> syzbot can test patches for this issue, for details see:
> https://goo.gl/tpsmEJ#testing-patches

2023-03-29 19:10:44

by Pedro Tammela

[permalink] [raw]
Subject: Re: [syzbot] [net?] KASAN: slab-use-after-free Write in mini_qdisc_pair_swap

On 29/03/2023 00:37, Seth Forshee wrote:
> On Tue, Mar 28, 2023 at 06:47:33PM -0700, Jakub Kicinski wrote:
>> Seth, does this looks related to commit 267463823adb ("net: sch:
>> eliminate unnecessary RCU waits in mini_qdisc_pair_swap()")
>> by any chance?
>
> I don't see how it could be. The memory being written is part of the
> qdisc private memory, and tc_new_tfilter() takes a reference to the
> qdisc. If that memory has been freed doesn't it mean that something has
> done an unbalanced qdisc_put()?
>

Reverting Seth's patches (85c0c3eb9a66 and 267463823adb) leads to these
traces with the reproducer:
[ 52.704956][ C0] ------------[ cut here ]------------
[ 52.705568][ C0] ODEBUG: free active (active state 1) object:0
[ 52.706542][ C0] WARNING: CPU: 0 PID: 0 at lib/debugobjects.c0
[ 52.707283][ C0] Modules linked in:
[ 52.707602][ C0] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.0
[ 52.708304][ C0] Hardware name: QEMU Standard PC (i440FX + PI4
[ 52.709032][ C0] RIP: 0010:debug_print_object+0x196/0x290
[ 52.709509][ C0] Code: 48 89 fa 48 c1 ea 03 80 3c 02 00 0f 85e
[ 52.711011][ C0] RSP: 0018:ffffc90000007cd0 EFLAGS: 00010282
[ 52.711510][ C0] RAX: 0000000000000000 RBX: 0000000000000003 0
[ 52.712125][ C0] RDX: ffffffff8c495800 RSI: ffffffff814b96d7 1
[ 52.712748][ C0] RBP: 0000000000000001 R08: 0000000000000001 0
[ 52.713370][ C0] R10: 0000000000000000 R11: 203a47554245444f 0
[ 52.713983][ C0] R13: ffffffff8aa6e960 R14: 0000000000000000 8
[ 52.714609][ C0] FS: 0000000000000000(0000) GS:ffff8881f5a000
[ 52.715356][ C0] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008003
[ 52.715863][ C0] CR2: 000055914686f020 CR3: 000000011e856000 0
[ 52.716486][ C0] Call Trace:
[ 52.716742][ C0] <IRQ>
[ 52.716969][ C0] debug_check_no_obj_freed+0x302/0x420
[ 52.717423][ C0] slab_free_freelist_hook+0xec/0x1c0
[ 52.717848][ C0] ? rcu_core+0x818/0x1930
[ 52.718204][ C0] __kmem_cache_free+0xaf/0x2e0
[ 52.718590][ C0] rcu_core+0x818/0x1930
[ 52.718938][ C0] ? rcu_report_dead+0x610/0x610
[ 52.719328][ C0] __do_softirq+0x1d4/0x8ef
[ 52.719689][ C0] __irq_exit_rcu+0x11d/0x190
[ 52.720062][ C0] irq_exit_rcu+0x9/0x20
[ 52.720402][ C0] sysvec_apic_timer_interrupt+0x97/0xc0
[ 52.720842][ C0] </IRQ>
[ 52.721070][ C0] <TASK>
[ 52.721300][ C0] asm_sysvec_apic_timer_interrupt+0x1a/0x20
[ 52.721779][ C0] RIP: 0010:default_idle+0xf/0x20
[ 52.722172][ C0] Code: 89 07 49 c7 c0 08 00 00 00 4d 29 c8 4c5
[ 52.723631][ C0] RSP: 0018:ffffffff8c407e30 EFLAGS: 00000202
[ 52.724096][ C0] RAX: 000000000007897f RBX: 0000000000000000 6
[ 52.724702][ C0] RDX: 0000000000000000 RSI: 0000000000000001 0
[ 52.725335][ C0] RBP: ffffffff8c495800 R08: 0000000000000001 b
[ 52.725957][ C0] R10: ffffed103eb46d95 R11: 0000000000000000 0
[ 52.726550][ C0] R13: 0000000000000000 R14: ffffffff8e7834d0 0
[ 52.727162][ C0] ? ct_kernel_exit+0x1d6/0x240
[ 52.727542][ C0] default_idle_call+0x67/0xa0
[ 52.727912][ C0] do_idle+0x31e/0x3e0
[ 52.728241][ C0] ? arch_cpu_idle_exit+0x30/0x30
[ 52.728635][ C0] cpu_startup_entry+0x18/0x20
[ 52.729006][ C0] rest_init+0x16d/0x2b0
[ 52.729338][ C0] ? regulator_has_full_constraints+0x9/0x20
[ 52.729815][ C0] ? trace_init_perf_perm_irq_work_exit+0x20/00
[ 52.730309][ C0] arch_call_rest_init+0x13/0x30
[ 52.730703][ C0] start_kernel+0x352/0x4c0
[ 52.731087][ C0] secondary_startup_64_no_verify+0xce/0xdb
[ 52.731611][ C0] </TASK>
[ 52.731870][ C0] Kernel panic - not syncing: kernel: panic_on.
[ 52.732445][ C0] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.0
[ 52.733140][ C0] Hardware name: QEMU Standard PC (i440FX + PI4
[ 52.733867][ C0] Call Trace:
[ 52.734143][ C0] <IRQ>
[ 52.734380][ C0] dump_stack_lvl+0xd9/0x150
[ 52.734769][ C0] panic+0x684/0x730
[ 52.735082][ C0] ? panic_smp_self_stop+0x90/0x90
[ 52.735322][ C0] ? show_trace_log_lvl+0x285/0x390
[ 52.735322][ C0] ? debug_print_object+0x196/0x290
[ 52.735322][ C0] check_panic_on_warn+0xb1/0xc0
[ 52.735322][ C0] __warn+0xf2/0x390
[ 52.735322][ C0] ? debug_print_object+0x196/0x290
[ 52.735322][ C0] report_bug+0x2dd/0x500
[ 52.735322][ C0] handle_bug+0x3c/0x70
[ 52.735322][ C0] exc_invalid_op+0x18/0x50
[ 52.735322][ C0] asm_exc_invalid_op+0x1a/0x20
[ 52.735322][ C0] RIP: 0010:debug_print_object+0x196/0x290
[ 52.735322][ C0] Code: 48 89 fa 48 c1 ea 03 80 3c 02 00 0f 85e
[ 52.735322][ C0] RSP: 0018:ffffc90000007cd0 EFLAGS: 00010282
[ 52.735322][ C0] RAX: 0000000000000000 RBX: 0000000000000003 0
[ 52.735322][ C0] RDX: ffffffff8c495800 RSI: ffffffff814b96d7 1
[ 52.735322][ C0] RBP: 0000000000000001 R08: 0000000000000001 0
[ 52.735322][ C0] R10: 0000000000000000 R11: 203a47554245444f 0
[ 52.735322][ C0] R13: ffffffff8aa6e960 R14: 0000000000000000 8
[ 52.735322][ C0] ? __warn_printk+0x187/0x310
[ 52.735322][ C0] debug_check_no_obj_freed+0x302/0x420
[ 52.735322][ C0] slab_free_freelist_hook+0xec/0x1c0
[ 52.735322][ C0] ? rcu_core+0x818/0x1930
[ 52.735322][ C0] __kmem_cache_free+0xaf/0x2e0
[ 52.735322][ C0] rcu_core+0x818/0x1930
[ 52.735322][ C0] ? rcu_report_dead+0x610/0x610
[ 52.735322][ C0] __do_softirq+0x1d4/0x8ef
[ 52.735322][ C0] __irq_exit_rcu+0x11d/0x190
[ 52.735322][ C0] irq_exit_rcu+0x9/0x20
[ 52.735322][ C0] sysvec_apic_timer_interrupt+0x97/0xc0
[ 52.735322][ C0] </IRQ>
[ 52.735322][ C0] <TASK>
[ 52.735322][ C0] asm_sysvec_apic_timer_interrupt+0x1a/0x20
[ 52.735322][ C0] RIP: 0010:default_idle+0xf/0x20
[ 52.735322][ C0] Code: 89 07 49 c7 c0 08 00 00 00 4d 29 c8 4c5
[ 52.735322][ C0] RSP: 0018:ffffffff8c407e30 EFLAGS: 00000202
[ 52.735322][ C0] RAX: 000000000007897f RBX: 0000000000000000 6
[ 52.735322][ C0] RDX: 0000000000000000 RSI: 0000000000000001 0
[ 52.735322][ C0] RBP: ffffffff8c495800 R08: 0000000000000001 b
[ 52.735322][ C0] R10: ffffed103eb46d95 R11: 0000000000000000 0
[ 52.735322][ C0] R13: 0000000000000000 R14: ffffffff8e7834d0 0
[ 52.735322][ C0] ? ct_kernel_exit+0x1d6/0x240
[ 52.735322][ C0] default_idle_call+0x67/0xa0
[ 52.735322][ C0] do_idle+0x31e/0x3e0
[ 52.735322][ C0] ? arch_cpu_idle_exit+0x30/0x30
[ 52.735322][ C0] cpu_startup_entry+0x18/0x20
[ 52.735322][ C0] rest_init+0x16d/0x2b0
[ 52.735322][ C0] ? regulator_has_full_constraints+0x9/0x20
[ 52.735322][ C0] ? trace_init_perf_perm_irq_work_exit+0x20/00
[ 52.735322][ C0] arch_call_rest_init+0x13/0x30
[ 52.735322][ C0] start_kernel+0x352/0x4c0
[ 52.735322][ C0] secondary_startup_64_no_verify+0xce/0xdb
[ 52.735322][ C0] </TASK>
[ 52.735322][ C0] Kernel Offset: disabled
[ 52.735322][ C0] Rebooting in 86400 seconds..


2023-04-03 16:00:06

by Jamal Hadi Salim

[permalink] [raw]
Subject: Re: [syzbot] [net?] KASAN: slab-use-after-free Write in mini_qdisc_pair_swap

To provide more update:
Happens on single processor before Seth's patches; and only on
multi-processor after Seth's patches.
Theory is: there is a logic bug in the miniqdisc rcu visibility. Feels
like the freeing of the structure is done without rcu involvement.
Jiri/Cong maybe you can take a look since youve been dabbling in
miniqdisc? The reproducer worked for me and Pedro 100% of the time.

cheers,
jamal

On Wed, Mar 29, 2023 at 3:07 PM Pedro Tammela <[email protected]> wrote:
>
> On 29/03/2023 00:37, Seth Forshee wrote:
> > On Tue, Mar 28, 2023 at 06:47:33PM -0700, Jakub Kicinski wrote:
> >> Seth, does this looks related to commit 267463823adb ("net: sch:
> >> eliminate unnecessary RCU waits in mini_qdisc_pair_swap()")
> >> by any chance?
> >
> > I don't see how it could be. The memory being written is part of the
> > qdisc private memory, and tc_new_tfilter() takes a reference to the
> > qdisc. If that memory has been freed doesn't it mean that something has
> > done an unbalanced qdisc_put()?
> >
>
> Reverting Seth's patches (85c0c3eb9a66 and 267463823adb) leads to these
> traces with the reproducer:
> [ 52.704956][ C0] ------------[ cut here ]------------
> [ 52.705568][ C0] ODEBUG: free active (active state 1) object:0
> [ 52.706542][ C0] WARNING: CPU: 0 PID: 0 at lib/debugobjects.c0
> [ 52.707283][ C0] Modules linked in:
> [ 52.707602][ C0] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.0
> [ 52.708304][ C0] Hardware name: QEMU Standard PC (i440FX + PI4
> [ 52.709032][ C0] RIP: 0010:debug_print_object+0x196/0x290
> [ 52.709509][ C0] Code: 48 89 fa 48 c1 ea 03 80 3c 02 00 0f 85e
> [ 52.711011][ C0] RSP: 0018:ffffc90000007cd0 EFLAGS: 00010282
> [ 52.711510][ C0] RAX: 0000000000000000 RBX: 0000000000000003 0
> [ 52.712125][ C0] RDX: ffffffff8c495800 RSI: ffffffff814b96d7 1
> [ 52.712748][ C0] RBP: 0000000000000001 R08: 0000000000000001 0
> [ 52.713370][ C0] R10: 0000000000000000 R11: 203a47554245444f 0
> [ 52.713983][ C0] R13: ffffffff8aa6e960 R14: 0000000000000000 8
> [ 52.714609][ C0] FS: 0000000000000000(0000) GS:ffff8881f5a000
> [ 52.715356][ C0] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008003
> [ 52.715863][ C0] CR2: 000055914686f020 CR3: 000000011e856000 0
> [ 52.716486][ C0] Call Trace:
> [ 52.716742][ C0] <IRQ>
> [ 52.716969][ C0] debug_check_no_obj_freed+0x302/0x420
> [ 52.717423][ C0] slab_free_freelist_hook+0xec/0x1c0
> [ 52.717848][ C0] ? rcu_core+0x818/0x1930
> [ 52.718204][ C0] __kmem_cache_free+0xaf/0x2e0
> [ 52.718590][ C0] rcu_core+0x818/0x1930
> [ 52.718938][ C0] ? rcu_report_dead+0x610/0x610
> [ 52.719328][ C0] __do_softirq+0x1d4/0x8ef
> [ 52.719689][ C0] __irq_exit_rcu+0x11d/0x190
> [ 52.720062][ C0] irq_exit_rcu+0x9/0x20
> [ 52.720402][ C0] sysvec_apic_timer_interrupt+0x97/0xc0
> [ 52.720842][ C0] </IRQ>
> [ 52.721070][ C0] <TASK>
> [ 52.721300][ C0] asm_sysvec_apic_timer_interrupt+0x1a/0x20
> [ 52.721779][ C0] RIP: 0010:default_idle+0xf/0x20
> [ 52.722172][ C0] Code: 89 07 49 c7 c0 08 00 00 00 4d 29 c8 4c5
> [ 52.723631][ C0] RSP: 0018:ffffffff8c407e30 EFLAGS: 00000202
> [ 52.724096][ C0] RAX: 000000000007897f RBX: 0000000000000000 6
> [ 52.724702][ C0] RDX: 0000000000000000 RSI: 0000000000000001 0
> [ 52.725335][ C0] RBP: ffffffff8c495800 R08: 0000000000000001 b
> [ 52.725957][ C0] R10: ffffed103eb46d95 R11: 0000000000000000 0
> [ 52.726550][ C0] R13: 0000000000000000 R14: ffffffff8e7834d0 0
> [ 52.727162][ C0] ? ct_kernel_exit+0x1d6/0x240
> [ 52.727542][ C0] default_idle_call+0x67/0xa0
> [ 52.727912][ C0] do_idle+0x31e/0x3e0
> [ 52.728241][ C0] ? arch_cpu_idle_exit+0x30/0x30
> [ 52.728635][ C0] cpu_startup_entry+0x18/0x20
> [ 52.729006][ C0] rest_init+0x16d/0x2b0
> [ 52.729338][ C0] ? regulator_has_full_constraints+0x9/0x20
> [ 52.729815][ C0] ? trace_init_perf_perm_irq_work_exit+0x20/00
> [ 52.730309][ C0] arch_call_rest_init+0x13/0x30
> [ 52.730703][ C0] start_kernel+0x352/0x4c0
> [ 52.731087][ C0] secondary_startup_64_no_verify+0xce/0xdb
> [ 52.731611][ C0] </TASK>
> [ 52.731870][ C0] Kernel panic - not syncing: kernel: panic_on.
> [ 52.732445][ C0] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.0
> [ 52.733140][ C0] Hardware name: QEMU Standard PC (i440FX + PI4
> [ 52.733867][ C0] Call Trace:
> [ 52.734143][ C0] <IRQ>
> [ 52.734380][ C0] dump_stack_lvl+0xd9/0x150
> [ 52.734769][ C0] panic+0x684/0x730
> [ 52.735082][ C0] ? panic_smp_self_stop+0x90/0x90
> [ 52.735322][ C0] ? show_trace_log_lvl+0x285/0x390
> [ 52.735322][ C0] ? debug_print_object+0x196/0x290
> [ 52.735322][ C0] check_panic_on_warn+0xb1/0xc0
> [ 52.735322][ C0] __warn+0xf2/0x390
> [ 52.735322][ C0] ? debug_print_object+0x196/0x290
> [ 52.735322][ C0] report_bug+0x2dd/0x500
> [ 52.735322][ C0] handle_bug+0x3c/0x70
> [ 52.735322][ C0] exc_invalid_op+0x18/0x50
> [ 52.735322][ C0] asm_exc_invalid_op+0x1a/0x20
> [ 52.735322][ C0] RIP: 0010:debug_print_object+0x196/0x290
> [ 52.735322][ C0] Code: 48 89 fa 48 c1 ea 03 80 3c 02 00 0f 85e
> [ 52.735322][ C0] RSP: 0018:ffffc90000007cd0 EFLAGS: 00010282
> [ 52.735322][ C0] RAX: 0000000000000000 RBX: 0000000000000003 0
> [ 52.735322][ C0] RDX: ffffffff8c495800 RSI: ffffffff814b96d7 1
> [ 52.735322][ C0] RBP: 0000000000000001 R08: 0000000000000001 0
> [ 52.735322][ C0] R10: 0000000000000000 R11: 203a47554245444f 0
> [ 52.735322][ C0] R13: ffffffff8aa6e960 R14: 0000000000000000 8
> [ 52.735322][ C0] ? __warn_printk+0x187/0x310
> [ 52.735322][ C0] debug_check_no_obj_freed+0x302/0x420
> [ 52.735322][ C0] slab_free_freelist_hook+0xec/0x1c0
> [ 52.735322][ C0] ? rcu_core+0x818/0x1930
> [ 52.735322][ C0] __kmem_cache_free+0xaf/0x2e0
> [ 52.735322][ C0] rcu_core+0x818/0x1930
> [ 52.735322][ C0] ? rcu_report_dead+0x610/0x610
> [ 52.735322][ C0] __do_softirq+0x1d4/0x8ef
> [ 52.735322][ C0] __irq_exit_rcu+0x11d/0x190
> [ 52.735322][ C0] irq_exit_rcu+0x9/0x20
> [ 52.735322][ C0] sysvec_apic_timer_interrupt+0x97/0xc0
> [ 52.735322][ C0] </IRQ>
> [ 52.735322][ C0] <TASK>
> [ 52.735322][ C0] asm_sysvec_apic_timer_interrupt+0x1a/0x20
> [ 52.735322][ C0] RIP: 0010:default_idle+0xf/0x20
> [ 52.735322][ C0] Code: 89 07 49 c7 c0 08 00 00 00 4d 29 c8 4c5
> [ 52.735322][ C0] RSP: 0018:ffffffff8c407e30 EFLAGS: 00000202
> [ 52.735322][ C0] RAX: 000000000007897f RBX: 0000000000000000 6
> [ 52.735322][ C0] RDX: 0000000000000000 RSI: 0000000000000001 0
> [ 52.735322][ C0] RBP: ffffffff8c495800 R08: 0000000000000001 b
> [ 52.735322][ C0] R10: ffffed103eb46d95 R11: 0000000000000000 0
> [ 52.735322][ C0] R13: 0000000000000000 R14: ffffffff8e7834d0 0
> [ 52.735322][ C0] ? ct_kernel_exit+0x1d6/0x240
> [ 52.735322][ C0] default_idle_call+0x67/0xa0
> [ 52.735322][ C0] do_idle+0x31e/0x3e0
> [ 52.735322][ C0] ? arch_cpu_idle_exit+0x30/0x30
> [ 52.735322][ C0] cpu_startup_entry+0x18/0x20
> [ 52.735322][ C0] rest_init+0x16d/0x2b0
> [ 52.735322][ C0] ? regulator_has_full_constraints+0x9/0x20
> [ 52.735322][ C0] ? trace_init_perf_perm_irq_work_exit+0x20/00
> [ 52.735322][ C0] arch_call_rest_init+0x13/0x30
> [ 52.735322][ C0] start_kernel+0x352/0x4c0
> [ 52.735322][ C0] secondary_startup_64_no_verify+0xce/0xdb
> [ 52.735322][ C0] </TASK>
> [ 52.735322][ C0] Kernel Offset: disabled
> [ 52.735322][ C0] Rebooting in 86400 seconds..
>
>

2023-04-17 23:11:40

by Peilin Ye

[permalink] [raw]
Subject: Re: [syzbot] [net?] KASAN: slab-use-after-free Write in mini_qdisc_pair_swap

Hi all,

On Mon, Apr 03, 2023 at 11:58:44AM -0400, Jamal Hadi Salim wrote:
> To provide more update:
> Happens on single processor before Seth's patches; and only on
> multi-processor after Seth's patches.
> Theory is: there is a logic bug in the miniqdisc rcu visibility. Feels
> like the freeing of the structure is done without rcu involvement.
> Jiri/Cong maybe you can take a look since youve been dabbling in
> miniqdisc? The reproducer worked for me and Pedro 100% of the time.

I also reproduced this UAF using the syzkaller reproducer in the report
(the C reproducer did not work for me for unknown reasons). I will look
into this.

Thanks,
Peilin Ye

2023-04-18 03:13:33

by syzbot

[permalink] [raw]
Subject: Re: [syzbot] [net?] KASAN: slab-use-after-free Write in mini_qdisc_pair_swap

Hello,

syzbot has tested the proposed patch but the reproducer is still triggering an issue:
KASAN: slab-use-after-free Write in mini_qdisc_pair_swap

==================================================================
BUG: KASAN: slab-use-after-free in mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
Write of size 8 at addr ffff88806348a308 by task syz-executor.0/8529

CPU: 1 PID: 8529 Comm: syz-executor.0 Not tainted 6.3.0-rc3-syzkaller-00026-gfff5a5e7f528-dirty #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/30/2023
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106
print_address_description.constprop.0+0x2c/0x3c0 mm/kasan/report.c:319
print_report mm/kasan/report.c:430 [inline]
kasan_report+0x11c/0x130 mm/kasan/report.c:536
mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
tcf_chain_head_change_item net/sched/cls_api.c:495 [inline]
tcf_chain0_head_change.isra.0+0xb9/0x120 net/sched/cls_api.c:509
tcf_chain_tp_insert net/sched/cls_api.c:1826 [inline]
tcf_chain_tp_insert_unique net/sched/cls_api.c:1875 [inline]
tc_new_tfilter+0x1d77/0x2200 net/sched/cls_api.c:2268
rtnetlink_rcv_msg+0x996/0xd50 net/core/rtnetlink.c:6165
netlink_rcv_skb+0x165/0x440 net/netlink/af_netlink.c:2574
netlink_unicast_kernel net/netlink/af_netlink.c:1339 [inline]
netlink_unicast+0x547/0x7f0 net/netlink/af_netlink.c:1365
netlink_sendmsg+0x925/0xe30 net/netlink/af_netlink.c:1942
sock_sendmsg_nosec net/socket.c:724 [inline]
sock_sendmsg+0xde/0x190 net/socket.c:747
____sys_sendmsg+0x334/0x900 net/socket.c:2501
___sys_sendmsg+0x110/0x1b0 net/socket.c:2555
__sys_sendmmsg+0x18f/0x460 net/socket.c:2641
__do_sys_sendmmsg net/socket.c:2670 [inline]
__se_sys_sendmmsg net/socket.c:2667 [inline]
__x64_sys_sendmmsg+0x9d/0x100 net/socket.c:2667
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7ffaaee8c0f9
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 f1 19 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ffaafc5e168 EFLAGS: 00000246 ORIG_RAX: 0000000000000133
RAX: ffffffffffffffda RBX: 00007ffaaefac120 RCX: 00007ffaaee8c0f9
RDX: 040000000000009f RSI: 00000000200002c0 RDI: 0000000000000007
RBP: 00007ffaaeee7b39 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007ffe97997fbf R14: 00007ffaafc5e300 R15: 0000000000022000
</TASK>

Allocated by task 8524:
kasan_save_stack+0x22/0x40 mm/kasan/common.c:45
kasan_set_track+0x25/0x30 mm/kasan/common.c:52
____kasan_kmalloc mm/kasan/common.c:374 [inline]
____kasan_kmalloc mm/kasan/common.c:333 [inline]
__kasan_kmalloc+0xa2/0xb0 mm/kasan/common.c:383
kasan_kmalloc include/linux/kasan.h:196 [inline]
__do_kmalloc_node mm/slab_common.c:967 [inline]
__kmalloc_node+0x61/0x1a0 mm/slab_common.c:974
kmalloc_node include/linux/slab.h:610 [inline]
kzalloc_node include/linux/slab.h:731 [inline]
qdisc_alloc+0xb0/0xb30 net/sched/sch_generic.c:938
qdisc_create+0xce/0x1040 net/sched/sch_api.c:1244
tc_modify_qdisc+0x488/0x1a40 net/sched/sch_api.c:1680
rtnetlink_rcv_msg+0x43d/0xd50 net/core/rtnetlink.c:6174
netlink_rcv_skb+0x165/0x440 net/netlink/af_netlink.c:2574
netlink_unicast_kernel net/netlink/af_netlink.c:1339 [inline]
netlink_unicast+0x547/0x7f0 net/netlink/af_netlink.c:1365
netlink_sendmsg+0x925/0xe30 net/netlink/af_netlink.c:1942
sock_sendmsg_nosec net/socket.c:724 [inline]
sock_sendmsg+0xde/0x190 net/socket.c:747
____sys_sendmsg+0x334/0x900 net/socket.c:2501
___sys_sendmsg+0x110/0x1b0 net/socket.c:2555
__sys_sendmmsg+0x18f/0x460 net/socket.c:2641
__do_sys_sendmmsg net/socket.c:2670 [inline]
__se_sys_sendmmsg net/socket.c:2667 [inline]
__x64_sys_sendmmsg+0x9d/0x100 net/socket.c:2667
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

Freed by task 21:
kasan_save_stack+0x22/0x40 mm/kasan/common.c:45
kasan_set_track+0x25/0x30 mm/kasan/common.c:52
kasan_save_free_info+0x2e/0x40 mm/kasan/generic.c:521
____kasan_slab_free mm/kasan/common.c:236 [inline]
____kasan_slab_free+0x160/0x1c0 mm/kasan/common.c:200
kasan_slab_free include/linux/kasan.h:162 [inline]
slab_free_hook mm/slub.c:1781 [inline]
slab_free_freelist_hook+0x8b/0x1c0 mm/slub.c:1807
slab_free mm/slub.c:3787 [inline]
__kmem_cache_free+0xaf/0x2d0 mm/slub.c:3800
rcu_do_batch kernel/rcu/tree.c:2112 [inline]
rcu_core+0x814/0x1960 kernel/rcu/tree.c:2372
__do_softirq+0x1d4/0x905 kernel/softirq.c:571

Last potentially related work creation:
kasan_save_stack+0x22/0x40 mm/kasan/common.c:45
__kasan_record_aux_stack+0xbc/0xd0 mm/kasan/generic.c:491
__call_rcu_common.constprop.0+0x99/0x7e0 kernel/rcu/tree.c:2622
qdisc_put_unlocked+0x73/0x90 net/sched/sch_generic.c:1097
tcf_block_release+0x86/0x90 net/sched/cls_api.c:1362
tc_new_tfilter+0xa2b/0x2200 net/sched/cls_api.c:2333
rtnetlink_rcv_msg+0x996/0xd50 net/core/rtnetlink.c:6165
netlink_rcv_skb+0x165/0x440 net/netlink/af_netlink.c:2574
netlink_unicast_kernel net/netlink/af_netlink.c:1339 [inline]
netlink_unicast+0x547/0x7f0 net/netlink/af_netlink.c:1365
netlink_sendmsg+0x925/0xe30 net/netlink/af_netlink.c:1942
sock_sendmsg_nosec net/socket.c:724 [inline]
sock_sendmsg+0xde/0x190 net/socket.c:747
____sys_sendmsg+0x334/0x900 net/socket.c:2501
___sys_sendmsg+0x110/0x1b0 net/socket.c:2555
__sys_sendmmsg+0x18f/0x460 net/socket.c:2641
__do_sys_sendmmsg net/socket.c:2670 [inline]
__se_sys_sendmmsg net/socket.c:2667 [inline]
__x64_sys_sendmmsg+0x9d/0x100 net/socket.c:2667
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

Second to last potentially related work creation:
kasan_save_stack+0x22/0x40 mm/kasan/common.c:45
__kasan_record_aux_stack+0xbc/0xd0 mm/kasan/generic.c:491
__call_rcu_common.constprop.0+0x99/0x7e0 kernel/rcu/tree.c:2622
rhashtable_rehash_table lib/rhashtable.c:348 [inline]
rht_deferred_worker+0xb24/0x1ce0 lib/rhashtable.c:432
process_one_work+0x991/0x15c0 kernel/workqueue.c:2390
worker_thread+0x669/0x1090 kernel/workqueue.c:2537
kthread+0x2e8/0x3a0 kernel/kthread.c:376
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308

The buggy address belongs to the object at ffff88806348a000
which belongs to the cache kmalloc-1k of size 1024
The buggy address is located 776 bytes inside of
freed 1024-byte region [ffff88806348a000, ffff88806348a400)

The buggy address belongs to the physical page:
page:ffffea00018d2200 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x63488
head:ffffea00018d2200 order:3 entire_mapcount:0 nr_pages_mapped:0 pincount:0
anon flags: 0xfff00000010200(slab|head|node=0|zone=1|lastcpupid=0x7ff)
raw: 00fff00000010200 ffff888012441dc0 0000000000000000 dead000000000001
raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 3, migratetype Unmovable, gfp_mask 0x1d20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL), pid 8117, tgid 8111 (syz-executor.3), ts 126042830210, free_ts 15956815362
prep_new_page mm/page_alloc.c:2552 [inline]
get_page_from_freelist+0x1190/0x2e20 mm/page_alloc.c:4325
__alloc_pages+0x1cb/0x4a0 mm/page_alloc.c:5591
alloc_pages+0x1aa/0x270 mm/mempolicy.c:2283
alloc_slab_page mm/slub.c:1851 [inline]
allocate_slab+0x25f/0x390 mm/slub.c:1998
new_slab mm/slub.c:2051 [inline]
___slab_alloc+0xa91/0x1400 mm/slub.c:3193
__slab_alloc.constprop.0+0x56/0xa0 mm/slub.c:3292
__slab_alloc_node mm/slub.c:3345 [inline]
slab_alloc_node mm/slub.c:3442 [inline]
__kmem_cache_alloc_node+0x136/0x320 mm/slub.c:3491
kmalloc_trace+0x26/0xe0 mm/slab_common.c:1061
kmalloc include/linux/slab.h:580 [inline]
kzalloc include/linux/slab.h:720 [inline]
fl_init+0x45/0x2c0 net/sched/cls_flower.c:351
tcf_proto_create net/sched/cls_api.c:398 [inline]
tc_new_tfilter+0xecf/0x2200 net/sched/cls_api.c:2260
rtnetlink_rcv_msg+0x996/0xd50 net/core/rtnetlink.c:6165
netlink_rcv_skb+0x165/0x440 net/netlink/af_netlink.c:2574
netlink_unicast_kernel net/netlink/af_netlink.c:1339 [inline]
netlink_unicast+0x547/0x7f0 net/netlink/af_netlink.c:1365
netlink_sendmsg+0x925/0xe30 net/netlink/af_netlink.c:1942
sock_sendmsg_nosec net/socket.c:724 [inline]
sock_sendmsg+0xde/0x190 net/socket.c:747
____sys_sendmsg+0x334/0x900 net/socket.c:2501
page last free stack trace:
reset_page_owner include/linux/page_owner.h:24 [inline]
free_pages_prepare mm/page_alloc.c:1453 [inline]
free_pcp_prepare+0x5d5/0xa50 mm/page_alloc.c:1503
free_unref_page_prepare mm/page_alloc.c:3387 [inline]
free_unref_page+0x1d/0x490 mm/page_alloc.c:3482
free_contig_range+0xb5/0x180 mm/page_alloc.c:9531
destroy_args+0x6c4/0x920 mm/debug_vm_pgtable.c:1023
debug_vm_pgtable+0x242a/0x4640 mm/debug_vm_pgtable.c:1403
do_one_initcall+0x102/0x540 init/main.c:1310
do_initcall_level init/main.c:1383 [inline]
do_initcalls init/main.c:1399 [inline]
do_basic_setup init/main.c:1418 [inline]
kernel_init_freeable+0x696/0xc00 init/main.c:1638
kernel_init+0x1e/0x2c0 init/main.c:1526
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308

Memory state around the buggy address:
ffff88806348a200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff88806348a280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff88806348a300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff88806348a380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff88806348a400: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
==================================================================


Tested on:

commit: fff5a5e7 Merge tag 'for-linus' of git://git.armlinux.o..
git tree: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
console output: https://syzkaller.appspot.com/x/log.txt?x=10cf71c0280000
kernel config: https://syzkaller.appspot.com/x/.config?x=ea09b0836073ee4
dashboard link: https://syzkaller.appspot.com/bug?extid=b53a9c0d1ea4ad62da8b
compiler: gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2
patch: https://syzkaller.appspot.com/x/patch.diff?x=16fa5f23c80000

2023-04-18 09:48:05

by syzbot

[permalink] [raw]
Subject: Re: [syzbot] [net?] KASAN: slab-use-after-free Write in mini_qdisc_pair_swap

Hello,

syzbot has tested the proposed patch and the reproducer did not trigger any issue:

Reported-and-tested-by: [email protected]

Tested on:

commit: fff5a5e7 Merge tag 'for-linus' of git://git.armlinux.o..
git tree: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
console output: https://syzkaller.appspot.com/x/log.txt?x=1658d89fc80000
kernel config: https://syzkaller.appspot.com/x/.config?x=ea09b0836073ee4
dashboard link: https://syzkaller.appspot.com/bug?extid=b53a9c0d1ea4ad62da8b
compiler: gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2
patch: https://syzkaller.appspot.com/x/patch.diff?x=10d7cd5bc80000

Note: testing is done by a robot and is best-effort only.

2023-04-26 23:43:00

by Peilin Ye

[permalink] [raw]
Subject: Re: [syzbot] [net?] KASAN: slab-use-after-free Write in mini_qdisc_pair_swap

+Cc: Vlad Buslov, Hillf Danton

Hi all,

On Mon, Apr 17, 2023 at 04:00:11PM -0700, Peilin Ye wrote:
> I also reproduced this UAF using the syzkaller reproducer in the report
> (the C reproducer did not work for me for unknown reasons). I will look
> into this.

Currently, multiple ingress (clsact) Qdiscs can access the per-netdev
*miniq_ingress (*miniq_egress) pointer concurrently. This is
unfortunately true in two senses:

1. We allow adding ingress (clsact) Qdiscs under parents other than
TC_H_INGRESS (TC_H_CLSACT):

$ ip link add ifb0 numtxqueues 8 type ifb
$ echo clsact > /proc/sys/net/core/default_qdisc
$ tc qdisc add dev ifb0 handle 1: root mq
$ tc qdisc show dev ifb0
qdisc mq 1: root
qdisc clsact 0: parent 1:8
qdisc clsact 0: parent 1:7
qdisc clsact 0: parent 1:6
qdisc clsact 0: parent 1:5
qdisc clsact 0: parent 1:4
qdisc clsact 0: parent 1:3
qdisc clsact 0: parent 1:2
qdisc clsact 0: parent 1:1

This is obviously racy and should be prohibited. I've started working
on patches to fix this. The syz repro for this UAF adds ingress Qdiscs
under TC_H_ROOT, by the way.

2. After introducing RTNL-lockless RTM_{NEW,DEL,GET}TFILTER requests
[1], it is possible that, when replacing ingress (clsact) Qdiscs, the
old one can access *miniq_{in,e}gress concurrently with the new one. For
example, the syz repro does something like the following:

Thread 1 creates sch_ingress Qdisc A (containing mini Qdisc a1 and a2),
then adds a cls_flower filter X to Qdisc A.

Thread 2 creates sch_ingress Qdisc B (containing mini Qdisc b1 and b2)
to replace Qdisc A, then adds a cls_flower filter Y to Qdisc B.

Device has 8 TXQs.

Thread 1 A's refcnt Thread 2
RTM_NEWQDISC (A, locked)
qdisc_create(A) 1
qdisc_graft(A) 9

RTM_NEWTFILTER (X, lockless)
__tcf_qdisc_find(A) 10
tcf_chain0_head_change(A)
! mini_qdisc_pair_swap(A)
| RTM_NEWQDISC (B, locked)
| 2 qdisc_graft(B)
| 1 notify_and_destroy(A)
|
| RTM_NEWTFILTER (Y, lockless)
| tcf_chain0_head_change(B)
| ! mini_qdisc_pair_swap(B)
tcf_block_release(A) 0 |
qdisc_destroy(A) |
tcf_chain0_head_change_cb_del(A) |
! mini_qdisc_pair_swap(A) |
| |
... ...

As we can see there're interleaving mini_qdisc_pair_swap() calls between
Qdisc A and B, causing all kinds of troubles, including the UAF (thread
2 writing to mini Qdisc a1's rcu_state after Qdisc A has already been
freed) reported by syzbot.

To fix this, I'm cooking a patch that, when replacing ingress (clsact)
Qdiscs, in qdisc_graft():

I. We should make sure there's no on-the-fly lockless filter requests
for the old Qdisc, and return -EBUSY if there's any (or can/should
we wait in RTM_NEWQDISC handler?)

II. We should destory the old Qdisc before publishing the new one
(i.e. setting it to dev_ingress_queue(dev)->qdisc_sleeping, so
that subsequent filter requests can see it), because
{ingress,clsact}_destroy() also call mini_qdisc_pair_swap(), which
sets *miniq_{in,e}gress to NULL

Future Qdiscs that support RTNL-lockless cls_ops, if any, won't need
this fix, as long as their ->chain_head_change() don't access
out-of-Qdisc-scope data, like pointers in struct net_device.

Do you think this is the right way to go? Thanks!

[1] Thanks Hillf Danton for the hint:
https://syzkaller.appspot.com/text?tag=Patch&x=10d7cd5bc80000

Thanks,
Peilin Ye

2023-04-27 03:01:47

by Pedro Tammela

[permalink] [raw]
Subject: Re: [syzbot] [net?] KASAN: slab-use-after-free Write in mini_qdisc_pair_swap

On 26/04/2023 20:42, Peilin Ye wrote:
> +Cc: Vlad Buslov, Hillf Danton
>
> Hi all,
>
> On Mon, Apr 17, 2023 at 04:00:11PM -0700, Peilin Ye wrote:
>> I also reproduced this UAF using the syzkaller reproducer in the report
>> (the C reproducer did not work for me for unknown reasons). I will look
>> into this.
>
> Currently, multiple ingress (clsact) Qdiscs can access the per-netdev
> *miniq_ingress (*miniq_egress) pointer concurrently. This is
> unfortunately true in two senses:
>
> 1. We allow adding ingress (clsact) Qdiscs under parents other than
> TC_H_INGRESS (TC_H_CLSACT):
>
> $ ip link add ifb0 numtxqueues 8 type ifb
> $ echo clsact > /proc/sys/net/core/default_qdisc
> $ tc qdisc add dev ifb0 handle 1: root mq
> $ tc qdisc show dev ifb0
> qdisc mq 1: root
> qdisc clsact 0: parent 1:8
> qdisc clsact 0: parent 1:7
> qdisc clsact 0: parent 1:6
> qdisc clsact 0: parent 1:5
> qdisc clsact 0: parent 1:4
> qdisc clsact 0: parent 1:3
> qdisc clsact 0: parent 1:2
> qdisc clsact 0: parent 1:1
>
> This is obviously racy and should be prohibited. I've started working
> on patches to fix this. The syz repro for this UAF adds ingress Qdiscs
> under TC_H_ROOT, by the way.
>
> 2. After introducing RTNL-lockless RTM_{NEW,DEL,GET}TFILTER requests
> [1], it is possible that, when replacing ingress (clsact) Qdiscs, the
> old one can access *miniq_{in,e}gress concurrently with the new one. For
> example, the syz repro does something like the following:
>
> Thread 1 creates sch_ingress Qdisc A (containing mini Qdisc a1 and a2),
> then adds a cls_flower filter X to Qdisc A.
>
> Thread 2 creates sch_ingress Qdisc B (containing mini Qdisc b1 and b2)
> to replace Qdisc A, then adds a cls_flower filter Y to Qdisc B.
>
> Device has 8 TXQs.
>
> Thread 1 A's refcnt Thread 2
> RTM_NEWQDISC (A, locked)
> qdisc_create(A) 1
> qdisc_graft(A) 9
>
> RTM_NEWTFILTER (X, lockless)
> __tcf_qdisc_find(A) 10
> tcf_chain0_head_change(A)
> ! mini_qdisc_pair_swap(A)
> | RTM_NEWQDISC (B, locked)
> | 2 qdisc_graft(B)
> | 1 notify_and_destroy(A)
> |
> | RTM_NEWTFILTER (Y, lockless)
> | tcf_chain0_head_change(B)
> | ! mini_qdisc_pair_swap(B)
> tcf_block_release(A) 0 |
> qdisc_destroy(A) |
> tcf_chain0_head_change_cb_del(A) |
> ! mini_qdisc_pair_swap(A) |
> | |
> ... ...
>
> As we can see there're interleaving mini_qdisc_pair_swap() calls between
> Qdisc A and B, causing all kinds of troubles, including the UAF (thread
> 2 writing to mini Qdisc a1's rcu_state after Qdisc A has already been
> freed) reported by syzbot.

Thanks for the analysis. It makes total sense.
After going through the call chains, please correct me if my ELI5 is wrong:

'clsact_init()' is called for B when dev has miniq_ingress set to 'A',
therefore copying a pointer to the miniq_qdisc with lifetime bound to
'A' in a miniq_qdisc_pair with lifetime bound to 'B' therefore raising
an UAF after A is destroyed and B is manipulated.

>
> To fix this, I'm cooking a patch that, when replacing ingress (clsact)
> Qdiscs, in qdisc_graft():
>
> I. We should make sure there's no on-the-fly lockless filter requests
> for the old Qdisc, and return -EBUSY if there's any (or can/should
> we wait in RTM_NEWQDISC handler?
Makes sense.

>
> II. We should destory the old Qdisc before publishing the new one
> (i.e. setting it to dev_ingress_queue(dev)->qdisc_sleeping, so
> that subsequent filter requests can see it), because
> {ingress,clsact}_destroy() also call mini_qdisc_pair_swap(), which
> sets *miniq_{in,e}gress to NULL >
> Future Qdiscs that support RTNL-lockless cls_ops, if any, won't need
> this fix, as long as their ->chain_head_change() don't access
> out-of-Qdisc-scope data, like pointers in struct net_device.

Probably worth a comment somewhere in the code

>
> Do you think this is the right way to go? Thanks!
>
> [1] Thanks Hillf Danton for the hint:
> https://syzkaller.appspot.com/text?tag=Patch&x=10d7cd5bc80000
>
> Thanks,
> Peilin Ye
>

2023-04-27 12:44:24

by Vlad Buslov

[permalink] [raw]
Subject: Re: [syzbot] [net?] KASAN: slab-use-after-free Write in mini_qdisc_pair_swap

Hi Peilin,

On Wed 26 Apr 2023 at 16:42, Peilin Ye <[email protected]> wrote:
> +Cc: Vlad Buslov, Hillf Danton
>
> Hi all,
>
> On Mon, Apr 17, 2023 at 04:00:11PM -0700, Peilin Ye wrote:
>> I also reproduced this UAF using the syzkaller reproducer in the report
>> (the C reproducer did not work for me for unknown reasons). I will look
>> into this.
>
> Currently, multiple ingress (clsact) Qdiscs can access the per-netdev
> *miniq_ingress (*miniq_egress) pointer concurrently. This is
> unfortunately true in two senses:
>
> 1. We allow adding ingress (clsact) Qdiscs under parents other than
> TC_H_INGRESS (TC_H_CLSACT):
>
> $ ip link add ifb0 numtxqueues 8 type ifb
> $ echo clsact > /proc/sys/net/core/default_qdisc
> $ tc qdisc add dev ifb0 handle 1: root mq
> $ tc qdisc show dev ifb0
> qdisc mq 1: root
> qdisc clsact 0: parent 1:8
> qdisc clsact 0: parent 1:7
> qdisc clsact 0: parent 1:6
> qdisc clsact 0: parent 1:5
> qdisc clsact 0: parent 1:4
> qdisc clsact 0: parent 1:3
> qdisc clsact 0: parent 1:2
> qdisc clsact 0: parent 1:1
>
> This is obviously racy and should be prohibited. I've started working
> on patches to fix this. The syz repro for this UAF adds ingress Qdiscs
> under TC_H_ROOT, by the way.


Hmm, didn't realize it was the case.

>
> 2. After introducing RTNL-lockless RTM_{NEW,DEL,GET}TFILTER requests
> [1], it is possible that, when replacing ingress (clsact) Qdiscs, the
> old one can access *miniq_{in,e}gress concurrently with the new one. For
> example, the syz repro does something like the following:
>
> Thread 1 creates sch_ingress Qdisc A (containing mini Qdisc a1 and a2),
> then adds a cls_flower filter X to Qdisc A.
>
> Thread 2 creates sch_ingress Qdisc B (containing mini Qdisc b1 and b2)
> to replace Qdisc A, then adds a cls_flower filter Y to Qdisc B.
>
> Device has 8 TXQs.
>
> Thread 1 A's refcnt Thread 2
> RTM_NEWQDISC (A, locked)
> qdisc_create(A) 1
> qdisc_graft(A) 9
>
> RTM_NEWTFILTER (X, lockless)
> __tcf_qdisc_find(A) 10
> tcf_chain0_head_change(A)
> ! mini_qdisc_pair_swap(A)
> | RTM_NEWQDISC (B, locked)
> | 2 qdisc_graft(B)
> | 1 notify_and_destroy(A)
> |
> | RTM_NEWTFILTER (Y, lockless)
> | tcf_chain0_head_change(B)
> | ! mini_qdisc_pair_swap(B)
> tcf_block_release(A) 0 |
> qdisc_destroy(A) |
> tcf_chain0_head_change_cb_del(A) |
> ! mini_qdisc_pair_swap(A) |
> | |
> ... ...
>
> As we can see there're interleaving mini_qdisc_pair_swap() calls between
> Qdisc A and B, causing all kinds of troubles, including the UAF (thread
> 2 writing to mini Qdisc a1's rcu_state after Qdisc A has already been
> freed) reported by syzbot.

Great analysis! However, it is still not quite clear to me how threads 1
and 2 access each other RCU state when q->miniqp is a private memory of
the Qdisc, so 1 should only see A->miniqp and 2 only B->miniqp. And both
miniqps should be protected from deallocation by reference that lockless
RTM_NEWTFILTER obtains.

>
> To fix this, I'm cooking a patch that, when replacing ingress (clsact)
> Qdiscs, in qdisc_graft():
>
> I. We should make sure there's no on-the-fly lockless filter requests
> for the old Qdisc, and return -EBUSY if there's any (or can/should
> we wait in RTM_NEWQDISC handler?)
>
> II. We should destory the old Qdisc before publishing the new one
> (i.e. setting it to dev_ingress_queue(dev)->qdisc_sleeping, so
> that subsequent filter requests can see it), because
> {ingress,clsact}_destroy() also call mini_qdisc_pair_swap(), which
> sets *miniq_{in,e}gress to NULL

Another approach would be to somehow detect concurrent Qdisc replace and
return -EAGAIN from tcf_chain_tp_insert() before calling
tcf_chain0_head_change(). This would leverage existing cls_api
functionality that automatically retries after releasing all references
to chain/tp and obtaining them again instead of messing with qdisc api.
However, since I still didn't fully grasp the issue it is hard for me to
reason whether such approach would be possible to implement in this
case.

>
> Future Qdiscs that support RTNL-lockless cls_ops, if any, won't need
> this fix, as long as their ->chain_head_change() don't access
> out-of-Qdisc-scope data, like pointers in struct net_device.
>
> Do you think this is the right way to go? Thanks!
>
> [1] Thanks Hillf Danton for the hint:
> https://syzkaller.appspot.com/text?tag=Patch&x=10d7cd5bc80000
>
> Thanks,
> Peilin Ye

2023-04-27 17:38:16

by Peilin Ye

[permalink] [raw]
Subject: Re: [syzbot] [net?] KASAN: slab-use-after-free Write in mini_qdisc_pair_swap

Hi Pedro, Vlad,

On Thu, Apr 27, 2023 at 03:26:03PM +0300, Vlad Buslov wrote:
> On Wed 26 Apr 2023 at 16:42, Peilin Ye <[email protected]> wrote:
> > As we can see there're interleaving mini_qdisc_pair_swap() calls between
> > Qdisc A and B, causing all kinds of troubles, including the UAF (thread
> > 2 writing to mini Qdisc a1's rcu_state after Qdisc A has already been
> > freed) reported by syzbot.
>
> Great analysis! However, it is still not quite clear to me how threads 1
> and 2 access each other RCU state when q->miniqp is a private memory of
> the Qdisc, so 1 should only see A->miniqp and 2 only B->miniqp. And both
> miniqps should be protected from deallocation by reference that lockless
> RTM_NEWTFILTER obtains.

Thanks for taking a look!

To elaborate, p_miniq is a pointer of pointer of struct mini_Qdisc,
initialized in ingress_init() to point to eth0->miniq_ingress, which
isn't private to A or B.

In other words, both A->miniqp->p_miniq and B->miniqp->p_miniq point to
eth0->miniq_ingress.

For your reference, roughly speaking, mini_qdisc_pair_swap() does this:

miniq_old = dev->miniq_ingress;

if (destroying) {
dev->miniq_ingress = NULL;
} else {
rcu_wait();
dev->miniq_ingress = miniq_new;
}

if (miniq_old)
miniq_old->rcu_state = ...

On Wed 26 Apr 2023 at 16:42, Peilin Ye <[email protected]> wrote:
> Thread 1 A's refcnt Thread 2
> RTM_NEWQDISC (A, locked)
> qdisc_create(A) 1
> qdisc_graft(A) 9
>
> RTM_NEWTFILTER (X, lockless)
> __tcf_qdisc_find(A) 10
> tcf_chain0_head_change(A)
> ! mini_qdisc_pair_swap(A)

1. A adds its first filter,
miniq_old (eth0->miniq_ingress) is NULL,
RCU wait starts,
RCU wait ends,
change eth0->miniq_ingress to A's mini Qdisc.

> | RTM_NEWQDISC (B, locked)
> | 2 qdisc_graft(B)
> | 1 notify_and_destroy(A)
> |
> | RTM_NEWTFILTER (Y, lockless)
> | tcf_chain0_head_change(B)
> | ! mini_qdisc_pair_swap(B)

2. B adds its first filter,
miniq_old (eth0->miniq_ingress) is A's mini Qdisc,
RCU wait starts,

> tcf_block_release(A) 0 |
> qdisc_destroy(A) |
> tcf_chain0_head_change_cb_del(A) |
> ! mini_qdisc_pair_swap(A) |

3. A destroys itself,
miniq_old (eth0->miniq_ingress) is A's mini Qdisc,
(destroying, so no RCU wait)
change eth0->miniq_ingress to NULL,
update miniq_old, or A's mini Qdisc's RCU state,
A is freed.

2. RCU wait ends,
change eth0->miniq_ingress to B's mini Qdisc,
use-after-free: update miniq_old, or A's mini Qdisc's RCU state.

I hope this helps. Sorry I didn't go into details; this UAF isn't the
only thing that is unacceptable here:

Consider B. We add a filter Y to B, expecting ingress packets on eth0
to go through Y. Then all of a sudden, A sets eth0->miniq_ingress to
NULL during its destruction, so packets will not find Y at all on
datapath (sch_handle_ingress()). New filter becomes invisible - this is
already buggy enough :-/

So I think B's first call to mini_qdisc_pair_swap() should happen after
A's last call (in ingress_destroy()), which is what I am trying to
achieve here.

Thanks,
Peilin Ye

2023-04-28 12:45:44

by Vlad Buslov

[permalink] [raw]
Subject: Re: [syzbot] [net?] KASAN: slab-use-after-free Write in mini_qdisc_pair_swap

On Thu 27 Apr 2023 at 10:35, Peilin Ye <[email protected]> wrote:
> Hi Pedro, Vlad,
>
> On Thu, Apr 27, 2023 at 03:26:03PM +0300, Vlad Buslov wrote:
>> On Wed 26 Apr 2023 at 16:42, Peilin Ye <[email protected]> wrote:
>> > As we can see there're interleaving mini_qdisc_pair_swap() calls between
>> > Qdisc A and B, causing all kinds of troubles, including the UAF (thread
>> > 2 writing to mini Qdisc a1's rcu_state after Qdisc A has already been
>> > freed) reported by syzbot.
>>
>> Great analysis! However, it is still not quite clear to me how threads 1
>> and 2 access each other RCU state when q->miniqp is a private memory of
>> the Qdisc, so 1 should only see A->miniqp and 2 only B->miniqp. And both
>> miniqps should be protected from deallocation by reference that lockless
>> RTM_NEWTFILTER obtains.
>
> Thanks for taking a look!
>
> To elaborate, p_miniq is a pointer of pointer of struct mini_Qdisc,
> initialized in ingress_init() to point to eth0->miniq_ingress, which
> isn't private to A or B.
>
> In other words, both A->miniqp->p_miniq and B->miniqp->p_miniq point to
> eth0->miniq_ingress.
>
> For your reference, roughly speaking, mini_qdisc_pair_swap() does this:
>
> miniq_old = dev->miniq_ingress;
>
> if (destroying) {
> dev->miniq_ingress = NULL;
> } else {
> rcu_wait();
> dev->miniq_ingress = miniq_new;
> }
>
> if (miniq_old)
> miniq_old->rcu_state = ...
>
> On Wed 26 Apr 2023 at 16:42, Peilin Ye <[email protected]> wrote:
>> Thread 1 A's refcnt Thread 2
>> RTM_NEWQDISC (A, locked)
>> qdisc_create(A) 1
>> qdisc_graft(A) 9
>>
>> RTM_NEWTFILTER (X, lockless)
>> __tcf_qdisc_find(A) 10
>> tcf_chain0_head_change(A)
>> ! mini_qdisc_pair_swap(A)
>
> 1. A adds its first filter,
> miniq_old (eth0->miniq_ingress) is NULL,
> RCU wait starts,
> RCU wait ends,
> change eth0->miniq_ingress to A's mini Qdisc.
>
>> | RTM_NEWQDISC (B, locked)
>> | 2 qdisc_graft(B)
>> | 1 notify_and_destroy(A)
>> |
>> | RTM_NEWTFILTER (Y, lockless)
>> | tcf_chain0_head_change(B)
>> | ! mini_qdisc_pair_swap(B)
>
> 2. B adds its first filter,
> miniq_old (eth0->miniq_ingress) is A's mini Qdisc,
> RCU wait starts,
>
>> tcf_block_release(A) 0 |
>> qdisc_destroy(A) |
>> tcf_chain0_head_change_cb_del(A) |
>> ! mini_qdisc_pair_swap(A) |
>
> 3. A destroys itself,
> miniq_old (eth0->miniq_ingress) is A's mini Qdisc,
> (destroying, so no RCU wait)
> change eth0->miniq_ingress to NULL,
> update miniq_old, or A's mini Qdisc's RCU state,
> A is freed.
>
> 2. RCU wait ends,
> change eth0->miniq_ingress to B's mini Qdisc,
> use-after-free: update miniq_old, or A's mini Qdisc's RCU state.

Thanks for the clarification.

>
> I hope this helps. Sorry I didn't go into details; this UAF isn't the
> only thing that is unacceptable here:
>
> Consider B. We add a filter Y to B, expecting ingress packets on eth0
> to go through Y. Then all of a sudden, A sets eth0->miniq_ingress to
> NULL during its destruction, so packets will not find Y at all on
> datapath (sch_handle_ingress()). New filter becomes invisible - this is
> already buggy enough :-/
>
> So I think B's first call to mini_qdisc_pair_swap() should happen after
> A's last call (in ingress_destroy()), which is what I am trying to
> achieve here.

Makes sense to me.

2023-05-06 00:18:44

by Peilin Ye

[permalink] [raw]
Subject: [PATCH net 0/6] net/sched: Fixes for sch_ingress and sch_clsact

Hi all,

These are fixes for ingress and clsact Qdiscs:

[1,2/6]: ingress and clsact Qdiscs should only be created under ffff:fff1
[3/6]: Under ffff:fff1, only create ingress and clsact Qdiscs (for now,
at least)
[4/6]: After creating ingress and clsact Qdiscs under ffff:fff1, do not
graft them again to anywhere else (e.g. as the inner Qdisc of a
TBF Qdisc)
[5/6]: Prepare for [6/6], do not reuse that for-loop in qdisc_graft()
for ingress and clsact Qdiscs
[6/6]: Fix use-after-free [a] in mini_qdisc_pair_swap()

Please review (especially [6/6]), thanks! Other tasks, including:

- prohibiting sch_ingress, sch_clsact etc. in default_qdisc
- more cleanups for qdisc_graft()

Will be done in separate patches.

[a] https://syzkaller.appspot.com/bug?extid=b53a9c0d1ea4ad62da8b

Thanks,
Peilin Ye (6):
net/sched: sch_ingress: Only create under TC_H_INGRESS
net/sched: sch_clsact: Only create under TC_H_CLSACT
net/sched: Reserve TC_H_INGRESS (TC_H_CLSACT) for ingress (clsact)
Qdiscs
net/sched: Prohibit regrafting ingress or clsact Qdiscs
net/sched: Refactor qdisc_graft() for ingress and clsact Qdiscs
net/sched: qdisc_destroy() old ingress and clsact Qdiscs before
grafting

include/net/sch_generic.h | 8 ++++++
net/sched/sch_api.c | 54 +++++++++++++++++++++++++++++----------
net/sched/sch_generic.c | 14 +++++++---
net/sched/sch_ingress.c | 10 ++++++--
4 files changed, 67 insertions(+), 19 deletions(-)

--
2.20.1

2023-05-06 00:32:42

by Peilin Ye

[permalink] [raw]
Subject: [PATCH net 5/6] net/sched: Refactor qdisc_graft() for ingress and clsact Qdiscs

Grafting ingress and clsact Qdiscs does not need a for-loop in
qdisc_graft(). Refactor it. No functional changes intended.

Signed-off-by: Peilin Ye <[email protected]>
---
net/sched/sch_api.c | 20 ++++++++++----------
1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index 49b9c1bbfdd9..f72a581666a2 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -1073,12 +1073,12 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,

if (parent == NULL) {
unsigned int i, num_q, ingress;
+ struct netdev_queue *dev_queue;

ingress = 0;
num_q = dev->num_tx_queues;
if ((q && q->flags & TCQ_F_INGRESS) ||
(new && new->flags & TCQ_F_INGRESS)) {
- num_q = 1;
ingress = 1;
if (!dev_ingress_queue(dev)) {
NL_SET_ERR_MSG(extack, "Device does not have an ingress queue");
@@ -1094,18 +1094,18 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
if (new && new->ops->attach && !ingress)
goto skip;

- for (i = 0; i < num_q; i++) {
- struct netdev_queue *dev_queue = dev_ingress_queue(dev);
-
- if (!ingress)
+ if (!ingress) {
+ for (i = 0; i < num_q; i++) {
dev_queue = netdev_get_tx_queue(dev, i);
+ old = dev_graft_qdisc(dev_queue, new);

- old = dev_graft_qdisc(dev_queue, new);
- if (new && i > 0)
- qdisc_refcount_inc(new);
-
- if (!ingress)
+ if (new && i > 0)
+ qdisc_refcount_inc(new);
qdisc_put(old);
+ }
+ } else {
+ dev_queue = dev_ingress_queue(dev);
+ old = dev_graft_qdisc(dev_queue, new);
}

skip:
--
2.20.1

2023-05-06 00:34:39

by Peilin Ye

[permalink] [raw]
Subject: [PATCH net 2/6] net/sched: sch_clsact: Only create under TC_H_CLSACT

clsact Qdiscs are only supposed to be created under TC_H_CLSACT (which
equals TC_H_INGRESS). Return -EOPNOTSUPP if 'parent' is not
TC_H_CLSACT.

Fixes: 1f211a1b929c ("net, sched: add clsact qdisc")
Signed-off-by: Peilin Ye <[email protected]>
---
net/sched/sch_ingress.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/net/sched/sch_ingress.c b/net/sched/sch_ingress.c
index 3d71f7a3b4ad..13218a1fe4a5 100644
--- a/net/sched/sch_ingress.c
+++ b/net/sched/sch_ingress.c
@@ -222,6 +222,9 @@ static int clsact_init(struct Qdisc *sch, struct nlattr *opt,
struct net_device *dev = qdisc_dev(sch);
int err;

+ if (sch->parent != TC_H_CLSACT)
+ return -EOPNOTSUPP;
+
net_inc_ingress_queue();
net_inc_egress_queue();

--
2.20.1

2023-05-06 00:39:01

by Peilin Ye

[permalink] [raw]
Subject: [PATCH net 3/6] net/sched: Reserve TC_H_INGRESS (TC_H_CLSACT) for ingress (clsact) Qdiscs

Currently it is possible to add e.g. an HTB Qdisc under ffff:fff1
(TC_H_INGRESS, TC_H_CLSACT):

$ ip link add name ifb0 type ifb
$ tc qdisc add dev ifb0 parent ffff:fff1 htb
$ tc qdisc add dev ifb0 clsact
Error: Exclusivity flag on, cannot modify.
$ drgn
...
>>> ifb0 = netdev_get_by_name(prog, "ifb0")
>>> qdisc = ifb0.ingress_queue.qdisc_sleeping
>>> print(qdisc.ops.id.string_().decode())
htb
>>> qdisc.flags.value_() # TCQ_F_INGRESS
2

Only allow ingress and clsact Qdiscs under ffff:fff1. Return -EINVAL
for everything else. Make TCQ_F_INGRESS a static flag of ingress and
clsact Qdiscs.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Fixes: 1f211a1b929c ("net, sched: add clsact qdisc")
Signed-off-by: Peilin Ye <[email protected]>
---
net/sched/sch_api.c | 7 ++++++-
net/sched/sch_ingress.c | 4 ++--
2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index fdb8f429333d..383195955b7d 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -1252,7 +1252,12 @@ static struct Qdisc *qdisc_create(struct net_device *dev,
sch->parent = parent;

if (handle == TC_H_INGRESS) {
- sch->flags |= TCQ_F_INGRESS;
+ if (!(sch->flags & TCQ_F_INGRESS)) {
+ NL_SET_ERR_MSG(extack,
+ "Specified parent ID is reserved for ingress and clsact Qdiscs");
+ err = -EINVAL;
+ goto err_out3;
+ }
handle = TC_H_MAKE(TC_H_INGRESS, 0);
} else {
if (handle == 0) {
diff --git a/net/sched/sch_ingress.c b/net/sched/sch_ingress.c
index 13218a1fe4a5..caea51e0d4e9 100644
--- a/net/sched/sch_ingress.c
+++ b/net/sched/sch_ingress.c
@@ -137,7 +137,7 @@ static struct Qdisc_ops ingress_qdisc_ops __read_mostly = {
.cl_ops = &ingress_class_ops,
.id = "ingress",
.priv_size = sizeof(struct ingress_sched_data),
- .static_flags = TCQ_F_CPUSTATS,
+ .static_flags = TCQ_F_INGRESS | TCQ_F_CPUSTATS,
.init = ingress_init,
.destroy = ingress_destroy,
.dump = ingress_dump,
@@ -275,7 +275,7 @@ static struct Qdisc_ops clsact_qdisc_ops __read_mostly = {
.cl_ops = &clsact_class_ops,
.id = "clsact",
.priv_size = sizeof(struct clsact_sched_data),
- .static_flags = TCQ_F_CPUSTATS,
+ .static_flags = TCQ_F_INGRESS | TCQ_F_CPUSTATS,
.init = clsact_init,
.destroy = clsact_destroy,
.dump = ingress_dump,
--
2.20.1

2023-05-06 00:57:14

by Peilin Ye

[permalink] [raw]
Subject: [PATCH net 1/6] net/sched: sch_ingress: Only create under TC_H_INGRESS

ingress Qdiscs are only supposed to be created under TC_H_INGRESS.
Similar to mq_init(), return -EOPNOTSUPP if 'parent' is not
TC_H_INGRESS.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: [email protected]
Link: https://lore.kernel.org/netdev/[email protected]
Signed-off-by: Peilin Ye <[email protected]>
---
net/sched/sch_ingress.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/net/sched/sch_ingress.c b/net/sched/sch_ingress.c
index 84838128b9c5..3d71f7a3b4ad 100644
--- a/net/sched/sch_ingress.c
+++ b/net/sched/sch_ingress.c
@@ -80,6 +80,9 @@ static int ingress_init(struct Qdisc *sch, struct nlattr *opt,
struct net_device *dev = qdisc_dev(sch);
int err;

+ if (sch->parent != TC_H_INGRESS)
+ return -EOPNOTSUPP;
+
net_inc_ingress_queue();

mini_qdisc_pair_init(&q->miniqp, sch, &dev->miniq_ingress);
--
2.20.1

2023-05-06 00:57:15

by Peilin Ye

[permalink] [raw]
Subject: [PATCH net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting

mini_Qdisc_pair::p_miniq is a double pointer to mini_Qdisc, initialized in
ingress_init() to point to net_device::miniq_ingress. ingress Qdiscs
access this per-net_device pointer in mini_qdisc_pair_swap(). Similar for
clsact Qdiscs and miniq_egress.

Unfortunately, after introducing RTNL-lockless RTM_{NEW,DEL,GET}TFILTER
requests, when e.g. replacing ingress (clsact) Qdiscs, the old Qdisc could
access the same miniq_{in,e}gress pointer(s) concurrently with the new
Qdisc, causing race conditions [1] including a use-after-free in
mini_qdisc_pair_swap() reported by syzbot:

BUG: KASAN: slab-use-after-free in mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
Write of size 8 at addr ffff888045b31308 by task syz-executor690/14901
...
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106
print_address_description.constprop.0+0x2c/0x3c0 mm/kasan/report.c:319
print_report mm/kasan/report.c:430 [inline]
kasan_report+0x11c/0x130 mm/kasan/report.c:536
mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
tcf_chain_head_change_item net/sched/cls_api.c:495 [inline]
tcf_chain0_head_change.isra.0+0xb9/0x120 net/sched/cls_api.c:509
tcf_chain_tp_insert net/sched/cls_api.c:1826 [inline]
tcf_chain_tp_insert_unique net/sched/cls_api.c:1875 [inline]
tc_new_tfilter+0x1de6/0x2290 net/sched/cls_api.c:2266
...

The new (ingress or clsact) Qdisc should only call mini_qdisc_pair_swap()
after the old Qdisc's last call (in {ingress,clsact}_destroy()) has
finished.

To achieve this, in qdisc_graft(), return -EBUSY if the old (ingress or
clsact) Qdisc has ongoing RTNL-lockless filter requests, and call
qdisc_destroy() for "old" before grafting "new".

Introduce qdisc_refcount_dec_if_one() as the counterpart of
qdisc_refcount_inc_nz() used for RTNL-lockless filter requests. Introduce
a non-static version of qdisc_destroy() that does a TCQ_F_BUILTIN check,
just like qdisc_put() etc.

[1] To illustrate, the syzkaller reproducer adds ingress Qdiscs under
TC_H_ROOT (no longer possible after patch "net/sched: sch_ingress: Only
create under TC_H_INGRESS") on eth0 that has 8 transmission queues:

Thread 1 creates ingress Qdisc A (containing mini Qdisc a1 and a2), then
adds a flower filter X to A.

Thread 2 creates another ingress Qdisc B (containing mini Qdisc b1 and
b2) to replace A, then adds a flower filter Y to B.

Thread 1 A's refcnt Thread 2
RTM_NEWQDISC (A, RTNL-locked)
qdisc_create(A) 1
qdisc_graft(A) 9

RTM_NEWTFILTER (X, RTNL-lockless)
__tcf_qdisc_find(A) 10
tcf_chain0_head_change(A)
mini_qdisc_pair_swap(A) (1st)
|
| RTM_NEWQDISC (B, RTNL-locked)
RCU 2 qdisc_graft(B)
| 1 notify_and_destroy(A)
|
tcf_block_release(A) 0 RTM_NEWTFILTER (Y, RTNL-lockless)
qdisc_destroy(A) tcf_chain0_head_change(B)
tcf_chain0_head_change_cb_del(A) mini_qdisc_pair_swap(B) (2nd)
mini_qdisc_pair_swap(A) (3rd) |
... ...

Here, B calls mini_qdisc_pair_swap(), pointing eth0->miniq_ingress to its
mini Qdisc, b1. Then, A calls mini_qdisc_pair_swap() again during
ingress_destroy(), setting eth0->miniq_ingress to NULL, so ingress packets
on eth0 will not find filter Y in sch_handle_ingress().

This is just one of the possible consequences of concurrently accessing
net_device::miniq_{in,e}gress pointers. The point is clear, however:
B's first call to mini_qdisc_pair_swap() should take place after A's
last call, in qdisc_destroy().

Fixes: 7a096d579e8e ("net: sched: ingress: set 'unlocked' flag for Qdisc ops")
Fixes: 87f373921c4e ("net: sched: ingress: set 'unlocked' flag for clsact Qdisc ops")
Reported-by: [email protected]
Link: https://lore.kernel.org/netdev/[email protected]
Cc: Hillf Danton <[email protected]>
Signed-off-by: Peilin Ye <[email protected]>
---
include/net/sch_generic.h | 8 ++++++++
net/sched/sch_api.c | 26 +++++++++++++++++++++-----
net/sched/sch_generic.c | 14 +++++++++++---
3 files changed, 40 insertions(+), 8 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index fab5ba3e61b7..3e9cc43cbc90 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -137,6 +137,13 @@ static inline void qdisc_refcount_inc(struct Qdisc *qdisc)
refcount_inc(&qdisc->refcnt);
}

+static inline bool qdisc_refcount_dec_if_one(struct Qdisc *qdisc)
+{
+ if (qdisc->flags & TCQ_F_BUILTIN)
+ return true;
+ return refcount_dec_if_one(&qdisc->refcnt);
+}
+
/* Intended to be used by unlocked users, when concurrent qdisc release is
* possible.
*/
@@ -652,6 +659,7 @@ void dev_deactivate_many(struct list_head *head);
struct Qdisc *dev_graft_qdisc(struct netdev_queue *dev_queue,
struct Qdisc *qdisc);
void qdisc_reset(struct Qdisc *qdisc);
+void qdisc_destroy(struct Qdisc *qdisc);
void qdisc_put(struct Qdisc *qdisc);
void qdisc_put_unlocked(struct Qdisc *qdisc);
void qdisc_tree_reduce_backlog(struct Qdisc *qdisc, int n, int len);
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index f72a581666a2..a2d07bc8ded6 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -1080,10 +1080,20 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
if ((q && q->flags & TCQ_F_INGRESS) ||
(new && new->flags & TCQ_F_INGRESS)) {
ingress = 1;
- if (!dev_ingress_queue(dev)) {
+ dev_queue = dev_ingress_queue(dev);
+ if (!dev_queue) {
NL_SET_ERR_MSG(extack, "Device does not have an ingress queue");
return -ENOENT;
}
+
+ /* This is the counterpart of that qdisc_refcount_inc_nz() call in
+ * __tcf_qdisc_find() for RTNL-lockless filter requests.
+ */
+ if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping)) {
+ NL_SET_ERR_MSG(extack,
+ "Current ingress or clsact Qdisc has ongoing filter request(s)");
+ return -EBUSY;
+ }
}

if (dev->flags & IFF_UP)
@@ -1104,8 +1114,16 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
qdisc_put(old);
}
} else {
- dev_queue = dev_ingress_queue(dev);
- old = dev_graft_qdisc(dev_queue, new);
+ old = dev_graft_qdisc(dev_queue, NULL);
+
+ /* {ingress,clsact}_destroy() "old" before grafting "new" to avoid
+ * unprotected concurrent accesses to net_device::miniq_{in,e}gress
+ * pointer(s) in mini_qdisc_pair_swap().
+ */
+ qdisc_notify(net, skb, n, classid, old, new, extack);
+ qdisc_destroy(old);
+
+ dev_graft_qdisc(dev_queue, new);
}

skip:
@@ -1119,8 +1137,6 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,

if (new && new->ops->attach)
new->ops->attach(new);
- } else {
- notify_and_destroy(net, skb, n, classid, old, new, extack);
}

if (dev->flags & IFF_UP)
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 37e41f972f69..e14ed47f961c 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -1046,7 +1046,7 @@ static void qdisc_free_cb(struct rcu_head *head)
qdisc_free(q);
}

-static void qdisc_destroy(struct Qdisc *qdisc)
+static void __qdisc_destroy(struct Qdisc *qdisc)
{
const struct Qdisc_ops *ops = qdisc->ops;

@@ -1070,6 +1070,14 @@ static void qdisc_destroy(struct Qdisc *qdisc)
call_rcu(&qdisc->rcu, qdisc_free_cb);
}

+void qdisc_destroy(struct Qdisc *qdisc)
+{
+ if (qdisc->flags & TCQ_F_BUILTIN)
+ return;
+
+ __qdisc_destroy(qdisc);
+}
+
void qdisc_put(struct Qdisc *qdisc)
{
if (!qdisc)
@@ -1079,7 +1087,7 @@ void qdisc_put(struct Qdisc *qdisc)
!refcount_dec_and_test(&qdisc->refcnt))
return;

- qdisc_destroy(qdisc);
+ __qdisc_destroy(qdisc);
}
EXPORT_SYMBOL(qdisc_put);

@@ -1094,7 +1102,7 @@ void qdisc_put_unlocked(struct Qdisc *qdisc)
!refcount_dec_and_rtnl_lock(&qdisc->refcnt))
return;

- qdisc_destroy(qdisc);
+ __qdisc_destroy(qdisc);
rtnl_unlock();
}
EXPORT_SYMBOL(qdisc_put_unlocked);
--
2.20.1

2023-05-06 00:58:14

by Peilin Ye

[permalink] [raw]
Subject: [PATCH net 4/6] net/sched: Prohibit regrafting ingress or clsact Qdiscs

Currently, after creating an ingress (or clsact) Qdisc and grafting it
under TC_H_INGRESS (TC_H_CLSACT), it is possible to graft it again under
e.g. a TBF Qdisc:

$ ip link add ifb0 type ifb
$ tc qdisc add dev ifb0 handle 1: root tbf rate 20kbit buffer 1600 limit 3000
$ tc qdisc add dev ifb0 clsact
$ tc qdisc link dev ifb0 handle ffff: parent 1:1
$ tc qdisc show dev ifb0
qdisc tbf 1: root refcnt 2 rate 20Kbit burst 1600b lat 560.0ms
qdisc clsact ffff: parent ffff:fff1 refcnt 2
^^^^^^^^

clsact's refcount has increased: it is now grafted under both
TC_H_CLSACT and 1:1.

ingress and clsact Qdiscs should only be used under TC_H_INGRESS
(TC_H_CLSACT). Prohibit regrafting them.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Fixes: 1f211a1b929c ("net, sched: add clsact qdisc")
Signed-off-by: Peilin Ye <[email protected]>
---
net/sched/sch_api.c | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index 383195955b7d..49b9c1bbfdd9 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -1596,6 +1596,11 @@ static int tc_modify_qdisc(struct sk_buff *skb, struct nlmsghdr *n,
NL_SET_ERR_MSG(extack, "Invalid qdisc name");
return -EINVAL;
}
+ if (q->flags & TCQ_F_INGRESS) {
+ NL_SET_ERR_MSG(extack,
+ "Cannot regraft ingress or clsact Qdiscs");
+ return -EINVAL;
+ }
if (q == p ||
(p && check_loop(q, p, 0))) {
NL_SET_ERR_MSG(extack, "Qdisc parent/child loop detected");
--
2.20.1

2023-05-08 11:53:23

by Jamal Hadi Salim

[permalink] [raw]
Subject: Re: [PATCH net 4/6] net/sched: Prohibit regrafting ingress or clsact Qdiscs

On Fri, May 5, 2023 at 8:14 PM Peilin Ye <[email protected]> wrote:
>
> Currently, after creating an ingress (or clsact) Qdisc and grafting it
> under TC_H_INGRESS (TC_H_CLSACT), it is possible to graft it again under
> e.g. a TBF Qdisc:
>
> $ ip link add ifb0 type ifb
> $ tc qdisc add dev ifb0 handle 1: root tbf rate 20kbit buffer 1600 limit 3000
> $ tc qdisc add dev ifb0 clsact
> $ tc qdisc link dev ifb0 handle ffff: parent 1:1
> $ tc qdisc show dev ifb0
> qdisc tbf 1: root refcnt 2 rate 20Kbit burst 1600b lat 560.0ms
> qdisc clsact ffff: parent ffff:fff1 refcnt 2
> ^^^^^^^^
>
> clsact's refcount has increased: it is now grafted under both
> TC_H_CLSACT and 1:1.
>
> ingress and clsact Qdiscs should only be used under TC_H_INGRESS
> (TC_H_CLSACT). Prohibit regrafting them.
>
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> Fixes: 1f211a1b929c ("net, sched: add clsact qdisc")
> Signed-off-by: Peilin Ye <[email protected]>

Reviewed-by: Jamal Hadi Salim <[email protected]>
Acked-by: Jamal Hadi Salim <[email protected]>

cheers,
jamal

> ---
> net/sched/sch_api.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
> index 383195955b7d..49b9c1bbfdd9 100644
> --- a/net/sched/sch_api.c
> +++ b/net/sched/sch_api.c
> @@ -1596,6 +1596,11 @@ static int tc_modify_qdisc(struct sk_buff *skb, struct nlmsghdr *n,
> NL_SET_ERR_MSG(extack, "Invalid qdisc name");
> return -EINVAL;
> }
> + if (q->flags & TCQ_F_INGRESS) {
> + NL_SET_ERR_MSG(extack,
> + "Cannot regraft ingress or clsact Qdiscs");
> + return -EINVAL;
> + }
> if (q == p ||
> (p && check_loop(q, p, 0))) {
> NL_SET_ERR_MSG(extack, "Qdisc parent/child loop detected");
> --
> 2.20.1
>

2023-05-08 11:53:23

by Jamal Hadi Salim

[permalink] [raw]
Subject: Re: [PATCH net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting

On Fri, May 5, 2023 at 8:16 PM Peilin Ye <[email protected]> wrote:
>
> mini_Qdisc_pair::p_miniq is a double pointer to mini_Qdisc, initialized in
> ingress_init() to point to net_device::miniq_ingress. ingress Qdiscs
> access this per-net_device pointer in mini_qdisc_pair_swap(). Similar for
> clsact Qdiscs and miniq_egress.
>
> Unfortunately, after introducing RTNL-lockless RTM_{NEW,DEL,GET}TFILTER
> requests, when e.g. replacing ingress (clsact) Qdiscs, the old Qdisc could
> access the same miniq_{in,e}gress pointer(s) concurrently with the new
> Qdisc, causing race conditions [1] including a use-after-free in
> mini_qdisc_pair_swap() reported by syzbot:
>
> BUG: KASAN: slab-use-after-free in mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
> Write of size 8 at addr ffff888045b31308 by task syz-executor690/14901
> ...
> Call Trace:
> <TASK>
> __dump_stack lib/dump_stack.c:88 [inline]
> dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106
> print_address_description.constprop.0+0x2c/0x3c0 mm/kasan/report.c:319
> print_report mm/kasan/report.c:430 [inline]
> kasan_report+0x11c/0x130 mm/kasan/report.c:536
> mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
> tcf_chain_head_change_item net/sched/cls_api.c:495 [inline]
> tcf_chain0_head_change.isra.0+0xb9/0x120 net/sched/cls_api.c:509
> tcf_chain_tp_insert net/sched/cls_api.c:1826 [inline]
> tcf_chain_tp_insert_unique net/sched/cls_api.c:1875 [inline]
> tc_new_tfilter+0x1de6/0x2290 net/sched/cls_api.c:2266
> ...
>
> The new (ingress or clsact) Qdisc should only call mini_qdisc_pair_swap()
> after the old Qdisc's last call (in {ingress,clsact}_destroy()) has
> finished.
>
> To achieve this, in qdisc_graft(), return -EBUSY if the old (ingress or
> clsact) Qdisc has ongoing RTNL-lockless filter requests, and call
> qdisc_destroy() for "old" before grafting "new".
>
> Introduce qdisc_refcount_dec_if_one() as the counterpart of
> qdisc_refcount_inc_nz() used for RTNL-lockless filter requests. Introduce
> a non-static version of qdisc_destroy() that does a TCQ_F_BUILTIN check,
> just like qdisc_put() etc.
>
> [1] To illustrate, the syzkaller reproducer adds ingress Qdiscs under
> TC_H_ROOT (no longer possible after patch "net/sched: sch_ingress: Only
> create under TC_H_INGRESS") on eth0 that has 8 transmission queues:
>
> Thread 1 creates ingress Qdisc A (containing mini Qdisc a1 and a2), then
> adds a flower filter X to A.
>
> Thread 2 creates another ingress Qdisc B (containing mini Qdisc b1 and
> b2) to replace A, then adds a flower filter Y to B.
>
> Thread 1 A's refcnt Thread 2
> RTM_NEWQDISC (A, RTNL-locked)
> qdisc_create(A) 1
> qdisc_graft(A) 9
>
> RTM_NEWTFILTER (X, RTNL-lockless)
> __tcf_qdisc_find(A) 10
> tcf_chain0_head_change(A)
> mini_qdisc_pair_swap(A) (1st)
> |
> | RTM_NEWQDISC (B, RTNL-locked)
> RCU 2 qdisc_graft(B)
> | 1 notify_and_destroy(A)
> |
> tcf_block_release(A) 0 RTM_NEWTFILTER (Y, RTNL-lockless)
> qdisc_destroy(A) tcf_chain0_head_change(B)
> tcf_chain0_head_change_cb_del(A) mini_qdisc_pair_swap(B) (2nd)
> mini_qdisc_pair_swap(A) (3rd) |
> ... ...
>
> Here, B calls mini_qdisc_pair_swap(), pointing eth0->miniq_ingress to its
> mini Qdisc, b1. Then, A calls mini_qdisc_pair_swap() again during
> ingress_destroy(), setting eth0->miniq_ingress to NULL, so ingress packets
> on eth0 will not find filter Y in sch_handle_ingress().
>
> This is just one of the possible consequences of concurrently accessing
> net_device::miniq_{in,e}gress pointers. The point is clear, however:
> B's first call to mini_qdisc_pair_swap() should take place after A's
> last call, in qdisc_destroy().
>
> Fixes: 7a096d579e8e ("net: sched: ingress: set 'unlocked' flag for Qdisc ops")
> Fixes: 87f373921c4e ("net: sched: ingress: set 'unlocked' flag for clsact Qdisc ops")
> Reported-by: [email protected]
> Link: https://lore.kernel.org/netdev/[email protected]
> Cc: Hillf Danton <[email protected]>
> Signed-off-by: Peilin Ye <[email protected]>



Thanks for the excellent analysis Peilin and for chasing this to the
end. I have no doubt it was a lot of fun! ;->

Reviewed-by: Jamal Hadi Salim <[email protected]>
Acked-by: Jamal Hadi Salim <[email protected]>

cheers,
jamal

> ---
> include/net/sch_generic.h | 8 ++++++++
> net/sched/sch_api.c | 26 +++++++++++++++++++++-----
> net/sched/sch_generic.c | 14 +++++++++++---
> 3 files changed, 40 insertions(+), 8 deletions(-)
>
> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> index fab5ba3e61b7..3e9cc43cbc90 100644
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -137,6 +137,13 @@ static inline void qdisc_refcount_inc(struct Qdisc *qdisc)
> refcount_inc(&qdisc->refcnt);
> }
>
> +static inline bool qdisc_refcount_dec_if_one(struct Qdisc *qdisc)
> +{
> + if (qdisc->flags & TCQ_F_BUILTIN)
> + return true;
> + return refcount_dec_if_one(&qdisc->refcnt);
> +}
> +
> /* Intended to be used by unlocked users, when concurrent qdisc release is
> * possible.
> */
> @@ -652,6 +659,7 @@ void dev_deactivate_many(struct list_head *head);
> struct Qdisc *dev_graft_qdisc(struct netdev_queue *dev_queue,
> struct Qdisc *qdisc);
> void qdisc_reset(struct Qdisc *qdisc);
> +void qdisc_destroy(struct Qdisc *qdisc);
> void qdisc_put(struct Qdisc *qdisc);
> void qdisc_put_unlocked(struct Qdisc *qdisc);
> void qdisc_tree_reduce_backlog(struct Qdisc *qdisc, int n, int len);
> diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
> index f72a581666a2..a2d07bc8ded6 100644
> --- a/net/sched/sch_api.c
> +++ b/net/sched/sch_api.c
> @@ -1080,10 +1080,20 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
> if ((q && q->flags & TCQ_F_INGRESS) ||
> (new && new->flags & TCQ_F_INGRESS)) {
> ingress = 1;
> - if (!dev_ingress_queue(dev)) {
> + dev_queue = dev_ingress_queue(dev);
> + if (!dev_queue) {
> NL_SET_ERR_MSG(extack, "Device does not have an ingress queue");
> return -ENOENT;
> }
> +
> + /* This is the counterpart of that qdisc_refcount_inc_nz() call in
> + * __tcf_qdisc_find() for RTNL-lockless filter requests.
> + */
> + if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping)) {
> + NL_SET_ERR_MSG(extack,
> + "Current ingress or clsact Qdisc has ongoing filter request(s)");
> + return -EBUSY;
> + }
> }
>
> if (dev->flags & IFF_UP)
> @@ -1104,8 +1114,16 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
> qdisc_put(old);
> }
> } else {
> - dev_queue = dev_ingress_queue(dev);
> - old = dev_graft_qdisc(dev_queue, new);
> + old = dev_graft_qdisc(dev_queue, NULL);
> +
> + /* {ingress,clsact}_destroy() "old" before grafting "new" to avoid
> + * unprotected concurrent accesses to net_device::miniq_{in,e}gress
> + * pointer(s) in mini_qdisc_pair_swap().
> + */
> + qdisc_notify(net, skb, n, classid, old, new, extack);
> + qdisc_destroy(old);
> +
> + dev_graft_qdisc(dev_queue, new);
> }
>
> skip:
> @@ -1119,8 +1137,6 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
>
> if (new && new->ops->attach)
> new->ops->attach(new);
> - } else {
> - notify_and_destroy(net, skb, n, classid, old, new, extack);
> }
>
> if (dev->flags & IFF_UP)
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 37e41f972f69..e14ed47f961c 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -1046,7 +1046,7 @@ static void qdisc_free_cb(struct rcu_head *head)
> qdisc_free(q);
> }
>
> -static void qdisc_destroy(struct Qdisc *qdisc)
> +static void __qdisc_destroy(struct Qdisc *qdisc)
> {
> const struct Qdisc_ops *ops = qdisc->ops;
>
> @@ -1070,6 +1070,14 @@ static void qdisc_destroy(struct Qdisc *qdisc)
> call_rcu(&qdisc->rcu, qdisc_free_cb);
> }
>
> +void qdisc_destroy(struct Qdisc *qdisc)
> +{
> + if (qdisc->flags & TCQ_F_BUILTIN)
> + return;
> +
> + __qdisc_destroy(qdisc);
> +}
> +
> void qdisc_put(struct Qdisc *qdisc)
> {
> if (!qdisc)
> @@ -1079,7 +1087,7 @@ void qdisc_put(struct Qdisc *qdisc)
> !refcount_dec_and_test(&qdisc->refcnt))
> return;
>
> - qdisc_destroy(qdisc);
> + __qdisc_destroy(qdisc);
> }
> EXPORT_SYMBOL(qdisc_put);
>
> @@ -1094,7 +1102,7 @@ void qdisc_put_unlocked(struct Qdisc *qdisc)
> !refcount_dec_and_rtnl_lock(&qdisc->refcnt))
> return;
>
> - qdisc_destroy(qdisc);
> + __qdisc_destroy(qdisc);
> rtnl_unlock();
> }
> EXPORT_SYMBOL(qdisc_put_unlocked);
> --
> 2.20.1
>

2023-05-08 11:55:44

by Jamal Hadi Salim

[permalink] [raw]
Subject: Re: [PATCH net 2/6] net/sched: sch_clsact: Only create under TC_H_CLSACT

On Fri, May 5, 2023 at 8:13 PM Peilin Ye <[email protected]> wrote:
>
> clsact Qdiscs are only supposed to be created under TC_H_CLSACT (which
> equals TC_H_INGRESS). Return -EOPNOTSUPP if 'parent' is not
> TC_H_CLSACT.
>
> Fixes: 1f211a1b929c ("net, sched: add clsact qdisc")
> Signed-off-by: Peilin Ye <[email protected]>

Reviewed-by: Jamal Hadi Salim <[email protected]>
Acked-by: Jamal Hadi Salim <[email protected]>

cheers,
jamal

> ---
> net/sched/sch_ingress.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/net/sched/sch_ingress.c b/net/sched/sch_ingress.c
> index 3d71f7a3b4ad..13218a1fe4a5 100644
> --- a/net/sched/sch_ingress.c
> +++ b/net/sched/sch_ingress.c
> @@ -222,6 +222,9 @@ static int clsact_init(struct Qdisc *sch, struct nlattr *opt,
> struct net_device *dev = qdisc_dev(sch);
> int err;
>
> + if (sch->parent != TC_H_CLSACT)
> + return -EOPNOTSUPP;
> +
> net_inc_ingress_queue();
> net_inc_egress_queue();
>
> --
> 2.20.1
>

2023-05-08 11:56:43

by Jamal Hadi Salim

[permalink] [raw]
Subject: Re: [PATCH net 3/6] net/sched: Reserve TC_H_INGRESS (TC_H_CLSACT) for ingress (clsact) Qdiscs

On Fri, May 5, 2023 at 8:14 PM Peilin Ye <[email protected]> wrote:
>
> Currently it is possible to add e.g. an HTB Qdisc under ffff:fff1
> (TC_H_INGRESS, TC_H_CLSACT):
>
> $ ip link add name ifb0 type ifb
> $ tc qdisc add dev ifb0 parent ffff:fff1 htb
> $ tc qdisc add dev ifb0 clsact
> Error: Exclusivity flag on, cannot modify.
> $ drgn
> ...
> >>> ifb0 = netdev_get_by_name(prog, "ifb0")
> >>> qdisc = ifb0.ingress_queue.qdisc_sleeping
> >>> print(qdisc.ops.id.string_().decode())
> htb
> >>> qdisc.flags.value_() # TCQ_F_INGRESS
> 2
>
> Only allow ingress and clsact Qdiscs under ffff:fff1. Return -EINVAL
> for everything else. Make TCQ_F_INGRESS a static flag of ingress and
> clsact Qdiscs.
>
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> Fixes: 1f211a1b929c ("net, sched: add clsact qdisc")
> Signed-off-by: Peilin Ye <[email protected]>

Reviewed-by: Jamal Hadi Salim <[email protected]>
Acked-by: Jamal Hadi Salim <[email protected]>


cheers,
jamal

> ---
> net/sched/sch_api.c | 7 ++++++-
> net/sched/sch_ingress.c | 4 ++--
> 2 files changed, 8 insertions(+), 3 deletions(-)
>
> diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
> index fdb8f429333d..383195955b7d 100644
> --- a/net/sched/sch_api.c
> +++ b/net/sched/sch_api.c
> @@ -1252,7 +1252,12 @@ static struct Qdisc *qdisc_create(struct net_device *dev,
> sch->parent = parent;
>
> if (handle == TC_H_INGRESS) {
> - sch->flags |= TCQ_F_INGRESS;
> + if (!(sch->flags & TCQ_F_INGRESS)) {
> + NL_SET_ERR_MSG(extack,
> + "Specified parent ID is reserved for ingress and clsact Qdiscs");
> + err = -EINVAL;
> + goto err_out3;
> + }
> handle = TC_H_MAKE(TC_H_INGRESS, 0);
> } else {
> if (handle == 0) {
> diff --git a/net/sched/sch_ingress.c b/net/sched/sch_ingress.c
> index 13218a1fe4a5..caea51e0d4e9 100644
> --- a/net/sched/sch_ingress.c
> +++ b/net/sched/sch_ingress.c
> @@ -137,7 +137,7 @@ static struct Qdisc_ops ingress_qdisc_ops __read_mostly = {
> .cl_ops = &ingress_class_ops,
> .id = "ingress",
> .priv_size = sizeof(struct ingress_sched_data),
> - .static_flags = TCQ_F_CPUSTATS,
> + .static_flags = TCQ_F_INGRESS | TCQ_F_CPUSTATS,
> .init = ingress_init,
> .destroy = ingress_destroy,
> .dump = ingress_dump,
> @@ -275,7 +275,7 @@ static struct Qdisc_ops clsact_qdisc_ops __read_mostly = {
> .cl_ops = &clsact_class_ops,
> .id = "clsact",
> .priv_size = sizeof(struct clsact_sched_data),
> - .static_flags = TCQ_F_CPUSTATS,
> + .static_flags = TCQ_F_INGRESS | TCQ_F_CPUSTATS,
> .init = clsact_init,
> .destroy = clsact_destroy,
> .dump = ingress_dump,
> --
> 2.20.1
>

2023-05-08 12:02:22

by Jamal Hadi Salim

[permalink] [raw]
Subject: Re: [PATCH net 5/6] net/sched: Refactor qdisc_graft() for ingress and clsact Qdiscs

On Fri, May 5, 2023 at 8:15 PM Peilin Ye <[email protected]> wrote:
>
> Grafting ingress and clsact Qdiscs does not need a for-loop in
> qdisc_graft(). Refactor it. No functional changes intended.
>
> Signed-off-by: Peilin Ye <[email protected]>

Fixed John's email address.

This one i am not so sure; num_q = 1 implies it will run on the for
loop only once. I am not sure it improves readability either. Anyways
for the effort you put into it i am tossing a coin and saying:
Reviewed-by: Jamal Hadi Salim <[email protected]>
Acked-by: Jamal Hadi Salim <[email protected]>

cheers,
jamal


> net/sched/sch_api.c | 20 ++++++++++----------
> 1 file changed, 10 insertions(+), 10 deletions(-)
>
> diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
> index 49b9c1bbfdd9..f72a581666a2 100644
> --- a/net/sched/sch_api.c
> +++ b/net/sched/sch_api.c
> @@ -1073,12 +1073,12 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
>
> if (parent == NULL) {
> unsigned int i, num_q, ingress;
> + struct netdev_queue *dev_queue;
>
> ingress = 0;
> num_q = dev->num_tx_queues;
> if ((q && q->flags & TCQ_F_INGRESS) ||
> (new && new->flags & TCQ_F_INGRESS)) {
> - num_q = 1;
> ingress = 1;
> if (!dev_ingress_queue(dev)) {
> NL_SET_ERR_MSG(extack, "Device does not have an ingress queue");
> @@ -1094,18 +1094,18 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
> if (new && new->ops->attach && !ingress)
> goto skip;
>
> - for (i = 0; i < num_q; i++) {
> - struct netdev_queue *dev_queue = dev_ingress_queue(dev);
> -
> - if (!ingress)
> + if (!ingress) {
> + for (i = 0; i < num_q; i++) {
> dev_queue = netdev_get_tx_queue(dev, i);
> + old = dev_graft_qdisc(dev_queue, new);
>
> - old = dev_graft_qdisc(dev_queue, new);
> - if (new && i > 0)
> - qdisc_refcount_inc(new);
> -
> - if (!ingress)
> + if (new && i > 0)
> + qdisc_refcount_inc(new);
> qdisc_put(old);
> + }
> + } else {
> + dev_queue = dev_ingress_queue(dev);
> + old = dev_graft_qdisc(dev_queue, new);
> }
>
> skip:
> --
> 2.20.1
>

2023-05-08 12:06:38

by Jamal Hadi Salim

[permalink] [raw]
Subject: Re: [PATCH net 1/6] net/sched: sch_ingress: Only create under TC_H_INGRESS

On Fri, May 5, 2023 at 8:12 PM Peilin Ye <[email protected]> wrote:
>
> ingress Qdiscs are only supposed to be created under TC_H_INGRESS.
> Similar to mq_init(), return -EOPNOTSUPP if 'parent' is not
> TC_H_INGRESS.
>
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> Reported-by: [email protected]
> Link: https://lore.kernel.org/netdev/[email protected]
> Signed-off-by: Peilin Ye <[email protected]>
Reviewed-by: Jamal Hadi Salim <[email protected]>
Acked-by: Jamal Hadi Salim <[email protected]>

cheers,
jamal

> ---
> net/sched/sch_ingress.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/net/sched/sch_ingress.c b/net/sched/sch_ingress.c
> index 84838128b9c5..3d71f7a3b4ad 100644
> --- a/net/sched/sch_ingress.c
> +++ b/net/sched/sch_ingress.c
> @@ -80,6 +80,9 @@ static int ingress_init(struct Qdisc *sch, struct nlattr *opt,
> struct net_device *dev = qdisc_dev(sch);
> int err;
>
> + if (sch->parent != TC_H_INGRESS)
> + return -EOPNOTSUPP;
> +
> net_inc_ingress_queue();
>
> mini_qdisc_pair_init(&q->miniqp, sch, &dev->miniq_ingress);
> --
> 2.20.1
>

2023-05-08 14:24:05

by Pedro Tammela

[permalink] [raw]
Subject: Re: [PATCH net 5/6] net/sched: Refactor qdisc_graft() for ingress and clsact Qdiscs

On 05/05/2023 21:15, Peilin Ye wrote:
> Grafting ingress and clsact Qdiscs does not need a for-loop in
> qdisc_graft(). Refactor it. No functional changes intended.
>
> Signed-off-by: Peilin Ye <[email protected]>

Just a FYI:
If you decide to keep this refactoring, it will need to be back ported
together with the subsequent fix.
I would personally leave it to net-next.

Thanks for chasing this!

Tested-by: Pedro Tammela <[email protected]>

> ---
> net/sched/sch_api.c | 20 ++++++++++----------
> 1 file changed, 10 insertions(+), 10 deletions(-)
>
> diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
> index 49b9c1bbfdd9..f72a581666a2 100644
> --- a/net/sched/sch_api.c
> +++ b/net/sched/sch_api.c
> @@ -1073,12 +1073,12 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
>
> if (parent == NULL) {
> unsigned int i, num_q, ingress;
> + struct netdev_queue *dev_queue;
>
> ingress = 0;
> num_q = dev->num_tx_queues;
> if ((q && q->flags & TCQ_F_INGRESS) ||
> (new && new->flags & TCQ_F_INGRESS)) {
> - num_q = 1;
> ingress = 1;
> if (!dev_ingress_queue(dev)) {
> NL_SET_ERR_MSG(extack, "Device does not have an ingress queue");
> @@ -1094,18 +1094,18 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
> if (new && new->ops->attach && !ingress)
> goto skip;
>
> - for (i = 0; i < num_q; i++) {
> - struct netdev_queue *dev_queue = dev_ingress_queue(dev);
> -
> - if (!ingress)
> + if (!ingress) {
> + for (i = 0; i < num_q; i++) {
> dev_queue = netdev_get_tx_queue(dev, i);
> + old = dev_graft_qdisc(dev_queue, new);
>
> - old = dev_graft_qdisc(dev_queue, new);
> - if (new && i > 0)
> - qdisc_refcount_inc(new);
> -
> - if (!ingress)
> + if (new && i > 0)
> + qdisc_refcount_inc(new);
> qdisc_put(old);
> + }
> + } else {
> + dev_queue = dev_ingress_queue(dev);
> + old = dev_graft_qdisc(dev_queue, new);
> }
>
> skip:

2023-05-08 14:43:39

by Pedro Tammela

[permalink] [raw]
Subject: Re: [PATCH net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting

On 05/05/2023 21:16, Peilin Ye wrote:
> mini_Qdisc_pair::p_miniq is a double pointer to mini_Qdisc, initialized in
> ingress_init() to point to net_device::miniq_ingress. ingress Qdiscs
> access this per-net_device pointer in mini_qdisc_pair_swap(). Similar for
> clsact Qdiscs and miniq_egress.
>
> Unfortunately, after introducing RTNL-lockless RTM_{NEW,DEL,GET}TFILTER
> requests, when e.g. replacing ingress (clsact) Qdiscs, the old Qdisc could
> access the same miniq_{in,e}gress pointer(s) concurrently with the new
> Qdisc, causing race conditions [1] including a use-after-free in
> mini_qdisc_pair_swap() reported by syzbot:
>
> BUG: KASAN: slab-use-after-free in mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
> Write of size 8 at addr ffff888045b31308 by task syz-executor690/14901
> ...
> Call Trace:
> <TASK>
> __dump_stack lib/dump_stack.c:88 [inline]
> dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106
> print_address_description.constprop.0+0x2c/0x3c0 mm/kasan/report.c:319
> print_report mm/kasan/report.c:430 [inline]
> kasan_report+0x11c/0x130 mm/kasan/report.c:536
> mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
> tcf_chain_head_change_item net/sched/cls_api.c:495 [inline]
> tcf_chain0_head_change.isra.0+0xb9/0x120 net/sched/cls_api.c:509
> tcf_chain_tp_insert net/sched/cls_api.c:1826 [inline]
> tcf_chain_tp_insert_unique net/sched/cls_api.c:1875 [inline]
> tc_new_tfilter+0x1de6/0x2290 net/sched/cls_api.c:2266
> ...
>
> The new (ingress or clsact) Qdisc should only call mini_qdisc_pair_swap()
> after the old Qdisc's last call (in {ingress,clsact}_destroy()) has
> finished.
>
> To achieve this, in qdisc_graft(), return -EBUSY if the old (ingress or
> clsact) Qdisc has ongoing RTNL-lockless filter requests, and call
> qdisc_destroy() for "old" before grafting "new".
>
> Introduce qdisc_refcount_dec_if_one() as the counterpart of
> qdisc_refcount_inc_nz() used for RTNL-lockless filter requests. Introduce
> a non-static version of qdisc_destroy() that does a TCQ_F_BUILTIN check,
> just like qdisc_put() etc.
>
> [1] To illustrate, the syzkaller reproducer adds ingress Qdiscs under
> TC_H_ROOT (no longer possible after patch "net/sched: sch_ingress: Only
> create under TC_H_INGRESS") on eth0 that has 8 transmission queues:
>
> Thread 1 creates ingress Qdisc A (containing mini Qdisc a1 and a2), then
> adds a flower filter X to A.
>
> Thread 2 creates another ingress Qdisc B (containing mini Qdisc b1 and
> b2) to replace A, then adds a flower filter Y to B.
>
> Thread 1 A's refcnt Thread 2
> RTM_NEWQDISC (A, RTNL-locked)
> qdisc_create(A) 1
> qdisc_graft(A) 9
>
> RTM_NEWTFILTER (X, RTNL-lockless)
> __tcf_qdisc_find(A) 10
> tcf_chain0_head_change(A)
> mini_qdisc_pair_swap(A) (1st)
> |
> | RTM_NEWQDISC (B, RTNL-locked)
> RCU 2 qdisc_graft(B)
> | 1 notify_and_destroy(A)
> |
> tcf_block_release(A) 0 RTM_NEWTFILTER (Y, RTNL-lockless)
> qdisc_destroy(A) tcf_chain0_head_change(B)
> tcf_chain0_head_change_cb_del(A) mini_qdisc_pair_swap(B) (2nd)
> mini_qdisc_pair_swap(A) (3rd) |
> ... ...
>
> Here, B calls mini_qdisc_pair_swap(), pointing eth0->miniq_ingress to its
> mini Qdisc, b1. Then, A calls mini_qdisc_pair_swap() again during
> ingress_destroy(), setting eth0->miniq_ingress to NULL, so ingress packets
> on eth0 will not find filter Y in sch_handle_ingress().
>
> This is just one of the possible consequences of concurrently accessing
> net_device::miniq_{in,e}gress pointers. The point is clear, however:
> B's first call to mini_qdisc_pair_swap() should take place after A's
> last call, in qdisc_destroy().
>
> Fixes: 7a096d579e8e ("net: sched: ingress: set 'unlocked' flag for Qdisc ops")
> Fixes: 87f373921c4e ("net: sched: ingress: set 'unlocked' flag for clsact Qdisc ops")
> Reported-by: [email protected]
> Link: https://lore.kernel.org/netdev/[email protected]
> Cc: Hillf Danton <[email protected]>
> Signed-off-by: Peilin Ye <[email protected]>

Thanks for chasing this!

Tested-by: Pedro Tammela <[email protected]>

> ---
> include/net/sch_generic.h | 8 ++++++++
> net/sched/sch_api.c | 26 +++++++++++++++++++++-----
> net/sched/sch_generic.c | 14 +++++++++++---
> 3 files changed, 40 insertions(+), 8 deletions(-)
>
> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> index fab5ba3e61b7..3e9cc43cbc90 100644
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -137,6 +137,13 @@ static inline void qdisc_refcount_inc(struct Qdisc *qdisc)
> refcount_inc(&qdisc->refcnt);
> }
>
> +static inline bool qdisc_refcount_dec_if_one(struct Qdisc *qdisc)
> +{
> + if (qdisc->flags & TCQ_F_BUILTIN)
> + return true;
> + return refcount_dec_if_one(&qdisc->refcnt);
> +}
> +
> /* Intended to be used by unlocked users, when concurrent qdisc release is
> * possible.
> */
> @@ -652,6 +659,7 @@ void dev_deactivate_many(struct list_head *head);
> struct Qdisc *dev_graft_qdisc(struct netdev_queue *dev_queue,
> struct Qdisc *qdisc);
> void qdisc_reset(struct Qdisc *qdisc);
> +void qdisc_destroy(struct Qdisc *qdisc);
> void qdisc_put(struct Qdisc *qdisc);
> void qdisc_put_unlocked(struct Qdisc *qdisc);
> void qdisc_tree_reduce_backlog(struct Qdisc *qdisc, int n, int len);
> diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
> index f72a581666a2..a2d07bc8ded6 100644
> --- a/net/sched/sch_api.c
> +++ b/net/sched/sch_api.c
> @@ -1080,10 +1080,20 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
> if ((q && q->flags & TCQ_F_INGRESS) ||
> (new && new->flags & TCQ_F_INGRESS)) {
> ingress = 1;
> - if (!dev_ingress_queue(dev)) {
> + dev_queue = dev_ingress_queue(dev);
> + if (!dev_queue) {
> NL_SET_ERR_MSG(extack, "Device does not have an ingress queue");
> return -ENOENT;
> }
> +
> + /* This is the counterpart of that qdisc_refcount_inc_nz() call in
> + * __tcf_qdisc_find() for RTNL-lockless filter requests.
> + */
> + if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping)) {
> + NL_SET_ERR_MSG(extack,
> + "Current ingress or clsact Qdisc has ongoing filter request(s)");
> + return -EBUSY;
> + }
> }
>
> if (dev->flags & IFF_UP)
> @@ -1104,8 +1114,16 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
> qdisc_put(old);
> }
> } else {
> - dev_queue = dev_ingress_queue(dev);
> - old = dev_graft_qdisc(dev_queue, new);
> + old = dev_graft_qdisc(dev_queue, NULL);
> +
> + /* {ingress,clsact}_destroy() "old" before grafting "new" to avoid
> + * unprotected concurrent accesses to net_device::miniq_{in,e}gress
> + * pointer(s) in mini_qdisc_pair_swap().
> + */
> + qdisc_notify(net, skb, n, classid, old, new, extack);
> + qdisc_destroy(old);
> +
> + dev_graft_qdisc(dev_queue, new);
> }
>
> skip:
> @@ -1119,8 +1137,6 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
>
> if (new && new->ops->attach)
> new->ops->attach(new);
> - } else {
> - notify_and_destroy(net, skb, n, classid, old, new, extack);
> }
>
> if (dev->flags & IFF_UP)
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 37e41f972f69..e14ed47f961c 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -1046,7 +1046,7 @@ static void qdisc_free_cb(struct rcu_head *head)
> qdisc_free(q);
> }
>
> -static void qdisc_destroy(struct Qdisc *qdisc)
> +static void __qdisc_destroy(struct Qdisc *qdisc)
> {
> const struct Qdisc_ops *ops = qdisc->ops;
>
> @@ -1070,6 +1070,14 @@ static void qdisc_destroy(struct Qdisc *qdisc)
> call_rcu(&qdisc->rcu, qdisc_free_cb);
> }
>
> +void qdisc_destroy(struct Qdisc *qdisc)
> +{
> + if (qdisc->flags & TCQ_F_BUILTIN)
> + return;
> +
> + __qdisc_destroy(qdisc);
> +}
> +
> void qdisc_put(struct Qdisc *qdisc)
> {
> if (!qdisc)
> @@ -1079,7 +1087,7 @@ void qdisc_put(struct Qdisc *qdisc)
> !refcount_dec_and_test(&qdisc->refcnt))
> return;
>
> - qdisc_destroy(qdisc);
> + __qdisc_destroy(qdisc);
> }
> EXPORT_SYMBOL(qdisc_put);
>
> @@ -1094,7 +1102,7 @@ void qdisc_put_unlocked(struct Qdisc *qdisc)
> !refcount_dec_and_rtnl_lock(&qdisc->refcnt))
> return;
>
> - qdisc_destroy(qdisc);
> + __qdisc_destroy(qdisc);
> rtnl_unlock();
> }
> EXPORT_SYMBOL(qdisc_put_unlocked);

2023-05-08 22:03:29

by Peilin Ye

[permalink] [raw]
Subject: Re: [PATCH net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting

On Mon, May 08, 2023 at 07:32:01AM -0400, Jamal Hadi Salim wrote:
> Thanks for the excellent analysis Peilin and for chasing this to the
> end. I have no doubt it was a lot of fun! ;->

Of course ;-)

> Reviewed-by: Jamal Hadi Salim <[email protected]>
> Acked-by: Jamal Hadi Salim <[email protected]>

Thanks for reviewing this!

Peilin Ye

2023-05-08 22:09:09

by Peilin Ye

[permalink] [raw]
Subject: Re: [PATCH net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting

On Mon, May 08, 2023 at 11:12:24AM -0300, Pedro Tammela wrote:
> Thanks for chasing this!
>
> Tested-by: Pedro Tammela <[email protected]>

Thanks for testing, Pedro!

Thanks,
Peilin Ye

2023-05-08 22:44:21

by Peilin Ye

[permalink] [raw]
Subject: Re: [PATCH net 5/6] net/sched: Refactor qdisc_graft() for ingress and clsact Qdiscs

On Mon, May 08, 2023 at 07:29:26AM -0400, Jamal Hadi Salim wrote:
> On Fri, May 5, 2023 at 8:15 PM Peilin Ye <[email protected]> wrote:
> >
> > Grafting ingress and clsact Qdiscs does not need a for-loop in
> > qdisc_graft(). Refactor it. No functional changes intended.
>
> This one i am not so sure; num_q = 1 implies it will run on the for
> loop only once. I am not sure it improves readability either. Anyways
> for the effort you put into it i am tossing a coin and saying:
> Reviewed-by: Jamal Hadi Salim <[email protected]>
> Acked-by: Jamal Hadi Salim <[email protected]>

Yeah, it doesn't make much difference itself. I'm just afraid that,
without [5/6], [6/6] would look like:

for (i = 0; i < num_q; i++) {
if (!ingress) {
dev_queue = netdev_get_tx_queue(dev, i);
old = dev_graft_qdisc(dev_queue, new);
else {
old = dev_graft_qdisc(dev_queue, NULL);
}

if (new && i > 0)
qdisc_refcount_inc(new);

if (!ingress) {
qdisc_put(old);
} else {
/* {ingress,clsact}_destroy() "old" before grafting "new" to avoid
* unprotected concurrent accesses to net_device::miniq_{in,e}gress
* pointer(s) in mini_qdisc_pair_swap().
*/
qdisc_notify(net, skb, n, classid, old, new, extack);
qdisc_destroy(old);
}

if (ingress)
dev_graft_qdisc(dev_queue, new);
}

The "!ingress" path doesn't share a single line with "ingress", which
looks a bit weird to me. Personally I'd like to keep [5/6].

Thanks,
Peilin Ye

2023-05-09 01:37:10

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [PATCH net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting

On Fri, 5 May 2023 17:16:10 -0700 Peilin Ye wrote:
> Thread 1 creates ingress Qdisc A (containing mini Qdisc a1 and a2), then
> adds a flower filter X to A.
>
> Thread 2 creates another ingress Qdisc B (containing mini Qdisc b1 and
> b2) to replace A, then adds a flower filter Y to B.
>
> Thread 1 A's refcnt Thread 2
> RTM_NEWQDISC (A, RTNL-locked)
> qdisc_create(A) 1
> qdisc_graft(A) 9
>
> RTM_NEWTFILTER (X, RTNL-lockless)
> __tcf_qdisc_find(A) 10
> tcf_chain0_head_change(A)
> mini_qdisc_pair_swap(A) (1st)
> |
> | RTM_NEWQDISC (B, RTNL-locked)
> RCU 2 qdisc_graft(B)
> | 1 notify_and_destroy(A)
> |
> tcf_block_release(A) 0 RTM_NEWTFILTER (Y, RTNL-lockless)
> qdisc_destroy(A) tcf_chain0_head_change(B)
> tcf_chain0_head_change_cb_del(A) mini_qdisc_pair_swap(B) (2nd)
> mini_qdisc_pair_swap(A) (3rd) |
> ... ...
>
> Here, B calls mini_qdisc_pair_swap(), pointing eth0->miniq_ingress to its
> mini Qdisc, b1. Then, A calls mini_qdisc_pair_swap() again during
> ingress_destroy(), setting eth0->miniq_ingress to NULL, so ingress packets
> on eth0 will not find filter Y in sch_handle_ingress().
>
> This is just one of the possible consequences of concurrently accessing
> net_device::miniq_{in,e}gress pointers. The point is clear, however:
> B's first call to mini_qdisc_pair_swap() should take place after A's
> last call, in qdisc_destroy().

Great analysis, thanks for squashing this bug.

Have you considered creating a fix more localized to the miniq
implementation? It seems that having per-device miniq pointers is
incompatible with using reference counted objects. So miniq is
a more natural place to solve the problem. Otherwise workarounds
in the core keep piling up (here qdisc_graft()).

Can we replace the rcu_assign_pointer in (3rd) with a cmpxchg()?
If active qdisc is neither a1 nor a2 we should leave the dev state
alone.

2023-05-10 20:18:48

by Peilin Ye

[permalink] [raw]
Subject: Re: [PATCH net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting

On Mon, May 08, 2023 at 06:33:24PM -0700, Jakub Kicinski wrote:
> Great analysis, thanks for squashing this bug.

Thanks, happy to help!

> Have you considered creating a fix more localized to the miniq
> implementation? It seems that having per-device miniq pointers is
> incompatible with using reference counted objects. So miniq is
> a more natural place to solve the problem. Otherwise workarounds
> in the core keep piling up (here qdisc_graft()).
>
> Can we replace the rcu_assign_pointer in (3rd) with a cmpxchg()?
> If active qdisc is neither a1 nor a2 we should leave the dev state
> alone.

Yes, I have tried fixing this in mini_qdisc_pair_swap(), but I am afraid
it is hard:

(3rd) is called from ->destroy(), so currently it uses RCU_INIT_POINTER()
to set dev->miniq_ingress to NULL. It will need a logic like:

I am A. Set dev->miniq_ingress to NULL, if and only if it is a1 or a2,
and do it atomically.

We need more than a cmpxchg() to implement this "set NULL iff a1 or a2".
Additionally:

On Fri, 5 May 2023 17:16:10 -0700 Peilin Ye wrote:
> Thread 1 creates ingress Qdisc A (containing mini Qdisc a1 and a2), then
> adds a flower filter X to A.
>
> Thread 2 creates another ingress Qdisc B (containing mini Qdisc b1 and
> b2) to replace A, then adds a flower filter Y to B.
>
> Thread 1 A's refcnt Thread 2
> RTM_NEWQDISC (A, RTNL-locked)
> qdisc_create(A) 1
> qdisc_graft(A) 9
>
> RTM_NEWTFILTER (X, RTNL-lockless)
> __tcf_qdisc_find(A) 10
> tcf_chain0_head_change(A)
> mini_qdisc_pair_swap(A) (1st)
> |
> | RTM_NEWQDISC (B, RTNL-locked)
> RCU 2 qdisc_graft(B)
> | 1 notify_and_destroy(A)
> |
> tcf_block_release(A) 0 RTM_NEWTFILTER (Y, RTNL-lockless)
> qdisc_destroy(A) tcf_chain0_head_change(B)
> tcf_chain0_head_change_cb_del(A) mini_qdisc_pair_swap(B) (2nd)
> mini_qdisc_pair_swap(A) (3rd) |
> ... ...

Looking at the code, I think there is no guarantee that (1st) cannot
happen after (2nd), although unlikely? Can RTNL-lockless RTM_NEWTFILTER
handlers get preempted?

If (1st) happens later than (2nd), we will need to make (1st) no-op, by
detecting that we are the "old" Qdisc. I am not sure there is any
(clean) way to do it. I even thought about:

(1) Get the containing Qdisc of "miniqp" we are working on, "qdisc";
(2) Test if "qdisc == qdisc->dev_queue->qdisc_sleeping". If false, it
means we are the "old" Qdisc (have been replaced), and should do
nothing.

However, for clsact Qdiscs I don't know if "miniqp" is the ingress or
egress one, so I can't container_of() during step (1) ...

Eventually I created [5,6/6]. It is a workaround indeed, in the sense
that it changes sch_api.c to avoid a mini Qdisc issue. However I think it
makes the code correct in a relatively understandable way, without slowing
down mini_qdisc_pair_swap() or sch_handle_*gress().

Thanks,
Peilin Ye


2023-05-10 23:55:12

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [PATCH net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting

On Wed, 10 May 2023 13:11:19 -0700 Peilin Ye wrote:
> On Fri, 5 May 2023 17:16:10 -0700 Peilin Ye wrote:
> > Thread 1 creates ingress Qdisc A (containing mini Qdisc a1 and a2), then
> > adds a flower filter X to A.
> >
> > Thread 2 creates another ingress Qdisc B (containing mini Qdisc b1 and
> > b2) to replace A, then adds a flower filter Y to B.
> >
> > Thread 1 A's refcnt Thread 2
> > RTM_NEWQDISC (A, RTNL-locked)
> > qdisc_create(A) 1
> > qdisc_graft(A) 9
> >
> > RTM_NEWTFILTER (X, RTNL-lockless)
> > __tcf_qdisc_find(A) 10
> > tcf_chain0_head_change(A)
> > mini_qdisc_pair_swap(A) (1st)
> > |
> > | RTM_NEWQDISC (B, RTNL-locked)
> > RCU 2 qdisc_graft(B)
> > | 1 notify_and_destroy(A)
> > |
> > tcf_block_release(A) 0 RTM_NEWTFILTER (Y, RTNL-lockless)
> > qdisc_destroy(A) tcf_chain0_head_change(B)
> > tcf_chain0_head_change_cb_del(A) mini_qdisc_pair_swap(B) (2nd)
> > mini_qdisc_pair_swap(A) (3rd) |
> > ... ...
>
> Looking at the code, I think there is no guarantee that (1st) cannot
> happen after (2nd), although unlikely? Can RTNL-lockless RTM_NEWTFILTER
> handlers get preempted?

Right, we need qdisc_graft(B) to update the appropriate dev pointer
to point to b1. With that the ordering should not matter. Probably
using the ->attach() callback?

> If (1st) happens later than (2nd), we will need to make (1st) no-op, by
> detecting that we are the "old" Qdisc. I am not sure there is any
> (clean) way to do it. I even thought about:
>
> (1) Get the containing Qdisc of "miniqp" we are working on, "qdisc";
> (2) Test if "qdisc == qdisc->dev_queue->qdisc_sleeping". If false, it
> means we are the "old" Qdisc (have been replaced), and should do
> nothing.
>
> However, for clsact Qdiscs I don't know if "miniqp" is the ingress or
> egress one, so I can't container_of() during step (1) ...

And we can't be using multiple pieces of information to make
the decision since AFAIU mini_qdisc_pair_swap() can race with
qdisc_graft().

My thinking was to make sure that dev->miniq_* pointers always point
to one of the miniqs of the currently attached qdisc. Right now, on
a quick look, those pointers are not initialized during initial graft,
only when first filter is added, and may be cleared when filters are
removed. But I don't think that's strictly required, miniq with no
filters should be fine.

> Eventually I created [5,6/6]. It is a workaround indeed, in the sense
> that it changes sch_api.c to avoid a mini Qdisc issue. However I think it
> makes the code correct in a relatively understandable way,

What's your benchmark for being understandable?

> without slowing down mini_qdisc_pair_swap() or sch_handle_*gress().


2023-05-11 20:58:22

by Peilin Ye

[permalink] [raw]
Subject: Re: [PATCH net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting

On Wed, May 10, 2023 at 04:15:59PM -0700, Jakub Kicinski wrote:
> My thinking was to make sure that dev->miniq_* pointers always point
> to one of the miniqs of the currently attached qdisc. Right now, on
> a quick look, those pointers are not initialized during initial graft,
> only when first filter is added, and may be cleared when filters are
> removed. But I don't think that's strictly required, miniq with no
> filters should be fine.

Ah, I see, thanks for explaining, I didn't think of that. Looking at
sch_handle_ingress() BTW, currently mini Qdisc stats aren't being updated
(mini_qdisc_bstats_cpu_update()) if there's no filters, is this intended?
Should I keep this behavior by:

diff --git a/net/core/dev.c b/net/core/dev.c
index 735096d42c1d..9016481377e0 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5116,7 +5116,7 @@ sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
* that are not configured with an ingress qdisc will bail
* out here.
*/
- if (!miniq)
+ if (!miniq || !miniq->filter_list)
return skb;

if (*pt_prev) {

On Wed, May 10, 2023 at 04:15:59PM -0700, Jakub Kicinski wrote:
> On Wed, 10 May 2023 13:11:19 -0700 Peilin Ye wrote:
> > On Fri, 5 May 2023 17:16:10 -0700 Peilin Ye wrote:
> > > Thread 1 A's refcnt Thread 2
> > > RTM_NEWQDISC (A, RTNL-locked)
> > > qdisc_create(A) 1
> > > qdisc_graft(A) 9
> > >
> > > RTM_NEWTFILTER (X, RTNL-lockless)
> > > __tcf_qdisc_find(A) 10
> > > tcf_chain0_head_change(A)
> > > mini_qdisc_pair_swap(A) (1st)
> > > |
> > > | RTM_NEWQDISC (B, RTNL-locked)
> > > RCU 2 qdisc_graft(B)
> > > | 1 notify_and_destroy(A)
> > > |
> > > tcf_block_release(A) 0 RTM_NEWTFILTER (Y, RTNL-lockless)
> > > qdisc_destroy(A) tcf_chain0_head_change(B)
> > > tcf_chain0_head_change_cb_del(A) mini_qdisc_pair_swap(B) (2nd)
> > > mini_qdisc_pair_swap(A) (3rd) |
> > > ... ...
> >
> > Looking at the code, I think there is no guarantee that (1st) cannot
> > happen after (2nd), although unlikely? Can RTNL-lockless RTM_NEWTFILTER
> > handlers get preempted?
>
> Right, we need qdisc_graft(B) to update the appropriate dev pointer
> to point to b1. With that the ordering should not matter. Probably
> using the ->attach() callback?

->attach() is later than dev_graft_qdisc(). We should get ready for
concurrent filter requests (i.e. have dev pointer pointing to b1) before
grafting (publishing) B. Also currently qdisc_graft() doesn't call
->attach() for ingress and clsact Qdiscs.

But I see your point, thanks for the suggestion! I'll try ->init() and
create v2.

Thanks,
Peilin Ye


2023-05-11 23:33:31

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [PATCH net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting

On Thu, 11 May 2023 13:40:10 -0700 Peilin Ye wrote:
> On Wed, May 10, 2023 at 04:15:59PM -0700, Jakub Kicinski wrote:
> > My thinking was to make sure that dev->miniq_* pointers always point
> > to one of the miniqs of the currently attached qdisc. Right now, on
> > a quick look, those pointers are not initialized during initial graft,
> > only when first filter is added, and may be cleared when filters are
> > removed. But I don't think that's strictly required, miniq with no
> > filters should be fine.
>
> Ah, I see, thanks for explaining, I didn't think of that. Looking at
> sch_handle_ingress() BTW, currently mini Qdisc stats aren't being updated
> (mini_qdisc_bstats_cpu_update()) if there's no filters, is this intended?
> Should I keep this behavior by:
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 735096d42c1d..9016481377e0 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -5116,7 +5116,7 @@ sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
> * that are not configured with an ingress qdisc will bail
> * out here.
> */
> - if (!miniq)
> + if (!miniq || !miniq->filter_list)
> return skb;
>
> if (*pt_prev) {

Good question, maybe Jiri or Daniel can answer?

> On Wed, May 10, 2023 at 04:15:59PM -0700, Jakub Kicinski wrote:
> > On Wed, 10 May 2023 13:11:19 -0700 Peilin Ye wrote:
> > > Looking at the code, I think there is no guarantee that (1st) cannot
> > > happen after (2nd), although unlikely? Can RTNL-lockless RTM_NEWTFILTER
> > > handlers get preempted?
> >
> > Right, we need qdisc_graft(B) to update the appropriate dev pointer
> > to point to b1. With that the ordering should not matter. Probably
> > using the ->attach() callback?
>
> ->attach() is later than dev_graft_qdisc(). We should get ready for
> concurrent filter requests (i.e. have dev pointer pointing to b1) before
> grafting (publishing) B.

I thought even for "unlocked" filter operations the start of it is
under the lock, but the lock gets dropped after qdisc/block are found.
I could be misremembering, I haven't looked at the code.

> Also currently qdisc_graft() doesn't call
> ->attach() for ingress and clsact Qdiscs.
>
> But I see your point, thanks for the suggestion! I'll try ->init() and
> create v2.

->init() may be too early, aren't there any error points which could
prevent the Qdisc from binding after ->init() was called?

2023-05-12 00:14:00

by Peilin Ye

[permalink] [raw]
Subject: Re: [PATCH net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting

On Thu, May 11, 2023 at 04:20:23PM -0700, Jakub Kicinski wrote:
> > But I see your point, thanks for the suggestion! I'll try ->init() and
> > create v2.
>
> ->init() may be too early, aren't there any error points which could
> prevent the Qdisc from binding after ->init() was called?

You're right, it's in qdisc_create(), argh...

On Thu, May 11, 2023 at 04:20:23PM -0700, Jakub Kicinski wrote:
> > > > Looking at the code, I think there is no guarantee that (1st) cannot
> > > > happen after (2nd), although unlikely? Can RTNL-lockless RTM_NEWTFILTER
> > > > handlers get preempted?
> > >
> > > Right, we need qdisc_graft(B) to update the appropriate dev pointer
> > > to point to b1. With that the ordering should not matter. Probably
> > > using the ->attach() callback?
> >
> > ->attach() is later than dev_graft_qdisc(). We should get ready for
> > concurrent filter requests (i.e. have dev pointer pointing to b1) before
> > grafting (publishing) B.
>
> I thought even for "unlocked" filter operations the start of it is
> under the lock, but the lock gets dropped after qdisc/block are found.
> I could be misremembering, I haven't looked at the code.

No, f.e. RTM_NEWTFILTER is registered as RTNL_FLAG_DOIT_UNLOCKED, so
tc_new_tfilter() starts and calls __tcf_qdisc_find() without RTNL mutex,
at least in latest code.

Thinking,
Peilin Ye


2023-05-12 00:14:43

by Peilin Ye

[permalink] [raw]
Subject: Re: [PATCH net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting

On Thu, May 11, 2023 at 04:46:33PM -0700, Peilin Ye wrote:
> On Thu, May 11, 2023 at 04:20:23PM -0700, Jakub Kicinski wrote:
> > > But I see your point, thanks for the suggestion! I'll try ->init() and
> > > create v2.
> >
> > ->init() may be too early, aren't there any error points which could
> > prevent the Qdisc from binding after ->init() was called?
>
> You're right, it's in qdisc_create(), argh...

->destroy() is called for all error points between ->init() and
dev_graft_qdisc(). I'll try handling it in ->destroy().

Thanks,
Peilin Ye


2023-05-16 00:00:22

by Peilin Ye

[permalink] [raw]
Subject: Re: [PATCH net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting

On Thu, May 11, 2023 at 05:11:23PM -0700, Peilin Ye wrote:
> > > ->init() may be too early, aren't there any error points which could
> > > prevent the Qdisc from binding after ->init() was called?
> >
> > You're right, it's in qdisc_create(), argh...
>
> ->destroy() is called for all error points between ->init() and
> dev_graft_qdisc(). I'll try handling it in ->destroy().

Sorry for any confusion: there is no point at all undoing "setting dev
pointer to b1" in ->destroy() because datapath has already been affected.

To summarize, grafting B mustn't fail after setting dev pointer to b1, so
->init() is too early, because e.g. if user requested [1] to create a rate
estimator, gen_new_estimator() could fail after ->init() in
qdisc_create().

On the other hand, ->attach() is too late because it's later than
dev_graft_qdisc(), so concurrent filter requests might see uninitialized
dev pointer in theory.

Please suggest; is adding another callback (or calling ->attach()) right
before dev_graft_qdisc() for ingress (clsact) Qdiscs too much for this
fix?

[1] e.g. $ tc qdisc add dev eth0 estimator 1s 8s clsact

Thanks,
Peilin Ye


2023-05-16 19:37:03

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [PATCH net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting

On Mon, 15 May 2023 15:45:15 -0700 Peilin Ye wrote:
> On Thu, May 11, 2023 at 05:11:23PM -0700, Peilin Ye wrote:
> > > You're right, it's in qdisc_create(), argh...
> >
> > ->destroy() is called for all error points between ->init() and
> > dev_graft_qdisc(). I'll try handling it in ->destroy().
>
> Sorry for any confusion: there is no point at all undoing "setting dev
> pointer to b1" in ->destroy() because datapath has already been affected.
>
> To summarize, grafting B mustn't fail after setting dev pointer to b1, so
> ->init() is too early, because e.g. if user requested [1] to create a rate
> estimator, gen_new_estimator() could fail after ->init() in
> qdisc_create().
>
> On the other hand, ->attach() is too late because it's later than
> dev_graft_qdisc(), so concurrent filter requests might see uninitialized
> dev pointer in theory.
>
> Please suggest; is adding another callback (or calling ->attach()) right
> before dev_graft_qdisc() for ingress (clsact) Qdiscs too much for this
> fix?
>
> [1] e.g. $ tc qdisc add dev eth0 estimator 1s 8s clsact

Vlad, could you please clarify how you expect the unlocked filter
operations to work when the qdisc has global state?

2023-05-16 20:04:07

by Vlad Buslov

[permalink] [raw]
Subject: Re: [PATCH net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting

On Tue 16 May 2023 at 12:22, Jakub Kicinski <[email protected]> wrote:
> On Mon, 15 May 2023 15:45:15 -0700 Peilin Ye wrote:
>> On Thu, May 11, 2023 at 05:11:23PM -0700, Peilin Ye wrote:
>> > > You're right, it's in qdisc_create(), argh...
>> >
>> > ->destroy() is called for all error points between ->init() and
>> > dev_graft_qdisc(). I'll try handling it in ->destroy().
>>
>> Sorry for any confusion: there is no point at all undoing "setting dev
>> pointer to b1" in ->destroy() because datapath has already been affected.
>>
>> To summarize, grafting B mustn't fail after setting dev pointer to b1, so
>> ->init() is too early, because e.g. if user requested [1] to create a rate
>> estimator, gen_new_estimator() could fail after ->init() in
>> qdisc_create().
>>
>> On the other hand, ->attach() is too late because it's later than
>> dev_graft_qdisc(), so concurrent filter requests might see uninitialized
>> dev pointer in theory.
>>
>> Please suggest; is adding another callback (or calling ->attach()) right
>> before dev_graft_qdisc() for ingress (clsact) Qdiscs too much for this
>> fix?
>>
>> [1] e.g. $ tc qdisc add dev eth0 estimator 1s 8s clsact
>
> Vlad, could you please clarify how you expect the unlocked filter
> operations to work when the qdisc has global state?

Jakub, I didn't account for per-net_device pointer usage by miniqp code
hence this bug. I didn't comment on the fix because I was away from my
PC last week but Peilin's approach seems reasonable to me. When Peilin
brought up the issue initially I also tried to come up with some trick
to contain the changes to miniqp code instead of changing core but
couldn't think of anything workable due to the limitations already
discussed in this thread. I'm open to explore alternative approaches to
solving this issue, if that is what you suggest.


2023-05-16 22:09:59

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [PATCH net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting

On Tue, 16 May 2023 22:35:51 +0300 Vlad Buslov wrote:
> > Vlad, could you please clarify how you expect the unlocked filter
> > operations to work when the qdisc has global state?
>
> Jakub, I didn't account for per-net_device pointer usage by miniqp code
> hence this bug. I didn't comment on the fix because I was away from my
> PC last week but Peilin's approach seems reasonable to me. When Peilin
> brought up the issue initially I also tried to come up with some trick
> to contain the changes to miniqp code instead of changing core but
> couldn't think of anything workable due to the limitations already
> discussed in this thread. I'm open to explore alternative approaches to
> solving this issue, if that is what you suggest.

Given Peilin's investigation I think fix without changing core may
indeed be hard. I'm not sure if returning -EBUSY when qdisc refcnt
is elevated will be appreciated by the users, do we already have
similar behavior in other parts of TC?

2023-05-16 23:04:06

by Peilin Ye

[permalink] [raw]
Subject: Re: [PATCH net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting

On Tue, May 16, 2023 at 02:50:10PM -0700, Jakub Kicinski wrote:
> > > Vlad, could you please clarify how you expect the unlocked filter
> > > operations to work when the qdisc has global state?
> >
> > Jakub, I didn't account for per-net_device pointer usage by miniqp code
> > hence this bug. I didn't comment on the fix because I was away from my
> > PC last week but Peilin's approach seems reasonable to me. When Peilin
> > brought up the issue initially I also tried to come up with some trick
> > to contain the changes to miniqp code instead of changing core but
> > couldn't think of anything workable due to the limitations already
> > discussed in this thread. I'm open to explore alternative approaches to
> > solving this issue, if that is what you suggest.
>
> Given Peilin's investigation I think fix without changing core may
> indeed be hard. I'm not sure if returning -EBUSY when qdisc refcnt
> is elevated will be appreciated by the users, do we already have
> similar behavior in other parts of TC?

Seems like trying to delete an "in-use" cls_u32 filter returns -EBUSY:

net/sched/cls_u32.c:

if (ht->refcnt == 1) {
u32_destroy_hnode(tp, ht, extack);
} else {
NL_SET_ERR_MSG_MOD(extack, "Can not delete in-use filter");
return -EBUSY;
}

Thanks,
Peilin Ye


2023-05-17 01:16:23

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [PATCH net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting

On Tue, 16 May 2023 15:58:46 -0700 Peilin Ye wrote:
> > Given Peilin's investigation I think fix without changing core may
> > indeed be hard. I'm not sure if returning -EBUSY when qdisc refcnt
> > is elevated will be appreciated by the users, do we already have
> > similar behavior in other parts of TC?
>
> Seems like trying to delete an "in-use" cls_u32 filter returns -EBUSY

I meant -EBUSY due to a race (another operation being in flight).
I think that's different.

2023-05-17 09:22:01

by Vlad Buslov

[permalink] [raw]
Subject: Re: [PATCH net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting

On Tue 16 May 2023 at 17:39, Jakub Kicinski <[email protected]> wrote:
> On Tue, 16 May 2023 15:58:46 -0700 Peilin Ye wrote:
>> > Given Peilin's investigation I think fix without changing core may
>> > indeed be hard. I'm not sure if returning -EBUSY when qdisc refcnt
>> > is elevated will be appreciated by the users, do we already have
>> > similar behavior in other parts of TC?
>>
>> Seems like trying to delete an "in-use" cls_u32 filter returns -EBUSY
>
> I meant -EBUSY due to a race (another operation being in flight).
> I think that's different.

I wonder if somehow leveraging existing tc_modify_qdisc() 'replay'
functionality instead of returning error to the user would be a better
approach? Currently the function is replayed when qdisc_create() returns
EAGAIN. It should be trivial to do the same for qdisc_graft() result.


2023-05-17 18:51:25

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [PATCH net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting

On Wed, 17 May 2023 11:49:10 +0300 Vlad Buslov wrote:
> On Tue 16 May 2023 at 17:39, Jakub Kicinski <[email protected]> wrote:
> > On Tue, 16 May 2023 15:58:46 -0700 Peilin Ye wrote:
> >>
> >> Seems like trying to delete an "in-use" cls_u32 filter returns -EBUSY
> >
> > I meant -EBUSY due to a race (another operation being in flight).
> > I think that's different.
>
> I wonder if somehow leveraging existing tc_modify_qdisc() 'replay'
> functionality instead of returning error to the user would be a better
> approach? Currently the function is replayed when qdisc_create() returns
> EAGAIN. It should be trivial to do the same for qdisc_graft() result.

Sounds better than returning -EBUSY to the user and expecting them
to retry, yes.

2023-05-17 18:51:27

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [PATCH net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting

On Fri, 5 May 2023 17:16:10 -0700 Peilin Ye wrote:
> } else {
> - dev_queue = dev_ingress_queue(dev);
> - old = dev_graft_qdisc(dev_queue, new);
> + old = dev_graft_qdisc(dev_queue, NULL);
> +
> + /* {ingress,clsact}_destroy() "old" before grafting "new" to avoid
> + * unprotected concurrent accesses to net_device::miniq_{in,e}gress
> + * pointer(s) in mini_qdisc_pair_swap().
> + */
> + qdisc_notify(net, skb, n, classid, old, new, extack);
> + qdisc_destroy(old);
> +
> + dev_graft_qdisc(dev_queue, new);

BTW can't @old be NULL here?

2023-05-17 21:49:50

by Peilin Ye

[permalink] [raw]
Subject: Re: [PATCH net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting

On Wed, May 17, 2023 at 11:48:25AM -0700, Jakub Kicinski wrote:
> > } else {
> > - dev_queue = dev_ingress_queue(dev);
> > - old = dev_graft_qdisc(dev_queue, new);
> > + old = dev_graft_qdisc(dev_queue, NULL);
> > +
> > + /* {ingress,clsact}_destroy() "old" before grafting "new" to avoid
> > + * unprotected concurrent accesses to net_device::miniq_{in,e}gress
> > + * pointer(s) in mini_qdisc_pair_swap().
> > + */
> > + qdisc_notify(net, skb, n, classid, old, new, extack);
> > + qdisc_destroy(old);
> > +
> > + dev_graft_qdisc(dev_queue, new);
>
> BTW can't @old be NULL here?

ingress_queue->qdisc_sleeping is initialized to &noop_qdisc (placeholder)
in dev_ingress_queue_create(), and dev_graft_qdisc() also grafts
&noop_qdisc to represent "there's no Qdisc":

/* ... and graft new one */
if (qdisc == NULL)
qdisc = &noop_qdisc;
dev_queue->qdisc_sleeping = qdisc;

So @old can't be NULL here.

Thanks,
Peilin Ye


2023-05-17 22:27:11

by Peilin Ye

[permalink] [raw]
Subject: Re: [PATCH net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting

On Wed, May 17, 2023 at 11:48:05AM -0700, Jakub Kicinski wrote:
> On Wed, 17 May 2023 11:49:10 +0300 Vlad Buslov wrote:
> > I wonder if somehow leveraging existing tc_modify_qdisc() 'replay'
> > functionality instead of returning error to the user would be a better
> > approach? Currently the function is replayed when qdisc_create() returns
> > EAGAIN. It should be trivial to do the same for qdisc_graft() result.
>
> Sounds better than returning -EBUSY to the user and expecting them
> to retry, yes.

Thanks for the suggestion, Vlad! I'll try this in tc_modify_qdisc() and
tc_get_qdisc() (for Qdisc deletion) in v2.

Thanks,
Peilin Ye


2023-05-24 14:43:40

by Pedro Tammela

[permalink] [raw]
Subject: Re: [syzbot] [net?] KASAN: slab-use-after-free Write in mini_qdisc_pair_swap

#syz test: git://gitlab.com/tammela/net.git peilin-patches

Double checking with syzbot

2023-05-24 15:10:11

by syzbot

[permalink] [raw]
Subject: Re: [syzbot] [net?] KASAN: slab-use-after-free Write in mini_qdisc_pair_swap

Hello,

syzbot tried to test the proposed patch but the build/boot failed:

failed to checkout kernel repo git://gitlab.com/tammela/net.git/peilin-patches: failed to run ["git" "fetch" "--force" "71d757925c19d8f23c660d1e07af98f28b9c6977" "peilin-patches"]: exit status 128
fatal: read error: Connection reset by peer



Tested on:

commit: [unknown
git tree: git://gitlab.com/tammela/net.git peilin-patches
dashboard link: https://syzkaller.appspot.com/bug?extid=b53a9c0d1ea4ad62da8b
compiler:

Note: no patches were applied.

2023-05-24 15:18:49

by Pedro Tammela

[permalink] [raw]

2023-05-24 15:53:06

by syzbot

[permalink] [raw]
Subject: Re: [syzbot] [net?] KASAN: slab-use-after-free Write in mini_qdisc_pair_swap

Hello,

syzbot has tested the proposed patch and the reproducer did not trigger any issue:

Reported-and-tested-by: [email protected]

Tested on:

commit: 6078d01d net/sched: qdisc_destroy() old ingress and cl..
git tree: https://gitlab.com/tammela/net.git peilin-patches
console output: https://syzkaller.appspot.com/x/log.txt?x=111cf24d280000
kernel config: https://syzkaller.appspot.com/x/.config?x=b22b5699e8595bcd
dashboard link: https://syzkaller.appspot.com/bug?extid=b53a9c0d1ea4ad62da8b
compiler: gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2

Note: no patches were applied.
Note: testing is done by a robot and is best-effort only.