2023-10-27 09:09:06

by Hao Sun

[permalink] [raw]
Subject: bpf: incorrect passing infinate loop causing rcu detected stall during bpf_prog_run()

Hi,

The following C repro contains a bpf program that can cause rcu
stall/soft lockup during running in bpf_prog_run(). Seems the verifier
incorrectly passed the program with an infinite loop.

C repro: https://pastebin.com/raw/ymzAxjeU
Verifier's log: https://pastebin.com/raw/thZDTFJc

rcu stall:

rcu: INFO: rcu_preempt self-detected stall on CPU
rcu: 7-....: (10500 ticks this GP) idle=c144/1/0x4000000000000000
softirq=6017/6017 fqs=4579
rcu: hardirqs softirqs csw/system
rcu: number: 0 212 0
rcu: cputime: 0 0 52479 ==> 52480(ms)
rcu: (t=10501 jiffies g=8277 q=132 ncpus=8)
CPU: 7 PID: 8633 Comm: bpf-test Not tainted
6.6.0-rc5-01400-g7c2f6c9fb91f-dirty #25
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
RIP: 0010:___bpf_prog_run+0x4cbf/0x9720 kernel/bpf/core.c:2099
Code: 80 3c 38 00 0f 85 c0 44 00 00 41 0f b6 44 24 01 4c 8b 2b c0 e8
04 0f b6 c0 48 8d 5c c5 00 48 89 d8 48 c1 e8 03 42 80 3c 38 00 <0f> 85
8d 44 00 00 4c 3b 2b 0f 86 16 0f 00 00 49 8d 7c 24 02 48 89
RSP: 0018:ffffc90006eb7a58 EFLAGS: 00000246
RAX: 1ffff92000dd6f70 RBX: ffffc90006eb7b80 RCX: 0000000000000000
RDX: 0000000000000001 RSI: 00000000000014ff RDI: ffffc900052160d1
RBP: ffffc90006eb7b48 R08: ffffc90005216144 R09: fffffbfff228f9d0
R10: ffffffff9147ce87 R11: 0000000000088001 R12: ffffc900052160d0
R13: 000000000000001f R14: 0000000000000000 R15: dffffc0000000000
FS: 00007fcc4dc606c0(0000) GS:ffff88832db80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fcc4dcb6000 CR3: 0000000157724000 CR4: 00000000000006e0
Call Trace:
<IRQ>
</IRQ>
<TASK>
__bpf_prog_run32+0x8d/0xd0 kernel/bpf/core.c:2264
bpf_dispatcher_nop_func include/linux/bpf.h:1192 [inline]
__bpf_prog_run include/linux/filter.h:651 [inline]
bpf_prog_run include/linux/filter.h:658 [inline]
__bpf_prog_test_run_raw_tp+0xc5/0x2c0 net/bpf/test_run.c:712
bpf_prog_test_run_raw_tp+0x304/0x560 net/bpf/test_run.c:752
bpf_prog_put kernel/bpf/syscall.c:2165 [inline]
bpf_prog_test_run kernel/bpf/syscall.c:4042 [inline]
__sys_bpf+0xf98/0x4380 kernel/bpf/syscall.c:5401
__do_sys_bpf kernel/bpf/syscall.c:5487 [inline]
__se_sys_bpf kernel/bpf/syscall.c:5485 [inline]
__x64_sys_bpf+0x73/0xb0 kernel/bpf/syscall.c:5485
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x564499114a87
Code: 81 f9 01 02 00 00 72 03 49 8b 12 48 89 54 24 28 89 44 24 1c 48
8d 74 24 10 b8 41 01 00 00 bf 0a 00 00 00 ba 50 00 00 00 0f 05 <48> 3d
01 f0 ff ff 0f 83 8a 00 00 00 8b 7c 24 3c 49 8b 88 18 08 00
RSP: 002b:00007fcc4dc594d0 EFLAGS: 00000246 ORIG_RAX: 0000000000000141
RAX: ffffffffffffffda RBX: 00007fcc4dc59860 RCX: 0000564499114a87
RDX: 0000000000000050 RSI: 00007fcc4dc594e0 RDI: 000000000000000a
RBP: 00007fcc48002d00 R08: 00005644993d4e10 R09: 00005644993d5428
R10: 00005644993d5018 R11: 0000000000000246 R12: 00007fcc4dc5e458
R13: 00007fcc4dc5bb40 R14: 00007fcc4fd1cbb0 R15: 00007fcc4dc5bc28
</TASK>
watchdog: BUG: soft lockup - CPU#7 stuck for 246s! [bpf-test:8633]
Modules linked in:
irq event stamp: 64410
hardirqs last enabled at (64409): [<ffffffff8960140a>]
asm_sysvec_apic_timer_interrupt+0x1a/0x20
arch/x86/include/asm/idtentry.h:645
hardirqs last disabled at (64410): [<ffffffff89455f6f>]
sysvec_apic_timer_interrupt+0xf/0xc0 arch/x86/kernel/apic/apic.c:1074
softirqs last enabled at (64378): [<ffffffff8145aa47>] invoke_softirq
kernel/softirq.c:427 [inline]
softirqs last enabled at (64378): [<ffffffff8145aa47>] __irq_exit_rcu
kernel/softirq.c:632 [inline]
softirqs last enabled at (64378): [<ffffffff8145aa47>]
irq_exit_rcu+0xb7/0x120 kernel/softirq.c:644
softirqs last disabled at (64373): [<ffffffff8145aa47>] invoke_softirq
kernel/softirq.c:427 [inline]
softirqs last disabled at (64373): [<ffffffff8145aa47>] __irq_exit_rcu
kernel/softirq.c:632 [inline]
softirqs last disabled at (64373): [<ffffffff8145aa47>]
irq_exit_rcu+0xb7/0x120 kernel/softirq.c:644
CPU: 7 PID: 8633 Comm: bpf-test Not tainted
6.6.0-rc5-01400-g7c2f6c9fb91f-dirty #25
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
RIP: 0010:kcov_trace_bpf_prog_regs+0x68/0xc0 kernel/kcov.c:248
Code: 83 f8 04 75 5c 4c 8b a2 e8 15 00 00 8b 92 e4 15 00 00 49 8b 04
24 48 c1 e2 03 48 8d 34 80 48 8d 4c 36 0b 48 c1 e1 03 48 39 ca <72> 35
48 83 c0 01 49 89 04 24 48 c1 e6 04 49 89 7c 0c b0 31 db 49
RSP: 0018:ffffc90006eb7a38 EFLAGS: 00000297
RAX: 0000000000000433 RBX: ffffc90006eb7b48 RCX: 0000000000015048
RDX: 0000000000015000 RSI: 00000000000014ff RDI: 000000000000001c
RBP: ffffc90006eb7b48 R08: ffffc90005216144 R09: fffffbfff228f9d0
R10: ffffffff9147ce87 R11: 0000000000088001 R12: ffffc90006c39000
R13: ffff1102f32b8600 R14: 0000000000000000 R15: dffffc0000000000
FS: 00007fcc4dc606c0(0000) GS:ffff88832db80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fcc4dcb6000 CR3: 0000000157724000 CR4: 00000000000006e0
Call Trace:
<IRQ>
</IRQ>
<TASK>
___bpf_prog_run+0x53d9/0x9720 kernel/bpf/core.c:2100
__bpf_prog_run32+0x8d/0xd0 kernel/bpf/core.c:2264
bpf_dispatcher_nop_func include/linux/bpf.h:1192 [inline]
__bpf_prog_run include/linux/filter.h:651 [inline]
bpf_prog_run include/linux/filter.h:658 [inline]
__bpf_prog_test_run_raw_tp+0xc5/0x2c0 net/bpf/test_run.c:712
bpf_prog_test_run_raw_tp+0x304/0x560 net/bpf/test_run.c:752
bpf_prog_put kernel/bpf/syscall.c:2165 [inline]
bpf_prog_test_run kernel/bpf/syscall.c:4042 [inline]
__sys_bpf+0xf98/0x4380 kernel/bpf/syscall.c:5401
__do_sys_bpf kernel/bpf/syscall.c:5487 [inline]
__se_sys_bpf kernel/bpf/syscall.c:5485 [inline]
__x64_sys_bpf+0x73/0xb0 kernel/bpf/syscall.c:5485
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x564499114a87
Code: 81 f9 01 02 00 00 72 03 49 8b 12 48 89 54 24 28 89 44 24 1c 48
8d 74 24 10 b8 41 01 00 00 bf 0a 00 00 00 ba 50 00 00 00 0f 05 <48> 3d
01 f0 ff ff 0f 83 8a 00 00 00 8b 7c 24 3c 49 8b 88 18 08 00
RSP: 002b:00007fcc4dc594d0 EFLAGS: 00000246 ORIG_RAX: 0000000000000141
RAX: ffffffffffffffda RBX: 00007fcc4dc59860 RCX: 0000564499114a87
RDX: 0000000000000050 RSI: 00007fcc4dc594e0 RDI: 000000000000000a
RBP: 00007fcc48002d00 R08: 00005644993d4e10 R09: 00005644993d5428
R10: 00005644993d5018 R11: 0000000000000246 R12: 00007fcc4dc5e458
R13: 00007fcc4dc5bb40 R14: 00007fcc4fd1cbb0 R15: 00007fcc4dc5bc28
</TASK>


2023-10-29 01:10:12

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: bpf: incorrect passing infinate loop causing rcu detected stall during bpf_prog_run()

On Fri, Oct 27, 2023 at 2:09 AM Hao Sun <[email protected]> wrote:
>
> Hi,
>
> The following C repro contains a bpf program that can cause rcu
> stall/soft lockup during running in bpf_prog_run(). Seems the verifier
> incorrectly passed the program with an infinite loop.
>
> C repro: https://pastebin.com/raw/ymzAxjeU

Thanks for the report.
Did you debug what exactly caused this bug?
Are you planning to work on the fix?

> Verifier's log: https://pastebin.com/raw/thZDTFJc

log is trimmed.

2023-10-30 10:30:43

by Hao Sun

[permalink] [raw]
Subject: Re: bpf: incorrect passing infinate loop causing rcu detected stall during bpf_prog_run()

On Sun, Oct 29, 2023 at 2:35 AM Alexei Starovoitov
<[email protected]> wrote:
>
> On Fri, Oct 27, 2023 at 2:09 AM Hao Sun <[email protected]> wrote:
> >
> > Hi,
> >
> > The following C repro contains a bpf program that can cause rcu
> > stall/soft lockup during running in bpf_prog_run(). Seems the verifier
> > incorrectly passed the program with an infinite loop.
> >
> > C repro: https://pastebin.com/raw/ymzAxjeU
>
> Thanks for the report.
> Did you debug what exactly caused this bug?
> Are you planning to work on the fix?

This bug is really hard to debug. Here is a simplified view of
the original program:

loop:
0: r4 = r8
1: r1 = 0x1f
2: r8 -= -8
3: if r1 > r7 goto pc+1
4: r7 <<= r1 ; LSH r7 by 31
5: r5 = r0
6: r5 *= 2
7: if r5 < r0 goto pc+1
8: r8 s>>= 6
9: w7 &= w7 ; r7 = 0 after the first iter
10: r8 -= r7
11: r8 -= -1
12: if r4 >= 0x9 goto loop
13: exit

At runtime, r7 is updated to 0 through #4 and #9 at the first iteration,
so the following iteration will not take #3 to #4, so #3 can be ignored
after the first iteration. r0 is init by get_current_task, and r5 is always
smaller than r0 at runtime, so #7 to #8 will never run. So, the update
to r8 is only #2 and #11, which together add 9 to r8. Since r4 is set
to r8 at the start of each iteration, so it's an infinite loop at runtime.

Based on the log, the verifier keeps tracking #7 to #8 and to #9, and
at some point, the verifier prunes states and path from #7 to #9, so
it stops checking. The log is huge and hard to follow, the issue is likely
in pruning logic, but I don't have much knowledge about that part.

>
> > Verifier's log: https://pastebin.com/raw/thZDTFJc
>
> log is trimmed.

Full log: https://pastebin.com/raw/cTC8wmDH

2023-10-30 18:16:57

by Eduard Zingerman

[permalink] [raw]
Subject: Re: bpf: incorrect passing infinate loop causing rcu detected stall during bpf_prog_run()

On Mon, 2023-10-30 at 11:29 +0100, Hao Sun wrote:
> On Sun, Oct 29, 2023 at 2:35 AM Alexei Starovoitov
> <[email protected]> wrote:
> >
> > On Fri, Oct 27, 2023 at 2:09 AM Hao Sun <[email protected]> wrote:
> > >
> > > Hi,
> > >
> > > The following C repro contains a bpf program that can cause rcu
> > > stall/soft lockup during running in bpf_prog_run(). Seems the verifier
> > > incorrectly passed the program with an infinite loop.
> > >
> > > C repro: https://pastebin.com/raw/ymzAxjeU
> >
> > Thanks for the report.
> > Did you debug what exactly caused this bug?
> > Are you planning to work on the fix?
>
> This bug is really hard to debug. Here is a simplified view of
> the original program:
>
> loop:
> 0: r4 = r8
> 1: r1 = 0x1f
> 2: r8 -= -8
> 3: if r1 > r7 goto pc+1
> 4: r7 <<= r1 ; LSH r7 by 31
> 5: r5 = r0
> 6: r5 *= 2
> 7: if r5 < r0 goto pc+1
> 8: r8 s>>= 6
> 9: w7 &= w7 ; r7 = 0 after the first iter
> 10: r8 -= r7
> 11: r8 -= -1
> 12: if r4 >= 0x9 goto loop
> 13: exit
>
> At runtime, r7 is updated to 0 through #4 and #9 at the first iteration,
> so the following iteration will not take #3 to #4, so #3 can be ignored
> after the first iteration. r0 is init by get_current_task, and r5 is always
> smaller than r0 at runtime, so #7 to #8 will never run. So, the update
> to r8 is only #2 and #11, which together add 9 to r8. Since r4 is set
> to r8 at the start of each iteration, so it's an infinite loop at runtime.
>
> Based on the log, the verifier keeps tracking #7 to #8 and to #9, and
> at some point, the verifier prunes states and path from #7 to #9, so
> it stops checking. The log is huge and hard to follow, the issue is likely
> in pruning logic, but I don't have much knowledge about that part.

I can take a look at this issue but closer to the end of the week (Thu/Fri).

>
> >
> > > Verifier's log: https://pastebin.com/raw/thZDTFJc
> >
> > log is trimmed.
>
> Full log: https://pastebin.com/raw/cTC8wmDH
>