LinuxLists.cc - Getting empty callchain from perf_callchain

[permalink] [raw]

Subject: Re: Getting empty callchain from perf_callchain_kernel()

On Thu, May 16, 2019 at 11:51:55PM +0000, Song Liu wrote:
> Hi,
>
> We found a failure with selftests/bpf/tests_prog in test_stacktrace_map (on bpf/master
> branch).
>
> After digging into the code, we found that perf_callchain_kernel() is giving empty
> callchain for tracepoint sched/sched_switch. And it seems related to commit
>
> d15d356887e770c5f2dcf963b52c7cb510c9e42d
> ("perf/x86: Make perf callchains work without CONFIG_FRAME_POINTER")
>
> Before this commit, perf_callchain_kernel() returns callchain with regs->ip. With
> this commit, regs->ip is not sent for !perf_hw_regs(regs) case.

So while I think the below is indeed right; we should store regs->ip
regardless of the unwind path chosen, I still think there's something
fishy if this results in just the 1 entry.

The sched/sched_switch event really should have a non-trivial stack.

Let me see if I can reproduce with just perf.

2019-05-17 08:43:51

[permalink] [raw]

Subject: Re: Getting empty callchain from perf_callchain_kernel()

On Fri, May 17, 2019 at 09:46:00AM +0200, Peter Zijlstra wrote:
> On Thu, May 16, 2019 at 11:51:55PM +0000, Song Liu wrote:
> > Hi,
> >
> > We found a failure with selftests/bpf/tests_prog in test_stacktrace_map (on bpf/master
> > branch).
> >
> > After digging into the code, we found that perf_callchain_kernel() is giving empty
> > callchain for tracepoint sched/sched_switch. And it seems related to commit
> >
> > d15d356887e770c5f2dcf963b52c7cb510c9e42d
> > ("perf/x86: Make perf callchains work without CONFIG_FRAME_POINTER")
> >
> > Before this commit, perf_callchain_kernel() returns callchain with regs->ip. With
> > this commit, regs->ip is not sent for !perf_hw_regs(regs) case.
>
> So while I think the below is indeed right; we should store regs->ip
> regardless of the unwind path chosen, I still think there's something
> fishy if this results in just the 1 entry.
>
> The sched/sched_switch event really should have a non-trivial stack.
>
> Let me see if I can reproduce with just perf.

$ perf record -g -e "sched:sched_switch" -- make clean
$ perf report -D

12 904071759467 0x1790 [0xd0]: PERF_RECORD_SAMPLE(IP, 0x1): 7236/7236: 0xffffffff81c29562 period: 1 addr: 0
... FP chain: nr:10
..... 0: ffffffffffffff80
..... 1: ffffffff81c29562
..... 2: ffffffff81c29933
..... 3: ffffffff8111f688
..... 4: ffffffff81120b9d
..... 5: ffffffff81120ce5
..... 6: ffffffff8100254a
..... 7: ffffffff81e0007d
..... 8: fffffffffffffe00
..... 9: 00007f9b6cd9682a
... thread: sh:7236
...... dso: /lib/modules/5.1.0-12177-g41bbb9129767/build/vmlinux

IOW, it seems to 'work'.

2019-05-17 08:44:52

[permalink] [raw]

Subject: Re: Getting empty callchain from perf_callchain_kernel()

On Fri, May 17, 2019 at 4:11 PM Peter Zijlstra <[email protected]> wrote:
>
> On Fri, May 17, 2019 at 09:46:00AM +0200, Peter Zijlstra wrote:
> > On Thu, May 16, 2019 at 11:51:55PM +0000, Song Liu wrote:
> > > Hi,
> > >
> > > We found a failure with selftests/bpf/tests_prog in test_stacktrace_map (on bpf/master
> > > branch).
> > >
> > > After digging into the code, we found that perf_callchain_kernel() is giving empty
> > > callchain for tracepoint sched/sched_switch. And it seems related to commit
> > >
> > > d15d356887e770c5f2dcf963b52c7cb510c9e42d
> > > ("perf/x86: Make perf callchains work without CONFIG_FRAME_POINTER")
> > >
> > > Before this commit, perf_callchain_kernel() returns callchain with regs->ip. With
> > > this commit, regs->ip is not sent for !perf_hw_regs(regs) case.
> >
> > So while I think the below is indeed right; we should store regs->ip
> > regardless of the unwind path chosen, I still think there's something
> > fishy if this results in just the 1 entry.
> >
> > The sched/sched_switch event really should have a non-trivial stack.
> >
> > Let me see if I can reproduce with just perf.
>
> $ perf record -g -e "sched:sched_switch" -- make clean
> $ perf report -D
>
> 12 904071759467 0x1790 [0xd0]: PERF_RECORD_SAMPLE(IP, 0x1): 7236/7236: 0xffffffff81c29562 period: 1 addr: 0
> ... FP chain: nr:10
> ..... 0: ffffffffffffff80
> ..... 1: ffffffff81c29562
> ..... 2: ffffffff81c29933
> ..... 3: ffffffff8111f688
> ..... 4: ffffffff81120b9d
> ..... 5: ffffffff81120ce5
> ..... 6: ffffffff8100254a
> ..... 7: ffffffff81e0007d
> ..... 8: fffffffffffffe00
> ..... 9: 00007f9b6cd9682a
> ... thread: sh:7236
> ...... dso: /lib/modules/5.1.0-12177-g41bbb9129767/build/vmlinux
>
>
> IOW, it seems to 'work'.
>

Hi, I think the actual problem is that bpf_get_stackid_tp (and maybe
some other bfp functions) is now broken, or, strating an unwind
directly inside a bpf program will end up strangely. It have following
kernel message:

WARNING: kernel stack frame pointer at 0000000070cad07c in
test_progs:1242 has bad value 00000000ffd4497e

And in the debug message:

[ 160.460287] 000000006e117175: ffffffffaa23a0e8
(get_perf_callchain+0x148/0x280)
[ 160.460287] 0000000002e8715f: 0000000000c6bba0 (0xc6bba0)
[ 160.460288] 00000000b3d07758: ffff9ce3f9790000 (0xffff9ce3f9790000)
[ 160.460289] 0000000055c97836: ffff9ce3f9790000 (0xffff9ce3f9790000)
[ 160.460289] 000000007cbb140a: 000000010000007f (0x10000007f)
[ 160.460290] 000000007dc62ac9: 0000000000000000 ...
[ 160.460290] 000000006b41e346: 1c7da01cd70c4000 (0x1c7da01cd70c4000)
[ 160.460291] 00000000f23d1826: ffffd89cffc3ae80 (0xffffd89cffc3ae80)
[ 160.460292] 00000000f9a16017: 000000000000007f (0x7f)
[ 160.460292] 00000000a8e62d44: 0000000000000000 ...
[ 160.460293] 00000000cbc83c97: ffffb89d00d8d000 (0xffffb89d00d8d000)
[ 160.460293] 0000000092842522: 000000000000007f (0x7f)
[ 160.460294] 00000000dafbec89: ffffb89d00c6bc50 (0xffffb89d00c6bc50)
[ 160.460296] 000000000777e4cf: ffffffffaa225d97 (bpf_get_stackid+0x77/0x470)
[ 160.460296] 000000009942ea16: 0000000000000000 ...
[ 160.460297] 00000000a08006b1: 0000000000000001 (0x1)
[ 160.460298] 000000009f03b438: ffffb89d00c6bc30 (0xffffb89d00c6bc30)
[ 160.460299] 000000006caf8b32: ffffffffaa074fe8 (__do_page_fault+0x58/0x90)
[ 160.460300] 000000003a13d702: 0000000000000000 ...
[ 160.460300] 00000000e2e2496d: ffff9ce300000000 (0xffff9ce300000000)
[ 160.460301] 000000008ee6b7c2: ffffd89cffc3ae80 (0xffffd89cffc3ae80)
[ 160.460301] 00000000a8cf6d55: 0000000000000000 ...
[ 160.460302] 0000000059078076: ffffd89cffc3ae80 (0xffffd89cffc3ae80)
[ 160.460303] 00000000c6aac739: ffff9ce3f1e18eb0 (0xffff9ce3f1e18eb0)
[ 160.460303] 00000000a39aff92: ffffb89d00c6bc60 (0xffffb89d00c6bc60)
[ 160.460305] 0000000097498bf2: ffffffffaa1f4791 (bpf_get_stackid_tp+0x11/0x20)
[ 160.460306] 000000006992de1e: ffffb89d00c6bc78 (0xffffb89d00c6bc78)
[ 160.460307] 00000000dacd0ce5: ffffffffc0405676 (0xffffffffc0405676)
[ 160.460307] 00000000a81f2714: 0000000000000000 ...

# Note here is the invalid frame pointer
[ 160.460308] 0000000070cad07c: ffffb89d00a1d000 (0xffffb89d00a1d000)
[ 160.460308] 00000000f8f194e4: 0000000000000001 (0x1)
[ 160.460309] 000000002134f42a: ffffd89cffc3ae80 (0xffffd89cffc3ae80)
[ 160.460310] 00000000f9345889: ffff9ce3f1e18eb0 (0xffff9ce3f1e18eb0)
[ 160.460310] 000000008ad22a42: 0000000000000000 ...
[ 160.460311] 0000000073808173: ffffb89d00c6bce0 (0xffffb89d00c6bce0)
[ 160.460312] 00000000c9effff4: ffffffffaa1f48d1 (trace_call_bpf+0x81/0x100)
[ 160.460313] 00000000c5d8ebd1: ffffb89d00c6bcc0 (0xffffb89d00c6bcc0)
[ 160.460315] 00000000bce0b072: ffffffffab651be0
(event_sched_migrate_task+0xa0/0xa0)
[ 160.460316] 00000000355cf319: 0000000000000000 ...
[ 160.460316] 000000003b67f2ad: ffffd89cffc3ae80 (0xffffd89cffc3ae80)
[ 160.460316] 000000009a77e20b: ffff9ce3fba25000 (0xffff9ce3fba25000)
[ 160.460317] 0000000032cf7376: 0000000000000001 (0x1)
[ 160.460317] 000000000051db74: ffffb89d00c6bd20 (0xffffb89d00c6bd20)
[ 160.460318] 0000000040eb3bf7: ffffffffaa22be81
(perf_trace_run_bpf_submit+0x41/0xb0)

Simply store the IP still won't really fix the problem, it just passed
the test. Just had a try to have bpf functions set the
X86_EFLAGS_FIXED for the flags and always dump the bp, it bypassed
this specified problem.

Using frame pointer unwinder for testing this, and seems ORC is fine with this.

--
Best Regards,
Kairui Song

2019-05-17 09:43:52

[permalink] [raw]

Subject: Re: Getting empty callchain from perf_callchain_kernel()

On Fri, May 17, 2019 at 04:15:39PM +0800, Kairui Song wrote:
> Hi, I think the actual problem is that bpf_get_stackid_tp (and maybe
> some other bfp functions) is now broken, or, strating an unwind
> directly inside a bpf program will end up strangely. It have following
> kernel message:

Urgh, what is that bpf_get_stackid_tp() doing to get the regs? I can't
follow.

2019-05-17 09:59:21

[permalink] [raw]

Subject: Re: Getting empty callchain from perf_callchain_kernel()

On Fri, May 17, 2019 at 4:15 PM Kairui Song <[email protected]> wrote:
>
> On Fri, May 17, 2019 at 4:11 PM Peter Zijlstra <[email protected]> wrote:
> >
> > On Fri, May 17, 2019 at 09:46:00AM +0200, Peter Zijlstra wrote:
> > > On Thu, May 16, 2019 at 11:51:55PM +0000, Song Liu wrote:
> > > > Hi,
> > > >
> > > > We found a failure with selftests/bpf/tests_prog in test_stacktrace_map (on bpf/master
> > > > branch).
> > > >
> > > > After digging into the code, we found that perf_callchain_kernel() is giving empty
> > > > callchain for tracepoint sched/sched_switch. And it seems related to commit
> > > >
> > > > d15d356887e770c5f2dcf963b52c7cb510c9e42d
> > > > ("perf/x86: Make perf callchains work without CONFIG_FRAME_POINTER")
> > > >
> > > > Before this commit, perf_callchain_kernel() returns callchain with regs->ip. With
> > > > this commit, regs->ip is not sent for !perf_hw_regs(regs) case.
> > >
> > > So while I think the below is indeed right; we should store regs->ip
> > > regardless of the unwind path chosen, I still think there's something
> > > fishy if this results in just the 1 entry.
> > >
> > > The sched/sched_switch event really should have a non-trivial stack.
> > >
> > > Let me see if I can reproduce with just perf.
> >
> > $ perf record -g -e "sched:sched_switch" -- make clean
> > $ perf report -D
> >
> > 12 904071759467 0x1790 [0xd0]: PERF_RECORD_SAMPLE(IP, 0x1): 7236/7236: 0xffffffff81c29562 period: 1 addr: 0
> > ... FP chain: nr:10
> > ..... 0: ffffffffffffff80
> > ..... 1: ffffffff81c29562
> > ..... 2: ffffffff81c29933
> > ..... 3: ffffffff8111f688
> > ..... 4: ffffffff81120b9d
> > ..... 5: ffffffff81120ce5
> > ..... 6: ffffffff8100254a
> > ..... 7: ffffffff81e0007d
> > ..... 8: fffffffffffffe00
> > ..... 9: 00007f9b6cd9682a
> > ... thread: sh:7236
> > ...... dso: /lib/modules/5.1.0-12177-g41bbb9129767/build/vmlinux
> >
> >
> > IOW, it seems to 'work'.
> >
>
> Hi, I think the actual problem is that bpf_get_stackid_tp (and maybe
> some other bfp functions) is now broken, or, strating an unwind
> directly inside a bpf program will end up strangely. It have following
> kernel message:
>
> WARNING: kernel stack frame pointer at 0000000070cad07c in
> test_progs:1242 has bad value 00000000ffd4497e
>
> And in the debug message:
>
> [ 160.460287] 000000006e117175: ffffffffaa23a0e8
> (get_perf_callchain+0x148/0x280)
> [ 160.460287] 0000000002e8715f: 0000000000c6bba0 (0xc6bba0)
> [ 160.460288] 00000000b3d07758: ffff9ce3f9790000 (0xffff9ce3f9790000)
> [ 160.460289] 0000000055c97836: ffff9ce3f9790000 (0xffff9ce3f9790000)
> [ 160.460289] 000000007cbb140a: 000000010000007f (0x10000007f)
> [ 160.460290] 000000007dc62ac9: 0000000000000000 ...
> [ 160.460290] 000000006b41e346: 1c7da01cd70c4000 (0x1c7da01cd70c4000)
> [ 160.460291] 00000000f23d1826: ffffd89cffc3ae80 (0xffffd89cffc3ae80)
> [ 160.460292] 00000000f9a16017: 000000000000007f (0x7f)
> [ 160.460292] 00000000a8e62d44: 0000000000000000 ...
> [ 160.460293] 00000000cbc83c97: ffffb89d00d8d000 (0xffffb89d00d8d000)
> [ 160.460293] 0000000092842522: 000000000000007f (0x7f)
> [ 160.460294] 00000000dafbec89: ffffb89d00c6bc50 (0xffffb89d00c6bc50)
> [ 160.460296] 000000000777e4cf: ffffffffaa225d97 (bpf_get_stackid+0x77/0x470)
> [ 160.460296] 000000009942ea16: 0000000000000000 ...
> [ 160.460297] 00000000a08006b1: 0000000000000001 (0x1)
> [ 160.460298] 000000009f03b438: ffffb89d00c6bc30 (0xffffb89d00c6bc30)
> [ 160.460299] 000000006caf8b32: ffffffffaa074fe8 (__do_page_fault+0x58/0x90)
> [ 160.460300] 000000003a13d702: 0000000000000000 ...
> [ 160.460300] 00000000e2e2496d: ffff9ce300000000 (0xffff9ce300000000)
> [ 160.460301] 000000008ee6b7c2: ffffd89cffc3ae80 (0xffffd89cffc3ae80)
> [ 160.460301] 00000000a8cf6d55: 0000000000000000 ...
> [ 160.460302] 0000000059078076: ffffd89cffc3ae80 (0xffffd89cffc3ae80)
> [ 160.460303] 00000000c6aac739: ffff9ce3f1e18eb0 (0xffff9ce3f1e18eb0)
> [ 160.460303] 00000000a39aff92: ffffb89d00c6bc60 (0xffffb89d00c6bc60)
> [ 160.460305] 0000000097498bf2: ffffffffaa1f4791 (bpf_get_stackid_tp+0x11/0x20)
> [ 160.460306] 000000006992de1e: ffffb89d00c6bc78 (0xffffb89d00c6bc78)
> [ 160.460307] 00000000dacd0ce5: ffffffffc0405676 (0xffffffffc0405676)
> [ 160.460307] 00000000a81f2714: 0000000000000000 ...
>
> # Note here is the invalid frame pointer
> [ 160.460308] 0000000070cad07c: ffffb89d00a1d000 (0xffffb89d00a1d000)
> [ 160.460308] 00000000f8f194e4: 0000000000000001 (0x1)
> [ 160.460309] 000000002134f42a: ffffd89cffc3ae80 (0xffffd89cffc3ae80)
> [ 160.460310] 00000000f9345889: ffff9ce3f1e18eb0 (0xffff9ce3f1e18eb0)
> [ 160.460310] 000000008ad22a42: 0000000000000000 ...
> [ 160.460311] 0000000073808173: ffffb89d00c6bce0 (0xffffb89d00c6bce0)
> [ 160.460312] 00000000c9effff4: ffffffffaa1f48d1 (trace_call_bpf+0x81/0x100)
> [ 160.460313] 00000000c5d8ebd1: ffffb89d00c6bcc0 (0xffffb89d00c6bcc0)
> [ 160.460315] 00000000bce0b072: ffffffffab651be0
> (event_sched_migrate_task+0xa0/0xa0)
> [ 160.460316] 00000000355cf319: 0000000000000000 ...
> [ 160.460316] 000000003b67f2ad: ffffd89cffc3ae80 (0xffffd89cffc3ae80)
> [ 160.460316] 000000009a77e20b: ffff9ce3fba25000 (0xffff9ce3fba25000)
> [ 160.460317] 0000000032cf7376: 0000000000000001 (0x1)
> [ 160.460317] 000000000051db74: ffffb89d00c6bd20 (0xffffb89d00c6bd20)
> [ 160.460318] 0000000040eb3bf7: ffffffffaa22be81
> (perf_trace_run_bpf_submit+0x41/0xb0)
>
> Simply store the IP still won't really fix the problem, it just passed
> the test. Just had a try to have bpf functions set the
> X86_EFLAGS_FIXED for the flags and always dump the bp, it bypassed
> this specified problem.
>
> Using frame pointer unwinder for testing this, and seems ORC is fine with this.

Correction: ORC won't print an error, but it also give empty callstack.

--
Best Regards,
Kairui Song

2019-05-17 17:18:46

[permalink] [raw]

Subject: Re: Getting empty callchain from perf_callchain_kernel()

> On May 17, 2019, at 1:32 AM, Kairui Song <[email protected]> wrote:
>
> On Fri, May 17, 2019 at 4:15 PM Kairui Song <[email protected]> wrote:
>>
>> On Fri, May 17, 2019 at 4:11 PM Peter Zijlstra <[email protected]> wrote:
>>>
>>> On Fri, May 17, 2019 at 09:46:00AM +0200, Peter Zijlstra wrote:
>>>> On Thu, May 16, 2019 at 11:51:55PM +0000, Song Liu wrote:
>>>>> Hi,
>>>>>
>>>>> We found a failure with selftests/bpf/tests_prog in test_stacktrace_map (on bpf/master
>>>>> branch).
>>>>>
>>>>> After digging into the code, we found that perf_callchain_kernel() is giving empty
>>>>> callchain for tracepoint sched/sched_switch. And it seems related to commit
>>>>>
>>>>> d15d356887e770c5f2dcf963b52c7cb510c9e42d
>>>>> ("perf/x86: Make perf callchains work without CONFIG_FRAME_POINTER")
>>>>>
>>>>> Before this commit, perf_callchain_kernel() returns callchain with regs->ip. With
>>>>> this commit, regs->ip is not sent for !perf_hw_regs(regs) case.
>>>>
>>>> So while I think the below is indeed right; we should store regs->ip
>>>> regardless of the unwind path chosen, I still think there's something
>>>> fishy if this results in just the 1 entry.
>>>>
>>>> The sched/sched_switch event really should have a non-trivial stack.
>>>>
>>>> Let me see if I can reproduce with just perf.
>>>
>>> $ perf record -g -e "sched:sched_switch" -- make clean
>>> $ perf report -D
>>>
>>> 12 904071759467 0x1790 [0xd0]: PERF_RECORD_SAMPLE(IP, 0x1): 7236/7236: 0xffffffff81c29562 period: 1 addr: 0
>>> ... FP chain: nr:10
>>> ..... 0: ffffffffffffff80
>>> ..... 1: ffffffff81c29562
>>> ..... 2: ffffffff81c29933
>>> ..... 3: ffffffff8111f688
>>> ..... 4: ffffffff81120b9d
>>> ..... 5: ffffffff81120ce5
>>> ..... 6: ffffffff8100254a
>>> ..... 7: ffffffff81e0007d
>>> ..... 8: fffffffffffffe00
>>> ..... 9: 00007f9b6cd9682a
>>> ... thread: sh:7236
>>> ...... dso: /lib/modules/5.1.0-12177-g41bbb9129767/build/vmlinux
>>>
>>>
>>> IOW, it seems to 'work'.
>>>
>>
>> Hi, I think the actual problem is that bpf_get_stackid_tp (and maybe
>> some other bfp functions) is now broken, or, strating an unwind
>> directly inside a bpf program will end up strangely. It have following
>> kernel message:
>>
>> WARNING: kernel stack frame pointer at 0000000070cad07c in
>> test_progs:1242 has bad value 00000000ffd4497e
>>
>> And in the debug message:
>>
>> [ 160.460287] 000000006e117175: ffffffffaa23a0e8
>> (get_perf_callchain+0x148/0x280)
>> [ 160.460287] 0000000002e8715f: 0000000000c6bba0 (0xc6bba0)
>> [ 160.460288] 00000000b3d07758: ffff9ce3f9790000 (0xffff9ce3f9790000)
>> [ 160.460289] 0000000055c97836: ffff9ce3f9790000 (0xffff9ce3f9790000)
>> [ 160.460289] 000000007cbb140a: 000000010000007f (0x10000007f)
>> [ 160.460290] 000000007dc62ac9: 0000000000000000 ...
>> [ 160.460290] 000000006b41e346: 1c7da01cd70c4000 (0x1c7da01cd70c4000)
>> [ 160.460291] 00000000f23d1826: ffffd89cffc3ae80 (0xffffd89cffc3ae80)
>> [ 160.460292] 00000000f9a16017: 000000000000007f (0x7f)
>> [ 160.460292] 00000000a8e62d44: 0000000000000000 ...
>> [ 160.460293] 00000000cbc83c97: ffffb89d00d8d000 (0xffffb89d00d8d000)
>> [ 160.460293] 0000000092842522: 000000000000007f (0x7f)
>> [ 160.460294] 00000000dafbec89: ffffb89d00c6bc50 (0xffffb89d00c6bc50)
>> [ 160.460296] 000000000777e4cf: ffffffffaa225d97 (bpf_get_stackid+0x77/0x470)
>> [ 160.460296] 000000009942ea16: 0000000000000000 ...
>> [ 160.460297] 00000000a08006b1: 0000000000000001 (0x1)
>> [ 160.460298] 000000009f03b438: ffffb89d00c6bc30 (0xffffb89d00c6bc30)
>> [ 160.460299] 000000006caf8b32: ffffffffaa074fe8 (__do_page_fault+0x58/0x90)
>> [ 160.460300] 000000003a13d702: 0000000000000000 ...
>> [ 160.460300] 00000000e2e2496d: ffff9ce300000000 (0xffff9ce300000000)
>> [ 160.460301] 000000008ee6b7c2: ffffd89cffc3ae80 (0xffffd89cffc3ae80)
>> [ 160.460301] 00000000a8cf6d55: 0000000000000000 ...
>> [ 160.460302] 0000000059078076: ffffd89cffc3ae80 (0xffffd89cffc3ae80)
>> [ 160.460303] 00000000c6aac739: ffff9ce3f1e18eb0 (0xffff9ce3f1e18eb0)
>> [ 160.460303] 00000000a39aff92: ffffb89d00c6bc60 (0xffffb89d00c6bc60)
>> [ 160.460305] 0000000097498bf2: ffffffffaa1f4791 (bpf_get_stackid_tp+0x11/0x20)
>> [ 160.460306] 000000006992de1e: ffffb89d00c6bc78 (0xffffb89d00c6bc78)
>> [ 160.460307] 00000000dacd0ce5: ffffffffc0405676 (0xffffffffc0405676)
>> [ 160.460307] 00000000a81f2714: 0000000000000000 ...
>>
>> # Note here is the invalid frame pointer
>> [ 160.460308] 0000000070cad07c: ffffb89d00a1d000 (0xffffb89d00a1d000)
>> [ 160.460308] 00000000f8f194e4: 0000000000000001 (0x1)
>> [ 160.460309] 000000002134f42a: ffffd89cffc3ae80 (0xffffd89cffc3ae80)
>> [ 160.460310] 00000000f9345889: ffff9ce3f1e18eb0 (0xffff9ce3f1e18eb0)
>> [ 160.460310] 000000008ad22a42: 0000000000000000 ...
>> [ 160.460311] 0000000073808173: ffffb89d00c6bce0 (0xffffb89d00c6bce0)
>> [ 160.460312] 00000000c9effff4: ffffffffaa1f48d1 (trace_call_bpf+0x81/0x100)
>> [ 160.460313] 00000000c5d8ebd1: ffffb89d00c6bcc0 (0xffffb89d00c6bcc0)
>> [ 160.460315] 00000000bce0b072: ffffffffab651be0
>> (event_sched_migrate_task+0xa0/0xa0)
>> [ 160.460316] 00000000355cf319: 0000000000000000 ...
>> [ 160.460316] 000000003b67f2ad: ffffd89cffc3ae80 (0xffffd89cffc3ae80)
>> [ 160.460316] 000000009a77e20b: ffff9ce3fba25000 (0xffff9ce3fba25000)
>> [ 160.460317] 0000000032cf7376: 0000000000000001 (0x1)
>> [ 160.460317] 000000000051db74: ffffb89d00c6bd20 (0xffffb89d00c6bd20)
>> [ 160.460318] 0000000040eb3bf7: ffffffffaa22be81
>> (perf_trace_run_bpf_submit+0x41/0xb0)
>>
>> Simply store the IP still won't really fix the problem, it just passed
>> the test. Just had a try to have bpf functions set the
>> X86_EFLAGS_FIXED for the flags and always dump the bp, it bypassed
>> this specified problem.
>>
>> Using frame pointer unwinder for testing this, and seems ORC is fine with this.
>
> Correction: ORC won't print an error, but it also give empty callstack.

Yes, we are using ORC here, and getting empty callstack. bpf_get_stackid()
takes empty callstack as error, so we need up getting nothing.

Does it make sense to add regs->ip back?

Thanks,
Song

2019-05-17 18:04:27

[permalink] [raw]

Subject: Re: Getting empty callchain from perf_callchain_kernel()

> On May 17, 2019, at 1:10 AM, Peter Zijlstra <[email protected]> wrote:
>
> On Fri, May 17, 2019 at 09:46:00AM +0200, Peter Zijlstra wrote:
>> On Thu, May 16, 2019 at 11:51:55PM +0000, Song Liu wrote:
>>> Hi,
>>>
>>> We found a failure with selftests/bpf/tests_prog in test_stacktrace_map (on bpf/master
>>> branch).
>>>
>>> After digging into the code, we found that perf_callchain_kernel() is giving empty
>>> callchain for tracepoint sched/sched_switch. And it seems related to commit
>>>
>>> d15d356887e770c5f2dcf963b52c7cb510c9e42d
>>> ("perf/x86: Make perf callchains work without CONFIG_FRAME_POINTER")
>>>
>>> Before this commit, perf_callchain_kernel() returns callchain with regs->ip. With
>>> this commit, regs->ip is not sent for !perf_hw_regs(regs) case.
>>
>> So while I think the below is indeed right; we should store regs->ip
>> regardless of the unwind path chosen, I still think there's something
>> fishy if this results in just the 1 entry.
>>
>> The sched/sched_switch event really should have a non-trivial stack.
>>
>> Let me see if I can reproduce with just perf.
>
> $ perf record -g -e "sched:sched_switch" -- make clean
> $ perf report -D
>
> 12 904071759467 0x1790 [0xd0]: PERF_RECORD_SAMPLE(IP, 0x1): 7236/7236: 0xffffffff81c29562 period: 1 addr: 0
> ... FP chain: nr:10
> ..... 0: ffffffffffffff80
> ..... 1: ffffffff81c29562
> ..... 2: ffffffff81c29933
> ..... 3: ffffffff8111f688
> ..... 4: ffffffff81120b9d
> ..... 5: ffffffff81120ce5
> ..... 6: ffffffff8100254a
> ..... 7: ffffffff81e0007d
> ..... 8: fffffffffffffe00
> ..... 9: 00007f9b6cd9682a
> ... thread: sh:7236
> ...... dso: /lib/modules/5.1.0-12177-g41bbb9129767/build/vmlinux
>

Hmm... I also get this FP chain even with CONFIG_UNWINDER_ORC=y.
I guess it is actually from ORC?

Thanks,
Song

2019-05-17 20:04:59

[permalink] [raw]

Subject: Re: Getting empty callchain from perf_callchain_kernel()

+Alexei, Daniel, and bpf

> On May 17, 2019, at 2:10 AM, Peter Zijlstra <[email protected]> wrote:
>
> On Fri, May 17, 2019 at 04:15:39PM +0800, Kairui Song wrote:
>> Hi, I think the actual problem is that bpf_get_stackid_tp (and maybe
>> some other bfp functions) is now broken, or, strating an unwind
>> directly inside a bpf program will end up strangely. It have following
>> kernel message:
>
> Urgh, what is that bpf_get_stackid_tp() doing to get the regs? I can't
> follow.

I guess we need something like the following? (we should be able to
optimize the PER_CPU stuff).

Thanks,
Song

diff --git i/kernel/trace/bpf_trace.c w/kernel/trace/bpf_trace.c
index f92d6ad5e080..c525149028a7 100644
--- i/kernel/trace/bpf_trace.c
+++ w/kernel/trace/bpf_trace.c
@@ -696,11 +696,13 @@ static const struct bpf_func_proto bpf_perf_event_output_proto_tp = {
.arg5_type = ARG_CONST_SIZE_OR_ZERO,
};

+static DEFINE_PER_CPU(struct pt_regs, bpf_stackid_tp_regs);
BPF_CALL_3(bpf_get_stackid_tp, void *, tp_buff, struct bpf_map *, map,
u64, flags)
{
- struct pt_regs *regs = *(struct pt_regs **)tp_buff;
+ struct pt_regs *regs = this_cpu_ptr(&bpf_stackid_tp_regs);

+ perf_fetch_caller_regs(regs);
/*
* Same comment as in bpf_perf_event_output_tp(), only that this time
* the other helper's function body cannot be inlined due to being
@@ -719,10 +721,13 @@ static const struct bpf_func_proto bpf_get_stackid_proto_tp = {
.arg3_type = ARG_ANYTHING,
};

+static DEFINE_PER_CPU(struct pt_regs, bpf_stack_tp_regs);
BPF_CALL_4(bpf_get_stack_tp, void *, tp_buff, void *, buf, u32, size,
u64, flags)
{
- struct pt_regs *regs = *(struct pt_regs **)tp_buff;
+ struct pt_regs *regs = this_cpu_ptr(&bpf_stack_tp_regs);
+
+ perf_fetch_caller_regs(regs);

return bpf_get_stack((unsigned long) regs, (unsigned long) buf,
(unsigned long) size, flags, 0);

2019-05-17 22:12:44

by Alexei Starovoitov

[permalink] [raw]

Subject: Re: Getting empty callchain from perf_callchain_kernel()

On 5/17/19 11:40 AM, Song Liu wrote:
> +Alexei, Daniel, and bpf
>
>> On May 17, 2019, at 2:10 AM, Peter Zijlstra <[email protected]> wrote:
>>
>> On Fri, May 17, 2019 at 04:15:39PM +0800, Kairui Song wrote:
>>> Hi, I think the actual problem is that bpf_get_stackid_tp (and maybe
>>> some other bfp functions) is now broken, or, strating an unwind
>>> directly inside a bpf program will end up strangely. It have following
>>> kernel message:
>>
>> Urgh, what is that bpf_get_stackid_tp() doing to get the regs? I can't
>> follow.
>
> I guess we need something like the following? (we should be able to
> optimize the PER_CPU stuff).
>
> Thanks,
> Song
>
>
> diff --git i/kernel/trace/bpf_trace.c w/kernel/trace/bpf_trace.c
> index f92d6ad5e080..c525149028a7 100644
> --- i/kernel/trace/bpf_trace.c
> +++ w/kernel/trace/bpf_trace.c
> @@ -696,11 +696,13 @@ static const struct bpf_func_proto bpf_perf_event_output_proto_tp = {
> .arg5_type = ARG_CONST_SIZE_OR_ZERO,
> };
>
> +static DEFINE_PER_CPU(struct pt_regs, bpf_stackid_tp_regs);
> BPF_CALL_3(bpf_get_stackid_tp, void *, tp_buff, struct bpf_map *, map,
> u64, flags)
> {
> - struct pt_regs *regs = *(struct pt_regs **)tp_buff;
> + struct pt_regs *regs = this_cpu_ptr(&bpf_stackid_tp_regs);
>
> + perf_fetch_caller_regs(regs);

No. pt_regs is already passed in. It's the first argument.
If we call perf_fetch_caller_regs() again the stack trace will be wrong.
bpf prog should not see itself, interpreter or all the frames in between.

2019-05-18 00:33:40

[permalink] [raw]

Subject: Re: Getting empty callchain from perf_callchain_kernel()

> On May 17, 2019, at 2:06 PM, Alexei Starovoitov <[email protected]> wrote:
>
> On 5/17/19 11:40 AM, Song Liu wrote:
>> +Alexei, Daniel, and bpf
>>
>>> On May 17, 2019, at 2:10 AM, Peter Zijlstra <[email protected]> wrote:
>>>
>>> On Fri, May 17, 2019 at 04:15:39PM +0800, Kairui Song wrote:
>>>> Hi, I think the actual problem is that bpf_get_stackid_tp (and maybe
>>>> some other bfp functions) is now broken, or, strating an unwind
>>>> directly inside a bpf program will end up strangely. It have following
>>>> kernel message:
>>>
>>> Urgh, what is that bpf_get_stackid_tp() doing to get the regs? I can't
>>> follow.
>>
>> I guess we need something like the following? (we should be able to
>> optimize the PER_CPU stuff).
>>
>> Thanks,
>> Song
>>
>>
>> diff --git i/kernel/trace/bpf_trace.c w/kernel/trace/bpf_trace.c
>> index f92d6ad5e080..c525149028a7 100644
>> --- i/kernel/trace/bpf_trace.c
>> +++ w/kernel/trace/bpf_trace.c
>> @@ -696,11 +696,13 @@ static const struct bpf_func_proto bpf_perf_event_output_proto_tp = {
>> .arg5_type = ARG_CONST_SIZE_OR_ZERO,
>> };
>>
>> +static DEFINE_PER_CPU(struct pt_regs, bpf_stackid_tp_regs);
>> BPF_CALL_3(bpf_get_stackid_tp, void *, tp_buff, struct bpf_map *, map,
>> u64, flags)
>> {
>> - struct pt_regs *regs = *(struct pt_regs **)tp_buff;
>> + struct pt_regs *regs = this_cpu_ptr(&bpf_stackid_tp_regs);
>>
>> + perf_fetch_caller_regs(regs);
>
> No. pt_regs is already passed in. It's the first argument.
> If we call perf_fetch_caller_regs() again the stack trace will be wrong.
> bpf prog should not see itself, interpreter or all the frames in between.

Thanks Alexei! I get it now.

In bpf_get_stackid_tp(), the pt_regs is get by dereferencing the first field
of tp_buff:

struct pt_regs *regs = *(struct pt_regs **)tp_buff;

tp_buff points to something like

struct sched_switch_args {
unsigned long long pad;
char prev_comm[16];
int prev_pid;
int prev_prio;
long long prev_state;
char next_comm[16];
int next_pid;
int next_prio;
};

where the first field "pad" is a pointer to pt_regs.

@Kairui, I think you confirmed that current code will give empty call trace
with ORC unwinder? If that's the case, can we add regs->ip back? (as in the
first email of this thread.

Thanks,
Song

2019-05-19 18:09:07

[permalink] [raw]

Subject: Re: Getting empty callchain from perf_callchain_kernel()

On Sat, May 18, 2019 at 5:48 AM Song Liu <[email protected]> wrote:
>
>
>
> > On May 17, 2019, at 2:06 PM, Alexei Starovoitov <[email protected]> wrote:
> >
> > On 5/17/19 11:40 AM, Song Liu wrote:
> >> +Alexei, Daniel, and bpf
> >>
> >>> On May 17, 2019, at 2:10 AM, Peter Zijlstra <[email protected]> wrote:
> >>>
> >>> On Fri, May 17, 2019 at 04:15:39PM +0800, Kairui Song wrote:
> >>>> Hi, I think the actual problem is that bpf_get_stackid_tp (and maybe
> >>>> some other bfp functions) is now broken, or, strating an unwind
> >>>> directly inside a bpf program will end up strangely. It have following
> >>>> kernel message:
> >>>
> >>> Urgh, what is that bpf_get_stackid_tp() doing to get the regs? I can't
> >>> follow.
> >>
> >> I guess we need something like the following? (we should be able to
> >> optimize the PER_CPU stuff).
> >>
> >> Thanks,
> >> Song
> >>
> >>
> >> diff --git i/kernel/trace/bpf_trace.c w/kernel/trace/bpf_trace.c
> >> index f92d6ad5e080..c525149028a7 100644
> >> --- i/kernel/trace/bpf_trace.c
> >> +++ w/kernel/trace/bpf_trace.c
> >> @@ -696,11 +696,13 @@ static const struct bpf_func_proto bpf_perf_event_output_proto_tp = {
> >> .arg5_type = ARG_CONST_SIZE_OR_ZERO,
> >> };
> >>
> >> +static DEFINE_PER_CPU(struct pt_regs, bpf_stackid_tp_regs);
> >> BPF_CALL_3(bpf_get_stackid_tp, void *, tp_buff, struct bpf_map *, map,
> >> u64, flags)
> >> {
> >> - struct pt_regs *regs = *(struct pt_regs **)tp_buff;
> >> + struct pt_regs *regs = this_cpu_ptr(&bpf_stackid_tp_regs);
> >>
> >> + perf_fetch_caller_regs(regs);
> >
> > No. pt_regs is already passed in. It's the first argument.
> > If we call perf_fetch_caller_regs() again the stack trace will be wrong.
> > bpf prog should not see itself, interpreter or all the frames in between.
>
> Thanks Alexei! I get it now.
>
> In bpf_get_stackid_tp(), the pt_regs is get by dereferencing the first field
> of tp_buff:
>
> struct pt_regs *regs = *(struct pt_regs **)tp_buff;
>
> tp_buff points to something like
>
> struct sched_switch_args {
> unsigned long long pad;
> char prev_comm[16];
> int prev_pid;
> int prev_prio;
> long long prev_state;
> char next_comm[16];
> int next_pid;
> int next_prio;
> };
>
> where the first field "pad" is a pointer to pt_regs.
>
> @Kairui, I think you confirmed that current code will give empty call trace
> with ORC unwinder? If that's the case, can we add regs->ip back? (as in the
> first email of this thread.
>
> Thanks,
> Song
>

Hi thanks for the suggestion, yes we can add it should be good an idea
to always have IP when stack trace is not available.
But stack trace is actually still broken, it will always give only one
level of stacktrace (the IP).

--
Best Regards,
Kairui Song

2019-05-19 21:33:09

[permalink] [raw]

Subject: Re: Getting empty callchain from perf_callchain_kernel()

On Fri, May 17, 2019 at 5:10 PM Peter Zijlstra <[email protected]> wrote:
>
> On Fri, May 17, 2019 at 04:15:39PM +0800, Kairui Song wrote:
> > Hi, I think the actual problem is that bpf_get_stackid_tp (and maybe
> > some other bfp functions) is now broken, or, strating an unwind
> > directly inside a bpf program will end up strangely. It have following
> > kernel message:
>
> Urgh, what is that bpf_get_stackid_tp() doing to get the regs? I can't
> follow.

bpf_get_stackid_tp will just use the regs passed to it from the trace
point. And then it will eventually call perf_get_callchain to get the
call chain.
With a tracepoint we have the fake regs, so unwinder will start from
where it is called, and use the fake regs as the indicator of the
target frame it want, and keep unwinding until reached the actually
callsite.

But if the stack trace is started withing a bpf func call then it's broken...

If the unwinder could trace back through the bpf func call then there
will be no such problem.

For frame pointer unwinder, set the indicator flag (X86_EFLAGS_FIXED)
before bpf call, and ensure bp is also dumped could fix it (so it will
start using the regs for bpf calls, like before the commit
d15d356887e7). But for ORC I don't see a clear way to fix the problem.
First though is maybe dump some callee's regs for ORC (IP, BP, SP, DI,
DX, R10, R13, else?) in the trace point handler, then use the flag to
indicate ORC to do one more unwind (because can't get caller's regs,
so get callee's regs instaed) before actually giving output?

I had a try, for framepointer unwinder, mark the indicator flag before
calling bpf functions, and dump bp as well in the trace point. Then
with frame pointer, it works, test passed:

diff --git a/arch/x86/include/asm/perf_event.h
b/arch/x86/include/asm/perf_event.h
index 1392d5e6e8d6..6f1192e9776b 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -302,12 +302,25 @@ extern unsigned long perf_misc_flags(struct
pt_regs *regs);

#include <asm/stacktrace.h>

+#ifdef CONFIG_FRAME_POINTER
+static inline unsigned long caller_frame_pointer(void)
+{
+ return (unsigned long)__builtin_frame_address(1);
+}
+#else
+static inline unsigned long caller_frame_pointer(void)
+{
+ return 0;
+}
+#endif
+
/*
* We abuse bit 3 from flags to pass exact information, see perf_misc_flags
* and the comment with PERF_EFLAGS_EXACT.
*/
#define perf_arch_fetch_caller_regs(regs, __ip) { \
(regs)->ip = (__ip); \
+ (regs)->bp = caller_frame_pointer(); \
(regs)->sp = (unsigned long)__builtin_frame_address(0); \
(regs)->cs = __KERNEL_CS; \
regs->flags = 0; \
diff --git a/kernel/events/core.c b/kernel/events/core.c
index abbd4b3b96c2..ca7b95ee74f0 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -8549,6 +8549,7 @@ void perf_trace_run_bpf_submit(void *raw_data,
int size, int rctx,
struct task_struct *task)
{
if (bpf_prog_array_valid(call)) {
+ regs->flags |= X86_EFLAGS_FIXED;
*(struct pt_regs **)raw_data = regs;
if (!trace_call_bpf(call, raw_data) || hlist_empty(head)) {
perf_swevent_put_recursion_context(rctx);
@@ -8822,6 +8823,8 @@ static void bpf_overflow_handler(struct perf_event *event,
int ret = 0;

ctx.regs = perf_arch_bpf_user_pt_regs(regs);
+ ctx.regs->flags |= X86_EFLAGS_FIXED;
+
preempt_disable();
if (unlikely(__this_cpu_inc_return(bpf_prog_active) != 1))
goto out;
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index f92d6ad5e080..e1fa656677dc 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -497,6 +497,8 @@ u64 bpf_event_output(struct bpf_map *map, u64
flags, void *meta, u64 meta_size,
};

perf_fetch_caller_regs(regs);
+ regs->flags |= X86_EFLAGS_FIXED;
+
perf_sample_data_init(sd, 0, 0);
sd->raw = &raw;

@@ -831,6 +833,8 @@ BPF_CALL_5(bpf_perf_event_output_raw_tp, struct
bpf_raw_tracepoint_args *, args,
struct pt_regs *regs = this_cpu_ptr(&bpf_raw_tp_regs);

perf_fetch_caller_regs(regs);
+ regs->flags |= X86_EFLAGS_FIXED;
+
return ____bpf_perf_event_output(regs, map, flags, data, size);
}

@@ -851,6 +855,8 @@ BPF_CALL_3(bpf_get_stackid_raw_tp, struct
bpf_raw_tracepoint_args *, args,
struct pt_regs *regs = this_cpu_ptr(&bpf_raw_tp_regs);

perf_fetch_caller_regs(regs);
+ regs->flags |= X86_EFLAGS_FIXED;
+
/* similar to bpf_perf_event_output_tp, but pt_regs fetched
differently */
return bpf_get_stackid((unsigned long) regs, (unsigned long) map,
flags, 0, 0);
@@ -871,6 +877,8 @@ BPF_CALL_4(bpf_get_stack_raw_tp, struct
bpf_raw_tracepoint_args *, args,
struct pt_regs *regs = this_cpu_ptr(&bpf_raw_tp_regs);

perf_fetch_caller_regs(regs);
+ regs->flags |= X86_EFLAGS_FIXED;
+
return bpf_get_stack((unsigned long) regs, (unsigned long) buf,
(unsigned long) size, flags, 0);
}

And *_raw_tp functions will fetch the regs by themselves so a bit
trouble some...

----------

And another approach is to make unwinder direct unwinding works when
called by bpf (if possible and reasonable). I also had a nasty hacky
experiment (posted below) to just force frame pointer unwinder's
get_stack_info pass for bpf, then problem is fixed without any other
workaround:

diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
index 753b8cfe8b8a..c0cfdf25f5ed 100644
--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -166,7 +166,8 @@ int get_stack_info(unsigned long *stack, struct
task_struct *task,
if (in_entry_stack(stack, info))
goto recursion_check;

- goto unknown;
+ goto recursion_check;

recursion_check:
/*

Don't know how to do it the right way, or is it even possible for all
unwinders yet...

--
Best Regards,
Kairui Song

2019-05-20 18:20:16

[permalink] [raw]

Subject: Re: Getting empty callchain from perf_callchain_kernel()

> On May 19, 2019, at 11:06 AM, Kairui Song <[email protected]> wrote:
>
> On Fri, May 17, 2019 at 5:10 PM Peter Zijlstra <[email protected]> wrote:
>>
>> On Fri, May 17, 2019 at 04:15:39PM +0800, Kairui Song wrote:
>>> Hi, I think the actual problem is that bpf_get_stackid_tp (and maybe
>>> some other bfp functions) is now broken, or, strating an unwind
>>> directly inside a bpf program will end up strangely. It have following
>>> kernel message:
>>
>> Urgh, what is that bpf_get_stackid_tp() doing to get the regs? I can't
>> follow.
>
> bpf_get_stackid_tp will just use the regs passed to it from the trace
> point. And then it will eventually call perf_get_callchain to get the
> call chain.
> With a tracepoint we have the fake regs, so unwinder will start from
> where it is called, and use the fake regs as the indicator of the
> target frame it want, and keep unwinding until reached the actually
> callsite.
>
> But if the stack trace is started withing a bpf func call then it's broken...
>
> If the unwinder could trace back through the bpf func call then there
> will be no such problem.
>
> For frame pointer unwinder, set the indicator flag (X86_EFLAGS_FIXED)
> before bpf call, and ensure bp is also dumped could fix it (so it will
> start using the regs for bpf calls, like before the commit
> d15d356887e7). But for ORC I don't see a clear way to fix the problem.
> First though is maybe dump some callee's regs for ORC (IP, BP, SP, DI,
> DX, R10, R13, else?) in the trace point handler, then use the flag to
> indicate ORC to do one more unwind (because can't get caller's regs,
> so get callee's regs instaed) before actually giving output?
>
> I had a try, for framepointer unwinder, mark the indicator flag before
> calling bpf functions, and dump bp as well in the trace point. Then
> with frame pointer, it works, test passed:
>
> diff --git a/arch/x86/include/asm/perf_event.h
> b/arch/x86/include/asm/perf_event.h
> index 1392d5e6e8d6..6f1192e9776b 100644
> --- a/arch/x86/include/asm/perf_event.h
> +++ b/arch/x86/include/asm/perf_event.h
> @@ -302,12 +302,25 @@ extern unsigned long perf_misc_flags(struct
> pt_regs *regs);
>
> #include <asm/stacktrace.h>
>
> +#ifdef CONFIG_FRAME_POINTER
> +static inline unsigned long caller_frame_pointer(void)
> +{
> + return (unsigned long)__builtin_frame_address(1);
> +}
> +#else
> +static inline unsigned long caller_frame_pointer(void)
> +{
> + return 0;
> +}
> +#endif
> +
> /*
> * We abuse bit 3 from flags to pass exact information, see perf_misc_flags
> * and the comment with PERF_EFLAGS_EXACT.
> */
> #define perf_arch_fetch_caller_regs(regs, __ip) { \
> (regs)->ip = (__ip); \
> + (regs)->bp = caller_frame_pointer(); \
> (regs)->sp = (unsigned long)__builtin_frame_address(0); \
> (regs)->cs = __KERNEL_CS; \
> regs->flags = 0; \
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index abbd4b3b96c2..ca7b95ee74f0 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -8549,6 +8549,7 @@ void perf_trace_run_bpf_submit(void *raw_data,
> int size, int rctx,
> struct task_struct *task)
> {
> if (bpf_prog_array_valid(call)) {
> + regs->flags |= X86_EFLAGS_FIXED;
> *(struct pt_regs **)raw_data = regs;
> if (!trace_call_bpf(call, raw_data) || hlist_empty(head)) {
> perf_swevent_put_recursion_context(rctx);
> @@ -8822,6 +8823,8 @@ static void bpf_overflow_handler(struct perf_event *event,
> int ret = 0;
>
> ctx.regs = perf_arch_bpf_user_pt_regs(regs);
> + ctx.regs->flags |= X86_EFLAGS_FIXED;
> +
> preempt_disable();
> if (unlikely(__this_cpu_inc_return(bpf_prog_active) != 1))
> goto out;
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index f92d6ad5e080..e1fa656677dc 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -497,6 +497,8 @@ u64 bpf_event_output(struct bpf_map *map, u64
> flags, void *meta, u64 meta_size,
> };
>
> perf_fetch_caller_regs(regs);
> + regs->flags |= X86_EFLAGS_FIXED;
> +
> perf_sample_data_init(sd, 0, 0);
> sd->raw = &raw;
>
> @@ -831,6 +833,8 @@ BPF_CALL_5(bpf_perf_event_output_raw_tp, struct
> bpf_raw_tracepoint_args *, args,
> struct pt_regs *regs = this_cpu_ptr(&bpf_raw_tp_regs);
>
> perf_fetch_caller_regs(regs);
> + regs->flags |= X86_EFLAGS_FIXED;
> +
> return ____bpf_perf_event_output(regs, map, flags, data, size);
> }
>
> @@ -851,6 +855,8 @@ BPF_CALL_3(bpf_get_stackid_raw_tp, struct
> bpf_raw_tracepoint_args *, args,
> struct pt_regs *regs = this_cpu_ptr(&bpf_raw_tp_regs);
>
> perf_fetch_caller_regs(regs);
> + regs->flags |= X86_EFLAGS_FIXED;
> +
> /* similar to bpf_perf_event_output_tp, but pt_regs fetched
> differently */
> return bpf_get_stackid((unsigned long) regs, (unsigned long) map,
> flags, 0, 0);
> @@ -871,6 +877,8 @@ BPF_CALL_4(bpf_get_stack_raw_tp, struct
> bpf_raw_tracepoint_args *, args,
> struct pt_regs *regs = this_cpu_ptr(&bpf_raw_tp_regs);
>
> perf_fetch_caller_regs(regs);
> + regs->flags |= X86_EFLAGS_FIXED;
> +
> return bpf_get_stack((unsigned long) regs, (unsigned long) buf,
> (unsigned long) size, flags, 0);
> }
>
> And *_raw_tp functions will fetch the regs by themselves so a bit
> trouble some...
>
> ----------
>
> And another approach is to make unwinder direct unwinding works when
> called by bpf (if possible and reasonable). I also had a nasty hacky
> experiment (posted below) to just force frame pointer unwinder's
> get_stack_info pass for bpf, then problem is fixed without any other
> workaround:
>
> diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
> index 753b8cfe8b8a..c0cfdf25f5ed 100644
> --- a/arch/x86/kernel/dumpstack_64.c
> +++ b/arch/x86/kernel/dumpstack_64.c
> @@ -166,7 +166,8 @@ int get_stack_info(unsigned long *stack, struct
> task_struct *task,
> if (in_entry_stack(stack, info))
> goto recursion_check;
>
> - goto unknown;
> + goto recursion_check;
>
> recursion_check:
> /*
>
> Don't know how to do it the right way, or is it even possible for all
> unwinders yet...
>
> --
> Best Regards,
> Kairui Song

2019-05-20 18:20:28

[permalink] [raw]

Subject: Re: Getting empty callchain from perf_callchain_kernel()

Sorry for previous empty email.. Clicked send by accident.

> On May 19, 2019, at 11:06 AM, Kairui Song <[email protected]> wrote:
>
> On Fri, May 17, 2019 at 5:10 PM Peter Zijlstra <[email protected]> wrote:
>>
>> On Fri, May 17, 2019 at 04:15:39PM +0800, Kairui Song wrote:
>>> Hi, I think the actual problem is that bpf_get_stackid_tp (and maybe
>>> some other bfp functions) is now broken, or, strating an unwind
>>> directly inside a bpf program will end up strangely. It have following
>>> kernel message:
>>
>> Urgh, what is that bpf_get_stackid_tp() doing to get the regs? I can't
>> follow.
>
> bpf_get_stackid_tp will just use the regs passed to it from the trace
> point. And then it will eventually call perf_get_callchain to get the
> call chain.
> With a tracepoint we have the fake regs, so unwinder will start from
> where it is called, and use the fake regs as the indicator of the
> target frame it want, and keep unwinding until reached the actually
> callsite.
>
> But if the stack trace is started withing a bpf func call then it's broken...
>
> If the unwinder could trace back through the bpf func call then there
> will be no such problem.
>
> For frame pointer unwinder, set the indicator flag (X86_EFLAGS_FIXED)
> before bpf call, and ensure bp is also dumped could fix it (so it will
> start using the regs for bpf calls, like before the commit
> d15d356887e7). But for ORC I don't see a clear way to fix the problem.
> First though is maybe dump some callee's regs for ORC (IP, BP, SP, DI,
> DX, R10, R13, else?) in the trace point handler, then use the flag to
> indicate ORC to do one more unwind (because can't get caller's regs,
> so get callee's regs instaed) before actually giving output?
>
> I had a try, for framepointer unwinder, mark the indicator flag before
> calling bpf functions, and dump bp as well in the trace point. Then
> with frame pointer, it works, test passed:
>
> diff --git a/arch/x86/include/asm/perf_event.h
> b/arch/x86/include/asm/perf_event.h
> index 1392d5e6e8d6..6f1192e9776b 100644
> --- a/arch/x86/include/asm/perf_event.h
> +++ b/arch/x86/include/asm/perf_event.h
> @@ -302,12 +302,25 @@ extern unsigned long perf_misc_flags(struct
> pt_regs *regs);
>
> #include <asm/stacktrace.h>
>
> +#ifdef CONFIG_FRAME_POINTER
> +static inline unsigned long caller_frame_pointer(void)
> +{
> + return (unsigned long)__builtin_frame_address(1);
> +}
> +#else
> +static inline unsigned long caller_frame_pointer(void)
> +{
> + return 0;
> +}
> +#endif
> +
> /*
> * We abuse bit 3 from flags to pass exact information, see perf_misc_flags
> * and the comment with PERF_EFLAGS_EXACT.
> */
> #define perf_arch_fetch_caller_regs(regs, __ip) { \
> (regs)->ip = (__ip); \
> + (regs)->bp = caller_frame_pointer(); \
> (regs)->sp = (unsigned long)__builtin_frame_address(0); \
> (regs)->cs = __KERNEL_CS; \
> regs->flags = 0; \
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index abbd4b3b96c2..ca7b95ee74f0 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -8549,6 +8549,7 @@ void perf_trace_run_bpf_submit(void *raw_data,
> int size, int rctx,
> struct task_struct *task)
> {
> if (bpf_prog_array_valid(call)) {
> + regs->flags |= X86_EFLAGS_FIXED;
> *(struct pt_regs **)raw_data = regs;
> if (!trace_call_bpf(call, raw_data) || hlist_empty(head)) {
> perf_swevent_put_recursion_context(rctx);
> @@ -8822,6 +8823,8 @@ static void bpf_overflow_handler(struct perf_event *event,
> int ret = 0;
>
> ctx.regs = perf_arch_bpf_user_pt_regs(regs);
> + ctx.regs->flags |= X86_EFLAGS_FIXED;
> +
> preempt_disable();
> if (unlikely(__this_cpu_inc_return(bpf_prog_active) != 1))
> goto out;
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index f92d6ad5e080..e1fa656677dc 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -497,6 +497,8 @@ u64 bpf_event_output(struct bpf_map *map, u64
> flags, void *meta, u64 meta_size,
> };
>
> perf_fetch_caller_regs(regs);
> + regs->flags |= X86_EFLAGS_FIXED;
> +
> perf_sample_data_init(sd, 0, 0);
> sd->raw = &raw;
>
> @@ -831,6 +833,8 @@ BPF_CALL_5(bpf_perf_event_output_raw_tp, struct
> bpf_raw_tracepoint_args *, args,
> struct pt_regs *regs = this_cpu_ptr(&bpf_raw_tp_regs);
>
> perf_fetch_caller_regs(regs);
> + regs->flags |= X86_EFLAGS_FIXED;
> +
> return ____bpf_perf_event_output(regs, map, flags, data, size);
> }
>
> @@ -851,6 +855,8 @@ BPF_CALL_3(bpf_get_stackid_raw_tp, struct
> bpf_raw_tracepoint_args *, args,
> struct pt_regs *regs = this_cpu_ptr(&bpf_raw_tp_regs);
>
> perf_fetch_caller_regs(regs);
> + regs->flags |= X86_EFLAGS_FIXED;
> +
> /* similar to bpf_perf_event_output_tp, but pt_regs fetched
> differently */
> return bpf_get_stackid((unsigned long) regs, (unsigned long) map,
> flags, 0, 0);
> @@ -871,6 +877,8 @@ BPF_CALL_4(bpf_get_stack_raw_tp, struct
> bpf_raw_tracepoint_args *, args,
> struct pt_regs *regs = this_cpu_ptr(&bpf_raw_tp_regs);
>
> perf_fetch_caller_regs(regs);
> + regs->flags |= X86_EFLAGS_FIXED;
> +
> return bpf_get_stack((unsigned long) regs, (unsigned long) buf,
> (unsigned long) size, flags, 0);
> }
>

I think we really cannot do something above, as it leaks x86 specific
code into kernel/events and kernel/trace.

> And *_raw_tp functions will fetch the regs by themselves so a bit
> trouble some...
>
> ----------
>
> And another approach is to make unwinder direct unwinding works when
> called by bpf (if possible and reasonable). I also had a nasty hacky
> experiment (posted below) to just force frame pointer unwinder's
> get_stack_info pass for bpf, then problem is fixed without any other
> workaround:
>
> diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
> index 753b8cfe8b8a..c0cfdf25f5ed 100644
> --- a/arch/x86/kernel/dumpstack_64.c
> +++ b/arch/x86/kernel/dumpstack_64.c
> @@ -166,7 +166,8 @@ int get_stack_info(unsigned long *stack, struct
> task_struct *task,
> if (in_entry_stack(stack, info))
> goto recursion_check;
>
> - goto unknown;
> + goto recursion_check;
>
> recursion_check:
> /*

I think this one doesn't work for ORC either?

Thanks,
Song

> Don't know how to do it the right way, or is it even possible for all
> unwinders yet...
>
> --
> Best Regards,
> Kairui Song

2019-05-20 18:20:36

[permalink] [raw]

Subject: Re: Getting empty callchain from perf_callchain_kernel()

> On May 19, 2019, at 11:07 AM, Kairui Song <[email protected]> wrote:
>
> On Sat, May 18, 2019 at 5:48 AM Song Liu <[email protected]> wrote:
>>
>>
>>
>>> On May 17, 2019, at 2:06 PM, Alexei Starovoitov <[email protected]> wrote:
>>>
>>> On 5/17/19 11:40 AM, Song Liu wrote:
>>>> +Alexei, Daniel, and bpf
>>>>
>>>>> On May 17, 2019, at 2:10 AM, Peter Zijlstra <[email protected]> wrote:
>>>>>
>>>>> On Fri, May 17, 2019 at 04:15:39PM +0800, Kairui Song wrote:
>>>>>> Hi, I think the actual problem is that bpf_get_stackid_tp (and maybe
>>>>>> some other bfp functions) is now broken, or, strating an unwind
>>>>>> directly inside a bpf program will end up strangely. It have following
>>>>>> kernel message:
>>>>>
>>>>> Urgh, what is that bpf_get_stackid_tp() doing to get the regs? I can't
>>>>> follow.
>>>>
>>>> I guess we need something like the following? (we should be able to
>>>> optimize the PER_CPU stuff).
>>>>
>>>> Thanks,
>>>> Song
>>>>
>>>>
>>>> diff --git i/kernel/trace/bpf_trace.c w/kernel/trace/bpf_trace.c
>>>> index f92d6ad5e080..c525149028a7 100644
>>>> --- i/kernel/trace/bpf_trace.c
>>>> +++ w/kernel/trace/bpf_trace.c
>>>> @@ -696,11 +696,13 @@ static const struct bpf_func_proto bpf_perf_event_output_proto_tp = {
>>>> .arg5_type = ARG_CONST_SIZE_OR_ZERO,
>>>> };
>>>>
>>>> +static DEFINE_PER_CPU(struct pt_regs, bpf_stackid_tp_regs);
>>>> BPF_CALL_3(bpf_get_stackid_tp, void *, tp_buff, struct bpf_map *, map,
>>>> u64, flags)
>>>> {
>>>> - struct pt_regs *regs = *(struct pt_regs **)tp_buff;
>>>> + struct pt_regs *regs = this_cpu_ptr(&bpf_stackid_tp_regs);
>>>>
>>>> + perf_fetch_caller_regs(regs);
>>>
>>> No. pt_regs is already passed in. It's the first argument.
>>> If we call perf_fetch_caller_regs() again the stack trace will be wrong.
>>> bpf prog should not see itself, interpreter or all the frames in between.
>>
>> Thanks Alexei! I get it now.
>>
>> In bpf_get_stackid_tp(), the pt_regs is get by dereferencing the first field
>> of tp_buff:
>>
>> struct pt_regs *regs = *(struct pt_regs **)tp_buff;
>>
>> tp_buff points to something like
>>
>> struct sched_switch_args {
>> unsigned long long pad;
>> char prev_comm[16];
>> int prev_pid;
>> int prev_prio;
>> long long prev_state;
>> char next_comm[16];
>> int next_pid;
>> int next_prio;
>> };
>>
>> where the first field "pad" is a pointer to pt_regs.
>>
>> @Kairui, I think you confirmed that current code will give empty call trace
>> with ORC unwinder? If that's the case, can we add regs->ip back? (as in the
>> first email of this thread.
>>
>> Thanks,
>> Song
>>
>
> Hi thanks for the suggestion, yes we can add it should be good an idea
> to always have IP when stack trace is not available.
> But stack trace is actually still broken, it will always give only one
> level of stacktrace (the IP).

I think this is still the best fix/workaround here? And only one level
of stack trace should be OK for tracepoint?

Thanks,
Song

>
> --
> Best Regards,
> Kairui Song

2019-05-22 13:52:48

[permalink] [raw]

Subject: Re: Getting empty callchain from perf_callchain_kernel()

On Mon, May 20, 2019 at 05:22:12PM +0000, Song Liu wrote:
> I think this is still the best fix/workaround here? And only one level
> of stack trace should be OK for tracepoint?

No. That's still completely broken.

2019-05-22 14:06:45

[permalink] [raw]

Subject: Re: Getting empty callchain from perf_callchain_kernel()

On Mon, May 20, 2019 at 02:06:54AM +0800, Kairui Song wrote:
> On Fri, May 17, 2019 at 5:10 PM Peter Zijlstra <[email protected]> wrote:
> >
> > On Fri, May 17, 2019 at 04:15:39PM +0800, Kairui Song wrote:
> > > Hi, I think the actual problem is that bpf_get_stackid_tp (and maybe
> > > some other bfp functions) is now broken, or, strating an unwind
> > > directly inside a bpf program will end up strangely. It have following
> > > kernel message:
> >
> > Urgh, what is that bpf_get_stackid_tp() doing to get the regs? I can't
> > follow.
>
> bpf_get_stackid_tp will just use the regs passed to it from the trace
> point. And then it will eventually call perf_get_callchain to get the
> call chain.
> With a tracepoint we have the fake regs, so unwinder will start from
> where it is called, and use the fake regs as the indicator of the
> target frame it want, and keep unwinding until reached the actually
> callsite.
>
> But if the stack trace is started withing a bpf func call then it's broken...

I'm confused.. how is this broken? Surely we should eventually find the
original stack frame and be good again, right?

> If the unwinder could trace back through the bpf func call then there
> will be no such problem.

Why couldn't it trace back through the bpf stuff? And how can we fix
that?

2019-05-22 14:57:08

by Alexei Starovoitov

[permalink] [raw]

Subject: Re: Getting empty callchain from perf_callchain_kernel()

On 5/22/19 7:02 AM, Peter Zijlstra wrote:
>
>> If the unwinder could trace back through the bpf func call then there
>> will be no such problem.
>
> Why couldn't it trace back through the bpf stuff? And how can we fix
> that?

Most of the time there is no 'tracing through bpf stuff'.
bpf infra is preserving 'pt_regs*' that was collected before
any bpf things start executing.
The reason is that bpf can be executed via interpreter and
directly when JITed.
In both cases collected stack traces should be the same or
it's confusing the users and they cannot compensate for such
difference.

The only exception is raw_tracepoint, since it's the most
minimalistic way of calling bpf and kernel side doesn't do
anything before calling into bpf.
Only in such case bpf side has to call perf_fetch_caller_regs().
See bpf_get_stackid_raw_tp().
But this test case is actually working!
It's covered by prog_tests/stacktrace_map_raw_tp.c and
it passes.
The one that is broken is prog_tests/stacktrace_map.c
There we attach bpf to standard tracepoint where
kernel suppose to collect pt_regs before calling into bpf.
And that's what bpf_get_stackid_tp() is doing.
It passes pt_regs (that was collected before any bpf)
into bpf_get_stackid() which calls get_perf_callchain().
Same thing with kprobes, uprobes.

2019-05-22 17:47:36

[permalink] [raw]

Subject: Re: Getting empty callchain from perf_callchain_kernel()

On Wed, May 22, 2019 at 02:49:07PM +0000, Alexei Starovoitov wrote:
> The one that is broken is prog_tests/stacktrace_map.c
> There we attach bpf to standard tracepoint where
> kernel suppose to collect pt_regs before calling into bpf.
> And that's what bpf_get_stackid_tp() is doing.
> It passes pt_regs (that was collected before any bpf)
> into bpf_get_stackid() which calls get_perf_callchain().
> Same thing with kprobes, uprobes.

Is it trying to unwind through ___bpf_prog_run()?

If so, that would at least explain why ORC isn't working. Objtool
currently ignores that function because it can't follow the jump table.

--
Josh

2019-05-22 18:09:58

[permalink] [raw]

Subject: Re: Getting empty callchain from perf_callchain_kernel()

On Fri, May 17, 2019 at 04:15:39PM +0800, Kairui Song wrote:
> On Fri, May 17, 2019 at 4:11 PM Peter Zijlstra <[email protected]> wrote:
> >
> > On Fri, May 17, 2019 at 09:46:00AM +0200, Peter Zijlstra wrote:
> > > On Thu, May 16, 2019 at 11:51:55PM +0000, Song Liu wrote:
> > > > Hi,
> > > >
> > > > We found a failure with selftests/bpf/tests_prog in test_stacktrace_map (on bpf/master
> > > > branch).
> > > >
> > > > After digging into the code, we found that perf_callchain_kernel() is giving empty
> > > > callchain for tracepoint sched/sched_switch. And it seems related to commit
> > > >
> > > > d15d356887e770c5f2dcf963b52c7cb510c9e42d
> > > > ("perf/x86: Make perf callchains work without CONFIG_FRAME_POINTER")
> > > >
> > > > Before this commit, perf_callchain_kernel() returns callchain with regs->ip. With
> > > > this commit, regs->ip is not sent for !perf_hw_regs(regs) case.
> > >
> > > So while I think the below is indeed right; we should store regs->ip
> > > regardless of the unwind path chosen, I still think there's something
> > > fishy if this results in just the 1 entry.
> > >
> > > The sched/sched_switch event really should have a non-trivial stack.
> > >
> > > Let me see if I can reproduce with just perf.
> >
> > $ perf record -g -e "sched:sched_switch" -- make clean
> > $ perf report -D
> >
> > 12 904071759467 0x1790 [0xd0]: PERF_RECORD_SAMPLE(IP, 0x1): 7236/7236: 0xffffffff81c29562 period: 1 addr: 0
> > ... FP chain: nr:10
> > ..... 0: ffffffffffffff80
> > ..... 1: ffffffff81c29562
> > ..... 2: ffffffff81c29933
> > ..... 3: ffffffff8111f688
> > ..... 4: ffffffff81120b9d
> > ..... 5: ffffffff81120ce5
> > ..... 6: ffffffff8100254a
> > ..... 7: ffffffff81e0007d
> > ..... 8: fffffffffffffe00
> > ..... 9: 00007f9b6cd9682a
> > ... thread: sh:7236
> > ...... dso: /lib/modules/5.1.0-12177-g41bbb9129767/build/vmlinux
> >
> >
> > IOW, it seems to 'work'.
> >
>
> Hi, I think the actual problem is that bpf_get_stackid_tp (and maybe
> some other bfp functions) is now broken, or, strating an unwind
> directly inside a bpf program will end up strangely. It have following
> kernel message:
>
> WARNING: kernel stack frame pointer at 0000000070cad07c in
> test_progs:1242 has bad value 00000000ffd4497e
>
> And in the debug message:
>
> [ 160.460287] 000000006e117175: ffffffffaa23a0e8
> (get_perf_callchain+0x148/0x280)
> [ 160.460287] 0000000002e8715f: 0000000000c6bba0 (0xc6bba0)
> [ 160.460288] 00000000b3d07758: ffff9ce3f9790000 (0xffff9ce3f9790000)
> [ 160.460289] 0000000055c97836: ffff9ce3f9790000 (0xffff9ce3f9790000)
> [ 160.460289] 000000007cbb140a: 000000010000007f (0x10000007f)
> [ 160.460290] 000000007dc62ac9: 0000000000000000 ...
> [ 160.460290] 000000006b41e346: 1c7da01cd70c4000 (0x1c7da01cd70c4000)
> [ 160.460291] 00000000f23d1826: ffffd89cffc3ae80 (0xffffd89cffc3ae80)
> [ 160.460292] 00000000f9a16017: 000000000000007f (0x7f)
> [ 160.460292] 00000000a8e62d44: 0000000000000000 ...
> [ 160.460293] 00000000cbc83c97: ffffb89d00d8d000 (0xffffb89d00d8d000)
> [ 160.460293] 0000000092842522: 000000000000007f (0x7f)
> [ 160.460294] 00000000dafbec89: ffffb89d00c6bc50 (0xffffb89d00c6bc50)
> [ 160.460296] 000000000777e4cf: ffffffffaa225d97 (bpf_get_stackid+0x77/0x470)
> [ 160.460296] 000000009942ea16: 0000000000000000 ...
> [ 160.460297] 00000000a08006b1: 0000000000000001 (0x1)
> [ 160.460298] 000000009f03b438: ffffb89d00c6bc30 (0xffffb89d00c6bc30)
> [ 160.460299] 000000006caf8b32: ffffffffaa074fe8 (__do_page_fault+0x58/0x90)
> [ 160.460300] 000000003a13d702: 0000000000000000 ...
> [ 160.460300] 00000000e2e2496d: ffff9ce300000000 (0xffff9ce300000000)
> [ 160.460301] 000000008ee6b7c2: ffffd89cffc3ae80 (0xffffd89cffc3ae80)
> [ 160.460301] 00000000a8cf6d55: 0000000000000000 ...
> [ 160.460302] 0000000059078076: ffffd89cffc3ae80 (0xffffd89cffc3ae80)
> [ 160.460303] 00000000c6aac739: ffff9ce3f1e18eb0 (0xffff9ce3f1e18eb0)
> [ 160.460303] 00000000a39aff92: ffffb89d00c6bc60 (0xffffb89d00c6bc60)
> [ 160.460305] 0000000097498bf2: ffffffffaa1f4791 (bpf_get_stackid_tp+0x11/0x20)
> [ 160.460306] 000000006992de1e: ffffb89d00c6bc78 (0xffffb89d00c6bc78)
> [ 160.460307] 00000000dacd0ce5: ffffffffc0405676 (0xffffffffc0405676)
> [ 160.460307] 00000000a81f2714: 0000000000000000 ...
>
> # Note here is the invalid frame pointer
> [ 160.460308] 0000000070cad07c: ffffb89d00a1d000 (0xffffb89d00a1d000)
> [ 160.460308] 00000000f8f194e4: 0000000000000001 (0x1)
> [ 160.460309] 000000002134f42a: ffffd89cffc3ae80 (0xffffd89cffc3ae80)
> [ 160.460310] 00000000f9345889: ffff9ce3f1e18eb0 (0xffff9ce3f1e18eb0)
> [ 160.460310] 000000008ad22a42: 0000000000000000 ...
> [ 160.460311] 0000000073808173: ffffb89d00c6bce0 (0xffffb89d00c6bce0)
> [ 160.460312] 00000000c9effff4: ffffffffaa1f48d1 (trace_call_bpf+0x81/0x100)
> [ 160.460313] 00000000c5d8ebd1: ffffb89d00c6bcc0 (0xffffb89d00c6bcc0)
> [ 160.460315] 00000000bce0b072: ffffffffab651be0
> (event_sched_migrate_task+0xa0/0xa0)
> [ 160.460316] 00000000355cf319: 0000000000000000 ...
> [ 160.460316] 000000003b67f2ad: ffffd89cffc3ae80 (0xffffd89cffc3ae80)
> [ 160.460316] 000000009a77e20b: ffff9ce3fba25000 (0xffff9ce3fba25000)
> [ 160.460317] 0000000032cf7376: 0000000000000001 (0x1)
> [ 160.460317] 000000000051db74: ffffb89d00c6bd20 (0xffffb89d00c6bd20)
> [ 160.460318] 0000000040eb3bf7: ffffffffaa22be81
> (perf_trace_run_bpf_submit+0x41/0xb0)

Is there an easy way to recreate this?

--
Josh

2019-05-22 21:58:20

by Alexei Starovoitov

[permalink] [raw]

Subject: Re: Getting empty callchain from perf_callchain_kernel()

On 5/22/19 11:07 AM, Josh Poimboeuf wrote:
> On Fri, May 17, 2019 at 04:15:39PM +0800, Kairui Song wrote:
>> On Fri, May 17, 2019 at 4:11 PM Peter Zijlstra <[email protected]> wrote:
>>>
>>> On Fri, May 17, 2019 at 09:46:00AM +0200, Peter Zijlstra wrote:
>>>> On Thu, May 16, 2019 at 11:51:55PM +0000, Song Liu wrote:
>>>>> Hi,
>>>>>
>>>>> We found a failure with selftests/bpf/tests_prog in test_stacktrace_map (on bpf/master
>>>>> branch).
>>>>>
>>>>> After digging into the code, we found that perf_callchain_kernel() is giving empty
>>>>> callchain for tracepoint sched/sched_switch. And it seems related to commit
>>>>>
>>>>> d15d356887e770c5f2dcf963b52c7cb510c9e42d
>>>>> ("perf/x86: Make perf callchains work without CONFIG_FRAME_POINTER")
>>>>>
>>>>> Before this commit, perf_callchain_kernel() returns callchain with regs->ip. With
>>>>> this commit, regs->ip is not sent for !perf_hw_regs(regs) case.
>>>>
>>>> So while I think the below is indeed right; we should store regs->ip
>>>> regardless of the unwind path chosen, I still think there's something
>>>> fishy if this results in just the 1 entry.
>>>>
>>>> The sched/sched_switch event really should have a non-trivial stack.
>>>>
>>>> Let me see if I can reproduce with just perf.
>>>
>>> $ perf record -g -e "sched:sched_switch" -- make clean
>>> $ perf report -D
>>>
>>> 12 904071759467 0x1790 [0xd0]: PERF_RECORD_SAMPLE(IP, 0x1): 7236/7236: 0xffffffff81c29562 period: 1 addr: 0
>>> ... FP chain: nr:10
>>> ..... 0: ffffffffffffff80
>>> ..... 1: ffffffff81c29562
>>> ..... 2: ffffffff81c29933
>>> ..... 3: ffffffff8111f688
>>> ..... 4: ffffffff81120b9d
>>> ..... 5: ffffffff81120ce5
>>> ..... 6: ffffffff8100254a
>>> ..... 7: ffffffff81e0007d
>>> ..... 8: fffffffffffffe00
>>> ..... 9: 00007f9b6cd9682a
>>> ... thread: sh:7236
>>> ...... dso: /lib/modules/5.1.0-12177-g41bbb9129767/build/vmlinux
>>>
>>>
>>> IOW, it seems to 'work'.
>>>
>>
>> Hi, I think the actual problem is that bpf_get_stackid_tp (and maybe
>> some other bfp functions) is now broken, or, strating an unwind
>> directly inside a bpf program will end up strangely. It have following
>> kernel message:
>>
>> WARNING: kernel stack frame pointer at 0000000070cad07c in
>> test_progs:1242 has bad value 00000000ffd4497e
>>
>> And in the debug message:
>>
>> [ 160.460287] 000000006e117175: ffffffffaa23a0e8
>> (get_perf_callchain+0x148/0x280)
>> [ 160.460287] 0000000002e8715f: 0000000000c6bba0 (0xc6bba0)
>> [ 160.460288] 00000000b3d07758: ffff9ce3f9790000 (0xffff9ce3f9790000)
>> [ 160.460289] 0000000055c97836: ffff9ce3f9790000 (0xffff9ce3f9790000)
>> [ 160.460289] 000000007cbb140a: 000000010000007f (0x10000007f)
>> [ 160.460290] 000000007dc62ac9: 0000000000000000 ...
>> [ 160.460290] 000000006b41e346: 1c7da01cd70c4000 (0x1c7da01cd70c4000)
>> [ 160.460291] 00000000f23d1826: ffffd89cffc3ae80 (0xffffd89cffc3ae80)
>> [ 160.460292] 00000000f9a16017: 000000000000007f (0x7f)
>> [ 160.460292] 00000000a8e62d44: 0000000000000000 ...
>> [ 160.460293] 00000000cbc83c97: ffffb89d00d8d000 (0xffffb89d00d8d000)
>> [ 160.460293] 0000000092842522: 000000000000007f (0x7f)
>> [ 160.460294] 00000000dafbec89: ffffb89d00c6bc50 (0xffffb89d00c6bc50)
>> [ 160.460296] 000000000777e4cf: ffffffffaa225d97 (bpf_get_stackid+0x77/0x470)
>> [ 160.460296] 000000009942ea16: 0000000000000000 ...
>> [ 160.460297] 00000000a08006b1: 0000000000000001 (0x1)
>> [ 160.460298] 000000009f03b438: ffffb89d00c6bc30 (0xffffb89d00c6bc30)
>> [ 160.460299] 000000006caf8b32: ffffffffaa074fe8 (__do_page_fault+0x58/0x90)
>> [ 160.460300] 000000003a13d702: 0000000000000000 ...
>> [ 160.460300] 00000000e2e2496d: ffff9ce300000000 (0xffff9ce300000000)
>> [ 160.460301] 000000008ee6b7c2: ffffd89cffc3ae80 (0xffffd89cffc3ae80)
>> [ 160.460301] 00000000a8cf6d55: 0000000000000000 ...
>> [ 160.460302] 0000000059078076: ffffd89cffc3ae80 (0xffffd89cffc3ae80)
>> [ 160.460303] 00000000c6aac739: ffff9ce3f1e18eb0 (0xffff9ce3f1e18eb0)
>> [ 160.460303] 00000000a39aff92: ffffb89d00c6bc60 (0xffffb89d00c6bc60)
>> [ 160.460305] 0000000097498bf2: ffffffffaa1f4791 (bpf_get_stackid_tp+0x11/0x20)
>> [ 160.460306] 000000006992de1e: ffffb89d00c6bc78 (0xffffb89d00c6bc78)
>> [ 160.460307] 00000000dacd0ce5: ffffffffc0405676 (0xffffffffc0405676)
>> [ 160.460307] 00000000a81f2714: 0000000000000000 ...
>>
>> # Note here is the invalid frame pointer
>> [ 160.460308] 0000000070cad07c: ffffb89d00a1d000 (0xffffb89d00a1d000)
>> [ 160.460308] 00000000f8f194e4: 0000000000000001 (0x1)
>> [ 160.460309] 000000002134f42a: ffffd89cffc3ae80 (0xffffd89cffc3ae80)
>> [ 160.460310] 00000000f9345889: ffff9ce3f1e18eb0 (0xffff9ce3f1e18eb0)
>> [ 160.460310] 000000008ad22a42: 0000000000000000 ...
>> [ 160.460311] 0000000073808173: ffffb89d00c6bce0 (0xffffb89d00c6bce0)
>> [ 160.460312] 00000000c9effff4: ffffffffaa1f48d1 (trace_call_bpf+0x81/0x100)
>> [ 160.460313] 00000000c5d8ebd1: ffffb89d00c6bcc0 (0xffffb89d00c6bcc0)
>> [ 160.460315] 00000000bce0b072: ffffffffab651be0
>> (event_sched_migrate_task+0xa0/0xa0)
>> [ 160.460316] 00000000355cf319: 0000000000000000 ...
>> [ 160.460316] 000000003b67f2ad: ffffd89cffc3ae80 (0xffffd89cffc3ae80)
>> [ 160.460316] 000000009a77e20b: ffff9ce3fba25000 (0xffff9ce3fba25000)
>> [ 160.460317] 0000000032cf7376: 0000000000000001 (0x1)
>> [ 160.460317] 000000000051db74: ffffb89d00c6bd20 (0xffffb89d00c6bd20)
>> [ 160.460318] 0000000040eb3bf7: ffffffffaa22be81
>> (perf_trace_run_bpf_submit+0x41/0xb0)
>
> Is there an easy way to recreate this?
>

The failure I care about can be reproduced with:

cd tools/testing/selftests/bpf
make
./test_progs
test_stacktrace_map:FAIL:compare_map_keys stackid_hmap vs. stackmap err
-1 errno 2

2019-05-22 23:48:09

[permalink] [raw]

Subject: Re: Getting empty callchain from perf_callchain_kernel()

On Wed, May 22, 2019 at 12:45:17PM -0500, Josh Poimboeuf wrote:
> On Wed, May 22, 2019 at 02:49:07PM +0000, Alexei Starovoitov wrote:
> > The one that is broken is prog_tests/stacktrace_map.c
> > There we attach bpf to standard tracepoint where
> > kernel suppose to collect pt_regs before calling into bpf.
> > And that's what bpf_get_stackid_tp() is doing.
> > It passes pt_regs (that was collected before any bpf)
> > into bpf_get_stackid() which calls get_perf_callchain().
> > Same thing with kprobes, uprobes.
>
> Is it trying to unwind through ___bpf_prog_run()?
>
> If so, that would at least explain why ORC isn't working. Objtool
> currently ignores that function because it can't follow the jump table.

Here's a tentative fix (for ORC, at least). I'll need to make sure this
doesn't break anything else.

diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 242a643af82f..1d9a7cc4b836 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -1562,7 +1562,6 @@ static u64 ___bpf_prog_run(u64 *regs, const struct bpf_insn *insn, u64 *stack)
BUG_ON(1);
return 0;
}
-STACK_FRAME_NON_STANDARD(___bpf_prog_run); /* jump table */

#define PROG_NAME(stack_size) __bpf_prog_run##stack_size
#define DEFINE_BPF_PROG_RUN(stack_size) \
diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index 172f99195726..2567027fce95 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -1033,13 +1033,6 @@ static struct rela *find_switch_table(struct objtool_file *file,
if (text_rela->type == R_X86_64_PC32)
table_offset += 4;

- /*
- * Make sure the .rodata address isn't associated with a
- * symbol. gcc jump tables are anonymous data.
- */
- if (find_symbol_containing(rodata_sec, table_offset))
- continue;
-
rodata_rela = find_rela_by_dest(rodata_sec, table_offset);
if (rodata_rela) {
/*

2019-05-23 06:49:44

[permalink] [raw]

Subject: Re: Getting empty callchain from perf_callchain_kernel()

On Thu, May 23, 2019 at 7:46 AM Josh Poimboeuf <[email protected]> wrote:
>
> On Wed, May 22, 2019 at 12:45:17PM -0500, Josh Poimboeuf wrote:
> > On Wed, May 22, 2019 at 02:49:07PM +0000, Alexei Starovoitov wrote:
> > > The one that is broken is prog_tests/stacktrace_map.c
> > > There we attach bpf to standard tracepoint where
> > > kernel suppose to collect pt_regs before calling into bpf.
> > > And that's what bpf_get_stackid_tp() is doing.
> > > It passes pt_regs (that was collected before any bpf)
> > > into bpf_get_stackid() which calls get_perf_callchain().
> > > Same thing with kprobes, uprobes.
> >
> > Is it trying to unwind through ___bpf_prog_run()?
> >
> > If so, that would at least explain why ORC isn't working. Objtool
> > currently ignores that function because it can't follow the jump table.
>
> Here's a tentative fix (for ORC, at least). I'll need to make sure this
> doesn't break anything else.
>
> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index 242a643af82f..1d9a7cc4b836 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c
> @@ -1562,7 +1562,6 @@ static u64 ___bpf_prog_run(u64 *regs, const struct bpf_insn *insn, u64 *stack)
> BUG_ON(1);
> return 0;
> }
> -STACK_FRAME_NON_STANDARD(___bpf_prog_run); /* jump table */
>
> #define PROG_NAME(stack_size) __bpf_prog_run##stack_size
> #define DEFINE_BPF_PROG_RUN(stack_size) \
> diff --git a/tools/objtool/check.c b/tools/objtool/check.c
> index 172f99195726..2567027fce95 100644
> --- a/tools/objtool/check.c
> +++ b/tools/objtool/check.c
> @@ -1033,13 +1033,6 @@ static struct rela *find_switch_table(struct objtool_file *file,
> if (text_rela->type == R_X86_64_PC32)
> table_offset += 4;
>
> - /*
> - * Make sure the .rodata address isn't associated with a
> - * symbol. gcc jump tables are anonymous data.
> - */
> - if (find_symbol_containing(rodata_sec, table_offset))
> - continue;
> -
> rodata_rela = find_rela_by_dest(rodata_sec, table_offset);
> if (rodata_rela) {
> /*

Hi Josh, this still won't fix the problem.

Problem is not (or not only) with ___bpf_prog_run, what actually went
wrong is with the JITed bpf code.

For frame pointer unwinder, it seems the JITed bpf code will have a
shifted "BP" register? (arch/x86/net/bpf_jit_comp.c:217), so if we can
unshift it properly then it will work.

I tried below code, and problem is fixed (only for frame pointer
unwinder though). Need to find a better way to detect and do any
similar trick for bpf part, if this is a feasible way to fix it:

diff --git a/arch/x86/kernel/unwind_frame.c b/arch/x86/kernel/unwind_frame.c
index 9b9fd4826e7a..2c0fa2aaa7e4 100644
--- a/arch/x86/kernel/unwind_frame.c
+++ b/arch/x86/kernel/unwind_frame.c
@@ -330,8 +330,17 @@ bool unwind_next_frame(struct unwind_state *state)
}

/* Move to the next frame if it's safe: */
- if (!update_stack_state(state, next_bp))
- goto bad_address;
+ if (!update_stack_state(state, next_bp)) {
+ // Try again with shifted BP
+ state->bp += 5; // see AUX_STACK_SPACE
+ next_bp = (unsigned long
*)READ_ONCE_TASK_STACK(state->task, *state->bp);
+ // Clean and refetch stack info, it's marked as error outed
+ state->stack_mask = 0;
+ get_stack_info(next_bp, state->task,
&state->stack_info, &state->stack_mask);
+ if (!update_stack_state(state, next_bp)) {
+ goto bad_address;
+ }
+ }

return true;

For ORC unwinder, I think the unwinder can't find any info about the
JITed part. Maybe if can let it just skip the JITed part and go to
kernel context, then should be good enough.

--
Best Regards,
Kairui Song

2019-05-23 08:31:38

[permalink] [raw]

Subject: Re: Getting empty callchain from perf_callchain_kernel()

> On May 22, 2019, at 11:48 PM, Kairui Song <[email protected]> wrote:
>
> On Thu, May 23, 2019 at 7:46 AM Josh Poimboeuf <[email protected]> wrote:
>>
>> On Wed, May 22, 2019 at 12:45:17PM -0500, Josh Poimboeuf wrote:
>>> On Wed, May 22, 2019 at 02:49:07PM +0000, Alexei Starovoitov wrote:
>>>> The one that is broken is prog_tests/stacktrace_map.c
>>>> There we attach bpf to standard tracepoint where
>>>> kernel suppose to collect pt_regs before calling into bpf.
>>>> And that's what bpf_get_stackid_tp() is doing.
>>>> It passes pt_regs (that was collected before any bpf)
>>>> into bpf_get_stackid() which calls get_perf_callchain().
>>>> Same thing with kprobes, uprobes.
>>>
>>> Is it trying to unwind through ___bpf_prog_run()?
>>>
>>> If so, that would at least explain why ORC isn't working. Objtool
>>> currently ignores that function because it can't follow the jump table.
>>
>> Here's a tentative fix (for ORC, at least). I'll need to make sure this
>> doesn't break anything else.
>>
>> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
>> index 242a643af82f..1d9a7cc4b836 100644
>> --- a/kernel/bpf/core.c
>> +++ b/kernel/bpf/core.c
>> @@ -1562,7 +1562,6 @@ static u64 ___bpf_prog_run(u64 *regs, const struct bpf_insn *insn, u64 *stack)
>> BUG_ON(1);
>> return 0;
>> }
>> -STACK_FRAME_NON_STANDARD(___bpf_prog_run); /* jump table */
>>
>> #define PROG_NAME(stack_size) __bpf_prog_run##stack_size
>> #define DEFINE_BPF_PROG_RUN(stack_size) \
>> diff --git a/tools/objtool/check.c b/tools/objtool/check.c
>> index 172f99195726..2567027fce95 100644
>> --- a/tools/objtool/check.c
>> +++ b/tools/objtool/check.c
>> @@ -1033,13 +1033,6 @@ static struct rela *find_switch_table(struct objtool_file *file,
>> if (text_rela->type == R_X86_64_PC32)
>> table_offset += 4;
>>
>> - /*
>> - * Make sure the .rodata address isn't associated with a
>> - * symbol. gcc jump tables are anonymous data.
>> - */
>> - if (find_symbol_containing(rodata_sec, table_offset))
>> - continue;
>> -
>> rodata_rela = find_rela_by_dest(rodata_sec, table_offset);
>> if (rodata_rela) {
>> /*
>
> Hi Josh, this still won't fix the problem.
>
> Problem is not (or not only) with ___bpf_prog_run, what actually went
> wrong is with the JITed bpf code.
>
> For frame pointer unwinder, it seems the JITed bpf code will have a
> shifted "BP" register? (arch/x86/net/bpf_jit_comp.c:217), so if we can
> unshift it properly then it will work.
>
> I tried below code, and problem is fixed (only for frame pointer
> unwinder though). Need to find a better way to detect and do any
> similar trick for bpf part, if this is a feasible way to fix it:
>
> diff --git a/arch/x86/kernel/unwind_frame.c b/arch/x86/kernel/unwind_frame.c
> index 9b9fd4826e7a..2c0fa2aaa7e4 100644
> --- a/arch/x86/kernel/unwind_frame.c
> +++ b/arch/x86/kernel/unwind_frame.c
> @@ -330,8 +330,17 @@ bool unwind_next_frame(struct unwind_state *state)
> }
>
> /* Move to the next frame if it's safe: */
> - if (!update_stack_state(state, next_bp))
> - goto bad_address;
> + if (!update_stack_state(state, next_bp)) {
> + // Try again with shifted BP
> + state->bp += 5; // see AUX_STACK_SPACE
> + next_bp = (unsigned long
> *)READ_ONCE_TASK_STACK(state->task, *state->bp);
> + // Clean and refetch stack info, it's marked as error outed
> + state->stack_mask = 0;
> + get_stack_info(next_bp, state->task,
> &state->stack_info, &state->stack_mask);
> + if (!update_stack_state(state, next_bp)) {
> + goto bad_address;
> + }
> + }
>
> return true;
>
> For ORC unwinder, I think the unwinder can't find any info about the
> JITed part. Maybe if can let it just skip the JITed part and go to
> kernel context, then should be good enough.

In this case (tracepoint), the callchain bpf_get_stackid() fetches is the
callchain at the tracepoint. So we don't need the JITed part.

BPF program passes the regs at the tracepoint to perf_callchain_kernel().
However, perf_callchain_kernel() only uses regs->sp for !perf_hw_regs()
case. This is probably expected, as passing regs in doesn't really help.

There are multiple cases in unwind_orc.c:__unwind_start(), which I don't
understand very well.

Does the above make sense? Did I mis-understand anything?

@Alexei, do you remember some rough time/version that ORC unwinder works
well for tracepoints? Maybe we can dig into that version to see the
difference.

Thanks,
Song

>
>
>
>
>
> --
> Best Regards,
> Kairui Song

2019-05-23 09:13:45

[permalink] [raw]

Subject: Re: Getting empty callchain from perf_callchain_kernel()

On Thu, May 23, 2019 at 4:28 PM Song Liu <[email protected]> wrote:
>
> > On May 22, 2019, at 11:48 PM, Kairui Song <[email protected]> wrote:
> >
> > On Thu, May 23, 2019 at 7:46 AM Josh Poimboeuf <[email protected]> wrote:
> >>
> >> On Wed, May 22, 2019 at 12:45:17PM -0500, Josh Poimboeuf wrote:
> >>> On Wed, May 22, 2019 at 02:49:07PM +0000, Alexei Starovoitov wrote:
> >>>> The one that is broken is prog_tests/stacktrace_map.c
> >>>> There we attach bpf to standard tracepoint where
> >>>> kernel suppose to collect pt_regs before calling into bpf.
> >>>> And that's what bpf_get_stackid_tp() is doing.
> >>>> It passes pt_regs (that was collected before any bpf)
> >>>> into bpf_get_stackid() which calls get_perf_callchain().
> >>>> Same thing with kprobes, uprobes.
> >>>
> >>> Is it trying to unwind through ___bpf_prog_run()?
> >>>
> >>> If so, that would at least explain why ORC isn't working. Objtool
> >>> currently ignores that function because it can't follow the jump table.
> >>
> >> Here's a tentative fix (for ORC, at least). I'll need to make sure this
> >> doesn't break anything else.
> >>
> >> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> >> index 242a643af82f..1d9a7cc4b836 100644
> >> --- a/kernel/bpf/core.c
> >> +++ b/kernel/bpf/core.c
> >> @@ -1562,7 +1562,6 @@ static u64 ___bpf_prog_run(u64 *regs, const struct bpf_insn *insn, u64 *stack)
> >> BUG_ON(1);
> >> return 0;
> >> }
> >> -STACK_FRAME_NON_STANDARD(___bpf_prog_run); /* jump table */
> >>
> >> #define PROG_NAME(stack_size) __bpf_prog_run##stack_size
> >> #define DEFINE_BPF_PROG_RUN(stack_size) \
> >> diff --git a/tools/objtool/check.c b/tools/objtool/check.c
> >> index 172f99195726..2567027fce95 100644
> >> --- a/tools/objtool/check.c
> >> +++ b/tools/objtool/check.c
> >> @@ -1033,13 +1033,6 @@ static struct rela *find_switch_table(struct objtool_file *file,
> >> if (text_rela->type == R_X86_64_PC32)
> >> table_offset += 4;
> >>
> >> - /*
> >> - * Make sure the .rodata address isn't associated with a
> >> - * symbol. gcc jump tables are anonymous data.
> >> - */
> >> - if (find_symbol_containing(rodata_sec, table_offset))
> >> - continue;
> >> -
> >> rodata_rela = find_rela_by_dest(rodata_sec, table_offset);
> >> if (rodata_rela) {
> >> /*
> >
> > Hi Josh, this still won't fix the problem.
> >
> > Problem is not (or not only) with ___bpf_prog_run, what actually went
> > wrong is with the JITed bpf code.
> >
> > For frame pointer unwinder, it seems the JITed bpf code will have a
> > shifted "BP" register? (arch/x86/net/bpf_jit_comp.c:217), so if we can
> > unshift it properly then it will work.
> >
> > I tried below code, and problem is fixed (only for frame pointer
> > unwinder though). Need to find a better way to detect and do any
> > similar trick for bpf part, if this is a feasible way to fix it:
> >
> > diff --git a/arch/x86/kernel/unwind_frame.c b/arch/x86/kernel/unwind_frame.c
> > index 9b9fd4826e7a..2c0fa2aaa7e4 100644
> > --- a/arch/x86/kernel/unwind_frame.c
> > +++ b/arch/x86/kernel/unwind_frame.c
> > @@ -330,8 +330,17 @@ bool unwind_next_frame(struct unwind_state *state)
> > }
> >
> > /* Move to the next frame if it's safe: */
> > - if (!update_stack_state(state, next_bp))
> > - goto bad_address;
> > + if (!update_stack_state(state, next_bp)) {
> > + // Try again with shifted BP
> > + state->bp += 5; // see AUX_STACK_SPACE
> > + next_bp = (unsigned long
> > *)READ_ONCE_TASK_STACK(state->task, *state->bp);
> > + // Clean and refetch stack info, it's marked as error outed
> > + state->stack_mask = 0;
> > + get_stack_info(next_bp, state->task,
> > &state->stack_info, &state->stack_mask);
> > + if (!update_stack_state(state, next_bp)) {
> > + goto bad_address;
> > + }
> > + }
> >
> > return true;
> >
> > For ORC unwinder, I think the unwinder can't find any info about the
> > JITed part. Maybe if can let it just skip the JITed part and go to
> > kernel context, then should be good enough.
>
> In this case (tracepoint), the callchain bpf_get_stackid() fetches is the
> callchain at the tracepoint. So we don't need the JITed part.
>

We don't want the JITed part indeed, but the unwinder is now being
called to start directly within the BPF JITed program, so it have to
traceback above the JITed part and ignoring the JITed part.

> BPF program passes the regs at the tracepoint to perf_callchain_kernel().
> However, perf_callchain_kernel() only uses regs->sp for !perf_hw_regs()
> case. This is probably expected, as passing regs in doesn't really help.
>

ORC is not working with a partial dumped regs, and without frame
pointer, can't get a valid BP value in the tracepoint. So it only used
the sp as indicator of target frame. Unwinder was supposed to be
always able to unwind back correctly, so it should also be able to
reach the target frame, then start giving stack trace output.
Unfortunately this is not true when called through the JITed part...

> There are multiple cases in unwind_orc.c:__unwind_start(), which I don't
> understand very well.
>
> Does the above make sense? Did I mis-understand anything?

Yes, and thanks for the reply. See my other comments.

>
> @Alexei, do you remember some rough time/version that ORC unwinder works
> well for tracepoints? Maybe we can dig into that version to see the
> difference.
>

Is there really such a version/time? ORC was not working with every
use case of tracepoint before
d15d356887e770c5f2dcf963b52c7cb510c9e42d, as it require a fuller set
of regs from the tracepoint, however the handler called in the
tracepoint can't get the caller's required regs.

But before the commit d15d356887e770c5f2dcf963b52c7cb510c9e42d, it can
at least give one level of trace back (IP value), so I guess it just
passed the test.

--
Best Regards,
Kairui Song

2019-05-23 13:35:25

[permalink] [raw]

Subject: Re: Getting empty callchain from perf_callchain_kernel()

On Thu, May 23, 2019 at 02:48:11PM +0800, Kairui Song wrote:
> On Thu, May 23, 2019 at 7:46 AM Josh Poimboeuf <[email protected]> wrote:
> >
> > On Wed, May 22, 2019 at 12:45:17PM -0500, Josh Poimboeuf wrote:
> > > On Wed, May 22, 2019 at 02:49:07PM +0000, Alexei Starovoitov wrote:
> > > > The one that is broken is prog_tests/stacktrace_map.c
> > > > There we attach bpf to standard tracepoint where
> > > > kernel suppose to collect pt_regs before calling into bpf.
> > > > And that's what bpf_get_stackid_tp() is doing.
> > > > It passes pt_regs (that was collected before any bpf)
> > > > into bpf_get_stackid() which calls get_perf_callchain().
> > > > Same thing with kprobes, uprobes.
> > >
> > > Is it trying to unwind through ___bpf_prog_run()?
> > >
> > > If so, that would at least explain why ORC isn't working. Objtool
> > > currently ignores that function because it can't follow the jump table.
> >
> > Here's a tentative fix (for ORC, at least). I'll need to make sure this
> > doesn't break anything else.
> >
> > diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> > index 242a643af82f..1d9a7cc4b836 100644
> > --- a/kernel/bpf/core.c
> > +++ b/kernel/bpf/core.c
> > @@ -1562,7 +1562,6 @@ static u64 ___bpf_prog_run(u64 *regs, const struct bpf_insn *insn, u64 *stack)
> > BUG_ON(1);
> > return 0;
> > }
> > -STACK_FRAME_NON_STANDARD(___bpf_prog_run); /* jump table */
> >
> > #define PROG_NAME(stack_size) __bpf_prog_run##stack_size
> > #define DEFINE_BPF_PROG_RUN(stack_size) \
> > diff --git a/tools/objtool/check.c b/tools/objtool/check.c
> > index 172f99195726..2567027fce95 100644
> > --- a/tools/objtool/check.c
> > +++ b/tools/objtool/check.c
> > @@ -1033,13 +1033,6 @@ static struct rela *find_switch_table(struct objtool_file *file,
> > if (text_rela->type == R_X86_64_PC32)
> > table_offset += 4;
> >
> > - /*
> > - * Make sure the .rodata address isn't associated with a
> > - * symbol. gcc jump tables are anonymous data.
> > - */
> > - if (find_symbol_containing(rodata_sec, table_offset))
> > - continue;
> > -
> > rodata_rela = find_rela_by_dest(rodata_sec, table_offset);
> > if (rodata_rela) {
> > /*
>
> Hi Josh, this still won't fix the problem.
>
> Problem is not (or not only) with ___bpf_prog_run, what actually went
> wrong is with the JITed bpf code.

There seem to be a bunch of issues. My patch at least fixes the failing
selftest reported by Alexei for ORC.

How can I recreate your issue?

> For frame pointer unwinder, it seems the JITed bpf code will have a
> shifted "BP" register? (arch/x86/net/bpf_jit_comp.c:217), so if we can
> unshift it properly then it will work.

Yeah, that looks like a frame pointer bug in emit_prologue().

> I tried below code, and problem is fixed (only for frame pointer
> unwinder though). Need to find a better way to detect and do any
> similar trick for bpf part, if this is a feasible way to fix it:
>
> diff --git a/arch/x86/kernel/unwind_frame.c b/arch/x86/kernel/unwind_frame.c
> index 9b9fd4826e7a..2c0fa2aaa7e4 100644
> --- a/arch/x86/kernel/unwind_frame.c
> +++ b/arch/x86/kernel/unwind_frame.c
> @@ -330,8 +330,17 @@ bool unwind_next_frame(struct unwind_state *state)
> }
>
> /* Move to the next frame if it's safe: */
> - if (!update_stack_state(state, next_bp))
> - goto bad_address;
> + if (!update_stack_state(state, next_bp)) {
> + // Try again with shifted BP
> + state->bp += 5; // see AUX_STACK_SPACE
> + next_bp = (unsigned long
> *)READ_ONCE_TASK_STACK(state->task, *state->bp);
> + // Clean and refetch stack info, it's marked as error outed
> + state->stack_mask = 0;
> + get_stack_info(next_bp, state->task,
> &state->stack_info, &state->stack_mask);
> + if (!update_stack_state(state, next_bp)) {
> + goto bad_address;
> + }
> + }
>
> return true;

Nack.

> For ORC unwinder, I think the unwinder can't find any info about the
> JITed part. Maybe if can let it just skip the JITed part and go to
> kernel context, then should be good enough.

If it's starting from a fake pt_regs then that's going to be a
challenge.

Will the JIT code always have the same stack layout? If so then we
could hard code that knowledge in ORC. Or even better, create a generic
interface for ORC to query the creator of the generated code about the
stack layout.

--
Josh

2019-05-23 14:53:12

[permalink] [raw]

Subject: Re: Getting empty callchain from perf_callchain_kernel()

On Thu, May 23, 2019 at 9:32 PM Josh Poimboeuf <[email protected]> wrote:
>
> On Thu, May 23, 2019 at 02:48:11PM +0800, Kairui Song wrote:
> > On Thu, May 23, 2019 at 7:46 AM Josh Poimboeuf <[email protected]> wrote:
> > >
> > > On Wed, May 22, 2019 at 12:45:17PM -0500, Josh Poimboeuf wrote:
> > > > On Wed, May 22, 2019 at 02:49:07PM +0000, Alexei Starovoitov wrote:
> > > > > The one that is broken is prog_tests/stacktrace_map.c
> > > > > There we attach bpf to standard tracepoint where
> > > > > kernel suppose to collect pt_regs before calling into bpf.
> > > > > And that's what bpf_get_stackid_tp() is doing.
> > > > > It passes pt_regs (that was collected before any bpf)
> > > > > into bpf_get_stackid() which calls get_perf_callchain().
> > > > > Same thing with kprobes, uprobes.
> > > >
> > > > Is it trying to unwind through ___bpf_prog_run()?
> > > >
> > > > If so, that would at least explain why ORC isn't working. Objtool
> > > > currently ignores that function because it can't follow the jump table.
> > >
> > > Here's a tentative fix (for ORC, at least). I'll need to make sure this
> > > doesn't break anything else.
> > >
> > > diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> > > index 242a643af82f..1d9a7cc4b836 100644
> > > --- a/kernel/bpf/core.c
> > > +++ b/kernel/bpf/core.c
> > > @@ -1562,7 +1562,6 @@ static u64 ___bpf_prog_run(u64 *regs, const struct bpf_insn *insn, u64 *stack)
> > > BUG_ON(1);
> > > return 0;
> > > }
> > > -STACK_FRAME_NON_STANDARD(___bpf_prog_run); /* jump table */
> > >
> > > #define PROG_NAME(stack_size) __bpf_prog_run##stack_size
> > > #define DEFINE_BPF_PROG_RUN(stack_size) \
> > > diff --git a/tools/objtool/check.c b/tools/objtool/check.c
> > > index 172f99195726..2567027fce95 100644
> > > --- a/tools/objtool/check.c
> > > +++ b/tools/objtool/check.c
> > > @@ -1033,13 +1033,6 @@ static struct rela *find_switch_table(struct objtool_file *file,
> > > if (text_rela->type == R_X86_64_PC32)
> > > table_offset += 4;
> > >
> > > - /*
> > > - * Make sure the .rodata address isn't associated with a
> > > - * symbol. gcc jump tables are anonymous data.
> > > - */
> > > - if (find_symbol_containing(rodata_sec, table_offset))
> > > - continue;
> > > -
> > > rodata_rela = find_rela_by_dest(rodata_sec, table_offset);
> > > if (rodata_rela) {
> > > /*
> >
> > Hi Josh, this still won't fix the problem.
> >
> > Problem is not (or not only) with ___bpf_prog_run, what actually went
> > wrong is with the JITed bpf code.
>
> There seem to be a bunch of issues. My patch at least fixes the failing
> selftest reported by Alexei for ORC.
>
> How can I recreate your issue?

Hmm, I used bcc's example to attach bpf to trace point, and with that
fix stack trace is still invalid.

CMD I used with bcc:
python3 ./tools/stackcount.py t:sched:sched_fork

And I just had another try applying your patch, self test is also failing.

I'm applying on my local master branch, a few days older than
upstream, I can update and try again, am I missing anything?

>
> > For frame pointer unwinder, it seems the JITed bpf code will have a
> > shifted "BP" register? (arch/x86/net/bpf_jit_comp.c:217), so if we can
> > unshift it properly then it will work.
>
> Yeah, that looks like a frame pointer bug in emit_prologue().
>
> > I tried below code, and problem is fixed (only for frame pointer
> > unwinder though). Need to find a better way to detect and do any
> > similar trick for bpf part, if this is a feasible way to fix it:
> >
> > diff --git a/arch/x86/kernel/unwind_frame.c b/arch/x86/kernel/unwind_frame.c
> > index 9b9fd4826e7a..2c0fa2aaa7e4 100644
> > --- a/arch/x86/kernel/unwind_frame.c
> > +++ b/arch/x86/kernel/unwind_frame.c
> > @@ -330,8 +330,17 @@ bool unwind_next_frame(struct unwind_state *state)
> > }
> >
> > /* Move to the next frame if it's safe: */
> > - if (!update_stack_state(state, next_bp))
> > - goto bad_address;
> > + if (!update_stack_state(state, next_bp)) {
> > + // Try again with shifted BP
> > + state->bp += 5; // see AUX_STACK_SPACE
> > + next_bp = (unsigned long
> > *)READ_ONCE_TASK_STACK(state->task, *state->bp);
> > + // Clean and refetch stack info, it's marked as error outed
> > + state->stack_mask = 0;
> > + get_stack_info(next_bp, state->task,
> > &state->stack_info, &state->stack_mask);
> > + if (!update_stack_state(state, next_bp)) {
> > + goto bad_address;
> > + }
> > + }
> >
> > return true;
>
> Nack.
>
> > For ORC unwinder, I think the unwinder can't find any info about the
> > JITed part. Maybe if can let it just skip the JITed part and go to
> > kernel context, then should be good enough.
>
> If it's starting from a fake pt_regs then that's going to be a
> challenge.
>
> Will the JIT code always have the same stack layout? If so then we
> could hard code that knowledge in ORC. Or even better, create a generic
> interface for ORC to query the creator of the generated code about the
> stack layout.

I think yes.

Not sure why we have the BP shift yet, if the prolog code could be
tweaked to work with frame pointer unwinder it will be good to have.
But still not for ORC.

Will it be a good idea to have a region reserved for the JITed code?
Currently it shares the region with "module mapping space". If let it
have a separate region, when the unwinder meet any code in that region
it will know it's JITed code and then can do something special about
it.

This should make it much easier for both frame pointer and ORC unwinder to work.

--
Best Regards,
Kairui Song

2019-05-23 15:26:04

[permalink] [raw]

Subject: Re: Getting empty callchain from perf_callchain_kernel()

On Thu, May 23, 2019 at 10:50:24PM +0800, Kairui Song wrote:
> > > Hi Josh, this still won't fix the problem.
> > >
> > > Problem is not (or not only) with ___bpf_prog_run, what actually went
> > > wrong is with the JITed bpf code.
> >
> > There seem to be a bunch of issues. My patch at least fixes the failing
> > selftest reported by Alexei for ORC.
> >
> > How can I recreate your issue?
>
> Hmm, I used bcc's example to attach bpf to trace point, and with that
> fix stack trace is still invalid.
>
> CMD I used with bcc:
> python3 ./tools/stackcount.py t:sched:sched_fork

I've had problems in the past getting bcc to build, so I was hoping it
was reproducible with a standalone selftest.

> And I just had another try applying your patch, self test is also failing.

Is it the same selftest reported by Alexei?

test_stacktrace_map:FAIL:compare_map_keys stackid_hmap vs. stackmap err -1 errno 2

> I'm applying on my local master branch, a few days older than
> upstream, I can update and try again, am I missing anything?

The above patch had some issues, so with some configs you might see an
objtool warning for ___bpf_prog_run(), in which case the patch doesn't
fix the test_stacktrace_map selftest.

Here's the latest version which should fix it in all cases (based on
tip/master):

https://git.kernel.org/pub/scm/linux/kernel/git/jpoimboe/linux.git/commit/?h=bpf-orc-fix

> > > For frame pointer unwinder, it seems the JITed bpf code will have a
> > > shifted "BP" register? (arch/x86/net/bpf_jit_comp.c:217), so if we can
> > > unshift it properly then it will work.
> >
> > Yeah, that looks like a frame pointer bug in emit_prologue().
> >
> > > I tried below code, and problem is fixed (only for frame pointer
> > > unwinder though). Need to find a better way to detect and do any
> > > similar trick for bpf part, if this is a feasible way to fix it:
> > >
> > > diff --git a/arch/x86/kernel/unwind_frame.c b/arch/x86/kernel/unwind_frame.c
> > > index 9b9fd4826e7a..2c0fa2aaa7e4 100644
> > > --- a/arch/x86/kernel/unwind_frame.c
> > > +++ b/arch/x86/kernel/unwind_frame.c
> > > @@ -330,8 +330,17 @@ bool unwind_next_frame(struct unwind_state *state)
> > > }
> > >
> > > /* Move to the next frame if it's safe: */
> > > - if (!update_stack_state(state, next_bp))
> > > - goto bad_address;
> > > + if (!update_stack_state(state, next_bp)) {
> > > + // Try again with shifted BP
> > > + state->bp += 5; // see AUX_STACK_SPACE
> > > + next_bp = (unsigned long
> > > *)READ_ONCE_TASK_STACK(state->task, *state->bp);
> > > + // Clean and refetch stack info, it's marked as error outed
> > > + state->stack_mask = 0;
> > > + get_stack_info(next_bp, state->task,
> > > &state->stack_info, &state->stack_mask);
> > > + if (!update_stack_state(state, next_bp)) {
> > > + goto bad_address;
> > > + }
> > > + }
> > >
> > > return true;
> >
> > Nack.
> >
> > > For ORC unwinder, I think the unwinder can't find any info about the
> > > JITed part. Maybe if can let it just skip the JITed part and go to
> > > kernel context, then should be good enough.
> >
> > If it's starting from a fake pt_regs then that's going to be a
> > challenge.
> >
> > Will the JIT code always have the same stack layout? If so then we
> > could hard code that knowledge in ORC. Or even better, create a generic
> > interface for ORC to query the creator of the generated code about the
> > stack layout.
>
> I think yes.
>
> Not sure why we have the BP shift yet, if the prolog code could be
> tweaked to work with frame pointer unwinder it will be good to have.
> But still not for ORC.
>
> Will it be a good idea to have a region reserved for the JITed code?
> Currently it shares the region with "module mapping space". If let it
> have a separate region, when the unwinder meet any code in that region
> it will know it's JITed code and then can do something special about
> it.
>
> This should make it much easier for both frame pointer and ORC unwinder to work.

There's no need to put special cases in the FP unwinder when we can
instead just fix the frame pointer usage in the JIT code.

For ORC, I'm thinking we may be able to just require that all generated
code (BPF and others) always use frame pointers. Then when ORC doesn't
recognize a code address, it could try using the frame pointer as a
fallback.

Once I get a reproducer I can do the patches for all that.

--
Josh

2019-05-23 16:43:50

[permalink] [raw]

Subject: Re: Getting empty callchain from perf_callchain_kernel()

On Thu, May 23, 2019 at 11:24 PM Josh Poimboeuf <[email protected]> wrote:
>
> On Thu, May 23, 2019 at 10:50:24PM +0800, Kairui Song wrote:
> > > > Hi Josh, this still won't fix the problem.
> > > >
> > > > Problem is not (or not only) with ___bpf_prog_run, what actually went
> > > > wrong is with the JITed bpf code.
> > >
> > > There seem to be a bunch of issues. My patch at least fixes the failing
> > > selftest reported by Alexei for ORC.
> > >
> > > How can I recreate your issue?
> >
> > Hmm, I used bcc's example to attach bpf to trace point, and with that
> > fix stack trace is still invalid.
> >
> > CMD I used with bcc:
> > python3 ./tools/stackcount.py t:sched:sched_fork
>
> I've had problems in the past getting bcc to build, so I was hoping it
> was reproducible with a standalone selftest.
>
> > And I just had another try applying your patch, self test is also failing.
>
> Is it the same selftest reported by Alexei?
>
> test_stacktrace_map:FAIL:compare_map_keys stackid_hmap vs. stackmap err -1 errno 2
>
> > I'm applying on my local master branch, a few days older than
> > upstream, I can update and try again, am I missing anything?
>
> The above patch had some issues, so with some configs you might see an
> objtool warning for ___bpf_prog_run(), in which case the patch doesn't
> fix the test_stacktrace_map selftest.
>
> Here's the latest version which should fix it in all cases (based on
> tip/master):
>
> https://git.kernel.org/pub/scm/linux/kernel/git/jpoimboe/linux.git/commit/?h=bpf-orc-fix

Hmm, I still get the failure:
test_stacktrace_map:FAIL:compare_map_keys stackid_hmap vs. stackmap
err -1 errno 2

And I didn't see how this will fix the issue. As long as ORC need to
unwind through the JITed code it will fail. And that will happen
before reaching ___bpf_prog_run.

>
> > > > For frame pointer unwinder, it seems the JITed bpf code will have a
> > > > shifted "BP" register? (arch/x86/net/bpf_jit_comp.c:217), so if we can
> > > > unshift it properly then it will work.
> > >
> > > Yeah, that looks like a frame pointer bug in emit_prologue().
> > >
> > > > I tried below code, and problem is fixed (only for frame pointer
> > > > unwinder though). Need to find a better way to detect and do any
> > > > similar trick for bpf part, if this is a feasible way to fix it:
> > > >
> > > > diff --git a/arch/x86/kernel/unwind_frame.c b/arch/x86/kernel/unwind_frame.c
> > > > index 9b9fd4826e7a..2c0fa2aaa7e4 100644
> > > > --- a/arch/x86/kernel/unwind_frame.c
> > > > +++ b/arch/x86/kernel/unwind_frame.c
> > > > @@ -330,8 +330,17 @@ bool unwind_next_frame(struct unwind_state *state)
> > > > }
> > > >
> > > > /* Move to the next frame if it's safe: */
> > > > - if (!update_stack_state(state, next_bp))
> > > > - goto bad_address;
> > > > + if (!update_stack_state(state, next_bp)) {
> > > > + // Try again with shifted BP
> > > > + state->bp += 5; // see AUX_STACK_SPACE
> > > > + next_bp = (unsigned long
> > > > *)READ_ONCE_TASK_STACK(state->task, *state->bp);
> > > > + // Clean and refetch stack info, it's marked as error outed
> > > > + state->stack_mask = 0;
> > > > + get_stack_info(next_bp, state->task,
> > > > &state->stack_info, &state->stack_mask);
> > > > + if (!update_stack_state(state, next_bp)) {
> > > > + goto bad_address;
> > > > + }
> > > > + }
> > > >
> > > > return true;
> > >
> > > Nack.
> > >
> > > > For ORC unwinder, I think the unwinder can't find any info about the
> > > > JITed part. Maybe if can let it just skip the JITed part and go to
> > > > kernel context, then should be good enough.
> > >
> > > If it's starting from a fake pt_regs then that's going to be a
> > > challenge.
> > >
> > > Will the JIT code always have the same stack layout? If so then we
> > > could hard code that knowledge in ORC. Or even better, create a generic
> > > interface for ORC to query the creator of the generated code about the
> > > stack layout.
> >
> > I think yes.
> >
> > Not sure why we have the BP shift yet, if the prolog code could be
> > tweaked to work with frame pointer unwinder it will be good to have.
> > But still not for ORC.
> >
> > Will it be a good idea to have a region reserved for the JITed code?
> > Currently it shares the region with "module mapping space". If let it
> > have a separate region, when the unwinder meet any code in that region
> > it will know it's JITed code and then can do something special about
> > it.
> >
> > This should make it much easier for both frame pointer and ORC unwinder to work.
>
> There's no need to put special cases in the FP unwinder when we can
> instead just fix the frame pointer usage in the JIT code.
>
> For ORC, I'm thinking we may be able to just require that all generated
> code (BPF and others) always use frame pointers. Then when ORC doesn't
> recognize a code address, it could try using the frame pointer as a
> fallback.

Right, this sounds the right way to fix it, I believe this can fix
everything well.

>
> Once I get a reproducer I can do the patches for all that.
>
> --
> Josh
--
Best Regards,
Kairui Song

2019-05-23 17:29:52

[permalink] [raw]

Subject: Re: Getting empty callchain from perf_callchain_kernel()

On Fri, May 24, 2019 at 12:41:59AM +0800, Kairui Song wrote:
> On Thu, May 23, 2019 at 11:24 PM Josh Poimboeuf <[email protected]> wrote:
> >
> > On Thu, May 23, 2019 at 10:50:24PM +0800, Kairui Song wrote:
> > > > > Hi Josh, this still won't fix the problem.
> > > > >
> > > > > Problem is not (or not only) with ___bpf_prog_run, what actually went
> > > > > wrong is with the JITed bpf code.
> > > >
> > > > There seem to be a bunch of issues. My patch at least fixes the failing
> > > > selftest reported by Alexei for ORC.
> > > >
> > > > How can I recreate your issue?
> > >
> > > Hmm, I used bcc's example to attach bpf to trace point, and with that
> > > fix stack trace is still invalid.
> > >
> > > CMD I used with bcc:
> > > python3 ./tools/stackcount.py t:sched:sched_fork
> >
> > I've had problems in the past getting bcc to build, so I was hoping it
> > was reproducible with a standalone selftest.
> >
> > > And I just had another try applying your patch, self test is also failing.
> >
> > Is it the same selftest reported by Alexei?
> >
> > test_stacktrace_map:FAIL:compare_map_keys stackid_hmap vs. stackmap err -1 errno 2
> >
> > > I'm applying on my local master branch, a few days older than
> > > upstream, I can update and try again, am I missing anything?
> >
> > The above patch had some issues, so with some configs you might see an
> > objtool warning for ___bpf_prog_run(), in which case the patch doesn't
> > fix the test_stacktrace_map selftest.
> >
> > Here's the latest version which should fix it in all cases (based on
> > tip/master):
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/jpoimboe/linux.git/commit/?h=bpf-orc-fix
>
> Hmm, I still get the failure:
> test_stacktrace_map:FAIL:compare_map_keys stackid_hmap vs. stackmap
> err -1 errno 2
>
> And I didn't see how this will fix the issue. As long as ORC need to
> unwind through the JITed code it will fail. And that will happen
> before reaching ___bpf_prog_run.

Ok, I was able to recreate by doing

echo 1 > /proc/sys/net/core/bpf_jit_enable

first. I'm guessing you have CONFIG_BPF_JIT_ALWAYS_ON.

--
Josh

2019-05-24 02:23:36

[permalink] [raw]

Subject: Re: Getting empty callchain from perf_callchain_kernel()

On Fri, May 24, 2019 at 1:27 AM Josh Poimboeuf <[email protected]> wrote:
>
> On Fri, May 24, 2019 at 12:41:59AM +0800, Kairui Song wrote:
> > On Thu, May 23, 2019 at 11:24 PM Josh Poimboeuf <[email protected]> wrote:
> > >
> > > On Thu, May 23, 2019 at 10:50:24PM +0800, Kairui Song wrote:
> > > > > > Hi Josh, this still won't fix the problem.
> > > > > >
> > > > > > Problem is not (or not only) with ___bpf_prog_run, what actually went
> > > > > > wrong is with the JITed bpf code.
> > > > >
> > > > > There seem to be a bunch of issues. My patch at least fixes the failing
> > > > > selftest reported by Alexei for ORC.
> > > > >
> > > > > How can I recreate your issue?
> > > >
> > > > Hmm, I used bcc's example to attach bpf to trace point, and with that
> > > > fix stack trace is still invalid.
> > > >
> > > > CMD I used with bcc:
> > > > python3 ./tools/stackcount.py t:sched:sched_fork
> > >
> > > I've had problems in the past getting bcc to build, so I was hoping it
> > > was reproducible with a standalone selftest.
> > >
> > > > And I just had another try applying your patch, self test is also failing.
> > >
> > > Is it the same selftest reported by Alexei?
> > >
> > > test_stacktrace_map:FAIL:compare_map_keys stackid_hmap vs. stackmap err -1 errno 2
> > >
> > > > I'm applying on my local master branch, a few days older than
> > > > upstream, I can update and try again, am I missing anything?
> > >
> > > The above patch had some issues, so with some configs you might see an
> > > objtool warning for ___bpf_prog_run(), in which case the patch doesn't
> > > fix the test_stacktrace_map selftest.
> > >
> > > Here's the latest version which should fix it in all cases (based on
> > > tip/master):
> > >
> > > https://git.kernel.org/pub/scm/linux/kernel/git/jpoimboe/linux.git/commit/?h=bpf-orc-fix
> >
> > Hmm, I still get the failure:
> > test_stacktrace_map:FAIL:compare_map_keys stackid_hmap vs. stackmap
> > err -1 errno 2
> >
> > And I didn't see how this will fix the issue. As long as ORC need to
> > unwind through the JITed code it will fail. And that will happen
> > before reaching ___bpf_prog_run.
>
> Ok, I was able to recreate by doing
>
> echo 1 > /proc/sys/net/core/bpf_jit_enable
>
> first. I'm guessing you have CONFIG_BPF_JIT_ALWAYS_ON.
>

Yes, with JIT off it will be fixed. I can confirm that.

--
Best Regards,
Kairui Song

2019-05-24 08:55:44

[permalink] [raw]

Subject: Re: Getting empty callchain from perf_callchain_kernel()

On Thu, May 23, 2019 at 10:24:13AM -0500, Josh Poimboeuf wrote:

> Here's the latest version which should fix it in all cases (based on
> tip/master):
>
> https://git.kernel.org/pub/scm/linux/kernel/git/jpoimboe/linux.git/commit/?h=bpf-orc-fix

That patch suffers an inconsitency, the comment states:

'if they have "jump_table" in the name'

while the actual code implements:

'if the name starts with "jump_table"'

Other than that, I suppose that works just fine ;-)

> There's no need to put special cases in the FP unwinder when we can
> instead just fix the frame pointer usage in the JIT code.
>
> For ORC, I'm thinking we may be able to just require that all generated
> code (BPF and others) always use frame pointers. Then when ORC doesn't
> recognize a code address, it could try using the frame pointer as a
> fallback.

Yes, this seems like a sensible approach. We'd also have to audit the
ftrace and kprobe trampolines, IIRC they only do framepointer setup for
CONFIG_FRAME_POINTER currently, which should be easy to fix (after the
patches I have to fix the FP generation in the first place:

https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=x86/wip
)

2019-05-24 13:08:08

[permalink] [raw]

Subject: Re: Getting empty callchain from perf_callchain_kernel()

On Fri, May 24, 2019 at 10:53:19AM +0200, Peter Zijlstra wrote:
> On Thu, May 23, 2019 at 10:24:13AM -0500, Josh Poimboeuf wrote:
>
> > Here's the latest version which should fix it in all cases (based on
> > tip/master):
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/jpoimboe/linux.git/commit/?h=bpf-orc-fix
>
> That patch suffers an inconsitency, the comment states:
>
> 'if they have "jump_table" in the name'
>
> while the actual code implements:
>
> 'if the name starts with "jump_table"'
>
> Other than that, I suppose that works just fine ;-)

The thing is, gcc converts a static local variable named "jump_table" to
an ELF symbol with a numbered suffix, something like "jump_table.12345".
But yeah I should at least clarify that in the comment.

--
Josh

2019-05-24 23:24:44

[permalink] [raw]

Subject: Re: Getting empty callchain from perf_callchain_kernel()

On Fri, May 24, 2019 at 10:20:52AM +0800, Kairui Song wrote:
> On Fri, May 24, 2019 at 1:27 AM Josh Poimboeuf <[email protected]> wrote:
> >
> > On Fri, May 24, 2019 at 12:41:59AM +0800, Kairui Song wrote:
> > > On Thu, May 23, 2019 at 11:24 PM Josh Poimboeuf <[email protected]> wrote:
> > > >
> > > > On Thu, May 23, 2019 at 10:50:24PM +0800, Kairui Song wrote:
> > > > > > > Hi Josh, this still won't fix the problem.
> > > > > > >
> > > > > > > Problem is not (or not only) with ___bpf_prog_run, what actually went
> > > > > > > wrong is with the JITed bpf code.
> > > > > >
> > > > > > There seem to be a bunch of issues. My patch at least fixes the failing
> > > > > > selftest reported by Alexei for ORC.
> > > > > >
> > > > > > How can I recreate your issue?
> > > > >
> > > > > Hmm, I used bcc's example to attach bpf to trace point, and with that
> > > > > fix stack trace is still invalid.
> > > > >
> > > > > CMD I used with bcc:
> > > > > python3 ./tools/stackcount.py t:sched:sched_fork
> > > >
> > > > I've had problems in the past getting bcc to build, so I was hoping it
> > > > was reproducible with a standalone selftest.
> > > >
> > > > > And I just had another try applying your patch, self test is also failing.
> > > >
> > > > Is it the same selftest reported by Alexei?
> > > >
> > > > test_stacktrace_map:FAIL:compare_map_keys stackid_hmap vs. stackmap err -1 errno 2
> > > >
> > > > > I'm applying on my local master branch, a few days older than
> > > > > upstream, I can update and try again, am I missing anything?
> > > >
> > > > The above patch had some issues, so with some configs you might see an
> > > > objtool warning for ___bpf_prog_run(), in which case the patch doesn't
> > > > fix the test_stacktrace_map selftest.
> > > >
> > > > Here's the latest version which should fix it in all cases (based on
> > > > tip/master):
> > > >
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/jpoimboe/linux.git/commit/?h=bpf-orc-fix
> > >
> > > Hmm, I still get the failure:
> > > test_stacktrace_map:FAIL:compare_map_keys stackid_hmap vs. stackmap
> > > err -1 errno 2
> > >
> > > And I didn't see how this will fix the issue. As long as ORC need to
> > > unwind through the JITed code it will fail. And that will happen
> > > before reaching ___bpf_prog_run.
> >
> > Ok, I was able to recreate by doing
> >
> > echo 1 > /proc/sys/net/core/bpf_jit_enable
> >
> > first. I'm guessing you have CONFIG_BPF_JIT_ALWAYS_ON.
> >
>
> Yes, with JIT off it will be fixed. I can confirm that.

Here's a tentative BPF fix for the JIT frame pointer issue. It was a
bit harder than I expected. Encoding r12 as a base register requires a
SIB byte, so I had to add support for encoding that. I also simplified
the prologue to resemble a GCC prologue, which decreases the prologue
size quite a bit.

Next week I can work on the corresponding ORC change. Then I can clean
all the patches up and submit them properly.

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index afabf597c855..c9b4503558c9 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -104,9 +104,8 @@ static int bpf_size_to_x86_bytes(int bpf_size)
/*
* The following table maps BPF registers to x86-64 registers.
*
- * x86-64 register R12 is unused, since if used as base address
- * register in load/store instructions, it always needs an
- * extra byte of encoding and is callee saved.
+ * RBP isn't used; it needs to be preserved to allow the unwinder to move
+ * through generated code stacks.
*
* Also x86-64 register R9 is unused. x86-64 register R10 is
* used for blinding (if enabled).
@@ -122,7 +121,7 @@ static const int reg2hex[] = {
[BPF_REG_7] = 5, /* R13 callee saved */
[BPF_REG_8] = 6, /* R14 callee saved */
[BPF_REG_9] = 7, /* R15 callee saved */
- [BPF_REG_FP] = 5, /* RBP readonly */
+ [BPF_REG_FP] = 4, /* R12 readonly */
[BPF_REG_AX] = 2, /* R10 temp register */
[AUX_REG] = 3, /* R11 temp register */
};
@@ -139,6 +138,7 @@ static bool is_ereg(u32 reg)
BIT(BPF_REG_7) |
BIT(BPF_REG_8) |
BIT(BPF_REG_9) |
+ BIT(BPF_REG_FP) |
BIT(BPF_REG_AX));
}

@@ -147,6 +147,11 @@ static bool is_axreg(u32 reg)
return reg == BPF_REG_0;
}

+static bool is_sib_reg(u32 reg)
+{
+ return reg == BPF_REG_FP;
+}
+
/* Add modifiers if 'reg' maps to x86-64 registers R8..R15 */
static u8 add_1mod(u8 byte, u32 reg)
{
@@ -190,15 +195,13 @@ struct jit_context {
#define BPF_MAX_INSN_SIZE 128
#define BPF_INSN_SAFETY 64

-#define AUX_STACK_SPACE 40 /* Space for RBX, R13, R14, R15, tailcnt */
-
-#define PROLOGUE_SIZE 37
+#define PROLOGUE_SIZE 25

/*
* Emit x86-64 prologue code for BPF program and check its size.
* bpf_tail_call helper will skip it while jumping into another program
*/
-static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf)
+static void emit_prologue(u8 **pprog, u32 stack_depth)
{
u8 *prog = *pprog;
int cnt = 0;
@@ -206,40 +209,67 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf)
/* push rbp */
EMIT1(0x55);

- /* mov rbp,rsp */
+ /* mov rbp, rsp */
EMIT3(0x48, 0x89, 0xE5);

- /* sub rsp, rounded_stack_depth + AUX_STACK_SPACE */
- EMIT3_off32(0x48, 0x81, 0xEC,
- round_up(stack_depth, 8) + AUX_STACK_SPACE);
+ /* push r15 */
+ EMIT2(0x41, 0x57);
+ /* push r14 */
+ EMIT2(0x41, 0x56);
+ /* push r13 */
+ EMIT2(0x41, 0x55);
+ /* push r12 */
+ EMIT2(0x41, 0x54);
+ /* push rbx */
+ EMIT1(0x53);

- /* sub rbp, AUX_STACK_SPACE */
- EMIT4(0x48, 0x83, 0xED, AUX_STACK_SPACE);
+ /*
+ * Push the tail call counter (tail_call_cnt) for eBPF tail calls.
+ * Initialized to zero.
+ *
+ * push $0
+ */
+ EMIT2(0x6a, 0x00);

- /* mov qword ptr [rbp+0],rbx */
- EMIT4(0x48, 0x89, 0x5D, 0);
- /* mov qword ptr [rbp+8],r13 */
- EMIT4(0x4C, 0x89, 0x6D, 8);
- /* mov qword ptr [rbp+16],r14 */
- EMIT4(0x4C, 0x89, 0x75, 16);
- /* mov qword ptr [rbp+24],r15 */
- EMIT4(0x4C, 0x89, 0x7D, 24);
+ /*
+ * R12 is used for the BPF program's FP register. It points to the end
+ * of the program's stack area.
+ *
+ * mov r12, rsp
+ */
+ EMIT3(0x49, 0x89, 0xE4);

- if (!ebpf_from_cbpf) {
- /*
- * Clear the tail call counter (tail_call_cnt): for eBPF tail
- * calls we need to reset the counter to 0. It's done in two
- * instructions, resetting RAX register to 0, and moving it
- * to the counter location.
- */
+ /* sub rsp, rounded_stack_depth */
+ EMIT3_off32(0x48, 0x81, 0xEC, round_up(stack_depth, 8));

- /* xor eax, eax */
- EMIT2(0x31, 0xc0);
- /* mov qword ptr [rbp+32], rax */
- EMIT4(0x48, 0x89, 0x45, 32);
+ BUILD_BUG_ON(cnt != PROLOGUE_SIZE);

- BUILD_BUG_ON(cnt != PROLOGUE_SIZE);
- }
+ *pprog = prog;
+}
+
+static void emit_epilogue(u8 **pprog)
+{
+ u8 *prog = *pprog;
+ int cnt = 0;
+
+ /* lea rsp, [rbp-0x28] */
+ EMIT4(0x48, 0x8D, 0x65, 0xD8);
+
+ /* pop rbx */
+ EMIT1(0x5B);
+ /* pop r12 */
+ EMIT2(0x41, 0x5C);
+ /* pop r13 */
+ EMIT2(0x41, 0x5D);
+ /* pop r14 */
+ EMIT2(0x41, 0x5E);
+ /* pop r15 */
+ EMIT2(0x41, 0x5F);
+ /* pop rbp */
+ EMIT1(0x5D);
+
+ /* ret */
+ EMIT1(0xC3);

*pprog = prog;
}
@@ -277,7 +307,7 @@ static void emit_bpf_tail_call(u8 **pprog)
EMIT2(0x89, 0xD2); /* mov edx, edx */
EMIT3(0x39, 0x56, /* cmp dword ptr [rsi + 16], edx */
offsetof(struct bpf_array, map.max_entries));
-#define OFFSET1 (41 + RETPOLINE_RAX_BPF_JIT_SIZE) /* Number of bytes to jump */
+#define OFFSET1 (35 + RETPOLINE_RAX_BPF_JIT_SIZE) /* Number of bytes to jump */
EMIT2(X86_JBE, OFFSET1); /* jbe out */
label1 = cnt;

@@ -285,13 +315,13 @@ static void emit_bpf_tail_call(u8 **pprog)
* if (tail_call_cnt > MAX_TAIL_CALL_CNT)
* goto out;
*/
- EMIT2_off32(0x8B, 0x85, 36); /* mov eax, dword ptr [rbp + 36] */
+ EMIT3(0x8B, 0x45, 0xD4); /* mov eax, dword ptr [rbp - 44] */
EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT); /* cmp eax, MAX_TAIL_CALL_CNT */
-#define OFFSET2 (30 + RETPOLINE_RAX_BPF_JIT_SIZE)
+#define OFFSET2 (27 + RETPOLINE_RAX_BPF_JIT_SIZE)
EMIT2(X86_JA, OFFSET2); /* ja out */
label2 = cnt;
EMIT3(0x83, 0xC0, 0x01); /* add eax, 1 */
- EMIT2_off32(0x89, 0x85, 36); /* mov dword ptr [rbp + 36], eax */
+ EMIT3(0x89, 0x45, 0xD4); /* mov dword ptr [rbp - 44], eax */

/* prog = array->ptrs[index]; */
EMIT4_off32(0x48, 0x8B, 0x84, 0xD6, /* mov rax, [rsi + rdx * 8 + offsetof(...)] */
@@ -419,8 +449,7 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image,
int proglen = 0;
u8 *prog = temp;

- emit_prologue(&prog, bpf_prog->aux->stack_depth,
- bpf_prog_was_classic(bpf_prog));
+ emit_prologue(&prog, bpf_prog->aux->stack_depth);

for (i = 0; i < insn_cnt; i++, insn++) {
const s32 imm32 = insn->imm;
@@ -767,10 +796,19 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image,
case BPF_ST | BPF_MEM | BPF_DW:
EMIT2(add_1mod(0x48, dst_reg), 0xC7);

-st: if (is_imm8(insn->off))
- EMIT2(add_1reg(0x40, dst_reg), insn->off);
+st:
+ if (is_imm8(insn->off))
+ EMIT1(add_1reg(0x40, dst_reg));
else
- EMIT1_off32(add_1reg(0x80, dst_reg), insn->off);
+ EMIT1(add_1reg(0x80, dst_reg));
+
+ if (is_sib_reg(dst_reg))
+ EMIT1(add_1reg(0x20, dst_reg));
+
+ if (is_imm8(insn->off))
+ EMIT1(insn->off);
+ else
+ EMIT(insn->off, 4);

EMIT(imm32, bpf_size_to_x86_bytes(BPF_SIZE(insn->code)));
break;
@@ -799,11 +837,19 @@ st: if (is_imm8(insn->off))
goto stx;
case BPF_STX | BPF_MEM | BPF_DW:
EMIT2(add_2mod(0x48, dst_reg, src_reg), 0x89);
-stx: if (is_imm8(insn->off))
- EMIT2(add_2reg(0x40, dst_reg, src_reg), insn->off);
+stx:
+ if (is_imm8(insn->off))
+ EMIT1(add_2reg(0x40, dst_reg, src_reg));
+ else
+ EMIT1(add_2reg(0x80, dst_reg, src_reg));
+
+ if (is_sib_reg(dst_reg))
+ EMIT1(add_1reg(0x20, dst_reg));
+
+ if (is_imm8(insn->off))
+ EMIT1(insn->off);
else
- EMIT1_off32(add_2reg(0x80, dst_reg, src_reg),
- insn->off);
+ EMIT(insn->off, 4);
break;

/* LDX: dst_reg = *(u8*)(src_reg + off) */
@@ -825,16 +871,24 @@ stx: if (is_imm8(insn->off))
case BPF_LDX | BPF_MEM | BPF_DW:
/* Emit 'mov rax, qword ptr [rax+0x14]' */
EMIT2(add_2mod(0x48, src_reg, dst_reg), 0x8B);
-ldx: /*
+ldx:
+ /*
* If insn->off == 0 we can save one extra byte, but
* special case of x86 R13 which always needs an offset
* is not worth the hassle
*/
if (is_imm8(insn->off))
- EMIT2(add_2reg(0x40, src_reg, dst_reg), insn->off);
+ EMIT1(add_2reg(0x40, src_reg, dst_reg));
else
- EMIT1_off32(add_2reg(0x80, src_reg, dst_reg),
- insn->off);
+ EMIT1(add_2reg(0x80, src_reg, dst_reg));
+
+ if (is_sib_reg(src_reg))
+ EMIT1(add_1reg(0x20, src_reg));
+
+ if (is_imm8(insn->off))
+ EMIT1(insn->off);
+ else
+ EMIT(insn->off, 4);
break;

/* STX XADD: lock *(u32*)(dst_reg + off) += src_reg */
@@ -847,11 +901,19 @@ stx: if (is_imm8(insn->off))
goto xadd;
case BPF_STX | BPF_XADD | BPF_DW:
EMIT3(0xF0, add_2mod(0x48, dst_reg, src_reg), 0x01);
-xadd: if (is_imm8(insn->off))
- EMIT2(add_2reg(0x40, dst_reg, src_reg), insn->off);
+xadd:
+ if (is_imm8(insn->off))
+ EMIT1(add_2reg(0x40, dst_reg, src_reg));
else
- EMIT1_off32(add_2reg(0x80, dst_reg, src_reg),
- insn->off);
+ EMIT1(add_2reg(0x80, dst_reg, src_reg));
+
+ if (is_sib_reg(dst_reg))
+ EMIT1(add_1reg(0x20, dst_reg));
+
+ if (is_imm8(insn->off))
+ EMIT1(insn->off);
+ else
+ EMIT(insn->off, 4);
break;

/* call */
@@ -1040,19 +1102,8 @@ xadd: if (is_imm8(insn->off))
seen_exit = true;
/* Update cleanup_addr */
ctx->cleanup_addr = proglen;
- /* mov rbx, qword ptr [rbp+0] */
- EMIT4(0x48, 0x8B, 0x5D, 0);
- /* mov r13, qword ptr [rbp+8] */
- EMIT4(0x4C, 0x8B, 0x6D, 8);
- /* mov r14, qword ptr [rbp+16] */
- EMIT4(0x4C, 0x8B, 0x75, 16);
- /* mov r15, qword ptr [rbp+24] */
- EMIT4(0x4C, 0x8B, 0x7D, 24);
-
- /* add rbp, AUX_STACK_SPACE */
- EMIT4(0x48, 0x83, 0xC5, AUX_STACK_SPACE);
- EMIT1(0xC9); /* leave */
- EMIT1(0xC3); /* ret */
+
+ emit_epilogue(&prog);
break;

default:

2019-05-27 11:59:48

[permalink] [raw]

Subject: Re: Getting empty callchain from perf_callchain_kernel()

On Sat, May 25, 2019 at 7:23 AM Josh Poimboeuf <[email protected]> wrote:
>
> On Fri, May 24, 2019 at 10:20:52AM +0800, Kairui Song wrote:
> > On Fri, May 24, 2019 at 1:27 AM Josh Poimboeuf <[email protected]> wrote:
> > >
> > > On Fri, May 24, 2019 at 12:41:59AM +0800, Kairui Song wrote:
> > > > On Thu, May 23, 2019 at 11:24 PM Josh Poimboeuf <[email protected]> wrote:
> > > > >
> > > > > On Thu, May 23, 2019 at 10:50:24PM +0800, Kairui Song wrote:
> > > > > > > > Hi Josh, this still won't fix the problem.
> > > > > > > >
> > > > > > > > Problem is not (or not only) with ___bpf_prog_run, what actually went
> > > > > > > > wrong is with the JITed bpf code.
> > > > > > >
> > > > > > > There seem to be a bunch of issues. My patch at least fixes the failing
> > > > > > > selftest reported by Alexei for ORC.
> > > > > > >
> > > > > > > How can I recreate your issue?
> > > > > >
> > > > > > Hmm, I used bcc's example to attach bpf to trace point, and with that
> > > > > > fix stack trace is still invalid.
> > > > > >
> > > > > > CMD I used with bcc:
> > > > > > python3 ./tools/stackcount.py t:sched:sched_fork
> > > > >
> > > > > I've had problems in the past getting bcc to build, so I was hoping it
> > > > > was reproducible with a standalone selftest.
> > > > >
> > > > > > And I just had another try applying your patch, self test is also failing.
> > > > >
> > > > > Is it the same selftest reported by Alexei?
> > > > >
> > > > > test_stacktrace_map:FAIL:compare_map_keys stackid_hmap vs. stackmap err -1 errno 2
> > > > >
> > > > > > I'm applying on my local master branch, a few days older than
> > > > > > upstream, I can update and try again, am I missing anything?
> > > > >
> > > > > The above patch had some issues, so with some configs you might see an
> > > > > objtool warning for ___bpf_prog_run(), in which case the patch doesn't
> > > > > fix the test_stacktrace_map selftest.
> > > > >
> > > > > Here's the latest version which should fix it in all cases (based on
> > > > > tip/master):
> > > > >
> > > > > https://git.kernel.org/pub/scm/linux/kernel/git/jpoimboe/linux.git/commit/?h=bpf-orc-fix
> > > >
> > > > Hmm, I still get the failure:
> > > > test_stacktrace_map:FAIL:compare_map_keys stackid_hmap vs. stackmap
> > > > err -1 errno 2
> > > >
> > > > And I didn't see how this will fix the issue. As long as ORC need to
> > > > unwind through the JITed code it will fail. And that will happen
> > > > before reaching ___bpf_prog_run.
> > >
> > > Ok, I was able to recreate by doing
> > >
> > > echo 1 > /proc/sys/net/core/bpf_jit_enable
> > >
> > > first. I'm guessing you have CONFIG_BPF_JIT_ALWAYS_ON.
> > >
> >
> > Yes, with JIT off it will be fixed. I can confirm that.
>
> Here's a tentative BPF fix for the JIT frame pointer issue. It was a
> bit harder than I expected. Encoding r12 as a base register requires a
> SIB byte, so I had to add support for encoding that. I also simplified
> the prologue to resemble a GCC prologue, which decreases the prologue
> size quite a bit.
>
> Next week I can work on the corresponding ORC change. Then I can clean
> all the patches up and submit them properly.
>
> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> index afabf597c855..c9b4503558c9 100644
> --- a/arch/x86/net/bpf_jit_comp.c
> +++ b/arch/x86/net/bpf_jit_comp.c
> @@ -104,9 +104,8 @@ static int bpf_size_to_x86_bytes(int bpf_size)
> /*
> * The following table maps BPF registers to x86-64 registers.
> *
> - * x86-64 register R12 is unused, since if used as base address
> - * register in load/store instructions, it always needs an
> - * extra byte of encoding and is callee saved.
> + * RBP isn't used; it needs to be preserved to allow the unwinder to move
> + * through generated code stacks.
> *
> * Also x86-64 register R9 is unused. x86-64 register R10 is
> * used for blinding (if enabled).
> @@ -122,7 +121,7 @@ static const int reg2hex[] = {
> [BPF_REG_7] = 5, /* R13 callee saved */
> [BPF_REG_8] = 6, /* R14 callee saved */
> [BPF_REG_9] = 7, /* R15 callee saved */
> - [BPF_REG_FP] = 5, /* RBP readonly */
> + [BPF_REG_FP] = 4, /* R12 readonly */
> [BPF_REG_AX] = 2, /* R10 temp register */
> [AUX_REG] = 3, /* R11 temp register */
> };
> @@ -139,6 +138,7 @@ static bool is_ereg(u32 reg)
> BIT(BPF_REG_7) |
> BIT(BPF_REG_8) |
> BIT(BPF_REG_9) |
> + BIT(BPF_REG_FP) |
> BIT(BPF_REG_AX));
> }
>
> @@ -147,6 +147,11 @@ static bool is_axreg(u32 reg)
> return reg == BPF_REG_0;
> }
>
> +static bool is_sib_reg(u32 reg)
> +{
> + return reg == BPF_REG_FP;
> +}
> +
> /* Add modifiers if 'reg' maps to x86-64 registers R8..R15 */
> static u8 add_1mod(u8 byte, u32 reg)
> {
> @@ -190,15 +195,13 @@ struct jit_context {
> #define BPF_MAX_INSN_SIZE 128
> #define BPF_INSN_SAFETY 64
>
> -#define AUX_STACK_SPACE 40 /* Space for RBX, R13, R14, R15, tailcnt */
> -
> -#define PROLOGUE_SIZE 37
> +#define PROLOGUE_SIZE 25
>
> /*
> * Emit x86-64 prologue code for BPF program and check its size.
> * bpf_tail_call helper will skip it while jumping into another program
> */
> -static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf)
> +static void emit_prologue(u8 **pprog, u32 stack_depth)
> {
> u8 *prog = *pprog;
> int cnt = 0;
> @@ -206,40 +209,67 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf)
> /* push rbp */
> EMIT1(0x55);
>
> - /* mov rbp,rsp */
> + /* mov rbp, rsp */
> EMIT3(0x48, 0x89, 0xE5);
>
> - /* sub rsp, rounded_stack_depth + AUX_STACK_SPACE */
> - EMIT3_off32(0x48, 0x81, 0xEC,
> - round_up(stack_depth, 8) + AUX_STACK_SPACE);
> + /* push r15 */
> + EMIT2(0x41, 0x57);
> + /* push r14 */
> + EMIT2(0x41, 0x56);
> + /* push r13 */
> + EMIT2(0x41, 0x55);
> + /* push r12 */
> + EMIT2(0x41, 0x54);
> + /* push rbx */
> + EMIT1(0x53);
>
> - /* sub rbp, AUX_STACK_SPACE */
> - EMIT4(0x48, 0x83, 0xED, AUX_STACK_SPACE);
> + /*
> + * Push the tail call counter (tail_call_cnt) for eBPF tail calls.
> + * Initialized to zero.
> + *
> + * push $0
> + */
> + EMIT2(0x6a, 0x00);
>
> - /* mov qword ptr [rbp+0],rbx */
> - EMIT4(0x48, 0x89, 0x5D, 0);
> - /* mov qword ptr [rbp+8],r13 */
> - EMIT4(0x4C, 0x89, 0x6D, 8);
> - /* mov qword ptr [rbp+16],r14 */
> - EMIT4(0x4C, 0x89, 0x75, 16);
> - /* mov qword ptr [rbp+24],r15 */
> - EMIT4(0x4C, 0x89, 0x7D, 24);
> + /*
> + * R12 is used for the BPF program's FP register. It points to the end
> + * of the program's stack area.
> + *
> + * mov r12, rsp
> + */
> + EMIT3(0x49, 0x89, 0xE4);
>
> - if (!ebpf_from_cbpf) {
> - /*
> - * Clear the tail call counter (tail_call_cnt): for eBPF tail
> - * calls we need to reset the counter to 0. It's done in two
> - * instructions, resetting RAX register to 0, and moving it
> - * to the counter location.
> - */
> + /* sub rsp, rounded_stack_depth */
> + EMIT3_off32(0x48, 0x81, 0xEC, round_up(stack_depth, 8));
>
> - /* xor eax, eax */
> - EMIT2(0x31, 0xc0);
> - /* mov qword ptr [rbp+32], rax */
> - EMIT4(0x48, 0x89, 0x45, 32);
> + BUILD_BUG_ON(cnt != PROLOGUE_SIZE);
>
> - BUILD_BUG_ON(cnt != PROLOGUE_SIZE);
> - }
> + *pprog = prog;
> +}
> +
> +static void emit_epilogue(u8 **pprog)
> +{
> + u8 *prog = *pprog;
> + int cnt = 0;
> +
> + /* lea rsp, [rbp-0x28] */
> + EMIT4(0x48, 0x8D, 0x65, 0xD8);
> +
> + /* pop rbx */
> + EMIT1(0x5B);
> + /* pop r12 */
> + EMIT2(0x41, 0x5C);
> + /* pop r13 */
> + EMIT2(0x41, 0x5D);
> + /* pop r14 */
> + EMIT2(0x41, 0x5E);
> + /* pop r15 */
> + EMIT2(0x41, 0x5F);
> + /* pop rbp */
> + EMIT1(0x5D);
> +
> + /* ret */
> + EMIT1(0xC3);
>
> *pprog = prog;
> }
> @@ -277,7 +307,7 @@ static void emit_bpf_tail_call(u8 **pprog)
> EMIT2(0x89, 0xD2); /* mov edx, edx */
> EMIT3(0x39, 0x56, /* cmp dword ptr [rsi + 16], edx */
> offsetof(struct bpf_array, map.max_entries));
> -#define OFFSET1 (41 + RETPOLINE_RAX_BPF_JIT_SIZE) /* Number of bytes to jump */
> +#define OFFSET1 (35 + RETPOLINE_RAX_BPF_JIT_SIZE) /* Number of bytes to jump */
> EMIT2(X86_JBE, OFFSET1); /* jbe out */
> label1 = cnt;
>
> @@ -285,13 +315,13 @@ static void emit_bpf_tail_call(u8 **pprog)
> * if (tail_call_cnt > MAX_TAIL_CALL_CNT)
> * goto out;
> */
> - EMIT2_off32(0x8B, 0x85, 36); /* mov eax, dword ptr [rbp + 36] */
> + EMIT3(0x8B, 0x45, 0xD4); /* mov eax, dword ptr [rbp - 44] */
> EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT); /* cmp eax, MAX_TAIL_CALL_CNT */
> -#define OFFSET2 (30 + RETPOLINE_RAX_BPF_JIT_SIZE)
> +#define OFFSET2 (27 + RETPOLINE_RAX_BPF_JIT_SIZE)
> EMIT2(X86_JA, OFFSET2); /* ja out */
> label2 = cnt;
> EMIT3(0x83, 0xC0, 0x01); /* add eax, 1 */
> - EMIT2_off32(0x89, 0x85, 36); /* mov dword ptr [rbp + 36], eax */
> + EMIT3(0x89, 0x45, 0xD4); /* mov dword ptr [rbp - 44], eax */
>
> /* prog = array->ptrs[index]; */
> EMIT4_off32(0x48, 0x8B, 0x84, 0xD6, /* mov rax, [rsi + rdx * 8 + offsetof(...)] */
> @@ -419,8 +449,7 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image,
> int proglen = 0;
> u8 *prog = temp;
>
> - emit_prologue(&prog, bpf_prog->aux->stack_depth,
> - bpf_prog_was_classic(bpf_prog));
> + emit_prologue(&prog, bpf_prog->aux->stack_depth);
>
> for (i = 0; i < insn_cnt; i++, insn++) {
> const s32 imm32 = insn->imm;
> @@ -767,10 +796,19 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image,
> case BPF_ST | BPF_MEM | BPF_DW:
> EMIT2(add_1mod(0x48, dst_reg), 0xC7);
>
> -st: if (is_imm8(insn->off))
> - EMIT2(add_1reg(0x40, dst_reg), insn->off);
> +st:
> + if (is_imm8(insn->off))
> + EMIT1(add_1reg(0x40, dst_reg));
> else
> - EMIT1_off32(add_1reg(0x80, dst_reg), insn->off);
> + EMIT1(add_1reg(0x80, dst_reg));
> +
> + if (is_sib_reg(dst_reg))
> + EMIT1(add_1reg(0x20, dst_reg));
> +
> + if (is_imm8(insn->off))
> + EMIT1(insn->off);
> + else
> + EMIT(insn->off, 4);
>
> EMIT(imm32, bpf_size_to_x86_bytes(BPF_SIZE(insn->code)));
> break;
> @@ -799,11 +837,19 @@ st: if (is_imm8(insn->off))
> goto stx;
> case BPF_STX | BPF_MEM | BPF_DW:
> EMIT2(add_2mod(0x48, dst_reg, src_reg), 0x89);
> -stx: if (is_imm8(insn->off))
> - EMIT2(add_2reg(0x40, dst_reg, src_reg), insn->off);
> +stx:
> + if (is_imm8(insn->off))
> + EMIT1(add_2reg(0x40, dst_reg, src_reg));
> + else
> + EMIT1(add_2reg(0x80, dst_reg, src_reg));
> +
> + if (is_sib_reg(dst_reg))
> + EMIT1(add_1reg(0x20, dst_reg));
> +
> + if (is_imm8(insn->off))
> + EMIT1(insn->off);
> else
> - EMIT1_off32(add_2reg(0x80, dst_reg, src_reg),
> - insn->off);
> + EMIT(insn->off, 4);
> break;
>
> /* LDX: dst_reg = *(u8*)(src_reg + off) */
> @@ -825,16 +871,24 @@ stx: if (is_imm8(insn->off))
> case BPF_LDX | BPF_MEM | BPF_DW:
> /* Emit 'mov rax, qword ptr [rax+0x14]' */
> EMIT2(add_2mod(0x48, src_reg, dst_reg), 0x8B);
> -ldx: /*
> +ldx:
> + /*
> * If insn->off == 0 we can save one extra byte, but
> * special case of x86 R13 which always needs an offset
> * is not worth the hassle
> */
> if (is_imm8(insn->off))
> - EMIT2(add_2reg(0x40, src_reg, dst_reg), insn->off);
> + EMIT1(add_2reg(0x40, src_reg, dst_reg));
> else
> - EMIT1_off32(add_2reg(0x80, src_reg, dst_reg),
> - insn->off);
> + EMIT1(add_2reg(0x80, src_reg, dst_reg));
> +
> + if (is_sib_reg(src_reg))
> + EMIT1(add_1reg(0x20, src_reg));
> +
> + if (is_imm8(insn->off))
> + EMIT1(insn->off);
> + else
> + EMIT(insn->off, 4);
> break;
>
> /* STX XADD: lock *(u32*)(dst_reg + off) += src_reg */
> @@ -847,11 +901,19 @@ stx: if (is_imm8(insn->off))
> goto xadd;
> case BPF_STX | BPF_XADD | BPF_DW:
> EMIT3(0xF0, add_2mod(0x48, dst_reg, src_reg), 0x01);
> -xadd: if (is_imm8(insn->off))
> - EMIT2(add_2reg(0x40, dst_reg, src_reg), insn->off);
> +xadd:
> + if (is_imm8(insn->off))
> + EMIT1(add_2reg(0x40, dst_reg, src_reg));
> else
> - EMIT1_off32(add_2reg(0x80, dst_reg, src_reg),
> - insn->off);
> + EMIT1(add_2reg(0x80, dst_reg, src_reg));
> +
> + if (is_sib_reg(dst_reg))
> + EMIT1(add_1reg(0x20, dst_reg));
> +
> + if (is_imm8(insn->off))
> + EMIT1(insn->off);
> + else
> + EMIT(insn->off, 4);
> break;
>
> /* call */
> @@ -1040,19 +1102,8 @@ xadd: if (is_imm8(insn->off))
> seen_exit = true;
> /* Update cleanup_addr */
> ctx->cleanup_addr = proglen;
> - /* mov rbx, qword ptr [rbp+0] */
> - EMIT4(0x48, 0x8B, 0x5D, 0);
> - /* mov r13, qword ptr [rbp+8] */
> - EMIT4(0x4C, 0x8B, 0x6D, 8);
> - /* mov r14, qword ptr [rbp+16] */
> - EMIT4(0x4C, 0x8B, 0x75, 16);
> - /* mov r15, qword ptr [rbp+24] */
> - EMIT4(0x4C, 0x8B, 0x7D, 24);
> -
> - /* add rbp, AUX_STACK_SPACE */
> - EMIT4(0x48, 0x83, 0xC5, AUX_STACK_SPACE);
> - EMIT1(0xC9); /* leave */
> - EMIT1(0xC3); /* ret */
> +
> + emit_epilogue(&prog);
> break;
>
> default:

Thanks! This looks good to me and passed the self test and bcc test
(with frame pointer unwinder, and JIT enabled):
With bcc's tools/stackcount.py I got the valid stack trace, and the
self test says:
test_stacktrace_map:PASS:compare_map_keys stackid_hmap vs. stackmap 0 nsec

--
Best Regards,
Kairui Song

2019-06-06 16:09:44

[permalink] [raw]

Subject: Re: Getting empty callchain from perf_callchain_kernel()

> On May 27, 2019, at 4:57 AM, Kairui Song <[email protected]> wrote:
>
> On Sat, May 25, 2019 at 7:23 AM Josh Poimboeuf <[email protected]> wrote:
>>
>> On Fri, May 24, 2019 at 10:20:52AM +0800, Kairui Song wrote:
>>> On Fri, May 24, 2019 at 1:27 AM Josh Poimboeuf <[email protected]> wrote:
>>>>
>>>> On Fri, May 24, 2019 at 12:41:59AM +0800, Kairui Song wrote:
>>>>> On Thu, May 23, 2019 at 11:24 PM Josh Poimboeuf <[email protected]> wrote:
>>>>>>
>>>>>> On Thu, May 23, 2019 at 10:50:24PM +0800, Kairui Song wrote:
>>>>>>>>> Hi Josh, this still won't fix the problem.
>>>>>>>>>
>>>>>>>>> Problem is not (or not only) with ___bpf_prog_run, what actually went
>>>>>>>>> wrong is with the JITed bpf code.
>>>>>>>>
>>>>>>>> There seem to be a bunch of issues. My patch at least fixes the failing
>>>>>>>> selftest reported by Alexei for ORC.
>>>>>>>>
>>>>>>>> How can I recreate your issue?
>>>>>>>
>>>>>>> Hmm, I used bcc's example to attach bpf to trace point, and with that
>>>>>>> fix stack trace is still invalid.
>>>>>>>
>>>>>>> CMD I used with bcc:
>>>>>>> python3 ./tools/stackcount.py t:sched:sched_fork
>>>>>>
>>>>>> I've had problems in the past getting bcc to build, so I was hoping it
>>>>>> was reproducible with a standalone selftest.
>>>>>>
>>>>>>> And I just had another try applying your patch, self test is also failing.
>>>>>>
>>>>>> Is it the same selftest reported by Alexei?
>>>>>>
>>>>>> test_stacktrace_map:FAIL:compare_map_keys stackid_hmap vs. stackmap err -1 errno 2
>>>>>>
>>>>>>> I'm applying on my local master branch, a few days older than
>>>>>>> upstream, I can update and try again, am I missing anything?
>>>>>>
>>>>>> The above patch had some issues, so with some configs you might see an
>>>>>> objtool warning for ___bpf_prog_run(), in which case the patch doesn't
>>>>>> fix the test_stacktrace_map selftest.
>>>>>>
>>>>>> Here's the latest version which should fix it in all cases (based on
>>>>>> tip/master):
>>>>>>
>>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/jpoimboe/linux.git/commit/?h=bpf-orc-fix
>>>>>
>>>>> Hmm, I still get the failure:
>>>>> test_stacktrace_map:FAIL:compare_map_keys stackid_hmap vs. stackmap
>>>>> err -1 errno 2
>>>>>
>>>>> And I didn't see how this will fix the issue. As long as ORC need to
>>>>> unwind through the JITed code it will fail. And that will happen
>>>>> before reaching ___bpf_prog_run.
>>>>
>>>> Ok, I was able to recreate by doing
>>>>
>>>> echo 1 > /proc/sys/net/core/bpf_jit_enable
>>>>
>>>> first. I'm guessing you have CONFIG_BPF_JIT_ALWAYS_ON.
>>>>
>>>
>>> Yes, with JIT off it will be fixed. I can confirm that.
>>
>> Here's a tentative BPF fix for the JIT frame pointer issue. It was a
>> bit harder than I expected. Encoding r12 as a base register requires a
>> SIB byte, so I had to add support for encoding that. I also simplified
>> the prologue to resemble a GCC prologue, which decreases the prologue
>> size quite a bit.
>>
>> Next week I can work on the corresponding ORC change. Then I can clean
>> all the patches up and submit them properly.

Hi Josh,

Have you got luck fixing the ORC side?

Thanks,
Song

>>
>> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
>> index afabf597c855..c9b4503558c9 100644
>> --- a/arch/x86/net/bpf_jit_comp.c
>> +++ b/arch/x86/net/bpf_jit_comp.c
>> @@ -104,9 +104,8 @@ static int bpf_size_to_x86_bytes(int bpf_size)
>> /*
>> * The following table maps BPF registers to x86-64 registers.
>> *
>> - * x86-64 register R12 is unused, since if used as base address
>> - * register in load/store instructions, it always needs an
>> - * extra byte of encoding and is callee saved.
>> + * RBP isn't used; it needs to be preserved to allow the unwinder to move
>> + * through generated code stacks.
>> *
>> * Also x86-64 register R9 is unused. x86-64 register R10 is
>> * used for blinding (if enabled).
>> @@ -122,7 +121,7 @@ static const int reg2hex[] = {
>> [BPF_REG_7] = 5, /* R13 callee saved */
>> [BPF_REG_8] = 6, /* R14 callee saved */
>> [BPF_REG_9] = 7, /* R15 callee saved */
>> - [BPF_REG_FP] = 5, /* RBP readonly */
>> + [BPF_REG_FP] = 4, /* R12 readonly */
>> [BPF_REG_AX] = 2, /* R10 temp register */
>> [AUX_REG] = 3, /* R11 temp register */
>> };
>> @@ -139,6 +138,7 @@ static bool is_ereg(u32 reg)
>> BIT(BPF_REG_7) |
>> BIT(BPF_REG_8) |
>> BIT(BPF_REG_9) |
>> + BIT(BPF_REG_FP) |
>> BIT(BPF_REG_AX));
>> }
>>
>> @@ -147,6 +147,11 @@ static bool is_axreg(u32 reg)
>> return reg == BPF_REG_0;
>> }
>>
>> +static bool is_sib_reg(u32 reg)
>> +{
>> + return reg == BPF_REG_FP;
>> +}
>> +
>> /* Add modifiers if 'reg' maps to x86-64 registers R8..R15 */
>> static u8 add_1mod(u8 byte, u32 reg)
>> {
>> @@ -190,15 +195,13 @@ struct jit_context {
>> #define BPF_MAX_INSN_SIZE 128
>> #define BPF_INSN_SAFETY 64
>>
>> -#define AUX_STACK_SPACE 40 /* Space for RBX, R13, R14, R15, tailcnt */
>> -
>> -#define PROLOGUE_SIZE 37
>> +#define PROLOGUE_SIZE 25
>>
>> /*
>> * Emit x86-64 prologue code for BPF program and check its size.
>> * bpf_tail_call helper will skip it while jumping into another program
>> */
>> -static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf)
>> +static void emit_prologue(u8 **pprog, u32 stack_depth)
>> {
>> u8 *prog = *pprog;
>> int cnt = 0;
>> @@ -206,40 +209,67 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf)
>> /* push rbp */
>> EMIT1(0x55);
>>
>> - /* mov rbp,rsp */
>> + /* mov rbp, rsp */
>> EMIT3(0x48, 0x89, 0xE5);
>>
>> - /* sub rsp, rounded_stack_depth + AUX_STACK_SPACE */
>> - EMIT3_off32(0x48, 0x81, 0xEC,
>> - round_up(stack_depth, 8) + AUX_STACK_SPACE);
>> + /* push r15 */
>> + EMIT2(0x41, 0x57);
>> + /* push r14 */
>> + EMIT2(0x41, 0x56);
>> + /* push r13 */
>> + EMIT2(0x41, 0x55);
>> + /* push r12 */
>> + EMIT2(0x41, 0x54);
>> + /* push rbx */
>> + EMIT1(0x53);
>>
>> - /* sub rbp, AUX_STACK_SPACE */
>> - EMIT4(0x48, 0x83, 0xED, AUX_STACK_SPACE);
>> + /*
>> + * Push the tail call counter (tail_call_cnt) for eBPF tail calls.
>> + * Initialized to zero.
>> + *
>> + * push $0
>> + */
>> + EMIT2(0x6a, 0x00);
>>
>> - /* mov qword ptr [rbp+0],rbx */
>> - EMIT4(0x48, 0x89, 0x5D, 0);
>> - /* mov qword ptr [rbp+8],r13 */
>> - EMIT4(0x4C, 0x89, 0x6D, 8);
>> - /* mov qword ptr [rbp+16],r14 */
>> - EMIT4(0x4C, 0x89, 0x75, 16);
>> - /* mov qword ptr [rbp+24],r15 */
>> - EMIT4(0x4C, 0x89, 0x7D, 24);
>> + /*
>> + * R12 is used for the BPF program's FP register. It points to the end
>> + * of the program's stack area.
>> + *
>> + * mov r12, rsp
>> + */
>> + EMIT3(0x49, 0x89, 0xE4);
>>
>> - if (!ebpf_from_cbpf) {
>> - /*
>> - * Clear the tail call counter (tail_call_cnt): for eBPF tail
>> - * calls we need to reset the counter to 0. It's done in two
>> - * instructions, resetting RAX register to 0, and moving it
>> - * to the counter location.
>> - */
>> + /* sub rsp, rounded_stack_depth */
>> + EMIT3_off32(0x48, 0x81, 0xEC, round_up(stack_depth, 8));
>>
>> - /* xor eax, eax */
>> - EMIT2(0x31, 0xc0);
>> - /* mov qword ptr [rbp+32], rax */
>> - EMIT4(0x48, 0x89, 0x45, 32);
>> + BUILD_BUG_ON(cnt != PROLOGUE_SIZE);
>>
>> - BUILD_BUG_ON(cnt != PROLOGUE_SIZE);
>> - }
>> + *pprog = prog;
>> +}
>> +
>> +static void emit_epilogue(u8 **pprog)
>> +{
>> + u8 *prog = *pprog;
>> + int cnt = 0;
>> +
>> + /* lea rsp, [rbp-0x28] */
>> + EMIT4(0x48, 0x8D, 0x65, 0xD8);
>> +
>> + /* pop rbx */
>> + EMIT1(0x5B);
>> + /* pop r12 */
>> + EMIT2(0x41, 0x5C);
>> + /* pop r13 */
>> + EMIT2(0x41, 0x5D);
>> + /* pop r14 */
>> + EMIT2(0x41, 0x5E);
>> + /* pop r15 */
>> + EMIT2(0x41, 0x5F);
>> + /* pop rbp */
>> + EMIT1(0x5D);
>> +
>> + /* ret */
>> + EMIT1(0xC3);
>>
>> *pprog = prog;
>> }
>> @@ -277,7 +307,7 @@ static void emit_bpf_tail_call(u8 **pprog)
>> EMIT2(0x89, 0xD2); /* mov edx, edx */
>> EMIT3(0x39, 0x56, /* cmp dword ptr [rsi + 16], edx */
>> offsetof(struct bpf_array, map.max_entries));
>> -#define OFFSET1 (41 + RETPOLINE_RAX_BPF_JIT_SIZE) /* Number of bytes to jump */
>> +#define OFFSET1 (35 + RETPOLINE_RAX_BPF_JIT_SIZE) /* Number of bytes to jump */
>> EMIT2(X86_JBE, OFFSET1); /* jbe out */
>> label1 = cnt;
>>
>> @@ -285,13 +315,13 @@ static void emit_bpf_tail_call(u8 **pprog)
>> * if (tail_call_cnt > MAX_TAIL_CALL_CNT)
>> * goto out;
>> */
>> - EMIT2_off32(0x8B, 0x85, 36); /* mov eax, dword ptr [rbp + 36] */
>> + EMIT3(0x8B, 0x45, 0xD4); /* mov eax, dword ptr [rbp - 44] */
>> EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT); /* cmp eax, MAX_TAIL_CALL_CNT */
>> -#define OFFSET2 (30 + RETPOLINE_RAX_BPF_JIT_SIZE)
>> +#define OFFSET2 (27 + RETPOLINE_RAX_BPF_JIT_SIZE)
>> EMIT2(X86_JA, OFFSET2); /* ja out */
>> label2 = cnt;
>> EMIT3(0x83, 0xC0, 0x01); /* add eax, 1 */
>> - EMIT2_off32(0x89, 0x85, 36); /* mov dword ptr [rbp + 36], eax */
>> + EMIT3(0x89, 0x45, 0xD4); /* mov dword ptr [rbp - 44], eax */
>>
>> /* prog = array->ptrs[index]; */
>> EMIT4_off32(0x48, 0x8B, 0x84, 0xD6, /* mov rax, [rsi + rdx * 8 + offsetof(...)] */
>> @@ -419,8 +449,7 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image,
>> int proglen = 0;
>> u8 *prog = temp;
>>
>> - emit_prologue(&prog, bpf_prog->aux->stack_depth,
>> - bpf_prog_was_classic(bpf_prog));
>> + emit_prologue(&prog, bpf_prog->aux->stack_depth);
>>
>> for (i = 0; i < insn_cnt; i++, insn++) {
>> const s32 imm32 = insn->imm;
>> @@ -767,10 +796,19 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image,
>> case BPF_ST | BPF_MEM | BPF_DW:
>> EMIT2(add_1mod(0x48, dst_reg), 0xC7);
>>
>> -st: if (is_imm8(insn->off))
>> - EMIT2(add_1reg(0x40, dst_reg), insn->off);
>> +st:
>> + if (is_imm8(insn->off))
>> + EMIT1(add_1reg(0x40, dst_reg));
>> else
>> - EMIT1_off32(add_1reg(0x80, dst_reg), insn->off);
>> + EMIT1(add_1reg(0x80, dst_reg));
>> +
>> + if (is_sib_reg(dst_reg))
>> + EMIT1(add_1reg(0x20, dst_reg));
>> +
>> + if (is_imm8(insn->off))
>> + EMIT1(insn->off);
>> + else
>> + EMIT(insn->off, 4);
>>
>> EMIT(imm32, bpf_size_to_x86_bytes(BPF_SIZE(insn->code)));
>> break;
>> @@ -799,11 +837,19 @@ st: if (is_imm8(insn->off))
>> goto stx;
>> case BPF_STX | BPF_MEM | BPF_DW:
>> EMIT2(add_2mod(0x48, dst_reg, src_reg), 0x89);
>> -stx: if (is_imm8(insn->off))
>> - EMIT2(add_2reg(0x40, dst_reg, src_reg), insn->off);
>> +stx:
>> + if (is_imm8(insn->off))
>> + EMIT1(add_2reg(0x40, dst_reg, src_reg));
>> + else
>> + EMIT1(add_2reg(0x80, dst_reg, src_reg));
>> +
>> + if (is_sib_reg(dst_reg))
>> + EMIT1(add_1reg(0x20, dst_reg));
>> +
>> + if (is_imm8(insn->off))
>> + EMIT1(insn->off);
>> else
>> - EMIT1_off32(add_2reg(0x80, dst_reg, src_reg),
>> - insn->off);
>> + EMIT(insn->off, 4);
>> break;
>>
>> /* LDX: dst_reg = *(u8*)(src_reg + off) */
>> @@ -825,16 +871,24 @@ stx: if (is_imm8(insn->off))
>> case BPF_LDX | BPF_MEM | BPF_DW:
>> /* Emit 'mov rax, qword ptr [rax+0x14]' */
>> EMIT2(add_2mod(0x48, src_reg, dst_reg), 0x8B);
>> -ldx: /*
>> +ldx:
>> + /*
>> * If insn->off == 0 we can save one extra byte, but
>> * special case of x86 R13 which always needs an offset
>> * is not worth the hassle
>> */
>> if (is_imm8(insn->off))
>> - EMIT2(add_2reg(0x40, src_reg, dst_reg), insn->off);
>> + EMIT1(add_2reg(0x40, src_reg, dst_reg));
>> else
>> - EMIT1_off32(add_2reg(0x80, src_reg, dst_reg),
>> - insn->off);
>> + EMIT1(add_2reg(0x80, src_reg, dst_reg));
>> +
>> + if (is_sib_reg(src_reg))
>> + EMIT1(add_1reg(0x20, src_reg));
>> +
>> + if (is_imm8(insn->off))
>> + EMIT1(insn->off);
>> + else
>> + EMIT(insn->off, 4);
>> break;
>>
>> /* STX XADD: lock *(u32*)(dst_reg + off) += src_reg */
>> @@ -847,11 +901,19 @@ stx: if (is_imm8(insn->off))
>> goto xadd;
>> case BPF_STX | BPF_XADD | BPF_DW:
>> EMIT3(0xF0, add_2mod(0x48, dst_reg, src_reg), 0x01);
>> -xadd: if (is_imm8(insn->off))
>> - EMIT2(add_2reg(0x40, dst_reg, src_reg), insn->off);
>> +xadd:
>> + if (is_imm8(insn->off))
>> + EMIT1(add_2reg(0x40, dst_reg, src_reg));
>> else
>> - EMIT1_off32(add_2reg(0x80, dst_reg, src_reg),
>> - insn->off);
>> + EMIT1(add_2reg(0x80, dst_reg, src_reg));
>> +
>> + if (is_sib_reg(dst_reg))
>> + EMIT1(add_1reg(0x20, dst_reg));
>> +
>> + if (is_imm8(insn->off))
>> + EMIT1(insn->off);
>> + else
>> + EMIT(insn->off, 4);
>> break;
>>
>> /* call */
>> @@ -1040,19 +1102,8 @@ xadd: if (is_imm8(insn->off))
>> seen_exit = true;
>> /* Update cleanup_addr */
>> ctx->cleanup_addr = proglen;
>> - /* mov rbx, qword ptr [rbp+0] */
>> - EMIT4(0x48, 0x8B, 0x5D, 0);
>> - /* mov r13, qword ptr [rbp+8] */
>> - EMIT4(0x4C, 0x8B, 0x6D, 8);
>> - /* mov r14, qword ptr [rbp+16] */
>> - EMIT4(0x4C, 0x8B, 0x75, 16);
>> - /* mov r15, qword ptr [rbp+24] */
>> - EMIT4(0x4C, 0x8B, 0x7D, 24);
>> -
>> - /* add rbp, AUX_STACK_SPACE */
>> - EMIT4(0x48, 0x83, 0xC5, AUX_STACK_SPACE);
>> - EMIT1(0xC9); /* leave */
>> - EMIT1(0xC3); /* ret */
>> +
>> + emit_epilogue(&prog);
>> break;
>>
>> default:
>
> Thanks! This looks good to me and passed the self test and bcc test
> (with frame pointer unwinder, and JIT enabled):
> With bcc's tools/stackcount.py I got the valid stack trace, and the
> self test says:
> test_stacktrace_map:PASS:compare_map_keys stackid_hmap vs. stackmap 0 nsec
>
> --
> Best Regards,
> Kairui Song

2019-06-07 00:51:49

[permalink] [raw]

Subject: Re: Getting empty callchain from perf_callchain_kernel()

On Thu, Jun 06, 2019 at 04:04:48PM +0000, Song Liu wrote:
> >> Here's a tentative BPF fix for the JIT frame pointer issue. It was a
> >> bit harder than I expected. Encoding r12 as a base register requires a
> >> SIB byte, so I had to add support for encoding that. I also simplified
> >> the prologue to resemble a GCC prologue, which decreases the prologue
> >> size quite a bit.
> >>
> >> Next week I can work on the corresponding ORC change. Then I can clean
> >> all the patches up and submit them properly.
>
> Hi Josh,
>
> Have you got luck fixing the ORC side?

Sorry, I've been traveling this week and I haven't had a chance to work
on it. It's at the top of my TODO list for next week.

--
Josh

2019-06-12 01:23:49

[permalink] [raw]

Subject: Re: Getting empty callchain from perf_callchain_kernel()

On Thu, Jun 06, 2019 at 04:04:48PM +0000, Song Liu wrote:
> Hi Josh,
>
> Have you got luck fixing the ORC side?

Here's the ORC fix. It's needed in addition to the bpf frame pointer
fix (the previous patch). I'll clean the patches up and post them soon.

diff --git a/arch/x86/kernel/unwind_orc.c b/arch/x86/kernel/unwind_orc.c
index 33b66b5c5aec..a2337460b1a3 100644
--- a/arch/x86/kernel/unwind_orc.c
+++ b/arch/x86/kernel/unwind_orc.c
@@ -128,6 +128,16 @@ static struct orc_entry null_orc_entry = {
.type = ORC_TYPE_CALL
};

+/* Fake frame pointer entry -- used as a fallback for generated code */
+static struct orc_entry orc_fp_entry = {
+ .type = ORC_TYPE_CALL,
+ .sp_reg = ORC_REG_BP,
+ .sp_offset = 16,
+ .bp_reg = ORC_REG_PREV_SP,
+ .bp_offset = -16,
+ .end = 0,
+};
+
static struct orc_entry *orc_find(unsigned long ip)
{
static struct orc_entry *orc;
@@ -392,8 +402,16 @@ bool unwind_next_frame(struct unwind_state *state)
* calls and calls to noreturn functions.
*/
orc = orc_find(state->signal ? state->ip : state->ip - 1);
- if (!orc)
- goto err;
+ if (!orc) {
+ /*
+ * As a fallback, try to assume rbp is the frame pointer. This
+ * is useful for generated code like BPF. Since this is a
+ * guess, the rest of the unwind is no longer considered
+ * reliable.
+ */
+ orc = &orc_fp_entry;
+ state->error = true;
+ }

/* End-of-stack check for kernel threads: */
if (orc->sp_reg == ORC_REG_UNDEFINED) {

2019-06-12 07:42:47

[permalink] [raw]

Subject: Re: Getting empty callchain from perf_callchain_kernel()

On Fri, May 24, 2019 at 10:53:19AM +0200, Peter Zijlstra wrote:
> > For ORC, I'm thinking we may be able to just require that all generated
> > code (BPF and others) always use frame pointers. Then when ORC doesn't
> > recognize a code address, it could try using the frame pointer as a
> > fallback.
>
> Yes, this seems like a sensible approach. We'd also have to audit the
> ftrace and kprobe trampolines, IIRC they only do framepointer setup for
> CONFIG_FRAME_POINTER currently, which should be easy to fix (after the
> patches I have to fix the FP generation in the first place:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=x86/wip

Right now, ftrace has a special hook in the ORC unwinder
(orc_ftrace_find). It would be great if we could get rid of that in
favor of the "always use frame pointers" approach. I'll hold off on
doing the kpatch/kprobe trampoline conversions in my patches since it
would conflict with yours.

Though, hm, because of pt_regs I guess ORC would need to be able to
decode an encoded frame pointer? I was hoping we could leave those
encoded frame pointers behind in CONFIG_FRAME_POINTER-land forever...

Here are my latest BPF unwinder patches in case anybody wants a sneak
peek:

https://git.kernel.org/pub/scm/linux/kernel/git/jpoimboe/linux.git/log/?h=bpf-orc-fix

--
Josh

2019-06-12 09:16:46

[permalink] [raw]

Subject: Re: Getting empty callchain from perf_callchain_kernel()

On Tue, Jun 11, 2019 at 10:05:01PM -0500, Josh Poimboeuf wrote:
> On Fri, May 24, 2019 at 10:53:19AM +0200, Peter Zijlstra wrote:
> > > For ORC, I'm thinking we may be able to just require that all generated
> > > code (BPF and others) always use frame pointers. Then when ORC doesn't
> > > recognize a code address, it could try using the frame pointer as a
> > > fallback.
> >
> > Yes, this seems like a sensible approach. We'd also have to audit the
> > ftrace and kprobe trampolines, IIRC they only do framepointer setup for
> > CONFIG_FRAME_POINTER currently, which should be easy to fix (after the
> > patches I have to fix the FP generation in the first place:
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=x86/wip
>
> Right now, ftrace has a special hook in the ORC unwinder
> (orc_ftrace_find). It would be great if we could get rid of that in
> favor of the "always use frame pointers" approach. I'll hold off on
> doing the kpatch/kprobe trampoline conversions in my patches since it
> would conflict with yours.
>
> Though, hm, because of pt_regs I guess ORC would need to be able to
> decode an encoded frame pointer? I was hoping we could leave those
> encoded frame pointers behind in CONFIG_FRAME_POINTER-land forever...

Ah, I see.. could a similar approach work for the kprobe trampolines
perhaps?

> Here are my latest BPF unwinder patches in case anybody wants a sneak
> peek:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/jpoimboe/linux.git/log/?h=bpf-orc-fix

On a quick read-through, that looks good to me. A minor nit:

/* mov dst_reg, %r11 */
EMIT_mov(dst_reg, AUX_REG);

The disparity between %r11 and AUX_REG is jarring. I understand the
whole bpf register mapping thing, but it is just weird when reading
this.

Other than that, the same note as before, the 32bit JIT still seems
buggered, but I'm not sure you (or anybody else) cares enough about that
to fix it though. It seems to use ebp as its own frame pointer, which
completely defeats an unwinder.

2019-06-12 17:58:51

by Steven Rostedt

[permalink] [raw]

Subject: Re: Getting empty callchain from perf_callchain_kernel()

On Tue, 11 Jun 2019 22:05:01 -0500
Josh Poimboeuf <[email protected]> wrote:

> Right now, ftrace has a special hook in the ORC unwinder
> (orc_ftrace_find). It would be great if we could get rid of that in
> favor of the "always use frame pointers" approach. I'll hold off on
> doing the kpatch/kprobe trampoline conversions in my patches since it
> would conflict with yours.

Basically, IIUC, what you are saying is that the ftrace trampoline
should always store the %sp in %rb even when CONFIG_FRAME_POINTER is not
enabled? And this can allow you to remove the ftrace specific code from
the orc unwinder?

-- Steve

>
> Though, hm, because of pt_regs I guess ORC would need to be able to
> decode an encoded frame pointer? I was hoping we could leave those
> encoded frame pointers behind in CONFIG_FRAME_POINTER-land forever...

2019-06-12 18:02:20

[permalink] [raw]

Subject: Re: Getting empty callchain from perf_callchain_kernel()

On Wed, Jun 12, 2019 at 09:10:23AM -0400, Steven Rostedt wrote:
> On Tue, 11 Jun 2019 22:05:01 -0500
> Josh Poimboeuf <[email protected]> wrote:
>
> > Right now, ftrace has a special hook in the ORC unwinder
> > (orc_ftrace_find). It would be great if we could get rid of that in
> > favor of the "always use frame pointers" approach. I'll hold off on
> > doing the kpatch/kprobe trampoline conversions in my patches since it
> > would conflict with yours.
>
> Basically, IIUC, what you are saying is that the ftrace trampoline
> should always store the %sp in %rb even when CONFIG_FRAME_POINTER is not
> enabled? And this can allow you to remove the ftrace specific code from
> the orc unwinder?

Basically, yes. Though the frame pointer encoding which Peter is adding
to the ftrace/kprobes trampolines might complicate things a bit.

--
Josh

2019-06-12 18:04:06

[permalink] [raw]

Subject: Re: Getting empty callchain from perf_callchain_kernel()

On Wed, Jun 12, 2019 at 10:54:23AM +0200, Peter Zijlstra wrote:
> On Tue, Jun 11, 2019 at 10:05:01PM -0500, Josh Poimboeuf wrote:
> > On Fri, May 24, 2019 at 10:53:19AM +0200, Peter Zijlstra wrote:
> > > > For ORC, I'm thinking we may be able to just require that all generated
> > > > code (BPF and others) always use frame pointers. Then when ORC doesn't
> > > > recognize a code address, it could try using the frame pointer as a
> > > > fallback.
> > >
> > > Yes, this seems like a sensible approach. We'd also have to audit the
> > > ftrace and kprobe trampolines, IIRC they only do framepointer setup for
> > > CONFIG_FRAME_POINTER currently, which should be easy to fix (after the
> > > patches I have to fix the FP generation in the first place:
> > >
> > > https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=x86/wip
> >
> > Right now, ftrace has a special hook in the ORC unwinder
> > (orc_ftrace_find). It would be great if we could get rid of that in
> > favor of the "always use frame pointers" approach. I'll hold off on
> > doing the kpatch/kprobe trampoline conversions in my patches since it
> > would conflict with yours.
> >
> > Though, hm, because of pt_regs I guess ORC would need to be able to
> > decode an encoded frame pointer? I was hoping we could leave those
> > encoded frame pointers behind in CONFIG_FRAME_POINTER-land forever...
>
> Ah, I see.. could a similar approach work for the kprobe trampolines
> perhaps?

If you mean requiring that kprobes trampolines always use frame
pointers, I think it should work.

> > Here are my latest BPF unwinder patches in case anybody wants a sneak
> > peek:
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/jpoimboe/linux.git/log/?h=bpf-orc-fix
>
> On a quick read-through, that looks good to me. A minor nit:
>
> /* mov dst_reg, %r11 */
> EMIT_mov(dst_reg, AUX_REG);
>
> The disparity between %r11 and AUX_REG is jarring. I understand the
> whole bpf register mapping thing, but it is just weird when reading
> this.

True, but there are several cases where the r11 is hard-coded in the
instruction encoding itself, like:

/* mov imm32, %r11 */
EMIT3_off32(0x49, 0xC7, 0xC3, imm32);

If the code were more decoupled, like if it had helpers where you could
always just pass AUX_REG, and the code never had to know what the value
of AUX_REG is, then using "AUX_REG" in the comments would make sense.

But since there are inconsistencies, with hard-coded register mapping
knowledge in many places, I find it easier to follow what's going on
when the specific register name is always shown in the comments.

> Other than that, the same note as before, the 32bit JIT still seems
> buggered, but I'm not sure you (or anybody else) cares enough about that
> to fix it though. It seems to use ebp as its own frame pointer, which
> completely defeats an unwinder.

I'm still trying to decide if I care about 32-bit. It does indeed use
ebp everywhere. But I'm not sure if I want to poke the beehive... Also
factoring into the equation is the fact that I'll be on PTO next week
:-) If I have time in the next couple days then I may take a look.

--
Josh

2019-06-13 20:27:46