2018-10-20 22:50:10

by Milian Wolff

[permalink] [raw]
Subject: Broken dwarf unwinding - wrong stack pointer register value?

Hey all,

I'm on the quest to figure out why perf regularly fails to unwind (some)
samples. I am seeing very strange behavior, where an apparently wrong stack
pointer value is read from the register - see below for more information and
the end of this (long) mail for my open questions. Any help would be greatly
appreciated.

I am currently using this trivial C++ code to reproduce the issue:

```
#include <cmath>
#include <complex>
#include <iostream>
#include <random>

using namespace std;

int main()
{
uniform_real_distribution<double> uniform(-1E5, 1E5);
default_random_engine engine;
double s = 0;
for (int i = 0; i < 10000000; ++i) {
s += norm(complex<double>(uniform(engine), uniform(engine)));
}
cout << s << '\n';
return 0;
}
```

I compile it with `g++ -O2 -g` and then record it with `perf record --call-
graph dwarf`. Using perf script, I then see e.g.:

```
$ perf script -v --no-inline --time 90229.12668,90229.127158 --ns
...
# first frame (working unwinding from __hypot_finite):
unwind: reg 16, val 7faf7dca2696
unwind: reg 7, val 7ffc80811ca0
unwind: find_proc_info dso /usr/lib/libm-2.28.so
unwind: access_mem addr 0x7ffc80811ce8 val 7faf7dc88af9, offset 72
unwind: find_proc_info dso /usr/lib/libm-2.28.so
unwind: access_mem addr 0x7ffc80811d08 val 56382b0fc129, offset 104
unwind: find_proc_info dso /home/milian/projects/kdab/rnd/hotspot/build/tests/
test-clients/cpp-inlining/cpp-inlining
unwind: access_mem addr 0x7ffc80811d58 val 7faf7dabf223, offset 184
unwind: find_proc_info dso /usr/lib/libc-2.28.so
unwind: access_mem addr 0x7ffc80811e18 val 56382b0fc1ee, offset 376
unwind: find_proc_info dso /home/milian/projects/kdab/rnd/hotspot/build/tests/
test-clients/cpp-inlining/cpp-inlining
unwind: __hypot_finite:ip = 0x7faf7dca2696 (0x29696)
unwind: hypotf32x:ip = 0x7faf7dc88af8 (0xfaf8)
unwind: main:ip = 0x56382b0fc128 (0x1128)
unwind: __libc_start_main:ip = 0x7faf7dabf222 (0x24222)
unwind: _start:ip = 0x56382b0fc1ed (0x11ed)
# second frame (unrelated)
unwind: reg 16, val 56382b0fc114
unwind: reg 7, val 7ffc80811d10
unwind: find_proc_info dso /home/milian/projects/kdab/rnd/hotspot/build/tests/
test-clients/cpp-inlining/cpp-inlining
unwind: access_mem addr 0x7ffc80811d58 val 7faf7dabf223, offset 72
unwind: access_mem addr 0x7ffc80811e18 val 56382b0fc1ee, offset 264
unwind: main:ip = 0x56382b0fc114 (0x1114)
unwind: __libc_start_main:ip = 0x7faf7dabf222 (0x24222)
unwind: _start:ip = 0x56382b0fc1ed (0x11ed)
# third frame (broken unwinding from __hypot_finite)
unwind: reg 16, val 7faf7dca2688
unwind: reg 7, val 7ffc80811ca0
unwind: find_proc_info dso /usr/lib/libm-2.28.so
unwind: access_mem addr 0x7ffc80811cc0 val 0, offset 32
unwind: __hypot_finite:ip = 0x7faf7dca2688 (0x29688)
unwind: '':ip = 0xffffffffffffffff (0x0)
cpp-inlining 24617 90229.126685606: 711026 cycles:uppp:
7faf7dca2696 __hypot_finite+0x36 (/usr/lib/libm-2.28.so)
7faf7dc88af8 hypotf32x+0x18 (/usr/lib/libm-2.28.so)
56382b0fc128 main+0x88 (/home/milian/projects/kdab/rnd/hotspot/
build/tests/test-clients/cpp-inlining/cpp-inlining)
7faf7dabf222 __libc_start_main+0xf2 (/usr/lib/libc-2.28.so)
56382b0fc1ed _start+0x2d (/home/milian/projects/kdab/rnd/hotspot/
build/tests/test-clients/cpp-inlining/cpp-inlining)

cpp-inlining 24617 90229.126921551: 714657 cycles:uppp:
56382b0fc114 main+0x74 (/home/milian/projects/kdab/rnd/hotspot/
build/tests/test-clients/cpp-inlining/cpp-inlining)
7faf7dabf222 __libc_start_main+0xf2 (/usr/lib/libc-2.28.so)
56382b0fc1ed _start+0x2d (/home/milian/projects/kdab/rnd/hotspot/
build/tests/test-clients/cpp-inlining/cpp-inlining)

cpp-inlining 24617 90229.127157818: 719976 cycles:uppp:
7faf7dca2688 __hypot_finite+0x28 (/usr/lib/libm-2.28.so)
ffffffffffffffff [unknown] ([unknown])
...
```

Now I'm trying to figure out why one __hypot_finite sample works but the other
one breaks for no apparent reason. One thing I noticed so far is the following
discrepancy in memory accesses, i.e. for the working sample we get:

```
unwind: reg 16, val 7faf7dca2696
unwind: reg 7, val 7ffc80811ca0
unwind: access_mem addr 0x7ffc80811ce8 val 7faf7dc88af9, offset 72
...
```

But the broken sample only produces:

```
unwind: reg 16, val 7faf7dca2688
unwind: reg 7, val 7ffc80811ca0
unwind: find_proc_info dso /usr/lib/libm-2.28.so
unwind: access_mem addr 0x7ffc80811cc0 val 0, offset 32
```

The user stack of the working sample starts with (in hex):

0000000000000000000000009c830441000000000000000000000000000000000000000000000000254a980000000000301d8180fc7f0000c0c10f2b3856000030

The user stack of the broken sample starts with (in hex):

0000000000000000000000002f64fd4000000000000000000000000000000000000000000000000073bb970000000000301d8180fc7f0000c0c10f2b3856000030

This is a 64bit machine, so word width is 8. So both user stacks start with
`24 * 2 = 48` words of zeros. Hence offset 32 cannot possibly produce a
meaningful value...

This offset is calculcated from LIBUNWIND__ARCH_REG_SP, cf. unwind-libunwind-
local.c. So is the stack pointer address in the register wrong? If I hackishly
offset the value for the stack pointer by 40, i.e.:

```
diff --git a/tools/perf/util/unwind-libunwind-local.c b/tools/perf/util/
unwind-libunwind-local.c
index 79f521a552cf..a766ddaaa4dd 100644
--- a/tools/perf/util/unwind-libunwind-local.c
+++ b/tools/perf/util/unwind-libunwind-local.c
@@ -502,6 +502,7 @@ static int access_mem(unw_addr_space_t __maybe_unused as,
if (ret)
return ret;

+ start -= 40;
end = start + stack->size;

/* Check overflow. */
```

Then I can successfully unwind the broken sample:

```
$ perf script -v --no-inline --time 90229.127156,90229.127158 --ns
...
unwind: reg 16, val 7faf7dca2688
unwind: reg 7, val 7ffc80811ca0
unwind: find_proc_info dso /usr/lib/libm-2.28.so
unwind: access_mem addr 0x7ffc80811cc0 val 7faf7dc88af9, offset 72
unwind: find_proc_info dso /usr/lib/libm-2.28.so
unwind: access_mem addr 0x7ffc80811ce0 val 56382b0fc129, offset 104
unwind: find_proc_info dso /home/milian/projects/kdab/rnd/hotspot/build/tests/
test-clients/cpp-inlining/cpp-inlining
unwind: access_mem addr 0x7ffc80811d30 val 7faf7dabf223, offset 184
unwind: find_proc_info dso /usr/lib/libc-2.28.so
unwind: access_mem addr 0x7ffc80811df0 val 56382b0fc1ee, offset 376
unwind: find_proc_info dso /home/milian/projects/kdab/rnd/hotspot/build/tests/
test-clients/cpp-inlining/cpp-inlining
unwind: __hypot_finite:ip = 0x7faf7dca2688 (0x29688)
unwind: hypotf32x:ip = 0x7faf7dc88af8 (0xfaf8)
unwind: main:ip = 0x56382b0fc128 (0x1128)
unwind: __libc_start_main:ip = 0x7faf7dabf222 (0x24222)
unwind: _start:ip = 0x56382b0fc1ed (0x11ed)
cpp-inlining 24617 90229.127157818: 719976 cycles:uppp:
7faf7dca2688 __hypot_finite+0x28 (/usr/lib/libm-2.28.so)
7faf7dc88af8 hypotf32x+0x18 (/usr/lib/libm-2.28.so)
56382b0fc128 main+0x88 (/home/milian/projects/kdab/rnd/hotspot/
build/tests/test-clients/cpp-inlining/cpp-inlining)
7faf7dabf222 __libc_start_main+0xf2 (/usr/lib/libc-2.28.so)
56382b0fc1ed _start+0x2d (/home/milian/projects/kdab/rnd/hotspot/
build/tests/test-clients/cpp-inlining/cpp-inlining)

```

So, what now? Here are my open questions:

Is this just working now by chance, or is this the real reason? I.e. is the
register value for the stack pointer incorrectly recorded?

Can this be fixed somehow during record time?

Can we detect this scenario at analysis time and correct the stack pointer
address automatically somehow? Should the first frame always try to read from
offset 72 maybe?

Any help would be greatly appreciated, many thanks

--
Milian Wolff
[email protected]
http://milianw.de


Attachments:
signature.asc (849.00 B)
This is a digitally signed message part.

2018-10-21 20:59:52

by Milian Wolff

[permalink] [raw]
Subject: Re: Broken dwarf unwinding - wrong stack pointer register value?

On Sonntag, 21. Oktober 2018 00:39:51 CEST Milian Wolff wrote:
> Hey all,
>
> I'm on the quest to figure out why perf regularly fails to unwind (some)
> samples. I am seeing very strange behavior, where an apparently wrong stack
> pointer value is read from the register - see below for more information and
> the end of this (long) mail for my open questions. Any help would be
> greatly appreciated.
>
> I am currently using this trivial C++ code to reproduce the issue:
>
> ```
> #include <cmath>
> #include <complex>
> #include <iostream>
> #include <random>
>
> using namespace std;
>
> int main()
> {
> uniform_real_distribution<double> uniform(-1E5, 1E5);
> default_random_engine engine;
> double s = 0;
> for (int i = 0; i < 10000000; ++i) {
> s += norm(complex<double>(uniform(engine), uniform(engine)));
> }
> cout << s << '\n';
> return 0;
> }
> ```
>
> I compile it with `g++ -O2 -g` and then record it with `perf record --call-
> graph dwarf`. Using perf script, I then see e.g.:

With my patch to regularly flush the perf script output buffer, we can now
easily find all broken backtraces and the corresponding debug output via:

$ perf script --ns -v |& awk -v RS='' '/\[unknown\]/ {print "\n"$0}'

I've pasted the output to the above command from my machine here:
https://paste.kde.org/pmyxwkk1k

This contains 139 samples with broken unwinding, out of 2350 samples in total,
so about 6% of all samples are broken.

In many cases, the first accessed memory is 0 because a too-low offset into
the stack is computed from the SP value, similar to the scenario I described
in my initial mail. In other cases we read garbadge addresses such as

unwind: access_mem addr 0x7ffc80811cf0 val 408195dfbda90580, offset 24

In all cases except for the the two samples at the very start and end of this
log, the last offset encountered in access_mem is lower than 72. Remember what
I wrote in the initial mail - if I manually hack the access_mem function to
use 72 for one of the broken samples, it made unwinding magically work
again...

With this addition of data - can anyone sched some light on what's potentially
going on here? How can we improve this situation?

Thanks
--
Milian Wolff | [email protected] | Senior Software Engineer
KDAB (Deutschland) GmbH, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt, C++ and OpenGL Experts


Attachments:
smime.p7s (3.74 kB)

2018-10-22 10:37:44

by Milian Wolff

[permalink] [raw]
Subject: Re: Broken dwarf unwinding - wrong stack pointer register value?

On Sonntag, 21. Oktober 2018 00:39:51 CEST Milian Wolff wrote:
> Hey all,
>
> I'm on the quest to figure out why perf regularly fails to unwind (some)
> samples. I am seeing very strange behavior, where an apparently wrong stack
> pointer value is read from the register - see below for more information and
> the end of this (long) mail for my open questions. Any help would be
> greatly appreciated.
>
> I am currently using this trivial C++ code to reproduce the issue:
>
> ```
> #include <cmath>
> #include <complex>
> #include <iostream>
> #include <random>
>
> using namespace std;
>
> int main()
> {
> uniform_real_distribution<double> uniform(-1E5, 1E5);
> default_random_engine engine;
> double s = 0;
> for (int i = 0; i < 10000000; ++i) {
> s += norm(complex<double>(uniform(engine), uniform(engine)));
> }
> cout << s << '\n';
> return 0;
> }
> ```
>
> I compile it with `g++ -O2 -g` and then record it with `perf record --call-
> graph dwarf`. Using perf script, I then see e.g.:
>
> ```
> $ perf script -v --no-inline --time 90229.12668,90229.127158 --ns
> ...
> # first frame (working unwinding from __hypot_finite):
> unwind: reg 16, val 7faf7dca2696
> unwind: reg 7, val 7ffc80811ca0
> unwind: find_proc_info dso /usr/lib/libm-2.28.so
> unwind: access_mem addr 0x7ffc80811ce8 val 7faf7dc88af9, offset 72
> unwind: find_proc_info dso /usr/lib/libm-2.28.so
> unwind: access_mem addr 0x7ffc80811d08 val 56382b0fc129, offset 104
> unwind: find_proc_info dso
> /home/milian/projects/kdab/rnd/hotspot/build/tests/
> test-clients/cpp-inlining/cpp-inlining
> unwind: access_mem addr 0x7ffc80811d58 val 7faf7dabf223, offset 184
> unwind: find_proc_info dso /usr/lib/libc-2.28.so
> unwind: access_mem addr 0x7ffc80811e18 val 56382b0fc1ee, offset 376
> unwind: find_proc_info dso
> /home/milian/projects/kdab/rnd/hotspot/build/tests/
> test-clients/cpp-inlining/cpp-inlining
> unwind: __hypot_finite:ip = 0x7faf7dca2696 (0x29696)
> unwind: hypotf32x:ip = 0x7faf7dc88af8 (0xfaf8)
> unwind: main:ip = 0x56382b0fc128 (0x1128)
> unwind: __libc_start_main:ip = 0x7faf7dabf222 (0x24222)
> unwind: _start:ip = 0x56382b0fc1ed (0x11ed)
> # second frame (unrelated)
> unwind: reg 16, val 56382b0fc114
> unwind: reg 7, val 7ffc80811d10
> unwind: find_proc_info dso
> /home/milian/projects/kdab/rnd/hotspot/build/tests/
> test-clients/cpp-inlining/cpp-inlining
> unwind: access_mem addr 0x7ffc80811d58 val 7faf7dabf223, offset 72
> unwind: access_mem addr 0x7ffc80811e18 val 56382b0fc1ee, offset 264
> unwind: main:ip = 0x56382b0fc114 (0x1114)
> unwind: __libc_start_main:ip = 0x7faf7dabf222 (0x24222)
> unwind: _start:ip = 0x56382b0fc1ed (0x11ed)
> # third frame (broken unwinding from __hypot_finite)
> unwind: reg 16, val 7faf7dca2688
> unwind: reg 7, val 7ffc80811ca0
> unwind: find_proc_info dso /usr/lib/libm-2.28.so
> unwind: access_mem addr 0x7ffc80811cc0 val 0, offset 32
> unwind: __hypot_finite:ip = 0x7faf7dca2688 (0x29688)
> unwind: '':ip = 0xffffffffffffffff (0x0)
> cpp-inlining 24617 90229.126685606: 711026 cycles:uppp:
> 7faf7dca2696 __hypot_finite+0x36 (/usr/lib/libm-2.28.so)
> 7faf7dc88af8 hypotf32x+0x18 (/usr/lib/libm-2.28.so)
> 56382b0fc128 main+0x88 (/home/milian/projects/kdab/rnd/hotspot/
> build/tests/test-clients/cpp-inlining/cpp-inlining)
> 7faf7dabf222 __libc_start_main+0xf2 (/usr/lib/libc-2.28.so)
> 56382b0fc1ed _start+0x2d (/home/milian/projects/kdab/rnd/hotspot/
> build/tests/test-clients/cpp-inlining/cpp-inlining)
>
> cpp-inlining 24617 90229.126921551: 714657 cycles:uppp:
> 56382b0fc114 main+0x74 (/home/milian/projects/kdab/rnd/hotspot/
> build/tests/test-clients/cpp-inlining/cpp-inlining)
> 7faf7dabf222 __libc_start_main+0xf2 (/usr/lib/libc-2.28.so)
> 56382b0fc1ed _start+0x2d (/home/milian/projects/kdab/rnd/hotspot/
> build/tests/test-clients/cpp-inlining/cpp-inlining)
>
> cpp-inlining 24617 90229.127157818: 719976 cycles:uppp:
> 7faf7dca2688 __hypot_finite+0x28 (/usr/lib/libm-2.28.so)
> ffffffffffffffff [unknown] ([unknown])
> ...
> ```
>
> Now I'm trying to figure out why one __hypot_finite sample works but the
> other one breaks for no apparent reason.

I've now collected some more background information, which is quite helpful I
believe for the analysis of this issue:

Note how the broken sample has the IP pointing at __hypot_finite+0x28:

unwind: __hypot_finite:ip = 0x7faf7dca2688 (0x29688)

When we run my reproducer code in GDB, we can see that obtaining a backtrace
from that address works just fine there:

```
$ gdb ./cpp-inlining
GNU gdb (GDB) 8.2
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./cpp-inlining...done.
(gdb) break main
Breakpoint 1 at 0x10a0: file /home/milian/projects/kdab/rnd/hotspot/tests/
test-clients/cpp-inlining/main.cpp, line 34.
(gdb) run
Starting program: /home/milian/projects/kdab/rnd/hotspot/build/tests/test-
clients/cpp-inlining/cpp-inlining

Breakpoint 1, main () at /home/milian/projects/kdab/rnd/hotspot/tests/test-
clients/cpp-inlining/main.cpp:34
34 {
(gdb) disassemble __hypot_finite
Dump of assembler code for function __hypot_finite:
0x00007ffff7c96660 <+0>: endbr64
0x00007ffff7c96664 <+4>: push %r13
0x00007ffff7c96666 <+6>: push %r12
0x00007ffff7c96668 <+8>: push %rbp
0x00007ffff7c96669 <+9>: movq %xmm1,%rbp
0x00007ffff7c9666e <+14>: push %rbx
0x00007ffff7c9666f <+15>: movq %xmm0,%rbx
0x00007ffff7c96674 <+20>: shr $0x20,%rbp
0x00007ffff7c96678 <+24>: shr $0x20,%rbx
0x00007ffff7c9667c <+28>: and $0x7fffffff,%ebp
0x00007ffff7c96682 <+34>: and $0x7fffffff,%ebx
0x00007ffff7c96688 <+40>: sub $0x28,%rsp
0x00007ffff7c9668c <+44>: cmp %ebp,%ebx
0x00007ffff7c9668e <+46>: jge 0x7ffff7c966a2 <__hypot_finite+66>
0x00007ffff7c96690 <+48>: mov %ebx,%eax
0x00007ffff7c96692 <+50>: movapd %xmm0,%xmm2
0x00007ffff7c96696 <+54>: movapd %xmm1,%xmm0
0x00007ffff7c9669a <+58>: mov %ebp,%ebx
0x00007ffff7c9669c <+60>: movapd %xmm2,%xmm1
0x00007ffff7c966a0 <+64>: mov %eax,%ebp
0x00007ffff7c966a2 <+66>: movq %xmm0,%r13
0x00007ffff7c966a7 <+71>: mov %rbx,%rax
0x00007ffff7c966aa <+74>: movq %xmm1,%r12
--Type <RET> for more, q to quit, c to continue without paging--q
Quit
(gdb) break *0x00007ffff7c96688
Breakpoint 2 at 0x7ffff7c96688
(gdb) cont
Continuing.

Breakpoint 2, 0x00007ffff7c96688 in __hypot_finite () from /usr/lib/libm.so.6
(gdb) bt
#0 0x00007ffff7c96688 in __hypot_finite () from /usr/lib/libm.so.6
#1 0x00007ffff7c7caf9 in hypotf64 () from /usr/lib/libm.so.6
#2 0x0000555555555129 in std::__complex_abs (__z=<optimized out>) at /usr/
include/c++/8.2.1/complex:1362
#3 std::abs<double> (__z=...) at /usr/include/c++/8.2.1/complex:618
#4 std::_Norm_helper<true>::_S_do_it<double> (__z=...) at /usr/include/c++/
8.2.1/complex:675
#5 std::norm<double> (__z=...) at /usr/include/c++/8.2.1/complex:685
#6 main () at /home/milian/projects/kdab/rnd/hotspot/tests/test-clients/cpp-
inlining/main.cpp:39
```

But note how __hypot_finite+0x28 actually points at an instruction that
modifies the stack pointer by 40, just like I did manually to "fix" unwinding
for this sample (see my initial email):

0x00007ffff7c96688 <+40>: sub $0x28,%rsp

So we are still in the function prologue, and can't properly unwind from there
apparently! Any suggestions on how this should be handled? Can we figure out
the size of the function prologue somehow and manually fix the stack pointer
then?

Cheers
--
Milian Wolff | [email protected] | Senior Software Engineer
KDAB (Deutschland) GmbH, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt, C++ and OpenGL Experts


Attachments:
smime.p7s (3.74 kB)

2018-10-22 13:05:08

by Milian Wolff

[permalink] [raw]
Subject: Re: Broken dwarf unwinding - wrong stack pointer register value?

On Montag, 22. Oktober 2018 12:35:39 CEST Milian Wolff wrote:
> On Sonntag, 21. Oktober 2018 00:39:51 CEST Milian Wolff wrote:
> > Hey all,
> >
> > I'm on the quest to figure out why perf regularly fails to unwind (some)
> > samples. I am seeing very strange behavior, where an apparently wrong
> > stack
> > pointer value is read from the register - see below for more information
> > and the end of this (long) mail for my open questions. Any help would be
> > greatly appreciated.
> >
> > I am currently using this trivial C++ code to reproduce the issue:
> >
> > ```
> > #include <cmath>
> > #include <complex>
> > #include <iostream>
> > #include <random>
> >
> > using namespace std;
> >
> > int main()
> > {
> >
> > uniform_real_distribution<double> uniform(-1E5, 1E5);
> > default_random_engine engine;
> > double s = 0;
> > for (int i = 0; i < 10000000; ++i) {
> >
> > s += norm(complex<double>(uniform(engine), uniform(engine)));
> >
> > }
> > cout << s << '\n';
> > return 0;
> >
> > }
> > ```
> >
> > I compile it with `g++ -O2 -g` and then record it with `perf record
> > --call-
> > graph dwarf`. Using perf script, I then see e.g.:
> >
> > ```
> > $ perf script -v --no-inline --time 90229.12668,90229.127158 --ns
> > ...
> > # first frame (working unwinding from __hypot_finite):
> > unwind: reg 16, val 7faf7dca2696
> > unwind: reg 7, val 7ffc80811ca0
> > unwind: find_proc_info dso /usr/lib/libm-2.28.so
> > unwind: access_mem addr 0x7ffc80811ce8 val 7faf7dc88af9, offset 72
> > unwind: find_proc_info dso /usr/lib/libm-2.28.so
> > unwind: access_mem addr 0x7ffc80811d08 val 56382b0fc129, offset 104
> > unwind: find_proc_info dso
> > /home/milian/projects/kdab/rnd/hotspot/build/tests/
> > test-clients/cpp-inlining/cpp-inlining
> > unwind: access_mem addr 0x7ffc80811d58 val 7faf7dabf223, offset 184
> > unwind: find_proc_info dso /usr/lib/libc-2.28.so
> > unwind: access_mem addr 0x7ffc80811e18 val 56382b0fc1ee, offset 376
> > unwind: find_proc_info dso
> > /home/milian/projects/kdab/rnd/hotspot/build/tests/
> > test-clients/cpp-inlining/cpp-inlining
> > unwind: __hypot_finite:ip = 0x7faf7dca2696 (0x29696)
> > unwind: hypotf32x:ip = 0x7faf7dc88af8 (0xfaf8)
> > unwind: main:ip = 0x56382b0fc128 (0x1128)
> > unwind: __libc_start_main:ip = 0x7faf7dabf222 (0x24222)
> > unwind: _start:ip = 0x56382b0fc1ed (0x11ed)
> > # second frame (unrelated)
> > unwind: reg 16, val 56382b0fc114
> > unwind: reg 7, val 7ffc80811d10
> > unwind: find_proc_info dso
> > /home/milian/projects/kdab/rnd/hotspot/build/tests/
> > test-clients/cpp-inlining/cpp-inlining
> > unwind: access_mem addr 0x7ffc80811d58 val 7faf7dabf223, offset 72
> > unwind: access_mem addr 0x7ffc80811e18 val 56382b0fc1ee, offset 264
> > unwind: main:ip = 0x56382b0fc114 (0x1114)
> > unwind: __libc_start_main:ip = 0x7faf7dabf222 (0x24222)
> > unwind: _start:ip = 0x56382b0fc1ed (0x11ed)
> > # third frame (broken unwinding from __hypot_finite)
> > unwind: reg 16, val 7faf7dca2688
> > unwind: reg 7, val 7ffc80811ca0
> > unwind: find_proc_info dso /usr/lib/libm-2.28.so
> > unwind: access_mem addr 0x7ffc80811cc0 val 0, offset 32
> > unwind: __hypot_finite:ip = 0x7faf7dca2688 (0x29688)
> > unwind: '':ip = 0xffffffffffffffff (0x0)
> >
> > cpp-inlining 24617 90229.126685606: 711026 cycles:uppp:
> > 7faf7dca2696 __hypot_finite+0x36 (/usr/lib/libm-2.28.so)
> > 7faf7dc88af8 hypotf32x+0x18 (/usr/lib/libm-2.28.so)
> > 56382b0fc128 main+0x88 (/home/milian/projects/kdab/rnd/hotspot/
> >
> > build/tests/test-clients/cpp-inlining/cpp-inlining)
> >
> > 7faf7dabf222 __libc_start_main+0xf2 (/usr/lib/libc-2.28.so)
> > 56382b0fc1ed _start+0x2d (/home/milian/projects/kdab/rnd/hotspot/
> >
> > build/tests/test-clients/cpp-inlining/cpp-inlining)
> >
> > cpp-inlining 24617 90229.126921551: 714657 cycles:uppp:
> > 56382b0fc114 main+0x74 (/home/milian/projects/kdab/rnd/hotspot/
> >
> > build/tests/test-clients/cpp-inlining/cpp-inlining)
> >
> > 7faf7dabf222 __libc_start_main+0xf2 (/usr/lib/libc-2.28.so)
> > 56382b0fc1ed _start+0x2d (/home/milian/projects/kdab/rnd/hotspot/
> >
> > build/tests/test-clients/cpp-inlining/cpp-inlining)
> >
> > cpp-inlining 24617 90229.127157818: 719976 cycles:uppp:
> > 7faf7dca2688 __hypot_finite+0x28 (/usr/lib/libm-2.28.so)
> >
> > ffffffffffffffff [unknown] ([unknown])
> >
> > ...
> > ```
> >
> > Now I'm trying to figure out why one __hypot_finite sample works but the
> > other one breaks for no apparent reason.
>
> I've now collected some more background information, which is quite helpful
> I believe for the analysis of this issue:
>
> Note how the broken sample has the IP pointing at __hypot_finite+0x28:
>
> unwind: __hypot_finite:ip = 0x7faf7dca2688 (0x29688)
>
> When we run my reproducer code in GDB, we can see that obtaining a backtrace
> from that address works just fine there:
>
> ```
> $ gdb ./cpp-inlining
> GNU gdb (GDB) 8.2
> Copyright (C) 2018 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later
> <http://gnu.org/licenses/gpl.html> This is free software: you are free to
> change and redistribute it. There is NO WARRANTY, to the extent permitted
> by law.
> Type "show copying" and "show warranty" for details.
> This GDB was configured as "x86_64-pc-linux-gnu".
> Type "show configuration" for configuration details.
> For bug reporting instructions, please see:
> <http://www.gnu.org/software/gdb/bugs/>.
> Find the GDB manual and other documentation resources online at:
> <http://www.gnu.org/software/gdb/documentation/>.
>
> For help, type "help".
> Type "apropos word" to search for commands related to "word"...
> Reading symbols from ./cpp-inlining...done.
> (gdb) break main
> Breakpoint 1 at 0x10a0: file /home/milian/projects/kdab/rnd/hotspot/tests/
> test-clients/cpp-inlining/main.cpp, line 34.
> (gdb) run
> Starting program: /home/milian/projects/kdab/rnd/hotspot/build/tests/test-
> clients/cpp-inlining/cpp-inlining
>
> Breakpoint 1, main () at /home/milian/projects/kdab/rnd/hotspot/tests/test-
> clients/cpp-inlining/main.cpp:34
> 34 {
> (gdb) disassemble __hypot_finite
> Dump of assembler code for function __hypot_finite:
> 0x00007ffff7c96660 <+0>: endbr64
> 0x00007ffff7c96664 <+4>: push %r13
> 0x00007ffff7c96666 <+6>: push %r12
> 0x00007ffff7c96668 <+8>: push %rbp
> 0x00007ffff7c96669 <+9>: movq %xmm1,%rbp
> 0x00007ffff7c9666e <+14>: push %rbx
> 0x00007ffff7c9666f <+15>: movq %xmm0,%rbx
> 0x00007ffff7c96674 <+20>: shr $0x20,%rbp
> 0x00007ffff7c96678 <+24>: shr $0x20,%rbx
> 0x00007ffff7c9667c <+28>: and $0x7fffffff,%ebp
> 0x00007ffff7c96682 <+34>: and $0x7fffffff,%ebx
> 0x00007ffff7c96688 <+40>: sub $0x28,%rsp
> 0x00007ffff7c9668c <+44>: cmp %ebp,%ebx
> 0x00007ffff7c9668e <+46>: jge 0x7ffff7c966a2 <__hypot_finite+66>
> 0x00007ffff7c96690 <+48>: mov %ebx,%eax
> 0x00007ffff7c96692 <+50>: movapd %xmm0,%xmm2
> 0x00007ffff7c96696 <+54>: movapd %xmm1,%xmm0
> 0x00007ffff7c9669a <+58>: mov %ebp,%ebx
> 0x00007ffff7c9669c <+60>: movapd %xmm2,%xmm1
> 0x00007ffff7c966a0 <+64>: mov %eax,%ebp
> 0x00007ffff7c966a2 <+66>: movq %xmm0,%r13
> 0x00007ffff7c966a7 <+71>: mov %rbx,%rax
> 0x00007ffff7c966aa <+74>: movq %xmm1,%r12
> --Type <RET> for more, q to quit, c to continue without paging--q
> Quit
> (gdb) break *0x00007ffff7c96688
> Breakpoint 2 at 0x7ffff7c96688
> (gdb) cont
> Continuing.
>
> Breakpoint 2, 0x00007ffff7c96688 in __hypot_finite () from
> /usr/lib/libm.so.6 (gdb) bt
> #0 0x00007ffff7c96688 in __hypot_finite () from /usr/lib/libm.so.6
> #1 0x00007ffff7c7caf9 in hypotf64 () from /usr/lib/libm.so.6
> #2 0x0000555555555129 in std::__complex_abs (__z=<optimized out>) at /usr/
> include/c++/8.2.1/complex:1362
> #3 std::abs<double> (__z=...) at /usr/include/c++/8.2.1/complex:618
> #4 std::_Norm_helper<true>::_S_do_it<double> (__z=...) at /usr/include/c++/
> 8.2.1/complex:675
> #5 std::norm<double> (__z=...) at /usr/include/c++/8.2.1/complex:685
> #6 main () at
> /home/milian/projects/kdab/rnd/hotspot/tests/test-clients/cpp-
> inlining/main.cpp:39
> ```
>
> But note how __hypot_finite+0x28 actually points at an instruction that
> modifies the stack pointer by 40, just like I did manually to "fix"
> unwinding for this sample (see my initial email):
>
> 0x00007ffff7c96688 <+40>: sub $0x28,%rsp
>
> So we are still in the function prologue, and can't properly unwind from
> there apparently! Any suggestions on how this should be handled? Can we
> figure out the size of the function prologue somehow and manually fix the
> stack pointer then?

After more digging, it turns out that I've apparently chased a red herring.
I'm running archlinux which isn't shipping debug symbols for libm. Without
those, we are apparently out of luck to unwind from the function prologue.

I still wonder whether we could catch this somehow and handle it ourselves. I
wonder what GDB does to unwind here, despite the lack of debug information in
libm...

Cheers

--
Milian Wolff | [email protected] | Senior Software Engineer
KDAB (Deutschland) GmbH, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt, C++ and OpenGL Experts


Attachments:
smime.p7s (3.74 kB)

2018-10-22 14:22:17

by Andi Kleen

[permalink] [raw]
Subject: Re: Broken dwarf unwinding - wrong stack pointer register value?

Milian Wolff <[email protected]> writes:
>
> After more digging, it turns out that I've apparently chased a red herring.
> I'm running archlinux which isn't shipping debug symbols for libm.

64bit executables normally have unwind information even when stripped.
Unless someone forcefully stripped those too.

You can checkout with objdump --sections.

-Andi

2018-10-22 21:01:41

by Milian Wolff

[permalink] [raw]
Subject: Re: Broken dwarf unwinding - wrong stack pointer register value?

On Montag, 22. Oktober 2018 15:58:17 CEST Andi Kleen wrote:
> Milian Wolff <[email protected]> writes:
> > After more digging, it turns out that I've apparently chased a red
> > herring.
> > I'm running archlinux which isn't shipping debug symbols for libm.
>
> 64bit executables normally have unwind information even when stripped.
> Unless someone forcefully stripped those too.
>
> You can checkout with objdump --sections.

Right, we do have .eh_frame and .eh_frame_hdr according to readelf:

```
$ readelf --sections /usr/lib/libm.so.6
There are 26 section headers, starting at offset 0x183120:

Section Headers:
[Nr] Name Type Address Offset
Size EntSize Flags Link Info Align
[ 0] NULL 0000000000000000 00000000
0000000000000000 0000000000000000 0 0 0
[ 1] .note.gnu.build-i NOTE 0000000000000270 00000270
0000000000000024 0000000000000000 A 0 0 4
[ 2] .note.ABI-tag NOTE 0000000000000294 00000294
0000000000000020 0000000000000000 A 0 0 4
[ 3] .note.gnu.propert NOTE 00000000000002b8 000002b8
0000000000000020 0000000000000000 A 0 0 8
[ 4] .gnu.hash GNU_HASH 00000000000002d8 000002d8
00000000000024d0 0000000000000000 A 5 0 8
[ 5] .dynsym DYNSYM 00000000000027a8 000027a8
00000000000066c0 0000000000000018 A 6 1 8
[ 6] .dynstr STRTAB 0000000000008e68 00008e68
0000000000002352 0000000000000000 A 0 0 1
[ 7] .gnu.version VERSYM 000000000000b1ba 0000b1ba
0000000000000890 0000000000000002 A 5 0 2
[ 8] .gnu.version_d VERDEF 000000000000ba50 0000ba50
000000000000017c 0000000000000000 A 6 11 8
[ 9] .gnu.version_r VERNEED 000000000000bbd0 0000bbd0
0000000000000060 0000000000000000 A 6 2 8
[10] .rela.dyn RELA 000000000000bc30 0000bc30
0000000000000480 0000000000000018 A 5 0 8
[11] .init PROGBITS 000000000000d000 0000d000
000000000000001b 0000000000000000 AX 0 0 4
[12] .text PROGBITS 000000000000d020 0000d020
00000000000a063b 0000000000000000 AX 0 0 16
[13] .fini PROGBITS 00000000000ad65c 000ad65c
000000000000000d 0000000000000000 AX 0 0 4
[14] .rodata PROGBITS 00000000000ae000 000ae000
00000000000c76a4 0000000000000000 A 0 0 32
[15] .eh_frame_hdr PROGBITS 00000000001756a4 001756a4
0000000000001c34 0000000000000000 A 0 0 4
[16] .eh_frame PROGBITS 00000000001772d8 001772d8
00000000000093f0 0000000000000000 A 0 0 8
[17] .hash HASH 00000000001806c8 001806c8
000000000000210c 0000000000000004 A 5 0 8
[18] .init_array INIT_ARRAY 0000000000183c80 00182c80
0000000000000008 0000000000000008 WA 0 0 8
[19] .fini_array FINI_ARRAY 0000000000183c88 00182c88
0000000000000008 0000000000000008 WA 0 0 8
[20] .dynamic DYNAMIC 0000000000183c90 00182c90
00000000000001f0 0000000000000010 WA 6 0 8
[21] .got PROGBITS 0000000000183e80 00182e80
0000000000000180 0000000000000008 WA 0 0 8
[22] .data PROGBITS 0000000000184000 00183000
000000000000000c 0000000000000000 WA 0 0 8
[23] .bss NOBITS 000000000018400c 0018300c
000000000000000c 0000000000000000 WA 0 0 4
[24] .comment PROGBITS 0000000000000000 0018300c
000000000000001a 0000000000000001 MS 0 0 1
[25] .shstrtab STRTAB 0000000000000000 00183026
00000000000000fa 0000000000000000 0 0 1
Key to Flags:
W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
L (link order), O (extra OS processing required), G (group), T (TLS),
C (compressed), x (unknown), o (OS specific), E (exclude),
l (large), p (processor specific)
```

But should that be enough information to be able to unwind from a function
prologue? I mean, it obviously seems to work when we unwind from the function
body. But how would I know whether it should work from the prologue too?

Reading e.g. https://www.airs.com/blog/archives/460, I can find:

> There should be exactly one FDE covering each instruction which may be being
executed when an exception occurs. By default an exception can only occur
during a function call or a throw. When using the -fnon-call-exceptions gcc
option, an exception can also occur on most memory references and floating
point operations. When using -fasynchronous-unwind-tables, the FDE will cover
every instruction, to permit unwinding from a signal handler.

So what if my libm wasn't compiled with -fasynchronous-unwind-tables? We
probably cannot throw an exception in the function prologue, so potentially
that range is simply not mapped? But this is just a shot in the dark, I have
no clue how to get more information about what contents are actually stored in
the .eh_frame section. I would love to find out though! Does anyone know a
tool to sched some light into this section?

I found http://www.bitlackeys.org/#eh_frame which at least shows me that
__hypot_finite is mentioned in the .eh_frame section:

$ nm -aD /usr/lib/libm.so.6 |& grep hypot_finite
0000000000029660 T __hypot_finite
$ ./eh_frame /usr/lib/libm.so.6 |& grep 29660
Function size: 878 Function Addr: 29660

Thanks

--
Milian Wolff | [email protected] | Senior Software Engineer
KDAB (Deutschland) GmbH, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt, C++ and OpenGL Experts


Attachments:
smime.p7s (3.74 kB)

2018-10-23 04:14:44

by Andi Kleen

[permalink] [raw]
Subject: Re: Broken dwarf unwinding - wrong stack pointer register value?

> So what if my libm wasn't compiled with -fasynchronous-unwind-tables? We

It's default (64bit since always and 32bit now too) Unless someone disabled it.

However libm might be partially written in assembler and hand written assembler
often has problems with unwind tables because the programmer has to get them
correct explicitely.

-Andi

2018-10-23 10:50:59

by Milian Wolff

[permalink] [raw]
Subject: Re: Broken dwarf unwinding - wrong stack pointer register value?

On Dienstag, 23. Oktober 2018 06:03:56 CEST Andi Kleen wrote:
> > So what if my libm wasn't compiled with -fasynchronous-unwind-tables? We
>
> It's default (64bit since always and 32bit now too) Unless someone disabled
> it.

Excellent, good to know. Since [1] doesn't explicitly disable it, I would
assume the information should be available.

[1]: https://git.archlinux.org/svntogit/packages.git/tree/trunk/PKGBUILD?
h=packages/glibc

> However libm might be partially written in assembler and hand written
> assembler often has problems with unwind tables because the programmer has
> to get them correct explicitely.

Yes, that could be the case. I'm unsure about the glibc build system and what
actually gets compiled, but I found a potential source at [2]:

[2]: https://github.com/bminor/glibc/blob/
43b1048ab9418e902aac8c834a7a9a88c501620a/sysdeps/ieee754/dbl-64/e_hypot.c

I believe this is what is used on my system, since I can spot calls to
__issignaling@@GLIBC_2.18 etc. in the disassembly of __hypot_finite ([3]),
which matches the sources referenced in [2].

[3]: https://paste.kde.org/poywa7y2z

If [2] is used, then it's not hand written assembler but code compiled by the
compiler. So unwinding should work, even from the prologue?

I have since also figured out how to dump the .eh_frame contents in a human
readable format via readelf. Remember, __hypot_finite on my system is at
offset 0x29660 of libm, so I think the following are the corresponding
.eh_frame contents:

```
$ readelf --debug-dump=frames /usr/lib/libm.so.6 |& less
...
00002b60 000000000000004c 00002b64 FDE cie=00000000
pc=0000000000029660..00000000000299ce
DW_CFA_advance_loc: 6 to 0000000000029666
DW_CFA_def_cfa_offset: 16
DW_CFA_offset: r13 (r13) at cfa-16
DW_CFA_advance_loc: 2 to 0000000000029668
DW_CFA_def_cfa_offset: 24
DW_CFA_offset: r12 (r12) at cfa-24
DW_CFA_advance_loc: 1 to 0000000000029669
DW_CFA_def_cfa_offset: 32
DW_CFA_offset: r6 (rbp) at cfa-32
DW_CFA_advance_loc: 6 to 000000000002966f
DW_CFA_def_cfa_offset: 40
DW_CFA_offset: r3 (rbx) at cfa-40
DW_CFA_advance_loc: 29 to 000000000002968c
DW_CFA_def_cfa_offset: 80
DW_CFA_advance_loc2: 291 to 00000000000297af
DW_CFA_remember_state
DW_CFA_def_cfa_offset: 40
DW_CFA_advance_loc: 5 to 00000000000297b4
DW_CFA_def_cfa_offset: 32
DW_CFA_advance_loc: 1 to 00000000000297b5
DW_CFA_def_cfa_offset: 24
DW_CFA_advance_loc: 2 to 00000000000297b7
DW_CFA_def_cfa_offset: 16
DW_CFA_advance_loc: 2 to 00000000000297b9
DW_CFA_def_cfa_offset: 8
DW_CFA_advance_loc: 7 to 00000000000297c0
DW_CFA_restore_state
DW_CFA_advance_loc1: 88 to 0000000000029818
DW_CFA_remember_state
DW_CFA_def_cfa_offset: 40
DW_CFA_advance_loc: 1 to 0000000000029819
DW_CFA_def_cfa_offset: 32
DW_CFA_advance_loc: 1 to 000000000002981a
DW_CFA_def_cfa_offset: 24
DW_CFA_advance_loc: 2 to 000000000002981c
DW_CFA_def_cfa_offset: 16
DW_CFA_advance_loc: 2 to 000000000002981e
DW_CFA_def_cfa_offset: 8
DW_CFA_advance_loc: 18 to 0000000000029830
DW_CFA_restore_state
DW_CFA_nop
```

I notice that this does not touch the rsp register at all, even though it's
mutated by the code, leading to the issue. See again this paste for the
disassembly at [3], and note that the broken sample frame points at

0x0000000000029688 <+40>: sub $0x28,%rsp

Can someone at least confirm whether unwinding from a function prologue via
.eh_frame (but without .debug_frame) should actually be possible?

Thanks
--
Milian Wolff | [email protected] | Senior Software Engineer
KDAB (Deutschland) GmbH, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt, C++ and OpenGL Experts


Attachments:
smime.p7s (3.74 kB)

2018-10-24 14:49:24

by Andi Kleen

[permalink] [raw]
Subject: Re: Broken dwarf unwinding - wrong stack pointer register value?

>
> Can someone at least confirm whether unwinding from a function prologue via
> .eh_frame (but without .debug_frame) should actually be possible?

Yes it should be possible. Asynchronous unwind tables should work
from any instruction.

-Andi

2018-10-30 23:05:30

by Milian Wolff

[permalink] [raw]
Subject: Re: Broken dwarf unwinding - wrong stack pointer register value?

On Mittwoch, 24. Oktober 2018 16:48:18 CET Andi Kleen wrote:
> > Can someone at least confirm whether unwinding from a function prologue
> > via
> > .eh_frame (but without .debug_frame) should actually be possible?
>
> Yes it should be possible. Asynchronous unwind tables should work
> from any instruction.

I'm still trying to figure out what's going on here... I have remembered, that
libunwind has a quite helpful debug mode, which I further extended to generate
a more detailed report of the failed unwinding attempt. For one of the broken
samples, I see:

```
$ UNW_DEBUG_LEVEL=15 perf script -v
...
cpp-inlining 30227 18486.357148: 788066 cycles:ppp:
>_Ux86_64_init_remote: (cursor=0x7fffafa55c10)
unwind: reg 16, val 7f91345d77b4
unwind: reg 7, val 7ffd1e276f18
>_Ux86_64_step: (cursor=0x7fffafa55c10, ip=0x00007f91345d77b4,
cfa=0x00007ffd1e276f18)
unwind: find_proc_info dso /usr/lib/libm-2.28.so
>_Ux86_64_dwarf_search_unwind_table: ip=0x7f91345d77b4,
start_ip=0xffffffffffeb3fbc
>_Ux86_64_dwarf_search_unwind_table: e->fde_offset = 4794, segbase =
7f91347236a4, debug_frame_base = 0, fde_addr = 7f9134727e38
>_Ux86_64_dwarf_extract_proc_info_from_fde: FDE @ 0x7f9134727e38
>_Ux86_64_dwarf_extract_proc_info_from_fde: looking for CIE at
address 7f91347252d8
>parse_cie: CIE parsed OK, augmentation = "zR", handler=0x0
>_Ux86_64_dwarf_extract_proc_info_from_fde: FDE covers IP
0x7f91345d7660-0x7f91345d79ce, LSDA=0x0
>_Ux86_64_fetch_frame: fetch frame ip=0x7f91345d77b4 cfa=0x7ffd1e276f18
format=0
>run_cfi_program: CFA_def_cfa r7+0x8
>run_cfi_program: CFA_offset r16 at cfa+0xfffffffffffffff8
>run_cfi_program: CFA_advance_loc to 0x7f91345d7666
>run_cfi_program: CFA_def_cfa_offset 0x10
>run_cfi_program: CFA_offset r13 at cfa+0xfffffffffffffff0
>run_cfi_program: CFA_advance_loc to 0x7f91345d7668
>run_cfi_program: CFA_def_cfa_offset 0x18
>run_cfi_program: CFA_offset r12 at cfa+0xffffffffffffffe8
>run_cfi_program: CFA_advance_loc to 0x7f91345d7669
>run_cfi_program: CFA_def_cfa_offset 0x20
>run_cfi_program: CFA_offset r6 at cfa+0xffffffffffffffe0
>run_cfi_program: CFA_advance_loc to 0x7f91345d766f
>run_cfi_program: CFA_def_cfa_offset 0x28
>run_cfi_program: CFA_offset r3 at cfa+0xffffffffffffffd8
>run_cfi_program: CFA_advance_loc to 0x7f91345d768c
>run_cfi_program: CFA_def_cfa_offset 0x50
>run_cfi_program: CFA_advance_loc2 to 0x7f91345d77af
>run_cfi_program: CFA_remember_state
>run_cfi_program: CFA_def_cfa_offset 0x28
>run_cfi_program: CFA_advance_loc to 0x7f91345d77b4
>run_cfi_program: CFA_def_cfa_offset 0x20
>run_cfi_program: CFA_advance_loc to 0x7f91345d77b5
>_Ux86_64_cache_frame: cache frame ip=0x7f91345d77b4 cfa=0x7ffd1e276f18
format=0
>_Ux86_64_reuse_frame: reuse frame ip=0x7f91345d77b4 cfa=0x7ffd1e276f18
format=0 addr=0x0 offset=+0
>apply_reg_state: CFA_REG_COLUMN[17] = 0x7,
DWARF_CFA_OFF_COLUMN[18] = 0x20
>apply_reg_state: CFA = 0x7ffd1e276f18 (pop) + 0x20 =
0x7ffd1e276f38
>apply_reg_state: 0: DWARF_WHERE_SAME = (0, 2)
>apply_reg_state: 1: DWARF_WHERE_SAME = (1, 2)
>apply_reg_state: 2: DWARF_WHERE_SAME = (2, 2)
>apply_reg_state: 3: DWARF_WHERE_CFAREL = 7ffd1e276f38 + -40 =
(7ffd1e276f10, 0)
>apply_reg_state: 4: DWARF_WHERE_SAME = (4, 2)
>apply_reg_state: 5: DWARF_WHERE_SAME = (5, 2)
>apply_reg_state: 6: DWARF_WHERE_CFAREL = 7ffd1e276f38 + -32 =
(7ffd1e276f18, 0)
>apply_reg_state: 7: DWARF_WHERE_SAME = (7, 2)
>apply_reg_state: 8: DWARF_WHERE_SAME = (8, 2)
>apply_reg_state: 9: DWARF_WHERE_SAME = (9, 2)
>apply_reg_state: 10: DWARF_WHERE_SAME = (a, 2)
>apply_reg_state: 11: DWARF_WHERE_SAME = (b, 2)
>apply_reg_state: 12: DWARF_WHERE_CFAREL = 7ffd1e276f38 + -24 =
(7ffd1e276f20, 0)
>apply_reg_state: 13: DWARF_WHERE_CFAREL = 7ffd1e276f38 + -16 =
(7ffd1e276f28, 0)
>apply_reg_state: 14: DWARF_WHERE_SAME = (e, 2)
>apply_reg_state: 15: DWARF_WHERE_SAME = (f, 2)
>apply_reg_state: 16: DWARF_WHERE_CFAREL = 7ffd1e276f38 + -8 =
(7ffd1e276f30, 0)
unwind: access_mem addr 0x7ffd1e276f30 val c0d885722245b5e4, offset 24
>apply_reg_state: ret_addr_column: 16, ip: c0d885722245b5e4
>_Ux86_64_step: dwarf_step returned 1
>_Ux86_64_step: returning 1
>_Ux86_64_step: (cursor=0x7fffafa55c10, ip=0xc0d885722245b5e4,
cfa=0x00007ffd1e276f38)
>_Ux86_64_step: dwarf_step returned -22
>_Ux86_64_step: returning -22
unwind: __hypot_finite:ip = 0x7f91345d77b4 (0x297b4)
unwind: '':ip = 0xc0d885722245b5e3 (0x0)

7f91345d77b4 __hypot_finite+0x154 (/usr/lib/libm-2.28.so)
c0d885722245b5e3 [unknown] ([unknown])
```


Now, I also tried the following:

```
$ perf probe -x /usr/lib/libm-2.28.so -a __hypot_finite+0x154
$ perf record -F 1000 --call-graph dwarf -e probe_libm:__hypot_finite ./cpp-
inlining
```

And all of the samples unwind correctly! This makes me believe that it's not
the .eh_frame information which is wrong - otherwise unwinding would always
fail from these locations, esp. when using the custom probe trace point. But
since this is not happening, what else could it be? I only see two
possibilities: the register values or the stack memory stored in in the sample
by perf.

The register values is unlikely, since I now understand how the .eh_frame
contents get analyzed. For __hypot_finite+0x154, we will always end up asking
for the address at SP+24. access_mem thus will always look at the address at
offset 24, independent of the actual value of SP.

So, what remains is that the stack dump is somehow wrong, i.e. its contents
are moved by some offset. Note how I can "fix" the unwinding for such broken
samples by manually applying some offset in access_mem. By looking at other
samples where unwinding works from __hypot_finite, I could figure out that the
correct address to be read for unwnding should be 7f91345bdaf8, e.g.:

```
7f91345d76ed __hypot_finite+0x8d (/usr/lib/libm-2.28.so)
7f91345bdaf8 hypotf32x+0x18 (/usr/lib/libm-2.28.so)
5620579cb128 main+0x88 (/home/milian/projects/kdab/rnd/hotspot/
build/tests/test-clients/cpp-inlining/cpp-inlining)
```

This address indeed occurs in the user stack dump (starting at 0xe0 in the raw
event data) for the broken sample, cf.:


```
. 00e0: 00 20 00 00 00 00 00 00 c0 b1 9c 57 20 56 00 00 . .........W V..
. 00f0: 70 70 27 1e fd 7f 00 00 f9 da 5b 34 91 7f 00 00 pp'.......[4....
. 0100: e4 b5 45 22 72 85 d8 c0 c0 1d 16 84 43 30 bb c0 ..E"r.......C0..
.
```

We can find `7f91345bdaf8+1 = 7f91345bdaf9" at offset 16 (search for "f9 da 5b
34 91 7f"). Using that address makes unwinding work for this sample. What
could be the reason for this shift?

Thanks

--
Milian Wolff | [email protected] | Senior Software Engineer
KDAB (Deutschland) GmbH, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt, C++ and OpenGL Experts


Attachments:
smime.p7s (3.73 kB)

2018-11-01 22:09:21

by Milian Wolff

[permalink] [raw]
Subject: PEBS level 2/3 breaks dwarf unwinding! [WAS: Re: Broken dwarf unwinding - wrong stack pointer register value?]

On Dienstag, 30. Oktober 2018 23:34:35 CET Milian Wolff wrote:
> On Mittwoch, 24. Oktober 2018 16:48:18 CET Andi Kleen wrote:
> > > Can someone at least confirm whether unwinding from a function prologue
> > > via
> > > .eh_frame (but without .debug_frame) should actually be possible?
> >
> > Yes it should be possible. Asynchronous unwind tables should work
> > from any instruction.

<snip>

> We can find `7f91345bdaf8+1 = 7f91345bdaf9" at offset 16 (search for "f9 da
> 5b 34 91 7f"). Using that address makes unwinding work for this sample.
> What could be the reason for this shift?

I believe I have found the culprit: PEBS seems to be at fault here - i.e. the
RIP/RSP and the ustack dump of the sample simply don't fit together.

Check this out:

```
$ for i in $(seq 10); do perf record -q -e "cycles:" --call-graph dwarf ./cpp-
inlining > /dev/null; perf script | pcre2grep -c -M "hypot_finite.*\n.*\
[unknown\]"; done
0
0
0
0
0
0
0
0
0
0

$ for i in $(seq 10); do perf record -q -e "cycles:p" --call-graph dwarf ./
cpp-inlining > /dev/null; perf script | pcre2grep -c -M "hypot_finite.*\n.*\
[unknown\]"; done
0
0
0
0
0
0
0
0
0
0

$ for i in $(seq 10); do perf record -q -e "cycles:pp" --call-graph dwarf ./
cpp-inlining > /dev/null; perf script | pcre2grep -c -M "hypot_finite.*\n.*\
[unknown\]"; done
37
39
35
28
40
39
29
37
31
26

$ for i in $(seq 10); do perf record -q -e "cycles:ppp" --call-graph dwarf ./
cpp-inlining > /dev/null; perf script | pcre2grep -c -M "hypot_finite.*\n.*\
[unknown\]"; done
79
70
76
77
70
90
64
78
86
74
```

Note how precise levels 0 and 1 do not produce any samples where unwinding
fails. But precise level 2 produces some, and precise level 3 increases the
amount (by ca. ~2x).

I can reproduce this pattern on two separate Intel CPUs and kernel versions
currently:

Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz with 4.18.16-arch1-1-ARCH
Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz with 4.14.78-1-lts

Could someone else try this? What about AMD and IBS - is it also affected?
What about newer/different Intel CPUs?

Better yet, can someone come up with a fix for this on Intel with maximum
precise level?

Thanks

--
Milian Wolff | [email protected] | Senior Software Engineer
KDAB (Deutschland) GmbH, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt, C++ and OpenGL Experts


Attachments:
smime.p7s (3.74 kB)

2018-11-02 11:29:09

by Jiri Olsa

[permalink] [raw]
Subject: Re: PEBS level 2/3 breaks dwarf unwinding! [WAS: Re: Broken dwarf unwinding - wrong stack pointer register value?]

On Thu, Nov 01, 2018 at 11:08:18PM +0100, Milian Wolff wrote:
> On Dienstag, 30. Oktober 2018 23:34:35 CET Milian Wolff wrote:
> > On Mittwoch, 24. Oktober 2018 16:48:18 CET Andi Kleen wrote:
> > > > Can someone at least confirm whether unwinding from a function prologue
> > > > via
> > > > .eh_frame (but without .debug_frame) should actually be possible?
> > >
> > > Yes it should be possible. Asynchronous unwind tables should work
> > > from any instruction.
>
> <snip>
>
> > We can find `7f91345bdaf8+1 = 7f91345bdaf9" at offset 16 (search for "f9 da
> > 5b 34 91 7f"). Using that address makes unwinding work for this sample.
> > What could be the reason for this shift?
>
> I believe I have found the culprit: PEBS seems to be at fault here - i.e. the
> RIP/RSP and the ustack dump of the sample simply don't fit together.
>
> Check this out:
>
> ```
> $ for i in $(seq 10); do perf record -q -e "cycles:" --call-graph dwarf ./cpp-
> inlining > /dev/null; perf script | pcre2grep -c -M "hypot_finite.*\n.*\
> [unknown\]"; done
> 0
> 0
> 0
> 0
> 0
> 0
> 0
> 0
> 0
> 0
>
> $ for i in $(seq 10); do perf record -q -e "cycles:p" --call-graph dwarf ./
> cpp-inlining > /dev/null; perf script | pcre2grep -c -M "hypot_finite.*\n.*\
> [unknown\]"; done
> 0
> 0
> 0
> 0
> 0
> 0
> 0
> 0
> 0
> 0
>
> $ for i in $(seq 10); do perf record -q -e "cycles:pp" --call-graph dwarf ./
> cpp-inlining > /dev/null; perf script | pcre2grep -c -M "hypot_finite.*\n.*\
> [unknown\]"; done
> 37
> 39
> 35
> 28
> 40
> 39
> 29
> 37
> 31
> 26
>
> $ for i in $(seq 10); do perf record -q -e "cycles:ppp" --call-graph dwarf ./
> cpp-inlining > /dev/null; perf script | pcre2grep -c -M "hypot_finite.*\n.*\
> [unknown\]"; done
> 79
> 70
> 76
> 77
> 70
> 90
> 64
> 78
> 86
> 74
> ```
>
> Note how precise levels 0 and 1 do not produce any samples where unwinding
> fails. But precise level 2 produces some, and precise level 3 increases the
> amount (by ca. ~2x).
>
> I can reproduce this pattern on two separate Intel CPUs and kernel versions
> currently:
>
> Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz with 4.18.16-arch1-1-ARCH
> Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz with 4.14.78-1-lts
>
> Could someone else try this? What about AMD and IBS - is it also affected?
> What about newer/different Intel CPUs?

I tried on intel and can't actualy see that.. how do the failed samples
look like? like is there the stack dump attached, what's in the regs?

could you please paste the 'perf report -D' output for some of the
failed samples?

thanks,
jirka


2018-11-02 17:57:44

by Milian Wolff

[permalink] [raw]
Subject: Re: PEBS level 2/3 breaks dwarf unwinding! [WAS: Re: Broken dwarf unwinding - wrong stack pointer register value?]

On Freitag, 2. November 2018 12:26:35 CET Jiri Olsa wrote:
> On Thu, Nov 01, 2018 at 11:08:18PM +0100, Milian Wolff wrote:
> > On Dienstag, 30. Oktober 2018 23:34:35 CET Milian Wolff wrote:
> > > On Mittwoch, 24. Oktober 2018 16:48:18 CET Andi Kleen wrote:
> > > > > Can someone at least confirm whether unwinding from a function
> > > > > prologue
> > > > > via
> > > > > .eh_frame (but without .debug_frame) should actually be possible?
> > > >
> > > > Yes it should be possible. Asynchronous unwind tables should work
> > > > from any instruction.
> >
> > <snip>
> >
> > > We can find `7f91345bdaf8+1 = 7f91345bdaf9" at offset 16 (search for "f9
> > > da
> > > 5b 34 91 7f"). Using that address makes unwinding work for this sample.
> > > What could be the reason for this shift?
> >
> > I believe I have found the culprit: PEBS seems to be at fault here - i.e.
> > the RIP/RSP and the ustack dump of the sample simply don't fit together.
> >
> > Check this out:
> >
> > ```
> > $ for i in $(seq 10); do perf record -q -e "cycles:" --call-graph dwarf
> > ./cpp- inlining > /dev/null; perf script | pcre2grep -c -M
> > "hypot_finite.*\n.*\ [unknown\]"; done
> > 0
> > 0
> > 0
> > 0
> > 0
> > 0
> > 0
> > 0
> > 0
> > 0
> >
> > $ for i in $(seq 10); do perf record -q -e "cycles:p" --call-graph dwarf
> > ./
> > cpp-inlining > /dev/null; perf script | pcre2grep -c -M
> > "hypot_finite.*\n.*\ [unknown\]"; done
> > 0
> > 0
> > 0
> > 0
> > 0
> > 0
> > 0
> > 0
> > 0
> > 0
> >
> > $ for i in $(seq 10); do perf record -q -e "cycles:pp" --call-graph dwarf
> > ./ cpp-inlining > /dev/null; perf script | pcre2grep -c -M
> > "hypot_finite.*\n.*\ [unknown\]"; done
> > 37
> > 39
> > 35
> > 28
> > 40
> > 39
> > 29
> > 37
> > 31
> > 26
> >
> > $ for i in $(seq 10); do perf record -q -e "cycles:ppp" --call-graph dwarf
> > ./ cpp-inlining > /dev/null; perf script | pcre2grep -c -M
> > "hypot_finite.*\n.*\ [unknown\]"; done
> > 79
> > 70
> > 76
> > 77
> > 70
> > 90
> > 64
> > 78
> > 86
> > 74
> > ```
> >
> > Note how precise levels 0 and 1 do not produce any samples where unwinding
> > fails. But precise level 2 produces some, and precise level 3 increases
> > the
> > amount (by ca. ~2x).
> >
> > I can reproduce this pattern on two separate Intel CPUs and kernel
> > versions
> > currently:
> >
> > Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz with 4.18.16-arch1-1-ARCH
> > Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz with 4.14.78-1-lts
> >
> > Could someone else try this? What about AMD and IBS - is it also affected?
> > What about newer/different Intel CPUs?
>
> I tried on intel and can't actualy see that.. how do the failed samples
> look like? like is there the stack dump attached, what's in the regs?
>
> could you please paste the 'perf report -D' output for some of the
> failed samples?

See here for one case: https://paste.kde.org/prryvdilq

What Intel CPU did you use? What microcode version? Which kernel version?

Generally, isn't what I'm seeing actually a neccessary evil of the ustack
based unwinding in perf? I mean, the general procedure is as follows if I'm
not mistaken:

- PMU triggers interrupt and PEBS stores RIP etc.
- code continous to execute, possibly changing the stack
- PMU interrupt is handled, and a perf sample is generated
- register values are used from "past" status stored in PEBS
- but ustack dump is based on the "current" status

From this latter discrepancy, it must directly follow that *sometimes* the
ustack dump represents a state that cannot be unwound from, no?

Cheers
--
Milian Wolff | [email protected] | Senior Software Engineer
KDAB (Deutschland) GmbH, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt, C++ and OpenGL Experts


Attachments:
smime.p7s (3.74 kB)

2018-11-05 20:52:02

by Jiri Olsa

[permalink] [raw]
Subject: Re: PEBS level 2/3 breaks dwarf unwinding! [WAS: Re: Broken dwarf unwinding - wrong stack pointer register value?]

On Fri, Nov 02, 2018 at 06:56:50PM +0100, Milian Wolff wrote:

SNIP

> > >
> > > Note how precise levels 0 and 1 do not produce any samples where unwinding
> > > fails. But precise level 2 produces some, and precise level 3 increases
> > > the
> > > amount (by ca. ~2x).
> > >
> > > I can reproduce this pattern on two separate Intel CPUs and kernel
> > > versions
> > > currently:
> > >
> > > Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz with 4.18.16-arch1-1-ARCH
> > > Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz with 4.14.78-1-lts
> > >
> > > Could someone else try this? What about AMD and IBS - is it also affected?
> > > What about newer/different Intel CPUs?
> >
> > I tried on intel and can't actualy see that.. how do the failed samples
> > look like? like is there the stack dump attached, what's in the regs?
> >
> > could you please paste the 'perf report -D' output for some of the
> > failed samples?
>
> See here for one case: https://paste.kde.org/prryvdilq

we should really print some helpfull debug output
for this.. like to show some markers where the stack
data starts

>
> What Intel CPU did you use? What microcode version? Which kernel version?
>
> Generally, isn't what I'm seeing actually a neccessary evil of the ustack
> based unwinding in perf? I mean, the general procedure is as follows if I'm
> not mistaken:
>
> - PMU triggers interrupt and PEBS stores RIP etc.
> - code continous to execute, possibly changing the stack

I dont think the code continues to execute.. the stack is ok

the problem I saw in past is that the copy from user is not
100% and sometimes you might not get full stack data you
asked for

have you tried with libdw unwinder? if one of the unwinder
shows more callchains, we need to fix the other one ;-)

jirka

2018-11-05 22:55:39

by Milian Wolff

[permalink] [raw]
Subject: Re: PEBS level 2/3 breaks dwarf unwinding! [WAS: Re: Broken dwarf unwinding - wrong stack pointer register value?]

On Montag, 5. November 2018 21:51:19 CET Jiri Olsa wrote:
> On Fri, Nov 02, 2018 at 06:56:50PM +0100, Milian Wolff wrote:
>
> SNIP
>
> > > > Note how precise levels 0 and 1 do not produce any samples where
> > > > unwinding
> > > > fails. But precise level 2 produces some, and precise level 3
> > > > increases
> > > > the
> > > > amount (by ca. ~2x).
> > > >
> > > > I can reproduce this pattern on two separate Intel CPUs and kernel
> > > > versions
> > > > currently:
> > > >
> > > > Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz with 4.18.16-arch1-1-ARCH
> > > > Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz with 4.14.78-1-lts
> > > >
> > > > Could someone else try this? What about AMD and IBS - is it also
> > > > affected?
> > > > What about newer/different Intel CPUs?
> > >
> > > I tried on intel and can't actualy see that.. how do the failed samples
> > > look like? like is there the stack dump attached, what's in the regs?
> > >
> > > could you please paste the 'perf report -D' output for some of the
> > > failed samples?
> >
> > See here for one case: https://paste.kde.org/prryvdilq
>
> we should really print some helpfull debug output
> for this.. like to show some markers where the stack
> data starts

Further down below, the offset for the ustack start is given (0xe0). But yes,
that would be welcome.

> > What Intel CPU did you use? What microcode version? Which kernel version?
> >
> > Generally, isn't what I'm seeing actually a neccessary evil of the ustack
> > based unwinding in perf? I mean, the general procedure is as follows if
> > I'm
> > not mistaken:
> >
> > - PMU triggers interrupt and PEBS stores RIP etc.
> > - code continous to execute, possibly changing the stack
>
> I dont think the code continues to execute.. the stack is ok

Are you sure about this? I mean, isn't that the whole reason why we need PEBS?
Generally, if you are sure about this, can you point me to some documentation
on this to allow me to understand it better?

Also, how do you explain the scenario I am seeing, with `cycles:` and
`cycles:p` not suffering from this issue, but `cycles:pp` and `cycles:ppp`
leading to broken samples? It _has_ to be PEBS then, no? What else could
explain this?

> the problem I saw in past is that the copy from user is not
> 100% and sometimes you might not get full stack data you
> asked for

But that would indicate missing data at the end of the ustack dump? In our
case, the "problematic" data is always at the start.

Also note the apparent shift in the ustack copy which - in one case - directly
correlatates with the code being executed, i.e. from objdump in libm I see:

0x0000000000029688 <+40>: sub $0x28,%rsp
(https://paste.kde.org/poywa7y2z)

The address of the expected parent frame is 7ffff7c7caf8 (hypotf32x+0x18).
This can be found at offset 80 in the ustack dump (cf. https://paste.kde.org/
prryvdilq - ("f9 ca c7 f7 ff 7f" is found at 0x130, minus 0xe0 yields 0x50 or
80).

From the libunwind (or libdw) debug output in perf, we see that the unwinder
tries to access offset 32 (cf. https://paste.kde.org/prryvdilq#line-610),
which is ofset by 48 from the desired value of 80. This offset is *veroy*
close to the value of 40 we see in the libm disassembly for __hypot_function
("$0x28,%rsp"). Is this really just a coincidence?

> have you tried with libdw unwinder? if one of the unwinder
> shows more callchains, we need to fix the other one ;-)

Yes, I've looked at both unwinders. Both try to access the same values, and
both break due to seemingly wrong data being read from the stack. And if you
look at my other patches, you may have seen that I've regularly fixed the
libdw unwinder to bring it closer to libunwind.

Thanks
--
Milian Wolff | [email protected] | Senior Software Engineer
KDAB (Deutschland) GmbH, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt, C++ and OpenGL Experts


Attachments:
smime.p7s (3.74 kB)

2018-11-06 00:11:16

by Andi Kleen

[permalink] [raw]
Subject: Re: PEBS level 2/3 breaks dwarf unwinding! [WAS: Re: Broken dwarf unwinding - wrong stack pointer register value?]

> > > - PMU triggers interrupt and PEBS stores RIP etc.
> > > - code continous to execute, possibly changing the stack
> >
> > I dont think the code continues to execute.. the stack is ok
>
> Are you sure about this? I mean, isn't that the whole reason why we need PEBS?
> Generally, if you are sure about this, can you point me to some documentation
> on this to allow me to understand it better?

Milian is right.

There is a execution window from PEBS capturing registers to actually triggering
the PMU, and if there is stack manipulation in that window
the PEBS state might be out of sync with the real stack.

The right RIP/RSP to use for the stack unwinding is always the data
in the PMI's exception frame on the stack.

Probably would need to modify perf to report those too in addition
to the PEBS registers.

Of course it would still mean that the stack unwinding may not exactly
match the sample RIP, but at least it should be consistent.

-Andi

2018-11-06 08:40:51

by Jiri Olsa

[permalink] [raw]
Subject: Re: PEBS level 2/3 breaks dwarf unwinding! [WAS: Re: Broken dwarf unwinding - wrong stack pointer register value?]

On Mon, Nov 05, 2018 at 04:10:37PM -0800, Andi Kleen wrote:
> > > > - PMU triggers interrupt and PEBS stores RIP etc.
> > > > - code continous to execute, possibly changing the stack
> > >
> > > I dont think the code continues to execute.. the stack is ok
> >
> > Are you sure about this? I mean, isn't that the whole reason why we need PEBS?
> > Generally, if you are sure about this, can you point me to some documentation
> > on this to allow me to understand it better?
>
> Milian is right.
>
> There is a execution window from PEBS capturing registers to actually triggering
> the PMU, and if there is stack manipulation in that window
> the PEBS state might be out of sync with the real stack.

hum, is this about having 'large pebs' or there's this window
if there's also only single pebs record allowed? which should
be case for dwarf unwind

> The right RIP/RSP to use for the stack unwinding is always the data
> in the PMI's exception frame on the stack.
>
> Probably would need to modify perf to report those too in addition
> to the PEBS registers.

ok, should not be that hard

>
> Of course it would still mean that the stack unwinding may not exactly
> match the sample RIP, but at least it should be consistent.

jirka

2018-11-06 17:27:37

by Andi Kleen

[permalink] [raw]
Subject: Re: PEBS level 2/3 breaks dwarf unwinding! [WAS: Re: Broken dwarf unwinding - wrong stack pointer register value?]

> hum, is this about having 'large pebs' or there's this window
> if there's also only single pebs record allowed? which should
> be case for dwarf unwind

With large PEBS today there is never any stack unwind because
stack unwinding can be only done from a PMI.

The window happens even with single PEBS. It's related
to the CPU being pipelined, so complex events may not
happen fully atomically.

-Andi


2018-11-06 20:05:04

by Milian Wolff

[permalink] [raw]
Subject: Re: PEBS level 2/3 breaks dwarf unwinding! [WAS: Re: Broken dwarf unwinding - wrong stack pointer register value?]

On Dienstag, 6. November 2018 09:39:57 CET Jiri Olsa wrote:
> On Mon, Nov 05, 2018 at 04:10:37PM -0800, Andi Kleen wrote:
> > > > > - PMU triggers interrupt and PEBS stores RIP etc.
> > > > > - code continous to execute, possibly changing the stack
> > > >
> > > > I dont think the code continues to execute.. the stack is ok
> > >
> > > Are you sure about this? I mean, isn't that the whole reason why we need
> > > PEBS? Generally, if you are sure about this, can you point me to some
> > > documentation on this to allow me to understand it better?
> >
> > Milian is right.
> >
> > There is a execution window from PEBS capturing registers to actually
> > triggering the PMU, and if there is stack manipulation in that window
> > the PEBS state might be out of sync with the real stack.
>
> hum, is this about having 'large pebs' or there's this window
> if there's also only single pebs record allowed? which should
> be case for dwarf unwind
>
> > The right RIP/RSP to use for the stack unwinding is always the data
> > in the PMI's exception frame on the stack.
> >
> > Probably would need to modify perf to report those too in addition
> > to the PEBS registers.
>
> ok, should not be that hard

Where would I look for the source to change here? So far, I only concentrated
on the userspace side of perf in tools/perf.

Thanks

--
Milian Wolff | [email protected] | Senior Software Engineer
KDAB (Deutschland) GmbH, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt, C++ and OpenGL Experts


Attachments:
smime.p7s (3.74 kB)

2018-11-06 20:24:59

by Andi Kleen

[permalink] [raw]
Subject: Re: PEBS level 2/3 breaks dwarf unwinding! [WAS: Re: Broken dwarf unwinding - wrong stack pointer register value?]

> Where would I look for the source to change here? So far, I only concentrated
> on the userspace side of perf in tools/perf.

Kind of similar to

a405bad5ad20 perf/x86: Add Haswell specific transaction flag reporting
fdfbbd07e91f perf: Add generic transaction flags

Report the original (not overwritten) regs->ip and regs->sp

-Andi

2018-11-07 22:42:30

by Milian Wolff

[permalink] [raw]
Subject: Re: PEBS level 2/3 breaks dwarf unwinding! [WAS: Re: Broken dwarf unwinding - wrong stack pointer register value?]

On Dienstag, 6. November 2018 21:24:11 CET Andi Kleen wrote:
> > Where would I look for the source to change here? So far, I only
> > concentrated on the userspace side of perf in tools/perf.
>
> Kind of similar to
>
> a405bad5ad20 perf/x86: Add Haswell specific transaction flag reporting
> fdfbbd07e91f perf: Add generic transaction flags
>
> Report the original (not overwritten) regs->ip and regs->sp

Thanks a lot Andi! With your help, I have managed to find the exact issue for
my scenario. Turns out, it really is "just" the instruction pointer that is
wrong. I.e. originally we have IP = 0x7feda32ca68c, but with PEBS we correct
that to IP = 7feda32ca688. The SP register value stays the same according to
my printk output. Using the original IP value, we can unwind correctly since
we point to the correct place in the .eh_frame section. The PEBS IP points to
a different position in the .eh_frame section, which is "too early".

That brings up some questions:

- I noticed `perf record --intr-regs`, but the values recorded in the
perf.data file are always the same. I.e. comparing uregs and iregs, I always
see the same values printed by `perf script`. This smells like a bug to me,
but so far I haven't figured out why this happens...

- Independently, when I add a custom printk manually in `arch/x86/events/
intel/ds.c` at the end of `setup_pebs_sample_data`, then I'm never seeing any
differences between SP in iregs/pebs/regs. Shouldn't it also be recorded via
PEBS? Or is it just chance that I'm never seeing any difference in
setup_pebs_sample_data between iregs->sp and regs->sp?

- Generally, how do we want to handle this bug? If `--intr-regs` would
actually record a different IP than stored in uregs in the perf.data file,
then we could use that as a fallback for unwinding, when it fails the first
time. Or should we always unwind from that IP? How do we mark the "actual"
frame/IP then, if that differs?

Thanks

--
Milian Wolff | [email protected] | Senior Software Engineer
KDAB (Deutschland) GmbH, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt, C++ and OpenGL Experts


Attachments:
smime.p7s (3.74 kB)

2018-11-08 12:44:20

by Milian Wolff

[permalink] [raw]
Subject: Re: PEBS level 2/3 breaks dwarf unwinding! [WAS: Re: Broken dwarf unwinding - wrong stack pointer register value?]

On Mittwoch, 7. November 2018 23:41:31 CET Milian Wolff wrote:
> On Dienstag, 6. November 2018 21:24:11 CET Andi Kleen wrote:
> > > Where would I look for the source to change here? So far, I only
> > > concentrated on the userspace side of perf in tools/perf.
> >
> > Kind of similar to
> >
> > a405bad5ad20 perf/x86: Add Haswell specific transaction flag reporting
> > fdfbbd07e91f perf: Add generic transaction flags
> >
> > Report the original (not overwritten) regs->ip and regs->sp
>
> Thanks a lot Andi! With your help, I have managed to find the exact issue
> for my scenario. Turns out, it really is "just" the instruction pointer
> that is wrong. I.e. originally we have IP = 0x7feda32ca68c, but with PEBS
> we correct that to IP = 7feda32ca688. The SP register value stays the same
> according to my printk output. Using the original IP value, we can unwind
> correctly since we point to the correct place in the .eh_frame section. The
> PEBS IP points to a different position in the .eh_frame section, which is
> "too early".
>
> That brings up some questions:
>
> - I noticed `perf record --intr-regs`, but the values recorded in the
> perf.data file are always the same. I.e. comparing uregs and iregs, I always
> see the same values printed by `perf script`. This smells like a bug to me,
> but so far I haven't figured out why this happens...

The reason seems to be that perf_event_output only takes one set of registers,
which then gets handed down into perf_prepare_sample where it gets sampled.
Thus if sample type has both PERF_SAMPLE_REGS_USER and PERF_SAMPLE_REGS_INTR
set, then by design both will store the same values for user space samples.

Can we change this, such that perf_event_output also takes a second set of
registers (iregs) that get sampled for PERF_SAMPLE_REGS_INTR? I'm very new to
real kernel development, what kind of ABI/API stability guarantees exist for
something like "perf_event_output"?

> - Independently, when I add a custom printk manually in `arch/x86/events/
> intel/ds.c` at the end of `setup_pebs_sample_data`, then I'm never seeing
> any differences between SP in iregs/pebs/regs. Shouldn't it also be
> recorded via PEBS? Or is it just chance that I'm never seeing any
> difference in setup_pebs_sample_data between iregs->sp and regs->sp?

The reason here seems to be that the registers stored in "pebs" are
essentially the same as iregs for the setup for `perf record --call-graph
dwarf`. The difference is the availability of `pebs->real_ip` which gets used
on my system to fixup the IP. SP stays untouched and is thus only truly valid
for the untouched IP (which is discarded currently - see above).

> - Generally, how do we want to handle this bug? If `--intr-regs` would
> actually record a different IP than stored in uregs in the perf.data file,
> then we could use that as a fallback for unwinding, when it fails the first
> time. Or should we always unwind from that IP? How do we mark the "actual"
> frame/IP then, if that differs?
>
> Thanks


--
Milian Wolff | [email protected] | Senior Software Engineer
KDAB (Deutschland) GmbH, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt, C++ and OpenGL Experts


Attachments:
smime.p7s (3.74 kB)

2018-11-09 00:55:05

by Andi Kleen

[permalink] [raw]
Subject: Re: PEBS level 2/3 breaks dwarf unwinding! [WAS: Re: Broken dwarf unwinding - wrong stack pointer register value?]

> - Independently, when I add a custom printk manually in `arch/x86/events/
> intel/ds.c` at the end of `setup_pebs_sample_data`, then I'm never seeing any
> differences between SP in iregs/pebs/regs. Shouldn't it also be recorded via
> PEBS? Or is it just chance that I'm never seeing any difference in
> setup_pebs_sample_data between iregs->sp and regs->sp?

It will depend on the workload.

>
> - Generally, how do we want to handle this bug? If `--intr-regs` would
> actually record a different IP than stored in uregs in the perf.data file,
> then we could use that as a fallback for unwinding, when it fails the first
> time. Or should we always unwind from that IP? How do we mark the "actual"
> frame/IP then, if that differs?

I don't think the regs include IP currently? Would need to add it.
Then could report everything from the original stack in intr-regs,
and this would address that problem.

Yes it might be useful to add some indication to the sample that
there is a potential divergence between the backtrace and the IP.
Would also need to report that in perf report then.

-Andi

2018-11-09 00:56:17

by Andi Kleen

[permalink] [raw]
Subject: Re: PEBS level 2/3 breaks dwarf unwinding! [WAS: Re: Broken dwarf unwinding - wrong stack pointer register value?]

> Can we change this, such that perf_event_output also takes a second set of
> registers (iregs) that get sampled for PERF_SAMPLE_REGS_INTR? I'm very new to
> real kernel development, what kind of ABI/API stability guarantees exist for
> something like "perf_event_output"?

Yes you can change it.

That's no API/ABI stability guarantee for kernel internal functions,
as long as you change all callers in tree.

-Andi

2018-11-10 21:45:26

by Travis Downs

[permalink] [raw]
Subject: Re: PEBS level 2/3 breaks dwarf unwinding! [WAS: Re: Broken dwarf unwinding - wrong stack pointer register value?]

On Mon, Nov 5, 2018 at 7:11 PM Andi Kleen <[email protected]> wrote:
> Milian is right.
>
> There is a execution window from PEBS capturing registers to actually triggering
> the PMU, and if there is stack manipulation in that window
> the PEBS state might be out of sync with the real stack.

This explains some weird results I was always getting especially when
functions were small, including
failed unwindings when using dwarf unwinder.

I guess this problem doesn't occur for LBR unwinding since the LBR
records are captured at the same
moment in time as the PEBS record, so reflect the correct branch
sequence. Of course, LBR doesn't
always let you unwind fully, right?

>
> The right RIP/RSP to use for the stack unwinding is always the data
> in the PMI's exception frame on the stack.
>
> Probably would need to modify perf to report those too in addition
> to the PEBS registers.
>
> Of course it would still mean that the stack unwinding may not exactly
> match the sample RIP, but at least it should be consistent.

What would this fix mean for perf report when you use cycles:pp and
cycles:ppp (or any PEBS based events)? The unwinding should generally
work, but the IP at the top of that stack (from the PMI) will
generally be different than that recorded by PEBS. The tree view and
overhead calculations will be based on the captured stacks, I guess -
but when I annotate, will the values I see correspond to the PEBS IPs
or the PMI IPs?

If someone is using cycles:pp or :ppp they probably care about
instruction-level accuracy, so it would be a shame to throw it away.

2018-11-11 01:08:10

by Andi Kleen

[permalink] [raw]
Subject: Re: PEBS level 2/3 breaks dwarf unwinding! [WAS: Re: Broken dwarf unwinding - wrong stack pointer register value?]

On Sat, Nov 10, 2018 at 04:42:48PM -0500, Travis Downs wrote:
> I guess this problem doesn't occur for LBR unwinding since the LBR
> records are captured at the same
> moment in time as the PEBS record, so reflect the correct branch
> sequence.

Actually it happens with LBRs too, but it always gives the backtrace
consistently at the PMI trigger point.

> What would this fix mean for perf report when you use cycles:pp and
> cycles:ppp (or any PEBS based events)? The unwinding should generally
> work, but the IP at the top of that stack (from the PMI) will
> generally be different than that recorded by PEBS. The tree view and
> overhead calculations will be based on the captured stacks, I guess -
> but when I annotate, will the values I see correspond to the PEBS IPs
> or the PMI IPs?

Based on PEBS IPs.

It would be a good idea to add a check to perf report
that the two IPs are different, and if they differ
add some indicator to the sample. This could be a new sort key,
although that would waste some space on the screen, or something
else.

It wouldn't be cover all cases, for example if you have recursion
on the same function it might report the same IP even though
it's a different instance, but I presume that should be rare
enough to not be a problem.

-Andi

2018-11-11 02:56:07

by Travis Downs

[permalink] [raw]
Subject: Re: PEBS level 2/3 breaks dwarf unwinding! [WAS: Re: Broken dwarf unwinding - wrong stack pointer register value?]

On Sat, Nov 10, 2018 at 8:07 PM Andi Kleen <[email protected]> wrote:
>
> On Sat, Nov 10, 2018 at 04:42:48PM -0500, Travis Downs wrote:
> > I guess this problem doesn't occur for LBR unwinding since the LBR
> > records are captured at the same
> > moment in time as the PEBS record, so reflect the correct branch
> > sequence.
>
> Actually it happens with LBRs too, but it always gives the backtrace
> consistently at the PMI trigger point.


That's weird - so the LBR records are from the PMI point, but the rest
of the PEBS record comes from the PEBS trigger point? Or the LBR isn't
part of PEBS at all?

>
> > What would this fix mean for perf report when you use cycles:pp and
> > cycles:ppp (or any PEBS based events)? The unwinding should generally
> > work, but the IP at the top of that stack (from the PMI) will
> > generally be different than that recorded by PEBS. The tree view and
> > overhead calculations will be based on the captured stacks, I guess -
> > but when I annotate, will the values I see correspond to the PEBS IPs
> > or the PMI IPs?
>
> Based on PEBS IPs.
>
> It would be a good idea to add a check to perf report
> that the two IPs are different, and if they differ
> add some indicator to the sample. This could be a new sort key,
> although that would waste some space on the screen, or something
> else.


In the case that PEBS events are used, the IP will differ essentially
100% of the time, right? That is, there will always be *some* skid.

>
>
> It wouldn't be cover all cases, for example if you have recursion
> on the same function it might report the same IP even though
> it's a different instance, but I presume that should be rare
> enough to not be a problem.
>

Well the main problem I see is that "IP inconsistency" will be the
usual case, and it will be hard to resent in a reasonable way in the
report. For example, the backtrace-based displays/reports may indicate
that 80% of your samples are in function X, but based on the PEBS IP
records, only 50% may fall in that function, so you'll always have a
weird thing where when you are investigating within the stack-display
you might see 1234 samples in a function, but when you annotate only
789 samples are accounted for, or whatever.

I don't think this is 100% solvable, it's mostly an issue of
displaying it reasonably and managing expectations.

If the LBR record came from PEBS (as I had thought, but perhaps you
are indicating otherwise above), I could imagine a hybrid mode where
LBR is used to go back some number of calls and then dwarf or FP or
whatever unwinding takes over, because the further down the stack you
do the more likely the PEBS trigger point and PMI point are likely to
have a consistent stack.

2018-11-12 03:27:22

by Andi Kleen

[permalink] [raw]
Subject: Re: PEBS level 2/3 breaks dwarf unwinding! [WAS: Re: Broken dwarf unwinding - wrong stack pointer register value?]

On Sat, Nov 10, 2018 at 09:50:05PM -0500, Travis Downs wrote:
> On Sat, Nov 10, 2018 at 8:07 PM Andi Kleen <[email protected]> wrote:
>
> On Sat, Nov 10, 2018 at 04:42:48PM -0500, Travis Downs wrote:
> > I guess this problem doesn't occur for LBR unwinding since the LBR
> > records are captured at the same
> > moment in time as the PEBS record, so reflect the correct branch
> > sequence.
>
> Actually it happens with LBRs too, but it always gives the backtrace
> consistently at the PMI trigger point.
>
> That's weird - so the LBR records are from the PMI point, but the rest of
> the PEBS record comes from the PEBS trigger point? Or the LBR isn't part
> of PEBS at all?

LBR is not part of PEBS, but is collected separately in the PMI handler.

> > overhead calculations will be based on the captured stacks, I guess -
> > but when I annotate, will the values I see correspond to the PEBS IPs
> > or the PMI IPs?
>
> Based on PEBS IPs.
>
> It would be a good idea to add a check to perf report
> that the two IPs are different, and if they differ
> add some indicator to the sample. This could be a new sort key,
> although that would waste some space on the screen, or something
> else.
>
> In the case that PEBS events are used, the IP will differ essentially 100%
> of the time, right? That is, there will always be *some* skid.

I wouldn't say that. It depends on what the CPU is doing and the IPC
of the code.

Also the backtrace inconsistency can only happen if the sample races with
function return. If you don't then the backtrace will point
to the correct function, even though the unwind IP is different.

For example in the common case where you profile a long loop it
is unlikely to happen.


> indicating otherwise above), I could imagine a hybrid mode where LBR is
> used to go back some number of calls and then dwarf or FP or whatever
> unwinding takes over, because the further down the stack you do the more
> likely the PEBS trigger point and PMI point are likely to have a
> consistent stack.

Could collect numbers how often it happens, but it would surprise
me if anything complicated is worth it. I would just do the minimum fixes
to address the unwinder errors, and perhaps add the "unwind ip differs"
indication.

-Andi

2018-11-14 13:21:25

by Milian Wolff

[permalink] [raw]
Subject: Re: PEBS level 2/3 breaks dwarf unwinding! [WAS: Re: Broken dwarf unwinding - wrong stack pointer register value?]

On Montag, 12. November 2018 04:26:37 CET Andi Kleen wrote:
> On Sat, Nov 10, 2018 at 09:50:05PM -0500, Travis Downs wrote:
> > On Sat, Nov 10, 2018 at 8:07 PM Andi Kleen <[email protected]> wrote:
> > On Sat, Nov 10, 2018 at 04:42:48PM -0500, Travis Downs wrote:
> > > I guess this problem doesn't occur for LBR unwinding since the LBR
> > > records are captured at the same
> > > moment in time as the PEBS record, so reflect the correct branch
> > > sequence.
> >
> > Actually it happens with LBRs too, but it always gives the backtrace
> > consistently at the PMI trigger point.
> >
> > That's weird - so the LBR records are from the PMI point, but the rest
> > of
> > the PEBS record comes from the PEBS trigger point? Or the LBR isn't
> > part
> > of PEBS at all?
>
> LBR is not part of PEBS, but is collected separately in the PMI handler.
>
> > > overhead calculations will be based on the captured stacks, I guess
> > > -
> > > but when I annotate, will the values I see correspond to the PEBS
> > > IPs
> > > or the PMI IPs?
> >
> > Based on PEBS IPs.
> >
> > It would be a good idea to add a check to perf report
> > that the two IPs are different, and if they differ
> > add some indicator to the sample. This could be a new sort key,
> > although that would waste some space on the screen, or something
> > else.
> >
> > In the case that PEBS events are used, the IP will differ essentially
> > 100%
> > of the time, right? That is, there will always be *some* skid.
>
> I wouldn't say that. It depends on what the CPU is doing and the IPC
> of the code.
>
> Also the backtrace inconsistency can only happen if the sample races with
> function return. If you don't then the backtrace will point
> to the correct function, even though the unwind IP is different.
>
> For example in the common case where you profile a long loop it
> is unlikely to happen.
>
> > indicating otherwise above), I could imagine a hybrid mode where LBR is
> > used to go back some number of calls and then dwarf or FP or whatever
> > unwinding takes over, because the further down the stack you do the
> > more
> > likely the PEBS trigger point and PMI point are likely to have a
> > consistent stack.
>
> Could collect numbers how often it happens, but it would surprise
> me if anything complicated is worth it. I would just do the minimum fixes
> to address the unwinder errors, and perhaps add the "unwind ip differs"
> indication.

I now have a preliminary WIP patch up and running (see attached), which works
for my usecase and improves perf noticeably. All traces of "unknown" frames
are eradicated, i.e. unwinding now works for 100% of the samples!

There are some remaining open questions on my side:

1) Do we really want to change the API of perf_event_overflow_* and
perf_event_output_* and adapt all its users? To me, it seems as if only PEBS
and IBS would want to pass distinct register sets for iregs and uregs. All
other users of the API would continue to pass the same set. Changing the
central API produces a lot of churn for no good reason. Does anyone see an
alternative to this?

The only alternative idea I have right now is to temporarily change the
sample_type in __intel_pmu_pebs_event before we call perf_event_output /
perf_event_overflow. I.e. unset PERF_SAMPLE_REGS_INTR, then sample the regs
manually from iregs before calling perf_event_{overflow,output}, then set
PERF_SAPMLE_REGS_INTR again. Or we could introduce a custom flag similar to
__PERF_SAMPLE_CALLCHAIN_EARLY here...

2) How do we want to do ?the "unwind ip differs" indication? as Andi puts it?
I.e. on the perf report/script side, how should we display the stacks?
Something like the following annotation maybe?


```
cpp-inlining 2605 [-01] 57.870061: 701199 cycles:pppu:
7fc1042797b4 __hypot_finite+0x154 (/usr/lib/libm-2.28.so)
7fc10425faf8 hypotf32x+0x18 (/usr/lib/libm-2.28.so) (unwind ip
differs)
5622c7452128 main+0x88 (/tmp/cpp-inlining)
7fc104096222 __libc_start_main+0xf2 (/usr/lib/libc-2.28.so)
5622c74521ed _start+0x2d (/tmp/cpp-inlining)
```

3) I suggest we always keep the first frame and sample IP from the user regs,
i.e. the accurate PEBS/IBS IP. Then we add the following frames from unwinding
the ustack with the iregs. But what do we do with the first iregs IP? If we
add it, then we could see the same frame with slightly different IP, like in
the following, which is undesired I believe:


```
cpp-inlining 2605 [-01] 57.870061: 701199 cycles:pppu:
7fc1042797b4 __hypot_finite+0x154 (/usr/lib/libm-2.28.so)
7fc1042797b5 __hypot_finite+0x155 (/usr/lib/libm-2.28.so)
7fc10425faf8 hypotf32x+0x18 (/usr/lib/libm-2.28.so) (unwind ip
differs)
5622c7452128 main+0x88 (/tmp/cpp-inlining)
7fc104096222 __libc_start_main+0xf2 (/usr/lib/libc-2.28.so)
5622c74521ed _start+0x2d (/tmp/cpp-inlining)
```

But always skipping the IP is also sometimes wrong, like in this case:

```
cpp-inlining 2605 [-01] 57.862313: 694984 cycles:pppu:
7fc1042797b9 __hypot_finite+0x159 (/usr/lib/libm-2.28.so)
5622c7452128 main+0x88 (/tmp/cpp-inlining)
7fc104096222 __libc_start_main+0xf2 (/usr/lib/libc-2.28.so)
5622c74521ed _start+0x2d (/tmp/cpp-inlining)
```

Here, we are missing the hypotf32x call inbetween __hypot_finite and main.

Do we want to introduce some heuristic on how handle these scenarios? I.e. if
uregs->ip and iregs->ip point to the same function symbol, then skip the frame
for iregs->ip, otherwise add it?

Thanks
--
Milian Wolff | [email protected] | Senior Software Engineer
KDAB (Deutschland) GmbH, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt, C++ and OpenGL Experts


Attachments:
0001-WIP-perf-make-it-possible-to-collect-both-iregs-and-.patch (14.05 kB)
0002-WIP-perf-unwind-use-iregs-for-unwinding.patch (2.12 kB)
smime.p7s (3.74 kB)
Download all attachments

2018-11-15 02:07:01

by Travis Downs

[permalink] [raw]
Subject: Re: PEBS level 2/3 breaks dwarf unwinding! [WAS: Re: Broken dwarf unwinding - wrong stack pointer register value?]

On Wed, Nov 14, 2018 at 8:20 AM Milian Wolff <[email protected]> wrote:

> 3) I suggest we always keep the first frame and sample IP from the user regs,
> i.e. the accurate PEBS/IBS IP. Then we add the following frames from unwinding
> the ustack with the iregs.

Does this mean that the displayed unwind will sometimes be
"impossible" to have actually be generated from a consistent execution
of the user program?

For example, the top frame (from PEBS) and second frame (from iregs)
may be inconsistent in that the latter function never calls the first.
At this point it would be good to have an indication at the top frame
is from a different source than the rest of the frames, lest the user
pull is hair out trying to determine how function X seems to call
function Y despite that not being the case in the source.

2018-11-15 02:17:15

by Travis Downs

[permalink] [raw]
Subject: Re: PEBS level 2/3 breaks dwarf unwinding! [WAS: Re: Broken dwarf unwinding - wrong stack pointer register value?]

On Sun, Nov 11, 2018 at 10:26 PM Andi Kleen <[email protected]> wrote:
> On Sat, Nov 10, 2018 at 09:50:05PM -0500, Travis Downs wrote:

> LBR is not part of PEBS, but is collected separately in the PMI handler.

Thanks for clearing this up - so you can ignore any earlier
suggestions on my part of trying to use LBR to fix the unwinding
inconsistency.

> > In the case that PEBS events are used, the IP will differ essentially 100%
> > of the time, right? That is, there will always be *some* skid.
>
> I wouldn't say that. It depends on what the CPU is doing and the IPC
> of the code.

Well other than say a long-latency cache miss, it is my experience
that the skid is generally never zero. That is, the PEBS and ireg IP
will usually differ. This is mostly moot though: what is important is
how often the ireg skip results in a different call chain (i.e., a
return occurred between the PEBS point and the interrupt), as you have
pointed out.

>
> Also the backtrace inconsistency can only happen if the sample races with
> function return. If you don't then the backtrace will point
> to the correct function, even though the unwind IP is different.
>
> For example in the common case where you profile a long loop it
> is unlikely to happen.

Agreed.


> Could collect numbers how often it happens, but it would surprise
> me if anything complicated is worth it. I would just do the minimum fixes
> to address the unwinder errors, and perhaps add the "unwind ip differs"
> indication.

As above, I think the most important UX problem is not when the IP
differs, but when the top frame of the IP unwind is different than the
function in which the PEBS sample occurred. I think the case where the
skid ends up with both in the same function doesn't pose any
presentation difficulties [1]. When they are different though, it
seems tough to present a consistent picture.

[1] Strictly speaking, this the "IPs are in the same function" is not
sufficient. Imagine a scenario where you have T->B->A (T calls B calls
A) and the PEBS sample happens in A, and then A and B return, and now
C then A are called (T->C->A) and the PMI happens. Now the PEBS IP and
the ireg IP are in the same function, but the stacks are still
inconsistent. It is probably fine to paper this over and show the user
the T->C->A stack, as this stack is somehow accurate (it really
happened), but the user might be confused when he looks at the
annotation for A, and sees code being executed (having PEBS samples)
that he knows can never execute when C calls A (for example) since the
annotations are based on the hidden T->B->A execution...

Bleh.

2018-11-15 09:11:16

by Milian Wolff

[permalink] [raw]
Subject: Re: PEBS level 2/3 breaks dwarf unwinding! [WAS: Re: Broken dwarf unwinding - wrong stack pointer register value?]

On Donnerstag, 15. November 2018 03:05:32 CET Travis Downs wrote:
> On Wed, Nov 14, 2018 at 8:20 AM Milian Wolff <[email protected]> wrote:
> > 3) I suggest we always keep the first frame and sample IP from the user
> > regs, i.e. the accurate PEBS/IBS IP. Then we add the following frames
> > from unwinding the ustack with the iregs.
>
> Does this mean that the displayed unwind will sometimes be
> "impossible" to have actually be generated from a consistent execution
> of the user program?

Yes, that is exactly what I'm saying.

> For example, the top frame (from PEBS) and second frame (from iregs)
> may be inconsistent in that the latter function never calls the first.
> At this point it would be good to have an indication at the top frame
> is from a different source than the rest of the frames, lest the user
> pull is hair out trying to determine how function X seems to call
> function Y despite that not being the case in the source.

I agree. I personally like your suggested approach - only add an indication
when the IP differs so much that it points to a different function. What do
others say to this?

Cheers

--
Milian Wolff | [email protected] | Senior Software Engineer
KDAB (Deutschland) GmbH, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt, C++ and OpenGL Experts


Attachments:
smime.p7s (3.74 kB)

2018-11-15 19:03:24

by Andi Kleen

[permalink] [raw]
Subject: Re: PEBS level 2/3 breaks dwarf unwinding! [WAS: Re: Broken dwarf unwinding - wrong stack pointer register value?]

Milian Wolff <[email protected]> writes:
>
> I agree. I personally like your suggested approach - only add an indication
> when the IP differs so much that it points to a different function. What do
> others say to this?

Yes that's fine.

-Andi