LinuxLists.cc - [RFC PATCH] x86/dumpstack: Fix unwind failure due to off-by-one-frame

2022-01-27 08:14:41

Subject: [RFC PATCH] x86/dumpstack: Fix unwind failure due to off-by-one-frame

(emphasis on the "RFC", not the "PATCH"...)

I've hit a bug where __dump_stack() ends up printing a stack trace that
consists purely of guesses (all printed frames start with "? ").

Debugging the issue, I found that show_trace_log_lvl() is looking at a
stack that looks like this:

function stored value pointer in show_trace_log_lvl()
====================================================================
show_stack saved RIP
show_stack saved RBP <-- stack
show_trace_log_lvl saved RIP <-- unwind_get_return_address_ptr(...)
show_trace_log_lvl ...
show_trace_log_lvl ...

show_trace_log_lvl() then iterates up the stack with its `stack`
variable; but because `unwind_get_return_address_ptr(&state)` is below the
starting point, the two never compile equal, and so `reliable` is never
set to 1.

Poking around a bit, I see two issues.

The first issue is that __unwind_start() tries to figure out whether
`first_frame` is inside the current frame before even having looked up
the ORC entry that determines where the current frame ends.
That can't work and results in being off-by-one-frame in some cases no
matter how we twist the comparison between `state->sp` and `first_frame`.

The second issue is that show_trace_log_lvl() asks __unwind_start() to
stop when it finds the frame containing `stack`, but then tries
comparing `unwind_get_return_address_ptr(&state)` (which has to be below
`stack`, since it is part of the lower frame) with `stack`.
That can't work if __unwind_start() is working properly - we'll have to
unwind up another frame.

This patch is an attempt to fix that, but I guess there might still be
issues with it in the interaction with show_regs_if_on_stack() in
show_trace_log_lvl(), or something like that?

Another option might be to rework even more how ORC stack walking works,
and always compute the location of the next frame in __unwind_start()
and unwind_next(), such that it becomes possible to query for the top
of the current frame?

Or a completely different approach, do more special-casing of different
unwinding scenarios in __unwind_start(), such that unwinding a remote
task doesn't go through the skip-ahead loop, and unwinding the current
task from a starting point is always guaranteed to skip the given frame
and stop at the following one? Or something along those lines?
That would also make it more obviously correct what happens if a
function specifies its own frame as the starting point wrt to changes to
that frame's contents before the call to unwind_next()... now that I'm
typing this out, I think that might be the best option?

[no signoff because I think this patch might be a terrible idea...]
---
arch/x86/kernel/dumpstack.c | 10 ++++++++
arch/x86/kernel/unwind_orc.c | 46 ++++++++++++++++++++++++++++++++----
2 files changed, 52 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 53de044e5654..2bffe8132e7c 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -195,6 +195,16 @@ static void show_trace_log_lvl(struct task_struct *task, struct pt_regs *regs,
printk("%sCall Trace:\n", log_lvl);

unwind_start(&state, task, regs, stack);
+
+ /* We asked unwind_start() to give us the unwind state for the frame
+ * that contains @stack. But unwind_get_return_address_ptr() gives the
+ * pointer to the saved RIP at the top of the previous frame.
+ * If we want to be able to compare unwind_get_return_address_ptr()
+ * with stack positions starting at @stack, we have to skip to the next
+ * frame.
+ */
+ unwind_next_frame(&state);
+
stack = stack ? : get_stack_pointer(task, regs);
regs = unwind_get_entry_regs(&state, &partial);

diff --git a/arch/x86/kernel/unwind_orc.c b/arch/x86/kernel/unwind_orc.c
index 2de3c8c5eba9..611ffa4d3701 100644
--- a/arch/x86/kernel/unwind_orc.c
+++ b/arch/x86/kernel/unwind_orc.c
@@ -706,10 +706,48 @@ void __unwind_start(struct unwind_state *state, struct task_struct *task,
}

/* Otherwise, skip ahead to the user-specified starting frame: */
- while (!unwind_done(state) &&
- (!on_stack(&state->stack_info, first_frame, sizeof(long)) ||
- state->sp < (unsigned long)first_frame))
- unwind_next_frame(state);
+ while (!unwind_done(state)) {
+ struct unwind_state lookahead_state;
+ /* If we're not even on the right stack yet, skip. */
+ if (!on_stack(&state->stack_info, first_frame, sizeof(long))) {
+ unwind_next_frame(state);
+ continue;
+ }
+
+ /*
+ * What the user has given us is just a pointer somewhere *into*
+ * a frame - for example, if the caller used
+ * __builtin_frame_address(0), that function will have a saved
+ * RBP below the saved RIP (yes, even when compiling without
+ * frame pointers), and first_frame will point to that saved
+ * RBP. If we're unwinding a remote task, we might be dealing
+ * with a first_frame pointer to the bottom of a stack frame.
+ *
+ * So if we want to stop at the frame *containing* first_frame,
+ * we have to first look up the next ORC entry and check whether
+ * first_frame is within the current frame.
+ */
+ lookahead_state = *state;
+ unwind_next_frame(&lookahead_state);
+
+ /*
+ * It might be that the current frame is the last frame on this
+ * stack, and the lookahead_state points to an entirely
+ * different stack. We already decided above that we want to
+ * stop somewhere on the current stack, so stop here.
+ */
+ if (lookahead_state.stack_info.type != state->stack_info.type)
+ break;
+
+ /*
+ * Stop if the user-specified address is below where the next
+ * frame starts (meaning it is in the current frame).
+ */
+ if ((unsigned long)first_frame < lookahead_state.sp)
+ break;
+
+ *state = lookahead_state;
+ }

return;

base-commit: 0280e3c58f92b2fe0e8fbbdf8d386449168de4a8
prerequisite-patch-id: c01c9b02a02ca0c6996a5fb02e68e137444a0a53
--
2.35.0.rc0.227.g00780c9af4-goog

2022-02-01 20:57:25

by Josh Poimboeuf

[permalink] [raw]

Subject: Re: [RFC PATCH] x86/dumpstack: Fix unwind failure due to off-by-one-frame

On Thu, Jan 27, 2022 at 01:55:55AM +0100, Jann Horn wrote:
> (emphasis on the "RFC", not the "PATCH"...)
>
> I've hit a bug where __dump_stack() ends up printing a stack trace that
> consists purely of guesses (all printed frames start with "? ").
>
> Debugging the issue, I found that show_trace_log_lvl() is looking at a
> stack that looks like this:
>
> function stored value pointer in show_trace_log_lvl()
> ====================================================================
> show_stack saved RIP
> show_stack saved RBP <-- stack
> show_trace_log_lvl saved RIP <-- unwind_get_return_address_ptr(...)
> show_trace_log_lvl ...
> show_trace_log_lvl ...
>
> show_trace_log_lvl() then iterates up the stack with its `stack`
> variable; but because `unwind_get_return_address_ptr(&state)` is below the
> starting point, the two never compile equal, and so `reliable` is never
> set to 1.

Thanks for reporting! If I understand correctly, this only happens
when show_stack() has an 8-byte stack size.

> Poking around a bit, I see two issues.
>
> The first issue is that __unwind_start() tries to figure out whether
> `first_frame` is inside the current frame before even having looked up
> the ORC entry that determines where the current frame ends.
> That can't work and results in being off-by-one-frame in some cases no
> matter how we twist the comparison between `state->sp` and `first_frame`.

> The second issue is that show_trace_log_lvl() asks __unwind_start() to
> stop when it finds the frame containing `stack`, but then tries
> comparing `unwind_get_return_address_ptr(&state)` (which has to be below
> `stack`, since it is part of the lower frame) with `stack`.
> That can't work if __unwind_start() is working properly - we'll have to
> unwind up another frame.
>
> This patch is an attempt to fix that, but I guess there might still be
> issues with it in the interaction with show_regs_if_on_stack() in
> show_trace_log_lvl(), or something like that?
>
> Another option might be to rework even more how ORC stack walking works,
> and always compute the location of the next frame in __unwind_start()
> and unwind_next(), such that it becomes possible to query for the top
> of the current frame?
>
> Or a completely different approach, do more special-casing of different
> unwinding scenarios in __unwind_start(), such that unwinding a remote
> task doesn't go through the skip-ahead loop, and unwinding the current
> task from a starting point is always guaranteed to skip the given frame
> and stop at the following one? Or something along those lines?
>
> That would also make it more obviously correct what happens if a
> function specifies its own frame as the starting point wrt to changes to
> that frame's contents before the call to unwind_next()... now that I'm
> typing this out, I think that might be the best option?

If I understand correctly, this last proposal is what the current
__unwind_start() code already attempts to do (but obviously fails in the
above off-by-one case). It tries to start at the first frame it finds
*beyond* the given 'first_frame' pointer, rather than the frame
including it. That makes the logic simpler, since you don't have to
find the size of the frame.

So I think this bug could be fixed by reverting commit f1d9a2abff66
("x86/unwind/orc: Don't skip the first frame for inactive tasks").

Can you confirm?

If that fixes it, we may need to do a little more special-casing like
you suggest to get the expected results for the different use cases.

--
Josh

2022-02-02 09:36:30

by Jann Horn

[permalink] [raw]

Subject: Re: [RFC PATCH] x86/dumpstack: Fix unwind failure due to off-by-one-frame

On Tue, Feb 1, 2022 at 1:30 AM Josh Poimboeuf <[email protected]> wrote:
> On Thu, Jan 27, 2022 at 01:55:55AM +0100, Jann Horn wrote:
> > (emphasis on the "RFC", not the "PATCH"...)
> >
> > I've hit a bug where __dump_stack() ends up printing a stack trace that
> > consists purely of guesses (all printed frames start with "? ").
> >
> > Debugging the issue, I found that show_trace_log_lvl() is looking at a
> > stack that looks like this:
> >
> > function stored value pointer in show_trace_log_lvl()
> > ====================================================================
> > show_stack saved RIP
> > show_stack saved RBP <-- stack
> > show_trace_log_lvl saved RIP <-- unwind_get_return_address_ptr(...)
> > show_trace_log_lvl ...
> > show_trace_log_lvl ...
> >
> > show_trace_log_lvl() then iterates up the stack with its `stack`
> > variable; but because `unwind_get_return_address_ptr(&state)` is below the
> > starting point, the two never compile equal, and so `reliable` is never
> > set to 1.
>
> Thanks for reporting! If I understand correctly, this only happens
> when show_stack() has an 8-byte stack size.

Yes, I think so. (Well, 16 bytes if you count the saved RIP at the top
as part of the frame.)

I just realized that this probably happened to me because I was
compiling with -fno-optimize-sibling-calls (to make stack traces more
readable, because sibling call optimization effectively makes stack
frames randomly disappear, which I find very frustrating).

> > Poking around a bit, I see two issues.
> >
> > The first issue is that __unwind_start() tries to figure out whether
> > `first_frame` is inside the current frame before even having looked up
> > the ORC entry that determines where the current frame ends.
> > That can't work and results in being off-by-one-frame in some cases no
> > matter how we twist the comparison between `state->sp` and `first_frame`.
>
> > The second issue is that show_trace_log_lvl() asks __unwind_start() to
> > stop when it finds the frame containing `stack`, but then tries
> > comparing `unwind_get_return_address_ptr(&state)` (which has to be below
> > `stack`, since it is part of the lower frame) with `stack`.
> > That can't work if __unwind_start() is working properly - we'll have to
> > unwind up another frame.
> >
> > This patch is an attempt to fix that, but I guess there might still be
> > issues with it in the interaction with show_regs_if_on_stack() in
> > show_trace_log_lvl(), or something like that?
> >
> > Another option might be to rework even more how ORC stack walking works,
> > and always compute the location of the next frame in __unwind_start()
> > and unwind_next(), such that it becomes possible to query for the top
> > of the current frame?
> >
> > Or a completely different approach, do more special-casing of different
> > unwinding scenarios in __unwind_start(), such that unwinding a remote
> > task doesn't go through the skip-ahead loop, and unwinding the current
> > task from a starting point is always guaranteed to skip the given frame
> > and stop at the following one? Or something along those lines?
> >
> > That would also make it more obviously correct what happens if a
> > function specifies its own frame as the starting point wrt to changes to
> > that frame's contents before the call to unwind_next()... now that I'm
> > typing this out, I think that might be the best option?
>
> If I understand correctly, this last proposal is what the current
> __unwind_start() code already attempts to do (but obviously fails in the
> above off-by-one case). It tries to start at the first frame it finds
> *beyond* the given 'first_frame' pointer, rather than the frame
> including it. That makes the logic simpler, since you don't have to
> find the size of the frame.

Ahh, okay. I missed that those were the intended semantics...

> So I think this bug could be fixed by reverting commit f1d9a2abff66
> ("x86/unwind/orc: Don't skip the first frame for inactive tasks").
>
> Can you confirm?

Yes.

When compiling with "gcc (Debian 11.2.0-13) 11.2.0" and
"-fno-optimize-sibling-calls", "echo l > /proc/sysrq-trigger" prints
an all-guesses trace for the current CPU:

[ 99.465299][ T598] sysrq: Show backtrace of all active CPUs
[ 99.466130][ T598] NMI backtrace for cpu 0
[ 99.466533][ T598] CPU: 0 PID: 598 Comm: bash Not tainted
5.17.0-rc1-00082-g81c3649a14a2-dirty #944
[ 99.467602][ T598] Hardware name: QEMU Standard PC (i440FX + PIIX,
1996), BIOS 1.15.0-1 04/01/2014
[ 99.468511][ T598] Call Trace:
[ 99.468927][ T598] <TASK>
[ 99.469268][ T598] ? dump_stack_lvl+0x45/0x59
[ 99.469807][ T598] ? dump_stack+0xc/0xd
[ 99.470284][ T598] ? nmi_cpu_backtrace.cold+0xa4/0xa9
[ 99.470760][ T598] ? lapic_can_unplug_cpu+0x80/0x80
[ 99.471330][ T598] ? nmi_trigger_cpumask_backtrace+0x10e/0x140
[ 99.471889][ T598] ? arch_trigger_cpumask_backtrace+0x15/0x20
[ 99.472427][ T598] ? sysrq_handle_showallcpus+0x13/0x20
[ 99.472970][ T598] ? __handle_sysrq.cold+0x11c/0x37b
[ 99.473450][ T598] ? write_sysrq_trigger+0x3f/0x50
[ 99.473963][ T598] ? proc_reg_write+0x1b3/0x270
[ 99.474408][ T598] ? vfs_write+0x1c7/0x920
[ 99.474920][ T598] ? ksys_write+0xf9/0x1d0
[ 99.475507][ T598] ? __ia32_sys_read+0xb0/0xb0
[ 99.476467][ T598] ? lock_is_held_type+0xd7/0x130
[ 99.477071][ T598] ? syscall_enter_from_user_mode+0x1d/0x50
[ 99.477708][ T598] ? __x64_sys_write+0x6e/0xb0
[ 99.478216][ T598] ? syscall_enter_from_user_mode+0x1d/0x50
[ 99.479137][ T598] ? do_syscall_64+0x43/0x90
[ 99.479576][ T598] ? entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 99.480172][ T598] </TASK>
[ 99.480649][ T598] Sending NMI from CPU 0 to CPUs 1-3:

With f1d9a2abff66 reverted:

[ 92.114861][ T598] sysrq: Show backtrace of all active CPUs
[ 92.115989][ T598] NMI backtrace for cpu 2
[ 92.116448][ T598] CPU: 2 PID: 598 Comm: bash Not tainted
5.17.0-rc1-00083-gc0df0cbee2c5-dirty #945
[ 92.117493][ T598] Hardware name: QEMU Standard PC (i440FX + PIIX,
1996), BIOS 1.15.0-1 04/01/2014
[ 92.118290][ T598] Call Trace:
[ 92.118598][ T598] <TASK>
[ 92.118860][ T598] dump_stack_lvl+0x45/0x59
[ 92.119363][ T598] dump_stack+0xc/0xd
[ 92.119727][ T598] nmi_cpu_backtrace.cold+0xa4/0xa9
[ 92.120183][ T598] ? lapic_can_unplug_cpu+0x80/0x80
[ 92.120655][ T598] nmi_trigger_cpumask_backtrace+0x10e/0x140
[ 92.121328][ T598] arch_trigger_cpumask_backtrace+0x15/0x20
[ 92.121856][ T598] sysrq_handle_showallcpus+0x13/0x20
[ 92.122467][ T598] __handle_sysrq.cold+0x11c/0x37b
[ 92.122924][ T598] write_sysrq_trigger+0x3f/0x50
[ 92.123357][ T598] proc_reg_write+0x1b3/0x270
[ 92.123773][ T598] vfs_write+0x1c7/0x920
[ 92.124151][ T598] ksys_write+0xf9/0x1d0
[ 92.124523][ T598] ? __ia32_sys_read+0xb0/0xb0
[ 92.124998][ T598] ? lock_is_held_type+0xd7/0x130
[ 92.125512][ T598] ? syscall_enter_from_user_mode+0x1d/0x50
[ 92.127120][ T598] __x64_sys_write+0x6e/0xb0
[ 92.127537][ T598] ? syscall_enter_from_user_mode+0x1d/0x50
[ 92.128051][ T598] do_syscall_64+0x43/0x90
[ 92.128444][ T598] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 92.129046][ T598] RIP: 0033:0x7d8d29a7b504
[ 92.129442][ T598] Code: 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb
b3 0f 1f 80 00 00 00 00 48 8d 05 f9 61 0d 00 8b 00 85 c0 75 13 b8 01
00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 41 54 49 89 d4 55
48 89 f5 53
[ 92.131109][ T598] RSP: 002b:00007ffd74467c58 EFLAGS: 00000246
ORIG_RAX: 0000000000000001
[ 92.131913][ T598] RAX: ffffffffffffffda RBX: 0000000000000002
RCX: 00007d8d29a7b504
[ 92.132682][ T598] RDX: 0000000000000002 RSI: 00007d8d2a6dcfb0
RDI: 0000000000000001
[ 92.133460][ T598] RBP: 00007d8d2a6dcfb0 R08: 000000000000000a
R09: 00007d8d29b4cca0
[ 92.134219][ T598] R10: 000000000000000a R11: 0000000000000246
R12: 00007d8d29b4d760
[ 92.134904][ T598] R13: 0000000000000002 R14: 00007d8d29b48760
R15: 0000000000000002
[ 92.135591][ T598] </TASK>
[ 92.135961][ T598] Sending NMI from CPU 2 to CPUs 0-1,3: