2018-11-06 07:20:22

by Zhaoyang Huang

[permalink] [raw]
Subject: [PATCH] arch/arm64 : fix error in dump_backtrace

From: Zhaoyang Huang <[email protected]>

In some cases, the instruction of "bl foo1" will be the last one of the
foo2[1], which will cause the lr be the first instruction of the adjacent
foo3[2]. Hence, the backtrace will show the weird result as bellow[3].
The patch will fix it by miner 4 of the lr when dump_backtrace

[1]
0xffffff80081e6b04 <handle_mm_fault+3996>: adrp x0, 0xffffff8008ca8000
0xffffff80081e6b08 <handle_mm_fault+4000>: add x0, x0, #0x5a8
0xffffff80081e6b0c <handle_mm_fault+4004>: bl 0xffffff80081b0ca0 <panic>
0xffffff80081e6b10 <access_remote_vm>: stp x29, x30, [sp,#-64]!
0xffffff80081e6b14 <access_remote_vm+4>: mov x29, sp

[2]
crash_arm64> rd ffffffc02eec3bd0 2
ffffffc02eec3bd0: ffffffc02eec3cb0 ffffff80081e6b10

[3]
wrong:
[<ffffff80081b0d90>] panic+0xf0/0x24c
[<ffffff80081e6b10>] access_remote_vm+0x0/0x5c
[<ffffff800809d7b0>] do_page_fault+0x290/0x3b8
[<ffffff8008081570>] do_mem_abort+0x64/0xdc

correct:
[ffffffc02eec3bd0] panic at ffffff80081b0da4
[ffffffc02eec3cb0] handle_mm_fault at ffffff80081e6b0c
[ffffffc02eec3d80] do_page_fault at ffffff800809d7ac
[ffffffc02eec3df0] do_mem_abort at ffffff800808156c

Signed-off-by: Zhaoyang Huang <[email protected]>
---
arch/arm64/kernel/traps.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
index d399d45..7a097cc 100644
--- a/arch/arm64/kernel/traps.c
+++ b/arch/arm64/kernel/traps.c
@@ -113,7 +113,7 @@ void dump_backtrace(struct pt_regs *regs, struct task_struct *tsk)

if (tsk == current) {
frame.fp = (unsigned long)__builtin_frame_address(0);
- frame.pc = (unsigned long)dump_backtrace;
+ frame.pc = (unsigned long)dump_backtrace + 4;
} else {
/*
* task blocked in __switch_to
@@ -130,7 +130,7 @@ void dump_backtrace(struct pt_regs *regs, struct task_struct *tsk)
do {
/* skip until specified stack frame */
if (!skip) {
- dump_backtrace_entry(frame.pc);
+ dump_backtrace_entry(frame.pc - 4);
} else if (frame.fp == regs->regs[29]) {
skip = 0;
/*
--
1.9.1



2018-11-06 08:40:14

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH] arch/arm64 : fix error in dump_backtrace

On Tue, Nov 06, 2018 at 03:19:35PM +0800, Zhaoyang Huang wrote:
> From: Zhaoyang Huang <[email protected]>
>
> In some cases, the instruction of "bl foo1" will be the last one of the
> foo2[1], which will cause the lr be the first instruction of the adjacent
> foo3[2]. Hence, the backtrace will show the weird result as bellow[3].
> The patch will fix it by miner 4 of the lr when dump_backtrace

This has come up in the past (and a similar patch has been applied, then
reverted).

In general, we don't know that a function call was made via BL, and therefore
cannot know that LR - 4 is the address of the caller. The caller could set up
the LR as it likes, then B or BR to the callee, and depending on how the basic
blocks get laid out in memory, LR - 4 might point at something completely
different.

More ideally, the compiler wouldn't end a function with a BL. When does that
happen, and is there some way we could arrange for that to not happen? e.g.
somehow pad a NOP after the BL.

Thanks,
Mark.

>
> [1]
> 0xffffff80081e6b04 <handle_mm_fault+3996>: adrp x0, 0xffffff8008ca8000
> 0xffffff80081e6b08 <handle_mm_fault+4000>: add x0, x0, #0x5a8
> 0xffffff80081e6b0c <handle_mm_fault+4004>: bl 0xffffff80081b0ca0 <panic>
> 0xffffff80081e6b10 <access_remote_vm>: stp x29, x30, [sp,#-64]!
> 0xffffff80081e6b14 <access_remote_vm+4>: mov x29, sp
>
> [2]
> crash_arm64> rd ffffffc02eec3bd0 2
> ffffffc02eec3bd0: ffffffc02eec3cb0 ffffff80081e6b10
>
> [3]
> wrong:
> [<ffffff80081b0d90>] panic+0xf0/0x24c
> [<ffffff80081e6b10>] access_remote_vm+0x0/0x5c
> [<ffffff800809d7b0>] do_page_fault+0x290/0x3b8
> [<ffffff8008081570>] do_mem_abort+0x64/0xdc
>
> correct:
> [ffffffc02eec3bd0] panic at ffffff80081b0da4
> [ffffffc02eec3cb0] handle_mm_fault at ffffff80081e6b0c
> [ffffffc02eec3d80] do_page_fault at ffffff800809d7ac
> [ffffffc02eec3df0] do_mem_abort at ffffff800808156c
>
> Signed-off-by: Zhaoyang Huang <[email protected]>
> ---
> arch/arm64/kernel/traps.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
> index d399d45..7a097cc 100644
> --- a/arch/arm64/kernel/traps.c
> +++ b/arch/arm64/kernel/traps.c
> @@ -113,7 +113,7 @@ void dump_backtrace(struct pt_regs *regs, struct task_struct *tsk)
>
> if (tsk == current) {
> frame.fp = (unsigned long)__builtin_frame_address(0);
> - frame.pc = (unsigned long)dump_backtrace;
> + frame.pc = (unsigned long)dump_backtrace + 4;
> } else {
> /*
> * task blocked in __switch_to
> @@ -130,7 +130,7 @@ void dump_backtrace(struct pt_regs *regs, struct task_struct *tsk)
> do {
> /* skip until specified stack frame */
> if (!skip) {
> - dump_backtrace_entry(frame.pc);
> + dump_backtrace_entry(frame.pc - 4);
> } else if (frame.fp == regs->regs[29]) {
> skip = 0;
> /*
> --
> 1.9.1
>

2018-11-06 08:58:34

by Daniel Thompson

[permalink] [raw]
Subject: Re: [PATCH] arch/arm64 : fix error in dump_backtrace

On Tue, Nov 06, 2018 at 08:39:01AM +0000, Mark Rutland wrote:
> On Tue, Nov 06, 2018 at 03:19:35PM +0800, Zhaoyang Huang wrote:
> > From: Zhaoyang Huang <[email protected]>
> >
> > In some cases, the instruction of "bl foo1" will be the last one of the
> > foo2[1], which will cause the lr be the first instruction of the adjacent
> > foo3[2]. Hence, the backtrace will show the weird result as bellow[3].
> > The patch will fix it by miner 4 of the lr when dump_backtrace
>
> This has come up in the past (and a similar patch has been applied, then
> reverted).
>
> In general, we don't know that a function call was made via BL, and therefore
> cannot know that LR - 4 is the address of the caller. The caller could set up
> the LR as it likes, then B or BR to the callee, and depending on how the basic
> blocks get laid out in memory, LR - 4 might point at something completely
> different.
>
> More ideally, the compiler wouldn't end a function with a BL. When does that
> happen, and is there some way we could arrange for that to not happen? e.g.
> somehow pad a NOP after the BL.

It's a consequence of having __noreturn isn't it? __noreturn frees the
compiler from the burden of having to produce a valid return stack... so
it doesn't and unwinding becomes hard.


Daniel.


> > [1]
> > 0xffffff80081e6b04 <handle_mm_fault+3996>: adrp x0, 0xffffff8008ca8000
> > 0xffffff80081e6b08 <handle_mm_fault+4000>: add x0, x0, #0x5a8
> > 0xffffff80081e6b0c <handle_mm_fault+4004>: bl 0xffffff80081b0ca0 <panic>
> > 0xffffff80081e6b10 <access_remote_vm>: stp x29, x30, [sp,#-64]!
> > 0xffffff80081e6b14 <access_remote_vm+4>: mov x29, sp
> >
> > [2]
> > crash_arm64> rd ffffffc02eec3bd0 2
> > ffffffc02eec3bd0: ffffffc02eec3cb0 ffffff80081e6b10
> >
> > [3]
> > wrong:
> > [<ffffff80081b0d90>] panic+0xf0/0x24c
> > [<ffffff80081e6b10>] access_remote_vm+0x0/0x5c
> > [<ffffff800809d7b0>] do_page_fault+0x290/0x3b8
> > [<ffffff8008081570>] do_mem_abort+0x64/0xdc
> >
> > correct:
> > [ffffffc02eec3bd0] panic at ffffff80081b0da4
> > [ffffffc02eec3cb0] handle_mm_fault at ffffff80081e6b0c
> > [ffffffc02eec3d80] do_page_fault at ffffff800809d7ac
> > [ffffffc02eec3df0] do_mem_abort at ffffff800808156c
> >
> > Signed-off-by: Zhaoyang Huang <[email protected]>
> > ---
> > arch/arm64/kernel/traps.c | 4 ++--
> > 1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
> > index d399d45..7a097cc 100644
> > --- a/arch/arm64/kernel/traps.c
> > +++ b/arch/arm64/kernel/traps.c
> > @@ -113,7 +113,7 @@ void dump_backtrace(struct pt_regs *regs, struct task_struct *tsk)
> >
> > if (tsk == current) {
> > frame.fp = (unsigned long)__builtin_frame_address(0);
> > - frame.pc = (unsigned long)dump_backtrace;
> > + frame.pc = (unsigned long)dump_backtrace + 4;
> > } else {
> > /*
> > * task blocked in __switch_to
> > @@ -130,7 +130,7 @@ void dump_backtrace(struct pt_regs *regs, struct task_struct *tsk)
> > do {
> > /* skip until specified stack frame */
> > if (!skip) {
> > - dump_backtrace_entry(frame.pc);
> > + dump_backtrace_entry(frame.pc - 4);
> > } else if (frame.fp == regs->regs[29]) {
> > skip = 0;
> > /*
> > --
> > 1.9.1
> >

2018-11-06 11:02:29

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH] arch/arm64 : fix error in dump_backtrace

On Tue, Nov 06, 2018 at 08:57:51AM +0000, Daniel Thompson wrote:
> On Tue, Nov 06, 2018 at 08:39:01AM +0000, Mark Rutland wrote:
> > On Tue, Nov 06, 2018 at 03:19:35PM +0800, Zhaoyang Huang wrote:
> > > From: Zhaoyang Huang <[email protected]>
> > >
> > > In some cases, the instruction of "bl foo1" will be the last one of the
> > > foo2[1], which will cause the lr be the first instruction of the adjacent
> > > foo3[2]. Hence, the backtrace will show the weird result as bellow[3].
> > > The patch will fix it by miner 4 of the lr when dump_backtrace
> >
> > This has come up in the past (and a similar patch has been applied, then
> > reverted).
> >
> > In general, we don't know that a function call was made via BL, and therefore
> > cannot know that LR - 4 is the address of the caller. The caller could set up
> > the LR as it likes, then B or BR to the callee, and depending on how the basic
> > blocks get laid out in memory, LR - 4 might point at something completely
> > different.
> >
> > More ideally, the compiler wouldn't end a function with a BL. When does that
> > happen, and is there some way we could arrange for that to not happen? e.g.
> > somehow pad a NOP after the BL.
>
> It's a consequence of having __noreturn isn't it? __noreturn frees the
> compiler from the burden of having to produce a valid return stack... so
> it doesn't and unwinding becomes hard.

In that case, the compiler could equally just use B rather than BL,
which this patch doesn't solve.

The documentation for the GCC noreturn attribute [1] says:

| In order to preserve backtraces, GCC will never turn calls to noreturn
| functions into tail calls.

... so clearly it's not intended to mess up backtracing.

IIUC we mostly use noreturn to prevent warings about uninitialised
variables and such after a call to a noreturn function. I think
optimization is a secondary concern.

We could ask the GCC folk if they can ensure that a noreturn function
call leave thes LR pointing into the caller, e.g. by padding with a NOP:

BL <noreturn function>
NOP

That seems cheap enough, and would keep backtraces reliable.

Thanks,
Mark.

[1] https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-noreturn-function-attribute

2018-11-06 11:35:06

by Dave Martin

[permalink] [raw]
Subject: Re: [PATCH] arch/arm64 : fix error in dump_backtrace

On Tue, Nov 06, 2018 at 11:00:19AM +0000, Mark Rutland wrote:
> On Tue, Nov 06, 2018 at 08:57:51AM +0000, Daniel Thompson wrote:
> > On Tue, Nov 06, 2018 at 08:39:01AM +0000, Mark Rutland wrote:
> > > On Tue, Nov 06, 2018 at 03:19:35PM +0800, Zhaoyang Huang wrote:
> > > > From: Zhaoyang Huang <[email protected]>
> > > >
> > > > In some cases, the instruction of "bl foo1" will be the last one of the
> > > > foo2[1], which will cause the lr be the first instruction of the adjacent
> > > > foo3[2]. Hence, the backtrace will show the weird result as bellow[3].
> > > > The patch will fix it by miner 4 of the lr when dump_backtrace
> > >
> > > This has come up in the past (and a similar patch has been applied, then
> > > reverted).
> > >
> > > In general, we don't know that a function call was made via BL, and therefore
> > > cannot know that LR - 4 is the address of the caller. The caller could set up
> > > the LR as it likes, then B or BR to the callee, and depending on how the basic
> > > blocks get laid out in memory, LR - 4 might point at something completely
> > > different.
> > >
> > > More ideally, the compiler wouldn't end a function with a BL. When does that
> > > happen, and is there some way we could arrange for that to not happen? e.g.
> > > somehow pad a NOP after the BL.
> >
> > It's a consequence of having __noreturn isn't it? __noreturn frees the
> > compiler from the burden of having to produce a valid return stack... so
> > it doesn't and unwinding becomes hard.
>
> In that case, the compiler could equally just use B rather than BL,
> which this patch doesn't solve.
>
> The documentation for the GCC noreturn attribute [1] says:
>
> | In order to preserve backtraces, GCC will never turn calls to noreturn
> | functions into tail calls.
>
> ... so clearly it's not intended to mess up backtracing.

Which is a bit odd, since every call to a noreturn function is a tail-
call by definition, and other tail-calls are typically optimised to a B
(thus interfering with backtracing in all except the noreturn case).

Avoiding this would require a change to the compiler, and since there's
no obvious correct answer anyway, I guess we shouldn't hold our breath.

> IIUC we mostly use noreturn to prevent warings about uninitialised
> variables and such after a call to a noreturn function. I think
> optimization is a secondary concern.

Agreed.

> We could ask the GCC folk if they can ensure that a noreturn function
> call leave thes LR pointing into the caller, e.g. by padding with a NOP:
>
> BL<noreturn function>
> NOP
>
> That seems cheap enough, and would keep backtraces reliable.

-fpatchable-function-entry=1,1 does almost the right thing, by
inserting 1 NOP at the start of each function, and putting the function
entry point after that (1) NOP.

This works on gcc-8.1. I haven't experimented with earlier compilers.

It could be considered an abuse, since we'd not quite be using the
option for its intended purpose. It also causes every function
entry point to be misaligned due to the NOP insertion, which we may or
may not care about.

Cheers
---Dave
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

2018-11-06 12:18:02

by Daniel Thompson

[permalink] [raw]
Subject: Re: [PATCH] arch/arm64 : fix error in dump_backtrace

On Tue, Nov 06, 2018 at 11:00:19AM +0000, Mark Rutland wrote:
> On Tue, Nov 06, 2018 at 08:57:51AM +0000, Daniel Thompson wrote:
> > On Tue, Nov 06, 2018 at 08:39:01AM +0000, Mark Rutland wrote:
> > > On Tue, Nov 06, 2018 at 03:19:35PM +0800, Zhaoyang Huang wrote:
> > > > From: Zhaoyang Huang <[email protected]>
> > > >
> > > > In some cases, the instruction of "bl foo1" will be the last one of the
> > > > foo2[1], which will cause the lr be the first instruction of the adjacent
> > > > foo3[2]. Hence, the backtrace will show the weird result as bellow[3].
> > > > The patch will fix it by miner 4 of the lr when dump_backtrace
> > >
> > > This has come up in the past (and a similar patch has been applied, then
> > > reverted).
> > >
> > > In general, we don't know that a function call was made via BL, and therefore
> > > cannot know that LR - 4 is the address of the caller. The caller could set up
> > > the LR as it likes, then B or BR to the callee, and depending on how the basic
> > > blocks get laid out in memory, LR - 4 might point at something completely
> > > different.
> > >
> > > More ideally, the compiler wouldn't end a function with a BL. When does that
> > > happen, and is there some way we could arrange for that to not happen? e.g.
> > > somehow pad a NOP after the BL.
> >
> > It's a consequence of having __noreturn isn't it? __noreturn frees the
> > compiler from the burden of having to produce a valid return stack... so
> > it doesn't and unwinding becomes hard.
>
> In that case, the compiler could equally just use B rather than BL,
> which this patch doesn't solve.
>
> The documentation for the GCC noreturn attribute [1] says:
>
> | In order to preserve backtraces, GCC will never turn calls to noreturn
> | functions into tail calls.
>
> ... so clearly it's not intended to mess up backtracing.

I guess that explains why the compiler chooses BL over B (since B would
be a tail call).


> IIUC we mostly use noreturn to prevent warings about uninitialised
> variables and such after a call to a noreturn function. I think
> optimization is a secondary concern.
>
> We could ask the GCC folk if they can ensure that a noreturn function
> call leave thes LR pointing into the caller, e.g. by padding with a NOP:
>
> BL <noreturn function>
> NOP
>
> That seems cheap enough, and would keep backtraces reliable.

It might be worth discussing. One related question though... is there
any other case there the symbol name other than __noreturn where the
symbol can change between LR and LR-1?

If not can't we just switch over to %pB which is designed for this case
(and in case you ask: no... I didn't know about %pB until this morning
when I started trying to implement the heuristic it uses by hand and
then discovering it by accident).

Something like:

From 70a2c9ffc962a650ff894d40b775f8310190bf86 Mon Sep 17 00:00:00 2001
From: Daniel Thompson <[email protected]>
Date: Tue, 6 Nov 2018 11:59:56 +0000
Subject: [PATCH] arm64: Use %pB to print backtrace symbols

Currently arm64 uses %pS to print backtrace symbols but %pB (which takes
into consideration the effect of compiler optimisations which may occur
when tail-calls are used and marked with the noreturn GCC attribute)
is a better alternative.

Signed-off-by: Daniel Thompson <[email protected]>
---
arch/arm64/kernel/traps.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
index 5f4d9acb32f5..ea571d3a1373 100644
--- a/arch/arm64/kernel/traps.c
+++ b/arch/arm64/kernel/traps.c
@@ -62,7 +62,7 @@ int show_unhandled_signals = 0;

static void dump_backtrace_entry(unsigned long where)
{
- printk(" %pS\n", (void *)where);
+ printk(" %pB\n", (void *)where);
}

static void __dump_instr(const char *lvl, struct pt_regs *regs)
--
2.19.1


2018-11-06 12:32:03

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH] arch/arm64 : fix error in dump_backtrace

On Tue, Nov 06, 2018 at 11:32:50AM +0000, Dave P Martin wrote:
> On Tue, Nov 06, 2018 at 11:00:19AM +0000, Mark Rutland wrote:
> > On Tue, Nov 06, 2018 at 08:57:51AM +0000, Daniel Thompson wrote:
> > > On Tue, Nov 06, 2018 at 08:39:01AM +0000, Mark Rutland wrote:
> > > > On Tue, Nov 06, 2018 at 03:19:35PM +0800, Zhaoyang Huang wrote:
> > > > > From: Zhaoyang Huang <[email protected]>
> > > > >
> > > > > In some cases, the instruction of "bl foo1" will be the last one of the
> > > > > foo2[1], which will cause the lr be the first instruction of the adjacent
> > > > > foo3[2]. Hence, the backtrace will show the weird result as bellow[3].
> > > > > The patch will fix it by miner 4 of the lr when dump_backtrace
> > > >
> > > > This has come up in the past (and a similar patch has been applied, then
> > > > reverted).
> > > >
> > > > In general, we don't know that a function call was made via BL, and therefore
> > > > cannot know that LR - 4 is the address of the caller. The caller could set up
> > > > the LR as it likes, then B or BR to the callee, and depending on how the basic
> > > > blocks get laid out in memory, LR - 4 might point at something completely
> > > > different.
> > > >
> > > > More ideally, the compiler wouldn't end a function with a BL. When does that
> > > > happen, and is there some way we could arrange for that to not happen? e.g.
> > > > somehow pad a NOP after the BL.
> > >
> > > It's a consequence of having __noreturn isn't it? __noreturn frees the
> > > compiler from the burden of having to produce a valid return stack... so
> > > it doesn't and unwinding becomes hard.
> >
> > In that case, the compiler could equally just use B rather than BL,
> > which this patch doesn't solve.
> >
> > The documentation for the GCC noreturn attribute [1] says:
> >
> > | In order to preserve backtraces, GCC will never turn calls to noreturn
> > | functions into tail calls.
> >
> > ... so clearly it's not intended to mess up backtracing.
>
> Which is a bit odd, since every call to a noreturn function is a tail-
> call by definition, and other tail-calls are typically optimised to a B
> (thus interfering with backtracing in all except the noreturn case).
>
> Avoiding this would require a change to the compiler, and since there's
> no obvious correct answer anyway, I guess we shouldn't hold our breath.
>
> > IIUC we mostly use noreturn to prevent warings about uninitialised
> > variables and such after a call to a noreturn function. I think
> > optimization is a secondary concern.
>
> Agreed.
>
> > We could ask the GCC folk if they can ensure that a noreturn function
> > call leave thes LR pointing into the caller, e.g. by padding with a NOP:
> >
> > BL <noreturn function>
> > NOP
> >
> > That seems cheap enough, and would keep backtraces reliable.
>
> -fpatchable-function-entry=1,1 does almost the right thing, by
> inserting 1 NOP at the start of each function, and putting the function
> entry point after that (1) NOP.

Neat hack, but unfortunately insufficient in general since:

* The next function may be notrace or asm, and hence will not have a
preceding NOP.

* There may not be a next function (e.g. for the final instruction in
.text), and the LR value may point at some data symbol.

* This relies on the preceding NOP being accounted as part of the
previous function, which feels like a bug given we should have the
function size somewhere.

Generally, I think that trying to bodge around the exiting behaviour is
going to cause just as many problems as it solves, and worse, makes it
harder to consistently analyse a backtrace.

IMO, we shouldn't change the kernel here.

Thanks,
Mark.

2018-11-06 14:26:11

by Dave Martin

[permalink] [raw]
Subject: Re: [PATCH] arch/arm64 : fix error in dump_backtrace

On Tue, Nov 06, 2018 at 12:29:33PM +0000, Mark Rutland wrote:
> On Tue, Nov 06, 2018 at 11:32:50AM +0000, Dave P Martin wrote:
> > On Tue, Nov 06, 2018 at 11:00:19AM +0000, Mark Rutland wrote:
> > > On Tue, Nov 06, 2018 at 08:57:51AM +0000, Daniel Thompson wrote:
> > > > On Tue, Nov 06, 2018 at 08:39:01AM +0000, Mark Rutland wrote:
> > > > > On Tue, Nov 06, 2018 at 03:19:35PM +0800, Zhaoyang Huang wrote:
> > > > > > From: Zhaoyang Huang <[email protected]>
> > > > > >
> > > > > > In some cases, the instruction of "bl foo1" will be the last one of the
> > > > > > foo2[1], which will cause the lr be the first instruction of the adjacent
> > > > > > foo3[2]. Hence, the backtrace will show the weird result as bellow[3].
> > > > > > The patch will fix it by miner 4 of the lr when dump_backtrace
> > > > >
> > > > > This has come up in the past (and a similar patch has been applied, then
> > > > > reverted).
> > > > >
> > > > > In general, we don't know that a function call was made via BL, and therefore
> > > > > cannot know that LR - 4 is the address of the caller. The caller could set up
> > > > > the LR as it likes, then B or BR to the callee, and depending on how the basic
> > > > > blocks get laid out in memory, LR - 4 might point at something completely
> > > > > different.
> > > > >
> > > > > More ideally, the compiler wouldn't end a function with a BL. When does that
> > > > > happen, and is there some way we could arrange for that to not happen? e.g.
> > > > > somehow pad a NOP after the BL.
> > > >
> > > > It's a consequence of having __noreturn isn't it? __noreturn frees the
> > > > compiler from the burden of having to produce a valid return stack... so
> > > > it doesn't and unwinding becomes hard.
> > >
> > > In that case, the compiler could equally just use B rather than BL,
> > > which this patch doesn't solve.
> > >
> > > The documentation for the GCC noreturn attribute [1] says:
> > >
> > > | In order to preserve backtraces, GCC will never turn calls to noreturn
> > > | functions into tail calls.
> > >
> > > ... so clearly it's not intended to mess up backtracing.
> >
> > Which is a bit odd, since every call to a noreturn function is a tail-
> > call by definition, and other tail-calls are typically optimised to a B
> > (thus interfering with backtracing in all except the noreturn case).
> >
> > Avoiding this would require a change to the compiler, and since there's
> > no obvious correct answer anyway, I guess we shouldn't hold our breath.
> >
> > > IIUC we mostly use noreturn to prevent warings about uninitialised
> > > variables and such after a call to a noreturn function. I think
> > > optimization is a secondary concern.
> >
> > Agreed.
> >
> > > We could ask the GCC folk if they can ensure that a noreturn function
> > > call leave thes LR pointing into the caller, e.g. by padding with a NOP:
> > >
> > > BL <noreturn function>
> > > NOP
> > >
> > > That seems cheap enough, and would keep backtraces reliable.
> >
> > -fpatchable-function-entry=1,1 does almost the right thing, by
> > inserting 1 NOP at the start of each function, and putting the function
> > entry point after that (1) NOP.
>
> Neat hack, but unfortunately insufficient in general since:
>
> * The next function may be notrace or asm, and hence will not have a
> preceding NOP.

Usually not, but yes, it would be true for some cases.

> * There may not be a next function (e.g. for the final instruction in
> .text), and the LR value may point at some data symbol.

Ditto.

> * This relies on the preceding NOP being accounted as part of the
> previous function, which feels like a bug given we should have the
> function size somewhere.

I assumed that the function sizes (in the ELF symbol senses) are simply
ignored and the kernel presumes that each function ends where the next
function starts.

If so, the NOP would be accounted with the preceding function (if any).
I may be misremembering how the kallsyms determines symbol sizes though.

> Generally, I think that trying to bodge around the exiting behaviour is
> going to cause just as many problems as it solves, and worse, makes it
> harder to consistently analyse a backtrace.
>
> IMO, we shouldn't change the kernel here.

Agreed. -fpatchable-function-entry is an interesting hack here, but
doesn't solve the whole problem and partly works by accident
(unfortunately).

Cheers
---Dave