2019-01-10 23:53:21

by Joe Lawrence

[permalink] [raw]
Subject: ppc64le reliable stack unwinder and scheduled tasks

Hi all,

tl;dr: On ppc64le, what is top-most stack frame for scheduled tasks
about?

I am looking at a bug in which ~10% of livepatch tests on RHEL-7 and
RHEL-8 distro kernels, the ppc64le reliable stack unwinder consistently
reports an unreliable stack for a given task. This condition can
eventually resolve itself, but sometimes this state remains wedged for
hours and I can live-debug the system with crash-utility.

I have tried reproducing with upstream 4.20 kernel, but haven't seen
this exact condition yet. More on upstream in a bit.

That said, I have a question about the ppc64 stack frame layout with
regard to scheduled tasks. In each sticky "unreliable" stack trace
instance that I've looked at, the task's stack has an interesting
frame at the top:


Example 1 (RHEL-7)
==================

crash> struct task_struct.thread c00000022fd015c0 | grep ksp
ksp = 0xc0000000288af9c0

crash> rd 0xc0000000288af9c0 -e 0xc0000000288b0000

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

sp[0]:

c0000000288af9c0: c0000000288afb90 0000000000dd0000 ...(............
c0000000288af9d0: c000000000002a94 c000000001c60a00 .*..............

crash> sym c000000000002a94
c000000000002a94 (T) hardware_interrupt_common+0x114

c0000000288af9e0: c000000001c60a80 0000000000000000 ................
c0000000288af9f0: c0000000288afbc0 0000000000dd0000 ...(............
c0000000288afa00: c0000000014322e0 c000000001c60a00 ."C.............
c0000000288afa10: c0000002303ae380 c0000002303ae380 ..:0......:0....
c0000000288afa20: 7265677368657265 0000000000002200 erehsger."......

Uh-oh...

/* Mark stacktraces with exception frames as unreliable. */
stack[STACK_FRAME_MARKER] == STACK_FRAME_REGS_MARKER

c0000000288afa30: c0000000001240ac c0000000288afcb0 .@.........(....
c0000000288afa40: c0000000013e4d00 0000000000000000 .M>.............
c0000000288afa50: 0000000000000001 0000000000000001 ................
c0000000288afa60: c0000000014322e0 0000000000000804 ."C.............
c0000000288afa70: c0000000288ac080 c00000022fd015c0 ...(......./....
c0000000288afa80: c0000000288afae0 00000000ffffffff ...(............
c0000000288afa90: c0000000288afae0 c000000007b80900 ...(............
c0000000288afaa0: c000000000e90a00 c000000000e90a00 ................
c0000000288afab0: c0000000001240ac c000000000e90a00 .@..............
c0000000288afac0: c000000000e90a00 c000000004790580 ..........y.....
c0000000288afad0: 0000000000000001 c000000000e90a00 ................
c0000000288afae0: 0000000000000000 0000000000000004 ................
c0000000288afaf0: c00000022fd01ad0 c000000000e73be0 .../.....;......
c0000000288afb00: c00000023900f450 c000000001488190 P..9......H.....
c0000000288afb10: 0000000000ad0000 c00000023900ef40 [email protected]....
c0000000288afb20: c0000000288ac000 c000000000e73be0 ...(.....;......
c0000000288afb30: c00000000001b514 c00000022fd01628 ........(../....
c0000000288afb40: c0000000288afbb0 c000000000008800 ...(............
c0000000288afb50: c000000000162880 0000000000000000 .(..............
c0000000288afb60: 00000000240f3022 0000000000000004 "0.$............
c0000000288afb70: c000000000e90a00 c00000022fd01a20 ........ ../....
c0000000288afb80: c0000000288afbf0 c0000000014322e0 ...(....."C.....

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

sp[1]:

c0000000288afb90: c0000000288afbf0 c000000001960a00 ...(............
c0000000288afba0: c00000000001b514 0000000000000004 ................

crash> sym c00000000001b514
c00000000001b514 (T) __switch_to+0x264

c0000000288afbb0: c000000000e90a00 0000000000000000 ................
c0000000288afbc0: c0000000288ac000 c0000000014322e0 ...(....."C.....
c0000000288afbd0: c000000000e90a00 c000000001960a00 ................
c0000000288afbe0: c00000022fd015c0 c00000023900ef40 .../[email protected]....

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

sp[2]:

c0000000288afbf0: c0000000288afcd0 c000000000003300 ...(.....3......
c0000000288afc00: c000000000a83918 c0000000013e4d00 .9.......M>.....

crash> sym c000000000a83918
c000000000a83918 (t) __schedule+0x428

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

sp[3]:

c0000000288afcd0: c0000000288afd80 0000000000003300 ...(.....3......
c0000000288afce0: c0000000001240ac 0000000000000000 .@..............

crash> sym c0000000001240ac
c0000000001240ac (t) rescuer_thread+0x3ac

[ ... and so on as we would normally expect ... ]


Example 2 - (RHEL-7)
====================

crash> struct task_struct.thread c00000004a660a00 | grep ksp
ksp = 0xc0000005e85439c0,

crash> rd 0xc0000005e8543b90 -e 0xc0000005e8544000

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

sp[0]:
c0000005e85439c0: c0000005e8543b90 0000000000000000 .;T.............
c0000005e85439d0: c000000000002a94 0000000000000000 .*..............

crash> sym c000000000002a94
c000000000002a94 (T) hardware_interrupt_common+0x114

c0000005e8543a20: 7265677368657265 c0000000013e4d00 erehsger.M>.....

/* Mark stacktraces with exception frames as unreliable. */
stack[STACK_FRAME_MARKER] == STACK_FRAME_REGS_MARKER

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

sp[1]:
c0000005e8543b90: c0000005e8543bf0 c000000001960a00 .;T.............
c0000005e8543ba0: c00000000001b514 0000000000000004 ................

crash> sym c00000000001b514
c00000000001b514 (T) __switch_to+0x264

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

sp[2]:
c0000005e8543bf0: c0000005e8543cd0 c000000000003300 .<T......3......
c0000005e8543c00: c000000000a83918 c0000000013e4d00 .9.......M>.....

crash> sym c000000000a83918
c000000000a83918 (t) __schedule+0x428

[ ... and so on as we would normally expect ... ]


Example 3 (upstream 4.20)
=========================

This is not a repro of the sticky "unreliable" stack trace, but just the
stack of a workqueue that fired and called msleep.


crash> struct task_struct.thread c0000002257f0e00 | grep ksp
ksp = 0xc00000022715f9b0,

crash> rd 0xc00000022715f9b0 100

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

sp[0]:

c00000022715f9b0: c00000022715fb90 c00000023f4e8280 ...'......N?....
c00000022715f9c0: c00000022715f9e0 0000000000000003 ...'............

crash> sym c00000022715f9e0
c00000022715f9e0 (B) _end+0x22594f9e0

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

sp[1]:

c00000022715fb90: c00000022715fbf0 c00000000158634c ...'....LcX.....
c00000022715fba0: c00000000001ebfc c000000001318a60 ........`.1.....

crash> sym c00000000001ebfc
c00000000001ebfc (T) __switch_to+0x2dc

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

sp[2]:

c00000022715fbf0: c00000022715fcd0 c000000000008800 ...'............
c00000022715fc00: c000000000abcf40 c0000000013abf00 @.........:.....

crash> sym c000000000abcf40
c000000000abcf40 (t) __schedule+0x3b0

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

sp[3]:

c00000022715fcd0: c00000022715fd00 2c50fff000000000 ...'..........P,
c00000022715fce0: c000000000abd628 c00000023bff57e8 (........W.;....

crash> sym c000000000abd628
c000000000abd628 (T) schedule+0x48

[ ... etc ... ]


save_stack_trace_tsk_reliable
=============================

arch/powerpc/kernel/stacktrace.c :: save_stack_trace_tsk_reliable() does
take into account the first stackframe, but only to verify that the link
register is indeed pointing at kernel code address.

Can someone explain what __switch_to is doing with the stack and whether
in such circumstances, the reliable stack unwinder should be skipping
the first frame when verifying the frame contents like
STACK_FRAME_MARKER, etc.

I may be on the wrong path in debugging this, but figuring out this
sp[0] frame state would be helpful.

Thanks,

-- Joe


2019-01-11 00:42:32

by Nicolai Stange

[permalink] [raw]
Subject: Re: ppc64le reliable stack unwinder and scheduled tasks

Hi Joe,

Joe Lawrence <[email protected]> writes:

> tl;dr: On ppc64le, what is top-most stack frame for scheduled tasks
> about?

If I'm reading the code in _switch() correctly, the first frame is
completely uninitialized except for the pointer back to the caller's
stack frame.

For completeness: _switch() saves the return address, i.e. the link
register into its parent's stack frame, as is mandated by the ABI and
consistent with your findings below: it's always the second stack frame
where the return address into __switch_to() is kept.

<snip>

>
>
> Example 1 (RHEL-7)
> ==================
>
> crash> struct task_struct.thread c00000022fd015c0 | grep ksp
> ksp = 0xc0000000288af9c0
>
> crash> rd 0xc0000000288af9c0 -e 0xc0000000288b0000
>
> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
>
> sp[0]:
>
> c0000000288af9c0: c0000000288afb90 0000000000dd0000 ...(............
> c0000000288af9d0: c000000000002a94 c000000001c60a00 .*..............
>
> crash> sym c000000000002a94
> c000000000002a94 (T) hardware_interrupt_common+0x114

So that c000000000002a94 certainly wasn't stored by _switch(). I think
what might have happened is that the switching frame aliased with some
prior interrupt frame as setup by hardware_interrupt_common().

The interrupt and switching frames seem to share a common layout as far
as the lower STACK_FRAME_OVERHEAD + sizeof(struct pt_regs) bytes are
concerned.

That address into hardware_interrupt_common() could have been written by
the do_IRQ() called from there.


> c0000000288af9e0: c000000001c60a80 0000000000000000 ................
> c0000000288af9f0: c0000000288afbc0 0000000000dd0000 ...(............
> c0000000288afa00: c0000000014322e0 c000000001c60a00 ."C.............
> c0000000288afa10: c0000002303ae380 c0000002303ae380 ..:0......:0....
> c0000000288afa20: 7265677368657265 0000000000002200 erehsger."......
>
> Uh-oh...
>
> /* Mark stacktraces with exception frames as unreliable. */
> stack[STACK_FRAME_MARKER] == STACK_FRAME_REGS_MARKER


Aliasing of the switching stack frame with some prior interrupt stack
frame would explain why that STACK_FRAME_REGS_MARKER is still found on
the stack, i.e. it's a leftover.

For testing, could you try whether clearing the word at STACK_FRAME_MARKER
from _switch() helps?

I.e. something like (completely untested):

diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index 435927f549c4..b747d0647ec4 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -596,6 +596,10 @@ _GLOBAL(_switch)
SAVE_8GPRS(14, r1)
SAVE_10GPRS(22, r1)
std r0,_NIP(r1) /* Return to switch caller */
+
+ li r23,0
+ std r23,96(r1) /* 96 == STACK_FRAME_MARKER * sizeof(long) */
+
mfcr r23
std r23,_CCR(r1)
std r1,KSP(r3) /* Set old stack pointer */


<snap>

>
> save_stack_trace_tsk_reliable
> =============================
>
> arch/powerpc/kernel/stacktrace.c :: save_stack_trace_tsk_reliable() does
> take into account the first stackframe, but only to verify that the link
> register is indeed pointing at kernel code address.

It's actually the other way around:

if (!firstframe && !__kernel_text_address(ip))
return 1;


So the address gets sanitized only if it's _not_ coming from the first
frame.


Thanks,

Nicolai

--
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton,
HRB 21284 (AG Nürnberg)

2019-01-11 03:43:56

by Joe Lawrence

[permalink] [raw]
Subject: Re: ppc64le reliable stack unwinder and scheduled tasks

On Fri, Jan 11, 2019 at 01:00:38AM +0100, Nicolai Stange wrote:
> Hi Joe,
>
> Joe Lawrence <[email protected]> writes:
>
> > tl;dr: On ppc64le, what is top-most stack frame for scheduled tasks
> > about?
>
> If I'm reading the code in _switch() correctly, the first frame is
> completely uninitialized except for the pointer back to the caller's
> stack frame.
>
> For completeness: _switch() saves the return address, i.e. the link
> register into its parent's stack frame, as is mandated by the ABI and
> consistent with your findings below: it's always the second stack frame
> where the return address into __switch_to() is kept.
>

Hi Nicolai,

Good, that makes a lot of sense. I couldn't find any reference
explaining the contents of frame 0, only unwinding code here and there
(as in crash-utility) that stepped over it.

> <snip>
>
> >
> >
> > Example 1 (RHEL-7)
> > ==================
> >
> > crash> struct task_struct.thread c00000022fd015c0 | grep ksp
> > ksp = 0xc0000000288af9c0
> >
> > crash> rd 0xc0000000288af9c0 -e 0xc0000000288b0000
> >
> > - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> >
> > sp[0]:
> >
> > c0000000288af9c0: c0000000288afb90 0000000000dd0000 ...(............
> > c0000000288af9d0: c000000000002a94 c000000001c60a00 .*..............
> >
> > crash> sym c000000000002a94
> > c000000000002a94 (T) hardware_interrupt_common+0x114
>
> So that c000000000002a94 certainly wasn't stored by _switch(). I think
> what might have happened is that the switching frame aliased with some
> prior interrupt frame as setup by hardware_interrupt_common().
>
> The interrupt and switching frames seem to share a common layout as far
> as the lower STACK_FRAME_OVERHEAD + sizeof(struct pt_regs) bytes are
> concerned.
>
> That address into hardware_interrupt_common() could have been written by
> the do_IRQ() called from there.
>

That was my initial theory, but then when I saw an ordinary scheduled
task with a similarly strange frame 0, I thought that _switch() might
have been doing something clever (or not). But according your earlier
explanation, it would line up that these values may be inherited from
do_IRQ() or the like.

>
> > c0000000288af9e0: c000000001c60a80 0000000000000000 ................
> > c0000000288af9f0: c0000000288afbc0 0000000000dd0000 ...(............
> > c0000000288afa00: c0000000014322e0 c000000001c60a00 ."C.............
> > c0000000288afa10: c0000002303ae380 c0000002303ae380 ..:0......:0....
> > c0000000288afa20: 7265677368657265 0000000000002200 erehsger."......
> >
> > Uh-oh...
> >
> > /* Mark stacktraces with exception frames as unreliable. */
> > stack[STACK_FRAME_MARKER] == STACK_FRAME_REGS_MARKER
>
>
> Aliasing of the switching stack frame with some prior interrupt stack
> frame would explain why that STACK_FRAME_REGS_MARKER is still found on
> the stack, i.e. it's a leftover.
>
> For testing, could you try whether clearing the word at STACK_FRAME_MARKER
> from _switch() helps?
>
> I.e. something like (completely untested):

I'll kick off some builds tonight and try to get tests lined up
tomorrow. Unfortunately these take a bit of time to run setup, schedule
and complete, so perhaps by next week.

>
> diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
> index 435927f549c4..b747d0647ec4 100644
> --- a/arch/powerpc/kernel/entry_64.S
> +++ b/arch/powerpc/kernel/entry_64.S
> @@ -596,6 +596,10 @@ _GLOBAL(_switch)
> SAVE_8GPRS(14, r1)
> SAVE_10GPRS(22, r1)
> std r0,_NIP(r1) /* Return to switch caller */
> +
> + li r23,0
> + std r23,96(r1) /* 96 == STACK_FRAME_MARKER * sizeof(long) */
> +
> mfcr r23
> std r23,_CCR(r1)
> std r1,KSP(r3) /* Set old stack pointer */
>
>

This may be sufficient to avoid the condition, but if the contents of
frame 0 are truely uninitialized (not to be trusted), should the
unwinder be even looking at that frame (for STACK_FRAMES_REGS_MARKER),
aside from the LR and other stack size geometry sanity checks?

> <snap>
>
> >
> > save_stack_trace_tsk_reliable
> > =============================
> >
> > arch/powerpc/kernel/stacktrace.c :: save_stack_trace_tsk_reliable() does
> > take into account the first stackframe, but only to verify that the link
> > register is indeed pointing at kernel code address.
>
> It's actually the other way around:
>
> if (!firstframe && !__kernel_text_address(ip))
> return 1;
>
>
> So the address gets sanitized only if it's _not_ coming from the first
> frame.

Yup, that's right, I had it backwards.

Thanks!

-- Joe

2019-01-11 09:03:12

by Nicolai Stange

[permalink] [raw]
Subject: Re: ppc64le reliable stack unwinder and scheduled tasks

Joe Lawrence <[email protected]> writes:

> On Fri, Jan 11, 2019 at 01:00:38AM +0100, Nicolai Stange wrote:
>> Hi Joe,
>>
>> Joe Lawrence <[email protected]> writes:
>>
>> > tl;dr: On ppc64le, what is top-most stack frame for scheduled tasks
>> > about?
>>
>> If I'm reading the code in _switch() correctly, the first frame is
>> completely uninitialized except for the pointer back to the caller's
>> stack frame.
>>
>> For completeness: _switch() saves the return address, i.e. the link
>> register into its parent's stack frame, as is mandated by the ABI and
>> consistent with your findings below: it's always the second stack frame
>> where the return address into __switch_to() is kept.
>>
>
> Hi Nicolai,
>
> Good, that makes a lot of sense. I couldn't find any reference
> explaining the contents of frame 0, only unwinding code here and there
> (as in crash-utility) that stepped over it.

FWIW, I learned about general stack frame usage on ppc from part 4 of
the introductionary series starting at [1]: it's a good reading and I
can definitely recommend it.

Summary:
- Callers of other functions always allocate a stack frame and only
set the pointer to the previous stack frame (that's the
'stdu r1, -STACK_FRAME_OVERHEAD(r1)' insn).
- Callees save their stuff into the stack frame allocated by the caller
if needed. Where "if needed" == callee in turn calls another function.

The insignificance of frame 0's contents follows from this ABI: the
caller might not have called any callee yet, the callee might be a leaf
and so on.

Finally, as I understand it, the only purpose of _switch() creating a
standard stack frame at the bottom of scheduled out tasks is that the
higher ones can be found (for e.g. the backtracing): otherwise
there would be a pt_regs at the bottom of the stack. But I might be
wrong here.



>> <snip>
>>
>> >
>> >
>> > Example 1 (RHEL-7)
>> > ==================
>> >
>> > crash> struct task_struct.thread c00000022fd015c0 | grep ksp
>> > ksp = 0xc0000000288af9c0
>> >
>> > crash> rd 0xc0000000288af9c0 -e 0xc0000000288b0000
>> >
>> > - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
>> >
>> > sp[0]:
>> >
>> > c0000000288af9c0: c0000000288afb90 0000000000dd0000 ...(............
>> > c0000000288af9d0: c000000000002a94 c000000001c60a00 .*..............
>> >
>> > crash> sym c000000000002a94
>> > c000000000002a94 (T) hardware_interrupt_common+0x114
>>
>> So that c000000000002a94 certainly wasn't stored by _switch(). I think
>> what might have happened is that the switching frame aliased with some
>> prior interrupt frame as setup by hardware_interrupt_common().
>>
>> The interrupt and switching frames seem to share a common layout as far
>> as the lower STACK_FRAME_OVERHEAD + sizeof(struct pt_regs) bytes are
>> concerned.
>>
>> That address into hardware_interrupt_common() could have been written by
>> the do_IRQ() called from there.
>>
>
> That was my initial theory, but then when I saw an ordinary scheduled
> task with a similarly strange frame 0, I thought that _switch() might
> have been doing something clever (or not). But according your earlier
> explanation, it would line up that these values may be inherited from
> do_IRQ() or the like.
>
>>
>> > c0000000288af9e0: c000000001c60a80 0000000000000000 ................
>> > c0000000288af9f0: c0000000288afbc0 0000000000dd0000 ...(............
>> > c0000000288afa00: c0000000014322e0 c000000001c60a00 ."C.............
>> > c0000000288afa10: c0000002303ae380 c0000002303ae380 ..:0......:0....
>> > c0000000288afa20: 7265677368657265 0000000000002200 erehsger."......
>> >
>> > Uh-oh...
>> >
>> > /* Mark stacktraces with exception frames as unreliable. */
>> > stack[STACK_FRAME_MARKER] == STACK_FRAME_REGS_MARKER
>>
>>
>> Aliasing of the switching stack frame with some prior interrupt stack
>> frame would explain why that STACK_FRAME_REGS_MARKER is still found on
>> the stack, i.e. it's a leftover.
>>
>> For testing, could you try whether clearing the word at STACK_FRAME_MARKER
>> from _switch() helps?
>>
>> I.e. something like (completely untested):
>
> I'll kick off some builds tonight and try to get tests lined up
> tomorrow. Unfortunately these take a bit of time to run setup, schedule
> and complete, so perhaps by next week.

Ok, that's probably still a good test for confirmation, but the solution
you proposed below is much better.


>>
>> diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
>> index 435927f549c4..b747d0647ec4 100644
>> --- a/arch/powerpc/kernel/entry_64.S
>> +++ b/arch/powerpc/kernel/entry_64.S
>> @@ -596,6 +596,10 @@ _GLOBAL(_switch)
>> SAVE_8GPRS(14, r1)
>> SAVE_10GPRS(22, r1)
>> std r0,_NIP(r1) /* Return to switch caller */
>> +
>> + li r23,0
>> + std r23,96(r1) /* 96 == STACK_FRAME_MARKER * sizeof(long) */
>> +
>> mfcr r23
>> std r23,_CCR(r1)
>> std r1,KSP(r3) /* Set old stack pointer */
>>
>>
>
> This may be sufficient to avoid the condition, but if the contents of
> frame 0 are truely uninitialized (not to be trusted), should the
> unwinder be even looking at that frame (for STACK_FRAMES_REGS_MARKER),
> aside from the LR and other stack size geometry sanity checks?

That's a very good point: we'll only ever be examining scheduled out tasks
and current (which at that time is running klp_try_complete_transition()).

current won't be in an interrupt/exception when it's walking its own
stack. And scheduled out tasks can't have an exception/interrupt frame
as their frame 0, correct?

Thus, AFAICS, whenever klp_try_complete_transition() finds a
STACK_FRAMES_REGS_MARKER in frame 0, it is known to be garbage, as you
said.

Thanks,

Nicolai

[1] https://www.ibm.com/developerworks/linux/library/l-powasm1/index.html

--
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton,
HRB 21284 (AG Nürnberg)

2019-01-12 01:10:38

by Balbir Singh

[permalink] [raw]
Subject: Re: ppc64le reliable stack unwinder and scheduled tasks

On Thu, Jan 10, 2019 at 04:14:00PM -0500, Joe Lawrence wrote:
> Hi all,
>
> tl;dr: On ppc64le, what is top-most stack frame for scheduled tasks
> about?
>
> I am looking at a bug in which ~10% of livepatch tests on RHEL-7 and
> RHEL-8 distro kernels, the ppc64le reliable stack unwinder consistently
> reports an unreliable stack for a given task. This condition can
> eventually resolve itself, but sometimes this state remains wedged for
> hours and I can live-debug the system with crash-utility.
>

In summary, you think the reliable stack tracer is being conservative?
xmon (built in debugger) also allows for live-debugging and might
be more easier to get some of this done.

> I have tried reproducing with upstream 4.20 kernel, but haven't seen
> this exact condition yet. More on upstream in a bit.
>
> That said, I have a question about the ppc64 stack frame layout with
> regard to scheduled tasks. In each sticky "unreliable" stack trace
> instance that I've looked at, the task's stack has an interesting
> frame at the top:
>

Could you please define interesting frame on top a bit more? Usually
the topmost return address is in LR

>
> Example 1 (RHEL-7)
> ==================
>
> crash> struct task_struct.thread c00000022fd015c0 | grep ksp
> ksp = 0xc0000000288af9c0
>
> crash> rd 0xc0000000288af9c0 -e 0xc0000000288b0000
>
> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
>
> sp[0]:
>
> c0000000288af9c0: c0000000288afb90 0000000000dd0000 ...(............
> c0000000288af9d0: c000000000002a94 c000000001c60a00 .*..............
>
> crash> sym c000000000002a94
> c000000000002a94 (T) hardware_interrupt_common+0x114

What does bt look like in this case? what does r1 point to? I would
look at xmon and see what the "t" (stack trace) looks like, I think
it's a more familiar tool.

>
> c0000000288af9e0: c000000001c60a80 0000000000000000 ................
> c0000000288af9f0: c0000000288afbc0 0000000000dd0000 ...(............
> c0000000288afa00: c0000000014322e0 c000000001c60a00 ."C.............
> c0000000288afa10: c0000002303ae380 c0000002303ae380 ..:0......:0....
> c0000000288afa20: 7265677368657265 0000000000002200 erehsger."......
>
> Uh-oh...
>
> /* Mark stacktraces with exception frames as unreliable. */
> stack[STACK_FRAME_MARKER] == STACK_FRAME_REGS_MARKER
>
> c0000000288afa30: c0000000001240ac c0000000288afcb0 .@.........(....
> c0000000288afa40: c0000000013e4d00 0000000000000000 .M>.............
> c0000000288afa50: 0000000000000001 0000000000000001 ................
> c0000000288afa60: c0000000014322e0 0000000000000804 ."C.............
> c0000000288afa70: c0000000288ac080 c00000022fd015c0 ...(......./....
> c0000000288afa80: c0000000288afae0 00000000ffffffff ...(............
> c0000000288afa90: c0000000288afae0 c000000007b80900 ...(............
> c0000000288afaa0: c000000000e90a00 c000000000e90a00 ................
> c0000000288afab0: c0000000001240ac c000000000e90a00 .@..............
> c0000000288afac0: c000000000e90a00 c000000004790580 ..........y.....
> c0000000288afad0: 0000000000000001 c000000000e90a00 ................
> c0000000288afae0: 0000000000000000 0000000000000004 ................
> c0000000288afaf0: c00000022fd01ad0 c000000000e73be0 .../.....;......
> c0000000288afb00: c00000023900f450 c000000001488190 P..9......H.....
> c0000000288afb10: 0000000000ad0000 c00000023900ef40 [email protected]....
> c0000000288afb20: c0000000288ac000 c000000000e73be0 ...(.....;......
> c0000000288afb30: c00000000001b514 c00000022fd01628 ........(../....
> c0000000288afb40: c0000000288afbb0 c000000000008800 ...(............
> c0000000288afb50: c000000000162880 0000000000000000 .(..............
> c0000000288afb60: 00000000240f3022 0000000000000004 "0.$............
> c0000000288afb70: c000000000e90a00 c00000022fd01a20 ........ ../....
> c0000000288afb80: c0000000288afbf0 c0000000014322e0 ...(....."C.....
>
> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
>
> sp[1]:
>
> c0000000288afb90: c0000000288afbf0 c000000001960a00 ...(............
> c0000000288afba0: c00000000001b514 0000000000000004 ................
>
> crash> sym c00000000001b514
> c00000000001b514 (T) __switch_to+0x264
>
> c0000000288afbb0: c000000000e90a00 0000000000000000 ................
> c0000000288afbc0: c0000000288ac000 c0000000014322e0 ...(....."C.....
> c0000000288afbd0: c000000000e90a00 c000000001960a00 ................
> c0000000288afbe0: c00000022fd015c0 c00000023900ef40 .../[email protected]....
>
> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
>
> sp[2]:
>
> c0000000288afbf0: c0000000288afcd0 c000000000003300 ...(.....3......
> c0000000288afc00: c000000000a83918 c0000000013e4d00 .9.......M>.....
>
> crash> sym c000000000a83918
> c000000000a83918 (t) __schedule+0x428
>
> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
>
> sp[3]:
>
> c0000000288afcd0: c0000000288afd80 0000000000003300 ...(.....3......
> c0000000288afce0: c0000000001240ac 0000000000000000 .@..............
>
> crash> sym c0000000001240ac
> c0000000001240ac (t) rescuer_thread+0x3ac
>
> [ ... and so on as we would normally expect ... ]
>
>
> Example 2 - (RHEL-7)
> ====================
>
> crash> struct task_struct.thread c00000004a660a00 | grep ksp
> ksp = 0xc0000005e85439c0,
>
> crash> rd 0xc0000005e8543b90 -e 0xc0000005e8544000
>
> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
>
> sp[0]:
> c0000005e85439c0: c0000005e8543b90 0000000000000000 .;T.............
> c0000005e85439d0: c000000000002a94 0000000000000000 .*..............
>
> crash> sym c000000000002a94
> c000000000002a94 (T) hardware_interrupt_common+0x114
>
> c0000005e8543a20: 7265677368657265 c0000000013e4d00 erehsger.M>.....
>
> /* Mark stacktraces with exception frames as unreliable. */
> stack[STACK_FRAME_MARKER] == STACK_FRAME_REGS_MARKER
>
> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
>
> sp[1]:
> c0000005e8543b90: c0000005e8543bf0 c000000001960a00 .;T.............
> c0000005e8543ba0: c00000000001b514 0000000000000004 ................
>
> crash> sym c00000000001b514
> c00000000001b514 (T) __switch_to+0x264
>
> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
>
> sp[2]:
> c0000005e8543bf0: c0000005e8543cd0 c000000000003300 .<T......3......
> c0000005e8543c00: c000000000a83918 c0000000013e4d00 .9.......M>.....
>
> crash> sym c000000000a83918
> c000000000a83918 (t) __schedule+0x428
>
> [ ... and so on as we would normally expect ... ]
>
>
> Example 3 (upstream 4.20)
> =========================
>
> This is not a repro of the sticky "unreliable" stack trace, but just the
> stack of a workqueue that fired and called msleep.
>
>
> crash> struct task_struct.thread c0000002257f0e00 | grep ksp
> ksp = 0xc00000022715f9b0,
>
> crash> rd 0xc00000022715f9b0 100
>
> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
>
> sp[0]:
>
> c00000022715f9b0: c00000022715fb90 c00000023f4e8280 ...'......N?....
> c00000022715f9c0: c00000022715f9e0 0000000000000003 ...'............
>
> crash> sym c00000022715f9e0
> c00000022715f9e0 (B) _end+0x22594f9e0

That is weird, can you compare this to the output of "bt" or "t"
from xmon?

>
> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
>
> sp[1]:
>
> c00000022715fb90: c00000022715fbf0 c00000000158634c ...'....LcX.....
> c00000022715fba0: c00000000001ebfc c000000001318a60 ........`.1.....
>
> crash> sym c00000000001ebfc
> c00000000001ebfc (T) __switch_to+0x2dc
>
> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
>
> sp[2]:
>
> c00000022715fbf0: c00000022715fcd0 c000000000008800 ...'............
> c00000022715fc00: c000000000abcf40 c0000000013abf00 @.........:.....
>
> crash> sym c000000000abcf40
> c000000000abcf40 (t) __schedule+0x3b0
>
> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
>
> sp[3]:
>
> c00000022715fcd0: c00000022715fd00 2c50fff000000000 ...'..........P,
> c00000022715fce0: c000000000abd628 c00000023bff57e8 (........W.;....
>
> crash> sym c000000000abd628
> c000000000abd628 (T) schedule+0x48
>
> [ ... etc ... ]
>
>
> save_stack_trace_tsk_reliable
> =============================
>
> arch/powerpc/kernel/stacktrace.c :: save_stack_trace_tsk_reliable() does
> take into account the first stackframe, but only to verify that the link
> register is indeed pointing at kernel code address.
>
> Can someone explain what __switch_to is doing with the stack and whether
> in such circumstances, the reliable stack unwinder should be skipping
> the first frame when verifying the frame contents like STACK_FRAME_MARKER,
> etc.
>
> I may be on the wrong path in debugging this, but figuring out this
> sp[0] frame state would be helpful.
>

I would compare the output of xmon across the unreliable stack frames with
the contents of what the stack unwinder has.

I suspect the patch is stuck trying to transition to enabled state, it'll
be interesting to see if we are really stuck

Balbir Singh.

2019-01-12 09:09:57

by Segher Boessenkool

[permalink] [raw]
Subject: Re: ppc64le reliable stack unwinder and scheduled tasks

On Sat, Jan 12, 2019 at 12:09:14PM +1100, Balbir Singh wrote:
> Could you please define interesting frame on top a bit more? Usually
> the topmost return address is in LR

There is no reliable way (other than DWARF unwind info) to find out where
the value of LR at function entry currently lives (if anywhere). It may or
may not be still available in LR, it may or may not be saved to the return
stack slot. It can also live in some GPR, or in some other stack slot.

(The same is true for all other registers).

The only thing the ABI guarantees you is that you can find all stack frames
via the back chain. If you want more you can use some heuristics and do
some heroics (like GDB does), but this is not fully reliable. Using DWARF
unwind info is, but that requires big tables.


Segher

2019-01-13 12:40:26

by Balbir Singh

[permalink] [raw]
Subject: Re: ppc64le reliable stack unwinder and scheduled tasks

On Sat, Jan 12, 2019 at 02:45:41AM -0600, Segher Boessenkool wrote:
> On Sat, Jan 12, 2019 at 12:09:14PM +1100, Balbir Singh wrote:
> > Could you please define interesting frame on top a bit more? Usually
> > the topmost return address is in LR
>
> There is no reliable way (other than DWARF unwind info) to find out where
> the value of LR at function entry currently lives (if anywhere). It may or
> may not be still available in LR, it may or may not be saved to the return
> stack slot. It can also live in some GPR, or in some other stack slot.
>
> (The same is true for all other registers).
>
> The only thing the ABI guarantees you is that you can find all stack frames
> via the back chain. If you want more you can use some heuristics and do
> some heroics (like GDB does), but this is not fully reliable. Using DWARF
> unwind info is, but that requires big tables.
>

Thanks, so are you suggesting that a reliable stack is not possible on
ppc64le? Even with the restricted scope of the kernel?

Balbir Singh.


2019-01-13 13:07:48

by Torsten Duwe

[permalink] [raw]
Subject: Re: ppc64le reliable stack unwinder and scheduled tasks

On Sun, 13 Jan 2019 23:33:56 +1100
Balbir Singh <[email protected]> wrote:

> On Sat, Jan 12, 2019 at 02:45:41AM -0600, Segher Boessenkool wrote:
> > On Sat, Jan 12, 2019 at 12:09:14PM +1100, Balbir Singh wrote:
> > > Could you please define interesting frame on top a bit more?
> > > Usually the topmost return address is in LR
> >
> > There is no reliable way (other than DWARF unwind info) to find out
> > where the value of LR at function entry currently lives (if
> > anywhere). It may or may not be still available in LR, it may or
> > may not be saved to the return stack slot. It can also live in
> > some GPR, or in some other stack slot.
> >
> > (The same is true for all other registers).
> >
> > The only thing the ABI guarantees you is that you can find all
> > stack frames via the back chain. If you want more you can use some
> > heuristics and do some heroics (like GDB does), but this is not
> > fully reliable. Using DWARF unwind info is, but that requires big
> > tables.
> >
>
> Thanks, so are you suggesting that a reliable stack is not possible on
> ppc64le? Even with the restricted scope of the kernel?

The LR value location is _always_ hard to determine for the topmost
frame. This is not a problem for voluntarily sleeping tasks, because the
topmost function will always be well known. It is a problem for tasks preempted
by an interrupt, or those handling an exception, that's why these need
to report "unreliable".

Note that this is a very general problem, across _all_ RISC-like
architectures. It should thus be handled as generically as possible.

Torsten


2019-01-14 04:11:12

by Joe Lawrence

[permalink] [raw]
Subject: Re: ppc64le reliable stack unwinder and scheduled tasks

On Fri, Jan 11, 2019 at 08:51:54AM +0100, Nicolai Stange wrote:
> Joe Lawrence <[email protected]> writes:
>
> > On Fri, Jan 11, 2019 at 01:00:38AM +0100, Nicolai Stange wrote:

[ ... snip ... ]

> >> For testing, could you try whether clearing the word at STACK_FRAME_MARKER
> >> from _switch() helps?
> >>
> >> I.e. something like (completely untested):
> >
> > I'll kick off some builds tonight and try to get tests lined up
> > tomorrow. Unfortunately these take a bit of time to run setup, schedule
> > and complete, so perhaps by next week.
>
> Ok, that's probably still a good test for confirmation, but the solution
> you proposed below is much better.

Good news, 40 tests (RHEL-7 backport) with clearing out the
STACK_FRAME_MARKER were all successful. Previously I would see stalled
livepatch transitions due to the unreliable report in ~10% of test
cases.

> >> diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
> >> index 435927f549c4..b747d0647ec4 100644
> >> --- a/arch/powerpc/kernel/entry_64.S
> >> +++ b/arch/powerpc/kernel/entry_64.S
> >> @@ -596,6 +596,10 @@ _GLOBAL(_switch)
> >> SAVE_8GPRS(14, r1)
> >> SAVE_10GPRS(22, r1)
> >> std r0,_NIP(r1) /* Return to switch caller */
> >> +
> >> + li r23,0
> >> + std r23,96(r1) /* 96 == STACK_FRAME_MARKER * sizeof(long) */
> >> +
> >> mfcr r23
> >> std r23,_CCR(r1)
> >> std r1,KSP(r3) /* Set old stack pointer */
> >>
> >>
> >
> > This may be sufficient to avoid the condition, but if the contents of
> > frame 0 are truely uninitialized (not to be trusted), should the
> > unwinder be even looking at that frame (for STACK_FRAMES_REGS_MARKER),
> > aside from the LR and other stack size geometry sanity checks?
>
> That's a very good point: we'll only ever be examining scheduled out tasks
> and current (which at that time is running klp_try_complete_transition()).
>
> current won't be in an interrupt/exception when it's walking its own
> stack. And scheduled out tasks can't have an exception/interrupt frame
> as their frame 0, correct?
>
> Thus, AFAICS, whenever klp_try_complete_transition() finds a
> STACK_FRAMES_REGS_MARKER in frame 0, it is known to be garbage, as you
> said.

I gave this about 20 tests and they were also all successful, what do
you think? (The log could probably use some trimming and cleanup.) I
think either solution would fix the issue.

-->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8--

From edd3fb9e8a16d5a7ccc98759e9397f386f0e82ca Mon Sep 17 00:00:00 2001
From: Joe Lawrence <[email protected]>
Date: Fri, 11 Jan 2019 10:25:26 -0500
Subject: [PATCH] powerpc/livepatch: relax reliable stack tracer exception
check

The ppc64le reliable stack tracer iterates over a given task's stack,
verifying a set of conditions to determine whether the trace is
'reliable'. These checks include the presence of any exception frames.

We should be careful when inspecting the bottom-most stack frame (the
first to be unwound), particularly for scheduled-out tasks. As Nicolai
Stange explains, "If I'm reading the code in _switch() correctly, the
first frame is completely uninitialized except for the pointer back to
the caller's stack frame." If a previous do_IRQ() invocation, for
example, has left a residual exception-marker on the first frame, the
stack tracer would incorrectly report this task's trace as unreliable.

save_stack_trace_tsk_reliable() already skips the first frame when
verifying the saved Link Register. It should do the same when looking
for the STACK_FRAME_REGS_MARKER. The task it is unwinding will be
either 'current', in which case the tracer will have never been called
from an exception path, or the task will be inactive and its first
frame's contents will be uninitialized (aside from backchain pointer).

Fixes: df78d3f61480 ("powerpc/livepatch: Implement reliable stack tracing for the consistency model")
Signed-off-by: Joe Lawrence <[email protected]>
---
arch/powerpc/kernel/stacktrace.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/stacktrace.c b/arch/powerpc/kernel/stacktrace.c
index e2c50b55138f..0793e75c45a6 100644
--- a/arch/powerpc/kernel/stacktrace.c
+++ b/arch/powerpc/kernel/stacktrace.c
@@ -84,6 +84,12 @@ save_stack_trace_regs(struct pt_regs *regs, struct stack_trace *trace)
EXPORT_SYMBOL_GPL(save_stack_trace_regs);

#ifdef CONFIG_HAVE_RELIABLE_STACKTRACE
+/*
+ * This function returns an error if it detects any unreliable features of the
+ * stack. Otherwise it guarantees that the stack trace is reliable.
+ *
+ * If the task is not 'current', the caller *must* ensure the task is inactive.
+ */
int
save_stack_trace_tsk_reliable(struct task_struct *tsk,
struct stack_trace *trace)
@@ -143,7 +149,7 @@ save_stack_trace_tsk_reliable(struct task_struct *tsk,
return 1;

/* Mark stacktraces with exception frames as unreliable. */
- if (sp <= stack_end - STACK_INT_FRAME_SIZE &&
+ if (!firstframe && sp <= stack_end - STACK_INT_FRAME_SIZE &&
stack[STACK_FRAME_MARKER] == STACK_FRAME_REGS_MARKER) {
return 1;
}
--
2.20.1



2019-01-14 07:23:51

by Nicolai Stange

[permalink] [raw]
Subject: Re: ppc64le reliable stack unwinder and scheduled tasks

Joe Lawrence <[email protected]> writes:

> On Fri, Jan 11, 2019 at 08:51:54AM +0100, Nicolai Stange wrote:
>> Joe Lawrence <[email protected]> writes:
>>
>> > On Fri, Jan 11, 2019 at 01:00:38AM +0100, Nicolai Stange wrote:
>
> [ ... snip ... ]
>
>> >> For testing, could you try whether clearing the word at STACK_FRAME_MARKER
>> >> from _switch() helps?
>> >>
>> >> I.e. something like (completely untested):
>> >
>> > I'll kick off some builds tonight and try to get tests lined up
>> > tomorrow. Unfortunately these take a bit of time to run setup, schedule
>> > and complete, so perhaps by next week.
>>
>> Ok, that's probably still a good test for confirmation, but the solution
>> you proposed below is much better.
>
> Good news, 40 tests (RHEL-7 backport) with clearing out the
> STACK_FRAME_MARKER were all successful. Previously I would see stalled
> livepatch transitions due to the unreliable report in ~10% of test
> cases.
>
>> >> diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
>> >> index 435927f549c4..b747d0647ec4 100644
>> >> --- a/arch/powerpc/kernel/entry_64.S
>> >> +++ b/arch/powerpc/kernel/entry_64.S
>> >> @@ -596,6 +596,10 @@ _GLOBAL(_switch)
>> >> SAVE_8GPRS(14, r1)
>> >> SAVE_10GPRS(22, r1)
>> >> std r0,_NIP(r1) /* Return to switch caller */
>> >> +
>> >> + li r23,0
>> >> + std r23,96(r1) /* 96 == STACK_FRAME_MARKER * sizeof(long) */
>> >> +
>> >> mfcr r23
>> >> std r23,_CCR(r1)
>> >> std r1,KSP(r3) /* Set old stack pointer */
>> >>
>> >>
>> >
>> > This may be sufficient to avoid the condition, but if the contents of
>> > frame 0 are truely uninitialized (not to be trusted), should the
>> > unwinder be even looking at that frame (for STACK_FRAMES_REGS_MARKER),
>> > aside from the LR and other stack size geometry sanity checks?
>>
>> That's a very good point: we'll only ever be examining scheduled out tasks
>> and current (which at that time is running klp_try_complete_transition()).
>>
>> current won't be in an interrupt/exception when it's walking its own
>> stack. And scheduled out tasks can't have an exception/interrupt frame
>> as their frame 0, correct?
>>
>> Thus, AFAICS, whenever klp_try_complete_transition() finds a
>> STACK_FRAMES_REGS_MARKER in frame 0, it is known to be garbage, as you
>> said.
>
> I gave this about 20 tests and they were also all successful, what do
> you think? (The log could probably use some trimming and cleanup.) I
> think either solution would fix the issue.
>
> -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8--
>
> From edd3fb9e8a16d5a7ccc98759e9397f386f0e82ca Mon Sep 17 00:00:00 2001
> From: Joe Lawrence <[email protected]>
> Date: Fri, 11 Jan 2019 10:25:26 -0500
> Subject: [PATCH] powerpc/livepatch: relax reliable stack tracer exception
> check
>
> The ppc64le reliable stack tracer iterates over a given task's stack,
> verifying a set of conditions to determine whether the trace is
> 'reliable'. These checks include the presence of any exception frames.
>
> We should be careful when inspecting the bottom-most stack frame (the
> first to be unwound), particularly for scheduled-out tasks. As Nicolai
> Stange explains, "If I'm reading the code in _switch() correctly, the
> first frame is completely uninitialized except for the pointer back to
> the caller's stack frame." If a previous do_IRQ() invocation, for
> example, has left a residual exception-marker on the first frame, the
> stack tracer would incorrectly report this task's trace as unreliable.
>

FWIW, it's not been do_IRQ() who wrote the exception marker, but it's
caller hardware_interrupt_common(), more specifically the
EXCEPTION_PROLOG_COMMON_3 part therof.


> save_stack_trace_tsk_reliable() already skips the first frame when
> verifying the saved Link Register. It should do the same when looking
> for the STACK_FRAME_REGS_MARKER. The task it is unwinding will be
> either 'current', in which case the tracer will have never been called
> from an exception path, or the task will be inactive and its first
> frame's contents will be uninitialized (aside from backchain pointer).

I thought about this a little more and can't see anything that would
prevent higher, i.e. non-_switch() frames to also alias with prior
exception frames. That STACK_FRAME_REGS_MARKER is written to a stack
frame's "parameter area" and most functions probably don't initialize
this either. So, AFAICS, higher stack frames could potentially be
affected by the very same problem.

I think the best solution would be to clear the STACK_FRAME_REGS_MARKER
upon exception return. I have a patch ready for that and will post it
after it has passed some basic testing -- hopefully later this day.

That being said, I still think that your patch should also get applied
in some form -- looking at unitialized memory is just not a good thing
to do.


>
> Fixes: df78d3f61480 ("powerpc/livepatch: Implement reliable stack tracing for the consistency model")
> Signed-off-by: Joe Lawrence <[email protected]>
> ---
> arch/powerpc/kernel/stacktrace.c | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/kernel/stacktrace.c b/arch/powerpc/kernel/stacktrace.c
> index e2c50b55138f..0793e75c45a6 100644
> --- a/arch/powerpc/kernel/stacktrace.c
> +++ b/arch/powerpc/kernel/stacktrace.c
> @@ -84,6 +84,12 @@ save_stack_trace_regs(struct pt_regs *regs, struct stack_trace *trace)
> EXPORT_SYMBOL_GPL(save_stack_trace_regs);
>
> #ifdef CONFIG_HAVE_RELIABLE_STACKTRACE
> +/*
> + * This function returns an error if it detects any unreliable features of the
> + * stack. Otherwise it guarantees that the stack trace is reliable.
> + *
> + * If the task is not 'current', the caller *must* ensure the task is inactive.
> + */
> int
> save_stack_trace_tsk_reliable(struct task_struct *tsk,
> struct stack_trace *trace)
> @@ -143,7 +149,7 @@ save_stack_trace_tsk_reliable(struct task_struct *tsk,
> return 1;
>
> /* Mark stacktraces with exception frames as unreliable. */
> - if (sp <= stack_end - STACK_INT_FRAME_SIZE &&
> + if (!firstframe && sp <= stack_end - STACK_INT_FRAME_SIZE &&
> stack[STACK_FRAME_MARKER] == STACK_FRAME_REGS_MARKER) {
> return 1;
> }

I would perhaps not limit this to the STACK_FRAME_REGS_MARKER, but also
not emit the ip obtained from the first frame into the resulting trace.

I.e., how about moving all the sp/newsp handling to the beginning of the
loop and doing an 'if (firstframe) continue;' right after that?

Thanks,

Nicolai


--
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton,
HRB 21284 (AG Nürnberg)

2019-01-14 16:49:01

by Joe Lawrence

[permalink] [raw]
Subject: Re: ppc64le reliable stack unwinder and scheduled tasks

On Mon, Jan 14, 2019 at 08:21:40AM +0100, Nicolai Stange wrote:
> Joe Lawrence <[email protected]> writes:
>
> > We should be careful when inspecting the bottom-most stack frame (the
> > first to be unwound), particularly for scheduled-out tasks. As Nicolai
> > Stange explains, "If I'm reading the code in _switch() correctly, the
> > first frame is completely uninitialized except for the pointer back to
> > the caller's stack frame." If a previous do_IRQ() invocation, for
> > example, has left a residual exception-marker on the first frame, the
> > stack tracer would incorrectly report this task's trace as unreliable.
> >
>
> FWIW, it's not been do_IRQ() who wrote the exception marker, but it's
> caller hardware_interrupt_common(), more specifically the
> EXCEPTION_PROLOG_COMMON_3 part therof.
>

Hi Nicolai,

Yeah, I was sloppy with the description there. :)

> I thought about this a little more and can't see anything that would
> prevent higher, i.e. non-_switch() frames to also alias with prior
> exception frames. That STACK_FRAME_REGS_MARKER is written to a stack
> frame's "parameter area" and most functions probably don't initialize
> this either. So, AFAICS, higher stack frames could potentially be
> affected by the very same problem.

Hmm, I suppose a callee could leave that stack-word untouched and then
make subsquent calls, which would be confusing for the unwinder.

> I think the best solution would be to clear the STACK_FRAME_REGS_MARKER
> upon exception return. I have a patch ready for that and will post it
> after it has passed some basic testing -- hopefully later this day.
>

I agree that this seems like the simplest way to clean up the exception
stack frame state.

> That being said, I still think that your patch should also get applied
> in some form -- looking at unitialized memory is just not a good thing
> to do.
>
> [ ... snip ...]

> I would perhaps not limit this to the STACK_FRAME_REGS_MARKER, but also
> not emit the ip obtained from the first frame into the resulting trace.
>
> I.e., how about moving all the sp/newsp handling to the beginning of the
> loop and doing an 'if (firstframe) continue;' right after that?

Good point, there is a bunch of ip and trace entries bookkeeping that
shouldn't apply in this case.

I gave the following some very light testing (5.0.0-rc2 + Petr's atomic
patches as to include and run the selftests) ... if you want to take a
bigger hammer to refactor some of the sp/newsp code (perhaps it could be
incorporated into the for() loop itself), feel free to go for it. You
could add something like this as a 2nd patch to the previously mentioned
STACK_FRAME_REGS_MARKER cleanup fix.

Thanks,

-- Joe

-->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8--

From b87f9e81cf59a6e7e2309400e1b417562414cd5c Mon Sep 17 00:00:00 2001
From: Joe Lawrence <[email protected]>
Date: Sun, 13 Jan 2019 21:02:01 -0500
Subject: [PATCH] powerpc/livepatch: relax reliable stack tracer checks for
first-frame

The bottom-most stack frame (the first to be unwound) may be largely
uninitialized, for the "Power Architecture 64-Bit ELF V2 ABI" only
requires its backchain pointer to be set.

The reliable stack tracer should be careful when verifying this frame:
skip checks on STACK_FRAME_LR_SAVE and STACK_FRAME_MARKER offsets that
may contain uninitialized residual data.

Fixes: df78d3f61480 ("powerpc/livepatch: Implement reliable stack tracing for the consistency model")
Suggested-by: Nicolai Stange <[email protected]>
Signed-off-by: Joe Lawrence <[email protected]>
---
arch/powerpc/kernel/stacktrace.c | 33 +++++++++++++++++++++-----------
1 file changed, 22 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/kernel/stacktrace.c b/arch/powerpc/kernel/stacktrace.c
index e2c50b55138f..46096687a5a8 100644
--- a/arch/powerpc/kernel/stacktrace.c
+++ b/arch/powerpc/kernel/stacktrace.c
@@ -84,6 +84,12 @@ save_stack_trace_regs(struct pt_regs *regs, struct stack_trace *trace)
EXPORT_SYMBOL_GPL(save_stack_trace_regs);

#ifdef CONFIG_HAVE_RELIABLE_STACKTRACE
+/*
+ * This function returns an error if it detects any unreliable features of the
+ * stack. Otherwise it guarantees that the stack trace is reliable.
+ *
+ * If the task is not 'current', the caller *must* ensure the task is inactive.
+ */
int
save_stack_trace_tsk_reliable(struct task_struct *tsk,
struct stack_trace *trace)
@@ -142,12 +148,6 @@ save_stack_trace_tsk_reliable(struct task_struct *tsk,
if (sp & 0xF)
return 1;

- /* Mark stacktraces with exception frames as unreliable. */
- if (sp <= stack_end - STACK_INT_FRAME_SIZE &&
- stack[STACK_FRAME_MARKER] == STACK_FRAME_REGS_MARKER) {
- return 1;
- }
-
newsp = stack[0];
/* Stack grows downwards; unwinder may only go up. */
if (newsp <= sp)
@@ -158,11 +158,21 @@ save_stack_trace_tsk_reliable(struct task_struct *tsk,
return 1; /* invalid backlink, too far up. */
}

+ /* We can only trust the bottom frame's backlink, the rest
+ * of the frame may be uninitialized, continue to the next. */
+ if (firstframe--)
+ goto next;
+
+ /* Mark stacktraces with exception frames as unreliable. */
+ if (sp <= stack_end - STACK_INT_FRAME_SIZE &&
+ stack[STACK_FRAME_MARKER] == STACK_FRAME_REGS_MARKER) {
+ return 1;
+ }
+
/* Examine the saved LR: it must point into kernel code. */
ip = stack[STACK_FRAME_LR_SAVE];
- if (!firstframe && !__kernel_text_address(ip))
+ if (!__kernel_text_address(ip))
return 1;
- firstframe = 0;

/*
* FIXME: IMHO these tests do not belong in
@@ -183,12 +193,13 @@ save_stack_trace_tsk_reliable(struct task_struct *tsk,
else
trace->skip--;

- if (newsp == stack_end)
- break;
-
if (trace->nr_entries >= trace->max_entries)
return -E2BIG;

+next:
+ if (newsp == stack_end)
+ break;
+
sp = newsp;
}
return 0;
--
2.20.1



2019-01-14 17:12:20

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: ppc64le reliable stack unwinder and scheduled tasks

On Mon, Jan 14, 2019 at 11:46:59AM -0500, Joe Lawrence wrote:
> @@ -158,11 +158,21 @@ save_stack_trace_tsk_reliable(struct task_struct *tsk,
> return 1; /* invalid backlink, too far up. */
> }
>
> + /* We can only trust the bottom frame's backlink, the rest
> + * of the frame may be uninitialized, continue to the next. */
> + if (firstframe--)
> + goto next;

Won't this decrement firstframe on every iteration, so when firstframe
is 0, it will decrement it to -1, causing it to 'goto next' on all
future iterations?

--
Josh

2019-01-14 17:56:14

by Joe Lawrence

[permalink] [raw]
Subject: Re: ppc64le reliable stack unwinder and scheduled tasks

On 1/14/19 12:09 PM, Josh Poimboeuf wrote:
> On Mon, Jan 14, 2019 at 11:46:59AM -0500, Joe Lawrence wrote:
>> @@ -158,11 +158,21 @@ save_stack_trace_tsk_reliable(struct task_struct *tsk,
>> return 1; /* invalid backlink, too far up. */
>> }
>>
>> + /* We can only trust the bottom frame's backlink, the rest
>> + * of the frame may be uninitialized, continue to the next. */
>> + if (firstframe--)
>> + goto next;
>
> Won't this decrement firstframe on every iteration, so when firstframe
> is 0, it will decrement it to -1, causing it to 'goto next' on all
> future iterations?
>

Argg, yes, that should be:

if (!firstframe) {
firstframe = 0;
goto next;
}

Apologies for the monday-morning crap-patch.

/runsoff to find some more caffeine

-- Joe

2019-01-17 15:00:04

by Segher Boessenkool

[permalink] [raw]
Subject: Re: ppc64le reliable stack unwinder and scheduled tasks

On Sun, Jan 13, 2019 at 11:33:56PM +1100, Balbir Singh wrote:
> On Sat, Jan 12, 2019 at 02:45:41AM -0600, Segher Boessenkool wrote:
> > On Sat, Jan 12, 2019 at 12:09:14PM +1100, Balbir Singh wrote:
> > > Could you please define interesting frame on top a bit more? Usually
> > > the topmost return address is in LR
> >
> > There is no reliable way (other than DWARF unwind info) to find out where
> > the value of LR at function entry currently lives (if anywhere). It may or
> > may not be still available in LR, it may or may not be saved to the return
> > stack slot. It can also live in some GPR, or in some other stack slot.
> >
> > (The same is true for all other registers).
> >
> > The only thing the ABI guarantees you is that you can find all stack frames
> > via the back chain. If you want more you can use some heuristics and do
> > some heroics (like GDB does), but this is not fully reliable. Using DWARF
> > unwind info is, but that requires big tables.
> >
>
> Thanks, so are you suggesting that a reliable stack is not possible on
> ppc64le? Even with the restricted scope of the kernel?

It depends on what you mean with "reliable stack unwinder". You can unwind
the stack reliably on Power, but you want more, you want to know where some
state local to functions is kept on the stack.


Segher