2023-08-19 13:10:25

by Paul E. McKenney

[permalink] [raw]
Subject: [BUG] missing return thunk: __ret+0x5/0x7e-__ret+0x0/0x7e: e9 f6 ff ff ff

Hello!

I hit the splat at the end of this message in recent mainline, and has
appeared some time since v6.5-rc1. Should I be worried?

Reproducer on a two-socket hyperthreaded 20-core-per-socket x86 system:

tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 5m --torture refscale --kcsan --kconfig "CONFIG_NR_CPUS=40" --kmake-args "CC=clang" --bootargs "refscale.scale_type=typesafe_seqlock refscale.nreaders=40 refscale.loops=10000 refscale.holdoff=20 torture.disable_onoff_at_boot refscale.verbose_batched=5 torture.verbose_sleep_frequency=8 torture.verbose_sleep_duration=5"

This is from overnight testing that hit this only in the KCSAN runs.
The KASAN and non-debug runs had no trouble.

This commit added the warning long ago:

65cdf0d623be ("x86/alternative: Report missing return thunk details")

Thoughts?

Thanx, Paul

------------------------------------------------------------------------

[? ? 0.281208] ------------[ cut here ]------------
[? ? 0.281484] missing return thunk: __ret+0x5/0x7e-__ret+0x0/0x7e: e9 f6 ff ff ff
[? ? 0.281514] WARNING: CPU: 0 PID: 0 at arch/x86/kernel/alternative.c:753 apply_returns+0x2fc/0x450
[? ? 0.283482] Modules linked in:
[? ? 0.284489] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.5.0-rc6-00047-g21575bdc67ed #34195
[? ? 0.285483] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
[? ? 0.286482] RIP: 0010:apply_returns+0x2fc/0x450
[? ? 0.287124] Code: ff ff 0f 0b e9 a9 fd ff ff c6 05 a1 0a 65 02 01 48 c7 c7 8b e3 2b b9 4c 89 ee 48 89 da b9 05 00 00 00 4d 89 e8 e8 04 f4 06 00 <0f> 0b e9 9a fe ff ff 85 db 0f 84 15 ff ff ff 48 c7 c7 4b e3 2b b9
[? ? 0.287483] RSP: 0000:ffffffffb9603e00 EFLAGS: 00010246
[? ? 0.288482] RAX: 22c53364d8918300 RBX: ffffffffb8b0e600 RCX: 0000000000000002
[? ? 0.289482] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[? ? 0.290482] RBP: ffffffffb9603ee0 R08: 0000000080000003 R09: 0000000000000000
[? ? 0.291481] R10: 0001ffffffffffff R11: ffffffffb9623800 R12: ffffffffb9603e18
[? ? 0.292481] R13: ffffffffb8b0e605 R14: ffffffffba150a70 R15: ffffffffba150a68
[? ? 0.293482] FS:? 0000000000000000(0000) GS:ffff97305ec00000(0000) knlGS:0000000000000000
[? ? 0.294481] CS:? 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[? ? 0.295481] CR2: ffff973055601000 CR3: 0000000013a44000 CR4: 00000000000006f0
[? ? 0.296483] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[? ? 0.297482] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[? ? 0.298482] Call Trace:
[? ? 0.298859]? <TASK>
[? ? 0.299185]? ? __warn+0x12c/0x330
[? ? 0.299484]? ? apply_returns+0x2fc/0x450
[? ? 0.300484]? ? report_bug+0x12a/0x1c0
[? ? 0.301079]? ? handle_bug+0x3d/0x80
[? ? 0.301483]? ? exc_invalid_op+0x1a/0x50
[? ? 0.302041]? ? asm_exc_invalid_op+0x1a/0x20
[? ? 0.302483]? ? __ret+0x5/0x7e
[? ? 0.302903]? ? zen_untrain_ret+0x1/0x1
[? ? 0.303487]? ? apply_returns+0x2fc/0x450
[? ? 0.304003]? ? __ret+0x5/0x7e
[? ? 0.304482]? ? __ret+0x14/0x7e
[? ? 0.304869]? ? __ret+0xa/0x7e
[? ? 0.305484]? ? unregister_die_notifier+0x4e/0x60
[? ? 0.306063]? alternative_instructions+0x52/0x120
[? ? 0.306489]? arch_cpu_finalize_init+0x2c/0x50
[? ? 0.307068]? start_kernel+0x480/0x590
[? ? 0.307485]? x86_64_start_reservations+0x24/0x30
[? ? 0.308482]? x86_64_start_kernel+0xab/0xb0
[? ? 0.309068]? secondary_startup_64_no_verify+0x17a/0x17b
[? ? 0.309490]? </TASK>
[? ? 0.309808] irq event stamp: 128439
[? ? 0.310481] hardirqs last? enabled at (128457): [<ffffffffb7368401>] __up_console_sem+0x91/0xc0
[? ? 0.311481] hardirqs last disabled at (128474): [<ffffffffb73683e6>] __up_console_sem+0x76/0xc0
[? ? 0.312482] softirqs last? enabled at (128490): [<ffffffffb72cf624>] __irq_exit_rcu+0x64/0xd0
[? ? 0.313481] softirqs last disabled at (128501): [<ffffffffb72cf624>] __irq_exit_rcu+0x64/0xd0
[? ? 0.314481] ---[ end trace 0000000000000000 ]---


2023-08-20 02:39:33

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [BUG] missing return thunk: __ret+0x5/0x7e-__ret+0x0/0x7e: e9 f6 ff ff ff

On Thu, Aug 17, 2023 at 11:49:52PM +0300, Nikolay Borisov wrote:
>
>
> On 16.08.23 г. 20:54 ч., Paul E. McKenney wrote:
> > Hello!
> >
> > I hit the splat at the end of this message in recent mainline, and has
> > appeared some time since v6.5-rc1. Should I be worried?
> >
> > Reproducer on a two-socket hyperthreaded 20-core-per-socket x86 system:
> >
> > tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 5m --torture refscale --kcsan --kconfig "CONFIG_NR_CPUS=40" --kmake-args "CC=clang" --bootargs "refscale.scale_type=typesafe_seqlock refscale.nreaders=40 refscale.loops=10000 refscale.holdoff=20 torture.disable_onoff_at_boot refscale.verbose_batched=5 torture.verbose_sleep_frequency=8 torture.verbose_sleep_duration=5"
> >
> > This is from overnight testing that hit this only in the KCSAN runs.
> > The KASAN and non-debug runs had no trouble.
> >
> > This commit added the warning long ago:
> >
> > 65cdf0d623be ("x86/alternative: Report missing return thunk details")
> >
> > Thoughts?
> >
> > Thanx, Paul
>
> Likely fixed by the following commit in tip/urgen :
> 4ae68b26c3ab5a82aa271e6e9fc9b1a06e1d6b40 [tip: x86/urgent] objtool/x86: Fix
> SRSO mess

Thank you! Given the "urgent", I am guessing that this is going
upstream soon?

Thanx, Paul

2023-08-20 18:15:07

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [BUG] missing return thunk: __ret+0x5/0x7e-__ret+0x0/0x7e: e9 f6 ff ff ff

On Wed, Aug 16, 2023 at 08:17:20PM +0200, Borislav Petkov wrote:
> Hey Paul,
>
> On Wed, Aug 16, 2023 at 10:54:09AM -0700, Paul E. McKenney wrote:
> > I hit the splat at the end of this message in recent mainline, and has
> > appeared some time since v6.5-rc1. Should I be worried?
>
> does it go away if you try the latest tip:x86/urgent branch?

That is plausible, given that bisection has narrowed things down to
somewhere between v6.5-rc5 and v6.5-rc6. And it is quite conveniently
currently on a bad commit. Sometimes you get lucky. ;-)

So pulling in those commits from -tip, currently headed by this one:

d80c3c9de067 ("x86/srso: Explain the untraining sequences a bit more")

Then merging them with the current bad commit gets me a successful
run. Thank you!!!

Thanx, Paul

2023-08-25 08:42:13

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [BUG] missing return thunk: __ret+0x5/0x7e-__ret+0x0/0x7e: e9 f6 ff ff ff

On Fri, Aug 25, 2023 at 08:33:02AM +0200, Borislav Petkov wrote:
> On Thu, Aug 24, 2023 at 03:52:56PM +0200, Greg KH wrote:
> > On Wed, Aug 16, 2023 at 08:17:20PM +0200, Borislav Petkov wrote:
> > > Hey Paul,
> > >
> > > On Wed, Aug 16, 2023 at 10:54:09AM -0700, Paul E. McKenney wrote:
> > > > I hit the splat at the end of this message in recent mainline, and has
> > > > appeared some time since v6.5-rc1. Should I be worried?
> > >
> > > does it go away if you try the latest tip:x86/urgent branch?
> >
> > Note, this problem is showing up in the 6.1.y branch right now, due to
> > one objtool patch not being able to be backported there easily (i.e. I
> > tried and gave up.)
> >
> > 4ae68b26c3ab ("objtool/x86: Fix SRSO mess") being the commit that I
> > can't seem to get to work properly, my attempt can be seen here:
> > https://lore.kernel.org/r/2023082212-pregnant-lizard-80e0@gregkh
> >
> > Just a heads up as this will start to affect users of the next 6.1.y
> > release, and probably older releases, as they are taking portions of the
> > "fixes for fixes" but not the above mentioned one.
>
> Hmm, Peter and I are away, I guess maybe Josh might have an idea how and
> what to backport to 6.1 to get this sorted out...
>
> CCed.

Sure, will take a look tomorrow.

--
Josh

2023-08-26 07:52:26

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [BUG] missing return thunk: __ret+0x5/0x7e-__ret+0x0/0x7e: e9 f6 ff ff ff

On Thu, Aug 24, 2023 at 03:52:56PM +0200, Greg KH wrote:
> On Wed, Aug 16, 2023 at 08:17:20PM +0200, Borislav Petkov wrote:
> > Hey Paul,
> >
> > On Wed, Aug 16, 2023 at 10:54:09AM -0700, Paul E. McKenney wrote:
> > > I hit the splat at the end of this message in recent mainline, and has
> > > appeared some time since v6.5-rc1. Should I be worried?
> >
> > does it go away if you try the latest tip:x86/urgent branch?
>
> Note, this problem is showing up in the 6.1.y branch right now, due to
> one objtool patch not being able to be backported there easily (i.e. I
> tried and gave up.)
>
> 4ae68b26c3ab ("objtool/x86: Fix SRSO mess") being the commit that I
> can't seem to get to work properly, my attempt can be seen here:
> https://lore.kernel.org/r/2023082212-pregnant-lizard-80e0@gregkh

> --- a/tools/objtool/arch/x86/decode.c
> +++ b/tools/objtool/arch/x86/decode.c
> @@ -796,8 +796,11 @@ bool arch_is_retpoline(struct symbol *sy
>
> bool arch_is_rethunk(struct symbol *sym)
> {
> - return !strcmp(sym->name, "__x86_return_thunk") ||
> - !strcmp(sym->name, "srso_untrain_ret") ||
> - !strcmp(sym->name, "srso_safe_ret") ||
> - !strcmp(sym->name, "retbleed_return_thunk");
> + return !strcmp(sym->name, "__x86_return_thunk");
> +}
> +
> +bool arch_is_embedded_insn(struct symbol *sym)
> +{
> + return !strcmp(sym->name, "retbleed_return_thunk") ||
> + !strcmp(sym->name, "srso_safe_ret");

This wouldn't work with the current 6.1.y branch, I assume you had some
other patches applied before this. e.g., the patch renaming __ret to
retbleed_return_thunk.

> }
> --- a/tools/objtool/check.c
> +++ b/tools/objtool/check.c
> @@ -418,7 +418,7 @@ static int decode_instructions(struct ob
> }
>
> list_for_each_entry(func, &sec->symbol_list, list) {
> - if (func->type != STT_FUNC || func->alias != func)
> + if (func->embedded_insn || func->alias != func)
> continue;

This hunk looks like a bug. This might be the source of your problems.

The below patch seems to work on stock 6.1.47. Or if you have other
SRSO patches pending, point me to them and I can look at porting this
one to fit.

---8<---

From: Peter Zijlstra <[email protected]>
Subject: [PATCH] objtool/x86: Fix SRSO mess

Objtool --rethunk does two things:

- it collects all (tail) call's of __x86_return_thunk and places them
into .return_sites. These are typically compiler generated, but
RET also emits this same.

- it fudges the validation of the __x86_return_thunk symbol; because
this symbol is inside another instruction, it can't actually find
the instruction pointed to by the symbol offset and gets upset.

Because these two things pertained to the same symbol, there was no
pressing need to separate these two separate things.

However, alas, along comes SRSO and more crazy things to deal with
appeared.

The SRSO patch itself added the following symbol names to identify as
rethunk:

'srso_untrain_ret', 'srso_safe_ret' and '__ret'

Where '__ret' is the old retbleed return thunk, 'srso_safe_ret' is a
new similarly embedded return thunk, and 'srso_untrain_ret' is
completely unrelated to anything the above does (and was only included
because of that INT3 vs UD2 issue fixed previous).

Clear things up by adding a second category for the embedded instruction
thing.

Fixes: fb3bd914b3ec ("x86/srso: Add a Speculative RAS Overflow mitigation")
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Josh Poimboeuf <[email protected]>
---
tools/objtool/arch/x86/decode.c | 11 +++++++----
tools/objtool/check.c | 22 +++++++++++++++++++++-
tools/objtool/include/objtool/arch.h | 1 +
tools/objtool/include/objtool/elf.h | 1 +
4 files changed, 30 insertions(+), 5 deletions(-)

diff --git a/tools/objtool/arch/x86/decode.c b/tools/objtool/arch/x86/decode.c
index a60c5efe34b3..4cf730e3ac1d 100644
--- a/tools/objtool/arch/x86/decode.c
+++ b/tools/objtool/arch/x86/decode.c
@@ -796,8 +796,11 @@ bool arch_is_retpoline(struct symbol *sym)

bool arch_is_rethunk(struct symbol *sym)
{
- return !strcmp(sym->name, "__x86_return_thunk") ||
- !strcmp(sym->name, "srso_untrain_ret") ||
- !strcmp(sym->name, "srso_safe_ret") ||
- !strcmp(sym->name, "__ret");
+ return !strcmp(sym->name, "__x86_return_thunk");
+}
+
+bool arch_is_embedded_insn(struct symbol *sym)
+{
+ return !strcmp(sym->name, "__ret") ||
+ !strcmp(sym->name, "srso_safe_ret");
}
diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index c2c350933a23..a88ad299fc31 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -1164,16 +1164,33 @@ static int add_ignore_alternatives(struct objtool_file *file)
return 0;
}

+/*
+ * Symbols that replace INSN_CALL_DYNAMIC, every (tail) call to such a symbol
+ * will be added to the .retpoline_sites section.
+ */
__weak bool arch_is_retpoline(struct symbol *sym)
{
return false;
}

+/*
+ * Symbols that replace INSN_RETURN, every (tail) call to such a symbol
+ * will be added to the .return_sites section.
+ */
__weak bool arch_is_rethunk(struct symbol *sym)
{
return false;
}

+/*
+ * Symbols that are embedded inside other instructions, because sometimes crazy
+ * code exists. These are mostly ignored for validation purposes.
+ */
+__weak bool arch_is_embedded_insn(struct symbol *sym)
+{
+ return false;
+}
+
#define NEGATIVE_RELOC ((void *)-1L)

static struct reloc *insn_reloc(struct objtool_file *file, struct instruction *insn)
@@ -1437,7 +1454,7 @@ static int add_jump_destinations(struct objtool_file *file)
* middle of another instruction. Objtool only
* knows about the outer instruction.
*/
- if (sym && sym->return_thunk) {
+ if (sym && sym->embedded_insn) {
add_return_call(file, insn, false);
continue;
}
@@ -2327,6 +2344,9 @@ static int classify_symbols(struct objtool_file *file)
if (arch_is_rethunk(func))
func->return_thunk = true;

+ if (arch_is_embedded_insn(func))
+ func->embedded_insn = true;
+
if (!strcmp(func->name, "__fentry__"))
func->fentry = true;

diff --git a/tools/objtool/include/objtool/arch.h b/tools/objtool/include/objtool/arch.h
index beb2f3aa94ff..861c0c60ac81 100644
--- a/tools/objtool/include/objtool/arch.h
+++ b/tools/objtool/include/objtool/arch.h
@@ -90,6 +90,7 @@ int arch_decode_hint_reg(u8 sp_reg, int *base);

bool arch_is_retpoline(struct symbol *sym);
bool arch_is_rethunk(struct symbol *sym);
+bool arch_is_embedded_insn(struct symbol *sym);

int arch_rewrite_retpolines(struct objtool_file *file);

diff --git a/tools/objtool/include/objtool/elf.h b/tools/objtool/include/objtool/elf.h
index 16f4067b82ae..5d4a841fbd31 100644
--- a/tools/objtool/include/objtool/elf.h
+++ b/tools/objtool/include/objtool/elf.h
@@ -60,6 +60,7 @@ struct symbol {
u8 return_thunk : 1;
u8 fentry : 1;
u8 profiling_func : 1;
+ u8 embedded_insn : 1;
struct list_head pv_target;
};

--
2.41.0


2023-08-26 14:55:48

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [BUG] missing return thunk: __ret+0x5/0x7e-__ret+0x0/0x7e: e9 f6 ff ff ff

On Fri, Aug 25, 2023 at 04:26:57PM -0700, Josh Poimboeuf wrote:
> On Thu, Aug 24, 2023 at 03:52:56PM +0200, Greg KH wrote:
> > On Wed, Aug 16, 2023 at 08:17:20PM +0200, Borislav Petkov wrote:
> > > Hey Paul,
> > >
> > > On Wed, Aug 16, 2023 at 10:54:09AM -0700, Paul E. McKenney wrote:
> > > > I hit the splat at the end of this message in recent mainline, and has
> > > > appeared some time since v6.5-rc1. Should I be worried?
> > >
> > > does it go away if you try the latest tip:x86/urgent branch?
> >
> > Note, this problem is showing up in the 6.1.y branch right now, due to
> > one objtool patch not being able to be backported there easily (i.e. I
> > tried and gave up.)
> >
> > 4ae68b26c3ab ("objtool/x86: Fix SRSO mess") being the commit that I
> > can't seem to get to work properly, my attempt can be seen here:
> > https://lore.kernel.org/r/2023082212-pregnant-lizard-80e0@gregkh
>
> > --- a/tools/objtool/arch/x86/decode.c
> > +++ b/tools/objtool/arch/x86/decode.c
> > @@ -796,8 +796,11 @@ bool arch_is_retpoline(struct symbol *sy
> >
> > bool arch_is_rethunk(struct symbol *sym)
> > {
> > - return !strcmp(sym->name, "__x86_return_thunk") ||
> > - !strcmp(sym->name, "srso_untrain_ret") ||
> > - !strcmp(sym->name, "srso_safe_ret") ||
> > - !strcmp(sym->name, "retbleed_return_thunk");
> > + return !strcmp(sym->name, "__x86_return_thunk");
> > +}
> > +
> > +bool arch_is_embedded_insn(struct symbol *sym)
> > +{
> > + return !strcmp(sym->name, "retbleed_return_thunk") ||
> > + !strcmp(sym->name, "srso_safe_ret");
>
> This wouldn't work with the current 6.1.y branch, I assume you had some
> other patches applied before this. e.g., the patch renaming __ret to
> retbleed_return_thunk.

Yes, I did.

> > }
> > --- a/tools/objtool/check.c
> > +++ b/tools/objtool/check.c
> > @@ -418,7 +418,7 @@ static int decode_instructions(struct ob
> > }
> >
> > list_for_each_entry(func, &sec->symbol_list, list) {
> > - if (func->type != STT_FUNC || func->alias != func)
> > + if (func->embedded_insn || func->alias != func)
> > continue;
>
> This hunk looks like a bug. This might be the source of your problems.

Ah, I guessed wrong on that change, my fault :(

> The below patch seems to work on stock 6.1.47. Or if you have other
> SRSO patches pending, point me to them and I can look at porting this
> one to fit.

I got this to apply on top of the latest series (-rc) and it passes
test-builds here. I'll do a release now without it and then queue this
up, along with some other fixes for reported problems in previous
releases, and release it so that the CI systems can go at it.

Many thanks for this!

greg k-h