2020-11-06 06:06:58

by Shinichiro Kawasaki

[permalink] [raw]
Subject: WARNING: can't access registers at asm_common_interrupt

Greetings,

I observe "WARNING: can't access registers at asm_common_interrupt+0x1e/0x40"
in my kernel test system repeatedly, which is printed by unwind_next_frame() in
"arch/x86/kernel/unwind_orc.c". Syzbot already reported that [1]. Similar
warning was reported and discussed [2], but I suppose the cause is not yet
clarified.

The warning was observed with v5.10-rc2 and older tags. I bisected and found
that the commit 044d0d6de9f5 ("lockdep: Only trace IRQ edges") in v5.9-rc3
triggered the warning. Reverting that from 5.10-rc2, the warning disappeared.
May I ask comment by expertise on CC how this commit can relate to the warning?

The test condition to reproduce the warning is rather unique (blktests,
dm-linear and ZNS device emulation by QEMU). If any action is suggested for
further analysis, I'm willing to take it with my test system.

Wish this report helps.

[1] https://lkml.org/lkml/2020/9/6/231
[2] https://lkml.org/lkml/2020/9/8/1538

--
Best Regards,
Shin'ichiro Kawasaki


2020-11-06 18:11:26

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: WARNING: can't access registers at asm_common_interrupt

On Fri, Nov 06, 2020 at 06:04:15AM +0000, Shinichiro Kawasaki wrote:
> Greetings,
>
> I observe "WARNING: can't access registers at asm_common_interrupt+0x1e/0x40"
> in my kernel test system repeatedly, which is printed by unwind_next_frame() in
> "arch/x86/kernel/unwind_orc.c". Syzbot already reported that [1]. Similar
> warning was reported and discussed [2], but I suppose the cause is not yet
> clarified.
>
> The warning was observed with v5.10-rc2 and older tags. I bisected and found
> that the commit 044d0d6de9f5 ("lockdep: Only trace IRQ edges") in v5.9-rc3
> triggered the warning. Reverting that from 5.10-rc2, the warning disappeared.
> May I ask comment by expertise on CC how this commit can relate to the warning?
>
> The test condition to reproduce the warning is rather unique (blktests,
> dm-linear and ZNS device emulation by QEMU). If any action is suggested for
> further analysis, I'm willing to take it with my test system.

Hi,

Thanks for reporting this issue. This might be a different issue from
[2].

Can you send me the arch/x86/entry/entry_64.o file from your build?

--
Josh

2020-11-09 09:13:19

by Shinichiro Kawasaki

[permalink] [raw]
Subject: Re: WARNING: can't access registers at asm_common_interrupt

On Nov 06, 2020 / 12:06, Josh Poimboeuf wrote:
> On Fri, Nov 06, 2020 at 06:04:15AM +0000, Shinichiro Kawasaki wrote:
> > Greetings,
> >
> > I observe "WARNING: can't access registers at asm_common_interrupt+0x1e/0x40"
> > in my kernel test system repeatedly, which is printed by unwind_next_frame() in
> > "arch/x86/kernel/unwind_orc.c". Syzbot already reported that [1]. Similar
> > warning was reported and discussed [2], but I suppose the cause is not yet
> > clarified.
> >
> > The warning was observed with v5.10-rc2 and older tags. I bisected and found
> > that the commit 044d0d6de9f5 ("lockdep: Only trace IRQ edges") in v5.9-rc3
> > triggered the warning. Reverting that from 5.10-rc2, the warning disappeared.
> > May I ask comment by expertise on CC how this commit can relate to the warning?
> >
> > The test condition to reproduce the warning is rather unique (blktests,
> > dm-linear and ZNS device emulation by QEMU). If any action is suggested for
> > further analysis, I'm willing to take it with my test system.
>
> Hi,
>
> Thanks for reporting this issue. This might be a different issue from
> [2].
>
> Can you send me the arch/x86/entry/entry_64.o file from your build?

Hi Josh, thank you for your response. As a separated e-mail, I have sent the
entry_64.o only to your address, since I hesitate to send around the 76kb
attachment file to LKML. In case it does not reach to you, please let me know.

--
Best Regards,
Shin'ichiro Kawasaki

2020-11-10 03:22:06

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: WARNING: can't access registers at asm_common_interrupt

On Mon, Nov 09, 2020 at 09:10:38AM +0000, Shinichiro Kawasaki wrote:
> On Nov 06, 2020 / 12:06, Josh Poimboeuf wrote:
> > On Fri, Nov 06, 2020 at 06:04:15AM +0000, Shinichiro Kawasaki wrote:
> > > Greetings,
> > >
> > > I observe "WARNING: can't access registers at asm_common_interrupt+0x1e/0x40"
> > > in my kernel test system repeatedly, which is printed by unwind_next_frame() in
> > > "arch/x86/kernel/unwind_orc.c". Syzbot already reported that [1]. Similar
> > > warning was reported and discussed [2], but I suppose the cause is not yet
> > > clarified.
> > >
> > > The warning was observed with v5.10-rc2 and older tags. I bisected and found
> > > that the commit 044d0d6de9f5 ("lockdep: Only trace IRQ edges") in v5.9-rc3
> > > triggered the warning. Reverting that from 5.10-rc2, the warning disappeared.
> > > May I ask comment by expertise on CC how this commit can relate to the warning?
> > >
> > > The test condition to reproduce the warning is rather unique (blktests,
> > > dm-linear and ZNS device emulation by QEMU). If any action is suggested for
> > > further analysis, I'm willing to take it with my test system.
> >
> > Hi,
> >
> > Thanks for reporting this issue. This might be a different issue from
> > [2].
> >
> > Can you send me the arch/x86/entry/entry_64.o file from your build?
>
> Hi Josh, thank you for your response. As a separated e-mail, I have sent the
> entry_64.o only to your address, since I hesitate to send around the 76kb
> attachment file to LKML. In case it does not reach to you, please let me know.

Got it, thanks. Unfortunately I'm still confused.

Can you test with the following patch, and boot with 'unwind_debug'?
That should hopefully dump a lot of useful data along with the warning.

From: Josh Poimboeuf <[email protected]>
Subject: [PATCH] x86/unwind/orc: Add 'unwind_debug' cmdline option

Sometimes the one-line ORC unwinder warnings aren't very helpful. Add a
new 'unwind_debug' cmdline option which will dump the full stack
contents of the current task when an error condition is encountered.

Signed-off-by: Josh Poimboeuf <[email protected]>
Reviewed-by: Miroslav Benes <[email protected]>
---
.../admin-guide/kernel-parameters.txt | 6 +++
arch/x86/kernel/unwind_orc.c | 48 ++++++++++++++++++-
2 files changed, 53 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 526d65d8573a..4bed92c51723 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -5512,6 +5512,12 @@
unknown_nmi_panic
[X86] Cause panic on unknown NMI.

+ unwind_debug [X86-64]
+ Enable unwinder debug output. This can be
+ useful for debugging certain unwinder error
+ conditions, including corrupt stacks and
+ bad/missing unwinder metadata.
+
usbcore.authorized_default=
[USB] Default USB device authorization:
(default -1 = authorized except for wireless USB,
diff --git a/arch/x86/kernel/unwind_orc.c b/arch/x86/kernel/unwind_orc.c
index 73f800100066..44bae03f9bfc 100644
--- a/arch/x86/kernel/unwind_orc.c
+++ b/arch/x86/kernel/unwind_orc.c
@@ -13,8 +13,13 @@

#define orc_warn_current(args...) \
({ \
- if (state->task == current) \
+ static bool dumped_before; \
+ if (state->task == current) { \
orc_warn(args); \
+ if (unwind_debug && !dumped_before) \
+ unwind_dump(state); \
+ dumped_before = true; \
+ } \
})

extern int __start_orc_unwind_ip[];
@@ -23,8 +28,49 @@ extern struct orc_entry __start_orc_unwind[];
extern struct orc_entry __stop_orc_unwind[];

static bool orc_init __ro_after_init;
+static bool unwind_debug __ro_after_init;
static unsigned int lookup_num_blocks __ro_after_init;

+static int __init unwind_debug_cmdline(char *str)
+{
+ unwind_debug = true;
+
+ return 0;
+}
+early_param("unwind_debug", unwind_debug_cmdline);
+
+static void unwind_dump(struct unwind_state *state)
+{
+ static bool dumped_before;
+ unsigned long word, *sp;
+ struct stack_info stack_info = {0};
+ unsigned long visit_mask = 0;
+
+ if (dumped_before)
+ return;
+
+ dumped_before = true;
+
+ printk_deferred("unwind stack type:%d next_sp:%p mask:0x%lx graph_idx:%d\n",
+ state->stack_info.type, state->stack_info.next_sp,
+ state->stack_mask, state->graph_idx);
+
+ for (sp = __builtin_frame_address(0); sp;
+ sp = PTR_ALIGN(stack_info.next_sp, sizeof(long))) {
+ if (get_stack_info(sp, state->task, &stack_info, &visit_mask))
+ break;
+
+ for (; sp < stack_info.end; sp++) {
+
+ word = READ_ONCE_NOCHECK(*sp);
+
+ printk_deferred("%0*lx: %0*lx (%pB)\n", BITS_PER_LONG/4,
+ (unsigned long)sp, BITS_PER_LONG/4,
+ word, (void *)word);
+ }
+ }
+}
+
static inline unsigned long orc_ip(const int *ip)
{
return (unsigned long)ip + *ip;
--
2.25.4

2020-11-10 09:23:45

by Shinichiro Kawasaki

[permalink] [raw]
Subject: Re: WARNING: can't access registers at asm_common_interrupt

On Nov 09, 2020 / 21:19, Josh Poimboeuf wrote:
> On Mon, Nov 09, 2020 at 09:10:38AM +0000, Shinichiro Kawasaki wrote:
> > On Nov 06, 2020 / 12:06, Josh Poimboeuf wrote:
> > > On Fri, Nov 06, 2020 at 06:04:15AM +0000, Shinichiro Kawasaki wrote:
> > > > Greetings,
> > > >
> > > > I observe "WARNING: can't access registers at asm_common_interrupt+0x1e/0x40"
> > > > in my kernel test system repeatedly, which is printed by unwind_next_frame() in
> > > > "arch/x86/kernel/unwind_orc.c". Syzbot already reported that [1]. Similar
> > > > warning was reported and discussed [2], but I suppose the cause is not yet
> > > > clarified.
> > > >
> > > > The warning was observed with v5.10-rc2 and older tags. I bisected and found
> > > > that the commit 044d0d6de9f5 ("lockdep: Only trace IRQ edges") in v5.9-rc3
> > > > triggered the warning. Reverting that from 5.10-rc2, the warning disappeared.
> > > > May I ask comment by expertise on CC how this commit can relate to the warning?
> > > >
> > > > The test condition to reproduce the warning is rather unique (blktests,
> > > > dm-linear and ZNS device emulation by QEMU). If any action is suggested for
> > > > further analysis, I'm willing to take it with my test system.
> > >
> > > Hi,
> > >
> > > Thanks for reporting this issue. This might be a different issue from
> > > [2].
> > >
> > > Can you send me the arch/x86/entry/entry_64.o file from your build?
> >
> > Hi Josh, thank you for your response. As a separated e-mail, I have sent the
> > entry_64.o only to your address, since I hesitate to send around the 76kb
> > attachment file to LKML. In case it does not reach to you, please let me know.
>
> Got it, thanks. Unfortunately I'm still confused.
>
> Can you test with the following patch, and boot with 'unwind_debug'?
> That should hopefully dump a lot of useful data along with the warning.

Thank you for the patch. With the patch and 'unwind_debug' kernel command line,
I recreated the warning. Here I quote the kernel messages printed (smpboot
related messages are usual message printed for blktests test case block/008).

...
[ 113.022680] smpboot: CPU 1 is now offline
[ 113.030546] WARNING: can't access registers at asm_common_interrupt+0x1e/0x40
[ 113.030549] unwind stack type:0 next_sp:0000000000000000 mask:0x6 graph_idx:0
[ 113.030554] ffff8881e87097b0: 1ffff1103d0e1302 (0x1ffff1103d0e1302)
[ 113.030558] ffff8881e87097b8: ffffffff9712a434 (unwind_next_frame+0x15e4/0x1fc0)
[ 113.030560] ffff8881e87097c0: ffffffff99200c5e (asm_common_interrupt+0x1e/0x40)
[ 113.030562] ffff8881e87097c8: 0000000000000000 (0x0)
[ 113.030563] ffff8881e87097d0: ffff8881e87098bd (0xffff8881e87098bd)
[ 113.030564] ffff8881e87097d8: ffff8881e87098d8 (0xffff8881e87098d8)
[ 113.030565] ffff8881e87097e0: ffff8881e87098c0 (0xffff8881e87098c0)
[ 113.030570] ffff8881e87097e8: ffffffff9ac85575 (__start_orc_unwind+0x65f81/0x37a91c)
[ 113.030572] ffff8881e87097f0: ffffffff9ac85575 (__start_orc_unwind+0x65f81/0x37a91c)
[ 113.030573] ffff8881e87097f8: ffff8881e8709938 (0xffff8881e8709938)
[ 113.030574] ffff8881e8709800: 0000000000000000 (0x0)
[ 113.030597] ffff8881e8709808: ffff88810ebd31c0 (0xffff88810ebd31c0)
[ 113.030598] ffff8881e8709810: 0000000041b58ab3 (0x41b58ab3)
[ 113.030602] ffff8881e8709818: ffffffff99d7a850 (.LC2+0x3be/0x44d)
[ 113.030604] ffff8881e8709820: ffffffff97128e50 (deref_stack_reg+0x160/0x160)
[ 113.030605] ffff8881e8709828: 0000000000000000 (0x0)
[ 113.030606] ffff8881e8709830: 0000000000000000 (0x0)
[ 113.030610] ffff8881e8709838: ffffffff972343bf (kernel_text_address.part.0+0x1f/0xc0)
[ 113.030624] ffff8881e8709840: ffffffffc046fd90 (nvme_irq+0x10/0x20 [nvme])
[ 113.030625] ffff8881e8709848: ffff8881e87098d0 (0xffff8881e87098d0)
[ 113.030628] ffff8881e8709850: ffffffff973994c0 (create_prof_cpu_mask+0x20/0x20)
[ 113.030629] ffff8881e8709858: ffff8881e8709908 (0xffff8881e8709908)
[ 113.030630] ffff8881e8709860: ffff8881e8709938 (0xffff8881e8709938)
[ 113.030631] ffff8881e8709868: 0000000000000000 (0x0)
[ 113.030632] ffff8881e8709870: ffff88810ebd31c0 (0xffff88810ebd31c0)
[ 113.030633] ffff8881e8709878: 0000000000000006 (0x6)
[ 113.030636] ffff8881e8709880: ffffffff9709925c (arch_stack_walk+0x6c/0xb0)
[ 113.030637] ffff8881e8709888: 0000000000000000 (0x0)
[ 113.030638] ffff8881e8709890: ffff88814c428000 (0xffff88814c428000)
[ 113.030640] ffff8881e8709898: ffff88814c430000 (0xffff88814c430000)
[ 113.030641] ffff8881e87098a0: 0000000000000000 (0x0)
[ 113.030642] ffff8881e87098a8: 0000000000000006 (0x6)
[ 113.030643] ffff8881e87098b0: ffff88810ebd31c0 (0xffff88810ebd31c0)
[ 113.030644] ffff8881e87098b8: 0000010000000000 (0x10000000000)
[ 113.030645] ffff8881e87098c0: 0000000000000000 (0x0)
[ 113.030646] ffff8881e87098c8: 0000000000000000 (0x0)
[ 113.030648] ffff8881e87098d0: ffffffff99200c5e (asm_common_interrupt+0x1e/0x40)
[ 113.030649] ffff8881e87098d8: ffff88814c42fed0 (0xffff88814c42fed0)
[ 113.030650] ffff8881e87098e0: 0000000000000000 (0x0)
[ 113.030651] ffff8881e87098e8: ffffed103d0e1323 (0xffffed103d0e1323)
[ 113.030652] ffff8881e87098f0: 0000000000000800 (0x800)
[ 113.030653] ffff8881e87098f8: ffff88810b48a780 (0xffff88810b48a780)
[ 113.030654] ffff8881e8709900: ffff8881e8709c48 (0xffff8881e8709c48)
[ 113.030658] ffff8881e8709908: ffffffff97862595 (kmem_cache_free+0xf5/0x590)
[ 113.030660] ffff8881e8709910: ffffffff973996a1 (stack_trace_save+0x81/0xa0)
[ 113.030661] ffff8881e8709918: 0000000041b58ab3 (0x41b58ab3)
[ 113.030663] ffff8881e8709920: ffffffff99d89ef4 (.LC4+0x20f/0x14aa1)
[ 113.030665] ffff8881e8709928: ffffffff97399620 (stack_trace_consume_entry+0x160/0x160)
[ 113.030666] ffff8881e8709930: ffff8881098e9800 (0xffff8881098e9800)
[ 113.030667] ffff8881e8709938: ffff8881e8709988 (0xffff8881e8709988)
[ 113.030668] ffff8881e8709940: 0000000000000040 (0x40)
[ 113.030669] ffff8881e8709948: 0000000000000015 (0x15)
[ 113.030670] ffff8881e8709950: ffff88811858d348 (0xffff88811858d348)
[ 113.030672] ffff8881e8709958: ffffffff97862595 (kmem_cache_free+0xf5/0x590)
[ 113.030673] ffff8881e8709960: 0000000000000800 (0x800)
[ 113.030674] ffff8881e8709968: ffff88810b48a780 (0xffff88810b48a780)
[ 113.030675] ffff8881e8709970: ffff8881e8709c48 (0xffff8881e8709c48)
[ 113.030676] ffff8881e8709978: ffff88811858c008 (0xffff88811858c008)
[ 113.030678] ffff8881e8709980: ffffffff978649db (kasan_save_stack+0x1b/0x40)
[ 113.030680] ffff8881e8709988: ffffffff978649db (kasan_save_stack+0x1b/0x40)
[ 113.030681] ffff8881e8709990: ffffffff97864a1c (kasan_set_track+0x1c/0x30)
[ 113.030684] ffff8881e8709998: ffffffff97866f5b (kasan_set_free_info+0x1b/0x30)
[ 113.030685] ffff8881e87099a0: ffffffff97864980 (__kasan_slab_free+0x110/0x150)
[ 113.030687] ffff8881e87099a8: ffffffff9785b4da (slab_free_freelist_hook+0x5a/0x170)
[ 113.030689] ffff8881e87099b0: ffffffff97862595 (kmem_cache_free+0xf5/0x590)
[ 113.030692] ffff8881e87099b8: ffffffff988f4da0 (dec_pending+0x1f0/0x900)
[ 113.030694] ffff8881e87099c0: ffffffff988fad15 (clone_endio+0x1e5/0x940)
[ 113.030697] ffff8881e87099c8: ffffffff97eac8f6 (blk_update_request+0x716/0xe20)
[ 113.030699] ffff8881e87099d0: ffffffff97ed523b (blk_mq_end_request+0x4b/0x480)
[ 113.030701] ffff8881e87099d8: ffffffffc046e863 (nvme_process_cq+0x563/0xa40 [nvme])
[ 113.030703] ffff8881e87099e0: ffffffffc046fd90 (nvme_irq+0x10/0x20 [nvme])
[ 113.030706] ffff8881e87099e8: ffffffff9733ef52 (__handle_irq_event_percpu+0x252/0x620)
[ 113.030707] ffff8881e87099f0: ffffffff9733f51f (handle_irq_event+0xef/0x240)
[ 113.030709] ffff8881e87099f8: ffffffff9734bff6 (handle_edge_irq+0x1f6/0xb60)
[ 113.030712] ffff8881e8709a00: ffffffff992010b2 (asm_call_irq_on_stack+0x12/0x20)
[ 113.030714] ffff8881e8709a08: ffffffff9918f05b (common_interrupt+0xeb/0x190)
[ 113.030716] ffff8881e8709a10: ffffffff99200c5e (asm_common_interrupt+0x1e/0x40)
[ 113.030717] ffff8881e8709a18: ffffffff97395f70 (exit_to_user_mode_prepare+0xc0/0x1d0)
[ 113.030719] ffff8881e8709a20: ffffffff99200c48 (asm_common_interrupt+0x8/0x40)
[ 113.030721] ffff8881e8709a28: ffffffff99200c5e (asm_common_interrupt+0x1e/0x40)
[ 113.030723] ffff8881e8709a30: ffffffff991a91e0 (schedule+0xd0/0x270)
[ 113.030724] ffff8881e8709a38: ffffffff97395fb7 (exit_to_user_mode_prepare+0x107/0x1d0)
[ 113.030725] ffff8881e8709a40: 1ffff1103d0e134e (0x1ffff1103d0e134e)
[ 113.030727] ffff8881e8709a48: ffff8881e8733dc8 (0xffff8881e8733dc8)
[ 113.030728] ffff8881e8709a50: ffff8881e8733d40 (0xffff8881e8733d40)
[ 113.030729] ffff8881e8709a58: 00000019d79785f8 (0x19d79785f8)
[ 113.030730] ffff8881e8709a60: 0000000000000005 (0x5)
[ 113.030731] ffff8881e8709a68: 0000000000000000 (0x0)
[ 113.030732] ffff8881e8709a70: ffff88810b579980 (0xffff88810b579980)
[ 113.030733] ffff8881e8709a78: ffff88810b579800 (0xffff88810b579800)
[ 113.030736] ffff8881e8709a80: ffffffff97286cda (update_load_avg+0x1fa/0x1ad0)
[ 113.030737] ffff8881e8709a88: ffff88810b57a000 (0xffff88810b57a000)
[ 113.030738] ffff8881e8709a90: ffff88810b579800 (0xffff88810b579800)
[ 113.030739] ffff8881e8709a98: 0000000000000400 (0x400)
[ 113.030741] ffff8881e8709aa0: ffffffff9729be38 (update_cfs_group+0x118/0x290)
[ 113.030743] ffff8881e8709aa8: ffffffff973146a0 (lock_downgrade+0x6a0/0x6a0)
[ 113.030744] ffff8881e8709ab0: 0000000000000000 (0x0)
[ 113.030745] ffff8881e8709ab8: ffff88810b579850 (0xffff88810b579850)
[ 113.030746] ffff8881e8709ac0: ffff8881e8733d68 (0xffff8881e8733d68)
[ 113.030747] ffff8881e8709ac8: ffff8881e8733d40 (0xffff8881e8733d40)
[ 113.030748] ffff8881e8709ad0: 0000000000000001 (0x1)
[ 113.030749] ffff8881e8709ad8: ffff88810b579810 (0xffff88810b579810)
[ 113.030751] ffff8881e8709ae0: ffffffff972a14b2 (enqueue_entity+0x402/0x2ba0)
[ 113.030752] ffff8881e8709ae8: ffff8881e8733d50 (0xffff8881e8733d50)
[ 113.030753] ffff8881e8709af0: ffff88810b579800 (0xffff88810b579800)
[ 113.030754] ffff8881e8709af8: ffff8881e8733d80 (0xffff8881e8733d80)
[ 113.030755] ffff8881e8709b00: ffff888116873268 (0xffff888116873268)
[ 113.030757] ffff8881e8709b08: ffff8881e8734790 (0xffff8881e8734790)
[ 113.030758] ffff8881e8709b10: ffff88810ebd3268 (0xffff88810ebd3268)
[ 113.030759] ffff8881e8709b18: ffff8881e8720e80 (0xffff8881e8720e80)
[ 113.030760] ffff8881e8709b20: ffff88810b579800 (0xffff88810b579800)
[ 113.030761] ffff8881e8709b28: ffff88810b579950 (0xffff88810b579950)
[ 113.030762] ffff8881e8709b30: 0000000000000009 (0x9)
[ 113.030763] ffff8881e8709b38: ffff8881e8733c80 (0xffff8881e8733c80)
[ 113.030764] ffff8881e8709b40: ffff88810ebd31c0 (0xffff88810ebd31c0)
[ 113.030765] ffff8881e8709b48: 0000000000000002 (0x2)
[ 113.030768] ffff8881e8709b50: ffffffff97263aef (resched_curr+0x17f/0x1e0)
[ 113.030769] ffff8881e8709b58: ffff8881098e9000 (0xffff8881098e9000)
[ 113.030770] ffff8881e8709b60: ffff8881098e9038 (0xffff8881098e9038)
[ 113.030771] ffff8881e8709b68: 000000000056e125 (0x56e125)
[ 113.030772] ffff8881e8709b70: 0000000000000002 (0x2)
[ 113.030773] ffff8881e8709b78: 1ffff1103d0e1374 (0x1ffff1103d0e1374)
[ 113.030774] ffff8881e8709b80: ffff888116873e20 (0xffff888116873e20)
[ 113.030775] ffff8881e8709b88: ffff88811858c000 (0xffff88811858c000)
[ 113.030777] ffff8881e8709b90: ffffffff97864a1c (kasan_set_track+0x1c/0x30)
[ 113.030778] ffff8881e8709b98: 1ffff110230b1800 (0x1ffff110230b1800)
[ 113.030780] ffff8881e8709ba0: ffffffff97866f5b (kasan_set_free_info+0x1b/0x30)
[ 113.030781] ffff8881e8709ba8: 0000000000000001 (0x1)
[ 113.030782] ffff8881e8709bb0: ffffffff97864980 (__kasan_slab_free+0x110/0x150)
[ 113.030783] ffff8881e8709bb8: ffff88810b48a780 (0xffff88810b48a780)
[ 113.030785] ffff8881e8709bc0: ffff88811858c000 (0xffff88811858c000)
[ 113.030786] ffff8881e8709bc8: 3b9c1e2c25148a56 (0x3b9c1e2c25148a56)
[ 113.030787] ffff8881e8709bd0: ffff88811858c000 (0xffff88811858c000)
[ 113.030788] ffff8881e8709bd8: ffffffff9785b4da (slab_free_freelist_hook+0x5a/0x170)
[ 113.030789] ffff8881e8709be0: ffff8881e8709c50 (0xffff8881e8709c50)
[ 113.030790] ffff8881e8709be8: ffff88819858c000 (0xffff88819858c000)
[ 113.030792] ffff8881e8709bf0: ffff88810b48a780 (0xffff88810b48a780)
[ 113.030793] ffff8881e8709bf8: ffffea0004616300 (0xffffea0004616300)
[ 113.030794] ffff8881e8709c00: ffff88811858c000 (0xffff88811858c000)
[ 113.030795] ffff8881e8709c08: 0000000000000000 (0x0)
[ 113.030796] ffff8881e8709c10: ffff8881059402a0 (0xffff8881059402a0)
[ 113.030797] ffff8881e8709c18: ffffffff97862595 (kmem_cache_free+0xf5/0x590)
[ 113.030799] ffff8881e8709c20: ffff888114d87038 (0xffff888114d87038)
[ 113.030799] ffff8881e8709c28: 0000000000000000 (0x0)
[ 113.030801] ffff8881e8709c30: ffffffff988f4da0 (dec_pending+0x1f0/0x900)
[ 113.030802] ffff8881e8709c38: ffff8881059403b8 (0xffff8881059403b8)
[ 113.030803] ffff8881e8709c40: ffff88811858c0a8 (0xffff88811858c0a8)
[ 113.030804] ffff8881e8709c48: 0000000000000000 (0x0)
[ 113.030805] ffff8881e8709c50: 0000000000000000 (0x0)
[ 113.030806] ffff8881e8709c58: ffff88811858c000 (0xffff88811858c000)
[ 113.030807] ffff8881e8709c60: ffff88814c427ab8 (0xffff88814c427ab8)
[ 113.030808] ffff8881e8709c68: ffff888105940000 (0xffff888105940000)
[ 113.030809] ffff8881e8709c70: 0000000000000000 (0x0)
[ 113.030810] ffff8881e8709c78: ffff88814c427ac8 (0xffff88814c427ac8)
[ 113.030811] ffff8881e8709c80: ffff8881059402a0 (0xffff8881059402a0)
[ 113.030813] ffff8881e8709c88: ffffffff988f4da0 (dec_pending+0x1f0/0x900)
[ 113.030814] ffff8881e8709c90: ffff888105940000 (0xffff888105940000)
[ 113.030815] ffff8881e8709c98: 00000000fffd21a8 (0xfffd21a8)
[ 113.030816] ffff8881e8709ca0: 0000000400000000 (0x400000000)
[ 113.030817] ffff8881e8709ca8: ffff88814c427ac0 (0xffff88814c427ac0)
[ 113.030818] ffff8881e8709cb0: 1ffff1103d0e139c (0x1ffff1103d0e139c)
[ 113.030819] ffff8881e8709cb8: 1ffff1103d0e13a0 (0x1ffff1103d0e13a0)
[ 113.030820] ffff8881e8709cc0: ffff88811858c0a8 (0xffff88811858c0a8)
[ 113.030822] ffff8881e8709cc8: ffff88814c427ab8 (0xffff88814c427ab8)
[ 113.030823] ffff8881e8709cd0: ffff88811858c000 (0xffff88811858c000)
[ 113.030824] ffff8881e8709cd8: 0000000000000000 (0x0)
[ 113.030825] ffff8881e8709ce0: ffff88811858c088 (0xffff88811858c088)
[ 113.030826] ffff8881e8709ce8: ffffffff988fad15 (clone_endio+0x1e5/0x940)
[ 113.030827] ffff8881e8709cf0: ffff888102a26c38 (0xffff888102a26c38)
[ 113.030828] ffff8881e8709cf8: ffff88811858c0b0 (0xffff88811858c0b0)
[ 113.030829] ffff8881e8709d00: 0000000041b58ab3 (0x41b58ab3)
[ 113.030831] ffff8881e8709d08: ffffffff99e07e8f (.LC0+0x31f92/0x36703)
[ 113.030833] ffff8881e8709d10: ffffffff988fab30 (disable_discard+0xd0/0xd0)
[ 113.030834] ffff8881e8709d18: ffff88811858c0a8 (0xffff88811858c0a8)
[ 113.030835] ffff8881e8709d20: ffff88811858c000 (0xffff88811858c000)
[ 113.030836] ffff8881e8709d28: ffff88811858c0e0 (0xffff88811858c0e0)
[ 113.030837] ffff8881e8709d30: ffff88811858c0b0 (0xffff88811858c0b0)
[ 113.030838] ffff8881e8709d38: ffff88811a753300 (0xffff88811a753300)
[ 113.030840] ffff8881e8709d40: dffffc0000000000 (0xdffffc0000000000)
[ 113.030841] ffff8881e8709d48: ffff88811858c0a8 (0xffff88811858c0a8)
[ 113.030842] ffff8881e8709d50: ffff88811a75331c (0xffff88811a75331c)
[ 113.030843] ffff8881e8709d58: 0000000000001000 (0x1000)
[ 113.030844] ffff8881e8709d60: 0000000000001000 (0x1000)
[ 113.030845] ffff8881e8709d68: ffff88811a753300 (0xffff88811a753300)
[ 113.030846] ffff8881e8709d70: ffffffff97eac8f6 (blk_update_request+0x716/0xe20)
[ 113.030847] ffff8881e8709d78: ffffffff00000000 (0xffffffff00000000)
[ 113.030848] ffff8881e8709d80: 0000000000000000 (0x0)
[ 113.030849] ffff8881e8709d88: ffffed10234ea667 (0xffffed10234ea667)
[ 113.030851] ffff8881e8709d90: ffff88811858c0d0 (0xffff88811858c0d0)
[ 113.030852] ffff8881e8709d98: ffffed10234ea663 (0xffffed10234ea663)
[ 113.030853] ffff8881e8709da0: ffff88811a753318 (0xffff88811a753318)
[ 113.030854] ffff8881e8709da8: ffff88811a753338 (0xffff88811a753338)
[ 113.030855] ffff8881e8709db0: ffff888111788800 (0xffff888111788800)
[ 113.030856] ffff8881e8709db8: dffffc0000000000 (0xdffffc0000000000)
[ 113.030857] ffff8881e8709dc0: ffff88811a753300 (0xffff88811a753300)
[ 113.030858] ffff8881e8709dc8: 0000000000000000 (0x0)
[ 113.030859] ffff8881e8709dd0: ffff88811a753300 (0xffff88811a753300)
[ 113.030860] ffff8881e8709dd8: ffff888116c828c0 (0xffff888116c828c0)
[ 113.030861] ffff8881e8709de0: ffffed1020316ea8 (0xffffed1020316ea8)
[ 113.030863] ffff8881e8709de8: ffffffff97ed523b (blk_mq_end_request+0x4b/0x480)
[ 113.030864] ffff8881e8709df0: dffffc0000000000 (0xdffffc0000000000)
[ 113.030865] ffff8881e8709df8: dffffc0000000000 (0xdffffc0000000000)
[ 113.030866] ffff8881e8709e00: ffff8881018b7480 (0xffff8881018b7480)
[ 113.030867] ffff8881e8709e08: 0000000000000000 (0x0)
[ 113.030868] ffff8881e8709e10: ffff88811a753300 (0xffff88811a753300)
[ 113.030869] ffff8881e8709e18: ffff888116c828c0 (0xffff888116c828c0)
[ 113.030871] ffff8881e8709e20: ffffffffc046e863 (nvme_process_cq+0x563/0xa40 [nvme])
[ 113.030872] ffff8881e8709e28: ffffed1020316ead (0xffffed1020316ead)
[ 113.030873] ffff8881e8709e30: ffff888116c828cc (0xffff888116c828cc)
[ 113.030874] ffff8881e8709e38: 0000000197314fd1 (0x197314fd1)
[ 113.030875] ffff8881e8709e40: ffff8881018b756c (0xffff8881018b756c)
[ 113.030876] ffff8881e8709e48: ffff8881018b7560 (0xffff8881018b7560)
[ 113.030878] ffff8881e8709e50: ffff8881018b756a (0xffff8881018b756a)
[ 113.030879] ffff8881e8709e58: ffff88810e3c98c0 (0xffff88810e3c98c0)
[ 113.030880] ffff8881e8709e60: ffff8881018b7540 (0xffff8881018b7540)
[ 113.030881] ffff8881e8709e68: ffff8881018b7568 (0xffff8881018b7568)
[ 113.030883] ffff8881e8709e70: ffffffff9733f50a (handle_irq_event+0xda/0x240)
[ 113.030884] ffff8881e8709e78: dffffc0000000000 (0xdffffc0000000000)
[ 113.030886] ffff8881e8709e80: ffffffffc046fd80 (nvme_suspend+0x330/0x330 [nvme])
[ 113.030887] ffff8881e8709e88: ffff8881e8709f48 (0xffff8881e8709f48)
[ 113.030888] ffff8881e8709e90: ffff88810e3c9800 (0xffff88810e3c9800)
[ 113.030889] ffff8881e8709e98: ffff8881152e7f00 (0xffff8881152e7f00)
[ 113.030890] ffff8881e8709ea0: 000000000000001c (0x1c)
[ 113.030892] ffff8881e8709ea8: ffffffffc046fd90 (nvme_irq+0x10/0x20 [nvme])
[ 113.030894] ffff8881e8709eb0: ffffffff9733ef52 (__handle_irq_event_percpu+0x252/0x620)
[ 113.030895] ffff8881e8709eb8: 0000000000000002 (0x2)
[ 113.030896] ffff8881e8709ec0: 0000000000000000 (0x0)
[ 113.030897] ffff8881e8709ec8: ffffed1021c7930f (0xffffed1021c7930f)
[ 113.030898] ffff8881e8709ed0: ffff88810ebd3eb8 (0xffff88810ebd3eb8)
[ 113.030899] ffff8881e8709ed8: ffff88810ebd31c0 (0xffff88810ebd31c0)
[ 113.030900] ffff8881e8709ee0: ffff88810e3c9878 (0xffff88810e3c9878)
[ 113.030901] ffff8881e8709ee8: ffff88810e3c9800 (0xffff88810e3c9800)
[ 113.030902] ffff8881e8709ef0: 1ffff1103d0e13e5 (0x1ffff1103d0e13e5)
[ 113.030903] ffff8881e8709ef8: ffff88810e3c9800 (0xffff88810e3c9800)
[ 113.030904] ffff8881e8709f00: ffff88810e3c9838 (0xffff88810e3c9838)
[ 113.030906] ffff8881e8709f08: ffff88810e3c98a8 (0xffff88810e3c98a8)
[ 113.030907] ffff8881e8709f10: ffff88810e3c9800 (0xffff88810e3c9800)
[ 113.030908] ffff8881e8709f18: ffffffff9733f51f (handle_irq_event+0xef/0x240)
[ 113.030909] ffff8881e8709f20: 0000000000000002 (0x2)
[ 113.030910] ffff8881e8709f28: 0000000041b58ab3 (0x41b58ab3)
[ 113.030912] ffff8881e8709f30: ffffffff99d88063 (.LC0+0x10d5/0x2d57)
[ 113.030914] ffff8881e8709f38: ffffffff9733f430 (handle_irq_event_percpu+0x110/0x110)
[ 113.030915] ffff8881e8709f40: 0000000000000000 (0x0)
[ 113.030916] ffff8881e8709f48: 0000000000000000 (0x0)
[ 113.030917] ffff8881e8709f50: ffff88810e3c98a8 (0xffff88810e3c98a8)
[ 113.030918] ffff8881e8709f58: ffff88810e3c987c (0xffff88810e3c987c)
[ 113.030919] ffff8881e8709f60: 0000000000000000 (0x0)
[ 113.030920] ffff8881e8709f68: ffffed1021c7930e (0xffffed1021c7930e)
[ 113.030921] ffff8881e8709f70: ffff88810e3c9838 (0xffff88810e3c9838)
[ 113.030922] ffff8881e8709f78: ffff88810e3c987c (0xffff88810e3c987c)
[ 113.030923] ffff8881e8709f80: ffffed1021c7930f (0xffffed1021c7930f)
[ 113.030925] ffff8881e8709f88: dffffc0000000000 (0xdffffc0000000000)
[ 113.030926] ffff8881e8709f90: ffffffff9734bff6 (handle_edge_irq+0x1f6/0xb60)
[ 113.030927] ffff8881e8709f98: ffff88810e3c98a8 (0xffff88810e3c98a8)
[ 113.030928] ffff8881e8709fa0: ffff88810e3c9870 (0xffff88810e3c9870)
[ 113.030929] ffff8881e8709fa8: ffff88810e3c9840 (0xffff88810e3c9840)
[ 113.030930] ffff8881e8709fb0: ffff88810e3c9828 (0xffff88810e3c9828)
[ 113.030931] ffff8881e8709fb8: 0000000000000000 (0x0)
[ 113.030932] ffff8881e8709fc0: 0000000000000023 (0x23)
[ 113.030933] ffff8881e8709fc8: ffff88814c42fe30 (0xffff88814c42fe30)
[ 113.030934] ffff8881e8709fd0: ffff88810e3c9800 (0xffff88810e3c9800)
[ 113.030935] ffff8881e8709fd8: 0000000000000000 (0x0)
[ 113.030936] ffff8881e8709fe0: 0000000000000000 (0x0)
[ 113.030937] ffff8881e8709fe8: 0000000000000023 (0x23)
[ 113.030939] ffff8881e8709ff0: ffffffff992010b2 (asm_call_irq_on_stack+0x12/0x20)
[ 113.030940] ffff8881e8709ff8: ffff88814c42fe30 (0xffff88814c42fe30)
[ 113.030942] ffff88814c42fe30: ffff88814c42fe78 (0xffff88814c42fe78)
[ 113.030943] ffff88814c42fe38: ffffffff9918f05b (common_interrupt+0xeb/0x190)
[ 113.030944] ffff88814c42fe40: 0000000000000000 (0x0)
[ 113.030945] ffff88814c42fe48: 0000000000000000 (0x0)
[ 113.030946] ffff88814c42fe50: 0000000000000000 (0x0)
[ 113.030947] ffff88814c42fe58: 0000000000000000 (0x0)
[ 113.030948] ffff88814c42fe60: 0000000000000000 (0x0)
[ 113.030949] ffff88814c42fe68: 0000000000000000 (0x0)
[ 113.030950] ffff88814c42fe70: ffffffff99200c5e (asm_common_interrupt+0x1e/0x40)
[ 113.030951] ffff88814c42fe78: 0000000000000000 (0x0)
[ 113.030952] ffff88814c42fe80: 0000000000000000 (0x0)
[ 113.030953] ffff88814c42fe88: 0000000000000000 (0x0)
[ 113.030954] ffff88814c42fe90: ffff88810ebd31c0 (0xffff88810ebd31c0)
[ 113.030955] ffff88814c42fe98: ffff88814c42ff58 (0xffff88814c42ff58)
[ 113.030956] ffff88814c42fea0: 0000000000000000 (0x0)
[ 113.030957] ffff88814c42fea8: 0000000000000001 (0x1)
[ 113.030958] ffff88814c42feb0: ffffed1021d7a638 (0xffffed1021d7a638)
[ 113.030959] ffff88814c42feb8: ffff88810ebd31c7 (0xffff88810ebd31c7)
[ 113.030960] ffff88814c42fec0: 0000000000000000 (0x0)
[ 113.030961] ffff88814c42fec8: 0000000000400140 (0x400140)
[ 113.030963] ffff88814c42fed0: ffffffff991a91f4 (schedule+0xe4/0x270)
[ 113.030963] ffff88814c42fed8: 0000000000000000 (0x0)
[ 113.030964] ffff88814c42fee0: 0000000000000008 (0x8)
[ 113.030965] ffff88814c42fee8: ffff88810ebd31c0 (0xffff88810ebd31c0)
[ 113.030967] ffff88814c42fef0: ffffffffffffffff (0xffffffffffffffff)
[ 113.030968] ffff88814c42fef8: ffffffff97395f70 (exit_to_user_mode_prepare+0xc0/0x1d0)
[ 113.030969] ffff88814c42ff00: 0000000000000010 (0x10)
[ 113.030970] ffff88814c42ff08: 0000000000000246 (0x246)
[ 113.030971] ffff88814c42ff10: ffff88814c42ff28 (0xffff88814c42ff28)
[ 113.030972] ffff88814c42ff18: 0000000000000018 (0x18)
[ 113.030973] ffff88814c42ff20: 0000000000000000 (0x0)
[ 113.030974] ffff88814c42ff28: 0000000000000246 (0x246)
[ 113.030975] ffff88814c42ff30: 0000000000000000 (0x0)
[ 113.030976] ffff88814c42ff38: 0000000000000000 (0x0)
[ 113.030978] ffff88814c42ff40: ffffffff99200c48 (asm_common_interrupt+0x8/0x40)
[ 113.030980] ffff88814c42ff48: ffffffff99191c95 (irqentry_exit_to_user_mode+0x5/0x40)
[ 113.030981] ffff88814c42ff50: ffffffff99200c5e (asm_common_interrupt+0x1e/0x40)
[ 113.030982] ffff88814c42ff58: 0000000000aaa828 (0xaaa828)
[ 113.030983] ffff88814c42ff60: 0000000000001000 (0x1000)
[ 113.030984] ffff88814c42ff68: 0000000000000000 (0x0)
[ 113.030985] ffff88814c42ff70: 00007f02266d1460 (0x7f02266d1460)
[ 113.030986] ffff88814c42ff78: 0000000000aaa800 (0xaaa800)
[ 113.030987] ffff88814c42ff80: 0000000000aaa800 (0xaaa800)
[ 113.030988] ffff88814c42ff88: 0000000000000293 (0x293)
[ 113.030989] ffff88814c42ff90: 0000000000b59000 (0xb59000)
[ 113.030990] ffff88814c42ff98: 00007ffdf02c4090 (0x7ffdf02c4090)
[ 113.030991] ffff88814c42ffa0: 0000000000000000 (0x0)
[ 113.030992] ffff88814c42ffa8: 0000000000001000 (0x1000)
[ 113.030994] ffff88814c42ffb0: 00007f028a34124f (0x7f028a34124f)
[ 113.030995] ffff88814c42ffb8: 0000000000001000 (0x1000)
[ 113.030995] ffff88814c42ffc0: 0000000000abd000 (0xabd000)
[ 113.030996] ffff88814c42ffc8: 0000000000000007 (0x7)
[ 113.030998] ffff88814c42ffd0: ffffffffffffffff (0xffffffffffffffff)
[ 113.030999] ffff88814c42ffd8: 00007f028a34124f (0x7f028a34124f)
[ 113.031000] ffff88814c42ffe0: 0000000000000033 (0x33)
[ 113.031000] ffff88814c42ffe8: 0000000000000293 (0x293)
[ 113.031002] ffff88814c42fff0: 00007ffdf01f96d0 (0x7ffdf01f96d0)
[ 113.031002] ffff88814c42fff8: 000000000000002b (0x2b)
[ 115.105359] smpboot: CPU 3 is now offline
[ 115.359319] smpboot: CPU 2 is now offline
...

--
Best Regards,
Shin'ichiro Kawasaki

2020-11-11 17:08:11

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: WARNING: can't access registers at asm_common_interrupt

On Fri, Nov 06, 2020 at 06:04:15AM +0000, Shinichiro Kawasaki wrote:
> Greetings,
>
> I observe "WARNING: can't access registers at asm_common_interrupt+0x1e/0x40"
> in my kernel test system repeatedly, which is printed by unwind_next_frame() in
> "arch/x86/kernel/unwind_orc.c". Syzbot already reported that [1]. Similar
> warning was reported and discussed [2], but I suppose the cause is not yet
> clarified.
>
> The warning was observed with v5.10-rc2 and older tags. I bisected and found
> that the commit 044d0d6de9f5 ("lockdep: Only trace IRQ edges") in v5.9-rc3
> triggered the warning. Reverting that from 5.10-rc2, the warning disappeared.
> May I ask comment by expertise on CC how this commit can relate to the warning?
>
> The test condition to reproduce the warning is rather unique (blktests,
> dm-linear and ZNS device emulation by QEMU). If any action is suggested for
> further analysis, I'm willing to take it with my test system.
>
> Wish this report helps.
>
> [1] https://lkml.org/lkml/2020/9/6/231
> [2] https://lkml.org/lkml/2020/9/8/1538

Shin'ichiro,

Thanks for all the data. It looks like the ORC unwinder is getting
confused by paravirt patching (with runtime-patched pushf/pop changing
the stack layout).

<user interrupt>
exit_to_user_mode_prepare()
exit_to_user_mode_loop()
local_irq_disable_exit_to_user()
local_irq_disable()
raw_irqs_disabled()
arch_irqs_disabled()
arch_local_save_flags()
pushfq
<another interrupt>

Objtool doesn't know about the pushf/pop paravirt patch, so ORC gets
confused by the changed stack layout.

I'm thinking we either need to teach objtool how to deal with
save_fl/restore_fl patches, or we need to just get rid of those nasty
patches somehow. Peter, any thoughts?

It looks like 044d0d6de9f5 ("lockdep: Only trace IRQ edges") is making
the problem more likely, by adding the irqs_disabled() check for every
local_irq_disable().

Also - Peter, Nicholas - is that irqs_disabled() check really necessary
in local_irq_disable()? Presumably irqs would typically be be enabled
before calling it?

--
Josh

2020-11-11 17:51:53

by Peter Zijlstra

[permalink] [raw]
Subject: Re: WARNING: can't access registers at asm_common_interrupt

On Wed, Nov 11, 2020 at 11:05:36AM -0600, Josh Poimboeuf wrote:
> On Fri, Nov 06, 2020 at 06:04:15AM +0000, Shinichiro Kawasaki wrote:
> > Greetings,
> >
> > I observe "WARNING: can't access registers at asm_common_interrupt+0x1e/0x40"
> > in my kernel test system repeatedly, which is printed by unwind_next_frame() in
> > "arch/x86/kernel/unwind_orc.c". Syzbot already reported that [1]. Similar
> > warning was reported and discussed [2], but I suppose the cause is not yet
> > clarified.
> >
> > The warning was observed with v5.10-rc2 and older tags. I bisected and found
> > that the commit 044d0d6de9f5 ("lockdep: Only trace IRQ edges") in v5.9-rc3
> > triggered the warning. Reverting that from 5.10-rc2, the warning disappeared.
> > May I ask comment by expertise on CC how this commit can relate to the warning?
> >
> > The test condition to reproduce the warning is rather unique (blktests,
> > dm-linear and ZNS device emulation by QEMU). If any action is suggested for
> > further analysis, I'm willing to take it with my test system.
> >
> > Wish this report helps.
> >
> > [1] https://lkml.org/lkml/2020/9/6/231
> > [2] https://lkml.org/lkml/2020/9/8/1538
>
> Shin'ichiro,
>
> Thanks for all the data. It looks like the ORC unwinder is getting
> confused by paravirt patching (with runtime-patched pushf/pop changing
> the stack layout).
>
> <user interrupt>
> exit_to_user_mode_prepare()
> exit_to_user_mode_loop()
> local_irq_disable_exit_to_user()
> local_irq_disable()
> raw_irqs_disabled()
> arch_irqs_disabled()
> arch_local_save_flags()
> pushfq
> <another interrupt>

This is PARAVIRT_XXL only, which is a Xen special. My preference, as
always, is to kill it... Sadly the Xen people have a different opinion.

> Objtool doesn't know about the pushf/pop paravirt patch, so ORC gets
> confused by the changed stack layout.
>
> I'm thinking we either need to teach objtool how to deal with
> save_fl/restore_fl patches, or we need to just get rid of those nasty
> patches somehow. Peter, any thoughts?

Don't use Xen? ;-)

So with PARAVIRT_XXL the compiler will emit something like:

"CALL *pvops.save_fl"

Which we then overwrite at runtime with "pushf; pop %[re]ax" and a few
NOPs.

Now, objtool understands alternatives, and ensures they have the same
stack layout, it has no chance in hell of understanding this, simply
because paravirt_patch.c is magic.

I don't have any immediate clever ideas, but let me ponder it a wee bit.

....

Something really disguisting we could do is recognise the indirect call
offset and emit an extra ORC entry for RIP+1. So the cases are:

CALL *pv_ops.save_fl -- 7 bytes IIRC
CALL $imm; -- 5 bytes
PUSHF; POP %[RE]AX -- 2 bytes

so the RIP+1 (the POP insn) will only ever exist in this case. The
indirect and direct call cases would never land on that IP.

....


> It looks like 044d0d6de9f5 ("lockdep: Only trace IRQ edges") is making
> the problem more likely, by adding the irqs_disabled() check for every
> local_irq_disable().
>
> Also - Peter, Nicholas - is that irqs_disabled() check really necessary
> in local_irq_disable()? Presumably irqs would typically be be enabled
> before calling it?

Yeah, so it's all a giant can of worms that; also see:

https://lkml.kernel.org/r/[email protected]

The basic idea is to only trace edges, ie. when the hardware state
actually changes. Sadly this means doing a pushf/pop before the cli.
Ideally CLI would store the old IF in CF or something like that, but
alas.

2020-11-11 18:15:51

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: WARNING: can't access registers at asm_common_interrupt

On Wed, Nov 11, 2020 at 06:47:36PM +0100, Peter Zijlstra wrote:
> This is PARAVIRT_XXL only, which is a Xen special. My preference, as
> always, is to kill it... Sadly the Xen people have a different opinion.

That would be soooo nice... then we could get rid of paravirt patching
altogether and replace it with static calls.

> > Objtool doesn't know about the pushf/pop paravirt patch, so ORC gets
> > confused by the changed stack layout.
> >
> > I'm thinking we either need to teach objtool how to deal with
> > save_fl/restore_fl patches, or we need to just get rid of those nasty
> > patches somehow. Peter, any thoughts?
>
> Don't use Xen? ;-)
>
> So with PARAVIRT_XXL the compiler will emit something like:
>
> "CALL *pvops.save_fl"
>
> Which we then overwrite at runtime with "pushf; pop %[re]ax" and a few
> NOPs.
>
> Now, objtool understands alternatives, and ensures they have the same
> stack layout, it has no chance in hell of understanding this, simply
> because paravirt_patch.c is magic.
>
> I don't have any immediate clever ideas, but let me ponder it a wee bit.
>
> ....
>
> Something really disguisting we could do is recognise the indirect call
> offset and emit an extra ORC entry for RIP+1. So the cases are:
>
> CALL *pv_ops.save_fl -- 7 bytes IIRC
> CALL $imm; -- 5 bytes
> PUSHF; POP %[RE]AX -- 2 bytes
>
> so the RIP+1 (the POP insn) will only ever exist in this case. The
> indirect and direct call cases would never land on that IP.

I had a similar idea, and a bit of deja vu - we may have talked about
this before. At least I know we talked about doing something similar
for alternatives which muck with the stack.

> > It looks like 044d0d6de9f5 ("lockdep: Only trace IRQ edges") is making
> > the problem more likely, by adding the irqs_disabled() check for every
> > local_irq_disable().
> >
> > Also - Peter, Nicholas - is that irqs_disabled() check really necessary
> > in local_irq_disable()? Presumably irqs would typically be be enabled
> > before calling it?
>
> Yeah, so it's all a giant can of worms that; also see:
>
> https://lkml.kernel.org/r/[email protected]
>
> The basic idea is to only trace edges, ie. when the hardware state
> actually changes. Sadly this means doing a pushf/pop before the cli.
> Ideally CLI would store the old IF in CF or something like that, but
> alas.

Right, that makes sense for save/restore, but is the disabled check
really needed for local_irq_disable()? Wouldn't that always be an edge?

And anyway I don't see a similar check for local_irq_enable().

--
Josh

2020-11-11 19:46:29

by Peter Zijlstra

[permalink] [raw]
Subject: Re: WARNING: can't access registers at asm_common_interrupt

On Wed, Nov 11, 2020 at 06:46:37PM +0000, Andrew Cooper wrote:

> Well...
>
> static_calls are a newer, and more generic, form of pvops.? Most of the
> magic is to do with inlining small fragments, but static calls can do
> that now too, IIRC?

If you're referring to this glorious hack:

https://lkml.kernel.org/r/[email protected]

that only 'works' because it's a single instruction. That is,
static_call can only poke single instructions. They cannot replace a
call with "PUSHF; POP" / "PUSH; POPF" for example. They also cannot do
NOP padding for 'short' sequences.

Paravirt, like alternatives, are special in that they only happen once,
before SMP bringup.

> >> Something really disguisting we could do is recognise the indirect call
> >> offset and emit an extra ORC entry for RIP+1. So the cases are:
> >>
> >> CALL *pv_ops.save_fl -- 7 bytes IIRC
> >> CALL $imm; -- 5 bytes
> >> PUSHF; POP %[RE]AX -- 2 bytes
> >>
> >> so the RIP+1 (the POP insn) will only ever exist in this case. The
> >> indirect and direct call cases would never land on that IP.
> > I had a similar idea, and a bit of deja vu - we may have talked about
> > this before. At least I know we talked about doing something similar
> > for alternatives which muck with the stack.

Vague memories... luckily we managed to get alternatives to a state
where they match, which is much saner.

> The main complexity with pvops is that the
>
> ??? CALL *pv_ops.save_fl
>
> form needs to be usable from extremely early in the day (pre general
> patching), hence the use of function pointers and some non-standard ABIs.

The performance rasins mentioned below are a large part of the
non-standard ABI (eg CALLEE_SAVE)

> For performance reasons, the end result of this pvop wants to be `pushf;
> pop %[re]ax` in then native case, and `call xen_pv_save_fl` in the Xen
> case, but this doesn't mean that the compiled instruction needs to be a
> function pointer to begin with.

Not sure emitting the native code would be feasible.. also
cpu_usergs_sysret64 is 6 bytes.

> Would objtool have an easier time coping if this were implemented in
> terms of a static call?

I doubt it, the big problem is that there is no visibility into the
actual alternative text. Runtime patching fragments into static call
would have the exact same problem.

Something that _might_ maybe work is trying to morph the immediate
fragments into an alternative. That is, instead of this:

static inline notrace unsigned long arch_local_save_flags(void)
{
return PVOP_CALLEE0(unsigned long, irq.save_fl);
}

Write it something like:

static inline notrace unsigned long arch_local_save_flags(void)
{
PVOP_CALL_ARGS;
PVOP_TEST_NULL(irq.save_fl);
asm_inline volatile(ALTERNATIVE(paravirt_alt(PARAVIRT_CALL),
"PUSHF; POP _ASM_AX",
X86_FEATURE_NATIVE)
: CLBR_RET_REG, ASM_CALL_CONSTRAINT
: paravirt_type(irq.save_fl.func),
paravirt_clobber(PVOP_CALLEE_CLOBBERS)
: "memory", "cc");
return __eax;
}

And then we have to teach objtool how to deal with conflicting
alternatives...

That would remove most (all, if we can figure out a form that deals with
the spinlock fragments) of paravirt_patch.c

Hmm?

2020-11-11 20:01:21

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: WARNING: can't access registers at asm_common_interrupt

On Wed, Nov 11, 2020 at 08:42:06PM +0100, Peter Zijlstra wrote:
> > Would objtool have an easier time coping if this were implemented in
> > terms of a static call?
>
> I doubt it, the big problem is that there is no visibility into the
> actual alternative text. Runtime patching fragments into static call
> would have the exact same problem.
>
> Something that _might_ maybe work is trying to morph the immediate
> fragments into an alternative. That is, instead of this:
>
> static inline notrace unsigned long arch_local_save_flags(void)
> {
> return PVOP_CALLEE0(unsigned long, irq.save_fl);
> }
>
> Write it something like:
>
> static inline notrace unsigned long arch_local_save_flags(void)
> {
> PVOP_CALL_ARGS;
> PVOP_TEST_NULL(irq.save_fl);
> asm_inline volatile(ALTERNATIVE(paravirt_alt(PARAVIRT_CALL),
> "PUSHF; POP _ASM_AX",
> X86_FEATURE_NATIVE)
> : CLBR_RET_REG, ASM_CALL_CONSTRAINT
> : paravirt_type(irq.save_fl.func),
> paravirt_clobber(PVOP_CALLEE_CLOBBERS)
> : "memory", "cc");
> return __eax;
> }
>
> And then we have to teach objtool how to deal with conflicting
> alternatives...
>
> That would remove most (all, if we can figure out a form that deals with
> the spinlock fragments) of paravirt_patch.c
>
> Hmm?

I was going to suggest something similar. Though I would try to take it
further and replace paravirt_patch_default() with static calls.

Either way it doesn't make objtool's job much easier. But it would be
nice to consolidate runtime patching mechanisms and get rid of
.parainstructions.

--
Josh

2020-11-11 20:12:10

by Peter Zijlstra

[permalink] [raw]
Subject: Re: WARNING: can't access registers at asm_common_interrupt

On Wed, Nov 11, 2020 at 01:59:00PM -0600, Josh Poimboeuf wrote:
> On Wed, Nov 11, 2020 at 08:42:06PM +0100, Peter Zijlstra wrote:
> > > Would objtool have an easier time coping if this were implemented in
> > > terms of a static call?
> >
> > I doubt it, the big problem is that there is no visibility into the
> > actual alternative text. Runtime patching fragments into static call
> > would have the exact same problem.
> >
> > Something that _might_ maybe work is trying to morph the immediate
> > fragments into an alternative. That is, instead of this:
> >
> > static inline notrace unsigned long arch_local_save_flags(void)
> > {
> > return PVOP_CALLEE0(unsigned long, irq.save_fl);
> > }
> >
> > Write it something like:
> >
> > static inline notrace unsigned long arch_local_save_flags(void)
> > {
> > PVOP_CALL_ARGS;
> > PVOP_TEST_NULL(irq.save_fl);
> > asm_inline volatile(ALTERNATIVE(paravirt_alt(PARAVIRT_CALL),
> > "PUSHF; POP _ASM_AX",
> > X86_FEATURE_NATIVE)
> > : CLBR_RET_REG, ASM_CALL_CONSTRAINT
> > : paravirt_type(irq.save_fl.func),
> > paravirt_clobber(PVOP_CALLEE_CLOBBERS)
> > : "memory", "cc");
> > return __eax;
> > }
> >
> > And then we have to teach objtool how to deal with conflicting
> > alternatives...
> >
> > That would remove most (all, if we can figure out a form that deals with
> > the spinlock fragments) of paravirt_patch.c
> >
> > Hmm?
>
> I was going to suggest something similar. Though I would try to take it
> further and replace paravirt_patch_default() with static calls.

Possible, we just need to be _really_ careful to not allow changing
those static_call()s. So maybe we need DEFINE_STATIC_CALL_RO() which
does a __ro_after_init on the whole thing.

> Either way it doesn't make objtool's job much easier. But it would be
> nice to consolidate runtime patching mechanisms and get rid of
> .parainstructions.

I think the above (combining alternative and paravirt/static_call) does
make objtool's job easier, since then we at least have the actual
alternative instructions available to inspect, or am I mis-understanding
things?

2020-11-11 20:18:19

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: WARNING: can't access registers at asm_common_interrupt

On Wed, Nov 11, 2020 at 09:07:30PM +0100, Peter Zijlstra wrote:
> On Wed, Nov 11, 2020 at 01:59:00PM -0600, Josh Poimboeuf wrote:
> > On Wed, Nov 11, 2020 at 08:42:06PM +0100, Peter Zijlstra wrote:
> > > > Would objtool have an easier time coping if this were implemented in
> > > > terms of a static call?
> > >
> > > I doubt it, the big problem is that there is no visibility into the
> > > actual alternative text. Runtime patching fragments into static call
> > > would have the exact same problem.
> > >
> > > Something that _might_ maybe work is trying to morph the immediate
> > > fragments into an alternative. That is, instead of this:
> > >
> > > static inline notrace unsigned long arch_local_save_flags(void)
> > > {
> > > return PVOP_CALLEE0(unsigned long, irq.save_fl);
> > > }
> > >
> > > Write it something like:
> > >
> > > static inline notrace unsigned long arch_local_save_flags(void)
> > > {
> > > PVOP_CALL_ARGS;
> > > PVOP_TEST_NULL(irq.save_fl);
> > > asm_inline volatile(ALTERNATIVE(paravirt_alt(PARAVIRT_CALL),
> > > "PUSHF; POP _ASM_AX",
> > > X86_FEATURE_NATIVE)
> > > : CLBR_RET_REG, ASM_CALL_CONSTRAINT
> > > : paravirt_type(irq.save_fl.func),
> > > paravirt_clobber(PVOP_CALLEE_CLOBBERS)
> > > : "memory", "cc");
> > > return __eax;
> > > }
> > >
> > > And then we have to teach objtool how to deal with conflicting
> > > alternatives...
> > >
> > > That would remove most (all, if we can figure out a form that deals with
> > > the spinlock fragments) of paravirt_patch.c
> > >
> > > Hmm?
> >
> > I was going to suggest something similar. Though I would try to take it
> > further and replace paravirt_patch_default() with static calls.
>
> Possible, we just need to be _really_ careful to not allow changing
> those static_call()s. So maybe we need DEFINE_STATIC_CALL_RO() which
> does a __ro_after_init on the whole thing.

But what if you want to live migrate to another hypervisor ;-)

> > Either way it doesn't make objtool's job much easier. But it would be
> > nice to consolidate runtime patching mechanisms and get rid of
> > .parainstructions.
>
> I think the above (combining alternative and paravirt/static_call) does
> make objtool's job easier, since then we at least have the actual
> alternative instructions available to inspect, or am I mis-understanding
> things?

Right, it makes objtool's job a _little_ easier, since it already knows
how to read alternatives. But it still has to learn to deal with the
conflicting stack layouts.

--
Josh

2020-11-11 20:41:24

by Peter Zijlstra

[permalink] [raw]
Subject: Re: WARNING: can't access registers at asm_common_interrupt

On Wed, Nov 11, 2020 at 02:15:06PM -0600, Josh Poimboeuf wrote:
> On Wed, Nov 11, 2020 at 09:07:30PM +0100, Peter Zijlstra wrote:

> > Possible, we just need to be _really_ careful to not allow changing
> > those static_call()s. So maybe we need DEFINE_STATIC_CALL_RO() which
> > does a __ro_after_init on the whole thing.
>
> But what if you want to live migrate to another hypervisor ;-)

Then you get to keep the pieces ;-)

> > > Either way it doesn't make objtool's job much easier. But it would be
> > > nice to consolidate runtime patching mechanisms and get rid of
> > > .parainstructions.
> >
> > I think the above (combining alternative and paravirt/static_call) does
> > make objtool's job easier, since then we at least have the actual
> > alternative instructions available to inspect, or am I mis-understanding
> > things?
>
> Right, it makes objtool's job a _little_ easier, since it already knows
> how to read alternatives. But it still has to learn to deal with the
> conflicting stack layouts.

Right, but at least now it can see the instructions. Which is a lot
better than: `add a magic +1 ORC entry when you see an indirect call to
$magic`.

Anyway, __orc_find(addr) looks for the ORC entry with the highest IP <=
@addr. So if we have:

alt0 alt1

0x00 CALL *foo 0x00 PUSHF
0x07 insn 0x01 POP %rax
0x02 .nops 5
0x07 insn

we have ORC entries for alt1.0x00 and alt1.0x01. Then if we hit insn,
it'll find the alt1.0x01 entry, but that had better be the same as the
state at 0x00.

This means that for every alt, we have to unwind using the CFI of every
other alt and match for every instruction. Which should be doable I
think.

Certainly more complicated that outright disallowing CFI changes inside
alt groups though :/

2020-11-11 20:42:52

by Peter Zijlstra

[permalink] [raw]
Subject: Re: WARNING: can't access registers at asm_common_interrupt

On Wed, Nov 11, 2020 at 08:25:36PM +0000, Andrew Cooper wrote:

> > Right, it makes objtool's job a _little_ easier, since it already knows
> > how to read alternatives. But it still has to learn to deal with the
> > conflicting stack layouts.
>
> I suppose the needed abstraction is "these blocks will start and end
> with the same stack layout", while allowing the internals to diverge.

It's a little more complicated than that due to the fact that there is
only a single ORC table.

So something like:

alt0 alt1
0x00 push 0x00 nop
0x01 nop 0x01 push

is impossible to correctly unwind.

2020-11-11 20:46:49

by Peter Zijlstra

[permalink] [raw]
Subject: Re: WARNING: can't access registers at asm_common_interrupt

On Wed, Nov 11, 2020 at 12:13:28PM -0600, Josh Poimboeuf wrote:
> On Wed, Nov 11, 2020 at 06:47:36PM +0100, Peter Zijlstra wrote:

> > Yeah, so it's all a giant can of worms that; also see:
> >
> > https://lkml.kernel.org/r/[email protected]
> >
> > The basic idea is to only trace edges, ie. when the hardware state
> > actually changes. Sadly this means doing a pushf/pop before the cli.
> > Ideally CLI would store the old IF in CF or something like that, but
> > alas.
>
> Right, that makes sense for save/restore, but is the disabled check
> really needed for local_irq_disable()? Wouldn't that always be an edge?

IIRC there is code that does local_irq_disable() even though IRQs are
already disabled. This is 'harmless'.

> And anyway I don't see a similar check for local_irq_enable().

I know there is code that does local_irq_enable() with IRQs already
enabled, I'm not exactly sure why this is different. I'll have to put it
on the todo list :/

2020-11-13 17:37:30

by Andy Lutomirski

[permalink] [raw]
Subject: Re: WARNING: can't access registers at asm_common_interrupt

On Wed, Nov 11, 2020 at 12:25 PM Andrew Cooper
<[email protected]> wrote:
>
> On 11/11/2020 20:15, Josh Poimboeuf wrote:
> > On Wed, Nov 11, 2020 at 09:07:30PM +0100, Peter Zijlstra wrote:
> >> On Wed, Nov 11, 2020 at 01:59:00PM -0600, Josh Poimboeuf wrote:
> >>> On Wed, Nov 11, 2020 at 08:42:06PM +0100, Peter Zijlstra wrote:
> >>>>> Would objtool have an easier time coping if this were implemented in
> >>>>> terms of a static call?
> >>>> I doubt it, the big problem is that there is no visibility into the
> >>>> actual alternative text. Runtime patching fragments into static call
> >>>> would have the exact same problem.
> >>>>
> >>>> Something that _might_ maybe work is trying to morph the immediate
> >>>> fragments into an alternative. That is, instead of this:
> >>>>
> >>>> static inline notrace unsigned long arch_local_save_flags(void)
> >>>> {
> >>>> return PVOP_CALLEE0(unsigned long, irq.save_fl);
> >>>> }
> >>>>
> >>>> Write it something like:
> >>>>
> >>>> static inline notrace unsigned long arch_local_save_flags(void)
> >>>> {
> >>>> PVOP_CALL_ARGS;
> >>>> PVOP_TEST_NULL(irq.save_fl);
> >>>> asm_inline volatile(ALTERNATIVE(paravirt_alt(PARAVIRT_CALL),
> >>>> "PUSHF; POP _ASM_AX",
> >>>> X86_FEATURE_NATIVE)
> >>>> : CLBR_RET_REG, ASM_CALL_CONSTRAINT
> >>>> : paravirt_type(irq.save_fl.func),
> >>>> paravirt_clobber(PVOP_CALLEE_CLOBBERS)
> >>>> : "memory", "cc");
> >>>> return __eax;
> >>>> }
> >>>>
> >>>> And then we have to teach objtool how to deal with conflicting
> >>>> alternatives...
> >>>>
> >>>> That would remove most (all, if we can figure out a form that deals with
> >>>> the spinlock fragments) of paravirt_patch.c
> >>>>
> >>>> Hmm?
> >>> I was going to suggest something similar. Though I would try to take it
> >>> further and replace paravirt_patch_default() with static calls.
> >> Possible, we just need to be _really_ careful to not allow changing
> >> those static_call()s. So maybe we need DEFINE_STATIC_CALL_RO() which
> >> does a __ro_after_init on the whole thing.
> > But what if you want to live migrate to another hypervisor ;-)
>
> The same as what happens currently. The user gets to keep all the
> resulting pieces ;)
>
> >>> Either way it doesn't make objtool's job much easier. But it would be
> >>> nice to consolidate runtime patching mechanisms and get rid of
> >>> .parainstructions.
> >> I think the above (combining alternative and paravirt/static_call) does
> >> make objtool's job easier, since then we at least have the actual
> >> alternative instructions available to inspect, or am I mis-understanding
> >> things?
> > Right, it makes objtool's job a _little_ easier, since it already knows
> > how to read alternatives. But it still has to learn to deal with the
> > conflicting stack layouts.
>
> I suppose the needed abstraction is "these blocks will start and end
> with the same stack layout", while allowing the internals to diverge.
>

How much of this stuff is actually useful anymore? I'm wondering if
we can move most or all of this crud to
cpu_feature_enabled(X86_FEATURE_XENPV) and its asm equivalent. The
full list, annotated, appears to be:

const unsigned char irq_irq_disable[1];

This is CLI or CALL, right?

const unsigned char irq_irq_enable[1];

STI or CALL.

const unsigned char irq_save_fl[2];

PUSHF; POP %r/eax. I *think* I read the paravirt mess correctly and
this also turns into CALL.

const unsigned char mmu_read_cr2[3];
const unsigned char mmu_read_cr3[3];
const unsigned char mmu_write_cr3[3];

The write CR3 is so slow that I can't imagine us caring. Reading CR3
should already be fairly optimized because it's slow on old non-PV
hypervisors, too. Reading CR2 is rare and lives in asm. These also
appear to just switch between MOV and CALL, anyway.

const unsigned char irq_restore_fl[2];

Ugh, this one sucks. IMO it should be, for native and PV:

if (flags & X86_EFLAGS_IF) {
local_irq_enable(); /* or raw? */
} else {
if (some debugging option) {
WARN_ON_ONCE(save_fl() & X86_EFLAGS_IF);
}
}

POPF is slooooow.

const unsigned char cpu_wbinvd[2];

This is hilariously slow no matter what. static_call() or even just a
plain old indirect call should be fine.

const unsigned char cpu_usergs_sysret64[6];

This is in the asm and we shouldn't be doing it at all for Xen PV.
IOW we should just drop this patch site entirely. I can possibly find
some time to get rid of it, and maybe someone from Xen land can help.
I bet that we can gain a lot of perf on Xen PV by cleaning this up,
and I bet it will simplify everything.

const unsigned char cpu_swapgs[3];

This is SWAPGS or nop, unless I've missed some subtlety.

const unsigned char mov64[3];

This is some PTE magic, and I haven't deciphered it yet.

So I think there is at most one of these that wants anything more
complicated than a plain ALTERNATIVE. Any volunteers to make it so?
Juergen, if you do all of them except USERGS_SYSRET64, I hereby
volunteer to do that one.

BTW, if y'all want to live migrate between Xen PV and anything else,
you are nuts.

2020-11-14 09:20:19

by Juergen Gross

[permalink] [raw]
Subject: Re: WARNING: can't access registers at asm_common_interrupt

On 13.11.20 18:34, Andy Lutomirski wrote:
> On Wed, Nov 11, 2020 at 12:25 PM Andrew Cooper
> <[email protected]> wrote:
>>
>> On 11/11/2020 20:15, Josh Poimboeuf wrote:
>>> On Wed, Nov 11, 2020 at 09:07:30PM +0100, Peter Zijlstra wrote:
>>>> On Wed, Nov 11, 2020 at 01:59:00PM -0600, Josh Poimboeuf wrote:
>>>>> On Wed, Nov 11, 2020 at 08:42:06PM +0100, Peter Zijlstra wrote:
>>>>>>> Would objtool have an easier time coping if this were implemented in
>>>>>>> terms of a static call?
>>>>>> I doubt it, the big problem is that there is no visibility into the
>>>>>> actual alternative text. Runtime patching fragments into static call
>>>>>> would have the exact same problem.
>>>>>>
>>>>>> Something that _might_ maybe work is trying to morph the immediate
>>>>>> fragments into an alternative. That is, instead of this:
>>>>>>
>>>>>> static inline notrace unsigned long arch_local_save_flags(void)
>>>>>> {
>>>>>> return PVOP_CALLEE0(unsigned long, irq.save_fl);
>>>>>> }
>>>>>>
>>>>>> Write it something like:
>>>>>>
>>>>>> static inline notrace unsigned long arch_local_save_flags(void)
>>>>>> {
>>>>>> PVOP_CALL_ARGS;
>>>>>> PVOP_TEST_NULL(irq.save_fl);
>>>>>> asm_inline volatile(ALTERNATIVE(paravirt_alt(PARAVIRT_CALL),
>>>>>> "PUSHF; POP _ASM_AX",
>>>>>> X86_FEATURE_NATIVE)
>>>>>> : CLBR_RET_REG, ASM_CALL_CONSTRAINT
>>>>>> : paravirt_type(irq.save_fl.func),
>>>>>> paravirt_clobber(PVOP_CALLEE_CLOBBERS)
>>>>>> : "memory", "cc");
>>>>>> return __eax;
>>>>>> }
>>>>>>
>>>>>> And then we have to teach objtool how to deal with conflicting
>>>>>> alternatives...
>>>>>>
>>>>>> That would remove most (all, if we can figure out a form that deals with
>>>>>> the spinlock fragments) of paravirt_patch.c
>>>>>>
>>>>>> Hmm?
>>>>> I was going to suggest something similar. Though I would try to take it
>>>>> further and replace paravirt_patch_default() with static calls.
>>>> Possible, we just need to be _really_ careful to not allow changing
>>>> those static_call()s. So maybe we need DEFINE_STATIC_CALL_RO() which
>>>> does a __ro_after_init on the whole thing.
>>> But what if you want to live migrate to another hypervisor ;-)
>>
>> The same as what happens currently. The user gets to keep all the
>> resulting pieces ;)
>>
>>>>> Either way it doesn't make objtool's job much easier. But it would be
>>>>> nice to consolidate runtime patching mechanisms and get rid of
>>>>> .parainstructions.
>>>> I think the above (combining alternative and paravirt/static_call) does
>>>> make objtool's job easier, since then we at least have the actual
>>>> alternative instructions available to inspect, or am I mis-understanding
>>>> things?
>>> Right, it makes objtool's job a _little_ easier, since it already knows
>>> how to read alternatives. But it still has to learn to deal with the
>>> conflicting stack layouts.
>>
>> I suppose the needed abstraction is "these blocks will start and end
>> with the same stack layout", while allowing the internals to diverge.
>>
>
> How much of this stuff is actually useful anymore? I'm wondering if
> we can move most or all of this crud to
> cpu_feature_enabled(X86_FEATURE_XENPV) and its asm equivalent. The
> full list, annotated, appears to be:
>
> const unsigned char irq_irq_disable[1];
>
> This is CLI or CALL, right?

Yes.

>
> const unsigned char irq_irq_enable[1];
>
> STI or CALL.

Yes.

>
> const unsigned char irq_save_fl[2];
>
> PUSHF; POP %r/eax. I *think* I read the paravirt mess correctly and
> this also turns into CALL.

It does.

>
> const unsigned char mmu_read_cr2[3];
> const unsigned char mmu_read_cr3[3];
> const unsigned char mmu_write_cr3[3];
>
> The write CR3 is so slow that I can't imagine us caring. Reading CR3
> should already be fairly optimized because it's slow on old non-PV
> hypervisors, too. Reading CR2 is rare and lives in asm. These also
> appear to just switch between MOV and CALL, anyway.

Correct.

>
> const unsigned char irq_restore_fl[2];
>
> Ugh, this one sucks. IMO it should be, for native and PV:
>
> if (flags & X86_EFLAGS_IF) {
> local_irq_enable(); /* or raw? */
> } else {
> if (some debugging option) {
> WARN_ON_ONCE(save_fl() & X86_EFLAGS_IF);
> }
> }

Seems sensible.

>
> POPF is slooooow.
>
> const unsigned char cpu_wbinvd[2];
>
> This is hilariously slow no matter what. static_call() or even just a
> plain old indirect call should be fine.

I'd go with the static_call().

>
> const unsigned char cpu_usergs_sysret64[6];
>
> This is in the asm and we shouldn't be doing it at all for Xen PV.
> IOW we should just drop this patch site entirely. I can possibly find
> some time to get rid of it, and maybe someone from Xen land can help.
> I bet that we can gain a lot of perf on Xen PV by cleaning this up,
> and I bet it will simplify everything.
>
> const unsigned char cpu_swapgs[3];
>
> This is SWAPGS or nop, unless I've missed some subtlety.
>
> const unsigned char mov64[3];
>
> This is some PTE magic, and I haven't deciphered it yet.

Either a mov or a call.

>
> So I think there is at most one of these that wants anything more
> complicated than a plain ALTERNATIVE. Any volunteers to make it so?
> Juergen, if you do all of them except USERGS_SYSRET64, I hereby
> volunteer to do that one.

Why is a plain alternative (either swapgs; sysretq or a jmp xen_sysret64
depending on X86_FEATURE_XENPV) no option?

Its not as if this code would run before alternative patching.

>
> BTW, if y'all want to live migrate between Xen PV and anything else,
> you are nuts.
>

That's no option. Xen PV is a guest property, not one of the hypervisor.


Juergen


Attachments:
OpenPGP_0xB0DE9DD628BF132F.asc (3.06 kB)
OpenPGP_signature (505.00 B)
OpenPGP digital signature
Download all attachments

2020-11-14 18:13:07

by Andy Lutomirski

[permalink] [raw]
Subject: Re: WARNING: can't access registers at asm_common_interrupt

On Sat, Nov 14, 2020 at 1:16 AM Jürgen Groß <[email protected]> wrote:
>
> On 13.11.20 18:34, Andy Lutomirski wrote:
> > On Wed, Nov 11, 2020 at 12:25 PM Andrew Cooper
> > <[email protected]> wrote:
> >
> > So I think there is at most one of these that wants anything more
> > complicated than a plain ALTERNATIVE. Any volunteers to make it so?
> > Juergen, if you do all of them except USERGS_SYSRET64, I hereby
> > volunteer to do that one.
>
> Why is a plain alternative (either swapgs; sysretq or a jmp xen_sysret64
> depending on X86_FEATURE_XENPV) no option?
>
> Its not as if this code would run before alternative patching.

ALTERNATIVE would "work" in the sense that it would function and be
just about as nonsensical as the current code. Fundamentally, Xen
PV's sysret feature is not a drop-in replacement for SYSRET64, and
pretending that it is seems unlikely to work well. I suspect that the
current code is some combination of exceedingly slow, non-functional,
and incorrect in subtle ways.

We should just have a separate Xen PV exit path the same way we have a
separate entry path in recent kernels. *This* is what I'm
volunteering to do.

--Andy

2020-11-15 06:43:13

by Juergen Gross

[permalink] [raw]
Subject: Re: WARNING: can't access registers at asm_common_interrupt

On 14.11.20 19:10, Andy Lutomirski wrote:
> On Sat, Nov 14, 2020 at 1:16 AM Jürgen Groß <[email protected]> wrote:
>>
>> On 13.11.20 18:34, Andy Lutomirski wrote:
>>> On Wed, Nov 11, 2020 at 12:25 PM Andrew Cooper
>>> <[email protected]> wrote:
>>>
>>> So I think there is at most one of these that wants anything more
>>> complicated than a plain ALTERNATIVE. Any volunteers to make it so?
>>> Juergen, if you do all of them except USERGS_SYSRET64, I hereby
>>> volunteer to do that one.
>>
>> Why is a plain alternative (either swapgs; sysretq or a jmp xen_sysret64
>> depending on X86_FEATURE_XENPV) no option?
>>
>> Its not as if this code would run before alternative patching.
>
> ALTERNATIVE would "work" in the sense that it would function and be
> just about as nonsensical as the current code. Fundamentally, Xen
> PV's sysret feature is not a drop-in replacement for SYSRET64, and
> pretending that it is seems unlikely to work well. I suspect that the
> current code is some combination of exceedingly slow, non-functional,
> and incorrect in subtle ways.
>
> We should just have a separate Xen PV exit path the same way we have a
> separate entry path in recent kernels. *This* is what I'm
> volunteering to do.

I don't think there is much work needed. Xen PV does basically a return
to user mode via IRET. I think it might work just to use
swapgs_restore_regs_and_return_to_usermode() unconditionally for Xen PV.


Juergen


Attachments:
OpenPGP_0xB0DE9DD628BF132F.asc (3.06 kB)
OpenPGP_signature (505.00 B)
OpenPGP digital signature
Download all attachments

2020-11-15 16:37:33

by Andy Lutomirski

[permalink] [raw]
Subject: Re: WARNING: can't access registers at asm_common_interrupt


> On Nov 14, 2020, at 10:33 PM, Jürgen Groß <[email protected]> wrote:
>
> On 14.11.20 19:10, Andy Lutomirski wrote:
>>> On Sat, Nov 14, 2020 at 1:16 AM Jürgen Groß <[email protected]> wrote:
>>>
>>> On 13.11.20 18:34, Andy Lutomirski wrote:
>>>> On Wed, Nov 11, 2020 at 12:25 PM Andrew Cooper
>>>> <[email protected]> wrote:
>>>>
>>>> So I think there is at most one of these that wants anything more
>>>> complicated than a plain ALTERNATIVE. Any volunteers to make it so?
>>>> Juergen, if you do all of them except USERGS_SYSRET64, I hereby
>>>> volunteer to do that one.
>>>
>>> Why is a plain alternative (either swapgs; sysretq or a jmp xen_sysret64
>>> depending on X86_FEATURE_XENPV) no option?
>>>
>>> Its not as if this code would run before alternative patching.
>> ALTERNATIVE would "work" in the sense that it would function and be
>> just about as nonsensical as the current code. Fundamentally, Xen
>> PV's sysret feature is not a drop-in replacement for SYSRET64, and
>> pretending that it is seems unlikely to work well. I suspect that the
>> current code is some combination of exceedingly slow, non-functional,
>> and incorrect in subtle ways.
>> We should just have a separate Xen PV exit path the same way we have a
>> separate entry path in recent kernels. *This* is what I'm
>> volunteering to do.
>
> I don't think there is much work needed. Xen PV does basically a return
> to user mode via IRET. I think it might work just to use
> swapgs_restore_regs_and_return_to_usermode() unconditionally for Xen PV.
>

I’m quite confident that will work, but I was hoping to get it to work quickly too :)

>
> Juergen
> <OpenPGP_0xB0DE9DD628BF132F.asc>

2020-11-15 16:38:43

by Juergen Gross

[permalink] [raw]
Subject: Re: WARNING: can't access registers at asm_common_interrupt

On 15.11.20 17:05, Andy Lutomirski wrote:
>
>> On Nov 14, 2020, at 10:33 PM, Jürgen Groß <[email protected]> wrote:
>>
>> On 14.11.20 19:10, Andy Lutomirski wrote:
>>>> On Sat, Nov 14, 2020 at 1:16 AM Jürgen Groß <[email protected]> wrote:
>>>>
>>>> On 13.11.20 18:34, Andy Lutomirski wrote:
>>>>> On Wed, Nov 11, 2020 at 12:25 PM Andrew Cooper
>>>>> <[email protected]> wrote:
>>>>>
>>>>> So I think there is at most one of these that wants anything more
>>>>> complicated than a plain ALTERNATIVE. Any volunteers to make it so?
>>>>> Juergen, if you do all of them except USERGS_SYSRET64, I hereby
>>>>> volunteer to do that one.
>>>>
>>>> Why is a plain alternative (either swapgs; sysretq or a jmp xen_sysret64
>>>> depending on X86_FEATURE_XENPV) no option?
>>>>
>>>> Its not as if this code would run before alternative patching.
>>> ALTERNATIVE would "work" in the sense that it would function and be
>>> just about as nonsensical as the current code. Fundamentally, Xen
>>> PV's sysret feature is not a drop-in replacement for SYSRET64, and
>>> pretending that it is seems unlikely to work well. I suspect that the
>>> current code is some combination of exceedingly slow, non-functional,
>>> and incorrect in subtle ways.
>>> We should just have a separate Xen PV exit path the same way we have a
>>> separate entry path in recent kernels. *This* is what I'm
>>> volunteering to do.
>>
>> I don't think there is much work needed. Xen PV does basically a return
>> to user mode via IRET. I think it might work just to use
>> swapgs_restore_regs_and_return_to_usermode() unconditionally for Xen PV.
>>
>
> I’m quite confident that will work, but I was hoping to get it to work quickly too :)

The PV interface requires to use the iret hypercall, because there
is no sysret equivalent.


Juergen


Attachments:
OpenPGP_0xB0DE9DD628BF132F.asc (3.06 kB)
OpenPGP_signature (505.00 B)
OpenPGP digital signature
Download all attachments

2020-11-16 12:01:38

by Juergen Gross

[permalink] [raw]
Subject: Re: WARNING: can't access registers at asm_common_interrupt

On 13.11.20 18:34, Andy Lutomirski wrote:
> On Wed, Nov 11, 2020 at 12:25 PM Andrew Cooper
> <[email protected]> wrote:
>>
>> On 11/11/2020 20:15, Josh Poimboeuf wrote:
>>> On Wed, Nov 11, 2020 at 09:07:30PM +0100, Peter Zijlstra wrote:
>>>> On Wed, Nov 11, 2020 at 01:59:00PM -0600, Josh Poimboeuf wrote:
>>>>> On Wed, Nov 11, 2020 at 08:42:06PM +0100, Peter Zijlstra wrote:
>>>>>>> Would objtool have an easier time coping if this were implemented in
>>>>>>> terms of a static call?
>>>>>> I doubt it, the big problem is that there is no visibility into the
>>>>>> actual alternative text. Runtime patching fragments into static call
>>>>>> would have the exact same problem.
>>>>>>
>>>>>> Something that _might_ maybe work is trying to morph the immediate
>>>>>> fragments into an alternative. That is, instead of this:
>>>>>>
>>>>>> static inline notrace unsigned long arch_local_save_flags(void)
>>>>>> {
>>>>>> return PVOP_CALLEE0(unsigned long, irq.save_fl);
>>>>>> }
>>>>>>
>>>>>> Write it something like:
>>>>>>
>>>>>> static inline notrace unsigned long arch_local_save_flags(void)
>>>>>> {
>>>>>> PVOP_CALL_ARGS;
>>>>>> PVOP_TEST_NULL(irq.save_fl);
>>>>>> asm_inline volatile(ALTERNATIVE(paravirt_alt(PARAVIRT_CALL),
>>>>>> "PUSHF; POP _ASM_AX",
>>>>>> X86_FEATURE_NATIVE)

I am wondering whether we really want a new feature (basically "not
XENPV). We could use ~X86_FEATURE_XENPV and teach apply_alternatives()
to understand negated features (yes, this limits the number of features
to 32767, but I don't think this is a real problem for quite some time).

Thoughts?


Juergen


Attachments:
OpenPGP_0xB0DE9DD628BF132F.asc (3.06 kB)
OpenPGP_signature (505.00 B)
OpenPGP digital signature
Download all attachments

2020-11-16 13:09:27

by Peter Zijlstra

[permalink] [raw]
Subject: Re: WARNING: can't access registers at asm_common_interrupt

On Mon, Nov 16, 2020 at 12:56:32PM +0100, J?rgen Gro? wrote:
> > > > > > > static inline notrace unsigned long arch_local_save_flags(void)
> > > > > > > {
> > > > > > > PVOP_CALL_ARGS;
> > > > > > > PVOP_TEST_NULL(irq.save_fl);
> > > > > > > asm_inline volatile(ALTERNATIVE(paravirt_alt(PARAVIRT_CALL),
> > > > > > > "PUSHF; POP _ASM_AX",
> > > > > > > X86_FEATURE_NATIVE)
>
> I am wondering whether we really want a new feature (basically "not
> XENPV). We could use ~X86_FEATURE_XENPV and teach apply_alternatives()
> to understand negated features (yes, this limits the number of features
> to 32767, but I don't think this is a real problem for quite some time).
>
> Thoughts?

I went with the simple thing for now... If we ever want/need another
negative alternative I suppose we can do as you suggest...

I was still poking at objtool to actually dtrt though..

---
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index dad350d42ecf..cc88f358bac5 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -237,6 +237,7 @@
#define X86_FEATURE_VMCALL ( 8*32+18) /* "" Hypervisor supports the VMCALL instruction */
#define X86_FEATURE_VMW_VMMCALL ( 8*32+19) /* "" VMware prefers VMMCALL hypercall instruction */
#define X86_FEATURE_SEV_ES ( 8*32+20) /* AMD Secure Encrypted Virtualization - Encrypted State */
+#define X86_FEATURE_NOT_XENPV ( 8*32+21) /* "" inverse of X86_FEATURE_XENPV */

/* Intel-defined CPU features, CPUID level 0x00000007:0 (EBX), word 9 */
#define X86_FEATURE_FSGSBASE ( 9*32+ 0) /* RDFSBASE, WRFSBASE, RDGSBASE, WRGSBASE instructions*/
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index d25cc6830e89..e2a9d3e6b7ad 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -645,22 +645,56 @@ bool __raw_callee_save___native_vcpu_is_preempted(long cpu);
#ifdef CONFIG_PARAVIRT_XXL
static inline notrace unsigned long arch_local_save_flags(void)
{
- return PVOP_CALLEE0(unsigned long, irq.save_fl);
+ PVOP_CALL_ARGS;
+ PVOP_TEST_NULL(irq.save_fl);
+ asm_inline volatile(ALTERNATIVE(paravirt_alt(PARAVIRT_CALL),
+ "pushf; pop %%" _ASM_AX ";",
+ X86_FEATURE_NOT_XENPV)
+ : PVOP_CALLEE_CLOBBERS, ASM_CALL_CONSTRAINT
+ : paravirt_type(irq.save_fl.func),
+ paravirt_clobber(CLBR_RET_REG)
+ : "memory", "cc");
+ return __eax;
}

static inline notrace void arch_local_irq_restore(unsigned long f)
{
- PVOP_VCALLEE1(irq.restore_fl, f);
+ PVOP_CALL_ARGS;
+ PVOP_TEST_NULL(irq.restore_fl);
+ asm_inline volatile(ALTERNATIVE(paravirt_alt(PARAVIRT_CALL),
+ "push %%" _ASM_ARG1 "; popf;",
+ X86_FEATURE_NOT_XENPV)
+ : PVOP_VCALLEE_CLOBBERS, ASM_CALL_CONSTRAINT
+ : paravirt_type(irq.restore_fl.func),
+ paravirt_clobber(CLBR_RET_REG),
+ PVOP_CALL_ARG1(f)
+ : "memory", "cc");
}

static inline notrace void arch_local_irq_disable(void)
{
- PVOP_VCALLEE0(irq.irq_disable);
+ PVOP_CALL_ARGS;
+ PVOP_TEST_NULL(irq.irq_disable);
+ asm_inline volatile(ALTERNATIVE(paravirt_alt(PARAVIRT_CALL),
+ "cli;",
+ X86_FEATURE_NOT_XENPV)
+ : PVOP_VCALLEE_CLOBBERS, ASM_CALL_CONSTRAINT
+ : paravirt_type(irq.irq_disable.func),
+ paravirt_clobber(CLBR_RET_REG)
+ : "memory", "cc");
}

static inline notrace void arch_local_irq_enable(void)
{
- PVOP_VCALLEE0(irq.irq_enable);
+ PVOP_CALL_ARGS;
+ PVOP_TEST_NULL(irq.irq_enable);
+ asm_inline volatile(ALTERNATIVE(paravirt_alt(PARAVIRT_CALL),
+ "sti;",
+ X86_FEATURE_NOT_XENPV)
+ : PVOP_VCALLEE_CLOBBERS, ASM_CALL_CONSTRAINT
+ : paravirt_type(irq.irq_enable.func),
+ paravirt_clobber(CLBR_RET_REG)
+ : "memory", "cc");
}

static inline notrace unsigned long arch_local_irq_save(void)
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 2400ad62f330..5bd8f5503652 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -723,6 +723,19 @@ void __init alternative_instructions(void)
* patching.
*/

+ if (!boot_cpu_has(X86_FEATURE_XENPV))
+ setup_force_cpu_cap(X86_FEATURE_NOT_XENPV);
+
+ /*
+ * First patch paravirt functions, such that we overwrite the indirect
+ * call with the direct call.
+ */
+ apply_paravirt(__parainstructions, __parainstructions_end);
+
+ /*
+ * Then patch alternatives, such that those paravirt calls that are in
+ * alternatives can be overwritten by their immediate fragments.
+ */
apply_alternatives(__alt_instructions, __alt_instructions_end);

#ifdef CONFIG_SMP
@@ -741,8 +754,6 @@ void __init alternative_instructions(void)
}
#endif

- apply_paravirt(__parainstructions, __parainstructions_end);
-
restart_nmi();
alternatives_patched = 1;
}
diff --git a/arch/x86/kernel/paravirt_patch.c b/arch/x86/kernel/paravirt_patch.c
index ace6e334cb39..867498db53ad 100644
--- a/arch/x86/kernel/paravirt_patch.c
+++ b/arch/x86/kernel/paravirt_patch.c
@@ -33,13 +33,9 @@ struct patch_xxl {
};

static const struct patch_xxl patch_data_xxl = {
- .irq_irq_disable = { 0xfa }, // cli
- .irq_irq_enable = { 0xfb }, // sti
- .irq_save_fl = { 0x9c, 0x58 }, // pushf; pop %[re]ax
.mmu_read_cr2 = { 0x0f, 0x20, 0xd0 }, // mov %cr2, %[re]ax
.mmu_read_cr3 = { 0x0f, 0x20, 0xd8 }, // mov %cr3, %[re]ax
.mmu_write_cr3 = { 0x0f, 0x22, 0xdf }, // mov %rdi, %cr3
- .irq_restore_fl = { 0x57, 0x9d }, // push %rdi; popfq
.cpu_wbinvd = { 0x0f, 0x09 }, // wbinvd
.cpu_usergs_sysret64 = { 0x0f, 0x01, 0xf8,
0x48, 0x0f, 0x07 }, // swapgs; sysretq
@@ -76,11 +72,6 @@ unsigned int native_patch(u8 type, void *insn_buff, unsigned long addr,
switch (type) {

#ifdef CONFIG_PARAVIRT_XXL
- PATCH_CASE(irq, restore_fl, xxl, insn_buff, len);
- PATCH_CASE(irq, save_fl, xxl, insn_buff, len);
- PATCH_CASE(irq, irq_enable, xxl, insn_buff, len);
- PATCH_CASE(irq, irq_disable, xxl, insn_buff, len);
-
PATCH_CASE(mmu, read_cr2, xxl, insn_buff, len);
PATCH_CASE(mmu, read_cr3, xxl, insn_buff, len);
PATCH_CASE(mmu, write_cr3, xxl, insn_buff, len);


Attachments:
(No filename) (6.20 kB)
signature.asc (849.00 B)
Download all attachments

2020-11-18 06:52:00

by Juergen Gross

[permalink] [raw]
Subject: Re: WARNING: can't access registers at asm_common_interrupt

On 16.11.20 14:04, Peter Zijlstra wrote:
> On Mon, Nov 16, 2020 at 12:56:32PM +0100, J?rgen Gro? wrote:
>>>>>>>> static inline notrace unsigned long arch_local_save_flags(void)
>>>>>>>> {
>>>>>>>> PVOP_CALL_ARGS;
>>>>>>>> PVOP_TEST_NULL(irq.save_fl);
>>>>>>>> asm_inline volatile(ALTERNATIVE(paravirt_alt(PARAVIRT_CALL),
>>>>>>>> "PUSHF; POP _ASM_AX",
>>>>>>>> X86_FEATURE_NATIVE)
>>
>> I am wondering whether we really want a new feature (basically "not
>> XENPV). We could use ~X86_FEATURE_XENPV and teach apply_alternatives()
>> to understand negated features (yes, this limits the number of features
>> to 32767, but I don't think this is a real problem for quite some time).
>>
>> Thoughts?
>
> I went with the simple thing for now... If we ever want/need another
> negative alternative I suppose we can do as you suggest...
>
> I was still poking at objtool to actually dtrt though..

I'd like to include this part in my series (with a different solution
for the restore_fl part, as suggested by Andy before).

Peter, are you fine with me taking your patch and adding your SoB?


Juergen

>
> ---
> diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
> index dad350d42ecf..cc88f358bac5 100644
> --- a/arch/x86/include/asm/cpufeatures.h
> +++ b/arch/x86/include/asm/cpufeatures.h
> @@ -237,6 +237,7 @@
> #define X86_FEATURE_VMCALL ( 8*32+18) /* "" Hypervisor supports the VMCALL instruction */
> #define X86_FEATURE_VMW_VMMCALL ( 8*32+19) /* "" VMware prefers VMMCALL hypercall instruction */
> #define X86_FEATURE_SEV_ES ( 8*32+20) /* AMD Secure Encrypted Virtualization - Encrypted State */
> +#define X86_FEATURE_NOT_XENPV ( 8*32+21) /* "" inverse of X86_FEATURE_XENPV */
>
> /* Intel-defined CPU features, CPUID level 0x00000007:0 (EBX), word 9 */
> #define X86_FEATURE_FSGSBASE ( 9*32+ 0) /* RDFSBASE, WRFSBASE, RDGSBASE, WRGSBASE instructions*/
> diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
> index d25cc6830e89..e2a9d3e6b7ad 100644
> --- a/arch/x86/include/asm/paravirt.h
> +++ b/arch/x86/include/asm/paravirt.h
> @@ -645,22 +645,56 @@ bool __raw_callee_save___native_vcpu_is_preempted(long cpu);
> #ifdef CONFIG_PARAVIRT_XXL
> static inline notrace unsigned long arch_local_save_flags(void)
> {
> - return PVOP_CALLEE0(unsigned long, irq.save_fl);
> + PVOP_CALL_ARGS;
> + PVOP_TEST_NULL(irq.save_fl);
> + asm_inline volatile(ALTERNATIVE(paravirt_alt(PARAVIRT_CALL),
> + "pushf; pop %%" _ASM_AX ";",
> + X86_FEATURE_NOT_XENPV)
> + : PVOP_CALLEE_CLOBBERS, ASM_CALL_CONSTRAINT
> + : paravirt_type(irq.save_fl.func),
> + paravirt_clobber(CLBR_RET_REG)
> + : "memory", "cc");
> + return __eax;
> }
>
> static inline notrace void arch_local_irq_restore(unsigned long f)
> {
> - PVOP_VCALLEE1(irq.restore_fl, f);
> + PVOP_CALL_ARGS;
> + PVOP_TEST_NULL(irq.restore_fl);
> + asm_inline volatile(ALTERNATIVE(paravirt_alt(PARAVIRT_CALL),
> + "push %%" _ASM_ARG1 "; popf;",
> + X86_FEATURE_NOT_XENPV)
> + : PVOP_VCALLEE_CLOBBERS, ASM_CALL_CONSTRAINT
> + : paravirt_type(irq.restore_fl.func),
> + paravirt_clobber(CLBR_RET_REG),
> + PVOP_CALL_ARG1(f)
> + : "memory", "cc");
> }
>
> static inline notrace void arch_local_irq_disable(void)
> {
> - PVOP_VCALLEE0(irq.irq_disable);
> + PVOP_CALL_ARGS;
> + PVOP_TEST_NULL(irq.irq_disable);
> + asm_inline volatile(ALTERNATIVE(paravirt_alt(PARAVIRT_CALL),
> + "cli;",
> + X86_FEATURE_NOT_XENPV)
> + : PVOP_VCALLEE_CLOBBERS, ASM_CALL_CONSTRAINT
> + : paravirt_type(irq.irq_disable.func),
> + paravirt_clobber(CLBR_RET_REG)
> + : "memory", "cc");
> }
>
> static inline notrace void arch_local_irq_enable(void)
> {
> - PVOP_VCALLEE0(irq.irq_enable);
> + PVOP_CALL_ARGS;
> + PVOP_TEST_NULL(irq.irq_enable);
> + asm_inline volatile(ALTERNATIVE(paravirt_alt(PARAVIRT_CALL),
> + "sti;",
> + X86_FEATURE_NOT_XENPV)
> + : PVOP_VCALLEE_CLOBBERS, ASM_CALL_CONSTRAINT
> + : paravirt_type(irq.irq_enable.func),
> + paravirt_clobber(CLBR_RET_REG)
> + : "memory", "cc");
> }
>
> static inline notrace unsigned long arch_local_irq_save(void)
> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> index 2400ad62f330..5bd8f5503652 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -723,6 +723,19 @@ void __init alternative_instructions(void)
> * patching.
> */
>
> + if (!boot_cpu_has(X86_FEATURE_XENPV))
> + setup_force_cpu_cap(X86_FEATURE_NOT_XENPV);
> +
> + /*
> + * First patch paravirt functions, such that we overwrite the indirect
> + * call with the direct call.
> + */
> + apply_paravirt(__parainstructions, __parainstructions_end);
> +
> + /*
> + * Then patch alternatives, such that those paravirt calls that are in
> + * alternatives can be overwritten by their immediate fragments.
> + */
> apply_alternatives(__alt_instructions, __alt_instructions_end);
>
> #ifdef CONFIG_SMP
> @@ -741,8 +754,6 @@ void __init alternative_instructions(void)
> }
> #endif
>
> - apply_paravirt(__parainstructions, __parainstructions_end);
> -
> restart_nmi();
> alternatives_patched = 1;
> }
> diff --git a/arch/x86/kernel/paravirt_patch.c b/arch/x86/kernel/paravirt_patch.c
> index ace6e334cb39..867498db53ad 100644
> --- a/arch/x86/kernel/paravirt_patch.c
> +++ b/arch/x86/kernel/paravirt_patch.c
> @@ -33,13 +33,9 @@ struct patch_xxl {
> };
>
> static const struct patch_xxl patch_data_xxl = {
> - .irq_irq_disable = { 0xfa }, // cli
> - .irq_irq_enable = { 0xfb }, // sti
> - .irq_save_fl = { 0x9c, 0x58 }, // pushf; pop %[re]ax
> .mmu_read_cr2 = { 0x0f, 0x20, 0xd0 }, // mov %cr2, %[re]ax
> .mmu_read_cr3 = { 0x0f, 0x20, 0xd8 }, // mov %cr3, %[re]ax
> .mmu_write_cr3 = { 0x0f, 0x22, 0xdf }, // mov %rdi, %cr3
> - .irq_restore_fl = { 0x57, 0x9d }, // push %rdi; popfq
> .cpu_wbinvd = { 0x0f, 0x09 }, // wbinvd
> .cpu_usergs_sysret64 = { 0x0f, 0x01, 0xf8,
> 0x48, 0x0f, 0x07 }, // swapgs; sysretq
> @@ -76,11 +72,6 @@ unsigned int native_patch(u8 type, void *insn_buff, unsigned long addr,
> switch (type) {
>
> #ifdef CONFIG_PARAVIRT_XXL
> - PATCH_CASE(irq, restore_fl, xxl, insn_buff, len);
> - PATCH_CASE(irq, save_fl, xxl, insn_buff, len);
> - PATCH_CASE(irq, irq_enable, xxl, insn_buff, len);
> - PATCH_CASE(irq, irq_disable, xxl, insn_buff, len);
> -
> PATCH_CASE(mmu, read_cr2, xxl, insn_buff, len);
> PATCH_CASE(mmu, read_cr3, xxl, insn_buff, len);
> PATCH_CASE(mmu, write_cr3, xxl, insn_buff, len);
>


Attachments:
OpenPGP_0xB0DE9DD628BF132F.asc (3.06 kB)
OpenPGP_signature (505.00 B)
OpenPGP digital signature
Download all attachments

2020-11-18 08:26:29

by Peter Zijlstra

[permalink] [raw]
Subject: Re: WARNING: can't access registers at asm_common_interrupt

On Wed, Nov 18, 2020 at 07:47:36AM +0100, J?rgen Gro? wrote:
> On 16.11.20 14:04, Peter Zijlstra wrote:
> > On Mon, Nov 16, 2020 at 12:56:32PM +0100, J?rgen Gro? wrote:
> > > > > > > > > static inline notrace unsigned long arch_local_save_flags(void)
> > > > > > > > > {
> > > > > > > > > PVOP_CALL_ARGS;
> > > > > > > > > PVOP_TEST_NULL(irq.save_fl);
> > > > > > > > > asm_inline volatile(ALTERNATIVE(paravirt_alt(PARAVIRT_CALL),
> > > > > > > > > "PUSHF; POP _ASM_AX",
> > > > > > > > > X86_FEATURE_NATIVE)
> > >
> > > I am wondering whether we really want a new feature (basically "not
> > > XENPV). We could use ~X86_FEATURE_XENPV and teach apply_alternatives()
> > > to understand negated features (yes, this limits the number of features
> > > to 32767, but I don't think this is a real problem for quite some time).
> > >
> > > Thoughts?
> >
> > I went with the simple thing for now... If we ever want/need another
> > negative alternative I suppose we can do as you suggest...
> >
> > I was still poking at objtool to actually dtrt though..
>
> I'd like to include this part in my series (with a different solution
> for the restore_fl part, as suggested by Andy before).
>
> Peter, are you fine with me taking your patch and adding your SoB?

Sure, note that I only compile tested it, as it was my test-bed for
playing with objtool (which I still haven't managed to get back to and
finish :/).

So if it actually _works_ feel free to take it, otherwise you can have
whatever bits and pieces remain after you've butchered it to do as you
want.

2020-11-19 11:54:48

by Shinichiro Kawasaki

[permalink] [raw]
Subject: Re: WARNING: can't access registers at asm_common_interrupt

On Nov 18, 2020 / 09:22, Peter Zijlstra wrote:
> On Wed, Nov 18, 2020 at 07:47:36AM +0100, J?rgen Gro? wrote:
> > On 16.11.20 14:04, Peter Zijlstra wrote:
> > > On Mon, Nov 16, 2020 at 12:56:32PM +0100, J?rgen Gro? wrote:
> > > > > > > > > > static inline notrace unsigned long arch_local_save_flags(void)
> > > > > > > > > > {
> > > > > > > > > > PVOP_CALL_ARGS;
> > > > > > > > > > PVOP_TEST_NULL(irq.save_fl);
> > > > > > > > > > asm_inline volatile(ALTERNATIVE(paravirt_alt(PARAVIRT_CALL),
> > > > > > > > > > "PUSHF; POP _ASM_AX",
> > > > > > > > > > X86_FEATURE_NATIVE)
> > > >
> > > > I am wondering whether we really want a new feature (basically "not
> > > > XENPV). We could use ~X86_FEATURE_XENPV and teach apply_alternatives()
> > > > to understand negated features (yes, this limits the number of features
> > > > to 32767, but I don't think this is a real problem for quite some time).
> > > >
> > > > Thoughts?
> > >
> > > I went with the simple thing for now... If we ever want/need another
> > > negative alternative I suppose we can do as you suggest...
> > >
> > > I was still poking at objtool to actually dtrt though..
> >
> > I'd like to include this part in my series (with a different solution
> > for the restore_fl part, as suggested by Andy before).
> >
> > Peter, are you fine with me taking your patch and adding your SoB?
>
> Sure, note that I only compile tested it, as it was my test-bed for
> playing with objtool (which I still haven't managed to get back to and
> finish :/).
>
> So if it actually _works_ feel free to take it, otherwise you can have
> whatever bits and pieces remain after you've butchered it to do as you
> want.

All, thank you for the actions.

I tried Peter's patch in my test system and result looks good. The warning is
still observed with 5.10-rc4. With the patch on top of 5.10-rc4, the warning
disappeared.

Of note is that when I built kernel with the patch, objtool reported many
warnings, as follows:

arch/x86/events/intel/core.o: warning: objtool: .altinstr_replacement+0x0: alternative modifies stack
arch/x86/events/core.o: warning: objtool: .altinstr_replacement+0x0: alternative modifies stack

--
Best Regards,
Shin'ichiro Kawasaki

2020-11-19 12:04:48

by Peter Zijlstra

[permalink] [raw]
Subject: Re: WARNING: can't access registers at asm_common_interrupt

On Thu, Nov 19, 2020 at 11:51:00AM +0000, Shinichiro Kawasaki wrote:
> I tried Peter's patch in my test system and result looks good. The warning is
> still observed with 5.10-rc4. With the patch on top of 5.10-rc4, the warning
> disappeared.

The patch on its own is not sufficient to fix things; there needs to be
an accompanying objtool patch to generate the correct unwind
information.

This patch only ensures objtool has enough information to be able to
dtrt, but as it stands it shouldn't even compile, it should complain
about an alternative trying to modify the stack and bail.

2020-11-19 12:32:38

by Juergen Gross

[permalink] [raw]
Subject: Re: WARNING: can't access registers at asm_common_interrupt

On 19.11.20 13:01, Peter Zijlstra wrote:
> On Thu, Nov 19, 2020 at 11:51:00AM +0000, Shinichiro Kawasaki wrote:
>> I tried Peter's patch in my test system and result looks good. The warning is
>> still observed with 5.10-rc4. With the patch on top of 5.10-rc4, the warning
>> disappeared.
>
> The patch on its own is not sufficient to fix things; there needs to be
> an accompanying objtool patch to generate the correct unwind
> information.
>
> This patch only ensures objtool has enough information to be able to
> dtrt, but as it stands it shouldn't even compile, it should complain
> about an alternative trying to modify the stack and bail.
>

It complains, but doesn't bail.

I'm nearly ready with my series replacing all the custom code paravirt
patches by ALTERNATIVEs. Quite some paravirt macro cruft turned out to
be unused, so there is a rather large cleanup patch included. :-)


Juergen


Attachments:
OpenPGP_0xB0DE9DD628BF132F.asc (3.06 kB)
OpenPGP_signature (505.00 B)
OpenPGP digital signature
Download all attachments

2020-11-19 12:51:32

by Shinichiro Kawasaki

[permalink] [raw]
Subject: Re: WARNING: can't access registers at asm_common_interrupt

On Nov 19, 2020 / 13:01, Peter Zijlstra wrote:
> On Thu, Nov 19, 2020 at 11:51:00AM +0000, Shinichiro Kawasaki wrote:
> > I tried Peter's patch in my test system and result looks good. The warning is
> > still observed with 5.10-rc4. With the patch on top of 5.10-rc4, the warning
> > disappeared.
>
> The patch on its own is not sufficient to fix things; there needs to be
> an accompanying objtool patch to generate the correct unwind
> information.
>
> This patch only ensures objtool has enough information to be able to
> dtrt, but as it stands it shouldn't even compile, it should complain
> about an alternative trying to modify the stack and bail.

Thanks for the clarification. It was too early to try out... When it gets ready
to try and test, please let me know.

--
Best Regards,
Shin'ichiro Kawasaki