2023-01-08 04:26:16

by Joan Bruguera Micó

[permalink] [raw]
Subject: Wake-up from suspend to RAM broken under `retbleed=stuff`

Wake-up from suspend to RAM seems broken under `retbleed=stuff`
(the recently introduced call depth tracking mitigation, see:
https://lore.kernel.org/lkml/[email protected]/t/)
I can replicate it on both real hardware and QEMU (with and without KVM).

It can replicate it by booting a fairly standard mainline kernel
on QEMU with `init=/bin/bash` and then suspending to RAM with:
echo "deep" > /sys/power/mem_sleep
echo "mem" > /sys/power/state
Then executing `system_wakeup` on the QEMU monitor causes the crash.

Some tracing with QEMU shows that some of the instrumentation
(`INCREMENT_CALL_DEPTH` / `sarq $5, %gs:__x86_call_depth`, ...)
seems to be done before %gs has been set up, causing a fault.

The crash happens shortly after the call to `restore_processor_state`
from `wakeup_64.S`, on the `sarq $5, %gs:__x86_call_depth` instruction.
Probably needs to be excluded?

And I can also see some other suspicious instances of `sarq $5, ...`
before the one that causes the crash, which also look suspicious.

QEMU log before the crash:

...
0xffffffff9486dc89: 4c 8b 68 10 movq 0x10(%rax), %r13
0xffffffff9486dc8d: 4c 8b 70 08 movq 8(%rax), %r14
0xffffffff9486dc91: 4c 8b 38 movq (%rax), %r15
0xffffffff9486dc94: 31 c0 xorl %eax, %eax
0xffffffff9486dc96: 48 83 c4 08 addq $8, %rsp
# (This is the 'jmp restore_processor_state' on wakeup_64.S)
0xffffffff9486dc9a: e9 51 e5 c2 00 jmp 0xffffffff9549c1f0
0xffffffff9549c1f0: 66 0f 1f 00 nopw (%rax)
0xffffffff9549c1f4: 55 pushq %rbp
0xffffffff9549c1f5: 48 89 e5 movq %rsp, %rbp
0xffffffff9549c1f8: 41 57 pushq %r15
0xffffffff9549c1fa: 41 56 pushq %r14
0xffffffff9549c1fc: 41 55 pushq %r13
0xffffffff9549c1fe: 41 54 pushq %r12
0xffffffff9549c200: 53 pushq %rbx
0xffffffff9549c201: 48 83 ec 20 subq $0x20, %rsp
0xffffffff9549c205: 80 3d 10 93 c4 01 00 cmpb $0, 0x1c49310(%rip)
0xffffffff9549c20c: 0f 85 8b 02 00 00 jne 0xffffffff9549c49d
0xffffffff9549c49d: 48 8b 15 24 90 c4 01 movq 0x1c49024(%rip), %rdx
0xffffffff9549c4a4: bf a0 01 00 00 movl $0x1a0, %edi
0xffffffff9549c4a9: 89 d6 movl %edx, %esi
0xffffffff9549c4ab: 48 c1 ea 20 shrq $0x20, %rdx
0xffffffff9549c4af: e8 32 9c 3e ff callq 0xffffffff948860e6
0xffffffff948860e6: 65 48 c1 3c 25 90 29 03 sarq $5, %gs:0x32990
0xffffffff948860ee: 00 05

RAX=0000000000000000 RBX=ffff98fbc1295d00 RCX=0000000000000000 RDX=0000000000000000
RSI=0000000000000001 RDI=00000000000001a0 RBP=ffffaace80013ca0 RSP=ffffaace80013c50
R8 =0000000000000004 R9 =0000000021bf694e R10=00000000aaaaaaab R11=0000000000000005
R12=0000000000000000 R13=0000000000000000 R14=0000000000000004 R15=ffff98fbc4ae2560
RIP=ffffffff948860e6 RFL=00000046 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0018 0000000000000000 ffffffff 00cf9300 DPL=0 DS [-WA]
CS =0010 0000000000000000 ffffffff 00af9b00 DPL=0 CS64 [-RA]
SS =0018 0000000000000000 ffffffff 00cf9300 DPL=0 DS [-WA]
DS =0018 0000000000000000 ffffffff 00cf9300 DPL=0 DS [-WA]
FS =0018 0000000000000000 ffffffff 00cf9300 DPL=0 DS [-WA]
GS =0018 0000000000000000 ffffffff 00cf9300 DPL=0 DS [-WA]
LDT=0000 0000000000000000 0000ffff 00008200 DPL=0 LDT
TR =0000 0000000000000000 0000ffff 00008b00 DPL=0 TSS64-busy
GDT= ffff98fbfec0b000 0000007f
IDT= ffffffff96604000 000001ff
CR0=80050033 CR2=000055b505d6fd48 CR3=0000000001bfa000 CR4=000006f0
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
DR6=00000000ffff0ff0 DR7=0000000000000400
CCS=0000000000000000 CCD=0000000000000000 CCO=SARQ
EFER=0000000000000d01
check_exception old: 0xffffffff new 0xe
0: v=0e e=0000 i=0 cpl=0 IP=0010:ffffffff948860e6 pc=ffffffff948860e6 SP=0018:ffffaace80013c50 CR2=0000000000032990
RAX=0000000000000000 RBX=ffff98fbc1295d00 RCX=0000000000000000 RDX=0000000000000000
RSI=0000000000000001 RDI=00000000000001a0 RBP=ffffaace80013ca0 RSP=ffffaace80013c50
R8 =0000000000000004 R9 =0000000021bf694e R10=00000000aaaaaaab R11=0000000000000005
R12=0000000000000000 R13=0000000000000000 R14=0000000000000004 R15=ffff98fbc4ae2560
RIP=ffffffff948860e6 RFL=00000046 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0018 0000000000000000 ffffffff 00cf9300 DPL=0 DS [-WA]
CS =0010 0000000000000000 ffffffff 00af9b00 DPL=0 CS64 [-RA]
SS =0018 0000000000000000 ffffffff 00cf9300 DPL=0 DS [-WA]
DS =0018 0000000000000000 ffffffff 00cf9300 DPL=0 DS [-WA]
FS =0018 0000000000000000 ffffffff 00cf9300 DPL=0 DS [-WA]
GS =0018 0000000000000000 ffffffff 00cf9300 DPL=0 DS [-WA]
LDT=0000 0000000000000000 0000ffff 00008200 DPL=0 LDT
TR =0000 0000000000000000 0000ffff 00008b00 DPL=0 TSS64-busy
GDT= ffff98fbfec0b000 0000007f
IDT= ffffffff96604000 000001ff
CR0=80050033 CR2=0000000000032990 CR3=0000000001bfa000 CR4=000006f0
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
DR6=00000000ffff0ff0 DR7=0000000000000400
CCS=0000000000000000 CCD=0000000000000000 CCO=SARQ
EFER=0000000000000d01
check_exception old: 0xe new 0xd
1: v=08 e=0000 i=0 cpl=0 IP=0010:ffffffff948860e6 pc=ffffffff948860e6 SP=0018:ffffaace80013c50 env->regs[R_EAX]=0000000000000000
RAX=0000000000000000 RBX=ffff98fbc1295d00 RCX=0000000000000000 RDX=0000000000000000
RSI=0000000000000001 RDI=00000000000001a0 RBP=ffffaace80013ca0 RSP=ffffaace80013c50
R8 =0000000000000004 R9 =0000000021bf694e R10=00000000aaaaaaab R11=0000000000000005
R12=0000000000000000 R13=0000000000000000 R14=0000000000000004 R15=ffff98fbc4ae2560
RIP=ffffffff948860e6 RFL=00000046 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0018 0000000000000000 ffffffff 00cf9300 DPL=0 DS [-WA]
CS =0010 0000000000000000 ffffffff 00af9b00 DPL=0 CS64 [-RA]
SS =0018 0000000000000000 ffffffff 00cf9300 DPL=0 DS [-WA]
DS =0018 0000000000000000 ffffffff 00cf9300 DPL=0 DS [-WA]
FS =0018 0000000000000000 ffffffff 00cf9300 DPL=0 DS [-WA]
GS =0018 0000000000000000 ffffffff 00cf9300 DPL=0 DS [-WA]
LDT=0000 0000000000000000 0000ffff 00008200 DPL=0 LDT
TR =0000 0000000000000000 0000ffff 00008b00 DPL=0 TSS64-busy
GDT= ffff98fbfec0b000 0000007f
IDT= ffffffff96604000 000001ff
CR0=80050033 CR2=0000000000032990 CR3=0000000001bfa000 CR4=000006f0
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
DR6=00000000ffff0ff0 DR7=0000000000000400
CCS=0000000000000000 CCD=0000000000000000 CCO=SARQ
EFER=0000000000000d01
check_exception old: 0x8 new 0xd
Triple fault


2023-01-08 11:47:16

by Borislav Petkov

[permalink] [raw]
Subject: Re: Wake-up from suspend to RAM broken under `retbleed=stuff`

On Sun, Jan 08, 2023 at 03:07:48AM +0000, Joan Bruguera wrote:
> Wake-up from suspend to RAM seems broken under `retbleed=stuff`
> (the recently introduced call depth tracking mitigation, see:
> https://lore.kernel.org/lkml/[email protected]/t/)
> I can replicate it on both real hardware and QEMU (with and without KVM).
>
> It can replicate it by booting a fairly standard mainline kernel

Which version exactly?

> on QEMU with `init=/bin/bash` and then suspending to RAM with:
> echo "deep" > /sys/power/mem_sleep
> echo "mem" > /sys/power/state
> Then executing `system_wakeup` on the QEMU monitor causes the crash.

You probably need to share .config too because my tailored .config built with
latest tip/master works fine here.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-01-08 17:21:38

by Joan Bruguera Micó

[permalink] [raw]
Subject: Re: Wake-up from suspend to RAM broken under `retbleed=stuff`

Hi, sorry that my mail didn't have proper reproduction instructions.

Something important I missed is that on non-affected hardware / QEMU,
the kernel parameter must be `retbleed=stuff,force`, as otherwise
`retbleed=stuff` is a no-op.
In any case, lscpu (or /sys/devices/system/cpu/vulnerabilities/retbleed)
should contain "Mitigation: Stuffing".

That's most likely the missing piece as otherwise, all combinations I've
tested exhibit this behavior, so I assumed that the problem always
reproduces regardless of the config.

More details follow in case it still doesn't reproduce:

The real HW where I found the problem is an HP 250 G6 (Intel i5-7200U),
with a desktop environment (Arch Linux, systemd, Sway, etc.), with both:
* A recent mainline kernel: commit 0a71553536d270e988580a3daa9fc87535908221
* The latest x86-tip: commit 9c4fb147c3492fd4be1b89c22a4c333308f6f44a
The config I use is the following one, which is based on the
Arch Linux kernel config (and I leave any new options at their defaults):
https://aur.archlinux.org/cgit/aur.git/plain/config?h=linux-mainline&id=20ffc62e08f6b0d48a088bccb6e0c3606b88083a
I need to either disable the BPF LSM or apply this patch to boot it:
https://lore.kernel.org/lkml/[email protected]/
I suspend with `systemctl suspend`. Waking up hangs with `retbleed=stuff`
and works without it.

For verification I booted Fedora-Cloud-Base-Rawhide-20230107.n.0.x86_64.qcow2
(from https://dl.fedoraproject.org/pub/fedora/linux/development/rawhide/Cloud):
qemu-system-x86_64 -nodefaults -enable-kvm -smp 1 -m 1024 -vga std \
-hda Fedora-Cloud-Base-Rawhide-20230107.n.0.x86_64.qcow2 \
-serial stdio -monitor tcp:127.0.0.1:55555,server,nowait
I appended `retbleed=stuff,force init=/bin/sh` to the kernel command line,
suspended through sysfs and then wrote `system_wakeup` on the QEMU monitor.

For debugging/tracing, I booted the Arch kernel with a busybox initrd:
qemu-system-x86_64 -nodefaults -no-reboot -initrd initramfs.gz \
-kernel /boot/vmlinuz-linux-x86tip \
-append "console=ttyS0 loglevel=4 retbleed=stuff,force" \
-serial stdio -monitor tcp:127.0.0.1:55555,server,nowait
Suspended through sysfs, enabled singlestep and logs in the QEMU monitor,
then did a system_wakeup.

Regards,
- Joan

2023-01-08 21:57:27

by Borislav Petkov

[permalink] [raw]
Subject: Re: Wake-up from suspend to RAM broken under `retbleed=stuff`

On Sun, Jan 08, 2023 at 05:11:59PM +0000, Joan Bruguera wrote:
> Arch Linux kernel config (and I leave any new options at their defaults):
> https://aur.archlinux.org/cgit/aur.git/plain/config?h=linux-mainline&id=20ffc62e08f6b0d48a088bccb6e0c3606b88083a

That .config works here.

> I need to either disable the BPF LSM or apply this patch to boot it:
> https://lore.kernel.org/lkml/[email protected]/
> I suspend with `systemctl suspend`. Waking up hangs with `retbleed=stuff`
> and works without it.

I booted another, more modern guest which has systemd to see whether a different
userspace could be the issue but it works too.

> For verification I booted Fedora-Cloud-Base-Rawhide-20230107.n.0.x86_64.qcow2
> (from https://dl.fedoraproject.org/pub/fedora/linux/development/rawhide/Cloud):
> qemu-system-x86_64 -nodefaults -enable-kvm -smp 1 -m 1024 -vga std \
> -hda Fedora-Cloud-Base-Rawhide-20230107.n.0.x86_64.qcow2 \
> -serial stdio -monitor tcp:127.0.0.1:55555,server,nowait
> I appended `retbleed=stuff,force init=/bin/sh` to the kernel command line,
> suspended through sysfs and then wrote `system_wakeup` on the QEMU monitor.

I guess I'll try that next.

And maybe try that on an Intel machine - I'm using an AMD one but the ",force"
thing enables the same stuffing on AMD too even if it is not the fitting
mitigation for it.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-01-08 22:31:25

by Joan Bruguera Micó

[permalink] [raw]
Subject: Re: Wake-up from suspend to RAM broken under `retbleed=stuff`

I can also reproduce it on my Ryzen 5700G (both real and QEMU+KVM).
So it doesn't appear to be Intel or AMD specific.

Perhaps your BIOS/UEFI keeps the value of %gs or it points somewhere
valid after waking from suspend to RAM, and the ones I have don't?
That's a way I can imagine it not crashing on some HW.
If that's the case, you should still be able to reproduce it on QEMU.

Or it's also possible that the QEMU issue is different from the one on
real HW (though unlikely, I think)

2023-01-09 05:11:26

by Joan Bruguera Micó

[permalink] [raw]
Subject: Re: Wake-up from suspend to RAM broken under `retbleed=stuff`

This fixes wakeup for me on both QEMU and real HW
(just a proof of concept, don't merge)

diff --git a/arch/x86/kernel/callthunks.c b/arch/x86/kernel/callthunks.c
index ffea98f9064b..8704bcc0ce32 100644
--- a/arch/x86/kernel/callthunks.c
+++ b/arch/x86/kernel/callthunks.c
@@ -7,6 +7,7 @@
#include <linux/memory.h>
#include <linux/moduleloader.h>
#include <linux/static_call.h>
+#include <linux/suspend.h>

#include <asm/alternative.h>
#include <asm/asm-offsets.h>
@@ -150,6 +151,10 @@ static bool skip_addr(void *dest)
if (dest >= (void *)hypercall_page &&
dest < (void*)hypercall_page + PAGE_SIZE)
return true;
+#endif
+#ifdef CONFIG_PM_SLEEP
+ if (dest == restore_processor_state)
+ return true;
#endif
return false;
}
diff --git a/arch/x86/power/cpu.c b/arch/x86/power/cpu.c
index 236447ee9beb..e667894936f7 100644
--- a/arch/x86/power/cpu.c
+++ b/arch/x86/power/cpu.c
@@ -281,6 +281,9 @@ static void notrace __restore_processor_state(struct saved_context *ctxt)
/* Needed by apm.c */
void notrace restore_processor_state(void)
{
+ /* Restore GS before calling anything to avoid crash on call depth accounting */
+ native_wrmsrl(MSR_GS_BASE, saved_context.kernelmode_gs_base);
+
__restore_processor_state(&saved_context);
}
#ifdef CONFIG_X86_32

2023-01-11 11:27:26

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Wake-up from suspend to RAM broken under `retbleed=stuff`

On Mon, Jan 09, 2023 at 04:05:31AM +0000, Joan Bruguera wrote:
> This fixes wakeup for me on both QEMU and real HW
> (just a proof of concept, don't merge)
>
> diff --git a/arch/x86/kernel/callthunks.c b/arch/x86/kernel/callthunks.c
> index ffea98f9064b..8704bcc0ce32 100644
> --- a/arch/x86/kernel/callthunks.c
> +++ b/arch/x86/kernel/callthunks.c
> @@ -7,6 +7,7 @@
> #include <linux/memory.h>
> #include <linux/moduleloader.h>
> #include <linux/static_call.h>
> +#include <linux/suspend.h>
>
> #include <asm/alternative.h>
> #include <asm/asm-offsets.h>
> @@ -150,6 +151,10 @@ static bool skip_addr(void *dest)
> if (dest >= (void *)hypercall_page &&
> dest < (void*)hypercall_page + PAGE_SIZE)
> return true;
> +#endif
> +#ifdef CONFIG_PM_SLEEP
> + if (dest == restore_processor_state)
> + return true;
> #endif
> return false;
> }
> diff --git a/arch/x86/power/cpu.c b/arch/x86/power/cpu.c
> index 236447ee9beb..e667894936f7 100644
> --- a/arch/x86/power/cpu.c
> +++ b/arch/x86/power/cpu.c
> @@ -281,6 +281,9 @@ static void notrace __restore_processor_state(struct saved_context *ctxt)
> /* Needed by apm.c */
> void notrace restore_processor_state(void)
> {
> + /* Restore GS before calling anything to avoid crash on call depth accounting */
> + native_wrmsrl(MSR_GS_BASE, saved_context.kernelmode_gs_base);
> +
> __restore_processor_state(&saved_context);
> }

Yeah, I can see why, but I'm not really comfortable with this. TBH, I
don't see how the whole resume code is correct to begin with. At the
very least it needs a heavy dose of noinstr.

Rafael, what cr3 is active when we call restore_processor_state()?

Specifically, the problem is that I don't feel comfortable doing any
sort of weird code until all the CR and segment registers have been
restored, however, write_cr*() are paravirt functions that result in
CALL, which then gives us a bit of a checken and egg problem.

I'm also wondering how well retbleed=stuff works on Xen, if at all. If
we can ignore Xen, things are a little earier perhaps.

2023-01-11 11:58:17

by Jan Beulich

[permalink] [raw]
Subject: Re: Wake-up from suspend to RAM broken under `retbleed=stuff`

On 11.01.2023 12:39, Andrew Cooper wrote:
> The bigger issue with stuff accounting is that nothing AFAICT accounts
> for the fact that any hypercall potentially empties the RSB in otherwise
> synchronous program flow.

But that's not just at hypercall boundaries, but effectively anywhere
(i.e. whenever the hypervisor decides to de-schedule the vCPU)?

Jan

2023-01-11 12:55:15

by Juergen Gross

[permalink] [raw]
Subject: Re: Wake-up from suspend to RAM broken under `retbleed=stuff`

On 11.01.23 12:20, Peter Zijlstra wrote:
> On Mon, Jan 09, 2023 at 04:05:31AM +0000, Joan Bruguera wrote:
>> This fixes wakeup for me on both QEMU and real HW
>> (just a proof of concept, don't merge)
>>
>> diff --git a/arch/x86/kernel/callthunks.c b/arch/x86/kernel/callthunks.c
>> index ffea98f9064b..8704bcc0ce32 100644
>> --- a/arch/x86/kernel/callthunks.c
>> +++ b/arch/x86/kernel/callthunks.c
>> @@ -7,6 +7,7 @@
>> #include <linux/memory.h>
>> #include <linux/moduleloader.h>
>> #include <linux/static_call.h>
>> +#include <linux/suspend.h>
>>
>> #include <asm/alternative.h>
>> #include <asm/asm-offsets.h>
>> @@ -150,6 +151,10 @@ static bool skip_addr(void *dest)
>> if (dest >= (void *)hypercall_page &&
>> dest < (void*)hypercall_page + PAGE_SIZE)
>> return true;
>> +#endif
>> +#ifdef CONFIG_PM_SLEEP
>> + if (dest == restore_processor_state)
>> + return true;
>> #endif
>> return false;
>> }
>> diff --git a/arch/x86/power/cpu.c b/arch/x86/power/cpu.c
>> index 236447ee9beb..e667894936f7 100644
>> --- a/arch/x86/power/cpu.c
>> +++ b/arch/x86/power/cpu.c
>> @@ -281,6 +281,9 @@ static void notrace __restore_processor_state(struct saved_context *ctxt)
>> /* Needed by apm.c */
>> void notrace restore_processor_state(void)
>> {
>> + /* Restore GS before calling anything to avoid crash on call depth accounting */
>> + native_wrmsrl(MSR_GS_BASE, saved_context.kernelmode_gs_base);
>> +
>> __restore_processor_state(&saved_context);
>> }
>
> Yeah, I can see why, but I'm not really comfortable with this. TBH, I
> don't see how the whole resume code is correct to begin with. At the
> very least it needs a heavy dose of noinstr.
>
> Rafael, what cr3 is active when we call restore_processor_state()?
>
> Specifically, the problem is that I don't feel comfortable doing any
> sort of weird code until all the CR and segment registers have been
> restored, however, write_cr*() are paravirt functions that result in
> CALL, which then gives us a bit of a checken and egg problem.

Under Xen the hypervisor should do all needed low level stuff when
resuming. Even dom0 is just a guest, so I don't see a problem with
PV guests. CR and segment registers should be in the vcpu structure(s)
in Xen and they should be restored by Xen when activating the vcpus
after suspend. There shouldn't be any need to do this again in the
kernel.

I have verified that the suspend path when running as Xen PV guest
is NOT calling restore_processor_state().

> I'm also wondering how well retbleed=stuff works on Xen, if at all. If
> we can ignore Xen, things are a little earier perhaps.

In restore_processor_state() you should be able to ignore Xen, unless
there are other code paths calling it (kexec can be ignored, too, as
that isn't supported in Xen PV, and suspend to disk shouldn't work
either).


Juergen


Attachments:
OpenPGP_0xB0DE9DD628BF132F.asc (3.08 kB)
OpenPGP public key
OpenPGP_signature (505.00 B)
OpenPGP digital signature
Download all attachments

2023-01-11 13:14:51

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Wake-up from suspend to RAM broken under `retbleed=stuff`

On Wed, Jan 11, 2023 at 12:52:41PM +0100, Juergen Gross wrote:
> I have verified that the suspend path when running as Xen PV guest
> is NOT calling restore_processor_state().

Oh excellent, that makes things a lot simpler, thanks!

2023-01-11 18:08:55

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: Wake-up from suspend to RAM broken under `retbleed=stuff`

On Wed, Jan 11, 2023 at 12:20 PM Peter Zijlstra <[email protected]> wrote:
>
> On Mon, Jan 09, 2023 at 04:05:31AM +0000, Joan Bruguera wrote:
> > This fixes wakeup for me on both QEMU and real HW
> > (just a proof of concept, don't merge)
> >
> > diff --git a/arch/x86/kernel/callthunks.c b/arch/x86/kernel/callthunks.c
> > index ffea98f9064b..8704bcc0ce32 100644
> > --- a/arch/x86/kernel/callthunks.c
> > +++ b/arch/x86/kernel/callthunks.c
> > @@ -7,6 +7,7 @@
> > #include <linux/memory.h>
> > #include <linux/moduleloader.h>
> > #include <linux/static_call.h>
> > +#include <linux/suspend.h>
> >
> > #include <asm/alternative.h>
> > #include <asm/asm-offsets.h>
> > @@ -150,6 +151,10 @@ static bool skip_addr(void *dest)
> > if (dest >= (void *)hypercall_page &&
> > dest < (void*)hypercall_page + PAGE_SIZE)
> > return true;
> > +#endif
> > +#ifdef CONFIG_PM_SLEEP
> > + if (dest == restore_processor_state)
> > + return true;
> > #endif
> > return false;
> > }
> > diff --git a/arch/x86/power/cpu.c b/arch/x86/power/cpu.c
> > index 236447ee9beb..e667894936f7 100644
> > --- a/arch/x86/power/cpu.c
> > +++ b/arch/x86/power/cpu.c
> > @@ -281,6 +281,9 @@ static void notrace __restore_processor_state(struct saved_context *ctxt)
> > /* Needed by apm.c */
> > void notrace restore_processor_state(void)
> > {
> > + /* Restore GS before calling anything to avoid crash on call depth accounting */
> > + native_wrmsrl(MSR_GS_BASE, saved_context.kernelmode_gs_base);
> > +
> > __restore_processor_state(&saved_context);
> > }
>
> Yeah, I can see why, but I'm not really comfortable with this. TBH, I
> don't see how the whole resume code is correct to begin with. At the
> very least it needs a heavy dose of noinstr.
>
> Rafael, what cr3 is active when we call restore_processor_state()?

On resume from suspend-to-RAM, the one that was saved by
__save_processor_state() AFAICS.

On resume from hibernation, it looks like this is the one that was
used by the restore kernel.

> Specifically, the problem is that I don't feel comfortable doing any
> sort of weird code until all the CR and segment registers have been
> restored, however, write_cr*() are paravirt functions that result in
> CALL, which then gives us a bit of a checken and egg problem.
>
> I'm also wondering how well retbleed=stuff works on Xen, if at all. If
> we can ignore Xen, things are a little earier perhaps.