2021-12-13 04:22:06

by Lai Jiangshan

[permalink] [raw]
Subject: [PATCH 0/3] x86/entry: Fix 3 suspicious bugs

From: Lai Jiangshan <[email protected]>

The problems described in patch1/patch2 can only happen when the system
owner is really paranoid.

For patch3, I hardly believe #VC can hit in the code range returining
to user without implementing more SEV features.

Lai Jiangshan (3):
X86/db: Change __this_cpu_read() to this_cpu_read() in
hw_breakpoint_active()
x86/hw_breakpoint: Add stack_canary to hw_breakpoints denylist
x86/sev: The code for returning to user space is also in syscall gap

arch/x86/entry/entry_64.S | 2 ++
arch/x86/entry/entry_64_compat.S | 2 ++
arch/x86/include/asm/debugreg.h | 2 +-
arch/x86/include/asm/proto.h | 4 ++++
arch/x86/include/asm/ptrace.h | 4 ++++
arch/x86/kernel/hw_breakpoint.c | 8 ++++++++
6 files changed, 21 insertions(+), 1 deletion(-)

--
2.19.1.6.gb485710b



2021-12-13 04:22:17

by Lai Jiangshan

[permalink] [raw]
Subject: [PATCH 1/3] X86/db: Change __this_cpu_read() to this_cpu_read() in hw_breakpoint_active()

From: Lai Jiangshan <[email protected]>

__this_cpu_read() can not be instrumented except its own debugging code
when CONFIG_DEBUG_PREEMPT. The debugging code will call
__this_cpu_preempt_check(). __this_cpu_preempt_check() itself is also
noinstr, so __this_cpu_read() can be used in noinstr.

But these is one exception when exc_debug_kernel() calls local_db_save()
which calls hw_breakpoint_active() which calls __this_cpu_read(). If
the data accessed by __this_cpu_preempt_check() is also watched by
hw_breakpoints, it would cause recursive #DB.

this_cpu_read() in X86 is also non instrumentable, and it doesn't access
to any extra data except the percpu cpu_dr7, and cpu_dr7 is disallowed
to be watched in arch_build_bp_info(). So this_cpu_read() is safe to
be used when hw_breakpoints is still active, and __this_cpu_read() here
should be changed to this_cpu_read().

This problem can only happen when the system owner uses a kernel with
CONFIG_DEBUG_PREEMPT enabled and deliberately use hw_breakpoints on
the data that __this_cpu_preempt_check() accesses. Sot it is just a
problem with no significance.

One might suggest that, all the data accessed by noinstr functions
should be marked in denylist for hw_breakpoints. That would complexify
the noinstrment framework and add hurdles to anyone that who want to
add a new noinstr function. All we need is to suppress #DB in the IST
interrupt entry path until safe place where #DB is disabled in hardware
or #DB handler can handle well even it hits data accessed by noinstr
function. Changing __this_cpu_read() to this_cpu_read() is fit for it.

Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/include/asm/debugreg.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/debugreg.h b/arch/x86/include/asm/debugreg.h
index cfdf307ddc01..20189ce41578 100644
--- a/arch/x86/include/asm/debugreg.h
+++ b/arch/x86/include/asm/debugreg.h
@@ -87,7 +87,7 @@ static inline void hw_breakpoint_disable(void)

static __always_inline bool hw_breakpoint_active(void)
{
- return __this_cpu_read(cpu_dr7) & DR_GLOBAL_ENABLE_MASK;
+ return this_cpu_read(cpu_dr7) & DR_GLOBAL_ENABLE_MASK;
}

extern void hw_breakpoint_restore(void);
--
2.19.1.6.gb485710b


2021-12-13 04:22:28

by Lai Jiangshan

[permalink] [raw]
Subject: [PATCH 2/3] x86/hw_breakpoint: Add stack_canary to hw_breakpoints denylist

From: Lai Jiangshan <[email protected]>

When stack-protector is enabled, entry functions may access
to the stack_canary.

Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/kernel/hw_breakpoint.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/arch/x86/kernel/hw_breakpoint.c b/arch/x86/kernel/hw_breakpoint.c
index 668a4a6533d9..b2b64afdf9c0 100644
--- a/arch/x86/kernel/hw_breakpoint.c
+++ b/arch/x86/kernel/hw_breakpoint.c
@@ -315,6 +315,14 @@ static inline bool within_cpu_entry(unsigned long addr, unsigned long end)
if (within_area(addr, end, (unsigned long)&per_cpu(cpu_dr7, cpu),
sizeof(cpu_dr7)))
return true;
+
+ /*
+ * When stack-protector is enabled, entry functions may access
+ * to the stack_canary.
+ */
+ if (within_area(addr, end, (unsigned long)&per_cpu(fixed_percpu_data, cpu),
+ sizeof(struct fixed_percpu_data)))
+ return true;
}

return false;
--
2.19.1.6.gb485710b


2021-12-13 04:22:40

by Lai Jiangshan

[permalink] [raw]
Subject: [PATCH 3/3] x86/sev: The code for returning to user space is also in syscall gap

From: Lai Jiangshan <[email protected]>

When returning to user space, the %rsp is user controlled value.

Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/entry/entry_64.S | 2 ++
arch/x86/entry/entry_64_compat.S | 2 ++
arch/x86/include/asm/proto.h | 4 ++++
arch/x86/include/asm/ptrace.h | 4 ++++
4 files changed, 12 insertions(+)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index e23319ad3f42..44dadea935f7 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -213,8 +213,10 @@ syscall_return_via_sysret:

popq %rdi
popq %rsp
+SYM_INNER_LABEL(entry_SYSRETQ_unsafe_stack, SYM_L_GLOBAL)
swapgs
sysretq
+SYM_INNER_LABEL(entry_SYSRETQ_end, SYM_L_GLOBAL)
SYM_CODE_END(entry_SYSCALL_64)

/*
diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index 0051cf5c792d..98afdf92f360 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -293,6 +293,7 @@ sysret32_from_system_call:
* code. We zero R8-R10 to avoid info leaks.
*/
movq RSP-ORIG_RAX(%rsp), %rsp
+SYM_INNER_LABEL(entry_SYSRETL_compat_unsafe_stack, SYM_L_GLOBAL)

/*
* The original userspace %rsp (RSP-ORIG_RAX(%rsp)) is stored
@@ -310,6 +311,7 @@ sysret32_from_system_call:
xorl %r10d, %r10d
swapgs
sysretl
+SYM_INNER_LABEL(entry_SYSRETL_compat_end, SYM_L_GLOBAL)
SYM_CODE_END(entry_SYSCALL_compat)

/*
diff --git a/arch/x86/include/asm/proto.h b/arch/x86/include/asm/proto.h
index feed36d44d04..f042cfc9938f 100644
--- a/arch/x86/include/asm/proto.h
+++ b/arch/x86/include/asm/proto.h
@@ -13,6 +13,8 @@ void syscall_init(void);
#ifdef CONFIG_X86_64
void entry_SYSCALL_64(void);
void entry_SYSCALL_64_safe_stack(void);
+void entry_SYSRETQ_unsafe_stack(void);
+void entry_SYSRETQ_end(void);
long do_arch_prctl_64(struct task_struct *task, int option, unsigned long arg2);
#endif

@@ -28,6 +30,8 @@ void entry_SYSENTER_compat(void);
void __end_entry_SYSENTER_compat(void);
void entry_SYSCALL_compat(void);
void entry_SYSCALL_compat_safe_stack(void);
+void entry_SYSRETL_compat_unsafe_stack(void);
+void entry_SYSRETL_compat_end(void);
void entry_INT80_compat(void);
#ifdef CONFIG_XEN_PV
void xen_entry_INT80_compat(void);
diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index 703663175a5a..b3d2ba13cee2 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -186,9 +186,13 @@ static __always_inline bool ip_within_syscall_gap(struct pt_regs *regs)
bool ret = (regs->ip >= (unsigned long)entry_SYSCALL_64 &&
regs->ip < (unsigned long)entry_SYSCALL_64_safe_stack);

+ ret = ret || (regs->ip >= (unsigned long)entry_SYSRETQ_unsafe_stack &&
+ regs->ip < (unsigned long)entry_SYSRETQ_end);
#ifdef CONFIG_IA32_EMULATION
ret = ret || (regs->ip >= (unsigned long)entry_SYSCALL_compat &&
regs->ip < (unsigned long)entry_SYSCALL_compat_safe_stack);
+ ret = ret || (regs->ip >= (unsigned long)entry_SYSRETL_compat_unsafe_stack &&
+ regs->ip < (unsigned long)entry_SYSRETL_compat_end);
#endif

return ret;
--
2.19.1.6.gb485710b


2021-12-13 19:09:27

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 1/3] X86/db: Change __this_cpu_read() to this_cpu_read() in hw_breakpoint_active()

On Mon, Dec 13, 2021 at 12:22:13PM +0800, Lai Jiangshan wrote:
> From: Lai Jiangshan <[email protected]>

Lai, what you're touching is complex stuff. If your commit messages
are hard to parse then that makes reviewing those patches not a fun
experience. Not in the least.

So please try to restrain yourself and write proper English. Run your
commit message through a spellchecker at least so that there are no
funky words.

Also, read Documentation/process/submitting-patches.rst for hints on
how to write it.

The structure and the explanation is in the right direction but please
try to formulate them as understandable as possible.

> __this_cpu_read() can not be instrumented except its own debugging code
> when CONFIG_DEBUG_PREEMPT. The debugging code will call
> __this_cpu_preempt_check(). __this_cpu_preempt_check() itself is also
> noinstr, so __this_cpu_read() can be used in noinstr.
>
> But these is one exception when exc_debug_kernel() calls local_db_save()
> which calls hw_breakpoint_active() which calls __this_cpu_read(). If
> the data accessed by __this_cpu_preempt_check() is also watched by
> hw_breakpoints, it would cause recursive #DB.

Up until here is good.

> this_cpu_read() in X86 is also non instrumentable, and it doesn't access

"x86" not "X86" or any other way.

Also, read this: Documentation/process/maintainer-tip.rst

as it has more hints about commit message structure etc.

> to any extra data except the percpu cpu_dr7, and cpu_dr7 is disallowed
> to be watched in arch_build_bp_info(). So this_cpu_read() is safe to
> be used when hw_breakpoints is still active, and __this_cpu_read() here
> should be changed to this_cpu_read().
>
> This problem can only happen when the system owner uses a kernel with
> CONFIG_DEBUG_PREEMPT enabled and deliberately use hw_breakpoints on
> the data that __this_cpu_preempt_check() accesses. Sot it is just a
> problem with no significance.
>
> One might suggest that, all the data accessed by noinstr functions
> should be marked in denylist for hw_breakpoints. That would complexify

should be marked in denylist for hw_breakpoints. That would complexify
Unknown word [denylist] in commit message, suggestions:
['deny list', 'deny-list', 'dentistry']

should be marked in denylist for hw_breakpoints. That would complexify
Unknown word [complexify] in commit message, suggestions:
['complexity', 'complexion']

> the noinstrment framework and add hurdles to anyone that who want to

the noinstrment framework and add hurdles to anyone that who want to
Unknown word [noinstrment] in commit message, suggestions:
['instrument']

So you need to restrain yourself and stop inventing new English words.

> add a new noinstr function. All we need is to suppress #DB in the IST

Who is "we"?

> interrupt entry path until safe place where #DB is disabled in hardware
> or #DB handler can handle well even it hits data accessed by noinstr
> function. Changing __this_cpu_read() to this_cpu_read() is fit for it.

You don't need to write *what* your patch is doing - that is clear from
the diff.

I'll let Peter comment on what should actually be used.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2021-12-13 19:46:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/3] X86/db: Change __this_cpu_read() to this_cpu_read() in hw_breakpoint_active()

On Mon, Dec 13, 2021 at 12:22:13PM +0800, Lai Jiangshan wrote:
> From: Lai Jiangshan <[email protected]>
>
> __this_cpu_read() can not be instrumented except its own debugging code
> when CONFIG_DEBUG_PREEMPT. The debugging code will call
> __this_cpu_preempt_check(). __this_cpu_preempt_check() itself is also
> noinstr, so __this_cpu_read() can be used in noinstr.
>
> But these is one exception when exc_debug_kernel() calls local_db_save()
> which calls hw_breakpoint_active() which calls __this_cpu_read(). If
> the data accessed by __this_cpu_preempt_check() is also watched by
> hw_breakpoints, it would cause recursive #DB.
>
> this_cpu_read() in X86 is also non instrumentable, and it doesn't access
> to any extra data except the percpu cpu_dr7, and cpu_dr7 is disallowed
> to be watched in arch_build_bp_info(). So this_cpu_read() is safe to
> be used when hw_breakpoints is still active, and __this_cpu_read() here
> should be changed to this_cpu_read().
>
> This problem can only happen when the system owner uses a kernel with
> CONFIG_DEBUG_PREEMPT enabled and deliberately use hw_breakpoints on
> the data that __this_cpu_preempt_check() accesses. Sot it is just a
> problem with no significance.
>
> One might suggest that, all the data accessed by noinstr functions
> should be marked in denylist for hw_breakpoints. That would complexify
> the noinstrment framework and add hurdles to anyone that who want to
> add a new noinstr function. All we need is to suppress #DB in the IST
> interrupt entry path until safe place where #DB is disabled in hardware
> or #DB handler can handle well even it hits data accessed by noinstr
> function. Changing __this_cpu_read() to this_cpu_read() is fit for it.
>
> Signed-off-by: Lai Jiangshan <[email protected]>
> ---
> arch/x86/include/asm/debugreg.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/debugreg.h b/arch/x86/include/asm/debugreg.h
> index cfdf307ddc01..20189ce41578 100644
> --- a/arch/x86/include/asm/debugreg.h
> +++ b/arch/x86/include/asm/debugreg.h
> @@ -87,7 +87,7 @@ static inline void hw_breakpoint_disable(void)
>
> static __always_inline bool hw_breakpoint_active(void)
> {
> - return __this_cpu_read(cpu_dr7) & DR_GLOBAL_ENABLE_MASK;
> + return this_cpu_read(cpu_dr7) & DR_GLOBAL_ENABLE_MASK;

I don't really follow the argument for why this_cpu_read(); why not
raw_cpu_read() instead, which is what __this_cpu_read() is based on.
Also, this really needs a comment.

Alternatively, we should remove noinstr from check_preemption_disabled()
and fix up all the fallout, but that seems like far more work than it's
worth.

/*
* Must not hit a breakpoint in check_preempt_disabled()
*/
return raw_cpu_read(cpu_dr7) & DR_GLOBAL_ENABLE_MASK;

> }
>
> extern void hw_breakpoint_restore(void);
> --
> 2.19.1.6.gb485710b
>

2021-12-13 19:57:55

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/3] x86/hw_breakpoint: Add stack_canary to hw_breakpoints denylist

On Mon, Dec 13, 2021 at 12:22:14PM +0800, Lai Jiangshan wrote:
> From: Lai Jiangshan <[email protected]>
>
> When stack-protector is enabled, entry functions may access
> to the stack_canary.
>
> Signed-off-by: Lai Jiangshan <[email protected]>
> ---
> arch/x86/kernel/hw_breakpoint.c | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> diff --git a/arch/x86/kernel/hw_breakpoint.c b/arch/x86/kernel/hw_breakpoint.c
> index 668a4a6533d9..b2b64afdf9c0 100644
> --- a/arch/x86/kernel/hw_breakpoint.c
> +++ b/arch/x86/kernel/hw_breakpoint.c
> @@ -315,6 +315,14 @@ static inline bool within_cpu_entry(unsigned long addr, unsigned long end)
> if (within_area(addr, end, (unsigned long)&per_cpu(cpu_dr7, cpu),
> sizeof(cpu_dr7)))
> return true;
> +
> + /*
> + * When stack-protector is enabled, entry functions may access
> + * to the stack_canary.
> + */
> + if (within_area(addr, end, (unsigned long)&per_cpu(fixed_percpu_data, cpu),
> + sizeof(struct fixed_percpu_data)))
> + return true;

Isn't fixed_percpu_data x86_64 only?

> }
>
> return false;

2021-12-14 02:51:36

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH 1/3] X86/db: Change __this_cpu_read() to this_cpu_read() in hw_breakpoint_active()

Hello

On Tue, Dec 14, 2021 at 3:09 AM Borislav Petkov <[email protected]> wrote:

> So please try to restrain yourself and write proper English. Run your
> commit message through a spellchecker at least so that there are no
> funky words.

The commit message was checked via VIM spellchecker. It did highlight
denylist, noinstr, noinstrument, complexify, and a lot more.

There are too many false-negative results from VIM spellchecker, and
I searched denylist, complexify via google and they are used by some
other places so I kept them.

I'm sorry for not searching in the kernel tree to find a proper
word for noinstrument, not searching the web for better words for
denylist, complexify.

I will change a spellchecker and improve my English.


>
> > to any extra data except the percpu cpu_dr7, and cpu_dr7 is disallowed
> > to be watched in arch_build_bp_info(). So this_cpu_read() is safe to
> > be used when hw_breakpoints is still active, and __this_cpu_read() here
> > should be changed to this_cpu_read().
> >
> > This problem can only happen when the system owner uses a kernel with
> > CONFIG_DEBUG_PREEMPT enabled and deliberately use hw_breakpoints on
> > the data that __this_cpu_preempt_check() accesses. Sot it is just a
> > problem with no significance.
> >
> > One might suggest that, all the data accessed by noinstr functions
> > should be marked in denylist for hw_breakpoints. That would complexify
>
> should be marked in denylist for hw_breakpoints. That would complexify
> Unknown word [denylist] in commit message, suggestions:
> ['deny list', 'deny-list', 'dentistry']
>
> should be marked in denylist for hw_breakpoints. That would complexify
> Unknown word [complexify] in commit message, suggestions:
> ['complexity', 'complexion']
>
> > the noinstrment framework and add hurdles to anyone that who want to
>
> the noinstrment framework and add hurdles to anyone that who want to
> Unknown word [noinstrment] in commit message, suggestions:
> ['instrument']
>
> So you need to restrain yourself and stop inventing new English words.
>
> > add a new noinstr function. All we need is to suppress #DB in the IST
>
> Who is "we"?
>
> > interrupt entry path until safe place where #DB is disabled in hardware
> > or #DB handler can handle well even it hits data accessed by noinstr
> > function. Changing __this_cpu_read() to this_cpu_read() is fit for it.
>
> You don't need to write *what* your patch is doing - that is clear from
> the diff.

What I wanted to say in this paragraph is that why I chose this way to fix
it since there are several ways/policies to fix it.

"Changing __this_cpu_read() to this_cpu_read() is fit for" this policy.

I don't think it can be seen in the diff.

> I don't really follow the argument for why this_cpu_read(); why not
> raw_cpu_read() instead, which is what __this_cpu_read() is based on.
> Also, this really needs a comment.

Yes, raw_cpu_read() is better.

Some other places in noinstr function use this_cpu_read(), so I did
not search if there is a better alternative. I just reviewed the
definition of this_cpu_read() and concluded that it can be used.

> /*
> * Must not hit a breakpoint in check_preempt_disabled()
> */
> return raw_cpu_read(cpu_dr7) & DR_GLOBAL_ENABLE_MASK;

Although, this comment is describing raw_cpu_read() obviously, I often
can't get which code is a comment in other places referring to due
to later changes with new code added and removed.

Can I duplicate the code in the comments?
Use raw_cpu_read() instead of __this_cpu_read() to avoid hitting
a breakpoint in check_preempt_disabled().

Thanks
Lai

2021-12-14 09:33:48

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 1/3] X86/db: Change __this_cpu_read() to this_cpu_read() in hw_breakpoint_active()

On Tue, Dec 14, 2021 at 10:51:23AM +0800, Lai Jiangshan wrote:
> The commit message was checked via VIM spellchecker. It did highlight
> denylist, noinstr, noinstrument, complexify, and a lot more.
>
> There are too many false-negative results from VIM spellchecker, and

I don't know how your vim is configured but my vim spellchecker
highlights only those words which I mentioned.

> I searched denylist, complexify via google and they are used by some
> other places so I kept them.

And this is where the problem is: you can't just search for words on the
net and whether someone used them or created them and then assume that
the reader would know them and understand what you mean.

Writing commit messages is not an exercise in creative writing. Rather,
your commit messages must be *maximally* *understandable* to the
reviewer so that she/he doesn't have to

a) decipher your commit message
b) decipher your diff

because that's twice the work.

So the goal is to explain why you're doing a change in the clearest way
possible - not do fancy.

Also, even if you use known words, there's this other problem with
formulation: a sentence full of only correct words doesn't make it
understandable to others. So try to stick to simple, even boring
formulations - you can be sure the reader would know what you mean.

> I'm sorry for not searching in the kernel tree to find a proper
> word for noinstrument, not searching the web for better words for
> denylist, complexify.
>
> I will change a spellchecker and improve my English.

Thanks for the effort!

> What I wanted to say in this paragraph is that why I chose this way to fix
> it since there are several ways/policies to fix it.
>
> "Changing __this_cpu_read() to this_cpu_read() is fit for" this policy.
>
> I don't think it can be seen in the diff.

What "policy"? Fit for what?

This is what I mean: the formulation sounds weird and it makes me wonder
what you're *actually* trying to say?

> > I don't really follow the argument for why this_cpu_read();

See, Peter has a hard time understanding your reasoning either.

> > /*
> > * Must not hit a breakpoint in check_preempt_disabled()
> > */
> > return raw_cpu_read(cpu_dr7) & DR_GLOBAL_ENABLE_MASK;
>
> Although, this comment is describing raw_cpu_read() obviously, I often
> can't get which code is a comment in other places referring to due
> to later changes with new code added and removed.

Each comment must belong to the code it comments - otherwise it needs to
be fixed/removed.

In this particular case, it could be over hw_breakpoint_active() too.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2021-12-14 21:51:33

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 3/3] x86/sev: The code for returning to user space is also in syscall gap

+ Tom and leaving the whole mail un-trimmed for him.

On Mon, Dec 13, 2021 at 12:22:15PM +0800, Lai Jiangshan wrote:
> From: Lai Jiangshan <[email protected]>
>
> When returning to user space, the %rsp is user controlled value.

And?

I'd expect to see here some text analyzing the couple of instructions
between those new labels you've added and whether a #VC can happen
there.

> Signed-off-by: Lai Jiangshan <[email protected]>
> ---
> arch/x86/entry/entry_64.S | 2 ++
> arch/x86/entry/entry_64_compat.S | 2 ++
> arch/x86/include/asm/proto.h | 4 ++++
> arch/x86/include/asm/ptrace.h | 4 ++++
> 4 files changed, 12 insertions(+)
>
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index e23319ad3f42..44dadea935f7 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -213,8 +213,10 @@ syscall_return_via_sysret:
>
> popq %rdi
> popq %rsp
> +SYM_INNER_LABEL(entry_SYSRETQ_unsafe_stack, SYM_L_GLOBAL)
> swapgs
> sysretq
> +SYM_INNER_LABEL(entry_SYSRETQ_end, SYM_L_GLOBAL)
> SYM_CODE_END(entry_SYSCALL_64)
>
> /*
> diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
> index 0051cf5c792d..98afdf92f360 100644
> --- a/arch/x86/entry/entry_64_compat.S
> +++ b/arch/x86/entry/entry_64_compat.S
> @@ -293,6 +293,7 @@ sysret32_from_system_call:
> * code. We zero R8-R10 to avoid info leaks.
> */
> movq RSP-ORIG_RAX(%rsp), %rsp
> +SYM_INNER_LABEL(entry_SYSRETL_compat_unsafe_stack, SYM_L_GLOBAL)
>
> /*
> * The original userspace %rsp (RSP-ORIG_RAX(%rsp)) is stored
> @@ -310,6 +311,7 @@ sysret32_from_system_call:
> xorl %r10d, %r10d
> swapgs
> sysretl
> +SYM_INNER_LABEL(entry_SYSRETL_compat_end, SYM_L_GLOBAL)
> SYM_CODE_END(entry_SYSCALL_compat)
>
> /*
> diff --git a/arch/x86/include/asm/proto.h b/arch/x86/include/asm/proto.h
> index feed36d44d04..f042cfc9938f 100644
> --- a/arch/x86/include/asm/proto.h
> +++ b/arch/x86/include/asm/proto.h
> @@ -13,6 +13,8 @@ void syscall_init(void);
> #ifdef CONFIG_X86_64
> void entry_SYSCALL_64(void);
> void entry_SYSCALL_64_safe_stack(void);
> +void entry_SYSRETQ_unsafe_stack(void);
> +void entry_SYSRETQ_end(void);
> long do_arch_prctl_64(struct task_struct *task, int option, unsigned long arg2);
> #endif
>
> @@ -28,6 +30,8 @@ void entry_SYSENTER_compat(void);
> void __end_entry_SYSENTER_compat(void);
> void entry_SYSCALL_compat(void);
> void entry_SYSCALL_compat_safe_stack(void);
> +void entry_SYSRETL_compat_unsafe_stack(void);
> +void entry_SYSRETL_compat_end(void);
> void entry_INT80_compat(void);
> #ifdef CONFIG_XEN_PV
> void xen_entry_INT80_compat(void);
> diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
> index 703663175a5a..b3d2ba13cee2 100644
> --- a/arch/x86/include/asm/ptrace.h
> +++ b/arch/x86/include/asm/ptrace.h
> @@ -186,9 +186,13 @@ static __always_inline bool ip_within_syscall_gap(struct pt_regs *regs)
> bool ret = (regs->ip >= (unsigned long)entry_SYSCALL_64 &&
> regs->ip < (unsigned long)entry_SYSCALL_64_safe_stack);
>
> + ret = ret || (regs->ip >= (unsigned long)entry_SYSRETQ_unsafe_stack &&
> + regs->ip < (unsigned long)entry_SYSRETQ_end);
> #ifdef CONFIG_IA32_EMULATION
> ret = ret || (regs->ip >= (unsigned long)entry_SYSCALL_compat &&
> regs->ip < (unsigned long)entry_SYSCALL_compat_safe_stack);
> + ret = ret || (regs->ip >= (unsigned long)entry_SYSRETL_compat_unsafe_stack &&
> + regs->ip < (unsigned long)entry_SYSRETL_compat_end);
> #endif
>
> return ret;
> --
> 2.19.1.6.gb485710b
>

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2021-12-17 10:06:02

by Joerg Roedel

[permalink] [raw]
Subject: Re: [PATCH 3/3] x86/sev: The code for returning to user space is also in syscall gap

On Tue, Dec 14, 2021 at 10:51:27PM +0100, Borislav Petkov wrote:
> I'd expect to see here some text analyzing the couple of instructions
> between those new labels you've added and whether a #VC can happen
> there.

While it is true that the exit path can't trigger #VC right now, it will
be able with SNP.

So this change makes sense, either now or as part of the SNP-guest
patch-set.

Regards,

Joerg

2021-12-17 10:30:24

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 3/3] x86/sev: The code for returning to user space is also in syscall gap

On Fri, Dec 17, 2021 at 11:05:57AM +0100, Joerg Roedel wrote:
> While it is true that the exit path can't trigger #VC right now, it will
> be able with SNP.

How?

I audited the handful instructions in there and didn't find anything
that would cause a #VC...

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2021-12-17 11:00:44

by Joerg Roedel

[permalink] [raw]
Subject: Re: [PATCH 3/3] x86/sev: The code for returning to user space is also in syscall gap

On Fri, Dec 17, 2021 at 11:30:10AM +0100, Borislav Petkov wrote:
> I audited the handful instructions in there and didn't find anything
> that would cause a #VC...

If the hypervisor decides to mess with the code-page for this path
while a CPU is executing it. This will cause a #VC on that CPU and that
could hit in the syscall return path.

Regards,

--
J?rg R?del
[email protected]

SUSE Software Solutions Germany GmbH
Maxfeldstr. 5
90409 N?rnberg
Germany

(HRB 36809, AG N?rnberg)
Gesch?ftsf?hrer: Ivo Totev


2022-01-20 06:27:40

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 3/3] x86/sev: The code for returning to user space is also in syscall gap

On Fri, Dec 17, 2021 at 12:00:34PM +0100, Joerg Roedel wrote:
> On Fri, Dec 17, 2021 at 11:30:10AM +0100, Borislav Petkov wrote:
> > I audited the handful instructions in there and didn't find anything
> > that would cause a #VC...
>
> If the hypervisor decides to mess with the code-page for this path
> while a CPU is executing it. This will cause a #VC on that CPU and that
> could hit in the syscall return path.

So I added a CPUID on that return path:

@@ -213,8 +213,11 @@ syscall_return_via_sysret:

popq %rdi
popq %rsp
+ cpuid


It results in the splat below. I.e., we're on the VC2 stack. We've
landed there because:

* If entered from kernel-mode the return stack is validated first, and if it is
* not safe to use (e.g. because it points to the entry stack) the #VC handler
* will switch to a fall-back stack (VC2) and call a special handler function.

and what puts us there is, I think:

vc_switch_off_ist:
if (!get_stack_info_noinstr(stack, current, &info) || info.type == STACK_TYPE_ENTRY ||
info.type > STACK_TYPE_EXCEPTION_LAST)
sp = __this_cpu_ist_top_va(VC2);

but I need to stare at this more later to figure it all out properly.

[ 1.372783] Kernel panic - not syncing: Can't handle #VC exception from unsupported context: sp: 0xfffffe0000019f58, prev_sp: 0x7ffc79fd7e78, VC2 stack [0xfffffe0000018000:0xfffffe000001a000]
[ 1.374828] CPU: 0 PID: 1 Comm: init Not tainted 5.16.0+ #6
[ 1.375586] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 1.376553] Call Trace:
[ 1.377030] <#VC2>
[ 1.377462] dump_stack_lvl+0x48/0x5e
[ 1.378038] panic+0xfa/0x2c6
[ 1.378570] kernel_exc_vmm_communication+0x10e/0x160
[ 1.379275] asm_exc_vmm_communication+0x30/0x60
[ 1.379934] RIP: 0010:syscall_return_via_sysret+0x28/0x2a
[ 1.380669] Code: 00 00 41 5f 41 5e 41 5d 41 5c 5d 5b 5e 41 5a 41 59 41 58 58 5e 5a 5e 48 89 e7 65 48 8b 24 25 04 60 00 00 ff 77 28 ff 37 5f 5c <0f> a2 0f 01 f8 48 0f 07 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00
[ 1.384240] RSP: 0018:00007ffc79fd7e78 EFLAGS: 00010046
[ 1.384977] RAX: 00005555e4b80000 RBX: 0000000000000001 RCX: 00007fc2d978ac17
[ 1.385894] RDX: 0000000000000054 RSI: 00007fc2d9792e09 RDI: 0000000000000000
[ 1.386816] RBP: 00007fc2d97724e0 R08: 00007ffc79fd9fe7 R09: 00007fc2d979ae88
[ 1.387734] R10: 000000000000001c R11: 0000000000000246 R12: 00005555e448e040
[ 1.388647] R13: 000000000000000b R14: 0000000000000000 R15: 00007ffc79fd8119
[ 1.389559] </#VC2>
[ 1.391521] Kernel Offset: 0x7e00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 1.393015] ---[ end Kernel panic - not syncing: Can't handle #VC exception from unsupported context: sp: 0xfffffe0000019f58, prev_sp: 0x7ffc79fd7e78, VC2 stack [0xfffffe0000018000:0xfffffe000001a000] ]---


--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-01-20 14:35:21

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH 3/3] x86/sev: The code for returning to user space is also in syscall gap



On 2022/1/18 18:32, Borislav Petkov wrote:
> On Fri, Dec 17, 2021 at 12:00:34PM +0100, Joerg Roedel wrote:
>> On Fri, Dec 17, 2021 at 11:30:10AM +0100, Borislav Petkov wrote:
>>> I audited the handful instructions in there and didn't find anything
>>> that would cause a #VC...
>>
>> If the hypervisor decides to mess with the code-page for this path
>> while a CPU is executing it. This will cause a #VC on that CPU and that
>> could hit in the syscall return path.
>
> So I added a CPUID on that return path:
>
> @@ -213,8 +213,11 @@ syscall_return_via_sysret:
>
> popq %rdi
> popq %rsp
> + cpuid
>
>
> It results in the splat below. I.e., we're on the VC2 stack. We've
> landed there because:
>
> * If entered from kernel-mode the return stack is validated first, and if it is
> * not safe to use (e.g. because it points to the entry stack) the #VC handler
> * will switch to a fall-back stack (VC2) and call a special handler function.
>

Hello

Thanks for testing.

The log shows that the %rsp is 0x7ffc79fd7e78 before the #VC, which means the
userspace might not be malicious since it might not tamper the %rsp.

If the userspace is not malicious, there is nothing wrong with it when #VC is
on this gap.

If the userspace is malicious and misleads vc_switch_off_ist(), it would harm
the system.

For example, (I haven't test it, I am just imaging it,) if user %rsp were set
to be the kernel #NMI stack, #VC would keep running on #NMI stack, its stack
would be corrupted when a #NMI is delivered since #NMI is not masked. It would
be more dangerous if the hypervisor connives with the userspace.

I think ip_within_syscall_gap() was designed for avoid using userspace %rsp
albeit it misses the path returning to user space.

Thanks
Lai.

> and what puts us there is, I think:
>
> vc_switch_off_ist:
> if (!get_stack_info_noinstr(stack, current, &info) || info.type == STACK_TYPE_ENTRY ||
> info.type > STACK_TYPE_EXCEPTION_LAST)
> sp = __this_cpu_ist_top_va(VC2);
>
> but I need to stare at this more later to figure it all out properly.
>
> [ 1.372783] Kernel panic - not syncing: Can't handle #VC exception from unsupported context: sp: 0xfffffe0000019f58, prev_sp: 0x7ffc79fd7e78, VC2 stack [0xfffffe0000018000:0xfffffe000001a000]
> [ 1.374828] CPU: 0 PID: 1 Comm: init Not tainted 5.16.0+ #6
> [ 1.375586] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> [ 1.376553] Call Trace:
> [ 1.377030] <#VC2>
> [ 1.377462] dump_stack_lvl+0x48/0x5e
> [ 1.378038] panic+0xfa/0x2c6
> [ 1.378570] kernel_exc_vmm_communication+0x10e/0x160
> [ 1.379275] asm_exc_vmm_communication+0x30/0x60
> [ 1.379934] RIP: 0010:syscall_return_via_sysret+0x28/0x2a
> [ 1.380669] Code: 00 00 41 5f 41 5e 41 5d 41 5c 5d 5b 5e 41 5a 41 59 41 58 58 5e 5a 5e 48 89 e7 65 48 8b 24 25 04 60 00 00 ff 77 28 ff 37 5f 5c <0f> a2 0f 01 f8 48 0f 07 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00
> [ 1.384240] RSP: 0018:00007ffc79fd7e78 EFLAGS: 00010046
> [ 1.384977] RAX: 00005555e4b80000 RBX: 0000000000000001 RCX: 00007fc2d978ac17
> [ 1.385894] RDX: 0000000000000054 RSI: 00007fc2d9792e09 RDI: 0000000000000000
> [ 1.386816] RBP: 00007fc2d97724e0 R08: 00007ffc79fd9fe7 R09: 00007fc2d979ae88
> [ 1.387734] R10: 000000000000001c R11: 0000000000000246 R12: 00005555e448e040
> [ 1.388647] R13: 000000000000000b R14: 0000000000000000 R15: 00007ffc79fd8119
> [ 1.389559] </#VC2>
> [ 1.391521] Kernel Offset: 0x7e00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> [ 1.393015] ---[ end Kernel panic - not syncing: Can't handle #VC exception from unsupported context: sp: 0xfffffe0000019f58, prev_sp: 0x7ffc79fd7e78, VC2 stack [0xfffffe0000018000:0xfffffe000001a000] ]---
>
>

2022-04-12 20:11:32

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH 3/3] x86/sev: The code for returning to user space is also in syscall gap

On Wed, Dec 15, 2021 at 5:51 AM Borislav Petkov <[email protected]> wrote:
>
> + Tom and leaving the whole mail un-trimmed for him.
>
> On Mon, Dec 13, 2021 at 12:22:15PM +0800, Lai Jiangshan wrote:
> > From: Lai Jiangshan <[email protected]>
> >
> > When returning to user space, the %rsp is user controlled value.
>
> And?
>
> I'd expect to see here some text analyzing the couple of instructions
> between those new labels you've added and whether a #VC can happen
> there.


Hello, Borislav

I resent the patch with an updated changelog mainly copied from
Joerg's reply.

https://lore.kernel.org/lkml/[email protected]/


The other two patches in this patchset are omitted since they
make less sense which only harms only when the system owner
is deliberately doing stupid things.

thanks
Lai