2022-10-28 14:25:10

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH 0/2] x86/tdx: Enforce no #VE on private memory accesses

As described in 9a22bf6debbf ("x86/traps: Add #VE support for TDX
guest"), kernel relies on "no #VE on access to private memory" to keep
guest secure from attacks against syscall gap or NMI entry code.

SEPT_VE_DISABLE TD attribute controls TDX module behaviour on EPT
violation.

The attribute must be set to avoid #VE. Refuse to boot the guest if it
is not.

Kirill A. Shutemov (1):
x86/tdx: Do not allow #VE due to EPT violation on the private memory

Kuppuswamy Sathyanarayanan (1):
x86/tdx: Extract GET_INFO call from get_cc_mask()

arch/x86/coco/tdx/tdx.c | 74 ++++++++++++++++++++++++++++++++++++++---
1 file changed, 69 insertions(+), 5 deletions(-)

--
2.38.0



2022-10-28 15:07:37

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH 2/2] x86/tdx: Do not allow #VE due to EPT violation on the private memory

Virtualization Exceptions (#VE) are delivered to TDX guests due to
specific guest actions such as using specific instructions or accessing
a specific MSR.

Notable reason for #VE is access to specific guest physical addresses.
It requires special security considerations as it is not fully in
control of the guest kernel. VMM can remove a page from EPT page table
and trigger #VE on access.

The primary use-case for #VE on a memory access is MMIO: VMM removes
page from EPT to trigger exception in the guest which allows guest to
emulate MMIO with hypercalls.

MMIO only happens on shared memory. All conventional kernel memory is
private. This includes everything from kernel stacks to kernel text.

Handling exceptions on arbitrary accesses to kernel memory is
essentially impossible as handling #VE may require access to memory
that also triggers the exception.

TDX module provides mechanism to disable #VE delivery on access to
private memory. If SEPT_VE_DISABLE TD attribute is set, private EPT
violation will not be reflected to the guest as #VE, but will trigger
exit to VMM.

Make sure the attribute is set by VMM. Panic otherwise.

There's small window during the boot before the check where kernel has
early #VE handler. But the handler is only for port I/O and panic as
soon as it sees any other #VE reason.

SEPT_VE_DISABLE makes SEPT violation unrecoverable and terminating the
TD is the only option.

Kernel has no legitimate use-cases for #VE on private memory. It is
either a guest kernel bug (like access of unaccepted memory) or
malicious/buggy VMM that removes guest page that is still in use.

In both cases terminating TD is the right thing to do.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Fixes: 9a22bf6debbf ("x86/traps: Add #VE support for TDX guest")
Cc: [email protected] # v5.19
---
arch/x86/coco/tdx/tdx.c | 49 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 49 insertions(+)

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 343d60853b71..a376a0c3fddc 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -34,6 +34,9 @@
#define VE_GET_PORT_NUM(e) ((e) >> 16)
#define VE_IS_IO_STRING(e) ((e) & BIT(4))

+/* TD Attributes */
+#define ATTR_SEPT_VE_DISABLE BIT(28)
+
/* Caches GPA width from TDG.VP.INFO TDCALL */
static unsigned int gpa_width __ro_after_init;

@@ -770,6 +773,52 @@ void __init tdx_early_init(void)
*/
tdx_parse_tdinfo();

+ /*
+ * Do not allow #VE due to EPT violation on the private memory
+ *
+ * Virtualization Exceptions (#VE) are delivered to TDX guests due to
+ * specific guest actions such as using specific instructions or
+ * accessing a specific MSR.
+ *
+ * Notable reason for #VE is access to specific guest physical
+ * addresses. It requires special security considerations as it is not
+ * fully in control of the guest kernel. VMM can remove a page from EPT
+ * page table and trigger #VE on access.
+ *
+ * The primary use-case for #VE on a memory access is MMIO: VMM removes
+ * page from EPT to trigger exception in the guest which allows guest to
+ * emulate MMIO with hypercalls.
+ *
+ * MMIO only happens on shared memory. All conventional kernel memory is
+ * private. This includes everything from kernel stacks to kernel text.
+ *
+ * Handling exceptions on arbitrary accesses to kernel memory is
+ * essentially impossible as handling #VE may require access to memory
+ * that also triggers the exception.
+ *
+ * TDX module provides mechanism to disable #VE delivery on access to
+ * private memory. If SEPT_VE_DISABLE TD attribute is set, private EPT
+ * violation will not be reflected to the guest as #VE, but will trigger
+ * exit to VMM.
+ *
+ * Make sure the attribute is set by VMM. Panic otherwise.
+ *
+ * There's small window during the boot before the check where kernel has
+ * early #VE handler. But the handler is only for port I/O and panic as
+ * soon as it sees any other #VE reason.
+ *
+ * SEPT_VE_DISABLE makes SEPT violation unrecoverable and terminating
+ * the TD is the only option.
+ *
+ * Kernel has no legitimate use-cases for #VE on private memory. It is
+ * either a guest kernel bug (like access of unaccepted memory) or
+ * malicious/buggy VMM that removes guest page that is still in use.
+ *
+ * In both cases terminating TD is the right thing to do.
+ */
+ if (!(td_attr & ATTR_SEPT_VE_DISABLE))
+ panic("TD misconfiguration: SEPT_VE_DISABLE attibute must be set.\n");
+
setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);

cc_set_vendor(CC_VENDOR_INTEL);
--
2.38.0


2022-10-28 15:07:43

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH 1/2] x86/tdx: Extract GET_INFO call from get_cc_mask()

From: Kuppuswamy Sathyanarayanan <[email protected]>

TDCALL [TDG.VP.INFO] is used to get details like gpa_width,
TD attributes, etc.

So far only gpa_width was needed to enumerate the shared bit, but
upcoming changes will need TD attributes too.

Extract GET_INFO call out of get_cc_mask() into a separate helper and
save attributes.

Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/coco/tdx/tdx.c | 25 ++++++++++++++++++++-----
1 file changed, 20 insertions(+), 5 deletions(-)

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 928dcf7a20d9..343d60853b71 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -34,6 +34,12 @@
#define VE_GET_PORT_NUM(e) ((e) >> 16)
#define VE_IS_IO_STRING(e) ((e) & BIT(4))

+/* Caches GPA width from TDG.VP.INFO TDCALL */
+static unsigned int gpa_width __ro_after_init;
+
+/* Caches TD Attributes from TDG.VP.INFO TDCALL */
+static u64 td_attr __ro_after_init;
+
/*
* Wrapper for standard use of __tdx_hypercall with no output aside from
* return code.
@@ -98,17 +104,16 @@ static inline void tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
panic("TDCALL %lld failed (Buggy TDX module!)\n", fn);
}

-static u64 get_cc_mask(void)
+static void tdx_parse_tdinfo(void)
{
struct tdx_module_output out;
- unsigned int gpa_width;

/*
* TDINFO TDX module call is used to get the TD execution environment
* information like GPA width, number of available vcpus, debug mode
- * information, etc. More details about the ABI can be found in TDX
- * Guest-Host-Communication Interface (GHCI), section 2.4.2 TDCALL
- * [TDG.VP.INFO].
+ * information, TD attributes etc. More details about the ABI can be
+ * found in TDX Guest-Host-Communication Interface (GHCI), section
+ * 2.4.2 TDCALL [TDG.VP.INFO].
*
* The GPA width that comes out of this call is critical. TDX guests
* can not meaningfully run without it.
@@ -116,7 +121,11 @@ static u64 get_cc_mask(void)
tdx_module_call(TDX_GET_INFO, 0, 0, 0, 0, &out);

gpa_width = out.rcx & GENMASK(5, 0);
+ td_attr = out.rdx;
+}

+static u64 get_cc_mask(void)
+{
/*
* The highest bit of a guest physical address is the "sharing" bit.
* Set it for shared pages and clear it for private pages.
@@ -755,6 +764,12 @@ void __init tdx_early_init(void)
if (memcmp(TDX_IDENT, sig, sizeof(sig)))
return;

+ /*
+ * Initializes gpa_width and td_attr. Must be called before
+ * get_cc_mask() or attribute checks.
+ */
+ tdx_parse_tdinfo();
+
setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);

cc_set_vendor(CC_VENDOR_INTEL);
--
2.38.0


2022-10-28 16:07:11

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86/tdx: Do not allow #VE due to EPT violation on the private memory

On 10/28/22 07:12, Kirill A. Shutemov wrote:
> arch/x86/coco/tdx/tdx.c | 49 +++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 49 insertions(+)

The patch is good, but I'm not crazy about the changelog or the big ol'
comment.

Really, this would do:

/*
* The kernel can not handle #VE's when accessing normal kernel
* memory. Ensure that no #VE will be delivered for accesses to
* TD-private memory. Only VMM-shared memory (MMIO) will #VE.
*/
if (!(td_attr & ATTR_SEPT_VE_DISABLE))
panic("TD misconfiguration: SEPT_VE_DISABLE attibute must be set.\n");

I'll probably trim both of them down. If I chop out something that's
critical, let me know, otherwise let's follow up and stick all of those
details in Documentation.

2022-10-28 16:08:30

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86/tdx: Extract GET_INFO call from get_cc_mask()

On 10/28/22 07:12, Kirill A. Shutemov wrote:
> + * information, TD attributes etc. More details about the ABI can be
> + * found in TDX Guest-Host-Communication Interface (GHCI), section
> + * 2.4.2 TDCALL [TDG.VP.INFO].

Folks, I thought we agreed long ago to stop putting section numbers in
these comments because they're not stable. Am I remembering this wrong?

2022-10-28 23:58:40

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86/tdx: Extract GET_INFO call from get_cc_mask()

I looked at this a bit more closely. The code is just bonkers now. It
can't go in like this.

tdx_parse_tdinfo() stashes two global variables. Then, about three
lines later in the function, it calls get_cc_mask() which just returns
one of those variables, modulo a little masking.

Ditto for the td_attr. It's stashed in a global variable and then read
one time.

There is *ZERO* reason to store 'td_attr'. There's also zero reason to
have 'gpa_width' as a global variable. It's only used *ONE* *TIME* from
the scope of *ONE* *FUNCTION*.

Even the comment is bonkers:

/*
* Initializes gpa_width and td_attr. Must be called before
* get_cc_mask() or attribute checks.
*/
tdx_parse_tdinfo();

Comments are great. But comments that are only there because the code
is obtuse are not. I changed it to:

tdx_parse_tdinfo(&cc_mask);

It doesn't even need a comment now. Why? Because it's obvious from the
naming and calling convention that it initializes cc_mask and what the
ordering dependency is. Plus, even *if* I missed the call, or screwed
up the order, the compiler would tell me that I'm a dolt.

The whole global variable thing actually makes the code objectively
worse in almost every possible way.

Can you please take a look through this and make sure I didn't botch
anything:

> https://git.kernel.org/pub/scm/linux/kernel/git/daveh/devel.git/log/?h=tdxbadve

The end result is about 50 lines less than what was there before. Most
of it is comment removal but the code is simpler too.

Acks and Tested-by's would be appreciated.

2022-10-29 00:02:45

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86/tdx: Extract GET_INFO call from get_cc_mask()

On Fri, Oct 28, 2022 at 04:27:10PM -0700, Dave Hansen wrote:
> I looked at this a bit more closely. The code is just bonkers now. It
> can't go in like this.
>
> tdx_parse_tdinfo() stashes two global variables. Then, about three
> lines later in the function, it calls get_cc_mask() which just returns
> one of those variables, modulo a little masking.
>
> Ditto for the td_attr. It's stashed in a global variable and then read
> one time.
>
> There is *ZERO* reason to store 'td_attr'. There's also zero reason to
> have 'gpa_width' as a global variable. It's only used *ONE* *TIME* from
> the scope of *ONE* *FUNCTION*.
>
> Even the comment is bonkers:
>
> /*
> * Initializes gpa_width and td_attr. Must be called before
> * get_cc_mask() or attribute checks.
> */
> tdx_parse_tdinfo();
>
> Comments are great. But comments that are only there because the code
> is obtuse are not. I changed it to:
>
> tdx_parse_tdinfo(&cc_mask);
>
> It doesn't even need a comment now. Why? Because it's obvious from the
> naming and calling convention that it initializes cc_mask and what the
> ordering dependency is. Plus, even *if* I missed the call, or screwed
> up the order, the compiler would tell me that I'm a dolt.
>
> The whole global variable thing actually makes the code objectively
> worse in almost every possible way.

I agree. Sorry about that.

We have more code in our tree that want to check attributes. And it is
after initialization, so the code structure derived from there.

But, yes, I should have rework it before sending upstream.

> Can you please take a look through this and make sure I didn't botch
> anything:
>
> > https://git.kernel.org/pub/scm/linux/kernel/git/daveh/devel.git/log/?h=tdxbadve
>
> The end result is about 50 lines less than what was there before. Most
> of it is comment removal but the code is simpler too.
>
> Acks and Tested-by's would be appreciated.

Looks good and works fine:

Acked-and-Tested-by: Kirill A. Shutemov <[email protected]>

--
Kiryl Shutsemau / Kirill A. Shutemov

2022-10-31 04:25:38

by Guorui Yu

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86/tdx: Do not allow #VE due to EPT violation on the private memory

The core of this vulnerability is not directly related to the
ATTR_SEPT_VE_DISABLE, but the MMIO processing logic in #VE.

We have encountered similar problems on SEV-ES, here are their fixes on
Kernel [1] and OVMF[2].

Instead of enforcing the ATTR_SEPT_VE_DISABLE in TDX guest kernel, I
think the fix should also include necessary check on the MMIO path of
the #VE routine.

static int handle_mmio(struct pt_regs *regs, struct ve_info *ve)
{
unsigned long *reg, val, vaddr;
char buffer[MAX_INSN_SIZE];
struct insn insn = {};
enum mmio_type mmio;
int size, extend_size;
u8 extend_val = 0;

// Some addtional security check about ve->gpa should be introduced here.

/* Only in-kernel MMIO is supported */
if (WARN_ON_ONCE(user_mode(regs)))
return -EFAULT;

// ...
}

If we don't fix the problem at the point where we found, but rely on
complicated composite logic and long comments in the kernel, I'm
confident we'll fall back into the same pit in the near future :).


[1]
https://github.com/torvalds/linux/blob/1a2dcbdde82e3a5f1db9b2f4c48aa1aeba534fb2/arch/x86/kernel/sev.c#L503
[2] OVMF:
https://github.com/tianocore/edk2/blob/db2c22633f3c761975d8f469a0e195d8b79e1287/OvmfPkg/Library/VmgExitLib/VmgExitVcHandler.c#L670

2022-10-31 04:57:02

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86/tdx: Extract GET_INFO call from get_cc_mask()

On Sat, Oct 29, 2022 at 02:59:51AM +0300, Kirill A. Shutemov wrote:
> > Can you please take a look through this and make sure I didn't botch
> > anything:
> >
> > > https://git.kernel.org/pub/scm/linux/kernel/git/daveh/devel.git/log/?h=tdxbadve
> >
> > The end result is about 50 lines less than what was there before. Most
> > of it is comment removal but the code is simpler too.
> >
> > Acks and Tested-by's would be appreciated.

One thing that I must bring up is that it seems that there's no way to get
the panic message to user. I tried to convinced myself that it is qemu
misconfiguration on my part or some race, but no: it is just too early for
earlyprintk.

We only get earlyprintk working after parse_early_options() which happens
well after tdx_early_init().

Moving panic() after earlyprintk working is not good idea as it exposes
kernel more: by the time we already have full #VE handler.

We can move it earlier into decompresser which has different earlyprintk
implementation. Not sure if it worth this. What do you think?

"tdx: Guest detected" printed from the same tdx_early_init() is visible if
boot goes far enough to initialize console (early or not) as printk adds
the message to the buffer, but no so luck for panic().

--
Kiryl Shutsemau / Kirill A. Shutemov

2022-10-31 05:31:31

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86/tdx: Do not allow #VE due to EPT violation on the private memory

On Mon, Oct 31, 2022 at 12:07:45PM +0800, Guorui Yu wrote:
> The core of this vulnerability is not directly related to the
> ATTR_SEPT_VE_DISABLE, but the MMIO processing logic in #VE.
>
> We have encountered similar problems on SEV-ES, here are their fixes on
> Kernel [1] and OVMF[2].
>
> Instead of enforcing the ATTR_SEPT_VE_DISABLE in TDX guest kernel, I think
> the fix should also include necessary check on the MMIO path of the #VE
> routine.

Missing SEPT_VE_DISABLE exposes to more security problems than confused
handle_mmio(). Rogue #VE that is rightly timed can be used to escalate
privileges and more. Just adding check there would solve only some
potential attacks.

> static int handle_mmio(struct pt_regs *regs, struct ve_info *ve)
> {
> unsigned long *reg, val, vaddr;
> char buffer[MAX_INSN_SIZE];
> struct insn insn = {};
> enum mmio_type mmio;
> int size, extend_size;
> u8 extend_val = 0;
>
> // Some addtional security check about ve->gpa should be introduced here.
>
> /* Only in-kernel MMIO is supported */
> if (WARN_ON_ONCE(user_mode(regs)))
> return -EFAULT;
>
> // ...
> }
>
> If we don't fix the problem at the point where we found, but rely on
> complicated composite logic and long comments in the kernel, I'm confident
> we'll fall back into the same pit in the near future :).

The plan is to add the check there along with relaxing SEPT_VE_DISABLE for
debug TD. It is required to debug guest kernel effectively. Otherwise
access to unaccepted memory would terminate TD with zero info on why.

But it is not the urgent fix. It can be submitted separately.

--
Kiryl Shutsemau / Kirill A. Shutemov

2022-10-31 14:45:56

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86/tdx: Do not allow #VE due to EPT violation on the private memory

On 10/30/22 21:07, Guorui Yu wrote:
> We have encountered similar problems on SEV-ES, here are their fixes
> on Kernel [1] and OVMF[2].

SEV-ES and TDX are very different beasts in this area.

> Instead of enforcing the ATTR_SEPT_VE_DISABLE in TDX guest kernel, I
> think the fix should also include necessary check on the MMIO path of
> the #VE routine.

Instead?

Without ATTR_SEPT_VE_DISABLE, a #VE can occur on basically any
instruction. We call those kinds of exceptions "paranoid entry" points.
They need special handling like the NMI or #MC handlers.

I'd be happy to look at a patch that does the MMIO path check *and*
turns the #VE handler into a robust entry point.

Bonus points if you can do ~5 lines of C like the approach in this thread.

2022-10-31 17:30:20

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86/tdx: Extract GET_INFO call from get_cc_mask()

On 10/30/22 21:12, Kirill A. Shutemov wrote:
> On Sat, Oct 29, 2022 at 02:59:51AM +0300, Kirill A. Shutemov wrote:
>>> Can you please take a look through this and make sure I didn't botch
>>> anything:
>>>
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/daveh/devel.git/log/?h=tdxbadve
>>>
>>> The end result is about 50 lines less than what was there before. Most
>>> of it is comment removal but the code is simpler too.
>>>
>>> Acks and Tested-by's would be appreciated.
>
> One thing that I must bring up is that it seems that there's no way to get
> the panic message to user. I tried to convinced myself that it is qemu
> misconfiguration on my part or some race, but no: it is just too early for
> earlyprintk.
>
> We only get earlyprintk working after parse_early_options() which happens
> well after tdx_early_init().
>
> Moving panic() after earlyprintk working is not good idea as it exposes
> kernel more: by the time we already have full #VE handler.

How about we soften the panic() to a pr_err() if it's a debug guest?

The first thing a user is going to do if they get an early boot failure
is flip the debug switch and try it again. That gets us safe,
well-defined behavior when we need security and also lets us figure out
what went wrong.

Also, did anyone ever actually implement that TDX earlyprintk simple
console thing? A TDCALL up to the host with some characters in a
register or two is as dirt simple of a console as you can get. It would
be very easy to improve the user experience here if there were a:

tdx_puts("uh oh");

interface. It's a shame if it didn't get done by now. I asked for it
years ago.

And, yeah, I know it wouldn't help us in this precise situation because
earlyprintk doesn't work yet. But, it *would* be one of those really,
really early bitbanging-style consoles that _could_ be in use very, very
early if the printk() infrastructure could take advantage of it.

> We can move it earlier into decompresser which has different earlyprintk
> implementation. Not sure if it worth this. What do you think?

There's the puts()/printf() gunk that's really early like in
validate_cpu(). Is that what you were thinking of?


2022-10-31 20:09:11

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86/tdx: Extract GET_INFO call from get_cc_mask()

On 10/31/22 12:27, Andi Kleen wrote:
>> Moving panic() after earlyprintk working is not good idea as it exposes
>> kernel more: by the time we already have full #VE handler.
>
> It should be fine to move since there is no user land at this point (the
> attack requires user land)

Maybe I'm misunderstanding the exposure. A normal MMIO #VE goes
something like this:

1. %rax points to some MMIO
2. Kernel executes: mov (%rax),%rbx, trying to read MMIO
3. #VE handler is triggered
4. Handler emulates the 'mov' with instruction decoding
5. Handler asks the VMM what the value of %rax should be
6. Handler puts VMM value in %rax
7. Return from #VE

I think the attack scenario subverts a normal MMIO to the following
(changes from the normal flow are marked with *):

*1. %rax points to some private kernel memory, VMM removes
Secure-EPT entry for that memory.
2. Kernel executes: mov (%rax),%rbx as part of normal kernel
execution, not an MMIO read.
3. #VE handler is triggered, assuming a MMIO read
4. Handler emulates the 'mov' with instruction decoding
5. Handler asks the VMM what the value of %rax should be
*6. Handler puts (malicious) VMM value in %rax
7. Return from #VE
*8. Now the guest kernel is running with an attacker-controlled
%rax

This effectively gives the attacker the ability to override the contents
of a memory read.

Am I misunderstanding the attack scenario? I don't see guest userspace
needing to be involved at all.



2022-10-31 20:10:43

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86/tdx: Extract GET_INFO call from get_cc_mask()

sted-by's would be appreciated.
> One thing that I must bring up is that it seems that there's no way to get
> the panic message to user. I tried to convinced myself that it is qemu
> misconfiguration on my part or some race, but no: it is just too early for
> earlyprintk.
>
> We only get earlyprintk working after parse_early_options() which happens
> well after tdx_early_init().
>
> Moving panic() after earlyprintk working is not good idea as it exposes
> kernel more: by the time we already have full #VE handler.


It should be fine to move since there is no user land at this point (the
attack requires user land)


>
> We can move it earlier into decompresser which has different earlyprintk
> implementation. Not sure if it worth this. What do you think?

That would make uncompressed kernels unsafe.

-Andi


2022-10-31 20:21:09

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86/tdx: Extract GET_INFO call from get_cc_mask()

On Mon, Oct 31, 2022 at 09:42:15AM -0700, Dave Hansen wrote:
> On 10/30/22 21:12, Kirill A. Shutemov wrote:
> > On Sat, Oct 29, 2022 at 02:59:51AM +0300, Kirill A. Shutemov wrote:
> >>> Can you please take a look through this and make sure I didn't botch
> >>> anything:
> >>>
> >>>> https://git.kernel.org/pub/scm/linux/kernel/git/daveh/devel.git/log/?h=tdxbadve
> >>>
> >>> The end result is about 50 lines less than what was there before. Most
> >>> of it is comment removal but the code is simpler too.
> >>>
> >>> Acks and Tested-by's would be appreciated.
> >
> > One thing that I must bring up is that it seems that there's no way to get
> > the panic message to user. I tried to convinced myself that it is qemu
> > misconfiguration on my part or some race, but no: it is just too early for
> > earlyprintk.
> >
> > We only get earlyprintk working after parse_early_options() which happens
> > well after tdx_early_init().
> >
> > Moving panic() after earlyprintk working is not good idea as it exposes
> > kernel more: by the time we already have full #VE handler.
>
> How about we soften the panic() to a pr_err() if it's a debug guest?

The plan is to have pr_warn() + check in handle_mmio(), as I mentioned
before. But pr_err() also works.

> The first thing a user is going to do if they get an early boot failure
> is flip the debug switch and try it again. That gets us safe,
> well-defined behavior when we need security and also lets us figure out
> what went wrong.
>
> Also, did anyone ever actually implement that TDX earlyprintk simple
> console thing? A TDCALL up to the host with some characters in a
> register or two is as dirt simple of a console as you can get. It would
> be very easy to improve the user experience here if there were a:
>
> tdx_puts("uh oh");
>
> interface. It's a shame if it didn't get done by now. I asked for it
> years ago.

There's nothing like this, unfortunately.

There's ReportFatalError TDVMCALL that intended for the task, but it only
takes an error code as input which is useless here. Nobody will decode it.

> And, yeah, I know it wouldn't help us in this precise situation because
> earlyprintk doesn't work yet. But, it *would* be one of those really,
> really early bitbanging-style consoles that _could_ be in use very, very
> early if the printk() infrastructure could take advantage of it.
>
> > We can move it earlier into decompresser which has different earlyprintk
> > implementation. Not sure if it worth this. What do you think?
>
> There's the puts()/printf() gunk that's really early like in
> validate_cpu(). Is that what you were thinking of?

More like error() in arch/x86/boot/compressed/kaslr.c.

--
Kiryl Shutsemau / Kirill A. Shutemov

2022-10-31 23:09:18

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86/tdx: Extract GET_INFO call from get_cc_mask()

On Mon, Oct 31, 2022 at 12:44:15PM -0700, Dave Hansen wrote:
> On 10/31/22 12:27, Andi Kleen wrote:
> >> Moving panic() after earlyprintk working is not good idea as it exposes
> >> kernel more: by the time we already have full #VE handler.
> >
> > It should be fine to move since there is no user land at this point (the
> > attack requires user land)
>
> Maybe I'm misunderstanding the exposure. A normal MMIO #VE goes
> something like this:
>
> 1. %rax points to some MMIO
> 2. Kernel executes: mov (%rax),%rbx, trying to read MMIO
> 3. #VE handler is triggered
> 4. Handler emulates the 'mov' with instruction decoding
> 5. Handler asks the VMM what the value of %rax should be
> 6. Handler puts VMM value in %rax
> 7. Return from #VE
>
> I think the attack scenario subverts a normal MMIO to the following
> (changes from the normal flow are marked with *):
>
> *1. %rax points to some private kernel memory, VMM removes
> Secure-EPT entry for that memory.
> 2. Kernel executes: mov (%rax),%rbx as part of normal kernel
> execution, not an MMIO read.
> 3. #VE handler is triggered, assuming a MMIO read
> 4. Handler emulates the 'mov' with instruction decoding
> 5. Handler asks the VMM what the value of %rax should be
> *6. Handler puts (malicious) VMM value in %rax
> 7. Return from #VE
> *8. Now the guest kernel is running with an attacker-controlled
> %rax
>
> This effectively gives the attacker the ability to override the contents
> of a memory read.
>
> Am I misunderstanding the attack scenario? I don't see guest userspace
> needing to be involved at all.

Looks correct to me.

I think Andi refers to attack against syscall gap that also addressed by
the patch.

--
Kiryl Shutsemau / Kirill A. Shutemov

2022-11-04 22:41:10

by Erdem Aktas

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86/tdx: Do not allow #VE due to EPT violation on the private memory

On Fri, Oct 28, 2022 at 7:12 AM Kirill A. Shutemov
<[email protected]> wrote:

> + *
> + * Kernel has no legitimate use-cases for #VE on private memory. It is
> + * either a guest kernel bug (like access of unaccepted memory) or
> + * malicious/buggy VMM that removes guest page that is still in use.
> + *

I think this statement is too strong and I have few concerns on this approach.
I understand that there is an issue of handling #VEs on private pages
but it seems like we are just hiding the problem with this approach
instead of fixing it - I do not have any fix in my mind- .
First there is a feature of injecting #VE to handle unaccepted pages
at runtime and accept them on-demand, now the statement is saying this
was an unnecessary feature (why is it there at all then?) at all as
there is no legitimate use case.

I wonder if this will limit how we can implement the lazy TDACCEPT.
There are multiple ideas floating now.
https://github.com/intel/tdx/commit/9b3ef9655b695d3c67a557ec016487fded8b0e2b
has 3 implementation choices where "Accept a block of memory on the
first use." option is implemented. Actually it says "Accept a block
of memory on the first use." but it is implemented as "Accept a block
of memory on the first allocation". The comments in this code also
raises concerns on the performance.

As of now, we do not know which one of those ideas will provide an
acceptable performance for booting large size VMs. If the performance
overhead is high, we can always implement the lazy TDACCEPT as when
the first time a guest accesses an unaccepted memory, #VE can do the TDACCEPT.

I am not trying to solve the lazy TDACCEPT problem here but all I am
trying to say is that, there might be legitimate use cases for #VE on
private memory and this patch limits any future improvement we might
need to do on lazy TDACCEPT implementation.



> + * In both cases terminating TD is the right thing to do.
> + */
> + if (!(td_attr & ATTR_SEPT_VE_DISABLE))
> + panic("TD misconfiguration: SEPT_VE_DISABLE attibute must be set.\n");
> +
> setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
>
> cc_set_vendor(CC_VENDOR_INTEL);
> --
> 2.38.0
>

2022-11-04 23:13:04

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86/tdx: Do not allow #VE due to EPT violation on the private memory

On 11/4/22 15:36, Erdem Aktas wrote:
> On Fri, Oct 28, 2022 at 7:12 AM Kirill A. Shutemov
> <[email protected]> wrote:
>> + *
>> + * Kernel has no legitimate use-cases for #VE on private memory. It is
>> + * either a guest kernel bug (like access of unaccepted memory) or
>> + * malicious/buggy VMM that removes guest page that is still in use.
>> + *
>
> I think this statement is too strong and I have few concerns on this approach.
> I understand that there is an issue of handling #VEs on private pages
> but it seems like we are just hiding the problem with this approach
> instead of fixing it - I do not have any fix in my mind- .
> First there is a feature of injecting #VE to handle unaccepted pages
> at runtime and accept them on-demand, now the statement is saying this
> was an unnecessary feature (why is it there at all then?) at all as
> there is no legitimate use case.

We're doing on-demand page acceptance. We just don't need a #VE to
drive it. Why is it in the TDX module then? Inertia? Because it got
too far along in the process before anyone asked me or some of the other
x86 kernel folks to look at it hard.

> I wonder if this will limit how we can implement the lazy TDACCEPT.
> There are multiple ideas floating now.
> https://github.com/intel/tdx/commit/9b3ef9655b695d3c67a557ec016487fded8b0e2b
> has 3 implementation choices where "Accept a block of memory on the
> first use." option is implemented. Actually it says "Accept a block
> of memory on the first use." but it is implemented as "Accept a block
> of memory on the first allocation". The comments in this code also
> raises concerns on the performance.
>
> As of now, we do not know which one of those ideas will provide an
> acceptable performance for booting large size VMs. If the performance
> overhead is high, we can always implement the lazy TDACCEPT as when
> the first time a guest accesses an unaccepted memory, #VE can do the TDACCEPT.

Could you please elaborate a bit on what you think the distinction is
between:

* Accept on first use
and
* Accept on allocation

Surely, for the vast majority of memory, it's allocated and then used
pretty quickly. As in, most allocations are __GFP_ZERO so they're
allocated and "used" before they even leave the allocator. So, in
practice, they're *VERY* close to equivalent.

Where do you see them diverging? Why does it matter?

> I am not trying to solve the lazy TDACCEPT problem here but all I am
> trying to say is that, there might be legitimate use cases for #VE on
> private memory and this patch limits any future improvement we might
> need to do on lazy TDACCEPT implementation.

The kernel can't take exceptions on arbitrary memory accesses. I have
*ZERO* idea how to handle page acceptance on an access to a per-cpu
variable referenced in syscall entry, or the NMI stack when we've
interrupted kernel code with a user GSBASE value.

So, we either find *ALL* the kernel memory that needs to be pre-accepted
at allocation time (like kernel stacks) or we just say that all
allocated memory needs to be accepted before we let it be allocated.

One of those is really easy. The other means a boatload of code audits.
I know. I had to do that kind of exercise to get KPTI to work. I
don't want to do it again. It was worth it for KPTI when the world was
on fire. TDX isn't that important IMNHO. There's an easier way.

2022-11-07 05:57:31

by Guorui Yu

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86/tdx: Do not allow #VE due to EPT violation on the private memory



在 2022/10/31 22:22, Dave Hansen 写道:
> On 10/30/22 21:07, Guorui Yu wrote:
>> We have encountered similar problems on SEV-ES, here are their fixes
>> on Kernel [1] and OVMF[2].
>
> SEV-ES and TDX are very different beasts in this area.
>
>> Instead of enforcing the ATTR_SEPT_VE_DISABLE in TDX guest kernel, I
>> think the fix should also include necessary check on the MMIO path of
>> the #VE routine.
>
> Instead?
Yes, "Instead of" should be "Addtionally,".

>
> Without ATTR_SEPT_VE_DISABLE, a #VE can occur on basically any
> instruction. We call those kinds of exceptions "paranoid entry" points.
> They need special handling like the NMI or #MC handlers.
>
> I'd be happy to look at a patch that does the MMIO path check *and*
> turns the #VE handler into a robust entry point.
>
> Bonus points if you can do ~5 lines of C like the approach in this thread.

Yes, there is a fix to satify your requirement and get the bouns points :)

Please refer to
https://github.com/intel/tdx/commit/f045b0d52a5f7d8bf66cd4410307d05a90523f10


case EXIT_REASON_EPT_VIOLATION:
+ if (!(ve->gpa & tdx_shared_mask())) {
+ panic("#VE due to access to unaccepted memory. "
+ "GPA: %#llx\n", ve->gpa);
+ }
+
/* original from Kirill and Kuppuswamy */

It's already there, but it just didn't get into the main branch.


Sincerely,
Guorui

2022-11-07 13:44:16

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86/tdx: Do not allow #VE due to EPT violation on the private memory

On 11/6/22 21:10, Guorui Yu wrote:
>> Without ATTR_SEPT_VE_DISABLE, a #VE can occur on basically any
>> instruction.  We call those kinds of exceptions "paranoid entry" points.
>>   They need special handling like the NMI or #MC handlers.
>>
>> I'd be happy to look at a patch that does the MMIO path check *and*
>> turns the #VE handler into a robust entry point.
>>
>> Bonus points if you can do ~5 lines of C like the approach in this
>> thread.
>
> Yes, there is a fix to satify your requirement and get the bouns points ????
>
> Please refer to
> https://github.com/intel/tdx/commit/f045b0d52a5f7d8bf66cd4410307d05a90523f10
>
> case EXIT_REASON_EPT_VIOLATION:
> + if (!(ve->gpa & tdx_shared_mask())) {
> + panic("#VE due to access to unaccepted memory. "
> + "GPA: %#llx\n", ve->gpa);
> + }
> +
> /* original from Kirill and Kuppuswamy */
>
> It's already there, but it just didn't get into the main branch.

Could you explain how that prevents the #VE from occurring in the
"syscall gap" or in a place where the kernel is running with the user
GSBASE value?

It doesn't as far as I can tell. You need the SEPT_VE_DISABLE check for
that.

2022-11-07 13:59:36

by Guorui Yu

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86/tdx: Do not allow #VE due to EPT violation on the private memory



在 2022/11/7 21:31, Dave Hansen 写道:
> On 11/6/22 21:10, Guorui Yu wrote:
>>> Without ATTR_SEPT_VE_DISABLE, a #VE can occur on basically any
>>> instruction.  We call those kinds of exceptions "paranoid entry" points.
>>>   They need special handling like the NMI or #MC handlers.
>>>
>>> I'd be happy to look at a patch that does the MMIO path check *and*
>>> turns the #VE handler into a robust entry point.
>>>
>>> Bonus points if you can do ~5 lines of C like the approach in this
>>> thread.
>>
>> Yes, there is a fix to satify your requirement and get the bouns points ????
>>
>> Please refer to
>> https://github.com/intel/tdx/commit/f045b0d52a5f7d8bf66cd4410307d05a90523f10
>>
>> case EXIT_REASON_EPT_VIOLATION:
>> + if (!(ve->gpa & tdx_shared_mask())) {
>> + panic("#VE due to access to unaccepted memory. "
>> + "GPA: %#llx\n", ve->gpa);
>> + }
>> +
>> /* original from Kirill and Kuppuswamy */
>>
>> It's already there, but it just didn't get into the main branch.
>
> Could you explain how that prevents the #VE from occurring in the
> "syscall gap" or in a place where the kernel is running with the user
> GSBASE value?
>
Thank you for explaining the "paranoid entry" points with there examples
to me, now I understand why the SEPT_VE_DISABLE is necessary for TD.

> It doesn't as far as I can tell. You need the SEPT_VE_DISABLE check for
> that.

2022-11-07 23:06:05

by Erdem Aktas

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86/tdx: Do not allow #VE due to EPT violation on the private memory

On Fri, Nov 4, 2022 at 3:50 PM Dave Hansen <[email protected]> wrote:
>
> On 11/4/22 15:36, Erdem Aktas wrote:
> > On Fri, Oct 28, 2022 at 7:12 AM Kirill A. Shutemov
> > <[email protected]> wrote:
> >> + *
> >> + * Kernel has no legitimate use-cases for #VE on private memory. It is
> >> + * either a guest kernel bug (like access of unaccepted memory) or
> >> + * malicious/buggy VMM that removes guest page that is still in use.
> >> + *
> >
> > I think this statement is too strong and I have few concerns on this approach.
> > I understand that there is an issue of handling #VEs on private pages
> > but it seems like we are just hiding the problem with this approach
> > instead of fixing it - I do not have any fix in my mind- .
> > First there is a feature of injecting #VE to handle unaccepted pages
> > at runtime and accept them on-demand, now the statement is saying this
> > was an unnecessary feature (why is it there at all then?) at all as
> > there is no legitimate use case.
>
> We're doing on-demand page acceptance. We just don't need a #VE to
> drive it. Why is it in the TDX module then? Inertia? Because it got
> too far along in the process before anyone asked me or some of the other
> x86 kernel folks to look at it hard.
>
> > I wonder if this will limit how we can implement the lazy TDACCEPT.
> > There are multiple ideas floating now.
> > https://github.com/intel/tdx/commit/9b3ef9655b695d3c67a557ec016487fded8b0e2b
> > has 3 implementation choices where "Accept a block of memory on the
> > first use." option is implemented. Actually it says "Accept a block
> > of memory on the first use." but it is implemented as "Accept a block
> > of memory on the first allocation". The comments in this code also
> > raises concerns on the performance.
> >
> > As of now, we do not know which one of those ideas will provide an
> > acceptable performance for booting large size VMs. If the performance
> > overhead is high, we can always implement the lazy TDACCEPT as when
> > the first time a guest accesses an unaccepted memory, #VE can do the TDACCEPT.
>
> Could you please elaborate a bit on what you think the distinction is
> between:
>
> * Accept on first use
> and
> * Accept on allocation
>
> Surely, for the vast majority of memory, it's allocated and then used
> pretty quickly. As in, most allocations are __GFP_ZERO so they're
> allocated and "used" before they even leave the allocator. So, in
> practice, they're *VERY* close to equivalent.
>
> Where do you see them diverging? Why does it matter?
>

For a VM with a very large memory size, let's say close to 800G of
memory, it might take a really long time to finish the initialization.
If all allocations are __GFP_ZERO, then I agree it would not matter
but -- I need to run some benchmarks to validate -- what I remember
was, that was not what we were observing. Let me run a few tests to
provide more input on this but meanwhile if you have already run some
benchmarks, that would be great.

What I see in the code is that the "accept_page" function will zero
all the unaccepted pages even if the __GFP_ZERO flag is not set and if
__GFP_ZERO is set, we will again zero all those pages. I see a lot of
concerning comments like "Page acceptance can be very slow.".

What I mean with "Accept on allocation" is leaving the memory
allocation as it is and using the #VE handler to accept pages the
first time they have been accessed.

tLet me come back with some numbers on this which might take some time
to collect.

> > I am not trying to solve the lazy TDACCEPT problem here but all I am
> > trying to say is that, there might be legitimate use cases for #VE on
> > private memory and this patch limits any future improvement we might
> > need to do on lazy TDACCEPT implementation.
>
> The kernel can't take exceptions on arbitrary memory accesses. I have
> *ZERO* idea how to handle page acceptance on an access to a per-cpu
> variable referenced in syscall entry, or the NMI stack when we've
> interrupted kernel code with a user GSBASE value.
>
> So, we either find *ALL* the kernel memory that needs to be pre-accepted
> at allocation time (like kernel stacks) or we just say that all
> allocated memory needs to be accepted before we let it be allocated.
>
> One of those is really easy. The other means a boatload of code audits.
> I know. I had to do that kind of exercise to get KPTI to work. I
> don't want to do it again. It was worth it for KPTI when the world was
> on fire. TDX isn't that important IMNHO. There's an easier way.

2022-11-07 23:56:36

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86/tdx: Do not allow #VE due to EPT violation on the private memory

On 11/7/22 14:53, Erdem Aktas wrote:
> On Fri, Nov 4, 2022 at 3:50 PM Dave Hansen <[email protected]> wrote:
>> Could you please elaborate a bit on what you think the distinction is
>> between:
>>
>> * Accept on first use
>> and
>> * Accept on allocation
>>
>> Surely, for the vast majority of memory, it's allocated and then used
>> pretty quickly. As in, most allocations are __GFP_ZERO so they're
>> allocated and "used" before they even leave the allocator. So, in
>> practice, they're *VERY* close to equivalent.
>>
>> Where do you see them diverging? Why does it matter?
>
> For a VM with a very large memory size, let's say close to 800G of
> memory, it might take a really long time to finish the initialization.
> If all allocations are __GFP_ZERO, then I agree it would not matter
> but -- I need to run some benchmarks to validate -- what I remember
> was, that was not what we were observing. Let me run a few tests to
> provide more input on this but meanwhile if you have already run some
> benchmarks, that would be great.
>
> What I see in the code is that the "accept_page" function will zero
> all the unaccepted pages even if the __GFP_ZERO flag is not set and if
> __GFP_ZERO is set, we will again zero all those pages. I see a lot of
> concerning comments like "Page acceptance can be very slow.".

I'm not following you at all here. Yeah, page acceptance is very slow.
But, the slowest part is the probably cache coherency dance that the TDX
module has to do flushing and zeroing all the memory to initialize the
new integrity metadata. Second to that is the cost of the TDCALL.
Third is the cost of the #VE.

Here's what Kirill is proposing, in some peudocode:

alloc_page(order=0, __GFP_ZERO) {
TD.accept(size=4M) {
// TDX Module clflushes/zeroes 4M of memory
}
memset(4k);
// leave 1023 accepted 4k pages in the allocator
}

To accept 4M of memory, you do one TDCALL. You do zero #VE's. Using
the #VE handler, you do:

alloc_page(order=0, __GFP_ZERO) {
memset(4k) {
-> #VE handler
TD.accept(size=4k); // flush/zero 4k
}
// only 4k was accepted
}
... Take 1023 more #VE's later on for each 4k page

You do 1024 #VE's and 1024 TDCALLs. So, let's summarize. To do 4M
worth of 4k pages, here's how the two approaches break down if
__GFP_ZERO is in play:

#VE Accept-in-allocator
#VE's: 1024 0
TDCALLS: 1024 1
clflushes: 4k x 1024 4k x 1024
memset()s: 4k x 1024 4k x 1024

The *ONLY* downside of accept-at-allocate as implemented is that it does
4M at a time, so the TDCALL is long compared to a 4k one. But, this is
a classing bandwidth versus latency compromise. In this case, we choose
bandwidth.

*Both* cases need to memset() the same amount of memory. Both cases
only memset() 4k at a time.

The *ONLY* way the #VE approach is better is if you allocate 4k and then
never touch the rest of the 4M page. That might happen, maybe *ONE*
time per zone. But the rest of the time, the amortization of the TDCALL
cost is going to win.

I'll be shocked if any benchmarking turns up another result.