2022-01-24 19:24:10

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv2 08/29] x86/tdx: Handle in-kernel MMIO

In non-TDX VMs, MMIO is implemented by providing the guest a mapping
which will cause a VMEXIT on access and then the VMM emulating the
instruction that caused the VMEXIT. That's not possible for TDX VM.

To emulate an instruction an emulator needs two things:

- R/W access to the register file to read/modify instruction arguments
and see RIP of the faulted instruction.

- Read access to memory where instruction is placed to see what to
emulate. In this case it is guest kernel text.

Both of them are not available to VMM in TDX environment:

- Register file is never exposed to VMM. When a TD exits to the module,
it saves registers into the state-save area allocated for that TD.
The module then scrubs these registers before returning execution
control to the VMM, to help prevent leakage of TD state.

- Memory is encrypted a TD-private key. The CPU disallows software
other than the TDX module and TDs from making memory accesses using
the private key.

In TDX the MMIO regions are instead configured to trigger a #VE
exception in the guest. The guest #VE handler then emulates the MMIO
instruction inside the guest and converts it into a controlled hypercall
to the host.

MMIO addresses can be used with any CPU instruction that accesses
memory. This patch, however, covers only MMIO accesses done via io.h
helpers, such as 'readl()' or 'writeq()'.

readX()/writeX() helpers limit the range of instructions which can trigger
MMIO. It makes MMIO instruction emulation feasible. Raw access to MMIO
region allows compiler to generate whatever instruction it wants.
Supporting all possible instructions is a task of a different scope

MMIO access with anything other than helpers from io.h may result in
MMIO_DECODE_FAILED and an oops.

AMD SEV has the same limitations to MMIO handling.

=== Potential alternative approaches ===

== Paravirtualizing all MMIO ==

An alternative to letting MMIO induce a #VE exception is to avoid
the #VE in the first place. Similar to the port I/O case, it is
theoretically possible to paravirtualize MMIO accesses.

Like the exception-based approach offered here, a fully paravirtualized
approach would be limited to MMIO users that leverage common
infrastructure like the io.h macros.

However, any paravirtual approach would be patching approximately
120k call sites. With a conservative overhead estimation of 5 bytes per
call site (CALL instruction), it leads to bloating code by 600k.

Many drivers will never be used in the TDX environment and the bloat
cannot be justified.

== Patching TDX drivers ==

Rather than touching the entire kernel, it might also be possible to
just go after drivers that use MMIO in TDX guests. Right now, that's
limited only to virtio and some x86-specific drivers.

All virtio MMIO appears to be done through a single function, which
makes virtio eminently easy to patch. This will be implemented in the
future, removing the bulk of MMIO #VEs.

Co-developed-by: Kuppuswamy Sathyanarayanan <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/kernel/tdx.c | 114 ++++++++++++++++++++++++++++++++++++++++++
1 file changed, 114 insertions(+)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index f213c67b4ecc..8e630eeb765d 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -7,6 +7,8 @@
#include <linux/cpufeature.h>
#include <asm/tdx.h>
#include <asm/vmx.h>
+#include <asm/insn.h>
+#include <asm/insn-eval.h>

/* TDX module Call Leaf IDs */
#define TDX_GET_VEINFO 3
@@ -149,6 +151,112 @@ static bool tdx_handle_cpuid(struct pt_regs *regs)
return true;
}

+static bool tdx_mmio(int size, bool write, unsigned long addr,
+ unsigned long *val)
+{
+ struct tdx_hypercall_output out;
+ u64 err;
+
+ err = _tdx_hypercall(EXIT_REASON_EPT_VIOLATION, size, write,
+ addr, *val, &out);
+ if (err)
+ return true;
+
+ *val = out.r11;
+ return false;
+}
+
+static bool tdx_mmio_read(int size, unsigned long addr, unsigned long *val)
+{
+ return tdx_mmio(size, false, addr, val);
+}
+
+static bool tdx_mmio_write(int size, unsigned long addr, unsigned long *val)
+{
+ return tdx_mmio(size, true, addr, val);
+}
+
+static int tdx_handle_mmio(struct pt_regs *regs, struct ve_info *ve)
+{
+ char buffer[MAX_INSN_SIZE];
+ unsigned long *reg, val = 0;
+ struct insn insn = {};
+ enum mmio_type mmio;
+ int size;
+ bool err;
+
+ if (copy_from_kernel_nofault(buffer, (void *)regs->ip, MAX_INSN_SIZE))
+ return -EFAULT;
+
+ if (insn_decode(&insn, buffer, MAX_INSN_SIZE, INSN_MODE_64))
+ return -EFAULT;
+
+ mmio = insn_decode_mmio(&insn, &size);
+ if (WARN_ON_ONCE(mmio == MMIO_DECODE_FAILED))
+ return -EFAULT;
+
+ if (mmio != MMIO_WRITE_IMM && mmio != MMIO_MOVS) {
+ reg = insn_get_modrm_reg_ptr(&insn, regs);
+ if (!reg)
+ return -EFAULT;
+ }
+
+ switch (mmio) {
+ case MMIO_WRITE:
+ memcpy(&val, reg, size);
+ err = tdx_mmio_write(size, ve->gpa, &val);
+ break;
+ case MMIO_WRITE_IMM:
+ val = insn.immediate.value;
+ err = tdx_mmio_write(size, ve->gpa, &val);
+ break;
+ case MMIO_READ:
+ err = tdx_mmio_read(size, ve->gpa, &val);
+ if (err)
+ break;
+ /* Zero-extend for 32-bit operation */
+ if (size == 4)
+ *reg = 0;
+ memcpy(reg, &val, size);
+ break;
+ case MMIO_READ_ZERO_EXTEND:
+ err = tdx_mmio_read(size, ve->gpa, &val);
+ if (err)
+ break;
+
+ /* Zero extend based on operand size */
+ memset(reg, 0, insn.opnd_bytes);
+ memcpy(reg, &val, size);
+ break;
+ case MMIO_READ_SIGN_EXTEND: {
+ u8 sign_byte = 0, msb = 7;
+
+ err = tdx_mmio_read(size, ve->gpa, &val);
+ if (err)
+ break;
+
+ if (size > 1)
+ msb = 15;
+
+ if (val & BIT(msb))
+ sign_byte = -1;
+
+ /* Sign extend based on operand size */
+ memset(reg, sign_byte, insn.opnd_bytes);
+ memcpy(reg, &val, size);
+ break;
+ }
+ case MMIO_MOVS:
+ case MMIO_DECODE_FAILED:
+ return -EFAULT;
+ }
+
+ if (err)
+ return -EFAULT;
+
+ return insn.length;
+}
+
bool tdx_get_ve_info(struct ve_info *ve)
{
struct tdx_module_output out;
@@ -219,6 +327,12 @@ static bool tdx_virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
case EXIT_REASON_CPUID:
ret = tdx_handle_cpuid(regs);
break;
+ case EXIT_REASON_EPT_VIOLATION:
+ ve->instr_len = tdx_handle_mmio(regs, ve);
+ ret = ve->instr_len > 0;
+ if (!ret)
+ pr_warn_once("MMIO failed\n");
+ break;
default:
pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
break;
--
2.34.1


2022-01-24 20:33:01

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCHv2 08/29] x86/tdx: Handle in-kernel MMIO

On Mon, Jan 24, 2022 at 06:01:54PM +0300, Kirill A. Shutemov wrote:
> +static bool tdx_mmio(int size, bool write, unsigned long addr,
> + unsigned long *val)
> +{
> + struct tdx_hypercall_output out;
> + u64 err;
> +
> + err = _tdx_hypercall(EXIT_REASON_EPT_VIOLATION, size, write,
> + addr, *val, &out);
> + if (err)
> + return true;
> +
> + *val = out.r11;
> + return false;
> +}
> +
> +static bool tdx_mmio_read(int size, unsigned long addr, unsigned long *val)
> +{
> + return tdx_mmio(size, false, addr, val);
> +}
> +
> +static bool tdx_mmio_write(int size, unsigned long addr, unsigned long *val)
> +{
> + return tdx_mmio(size, true, addr, val);
> +}
> +
> +static int tdx_handle_mmio(struct pt_regs *regs, struct ve_info *ve)
> +{
> + char buffer[MAX_INSN_SIZE];
> + unsigned long *reg, val = 0;
> + struct insn insn = {};
> + enum mmio_type mmio;
> + int size;
> + bool err;
> +
> + if (copy_from_kernel_nofault(buffer, (void *)regs->ip, MAX_INSN_SIZE))
> + return -EFAULT;
> +
> + if (insn_decode(&insn, buffer, MAX_INSN_SIZE, INSN_MODE_64))
> + return -EFAULT;
> +
> + mmio = insn_decode_mmio(&insn, &size);
> + if (WARN_ON_ONCE(mmio == MMIO_DECODE_FAILED))
> + return -EFAULT;
> +
> + if (mmio != MMIO_WRITE_IMM && mmio != MMIO_MOVS) {
> + reg = insn_get_modrm_reg_ptr(&insn, regs);
> + if (!reg)
> + return -EFAULT;
> + }
> +
> + switch (mmio) {
> + case MMIO_WRITE:
> + memcpy(&val, reg, size);
> + err = tdx_mmio_write(size, ve->gpa, &val);
> + break;

The return code conventions are still all mismatched and confusing:

- Most tdx_handle_*() handlers return bool (success == true)

- tdx_handle_mmio() returns int (success > 0)

- tdx_mmio*() helpers return bool (success == false)

I still don't see any benefit in arbitrarily mixing three different
return conventions, none of which matches the typical kernel style for
returning errors, unless the goal is to confuse the reader and invite
bugs.

There is precedent in traps.c for some handle_*() functions to return
bool (success == true), so if the goal is to align with that
semi-convention, that's ok. But at the very least, please do it
consistently:

- change tdx_mmio*() to return true on success;

- change tdx_handle_mmio() to return bool, with 'len' passed as an
argument.

Or, even better, just change them all to return 0 on success like 99+%
of error-returning kernel functions.

--
Josh

2022-01-24 23:14:55

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv2 08/29] x86/tdx: Handle in-kernel MMIO

On Mon, Jan 24, 2022 at 11:30:08AM -0800, Josh Poimboeuf wrote:
> On Mon, Jan 24, 2022 at 06:01:54PM +0300, Kirill A. Shutemov wrote:
> > +static bool tdx_mmio(int size, bool write, unsigned long addr,
> > + unsigned long *val)
> > +{
> > + struct tdx_hypercall_output out;
> > + u64 err;
> > +
> > + err = _tdx_hypercall(EXIT_REASON_EPT_VIOLATION, size, write,
> > + addr, *val, &out);
> > + if (err)
> > + return true;
> > +
> > + *val = out.r11;
> > + return false;
> > +}
> > +
> > +static bool tdx_mmio_read(int size, unsigned long addr, unsigned long *val)
> > +{
> > + return tdx_mmio(size, false, addr, val);
> > +}
> > +
> > +static bool tdx_mmio_write(int size, unsigned long addr, unsigned long *val)
> > +{
> > + return tdx_mmio(size, true, addr, val);
> > +}
> > +
> > +static int tdx_handle_mmio(struct pt_regs *regs, struct ve_info *ve)
> > +{
> > + char buffer[MAX_INSN_SIZE];
> > + unsigned long *reg, val = 0;
> > + struct insn insn = {};
> > + enum mmio_type mmio;
> > + int size;
> > + bool err;
> > +
> > + if (copy_from_kernel_nofault(buffer, (void *)regs->ip, MAX_INSN_SIZE))
> > + return -EFAULT;
> > +
> > + if (insn_decode(&insn, buffer, MAX_INSN_SIZE, INSN_MODE_64))
> > + return -EFAULT;
> > +
> > + mmio = insn_decode_mmio(&insn, &size);
> > + if (WARN_ON_ONCE(mmio == MMIO_DECODE_FAILED))
> > + return -EFAULT;
> > +
> > + if (mmio != MMIO_WRITE_IMM && mmio != MMIO_MOVS) {
> > + reg = insn_get_modrm_reg_ptr(&insn, regs);
> > + if (!reg)
> > + return -EFAULT;
> > + }
> > +
> > + switch (mmio) {
> > + case MMIO_WRITE:
> > + memcpy(&val, reg, size);
> > + err = tdx_mmio_write(size, ve->gpa, &val);
> > + break;
>
> The return code conventions are still all mismatched and confusing:
>
> - Most tdx_handle_*() handlers return bool (success == true)
>
> - tdx_handle_mmio() returns int (success > 0)

Right, all tdx_handle_* are consistent: success > 0.

> - tdx_mmio*() helpers return bool (success == false)

And what is wrong with that? Why do you mix functions that called in
different contexts and expect them to have matching semantics?

> I still don't see any benefit in arbitrarily mixing three different
> return conventions, none of which matches the typical kernel style for
> returning errors, unless the goal is to confuse the reader and invite
> bugs.

Okay, we have an disagreement here.

I picked a way to communicate function result as I see best fits the
situation. It is a judgement call.

I will adjust code if maintainers see it differently from me. But until
then I don't see anything wrong here.

> There is precedent in traps.c for some handle_*() functions to return
> bool (success == true), so if the goal is to align with that
> semi-convention, that's ok. But at the very least, please do it
> consistently:
>
> - change tdx_mmio*() to return true on success;
>
> - change tdx_handle_mmio() to return bool, with 'len' passed as an
> argument.

Hard no.

Returning a value via passed argument is the last resort for cases when
more than one value has to be returned. In this case the function is
perfectly capable to communicate result via single return value.

I don't see a reason to complicate the code to satisfy some "typical
kernel style".

> Or, even better, just change them all to return 0 on success like 99+%
> of error-returning kernel functions.

Citation needed. 99+% looks like an overstatement to me.

--
Kirill A. Shutemov

2022-01-24 23:38:48

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv2 08/29] x86/tdx: Handle in-kernel MMIO

> +static bool tdx_mmio(int size, bool write, unsigned long addr,
> + unsigned long *val)
> +{
> + struct tdx_hypercall_output out;
> + u64 err;
> +
> + err = _tdx_hypercall(EXIT_REASON_EPT_VIOLATION, size, write,
> + addr, *val, &out);
> + if (err)
> + return true;
> +
> + *val = out.r11;
> + return false;
> +}
> +
> +static bool tdx_mmio_read(int size, unsigned long addr, unsigned long *val)
> +{
> + return tdx_mmio(size, false, addr, val);
> +}
> +
> +static bool tdx_mmio_write(int size, unsigned long addr, unsigned long *val)
> +{
> + return tdx_mmio(size, true, addr, val);
> +}
> +
> +static int tdx_handle_mmio(struct pt_regs *regs, struct ve_info *ve)
> +{
...
> + bool err;

I'll agree with Josh on one point: "bool err" _is_ weird.

Things tend to either return int with 0 for success or bool with true
for success.

The tdx_handle*() ones seem OK to me. It's pretty normal to have a
literal "handler" return true if things were handled.

I'd probably just make tdx_mmio() return an int. It seems to only able
to return -EFAULT anyway, so changing the return from bool->int and doing:

- return false;
+ return -EFAULT;

isn't exactly a heavy lift.

2022-01-24 23:50:02

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCHv2 08/29] x86/tdx: Handle in-kernel MMIO

On Tue, Jan 25, 2022 at 01:08:21AM +0300, Kirill A. Shutemov wrote:
> > The return code conventions are still all mismatched and confusing:
> >
> > - Most tdx_handle_*() handlers return bool (success == true)
> >
> > - tdx_handle_mmio() returns int (success > 0)
>
> Right, all tdx_handle_* are consistent: success > 0.

Non-zero success is not the same as above-zero success. The behavior is
not interchangeable.

> > - tdx_mmio*() helpers return bool (success == false)
>
> And what is wrong with that? Why do you mix functions that called in
> different contexts and expect them to have matching semantics?

Why would you expect the reader of the code to go investigate the weird
return semantics of every called function?

And "success == false" is just plain confusing, I haven't seen that one.

> > I still don't see any benefit in arbitrarily mixing three different
> > return conventions, none of which matches the typical kernel style for
> > returning errors, unless the goal is to confuse the reader and invite
> > bugs.
>
> Okay, we have an disagreement here.
>
> I picked a way to communicate function result as I see best fits the
> situation. It is a judgement call.
>
> I will adjust code if maintainers see it differently from me. But until
> then I don't see anything wrong here.
>
> > There is precedent in traps.c for some handle_*() functions to return
> > bool (success == true), so if the goal is to align with that
> > semi-convention, that's ok. But at the very least, please do it
> > consistently:
> >
> > - change tdx_mmio*() to return true on success;
> >
> > - change tdx_handle_mmio() to return bool, with 'len' passed as an
> > argument.
>
> Hard no.
>
> Returning a value via passed argument is the last resort for cases when
> more than one value has to be returned. In this case the function is
> perfectly capable to communicate result via single return value.
>
> I don't see a reason to complicate the code to satisfy some "typical
> kernel style".

It's a convention for a reason.

> > Or, even better, just change them all to return 0 on success like 99+%
> > of error-returning kernel functions.
>
> Citation needed. 99+% looks like an overstatement to me.

From Documentation/process/coding-style.rst:

16) Function return values and names
------------------------------------

Functions can return values of many different kinds, and one of the
most common is a value indicating whether the function succeeded or
failed. Such a value can be represented as an error-code integer
(-Exxx = failure, 0 = success) or a ``succeeded`` boolean (0 = failure,
non-zero = success).

Mixing up these two sorts of representations is a fertile source of
difficult-to-find bugs. If the C language included a strong distinction
between integers and booleans then the compiler would find these mistakes
for us... but it doesn't. To help prevent such bugs, always follow this
convention::

If the name of a function is an action or an imperative command,
the function should return an error-code integer. If the name
is a predicate, the function should return a "succeeded" boolean.

For example, ``add work`` is a command, and the add_work() function returns 0
for success or -EBUSY for failure. In the same way, ``PCI device present`` is
a predicate, and the pci_dev_present() function returns 1 if it succeeds in
finding a matching device or 0 if it doesn't.

--
Josh

2022-01-24 23:58:24

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv2.1 08/29] x86/tdx: Handle in-kernel MMIO

In non-TDX VMs, MMIO is implemented by providing the guest a mapping
which will cause a VMEXIT on access and then the VMM emulating the
instruction that caused the VMEXIT. That's not possible for TDX VM.

To emulate an instruction an emulator needs two things:

- R/W access to the register file to read/modify instruction arguments
and see RIP of the faulted instruction.

- Read access to memory where instruction is placed to see what to
emulate. In this case it is guest kernel text.

Both of them are not available to VMM in TDX environment:

- Register file is never exposed to VMM. When a TD exits to the module,
it saves registers into the state-save area allocated for that TD.
The module then scrubs these registers before returning execution
control to the VMM, to help prevent leakage of TD state.

- Memory is encrypted a TD-private key. The CPU disallows software
other than the TDX module and TDs from making memory accesses using
the private key.

In TDX the MMIO regions are instead configured to trigger a #VE
exception in the guest. The guest #VE handler then emulates the MMIO
instruction inside the guest and converts it into a controlled hypercall
to the host.

MMIO addresses can be used with any CPU instruction that accesses
memory. This patch, however, covers only MMIO accesses done via io.h
helpers, such as 'readl()' or 'writeq()'.

readX()/writeX() helpers limit the range of instructions which can trigger
MMIO. It makes MMIO instruction emulation feasible. Raw access to MMIO
region allows compiler to generate whatever instruction it wants.
Supporting all possible instructions is a task of a different scope

MMIO access with anything other than helpers from io.h may result in
MMIO_DECODE_FAILED and an oops.

AMD SEV has the same limitations to MMIO handling.

=== Potential alternative approaches ===

== Paravirtualizing all MMIO ==

An alternative to letting MMIO induce a #VE exception is to avoid
the #VE in the first place. Similar to the port I/O case, it is
theoretically possible to paravirtualize MMIO accesses.

Like the exception-based approach offered here, a fully paravirtualized
approach would be limited to MMIO users that leverage common
infrastructure like the io.h macros.

However, any paravirtual approach would be patching approximately
120k call sites. With a conservative overhead estimation of 5 bytes per
call site (CALL instruction), it leads to bloating code by 600k.

Many drivers will never be used in the TDX environment and the bloat
cannot be justified.

== Patching TDX drivers ==

Rather than touching the entire kernel, it might also be possible to
just go after drivers that use MMIO in TDX guests. Right now, that's
limited only to virtio and some x86-specific drivers.

All virtio MMIO appears to be done through a single function, which
makes virtio eminently easy to patch. This will be implemented in the
future, removing the bulk of MMIO #VEs.

Co-developed-by: Kuppuswamy Sathyanarayanan <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/kernel/tdx.c | 113 ++++++++++++++++++++++++++++++++++++++++++
1 file changed, 113 insertions(+)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index f213c67b4ecc..c5367e331bf6 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -7,6 +7,8 @@
#include <linux/cpufeature.h>
#include <asm/tdx.h>
#include <asm/vmx.h>
+#include <asm/insn.h>
+#include <asm/insn-eval.h>

/* TDX module Call Leaf IDs */
#define TDX_GET_VEINFO 3
@@ -149,6 +151,111 @@ static bool tdx_handle_cpuid(struct pt_regs *regs)
return true;
}

+static int tdx_mmio(int size, bool write, unsigned long addr,
+ unsigned long *val)
+{
+ struct tdx_hypercall_output out;
+ u64 err;
+
+ err = _tdx_hypercall(EXIT_REASON_EPT_VIOLATION, size, write,
+ addr, *val, &out);
+ if (err)
+ return -EFAULT;
+
+ *val = out.r11;
+ return 0;
+}
+
+static int tdx_mmio_read(int size, unsigned long addr, unsigned long *val)
+{
+ return tdx_mmio(size, false, addr, val);
+}
+
+static int tdx_mmio_write(int size, unsigned long addr, unsigned long *val)
+{
+ return tdx_mmio(size, true, addr, val);
+}
+
+static int tdx_handle_mmio(struct pt_regs *regs, struct ve_info *ve)
+{
+ char buffer[MAX_INSN_SIZE];
+ unsigned long *reg, val = 0;
+ struct insn insn = {};
+ enum mmio_type mmio;
+ int size, err;
+
+ if (copy_from_kernel_nofault(buffer, (void *)regs->ip, MAX_INSN_SIZE))
+ return -EFAULT;
+
+ if (insn_decode(&insn, buffer, MAX_INSN_SIZE, INSN_MODE_64))
+ return -EFAULT;
+
+ mmio = insn_decode_mmio(&insn, &size);
+ if (WARN_ON_ONCE(mmio == MMIO_DECODE_FAILED))
+ return -EFAULT;
+
+ if (mmio != MMIO_WRITE_IMM && mmio != MMIO_MOVS) {
+ reg = insn_get_modrm_reg_ptr(&insn, regs);
+ if (!reg)
+ return -EFAULT;
+ }
+
+ switch (mmio) {
+ case MMIO_WRITE:
+ memcpy(&val, reg, size);
+ err = tdx_mmio_write(size, ve->gpa, &val);
+ break;
+ case MMIO_WRITE_IMM:
+ val = insn.immediate.value;
+ err = tdx_mmio_write(size, ve->gpa, &val);
+ break;
+ case MMIO_READ:
+ err = tdx_mmio_read(size, ve->gpa, &val);
+ if (err)
+ break;
+ /* Zero-extend for 32-bit operation */
+ if (size == 4)
+ *reg = 0;
+ memcpy(reg, &val, size);
+ break;
+ case MMIO_READ_ZERO_EXTEND:
+ err = tdx_mmio_read(size, ve->gpa, &val);
+ if (err)
+ break;
+
+ /* Zero extend based on operand size */
+ memset(reg, 0, insn.opnd_bytes);
+ memcpy(reg, &val, size);
+ break;
+ case MMIO_READ_SIGN_EXTEND: {
+ u8 sign_byte = 0, msb = 7;
+
+ err = tdx_mmio_read(size, ve->gpa, &val);
+ if (err)
+ break;
+
+ if (size > 1)
+ msb = 15;
+
+ if (val & BIT(msb))
+ sign_byte = -1;
+
+ /* Sign extend based on operand size */
+ memset(reg, sign_byte, insn.opnd_bytes);
+ memcpy(reg, &val, size);
+ break;
+ }
+ case MMIO_MOVS:
+ case MMIO_DECODE_FAILED:
+ return -EFAULT;
+ }
+
+ if (err)
+ return err;
+
+ return insn.length;
+}
+
bool tdx_get_ve_info(struct ve_info *ve)
{
struct tdx_module_output out;
@@ -219,6 +326,12 @@ static bool tdx_virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
case EXIT_REASON_CPUID:
ret = tdx_handle_cpuid(regs);
break;
+ case EXIT_REASON_EPT_VIOLATION:
+ ve->instr_len = tdx_handle_mmio(regs, ve);
+ ret = ve->instr_len > 0;
+ if (!ret)
+ pr_warn_once("MMIO failed\n");
+ break;
default:
pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
break;
--
2.34.1

2022-02-02 18:17:28

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCHv2 08/29] x86/tdx: Handle in-kernel MMIO

On Mon, Jan 24 2022 at 18:01, Kirill A. Shutemov wrote:
>
> +static bool tdx_mmio(int size, bool write, unsigned long addr,
> + unsigned long *val)
> +{
> + struct tdx_hypercall_output out;
> + u64 err;
> +
> + err = _tdx_hypercall(EXIT_REASON_EPT_VIOLATION, size, write,
> + addr, *val, &out);

What's the purpose of storing *val as an argument for reads?

> + if (err)
> + return true;
> +
> + *val = out.r11;
> + return false;

Why is this writing back unconditionally for writes?

> +
> bool tdx_get_ve_info(struct ve_info *ve)
> {
> struct tdx_module_output out;
> @@ -219,6 +327,12 @@ static bool tdx_virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
> case EXIT_REASON_CPUID:
> ret = tdx_handle_cpuid(regs);
> break;
> + case EXIT_REASON_EPT_VIOLATION:
> + ve->instr_len = tdx_handle_mmio(regs, ve);
> + ret = ve->instr_len > 0;

I agree with Josh here. This is just wrong. Why returning the instr_len
as an error/success indicator? That's just a horrible idea simply
because the "error value" which is <= 0 is converted to a boolean return
value.

So what's wrong with doing the obvious here

case EXIT_REASON_EPT_VIOLATION:
return tdx_handle_mmio(regs, ve);

and have the handler function set ve->instr_length?

Also instead of having this not really helpful tdx_mmio() helper just
implement read and write seperately:

static bool tdx_mmio_read(int size, unsigned long addr, unsigned long *val)
{
struct tdx_hypercall_output out;

if (_tdx_hypercall(EXIT_REASON_EPT_VIOLATION, size, EPT_READ,
addr, 0, &out)
return false;

*val = out.r11;
return true;
}

static bool tdx_mmio_write(int size, unsigned long addr, unsigned long val)
{
return !!_tdx_hypercall(EXIT_REASON_EPT_VIOLATION, size, EPT_WRITE,
addr, val, NULL);
}

The return value is consistent with all the other handling functions
here, they return a boolean True for success. Which makes the main
handler consistent with the rest.

static bool tdx_handle_mmio(struct pt_regs *regs, struct ve_info *ve)
{
char buffer[MAX_INSN_SIZE];
unsigned long *reg, val;
struct insn insn = {};
int size, extend_size;
enum mmio_type mmio;
u8 extend_val = 0;
bool ret;

if (copy_from_kernel_nofault(buffer, (void *)regs->ip, MAX_INSN_SIZE))
return false;

if (insn_decode(&insn, buffer, MAX_INSN_SIZE, INSN_MODE_64))
return false;

mmio = insn_decode_mmio(&insn, &size);
if (WARN_ON_ONCE(mmio == MMIO_DECODE_FAILED))
return false;

if (mmio != MMIO_WRITE_IMM && mmio != MMIO_MOVS) {
reg = insn_get_modrm_reg_ptr(&insn, regs);
if (!reg)
return false;
}

ve->instr_length = insn.length;

switch (mmio) {
case MMIO_WRITE:
memcpy(&val, reg, size);
return tdx_mmio_write(size, ve->gpa, val);
case MMIO_WRITE_IMM:
val = insn.immediate.value;
return tdx_mmio_write(size, ve->gpa, val);
case MMIO_READ:
case MMIO_READ_ZERO_EXTEND:
case MMIO_READ_SIGN_EXTEND:
break;
case MMIO_MOVS:
case MMIO_DECODE_FAILED:
return false;
}

/* Handle reads */
if (!tdx_mmio_read(size, ve->gpa, &val))
return false;

switch (mmio) {
case MMIO_READ:
/* Zero-extend for 32-bit operation */
extend_size = size == 4 ? sizeof(*reg) : 0;
break;
case MMIO_READ_ZERO_EXTEND:
/* Zero extend based on operand size */
extend_size = insn.opnd_bytes;
break;
case MMIO_READ_SIGN_EXTEND:
/* Sign extend based on operand size */
extend_size = insn.opnd_bytes;
if (size == 1 && val & BIT(7))
extend_val = 0xFF;
else if (size > 1 && val & BIT(15))
extend_val = 0xFF;
break;
default:
BUG();
}

if (extend_size)
memset(reg, extend_val, extend_size);
memcpy(reg, &val, size);
return true;
}

Hmm?

Thanks,

tglx

2022-02-03 12:06:52

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCHv2.1 08/29] x86/tdx: Handle in-kernel MMIO

On Tue, Jan 25, 2022 at 02:04:32AM +0300, Kirill A. Shutemov wrote:
> MMIO addresses can be used with any CPU instruction that accesses
> memory. This patch, however, covers only MMIO accesses done via io.h

Just like the last time:

s/This patch, however, covers only/Address only/

Avoid having "This patch" or "This commit" in the commit message. It is
tautologically useless.

Also, do

$ git grep 'This patch' Documentation/process

for more details.

> helpers, such as 'readl()' or 'writeq()'.
>
> readX()/writeX() helpers limit the range of instructions which can trigger
> MMIO. It makes MMIO instruction emulation feasible. Raw access to MMIO

"Raw access to a MMIO region allows the compiler to ..."

> region allows compiler to generate whatever instruction it wants.
> Supporting all possible instructions is a task of a different scope
^
. Fullstop


...

> @@ -149,6 +151,111 @@ static bool tdx_handle_cpuid(struct pt_regs *regs)
> return true;
> }
>
> +static int tdx_mmio(int size, bool write, unsigned long addr,
> + unsigned long *val)

You don't need to break that line.

Rest LGTM.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette