LinuxLists.cc - Re: [PATCH] x86/tdx: Handle load_unaligned

2022-05-18 03:41:16

Subject: Re: [PATCH] x86/tdx: Handle load_unaligned_zeropad() page-cross to a shared page

On 5/17/22 13:17, Kirill A. Shutemov wrote:
>>> Given that we had to adjust IP in handle_mmio() anyway, do you still think
>>> "ve->instr_len = 0;" is wrong? I dislike ip_adjusted more.
>> Something is wrong about it.
>>
>> You could call it 've->instr_bytes_to_handle' or something. Then it
>> makes actual logical sense when you handle it to zero it out. I just
>> want it to be more explicit when the upper levels need to do something.
>>
>> Does ve->instr_len==0 both when the TDX module isn't providing
>> instruction sizes *and* when no handling is necessary? That seems like
>> an unfortunate logical multiplexing of 0.
> For EPT violation, ve->instr_len has *something* (not zero) that doesn't
> match the actual instruction size. I dig out that it is filled with data
> from VMREAD(0x440C), but I don't know where is the ultimate origin of the
> data.

The SDM has a breakdown:

27.2.5 Information for VM Exits Due to Instruction Execution

I didn't realize it came from VMREAD. I guess I assumed it came from
some TDX module magic. Silly me.

The SDM makes it sound like we should be more judicious about using
've->instr_len' though. "All VM exits other than those listed in the
above items leave this field undefined." Looking over
virt_exception_kernel(), we've got five cases from CPU instructions that
cause unconditional VMEXITs:

case EXIT_REASON_HLT:
case EXIT_REASON_MSR_READ:
case EXIT_REASON_MSR_WRITE:
case EXIT_REASON_CPUID:
case EXIT_REASON_IO_INSTRUCTION:

and should have that field filled out, plus one that doesn't:

case EXIT_REASON_IO_INSTRUCTION:

It seems awfully fragile to me to have the hardware be providing the
'instr_len' in those cases, but not in one other one. The data in there
is garbage for EXIT_REASON_IO_INSTRUCTION. The reason we don't consume
garbage is that all the paths leading out of handle_mmio() that return
true also set 've->instr_len'. But that logic is entirely opaque.

It's also borderline criminal to have six functions that look identical
(in that switch statement), but one of them has different behavior for
've->instr_len'.

I'd probably do it like this:

static int handle_halt(struct ve_info *ve)
{
/*
* Since non safe halt is mainly used in CPU offlining
* and the guest will always stay in the halt state, don't
* call the STI instruction (set do_sti as false).
*/
const bool irq_disabled = irqs_disabled();
const bool do_sti = false;

if (__halt(irq_disabled, do_sti))
return -EIO;

/*
* VM-exit instruction length is defined for HLT. See:
* "Information for VM Exits Due to Instruction Execution"
* in the SDM.
*/
return ve->insn_length;
}

Any >=0 return means the exception was handled and it tells the caller
hoe much to advance RIP.

Then handle_mmio() can say:

/*
* VM-exit instruction length is not provided for the EPT
* violations that MMIO causes. Use the insn_decode() length:
*/
return insn.length;

See? Now everybody that goes and writes a new #VE exception helper has
a chance of actually getting this right. As it stands, if someone adds
one more of these, they'll probably get random behavior. This way, they
actually have to choose. They _might_ even go looking at the SDM.

2022-05-18 04:11:44

by Sean Christopherson

[permalink] [raw]

Subject: Re: [PATCH] x86/tdx: Handle load_unaligned_zeropad() page-cross to a shared page

On Tue, May 17, 2022, Dave Hansen wrote:
> On 5/17/22 13:17, Kirill A. Shutemov wrote:
> >>> Given that we had to adjust IP in handle_mmio() anyway, do you still think
> >>> "ve->instr_len = 0;" is wrong? I dislike ip_adjusted more.
> >> Something is wrong about it.
> >>
> >> You could call it 've->instr_bytes_to_handle' or something. Then it
> >> makes actual logical sense when you handle it to zero it out. I just
> >> want it to be more explicit when the upper levels need to do something.
> >>
> >> Does ve->instr_len==0 both when the TDX module isn't providing
> >> instruction sizes *and* when no handling is necessary? That seems like
> >> an unfortunate logical multiplexing of 0.
> > For EPT violation, ve->instr_len has *something* (not zero) that doesn't
> > match the actual instruction size. I dig out that it is filled with data
> > from VMREAD(0x440C), but I don't know where is the ultimate origin of the
> > data.
>
> The SDM has a breakdown:
>
> 27.2.5 Information for VM Exits Due to Instruction Execution
>
> I didn't realize it came from VMREAD. I guess I assumed it came from
> some TDX module magic. Silly me.
>
> The SDM makes it sound like we should be more judicious about using
> 've->instr_len' though. "All VM exits other than those listed in the
> above items leave this field undefined." Looking over
> virt_exception_kernel(), we've got five cases from CPU instructions that
> cause unconditional VMEXITs:

None of the below exit unconditionally.

> case EXIT_REASON_HLT:
> case EXIT_REASON_MSR_READ:
> case EXIT_REASON_MSR_WRITE:
> case EXIT_REASON_CPUID:
> case EXIT_REASON_IO_INSTRUCTION:
>
> and should have that field filled out, plus one that doesn't:
>
> case EXIT_REASON_IO_INSTRUCTION:

I/O fills the length. IN, INS, OUT, and OUTS are all listed. It's not just
unconditional exits that provide the instruction length. The instruction length
is provided if the instruction is the direct cause of the exit, whether or not
the instruction exits unconditionally doesn't matter.

For fault-like VM exits due to attempts to execute one of the following
instructions that cause VM exits unconditionally or based on the settings of
VM-execution controls.

> Then handle_mmio() can say:
>
> /*
> * VM-exit instruction length is not provided for the EPT
> * violations that MMIO causes. Use the insn_decode() length:

This is inaccurate. The instruction length _is_ provided on EPT Violation VM-Exits
(it's also provided by all Intel CPUs on EPT Misconfigs even though the SDM doesn't
say so).

The instruction length is wrong in the TDX case because there is no EPT Violation
VM-Exit. The EPT Violation is morphed to a #VE by the CPU, and the instruction
length isn't one of the fields that's saved into the #VE info struct by the CPU.
When the TDX Module gets control on the TDCALL, VMCS.INSTRUCTION_LENGTH will hold
the length of the TDCALL, not the instruction that caused the #VE, i.e. the TDX
Module can't provide the correct length.

For all other #VE cases in TDX, the #VE is injected by software (TDX module) after
the instruction-based VM-Exit. Before injecting the #VE, the TDX module grabs the
length from the VMCS and manually records it in the #VE info struct.

2022-05-20 15:53:58

by Kirill A. Shutemov

[permalink] [raw]

Subject: Re: [PATCH] x86/tdx: Handle load_unaligned_zeropad() page-cross to a shared page

On Tue, May 17, 2022 at 03:16:42PM -0700, Dave Hansen wrote:
> See? Now everybody that goes and writes a new #VE exception helper has
> a chance of actually getting this right. As it stands, if someone adds
> one more of these, they'll probably get random behavior. This way, they
> actually have to choose. They _might_ even go looking at the SDM.

Okay. See below. Does it match what you had in mind?

If it is okay. I will do proper patches.

BTW, I found a bug in tdx_early_handle_ve(). It didn't update RIP.
I don't know how it happend. Maybe got lost on the way upstream.

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 9955b5a89df8..d2635ac52d9b 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -123,7 +123,7 @@ static u64 __cpuidle __halt(const bool irq_disabled, const bool do_sti)
return __tdx_hypercall(&args, do_sti ? TDX_HCALL_ISSUE_STI : 0);
}

-static bool handle_halt(void)
+static int handle_halt(struct ve_info *ve)
{
/*
* Since non safe halt is mainly used in CPU offlining
@@ -134,9 +134,9 @@ static bool handle_halt(void)
const bool do_sti = false;

if (__halt(irq_disabled, do_sti))
- return false;
+ return -EIO;

- return true;
+ return ve->instr_len;
}

void __cpuidle tdx_safe_halt(void)
@@ -156,7 +156,7 @@ void __cpuidle tdx_safe_halt(void)
WARN_ONCE(1, "HLT instruction emulation failed\n");
}

-static bool read_msr(struct pt_regs *regs)
+static int read_msr(struct pt_regs *regs, struct ve_info *ve)
{
struct tdx_hypercall_args args = {
.r10 = TDX_HYPERCALL_STANDARD,
@@ -170,14 +170,14 @@ static bool read_msr(struct pt_regs *regs)
* (GHCI), section titled "TDG.VP.VMCALL<Instruction.RDMSR>".
*/
if (__tdx_hypercall(&args, TDX_HCALL_HAS_OUTPUT))
- return false;
+ return -EIO;

regs->ax = lower_32_bits(args.r11);
regs->dx = upper_32_bits(args.r11);
- return true;
+ return ve->instr_len;
}

-static bool write_msr(struct pt_regs *regs)
+static int write_msr(struct pt_regs *regs, struct ve_info *ve)
{
struct tdx_hypercall_args args = {
.r10 = TDX_HYPERCALL_STANDARD,
@@ -191,10 +191,13 @@ static bool write_msr(struct pt_regs *regs)
* can be found in TDX Guest-Host-Communication Interface
* (GHCI) section titled "TDG.VP.VMCALL<Instruction.WRMSR>".
*/
- return !__tdx_hypercall(&args, 0);
+ if (__tdx_hypercall(&args, 0))
+ return -EIO;
+
+ return ve->instr_len;
}

-static bool handle_cpuid(struct pt_regs *regs)
+static int handle_cpuid(struct pt_regs *regs, struct ve_info *ve)
{
struct tdx_hypercall_args args = {
.r10 = TDX_HYPERCALL_STANDARD,
@@ -212,7 +215,7 @@ static bool handle_cpuid(struct pt_regs *regs)
*/
if (regs->ax < 0x40000000 || regs->ax > 0x4FFFFFFF) {
regs->ax = regs->bx = regs->cx = regs->dx = 0;
- return true;
+ return ve->instr_len;
}

/*
@@ -221,7 +224,7 @@ static bool handle_cpuid(struct pt_regs *regs)
* (GHCI), section titled "VP.VMCALL<Instruction.CPUID>".
*/
if (__tdx_hypercall(&args, TDX_HCALL_HAS_OUTPUT))
- return false;
+ return -EIO;

/*
* As per TDX GHCI CPUID ABI, r12-r15 registers contain contents of
@@ -233,7 +236,7 @@ static bool handle_cpuid(struct pt_regs *regs)
regs->cx = args.r14;
regs->dx = args.r15;

- return true;
+ return ve->instr_len;
}

static bool mmio_read(int size, unsigned long addr, unsigned long *val)
@@ -259,7 +262,7 @@ static bool mmio_write(int size, unsigned long addr, unsigned long val)
EPT_WRITE, addr, val);
}

-static bool handle_mmio(struct pt_regs *regs, struct ve_info *ve)
+static int handle_mmio(struct pt_regs *regs, struct ve_info *ve)
{
char buffer[MAX_INSN_SIZE];
unsigned long *reg, val;
@@ -270,7 +273,7 @@ static bool handle_mmio(struct pt_regs *regs, struct ve_info *ve)

/* Only in-kernel MMIO is supported */
if (WARN_ON_ONCE(user_mode(regs)))
- return false;
+ return -EFAULT;

/*
* load_unaligned_zeropad() relies on exception fixups in case of the
@@ -287,37 +290,37 @@ static bool handle_mmio(struct pt_regs *regs, struct ve_info *ve)
*/
if (fixup_exception(regs, X86_TRAP_VE, 0, ve->gla)) {
/* regs->ip is adjusted by fixup_exception() */
- ve->instr_len = 0;
-
- return true;
+ return 0;
}

if (copy_from_kernel_nofault(buffer, (void *)regs->ip, MAX_INSN_SIZE))
- return false;
+ return -EFAULT;

if (insn_decode(&insn, buffer, MAX_INSN_SIZE, INSN_MODE_64))
- return false;
+ return -EINVAL;

mmio = insn_decode_mmio(&insn, &size);
if (WARN_ON_ONCE(mmio == MMIO_DECODE_FAILED))
- return false;
+ return -EINVAL;

if (mmio != MMIO_WRITE_IMM && mmio != MMIO_MOVS) {
reg = insn_get_modrm_reg_ptr(&insn, regs);
if (!reg)
- return false;
+ return -EINVAL;
}

- ve->instr_len = insn.length;
-
/* Handle writes first */
switch (mmio) {
case MMIO_WRITE:
memcpy(&val, reg, size);
- return mmio_write(size, ve->gpa, val);
+ if (!mmio_write(size, ve->gpa, val))
+ return -EIO;
+ return insn.length;
case MMIO_WRITE_IMM:
val = insn.immediate.value;
- return mmio_write(size, ve->gpa, val);
+ if (!mmio_write(size, ve->gpa, val))
+ return -EIO;
+ return insn.length;
case MMIO_READ:
case MMIO_READ_ZERO_EXTEND:
case MMIO_READ_SIGN_EXTEND:
@@ -330,15 +333,15 @@ static bool handle_mmio(struct pt_regs *regs, struct ve_info *ve)
* decoded or handled properly. It was likely not using io.h
* helpers or accessed MMIO accidentally.
*/
- return false;
+ return -EINVAL;
default:
WARN_ONCE(1, "Unknown insn_decode_mmio() decode value?");
- return false;
+ return -EINVAL;
}

/* Handle reads */
if (!mmio_read(size, ve->gpa, &val))
- return false;
+ return -EIO;

switch (mmio) {
case MMIO_READ:
@@ -360,13 +363,13 @@ static bool handle_mmio(struct pt_regs *regs, struct ve_info *ve)
default:
/* All other cases has to be covered with the first switch() */
WARN_ON_ONCE(1);
- return false;
+ return -EINVAL;
}

if (extend_size)
memset(reg, extend_val, extend_size);
memcpy(reg, &val, size);
- return true;
+ return insn.length;
}

static bool handle_in(struct pt_regs *regs, int size, int port)
@@ -417,13 +420,14 @@ static bool handle_out(struct pt_regs *regs, int size, int port)
*
* Return True on success or False on failure.
*/
-static bool handle_io(struct pt_regs *regs, u32 exit_qual)
+static int handle_io(struct pt_regs *regs, struct ve_info *ve)
{
+ u32 exit_qual = ve->exit_qual;
int size, port;
- bool in;
+ bool in, ret;

if (VE_IS_IO_STRING(exit_qual))
- return false;
+ return -EIO;

in = VE_IS_IO_IN(exit_qual);
size = VE_GET_IO_SIZE(exit_qual);
@@ -431,9 +435,13 @@ static bool handle_io(struct pt_regs *regs, u32 exit_qual)

if (in)
- return handle_in(regs, size, port);
+ ret = handle_in(regs, size, port);
else
- return handle_out(regs, size, port);
+ ret = handle_out(regs, size, port);
+ if (!ret)
+ return -EIO;
+
+ return ve->instr_len;
}

/*
@@ -443,13 +451,19 @@ static bool handle_io(struct pt_regs *regs, u32 exit_qual)
__init bool tdx_early_handle_ve(struct pt_regs *regs)
{
struct ve_info ve;
+ int ret;

tdx_get_ve_info(&ve);

if (ve.exit_reason != EXIT_REASON_IO_INSTRUCTION)
return false;

- return handle_io(regs, ve.exit_qual);
+ ret = handle_io(regs, &ve);
+ if (ret < 0)
+ return false;
+
+ regs->ip += ret;
+ return true;
}

void tdx_get_ve_info(struct ve_info *ve)
@@ -483,53 +497,55 @@ void tdx_get_ve_info(struct ve_info *ve)
}

/* Handle the user initiated #VE */
-static bool virt_exception_user(struct pt_regs *regs, struct ve_info *ve)
+static int virt_exception_user(struct pt_regs *regs, struct ve_info *ve)
{
switch (ve->exit_reason) {
case EXIT_REASON_CPUID:
- return handle_cpuid(regs);
+ return handle_cpuid(regs, ve);
default:
pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
- return false;
+ return -EIO;
}
}

/* Handle the kernel #VE */
-static bool virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
+static int virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
{
switch (ve->exit_reason) {
case EXIT_REASON_HLT:
- return handle_halt();
+ return handle_halt(ve);
case EXIT_REASON_MSR_READ:
- return read_msr(regs);
+ return read_msr(regs, ve);
case EXIT_REASON_MSR_WRITE:
- return write_msr(regs);
+ return write_msr(regs, ve);
case EXIT_REASON_CPUID:
- return handle_cpuid(regs);
+ return handle_cpuid(regs, ve);
case EXIT_REASON_EPT_VIOLATION:
return handle_mmio(regs, ve);
case EXIT_REASON_IO_INSTRUCTION:
- return handle_io(regs, ve->exit_qual);
+ return handle_io(regs, ve);
default:
pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
- return false;
+ return -EIO;
}
}

bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve)
{
- bool ret;
+ int ret;

if (user_mode(regs))
ret = virt_exception_user(regs, ve);
else
ret = virt_exception_kernel(regs, ve);

+ if (ret < 0)
+ return false;
+
/* After successful #VE handling, move the IP */
- if (ret)
- regs->ip += ve->instr_len;
+ regs->ip += ret;

- return ret;
+ return true;
}

static bool tdx_tlb_flush_required(bool private)
--
Kirill A. Shutemov

2022-05-21 03:04:03

by Dave Hansen

[permalink] [raw]

Subject: Re: [PATCH] x86/tdx: Handle load_unaligned_zeropad() page-cross to a shared page

On 5/19/22 11:07, Kirill A. Shutemov wrote:
> On Tue, May 17, 2022 at 03:16:42PM -0700, Dave Hansen wrote:
>> See? Now everybody that goes and writes a new #VE exception helper has
>> a chance of actually getting this right. As it stands, if someone adds
>> one more of these, they'll probably get random behavior. This way, they
>> actually have to choose. They _might_ even go looking at the SDM.
>
> Okay. See below. Does it match what you had in mind?

Looks close.

> BTW, I found a bug in tdx_early_handle_ve(). It didn't update RIP.
> I don't know how it happend. Maybe got lost on the way upstream.

Huh, so refactoring things instead of depending on magic hidden behavior
helps find bugs? Interesting. ;)

> diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
> index 9955b5a89df8..d2635ac52d9b 100644
> --- a/arch/x86/coco/tdx/tdx.c
> +++ b/arch/x86/coco/tdx/tdx.c
> @@ -123,7 +123,7 @@ static u64 __cpuidle __halt(const bool irq_disabled, const bool do_sti)
> return __tdx_hypercall(&args, do_sti ? TDX_HCALL_ISSUE_STI : 0);
> }
>
> -static bool handle_halt(void)
> +static int handle_halt(struct ve_info *ve)
> {
> /*
> * Since non safe halt is mainly used in CPU offlining
> @@ -134,9 +134,9 @@ static bool handle_halt(void)
> const bool do_sti = false;
>
> if (__halt(irq_disabled, do_sti))
> - return false;
> + return -EIO;
>
> - return true;
> + return ve->instr_len;
> }

Ideally each of these would include a comment about why we can get the
isntruction length from ve_info. That "why" is currently a bit weak,
but it's something like:

/*
* In TDX guests, HLT is configured to cause exits. Assume that
* the TDX module has provided the "VM-exit instruction length".
*/

It would be nice to have some central discussion of this too to explain
that the TDX documentation is currently lacking here, but we don't need
to repeat that part in a comment 6 different times.

...
> /* Handle the kernel #VE */
> -static bool virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
> +static int virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
> {

/*
* Handle kernel #VEs. On success, returns the number of
* bytes RIP should be incremented (>=0) or -errno on error.
*/

> switch (ve->exit_reason) {
> case EXIT_REASON_HLT:
> - return handle_halt();
> + return handle_halt(ve);
> case EXIT_REASON_MSR_READ:
> - return read_msr(regs);
> + return read_msr(regs, ve);
> case EXIT_REASON_MSR_WRITE:
> - return write_msr(regs);
> + return write_msr(regs, ve);
> case EXIT_REASON_CPUID:
> - return handle_cpuid(regs);
> + return handle_cpuid(regs, ve);
> case EXIT_REASON_EPT_VIOLATION:
> return handle_mmio(regs, ve);
> case EXIT_REASON_IO_INSTRUCTION:
> - return handle_io(regs, ve->exit_qual);
> + return handle_io(regs, ve);
> default:
> pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
> - return false;
> + return -EIO;
> }
> }
>
> bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve)
> {
> - bool ret;
> + int ret;

'ret' is usually used for return values of *this* function.

Let's give it a better name, please.

> if (user_mode(regs))
> ret = virt_exception_user(regs, ve);
> else
> ret = virt_exception_kernel(regs, ve);
>
> + if (ret < 0)
> + return false;
> +
> /* After successful #VE handling, move the IP */
> - if (ret)
> - regs->ip += ve->instr_len;
> + regs->ip += ret;
>
> - return ret;
> + return true;
> }
>
> static bool tdx_tlb_flush_required(bool private)