LinuxLists.cc - [PATCH v2 0/8] x86/coco: Mark CoCo VM pages not present when changing encrypted state

2023-12-12 18:35:47

by Michael Kelley

[permalink] [raw]

Subject: RE: [PATCH v2 4/8] x86/sev: Enable PVALIDATE for PFNs without a valid virtual address

From: Edgecombe, Rick P <[email protected]> Sent: Tuesday, November 28, 2023 10:59 AM
>
> On Tue, 2023-11-28 at 18:08 +0000, Michael Kelley wrote:
> > >
> > > Sort of separately, if those vmalloc objections can't be worked
> > > through, did you consider doing something like text_poke() does
> > > (create
> > > the temporary mapping in a temporary MM) for pvalidate purposes? I
> > > don't know enough about what kind of special exceptions might popup
> > > during that operation though, might be playing with fire...
> >
> > Interesting idea. But from a quick glance at the text_poke() code,
> > such an approach seems somewhat complex, and I suspect it will have
> > the same perf issues (or worse) as creating a new vmalloc area for
> > each PVALIDATE invocation.
>
> Using new vmalloc area's will eventually result in a kernel shootdown,
> but usually have no flushes. text_poke will always result in a local-
> only flush. So at least whatever slowdown there is would only affect
> the calling thread.
>
> As for complexity, I think it might be simple to implement actually.
> What kind of special exceptions could come out of pvalidate, I'm not so
> sure. But the kernel terminates the VM on failure anyway, so maybe it's
> not an issue?

Sorry for the delay in getting back to this topic.

OK, I see now what you are suggesting. For each page that needs to
be PVALIDATE'd, use __text_poke() to create the temp mapping and
run PVALIDATE. However, there are some problems. __text_poke()
runs vmalloc_to_page() for addresses that aren't core kernel text,
and vmalloc_to_page() will fail if the PTE "present" bit has been
cleared. That could be easily addressed by changing it to use
slow_virt_to_phys(). But PVALIDATE also needs to be able to return
the PVALIDATE_FAIL_SIZEMISMATCH error code, which is tested for
in pvalidate_pages() and does not terminate the VM. __text_poke()
doesn’t have the machinery to return such an error code, and
that's harder to fix.

There's also the conceptual issue. The PVALIDATE use case isn't
working with a text area, so that case would be abusing "text_poke"
a bit. I could imagine __text_poke() having code to verify that it's
working on a text area, even if that code isn't there now.

To get a sense of performance, I hacked the equivalent of text_poke()
to work with PVALIDATE. The biggest case of transitioning pages from
encrypted to decrypted is the swiotlb area, which is 1 Gbyte for a
VM with 16 Gbytes or more of memory. In a Hyper-V CoCo VM, current
code takes about 270 milliseconds to transition that 1 Gbyte swiotlb area.
With my initial approach using vmap_pages_range(), that 270 ms went to
319 ms, which is fairly negligible in the overall VM boot time. Using the
text_poke() approach increased the time to 368 ms, which is bigger but
still probably not a show-stopper. It's definitely faster than creating a
new vmalloc area for each page that needs to be PVALIDATE'd, which
adds about 6 seconds to the boot time.

All-in-all, I'm back to my Plan B, which is to mark the pages "not
present" only in configurations where the hypervisor callbacks operate
on physical addresses instead of virtual addresses. Since SEV-SNP
needs virtual addresses, it will need to handle exceptions generated
by load_unaligned_zeropad() and do the appropriate fixup.

I'll try to get my "Plan B" patch set posted in a new few days.

Michael

>
> diff --git a/arch/x86/kernel/alternative.c
> b/arch/x86/kernel/alternative.c
> index 73be3931e4f0..a13293564eeb 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -1905,6 +1905,16 @@ void *text_poke(void *addr, const void *opcode,
> size_t len)
> return __text_poke(text_poke_memcpy, addr, opcode, len);
> }
>
> +static void text_poke_pvalidate(void *dst, const void *src, size_t
> len)
> +{
> + pvalidate(dst, len, true); // if fail, terminate
> +}
> +
> +void *pvalidated_poke(void *addr)
> +{
> + return __text_poke(text_poke_pvalidate, addr, NULL, PAGE_SIZE);
> +}
> +
> /**
> * text_poke_kgdb - Update instructions on a live kernel by kgdb
> * @addr: address to modify
>
>
>
> >
> > At this point, the complexity of creating the temp mapping for
> > PVALIDATE is seeming excessive. On balance it seems simpler to
> > revert to an approach where the use of set_memory_np() and
> > set_memory_p() is conditional. It would be necessary when #VC
> > and #VE exceptions are directed to a paravisor. (This assumes the
> > paravisor interface in the hypervisor callbacks does the natural thing
> > of working with physical addresses, so there's no need for a temp
> > mapping.)
> >
> > Optionally, the set_memory_np()/set_memory_p() approach could
> > be used in other cases where the hypervisor callbacks work with
> > physical addresses. But it can't be used with cases where the
> > hypervisor callbacks need valid virtual addresses.
> >
> > So on net, set_memory_np()/set_memory_p() would be used in
> > the Hyper-V cases of TDX and SEV-SNP with a paravisor. It could
> > optionally be used with TDX with no paravisor, but my sense is
> > that Kirill wants to keep TDX "as is" and let the exception handlers
> > do the load_unaligned_zeropad() fixup.
> >
> > It could not be used with SEV-SNP with no paravisor. Additional fixes
> > may be needed on the SEV-SNP side to properly fixup
> > load_unaligned_zeropad() accesses to a page that's in transition
> > between encrypted and decrypted.
> >
>
> Yea, I don't know about this paravisor/exception stuff.