2020-09-07 17:30:12

by Joerg Roedel

[permalink] [raw]
Subject: [PATCH v7 71/72] x86/efi: Add GHCB mappings when SEV-ES is active

From: Tom Lendacky <[email protected]>

Calling down to EFI runtime services can result in the firmware performing
VMGEXIT calls. The firmware is likely to use the GHCB of the OS (e.g., for
setting EFI variables), so each GHCB in the system needs to be identity
mapped in the EFI page tables, as unencrypted, to avoid page faults.

Signed-off-by: Tom Lendacky <[email protected]>
[ [email protected]: Moved GHCB mapping loop to sev-es.c ]
Signed-off-by: Joerg Roedel <[email protected]>
---
arch/x86/boot/compressed/sev-es.c | 1 +
arch/x86/include/asm/sev-es.h | 2 ++
arch/x86/kernel/sev-es.c | 30 ++++++++++++++++++++++++++++++
arch/x86/platform/efi/efi_64.c | 10 ++++++++++
4 files changed, 43 insertions(+)

diff --git a/arch/x86/boot/compressed/sev-es.c b/arch/x86/boot/compressed/sev-es.c
index 45702b866c33..0a9a248ca33d 100644
--- a/arch/x86/boot/compressed/sev-es.c
+++ b/arch/x86/boot/compressed/sev-es.c
@@ -12,6 +12,7 @@
*/
#include "misc.h"

+#include <asm/pgtable_types.h>
#include <asm/sev-es.h>
#include <asm/trapnr.h>
#include <asm/trap_pf.h>
diff --git a/arch/x86/include/asm/sev-es.h b/arch/x86/include/asm/sev-es.h
index e919f09ae33c..cf1d957c7091 100644
--- a/arch/x86/include/asm/sev-es.h
+++ b/arch/x86/include/asm/sev-es.h
@@ -102,11 +102,13 @@ static __always_inline void sev_es_nmi_complete(void)
if (static_branch_unlikely(&sev_es_enable_key))
__sev_es_nmi_complete();
}
+extern int __init sev_es_efi_map_ghcbs(pgd_t *pgd);
#else
static inline void sev_es_ist_enter(struct pt_regs *regs) { }
static inline void sev_es_ist_exit(void) { }
static inline int sev_es_setup_ap_jump_table(struct real_mode_header *rmh) { return 0; }
static inline void sev_es_nmi_complete(void) { }
+static inline int sev_es_efi_map_ghcbs(pgd_t *pgd) { return 0; }
#endif

#endif
diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
index 9ab3a4dfecd8..4e2b7e4d9b87 100644
--- a/arch/x86/kernel/sev-es.c
+++ b/arch/x86/kernel/sev-es.c
@@ -491,6 +491,36 @@ int sev_es_setup_ap_jump_table(struct real_mode_header *rmh)
return 0;
}

+/*
+ * This is needed by the OVMF UEFI firmware which will use whatever it finds in
+ * the GHCB MSR as its GHCB to talk to the hypervisor. So make sure the per-cpu
+ * runtime GHCBs used by the kernel are also mapped in the EFI page-table.
+ */
+int __init sev_es_efi_map_ghcbs(pgd_t *pgd)
+{
+ struct sev_es_runtime_data *data;
+ unsigned long address, pflags;
+ int cpu;
+ u64 pfn;
+
+ if (!sev_es_active())
+ return 0;
+
+ pflags = _PAGE_NX | _PAGE_RW;
+
+ for_each_possible_cpu(cpu) {
+ data = per_cpu(runtime_data, cpu);
+
+ address = __pa(&data->ghcb_page);
+ pfn = address >> PAGE_SHIFT;
+
+ if (kernel_map_pages_in_pgd(pgd, pfn, address, 1, pflags))
+ return 1;
+ }
+
+ return 0;
+}
+
static enum es_result vc_handle_msr(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
{
struct pt_regs *regs = ctxt->regs;
diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
index 6af4da1149ba..8f5759df7776 100644
--- a/arch/x86/platform/efi/efi_64.c
+++ b/arch/x86/platform/efi/efi_64.c
@@ -47,6 +47,7 @@
#include <asm/realmode.h>
#include <asm/time.h>
#include <asm/pgalloc.h>
+#include <asm/sev-es.h>

/*
* We allocate runtime services regions top-down, starting from -4G, i.e.
@@ -229,6 +230,15 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
return 1;
}

+ /*
+ * When SEV-ES is active, the GHCB as set by the kernel will be used
+ * by firmware. Create a 1:1 unencrypted mapping for each GHCB.
+ */
+ if (sev_es_efi_map_ghcbs(pgd)) {
+ pr_err("Failed to create 1:1 mapping for the GHCBs!\n");
+ return 1;
+ }
+
/*
* When making calls to the firmware everything needs to be 1:1
* mapped and addressable with 32-bit pointers. Map the kernel
--
2.28.0


2020-09-08 17:51:37

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v7 71/72] x86/efi: Add GHCB mappings when SEV-ES is active

+ Ard so that he can ack the efi bits.

On Mon, Sep 07, 2020 at 03:16:12PM +0200, Joerg Roedel wrote:
> From: Tom Lendacky <[email protected]>
>
> Calling down to EFI runtime services can result in the firmware performing
> VMGEXIT calls. The firmware is likely to use the GHCB of the OS (e.g., for
> setting EFI variables), so each GHCB in the system needs to be identity
> mapped in the EFI page tables, as unencrypted, to avoid page faults.
>
> Signed-off-by: Tom Lendacky <[email protected]>
> [ [email protected]: Moved GHCB mapping loop to sev-es.c ]
> Signed-off-by: Joerg Roedel <[email protected]>
> ---
> arch/x86/boot/compressed/sev-es.c | 1 +
> arch/x86/include/asm/sev-es.h | 2 ++
> arch/x86/kernel/sev-es.c | 30 ++++++++++++++++++++++++++++++
> arch/x86/platform/efi/efi_64.c | 10 ++++++++++
> 4 files changed, 43 insertions(+)
>
> diff --git a/arch/x86/boot/compressed/sev-es.c b/arch/x86/boot/compressed/sev-es.c
> index 45702b866c33..0a9a248ca33d 100644
> --- a/arch/x86/boot/compressed/sev-es.c
> +++ b/arch/x86/boot/compressed/sev-es.c
> @@ -12,6 +12,7 @@
> */
> #include "misc.h"
>
> +#include <asm/pgtable_types.h>
> #include <asm/sev-es.h>
> #include <asm/trapnr.h>
> #include <asm/trap_pf.h>
> diff --git a/arch/x86/include/asm/sev-es.h b/arch/x86/include/asm/sev-es.h
> index e919f09ae33c..cf1d957c7091 100644
> --- a/arch/x86/include/asm/sev-es.h
> +++ b/arch/x86/include/asm/sev-es.h
> @@ -102,11 +102,13 @@ static __always_inline void sev_es_nmi_complete(void)
> if (static_branch_unlikely(&sev_es_enable_key))
> __sev_es_nmi_complete();
> }
> +extern int __init sev_es_efi_map_ghcbs(pgd_t *pgd);
> #else
> static inline void sev_es_ist_enter(struct pt_regs *regs) { }
> static inline void sev_es_ist_exit(void) { }
> static inline int sev_es_setup_ap_jump_table(struct real_mode_header *rmh) { return 0; }
> static inline void sev_es_nmi_complete(void) { }
> +static inline int sev_es_efi_map_ghcbs(pgd_t *pgd) { return 0; }
> #endif
>
> #endif
> diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
> index 9ab3a4dfecd8..4e2b7e4d9b87 100644
> --- a/arch/x86/kernel/sev-es.c
> +++ b/arch/x86/kernel/sev-es.c
> @@ -491,6 +491,36 @@ int sev_es_setup_ap_jump_table(struct real_mode_header *rmh)
> return 0;
> }
>
> +/*
> + * This is needed by the OVMF UEFI firmware which will use whatever it finds in
> + * the GHCB MSR as its GHCB to talk to the hypervisor. So make sure the per-cpu
> + * runtime GHCBs used by the kernel are also mapped in the EFI page-table.
> + */
> +int __init sev_es_efi_map_ghcbs(pgd_t *pgd)
> +{
> + struct sev_es_runtime_data *data;
> + unsigned long address, pflags;
> + int cpu;
> + u64 pfn;
> +
> + if (!sev_es_active())
> + return 0;
> +
> + pflags = _PAGE_NX | _PAGE_RW;
> +
> + for_each_possible_cpu(cpu) {
> + data = per_cpu(runtime_data, cpu);
> +
> + address = __pa(&data->ghcb_page);
> + pfn = address >> PAGE_SHIFT;
> +
> + if (kernel_map_pages_in_pgd(pgd, pfn, address, 1, pflags))
> + return 1;
> + }
> +
> + return 0;
> +}
> +
> static enum es_result vc_handle_msr(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
> {
> struct pt_regs *regs = ctxt->regs;
> diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
> index 6af4da1149ba..8f5759df7776 100644
> --- a/arch/x86/platform/efi/efi_64.c
> +++ b/arch/x86/platform/efi/efi_64.c
> @@ -47,6 +47,7 @@
> #include <asm/realmode.h>
> #include <asm/time.h>
> #include <asm/pgalloc.h>
> +#include <asm/sev-es.h>
>
> /*
> * We allocate runtime services regions top-down, starting from -4G, i.e.
> @@ -229,6 +230,15 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
> return 1;
> }
>
> + /*
> + * When SEV-ES is active, the GHCB as set by the kernel will be used
> + * by firmware. Create a 1:1 unencrypted mapping for each GHCB.
> + */
> + if (sev_es_efi_map_ghcbs(pgd)) {
> + pr_err("Failed to create 1:1 mapping for the GHCBs!\n");
> + return 1;
> + }
> +
> /*
> * When making calls to the firmware everything needs to be 1:1
> * mapped and addressable with 32-bit pointers. Map the kernel
> --
> 2.28.0
>

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2020-09-09 08:29:16

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH v7 71/72] x86/efi: Add GHCB mappings when SEV-ES is active

(adding Laszlo and Brijesh)

On Tue, 8 Sep 2020 at 20:46, Borislav Petkov <[email protected]> wrote:
>
> + Ard so that he can ack the efi bits.
>
> On Mon, Sep 07, 2020 at 03:16:12PM +0200, Joerg Roedel wrote:
> > From: Tom Lendacky <[email protected]>
> >
> > Calling down to EFI runtime services can result in the firmware performing
> > VMGEXIT calls. The firmware is likely to use the GHCB of the OS (e.g., for
> > setting EFI variables), so each GHCB in the system needs to be identity
> > mapped in the EFI page tables, as unencrypted, to avoid page faults.
> >
> > Signed-off-by: Tom Lendacky <[email protected]>
> > [ [email protected]: Moved GHCB mapping loop to sev-es.c ]
> > Signed-off-by: Joerg Roedel <[email protected]>


This looks like it is papering over a more fundamental issue: any
memory region that the firmware itself needs to access during the
execution of runtime services needs to be described in the UEFI memory
map, with the appropriate annotations so that the OS knows it should
include these in the EFI runtime page tables. So why has this been
omitted in this case?



> > ---
> > arch/x86/boot/compressed/sev-es.c | 1 +
> > arch/x86/include/asm/sev-es.h | 2 ++
> > arch/x86/kernel/sev-es.c | 30 ++++++++++++++++++++++++++++++
> > arch/x86/platform/efi/efi_64.c | 10 ++++++++++
> > 4 files changed, 43 insertions(+)
> >
> > diff --git a/arch/x86/boot/compressed/sev-es.c b/arch/x86/boot/compressed/sev-es.c
> > index 45702b866c33..0a9a248ca33d 100644
> > --- a/arch/x86/boot/compressed/sev-es.c
> > +++ b/arch/x86/boot/compressed/sev-es.c
> > @@ -12,6 +12,7 @@
> > */
> > #include "misc.h"
> >
> > +#include <asm/pgtable_types.h>
> > #include <asm/sev-es.h>
> > #include <asm/trapnr.h>
> > #include <asm/trap_pf.h>
> > diff --git a/arch/x86/include/asm/sev-es.h b/arch/x86/include/asm/sev-es.h
> > index e919f09ae33c..cf1d957c7091 100644
> > --- a/arch/x86/include/asm/sev-es.h
> > +++ b/arch/x86/include/asm/sev-es.h
> > @@ -102,11 +102,13 @@ static __always_inline void sev_es_nmi_complete(void)
> > if (static_branch_unlikely(&sev_es_enable_key))
> > __sev_es_nmi_complete();
> > }
> > +extern int __init sev_es_efi_map_ghcbs(pgd_t *pgd);
> > #else
> > static inline void sev_es_ist_enter(struct pt_regs *regs) { }
> > static inline void sev_es_ist_exit(void) { }
> > static inline int sev_es_setup_ap_jump_table(struct real_mode_header *rmh) { return 0; }
> > static inline void sev_es_nmi_complete(void) { }
> > +static inline int sev_es_efi_map_ghcbs(pgd_t *pgd) { return 0; }
> > #endif
> >
> > #endif
> > diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
> > index 9ab3a4dfecd8..4e2b7e4d9b87 100644
> > --- a/arch/x86/kernel/sev-es.c
> > +++ b/arch/x86/kernel/sev-es.c
> > @@ -491,6 +491,36 @@ int sev_es_setup_ap_jump_table(struct real_mode_header *rmh)
> > return 0;
> > }
> >
> > +/*
> > + * This is needed by the OVMF UEFI firmware which will use whatever it finds in
> > + * the GHCB MSR as its GHCB to talk to the hypervisor. So make sure the per-cpu
> > + * runtime GHCBs used by the kernel are also mapped in the EFI page-table.
> > + */
> > +int __init sev_es_efi_map_ghcbs(pgd_t *pgd)
> > +{
> > + struct sev_es_runtime_data *data;
> > + unsigned long address, pflags;
> > + int cpu;
> > + u64 pfn;
> > +
> > + if (!sev_es_active())
> > + return 0;
> > +
> > + pflags = _PAGE_NX | _PAGE_RW;
> > +
> > + for_each_possible_cpu(cpu) {
> > + data = per_cpu(runtime_data, cpu);
> > +
> > + address = __pa(&data->ghcb_page);
> > + pfn = address >> PAGE_SHIFT;
> > +
> > + if (kernel_map_pages_in_pgd(pgd, pfn, address, 1, pflags))
> > + return 1;
> > + }
> > +
> > + return 0;
> > +}
> > +
> > static enum es_result vc_handle_msr(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
> > {
> > struct pt_regs *regs = ctxt->regs;
> > diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
> > index 6af4da1149ba..8f5759df7776 100644
> > --- a/arch/x86/platform/efi/efi_64.c
> > +++ b/arch/x86/platform/efi/efi_64.c
> > @@ -47,6 +47,7 @@
> > #include <asm/realmode.h>
> > #include <asm/time.h>
> > #include <asm/pgalloc.h>
> > +#include <asm/sev-es.h>
> >
> > /*
> > * We allocate runtime services regions top-down, starting from -4G, i.e.
> > @@ -229,6 +230,15 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
> > return 1;
> > }
> >
> > + /*
> > + * When SEV-ES is active, the GHCB as set by the kernel will be used
> > + * by firmware. Create a 1:1 unencrypted mapping for each GHCB.
> > + */
> > + if (sev_es_efi_map_ghcbs(pgd)) {
> > + pr_err("Failed to create 1:1 mapping for the GHCBs!\n");
> > + return 1;
> > + }
> > +
> > /*
> > * When making calls to the firmware everything needs to be 1:1
> > * mapped and addressable with 32-bit pointers. Map the kernel
> > --
> > 2.28.0
> >
>
> --
> Regards/Gruss,
> Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette

2020-09-09 13:10:34

by Laszlo Ersek

[permalink] [raw]
Subject: Re: [PATCH v7 71/72] x86/efi: Add GHCB mappings when SEV-ES is active

On 09/09/20 10:27, Ard Biesheuvel wrote:
> (adding Laszlo and Brijesh)
>
> On Tue, 8 Sep 2020 at 20:46, Borislav Petkov <[email protected]> wrote:
>>
>> + Ard so that he can ack the efi bits.
>>
>> On Mon, Sep 07, 2020 at 03:16:12PM +0200, Joerg Roedel wrote:
>>> From: Tom Lendacky <[email protected]>
>>>
>>> Calling down to EFI runtime services can result in the firmware
>>> performing VMGEXIT calls. The firmware is likely to use the GHCB of
>>> the OS (e.g., for setting EFI variables),

I've had to stare at this for a while.

Because, normally a VMGEXIT is supposed to occur like this:

- guest does something privileged
- resultant non-automatic exit (NAE) injects a #VC exception
- exception handler figures out what that privileged thing was
- exception handler submits request to hypervisor via GHCB contents plus
VMGEXIT instruction

Point being, the agent that "owns" the exception handler is supposed to
pre-allocate or otherwise provide the GHCB too, for information passing.

So... what is the particular NAE that occurs during the execution of
UEFI runtime services (at OS runtime)?

And assuming it occurs, I'm unsure if the exception handler (IDT) at
that point is owned (temporarily?) by the firmware.

- If the #VC handler comes from the firmware, then I don't know why it
would use the OS's GHCB.

- If the #VC handler comes from the OS, then I don't understand why the
commit message says "firmware performing VMGEXIT", given that in this
case it would be the OS's #VC handler executing VMGEXIT.

So, I think the above commit message implies a VMGEXIT *without* a NAE /
#VC context. (Because, I fail to interpret the commit message in a NAE /
#VC context in any way; see above.)

OK, so let's see where the firmware performs a VMGEXIT *outside* of an
exception handler, *while* at OS runtime. There seems to be one, in file
"OvmfPkg/QemuFlashFvbServicesRuntimeDxe/QemuFlashDxe.c":

> VOID
> QemuFlashPtrWrite (
> IN volatile UINT8 *Ptr,
> IN UINT8 Value
> )
> {
> if (MemEncryptSevEsIsEnabled ()) {
> MSR_SEV_ES_GHCB_REGISTER Msr;
> GHCB *Ghcb;
>
> Msr.GhcbPhysicalAddress = AsmReadMsr64 (MSR_SEV_ES_GHCB);
> Ghcb = Msr.Ghcb;
>
> //
> // Writing to flash is emulated by the hypervisor through the use of write
> // protection. This won't work for an SEV-ES guest because the write won't
> // be recognized as a true MMIO write, which would result in the required
> // #VC exception. Instead, use the the VMGEXIT MMIO write support directly
> // to perform the update.
> //
> VmgInit (Ghcb);
> Ghcb->SharedBuffer[0] = Value;
> Ghcb->SaveArea.SwScratch = (UINT64) (UINTN) Ghcb->SharedBuffer;
> VmgExit (Ghcb, SVM_EXIT_MMIO_WRITE, (UINT64) (UINTN) Ptr, 1);
> VmgDone (Ghcb);
> } else {
> *Ptr = Value;
> }
> }

This function *does* run at OS runtime (as a part of non-volatile UEFI
variable writes).

And note that, wherever MSR_SEV_ES_GHCB points to at the moment, is used
as GHCB.

If the guest kernel allocates its own GHCB and writes the allocation
address to MSR_SEV_ES_GHCB, then indeed the firmware will use the GHCB
of the OS.

I reviewed edk2 commit 437eb3f7a8db
("OvmfPkg/QemuFlashFvbServicesRuntimeDxe: Bypass flash detection with
SEV-ES", 2020-08-17), but I admit I never thought of the guest OS
changing MSR_SEV_ES_GHCB. I'm sorry about that.

As long as this driver is running before OS runtime (i.e., during the
DXE and BDS phases), MSR_SEV_ES_GHCB is supposed to carry the value we
set in "OvmfPkg/PlatformPei/AmdSev.c":

> STATIC
> VOID
> AmdSevEsInitialize (
> VOID
> )
> {
> VOID *GhcbBase;
> PHYSICAL_ADDRESS GhcbBasePa;
> UINTN GhcbPageCount, PageCount;
> RETURN_STATUS PcdStatus, DecryptStatus;
> IA32_DESCRIPTOR Gdtr;
> VOID *Gdt;
>
> if (!MemEncryptSevEsIsEnabled ()) {
> return;
> }
>
> PcdStatus = PcdSetBoolS (PcdSevEsIsEnabled, TRUE);
> ASSERT_RETURN_ERROR (PcdStatus);
>
> //
> // Allocate GHCB and per-CPU variable pages.
> // Since the pages must survive across the UEFI to OS transition
> // make them reserved.
> //
> GhcbPageCount = mMaxCpuCount * 2;
> GhcbBase = AllocateReservedPages (GhcbPageCount);
> ASSERT (GhcbBase != NULL);
>
> GhcbBasePa = (PHYSICAL_ADDRESS)(UINTN) GhcbBase;
>
> //
> // Each vCPU gets two consecutive pages, the first is the GHCB and the
> // second is the per-CPU variable page. Loop through the allocation and
> // only clear the encryption mask for the GHCB pages.
> //
> for (PageCount = 0; PageCount < GhcbPageCount; PageCount += 2) {
> DecryptStatus = MemEncryptSevClearPageEncMask (
> 0,
> GhcbBasePa + EFI_PAGES_TO_SIZE (PageCount),
> 1,
> TRUE
> );
> ASSERT_RETURN_ERROR (DecryptStatus);
> }
>
> ZeroMem (GhcbBase, EFI_PAGES_TO_SIZE (GhcbPageCount));
>
> PcdStatus = PcdSet64S (PcdGhcbBase, GhcbBasePa);
> ASSERT_RETURN_ERROR (PcdStatus);
> PcdStatus = PcdSet64S (PcdGhcbSize, EFI_PAGES_TO_SIZE (GhcbPageCount));
> ASSERT_RETURN_ERROR (PcdStatus);
>
> DEBUG ((DEBUG_INFO,
> "SEV-ES is enabled, %lu GHCB pages allocated starting at 0x%p\n",
> (UINT64)GhcbPageCount, GhcbBase));
>
> AsmWriteMsr64 (MSR_SEV_ES_GHCB, GhcbBasePa);

So what is the *actual* problem at OS runtime:

- Is it that MSR_SEV_ES_GHCB still points at this PEI-phase *reserved*
memory allocation (and so when QemuFlashPtrWrite() tries to access it
during OS runtime, it doesn't have a runtime mapping for it)?

- Or is it that the OS actively changes MSR_SEV_ES_GHCB, pointing to a
memory area that the OS owns -- and *that* area is what
QemuFlashPtrWrite() cannot access at OS runtime?

The first problem statement does *not* seem to apply, given -- again --
that the commit message says, "firmware is likely to use the GHCB of the
OS".

So I think the second problem statement must apply.

(I think the "reserved allocation" above is "reserved" only because we
want to keep the OS out of it around the ExitBootServices() transition.)

Back to the email:

On 09/09/20 10:27, Ard Biesheuvel wrote:
> On Tue, 8 Sep 2020 at 20:46, Borislav Petkov <[email protected]> wrote:
>> On Mon, Sep 07, 2020 at 03:16:12PM +0200, Joerg Roedel wrote:
>>> so each GHCB in the system needs to be identity
>>> mapped in the EFI page tables, as unencrypted, to avoid page faults.

Not sure I agree about this, but at least it seems to confirm my
understanding -- apparently the idea is, for the OS, to satisfy
QemuFlashPtrWrite() in the firmware, by putting the "expected" mapping
-- for wherever MSR_SEV_ES_GHCB is going to point to -- in place.

>>>
>>> Signed-off-by: Tom Lendacky <[email protected]>
>>> [ [email protected]: Moved GHCB mapping loop to sev-es.c ]
>>> Signed-off-by: Joerg Roedel <[email protected]>
>
>
> This looks like it is papering over a more fundamental issue: any
> memory region that the firmware itself needs to access during the
> execution of runtime services needs to be described in the UEFI memory
> map, with the appropriate annotations so that the OS knows it should
> include these in the EFI runtime page tables. So why has this been
> omitted in this case?

So yeah, the issue seems to be that the QemuFlashFvbServicesRuntimeDxe
driver does not *own* the GHCB that it attempts to use at OS runtime. It
doesn't know where MSR_SEV_ES_GHCB is going to point.

Is QemuFlashFvbServicesRuntimeDxe permitted to change MSR_SEV_ES_GHCB
*temporarily* at OS runtime?

Because, in that case:

- QemuFlashFvbServicesRuntimeDxe should allocate a Runtime Services Data
block for GHCB when it starts up (if SEV-ES is active),

- QemuFlashFvbServicesRuntimeDxe should register a SetVirtualAddressMap
handler, and use EfiConvertPointer() from UefiRuntimeLib to convert
the "runtime GHCB" address to virtual address, in that handler,

- QemuFlashPtrWrite() should call EfiAtRuntime() from UefiRuntimeLib,
and if the latter returns TRUE, then (a) use the runtime-converted
address for populating the GHCB, and (b) temporarily swap
MSR_SEV_ES_GHCB with the address of the self-allocated GHCB. (The MSR
needs a *physical* address, so QemuFlashFvbServicesRuntimeDxe would
have to remember / retain the original (physical) allocation address
too.)

If QemuFlashFvbServicesRuntimeDxe is not permitted to change
MSR_SEV_ES_GHCB even temporarily (at OS runtime), then I think the
approach proposed in this (guest kernel) patch is valid.

Let me skim the code below...

>
>
>
>>> ---
>>> arch/x86/boot/compressed/sev-es.c | 1 +
>>> arch/x86/include/asm/sev-es.h | 2 ++
>>> arch/x86/kernel/sev-es.c | 30 ++++++++++++++++++++++++++++++
>>> arch/x86/platform/efi/efi_64.c | 10 ++++++++++
>>> 4 files changed, 43 insertions(+)
>>>
>>> diff --git a/arch/x86/boot/compressed/sev-es.c b/arch/x86/boot/compressed/sev-es.c
>>> index 45702b866c33..0a9a248ca33d 100644
>>> --- a/arch/x86/boot/compressed/sev-es.c
>>> +++ b/arch/x86/boot/compressed/sev-es.c
>>> @@ -12,6 +12,7 @@
>>> */
>>> #include "misc.h"
>>>
>>> +#include <asm/pgtable_types.h>
>>> #include <asm/sev-es.h>
>>> #include <asm/trapnr.h>
>>> #include <asm/trap_pf.h>
>>> diff --git a/arch/x86/include/asm/sev-es.h b/arch/x86/include/asm/sev-es.h
>>> index e919f09ae33c..cf1d957c7091 100644
>>> --- a/arch/x86/include/asm/sev-es.h
>>> +++ b/arch/x86/include/asm/sev-es.h
>>> @@ -102,11 +102,13 @@ static __always_inline void sev_es_nmi_complete(void)
>>> if (static_branch_unlikely(&sev_es_enable_key))
>>> __sev_es_nmi_complete();
>>> }
>>> +extern int __init sev_es_efi_map_ghcbs(pgd_t *pgd);
>>> #else
>>> static inline void sev_es_ist_enter(struct pt_regs *regs) { }
>>> static inline void sev_es_ist_exit(void) { }
>>> static inline int sev_es_setup_ap_jump_table(struct real_mode_header *rmh) { return 0; }
>>> static inline void sev_es_nmi_complete(void) { }
>>> +static inline int sev_es_efi_map_ghcbs(pgd_t *pgd) { return 0; }
>>> #endif
>>>
>>> #endif
>>> diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
>>> index 9ab3a4dfecd8..4e2b7e4d9b87 100644
>>> --- a/arch/x86/kernel/sev-es.c
>>> +++ b/arch/x86/kernel/sev-es.c
>>> @@ -491,6 +491,36 @@ int sev_es_setup_ap_jump_table(struct real_mode_header *rmh)
>>> return 0;
>>> }
>>>
>>> +/*
>>> + * This is needed by the OVMF UEFI firmware which will use whatever it finds in
>>> + * the GHCB MSR as its GHCB to talk to the hypervisor. So make sure the per-cpu
>>> + * runtime GHCBs used by the kernel are also mapped in the EFI page-table.

Yup, this pretty much confirms my suspicion that QemuFlashPtrWrite() is
at the center of this.

(BTW, I don't think that the runtime services data allocation, in
QemuFlashFvbServicesRuntimeDxe, for OS runtime GHCB purposes, would have
to be "per CPU". Refer to "Table 35. Rules for Reentry Into Runtime
Services" in the UEFI spec -- if one processor is executing
SetVariable(), then no other processor must enter SetVariable(). And so
we'll have *at most* one VCPU in QemuFlashPtrWrite(), at any time.)

>>> + */
>>> +int __init sev_es_efi_map_ghcbs(pgd_t *pgd)
>>> +{
>>> + struct sev_es_runtime_data *data;
>>> + unsigned long address, pflags;
>>> + int cpu;
>>> + u64 pfn;
>>> +
>>> + if (!sev_es_active())
>>> + return 0;
>>> +
>>> + pflags = _PAGE_NX | _PAGE_RW;
>>> +
>>> + for_each_possible_cpu(cpu) {
>>> + data = per_cpu(runtime_data, cpu);
>>> +
>>> + address = __pa(&data->ghcb_page);
>>> + pfn = address >> PAGE_SHIFT;
>>> +
>>> + if (kernel_map_pages_in_pgd(pgd, pfn, address, 1, pflags))
>>> + return 1;
>>> + }
>>> +
>>> + return 0;
>>> +}
>>> +
>>> static enum es_result vc_handle_msr(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
>>> {
>>> struct pt_regs *regs = ctxt->regs;
>>> diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
>>> index 6af4da1149ba..8f5759df7776 100644
>>> --- a/arch/x86/platform/efi/efi_64.c
>>> +++ b/arch/x86/platform/efi/efi_64.c
>>> @@ -47,6 +47,7 @@
>>> #include <asm/realmode.h>
>>> #include <asm/time.h>
>>> #include <asm/pgalloc.h>
>>> +#include <asm/sev-es.h>
>>>
>>> /*
>>> * We allocate runtime services regions top-down, starting from -4G, i.e.
>>> @@ -229,6 +230,15 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
>>> return 1;
>>> }
>>>
>>> + /*
>>> + * When SEV-ES is active, the GHCB as set by the kernel will be used
>>> + * by firmware. Create a 1:1 unencrypted mapping for each GHCB.
>>> + */
>>> + if (sev_es_efi_map_ghcbs(pgd)) {
>>> + pr_err("Failed to create 1:1 mapping for the GHCBs!\n");
>>> + return 1;
>>> + }
>>> +
>>> /*
>>> * When making calls to the firmware everything needs to be 1:1
>>> * mapped and addressable with 32-bit pointers. Map the kernel

Good point!

And it even makes me wonder if the QemuFlashFvbServicesRuntimeDxe
approach, with the runtime services data type memory allocation, is
feasible at all. Namely, a page's encryption status, under SEV, is
controlled through the PTE.

And for this particular UEFI runtime area, it would *not* suffice for
the OS to just virt-map it. The OS would also have to *decrypt* the area
(mark the PTE as "plaintext").

In other words, it would be an "unprecedented" PTE for the OS to set up:
the PTE would not only map the GVA to GPA, but also mark the area as
"plaintext".

Otherwise -- if the OS covers *just* the virt-mapping --,
QemuFlashFvbServicesRuntimeDxe would populate its own "runtime GHCB"
area just fine, but the actual data hitting the host RAM would be
encrypted. And so the hypervisor could not interpret the GHCB.

*If* QemuFlashFvbServicesRuntimeDxe should not change the kernel-owned
PTE at runtime, even temporarily, for marking the GHCB as "plaintext",
then the problem is indeed only solvable in the guest kernel, in my
opinion.

There simply isn't an "architected annotation" for telling the kernel,
"virt-map this runtime services data type memory range, *and* mark it as
plaintext at the same time".

This would be necessary, as both actions affect the exact same PTE, and
the firmware is not really allowed to touch the PTE at runtime. But we
don't have such a hint.


To summarize: for QemuFlashFvbServicesRuntimeDxe to allocate UEFI
Runtime Services Data type memory, for its own runtime GHCB, two
permissions are necessary (together), at OS runtime:

- QemuFlashFvbServicesRuntimeDxe must be allowed to swap MSR_SEV_ES_GHCB
temporarily (before executing VMGEXIT),

- QemuFlashFvbServicesRuntimeDxe must be allowed to change the OS-owned
PTE temporarily (for remapping the GHCB as plaintext, before writing
to it).

Thanks
Laszlo

2020-09-09 16:53:00

by Tom Lendacky

[permalink] [raw]
Subject: Re: [PATCH v7 71/72] x86/efi: Add GHCB mappings when SEV-ES is active

On 9/9/20 7:44 AM, Laszlo Ersek wrote:
> On 09/09/20 10:27, Ard Biesheuvel wrote:
>> (adding Laszlo and Brijesh)
>>
>> On Tue, 8 Sep 2020 at 20:46, Borislav Petkov <[email protected]> wrote:
>>>
>>> + Ard so that he can ack the efi bits.
>>>
>>> On Mon, Sep 07, 2020 at 03:16:12PM +0200, Joerg Roedel wrote:
>>>> From: Tom Lendacky <[email protected]>
>>>>
>>>> Calling down to EFI runtime services can result in the firmware
>>>> performing VMGEXIT calls. The firmware is likely to use the GHCB of
>>>> the OS (e.g., for setting EFI variables),
>
> I've had to stare at this for a while.
>
> Because, normally a VMGEXIT is supposed to occur like this:
>
> - guest does something privileged
> - resultant non-automatic exit (NAE) injects a #VC exception
> - exception handler figures out what that privileged thing was
> - exception handler submits request to hypervisor via GHCB contents plus
> VMGEXIT instruction
>
> Point being, the agent that "owns" the exception handler is supposed to
> pre-allocate or otherwise provide the GHCB too, for information passing.
>
> So... what is the particular NAE that occurs during the execution of
> UEFI runtime services (at OS runtime)?
>
> And assuming it occurs, I'm unsure if the exception handler (IDT) at
> that point is owned (temporarily?) by the firmware.
>
> - If the #VC handler comes from the firmware, then I don't know why it
> would use the OS's GHCB.
>
> - If the #VC handler comes from the OS, then I don't understand why the
> commit message says "firmware performing VMGEXIT", given that in this
> case it would be the OS's #VC handler executing VMGEXIT.
>
> So, I think the above commit message implies a VMGEXIT *without* a NAE /
> #VC context. (Because, I fail to interpret the commit message in a NAE /
> #VC context in any way; see above.)

Correct.

>
> OK, so let's see where the firmware performs a VMGEXIT *outside* of an
> exception handler, *while* at OS runtime. There seems to be one, in file
> "OvmfPkg/QemuFlashFvbServicesRuntimeDxe/QemuFlashDxe.c":

Again, correct. Basically this is what is invoked when setting UEFI variables.

>
>> VOID
>> QemuFlashPtrWrite (
>> IN volatile UINT8 *Ptr,
>> IN UINT8 Value
>> )
>> {
>> if (MemEncryptSevEsIsEnabled ()) {
>> MSR_SEV_ES_GHCB_REGISTER Msr;
>> GHCB *Ghcb;
>>
>> Msr.GhcbPhysicalAddress = AsmReadMsr64 (MSR_SEV_ES_GHCB);
>> Ghcb = Msr.Ghcb;
>>
>> //
>> // Writing to flash is emulated by the hypervisor through the use of write
>> // protection. This won't work for an SEV-ES guest because the write won't
>> // be recognized as a true MMIO write, which would result in the required
>> // #VC exception. Instead, use the the VMGEXIT MMIO write support directly
>> // to perform the update.
>> //
>> VmgInit (Ghcb);
>> Ghcb->SharedBuffer[0] = Value;
>> Ghcb->SaveArea.SwScratch = (UINT64) (UINTN) Ghcb->SharedBuffer;
>> VmgExit (Ghcb, SVM_EXIT_MMIO_WRITE, (UINT64) (UINTN) Ptr, 1);
>> VmgDone (Ghcb);
>> } else {
>> *Ptr = Value;
>> }
>> }
>
> This function *does* run at OS runtime (as a part of non-volatile UEFI
> variable writes).
>
> And note that, wherever MSR_SEV_ES_GHCB points to at the moment, is used
> as GHCB.
>
> If the guest kernel allocates its own GHCB and writes the allocation
> address to MSR_SEV_ES_GHCB, then indeed the firmware will use the GHCB
> of the OS.
>
> I reviewed edk2 commit 437eb3f7a8db
> ("OvmfPkg/QemuFlashFvbServicesRuntimeDxe: Bypass flash detection with
> SEV-ES", 2020-08-17), but I admit I never thought of the guest OS
> changing MSR_SEV_ES_GHCB. I'm sorry about that.
>
> As long as this driver is running before OS runtime (i.e., during the
> DXE and BDS phases), MSR_SEV_ES_GHCB is supposed to carry the value we
> set in "OvmfPkg/PlatformPei/AmdSev.c":
>
>> STATIC
>> VOID
>> AmdSevEsInitialize (
>> VOID
>> )
>> {
>> VOID *GhcbBase;
>> PHYSICAL_ADDRESS GhcbBasePa;
>> UINTN GhcbPageCount, PageCount;
>> RETURN_STATUS PcdStatus, DecryptStatus;
>> IA32_DESCRIPTOR Gdtr;
>> VOID *Gdt;
>>
>> if (!MemEncryptSevEsIsEnabled ()) {
>> return;
>> }
>>
>> PcdStatus = PcdSetBoolS (PcdSevEsIsEnabled, TRUE);
>> ASSERT_RETURN_ERROR (PcdStatus);
>>
>> //
>> // Allocate GHCB and per-CPU variable pages.
>> // Since the pages must survive across the UEFI to OS transition
>> // make them reserved.
>> //
>> GhcbPageCount = mMaxCpuCount * 2;
>> GhcbBase = AllocateReservedPages (GhcbPageCount);
>> ASSERT (GhcbBase != NULL);
>>
>> GhcbBasePa = (PHYSICAL_ADDRESS)(UINTN) GhcbBase;
>>
>> //
>> // Each vCPU gets two consecutive pages, the first is the GHCB and the
>> // second is the per-CPU variable page. Loop through the allocation and
>> // only clear the encryption mask for the GHCB pages.
>> //
>> for (PageCount = 0; PageCount < GhcbPageCount; PageCount += 2) {
>> DecryptStatus = MemEncryptSevClearPageEncMask (
>> 0,
>> GhcbBasePa + EFI_PAGES_TO_SIZE (PageCount),
>> 1,
>> TRUE
>> );
>> ASSERT_RETURN_ERROR (DecryptStatus);
>> }
>>
>> ZeroMem (GhcbBase, EFI_PAGES_TO_SIZE (GhcbPageCount));
>>
>> PcdStatus = PcdSet64S (PcdGhcbBase, GhcbBasePa);
>> ASSERT_RETURN_ERROR (PcdStatus);
>> PcdStatus = PcdSet64S (PcdGhcbSize, EFI_PAGES_TO_SIZE (GhcbPageCount));
>> ASSERT_RETURN_ERROR (PcdStatus);
>>
>> DEBUG ((DEBUG_INFO,
>> "SEV-ES is enabled, %lu GHCB pages allocated starting at 0x%p\n",
>> (UINT64)GhcbPageCount, GhcbBase));
>>
>> AsmWriteMsr64 (MSR_SEV_ES_GHCB, GhcbBasePa);
>
> So what is the *actual* problem at OS runtime:
>
> - Is it that MSR_SEV_ES_GHCB still points at this PEI-phase *reserved*
> memory allocation (and so when QemuFlashPtrWrite() tries to access it
> during OS runtime, it doesn't have a runtime mapping for it)?

At this point the GHCB MSR points to the OS GHCB, which isn't mapped by
the page tables supplied by the OS and used by UEFI.

>
> - Or is it that the OS actively changes MSR_SEV_ES_GHCB, pointing to a
> memory area that the OS owns -- and *that* area is what
> QemuFlashPtrWrite() cannot access at OS runtime?

Correct.

>
> The first problem statement does *not* seem to apply, given -- again --
> that the commit message says, "firmware is likely to use the GHCB of the
> OS".
>
> So I think the second problem statement must apply.
>
> (I think the "reserved allocation" above is "reserved" only because we
> want to keep the OS out of it around the ExitBootServices() transition.)
>
> Back to the email:
>
> On 09/09/20 10:27, Ard Biesheuvel wrote:
>> On Tue, 8 Sep 2020 at 20:46, Borislav Petkov <[email protected]> wrote:
>>> On Mon, Sep 07, 2020 at 03:16:12PM +0200, Joerg Roedel wrote:
>>>> so each GHCB in the system needs to be identity
>>>> mapped in the EFI page tables, as unencrypted, to avoid page faults.
>
> Not sure I agree about this, but at least it seems to confirm my
> understanding -- apparently the idea is, for the OS, to satisfy
> QemuFlashPtrWrite() in the firmware, by putting the "expected" mapping
> -- for wherever MSR_SEV_ES_GHCB is going to point to -- in place.
>
>>>>
>>>> Signed-off-by: Tom Lendacky <[email protected]>
>>>> [ [email protected]: Moved GHCB mapping loop to sev-es.c ]
>>>> Signed-off-by: Joerg Roedel <[email protected]>
>>
>>
>> This looks like it is papering over a more fundamental issue: any
>> memory region that the firmware itself needs to access during the
>> execution of runtime services needs to be described in the UEFI memory
>> map, with the appropriate annotations so that the OS knows it should
>> include these in the EFI runtime page tables. So why has this been
>> omitted in this case?
>
> So yeah, the issue seems to be that the QemuFlashFvbServicesRuntimeDxe
> driver does not *own* the GHCB that it attempts to use at OS runtime. It
> doesn't know where MSR_SEV_ES_GHCB is going to point.
>
> Is QemuFlashFvbServicesRuntimeDxe permitted to change MSR_SEV_ES_GHCB
> *temporarily* at OS runtime?
>
> Because, in that case:
>
> - QemuFlashFvbServicesRuntimeDxe should allocate a Runtime Services Data
> block for GHCB when it starts up (if SEV-ES is active),
>
> - QemuFlashFvbServicesRuntimeDxe should register a SetVirtualAddressMap
> handler, and use EfiConvertPointer() from UefiRuntimeLib to convert
> the "runtime GHCB" address to virtual address, in that handler,
>
> - QemuFlashPtrWrite() should call EfiAtRuntime() from UefiRuntimeLib,
> and if the latter returns TRUE, then (a) use the runtime-converted
> address for populating the GHCB, and (b) temporarily swap
> MSR_SEV_ES_GHCB with the address of the self-allocated GHCB. (The MSR
> needs a *physical* address, so QemuFlashFvbServicesRuntimeDxe would
> have to remember / retain the original (physical) allocation address
> too.)
>
> If QemuFlashFvbServicesRuntimeDxe is not permitted to change
> MSR_SEV_ES_GHCB even temporarily (at OS runtime), then I think the
> approach proposed in this (guest kernel) patch is valid.
>
> Let me skim the code below...
>
>>
>>
>>
>>>> ---
>>>> arch/x86/boot/compressed/sev-es.c | 1 +
>>>> arch/x86/include/asm/sev-es.h | 2 ++
>>>> arch/x86/kernel/sev-es.c | 30 ++++++++++++++++++++++++++++++
>>>> arch/x86/platform/efi/efi_64.c | 10 ++++++++++
>>>> 4 files changed, 43 insertions(+)
>>>>
>>>> diff --git a/arch/x86/boot/compressed/sev-es.c b/arch/x86/boot/compressed/sev-es.c
>>>> index 45702b866c33..0a9a248ca33d 100644
>>>> --- a/arch/x86/boot/compressed/sev-es.c
>>>> +++ b/arch/x86/boot/compressed/sev-es.c
>>>> @@ -12,6 +12,7 @@
>>>> */
>>>> #include "misc.h"
>>>>
>>>> +#include <asm/pgtable_types.h>
>>>> #include <asm/sev-es.h>
>>>> #include <asm/trapnr.h>
>>>> #include <asm/trap_pf.h>
>>>> diff --git a/arch/x86/include/asm/sev-es.h b/arch/x86/include/asm/sev-es.h
>>>> index e919f09ae33c..cf1d957c7091 100644
>>>> --- a/arch/x86/include/asm/sev-es.h
>>>> +++ b/arch/x86/include/asm/sev-es.h
>>>> @@ -102,11 +102,13 @@ static __always_inline void sev_es_nmi_complete(void)
>>>> if (static_branch_unlikely(&sev_es_enable_key))
>>>> __sev_es_nmi_complete();
>>>> }
>>>> +extern int __init sev_es_efi_map_ghcbs(pgd_t *pgd);
>>>> #else
>>>> static inline void sev_es_ist_enter(struct pt_regs *regs) { }
>>>> static inline void sev_es_ist_exit(void) { }
>>>> static inline int sev_es_setup_ap_jump_table(struct real_mode_header *rmh) { return 0; }
>>>> static inline void sev_es_nmi_complete(void) { }
>>>> +static inline int sev_es_efi_map_ghcbs(pgd_t *pgd) { return 0; }
>>>> #endif
>>>>
>>>> #endif
>>>> diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
>>>> index 9ab3a4dfecd8..4e2b7e4d9b87 100644
>>>> --- a/arch/x86/kernel/sev-es.c
>>>> +++ b/arch/x86/kernel/sev-es.c
>>>> @@ -491,6 +491,36 @@ int sev_es_setup_ap_jump_table(struct real_mode_header *rmh)
>>>> return 0;
>>>> }
>>>>
>>>> +/*
>>>> + * This is needed by the OVMF UEFI firmware which will use whatever it finds in
>>>> + * the GHCB MSR as its GHCB to talk to the hypervisor. So make sure the per-cpu
>>>> + * runtime GHCBs used by the kernel are also mapped in the EFI page-table.
>
> Yup, this pretty much confirms my suspicion that QemuFlashPtrWrite() is
> at the center of this.
>
> (BTW, I don't think that the runtime services data allocation, in
> QemuFlashFvbServicesRuntimeDxe, for OS runtime GHCB purposes, would have
> to be "per CPU". Refer to "Table 35. Rules for Reentry Into Runtime
> Services" in the UEFI spec -- if one processor is executing
> SetVariable(), then no other processor must enter SetVariable(). And so
> we'll have *at most* one VCPU in QemuFlashPtrWrite(), at any time.)
>
>>>> + */
>>>> +int __init sev_es_efi_map_ghcbs(pgd_t *pgd)
>>>> +{
>>>> + struct sev_es_runtime_data *data;
>>>> + unsigned long address, pflags;
>>>> + int cpu;
>>>> + u64 pfn;
>>>> +
>>>> + if (!sev_es_active())
>>>> + return 0;
>>>> +
>>>> + pflags = _PAGE_NX | _PAGE_RW;
>>>> +
>>>> + for_each_possible_cpu(cpu) {
>>>> + data = per_cpu(runtime_data, cpu);
>>>> +
>>>> + address = __pa(&data->ghcb_page);
>>>> + pfn = address >> PAGE_SHIFT;
>>>> +
>>>> + if (kernel_map_pages_in_pgd(pgd, pfn, address, 1, pflags))
>>>> + return 1;
>>>> + }
>>>> +
>>>> + return 0;
>>>> +}
>>>> +
>>>> static enum es_result vc_handle_msr(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
>>>> {
>>>> struct pt_regs *regs = ctxt->regs;
>>>> diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
>>>> index 6af4da1149ba..8f5759df7776 100644
>>>> --- a/arch/x86/platform/efi/efi_64.c
>>>> +++ b/arch/x86/platform/efi/efi_64.c
>>>> @@ -47,6 +47,7 @@
>>>> #include <asm/realmode.h>
>>>> #include <asm/time.h>
>>>> #include <asm/pgalloc.h>
>>>> +#include <asm/sev-es.h>
>>>>
>>>> /*
>>>> * We allocate runtime services regions top-down, starting from -4G, i.e.
>>>> @@ -229,6 +230,15 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
>>>> return 1;
>>>> }
>>>>
>>>> + /*
>>>> + * When SEV-ES is active, the GHCB as set by the kernel will be used
>>>> + * by firmware. Create a 1:1 unencrypted mapping for each GHCB.
>>>> + */
>>>> + if (sev_es_efi_map_ghcbs(pgd)) {
>>>> + pr_err("Failed to create 1:1 mapping for the GHCBs!\n");
>>>> + return 1;
>>>> + }
>>>> +
>>>> /*
>>>> * When making calls to the firmware everything needs to be 1:1
>>>> * mapped and addressable with 32-bit pointers. Map the kernel
>
> Good point!
>
> And it even makes me wonder if the QemuFlashFvbServicesRuntimeDxe
> approach, with the runtime services data type memory allocation, is
> feasible at all. Namely, a page's encryption status, under SEV, is
> controlled through the PTE.
>
> And for this particular UEFI runtime area, it would *not* suffice for
> the OS to just virt-map it. The OS would also have to *decrypt* the area
> (mark the PTE as "plaintext").
>
> In other words, it would be an "unprecedented" PTE for the OS to set up:
> the PTE would not only map the GVA to GPA, but also mark the area as
> "plaintext".
>
> Otherwise -- if the OS covers *just* the virt-mapping --,
> QemuFlashFvbServicesRuntimeDxe would populate its own "runtime GHCB"
> area just fine, but the actual data hitting the host RAM would be
> encrypted. And so the hypervisor could not interpret the GHCB.
>
> *If* QemuFlashFvbServicesRuntimeDxe should not change the kernel-owned
> PTE at runtime, even temporarily, for marking the GHCB as "plaintext",
> then the problem is indeed only solvable in the guest kernel, in my
> opinion.
>
> There simply isn't an "architected annotation" for telling the kernel,
> "virt-map this runtime services data type memory range, *and* mark it as
> plaintext at the same time".
>
> This would be necessary, as both actions affect the exact same PTE, and
> the firmware is not really allowed to touch the PTE at runtime. But we
> don't have such a hint.
>
>
> To summarize: for QemuFlashFvbServicesRuntimeDxe to allocate UEFI
> Runtime Services Data type memory, for its own runtime GHCB, two
> permissions are necessary (together), at OS runtime:
>
> - QemuFlashFvbServicesRuntimeDxe must be allowed to swap MSR_SEV_ES_GHCB
> temporarily (before executing VMGEXIT),
>
> - QemuFlashFvbServicesRuntimeDxe must be allowed to change the OS-owned
> PTE temporarily (for remapping the GHCB as plaintext, before writing
> to it).
>

Amazing summarization Laszlo!

Thanks,
Tom

> Thanks
> Laszlo
>

2020-09-09 17:07:56

by Laszlo Ersek

[permalink] [raw]
Subject: Re: [PATCH v7 71/72] x86/efi: Add GHCB mappings when SEV-ES is active

On 09/09/20 14:44, Laszlo Ersek wrote:

> To summarize: for QemuFlashFvbServicesRuntimeDxe to allocate UEFI
> Runtime Services Data type memory, for its own runtime GHCB, two
> permissions are necessary (together), at OS runtime:
>
> - QemuFlashFvbServicesRuntimeDxe must be allowed to swap MSR_SEV_ES_GHCB
> temporarily (before executing VMGEXIT),
>
> - QemuFlashFvbServicesRuntimeDxe must be allowed to change the OS-owned
> PTE temporarily (for remapping the GHCB as plaintext, before writing
> to it).

Condition#2 gets worse:

If the firmware-allocated runtime GHCB happens to be virt-mapped by the
OS using a 2MB page (covering other UEFI runtime data areas, perhaps?),
then simply flipping the encryption bit in
QemuFlashFvbServicesRuntimeDxe would mark more runtime memory than just
the GHCB as "plaintext". (2MB-4KB specifically.)

This could result in:
- firmware read accesses outside of the GHCB returning garbage
- firmware write accesses ouside of the GHCB leaking information to the
hypervisor, and reading back later as garbage

In order to prevent those symptoms, the firmware would have to split the
2MB page to 4KB pages, and decrypt just the one (GHCB) page.

But page splitting requires additional memory (for the finer granularity
page table), and fw memory cannot be allocated at OS runtime. So that
extra memory too would have to be pre-allocated by the firmware. Nasty.

I'd recommend sticking with this kernel patch.

Thanks,
Laszlo

Subject: [tip: x86/seves] x86/efi: Add GHCB mappings when SEV-ES is active

The following commit has been merged into the x86/seves branch of tip:

Commit-ID: 35dcb1ebaf43230637abd2428b9d1fe6c915a78b
Gitweb: https://git.kernel.org/tip/35dcb1ebaf43230637abd2428b9d1fe6c915a78b
Author: Tom Lendacky <[email protected]>
AuthorDate: Mon, 07 Sep 2020 15:16:12 +02:00
Committer: Borislav Petkov <[email protected]>
CommitterDate: Wed, 09 Sep 2020 18:03:48 +02:00

x86/efi: Add GHCB mappings when SEV-ES is active

Calling down to EFI runtime services can result in the firmware
performing VMGEXIT calls. The firmware is likely to use the GHCB of the
OS (e.g., for setting EFI variables), so each GHCB in the system needs
to be identity-mapped in the EFI page tables, as unencrypted, to avoid
page faults.

Signed-off-by: Tom Lendacky <[email protected]>
[ [email protected]: Moved GHCB mapping loop to sev-es.c ]
Signed-off-by: Joerg Roedel <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
arch/x86/boot/compressed/sev-es.c | 1 +-
arch/x86/include/asm/sev-es.h | 2 ++-
arch/x86/kernel/sev-es.c | 30 ++++++++++++++++++++++++++++++-
arch/x86/platform/efi/efi_64.c | 10 ++++++++++-
4 files changed, 43 insertions(+)

diff --git a/arch/x86/boot/compressed/sev-es.c b/arch/x86/boot/compressed/sev-es.c
index 5f15e58..2a6c7c3 100644
--- a/arch/x86/boot/compressed/sev-es.c
+++ b/arch/x86/boot/compressed/sev-es.c
@@ -12,6 +12,7 @@
*/
#include "misc.h"

+#include <asm/pgtable_types.h>
#include <asm/sev-es.h>
#include <asm/trapnr.h>
#include <asm/trap_pf.h>
diff --git a/arch/x86/include/asm/sev-es.h b/arch/x86/include/asm/sev-es.h
index e919f09..cf1d957 100644
--- a/arch/x86/include/asm/sev-es.h
+++ b/arch/x86/include/asm/sev-es.h
@@ -102,11 +102,13 @@ static __always_inline void sev_es_nmi_complete(void)
if (static_branch_unlikely(&sev_es_enable_key))
__sev_es_nmi_complete();
}
+extern int __init sev_es_efi_map_ghcbs(pgd_t *pgd);
#else
static inline void sev_es_ist_enter(struct pt_regs *regs) { }
static inline void sev_es_ist_exit(void) { }
static inline int sev_es_setup_ap_jump_table(struct real_mode_header *rmh) { return 0; }
static inline void sev_es_nmi_complete(void) { }
+static inline int sev_es_efi_map_ghcbs(pgd_t *pgd) { return 0; }
#endif

#endif
diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
index b6518e9..8cac9f8 100644
--- a/arch/x86/kernel/sev-es.c
+++ b/arch/x86/kernel/sev-es.c
@@ -491,6 +491,36 @@ int sev_es_setup_ap_jump_table(struct real_mode_header *rmh)
return 0;
}

+/*
+ * This is needed by the OVMF UEFI firmware which will use whatever it finds in
+ * the GHCB MSR as its GHCB to talk to the hypervisor. So make sure the per-cpu
+ * runtime GHCBs used by the kernel are also mapped in the EFI page-table.
+ */
+int __init sev_es_efi_map_ghcbs(pgd_t *pgd)
+{
+ struct sev_es_runtime_data *data;
+ unsigned long address, pflags;
+ int cpu;
+ u64 pfn;
+
+ if (!sev_es_active())
+ return 0;
+
+ pflags = _PAGE_NX | _PAGE_RW;
+
+ for_each_possible_cpu(cpu) {
+ data = per_cpu(runtime_data, cpu);
+
+ address = __pa(&data->ghcb_page);
+ pfn = address >> PAGE_SHIFT;
+
+ if (kernel_map_pages_in_pgd(pgd, pfn, address, 1, pflags))
+ return 1;
+ }
+
+ return 0;
+}
+
static enum es_result vc_handle_msr(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
{
struct pt_regs *regs = ctxt->regs;
diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
index 6af4da1..8f5759d 100644
--- a/arch/x86/platform/efi/efi_64.c
+++ b/arch/x86/platform/efi/efi_64.c
@@ -47,6 +47,7 @@
#include <asm/realmode.h>
#include <asm/time.h>
#include <asm/pgalloc.h>
+#include <asm/sev-es.h>

/*
* We allocate runtime services regions top-down, starting from -4G, i.e.
@@ -230,6 +231,15 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
}

/*
+ * When SEV-ES is active, the GHCB as set by the kernel will be used
+ * by firmware. Create a 1:1 unencrypted mapping for each GHCB.
+ */
+ if (sev_es_efi_map_ghcbs(pgd)) {
+ pr_err("Failed to create 1:1 mapping for the GHCBs!\n");
+ return 1;
+ }
+
+ /*
* When making calls to the firmware everything needs to be 1:1
* mapped and addressable with 32-bit pointers. Map the kernel
* text and allocate a new stack because we can't rely on the

Subject: [tip: x86/seves] x86/efi: Add GHCB mappings when SEV-ES is active

The following commit has been merged into the x86/seves branch of tip:

Commit-ID: 39336f4ffb2478ad384075cf4ba7ef2e5db2bbd7
Gitweb: https://git.kernel.org/tip/39336f4ffb2478ad384075cf4ba7ef2e5db2bbd7
Author: Tom Lendacky <[email protected]>
AuthorDate: Mon, 07 Sep 2020 15:16:12 +02:00
Committer: Borislav Petkov <[email protected]>
CommitterDate: Thu, 10 Sep 2020 21:48:50 +02:00

x86/efi: Add GHCB mappings when SEV-ES is active

Calling down to EFI runtime services can result in the firmware
performing VMGEXIT calls. The firmware is likely to use the GHCB of the
OS (e.g., for setting EFI variables), so each GHCB in the system needs
to be identity-mapped in the EFI page tables, as unencrypted, to avoid
page faults.

Signed-off-by: Tom Lendacky <[email protected]>
[ [email protected]: Moved GHCB mapping loop to sev-es.c ]
Signed-off-by: Joerg Roedel <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Acked-by: Ard Biesheuvel <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
arch/x86/boot/compressed/sev-es.c | 1 +-
arch/x86/include/asm/sev-es.h | 2 ++-
arch/x86/kernel/sev-es.c | 30 ++++++++++++++++++++++++++++++-
arch/x86/platform/efi/efi_64.c | 10 ++++++++++-
4 files changed, 43 insertions(+)

diff --git a/arch/x86/boot/compressed/sev-es.c b/arch/x86/boot/compressed/sev-es.c
index 5f15e58..2a6c7c3 100644
--- a/arch/x86/boot/compressed/sev-es.c
+++ b/arch/x86/boot/compressed/sev-es.c
@@ -12,6 +12,7 @@
*/
#include "misc.h"

+#include <asm/pgtable_types.h>
#include <asm/sev-es.h>
#include <asm/trapnr.h>
#include <asm/trap_pf.h>
diff --git a/arch/x86/include/asm/sev-es.h b/arch/x86/include/asm/sev-es.h
index e919f09..cf1d957 100644
--- a/arch/x86/include/asm/sev-es.h
+++ b/arch/x86/include/asm/sev-es.h
@@ -102,11 +102,13 @@ static __always_inline void sev_es_nmi_complete(void)
if (static_branch_unlikely(&sev_es_enable_key))
__sev_es_nmi_complete();
}
+extern int __init sev_es_efi_map_ghcbs(pgd_t *pgd);
#else
static inline void sev_es_ist_enter(struct pt_regs *regs) { }
static inline void sev_es_ist_exit(void) { }
static inline int sev_es_setup_ap_jump_table(struct real_mode_header *rmh) { return 0; }
static inline void sev_es_nmi_complete(void) { }
+static inline int sev_es_efi_map_ghcbs(pgd_t *pgd) { return 0; }
#endif

#endif
diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
index b6518e9..8cac9f8 100644
--- a/arch/x86/kernel/sev-es.c
+++ b/arch/x86/kernel/sev-es.c
@@ -491,6 +491,36 @@ int sev_es_setup_ap_jump_table(struct real_mode_header *rmh)
return 0;
}

+/*
+ * This is needed by the OVMF UEFI firmware which will use whatever it finds in
+ * the GHCB MSR as its GHCB to talk to the hypervisor. So make sure the per-cpu
+ * runtime GHCBs used by the kernel are also mapped in the EFI page-table.
+ */
+int __init sev_es_efi_map_ghcbs(pgd_t *pgd)
+{
+ struct sev_es_runtime_data *data;
+ unsigned long address, pflags;
+ int cpu;
+ u64 pfn;
+
+ if (!sev_es_active())
+ return 0;
+
+ pflags = _PAGE_NX | _PAGE_RW;
+
+ for_each_possible_cpu(cpu) {
+ data = per_cpu(runtime_data, cpu);
+
+ address = __pa(&data->ghcb_page);
+ pfn = address >> PAGE_SHIFT;
+
+ if (kernel_map_pages_in_pgd(pgd, pfn, address, 1, pflags))
+ return 1;
+ }
+
+ return 0;
+}
+
static enum es_result vc_handle_msr(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
{
struct pt_regs *regs = ctxt->regs;
diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
index 6af4da1..8f5759d 100644
--- a/arch/x86/platform/efi/efi_64.c
+++ b/arch/x86/platform/efi/efi_64.c
@@ -47,6 +47,7 @@
#include <asm/realmode.h>
#include <asm/time.h>
#include <asm/pgalloc.h>
+#include <asm/sev-es.h>

/*
* We allocate runtime services regions top-down, starting from -4G, i.e.
@@ -230,6 +231,15 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
}

/*
+ * When SEV-ES is active, the GHCB as set by the kernel will be used
+ * by firmware. Create a 1:1 unencrypted mapping for each GHCB.
+ */
+ if (sev_es_efi_map_ghcbs(pgd)) {
+ pr_err("Failed to create 1:1 mapping for the GHCBs!\n");
+ return 1;
+ }
+
+ /*
* When making calls to the firmware everything needs to be 1:1
* mapped and addressable with 32-bit pointers. Map the kernel
* text and allocate a new stack because we can't rely on the

2020-09-10 21:52:16

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH v7 71/72] x86/efi: Add GHCB mappings when SEV-ES is active

On Wed, 9 Sep 2020 at 16:49, Tom Lendacky <[email protected]> wrote:
>
> On 9/9/20 7:44 AM, Laszlo Ersek wrote:
> > On 09/09/20 10:27, Ard Biesheuvel wrote:
> >> (adding Laszlo and Brijesh)
> >>
> >> On Tue, 8 Sep 2020 at 20:46, Borislav Petkov <[email protected]> wrote:
> >>>
> >>> + Ard so that he can ack the efi bits.
> >>>
> >>> On Mon, Sep 07, 2020 at 03:16:12PM +0200, Joerg Roedel wrote:
> >>>> From: Tom Lendacky <[email protected]>
> >>>>
> >>>> Calling down to EFI runtime services can result in the firmware
> >>>> performing VMGEXIT calls. The firmware is likely to use the GHCB of
> >>>> the OS (e.g., for setting EFI variables),
> >
> > I've had to stare at this for a while.
> >
> > Because, normally a VMGEXIT is supposed to occur like this:
> >
> > - guest does something privileged
> > - resultant non-automatic exit (NAE) injects a #VC exception
> > - exception handler figures out what that privileged thing was
> > - exception handler submits request to hypervisor via GHCB contents plus
> > VMGEXIT instruction
> >
> > Point being, the agent that "owns" the exception handler is supposed to
> > pre-allocate or otherwise provide the GHCB too, for information passing.
> >
> > So... what is the particular NAE that occurs during the execution of
> > UEFI runtime services (at OS runtime)?
> >
> > And assuming it occurs, I'm unsure if the exception handler (IDT) at
> > that point is owned (temporarily?) by the firmware.
> >
> > - If the #VC handler comes from the firmware, then I don't know why it
> > would use the OS's GHCB.
> >
> > - If the #VC handler comes from the OS, then I don't understand why the
> > commit message says "firmware performing VMGEXIT", given that in this
> > case it would be the OS's #VC handler executing VMGEXIT.
> >
> > So, I think the above commit message implies a VMGEXIT *without* a NAE /
> > #VC context. (Because, I fail to interpret the commit message in a NAE /
> > #VC context in any way; see above.)
>
> Correct.
>
> >
> > OK, so let's see where the firmware performs a VMGEXIT *outside* of an
> > exception handler, *while* at OS runtime. There seems to be one, in file
> > "OvmfPkg/QemuFlashFvbServicesRuntimeDxe/QemuFlashDxe.c":
>
> Again, correct. Basically this is what is invoked when setting UEFI variables.
>
> >
> >> VOID
> >> QemuFlashPtrWrite (
> >> IN volatile UINT8 *Ptr,
> >> IN UINT8 Value
> >> )
> >> {
> >> if (MemEncryptSevEsIsEnabled ()) {
> >> MSR_SEV_ES_GHCB_REGISTER Msr;
> >> GHCB *Ghcb;
> >>
> >> Msr.GhcbPhysicalAddress = AsmReadMsr64 (MSR_SEV_ES_GHCB);
> >> Ghcb = Msr.Ghcb;
> >>
> >> //
> >> // Writing to flash is emulated by the hypervisor through the use of write
> >> // protection. This won't work for an SEV-ES guest because the write won't
> >> // be recognized as a true MMIO write, which would result in the required
> >> // #VC exception. Instead, use the the VMGEXIT MMIO write support directly
> >> // to perform the update.
> >> //
> >> VmgInit (Ghcb);
> >> Ghcb->SharedBuffer[0] = Value;
> >> Ghcb->SaveArea.SwScratch = (UINT64) (UINTN) Ghcb->SharedBuffer;
> >> VmgExit (Ghcb, SVM_EXIT_MMIO_WRITE, (UINT64) (UINTN) Ptr, 1);
> >> VmgDone (Ghcb);
> >> } else {
> >> *Ptr = Value;
> >> }
> >> }
> >
> > This function *does* run at OS runtime (as a part of non-volatile UEFI
> > variable writes).
> >
> > And note that, wherever MSR_SEV_ES_GHCB points to at the moment, is used
> > as GHCB.
> >
> > If the guest kernel allocates its own GHCB and writes the allocation
> > address to MSR_SEV_ES_GHCB, then indeed the firmware will use the GHCB
> > of the OS.
> >
> > I reviewed edk2 commit 437eb3f7a8db
> > ("OvmfPkg/QemuFlashFvbServicesRuntimeDxe: Bypass flash detection with
> > SEV-ES", 2020-08-17), but I admit I never thought of the guest OS
> > changing MSR_SEV_ES_GHCB. I'm sorry about that.
> >
> > As long as this driver is running before OS runtime (i.e., during the
> > DXE and BDS phases), MSR_SEV_ES_GHCB is supposed to carry the value we
> > set in "OvmfPkg/PlatformPei/AmdSev.c":
> >
> >> STATIC
> >> VOID
> >> AmdSevEsInitialize (
> >> VOID
> >> )
> >> {
> >> VOID *GhcbBase;
> >> PHYSICAL_ADDRESS GhcbBasePa;
> >> UINTN GhcbPageCount, PageCount;
> >> RETURN_STATUS PcdStatus, DecryptStatus;
> >> IA32_DESCRIPTOR Gdtr;
> >> VOID *Gdt;
> >>
> >> if (!MemEncryptSevEsIsEnabled ()) {
> >> return;
> >> }
> >>
> >> PcdStatus = PcdSetBoolS (PcdSevEsIsEnabled, TRUE);
> >> ASSERT_RETURN_ERROR (PcdStatus);
> >>
> >> //
> >> // Allocate GHCB and per-CPU variable pages.
> >> // Since the pages must survive across the UEFI to OS transition
> >> // make them reserved.
> >> //
> >> GhcbPageCount = mMaxCpuCount * 2;
> >> GhcbBase = AllocateReservedPages (GhcbPageCount);
> >> ASSERT (GhcbBase != NULL);
> >>
> >> GhcbBasePa = (PHYSICAL_ADDRESS)(UINTN) GhcbBase;
> >>
> >> //
> >> // Each vCPU gets two consecutive pages, the first is the GHCB and the
> >> // second is the per-CPU variable page. Loop through the allocation and
> >> // only clear the encryption mask for the GHCB pages.
> >> //
> >> for (PageCount = 0; PageCount < GhcbPageCount; PageCount += 2) {
> >> DecryptStatus = MemEncryptSevClearPageEncMask (
> >> 0,
> >> GhcbBasePa + EFI_PAGES_TO_SIZE (PageCount),
> >> 1,
> >> TRUE
> >> );
> >> ASSERT_RETURN_ERROR (DecryptStatus);
> >> }
> >>
> >> ZeroMem (GhcbBase, EFI_PAGES_TO_SIZE (GhcbPageCount));
> >>
> >> PcdStatus = PcdSet64S (PcdGhcbBase, GhcbBasePa);
> >> ASSERT_RETURN_ERROR (PcdStatus);
> >> PcdStatus = PcdSet64S (PcdGhcbSize, EFI_PAGES_TO_SIZE (GhcbPageCount));
> >> ASSERT_RETURN_ERROR (PcdStatus);
> >>
> >> DEBUG ((DEBUG_INFO,
> >> "SEV-ES is enabled, %lu GHCB pages allocated starting at 0x%p\n",
> >> (UINT64)GhcbPageCount, GhcbBase));
> >>
> >> AsmWriteMsr64 (MSR_SEV_ES_GHCB, GhcbBasePa);
> >
> > So what is the *actual* problem at OS runtime:
> >
> > - Is it that MSR_SEV_ES_GHCB still points at this PEI-phase *reserved*
> > memory allocation (and so when QemuFlashPtrWrite() tries to access it
> > during OS runtime, it doesn't have a runtime mapping for it)?
>
> At this point the GHCB MSR points to the OS GHCB, which isn't mapped by
> the page tables supplied by the OS and used by UEFI.
>
> >
> > - Or is it that the OS actively changes MSR_SEV_ES_GHCB, pointing to a
> > memory area that the OS owns -- and *that* area is what
> > QemuFlashPtrWrite() cannot access at OS runtime?
>
> Correct.
>
> >
> > The first problem statement does *not* seem to apply, given -- again --
> > that the commit message says, "firmware is likely to use the GHCB of the
> > OS".
> >
> > So I think the second problem statement must apply.
> >
> > (I think the "reserved allocation" above is "reserved" only because we
> > want to keep the OS out of it around the ExitBootServices() transition.)
> >
> > Back to the email:
> >
> > On 09/09/20 10:27, Ard Biesheuvel wrote:
> >> On Tue, 8 Sep 2020 at 20:46, Borislav Petkov <[email protected]> wrote:
> >>> On Mon, Sep 07, 2020 at 03:16:12PM +0200, Joerg Roedel wrote:
> >>>> so each GHCB in the system needs to be identity
> >>>> mapped in the EFI page tables, as unencrypted, to avoid page faults.
> >
> > Not sure I agree about this, but at least it seems to confirm my
> > understanding -- apparently the idea is, for the OS, to satisfy
> > QemuFlashPtrWrite() in the firmware, by putting the "expected" mapping
> > -- for wherever MSR_SEV_ES_GHCB is going to point to -- in place.
> >
> >>>>
> >>>> Signed-off-by: Tom Lendacky <[email protected]>
> >>>> [ [email protected]: Moved GHCB mapping loop to sev-es.c ]
> >>>> Signed-off-by: Joerg Roedel <[email protected]>
> >>
> >>
> >> This looks like it is papering over a more fundamental issue: any
> >> memory region that the firmware itself needs to access during the
> >> execution of runtime services needs to be described in the UEFI memory
> >> map, with the appropriate annotations so that the OS knows it should
> >> include these in the EFI runtime page tables. So why has this been
> >> omitted in this case?
> >
> > So yeah, the issue seems to be that the QemuFlashFvbServicesRuntimeDxe
> > driver does not *own* the GHCB that it attempts to use at OS runtime. It
> > doesn't know where MSR_SEV_ES_GHCB is going to point.
> >
> > Is QemuFlashFvbServicesRuntimeDxe permitted to change MSR_SEV_ES_GHCB
> > *temporarily* at OS runtime?
> >
> > Because, in that case:
> >
> > - QemuFlashFvbServicesRuntimeDxe should allocate a Runtime Services Data
> > block for GHCB when it starts up (if SEV-ES is active),
> >
> > - QemuFlashFvbServicesRuntimeDxe should register a SetVirtualAddressMap
> > handler, and use EfiConvertPointer() from UefiRuntimeLib to convert
> > the "runtime GHCB" address to virtual address, in that handler,
> >
> > - QemuFlashPtrWrite() should call EfiAtRuntime() from UefiRuntimeLib,
> > and if the latter returns TRUE, then (a) use the runtime-converted
> > address for populating the GHCB, and (b) temporarily swap
> > MSR_SEV_ES_GHCB with the address of the self-allocated GHCB. (The MSR
> > needs a *physical* address, so QemuFlashFvbServicesRuntimeDxe would
> > have to remember / retain the original (physical) allocation address
> > too.)
> >
> > If QemuFlashFvbServicesRuntimeDxe is not permitted to change
> > MSR_SEV_ES_GHCB even temporarily (at OS runtime), then I think the
> > approach proposed in this (guest kernel) patch is valid.
> >
> > Let me skim the code below...
> >
> >>
> >>
> >>
> >>>> ---
> >>>> arch/x86/boot/compressed/sev-es.c | 1 +
> >>>> arch/x86/include/asm/sev-es.h | 2 ++
> >>>> arch/x86/kernel/sev-es.c | 30 ++++++++++++++++++++++++++++++
> >>>> arch/x86/platform/efi/efi_64.c | 10 ++++++++++
> >>>> 4 files changed, 43 insertions(+)
> >>>>
> >>>> diff --git a/arch/x86/boot/compressed/sev-es.c b/arch/x86/boot/compressed/sev-es.c
> >>>> index 45702b866c33..0a9a248ca33d 100644
> >>>> --- a/arch/x86/boot/compressed/sev-es.c
> >>>> +++ b/arch/x86/boot/compressed/sev-es.c
> >>>> @@ -12,6 +12,7 @@
> >>>> */
> >>>> #include "misc.h"
> >>>>
> >>>> +#include <asm/pgtable_types.h>
> >>>> #include <asm/sev-es.h>
> >>>> #include <asm/trapnr.h>
> >>>> #include <asm/trap_pf.h>
> >>>> diff --git a/arch/x86/include/asm/sev-es.h b/arch/x86/include/asm/sev-es.h
> >>>> index e919f09ae33c..cf1d957c7091 100644
> >>>> --- a/arch/x86/include/asm/sev-es.h
> >>>> +++ b/arch/x86/include/asm/sev-es.h
> >>>> @@ -102,11 +102,13 @@ static __always_inline void sev_es_nmi_complete(void)
> >>>> if (static_branch_unlikely(&sev_es_enable_key))
> >>>> __sev_es_nmi_complete();
> >>>> }
> >>>> +extern int __init sev_es_efi_map_ghcbs(pgd_t *pgd);
> >>>> #else
> >>>> static inline void sev_es_ist_enter(struct pt_regs *regs) { }
> >>>> static inline void sev_es_ist_exit(void) { }
> >>>> static inline int sev_es_setup_ap_jump_table(struct real_mode_header *rmh) { return 0; }
> >>>> static inline void sev_es_nmi_complete(void) { }
> >>>> +static inline int sev_es_efi_map_ghcbs(pgd_t *pgd) { return 0; }
> >>>> #endif
> >>>>
> >>>> #endif
> >>>> diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
> >>>> index 9ab3a4dfecd8..4e2b7e4d9b87 100644
> >>>> --- a/arch/x86/kernel/sev-es.c
> >>>> +++ b/arch/x86/kernel/sev-es.c
> >>>> @@ -491,6 +491,36 @@ int sev_es_setup_ap_jump_table(struct real_mode_header *rmh)
> >>>> return 0;
> >>>> }
> >>>>
> >>>> +/*
> >>>> + * This is needed by the OVMF UEFI firmware which will use whatever it finds in
> >>>> + * the GHCB MSR as its GHCB to talk to the hypervisor. So make sure the per-cpu
> >>>> + * runtime GHCBs used by the kernel are also mapped in the EFI page-table.
> >
> > Yup, this pretty much confirms my suspicion that QemuFlashPtrWrite() is
> > at the center of this.
> >
> > (BTW, I don't think that the runtime services data allocation, in
> > QemuFlashFvbServicesRuntimeDxe, for OS runtime GHCB purposes, would have
> > to be "per CPU". Refer to "Table 35. Rules for Reentry Into Runtime
> > Services" in the UEFI spec -- if one processor is executing
> > SetVariable(), then no other processor must enter SetVariable(). And so
> > we'll have *at most* one VCPU in QemuFlashPtrWrite(), at any time.)
> >
> >>>> + */
> >>>> +int __init sev_es_efi_map_ghcbs(pgd_t *pgd)
> >>>> +{
> >>>> + struct sev_es_runtime_data *data;
> >>>> + unsigned long address, pflags;
> >>>> + int cpu;
> >>>> + u64 pfn;
> >>>> +
> >>>> + if (!sev_es_active())
> >>>> + return 0;
> >>>> +
> >>>> + pflags = _PAGE_NX | _PAGE_RW;
> >>>> +
> >>>> + for_each_possible_cpu(cpu) {
> >>>> + data = per_cpu(runtime_data, cpu);
> >>>> +
> >>>> + address = __pa(&data->ghcb_page);
> >>>> + pfn = address >> PAGE_SHIFT;
> >>>> +
> >>>> + if (kernel_map_pages_in_pgd(pgd, pfn, address, 1, pflags))
> >>>> + return 1;
> >>>> + }
> >>>> +
> >>>> + return 0;
> >>>> +}
> >>>> +
> >>>> static enum es_result vc_handle_msr(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
> >>>> {
> >>>> struct pt_regs *regs = ctxt->regs;
> >>>> diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
> >>>> index 6af4da1149ba..8f5759df7776 100644
> >>>> --- a/arch/x86/platform/efi/efi_64.c
> >>>> +++ b/arch/x86/platform/efi/efi_64.c
> >>>> @@ -47,6 +47,7 @@
> >>>> #include <asm/realmode.h>
> >>>> #include <asm/time.h>
> >>>> #include <asm/pgalloc.h>
> >>>> +#include <asm/sev-es.h>
> >>>>
> >>>> /*
> >>>> * We allocate runtime services regions top-down, starting from -4G, i.e.
> >>>> @@ -229,6 +230,15 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
> >>>> return 1;
> >>>> }
> >>>>
> >>>> + /*
> >>>> + * When SEV-ES is active, the GHCB as set by the kernel will be used
> >>>> + * by firmware. Create a 1:1 unencrypted mapping for each GHCB.
> >>>> + */
> >>>> + if (sev_es_efi_map_ghcbs(pgd)) {
> >>>> + pr_err("Failed to create 1:1 mapping for the GHCBs!\n");
> >>>> + return 1;
> >>>> + }
> >>>> +
> >>>> /*
> >>>> * When making calls to the firmware everything needs to be 1:1
> >>>> * mapped and addressable with 32-bit pointers. Map the kernel
> >
> > Good point!
> >
> > And it even makes me wonder if the QemuFlashFvbServicesRuntimeDxe
> > approach, with the runtime services data type memory allocation, is
> > feasible at all. Namely, a page's encryption status, under SEV, is
> > controlled through the PTE.
> >
> > And for this particular UEFI runtime area, it would *not* suffice for
> > the OS to just virt-map it. The OS would also have to *decrypt* the area
> > (mark the PTE as "plaintext").
> >
> > In other words, it would be an "unprecedented" PTE for the OS to set up:
> > the PTE would not only map the GVA to GPA, but also mark the area as
> > "plaintext".
> >
> > Otherwise -- if the OS covers *just* the virt-mapping --,
> > QemuFlashFvbServicesRuntimeDxe would populate its own "runtime GHCB"
> > area just fine, but the actual data hitting the host RAM would be
> > encrypted. And so the hypervisor could not interpret the GHCB.
> >
> > *If* QemuFlashFvbServicesRuntimeDxe should not change the kernel-owned
> > PTE at runtime, even temporarily, for marking the GHCB as "plaintext",
> > then the problem is indeed only solvable in the guest kernel, in my
> > opinion.
> >
> > There simply isn't an "architected annotation" for telling the kernel,
> > "virt-map this runtime services data type memory range, *and* mark it as
> > plaintext at the same time".
> >
> > This would be necessary, as both actions affect the exact same PTE, and
> > the firmware is not really allowed to touch the PTE at runtime. But we
> > don't have such a hint.
> >
> >
> > To summarize: for QemuFlashFvbServicesRuntimeDxe to allocate UEFI
> > Runtime Services Data type memory, for its own runtime GHCB, two
> > permissions are necessary (together), at OS runtime:
> >
> > - QemuFlashFvbServicesRuntimeDxe must be allowed to swap MSR_SEV_ES_GHCB
> > temporarily (before executing VMGEXIT),
> >
> > - QemuFlashFvbServicesRuntimeDxe must be allowed to change the OS-owned
> > PTE temporarily (for remapping the GHCB as plaintext, before writing
> > to it).
> >
>
> Amazing summarization Laszlo!
>


Agreed. And thanks a lot for taking the time to do the analysis.

Based on the above,

Acked-by: Ard Biesheuvel <[email protected]>

Thanks,