Subject: [PATCH v5 00/16] Add TDX Guest Support (shared-mm support)

Hi All,

Intel's Trust Domain Extensions (TDX) protect guest VMs from malicious
hosts and some physical attacks. Since VMM is untrusted entity, it does
not allow VMM to access guest private memory. Any memory that is required
for communication with VMM must be shared explicitly. This series adds
support to securely share guest memory with VMM when it is required by
guest.

Originally TDX did automatic sharing of every ioremap. But it was found that
this ends up with a lot of memory shared that is supposed to be private, for
example ACPI tables. Also in general since only a few drivers are expected
to be used it's safer to mark them explicitly (for virtio it actually only
needs two places). This gives the advantage of automatically preventing
other drivers from doing MMIO, which can happen in some cases even with
the device filter. There is still a command line option to override this option,
which allows to use all drivers.

This series is the continuation of the patch series titled "Add TDX Guest
Support (Initial support)", "Add TDX Guest Support (#VE handler support)"
and "Add TDX Guest Support (boot support)" which added initial support,
#VE handler support and boot fixes for TDX guests. You can find the
related patchsets in the following links.

[set 1, v9] - https://lore.kernel.org/lkml/20211008234009.1211215-1-sathyanarayanan.kuppuswamy@linux.intel.com/
[set 2, v7] - https://lore.kernel.org/lkml/20211005204136.1812078-1-sathyanarayanan.kuppuswamy@linux.intel.com/
[set 3, v7] - https://lore.kernel.org/lkml/20211005230550.1819406-1-sathyanarayanan.kuppuswamy@linux.intel.com/

Also please note that this series alone is not necessarily fully
functional. You need to apply all the above 3 patch series to get
a fully functional TDX guest.

You can find TDX related documents in the following link.

https://software.intel.com/content/www/br/pt/develop/articles/intel-trust-domain-extensions.html

Also, ioremap related changes in mips, parisc, alpha, sparch archs' are
only compile tested, and hence need help from the community users of these
archs' to make sure that it does not break any functionality.

In this patch series, following patches are in PCI domain and are
meant for the PCI domain reviewers.

pci: Consolidate pci_iomap* and pci_iomap*wc
pci: Add pci_iomap_shared{,_range}
pci: Mark MSI data shared

Patch titled "asm/io.h: Add ioremap_host_shared fallback" adds
generic and arch specific ioremap_host_shared headers and are
meant to be reviewed by [email protected],
[email protected], [email protected],
[email protected], [email protected].

Similarly patch titled "virtio: Use shared mappings for virtio
PCI devices" adds ioremap_host_shared() support for virtio drivers
and are meant to be reviewed by virtio driver maintainers.

I have CCed this patch series to all the related domain maintainers
and open lists. If you prefer to get only patches specific to your
domain, please let me know. I will fix this in next submission.

Changes since v4:
* Since patch titled "x86/tdx: Get TD execution environment
information via TDINFO" is required only by this patch set,
moved it here.
* Rest of the change log is included per patch.

Changes since v3:
* Rebased on top of Tom Lendacky's protected guest
changes (https://lore.kernel.org/patchwork/cover/1468760/)
* Added new API to share io-reamapped memory selectively
(using ioremap_shared())
* Added new wrapper (pci_iomap_shared_range()) for PCI IO
remap shared mappings use case.

Changes since v2:
* Rebased on top of v5.14-rc1.
* No functional changes.

Andi Kleen (6):
PCI: Consolidate pci_iomap_range(), pci_iomap_wc_range()
asm/io.h: Add ioremap_host_shared fallback
PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()
PCI: Mark MSI data shared
virtio: Use shared mappings for virtio PCI devices
x86/tdx: Implement ioremap_host_shared for x86

Isaku Yamahata (1):
x86/tdx: ioapic: Add shared bit for IOAPIC base address

Kirill A. Shutemov (7):
x86/mm: Move force_dma_unencrypted() to common code
x86/tdx: Get TD execution environment information via TDINFO
x86/tdx: Exclude Shared bit from physical_mask
x86/tdx: Make pages shared in ioremap()
x86/tdx: Add helper to do MapGPA hypercall
x86/tdx: Make DMA pages shared
x86/kvm: Use bounce buffers for TD guest

Kuppuswamy Sathyanarayanan (2):
x86/tdx: Enable shared memory confidential guest flags for TDX guest
x86/tdx: Add cmdline option to force use of ioremap_host_shared

.../admin-guide/kernel-parameters.rst | 1 +
.../admin-guide/kernel-parameters.txt | 12 ++
Documentation/driver-api/device-io.rst | 7 +
arch/alpha/include/asm/io.h | 2 +
arch/mips/include/asm/io.h | 2 +
arch/parisc/include/asm/io.h | 2 +
arch/sparc/include/asm/io_64.h | 2 +
arch/x86/Kconfig | 9 +-
arch/x86/include/asm/io.h | 6 +
arch/x86/include/asm/mem_encrypt_common.h | 21 +++
arch/x86/include/asm/pgtable.h | 5 +
arch/x86/include/asm/tdx.h | 22 +++
arch/x86/kernel/apic/io_apic.c | 18 ++-
arch/x86/kernel/cc_platform.c | 3 +
arch/x86/kernel/tdx.c | 109 +++++++++++++++
arch/x86/mm/Makefile | 2 +
arch/x86/mm/ioremap.c | 64 +++++++--
arch/x86/mm/mem_encrypt.c | 8 +-
arch/x86/mm/mem_encrypt_common.c | 40 ++++++
arch/x86/mm/pat/set_memory.c | 45 +++++-
drivers/pci/msi.c | 2 +-
drivers/virtio/virtio_pci_modern_dev.c | 2 +-
include/asm-generic/io.h | 5 +
include/asm-generic/pci_iomap.h | 6 +
include/linux/cc_platform.h | 13 ++
lib/pci_iomap.c | 131 +++++++++++++-----
26 files changed, 475 insertions(+), 64 deletions(-)
create mode 100644 arch/x86/include/asm/mem_encrypt_common.h
create mode 100644 arch/x86/mm/mem_encrypt_common.c

--
2.25.1


Subject: [PATCH v5 01/16] x86/mm: Move force_dma_unencrypted() to common code

From: "Kirill A. Shutemov" <[email protected]>

Intel TDX doesn't allow VMM to access guest private memory. Any memory
that is required for communication with VMM must be shared explicitly
by setting the bit in page table entry. After setting the shared bit,
the conversion must be completed with MapGPA hypercall. Details about
MapGPA hypercall can be found in [1], sec 3.2.

The call informs VMM about the conversion between private/shared
mappings. The shared memory is similar to unencrypted memory in AMD
SME/SEV terminology but the underlying process of sharing/un-sharing
the memory is different for Intel TDX guest platform.

SEV assumes that I/O devices can only do DMA to "decrypted" physical
addresses without the C-bit set. In order for the CPU to interact with
this memory, the CPU needs a decrypted mapping. To add this support,
AMD SME code forces force_dma_unencrypted() to return true for
platforms that support AMD SEV feature. It will be used for DMA memory
allocation API to trigger set_memory_decrypted() for platforms that
support AMD SEV feature.

TDX is similar. So, to communicate with I/O devices, related pages need
to be marked as shared. As mentioned above, shared memory in TDX
architecture is similar to decrypted memory in AMD SME/SEV. So similar
to AMD SEV, force_dma_unencrypted() has to forced to return true. This
support is added in other patches in this series.

So move force_dma_unencrypted() out of AMD specific code and call AMD
specific (amd_force_dma_unencrypted()) initialization function from it.
force_dma_unencrypted() will be modified by later patches to include
Intel TDX guest platform specific initialization.

Also, introduce new config option X86_MEM_ENCRYPT_COMMON that has to be
selected by all x86 memory encryption features. This will be selected
by both AMD SEV and Intel TDX guest config options.

This is preparation for TDX changes in DMA code and it has no
functional change.

[1] - https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-guest-hypervisor-communication-interface.pdf

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---

Changes since v4:
* Removed used we/you from commit log.

Change since v3:
* None

Changes since v1:
* Removed sev_active(), sme_active() checks in force_dma_unencrypted().

arch/x86/Kconfig | 8 ++++++--
arch/x86/include/asm/mem_encrypt_common.h | 18 ++++++++++++++++++
arch/x86/mm/Makefile | 2 ++
arch/x86/mm/mem_encrypt.c | 3 ++-
arch/x86/mm/mem_encrypt_common.c | 17 +++++++++++++++++
5 files changed, 45 insertions(+), 3 deletions(-)
create mode 100644 arch/x86/include/asm/mem_encrypt_common.h
create mode 100644 arch/x86/mm/mem_encrypt_common.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index af49ad084919..37b27412f52e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1519,16 +1519,20 @@ config X86_CPA_STATISTICS
helps to determine the effectiveness of preserving large and huge
page mappings when mapping protections are changed.

+config X86_MEM_ENCRYPT_COMMON
+ select ARCH_HAS_FORCE_DMA_UNENCRYPTED
+ select DYNAMIC_PHYSICAL_MASK
+ def_bool n
+
config AMD_MEM_ENCRYPT
bool "AMD Secure Memory Encryption (SME) support"
depends on X86_64 && CPU_SUP_AMD
select DMA_COHERENT_POOL
- select DYNAMIC_PHYSICAL_MASK
select ARCH_USE_MEMREMAP_PROT
- select ARCH_HAS_FORCE_DMA_UNENCRYPTED
select INSTRUCTION_DECODER
select ARCH_HAS_RESTRICTED_VIRTIO_MEMORY_ACCESS
select ARCH_HAS_CC_PLATFORM
+ select X86_MEM_ENCRYPT_COMMON
help
Say yes to enable support for the encryption of system memory.
This requires an AMD processor that supports Secure Memory
diff --git a/arch/x86/include/asm/mem_encrypt_common.h b/arch/x86/include/asm/mem_encrypt_common.h
new file mode 100644
index 000000000000..697bc40a4e3d
--- /dev/null
+++ b/arch/x86/include/asm/mem_encrypt_common.h
@@ -0,0 +1,18 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (C) 2020 Intel Corporation */
+#ifndef _ASM_X86_MEM_ENCRYPT_COMMON_H
+#define _ASM_X86_MEM_ENCRYPT_COMMON_H
+
+#include <linux/mem_encrypt.h>
+#include <linux/device.h>
+
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+bool amd_force_dma_unencrypted(struct device *dev);
+#else /* CONFIG_AMD_MEM_ENCRYPT */
+static inline bool amd_force_dma_unencrypted(struct device *dev)
+{
+ return false;
+}
+#endif /* CONFIG_AMD_MEM_ENCRYPT */
+
+#endif
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 5864219221ca..b31cb52bf1bd 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -52,6 +52,8 @@ obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) += pkeys.o
obj-$(CONFIG_RANDOMIZE_MEMORY) += kaslr.o
obj-$(CONFIG_PAGE_TABLE_ISOLATION) += pti.o

+obj-$(CONFIG_X86_MEM_ENCRYPT_COMMON) += mem_encrypt_common.o
+
obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt.o
obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt_identity.o
obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt_boot.o
diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
index 23d54b810f08..5d7fbed73949 100644
--- a/arch/x86/mm/mem_encrypt.c
+++ b/arch/x86/mm/mem_encrypt.c
@@ -31,6 +31,7 @@
#include <asm/processor-flags.h>
#include <asm/msr.h>
#include <asm/cmdline.h>
+#include <asm/mem_encrypt_common.h>

#include "mm_internal.h"

@@ -362,7 +363,7 @@ int __init early_set_memory_encrypted(unsigned long vaddr, unsigned long size)
}

/* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
-bool force_dma_unencrypted(struct device *dev)
+bool amd_force_dma_unencrypted(struct device *dev)
{
/*
* For SEV, all DMA must be to unencrypted addresses.
diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c
new file mode 100644
index 000000000000..f063c885b0a5
--- /dev/null
+++ b/arch/x86/mm/mem_encrypt_common.c
@@ -0,0 +1,17 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Memory Encryption Support Common Code
+ *
+ * Copyright (C) 2021 Intel Corporation
+ *
+ * Author: Kuppuswamy Sathyanarayanan <[email protected]>
+ */
+
+#include <asm/mem_encrypt_common.h>
+#include <linux/dma-mapping.h>
+
+/* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
+bool force_dma_unencrypted(struct device *dev)
+{
+ return amd_force_dma_unencrypted(dev);
+}
--
2.25.1

Subject: [PATCH v5 02/16] x86/tdx: Get TD execution environment information via TDINFO

From: "Kirill A. Shutemov" <[email protected]>

Per Guest-Host-Communication Interface (GHCI) for Intel Trust Domain
Extensions (Intel TDX) specification, sec 2.4.2, TDCALL[TDINFO]
provides basic TD execution environment information, not provided by
CPUID.

Call TDINFO during early boot to be used for following system
initialization.

The call provides info on which bit in PFN is used to indicate that the
page is shared with the host and attributes of the TD, such as debug.

Information about the number of CPUs need not be saved because there are
no users so far for it.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---
arch/x86/kernel/tdx.c | 33 +++++++++++++++++++++++++++++++++
1 file changed, 33 insertions(+)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 79af9e78b300..bb237cf291e6 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -14,6 +14,7 @@
#include <linux/sched/signal.h> /* force_sig_fault() */

/* TDX Module call Leaf IDs */
+#define TDX_GET_INFO 1
#define TDX_GET_VEINFO 3

#define VE_IS_IO_OUT(exit_qual) (((exit_qual) & 8) ? 0 : 1)
@@ -21,6 +22,11 @@
#define VE_GET_PORT_NUM(exit_qual) ((exit_qual) >> 16)
#define VE_IS_IO_STRING(exit_qual) ((exit_qual) & 16 ? 1 : 0)

+static struct {
+ unsigned int gpa_width;
+ unsigned long attributes;
+} td_info __ro_after_init;
+
bool is_tdx_guest(void)
{
static int tdx_guest = -1;
@@ -65,6 +71,31 @@ static inline u64 _tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14,
return out->r10;
}

+static void tdx_get_info(void)
+{
+ struct tdx_module_output out;
+ u64 ret;
+
+ /*
+ * TDINFO TDX Module call is used to get the TD
+ * execution environment information like GPA
+ * width, number of available vcpus, debug mode
+ * information, etc. More details about the ABI
+ * can be found in TDX Guest-Host-Communication
+ * Interface (GHCI), sec 2.4.2 TDCALL [TDG.VP.INFO].
+ */
+ ret = __tdx_module_call(TDX_GET_INFO, 0, 0, 0, 0, &out);
+
+ /*
+ * Non zero return means buggy TDX module (which is
+ * fatal). So raise a BUG().
+ */
+ BUG_ON(ret);
+
+ td_info.gpa_width = out.rcx & GENMASK(5, 0);
+ td_info.attributes = out.rdx;
+}
+
static __cpuidle void _tdx_halt(const bool irq_disabled, const bool do_sti)
{
u64 ret;
@@ -466,6 +497,8 @@ void __init tdx_early_init(void)

setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);

+ tdx_get_info();
+
pv_ops.irq.safe_halt = tdx_safe_halt;
pv_ops.irq.halt = tdx_halt;

--
2.25.1

Subject: [PATCH v5 08/16] x86/tdx: ioapic: Add shared bit for IOAPIC base address

From: Isaku Yamahata <[email protected]>

The kernel interacts with each bare-metal IOAPIC with a special
MMIO page. When running under KVM, the guest's IOAPICs are
emulated by KVM.

When running as a TDX guest, the guest needs to mark each IOAPIC
mapping as "shared" with the host. This ensures that TDX private
protections are not applied to the page, which allows the TDX host
emulation to work.

Earlier patches in this series modified ioremap() so that
ioremap()-created mappings such as virtio will be marked as
shared. However, the IOAPIC code does not use ioremap() and instead
uses the fixmap mechanism.

Introduce a special fixmap helper just for the IOAPIC code. Ensure
that it marks IOAPIC pages as "shared". This replaces
set_fixmap_nocache() with __set_fixmap() since __set_fixmap()
allows custom 'prot' values.

Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---

Changes since v4:
* Rebased on top of Tom Lendacky's CC guest
changes (https://www.spinics.net/lists/linux-tip-commits/msg58716.html)

Changes since v3:
* Rebased on top of Tom Lendacky's protected guest
changes (https://lore.kernel.org/patchwork/cover/1468760/)

arch/x86/kernel/apic/io_apic.c | 18 ++++++++++++++++--
1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index c1bb384935b0..eefb260d7759 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -49,6 +49,7 @@
#include <linux/slab.h>
#include <linux/memblock.h>
#include <linux/msi.h>
+#include <linux/cc_platform.h>

#include <asm/irqdomain.h>
#include <asm/io.h>
@@ -65,6 +66,7 @@
#include <asm/irq_remapping.h>
#include <asm/hw_irq.h>
#include <asm/apic.h>
+#include <asm/tdx.h>

#define for_each_ioapic(idx) \
for ((idx) = 0; (idx) < nr_ioapics; (idx)++)
@@ -2677,6 +2679,18 @@ static struct resource * __init ioapic_setup_resources(void)
return res;
}

+static void io_apic_set_fixmap_nocache(enum fixed_addresses idx,
+ phys_addr_t phys)
+{
+ pgprot_t flags = FIXMAP_PAGE_NOCACHE;
+
+ /* Set TDX guest shared bit in pgprot flags */
+ if (cc_platform_has(CC_ATTR_GUEST_SHARED_MAPPING_INIT))
+ flags = pgprot_cc_guest(flags);
+
+ __set_fixmap(idx, phys, flags);
+}
+
void __init io_apic_init_mappings(void)
{
unsigned long ioapic_phys, idx = FIX_IO_APIC_BASE_0;
@@ -2709,7 +2723,7 @@ void __init io_apic_init_mappings(void)
__func__, PAGE_SIZE, PAGE_SIZE);
ioapic_phys = __pa(ioapic_phys);
}
- set_fixmap_nocache(idx, ioapic_phys);
+ io_apic_set_fixmap_nocache(idx, ioapic_phys);
apic_printk(APIC_VERBOSE, "mapped IOAPIC to %08lx (%08lx)\n",
__fix_to_virt(idx) + (ioapic_phys & ~PAGE_MASK),
ioapic_phys);
@@ -2838,7 +2852,7 @@ int mp_register_ioapic(int id, u32 address, u32 gsi_base,
ioapics[idx].mp_config.flags = MPC_APIC_USABLE;
ioapics[idx].mp_config.apicaddr = address;

- set_fixmap_nocache(FIX_IO_APIC_BASE_0 + idx, address);
+ io_apic_set_fixmap_nocache(FIX_IO_APIC_BASE_0 + idx, address);
if (bad_ioapic_register(idx)) {
clear_fixmap(FIX_IO_APIC_BASE_0 + idx);
return -ENODEV;
--
2.25.1

Subject: [PATCH v5 03/16] x86/tdx: Exclude Shared bit from physical_mask

From: "Kirill A. Shutemov" <[email protected]>

Just like MKTME, TDX reassigns bits of the physical address for
metadata. MKTME used several bits for an encryption KeyID. TDX
uses a single bit in guests to communicate whether a physical page
should be protected by TDX as private memory (bit set to 0) or
unprotected and shared with the VMM (bit set to 1).

Add a helper, tdx_shared_mask() to generate the mask. The processor
enumerates its physical address width to include the shared bit, which
means it gets included in __PHYSICAL_MASK by default.

Remove the shared mask from 'physical_mask' since any bits in
tdx_shared_mask() are not used for physical addresses in page table
entries.

Also, note that shared mapping configuration cannot be clubbed between
AMD SME and Intel TDX Guest platforms in common function. SME has
to do it very early in __startup_64() as it sets the bit on all
memory, except what is used for communication. TDX can postpone it,
as it don't need any shared mapping in very early boot.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---

Changes since v4:
* Renamed tdg_shared_mask() to tdx_shared_mask().

Changes since v3:
* None

Changes since v1:
* Fixed format issues in commit log.

arch/x86/Kconfig | 1 +
arch/x86/include/asm/tdx.h | 4 ++++
arch/x86/kernel/tdx.c | 9 +++++++++
3 files changed, 14 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 37b27412f52e..e99c669e633a 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -871,6 +871,7 @@ config INTEL_TDX_GUEST
depends on SECURITY
depends on X86_X2APIC
select ARCH_HAS_CC_PLATFORM
+ select X86_MEM_ENCRYPT_COMMON
help
Provide support for running in a trusted domain on Intel processors
equipped with Trusted Domain Extensions. TDX is a Intel technology
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index eb5e9dbe1861..b8f758dbbea9 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -76,6 +76,8 @@ bool tdx_handle_virtualization_exception(struct pt_regs *regs,

bool tdx_early_handle_ve(struct pt_regs *regs);

+extern phys_addr_t tdx_shared_mask(void);
+
/*
* To support I/O port access in decompressor or early kernel init
* code, since #VE exception handler cannot be used, use paravirt
@@ -141,6 +143,8 @@ static inline void tdx_early_init(void) { };

static inline bool tdx_early_handle_ve(struct pt_regs *regs) { return false; }

+static inline phys_addr_t tdx_shared_mask(void) { return 0; }
+
#endif /* CONFIG_INTEL_TDX_GUEST */

#if defined(CONFIG_KVM_GUEST) && defined(CONFIG_INTEL_TDX_GUEST)
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index bb237cf291e6..8a37ab0c6cbf 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -71,6 +71,12 @@ static inline u64 _tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14,
return out->r10;
}

+/* The highest bit of a guest physical address is the "sharing" bit */
+phys_addr_t tdx_shared_mask(void)
+{
+ return 1ULL << (td_info.gpa_width - 1);
+}
+
static void tdx_get_info(void)
{
struct tdx_module_output out;
@@ -94,6 +100,9 @@ static void tdx_get_info(void)

td_info.gpa_width = out.rcx & GENMASK(5, 0);
td_info.attributes = out.rdx;
+
+ /* Exclude Shared bit from the __PHYSICAL_MASK */
+ physical_mask &= ~tdx_shared_mask();
}

static __cpuidle void _tdx_halt(const bool irq_disabled, const bool do_sti)
--
2.25.1

Subject: [PATCH v5 09/16] x86/tdx: Enable shared memory confidential guest flags for TDX guest

In TDX guest, since the memory is private to guest, it needs some
extra configuration before sharing any data with VMM. AMD SEV also
implements similar features and hence code can be shared. Currently
memory sharing related code in the kernel is protected by
CC_ATTR_GUEST_MEM_ENCRYPT and CC_ATTR_GUEST_SHARED_MAPPING_INIT flags.
So enable them for TDX guest as well.

Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---

Changes since v4:
* Rebased on top of Tom Lendacky's CC guest
changes (https://www.spinics.net/lists/linux-tip-commits/msg58716.html)

arch/x86/kernel/cc_platform.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/arch/x86/kernel/cc_platform.c b/arch/x86/kernel/cc_platform.c
index d13188e8eb2c..deac0a7d7d37 100644
--- a/arch/x86/kernel/cc_platform.c
+++ b/arch/x86/kernel/cc_platform.c
@@ -20,6 +20,9 @@ static bool intel_cc_platform_has(enum cc_attr attr)
switch (attr) {
case CC_ATTR_GUEST_TDX:
case CC_ATTR_GUEST_UNROLL_STRING_IO:
+ case CC_ATTR_GUEST_MEM_ENCRYPT:
+ case CC_ATTR_GUEST_SHARED_MAPPING_INIT:
+ case CC_ATTR_MEM_ENCRYPT:
return is_tdx_guest();
default:
return false;
--
2.25.1

Subject: [PATCH v5 05/16] x86/tdx: Add helper to do MapGPA hypercall

From: "Kirill A. Shutemov" <[email protected]>

MapGPA hypercall is used by TDX guests to request VMM convert the
existing mapping of given GPA address range between private/shared.

tdx_hcall_gpa_intent() is the wrapper used for making MapGPA hypercall.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---

Changes since v4:
* Added required comments in tdx_hcall_gpa_intent().

Changes since v3:
* None

Changes since v1:
* Modified tdx_hcall_gpa_intent() to use _tdx_hypercall() instead of
tdx_hypercall().

arch/x86/include/asm/tdx.h | 18 ++++++++++++++++++
arch/x86/kernel/tdx.c | 30 ++++++++++++++++++++++++++++++
2 files changed, 48 insertions(+)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index b8f758dbbea9..a931c317e37d 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -56,6 +56,15 @@ struct ve_info {
u32 instr_info;
};

+/*
+ * Page mapping type enum. This is software construct not
+ * part of any hardware or VMM ABI.
+ */
+enum tdx_map_type {
+ TDX_MAP_PRIVATE,
+ TDX_MAP_SHARED,
+};
+
#ifdef CONFIG_INTEL_TDX_GUEST

bool is_tdx_guest(void);
@@ -78,6 +87,9 @@ bool tdx_early_handle_ve(struct pt_regs *regs);

extern phys_addr_t tdx_shared_mask(void);

+extern int tdx_hcall_gpa_intent(phys_addr_t gpa, int numpages,
+ enum tdx_map_type map_type);
+
/*
* To support I/O port access in decompressor or early kernel init
* code, since #VE exception handler cannot be used, use paravirt
@@ -145,6 +157,12 @@ static inline bool tdx_early_handle_ve(struct pt_regs *regs) { return false; }

static inline phys_addr_t tdx_shared_mask(void) { return 0; }

+static inline int tdx_hcall_gpa_intent(phys_addr_t gpa, int numpages,
+ enum tdx_map_type map_type)
+{
+ return -ENODEV;
+}
+
#endif /* CONFIG_INTEL_TDX_GUEST */

#if defined(CONFIG_KVM_GUEST) && defined(CONFIG_INTEL_TDX_GUEST)
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 8a37ab0c6cbf..c3e4cc5d631b 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -17,6 +17,9 @@
#define TDX_GET_INFO 1
#define TDX_GET_VEINFO 3

+/* TDX hypercall Leaf IDs */
+#define TDVMCALL_MAP_GPA 0x10001
+
#define VE_IS_IO_OUT(exit_qual) (((exit_qual) & 8) ? 0 : 1)
#define VE_GET_IO_SIZE(exit_qual) (((exit_qual) & 7) + 1)
#define VE_GET_PORT_NUM(exit_qual) ((exit_qual) >> 16)
@@ -105,6 +108,33 @@ static void tdx_get_info(void)
physical_mask &= ~tdx_shared_mask();
}

+/*
+ * Inform the VMM of the guest's intent for this physical page:
+ * shared with the VMM or private to the guest. The VMM is
+ * expected to change its mapping of the page in response.
+ *
+ * Note: shared->private conversions require further guest
+ * action to accept the page.
+ */
+int tdx_hcall_gpa_intent(phys_addr_t gpa, int numpages,
+ enum tdx_map_type map_type)
+{
+ u64 ret;
+
+ if (map_type == TDX_MAP_SHARED)
+ gpa |= tdx_shared_mask();
+
+ /*
+ * Notify VMM about page mapping conversion. More info
+ * about ABI can be found in TDX Guest-Host-Communication
+ * Interface (GHCI), sec 3.2.
+ */
+ ret = _tdx_hypercall(TDVMCALL_MAP_GPA, gpa, PAGE_SIZE * numpages, 0, 0,
+ NULL);
+
+ return ret ? -EIO : 0;
+}
+
static __cpuidle void _tdx_halt(const bool irq_disabled, const bool do_sti)
{
u64 ret;
--
2.25.1

Subject: [PATCH v5 04/16] x86/tdx: Make pages shared in ioremap()

From: "Kirill A. Shutemov" <[email protected]>

All ioremap()ed pages that are not backed by normal memory (NONE or
RESERVED) have to be mapped as shared.

Reuse the infrastructure from AMD SEV code.

Note that DMA code doesn't use ioremap() to convert memory to shared as
DMA buffers backed by normal memory. DMA code make buffer shared with
set_memory_decrypted().

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---

Changes since v4:
* Renamed "protected_guest" to "cc_guest".
* Replaced use of prot_guest_has() with cc_guest_has()
* Modified the patch to adapt to latest version cc_guest_has()
changes.

Changes since v3:
* Rebased on top of Tom Lendacky's protected guest
changes (https://lore.kernel.org/patchwork/cover/1468760/)

Changes since v1:
* Fixed format issues in commit log.

arch/x86/include/asm/pgtable.h | 4 ++++
arch/x86/mm/ioremap.c | 8 ++++++--
include/linux/cc_platform.h | 13 +++++++++++++
3 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 448cd01eb3ec..ecefccbdf2e3 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -21,6 +21,10 @@
#define pgprot_encrypted(prot) __pgprot(__sme_set(pgprot_val(prot)))
#define pgprot_decrypted(prot) __pgprot(__sme_clr(pgprot_val(prot)))

+/* Make the page accesable by VMM for confidential guests */
+#define pgprot_cc_guest(prot) __pgprot(pgprot_val(prot) | \
+ tdx_shared_mask())
+
#ifndef __ASSEMBLY__
#include <asm/x86_init.h>
#include <asm/pkru.h>
diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index 026031b3b782..83daa3f8f39c 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -17,6 +17,7 @@
#include <linux/cc_platform.h>
#include <linux/efi.h>
#include <linux/pgtable.h>
+#include <linux/cc_platform.h>

#include <asm/set_memory.h>
#include <asm/e820/api.h>
@@ -26,6 +27,7 @@
#include <asm/pgalloc.h>
#include <asm/memtype.h>
#include <asm/setup.h>
+#include <asm/tdx.h>

#include "physaddr.h"

@@ -87,8 +89,8 @@ static unsigned int __ioremap_check_ram(struct resource *res)
}

/*
- * In a SEV guest, NONE and RESERVED should not be mapped encrypted because
- * there the whole memory is already encrypted.
+ * In a SEV or TDX guest, NONE and RESERVED should not be mapped encrypted (or
+ * private in TDX case) because there the whole memory is already encrypted.
*/
static unsigned int __ioremap_check_encrypted(struct resource *res)
{
@@ -246,6 +248,8 @@ __ioremap_caller(resource_size_t phys_addr, unsigned long size,
prot = PAGE_KERNEL_IO;
if ((io_desc.flags & IORES_MAP_ENCRYPTED) || encrypted)
prot = pgprot_encrypted(prot);
+ else if (cc_platform_has(CC_ATTR_GUEST_SHARED_MAPPING_INIT))
+ prot = pgprot_cc_guest(prot);

switch (pcm) {
case _PAGE_CACHE_MODE_UC:
diff --git a/include/linux/cc_platform.h b/include/linux/cc_platform.h
index 7728574d7783..edb1d7a2f6af 100644
--- a/include/linux/cc_platform.h
+++ b/include/linux/cc_platform.h
@@ -81,6 +81,19 @@ enum cc_attr {
* Examples include TDX Guest.
*/
CC_ATTR_GUEST_UNROLL_STRING_IO,
+
+ /**
+ * @CC_ATTR_GUEST_SHARED_MAPPING_INIT: IO Remapped memory is marked
+ * as shared.
+ *
+ * The platform/OS is running as a guest/virtual machine and
+ * initializes all IO remapped memory as shared.
+ *
+ * Examples include TDX Guest (SEV marks all pages as shared by default
+ * so this feature cannot be enabled for it).
+ */
+ CC_ATTR_GUEST_SHARED_MAPPING_INIT,
+
};

#ifdef CONFIG_ARCH_HAS_CC_PLATFORM
--
2.25.1

Subject: [PATCH v5 13/16] PCI: Mark MSI data shared

From: Andi Kleen <[email protected]>

In a TDX guest the MSI area must be shared with the host, so use
a shared mapping.

Signed-off-by: Andi Kleen <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---

Changes since v4:
* Replaced ioremap_shared() with ioremap_host_shared()

drivers/pci/msi.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 0099a00af361..198ef6e6ca4f 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -596,7 +596,7 @@ static void __iomem *msix_map_region(struct pci_dev *dev, unsigned nr_entries)
table_offset &= PCI_MSIX_TABLE_OFFSET;
phys_addr = pci_resource_start(dev, bir) + table_offset;

- return ioremap(phys_addr, nr_entries * PCI_MSIX_ENTRY_SIZE);
+ return ioremap_host_shared(phys_addr, nr_entries * PCI_MSIX_ENTRY_SIZE);
}

static int msix_setup_entries(struct pci_dev *dev, void __iomem *base,
--
2.25.1

Subject: [PATCH v5 15/16] x86/tdx: Implement ioremap_host_shared for x86

From: Andi Kleen <[email protected]>

Implement ioremap_host_shared for x86. In TDX most memory is encrypted,
but some memory that is used to communicate with the host must
be declared shared with special page table attributes and a
special hypercall. Previously all ioremaped memory was declared
shared, but this leads to various BIOS tables and other private
state being shared, which is a security risk.

This patch replaces the unconditional ioremap sharing with an explicit
ioremap_host_shared that enables sharing.

Signed-off-by: Andi Kleen <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---

Changes since v4:
* Rebased on top of Tom Lendacky's CC guest
changes (https://www.spinics.net/lists/linux-tip-commits/msg58716.html)

arch/x86/include/asm/io.h | 4 ++++
arch/x86/mm/ioremap.c | 41 ++++++++++++++++++++++++++++++---------
2 files changed, 36 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index 67e0c4a0a0f4..521b239c013f 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -380,6 +380,10 @@ extern void __iomem *ioremap_wc(resource_size_t offset, unsigned long size);
extern void __iomem *ioremap_wt(resource_size_t offset, unsigned long size);
#define ioremap_wt ioremap_wt

+extern void __iomem *ioremap_host_shared(resource_size_t offset,
+ unsigned long size);
+#define ioremap_host_shared ioremap_host_shared
+
extern bool is_early_ioremap_ptep(pte_t *ptep);

#define IO_SPACE_LIMIT 0xffff
diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index 83daa3f8f39c..a83a69045f61 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -178,7 +178,8 @@ static void __ioremap_check_mem(resource_size_t addr, unsigned long size,
*/
static void __iomem *
__ioremap_caller(resource_size_t phys_addr, unsigned long size,
- enum page_cache_mode pcm, void *caller, bool encrypted)
+ enum page_cache_mode pcm, void *caller, bool encrypted,
+ bool shared)
{
unsigned long offset, vaddr;
resource_size_t last_addr;
@@ -248,7 +249,7 @@ __ioremap_caller(resource_size_t phys_addr, unsigned long size,
prot = PAGE_KERNEL_IO;
if ((io_desc.flags & IORES_MAP_ENCRYPTED) || encrypted)
prot = pgprot_encrypted(prot);
- else if (cc_platform_has(CC_ATTR_GUEST_SHARED_MAPPING_INIT))
+ else if (shared)
prot = pgprot_cc_guest(prot);

switch (pcm) {
@@ -340,7 +341,8 @@ void __iomem *ioremap(resource_size_t phys_addr, unsigned long size)
enum page_cache_mode pcm = _PAGE_CACHE_MODE_UC_MINUS;

return __ioremap_caller(phys_addr, size, pcm,
- __builtin_return_address(0), false);
+ __builtin_return_address(0), false,
+ false);
}
EXPORT_SYMBOL(ioremap);

@@ -373,7 +375,8 @@ void __iomem *ioremap_uc(resource_size_t phys_addr, unsigned long size)
enum page_cache_mode pcm = _PAGE_CACHE_MODE_UC;

return __ioremap_caller(phys_addr, size, pcm,
- __builtin_return_address(0), false);
+ __builtin_return_address(0), false,
+ false);
}
EXPORT_SYMBOL_GPL(ioremap_uc);

@@ -390,10 +393,29 @@ EXPORT_SYMBOL_GPL(ioremap_uc);
void __iomem *ioremap_wc(resource_size_t phys_addr, unsigned long size)
{
return __ioremap_caller(phys_addr, size, _PAGE_CACHE_MODE_WC,
- __builtin_return_address(0), false);
+ __builtin_return_address(0), false,
+ false);
}
EXPORT_SYMBOL(ioremap_wc);

+/**
+ * ioremap_host_shared - map memory into CPU space shared with host
+ * @phys_addr: bus address of the memory
+ * @size: size of the resource to map
+ *
+ * This version of ioremap ensures that the memory is marked shared
+ * with the host. This is useful for confidential guests.
+ *
+ * Must be freed with iounmap.
+ */
+void __iomem *ioremap_host_shared(resource_size_t phys_addr, unsigned long size)
+{
+ return __ioremap_caller(phys_addr, size, _PAGE_CACHE_MODE_UC,
+ __builtin_return_address(0), false,
+ cc_platform_has(CC_ATTR_GUEST_SHARED_MAPPING_INIT));
+}
+EXPORT_SYMBOL(ioremap_host_shared);
+
/**
* ioremap_wt - map memory into CPU space write through
* @phys_addr: bus address of the memory
@@ -407,21 +429,22 @@ EXPORT_SYMBOL(ioremap_wc);
void __iomem *ioremap_wt(resource_size_t phys_addr, unsigned long size)
{
return __ioremap_caller(phys_addr, size, _PAGE_CACHE_MODE_WT,
- __builtin_return_address(0), false);
+ __builtin_return_address(0), false,
+ false);
}
EXPORT_SYMBOL(ioremap_wt);

void __iomem *ioremap_encrypted(resource_size_t phys_addr, unsigned long size)
{
return __ioremap_caller(phys_addr, size, _PAGE_CACHE_MODE_WB,
- __builtin_return_address(0), true);
+ __builtin_return_address(0), true, false);
}
EXPORT_SYMBOL(ioremap_encrypted);

void __iomem *ioremap_cache(resource_size_t phys_addr, unsigned long size)
{
return __ioremap_caller(phys_addr, size, _PAGE_CACHE_MODE_WB,
- __builtin_return_address(0), false);
+ __builtin_return_address(0), false, false);
}
EXPORT_SYMBOL(ioremap_cache);

@@ -430,7 +453,7 @@ void __iomem *ioremap_prot(resource_size_t phys_addr, unsigned long size,
{
return __ioremap_caller(phys_addr, size,
pgprot2cachemode(__pgprot(prot_val)),
- __builtin_return_address(0), false);
+ __builtin_return_address(0), false, false);
}
EXPORT_SYMBOL(ioremap_prot);

--
2.25.1

Subject: [PATCH v5 16/16] x86/tdx: Add cmdline option to force use of ioremap_host_shared

Add a command line option to force all the enabled drivers to use
shared memory mappings. This will be useful when enabling new drivers
in the confidential guest without making all the required changes to
use shared mappings in it.

Note that this might also allow other non explicitly enabled drivers
to interact with the host, which could cause other security risks.

Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---
.../admin-guide/kernel-parameters.rst | 1 +
.../admin-guide/kernel-parameters.txt | 12 ++++++++++++
arch/x86/include/asm/io.h | 2 ++
arch/x86/mm/ioremap.c | 19 ++++++++++++++++++-
4 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.rst b/Documentation/admin-guide/kernel-parameters.rst
index 01ba293a2d70..02e6aae1ad68 100644
--- a/Documentation/admin-guide/kernel-parameters.rst
+++ b/Documentation/admin-guide/kernel-parameters.rst
@@ -102,6 +102,7 @@ parameter is applicable::
ARM ARM architecture is enabled.
ARM64 ARM64 architecture is enabled.
AX25 Appropriate AX.25 support is enabled.
+ CCG Confidential Computing guest is enabled.
CLK Common clock infrastructure is enabled.
CMA Contiguous Memory Area support is enabled.
DRM Direct Rendering Management support is enabled.
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 91ba391f9b32..0af19cb1a28c 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2076,6 +2076,18 @@
1 - Bypass the IOMMU for DMA.
unset - Use value of CONFIG_IOMMU_DEFAULT_PASSTHROUGH.

+ ioremap_force_shared= [X86_64, CCG]
+ Force the kernel to use shared memory mappings which do
+ not use ioremap_host_shared/pcimap_host_shared to opt-in
+ to shared mappings with the host. This feature is mainly
+ used by a confidential guest when enabling new drivers
+ without proper shared memory related changes. Please note
+ that this option might also allow other non explicitly
+ enabled drivers to interact with the host in confidential
+ guest, which could cause other security risks. This option
+ will also cause BIOS data structures to be shared with the
+ host, which might open security holes.
+
io7= [HW] IO7 for Marvel-based Alpha systems
See comment before marvel_specify_io7 in
arch/alpha/kernel/core_marvel.c.
diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index 521b239c013f..98836c2833e4 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -423,6 +423,8 @@ static inline bool phys_mem_access_encrypted(unsigned long phys_addr,
}
#endif

+extern bool ioremap_force_shared;
+
/**
* iosubmit_cmds512 - copy data to single MMIO location, in 512-bit units
* @dst: destination, in MMIO space (must be 512-bit aligned)
diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index a83a69045f61..d0d2bf5116bc 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -28,6 +28,7 @@
#include <asm/memtype.h>
#include <asm/setup.h>
#include <asm/tdx.h>
+#include <asm/cmdline.h>

#include "physaddr.h"

@@ -162,6 +163,17 @@ static void __ioremap_check_mem(resource_size_t addr, unsigned long size,
__ioremap_check_other(addr, desc);
}

+/*
+ * Normally only drivers that are hardened for use in confidential guests
+ * force shared mappings. But if device filtering is disabled other
+ * devices can be loaded, and these need shared mappings too. This
+ * variable is set to true if these filters are disabled.
+ *
+ * Note this has some side effects, e.g. various BIOS tables
+ * get shared too which is risky.
+ */
+bool ioremap_force_shared;
+
/*
* Remap an arbitrary physical address space into the kernel virtual
* address space. It transparently creates kernel huge I/O mapping when
@@ -249,7 +261,7 @@ __ioremap_caller(resource_size_t phys_addr, unsigned long size,
prot = PAGE_KERNEL_IO;
if ((io_desc.flags & IORES_MAP_ENCRYPTED) || encrypted)
prot = pgprot_encrypted(prot);
- else if (shared)
+ else if (shared || ioremap_force_shared)
prot = pgprot_cc_guest(prot);

switch (pcm) {
@@ -847,6 +859,11 @@ void __init early_ioremap_init(void)
WARN_ON((fix_to_virt(0) + PAGE_SIZE) & ((1 << PMD_SHIFT) - 1));
#endif

+ /* Parse cmdline params for ioremap_force_shared */
+ if (cmdline_find_option_bool(boot_command_line,
+ "ioremap_force_shared"))
+ ioremap_force_shared = 1;
+
early_ioremap_setup();

pmd = early_ioremap_pmd(fix_to_virt(FIX_BTMAP_BEGIN));
--
2.25.1

Subject: [PATCH v5 11/16] asm/io.h: Add ioremap_host_shared fallback

From: Andi Kleen <[email protected]>

This function is for declaring memory that should be shared with
a hypervisor in a confidential guest. If the architecture doesn't
implement it it's just ioremap.

Signed-off-by: Andi Kleen <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---

Changes since v4:
* Renamed ioremap_shared to ioremap_host_shared
* Added documentation for ioremap_host_shared().

Documentation/driver-api/device-io.rst | 7 +++++++
arch/alpha/include/asm/io.h | 2 ++
arch/mips/include/asm/io.h | 2 ++
arch/parisc/include/asm/io.h | 2 ++
arch/sparc/include/asm/io_64.h | 2 ++
include/asm-generic/io.h | 5 +++++
6 files changed, 20 insertions(+)

diff --git a/Documentation/driver-api/device-io.rst b/Documentation/driver-api/device-io.rst
index e9f04b1815d1..9f77a036fc2f 100644
--- a/Documentation/driver-api/device-io.rst
+++ b/Documentation/driver-api/device-io.rst
@@ -429,6 +429,13 @@ of the linear kernel memory area to a regular pointer.

Portable drivers should avoid the use of ioremap_cache().

+ioremap_host_shared()
+---------------------
+
+ioremap_host_shared() maps I/O memory so that it can be shared with the host
+in a confidential guest platform. It is mainly used in platforms like
+Trusted Domain Extensions (TDX).
+
Architecture example
--------------------

diff --git a/arch/alpha/include/asm/io.h b/arch/alpha/include/asm/io.h
index 0fab5ac90775..81952ef50667 100644
--- a/arch/alpha/include/asm/io.h
+++ b/arch/alpha/include/asm/io.h
@@ -283,6 +283,8 @@ static inline void __iomem *ioremap(unsigned long port, unsigned long size)
}

#define ioremap_wc ioremap
+/* Share memory with host in confidential guest platforms */
+#define ioremap_host_shared ioremap
#define ioremap_uc ioremap

static inline void iounmap(volatile void __iomem *addr)
diff --git a/arch/mips/include/asm/io.h b/arch/mips/include/asm/io.h
index 6f5c86d2bab4..83f638fb48c5 100644
--- a/arch/mips/include/asm/io.h
+++ b/arch/mips/include/asm/io.h
@@ -179,6 +179,8 @@ void iounmap(const volatile void __iomem *addr);
#define ioremap(offset, size) \
ioremap_prot((offset), (size), _CACHE_UNCACHED)
#define ioremap_uc ioremap
+/* Share memory with host in confidential guest platforms */
+#define ioremap_host_shared ioremap

/*
* ioremap_cache - map bus memory into CPU space
diff --git a/arch/parisc/include/asm/io.h b/arch/parisc/include/asm/io.h
index 0b5259102319..ef516ee06238 100644
--- a/arch/parisc/include/asm/io.h
+++ b/arch/parisc/include/asm/io.h
@@ -129,6 +129,8 @@ static inline void gsc_writeq(unsigned long long val, unsigned long addr)
*/
void __iomem *ioremap(unsigned long offset, unsigned long size);
#define ioremap_wc ioremap
+/* Share memory with host in confidential guest platforms */
+#define ioremap_host_shared ioremap
#define ioremap_uc ioremap

extern void iounmap(const volatile void __iomem *addr);
diff --git a/arch/sparc/include/asm/io_64.h b/arch/sparc/include/asm/io_64.h
index 5ffa820dcd4d..5b73b877f832 100644
--- a/arch/sparc/include/asm/io_64.h
+++ b/arch/sparc/include/asm/io_64.h
@@ -409,6 +409,8 @@ static inline void __iomem *ioremap(unsigned long offset, unsigned long size)
#define ioremap_uc(X,Y) ioremap((X),(Y))
#define ioremap_wc(X,Y) ioremap((X),(Y))
#define ioremap_wt(X,Y) ioremap((X),(Y))
+/* Share memory with host in confidential guest platforms */
+#define ioremap_host_shared(X, Y) ioremap((X), (Y))
static inline void __iomem *ioremap_np(unsigned long offset, unsigned long size)
{
return NULL;
diff --git a/include/asm-generic/io.h b/include/asm-generic/io.h
index e93375c710b9..26b48fe23769 100644
--- a/include/asm-generic/io.h
+++ b/include/asm-generic/io.h
@@ -982,6 +982,11 @@ static inline void __iomem *ioremap(phys_addr_t addr, size_t size)
#define ioremap_wt ioremap
#endif

+/* Share memory with host in confidential guest platforms */
+#ifndef ioremap_host_shared
+#define ioremap_host_shared ioremap
+#endif
+
/*
* ioremap_uc is special in that we do require an explicit architecture
* implementation. In general you do not want to use this function in a
--
2.25.1

Subject: [PATCH v5 07/16] x86/kvm: Use bounce buffers for TD guest

From: "Kirill A. Shutemov" <[email protected]>

Intel TDX doesn't allow VMM to directly access guest private memory.
Any memory that is required for communication with VMM must be shared
explicitly. The same rule applies for any DMA to and from TDX guest.
All DMA pages had to marked as shared pages. A generic way to achieve
this without any changes to device drivers is to use the SWIOTLB
framework.

This method of handling is similar to AMD SEV. So extend this support
for TDX guest as well. Also since there are some common code between
AMD SEV and TDX guest in mem_encrypt_init(), move it to
mem_encrypt_common.c and call AMD specific init function from it

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---

Changes since v4:
* Replaced prot_guest_has() with cc_guest_has().

Changes since v3:
* Rebased on top of Tom Lendacky's protected guest
changes (https://lore.kernel.org/patchwork/cover/1468760/)

Changes since v1:
* Removed sme_me_mask check for amd_mem_encrypt_init() in mem_encrypt_init().

arch/x86/include/asm/mem_encrypt_common.h | 3 +++
arch/x86/kernel/tdx.c | 2 ++
arch/x86/mm/mem_encrypt.c | 5 +----
arch/x86/mm/mem_encrypt_common.c | 14 ++++++++++++++
4 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/mem_encrypt_common.h b/arch/x86/include/asm/mem_encrypt_common.h
index 697bc40a4e3d..bc90e565bce4 100644
--- a/arch/x86/include/asm/mem_encrypt_common.h
+++ b/arch/x86/include/asm/mem_encrypt_common.h
@@ -8,11 +8,14 @@

#ifdef CONFIG_AMD_MEM_ENCRYPT
bool amd_force_dma_unencrypted(struct device *dev);
+void __init amd_mem_encrypt_init(void);
#else /* CONFIG_AMD_MEM_ENCRYPT */
static inline bool amd_force_dma_unencrypted(struct device *dev)
{
return false;
}
+
+static inline void amd_mem_encrypt_init(void) {}
#endif /* CONFIG_AMD_MEM_ENCRYPT */

#endif
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 433f366ca25c..ce8e3019b812 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -12,6 +12,7 @@
#include <asm/insn.h>
#include <asm/insn-eval.h>
#include <linux/sched/signal.h> /* force_sig_fault() */
+#include <linux/swiotlb.h>

/* TDX Module call Leaf IDs */
#define TDX_GET_INFO 1
@@ -577,6 +578,7 @@ void __init tdx_early_init(void)
pv_ops.irq.halt = tdx_halt;

legacy_pic = &null_legacy_pic;
+ swiotlb_force = SWIOTLB_FORCE;

cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "tdx:cpu_hotplug",
NULL, tdx_cpu_offline_prepare);
diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
index 5d7fbed73949..8385bc4565e9 100644
--- a/arch/x86/mm/mem_encrypt.c
+++ b/arch/x86/mm/mem_encrypt.c
@@ -438,14 +438,11 @@ static void print_mem_encrypt_feature_info(void)
}

/* Architecture __weak replacement functions */
-void __init mem_encrypt_init(void)
+void __init amd_mem_encrypt_init(void)
{
if (!sme_me_mask)
return;

- /* Call into SWIOTLB to update the SWIOTLB DMA buffers */
- swiotlb_update_mem_attributes();
-
/*
* With SEV, we need to unroll the rep string I/O instructions,
* but SEV-ES supports them through the #VC handler.
diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c
index 119a9056efbb..6fe44c6cb753 100644
--- a/arch/x86/mm/mem_encrypt_common.c
+++ b/arch/x86/mm/mem_encrypt_common.c
@@ -10,6 +10,7 @@
#include <asm/mem_encrypt_common.h>
#include <linux/dma-mapping.h>
#include <linux/cc_platform.h>
+#include <linux/swiotlb.h>

/* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
bool force_dma_unencrypted(struct device *dev)
@@ -24,3 +25,16 @@ bool force_dma_unencrypted(struct device *dev)

return false;
}
+
+/* Architecture __weak replacement functions */
+void __init mem_encrypt_init(void)
+{
+ /*
+ * For TDX guest or SEV/SME, call into SWIOTLB to update
+ * the SWIOTLB DMA buffers
+ */
+ if (sme_me_mask || cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT))
+ swiotlb_update_mem_attributes();
+
+ amd_mem_encrypt_init();
+}
--
2.25.1

Subject: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

From: Andi Kleen <[email protected]>

For Confidential VM guests like TDX, the host is untrusted and hence
the devices emulated by the host or any data coming from the host
cannot be trusted. So the drivers that interact with the outside world
have to be hardened by sharing memory with host on need basis
with proper hardening fixes.

For the PCI driver case, to share the memory with the host add
pci_iomap_host_shared() and pci_iomap_host_shared_range() APIs.

Signed-off-by: Andi Kleen <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---
Changes since v4:
* Replaced "_shared" with "_host_shared" in pci_iomap* APIs
* Fixed commit log as per review comments.

include/asm-generic/pci_iomap.h | 6 +++++
lib/pci_iomap.c | 47 +++++++++++++++++++++++++++++++++
2 files changed, 53 insertions(+)

diff --git a/include/asm-generic/pci_iomap.h b/include/asm-generic/pci_iomap.h
index df636c6d8e6c..a4a83c8ab3cf 100644
--- a/include/asm-generic/pci_iomap.h
+++ b/include/asm-generic/pci_iomap.h
@@ -18,6 +18,12 @@ extern void __iomem *pci_iomap_range(struct pci_dev *dev, int bar,
extern void __iomem *pci_iomap_wc_range(struct pci_dev *dev, int bar,
unsigned long offset,
unsigned long maxlen);
+extern void __iomem *pci_iomap_host_shared(struct pci_dev *dev, int bar,
+ unsigned long max);
+extern void __iomem *pci_iomap_host_shared_range(struct pci_dev *dev, int bar,
+ unsigned long offset,
+ unsigned long maxlen);
+
/* Create a virtual mapping cookie for a port on a given PCI device.
* Do not call this directly, it exists to make it easier for architectures
* to override */
diff --git a/lib/pci_iomap.c b/lib/pci_iomap.c
index 57bd92f599ee..2816dc8715da 100644
--- a/lib/pci_iomap.c
+++ b/lib/pci_iomap.c
@@ -25,6 +25,11 @@ static void __iomem *map_ioremap_wc(phys_addr_t addr, size_t size)
return ioremap_wc(addr, size);
}

+static void __iomem *map_ioremap_host_shared(phys_addr_t addr, size_t size)
+{
+ return ioremap_host_shared(addr, size);
+}
+
static void __iomem *pci_iomap_range_map(struct pci_dev *dev,
int bar,
unsigned long offset,
@@ -106,6 +111,48 @@ void __iomem *pci_iomap_wc_range(struct pci_dev *dev,
}
EXPORT_SYMBOL_GPL(pci_iomap_wc_range);

+/**
+ * pci_iomap_host_shared_range - create a virtual shared mapping cookie
+ * for a PCI BAR
+ * @dev: PCI device that owns the BAR
+ * @bar: BAR number
+ * @offset: map memory at the given offset in BAR
+ * @maxlen: max length of the memory to map
+ *
+ * Remap a pci device's resources shared in a confidential guest.
+ * For more details see pci_iomap_range's documentation.
+ *
+ * @maxlen specifies the maximum length to map. To get access to
+ * the complete BAR from offset to the end, pass %0 here.
+ */
+void __iomem *pci_iomap_host_shared_range(struct pci_dev *dev, int bar,
+ unsigned long offset,
+ unsigned long maxlen)
+{
+ return pci_iomap_range_map(dev, bar, offset, maxlen,
+ map_ioremap_host_shared, true);
+}
+EXPORT_SYMBOL_GPL(pci_iomap_host_shared_range);
+
+/**
+ * pci_iomap_host_shared - create a virtual shared mapping cookie for a PCI BAR
+ * @dev: PCI device that owns the BAR
+ * @bar: BAR number
+ * @maxlen: length of the memory to map
+ *
+ * See pci_iomap for details. This function creates a shared mapping
+ * with the host for confidential hosts.
+ *
+ * @maxlen specifies the maximum length to map. To get access to the
+ * complete BAR without checking for its length first, pass %0 here.
+ */
+void __iomem *pci_iomap_host_shared(struct pci_dev *dev, int bar,
+ unsigned long maxlen)
+{
+ return pci_iomap_host_shared_range(dev, bar, 0, maxlen);
+}
+EXPORT_SYMBOL_GPL(pci_iomap_host_shared);
+
/**
* pci_iomap - create a virtual mapping cookie for a PCI BAR
* @dev: PCI device that owns the BAR
--
2.25.1

Subject: [PATCH v5 10/16] PCI: Consolidate pci_iomap_range(), pci_iomap_wc_range()

From: Andi Kleen <[email protected]>

pci_iomap_range() and pci_iomap_wc_range() are currently duplicated
code, except that the _wc variant does not support IO ports. So,
implement them using a common helper, pci_iomap_range_map().

Also add wrappers for the maps because some architectures implement
ioremap and friends with macros.

This will allow to add more variants without excessive code
duplication. This patch has no functional changes.

Signed-off-by: Andi Kleen <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---
Changes since v4:
* Rebased on top of Tom Lendacky's CC guest
changes (https://www.spinics.net/lists/linux-tip-commits/msg58716.html)
* Fixed commit log as per Bjorns comments.
* Added "support_io" argument to pci_iomap_range_map() to support
__pci_ioport_map() only in pci_iomap_range().

lib/pci_iomap.c | 86 ++++++++++++++++++++++++++++---------------------
1 file changed, 49 insertions(+), 37 deletions(-)

diff --git a/lib/pci_iomap.c b/lib/pci_iomap.c
index 2d3eb1cb73b8..57bd92f599ee 100644
--- a/lib/pci_iomap.c
+++ b/lib/pci_iomap.c
@@ -10,6 +10,51 @@
#include <linux/export.h>

#ifdef CONFIG_PCI
+
+/*
+ * Callback wrappers because some architectures define ioremap et.al.
+ * as macros.
+ */
+static void __iomem *map_ioremap(phys_addr_t addr, size_t size)
+{
+ return ioremap(addr, size);
+}
+
+static void __iomem *map_ioremap_wc(phys_addr_t addr, size_t size)
+{
+ return ioremap_wc(addr, size);
+}
+
+static void __iomem *pci_iomap_range_map(struct pci_dev *dev,
+ int bar,
+ unsigned long offset,
+ unsigned long maxlen,
+ void __iomem *(*mapm)(phys_addr_t,
+ size_t),
+ bool support_io)
+{
+ resource_size_t start = pci_resource_start(dev, bar);
+ resource_size_t len = pci_resource_len(dev, bar);
+ unsigned long flags = pci_resource_flags(dev, bar);
+
+ if (len <= offset || !start)
+ return NULL;
+ len -= offset;
+ start += offset;
+ if (maxlen && len > maxlen)
+ len = maxlen;
+ if (flags & IORESOURCE_IO) {
+ if (support_io)
+ return __pci_ioport_map(dev, start, len);
+
+ return NULL;
+ }
+ if (flags & IORESOURCE_MEM)
+ return mapm(start, len);
+ /* What? */
+ return NULL;
+}
+
/**
* pci_iomap_range - create a virtual mapping cookie for a PCI BAR
* @dev: PCI device that owns the BAR
@@ -30,22 +75,8 @@ void __iomem *pci_iomap_range(struct pci_dev *dev,
unsigned long offset,
unsigned long maxlen)
{
- resource_size_t start = pci_resource_start(dev, bar);
- resource_size_t len = pci_resource_len(dev, bar);
- unsigned long flags = pci_resource_flags(dev, bar);
-
- if (len <= offset || !start)
- return NULL;
- len -= offset;
- start += offset;
- if (maxlen && len > maxlen)
- len = maxlen;
- if (flags & IORESOURCE_IO)
- return __pci_ioport_map(dev, start, len);
- if (flags & IORESOURCE_MEM)
- return ioremap(start, len);
- /* What? */
- return NULL;
+ return pci_iomap_range_map(dev, bar, offset, maxlen,
+ map_ioremap, true);
}
EXPORT_SYMBOL(pci_iomap_range);

@@ -70,27 +101,8 @@ void __iomem *pci_iomap_wc_range(struct pci_dev *dev,
unsigned long offset,
unsigned long maxlen)
{
- resource_size_t start = pci_resource_start(dev, bar);
- resource_size_t len = pci_resource_len(dev, bar);
- unsigned long flags = pci_resource_flags(dev, bar);
-
-
- if (flags & IORESOURCE_IO)
- return NULL;
-
- if (len <= offset || !start)
- return NULL;
-
- len -= offset;
- start += offset;
- if (maxlen && len > maxlen)
- len = maxlen;
-
- if (flags & IORESOURCE_MEM)
- return ioremap_wc(start, len);
-
- /* What? */
- return NULL;
+ return pci_iomap_range_map(dev, bar, offset, maxlen,
+ map_ioremap_wc, false);
}
EXPORT_SYMBOL_GPL(pci_iomap_wc_range);

--
2.25.1

Subject: [PATCH v5 14/16] virtio: Use shared mappings for virtio PCI devices

From: Andi Kleen <[email protected]>

In a TDX guest the pci device mappings of virtio must be shared
with the host, so use explicit shared mappings.

Signed-off-by: Andi Kleen <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---

Changes since v4:
* Replaced pci_iomap_shared_range() with pci_iomap_host_shared_range().

drivers/virtio/virtio_pci_modern_dev.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio_pci_modern_dev.c b/drivers/virtio/virtio_pci_modern_dev.c
index e11ed748e661..f29bf45a4642 100644
--- a/drivers/virtio/virtio_pci_modern_dev.c
+++ b/drivers/virtio/virtio_pci_modern_dev.c
@@ -83,7 +83,7 @@ vp_modern_map_capability(struct virtio_pci_modern_device *mdev, int off,
return NULL;
}

- p = pci_iomap_range(dev, bar, offset, length);
+ p = pci_iomap_host_shared_range(dev, bar, offset, length);
if (!p)
dev_err(&dev->dev,
"virtio_pci: unable to map virtio %u@%u on bar %i\n",
--
2.25.1

Subject: [PATCH v5 06/16] x86/tdx: Make DMA pages shared

From: "Kirill A. Shutemov" <[email protected]>

Just like MKTME, TDX reassigns bits of the physical address for
metadata. MKTME used several bits for an encryption KeyID. TDX
uses a single bit in guests to communicate whether a physical page
should be protected by TDX as private memory (bit set to 0) or
unprotected and shared with the VMM (bit set to 1).

__set_memory_enc_dec() is now aware about TDX and sets Shared bit
accordingly following with relevant TDX hypercall.

Also, Do TDX_ACCEPT_PAGE on every 4k page after mapping the GPA range
when converting memory to private. Using 4k page size limit is due
to current TDX spec restriction. Also, If the GPA (range) was
already mapped as an active, private page, the host VMM may remove
the private page from the TD by following the “Removing TD Private
Pages” sequence in the Intel TDX-module specification [1] to safely
block the mapping(s), flush the TLB and cache, and remove the
mapping(s).

BUG() if TDX_ACCEPT_PAGE fails (except "previously accepted page" case)
, as the guest is completely hosed if it can't access memory. 

[1] https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-module-1eas-v0.85.039.pdf

Tested-by: Kai Huang <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---

Changes since v4:
* Renamed tdg_accept_page() to tdx_accept_page().
* Added required comments to tdx_accept_page().
* Replaced prot_guest_has() to cc_guest_has().

Changes since v3:
* Rebased on top of Tom Lendacky's protected guest
changes (https://lore.kernel.org/patchwork/cover/1468760/)
* Fixed TDX_PAGE_ALREADY_ACCEPTED error code as per latest
spec update.

Changes since v1:
* Removed "we" or "I" usages in comment section.
* Replaced is_tdx_guest() checks with prot_guest_has() checks.

arch/x86/include/asm/pgtable.h | 1 +
arch/x86/kernel/tdx.c | 45 ++++++++++++++++++++++++++++----
arch/x86/mm/mem_encrypt_common.c | 11 +++++++-
arch/x86/mm/pat/set_memory.c | 45 +++++++++++++++++++++++++++-----
4 files changed, 89 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index ecefccbdf2e3..2de4d6e34b84 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -24,6 +24,7 @@
/* Make the page accesable by VMM for confidential guests */
#define pgprot_cc_guest(prot) __pgprot(pgprot_val(prot) | \
tdx_shared_mask())
+#define pgprot_cc_shared_mask() __pgprot(tdx_shared_mask())

#ifndef __ASSEMBLY__
#include <asm/x86_init.h>
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index c3e4cc5d631b..433f366ca25c 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -16,10 +16,16 @@
/* TDX Module call Leaf IDs */
#define TDX_GET_INFO 1
#define TDX_GET_VEINFO 3
+#define TDX_ACCEPT_PAGE 6

/* TDX hypercall Leaf IDs */
#define TDVMCALL_MAP_GPA 0x10001

+/* TDX Module call error codes */
+#define TDX_PAGE_ALREADY_ACCEPTED 0x00000b0a00000000
+#define TDCALL_RETURN_CODE_MASK 0xFFFFFFFF00000000
+#define TDCALL_RETURN_CODE(a) ((a) & TDCALL_RETURN_CODE_MASK)
+
#define VE_IS_IO_OUT(exit_qual) (((exit_qual) & 8) ? 0 : 1)
#define VE_GET_IO_SIZE(exit_qual) (((exit_qual) & 7) + 1)
#define VE_GET_PORT_NUM(exit_qual) ((exit_qual) >> 16)
@@ -108,18 +114,35 @@ static void tdx_get_info(void)
physical_mask &= ~tdx_shared_mask();
}

+static void tdx_accept_page(phys_addr_t gpa)
+{
+ u64 ret;
+
+ /*
+ * Pass the page physical address and size (0-4KB) to the
+ * TDX module to accept the pending, private page. More info
+ * about ABI can be found in TDX Guest-Host-Communication
+ * Interface (GHCI), sec 2.4.7.
+ */
+ ret = __tdx_module_call(TDX_ACCEPT_PAGE, gpa, 0, 0, 0, NULL);
+
+ /*
+ * Non zero return value means buggy TDX module (which is
+ * fatal for TDX guest). So panic here.
+ */
+ BUG_ON(ret && TDCALL_RETURN_CODE(ret) != TDX_PAGE_ALREADY_ACCEPTED);
+}
+
/*
* Inform the VMM of the guest's intent for this physical page:
* shared with the VMM or private to the guest. The VMM is
* expected to change its mapping of the page in response.
- *
- * Note: shared->private conversions require further guest
- * action to accept the page.
*/
int tdx_hcall_gpa_intent(phys_addr_t gpa, int numpages,
enum tdx_map_type map_type)
{
- u64 ret;
+ u64 ret = 0;
+ int i;

if (map_type == TDX_MAP_SHARED)
gpa |= tdx_shared_mask();
@@ -131,8 +154,20 @@ int tdx_hcall_gpa_intent(phys_addr_t gpa, int numpages,
*/
ret = _tdx_hypercall(TDVMCALL_MAP_GPA, gpa, PAGE_SIZE * numpages, 0, 0,
NULL);
+ if (ret)
+ ret = -EIO;
+
+ if (ret || map_type == TDX_MAP_SHARED)
+ return ret;
+
+ /*
+ * For shared->private conversion, accept the page using
+ * TDX_ACCEPT_PAGE TDX module call.
+ */
+ for (i = 0; i < numpages; i++)
+ tdx_accept_page(gpa + i * PAGE_SIZE);

- return ret ? -EIO : 0;
+ return 0;
}

static __cpuidle void _tdx_halt(const bool irq_disabled, const bool do_sti)
diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c
index f063c885b0a5..119a9056efbb 100644
--- a/arch/x86/mm/mem_encrypt_common.c
+++ b/arch/x86/mm/mem_encrypt_common.c
@@ -9,9 +9,18 @@

#include <asm/mem_encrypt_common.h>
#include <linux/dma-mapping.h>
+#include <linux/cc_platform.h>

/* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
bool force_dma_unencrypted(struct device *dev)
{
- return amd_force_dma_unencrypted(dev);
+ if (cc_platform_has(CC_ATTR_GUEST_TDX) &&
+ cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT))
+ return true;
+
+ if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT) ||
+ cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT))
+ return amd_force_dma_unencrypted(dev);
+
+ return false;
}
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 527957586f3c..6c531d5cb5fd 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -30,6 +30,7 @@
#include <asm/proto.h>
#include <asm/memtype.h>
#include <asm/set_memory.h>
+#include <asm/tdx.h>

#include "../mm_internal.h"

@@ -1981,8 +1982,10 @@ int set_memory_global(unsigned long addr, int numpages)
__pgprot(_PAGE_GLOBAL), 0);
}

-static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
+static int __set_memory_protect(unsigned long addr, int numpages, bool protect)
{
+ pgprot_t mem_protected_bits, mem_plain_bits;
+ enum tdx_map_type map_type;
struct cpa_data cpa;
int ret;

@@ -1997,8 +2000,25 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
memset(&cpa, 0, sizeof(cpa));
cpa.vaddr = &addr;
cpa.numpages = numpages;
- cpa.mask_set = enc ? __pgprot(_PAGE_ENC) : __pgprot(0);
- cpa.mask_clr = enc ? __pgprot(0) : __pgprot(_PAGE_ENC);
+
+ if (cc_platform_has(CC_ATTR_GUEST_SHARED_MAPPING_INIT)) {
+ mem_protected_bits = __pgprot(0);
+ mem_plain_bits = pgprot_cc_shared_mask();
+ } else {
+ mem_protected_bits = __pgprot(_PAGE_ENC);
+ mem_plain_bits = __pgprot(0);
+ }
+
+ if (protect) {
+ cpa.mask_set = mem_protected_bits;
+ cpa.mask_clr = mem_plain_bits;
+ map_type = TDX_MAP_PRIVATE;
+ } else {
+ cpa.mask_set = mem_plain_bits;
+ cpa.mask_clr = mem_protected_bits;
+ map_type = TDX_MAP_SHARED;
+ }
+
cpa.pgd = init_mm.pgd;

/* Must avoid aliasing mappings in the highmem code */
@@ -2006,9 +2026,17 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
vm_unmap_aliases();

/*
- * Before changing the encryption attribute, we need to flush caches.
+ * Before changing the encryption attribute, flush caches.
+ *
+ * For TDX, guest is responsible for flushing caches on private->shared
+ * transition. VMM is responsible for flushing on shared->private.
*/
- cpa_flush(&cpa, !this_cpu_has(X86_FEATURE_SME_COHERENT));
+ if (cc_platform_has(CC_ATTR_GUEST_TDX)) {
+ if (map_type == TDX_MAP_SHARED)
+ cpa_flush(&cpa, 1);
+ } else {
+ cpa_flush(&cpa, !this_cpu_has(X86_FEATURE_SME_COHERENT));
+ }

ret = __change_page_attr_set_clr(&cpa, 1);

@@ -2021,18 +2049,21 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
*/
cpa_flush(&cpa, 0);

+ if (!ret && cc_platform_has(CC_ATTR_GUEST_SHARED_MAPPING_INIT))
+ ret = tdx_hcall_gpa_intent(__pa(addr), numpages, map_type);
+
return ret;
}

int set_memory_encrypted(unsigned long addr, int numpages)
{
- return __set_memory_enc_dec(addr, numpages, true);
+ return __set_memory_protect(addr, numpages, true);
}
EXPORT_SYMBOL_GPL(set_memory_encrypted);

int set_memory_decrypted(unsigned long addr, int numpages)
{
- return __set_memory_enc_dec(addr, numpages, false);
+ return __set_memory_protect(addr, numpages, false);
}
EXPORT_SYMBOL_GPL(set_memory_decrypted);

--
2.25.1

2021-10-09 01:50:45

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH v5 16/16] x86/tdx: Add cmdline option to force use of ioremap_host_shared

On 10/8/21 5:37 PM, Kuppuswamy Sathyanarayanan wrote:
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 91ba391f9b32..0af19cb1a28c 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -2076,6 +2076,18 @@
> 1 - Bypass the IOMMU for DMA.
> unset - Use value of CONFIG_IOMMU_DEFAULT_PASSTHROUGH.
>
> + ioremap_force_shared= [X86_64, CCG]
> + Force the kernel to use shared memory mappings which do
> + not use ioremap_host_shared/pcimap_host_shared to opt-in
> + to shared mappings with the host. This feature is mainly
> + used by a confidential guest when enabling new drivers
> + without proper shared memory related changes. Please note
> + that this option might also allow other non explicitly
> + enabled drivers to interact with the host in confidential
> + guest, which could cause other security risks. This option
> + will also cause BIOS data structures to be shared with the
> + host, which might open security holes.

Hi,
This cmdline option text should have a little bit more info. Just as an
example/template:

acpi_apic_instance= [ACPI, IOAPIC]
Format: <int>
2: use 2nd APIC table, if available
1,0: use 1st APIC table
default: 0

So what is expected after the "=" sign?...

thanks.
--
~Randy

Subject: Re: [PATCH v5 16/16] x86/tdx: Add cmdline option to force use of ioremap_host_shared



On 10/8/21 6:45 PM, Randy Dunlap wrote:
> Hi,
> This cmdline option text should have a little bit more info. Just as an
> example/template:
>
>     acpi_apic_instance=    [ACPI, IOAPIC]
>             Format: <int>
>             2: use 2nd APIC table, if available
>             1,0: use 1st APIC table
>             default: 0
>
> So what is expected after the "=" sign?...

It does not take any arguments. I will remove the = sign in next version.

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2021-10-09 09:54:47

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

On Fri, Oct 08, 2021 at 05:37:07PM -0700, Kuppuswamy Sathyanarayanan wrote:
> From: Andi Kleen <[email protected]>
>
> For Confidential VM guests like TDX, the host is untrusted and hence
> the devices emulated by the host or any data coming from the host
> cannot be trusted. So the drivers that interact with the outside world
> have to be hardened by sharing memory with host on need basis
> with proper hardening fixes.
>
> For the PCI driver case, to share the memory with the host add
> pci_iomap_host_shared() and pci_iomap_host_shared_range() APIs.
>
> Signed-off-by: Andi Kleen <[email protected]>
> Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>

So I proposed to make all pci mappings shared, eliminating the need
to patch drivers.

To which Andi replied
One problem with removing the ioremap opt-in is that
it's still possible for drivers to get at devices without going through probe.

To which Greg replied:
https://lore.kernel.org/all/[email protected]/
If there are in-kernel PCI drivers that do not do this, they need to be
fixed today.

Can you guys resolve the differences here?

And once they are resolved, mention this in the commit log so
I don't get to re-read the series just to find out nothing
changed in this respect?

I frankly do not believe we are anywhere near being able to harden
an arbitrary kernel config against attack.
How about creating a defconfig that makes sense for TDX then?
Anyone deviating from that better know what they are doing,
this API tweaking is just putting policy into the kernel ...

> ---
> Changes since v4:
> * Replaced "_shared" with "_host_shared" in pci_iomap* APIs
> * Fixed commit log as per review comments.
>
> include/asm-generic/pci_iomap.h | 6 +++++
> lib/pci_iomap.c | 47 +++++++++++++++++++++++++++++++++
> 2 files changed, 53 insertions(+)
>
> diff --git a/include/asm-generic/pci_iomap.h b/include/asm-generic/pci_iomap.h
> index df636c6d8e6c..a4a83c8ab3cf 100644
> --- a/include/asm-generic/pci_iomap.h
> +++ b/include/asm-generic/pci_iomap.h
> @@ -18,6 +18,12 @@ extern void __iomem *pci_iomap_range(struct pci_dev *dev, int bar,
> extern void __iomem *pci_iomap_wc_range(struct pci_dev *dev, int bar,
> unsigned long offset,
> unsigned long maxlen);
> +extern void __iomem *pci_iomap_host_shared(struct pci_dev *dev, int bar,
> + unsigned long max);
> +extern void __iomem *pci_iomap_host_shared_range(struct pci_dev *dev, int bar,
> + unsigned long offset,
> + unsigned long maxlen);
> +
> /* Create a virtual mapping cookie for a port on a given PCI device.
> * Do not call this directly, it exists to make it easier for architectures
> * to override */
> diff --git a/lib/pci_iomap.c b/lib/pci_iomap.c
> index 57bd92f599ee..2816dc8715da 100644
> --- a/lib/pci_iomap.c
> +++ b/lib/pci_iomap.c
> @@ -25,6 +25,11 @@ static void __iomem *map_ioremap_wc(phys_addr_t addr, size_t size)
> return ioremap_wc(addr, size);
> }
>
> +static void __iomem *map_ioremap_host_shared(phys_addr_t addr, size_t size)
> +{
> + return ioremap_host_shared(addr, size);
> +}
> +
> static void __iomem *pci_iomap_range_map(struct pci_dev *dev,
> int bar,
> unsigned long offset,
> @@ -106,6 +111,48 @@ void __iomem *pci_iomap_wc_range(struct pci_dev *dev,
> }
> EXPORT_SYMBOL_GPL(pci_iomap_wc_range);
>
> +/**
> + * pci_iomap_host_shared_range - create a virtual shared mapping cookie
> + * for a PCI BAR
> + * @dev: PCI device that owns the BAR
> + * @bar: BAR number
> + * @offset: map memory at the given offset in BAR
> + * @maxlen: max length of the memory to map
> + *
> + * Remap a pci device's resources shared in a confidential guest.
> + * For more details see pci_iomap_range's documentation.

So how does a driver author know when to use this function, and when to
use the regular pci_iomap_range? Drivers have no idea whether they are
used in a confidential guest, and which ranges are shared, it's a TDX
thing ...

This documentation should really address it.

> + *
> + * @maxlen specifies the maximum length to map. To get access to
> + * the complete BAR from offset to the end, pass %0 here.
> + */
> +void __iomem *pci_iomap_host_shared_range(struct pci_dev *dev, int bar,
> + unsigned long offset,
> + unsigned long maxlen)
> +{
> + return pci_iomap_range_map(dev, bar, offset, maxlen,
> + map_ioremap_host_shared, true);
> +}
> +EXPORT_SYMBOL_GPL(pci_iomap_host_shared_range);
> +
> +/**
> + * pci_iomap_host_shared - create a virtual shared mapping cookie for a PCI BAR
> + * @dev: PCI device that owns the BAR
> + * @bar: BAR number
> + * @maxlen: length of the memory to map
> + *
> + * See pci_iomap for details. This function creates a shared mapping
> + * with the host for confidential hosts.
> + *
> + * @maxlen specifies the maximum length to map. To get access to the
> + * complete BAR without checking for its length first, pass %0 here.
> + */
> +void __iomem *pci_iomap_host_shared(struct pci_dev *dev, int bar,
> + unsigned long maxlen)
> +{
> + return pci_iomap_host_shared_range(dev, bar, 0, maxlen);
> +}
> +EXPORT_SYMBOL_GPL(pci_iomap_host_shared);
> +
> /**
> * pci_iomap - create a virtual mapping cookie for a PCI BAR
> * @dev: PCI device that owns the BAR
> --
> 2.25.1

2021-10-09 11:06:37

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v5 16/16] x86/tdx: Add cmdline option to force use of ioremap_host_shared

On Fri, Oct 08, 2021 at 05:37:11PM -0700, Kuppuswamy Sathyanarayanan wrote:
> + ioremap_force_shared= [X86_64, CCG]
> + Force the kernel to use shared memory mappings which do
> + not use ioremap_host_shared/pcimap_host_shared to opt-in
> + to shared mappings with the host. This feature is mainly
> + used by a confidential guest when enabling new drivers
> + without proper shared memory related changes. Please note
> + that this option might also allow other non explicitly
> + enabled drivers to interact with the host in confidential
> + guest, which could cause other security risks. This option
> + will also cause BIOS data structures to be shared with the
> + host, which might open security holes.
> +
> io7= [HW] IO7 for Marvel-based Alpha systems
> See comment before marvel_specify_io7 in
> arch/alpha/kernel/core_marvel.c.

The connection is quite unfortunate IMHO.
Can't there be an option
that unbreaks drivers *without* opening up security holes by
making BIOS shared?

--
MST

2021-10-09 20:41:12

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

On Sat, Oct 9, 2021 at 2:53 AM Michael S. Tsirkin <[email protected]> wrote:
>
> On Fri, Oct 08, 2021 at 05:37:07PM -0700, Kuppuswamy Sathyanarayanan wrote:
> > From: Andi Kleen <[email protected]>
> >
> > For Confidential VM guests like TDX, the host is untrusted and hence
> > the devices emulated by the host or any data coming from the host
> > cannot be trusted. So the drivers that interact with the outside world
> > have to be hardened by sharing memory with host on need basis
> > with proper hardening fixes.
> >
> > For the PCI driver case, to share the memory with the host add
> > pci_iomap_host_shared() and pci_iomap_host_shared_range() APIs.
> >
> > Signed-off-by: Andi Kleen <[email protected]>
> > Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
>
> So I proposed to make all pci mappings shared, eliminating the need
> to patch drivers.
>
> To which Andi replied
> One problem with removing the ioremap opt-in is that
> it's still possible for drivers to get at devices without going through probe.
>
> To which Greg replied:
> https://lore.kernel.org/all/[email protected]/
> If there are in-kernel PCI drivers that do not do this, they need to be
> fixed today.
>
> Can you guys resolve the differences here?

I agree with you and Greg here. If a driver is accessing hardware
resources outside of the bind lifetime of one of the devices it
supports, and in a way that neither modrobe-policy nor
device-authorization -policy infrastructure can block, that sounds
like a bug report. Fix those drivers instead of sprinkling
ioremap_shared in select places and with unclear rules about when a
driver is allowed to do "shared" mappings. Let the new
device-authorization mechanism (with policy in userspace) be the
central place where all of these driver "trust" issues are managed.

> And once they are resolved, mention this in the commit log so
> I don't get to re-read the series just to find out nothing
> changed in this respect?
>
> I frankly do not believe we are anywhere near being able to harden
> an arbitrary kernel config against attack.
> How about creating a defconfig that makes sense for TDX then?
> Anyone deviating from that better know what they are doing,
> this API tweaking is just putting policy into the kernel ...

Right, userspace authorization policy and select driver fixups seems
to be the answer to the raised concerns.

2021-10-11 00:00:30

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()


On 10/9/2021 1:39 PM, Dan Williams wrote:
> On Sat, Oct 9, 2021 at 2:53 AM Michael S. Tsirkin <[email protected]> wrote:
>> On Fri, Oct 08, 2021 at 05:37:07PM -0700, Kuppuswamy Sathyanarayanan wrote:
>>> From: Andi Kleen <[email protected]>
>>>
>>> For Confidential VM guests like TDX, the host is untrusted and hence
>>> the devices emulated by the host or any data coming from the host
>>> cannot be trusted. So the drivers that interact with the outside world
>>> have to be hardened by sharing memory with host on need basis
>>> with proper hardening fixes.
>>>
>>> For the PCI driver case, to share the memory with the host add
>>> pci_iomap_host_shared() and pci_iomap_host_shared_range() APIs.
>>>
>>> Signed-off-by: Andi Kleen <[email protected]>
>>> Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
>> So I proposed to make all pci mappings shared, eliminating the need
>> to patch drivers.
>>
>> To which Andi replied
>> One problem with removing the ioremap opt-in is that
>> it's still possible for drivers to get at devices without going through probe.
>>
>> To which Greg replied:
>> https://lore.kernel.org/all/[email protected]/
>> If there are in-kernel PCI drivers that do not do this, they need to be
>> fixed today.
>>
>> Can you guys resolve the differences here?
> I agree with you and Greg here. If a driver is accessing hardware
> resources outside of the bind lifetime of one of the devices it
> supports, and in a way that neither modrobe-policy nor
> device-authorization -policy infrastructure can block, that sounds
> like a bug report.

The 5.15 tree has something like ~2.4k IO accesses (including MMIO and
others) in init functions that also register drivers (thanks Elena for
the number)

Some are probably old drivers that could be fixed, but it's quite a few
legitimate cases. For example for platform or ISA drivers that's the
only way they can be implemented because they often have no other
enumeration mechanism. For PCI drivers it's rarer, but also still can
happen. One example that comes to mind here is the x86 Intel uncore
drivers, which support a mix of MSR, ioremap and PCI config space
accesses all from the same driver. This particular example can (and
should be) fixed in other ways, but similar things also happen in other
drivers, and they're not all broken. Even for the broken ones they're
usually for some crufty old devices that has very few users, so it's
likely untestable in practice.

My point is just that the ecosystem of devices that Linux supports is
messy enough that there are legitimate exceptions from the "First IO
only in probe call only" rule.

And we can't just fix them all. Even if we could it would be hard to
maintain.

Using a "firewall model" hooking into a few strategic points like we're
proposing here is much saner for everyone.

Now we can argue about the details. Right now what we're proposing has
some redundancies: it has both a device model filter and low level
filter for ioremap (this patch and some others). The low level filter is
for catching issues that don't clearly fit into the
"enumeration<->probe" model. You could call that redundant, but I would
call it defense in depth or better safe than sorry. In theory it would
be enough to have the low level opt-in only, but that would have the
drawback that is something gets enumerated after all you would have all
kind of weird device driver failures and in some cases even killed
guests. So I think it makes sense to have


> Fix those drivers instead of sprinkling
> ioremap_shared in select places and with unclear rules about when a
> driver is allowed to do "shared" mappings.

Only add it when the driver has been audited and hardened.

But I agree we need on a documented process for this. I will work on
some documentation for a proposal. But essentially I think it should be
some variant of what Elena has outlined in her talk at Security Summit.

https://static.sched.com/hosted_files/lssna2021/b6/LSS-HardeningLinuxGuestForCCC.pdf

That is using extra auditing/scrutiny at review time, supported with
some static code analysis that points to the interaction points, and
code needs to be fuzzed explicitly.

However short term it's only three virtio drivers, so this is not a
urgent problem.

> Let the new
> device-authorization mechanism (with policy in userspace)


Default policy in user space just seems to be a bad idea here. Who
should know if a driver is hardened other than the kernel? Maintaining
the list somewhere else just doesn't make sense to me.

Also there is the more practical problem that some devices are needed
for booting. For example in TDX we can't print something to the console
with this mechanism, so you would never get any output before the
initrd. Just seems like a nightmare for debugging anything. There really
needs to be an authorization mechanism that works reasonably early.

I can see a point of having user space overrides though, but we need to
have a sane kernel default that works early.

-Andi

2021-10-11 00:53:22

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()


> To which Andi replied
> One problem with removing the ioremap opt-in is that
> it's still possible for drivers to get at devices without going through probe.
>
> To which Greg replied:
> https://lore.kernel.org/all/[email protected]/
> If there are in-kernel PCI drivers that do not do this, they need to be
> fixed today.
>
> Can you guys resolve the differences here?


I addressed this in my other mail, but we may need more discussion.


>
> And once they are resolved, mention this in the commit log so
> I don't get to re-read the series just to find out nothing
> changed in this respect?
>
> I frankly do not believe we are anywhere near being able to harden
> an arbitrary kernel config against attack.

Why not? Device filter and the opt-ins together are a fairly strong
mechanism.

And it's not that they're a lot of code or super complicated either.

You're essentially objecting to a single line change in your subsystem here.


> How about creating a defconfig that makes sense for TDX then?

TDX can be used in many different ways, I don't think a defconfig is
practical.

In theory you could do some Kconfig dependency (at the pain point of
having separate kernel binariees), but why not just do it at run time
then if you maintain the list anyways. That's much easier and saner for
everyone. In the past we usually always ended up with runtime mechanism
for similar things anyways.

Also it turns out that the filter mechanisms are needed for some arch
drivers which are not even configurable, so alone it's probably not enough,


> Anyone deviating from that better know what they are doing,
> this API tweaking is just putting policy into the kernel ...

Hardening drivers is kernel policy. It cannot be done anywhere else.


-Andi

2021-10-11 04:07:42

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v5 16/16] x86/tdx: Add cmdline option to force use of ioremap_host_shared


> The connection is quite unfortunate IMHO.
> Can't there be an option
> that unbreaks drivers *without* opening up security holes by
> making BIOS shared?

That would require new low level APIs that distinguish both cases, and a
tree sweep.


-Andi

2021-10-11 12:27:28

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

Just as last time: This does not make any sense. ioremap is shared
by definition.

2021-10-11 16:23:20

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

On Sun, Oct 10, 2021 at 03:22:39PM -0700, Andi Kleen wrote:
>
> > To which Andi replied
> > One problem with removing the ioremap opt-in is that
> > it's still possible for drivers to get at devices without going through probe.
> >
> > To which Greg replied:
> > https://lore.kernel.org/all/[email protected]/
> > If there are in-kernel PCI drivers that do not do this, they need to be
> > fixed today.
> >
> > Can you guys resolve the differences here?
>
>
> I addressed this in my other mail, but we may need more discussion.

Hopefully Greg will reply to that one.

>
> >
> > And once they are resolved, mention this in the commit log so
> > I don't get to re-read the series just to find out nothing
> > changed in this respect?
> >
> > I frankly do not believe we are anywhere near being able to harden
> > an arbitrary kernel config against attack.
>
> Why not? Device filter and the opt-ins together are a fairly strong
> mechanism.

Because it does not end with I/O operations, that's a trivial example.
module unloading is famous for being racy: I just re-read that part of
virtio drivers and sure enough we have bugs there, this is after
they have presumably been audited, so a TDX guest is better off
just disabling hot-unplug completely, and hotplug isn't far behind.
Malicious filesystems can exploit many linux systems unless
you take pains to limit what is mounted and how.
Networking devices tend to get into the default namespaces and can
do more or less whatever CAP_NET_ADMIN can.
Etc.
I am not saying this makes the effort worthless, I am saying userspace
better know very well what it's doing, and kernel better be
configured in a very specific way.

> And it's not that they're a lot of code or super complicated either.
>
> You're essentially objecting to a single line change in your subsystem here.

Well I commented on the API patch, not the virtio patch.
If it's a way for a driver to say "I am hardened
and audited" then I guess it should at least say so. It has nothing
to do with host or sharing, that's an implementation detail,
and it obscures the actual limitations of the approach,
in that eventually in an ideal world all drivers would be secure
and use this API.

Yes, if that's the API that PCI gains then virtio will use it.


> > How about creating a defconfig that makes sense for TDX then?
>
> TDX can be used in many different ways, I don't think a defconfig is
> practical.
>
> In theory you could do some Kconfig dependency (at the pain point of having
> separate kernel binariees), but why not just do it at run time then if you
> maintain the list anyways. That's much easier and saner for everyone. In the
> past we usually always ended up with runtime mechanism for similar things
> anyways.
>
> Also it turns out that the filter mechanisms are needed for some arch
> drivers which are not even configurable, so alone it's probably not enough,


I guess they aren't really needed though right, or you won't try to
filter them? So make them configurable?

>
> > Anyone deviating from that better know what they are doing,
> > this API tweaking is just putting policy into the kernel ...
>
> Hardening drivers is kernel policy. It cannot be done anywhere else.
>
>
> -Andi

To clarify, the policy is which drivers to load into the kernel.

--
MST

2021-10-11 16:24:17

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v5 16/16] x86/tdx: Add cmdline option to force use of ioremap_host_shared

On Sun, Oct 10, 2021 at 07:39:55PM -0700, Andi Kleen wrote:
>
> > The connection is quite unfortunate IMHO.
> > Can't there be an option
> > that unbreaks drivers *without* opening up security holes by
> > making BIOS shared?
>
> That would require new low level APIs that distinguish both cases, and a
> tree sweep.
>
>
> -Andi

Presumably bios code is in arch/x86 and drivers/acpi, right?
Up to 200 calls the majority of which is likely private ...

I don't have better ideas but the current setup will just
result in people making their guests vulnerable whenever they
want to allow device pass-through.

--
MST

2021-10-11 17:25:20

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()


On 10/11/2021 12:58 AM, Christoph Hellwig wrote:
> Just as last time: This does not make any sense. ioremap is shared
> by definition.

It's not necessarily shared with the host for confidential computing:
for example BIOS mappings definitely should not be shared, but they're
using ioremap today.

But if you have a better term please propose something. I tried to
clarify it with "shared_host", but I don't know a better term.


-Andi


2021-10-11 17:34:47

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()


> Because it does not end with I/O operations, that's a trivial example.
> module unloading is famous for being racy: I just re-read that part of
> virtio drivers and sure enough we have bugs there, this is after
> they have presumably been audited, so a TDX guest is better off
> just disabling hot-unplug completely, and hotplug isn't far behind.

These all shouldn't matter for a confidential guest. The only way it can
be attacked is through IO, everything else is protected by hardware.


Also it would all require doing something at the guest level, which we
assume is not malicious.


> Malicious filesystems can exploit many linux systems unless
> you take pains to limit what is mounted and how.

That's expected to be handled by authenticated dmcrypt and similar.
Hardening at this level has been done for many years.


> Networking devices tend to get into the default namespaces and can
> do more or less whatever CAP_NET_ADMIN can.
> Etc.


Networking should be already hardened, otherwise you would have much
worse problems today.



> hange in your subsystem here.
> Well I commented on the API patch, not the virtio patch.
> If it's a way for a driver to say "I am hardened
> and audited" then I guess it should at least say so.


This is handled by the central allow list. We intentionally didn't want
each driver to declare itself, but have a central list where changes
will get more scrutiny than random driver code.

But then there are the additional opt-ins for the low level firewall.
These are in the API. I don't see how it could be done at the driver
level, unless you want to pass in a struct device everywhere?

>>> How about creating a defconfig that makes sense for TDX then?
>> TDX can be used in many different ways, I don't think a defconfig is
>> practical.
>>
>> In theory you could do some Kconfig dependency (at the pain point of having
>> separate kernel binariees), but why not just do it at run time then if you
>> maintain the list anyways. That's much easier and saner for everyone. In the
>> past we usually always ended up with runtime mechanism for similar things
>> anyways.
>>
>> Also it turns out that the filter mechanisms are needed for some arch
>> drivers which are not even configurable, so alone it's probably not enough,
>
> I guess they aren't really needed though right, or you won't try to
> filter them?

We're addressing most of them with the device filter for platform
drivers. But since we cannot stop them doing ioremap IO in their init
code they also need the low level firewall.

Some others that cannot be addressed have explicit disables.


> So make them configurable?

Why not just fix the runtime? It's much saner for everyone. Proposing to
do things at build time sounds like we're in Linux 0.99 days.

-Andi

2021-10-11 17:39:14

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v5 16/16] x86/tdx: Add cmdline option to force use of ioremap_host_shared


> Presumably bios code is in arch/x86 and drivers/acpi, right?
> Up to 200 calls the majority of which is likely private ...

Yes.

> I don't have better ideas but the current setup will just
> result in people making their guests vulnerable whenever they
> want to allow device pass-through.


Yes that's true. For current TDX our target is virtual devices only. But
if pass through usage will be really wide spread we may need to revisit.


-Andi

2021-10-11 18:25:28

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

On Mon, Oct 11, 2021 at 10:32:23AM -0700, Andi Kleen wrote:
>
> > Because it does not end with I/O operations, that's a trivial example.
> > module unloading is famous for being racy: I just re-read that part of
> > virtio drivers and sure enough we have bugs there, this is after
> > they have presumably been audited, so a TDX guest is better off
> > just disabling hot-unplug completely, and hotplug isn't far behind.
>
> These all shouldn't matter for a confidential guest. The only way it can be
> attacked is through IO, everything else is protected by hardware.
>
>
> Also it would all require doing something at the guest level, which we
> assume is not malicious.
>
>
> > Malicious filesystems can exploit many linux systems unless
> > you take pains to limit what is mounted and how.
>
> That's expected to be handled by authenticated dmcrypt and similar.
> Hardening at this level has been done for many years.

It's possible to do it like this, sure. But that's not the
only configuration, userspace needs to be smart about setting things up.
Which is my point really.

>
> > Networking devices tend to get into the default namespaces and can
> > do more or less whatever CAP_NET_ADMIN can.
> > Etc.
>
>
> Networking should be already hardened, otherwise you would have much worse
> problems today.

Same thing. NFS is pretty common, you are saying don't do it then. Fair
enough but again, arbitrary configs just aren't going to be secure.

>
>
> > hange in your subsystem here.
> > Well I commented on the API patch, not the virtio patch.
> > If it's a way for a driver to say "I am hardened
> > and audited" then I guess it should at least say so.
>
>
> This is handled by the central allow list. We intentionally didn't want each
> driver to declare itself, but have a central list where changes will get
> more scrutiny than random driver code.

Makes sense. Additionally, distros can tweak that to their heart's
content, selecting the functionality/security balance that makes sense
for them.

> But then there are the additional opt-ins for the low level firewall. These
> are in the API. I don't see how it could be done at the driver level, unless
> you want to pass in a struct device everywhere?

I am just saying don't do it then. Don't build drivers that distro does
not want to support into kernel. And don't load them when they are
modules.

> > > > How about creating a defconfig that makes sense for TDX then?
> > > TDX can be used in many different ways, I don't think a defconfig is
> > > practical.
> > >
> > > In theory you could do some Kconfig dependency (at the pain point of having
> > > separate kernel binariees), but why not just do it at run time then if you
> > > maintain the list anyways. That's much easier and saner for everyone. In the
> > > past we usually always ended up with runtime mechanism for similar things
> > > anyways.
> > >
> > > Also it turns out that the filter mechanisms are needed for some arch
> > > drivers which are not even configurable, so alone it's probably not enough,
> >
> > I guess they aren't really needed though right, or you won't try to
> > filter them?
>
> We're addressing most of them with the device filter for platform drivers.
> But since we cannot stop them doing ioremap IO in their init code they also
> need the low level firewall.
>
> Some others that cannot be addressed have explicit disables.
>
>
> > So make them configurable?
>
> Why not just fix the runtime? It's much saner for everyone. Proposing to do
> things at build time sounds like we're in Linux 0.99 days.
>
> -Andi

Um. Tweaking driver code is not just build time, it's development time.
At least with kconfig you don't need to patch your kernel.

--
MST

2021-10-11 18:31:30

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v5 16/16] x86/tdx: Add cmdline option to force use of ioremap_host_shared

On Mon, Oct 11, 2021 at 10:35:18AM -0700, Andi Kleen wrote:
>
> > Presumably bios code is in arch/x86 and drivers/acpi, right?
> > Up to 200 calls the majority of which is likely private ...
>
> Yes.
>
> > I don't have better ideas but the current setup will just
> > result in people making their guests vulnerable whenever they
> > want to allow device pass-through.
>
>
> Yes that's true. For current TDX our target is virtual devices only. But if
> pass through usage will be really wide spread we may need to revisit.
>
>
> -Andi

I mean ... it's already wide spread. If we support it with TDX
it will be used with TDX. If we don't then I guess it won't,
exposing this kind of limitation in a userspace visible way isn't great
though. I guess it boils down to the fact that
ioremap_host_shared is just not a great interface, users simply
have no idea whether a given driver uses ioremap.

--
MST

2021-10-11 19:13:37

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

On Mon, Oct 11, 2021 at 10:23:00AM -0700, Andi Kleen wrote:
>
> On 10/11/2021 12:58 AM, Christoph Hellwig wrote:
> > Just as last time: This does not make any sense. ioremap is shared
> > by definition.
>
> It's not necessarily shared with the host for confidential computing: for
> example BIOS mappings definitely should not be shared, but they're using
> ioremap today.

That just needs to be fixed.

> But if you have a better term please propose something. I tried to clarify
> it with "shared_host", but I don't know a better term.
>
>
> -Andi
>


The reason we have trouble is that it's not clear what does the API mean
outside the realm of TDX.
If we really, truly want an API that says "ioremap and it's a hardened
driver" then I guess ioremap_hardened_driver is what you want.

--
MST

2021-10-12 05:39:59

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

On Mon, Oct 11, 2021 at 03:09:09PM -0400, Michael S. Tsirkin wrote:
> The reason we have trouble is that it's not clear what does the API mean
> outside the realm of TDX.
> If we really, truly want an API that says "ioremap and it's a hardened
> driver" then I guess ioremap_hardened_driver is what you want.

Yes. And why would be we ioremap the BIOS anyway? It is not I/O memory
in any of the senses we generally use ioremap for.

2021-10-12 17:45:23

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

On Sun, Oct 10, 2021 at 3:11 PM Andi Kleen <[email protected]> wrote:
>
>
> On 10/9/2021 1:39 PM, Dan Williams wrote:
> > On Sat, Oct 9, 2021 at 2:53 AM Michael S. Tsirkin <[email protected]> wrote:
> >> On Fri, Oct 08, 2021 at 05:37:07PM -0700, Kuppuswamy Sathyanarayanan wrote:
> >>> From: Andi Kleen <[email protected]>
> >>>
> >>> For Confidential VM guests like TDX, the host is untrusted and hence
> >>> the devices emulated by the host or any data coming from the host
> >>> cannot be trusted. So the drivers that interact with the outside world
> >>> have to be hardened by sharing memory with host on need basis
> >>> with proper hardening fixes.
> >>>
> >>> For the PCI driver case, to share the memory with the host add
> >>> pci_iomap_host_shared() and pci_iomap_host_shared_range() APIs.
> >>>
> >>> Signed-off-by: Andi Kleen <[email protected]>
> >>> Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
> >> So I proposed to make all pci mappings shared, eliminating the need
> >> to patch drivers.
> >>
> >> To which Andi replied
> >> One problem with removing the ioremap opt-in is that
> >> it's still possible for drivers to get at devices without going through probe.
> >>
> >> To which Greg replied:
> >> https://lore.kernel.org/all/[email protected]/
> >> If there are in-kernel PCI drivers that do not do this, they need to be
> >> fixed today.
> >>
> >> Can you guys resolve the differences here?
> > I agree with you and Greg here. If a driver is accessing hardware
> > resources outside of the bind lifetime of one of the devices it
> > supports, and in a way that neither modrobe-policy nor
> > device-authorization -policy infrastructure can block, that sounds
> > like a bug report.
>
> The 5.15 tree has something like ~2.4k IO accesses (including MMIO and
> others) in init functions that also register drivers (thanks Elena for
> the number)
>
> Some are probably old drivers that could be fixed, but it's quite a few
> legitimate cases. For example for platform or ISA drivers that's the
> only way they can be implemented because they often have no other
> enumeration mechanism. For PCI drivers it's rarer, but also still can
> happen. One example that comes to mind here is the x86 Intel uncore
> drivers, which support a mix of MSR, ioremap and PCI config space
> accesses all from the same driver. This particular example can (and
> should be) fixed in other ways, but similar things also happen in other
> drivers, and they're not all broken. Even for the broken ones they're
> usually for some crufty old devices that has very few users, so it's
> likely untestable in practice.
>
> My point is just that the ecosystem of devices that Linux supports is
> messy enough that there are legitimate exceptions from the "First IO
> only in probe call only" rule.
>
> And we can't just fix them all. Even if we could it would be hard to
> maintain.
>
> Using a "firewall model" hooking into a few strategic points like we're
> proposing here is much saner for everyone.
>
> Now we can argue about the details. Right now what we're proposing has
> some redundancies: it has both a device model filter and low level
> filter for ioremap (this patch and some others). The low level filter is
> for catching issues that don't clearly fit into the
> "enumeration<->probe" model. You could call that redundant, but I would
> call it defense in depth or better safe than sorry. In theory it would
> be enough to have the low level opt-in only, but that would have the
> drawback that is something gets enumerated after all you would have all
> kind of weird device driver failures and in some cases even killed
> guests. So I think it makes sense to have

The "better safe-than-sorry" argument is hard to build consensus
around. The spectre mitigations ran into similar problems where the
community rightly wanted to see the details and instrument the
problematic paths rather than blanket sprinkle lfence "just to be
safe". In this case the rules about when a driver is suitably
"hardened" are vague and the overlapping policy engines are confusing.

I'd rather see more concerted efforts focused/limited core changes
rather than leaf driver changes until there is a clearer definition of
hardened. I.e. instead of jumping to the assertion that fixing up
these init-path vulnerabilities are too big to fix, dig to the next
level to provide more evidence that per-driver opt-in is the only
viable option.

For example, how many of these problematic paths are built-in to the
average kernel config? A strawman might be to add a sprinkling error
exits in the module_init() of the problematic drivers, and only fail
if the module is built-in, and let modprobe policy handle the rest.

>
>
> > Fix those drivers instead of sprinkling
> > ioremap_shared in select places and with unclear rules about when a
> > driver is allowed to do "shared" mappings.
>
> Only add it when the driver has been audited and hardened.
>
> But I agree we need on a documented process for this. I will work on
> some documentation for a proposal. But essentially I think it should be
> some variant of what Elena has outlined in her talk at Security Summit.
>
> https://static.sched.com/hosted_files/lssna2021/b6/LSS-HardeningLinuxGuestForCCC.pdf
>
> That is using extra auditing/scrutiny at review time, supported with
> some static code analysis that points to the interaction points, and
> code needs to be fuzzed explicitly.
>
> However short term it's only three virtio drivers, so this is not a
> urgent problem.
>
> > Let the new
> > device-authorization mechanism (with policy in userspace)
>
>
> Default policy in user space just seems to be a bad idea here. Who
> should know if a driver is hardened other than the kernel? Maintaining
> the list somewhere else just doesn't make sense to me.

I do not understand the maintenance burden correlation of where the
policy is driven vs where the list is maintained? Even if I agreed
with the contention that out-of-tree userspace would have a hard time
tracking the "hardened" driver list there is still an in-tree
userspace path to explore. E.g. perf maintains lists of things tightly
coupled to the kernel, this authorized device list seems to be in the
same category of data.

> Also there is the more practical problem that some devices are needed
> for booting. For example in TDX we can't print something to the console
> with this mechanism, so you would never get any output before the
> initrd. Just seems like a nightmare for debugging anything. There really
> needs to be an authorization mechanism that works reasonably early.
>
> I can see a point of having user space overrides though, but we need to
> have a sane kernel default that works early.

Right, as I suggested [1], just enough early authorization to
bootstrap/debug initramfs and then that can authorize the remainder.

[1]: https://lore.kernel.org/all/CAPcyv4im4Tsj1SnxSWe=cAHBP1mQ=zgO-D81n2BpD+_HkpitbQ@mail.gmail.com/

2021-10-12 17:58:25

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v5 16/16] x86/tdx: Add cmdline option to force use of ioremap_host_shared


> I mean ... it's already wide spread.


I meant wide spread usage with confidential guests.

> If we support it with TDX
> it will be used with TDX.

It has some security trade offs. The main reason to use TDX is security.
Also when people take the VT-d tradeoffs they might be ok with the BIOS
trade offs too.

-Andi

2021-10-12 18:49:25

by Elena Reshetova

[permalink] [raw]
Subject: RE: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

> The 5.15 tree has something like ~2.4k IO accesses (including MMIO and
> others) in init functions that also register drivers (thanks Elena for
> the number)

To provide more numbers on this. What I can see so far from a smatch-based
analysis, we have 409 __init style functions (.probe & builtin/module_
_platform_driver_probe excluded) for 5.15 with allyesconfig. The number of
distinct individual IO reads (MSRs included) is much higher than 2.4k and on the
range of 30k because quite often there are more than a single IO read in the
same source function. The full list of accesses and the possible call paths is huge
for manually looking at, but here is the list of the 409 functions if anyone wants
to take a look:

['doc200x_ident_chip',
'doc_probe', 'doc2001_init', 'mtd_speedtest_init',
'mtd_nandbiterrs_init', 'mtd_oobtest_init', 'mtd_pagetest_init',
'tort_init', 'mtd_subpagetest_init', 'fixup_pmc551',
'doc_set_driver_info', 'init_amd76xrom', 'init_l440gx',
'init_sc520cdp', 'init_ichxrom', 'init_ck804xrom', 'init_esb2rom',
'probe_acpi_namespace_devices', 'amd_iommu_init_pci', 'state_next',
'arm_v7s_do_selftests', 'arm_lpae_run_tests', 'init_iommu_one',
'init_dmars', 'iommu_init_pci', 'early_amd_iommu_init',
'late_iommu_features_init', 'detect_ivrs',
'intel_prepare_irq_remapping', 'intel_enable_irq_remapping',
'intel_cleanup_irq_remapping', 'detect_intel_iommu',
'parse_ioapics_under_ir', 'si_domain_init', 'ubi_init',
'fb_console_init', 'xenbus_probe_backend_init',
'xenbus_probe_frontend_init', 'setup_vcpu_hotplug_event',
'balloon_init', 'intel_iommu_init', 'intel_rng_mod_init',
'check_tylersburg_isoch', 'dmar_table_init',
'enable_drhd_fault_handling', 'init_acpi_pm_clocksource',
'ostm_init_clksrc', 'ftm_clockevent_init', 'ftm_clocksource_init',
'kona_timer_init', 'mtk_gpt_init', 'samsung_clockevent_init',
'samsung_clocksource_init', 'sysctr_timer_init', 'mxs_timer_init',
'sun4i_timer_init', 'at91sam926x_pit_dt_init', 'owl_timer_init',
'sun5i_setup_clockevent', 'ubi_gluebi_init', 'ubiblock_init',
'hv_init_tsc_clocksource', 'hv_init_clocksource', 'mt7621_clk_init',
'samsung_clk_register_mux', 'samsung_clk_register_gate',
'samsung_clk_register_fixed_rate', 'clk_boston_setup',
'gemini_cc_init', 'aspeed_ast2400_cc', 'aspeed_ast2500_cc',
'sun6i_rtc_clk_init', 'phy_init', 'ingenic_ost_register_clock',
'meson6_timer_init', 'atcpit100_timer_init',
'npcm7xx_clocksource_init', 'clksrc_dbx500_prcmu_init', 'skx_init',
'i10nm_init', 'sbridge_init', 'i82975x_init', 'i3000_init',
'x38_init', 'ie31200_init', 'i3200_init', 'amd64_edac_init',
'pnd2_init', 'edac_init', 'adummy_init', 'mtd_stresstest_init',
'bxt_idle_state_table_update', 'sklh_idle_state_table_update',
'skx_idle_state_table_update',
'acpi_gpio_handle_deferred_request_irqs', 'smc_findirq', 'ltpc_probe',
'com90io_probe', 'com90xx_probe', 'pcnet32_init_module',
'it87_gpio_init', 'f7188x_find', 'it8712f_wdt_find', 'f71808e_find',
'it87_wdt_init', 'f71882fg_find', 'it87_find', 'f71805f_find',
'parport_pc_init', 'asic3_irq_probe', 'sch311x_detect',
'amd_gpio_init', 'dvb_init', 'dvb_register', 'em28xx_alsa_register',
'em28xx_dvb_register', 'em28xx_rc_register', 'em28xx_video_register',
'blackbird_init', 'bttv_check_chipset', 'ivtvfb_callback_init',
'init_control', 'con_init', 'cr_pll_init',
'clk_disable_unused_subtree', 'fmi_init', 'cadet_init', 'pcm20_init',
'airo_init_module', 'bnx2i_mod_init', 'bnx2fc_mod_init',
'timer_of_irq_exit', 'init', 'kempld_init', 'ivtvfb_init',
'brcmf_core_init', 'comedi_test_init', 'tlan_eisa_probe',
'timer_probe', 'of_clk_init', '__reserved_mem_init_node',
'of_irq_init', 'mace_init', 'vortex_eisa_init', 'reset_chip',
'atp_init', 'atp_probe1', 'smc_probe', 'osi_setup', 'led_init',
'el3_init_module', 'clk_sp810_of_setup', 'ltpc_probe_dma',
'com90io_found', 'check_mirror', 'arcrimi_found', 'com90xx_found',
'intel_soc_thermal_init', 'thermal_register_governors',
'thermal_unregister_governors', 'therm_lvt_init', 'tcc_cooling_init',
'powerclamp_probe', 'intel_init', 'qcom_geni_serial_earlycon_setup',
'kgdboc_early_init', 'lpuart_console_setup', 'speakup_init',
'early_console_setup', 'init_port', 'early_serial8250_setup',
'linflex_console_setup', 'pl010_console_setup', 'register_earlycon',
'of_setup_earlycon', 'slgt_init', 'moxa_init',
'parport_pc_init_superio', 'parport_pc_find_ports', 'mousedev_init',
'ses_init', 'riocm_init', 'efi_rci2_sysfs_init', 'blogic_probe',
'blogic_init', 'blogic_init_mm_probeinfo',
'blogic_init_probeinfo_list', 'blogic_checkadapter',
'blogic_rdconfig', 'blogic_inquiry', 'adpt_init',
'clk_unprepare_unused_subtree', 'aspeed_socinfo_init',
'rcar_sysc_pd_setup', 'r8a779a0_sysc_pd_setup', 'renesas_soc_init',
'rcar_rst_init', 'rmobile_setup_pm_domain', 'mcp_write_pairing_set',
'a72_b53_rac_enable_all', 'mcp_a72_b53_set',
'brcmstb_soc_device_early_init', 'imx8mq_soc_revision',
'imx8mm_soc_uid', 'imx8mm_soc_revision', 'qe_init',
'exynos5x_clk_init', 'exynos5250_clk_init', 'exynos4_get_xom',
'create_one_cmux', 'create_one_pll', 'p2041_init_periph',
'p4080_init_periph', 'p5020_init_periph', 'p5040_init_periph',
'r9a06g032_clocks_probe', 'r8a73a4_cpg_clocks_init',
'sh73a0_cpg_clocks_init', 'cpg_div6_register',
'r8a7740_cpg_clocks_init', 'cpg_mssr_register_mod_clk',
'cpg_mssr_register_core_clk', 'rcar_gen3_cpg_clk_register',
'cpg_sd_clk_register', 'r7s9210_update_clk_table',
'rz_cpg_read_mode_pins', 'rz_cpg_clocks_init',
'rcar_r8a779a0_cpg_clk_register', 'rcar_gen2_cpg_clk_register',
'sun8i_a33_ccu_setup', 'sun8i_a23_ccu_setup', 'sun5i_ccu_init',
'suniv_f1c100s_ccu_setup', 'sun6i_a31_ccu_setup',
'sun8i_v3_v3s_ccu_init', 'sun50i_h616_ccu_setup',
'sunxi_h3_h5_ccu_init', 'sun4i_ccu_init', 'kona_ccu_init',
'ns2_genpll_scr_clk_init', 'ns2_genpll_sw_clk_init',
'ns2_lcpll_ddr_clk_init', 'ns2_lcpll_ports_clk_init',
'nsp_genpll_clk_init', 'nsp_lcpll0_clk_init',
'cygnus_genpll_clk_init', 'cygnus_lcpll0_clk_init',
'cygnus_mipipll_clk_init', 'cygnus_audiopll_clk_init',
'of_fixed_mmio_clk_setup', 'xdbc_map_pci_mmio', 'xdbc_find_dbgp',
'xdbc_bios_handoff', 'xdbc_early_setup', 'ehci_setup',
'early_xdbc_parse_parameter', 'find_cap', '__find_dbgp',
'nvidia_set_debug_port', 'detect_set_debug_port',
'early_ehci_bios_handoff', 'early_dbgp_init', 'dbgp_init',
'ulpi_init', 'hidg_init', 'xdbc_init', 'brcmstb_usb_pinmap_probe',
'dell_init', 'eisa_init_device', 'mlxcpld_led_probe', 'nas_gpio_init',
'asic3_mfd_probe', 'asic3_probe', 'watchdog_init', 'ssb_modinit',
'pt_init', 'thinkpad_acpi_module_init', 'kbd_init', 'joydev_init',
'evdev_init', 'evbug_init', 'input_leds_init', 'mk712_init',
'l4_add_card', 'ns558_init', 'apanel_init', 'ct82c710_detect',
'i8042_check_aux', 'i8042_check_mux', 'i8042_probe', 'i8042_init',
'i8042_aux_test_irq', 'ocrdma_init_module', 'input_apanel_init',
'cs5535_mfgpt_init', 'geodewdt_probe', 'duramar2150_c2port_init',
'init_ohci1394_dma_on_all_controllers', 'init_ohci1394_controller',
'rionet_init', 'nonstatic_sysfs_init', 'init_pcmcia_bus',
'devlink_class_init', 'switchtec_ntb_init', 'mport_init',
'drivetemp_init', 'omap_vout_probe', 'probe_opti_vlb',
'probe_chip_type', 'legacy_check_special_cases',
'qdi65_identify_port', 'probe_qdi_vlb', 'comedi_init', 'hv_acpi_init',
'pcistub_init_devices_late', 'bcma_host_soc_register',
'bcma_bus_early_register', 'vga_arb_device_init',
'vga_arb_select_default_device', 'zf_init',
'watchdog_deferred_registration', 'wb_smsc_wdt_init',
'w83977f_wdt_init', 'ali_find_watchdog', 'pc87413_init',
'alim7101_wdt_init', 'at91_wdt_init', 'sc1200wdt_probe',
'asr_get_base_address', 'dmi_walk_early', 'dmi_sysfs_init',
'dell_smbios_init', 'acer_wmi_init', 'get_thinkpad_model_data',
'dmi_scan_machine', 'pci_assign_unassigned_resources',
'cpcihp_generic_init', 'pnpacpi_init', 'acpi_early_processor_osc',
'acpi_processor_check_duplicates', 'acpi_early_processor_set_pdc',
'acpi_ec_dsdt_probe', 'cros_ec_lpc_init', 'tpacpi_acpi_handle_locate',
'ks_pcie_init_id', 'ks_pcie_host_init', 'pci_apply_final_quirks',
'intel_uncore_init', 'qedr_init_module', 'isapnp_peek',
'isapnp_isolate', 'init_ipmi_si', 'isapnp_build_device_list',
'pnpacpi_add_device', 'erst_init', 'intel_idle_acpi_cst_extract',
'xen_acpi_processor_init', 'acpi_scan_init', 's3_wmi_probe',
'intel_opregion_present', 'extlog_init', 'intel_pstate_init',
'via_rng_mod_init', 'amd_rng_mod_init', 'ccp_init', 'init_nsc',
'init_atmel', 'intel_rng_hw_init', 'intel_init_hw_struct',
'tlclk_init', 'mwave_init', 'applicom_init', 'hdaps_init',
'tink_board_init', 'ibm_rtl_init', 'samsung_sabi_init',
'samsung_init', 'samsung_backlight_init', 'samsung_rfkill_init_swsmi',
'samsung_lid_handling_init', 'samsung_leds_init', 'samsung_sabi_diag',
'samsung_sabi_infos', 'isst_if_mbox_init', 'pmc_atom_init',
'abituguru_detect', 'hwmon_pci_quirks', 'applesmc_init',
'abituguru3_detect', 'w83627ehf_probe', 'dme1737_isa_detect',
'smsc47m1_probe', 'pcc_cpufreq_init', 'cpufreq_p4_init',
'centrino_init', 'acpi_cpufreq_init', 'pcc_cpufreq_probe',
'intel_pstate_msrs_not_valid',
'intel_pstate_platform_pwr_mgmt_exists', 'acpi_cpufreq_boost_init',
'amd_freq_sensitivity_init', 'gic_fixup_resource', 'do_floppy_init',
'get_fdc_version', 'pf_init', 'pg_init', 'pd_init', 'pcd_init',
'rio_basic_attach]

2021-10-12 18:49:48

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()


> The "better safe-than-sorry" argument is hard to build consensus
> around. The spectre mitigations ran into similar problems where the
> community rightly wanted to see the details and instrument the
> problematic paths rather than blanket sprinkle lfence "just to be
> safe".

But that was due to performance problems in hot paths. Nothing of this
applies here.

> In this case the rules about when a driver is suitably
> "hardened" are vague and the overlapping policy engines are confusing.

What is confusing exactly?

For me it both seems very straight forward and simple (but then I'm biased)

The policy is:

- Have an allow list at driver registration.

- Have an additional opt-in for MMIO mappings (and potentially config
space, but that's not currently there) to cover init calls completely.

>
> I'd rather see more concerted efforts focused/limited core changes
> rather than leaf driver changes until there is a clearer definition of
> hardened.

A hardened driver is a driver that

- Had similar security (not API) oriented review of its IO operations
(mainly MMIO access, but also PCI config space) as a non privileged user
interface (like a ioctl). That review should be focused on memory safety.

- Had some fuzzing on these IO interfaces using to be released tools.

Right now it's only three virtio drivers (console, net, block)

Really it's no different than what we do for every new unprivileged user
interface.


> I.e. instead of jumping to the assertion that fixing up
> these init-path vulnerabilities are too big to fix, dig to the next
> level to provide more evidence that per-driver opt-in is the only
> viable option.
>
> For example, how many of these problematic paths are built-in to the
> average kernel config?

I don't think arguments from "the average kernel config" (if such a
thing even exists) are useful. That's would be just hand waving.


> A strawman might be to add a sprinkling error
> exits in the module_init() of the problematic drivers, and only fail
> if the module is built-in, and let modprobe policy handle the rest.


That would be already hundreds of changes. I have no idea how such a
thing could be maintained or sustained either.

Really I don't even see how these alternatives can be considered. Tree
sweeps should always be last resort. They're a pain for everyone. But
here they're casually thrown around as alternatives to straight forward
one or two line changes.




>
>> Default policy in user space just seems to be a bad idea here. Who
>> should know if a driver is hardened other than the kernel? Maintaining
>> the list somewhere else just doesn't make sense to me.
> I do not understand the maintenance burden correlation of where the
> policy is driven vs where the list is maintained?

All the hardening and auditing happens in the kernel tree. So it seems
the natural place to store the result is in the kernel tree.

But there's no single package for initrd, so you would need custom
configurations for all the supported distros.

Also we're really arguing about a list that currently has three entries.


> Even if I agreed
> with the contention that out-of-tree userspace would have a hard time
> tracking the "hardened" driver list there is still an in-tree
> userspace path to explore. E.g. perf maintains lists of things tightly
> coupled to the kernel, this authorized device list seems to be in the
> same category of data.

You mean the event list? perf is in the kernel tree, so it's maintained
together with the kernel.

But we don't have a kernel initrd.



>
>> Also there is the more practical problem that some devices are needed
>> for booting. For example in TDX we can't print something to the console
>> with this mechanism, so you would never get any output before the
>> initrd. Just seems like a nightmare for debugging anything. There really
>> needs to be an authorization mechanism that works reasonably early.
>>
>> I can see a point of having user space overrides though, but we need to
>> have a sane kernel default that works early.
> Right, as I suggested [1], just enough early authorization to
> bootstrap/debug initramfs and then that can authorize the remainder.

But how do you debug the kernel then? Making early undebuggable seems
just bad policy to me.

And if you fix if for the console why not add the two more entries for
virtio net and block too?


-Andi

2021-10-12 18:49:55

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()


On 10/11/2021 10:31 PM, Christoph Hellwig wrote:
> On Mon, Oct 11, 2021 at 03:09:09PM -0400, Michael S. Tsirkin wrote:
>> The reason we have trouble is that it's not clear what does the API mean
>> outside the realm of TDX.
>> If we really, truly want an API that says "ioremap and it's a hardened
>> driver" then I guess ioremap_hardened_driver is what you want.
> Yes. And why would be we ioremap the BIOS anyway? It is not I/O memory
> in any of the senses we generally use ioremap for.

I/O memory is anything outside the kernel memory map.

-Andi


2021-10-12 18:50:45

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()


On 10/12/2021 11:36 AM, Reshetova, Elena wrote:
>> The 5.15 tree has something like ~2.4k IO accesses (including MMIO and
>> others) in init functions that also register drivers (thanks Elena for
>> the number)
> To provide more numbers on this. What I can see so far from a smatch-based
> analysis, we have 409 __init style functions (.probe & builtin/module_
> _platform_driver_probe excluded) for 5.15 with allyesconfig. The number of
> distinct individual IO reads (MSRs included) is much higher than 2.4k and on the
> range of 30k because quite often there are more than a single IO read in the
> same source function. The full list of accesses and the possible call paths is huge
> for manually looking at, but here is the list of the 409 functions if anyone wants
> to take a look:


Thanks Elena.


I suspect the true number is even higher because that doesn't include IO
inside calls to other modules and indirect pointers, correct?


-Andi

2021-10-12 18:59:06

by Elena Reshetova

[permalink] [raw]
Subject: RE: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()


> I suspect the true number is even higher because that doesn't include IO
> inside calls to other modules and indirect pointers, correct?

Actually everything should be included. Smatch has cross-function db and
I am using it for getting the call chains and it follows function pointers.
Also since I am starting from a list of individual read IOs, every single
base read IO in drivers/* should be covered as far as I can see. But if it uses
some weird IO wrappers then the actual list might be higher.

2021-10-12 19:15:34

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

On Tue, Oct 12, 2021 at 11:57 AM Reshetova, Elena
<[email protected]> wrote:
>
>
> > I suspect the true number is even higher because that doesn't include IO
> > inside calls to other modules and indirect pointers, correct?
>
> Actually everything should be included. Smatch has cross-function db and
> I am using it for getting the call chains and it follows function pointers.
> Also since I am starting from a list of individual read IOs, every single
> base read IO in drivers/* should be covered as far as I can see. But if it uses
> some weird IO wrappers then the actual list might be higher.

Why analyze individual IO calls? I thought the goal here was to
disable entire classes of ioremap() users?

2021-10-12 19:52:27

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()


On 10/12/2021 12:13 PM, Dan Williams wrote:
> On Tue, Oct 12, 2021 at 11:57 AM Reshetova, Elena
> <[email protected]> wrote:
>>
>>> I suspect the true number is even higher because that doesn't include IO
>>> inside calls to other modules and indirect pointers, correct?
>> Actually everything should be included. Smatch has cross-function db and
>> I am using it for getting the call chains and it follows function pointers.
>> Also since I am starting from a list of individual read IOs, every single
>> base read IO in drivers/* should be covered as far as I can see. But if it uses
>> some weird IO wrappers then the actual list might be higher.
> Why analyze individual IO calls? I thought the goal here was to
> disable entire classes of ioremap() users?

This is everything that would need to be moved somewhere else if we
didn't disable the entire classes of ioremap users.

-Andi

2021-10-12 21:01:29

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v5 16/16] x86/tdx: Add cmdline option to force use of ioremap_host_shared

On Tue, Oct 12, 2021 at 10:55:20AM -0700, Andi Kleen wrote:
>
> > I mean ... it's already wide spread.
>
>
> I meant wide spread usage with confidential guests.
>
> > If we support it with TDX
> > it will be used with TDX.
>
> It has some security trade offs. The main reason to use TDX is security.
> Also when people take the VT-d tradeoffs they might be ok with the BIOS
> trade offs too.
>
> -Andi

Interesting. VT-d tradeoffs ... what are they?
Allowing hypervisor to write into BIOS looks like it will
trivially lead to code execution, won't it?

--
MST

2021-10-12 21:14:41

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

On Tue, Oct 12, 2021 at 06:36:16PM +0000, Reshetova, Elena wrote:
> > The 5.15 tree has something like ~2.4k IO accesses (including MMIO and
> > others) in init functions that also register drivers (thanks Elena for
> > the number)
>
> To provide more numbers on this. What I can see so far from a smatch-based
> analysis, we have 409 __init style functions (.probe & builtin/module_
> _platform_driver_probe excluded) for 5.15 with allyesconfig.

I don't think we care about allyesconfig at all though.
Just don't do that.
How about allmodconfig? This is closer to what distros actually do.

--
MST

2021-10-12 21:16:23

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

On Tue, Oct 12, 2021 at 11:35 AM Andi Kleen <[email protected]> wrote:
>
>
> > The "better safe-than-sorry" argument is hard to build consensus
> > around. The spectre mitigations ran into similar problems where the
> > community rightly wanted to see the details and instrument the
> > problematic paths rather than blanket sprinkle lfence "just to be
> > safe".
>
> But that was due to performance problems in hot paths. Nothing of this
> applies here.

It applies because a new API that individual driver authors is being
proposed and that's an ongoing maintenance burden that might be
mitigated by hiding that implementation detail from leaf drivers.

>
> > In this case the rules about when a driver is suitably
> > "hardened" are vague and the overlapping policy engines are confusing.
>
> What is confusing exactly?

Multiple places to go to enable functionality. The device-filter
firewall policy can collide with the ioremap access control policy.

> For me it both seems very straight forward and simple (but then I'm biased)

You seem to be having a difficult time iterating this proposal toward
consensus. I don't think the base principles are being contested as
much as the semantics, scope, and need for the proposed API that is in
the purview of all leaf driver developers.

> The policy is:
>
> - Have an allow list at driver registration.
>
> - Have an additional opt-in for MMIO mappings (and potentially config
> space, but that's not currently there) to cover init calls completely.

The proliferation of policy engines and driver special casing is the
issue. Especially in this case where the virtio use case being
opted-in is *already* in a path that has been authorized by the
device-filter policy engine. I.e. why special case the ioremap() in
virtio to be additionally authorized when the device has already been
authorized to probe? Put another way, the easiest driver API change to
merge would be no additional changes in leaf drivers.

>
> >
> > I'd rather see more concerted efforts focused/limited core changes
> > rather than leaf driver changes until there is a clearer definition of
> > hardened.
>
> A hardened driver is a driver that
>
> - Had similar security (not API) oriented review of its IO operations
> (mainly MMIO access, but also PCI config space) as a non privileged user
> interface (like a ioctl). That review should be focused on memory safety.
>
> - Had some fuzzing on these IO interfaces using to be released tools.

What is the intersection of ioremap() users that are outside of the
proposed probe authorization regime AND want confidential computing
support?

> Right now it's only three virtio drivers (console, net, block)
>
> Really it's no different than what we do for every new unprivileged user
> interface.
>
>
> > I.e. instead of jumping to the assertion that fixing up
> > these init-path vulnerabilities are too big to fix, dig to the next
> > level to provide more evidence that per-driver opt-in is the only
> > viable option.
> >
> > For example, how many of these problematic paths are built-in to the
> > average kernel config?
>
> I don't think arguments from "the average kernel config" (if such a
> thing even exists) are useful. That's would be just hand waving.

I'm trying to bridge to your contention that this enabling can not
rely on custom kernel configs and must offer protection on the same
kernel image that might ship in the host, but lets set this aside and
focus on when and where leaf drivers need to adopt a new API.

> > A strawman might be to add a sprinkling error
> > exits in the module_init() of the problematic drivers, and only fail
> > if the module is built-in, and let modprobe policy handle the rest.
>
>
> That would be already hundreds of changes. I have no idea how such a
> thing could be maintained or sustained either.
>
> Really I don't even see how these alternatives can be considered. Tree
> sweeps should always be last resort. They're a pain for everyone. But
> here they're casually thrown around as alternatives to straight forward
> one or two line changes.

If it looked straightforward I'm not sure we would be having this
discussion, I think it's reasonable to ask if this is a per-driver
opt-in responsibility that must be added in addition to probe
authorization.

> >> Default policy in user space just seems to be a bad idea here. Who
> >> should know if a driver is hardened other than the kernel? Maintaining
> >> the list somewhere else just doesn't make sense to me.
> > I do not understand the maintenance burden correlation of where the
> > policy is driven vs where the list is maintained?
>
> All the hardening and auditing happens in the kernel tree. So it seems
> the natural place to store the result is in the kernel tree.
>
> But there's no single package for initrd, so you would need custom
> configurations for all the supported distros.
>
> Also we're really arguing about a list that currently has three entries.
>
>
> > Even if I agreed
> > with the contention that out-of-tree userspace would have a hard time
> > tracking the "hardened" driver list there is still an in-tree
> > userspace path to explore. E.g. perf maintains lists of things tightly
> > coupled to the kernel, this authorized device list seems to be in the
> > same category of data.
>
> You mean the event list? perf is in the kernel tree, so it's maintained
> together with the kernel.
>
> But we don't have a kernel initrd.

I'm proposing that this list is either tiny and slow moving enough for
initrd builders to track manually, or it's a data file that ships in
distro kernel packages that initrd builders can pull in.

> >> Also there is the more practical problem that some devices are needed
> >> for booting. For example in TDX we can't print something to the console
> >> with this mechanism, so you would never get any output before the
> >> initrd. Just seems like a nightmare for debugging anything. There really
> >> needs to be an authorization mechanism that works reasonably early.
> >>
> >> I can see a point of having user space overrides though, but we need to
> >> have a sane kernel default that works early.
> > Right, as I suggested [1], just enough early authorization to
> > bootstrap/debug initramfs and then that can authorize the remainder.
>
> But how do you debug the kernel then? Making early undebuggable seems
> just bad policy to me.

I am not proposing making the early undebuggable.

>
> And if you fix if for the console why not add the two more entries for
> virtio net and block too?

Again because there seems to be struggling consensus around what
criteria constitutes being added to this list. In order to move this
series forward I'm trying to find common ground.

2021-10-12 21:22:58

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v5 16/16] x86/tdx: Add cmdline option to force use of ioremap_host_shared


> Interesting. VT-d tradeoffs ... what are they?

The connection to the device is not encrypted and also not authenticated.

This is different that even talking to the (untrusted) host through
shared memory where you at least still have a common key.

> Allowing hypervisor to write into BIOS looks like it will
> trivially lead to code execution, won't it?

This is not about BIOS code executing. While the guest firmware runs it
is protected of course. This is for BIOS structures like ACPI tables
that are mapped by Linux. While AML can run byte code it can normally
not write to arbitrary memory.

The risk is more that all the Linux code dealing with this hasn't been
hardened to deal with malicious input.

-Andi

2021-10-12 21:23:35

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

On Tue, Oct 12, 2021 at 02:14:44PM -0700, Dan Williams wrote:
> Especially in this case where the virtio use case being
> opted-in is *already* in a path that has been authorized by the
> device-filter policy engine.

That's a good point. Andi, how about setting a per-device flag
if its ID has been allowed and then making pci_iomap create
a shared mapping transparently?

--
MST

2021-10-12 21:26:20

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()


On 10/12/2021 2:18 PM, Michael S. Tsirkin wrote:
> On Tue, Oct 12, 2021 at 02:14:44PM -0700, Dan Williams wrote:
>> Especially in this case where the virtio use case being
>> opted-in is *already* in a path that has been authorized by the
>> device-filter policy engine.
> That's a good point. Andi, how about setting a per-device flag
> if its ID has been allowed and then making pci_iomap create
> a shared mapping transparently?

Yes for pci_iomap we could do that.

If someone uses raw ioremap without a device it won't work, but I don't
think that's the case for virtio at least.

I suppose we could solve that problem if it actually happens.

-Andi

2021-10-12 21:30:03

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()


>> But that was due to performance problems in hot paths. Nothing of this
>> applies here.
> It applies because a new API that individual driver authors is being
> proposed and that's an ongoing maintenance burden that might be
> mitigated by hiding that implementation detail from leaf drivers.

Right now we're only talking about 2 places to change, and none of those
are actually in individual drivers, but in the virtio generic code and
in the MSI code.

While there might be drivers in the future that do it directly it will
be always the exception, normal drivers don't have to deal with this.



>> For me it both seems very straight forward and simple (but then I'm biased)
> You seem to be having a difficult time iterating this proposal toward
> consensus. I don't think the base principles are being contested as
> much as the semantics, scope, and need for the proposed API that is in
> the purview of all leaf driver developers.
Right now there is no leaf driver changed at all.
>
>>> I'd rather see more concerted efforts focused/limited core changes
>>> rather than leaf driver changes until there is a clearer definition of
>>> hardened.
>> A hardened driver is a driver that
>>
>> - Had similar security (not API) oriented review of its IO operations
>> (mainly MMIO access, but also PCI config space) as a non privileged user
>> interface (like a ioctl). That review should be focused on memory safety.
>>
>> - Had some fuzzing on these IO interfaces using to be released tools.
> What is the intersection of ioremap() users that are outside of the
> proposed probe authorization regime AND want confidential computing
> support?


Right now it's zero I believe.

That is there is other low level code that sets memory shared, but it's
not using ioremap, but some other mechanisms.

>
> are needed
>>>> for booting. For example in TDX we can't print something to the console
>>>> with this mechanism, so you would never get any output before the
>>>> initrd. Just seems like a nightmare for debugging anything. There really
>>>> needs to be an authorization mechanism that works reasonably early.
>>>>
>>>> I can see a point of having user space overrides though, but we need to
>>>> have a sane kernel default that works early.
>>> Right, as I suggested [1], just enough early authorization to
>>> bootstrap/debug initramfs and then that can authorize the remainder.
>> But how do you debug the kernel then? Making early undebuggable seems
>> just bad policy to me.
> I am not proposing making the early undebuggable.


That's the implication of moving the policy into initrd.


If only initrd can authorize then it won't be possible to authorize
before initrd, thus the early console won't work.

-Andi


2021-10-12 21:32:04

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v5 16/16] x86/tdx: Add cmdline option to force use of ioremap_host_shared

On Tue, Oct 12, 2021 at 02:18:01PM -0700, Andi Kleen wrote:
>
> > Interesting. VT-d tradeoffs ... what are they?
>
> The connection to the device is not encrypted and also not authenticated.
>
> This is different that even talking to the (untrusted) host through shared
> memory where you at least still have a common key.

Well it's different sure enough but how is talking to host less secure?
Cold boot attacks and such?

> > Allowing hypervisor to write into BIOS looks like it will
> > trivially lead to code execution, won't it?
>
> This is not about BIOS code executing. While the guest firmware runs it is
> protected of course. This is for BIOS structures like ACPI tables that are
> mapped by Linux. While AML can run byte code it can normally not write to
> arbitrary memory.

I thought you basically create an OperationRegion of SystemMemory type,
and off you go. Maybe the OSPM in Linux is clever and protects
some memory, I wouldn't know.

> The risk is more that all the Linux code dealing with this hasn't been
> hardened to deal with malicious input.
>
> -Andi


--
MST

2021-10-12 22:03:07

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

On Tue, Oct 12, 2021 at 2:28 PM Andi Kleen <[email protected]> wrote:
[..]
> >> But how do you debug the kernel then? Making early undebuggable seems
> >> just bad policy to me.
> > I am not proposing making the early undebuggable.
>
>
> That's the implication of moving the policy into initrd.
>
>
> If only initrd can authorize then it won't be possible to authorize
> before initrd, thus the early console won't work.

Again, the proposal is that the allow-list is limited to just enough
devices to startup and debug the initramfs and no more. Everything
else can be dynamic, and this allows for a powerful custom override
interface without needing to debate additional ABI like command line
overrides, and minimizes future changes to this kernel-internal
allow-list.

2021-10-14 06:33:58

by Elena Reshetova

[permalink] [raw]
Subject: RE: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

> On Tue, Oct 12, 2021 at 06:36:16PM +0000, Reshetova, Elena wrote:
> > > The 5.15 tree has something like ~2.4k IO accesses (including MMIO and
> > > others) in init functions that also register drivers (thanks Elena for
> > > the number)
> >
> > To provide more numbers on this. What I can see so far from a smatch-based
> > analysis, we have 409 __init style functions (.probe & builtin/module_
> > _platform_driver_probe excluded) for 5.15 with allyesconfig.
>
> I don't think we care about allyesconfig at all though.
> Just don't do that.
> How about allmodconfig? This is closer to what distros actually do.

It does not make any difference really for the content of the /drivers/*:
gives 408 __init style functions doing IO (.probe & builtin/module_
> > _platform_driver_probe excluded) for 5.15 with allmodconfig:

['doc200x_ident_chip',
'doc_probe', 'doc2001_init', 'mtd_speedtest_init',
'mtd_nandbiterrs_init', 'mtd_oobtest_init', 'mtd_pagetest_init',
'tort_init', 'mtd_subpagetest_init', 'fixup_pmc551',
'doc_set_driver_info', 'init_amd76xrom', 'init_l440gx',
'init_sc520cdp', 'init_ichxrom', 'init_ck804xrom', 'init_esb2rom',
'probe_acpi_namespace_devices', 'amd_iommu_init_pci', 'state_next',
'arm_v7s_do_selftests', 'arm_lpae_run_tests', 'init_iommu_one',
'init_dmars', 'iommu_init_pci', 'early_amd_iommu_init',
'late_iommu_features_init', 'detect_ivrs',
'intel_prepare_irq_remapping', 'intel_enable_irq_remapping',
'intel_cleanup_irq_remapping', 'detect_intel_iommu',
'parse_ioapics_under_ir', 'si_domain_init', 'ubi_init',
'fb_console_init', 'xenbus_probe_backend_init',
'xenbus_probe_frontend_init', 'setup_vcpu_hotplug_event',
'balloon_init', 'intel_iommu_init', 'intel_rng_mod_init',
'check_tylersburg_isoch', 'dmar_table_init',
'enable_drhd_fault_handling', 'init_acpi_pm_clocksource',
'ostm_init_clksrc', 'ftm_clockevent_init', 'ftm_clocksource_init',
'kona_timer_init', 'mtk_gpt_init', 'samsung_clockevent_init',
'samsung_clocksource_init', 'sysctr_timer_init', 'mxs_timer_init',
'sun4i_timer_init', 'at91sam926x_pit_dt_init', 'owl_timer_init',
'sun5i_setup_clockevent', 'ubi_gluebi_init', 'ubiblock_init',
'hv_init_tsc_clocksource', 'hv_init_clocksource', 'mt7621_clk_init',
'samsung_clk_register_mux', 'samsung_clk_register_gate',
'samsung_clk_register_fixed_rate', 'clk_boston_setup',
'gemini_cc_init', 'aspeed_ast2400_cc', 'aspeed_ast2500_cc',
'sun6i_rtc_clk_init', 'phy_init', 'ingenic_ost_register_clock',
'meson6_timer_init', 'atcpit100_timer_init',
'npcm7xx_clocksource_init', 'clksrc_dbx500_prcmu_init', 'skx_init',
'i10nm_init', 'sbridge_init', 'i82975x_init', 'i3000_init',
'x38_init', 'ie31200_init', 'i3200_init', 'amd64_edac_init',
'pnd2_init', 'edac_init', 'adummy_init', 'mtd_stresstest_init',
'bxt_idle_state_table_update', 'sklh_idle_state_table_update',
'skx_idle_state_table_update',
'acpi_gpio_handle_deferred_request_irqs', 'smc_findirq', 'ltpc_probe',
'com90io_probe', 'com90xx_probe', 'pcnet32_init_module',
'it87_gpio_init', 'f7188x_find', 'it8712f_wdt_find', 'f71808e_find',
'it87_wdt_init', 'f71882fg_find', 'it87_find', 'f71805f_find',
'parport_pc_init', 'asic3_irq_probe', 'sch311x_detect',
'amd_gpio_init', 'dvb_init', 'dvb_register', 'em28xx_alsa_register',
'em28xx_dvb_register', 'em28xx_rc_register', 'em28xx_video_register',
'blackbird_init', 'bttv_check_chipset', 'ivtvfb_callback_init',
'init_control', 'con_init', 'cr_pll_init',
'clk_disable_unused_subtree', 'fmi_init', 'cadet_init', 'pcm20_init',
'airo_init_module', 'bnx2i_mod_init', 'bnx2fc_mod_init',
'timer_of_irq_exit', 'init', 'kempld_init', 'ivtvfb_init',
'brcmf_core_init', 'comedi_test_init', 'tlan_eisa_probe',
'timer_probe', 'of_clk_init', '__reserved_mem_init_node',
'of_irq_init', 'mace_init', 'vortex_eisa_init', 'reset_chip',
'atp_init', 'atp_probe1', 'smc_probe', 'osi_setup', 'led_init',
'el3_init_module', 'clk_sp810_of_setup', 'ltpc_probe_dma',
'com90io_found', 'check_mirror', 'arcrimi_found', 'com90xx_found',
'intel_soc_thermal_init', 'thermal_register_governors',
'thermal_unregister_governors', 'therm_lvt_init', 'tcc_cooling_init',
'powerclamp_probe', 'intel_init', 'qcom_geni_serial_earlycon_setup',
'kgdboc_early_init', 'lpuart_console_setup', 'speakup_init',
'early_console_setup', 'init_port', 'early_serial8250_setup',
'linflex_console_setup', 'register_earlycon', 'of_setup_earlycon',
'slgt_init', 'moxa_init', 'parport_pc_init_superio',
'parport_pc_find_ports', 'mousedev_init', 'ses_init', 'riocm_init',
'efi_rci2_sysfs_init', 'blogic_probe', 'blogic_init',
'blogic_init_mm_probeinfo', 'blogic_init_probeinfo_list',
'blogic_checkadapter', 'blogic_rdconfig', 'blogic_inquiry',
'adpt_init', 'clk_unprepare_unused_subtree', 'aspeed_socinfo_init',
'rcar_sysc_pd_setup', 'r8a779a0_sysc_pd_setup', 'renesas_soc_init',
'rcar_rst_init', 'rmobile_setup_pm_domain', 'mcp_write_pairing_set',
'a72_b53_rac_enable_all', 'mcp_a72_b53_set',
'brcmstb_soc_device_early_init', 'imx8mq_soc_revision',
'imx8mm_soc_uid', 'imx8mm_soc_revision', 'qe_init',
'exynos5x_clk_init', 'exynos5250_clk_init', 'exynos4_get_xom',
'create_one_cmux', 'create_one_pll', 'p2041_init_periph',
'p4080_init_periph', 'p5020_init_periph', 'p5040_init_periph',
'r9a06g032_clocks_probe', 'r8a73a4_cpg_clocks_init',
'sh73a0_cpg_clocks_init', 'cpg_div6_register',
'r8a7740_cpg_clocks_init', 'cpg_mssr_register_mod_clk',
'cpg_mssr_register_core_clk', 'rcar_gen3_cpg_clk_register',
'cpg_sd_clk_register', 'r7s9210_update_clk_table',
'rz_cpg_read_mode_pins', 'rz_cpg_clocks_init',
'rcar_r8a779a0_cpg_clk_register', 'rcar_gen2_cpg_clk_register',
'sun8i_a33_ccu_setup', 'sun8i_a23_ccu_setup', 'sun5i_ccu_init',
'suniv_f1c100s_ccu_setup', 'sun6i_a31_ccu_setup',
'sun8i_v3_v3s_ccu_init', 'sun50i_h616_ccu_setup',
'sunxi_h3_h5_ccu_init', 'sun4i_ccu_init', 'kona_ccu_init',
'ns2_genpll_scr_clk_init', 'ns2_genpll_sw_clk_init',
'ns2_lcpll_ddr_clk_init', 'ns2_lcpll_ports_clk_init',
'nsp_genpll_clk_init', 'nsp_lcpll0_clk_init',
'cygnus_genpll_clk_init', 'cygnus_lcpll0_clk_init',
'cygnus_mipipll_clk_init', 'cygnus_audiopll_clk_init',
'of_fixed_mmio_clk_setup', 'xdbc_map_pci_mmio', 'xdbc_find_dbgp',
'xdbc_bios_handoff', 'xdbc_early_setup', 'ehci_setup',
'early_xdbc_parse_parameter', 'find_cap', '__find_dbgp',
'nvidia_set_debug_port', 'detect_set_debug_port',
'early_ehci_bios_handoff', 'early_dbgp_init', 'dbgp_init',
'ulpi_init', 'hidg_init', 'xdbc_init', 'brcmstb_usb_pinmap_probe',
'dell_init', 'eisa_init_device', 'mlxcpld_led_probe', 'nas_gpio_init',
'asic3_mfd_probe', 'asic3_probe', 'watchdog_init', 'ssb_modinit',
'pt_init', 'thinkpad_acpi_module_init', 'kbd_init', 'joydev_init',
'evdev_init', 'evbug_init', 'input_leds_init', 'mk712_init',
'l4_add_card', 'ns558_init', 'apanel_init', 'ct82c710_detect',
'i8042_check_aux', 'i8042_check_mux', 'i8042_probe', 'i8042_init',
'i8042_aux_test_irq', 'ocrdma_init_module', 'input_apanel_init',
'cs5535_mfgpt_init', 'geodewdt_probe', 'duramar2150_c2port_init',
'init_ohci1394_dma_on_all_controllers', 'init_ohci1394_controller',
'rionet_init', 'nonstatic_sysfs_init', 'init_pcmcia_bus',
'devlink_class_init', 'switchtec_ntb_init', 'mport_init',
'drivetemp_init', 'omap_vout_probe', 'probe_opti_vlb',
'probe_chip_type', 'legacy_check_special_cases',
'qdi65_identify_port', 'probe_qdi_vlb', 'comedi_init', 'hv_acpi_init',
'pcistub_init_devices_late', 'bcma_host_soc_register',
'bcma_bus_early_register', 'vga_arb_device_init',
'vga_arb_select_default_device', 'zf_init',
'watchdog_deferred_registration', 'wb_smsc_wdt_init',
'w83977f_wdt_init', 'ali_find_watchdog', 'pc87413_init',
'alim7101_wdt_init', 'at91_wdt_init', 'sc1200wdt_probe',
'asr_get_base_address', 'dmi_walk_early', 'dmi_sysfs_init',
'dell_smbios_init', 'acer_wmi_init', 'get_thinkpad_model_data',
'dmi_scan_machine', 'pci_assign_unassigned_resources',
'cpcihp_generic_init', 'pnpacpi_init', 'acpi_early_processor_osc',
'acpi_processor_check_duplicates', 'acpi_early_processor_set_pdc',
'acpi_ec_dsdt_probe', 'cros_ec_lpc_init', 'tpacpi_acpi_handle_locate',
'ks_pcie_init_id', 'ks_pcie_host_init', 'pci_apply_final_quirks',
'intel_uncore_init', 'qedr_init_module', 'isapnp_peek',
'isapnp_isolate', 'init_ipmi_si', 'isapnp_build_device_list',
'pnpacpi_add_device', 'erst_init', 'intel_idle_acpi_cst_extract',
'xen_acpi_processor_init', 'acpi_scan_init', 's3_wmi_probe',
'intel_opregion_present', 'extlog_init', 'intel_pstate_init',
'via_rng_mod_init', 'amd_rng_mod_init', 'ccp_init', 'init_nsc',
'init_atmel', 'intel_rng_hw_init', 'intel_init_hw_struct',
'tlclk_init', 'mwave_init', 'applicom_init', 'hdaps_init',
'tink_board_init', 'ibm_rtl_init', 'samsung_sabi_init',
'samsung_init', 'samsung_backlight_init', 'samsung_rfkill_init_swsmi',
'samsung_lid_handling_init', 'samsung_leds_init', 'samsung_sabi_diag',
'samsung_sabi_infos', 'isst_if_mbox_init', 'pmc_atom_init',
'abituguru_detect', 'hwmon_pci_quirks', 'applesmc_init',
'abituguru3_detect', 'w83627ehf_probe', 'dme1737_isa_detect',
'smsc47m1_probe', 'pcc_cpufreq_init', 'cpufreq_p4_init',
'centrino_init', 'acpi_cpufreq_init', 'pcc_cpufreq_probe',
'intel_pstate_msrs_not_valid',
'intel_pstate_platform_pwr_mgmt_exists', 'acpi_cpufreq_boost_init',
'amd_freq_sensitivity_init', 'gic_fixup_resource', 'do_floppy_init',
'get_fdc_version', 'pf_init', 'pg_init', 'pd_init', 'pcd_init',
'rio_basic_attach']

2021-10-14 06:59:31

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

On Thu, Oct 14, 2021 at 06:32:32AM +0000, Reshetova, Elena wrote:
> > On Tue, Oct 12, 2021 at 06:36:16PM +0000, Reshetova, Elena wrote:
> > > > The 5.15 tree has something like ~2.4k IO accesses (including MMIO and
> > > > others) in init functions that also register drivers (thanks Elena for
> > > > the number)
> > >
> > > To provide more numbers on this. What I can see so far from a smatch-based
> > > analysis, we have 409 __init style functions (.probe & builtin/module_
> > > _platform_driver_probe excluded) for 5.15 with allyesconfig.
> >
> > I don't think we care about allyesconfig at all though.
> > Just don't do that.
> > How about allmodconfig? This is closer to what distros actually do.
>
> It does not make any difference really for the content of the /drivers/*:
> gives 408 __init style functions doing IO (.probe & builtin/module_
> > > _platform_driver_probe excluded) for 5.15 with allmodconfig:
>
> ['doc200x_ident_chip',
> 'doc_probe', 'doc2001_init', 'mtd_speedtest_init',
> 'mtd_nandbiterrs_init', 'mtd_oobtest_init', 'mtd_pagetest_init',
> 'tort_init', 'mtd_subpagetest_init', 'fixup_pmc551',
> 'doc_set_driver_info', 'init_amd76xrom', 'init_l440gx',
> 'init_sc520cdp', 'init_ichxrom', 'init_ck804xrom', 'init_esb2rom',
> 'probe_acpi_namespace_devices', 'amd_iommu_init_pci', 'state_next',
> 'arm_v7s_do_selftests', 'arm_lpae_run_tests', 'init_iommu_one',

Um. ARM? Which architecture is this build for?


> 'init_dmars', 'iommu_init_pci', 'early_amd_iommu_init',
> 'late_iommu_features_init', 'detect_ivrs',
> 'intel_prepare_irq_remapping', 'intel_enable_irq_remapping',
> 'intel_cleanup_irq_remapping', 'detect_intel_iommu',
> 'parse_ioapics_under_ir', 'si_domain_init', 'ubi_init',
> 'fb_console_init', 'xenbus_probe_backend_init',
> 'xenbus_probe_frontend_init', 'setup_vcpu_hotplug_event',
> 'balloon_init', 'intel_iommu_init', 'intel_rng_mod_init',
> 'check_tylersburg_isoch', 'dmar_table_init',
> 'enable_drhd_fault_handling', 'init_acpi_pm_clocksource',
> 'ostm_init_clksrc', 'ftm_clockevent_init', 'ftm_clocksource_init',
> 'kona_timer_init', 'mtk_gpt_init', 'samsung_clockevent_init',
> 'samsung_clocksource_init', 'sysctr_timer_init', 'mxs_timer_init',
> 'sun4i_timer_init', 'at91sam926x_pit_dt_init', 'owl_timer_init',
> 'sun5i_setup_clockevent', 'ubi_gluebi_init', 'ubiblock_init',
> 'hv_init_tsc_clocksource', 'hv_init_clocksource', 'mt7621_clk_init',
> 'samsung_clk_register_mux', 'samsung_clk_register_gate',
> 'samsung_clk_register_fixed_rate', 'clk_boston_setup',
> 'gemini_cc_init', 'aspeed_ast2400_cc', 'aspeed_ast2500_cc',
> 'sun6i_rtc_clk_init', 'phy_init', 'ingenic_ost_register_clock',
> 'meson6_timer_init', 'atcpit100_timer_init',
> 'npcm7xx_clocksource_init', 'clksrc_dbx500_prcmu_init', 'skx_init',
> 'i10nm_init', 'sbridge_init', 'i82975x_init', 'i3000_init',
> 'x38_init', 'ie31200_init', 'i3200_init', 'amd64_edac_init',
> 'pnd2_init', 'edac_init', 'adummy_init', 'mtd_stresstest_init',
> 'bxt_idle_state_table_update', 'sklh_idle_state_table_update',
> 'skx_idle_state_table_update',
> 'acpi_gpio_handle_deferred_request_irqs', 'smc_findirq', 'ltpc_probe',
> 'com90io_probe', 'com90xx_probe', 'pcnet32_init_module',
> 'it87_gpio_init', 'f7188x_find', 'it8712f_wdt_find', 'f71808e_find',
> 'it87_wdt_init', 'f71882fg_find', 'it87_find', 'f71805f_find',
> 'parport_pc_init', 'asic3_irq_probe', 'sch311x_detect',
> 'amd_gpio_init', 'dvb_init', 'dvb_register', 'em28xx_alsa_register',
> 'em28xx_dvb_register', 'em28xx_rc_register', 'em28xx_video_register',
> 'blackbird_init', 'bttv_check_chipset', 'ivtvfb_callback_init',
> 'init_control', 'con_init', 'cr_pll_init',
> 'clk_disable_unused_subtree', 'fmi_init', 'cadet_init', 'pcm20_init',
> 'airo_init_module', 'bnx2i_mod_init', 'bnx2fc_mod_init',
> 'timer_of_irq_exit', 'init', 'kempld_init', 'ivtvfb_init',
> 'brcmf_core_init', 'comedi_test_init', 'tlan_eisa_probe',
> 'timer_probe', 'of_clk_init', '__reserved_mem_init_node',
> 'of_irq_init', 'mace_init', 'vortex_eisa_init', 'reset_chip',
> 'atp_init', 'atp_probe1', 'smc_probe', 'osi_setup', 'led_init',
> 'el3_init_module', 'clk_sp810_of_setup', 'ltpc_probe_dma',
> 'com90io_found', 'check_mirror', 'arcrimi_found', 'com90xx_found',
> 'intel_soc_thermal_init', 'thermal_register_governors',
> 'thermal_unregister_governors', 'therm_lvt_init', 'tcc_cooling_init',
> 'powerclamp_probe', 'intel_init', 'qcom_geni_serial_earlycon_setup',
> 'kgdboc_early_init', 'lpuart_console_setup', 'speakup_init',
> 'early_console_setup', 'init_port', 'early_serial8250_setup',
> 'linflex_console_setup', 'register_earlycon', 'of_setup_earlycon',
> 'slgt_init', 'moxa_init', 'parport_pc_init_superio',
> 'parport_pc_find_ports', 'mousedev_init', 'ses_init', 'riocm_init',
> 'efi_rci2_sysfs_init', 'blogic_probe', 'blogic_init',
> 'blogic_init_mm_probeinfo', 'blogic_init_probeinfo_list',
> 'blogic_checkadapter', 'blogic_rdconfig', 'blogic_inquiry',
> 'adpt_init', 'clk_unprepare_unused_subtree', 'aspeed_socinfo_init',
> 'rcar_sysc_pd_setup', 'r8a779a0_sysc_pd_setup', 'renesas_soc_init',
> 'rcar_rst_init', 'rmobile_setup_pm_domain', 'mcp_write_pairing_set',
> 'a72_b53_rac_enable_all', 'mcp_a72_b53_set',
> 'brcmstb_soc_device_early_init', 'imx8mq_soc_revision',
> 'imx8mm_soc_uid', 'imx8mm_soc_revision', 'qe_init',
> 'exynos5x_clk_init', 'exynos5250_clk_init', 'exynos4_get_xom',
> 'create_one_cmux', 'create_one_pll', 'p2041_init_periph',
> 'p4080_init_periph', 'p5020_init_periph', 'p5040_init_periph',
> 'r9a06g032_clocks_probe', 'r8a73a4_cpg_clocks_init',
> 'sh73a0_cpg_clocks_init', 'cpg_div6_register',
> 'r8a7740_cpg_clocks_init', 'cpg_mssr_register_mod_clk',
> 'cpg_mssr_register_core_clk', 'rcar_gen3_cpg_clk_register',
> 'cpg_sd_clk_register', 'r7s9210_update_clk_table',
> 'rz_cpg_read_mode_pins', 'rz_cpg_clocks_init',
> 'rcar_r8a779a0_cpg_clk_register', 'rcar_gen2_cpg_clk_register',
> 'sun8i_a33_ccu_setup', 'sun8i_a23_ccu_setup', 'sun5i_ccu_init',
> 'suniv_f1c100s_ccu_setup', 'sun6i_a31_ccu_setup',
> 'sun8i_v3_v3s_ccu_init', 'sun50i_h616_ccu_setup',
> 'sunxi_h3_h5_ccu_init', 'sun4i_ccu_init', 'kona_ccu_init',
> 'ns2_genpll_scr_clk_init', 'ns2_genpll_sw_clk_init',
> 'ns2_lcpll_ddr_clk_init', 'ns2_lcpll_ports_clk_init',
> 'nsp_genpll_clk_init', 'nsp_lcpll0_clk_init',
> 'cygnus_genpll_clk_init', 'cygnus_lcpll0_clk_init',
> 'cygnus_mipipll_clk_init', 'cygnus_audiopll_clk_init',
> 'of_fixed_mmio_clk_setup', 'xdbc_map_pci_mmio', 'xdbc_find_dbgp',
> 'xdbc_bios_handoff', 'xdbc_early_setup', 'ehci_setup',
> 'early_xdbc_parse_parameter', 'find_cap', '__find_dbgp',
> 'nvidia_set_debug_port', 'detect_set_debug_port',
> 'early_ehci_bios_handoff', 'early_dbgp_init', 'dbgp_init',
> 'ulpi_init', 'hidg_init', 'xdbc_init', 'brcmstb_usb_pinmap_probe',
> 'dell_init', 'eisa_init_device', 'mlxcpld_led_probe', 'nas_gpio_init',
> 'asic3_mfd_probe', 'asic3_probe', 'watchdog_init', 'ssb_modinit',
> 'pt_init', 'thinkpad_acpi_module_init', 'kbd_init', 'joydev_init',
> 'evdev_init', 'evbug_init', 'input_leds_init', 'mk712_init',
> 'l4_add_card', 'ns558_init', 'apanel_init', 'ct82c710_detect',
> 'i8042_check_aux', 'i8042_check_mux', 'i8042_probe', 'i8042_init',
> 'i8042_aux_test_irq', 'ocrdma_init_module', 'input_apanel_init',
> 'cs5535_mfgpt_init', 'geodewdt_probe', 'duramar2150_c2port_init',
> 'init_ohci1394_dma_on_all_controllers', 'init_ohci1394_controller',
> 'rionet_init', 'nonstatic_sysfs_init', 'init_pcmcia_bus',
> 'devlink_class_init', 'switchtec_ntb_init', 'mport_init',
> 'drivetemp_init', 'omap_vout_probe', 'probe_opti_vlb',
> 'probe_chip_type', 'legacy_check_special_cases',
> 'qdi65_identify_port', 'probe_qdi_vlb', 'comedi_init', 'hv_acpi_init',
> 'pcistub_init_devices_late', 'bcma_host_soc_register',
> 'bcma_bus_early_register', 'vga_arb_device_init',
> 'vga_arb_select_default_device', 'zf_init',
> 'watchdog_deferred_registration', 'wb_smsc_wdt_init',
> 'w83977f_wdt_init', 'ali_find_watchdog', 'pc87413_init',
> 'alim7101_wdt_init', 'at91_wdt_init', 'sc1200wdt_probe',
> 'asr_get_base_address', 'dmi_walk_early', 'dmi_sysfs_init',
> 'dell_smbios_init', 'acer_wmi_init', 'get_thinkpad_model_data',
> 'dmi_scan_machine', 'pci_assign_unassigned_resources',
> 'cpcihp_generic_init', 'pnpacpi_init', 'acpi_early_processor_osc',
> 'acpi_processor_check_duplicates', 'acpi_early_processor_set_pdc',
> 'acpi_ec_dsdt_probe', 'cros_ec_lpc_init', 'tpacpi_acpi_handle_locate',
> 'ks_pcie_init_id', 'ks_pcie_host_init', 'pci_apply_final_quirks',
> 'intel_uncore_init', 'qedr_init_module', 'isapnp_peek',
> 'isapnp_isolate', 'init_ipmi_si', 'isapnp_build_device_list',
> 'pnpacpi_add_device', 'erst_init', 'intel_idle_acpi_cst_extract',
> 'xen_acpi_processor_init', 'acpi_scan_init', 's3_wmi_probe',
> 'intel_opregion_present', 'extlog_init', 'intel_pstate_init',
> 'via_rng_mod_init', 'amd_rng_mod_init', 'ccp_init', 'init_nsc',
> 'init_atmel', 'intel_rng_hw_init', 'intel_init_hw_struct',
> 'tlclk_init', 'mwave_init', 'applicom_init', 'hdaps_init',
> 'tink_board_init', 'ibm_rtl_init', 'samsung_sabi_init',
> 'samsung_init', 'samsung_backlight_init', 'samsung_rfkill_init_swsmi',
> 'samsung_lid_handling_init', 'samsung_leds_init', 'samsung_sabi_diag',
> 'samsung_sabi_infos', 'isst_if_mbox_init', 'pmc_atom_init',
> 'abituguru_detect', 'hwmon_pci_quirks', 'applesmc_init',
> 'abituguru3_detect', 'w83627ehf_probe', 'dme1737_isa_detect',
> 'smsc47m1_probe', 'pcc_cpufreq_init', 'cpufreq_p4_init',
> 'centrino_init', 'acpi_cpufreq_init', 'pcc_cpufreq_probe',
> 'intel_pstate_msrs_not_valid',
> 'intel_pstate_platform_pwr_mgmt_exists', 'acpi_cpufreq_boost_init',
> 'amd_freq_sensitivity_init', 'gic_fixup_resource', 'do_floppy_init',
> 'get_fdc_version', 'pf_init', 'pg_init', 'pd_init', 'pcd_init',
> 'rio_basic_attach']

2021-10-14 07:30:03

by Elena Reshetova

[permalink] [raw]
Subject: RE: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

> On Thu, Oct 14, 2021 at 06:32:32AM +0000, Reshetova, Elena wrote:
> > > On Tue, Oct 12, 2021 at 06:36:16PM +0000, Reshetova, Elena wrote:
> > > > > The 5.15 tree has something like ~2.4k IO accesses (including MMIO and
> > > > > others) in init functions that also register drivers (thanks Elena for
> > > > > the number)
> > > >
> > > > To provide more numbers on this. What I can see so far from a smatch-based
> > > > analysis, we have 409 __init style functions (.probe & builtin/module_
> > > > _platform_driver_probe excluded) for 5.15 with allyesconfig.
> > >
> > > I don't think we care about allyesconfig at all though.
> > > Just don't do that.
> > > How about allmodconfig? This is closer to what distros actually do.
> >
> > It does not make any difference really for the content of the /drivers/*:
> > gives 408 __init style functions doing IO (.probe & builtin/module_
> > > > _platform_driver_probe excluded) for 5.15 with allmodconfig:
> >
> > ['doc200x_ident_chip',
> > 'doc_probe', 'doc2001_init', 'mtd_speedtest_init',
> > 'mtd_nandbiterrs_init', 'mtd_oobtest_init', 'mtd_pagetest_init',
> > 'tort_init', 'mtd_subpagetest_init', 'fixup_pmc551',
> > 'doc_set_driver_info', 'init_amd76xrom', 'init_l440gx',
> > 'init_sc520cdp', 'init_ichxrom', 'init_ck804xrom', 'init_esb2rom',
> > 'probe_acpi_namespace_devices', 'amd_iommu_init_pci', 'state_next',
> > 'arm_v7s_do_selftests', 'arm_lpae_run_tests', 'init_iommu_one',
>
> Um. ARM? Which architecture is this build for?

The list of smatch IO findings is built for x86, but the smatch cross function
database covers all archs, so when queried for all potential function callers,
it would show non x86 arch call chains also.

Here is the original x86 finding and call chain for the 'arm_v7s_do_selftests':

Detected low-level IO from arm_v7s_do_selftests in fun
__iommu_queue_command_sync

drivers/iommu/amd/iommu.c:1025 __iommu_queue_command_sync() error:
{15002074744551330002}
'check_host_input' read from the host using function 'readl' to a
member of the structure 'iommu->cmd_buf_head';

__iommu_queue_command_sync()
iommu_completion_wait()
amd_iommu_domain_flush_complete()
iommu_v1_map_page()
arm_v7s_do_selftests()

So, the results can be further filtered if you want a specified arch.

2021-10-14 09:28:49

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

On Thu, Oct 14, 2021 at 07:27:42AM +0000, Reshetova, Elena wrote:
> > On Thu, Oct 14, 2021 at 06:32:32AM +0000, Reshetova, Elena wrote:
> > > > On Tue, Oct 12, 2021 at 06:36:16PM +0000, Reshetova, Elena wrote:
> > > > > > The 5.15 tree has something like ~2.4k IO accesses (including MMIO and
> > > > > > others) in init functions that also register drivers (thanks Elena for
> > > > > > the number)
> > > > >
> > > > > To provide more numbers on this. What I can see so far from a smatch-based
> > > > > analysis, we have 409 __init style functions (.probe & builtin/module_
> > > > > _platform_driver_probe excluded) for 5.15 with allyesconfig.
> > > >
> > > > I don't think we care about allyesconfig at all though.
> > > > Just don't do that.
> > > > How about allmodconfig? This is closer to what distros actually do.
> > >
> > > It does not make any difference really for the content of the /drivers/*:
> > > gives 408 __init style functions doing IO (.probe & builtin/module_
> > > > > _platform_driver_probe excluded) for 5.15 with allmodconfig:
> > >
> > > ['doc200x_ident_chip',
> > > 'doc_probe', 'doc2001_init', 'mtd_speedtest_init',
> > > 'mtd_nandbiterrs_init', 'mtd_oobtest_init', 'mtd_pagetest_init',
> > > 'tort_init', 'mtd_subpagetest_init', 'fixup_pmc551',
> > > 'doc_set_driver_info', 'init_amd76xrom', 'init_l440gx',
> > > 'init_sc520cdp', 'init_ichxrom', 'init_ck804xrom', 'init_esb2rom',
> > > 'probe_acpi_namespace_devices', 'amd_iommu_init_pci', 'state_next',
> > > 'arm_v7s_do_selftests', 'arm_lpae_run_tests', 'init_iommu_one',
> >
> > Um. ARM? Which architecture is this build for?
>
> The list of smatch IO findings is built for x86, but the smatch cross function
> database covers all archs, so when queried for all potential function callers,
> it would show non x86 arch call chains also.
>
> Here is the original x86 finding and call chain for the 'arm_v7s_do_selftests':
>
> Detected low-level IO from arm_v7s_do_selftests in fun
> __iommu_queue_command_sync
>
> drivers/iommu/amd/iommu.c:1025 __iommu_queue_command_sync() error:
> {15002074744551330002}
> 'check_host_input' read from the host using function 'readl' to a
> member of the structure 'iommu->cmd_buf_head';
>
> __iommu_queue_command_sync()
> iommu_completion_wait()
> amd_iommu_domain_flush_complete()
> iommu_v1_map_page()
> arm_v7s_do_selftests()
>
> So, the results can be further filtered if you want a specified arch.

So what is it just for x86? Could you tell?

--
MST

2021-10-14 14:52:26

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

On Thu, Oct 14, 2021 at 07:27:42AM +0000, Reshetova, Elena wrote:
> > On Thu, Oct 14, 2021 at 06:32:32AM +0000, Reshetova, Elena wrote:
> > > > On Tue, Oct 12, 2021 at 06:36:16PM +0000, Reshetova, Elena wrote:
> > > > > > The 5.15 tree has something like ~2.4k IO accesses (including MMIO and
> > > > > > others) in init functions that also register drivers (thanks Elena for
> > > > > > the number)
> > > > >
> > > > > To provide more numbers on this. What I can see so far from a smatch-based
> > > > > analysis, we have 409 __init style functions (.probe & builtin/module_
> > > > > _platform_driver_probe excluded) for 5.15 with allyesconfig.
> > > >
> > > > I don't think we care about allyesconfig at all though.
> > > > Just don't do that.
> > > > How about allmodconfig? This is closer to what distros actually do.
> > >
> > > It does not make any difference really for the content of the /drivers/*:
> > > gives 408 __init style functions doing IO (.probe & builtin/module_
> > > > > _platform_driver_probe excluded) for 5.15 with allmodconfig:
> > >
> > > ['doc200x_ident_chip',
> > > 'doc_probe', 'doc2001_init', 'mtd_speedtest_init',
> > > 'mtd_nandbiterrs_init', 'mtd_oobtest_init', 'mtd_pagetest_init',
> > > 'tort_init', 'mtd_subpagetest_init', 'fixup_pmc551',
> > > 'doc_set_driver_info', 'init_amd76xrom', 'init_l440gx',
> > > 'init_sc520cdp', 'init_ichxrom', 'init_ck804xrom', 'init_esb2rom',
> > > 'probe_acpi_namespace_devices', 'amd_iommu_init_pci', 'state_next',
> > > 'arm_v7s_do_selftests', 'arm_lpae_run_tests', 'init_iommu_one',
> >
> > Um. ARM? Which architecture is this build for?
>
> The list of smatch IO findings is built for x86, but the smatch cross function
> database covers all archs, so when queried for all potential function callers,
> it would show non x86 arch call chains also.
>
> Here is the original x86 finding and call chain for the 'arm_v7s_do_selftests':
>
> Detected low-level IO from arm_v7s_do_selftests in fun
> __iommu_queue_command_sync
>
> drivers/iommu/amd/iommu.c:1025 __iommu_queue_command_sync() error:
> {15002074744551330002}
> 'check_host_input' read from the host using function 'readl' to a
> member of the structure 'iommu->cmd_buf_head';
>
> __iommu_queue_command_sync()
> iommu_completion_wait()
> amd_iommu_domain_flush_complete()
> iommu_v1_map_page()
> arm_v7s_do_selftests()
>
> So, the results can be further filtered if you want a specified arch.

Even better would be a typical distro build.

--
MST

2021-10-14 15:33:18

by Elena Reshetova

[permalink] [raw]
Subject: RE: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

> On Thu, Oct 14, 2021 at 07:27:42AM +0000, Reshetova, Elena wrote:
> > > On Thu, Oct 14, 2021 at 06:32:32AM +0000, Reshetova, Elena wrote:
> > > > > On Tue, Oct 12, 2021 at 06:36:16PM +0000, Reshetova, Elena wrote:
> > > > > > > The 5.15 tree has something like ~2.4k IO accesses (including MMIO and
> > > > > > > others) in init functions that also register drivers (thanks Elena for
> > > > > > > the number)
> > > > > >
> > > > > > To provide more numbers on this. What I can see so far from a smatch-
> based
> > > > > > analysis, we have 409 __init style functions (.probe & builtin/module_
> > > > > > _platform_driver_probe excluded) for 5.15 with allyesconfig.
> > > > >
> > > > > I don't think we care about allyesconfig at all though.
> > > > > Just don't do that.
> > > > > How about allmodconfig? This is closer to what distros actually do.
> > > >
> > > > It does not make any difference really for the content of the /drivers/*:
> > > > gives 408 __init style functions doing IO (.probe & builtin/module_
> > > > > > _platform_driver_probe excluded) for 5.15 with allmodconfig:
> > > >
> > > > ['doc200x_ident_chip',
> > > > 'doc_probe', 'doc2001_init', 'mtd_speedtest_init',
> > > > 'mtd_nandbiterrs_init', 'mtd_oobtest_init', 'mtd_pagetest_init',
> > > > 'tort_init', 'mtd_subpagetest_init', 'fixup_pmc551',
> > > > 'doc_set_driver_info', 'init_amd76xrom', 'init_l440gx',
> > > > 'init_sc520cdp', 'init_ichxrom', 'init_ck804xrom', 'init_esb2rom',
> > > > 'probe_acpi_namespace_devices', 'amd_iommu_init_pci', 'state_next',
> > > > 'arm_v7s_do_selftests', 'arm_lpae_run_tests', 'init_iommu_one',
> > >
> > > Um. ARM? Which architecture is this build for?
> >
> > The list of smatch IO findings is built for x86, but the smatch cross function
> > database covers all archs, so when queried for all potential function callers,
> > it would show non x86 arch call chains also.
> >
> > Here is the original x86 finding and call chain for the 'arm_v7s_do_selftests':
> >
> > Detected low-level IO from arm_v7s_do_selftests in fun
> > __iommu_queue_command_sync
> >
> > drivers/iommu/amd/iommu.c:1025 __iommu_queue_command_sync() error:
> > {15002074744551330002}
> > 'check_host_input' read from the host using function 'readl' to a
> > member of the structure 'iommu->cmd_buf_head';
> >
> > __iommu_queue_command_sync()
> > iommu_completion_wait()
> > amd_iommu_domain_flush_complete()
> > iommu_v1_map_page()
> > arm_v7s_do_selftests()
> >
> > So, the results can be further filtered if you want a specified arch.
>
> So what is it just for x86? Could you tell?

I can probably figure out how to do additional filtering here, but does
it really matter for the case that is being discussed here? Andi's point was
that there quite many existing places in the kernel when low-level IO
happens before the probe stage. So I brought these numbers here.
What do you plan to do with the pure x86 results?

And here are the full results for allyesconfig, if anyone is interested (just got permission to create
the repository today):
https://github.com/intel/ccc-linux-guest-hardening/tree/master/audit/sample_output/5.15-rc1
We will be pushing to this repo all the scripts and fuzzing setups we use as part of
our Linux guest hardening effort for confidential cloud computing, but it is going to take
some time to get all the approvals collected.

Best Regards,
Elena.

2021-10-15 12:13:41

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v5 16/16] x86/tdx: Add cmdline option to force use of ioremap_host_shared


> I thought you basically create an OperationRegion of SystemMemory type,
> and off you go. Maybe the OSPM in Linux is clever and protects
> some memory, I wouldn't know.


I investigated this now, and it looks like acpi is using
ioremap_cache(). We can hook into that and force non sharing. It's
probably safe to assume that this is not used on real IO devices.

I think there are still some other BIOS mappings that use just plain
ioremap() though.


-Andi

2021-10-15 13:07:56

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v5 16/16] x86/tdx: Add cmdline option to force use of ioremap_host_shared

On Thu, Oct 14, 2021 at 10:50:59PM -0700, Andi Kleen wrote:
>
> > I thought you basically create an OperationRegion of SystemMemory type,
> > and off you go. Maybe the OSPM in Linux is clever and protects
> > some memory, I wouldn't know.
>
>
> I investigated this now, and it looks like acpi is using ioremap_cache(). We
> can hook into that and force non sharing. It's probably safe to assume that
> this is not used on real IO devices.
>
> I think there are still some other BIOS mappings that use just plain
> ioremap() though.
>
>
> -Andi

Hmm don't you mean the reverse? If you make ioremap shared then OS is
protected from malicious ACPI? If you don't make it shared then
malicious ACPI can poke at arbitrary OS memory. Looks like making
ioremap non shared by default is actually less safe than shared.
Interesting.

For BIOS I suspect there's no way around it, it needs to be
audited since it's executable.

--
MST

2021-10-16 07:00:20

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v5 16/16] x86/tdx: Add cmdline option to force use of ioremap_host_shared

cutting down the insane cc list.

On 10/14/2021 11:57 PM, Michael S. Tsirkin wrote:
> On Thu, Oct 14, 2021 at 10:50:59PM -0700, Andi Kleen wrote:
>>> I thought you basically create an OperationRegion of SystemMemory type,
>>> and off you go. Maybe the OSPM in Linux is clever and protects
>>> some memory, I wouldn't know.
>>
>> I investigated this now, and it looks like acpi is using ioremap_cache(). We
>> can hook into that and force non sharing. It's probably safe to assume that
>> this is not used on real IO devices.
>>
>> I think there are still some other BIOS mappings that use just plain
>> ioremap() though.
>>
>>
>> -Andi
> Hmm don't you mean the reverse? If you make ioremap shared then OS is
> protected from malicious ACPI?


Nope

> If you don't make it shared then
> malicious ACPI can poke at arbitrary OS memory.


When it's private it's protected and when it's shared it can be attacked


>
> For BIOS I suspect there's no way around it, it needs to be
> audited since it's executable.


The guest BIOS is attested and trusted. The original ACPI tables by the
BIOS are attested and trusted too.

Just if we map the ACPI tables temporarily shared then an evil
hypervisor could modify them during that time window.

-Andi

2021-10-18 03:48:07

by Thomas Gleixner

[permalink] [raw]
Subject: RE: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

Elena,

On Thu, Oct 14 2021 at 06:32, Elena Reshetova wrote:
>> On Tue, Oct 12, 2021 at 06:36:16PM +0000, Reshetova, Elena wrote:
> It does not make any difference really for the content of the /drivers/*:
> gives 408 __init style functions doing IO (.probe & builtin/module_
>> > _platform_driver_probe excluded) for 5.15 with allmodconfig:
>
> ['doc200x_ident_chip',
> 'doc_probe', 'doc2001_init', 'mtd_speedtest_init',
> 'mtd_nandbiterrs_init', 'mtd_oobtest_init', 'mtd_pagetest_init',
> 'tort_init', 'mtd_subpagetest_init', 'fixup_pmc551',
> 'doc_set_driver_info', 'init_amd76xrom', 'init_l440gx',
> 'init_sc520cdp', 'init_ichxrom', 'init_ck804xrom', 'init_esb2rom',
> 'ubi_gluebi_init', 'ubiblock_init'
> 'ubi_init', 'mtd_stresstest_init',

All of this is MTD and can just be disabled wholesale.

Aside of that, most of these depend on either platform devices or device
tree enumerations which are not ever available on X86.

> 'probe_acpi_namespace_devices',

> 'amd_iommu_init_pci', 'state_next',
> 'init_dmars', 'iommu_init_pci', 'early_amd_iommu_init',
> 'late_iommu_features_init', 'detect_ivrs',
> 'intel_prepare_irq_remapping', 'intel_enable_irq_remapping',
> 'intel_cleanup_irq_remapping', 'detect_intel_iommu',
> 'parse_ioapics_under_ir', 'si_domain_init',
> 'intel_iommu_init', 'dmar_table_init',
> 'enable_drhd_fault_handling',
> 'check_tylersburg_isoch',

None of this is reachable because the initial detection which is ACPI
table based will fail for TDX. If not, it's a guest firmware problem.

> 'fb_console_init', 'xenbus_probe_backend_init',
> 'xenbus_probe_frontend_init', 'setup_vcpu_hotplug_event',
> 'balloon_init',

XEN, that's relevant because magically the TDX guest will assume that it
is a XEN instance?

> 'ostm_init_clksrc', 'ftm_clockevent_init', 'ftm_clocksource_init',
> 'kona_timer_init', 'mtk_gpt_init', 'samsung_clockevent_init',
> 'samsung_clocksource_init', 'sysctr_timer_init', 'mxs_timer_init',
> 'sun4i_timer_init', 'at91sam926x_pit_dt_init', 'owl_timer_init',
> 'sun5i_setup_clockevent',
> 'mt7621_clk_init',
> 'samsung_clk_register_mux', 'samsung_clk_register_gate',
> 'samsung_clk_register_fixed_rate', 'clk_boston_setup',
> 'gemini_cc_init', 'aspeed_ast2400_cc', 'aspeed_ast2500_cc',
> 'sun6i_rtc_clk_init', 'phy_init', 'ingenic_ost_register_clock',
> 'meson6_timer_init', 'atcpit100_timer_init',
> 'npcm7xx_clocksource_init', 'clksrc_dbx500_prcmu_init',
> 'rcar_sysc_pd_setup', 'r8a779a0_sysc_pd_setup', 'renesas_soc_init',
> 'rcar_rst_init', 'rmobile_setup_pm_domain', 'mcp_write_pairing_set',
> 'a72_b53_rac_enable_all', 'mcp_a72_b53_set',
> 'brcmstb_soc_device_early_init', 'imx8mq_soc_revision',
> 'imx8mm_soc_uid', 'imx8mm_soc_revision', 'qe_init',
> 'exynos5x_clk_init', 'exynos5250_clk_init', 'exynos4_get_xom',
> 'create_one_cmux', 'create_one_pll', 'p2041_init_periph',
> 'p4080_init_periph', 'p5020_init_periph', 'p5040_init_periph',
> 'r9a06g032_clocks_probe', 'r8a73a4_cpg_clocks_init',
> 'sh73a0_cpg_clocks_init', 'cpg_div6_register',
> 'r8a7740_cpg_clocks_init', 'cpg_mssr_register_mod_clk',
> 'cpg_mssr_register_core_clk', 'rcar_gen3_cpg_clk_register',
> 'cpg_sd_clk_register', 'r7s9210_update_clk_table',
> 'rz_cpg_read_mode_pins', 'rz_cpg_clocks_init',
> 'rcar_r8a779a0_cpg_clk_register', 'rcar_gen2_cpg_clk_register',
> 'sun8i_a33_ccu_setup', 'sun8i_a23_ccu_setup', 'sun5i_ccu_init',
> 'suniv_f1c100s_ccu_setup', 'sun6i_a31_ccu_setup',
> 'sun8i_v3_v3s_ccu_init', 'sun50i_h616_ccu_setup',
> 'sunxi_h3_h5_ccu_init', 'sun4i_ccu_init', 'kona_ccu_init',
> 'ns2_genpll_scr_clk_init', 'ns2_genpll_sw_clk_init',
> 'ns2_lcpll_ddr_clk_init', 'ns2_lcpll_ports_clk_init',
> 'nsp_genpll_clk_init', 'nsp_lcpll0_clk_init',
> 'cygnus_genpll_clk_init', 'cygnus_lcpll0_clk_init',
> 'cygnus_mipipll_clk_init', 'cygnus_audiopll_clk_init',
> 'of_fixed_mmio_clk_setup',
> 'arm_v7s_do_selftests', 'arm_lpae_run_tests', 'init_iommu_one',

ARM based drivers are initialized on x86 in which way?

> 'hv_init_tsc_clocksource', 'hv_init_clocksource',

HyperV. See XEN

> 'skx_init',
> 'i10nm_init', 'sbridge_init', 'i82975x_init', 'i3000_init',
> 'x38_init', 'ie31200_init', 'i3200_init', 'amd64_edac_init',
> 'pnd2_init', 'edac_init', 'adummy_init',

EDAC has already hypervisor checks

> 'init_acpi_pm_clocksource',

Requires ACPI table entry or command line override

> 'intel_rng_mod_init',

Has an old style PCI table which is searched via pci_get_device(). Could
do with a cleanup which converts it to proper PCI probing.

<SNIP>

So I stop here, because it would be way simpler to have the file names
but so far I could identify all of it from the top of my head.

So what are you trying to tell me? That you found tons of ioremaps in
__init functions which are completely irrelevant.

Please stop making arguments based on completely nonsensical data. It
took me less than 5 minutes to eliminate more than 50% of that list and
I'm pretty sure that I could have eliminated the bulk of the rest as
well.

The fact that a large part of this is ARM only, the fact that nobody
bothered to look at how e.g. IOMMU detection works and whether those
ioremaps actually can't be reached is hillarious.

So of these 400 instances are at least 30% ARM specific and those
cannot be reached on ARM nilly willy either because they are either
device tree or ACPI enumerated.

Claiming that it is soo much work to analyze 400 at least to the point:

- whether they are relevant for x86 and therefore potentially TDX at
all

- whether they have some form of enumeration or detection which makes
the ioremaps unreachable when the trusted BIOS is implemented
correctly

Ijust can laugh at that, really:

Two of my engineers have done an inventory of hundreds of cpu hotplug
notifier instances in a couple of days some years ago. Ditto for a
couple of hundred seqcount and a couple of hundred tasklet usage
sites.

Sure, but it makes more security handwaving and a nice presentation to
tell people how much unsecure code there is based on half thought out
static analysis. To do a proper static analysis of this, you really
have to do a proper brain based analysis first of:

1) Which code is relevant for x86

2) What are the mechanisms which are used across the X86 relevant
driver space to make these ioremap/MSR accesses actually reachable.

And of course this will not be complete, but this eliminates the vast
majority of your list. And looking at the remaining ones is not rocket
science either.

I can't take that serious at all. Come back when you have a properly
compiled list of drivers which:

1) Can even be built for X86

2) Do ioremap/MSR based poking unconditionally.

3) Cannot be easily guarded off at the subsystem level

It's not going to be a huge list.

Then we can talk about facts and talk about the work required to fix
them or blacklist them in some way.

Thanks,

tglx

2021-10-18 03:48:23

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

Andi,

On Sun, Oct 10 2021 at 15:11, Andi Kleen wrote:
> On 10/9/2021 1:39 PM, Dan Williams wrote:
>> I agree with you and Greg here. If a driver is accessing hardware
>> resources outside of the bind lifetime of one of the devices it
>> supports, and in a way that neither modrobe-policy nor
>> device-authorization -policy infrastructure can block, that sounds
>> like a bug report.
>
> The 5.15 tree has something like ~2.4k IO accesses (including MMIO and
> others) in init functions that also register drivers (thanks Elena for
> the number)

These numbers are completely useless simply because they are based on
nonsensical criteria. See:

https://lore.kernel.org/r/87r1cj2uad.ffs@tglx

> My point is just that the ecosystem of devices that Linux supports is
> messy enough that there are legitimate exceptions from the "First IO
> only in probe call only" rule.

Your point is based on your outright refusal to actualy do a proper
analysis and your outright refusal to help fixing the real problems.

All you have provided so far is handwaving based on a completely useless
analysis.

Sure, your goal is to get this TDX problem solved, but it's not going to
be solved by:

1) Providing a nonsensical analysis

2) Using #1 as an argument to hack some half baken interfaces into the
kernel which allow you to tick off your checkbox and then leave the
resulting mess for others to clean up.

Try again when you have factual data to back up your claims and factual
arguments which prove that the problem can't be fixed otherwise.

I might be repeating myself, but kernel development works this way:

1) Hack your private POC - Yay!

2) Sit down and think hard about the problems you identified in step
#1. Do a thorough analysis.

3) Come up with a sensible integration plan.

4) Do the necessary grump work of cleanups all over the place

5) Add sensible infrastructure which is understandable for the bulk
of kernel/driver developers

6) Let your feature fall in place

and not in the way you are insisting on:

1) Hack your private POC - Yay!

2) Define that this is the only way to do it and try to shove it down
the throat of everyone.

3) Getting told that this is not the way it works

4) Insist on it forever and blame the grumpy maintainers who are just
not understanding the great value of your approach.

5) Go back to #2

You should know that already, but I have no problem to give that lecture
to you over and over again. I probably should create a form letter.

And no, you can bitch about me as much as you want. These are not my
personal rules and personal pet pieves. These are rules Linus cares
about very much and aside of that they just reflect common sense.

The kernel is a common good and not the dump ground for your personal
brain waste.

The kernel does not serve Intel. Quite the contrary Intel depends on
the kernel to work nicely with it's hardware. Ergo, Intel should have
a vested interest to serve the kernel and take responsibility for it
as a whole. And so should you as an Intel employee.

Just dumping your next half baken workaround does not cut it especially
not when it is not backed up by sensible arguments.

Please try again, but not before you have something substantial to back
up your claims.

Thanks,

Thomas

2021-10-18 03:48:26

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

On Thu, Oct 14, 2021 at 12:33:49PM +0000, Reshetova, Elena wrote:
> > On Thu, Oct 14, 2021 at 07:27:42AM +0000, Reshetova, Elena wrote:
> > > > On Thu, Oct 14, 2021 at 06:32:32AM +0000, Reshetova, Elena wrote:
> > > > > > On Tue, Oct 12, 2021 at 06:36:16PM +0000, Reshetova, Elena wrote:
> > > > > > > > The 5.15 tree has something like ~2.4k IO accesses (including MMIO and
> > > > > > > > others) in init functions that also register drivers (thanks Elena for
> > > > > > > > the number)
> > > > > > >
> > > > > > > To provide more numbers on this. What I can see so far from a smatch-
> > based
> > > > > > > analysis, we have 409 __init style functions (.probe & builtin/module_
> > > > > > > _platform_driver_probe excluded) for 5.15 with allyesconfig.
> > > > > >
> > > > > > I don't think we care about allyesconfig at all though.
> > > > > > Just don't do that.
> > > > > > How about allmodconfig? This is closer to what distros actually do.
> > > > >
> > > > > It does not make any difference really for the content of the /drivers/*:
> > > > > gives 408 __init style functions doing IO (.probe & builtin/module_
> > > > > > > _platform_driver_probe excluded) for 5.15 with allmodconfig:
> > > > >
> > > > > ['doc200x_ident_chip',
> > > > > 'doc_probe', 'doc2001_init', 'mtd_speedtest_init',
> > > > > 'mtd_nandbiterrs_init', 'mtd_oobtest_init', 'mtd_pagetest_init',
> > > > > 'tort_init', 'mtd_subpagetest_init', 'fixup_pmc551',
> > > > > 'doc_set_driver_info', 'init_amd76xrom', 'init_l440gx',
> > > > > 'init_sc520cdp', 'init_ichxrom', 'init_ck804xrom', 'init_esb2rom',
> > > > > 'probe_acpi_namespace_devices', 'amd_iommu_init_pci', 'state_next',
> > > > > 'arm_v7s_do_selftests', 'arm_lpae_run_tests', 'init_iommu_one',
> > > >
> > > > Um. ARM? Which architecture is this build for?
> > >
> > > The list of smatch IO findings is built for x86, but the smatch cross function
> > > database covers all archs, so when queried for all potential function callers,
> > > it would show non x86 arch call chains also.
> > >
> > > Here is the original x86 finding and call chain for the 'arm_v7s_do_selftests':
> > >
> > > Detected low-level IO from arm_v7s_do_selftests in fun
> > > __iommu_queue_command_sync
> > >
> > > drivers/iommu/amd/iommu.c:1025 __iommu_queue_command_sync() error:
> > > {15002074744551330002}
> > > 'check_host_input' read from the host using function 'readl' to a
> > > member of the structure 'iommu->cmd_buf_head';
> > >
> > > __iommu_queue_command_sync()
> > > iommu_completion_wait()
> > > amd_iommu_domain_flush_complete()
> > > iommu_v1_map_page()
> > > arm_v7s_do_selftests()
> > >
> > > So, the results can be further filtered if you want a specified arch.
> >
> > So what is it just for x86? Could you tell?
>
> I can probably figure out how to do additional filtering here, but does
> it really matter for the case that is being discussed here? Andi's point was
> that there quite many existing places in the kernel when low-level IO
> happens before the probe stage. So I brought these numbers here.
> What do you plan to do with the pure x86 results?

If the list is short - just suggest securing that ;)


> And here are the full results for allyesconfig, if anyone is interested (just got permission to create
> the repository today):
> https://github.com/intel/ccc-linux-guest-hardening/tree/master/audit/sample_output/5.15-rc1
> We will be pushing to this repo all the scripts and fuzzing setups we use as part of
> our Linux guest hardening effort for confidential cloud computing, but it is going to take
> some time to get all the approvals collected.
>
> Best Regards,
> Elena.

2021-10-18 03:48:28

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

On Mon, Oct 18 2021 at 02:55, Thomas Gleixner wrote:
> On Sun, Oct 10 2021 at 15:11, Andi Kleen wrote:
>> The 5.15 tree has something like ~2.4k IO accesses (including MMIO and
>> others) in init functions that also register drivers (thanks Elena for
>> the number)
>
> These numbers are completely useless simply because they are based on
> nonsensical criteria. See:
>
> https://lore.kernel.org/r/87r1cj2uad.ffs@tglx
>
>> My point is just that the ecosystem of devices that Linux supports is
>> messy enough that there are legitimate exceptions from the "First IO
>> only in probe call only" rule.
>
> Your point is based on your outright refusal to actualy do a proper
> analysis and your outright refusal to help fixing the real problems.
>
> All you have provided so far is handwaving based on a completely useless
> analysis.
>
> Sure, your goal is to get this TDX problem solved, but it's not going to
> be solved by:
>
> 1) Providing a nonsensical analysis
>
> 2) Using #1 as an argument to hack some half baken interfaces into the
> kernel which allow you to tick off your checkbox and then leave the
> resulting mess for others to clean up.
>
> Try again when you have factual data to back up your claims and factual
> arguments which prove that the problem can't be fixed otherwise.
>
> I might be repeating myself, but kernel development works this way:
>
> 1) Hack your private POC - Yay!
>
> 2) Sit down and think hard about the problems you identified in step
> #1. Do a thorough analysis.
>
> 3) Come up with a sensible integration plan.
>
> 4) Do the necessary grump work of cleanups all over the place
>
> 5) Add sensible infrastructure which is understandable for the bulk
> of kernel/driver developers
>
> 6) Let your feature fall in place
>
> and not in the way you are insisting on:
>
> 1) Hack your private POC - Yay!
>
> 2) Define that this is the only way to do it and try to shove it down
> the throat of everyone.
>
> 3) Getting told that this is not the way it works
>
> 4) Insist on it forever and blame the grumpy maintainers who are just
> not understanding the great value of your approach.
>
> 5) Go back to #2
>
> You should know that already, but I have no problem to give that lecture
> to you over and over again. I probably should create a form letter.
>
> And no, you can bitch about me as much as you want. These are not my
> personal rules and personal pet pieves. These are rules Linus cares
> about very much and aside of that they just reflect common sense.
>
> The kernel is a common good and not the dump ground for your personal
> brain waste.
>
> The kernel does not serve Intel. Quite the contrary Intel depends on
> the kernel to work nicely with it's hardware. Ergo, Intel should have
> a vested interest to serve the kernel and take responsibility for it
> as a whole. And so should you as an Intel employee.
>
> Just dumping your next half baken workaround does not cut it especially
> not when it is not backed up by sensible arguments.
>
> Please try again, but not before you have something substantial to back
> up your claims.

That said, I can't resist the urge to say a few words to the responsible
senior and management people at Intel in this context:

I surely know that a lot of Intel people claim that their lack of
progress is _only_ because Thomas is hard to work with and Thomas wants
unreasonable changes to their code, which I could perceive as an abuse of
myself for the purpose of self-deception. TBH, I don't give a damn.

Let me ask a few questions instead:

- Is it unreasonable to expect that argumentations are based on facts
and proper analysis?

- Is it unreasonable to expect a proper integration of a new feature?

- Does it take unreasonable effort to do a proper design?

- Is it unreasonable to ask that he necessary cleanups are done
upfront?

If anyone of the responsible people at Intel thinks so, then they should
speak up now and tell me in public and into my face what's so
unreasonable about that.

Thanks,

Thomas

2021-10-18 03:49:19

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v5 16/16] x86/tdx: Add cmdline option to force use of ioremap_host_shared

On Fri, Oct 15, 2021 at 06:34:17AM -0700, Andi Kleen wrote:
> cutting down the insane cc list.
>
> On 10/14/2021 11:57 PM, Michael S. Tsirkin wrote:
> > On Thu, Oct 14, 2021 at 10:50:59PM -0700, Andi Kleen wrote:
> > > > I thought you basically create an OperationRegion of SystemMemory type,
> > > > and off you go. Maybe the OSPM in Linux is clever and protects
> > > > some memory, I wouldn't know.
> > >
> > > I investigated this now, and it looks like acpi is using ioremap_cache(). We
> > > can hook into that and force non sharing. It's probably safe to assume that
> > > this is not used on real IO devices.
> > >
> > > I think there are still some other BIOS mappings that use just plain
> > > ioremap() though.
> > >
> > >
> > > -Andi
> > Hmm don't you mean the reverse? If you make ioremap shared then OS is
> > protected from malicious ACPI?
>
>
> Nope
>
> > If you don't make it shared then
> > malicious ACPI can poke at arbitrary OS memory.
>
>
> When it's private it's protected and when it's shared it can be attacked
>
>
> >
> > For BIOS I suspect there's no way around it, it needs to be
> > audited since it's executable.
>
>
> The guest BIOS is attested and trusted. The original ACPI tables by the BIOS
> are attested and trusted too.
>
> Just if we map the ACPI tables temporarily shared then an evil hypervisor
> could modify them during that time window.
>
> -Andi

I thought some more about it.

Fundamentally, ACPI has these types of OperationRegions:
SystemIO | SystemMemory | PCI_Config | EmbeddedControl | SMBus | SystemCMOS | 
PciBarTarget | IPMI | GeneralPurposeIO | GenericSerialBus |
PCC

Now, SystemMemory can be used to talk to either BIOS (should be
encrypted) or hypervisor (should not be encrypted).

I think it's not a great idea to commit to either, or teach users
to hack around it with command line flags. Instead
there should be a new SystemMemoryUnencrypted API for interface with
the hypervisor. Can you guys propose this at the ACPI spec?
If not but at least we are in agreement I guess I can try to do it,
have a bit of experience with the ACPI spec.

And I assume PciBarTarget should be unencrypted so it can work.

--
MST

2021-10-18 07:06:04

by Elena Reshetova

[permalink] [raw]
Subject: RE: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

Thank Thomas for your review, please see my responses/comments inline.

> > 'ostm_init_clksrc', 'ftm_clockevent_init', 'ftm_clocksource_init',
> > 'kona_timer_init', 'mtk_gpt_init', 'samsung_clockevent_init',
> > 'samsung_clocksource_init', 'sysctr_timer_init', 'mxs_timer_init',
> > 'sun4i_timer_init', 'at91sam926x_pit_dt_init', 'owl_timer_init',
> > 'sun5i_setup_clockevent',
> > 'mt7621_clk_init',
> > 'samsung_clk_register_mux', 'samsung_clk_register_gate',
> > 'samsung_clk_register_fixed_rate', 'clk_boston_setup',
> > 'gemini_cc_init', 'aspeed_ast2400_cc', 'aspeed_ast2500_cc',
> > 'sun6i_rtc_clk_init', 'phy_init', 'ingenic_ost_register_clock',
> > 'meson6_timer_init', 'atcpit100_timer_init',
> > 'npcm7xx_clocksource_init', 'clksrc_dbx500_prcmu_init',
> > 'rcar_sysc_pd_setup', 'r8a779a0_sysc_pd_setup', 'renesas_soc_init',
> > 'rcar_rst_init', 'rmobile_setup_pm_domain', 'mcp_write_pairing_set',
> > 'a72_b53_rac_enable_all', 'mcp_a72_b53_set',
> > 'brcmstb_soc_device_early_init', 'imx8mq_soc_revision',
> > 'imx8mm_soc_uid', 'imx8mm_soc_revision', 'qe_init',
> > 'exynos5x_clk_init', 'exynos5250_clk_init', 'exynos4_get_xom',
> > 'create_one_cmux', 'create_one_pll', 'p2041_init_periph',
> > 'p4080_init_periph', 'p5020_init_periph', 'p5040_init_periph',
> > 'r9a06g032_clocks_probe', 'r8a73a4_cpg_clocks_init',
> > 'sh73a0_cpg_clocks_init', 'cpg_div6_register',
> > 'r8a7740_cpg_clocks_init', 'cpg_mssr_register_mod_clk',
> > 'cpg_mssr_register_core_clk', 'rcar_gen3_cpg_clk_register',
> > 'cpg_sd_clk_register', 'r7s9210_update_clk_table',
> > 'rz_cpg_read_mode_pins', 'rz_cpg_clocks_init',
> > 'rcar_r8a779a0_cpg_clk_register', 'rcar_gen2_cpg_clk_register',
> > 'sun8i_a33_ccu_setup', 'sun8i_a23_ccu_setup', 'sun5i_ccu_init',
> > 'suniv_f1c100s_ccu_setup', 'sun6i_a31_ccu_setup',
> > 'sun8i_v3_v3s_ccu_init', 'sun50i_h616_ccu_setup',
> > 'sunxi_h3_h5_ccu_init', 'sun4i_ccu_init', 'kona_ccu_init',
> > 'ns2_genpll_scr_clk_init', 'ns2_genpll_sw_clk_init',
> > 'ns2_lcpll_ddr_clk_init', 'ns2_lcpll_ports_clk_init',
> > 'nsp_genpll_clk_init', 'nsp_lcpll0_clk_init',
> > 'cygnus_genpll_clk_init', 'cygnus_lcpll0_clk_init',
> > 'cygnus_mipipll_clk_init', 'cygnus_audiopll_clk_init',
> > 'of_fixed_mmio_clk_setup',
> > 'arm_v7s_do_selftests', 'arm_lpae_run_tests', 'init_iommu_one',
>
> ARM based drivers are initialized on x86 in which way?

As I already explained to Michael the reason why ARM code is included
into this list is the fact that that smatch cross function database contains
all code paths, so when querying up for the possible ones you get everything.
I will filter this list to x86 and provide results since this seems to be important.
The reason why I don't see this as important is that the threat model we are
trying to protect against here (malicious VMM/host) is not TDX specific and
I see no reason why in some near or further future ARM arch would also have
a confidential cloud computing solution and they would need to do exactly the
same thing.

>
> > 'hv_init_tsc_clocksource', 'hv_init_clocksource',
>
> HyperV. See XEN
>
> > 'skx_init',
> > 'i10nm_init', 'sbridge_init', 'i82975x_init', 'i3000_init',
> > 'x38_init', 'ie31200_init', 'i3200_init', 'amd64_edac_init',
> > 'pnd2_init', 'edac_init', 'adummy_init',
>
> EDAC has already hypervisor checks
>
> > 'init_acpi_pm_clocksource',
>
> Requires ACPI table entry or command line override
>
> > 'intel_rng_mod_init',
>
> Has an old style PCI table which is searched via pci_get_device(). Could
> do with a cleanup which converts it to proper PCI probing.
>
> <SNIP>
>
> So I stop here, because it would be way simpler to have the file names
> but so far I could identify all of it from the top of my head.
>
> So what are you trying to tell me? That you found tons of ioremaps in
> __init functions which are completely irrelevant.
>
> Please stop making arguments based on completely nonsensical data. It
> took me less than 5 minutes to eliminate more than 50% of that list and
> I'm pretty sure that I could have eliminated the bulk of the rest as
> well.
>
> The fact that a large part of this is ARM only, the fact that nobody
> bothered to look at how e.g. IOMMU detection works and whether those
> ioremaps actually can't be reached is hillarious.
>
> So of these 400 instances are at least 30% ARM specific and those
> cannot be reached on ARM nilly willy either because they are either
> device tree or ACPI enumerated.
>
> Claiming that it is soo much work to analyze 400 at least to the point:

Please bear in mind that the 400 function list is not complete by any means.
Many drivers define driver-specific wrappers for low-level IO and then use
these wrappers through their code. For the drivers we have audited (like virtIO)
we have manually read the code of each driver, identified these wrapper
functions (like virtio_cread*) and then added them into the static analyzer
to track the all the invocations of these functions. How do you propose to
repeat this exercise for all the Linux kernel driver/module codebase?

>
> - whether they are relevant for x86 and therefore potentially TDX at
> all
>
> - whether they have some form of enumeration or detection which makes
> the ioremaps unreachable when the trusted BIOS is implemented
> correctly
>
> Ijust can laugh at that, really:
>
> Two of my engineers have done an inventory of hundreds of cpu hotplug
> notifier instances in a couple of days some years ago. Ditto for a
> couple of hundred seqcount and a couple of hundred tasklet usage
> sites.
>
> Sure, but it makes more security handwaving and a nice presentation to
> tell people how much unsecure code there is based on half thought out
> static analysis. To do a proper static analysis of this, you really
> have to do a proper brain based analysis first of:
>
> 1) Which code is relevant for x86
>
> 2) What are the mechanisms which are used across the X86 relevant
> driver space to make these ioremap/MSR accesses actually reachable.
>
> And of course this will not be complete, but this eliminates the vast
> majority of your list. And looking at the remaining ones is not rocket
> science either.
>
> I can't take that serious at all. Come back when you have a properly
> compiled list of drivers which:
>
> 1) Can even be built for X86
>
> 2) Do ioremap/MSR based poking unconditionally.
>
> 3) Cannot be easily guarded off at the subsystem level

I see two main problems with this approach (in addition to the above fact that
we need to spend *a lot of time* building the complete list of such functions first):

1. It is very error prone since it would involve a lot of manual code
audit done by humans. And here each mistake is a potential new CVE for the kernel
in the scope of confidential computing threat model.

2. It would require a lot of expertise from people who would want to run
a secure protected guest kernel to make sure that their particular setup is secure.
Essentially they would need to completely repeat the hardening exercise from scratch
and verify all the involved code paths to make sure for their build, certain code is
indeed disabled, guarded at the subsystem level, not reachable because of cupid bits, etc. etc.
Not everyone has resources to do such an analysis (I would say little do), so we will end up
with a lot of unsecure production kernels, because time to market pressure would win over the
security if doing it means so much work.

Speaking in security terms what you propose is to start from day one analyzing the whole
existing and waste attack surface, fix all the security issues in it manually one by one and then
somewhere in 20 years from now be done with it (or maybe never because there is always
new code coming in, and some would introduce new problems).

What we are proposing is first to try to minimize the attack surface as much as possible with a simple
and well understood set of controls and then spend realistic amount of time securing this minimized
surface. Again, this is not TDX specific attack surface, but generic to any guest kernel that wants to be
secure under confidential cloud computing threat model. So, it is not us who is pushing smth into the
kernel for the sake of TDX, but we seems to be the first ones so far who cares about the whole picture
and not just to provide HW means to run a protected guest. And given that most of the drivers
have never been written with this confidential cloud computing threat model in mind, it is going to
take time to fix all of them. This really should be a community effort and a long-running task.
Take a look on this paper for example: https://arxiv.org/pdf/2109.10660.pdf They have tried to
fuzz just small set of 22 drivers from this attack surface and found 50 security relevant bugs.
And fuzzing is of course no ultimate security testing technique. So, I don't see how without the
drastically reducing the scope of security hardening first we can end up with a secure setup.
And then as the time goes and more people looking into this attack surface, we can (hopefully)
gradually open it up.

Best Regards,
Elena.

2021-10-18 12:10:11

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

On Sun, Oct 10, 2021 at 03:11:23PM -0700, Andi Kleen wrote:
>
> On 10/9/2021 1:39 PM, Dan Williams wrote:
> > On Sat, Oct 9, 2021 at 2:53 AM Michael S. Tsirkin <[email protected]> wrote:
> > > On Fri, Oct 08, 2021 at 05:37:07PM -0700, Kuppuswamy Sathyanarayanan wrote:
> > > > From: Andi Kleen <[email protected]>
> > > >
> > > > For Confidential VM guests like TDX, the host is untrusted and hence
> > > > the devices emulated by the host or any data coming from the host
> > > > cannot be trusted. So the drivers that interact with the outside world
> > > > have to be hardened by sharing memory with host on need basis
> > > > with proper hardening fixes.
> > > >
> > > > For the PCI driver case, to share the memory with the host add
> > > > pci_iomap_host_shared() and pci_iomap_host_shared_range() APIs.
> > > >
> > > > Signed-off-by: Andi Kleen <[email protected]>
> > > > Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
> > > So I proposed to make all pci mappings shared, eliminating the need
> > > to patch drivers.
> > >
> > > To which Andi replied
> > > One problem with removing the ioremap opt-in is that
> > > it's still possible for drivers to get at devices without going through probe.
> > >
> > > To which Greg replied:
> > > https://lore.kernel.org/all/[email protected]/
> > > If there are in-kernel PCI drivers that do not do this, they need to be
> > > fixed today.
> > >
> > > Can you guys resolve the differences here?
> > I agree with you and Greg here. If a driver is accessing hardware
> > resources outside of the bind lifetime of one of the devices it
> > supports, and in a way that neither modrobe-policy nor
> > device-authorization -policy infrastructure can block, that sounds
> > like a bug report.
>
> The 5.15 tree has something like ~2.4k IO accesses (including MMIO and
> others) in init functions that also register drivers (thanks Elena for the
> number)
>
> Some are probably old drivers that could be fixed, but it's quite a few
> legitimate cases. For example for platform or ISA drivers that's the only
> way they can be implemented because they often have no other enumeration
> mechanism. For PCI drivers it's rarer, but also still can happen. One
> example that comes to mind here is the x86 Intel uncore drivers, which
> support a mix of MSR, ioremap and PCI config space accesses all from the
> same driver. This particular example can (and should be) fixed in other
> ways, but similar things also happen in other drivers, and they're not all
> broken. Even for the broken ones they're usually for some crufty old devices
> that has very few users, so it's likely untestable in practice.
>
> My point is just that the ecosystem of devices that Linux supports is messy
> enough that there are legitimate exceptions from the "First IO only in probe
> call only" rule.

No, there should not be for PCI drivers. If there is, that is a bug
that you can, and should, fix.

> And we can't just fix them all. Even if we could it would be hard to
> maintain.

Not true at all, you can fix them, and write a simple coccinelle rule to
prevent them from ever coming back in.

> Using a "firewall model" hooking into a few strategic points like we're
> proposing here is much saner for everyone.

No it is not. It is "easier" for you because you all do not want to fix
up all of the drivers and want to add additional code complexity on top
of the current mess that we have and then you can claim that you have
"hardened" the drivers you care about.

Despite no one ever explaining exactly what "hardened" means to me.

Again, fix the existing drivers, you have the whole source, if this is
something that you all care about, it should not be hard to do.

Stop making excuses.

greg k-h

2021-10-18 12:18:04

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

On Tue, Oct 12, 2021 at 11:35:04AM -0700, Andi Kleen wrote:
> > I'd rather see more concerted efforts focused/limited core changes
> > rather than leaf driver changes until there is a clearer definition of
> > hardened.
>
> A hardened driver is a driver that

Ah, you do define this, thank you!

> - Had similar security (not API) oriented review of its IO operations
> (mainly MMIO access, but also PCI config space) as a non privileged user
> interface (like a ioctl). That review should be focused on memory safety.

Where is this review done? Where is is documented? Who is responsible
for keeping it up to date with every code change to the driver, and to
the code that the driver calls and the code that calls the driver?

> - Had some fuzzing on these IO interfaces using to be released tools.

"some"? What tools? What is the input, and where is that defined? How
much fuzzing do you claim is "good enough"?

> Right now it's only three virtio drivers (console, net, block)

Where was this work done and published? And why only 3?

> Really it's no different than what we do for every new unprivileged user
> interface.

Really? I have seen loads of new drivers from Intel submitted in the
past months that would fail any of the above things just based on
obvious code reviews that I end up having to do...

If you want to start a "hardened driver" effort, there's a lot of real
work that needs to be done here and documented, and explained why it can
not just be done for the whole kernel...

greg k-h

2021-10-18 12:18:20

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

On Mon, Oct 11, 2021 at 07:59:17AM -0400, Michael S. Tsirkin wrote:
> On Sun, Oct 10, 2021 at 03:22:39PM -0700, Andi Kleen wrote:
> >
> > > To which Andi replied
> > > One problem with removing the ioremap opt-in is that
> > > it's still possible for drivers to get at devices without going through probe.
> > >
> > > To which Greg replied:
> > > https://lore.kernel.org/all/[email protected]/
> > > If there are in-kernel PCI drivers that do not do this, they need to be
> > > fixed today.
> > >
> > > Can you guys resolve the differences here?
> >
> >
> > I addressed this in my other mail, but we may need more discussion.
>
> Hopefully Greg will reply to that one.

Note, when wanting Greg to reply, someone should at the very least cc:
him.

{sigh}

greg k-h

2021-10-18 13:19:29

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

On Mon, Oct 18, 2021 at 02:15:47PM +0200, Greg KH wrote:
> On Mon, Oct 11, 2021 at 07:59:17AM -0400, Michael S. Tsirkin wrote:
> > On Sun, Oct 10, 2021 at 03:22:39PM -0700, Andi Kleen wrote:
> > >
> > > > To which Andi replied
> > > > One problem with removing the ioremap opt-in is that
> > > > it's still possible for drivers to get at devices without going through probe.
> > > >
> > > > To which Greg replied:
> > > > https://lore.kernel.org/all/[email protected]/
> > > > If there are in-kernel PCI drivers that do not do this, they need to be
> > > > fixed today.
> > > >
> > > > Can you guys resolve the differences here?
> > >
> > >
> > > I addressed this in my other mail, but we may need more discussion.
> >
> > Hopefully Greg will reply to that one.
>
> Note, when wanting Greg to reply, someone should at the very least cc:
> him.

"that one" being "Andi's other mail". Which I don't remember what it was,
by now. Sorry.

> {sigh}
>
> greg k-h

2021-10-20 16:06:16

by Tom Lendacky

[permalink] [raw]
Subject: Re: [PATCH v5 04/16] x86/tdx: Make pages shared in ioremap()

On 10/8/21 7:36 PM, Kuppuswamy Sathyanarayanan wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> All ioremap()ed pages that are not backed by normal memory (NONE or
> RESERVED) have to be mapped as shared.
>
> Reuse the infrastructure from AMD SEV code.
>
> Note that DMA code doesn't use ioremap() to convert memory to shared as
> DMA buffers backed by normal memory. DMA code make buffer shared with
> set_memory_decrypted().
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Reviewed-by: Andi Kleen <[email protected]>
> Reviewed-by: Tony Luck <[email protected]>
> Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
> ---
>
> Changes since v4:
> * Renamed "protected_guest" to "cc_guest".
> * Replaced use of prot_guest_has() with cc_guest_has()
> * Modified the patch to adapt to latest version cc_guest_has()
> changes.
>
> Changes since v3:
> * Rebased on top of Tom Lendacky's protected guest
> changes (https://lore.kernel.org/patchwork/cover/1468760/)
>
> Changes since v1:
> * Fixed format issues in commit log.
>
> arch/x86/include/asm/pgtable.h | 4 ++++
> arch/x86/mm/ioremap.c | 8 ++++++--
> include/linux/cc_platform.h | 13 +++++++++++++
> 3 files changed, 23 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index 448cd01eb3ec..ecefccbdf2e3 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -21,6 +21,10 @@
> #define pgprot_encrypted(prot) __pgprot(__sme_set(pgprot_val(prot)))
> #define pgprot_decrypted(prot) __pgprot(__sme_clr(pgprot_val(prot)))
>
> +/* Make the page accesable by VMM for confidential guests */
> +#define pgprot_cc_guest(prot) __pgprot(pgprot_val(prot) | \
> + tdx_shared_mask())

So this is only for TDX guests, so maybe a name that is less generic.

Alternatively, you could create pgprot_private()/pgprot_shared() functions
that check for SME/SEV or TDX and do the proper thing.

Then you can redefine pgprot_encrypted()/pgprot_decrypted() to the new
functions?

> +
> #ifndef __ASSEMBLY__
> #include <asm/x86_init.h>
> #include <asm/pkru.h>
> diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
> index 026031b3b782..83daa3f8f39c 100644
> --- a/arch/x86/mm/ioremap.c
> +++ b/arch/x86/mm/ioremap.c
> @@ -17,6 +17,7 @@
> #include <linux/cc_platform.h>
> #include <linux/efi.h>
> #include <linux/pgtable.h>
> +#include <linux/cc_platform.h>
>
> #include <asm/set_memory.h>
> #include <asm/e820/api.h>
> @@ -26,6 +27,7 @@
> #include <asm/pgalloc.h>
> #include <asm/memtype.h>
> #include <asm/setup.h>
> +#include <asm/tdx.h>
>
> #include "physaddr.h"
>
> @@ -87,8 +89,8 @@ static unsigned int __ioremap_check_ram(struct resource *res)
> }
>
> /*
> - * In a SEV guest, NONE and RESERVED should not be mapped encrypted because
> - * there the whole memory is already encrypted.
> + * In a SEV or TDX guest, NONE and RESERVED should not be mapped encrypted (or
> + * private in TDX case) because there the whole memory is already encrypted.
> */
> static unsigned int __ioremap_check_encrypted(struct resource *res)
> {
> @@ -246,6 +248,8 @@ __ioremap_caller(resource_size_t phys_addr, unsigned long size,
> prot = PAGE_KERNEL_IO;
> if ((io_desc.flags & IORES_MAP_ENCRYPTED) || encrypted)
> prot = pgprot_encrypted(prot);
> + else if (cc_platform_has(CC_ATTR_GUEST_SHARED_MAPPING_INIT))
> + prot = pgprot_cc_guest(prot);

And if you do the new functions this could be:

if ((io_desc.flags & IORES_MAP_ENCRYPTED) || encrypted)
prot = pgprot_private(prot);
else
prot = pgprot_shared(prot);

Thanks,
Tom

>
> switch (pcm) {
> case _PAGE_CACHE_MODE_UC:
> diff --git a/include/linux/cc_platform.h b/include/linux/cc_platform.h
> index 7728574d7783..edb1d7a2f6af 100644
> --- a/include/linux/cc_platform.h
> +++ b/include/linux/cc_platform.h
> @@ -81,6 +81,19 @@ enum cc_attr {
> * Examples include TDX Guest.
> */
> CC_ATTR_GUEST_UNROLL_STRING_IO,
> +
> + /**
> + * @CC_ATTR_GUEST_SHARED_MAPPING_INIT: IO Remapped memory is marked
> + * as shared.
> + *
> + * The platform/OS is running as a guest/virtual machine and
> + * initializes all IO remapped memory as shared.
> + *
> + * Examples include TDX Guest (SEV marks all pages as shared by default
> + * so this feature cannot be enabled for it).
> + */
> + CC_ATTR_GUEST_SHARED_MAPPING_INIT,
> +
> };
>
> #ifdef CONFIG_ARCH_HAS_CC_PLATFORM
>

2021-10-20 16:14:33

by Tom Lendacky

[permalink] [raw]
Subject: Re: [PATCH v5 01/16] x86/mm: Move force_dma_unencrypted() to common code

On 10/8/21 7:36 PM, Kuppuswamy Sathyanarayanan wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> Intel TDX doesn't allow VMM to access guest private memory. Any memory
> that is required for communication with VMM must be shared explicitly
> by setting the bit in page table entry. After setting the shared bit,
> the conversion must be completed with MapGPA hypercall. Details about
> MapGPA hypercall can be found in [1], sec 3.2.
>
> The call informs VMM about the conversion between private/shared
> mappings. The shared memory is similar to unencrypted memory in AMD
> SME/SEV terminology but the underlying process of sharing/un-sharing
> the memory is different for Intel TDX guest platform.
>
> SEV assumes that I/O devices can only do DMA to "decrypted" physical
> addresses without the C-bit set. In order for the CPU to interact with
> this memory, the CPU needs a decrypted mapping. To add this support,
> AMD SME code forces force_dma_unencrypted() to return true for
> platforms that support AMD SEV feature. It will be used for DMA memory
> allocation API to trigger set_memory_decrypted() for platforms that
> support AMD SEV feature.
>
> TDX is similar. So, to communicate with I/O devices, related pages need
> to be marked as shared. As mentioned above, shared memory in TDX
> architecture is similar to decrypted memory in AMD SME/SEV. So similar
> to AMD SEV, force_dma_unencrypted() has to forced to return true. This
> support is added in other patches in this series.
>
> So move force_dma_unencrypted() out of AMD specific code and call AMD
> specific (amd_force_dma_unencrypted()) initialization function from it.
> force_dma_unencrypted() will be modified by later patches to include
> Intel TDX guest platform specific initialization.
>
> Also, introduce new config option X86_MEM_ENCRYPT_COMMON that has to be
> selected by all x86 memory encryption features. This will be selected
> by both AMD SEV and Intel TDX guest config options.
>
> This is preparation for TDX changes in DMA code and it has no
> functional change.

Can force_dma_unencrypted() be moved to arch/x86/kernel/cc_platform.c,
instead of creating a new file? It might fit better with patch #6.

Thanks,
Tom

>
> [1] - https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-guest-hypervisor-communication-interface.pdf
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Reviewed-by: Andi Kleen <[email protected]>
> Reviewed-by: Tony Luck <[email protected]>
> Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
> ---
>
> Changes since v4:
> * Removed used we/you from commit log.
>
> Change since v3:
> * None
>
> Changes since v1:
> * Removed sev_active(), sme_active() checks in force_dma_unencrypted().
>
> arch/x86/Kconfig | 8 ++++++--
> arch/x86/include/asm/mem_encrypt_common.h | 18 ++++++++++++++++++
> arch/x86/mm/Makefile | 2 ++
> arch/x86/mm/mem_encrypt.c | 3 ++-
> arch/x86/mm/mem_encrypt_common.c | 17 +++++++++++++++++
> 5 files changed, 45 insertions(+), 3 deletions(-)
> create mode 100644 arch/x86/include/asm/mem_encrypt_common.h
> create mode 100644 arch/x86/mm/mem_encrypt_common.c
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index af49ad084919..37b27412f52e 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1519,16 +1519,20 @@ config X86_CPA_STATISTICS
> helps to determine the effectiveness of preserving large and huge
> page mappings when mapping protections are changed.
>
> +config X86_MEM_ENCRYPT_COMMON
> + select ARCH_HAS_FORCE_DMA_UNENCRYPTED
> + select DYNAMIC_PHYSICAL_MASK
> + def_bool n
> +
> config AMD_MEM_ENCRYPT
> bool "AMD Secure Memory Encryption (SME) support"
> depends on X86_64 && CPU_SUP_AMD
> select DMA_COHERENT_POOL
> - select DYNAMIC_PHYSICAL_MASK
> select ARCH_USE_MEMREMAP_PROT
> - select ARCH_HAS_FORCE_DMA_UNENCRYPTED
> select INSTRUCTION_DECODER
> select ARCH_HAS_RESTRICTED_VIRTIO_MEMORY_ACCESS
> select ARCH_HAS_CC_PLATFORM
> + select X86_MEM_ENCRYPT_COMMON
> help
> Say yes to enable support for the encryption of system memory.
> This requires an AMD processor that supports Secure Memory
> diff --git a/arch/x86/include/asm/mem_encrypt_common.h b/arch/x86/include/asm/mem_encrypt_common.h
> new file mode 100644
> index 000000000000..697bc40a4e3d
> --- /dev/null
> +++ b/arch/x86/include/asm/mem_encrypt_common.h
> @@ -0,0 +1,18 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/* Copyright (C) 2020 Intel Corporation */
> +#ifndef _ASM_X86_MEM_ENCRYPT_COMMON_H
> +#define _ASM_X86_MEM_ENCRYPT_COMMON_H
> +
> +#include <linux/mem_encrypt.h>
> +#include <linux/device.h>
> +
> +#ifdef CONFIG_AMD_MEM_ENCRYPT
> +bool amd_force_dma_unencrypted(struct device *dev);
> +#else /* CONFIG_AMD_MEM_ENCRYPT */
> +static inline bool amd_force_dma_unencrypted(struct device *dev)
> +{
> + return false;
> +}
> +#endif /* CONFIG_AMD_MEM_ENCRYPT */
> +
> +#endif
> diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
> index 5864219221ca..b31cb52bf1bd 100644
> --- a/arch/x86/mm/Makefile
> +++ b/arch/x86/mm/Makefile
> @@ -52,6 +52,8 @@ obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) += pkeys.o
> obj-$(CONFIG_RANDOMIZE_MEMORY) += kaslr.o
> obj-$(CONFIG_PAGE_TABLE_ISOLATION) += pti.o
>
> +obj-$(CONFIG_X86_MEM_ENCRYPT_COMMON) += mem_encrypt_common.o
> +
> obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt.o
> obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt_identity.o
> obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt_boot.o
> diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
> index 23d54b810f08..5d7fbed73949 100644
> --- a/arch/x86/mm/mem_encrypt.c
> +++ b/arch/x86/mm/mem_encrypt.c
> @@ -31,6 +31,7 @@
> #include <asm/processor-flags.h>
> #include <asm/msr.h>
> #include <asm/cmdline.h>
> +#include <asm/mem_encrypt_common.h>
>
> #include "mm_internal.h"
>
> @@ -362,7 +363,7 @@ int __init early_set_memory_encrypted(unsigned long vaddr, unsigned long size)
> }
>
> /* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
> -bool force_dma_unencrypted(struct device *dev)
> +bool amd_force_dma_unencrypted(struct device *dev)
> {
> /*
> * For SEV, all DMA must be to unencrypted addresses.
> diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c
> new file mode 100644
> index 000000000000..f063c885b0a5
> --- /dev/null
> +++ b/arch/x86/mm/mem_encrypt_common.c
> @@ -0,0 +1,17 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Memory Encryption Support Common Code
> + *
> + * Copyright (C) 2021 Intel Corporation
> + *
> + * Author: Kuppuswamy Sathyanarayanan <[email protected]>
> + */
> +
> +#include <asm/mem_encrypt_common.h>
> +#include <linux/dma-mapping.h>
> +
> +/* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
> +bool force_dma_unencrypted(struct device *dev)
> +{
> + return amd_force_dma_unencrypted(dev);
> +}
>

2021-10-20 16:34:59

by Tom Lendacky

[permalink] [raw]
Subject: Re: [PATCH v5 06/16] x86/tdx: Make DMA pages shared

On 10/8/21 7:37 PM, Kuppuswamy Sathyanarayanan wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> Just like MKTME, TDX reassigns bits of the physical address for
> metadata. MKTME used several bits for an encryption KeyID. TDX
> uses a single bit in guests to communicate whether a physical page
> should be protected by TDX as private memory (bit set to 0) or
> unprotected and shared with the VMM (bit set to 1).
>
> __set_memory_enc_dec() is now aware about TDX and sets Shared bit
> accordingly following with relevant TDX hypercall.
>
> Also, Do TDX_ACCEPT_PAGE on every 4k page after mapping the GPA range
> when converting memory to private. Using 4k page size limit is due
> to current TDX spec restriction. Also, If the GPA (range) was
> already mapped as an active, private page, the host VMM may remove
> the private page from the TD by following the “Removing TD Private
> Pages” sequence in the Intel TDX-module specification [1] to safely
> block the mapping(s), flush the TLB and cache, and remove the
> mapping(s).
>
> BUG() if TDX_ACCEPT_PAGE fails (except "previously accepted page" case)
> , as the guest is completely hosed if it can't access memory.
>
> [1] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsoftware.intel.com%2Fcontent%2Fdam%2Fdevelop%2Fexternal%2Fus%2Fen%2Fdocuments%2Ftdx-module-1eas-v0.85.039.pdf&amp;data=04%7C01%7Cthomas.lendacky%40amd.com%7C0e667adf5a4042abce3908d98abd07a8%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637693367201703893%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=UGxQ9xBjWsmev7PetX%2BuS0RChkAXyaH7q6JHO9ZiUtY%3D&amp;reserved=0
>
> Tested-by: Kai Huang <[email protected]>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> Reviewed-by: Andi Kleen <[email protected]>
> Reviewed-by: Tony Luck <[email protected]>
> Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>

...

> diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c
> index f063c885b0a5..119a9056efbb 100644
> --- a/arch/x86/mm/mem_encrypt_common.c
> +++ b/arch/x86/mm/mem_encrypt_common.c
> @@ -9,9 +9,18 @@
>
> #include <asm/mem_encrypt_common.h>
> #include <linux/dma-mapping.h>
> +#include <linux/cc_platform.h>
>
> /* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
> bool force_dma_unencrypted(struct device *dev)
> {
> - return amd_force_dma_unencrypted(dev);
> + if (cc_platform_has(CC_ATTR_GUEST_TDX) &&
> + cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT))
> + return true;
> +
> + if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT) ||
> + cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT))
> + return amd_force_dma_unencrypted(dev);
> +
> + return false;

Assuming the original force_dma_unencrypted() function was moved here or
cc_platform.c, then you shouldn't need any changes. Both SEV and TDX
require true be returned if CC_ATTR_GUEST_MEM_ENCRYPT returns true. And
then TDX should never return true for CC_ATTR_HOST_MEM_ENCRYPT.

> }
> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> index 527957586f3c..6c531d5cb5fd 100644
> --- a/arch/x86/mm/pat/set_memory.c
> +++ b/arch/x86/mm/pat/set_memory.c
> @@ -30,6 +30,7 @@
> #include <asm/proto.h>
> #include <asm/memtype.h>
> #include <asm/set_memory.h>
> +#include <asm/tdx.h>
>
> #include "../mm_internal.h"
>
> @@ -1981,8 +1982,10 @@ int set_memory_global(unsigned long addr, int numpages)
> __pgprot(_PAGE_GLOBAL), 0);
> }
>
> -static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
> +static int __set_memory_protect(unsigned long addr, int numpages, bool protect)
> {
> + pgprot_t mem_protected_bits, mem_plain_bits;
> + enum tdx_map_type map_type;
> struct cpa_data cpa;
> int ret;
>
> @@ -1997,8 +2000,25 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
> memset(&cpa, 0, sizeof(cpa));
> cpa.vaddr = &addr;
> cpa.numpages = numpages;
> - cpa.mask_set = enc ? __pgprot(_PAGE_ENC) : __pgprot(0);
> - cpa.mask_clr = enc ? __pgprot(0) : __pgprot(_PAGE_ENC);
> +
> + if (cc_platform_has(CC_ATTR_GUEST_SHARED_MAPPING_INIT)) {
> + mem_protected_bits = __pgprot(0);
> + mem_plain_bits = pgprot_cc_shared_mask();

How about having generic versions for both shared and private that return
the proper value for SEV or TDX. Then this remains looking similar to as
it does now, just replacing the __pgprot() calls with the appropriate
pgprot_cc_{shared,private}_mask().

Thanks,
Tom

> + } else {
> + mem_protected_bits = __pgprot(_PAGE_ENC);
> + mem_plain_bits = __pgprot(0);
> + }
> +
> + if (protect) {
> + cpa.mask_set = mem_protected_bits;
> + cpa.mask_clr = mem_plain_bits;
> + map_type = TDX_MAP_PRIVATE;
> + } else {
> + cpa.mask_set = mem_plain_bits;
> + cpa.mask_clr = mem_protected_bits;
> + map_type = TDX_MAP_SHARED;
> + }
> +
> cpa.pgd = init_mm.pgd;
>
> /* Must avoid aliasing mappings in the highmem code */
> @@ -2006,9 +2026,17 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
> vm_unmap_aliases();
>
> /*
> - * Before changing the encryption attribute, we need to flush caches.
> + * Before changing the encryption attribute, flush caches.
> + *
> + * For TDX, guest is responsible for flushing caches on private->shared
> + * transition. VMM is responsible for flushing on shared->private.
> */
> - cpa_flush(&cpa, !this_cpu_has(X86_FEATURE_SME_COHERENT));
> + if (cc_platform_has(CC_ATTR_GUEST_TDX)) {
> + if (map_type == TDX_MAP_SHARED)
> + cpa_flush(&cpa, 1);
> + } else {
> + cpa_flush(&cpa, !this_cpu_has(X86_FEATURE_SME_COHERENT));
> + }
>
> ret = __change_page_attr_set_clr(&cpa, 1);
>
> @@ -2021,18 +2049,21 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
> */
> cpa_flush(&cpa, 0);
>
> + if (!ret && cc_platform_has(CC_ATTR_GUEST_SHARED_MAPPING_INIT))
> + ret = tdx_hcall_gpa_intent(__pa(addr), numpages, map_type);
> +
> return ret;
> }
>
> int set_memory_encrypted(unsigned long addr, int numpages)
> {
> - return __set_memory_enc_dec(addr, numpages, true);
> + return __set_memory_protect(addr, numpages, true);
> }
> EXPORT_SYMBOL_GPL(set_memory_encrypted);
>
> int set_memory_decrypted(unsigned long addr, int numpages)
> {
> - return __set_memory_enc_dec(addr, numpages, false);
> + return __set_memory_protect(addr, numpages, false);
> }
> EXPORT_SYMBOL_GPL(set_memory_decrypted);
>
>

2021-10-20 16:41:00

by Tom Lendacky

[permalink] [raw]
Subject: Re: [PATCH v5 07/16] x86/kvm: Use bounce buffers for TD guest

On 10/8/21 7:37 PM, Kuppuswamy Sathyanarayanan wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> Intel TDX doesn't allow VMM to directly access guest private memory.
> Any memory that is required for communication with VMM must be shared
> explicitly. The same rule applies for any DMA to and from TDX guest.
> All DMA pages had to marked as shared pages. A generic way to achieve
> this without any changes to device drivers is to use the SWIOTLB
> framework.
>
> This method of handling is similar to AMD SEV. So extend this support
> for TDX guest as well. Also since there are some common code between
> AMD SEV and TDX guest in mem_encrypt_init(), move it to
> mem_encrypt_common.c and call AMD specific init function from it
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Reviewed-by: Andi Kleen <[email protected]>
> Reviewed-by: Tony Luck <[email protected]>
> Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
> ---
>
> Changes since v4:
> * Replaced prot_guest_has() with cc_guest_has().
>
> Changes since v3:
> * Rebased on top of Tom Lendacky's protected guest
> changes (https://lore.kernel.org/patchwork/cover/1468760/)
>
> Changes since v1:
> * Removed sme_me_mask check for amd_mem_encrypt_init() in mem_encrypt_init().
>
> arch/x86/include/asm/mem_encrypt_common.h | 3 +++
> arch/x86/kernel/tdx.c | 2 ++
> arch/x86/mm/mem_encrypt.c | 5 +----
> arch/x86/mm/mem_encrypt_common.c | 14 ++++++++++++++
> 4 files changed, 20 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/include/asm/mem_encrypt_common.h b/arch/x86/include/asm/mem_encrypt_common.h
> index 697bc40a4e3d..bc90e565bce4 100644
> --- a/arch/x86/include/asm/mem_encrypt_common.h
> +++ b/arch/x86/include/asm/mem_encrypt_common.h
> @@ -8,11 +8,14 @@
>
> #ifdef CONFIG_AMD_MEM_ENCRYPT
> bool amd_force_dma_unencrypted(struct device *dev);
> +void __init amd_mem_encrypt_init(void);
> #else /* CONFIG_AMD_MEM_ENCRYPT */
> static inline bool amd_force_dma_unencrypted(struct device *dev)
> {
> return false;
> }
> +
> +static inline void amd_mem_encrypt_init(void) {}
> #endif /* CONFIG_AMD_MEM_ENCRYPT */
>
> #endif
> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
> index 433f366ca25c..ce8e3019b812 100644
> --- a/arch/x86/kernel/tdx.c
> +++ b/arch/x86/kernel/tdx.c
> @@ -12,6 +12,7 @@
> #include <asm/insn.h>
> #include <asm/insn-eval.h>
> #include <linux/sched/signal.h> /* force_sig_fault() */
> +#include <linux/swiotlb.h>
>
> /* TDX Module call Leaf IDs */
> #define TDX_GET_INFO 1
> @@ -577,6 +578,7 @@ void __init tdx_early_init(void)
> pv_ops.irq.halt = tdx_halt;
>
> legacy_pic = &null_legacy_pic;
> + swiotlb_force = SWIOTLB_FORCE;
>
> cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "tdx:cpu_hotplug",
> NULL, tdx_cpu_offline_prepare);
> diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
> index 5d7fbed73949..8385bc4565e9 100644
> --- a/arch/x86/mm/mem_encrypt.c
> +++ b/arch/x86/mm/mem_encrypt.c
> @@ -438,14 +438,11 @@ static void print_mem_encrypt_feature_info(void)
> }
>
> /* Architecture __weak replacement functions */
> -void __init mem_encrypt_init(void)
> +void __init amd_mem_encrypt_init(void)
> {
> if (!sme_me_mask)
> return;
>
> - /* Call into SWIOTLB to update the SWIOTLB DMA buffers */
> - swiotlb_update_mem_attributes();
> -
> /*
> * With SEV, we need to unroll the rep string I/O instructions,
> * but SEV-ES supports them through the #VC handler.
> diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c
> index 119a9056efbb..6fe44c6cb753 100644
> --- a/arch/x86/mm/mem_encrypt_common.c
> +++ b/arch/x86/mm/mem_encrypt_common.c
> @@ -10,6 +10,7 @@
> #include <asm/mem_encrypt_common.h>
> #include <linux/dma-mapping.h>
> #include <linux/cc_platform.h>
> +#include <linux/swiotlb.h>
>
> /* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
> bool force_dma_unencrypted(struct device *dev)
> @@ -24,3 +25,16 @@ bool force_dma_unencrypted(struct device *dev)
>
> return false;
> }
> +
> +/* Architecture __weak replacement functions */
> +void __init mem_encrypt_init(void)
> +{
> + /*
> + * For TDX guest or SEV/SME, call into SWIOTLB to update
> + * the SWIOTLB DMA buffers
> + */
> + if (sme_me_mask || cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT))

Can't you just make this:

if (cc_platform_has(CC_ATTR_MEM_ENCRYPT))

SEV will return true if sme_me_mask is not zero and TDX should only return
true if it is TDX guest, right?

Thanks,
Tom

> + swiotlb_update_mem_attributes();
> +
> + amd_mem_encrypt_init();
> +}
>

Subject: Re: [PATCH v5 04/16] x86/tdx: Make pages shared in ioremap()



On 10/20/21 9:03 AM, Tom Lendacky wrote:
> On 10/8/21 7:36 PM, Kuppuswamy Sathyanarayanan wrote:
>> From: "Kirill A. Shutemov" <[email protected]>
>>
>> All ioremap()ed pages that are not backed by normal memory (NONE or
>> RESERVED) have to be mapped as shared.
>>
>> Reuse the infrastructure from AMD SEV code.
>>
>> Note that DMA code doesn't use ioremap() to convert memory to shared as
>> DMA buffers backed by normal memory. DMA code make buffer shared with
>> set_memory_decrypted().
>>
>> Signed-off-by: Kirill A. Shutemov <[email protected]>
>> Reviewed-by: Andi Kleen <[email protected]>
>> Reviewed-by: Tony Luck <[email protected]>
>> Signed-off-by: Kuppuswamy Sathyanarayanan
>> <[email protected]>
>> ---
>>
>> Changes since v4:
>>   * Renamed "protected_guest" to "cc_guest".
>>   * Replaced use of prot_guest_has() with cc_guest_has()
>>   * Modified the patch to adapt to latest version cc_guest_has()
>>     changes.
>>
>> Changes since v3:
>>   * Rebased on top of Tom Lendacky's protected guest
>>     changes (https://lore.kernel.org/patchwork/cover/1468760/)
>>
>> Changes since v1:
>>   * Fixed format issues in commit log.
>>
>>   arch/x86/include/asm/pgtable.h |  4 ++++
>>   arch/x86/mm/ioremap.c          |  8 ++++++--
>>   include/linux/cc_platform.h    | 13 +++++++++++++
>>   3 files changed, 23 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/pgtable.h
>> b/arch/x86/include/asm/pgtable.h
>> index 448cd01eb3ec..ecefccbdf2e3 100644
>> --- a/arch/x86/include/asm/pgtable.h
>> +++ b/arch/x86/include/asm/pgtable.h
>> @@ -21,6 +21,10 @@
>>   #define pgprot_encrypted(prot)    __pgprot(__sme_set(pgprot_val(prot)))
>>   #define pgprot_decrypted(prot)    __pgprot(__sme_clr(pgprot_val(prot)))
>> +/* Make the page accesable by VMM for confidential guests */
>> +#define pgprot_cc_guest(prot) __pgprot(pgprot_val(prot) |    \
>> +                          tdx_shared_mask())
>
> So this is only for TDX guests, so maybe a name that is less generic.
>
> Alternatively, you could create pgprot_private()/pgprot_shared()
> functions that check for SME/SEV or TDX and do the proper thing.
>
> Then you can redefine pgprot_encrypted()/pgprot_decrypted() to the new
> functions?

Make sense. It will be a better abstraction. I will make this change in
next version.

>
>> +
>>   #ifndef __ASSEMBLY__
>>   #include <asm/x86_init.h>
>>   #include <asm/pkru.h>
>> diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
>> index 026031b3b782..83daa3f8f39c 100644
>> --- a/arch/x86/mm/ioremap.c
>> +++ b/arch/x86/mm/ioremap.c
>> @@ -17,6 +17,7 @@
>>   #include <linux/cc_platform.h>
>>   #include <linux/efi.h>
>>   #include <linux/pgtable.h>
>> +#include <linux/cc_platform.h>
>>   #include <asm/set_memory.h>
>>   #include <asm/e820/api.h>
>> @@ -26,6 +27,7 @@
>>   #include <asm/pgalloc.h>
>>   #include <asm/memtype.h>
>>   #include <asm/setup.h>
>> +#include <asm/tdx.h>
>>   #include "physaddr.h"
>> @@ -87,8 +89,8 @@ static unsigned int __ioremap_check_ram(struct
>> resource *res)
>>   }
>>   /*
>> - * In a SEV guest, NONE and RESERVED should not be mapped encrypted
>> because
>> - * there the whole memory is already encrypted.
>> + * In a SEV or TDX guest, NONE and RESERVED should not be mapped
>> encrypted (or
>> + * private in TDX case) because there the whole memory is already
>> encrypted.
>>    */
>>   static unsigned int __ioremap_check_encrypted(struct resource *res)
>>   {
>> @@ -246,6 +248,8 @@ __ioremap_caller(resource_size_t phys_addr,
>> unsigned long size,
>>       prot = PAGE_KERNEL_IO;
>>       if ((io_desc.flags & IORES_MAP_ENCRYPTED) || encrypted)
>>           prot = pgprot_encrypted(prot);
>> +    else if (cc_platform_has(CC_ATTR_GUEST_SHARED_MAPPING_INIT))
>> +        prot = pgprot_cc_guest(prot);
>
> And if you do the new functions this could be:
>
>     if ((io_desc.flags & IORES_MAP_ENCRYPTED) || encrypted)
>         prot = pgprot_private(prot);
>     else
>         prot = pgprot_shared(prot);

Yes. I will make this change in next version.

>
> Thanks,
> Tom
>
>>       switch (pcm) {
>>       case _PAGE_CACHE_MODE_UC:
>> diff --git a/include/linux/cc_platform.h b/include/linux/cc_platform.h
>> index 7728574d7783..edb1d7a2f6af 100644
>> --- a/include/linux/cc_platform.h
>> +++ b/include/linux/cc_platform.h
>> @@ -81,6 +81,19 @@ enum cc_attr {
>>        * Examples include TDX Guest.
>>        */
>>       CC_ATTR_GUEST_UNROLL_STRING_IO,
>> +
>> +    /**
>> +     * @CC_ATTR_GUEST_SHARED_MAPPING_INIT: IO Remapped memory is marked
>> +     *                       as shared.
>> +     *
>> +     * The platform/OS is running as a guest/virtual machine and
>> +     * initializes all IO remapped memory as shared.
>> +     *
>> +     * Examples include TDX Guest (SEV marks all pages as shared by
>> default
>> +     * so this feature cannot be enabled for it).
>> +     */
>> +    CC_ATTR_GUEST_SHARED_MAPPING_INIT,
>> +
>>   };
>>   #ifdef CONFIG_ARCH_HAS_CC_PLATFORM
>>

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

Subject: Re: [PATCH v5 01/16] x86/mm: Move force_dma_unencrypted() to common code



On 10/20/21 9:11 AM, Tom Lendacky wrote:
>> Intel TDX doesn't allow VMM to access guest private memory. Any memory
>> that is required for communication with VMM must be shared explicitly
>> by setting the bit in page table entry. After setting the shared bit,
>> the conversion must be completed with MapGPA hypercall. Details about
>> MapGPA hypercall can be found in [1], sec 3.2.
>>
>> The call informs VMM about the conversion between private/shared
>> mappings. The shared memory is similar to unencrypted memory in AMD
>> SME/SEV terminology but the underlying process of sharing/un-sharing
>> the memory is different for Intel TDX guest platform.
>>
>> SEV assumes that I/O devices can only do DMA to "decrypted" physical
>> addresses without the C-bit set. In order for the CPU to interact with
>> this memory, the CPU needs a decrypted mapping. To add this support,
>> AMD SME code forces force_dma_unencrypted() to return true for
>> platforms that support AMD SEV feature. It will be used for DMA memory
>> allocation API to trigger set_memory_decrypted() for platforms that
>> support AMD SEV feature.
>>
>> TDX is similar. So, to communicate with I/O devices, related pages need
>> to be marked as shared. As mentioned above, shared memory in TDX
>> architecture is similar to decrypted memory in AMD SME/SEV. So similar
>> to AMD SEV, force_dma_unencrypted() has to forced to return true. This
>> support is added in other patches in this series.
>>
>> So move force_dma_unencrypted() out of AMD specific code and call AMD
>> specific (amd_force_dma_unencrypted()) initialization function from it.
>> force_dma_unencrypted() will be modified by later patches to include
>> Intel TDX guest platform specific initialization.
>>
>> Also, introduce new config option X86_MEM_ENCRYPT_COMMON that has to be
>> selected by all x86 memory encryption features. This will be selected
>> by both AMD SEV and Intel TDX guest config options.
>>
>> This is preparation for TDX changes in DMA code and it has no
>> functional change.
>
> Can force_dma_unencrypted() be moved to arch/x86/kernel/cc_platform.c,
> instead of creating a new file? It might fit better with patch #6.

Please check the final version of mem_encrypt_common.c

https://github.com/intel/tdx/blob/guest/arch/x86/mm/mem_encrypt_common.c

I am not sure whether it is alright to move mem_encrypt_init() and
arch_has_restricted_virtio_memory_access() to cc_platform.c

If this is fine, I can get rid of mem_encrypt_common.c

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

Subject: Re: [PATCH v5 06/16] x86/tdx: Make DMA pages shared



On 10/20/21 9:33 AM, Tom Lendacky wrote:
> On 10/8/21 7:37 PM, Kuppuswamy Sathyanarayanan wrote:
>> From: "Kirill A. Shutemov" <[email protected]>
>>
>> Just like MKTME, TDX reassigns bits of the physical address for
>> metadata.  MKTME used several bits for an encryption KeyID. TDX
>> uses a single bit in guests to communicate whether a physical page
>> should be protected by TDX as private memory (bit set to 0) or
>> unprotected and shared with the VMM (bit set to 1).
>>
>> __set_memory_enc_dec() is now aware about TDX and sets Shared bit
>> accordingly following with relevant TDX hypercall.
>>
>> Also, Do TDX_ACCEPT_PAGE on every 4k page after mapping the GPA range
>> when converting memory to private. Using 4k page size limit is due
>> to current TDX spec restriction. Also, If the GPA (range) was
>> already mapped as an active, private page, the host VMM may remove
>> the private page from the TD by following the “Removing TD Private
>> Pages” sequence in the Intel TDX-module specification [1] to safely
>> block the mapping(s), flush the TLB and cache, and remove the
>> mapping(s).
>>
>> BUG() if TDX_ACCEPT_PAGE fails (except "previously accepted page" case)
>> , as the guest is completely hosed if it can't access memory.
>>
>> [1]
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsoftware.intel.com%2Fcontent%2Fdam%2Fdevelop%2Fexternal%2Fus%2Fen%2Fdocuments%2Ftdx-module-1eas-v0.85.039.pdf&amp;data=04%7C01%7Cthomas.lendacky%40amd.com%7C0e667adf5a4042abce3908d98abd07a8%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637693367201703893%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=UGxQ9xBjWsmev7PetX%2BuS0RChkAXyaH7q6JHO9ZiUtY%3D&amp;reserved=0
>>
>>
>> Tested-by: Kai Huang <[email protected]>
>> Signed-off-by: Kirill A. Shutemov <[email protected]>
>> Signed-off-by: Sean Christopherson <[email protected]>
>> Reviewed-by: Andi Kleen <[email protected]>
>> Reviewed-by: Tony Luck <[email protected]>
>> Signed-off-by: Kuppuswamy Sathyanarayanan
>> <[email protected]>
>
> ...
>
>> diff --git a/arch/x86/mm/mem_encrypt_common.c
>> b/arch/x86/mm/mem_encrypt_common.c
>> index f063c885b0a5..119a9056efbb 100644
>> --- a/arch/x86/mm/mem_encrypt_common.c
>> +++ b/arch/x86/mm/mem_encrypt_common.c
>> @@ -9,9 +9,18 @@
>>   #include <asm/mem_encrypt_common.h>
>>   #include <linux/dma-mapping.h>
>> +#include <linux/cc_platform.h>
>>   /* Override for DMA direct allocation check -
>> ARCH_HAS_FORCE_DMA_UNENCRYPTED */
>>   bool force_dma_unencrypted(struct device *dev)
>>   {
>> -    return amd_force_dma_unencrypted(dev);
>> +    if (cc_platform_has(CC_ATTR_GUEST_TDX) &&
>> +        cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT))
>> +        return true;
>> +
>> +    if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT) ||
>> +        cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT))
>> +        return amd_force_dma_unencrypted(dev);
>> +
>> +    return false;
>
> Assuming the original force_dma_unencrypted() function was moved here or
> cc_platform.c, then you shouldn't need any changes. Both SEV and TDX
> require true be returned if CC_ATTR_GUEST_MEM_ENCRYPT returns true. And
> then TDX should never return true for CC_ATTR_HOST_MEM_ENCRYPT.


For non TDX case, in CC_ATTR_HOST_MEM_ENCRYPT, we should still call
amd_force_dma_unencrypted() right?

>
>>   }
>> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
>> index 527957586f3c..6c531d5cb5fd 100644
>> --- a/arch/x86/mm/pat/set_memory.c
>> +++ b/arch/x86/mm/pat/set_memory.c
>> @@ -30,6 +30,7 @@
>>   #include <asm/proto.h>
>>   #include <asm/memtype.h>
>>   #include <asm/set_memory.h>
>> +#include <asm/tdx.h>
>>   #include "../mm_internal.h"
>> @@ -1981,8 +1982,10 @@ int set_memory_global(unsigned long addr, int
>> numpages)
>>                       __pgprot(_PAGE_GLOBAL), 0);
>>   }
>> -static int __set_memory_enc_dec(unsigned long addr, int numpages,
>> bool enc)
>> +static int __set_memory_protect(unsigned long addr, int numpages,
>> bool protect)
>>   {
>> +    pgprot_t mem_protected_bits, mem_plain_bits;
>> +    enum tdx_map_type map_type;
>>       struct cpa_data cpa;
>>       int ret;
>> @@ -1997,8 +2000,25 @@ static int __set_memory_enc_dec(unsigned long
>> addr, int numpages, bool enc)
>>       memset(&cpa, 0, sizeof(cpa));
>>       cpa.vaddr = &addr;
>>       cpa.numpages = numpages;
>> -    cpa.mask_set = enc ? __pgprot(_PAGE_ENC) : __pgprot(0);
>> -    cpa.mask_clr = enc ? __pgprot(0) : __pgprot(_PAGE_ENC);
>> +
>> +    if (cc_platform_has(CC_ATTR_GUEST_SHARED_MAPPING_INIT)) {
>> +        mem_protected_bits = __pgprot(0);
>> +        mem_plain_bits = pgprot_cc_shared_mask();
>
> How about having generic versions for both shared and private that
> return the proper value for SEV or TDX. Then this remains looking
> similar to as it does now, just replacing the __pgprot() calls with the
> appropriate pgprot_cc_{shared,private}_mask().

Makes sense.

>
> Thanks,
> Tom
>
>> +    } else {
>> +        mem_protected_bits = __pgprot(_PAGE_ENC);
>> +        mem_plain_bits = __pgprot(0);
>> +    }
>> +
>> +    if (protect) {
>> +        cpa.mask_set = mem_protected_bits;
>> +        cpa.mask_clr = mem_plain_bits;
>> +        map_type = TDX_MAP_PRIVATE;
>> +    } else {
>> +        cpa.mask_set = mem_plain_bits;
>> +        cpa.mask_clr = mem_protected_bits;
>> +        map_type = TDX_MAP_SHARED;
>> +    }
>> +
>>       cpa.pgd = init_mm.pgd;
>>       /* Must avoid aliasing mappings in the highmem code */
>> @@ -2006,9 +2026,17 @@ static int __set_memory_enc_dec(unsigned long
>> addr, int numpages, bool enc)
>>       vm_unmap_aliases();
>>       /*
>> -     * Before changing the encryption attribute, we need to flush
>> caches.
>> +     * Before changing the encryption attribute, flush caches.
>> +     *
>> +     * For TDX, guest is responsible for flushing caches on
>> private->shared
>> +     * transition. VMM is responsible for flushing on shared->private.
>>        */
>> -    cpa_flush(&cpa, !this_cpu_has(X86_FEATURE_SME_COHERENT));
>> +    if (cc_platform_has(CC_ATTR_GUEST_TDX)) {
>> +        if (map_type == TDX_MAP_SHARED)
>> +            cpa_flush(&cpa, 1);
>> +    } else {
>> +        cpa_flush(&cpa, !this_cpu_has(X86_FEATURE_SME_COHERENT));
>> +    }
>>       ret = __change_page_attr_set_clr(&cpa, 1);
>> @@ -2021,18 +2049,21 @@ static int __set_memory_enc_dec(unsigned long
>> addr, int numpages, bool enc)
>>        */
>>       cpa_flush(&cpa, 0);
>> +    if (!ret && cc_platform_has(CC_ATTR_GUEST_SHARED_MAPPING_INIT))
>> +        ret = tdx_hcall_gpa_intent(__pa(addr), numpages, map_type);
>> +
>>       return ret;
>>   }
>>   int set_memory_encrypted(unsigned long addr, int numpages)
>>   {
>> -    return __set_memory_enc_dec(addr, numpages, true);
>> +    return __set_memory_protect(addr, numpages, true);
>>   }
>>   EXPORT_SYMBOL_GPL(set_memory_encrypted);
>>   int set_memory_decrypted(unsigned long addr, int numpages)
>>   {
>> -    return __set_memory_enc_dec(addr, numpages, false);
>> +    return __set_memory_protect(addr, numpages, false);
>>   }
>>   EXPORT_SYMBOL_GPL(set_memory_decrypted);
>>

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

Subject: Re: [PATCH v5 07/16] x86/kvm: Use bounce buffers for TD guest



On 10/20/21 9:39 AM, Tom Lendacky wrote:
> On 10/8/21 7:37 PM, Kuppuswamy Sathyanarayanan wrote:
>> From: "Kirill A. Shutemov" <[email protected]>
>>
>> Intel TDX doesn't allow VMM to directly access guest private memory.
>> Any memory that is required for communication with VMM must be shared
>> explicitly. The same rule applies for any DMA to and from TDX guest.
>> All DMA pages had to marked as shared pages. A generic way to achieve
>> this without any changes to device drivers is to use the SWIOTLB
>> framework.
>>
>> This method of handling is similar to AMD SEV. So extend this support
>> for TDX guest as well. Also since there are some common code between
>> AMD SEV and TDX guest in mem_encrypt_init(), move it to
>> mem_encrypt_common.c and call AMD specific init function from it
>>
>> Signed-off-by: Kirill A. Shutemov <[email protected]>
>> Reviewed-by: Andi Kleen <[email protected]>
>> Reviewed-by: Tony Luck <[email protected]>
>> Signed-off-by: Kuppuswamy Sathyanarayanan
>> <[email protected]>
>> ---
>>
>> Changes since v4:
>>   * Replaced prot_guest_has() with cc_guest_has().
>>
>> Changes since v3:
>>   * Rebased on top of Tom Lendacky's protected guest
>>     changes (https://lore.kernel.org/patchwork/cover/1468760/)
>>
>> Changes since v1:
>>   * Removed sme_me_mask check for amd_mem_encrypt_init() in
>> mem_encrypt_init().
>>
>>   arch/x86/include/asm/mem_encrypt_common.h |  3 +++
>>   arch/x86/kernel/tdx.c                     |  2 ++
>>   arch/x86/mm/mem_encrypt.c                 |  5 +----
>>   arch/x86/mm/mem_encrypt_common.c          | 14 ++++++++++++++
>>   4 files changed, 20 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/mem_encrypt_common.h
>> b/arch/x86/include/asm/mem_encrypt_common.h
>> index 697bc40a4e3d..bc90e565bce4 100644
>> --- a/arch/x86/include/asm/mem_encrypt_common.h
>> +++ b/arch/x86/include/asm/mem_encrypt_common.h
>> @@ -8,11 +8,14 @@
>>   #ifdef CONFIG_AMD_MEM_ENCRYPT
>>   bool amd_force_dma_unencrypted(struct device *dev);
>> +void __init amd_mem_encrypt_init(void);
>>   #else /* CONFIG_AMD_MEM_ENCRYPT */
>>   static inline bool amd_force_dma_unencrypted(struct device *dev)
>>   {
>>       return false;
>>   }
>> +
>> +static inline void amd_mem_encrypt_init(void) {}
>>   #endif /* CONFIG_AMD_MEM_ENCRYPT */
>>   #endif
>> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
>> index 433f366ca25c..ce8e3019b812 100644
>> --- a/arch/x86/kernel/tdx.c
>> +++ b/arch/x86/kernel/tdx.c
>> @@ -12,6 +12,7 @@
>>   #include <asm/insn.h>
>>   #include <asm/insn-eval.h>
>>   #include <linux/sched/signal.h> /* force_sig_fault() */
>> +#include <linux/swiotlb.h>
>>   /* TDX Module call Leaf IDs */
>>   #define TDX_GET_INFO            1
>> @@ -577,6 +578,7 @@ void __init tdx_early_init(void)
>>       pv_ops.irq.halt = tdx_halt;
>>       legacy_pic = &null_legacy_pic;
>> +    swiotlb_force = SWIOTLB_FORCE;
>>       cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "tdx:cpu_hotplug",
>>                 NULL, tdx_cpu_offline_prepare);
>> diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
>> index 5d7fbed73949..8385bc4565e9 100644
>> --- a/arch/x86/mm/mem_encrypt.c
>> +++ b/arch/x86/mm/mem_encrypt.c
>> @@ -438,14 +438,11 @@ static void print_mem_encrypt_feature_info(void)
>>   }
>>   /* Architecture __weak replacement functions */
>> -void __init mem_encrypt_init(void)
>> +void __init amd_mem_encrypt_init(void)
>>   {
>>       if (!sme_me_mask)
>>           return;
>> -    /* Call into SWIOTLB to update the SWIOTLB DMA buffers */
>> -    swiotlb_update_mem_attributes();
>> -
>>       /*
>>        * With SEV, we need to unroll the rep string I/O instructions,
>>        * but SEV-ES supports them through the #VC handler.
>> diff --git a/arch/x86/mm/mem_encrypt_common.c
>> b/arch/x86/mm/mem_encrypt_common.c
>> index 119a9056efbb..6fe44c6cb753 100644
>> --- a/arch/x86/mm/mem_encrypt_common.c
>> +++ b/arch/x86/mm/mem_encrypt_common.c
>> @@ -10,6 +10,7 @@
>>   #include <asm/mem_encrypt_common.h>
>>   #include <linux/dma-mapping.h>
>>   #include <linux/cc_platform.h>
>> +#include <linux/swiotlb.h>
>>   /* Override for DMA direct allocation check -
>> ARCH_HAS_FORCE_DMA_UNENCRYPTED */
>>   bool force_dma_unencrypted(struct device *dev)
>> @@ -24,3 +25,16 @@ bool force_dma_unencrypted(struct device *dev)
>>       return false;
>>   }
>> +
>> +/* Architecture __weak replacement functions */
>> +void __init mem_encrypt_init(void)
>> +{
>> +    /*
>> +     * For TDX guest or SEV/SME, call into SWIOTLB to update
>> +     * the SWIOTLB DMA buffers
>> +     */
>> +    if (sme_me_mask || cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT))
>
> Can't you just make this:
>
>     if (cc_platform_has(CC_ATTR_MEM_ENCRYPT))
>
> SEV will return true if sme_me_mask is not zero and TDX should only
> return true if it is TDX guest, right?

Yes. It can be simplified.

But where shall we leave this function cc_platform.c or here?

>
> Thanks,
> Tom
>
>> +        swiotlb_update_mem_attributes();
>> +
>> +    amd_mem_encrypt_init();
>> +}
>>

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2021-10-20 17:24:59

by Tom Lendacky

[permalink] [raw]
Subject: Re: [PATCH v5 06/16] x86/tdx: Make DMA pages shared

On 10/20/21 11:45 AM, Sathyanarayanan Kuppuswamy wrote:
> On 10/20/21 9:33 AM, Tom Lendacky wrote:
>> On 10/8/21 7:37 PM, Kuppuswamy Sathyanarayanan wrote:

...

>>>   bool force_dma_unencrypted(struct device *dev)
>>>   {
>>> -    return amd_force_dma_unencrypted(dev);
>>> +    if (cc_platform_has(CC_ATTR_GUEST_TDX) &&
>>> +        cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT))
>>> +        return true;
>>> +
>>> +    if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT) ||
>>> +        cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT))
>>> +        return amd_force_dma_unencrypted(dev);
>>> +
>>> +    return false;
>>
>> Assuming the original force_dma_unencrypted() function was moved here or
>> cc_platform.c, then you shouldn't need any changes. Both SEV and TDX
>> require true be returned if CC_ATTR_GUEST_MEM_ENCRYPT returns true. And
>> then TDX should never return true for CC_ATTR_HOST_MEM_ENCRYPT.
>
>
> For non TDX case, in CC_ATTR_HOST_MEM_ENCRYPT, we should still call
> amd_force_dma_unencrypted() right?

What I'm saying is that you wouldn't have amd_force_dma_unencrypted(). I
think the whole force_dma_unencrypted() can exist as-is in a different
file, whether that's cc_platform.c or mem_encrypt_common.c.

It will return true for an SEV or TDX guest, true for an SME host based on
the DMA mask or else false. That should work just fine for TDX.

Thanks,
Tom

>
>>
>>>   }
>>> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
>>> index 527957586f3c..6c531d5cb5fd 100644
>>> --- a/arch/x86/mm/pat/set_memory.c
>>> +++ b/arch/x86/mm/pat/set_memory.c
>>> @@ -30,6 +30,7 @@
>>>   #include <asm/proto.h>
>>>   #include <asm/memtype.h>
>>>   #include <asm/set_memory.h>
>>> +#include <asm/tdx.h>
>>>   #include "../mm_internal.h"
>>> @@ -1981,8 +1982,10 @@ int set_memory_global(unsigned long addr, int
>>> numpages)
>>>                       __pgprot(_PAGE_GLOBAL), 0);
>>>   }
>>> -static int __set_memory_enc_dec(unsigned long addr, int numpages, bool
>>> enc)
>>> +static int __set_memory_protect(unsigned long addr, int numpages, bool
>>> protect)
>>>   {
>>> +    pgprot_t mem_protected_bits, mem_plain_bits;
>>> +    enum tdx_map_type map_type;
>>>       struct cpa_data cpa;
>>>       int ret;
>>> @@ -1997,8 +2000,25 @@ static int __set_memory_enc_dec(unsigned long
>>> addr, int numpages, bool enc)
>>>       memset(&cpa, 0, sizeof(cpa));
>>>       cpa.vaddr = &addr;
>>>       cpa.numpages = numpages;
>>> -    cpa.mask_set = enc ? __pgprot(_PAGE_ENC) : __pgprot(0);
>>> -    cpa.mask_clr = enc ? __pgprot(0) : __pgprot(_PAGE_ENC);
>>> +
>>> +    if (cc_platform_has(CC_ATTR_GUEST_SHARED_MAPPING_INIT)) {
>>> +        mem_protected_bits = __pgprot(0);
>>> +        mem_plain_bits = pgprot_cc_shared_mask();
>>
>> How about having generic versions for both shared and private that
>> return the proper value for SEV or TDX. Then this remains looking
>> similar to as it does now, just replacing the __pgprot() calls with the
>> appropriate pgprot_cc_{shared,private}_mask().
>
> Makes sense.
>
>>
>> Thanks,
>> Tom
>>
>>> +    } else {
>>> +        mem_protected_bits = __pgprot(_PAGE_ENC);
>>> +        mem_plain_bits = __pgprot(0);
>>> +    }
>>> +
>>> +    if (protect) {
>>> +        cpa.mask_set = mem_protected_bits;
>>> +        cpa.mask_clr = mem_plain_bits;
>>> +        map_type = TDX_MAP_PRIVATE;
>>> +    } else {
>>> +        cpa.mask_set = mem_plain_bits;
>>> +        cpa.mask_clr = mem_protected_bits;
>>> +        map_type = TDX_MAP_SHARED;
>>> +    }
>>> +
>>>       cpa.pgd = init_mm.pgd;
>>>       /* Must avoid aliasing mappings in the highmem code */
>>> @@ -2006,9 +2026,17 @@ static int __set_memory_enc_dec(unsigned long
>>> addr, int numpages, bool enc)
>>>       vm_unmap_aliases();
>>>       /*
>>> -     * Before changing the encryption attribute, we need to flush caches.
>>> +     * Before changing the encryption attribute, flush caches.
>>> +     *
>>> +     * For TDX, guest is responsible for flushing caches on
>>> private->shared
>>> +     * transition. VMM is responsible for flushing on shared->private.
>>>        */
>>> -    cpa_flush(&cpa, !this_cpu_has(X86_FEATURE_SME_COHERENT));
>>> +    if (cc_platform_has(CC_ATTR_GUEST_TDX)) {
>>> +        if (map_type == TDX_MAP_SHARED)
>>> +            cpa_flush(&cpa, 1);
>>> +    } else {
>>> +        cpa_flush(&cpa, !this_cpu_has(X86_FEATURE_SME_COHERENT));
>>> +    }
>>>       ret = __change_page_attr_set_clr(&cpa, 1);
>>> @@ -2021,18 +2049,21 @@ static int __set_memory_enc_dec(unsigned long
>>> addr, int numpages, bool enc)
>>>        */
>>>       cpa_flush(&cpa, 0);
>>> +    if (!ret && cc_platform_has(CC_ATTR_GUEST_SHARED_MAPPING_INIT))
>>> +        ret = tdx_hcall_gpa_intent(__pa(addr), numpages, map_type);
>>> +
>>>       return ret;
>>>   }
>>>   int set_memory_encrypted(unsigned long addr, int numpages)
>>>   {
>>> -    return __set_memory_enc_dec(addr, numpages, true);
>>> +    return __set_memory_protect(addr, numpages, true);
>>>   }
>>>   EXPORT_SYMBOL_GPL(set_memory_encrypted);
>>>   int set_memory_decrypted(unsigned long addr, int numpages)
>>>   {
>>> -    return __set_memory_enc_dec(addr, numpages, false);
>>> +    return __set_memory_protect(addr, numpages, false);
>>>   }
>>>   EXPORT_SYMBOL_GPL(set_memory_decrypted);
>>>
>

2021-10-20 17:28:04

by Tom Lendacky

[permalink] [raw]
Subject: Re: [PATCH v5 07/16] x86/kvm: Use bounce buffers for TD guest

On 10/20/21 11:50 AM, Sathyanarayanan Kuppuswamy wrote:
>
>
> On 10/20/21 9:39 AM, Tom Lendacky wrote:
>> On 10/8/21 7:37 PM, Kuppuswamy Sathyanarayanan wrote:
>>> From: "Kirill A. Shutemov" <[email protected]>
>>>
>>> Intel TDX doesn't allow VMM to directly access guest private memory.
>>> Any memory that is required for communication with VMM must be shared
>>> explicitly. The same rule applies for any DMA to and from TDX guest.
>>> All DMA pages had to marked as shared pages. A generic way to achieve
>>> this without any changes to device drivers is to use the SWIOTLB
>>> framework.
>>>
>>> This method of handling is similar to AMD SEV. So extend this support
>>> for TDX guest as well. Also since there are some common code between
>>> AMD SEV and TDX guest in mem_encrypt_init(), move it to
>>> mem_encrypt_common.c and call AMD specific init function from it
>>>
>>> Signed-off-by: Kirill A. Shutemov <[email protected]>
>>> Reviewed-by: Andi Kleen <[email protected]>
>>> Reviewed-by: Tony Luck <[email protected]>
>>> Signed-off-by: Kuppuswamy Sathyanarayanan
>>> <[email protected]>
>>> ---
>>>
>>> Changes since v4:
>>>   * Replaced prot_guest_has() with cc_guest_has().
>>>
>>> Changes since v3:
>>>   * Rebased on top of Tom Lendacky's protected guest
>>>     changes
>>> (https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fpatchwork%2Fcover%2F1468760%2F&amp;data=04%7C01%7Cthomas.lendacky%40amd.com%7Cad852703670a44b1e29108d993e9dcc2%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637703454904800065%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=lXwd5%2Fhnmd5QYaIElQ%2BtVT%2B62JHq%2Bimno5VIjTILaig%3D&amp;reserved=0)
>>>
>>>
>>> Changes since v1:
>>>   * Removed sme_me_mask check for amd_mem_encrypt_init() in
>>> mem_encrypt_init().
>>>
>>>   arch/x86/include/asm/mem_encrypt_common.h |  3 +++
>>>   arch/x86/kernel/tdx.c                     |  2 ++
>>>   arch/x86/mm/mem_encrypt.c                 |  5 +----
>>>   arch/x86/mm/mem_encrypt_common.c          | 14 ++++++++++++++
>>>   4 files changed, 20 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/arch/x86/include/asm/mem_encrypt_common.h
>>> b/arch/x86/include/asm/mem_encrypt_common.h
>>> index 697bc40a4e3d..bc90e565bce4 100644
>>> --- a/arch/x86/include/asm/mem_encrypt_common.h
>>> +++ b/arch/x86/include/asm/mem_encrypt_common.h
>>> @@ -8,11 +8,14 @@
>>>   #ifdef CONFIG_AMD_MEM_ENCRYPT
>>>   bool amd_force_dma_unencrypted(struct device *dev);
>>> +void __init amd_mem_encrypt_init(void);
>>>   #else /* CONFIG_AMD_MEM_ENCRYPT */
>>>   static inline bool amd_force_dma_unencrypted(struct device *dev)
>>>   {
>>>       return false;
>>>   }
>>> +
>>> +static inline void amd_mem_encrypt_init(void) {}
>>>   #endif /* CONFIG_AMD_MEM_ENCRYPT */
>>>   #endif
>>> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
>>> index 433f366ca25c..ce8e3019b812 100644
>>> --- a/arch/x86/kernel/tdx.c
>>> +++ b/arch/x86/kernel/tdx.c
>>> @@ -12,6 +12,7 @@
>>>   #include <asm/insn.h>
>>>   #include <asm/insn-eval.h>
>>>   #include <linux/sched/signal.h> /* force_sig_fault() */
>>> +#include <linux/swiotlb.h>
>>>   /* TDX Module call Leaf IDs */
>>>   #define TDX_GET_INFO            1
>>> @@ -577,6 +578,7 @@ void __init tdx_early_init(void)
>>>       pv_ops.irq.halt = tdx_halt;
>>>       legacy_pic = &null_legacy_pic;
>>> +    swiotlb_force = SWIOTLB_FORCE;
>>>       cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "tdx:cpu_hotplug",
>>>                 NULL, tdx_cpu_offline_prepare);
>>> diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
>>> index 5d7fbed73949..8385bc4565e9 100644
>>> --- a/arch/x86/mm/mem_encrypt.c
>>> +++ b/arch/x86/mm/mem_encrypt.c
>>> @@ -438,14 +438,11 @@ static void print_mem_encrypt_feature_info(void)
>>>   }
>>>   /* Architecture __weak replacement functions */
>>> -void __init mem_encrypt_init(void)
>>> +void __init amd_mem_encrypt_init(void)
>>>   {
>>>       if (!sme_me_mask)
>>>           return;
>>> -    /* Call into SWIOTLB to update the SWIOTLB DMA buffers */
>>> -    swiotlb_update_mem_attributes();
>>> -
>>>       /*
>>>        * With SEV, we need to unroll the rep string I/O instructions,
>>>        * but SEV-ES supports them through the #VC handler.
>>> diff --git a/arch/x86/mm/mem_encrypt_common.c
>>> b/arch/x86/mm/mem_encrypt_common.c
>>> index 119a9056efbb..6fe44c6cb753 100644
>>> --- a/arch/x86/mm/mem_encrypt_common.c
>>> +++ b/arch/x86/mm/mem_encrypt_common.c
>>> @@ -10,6 +10,7 @@
>>>   #include <asm/mem_encrypt_common.h>
>>>   #include <linux/dma-mapping.h>
>>>   #include <linux/cc_platform.h>
>>> +#include <linux/swiotlb.h>
>>>   /* Override for DMA direct allocation check -
>>> ARCH_HAS_FORCE_DMA_UNENCRYPTED */
>>>   bool force_dma_unencrypted(struct device *dev)
>>> @@ -24,3 +25,16 @@ bool force_dma_unencrypted(struct device *dev)
>>>       return false;
>>>   }
>>> +
>>> +/* Architecture __weak replacement functions */
>>> +void __init mem_encrypt_init(void)
>>> +{
>>> +    /*
>>> +     * For TDX guest or SEV/SME, call into SWIOTLB to update
>>> +     * the SWIOTLB DMA buffers
>>> +     */
>>> +    if (sme_me_mask || cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT))
>>
>> Can't you just make this:
>>
>>      if (cc_platform_has(CC_ATTR_MEM_ENCRYPT))
>>
>> SEV will return true if sme_me_mask is not zero and TDX should only
>> return true if it is TDX guest, right?
>
> Yes. It can be simplified.
>
> But where shall we leave this function cc_platform.c or here?

Either one works... all depends on how the maintainers feel about
creating/using mem_encrypt_common.c or using cc_platform.c.

Thanks,
Tom

>
>>
>> Thanks,
>> Tom
>>
>>> +        swiotlb_update_mem_attributes();
>>> +
>>> +    amd_mem_encrypt_init();
>>> +}
>>>
>

Subject: Re: [PATCH v5 06/16] x86/tdx: Make DMA pages shared



On 10/20/21 10:22 AM, Tom Lendacky wrote:
>>
>> For non TDX case, in CC_ATTR_HOST_MEM_ENCRYPT, we should still call
>> amd_force_dma_unencrypted() right?
>
> What I'm saying is that you wouldn't have amd_force_dma_unencrypted(). I
> think the whole force_dma_unencrypted() can exist as-is in a different
> file, whether that's cc_platform.c or mem_encrypt_common.c.
>
> It will return true for an SEV or TDX guest, true for an SME host based
> on the DMA mask or else false. That should work just fine for TDX.

Got it. Thanks for clarifying it.

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2021-11-06 00:18:18

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v5 03/16] x86/tdx: Exclude Shared bit from physical_mask

On Fri, Oct 08, 2021, Kuppuswamy Sathyanarayanan wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> Just like MKTME, TDX reassigns bits of the physical address for
> metadata. MKTME used several bits for an encryption KeyID. TDX
> uses a single bit in guests to communicate whether a physical page
> should be protected by TDX as private memory (bit set to 0) or
> unprotected and shared with the VMM (bit set to 1).
>
> Add a helper, tdx_shared_mask() to generate the mask. The processor
> enumerates its physical address width to include the shared bit, which
> means it gets included in __PHYSICAL_MASK by default.

This is incorrect. The shared bit _may_ be a legal PA bit, but AIUI it's not a
hard requirement.

> Remove the shared mask from 'physical_mask' since any bits in
> tdx_shared_mask() are not used for physical addresses in page table
> entries.

...

> @@ -94,6 +100,9 @@ static void tdx_get_info(void)
>
> td_info.gpa_width = out.rcx & GENMASK(5, 0);
> td_info.attributes = out.rdx;
> +
> + /* Exclude Shared bit from the __PHYSICAL_MASK */
> + physical_mask &= ~tdx_shared_mask();

This is insufficient, though it's not really the fault of this patch, the specs
themselves botch this whole thing.

The TDX Module spec explicitly states that GPAs above GPAW are considered reserved.

10.11.1. GPAW-Relate EPT Violations
GPA bits higher than the SHARED bit are considered reserved and must be 0.
Address translation with any of the reserved bits set to 1 cause a #PF with
PFEC (Page Fault Error Code) RSVD bit set.

But this is contradicted by the architectural extensions spec, which states that
a GPA that satisfies MAXPA >= GPA > GPAW "can" cause an EPT violation, not #PF.
Note, this section also appears to have a bug, as it states that GPA bit 47 is
both the SHARED bit and reserved. I assume that blurb is intended to clarify
that bit 47 _would_ be reserved if it weren't the SHARED bit, but because it's
the shared bit it's ok to access.

1.4.2
Guest Physical Address Translation
If the CPU's maximum physical-address width (MAXPA) is 52 and the guest physical
address width is configured to be 48, accesses with GPA bits 51:48 not all being
0 can cause an EPT-violation, where such EPT-violations are not mutated to #VE,
even if the “EPT-violations #VE” execution control is 1.

If the CPU's physical-address width (MAXPA) is less than 48 and the SHARED bit
is configured to be in bit position 47, GPA bit 47 would be reserved, and GPA
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
bits 46:MAXPA would be reserved. On such CPUs, setting bits 51:48 or bits
46:MAXPA in any paging structure can cause a reserved bit page fault on access.

The Module spec also calls out that the effective GPA is not to be confused with
MAXPA, which combined with the above blurb about MAXPA < GPAW, suggests that MAXPA
is enumerated separately by design so that the guest doesn't incorrectly think
46:MAXPA are usable. But that is problematic for the case where MAXPA > GPAW.

The effective GPA width (in bits) for this TD (do not confuse with MAXPA).
SHARED bit is at GPA bit GPAW-1.

I can't find the exact reference, but the TDX module always passes through host's
MAXPHYADDR. As it pertains to this patch, just doing

physical_mask &= ~tdx_shared_mask()

means that a guest running with GPAW=0 and MAXPHYADDR>48 will have a discontiguous
physical_mask, and could access "reserved" memory. If the VMM defines legal memory
with bits [MAXPHYADDR:48]!=0, explosions may ensue. That's arguably a VMM bug, but
given that the VMM is untrusted I think the guest should be paranoid when handling
the SHARED bit. I also don't know that the kernel will play nice with a discontiguous
mask.

Specs aside, unless Intel makes a hardware change to treat GPAW as guest.MAXPHYADDR,
or the TDX Module emulates on EPT violations to inject #PF(RSVD) when appropriate,
this mess isn't going to be truly fixed from the guest perspective.

So, IMO all bits >= GPAW should be cleared, and the kernel should warn and/or
refuse to boot if the host has defined legal memory in that range.

FWIW, from a VMM perspective, I'm pretty sure the only sane approach is to force
GPAW=1, a.k.a. SHARED bit == 51, if host.MAXPHYADDR>=49. But on the guest side,
I think we should be paranoid.

2021-11-08 17:15:13

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v5 03/16] x86/tdx: Exclude Shared bit from physical_mask

On Fri, Nov 05, 2021 at 10:11:48PM +0000, Sean Christopherson wrote:
> On Fri, Oct 08, 2021, Kuppuswamy Sathyanarayanan wrote:
> > From: "Kirill A. Shutemov" <[email protected]>
> >
> > Just like MKTME, TDX reassigns bits of the physical address for
> > metadata. MKTME used several bits for an encryption KeyID. TDX
> > uses a single bit in guests to communicate whether a physical page
> > should be protected by TDX as private memory (bit set to 0) or
> > unprotected and shared with the VMM (bit set to 1).
> >
> > Add a helper, tdx_shared_mask() to generate the mask. The processor
> > enumerates its physical address width to include the shared bit, which
> > means it gets included in __PHYSICAL_MASK by default.
>
> This is incorrect. The shared bit _may_ be a legal PA bit, but AIUI it's not a
> hard requirement.

Good point, will fix.

> > Remove the shared mask from 'physical_mask' since any bits in
> > tdx_shared_mask() are not used for physical addresses in page table
> > entries.
>
> ...
>
> > @@ -94,6 +100,9 @@ static void tdx_get_info(void)
> >
> > td_info.gpa_width = out.rcx & GENMASK(5, 0);
> > td_info.attributes = out.rdx;
> > +
> > + /* Exclude Shared bit from the __PHYSICAL_MASK */
> > + physical_mask &= ~tdx_shared_mask();
>
> This is insufficient, though it's not really the fault of this patch, the specs
> themselves botch this whole thing.
>
> The TDX Module spec explicitly states that GPAs above GPAW are considered reserved.
>
> 10.11.1. GPAW-Relate EPT Violations
> GPA bits higher than the SHARED bit are considered reserved and must be 0.
> Address translation with any of the reserved bits set to 1 cause a #PF with
> PFEC (Page Fault Error Code) RSVD bit set.
>
> But this is contradicted by the architectural extensions spec, which states that
> a GPA that satisfies MAXPA >= GPA > GPAW "can" cause an EPT violation, not #PF.
> Note, this section also appears to have a bug, as it states that GPA bit 47 is
> both the SHARED bit and reserved. I assume that blurb is intended to clarify
> that bit 47 _would_ be reserved if it weren't the SHARED bit, but because it's
> the shared bit it's ok to access.
>
> 1.4.2
> Guest Physical Address Translation
> If the CPU's maximum physical-address width (MAXPA) is 52 and the guest physical
> address width is configured to be 48, accesses with GPA bits 51:48 not all being
> 0 can cause an EPT-violation, where such EPT-violations are not mutated to #VE,
> even if the “EPT-violations #VE” execution control is 1.
>
> If the CPU's physical-address width (MAXPA) is less than 48 and the SHARED bit
> is configured to be in bit position 47, GPA bit 47 would be reserved, and GPA
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> bits 46:MAXPA would be reserved. On such CPUs, setting bits 51:48 or bits
> 46:MAXPA in any paging structure can cause a reserved bit page fault on access.
>
> The Module spec also calls out that the effective GPA is not to be confused with
> MAXPA, which combined with the above blurb about MAXPA < GPAW, suggests that MAXPA
> is enumerated separately by design so that the guest doesn't incorrectly think
> 46:MAXPA are usable. But that is problematic for the case where MAXPA > GPAW.
>
> The effective GPA width (in bits) for this TD (do not confuse with MAXPA).
> SHARED bit is at GPA bit GPAW-1.
>
> I can't find the exact reference, but the TDX module always passes through host's
> MAXPHYADDR. As it pertains to this patch, just doing
>
> physical_mask &= ~tdx_shared_mask()
>
> means that a guest running with GPAW=0 and MAXPHYADDR>48 will have a discontiguous
> physical_mask, and could access "reserved" memory. If the VMM defines legal memory
> with bits [MAXPHYADDR:48]!=0, explosions may ensue. That's arguably a VMM bug, but
> given that the VMM is untrusted I think the guest should be paranoid when handling
> the SHARED bit. I also don't know that the kernel will play nice with a discontiguous
> mask.

I expect it to be buggy.

> Specs aside, unless Intel makes a hardware change to treat GPAW as guest.MAXPHYADDR,
> or the TDX Module emulates on EPT violations to inject #PF(RSVD) when appropriate,
> this mess isn't going to be truly fixed from the guest perspective.
>
> So, IMO all bits >= GPAW should be cleared, and the kernel should warn and/or
> refuse to boot if the host has defined legal memory in that range.

Right. But only >= GPAW-1 as shared bit is the MSB within GPAW:

physical_mask &= GENMASK_ULL(td_info.gpa_width - 2, 0);

'2' here smells bad, but well...

Given that physical_mask is now contiguous we can truncate anything from
e820 that cannot be addressed with adjusted __PHYSICAL_MASK:

iff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index bc0657f0deed..16d57a8769e8 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -833,6 +833,9 @@ static unsigned long __init e820_end_pfn(unsigned long limit_pfn, enum e820_type
unsigned long last_pfn = 0;
unsigned long max_arch_pfn = MAX_ARCH_PFN;

+ if (max_arch_pfn > PHYS_PFN(__PHYSICAL_MASK + 1))
+ max_arch_pfn = PHYS_PFN(__PHYSICAL_MASK + 1);
+
for (i = 0; i < e820_table->nr_entries; i++) {
struct e820_entry *entry = &e820_table->entries[i];
unsigned long start_pfn;

Does it look reasonable?

> FWIW, from a VMM perspective, I'm pretty sure the only sane approach is to force
> GPAW=1, a.k.a. SHARED bit == 51, if host.MAXPHYADDR>=49. But on the guest side,
> I think we should be paranoid.

--
Kirill A. Shutemov