Subject: [PATCH v4 00/15] Add TDX Guest Support (shared-mm support)

Hi All,

Intel's Trust Domain Extensions (TDX) protect guest VMs from malicious
hosts and some physical attacks. Since VMM is untrusted entity, it does
not allow VMM to access guest private memory. Any memory that is required
for communication with VMM must be shared explicitly. This series adds
support to securely share guest memory with VMM when it is required by
guest.

Originally TDX did automatic sharing of every ioremap. But it was found that
this ends up with a lot of memory shared that is supposed to be private, for
example ACPI tables. Also in general since only a few drivers are expected
to be used it's safer to mark them explicitly (for virtio it actually only
needs two places). This gives the advantage of automatically preventing
other drivers from doing MMIO, which can happen in some cases even with
the device filter. There is still a command line option to override this option,
which allows to use all drivers.

This series is the continuation of the patch series titled "Add TDX Guest
Support (Initial support)", "Add TDX Guest Support (#VE handler support)"
and "Add TDX Guest Support (boot fixes)" which added initial support,
#VE handler support and boot fixes for TDX guests. You can find the
related patchsets in the following links.

[set 1, v5] - https://lore.kernel.org/patchwork/project/lkml/list/?series=510805
[set 2, v4] - https://lore.kernel.org/patchwork/project/lkml/list/?series=510814
[set 3, v4] - https://lore.kernel.org/patchwork/project/lkml/list/?series=510816

Also please note that this series alone is not necessarily fully
functional. You need to apply all the above 3 patch series to get
a fully functional TDX guest.

You can find TDX related documents in the following link.

https://software.intel.com/content/www/br/pt/develop/articles/intel-trust-domain-extensions.html

Also, ioremap related changes in mips, parisc, alpha, sparch archs' are
only compile tested, and hence need help from the community users of these
archs' to make sure that it does not break any functionality.

In this patch series, following patches are in PCI domain and are
meant for the PCI domain reviewers.

pci: Consolidate pci_iomap* and pci_iomap*wc
pci: Add pci_iomap_shared{,_range}
pci: Mark MSI data shared

Patch titled "asm/io.h: Add ioremap_shared fallback" adds generic
and arch specific ioremap_shared headers and are meant to be reviewed
by [email protected], [email protected],
[email protected], [email protected],
[email protected].

Similarly patch titled "virtio: Use shared mappings for virtio
PCI devices" adds ioremap_shared() support for virtio drivers
and are meant to be reviewed by virtio driver maintainers.

I have CCed this patch series to all the related domain maintainers
and open lists. If you prefer to get only patches specific to your
domain, please let me know. I will fix this in next submission.

Changes since v3:
* Rebased on top of Tom Lendacky's protected guest
changes (https://lore.kernel.org/patchwork/cover/1468760/)
* Added new API to share io-reamapped memory selectively
(using ioremap_shared())
* Added new wrapper (pci_iomap_shared_range()) for PCI IO
remap shared mappings use case.

Changes since v2:
* Rebased on top of v5.14-rc1.
* No functional changes.

Andi Kleen (6):
pci: Consolidate pci_iomap* and pci_iomap*wc
asm/io.h: Add ioremap_shared fallback
pci: Add pci_iomap_shared{,_range}
pci: Mark MSI data shared
virtio: Use shared mappings for virtio PCI devices
x86/tdx: Implement ioremap_shared for x86

Isaku Yamahata (1):
x86/tdx: ioapic: Add shared bit for IOAPIC base address

Kirill A. Shutemov (6):
x86/mm: Move force_dma_unencrypted() to common code
x86/tdx: Exclude Shared bit from physical_mask
x86/tdx: Make pages shared in ioremap()
x86/tdx: Add helper to do MapGPA hypercall
x86/tdx: Make DMA pages shared
x86/kvm: Use bounce buffers for TD guest

Kuppuswamy Sathyanarayanan (2):
x86/tdx: Enable shared memory protected guest flags for TDX guest
x86/tdx: Add cmdline option to force use of ioremap_shared

.../admin-guide/kernel-parameters.rst | 1 +
.../admin-guide/kernel-parameters.txt | 12 ++
arch/alpha/include/asm/io.h | 1 +
arch/mips/include/asm/io.h | 1 +
arch/parisc/include/asm/io.h | 1 +
arch/sparc/include/asm/io_64.h | 1 +
arch/x86/Kconfig | 9 +-
arch/x86/include/asm/io.h | 5 +
arch/x86/include/asm/mem_encrypt_common.h | 20 +++
arch/x86/include/asm/pgtable.h | 5 +
arch/x86/include/asm/tdx.h | 22 +++
arch/x86/kernel/apic/io_apic.c | 18 ++-
arch/x86/kernel/tdx.c | 64 +++++++++
arch/x86/mm/Makefile | 2 +
arch/x86/mm/ioremap.c | 64 +++++++--
arch/x86/mm/mem_encrypt.c | 8 +-
arch/x86/mm/mem_encrypt_common.c | 38 ++++++
arch/x86/mm/pat/set_memory.c | 45 ++++++-
drivers/pci/msi.c | 2 +-
drivers/virtio/virtio_pci_modern_dev.c | 2 +-
include/asm-generic/io.h | 4 +
include/asm-generic/pci_iomap.h | 6 +
include/linux/protected_guest.h | 1 +
lib/pci_iomap.c | 125 +++++++++++++-----
24 files changed, 393 insertions(+), 64 deletions(-)
create mode 100644 arch/x86/include/asm/mem_encrypt_common.h
create mode 100644 arch/x86/mm/mem_encrypt_common.c

--
2.25.1


Subject: [PATCH v4 01/15] x86/mm: Move force_dma_unencrypted() to common code

From: "Kirill A. Shutemov" <[email protected]>

Intel TDX doesn't allow VMM to access guest private memory.
Any memory that is required for communication with VMM must
be shared explicitly by setting the bit in page table entry.
After setting the shared bit, the conversion must be completed
with MapGPA hypercall. You can find details about MapGPA
hypercall in [1], sec 3.2.

The call informs VMM about the conversion between
private/shared mappings. The shared memory is similar to
unencrypted memory in AMD SME/SEV terminology but the
underlying process of sharing/un-sharing the memory is
different for Intel TDX guest platform.

SEV assumes that I/O devices can only do DMA to "decrypted"
physical addresses without the C-bit set. In order for the CPU
to interact with this memory, the CPU needs a decrypted mapping.
To add this support, AMD SME code forces force_dma_unencrypted()
to return true for platforms that support AMD SEV feature. It
will be used for DMA memory allocation API to trigger
set_memory_decrypted() for platforms that support AMD SEV
feature.

TDX is similar. So, to communicate with I/O devices, related
pages need to be marked as shared. As mentioned above, shared
memory in TDX architecture is similar to decrypted memory in
AMD SME/SEV. So similar to AMD SEV, force_dma_unencrypted() has
to forced to return true. This support is added in other patches
in this series.

So move force_dma_unencrypted() out of AMD specific code and call
AMD specific (amd_force_dma_unencrypted()) initialization function
from it. force_dma_unencrypted() will be modified by later patches
to include Intel TDX guest platform specific initialization.

Also, introduce new config option X86_MEM_ENCRYPT_COMMON that has
to be selected by all x86 memory encryption features. This will be
selected by both AMD SEV and Intel TDX guest config options.

This is preparation for TDX changes in DMA code and it has no
functional change.

[1] - https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-guest-hypervisor-communication-interface.pdf

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---

Change since v3:
* None

Changes since v1:
* Removed sev_active(), sme_active() checks in force_dma_unencrypted().

arch/x86/Kconfig | 8 ++++++--
arch/x86/include/asm/mem_encrypt_common.h | 18 ++++++++++++++++++
arch/x86/mm/Makefile | 2 ++
arch/x86/mm/mem_encrypt.c | 3 ++-
arch/x86/mm/mem_encrypt_common.c | 17 +++++++++++++++++
5 files changed, 45 insertions(+), 3 deletions(-)
create mode 100644 arch/x86/include/asm/mem_encrypt_common.h
create mode 100644 arch/x86/mm/mem_encrypt_common.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b500f2afacce..d66a8a2f3c97 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1524,16 +1524,20 @@ config X86_CPA_STATISTICS
helps to determine the effectiveness of preserving large and huge
page mappings when mapping protections are changed.

+config X86_MEM_ENCRYPT_COMMON
+ select ARCH_HAS_FORCE_DMA_UNENCRYPTED
+ select DYNAMIC_PHYSICAL_MASK
+ def_bool n
+
config AMD_MEM_ENCRYPT
bool "AMD Secure Memory Encryption (SME) support"
depends on X86_64 && CPU_SUP_AMD
select DMA_COHERENT_POOL
- select DYNAMIC_PHYSICAL_MASK
select ARCH_USE_MEMREMAP_PROT
- select ARCH_HAS_FORCE_DMA_UNENCRYPTED
select INSTRUCTION_DECODER
select ARCH_HAS_RESTRICTED_VIRTIO_MEMORY_ACCESS
select ARCH_HAS_PROTECTED_GUEST
+ select X86_MEM_ENCRYPT_COMMON
help
Say yes to enable support for the encryption of system memory.
This requires an AMD processor that supports Secure Memory
diff --git a/arch/x86/include/asm/mem_encrypt_common.h b/arch/x86/include/asm/mem_encrypt_common.h
new file mode 100644
index 000000000000..697bc40a4e3d
--- /dev/null
+++ b/arch/x86/include/asm/mem_encrypt_common.h
@@ -0,0 +1,18 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (C) 2020 Intel Corporation */
+#ifndef _ASM_X86_MEM_ENCRYPT_COMMON_H
+#define _ASM_X86_MEM_ENCRYPT_COMMON_H
+
+#include <linux/mem_encrypt.h>
+#include <linux/device.h>
+
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+bool amd_force_dma_unencrypted(struct device *dev);
+#else /* CONFIG_AMD_MEM_ENCRYPT */
+static inline bool amd_force_dma_unencrypted(struct device *dev)
+{
+ return false;
+}
+#endif /* CONFIG_AMD_MEM_ENCRYPT */
+
+#endif
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 5864219221ca..b31cb52bf1bd 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -52,6 +52,8 @@ obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) += pkeys.o
obj-$(CONFIG_RANDOMIZE_MEMORY) += kaslr.o
obj-$(CONFIG_PAGE_TABLE_ISOLATION) += pti.o

+obj-$(CONFIG_X86_MEM_ENCRYPT_COMMON) += mem_encrypt_common.o
+
obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt.o
obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt_identity.o
obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt_boot.o
diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
index 7d3b2c6f5f88..1f7a72ce9d66 100644
--- a/arch/x86/mm/mem_encrypt.c
+++ b/arch/x86/mm/mem_encrypt.c
@@ -31,6 +31,7 @@
#include <asm/processor-flags.h>
#include <asm/msr.h>
#include <asm/cmdline.h>
+#include <asm/mem_encrypt_common.h>

#include "mm_internal.h"

@@ -415,7 +416,7 @@ bool amd_prot_guest_has(unsigned int attr)
EXPORT_SYMBOL(amd_prot_guest_has);

/* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
-bool force_dma_unencrypted(struct device *dev)
+bool amd_force_dma_unencrypted(struct device *dev)
{
/*
* For SEV, all DMA must be to unencrypted addresses.
diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c
new file mode 100644
index 000000000000..f063c885b0a5
--- /dev/null
+++ b/arch/x86/mm/mem_encrypt_common.c
@@ -0,0 +1,17 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Memory Encryption Support Common Code
+ *
+ * Copyright (C) 2021 Intel Corporation
+ *
+ * Author: Kuppuswamy Sathyanarayanan <[email protected]>
+ */
+
+#include <asm/mem_encrypt_common.h>
+#include <linux/dma-mapping.h>
+
+/* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
+bool force_dma_unencrypted(struct device *dev)
+{
+ return amd_force_dma_unencrypted(dev);
+}
--
2.25.1

Subject: [PATCH v4 03/15] x86/tdx: Make pages shared in ioremap()

From: "Kirill A. Shutemov" <[email protected]>

All ioremap()ed pages that are not backed by normal memory (NONE or
RESERVED) have to be mapped as shared.

Reuse the infrastructure from AMD SEV code.

Note that DMA code doesn't use ioremap() to convert memory to shared as
DMA buffers backed by normal memory. DMA code make buffer shared with
set_memory_decrypted().

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---

Changes since v3:
* Rebased on top of Tom Lendacky's protected guest
changes (https://lore.kernel.org/patchwork/cover/1468760/)

Changes since v1:
* Fixed format issues in commit log.

arch/x86/include/asm/pgtable.h | 4 ++++
arch/x86/mm/ioremap.c | 8 ++++++--
include/linux/protected_guest.h | 1 +
3 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 448cd01eb3ec..2d4d518651d2 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -21,6 +21,10 @@
#define pgprot_encrypted(prot) __pgprot(__sme_set(pgprot_val(prot)))
#define pgprot_decrypted(prot) __pgprot(__sme_clr(pgprot_val(prot)))

+/* Make the page accesable by VMM for protected guests */
+#define pgprot_protected_guest(prot) __pgprot(pgprot_val(prot) | \
+ tdg_shared_mask())
+
#ifndef __ASSEMBLY__
#include <asm/x86_init.h>
#include <asm/pkru.h>
diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index 60ade7dd71bd..69a60f240124 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -17,6 +17,7 @@
#include <linux/mem_encrypt.h>
#include <linux/efi.h>
#include <linux/pgtable.h>
+#include <linux/protected_guest.h>

#include <asm/set_memory.h>
#include <asm/e820/api.h>
@@ -26,6 +27,7 @@
#include <asm/pgalloc.h>
#include <asm/memtype.h>
#include <asm/setup.h>
+#include <asm/tdx.h>

#include "physaddr.h"

@@ -87,8 +89,8 @@ static unsigned int __ioremap_check_ram(struct resource *res)
}

/*
- * In a SEV guest, NONE and RESERVED should not be mapped encrypted because
- * there the whole memory is already encrypted.
+ * In a SEV or TDX guest, NONE and RESERVED should not be mapped encrypted (or
+ * private in TDX case) because there the whole memory is already encrypted.
*/
static unsigned int __ioremap_check_encrypted(struct resource *res)
{
@@ -246,6 +248,8 @@ __ioremap_caller(resource_size_t phys_addr, unsigned long size,
prot = PAGE_KERNEL_IO;
if ((io_desc.flags & IORES_MAP_ENCRYPTED) || encrypted)
prot = pgprot_encrypted(prot);
+ else if (prot_guest_has(PATTR_GUEST_SHARED_MAPPING_INIT))
+ prot = pgprot_protected_guest(prot);

switch (pcm) {
case _PAGE_CACHE_MODE_UC:
diff --git a/include/linux/protected_guest.h b/include/linux/protected_guest.h
index 1a350c3fcfe2..14255f801100 100644
--- a/include/linux/protected_guest.h
+++ b/include/linux/protected_guest.h
@@ -17,6 +17,7 @@
#define PATTR_GUEST_MEM_ENCRYPT 2 /* Guest encrypted memory */
#define PATTR_GUEST_PROT_STATE 3 /* Guest encrypted state */
#define PATTR_GUEST_UNROLL_STRING_IO 4 /* Unrolled string IO */
+#define PATTR_GUEST_SHARED_MAPPING_INIT 5 /* Late shared mapping init*/

/* 0x800 - 0x8ff reserved for AMD */
#define PATTR_SME 0x800
--
2.25.1

Subject: [PATCH v4 06/15] x86/kvm: Use bounce buffers for TD guest

From: "Kirill A. Shutemov" <[email protected]>

Intel TDX doesn't allow VMM to directly access guest private
memory. Any memory that is required for communication with
VMM must be shared explicitly. The same rule applies for any
any DMA to and fromTDX guest. All DMA pages had to marked as
shared pages. A generic way to achieve this without any changes
to device drivers is to use the SWIOTLB framework.

This method of handling is similar to AMD SEV. So extend this
support for TDX guest as well. Also since there are some common
code between AMD SEV and TDX guest in mem_encrypt_init(), move it
to mem_encrypt_common.c and call AMD specific init function from
it

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---
Changes since v3:
* Rebased on top of Tom Lendacky's protected guest
changes (https://lore.kernel.org/patchwork/cover/1468760/)

Changes since v1:
* Removed sme_me_mask check for amd_mem_encrypt_init() in mem_encrypt_init().

arch/x86/include/asm/mem_encrypt_common.h | 2 ++
arch/x86/kernel/tdx.c | 3 +++
arch/x86/mm/mem_encrypt.c | 5 +----
arch/x86/mm/mem_encrypt_common.c | 14 ++++++++++++++
4 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/mem_encrypt_common.h b/arch/x86/include/asm/mem_encrypt_common.h
index 697bc40a4e3d..48d98a3d64fd 100644
--- a/arch/x86/include/asm/mem_encrypt_common.h
+++ b/arch/x86/include/asm/mem_encrypt_common.h
@@ -8,11 +8,13 @@

#ifdef CONFIG_AMD_MEM_ENCRYPT
bool amd_force_dma_unencrypted(struct device *dev);
+void __init amd_mem_encrypt_init(void);
#else /* CONFIG_AMD_MEM_ENCRYPT */
static inline bool amd_force_dma_unencrypted(struct device *dev)
{
return false;
}
+static inline void amd_mem_encrypt_init(void) {}
#endif /* CONFIG_AMD_MEM_ENCRYPT */

#endif
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index b91740a485d6..01b758496e84 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -13,6 +13,7 @@
#include <asm/insn.h>
#include <asm/insn-eval.h>
#include <linux/sched/signal.h> /* force_sig_fault() */
+#include <linux/swiotlb.h>

/* TDX Module call Leaf IDs */
#define TDINFO 1
@@ -517,6 +518,8 @@ void __init tdx_early_init(void)

legacy_pic = &null_legacy_pic;

+ swiotlb_force = SWIOTLB_FORCE;
+
cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "tdg:cpu_hotplug",
NULL, tdg_cpu_offline_prepare);

diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
index 1f7a72ce9d66..cab68d8cc5b0 100644
--- a/arch/x86/mm/mem_encrypt.c
+++ b/arch/x86/mm/mem_encrypt.c
@@ -491,14 +491,11 @@ static void print_mem_encrypt_feature_info(void)
}

/* Architecture __weak replacement functions */
-void __init mem_encrypt_init(void)
+void __init amd_mem_encrypt_init(void)
{
if (!sme_me_mask)
return;

- /* Call into SWIOTLB to update the SWIOTLB DMA buffers */
- swiotlb_update_mem_attributes();
-
/*
* With SEV, we need to unroll the rep string I/O instructions,
* but SEV-ES supports them through the #VC handler.
diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c
index fdaf09b4a658..2ba19476dc26 100644
--- a/arch/x86/mm/mem_encrypt_common.c
+++ b/arch/x86/mm/mem_encrypt_common.c
@@ -10,6 +10,7 @@
#include <asm/mem_encrypt_common.h>
#include <linux/dma-mapping.h>
#include <linux/protected_guest.h>
+#include <linux/swiotlb.h>

/* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
bool force_dma_unencrypted(struct device *dev)
@@ -22,3 +23,16 @@ bool force_dma_unencrypted(struct device *dev)

return false;
}
+
+/* Architecture __weak replacement functions */
+void __init mem_encrypt_init(void)
+{
+ /*
+ * For TDX guest or SEV/SME, call into SWIOTLB to update
+ * the SWIOTLB DMA buffers
+ */
+ if (sme_me_mask || prot_guest_has(PATTR_GUEST_MEM_ENCRYPT))
+ swiotlb_update_mem_attributes();
+
+ amd_mem_encrypt_init();
+}
--
2.25.1

Subject: [PATCH v4 05/15] x86/tdx: Make DMA pages shared

From: "Kirill A. Shutemov" <[email protected]>

Just like MKTME, TDX reassigns bits of the physical address for
metadata. MKTME used several bits for an encryption KeyID. TDX
uses a single bit in guests to communicate whether a physical page
should be protected by TDX as private memory (bit set to 0) or
unprotected and shared with the VMM (bit set to 1).

__set_memory_enc_dec() is now aware about TDX and sets Shared bit
accordingly following with relevant TDX hypercall.

Also, Do TDACCEPTPAGE on every 4k page after mapping the GPA range
when converting memory to private. Using 4k page size limit is due
to current TDX spec restriction. Also, If the GPA (range) was
already mapped as an active, private page, the host VMM may remove
the private page from the TD by following the “Removing TD Private
Pages” sequence in the Intel TDX-module specification [1] to safely
block the mapping(s), flush the TLB and cache, and remove the
mapping(s).

BUG() if TDACCEPTPAGE fails (except "previously accepted page" case)
, as the guest is completely hosed if it can't access memory. 

[1] https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-module-1eas-v0.85.039.pdf

Tested-by: Kai Huang <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---

Changes since v3:
* Rebased on top of Tom Lendacky's protected guest
changes (https://lore.kernel.org/patchwork/cover/1468760/)
* Fixed TDX_PAGE_ALREADY_ACCEPTED error code as per latest
spec update.

Changes since v1:
* Removed "we" or "I" usages in comment section.
* Replaced is_tdx_guest() checks with prot_guest_has() checks.

arch/x86/include/asm/pgtable.h | 1 +
arch/x86/kernel/tdx.c | 35 +++++++++++++++++++++----
arch/x86/mm/mem_encrypt_common.c | 9 ++++++-
arch/x86/mm/pat/set_memory.c | 45 +++++++++++++++++++++++++++-----
4 files changed, 77 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 2d4d518651d2..f341bf6b8b93 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -24,6 +24,7 @@
/* Make the page accesable by VMM for protected guests */
#define pgprot_protected_guest(prot) __pgprot(pgprot_val(prot) | \
tdg_shared_mask())
+#define pgprot_pg_shared_mask() __pgprot(tdg_shared_mask())

#ifndef __ASSEMBLY__
#include <asm/x86_init.h>
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index ccbad600a422..b91740a485d6 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -17,10 +17,16 @@
/* TDX Module call Leaf IDs */
#define TDINFO 1
#define TDGETVEINFO 3
+#define TDACCEPTPAGE 6

/* TDX hypercall Leaf IDs */
#define TDVMCALL_MAP_GPA 0x10001

+/* TDX Module call error codes */
+#define TDX_PAGE_ALREADY_ACCEPTED 0x00000b0a00000000
+#define TDCALL_RETURN_CODE_MASK 0xFFFFFFFF00000000
+#define TDCALL_RETURN_CODE(a) ((a) & TDCALL_RETURN_CODE_MASK)
+
#define VE_IS_IO_OUT(exit_qual) (((exit_qual) & 8) ? 0 : 1)
#define VE_GET_IO_SIZE(exit_qual) (((exit_qual) & 7) + 1)
#define VE_GET_PORT_NUM(exit_qual) ((exit_qual) >> 16)
@@ -100,26 +106,45 @@ static void tdg_get_info(void)
physical_mask &= ~tdg_shared_mask();
}

+static void tdg_accept_page(phys_addr_t gpa)
+{
+ u64 ret;
+
+ ret = __tdx_module_call(TDACCEPTPAGE, gpa, 0, 0, 0, NULL);
+
+ BUG_ON(ret && TDCALL_RETURN_CODE(ret) != TDX_PAGE_ALREADY_ACCEPTED);
+}
+
/*
* Inform the VMM of the guest's intent for this physical page:
* shared with the VMM or private to the guest. The VMM is
* expected to change its mapping of the page in response.
- *
- * Note: shared->private conversions require further guest
- * action to accept the page.
*/
int tdx_hcall_gpa_intent(phys_addr_t gpa, int numpages,
enum tdx_map_type map_type)
{
- u64 ret;
+ u64 ret = 0;
+ int i;

if (map_type == TDX_MAP_SHARED)
gpa |= tdg_shared_mask();

ret = _tdx_hypercall(TDVMCALL_MAP_GPA, gpa, PAGE_SIZE * numpages, 0, 0,
NULL);
+ if (ret)
+ ret = -EIO;

- return ret ? -EIO : 0;
+ if (ret || map_type == TDX_MAP_SHARED)
+ return ret;
+
+ /*
+ * For shared->private conversion, accept the page using TDACCEPTPAGE
+ * TDX module call.
+ */
+ for (i = 0; i < numpages; i++)
+ tdg_accept_page(gpa + i * PAGE_SIZE);
+
+ return 0;
}

static __cpuidle void tdg_halt(void)
diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c
index f063c885b0a5..fdaf09b4a658 100644
--- a/arch/x86/mm/mem_encrypt_common.c
+++ b/arch/x86/mm/mem_encrypt_common.c
@@ -9,9 +9,16 @@

#include <asm/mem_encrypt_common.h>
#include <linux/dma-mapping.h>
+#include <linux/protected_guest.h>

/* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
bool force_dma_unencrypted(struct device *dev)
{
- return amd_force_dma_unencrypted(dev);
+ if (prot_guest_has(PATTR_SEV) || prot_guest_has(PATTR_SME))
+ return amd_force_dma_unencrypted(dev);
+
+ if (prot_guest_has(PATTR_GUEST_MEM_ENCRYPT))
+ return true;
+
+ return false;
}
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index ad8a5c586a35..52c93923b947 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -29,6 +29,7 @@
#include <asm/proto.h>
#include <asm/memtype.h>
#include <asm/set_memory.h>
+#include <asm/tdx.h>

#include "../mm_internal.h"

@@ -1980,9 +1981,11 @@ int set_memory_global(unsigned long addr, int numpages)
__pgprot(_PAGE_GLOBAL), 0);
}

-static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
+static int __set_memory_protect(unsigned long addr, int numpages, bool protect)
{
+ pgprot_t mem_protected_bits, mem_plain_bits;
struct cpa_data cpa;
+ enum tdx_map_type map_type;
int ret;

/* Nothing to do if memory encryption is not active */
@@ -1996,8 +1999,25 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
memset(&cpa, 0, sizeof(cpa));
cpa.vaddr = &addr;
cpa.numpages = numpages;
- cpa.mask_set = enc ? __pgprot(_PAGE_ENC) : __pgprot(0);
- cpa.mask_clr = enc ? __pgprot(0) : __pgprot(_PAGE_ENC);
+
+ if (prot_guest_has(PATTR_GUEST_SHARED_MAPPING_INIT)) {
+ mem_protected_bits = __pgprot(0);
+ mem_plain_bits = pgprot_pg_shared_mask();
+ } else {
+ mem_protected_bits = __pgprot(_PAGE_ENC);
+ mem_plain_bits = __pgprot(0);
+ }
+
+ if (protect) {
+ cpa.mask_set = mem_protected_bits;
+ cpa.mask_clr = mem_plain_bits;
+ map_type = TDX_MAP_PRIVATE;
+ } else {
+ cpa.mask_set = mem_plain_bits;
+ cpa.mask_clr = mem_protected_bits;
+ map_type = TDX_MAP_SHARED;
+ }
+
cpa.pgd = init_mm.pgd;

/* Must avoid aliasing mappings in the highmem code */
@@ -2005,9 +2025,17 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
vm_unmap_aliases();

/*
- * Before changing the encryption attribute, we need to flush caches.
+ * Before changing the encryption attribute, flush caches.
+ *
+ * For TDX, guest is responsible for flushing caches on private->shared
+ * transition. VMM is responsible for flushing on shared->private.
*/
- cpa_flush(&cpa, !this_cpu_has(X86_FEATURE_SME_COHERENT));
+ if (prot_guest_has(PATTR_GUEST_TDX)) {
+ if (map_type == TDX_MAP_SHARED)
+ cpa_flush(&cpa, 1);
+ } else {
+ cpa_flush(&cpa, !this_cpu_has(X86_FEATURE_SME_COHERENT));
+ }

ret = __change_page_attr_set_clr(&cpa, 1);

@@ -2020,18 +2048,21 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
*/
cpa_flush(&cpa, 0);

+ if (!ret && prot_guest_has(PATTR_GUEST_SHARED_MAPPING_INIT))
+ ret = tdx_hcall_gpa_intent(__pa(addr), numpages, map_type);
+
return ret;
}

int set_memory_encrypted(unsigned long addr, int numpages)
{
- return __set_memory_enc_dec(addr, numpages, true);
+ return __set_memory_protect(addr, numpages, true);
}
EXPORT_SYMBOL_GPL(set_memory_encrypted);

int set_memory_decrypted(unsigned long addr, int numpages)
{
- return __set_memory_enc_dec(addr, numpages, false);
+ return __set_memory_protect(addr, numpages, false);
}
EXPORT_SYMBOL_GPL(set_memory_decrypted);

--
2.25.1

Subject: [PATCH v4 07/15] x86/tdx: ioapic: Add shared bit for IOAPIC base address

From: Isaku Yamahata <[email protected]>

The kernel interacts with each bare-metal IOAPIC with a special
MMIO page. When running under KVM, the guest's IOAPICs are
emulated by KVM.

When running as a TDX guest, the guest needs to mark each IOAPIC
mapping as "shared" with the host. This ensures that TDX private
protections are not applied to the page, which allows the TDX host
emulation to work.

Earlier patches in this series modified ioremap() so that
ioremap()-created mappings such as virtio will be marked as
shared. However, the IOAPIC code does not use ioremap() and instead
uses the fixmap mechanism.

Introduce a special fixmap helper just for the IOAPIC code. Ensure
that it marks IOAPIC pages as "shared". This replaces
set_fixmap_nocache() with __set_fixmap() since __set_fixmap()
allows custom 'prot' values.

Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---

Changes since v3:
* Rebased on top of Tom Lendacky's protected guest
changes (https://lore.kernel.org/patchwork/cover/1468760/)

arch/x86/kernel/apic/io_apic.c | 18 ++++++++++++++++--
1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index d5c691a3208b..5154efe8c4f7 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -49,6 +49,7 @@
#include <linux/slab.h>
#include <linux/memblock.h>
#include <linux/msi.h>
+#include <linux/protected_guest.h>

#include <asm/irqdomain.h>
#include <asm/io.h>
@@ -65,6 +66,7 @@
#include <asm/irq_remapping.h>
#include <asm/hw_irq.h>
#include <asm/apic.h>
+#include <asm/tdx.h>

#define for_each_ioapic(idx) \
for ((idx) = 0; (idx) < nr_ioapics; (idx)++)
@@ -2675,6 +2677,18 @@ static struct resource * __init ioapic_setup_resources(void)
return res;
}

+static void io_apic_set_fixmap_nocache(enum fixed_addresses idx,
+ phys_addr_t phys)
+{
+ pgprot_t flags = FIXMAP_PAGE_NOCACHE;
+
+ /* Set TDX guest shared bit in pgprot flags */
+ if (prot_guest_has(PATTR_GUEST_SHARED_MAPPING_INIT))
+ flags = pgprot_protected_guest(flags);
+
+ __set_fixmap(idx, phys, flags);
+}
+
void __init io_apic_init_mappings(void)
{
unsigned long ioapic_phys, idx = FIX_IO_APIC_BASE_0;
@@ -2707,7 +2721,7 @@ void __init io_apic_init_mappings(void)
__func__, PAGE_SIZE, PAGE_SIZE);
ioapic_phys = __pa(ioapic_phys);
}
- set_fixmap_nocache(idx, ioapic_phys);
+ io_apic_set_fixmap_nocache(idx, ioapic_phys);
apic_printk(APIC_VERBOSE, "mapped IOAPIC to %08lx (%08lx)\n",
__fix_to_virt(idx) + (ioapic_phys & ~PAGE_MASK),
ioapic_phys);
@@ -2836,7 +2850,7 @@ int mp_register_ioapic(int id, u32 address, u32 gsi_base,
ioapics[idx].mp_config.flags = MPC_APIC_USABLE;
ioapics[idx].mp_config.apicaddr = address;

- set_fixmap_nocache(FIX_IO_APIC_BASE_0 + idx, address);
+ io_apic_set_fixmap_nocache(FIX_IO_APIC_BASE_0 + idx, address);
if (bad_ioapic_register(idx)) {
clear_fixmap(FIX_IO_APIC_BASE_0 + idx);
return -ENODEV;
--
2.25.1

Subject: [PATCH v4 08/15] x86/tdx: Enable shared memory protected guest flags for TDX guest

In TDX guest, since the memory is private to guest, it needs some
extra configuration before sharing any data with VMM. AMD SEV also
implements similar features and hence code can be shared. Currently
memory sharing related code in the kernel is protected by
PATTR_GUEST_MEM_ENCRYPT and PATTR_GUEST_SHARED_MAPPING_INIT flags.
So enable them for TDX guest as well.

Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---
arch/x86/kernel/tdx.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 01b758496e84..1cf2443edb90 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -78,6 +78,8 @@ bool tdx_prot_guest_has(unsigned long flag)
switch (flag) {
case PATTR_GUEST_TDX:
case PATTR_GUEST_UNROLL_STRING_IO:
+ case PATTR_GUEST_MEM_ENCRYPT:
+ case PATTR_GUEST_SHARED_MAPPING_INIT:
return cpu_feature_enabled(X86_FEATURE_TDX_GUEST);
}

--
2.25.1

Subject: [PATCH v4 04/15] x86/tdx: Add helper to do MapGPA hypercall

From: "Kirill A. Shutemov" <[email protected]>

MapGPA hypercall is used by TDX guests to request VMM convert
the existing mapping of given GPA address range between
private/shared.

tdx_hcall_gpa_intent() is the wrapper used for making MapGPA
hypercall.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---

Changes since v3:
* None

Changes since v1:
* Modified tdx_hcall_gpa_intent() to use _tdx_hypercall() instead of
tdx_hypercall().

arch/x86/include/asm/tdx.h | 18 ++++++++++++++++++
arch/x86/kernel/tdx.c | 25 +++++++++++++++++++++++++
2 files changed, 43 insertions(+)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 1e2a1c6a1898..665c8cf57d5b 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -56,6 +56,15 @@ struct ve_info {
u32 instr_info;
};

+/*
+ * Page mapping type enum. This is software construct not
+ * part of any hardware or VMM ABI.
+ */
+enum tdx_map_type {
+ TDX_MAP_PRIVATE,
+ TDX_MAP_SHARED,
+};
+
#ifdef CONFIG_INTEL_TDX_GUEST

void __init tdx_early_init(void);
@@ -79,6 +88,9 @@ bool tdg_early_handle_ve(struct pt_regs *regs);

extern phys_addr_t tdg_shared_mask(void);

+extern int tdx_hcall_gpa_intent(phys_addr_t gpa, int numpages,
+ enum tdx_map_type map_type);
+
/*
* To support I/O port access in decompressor or early kernel init
* code, since #VE exception handler cannot be used, use paravirt
@@ -149,6 +161,12 @@ static inline bool tdg_early_handle_ve(struct pt_regs *regs) { return false; }

static inline phys_addr_t tdg_shared_mask(void) { return 0; }

+static inline int tdx_hcall_gpa_intent(phys_addr_t gpa, int numpages,
+ enum tdx_map_type map_type)
+{
+ return -ENODEV;
+}
+
#endif /* CONFIG_INTEL_TDX_GUEST */

#ifdef CONFIG_INTEL_TDX_GUEST_KVM
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index d316fe33f52f..ccbad600a422 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -18,6 +18,9 @@
#define TDINFO 1
#define TDGETVEINFO 3

+/* TDX hypercall Leaf IDs */
+#define TDVMCALL_MAP_GPA 0x10001
+
#define VE_IS_IO_OUT(exit_qual) (((exit_qual) & 8) ? 0 : 1)
#define VE_GET_IO_SIZE(exit_qual) (((exit_qual) & 7) + 1)
#define VE_GET_PORT_NUM(exit_qual) ((exit_qual) >> 16)
@@ -97,6 +100,28 @@ static void tdg_get_info(void)
physical_mask &= ~tdg_shared_mask();
}

+/*
+ * Inform the VMM of the guest's intent for this physical page:
+ * shared with the VMM or private to the guest. The VMM is
+ * expected to change its mapping of the page in response.
+ *
+ * Note: shared->private conversions require further guest
+ * action to accept the page.
+ */
+int tdx_hcall_gpa_intent(phys_addr_t gpa, int numpages,
+ enum tdx_map_type map_type)
+{
+ u64 ret;
+
+ if (map_type == TDX_MAP_SHARED)
+ gpa |= tdg_shared_mask();
+
+ ret = _tdx_hypercall(TDVMCALL_MAP_GPA, gpa, PAGE_SIZE * numpages, 0, 0,
+ NULL);
+
+ return ret ? -EIO : 0;
+}
+
static __cpuidle void tdg_halt(void)
{
u64 ret;
--
2.25.1

Subject: [PATCH v4 09/15] pci: Consolidate pci_iomap* and pci_iomap*wc

From: Andi Kleen <[email protected]>

pci_iomap* and pci_iomap*wc are currently duplicated code, except
that the _wc variant does not support IO ports. Replace them
with a common helper and a callback for the mapping. I used
wrappers for the maps because some architectures implement ioremap
and friends with macros.

This will allow to add more variants without excessive code
duplications. This patch should have no behavior change.

Signed-off-by: Andi Kleen <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---
lib/pci_iomap.c | 81 +++++++++++++++++++++++++++----------------------
1 file changed, 44 insertions(+), 37 deletions(-)

diff --git a/lib/pci_iomap.c b/lib/pci_iomap.c
index 2d3eb1cb73b8..6251c3f651c6 100644
--- a/lib/pci_iomap.c
+++ b/lib/pci_iomap.c
@@ -10,6 +10,46 @@
#include <linux/export.h>

#ifdef CONFIG_PCI
+
+/*
+ * Callback wrappers because some architectures define ioremap et.al.
+ * as macros.
+ */
+static void __iomem *map_ioremap(phys_addr_t addr, size_t size)
+{
+ return ioremap(addr, size);
+}
+
+static void __iomem *map_ioremap_wc(phys_addr_t addr, size_t size)
+{
+ return ioremap_wc(addr, size);
+}
+
+static void __iomem *pci_iomap_range_map(struct pci_dev *dev,
+ int bar,
+ unsigned long offset,
+ unsigned long maxlen,
+ void __iomem *(*mapm)(phys_addr_t,
+ size_t))
+{
+ resource_size_t start = pci_resource_start(dev, bar);
+ resource_size_t len = pci_resource_len(dev, bar);
+ unsigned long flags = pci_resource_flags(dev, bar);
+
+ if (len <= offset || !start)
+ return NULL;
+ len -= offset;
+ start += offset;
+ if (maxlen && len > maxlen)
+ len = maxlen;
+ if (flags & IORESOURCE_IO)
+ return __pci_ioport_map(dev, start, len);
+ if (flags & IORESOURCE_MEM)
+ return mapm(start, len);
+ /* What? */
+ return NULL;
+}
+
/**
* pci_iomap_range - create a virtual mapping cookie for a PCI BAR
* @dev: PCI device that owns the BAR
@@ -30,22 +70,8 @@ void __iomem *pci_iomap_range(struct pci_dev *dev,
unsigned long offset,
unsigned long maxlen)
{
- resource_size_t start = pci_resource_start(dev, bar);
- resource_size_t len = pci_resource_len(dev, bar);
- unsigned long flags = pci_resource_flags(dev, bar);
-
- if (len <= offset || !start)
- return NULL;
- len -= offset;
- start += offset;
- if (maxlen && len > maxlen)
- len = maxlen;
- if (flags & IORESOURCE_IO)
- return __pci_ioport_map(dev, start, len);
- if (flags & IORESOURCE_MEM)
- return ioremap(start, len);
- /* What? */
- return NULL;
+ return pci_iomap_range_map(dev, bar, offset, maxlen,
+ map_ioremap);
}
EXPORT_SYMBOL(pci_iomap_range);

@@ -70,27 +96,8 @@ void __iomem *pci_iomap_wc_range(struct pci_dev *dev,
unsigned long offset,
unsigned long maxlen)
{
- resource_size_t start = pci_resource_start(dev, bar);
- resource_size_t len = pci_resource_len(dev, bar);
- unsigned long flags = pci_resource_flags(dev, bar);
-
-
- if (flags & IORESOURCE_IO)
- return NULL;
-
- if (len <= offset || !start)
- return NULL;
-
- len -= offset;
- start += offset;
- if (maxlen && len > maxlen)
- len = maxlen;
-
- if (flags & IORESOURCE_MEM)
- return ioremap_wc(start, len);
-
- /* What? */
- return NULL;
+ return pci_iomap_range_map(dev, bar, offset, maxlen,
+ map_ioremap_wc);
}
EXPORT_SYMBOL_GPL(pci_iomap_wc_range);

--
2.25.1

Subject: [PATCH v4 10/15] asm/io.h: Add ioremap_shared fallback

From: Andi Kleen <[email protected]>

This function is for declaring memory that should be shared with
a hypervisor in a confidential guest. If the architecture doesn't
implement it it's just ioremap.

Signed-off-by: Andi Kleen <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---
arch/alpha/include/asm/io.h | 1 +
arch/mips/include/asm/io.h | 1 +
arch/parisc/include/asm/io.h | 1 +
arch/sparc/include/asm/io_64.h | 1 +
include/asm-generic/io.h | 4 ++++
5 files changed, 8 insertions(+)

diff --git a/arch/alpha/include/asm/io.h b/arch/alpha/include/asm/io.h
index 0fab5ac90775..701b44909b94 100644
--- a/arch/alpha/include/asm/io.h
+++ b/arch/alpha/include/asm/io.h
@@ -283,6 +283,7 @@ static inline void __iomem *ioremap(unsigned long port, unsigned long size)
}

#define ioremap_wc ioremap
+#define ioremap_shared ioremap
#define ioremap_uc ioremap

static inline void iounmap(volatile void __iomem *addr)
diff --git a/arch/mips/include/asm/io.h b/arch/mips/include/asm/io.h
index 6f5c86d2bab4..3713ff624632 100644
--- a/arch/mips/include/asm/io.h
+++ b/arch/mips/include/asm/io.h
@@ -179,6 +179,7 @@ void iounmap(const volatile void __iomem *addr);
#define ioremap(offset, size) \
ioremap_prot((offset), (size), _CACHE_UNCACHED)
#define ioremap_uc ioremap
+#define ioremap_shared ioremap

/*
* ioremap_cache - map bus memory into CPU space
diff --git a/arch/parisc/include/asm/io.h b/arch/parisc/include/asm/io.h
index 0b5259102319..73064e152df7 100644
--- a/arch/parisc/include/asm/io.h
+++ b/arch/parisc/include/asm/io.h
@@ -129,6 +129,7 @@ static inline void gsc_writeq(unsigned long long val, unsigned long addr)
*/
void __iomem *ioremap(unsigned long offset, unsigned long size);
#define ioremap_wc ioremap
+#define ioremap_shared ioremap
#define ioremap_uc ioremap

extern void iounmap(const volatile void __iomem *addr);
diff --git a/arch/sparc/include/asm/io_64.h b/arch/sparc/include/asm/io_64.h
index 5ffa820dcd4d..18cc656eb712 100644
--- a/arch/sparc/include/asm/io_64.h
+++ b/arch/sparc/include/asm/io_64.h
@@ -409,6 +409,7 @@ static inline void __iomem *ioremap(unsigned long offset, unsigned long size)
#define ioremap_uc(X,Y) ioremap((X),(Y))
#define ioremap_wc(X,Y) ioremap((X),(Y))
#define ioremap_wt(X,Y) ioremap((X),(Y))
+#define ioremap_shared(X, Y) ioremap((X), (Y))
static inline void __iomem *ioremap_np(unsigned long offset, unsigned long size)
{
return NULL;
diff --git a/include/asm-generic/io.h b/include/asm-generic/io.h
index e93375c710b9..bfcaee1691c8 100644
--- a/include/asm-generic/io.h
+++ b/include/asm-generic/io.h
@@ -982,6 +982,10 @@ static inline void __iomem *ioremap(phys_addr_t addr, size_t size)
#define ioremap_wt ioremap
#endif

+#ifndef ioremap_shared
+#define ioremap_shared ioremap
+#endif
+
/*
* ioremap_uc is special in that we do require an explicit architecture
* implementation. In general you do not want to use this function in a
--
2.25.1

Subject: [PATCH v4 14/15] x86/tdx: Implement ioremap_shared for x86

From: Andi Kleen <[email protected]>

Implement ioremap_shared for x86. In TDX most memory is encrypted,
but some memory that is used to communicate with the host must
be declared shared with special page table attributes and a
special hypercall. Previously all ioremaped memory was declared
shared, but this leads to various BIOS tables and other private
state being shared, which is a security risk.

This patch replaces the unconditional ioremap sharing with an explicit
ioremap_shared that enables sharing.

Signed-off-by: Andi Kleen <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---
arch/x86/include/asm/io.h | 3 +++
arch/x86/mm/ioremap.c | 41 ++++++++++++++++++++++++++++++---------
2 files changed, 35 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index 4c9e06a81ebe..51c2c45456bf 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -379,6 +379,9 @@ extern void __iomem *ioremap_wc(resource_size_t offset, unsigned long size);
extern void __iomem *ioremap_wt(resource_size_t offset, unsigned long size);
#define ioremap_wt ioremap_wt

+extern void __iomem *ioremap_shared(resource_size_t offset, unsigned long size);
+#define ioremap_shared ioremap_shared
+
extern bool is_early_ioremap_ptep(pte_t *ptep);

#define IO_SPACE_LIMIT 0xffff
diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index 69a60f240124..74260aaa494b 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -178,7 +178,8 @@ static void __ioremap_check_mem(resource_size_t addr, unsigned long size,
*/
static void __iomem *
__ioremap_caller(resource_size_t phys_addr, unsigned long size,
- enum page_cache_mode pcm, void *caller, bool encrypted)
+ enum page_cache_mode pcm, void *caller, bool encrypted,
+ bool shared)
{
unsigned long offset, vaddr;
resource_size_t last_addr;
@@ -248,7 +249,7 @@ __ioremap_caller(resource_size_t phys_addr, unsigned long size,
prot = PAGE_KERNEL_IO;
if ((io_desc.flags & IORES_MAP_ENCRYPTED) || encrypted)
prot = pgprot_encrypted(prot);
- else if (prot_guest_has(PATTR_GUEST_SHARED_MAPPING_INIT))
+ else if (shared)
prot = pgprot_protected_guest(prot);

switch (pcm) {
@@ -340,7 +341,8 @@ void __iomem *ioremap(resource_size_t phys_addr, unsigned long size)
enum page_cache_mode pcm = _PAGE_CACHE_MODE_UC_MINUS;

return __ioremap_caller(phys_addr, size, pcm,
- __builtin_return_address(0), false);
+ __builtin_return_address(0), false,
+ false);
}
EXPORT_SYMBOL(ioremap);

@@ -373,7 +375,8 @@ void __iomem *ioremap_uc(resource_size_t phys_addr, unsigned long size)
enum page_cache_mode pcm = _PAGE_CACHE_MODE_UC;

return __ioremap_caller(phys_addr, size, pcm,
- __builtin_return_address(0), false);
+ __builtin_return_address(0), false,
+ false);
}
EXPORT_SYMBOL_GPL(ioremap_uc);

@@ -390,10 +393,29 @@ EXPORT_SYMBOL_GPL(ioremap_uc);
void __iomem *ioremap_wc(resource_size_t phys_addr, unsigned long size)
{
return __ioremap_caller(phys_addr, size, _PAGE_CACHE_MODE_WC,
- __builtin_return_address(0), false);
+ __builtin_return_address(0), false,
+ false);
}
EXPORT_SYMBOL(ioremap_wc);

+/**
+ * ioremap_shared - map memory into CPU space shared with host
+ * @phys_addr: bus address of the memory
+ * @size: size of the resource to map
+ *
+ * This version of ioremap ensures that the memory is marked shared
+ * with the host. This is useful for confidential guests.
+ *
+ * Must be freed with iounmap.
+ */
+void __iomem *ioremap_shared(resource_size_t phys_addr, unsigned long size)
+{
+ return __ioremap_caller(phys_addr, size, _PAGE_CACHE_MODE_WC,
+ __builtin_return_address(0), false,
+ prot_guest_has(PATTR_GUEST_SHARED_MAPPING_INIT));
+}
+EXPORT_SYMBOL(ioremap_shared);
+
/**
* ioremap_wt - map memory into CPU space write through
* @phys_addr: bus address of the memory
@@ -407,21 +429,22 @@ EXPORT_SYMBOL(ioremap_wc);
void __iomem *ioremap_wt(resource_size_t phys_addr, unsigned long size)
{
return __ioremap_caller(phys_addr, size, _PAGE_CACHE_MODE_WT,
- __builtin_return_address(0), false);
+ __builtin_return_address(0), false,
+ false);
}
EXPORT_SYMBOL(ioremap_wt);

void __iomem *ioremap_encrypted(resource_size_t phys_addr, unsigned long size)
{
return __ioremap_caller(phys_addr, size, _PAGE_CACHE_MODE_WB,
- __builtin_return_address(0), true);
+ __builtin_return_address(0), true, false);
}
EXPORT_SYMBOL(ioremap_encrypted);

void __iomem *ioremap_cache(resource_size_t phys_addr, unsigned long size)
{
return __ioremap_caller(phys_addr, size, _PAGE_CACHE_MODE_WB,
- __builtin_return_address(0), false);
+ __builtin_return_address(0), false, false);
}
EXPORT_SYMBOL(ioremap_cache);

@@ -430,7 +453,7 @@ void __iomem *ioremap_prot(resource_size_t phys_addr, unsigned long size,
{
return __ioremap_caller(phys_addr, size,
pgprot2cachemode(__pgprot(prot_val)),
- __builtin_return_address(0), false);
+ __builtin_return_address(0), false, false);
}
EXPORT_SYMBOL(ioremap_prot);

--
2.25.1

Subject: [PATCH v4 13/15] virtio: Use shared mappings for virtio PCI devices

From: Andi Kleen <[email protected]>

In a TDX guest the pci device mappings of virtio must be shared
with the host, so use explicit shared mappings.

Signed-off-by: Andi Kleen <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---
drivers/virtio/virtio_pci_modern_dev.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio_pci_modern_dev.c b/drivers/virtio/virtio_pci_modern_dev.c
index e11ed748e661..6ec4c52258ce 100644
--- a/drivers/virtio/virtio_pci_modern_dev.c
+++ b/drivers/virtio/virtio_pci_modern_dev.c
@@ -83,7 +83,7 @@ vp_modern_map_capability(struct virtio_pci_modern_device *mdev, int off,
return NULL;
}

- p = pci_iomap_range(dev, bar, offset, length);
+ p = pci_iomap_shared_range(dev, bar, offset, length);
if (!p)
dev_err(&dev->dev,
"virtio_pci: unable to map virtio %u@%u on bar %i\n",
--
2.25.1

Subject: [PATCH v4 15/15] x86/tdx: Add cmdline option to force use of ioremap_shared

Add a command line option to force all the enabled drivers to use
shared memory mappings. This will be useful when enabling new drivers
in the protected guest without making all the required changes to use
shared mappings in it.

Note that this might also allow other non explicitly enabled drivers
to interact with the host, which could cause other security risks.

Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---
.../admin-guide/kernel-parameters.rst | 1 +
.../admin-guide/kernel-parameters.txt | 12 ++++++++++++
arch/x86/include/asm/io.h | 2 ++
arch/x86/mm/ioremap.c | 19 ++++++++++++++++++-
4 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.rst b/Documentation/admin-guide/kernel-parameters.rst
index 01ba293a2d70..bdf3896a100c 100644
--- a/Documentation/admin-guide/kernel-parameters.rst
+++ b/Documentation/admin-guide/kernel-parameters.rst
@@ -147,6 +147,7 @@ parameter is applicable::
PCI PCI bus support is enabled.
PCIE PCI Express support is enabled.
PCMCIA The PCMCIA subsystem is enabled.
+ PG Protected guest is enabled.
PNP Plug & Play support is enabled.
PPC PowerPC architecture is enabled.
PPT Parallel port support is enabled.
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index bdb22006f713..ba390be62f89 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2062,6 +2062,18 @@
1 - Bypass the IOMMU for DMA.
unset - Use value of CONFIG_IOMMU_DEFAULT_PASSTHROUGH.

+ ioremap_force_shared= [X86_64, PG]
+ Force the kernel to use shared memory mappings which do
+ not use ioremap_shared/pcimap_shared to opt-in to shared
+ mappings with the host. This feature is mainly used by
+ a protected guest when enabling new drivers without
+ proper shared memory related changes. Please note that
+ this option might also allow other non explicitly enabled
+ drivers to interact with the host in protected guest,
+ which could cause other security risks. This option will
+ also cause BIOS data structures to be shared with the host,
+ which might open security holes.
+
io7= [HW] IO7 for Marvel-based Alpha systems
See comment before marvel_specify_io7 in
arch/alpha/kernel/core_marvel.c.
diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index 51c2c45456bf..744f72835a30 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -413,6 +413,8 @@ extern bool arch_memremap_can_ram_remap(resource_size_t offset,
extern bool phys_mem_access_encrypted(unsigned long phys_addr,
unsigned long size);

+extern bool ioremap_force_shared;
+
/**
* iosubmit_cmds512 - copy data to single MMIO location, in 512-bit units
* @dst: destination, in MMIO space (must be 512-bit aligned)
diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index 74260aaa494b..7576e886fad8 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -28,6 +28,7 @@
#include <asm/memtype.h>
#include <asm/setup.h>
#include <asm/tdx.h>
+#include <asm/cmdline.h>

#include "physaddr.h"

@@ -162,6 +163,17 @@ static void __ioremap_check_mem(resource_size_t addr, unsigned long size,
__ioremap_check_other(addr, desc);
}

+/*
+ * Normally only drivers that are hardened for use in confidential guests
+ * force shared mappings. But if device filtering is disabled other
+ * devices can be loaded, and these need shared mappings too. This
+ * variable is set to true if these filters are disabled.
+ *
+ * Note this has some side effects, e.g. various BIOS tables
+ * get shared too which is risky.
+ */
+bool ioremap_force_shared;
+
/*
* Remap an arbitrary physical address space into the kernel virtual
* address space. It transparently creates kernel huge I/O mapping when
@@ -249,7 +261,7 @@ __ioremap_caller(resource_size_t phys_addr, unsigned long size,
prot = PAGE_KERNEL_IO;
if ((io_desc.flags & IORES_MAP_ENCRYPTED) || encrypted)
prot = pgprot_encrypted(prot);
- else if (shared)
+ else if (shared || ioremap_force_shared)
prot = pgprot_protected_guest(prot);

switch (pcm) {
@@ -847,6 +859,11 @@ void __init early_ioremap_init(void)
WARN_ON((fix_to_virt(0) + PAGE_SIZE) & ((1 << PMD_SHIFT) - 1));
#endif

+ /* Parse cmdline params for ioremap_force_shared */
+ if (cmdline_find_option_bool(boot_command_line,
+ "ioremap_force_shared"))
+ ioremap_force_shared = 1;
+
early_ioremap_setup();

pmd = early_ioremap_pmd(fix_to_virt(FIX_BTMAP_BEGIN));
--
2.25.1

Subject: [PATCH v4 02/15] x86/tdx: Exclude Shared bit from physical_mask

From: "Kirill A. Shutemov" <[email protected]>

Just like MKTME, TDX reassigns bits of the physical address for
metadata. MKTME used several bits for an encryption KeyID. TDX
uses a single bit in guests to communicate whether a physical page
should be protected by TDX as private memory (bit set to 0) or
unprotected and shared with the VMM (bit set to 1).

Add a helper, tdg_shared_mask() to generate the mask. The processor
enumerates its physical address width to include the shared bit, which
means it gets included in __PHYSICAL_MASK by default.

Remove the shared mask from 'physical_mask' since any bits in
tdg_shared_mask() are not used for physical addresses in page table
entries.

Also, note that shared mapping configuration cannot be clubbed between
AMD SME and Intel TDX Guest platforms in common function. SME has
to do it very early in __startup_64() as it sets the bit on all
memory, except what is used for communication. TDX can postpone it,
as it don't need any shared mapping in very early boot.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---

Changes since v3:
* None

Changes since v1:
* Fixed format issues in commit log.

arch/x86/Kconfig | 1 +
arch/x86/include/asm/tdx.h | 4 ++++
arch/x86/kernel/tdx.c | 9 +++++++++
3 files changed, 14 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index d66a8a2f3c97..8eada36694b6 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -872,6 +872,7 @@ config INTEL_TDX_GUEST
select X86_X2APIC
select SECURITY_LOCKDOWN_LSM
select ARCH_HAS_PROTECTED_GUEST
+ select X86_MEM_ENCRYPT_COMMON
help
Provide support for running in a trusted domain on Intel processors
equipped with Trusted Domain eXtensions. TDX is a new Intel
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 72154d3f63c2..1e2a1c6a1898 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -77,6 +77,8 @@ int tdg_handle_virtualization_exception(struct pt_regs *regs,

bool tdg_early_handle_ve(struct pt_regs *regs);

+extern phys_addr_t tdg_shared_mask(void);
+
/*
* To support I/O port access in decompressor or early kernel init
* code, since #VE exception handler cannot be used, use paravirt
@@ -145,6 +147,8 @@ static inline bool tdx_prot_guest_has(unsigned long flag) { return false; }

static inline bool tdg_early_handle_ve(struct pt_regs *regs) { return false; }

+static inline phys_addr_t tdg_shared_mask(void) { return 0; }
+
#endif /* CONFIG_INTEL_TDX_GUEST */

#ifdef CONFIG_INTEL_TDX_GUEST_KVM
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 0c24439774b4..d316fe33f52f 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -75,6 +75,12 @@ bool tdx_prot_guest_has(unsigned long flag)
}
EXPORT_SYMBOL_GPL(tdx_prot_guest_has);

+/* The highest bit of a guest physical address is the "sharing" bit */
+phys_addr_t tdg_shared_mask(void)
+{
+ return 1ULL << (td_info.gpa_width - 1);
+}
+
static void tdg_get_info(void)
{
u64 ret;
@@ -86,6 +92,9 @@ static void tdg_get_info(void)

td_info.gpa_width = out.rcx & GENMASK(5, 0);
td_info.attributes = out.rdx;
+
+ /* Exclude Shared bit from the __PHYSICAL_MASK */
+ physical_mask &= ~tdg_shared_mask();
}

static __cpuidle void tdg_halt(void)
--
2.25.1

Subject: [PATCH v4 12/15] pci: Mark MSI data shared

From: Andi Kleen <[email protected]>

In a TDX guest the MSI area must be shared with the host, so use
a shared mapping.

Signed-off-by: Andi Kleen <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---
drivers/pci/msi.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 9232255c8515..c6cf1b756d74 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -682,7 +682,7 @@ static void __iomem *msix_map_region(struct pci_dev *dev, unsigned nr_entries)
table_offset &= PCI_MSIX_TABLE_OFFSET;
phys_addr = pci_resource_start(dev, bir) + table_offset;

- return ioremap(phys_addr, nr_entries * PCI_MSIX_ENTRY_SIZE);
+ return ioremap_shared(phys_addr, nr_entries * PCI_MSIX_ENTRY_SIZE);
}

static int msix_setup_entries(struct pci_dev *dev, void __iomem *base,
--
2.25.1

Subject: [PATCH v4 11/15] pci: Add pci_iomap_shared{,_range}

From: Andi Kleen <[email protected]>

Add a new variant of pci_iomap for mapping all PCI resources
of a devices as shared memory with a hypervisor in a confidential
guest.

Signed-off-by: Andi Kleen <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---
include/asm-generic/pci_iomap.h | 6 +++++
lib/pci_iomap.c | 46 +++++++++++++++++++++++++++++++++
2 files changed, 52 insertions(+)

diff --git a/include/asm-generic/pci_iomap.h b/include/asm-generic/pci_iomap.h
index d4f16dcc2ed7..0178ddd7ad88 100644
--- a/include/asm-generic/pci_iomap.h
+++ b/include/asm-generic/pci_iomap.h
@@ -18,6 +18,12 @@ extern void __iomem *pci_iomap_range(struct pci_dev *dev, int bar,
extern void __iomem *pci_iomap_wc_range(struct pci_dev *dev, int bar,
unsigned long offset,
unsigned long maxlen);
+extern void __iomem *pci_iomap_shared(struct pci_dev *dev, int bar,
+ unsigned long max);
+extern void __iomem *pci_iomap_shared_range(struct pci_dev *dev, int bar,
+ unsigned long offset,
+ unsigned long maxlen);
+
/* Create a virtual mapping cookie for a port on a given PCI device.
* Do not call this directly, it exists to make it easier for architectures
* to override */
diff --git a/lib/pci_iomap.c b/lib/pci_iomap.c
index 6251c3f651c6..b04e8689eab3 100644
--- a/lib/pci_iomap.c
+++ b/lib/pci_iomap.c
@@ -25,6 +25,11 @@ static void __iomem *map_ioremap_wc(phys_addr_t addr, size_t size)
return ioremap_wc(addr, size);
}

+static void __iomem *map_ioremap_shared(phys_addr_t addr, size_t size)
+{
+ return ioremap_shared(addr, size);
+}
+
static void __iomem *pci_iomap_range_map(struct pci_dev *dev,
int bar,
unsigned long offset,
@@ -101,6 +106,47 @@ void __iomem *pci_iomap_wc_range(struct pci_dev *dev,
}
EXPORT_SYMBOL_GPL(pci_iomap_wc_range);

+/**
+ * pci_iomap_shared_range - create a virtual shared mapping cookie for a
+ * PCI BAR
+ * @dev: PCI device that owns the BAR
+ * @bar: BAR number
+ * @offset: map memory at the given offset in BAR
+ * @maxlen: max length of the memory to map
+ *
+ * Remap a pci device's resources shared in a confidential guest.
+ * For more details see pci_iomap_range's documentation.
+ *
+ * @maxlen specifies the maximum length to map. To get access to
+ * the complete BAR from offset to the end, pass %0 here.
+ */
+void __iomem *pci_iomap_shared_range(struct pci_dev *dev, int bar,
+ unsigned long offset, unsigned long maxlen)
+{
+ return pci_iomap_range_map(dev, bar, offset, maxlen,
+ map_ioremap_shared);
+}
+EXPORT_SYMBOL_GPL(pci_iomap_shared_range);
+
+/**
+ * pci_iomap_shared - create a virtual shared mapping cookie for a PCI BAR
+ * @dev: PCI device that owns the BAR
+ * @bar: BAR number
+ * @maxlen: length of the memory to map
+ *
+ * See pci_iomap for details. This function creates a shared mapping
+ * with the host for confidential hosts.
+ *
+ * @maxlen specifies the maximum length to map. To get access to the
+ * complete BAR without checking for its length first, pass %0 here.
+ */
+void __iomem *pci_iomap_shared(struct pci_dev *dev, int bar,
+ unsigned long maxlen)
+{
+ return pci_iomap_shared_range(dev, bar, 0, maxlen);
+}
+EXPORT_SYMBOL_GPL(pci_iomap_shared);
+
/**
* pci_iomap - create a virtual mapping cookie for a PCI BAR
* @dev: PCI device that owns the BAR
--
2.25.1

2021-08-12 20:42:26

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH v4 10/15] asm/io.h: Add ioremap_shared fallback

On Wed, Aug 04, 2021 at 05:52:13PM -0700, Kuppuswamy Sathyanarayanan wrote:
> From: Andi Kleen <[email protected]>
>
> This function is for declaring memory that should be shared with
> a hypervisor in a confidential guest. If the architecture doesn't
> implement it it's just ioremap.

I would assume ioremap_shared() would "map" something, not "declare"
it.

> Signed-off-by: Andi Kleen <[email protected]>
> Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
> ---
> arch/alpha/include/asm/io.h | 1 +
> arch/mips/include/asm/io.h | 1 +
> arch/parisc/include/asm/io.h | 1 +
> arch/sparc/include/asm/io_64.h | 1 +
> include/asm-generic/io.h | 4 ++++
> 5 files changed, 8 insertions(+)
>
> diff --git a/arch/alpha/include/asm/io.h b/arch/alpha/include/asm/io.h
> index 0fab5ac90775..701b44909b94 100644
> --- a/arch/alpha/include/asm/io.h
> +++ b/arch/alpha/include/asm/io.h
> @@ -283,6 +283,7 @@ static inline void __iomem *ioremap(unsigned long port, unsigned long size)
> }
>
> #define ioremap_wc ioremap
> +#define ioremap_shared ioremap
> #define ioremap_uc ioremap
>
> static inline void iounmap(volatile void __iomem *addr)
> diff --git a/arch/mips/include/asm/io.h b/arch/mips/include/asm/io.h
> index 6f5c86d2bab4..3713ff624632 100644
> --- a/arch/mips/include/asm/io.h
> +++ b/arch/mips/include/asm/io.h
> @@ -179,6 +179,7 @@ void iounmap(const volatile void __iomem *addr);
> #define ioremap(offset, size) \
> ioremap_prot((offset), (size), _CACHE_UNCACHED)
> #define ioremap_uc ioremap
> +#define ioremap_shared ioremap
>
> /*
> * ioremap_cache - map bus memory into CPU space
> diff --git a/arch/parisc/include/asm/io.h b/arch/parisc/include/asm/io.h
> index 0b5259102319..73064e152df7 100644
> --- a/arch/parisc/include/asm/io.h
> +++ b/arch/parisc/include/asm/io.h
> @@ -129,6 +129,7 @@ static inline void gsc_writeq(unsigned long long val, unsigned long addr)
> */
> void __iomem *ioremap(unsigned long offset, unsigned long size);
> #define ioremap_wc ioremap
> +#define ioremap_shared ioremap
> #define ioremap_uc ioremap
>
> extern void iounmap(const volatile void __iomem *addr);
> diff --git a/arch/sparc/include/asm/io_64.h b/arch/sparc/include/asm/io_64.h
> index 5ffa820dcd4d..18cc656eb712 100644
> --- a/arch/sparc/include/asm/io_64.h
> +++ b/arch/sparc/include/asm/io_64.h
> @@ -409,6 +409,7 @@ static inline void __iomem *ioremap(unsigned long offset, unsigned long size)
> #define ioremap_uc(X,Y) ioremap((X),(Y))
> #define ioremap_wc(X,Y) ioremap((X),(Y))
> #define ioremap_wt(X,Y) ioremap((X),(Y))
> +#define ioremap_shared(X, Y) ioremap((X), (Y))
> static inline void __iomem *ioremap_np(unsigned long offset, unsigned long size)
> {
> return NULL;
> diff --git a/include/asm-generic/io.h b/include/asm-generic/io.h
> index e93375c710b9..bfcaee1691c8 100644
> --- a/include/asm-generic/io.h
> +++ b/include/asm-generic/io.h
> @@ -982,6 +982,10 @@ static inline void __iomem *ioremap(phys_addr_t addr, size_t size)
> #define ioremap_wt ioremap
> #endif
>
> +#ifndef ioremap_shared
> +#define ioremap_shared ioremap
> +#endif

"ioremap_shared" is a very generic term for a pretty specific thing:
"memory shared with a hypervisor in a confidential guest".

Maybe deserves a comment with at least a hint here. "Hypervisors in a
confidential guest" isn't the first thing that comes to mind when I
read "shared".

> /*
> * ioremap_uc is special in that we do require an explicit architecture
> * implementation. In general you do not want to use this function in a
> --
> 2.25.1
>

2021-08-12 21:42:40

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH v4 09/15] pci: Consolidate pci_iomap* and pci_iomap*wc

Is there a branch with all of this applied? I was going to apply this
to help take a look at it, but it doesn't apply to v5.14-rc1. I know
you listed some prereqs in the cover letter, but it's a fair amount of
work to sort all that out.

On Wed, Aug 04, 2021 at 05:52:12PM -0700, Kuppuswamy Sathyanarayanan wrote:
> From: Andi Kleen <[email protected]>

If I were applying these, I would silently update the subject lines to
match previous commits. Since these will probably be merged via a
different tree, you can update if there's a v5:

PCI: Consolidate pci_iomap_range(), pci_iomap_wc_range()

Also applies to 11/15 and 12/15.

> pci_iomap* and pci_iomap*wc are currently duplicated code, except
> that the _wc variant does not support IO ports. Replace them
> with a common helper and a callback for the mapping. I used
> wrappers for the maps because some architectures implement ioremap
> and friends with macros.

Maybe spell some of this out:

pci_iomap_range() and pci_iomap_wc_range() are currently duplicated
code, ... Implement them using a common helper,
pci_iomap_range_map(), ...

Using "pci_iomap*" obscures the name and doesn't save any space.

Why is it safe to make pci_iomap_wc_range() support IO ports when it
didn't before? That might be desirable, but I think it *is* a
functional change here.

IIUC, pci_iomap_wc_range() on an IO port range previously returned
NULL, and after this patch it will work the same as pci_iomap_range(),
i.e., it will return the result of __pci_ioport_map().

> This will allow to add more variants without excessive code
> duplications. This patch should have no behavior change.
>
> Signed-off-by: Andi Kleen <[email protected]>
> Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
> ---
> lib/pci_iomap.c | 81 +++++++++++++++++++++++++++----------------------
> 1 file changed, 44 insertions(+), 37 deletions(-)
>
> diff --git a/lib/pci_iomap.c b/lib/pci_iomap.c
> index 2d3eb1cb73b8..6251c3f651c6 100644
> --- a/lib/pci_iomap.c
> +++ b/lib/pci_iomap.c
> @@ -10,6 +10,46 @@
> #include <linux/export.h>
>
> #ifdef CONFIG_PCI
> +
> +/*
> + * Callback wrappers because some architectures define ioremap et.al.
> + * as macros.
> + */
> +static void __iomem *map_ioremap(phys_addr_t addr, size_t size)
> +{
> + return ioremap(addr, size);
> +}
> +
> +static void __iomem *map_ioremap_wc(phys_addr_t addr, size_t size)
> +{
> + return ioremap_wc(addr, size);
> +}
> +
> +static void __iomem *pci_iomap_range_map(struct pci_dev *dev,
> + int bar,
> + unsigned long offset,
> + unsigned long maxlen,
> + void __iomem *(*mapm)(phys_addr_t,
> + size_t))
> +{
> + resource_size_t start = pci_resource_start(dev, bar);
> + resource_size_t len = pci_resource_len(dev, bar);
> + unsigned long flags = pci_resource_flags(dev, bar);
> +
> + if (len <= offset || !start)
> + return NULL;
> + len -= offset;
> + start += offset;
> + if (maxlen && len > maxlen)
> + len = maxlen;
> + if (flags & IORESOURCE_IO)
> + return __pci_ioport_map(dev, start, len);
> + if (flags & IORESOURCE_MEM)
> + return mapm(start, len);
> + /* What? */
> + return NULL;
> +}
> +
> /**
> * pci_iomap_range - create a virtual mapping cookie for a PCI BAR
> * @dev: PCI device that owns the BAR
> @@ -30,22 +70,8 @@ void __iomem *pci_iomap_range(struct pci_dev *dev,
> unsigned long offset,
> unsigned long maxlen)
> {
> - resource_size_t start = pci_resource_start(dev, bar);
> - resource_size_t len = pci_resource_len(dev, bar);
> - unsigned long flags = pci_resource_flags(dev, bar);
> -
> - if (len <= offset || !start)
> - return NULL;
> - len -= offset;
> - start += offset;
> - if (maxlen && len > maxlen)
> - len = maxlen;
> - if (flags & IORESOURCE_IO)
> - return __pci_ioport_map(dev, start, len);
> - if (flags & IORESOURCE_MEM)
> - return ioremap(start, len);
> - /* What? */
> - return NULL;
> + return pci_iomap_range_map(dev, bar, offset, maxlen,
> + map_ioremap);
> }
> EXPORT_SYMBOL(pci_iomap_range);
>
> @@ -70,27 +96,8 @@ void __iomem *pci_iomap_wc_range(struct pci_dev *dev,
> unsigned long offset,
> unsigned long maxlen)
> {
> - resource_size_t start = pci_resource_start(dev, bar);
> - resource_size_t len = pci_resource_len(dev, bar);
> - unsigned long flags = pci_resource_flags(dev, bar);
> -
> -
> - if (flags & IORESOURCE_IO)
> - return NULL;
> -
> - if (len <= offset || !start)
> - return NULL;
> -
> - len -= offset;
> - start += offset;
> - if (maxlen && len > maxlen)
> - len = maxlen;
> -
> - if (flags & IORESOURCE_MEM)
> - return ioremap_wc(start, len);
> -
> - /* What? */
> - return NULL;
> + return pci_iomap_range_map(dev, bar, offset, maxlen,
> + map_ioremap_wc);
> }
> EXPORT_SYMBOL_GPL(pci_iomap_wc_range);
>
> --
> 2.25.1
>

2021-08-12 23:39:54

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v4 09/15] pci: Consolidate pci_iomap* and pci_iomap*wc


> Why is it safe to make pci_iomap_wc_range() support IO ports when it
> didn't before? That might be desirable, but I think it *is* a
> functional change here.

None of the callers use it to map IO ports, so it will be a no-op for
them. But you're right, it's a (minor) functional change.

-Andi




Subject: Re: [PATCH v4 09/15] pci: Consolidate pci_iomap* and pci_iomap*wc



On 8/12/21 12:43 PM, Bjorn Helgaas wrote:
> Is there a branch with all of this applied? I was going to apply this

Its is maintained in following tree.

https://github.com/intel/tdx/commit/93fd5b655172ba9e3350487995102a8b2c41de27

> to help take a look at it, but it doesn't apply to v5.14-rc1. I know

This patch can be applied independently. I have just applied it on top
of v5.14-rc5, and it seems to apply clean. Can you try -rc5?

> you listed some prereqs in the cover letter, but it's a fair amount of
> work to sort all that out.
>
> On Wed, Aug 04, 2021 at 05:52:12PM -0700, Kuppuswamy Sathyanarayanan wrote:
>> From: Andi Kleen <[email protected]>
>
> If I were applying these, I would silently update the subject lines to
> match previous commits. Since these will probably be merged via a
> different tree, you can update if there's a v5:
>
> PCI: Consolidate pci_iomap_range(), pci_iomap_wc_range()

Yes. I will fix this in next version.

>
> Also applies to 11/15 and 12/15.

Will do the same.

>
>> pci_iomap* and pci_iomap*wc are currently duplicated code, except
>> that the _wc variant does not support IO ports. Replace them
>> with a common helper and a callback for the mapping. I used
>> wrappers for the maps because some architectures implement ioremap
>> and friends with macros.
>
> Maybe spell some of this out:
>
> pci_iomap_range() and pci_iomap_wc_range() are currently duplicated
> code, ... Implement them using a common helper,
> pci_iomap_range_map(), ...
>
> Using "pci_iomap*" obscures the name and doesn't save any space.
>
> Why is it safe to make pci_iomap_wc_range() support IO ports when it
> didn't before? That might be desirable, but I think it *is* a
> functional change here.

Agree. Commit log had to be updated. I will include these details
in next submission.

>
> IIUC, pci_iomap_wc_range() on an IO port range previously returned
> NULL, and after this patch it will work the same as pci_iomap_range(),
> i.e., it will return the result of __pci_ioport_map().
>
>> This will allow to add more variants without excessive code
>> duplications. This patch should have no behavior change.
>>
>> Signed-off-by: Andi Kleen <[email protected]>
>> Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
>> ---
>> lib/pci_iomap.c | 81 +++++++++++++++++++++++++++----------------------
>> 1 file changed, 44 insertions(+), 37 deletions(-)
>>


--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2021-08-13 08:07:39

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v4 10/15] asm/io.h: Add ioremap_shared fallback

_shared is just a horrible name for these. Please find a more specific
name, and document them instead of just adding to the macro forrest.

2021-08-13 08:11:57

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v4 11/15] pci: Add pci_iomap_shared{,_range}

On Wed, Aug 04, 2021 at 05:52:14PM -0700, Kuppuswamy Sathyanarayanan wrote:
> +extern void __iomem *pci_iomap_shared(struct pci_dev *dev, int bar,
> + unsigned long max);
> +extern void __iomem *pci_iomap_shared_range(struct pci_dev *dev, int bar,
> + unsigned long offset,
> + unsigned long maxlen);

No need for externs here.

> +/**
> + * pci_iomap_shared_range - create a virtual shared mapping cookie for a
> + * PCI BAR

This reads like completely garbage from a markow chain.

2021-08-13 09:54:24

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v4 12/15] pci: Mark MSI data shared

On Wed, Aug 04, 2021 at 05:52:15PM -0700, Kuppuswamy Sathyanarayanan wrote:
>
> - return ioremap(phys_addr, nr_entries * PCI_MSIX_ENTRY_SIZE);
> + return ioremap_shared(phys_addr, nr_entries * PCI_MSIX_ENTRY_SIZE);

Please add a comment here. I also find the amount of ioremap_* variants
rather frustrating. Maybe it it is time for a ioremap_flags which
replaces the too-lowlevel pgprot_t of the ioremap_prot provided by some
architectures with a more highlevel flags arguments that could also
provide an accessible to the hypervisor flag.

2021-08-23 23:57:59

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v4 11/15] pci: Add pci_iomap_shared{,_range}

On Wed, Aug 04, 2021 at 05:52:14PM -0700, Kuppuswamy Sathyanarayanan wrote:
> From: Andi Kleen <[email protected]>
>
> Add a new variant of pci_iomap for mapping all PCI resources
> of a devices as shared memory with a hypervisor in a confidential
> guest.
>
> Signed-off-by: Andi Kleen <[email protected]>
> Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>

I'm a bit puzzled by this part. So why should the guest *not* map
pci memory as shared? And if the answer is never (as it seems to be)
then why not just make regular pci_iomap DTRT?

Thanks!

> ---
> include/asm-generic/pci_iomap.h | 6 +++++
> lib/pci_iomap.c | 46 +++++++++++++++++++++++++++++++++
> 2 files changed, 52 insertions(+)
>
> diff --git a/include/asm-generic/pci_iomap.h b/include/asm-generic/pci_iomap.h
> index d4f16dcc2ed7..0178ddd7ad88 100644
> --- a/include/asm-generic/pci_iomap.h
> +++ b/include/asm-generic/pci_iomap.h
> @@ -18,6 +18,12 @@ extern void __iomem *pci_iomap_range(struct pci_dev *dev, int bar,
> extern void __iomem *pci_iomap_wc_range(struct pci_dev *dev, int bar,
> unsigned long offset,
> unsigned long maxlen);
> +extern void __iomem *pci_iomap_shared(struct pci_dev *dev, int bar,
> + unsigned long max);
> +extern void __iomem *pci_iomap_shared_range(struct pci_dev *dev, int bar,
> + unsigned long offset,
> + unsigned long maxlen);
> +
> /* Create a virtual mapping cookie for a port on a given PCI device.
> * Do not call this directly, it exists to make it easier for architectures
> * to override */
> diff --git a/lib/pci_iomap.c b/lib/pci_iomap.c
> index 6251c3f651c6..b04e8689eab3 100644
> --- a/lib/pci_iomap.c
> +++ b/lib/pci_iomap.c
> @@ -25,6 +25,11 @@ static void __iomem *map_ioremap_wc(phys_addr_t addr, size_t size)
> return ioremap_wc(addr, size);
> }
>
> +static void __iomem *map_ioremap_shared(phys_addr_t addr, size_t size)
> +{
> + return ioremap_shared(addr, size);
> +}
> +
> static void __iomem *pci_iomap_range_map(struct pci_dev *dev,
> int bar,
> unsigned long offset,
> @@ -101,6 +106,47 @@ void __iomem *pci_iomap_wc_range(struct pci_dev *dev,
> }
> EXPORT_SYMBOL_GPL(pci_iomap_wc_range);
>
> +/**
> + * pci_iomap_shared_range - create a virtual shared mapping cookie for a
> + * PCI BAR
> + * @dev: PCI device that owns the BAR
> + * @bar: BAR number
> + * @offset: map memory at the given offset in BAR
> + * @maxlen: max length of the memory to map
> + *
> + * Remap a pci device's resources shared in a confidential guest.
> + * For more details see pci_iomap_range's documentation.
> + *
> + * @maxlen specifies the maximum length to map. To get access to
> + * the complete BAR from offset to the end, pass %0 here.
> + */
> +void __iomem *pci_iomap_shared_range(struct pci_dev *dev, int bar,
> + unsigned long offset, unsigned long maxlen)
> +{
> + return pci_iomap_range_map(dev, bar, offset, maxlen,
> + map_ioremap_shared);
> +}
> +EXPORT_SYMBOL_GPL(pci_iomap_shared_range);
> +
> +/**
> + * pci_iomap_shared - create a virtual shared mapping cookie for a PCI BAR
> + * @dev: PCI device that owns the BAR
> + * @bar: BAR number
> + * @maxlen: length of the memory to map
> + *
> + * See pci_iomap for details. This function creates a shared mapping
> + * with the host for confidential hosts.
> + *
> + * @maxlen specifies the maximum length to map. To get access to the
> + * complete BAR without checking for its length first, pass %0 here.
> + */
> +void __iomem *pci_iomap_shared(struct pci_dev *dev, int bar,
> + unsigned long maxlen)
> +{
> + return pci_iomap_shared_range(dev, bar, 0, maxlen);
> +}
> +EXPORT_SYMBOL_GPL(pci_iomap_shared);
> +
> /**
> * pci_iomap - create a virtual mapping cookie for a PCI BAR
> * @dev: PCI device that owns the BAR
> --
> 2.25.1

Subject: Re: [PATCH v4 11/15] pci: Add pci_iomap_shared{,_range}



On 8/23/21 4:56 PM, Michael S. Tsirkin wrote:
>> Add a new variant of pci_iomap for mapping all PCI resources
>> of a devices as shared memory with a hypervisor in a confidential
>> guest.
>>
>> Signed-off-by: Andi Kleen<[email protected]>
>> Signed-off-by: Kuppuswamy Sathyanarayanan<[email protected]>
> I'm a bit puzzled by this part. So why should the guest*not* map
> pci memory as shared? And if the answer is never (as it seems to be)
> then why not just make regular pci_iomap DTRT?

It is in the context of confidential guest (where VMM is un-trusted). So
we don't want to make all PCI resource as shared. It should be allowed
only for hardened drivers/devices.

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2021-08-24 01:07:41

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v4 11/15] pci: Add pci_iomap_shared{,_range}

On Mon, Aug 23, 2021 at 5:31 PM Kuppuswamy, Sathyanarayanan
<[email protected]> wrote:
>
>
>
> On 8/23/21 4:56 PM, Michael S. Tsirkin wrote:
> >> Add a new variant of pci_iomap for mapping all PCI resources
> >> of a devices as shared memory with a hypervisor in a confidential
> >> guest.
> >>
> >> Signed-off-by: Andi Kleen<[email protected]>
> >> Signed-off-by: Kuppuswamy Sathyanarayanan<[email protected]>
> > I'm a bit puzzled by this part. So why should the guest*not* map
> > pci memory as shared? And if the answer is never (as it seems to be)
> > then why not just make regular pci_iomap DTRT?
>
> It is in the context of confidential guest (where VMM is un-trusted). So
> we don't want to make all PCI resource as shared. It should be allowed
> only for hardened drivers/devices.

That's confusing, isn't device authorization what keeps unaudited
drivers from loading against untrusted devices? I'm feeling like
Michael that this should be a detail that drivers need not care about
explicitly, in which case it does not need to be exported because the
detail can be buried in lower levels.

Note, I specifically said "unaudited", not "hardened" because as Greg
mentioned the kernel must trust drivers, its devices that may not be
trusted.

2021-08-24 02:15:33

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v4 11/15] pci: Add pci_iomap_shared{,_range}


On 8/23/2021 6:04 PM, Dan Williams wrote:
> On Mon, Aug 23, 2021 at 5:31 PM Kuppuswamy, Sathyanarayanan
> <[email protected]> wrote:
>>
>>
>> On 8/23/21 4:56 PM, Michael S. Tsirkin wrote:
>>>> Add a new variant of pci_iomap for mapping all PCI resources
>>>> of a devices as shared memory with a hypervisor in a confidential
>>>> guest.
>>>>
>>>> Signed-off-by: Andi Kleen<[email protected]>
>>>> Signed-off-by: Kuppuswamy Sathyanarayanan<[email protected]>
>>> I'm a bit puzzled by this part. So why should the guest*not* map
>>> pci memory as shared? And if the answer is never (as it seems to be)
>>> then why not just make regular pci_iomap DTRT?
>> It is in the context of confidential guest (where VMM is un-trusted). So
>> we don't want to make all PCI resource as shared. It should be allowed
>> only for hardened drivers/devices.
> That's confusing, isn't device authorization what keeps unaudited
> drivers from loading against untrusted devices? I'm feeling like
> Michael that this should be a detail that drivers need not care about
> explicitly, in which case it does not need to be exported because the
> detail can be buried in lower levels.

We originally made it default (similar to AMD), but it during code audit
we found a lot of drivers who do ioremap early outside the probe
function. Since it would be difficult to change them all we made it
opt-in, which ensures that only drivers that have been enabled can talk
with the host at all and can't be attacked. That made the problem of
hardening all these drivers a lot more practical.

Currently we only really need virtio and MSI-X shared, so for changing
two places in the tree you avoid a lot of headache elsewhere.

Note there is still a command line option to override if you want to
allow and load other drivers.

-Andi

2021-08-24 07:15:04

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v4 11/15] pci: Add pci_iomap_shared{,_range}

On Mon, Aug 23, 2021 at 05:30:54PM -0700, Kuppuswamy, Sathyanarayanan wrote:
>
>
> On 8/23/21 4:56 PM, Michael S. Tsirkin wrote:
> > > Add a new variant of pci_iomap for mapping all PCI resources
> > > of a devices as shared memory with a hypervisor in a confidential
> > > guest.
> > >
> > > Signed-off-by: Andi Kleen<[email protected]>
> > > Signed-off-by: Kuppuswamy Sathyanarayanan<[email protected]>
> > I'm a bit puzzled by this part. So why should the guest*not* map
> > pci memory as shared? And if the answer is never (as it seems to be)
> > then why not just make regular pci_iomap DTRT?
>
> It is in the context of confidential guest (where VMM is un-trusted). So
> we don't want to make all PCI resource as shared. It should be allowed
> only for hardened drivers/devices.

Well, assuming the host can do any damage when mapped shared that also
means not mapping it shared will completely break the drivers.

2021-08-24 09:14:33

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v4 11/15] pci: Add pci_iomap_shared{,_range}

On Mon, Aug 23, 2021 at 05:30:54PM -0700, Kuppuswamy, Sathyanarayanan wrote:
>
>
> On 8/23/21 4:56 PM, Michael S. Tsirkin wrote:
> > > Add a new variant of pci_iomap for mapping all PCI resources
> > > of a devices as shared memory with a hypervisor in a confidential
> > > guest.
> > >
> > > Signed-off-by: Andi Kleen<[email protected]>
> > > Signed-off-by: Kuppuswamy Sathyanarayanan<[email protected]>
> > I'm a bit puzzled by this part. So why should the guest*not* map
> > pci memory as shared? And if the answer is never (as it seems to be)
> > then why not just make regular pci_iomap DTRT?
>
> It is in the context of confidential guest (where VMM is un-trusted). So
> we don't want to make all PCI resource as shared. It should be allowed
> only for hardened drivers/devices.

I can't say this answers the question at all. PCI devices are part of
the VMM and so un-trusted. In particular PCI devices do not have
the key to decrypt memory. Therefore as far as I can see PCI resources
should not be encrypted. I conclude they all should be marked
shared.

If I'm wrong can you please give an example of a PCI resource
that is encrypted?

--
MST

2021-08-24 09:48:10

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v4 11/15] pci: Add pci_iomap_shared{,_range}

On Mon, Aug 23, 2021 at 07:14:18PM -0700, Andi Kleen wrote:
>
> On 8/23/2021 6:04 PM, Dan Williams wrote:
> > On Mon, Aug 23, 2021 at 5:31 PM Kuppuswamy, Sathyanarayanan
> > <[email protected]> wrote:
> > >
> > >
> > > On 8/23/21 4:56 PM, Michael S. Tsirkin wrote:
> > > > > Add a new variant of pci_iomap for mapping all PCI resources
> > > > > of a devices as shared memory with a hypervisor in a confidential
> > > > > guest.
> > > > >
> > > > > Signed-off-by: Andi Kleen<[email protected]>
> > > > > Signed-off-by: Kuppuswamy Sathyanarayanan<[email protected]>
> > > > I'm a bit puzzled by this part. So why should the guest*not* map
> > > > pci memory as shared? And if the answer is never (as it seems to be)
> > > > then why not just make regular pci_iomap DTRT?
> > > It is in the context of confidential guest (where VMM is un-trusted). So
> > > we don't want to make all PCI resource as shared. It should be allowed
> > > only for hardened drivers/devices.
> > That's confusing, isn't device authorization what keeps unaudited
> > drivers from loading against untrusted devices? I'm feeling like
> > Michael that this should be a detail that drivers need not care about
> > explicitly, in which case it does not need to be exported because the
> > detail can be buried in lower levels.
>
> We originally made it default (similar to AMD), but it during code audit we
> found a lot of drivers who do ioremap early outside the probe function.
> Since it would be difficult to change them all we made it opt-in, which
> ensures that only drivers that have been enabled can talk with the host at
> all and can't be attacked. That made the problem of hardening all these
> drivers a lot more practical.
>
> Currently we only really need virtio and MSI-X shared, so for changing two
> places in the tree you avoid a lot of headache elsewhere.
>
> Note there is still a command line option to override if you want to allow
> and load other drivers.
>
> -Andi

I see. Hmm. It's a bit of a random thing to do it at the map time
though. E.g. DMA is all handled transparently behind the DMA API.
Hardening is much more than just replacing map with map_shared
and I suspect what you will end up with is basically
vendors replacing map with map shared to make things work
for their users and washing their hands.

I would say an explicit flag in the driver that says "hardened"
and refusing to init a non hardened one would be better.

--
MST

2021-08-24 17:11:17

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v4 11/15] pci: Add pci_iomap_shared{,_range}


On 8/24/2021 12:07 AM, Christoph Hellwig wrote:
> On Mon, Aug 23, 2021 at 05:30:54PM -0700, Kuppuswamy, Sathyanarayanan wrote:
>>
>> On 8/23/21 4:56 PM, Michael S. Tsirkin wrote:
>>>> Add a new variant of pci_iomap for mapping all PCI resources
>>>> of a devices as shared memory with a hypervisor in a confidential
>>>> guest.
>>>>
>>>> Signed-off-by: Andi Kleen<[email protected]>
>>>> Signed-off-by: Kuppuswamy Sathyanarayanan<[email protected]>
>>> I'm a bit puzzled by this part. So why should the guest*not* map
>>> pci memory as shared? And if the answer is never (as it seems to be)
>>> then why not just make regular pci_iomap DTRT?
>> It is in the context of confidential guest (where VMM is un-trusted). So
>> we don't want to make all PCI resource as shared. It should be allowed
>> only for hardened drivers/devices.
> Well, assuming the host can do any damage when mapped shared that also
> means not mapping it shared will completely break the drivers.

There are several cases:

- We have driver filtering active to protect you against attacks from
the host against unhardened drivers.

In this case the drivers not working is the intended behavior.

- There is an command allow list override for some new driver, but the
driver is hardened and shared

The other drivers will still not work, but that's also the intended behavior

- Driver filtering is disabled or the allow list override is used to
enable some non hardened/enabled driver

There is a command line option to override the ioremap sharing default,
it will allow all drivers to do ioremap. We would really prefer to make
it more finegrained, but it's not possible in this case. Other drivers
are likely attackable.

- Driver filtering is disabled (allowing attacks on the drivers) and the
command line option for forced sharing is set.

All drivers initialize and can talk to the host through MMIO. Lots of
unhardened drivers are likely attackable.

-Andi

2021-08-24 17:40:27

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v4 11/15] pci: Add pci_iomap_shared{,_range}


> I see. Hmm. It's a bit of a random thing to do it at the map time
> though. E.g. DMA is all handled transparently behind the DMA API.
> Hardening is much more than just replacing map with map_shared
> and I suspect what you will end up with is basically
> vendors replacing map with map shared to make things work
> for their users and washing their hands.

That concept exists too. There is a separate allow list for the drivers.
So just adding shared to a driver is not enough, until it's also added
to the allowlist

Users can of course chose to disable the allowlist, but they need to
understand the security implications.


>
> I would say an explicit flag in the driver that says "hardened"
> and refusing to init a non hardened one would be better.


We have that too (that's the device filtering)

But the problem is that device filtering just stops the probe functions,
not the initcalls, and lot of legacy drivers do MMIO interactions before
going into probe. In some cases it's unavoidable because of the device
doesn't have a separate enumeration mechanism it needs some kind of
probing to even check for its existence And since we don't want to
change all of them it's far safer to make the ioremap opt-in.


-Andi

2021-08-24 18:57:21

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH v4 11/15] pci: Add pci_iomap_shared{,_range}

[+cc Rajat; I still don't know what "shared memory with a hypervisor
in a confidential guest" means, but now we're talking about hardened
drivers and allow lists, which Rajat is interested in]

On Tue, Aug 24, 2021 at 10:20:44AM -0700, Andi Kleen wrote:
>
> > I see. Hmm. It's a bit of a random thing to do it at the map time
> > though. E.g. DMA is all handled transparently behind the DMA API.
> > Hardening is much more than just replacing map with map_shared
> > and I suspect what you will end up with is basically
> > vendors replacing map with map shared to make things work
> > for their users and washing their hands.
>
> That concept exists too. There is a separate allow list for the drivers. So
> just adding shared to a driver is not enough, until it's also added to the
> allowlist
>
> Users can of course chose to disable the allowlist, but they need to
> understand the security implications.
>
> > I would say an explicit flag in the driver that says "hardened"
> > and refusing to init a non hardened one would be better.
>
> We have that too (that's the device filtering)
>
> But the problem is that device filtering just stops the probe functions, not
> the initcalls, and lot of legacy drivers do MMIO interactions before going
> into probe. In some cases it's unavoidable because of the device doesn't
> have a separate enumeration mechanism it needs some kind of probing to even
> check for its existence And since we don't want to change all of them it's
> far safer to make the ioremap opt-in.
>
>
> -Andi
>

2021-08-24 20:16:50

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v4 11/15] pci: Add pci_iomap_shared{,_range}


On 8/24/2021 11:55 AM, Bjorn Helgaas wrote:
> [+cc Rajat; I still don't know what "shared memory with a hypervisor
> in a confidential guest" means,

A confidential guest is a guest which uses memory encryption to isolate
itself from the host. It doesn't trust the host. But it still needs to
communicate with the host for IO, so it has some special memory areas
that are explicitly marked shared. These are used to do IO with the
host. All their usage needs to be carefully hardened to avoid any
security attacks on the guest, that's why we want to limit this
interaction only to a small set of hardened drivers. For MMIO, the set
is currently only virtio and MSI-X.

-Andi


2021-08-24 20:34:49

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH v4 11/15] pci: Add pci_iomap_shared{,_range}

On Tue, Aug 24, 2021 at 01:14:02PM -0700, Andi Kleen wrote:
>
> On 8/24/2021 11:55 AM, Bjorn Helgaas wrote:
> > [+cc Rajat; I still don't know what "shared memory with a hypervisor
> > in a confidential guest" means,
>
> A confidential guest is a guest which uses memory encryption to isolate
> itself from the host. It doesn't trust the host. But it still needs to
> communicate with the host for IO, so it has some special memory areas that
> are explicitly marked shared. These are used to do IO with the host. All
> their usage needs to be carefully hardened to avoid any security attacks on
> the guest, that's why we want to limit this interaction only to a small set
> of hardened drivers. For MMIO, the set is currently only virtio and MSI-X.

Good material for the commit log next time around. Thanks!

Bjorn

2021-08-24 20:51:22

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v4 11/15] pci: Add pci_iomap_shared{,_range}


On 8/24/2021 1:31 PM, Bjorn Helgaas wrote:
> On Tue, Aug 24, 2021 at 01:14:02PM -0700, Andi Kleen wrote:
>> On 8/24/2021 11:55 AM, Bjorn Helgaas wrote:
>>> [+cc Rajat; I still don't know what "shared memory with a hypervisor
>>> in a confidential guest" means,
>> A confidential guest is a guest which uses memory encryption to isolate
>> itself from the host. It doesn't trust the host. But it still needs to
>> communicate with the host for IO, so it has some special memory areas that
>> are explicitly marked shared. These are used to do IO with the host. All
>> their usage needs to be carefully hardened to avoid any security attacks on
>> the guest, that's why we want to limit this interaction only to a small set
>> of hardened drivers. For MMIO, the set is currently only virtio and MSI-X.
> Good material for the commit log next time around. Thanks!

This is all in the patch intro too, which should make it into the merge
commits.

I don't think we can reexplain the basic concepts for every individual
patch in a large patch kit.


-Andi

2021-08-24 21:07:30

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v4 11/15] pci: Add pci_iomap_shared{,_range}

On Tue, Aug 24, 2021 at 1:50 PM Andi Kleen <[email protected]> wrote:
>
>
> On 8/24/2021 1:31 PM, Bjorn Helgaas wrote:
> > On Tue, Aug 24, 2021 at 01:14:02PM -0700, Andi Kleen wrote:
> >> On 8/24/2021 11:55 AM, Bjorn Helgaas wrote:
> >>> [+cc Rajat; I still don't know what "shared memory with a hypervisor
> >>> in a confidential guest" means,
> >> A confidential guest is a guest which uses memory encryption to isolate
> >> itself from the host. It doesn't trust the host. But it still needs to
> >> communicate with the host for IO, so it has some special memory areas that
> >> are explicitly marked shared. These are used to do IO with the host. All
> >> their usage needs to be carefully hardened to avoid any security attacks on
> >> the guest, that's why we want to limit this interaction only to a small set
> >> of hardened drivers. For MMIO, the set is currently only virtio and MSI-X.
> > Good material for the commit log next time around. Thanks!
>
> This is all in the patch intro too, which should make it into the merge
> commits.
>
> I don't think we can reexplain the basic concepts for every individual
> patch in a large patch kit.

Maybe not the whole cover letter, but how about just a line in this
one that says "Recall that 'shared' in this context refers to memory
that lacks confidentiality and integrity protection from the VMM so
that it can communicate with the VM." Although I think
ioremap_noprotect() might be clearer than shared for the protected
guest use case?

2021-08-24 21:57:13

by Rajat Jain

[permalink] [raw]
Subject: Re: [PATCH v4 11/15] pci: Add pci_iomap_shared{,_range}

Thanks a lot Bjorn for adding me!


On Tue, Aug 24, 2021 at 11:55 AM Bjorn Helgaas <[email protected]> wrote:
>
> [+cc Rajat; I still don't know what "shared memory with a hypervisor
> in a confidential guest" means, but now we're talking about hardened
> drivers and allow lists, which Rajat is interested in]
>
> On Tue, Aug 24, 2021 at 10:20:44AM -0700, Andi Kleen wrote:
> >
> > > I see. Hmm. It's a bit of a random thing to do it at the map time
> > > though. E.g. DMA is all handled transparently behind the DMA API.
> > > Hardening is much more than just replacing map with map_shared
> > > and I suspect what you will end up with is basically
> > > vendors replacing map with map shared to make things work
> > > for their users and washing their hands.
> >
> > That concept exists too. There is a separate allow list for the drivers. So
> > just adding shared to a driver is not enough, until it's also added to the
> > allowlist
> >
> > Users can of course chose to disable the allowlist, but they need to
> > understand the security implications.

This is great. I'd be interested in looking at this allowlist
mechanism. Is this something in-kernel or in userspace? Is this
available upstream or are you maintaining this allowlist elsewhere?
(Background: https://lore.kernel.org/linux-pci/CACK8Z6E8pjVeC934oFgr=VB3pULx_GyT2NkzAogdRQJ9TKSX9A@mail.gmail.com/)

Short Summary: we also have our security team that audits drivers, and
we'd like to enable only audited drivers for the untrusted devices.
Currently, we're carrying this allowlist mechanism on our own since
the idea was Nack'ed by upstream. So if there is something available,
we'd like to use it too.

Thanks,

Rajat


> >
> > > I would say an explicit flag in the driver that says "hardened"
> > > and refusing to init a non hardened one would be better.
> >
> > We have that too (that's the device filtering)
> >
> > But the problem is that device filtering just stops the probe functions, not
> > the initcalls, and lot of legacy drivers do MMIO interactions before going
> > into probe. In some cases it's unavoidable because of the device doesn't
> > have a separate enumeration mechanism it needs some kind of probing to even
> > check for its existence And since we don't want to change all of them it's
> > far safer to make the ioremap opt-in.
> >
> >
> > -Andi
> >

2021-08-24 21:59:04

by Rajat Jain

[permalink] [raw]
Subject: Re: [PATCH v4 11/15] pci: Add pci_iomap_shared{,_range}

On Mon, Aug 23, 2021 at 6:06 PM Dan Williams <[email protected]> wrote:
>
> On Mon, Aug 23, 2021 at 5:31 PM Kuppuswamy, Sathyanarayanan
> <[email protected]> wrote:
> >
> >
> >
> > On 8/23/21 4:56 PM, Michael S. Tsirkin wrote:
> > >> Add a new variant of pci_iomap for mapping all PCI resources
> > >> of a devices as shared memory with a hypervisor in a confidential
> > >> guest.
> > >>
> > >> Signed-off-by: Andi Kleen<[email protected]>
> > >> Signed-off-by: Kuppuswamy Sathyanarayanan<[email protected]>
> > > I'm a bit puzzled by this part. So why should the guest*not* map
> > > pci memory as shared? And if the answer is never (as it seems to be)
> > > then why not just make regular pci_iomap DTRT?
> >
> > It is in the context of confidential guest (where VMM is un-trusted). So
> > we don't want to make all PCI resource as shared. It should be allowed
> > only for hardened drivers/devices.
>
> That's confusing, isn't device authorization what keeps unaudited
> drivers from loading against untrusted devices? I'm feeling like
> Michael that this should be a detail that drivers need not care about
> explicitly, in which case it does not need to be exported because the
> detail can be buried in lower levels.
>
> Note, I specifically said "unaudited", not "hardened" because as Greg
> mentioned the kernel must trust drivers, its devices that may not be
> trusted.

Can you please point me to the thread where this discussion with Greg
is ongoing?

Thanks,

Rajat

2021-08-24 22:01:52

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v4 11/15] pci: Add pci_iomap_shared{,_range}

On Tue, Aug 24, 2021 at 2:57 PM Rajat Jain <[email protected]> wrote:
>
> On Mon, Aug 23, 2021 at 6:06 PM Dan Williams <[email protected]> wrote:
> >
> > On Mon, Aug 23, 2021 at 5:31 PM Kuppuswamy, Sathyanarayanan
> > <[email protected]> wrote:
> > >
> > >
> > >
> > > On 8/23/21 4:56 PM, Michael S. Tsirkin wrote:
> > > >> Add a new variant of pci_iomap for mapping all PCI resources
> > > >> of a devices as shared memory with a hypervisor in a confidential
> > > >> guest.
> > > >>
> > > >> Signed-off-by: Andi Kleen<[email protected]>
> > > >> Signed-off-by: Kuppuswamy Sathyanarayanan<[email protected]>
> > > > I'm a bit puzzled by this part. So why should the guest*not* map
> > > > pci memory as shared? And if the answer is never (as it seems to be)
> > > > then why not just make regular pci_iomap DTRT?
> > >
> > > It is in the context of confidential guest (where VMM is un-trusted). So
> > > we don't want to make all PCI resource as shared. It should be allowed
> > > only for hardened drivers/devices.
> >
> > That's confusing, isn't device authorization what keeps unaudited
> > drivers from loading against untrusted devices? I'm feeling like
> > Michael that this should be a detail that drivers need not care about
> > explicitly, in which case it does not need to be exported because the
> > detail can be buried in lower levels.
> >
> > Note, I specifically said "unaudited", not "hardened" because as Greg
> > mentioned the kernel must trust drivers, its devices that may not be
> > trusted.
>
> Can you please point me to the thread where this discussion with Greg
> is ongoing?

It slowed down to implement the "move to the 'authorized' device
model" recommendation. LWN has a good writeup (as usual) and a link to
the thread:

https://lwn.net/Articles/865918/

2021-08-25 18:31:47

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH v4 11/15] pci: Add pci_iomap_shared{,_range}

On Tue, Aug 24, 2021 at 01:50:00PM -0700, Andi Kleen wrote:
>
> On 8/24/2021 1:31 PM, Bjorn Helgaas wrote:
> > On Tue, Aug 24, 2021 at 01:14:02PM -0700, Andi Kleen wrote:
> > > On 8/24/2021 11:55 AM, Bjorn Helgaas wrote:
> > > > [+cc Rajat; I still don't know what "shared memory with a hypervisor
> > > > in a confidential guest" means,
> > > A confidential guest is a guest which uses memory encryption to isolate
> > > itself from the host. It doesn't trust the host. But it still needs to
> > > communicate with the host for IO, so it has some special memory areas that
> > > are explicitly marked shared. These are used to do IO with the host. All
> > > their usage needs to be carefully hardened to avoid any security attacks on
> > > the guest, that's why we want to limit this interaction only to a small set
> > > of hardened drivers. For MMIO, the set is currently only virtio and MSI-X.
> > Good material for the commit log next time around. Thanks!
>
> This is all in the patch intro too, which should make it into the merge
> commits.

It's good if the cover letter makes into the merge commit log.

It's probably just because my git foo is lacking, but merge commit
logs don't seem as discoverable as the actual patch commit logs. Five
years from now, if I want to learn about pci_iomap_shared() history, I
would "git log -p lib/pci_iomap.c" and search for it. But I don't
think I would see the merge commit then.

Bjorn

2021-08-29 15:29:52

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v4 11/15] pci: Add pci_iomap_shared{,_range}

On Tue, Aug 24, 2021 at 10:20:44AM -0700, Andi Kleen wrote:
>
> > I see. Hmm. It's a bit of a random thing to do it at the map time
> > though. E.g. DMA is all handled transparently behind the DMA API.
> > Hardening is much more than just replacing map with map_shared
> > and I suspect what you will end up with is basically
> > vendors replacing map with map shared to make things work
> > for their users and washing their hands.
>
> That concept exists too. There is a separate allow list for the drivers. So
> just adding shared to a driver is not enough, until it's also added to the
> allowlist
>
> Users can of course chose to disable the allowlist, but they need to
> understand the security implications.

Right. So given that, why do we need to tweak a random API like the map?
If you just make all maps be shared then the user is in control.
Seems sensible to me.

>
> >
> > I would say an explicit flag in the driver that says "hardened"
> > and refusing to init a non hardened one would be better.
>
>
> We have that too (that's the device filtering)
>
> But the problem is that device filtering just stops the probe functions, not
> the initcalls, and lot of legacy drivers do MMIO interactions before going
> into probe. In some cases it's unavoidable because of the device doesn't
> have a separate enumeration mechanism it needs some kind of probing to even
> check for its existence And since we don't want to change all of them it's
> far safer to make the ioremap opt-in.
>
>
> -Andi

Let's be frank, even without encryption disabling most drivers -
especially weird ones that poke at hardware before probe -
is far safer than keeping them, but one loses a bunch of features.
IOW all this hardening is nice but which security/feature tradeoff
to take it a policy decision, not something kernel should do
imho.

--
MST

2021-08-29 15:37:15

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v4 11/15] pci: Add pci_iomap_shared{,_range}

On Tue, Aug 24, 2021 at 10:04:26AM -0700, Andi Kleen wrote:
>
> On 8/24/2021 12:07 AM, Christoph Hellwig wrote:
> > On Mon, Aug 23, 2021 at 05:30:54PM -0700, Kuppuswamy, Sathyanarayanan wrote:
> > >
> > > On 8/23/21 4:56 PM, Michael S. Tsirkin wrote:
> > > > > Add a new variant of pci_iomap for mapping all PCI resources
> > > > > of a devices as shared memory with a hypervisor in a confidential
> > > > > guest.
> > > > >
> > > > > Signed-off-by: Andi Kleen<[email protected]>
> > > > > Signed-off-by: Kuppuswamy Sathyanarayanan<[email protected]>
> > > > I'm a bit puzzled by this part. So why should the guest*not* map
> > > > pci memory as shared? And if the answer is never (as it seems to be)
> > > > then why not just make regular pci_iomap DTRT?
> > > It is in the context of confidential guest (where VMM is un-trusted). So
> > > we don't want to make all PCI resource as shared. It should be allowed
> > > only for hardened drivers/devices.
> > Well, assuming the host can do any damage when mapped shared that also
> > means not mapping it shared will completely break the drivers.
>
> There are several cases:
>
> - We have driver filtering active to protect you against attacks from the
> host against unhardened drivers.
>
> In this case the drivers not working is the intended behavior.
>
> - There is an command allow list override for some new driver, but the
> driver is hardened and shared
>
> The other drivers will still not work, but that's also the intended behavior
>
> - Driver filtering is disabled or the allow list override is used to enable
> some non hardened/enabled driver
>
> There is a command line option to override the ioremap sharing default, it
> will allow all drivers to do ioremap. We would really prefer to make it more
> finegrained, but it's not possible in this case. Other drivers are likely
> attackable.
>
> - Driver filtering is disabled (allowing attacks on the drivers) and the
> command line option for forced sharing is set.
>
> All drivers initialize and can talk to the host through MMIO. Lots of
> unhardened drivers are likely attackable.
>
> -Andi

All this makes sense but ioremap is such a random place to declare
driver has been audited, and it's baked into the binary with no way for
userspace to set policy.

Again all we will end up with is gradual replacement of all ioremap
calls with ioremap_shared as people discover a given driver does not
work in a VM. How are you going to know driver has actually been
audited? what the quality of the audit was? did the people doing the
auditing understand what they are auditing for? No way, right?
So IMHO, let it be for now.

--
MST

2021-08-29 16:20:11

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v4 11/15] pci: Add pci_iomap_shared{,_range}


> Let's be frank, even without encryption disabling most drivers -
> especially weird ones that poke at hardware before probe -
> is far safer than keeping them, but one loses a bunch of features.

Usually we don't lose features at all. None of the legacy drivers are
needed on a guest (or even a modern native system). It's all just all
for old hardware. Maybe in 20+ years it can be all removed, but we can't
wait that long.

> IOW all this hardening is nice but which security/feature tradeoff
> to take it a policy decision, not something kernel should do
> imho.

There's no mechanism to push this kind of policy to user space. Users
don't have control what initcalls run. At the time they execute there
isn't even any user space yet.

Even if they could somehow control them it's very unlikely they would
understand them and make an informed decision.

Doing it at build time is not feasible either since we want to run on
standard distribution kernels.

For modules we have a policy mechanism (prevent udev probing by
preventing enumeration), and that is implemented, but only handling
modules is not enough. The compiled in drivers have to be handled too,
otherwise you have gaping holes in the protection. We don't prevent
users manually loading modules that might probe, but that is a policy
decision that users actually sensibly make in user space.

Also I changing this single call really that bad? It's not that we
changing anything drastic here, just give the low level subsystem a
better hint about the intention. If you don't like the function name,
could make it an argument instead?

-Andi



>

2021-08-29 16:46:31

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v4 11/15] pci: Add pci_iomap_shared{,_range}

> All this makes sense but ioremap is such a random place to declare
> driver has been audited, and it's baked into the binary with no way for
> userspace to set policy.
>
> Again all we will end up with is gradual replacement of all ioremap
> calls with ioremap_shared as people discover a given driver does not
> work in a VM.

No the device filter will prevent that. They would need to submit
patches to the central list.

Or they can override it at the command line, but there is already an
option to force all ioremaps to be shared. So if you just want some
driver to run without caring about security you can already do that
without modifying it.

Besides the shared concept only makes sense for virtual devices, so if
you don't have a device model for a device this will never happen either.

So I don't think your scenario of all ioremaps becoming shared will ever
happen.


> How are you going to know driver has actually been
> audited? what the quality of the audit was? did the people doing the
> auditing understand what they are auditing for? No way, right?

First the primary purpose of the ioremap policy is to avoid messing with
all the legacy drivers (which is 99+% of the code base)

How to handle someone maliciously submitting a driver to the kernel is a
completely different problem. To some degree there is trust of course.
If someone says they audited and a maintainer trusts them with their
statement, but they actually didn't there is a problem, but it's larger
than just TDX. But in such a case the community code audit will also
focus on such areas, and people interested in confidential computing
security will also start taking a look.

And we're also working on fuzzing, so these drivers will be fuzzed at
some point, so mistakes will be eventually found.

But in any case the ioremap policy is mainly to prevent having to handle
this for all legacy drivers.

I would rather change the few drivers that are actually needed, than all
the other drivers.

That's really the fundamental problem this is addressing: how to get
reasonable security with all the legacy drivers out of the box without
touching them.


-Andi


2021-08-29 22:27:27

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v4 11/15] pci: Add pci_iomap_shared{,_range}

On Sun, Aug 29, 2021 at 09:17:53AM -0700, Andi Kleen wrote:
> Also I changing this single call really that bad? It's not that we changing
> anything drastic here, just give the low level subsystem a better hint about
> the intention. If you don't like the function name, could make it an
> argument instead?

My point however is that the API should say that the
driver has been audited, not that the mapping has been
done in some special way. For example the mapping can be
in some kind of wrapper, not directly in the driver.
However you want the driver validated, not the wrapper.

Here's an idea:



diff --git a/include/linux/audited.h b/include/linux/audited.h
new file mode 100644
index 000000000000..e23fd6ad50db
--- /dev/null
+++ b/include/linux/audited.h
@@ -0,0 +1,3 @@
+#ifndef AUDITED_MODULE
+#define AUDITED_MODULE
+#endif

Now any audited driver must do
#include <linux/audited.h>
first of all.
Implementation-wise it can do any number of things,
e.g. if you like then sure you can do:

#ifdef AUDITED_MODULE
#define pci_ioremap pci_ioremap_shared
#else
#define pci_ioremap pci_ioremap
#endif

but you can also thinkably do something like (won't work,
but just to give you the idea):

#ifdef AUDITED_MODULE
#define __init __init
#else
#define __init
#endif

or any number of hacks like this.


--
MST

2021-08-30 05:13:49

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v4 11/15] pci: Add pci_iomap_shared{,_range}


On 8/29/2021 3:26 PM, Michael S. Tsirkin wrote:
> On Sun, Aug 29, 2021 at 09:17:53AM -0700, Andi Kleen wrote:
>> Also I changing this single call really that bad? It's not that we changing
>> anything drastic here, just give the low level subsystem a better hint about
>> the intention. If you don't like the function name, could make it an
>> argument instead?
> My point however is that the API should say that the
> driver has been audited,

We have that status in the struct device. If you want to tie the ioremap
to that we could define a ioremap_device() with a device argument and
decide based on that.

Or we can add _audited to the name. ioremap_shared_audited?

> not that the mapping has been
> done in some special way. For example the mapping can be
> in some kind of wrapper, not directly in the driver.
> However you want the driver validated, not the wrapper.
>
> Here's an idea:


I don't think magic differences of API behavior based on some define are
a good idea.  That's easy to miss.

That's a "COME FROM" in API design.

Also it wouldn't handle the case that a driver has both private and
shared ioremaps, e.g. for BIOS structures.

And we've been avoiding that drivers can self declare auditing, we've
been trying to have a separate centralized list so that it's easier to
enforce and avoids any cut'n'paste mistakes.

-Andi

2021-08-30 21:04:03

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v4 11/15] pci: Add pci_iomap_shared{,_range}

On Sun, Aug 29, 2021 at 10:11:46PM -0700, Andi Kleen wrote:
>
> On 8/29/2021 3:26 PM, Michael S. Tsirkin wrote:
> > On Sun, Aug 29, 2021 at 09:17:53AM -0700, Andi Kleen wrote:
> > > Also I changing this single call really that bad? It's not that we changing
> > > anything drastic here, just give the low level subsystem a better hint about
> > > the intention. If you don't like the function name, could make it an
> > > argument instead?
> > My point however is that the API should say that the
> > driver has been audited,
>
> We have that status in the struct device. If you want to tie the ioremap to
> that we could define a ioremap_device() with a device argument and decide
> based on that.

But it's not the device that is audited. And it's not the device
that might be secure or insecure. It's the driver.

> Or we can add _audited to the name. ioremap_shared_audited?

But it's not the mapping that has to be done in handled special way.
It's any data we get from device, not all of it coming from IO, e.g.
there's DMA and interrupts that all have to be validated.
Wouldn't you say that what is really wanted is just not running
unaudited drivers in the first place?

>
> > not that the mapping has been
> > done in some special way. For example the mapping can be
> > in some kind of wrapper, not directly in the driver.
> > However you want the driver validated, not the wrapper.
> >
> > Here's an idea:
>
>
> I don't think magic differences of API behavior based on some define are a
> good idea.? That's easy to miss.

Well ... my point is that actually there is no difference in API
behaviour. the map is the same map, exactly same data goes to device. If
anything any non-shared map is special in that encrypted data goes to
device.

>
> That's a "COME FROM" in API design.
>
> Also it wouldn't handle the case that a driver has both private and shared
> ioremaps, e.g. for BIOS structures.

Hmm. Interesting. It's bios maps that are unusual and need to be private though ...

> And we've been avoiding that drivers can self declare auditing, we've been
> trying to have a separate centralized list so that it's easier to enforce
> and avoids any cut'n'paste mistakes.
>
> -Andi

Now I'm confused. What is proposed here seems to be basically that,
drivers need to declare auditing by replacing ioremap with
ioremap_shared.

--
MST

2021-08-31 00:25:11

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v4 11/15] pci: Add pci_iomap_shared{,_range}


On 8/30/2021 1:59 PM, Michael S. Tsirkin wrote:
>
>> Or we can add _audited to the name. ioremap_shared_audited?
> But it's not the mapping that has to be done in handled special way.
> It's any data we get from device, not all of it coming from IO, e.g.
> there's DMA and interrupts that all have to be validated.
> Wouldn't you say that what is really wanted is just not running
> unaudited drivers in the first place?


Yes.


>
>> And we've been avoiding that drivers can self declare auditing, we've been
>> trying to have a separate centralized list so that it's easier to enforce
>> and avoids any cut'n'paste mistakes.
>>
>> -Andi
> Now I'm confused. What is proposed here seems to be basically that,
> drivers need to declare auditing by replacing ioremap with
> ioremap_shared.

Auditing is declared on the device model level using a central allow list.

But this cannot do anything to initcalls that run before probe, that's
why an extra level of defense of ioremap opt-in is useful. But it's not
the primary mechanism to declare a driver audited, that's the allow
list. The ioremap is just another mechanism to avoid having to touch a
lot of legacy drivers.

If we agree on that then the original proposed semantics of
"ioremap_shared" may be acceptable?

-Andi



2021-09-10 09:55:49

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v4 11/15] pci: Add pci_iomap_shared{,_range}

On Mon, Aug 30, 2021 at 05:23:17PM -0700, Andi Kleen wrote:
>
> On 8/30/2021 1:59 PM, Michael S. Tsirkin wrote:
> >
> > > Or we can add _audited to the name. ioremap_shared_audited?
> > But it's not the mapping that has to be done in handled special way.
> > It's any data we get from device, not all of it coming from IO, e.g.
> > there's DMA and interrupts that all have to be validated.
> > Wouldn't you say that what is really wanted is just not running
> > unaudited drivers in the first place?
>
>
> Yes.

Then ... let's do just that?

>
> >
> > > And we've been avoiding that drivers can self declare auditing, we've been
> > > trying to have a separate centralized list so that it's easier to enforce
> > > and avoids any cut'n'paste mistakes.
> > >
> > > -Andi
> > Now I'm confused. What is proposed here seems to be basically that,
> > drivers need to declare auditing by replacing ioremap with
> > ioremap_shared.
>
> Auditing is declared on the device model level using a central allow list.

Can we not have an init call allow list instead of, or in addition to, a
device allow list?

> But this cannot do anything to initcalls that run before probe,

Can't we extend module_init so init calls are validated against the
allow list?

> that's why
> an extra level of defense of ioremap opt-in is useful.

OK even assuming this, why is pci_iomap opt-in useful?
That never happens before probe - there's simply no pci_device then.

> But it's not the
> primary mechanism to declare a driver audited, that's the allow list. The
> ioremap is just another mechanism to avoid having to touch a lot of legacy
> drivers.
>
> If we agree on that then the original proposed semantics of "ioremap_shared"
> may be acceptable?
>
> -Andi
>

It looks suspiciously like drivers self-declaring auditing to me which
we both seem to agree is undesirable. What exactly is the difference?

Or are you just trying to disable anything that runs before probe?
In that case I don't see a reason to touch pci drivers though.
These should be fine with just the device model list.

--
MST

2021-09-10 16:35:50

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v4 11/15] pci: Add pci_iomap_shared{,_range}


>>>> And we've been avoiding that drivers can self declare auditing, we've been
>>>> trying to have a separate centralized list so that it's easier to enforce
>>>> and avoids any cut'n'paste mistakes.
>>>>
>>>> -Andi
>>> Now I'm confused. What is proposed here seems to be basically that,
>>> drivers need to declare auditing by replacing ioremap with
>>> ioremap_shared.
>> Auditing is declared on the device model level using a central allow list.
> Can we not have an init call allow list instead of, or in addition to, a
> device allow list?


That would be quite complicated and intrusive. In fact I'm not even sure
how to do maintain something like this. There are a lot of needed
initcalls, they would all need to be marked. How can we distinguish
them? It would be a giant auditing project. And of course how would you
prevent it from bitrotting?


Basically it would be hundreds of changes all over the tree, just to
avoid two changes in virtio and MSI. Approach of just stopping the
initcalls from doing bad things is much less intrusive.

>
>> But this cannot do anything to initcalls that run before probe,
> Can't we extend module_init so init calls are validated against the
> allow list?

See above.


Also the problem isn't really with modules (we rely on udev not loading
them), but with builtin initcalls


>
>> that's why
>> an extra level of defense of ioremap opt-in is useful.
> OK even assuming this, why is pci_iomap opt-in useful?
> That never happens before probe - there's simply no pci_device then.


Hmm, yes that's true. I guess we can make it default to opt-in for
pci_iomap.

It only really matters for device less ioremaps.

>
> It looks suspiciously like drivers self-declaring auditing to me which
> we both seem to agree is undesirable. What exactly is the difference?


Just allow listing the ioremaps is not self declaration because the
device will still not initialize due to the central device filter. If
you want to use it that has to be changed.

It's just an additional safety net to contain code running before probe.


>
> Or are you just trying to disable anything that runs before probe?


Well anything that could do dangerous host interactions (like processing
ioremap data) A lot of things are harmless and can be allowed, or
already blocked elsewhere (e.g. we have a IO port filter). This just
handles the ioremap/MMIO case.

> In that case I don't see a reason to touch pci drivers though.
> These should be fine with just the device model list.


That won't stop initcalls.


-Andi


>

2021-09-11 23:56:36

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v4 11/15] pci: Add pci_iomap_shared{,_range}

On Fri, Sep 10, 2021 at 09:34:45AM -0700, Andi Kleen wrote:
> > > that's why
> > > an extra level of defense of ioremap opt-in is useful.
> > OK even assuming this, why is pci_iomap opt-in useful?
> > That never happens before probe - there's simply no pci_device then.
>
>
> Hmm, yes that's true. I guess we can make it default to opt-in for
> pci_iomap.
>
> It only really matters for device less ioremaps.

OK. And same thing for other things with device, such as
devm_platform_ioremap_resource.
If we agree on all that, this will basically remove virtio
changes from the picture ;)

--
MST

2021-09-13 05:57:01

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v4 11/15] pci: Add pci_iomap_shared{,_range}

On Sat, Sep 11, 2021 at 07:54:43PM -0400, Michael S. Tsirkin wrote:
> On Fri, Sep 10, 2021 at 09:34:45AM -0700, Andi Kleen wrote:
> > > > that's why
> > > > an extra level of defense of ioremap opt-in is useful.
> > > OK even assuming this, why is pci_iomap opt-in useful?
> > > That never happens before probe - there's simply no pci_device then.
> >
> >
> > Hmm, yes that's true. I guess we can make it default to opt-in for
> > pci_iomap.
> >
> > It only really matters for device less ioremaps.
>
> OK. And same thing for other things with device, such as
> devm_platform_ioremap_resource.
> If we agree on all that, this will basically remove virtio
> changes from the picture ;)


Something else that was pointed out to me:

fs->window_kaddr = devm_memremap_pages(&vdev->dev, pgmap);
if (IS_ERR(fs->window_kaddr))
return PTR_ERR(fs->window_kaddr);


looks like if we forget to set the shared flag then it will
corrupt the DAX data?


> --
> MST
>

2021-09-25 06:36:55

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v4 11/15] pci: Add pci_iomap_shared{,_range}


>> Hmm, yes that's true. I guess we can make it default to opt-in for
>> pci_iomap.
>>
>> It only really matters for device less ioremaps.
> OK. And same thing for other things with device, such as
> devm_platform_ioremap_resource.
> If we agree on all that, this will basically remove virtio
> changes from the picture ;)

Hi we revisited this now. One problem with removing the ioremap opt-in
is that it's still possible for drivers to get at devices without going
through probe. For example they can walk the PCI device list. Some
drivers do that for various reasons. So if we remove the opt-in we would
need to audit and possibly fix all that, which would be potentially a
lot of churn. That's why I think it's better to keep the opt-in.


-Andi


2021-09-27 09:09:55

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v4 11/15] pci: Add pci_iomap_shared{,_range}

On Fri, Sep 24, 2021 at 03:43:40PM -0700, Andi Kleen wrote:
>
> > > Hmm, yes that's true. I guess we can make it default to opt-in for
> > > pci_iomap.
> > >
> > > It only really matters for device less ioremaps.
> > OK. And same thing for other things with device, such as
> > devm_platform_ioremap_resource.
> > If we agree on all that, this will basically remove virtio
> > changes from the picture ;)
>
> Hi we revisited this now. One problem with removing the ioremap opt-in is
> that it's still possible for drivers to get at devices without going through
> probe. For example they can walk the PCI device list. Some drivers do that
> for various reasons. So if we remove the opt-in we would need to audit and
> possibly fix all that, which would be potentially a lot of churn. That's why
> I think it's better to keep the opt-in.
>
>
> -Andi
>

I've been thinking about why this still feels wrong to me.

Here's what I came up with: at some point someone will want one of these
modules (poking at devices in the initcall) in the encrypted
environment, and will change ioremap to ioremap_shared.
At that point the allowlist will be broken again, and
by that time it will be set in stone and too late to fix.

Isn't the problem that what is actually audited is modules,
but you are trying to add devices to allow list?
So why not have modules/initcalls in the allowlist then?
For built-in modules, we already have initcall_blacklisted, right?
This could be an extension ... no?

--
MST