2023-10-17 20:25:44

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH 00/10] Handle set_memory_XXcrypted() errors

Shared pages should never return to the page allocator, or future usage of
the pages may allow for the contents to be exposed to the host. They may
also cause the guest to crash if the page is used in way disallowed by HW
(i.e. for executable code or as a page table).

Normally set_memory() call failures are rare. But on TDX
set_memory_XXcrypted() involves calls to the untrusted VMM, and an attacker
could fail these calls such that:
1. set_memory_encrypted() returns an error and leaves the pages fully
shared.
2. set_memory_decrypted() returns an error, but the pages are actually
full converted to shared.

This means that patterns like the below can cause problems:
void *addr = alloc();
int fail = set_memory_decrypted(addr, 1);
if (fail)
free_pages(addr, 0);

And:
void *addr = alloc();
int fail = set_memory_decrypted(addr, 1);
if (fail) {
set_memory_encrypted(addr, 1);
free_pages(addr, 0);
}

Unfortunately these patterns are all over the place. And what the
set_memory() callers should do in this situation is not clear either. They
shouldn’t use them as shared because something clearly went wrong, but
they also need to fully reset the pages to private to free them. But, the
kernel needs the VMMs help to do this and the VMM is already being
uncooperative around the needed operations. So this isn't guaranteed to
succeed and the caller is kind of stuck with unusable pages.

Looking at QEMU/KVM as an example, these VMM converstion failures either
indicates an attempt to attack the guest, or resource constraints on the
host. Preventing a DOS attack is out of scope for the coco threat model.
So this leaves the host resource constraint cause. When similar resource
constraints are encountered in the host, KVM punts the problem to
userspace and QEMU terminates the guest. When similar problems are
detected inside set_memory(), SEV issues a command to terminate the guest.

This all makes it appealing to simply panic (via tdx_panic() call
which informs the host what is happening) when observing troublesome VMM
behavior around the memory conversion. It is:
- Consistent with similar behavior on SEV side.
- Generally more consistent with how host resource constraints are handled
(at least in QEMU/KVM)
- Would be a more foolproof defense against the attack scenario.

Never-the-less, doing so would be an instance of the “crash the kernel for
security reasons” pattern. This is a big reason, and crashing is not fully
needed because the unusable pages could just be leaked (as they already
are in some cases). So instead, this series does a tree-wide search and
fixes the callers to handle the error by leaking the pages. Going forward
callers will need to handle the set_memory() errors correctly in order to
not reintroduce the issue.

I think there are some points for both sides, and we had some internal
discussion on the right way to handle it. So I've tried to characterize
both arguments. I'm interested to hear opinions on which is the best.

I’ve marked the hyperv guest parts in this as RFC, both because I can’t
test them and I believe Linux TDs can’t run on hyperv yet due to some
missing support. I would appreciate a correction on this if it’s wrong.

Rick Edgecombe (10):
mm: Add helper for freeing decrypted memory
x86/mm/cpa: Reject incorrect encryption change requests
kvmclock: Use free_decrypted_pages()
swiotlb: Use free_decrypted_pages()
ptp: Use free_decrypted_pages()
dma: Use free_decrypted_pages()
hv: Use free_decrypted_pages()
hv: Track decrypted status in vmbus_gpadl
hv_nstvsc: Don't free decrypted memory
uio_hv_generic: Don't free decrypted memory

arch/s390/include/asm/set_memory.h | 1 +
arch/x86/kernel/kvmclock.c | 2 +-
arch/x86/mm/pat/set_memory.c | 41 +++++++++++++++++++++++++++++-
drivers/hv/channel.c | 18 ++++++++-----
drivers/hv/connection.c | 13 +++++++---
drivers/net/hyperv/netvsc.c | 7 +++--
drivers/ptp/ptp_kvm_x86.c | 2 +-
drivers/uio/uio_hv_generic.c | 12 ++++++---
include/linux/dma-map-ops.h | 3 ++-
include/linux/hyperv.h | 1 +
include/linux/set_memory.h | 13 ++++++++++
kernel/dma/contiguous.c | 2 +-
kernel/dma/swiotlb.c | 11 +++++---
13 files changed, 101 insertions(+), 25 deletions(-)

--
2.34.1


2023-10-17 20:26:30

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH 01/10] mm: Add helper for freeing decrypted memory

When freeing decrypted memory to the page allocator the memory needs to be
manually re-encrypted beforehand. If this step is skipped, then the next
user of those pages will have the contents inadvertently exposed to
the guest, or cause the guest to crash if the page is used in way
disallowed by HW (i.e. for executable code or as a page table).

Unfortunately, there are many instance of patterns like:
set_memory_encrypted(pages);
free_pages(pages);

...or...

if (set_memory_decrypted(addr, 1))
free_pages(pages);

This is a problem because set_memory_encrypted() and
set_memory_decrypted() can be failed by the untrusted host in such a way
that an error is returned and the resulting memory is shared.

To aid in a tree-wide cleanup of these callers, add a
free_decrypted_pages() function that will first try to encrypt the pages
before returning them. If it is not successful, have it leak the pages and
warn about this. This is preferable to returning shared pages to allocator
or panicking.

In some cases the code path's for freeing decrypted memory handle both
encrypted and decrypted pages. In this case, rely on set_memory() to
handle being asked to convert memory to the state it is already in.

Going forward, rely on cross-arch callers to find and use
free_decrypted_pages() instead of resorting to more heavy handed solutions
like terminating the guest when nasty VMM behavior is observed.

To make s390's arch set_memory_XXcrypted() definitions available in
linux/set_memory.h, add include for s390's asm version of set_memory.h.

Cc: Heiko Carstens <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: [email protected]
Suggested-by: Dave Hansen <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
---
arch/s390/include/asm/set_memory.h | 1 +
include/linux/set_memory.h | 13 +++++++++++++
2 files changed, 14 insertions(+)

diff --git a/arch/s390/include/asm/set_memory.h b/arch/s390/include/asm/set_memory.h
index 06fbabe2f66c..09d36ebd64b5 100644
--- a/arch/s390/include/asm/set_memory.h
+++ b/arch/s390/include/asm/set_memory.h
@@ -3,6 +3,7 @@
#define _ASMS390_SET_MEMORY_H

#include <linux/mutex.h>
+#include <linux/mem_encrypt.h>

extern struct mutex cpa_mutex;

diff --git a/include/linux/set_memory.h b/include/linux/set_memory.h
index 95ac8398ee72..a898b14b6b1f 100644
--- a/include/linux/set_memory.h
+++ b/include/linux/set_memory.h
@@ -5,6 +5,8 @@
#ifndef _LINUX_SET_MEMORY_H_
#define _LINUX_SET_MEMORY_H_

+#include <linux/gfp.h>
+
#ifdef CONFIG_ARCH_HAS_SET_MEMORY
#include <asm/set_memory.h>
#else
@@ -78,4 +80,15 @@ static inline int set_memory_decrypted(unsigned long addr, int numpages)
}
#endif /* CONFIG_ARCH_HAS_MEM_ENCRYPT */

+static inline void free_decrypted_pages(unsigned long addr, int order)
+{
+ int ret = set_memory_encrypted(addr, 1 << order);
+
+ if (ret) {
+ WARN_ONCE(1, "Failed to re-encrypt memory before freeing, leaking pages!\n");
+ return;
+ }
+ free_pages(addr, order);
+}
+
#endif /* _LINUX_SET_MEMORY_H_ */
--
2.34.1

2023-10-17 20:26:33

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH 05/10] ptp: Use free_decrypted_pages()

On TDX it is possible for the untrusted host to cause
set_memory_encrypted() or set_memory_decrypted() to fail such that an
error is returned and the resulting memory is shared. Callers need to take
care to handle these errors to avoid returning decrypted (shared) memory to
the page allocator, which could lead to functional or security issues.

Ptp could free decrypted/shared pages if set_memory_decrypted() fails.
Use the recently added free_decrypted_pages() to avoid this.

Cc: Richard Cochran <[email protected]>
Cc: [email protected]
Signed-off-by: Rick Edgecombe <[email protected]>
---
drivers/ptp/ptp_kvm_x86.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/ptp/ptp_kvm_x86.c b/drivers/ptp/ptp_kvm_x86.c
index 902844cc1a17..203af060013d 100644
--- a/drivers/ptp/ptp_kvm_x86.c
+++ b/drivers/ptp/ptp_kvm_x86.c
@@ -36,7 +36,7 @@ int kvm_arch_ptp_init(void)
clock_pair = page_address(p);
ret = set_memory_decrypted((unsigned long)clock_pair, 1);
if (ret) {
- __free_page(p);
+ free_decrypted_pages((unsigned long)clock_pair, 0);
clock_pair = NULL;
goto nofree;
}
--
2.34.1

2023-10-17 20:26:33

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH 02/10] x86/mm/cpa: Reject incorrect encryption change requests

Kernel memory is "encrypted" by default. Some callers may "decrypt" it
in order to share it with things outside the kernel like a device or an
untrusted VMM.

There is nothing to stop set_memory_encrypted() from being passed memory
that is already "encrypted" (aka. "private" on TDX). In fact, some
callers do this because ... $REASONS. Unfortunately, part of the TDX
decrypted=>encrypted transition is truly one way*. It can't handle
being asked to encrypt an already encrypted page

Allow __set_memory_enc_pgtable() to detect already-encrypted memory
before it hits the TDX code.

* The one way part is "page acceptance"

[commit log written by Dave Hansen]
Signed-off-by: Rick Edgecombe <[email protected]>
---
arch/x86/mm/pat/set_memory.c | 41 +++++++++++++++++++++++++++++++++++-
1 file changed, 40 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index bda9f129835e..1238b0db3e33 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -2122,6 +2122,21 @@ int set_memory_global(unsigned long addr, int numpages)
__pgprot(_PAGE_GLOBAL), 0);
}

+static bool kernel_vaddr_encryped(unsigned long addr, bool enc)
+{
+ unsigned int level;
+ pte_t *pte;
+
+ pte = lookup_address(addr, &level);
+ if (!pte)
+ return false;
+
+ if (enc)
+ return pte_val(*pte) == cc_mkenc(pte_val(*pte));
+
+ return pte_val(*pte) == cc_mkdec(pte_val(*pte));
+}
+
/*
* __set_memory_enc_pgtable() is used for the hypervisors that get
* informed about "encryption" status via page tables.
@@ -2130,7 +2145,7 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
{
pgprot_t empty = __pgprot(0);
struct cpa_data cpa;
- int ret;
+ int ret, numpages_in_state = 0;

/* Should not be working on unaligned addresses */
if (WARN_ONCE(addr & ~PAGE_MASK, "misaligned address: %#lx\n", addr))
@@ -2143,6 +2158,30 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
cpa.mask_clr = enc ? pgprot_decrypted(empty) : pgprot_encrypted(empty);
cpa.pgd = init_mm.pgd;

+ /*
+ * If any page is already in the right state, bail with an error
+ * because the code doesn't handled it. This is likely because
+ * something has gone wrong and isn't worth optimizing for.
+ *
+ * If all the memory pages are already in the desired state return
+ * success.
+ *
+ * kernel_vaddr_encryped() does not synchronize against huge page
+ * splits so take pgd_lock. A caller doing strange things could
+ * get a new PMD mid level PTE confused with a huge PMD entry. Just
+ * lock to tie up loose ends.
+ */
+ spin_lock(&pgd_lock);
+ for (int i = 0; i < numpages; i++) {
+ if (kernel_vaddr_encryped(addr + (PAGE_SIZE * i), enc))
+ numpages_in_state++;
+ }
+ spin_unlock(&pgd_lock);
+ if (numpages_in_state == numpages)
+ return 0;
+ else if (numpages_in_state)
+ return 1;
+
/* Must avoid aliasing mappings in the highmem code */
kmap_flush_unused();
vm_unmap_aliases();
--
2.34.1

2023-10-17 20:26:37

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH 06/10] dma: Use free_decrypted_pages()

On TDX it is possible for the untrusted host to cause
set_memory_encrypted() or set_memory_decrypted() to fail such that an
error is returned and the resulting memory is shared. Callers need to take
care to handle these errors to avoid returning decrypted (shared) memory to
the page allocator, which could lead to functional or security issues.

DMA could free decrypted/shared pages if set_memory_decrypted() fails.
Use the recently added free_decrypted_pages() to avoid this.

Several paths also result in proper encrypted pages being freed through
the same freeing function. Rely on free_decrypted_pages() to not leak the
memory in these cases.

Cc: Christoph Hellwig <[email protected]>
Cc: Marek Szyprowski <[email protected]>
Cc: Robin Murphy <[email protected]>
Cc: [email protected]
Signed-off-by: Rick Edgecombe <[email protected]>
---
include/linux/dma-map-ops.h | 3 ++-
kernel/dma/contiguous.c | 2 +-
2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index f2fc203fb8a1..b0800cbbc357 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -9,6 +9,7 @@
#include <linux/dma-mapping.h>
#include <linux/pgtable.h>
#include <linux/slab.h>
+#include <linux/set_memory.h>

struct cma;

@@ -165,7 +166,7 @@ static inline struct page *dma_alloc_contiguous(struct device *dev, size_t size,
static inline void dma_free_contiguous(struct device *dev, struct page *page,
size_t size)
{
- __free_pages(page, get_order(size));
+ free_decrypted_pages((unsigned long)page_address(page), get_order(size));
}
#endif /* CONFIG_DMA_CMA*/

diff --git a/kernel/dma/contiguous.c b/kernel/dma/contiguous.c
index f005c66f378c..e962f1f6434e 100644
--- a/kernel/dma/contiguous.c
+++ b/kernel/dma/contiguous.c
@@ -429,7 +429,7 @@ void dma_free_contiguous(struct device *dev, struct page *page, size_t size)
}

/* not in any cma, free from buddy */
- __free_pages(page, get_order(size));
+ free_decrypted_pages((unsigned long)page_address(page), get_order(size));
}

/*
--
2.34.1

2023-10-17 20:26:49

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH 04/10] swiotlb: Use free_decrypted_pages()

On TDX it is possible for the untrusted host to cause
set_memory_encrypted() or set_memory_decrypted() to fail such that an
error is returned and the resulting memory is shared. Callers need to take
care to handle these errors to avoid returning decrypted (shared) memory to
the page allocator, which could lead to functional or security issues.

Swiotlb could free decrypted/shared pages if set_memory_decrypted() fails.
Use the recently added free_decrypted_pages() to avoid this.

In swiotlb_exit(), check for set_memory_encrypted() errors manually,
because the pages are not nessarily going to the page allocator.

Cc: Christoph Hellwig <[email protected]>
Cc: Marek Szyprowski <[email protected]>
Cc: Robin Murphy <[email protected]>
Cc: [email protected]
Signed-off-by: Rick Edgecombe <[email protected]>
---
kernel/dma/swiotlb.c | 11 +++++++----
1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 394494a6b1f3..ad06786c4f98 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -524,6 +524,7 @@ void __init swiotlb_exit(void)
unsigned long tbl_vaddr;
size_t tbl_size, slots_size;
unsigned int area_order;
+ int ret;

if (swiotlb_force_bounce)
return;
@@ -536,17 +537,19 @@ void __init swiotlb_exit(void)
tbl_size = PAGE_ALIGN(mem->end - mem->start);
slots_size = PAGE_ALIGN(array_size(sizeof(*mem->slots), mem->nslabs));

- set_memory_encrypted(tbl_vaddr, tbl_size >> PAGE_SHIFT);
+ ret = set_memory_encrypted(tbl_vaddr, tbl_size >> PAGE_SHIFT);
if (mem->late_alloc) {
area_order = get_order(array_size(sizeof(*mem->areas),
mem->nareas));
free_pages((unsigned long)mem->areas, area_order);
- free_pages(tbl_vaddr, get_order(tbl_size));
+ if (!ret)
+ free_pages(tbl_vaddr, get_order(tbl_size));
free_pages((unsigned long)mem->slots, get_order(slots_size));
} else {
memblock_free_late(__pa(mem->areas),
array_size(sizeof(*mem->areas), mem->nareas));
- memblock_free_late(mem->start, tbl_size);
+ if (!ret)
+ memblock_free_late(mem->start, tbl_size);
memblock_free_late(__pa(mem->slots), slots_size);
}

@@ -581,7 +584,7 @@ static struct page *alloc_dma_pages(gfp_t gfp, size_t bytes)
return page;

error:
- __free_pages(page, order);
+ free_decrypted_pages((unsigned long)vaddr, order);
return NULL;
}

--
2.34.1

2023-10-17 20:26:54

by Edgecombe, Rick P

[permalink] [raw]
Subject: [RFC 08/10] hv: Track decrypted status in vmbus_gpadl

On TDX it is possible for the untrusted host to cause
set_memory_encrypted() or set_memory_decrypted() to fail such that an
error is returned and the resulting memory is shared. Callers need to take
care to handle these errors to avoid returning decrypted (shared) memory to
the page allocator, which could lead to functional or security issues.

In order to make sure caller's of vmbus_establish_gpadl() and
vmbus_teardown_gpadl() don't return decrypted/shared pages to
allocators, add a field in struct vmbus_gpadl to keep track of the
decryption status of the buffer's. This will allow the callers to
know if they should free or leak the pages.

Only compile tested.

Cc: "K. Y. Srinivasan" <[email protected]>
Cc: Haiyang Zhang <[email protected]>
Cc: Wei Liu <[email protected]>
Cc: Dexuan Cui <[email protected]>
Cc: [email protected]
Signed-off-by: Rick Edgecombe <[email protected]>
---
drivers/hv/channel.c | 11 ++++++++---
include/linux/hyperv.h | 1 +
2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c
index 1ad8f7fabe06..0a7dcbb48140 100644
--- a/drivers/hv/channel.c
+++ b/drivers/hv/channel.c
@@ -479,6 +479,7 @@ static int __vmbus_establish_gpadl(struct vmbus_channel *channel,
ret = set_memory_decrypted((unsigned long)kbuffer,
PFN_UP(size));
if (ret) {
+ gpadl->decrypted = false;
dev_warn(&channel->device_obj->device,
"Failed to set host visibility for new GPADL %d.\n",
ret);
@@ -551,6 +552,7 @@ static int __vmbus_establish_gpadl(struct vmbus_channel *channel,
gpadl->gpadl_handle = gpadlmsg->gpadl;
gpadl->buffer = kbuffer;
gpadl->size = size;
+ gpadl->decrypted = true;


cleanup:
@@ -564,9 +566,10 @@ static int __vmbus_establish_gpadl(struct vmbus_channel *channel,

kfree(msginfo);

- if (ret)
- set_memory_encrypted((unsigned long)kbuffer,
- PFN_UP(size));
+ if (ret) {
+ if (set_memory_encrypted((unsigned long)kbuffer, PFN_UP(size)))
+ gpadl->decrypted = false;
+ }

return ret;
}
@@ -887,6 +890,8 @@ int vmbus_teardown_gpadl(struct vmbus_channel *channel, struct vmbus_gpadl *gpad
if (ret)
pr_warn("Fail to set mem host visibility in GPADL teardown %d.\n", ret);

+ gpadl->decrypted = ret;
+
return ret;
}
EXPORT_SYMBOL_GPL(vmbus_teardown_gpadl);
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index 2b00faf98017..5bac136c268c 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -812,6 +812,7 @@ struct vmbus_gpadl {
u32 gpadl_handle;
u32 size;
void *buffer;
+ bool decrypted;
};

struct vmbus_channel {
--
2.34.1

2023-10-17 20:26:53

by Edgecombe, Rick P

[permalink] [raw]
Subject: [RFC 09/10] hv_nstvsc: Don't free decrypted memory

On TDX it is possible for the untrusted host to cause
set_memory_encrypted() or set_memory_decrypted() to fail such that an
error is returned and the resulting memory is shared. Callers need to take
care to handle these errors to avoid returning decrypted (shared) memory to
the page allocator, which could lead to functional or security issues.

hv_nstvsc could free decrypted/shared pages if set_memory_decrypted()
fails. Check the decrypted field in the gpadl before freeing in order to
not leak the memory.

Only compile tested.

Cc: "K. Y. Srinivasan" <[email protected]>
Cc: Haiyang Zhang <[email protected]>
Cc: Wei Liu <[email protected]>
Cc: Dexuan Cui <[email protected]>
Cc: [email protected]
Signed-off-by: Rick Edgecombe <[email protected]>
---
drivers/net/hyperv/netvsc.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 82e9796c8f5e..70b7f91fb96b 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -154,8 +154,11 @@ static void free_netvsc_device(struct rcu_head *head)
int i;

kfree(nvdev->extension);
- vfree(nvdev->recv_buf);
- vfree(nvdev->send_buf);
+
+ if (!nvdev->recv_buf_gpadl_handle.decrypted)
+ vfree(nvdev->recv_buf);
+ if (!nvdev->send_buf_gpadl_handle.decrypted)
+ vfree(nvdev->send_buf);
bitmap_free(nvdev->send_section_map);

for (i = 0; i < VRSS_CHANNEL_MAX; i++) {
--
2.34.1

2023-10-17 20:26:58

by Edgecombe, Rick P

[permalink] [raw]
Subject: [RFC 10/10] uio_hv_generic: Don't free decrypted memory

On TDX it is possible for the untrusted host to cause
set_memory_encrypted() or set_memory_decrypted() to fail such that an
error is returned and the resulting memory is shared. Callers need to take
care to handle these errors to avoid returning decrypted (shared) memory to
the page allocator, which could lead to functional or security issues.

uio_hv_generic could free decrypted/shared pages if
set_memory_decrypted() fails.

Check the decrypted field in the gpadl before freeing in order to not
leak the memory.

Only compile tested.

Cc: "K. Y. Srinivasan" <[email protected]>
Cc: Haiyang Zhang <[email protected]>
Cc: Wei Liu <[email protected]>
Cc: Dexuan Cui <[email protected]>
Cc: [email protected]
Signed-off-by: Rick Edgecombe <[email protected]>
---
drivers/uio/uio_hv_generic.c | 12 ++++++++----
1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/uio/uio_hv_generic.c b/drivers/uio/uio_hv_generic.c
index 20d9762331bd..6be3462b109f 100644
--- a/drivers/uio/uio_hv_generic.c
+++ b/drivers/uio/uio_hv_generic.c
@@ -181,12 +181,14 @@ hv_uio_cleanup(struct hv_device *dev, struct hv_uio_private_data *pdata)
{
if (pdata->send_gpadl.gpadl_handle) {
vmbus_teardown_gpadl(dev->channel, &pdata->send_gpadl);
- vfree(pdata->send_buf);
+ if (!pdata->send_gpadl.decrypted)
+ vfree(pdata->send_buf);
}

if (pdata->recv_gpadl.gpadl_handle) {
vmbus_teardown_gpadl(dev->channel, &pdata->recv_gpadl);
- vfree(pdata->recv_buf);
+ if (!pdata->recv_gpadl.decrypted)
+ vfree(pdata->recv_buf);
}
}

@@ -295,7 +297,8 @@ hv_uio_probe(struct hv_device *dev,
ret = vmbus_establish_gpadl(channel, pdata->recv_buf,
RECV_BUFFER_SIZE, &pdata->recv_gpadl);
if (ret) {
- vfree(pdata->recv_buf);
+ if (!pdata->recv_gpadl.decrypted)
+ vfree(pdata->recv_buf);
goto fail_close;
}

@@ -317,7 +320,8 @@ hv_uio_probe(struct hv_device *dev,
ret = vmbus_establish_gpadl(channel, pdata->send_buf,
SEND_BUFFER_SIZE, &pdata->send_gpadl);
if (ret) {
- vfree(pdata->send_buf);
+ if (!pdata->send_gpadl.decrypted)
+ vfree(pdata->send_buf);
goto fail_close;
}

--
2.34.1

2023-10-17 20:27:02

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH 03/10] kvmclock: Use free_decrypted_pages()

On TDX it is possible for the untrusted host to cause
set_memory_encrypted() or set_memory_decrypted() to fail such that an
error is returned and the resulting memory is shared. Callers need to take
care to handle these errors to avoid returning decrypted (shared) memory to
the page allocator, which could lead to functional or security issues.

Kvmclock could free decrypted/shared pages if set_memory_decrypted() fails.
Use the recently added free_decrypted_pages() to avoid this.

Cc: Paolo Bonzini <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Vitaly Kuznetsov <[email protected]>
Cc: [email protected]
Signed-off-by: Rick Edgecombe <[email protected]>
---
arch/x86/kernel/kvmclock.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index fb8f52149be9..587b159c4e53 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -227,7 +227,7 @@ static void __init kvmclock_init_mem(void)
r = set_memory_decrypted((unsigned long) hvclock_mem,
1UL << order);
if (r) {
- __free_pages(p, order);
+ free_decrypted_pages((unsigned long)hvclock_mem, order);
hvclock_mem = NULL;
pr_warn("kvmclock: set_memory_decrypted() failed. Disabling\n");
return;
--
2.34.1

2023-10-17 20:27:02

by Edgecombe, Rick P

[permalink] [raw]
Subject: [RFC 07/10] hv: Use free_decrypted_pages()

On TDX it is possible for the untrusted host to cause
set_memory_encrypted() or set_memory_decrypted() to fail such that an
error is returned and the resulting memory is shared. Callers need to take
care to handle these errors to avoid returning decrypted (shared) memory to
the page allocator, which could lead to functional or security issues.

Hyperv could free decrypted/shared pages if set_memory_decrypted() fails.
Use the recently added free_decrypted_pages() to avoid this.

Only compile tested.

Cc: "K. Y. Srinivasan" <[email protected]>
Cc: Haiyang Zhang <[email protected]>
Cc: Wei Liu <[email protected]>
Cc: Dexuan Cui <[email protected]>
Cc: [email protected]
Signed-off-by: Rick Edgecombe <[email protected]>
---
drivers/hv/channel.c | 7 ++++---
drivers/hv/connection.c | 13 +++++++++----
2 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c
index 56f7e06c673e..1ad8f7fabe06 100644
--- a/drivers/hv/channel.c
+++ b/drivers/hv/channel.c
@@ -153,9 +153,10 @@ void vmbus_free_ring(struct vmbus_channel *channel)
hv_ringbuffer_cleanup(&channel->inbound);

if (channel->ringbuffer_page) {
- __free_pages(channel->ringbuffer_page,
- get_order(channel->ringbuffer_pagecount
- << PAGE_SHIFT));
+ int order = get_order(channel->ringbuffer_pagecount << PAGE_SHIFT);
+ unsigned long addr = (unsigned long)page_address(channel->ringbuffer_page);
+
+ free_decrypted_pages(addr, order);
channel->ringbuffer_page = NULL;
}
}
diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
index 3cabeeabb1ca..cffad9b139d3 100644
--- a/drivers/hv/connection.c
+++ b/drivers/hv/connection.c
@@ -315,6 +315,7 @@ int vmbus_connect(void)

void vmbus_disconnect(void)
{
+ int ret;
/*
* First send the unload request to the host.
*/
@@ -337,11 +338,15 @@ void vmbus_disconnect(void)
vmbus_connection.int_page = NULL;
}

- set_memory_encrypted((unsigned long)vmbus_connection.monitor_pages[0], 1);
- set_memory_encrypted((unsigned long)vmbus_connection.monitor_pages[1], 1);
+ ret = set_memory_encrypted((unsigned long)vmbus_connection.monitor_pages[0], 1);
+ ret |= set_memory_encrypted((unsigned long)vmbus_connection.monitor_pages[1], 1);

- hv_free_hyperv_page(vmbus_connection.monitor_pages[0]);
- hv_free_hyperv_page(vmbus_connection.monitor_pages[1]);
+ if (!ret) {
+ hv_free_hyperv_page(vmbus_connection.monitor_pages[0]);
+ hv_free_hyperv_page(vmbus_connection.monitor_pages[1]);
+ } else {
+ WARN_ONCE(1, "Failed to re-encrypt memory before freeing, leaking pages!\n");
+ }
vmbus_connection.monitor_pages[0] = NULL;
vmbus_connection.monitor_pages[1] = NULL;
}
--
2.34.1

2023-10-18 04:50:55

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 04/10] swiotlb: Use free_decrypted_pages()

On Tue, Oct 17, 2023 at 01:24:59PM -0700, Rick Edgecombe wrote:
> On TDX it is possible for the untrusted host to cause
> set_memory_encrypted() or set_memory_decrypted() to fail such that an
> error is returned and the resulting memory is shared. Callers need to take
> care to handle these errors to avoid returning decrypted (shared) memory to
> the page allocator, which could lead to functional or security issues.
>
> Swiotlb could free decrypted/shared pages if set_memory_decrypted() fails.
> Use the recently added free_decrypted_pages() to avoid this.
>
> In swiotlb_exit(), check for set_memory_encrypted() errors manually,
> because the pages are not nessarily going to the page allocator.

Whatever recently introduced it didn't make it to my mailbox. Please
always CC everyone on every patch in a series, everything else is
impossible to review.

Subject: Re: [PATCH 03/10] kvmclock: Use free_decrypted_pages()



On 10/17/2023 1:24 PM, Rick Edgecombe wrote:
> On TDX it is possible for the untrusted host to cause
> set_memory_encrypted() or set_memory_decrypted() to fail such that an
> error is returned and the resulting memory is shared. Callers need to take
> care to handle these errors to avoid returning decrypted (shared) memory to
> the page allocator, which could lead to functional or security issues.
>
> Kvmclock could free decrypted/shared pages if set_memory_decrypted() fails.
> Use the recently added free_decrypted_pages() to avoid this.
>
> Cc: Paolo Bonzini <[email protected]>
> Cc: Wanpeng Li <[email protected]>
> Cc: Vitaly Kuznetsov <[email protected]>
> Cc: [email protected]
> Signed-off-by: Rick Edgecombe <[email protected]>
> ---

Since it a fix, do you want to add Fixes tag?

Otherwise, it looks good to me.

Reviewed-by: Kuppuswamy Sathyanarayanan <[email protected]>


> arch/x86/kernel/kvmclock.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> index fb8f52149be9..587b159c4e53 100644
> --- a/arch/x86/kernel/kvmclock.c
> +++ b/arch/x86/kernel/kvmclock.c
> @@ -227,7 +227,7 @@ static void __init kvmclock_init_mem(void)
> r = set_memory_decrypted((unsigned long) hvclock_mem,
> 1UL << order);
> if (r) {
> - __free_pages(p, order);
> + free_decrypted_pages((unsigned long)hvclock_mem, order);
> hvclock_mem = NULL;
> pr_warn("kvmclock: set_memory_decrypted() failed. Disabling\n");
> return;

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2023-10-18 06:26:25

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 06/10] dma: Use free_decrypted_pages()

On Tue, Oct 17, 2023 at 01:25:01PM -0700, Rick Edgecombe wrote:
> struct cma;
>
> @@ -165,7 +166,7 @@ static inline struct page *dma_alloc_contiguous(struct device *dev, size_t size,
> static inline void dma_free_contiguous(struct device *dev, struct page *page,
> size_t size)
> {
> - __free_pages(page, get_order(size));
> + free_decrypted_pages((unsigned long)page_address(page), get_order(size));

CMA can be highmem, so this won't work totally independent of what
free_decrypted_pages actually does. Also please avoid the overly long line.

2023-10-18 08:46:16

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 02/10] x86/mm/cpa: Reject incorrect encryption change requests


* Rick Edgecombe <[email protected]> wrote:

> Kernel memory is "encrypted" by default. Some callers may "decrypt" it
> in order to share it with things outside the kernel like a device or an
> untrusted VMM.
>
> There is nothing to stop set_memory_encrypted() from being passed memory
> that is already "encrypted" (aka. "private" on TDX). In fact, some
> callers do this because ... $REASONS. Unfortunately, part of the TDX
> decrypted=>encrypted transition is truly one way*. It can't handle
> being asked to encrypt an already encrypted page
>
> Allow __set_memory_enc_pgtable() to detect already-encrypted memory
> before it hits the TDX code.
>
> * The one way part is "page acceptance"
>
> [commit log written by Dave Hansen]
> Signed-off-by: Rick Edgecombe <[email protected]>
> ---
> arch/x86/mm/pat/set_memory.c | 41 +++++++++++++++++++++++++++++++++++-
> 1 file changed, 40 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> index bda9f129835e..1238b0db3e33 100644
> --- a/arch/x86/mm/pat/set_memory.c
> +++ b/arch/x86/mm/pat/set_memory.c
> @@ -2122,6 +2122,21 @@ int set_memory_global(unsigned long addr, int numpages)
> __pgprot(_PAGE_GLOBAL), 0);
> }
>
> +static bool kernel_vaddr_encryped(unsigned long addr, bool enc)
> +{
> + unsigned int level;
> + pte_t *pte;
> +
> + pte = lookup_address(addr, &level);
> + if (!pte)
> + return false;
> +
> + if (enc)
> + return pte_val(*pte) == cc_mkenc(pte_val(*pte));
> +
> + return pte_val(*pte) == cc_mkdec(pte_val(*pte));
> +}
> +
> /*
> * __set_memory_enc_pgtable() is used for the hypervisors that get
> * informed about "encryption" status via page tables.
> @@ -2130,7 +2145,7 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
> {
> pgprot_t empty = __pgprot(0);
> struct cpa_data cpa;
> - int ret;
> + int ret, numpages_in_state = 0;
>
> /* Should not be working on unaligned addresses */
> if (WARN_ONCE(addr & ~PAGE_MASK, "misaligned address: %#lx\n", addr))
> @@ -2143,6 +2158,30 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
> cpa.mask_clr = enc ? pgprot_decrypted(empty) : pgprot_encrypted(empty);
> cpa.pgd = init_mm.pgd;
>
> + /*
> + * If any page is already in the right state, bail with an error
> + * because the code doesn't handled it. This is likely because

Grammar mistake here.

> + * something has gone wrong and isn't worth optimizing for.
> + *
> + * If all the memory pages are already in the desired state return
> + * success.

Missing comma.

> + *
> + * kernel_vaddr_encryped() does not synchronize against huge page
> + * splits so take pgd_lock. A caller doing strange things could

Missing comma.

Thanks,

Ingo

2023-10-18 15:53:29

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH 02/10] x86/mm/cpa: Reject incorrect encryption change requests

On Wed, 2023-10-18 at 10:44 +0200, Ingo Molnar wrote:
> > +       /*
> > +        * If any page is already in the right state, bail with an
> > error
> > +        * because the code doesn't handled it. This is likely
> > because
>
> Grammar mistake here.
>
> > +        * something has gone wrong and isn't worth optimizing for.
> > +        *
> > +        * If all the memory pages are already in the desired state
> > return
> > +        * success.
>
> Missing comma.
>
> > +        *
> > +        * kernel_vaddr_encryped() does not synchronize against
> > huge page
> > +        * splits so take pgd_lock. A caller doing strange things
> > could
>
> Missing comma.

Oh, yep. Thanks.

2023-10-18 15:56:28

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH 04/10] swiotlb: Use free_decrypted_pages()

On Wed, 2023-10-18 at 06:43 +0200, Christoph Hellwig wrote:
> On Tue, Oct 17, 2023 at 01:24:59PM -0700, Rick Edgecombe wrote:
> > On TDX it is possible for the untrusted host to cause
> > set_memory_encrypted() or set_memory_decrypted() to fail such that
> > an
> > error is returned and the resulting memory is shared. Callers need
> > to take
> > care to handle these errors to avoid returning decrypted (shared)
> > memory to
> > the page allocator, which could lead to functional or security
> > issues.
> >
> > Swiotlb could free decrypted/shared pages if set_memory_decrypted()
> > fails.
> > Use the recently added free_decrypted_pages() to avoid this.
> >
> > In swiotlb_exit(), check for set_memory_encrypted() errors
> > manually,
> > because the pages are not nessarily going to the page allocator.
>
> Whatever recently introduced it didn't make it to my mailbox.  Please
> always CC everyone on every patch in a series, everything else is
> impossible to review.

Ok. The series touches a bunch of set_memory() callers all over, so I
was trying to manage the CC list to something reasonable. I tried to
give a summary in each commit, but I guess it wasn't in depth enough.
Here is the lore link, if you haven't already found it:
https://lore.kernel.org/lkml/[email protected]/

2023-10-18 15:57:59

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH 03/10] kvmclock: Use free_decrypted_pages()

On Tue, 2023-10-17 at 22:20 -0700, Kuppuswamy Sathyanarayanan wrote:
>
>
> On 10/17/2023 1:24 PM, Rick Edgecombe wrote:
> > On TDX it is possible for the untrusted host to cause
> > set_memory_encrypted() or set_memory_decrypted() to fail such that
> > an
> > error is returned and the resulting memory is shared. Callers need
> > to take
> > care to handle these errors to avoid returning decrypted (shared)
> > memory to
> > the page allocator, which could lead to functional or security
> > issues.
> >
> > Kvmclock could free decrypted/shared pages if
> > set_memory_decrypted() fails.
> > Use the recently added free_decrypted_pages() to avoid this.
> >
> > Cc: Paolo Bonzini <[email protected]>
> > Cc: Wanpeng Li <[email protected]>
> > Cc: Vitaly Kuznetsov <[email protected]>
> > Cc: [email protected]
> > Signed-off-by: Rick Edgecombe <[email protected]>
> > ---
>
> Since it a fix, do you want to add Fixes tag?
>
> Otherwise, it looks good to me.
>
> Reviewed-by: Kuppuswamy Sathyanarayanan
> <[email protected]>

Thanks. Yes, the thinking was to mark all these for stable, but some
patches are still RFC for this version. I'll add it for all non-RFC
ones in the next version.

2023-10-18 17:11:47

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH 06/10] dma: Use free_decrypted_pages()

Link to whole series:
https://lore.kernel.org/lkml/[email protected]/

On Wed, 2023-10-18 at 08:24 +0200, Christoph Hellwig wrote:
> On Tue, Oct 17, 2023 at 01:25:01PM -0700, Rick Edgecombe wrote:
> >   struct cma;
> >  
> > @@ -165,7 +166,7 @@ static inline struct page
> > *dma_alloc_contiguous(struct device *dev, size_t size,
> >   static inline void dma_free_contiguous(struct device *dev, struct
> > page *page,
> >                 size_t size)
> >   {
> > -       __free_pages(page, get_order(size));
> > +       free_decrypted_pages((unsigned long)page_address(page),
> > get_order(size));
>
> CMA can be highmem, so this won't work totally independent of what
> free_decrypted_pages actually does.  Also please avoid the overly
> long line.

Argh, yes this is broken for highmem. Thanks for pointing it out.

For x86, we don't need to worry about doing set_memory_XXcrypted() with
highmem. Checking the Kconfig logic around the other
set_memory_XXcrypted() implementations:
s390 - Doesn't support HIGHMEM
powerpc - Doesn't support set_memory_XXcrypted() and HIGHMEM together

So that would mean set_memory_encrypted() is not needed on the HIGHMEM
configs (i.e. it's ok if there is no virtual mapping at free-time,
because it can skip the conversion work).

So free_decrypted_pages() could be changed to not disturb the HIGHMEM
configs, like this:
static inline void free_decrypted_pages(struct page *page, int order)
{
void *addr = page_address(page);
int ret = 0;

if (addr)
ret = set_memory_encrypted(addr, 1 << order);

if (ret) {
WARN_ONCE(1, "Failed...\n");
return;
}
__free_pages(page, get_order(size));
}

Or we could just fix all the callers to open code the right logic. The
free_decrypted_pages() helper is not really saving code across the
series, and only serving to help callers avoid re-introducing the
issue. But I'm sort of worried it will be easy to do just that. Hmm...

2023-10-18 17:43:55

by Robin Murphy

[permalink] [raw]
Subject: Re: [PATCH 06/10] dma: Use free_decrypted_pages()

On 2023-10-17 21:25, Rick Edgecombe wrote:
> On TDX it is possible for the untrusted host to cause
> set_memory_encrypted() or set_memory_decrypted() to fail such that an
> error is returned and the resulting memory is shared. Callers need to take
> care to handle these errors to avoid returning decrypted (shared) memory to
> the page allocator, which could lead to functional or security issues.
>
> DMA could free decrypted/shared pages if set_memory_decrypted() fails.
> Use the recently added free_decrypted_pages() to avoid this.
>
> Several paths also result in proper encrypted pages being freed through
> the same freeing function. Rely on free_decrypted_pages() to not leak the
> memory in these cases.

If something's needed in the fallback path here, what about the
cma_release() paths?

Thanks,
Robin.

> Cc: Christoph Hellwig <[email protected]>
> Cc: Marek Szyprowski <[email protected]>
> Cc: Robin Murphy <[email protected]>
> Cc: [email protected]
> Signed-off-by: Rick Edgecombe <[email protected]>
> ---
> include/linux/dma-map-ops.h | 3 ++-
> kernel/dma/contiguous.c | 2 +-
> 2 files changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
> index f2fc203fb8a1..b0800cbbc357 100644
> --- a/include/linux/dma-map-ops.h
> +++ b/include/linux/dma-map-ops.h
> @@ -9,6 +9,7 @@
> #include <linux/dma-mapping.h>
> #include <linux/pgtable.h>
> #include <linux/slab.h>
> +#include <linux/set_memory.h>
>
> struct cma;
>
> @@ -165,7 +166,7 @@ static inline struct page *dma_alloc_contiguous(struct device *dev, size_t size,
> static inline void dma_free_contiguous(struct device *dev, struct page *page,
> size_t size)
> {
> - __free_pages(page, get_order(size));
> + free_decrypted_pages((unsigned long)page_address(page), get_order(size));
> }
> #endif /* CONFIG_DMA_CMA*/
>
> diff --git a/kernel/dma/contiguous.c b/kernel/dma/contiguous.c
> index f005c66f378c..e962f1f6434e 100644
> --- a/kernel/dma/contiguous.c
> +++ b/kernel/dma/contiguous.c
> @@ -429,7 +429,7 @@ void dma_free_contiguous(struct device *dev, struct page *page, size_t size)
> }
>
> /* not in any cma, free from buddy */
> - __free_pages(page, get_order(size));
> + free_decrypted_pages((unsigned long)page_address(page), get_order(size));
> }
>
> /*

2023-10-19 17:06:03

by Michael Kelley (LINUX)

[permalink] [raw]
Subject: RE: [PATCH 00/10] Handle set_memory_XXcrypted() errors

From: Rick Edgecombe <[email protected]> Sent: Tuesday, October 17, 2023 1:25 PM
>
> Shared pages should never return to the page allocator, or future usage of
> the pages may allow for the contents to be exposed to the host. They may
> also cause the guest to crash if the page is used in way disallowed by HW
> (i.e. for executable code or as a page table).
>
> Normally set_memory() call failures are rare. But on TDX
> set_memory_XXcrypted() involves calls to the untrusted VMM, and an attacker
> could fail these calls such that:
> 1. set_memory_encrypted() returns an error and leaves the pages fully
> shared.
> 2. set_memory_decrypted() returns an error, but the pages are actually
> full converted to shared.
>
> This means that patterns like the below can cause problems:
> void *addr = alloc();
> int fail = set_memory_decrypted(addr, 1);
> if (fail)
> free_pages(addr, 0);
>
> And:
> void *addr = alloc();
> int fail = set_memory_decrypted(addr, 1);
> if (fail) {
> set_memory_encrypted(addr, 1);
> free_pages(addr, 0);
> }
>
> Unfortunately these patterns are all over the place. And what the
> set_memory() callers should do in this situation is not clear either. They
> shouldn’t use them as shared because something clearly went wrong, but
> they also need to fully reset the pages to private to free them. But, the
> kernel needs the VMMs help to do this and the VMM is already being
> uncooperative around the needed operations. So this isn't guaranteed to
> succeed and the caller is kind of stuck with unusable pages.
>
> Looking at QEMU/KVM as an example, these VMM converstion failures either
> indicates an attempt to attack the guest, or resource constraints on the
> host. Preventing a DOS attack is out of scope for the coco threat model.
> So this leaves the host resource constraint cause. When similar resource
> constraints are encountered in the host, KVM punts the problem to
> userspace and QEMU terminates the guest. When similar problems are
> detected inside set_memory(), SEV issues a command to terminate the guest.
>
> This all makes it appealing to simply panic (via tdx_panic() call
> which informs the host what is happening) when observing troublesome VMM
> behavior around the memory conversion. It is:
> - Consistent with similar behavior on SEV side.
> - Generally more consistent with how host resource constraints are handled
> (at least in QEMU/KVM)
> - Would be a more foolproof defense against the attack scenario.
>
> Never-the-less, doing so would be an instance of the “crash the kernel for
> security reasons” pattern. This is a big reason, and crashing is not fully
> needed because the unusable pages could just be leaked (as they already
> are in some cases). So instead, this series does a tree-wide search and
> fixes the callers to handle the error by leaking the pages. Going forward
> callers will need to handle the set_memory() errors correctly in order to
> not reintroduce the issue.
>
> I think there are some points for both sides, and we had some internal
> discussion on the right way to handle it. So I've tried to characterize
> both arguments. I'm interested to hear opinions on which is the best.

I'm more in favor of the "simply panic" approach. What you've done
in your Patch 1 and Patch 2 is an intriguing way to try to get the memory
back into a consistent state. But I'm concerned that there are failure
modes that make it less than 100% foolproof (more on that below). If
we can't be sure that the memory is back in a consistent state, then the
original problem isn't fully solved. I'm also not sure of the value of
investing effort to ensure that some errors cases are handled without
panic'ing. The upside benefit of not panic'ing seems small compared to
the downside risk of leaking guest VM data to the host.

My concern about Patches 1 and 2 is that the encryption bit in the PTE
is not a reliable indicator of the state that the host thinks the page is
in. Changing the state requires two steps (in either order): 1) updating
the guest VM PTEs, and 2) updating the host's view of the page state.
Both steps may be done on a range of pages. If #2 fails, the guest
doesn't know which pages in the batch were updated and which were
not, so the guest PTEs may not match the host state. In such a case,
set_memory_encrypted() could succeed based on checking the
PTEs when in fact the host still thinks some of the pages are shared.
Such a mismatch will produce a guest panic later on if the page is
referenced.

As you pointed out, the SEV code, and specifically the SEV-SNP code,
terminates the VM if there is a failure. That's in pvalidate_pages().
You've described your changes as being for TDX, but there's also
the Hyper-V version of handling private <-> shared transitions
which makes use of a paravisor for both SEV-SNP and TDX. The
Hyper-V versions could also fail due to resource constraints in
the paravisor, and so has the same issues as TDX, even when
running on SEV-SNP hardware.

In general, it's hard to anticipate all the failure modes that can occur
currently. Additional failure modes could be added in the future,
and taking into account malicious behavior by the host makes it
even worse. That leads me back to my conclusion that just taking
the panic is best.

>
> I’ve marked the hyperv guest parts in this as RFC, both because I can’t
> test them and I believe Linux TDs can’t run on hyperv yet due to some
> missing support. I would appreciate a correction on this if it’s wrong.

Linux TDs can run on Hyper-V today, though it may require a Hyper-V
version that isn't officially released yet. We have it working internally
at Microsoft and I think at Intel as well. There *are* still a couple of
Linux patches waiting to be accepted upstream to run without a
paravisor. In any case, your concerns about testing are valid -- it's
probably easier for one of us at Microsoft to test the Hyper-V guest
parts if we continue down the path you have proposed.

I've looked through the other patches in the series, and have a few
minor comments on the Hyper-V parts. But I'll hold those pending
an overall conclusion on whether to pursue this approach.

If you send a new version of the series, please include the
[email protected] mailing list as well on all the
patches.

Michael

>
> Rick Edgecombe (10):
> mm: Add helper for freeing decrypted memory
> x86/mm/cpa: Reject incorrect encryption change requests
> kvmclock: Use free_decrypted_pages()
> swiotlb: Use free_decrypted_pages()
> ptp: Use free_decrypted_pages()
> dma: Use free_decrypted_pages()
> hv: Use free_decrypted_pages()
> hv: Track decrypted status in vmbus_gpadl
> hv_nstvsc: Don't free decrypted memory
> uio_hv_generic: Don't free decrypted memory
>
> arch/s390/include/asm/set_memory.h | 1 +
> arch/x86/kernel/kvmclock.c | 2 +-
> arch/x86/mm/pat/set_memory.c | 41 +++++++++++++++++++++++++++++-
> drivers/hv/channel.c | 18 ++++++++-----
> drivers/hv/connection.c | 13 +++++++---
> drivers/net/hyperv/netvsc.c | 7 +++--
> drivers/ptp/ptp_kvm_x86.c | 2 +-
> drivers/uio/uio_hv_generic.c | 12 ++++++---
> include/linux/dma-map-ops.h | 3 ++-
> include/linux/hyperv.h | 1 +
> include/linux/set_memory.h | 13 ++++++++++
> kernel/dma/contiguous.c | 2 +-
> kernel/dma/swiotlb.c | 11 +++++---
> 13 files changed, 101 insertions(+), 25 deletions(-)
>
> --
> 2.34.1

2023-10-19 19:13:32

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 00/10] Handle set_memory_XXcrypted() errors

On 10/19/23 10:05, Michael Kelley (LINUX) wrote:
> I'm more in favor of the "simply panic" approach. What you've done
> in your Patch 1 and Patch 2 is an intriguing way to try to get the memory
> back into a consistent state. But I'm concerned that there are failure
> modes that make it less than 100% foolproof (more on that below). If
> we can't be sure that the memory is back in a consistent state, then the
> original problem isn't fully solved. I'm also not sure of the value of
> investing effort to ensure that some errors cases are handled without
> panic'ing. The upside benefit of not panic'ing seems small compared to
> the downside risk of leaking guest VM data to the host.

panic() should be a last resort. We *always* continue unless we know
that something is so bad that we're going to make things worse by
continuing to run.

We shouldn't panic() on the first little thing that goes wrong. If
folks want *that*, then they can set panic_on_warn.

> My concern about Patches 1 and 2 is that the encryption bit in the PTE
> is not a reliable indicator of the state that the host thinks the page is
> in. Changing the state requires two steps (in either order): 1) updating
> the guest VM PTEs, and 2) updating the host's view of the page state.
> Both steps may be done on a range of pages. If #2 fails, the guest
> doesn't know which pages in the batch were updated and which were
> not, so the guest PTEs may not match the host state. In such a case,
> set_memory_encrypted() could succeed based on checking the
> PTEs when in fact the host still thinks some of the pages are shared.
> Such a mismatch will produce a guest panic later on if the page is
> referenced.

I think that's OK. In the end, the page state is controlled by the VMM.
The guest has zero control. All it can do is make the PTEs consistent
and hold on for dear life. That's a general statement and not specific
to this problem.

In other words, it's fine for CoCo folks to be paranoid. It's fine for
them to set panic_on_{warn,oops,whatever}=1. But it's *NOT* fine to say
that every TDX guest will want to do that.

2023-10-23 16:46:42

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH 06/10] dma: Use free_decrypted_pages()

On Wed, 2023-10-18 at 18:42 +0100, Robin Murphy wrote:
> On 2023-10-17 21:25, Rick Edgecombe wrote:
> > On TDX it is possible for the untrusted host to cause
> > set_memory_encrypted() or set_memory_decrypted() to fail such that
> > an
> > error is returned and the resulting memory is shared. Callers need
> > to take
> > care to handle these errors to avoid returning decrypted (shared)
> > memory to
> > the page allocator, which could lead to functional or security
> > issues.
> >
> > DMA could free decrypted/shared pages if set_memory_decrypted()
> > fails.
> > Use the recently added free_decrypted_pages() to avoid this.
> >
> > Several paths also result in proper encrypted pages being freed
> > through
> > the same freeing function. Rely on free_decrypted_pages() to not
> > leak the
> > memory in these cases.
>
> If something's needed in the fallback path here, what about the
> cma_release() paths?

You mean inside cma_release(). If so, unfortunately I think it won't
fit great because there are callers that are never dealing with shared
memory (huge tlb). The reset-to-private operation does extra work that
would be nice to avoid when possible.

The cases I thought exhibited the issue were the two calls sites of
dma_set_decrypted(). Playing around with it, I was thinking it might be
easier to just fix those to open code leaking the pages on
dma_set_decrypted() error. In which case it won't have the re-encrypt
problem.

It make's it less fool proof, but more efficient. And
free_decrypted_pages() doesn't fit great anyway, as pointed out by
Christoph.

2023-10-23 16:47:47

by Michael Kelley (LINUX)

[permalink] [raw]
Subject: RE: [PATCH 00/10] Handle set_memory_XXcrypted() errors

From: Dave Hansen <[email protected]> Sent: Thursday, October 19, 2023 12:13 PM
>
> On 10/19/23 10:05, Michael Kelley (LINUX) wrote:
> > I'm more in favor of the "simply panic" approach. What you've done
> > in your Patch 1 and Patch 2 is an intriguing way to try to get the memory
> > back into a consistent state. But I'm concerned that there are failure
> > modes that make it less than 100% foolproof (more on that below). If
> > we can't be sure that the memory is back in a consistent state, then the
> > original problem isn't fully solved. I'm also not sure of the value of
> > investing effort to ensure that some errors cases are handled without
> > panic'ing. The upside benefit of not panic'ing seems small compared to
> > the downside risk of leaking guest VM data to the host.
>
> panic() should be a last resort. We *always* continue unless we know
> that something is so bad that we're going to make things worse by
> continuing to run.
>
> We shouldn't panic() on the first little thing that goes wrong. If
> folks want *that*, then they can set panic_on_warn.
>
> > My concern about Patches 1 and 2 is that the encryption bit in the PTE
> > is not a reliable indicator of the state that the host thinks the page is
> > in. Changing the state requires two steps (in either order): 1) updating
> > the guest VM PTEs, and 2) updating the host's view of the page state.
> > Both steps may be done on a range of pages. If #2 fails, the guest
> > doesn't know which pages in the batch were updated and which were
> > not, so the guest PTEs may not match the host state. In such a case,
> > set_memory_encrypted() could succeed based on checking the
> > PTEs when in fact the host still thinks some of the pages are shared.
> > Such a mismatch will produce a guest panic later on if the page is
> > referenced.
>
> I think that's OK. In the end, the page state is controlled by the VMM.
> The guest has zero control. All it can do is make the PTEs consistent
> and hold on for dear life. That's a general statement and not specific
> to this problem.
>
> In other words, it's fine for CoCo folks to be paranoid. It's fine for
> them to set panic_on_{warn,oops,whatever}=1. But it's *NOT* fine to say
> that every TDX guest will want to do that.

The premise of this patch set is to not put pages on the Linux
guest free list that are shared. I agree with that premise. But
more precisely, the best we can do is not put pages on the free
list where the guest PTE indicates "shared". Even if the host is
not acting maliciously, errors can cause the guest and host to be
out-of-sync regarding a page's private/shared status. There's no
way to find out for sure if the host status is "private" before
returning such a page to the free list, though if
set_memory_encrypted() succeeds and the host is not
malicious, we should be reasonably safe.

For paranoid CoCo VM users, using panic_on_warn=1 seems
workable. However, with current code and this patch series,
it's possible have set_memory_decrypted() return an error and
have set_memory_encrypted() fix things up as best it can
without generating any warnings. It seems like we need a
WARN or some equivalent mechanism if either of these fails,
so that CoCo VMs can panic if they don't want to run with any
inconsistencies (again, assuming the host isn't malicious).

Also, from a troubleshooting standpoint, panic_on_warn=1
will make it easier to diagnose a failure of
set_memory_encrypted()/decrypted() if it is caught
immediately, versus putting a page with an inconsistent state
on the free list and having things blow up later.

Michael

2023-10-23 16:57:50

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 00/10] Handle set_memory_XXcrypted() errors

On 10/23/23 09:47, Michael Kelley (LINUX) wrote:
> For paranoid CoCo VM users, using panic_on_warn=1 seems workable.
> However, with current code and this patch series, it's possible have
> set_memory_decrypted() return an error and have set_memory_encrypted()
> fix things up as best it can without generating any warnings. It seems
> like we need a WARN or some equivalent mechanism if either of these
> fails, so that CoCo VMs can panic if they don't want to run with any
> inconsistencies (again, assuming the host isn't malicious).

Adding a warning to the fixup path in set_memory_encrypted() would be
totally fine with me.

2023-10-23 17:01:52

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH 00/10] Handle set_memory_XXcrypted() errors

On Mon, 2023-10-23 at 16:47 +0000, Michael Kelley (LINUX) wrote:
> From: Dave Hansen <[email protected]> Sent: Thursday, October 19,
> 2023 12:13 PM
> >
> > On 10/19/23 10:05, Michael Kelley (LINUX) wrote:
> > > I'm more in favor of the "simply panic" approach.   What you've
> > > done
> > > in your Patch 1 and Patch 2 is an intriguing way to try to get
> > > the memory
> > > back into a consistent state.  But I'm concerned that there are
> > > failure
> > > modes that make it less than 100% foolproof (more on that
> > > below).  If
> > > we can't be sure that the memory is back in a consistent state,
> > > then the
> > > original problem isn't fully solved.   I'm also not sure of the
> > > value of
> > > investing effort to ensure that some errors cases are handled
> > > without
> > > panic'ing.  The upside benefit of not panic'ing seems small
> > > compared to
> > > the downside risk of leaking guest VM data to the host.
> >
> > panic() should be a last resort.  We *always* continue unless we
> > know
> > that something is so bad that we're going to make things worse by
> > continuing to run.
> >
> > We shouldn't panic() on the first little thing that goes wrong.  If
> > folks want *that*, then they can set panic_on_warn.
> >
> > > My concern about Patches 1 and 2 is that the encryption bit in
> > > the PTE
> > > is not a reliable indicator of the state that the host thinks the
> > > page is
> > > in.  Changing the state requires two steps (in either order):  1)
> > > updating
> > > the guest VM PTEs, and 2) updating the host's view of the page
> > > state.
> > > Both steps may be done on a range of pages.  If #2 fails, the
> > > guest
> > > doesn't know which pages in the batch were updated and which were
> > > not, so the guest PTEs may not match the host state.  In such a
> > > case,
> > > set_memory_encrypted() could succeed based on checking the
> > > PTEs when in fact the host still thinks some of the pages are
> > > shared.
> > > Such a mismatch will produce a guest panic later on if the page
> > > is
> > > referenced.
> >
> > I think that's OK.  In the end, the page state is controlled by the
> > VMM.
> >   The guest has zero control.  All it can do is make the PTEs
> > consistent
> > and hold on for dear life.  That's a general statement and not
> > specific
> > to this problem.
> >
> > In other words, it's fine for CoCo folks to be paranoid.  It's fine
> > for
> > them to set panic_on_{warn,oops,whatever}=1.  But it's *NOT* fine
> > to say
> > that every TDX guest will want to do that.
>
> The premise of this patch set is to not put pages on the Linux
> guest free list that are shared.  I agree with that premise.  But
> more precisely, the best we can do is not put pages on the free
> list where the guest PTE indicates "shared".  Even if the host is
> not acting maliciously, errors can cause the guest and host to be
> out-of-sync regarding a page's private/shared status.  There's no
> way to find out for sure if the host status is "private" before
> returning such a page to the free list, though if
> set_memory_encrypted() succeeds and the host is not
> malicious, we should be reasonably safe.
>
> For paranoid CoCo VM users, using panic_on_warn=1 seems
> workable.  However, with current code and this patch series,
> it's possible have set_memory_decrypted() return an error and
> have set_memory_encrypted() fix things up as best it can
> without generating any warnings.  It seems like we need a
> WARN or some equivalent mechanism if either of these fails,
> so that CoCo VMs can panic if they don't want to run with any
> inconsistencies (again, assuming the host isn't malicious).

Always warning seems reasonable, given the added logic around retrying.
This is something that the GHCI spec says is something that can happen
though. So the warning is kind of like "the host is being legal, but a
bit unreasonable". This will also save having to add warnings in all of
the callers, which are missing in some spots currently.

>
> Also, from a troubleshooting standpoint, panic_on_warn=1
> will make it easier to diagnose a failure of
> set_memory_encrypted()/decrypted() if it is caught
> immediately, versus putting a page with an inconsistent state
> on the free list and having things blow up later.

If the guest doesn't notify anything to the host, I wonder if there
will be scenarios where the host doesn't know that there is anything
wrong in the TD. Maybe a naive question, but do you have any insight
into how errors in guest logs might make there way to the host admin in
typical usage?

2023-10-23 17:23:20

by Robin Murphy

[permalink] [raw]
Subject: Re: [PATCH 06/10] dma: Use free_decrypted_pages()

On 2023-10-23 17:46, Edgecombe, Rick P wrote:
> On Wed, 2023-10-18 at 18:42 +0100, Robin Murphy wrote:
>> On 2023-10-17 21:25, Rick Edgecombe wrote:
>>> On TDX it is possible for the untrusted host to cause
>>> set_memory_encrypted() or set_memory_decrypted() to fail such that
>>> an
>>> error is returned and the resulting memory is shared. Callers need
>>> to take
>>> care to handle these errors to avoid returning decrypted (shared)
>>> memory to
>>> the page allocator, which could lead to functional or security
>>> issues.
>>>
>>> DMA could free decrypted/shared pages if set_memory_decrypted()
>>> fails.
>>> Use the recently added free_decrypted_pages() to avoid this.
>>>
>>> Several paths also result in proper encrypted pages being freed
>>> through
>>> the same freeing function. Rely on free_decrypted_pages() to not
>>> leak the
>>> memory in these cases.
>>
>> If something's needed in the fallback path here, what about the
>> cma_release() paths?
>
> You mean inside cma_release(). If so, unfortunately I think it won't
> fit great because there are callers that are never dealing with shared
> memory (huge tlb). The reset-to-private operation does extra work that
> would be nice to avoid when possible.
>
> The cases I thought exhibited the issue were the two calls sites of
> dma_set_decrypted(). Playing around with it, I was thinking it might be
> easier to just fix those to open code leaking the pages on
> dma_set_decrypted() error. In which case it won't have the re-encrypt
> problem.
>
> It make's it less fool proof, but more efficient. And
> free_decrypted_pages() doesn't fit great anyway, as pointed out by
> Christoph.

My point is that in dma_direct_alloc(), we get some memory either
straight from the page allocator *or* from a CMA area, then call
set_memory_decrypted() on it. If the problem is that
set_memory_decrypted() can fail and require cleanup, then logically if
that cleanup is necessary for the dma_free_contiguous()->__free_pages()
call, then surely it must also be necessary for the
dma_free_contiguous()->cma_release()->free_contig_range()->__free_page()
calls.

Thanks,
Robin.

2023-10-23 17:28:14

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH 06/10] dma: Use free_decrypted_pages()

On Mon, 2023-10-23 at 18:22 +0100, Robin Murphy wrote:
> > >
> > > If something's needed in the fallback path here, what about the
> > > cma_release() paths?
> >
> > You mean inside cma_release(). If so, unfortunately I think it
> > won't
> > fit great because there are callers that are never dealing with
> > shared
> > memory (huge tlb). The reset-to-private operation does extra work
> > that
> > would be nice to avoid when possible.
> >
> > The cases I thought exhibited the issue were the two calls sites of
> > dma_set_decrypted(). Playing around with it, I was thinking it
> > might be
> > easier to just fix those to open code leaking the pages on
> > dma_set_decrypted() error. In which case it won't have the re-
> > encrypt
> > problem.
> >
> > It make's it less fool proof, but more efficient. And
> > free_decrypted_pages() doesn't fit great anyway, as pointed out by
> > Christoph.
>
> My point is that in dma_direct_alloc(), we get some memory either
> straight from the page allocator *or* from a CMA area, then call
> set_memory_decrypted() on it. If the problem is that
> set_memory_decrypted() can fail and require cleanup, then logically
> if
> that cleanup is necessary for the dma_free_contiguous()-
> >__free_pages()
> call, then surely it must also be necessary for the
> dma_free_contiguous()->cma_release()->free_contig_range()-
> >__free_page()
> calls.

Oh, I see you are saying the patch misses that case. Yes, makes sense.

Sorry for the confusion. In trying to fix the callers, I waded through
a lot of area's that I didn't have much expertise in and probably
should have marked the whole thing RFC.

2023-10-31 10:43:34

by Petr Tesařík

[permalink] [raw]
Subject: Re: [PATCH 04/10] swiotlb: Use free_decrypted_pages()

On Tue, 17 Oct 2023 13:24:59 -0700
Rick Edgecombe <[email protected]> wrote:

> On TDX it is possible for the untrusted host to cause
> set_memory_encrypted() or set_memory_decrypted() to fail such that an
> error is returned and the resulting memory is shared. Callers need to take
> care to handle these errors to avoid returning decrypted (shared) memory to
> the page allocator, which could lead to functional or security issues.
>
> Swiotlb could free decrypted/shared pages if set_memory_decrypted() fails.
> Use the recently added free_decrypted_pages() to avoid this.
>
> In swiotlb_exit(), check for set_memory_encrypted() errors manually,
> because the pages are not nessarily going to the page allocator.
>
> Cc: Christoph Hellwig <[email protected]>
> Cc: Marek Szyprowski <[email protected]>
> Cc: Robin Murphy <[email protected]>
> Cc: [email protected]
> Signed-off-by: Rick Edgecombe <[email protected]>
> ---
> kernel/dma/swiotlb.c | 11 +++++++----
> 1 file changed, 7 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index 394494a6b1f3..ad06786c4f98 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -524,6 +524,7 @@ void __init swiotlb_exit(void)
> unsigned long tbl_vaddr;
> size_t tbl_size, slots_size;
> unsigned int area_order;
> + int ret;
>
> if (swiotlb_force_bounce)
> return;
> @@ -536,17 +537,19 @@ void __init swiotlb_exit(void)
> tbl_size = PAGE_ALIGN(mem->end - mem->start);
> slots_size = PAGE_ALIGN(array_size(sizeof(*mem->slots), mem->nslabs));
>
> - set_memory_encrypted(tbl_vaddr, tbl_size >> PAGE_SHIFT);
> + ret = set_memory_encrypted(tbl_vaddr, tbl_size >> PAGE_SHIFT);
> if (mem->late_alloc) {
> area_order = get_order(array_size(sizeof(*mem->areas),
> mem->nareas));
> free_pages((unsigned long)mem->areas, area_order);
> - free_pages(tbl_vaddr, get_order(tbl_size));
> + if (!ret)
> + free_pages(tbl_vaddr, get_order(tbl_size));
> free_pages((unsigned long)mem->slots, get_order(slots_size));
> } else {
> memblock_free_late(__pa(mem->areas),
> array_size(sizeof(*mem->areas), mem->nareas));
> - memblock_free_late(mem->start, tbl_size);
> + if (!ret)
> + memblock_free_late(mem->start, tbl_size);
> memblock_free_late(__pa(mem->slots), slots_size);
> }
>
> @@ -581,7 +584,7 @@ static struct page *alloc_dma_pages(gfp_t gfp, size_t bytes)
> return page;
>
> error:
> - __free_pages(page, order);
> + free_decrypted_pages((unsigned long)vaddr, order);
> return NULL;
> }

I admit I'm not familiar with the encryption/decryption API, but if a
__free_pages() is not sufficient here, then it is quite confusing.
The error label is reached only if set_memory_decrypted() returns
non-zero. My naive expectation is that the memory is *not* decrypted in
that case and does not require special treatment. Is this assumption
wrong?

OTOH I believe there is a bug in the logic. The subsequent
__free_pages() in swiotlb_alloc_tlb() would have to be changed to a
free_decrypted_pages(). However, I'm proposing a different approach to
address the latter issue here:

https://lore.kernel.org/linux-iommu/[email protected]/T/

Petr T

2023-10-31 15:55:07

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH 04/10] swiotlb: Use free_decrypted_pages()

On Tue, 2023-10-31 at 11:43 +0100, Petr Tesařík wrote:
>
> I admit I'm not familiar with the encryption/decryption API, but if a
> __free_pages() is not sufficient here, then it is quite confusing.
> The error label is reached only if set_memory_decrypted() returns
> non-zero. My naive expectation is that the memory is *not* decrypted
> in
> that case and does not require special treatment. Is this assumption
> wrong?

Yea, the memory can still be decrypted, or partially decrypted. On x86,
all the set_memory() calls can fail part way through the work, and they
don't rollback the changes they had made up to that point. I was
thinking about trying to changes this, but that is the current
behavior.

But in TDX's case set_memory_decrypted() can be fully successful and
just return an error. And there aren't any plans to fix the TDX case
(which has special VMM work that the kernel doesn't have control over).
So instead, the plan is to WARN about it and count on the caller to
handle the error properly:
https://lore.kernel.org/lkml/[email protected]/

>
> OTOH I believe there is a bug in the logic. The subsequent
> __free_pages() in swiotlb_alloc_tlb() would have to be changed to a
> free_decrypted_pages(). However, I'm proposing a different approach
> to
> address the latter issue here:
>
> https://lore.kernel.org/linux-iommu/[email protected]/T/

Oh, yes, that makes sense. I was planning to send a patch to just leak
the pages if set_memory_decrypted() fails, after my v2 linked above is
accepted. It could have a different label than the phys_limit check
error path added in your linked patch, so that case would still free
the perfectly fine encrypted pages.

2023-10-31 17:14:07

by Petr Tesařík

[permalink] [raw]
Subject: Re: [PATCH 04/10] swiotlb: Use free_decrypted_pages()

On Tue, 31 Oct 2023 15:54:52 +0000
"Edgecombe, Rick P" <[email protected]> wrote:

> On Tue, 2023-10-31 at 11:43 +0100, Petr Tesařík wrote:
> >
> > I admit I'm not familiar with the encryption/decryption API, but if a
> > __free_pages() is not sufficient here, then it is quite confusing.
> > The error label is reached only if set_memory_decrypted() returns
> > non-zero. My naive expectation is that the memory is *not* decrypted
> > in
> > that case and does not require special treatment. Is this assumption
> > wrong?
>
> Yea, the memory can still be decrypted, or partially decrypted. On x86,
> all the set_memory() calls can fail part way through the work, and they
> don't rollback the changes they had made up to that point.

Thank you for the explanation. So, after set_memory_decrypted() fails,
the pages become Schroedinger-crypted, but since its true state cannot
be observed by the guest kernel, it stays as such forever.

Sweet.

>[...]
> > OTOH I believe there is a bug in the logic. The subsequent
> > __free_pages() in swiotlb_alloc_tlb() would have to be changed to a
> > free_decrypted_pages(). However, I'm proposing a different approach
> > to
> > address the latter issue here:
> >
> > https://lore.kernel.org/linux-iommu/[email protected]/T/
>
> Oh, yes, that makes sense. I was planning to send a patch to just leak
> the pages if set_memory_decrypted() fails, after my v2 linked above is
> accepted. It could have a different label than the phys_limit check
> error path added in your linked patch, so that case would still free
> the perfectly fine encrypted pages.

Hm, should I incorporate this knowledge into a v2 of my patch and
address both issues?

Petr T

2023-10-31 17:30:31

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH 04/10] swiotlb: Use free_decrypted_pages()

On Tue, 2023-10-31 at 18:13 +0100, Petr Tesařík wrote:
> Thank you for the explanation. So, after set_memory_decrypted()
> fails,
> the pages become Schroedinger-crypted, but since its true state
> cannot
> be observed by the guest kernel, it stays as such forever.
>
> Sweet.
>
Yes... The untrusted host (the part of the VMM TDX is defending
against) gets to specify the return code of these operations (success
or failure). But the coco(a general term for TDX and similar from other
vendors) threat model doesn't include DOS. So the guest should trust
the return code as far as trying to not crash, but not trust it in
regards to the potential to leak data.

It's a bit to ask of the callers, but the other solution we discussed
was to panic the guest if any weirdness is observed by the VMM, in
which case the callers would never see the error. And of course
panicing the kernel is Bad. So that is how we arrived at this request
of the callers. Appreciate the effort to handle it on that side.


> Hm, should I incorporate this knowledge into a v2 of my patch and
> address both issues?

That sounds good to me! Feel free to CC me if you would like, and I can
scrutinize it for this particular issue.

2023-11-01 06:27:45

by Petr Tesařík

[permalink] [raw]
Subject: Re: [PATCH 04/10] swiotlb: Use free_decrypted_pages()

Hi,

On Tue, 31 Oct 2023 17:29:25 +0000
"Edgecombe, Rick P" <[email protected]> wrote:

> On Tue, 2023-10-31 at 18:13 +0100, Petr Tesařík wrote:
> > Thank you for the explanation. So, after set_memory_decrypted()
> > fails,
> > the pages become Schroedinger-crypted, but since its true state
> > cannot
> > be observed by the guest kernel, it stays as such forever.
> >
> > Sweet.
> >
> Yes... The untrusted host (the part of the VMM TDX is defending
> against) gets to specify the return code of these operations (success
> or failure). But the coco(a general term for TDX and similar from other
> vendors) threat model doesn't include DOS. So the guest should trust
> the return code as far as trying to not crash, but not trust it in
> regards to the potential to leak data.
>
> It's a bit to ask of the callers, but the other solution we discussed
> was to panic the guest if any weirdness is observed by the VMM, in
> which case the callers would never see the error. And of course
> panicing the kernel is Bad. So that is how we arrived at this request
> of the callers. Appreciate the effort to handle it on that side.
>
>
> > Hm, should I incorporate this knowledge into a v2 of my patch and
> > address both issues?
>
> That sounds good to me! Feel free to CC me if you would like, and I can
> scrutinize it for this particular issue.

I'm sorry I missed that free_decrypted_pages() is added by the very
same series, so I cannot use it just yet. I can open-code it and let
you convert the code to the new function. You may then also want to
convert another open-coded instance further down in swiotlb_free_tlb().

In any case, there is an interdependency between the two patches, so we
should agree in which order to apply them.

Petr T

2023-11-01 14:41:10

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH 04/10] swiotlb: Use free_decrypted_pages()

On Wed, 2023-11-01 at 07:27 +0100, Petr Tesařík wrote:
> I'm sorry I missed that free_decrypted_pages() is added by the very
> same series, so I cannot use it just yet. I can open-code it and let
> you convert the code to the new function. You may then also want to
> convert another open-coded instance further down in
> swiotlb_free_tlb().
>
> In any case, there is an interdependency between the two patches, so
> we
> should agree in which order to apply them.

Open coding in the callers is the current plan (see an explanation
after the "---" in the v1 of that patch[0] if you are interested).
There might not be any interdependency between the the warning and
swiotlb changes, but I can double check if you CC me.


[0]
https://lore.kernel.org/lkml/[email protected]/