[v1: Initial post]
With confidential computing like TDX the guest doesn't trust the host
anymore. The host is allowed to DOS of course, but it is not allowed
to read or write any guest memory not explicitely shared with it.
This has implication for virtio. Traditionally virtio didn't assume
the other side of the communication channel is malicious, and therefore
didn't do any boundary checks in virtio ring data structures.
This patchkit does hardening for virtio. In a TDX like model
the only host memory accesses allowed are in the virtio ring,
as well as the (forced) swiotlb buffer.
This patch kit does various changes to ensure there can be no
access outside these two areas. It is possible for the host
to break the communication, but this should result in a IO
error on the guest, but no memory safety violations.
virtio is quite complicated with many modes. To simplify
the task we enforce that virtio is only in split mode without
indirect descriptors, when running as a TDX guest. We also
enforce use of the DMA API.
Then these code paths are hardened against any corruptions
on the ring.
This patchkit has components in three subsystems:
- Hardening changes to virtio, all in the generic virtio-ring
- Hardening changes to kernel/dma swiotlb to harden swiotlb against
malicious pointers. It requires an API change which needed a tree sweep.
- A single x86 patch to enable the arch_has_restricted_memory_access
for TDX
It depends on Sathya's earlier patchkit that adds the basic infrastructure
for TDX. This is only needed for the "am I running in TDX" part.
In some situations when we know swiotlb is forced and we have
to deal with untrusted hosts, it's useful to know if a mapping
was in the swiotlb or not. This allows us to abort any IO
operation that would access memory outside the swiotlb.
Otherwise it might be possible for a malicious host to inject
any guest page in a read operation. While it couldn't directly
access the results of the read() inside the guest, there
might scenarios where data is echoed back with a write(),
and that would then leak guest memory.
Add a return value to dma_unmap_single/page. Most users
of course will ignore it. The return value is set to EIO
if we're in forced swiotlb mode and the buffer is not inside
the swiotlb buffer. Otherwise it's always 0.
A new callback is used to avoid changing all the IOMMU drivers.
Signed-off-by: Andi Kleen <[email protected]>
---
drivers/iommu/dma-iommu.c | 17 +++++++++++------
include/linux/dma-map-ops.h | 3 +++
include/linux/dma-mapping.h | 7 ++++---
kernel/dma/mapping.c | 6 +++++-
4 files changed, 23 insertions(+), 10 deletions(-)
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 7ef13198721b..babe46f2ae3a 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -491,7 +491,8 @@ static void __iommu_dma_unmap(struct device *dev, dma_addr_t dma_addr,
iommu_dma_free_iova(cookie, dma_addr, size, iotlb_gather.freelist);
}
-static void __iommu_dma_unmap_swiotlb(struct device *dev, dma_addr_t dma_addr,
+static int __iommu_dma_unmap_swiotlb_check(struct device *dev,
+ dma_addr_t dma_addr,
size_t size, enum dma_data_direction dir,
unsigned long attrs)
{
@@ -500,12 +501,15 @@ static void __iommu_dma_unmap_swiotlb(struct device *dev, dma_addr_t dma_addr,
phys = iommu_iova_to_phys(domain, dma_addr);
if (WARN_ON(!phys))
- return;
+ return -EIO;
__iommu_dma_unmap(dev, dma_addr, size);
if (unlikely(is_swiotlb_buffer(phys, size)))
swiotlb_tbl_unmap_single(dev, phys, size, dir, attrs);
+ else if (swiotlb_force == SWIOTLB_FORCE)
+ return -EIO;
+ return 0;
}
static dma_addr_t __iommu_dma_map(struct device *dev, phys_addr_t phys,
@@ -856,12 +860,13 @@ static dma_addr_t iommu_dma_map_page(struct device *dev, struct page *page,
return dma_handle;
}
-static void iommu_dma_unmap_page(struct device *dev, dma_addr_t dma_handle,
+static int iommu_dma_unmap_page_check(struct device *dev, dma_addr_t dma_handle,
size_t size, enum dma_data_direction dir, unsigned long attrs)
{
if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
iommu_dma_sync_single_for_cpu(dev, dma_handle, size, dir);
- __iommu_dma_unmap_swiotlb(dev, dma_handle, size, dir, attrs);
+ return __iommu_dma_unmap_swiotlb_check(dev, dma_handle, size, dir,
+ attrs);
}
/*
@@ -946,7 +951,7 @@ static void iommu_dma_unmap_sg_swiotlb(struct device *dev, struct scatterlist *s
int i;
for_each_sg(sg, s, nents, i)
- __iommu_dma_unmap_swiotlb(dev, sg_dma_address(s),
+ __iommu_dma_unmap_swiotlb_check(dev, sg_dma_address(s),
sg_dma_len(s), dir, attrs);
}
@@ -1291,7 +1296,7 @@ static const struct dma_map_ops iommu_dma_ops = {
.mmap = iommu_dma_mmap,
.get_sgtable = iommu_dma_get_sgtable,
.map_page = iommu_dma_map_page,
- .unmap_page = iommu_dma_unmap_page,
+ .unmap_page_check = iommu_dma_unmap_page_check,
.map_sg = iommu_dma_map_sg,
.unmap_sg = iommu_dma_unmap_sg,
.sync_single_for_cpu = iommu_dma_sync_single_for_cpu,
diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index 0d53a96a3d64..0ed0190f7949 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -69,6 +69,9 @@ struct dma_map_ops {
u64 (*get_required_mask)(struct device *dev);
size_t (*max_mapping_size)(struct device *dev);
unsigned long (*get_merge_boundary)(struct device *dev);
+ int (*unmap_page_check)(struct device *dev, dma_addr_t dma_handle,
+ size_t size, enum dma_data_direction dir,
+ unsigned long attrs);
};
#ifdef CONFIG_DMA_OPS
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 37fbd12bd4ab..25b8382f8601 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -103,7 +103,7 @@ static inline int dma_mapping_error(struct device *dev, dma_addr_t dma_addr)
dma_addr_t dma_map_page_attrs(struct device *dev, struct page *page,
size_t offset, size_t size, enum dma_data_direction dir,
unsigned long attrs);
-void dma_unmap_page_attrs(struct device *dev, dma_addr_t addr, size_t size,
+int dma_unmap_page_attrs(struct device *dev, dma_addr_t addr, size_t size,
enum dma_data_direction dir, unsigned long attrs);
int dma_map_sg_attrs(struct device *dev, struct scatterlist *sg, int nents,
enum dma_data_direction dir, unsigned long attrs);
@@ -160,9 +160,10 @@ static inline dma_addr_t dma_map_page_attrs(struct device *dev,
{
return DMA_MAPPING_ERROR;
}
-static inline void dma_unmap_page_attrs(struct device *dev, dma_addr_t addr,
+static inline int dma_unmap_page_attrs(struct device *dev, dma_addr_t addr,
size_t size, enum dma_data_direction dir, unsigned long attrs)
{
+ return 0;
}
static inline int dma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
int nents, enum dma_data_direction dir, unsigned long attrs)
@@ -323,7 +324,7 @@ static inline dma_addr_t dma_map_single_attrs(struct device *dev, void *ptr,
size, dir, attrs);
}
-static inline void dma_unmap_single_attrs(struct device *dev, dma_addr_t addr,
+static inline int dma_unmap_single_attrs(struct device *dev, dma_addr_t addr,
size_t size, enum dma_data_direction dir, unsigned long attrs)
{
return dma_unmap_page_attrs(dev, addr, size, dir, attrs);
diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index 9bf02c8d7d1b..dc0ce649d1f9 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -162,18 +162,22 @@ dma_addr_t dma_map_page_attrs(struct device *dev, struct page *page,
}
EXPORT_SYMBOL(dma_map_page_attrs);
-void dma_unmap_page_attrs(struct device *dev, dma_addr_t addr, size_t size,
+int dma_unmap_page_attrs(struct device *dev, dma_addr_t addr, size_t size,
enum dma_data_direction dir, unsigned long attrs)
{
const struct dma_map_ops *ops = get_dma_ops(dev);
+ int ret = 0;
BUG_ON(!valid_dma_direction(dir));
if (dma_map_direct(dev, ops) ||
arch_dma_unmap_page_direct(dev, addr + size))
dma_direct_unmap_page(dev, addr, size, dir, attrs);
+ else if (ops->unmap_page_check)
+ ret = ops->unmap_page_check(dev, addr, size, dir, attrs);
else if (ops->unmap_page)
ops->unmap_page(dev, addr, size, dir, attrs);
debug_dma_unmap_page(dev, addr, size, dir);
+ return ret;
}
EXPORT_SYMBOL(dma_unmap_page_attrs);
--
2.25.4
在 2021/6/3 上午8:41, Andi Kleen 写道:
> [v1: Initial post]
>
> With confidential computing like TDX the guest doesn't trust the host
> anymore. The host is allowed to DOS of course, but it is not allowed
> to read or write any guest memory not explicitely shared with it.
>
> This has implication for virtio. Traditionally virtio didn't assume
> the other side of the communication channel is malicious, and therefore
> didn't do any boundary checks in virtio ring data structures.
>
> This patchkit does hardening for virtio. In a TDX like model
> the only host memory accesses allowed are in the virtio ring,
> as well as the (forced) swiotlb buffer.
>
> This patch kit does various changes to ensure there can be no
> access outside these two areas. It is possible for the host
> to break the communication, but this should result in a IO
> error on the guest, but no memory safety violations.
>
> virtio is quite complicated with many modes. To simplify
> the task we enforce that virtio is only in split mode without
> indirect descriptors, when running as a TDX guest. We also
> enforce use of the DMA API.
>
> Then these code paths are hardened against any corruptions
> on the ring.
>
> This patchkit has components in three subsystems:
> - Hardening changes to virtio, all in the generic virtio-ring
> - Hardening changes to kernel/dma swiotlb to harden swiotlb against
> malicious pointers. It requires an API change which needed a tree sweep.
> - A single x86 patch to enable the arch_has_restricted_memory_access
> for TDX
>
> It depends on Sathya's earlier patchkit that adds the basic infrastructure
> for TDX. This is only needed for the "am I running in TDX" part.
Note that it's probably needed by other cases as well:
1) Other encrypted VM technology
2) VDUSE[1]
3) Smart NICs
We have already had discussions and some patches have been posted[2][3][4].
I think the basic idea is similar, basically, we don't trust any
metadata provided by the device.
[2] is the series that use the metadata stored in the private memory
which can't be accessed by swiotlb, this series aims to eliminate all
the possible attacks via virtqueue metadata
[3] is one example for the the used length validation
[4] is the fix for the malicious config space
Thanks
[1] https://www.spinics.net/lists/netdev/msg743264.html
[2] https://www.spinics.net/lists/kvm/msg241825.html
[3] https://patches.linaro.org/patch/450733/
[4] https://lkml.org/lkml/2021/5/17/376
>
>
>
> Note that it's probably needed by other cases as well:
>
> 1) Other encrypted VM technology
> 2) VDUSE[1]
> 3) Smart NICs
Right. I don't see any reason why these shouldn't work. You may just
need to add the enable for the lockdown, but you can reuse the basic
infrastructure.
>
> We have already had discussions and some patches have been
> posted[2][3][4].
Thanks.
Yes [2] is indeed an alternative. We considered this at some point, but
since we don't care about DOS in our case it seemed simpler to just
harden the existing code. But yes if it's there it's useful for TDX too.
FWIW I would argue that the descriptor boundary checking should be added
in any case, security case or separated metadata or not, because it can
catch bugs and is very cheap. Checking boundaries is good practice.
[4] would be an independent issue, that's something we didn't catch.
Also the swiotlb hardening implemented in this patchkit doesn't seem to
be in any of the other patches.
So I would say my patches are mostly orthogonal to these patches below
and not conflicting, even though they address a similar problem space.
-Andi
Hi Andi,
On 2021-06-03 01:41, Andi Kleen wrote:
> In some situations when we know swiotlb is forced and we have
> to deal with untrusted hosts, it's useful to know if a mapping
> was in the swiotlb or not. This allows us to abort any IO
> operation that would access memory outside the swiotlb.
>
> Otherwise it might be possible for a malicious host to inject
> any guest page in a read operation. While it couldn't directly
> access the results of the read() inside the guest, there
> might scenarios where data is echoed back with a write(),
> and that would then leak guest memory.
>
> Add a return value to dma_unmap_single/page. Most users
> of course will ignore it. The return value is set to EIO
> if we're in forced swiotlb mode and the buffer is not inside
> the swiotlb buffer. Otherwise it's always 0.
I have to say my first impression of this isn't too good :(
What it looks like to me is abusing SWIOTLB's internal housekeeping to
keep track of virtio-specific state. The DMA API does not attempt to
validate calls in general since in many cases the additional overhead
would be prohibitive. It has always been callers' responsibility to keep
track of what they mapped and make sure sync/unmap calls match, and
there are many, many, subtle and not-so-subtle ways for things to go
wrong if they don't. If virtio is not doing a good enough job of that,
what's the justification for making it the DMA API's problem?
> A new callback is used to avoid changing all the IOMMU drivers.
Nit: presumably by "IOMMU drivers" you actually mean arch DMA API backends?
As an aside, we'll take a look at the rest of the series for the
perspective of our prototyping for Arm's Confidential Compute
Architecture, but I'm not sure we'll need it, since accesses beyond the
bounds of the shared SWIOTLB buffer shouldn't be an issue for us.
Furthermore, AFAICS it's still not going to help against exfiltrating
guest memory by over-unmapping the original SWIOTLB slot *without* going
past the end of the whole buffer, but I think Martin's patch *has*
addressed that already.
Robin.
> Signed-off-by: Andi Kleen <[email protected]>
> ---
> drivers/iommu/dma-iommu.c | 17 +++++++++++------
> include/linux/dma-map-ops.h | 3 +++
> include/linux/dma-mapping.h | 7 ++++---
> kernel/dma/mapping.c | 6 +++++-
> 4 files changed, 23 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> index 7ef13198721b..babe46f2ae3a 100644
> --- a/drivers/iommu/dma-iommu.c
> +++ b/drivers/iommu/dma-iommu.c
> @@ -491,7 +491,8 @@ static void __iommu_dma_unmap(struct device *dev, dma_addr_t dma_addr,
> iommu_dma_free_iova(cookie, dma_addr, size, iotlb_gather.freelist);
> }
>
> -static void __iommu_dma_unmap_swiotlb(struct device *dev, dma_addr_t dma_addr,
> +static int __iommu_dma_unmap_swiotlb_check(struct device *dev,
> + dma_addr_t dma_addr,
> size_t size, enum dma_data_direction dir,
> unsigned long attrs)
> {
> @@ -500,12 +501,15 @@ static void __iommu_dma_unmap_swiotlb(struct device *dev, dma_addr_t dma_addr,
>
> phys = iommu_iova_to_phys(domain, dma_addr);
> if (WARN_ON(!phys))
> - return;
> + return -EIO;
>
> __iommu_dma_unmap(dev, dma_addr, size);
>
> if (unlikely(is_swiotlb_buffer(phys, size)))
> swiotlb_tbl_unmap_single(dev, phys, size, dir, attrs);
> + else if (swiotlb_force == SWIOTLB_FORCE)
> + return -EIO;
> + return 0;
> }
>
> static dma_addr_t __iommu_dma_map(struct device *dev, phys_addr_t phys,
> @@ -856,12 +860,13 @@ static dma_addr_t iommu_dma_map_page(struct device *dev, struct page *page,
> return dma_handle;
> }
>
> -static void iommu_dma_unmap_page(struct device *dev, dma_addr_t dma_handle,
> +static int iommu_dma_unmap_page_check(struct device *dev, dma_addr_t dma_handle,
> size_t size, enum dma_data_direction dir, unsigned long attrs)
> {
> if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
> iommu_dma_sync_single_for_cpu(dev, dma_handle, size, dir);
> - __iommu_dma_unmap_swiotlb(dev, dma_handle, size, dir, attrs);
> + return __iommu_dma_unmap_swiotlb_check(dev, dma_handle, size, dir,
> + attrs);
> }
>
> /*
> @@ -946,7 +951,7 @@ static void iommu_dma_unmap_sg_swiotlb(struct device *dev, struct scatterlist *s
> int i;
>
> for_each_sg(sg, s, nents, i)
> - __iommu_dma_unmap_swiotlb(dev, sg_dma_address(s),
> + __iommu_dma_unmap_swiotlb_check(dev, sg_dma_address(s),
> sg_dma_len(s), dir, attrs);
> }
>
> @@ -1291,7 +1296,7 @@ static const struct dma_map_ops iommu_dma_ops = {
> .mmap = iommu_dma_mmap,
> .get_sgtable = iommu_dma_get_sgtable,
> .map_page = iommu_dma_map_page,
> - .unmap_page = iommu_dma_unmap_page,
> + .unmap_page_check = iommu_dma_unmap_page_check,
> .map_sg = iommu_dma_map_sg,
> .unmap_sg = iommu_dma_unmap_sg,
> .sync_single_for_cpu = iommu_dma_sync_single_for_cpu,
> diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
> index 0d53a96a3d64..0ed0190f7949 100644
> --- a/include/linux/dma-map-ops.h
> +++ b/include/linux/dma-map-ops.h
> @@ -69,6 +69,9 @@ struct dma_map_ops {
> u64 (*get_required_mask)(struct device *dev);
> size_t (*max_mapping_size)(struct device *dev);
> unsigned long (*get_merge_boundary)(struct device *dev);
> + int (*unmap_page_check)(struct device *dev, dma_addr_t dma_handle,
> + size_t size, enum dma_data_direction dir,
> + unsigned long attrs);
> };
>
> #ifdef CONFIG_DMA_OPS
> diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
> index 37fbd12bd4ab..25b8382f8601 100644
> --- a/include/linux/dma-mapping.h
> +++ b/include/linux/dma-mapping.h
> @@ -103,7 +103,7 @@ static inline int dma_mapping_error(struct device *dev, dma_addr_t dma_addr)
> dma_addr_t dma_map_page_attrs(struct device *dev, struct page *page,
> size_t offset, size_t size, enum dma_data_direction dir,
> unsigned long attrs);
> -void dma_unmap_page_attrs(struct device *dev, dma_addr_t addr, size_t size,
> +int dma_unmap_page_attrs(struct device *dev, dma_addr_t addr, size_t size,
> enum dma_data_direction dir, unsigned long attrs);
> int dma_map_sg_attrs(struct device *dev, struct scatterlist *sg, int nents,
> enum dma_data_direction dir, unsigned long attrs);
> @@ -160,9 +160,10 @@ static inline dma_addr_t dma_map_page_attrs(struct device *dev,
> {
> return DMA_MAPPING_ERROR;
> }
> -static inline void dma_unmap_page_attrs(struct device *dev, dma_addr_t addr,
> +static inline int dma_unmap_page_attrs(struct device *dev, dma_addr_t addr,
> size_t size, enum dma_data_direction dir, unsigned long attrs)
> {
> + return 0;
> }
> static inline int dma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
> int nents, enum dma_data_direction dir, unsigned long attrs)
> @@ -323,7 +324,7 @@ static inline dma_addr_t dma_map_single_attrs(struct device *dev, void *ptr,
> size, dir, attrs);
> }
>
> -static inline void dma_unmap_single_attrs(struct device *dev, dma_addr_t addr,
> +static inline int dma_unmap_single_attrs(struct device *dev, dma_addr_t addr,
> size_t size, enum dma_data_direction dir, unsigned long attrs)
> {
> return dma_unmap_page_attrs(dev, addr, size, dir, attrs);
> diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
> index 9bf02c8d7d1b..dc0ce649d1f9 100644
> --- a/kernel/dma/mapping.c
> +++ b/kernel/dma/mapping.c
> @@ -162,18 +162,22 @@ dma_addr_t dma_map_page_attrs(struct device *dev, struct page *page,
> }
> EXPORT_SYMBOL(dma_map_page_attrs);
>
> -void dma_unmap_page_attrs(struct device *dev, dma_addr_t addr, size_t size,
> +int dma_unmap_page_attrs(struct device *dev, dma_addr_t addr, size_t size,
> enum dma_data_direction dir, unsigned long attrs)
> {
> const struct dma_map_ops *ops = get_dma_ops(dev);
> + int ret = 0;
>
> BUG_ON(!valid_dma_direction(dir));
> if (dma_map_direct(dev, ops) ||
> arch_dma_unmap_page_direct(dev, addr + size))
> dma_direct_unmap_page(dev, addr, size, dir, attrs);
> + else if (ops->unmap_page_check)
> + ret = ops->unmap_page_check(dev, addr, size, dir, attrs);
> else if (ops->unmap_page)
> ops->unmap_page(dev, addr, size, dir, attrs);
> debug_dma_unmap_page(dev, addr, size, dir);
> + return ret;
> }
> EXPORT_SYMBOL(dma_unmap_page_attrs);
>
>
>
> What it looks like to me is abusing SWIOTLB's internal housekeeping to
> keep track of virtio-specific state. The DMA API does not attempt to
> validate calls in general since in many cases the additional overhead
> would be prohibitive. It has always been callers' responsibility to
> keep track of what they mapped and make sure sync/unmap calls match,
> and there are many, many, subtle and not-so-subtle ways for things to
> go wrong if they don't. If virtio is not doing a good enough job of
> that, what's the justification for making it the DMA API's problem?
In this case it's not prohibitive at all. Just adding a few error
returns, and checking the overlap (which seems to have been already
solved anyways) I would argue the error returns are good practice
anyways, so that API users can check that something bad happening and
abort. The DMA API was never very good at proper error handling, but
there's no reason at all to continue being bad it forever.
AFAIK the rest just works anyways, so it's not really a new problem to
be solved.
>
>> A new callback is used to avoid changing all the IOMMU drivers.
>
> Nit: presumably by "IOMMU drivers" you actually mean arch DMA API
> backends?
Yes
>
> Furthermore, AFAICS it's still not going to help against exfiltrating
> guest memory by over-unmapping the original SWIOTLB slot *without*
> going past the end of the whole buffer,
That would be just exfiltrating data that is already shared, unless I'm
misunderstanding you.
-Andi