This series allow user space to notify KVM of noncoherent DMA status so as
to let KVM honor guest memory types in specified memory slot ranges.
Motivation
===
A virtio GPU device may want to configure GPU hardware to work in
noncoherent mode, i.e. some of its DMAs do not snoop CPU caches.
This is generally for performance consideration.
In certain platform, GFX performance can improve 20+% with DMAs going to
noncoherent path.
This noncoherent DMA mode works in below sequence:
1. Host backend driver programs hardware not to snoop memory of target
DMA buffer.
2. Host backend driver indicates guest frontend driver to program guest PAT
to WC for target DMA buffer.
3. Guest frontend driver writes to the DMA buffer without clflush stuffs.
4. Hardware does noncoherent DMA to the target buffer.
In this noncoherent DMA mode, both guest and hardware regard a DMA buffer
as not cached. So, if KVM forces the effective memory type of this DMA
buffer to be WB, hardware DMA may read incorrect data and cause misc
failures.
Therefore we introduced a new memslot flag KVM_MEM_NON_COHERENT_DMA to
allow user space convey noncoherent DMA status in memslot granularity.
Platforms that do not always honor guest memory type can choose to honor
it in ranges of memslots with KVM_MEM_NON_COHERENT_DMA set.
Security
===
The biggest concern for KVM to honor guest's memory type is page aliasing
issue.
In Intel's platform,
- For host MMIO, KVM VMX always programs EPT memory type to UC (which will
overwrite all guest PAT types except WC), which is of no change after
this series.
- For host non-MMIO pages,
* virtio guest frontend and host backend driver should be synced to use
the same memory type to map a buffer. Otherwise, there will be
potential problem for incorrect memory data. But this will only impact
the buggy guest alone.
* for live migration, user space can skip reading/writing memory
corresponding to the memslot with flag KVM_MEM_NON_COHERENT_DMA or
do some special handling during memory read/write.
Implementation
===
Unlike previous RFC series [1] that uses a new KVM VIRTIO device to convey
noncoherent DMA status, this version chooses to introduce a new memslot
flag, similar to what's done in series from google at [2].
The difference is that [2] increases noncoherent DMA count to ask KVM VMX
to honor guest memory type for all guest memory as a whole, while this
series will only ask KVM to honor guest memory type in the specified
memslots.
The reason of not introducing a KVM cap or a memslot flag to allow users to
toggle noncoherent DMA state as a whole is mainly for the page aliasing
issue as mentioned above.
If guest memory type is only honored in limited memslots, user space can
do special handling before/after accessing to guest memory belonging to the
limited memslots.
For virtio GPUs, it usually will create memslots that are mapped into guest
device BARs.
- guest device driver will sync with host side to use the same memory type
to access that memslots.
- no other guest components will have access to the memory in the memslots
since it's mapped as device BARs.
So, by adding flag KVM_MEM_NON_COHERENT_DMA to memslots specific to virtio
GPUs and asking KVM to only honor guest memory in those memslots, page
aliasing issue can be avoided easily.
This series doesn't limit which memslots are legible to set flag
KVM_MEM_NON_COHERENT_DMA, so if the user sets this flag to memslots for
guest system RAM, page aliasing issue may be met during live migration
or other use cases when host wants to access guest memory with different
memory types due to lacking of coordination between non-enlightened guest
components and host. Just as when noncoherent DMA devices are assigned
through VFIO.
But as it will not impact other VMs, we choose to trust the user and let
the user to do mitigations when it has to set this flag to memslots for
guest system RAM.
Note:
We also noticed that there's a series [3] trying to fix a similar problem
in ARM for VFIO device passthrough.
The difference is that [3] is trying to fix the problem that guest memory
types for pass-through device MMIOs are not honored in ARM (which is not a
problem for x86 VMX), while this series is for the problem that guest
memory types are not honored in non-host-MMIO ranges for virtio GPUs in x86
VMX.
Changelog:
RFC --> v1:
- Switch to use memslot flag way to convey non-coherent DMA info
(Sean, Kevin)
- Do not honor guest MTRRs in memslot of flag KVM_MEM_NON_COHERENT_DMA
(Sean)
[1]: https://lore.kernel.org/all/[email protected]/
[2]: https://patchwork.kernel.org/project/dri-devel/cover/[email protected]/
[3]: https://lore.kernel.org/all/[email protected]/
Yan Zhao (4):
KVM: Introduce a new memslot flag KVM_MEM_NON_COHERENT_DMA
KVM: x86: Add a new param "slot" to op get_mt_mask in kvm_x86_ops
KVM: VMX: Honor guest PATs for memslots of flag
KVM_MEM_NON_COHERENT_DMA
KVM: selftests: Set KVM_MEM_NON_COHERENT_DMA as a supported memslot
flag
arch/x86/include/asm/kvm_host.h | 3 ++-
arch/x86/kvm/mmu/spte.c | 3 ++-
arch/x86/kvm/vmx/vmx.c | 6 +++++-
include/uapi/linux/kvm.h | 2 ++
tools/testing/selftests/kvm/set_memory_region_test.c | 3 +++
virt/kvm/kvm_main.c | 8 ++++++--
6 files changed, 20 insertions(+), 5 deletions(-)
base-commit: 8ed26ab8d59111c2f7b86d200d1eb97d2a458fd1
--
2.17.1
Introduce a new flag KVM_MEM_NON_COHERENT_DMA to provide user space a
channel to notify KVM that guest memory specified by the memslot may be
accessed by noncoherent DMA devices.
KVM can start honoring guest memory type for this range of guest memory in
platforms that do not always honoring guest PAT, e.g. in Intel's platform.
Previously, the only way to let KVM be aware of noncoherent DMA devices
is through KVM device for VFIO pass-through devices, in which case, KVM is
notified that all guest memory may be accessed by noncoherent DMA devices.
To avoid complication, flag KVM_MEM_NON_COHERENT_DMA is not allowed to be
dynamically modified for a memslot.
A KVM_CAP_USER_CONFIGURE_NONCOHERENT_DMA is added to let user space know if
this new memslot flag is supported in KVM.
Cc: Kevin Tian <[email protected]>
Cc: Zhenyu Wang <[email protected]>
Tested-by: Yongwei Ma <[email protected]>
Signed-off-by: Yan Zhao <[email protected]>
---
include/uapi/linux/kvm.h | 2 ++
virt/kvm/kvm_main.c | 8 ++++++--
2 files changed, 8 insertions(+), 2 deletions(-)
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index b1f92a0edc35..4cb615e46488 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -46,6 +46,7 @@ struct kvm_userspace_memory_region2 {
#define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0)
#define KVM_MEM_READONLY (1UL << 1)
#define KVM_MEM_GUEST_MEMFD (1UL << 2)
+#define KVM_MEM_NON_COHERENT_DMA (1UL << 3)
/* for KVM_IRQ_LINE */
struct kvm_irq_level {
@@ -1155,6 +1156,7 @@ struct kvm_ppc_resize_hpt {
#define KVM_CAP_MEMORY_ATTRIBUTES 233
#define KVM_CAP_GUEST_MEMFD 234
#define KVM_CAP_VM_TYPES 235
+#define KVM_CAP_USER_CONFIGURE_NONCOHERENT_DMA 236
#ifdef KVM_CAP_IRQ_ROUTING
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index acd67fb40183..6d44dcf7322d 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1607,7 +1607,7 @@ static void kvm_replace_memslot(struct kvm *kvm,
* only allows these.
*/
#define KVM_SET_USER_MEMORY_REGION_V1_FLAGS \
- (KVM_MEM_LOG_DIRTY_PAGES | KVM_MEM_READONLY)
+ (KVM_MEM_LOG_DIRTY_PAGES | KVM_MEM_READONLY | KVM_MEM_NON_COHERENT_DMA)
static int check_memory_region_flags(struct kvm *kvm,
const struct kvm_userspace_memory_region2 *mem)
@@ -1625,6 +1625,8 @@ static int check_memory_region_flags(struct kvm *kvm,
valid_flags |= KVM_MEM_READONLY;
#endif
+ valid_flags |= KVM_MEM_NON_COHERENT_DMA;
+
if (mem->flags & ~valid_flags)
return -EINVAL;
@@ -2095,7 +2097,8 @@ int __kvm_set_memory_region(struct kvm *kvm,
return -EINVAL;
if ((mem->userspace_addr != old->userspace_addr) ||
(npages != old->npages) ||
- ((mem->flags ^ old->flags) & KVM_MEM_READONLY))
+ ((mem->flags ^ old->flags) &
+ (KVM_MEM_READONLY | KVM_MEM_NON_COHERENT_DMA)))
return -EINVAL;
if (base_gfn != old->base_gfn)
@@ -4822,6 +4825,7 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
case KVM_CAP_USER_MEMORY2:
case KVM_CAP_DESTROY_MEMORY_REGION_WORKS:
case KVM_CAP_JOIN_MEMORY_REGIONS_WORKS:
+ case KVM_CAP_USER_CONFIGURE_NONCOHERENT_DMA:
case KVM_CAP_INTERNAL_ERROR_DATA:
#ifdef CONFIG_HAVE_KVM_MSI
case KVM_CAP_SIGNAL_MSI:
--
2.17.1
Add param "slot" to op get_mt_mask in kvm_x86_ops.
This is a preparation patch to later honor guest PATs for certain memslots.
No functional change intended.
Suggested-by: Sean Christopherson <[email protected]>
Cc: Kevin Tian <[email protected]>
Cc: Zhenyu Wang <[email protected]>
Tested-by: Yongwei Ma <[email protected]>
Signed-off-by: Yan Zhao <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 3 ++-
arch/x86/kvm/mmu/spte.c | 3 ++-
arch/x86/kvm/vmx/vmx.c | 3 ++-
3 files changed, 6 insertions(+), 3 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index a565a2e70f30..6be0d8ccff65 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1675,7 +1675,8 @@ struct kvm_x86_ops {
int (*sync_pir_to_irr)(struct kvm_vcpu *vcpu);
int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
int (*set_identity_map_addr)(struct kvm *kvm, u64 ident_addr);
- u8 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
+ u8 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio,
+ const struct kvm_memory_slot *slot);
void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, hpa_t root_hpa,
int root_level);
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 4a599130e9c9..2c3ede3f27a9 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -191,7 +191,8 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
if (shadow_memtype_mask)
spte |= static_call(kvm_x86_get_mt_mask)(vcpu, gfn,
- kvm_is_mmio_pfn(pfn));
+ kvm_is_mmio_pfn(pfn),
+ slot);
if (host_writable)
spte |= shadow_host_writable_mask;
else
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 40e3780d73ae..85a23765e506 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7576,7 +7576,8 @@ static int vmx_vm_init(struct kvm *kvm)
return 0;
}
-static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
+static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio,
+ const struct kvm_memory_slot *slot)
{
/* We wanted to honor guest CD/MTRR/PAT, but doing so could result in
* memory aliases with conflicting memory types and sometimes MCEs.
--
2.17.1
Honor guest PATs in the range of memslots of flag KVM_MEM_NON_COHERENT_DMA
set no matter the value of noncoherent dma count.
Just honoring guest PATs (without honoring guest MTRRs) for memslots of
flag KVM_MEM_NON_COHERENT_DMA is because
- guest OS will ensure no page aliasing issue in guest side by honoring
guest MTRRs in guest page table.
Combinations like guest MTRR=WC or UC, guest PAT = WB is not allowed.
(at least in Linux, see pat_x_mtrr_type()).
- guest device driver programs device hardware according to guest PATs in
modern platforms.
Besides, we don't break down an EPT huge page if guest MTRRs in its range
are not consistent, because
- guest should have chosen correct guest PATs according to guest MTRRs.
- in normal platforms, small guest pages with different PATs must
correspond to different TLBs though they are mapped in a huge page in
EPT.
However, one condition may not be supported well by honoring guest PAT
alone -- when guest MTRR=WC, guest PAT=UC-.
By honoring guest MTRRs+PATs, the effective memory type is WC; while
by honoring guest PATs alone, the effective memory type is UC.
But it's arguable to support such a usage.
Suggested-by: Sean Christopherson <[email protected]>
Cc: Kevin Tian <[email protected]>
Cc: Zhenyu Wang <[email protected]>
Tested-by: Yongwei Ma <[email protected]>
Signed-off-by: Yan Zhao <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 85a23765e506..99f22589fa6d 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7600,6 +7600,9 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio,
if (is_mmio)
return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;
+ if (slot->flags & KVM_MEM_NON_COHERENT_DMA)
+ return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT;
+
if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
--
2.17.1
Update test_invalid_memory_region_flags() to treat KVM_MEM_NON_COHERENT_DMA
as a supported memslot flag.
Signed-off-by: Yan Zhao <[email protected]>
---
tools/testing/selftests/kvm/set_memory_region_test.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/tools/testing/selftests/kvm/set_memory_region_test.c b/tools/testing/selftests/kvm/set_memory_region_test.c
index 075b80dbe237..2d6f961734db 100644
--- a/tools/testing/selftests/kvm/set_memory_region_test.c
+++ b/tools/testing/selftests/kvm/set_memory_region_test.c
@@ -335,6 +335,9 @@ static void test_invalid_memory_region_flags(void)
#if defined __aarch64__ || defined __x86_64__
supported_flags |= KVM_MEM_READONLY;
+
+ if (kvm_has_cap(KVM_CAP_USER_CONFIGURE_NONCOHERENT_DMA))
+ supported_flags |= KVM_MEM_NON_COHERENT_DMA;
#endif
#ifdef __x86_64__
--
2.17.1
On Fri, Jan 05, 2024 at 05:12:37PM +0800, Yan Zhao wrote:
> This series allow user space to notify KVM of noncoherent DMA status so as
> to let KVM honor guest memory types in specified memory slot ranges.
>
> Motivation
> ===
> A virtio GPU device may want to configure GPU hardware to work in
> noncoherent mode, i.e. some of its DMAs do not snoop CPU caches.
Does this mean some DMA reads do not snoop the caches or does it
include DMA writes not synchronizing the caches too?
> This is generally for performance consideration.
> In certain platform, GFX performance can improve 20+% with DMAs going to
> noncoherent path.
>
> This noncoherent DMA mode works in below sequence:
> 1. Host backend driver programs hardware not to snoop memory of target
> DMA buffer.
> 2. Host backend driver indicates guest frontend driver to program guest PAT
> to WC for target DMA buffer.
> 3. Guest frontend driver writes to the DMA buffer without clflush stuffs.
> 4. Hardware does noncoherent DMA to the target buffer.
>
> In this noncoherent DMA mode, both guest and hardware regard a DMA buffer
> as not cached. So, if KVM forces the effective memory type of this DMA
> buffer to be WB, hardware DMA may read incorrect data and cause misc
> failures.
I don't know all the details, but a big concern would be that the
caches remain fully coherent with the underlying memory at any point
where kvm decides to revoke the page from the VM.
If you allow an incoherence of cache != physical then it opens a
security attack where the observed content of memory can change when
it should not.
ARM64 has issues like this and due to that ARM has to have explict,
expensive, cache flushing at certain points.
Jason
On Fri, Jan 05, 2024 at 03:55:51PM -0400, Jason Gunthorpe wrote:
> On Fri, Jan 05, 2024 at 05:12:37PM +0800, Yan Zhao wrote:
> > This series allow user space to notify KVM of noncoherent DMA status so as
> > to let KVM honor guest memory types in specified memory slot ranges.
> >
> > Motivation
> > ===
> > A virtio GPU device may want to configure GPU hardware to work in
> > noncoherent mode, i.e. some of its DMAs do not snoop CPU caches.
>
> Does this mean some DMA reads do not snoop the caches or does it
> include DMA writes not synchronizing the caches too?
Both DMA reads and writes are not snooped.
The virtio host side will mmap the buffer to WC (pgprot_writecombine)
for CPU access and program the device to access the buffer in uncached
way.
Meanwhile, virtio host side will construct a memslot in KVM with the PTR
returned from the mmap, and notify virtio guest side to mmap the same buffer in
guest page table with PAT=WC, too.
>
> > This is generally for performance consideration.
> > In certain platform, GFX performance can improve 20+% with DMAs going to
> > noncoherent path.
> >
> > This noncoherent DMA mode works in below sequence:
> > 1. Host backend driver programs hardware not to snoop memory of target
> > DMA buffer.
> > 2. Host backend driver indicates guest frontend driver to program guest PAT
> > to WC for target DMA buffer.
> > 3. Guest frontend driver writes to the DMA buffer without clflush stuffs.
> > 4. Hardware does noncoherent DMA to the target buffer.
> >
> > In this noncoherent DMA mode, both guest and hardware regard a DMA buffer
> > as not cached. So, if KVM forces the effective memory type of this DMA
> > buffer to be WB, hardware DMA may read incorrect data and cause misc
> > failures.
>
> I don't know all the details, but a big concern would be that the
> caches remain fully coherent with the underlying memory at any point
> where kvm decides to revoke the page from the VM.
Ah, you mean, for page migration, the content of the page may not be copied
correctly, right?
Currently in x86, we have 2 ways to let KVM honor guest memory types:
1. through KVM memslot flag introduced in this series, for virtio GPUs, in
memslot granularity.
2. through increasing noncoherent dma count, as what's done in VFIO, for
Intel GPU passthrough, for all guest memory.
This page migration issue should not be the case for virtio GPU, as both host
and guest are synced to use the same memory type and actually the pages
are not anonymous pages.
For GPU pass-through, though host mmaps with WB, it's still fine for guest to
use WC because page migration on pages of VMs with pass-through device is not
allowed.
But I agree, this should be a case if user space sets the memslot flag to honor
guest memory type to memslots for guest system RAM where non-enlightened guest
components may cause guest and host to access with different memory types.
Or simply when the guest is a malicious one.
> If you allow an incoherence of cache != physical then it opens a
> security attack where the observed content of memory can change when
> it should not.
In this case, will this security attack impact other guests?
>
> ARM64 has issues like this and due to that ARM has to have explict,
> expensive, cache flushing at certain points.
>
On Mon, Jan 08, 2024 at 02:02:57PM +0800, Yan Zhao wrote:
> On Fri, Jan 05, 2024 at 03:55:51PM -0400, Jason Gunthorpe wrote:
> > On Fri, Jan 05, 2024 at 05:12:37PM +0800, Yan Zhao wrote:
> > > This series allow user space to notify KVM of noncoherent DMA status so as
> > > to let KVM honor guest memory types in specified memory slot ranges.
> > >
> > > Motivation
> > > ===
> > > A virtio GPU device may want to configure GPU hardware to work in
> > > noncoherent mode, i.e. some of its DMAs do not snoop CPU caches.
> >
> > Does this mean some DMA reads do not snoop the caches or does it
> > include DMA writes not synchronizing the caches too?
> Both DMA reads and writes are not snooped.
Oh that sounds really dangerous.
> > > This is generally for performance consideration.
> > > In certain platform, GFX performance can improve 20+% with DMAs going to
> > > noncoherent path.
> > >
> > > This noncoherent DMA mode works in below sequence:
> > > 1. Host backend driver programs hardware not to snoop memory of target
> > > DMA buffer.
> > > 2. Host backend driver indicates guest frontend driver to program guest PAT
> > > to WC for target DMA buffer.
> > > 3. Guest frontend driver writes to the DMA buffer without clflush stuffs.
> > > 4. Hardware does noncoherent DMA to the target buffer.
> > >
> > > In this noncoherent DMA mode, both guest and hardware regard a DMA buffer
> > > as not cached. So, if KVM forces the effective memory type of this DMA
> > > buffer to be WB, hardware DMA may read incorrect data and cause misc
> > > failures.
> >
> > I don't know all the details, but a big concern would be that the
> > caches remain fully coherent with the underlying memory at any point
> > where kvm decides to revoke the page from the VM.
> Ah, you mean, for page migration, the content of the page may not be copied
> correctly, right?
Not just migration. Any point where KVM revokes the page from the
VM. Ie just tearing down the VM still has to make the cache coherent
with physical or there may be problems.
> Currently in x86, we have 2 ways to let KVM honor guest memory types:
> 1. through KVM memslot flag introduced in this series, for virtio GPUs, in
> memslot granularity.
> 2. through increasing noncoherent dma count, as what's done in VFIO, for
> Intel GPU passthrough, for all guest memory.
And where does all this fixup the coherency problem?
> This page migration issue should not be the case for virtio GPU, as both host
> and guest are synced to use the same memory type and actually the pages
> are not anonymous pages.
The guest isn't required to do this so it can force the cache to
become incoherent.
> > If you allow an incoherence of cache != physical then it opens a
> > security attack where the observed content of memory can change when
> > it should not.
>
> In this case, will this security attack impact other guests?
It impacts the hypervisor potentially. It depends..
Jason
On Mon, Jan 08, 2024 at 10:02:50AM -0400, Jason Gunthorpe wrote:
> On Mon, Jan 08, 2024 at 02:02:57PM +0800, Yan Zhao wrote:
> > On Fri, Jan 05, 2024 at 03:55:51PM -0400, Jason Gunthorpe wrote:
> > > On Fri, Jan 05, 2024 at 05:12:37PM +0800, Yan Zhao wrote:
> > > > This series allow user space to notify KVM of noncoherent DMA status so as
> > > > to let KVM honor guest memory types in specified memory slot ranges.
> > > >
> > > > Motivation
> > > > ===
> > > > A virtio GPU device may want to configure GPU hardware to work in
> > > > noncoherent mode, i.e. some of its DMAs do not snoop CPU caches.
> > >
> > > Does this mean some DMA reads do not snoop the caches or does it
> > > include DMA writes not synchronizing the caches too?
> > Both DMA reads and writes are not snooped.
>
> Oh that sounds really dangerous.
So if this is an issue then we might already have a problem, because with
many devices it's entirely up to the device programming whether the i/o is
snooping or not. So the moment you pass such a device to a guest, whether
there's explicit support for non-coherent or not, you have a problem.
_If_ there is a fundamental problem. I'm not sure of that, because my
assumption was that at most the guest shoots itself and the data
corruption doesn't go any further the moment the hypervisor does the
dma/iommu unmapping.
Also, there's a pile of x86 devices where this very much applies, x86
being dma-coherent is not really the true ground story.
Cheers, Sima
> > > > This is generally for performance consideration.
> > > > In certain platform, GFX performance can improve 20+% with DMAs going to
> > > > noncoherent path.
> > > >
> > > > This noncoherent DMA mode works in below sequence:
> > > > 1. Host backend driver programs hardware not to snoop memory of target
> > > > DMA buffer.
> > > > 2. Host backend driver indicates guest frontend driver to program guest PAT
> > > > to WC for target DMA buffer.
> > > > 3. Guest frontend driver writes to the DMA buffer without clflush stuffs.
> > > > 4. Hardware does noncoherent DMA to the target buffer.
> > > >
> > > > In this noncoherent DMA mode, both guest and hardware regard a DMA buffer
> > > > as not cached. So, if KVM forces the effective memory type of this DMA
> > > > buffer to be WB, hardware DMA may read incorrect data and cause misc
> > > > failures.
> > >
> > > I don't know all the details, but a big concern would be that the
> > > caches remain fully coherent with the underlying memory at any point
> > > where kvm decides to revoke the page from the VM.
> > Ah, you mean, for page migration, the content of the page may not be copied
> > correctly, right?
>
> Not just migration. Any point where KVM revokes the page from the
> VM. Ie just tearing down the VM still has to make the cache coherent
> with physical or there may be problems.
>
> > Currently in x86, we have 2 ways to let KVM honor guest memory types:
> > 1. through KVM memslot flag introduced in this series, for virtio GPUs, in
> > memslot granularity.
> > 2. through increasing noncoherent dma count, as what's done in VFIO, for
> > Intel GPU passthrough, for all guest memory.
>
> And where does all this fixup the coherency problem?
>
> > This page migration issue should not be the case for virtio GPU, as both host
> > and guest are synced to use the same memory type and actually the pages
> > are not anonymous pages.
>
> The guest isn't required to do this so it can force the cache to
> become incoherent.
>
> > > If you allow an incoherence of cache != physical then it opens a
> > > security attack where the observed content of memory can change when
> > > it should not.
> >
> > In this case, will this security attack impact other guests?
>
> It impacts the hypervisor potentially. It depends..
>
> Jason
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
On Mon, Jan 08, 2024 at 04:25:02PM +0100, Daniel Vetter wrote:
> On Mon, Jan 08, 2024 at 10:02:50AM -0400, Jason Gunthorpe wrote:
> > On Mon, Jan 08, 2024 at 02:02:57PM +0800, Yan Zhao wrote:
> > > On Fri, Jan 05, 2024 at 03:55:51PM -0400, Jason Gunthorpe wrote:
> > > > On Fri, Jan 05, 2024 at 05:12:37PM +0800, Yan Zhao wrote:
> > > > > This series allow user space to notify KVM of noncoherent DMA status so as
> > > > > to let KVM honor guest memory types in specified memory slot ranges.
> > > > >
> > > > > Motivation
> > > > > ===
> > > > > A virtio GPU device may want to configure GPU hardware to work in
> > > > > noncoherent mode, i.e. some of its DMAs do not snoop CPU caches.
> > > >
> > > > Does this mean some DMA reads do not snoop the caches or does it
> > > > include DMA writes not synchronizing the caches too?
> > > Both DMA reads and writes are not snooped.
> >
> > Oh that sounds really dangerous.
>
> So if this is an issue then we might already have a problem, because with
> many devices it's entirely up to the device programming whether the i/o is
> snooping or not. So the moment you pass such a device to a guest, whether
> there's explicit support for non-coherent or not, you have a
> problem.
No, the iommus (except Intel and only for Intel integrated GPU, IIRC)
prohibit the use of non-coherent DMA entirely from a VM.
Eg AMD systems 100% block non-coherent DMA in VMs at the iommu level.
> _If_ there is a fundamental problem. I'm not sure of that, because my
> assumption was that at most the guest shoots itself and the data
> corruption doesn't go any further the moment the hypervisor does the
> dma/iommu unmapping.
Who fixes the cache on the unmapping? I didn't see anything..
Jason
On Mon, Jan 08, 2024 at 10:02:50AM -0400, Jason Gunthorpe wrote:
> On Mon, Jan 08, 2024 at 02:02:57PM +0800, Yan Zhao wrote:
> > On Fri, Jan 05, 2024 at 03:55:51PM -0400, Jason Gunthorpe wrote:
> > > On Fri, Jan 05, 2024 at 05:12:37PM +0800, Yan Zhao wrote:
> > > > This series allow user space to notify KVM of noncoherent DMA status so as
> > > > to let KVM honor guest memory types in specified memory slot ranges.
> > > >
> > > > Motivation
> > > > ===
> > > > A virtio GPU device may want to configure GPU hardware to work in
> > > > noncoherent mode, i.e. some of its DMAs do not snoop CPU caches.
> > >
> > > Does this mean some DMA reads do not snoop the caches or does it
> > > include DMA writes not synchronizing the caches too?
> > Both DMA reads and writes are not snooped.
>
> Oh that sounds really dangerous.
>
But the IOMMU for Intel GPU does not do force-snoop, no matter KVM
honors guest memory type or not.
> > > > This is generally for performance consideration.
> > > > In certain platform, GFX performance can improve 20+% with DMAs going to
> > > > noncoherent path.
> > > >
> > > > This noncoherent DMA mode works in below sequence:
> > > > 1. Host backend driver programs hardware not to snoop memory of target
> > > > DMA buffer.
> > > > 2. Host backend driver indicates guest frontend driver to program guest PAT
> > > > to WC for target DMA buffer.
> > > > 3. Guest frontend driver writes to the DMA buffer without clflush stuffs.
> > > > 4. Hardware does noncoherent DMA to the target buffer.
> > > >
> > > > In this noncoherent DMA mode, both guest and hardware regard a DMA buffer
> > > > as not cached. So, if KVM forces the effective memory type of this DMA
> > > > buffer to be WB, hardware DMA may read incorrect data and cause misc
> > > > failures.
> > >
> > > I don't know all the details, but a big concern would be that the
> > > caches remain fully coherent with the underlying memory at any point
> > > where kvm decides to revoke the page from the VM.
> > Ah, you mean, for page migration, the content of the page may not be copied
> > correctly, right?
>
> Not just migration. Any point where KVM revokes the page from the
> VM. Ie just tearing down the VM still has to make the cache coherent
> with physical or there may be problems.
Not sure what's the mentioned problem during KVM revoking.
In host,
- If the memory type is WB, as the case in intel GPU passthrough,
the mismatch can only happen when guest memory type is UC/WC/WT/WP, all
stronger than WB.
So, even after KVM revoking the page, the host will not get delayed
data from cache.
- If the memory type is WC, as the case in virtio GPU, after KVM revoking
the page, the page is still hold in the virtio host side.
Even though a incooperative guest can cause wrong data in the page,
the guest can achieve the purpose in a more straight-forward way, i.e.
writing a wrong data directly to the page.
So, I don't see the problem in this case too.
>
> > Currently in x86, we have 2 ways to let KVM honor guest memory types:
> > 1. through KVM memslot flag introduced in this series, for virtio GPUs, in
> > memslot granularity.
> > 2. through increasing noncoherent dma count, as what's done in VFIO, for
> > Intel GPU passthrough, for all guest memory.
>
> And where does all this fixup the coherency problem?
>
> > This page migration issue should not be the case for virtio GPU, as both host
> > and guest are synced to use the same memory type and actually the pages
> > are not anonymous pages.
>
> The guest isn't required to do this so it can force the cache to
> become incoherent.
>
> > > If you allow an incoherence of cache != physical then it opens a
> > > security attack where the observed content of memory can change when
> > > it should not.
> >
> > In this case, will this security attack impact other guests?
>
> It impacts the hypervisor potentially. It depends..
Could you elaborate more on how it will impact hypervisor?
We can try to fix it if it's really a case.
On Tue, Jan 09, 2024 at 07:36:22AM +0800, Yan Zhao wrote:
> On Mon, Jan 08, 2024 at 10:02:50AM -0400, Jason Gunthorpe wrote:
> > On Mon, Jan 08, 2024 at 02:02:57PM +0800, Yan Zhao wrote:
> > > On Fri, Jan 05, 2024 at 03:55:51PM -0400, Jason Gunthorpe wrote:
> > > > On Fri, Jan 05, 2024 at 05:12:37PM +0800, Yan Zhao wrote:
> > > > > This series allow user space to notify KVM of noncoherent DMA status so as
> > > > > to let KVM honor guest memory types in specified memory slot ranges.
> > > > >
> > > > > Motivation
> > > > > ===
> > > > > A virtio GPU device may want to configure GPU hardware to work in
> > > > > noncoherent mode, i.e. some of its DMAs do not snoop CPU caches.
> > > >
> > > > Does this mean some DMA reads do not snoop the caches or does it
> > > > include DMA writes not synchronizing the caches too?
> > > Both DMA reads and writes are not snooped.
> >
> > Oh that sounds really dangerous.
> >
> But the IOMMU for Intel GPU does not do force-snoop, no matter KVM
> honors guest memory type or not.
Yes, I know. Sounds dangerous!
> > Not just migration. Any point where KVM revokes the page from the
> > VM. Ie just tearing down the VM still has to make the cache coherent
> > with physical or there may be problems.
> Not sure what's the mentioned problem during KVM revoking.
> In host,
> - If the memory type is WB, as the case in intel GPU passthrough,
> the mismatch can only happen when guest memory type is UC/WC/WT/WP, all
> stronger than WB.
> So, even after KVM revoking the page, the host will not get delayed
> data from cache.
> - If the memory type is WC, as the case in virtio GPU, after KVM revoking
> the page, the page is still hold in the virtio host side.
> Even though a incooperative guest can cause wrong data in the page,
> the guest can achieve the purpose in a more straight-forward way, i.e.
> writing a wrong data directly to the page.
> So, I don't see the problem in this case too.
You can't let cache incoherent memory leak back into the hypervisor
for other uses or who knows what can happen. In many cases something
will zero the page and you can probably reliably argue that will make
the cache coherent, but there are still all sorts of cases where pages
are write protected and then used in the hypervisor context. Eg page
out or something where the incoherence is a big problem.
eg RAID parity and mirror calculations become at-rist of
malfunction. Storage CRCs stop working reliably, etc, etc.
It is certainly a big enough problem that a generic KVM switch to
allow incoherence should be trated with alot of skepticism. You can't
argue that the only use of the generic switch will be with GPUs that
exclude all the troublesome cases!
> > > In this case, will this security attack impact other guests?
> >
> > It impacts the hypervisor potentially. It depends..
> Could you elaborate more on how it will impact hypervisor?
> We can try to fix it if it's really a case.
Well, for instance, when you install pages into the KVM the hypervisor
will have taken kernel memory, then zero'd it with cachable writes,
however the VM can read it incoherently with DMA and access the
pre-zero'd data since the zero'd writes potentially hasn't left the
cache. That is an information leakage exploit.
Who knows what else you can get up to if you are creative. The whole
security model assumes there is only one view of memory, not two.
Jason
On Mon, Jan 08, 2024 at 08:22:20PM -0400, Jason Gunthorpe wrote:
> On Tue, Jan 09, 2024 at 07:36:22AM +0800, Yan Zhao wrote:
> > On Mon, Jan 08, 2024 at 10:02:50AM -0400, Jason Gunthorpe wrote:
> > > On Mon, Jan 08, 2024 at 02:02:57PM +0800, Yan Zhao wrote:
> > > > On Fri, Jan 05, 2024 at 03:55:51PM -0400, Jason Gunthorpe wrote:
> > > > > On Fri, Jan 05, 2024 at 05:12:37PM +0800, Yan Zhao wrote:
> > > > > > This series allow user space to notify KVM of noncoherent DMA status so as
> > > > > > to let KVM honor guest memory types in specified memory slot ranges.
> > > > > >
> > > > > > Motivation
> > > > > > ===
> > > > > > A virtio GPU device may want to configure GPU hardware to work in
> > > > > > noncoherent mode, i.e. some of its DMAs do not snoop CPU caches.
> > > > >
> > > > > Does this mean some DMA reads do not snoop the caches or does it
> > > > > include DMA writes not synchronizing the caches too?
> > > > Both DMA reads and writes are not snooped.
> > >
> > > Oh that sounds really dangerous.
> > >
> > But the IOMMU for Intel GPU does not do force-snoop, no matter KVM
> > honors guest memory type or not.
>
> Yes, I know. Sounds dangerous!
>
> > > Not just migration. Any point where KVM revokes the page from the
> > > VM. Ie just tearing down the VM still has to make the cache coherent
> > > with physical or there may be problems.
> > Not sure what's the mentioned problem during KVM revoking.
> > In host,
> > - If the memory type is WB, as the case in intel GPU passthrough,
> > the mismatch can only happen when guest memory type is UC/WC/WT/WP, all
> > stronger than WB.
> > So, even after KVM revoking the page, the host will not get delayed
> > data from cache.
> > - If the memory type is WC, as the case in virtio GPU, after KVM revoking
> > the page, the page is still hold in the virtio host side.
> > Even though a incooperative guest can cause wrong data in the page,
> > the guest can achieve the purpose in a more straight-forward way, i.e.
> > writing a wrong data directly to the page.
> > So, I don't see the problem in this case too.
>
> You can't let cache incoherent memory leak back into the hypervisor
> for other uses or who knows what can happen. In many cases something
> will zero the page and you can probably reliably argue that will make
> the cache coherent, but there are still all sorts of cases where pages
> are write protected and then used in the hypervisor context. Eg page
> out or something where the incoherence is a big problem.
>
> eg RAID parity and mirror calculations become at-rist of
> malfunction. Storage CRCs stop working reliably, etc, etc.
>
> It is certainly a big enough problem that a generic KVM switch to
> allow incoherence should be trated with alot of skepticism. You can't
> argue that the only use of the generic switch will be with GPUs that
> exclude all the troublesome cases!
>
You are right. It's more safe with only one view of memory.
But even something will zero the page, if it happens before returning
the page to host, looks the impact is constrained in VM scope? e.g.
for the write protected page, hypervisor cannot rely on the page content
is correct or expected.
For virtio GPU's use case, do you think a better way for KVM is to pull
the memory type from host page table in the specified memslot?
But for noncoherent DMA device passthrough, we can't pull host memory
type, because we rely on guest device driver to do cache flush
properly, and if the guest device driver thinks a memory is uncached
while it's effectively cached, the device cannot work properly.
> > > > In this case, will this security attack impact other guests?
> > >
> > > It impacts the hypervisor potentially. It depends..
> > Could you elaborate more on how it will impact hypervisor?
> > We can try to fix it if it's really a case.
>
> Well, for instance, when you install pages into the KVM the hypervisor
> will have taken kernel memory, then zero'd it with cachable writes,
> however the VM can read it incoherently with DMA and access the
> pre-zero'd data since the zero'd writes potentially hasn't left the
> cache. That is an information leakage exploit.
This makes sense.
How about KVM doing cache flush before installing/revoking the
page if guest memory type is honored?
> Who knows what else you can get up to if you are creative. The whole
> security model assumes there is only one view of memory, not two.
>
On Tue, Jan 09, 2024 at 10:11:23AM +0800, Yan Zhao wrote:
> > Well, for instance, when you install pages into the KVM the hypervisor
> > will have taken kernel memory, then zero'd it with cachable writes,
> > however the VM can read it incoherently with DMA and access the
> > pre-zero'd data since the zero'd writes potentially hasn't left the
> > cache. That is an information leakage exploit.
>
> This makes sense.
> How about KVM doing cache flush before installing/revoking the
> page if guest memory type is honored?
I think if you are going to allow the guest to bypass the cache in any
way then KVM should fully flush the cache before allowing the guest to
access memory and it should fully flush the cache after removing
memory from the guest.
Noting that fully removing the memory now includes VFIO too, which is
going to be very hard to co-ordinate between KVM and VFIO.
ARM has the hooks for most of this in the common code already, so it
should not be outrageous to do, but slow I suspect.
Jason
> From: Jason Gunthorpe <[email protected]>
> Sent: Tuesday, January 16, 2024 12:31 AM
>
> On Tue, Jan 09, 2024 at 10:11:23AM +0800, Yan Zhao wrote:
>
> > > Well, for instance, when you install pages into the KVM the hypervisor
> > > will have taken kernel memory, then zero'd it with cachable writes,
> > > however the VM can read it incoherently with DMA and access the
> > > pre-zero'd data since the zero'd writes potentially hasn't left the
> > > cache. That is an information leakage exploit.
> >
> > This makes sense.
> > How about KVM doing cache flush before installing/revoking the
> > page if guest memory type is honored?
>
> I think if you are going to allow the guest to bypass the cache in any
> way then KVM should fully flush the cache before allowing the guest to
> access memory and it should fully flush the cache after removing
> memory from the guest.
For GPU passthrough can we rely on the fact that the entire guest memory
is pinned so the only occurrence of removing memory is when killing the
guest then the pages will be zero-ed by mm before next use? then we
just need to flush the cache before the 1st guest run to avoid information
leak.
yes it's a more complex issue if allowing guest to bypass cache in a
configuration mixing host mm activities on guest pages at run-time.
>
> Noting that fully removing the memory now includes VFIO too, which is
> going to be very hard to co-ordinate between KVM and VFIO.
if only talking about GPU passthrough do we still need such coordination?
>
> ARM has the hooks for most of this in the common code already, so it
> should not be outrageous to do, but slow I suspect.
>
> Jason
> From: Tian, Kevin
> Sent: Tuesday, January 16, 2024 8:46 AM
>
> > From: Jason Gunthorpe <[email protected]>
> > Sent: Tuesday, January 16, 2024 12:31 AM
> >
> > On Tue, Jan 09, 2024 at 10:11:23AM +0800, Yan Zhao wrote:
> >
> > > > Well, for instance, when you install pages into the KVM the hypervisor
> > > > will have taken kernel memory, then zero'd it with cachable writes,
> > > > however the VM can read it incoherently with DMA and access the
> > > > pre-zero'd data since the zero'd writes potentially hasn't left the
> > > > cache. That is an information leakage exploit.
> > >
> > > This makes sense.
> > > How about KVM doing cache flush before installing/revoking the
> > > page if guest memory type is honored?
> >
> > I think if you are going to allow the guest to bypass the cache in any
> > way then KVM should fully flush the cache before allowing the guest to
> > access memory and it should fully flush the cache after removing
> > memory from the guest.
>
> For GPU passthrough can we rely on the fact that the entire guest memory
> is pinned so the only occurrence of removing memory is when killing the
> guest then the pages will be zero-ed by mm before next use? then we
> just need to flush the cache before the 1st guest run to avoid information
> leak.
Just checked your past comments. If there is no guarantee that the removed
pages will be zero-ed before next use then yes cache has to be flushed
after the page is removed from the guest. :/
>
> yes it's a more complex issue if allowing guest to bypass cache in a
> configuration mixing host mm activities on guest pages at run-time.
>
> >
> > Noting that fully removing the memory now includes VFIO too, which is
> > going to be very hard to co-ordinate between KVM and VFIO.
>
Probably we could just handle cache flush in IOMMUFD or VFIO type1
map/unmap which is the gate of allowing/denying non-coherent DMAs
to specific pages.
On Tue, Jan 16, 2024 at 04:05:08AM +0000, Tian, Kevin wrote:
> > From: Tian, Kevin
> > Sent: Tuesday, January 16, 2024 8:46 AM
> >
> > > From: Jason Gunthorpe <[email protected]>
> > > Sent: Tuesday, January 16, 2024 12:31 AM
> > >
> > > On Tue, Jan 09, 2024 at 10:11:23AM +0800, Yan Zhao wrote:
> > >
> > > > > Well, for instance, when you install pages into the KVM the hypervisor
> > > > > will have taken kernel memory, then zero'd it with cachable writes,
> > > > > however the VM can read it incoherently with DMA and access the
> > > > > pre-zero'd data since the zero'd writes potentially hasn't left the
> > > > > cache. That is an information leakage exploit.
> > > >
> > > > This makes sense.
> > > > How about KVM doing cache flush before installing/revoking the
> > > > page if guest memory type is honored?
> > >
> > > I think if you are going to allow the guest to bypass the cache in any
> > > way then KVM should fully flush the cache before allowing the guest to
> > > access memory and it should fully flush the cache after removing
> > > memory from the guest.
> >
> > For GPU passthrough can we rely on the fact that the entire guest memory
> > is pinned so the only occurrence of removing memory is when killing the
> > guest then the pages will be zero-ed by mm before next use? then we
> > just need to flush the cache before the 1st guest run to avoid information
> > leak.
>
> Just checked your past comments. If there is no guarantee that the removed
> pages will be zero-ed before next use then yes cache has to be flushed
> after the page is removed from the guest. :/
Next use may include things like swap to disk or live migrate the VM.
So it isn't quite so simple in the general case.
> > > Noting that fully removing the memory now includes VFIO too, which is
> > > going to be very hard to co-ordinate between KVM and VFIO.
>
> Probably we could just handle cache flush in IOMMUFD or VFIO type1
> map/unmap which is the gate of allowing/denying non-coherent DMAs
> to specific pages.
Maybe, and on live migrate dma stop..
Jason