2020-08-07 14:15:05

by Vitaly Kuznetsov

[permalink] [raw]
Subject: [PATCH v2 2/3] KVM: x86: introduce KVM_MEM_PCI_HOLE memory

PCIe config space can (depending on the configuration) be quite big but
usually is sparsely populated. Guest may scan it by accessing individual
device's page which, when device is missing, is supposed to have 'pci
hole' semantics: reads return '0xff' and writes get discarded. Compared
to the already existing KVM_MEM_READONLY, VMM doesn't need to allocate
real memory and stuff it with '0xff'.

Suggested-by: Michael S. Tsirkin <[email protected]>
Signed-off-by: Vitaly Kuznetsov <[email protected]>
---
Documentation/virt/kvm/api.rst | 18 ++++++++++-----
arch/x86/include/uapi/asm/kvm.h | 1 +
arch/x86/kvm/mmu/mmu.c | 5 ++++-
arch/x86/kvm/mmu/paging_tmpl.h | 3 +++
arch/x86/kvm/x86.c | 10 ++++++---
include/linux/kvm_host.h | 3 +++
include/uapi/linux/kvm.h | 2 ++
virt/kvm/kvm_main.c | 39 +++++++++++++++++++++++++++------
8 files changed, 64 insertions(+), 17 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 644e5326aa50..dc4172352635 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -1241,6 +1241,7 @@ yet and must be cleared on entry.
/* for kvm_memory_region::flags */
#define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0)
#define KVM_MEM_READONLY (1UL << 1)
+ #define KVM_MEM_PCI_HOLE (1UL << 2)

This ioctl allows the user to create, modify or delete a guest physical
memory slot. Bits 0-15 of "slot" specify the slot id and this value
@@ -1268,12 +1269,17 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
be identical. This allows large pages in the guest to be backed by large
pages in the host.

-The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
-KVM_MEM_READONLY. The former can be set to instruct KVM to keep track of
-writes to memory within the slot. See KVM_GET_DIRTY_LOG ioctl to know how to
-use it. The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
-to make a new slot read-only. In this case, writes to this memory will be
-posted to userspace as KVM_EXIT_MMIO exits.
+The flags field supports the following flags: KVM_MEM_LOG_DIRTY_PAGES,
+KVM_MEM_READONLY, KVM_MEM_PCI_HOLE:
+- KVM_MEM_LOG_DIRTY_PAGES: log writes. Use KVM_GET_DIRTY_LOG to retreive
+ the log.
+- KVM_MEM_READONLY: exit to userspace with KVM_EXIT_MMIO on writes. Only
+ available when KVM_CAP_READONLY_MEM is present.
+- KVM_MEM_PCI_HOLE: always return 0xff on reads, exit to userspace with
+ KVM_EXIT_MMIO on writes. Only available when KVM_CAP_PCI_HOLE_MEM is
+ present. When setting the memory region 'userspace_addr' must be NULL.
+ This flag is mutually exclusive with KVM_MEM_LOG_DIRTY_PAGES and with
+ KVM_MEM_READONLY.

When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
the memory region are automatically reflected into the guest. For example, an
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 17c5a038f42d..cf80a26d74f5 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -48,6 +48,7 @@
#define __KVM_HAVE_XSAVE
#define __KVM_HAVE_XCRS
#define __KVM_HAVE_READONLY_MEM
+#define __KVM_HAVE_PCI_HOLE_MEM

/* Architectural interrupt line count. */
#define KVM_NR_INTERRUPTS 256
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index fef6956393f7..4a2a7fface1e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3254,7 +3254,7 @@ static int kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, gfn_t gfn,
return PG_LEVEL_4K;

slot = gfn_to_memslot_dirty_bitmap(vcpu, gfn, true);
- if (!slot)
+ if (!slot || (slot->flags & KVM_MEM_PCI_HOLE))
return PG_LEVEL_4K;

max_level = min(max_level, max_huge_page_level);
@@ -4105,6 +4105,9 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,

slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);

+ if (!write && slot && (slot->flags & KVM_MEM_PCI_HOLE))
+ return RET_PF_EMULATE;
+
if (try_async_pf(vcpu, slot, prefault, gfn, gpa, &pfn, write,
&map_writable))
return RET_PF_RETRY;
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 5c6a895f67c3..27abd69e69f6 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -836,6 +836,9 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gpa_t addr, u32 error_code,

slot = kvm_vcpu_gfn_to_memslot(vcpu, walker.gfn);

+ if (!write_fault && slot && (slot->flags & KVM_MEM_PCI_HOLE))
+ return RET_PF_EMULATE;
+
if (try_async_pf(vcpu, slot, prefault, walker.gfn, addr, &pfn,
write_fault, &map_writable))
return RET_PF_RETRY;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index dc4370394ab8..538bc58a22db 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3515,6 +3515,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_EXCEPTION_PAYLOAD:
case KVM_CAP_SET_GUEST_DEBUG:
case KVM_CAP_LAST_CPU:
+ case KVM_CAP_PCI_HOLE_MEM:
r = 1;
break;
case KVM_CAP_SYNC_REGS:
@@ -10114,9 +10115,11 @@ static int kvm_alloc_memslot_metadata(struct kvm_memory_slot *slot,
ugfn = slot->userspace_addr >> PAGE_SHIFT;
/*
* If the gfn and userspace address are not aligned wrt each
- * other, disable large page support for this slot.
+ * other, disable large page support for this slot. Also,
+ * disable large page support for KVM_MEM_PCI_HOLE slots.
*/
- if ((slot->base_gfn ^ ugfn) & (KVM_PAGES_PER_HPAGE(level) - 1)) {
+ if ((slot->flags & KVM_MEM_PCI_HOLE) || ((slot->base_gfn ^ ugfn) &
+ (KVM_PAGES_PER_HPAGE(level) - 1))) {
unsigned long j;

for (j = 0; j < lpages; ++j)
@@ -10178,7 +10181,8 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
* Nothing to do for RO slots or CREATE/MOVE/DELETE of a slot.
* See comments below.
*/
- if ((change != KVM_MR_FLAGS_ONLY) || (new->flags & KVM_MEM_READONLY))
+ if ((change != KVM_MR_FLAGS_ONLY) ||
+ (new->flags & (KVM_MEM_READONLY | KVM_MEM_PCI_HOLE)))
return;

/*
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 989afcbe642f..de1faa64a8ef 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1081,6 +1081,9 @@ __gfn_to_memslot(struct kvm_memslots *slots, gfn_t gfn)
static inline unsigned long
__gfn_to_hva_memslot(struct kvm_memory_slot *slot, gfn_t gfn)
{
+ /* Should never be called with a KVM_MEM_PCI_HOLE slot */
+ BUG_ON(!slot->userspace_addr);
+
return slot->userspace_addr + (gfn - slot->base_gfn) * PAGE_SIZE;
}

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index f6d86033c4fa..b56137f4f96f 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -109,6 +109,7 @@ struct kvm_userspace_memory_region {
*/
#define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0)
#define KVM_MEM_READONLY (1UL << 1)
+#define KVM_MEM_PCI_HOLE (1UL << 2)

/* for KVM_IRQ_LINE */
struct kvm_irq_level {
@@ -1035,6 +1036,7 @@ struct kvm_ppc_resize_hpt {
#define KVM_CAP_LAST_CPU 184
#define KVM_CAP_SMALLER_MAXPHYADDR 185
#define KVM_CAP_S390_DIAG318 186
+#define KVM_CAP_PCI_HOLE_MEM 187

#ifdef KVM_CAP_IRQ_ROUTING

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 2c2c0254c2d8..3f69ae711021 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1107,6 +1107,10 @@ static int check_memory_region_flags(const struct kvm_userspace_memory_region *m
valid_flags |= KVM_MEM_READONLY;
#endif

+#ifdef __KVM_HAVE_PCI_HOLE_MEM
+ valid_flags |= KVM_MEM_PCI_HOLE;
+#endif
+
if (mem->flags & ~valid_flags)
return -EINVAL;

@@ -1284,11 +1288,26 @@ int __kvm_set_memory_region(struct kvm *kvm,
return -EINVAL;
if (mem->guest_phys_addr & (PAGE_SIZE - 1))
return -EINVAL;
- /* We can read the guest memory with __xxx_user() later on. */
- if ((mem->userspace_addr & (PAGE_SIZE - 1)) ||
- !access_ok((void __user *)(unsigned long)mem->userspace_addr,
- mem->memory_size))
+
+ /*
+ * KVM_MEM_PCI_HOLE is mutually exclusive with KVM_MEM_READONLY/
+ * KVM_MEM_LOG_DIRTY_PAGES.
+ */
+ if ((mem->flags & KVM_MEM_PCI_HOLE) &&
+ (mem->flags & (KVM_MEM_READONLY | KVM_MEM_LOG_DIRTY_PAGES)))
return -EINVAL;
+
+ if (!(mem->flags & KVM_MEM_PCI_HOLE)) {
+ /* We can read the guest memory with __xxx_user() later on. */
+ if ((mem->userspace_addr & (PAGE_SIZE - 1)) ||
+ !access_ok((void __user *)(unsigned long)mem->userspace_addr,
+ mem->memory_size))
+ return -EINVAL;
+ } else {
+ if (mem->userspace_addr)
+ return -EINVAL;
+ }
+
if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM)
return -EINVAL;
if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)
@@ -1328,7 +1347,8 @@ int __kvm_set_memory_region(struct kvm *kvm,
} else { /* Modify an existing slot. */
if ((new.userspace_addr != old.userspace_addr) ||
(new.npages != old.npages) ||
- ((new.flags ^ old.flags) & KVM_MEM_READONLY))
+ ((new.flags ^ old.flags) & KVM_MEM_READONLY) ||
+ ((new.flags ^ old.flags) & KVM_MEM_PCI_HOLE))
return -EINVAL;

if (new.base_gfn != old.base_gfn)
@@ -1715,13 +1735,13 @@ unsigned long kvm_host_page_size(struct kvm_vcpu *vcpu, gfn_t gfn)

static bool memslot_is_readonly(struct kvm_memory_slot *slot)
{
- return slot->flags & KVM_MEM_READONLY;
+ return slot->flags & (KVM_MEM_READONLY | KVM_MEM_PCI_HOLE);
}

static unsigned long __gfn_to_hva_many(struct kvm_memory_slot *slot, gfn_t gfn,
gfn_t *nr_pages, bool write)
{
- if (!slot || slot->flags & KVM_MEMSLOT_INVALID)
+ if (!slot || (slot->flags & (KVM_MEMSLOT_INVALID | KVM_MEM_PCI_HOLE)))
return KVM_HVA_ERR_BAD;

if (memslot_is_readonly(slot) && write)
@@ -2318,6 +2338,11 @@ static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
int r;
unsigned long addr;

+ if (unlikely(slot && (slot->flags & KVM_MEM_PCI_HOLE))) {
+ memset(data, 0xff, len);
+ return 0;
+ }
+
addr = gfn_to_hva_memslot_prot(slot, gfn, NULL);
if (kvm_is_error_hva(addr))
return -EFAULT;
--
2.25.4


2020-08-14 02:32:43

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] KVM: x86: introduce KVM_MEM_PCI_HOLE memory

On Fri, Aug 07, 2020 at 04:12:31PM +0200, Vitaly Kuznetsov wrote:
> PCIe config space can (depending on the configuration) be quite big but
> usually is sparsely populated. Guest may scan it by accessing individual
> device's page which, when device is missing, is supposed to have 'pci
> hole' semantics: reads return '0xff' and writes get discarded. Compared
> to the already existing KVM_MEM_READONLY, VMM doesn't need to allocate
> real memory and stuff it with '0xff'.
>
> Suggested-by: Michael S. Tsirkin <[email protected]>
> Signed-off-by: Vitaly Kuznetsov <[email protected]>
> ---
> Documentation/virt/kvm/api.rst | 18 ++++++++++-----
> arch/x86/include/uapi/asm/kvm.h | 1 +
> arch/x86/kvm/mmu/mmu.c | 5 ++++-
> arch/x86/kvm/mmu/paging_tmpl.h | 3 +++
> arch/x86/kvm/x86.c | 10 ++++++---
> include/linux/kvm_host.h | 3 +++
> include/uapi/linux/kvm.h | 2 ++
> virt/kvm/kvm_main.c | 39 +++++++++++++++++++++++++++------
> 8 files changed, 64 insertions(+), 17 deletions(-)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 644e5326aa50..dc4172352635 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -1241,6 +1241,7 @@ yet and must be cleared on entry.
> /* for kvm_memory_region::flags */
> #define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0)
> #define KVM_MEM_READONLY (1UL << 1)
> + #define KVM_MEM_PCI_HOLE (1UL << 2)
>
> This ioctl allows the user to create, modify or delete a guest physical
> memory slot. Bits 0-15 of "slot" specify the slot id and this value
> @@ -1268,12 +1269,17 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
> be identical. This allows large pages in the guest to be backed by large
> pages in the host.
>
> -The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
> -KVM_MEM_READONLY. The former can be set to instruct KVM to keep track of
> -writes to memory within the slot. See KVM_GET_DIRTY_LOG ioctl to know how to
> -use it. The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
> -to make a new slot read-only. In this case, writes to this memory will be
> -posted to userspace as KVM_EXIT_MMIO exits.
> +The flags field supports the following flags: KVM_MEM_LOG_DIRTY_PAGES,
> +KVM_MEM_READONLY, KVM_MEM_PCI_HOLE:
> +- KVM_MEM_LOG_DIRTY_PAGES: log writes. Use KVM_GET_DIRTY_LOG to retreive
> + the log.
> +- KVM_MEM_READONLY: exit to userspace with KVM_EXIT_MMIO on writes. Only
> + available when KVM_CAP_READONLY_MEM is present.
> +- KVM_MEM_PCI_HOLE: always return 0xff on reads, exit to userspace with
> + KVM_EXIT_MMIO on writes. Only available when KVM_CAP_PCI_HOLE_MEM is
> + present. When setting the memory region 'userspace_addr' must be NULL.
> + This flag is mutually exclusive with KVM_MEM_LOG_DIRTY_PAGES and with
> + KVM_MEM_READONLY.
>
> When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
> the memory region are automatically reflected into the guest. For example, an
> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> index 17c5a038f42d..cf80a26d74f5 100644
> --- a/arch/x86/include/uapi/asm/kvm.h
> +++ b/arch/x86/include/uapi/asm/kvm.h
> @@ -48,6 +48,7 @@
> #define __KVM_HAVE_XSAVE
> #define __KVM_HAVE_XCRS
> #define __KVM_HAVE_READONLY_MEM
> +#define __KVM_HAVE_PCI_HOLE_MEM
>
> /* Architectural interrupt line count. */
> #define KVM_NR_INTERRUPTS 256
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index fef6956393f7..4a2a7fface1e 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3254,7 +3254,7 @@ static int kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, gfn_t gfn,
> return PG_LEVEL_4K;
>
> slot = gfn_to_memslot_dirty_bitmap(vcpu, gfn, true);
> - if (!slot)
> + if (!slot || (slot->flags & KVM_MEM_PCI_HOLE))

This is unnecessary since you're setting disallow_lpage in
kvm_alloc_memslot_metadata().

> return PG_LEVEL_4K;
>
> max_level = min(max_level, max_huge_page_level);
> @@ -4105,6 +4105,9 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
>
> slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
>
> + if (!write && slot && (slot->flags & KVM_MEM_PCI_HOLE))

I'm confused. Why does this short circuit reads but not writes?

> + return RET_PF_EMULATE;
> +
> if (try_async_pf(vcpu, slot, prefault, gfn, gpa, &pfn, write,
> &map_writable))
> return RET_PF_RETRY;
> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index 5c6a895f67c3..27abd69e69f6 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -836,6 +836,9 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gpa_t addr, u32 error_code,
>
> slot = kvm_vcpu_gfn_to_memslot(vcpu, walker.gfn);
>
> + if (!write_fault && slot && (slot->flags & KVM_MEM_PCI_HOLE))
> + return RET_PF_EMULATE;
> +
> if (try_async_pf(vcpu, slot, prefault, walker.gfn, addr, &pfn,
> write_fault, &map_writable))
> return RET_PF_RETRY;
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index dc4370394ab8..538bc58a22db 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3515,6 +3515,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> case KVM_CAP_EXCEPTION_PAYLOAD:
> case KVM_CAP_SET_GUEST_DEBUG:
> case KVM_CAP_LAST_CPU:
> + case KVM_CAP_PCI_HOLE_MEM:
> r = 1;
> break;
> case KVM_CAP_SYNC_REGS:
> @@ -10114,9 +10115,11 @@ static int kvm_alloc_memslot_metadata(struct kvm_memory_slot *slot,
> ugfn = slot->userspace_addr >> PAGE_SHIFT;
> /*
> * If the gfn and userspace address are not aligned wrt each
> - * other, disable large page support for this slot.
> + * other, disable large page support for this slot. Also,
> + * disable large page support for KVM_MEM_PCI_HOLE slots.
> */
> - if ((slot->base_gfn ^ ugfn) & (KVM_PAGES_PER_HPAGE(level) - 1)) {
> + if ((slot->flags & KVM_MEM_PCI_HOLE) || ((slot->base_gfn ^ ugfn) &
> + (KVM_PAGES_PER_HPAGE(level) - 1))) {
> unsigned long j;
>
> for (j = 0; j < lpages; ++j)
> @@ -10178,7 +10181,8 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
> * Nothing to do for RO slots or CREATE/MOVE/DELETE of a slot.
> * See comments below.
> */
> - if ((change != KVM_MR_FLAGS_ONLY) || (new->flags & KVM_MEM_READONLY))
> + if ((change != KVM_MR_FLAGS_ONLY) ||
> + (new->flags & (KVM_MEM_READONLY | KVM_MEM_PCI_HOLE)))
> return;
>
> /*
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 989afcbe642f..de1faa64a8ef 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1081,6 +1081,9 @@ __gfn_to_memslot(struct kvm_memslots *slots, gfn_t gfn)
> static inline unsigned long
> __gfn_to_hva_memslot(struct kvm_memory_slot *slot, gfn_t gfn)
> {
> + /* Should never be called with a KVM_MEM_PCI_HOLE slot */
> + BUG_ON(!slot->userspace_addr);

So _technically_, userspace can hit this by allowing virtual address 0,
which is very much non-standard, but theoretically legal. It'd probably be
better to use a value that can't possibly be a valid userspace_addr, e.g. a
non-canonical value.

> +
> return slot->userspace_addr + (gfn - slot->base_gfn) * PAGE_SIZE;
> }
>

...

> @@ -2318,6 +2338,11 @@ static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
> int r;
> unsigned long addr;
>
> + if (unlikely(slot && (slot->flags & KVM_MEM_PCI_HOLE))) {
> + memset(data, 0xff, len);
> + return 0;
> + }

This feels wrong, shouldn't we be treating PCI_HOLE as MMIO? Given that
this is performance oriented, I would think we'd want to leverage the
GPA from the VMCS instead of doing a full translation.

That brings up a potential alternative to adding a memslot flag. What if
we instead add a KVM_MMIO_BUS device similar to coalesced MMIO? I think
it'd be about the same amount of KVM code, and it would provide userspace
with more flexibility, e.g. I assume it would allow handling even writes
wholly within the kernel for certain ranges and/or use cases, and it'd
allow stuffing a value other than 0xff (though I have no idea if there is
a use case for this).

Speaking of which, why do writes go to userspace in this series?

> +
> addr = gfn_to_hva_memslot_prot(slot, gfn, NULL);
> if (kvm_is_error_hva(addr))
> return -EFAULT;
> --
> 2.25.4
>

2020-08-14 14:53:29

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] KVM: x86: introduce KVM_MEM_PCI_HOLE memory

On Thu, Aug 13, 2020 at 07:31:39PM -0700, Sean Christopherson wrote:
> On Fri, Aug 07, 2020 at 04:12:31PM +0200, Vitaly Kuznetsov wrote:
> > PCIe config space can (depending on the configuration) be quite big but
> > usually is sparsely populated. Guest may scan it by accessing individual
> > device's page which, when device is missing, is supposed to have 'pci
> > hole' semantics: reads return '0xff' and writes get discarded. Compared
> > to the already existing KVM_MEM_READONLY, VMM doesn't need to allocate
> > real memory and stuff it with '0xff'.
> >
> > Suggested-by: Michael S. Tsirkin <[email protected]>
> > Signed-off-by: Vitaly Kuznetsov <[email protected]>
> > ---
> > Documentation/virt/kvm/api.rst | 18 ++++++++++-----
> > arch/x86/include/uapi/asm/kvm.h | 1 +
> > arch/x86/kvm/mmu/mmu.c | 5 ++++-
> > arch/x86/kvm/mmu/paging_tmpl.h | 3 +++
> > arch/x86/kvm/x86.c | 10 ++++++---
> > include/linux/kvm_host.h | 3 +++
> > include/uapi/linux/kvm.h | 2 ++
> > virt/kvm/kvm_main.c | 39 +++++++++++++++++++++++++++------
> > 8 files changed, 64 insertions(+), 17 deletions(-)
> >
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index 644e5326aa50..dc4172352635 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -1241,6 +1241,7 @@ yet and must be cleared on entry.
> > /* for kvm_memory_region::flags */
> > #define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0)
> > #define KVM_MEM_READONLY (1UL << 1)
> > + #define KVM_MEM_PCI_HOLE (1UL << 2)
> >
> > This ioctl allows the user to create, modify or delete a guest physical
> > memory slot. Bits 0-15 of "slot" specify the slot id and this value
> > @@ -1268,12 +1269,17 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
> > be identical. This allows large pages in the guest to be backed by large
> > pages in the host.
> >
> > -The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
> > -KVM_MEM_READONLY. The former can be set to instruct KVM to keep track of
> > -writes to memory within the slot. See KVM_GET_DIRTY_LOG ioctl to know how to
> > -use it. The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
> > -to make a new slot read-only. In this case, writes to this memory will be
> > -posted to userspace as KVM_EXIT_MMIO exits.
> > +The flags field supports the following flags: KVM_MEM_LOG_DIRTY_PAGES,
> > +KVM_MEM_READONLY, KVM_MEM_PCI_HOLE:
> > +- KVM_MEM_LOG_DIRTY_PAGES: log writes. Use KVM_GET_DIRTY_LOG to retreive
> > + the log.
> > +- KVM_MEM_READONLY: exit to userspace with KVM_EXIT_MMIO on writes. Only
> > + available when KVM_CAP_READONLY_MEM is present.
> > +- KVM_MEM_PCI_HOLE: always return 0xff on reads, exit to userspace with
> > + KVM_EXIT_MMIO on writes. Only available when KVM_CAP_PCI_HOLE_MEM is
> > + present. When setting the memory region 'userspace_addr' must be NULL.
> > + This flag is mutually exclusive with KVM_MEM_LOG_DIRTY_PAGES and with
> > + KVM_MEM_READONLY.
> >
> > When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
> > the memory region are automatically reflected into the guest. For example, an
> > diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> > index 17c5a038f42d..cf80a26d74f5 100644
> > --- a/arch/x86/include/uapi/asm/kvm.h
> > +++ b/arch/x86/include/uapi/asm/kvm.h
> > @@ -48,6 +48,7 @@
> > #define __KVM_HAVE_XSAVE
> > #define __KVM_HAVE_XCRS
> > #define __KVM_HAVE_READONLY_MEM
> > +#define __KVM_HAVE_PCI_HOLE_MEM
> >
> > /* Architectural interrupt line count. */
> > #define KVM_NR_INTERRUPTS 256
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index fef6956393f7..4a2a7fface1e 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -3254,7 +3254,7 @@ static int kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, gfn_t gfn,
> > return PG_LEVEL_4K;
> >
> > slot = gfn_to_memslot_dirty_bitmap(vcpu, gfn, true);
> > - if (!slot)
> > + if (!slot || (slot->flags & KVM_MEM_PCI_HOLE))
>
> This is unnecessary since you're setting disallow_lpage in
> kvm_alloc_memslot_metadata().
>
> > return PG_LEVEL_4K;
> >
> > max_level = min(max_level, max_huge_page_level);
> > @@ -4105,6 +4105,9 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
> >
> > slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
> >
> > + if (!write && slot && (slot->flags & KVM_MEM_PCI_HOLE))
>
> I'm confused. Why does this short circuit reads but not writes?
>
> > + return RET_PF_EMULATE;
> > +
> > if (try_async_pf(vcpu, slot, prefault, gfn, gpa, &pfn, write,
> > &map_writable))
> > return RET_PF_RETRY;
> > diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> > index 5c6a895f67c3..27abd69e69f6 100644
> > --- a/arch/x86/kvm/mmu/paging_tmpl.h
> > +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> > @@ -836,6 +836,9 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gpa_t addr, u32 error_code,
> >
> > slot = kvm_vcpu_gfn_to_memslot(vcpu, walker.gfn);
> >
> > + if (!write_fault && slot && (slot->flags & KVM_MEM_PCI_HOLE))
> > + return RET_PF_EMULATE;
> > +
> > if (try_async_pf(vcpu, slot, prefault, walker.gfn, addr, &pfn,
> > write_fault, &map_writable))
> > return RET_PF_RETRY;
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index dc4370394ab8..538bc58a22db 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -3515,6 +3515,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> > case KVM_CAP_EXCEPTION_PAYLOAD:
> > case KVM_CAP_SET_GUEST_DEBUG:
> > case KVM_CAP_LAST_CPU:
> > + case KVM_CAP_PCI_HOLE_MEM:
> > r = 1;
> > break;
> > case KVM_CAP_SYNC_REGS:
> > @@ -10114,9 +10115,11 @@ static int kvm_alloc_memslot_metadata(struct kvm_memory_slot *slot,
> > ugfn = slot->userspace_addr >> PAGE_SHIFT;
> > /*
> > * If the gfn and userspace address are not aligned wrt each
> > - * other, disable large page support for this slot.
> > + * other, disable large page support for this slot. Also,
> > + * disable large page support for KVM_MEM_PCI_HOLE slots.
> > */
> > - if ((slot->base_gfn ^ ugfn) & (KVM_PAGES_PER_HPAGE(level) - 1)) {
> > + if ((slot->flags & KVM_MEM_PCI_HOLE) || ((slot->base_gfn ^ ugfn) &
> > + (KVM_PAGES_PER_HPAGE(level) - 1))) {
> > unsigned long j;
> >
> > for (j = 0; j < lpages; ++j)
> > @@ -10178,7 +10181,8 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
> > * Nothing to do for RO slots or CREATE/MOVE/DELETE of a slot.
> > * See comments below.
> > */
> > - if ((change != KVM_MR_FLAGS_ONLY) || (new->flags & KVM_MEM_READONLY))
> > + if ((change != KVM_MR_FLAGS_ONLY) ||
> > + (new->flags & (KVM_MEM_READONLY | KVM_MEM_PCI_HOLE)))
> > return;
> >
> > /*
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 989afcbe642f..de1faa64a8ef 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -1081,6 +1081,9 @@ __gfn_to_memslot(struct kvm_memslots *slots, gfn_t gfn)
> > static inline unsigned long
> > __gfn_to_hva_memslot(struct kvm_memory_slot *slot, gfn_t gfn)
> > {
> > + /* Should never be called with a KVM_MEM_PCI_HOLE slot */
> > + BUG_ON(!slot->userspace_addr);
>
> So _technically_, userspace can hit this by allowing virtual address 0,
> which is very much non-standard, but theoretically legal. It'd probably be
> better to use a value that can't possibly be a valid userspace_addr, e.g. a
> non-canonical value.
>
> > +
> > return slot->userspace_addr + (gfn - slot->base_gfn) * PAGE_SIZE;
> > }
> >
>
> ...
>
> > @@ -2318,6 +2338,11 @@ static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
> > int r;
> > unsigned long addr;
> >
> > + if (unlikely(slot && (slot->flags & KVM_MEM_PCI_HOLE))) {
> > + memset(data, 0xff, len);
> > + return 0;
> > + }
>
> This feels wrong, shouldn't we be treating PCI_HOLE as MMIO? Given that
> this is performance oriented, I would think we'd want to leverage the
> GPA from the VMCS instead of doing a full translation.
>
> That brings up a potential alternative to adding a memslot flag. What if
> we instead add a KVM_MMIO_BUS device similar to coalesced MMIO? I think
> it'd be about the same amount of KVM code, and it would provide userspace
> with more flexibility, e.g. I assume it would allow handling even writes
> wholly within the kernel for certain ranges and/or use cases, and it'd
> allow stuffing a value other than 0xff (though I have no idea if there is
> a use case for this).

I still think down the road the way to go is to map
valid RO page full of 0xff to avoid exit on read.
I don't think a KVM_MMIO_BUS device will allow this, will it?


> Speaking of which, why do writes go to userspace in this series?
>
> > +
> > addr = gfn_to_hva_memslot_prot(slot, gfn, NULL);
> > if (kvm_is_error_hva(addr))
> > return -EFAULT;
> > --
> > 2.25.4
> >

2020-08-17 22:37:20

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] KVM: x86: introduce KVM_MEM_PCI_HOLE memory

On Fri, Aug 14, 2020 at 10:30:14AM -0400, Michael S. Tsirkin wrote:
> On Thu, Aug 13, 2020 at 07:31:39PM -0700, Sean Christopherson wrote:
> > > @@ -2318,6 +2338,11 @@ static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
> > > int r;
> > > unsigned long addr;
> > >
> > > + if (unlikely(slot && (slot->flags & KVM_MEM_PCI_HOLE))) {
> > > + memset(data, 0xff, len);
> > > + return 0;
> > > + }
> >
> > This feels wrong, shouldn't we be treating PCI_HOLE as MMIO? Given that
> > this is performance oriented, I would think we'd want to leverage the
> > GPA from the VMCS instead of doing a full translation.
> >
> > That brings up a potential alternative to adding a memslot flag. What if
> > we instead add a KVM_MMIO_BUS device similar to coalesced MMIO? I think
> > it'd be about the same amount of KVM code, and it would provide userspace
> > with more flexibility, e.g. I assume it would allow handling even writes
> > wholly within the kernel for certain ranges and/or use cases, and it'd
> > allow stuffing a value other than 0xff (though I have no idea if there is
> > a use case for this).
>
> I still think down the road the way to go is to map
> valid RO page full of 0xff to avoid exit on read.
> I don't think a KVM_MMIO_BUS device will allow this, will it?

No, it would not, but adding KVM_MEM_PCI_HOLE doesn't get us any closer to
solving that problem either.

What if we add a flag to allow routing all GFNs in a memslot to a single
HVA? At a glance, it doesn't seem to heinous. It would have several of the
same touchpoints as this series, e.g. __kvm_set_memory_region() and
kvm_alloc_memslot_metadata().

The functional changes (for x86) would be a few lines in
__gfn_to_hva_memslot() and some new logic in kvm_handle_hva_range(). The
biggest concern is probably the fragility of such an implementation, as KVM
has a habit of open coding operations on memslots.

The new flags could then be paired with KVM_MEM_READONLY to yield the desired
behavior of reading out 0xff for an arbitrary range without requiring copious
memslots and/or host pages.

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 852fc8274bdd..875243a0ab36 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1103,6 +1103,9 @@ __gfn_to_memslot(struct kvm_memslots *slots, gfn_t gfn)
static inline unsigned long
__gfn_to_hva_memslot(struct kvm_memory_slot *slot, gfn_t gfn)
{
+ if (unlikely(slot->flags & KVM_MEM_SINGLE_HVA))
+ return slot->userspace_addr;
+
return slot->userspace_addr + (gfn - slot->base_gfn) * PAGE_SIZE;
}

2020-08-21 01:50:46

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] KVM: x86: introduce KVM_MEM_PCI_HOLE memory

On Mon, Aug 17, 2020 at 09:32:07AM -0700, Sean Christopherson wrote:
> On Fri, Aug 14, 2020 at 10:30:14AM -0400, Michael S. Tsirkin wrote:
> > On Thu, Aug 13, 2020 at 07:31:39PM -0700, Sean Christopherson wrote:
> > > > @@ -2318,6 +2338,11 @@ static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
> > > > int r;
> > > > unsigned long addr;
> > > >
> > > > + if (unlikely(slot && (slot->flags & KVM_MEM_PCI_HOLE))) {
> > > > + memset(data, 0xff, len);
> > > > + return 0;
> > > > + }
> > >
> > > This feels wrong, shouldn't we be treating PCI_HOLE as MMIO? Given that
> > > this is performance oriented, I would think we'd want to leverage the
> > > GPA from the VMCS instead of doing a full translation.
> > >
> > > That brings up a potential alternative to adding a memslot flag. What if
> > > we instead add a KVM_MMIO_BUS device similar to coalesced MMIO? I think
> > > it'd be about the same amount of KVM code, and it would provide userspace
> > > with more flexibility, e.g. I assume it would allow handling even writes
> > > wholly within the kernel for certain ranges and/or use cases, and it'd
> > > allow stuffing a value other than 0xff (though I have no idea if there is
> > > a use case for this).
> >
> > I still think down the road the way to go is to map
> > valid RO page full of 0xff to avoid exit on read.
> > I don't think a KVM_MMIO_BUS device will allow this, will it?
>
> No, it would not, but adding KVM_MEM_PCI_HOLE doesn't get us any closer to
> solving that problem either.

I'm not sure why. Care to elaborate?

> What if we add a flag to allow routing all GFNs in a memslot to a single
> HVA?

An issue here would be this breaks attempts to use a hugepage for this.


> At a glance, it doesn't seem to heinous. It would have several of the
> same touchpoints as this series, e.g. __kvm_set_memory_region() and
> kvm_alloc_memslot_metadata().
>
> The functional changes (for x86) would be a few lines in
> __gfn_to_hva_memslot() and some new logic in kvm_handle_hva_range(). The
> biggest concern is probably the fragility of such an implementation, as KVM
> has a habit of open coding operations on memslots.
>
> The new flags could then be paired with KVM_MEM_READONLY to yield the desired
> behavior of reading out 0xff for an arbitrary range without requiring copious
> memslots and/or host pages.
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 852fc8274bdd..875243a0ab36 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1103,6 +1103,9 @@ __gfn_to_memslot(struct kvm_memslots *slots, gfn_t gfn)
> static inline unsigned long
> __gfn_to_hva_memslot(struct kvm_memory_slot *slot, gfn_t gfn)
> {
> + if (unlikely(slot->flags & KVM_MEM_SINGLE_HVA))
> + return slot->userspace_addr;
> +
> return slot->userspace_addr + (gfn - slot->base_gfn) * PAGE_SIZE;
> }

2020-08-22 03:24:05

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] KVM: x86: introduce KVM_MEM_PCI_HOLE memory

On Thu, Aug 20, 2020 at 09:46:25PM -0400, Michael S. Tsirkin wrote:
> On Mon, Aug 17, 2020 at 09:32:07AM -0700, Sean Christopherson wrote:
> > On Fri, Aug 14, 2020 at 10:30:14AM -0400, Michael S. Tsirkin wrote:
> > > On Thu, Aug 13, 2020 at 07:31:39PM -0700, Sean Christopherson wrote:
> > > > > @@ -2318,6 +2338,11 @@ static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
> > > > > int r;
> > > > > unsigned long addr;
> > > > >
> > > > > + if (unlikely(slot && (slot->flags & KVM_MEM_PCI_HOLE))) {
> > > > > + memset(data, 0xff, len);
> > > > > + return 0;
> > > > > + }
> > > >
> > > > This feels wrong, shouldn't we be treating PCI_HOLE as MMIO? Given that
> > > > this is performance oriented, I would think we'd want to leverage the
> > > > GPA from the VMCS instead of doing a full translation.
> > > >
> > > > That brings up a potential alternative to adding a memslot flag. What if
> > > > we instead add a KVM_MMIO_BUS device similar to coalesced MMIO? I think
> > > > it'd be about the same amount of KVM code, and it would provide userspace
> > > > with more flexibility, e.g. I assume it would allow handling even writes
> > > > wholly within the kernel for certain ranges and/or use cases, and it'd
> > > > allow stuffing a value other than 0xff (though I have no idea if there is
> > > > a use case for this).
> > >
> > > I still think down the road the way to go is to map
> > > valid RO page full of 0xff to avoid exit on read.
> > > I don't think a KVM_MMIO_BUS device will allow this, will it?
> >
> > No, it would not, but adding KVM_MEM_PCI_HOLE doesn't get us any closer to
> > solving that problem either.
>
> I'm not sure why. Care to elaborate?

The bulk of the code in this series would get thrown away if KVM_MEM_PCI_HOLE
were reworked to be backed by a physical page. If we really want a physical
page, then let's use a physical page from the get-go.

I realize I suggested the specialized MMIO idea, but that's when I thought the
primary motivation was memory, not performance.

> > What if we add a flag to allow routing all GFNs in a memslot to a single
> > HVA?
>
> An issue here would be this breaks attempts to use a hugepage for this.

What are the performance numbers of hugepage vs. aggressively prefetching
SPTEs? Note, the unbounded prefetching from the original RFC won't fly,
but prefetching 2mb ranges might be reasonable.

Reraising an earlier unanswered question, is enlightening the guest an
option for this use case?

2020-09-01 14:42:28

by Vitaly Kuznetsov

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] KVM: x86: introduce KVM_MEM_PCI_HOLE memory

Sean Christopherson <[email protected]> writes:

> On Fri, Aug 07, 2020 at 04:12:31PM +0200, Vitaly Kuznetsov wrote:
>> PCIe config space can (depending on the configuration) be quite big but
>> usually is sparsely populated. Guest may scan it by accessing individual
>> device's page which, when device is missing, is supposed to have 'pci
>> hole' semantics: reads return '0xff' and writes get discarded. Compared
>> to the already existing KVM_MEM_READONLY, VMM doesn't need to allocate
>> real memory and stuff it with '0xff'.
>>
>> Suggested-by: Michael S. Tsirkin <[email protected]>
>> Signed-off-by: Vitaly Kuznetsov <[email protected]>
>> ---
>> Documentation/virt/kvm/api.rst | 18 ++++++++++-----
>> arch/x86/include/uapi/asm/kvm.h | 1 +
>> arch/x86/kvm/mmu/mmu.c | 5 ++++-
>> arch/x86/kvm/mmu/paging_tmpl.h | 3 +++
>> arch/x86/kvm/x86.c | 10 ++++++---
>> include/linux/kvm_host.h | 3 +++
>> include/uapi/linux/kvm.h | 2 ++
>> virt/kvm/kvm_main.c | 39 +++++++++++++++++++++++++++------
>> 8 files changed, 64 insertions(+), 17 deletions(-)
>>
>> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
>> index 644e5326aa50..dc4172352635 100644
>> --- a/Documentation/virt/kvm/api.rst
>> +++ b/Documentation/virt/kvm/api.rst
>> @@ -1241,6 +1241,7 @@ yet and must be cleared on entry.
>> /* for kvm_memory_region::flags */
>> #define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0)
>> #define KVM_MEM_READONLY (1UL << 1)
>> + #define KVM_MEM_PCI_HOLE (1UL << 2)
>>
>> This ioctl allows the user to create, modify or delete a guest physical
>> memory slot. Bits 0-15 of "slot" specify the slot id and this value
>> @@ -1268,12 +1269,17 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
>> be identical. This allows large pages in the guest to be backed by large
>> pages in the host.
>>
>> -The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
>> -KVM_MEM_READONLY. The former can be set to instruct KVM to keep track of
>> -writes to memory within the slot. See KVM_GET_DIRTY_LOG ioctl to know how to
>> -use it. The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
>> -to make a new slot read-only. In this case, writes to this memory will be
>> -posted to userspace as KVM_EXIT_MMIO exits.
>> +The flags field supports the following flags: KVM_MEM_LOG_DIRTY_PAGES,
>> +KVM_MEM_READONLY, KVM_MEM_PCI_HOLE:
>> +- KVM_MEM_LOG_DIRTY_PAGES: log writes. Use KVM_GET_DIRTY_LOG to retreive
>> + the log.
>> +- KVM_MEM_READONLY: exit to userspace with KVM_EXIT_MMIO on writes. Only
>> + available when KVM_CAP_READONLY_MEM is present.
>> +- KVM_MEM_PCI_HOLE: always return 0xff on reads, exit to userspace with
>> + KVM_EXIT_MMIO on writes. Only available when KVM_CAP_PCI_HOLE_MEM is
>> + present. When setting the memory region 'userspace_addr' must be NULL.
>> + This flag is mutually exclusive with KVM_MEM_LOG_DIRTY_PAGES and with
>> + KVM_MEM_READONLY.
>>
>> When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
>> the memory region are automatically reflected into the guest. For example, an
>> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
>> index 17c5a038f42d..cf80a26d74f5 100644
>> --- a/arch/x86/include/uapi/asm/kvm.h
>> +++ b/arch/x86/include/uapi/asm/kvm.h
>> @@ -48,6 +48,7 @@
>> #define __KVM_HAVE_XSAVE
>> #define __KVM_HAVE_XCRS
>> #define __KVM_HAVE_READONLY_MEM
>> +#define __KVM_HAVE_PCI_HOLE_MEM
>>
>> /* Architectural interrupt line count. */
>> #define KVM_NR_INTERRUPTS 256
>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>> index fef6956393f7..4a2a7fface1e 100644
>> --- a/arch/x86/kvm/mmu/mmu.c
>> +++ b/arch/x86/kvm/mmu/mmu.c
>> @@ -3254,7 +3254,7 @@ static int kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, gfn_t gfn,
>> return PG_LEVEL_4K;
>>
>> slot = gfn_to_memslot_dirty_bitmap(vcpu, gfn, true);
>> - if (!slot)
>> + if (!slot || (slot->flags & KVM_MEM_PCI_HOLE))
>
> This is unnecessary since you're setting disallow_lpage in
> kvm_alloc_memslot_metadata().
>

Yea, redundant precaution, can be dropped.

>> return PG_LEVEL_4K;
>>
>> max_level = min(max_level, max_huge_page_level);
>> @@ -4105,6 +4105,9 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
>>
>> slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
>>
>> + if (!write && slot && (slot->flags & KVM_MEM_PCI_HOLE))
>
> I'm confused. Why does this short circuit reads but not writes?
>

The idea was that guests shouldn't normally write to these regions and
we may want to catch them if they do. We can short circuit writes too by
simply ignoring them.

>> + return RET_PF_EMULATE;
>> +
>> if (try_async_pf(vcpu, slot, prefault, gfn, gpa, &pfn, write,
>> &map_writable))
>> return RET_PF_RETRY;
>> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
>> index 5c6a895f67c3..27abd69e69f6 100644
>> --- a/arch/x86/kvm/mmu/paging_tmpl.h
>> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
>> @@ -836,6 +836,9 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gpa_t addr, u32 error_code,
>>
>> slot = kvm_vcpu_gfn_to_memslot(vcpu, walker.gfn);
>>
>> + if (!write_fault && slot && (slot->flags & KVM_MEM_PCI_HOLE))
>> + return RET_PF_EMULATE;
>> +
>> if (try_async_pf(vcpu, slot, prefault, walker.gfn, addr, &pfn,
>> write_fault, &map_writable))
>> return RET_PF_RETRY;
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index dc4370394ab8..538bc58a22db 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -3515,6 +3515,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>> case KVM_CAP_EXCEPTION_PAYLOAD:
>> case KVM_CAP_SET_GUEST_DEBUG:
>> case KVM_CAP_LAST_CPU:
>> + case KVM_CAP_PCI_HOLE_MEM:
>> r = 1;
>> break;
>> case KVM_CAP_SYNC_REGS:
>> @@ -10114,9 +10115,11 @@ static int kvm_alloc_memslot_metadata(struct kvm_memory_slot *slot,
>> ugfn = slot->userspace_addr >> PAGE_SHIFT;
>> /*
>> * If the gfn and userspace address are not aligned wrt each
>> - * other, disable large page support for this slot.
>> + * other, disable large page support for this slot. Also,
>> + * disable large page support for KVM_MEM_PCI_HOLE slots.
>> */
>> - if ((slot->base_gfn ^ ugfn) & (KVM_PAGES_PER_HPAGE(level) - 1)) {
>> + if ((slot->flags & KVM_MEM_PCI_HOLE) || ((slot->base_gfn ^ ugfn) &
>> + (KVM_PAGES_PER_HPAGE(level) - 1))) {
>> unsigned long j;
>>
>> for (j = 0; j < lpages; ++j)
>> @@ -10178,7 +10181,8 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
>> * Nothing to do for RO slots or CREATE/MOVE/DELETE of a slot.
>> * See comments below.
>> */
>> - if ((change != KVM_MR_FLAGS_ONLY) || (new->flags & KVM_MEM_READONLY))
>> + if ((change != KVM_MR_FLAGS_ONLY) ||
>> + (new->flags & (KVM_MEM_READONLY | KVM_MEM_PCI_HOLE)))
>> return;
>>
>> /*
>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>> index 989afcbe642f..de1faa64a8ef 100644
>> --- a/include/linux/kvm_host.h
>> +++ b/include/linux/kvm_host.h
>> @@ -1081,6 +1081,9 @@ __gfn_to_memslot(struct kvm_memslots *slots, gfn_t gfn)
>> static inline unsigned long
>> __gfn_to_hva_memslot(struct kvm_memory_slot *slot, gfn_t gfn)
>> {
>> + /* Should never be called with a KVM_MEM_PCI_HOLE slot */
>> + BUG_ON(!slot->userspace_addr);
>
> So _technically_, userspace can hit this by allowing virtual address 0,
> which is very much non-standard, but theoretically legal. It'd probably be
> better to use a value that can't possibly be a valid userspace_addr, e.g. a
> non-canonical value.
>

I think I had '!(slot->flags & KVM_MEM_PCI_HOLE)' check in a previous
version, we can restore it (if needed) or drop the thing completely.

>> +
>> return slot->userspace_addr + (gfn - slot->base_gfn) * PAGE_SIZE;
>> }
>>
>
> ...
>
>> @@ -2318,6 +2338,11 @@ static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
>> int r;
>> unsigned long addr;
>>
>> + if (unlikely(slot && (slot->flags & KVM_MEM_PCI_HOLE))) {
>> + memset(data, 0xff, len);
>> + return 0;
>> + }
>
> This feels wrong, shouldn't we be treating PCI_HOLE as MMIO? Given that
> this is performance oriented, I would think we'd want to leverage the
> GPA from the VMCS instead of doing a full translation.
>
> That brings up a potential alternative to adding a memslot flag. What if
> we instead add a KVM_MMIO_BUS device similar to coalesced MMIO? I think
> it'd be about the same amount of KVM code, and it would provide userspace
> with more flexibility, e.g. I assume it would allow handling even writes
> wholly within the kernel for certain ranges and/or use cases, and it'd
> allow stuffing a value other than 0xff (though I have no idea if there is
> a use case for this).

I was thinking about making this a bit more generic, like 'immutable'
memory with a userspace-supplied values, e.g. userspace would be
providing a region (e.g. a single page which will be mapped to by all
pages of the slot) but then I failed to find a use-case for that. The
PCI hole semantics seems to be the only one we actually need in the real
life.

Overall, makeing these PCI holes 'special' memory regions (slots) and
sticking to KVM_SET_USER_MEMORY_REGION feels natural to me. I also think
it would be much easier to consume from QEMU side as we won't need to
use a 'special' API when things change (e.g. a device gets added and we
need to [un]punch a hole in the 'PCI hole' space).

>
> Speaking of which, why do writes go to userspace in this series?
>

No particular reason actually, if there is no need to catch such
(stray?) writes we can simply short-circuit them to nop.

>> +
>> addr = gfn_to_hva_memslot_prot(slot, gfn, NULL);
>> if (kvm_is_error_hva(addr))
>> return -EFAULT;
>> --
>> 2.25.4
>>
>

--
Vitaly

2020-09-03 09:38:48

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] KVM: x86: introduce KVM_MEM_PCI_HOLE memory

On Tue, Sep 01, 2020 at 04:39:22PM +0200, Vitaly Kuznetsov wrote:
> Sean Christopherson <[email protected]> writes:
>
> > On Fri, Aug 07, 2020 at 04:12:31PM +0200, Vitaly Kuznetsov wrote:
> >> PCIe config space can (depending on the configuration) be quite big but
> >> usually is sparsely populated. Guest may scan it by accessing individual
> >> device's page which, when device is missing, is supposed to have 'pci
> >> hole' semantics: reads return '0xff' and writes get discarded. Compared
> >> to the already existing KVM_MEM_READONLY, VMM doesn't need to allocate
> >> real memory and stuff it with '0xff'.
> >>
> >> Suggested-by: Michael S. Tsirkin <[email protected]>
> >> Signed-off-by: Vitaly Kuznetsov <[email protected]>
> >> ---
> >> Documentation/virt/kvm/api.rst | 18 ++++++++++-----
> >> arch/x86/include/uapi/asm/kvm.h | 1 +
> >> arch/x86/kvm/mmu/mmu.c | 5 ++++-
> >> arch/x86/kvm/mmu/paging_tmpl.h | 3 +++
> >> arch/x86/kvm/x86.c | 10 ++++++---
> >> include/linux/kvm_host.h | 3 +++
> >> include/uapi/linux/kvm.h | 2 ++
> >> virt/kvm/kvm_main.c | 39 +++++++++++++++++++++++++++------
> >> 8 files changed, 64 insertions(+), 17 deletions(-)
> >>
> >> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> >> index 644e5326aa50..dc4172352635 100644
> >> --- a/Documentation/virt/kvm/api.rst
> >> +++ b/Documentation/virt/kvm/api.rst
> >> @@ -1241,6 +1241,7 @@ yet and must be cleared on entry.
> >> /* for kvm_memory_region::flags */
> >> #define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0)
> >> #define KVM_MEM_READONLY (1UL << 1)
> >> + #define KVM_MEM_PCI_HOLE (1UL << 2)
> >>
> >> This ioctl allows the user to create, modify or delete a guest physical
> >> memory slot. Bits 0-15 of "slot" specify the slot id and this value
> >> @@ -1268,12 +1269,17 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
> >> be identical. This allows large pages in the guest to be backed by large
> >> pages in the host.
> >>
> >> -The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
> >> -KVM_MEM_READONLY. The former can be set to instruct KVM to keep track of
> >> -writes to memory within the slot. See KVM_GET_DIRTY_LOG ioctl to know how to
> >> -use it. The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
> >> -to make a new slot read-only. In this case, writes to this memory will be
> >> -posted to userspace as KVM_EXIT_MMIO exits.
> >> +The flags field supports the following flags: KVM_MEM_LOG_DIRTY_PAGES,
> >> +KVM_MEM_READONLY, KVM_MEM_PCI_HOLE:
> >> +- KVM_MEM_LOG_DIRTY_PAGES: log writes. Use KVM_GET_DIRTY_LOG to retreive
> >> + the log.
> >> +- KVM_MEM_READONLY: exit to userspace with KVM_EXIT_MMIO on writes. Only
> >> + available when KVM_CAP_READONLY_MEM is present.
> >> +- KVM_MEM_PCI_HOLE: always return 0xff on reads, exit to userspace with
> >> + KVM_EXIT_MMIO on writes. Only available when KVM_CAP_PCI_HOLE_MEM is
> >> + present. When setting the memory region 'userspace_addr' must be NULL.
> >> + This flag is mutually exclusive with KVM_MEM_LOG_DIRTY_PAGES and with
> >> + KVM_MEM_READONLY.
> >>
> >> When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
> >> the memory region are automatically reflected into the guest. For example, an
> >> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> >> index 17c5a038f42d..cf80a26d74f5 100644
> >> --- a/arch/x86/include/uapi/asm/kvm.h
> >> +++ b/arch/x86/include/uapi/asm/kvm.h
> >> @@ -48,6 +48,7 @@
> >> #define __KVM_HAVE_XSAVE
> >> #define __KVM_HAVE_XCRS
> >> #define __KVM_HAVE_READONLY_MEM
> >> +#define __KVM_HAVE_PCI_HOLE_MEM
> >>
> >> /* Architectural interrupt line count. */
> >> #define KVM_NR_INTERRUPTS 256
> >> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> >> index fef6956393f7..4a2a7fface1e 100644
> >> --- a/arch/x86/kvm/mmu/mmu.c
> >> +++ b/arch/x86/kvm/mmu/mmu.c
> >> @@ -3254,7 +3254,7 @@ static int kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, gfn_t gfn,
> >> return PG_LEVEL_4K;
> >>
> >> slot = gfn_to_memslot_dirty_bitmap(vcpu, gfn, true);
> >> - if (!slot)
> >> + if (!slot || (slot->flags & KVM_MEM_PCI_HOLE))
> >
> > This is unnecessary since you're setting disallow_lpage in
> > kvm_alloc_memslot_metadata().
> >
>
> Yea, redundant precaution, can be dropped.
>
> >> return PG_LEVEL_4K;
> >>
> >> max_level = min(max_level, max_huge_page_level);
> >> @@ -4105,6 +4105,9 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
> >>
> >> slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
> >>
> >> + if (!write && slot && (slot->flags & KVM_MEM_PCI_HOLE))
> >
> > I'm confused. Why does this short circuit reads but not writes?
> >
>
> The idea was that guests shouldn't normally write to these regions and
> we may want to catch them if they do. We can short circuit writes too by
> simply ignoring them.

Another point is that write by guests might need to set bits
in the root port error reporting cap.

> >> + return RET_PF_EMULATE;
> >> +
> >> if (try_async_pf(vcpu, slot, prefault, gfn, gpa, &pfn, write,
> >> &map_writable))
> >> return RET_PF_RETRY;
> >> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> >> index 5c6a895f67c3..27abd69e69f6 100644
> >> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> >> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> >> @@ -836,6 +836,9 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gpa_t addr, u32 error_code,
> >>
> >> slot = kvm_vcpu_gfn_to_memslot(vcpu, walker.gfn);
> >>
> >> + if (!write_fault && slot && (slot->flags & KVM_MEM_PCI_HOLE))
> >> + return RET_PF_EMULATE;
> >> +
> >> if (try_async_pf(vcpu, slot, prefault, walker.gfn, addr, &pfn,
> >> write_fault, &map_writable))
> >> return RET_PF_RETRY;
> >> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> >> index dc4370394ab8..538bc58a22db 100644
> >> --- a/arch/x86/kvm/x86.c
> >> +++ b/arch/x86/kvm/x86.c
> >> @@ -3515,6 +3515,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> >> case KVM_CAP_EXCEPTION_PAYLOAD:
> >> case KVM_CAP_SET_GUEST_DEBUG:
> >> case KVM_CAP_LAST_CPU:
> >> + case KVM_CAP_PCI_HOLE_MEM:
> >> r = 1;
> >> break;
> >> case KVM_CAP_SYNC_REGS:
> >> @@ -10114,9 +10115,11 @@ static int kvm_alloc_memslot_metadata(struct kvm_memory_slot *slot,
> >> ugfn = slot->userspace_addr >> PAGE_SHIFT;
> >> /*
> >> * If the gfn and userspace address are not aligned wrt each
> >> - * other, disable large page support for this slot.
> >> + * other, disable large page support for this slot. Also,
> >> + * disable large page support for KVM_MEM_PCI_HOLE slots.
> >> */
> >> - if ((slot->base_gfn ^ ugfn) & (KVM_PAGES_PER_HPAGE(level) - 1)) {
> >> + if ((slot->flags & KVM_MEM_PCI_HOLE) || ((slot->base_gfn ^ ugfn) &
> >> + (KVM_PAGES_PER_HPAGE(level) - 1))) {
> >> unsigned long j;
> >>
> >> for (j = 0; j < lpages; ++j)
> >> @@ -10178,7 +10181,8 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
> >> * Nothing to do for RO slots or CREATE/MOVE/DELETE of a slot.
> >> * See comments below.
> >> */
> >> - if ((change != KVM_MR_FLAGS_ONLY) || (new->flags & KVM_MEM_READONLY))
> >> + if ((change != KVM_MR_FLAGS_ONLY) ||
> >> + (new->flags & (KVM_MEM_READONLY | KVM_MEM_PCI_HOLE)))
> >> return;
> >>
> >> /*
> >> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> >> index 989afcbe642f..de1faa64a8ef 100644
> >> --- a/include/linux/kvm_host.h
> >> +++ b/include/linux/kvm_host.h
> >> @@ -1081,6 +1081,9 @@ __gfn_to_memslot(struct kvm_memslots *slots, gfn_t gfn)
> >> static inline unsigned long
> >> __gfn_to_hva_memslot(struct kvm_memory_slot *slot, gfn_t gfn)
> >> {
> >> + /* Should never be called with a KVM_MEM_PCI_HOLE slot */
> >> + BUG_ON(!slot->userspace_addr);
> >
> > So _technically_, userspace can hit this by allowing virtual address 0,
> > which is very much non-standard, but theoretically legal. It'd probably be
> > better to use a value that can't possibly be a valid userspace_addr, e.g. a
> > non-canonical value.
> >
>
> I think I had '!(slot->flags & KVM_MEM_PCI_HOLE)' check in a previous
> version, we can restore it (if needed) or drop the thing completely.
>
> >> +
> >> return slot->userspace_addr + (gfn - slot->base_gfn) * PAGE_SIZE;
> >> }
> >>
> >
> > ...
> >
> >> @@ -2318,6 +2338,11 @@ static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
> >> int r;
> >> unsigned long addr;
> >>
> >> + if (unlikely(slot && (slot->flags & KVM_MEM_PCI_HOLE))) {
> >> + memset(data, 0xff, len);
> >> + return 0;
> >> + }
> >
> > This feels wrong, shouldn't we be treating PCI_HOLE as MMIO? Given that
> > this is performance oriented, I would think we'd want to leverage the
> > GPA from the VMCS instead of doing a full translation.
> >
> > That brings up a potential alternative to adding a memslot flag. What if
> > we instead add a KVM_MMIO_BUS device similar to coalesced MMIO? I think
> > it'd be about the same amount of KVM code, and it would provide userspace
> > with more flexibility, e.g. I assume it would allow handling even writes
> > wholly within the kernel for certain ranges and/or use cases, and it'd
> > allow stuffing a value other than 0xff (though I have no idea if there is
> > a use case for this).
>
> I was thinking about making this a bit more generic, like 'immutable'
> memory with a userspace-supplied values, e.g. userspace would be
> providing a region (e.g. a single page which will be mapped to by all
> pages of the slot) but then I failed to find a use-case for that. The
> PCI hole semantics seems to be the only one we actually need in the real
> life.
>
> Overall, makeing these PCI holes 'special' memory regions (slots) and
> sticking to KVM_SET_USER_MEMORY_REGION feels natural to me. I also think
> it would be much easier to consume from QEMU side as we won't need to
> use a 'special' API when things change (e.g. a device gets added and we
> need to [un]punch a hole in the 'PCI hole' space).
>
> >
> > Speaking of which, why do writes go to userspace in this series?
> >
>
> No particular reason actually, if there is no need to catch such
> (stray?) writes we can simply short-circuit them to nop.

With AER we might want to handle them I think.

> >> +
> >> addr = gfn_to_hva_memslot_prot(slot, gfn, NULL);
> >> if (kvm_is_error_hva(addr))
> >> return -EFAULT;
> >> --
> >> 2.25.4
> >>
> >
>
> --
> Vitaly