The basic idea is to let userspace provide the desired maximal number of
VCPUs and allocate only necessary memory for them.
The goal is to freeze KVM_MAX_VCPUS at its current level and only increase the
new KVM_MAX_CONFIGURABLE_VCPUS, probably directly to INT_MAX/KVM_VCPU_ID, so we
don't have to worry about it for a while.
PPC should be interested in this as they set KVM_MAX_VCPUS to NR_CPUS
and probably waste few pages for every guest this way.
Radim Krčmář (4):
KVM: remove unused __KVM_HAVE_ARCH_VM_ALLOC
KVM: allocate kvm->vcpus separately
KVM: add KVM_CREATE_VM2 system ioctl
KVM: x86: enable configurable MAX_VCPU
Documentation/virtual/kvm/api.txt | 28 +++++++++++++++
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/irq_comm.c | 4 +--
include/linux/kvm_host.h | 23 +++++-------
include/uapi/linux/kvm.h | 8 +++++
virt/kvm/kvm_main.c | 76 +++++++++++++++++++++++++++++++++------
6 files changed, 114 insertions(+), 26 deletions(-)
--
2.12.0
Moving it to generic code will allow us to extend it with ease.
Signed-off-by: Radim Krčmář <[email protected]>
---
include/linux/kvm_host.h | 12 ------------
virt/kvm/kvm_main.c | 16 +++++++++++++---
2 files changed, 13 insertions(+), 15 deletions(-)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 397b7b5b1933..ae4e114cb7d1 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -769,18 +769,6 @@ int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu);
void *kvm_kvzalloc(unsigned long size);
-#ifndef __KVM_HAVE_ARCH_VM_ALLOC
-static inline struct kvm *kvm_arch_alloc_vm(void)
-{
- return kzalloc(sizeof(struct kvm), GFP_KERNEL);
-}
-
-static inline void kvm_arch_free_vm(struct kvm *kvm)
-{
- kfree(kvm);
-}
-#endif
-
#ifdef __KVM_HAVE_ARCH_NONCOHERENT_DMA
void kvm_arch_register_noncoherent_dma(struct kvm *kvm);
void kvm_arch_unregister_noncoherent_dma(struct kvm *kvm);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 357e67cba32e..f03b093abffe 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -604,10 +604,20 @@ static int kvm_create_vm_debugfs(struct kvm *kvm, int fd)
return 0;
}
+static inline struct kvm *kvm_alloc_vm(void)
+{
+ return kzalloc(sizeof(struct kvm), GFP_KERNEL);
+}
+
+static inline void kvm_free_vm(struct kvm *kvm)
+{
+ kfree(kvm);
+}
+
static struct kvm *kvm_create_vm(unsigned long type)
{
int r, i;
- struct kvm *kvm = kvm_arch_alloc_vm();
+ struct kvm *kvm = kvm_alloc_vm();
if (!kvm)
return ERR_PTR(-ENOMEM);
@@ -684,7 +694,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
kfree(kvm->buses[i]);
for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
kvm_free_memslots(kvm, kvm->memslots[i]);
- kvm_arch_free_vm(kvm);
+ kvm_free_vm(kvm);
mmdrop(current->mm);
return ERR_PTR(r);
}
@@ -744,7 +754,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
kvm_free_memslots(kvm, kvm->memslots[i]);
cleanup_srcu_struct(&kvm->irq_srcu);
cleanup_srcu_struct(&kvm->srcu);
- kvm_arch_free_vm(kvm);
+ kvm_free_vm(kvm);
preempt_notifier_dec();
hardware_disable_all();
mmdrop(mm);
--
2.12.0
This patch allows userspace to tell how many VCPUs it is going to use,
which can save memory when allocating the kvm->vcpus array. This will
be done with a new KVM_CREATE_VM2 IOCTL.
An alternative would be to redo kvm->vcpus as a list or protect the
array with RCU. RCU is slower and a list is not even practical as
kvm->vcpus are being used for index-based accesses.
We could have an IOCTL that is called in between KVM_CREATE_VM and first
KVM_CREATE_VCPU and sets the size of the vcpus array, but we'd be making
one useless allocation. Knowing the desired number of VCPUs from the
beginning is seems best for now.
This patch also prepares generic code for architectures that will set
KVM_CONFIGURABLE_MAX_VCPUS to a non-zero value.
A disputable decision is that KVM_CREATE_VM2 actually works even if
KVM_CAP_CONFIGURABLE_MAX_VCPUS is 0, but uses that capability for its
detection.
Signed-off-by: Radim Krčmář <[email protected]>
---
Documentation/virtual/kvm/api.txt | 28 ++++++++++++++++++++++++
include/linux/kvm_host.h | 3 +++
include/uapi/linux/kvm.h | 8 +++++++
virt/kvm/kvm_main.c | 45 +++++++++++++++++++++++++++++++++------
4 files changed, 77 insertions(+), 7 deletions(-)
diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index e60be91d8036..461130adbdc7 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -3259,6 +3259,34 @@ Otherwise, if the MCE is a corrected error, KVM will just
store it in the corresponding bank (provided this bank is
not holding a previously reported uncorrected error).
+
+4.107 KVM_CREATE_VM2
+
+Capability: KVM_CAP_CONFIGURABLE_MAX_VCPUS
+Architectures: all
+Type: system ioctl
+Parameters: struct kvm_vm_config
+Returns: a VM fd that can be used to control the new virtual machine,
+ -E2BIG if the value of max_vcpus is not supported
+
+This is an extension of KVM_CREATE_VM that allows the user to pass more
+information through
+
+struct kvm_vm_config {
+ __u64 type;
+ __u32 max_vcpus;
+ __u8 reserved[52];
+};
+
+type is the argument to KVM_CREATE_VM
+
+max_vcpus is the desired maximal number of VCPUs, it must not exceed the value
+returned by KVM_CAP_CONFIGURABLE_MAX_VCPUS. Value of 0 treated as if userspace
+passed the value returned by KVM_CAP_MAX_VCPU instead.
+
+reserved is must be 0.
+
+
5. The kvm_run structure
------------------------
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 6ba7bc831094..b875c0997328 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -39,6 +39,9 @@
#ifndef KVM_MAX_VCPU_ID
#define KVM_MAX_VCPU_ID KVM_MAX_VCPUS
#endif
+#ifndef KVM_CONFIGURABLE_MAX_VCPUS
+#define KVM_CONFIGURABLE_MAX_VCPUS 0U
+#endif
/*
* The bit 16 ~ bit 31 of kvm_memory_region::flags are internally used
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 6180ea50e9ef..8349c73b3517 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -83,6 +83,12 @@ struct kvm_debug_guest {
/* *** End of deprecated interfaces *** */
+/* for KVM_CREATE_VM2 */
+struct kvm_vm_config {
+ __u64 type;
+ __u32 max_vcpus;
+ __u8 reserved[52];
+};
/* for KVM_CREATE_MEMORY_REGION */
struct kvm_memory_region {
@@ -713,6 +719,7 @@ struct kvm_ppc_resize_hpt {
*/
#define KVM_GET_API_VERSION _IO(KVMIO, 0x00)
#define KVM_CREATE_VM _IO(KVMIO, 0x01) /* returns a VM fd */
+#define KVM_CREATE_VM2 _IOR(KVMIO, 0x01, struct kvm_vm_config)
#define KVM_GET_MSR_INDEX_LIST _IOWR(KVMIO, 0x02, struct kvm_msr_list)
#define KVM_S390_ENABLE_SIE _IO(KVMIO, 0x06)
@@ -892,6 +899,7 @@ struct kvm_ppc_resize_hpt {
#define KVM_CAP_MIPS_64BIT 139
#define KVM_CAP_S390_GS 140
#define KVM_CAP_S390_AIS 141
+#define KVM_CAP_CONFIGURABLE_MAX_VCPUS 142
#ifdef KVM_CAP_IRQ_ROUTING
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 0f1579f118b4..9ef52fa006ec 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -629,11 +629,20 @@ static inline void kvm_free_vm(struct kvm *kvm)
kfree(kvm);
}
-static struct kvm *kvm_create_vm(unsigned long type)
+static struct kvm *kvm_create_vm(struct kvm_vm_config *vm_config)
{
int r, i;
- struct kvm *kvm = kvm_alloc_vm(KVM_MAX_VCPUS);
+ struct kvm *kvm;
+ if (!KVM_CONFIGURABLE_MAX_VCPUS && vm_config->max_vcpus)
+ return ERR_PTR(-EINVAL);
+ if (vm_config->max_vcpus > KVM_CONFIGURABLE_MAX_VCPUS)
+ return ERR_PTR(-E2BIG);
+
+ if (!vm_config->max_vcpus)
+ vm_config->max_vcpus = KVM_MAX_VCPUS;
+
+ kvm = kvm_alloc_vm(vm_config->max_vcpus);
if (!kvm)
return ERR_PTR(-ENOMEM);
@@ -647,7 +656,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
refcount_set(&kvm->users_count, 1);
INIT_LIST_HEAD(&kvm->devices);
- r = kvm_arch_init_vm(kvm, type);
+ r = kvm_arch_init_vm(kvm, vm_config->type);
if (r)
goto out_err_no_disable;
@@ -2957,6 +2966,8 @@ static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
#endif
case KVM_CAP_MAX_VCPU_ID:
return KVM_MAX_VCPU_ID;
+ case KVM_CAP_CONFIGURABLE_MAX_VCPUS:
+ return KVM_CONFIGURABLE_MAX_VCPUS;
default:
break;
}
@@ -3182,13 +3193,13 @@ static struct file_operations kvm_vm_fops = {
.llseek = noop_llseek,
};
-static int kvm_dev_ioctl_create_vm(unsigned long type)
+static int kvm_dev_ioctl_create_vm(struct kvm_vm_config *vm_config)
{
int r;
struct kvm *kvm;
struct file *file;
- kvm = kvm_create_vm(type);
+ kvm = kvm_create_vm(vm_config);
if (IS_ERR(kvm))
return PTR_ERR(kvm);
#ifdef CONFIG_KVM_MMIO
@@ -3223,6 +3234,7 @@ static int kvm_dev_ioctl_create_vm(unsigned long type)
static long kvm_dev_ioctl(struct file *filp,
unsigned int ioctl, unsigned long arg)
{
+ void __user *argp = (void __user *)arg;
long r = -EINVAL;
switch (ioctl) {
@@ -3231,9 +3243,28 @@ static long kvm_dev_ioctl(struct file *filp,
goto out;
r = KVM_API_VERSION;
break;
- case KVM_CREATE_VM:
- r = kvm_dev_ioctl_create_vm(arg);
+ case KVM_CREATE_VM: {
+ struct kvm_vm_config vm_config = {.type = arg};
+
+ r = kvm_dev_ioctl_create_vm(&vm_config);
break;
+ }
+ case KVM_CREATE_VM2: {
+ struct kvm_vm_config vm_config, check_reserved = {};
+
+ r = -EFAULT;
+ if (copy_from_user(&vm_config, argp, sizeof vm_config))
+ goto out;
+
+ r = -EINVAL;
+ check_reserved.type = vm_config.type;
+ check_reserved.max_vcpus = vm_config.max_vcpus;
+ if (memcmp(&vm_config, &check_reserved, sizeof check_reserved))
+ goto out;
+
+ r = kvm_dev_ioctl_create_vm(&vm_config);
+ break;
+ }
case KVM_CHECK_EXTENSION:
r = kvm_vm_ioctl_check_extension_generic(NULL, arg);
break;
--
2.12.0
The only user of KVM_MAX_VCPU is switched to kvm->max_vcpu.
The limit could have been INT_MAX as it makes no difference, but there
is no point in making it bigger than KVM_MAX_VCPU_ID.
Signed-off-by: Radim Krčmář <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/irq_comm.c | 4 ++--
2 files changed, 3 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 2cc5ec7cc6f5..eeeb88eedabf 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -38,6 +38,7 @@
#define KVM_MAX_VCPUS 288
#define KVM_SOFT_MAX_VCPUS 240
#define KVM_MAX_VCPU_ID 1023
+#define KVM_CONFIGURABLE_MAX_VCPUS KVM_MAX_VCPU_ID
#define KVM_USER_MEM_SLOTS 509
/* memory slots that are not exposed to userspace */
#define KVM_PRIVATE_MEM_SLOTS 3
diff --git a/arch/x86/kvm/irq_comm.c b/arch/x86/kvm/irq_comm.c
index 4517a4c2ac3a..a7baeb44539a 100644
--- a/arch/x86/kvm/irq_comm.c
+++ b/arch/x86/kvm/irq_comm.c
@@ -60,7 +60,7 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
{
int i, r = -1;
struct kvm_vcpu *vcpu, *lowest = NULL;
- unsigned long dest_vcpu_bitmap[BITS_TO_LONGS(KVM_MAX_VCPUS)];
+ unsigned long dest_vcpu_bitmap[BITS_TO_LONGS(kvm->max_vcpus)];
unsigned int dest_vcpus = 0;
if (irq->dest_mode == 0 && irq->dest_id == 0xff &&
@@ -101,7 +101,7 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
if (dest_vcpus != 0) {
int idx = kvm_vector_to_index(irq->vector, dest_vcpus,
- dest_vcpu_bitmap, KVM_MAX_VCPUS);
+ dest_vcpu_bitmap, kvm->max_vcpus);
lowest = kvm_get_vcpu(kvm, idx);
}
--
2.12.0
The maximal number of VCPUs is going to be high, but most VMs are still
going to use just a few. We want to save memory and there are two main
conservative possibilities:
1) turn vcpus into a pointer and allocate separately
2) turn vcpus into variable length array at the end of struct kvm
This patch does (1) as it is slightly safer, (2) would avoid one level
of indirection and is a nice follow up.
The vcpus array going to be dynamic and might take several pages, which
is why it is allocated with kvm_kvzalloc().
Generic users of KVM_MAX_VCPUS are switched to kvm->max_vcpus as the
array size is going to change.
Signed-off-by: Radim Krčmář <[email protected]>
---
include/linux/kvm_host.h | 8 ++++++--
virt/kvm/kvm_main.c | 23 +++++++++++++++++++----
2 files changed, 25 insertions(+), 6 deletions(-)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ae4e114cb7d1..6ba7bc831094 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -377,16 +377,20 @@ struct kvm {
struct kvm_memslots *memslots[KVM_ADDRESS_SPACE_NUM];
struct srcu_struct srcu;
struct srcu_struct irq_srcu;
- struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];
+ struct kvm_vcpu **vcpus;
/*
* created_vcpus is protected by kvm->lock, and is incremented
* at the beginning of KVM_CREATE_VCPU. online_vcpus is only
* incremented after storing the kvm_vcpu pointer in vcpus,
* and is accessed atomically.
+ * max_vcpus is the size of vcpus array and can be changed only before
+ * any vcpu is created. Updates to max_vcpus are protected by
+ * kvm->lock.
*/
atomic_t online_vcpus;
int created_vcpus;
+ int max_vcpus;
int last_boosted_vcpu;
struct list_head vm_list;
struct mutex lock;
@@ -480,7 +484,7 @@ static inline struct kvm_vcpu *kvm_get_vcpu_by_id(struct kvm *kvm, int id)
if (id < 0)
return NULL;
- if (id < KVM_MAX_VCPUS)
+ if (id < kvm->max_vcpus)
vcpu = kvm_get_vcpu(kvm, id);
if (vcpu && vcpu->vcpu_id == id)
return vcpu;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index f03b093abffe..0f1579f118b4 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -604,20 +604,35 @@ static int kvm_create_vm_debugfs(struct kvm *kvm, int fd)
return 0;
}
-static inline struct kvm *kvm_alloc_vm(void)
+static inline struct kvm *kvm_alloc_vm(size_t max_vcpus)
{
- return kzalloc(sizeof(struct kvm), GFP_KERNEL);
+ struct kvm *kvm;
+
+ kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL);
+ if (!kvm)
+ return NULL;
+
+ kvm->vcpus = kvm_kvzalloc(sizeof(*kvm->vcpus) * max_vcpus);
+ if (!kvm->vcpus) {
+ kfree(kvm);
+ return NULL;
+ }
+ kvm->max_vcpus = max_vcpus;
+
+ return kvm;
}
static inline void kvm_free_vm(struct kvm *kvm)
{
+ if (kvm)
+ kvfree(kvm->vcpus);
kfree(kvm);
}
static struct kvm *kvm_create_vm(unsigned long type)
{
int r, i;
- struct kvm *kvm = kvm_alloc_vm();
+ struct kvm *kvm = kvm_alloc_vm(KVM_MAX_VCPUS);
if (!kvm)
return ERR_PTR(-ENOMEM);
@@ -2445,7 +2460,7 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id)
return -EINVAL;
mutex_lock(&kvm->lock);
- if (kvm->created_vcpus == KVM_MAX_VCPUS) {
+ if (kvm->created_vcpus == kvm->max_vcpus) {
mutex_unlock(&kvm->lock);
return -EINVAL;
}
--
2.12.0
On 13.04.2017 22:19, Radim Krčmář wrote:
> Moving it to generic code will allow us to extend it with ease.
>
> Signed-off-by: Radim Krčmář <[email protected]>
> ---
> include/linux/kvm_host.h | 12 ------------
> virt/kvm/kvm_main.c | 16 +++++++++++++---
> 2 files changed, 13 insertions(+), 15 deletions(-)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 397b7b5b1933..ae4e114cb7d1 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -769,18 +769,6 @@ int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu);
>
> void *kvm_kvzalloc(unsigned long size);
>
> -#ifndef __KVM_HAVE_ARCH_VM_ALLOC
> -static inline struct kvm *kvm_arch_alloc_vm(void)
> -{
> - return kzalloc(sizeof(struct kvm), GFP_KERNEL);
> -}
> -
> -static inline void kvm_arch_free_vm(struct kvm *kvm)
> -{
> - kfree(kvm);
> -}
> -#endif
> -
> #ifdef __KVM_HAVE_ARCH_NONCOHERENT_DMA
> void kvm_arch_register_noncoherent_dma(struct kvm *kvm);
> void kvm_arch_unregister_noncoherent_dma(struct kvm *kvm);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 357e67cba32e..f03b093abffe 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -604,10 +604,20 @@ static int kvm_create_vm_debugfs(struct kvm *kvm, int fd)
> return 0;
> }
>
> +static inline struct kvm *kvm_alloc_vm(void)
> +{
> + return kzalloc(sizeof(struct kvm), GFP_KERNEL);
> +}
> +
> +static inline void kvm_free_vm(struct kvm *kvm)
> +{
> + kfree(kvm);
> +}
> +
> static struct kvm *kvm_create_vm(unsigned long type)
> {
> int r, i;
> - struct kvm *kvm = kvm_arch_alloc_vm();
> + struct kvm *kvm = kvm_alloc_vm();
>
> if (!kvm)
> return ERR_PTR(-ENOMEM);
> @@ -684,7 +694,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
> kfree(kvm->buses[i]);
> for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
> kvm_free_memslots(kvm, kvm->memslots[i]);
> - kvm_arch_free_vm(kvm);
> + kvm_free_vm(kvm);
> mmdrop(current->mm);
> return ERR_PTR(r);
> }
> @@ -744,7 +754,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
> kvm_free_memslots(kvm, kvm->memslots[i]);
> cleanup_srcu_struct(&kvm->irq_srcu);
> cleanup_srcu_struct(&kvm->srcu);
> - kvm_arch_free_vm(kvm);
> + kvm_free_vm(kvm);
> preempt_notifier_dec();
> hardware_disable_all();
> mmdrop(mm);
>
Reviewed-by: David Hildenbrand <[email protected]>
--
Thanks,
David
On 13.04.2017 22:19, Radim Krčmář wrote:
> The basic idea is to let userspace provide the desired maximal number of
> VCPUs and allocate only necessary memory for them.
>
> The goal is to freeze KVM_MAX_VCPUS at its current level and only increase the
KVM_MAX_VCPUS might still increase e.g. if hw support for more VCPUs is
comming.
> new KVM_MAX_CONFIGURABLE_VCPUS, probably directly to INT_MAX/KVM_VCPU_ID, so we
> don't have to worry about it for a while.
>
> PPC should be interested in this as they set KVM_MAX_VCPUS to NR_CPUS
> and probably waste few pages for every guest this way.
As we just store pointers, this should be a maximum of 4 pages for ppc
(4k pages). Is this really worth yet another VM creation ioctl? Is there
not a nicer way to handle this internally?
An alternative might be to simply realloc the array when it reaches a
certain size (on VCPU creation, maybe protecting the pointer via rcu).
But not sure if something like that could work.
>
>
> Radim Krčmář (4):
> KVM: remove unused __KVM_HAVE_ARCH_VM_ALLOC
> KVM: allocate kvm->vcpus separately
> KVM: add KVM_CREATE_VM2 system ioctl
> KVM: x86: enable configurable MAX_VCPU
>
> Documentation/virtual/kvm/api.txt | 28 +++++++++++++++
> arch/x86/include/asm/kvm_host.h | 1 +
> arch/x86/kvm/irq_comm.c | 4 +--
> include/linux/kvm_host.h | 23 +++++-------
> include/uapi/linux/kvm.h | 8 +++++
> virt/kvm/kvm_main.c | 76 +++++++++++++++++++++++++++++++++------
> 6 files changed, 114 insertions(+), 26 deletions(-)
>
--
Thanks,
David
On Tue, 18 Apr 2017 13:11:55 +0200
David Hildenbrand <[email protected]> wrote:
> On 13.04.2017 22:19, Radim Krčmář wrote:
> > The basic idea is to let userspace provide the desired maximal number of
> > VCPUs and allocate only necessary memory for them.
> >
> > The goal is to freeze KVM_MAX_VCPUS at its current level and only increase the
>
> KVM_MAX_VCPUS might still increase e.g. if hw support for more VCPUs is
> comming.
>
> > new KVM_MAX_CONFIGURABLE_VCPUS, probably directly to INT_MAX/KVM_VCPU_ID, so we
> > don't have to worry about it for a while.
> >
> > PPC should be interested in this as they set KVM_MAX_VCPUS to NR_CPUS
> > and probably waste few pages for every guest this way.
>
> As we just store pointers, this should be a maximum of 4 pages for ppc
> (4k pages). Is this really worth yet another VM creation ioctl? Is there
> not a nicer way to handle this internally?
>
> An alternative might be to simply realloc the array when it reaches a
> certain size (on VCPU creation, maybe protecting the pointer via rcu).
> But not sure if something like that could work.
I like that idea better, if it does work (I think it should be doable).
If we just double the array size every time we run out of space, we
should be able to do this with few reallocations. That has also the
advantage of being transparent to user space (other than increased
number of vcpus).
>
> >
> >
> > Radim Krčmář (4):
> > KVM: remove unused __KVM_HAVE_ARCH_VM_ALLOC
> > KVM: allocate kvm->vcpus separately
> > KVM: add KVM_CREATE_VM2 system ioctl
> > KVM: x86: enable configurable MAX_VCPU
> >
> > Documentation/virtual/kvm/api.txt | 28 +++++++++++++++
> > arch/x86/include/asm/kvm_host.h | 1 +
> > arch/x86/kvm/irq_comm.c | 4 +--
> > include/linux/kvm_host.h | 23 +++++-------
> > include/uapi/linux/kvm.h | 8 +++++
> > virt/kvm/kvm_main.c | 76 +++++++++++++++++++++++++++++++++------
> > 6 files changed, 114 insertions(+), 26 deletions(-)
> >
>
>
On 13/04/2017 22:19, Radim Krčmář wrote:
> This patch allows userspace to tell how many VCPUs it is going to use,
> which can save memory when allocating the kvm->vcpus array. This will
> be done with a new KVM_CREATE_VM2 IOCTL.
>
> An alternative would be to redo kvm->vcpus as a list or protect the
> array with RCU. RCU is slower and a list is not even practical as
> kvm->vcpus are being used for index-based accesses.
>
> We could have an IOCTL that is called in between KVM_CREATE_VM and first
> KVM_CREATE_VCPU and sets the size of the vcpus array, but we'd be making
> one useless allocation. Knowing the desired number of VCPUs from the
> beginning is seems best for now.
>
> This patch also prepares generic code for architectures that will set
> KVM_CONFIGURABLE_MAX_VCPUS to a non-zero value.
Why is KVM_MAX_VCPU_ID or KVM_MAX_VCPUS not enough?
Paolo
> A disputable decision is that KVM_CREATE_VM2 actually works even if
> KVM_CAP_CONFIGURABLE_MAX_VCPUS is 0, but uses that capability for its
> detection.
>
> Signed-off-by: Radim Krčmář <[email protected]>
> ---
> Documentation/virtual/kvm/api.txt | 28 ++++++++++++++++++++++++
> include/linux/kvm_host.h | 3 +++
> include/uapi/linux/kvm.h | 8 +++++++
> virt/kvm/kvm_main.c | 45 +++++++++++++++++++++++++++++++++------
> 4 files changed, 77 insertions(+), 7 deletions(-)
>
> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> index e60be91d8036..461130adbdc7 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -3259,6 +3259,34 @@ Otherwise, if the MCE is a corrected error, KVM will just
> store it in the corresponding bank (provided this bank is
> not holding a previously reported uncorrected error).
>
> +
> +4.107 KVM_CREATE_VM2
> +
> +Capability: KVM_CAP_CONFIGURABLE_MAX_VCPUS
> +Architectures: all
> +Type: system ioctl
> +Parameters: struct kvm_vm_config
> +Returns: a VM fd that can be used to control the new virtual machine,
> + -E2BIG if the value of max_vcpus is not supported
> +
> +This is an extension of KVM_CREATE_VM that allows the user to pass more
> +information through
> +
> +struct kvm_vm_config {
> + __u64 type;
> + __u32 max_vcpus;
> + __u8 reserved[52];
> +};
> +
> +type is the argument to KVM_CREATE_VM
> +
> +max_vcpus is the desired maximal number of VCPUs, it must not exceed the value
> +returned by KVM_CAP_CONFIGURABLE_MAX_VCPUS. Value of 0 treated as if userspace
> +passed the value returned by KVM_CAP_MAX_VCPU instead.
> +
> +reserved is must be 0.
> +
> +
> 5. The kvm_run structure
> ------------------------
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 6ba7bc831094..b875c0997328 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -39,6 +39,9 @@
> #ifndef KVM_MAX_VCPU_ID
> #define KVM_MAX_VCPU_ID KVM_MAX_VCPUS
> #endif
> +#ifndef KVM_CONFIGURABLE_MAX_VCPUS
> +#define KVM_CONFIGURABLE_MAX_VCPUS 0U
> +#endif
>
> /*
> * The bit 16 ~ bit 31 of kvm_memory_region::flags are internally used
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 6180ea50e9ef..8349c73b3517 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -83,6 +83,12 @@ struct kvm_debug_guest {
>
> /* *** End of deprecated interfaces *** */
>
> +/* for KVM_CREATE_VM2 */
> +struct kvm_vm_config {
> + __u64 type;
> + __u32 max_vcpus;
> + __u8 reserved[52];
> +};
>
> /* for KVM_CREATE_MEMORY_REGION */
> struct kvm_memory_region {
> @@ -713,6 +719,7 @@ struct kvm_ppc_resize_hpt {
> */
> #define KVM_GET_API_VERSION _IO(KVMIO, 0x00)
> #define KVM_CREATE_VM _IO(KVMIO, 0x01) /* returns a VM fd */
> +#define KVM_CREATE_VM2 _IOR(KVMIO, 0x01, struct kvm_vm_config)
> #define KVM_GET_MSR_INDEX_LIST _IOWR(KVMIO, 0x02, struct kvm_msr_list)
>
> #define KVM_S390_ENABLE_SIE _IO(KVMIO, 0x06)
> @@ -892,6 +899,7 @@ struct kvm_ppc_resize_hpt {
> #define KVM_CAP_MIPS_64BIT 139
> #define KVM_CAP_S390_GS 140
> #define KVM_CAP_S390_AIS 141
> +#define KVM_CAP_CONFIGURABLE_MAX_VCPUS 142
>
> #ifdef KVM_CAP_IRQ_ROUTING
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 0f1579f118b4..9ef52fa006ec 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -629,11 +629,20 @@ static inline void kvm_free_vm(struct kvm *kvm)
> kfree(kvm);
> }
>
> -static struct kvm *kvm_create_vm(unsigned long type)
> +static struct kvm *kvm_create_vm(struct kvm_vm_config *vm_config)
> {
> int r, i;
> - struct kvm *kvm = kvm_alloc_vm(KVM_MAX_VCPUS);
> + struct kvm *kvm;
>
> + if (!KVM_CONFIGURABLE_MAX_VCPUS && vm_config->max_vcpus)
> + return ERR_PTR(-EINVAL);
> + if (vm_config->max_vcpus > KVM_CONFIGURABLE_MAX_VCPUS)
> + return ERR_PTR(-E2BIG);
> +
> + if (!vm_config->max_vcpus)
> + vm_config->max_vcpus = KVM_MAX_VCPUS;
> +
> + kvm = kvm_alloc_vm(vm_config->max_vcpus);
> if (!kvm)
> return ERR_PTR(-ENOMEM);
>
> @@ -647,7 +656,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
> refcount_set(&kvm->users_count, 1);
> INIT_LIST_HEAD(&kvm->devices);
>
> - r = kvm_arch_init_vm(kvm, type);
> + r = kvm_arch_init_vm(kvm, vm_config->type);
> if (r)
> goto out_err_no_disable;
>
> @@ -2957,6 +2966,8 @@ static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
> #endif
> case KVM_CAP_MAX_VCPU_ID:
> return KVM_MAX_VCPU_ID;
> + case KVM_CAP_CONFIGURABLE_MAX_VCPUS:
> + return KVM_CONFIGURABLE_MAX_VCPUS;
> default:
> break;
> }
> @@ -3182,13 +3193,13 @@ static struct file_operations kvm_vm_fops = {
> .llseek = noop_llseek,
> };
>
> -static int kvm_dev_ioctl_create_vm(unsigned long type)
> +static int kvm_dev_ioctl_create_vm(struct kvm_vm_config *vm_config)
> {
> int r;
> struct kvm *kvm;
> struct file *file;
>
> - kvm = kvm_create_vm(type);
> + kvm = kvm_create_vm(vm_config);
> if (IS_ERR(kvm))
> return PTR_ERR(kvm);
> #ifdef CONFIG_KVM_MMIO
> @@ -3223,6 +3234,7 @@ static int kvm_dev_ioctl_create_vm(unsigned long type)
> static long kvm_dev_ioctl(struct file *filp,
> unsigned int ioctl, unsigned long arg)
> {
> + void __user *argp = (void __user *)arg;
> long r = -EINVAL;
>
> switch (ioctl) {
> @@ -3231,9 +3243,28 @@ static long kvm_dev_ioctl(struct file *filp,
> goto out;
> r = KVM_API_VERSION;
> break;
> - case KVM_CREATE_VM:
> - r = kvm_dev_ioctl_create_vm(arg);
> + case KVM_CREATE_VM: {
> + struct kvm_vm_config vm_config = {.type = arg};
> +
> + r = kvm_dev_ioctl_create_vm(&vm_config);
> break;
> + }
> + case KVM_CREATE_VM2: {
> + struct kvm_vm_config vm_config, check_reserved = {};
> +
> + r = -EFAULT;
> + if (copy_from_user(&vm_config, argp, sizeof vm_config))
> + goto out;
> +
> + r = -EINVAL;
> + check_reserved.type = vm_config.type;
> + check_reserved.max_vcpus = vm_config.max_vcpus;
> + if (memcmp(&vm_config, &check_reserved, sizeof check_reserved))
> + goto out;
> +
> + r = kvm_dev_ioctl_create_vm(&vm_config);
> + break;
> + }
> case KVM_CHECK_EXTENSION:
> r = kvm_vm_ioctl_check_extension_generic(NULL, arg);
> break;
>
On 18/04/2017 16:16, Paolo Bonzini wrote:
>> This patch allows userspace to tell how many VCPUs it is going to use,
>> which can save memory when allocating the kvm->vcpus array. This will
>> be done with a new KVM_CREATE_VM2 IOCTL.
>>
>> An alternative would be to redo kvm->vcpus as a list or protect the
>> array with RCU. RCU is slower and a list is not even practical as
>> kvm->vcpus are being used for index-based accesses.
>>
>> We could have an IOCTL that is called in between KVM_CREATE_VM and first
>> KVM_CREATE_VCPU and sets the size of the vcpus array, but we'd be making
>> one useless allocation. Knowing the desired number of VCPUs from the
>> beginning is seems best for now.
>>
>> This patch also prepares generic code for architectures that will set
>> KVM_CONFIGURABLE_MAX_VCPUS to a non-zero value.
> Why is KVM_MAX_VCPU_ID or KVM_MAX_VCPUS not enough?
Ok, for KVM_MAX_VCPUS I should have read the cover letter more carefully. :)
Paolo
On 04/13/2017 10:19 PM, Radim Krčmář wrote:
> The only user of KVM_MAX_VCPU is switched to kvm->max_vcpu.
>
> The limit could have been INT_MAX as it makes no difference, but there
> is no point in making it bigger than KVM_MAX_VCPU_ID.
>
> Signed-off-by: Radim Krčmář <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 1 +
> arch/x86/kvm/irq_comm.c | 4 ++--
> 2 files changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 2cc5ec7cc6f5..eeeb88eedabf 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -38,6 +38,7 @@
> #define KVM_MAX_VCPUS 288
> #define KVM_SOFT_MAX_VCPUS 240
> #define KVM_MAX_VCPU_ID 1023
> +#define KVM_CONFIGURABLE_MAX_VCPUS KVM_MAX_VCPU_ID
> #define KVM_USER_MEM_SLOTS 509
> /* memory slots that are not exposed to userspace */
> #define KVM_PRIVATE_MEM_SLOTS 3
> diff --git a/arch/x86/kvm/irq_comm.c b/arch/x86/kvm/irq_comm.c
> index 4517a4c2ac3a..a7baeb44539a 100644
> --- a/arch/x86/kvm/irq_comm.c
> +++ b/arch/x86/kvm/irq_comm.c
> @@ -60,7 +60,7 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
> {
> int i, r = -1;
> struct kvm_vcpu *vcpu, *lowest = NULL;
> - unsigned long dest_vcpu_bitmap[BITS_TO_LONGS(KVM_MAX_VCPUS)];
> + unsigned long dest_vcpu_bitmap[BITS_TO_LONGS(kvm->max_vcpus)];
Doesnt that allow unlimited stack usage?
2017-04-18 16:30+0200, Paolo Bonzini:
> On 18/04/2017 16:16, Paolo Bonzini wrote:
>>> This patch allows userspace to tell how many VCPUs it is going to use,
>>> which can save memory when allocating the kvm->vcpus array. This will
>>> be done with a new KVM_CREATE_VM2 IOCTL.
>>>
>>> An alternative would be to redo kvm->vcpus as a list or protect the
>>> array with RCU. RCU is slower and a list is not even practical as
>>> kvm->vcpus are being used for index-based accesses.
>>>
>>> We could have an IOCTL that is called in between KVM_CREATE_VM and first
>>> KVM_CREATE_VCPU and sets the size of the vcpus array, but we'd be making
>>> one useless allocation. Knowing the desired number of VCPUs from the
>>> beginning is seems best for now.
>>>
>>> This patch also prepares generic code for architectures that will set
>>> KVM_CONFIGURABLE_MAX_VCPUS to a non-zero value.
>> Why is KVM_MAX_VCPU_ID or KVM_MAX_VCPUS not enough?
>
> Ok, for KVM_MAX_VCPUS I should have read the cover letter more carefully. :)
KVM_MAX_VCPU_ID makes sense as the upper bound, I just didn't want to
mingle the concepts, because the kvm->vcpus array is not indexed by
VCPU_ID ...
In hindsight, it would be best to change that and get rid of the search.
I'll see how that looks in v2.
2017-04-19 10:08+0200, Christian Borntraeger:
> On 04/13/2017 10:19 PM, Radim Krčmář wrote:
>> The only user of KVM_MAX_VCPU is switched to kvm->max_vcpu.
>>
>> The limit could have been INT_MAX as it makes no difference, but there
>> is no point in making it bigger than KVM_MAX_VCPU_ID.
>>
>> Signed-off-by: Radim Krčmář <[email protected]>
>> ---
>> arch/x86/include/asm/kvm_host.h | 1 +
>> arch/x86/kvm/irq_comm.c | 4 ++--
>> 2 files changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> index 2cc5ec7cc6f5..eeeb88eedabf 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -38,6 +38,7 @@
>> #define KVM_MAX_VCPUS 288
>> #define KVM_SOFT_MAX_VCPUS 240
>> #define KVM_MAX_VCPU_ID 1023
>> +#define KVM_CONFIGURABLE_MAX_VCPUS KVM_MAX_VCPU_ID
>> #define KVM_USER_MEM_SLOTS 509
>> /* memory slots that are not exposed to userspace */
>> #define KVM_PRIVATE_MEM_SLOTS 3
>> diff --git a/arch/x86/kvm/irq_comm.c b/arch/x86/kvm/irq_comm.c
>> index 4517a4c2ac3a..a7baeb44539a 100644
>> --- a/arch/x86/kvm/irq_comm.c
>> +++ b/arch/x86/kvm/irq_comm.c
>> @@ -60,7 +60,7 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
>> {
>> int i, r = -1;
>> struct kvm_vcpu *vcpu, *lowest = NULL;
>> - unsigned long dest_vcpu_bitmap[BITS_TO_LONGS(KVM_MAX_VCPUS)];
>> + unsigned long dest_vcpu_bitmap[BITS_TO_LONGS(kvm->max_vcpus)];
>
> Doesnt that allow unlimited stack usage?
Right, it doesn't scale.
kvm->max_vcpus is at most KVM_CONFIGURABLE_MAX_VCPUS, but that will be
bigger than the stack. I'll allocate the bitmap in v2.
Thanks.
2017-04-18 13:11+0200, David Hildenbrand:
> On 13.04.2017 22:19, Radim Krčmář wrote:
>> The basic idea is to let userspace provide the desired maximal number of
>> VCPUs and allocate only necessary memory for them.
>>
>> The goal is to freeze KVM_MAX_VCPUS at its current level and only increase the
>
> KVM_MAX_VCPUS might still increase e.g. if hw support for more VCPUs is
> comming.
This patch wanted to make KVM_MAX_VCPUS just a compatibility option for
old userspaces and not looked at in new ones, so we wouldn't have to
touch it from now on.
>> new KVM_MAX_CONFIGURABLE_VCPUS, probably directly to INT_MAX/KVM_VCPU_ID, so we
>> don't have to worry about it for a while.
>>
>> PPC should be interested in this as they set KVM_MAX_VCPUS to NR_CPUS
>> and probably waste few pages for every guest this way.
>
> As we just store pointers, this should be a maximum of 4 pages for ppc
> (4k pages). Is this really worth yet another VM creation ioctl? Is there
> not a nicer way to handle this internally?
>
> An alternative might be to simply realloc the array when it reaches a
> certain size (on VCPU creation, maybe protecting the pointer via rcu).
> But not sure if something like that could work.
Good point. I'll cover it in the next email.
Thanks.
2017-04-18 14:29+0200, Cornelia Huck:
> On Tue, 18 Apr 2017 13:11:55 +0200
> David Hildenbrand <[email protected]> wrote:
>> On 13.04.2017 22:19, Radim Krčmář wrote:
>> > new KVM_MAX_CONFIGURABLE_VCPUS, probably directly to INT_MAX/KVM_VCPU_ID, so we
>> > don't have to worry about it for a while.
>> >
>> > PPC should be interested in this as they set KVM_MAX_VCPUS to NR_CPUS
>> > and probably waste few pages for every guest this way.
>>
>> As we just store pointers, this should be a maximum of 4 pages for ppc
>> (4k pages). Is this really worth yet another VM creation ioctl? Is there
>> not a nicer way to handle this internally?
>>
>> An alternative might be to simply realloc the array when it reaches a
>> certain size (on VCPU creation, maybe protecting the pointer via rcu).
>> But not sure if something like that could work.
>
> I like that idea better, if it does work (I think it should be doable).
> If we just double the array size every time we run out of space, we
> should be able to do this with few reallocations. That has also the
> advantage of being transparent to user space (other than increased
> number of vcpus).
Yes, relocating would work with protection against use-after-free and
RCU fits well. (Readers don't have any lock we could piggyback on.)
I didn't go for it because of C: the kvm_for_each_vcpu macro would be
less robust if it included the locking around its body -- nested fors are
susceptible to return/goto errors inside the loop body + we'd need to
obfuscate several existing users of that pattern. And open-coding the
protection everywhere is polluting the code too, IMO.
Lock-less list would solve those problems, but we are accessing the
VCPUs by index, which makes it suboptimal in other direction ... using
the list for kvm_for_each_vcpu and adding RCU protected array for
kvm_get_vcpu and kvm_get_vcpu_by_id looks like over-engineering as we
wouldn't save memory, performance, nor lines of code by doing that.
I didn't see a way to untangle kvm->vcpu that would allow a nice
runtime-dynamic variant.
We currently don't need to pass more information at VM creation time
either, so I was also thinking of hijacking the parameter to
KVM_CREATE_VM for factor-of-2 VCPU count (20 bits would last a while),
but that is already a new interface and new IOCTL to do a superset of
another one seemed much better.
I agree that the idea is questionable. I'll redo the series and bump
KVM_MAX_VCPUS unless you think that the dynamic could be done nicely.
(The memory saving is a miniscule fraction of a VM size and if we do big
increments in KVM_MAX_VCPUS, then the motivation is gone.)
Thanks.
2017-04-24 18:22+0200, Radim Krčmář:
> 2017-04-18 16:30+0200, Paolo Bonzini:
>> On 18/04/2017 16:16, Paolo Bonzini wrote:
>>>> This patch allows userspace to tell how many VCPUs it is going to use,
>>>> which can save memory when allocating the kvm->vcpus array. This will
>>>> be done with a new KVM_CREATE_VM2 IOCTL.
>>>>
>>>> An alternative would be to redo kvm->vcpus as a list or protect the
>>>> array with RCU. RCU is slower and a list is not even practical as
>>>> kvm->vcpus are being used for index-based accesses.
>>>>
>>>> We could have an IOCTL that is called in between KVM_CREATE_VM and first
>>>> KVM_CREATE_VCPU and sets the size of the vcpus array, but we'd be making
>>>> one useless allocation. Knowing the desired number of VCPUs from the
>>>> beginning is seems best for now.
>>>>
>>>> This patch also prepares generic code for architectures that will set
>>>> KVM_CONFIGURABLE_MAX_VCPUS to a non-zero value.
>>> Why is KVM_MAX_VCPU_ID or KVM_MAX_VCPUS not enough?
>>
>> Ok, for KVM_MAX_VCPUS I should have read the cover letter more carefully. :)
>
> KVM_MAX_VCPU_ID makes sense as the upper bound, I just didn't want to
> mingle the concepts, because the kvm->vcpus array is not indexed by
> VCPU_ID ...
>
> In hindsight, it would be best to change that and get rid of the search.
> I'll see how that looks in v2.
I realized why not:
* the major user of kvm->vcpu is kvm_for_each_vcpu and it works best
with a packed array
* at least arm KVM_IRQ_LINE uses the order in which cpus were created
to communicate with userspace
Putting this work into a drawer with the "do not share data structure
between kvm_for_each_vcpu and kvm_get_vcpu" idea. :)