Series extends vhost to support upto 509 memory regions,
and adds some vhost:translate_desc() performance improvemnts
so it won't regress when memslots are increased to 509.
It fixes running VM crashing during memory hotplug due
to vhost refusing accepting more than 64 memory regions.
It's only host kernel side fix to make it work with QEMU
versions that support memory hotplug. But I'll continue
to work on QEMU side solution to reduce amount of memory
regions to make things even better.
Performance wise for guest with (in my case 3 memory regions)
and netperf's UDP_RR workload translate_desc() execution
time from total workload takes:
Memory |1G RAM|cached|non cached
regions # | 3 | 53 | 53
------------------------------------
upstream | 0.3% | - | 3.5%
------------------------------------
this series | 0.2% | 0.5% | 0.7%
where "non cached" column reflects trashing wokload
with constant cache miss. More details on timing in
respective patches.
Igor Mammedov (5):
vhost: use binary search instead of linear in find_region()
vhost: extend memory regions allocation to vmalloc
vhost: support upto 509 memory regions
vhost: add per VQ memory region caching
vhost: translate_desc: optimization for desc.len < region size
drivers/vhost/vhost.c | 95 +++++++++++++++++++++++++++++++++++++--------------
drivers/vhost/vhost.h | 1 +
2 files changed, 71 insertions(+), 25 deletions(-)
--
1.8.3.1
For default region layouts performance stays the same
as linear search i.e. it takes around 210ns average for
translate_desc() that inlines find_region().
But it scales better with larger amount of regions,
235ns BS vs 300ns LS with 55 memory regions
and it will be about the same values when allowed number
of slots is increased to 509 like it has been done in kvm.
Signed-off-by: Igor Mammedov <[email protected]>
---
drivers/vhost/vhost.c | 38 ++++++++++++++++++++++++++++----------
1 file changed, 28 insertions(+), 10 deletions(-)
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 2ee2826..a22f8c3 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -25,6 +25,7 @@
#include <linux/kthread.h>
#include <linux/cgroup.h>
#include <linux/module.h>
+#include <linux/sort.h>
#include "vhost.h"
@@ -590,6 +591,16 @@ int vhost_vq_access_ok(struct vhost_virtqueue *vq)
}
EXPORT_SYMBOL_GPL(vhost_vq_access_ok);
+static int vhost_memory_reg_sort_cmp(const void *p1, const void *p2)
+{
+ const struct vhost_memory_region *r1 = p1, *r2 = p2;
+ if (r1->guest_phys_addr < r2->guest_phys_addr)
+ return 1;
+ if (r1->guest_phys_addr > r2->guest_phys_addr)
+ return -1;
+ return 0;
+}
+
static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m)
{
struct vhost_memory mem, *newmem, *oldmem;
@@ -609,9 +620,11 @@ static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m)
memcpy(newmem, &mem, size);
if (copy_from_user(newmem->regions, m->regions,
mem.nregions * sizeof *m->regions)) {
- kfree(newmem);
+ kvfree(newmem);
return -EFAULT;
}
+ sort(newmem->regions, newmem->nregions, sizeof(*newmem->regions),
+ vhost_memory_reg_sort_cmp, NULL);
if (!memory_access_ok(d, newmem, 0)) {
kfree(newmem);
@@ -913,17 +926,22 @@ EXPORT_SYMBOL_GPL(vhost_dev_ioctl);
static const struct vhost_memory_region *find_region(struct vhost_memory *mem,
__u64 addr, __u32 len)
{
- struct vhost_memory_region *reg;
- int i;
+ const struct vhost_memory_region *reg;
+ int start = 0, end = mem->nregions;
- /* linear search is not brilliant, but we really have on the order of 6
- * regions in practice */
- for (i = 0; i < mem->nregions; ++i) {
- reg = mem->regions + i;
- if (reg->guest_phys_addr <= addr &&
- reg->guest_phys_addr + reg->memory_size - 1 >= addr)
- return reg;
+ while (start < end) {
+ int slot = start + (end - start) / 2;
+ reg = mem->regions + slot;
+ if (addr >= reg->guest_phys_addr)
+ end = slot;
+ else
+ start = slot + 1;
}
+
+ reg = mem->regions + start;
+ if (addr >= reg->guest_phys_addr &&
+ reg->guest_phys_addr + reg->memory_size > addr)
+ return reg;
return NULL;
}
--
1.8.3.1
with large number of memory regions we could end up with
high order allocations and kmalloc could fail if
host is under memory pressure.
Considering that memory regions array is used on hot path
try harder to allocate using kmalloc and if it fails resort
to vmalloc.
It's still better than just failing vhost_set_memory() and
causing guest crash due to it when a new memory hotplugged
to guest.
I'll still look at QEMU side solution to reduce amount of
memory regions it feeds to vhost to make things even better,
but it doesn't hurt for kernel to behave smarter and don't
crash older QEMU's which could use large amount of memory
regions.
Signed-off-by: Igor Mammedov <[email protected]>
---
---
drivers/vhost/vhost.c | 20 ++++++++++++++++----
1 file changed, 16 insertions(+), 4 deletions(-)
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index a22f8c3..99931a0 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -471,7 +471,7 @@ void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
fput(dev->log_file);
dev->log_file = NULL;
/* No one will access memory at this point */
- kfree(dev->memory);
+ kvfree(dev->memory);
dev->memory = NULL;
WARN_ON(!list_empty(&dev->work_list));
if (dev->worker) {
@@ -601,6 +601,18 @@ static int vhost_memory_reg_sort_cmp(const void *p1, const void *p2)
return 0;
}
+static void *vhost_kvzalloc(unsigned long size)
+{
+ void *n = kzalloc(size, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
+
+ if (!n) {
+ n = vzalloc(size);
+ if (!n)
+ return ERR_PTR(-ENOMEM);
+ }
+ return n;
+}
+
static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m)
{
struct vhost_memory mem, *newmem, *oldmem;
@@ -613,7 +625,7 @@ static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m)
return -EOPNOTSUPP;
if (mem.nregions > VHOST_MEMORY_MAX_NREGIONS)
return -E2BIG;
- newmem = kmalloc(size + mem.nregions * sizeof *m->regions, GFP_KERNEL);
+ newmem = vhost_kvzalloc(size + mem.nregions * sizeof(*m->regions));
if (!newmem)
return -ENOMEM;
@@ -627,7 +639,7 @@ static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m)
vhost_memory_reg_sort_cmp, NULL);
if (!memory_access_ok(d, newmem, 0)) {
- kfree(newmem);
+ kvfree(newmem);
return -EFAULT;
}
oldmem = d->memory;
@@ -639,7 +651,7 @@ static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m)
d->vqs[i]->memory = newmem;
mutex_unlock(&d->vqs[i]->mutex);
}
- kfree(oldmem);
+ kvfree(oldmem);
return 0;
}
--
1.8.3.1
since commit
1d4e7e3 kvm: x86: increase user memory slots to 509
it became possible to use a bigger amount of memory
slots, which is used by memory hotplug for
registering hotplugged memory.
However QEMU crashes if it's used with more than ~60
pc-dimm devices and vhost-net since host kernel
in module vhost-net refuses to accept more than 65
memory regions.
Increase VHOST_MEMORY_MAX_NREGIONS from 65 to 509
to match KVM_USER_MEM_SLOTS fixes issue for vhost-net.
Signed-off-by: Igor Mammedov <[email protected]>
---
drivers/vhost/vhost.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 99931a0..6a18c92 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -30,7 +30,7 @@
#include "vhost.h"
enum {
- VHOST_MEMORY_MAX_NREGIONS = 64,
+ VHOST_MEMORY_MAX_NREGIONS = 509,
VHOST_MEMORY_F_LOG = 0x1,
};
--
1.8.3.1
that brings down translate_desc() cost to around 210ns
if accessed descriptors are from the same memory region.
Signed-off-by: Igor Mammedov <[email protected]>
---
that's what netperf/iperf workloads were during testing.
---
drivers/vhost/vhost.c | 16 +++++++++++++---
drivers/vhost/vhost.h | 1 +
2 files changed, 14 insertions(+), 3 deletions(-)
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 6a18c92..68c1c88 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -200,6 +200,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
vq->call = NULL;
vq->log_ctx = NULL;
vq->memory = NULL;
+ vq->cached_reg = 0;
}
static int vhost_worker(void *data)
@@ -649,6 +650,7 @@ static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m)
for (i = 0; i < d->nvqs; ++i) {
mutex_lock(&d->vqs[i]->mutex);
d->vqs[i]->memory = newmem;
+ d->vqs[i]->cached_reg = 0;
mutex_unlock(&d->vqs[i]->mutex);
}
kvfree(oldmem);
@@ -936,11 +938,17 @@ done:
EXPORT_SYMBOL_GPL(vhost_dev_ioctl);
static const struct vhost_memory_region *find_region(struct vhost_memory *mem,
- __u64 addr, __u32 len)
+ __u64 addr, __u32 len,
+ int *cached_reg)
{
const struct vhost_memory_region *reg;
int start = 0, end = mem->nregions;
+ reg = mem->regions + *cached_reg;
+ if (likely(addr >= reg->guest_phys_addr &&
+ reg->guest_phys_addr + reg->memory_size > addr))
+ return reg;
+
while (start < end) {
int slot = start + (end - start) / 2;
reg = mem->regions + slot;
@@ -952,8 +960,10 @@ static const struct vhost_memory_region *find_region(struct vhost_memory *mem,
reg = mem->regions + start;
if (addr >= reg->guest_phys_addr &&
- reg->guest_phys_addr + reg->memory_size > addr)
+ reg->guest_phys_addr + reg->memory_size > addr) {
+ *cached_reg = start;
return reg;
+ }
return NULL;
}
@@ -1107,7 +1117,7 @@ static int translate_desc(struct vhost_virtqueue *vq, u64 addr, u32 len,
ret = -ENOBUFS;
break;
}
- reg = find_region(mem, addr, len);
+ reg = find_region(mem, addr, len, &vq->cached_reg);
if (unlikely(!reg)) {
ret = -EFAULT;
break;
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 8c1c792..68bd00f 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -106,6 +106,7 @@ struct vhost_virtqueue {
/* Log write descriptors */
void __user *log_base;
struct vhost_log *log;
+ int cached_reg;
};
struct vhost_dev {
--
1.8.3.1
when translating descriptors they are typically less than
memory region that holds them and translated into 1 iov
enty, so it's not nessesary to check remaining length
twice and calculate used length and next address
in such cases.
so relace a remaining length and 'size' increment branches
with a single remaining length check and execute
next iov steps only when it needed.
It saves tiny 2% of translate_desc() execution time.
Signed-off-by: Igor Mammedov <[email protected]>
---
PS:
I'm not sure if iov_size > 0 is always true, if it's not
then better to drop this patch.
---
drivers/vhost/vhost.c | 21 +++++++++++++--------
1 file changed, 13 insertions(+), 8 deletions(-)
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 68c1c88..84c457d 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1111,12 +1111,8 @@ static int translate_desc(struct vhost_virtqueue *vq, u64 addr, u32 len,
int ret = 0;
mem = vq->memory;
- while ((u64)len > s) {
+ do {
u64 size;
- if (unlikely(ret >= iov_size)) {
- ret = -ENOBUFS;
- break;
- }
reg = find_region(mem, addr, len, &vq->cached_reg);
if (unlikely(!reg)) {
ret = -EFAULT;
@@ -1124,13 +1120,22 @@ static int translate_desc(struct vhost_virtqueue *vq, u64 addr, u32 len,
}
_iov = iov + ret;
size = reg->memory_size - addr + reg->guest_phys_addr;
- _iov->iov_len = min((u64)len - s, size);
_iov->iov_base = (void __user *)(unsigned long)
(reg->userspace_addr + addr - reg->guest_phys_addr);
+ ++ret;
+ if (likely((u64)len - s < size)) {
+ _iov->iov_len = (u64)len - s;
+ break;
+ }
+
+ if (unlikely(ret >= iov_size)) {
+ ret = -ENOBUFS;
+ break;
+ }
+ _iov->iov_len = size;
s += size;
addr += size;
- ++ret;
- }
+ } while (1);
return ret;
}
--
1.8.3.1
On Tue, Jun 16, 2015 at 06:33:35PM +0200, Igor Mammedov wrote:
> For default region layouts performance stays the same
> as linear search i.e. it takes around 210ns average for
> translate_desc() that inlines find_region().
>
> But it scales better with larger amount of regions,
> 235ns BS vs 300ns LS with 55 memory regions
> and it will be about the same values when allowed number
> of slots is increased to 509 like it has been done in kvm.
>
> Signed-off-by: Igor Mammedov <[email protected]>
> ---
> drivers/vhost/vhost.c | 38 ++++++++++++++++++++++++++++----------
> 1 file changed, 28 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 2ee2826..a22f8c3 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -25,6 +25,7 @@
> #include <linux/kthread.h>
> #include <linux/cgroup.h>
> #include <linux/module.h>
> +#include <linux/sort.h>
>
> #include "vhost.h"
>
> @@ -590,6 +591,16 @@ int vhost_vq_access_ok(struct vhost_virtqueue *vq)
> }
> EXPORT_SYMBOL_GPL(vhost_vq_access_ok);
>
> +static int vhost_memory_reg_sort_cmp(const void *p1, const void *p2)
> +{
> + const struct vhost_memory_region *r1 = p1, *r2 = p2;
> + if (r1->guest_phys_addr < r2->guest_phys_addr)
> + return 1;
> + if (r1->guest_phys_addr > r2->guest_phys_addr)
> + return -1;
> + return 0;
> +}
> +
> static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m)
> {
> struct vhost_memory mem, *newmem, *oldmem;
> @@ -609,9 +620,11 @@ static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m)
> memcpy(newmem, &mem, size);
> if (copy_from_user(newmem->regions, m->regions,
> mem.nregions * sizeof *m->regions)) {
> - kfree(newmem);
> + kvfree(newmem);
> return -EFAULT;
> }
What's this doing here?
> + sort(newmem->regions, newmem->nregions, sizeof(*newmem->regions),
> + vhost_memory_reg_sort_cmp, NULL);
>
> if (!memory_access_ok(d, newmem, 0)) {
> kfree(newmem);
> @@ -913,17 +926,22 @@ EXPORT_SYMBOL_GPL(vhost_dev_ioctl);
> static const struct vhost_memory_region *find_region(struct vhost_memory *mem,
> __u64 addr, __u32 len)
> {
> - struct vhost_memory_region *reg;
> - int i;
> + const struct vhost_memory_region *reg;
> + int start = 0, end = mem->nregions;
>
> - /* linear search is not brilliant, but we really have on the order of 6
> - * regions in practice */
> - for (i = 0; i < mem->nregions; ++i) {
> - reg = mem->regions + i;
> - if (reg->guest_phys_addr <= addr &&
> - reg->guest_phys_addr + reg->memory_size - 1 >= addr)
> - return reg;
> + while (start < end) {
> + int slot = start + (end - start) / 2;
> + reg = mem->regions + slot;
> + if (addr >= reg->guest_phys_addr)
> + end = slot;
> + else
> + start = slot + 1;
> }
> +
> + reg = mem->regions + start;
> + if (addr >= reg->guest_phys_addr &&
> + reg->guest_phys_addr + reg->memory_size > addr)
> + return reg;
> return NULL;
> }
>
> --
> 1.8.3.1
On Tue, Jun 16, 2015 at 06:33:39PM +0200, Igor Mammedov wrote:
> when translating descriptors they are typically less than
> memory region that holds them and translated into 1 iov
> enty,
entry
> so it's not nessesary to check remaining length
> twice and calculate used length and next address
> in such cases.
>
> so relace
replace
> a remaining length and 'size' increment branches
> with a single remaining length check and execute
> next iov steps only when it needed.
it is needed
>
> It saves tiny 2% of translate_desc() execution time.
a tiny
>
> Signed-off-by: Igor Mammedov <[email protected]>
> ---
> PS:
> I'm not sure if iov_size > 0 is always true, if it's not
> then better to drop this patch.
> ---
> drivers/vhost/vhost.c | 21 +++++++++++++--------
> 1 file changed, 13 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 68c1c88..84c457d 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -1111,12 +1111,8 @@ static int translate_desc(struct vhost_virtqueue *vq, u64 addr, u32 len,
> int ret = 0;
>
> mem = vq->memory;
> - while ((u64)len > s) {
> + do {
> u64 size;
> - if (unlikely(ret >= iov_size)) {
> - ret = -ENOBUFS;
> - break;
> - }
> reg = find_region(mem, addr, len, &vq->cached_reg);
> if (unlikely(!reg)) {
> ret = -EFAULT;
> @@ -1124,13 +1120,22 @@ static int translate_desc(struct vhost_virtqueue *vq, u64 addr, u32 len,
> }
> _iov = iov + ret;
> size = reg->memory_size - addr + reg->guest_phys_addr;
> - _iov->iov_len = min((u64)len - s, size);
> _iov->iov_base = (void __user *)(unsigned long)
> (reg->userspace_addr + addr - reg->guest_phys_addr);
> + ++ret;
> + if (likely((u64)len - s < size)) {
> + _iov->iov_len = (u64)len - s;
> + break;
> + }
> +
> + if (unlikely(ret >= iov_size)) {
> + ret = -ENOBUFS;
> + break;
> + }
> + _iov->iov_len = size;
> s += size;
> addr += size;
> - ++ret;
> - }
> + } while (1);
>
> return ret;
> }
> --
> 1.8.3.1
On Tue, 16 Jun 2015 23:07:24 +0200
"Michael S. Tsirkin" <[email protected]> wrote:
> On Tue, Jun 16, 2015 at 06:33:35PM +0200, Igor Mammedov wrote:
> > For default region layouts performance stays the same
> > as linear search i.e. it takes around 210ns average for
> > translate_desc() that inlines find_region().
> >
> > But it scales better with larger amount of regions,
> > 235ns BS vs 300ns LS with 55 memory regions
> > and it will be about the same values when allowed number
> > of slots is increased to 509 like it has been done in kvm.
> >
> > Signed-off-by: Igor Mammedov <[email protected]>
> > ---
> > drivers/vhost/vhost.c | 38 ++++++++++++++++++++++++++++----------
> > 1 file changed, 28 insertions(+), 10 deletions(-)
> >
> > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > index 2ee2826..a22f8c3 100644
> > --- a/drivers/vhost/vhost.c
> > +++ b/drivers/vhost/vhost.c
> > @@ -25,6 +25,7 @@
> > #include <linux/kthread.h>
> > #include <linux/cgroup.h>
> > #include <linux/module.h>
> > +#include <linux/sort.h>
> >
> > #include "vhost.h"
> >
> > @@ -590,6 +591,16 @@ int vhost_vq_access_ok(struct vhost_virtqueue
> > *vq) }
> > EXPORT_SYMBOL_GPL(vhost_vq_access_ok);
> >
> > +static int vhost_memory_reg_sort_cmp(const void *p1, const void
> > *p2) +{
> > + const struct vhost_memory_region *r1 = p1, *r2 = p2;
> > + if (r1->guest_phys_addr < r2->guest_phys_addr)
> > + return 1;
> > + if (r1->guest_phys_addr > r2->guest_phys_addr)
> > + return -1;
> > + return 0;
> > +}
> > +
> > static long vhost_set_memory(struct vhost_dev *d, struct
> > vhost_memory __user *m) {
> > struct vhost_memory mem, *newmem, *oldmem;
> > @@ -609,9 +620,11 @@ static long vhost_set_memory(struct vhost_dev
> > *d, struct vhost_memory __user *m) memcpy(newmem, &mem, size);
> > if (copy_from_user(newmem->regions, m->regions,
> > mem.nregions * sizeof *m->regions)) {
> > - kfree(newmem);
> > + kvfree(newmem);
> > return -EFAULT;
> > }
>
> What's this doing here?
ops, it sneaked in from 2/5 when I was splitting patches.
I'll fix it up.
>
> > + sort(newmem->regions, newmem->nregions,
> > sizeof(*newmem->regions),
> > + vhost_memory_reg_sort_cmp, NULL);
> >
> > if (!memory_access_ok(d, newmem, 0)) {
> > kfree(newmem);
> > @@ -913,17 +926,22 @@ EXPORT_SYMBOL_GPL(vhost_dev_ioctl);
> > static const struct vhost_memory_region *find_region(struct
> > vhost_memory *mem, __u64 addr, __u32 len)
> > {
> > - struct vhost_memory_region *reg;
> > - int i;
> > + const struct vhost_memory_region *reg;
> > + int start = 0, end = mem->nregions;
> >
> > - /* linear search is not brilliant, but we really have on
> > the order of 6
> > - * regions in practice */
> > - for (i = 0; i < mem->nregions; ++i) {
> > - reg = mem->regions + i;
> > - if (reg->guest_phys_addr <= addr &&
> > - reg->guest_phys_addr + reg->memory_size - 1 >=
> > addr)
> > - return reg;
> > + while (start < end) {
> > + int slot = start + (end - start) / 2;
> > + reg = mem->regions + slot;
> > + if (addr >= reg->guest_phys_addr)
> > + end = slot;
> > + else
> > + start = slot + 1;
> > }
> > +
> > + reg = mem->regions + start;
> > + if (addr >= reg->guest_phys_addr &&
> > + reg->guest_phys_addr + reg->memory_size > addr)
> > + return reg;
> > return NULL;
> > }
> >
> > --
> > 1.8.3.1
On Tue, Jun 16, 2015 at 06:33:37PM +0200, Igor Mammedov wrote:
> since commit
> 1d4e7e3 kvm: x86: increase user memory slots to 509
>
> it became possible to use a bigger amount of memory
> slots, which is used by memory hotplug for
> registering hotplugged memory.
> However QEMU crashes if it's used with more than ~60
> pc-dimm devices and vhost-net since host kernel
> in module vhost-net refuses to accept more than 65
> memory regions.
>
> Increase VHOST_MEMORY_MAX_NREGIONS from 65 to 509
It was 64, not 65.
> to match KVM_USER_MEM_SLOTS fixes issue for vhost-net.
>
> Signed-off-by: Igor Mammedov <[email protected]>
Still thinking about this: can you reorder this to
be the last patch in the series please?
Also - 509?
I think if we are changing this, it'd be nice to
create a way for userspace to discover the support
and the # of regions supported.
> ---
> drivers/vhost/vhost.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 99931a0..6a18c92 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -30,7 +30,7 @@
> #include "vhost.h"
>
> enum {
> - VHOST_MEMORY_MAX_NREGIONS = 64,
> + VHOST_MEMORY_MAX_NREGIONS = 509,
> VHOST_MEMORY_F_LOG = 0x1,
> };
>
> --
> 1.8.3.1
On Tue, Jun 16, 2015 at 06:33:34PM +0200, Igor Mammedov wrote:
> Series extends vhost to support upto 509 memory regions,
> and adds some vhost:translate_desc() performance improvemnts
> so it won't regress when memslots are increased to 509.
>
> It fixes running VM crashing during memory hotplug due
> to vhost refusing accepting more than 64 memory regions.
>
> It's only host kernel side fix to make it work with QEMU
> versions that support memory hotplug. But I'll continue
> to work on QEMU side solution to reduce amount of memory
> regions to make things even better.
I'm concerned userspace work will be harder, in particular,
performance gains will be harder to measure.
How about a flag to disable caching?
> Performance wise for guest with (in my case 3 memory regions)
> and netperf's UDP_RR workload translate_desc() execution
> time from total workload takes:
>
> Memory |1G RAM|cached|non cached
> regions # | 3 | 53 | 53
> ------------------------------------
> upstream | 0.3% | - | 3.5%
> ------------------------------------
> this series | 0.2% | 0.5% | 0.7%
>
> where "non cached" column reflects trashing wokload
> with constant cache miss. More details on timing in
> respective patches.
>
> Igor Mammedov (5):
> vhost: use binary search instead of linear in find_region()
> vhost: extend memory regions allocation to vmalloc
> vhost: support upto 509 memory regions
> vhost: add per VQ memory region caching
> vhost: translate_desc: optimization for desc.len < region size
>
> drivers/vhost/vhost.c | 95 +++++++++++++++++++++++++++++++++++++--------------
> drivers/vhost/vhost.h | 1 +
> 2 files changed, 71 insertions(+), 25 deletions(-)
>
> --
> 1.8.3.1
On Tue, 16 Jun 2015 23:14:20 +0200
"Michael S. Tsirkin" <[email protected]> wrote:
> On Tue, Jun 16, 2015 at 06:33:37PM +0200, Igor Mammedov wrote:
> > since commit
> > 1d4e7e3 kvm: x86: increase user memory slots to 509
> >
> > it became possible to use a bigger amount of memory
> > slots, which is used by memory hotplug for
> > registering hotplugged memory.
> > However QEMU crashes if it's used with more than ~60
> > pc-dimm devices and vhost-net since host kernel
> > in module vhost-net refuses to accept more than 65
> > memory regions.
> >
> > Increase VHOST_MEMORY_MAX_NREGIONS from 65 to 509
>
> It was 64, not 65.
>
> > to match KVM_USER_MEM_SLOTS fixes issue for vhost-net.
> >
> > Signed-off-by: Igor Mammedov <[email protected]>
>
> Still thinking about this: can you reorder this to
> be the last patch in the series please?
sure
>
> Also - 509?
userspace memory slots in terms of KVM, I made it match
KVM's allotment of memory slots for userspace side.
> I think if we are changing this, it'd be nice to
> create a way for userspace to discover the support
> and the # of regions supported.
That was my first idea before extending KVM's memslots
to teach kernel to tell qemu this number so that QEMU
at least would be able to check if new memory slot could
be added but I was redirected to a more simple solution
of just extending vs everdoing things.
Currently QEMU supports upto ~250 memslots so 509
is about twice high we need it so it should work for near
future but eventually we might still teach kernel and QEMU
to make things more robust.
>
>
> > ---
> > drivers/vhost/vhost.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > index 99931a0..6a18c92 100644
> > --- a/drivers/vhost/vhost.c
> > +++ b/drivers/vhost/vhost.c
> > @@ -30,7 +30,7 @@
> > #include "vhost.h"
> >
> > enum {
> > - VHOST_MEMORY_MAX_NREGIONS = 64,
> > + VHOST_MEMORY_MAX_NREGIONS = 509,
> > VHOST_MEMORY_F_LOG = 0x1,
> > };
> >
> > --
> > 1.8.3.1
On Tue, 16 Jun 2015 23:16:07 +0200
"Michael S. Tsirkin" <[email protected]> wrote:
> On Tue, Jun 16, 2015 at 06:33:34PM +0200, Igor Mammedov wrote:
> > Series extends vhost to support upto 509 memory regions,
> > and adds some vhost:translate_desc() performance improvemnts
> > so it won't regress when memslots are increased to 509.
> >
> > It fixes running VM crashing during memory hotplug due
> > to vhost refusing accepting more than 64 memory regions.
> >
> > It's only host kernel side fix to make it work with QEMU
> > versions that support memory hotplug. But I'll continue
> > to work on QEMU side solution to reduce amount of memory
> > regions to make things even better.
>
> I'm concerned userspace work will be harder, in particular,
> performance gains will be harder to measure.
it appears so, so far.
> How about a flag to disable caching?
I've tried to measure cost of cache miss but without much luck,
difference between version with cache and with caching removed
was within margin of error (±10ns) (i.e. not mensurable on my
5min/10*10^6 test workload).
Also I'm concerned about adding extra fetch+branch for flag
checking will make things worse for likely path of cache hit,
so I'd avoid it if possible.
Or do you mean a simple global per module flag to disable it and
wrap thing in static key so that it will be cheap jump to skip
cache?
> > Performance wise for guest with (in my case 3 memory regions)
> > and netperf's UDP_RR workload translate_desc() execution
> > time from total workload takes:
> >
> > Memory |1G RAM|cached|non cached
> > regions # | 3 | 53 | 53
> > ------------------------------------
> > upstream | 0.3% | - | 3.5%
> > ------------------------------------
> > this series | 0.2% | 0.5% | 0.7%
> >
> > where "non cached" column reflects trashing wokload
> > with constant cache miss. More details on timing in
> > respective patches.
> >
> > Igor Mammedov (5):
> > vhost: use binary search instead of linear in find_region()
> > vhost: extend memory regions allocation to vmalloc
> > vhost: support upto 509 memory regions
> > vhost: add per VQ memory region caching
> > vhost: translate_desc: optimization for desc.len < region size
> >
> > drivers/vhost/vhost.c | 95
> > +++++++++++++++++++++++++++++++++++++--------------
> > drivers/vhost/vhost.h | 1 + 2 files changed, 71 insertions(+), 25
> > deletions(-)
> >
> > --
> > 1.8.3.1
On Wed, Jun 17, 2015 at 12:19:15AM +0200, Igor Mammedov wrote:
> On Tue, 16 Jun 2015 23:16:07 +0200
> "Michael S. Tsirkin" <[email protected]> wrote:
>
> > On Tue, Jun 16, 2015 at 06:33:34PM +0200, Igor Mammedov wrote:
> > > Series extends vhost to support upto 509 memory regions,
> > > and adds some vhost:translate_desc() performance improvemnts
> > > so it won't regress when memslots are increased to 509.
> > >
> > > It fixes running VM crashing during memory hotplug due
> > > to vhost refusing accepting more than 64 memory regions.
> > >
> > > It's only host kernel side fix to make it work with QEMU
> > > versions that support memory hotplug. But I'll continue
> > > to work on QEMU side solution to reduce amount of memory
> > > regions to make things even better.
> >
> > I'm concerned userspace work will be harder, in particular,
> > performance gains will be harder to measure.
> it appears so, so far.
>
> > How about a flag to disable caching?
> I've tried to measure cost of cache miss but without much luck,
> difference between version with cache and with caching removed
> was within margin of error (?10ns) (i.e. not mensurable on my
> 5min/10*10^6 test workload).
Confused. I thought it was very much measureable.
So why add a cache if you can't measure its effect?
> Also I'm concerned about adding extra fetch+branch for flag
> checking will make things worse for likely path of cache hit,
> so I'd avoid it if possible.
>
> Or do you mean a simple global per module flag to disable it and
> wrap thing in static key so that it will be cheap jump to skip
> cache?
Something like this, yes.
> > > Performance wise for guest with (in my case 3 memory regions)
> > > and netperf's UDP_RR workload translate_desc() execution
> > > time from total workload takes:
> > >
> > > Memory |1G RAM|cached|non cached
> > > regions # | 3 | 53 | 53
> > > ------------------------------------
> > > upstream | 0.3% | - | 3.5%
> > > ------------------------------------
> > > this series | 0.2% | 0.5% | 0.7%
> > >
> > > where "non cached" column reflects trashing wokload
> > > with constant cache miss. More details on timing in
> > > respective patches.
> > >
> > > Igor Mammedov (5):
> > > vhost: use binary search instead of linear in find_region()
> > > vhost: extend memory regions allocation to vmalloc
> > > vhost: support upto 509 memory regions
> > > vhost: add per VQ memory region caching
> > > vhost: translate_desc: optimization for desc.len < region size
> > >
> > > drivers/vhost/vhost.c | 95
> > > +++++++++++++++++++++++++++++++++++++--------------
> > > drivers/vhost/vhost.h | 1 + 2 files changed, 71 insertions(+), 25
> > > deletions(-)
> > >
> > > --
> > > 1.8.3.1
On Wed, Jun 17, 2015 at 12:00:56AM +0200, Igor Mammedov wrote:
> On Tue, 16 Jun 2015 23:14:20 +0200
> "Michael S. Tsirkin" <[email protected]> wrote:
>
> > On Tue, Jun 16, 2015 at 06:33:37PM +0200, Igor Mammedov wrote:
> > > since commit
> > > 1d4e7e3 kvm: x86: increase user memory slots to 509
> > >
> > > it became possible to use a bigger amount of memory
> > > slots, which is used by memory hotplug for
> > > registering hotplugged memory.
> > > However QEMU crashes if it's used with more than ~60
> > > pc-dimm devices and vhost-net since host kernel
> > > in module vhost-net refuses to accept more than 65
> > > memory regions.
> > >
> > > Increase VHOST_MEMORY_MAX_NREGIONS from 65 to 509
> >
> > It was 64, not 65.
> >
> > > to match KVM_USER_MEM_SLOTS fixes issue for vhost-net.
> > >
> > > Signed-off-by: Igor Mammedov <[email protected]>
> >
> > Still thinking about this: can you reorder this to
> > be the last patch in the series please?
> sure
>
> >
> > Also - 509?
> userspace memory slots in terms of KVM, I made it match
> KVM's allotment of memory slots for userspace side.
Maybe KVM has its reasons for this #. I don't see
why we need to match this exactly.
> > I think if we are changing this, it'd be nice to
> > create a way for userspace to discover the support
> > and the # of regions supported.
> That was my first idea before extending KVM's memslots
> to teach kernel to tell qemu this number so that QEMU
> at least would be able to check if new memory slot could
> be added but I was redirected to a more simple solution
> of just extending vs everdoing things.
> Currently QEMU supports upto ~250 memslots so 509
> is about twice high we need it so it should work for near
> future
Yes but old kernels are still around. Would be nice if you
can detect them.
> but eventually we might still teach kernel and QEMU
> to make things more robust.
A new ioctl would be easy to add, I think it's a good
idea generally.
> >
> >
> > > ---
> > > drivers/vhost/vhost.c | 2 +-
> > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > > index 99931a0..6a18c92 100644
> > > --- a/drivers/vhost/vhost.c
> > > +++ b/drivers/vhost/vhost.c
> > > @@ -30,7 +30,7 @@
> > > #include "vhost.h"
> > >
> > > enum {
> > > - VHOST_MEMORY_MAX_NREGIONS = 64,
> > > + VHOST_MEMORY_MAX_NREGIONS = 509,
> > > VHOST_MEMORY_F_LOG = 0x1,
> > > };
> > >
> > > --
> > > 1.8.3.1
On Wed, 17 Jun 2015 08:34:26 +0200
"Michael S. Tsirkin" <[email protected]> wrote:
> On Wed, Jun 17, 2015 at 12:00:56AM +0200, Igor Mammedov wrote:
> > On Tue, 16 Jun 2015 23:14:20 +0200
> > "Michael S. Tsirkin" <[email protected]> wrote:
> >
> > > On Tue, Jun 16, 2015 at 06:33:37PM +0200, Igor Mammedov wrote:
> > > > since commit
> > > > 1d4e7e3 kvm: x86: increase user memory slots to 509
> > > >
> > > > it became possible to use a bigger amount of memory
> > > > slots, which is used by memory hotplug for
> > > > registering hotplugged memory.
> > > > However QEMU crashes if it's used with more than ~60
> > > > pc-dimm devices and vhost-net since host kernel
> > > > in module vhost-net refuses to accept more than 65
> > > > memory regions.
> > > >
> > > > Increase VHOST_MEMORY_MAX_NREGIONS from 65 to 509
> > >
> > > It was 64, not 65.
> > >
> > > > to match KVM_USER_MEM_SLOTS fixes issue for vhost-net.
> > > >
> > > > Signed-off-by: Igor Mammedov <[email protected]>
> > >
> > > Still thinking about this: can you reorder this to
> > > be the last patch in the series please?
> > sure
> >
> > >
> > > Also - 509?
> > userspace memory slots in terms of KVM, I made it match
> > KVM's allotment of memory slots for userspace side.
>
> Maybe KVM has its reasons for this #. I don't see
> why we need to match this exactly.
np, I can cap it at safe 300 slots but it's unlikely that it
would take cut off 1 extra hop since it's capped by QEMU
at 256+[initial fragmented memory]
>
> > > I think if we are changing this, it'd be nice to
> > > create a way for userspace to discover the support
> > > and the # of regions supported.
> > That was my first idea before extending KVM's memslots
> > to teach kernel to tell qemu this number so that QEMU
> > at least would be able to check if new memory slot could
> > be added but I was redirected to a more simple solution
> > of just extending vs everdoing things.
> > Currently QEMU supports upto ~250 memslots so 509
> > is about twice high we need it so it should work for near
> > future
>
> Yes but old kernels are still around. Would be nice if you
> can detect them.
>
> > but eventually we might still teach kernel and QEMU
> > to make things more robust.
>
> A new ioctl would be easy to add, I think it's a good
> idea generally.
I can try to do something like this on top of this series.
>
> > >
> > >
> > > > ---
> > > > drivers/vhost/vhost.c | 2 +-
> > > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > > > index 99931a0..6a18c92 100644
> > > > --- a/drivers/vhost/vhost.c
> > > > +++ b/drivers/vhost/vhost.c
> > > > @@ -30,7 +30,7 @@
> > > > #include "vhost.h"
> > > >
> > > > enum {
> > > > - VHOST_MEMORY_MAX_NREGIONS = 64,
> > > > + VHOST_MEMORY_MAX_NREGIONS = 509,
> > > > VHOST_MEMORY_F_LOG = 0x1,
> > > > };
> > > >
> > > > --
> > > > 1.8.3.1
On Wed, 17 Jun 2015 08:31:23 +0200
"Michael S. Tsirkin" <[email protected]> wrote:
> On Wed, Jun 17, 2015 at 12:19:15AM +0200, Igor Mammedov wrote:
> > On Tue, 16 Jun 2015 23:16:07 +0200
> > "Michael S. Tsirkin" <[email protected]> wrote:
> >
> > > On Tue, Jun 16, 2015 at 06:33:34PM +0200, Igor Mammedov wrote:
> > > > Series extends vhost to support upto 509 memory regions,
> > > > and adds some vhost:translate_desc() performance improvemnts
> > > > so it won't regress when memslots are increased to 509.
> > > >
> > > > It fixes running VM crashing during memory hotplug due
> > > > to vhost refusing accepting more than 64 memory regions.
> > > >
> > > > It's only host kernel side fix to make it work with QEMU
> > > > versions that support memory hotplug. But I'll continue
> > > > to work on QEMU side solution to reduce amount of memory
> > > > regions to make things even better.
> > >
> > > I'm concerned userspace work will be harder, in particular,
> > > performance gains will be harder to measure.
> > it appears so, so far.
> >
> > > How about a flag to disable caching?
> > I've tried to measure cost of cache miss but without much luck,
> > difference between version with cache and with caching removed
> > was within margin of error (±10ns) (i.e. not mensurable on my
> > 5min/10*10^6 test workload).
>
> Confused. I thought it was very much measureable.
> So why add a cache if you can't measure its effect?
I hasn't been able to measure immediate delta between function
start/end with precision more than 10ns, perhaps used method
(system tap) is to blame.
But it's still possible to measure indirectly like 2% from 5/5.
>
> > Also I'm concerned about adding extra fetch+branch for flag
> > checking will make things worse for likely path of cache hit,
> > so I'd avoid it if possible.
> >
> > Or do you mean a simple global per module flag to disable it and
> > wrap thing in static key so that it will be cheap jump to skip
> > cache?
>
> Something like this, yes.
ok, will do.
>
> > > > Performance wise for guest with (in my case 3 memory regions)
> > > > and netperf's UDP_RR workload translate_desc() execution
> > > > time from total workload takes:
> > > >
> > > > Memory |1G RAM|cached|non cached
> > > > regions # | 3 | 53 | 53
> > > > ------------------------------------
> > > > upstream | 0.3% | - | 3.5%
> > > > ------------------------------------
> > > > this series | 0.2% | 0.5% | 0.7%
> > > >
> > > > where "non cached" column reflects trashing wokload
> > > > with constant cache miss. More details on timing in
> > > > respective patches.
> > > >
> > > > Igor Mammedov (5):
> > > > vhost: use binary search instead of linear in find_region()
> > > > vhost: extend memory regions allocation to vmalloc
> > > > vhost: support upto 509 memory regions
> > > > vhost: add per VQ memory region caching
> > > > vhost: translate_desc: optimization for desc.len < region size
> > > >
> > > > drivers/vhost/vhost.c | 95
> > > > +++++++++++++++++++++++++++++++++++++--------------
> > > > drivers/vhost/vhost.h | 1 + 2 files changed, 71 insertions(+),
> > > > 25 deletions(-)
> > > >
> > > > --
> > > > 1.8.3.1
On Wed, Jun 17, 2015 at 09:28:02AM +0200, Igor Mammedov wrote:
> On Wed, 17 Jun 2015 08:34:26 +0200
> "Michael S. Tsirkin" <[email protected]> wrote:
>
> > On Wed, Jun 17, 2015 at 12:00:56AM +0200, Igor Mammedov wrote:
> > > On Tue, 16 Jun 2015 23:14:20 +0200
> > > "Michael S. Tsirkin" <[email protected]> wrote:
> > >
> > > > On Tue, Jun 16, 2015 at 06:33:37PM +0200, Igor Mammedov wrote:
> > > > > since commit
> > > > > 1d4e7e3 kvm: x86: increase user memory slots to 509
> > > > >
> > > > > it became possible to use a bigger amount of memory
> > > > > slots, which is used by memory hotplug for
> > > > > registering hotplugged memory.
> > > > > However QEMU crashes if it's used with more than ~60
> > > > > pc-dimm devices and vhost-net since host kernel
> > > > > in module vhost-net refuses to accept more than 65
> > > > > memory regions.
> > > > >
> > > > > Increase VHOST_MEMORY_MAX_NREGIONS from 65 to 509
> > > >
> > > > It was 64, not 65.
> > > >
> > > > > to match KVM_USER_MEM_SLOTS fixes issue for vhost-net.
> > > > >
> > > > > Signed-off-by: Igor Mammedov <[email protected]>
> > > >
> > > > Still thinking about this: can you reorder this to
> > > > be the last patch in the series please?
> > > sure
> > >
> > > >
> > > > Also - 509?
> > > userspace memory slots in terms of KVM, I made it match
> > > KVM's allotment of memory slots for userspace side.
> >
> > Maybe KVM has its reasons for this #. I don't see
> > why we need to match this exactly.
> np, I can cap it at safe 300 slots but it's unlikely that it
> would take cut off 1 extra hop since it's capped by QEMU
> at 256+[initial fragmented memory]
But what's the point? We allocate 32 bytes per slot.
300*32 = 9600 which is more than 8K, so we are doing
an order-3 allocation anyway.
If we could cap it at 8K (256 slots) that would make sense
since we could avoid wasting vmalloc space.
I'm still not very happy with the whole approach,
giving userspace ability allocate 4 whole pages
of kernel memory like this.
> >
> > > > I think if we are changing this, it'd be nice to
> > > > create a way for userspace to discover the support
> > > > and the # of regions supported.
> > > That was my first idea before extending KVM's memslots
> > > to teach kernel to tell qemu this number so that QEMU
> > > at least would be able to check if new memory slot could
> > > be added but I was redirected to a more simple solution
> > > of just extending vs everdoing things.
> > > Currently QEMU supports upto ~250 memslots so 509
> > > is about twice high we need it so it should work for near
> > > future
> >
> > Yes but old kernels are still around. Would be nice if you
> > can detect them.
> >
> > > but eventually we might still teach kernel and QEMU
> > > to make things more robust.
> >
> > A new ioctl would be easy to add, I think it's a good
> > idea generally.
> I can try to do something like this on top of this series.
>
> >
> > > >
> > > >
> > > > > ---
> > > > > drivers/vhost/vhost.c | 2 +-
> > > > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > > > > index 99931a0..6a18c92 100644
> > > > > --- a/drivers/vhost/vhost.c
> > > > > +++ b/drivers/vhost/vhost.c
> > > > > @@ -30,7 +30,7 @@
> > > > > #include "vhost.h"
> > > > >
> > > > > enum {
> > > > > - VHOST_MEMORY_MAX_NREGIONS = 64,
> > > > > + VHOST_MEMORY_MAX_NREGIONS = 509,
> > > > > VHOST_MEMORY_F_LOG = 0x1,
> > > > > };
> > > > >
> > > > > --
> > > > > 1.8.3.1
On Wed, Jun 17, 2015 at 09:33:57AM +0200, Igor Mammedov wrote:
> On Wed, 17 Jun 2015 08:31:23 +0200
> "Michael S. Tsirkin" <[email protected]> wrote:
>
> > On Wed, Jun 17, 2015 at 12:19:15AM +0200, Igor Mammedov wrote:
> > > On Tue, 16 Jun 2015 23:16:07 +0200
> > > "Michael S. Tsirkin" <[email protected]> wrote:
> > >
> > > > On Tue, Jun 16, 2015 at 06:33:34PM +0200, Igor Mammedov wrote:
> > > > > Series extends vhost to support upto 509 memory regions,
> > > > > and adds some vhost:translate_desc() performance improvemnts
> > > > > so it won't regress when memslots are increased to 509.
> > > > >
> > > > > It fixes running VM crashing during memory hotplug due
> > > > > to vhost refusing accepting more than 64 memory regions.
> > > > >
> > > > > It's only host kernel side fix to make it work with QEMU
> > > > > versions that support memory hotplug. But I'll continue
> > > > > to work on QEMU side solution to reduce amount of memory
> > > > > regions to make things even better.
> > > >
> > > > I'm concerned userspace work will be harder, in particular,
> > > > performance gains will be harder to measure.
> > > it appears so, so far.
> > >
> > > > How about a flag to disable caching?
> > > I've tried to measure cost of cache miss but without much luck,
> > > difference between version with cache and with caching removed
> > > was within margin of error (?10ns) (i.e. not mensurable on my
> > > 5min/10*10^6 test workload).
> >
> > Confused. I thought it was very much measureable.
> > So why add a cache if you can't measure its effect?
> I hasn't been able to measure immediate delta between function
> start/end with precision more than 10ns, perhaps used method
> (system tap) is to blame.
> But it's still possible to measure indirectly like 2% from 5/5.
Ah, makes sense.
> >
> > > Also I'm concerned about adding extra fetch+branch for flag
> > > checking will make things worse for likely path of cache hit,
> > > so I'd avoid it if possible.
> > >
> > > Or do you mean a simple global per module flag to disable it and
> > > wrap thing in static key so that it will be cheap jump to skip
> > > cache?
> >
> > Something like this, yes.
> ok, will do.
>
> >
> > > > > Performance wise for guest with (in my case 3 memory regions)
> > > > > and netperf's UDP_RR workload translate_desc() execution
> > > > > time from total workload takes:
> > > > >
> > > > > Memory |1G RAM|cached|non cached
> > > > > regions # | 3 | 53 | 53
> > > > > ------------------------------------
> > > > > upstream | 0.3% | - | 3.5%
> > > > > ------------------------------------
> > > > > this series | 0.2% | 0.5% | 0.7%
> > > > >
> > > > > where "non cached" column reflects trashing wokload
> > > > > with constant cache miss. More details on timing in
> > > > > respective patches.
> > > > >
> > > > > Igor Mammedov (5):
> > > > > vhost: use binary search instead of linear in find_region()
> > > > > vhost: extend memory regions allocation to vmalloc
> > > > > vhost: support upto 509 memory regions
> > > > > vhost: add per VQ memory region caching
> > > > > vhost: translate_desc: optimization for desc.len < region size
> > > > >
> > > > > drivers/vhost/vhost.c | 95
> > > > > +++++++++++++++++++++++++++++++++++++--------------
> > > > > drivers/vhost/vhost.h | 1 + 2 files changed, 71 insertions(+),
> > > > > 25 deletions(-)
> > > > >
> > > > > --
> > > > > 1.8.3.1
On 17/06/2015 08:34, Michael S. Tsirkin wrote:
>>> > >
>>> > > Also - 509?
>> > userspace memory slots in terms of KVM, I made it match
>> > KVM's allotment of memory slots for userspace side.
> Maybe KVM has its reasons for this #.
Nice power of two (512) - number of reserved slots required by Intel's
virtualization extensions (3: APIC access page, EPT identity page table,
VMX task state segment).
Paolo
I don't see
> why we need to match this exactly.
>
On Wed, 17 Jun 2015 09:39:06 +0200
"Michael S. Tsirkin" <[email protected]> wrote:
> On Wed, Jun 17, 2015 at 09:28:02AM +0200, Igor Mammedov wrote:
> > On Wed, 17 Jun 2015 08:34:26 +0200
> > "Michael S. Tsirkin" <[email protected]> wrote:
> >
> > > On Wed, Jun 17, 2015 at 12:00:56AM +0200, Igor Mammedov wrote:
> > > > On Tue, 16 Jun 2015 23:14:20 +0200
> > > > "Michael S. Tsirkin" <[email protected]> wrote:
> > > >
> > > > > On Tue, Jun 16, 2015 at 06:33:37PM +0200, Igor Mammedov wrote:
> > > > > > since commit
> > > > > > 1d4e7e3 kvm: x86: increase user memory slots to 509
> > > > > >
> > > > > > it became possible to use a bigger amount of memory
> > > > > > slots, which is used by memory hotplug for
> > > > > > registering hotplugged memory.
> > > > > > However QEMU crashes if it's used with more than ~60
> > > > > > pc-dimm devices and vhost-net since host kernel
> > > > > > in module vhost-net refuses to accept more than 65
> > > > > > memory regions.
> > > > > >
> > > > > > Increase VHOST_MEMORY_MAX_NREGIONS from 65 to 509
> > > > >
> > > > > It was 64, not 65.
> > > > >
> > > > > > to match KVM_USER_MEM_SLOTS fixes issue for vhost-net.
> > > > > >
> > > > > > Signed-off-by: Igor Mammedov <[email protected]>
> > > > >
> > > > > Still thinking about this: can you reorder this to
> > > > > be the last patch in the series please?
> > > > sure
> > > >
> > > > >
> > > > > Also - 509?
> > > > userspace memory slots in terms of KVM, I made it match
> > > > KVM's allotment of memory slots for userspace side.
> > >
> > > Maybe KVM has its reasons for this #. I don't see
> > > why we need to match this exactly.
> > np, I can cap it at safe 300 slots but it's unlikely that it
> > would take cut off 1 extra hop since it's capped by QEMU
> > at 256+[initial fragmented memory]
>
> But what's the point? We allocate 32 bytes per slot.
> 300*32 = 9600 which is more than 8K, so we are doing
> an order-3 allocation anyway.
> If we could cap it at 8K (256 slots) that would make sense
> since we could avoid wasting vmalloc space.
256 is amount of hotpluggable slots and there is no way
to predict how initial memory would be fragmented
(i.e. amount of slots it would take), if we guess wrong
we are back to square one with crashing userspace.
So I'd stay consistent with KVM's limit 509 since
it's only limit, i.e. not actual amount of allocated slots.
> I'm still not very happy with the whole approach,
> giving userspace ability allocate 4 whole pages
> of kernel memory like this.
I'm working in parallel so that userspace won't take so
many slots but it won't prevent its current versions
crashing due to kernel limitation.
> > > > > I think if we are changing this, it'd be nice to
> > > > > create a way for userspace to discover the support
> > > > > and the # of regions supported.
> > > > That was my first idea before extending KVM's memslots
> > > > to teach kernel to tell qemu this number so that QEMU
> > > > at least would be able to check if new memory slot could
> > > > be added but I was redirected to a more simple solution
> > > > of just extending vs everdoing things.
> > > > Currently QEMU supports upto ~250 memslots so 509
> > > > is about twice high we need it so it should work for near
> > > > future
> > >
> > > Yes but old kernels are still around. Would be nice if you
> > > can detect them.
> > >
> > > > but eventually we might still teach kernel and QEMU
> > > > to make things more robust.
> > >
> > > A new ioctl would be easy to add, I think it's a good
> > > idea generally.
> > I can try to do something like this on top of this series.
> >
> > >
> > > > >
> > > > >
> > > > > > ---
> > > > > > drivers/vhost/vhost.c | 2 +-
> > > > > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > >
> > > > > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > > > > > index 99931a0..6a18c92 100644
> > > > > > --- a/drivers/vhost/vhost.c
> > > > > > +++ b/drivers/vhost/vhost.c
> > > > > > @@ -30,7 +30,7 @@
> > > > > > #include "vhost.h"
> > > > > >
> > > > > > enum {
> > > > > > - VHOST_MEMORY_MAX_NREGIONS = 64,
> > > > > > + VHOST_MEMORY_MAX_NREGIONS = 509,
> > > > > > VHOST_MEMORY_F_LOG = 0x1,
> > > > > > };
> > > > > >
> > > > > > --
> > > > > > 1.8.3.1
On Wed, Jun 17, 2015 at 10:54:21AM +0200, Igor Mammedov wrote:
> On Wed, 17 Jun 2015 09:39:06 +0200
> "Michael S. Tsirkin" <[email protected]> wrote:
>
> > On Wed, Jun 17, 2015 at 09:28:02AM +0200, Igor Mammedov wrote:
> > > On Wed, 17 Jun 2015 08:34:26 +0200
> > > "Michael S. Tsirkin" <[email protected]> wrote:
> > >
> > > > On Wed, Jun 17, 2015 at 12:00:56AM +0200, Igor Mammedov wrote:
> > > > > On Tue, 16 Jun 2015 23:14:20 +0200
> > > > > "Michael S. Tsirkin" <[email protected]> wrote:
> > > > >
> > > > > > On Tue, Jun 16, 2015 at 06:33:37PM +0200, Igor Mammedov wrote:
> > > > > > > since commit
> > > > > > > 1d4e7e3 kvm: x86: increase user memory slots to 509
> > > > > > >
> > > > > > > it became possible to use a bigger amount of memory
> > > > > > > slots, which is used by memory hotplug for
> > > > > > > registering hotplugged memory.
> > > > > > > However QEMU crashes if it's used with more than ~60
> > > > > > > pc-dimm devices and vhost-net since host kernel
> > > > > > > in module vhost-net refuses to accept more than 65
> > > > > > > memory regions.
> > > > > > >
> > > > > > > Increase VHOST_MEMORY_MAX_NREGIONS from 65 to 509
> > > > > >
> > > > > > It was 64, not 65.
> > > > > >
> > > > > > > to match KVM_USER_MEM_SLOTS fixes issue for vhost-net.
> > > > > > >
> > > > > > > Signed-off-by: Igor Mammedov <[email protected]>
> > > > > >
> > > > > > Still thinking about this: can you reorder this to
> > > > > > be the last patch in the series please?
> > > > > sure
> > > > >
> > > > > >
> > > > > > Also - 509?
> > > > > userspace memory slots in terms of KVM, I made it match
> > > > > KVM's allotment of memory slots for userspace side.
> > > >
> > > > Maybe KVM has its reasons for this #. I don't see
> > > > why we need to match this exactly.
> > > np, I can cap it at safe 300 slots but it's unlikely that it
> > > would take cut off 1 extra hop since it's capped by QEMU
> > > at 256+[initial fragmented memory]
> >
> > But what's the point? We allocate 32 bytes per slot.
> > 300*32 = 9600 which is more than 8K, so we are doing
> > an order-3 allocation anyway.
> > If we could cap it at 8K (256 slots) that would make sense
> > since we could avoid wasting vmalloc space.
> 256 is amount of hotpluggable slots and there is no way
> to predict how initial memory would be fragmented
> (i.e. amount of slots it would take), if we guess wrong
> we are back to square one with crashing userspace.
> So I'd stay consistent with KVM's limit 509 since
> it's only limit, i.e. not actual amount of allocated slots.
>
> > I'm still not very happy with the whole approach,
> > giving userspace ability allocate 4 whole pages
> > of kernel memory like this.
> I'm working in parallel so that userspace won't take so
> many slots but it won't prevent its current versions
> crashing due to kernel limitation.
Right but at least it's not a regression. If we promise userspace to
support a ton of regions, we can't take it back later, and I'm concerned
about the memory usage.
I think it's already safe to merge the binary lookup patches, and maybe
cache and vmalloc, so that the remaining patch will be small.
>
> > > > > > I think if we are changing this, it'd be nice to
> > > > > > create a way for userspace to discover the support
> > > > > > and the # of regions supported.
> > > > > That was my first idea before extending KVM's memslots
> > > > > to teach kernel to tell qemu this number so that QEMU
> > > > > at least would be able to check if new memory slot could
> > > > > be added but I was redirected to a more simple solution
> > > > > of just extending vs everdoing things.
> > > > > Currently QEMU supports upto ~250 memslots so 509
> > > > > is about twice high we need it so it should work for near
> > > > > future
> > > >
> > > > Yes but old kernels are still around. Would be nice if you
> > > > can detect them.
> > > >
> > > > > but eventually we might still teach kernel and QEMU
> > > > > to make things more robust.
> > > >
> > > > A new ioctl would be easy to add, I think it's a good
> > > > idea generally.
> > > I can try to do something like this on top of this series.
> > >
> > > >
> > > > > >
> > > > > >
> > > > > > > ---
> > > > > > > drivers/vhost/vhost.c | 2 +-
> > > > > > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > > >
> > > > > > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > > > > > > index 99931a0..6a18c92 100644
> > > > > > > --- a/drivers/vhost/vhost.c
> > > > > > > +++ b/drivers/vhost/vhost.c
> > > > > > > @@ -30,7 +30,7 @@
> > > > > > > #include "vhost.h"
> > > > > > >
> > > > > > > enum {
> > > > > > > - VHOST_MEMORY_MAX_NREGIONS = 64,
> > > > > > > + VHOST_MEMORY_MAX_NREGIONS = 509,
> > > > > > > VHOST_MEMORY_F_LOG = 0x1,
> > > > > > > };
> > > > > > >
> > > > > > > --
> > > > > > > 1.8.3.1
On Wed, 17 Jun 2015 12:11:09 +0200
"Michael S. Tsirkin" <[email protected]> wrote:
> On Wed, Jun 17, 2015 at 10:54:21AM +0200, Igor Mammedov wrote:
> > On Wed, 17 Jun 2015 09:39:06 +0200
> > "Michael S. Tsirkin" <[email protected]> wrote:
> >
> > > On Wed, Jun 17, 2015 at 09:28:02AM +0200, Igor Mammedov wrote:
> > > > On Wed, 17 Jun 2015 08:34:26 +0200
> > > > "Michael S. Tsirkin" <[email protected]> wrote:
> > > >
> > > > > On Wed, Jun 17, 2015 at 12:00:56AM +0200, Igor Mammedov wrote:
> > > > > > On Tue, 16 Jun 2015 23:14:20 +0200
> > > > > > "Michael S. Tsirkin" <[email protected]> wrote:
> > > > > >
> > > > > > > On Tue, Jun 16, 2015 at 06:33:37PM +0200, Igor Mammedov wrote:
> > > > > > > > since commit
> > > > > > > > 1d4e7e3 kvm: x86: increase user memory slots to 509
> > > > > > > >
> > > > > > > > it became possible to use a bigger amount of memory
> > > > > > > > slots, which is used by memory hotplug for
> > > > > > > > registering hotplugged memory.
> > > > > > > > However QEMU crashes if it's used with more than ~60
> > > > > > > > pc-dimm devices and vhost-net since host kernel
> > > > > > > > in module vhost-net refuses to accept more than 65
> > > > > > > > memory regions.
> > > > > > > >
> > > > > > > > Increase VHOST_MEMORY_MAX_NREGIONS from 65 to 509
> > > > > > >
> > > > > > > It was 64, not 65.
> > > > > > >
> > > > > > > > to match KVM_USER_MEM_SLOTS fixes issue for vhost-net.
> > > > > > > >
> > > > > > > > Signed-off-by: Igor Mammedov <[email protected]>
> > > > > > >
> > > > > > > Still thinking about this: can you reorder this to
> > > > > > > be the last patch in the series please?
> > > > > > sure
> > > > > >
> > > > > > >
> > > > > > > Also - 509?
> > > > > > userspace memory slots in terms of KVM, I made it match
> > > > > > KVM's allotment of memory slots for userspace side.
> > > > >
> > > > > Maybe KVM has its reasons for this #. I don't see
> > > > > why we need to match this exactly.
> > > > np, I can cap it at safe 300 slots but it's unlikely that it
> > > > would take cut off 1 extra hop since it's capped by QEMU
> > > > at 256+[initial fragmented memory]
> > >
> > > But what's the point? We allocate 32 bytes per slot.
> > > 300*32 = 9600 which is more than 8K, so we are doing
> > > an order-3 allocation anyway.
> > > If we could cap it at 8K (256 slots) that would make sense
> > > since we could avoid wasting vmalloc space.
> > 256 is amount of hotpluggable slots and there is no way
> > to predict how initial memory would be fragmented
> > (i.e. amount of slots it would take), if we guess wrong
> > we are back to square one with crashing userspace.
> > So I'd stay consistent with KVM's limit 509 since
> > it's only limit, i.e. not actual amount of allocated slots.
> >
> > > I'm still not very happy with the whole approach,
> > > giving userspace ability allocate 4 whole pages
> > > of kernel memory like this.
> > I'm working in parallel so that userspace won't take so
> > many slots but it won't prevent its current versions
> > crashing due to kernel limitation.
>
> Right but at least it's not a regression. If we promise userspace to
> support a ton of regions, we can't take it back later, and I'm concerned
> about the memory usage.
>
> I think it's already safe to merge the binary lookup patches, and maybe
> cache and vmalloc, so that the remaining patch will be small.
it isn't regression with switching to binary search and increasing
slots to 509 either performance wise it's more on improvment side.
And I was thinking about memory usage as well, that's why I've dropped
faster radix tree in favor of more compact array, can't do better
on kernel side of fix.
Yes we will give userspace to ability to use more slots/and lock up
more memory if it's not able to consolidate memory regions but
that leaves an option for user to run guest with vhost performance
vs crashing it at runtime.
userspace/targets that could consolidate memory regions should
do so and I'm working on that as well but that doesn't mean
that users shouldn't have a choice.
So far it's kernel limitation and this patch fixes crashes
that users see now, with the rest of patches enabling performance
not to regress.
>
> >
> > > > > > > I think if we are changing this, it'd be nice to
> > > > > > > create a way for userspace to discover the support
> > > > > > > and the # of regions supported.
> > > > > > That was my first idea before extending KVM's memslots
> > > > > > to teach kernel to tell qemu this number so that QEMU
> > > > > > at least would be able to check if new memory slot could
> > > > > > be added but I was redirected to a more simple solution
> > > > > > of just extending vs everdoing things.
> > > > > > Currently QEMU supports upto ~250 memslots so 509
> > > > > > is about twice high we need it so it should work for near
> > > > > > future
> > > > >
> > > > > Yes but old kernels are still around. Would be nice if you
> > > > > can detect them.
> > > > >
> > > > > > but eventually we might still teach kernel and QEMU
> > > > > > to make things more robust.
> > > > >
> > > > > A new ioctl would be easy to add, I think it's a good
> > > > > idea generally.
> > > > I can try to do something like this on top of this series.
> > > >
> > > > >
> > > > > > >
> > > > > > >
> > > > > > > > ---
> > > > > > > > drivers/vhost/vhost.c | 2 +-
> > > > > > > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > > > >
> > > > > > > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > > > > > > > index 99931a0..6a18c92 100644
> > > > > > > > --- a/drivers/vhost/vhost.c
> > > > > > > > +++ b/drivers/vhost/vhost.c
> > > > > > > > @@ -30,7 +30,7 @@
> > > > > > > > #include "vhost.h"
> > > > > > > >
> > > > > > > > enum {
> > > > > > > > - VHOST_MEMORY_MAX_NREGIONS = 64,
> > > > > > > > + VHOST_MEMORY_MAX_NREGIONS = 509,
> > > > > > > > VHOST_MEMORY_F_LOG = 0x1,
> > > > > > > > };
> > > > > > > >
> > > > > > > > --
> > > > > > > > 1.8.3.1
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jun 17, 2015 at 12:37:42PM +0200, Igor Mammedov wrote:
> On Wed, 17 Jun 2015 12:11:09 +0200
> "Michael S. Tsirkin" <[email protected]> wrote:
>
> > On Wed, Jun 17, 2015 at 10:54:21AM +0200, Igor Mammedov wrote:
> > > On Wed, 17 Jun 2015 09:39:06 +0200
> > > "Michael S. Tsirkin" <[email protected]> wrote:
> > >
> > > > On Wed, Jun 17, 2015 at 09:28:02AM +0200, Igor Mammedov wrote:
> > > > > On Wed, 17 Jun 2015 08:34:26 +0200
> > > > > "Michael S. Tsirkin" <[email protected]> wrote:
> > > > >
> > > > > > On Wed, Jun 17, 2015 at 12:00:56AM +0200, Igor Mammedov wrote:
> > > > > > > On Tue, 16 Jun 2015 23:14:20 +0200
> > > > > > > "Michael S. Tsirkin" <[email protected]> wrote:
> > > > > > >
> > > > > > > > On Tue, Jun 16, 2015 at 06:33:37PM +0200, Igor Mammedov wrote:
> > > > > > > > > since commit
> > > > > > > > > 1d4e7e3 kvm: x86: increase user memory slots to 509
> > > > > > > > >
> > > > > > > > > it became possible to use a bigger amount of memory
> > > > > > > > > slots, which is used by memory hotplug for
> > > > > > > > > registering hotplugged memory.
> > > > > > > > > However QEMU crashes if it's used with more than ~60
> > > > > > > > > pc-dimm devices and vhost-net since host kernel
> > > > > > > > > in module vhost-net refuses to accept more than 65
> > > > > > > > > memory regions.
> > > > > > > > >
> > > > > > > > > Increase VHOST_MEMORY_MAX_NREGIONS from 65 to 509
> > > > > > > >
> > > > > > > > It was 64, not 65.
> > > > > > > >
> > > > > > > > > to match KVM_USER_MEM_SLOTS fixes issue for vhost-net.
> > > > > > > > >
> > > > > > > > > Signed-off-by: Igor Mammedov <[email protected]>
> > > > > > > >
> > > > > > > > Still thinking about this: can you reorder this to
> > > > > > > > be the last patch in the series please?
> > > > > > > sure
> > > > > > >
> > > > > > > >
> > > > > > > > Also - 509?
> > > > > > > userspace memory slots in terms of KVM, I made it match
> > > > > > > KVM's allotment of memory slots for userspace side.
> > > > > >
> > > > > > Maybe KVM has its reasons for this #. I don't see
> > > > > > why we need to match this exactly.
> > > > > np, I can cap it at safe 300 slots but it's unlikely that it
> > > > > would take cut off 1 extra hop since it's capped by QEMU
> > > > > at 256+[initial fragmented memory]
> > > >
> > > > But what's the point? We allocate 32 bytes per slot.
> > > > 300*32 = 9600 which is more than 8K, so we are doing
> > > > an order-3 allocation anyway.
> > > > If we could cap it at 8K (256 slots) that would make sense
> > > > since we could avoid wasting vmalloc space.
> > > 256 is amount of hotpluggable slots and there is no way
> > > to predict how initial memory would be fragmented
> > > (i.e. amount of slots it would take), if we guess wrong
> > > we are back to square one with crashing userspace.
> > > So I'd stay consistent with KVM's limit 509 since
> > > it's only limit, i.e. not actual amount of allocated slots.
> > >
> > > > I'm still not very happy with the whole approach,
> > > > giving userspace ability allocate 4 whole pages
> > > > of kernel memory like this.
> > > I'm working in parallel so that userspace won't take so
> > > many slots but it won't prevent its current versions
> > > crashing due to kernel limitation.
> >
> > Right but at least it's not a regression. If we promise userspace to
> > support a ton of regions, we can't take it back later, and I'm concerned
> > about the memory usage.
> >
> > I think it's already safe to merge the binary lookup patches, and maybe
> > cache and vmalloc, so that the remaining patch will be small.
> it isn't regression with switching to binary search and increasing
> slots to 509 either performance wise it's more on improvment side.
> And I was thinking about memory usage as well, that's why I've dropped
> faster radix tree in favor of more compact array, can't do better
> on kernel side of fix.
>
> Yes we will give userspace to ability to use more slots/and lock up
> more memory if it's not able to consolidate memory regions but
> that leaves an option for user to run guest with vhost performance
> vs crashing it at runtime.
Crashing is entirely QEMU's own doing in not handling
the error gracefully.
>
> userspace/targets that could consolidate memory regions should
> do so and I'm working on that as well but that doesn't mean
> that users shouldn't have a choice.
It's a fairly unusual corner case, I'm not yet
convinced we need to quickly add support to it when just waiting a bit
longer will get us an equivalent (or even more efficient) fix in
userspace.
> So far it's kernel limitation and this patch fixes crashes
> that users see now, with the rest of patches enabling performance
> not to regress.
When I say regression I refer to an option to limit the array
size again after userspace started using the larger size.
> >
> > >
> > > > > > > > I think if we are changing this, it'd be nice to
> > > > > > > > create a way for userspace to discover the support
> > > > > > > > and the # of regions supported.
> > > > > > > That was my first idea before extending KVM's memslots
> > > > > > > to teach kernel to tell qemu this number so that QEMU
> > > > > > > at least would be able to check if new memory slot could
> > > > > > > be added but I was redirected to a more simple solution
> > > > > > > of just extending vs everdoing things.
> > > > > > > Currently QEMU supports upto ~250 memslots so 509
> > > > > > > is about twice high we need it so it should work for near
> > > > > > > future
> > > > > >
> > > > > > Yes but old kernels are still around. Would be nice if you
> > > > > > can detect them.
> > > > > >
> > > > > > > but eventually we might still teach kernel and QEMU
> > > > > > > to make things more robust.
> > > > > >
> > > > > > A new ioctl would be easy to add, I think it's a good
> > > > > > idea generally.
> > > > > I can try to do something like this on top of this series.
> > > > >
> > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > ---
> > > > > > > > > drivers/vhost/vhost.c | 2 +-
> > > > > > > > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > > > > >
> > > > > > > > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > > > > > > > > index 99931a0..6a18c92 100644
> > > > > > > > > --- a/drivers/vhost/vhost.c
> > > > > > > > > +++ b/drivers/vhost/vhost.c
> > > > > > > > > @@ -30,7 +30,7 @@
> > > > > > > > > #include "vhost.h"
> > > > > > > > >
> > > > > > > > > enum {
> > > > > > > > > - VHOST_MEMORY_MAX_NREGIONS = 64,
> > > > > > > > > + VHOST_MEMORY_MAX_NREGIONS = 509,
> > > > > > > > > VHOST_MEMORY_F_LOG = 0x1,
> > > > > > > > > };
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > 1.8.3.1
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 17 Jun 2015 12:46:09 +0200
"Michael S. Tsirkin" <[email protected]> wrote:
> On Wed, Jun 17, 2015 at 12:37:42PM +0200, Igor Mammedov wrote:
> > On Wed, 17 Jun 2015 12:11:09 +0200
> > "Michael S. Tsirkin" <[email protected]> wrote:
> >
> > > On Wed, Jun 17, 2015 at 10:54:21AM +0200, Igor Mammedov wrote:
> > > > On Wed, 17 Jun 2015 09:39:06 +0200
> > > > "Michael S. Tsirkin" <[email protected]> wrote:
> > > >
> > > > > On Wed, Jun 17, 2015 at 09:28:02AM +0200, Igor Mammedov wrote:
> > > > > > On Wed, 17 Jun 2015 08:34:26 +0200
> > > > > > "Michael S. Tsirkin" <[email protected]> wrote:
> > > > > >
> > > > > > > On Wed, Jun 17, 2015 at 12:00:56AM +0200, Igor Mammedov wrote:
> > > > > > > > On Tue, 16 Jun 2015 23:14:20 +0200
> > > > > > > > "Michael S. Tsirkin" <[email protected]> wrote:
> > > > > > > >
> > > > > > > > > On Tue, Jun 16, 2015 at 06:33:37PM +0200, Igor Mammedov wrote:
> > > > > > > > > > since commit
> > > > > > > > > > 1d4e7e3 kvm: x86: increase user memory slots to 509
> > > > > > > > > >
> > > > > > > > > > it became possible to use a bigger amount of memory
> > > > > > > > > > slots, which is used by memory hotplug for
> > > > > > > > > > registering hotplugged memory.
> > > > > > > > > > However QEMU crashes if it's used with more than ~60
> > > > > > > > > > pc-dimm devices and vhost-net since host kernel
> > > > > > > > > > in module vhost-net refuses to accept more than 65
> > > > > > > > > > memory regions.
> > > > > > > > > >
> > > > > > > > > > Increase VHOST_MEMORY_MAX_NREGIONS from 65 to 509
> > > > > > > > >
> > > > > > > > > It was 64, not 65.
> > > > > > > > >
> > > > > > > > > > to match KVM_USER_MEM_SLOTS fixes issue for vhost-net.
> > > > > > > > > >
> > > > > > > > > > Signed-off-by: Igor Mammedov <[email protected]>
> > > > > > > > >
> > > > > > > > > Still thinking about this: can you reorder this to
> > > > > > > > > be the last patch in the series please?
> > > > > > > > sure
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Also - 509?
> > > > > > > > userspace memory slots in terms of KVM, I made it match
> > > > > > > > KVM's allotment of memory slots for userspace side.
> > > > > > >
> > > > > > > Maybe KVM has its reasons for this #. I don't see
> > > > > > > why we need to match this exactly.
> > > > > > np, I can cap it at safe 300 slots but it's unlikely that it
> > > > > > would take cut off 1 extra hop since it's capped by QEMU
> > > > > > at 256+[initial fragmented memory]
> > > > >
> > > > > But what's the point? We allocate 32 bytes per slot.
> > > > > 300*32 = 9600 which is more than 8K, so we are doing
> > > > > an order-3 allocation anyway.
> > > > > If we could cap it at 8K (256 slots) that would make sense
> > > > > since we could avoid wasting vmalloc space.
> > > > 256 is amount of hotpluggable slots and there is no way
> > > > to predict how initial memory would be fragmented
> > > > (i.e. amount of slots it would take), if we guess wrong
> > > > we are back to square one with crashing userspace.
> > > > So I'd stay consistent with KVM's limit 509 since
> > > > it's only limit, i.e. not actual amount of allocated slots.
> > > >
> > > > > I'm still not very happy with the whole approach,
> > > > > giving userspace ability allocate 4 whole pages
> > > > > of kernel memory like this.
> > > > I'm working in parallel so that userspace won't take so
> > > > many slots but it won't prevent its current versions
> > > > crashing due to kernel limitation.
> > >
> > > Right but at least it's not a regression. If we promise userspace to
> > > support a ton of regions, we can't take it back later, and I'm concerned
> > > about the memory usage.
> > >
> > > I think it's already safe to merge the binary lookup patches, and maybe
> > > cache and vmalloc, so that the remaining patch will be small.
> > it isn't regression with switching to binary search and increasing
> > slots to 509 either performance wise it's more on improvment side.
> > And I was thinking about memory usage as well, that's why I've dropped
> > faster radix tree in favor of more compact array, can't do better
> > on kernel side of fix.
> >
> > Yes we will give userspace to ability to use more slots/and lock up
> > more memory if it's not able to consolidate memory regions but
> > that leaves an option for user to run guest with vhost performance
> > vs crashing it at runtime.
>
> Crashing is entirely QEMU's own doing in not handling
> the error gracefully.
and that's hard to fix (handle error gracefully) the way it's implemented now.
> >
> > userspace/targets that could consolidate memory regions should
> > do so and I'm working on that as well but that doesn't mean
> > that users shouldn't have a choice.
>
> It's a fairly unusual corner case, I'm not yet
> convinced we need to quickly add support to it when just waiting a bit
> longer will get us an equivalent (or even more efficient) fix in
> userspace.
with memory hotplug support in QEMU has been released for quite
a long time already and there is users that use it so fix in
the future QEMU won't make it work with their distros.
So I wouldn't say that is "fairly unusual corner case".
>
> > So far it's kernel limitation and this patch fixes crashes
> > that users see now, with the rest of patches enabling performance
> > not to regress.
>
> When I say regression I refer to an option to limit the array
> size again after userspace started using the larger size.
Is there a need to do so?
Userspace that cares about memory footprint won't use many slots
keeping it low and user space that can't do without many slots
or doesn't care will have bigger memory footprint.
>
>
> > >
> > > >
> > > > > > > > > I think if we are changing this, it'd be nice to
> > > > > > > > > create a way for userspace to discover the support
> > > > > > > > > and the # of regions supported.
> > > > > > > > That was my first idea before extending KVM's memslots
> > > > > > > > to teach kernel to tell qemu this number so that QEMU
> > > > > > > > at least would be able to check if new memory slot could
> > > > > > > > be added but I was redirected to a more simple solution
> > > > > > > > of just extending vs everdoing things.
> > > > > > > > Currently QEMU supports upto ~250 memslots so 509
> > > > > > > > is about twice high we need it so it should work for near
> > > > > > > > future
> > > > > > >
> > > > > > > Yes but old kernels are still around. Would be nice if you
> > > > > > > can detect them.
> > > > > > >
> > > > > > > > but eventually we might still teach kernel and QEMU
> > > > > > > > to make things more robust.
> > > > > > >
> > > > > > > A new ioctl would be easy to add, I think it's a good
> > > > > > > idea generally.
> > > > > > I can try to do something like this on top of this series.
> > > > > >
> > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > ---
> > > > > > > > > > drivers/vhost/vhost.c | 2 +-
> > > > > > > > > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > > > > > >
> > > > > > > > > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > > > > > > > > > index 99931a0..6a18c92 100644
> > > > > > > > > > --- a/drivers/vhost/vhost.c
> > > > > > > > > > +++ b/drivers/vhost/vhost.c
> > > > > > > > > > @@ -30,7 +30,7 @@
> > > > > > > > > > #include "vhost.h"
> > > > > > > > > >
> > > > > > > > > > enum {
> > > > > > > > > > - VHOST_MEMORY_MAX_NREGIONS = 64,
> > > > > > > > > > + VHOST_MEMORY_MAX_NREGIONS = 509,
> > > > > > > > > > VHOST_MEMORY_F_LOG = 0x1,
> > > > > > > > > > };
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > 1.8.3.1
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > > the body of a message to [email protected]
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jun 17, 2015 at 01:48:03PM +0200, Igor Mammedov wrote:
> > > So far it's kernel limitation and this patch fixes crashes
> > > that users see now, with the rest of patches enabling performance
> > > not to regress.
> >
> > When I say regression I refer to an option to limit the array
> > size again after userspace started using the larger size.
> Is there a need to do so?
Considering userspace can be malicious, I guess yes.
> Userspace that cares about memory footprint won't use many slots
> keeping it low and user space that can't do without many slots
> or doesn't care will have bigger memory footprint.
We really can't trust userspace to do the right thing though.
--
MST
On Wed, 17 Jun 2015 13:51:56 +0200
"Michael S. Tsirkin" <[email protected]> wrote:
> On Wed, Jun 17, 2015 at 01:48:03PM +0200, Igor Mammedov wrote:
> > > > So far it's kernel limitation and this patch fixes crashes
> > > > that users see now, with the rest of patches enabling performance
> > > > not to regress.
> > >
> > > When I say regression I refer to an option to limit the array
> > > size again after userspace started using the larger size.
> > Is there a need to do so?
>
> Considering userspace can be malicious, I guess yes.
I don't think it's a valid concern in this case,
setting limit back from 509 to 64 will not help here in any way,
userspace still can create as many vhost instances as it needs
to consume memory it desires.
>
> > Userspace that cares about memory footprint won't use many slots
> > keeping it low and user space that can't do without many slots
> > or doesn't care will have bigger memory footprint.
>
> We really can't trust userspace to do the right thing though.
>
On Wed, Jun 17, 2015 at 02:23:39PM +0200, Igor Mammedov wrote:
> On Wed, 17 Jun 2015 13:51:56 +0200
> "Michael S. Tsirkin" <[email protected]> wrote:
>
> > On Wed, Jun 17, 2015 at 01:48:03PM +0200, Igor Mammedov wrote:
> > > > > So far it's kernel limitation and this patch fixes crashes
> > > > > that users see now, with the rest of patches enabling performance
> > > > > not to regress.
> > > >
> > > > When I say regression I refer to an option to limit the array
> > > > size again after userspace started using the larger size.
> > > Is there a need to do so?
> >
> > Considering userspace can be malicious, I guess yes.
> I don't think it's a valid concern in this case,
> setting limit back from 509 to 64 will not help here in any way,
> userspace still can create as many vhost instances as it needs
> to consume memory it desires.
Not really since vhost char device isn't world-accessible.
It's typically opened by a priveledged tool, the fd is
then passed to an unpriveledged userspace, or permissions dropped.
> >
> > > Userspace that cares about memory footprint won't use many slots
> > > keeping it low and user space that can't do without many slots
> > > or doesn't care will have bigger memory footprint.
> >
> > We really can't trust userspace to do the right thing though.
> >
On 17/06/2015 15:13, Michael S. Tsirkin wrote:
> > > Considering userspace can be malicious, I guess yes.
> > I don't think it's a valid concern in this case,
> > setting limit back from 509 to 64 will not help here in any way,
> > userspace still can create as many vhost instances as it needs
> > to consume memory it desires.
>
> Not really since vhost char device isn't world-accessible.
> It's typically opened by a priveledged tool, the fd is
> then passed to an unpriveledged userspace, or permissions dropped.
Then what's the concern anyway?
Paolo
On Wed, Jun 17, 2015 at 03:20:44PM +0200, Paolo Bonzini wrote:
>
>
> On 17/06/2015 15:13, Michael S. Tsirkin wrote:
> > > > Considering userspace can be malicious, I guess yes.
> > > I don't think it's a valid concern in this case,
> > > setting limit back from 509 to 64 will not help here in any way,
> > > userspace still can create as many vhost instances as it needs
> > > to consume memory it desires.
> >
> > Not really since vhost char device isn't world-accessible.
> > It's typically opened by a priveledged tool, the fd is
> > then passed to an unpriveledged userspace, or permissions dropped.
>
> Then what's the concern anyway?
>
> Paolo
Each fd now ties up 16K of kernel memory. It didn't use to, so
priveledged tool could safely give the unpriveledged userspace
a ton of these fds.
--
MST
On Wed, 17 Jun 2015 16:32:02 +0200
"Michael S. Tsirkin" <[email protected]> wrote:
> On Wed, Jun 17, 2015 at 03:20:44PM +0200, Paolo Bonzini wrote:
> >
> >
> > On 17/06/2015 15:13, Michael S. Tsirkin wrote:
> > > > > Considering userspace can be malicious, I guess yes.
> > > > I don't think it's a valid concern in this case,
> > > > setting limit back from 509 to 64 will not help here in any way,
> > > > userspace still can create as many vhost instances as it needs
> > > > to consume memory it desires.
> > >
> > > Not really since vhost char device isn't world-accessible.
> > > It's typically opened by a priveledged tool, the fd is
> > > then passed to an unpriveledged userspace, or permissions dropped.
> >
> > Then what's the concern anyway?
> >
> > Paolo
>
> Each fd now ties up 16K of kernel memory. It didn't use to, so
> priveledged tool could safely give the unpriveledged userspace
> a ton of these fds.
if privileged tool gives out unlimited amount of fds then it
doesn't matter whether fd ties 4K or 16K, host still could be DoSed.
On Wed, Jun 17, 2015 at 05:12:57PM +0200, Igor Mammedov wrote:
> On Wed, 17 Jun 2015 16:32:02 +0200
> "Michael S. Tsirkin" <[email protected]> wrote:
>
> > On Wed, Jun 17, 2015 at 03:20:44PM +0200, Paolo Bonzini wrote:
> > >
> > >
> > > On 17/06/2015 15:13, Michael S. Tsirkin wrote:
> > > > > > Considering userspace can be malicious, I guess yes.
> > > > > I don't think it's a valid concern in this case,
> > > > > setting limit back from 509 to 64 will not help here in any way,
> > > > > userspace still can create as many vhost instances as it needs
> > > > > to consume memory it desires.
> > > >
> > > > Not really since vhost char device isn't world-accessible.
> > > > It's typically opened by a priveledged tool, the fd is
> > > > then passed to an unpriveledged userspace, or permissions dropped.
> > >
> > > Then what's the concern anyway?
> > >
> > > Paolo
> >
> > Each fd now ties up 16K of kernel memory. It didn't use to, so
> > priveledged tool could safely give the unpriveledged userspace
> > a ton of these fds.
> if privileged tool gives out unlimited amount of fds then it
> doesn't matter whether fd ties 4K or 16K, host still could be DoSed.
>
Of course it does not give out unlimited fds, there's a way
for the sysadmin to specify the number of fds. Look at how libvirt
uses vhost, it should become clear I think.
--
MST
On Wed, 17 Jun 2015 17:38:40 +0200
"Michael S. Tsirkin" <[email protected]> wrote:
> On Wed, Jun 17, 2015 at 05:12:57PM +0200, Igor Mammedov wrote:
> > On Wed, 17 Jun 2015 16:32:02 +0200
> > "Michael S. Tsirkin" <[email protected]> wrote:
> >
> > > On Wed, Jun 17, 2015 at 03:20:44PM +0200, Paolo Bonzini wrote:
> > > >
> > > >
> > > > On 17/06/2015 15:13, Michael S. Tsirkin wrote:
> > > > > > > Considering userspace can be malicious, I guess yes.
> > > > > > I don't think it's a valid concern in this case,
> > > > > > setting limit back from 509 to 64 will not help here in any
> > > > > > way, userspace still can create as many vhost instances as
> > > > > > it needs to consume memory it desires.
> > > > >
> > > > > Not really since vhost char device isn't world-accessible.
> > > > > It's typically opened by a priveledged tool, the fd is
> > > > > then passed to an unpriveledged userspace, or permissions
> > > > > dropped.
> > > >
> > > > Then what's the concern anyway?
> > > >
> > > > Paolo
> > >
> > > Each fd now ties up 16K of kernel memory. It didn't use to, so
> > > priveledged tool could safely give the unpriveledged userspace
> > > a ton of these fds.
> > if privileged tool gives out unlimited amount of fds then it
> > doesn't matter whether fd ties 4K or 16K, host still could be DoSed.
> >
>
> Of course it does not give out unlimited fds, there's a way
> for the sysadmin to specify the number of fds. Look at how libvirt
> uses vhost, it should become clear I think.
then it just means that tool has to take into account a new limits
to partition host in sensible manner.
Exposing limit as module parameter might be of help to tool for
getting/setting it in a way it needs.
On Wed, Jun 17, 2015 at 06:09:21PM +0200, Igor Mammedov wrote:
> On Wed, 17 Jun 2015 17:38:40 +0200
> "Michael S. Tsirkin" <[email protected]> wrote:
>
> > On Wed, Jun 17, 2015 at 05:12:57PM +0200, Igor Mammedov wrote:
> > > On Wed, 17 Jun 2015 16:32:02 +0200
> > > "Michael S. Tsirkin" <[email protected]> wrote:
> > >
> > > > On Wed, Jun 17, 2015 at 03:20:44PM +0200, Paolo Bonzini wrote:
> > > > >
> > > > >
> > > > > On 17/06/2015 15:13, Michael S. Tsirkin wrote:
> > > > > > > > Considering userspace can be malicious, I guess yes.
> > > > > > > I don't think it's a valid concern in this case,
> > > > > > > setting limit back from 509 to 64 will not help here in any
> > > > > > > way, userspace still can create as many vhost instances as
> > > > > > > it needs to consume memory it desires.
> > > > > >
> > > > > > Not really since vhost char device isn't world-accessible.
> > > > > > It's typically opened by a priveledged tool, the fd is
> > > > > > then passed to an unpriveledged userspace, or permissions
> > > > > > dropped.
> > > > >
> > > > > Then what's the concern anyway?
> > > > >
> > > > > Paolo
> > > >
> > > > Each fd now ties up 16K of kernel memory. It didn't use to, so
> > > > priveledged tool could safely give the unpriveledged userspace
> > > > a ton of these fds.
> > > if privileged tool gives out unlimited amount of fds then it
> > > doesn't matter whether fd ties 4K or 16K, host still could be DoSed.
> > >
> >
> > Of course it does not give out unlimited fds, there's a way
> > for the sysadmin to specify the number of fds. Look at how libvirt
> > uses vhost, it should become clear I think.
> then it just means that tool has to take into account a new limits
> to partition host in sensible manner.
Meanwhile old tools are vulnerable to OOM attacks.
> Exposing limit as module parameter might be of help to tool for
> getting/setting it in a way it needs.
--
MST
On 17/06/2015 18:30, Michael S. Tsirkin wrote:
> Meanwhile old tools are vulnerable to OOM attacks.
For each vhost device there will be likely one tap interface, and I
suspect that it takes way, way more than 16KB of memory.
Paolo
On Wed, Jun 17, 2015 at 06:31:32PM +0200, Paolo Bonzini wrote:
>
>
> On 17/06/2015 18:30, Michael S. Tsirkin wrote:
> > Meanwhile old tools are vulnerable to OOM attacks.
>
> For each vhost device there will be likely one tap interface, and I
> suspect that it takes way, way more than 16KB of memory.
>
> Paolo
That's not true. We have a vhost device per queue, all queues
are part of a single tap device.
--
MST
On 17/06/2015 18:34, Michael S. Tsirkin wrote:
> On Wed, Jun 17, 2015 at 06:31:32PM +0200, Paolo Bonzini wrote:
>>
>>
>> On 17/06/2015 18:30, Michael S. Tsirkin wrote:
>>> Meanwhile old tools are vulnerable to OOM attacks.
>>
>> For each vhost device there will be likely one tap interface, and I
>> suspect that it takes way, way more than 16KB of memory.
>
> That's not true. We have a vhost device per queue, all queues
> are part of a single tap device.
s/tap/VCPU/ then. A KVM VCPU also takes more than 16KB of memory.
Paolo
On Wed, Jun 17, 2015 at 06:38:25PM +0200, Paolo Bonzini wrote:
>
>
> On 17/06/2015 18:34, Michael S. Tsirkin wrote:
> > On Wed, Jun 17, 2015 at 06:31:32PM +0200, Paolo Bonzini wrote:
> >>
> >>
> >> On 17/06/2015 18:30, Michael S. Tsirkin wrote:
> >>> Meanwhile old tools are vulnerable to OOM attacks.
> >>
> >> For each vhost device there will be likely one tap interface, and I
> >> suspect that it takes way, way more than 16KB of memory.
> >
> > That's not true. We have a vhost device per queue, all queues
> > are part of a single tap device.
>
> s/tap/VCPU/ then. A KVM VCPU also takes more than 16KB of memory.
>
> Paolo
That's up to you as a kvm maintainer :)
People are already concerned about vhost device
memory usage, I'm not happy to define our user/kernel interface
in a way that forces even more memory to be used up.
--
MST
On 17/06/2015 18:41, Michael S. Tsirkin wrote:
> On Wed, Jun 17, 2015 at 06:38:25PM +0200, Paolo Bonzini wrote:
>>
>>
>> On 17/06/2015 18:34, Michael S. Tsirkin wrote:
>>> On Wed, Jun 17, 2015 at 06:31:32PM +0200, Paolo Bonzini wrote:
>>>>
>>>>
>>>> On 17/06/2015 18:30, Michael S. Tsirkin wrote:
>>>>> Meanwhile old tools are vulnerable to OOM attacks.
>>>>
>>>> For each vhost device there will be likely one tap interface, and I
>>>> suspect that it takes way, way more than 16KB of memory.
>>>
>>> That's not true. We have a vhost device per queue, all queues
>>> are part of a single tap device.
>>
>> s/tap/VCPU/ then. A KVM VCPU also takes more than 16KB of memory.
>
> That's up to you as a kvm maintainer :)
Not easy, when the CPU alone requires three (albeit non-consecutive)
pages for the VMCS, the APIC access page and the EPT root.
> People are already concerned about vhost device
> memory usage, I'm not happy to define our user/kernel interface
> in a way that forces even more memory to be used up.
So, the questions to ask are:
1) What is the memory usage like immediately after vhost is brought up,
apart from these 16K?
2) Is there anything in vhost that allocates a user-controllable amount
of memory?
3) What is the size of the data structures that support one virtqueue
(there are two of them)? Does it depend on the size of the virtqueues?
4) Would it make sense to share memory regions between multiple vhost
devices? Would it be hard to implement? It would also make memory
operations O(1) rather than O(#cpus).
Paolo
On Wed, 17 Jun 2015 18:30:02 +0200
"Michael S. Tsirkin" <[email protected]> wrote:
> On Wed, Jun 17, 2015 at 06:09:21PM +0200, Igor Mammedov wrote:
> > On Wed, 17 Jun 2015 17:38:40 +0200
> > "Michael S. Tsirkin" <[email protected]> wrote:
> >
> > > On Wed, Jun 17, 2015 at 05:12:57PM +0200, Igor Mammedov wrote:
> > > > On Wed, 17 Jun 2015 16:32:02 +0200
> > > > "Michael S. Tsirkin" <[email protected]> wrote:
> > > >
> > > > > On Wed, Jun 17, 2015 at 03:20:44PM +0200, Paolo Bonzini wrote:
> > > > > >
> > > > > >
> > > > > > On 17/06/2015 15:13, Michael S. Tsirkin wrote:
> > > > > > > > > Considering userspace can be malicious, I guess yes.
> > > > > > > > I don't think it's a valid concern in this case,
> > > > > > > > setting limit back from 509 to 64 will not help here in
> > > > > > > > any way, userspace still can create as many vhost
> > > > > > > > instances as it needs to consume memory it desires.
> > > > > > >
> > > > > > > Not really since vhost char device isn't world-accessible.
> > > > > > > It's typically opened by a priveledged tool, the fd is
> > > > > > > then passed to an unpriveledged userspace, or permissions
> > > > > > > dropped.
> > > > > >
> > > > > > Then what's the concern anyway?
> > > > > >
> > > > > > Paolo
> > > > >
> > > > > Each fd now ties up 16K of kernel memory. It didn't use to,
> > > > > so priveledged tool could safely give the unpriveledged
> > > > > userspace a ton of these fds.
> > > > if privileged tool gives out unlimited amount of fds then it
> > > > doesn't matter whether fd ties 4K or 16K, host still could be
> > > > DoSed.
> > > >
> > >
> > > Of course it does not give out unlimited fds, there's a way
> > > for the sysadmin to specify the number of fds. Look at how libvirt
> > > uses vhost, it should become clear I think.
> > then it just means that tool has to take into account a new limits
> > to partition host in sensible manner.
>
> Meanwhile old tools are vulnerable to OOM attacks.
Let's leave old limit by default and allow override it via module
parameter, that way tools old tools won't be affected and new tools
could set limit the way they need.
That will accommodate current slot hungry userspace and a new one with
continuous HVA and won't regress old tools.
>
> > Exposing limit as module parameter might be of help to tool for
> > getting/setting it in a way it needs.
>
On Wed, 17 Jun 2015 18:47:18 +0200
Paolo Bonzini <[email protected]> wrote:
>
>
> On 17/06/2015 18:41, Michael S. Tsirkin wrote:
> > On Wed, Jun 17, 2015 at 06:38:25PM +0200, Paolo Bonzini wrote:
> >>
> >>
> >> On 17/06/2015 18:34, Michael S. Tsirkin wrote:
> >>> On Wed, Jun 17, 2015 at 06:31:32PM +0200, Paolo Bonzini wrote:
> >>>>
> >>>>
> >>>> On 17/06/2015 18:30, Michael S. Tsirkin wrote:
> >>>>> Meanwhile old tools are vulnerable to OOM attacks.
> >>>>
> >>>> For each vhost device there will be likely one tap interface,
> >>>> and I suspect that it takes way, way more than 16KB of memory.
> >>>
> >>> That's not true. We have a vhost device per queue, all queues
> >>> are part of a single tap device.
> >>
> >> s/tap/VCPU/ then. A KVM VCPU also takes more than 16KB of memory.
> >
> > That's up to you as a kvm maintainer :)
>
> Not easy, when the CPU alone requires three (albeit non-consecutive)
> pages for the VMCS, the APIC access page and the EPT root.
>
> > People are already concerned about vhost device
> > memory usage, I'm not happy to define our user/kernel interface
> > in a way that forces even more memory to be used up.
>
> So, the questions to ask are:
>
> 1) What is the memory usage like immediately after vhost is brought
> up, apart from these 16K?
>
> 2) Is there anything in vhost that allocates a user-controllable
> amount of memory?
>
> 3) What is the size of the data structures that support one virtqueue
> (there are two of them)? Does it depend on the size of the
> virtqueues?
>
> 4) Would it make sense to share memory regions between multiple vhost
> devices? Would it be hard to implement? It would also make memory
> operations O(1) rather than O(#cpus).
>
> Paolo
in addition to that could vhost share memmap with KVM i.e. use its
memslots instead of duplicating it?
On Wed, Jun 17, 2015 at 06:47:18PM +0200, Paolo Bonzini wrote:
>
>
> On 17/06/2015 18:41, Michael S. Tsirkin wrote:
> > On Wed, Jun 17, 2015 at 06:38:25PM +0200, Paolo Bonzini wrote:
> >>
> >>
> >> On 17/06/2015 18:34, Michael S. Tsirkin wrote:
> >>> On Wed, Jun 17, 2015 at 06:31:32PM +0200, Paolo Bonzini wrote:
> >>>>
> >>>>
> >>>> On 17/06/2015 18:30, Michael S. Tsirkin wrote:
> >>>>> Meanwhile old tools are vulnerable to OOM attacks.
> >>>>
> >>>> For each vhost device there will be likely one tap interface, and I
> >>>> suspect that it takes way, way more than 16KB of memory.
> >>>
> >>> That's not true. We have a vhost device per queue, all queues
> >>> are part of a single tap device.
> >>
> >> s/tap/VCPU/ then. A KVM VCPU also takes more than 16KB of memory.
> >
> > That's up to you as a kvm maintainer :)
>
> Not easy, when the CPU alone requires three (albeit non-consecutive)
> pages for the VMCS, the APIC access page and the EPT root.
>
> > People are already concerned about vhost device
> > memory usage, I'm not happy to define our user/kernel interface
> > in a way that forces even more memory to be used up.
>
> So, the questions to ask are:
>
> 1) What is the memory usage like immediately after vhost is brought up,
> apart from these 16K?
About 24K, but most are iov pool arrays are kept around as an optimization
to avoid kmalloc on data path. Below 1K tracks persistent state.
Recently people have been complaining about these pools
so I've been thinking about switching to a per-cpu array,
or something similar.
> 2) Is there anything in vhost that allocates a user-controllable amount
> of memory?
Definitely not in vhost-net.
> 3) What is the size of the data structures that support one virtqueue
> (there are two of them)?
Around 256 bytes.
> Does it depend on the size of the virtqueues?
No.
> 4) Would it make sense to share memory regions between multiple vhost
> devices? Would it be hard to implement?
It's not trivial. It would absolutely require userspace ABI
extensions.
> It would also make memory
> operations O(1) rather than O(#cpus).
>
> Paolo
We'd save the kmalloc/memcpy/kfree, that is true.
But we'd still need to flush all VQs so it's still O(#cpus),
we'd just be doing less stuff in that O(#cpus).
--
MST
On Wed, 17 Jun 2015 18:30:02 +0200
"Michael S. Tsirkin" <[email protected]> wrote:
> On Wed, Jun 17, 2015 at 06:09:21PM +0200, Igor Mammedov wrote:
> > On Wed, 17 Jun 2015 17:38:40 +0200
> > "Michael S. Tsirkin" <[email protected]> wrote:
> >
> > > On Wed, Jun 17, 2015 at 05:12:57PM +0200, Igor Mammedov wrote:
> > > > On Wed, 17 Jun 2015 16:32:02 +0200
> > > > "Michael S. Tsirkin" <[email protected]> wrote:
> > > >
> > > > > On Wed, Jun 17, 2015 at 03:20:44PM +0200, Paolo Bonzini wrote:
> > > > > >
> > > > > >
> > > > > > On 17/06/2015 15:13, Michael S. Tsirkin wrote:
> > > > > > > > > Considering userspace can be malicious, I guess yes.
> > > > > > > > I don't think it's a valid concern in this case,
> > > > > > > > setting limit back from 509 to 64 will not help here in any
> > > > > > > > way, userspace still can create as many vhost instances as
> > > > > > > > it needs to consume memory it desires.
> > > > > > >
> > > > > > > Not really since vhost char device isn't world-accessible.
> > > > > > > It's typically opened by a priveledged tool, the fd is
> > > > > > > then passed to an unpriveledged userspace, or permissions
> > > > > > > dropped.
> > > > > >
> > > > > > Then what's the concern anyway?
> > > > > >
> > > > > > Paolo
> > > > >
> > > > > Each fd now ties up 16K of kernel memory. It didn't use to, so
> > > > > priveledged tool could safely give the unpriveledged userspace
> > > > > a ton of these fds.
> > > > if privileged tool gives out unlimited amount of fds then it
> > > > doesn't matter whether fd ties 4K or 16K, host still could be DoSed.
> > > >
> > >
> > > Of course it does not give out unlimited fds, there's a way
> > > for the sysadmin to specify the number of fds. Look at how libvirt
> > > uses vhost, it should become clear I think.
> > then it just means that tool has to take into account a new limits
> > to partition host in sensible manner.
>
> Meanwhile old tools are vulnerable to OOM attacks.
I've chatted with libvirt folks, it doesn't care about how much memory
vhost would consume nor do any host capacity planning in that regard.
But lets assume that there are tools that do this so
how about instead of hardcoding limit make it a module parameter
with default set to 64. That would allow users to set higher limit
if they need it and nor regress old tools. it will also give tools
interface for reading limit from vhost module.
>
> > Exposing limit as module parameter might be of help to tool for
> > getting/setting it in a way it needs.
>
On Thu, Jun 18, 2015 at 11:12:24AM +0200, Igor Mammedov wrote:
> On Wed, 17 Jun 2015 18:30:02 +0200
> "Michael S. Tsirkin" <[email protected]> wrote:
>
> > On Wed, Jun 17, 2015 at 06:09:21PM +0200, Igor Mammedov wrote:
> > > On Wed, 17 Jun 2015 17:38:40 +0200
> > > "Michael S. Tsirkin" <[email protected]> wrote:
> > >
> > > > On Wed, Jun 17, 2015 at 05:12:57PM +0200, Igor Mammedov wrote:
> > > > > On Wed, 17 Jun 2015 16:32:02 +0200
> > > > > "Michael S. Tsirkin" <[email protected]> wrote:
> > > > >
> > > > > > On Wed, Jun 17, 2015 at 03:20:44PM +0200, Paolo Bonzini wrote:
> > > > > > >
> > > > > > >
> > > > > > > On 17/06/2015 15:13, Michael S. Tsirkin wrote:
> > > > > > > > > > Considering userspace can be malicious, I guess yes.
> > > > > > > > > I don't think it's a valid concern in this case,
> > > > > > > > > setting limit back from 509 to 64 will not help here in any
> > > > > > > > > way, userspace still can create as many vhost instances as
> > > > > > > > > it needs to consume memory it desires.
> > > > > > > >
> > > > > > > > Not really since vhost char device isn't world-accessible.
> > > > > > > > It's typically opened by a priveledged tool, the fd is
> > > > > > > > then passed to an unpriveledged userspace, or permissions
> > > > > > > > dropped.
> > > > > > >
> > > > > > > Then what's the concern anyway?
> > > > > > >
> > > > > > > Paolo
> > > > > >
> > > > > > Each fd now ties up 16K of kernel memory. It didn't use to, so
> > > > > > priveledged tool could safely give the unpriveledged userspace
> > > > > > a ton of these fds.
> > > > > if privileged tool gives out unlimited amount of fds then it
> > > > > doesn't matter whether fd ties 4K or 16K, host still could be DoSed.
> > > > >
> > > >
> > > > Of course it does not give out unlimited fds, there's a way
> > > > for the sysadmin to specify the number of fds. Look at how libvirt
> > > > uses vhost, it should become clear I think.
> > > then it just means that tool has to take into account a new limits
> > > to partition host in sensible manner.
> >
> > Meanwhile old tools are vulnerable to OOM attacks.
> I've chatted with libvirt folks, it doesn't care about how much memory
> vhost would consume nor do any host capacity planning in that regard.
Exactly, it's up to host admin.
> But lets assume that there are tools that do this so
> how about instead of hardcoding limit make it a module parameter
> with default set to 64. That would allow users to set higher limit
> if they need it and nor regress old tools. it will also give tools
> interface for reading limit from vhost module.
And now you need to choose between security and functionality :(
> >
> > > Exposing limit as module parameter might be of help to tool for
> > > getting/setting it in a way it needs.
> >
On 18/06/2015 11:50, Michael S. Tsirkin wrote:
> > But lets assume that there are tools that do this so
> > how about instead of hardcoding limit make it a module parameter
> > with default set to 64. That would allow users to set higher limit
> > if they need it and nor regress old tools. it will also give tools
> > interface for reading limit from vhost module.
>
> And now you need to choose between security and functionality :(
Don't call "security" a 16K allocation that can fall back to vmalloc
please. That's an insult to actual security problems...
Paolo
On Thu, 18 Jun 2015 11:50:22 +0200
"Michael S. Tsirkin" <[email protected]> wrote:
> On Thu, Jun 18, 2015 at 11:12:24AM +0200, Igor Mammedov wrote:
> > On Wed, 17 Jun 2015 18:30:02 +0200
> > "Michael S. Tsirkin" <[email protected]> wrote:
> >
> > > On Wed, Jun 17, 2015 at 06:09:21PM +0200, Igor Mammedov wrote:
> > > > On Wed, 17 Jun 2015 17:38:40 +0200
> > > > "Michael S. Tsirkin" <[email protected]> wrote:
> > > >
> > > > > On Wed, Jun 17, 2015 at 05:12:57PM +0200, Igor Mammedov wrote:
> > > > > > On Wed, 17 Jun 2015 16:32:02 +0200
> > > > > > "Michael S. Tsirkin" <[email protected]> wrote:
> > > > > >
> > > > > > > On Wed, Jun 17, 2015 at 03:20:44PM +0200, Paolo Bonzini wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > > On 17/06/2015 15:13, Michael S. Tsirkin wrote:
> > > > > > > > > > > Considering userspace can be malicious, I guess yes.
> > > > > > > > > > I don't think it's a valid concern in this case,
> > > > > > > > > > setting limit back from 509 to 64 will not help here in any
> > > > > > > > > > way, userspace still can create as many vhost instances as
> > > > > > > > > > it needs to consume memory it desires.
> > > > > > > > >
> > > > > > > > > Not really since vhost char device isn't world-accessible.
> > > > > > > > > It's typically opened by a priveledged tool, the fd is
> > > > > > > > > then passed to an unpriveledged userspace, or permissions
> > > > > > > > > dropped.
> > > > > > > >
> > > > > > > > Then what's the concern anyway?
> > > > > > > >
> > > > > > > > Paolo
> > > > > > >
> > > > > > > Each fd now ties up 16K of kernel memory. It didn't use to, so
> > > > > > > priveledged tool could safely give the unpriveledged userspace
> > > > > > > a ton of these fds.
> > > > > > if privileged tool gives out unlimited amount of fds then it
> > > > > > doesn't matter whether fd ties 4K or 16K, host still could be DoSed.
> > > > > >
> > > > >
> > > > > Of course it does not give out unlimited fds, there's a way
> > > > > for the sysadmin to specify the number of fds. Look at how libvirt
> > > > > uses vhost, it should become clear I think.
> > > > then it just means that tool has to take into account a new limits
> > > > to partition host in sensible manner.
> > >
> > > Meanwhile old tools are vulnerable to OOM attacks.
> > I've chatted with libvirt folks, it doesn't care about how much memory
> > vhost would consume nor do any host capacity planning in that regard.
>
> Exactly, it's up to host admin.
>
> > But lets assume that there are tools that do this so
> > how about instead of hardcoding limit make it a module parameter
> > with default set to 64. That would allow users to set higher limit
> > if they need it and nor regress old tools. it will also give tools
> > interface for reading limit from vhost module.
>
> And now you need to choose between security and functionality :(
There is no conflict here and it's not about choosing.
If admin has a method to estimate guest memory footprint
to do capacity partitioning then he would need to redo
partitioning taking in account new footprint when
he/she rises limit manually.
(BTW libvirt has tried and reverted patches that were trying to
predict required memory, admin might be able to do it manually
better but it's another topic how to do it ans it's not related
to this thread)
Lets leave decision upto users instead of making them live with
crashing guests.
>
> > >
> > > > Exposing limit as module parameter might be of help to tool for
> > > > getting/setting it in a way it needs.
> > >
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jun 18, 2015 at 01:39:12PM +0200, Igor Mammedov wrote:
> Lets leave decision upto users instead of making them live with
> crashing guests.
Come on, let's fix it in userspace.
--
MST
On 18/06/2015 13:41, Michael S. Tsirkin wrote:
> On Thu, Jun 18, 2015 at 01:39:12PM +0200, Igor Mammedov wrote:
>> Lets leave decision upto users instead of making them live with
>> crashing guests.
>
> Come on, let's fix it in userspace.
It's not trivial to fix it in userspace. Since QEMU uses RCU there
isn't a single memory map to use for a linear gpa->hva map.
I find it absurd that we're fighting over 12K of memory.
Paolo
On Thu, 18 Jun 2015 13:41:22 +0200
"Michael S. Tsirkin" <[email protected]> wrote:
> On Thu, Jun 18, 2015 at 01:39:12PM +0200, Igor Mammedov wrote:
> > Lets leave decision upto users instead of making them live with
> > crashing guests.
>
> Come on, let's fix it in userspace.
I'm not abandoning userspace approach either but it might take time
to implement in robust manner as it's much more complex and has much
more places to backfire then a straightforward kernel fix which will
work for both old userspace and a new one.
On Thu, Jun 18, 2015 at 01:50:32PM +0200, Paolo Bonzini wrote:
>
>
> On 18/06/2015 13:41, Michael S. Tsirkin wrote:
> > On Thu, Jun 18, 2015 at 01:39:12PM +0200, Igor Mammedov wrote:
> >> Lets leave decision upto users instead of making them live with
> >> crashing guests.
> >
> > Come on, let's fix it in userspace.
>
> It's not trivial to fix it in userspace. Since QEMU uses RCU there
> isn't a single memory map to use for a linear gpa->hva map.
Could you elaborate?
I'm confused by this mention of RCU.
You use RCU for accesses to the memory map, correct?
So memory map itself is a write side operation, as such all you need to
do is take some kind of lock to prevent conflicting with other memory
maps, do rcu sync under this lock.
> I find it absurd that we're fighting over 12K of memory.
>
> Paolo
I wouldn't worry so much if it didn't affect kernel/userspace API.
Need to be careful there.
--
MST
On 18/06/2015 15:19, Michael S. Tsirkin wrote:
> On Thu, Jun 18, 2015 at 01:50:32PM +0200, Paolo Bonzini wrote:
>>
>>
>> On 18/06/2015 13:41, Michael S. Tsirkin wrote:
>>> On Thu, Jun 18, 2015 at 01:39:12PM +0200, Igor Mammedov wrote:
>>>> Lets leave decision upto users instead of making them live with
>>>> crashing guests.
>>>
>>> Come on, let's fix it in userspace.
>>
>> It's not trivial to fix it in userspace. Since QEMU uses RCU there
>> isn't a single memory map to use for a linear gpa->hva map.
>
> Could you elaborate?
>
> I'm confused by this mention of RCU.
> You use RCU for accesses to the memory map, correct?
> So memory map itself is a write side operation, as such all you need to
> do is take some kind of lock to prevent conflicting with other memory
> maps, do rcu sync under this lock.
You're right, the problem isn't directly related to RCU. RCU would be
easy to handle by using synchronize_rcu instead of call_rcu. While I
identified an RCU-related problem with Igor's patches, it's much more
entrenched.
RAM can be used by asynchronous operations while the VM runs, between
address_space_map and address_space_unmap. It is possible and common to
have a quiescent state between the map and unmap, and a memory map
change can happen in the middle of this. Normally this is not a
problem, because changes to the memory map do not make the hva go away
(memory regions are reference counted).
However, with Igor's patches a memory_region_del_subregion will cause a
mmap(MAP_NORESERVE), which _does_ have the effect of making the hva go away.
I guess one way to do it would be to alias the same page in two places,
one for use by vhost and one for use by everything else. However, the
kernel does not provide the means to do this kind of aliasing for
anonymous mmaps.
Paolo
On Thu, Jun 18, 2015 at 03:46:14PM +0200, Paolo Bonzini wrote:
>
>
> On 18/06/2015 15:19, Michael S. Tsirkin wrote:
> > On Thu, Jun 18, 2015 at 01:50:32PM +0200, Paolo Bonzini wrote:
> >>
> >>
> >> On 18/06/2015 13:41, Michael S. Tsirkin wrote:
> >>> On Thu, Jun 18, 2015 at 01:39:12PM +0200, Igor Mammedov wrote:
> >>>> Lets leave decision upto users instead of making them live with
> >>>> crashing guests.
> >>>
> >>> Come on, let's fix it in userspace.
> >>
> >> It's not trivial to fix it in userspace. Since QEMU uses RCU there
> >> isn't a single memory map to use for a linear gpa->hva map.
> >
> > Could you elaborate?
> >
> > I'm confused by this mention of RCU.
> > You use RCU for accesses to the memory map, correct?
> > So memory map itself is a write side operation, as such all you need to
> > do is take some kind of lock to prevent conflicting with other memory
> > maps, do rcu sync under this lock.
>
> You're right, the problem isn't directly related to RCU. RCU would be
> easy to handle by using synchronize_rcu instead of call_rcu. While I
> identified an RCU-related problem with Igor's patches, it's much more
> entrenched.
>
> RAM can be used by asynchronous operations while the VM runs, between
> address_space_map and address_space_unmap. It is possible and common to
> have a quiescent state between the map and unmap, and a memory map
> change can happen in the middle of this. Normally this is not a
> problem, because changes to the memory map do not make the hva go away
> (memory regions are reference counted).
Right, so you want mmap(MAP_NORESERVE) when that reference
count becomes 0.
> However, with Igor's patches a memory_region_del_subregion will cause a
> mmap(MAP_NORESERVE), which _does_ have the effect of making the hva go away.
>
> I guess one way to do it would be to alias the same page in two places,
> one for use by vhost and one for use by everything else. However, the
> kernel does not provide the means to do this kind of aliasing for
> anonymous mmaps.
>
> Paolo
Basically pages go away on munmap, so won't simple
lock
munmap
mmap(MAP_NORESERVE)
unlock
do the trick?
--
MST
On Thu, 18 Jun 2015 16:47:33 +0200
"Michael S. Tsirkin" <[email protected]> wrote:
> On Thu, Jun 18, 2015 at 03:46:14PM +0200, Paolo Bonzini wrote:
> >
> >
> > On 18/06/2015 15:19, Michael S. Tsirkin wrote:
> > > On Thu, Jun 18, 2015 at 01:50:32PM +0200, Paolo Bonzini wrote:
> > >>
> > >>
> > >> On 18/06/2015 13:41, Michael S. Tsirkin wrote:
> > >>> On Thu, Jun 18, 2015 at 01:39:12PM +0200, Igor Mammedov wrote:
> > >>>> Lets leave decision upto users instead of making them live with
> > >>>> crashing guests.
> > >>>
> > >>> Come on, let's fix it in userspace.
> > >>
> > >> It's not trivial to fix it in userspace. Since QEMU uses RCU there
> > >> isn't a single memory map to use for a linear gpa->hva map.
> > >
> > > Could you elaborate?
> > >
> > > I'm confused by this mention of RCU.
> > > You use RCU for accesses to the memory map, correct?
> > > So memory map itself is a write side operation, as such all you need to
> > > do is take some kind of lock to prevent conflicting with other memory
> > > maps, do rcu sync under this lock.
> >
> > You're right, the problem isn't directly related to RCU. RCU would be
> > easy to handle by using synchronize_rcu instead of call_rcu. While I
> > identified an RCU-related problem with Igor's patches, it's much more
> > entrenched.
> >
> > RAM can be used by asynchronous operations while the VM runs, between
> > address_space_map and address_space_unmap. It is possible and common to
> > have a quiescent state between the map and unmap, and a memory map
> > change can happen in the middle of this. Normally this is not a
> > problem, because changes to the memory map do not make the hva go away
> > (memory regions are reference counted).
>
> Right, so you want mmap(MAP_NORESERVE) when that reference
> count becomes 0.
>
> > However, with Igor's patches a memory_region_del_subregion will cause a
> > mmap(MAP_NORESERVE), which _does_ have the effect of making the hva go away.
> >
> > I guess one way to do it would be to alias the same page in two places,
> > one for use by vhost and one for use by everything else. However, the
> > kernel does not provide the means to do this kind of aliasing for
> > anonymous mmaps.
> >
> > Paolo
>
> Basically pages go away on munmap, so won't simple
> lock
> munmap
> mmap(MAP_NORESERVE)
> unlock
> do the trick?
at what time are you suggesting to do this?
On 18/06/2015 16:47, Michael S. Tsirkin wrote:
>> However, with Igor's patches a memory_region_del_subregion will cause a
>> mmap(MAP_NORESERVE), which _does_ have the effect of making the hva go away.
>>
>> I guess one way to do it would be to alias the same page in two places,
>> one for use by vhost and one for use by everything else. However, the
>> kernel does not provide the means to do this kind of aliasing for
>> anonymous mmaps.
>
> Basically pages go away on munmap, so won't simple
> lock
> munmap
> mmap(MAP_NORESERVE)
> unlock
> do the trick?
Not sure I follow. Here we have this:
VCPU 1 VCPU 2 I/O worker
----------------------------------------------------------------------------------------
take big QEMU lock
p = address_space_map(hva, len)
pass I/O request to worker thread
read(fd, p, len)
release big QEMU lock
memory_region_del_subregion
mmap(MAP_NORESERVE)
read returns EFAULT
wake up VCPU 1
take big QEMU lock
EFAULT? What's that?
In another scenario you are less lucky: the memory accesses
between address_space_map/unmap aren't done in the kernel and
you get a plain old SIGSEGV.
This is not something that you can fix with a lock. The very
purpose of the map/unmap API is to do stuff asynchronously while
the lock is released.
Thanks,
Paolo
On Thu, Jun 18, 2015 at 06:02:46PM +0200, Paolo Bonzini wrote:
>
>
> On 18/06/2015 16:47, Michael S. Tsirkin wrote:
> >> However, with Igor's patches a memory_region_del_subregion will cause a
> >> mmap(MAP_NORESERVE), which _does_ have the effect of making the hva go away.
> >>
> >> I guess one way to do it would be to alias the same page in two places,
> >> one for use by vhost and one for use by everything else. However, the
> >> kernel does not provide the means to do this kind of aliasing for
> >> anonymous mmaps.
> >
> > Basically pages go away on munmap, so won't simple
> > lock
> > munmap
> > mmap(MAP_NORESERVE)
> > unlock
> > do the trick?
>
> Not sure I follow. Here we have this:
>
> VCPU 1 VCPU 2 I/O worker
> ----------------------------------------------------------------------------------------
> take big QEMU lock
> p = address_space_map(hva, len)
> pass I/O request to worker thread
> read(fd, p, len)
> release big QEMU lock
>
> memory_region_del_subregion
> mmap(MAP_NORESERVE)
>
> read returns EFAULT
Why doesn't it EFAULT without mmap(MAP_NORESERVE)?
Doesn't memory_region_del_subregion free the memory?
> wake up VCPU 1
> take big QEMU lock
> EFAULT? What's that?
>
> In another scenario you are less lucky: the memory accesses
> between address_space_map/unmap aren't done in the kernel and
> you get a plain old SIGSEGV.
>
> This is not something that you can fix with a lock. The very
> purpose of the map/unmap API is to do stuff asynchronously while
> the lock is released.
>
> Thanks,
>
> Paolo
On 19/06/2015 09:56, Michael S. Tsirkin wrote:
> On Thu, Jun 18, 2015 at 06:02:46PM +0200, Paolo Bonzini wrote:
>>
>>
>> On 18/06/2015 16:47, Michael S. Tsirkin wrote:
>>>> However, with Igor's patches a memory_region_del_subregion will cause a
>>>> mmap(MAP_NORESERVE), which _does_ have the effect of making the hva go away.
>>>>
>>>> I guess one way to do it would be to alias the same page in two places,
>>>> one for use by vhost and one for use by everything else. However, the
>>>> kernel does not provide the means to do this kind of aliasing for
>>>> anonymous mmaps.
>>>
>>> Basically pages go away on munmap, so won't simple
>>> lock
>>> munmap
>>> mmap(MAP_NORESERVE)
>>> unlock
>>> do the trick?
>>
>> Not sure I follow. Here we have this:
>>
>> VCPU 1 VCPU 2 I/O worker
>> ----------------------------------------------------------------------------------------
>> take big QEMU lock
>> p = address_space_map(hva, len)
>> pass I/O request to worker thread
>> read(fd, p, len)
>> release big QEMU lock
>>
>> memory_region_del_subregion
>> mmap(MAP_NORESERVE)
>>
>> read returns EFAULT
>
> Why doesn't it EFAULT without mmap(MAP_NORESERVE)?
> Doesn't memory_region_del_subregion free the memory?
No, only destruction of the memory region frees it. address_space_map
takes a reference to the memory region and address_space_unmap releases it.
Paolo
>> wake up VCPU 1
>> take big QEMU lock
>> EFAULT? What's that?
>>
>> In another scenario you are less lucky: the memory accesses
>> between address_space_map/unmap aren't done in the kernel and
>> you get a plain old SIGSEGV.
>>
>> This is not something that you can fix with a lock. The very
>> purpose of the map/unmap API is to do stuff asynchronously while
>> the lock is released.
>>
>> Thanks,
>>
>> Paolo
On Fri, Jun 19, 2015 at 09:57:22AM +0200, Paolo Bonzini wrote:
>
>
> On 19/06/2015 09:56, Michael S. Tsirkin wrote:
> > On Thu, Jun 18, 2015 at 06:02:46PM +0200, Paolo Bonzini wrote:
> >>
> >>
> >> On 18/06/2015 16:47, Michael S. Tsirkin wrote:
> >>>> However, with Igor's patches a memory_region_del_subregion will cause a
> >>>> mmap(MAP_NORESERVE), which _does_ have the effect of making the hva go away.
> >>>>
> >>>> I guess one way to do it would be to alias the same page in two places,
> >>>> one for use by vhost and one for use by everything else. However, the
> >>>> kernel does not provide the means to do this kind of aliasing for
> >>>> anonymous mmaps.
> >>>
> >>> Basically pages go away on munmap, so won't simple
> >>> lock
> >>> munmap
> >>> mmap(MAP_NORESERVE)
> >>> unlock
> >>> do the trick?
> >>
> >> Not sure I follow. Here we have this:
> >>
> >> VCPU 1 VCPU 2 I/O worker
> >> ----------------------------------------------------------------------------------------
> >> take big QEMU lock
> >> p = address_space_map(hva, len)
> >> pass I/O request to worker thread
> >> read(fd, p, len)
> >> release big QEMU lock
> >>
> >> memory_region_del_subregion
> >> mmap(MAP_NORESERVE)
> >>
> >> read returns EFAULT
> >
> > Why doesn't it EFAULT without mmap(MAP_NORESERVE)?
> > Doesn't memory_region_del_subregion free the memory?
>
> No, only destruction of the memory region frees it. address_space_map
> takes a reference to the memory region and address_space_unmap releases it.
>
> Paolo
Confused. So can we call mmap(MAP_NORESERVE) in address_space_unmap
after we detect refcount is 0?
> >> wake up VCPU 1
> >> take big QEMU lock
> >> EFAULT? What's that?
> >>
> >> In another scenario you are less lucky: the memory accesses
> >> between address_space_map/unmap aren't done in the kernel and
> >> you get a plain old SIGSEGV.
> >>
> >> This is not something that you can fix with a lock. The very
> >> purpose of the map/unmap API is to do stuff asynchronously while
> >> the lock is released.
> >>
> >> Thanks,
> >>
> >> Paolo
On 19/06/2015 10:05, Michael S. Tsirkin wrote:
> > No, only destruction of the memory region frees it. address_space_map
> > takes a reference to the memory region and address_space_unmap releases it.
> >
> > Paolo
>
> Confused. So can we call mmap(MAP_NORESERVE) in address_space_unmap
> after we detect refcount is 0?
No, because in the meanwhile another DIMM could have been hotplugged
at the same place where the old one was. This is legal:
user guest QEMU
----------------------------------------------------------------------------------------
start I/O
'---------------> address_space_map
device_del
'-------------------> receives SCI
executes _EJ0
'---------------> memory_region_del_subregion
object_unparent
device_add
'-----------------------------------------> device_set_realized
hotplug_handler_plug
pc_machine_device_plug_cb
pc_dimm_plug
memory_region_add_subregion
I/O finishes
address_space_unmap
Surprise removal similarly could be done in QEMU, but it will hold to
some resources for as long as the device backends need them.
Paolo
On Fri, Jun 19, 2015 at 10:52:47AM +0200, Paolo Bonzini wrote:
>
>
> On 19/06/2015 10:05, Michael S. Tsirkin wrote:
> > > No, only destruction of the memory region frees it. address_space_map
> > > takes a reference to the memory region and address_space_unmap releases it.
> > >
> > > Paolo
> >
> > Confused. So can we call mmap(MAP_NORESERVE) in address_space_unmap
> > after we detect refcount is 0?
>
> No, because in the meanwhile another DIMM could have been hotplugged
> at the same place where the old one was. This is legal:
>
> user guest QEMU
> ----------------------------------------------------------------------------------------
> start I/O
> '---------------> address_space_map
> device_del
> '-------------------> receives SCI
> executes _EJ0
> '---------------> memory_region_del_subregion
> object_unparent
So guest started DMA into memory, then ejected this memory while DMA
is in progress?
> device_add
> '-----------------------------------------> device_set_realized
> hotplug_handler_plug
> pc_machine_device_plug_cb
> pc_dimm_plug
> memory_region_add_subregion
>
> I/O finishes
> address_space_unmap
>
> Surprise removal similarly could be done in QEMU, but it will hold to
> some resources for as long as the device backends need them.
>
> Paolo
On 19/06/2015 12:14, Michael S. Tsirkin wrote:
> On Fri, Jun 19, 2015 at 10:52:47AM +0200, Paolo Bonzini wrote:
>>
>>
>> On 19/06/2015 10:05, Michael S. Tsirkin wrote:
>>>> No, only destruction of the memory region frees it. address_space_map
>>>> takes a reference to the memory region and address_space_unmap releases it.
>>>>
>>>> Paolo
>>>
>>> Confused. So can we call mmap(MAP_NORESERVE) in address_space_unmap
>>> after we detect refcount is 0?
>>
>> No, because in the meanwhile another DIMM could have been hotplugged
>> at the same place where the old one was. This is legal:
>>
>> user guest QEMU
>> ----------------------------------------------------------------------------------------
>> start I/O
>> '---------------> address_space_map
>> device_del
>> '-------------------> receives SCI
>> executes _EJ0
>> '---------------> memory_region_del_subregion
>> object_unparent
>
> So guest started DMA into memory, then ejected this memory while DMA
> is in progress?
Yes. There is nothing that forbids doing that.
Paolo
>> device_add
>> '-----------------------------------------> device_set_realized
>> hotplug_handler_plug
>> pc_machine_device_plug_cb
>> pc_dimm_plug
>> memory_region_add_subregion
>>
>> I/O finishes
>> address_space_unmap
>>
>> Surprise removal similarly could be done in QEMU, but it will hold to
>> some resources for as long as the device backends need them.
>>
>> Paolo
On Fri, Jun 19, 2015 at 12:44:26PM +0200, Paolo Bonzini wrote:
>
>
> On 19/06/2015 12:14, Michael S. Tsirkin wrote:
> > On Fri, Jun 19, 2015 at 10:52:47AM +0200, Paolo Bonzini wrote:
> >>
> >>
> >> On 19/06/2015 10:05, Michael S. Tsirkin wrote:
> >>>> No, only destruction of the memory region frees it. address_space_map
> >>>> takes a reference to the memory region and address_space_unmap releases it.
> >>>>
> >>>> Paolo
> >>>
> >>> Confused. So can we call mmap(MAP_NORESERVE) in address_space_unmap
> >>> after we detect refcount is 0?
> >>
> >> No, because in the meanwhile another DIMM could have been hotplugged
> >> at the same place where the old one was. This is legal:
> >>
> >> user guest QEMU
> >> ----------------------------------------------------------------------------------------
> >> start I/O
> >> '---------------> address_space_map
> >> device_del
> >> '-------------------> receives SCI
> >> executes _EJ0
> >> '---------------> memory_region_del_subregion
> >> object_unparent
> >
> > So guest started DMA into memory, then ejected this memory while DMA
> > is in progress?
>
> Yes. There is nothing that forbids doing that.
>
> Paolo
Can we simply defer the next device_add using a hva until all IO completes?
> >> device_add
> >> '-----------------------------------------> device_set_realized
> >> hotplug_handler_plug
> >> pc_machine_device_plug_cb
> >> pc_dimm_plug
> >> memory_region_add_subregion
> >>
> >> I/O finishes
> >> address_space_unmap
> >>
> >> Surprise removal similarly could be done in QEMU, but it will hold to
> >> some resources for as long as the device backends need them.
> >>
> >> Paolo
On 19/06/2015 15:34, Michael S. Tsirkin wrote:
> On Fri, Jun 19, 2015 at 12:44:26PM +0200, Paolo Bonzini wrote:
>>
>>
>> On 19/06/2015 12:14, Michael S. Tsirkin wrote:
>>> On Fri, Jun 19, 2015 at 10:52:47AM +0200, Paolo Bonzini wrote:
>>>>
>>>>
>>>> On 19/06/2015 10:05, Michael S. Tsirkin wrote:
>>>>>> No, only destruction of the memory region frees it. address_space_map
>>>>>> takes a reference to the memory region and address_space_unmap releases it.
>>>>>>
>>>>>> Paolo
>>>>>
>>>>> Confused. So can we call mmap(MAP_NORESERVE) in address_space_unmap
>>>>> after we detect refcount is 0?
>>>>
>>>> No, because in the meanwhile another DIMM could have been hotplugged
>>>> at the same place where the old one was. This is legal:
>>>>
>>>> user guest QEMU
>>>> ----------------------------------------------------------------------------------------
>>>> start I/O
>>>> '---------------> address_space_map
>>>> device_del
>>>> '-------------------> receives SCI
>>>> executes _EJ0
>>>> '---------------> memory_region_del_subregion
>>>> object_unparent
>>>
>>> So guest started DMA into memory, then ejected this memory while DMA
>>> is in progress?
>>
>> Yes. There is nothing that forbids doing that.
>
> Can we simply defer the next device_add using a hva until all IO completes?
We could, but I/O is just an example. It can be I/O, a network ring,
whatever. We cannot audit all address_space_map uses.
Paolo
On Fri, Jun 19, 2015 at 05:19:44PM +0200, Paolo Bonzini wrote:
>
>
> On 19/06/2015 15:34, Michael S. Tsirkin wrote:
> > On Fri, Jun 19, 2015 at 12:44:26PM +0200, Paolo Bonzini wrote:
> >>
> >>
> >> On 19/06/2015 12:14, Michael S. Tsirkin wrote:
> >>> On Fri, Jun 19, 2015 at 10:52:47AM +0200, Paolo Bonzini wrote:
> >>>>
> >>>>
> >>>> On 19/06/2015 10:05, Michael S. Tsirkin wrote:
> >>>>>> No, only destruction of the memory region frees it. address_space_map
> >>>>>> takes a reference to the memory region and address_space_unmap releases it.
> >>>>>>
> >>>>>> Paolo
> >>>>>
> >>>>> Confused. So can we call mmap(MAP_NORESERVE) in address_space_unmap
> >>>>> after we detect refcount is 0?
> >>>>
> >>>> No, because in the meanwhile another DIMM could have been hotplugged
> >>>> at the same place where the old one was. This is legal:
> >>>>
> >>>> user guest QEMU
> >>>> ----------------------------------------------------------------------------------------
> >>>> start I/O
> >>>> '---------------> address_space_map
> >>>> device_del
> >>>> '-------------------> receives SCI
> >>>> executes _EJ0
> >>>> '---------------> memory_region_del_subregion
> >>>> object_unparent
> >>>
> >>> So guest started DMA into memory, then ejected this memory while DMA
> >>> is in progress?
> >>
> >> Yes. There is nothing that forbids doing that.
> >
> > Can we simply defer the next device_add using a hva until all IO completes?
>
> We could, but I/O is just an example. It can be I/O, a network ring,
> whatever. We cannot audit all address_space_map uses.
>
> Paolo
No need to audit them all: defer device_add using an hva range until
address_space_unmap drops using hvas in range drops reference count to
0.
--
MST
On 19/06/2015 18:20, Michael S. Tsirkin wrote:
> > We could, but I/O is just an example. It can be I/O, a network ring,
> > whatever. We cannot audit all address_space_map uses.
> >
>
> No need to audit them all: defer device_add using an hva range until
> address_space_unmap drops using hvas in range drops reference count to
> 0.
That could be forever. You certainly don't want to lockup the monitor
forever just because a device model isn't too friendly to memory hot-unplug.
That's why you need to audit them (also, it's perfectly in the device
model's right to use address_space_unmap this way: it's the guest that's
buggy and leaves a dangling reference to a region before unplugging it).
Paolo
On Fri, Jun 19, 2015 at 06:26:27PM +0200, Paolo Bonzini wrote:
>
>
> On 19/06/2015 18:20, Michael S. Tsirkin wrote:
> > > We could, but I/O is just an example. It can be I/O, a network ring,
> > > whatever. We cannot audit all address_space_map uses.
> > >
> >
> > No need to audit them all: defer device_add using an hva range until
> > address_space_unmap drops using hvas in range drops reference count to
> > 0.
>
> That could be forever. You certainly don't want to lockup the monitor
> forever just because a device model isn't too friendly to memory hot-unplug.
We can defer the addition, no need to lockup the monitor.
> That's why you need to audit them (also, it's perfectly in the device
> model's right to use address_space_unmap this way: it's the guest that's
> buggy and leaves a dangling reference to a region before unplugging it).
>
> Paolo
Then maybe it's not too bad that the guest will crash because the memory
was unmapped.
--
MST
On 19/06/2015 18:33, Michael S. Tsirkin wrote:
> On Fri, Jun 19, 2015 at 06:26:27PM +0200, Paolo Bonzini wrote:
>>
>>
>> On 19/06/2015 18:20, Michael S. Tsirkin wrote:
>>>> We could, but I/O is just an example. It can be I/O, a network ring,
>>>> whatever. We cannot audit all address_space_map uses.
>>>>
>>>
>>> No need to audit them all: defer device_add using an hva range until
>>> address_space_unmap drops using hvas in range drops reference count to
>>> 0.
>>
>> That could be forever. You certainly don't want to lockup the monitor
>> forever just because a device model isn't too friendly to memory hot-unplug.
>
> We can defer the addition, no need to lockup the monitor.
Patches are welcome.
>> That's why you need to audit them (also, it's perfectly in the device
>> model's right to use address_space_unmap this way: it's the guest that's
>> buggy and leaves a dangling reference to a region before unplugging it).
>
> Then maybe it's not too bad that the guest will crash because the memory
> was unmapped.
That's a matter of taste. I strongly prefer using 12K extra memory per
VCPU to a guest crash.
Paolo
On Fri, Jun 19, 2015 at 10:52:47AM +0200, Paolo Bonzini wrote:
>
>
> On 19/06/2015 10:05, Michael S. Tsirkin wrote:
> > > No, only destruction of the memory region frees it. address_space_map
> > > takes a reference to the memory region and address_space_unmap releases it.
> > >
> > > Paolo
> >
> > Confused. So can we call mmap(MAP_NORESERVE) in address_space_unmap
> > after we detect refcount is 0?
>
> No, because in the meanwhile another DIMM could have been hotplugged
> at the same place where the old one was. This is legal:
>
> user guest QEMU
> ----------------------------------------------------------------------------------------
> start I/O
> '---------------> address_space_map
> device_del
> '-------------------> receives SCI
> executes _EJ0
> '---------------> memory_region_del_subregion
> object_unparent
> device_add
> '-----------------------------------------> device_set_realized
> hotplug_handler_plug
> pc_machine_device_plug_cb
> pc_dimm_plug
> memory_region_add_subregion
>
> I/O finishes
> address_space_unmap
>
> Surprise removal similarly could be done in QEMU, but it will hold to
> some resources for as long as the device backends need them.
>
> Paolo
OK so what's the problem with checking for this condition,
after address_space_unmap detects that ref count is 0
and before calling mmap(MAP_NORESERVE)?
At that point we are off the data path so we can take locks.
--
MST
On 19/06/2015 18:45, Michael S. Tsirkin wrote:
> > user guest QEMU
> > ----------------------------------------------------------------------------------------
> > start I/O
> > '---------------> address_space_map
> > device_del
> > '-------------------> receives SCI
> > executes _EJ0
> > '---------------> memory_region_del_subregion
> > object_unparent
> > device_add
> > '-----------------------------------------> device_set_realized
> > hotplug_handler_plug
> > pc_machine_device_plug_cb
> > pc_dimm_plug
> > memory_region_add_subregion
> >
> > I/O finishes
> > address_space_unmap
> >
>
> OK so what's the problem with checking for this condition,
> after address_space_unmap detects that ref count is 0
> and before calling mmap(MAP_NORESERVE)?
Should be okay. However, someone still needs to figure out the part
where device_add is delayed until the address_space_unmap. Otherwise,
the device_add's mmap() overlays the older one and the delayed
mmap(MAP_NORESERVE) overlays the device_add's mmap().
It seems like a colossal waste of time to perfect a little-used path
that no doubt will have bugs in it, some of which could perhaps have
security consequences.
Also, QEMU is not the only vhost client, so we're just scratching the
surface here.
Paolo
On Fri, 19 Jun 2015 18:33:39 +0200
"Michael S. Tsirkin" <[email protected]> wrote:
> On Fri, Jun 19, 2015 at 06:26:27PM +0200, Paolo Bonzini wrote:
> >
> >
> > On 19/06/2015 18:20, Michael S. Tsirkin wrote:
> > > > We could, but I/O is just an example. It can be I/O, a network ring,
> > > > whatever. We cannot audit all address_space_map uses.
> > > >
> > >
> > > No need to audit them all: defer device_add using an hva range until
> > > address_space_unmap drops using hvas in range drops reference count to
> > > 0.
> >
> > That could be forever. You certainly don't want to lockup the monitor
> > forever just because a device model isn't too friendly to memory hot-unplug.
>
> We can defer the addition, no need to lockup the monitor.
>
> > That's why you need to audit them (also, it's perfectly in the device
> > model's right to use address_space_unmap this way: it's the guest that's
> > buggy and leaves a dangling reference to a region before unplugging it).
> >
> > Paolo
>
> Then maybe it's not too bad that the guest will crash because the memory
> was unmapped.
So far HVA is unusable even if we will make this assumption and let guest crash.
virt_net doesn't work with it anyway,
translation of GPA to HVA for descriptors works as expected (correctly)
but vhost+HVA hack backed virtio still can't send/received packets.
That's why I prefer to merge kernel solution first as a stable and
not introducing any issues solution. And work on userspace approach on
top of that.
Hopefully it could be done but we still would need time
to iron out side effects/issues it causes or could cause so that
fix became stable enough for production.
On 22/06/2015 09:10, Igor Mammedov wrote:
> So far HVA is unusable even if we will make this assumption and let guest crash.
> virt_net doesn't work with it anyway,
> translation of GPA to HVA for descriptors works as expected (correctly)
> but vhost+HVA hack backed virtio still can't send/received packets.
>
> That's why I prefer to merge kernel solution first as a stable and
> not introducing any issues solution. And work on userspace approach on
> top of that.
Also, let's do some math.
Let's assume 3 network devices per VM, one vhost device per queue, one
queue per VCPU per network device. Let's assume the host is
overcommitted 3:1.
Thus we have 3*3=9 times vhost devices as we have physical CPUs.
We're thus talking about 108K per physical CPU.
>From a relative point of view, and assuming 1 GB of memory per physical
CPU (pretty low amount if you're overcommitting CPU 3:1), this is 0.01%
of the total memory.
>From an absolute point of view, it takes a system with 60 physical CPUs
to reach the same memory usage as the vmlinuz binary of a typical distro
kernel (not counting the modules).
Paolo
> Hopefully it could be done but we still would need time
> to iron out side effects/issues it causes or could cause so that
> fix became stable enough for production.
>