2014-06-02 21:29:47

by Michael S. Tsirkin

[permalink] [raw]
Subject: [PULL 0/2] vhost enhancements for 3.16

Reposting with actual patches included.

The following changes since commit 96b2e73c5471542cb9c622c4360716684f8797ed:

Revert "net/mlx4_en: Use affinity hint" (2014-06-02 00:18:48 -0700)

are available in the git repository at:

git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git vhost-next

for you to fetch changes up to 2ae76693b8bcabf370b981cd00c36cd41d33fabc:

vhost: replace rcu with mutex (2014-06-02 23:47:59 +0300)

----------------------------------------------------------------
Michael S. Tsirkin (2):
vhost-net: extend device allocation to vmalloc
vhost: replace rcu with mutex

drivers/vhost/net.c | 23 ++++++++++++++++++-----
drivers/vhost/vhost.c | 10 +++++++++-
2 files changed, 27 insertions(+), 6 deletions(-)


2014-06-02 21:29:56

by Michael S. Tsirkin

[permalink] [raw]
Subject: [PULL 2/2] vhost: replace rcu with mutex

All memory accesses are done under some VQ mutex.
So lock/unlock all VQs is a faster equivalent of synchronize_rcu()
for memory access changes.
Some guests cause a lot of these changes, so it's helpful
to make them faster.

Reported-by: "Gonglei (Arei)" <[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
---
drivers/vhost/vhost.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 78987e4..1c05e60 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -593,6 +593,7 @@ static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m)
{
struct vhost_memory mem, *newmem, *oldmem;
unsigned long size = offsetof(struct vhost_memory, regions);
+ int i;

if (copy_from_user(&mem, m, size))
return -EFAULT;
@@ -619,7 +620,14 @@ static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m)
oldmem = rcu_dereference_protected(d->memory,
lockdep_is_held(&d->mutex));
rcu_assign_pointer(d->memory, newmem);
- synchronize_rcu();
+
+ /* All memory accesses are done under some VQ mutex.
+ * So below is a faster equivalent of synchronize_rcu()
+ */
+ for (i = 0; i < d->nvqs; ++i) {
+ mutex_lock(&d->vqs[i]->mutex);
+ mutex_unlock(&d->vqs[i]->mutex);
+ }
kfree(oldmem);
return 0;
}
--
MST

2014-06-02 21:29:55

by Michael S. Tsirkin

[permalink] [raw]
Subject: [PULL 1/2] vhost-net: extend device allocation to vmalloc

Michael Mueller provided a patch to reduce the size of
vhost-net structure as some allocations could fail under
memory pressure/fragmentation. We are still left with
high order allocations though.

This patch is handling the problem at the core level, allowing
vhost structures to use vmalloc() if kmalloc() failed.

As vmalloc() adds overhead on a critical network path, add __GFP_REPEAT
to kzalloc() flags to do this fallback only when really needed.

People are still looking at cleaner ways to handle the problem
at the API level, probably passing in multiple iovecs.
This hack seems consistent with approaches
taken since then by drivers/vhost/scsi.c and net/core/dev.c

Based on patch by Romain Francoise.

Cc: Michael Mueller <[email protected]>
Signed-off-by: Romain Francoise <[email protected]>
Acked-by: Michael S. Tsirkin <[email protected]>
---
drivers/vhost/net.c | 23 ++++++++++++++++++-----
1 file changed, 18 insertions(+), 5 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index be414d2b..e489161 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -17,6 +17,7 @@
#include <linux/workqueue.h>
#include <linux/file.h>
#include <linux/slab.h>
+#include <linux/vmalloc.h>

#include <linux/net.h>
#include <linux/if_packet.h>
@@ -699,18 +700,30 @@ static void handle_rx_net(struct vhost_work *work)
handle_rx(net);
}

+static void vhost_net_free(void *addr)
+{
+ if (is_vmalloc_addr(addr))
+ vfree(addr);
+ else
+ kfree(addr);
+}
+
static int vhost_net_open(struct inode *inode, struct file *f)
{
- struct vhost_net *n = kmalloc(sizeof *n, GFP_KERNEL);
+ struct vhost_net *n;
struct vhost_dev *dev;
struct vhost_virtqueue **vqs;
int i;

- if (!n)
- return -ENOMEM;
+ n = kmalloc(sizeof *n, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
+ if (!n) {
+ n = vmalloc(sizeof *n);
+ if (!n)
+ return -ENOMEM;
+ }
vqs = kmalloc(VHOST_NET_VQ_MAX * sizeof(*vqs), GFP_KERNEL);
if (!vqs) {
- kfree(n);
+ vhost_net_free(n);
return -ENOMEM;
}

@@ -827,7 +840,7 @@ static int vhost_net_release(struct inode *inode, struct file *f)
* since jobs can re-queue themselves. */
vhost_net_flush(n);
kfree(n->dev.vqs);
- kfree(n);
+ vhost_net_free(n);
return 0;
}

--
MST

2014-06-02 21:58:06

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PULL 2/2] vhost: replace rcu with mutex

On Tue, 2014-06-03 at 00:30 +0300, Michael S. Tsirkin wrote:
> All memory accesses are done under some VQ mutex.
> So lock/unlock all VQs is a faster equivalent of synchronize_rcu()
> for memory access changes.
> Some guests cause a lot of these changes, so it's helpful
> to make them faster.
>
> Reported-by: "Gonglei (Arei)" <[email protected]>
> Signed-off-by: Michael S. Tsirkin <[email protected]>
> ---
> drivers/vhost/vhost.c | 10 +++++++++-
> 1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 78987e4..1c05e60 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -593,6 +593,7 @@ static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m)
> {
> struct vhost_memory mem, *newmem, *oldmem;
> unsigned long size = offsetof(struct vhost_memory, regions);
> + int i;
>
> if (copy_from_user(&mem, m, size))
> return -EFAULT;
> @@ -619,7 +620,14 @@ static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m)
> oldmem = rcu_dereference_protected(d->memory,
> lockdep_is_held(&d->mutex));
> rcu_assign_pointer(d->memory, newmem);
> - synchronize_rcu();
> +
> + /* All memory accesses are done under some VQ mutex.
> + * So below is a faster equivalent of synchronize_rcu()
> + */
> + for (i = 0; i < d->nvqs; ++i) {
> + mutex_lock(&d->vqs[i]->mutex);
> + mutex_unlock(&d->vqs[i]->mutex);
> + }
> kfree(oldmem);
> return 0;
> }

This looks dubious

What about using kfree_rcu() instead ?

translate_desc() still uses rcu_read_lock(), its not clear if the mutex
is really held.


2014-06-03 12:48:40

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PULL 2/2] vhost: replace rcu with mutex

Il 02/06/2014 23:58, Eric Dumazet ha scritto:
> This looks dubious
>
> What about using kfree_rcu() instead ?

It would lead to unbound allocation from userspace.

> translate_desc() still uses rcu_read_lock(), its not clear if the mutex
> is really held.

Yes, vhost_get_vq_desc must be called with the vq mutex held.

The rcu_read_lock/unlock in translate_desc is unnecessary.

Paolo

2014-06-03 13:35:22

by Vlad Yasevich

[permalink] [raw]
Subject: Re: [PULL 2/2] vhost: replace rcu with mutex

On 06/03/2014 08:48 AM, Paolo Bonzini wrote:
> Il 02/06/2014 23:58, Eric Dumazet ha scritto:
>> This looks dubious
>>
>> What about using kfree_rcu() instead ?
>
> It would lead to unbound allocation from userspace.
>
>> translate_desc() still uses rcu_read_lock(), its not clear if the mutex
>> is really held.
>
> Yes, vhost_get_vq_desc must be called with the vq mutex held.
>
> The rcu_read_lock/unlock in translate_desc is unnecessary.
>

If that's true, then does dev->memory really needs to be rcu protected?
It appears to always be read under mutex.

-vlad

> Paolo
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2014-06-03 13:55:55

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PULL 2/2] vhost: replace rcu with mutex

Il 03/06/2014 15:35, Vlad Yasevich ha scritto:
> > Yes, vhost_get_vq_desc must be called with the vq mutex held.
> >
> > The rcu_read_lock/unlock in translate_desc is unnecessary.
>
> If that's true, then does dev->memory really needs to be rcu protected?
> It appears to always be read under mutex.

It's always read under one of many mutexes, yes.

However, it's still RCU-like in the sense that you separate the removal
and reclamation phases so you still need rcu_dereference/rcu_assign_pointer.

With this mechanism, readers do not contend the mutexes with the
VHOST_SET_MEMORY ioctl, except for the very short lock-and-unlock
sequence at the end of it. They also never contend the mutexes between
themselves (which would be the case if VHOST_SET_MEMORY locked all the
mutexes).

You could also wrap all virtqueue processing with a rwsem and take the
rwsem for write in VHOST_SET_MEMORY. That simplifies some things however:

- unnecessarily complicates the code for all users of vhost_get_vq_desc

- suppose the reader-writer lock is fair, and VHOST_SET_MEMORY places a
writer in the queue. Then a long-running reader R1 could still block
another reader R2, because the writer would be served before R2.


The RCU-like approach avoids all this, which is important because of the
generally simpler code and because VHOST_SET_MEMORY is the only vhost
ioctl that can happen in the hot path.

Paolo

2014-06-03 13:57:47

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PULL 2/2] vhost: replace rcu with mutex

On Tue, 2014-06-03 at 14:48 +0200, Paolo Bonzini wrote:
> Il 02/06/2014 23:58, Eric Dumazet ha scritto:
> > This looks dubious
> >
> > What about using kfree_rcu() instead ?
>
> It would lead to unbound allocation from userspace.

Look at how we did this in commit
c3059477fce2d956a0bb3e04357324780c5d8eeb

>
> > translate_desc() still uses rcu_read_lock(), its not clear if the mutex
> > is really held.
>
> Yes, vhost_get_vq_desc must be called with the vq mutex held.
>
> The rcu_read_lock/unlock in translate_desc is unnecessary.

Yep, this is what I pointed out. This is not only necessary, but
confusing and might be incorrectly copy/pasted in the future.

This patch is a partial one and leaves confusion.

Some places uses the proper

mp = rcu_dereference_protected(dev->memory,
lockdep_is_held(&dev->mutex));

others use the now incorrect :

rcu_read_lock();
mp = rcu_dereference(dev->memory);
...


2014-06-03 14:20:38

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PULL 2/2] vhost: replace rcu with mutex

Il 03/06/2014 15:57, Eric Dumazet ha scritto:
> On Tue, 2014-06-03 at 14:48 +0200, Paolo Bonzini wrote:
>> Il 02/06/2014 23:58, Eric Dumazet ha scritto:
>>> This looks dubious
>>>
>>> What about using kfree_rcu() instead ?
>>
>> It would lead to unbound allocation from userspace.
>
> Look at how we did this in commit
> c3059477fce2d956a0bb3e04357324780c5d8eeb

That would make VHOST_SET_MEMORY as slow as before (even though once
every few times).

>>> translate_desc() still uses rcu_read_lock(), its not clear if the mutex
>>> is really held.
>>
>> Yes, vhost_get_vq_desc must be called with the vq mutex held.
>>
>> The rcu_read_lock/unlock in translate_desc is unnecessary.
>
> Yep, this is what I pointed out. This is not only necessary, but
> confusing and might be incorrectly copy/pasted in the future.
>
> This patch is a partial one and leaves confusion.

I agree.

Paolo

2014-06-04 18:12:41

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PULL 2/2] vhost: replace rcu with mutex

On Mon, Jun 02, 2014 at 02:58:00PM -0700, Eric Dumazet wrote:
> On Tue, 2014-06-03 at 00:30 +0300, Michael S. Tsirkin wrote:
> > All memory accesses are done under some VQ mutex.
> > So lock/unlock all VQs is a faster equivalent of synchronize_rcu()
> > for memory access changes.
> > Some guests cause a lot of these changes, so it's helpful
> > to make them faster.
> >
> > Reported-by: "Gonglei (Arei)" <[email protected]>
> > Signed-off-by: Michael S. Tsirkin <[email protected]>
> > ---
> > drivers/vhost/vhost.c | 10 +++++++++-
> > 1 file changed, 9 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > index 78987e4..1c05e60 100644
> > --- a/drivers/vhost/vhost.c
> > +++ b/drivers/vhost/vhost.c
> > @@ -593,6 +593,7 @@ static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m)
> > {
> > struct vhost_memory mem, *newmem, *oldmem;
> > unsigned long size = offsetof(struct vhost_memory, regions);
> > + int i;
> >
> > if (copy_from_user(&mem, m, size))
> > return -EFAULT;
> > @@ -619,7 +620,14 @@ static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m)
> > oldmem = rcu_dereference_protected(d->memory,
> > lockdep_is_held(&d->mutex));
> > rcu_assign_pointer(d->memory, newmem);
> > - synchronize_rcu();
> > +
> > + /* All memory accesses are done under some VQ mutex.
> > + * So below is a faster equivalent of synchronize_rcu()
> > + */
> > + for (i = 0; i < d->nvqs; ++i) {
> > + mutex_lock(&d->vqs[i]->mutex);
> > + mutex_unlock(&d->vqs[i]->mutex);
> > + }
> > kfree(oldmem);
> > return 0;
> > }
>
> This looks dubious
>
> What about using kfree_rcu() instead ?

Unfortunately userspace relies on the fact that no one
uses the old mappings by the time ioctl returns.
The issue isn't freeing the memory.

> translate_desc() still uses rcu_read_lock(), its not clear if the mutex
> is really held.
>

Thanks, good point, we can drop that rcu_read_lock now, but I think this could be a
patch on top.

--
MST

2014-06-04 19:50:51

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PULL 2/2] vhost: replace rcu with mutex

On Tue, Jun 03, 2014 at 06:57:43AM -0700, Eric Dumazet wrote:
> On Tue, 2014-06-03 at 14:48 +0200, Paolo Bonzini wrote:
> > Il 02/06/2014 23:58, Eric Dumazet ha scritto:
> > > This looks dubious
> > >
> > > What about using kfree_rcu() instead ?
> >
> > It would lead to unbound allocation from userspace.
>
> Look at how we did this in commit
> c3059477fce2d956a0bb3e04357324780c5d8eeb
>
> >
> > > translate_desc() still uses rcu_read_lock(), its not clear if the mutex
> > > is really held.
> >
> > Yes, vhost_get_vq_desc must be called with the vq mutex held.
> >
> > The rcu_read_lock/unlock in translate_desc is unnecessary.
>
> Yep, this is what I pointed out. This is not only necessary, but
> confusing and might be incorrectly copy/pasted in the future.
>
> This patch is a partial one and leaves confusion.
>
> Some places uses the proper
>
> mp = rcu_dereference_protected(dev->memory,
> lockdep_is_held(&dev->mutex));
>
> others use the now incorrect :
>
> rcu_read_lock();
> mp = rcu_dereference(dev->memory);
> ...
>

I agree, working on a cleanup patch on top now.

--
MST

2014-06-05 10:44:46

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PULL 2/2] vhost: replace rcu with mutex

On Wed, Jun 04, 2014 at 10:51:12PM +0300, Michael S. Tsirkin wrote:
> On Tue, Jun 03, 2014 at 06:57:43AM -0700, Eric Dumazet wrote:
> > On Tue, 2014-06-03 at 14:48 +0200, Paolo Bonzini wrote:
> > > Il 02/06/2014 23:58, Eric Dumazet ha scritto:
> > > > This looks dubious
> > > >
> > > > What about using kfree_rcu() instead ?
> > >
> > > It would lead to unbound allocation from userspace.
> >
> > Look at how we did this in commit
> > c3059477fce2d956a0bb3e04357324780c5d8eeb
> >
> > >
> > > > translate_desc() still uses rcu_read_lock(), its not clear if the mutex
> > > > is really held.
> > >
> > > Yes, vhost_get_vq_desc must be called with the vq mutex held.
> > >
> > > The rcu_read_lock/unlock in translate_desc is unnecessary.
> >
> > Yep, this is what I pointed out. This is not only necessary, but
> > confusing and might be incorrectly copy/pasted in the future.
> >
> > This patch is a partial one and leaves confusion.
> >
> > Some places uses the proper
> >
> > mp = rcu_dereference_protected(dev->memory,
> > lockdep_is_held(&dev->mutex));
> >
> > others use the now incorrect :
> >
> > rcu_read_lock();
> > mp = rcu_dereference(dev->memory);
> > ...
> >
>
> I agree, working on a cleanup patch on top now.

OK I just posted two cleanups as patches on top that address this.
Eric, could you please confirm that you are fine with
cleanups being patches on top?
Bisect will be fine since this hack is ugly but technically correct.

Thanks a lot for pointing out the issues!

> --
> MST