2014-02-12 16:33:06

by Michael S. Tsirkin

[permalink] [raw]
Subject: [PATCH net 0/3] vhost fixes for 3.14, -stable

This fixes a deadlock with vhost reported in the field,
as well as a theoretical race issue found by code
review.

Patches 1+2 are needed for stable.

Thanks to Qin Chuanyu for reporting the issue!

Michael S. Tsirkin (3):
kref: add kref_sub_return
vhost: fix ref cnt checking deadlock
vhost: fix a theoretical race in device cleanup

include/linux/kref.h | 33 ++++++++++++++++++++++++++++++++-
drivers/vhost/net.c | 15 ++++++++++-----
2 files changed, 42 insertions(+), 6 deletions(-)

--
MST


2014-02-12 16:32:09

by Michael S. Tsirkin

[permalink] [raw]
Subject: [PATCH net 3/3] vhost: fix a theoretical race in device cleanup

vhost_zerocopy_callback accesses VQ right after
it drops the last ubuf reference.
In theory, this could race with device removal
which waits on the ubuf kref, and crash
on use after free.

Do all accesses within rcu read side critical
section, and all synchronize on release.

Since callbacks are always invoked from bh,
synchronize_rcu_bh seems enough and will help
release complete a bit faster.

Signed-off-by: Michael S. Tsirkin <[email protected]>
---
drivers/vhost/net.c | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 7eaf2de..78a9d42 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -308,6 +308,8 @@ static void vhost_zerocopy_callback(struct ubuf_info *ubuf, bool success)
struct vhost_virtqueue *vq = ubufs->vq;
int cnt;

+ rcu_read_lock_bh();
+
/* set len to mark this desc buffers done DMA */
vq->heads[ubuf->desc].len = success ?
VHOST_DMA_DONE_LEN : VHOST_DMA_FAILED_LEN;
@@ -322,6 +324,8 @@ static void vhost_zerocopy_callback(struct ubuf_info *ubuf, bool success)
*/
if (cnt <= 1 || !(cnt % 16))
vhost_poll_queue(&vq->poll);
+
+ rcu_read_unlock_bh();
}

/* Expects to be always run from workqueue - which acts as
@@ -804,6 +808,8 @@ static int vhost_net_release(struct inode *inode, struct file *f)
fput(tx_sock->file);
if (rx_sock)
fput(rx_sock->file);
+ /* Make sure no callbacks are outstanding */
+ synchronize_rcu_bh();
/* We do an extra flush before freeing memory,
* since jobs can re-queue themselves. */
vhost_net_flush(n);
--
MST

2014-02-12 16:33:10

by Michael S. Tsirkin

[permalink] [raw]
Subject: [PATCH net 2/3] vhost: fix ref cnt checking deadlock

vhost checked the counter within the refcnt before decrementing. It
really wanted to know that there aren't too many references, as a way to
batch freeing resources a bit more efficiently.

This works well but it we now access the
ref counter twice so there's a race:
all users might see a high count and decide
to defer freeing resources.
In the end no one initiates freeing resources
until the last reference is gone (which is on VM shotdown
so might happen after a looooong time).

Let's do what we should have done straight away:
add a kref API to return the kref value atomically,
and use that to avoid the deadlock.

Reported-by: Qin Chuanyu <[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
---
drivers/vhost/net.c | 13 ++++++-------
1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 831eb4f..7eaf2de 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -140,9 +140,9 @@ vhost_net_ubuf_alloc(struct vhost_virtqueue *vq, bool zcopy)
return ubufs;
}

-static void vhost_net_ubuf_put(struct vhost_net_ubuf_ref *ubufs)
+static int vhost_net_ubuf_put(struct vhost_net_ubuf_ref *ubufs)
{
- kref_put(&ubufs->kref, vhost_net_zerocopy_done_signal);
+ return kref_sub_return(&ubufs->kref, 1, vhost_net_zerocopy_done_signal);
}

static void vhost_net_ubuf_put_and_wait(struct vhost_net_ubuf_ref *ubufs)
@@ -306,22 +306,21 @@ static void vhost_zerocopy_callback(struct ubuf_info *ubuf, bool success)
{
struct vhost_net_ubuf_ref *ubufs = ubuf->ctx;
struct vhost_virtqueue *vq = ubufs->vq;
- int cnt = atomic_read(&ubufs->kref.refcount);
+ int cnt;

/* set len to mark this desc buffers done DMA */
vq->heads[ubuf->desc].len = success ?
VHOST_DMA_DONE_LEN : VHOST_DMA_FAILED_LEN;
- vhost_net_ubuf_put(ubufs);
+ cnt = vhost_net_ubuf_put(ubufs);

/*
* Trigger polling thread if guest stopped submitting new buffers:
- * in this case, the refcount after decrement will eventually reach 1
- * so here it is 2.
+ * in this case, the refcount after decrement will eventually reach 1.
* We also trigger polling periodically after each 16 packets
* (the value 16 here is more or less arbitrary, it's tuned to trigger
* less than 10% of times).
*/
- if (cnt <= 2 || !(cnt % 16))
+ if (cnt <= 1 || !(cnt % 16))
vhost_poll_queue(&vq->poll);
}

--
MST

2014-02-12 16:40:56

by Michael S. Tsirkin

[permalink] [raw]
Subject: [PATCH net 1/3] kref: add kref_sub_return

It is sometimes useful to get the value of the reference count after
decrement.
For example, vhost wants to execute some periodic cleanup operations
once number of references drops below a specific value, before it
reaches zero (for efficiency).

Add an API to do this atomically and efficiently using
atomic_sub_return.

Signed-off-by: Michael S. Tsirkin <[email protected]>
---

Greg, could you ack this API extension please?
I think it is cleanest to merge this through -net together
with the first user.

include/linux/kref.h | 34 +++++++++++++++++++++++++++++++++-
1 file changed, 33 insertions(+), 1 deletion(-)

diff --git a/include/linux/kref.h b/include/linux/kref.h
index 484604d..cb20550 100644
--- a/include/linux/kref.h
+++ b/include/linux/kref.h
@@ -61,7 +61,7 @@ static inline void kref_get(struct kref *kref)
*
* Subtract @count from the refcount, and if 0, call release().
* Return 1 if the object was removed, otherwise return 0. Beware, if this
- * function returns 0, you still can not count on the kref from remaining in
+ * function returns 0, you still can not count on the kref remaining in
* memory. Only use the return value if you want to see if the kref is now
* gone, not present.
*/
@@ -78,6 +78,38 @@ static inline int kref_sub(struct kref *kref, unsigned int count,
}

/**
+ * kref_sub_return - subtract a number of refcounts for object.
+ * @kref: object.
+ * @count: Number of recounts to subtract.
+ * @release: pointer to the function that will clean up the object when the
+ * last reference to the object is released.
+ * This pointer is required, and it is not acceptable to pass kfree
+ * in as this function. If the caller does pass kfree to this
+ * function, you will be publicly mocked mercilessly by the kref
+ * maintainer, and anyone else who happens to notice it. You have
+ * been warned.
+ *
+ * Subtract @count from the refcount, and if 0, call release().
+ * Return the new refcount. Beware, if this function returns > N, you still
+ * can not count on there being at least N other references, and in
+ * particular, on the kref remaining in memory.
+ * Only use the return value if you want to see if there are at most,
+ * not at least, N other references to kref,
+ */
+static inline int kref_sub_return(struct kref *kref, unsigned int count,
+ void (*release)(struct kref *kref))
+{
+ int r;
+
+ WARN_ON(release == NULL);
+
+ r = atomic_sub_return((int) count, &kref->refcount);
+ if (!r)
+ release(kref);
+ return r;
+}
+
+/**
* kref_put - decrement refcount for object.
* @kref: object.
* @release: pointer to the function that will clean up the object when the
--
MST

2014-02-12 16:55:17

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH net 1/3] kref: add kref_sub_return

On Wed, Feb 12, 2014 at 06:38:21PM +0200, Michael S. Tsirkin wrote:
> It is sometimes useful to get the value of the reference count after
> decrement.
> For example, vhost wants to execute some periodic cleanup operations
> once number of references drops below a specific value, before it
> reaches zero (for efficiency).

You should never care about what the value of the kref is, if you are
using it correctly :)

So I really don't want to add this function, as I'm sure people will use
it incorrectly. You should only care if the reference drops to 0, if
not, then your usage doesn't really fit into the "kref" model, and so,
just use an atomic variable.

I really want to know why it matters for "efficiency" that you know this
number. How does that help anything, as the number could then go up
later on, and the work you did at a "lower" number is obsolete, right?

thanks,

greg k-h

2014-02-12 17:30:54

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH net 1/3] kref: add kref_sub_return

On Wed, Feb 12, 2014 at 08:56:30AM -0800, Greg Kroah-Hartman wrote:
> On Wed, Feb 12, 2014 at 06:38:21PM +0200, Michael S. Tsirkin wrote:
> > It is sometimes useful to get the value of the reference count after
> > decrement.
> > For example, vhost wants to execute some periodic cleanup operations
> > once number of references drops below a specific value, before it
> > reaches zero (for efficiency).
>
> You should never care about what the value of the kref is, if you are
> using it correctly :)
>
> So I really don't want to add this function, as I'm sure people will use
> it incorrectly. You should only care if the reference drops to 0, if
> not, then your usage doesn't really fit into the "kref" model, and so,
> just use an atomic variable.

This happens when you have code that keeps
reference itself implicitly or explicitly.

foo(struct kref *k, int bar) {

sub = kref_sub(k)

if (sub == 1)
FOO(k, bar) /* Here I am the only one
with a reference */

}

kref_get(k)
foo(k, bar);
....
kref_put(k)

Why not do FOO in destructor you ask?
Absolutely but this will be called much later.

Maybe you will reconsider if I document this
as the only legal use?

>
> I really want to know why it matters for "efficiency" that you know this
> number. How does that help anything, as the number could then go up
> later on, and the work you did at a "lower" number is obsolete, right?
>
> thanks,
>
> greg k-h

The issue is that if number dropped to 1, this means
we must do the cleanup work since there are
no outstanding buffers, (last user is ourselves)
if we do not cleanup,
guest will hang waiting for us.

But it never drops to 0 since we have our own reference
in the device.
If it goes up again this means we didn't have
to do cleanup, but an alternative is doing
it all the time and that is slow.

Yes I can rework vhost to open-code this kref use, it's
no big deal.
Alternatively since most of the use does match kref
model, maybe __kref_sub_return with disclaimers
that you must know what you are doing?
Please let me know.

Thanks!

--
MST

2014-02-12 18:36:42

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH net 1/3] kref: add kref_sub_return

On Wed, Feb 12, 2014 at 07:35:24PM +0200, Michael S. Tsirkin wrote:
> On Wed, Feb 12, 2014 at 08:56:30AM -0800, Greg Kroah-Hartman wrote:
> > On Wed, Feb 12, 2014 at 06:38:21PM +0200, Michael S. Tsirkin wrote:
> > > It is sometimes useful to get the value of the reference count after
> > > decrement.
> > > For example, vhost wants to execute some periodic cleanup operations
> > > once number of references drops below a specific value, before it
> > > reaches zero (for efficiency).
> >
> > You should never care about what the value of the kref is, if you are
> > using it correctly :)
> >
> > So I really don't want to add this function, as I'm sure people will use
> > it incorrectly. You should only care if the reference drops to 0, if
> > not, then your usage doesn't really fit into the "kref" model, and so,
> > just use an atomic variable.
>
> This happens when you have code that keeps
> reference itself implicitly or explicitly.
>
> foo(struct kref *k, int bar) {
>
> sub = kref_sub(k)
>
> if (sub == 1)
> FOO(k, bar) /* Here I am the only one
> with a reference */

Why do you care if you are the only one with a reference?

If you do, then just don't grab that reference and do the work in the
cleanup callback :)

> }
>
> kref_get(k)
> foo(k, bar);
> ....
> kref_put(k)
>
> Why not do FOO in destructor you ask?
> Absolutely but this will be called much later.
>
> Maybe you will reconsider if I document this
> as the only legal use?

No one reads documentation :(

> > I really want to know why it matters for "efficiency" that you know this
> > number. How does that help anything, as the number could then go up
> > later on, and the work you did at a "lower" number is obsolete, right?
> >
> > thanks,
> >
> > greg k-h
>
> The issue is that if number dropped to 1, this means
> we must do the cleanup work since there are
> no outstanding buffers, (last user is ourselves)
> if we do not cleanup,
> guest will hang waiting for us.

This doesn't make sense, nor does it sound like a use for a kref (or you
are using it wrong.)

> But it never drops to 0 since we have our own reference
> in the device.

Then don't do that.

> If it goes up again this means we didn't have
> to do cleanup, but an alternative is doing
> it all the time and that is slow.

Then just cleanup when it hits 0, like the rest of the world does.

> Yes I can rework vhost to open-code this kref use, it's
> no big deal.
> Alternatively since most of the use does match kref
> model, maybe __kref_sub_return with disclaimers
> that you must know what you are doing?

No, no one reads documentation, sorry. Either fix your use of kref
(i.e. don't care about the count), or do something else, as you don't
want a kref, but rather, an atomic count of what is going on and "1"
means something "special" to you.

sorry,

greg k-h

2014-02-12 18:39:20

by Anatol Pomozov

[permalink] [raw]
Subject: Re: [PATCH net 1/3] kref: add kref_sub_return

Hi

On Wed, Feb 12, 2014 at 9:35 AM, Michael S. Tsirkin <[email protected]> wrote:
> On Wed, Feb 12, 2014 at 08:56:30AM -0800, Greg Kroah-Hartman wrote:
>> On Wed, Feb 12, 2014 at 06:38:21PM +0200, Michael S. Tsirkin wrote:
>> > It is sometimes useful to get the value of the reference count after
>> > decrement.
>> > For example, vhost wants to execute some periodic cleanup operations
>> > once number of references drops below a specific value, before it
>> > reaches zero (for efficiency).
>>
>> You should never care about what the value of the kref is, if you are
>> using it correctly :)
>>
>> So I really don't want to add this function, as I'm sure people will use
>> it incorrectly. You should only care if the reference drops to 0, if
>> not, then your usage doesn't really fit into the "kref" model, and so,
>> just use an atomic variable.
>
> This happens when you have code that keeps
> reference itself implicitly or explicitly.
>
> foo(struct kref *k, int bar) {
>
> sub = kref_sub(k)
>
> if (sub == 1)

At this moment you cannot be sure that refcount is 1. It might be
changed between kref_sub call and this point. The refcount might be 0,
1, ... or any other value.

The only value that you can be sure is zero. Once refcount becomes
zero there is no way to increase it to positive value as there are no
alive pointers to the object.

> FOO(k, bar) /* Here I am the only one
> with a reference */
>
> }
>
> kref_get(k)
> foo(k, bar);
> ....
> kref_put(k)
>
> Why not do FOO in destructor you ask?
> Absolutely but this will be called much later.
>
> Maybe you will reconsider if I document this
> as the only legal use?
>
>>
>> I really want to know why it matters for "efficiency" that you know this
>> number. How does that help anything, as the number could then go up
>> later on, and the work you did at a "lower" number is obsolete, right?
>>
>> thanks,
>>
>> greg k-h
>
> The issue is that if number dropped to 1, this means
> we must do the cleanup work since there are
> no outstanding buffers, (last user is ourselves)
> if we do not cleanup,
> guest will hang waiting for us.
>
> But it never drops to 0 since we have our own reference
> in the device.
> If it goes up again this means we didn't have
> to do cleanup, but an alternative is doing
> it all the time and that is slow.
>
> Yes I can rework vhost to open-code this kref use, it's
> no big deal.
> Alternatively since most of the use does match kref
> model, maybe __kref_sub_return with disclaimers
> that you must know what you are doing?
> Please let me know.
>
> Thanks!
>
> --
> MST

2014-02-13 00:06:41

by David Miller

[permalink] [raw]
Subject: Re: [PATCH net 1/3] kref: add kref_sub_return

From: Greg Kroah-Hartman <[email protected]>
Date: Wed, 12 Feb 2014 08:56:30 -0800

> On Wed, Feb 12, 2014 at 06:38:21PM +0200, Michael S. Tsirkin wrote:
>> It is sometimes useful to get the value of the reference count after
>> decrement.
>> For example, vhost wants to execute some periodic cleanup operations
>> once number of references drops below a specific value, before it
>> reaches zero (for efficiency).
>
> You should never care about what the value of the kref is, if you are
> using it correctly :)

It isn't being used to determine when to destroy things.

They use it to as a heuristic of when to trigger polling.

Each ubuf attached gets a kref to the higher level virtio_net buffer
holding object, they want to trigger polling when that reference drops
to 1 or lower.

Right now they are reading the atomic refcount directly, which
I think is much worse than this helper.

2014-02-13 01:25:09

by Jörn Engel

[permalink] [raw]
Subject: Re: [PATCH net 1/3] kref: add kref_sub_return

On Wed, 12 February 2014 19:06:37 -0500, David Miller wrote:
>
> It isn't being used to determine when to destroy things.
>
> They use it to as a heuristic of when to trigger polling.
>
> Each ubuf attached gets a kref to the higher level virtio_net buffer
> holding object, they want to trigger polling when that reference drops
> to 1 or lower.
>
> Right now they are reading the atomic refcount directly, which
> I think is much worse than this helper.

I disagree. Reading the refcount directly is leaving noone under any
illusion that this might be a good idea.

Jörn

--
Victory in war is not repetitious.
-- Sun Tzu

2014-02-13 01:36:53

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH net 1/3] kref: add kref_sub_return

On Wed, Feb 12, 2014 at 07:06:37PM -0500, David Miller wrote:
> From: Greg Kroah-Hartman <[email protected]>
> Date: Wed, 12 Feb 2014 08:56:30 -0800
>
> > On Wed, Feb 12, 2014 at 06:38:21PM +0200, Michael S. Tsirkin wrote:
> >> It is sometimes useful to get the value of the reference count after
> >> decrement.
> >> For example, vhost wants to execute some periodic cleanup operations
> >> once number of references drops below a specific value, before it
> >> reaches zero (for efficiency).
> >
> > You should never care about what the value of the kref is, if you are
> > using it correctly :)
>
> It isn't being used to determine when to destroy things.
>
> They use it to as a heuristic of when to trigger polling.
>
> Each ubuf attached gets a kref to the higher level virtio_net buffer
> holding object, they want to trigger polling when that reference drops
> to 1 or lower.
>
> Right now they are reading the atomic refcount directly, which
> I think is much worse than this helper.

Yes, that's horrible as well, but as was already pointed out in this
thread, you can't rely on that value to really be "1" after reading it
due to the way krefs work, what happened if someone else just grabbed
it?

If all they want is a "count" for when to start polling, then use a
separate atomic count, but don't abuse the kref interface for this, I
don't think that will work properly at all.

thanks,

greg k-h

2014-02-13 04:05:24

by David Miller

[permalink] [raw]
Subject: Re: [PATCH net 1/3] kref: add kref_sub_return

From: Greg KH <[email protected]>
Date: Wed, 12 Feb 2014 17:39:02 -0800

> Yes, that's horrible as well, but as was already pointed out in this
> thread, you can't rely on that value to really be "1" after reading it
> due to the way krefs work, what happened if someone else just grabbed
> it?
>
> If all they want is a "count" for when to start polling, then use a
> separate atomic count, but don't abuse the kref interface for this, I
> don't think that will work properly at all.

They want to know which thread of control decrements the count to "1"
as buffers are released.

That seems entirely reasonable to me.

They could add another atomic counter for this, but that's rather
silly since the kref already has an atomic they can use for this
purpose.

2014-02-13 04:10:07

by David Miller

[permalink] [raw]
Subject: Re: [PATCH net 1/3] kref: add kref_sub_return

From: David Miller <[email protected]>
Date: Wed, 12 Feb 2014 23:05:06 -0500 (EST)

> From: Greg KH <[email protected]>
> Date: Wed, 12 Feb 2014 17:39:02 -0800
>
>> Yes, that's horrible as well, but as was already pointed out in this
>> thread, you can't rely on that value to really be "1" after reading it
>> due to the way krefs work, what happened if someone else just grabbed
>> it?
>>
>> If all they want is a "count" for when to start polling, then use a
>> separate atomic count, but don't abuse the kref interface for this, I
>> don't think that will work properly at all.
>
> They want to know which thread of control decrements the count to "1"
> as buffers are released.
>
> That seems entirely reasonable to me.
>
> They could add another atomic counter for this, but that's rather
> silly since the kref already has an atomic they can use for this
> purpose.

If you still can't understand what they are trying to do, they want to
do something precisely when the number of pending buffers is dropped
to 1 or less.

They are using krefs to track how many buffers are attached at a given
moment.

The counter can re-increment after the decrement to 1 or less occurs,
they don't care.

But they want precisely the entity that drops it down to 1 or less to
perform that action.

Just reading the atomic value directly, they cannot do this.

2014-02-14 00:02:04

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH net 1/3] kref: add kref_sub_return

On Wed, Feb 12, 2014 at 11:09:53PM -0500, David Miller wrote:
> From: David Miller <[email protected]>
> Date: Wed, 12 Feb 2014 23:05:06 -0500 (EST)
>
> > From: Greg KH <[email protected]>
> > Date: Wed, 12 Feb 2014 17:39:02 -0800
> >
> >> Yes, that's horrible as well, but as was already pointed out in this
> >> thread, you can't rely on that value to really be "1" after reading it
> >> due to the way krefs work, what happened if someone else just grabbed
> >> it?
> >>
> >> If all they want is a "count" for when to start polling, then use a
> >> separate atomic count, but don't abuse the kref interface for this, I
> >> don't think that will work properly at all.
> >
> > They want to know which thread of control decrements the count to "1"
> > as buffers are released.
> >
> > That seems entirely reasonable to me.
> >
> > They could add another atomic counter for this, but that's rather
> > silly since the kref already has an atomic they can use for this
> > purpose.
>
> If you still can't understand what they are trying to do, they want to
> do something precisely when the number of pending buffers is dropped
> to 1 or less.
>
> They are using krefs to track how many buffers are attached at a given
> moment.
>
> The counter can re-increment after the decrement to 1 or less occurs,
> they don't care.
>
> But they want precisely the entity that drops it down to 1 or less to
> perform that action.
>
> Just reading the atomic value directly, they cannot do this.

That makes sense, but is still totally crazy for what a kref was
designed to do. I'm really wary of giving access to the count of the
kref like this as it will almost always be abused (it's been asked for
before, but not like this.)

So how about just "open coding" a kref for this structure, as it wants
something that doesn't fit into the kref model, and should be pretty
simple to do (you can ensure you get the locking right, unlike almost
all users of krefs, as Al Viro constantly points out to me...)

thanks,

greg k-h

2014-02-14 05:10:10

by David Miller

[permalink] [raw]
Subject: Re: [PATCH net 1/3] kref: add kref_sub_return

From: Greg KH <[email protected]>
Date: Thu, 13 Feb 2014 16:03:20 -0800

> So how about just "open coding" a kref for this structure, as it wants
> something that doesn't fit into the kref model, and should be pretty
> simple to do (you can ensure you get the locking right, unlike almost
> all users of krefs, as Al Viro constantly points out to me...)

They redid their patches meanwhile to use their own atomic_t.