2010-11-01 20:18:01

by Shirley Ma

[permalink] [raw]
Subject: Re: [RFC PATCH 1/1] vhost: TX used buffer guest signal accumulation

On Sat, 2010-10-30 at 22:06 +0200, Michael S. Tsirkin wrote:
> On Fri, Oct 29, 2010 at 08:43:08AM -0700, Shirley Ma wrote:
> > On Fri, 2010-10-29 at 10:10 +0200, Michael S. Tsirkin wrote:
> > > Hmm. I don't yet understand. We are still doing copies into the
> per-vq
> > > buffer, and the data copied is really small. Is it about cache
> line
> > > bounces? Could you try figuring it out?
> >
> > per-vq buffer is much less expensive than 3 put_copy() call. I will
> > collect the profiling data to show that.
>
> What about __put_user? Maybe the access checks are the ones
> that add the cost here? I attach patches to strip access checks:
> they are not needed as we do them on setup time already, anyway.
> Can you try them out and see if performance is improved for you
> please?
> On top of this, we will need to add some scheme to accumulate signals,
> but that is a separate issue.

Yes, moving from put_user/get_user to __put_user/__get_user does improve
the performance by removing the checking.

My concern here is whether checking only in set up would be sufficient
for security? Would be there is a case guest could corrupt the ring
later? If not, that's OK.

> > > > > 2. How about flushing out queued stuff before we exit
> > > > > the handle_tx loop? That would address most of
> > > > > the spec issue.
> > > >
> > > > The performance is almost as same as the previous patch. I will
> > > resubmit
> > > > the modified one, adding vhost_add_used_and_signal_n after
> handle_tx
> > > > loop for processing pending queue.
> > > >
> > > > This patch was a part of modified macvtap zero copy which I
> haven't
> > > > submitted yet. I found this helped vhost TX in general. This
> pending
> > > > queue will be used by DMA done later, so I put it in vq instead
> of a
> > > > local variable in handle_tx.
> > > >
> > > > Thanks
> > > > Shirley
> > >
> > > BTW why do we need another array? Isn't heads field exactly what
> we
> > > need
> > > here?
> >
> > head field is only for up to 32, the more used buffers add and
> signal
> > accumulated the better performance is from test results.
>
> I think we should separate the used update and signalling. Interrupts
> are expensive so I can believe accumulating even up to 100 of them
> helps. But used head copies are already prety cheap. If we cut the
> overhead by x32, that should make them almost free?

I can separate the used update and signaling to see the best
performance.

> > That's was one
> > of the reason I didn't use heads. The other reason was I used these
> > buffer for pending dma done in mavctap zero copy patch. It could be
> up
> > to vq->num in worse case.
>
> We can always increase that, not an issue.

Good, I will change heads up to vq->num and use it.

Thanks
Shirley


2010-11-03 10:48:19

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [RFC PATCH 1/1] vhost: TX used buffer guest signal accumulation

On Mon, Nov 01, 2010 at 01:17:53PM -0700, Shirley Ma wrote:
> On Sat, 2010-10-30 at 22:06 +0200, Michael S. Tsirkin wrote:
> > On Fri, Oct 29, 2010 at 08:43:08AM -0700, Shirley Ma wrote:
> > > On Fri, 2010-10-29 at 10:10 +0200, Michael S. Tsirkin wrote:
> > > > Hmm. I don't yet understand. We are still doing copies into the
> > per-vq
> > > > buffer, and the data copied is really small. Is it about cache
> > line
> > > > bounces? Could you try figuring it out?
> > >
> > > per-vq buffer is much less expensive than 3 put_copy() call. I will
> > > collect the profiling data to show that.
> >
> > What about __put_user? Maybe the access checks are the ones
> > that add the cost here? I attach patches to strip access checks:
> > they are not needed as we do them on setup time already, anyway.
> > Can you try them out and see if performance is improved for you
> > please?
> > On top of this, we will need to add some scheme to accumulate signals,
> > but that is a separate issue.
>
> Yes, moving from put_user/get_user to __put_user/__get_user does improve
> the performance by removing the checking.

I mean in practice, you see a benefit from this patch?

> My concern here is whether checking only in set up would be sufficient
> for security?

It better be sufficient because the checks that put_user does
are not effictive when run from the kernel thread, anyway.

> Would be there is a case guest could corrupt the ring
> later? If not, that's OK.

You mean change the pointer after it's checked?
If you see such a case, please holler.

> > > > > > 2. How about flushing out queued stuff before we exit
> > > > > > the handle_tx loop? That would address most of
> > > > > > the spec issue.
> > > > >
> > > > > The performance is almost as same as the previous patch. I will
> > > > resubmit
> > > > > the modified one, adding vhost_add_used_and_signal_n after
> > handle_tx
> > > > > loop for processing pending queue.
> > > > >
> > > > > This patch was a part of modified macvtap zero copy which I
> > haven't
> > > > > submitted yet. I found this helped vhost TX in general. This
> > pending
> > > > > queue will be used by DMA done later, so I put it in vq instead
> > of a
> > > > > local variable in handle_tx.
> > > > >
> > > > > Thanks
> > > > > Shirley
> > > >
> > > > BTW why do we need another array? Isn't heads field exactly what
> > we
> > > > need
> > > > here?
> > >
> > > head field is only for up to 32, the more used buffers add and
> > signal
> > > accumulated the better performance is from test results.
> >
> > I think we should separate the used update and signalling. Interrupts
> > are expensive so I can believe accumulating even up to 100 of them
> > helps. But used head copies are already prety cheap. If we cut the
> > overhead by x32, that should make them almost free?
>
> I can separate the used update and signaling to see the best
> performance.
>
> > > That's was one
> > > of the reason I didn't use heads. The other reason was I used these
> > > buffer for pending dma done in mavctap zero copy patch. It could be
> > up
> > > to vq->num in worse case.
> >
> > We can always increase that, not an issue.
>
> Good, I will change heads up to vq->num and use it.
>
> Thanks
> Shirley

To clarify: the combination of __put_user and separate
signalling is giving the same performance benefit as your
patch?

I am mostly concerned with adding code that seems to help
speed for reasons we don't completely understand, because
then we might break the optimization easily without noticing.

--
MST

2010-11-04 05:38:52

by Shirley Ma

[permalink] [raw]
Subject: Re: [RFC PATCH 1/1] vhost: TX used buffer guest signal accumulation

On Wed, 2010-11-03 at 12:48 +0200, Michael S. Tsirkin wrote:
> I mean in practice, you see a benefit from this patch?

Yes, I tested it. It does benefit the performance.

> > My concern here is whether checking only in set up would be
> sufficient
> > for security?
>
> It better be sufficient because the checks that put_user does
> are not effictive when run from the kernel thread, anyway.
>
> > Would be there is a case guest could corrupt the ring
> > later? If not, that's OK.
>
> You mean change the pointer after it's checked?
> If you see such a case, please holler.

I wonder about it, not a such case in mind.

> To clarify: the combination of __put_user and separate
> signalling is giving the same performance benefit as your
> patch?

Yes, it has similar performance, not I haven't finished all message
sizes comparison yet.

> I am mostly concerned with adding code that seems to help
> speed for reasons we don't completely understand, because
> then we might break the optimization easily without noticing.

I don't think the patch I submited would break up anything. It just
reduced the cost of per used buffer 3 put_user() calls and guest
signaling from one to one to many to one.

Thanks
Shirley

2010-11-04 09:31:05

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [RFC PATCH 1/1] vhost: TX used buffer guest signal accumulation

On Wed, Nov 03, 2010 at 10:38:46PM -0700, Shirley Ma wrote:
> On Wed, 2010-11-03 at 12:48 +0200, Michael S. Tsirkin wrote:
> > I mean in practice, you see a benefit from this patch?
>
> Yes, I tested it. It does benefit the performance.
>
> > > My concern here is whether checking only in set up would be
> > sufficient
> > > for security?
> >
> > It better be sufficient because the checks that put_user does
> > are not effictive when run from the kernel thread, anyway.
> >
> > > Would be there is a case guest could corrupt the ring
> > > later? If not, that's OK.
> >
> > You mean change the pointer after it's checked?
> > If you see such a case, please holler.
>
> I wonder about it, not a such case in mind.
>
> > To clarify: the combination of __put_user and separate
> > signalling is giving the same performance benefit as your
> > patch?
>
> Yes, it has similar performance, not I haven't finished all message
> sizes comparison yet.
>
> > I am mostly concerned with adding code that seems to help
> > speed for reasons we don't completely understand, because
> > then we might break the optimization easily without noticing.
>
> I don't think the patch I submited would break up anything.

No, I just meant that when a patch gives some benefit, I'd like
to understand where the benefit comes from so that I don't
break it later.

> It just
> reduced the cost of per used buffer 3 put_user() calls and guest
> signaling from one to one to many to one.

One thing to note is that deferred signalling needs to be
benchmarked with old guests which don't orphan skbs on xmit
(or disable orphaning in both networking stack and virtio-net).

>
> Thanks
> Shirley

OK, so I guess I'll queue the __put_user etc patches for net-next, on top of this
I think a patch which defers signalling would be nice to have,
then we can figure out whether a separate heads array still has benefits
for non zero copy case: if yes what they are, if no whether it should be
used for zero copy only for both both non-zero copy and zero copy.

Makes sense?

--
MST

2010-11-04 21:37:30

by Shirley Ma

[permalink] [raw]
Subject: Re: [RFC PATCH 1/1] vhost: TX used buffer guest signal accumulation

On Thu, 2010-11-04 at 11:30 +0200, Michael S. Tsirkin wrote:
> One thing to note is that deferred signalling needs to be
> benchmarked with old guests which don't orphan skbs on xmit
> (or disable orphaning in both networking stack and virtio-net).

Yes, we need run more test.

>
> OK, so I guess I'll queue the __put_user etc patches for net-next, on
> top of this
> I think a patch which defers signalling would be nice to have,
> then we can figure out whether a separate heads array still has
> benefits
> for non zero copy case: if yes what they are, if no whether it should
> be
> used for zero copy only for both both non-zero copy and zero copy.
>
> Makes sense?

Agree.

Shirley