2013-10-30 19:06:40

by Michael S. Tsirkin

[permalink] [raw]
Subject: [PATCH RFC] kvm: optimize out smp_mb using srcu_read_unlock

I noticed that srcu_read_lock/unlock both have a memory barrier,
so just by moving srcu_read_unlock earlier we can get rid of
one call to smp_mb().

Unsurprisingly, the gain is small but measureable using the unit test
microbenchmark:
before
vmcall 1407
after
vmcall 1357

Signed-off-by: Michael S. Tsirkin <[email protected]>

--

I didn't stress test this yet, sending out for early review/flames.

Paul, could you review this patch please?
Documentation/memory-barriers.txt says that unlock has a weaker
uni-directional barrier, but in practice srcu_read_unlock calls
smp_mb().

Is it OK to rely on this? If not, can I add
smp_mb__after_srcu_read_unlock (making it an empty macro for now)
so we can avoid an actual extra smp_mb()?

Thanks.

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8617c9d..a48fb36 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5949,8 +5949,10 @@ restore:

/* We should set ->mode before check ->requests,
* see the comment in make_all_cpus_request.
+ *
+ * srcu_read_unlock below acts as a memory barrier.
*/
- smp_mb();
+ srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx);

local_irq_disable();

@@ -5960,12 +5962,11 @@ restore:
smp_wmb();
local_irq_enable();
preempt_enable();
+ vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
r = 1;
goto cancel_injection;
}

- srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx);
-
if (req_immediate_exit)
smp_send_reschedule(vcpu->cpu);


--
MST


2013-10-30 20:16:00

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC] kvm: optimize out smp_mb using srcu_read_unlock

On Wed, Oct 30, 2013 at 09:09:29PM +0200, Michael S. Tsirkin wrote:
> I noticed that srcu_read_lock/unlock both have a memory barrier,
> so just by moving srcu_read_unlock earlier we can get rid of
> one call to smp_mb().
>
> Unsurprisingly, the gain is small but measureable using the unit test
> microbenchmark:
> before
> vmcall 1407
> after
> vmcall 1357
>
> Signed-off-by: Michael S. Tsirkin <[email protected]>
>
> --
>
> I didn't stress test this yet, sending out for early review/flames.
>
> Paul, could you review this patch please?
> Documentation/memory-barriers.txt says that unlock has a weaker
> uni-directional barrier, but in practice srcu_read_unlock calls
> smp_mb().
>
> Is it OK to rely on this? If not, can I add
> smp_mb__after_srcu_read_unlock (making it an empty macro for now)
> so we can avoid an actual extra smp_mb()?

Please use smp_mb__after_srcu_read_unlock(). After all, it was not
that long ago that srcu_read_unlock() contained no memory barriers,
and perhaps some day it won't need to once again.

Thanx, Paul

> Thanks.
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 8617c9d..a48fb36 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -5949,8 +5949,10 @@ restore:
>
> /* We should set ->mode before check ->requests,
> * see the comment in make_all_cpus_request.
> + *
> + * srcu_read_unlock below acts as a memory barrier.
> */
> - smp_mb();
> + srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx);
>
> local_irq_disable();
>
> @@ -5960,12 +5962,11 @@ restore:
> smp_wmb();
> local_irq_enable();
> preempt_enable();
> + vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
> r = 1;
> goto cancel_injection;
> }
>
> - srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx);
> -
> if (req_immediate_exit)
> smp_send_reschedule(vcpu->cpu);
>
>
> --
> MST
>

2013-10-30 23:23:16

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH RFC] kvm: optimize out smp_mb using srcu_read_unlock

> > Paul, could you review this patch please?
> > Documentation/memory-barriers.txt says that unlock has a weaker
> > uni-directional barrier, but in practice srcu_read_unlock calls
> > smp_mb().
> >
> > Is it OK to rely on this? If not, can I add
> > smp_mb__after_srcu_read_unlock (making it an empty macro for now)
> > so we can avoid an actual extra smp_mb()?
>
> Please use smp_mb__after_srcu_read_unlock(). After all, it was not
> that long ago that srcu_read_unlock() contained no memory barriers,
> and perhaps some day it won't need to once again.
>
> Thanx, Paul
>

Thanks!
Something like this will be enough?

diff --git a/include/linux/srcu.h b/include/linux/srcu.h
index c114614..9b058ee 100644
--- a/include/linux/srcu.h
+++ b/include/linux/srcu.h
@@ -237,4 +237,18 @@ static inline void srcu_read_unlock(struct srcu_struct *sp, int idx)
__srcu_read_unlock(sp, idx);
}

+/**
+ * smp_mb__after_srcu_read_unlock - ensure full ordering after srcu_read_unlock
+ *
+ * Converts the preceding srcu_read_unlock into a two-way memory barrier.
+ *
+ * Call this after srcu_read_unlock, to guarantee that all memory operations
+ * that occur after smp_mb__after_srcu_read_unlock will appear to happen after
+ * the preceding srcu_read_unlock.
+ */
+static inline void smp_mb__after_srcu_read_unlock(void)
+{
+ /* __srcu_read_unlock has smp_mb() internally so nothing to do here. */
+}
+
#endif

2013-10-31 04:56:37

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC] kvm: optimize out smp_mb using srcu_read_unlock

On Thu, Oct 31, 2013 at 01:26:05AM +0200, Michael S. Tsirkin wrote:
> > > Paul, could you review this patch please?
> > > Documentation/memory-barriers.txt says that unlock has a weaker
> > > uni-directional barrier, but in practice srcu_read_unlock calls
> > > smp_mb().
> > >
> > > Is it OK to rely on this? If not, can I add
> > > smp_mb__after_srcu_read_unlock (making it an empty macro for now)
> > > so we can avoid an actual extra smp_mb()?
> >
> > Please use smp_mb__after_srcu_read_unlock(). After all, it was not
> > that long ago that srcu_read_unlock() contained no memory barriers,
> > and perhaps some day it won't need to once again.
> >
> > Thanx, Paul
> >
>
> Thanks!
> Something like this will be enough?
>
> diff --git a/include/linux/srcu.h b/include/linux/srcu.h
> index c114614..9b058ee 100644
> --- a/include/linux/srcu.h
> +++ b/include/linux/srcu.h
> @@ -237,4 +237,18 @@ static inline void srcu_read_unlock(struct srcu_struct *sp, int idx)
> __srcu_read_unlock(sp, idx);
> }
>
> +/**
> + * smp_mb__after_srcu_read_unlock - ensure full ordering after srcu_read_unlock
> + *
> + * Converts the preceding srcu_read_unlock into a two-way memory barrier.
> + *
> + * Call this after srcu_read_unlock, to guarantee that all memory operations
> + * that occur after smp_mb__after_srcu_read_unlock will appear to happen after
> + * the preceding srcu_read_unlock.
> + */
> +static inline void smp_mb__after_srcu_read_unlock(void)
> +{
> + /* __srcu_read_unlock has smp_mb() internally so nothing to do here. */
> +}
> +
> #endif

Yep, that should do it!

Thanx, Paul

2013-10-31 06:47:49

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH RFC] kvm: optimize out smp_mb using srcu_read_unlock

On Wed, Oct 30, 2013 at 09:56:29PM -0700, Paul E. McKenney wrote:
> On Thu, Oct 31, 2013 at 01:26:05AM +0200, Michael S. Tsirkin wrote:
> > > > Paul, could you review this patch please?
> > > > Documentation/memory-barriers.txt says that unlock has a weaker
> > > > uni-directional barrier, but in practice srcu_read_unlock calls
> > > > smp_mb().
> > > >
> > > > Is it OK to rely on this? If not, can I add
> > > > smp_mb__after_srcu_read_unlock (making it an empty macro for now)
> > > > so we can avoid an actual extra smp_mb()?
> > >
> > > Please use smp_mb__after_srcu_read_unlock(). After all, it was not
> > > that long ago that srcu_read_unlock() contained no memory barriers,
> > > and perhaps some day it won't need to once again.
> > >
> > > Thanx, Paul
> > >
> >
> > Thanks!
> > Something like this will be enough?
> >
> > diff --git a/include/linux/srcu.h b/include/linux/srcu.h
> > index c114614..9b058ee 100644
> > --- a/include/linux/srcu.h
> > +++ b/include/linux/srcu.h
> > @@ -237,4 +237,18 @@ static inline void srcu_read_unlock(struct srcu_struct *sp, int idx)
> > __srcu_read_unlock(sp, idx);
> > }
> >
> > +/**
> > + * smp_mb__after_srcu_read_unlock - ensure full ordering after srcu_read_unlock
> > + *
> > + * Converts the preceding srcu_read_unlock into a two-way memory barrier.
> > + *
> > + * Call this after srcu_read_unlock, to guarantee that all memory operations
> > + * that occur after smp_mb__after_srcu_read_unlock will appear to happen after
> > + * the preceding srcu_read_unlock.
> > + */
> > +static inline void smp_mb__after_srcu_read_unlock(void)
> > +{
> > + /* __srcu_read_unlock has smp_mb() internally so nothing to do here. */
> > +}
> > +
> > #endif
>
> Yep, that should do it!
>
This looks dubious to me. All other smp_mb__after_* variants are there
because some atomic operations have different memory barrier semantics on
different arches, but srcu_read_unlock() have the same semantics on all
arches, so smp_mb__after_srcu_read_unlock() becomes
smp_mb__after_a_function_that_happens_to_have_mb_now_but_may_not_have_in_the_feature().
How likely it is that smp_mb() will disappear from srcu_read_unlock()
(if was added for a reason I guess)? May be we should change documentation
to say that srcu_read_unlock() is a memory barrier which will reflect
the reality.

--
Gleb.

2013-10-31 11:11:28

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH RFC] kvm: optimize out smp_mb using srcu_read_unlock

Il 31/10/2013 07:47, Gleb Natapov ha scritto:
> This looks dubious to me. All other smp_mb__after_* variants are there
> because some atomic operations have different memory barrier semantics on
> different arches,

It doesn't have to be arches; unlock APIs typically have release
semantics only, but SRCU is stronger.

> but srcu_read_unlock() have the same semantics on all
> arches, so smp_mb__after_srcu_read_unlock() becomes
> smp_mb__after_a_function_that_happens_to_have_mb_now_but_may_not_have_in_the_feature().
> How likely it is that smp_mb() will disappear from srcu_read_unlock()
> (if was added for a reason I guess)? May be we should change documentation
> to say that srcu_read_unlock() is a memory barrier which will reflect
> the reality.

That would be different from all other unlock APIs.

Paolo

2013-10-31 11:14:23

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH RFC] kvm: optimize out smp_mb using srcu_read_unlock

Il 30/10/2013 20:09, Michael S. Tsirkin ha scritto:
> I noticed that srcu_read_lock/unlock both have a memory barrier,
> so just by moving srcu_read_unlock earlier we can get rid of
> one call to smp_mb().
>
> Unsurprisingly, the gain is small but measureable using the unit test
> microbenchmark:
> before
> vmcall 1407
> after
> vmcall 1357
>
> Signed-off-by: Michael S. Tsirkin <[email protected]>

Across how many runs? Best or average or "all runs were in that
ballpark", :) and what's the minimum/maximum before and after the patch?

As you say the benefit is not surprising, but the experiments should be
documented properly.

Paolo

2013-10-31 11:30:30

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH RFC] kvm: optimize out smp_mb using srcu_read_unlock

On Thu, Oct 31, 2013 at 12:14:15PM +0100, Paolo Bonzini wrote:
> Il 30/10/2013 20:09, Michael S. Tsirkin ha scritto:
> > I noticed that srcu_read_lock/unlock both have a memory barrier,
> > so just by moving srcu_read_unlock earlier we can get rid of
> > one call to smp_mb().
> >
> > Unsurprisingly, the gain is small but measureable using the unit test
> > microbenchmark:
> > before
> > vmcall 1407
> > after
> > vmcall 1357
> >
> > Signed-off-by: Michael S. Tsirkin <[email protected]>
>
> Across how many runs?


It's the experiment that vmexit test does:
it runs for 2^30 cycles, then divides the number of cycles by the
number of iterations.
You get in the ballpark of 1300000 iterations normally.

> Best or average or "all runs were in that
> ballpark", :) and what's the minimum/maximum before and after the patch?
>
> As you say the benefit is not surprising, but the experiments should be
> documented properly.
>
> Paolo

"All runs in that ballpark".

--
MST

2013-10-31 12:28:41

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH RFC] kvm: optimize out smp_mb using srcu_read_unlock

On Thu, Oct 31, 2013 at 12:11:21PM +0100, Paolo Bonzini wrote:
> Il 31/10/2013 07:47, Gleb Natapov ha scritto:
> > This looks dubious to me. All other smp_mb__after_* variants are there
> > because some atomic operations have different memory barrier semantics on
> > different arches,
>
> It doesn't have to be arches;
Of course it doesn't, but it is now :)

> unlock APIs typically have release
> semantics only, but SRCU is stronger.
>
Yes the question is if it is by design or implementation detail we should
not rely on.

> > but srcu_read_unlock() have the same semantics on all
> > arches, so smp_mb__after_srcu_read_unlock() becomes
> > smp_mb__after_a_function_that_happens_to_have_mb_now_but_may_not_have_in_the_feature().
> > How likely it is that smp_mb() will disappear from srcu_read_unlock()
> > (if was added for a reason I guess)? May be we should change documentation
> > to say that srcu_read_unlock() is a memory barrier which will reflect
> > the reality.
>
> That would be different from all other unlock APIs.
>
As long as it is documented... smp_mb__after_srcu_read_unlock() is just
a form of documentation anyway right now. I do not have strong objection
to smp_mb__after_srcu_read_unlock() though, the improvement is impressive
for such a small change.

--
Gleb.

2013-10-31 13:54:26

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH RFC] kvm: optimize out smp_mb using srcu_read_unlock

On Wed, Oct 30, 2013 at 09:56:29PM -0700, Paul E. McKenney wrote:
> On Thu, Oct 31, 2013 at 01:26:05AM +0200, Michael S. Tsirkin wrote:
> > > > Paul, could you review this patch please?
> > > > Documentation/memory-barriers.txt says that unlock has a weaker
> > > > uni-directional barrier, but in practice srcu_read_unlock calls
> > > > smp_mb().
> > > >
> > > > Is it OK to rely on this? If not, can I add
> > > > smp_mb__after_srcu_read_unlock (making it an empty macro for now)
> > > > so we can avoid an actual extra smp_mb()?
> > >
> > > Please use smp_mb__after_srcu_read_unlock(). After all, it was not
> > > that long ago that srcu_read_unlock() contained no memory barriers,
> > > and perhaps some day it won't need to once again.
> > >
> > > Thanx, Paul
> > >
> >
> > Thanks!
> > Something like this will be enough?
> >
> > diff --git a/include/linux/srcu.h b/include/linux/srcu.h
> > index c114614..9b058ee 100644
> > --- a/include/linux/srcu.h
> > +++ b/include/linux/srcu.h
> > @@ -237,4 +237,18 @@ static inline void srcu_read_unlock(struct srcu_struct *sp, int idx)
> > __srcu_read_unlock(sp, idx);
> > }
> >
> > +/**
> > + * smp_mb__after_srcu_read_unlock - ensure full ordering after srcu_read_unlock
> > + *
> > + * Converts the preceding srcu_read_unlock into a two-way memory barrier.
> > + *
> > + * Call this after srcu_read_unlock, to guarantee that all memory operations
> > + * that occur after smp_mb__after_srcu_read_unlock will appear to happen after
> > + * the preceding srcu_read_unlock.
> > + */
> > +static inline void smp_mb__after_srcu_read_unlock(void)
> > +{
> > + /* __srcu_read_unlock has smp_mb() internally so nothing to do here. */
> > +}
> > +
> > #endif
>
> Yep, that should do it!
>
> Thanx, Paul

BTW I'm wondering about the smb_mb within srcu_read_lock.
If we kept the index in the same memory with the buffer we
dereference, could we get rid of it and use a dependency barrier
instead? It does appear prominently in the profiles.
Thoughts?


--
MST