2012-05-01 09:40:06

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others

On Sun, 2012-04-29 at 15:23 +0300, Avi Kivity wrote:
> On 04/27/2012 07:24 PM, Nikunj A. Dadhania wrote:
> > flush_tlb_others_ipi depends on lot of statics in tlb.c. Replicated
> > the flush_tlb_others_ipi as kvm_flush_tlb_others to further adapt to
> > paravirtualization.
> >
> > Use the vcpu state information inside the kvm_flush_tlb_others to
> > avoid sending ipi to pre-empted vcpus.
> >
> > * Do not send ipi's to offline vcpus and set flush_on_enter flag
>
> get_user_pages_fast() depends on the IPI to hold off page table teardown
> while they are locklessly walked with interrupts disabled. If a vcpu
> were to be preempted while in this critical section, another vcpu
> tearing down page tables would go ahead and destroy them. when the
> preempted vcpu resumes it then touches the freed pages.
>
> We could try to teach kvm and get_user_pages_fast() about this, but this
> is intrusive. Another option is to replace the cpu_relax() loop with
> something that sleeps and is then woken up by the TLB IPI handler if needed.

I think something like

select HAVE_RCU_TABLE_FREE if PARAVIRT

or somesuch is just about all it takes.

A slightly better option would be to wrap all that tlb_*table* goo into
paravirt stuff and only do the RCU free when paravirt is indeed enabled,
but other than that you're there.

This should work because the preempted vcpu's RCU state would also be
stalled and thus avoids the actual page-table from going away.


2012-05-01 10:47:43

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others

On 05/01/2012 12:39 PM, Peter Zijlstra wrote:
> On Sun, 2012-04-29 at 15:23 +0300, Avi Kivity wrote:
> > On 04/27/2012 07:24 PM, Nikunj A. Dadhania wrote:
> > > flush_tlb_others_ipi depends on lot of statics in tlb.c. Replicated
> > > the flush_tlb_others_ipi as kvm_flush_tlb_others to further adapt to
> > > paravirtualization.
> > >
> > > Use the vcpu state information inside the kvm_flush_tlb_others to
> > > avoid sending ipi to pre-empted vcpus.
> > >
> > > * Do not send ipi's to offline vcpus and set flush_on_enter flag
> >
> > get_user_pages_fast() depends on the IPI to hold off page table teardown
> > while they are locklessly walked with interrupts disabled. If a vcpu
> > were to be preempted while in this critical section, another vcpu
> > tearing down page tables would go ahead and destroy them. when the
> > preempted vcpu resumes it then touches the freed pages.
> >
> > We could try to teach kvm and get_user_pages_fast() about this, but this
> > is intrusive. Another option is to replace the cpu_relax() loop with
> > something that sleeps and is then woken up by the TLB IPI handler if needed.
>
> I think something like
>
> select HAVE_RCU_TABLE_FREE if PARAVIRT
>
> or somesuch is just about all it takes.
>
> A slightly better option would be to wrap all that tlb_*table* goo into
> paravirt stuff and only do the RCU free when paravirt is indeed enabled,
> but other than that you're there.

I infer from this that there is a cost involved with rcu freeing. Any
idea how much?

Looks like this increases performance for the overcommitted case, and
also for the case where many vcpus are sleeping, while reducing
performance for the uncontended, high duty cycle case.

> This should work because the preempted vcpu's RCU state would also be
> stalled and thus avoids the actual page-table from going away.

It can be unstalled at any moment. But spin_lock_irq() > rcu_read_lock().

--
error compiling committee.c: too many arguments to function

2012-05-01 10:57:30

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others

On Tue, 2012-05-01 at 13:47 +0300, Avi Kivity wrote:
> On 05/01/2012 12:39 PM, Peter Zijlstra wrote:
> > On Sun, 2012-04-29 at 15:23 +0300, Avi Kivity wrote:
> > > On 04/27/2012 07:24 PM, Nikunj A. Dadhania wrote:
> > > > flush_tlb_others_ipi depends on lot of statics in tlb.c. Replicated
> > > > the flush_tlb_others_ipi as kvm_flush_tlb_others to further adapt to
> > > > paravirtualization.
> > > >
> > > > Use the vcpu state information inside the kvm_flush_tlb_others to
> > > > avoid sending ipi to pre-empted vcpus.
> > > >
> > > > * Do not send ipi's to offline vcpus and set flush_on_enter flag
> > >
> > > get_user_pages_fast() depends on the IPI to hold off page table teardown
> > > while they are locklessly walked with interrupts disabled. If a vcpu
> > > were to be preempted while in this critical section, another vcpu
> > > tearing down page tables would go ahead and destroy them. when the
> > > preempted vcpu resumes it then touches the freed pages.
> > >
> > > We could try to teach kvm and get_user_pages_fast() about this, but this
> > > is intrusive. Another option is to replace the cpu_relax() loop with
> > > something that sleeps and is then woken up by the TLB IPI handler if needed.
> >
> > I think something like
> >
> > select HAVE_RCU_TABLE_FREE if PARAVIRT
> >
> > or somesuch is just about all it takes.
> >
> > A slightly better option would be to wrap all that tlb_*table* goo into
> > paravirt stuff and only do the RCU free when paravirt is indeed enabled,
> > but other than that you're there.
>
> I infer from this that there is a cost involved with rcu freeing. Any
> idea how much?

No idea, so far that code has only been used on platforms that required
it so they didn't have a choice in the matter.

> Looks like this increases performance for the overcommitted case, and
> also for the case where many vcpus are sleeping, while reducing
> performance for the uncontended, high duty cycle case.

Sounds backwards if you put it like that ;-)

> > This should work because the preempted vcpu's RCU state would also be
> > stalled and thus avoids the actual page-table from going away.
>
> It can be unstalled at any moment. But spin_lock_irq() > rcu_read_lock().

Right, but since gup_fast has IRQs disabled the RCU state machine (as
driven by the tick) won't actually do anything until its done.

To be clear, the case was where the gup_fast() performing vcpu was
preempted in the middle of gup_fast(), on wakeup it would perform the
TLB flush on the virt-enter hook, but meanwhile a sibling vcpu might
have free'd the page-tables.

By using call_rcu_sched() to free the page-tables you'd need to receive
and process at least one tick on the woken up cpu after the freeing, but
since the in-progress gup_fast() will have IRQs disabled this will be
delayed.

Anyway, I don't have any idea about the costs involved with
HAVE_RCU_TABLE_FREE, but I don't think its much.. otherwise these other
platforms (PPC,SPARC) wouldn't have used it, gup_fast() is a very
specific case, whereas mmu-gather is something affecting pretty much all
tasks.

But mostly my comment was due to you saying modifying gup_fast() would
be difficult.. I was thinking the one Kconfig line wasn't as onerous ;-)

2012-05-01 11:00:16

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others

On Tue, 2012-05-01 at 12:57 +0200, Peter Zijlstra wrote:
> Anyway, I don't have any idea about the costs involved with
> HAVE_RCU_TABLE_FREE, but I don't think its much.. otherwise these other
> platforms (PPC,SPARC) wouldn't have used it, gup_fast() is a very
> specific case, whereas mmu-gather is something affecting pretty much all
> tasks.

Which reminds me, I thought Xen needed this too, but a git grep on
HAVE_RCU_TABLE_FREE shows its still only ppc and sparc.

Jeremy?

2012-05-01 12:13:33

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others

On 05/01/2012 01:57 PM, Peter Zijlstra wrote:
> > Looks like this increases performance for the overcommitted case, and
> > also for the case where many vcpus are sleeping, while reducing
> > performance for the uncontended, high duty cycle case.
>
> Sounds backwards if you put it like that ;-)

Yes...

> > > This should work because the preempted vcpu's RCU state would also be
> > > stalled and thus avoids the actual page-table from going away.
> >
> > It can be unstalled at any moment. But spin_lock_irq() > rcu_read_lock().
>
> Right, but since gup_fast has IRQs disabled the RCU state machine (as
> driven by the tick) won't actually do anything until its done.

That's what I meant, except I mistyped local_irq_disable() as
spin_lock_irq(). local_irq_save() is a stronger version or rcu_read_lock().

> To be clear, the case was where the gup_fast() performing vcpu was
> preempted in the middle of gup_fast(), on wakeup it would perform the
> TLB flush on the virt-enter hook, but meanwhile a sibling vcpu might
> have free'd the page-tables.
>
> By using call_rcu_sched() to free the page-tables you'd need to receive
> and process at least one tick on the woken up cpu after the freeing, but
> since the in-progress gup_fast() will have IRQs disabled this will be
> delayed.

We're now moving the freeing of kvm shadow page tables from using rcu to
using an irq-protected scheme like gup_fast(), because of the
performance differences. We didn't track it down but I expect the cause
is less reuse of cache-hot pages.

> Anyway, I don't have any idea about the costs involved with
> HAVE_RCU_TABLE_FREE, but I don't think its much.. otherwise these other
> platforms (PPC,SPARC) wouldn't have used it, gup_fast() is a very
> specific case, whereas mmu-gather is something affecting pretty much all
> tasks.

What's changed is not gup_fast() but the performance of munmap(),
exit(), and exec(), no?

What bounds the amount of memory waiting to be freed during an rcu grace
period?

> But mostly my comment was due to you saying modifying gup_fast() would
> be difficult.. I was thinking the one Kconfig line wasn't as onerous ;-)

Yes, you're right - the code is all there and as long as we tolerate the
use of local_irq_disable() in place of rcu_read_lock() no code changes
are needed.

--
error compiling committee.c: too many arguments to function

2012-05-01 15:00:11

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others

On Tue, 2012-05-01 at 15:12 +0300, Avi Kivity wrote:
> local_irq_save() is a stronger version or rcu_read_lock()

Not so, local_irq_save() doesn't at all stop regular RCU grace periods,
things like preemptible rcu can be driven from rcu_read_unlock().

Note that this is the reason call_rcu_sched() exists and is used in this
case.

2012-05-01 15:12:14

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others

On Tue, 2012-05-01 at 15:12 +0300, Avi Kivity wrote:
> We're now moving the freeing of kvm shadow page tables from using rcu to
> using an irq-protected scheme like gup_fast(), because of the
> performance differences. We didn't track it down but I expect the cause
> is less reuse of cache-hot pages.

It would be good to understand these things.. just doing things because
tends to come back and bite you :-)

If its cache behaviour you should be able to clearly see that using perf
stat.

2012-05-01 15:15:10

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others

On Tue, 2012-05-01 at 15:12 +0300, Avi Kivity wrote:
>
> What's changed is not gup_fast() but the performance of munmap(),
> exit(), and exec(), no?

If it is indeed cache related like you suggested earlier, it would be
the allocation side of things, like fork()/mmap() that suffer since
there's fewer hot pages about, but yes, anything creating/destroying
page-tables.

> What bounds the amount of memory waiting to be freed during an rcu grace
> period?

Most RCU implementations don't have limits, so that could be quite a
lot. I think preemptible RCU has a batch limit at which point it tries
rather hard to force a grace period, but I'm not sure if even that
provides a hard limit.

Practically though, I haven't had reports of PPC/Sparc going funny
because of this.

2012-05-01 15:32:57

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others

On 05/01/2012 05:59 PM, Peter Zijlstra wrote:
> On Tue, 2012-05-01 at 15:12 +0300, Avi Kivity wrote:
> > local_irq_save() is a stronger version or rcu_read_lock()
>
> Not so, local_irq_save() doesn't at all stop regular RCU grace periods,
> things like preemptible rcu can be driven from rcu_read_unlock().
>
> Note that this is the reason call_rcu_sched() exists and is used in this
> case.

Ah, thanks for the correction. I'll read some more about the _sched
variants.

--
error compiling committee.c: too many arguments to function

2012-05-01 15:33:31

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others

On 05/01/2012 06:11 PM, Peter Zijlstra wrote:
> On Tue, 2012-05-01 at 15:12 +0300, Avi Kivity wrote:
> > We're now moving the freeing of kvm shadow page tables from using rcu to
> > using an irq-protected scheme like gup_fast(), because of the
> > performance differences. We didn't track it down but I expect the cause
> > is less reuse of cache-hot pages.
>
> It would be good to understand these things.. just doing things because
> tends to come back and bite you :-)

Agree. Anyway that was only part of the motivation, the other part was
establishing a bound on the number of pages under deferred freeing.

> If its cache behaviour you should be able to clearly see that using perf
> stat.

It's hard to see it clearly because the changes are in the order of 1-2%
in a very noisy benchmark.

--
error compiling committee.c: too many arguments to function

2012-05-01 15:36:47

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others

On Tue, 2012-05-01 at 18:31 +0300, Avi Kivity wrote:
> On 05/01/2012 05:59 PM, Peter Zijlstra wrote:
> > On Tue, 2012-05-01 at 15:12 +0300, Avi Kivity wrote:
> > > local_irq_save() is a stronger version or rcu_read_lock()
> >
> > Not so, local_irq_save() doesn't at all stop regular RCU grace periods,
> > things like preemptible rcu can be driven from rcu_read_unlock().
> >
> > Note that this is the reason call_rcu_sched() exists and is used in this
> > case.
>
> Ah, thanks for the correction. I'll read some more about the _sched
> variants.

Basically:

rcu_read_{,un}lock_sched() -- aka preempt_{en,dis}able()
call_rcu_sched()
synchronize_rcu_sched() -- aka synchronize_sched();

Define an RCU variant where each cpu has to have scheduled at least once
to complete a grace period.

2012-05-01 15:41:19

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others

On 05/01/2012 06:36 PM, Peter Zijlstra wrote:
> On Tue, 2012-05-01 at 18:31 +0300, Avi Kivity wrote:
> > On 05/01/2012 05:59 PM, Peter Zijlstra wrote:
> > > On Tue, 2012-05-01 at 15:12 +0300, Avi Kivity wrote:
> > > > local_irq_save() is a stronger version or rcu_read_lock()
> > >
> > > Not so, local_irq_save() doesn't at all stop regular RCU grace periods,
> > > things like preemptible rcu can be driven from rcu_read_unlock().
> > >
> > > Note that this is the reason call_rcu_sched() exists and is used in this
> > > case.
> >
> > Ah, thanks for the correction. I'll read some more about the _sched
> > variants.
>
> Basically:
>
> rcu_read_{,un}lock_sched() -- aka preempt_{en,dis}able()
> call_rcu_sched()
> synchronize_rcu_sched() -- aka synchronize_sched();
>
> Define an RCU variant where each cpu has to have scheduled at least once
> to complete a grace period.

Which was the original rcu implementation, until it got replaced by
preemptible rcu, yes? I guess that was the source of my confusion.

--
error compiling committee.c: too many arguments to function

2012-05-01 15:42:28

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others

On Tue, 2012-05-01 at 18:39 +0300, Avi Kivity wrote:

> > rcu_read_{,un}lock_sched() -- aka preempt_{en,dis}able()
> > call_rcu_sched()
> > synchronize_rcu_sched() -- aka synchronize_sched();
> >
> > Define an RCU variant where each cpu has to have scheduled at least once
> > to complete a grace period.
>
> Which was the original rcu implementation, until it got replaced by
> preemptible rcu, yes? I guess that was the source of my confusion.

Indeed. The _sched RCU variant has the 'classic' RCU semantics.

2012-05-01 16:08:13

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others

On 05/01/2012 06:14 PM, Peter Zijlstra wrote:
> On Tue, 2012-05-01 at 15:12 +0300, Avi Kivity wrote:
> >
> > What's changed is not gup_fast() but the performance of munmap(),
> > exit(), and exec(), no?
>
> If it is indeed cache related like you suggested earlier, it would be
> the allocation side of things, like fork()/mmap() that suffer since
> there's fewer hot pages about, but yes, anything creating/destroying
> page-tables.

Right.

>
> > What bounds the amount of memory waiting to be freed during an rcu grace
> > period?
>
> Most RCU implementations don't have limits, so that could be quite a
> lot. I think preemptible RCU has a batch limit at which point it tries
> rather hard to force a grace period, but I'm not sure if even that
> provides a hard limit.
>
> Practically though, I haven't had reports of PPC/Sparc going funny
> because of this.

It could be considered a DoS if a user is able to free page tables
faster than rcu is able to recycle them, possibly triggering the oom
killer (should that force a grace period before firing from the hip?)

--
error compiling committee.c: too many arguments to function

2012-05-01 16:17:08

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others

On Tue, 2012-05-01 at 18:36 +0300, Avi Kivity wrote:

> > > What bounds the amount of memory waiting to be freed during an rcu grace
> > > period?
> >
> > Most RCU implementations don't have limits, so that could be quite a
> > lot. I think preemptible RCU has a batch limit at which point it tries
> > rather hard to force a grace period, but I'm not sure if even that
> > provides a hard limit.
> >
> > Practically though, I haven't had reports of PPC/Sparc going funny
> > because of this.
>
> It could be considered a DoS if a user is able to free page tables
> faster than rcu is able to recycle them, possibly triggering the oom
> killer (should that force a grace period before firing from the hip?)

One would think that would be a good thing, yes. However I cannot seem
to find anything like that in the current OOM killer. David, Paul, I
seem to have vague recollections of a discussion about RCU vs OOM, what
was the resolution (if anything) and would something like the below make
sense?

---
mm/oom_kill.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 46bf2ed5..244a371 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -607,6 +607,9 @@ int try_set_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask)
struct zone *zone;
int ret = 1;

+ synchronize_sched();
+ synchronize_rcu();
+
spin_lock(&zone_scan_lock);
for_each_zone_zonelist(zone, z, zonelist, gfp_zone(gfp_mask)) {
if (zone_is_oom_locked(zone)) {

2012-05-01 16:19:24

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others

And now with David actually on CC ;-)

On Tue, 2012-05-01 at 18:36 +0300, Avi Kivity wrote:

> > > What bounds the amount of memory waiting to be freed during an rcu grace
> > > period?
> >
> > Most RCU implementations don't have limits, so that could be quite a
> > lot. I think preemptible RCU has a batch limit at which point it tries
> > rather hard to force a grace period, but I'm not sure if even that
> > provides a hard limit.
> >
> > Practically though, I haven't had reports of PPC/Sparc going funny
> > because of this.
>
> It could be considered a DoS if a user is able to free page tables
> faster than rcu is able to recycle them, possibly triggering the oom
> killer (should that force a grace period before firing from the hip?)

One would think that would be a good thing, yes. However I cannot seem
to find anything like that in the current OOM killer. David, Paul, I
seem to have vague recollections of a discussion about RCU vs OOM, what
was the resolution (if anything) and would something like the below make
sense?

---
mm/oom_kill.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 46bf2ed5..244a371 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -607,6 +607,9 @@ int try_set_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask)
struct zone *zone;
int ret = 1;

+ synchronize_sched();
+ synchronize_rcu();
+
spin_lock(&zone_scan_lock);
for_each_zone_zonelist(zone, z, zonelist, gfp_zone(gfp_mask)) {
if (zone_is_oom_locked(zone)) {

2012-05-01 16:20:58

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others

On Tue, 2012-05-01 at 18:18 +0200, Peter Zijlstra wrote:

> ---
> mm/oom_kill.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 46bf2ed5..244a371 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -607,6 +607,9 @@ int try_set_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask)
> struct zone *zone;
> int ret = 1;
>
> + synchronize_sched();
> + synchronize_rcu();
> +
> spin_lock(&zone_scan_lock);
> for_each_zone_zonelist(zone, z, zonelist, gfp_zone(gfp_mask)) {
> if (zone_is_oom_locked(zone)) {
>

Hmm I guess that should be rcu_barrier_sched(); rcu_barrier(); instead
of sync.

2012-05-01 16:44:33

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others

On Tue, May 01, 2012 at 06:16:46PM +0200, Peter Zijlstra wrote:
> On Tue, 2012-05-01 at 18:36 +0300, Avi Kivity wrote:
>
> > > > What bounds the amount of memory waiting to be freed during an rcu grace
> > > > period?
> > >
> > > Most RCU implementations don't have limits, so that could be quite a
> > > lot. I think preemptible RCU has a batch limit at which point it tries
> > > rather hard to force a grace period, but I'm not sure if even that
> > > provides a hard limit.

All the TREE_RCU variants will get more aggressive about forcing grace
periods if any given CPU has more than 10,000 callbacks posted. When this
happens, the call_rcu() variants will try to push things ahead.

> > > Practically though, I haven't had reports of PPC/Sparc going funny
> > > because of this.
> >
> > It could be considered a DoS if a user is able to free page tables
> > faster than rcu is able to recycle them, possibly triggering the oom
> > killer (should that force a grace period before firing from the hip?)
>
> One would think that would be a good thing, yes. However I cannot seem
> to find anything like that in the current OOM killer. David, Paul, I
> seem to have vague recollections of a discussion about RCU vs OOM, what
> was the resolution (if anything) and would something like the below make
> sense?
>
> ---
> mm/oom_kill.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 46bf2ed5..244a371 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -607,6 +607,9 @@ int try_set_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask)
> struct zone *zone;
> int ret = 1;
>
> + synchronize_sched();
> + synchronize_rcu();

This will wait for a grace period, but not for the callbacks, which are
the things that actually free the memory. Given that, should we instead
do something like:

rcu_barrier();

Note that rcu_barrier() and rcu_barrier_sched() are one and the same
for CONFIG_PREEMPT=n kernels, and there seems to be a lot more
call_rcu() than call_rcu_sched(), so I left out the rcu_barrier_sched().

That said, this does have the effect of delaying the startup of the OOM
killer, and it does nothing to tell RCU that accelerating grace periods
would be a good thing. If DoS attack is a theoretical possibility rather
than a real bug, is a pure wait on RCU the right approach.

Alternative approaches include:

1. OOM killer calls into RCU, which arranges to become more
aggressive about forcing grace periods. (For example, RCU
could set a flag that caused it to act as if there were
lots of callbacks posted.)

2. RCU provides an API that forces grace periods, perhaps
invoked from a separate kthread so that the OOM killer can
proceed in parallel with RCU's grace-period forcing.

3. Like #2, but invoke it a bit earlier than the OOM killer
would normally start running.

Thanx, Paul

> spin_lock(&zone_scan_lock);
> for_each_zone_zonelist(zone, z, zonelist, gfp_zone(gfp_mask)) {
> if (zone_is_oom_locked(zone)) {
>

2012-05-01 22:49:54

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others

On 05/01/2012 03:59 AM, Peter Zijlstra wrote:
> On Tue, 2012-05-01 at 12:57 +0200, Peter Zijlstra wrote:
>> Anyway, I don't have any idea about the costs involved with
>> HAVE_RCU_TABLE_FREE, but I don't think its much.. otherwise these other
>> platforms (PPC,SPARC) wouldn't have used it, gup_fast() is a very
>> specific case, whereas mmu-gather is something affecting pretty much all
>> tasks.
> Which reminds me, I thought Xen needed this too, but a git grep on
> HAVE_RCU_TABLE_FREE shows its still only ppc and sparc.
>
> Jeremy?

Yeah, I was thinking that too, but I can't remember what we did to
resolve it. For pure PV guests, gupf simply isn't used, so the problem
is moot. But for dom0 or PCI-passthrough it could be.

Konrad, Stefano?

J

2012-05-02 08:51:56

by Nikunj A Dadhania

[permalink] [raw]
Subject: Re: [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others

On Tue, 01 May 2012 11:39:36 +0200, Peter Zijlstra <[email protected]> wrote:
> On Sun, 2012-04-29 at 15:23 +0300, Avi Kivity wrote:
> > On 04/27/2012 07:24 PM, Nikunj A. Dadhania wrote:
> > > flush_tlb_others_ipi depends on lot of statics in tlb.c. Replicated
> > > the flush_tlb_others_ipi as kvm_flush_tlb_others to further adapt to
> > > paravirtualization.
> > >
> > > Use the vcpu state information inside the kvm_flush_tlb_others to
> > > avoid sending ipi to pre-empted vcpus.
> > >
> > > * Do not send ipi's to offline vcpus and set flush_on_enter flag
> >
> > get_user_pages_fast() depends on the IPI to hold off page table teardown
> > while they are locklessly walked with interrupts disabled. If a vcpu
> > were to be preempted while in this critical section, another vcpu
> > tearing down page tables would go ahead and destroy them. when the
> > preempted vcpu resumes it then touches the freed pages.
> >
> > We could try to teach kvm and get_user_pages_fast() about this, but this
> > is intrusive. Another option is to replace the cpu_relax() loop with
> > something that sleeps and is then woken up by the TLB IPI handler if needed.
>
> I think something like
>
> select HAVE_RCU_TABLE_FREE if PARAVIRT
>
> or somesuch is just about all it takes.
>
[root@krm1 linux]# grep HAVE_RCU_TABLE .config
CONFIG_HAVE_RCU_TABLE_FREE=y
[root@krm1 linux]# make -j32 -s
mm/memory.c: In function ‘tlb_remove_table_one’:
mm/memory.c:315: error: implicit declaration of function ‘__tlb_remove_table’

I suppose we need to have __tlb_remove_table. Trying to understand what
needs to be done there.

Regards
Nikunj

2012-05-02 10:21:24

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others

On Wed, 2012-05-02 at 14:21 +0530, Nikunj A Dadhania wrote:
> [root@krm1 linux]# grep HAVE_RCU_TABLE .config
> CONFIG_HAVE_RCU_TABLE_FREE=y
> [root@krm1 linux]# make -j32 -s
> mm/memory.c: In function ‘tlb_remove_table_one’:
> mm/memory.c:315: error: implicit declaration of function ‘__tlb_remove_table’
>
> I suppose we need to have __tlb_remove_table. Trying to understand what
> needs to be done there.

Argh, I really should get back to unifying all mmu-gather
implementations :/

I think something like the below ought to sort it.

Completely untested though..

---
arch/powerpc/include/asm/pgalloc.h | 1 +
arch/s390/mm/pgtable.c | 1 +
arch/sparc/include/asm/pgalloc_64.h | 1 +
arch/x86/mm/pgtable.c | 6 +++---
include/asm-generic/tlb.h | 9 +++++++++
mm/memory.c | 7 +++++++
6 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/pgalloc.h b/arch/powerpc/include/asm/pgalloc.h
index bf301ac..c33ae79 100644
--- a/arch/powerpc/include/asm/pgalloc.h
+++ b/arch/powerpc/include/asm/pgalloc.h
@@ -49,6 +49,7 @@ static inline void __tlb_remove_table(void *_table)

pgtable_free(table, shift);
}
+#define __tlb_remove_table __tlb_remove_table
#else /* CONFIG_SMP */
static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)
{
diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 6e765bf..7029ed7 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -730,6 +730,7 @@ void __tlb_remove_table(void *_table)
else
free_pages((unsigned long) table, ALLOC_ORDER);
}
+#define __tlb_remove_table __tlb_remove_table

static void tlb_remove_table_smp_sync(void *arg)
{
diff --git a/arch/sparc/include/asm/pgalloc_64.h b/arch/sparc/include/asm/pgalloc_64.h
index 40b2d7a..d10913a 100644
--- a/arch/sparc/include/asm/pgalloc_64.h
+++ b/arch/sparc/include/asm/pgalloc_64.h
@@ -106,6 +106,7 @@ static inline void __tlb_remove_table(void *_table)
is_page = true;
pgtable_free(table, is_page);
}
+#define __tlb_remove_table __tlb_remove_table
#else /* CONFIG_SMP */
static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, bool is_page)
{
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 8573b83..34fa168 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -51,21 +51,21 @@ void ___pte_free_tlb(struct mmu_gather *tlb, struct page *pte)
{
pgtable_page_dtor(pte);
paravirt_release_pte(page_to_pfn(pte));
- tlb_remove_page(tlb, pte);
+ tlb_remove_table(tlb, pte);
}

#if PAGETABLE_LEVELS > 2
void ___pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmd)
{
paravirt_release_pmd(__pa(pmd) >> PAGE_SHIFT);
- tlb_remove_page(tlb, virt_to_page(pmd));
+ tlb_remove_table(tlb, virt_to_page(pmd));
}

#if PAGETABLE_LEVELS > 3
void ___pud_free_tlb(struct mmu_gather *tlb, pud_t *pud)
{
paravirt_release_pud(__pa(pud) >> PAGE_SHIFT);
- tlb_remove_page(tlb, virt_to_page(pud));
+ tlb_remove_table(tlb, virt_to_page(pud));
}
#endif /* PAGETABLE_LEVELS > 3 */
#endif /* PAGETABLE_LEVELS > 2 */
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index f96a5b5..8ca33e9 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -19,6 +19,8 @@
#include <asm/pgalloc.h>
#include <asm/tlbflush.h>

+static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page);
+
#ifdef CONFIG_HAVE_RCU_TABLE_FREE
/*
* Semi RCU freeing of the page directories.
@@ -60,6 +62,13 @@ struct mmu_table_batch {
extern void tlb_table_flush(struct mmu_gather *tlb);
extern void tlb_remove_table(struct mmu_gather *tlb, void *table);

+#else
+
+static inline void tlb_remove_table(struct mmu_gather *tlb, void *table)
+{
+ tlb_remove_page(tlb, page);
+}
+
#endif

/*
diff --git a/mm/memory.c b/mm/memory.c
index bf8b403..6ca6561 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -297,6 +297,13 @@ int __tlb_remove_page(struct mmu_gather *tlb, struct page *page)
* See the comment near struct mmu_table_batch.
*/

+#ifndef __tlb_remove_table
+static void __tlb_remove_table(void *table)
+{
+ free_page_and_swap_cache(table);
+}
+#endif
+
static void tlb_remove_table_smp_sync(void *arg)
{
/* Simply deliver the interrupt */

2012-05-02 13:54:07

by Nikunj A Dadhania

[permalink] [raw]
Subject: Re: [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others

On Wed, 02 May 2012 12:20:40 +0200, Peter Zijlstra <[email protected]> wrote:
> On Wed, 2012-05-02 at 14:21 +0530, Nikunj A Dadhania wrote:
> > [root@krm1 linux]# grep HAVE_RCU_TABLE .config
> > CONFIG_HAVE_RCU_TABLE_FREE=y
> > [root@krm1 linux]# make -j32 -s
> > mm/memory.c: In function ‘tlb_remove_table_one’:
> > mm/memory.c:315: error: implicit declaration of function ‘__tlb_remove_table’
> >
> > I suppose we need to have __tlb_remove_table. Trying to understand what
> > needs to be done there.
>
> Argh, I really should get back to unifying all mmu-gather
> implementations :/
>
> I think something like the below ought to sort it.
>
Thanks a lot.

> Completely untested though..
>

Tested-by: Nikunj A Dadhania <[email protected]>

Here is the comparison with the other version.

Gang pv_spin_flush pv_spin_flush_rcu
1VM 1.01 0.49 0.49
2VMs 7.07 4.04 4.06
4VMs 9.07 5.27 5.19
8VMs 9.99 7.65 7.80

Will test other use cases as well and report back.

Regards
Nikunj

2012-05-03 14:02:24

by Stefano Stabellini

[permalink] [raw]
Subject: Re: [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others

On Tue, 1 May 2012, Jeremy Fitzhardinge wrote:
> On 05/01/2012 03:59 AM, Peter Zijlstra wrote:
> > On Tue, 2012-05-01 at 12:57 +0200, Peter Zijlstra wrote:
> >> Anyway, I don't have any idea about the costs involved with
> >> HAVE_RCU_TABLE_FREE, but I don't think its much.. otherwise these other
> >> platforms (PPC,SPARC) wouldn't have used it, gup_fast() is a very
> >> specific case, whereas mmu-gather is something affecting pretty much all
> >> tasks.
> > Which reminds me, I thought Xen needed this too, but a git grep on
> > HAVE_RCU_TABLE_FREE shows its still only ppc and sparc.
> >
> > Jeremy?
>
> Yeah, I was thinking that too, but I can't remember what we did to
> resolve it. For pure PV guests, gupf simply isn't used, so the problem
> is moot. But for dom0 or PCI-passthrough it could be.

Yes, dom0 can use gupf, for example when a userspace block backend is
involved.

Reading the code it seems to me that xen_flush_tlb_others returns
immediately and succesfully, no matter whether one or more vcpus are or
are not running at the moment and no matter if one or more vcpus
previously disabled interrupts.

Therefore I think that we should be using HAVE_RCU_TABLE_FREE.
I am going to submit a patch for that.

2012-05-04 04:33:22

by Nikunj A Dadhania

[permalink] [raw]
Subject: Re: [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others

On Wed, 02 May 2012 12:20:40 +0200, Peter Zijlstra <[email protected]> wrote:
[...]
> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
> index f96a5b5..8ca33e9 100644
> --- a/include/asm-generic/tlb.h
> +++ b/include/asm-generic/tlb.h
> @@ -19,6 +19,8 @@
> #include <asm/pgalloc.h>
> #include <asm/tlbflush.h>
>
> +static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page);
> +
> #ifdef CONFIG_HAVE_RCU_TABLE_FREE
> /*
> * Semi RCU freeing of the page directories.
> @@ -60,6 +62,13 @@ struct mmu_table_batch {
> extern void tlb_table_flush(struct mmu_gather *tlb);
> extern void tlb_remove_table(struct mmu_gather *tlb, void *table);
>
> +#else
> +
> +static inline void tlb_remove_table(struct mmu_gather *tlb, void *table)
> +{
> + tlb_remove_page(tlb, page);
>
tlb_remove_page(tlb, table);