2009-03-18 19:17:20

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Question about x86/mm/gup.c's use of disabled interrupts

Hi Nick,

The comment in arch/x86/mm/gup.c:gup_get_pte() says:

[...] What
* we do have is the guarantee that a pte will only either go from not
* present to present, or present to not present or both -- it will not
* switch to a completely different present page without a TLB flush in
* between; something that we are blocking by holding interrupts off.


Disabling the interrupt will prevent the tlb flush IPI from coming in
and flushing this cpu's tlb, but I don't see how it will prevent some
other cpu from actually updating the pte in the pagetable, which is what
we're concerned about here. Is this the only reason to disable
interrupts? Would we need to do it for the !PAE cases?

Also, assuming that disabling the interrupt is enough to get the
guarantees we need here, there's a Xen problem because we don't use IPIs
for cross-cpu tlb flushes (well, it happens within Xen). I'll have to
think a bit about how to deal with that, but I'm thinking that we could
add a per-cpu "tlb flushes blocked" flag, and maintain some kind of
per-cpu deferred tlb flush count so we can get around to doing the flush
eventually.

But I want to make sure I understand the exact algorithm here.

(I couldn't find an instance of a pte update going from present->present
anyway; the only caller of set_pte_present is set_pte_vaddr which only
operates on kernel mappings, so perhaps this is moot. Oh, look,
native_set_pte_present thinks its only called on user mappings... In
fact set_pte_present seems to have completely lost its rationale for
existing.)

Thanks,
J


2009-03-18 21:14:14

by Avi Kivity

[permalink] [raw]
Subject: Re: Question about x86/mm/gup.c's use of disabled interrupts

Jeremy Fitzhardinge wrote:
> Hi Nick,
>
> The comment in arch/x86/mm/gup.c:gup_get_pte() says:
>
> [...] What
> * we do have is the guarantee that a pte will only either go from
> not
> * present to present, or present to not present or both -- it
> will not
> * switch to a completely different present page without a TLB
> flush in
> * between; something that we are blocking by holding interrupts off.
>
>
> Disabling the interrupt will prevent the tlb flush IPI from coming in
> and flushing this cpu's tlb, but I don't see how it will prevent some
> other cpu from actually updating the pte in the pagetable, which is
> what we're concerned about here.

The thread that cleared the pte holds the pte lock and is now waiting
for the IPI. The thread that wants to update the pte will wait for the
pte lock, thus also waits on the IPI and gup_fast()'s
local_irq_enable(). I think.

> Is this the only reason to disable interrupts?

Another comment says it also prevents pagetable teardown.

> Also, assuming that disabling the interrupt is enough to get the
> guarantees we need here, there's a Xen problem because we don't use
> IPIs for cross-cpu tlb flushes (well, it happens within Xen). I'll
> have to think a bit about how to deal with that, but I'm thinking that
> we could add a per-cpu "tlb flushes blocked" flag, and maintain some
> kind of per-cpu deferred tlb flush count so we can get around to doing
> the flush eventually.

I was thinking about adding a hypercall for cross-vcpu tlb flushes.
Guess I'll wait for you to clear up all the issues first.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2009-03-18 21:23:54

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: Question about x86/mm/gup.c's use of disabled interrupts

Avi Kivity wrote:
> Jeremy Fitzhardinge wrote:
>> Disabling the interrupt will prevent the tlb flush IPI from coming in
>> and flushing this cpu's tlb, but I don't see how it will prevent some
>> other cpu from actually updating the pte in the pagetable, which is
>> what we're concerned about here.
>
> The thread that cleared the pte holds the pte lock and is now waiting
> for the IPI. The thread that wants to update the pte will wait for
> the pte lock, thus also waits on the IPI and gup_fast()'s
> local_irq_enable(). I think.

But hasn't it already done the pte update at that point?

(I think this conversation really is moot because the kernel never does
P->P pte updates any more; its always P->N->P.)

>> Is this the only reason to disable interrupts?
>
> Another comment says it also prevents pagetable teardown.

We could take a reference to the mm to get the same effect, no?

>> Also, assuming that disabling the interrupt is enough to get the
>> guarantees we need here, there's a Xen problem because we don't use
>> IPIs for cross-cpu tlb flushes (well, it happens within Xen). I'll
>> have to think a bit about how to deal with that, but I'm thinking
>> that we could add a per-cpu "tlb flushes blocked" flag, and maintain
>> some kind of per-cpu deferred tlb flush count so we can get around to
>> doing the flush eventually.
>
> I was thinking about adding a hypercall for cross-vcpu tlb flushes.
> Guess I'll wait for you to clear up all the issues first.

Typical...

J

2009-03-18 21:40:21

by Avi Kivity

[permalink] [raw]
Subject: Re: Question about x86/mm/gup.c's use of disabled interrupts

Jeremy Fitzhardinge wrote:
>>> Disabling the interrupt will prevent the tlb flush IPI from coming
>>> in and flushing this cpu's tlb, but I don't see how it will prevent
>>> some other cpu from actually updating the pte in the pagetable,
>>> which is what we're concerned about here.
>>
>> The thread that cleared the pte holds the pte lock and is now waiting
>> for the IPI. The thread that wants to update the pte will wait for
>> the pte lock, thus also waits on the IPI and gup_fast()'s
>> local_irq_enable(). I think.
>
> But hasn't it already done the pte update at that point?
>
> (I think this conversation really is moot because the kernel never
> does P->P pte updates any more; its always P->N->P.)

I thought you were concerned about cpu 0 doing a gup_fast(), cpu 1 doing
P->N, and cpu 2 doing N->P. In this case cpu 2 is waiting on the pte lock.

>>> Is this the only reason to disable interrupts?
>>
>> Another comment says it also prevents pagetable teardown.
>
> We could take a reference to the mm to get the same effect, no?
>

Won't stop munmap().

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2009-03-18 22:14:25

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: Question about x86/mm/gup.c's use of disabled interrupts

Avi Kivity wrote:
> Jeremy Fitzhardinge wrote:
>>>> Disabling the interrupt will prevent the tlb flush IPI from coming
>>>> in and flushing this cpu's tlb, but I don't see how it will prevent
>>>> some other cpu from actually updating the pte in the pagetable,
>>>> which is what we're concerned about here.
>>>
>>> The thread that cleared the pte holds the pte lock and is now
>>> waiting for the IPI. The thread that wants to update the pte will
>>> wait for the pte lock, thus also waits on the IPI and gup_fast()'s
>>> local_irq_enable(). I think.
>>
>> But hasn't it already done the pte update at that point?
>>
>> (I think this conversation really is moot because the kernel never
>> does P->P pte updates any more; its always P->N->P.)
>
> I thought you were concerned about cpu 0 doing a gup_fast(), cpu 1
> doing P->N, and cpu 2 doing N->P. In this case cpu 2 is waiting on
> the pte lock.

The issue is that if cpu 0 is doing a gup_fast() and other cpus are
doing P->P updates, then gup_fast() can potentially get a mix of old and
new pte values - where P->P is any aggregate set of unsynchronized P->N
and N->P operations on any number of other cpus. Ah, but if every P->N
is followed by a tlb flush, then disabling interrupts will hold off any
following N->P, allowing gup_fast to get a consistent pte snapshot.

Hm, awkward if flush_tlb_others doesn't IPI...

> Won't stop munmap().

And I guess it does the tlb flush before freeing the pages, so disabling
the interrupt helps here too.

Simplest fix is to make gup_get_pte() a pvop, but that does seem like
putting a red flag in front of an inner-loop hotspot, or something...

The per-cpu tlb-flush exclusion flag might really be the way to go.

J

2009-03-18 22:44:26

by Avi Kivity

[permalink] [raw]
Subject: Re: Question about x86/mm/gup.c's use of disabled interrupts

Jeremy Fitzhardinge wrote:
>> I thought you were concerned about cpu 0 doing a gup_fast(), cpu 1
>> doing P->N, and cpu 2 doing N->P. In this case cpu 2 is waiting on
>> the pte lock.
>
> The issue is that if cpu 0 is doing a gup_fast() and other cpus are
> doing P->P updates, then gup_fast() can potentially get a mix of old
> and new pte values - where P->P is any aggregate set of unsynchronized
> P->N and N->P operations on any number of other cpus. Ah, but if
> every P->N is followed by a tlb flush, then disabling interrupts will
> hold off any following N->P, allowing gup_fast to get a consistent pte
> snapshot.
>

Right.

> Hm, awkward if flush_tlb_others doesn't IPI...
>

How can it avoid flushing the tlb on cpu [01]? It's it's gup_fast()ing
a pte, it may as well load it into the tlb.

>
> Simplest fix is to make gup_get_pte() a pvop, but that does seem like
> putting a red flag in front of an inner-loop hotspot, or something...
>
> The per-cpu tlb-flush exclusion flag might really be the way to go.

I don't see how it will work, without changing Xen to look at the flag?

local_irq_disable() is used here to lock out a remote cpu, I don't see
why deferring the flush helps.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2009-03-18 22:55:41

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: Question about x86/mm/gup.c's use of disabled interrupts

Avi Kivity wrote:
>> Hm, awkward if flush_tlb_others doesn't IPI...
>>
>
> How can it avoid flushing the tlb on cpu [01]? It's it's
> gup_fast()ing a pte, it may as well load it into the tlb.

xen_flush_tlb_others uses a hypercall rather than an IPI, so none of the
logic which depends on there being an IPI will work.

>> Simplest fix is to make gup_get_pte() a pvop, but that does seem like
>> putting a red flag in front of an inner-loop hotspot, or something...
>>
>> The per-cpu tlb-flush exclusion flag might really be the way to go.
>
> I don't see how it will work, without changing Xen to look at the flag?
>
> local_irq_disable() is used here to lock out a remote cpu, I don't see
> why deferring the flush helps.

Well, no, not deferring. Making xen_flush_tlb_others() spin waiting for
"doing_gup" to clear on the target cpu. Or add an explicit notion of a
"pte update barrier" rather than implicitly relying on the tlb IPI
(which is extremely convenient when available...).

J

2009-03-18 23:04:57

by Avi Kivity

[permalink] [raw]
Subject: Re: Question about x86/mm/gup.c's use of disabled interrupts

Jeremy Fitzhardinge wrote:
> Avi Kivity wrote:
>>> Hm, awkward if flush_tlb_others doesn't IPI...
>>>
>>
>> How can it avoid flushing the tlb on cpu [01]? It's it's
>> gup_fast()ing a pte, it may as well load it into the tlb.
>
> xen_flush_tlb_others uses a hypercall rather than an IPI, so none of
> the logic which depends on there being an IPI will work.

Right, of course, that's what we were talking about. I thought
optimizations to avoid IPIs if an mm never visited a cpu.

>
>>> Simplest fix is to make gup_get_pte() a pvop, but that does seem
>>> like putting a red flag in front of an inner-loop hotspot, or
>>> something...
>>>
>>> The per-cpu tlb-flush exclusion flag might really be the way to go.
>>
>> I don't see how it will work, without changing Xen to look at the flag?
>>
>> local_irq_disable() is used here to lock out a remote cpu, I don't
>> see why deferring the flush helps.
>
> Well, no, not deferring. Making xen_flush_tlb_others() spin waiting
> for "doing_gup" to clear on the target cpu. Or add an explicit notion
> of a "pte update barrier" rather than implicitly relying on the tlb
> IPI (which is extremely convenient when available...).

Pick up a percpu flag from all cpus and spin on each? Nasty.

You could use the irq enabled flag; it's available and what native spins
on (but also means I'll need to add one if I implement this).

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2009-03-18 23:32:41

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: Question about x86/mm/gup.c's use of disabled interrupts

Avi Kivity wrote:
> Jeremy Fitzhardinge wrote:
>> Avi Kivity wrote:
>>>> Hm, awkward if flush_tlb_others doesn't IPI...
>>>>
>>>
>>> How can it avoid flushing the tlb on cpu [01]? It's it's
>>> gup_fast()ing a pte, it may as well load it into the tlb.
>>
>> xen_flush_tlb_others uses a hypercall rather than an IPI, so none of
>> the logic which depends on there being an IPI will work.
>
> Right, of course, that's what we were talking about. I thought
> optimizations to avoid IPIs if an mm never visited a cpu.
>
>>
>>>> Simplest fix is to make gup_get_pte() a pvop, but that does seem
>>>> like putting a red flag in front of an inner-loop hotspot, or
>>>> something...
>>>>
>>>> The per-cpu tlb-flush exclusion flag might really be the way to go.
>>>
>>> I don't see how it will work, without changing Xen to look at the flag?
>>>
>>> local_irq_disable() is used here to lock out a remote cpu, I don't
>>> see why deferring the flush helps.
>>
>> Well, no, not deferring. Making xen_flush_tlb_others() spin waiting
>> for "doing_gup" to clear on the target cpu. Or add an explicit
>> notion of a "pte update barrier" rather than implicitly relying on
>> the tlb IPI (which is extremely convenient when available...).
>
> Pick up a percpu flag from all cpus and spin on each? Nasty.

Yeah, not great. Each of those flag fetches is likely to be cold, so a
bunch of cache misses. The only mitigating factor is that cross-cpu tlb
flushes are expected to be expensive, but some workloads are apparently
very sensitive to extra latency in that path. And the hypercall could
result in no Xen-level IPIs at all, so it could be very quick by
comparison to an IPI-based Linux implementation, in which case the flag
polling would be particularly harsh.

Also, the straightforward implementation of "poll until all target cpu's
flags are clear" may never make progress, so you'd have to "scan flags,
remove busy cpus from set, repeat until all cpus done".

All annoying because this race is pretty unlikely, and it seems a shame
to slow down all tlb flushes to deal with it. Some kind of global
"doing gup_fast" counter would get flush_tlb_others bypass the check, at
the cost of putting a couple of atomic ops around the outside of gup_fast.

> You could use the irq enabled flag; it's available and what native
> spins on (but also means I'll need to add one if I implement this).

Yes, but then we'd end up spuriously polling on cpus which happened to
disable interrupts for any reason. And if the vcpu is not running then
we could end up polling for a long time. (Same applies for things in
gup_fast, but I'm assuming that's a lot less common than disabling
interrupts in general).

J

2009-03-18 23:37:41

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: Question about x86/mm/gup.c's use of disabled interrupts

Shentino wrote:
> But, does a CPU running a task in userspace effectively have a read
> lock on the page tables?

No. A process has its own user pagetable which the kernel maintains on
its behalf. The kernel will briefly take locks on it while doing
modifications, mostly to deal with multithreaded usermode code running
on multiple cpus.

J

2009-03-19 01:32:29

by Nick Piggin

[permalink] [raw]
Subject: Re: Question about x86/mm/gup.c's use of disabled interrupts

Hi Jeremy,

I think you got most of your questions already hashed out, but
I could make a suggestion...

On Thursday 19 March 2009 06:17:03 Jeremy Fitzhardinge wrote:
> Hi Nick,
>
> The comment in arch/x86/mm/gup.c:gup_get_pte() says:
>
> [...] What
> * we do have is the guarantee that a pte will only either go from not
> * present to present, or present to not present or both -- it will not
> * switch to a completely different present page without a TLB flush in
> * between; something that we are blocking by holding interrupts off.
>
>
> Disabling the interrupt will prevent the tlb flush IPI from coming in
> and flushing this cpu's tlb, but I don't see how it will prevent some
> other cpu from actually updating the pte in the pagetable, which is what
> we're concerned about here.

Yes, I don't believe it is possible to have a *new* pte installed until
the flush is done.


> Is this the only reason to disable
> interrupts? Would we need to do it for the !PAE cases?

It has to pin page tables, and pin pages as well.


> Also, assuming that disabling the interrupt is enough to get the
> guarantees we need here, there's a Xen problem because we don't use IPIs
> for cross-cpu tlb flushes (well, it happens within Xen). I'll have to
> think a bit about how to deal with that, but I'm thinking that we could
> add a per-cpu "tlb flushes blocked" flag, and maintain some kind of
> per-cpu deferred tlb flush count so we can get around to doing the flush
> eventually.
>
> But I want to make sure I understand the exact algorithm here.

FWIW, powerpc actually can flush tlbs without IPIs, and it also has
a gup_fast. powerpc RCU frees its page _tables_ so we can walk them,
and then I use speculative page references in order to be able to
take a reference on the page without having it pinned.

Turning gup_get_pte into a pvop would be a bit nasty because on !PAE
it is just a single load, and even on PAE it is pretty cheap.

2009-03-19 09:47:04

by Avi Kivity

[permalink] [raw]
Subject: Re: Question about x86/mm/gup.c's use of disabled interrupts

Jeremy Fitzhardinge wrote:
>>>
>>> Well, no, not deferring. Making xen_flush_tlb_others() spin waiting
>>> for "doing_gup" to clear on the target cpu. Or add an explicit
>>> notion of a "pte update barrier" rather than implicitly relying on
>>> the tlb IPI (which is extremely convenient when available...).
>>
>> Pick up a percpu flag from all cpus and spin on each? Nasty.
>
> Yeah, not great. Each of those flag fetches is likely to be cold, so
> a bunch of cache misses. The only mitigating factor is that cross-cpu
> tlb flushes are expected to be expensive, but some workloads are
> apparently very sensitive to extra latency in that path.

Right, and they'll do a bunch more cache misses, so in comparison it
isn't too bad.

> And the hypercall could result in no Xen-level IPIs at all, so it
> could be very quick by comparison to an IPI-based Linux
> implementation, in which case the flag polling would be particularly
> harsh.

Maybe we could bring these optimizations into Linux as well. The only
thing Xen knows that Linux doesn't is if a vcpu is not scheduled; all
other information is shared.

>
> Also, the straightforward implementation of "poll until all target
> cpu's flags are clear" may never make progress, so you'd have to "scan
> flags, remove busy cpus from set, repeat until all cpus done".
>
> All annoying because this race is pretty unlikely, and it seems a
> shame to slow down all tlb flushes to deal with it. Some kind of
> global "doing gup_fast" counter would get flush_tlb_others bypass the
> check, at the cost of putting a couple of atomic ops around the
> outside of gup_fast.

The nice thing about local_irq_disable() is that it scales so well.

>
>> You could use the irq enabled flag; it's available and what native
>> spins on (but also means I'll need to add one if I implement this).
>
> Yes, but then we'd end up spuriously polling on cpus which happened to
> disable interrupts for any reason. And if the vcpu is not running
> then we could end up polling for a long time. (Same applies for
> things in gup_fast, but I'm assuming that's a lot less common than
> disabling interrupts in general).

Right.

--
error compiling committee.c: too many arguments to function

2009-03-19 17:17:19

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: Question about x86/mm/gup.c's use of disabled interrupts

Avi Kivity wrote:
>> And the hypercall could result in no Xen-level IPIs at all, so it
>> could be very quick by comparison to an IPI-based Linux
>> implementation, in which case the flag polling would be particularly
>> harsh.
>
> Maybe we could bring these optimizations into Linux as well. The only
> thing Xen knows that Linux doesn't is if a vcpu is not scheduled; all
> other information is shared.

I don't think there's a guarantee that just because a vcpu isn't running
now, it won't need a tlb flush. If a pcpu does runs vcpu 1 -> idle ->
vcpu 1, then there's no need for it to do a tlb flush, but the hypercall
can make force a flush when it reschedules vcpu 1 (if the tlb hasn't
already been flushed by some other means).

(I'm not sure to what extent Xen implements this now, but I wouldn't
want to over-constrain it.)

>> Also, the straightforward implementation of "poll until all target
>> cpu's flags are clear" may never make progress, so you'd have to
>> "scan flags, remove busy cpus from set, repeat until all cpus done".
>>
>> All annoying because this race is pretty unlikely, and it seems a
>> shame to slow down all tlb flushes to deal with it. Some kind of
>> global "doing gup_fast" counter would get flush_tlb_others bypass the
>> check, at the cost of putting a couple of atomic ops around the
>> outside of gup_fast.
>
> The nice thing about local_irq_disable() is that it scales so well.

Right. But it effectively puts the burden on the tlb-flusher to check
the state (implicitly, by trying to send an interrupt). Putting an
explicit poll in gets the same effect, but its pure overhead just to
deal with the gup race.

I'll put a patch together and see how it looks.

J

2009-03-19 17:32:31

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: Question about x86/mm/gup.c's use of disabled interrupts

Nick Piggin wrote:
>> Also, assuming that disabling the interrupt is enough to get the
>> guarantees we need here, there's a Xen problem because we don't use IPIs
>> for cross-cpu tlb flushes (well, it happens within Xen). I'll have to
>> think a bit about how to deal with that, but I'm thinking that we could
>> add a per-cpu "tlb flushes blocked" flag, and maintain some kind of
>> per-cpu deferred tlb flush count so we can get around to doing the flush
>> eventually.
>>
>> But I want to make sure I understand the exact algorithm here.
>>
>
> FWIW, powerpc actually can flush tlbs without IPIs, and it also has
> a gup_fast. powerpc RCU frees its page _tables_ so we can walk them,
> and then I use speculative page references in order to be able to
> take a reference on the page without having it pinned.
>

Ah, interesting. So disabling interrupts prevents the RCU free from
happening, and non-atomic pte fetching is a non-issue. So it doesn't
address the PAE side of the problem.

> Turning gup_get_pte into a pvop would be a bit nasty because on !PAE
> it is just a single load, and even on PAE it is pretty cheap.
>

Well, it wouldn't be too bad; for !PAE it would turn into something we
could inline, so there'd be little to no cost. For PAE it would be out
of line, but a direct function call, which would be nicely cached and
very predictable once we've gone through the the loop once (and for Xen
I think I'd just make it a cmpxchg8b-based implementation, assuming that
the tlb flush hypercall would offset the cost of making gup_fast a bit
slower).

But it would be better if we can address it at a higher level.

J

2009-03-19 17:34:37

by Avi Kivity

[permalink] [raw]
Subject: Re: Question about x86/mm/gup.c's use of disabled interrupts

Jeremy Fitzhardinge wrote:
> Avi Kivity wrote:
>>> And the hypercall could result in no Xen-level IPIs at all, so it
>>> could be very quick by comparison to an IPI-based Linux
>>> implementation, in which case the flag polling would be particularly
>>> harsh.
>>
>> Maybe we could bring these optimizations into Linux as well. The
>> only thing Xen knows that Linux doesn't is if a vcpu is not
>> scheduled; all other information is shared.
>
> I don't think there's a guarantee that just because a vcpu isn't
> running now, it won't need a tlb flush. If a pcpu does runs vcpu 1 ->
> idle -> vcpu 1, then there's no need for it to do a tlb flush, but the
> hypercall can make force a flush when it reschedules vcpu 1 (if the
> tlb hasn't already been flushed by some other means).

That's what I assumed you meant. Also, if a vcpu has a different cr3
loaded, the flush can be elided. Looks like Linux does this
(s/vcpu/process/).

> (I'm not sure to what extent Xen implements this now, but I wouldn't
> want to over-constrain it.)

Well, kvm does this.


>> The nice thing about local_irq_disable() is that it scales so well.
>
> Right. But it effectively puts the burden on the tlb-flusher to check
> the state (implicitly, by trying to send an interrupt). Putting an
> explicit poll in gets the same effect, but its pure overhead just to
> deal with the gup race.

I guess it hopes the flushes are much rarer. Certainly for threaded
databases doing O_DIRECT stuff, I'd expect lots of gupfs and no tlb flushes.


--
error compiling committee.c: too many arguments to function

2009-03-20 05:07:47

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Question about x86/mm/gup.c's use of disabled interrupts

On Thu, Mar 19, 2009 at 10:31:55AM -0700, Jeremy Fitzhardinge wrote:
> Nick Piggin wrote:
>>> Also, assuming that disabling the interrupt is enough to get the
>>> guarantees we need here, there's a Xen problem because we don't use IPIs
>>> for cross-cpu tlb flushes (well, it happens within Xen). I'll have to
>>> think a bit about how to deal with that, but I'm thinking that we could
>>> add a per-cpu "tlb flushes blocked" flag, and maintain some kind of
>>> per-cpu deferred tlb flush count so we can get around to doing the flush
>>> eventually.
>>>
>>> But I want to make sure I understand the exact algorithm here.
>>
>> FWIW, powerpc actually can flush tlbs without IPIs, and it also has
>> a gup_fast. powerpc RCU frees its page _tables_ so we can walk them,
>> and then I use speculative page references in order to be able to
>> take a reference on the page without having it pinned.
>
> Ah, interesting. So disabling interrupts prevents the RCU free from
> happening, and non-atomic pte fetching is a non-issue. So it doesn't
> address the PAE side of the problem.

This would be rcu_sched, correct?

Thanx, Paul

>> Turning gup_get_pte into a pvop would be a bit nasty because on !PAE
>> it is just a single load, and even on PAE it is pretty cheap.
>>
>
> Well, it wouldn't be too bad; for !PAE it would turn into something we
> could inline, so there'd be little to no cost. For PAE it would be out of
> line, but a direct function call, which would be nicely cached and very
> predictable once we've gone through the the loop once (and for Xen I think
> I'd just make it a cmpxchg8b-based implementation, assuming that the tlb
> flush hypercall would offset the cost of making gup_fast a bit slower).
>
> But it would be better if we can address it at a higher level.
>
> J
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2009-03-20 15:38:59

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: Question about x86/mm/gup.c's use of disabled interrupts

Paul E. McKenney wrote:
>> Ah, interesting. So disabling interrupts prevents the RCU free from
>> happening, and non-atomic pte fetching is a non-issue. So it doesn't
>> address the PAE side of the problem.
>>
>
> This would be rcu_sched, correct?
>

I guess? Whatever it is that ends up calling all the rcu callbacks
after the idle. A cpu with disabled interrupts can't go through idle,
right? Or is there an explicit way to hold off rcu?

J

2009-03-20 15:57:43

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Question about x86/mm/gup.c's use of disabled interrupts

On Fri, Mar 20, 2009 at 08:38:46AM -0700, Jeremy Fitzhardinge wrote:
> Paul E. McKenney wrote:
>>> Ah, interesting. So disabling interrupts prevents the RCU free from
>>> happening, and non-atomic pte fetching is a non-issue. So it doesn't
>>> address the PAE side of the problem.
>>
>> This would be rcu_sched, correct?
>
> I guess? Whatever it is that ends up calling all the rcu callbacks after
> the idle. A cpu with disabled interrupts can't go through idle, right? Or
> is there an explicit way to hold off rcu?

For synchronize_rcu() and call_rcu(), the only guaranteed way to hold
off RCU is rcu_read_lock() and rcu_read_unlock().

For call_rcu_bh, the only guaranteed way to hold off RCU is
rcu_read_lock_bh() and rcu_read_unlock_bh().

For synchronize_srcu(), the only guaranteed way to hold off RCU is
srcu_read_lock() and srcu_read_unlock().

For synchronize_sched() and call_rcu_sched(), anything that disables
preemption, including disabling irqs, holds off RCU.

Although disabling irqs can indeed hold off RCU in some other cases,
the only guarantee is for synchronize_sched() and call_rcu_sched().

Thanx, Paul

2009-04-03 01:41:53

by Kamble, Nitin A

[permalink] [raw]
Subject: paravirtops kernel and HVM guests

Jeremy,
I was trying the latest bits from your paravirtops tree with
xen-unstable. Dom0 is booting fine in Fedora 10 environment. I am seeing
page fault handing error while trying to create an HVM guest.
Is anybody working on the HVM support on top of paravirtops kernel?

privcmd_fault: vma=ffff880068081210 7f7cb7cfe000-7f7cb7dfe000, pgoff=0,
uv=00007f7cb7cfe000, ptep=(null) 0000000000000000
Pid: 4253, comm: qemu-dm Not tainted 2.6.29-rc8-tip #6
Call Trace:
[<ffffffff814357bf>] privcmd_fault+0x59/0x67
[<ffffffff810d34ce>] __do_fault+0x50/0x349
[<ffffffff8102912d>] ? __raw_callee_save_xen_pud_val+0x11/0x1e
[<ffffffff810d5659>] handle_mm_fault+0x2e2/0x6e9
[<ffffffff8102b9c2>] ? check_events+0x12/0x20
[<ffffffff8102b9af>] ? xen_restore_fl_direct_end+0x0/0x1
[<ffffffff816ba9ba>] ? _spin_unlock_irqrestore+0x38/0x3d
[<ffffffff816bcee9>] do_page_fault+0x2a5/0x2bc
[<ffffffff816bad45>] page_fault+0x25/0x30

Thanks & Regards,
Nitin

2009-04-03 03:37:48

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: paravirtops kernel and HVM guests

Nitin A Kamble wrote:
> Jeremy,
> I was trying the latest bits from your paravirtops tree with
> xen-unstable. Dom0 is booting fine in Fedora 10 environment. I am seeing
> page fault handing error while trying to create an HVM guest.
> Is anybody working on the HVM support on top of paravirtops kernel?
>

Yes, that's how it fails at the moment; that debug message is from the
last time I tried to track down what's going wrong, but I haven't
managed to make any progress beyond that. Getting it working is a
fairly high priority; I'll probably have another look at it next week.

J