LinuxLists.cc - Re: [RFC] page-table walkers vs memory order

2012-08-04 14:38:41

Subject: Re: [RFC] page-table walkers vs memory order

On Tue, Jul 24, 2012 at 02:51:05PM -0700, Hugh Dickins wrote:
> Since then, I think THP has made the rules more complicated; but I
> believe Andrea paid a great deal of attention to that kind of issue.

There were many issues, one unexpected was
1a5a9906d4e8d1976b701f889d8f35d54b928f25.

Keep in mind when holding only mmap_sem read mode (walk page range
speculative mode) or gup_fast, the result is always undefined and
racey if on the other CPU you have a munmap or mremap or any other pmd
manging concurrently messing with the mapping you're walking, all we
have to do is not to crash, it doesn't matter what happens.

The fact you need a barrier() or ACCESS_ONCE to avoid a lockup in a
while (rcu_dereference()), is no good reason to worry about all
possible purely theoretical gcc issues.

One important thing that wasn't mentioned so far in this thread is
also that we entirely relay on gcc for all pagetable and device driver
writes (to do 1 movq instead of 8 movb), see native_set_pmd and writel.

We must separate all different cases to avoid huge confusion:

1) tmp=*ptr, while(tmp) -> possible, needs barrier or better
ACCESS_ONCE if possible

2) orig_pmd = *pmdp before do_wp_huge_page in memory.c -> possible, needs
barrier() and it should be possible to convert to ACCESS_ONCE

3) native_set_pmd and friends -> possible but not worth fixing, tried
to fix a decade ago for a peace of mind and I was suggested to
desist and it didn't bite us yet

4) writel -> possible but same as 3

5) compiler behaving like alpha -> impossible (I may be wrong but I
believe so after thinking more on it)

6) I was told a decade ago by Honza to never touch any ram that can
change under the compiler unless it's declared volatile (could
crash over switch/case statement implemented with a table if the
switch/case value is re-read by the compiler). -> depends, we
don't always obey to this rule, clearly gup_fast currently disobeys
and even the generic pmd_read_atomic still disobeys (MADV_DONTNEED
can zero the pmd). If there's no "switch/case" I'm not aware of
other troubles.

7) barrier in pmd_none_or_trans_huge_or_clear_bad -> possible, same
issue as 2, full explanation in git show 1a5a9906d4e8d1976b701f

Note: here I'm ignoring CPU reordering, this is only about the compiler.

5 is impossible because:

a) the compiler can't read a guessed address or it can crash the
kernel

b) the compiler has no memory to store a "guessed" valid address when
the function return and the stack is unwind

For the compiler to behave like alpha, the compiler should read the
pteval before the pmdp, that it can't do, because it has no address to
guess from and it would Oops if it really tries to guess it!

So far it was said "compiler can guess the address" but there was no
valid explanation of how it could do it, and I don't see it, so please
explain if I'm wrong about the a, b above.

Furthermore the ACCESS_ONCE that Peter's patch added to gup_fast
pud/pgd can't prevent the compiler to read a guessed pmdp address as a
volatile variable, before reading the pmdp pointer and compare it with
the guessed address! So if it's 5 you worry about, when adding
ACCESS_ONCE in pudp/pgdp/pmdp is useless and won't fix it. You should
have added a barrier() instead.

> I suspect your arch/x86/mm/gup.c ACCESS_ONCE()s are necessary:
> gup_fast() breaks as many rules as it can, and in particular may
> be racing with the freeing of page tables; but I'm not so sure
> about the pagewalk mods - we could say "cannot do any harm",
> but I don't like adding lines on that basis.

I agree to add ACCESS_ONCE but because it's case 2, 7 above and it
could race with free_pgtables of pgd/pud/pmd or MADV_DONTNEED with
pmd.

The other part of the patch in pagewalk.c is superflous and should be
dropped: pud/pgd can't change in walk_page_table, it's required to
hold the mmap_sem at least in read mode (it's not disabling irqs).

The pmd side instead can change but only with THP enabled, and only
because MADV_DONTNEED (never because of free_pgtables) but it's
already fully handled through pmd_none_or_trans_huge_or_clear_bad. The
->pmd_entry callers are required to call pmd_trans_unstable() before
proceeding as the pmd may have been zeroed out by the time we get
there. The solution is zero barrier()/ACCESS_ONCE impact for THP=n. If
there are smp_read_barrier_depends already in alpha pte methods we're
fine.

Sorry for the long email but without a list that separates all
possible cases above, I don't think we can understand what we're
fixing in that patch and why the gup.c part is good.

Peter, I suggest to resend the fix with a more detailed explanataion
of the 2, 7 kind of race for gup.c only and drop the pagewalk.c.

Thanks,
Andrea

2012-08-04 22:04:25

by Paul E. McKenney

[permalink] [raw]

Subject: Re: [RFC] page-table walkers vs memory order

On Sat, Aug 04, 2012 at 04:37:19PM +0200, Andrea Arcangeli wrote:
> On Tue, Jul 24, 2012 at 02:51:05PM -0700, Hugh Dickins wrote:
> > Since then, I think THP has made the rules more complicated; but I
> > believe Andrea paid a great deal of attention to that kind of issue.
>
> There were many issues, one unexpected was
> 1a5a9906d4e8d1976b701f889d8f35d54b928f25.

[ . . . ]

> 5) compiler behaving like alpha -> impossible (I may be wrong but I
> believe so after thinking more on it)
>
> 6) I was told a decade ago by Honza to never touch any ram that can
> change under the compiler unless it's declared volatile (could
> crash over switch/case statement implemented with a table if the
> switch/case value is re-read by the compiler). -> depends, we
> don't always obey to this rule, clearly gup_fast currently disobeys
> and even the generic pmd_read_atomic still disobeys (MADV_DONTNEED
> can zero the pmd). If there's no "switch/case" I'm not aware of
> other troubles.
>
> 7) barrier in pmd_none_or_trans_huge_or_clear_bad -> possible, same
> issue as 2, full explanation in git show 1a5a9906d4e8d1976b701f
>
> Note: here I'm ignoring CPU reordering, this is only about the compiler.
>
> 5 is impossible because:
>
> a) the compiler can't read a guessed address or it can crash the
> kernel
>
> b) the compiler has no memory to store a "guessed" valid address when
> the function return and the stack is unwind
>
> For the compiler to behave like alpha, the compiler should read the
> pteval before the pmdp, that it can't do, because it has no address to
> guess from and it would Oops if it really tries to guess it!
>
> So far it was said "compiler can guess the address" but there was no
> valid explanation of how it could do it, and I don't see it, so please
> explain if I'm wrong about the a, b above.

OK, I'll bite. ;-)

The most sane way for this to happen is with feedback-driven techniques
involving profiling, similar to what is done for basic-block reordering
or branch prediction. The idea is that you compile the kernel in an
as-yet (and thankfully) mythical pointer-profiling mode, which records
the values of pointer loads and also measures the pointer-load latency.
If a situation is found where a given pointer almost always has the
same value but has high load latency (for example, is almost always a
high-latency cache miss), this fact is recorded and fed back into a
subsequent kernel build. This subsequent kernel build might choose to
speculate the value of the pointer concurrently with the pointer load.

And of course, when interpreting the phrase "most sane way" at the
beginning of the prior paragraph, it would probably be wise to keep
in mind who wrote it. And that "most sane way" might have little or
no resemblance to anything that typical kernel hackers would consider
anywhere near sanity. ;-)

> Furthermore the ACCESS_ONCE that Peter's patch added to gup_fast
> pud/pgd can't prevent the compiler to read a guessed pmdp address as a
> volatile variable, before reading the pmdp pointer and compare it with
> the guessed address! So if it's 5 you worry about, when adding
> ACCESS_ONCE in pudp/pgdp/pmdp is useless and won't fix it. You should
> have added a barrier() instead.

Most compiler writers I have discussed this with agreed that a volatile
cast would suppress value speculation. The "volatile" keyword is not
all that well specified in the C and C++ standards, but as "nix" said
at http://lwn.net/Articles/509731/:

volatile's meaning as 'minimize optimizations applied to things
manipulating anything of volatile type, do not duplicate, elide,
move, fold, spindle or mutilate' is of long standing.

That said, value speculation as a compiler optimization makes me a bit
nervous, so my current feeling is that is should be suppressed entirely.

Hey, you asked, even if only implicitly! ;-)

Thanx, Paul

> > I suspect your arch/x86/mm/gup.c ACCESS_ONCE()s are necessary:
> > gup_fast() breaks as many rules as it can, and in particular may
> > be racing with the freeing of page tables; but I'm not so sure
> > about the pagewalk mods - we could say "cannot do any harm",
> > but I don't like adding lines on that basis.
>
> I agree to add ACCESS_ONCE but because it's case 2, 7 above and it
> could race with free_pgtables of pgd/pud/pmd or MADV_DONTNEED with
> pmd.
>
> The other part of the patch in pagewalk.c is superflous and should be
> dropped: pud/pgd can't change in walk_page_table, it's required to
> hold the mmap_sem at least in read mode (it's not disabling irqs).
>
> The pmd side instead can change but only with THP enabled, and only
> because MADV_DONTNEED (never because of free_pgtables) but it's
> already fully handled through pmd_none_or_trans_huge_or_clear_bad. The
> ->pmd_entry callers are required to call pmd_trans_unstable() before
> proceeding as the pmd may have been zeroed out by the time we get
> there. The solution is zero barrier()/ACCESS_ONCE impact for THP=n. If
> there are smp_read_barrier_depends already in alpha pte methods we're
> fine.
>
> Sorry for the long email but without a list that separates all
> possible cases above, I don't think we can understand what we're
> fixing in that patch and why the gup.c part is good.
>
> Peter, I suggest to resend the fix with a more detailed explanataion
> of the 2, 7 kind of race for gup.c only and drop the pagewalk.c.
>
> Thanks,
> Andrea
>

2012-08-04 22:47:42

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: [RFC] page-table walkers vs memory order

On Sat, Aug 04, 2012 at 03:02:45PM -0700, Paul E. McKenney wrote:
> OK, I'll bite. ;-)

:))

> The most sane way for this to happen is with feedback-driven techniques
> involving profiling, similar to what is done for basic-block reordering
> or branch prediction. The idea is that you compile the kernel in an
> as-yet (and thankfully) mythical pointer-profiling mode, which records
> the values of pointer loads and also measures the pointer-load latency.
> If a situation is found where a given pointer almost always has the
> same value but has high load latency (for example, is almost always a
> high-latency cache miss), this fact is recorded and fed back into a
> subsequent kernel build. This subsequent kernel build might choose to
> speculate the value of the pointer concurrently with the pointer load.
>
> And of course, when interpreting the phrase "most sane way" at the
> beginning of the prior paragraph, it would probably be wise to keep
> in mind who wrote it. And that "most sane way" might have little or
> no resemblance to anything that typical kernel hackers would consider
> anywhere near sanity. ;-)

I see. The above scenario is sure fair enough assumption. We're
clearly stretching the constraints to see what is theoretically
possible and this is a very clear explanation of how gcc could have an
hardcoded "guessed" address in the .text.

Next step to clearify now, is how gcc can safely dereference such a
"guessed" address without the kernel knowing about it.

If gcc would really dereference a guessed address coming from a
profiling run without kernel being aware of it, it would eventually
crash the kernel with an oops. gcc cannot know what another CPU will
do with the kernel pagetables. It'd be perfectly legitimate to
temporarily move the data at the "guessed address" to another page and
to update the pointer through stop_cpu during some weird "cpu
offlining scenario" or anything you can imagine. I mean gcc must
behave in all cases so it's not allowed to deference the guessed
address at any given time.

The only way gcc could do the alpha thing and dereference the guessed
address before the real pointer, is with cooperation with the kernel.
The kernel should provide gcc "safe ranges" that won't crash the
kernel, and/or gcc could provide a .fixup section similar to the
current .fixup and the kernel should look it up during the page fault
handler in case the kernel is ok with temporarily getting faults in
that range. And in turn it can't happen unless we explicitly decide to
allow gcc to do it.

> > Furthermore the ACCESS_ONCE that Peter's patch added to gup_fast
> > pud/pgd can't prevent the compiler to read a guessed pmdp address as a
> > volatile variable, before reading the pmdp pointer and compare it with
> > the guessed address! So if it's 5 you worry about, when adding
> > ACCESS_ONCE in pudp/pgdp/pmdp is useless and won't fix it. You should
> > have added a barrier() instead.
>
> Most compiler writers I have discussed this with agreed that a volatile
> cast would suppress value speculation. The "volatile" keyword is not
> all that well specified in the C and C++ standards, but as "nix" said
> at http://lwn.net/Articles/509731/:
>
> volatile's meaning as 'minimize optimizations applied to things
> manipulating anything of volatile type, do not duplicate, elide,
> move, fold, spindle or mutilate' is of long standing.

Ok, so if the above optimization would be possible, volatile would
stop it too, thanks for the quote and the explanation.

On a side note I believe there's a few barrier()s that may be worth
converting to ACCESS_ONCE, that would take care of case 6) too in
addition to avoid clobbering more CPU registers than strictly
necessary. Not very important but a possible microoptimization.

> That said, value speculation as a compiler optimization makes me a bit
> nervous, so my current feeling is that is should be suppressed entirely.
>
> Hey, you asked, even if only implicitly! ;-)

You're reading my mind! :)

2012-08-04 22:59:29

by Dr. David Alan Gilbert

[permalink] [raw]

Subject: Re: [RFC] page-table walkers vs memory order

* Andrea Arcangeli ([email protected]) wrote:
> On Sat, Aug 04, 2012 at 03:02:45PM -0700, Paul E. McKenney wrote:
> > OK, I'll bite. ;-)
>
> :))
>
> > The most sane way for this to happen is with feedback-driven techniques
> > involving profiling, similar to what is done for basic-block reordering
> > or branch prediction. The idea is that you compile the kernel in an
> > as-yet (and thankfully) mythical pointer-profiling mode, which records
> > the values of pointer loads and also measures the pointer-load latency.
> > If a situation is found where a given pointer almost always has the
> > same value but has high load latency (for example, is almost always a
> > high-latency cache miss), this fact is recorded and fed back into a
> > subsequent kernel build. This subsequent kernel build might choose to
> > speculate the value of the pointer concurrently with the pointer load.
> >
> > And of course, when interpreting the phrase "most sane way" at the
> > beginning of the prior paragraph, it would probably be wise to keep
> > in mind who wrote it. And that "most sane way" might have little or
> > no resemblance to anything that typical kernel hackers would consider
> > anywhere near sanity. ;-)
>
> I see. The above scenario is sure fair enough assumption. We're
> clearly stretching the constraints to see what is theoretically
> possible and this is a very clear explanation of how gcc could have an
> hardcoded "guessed" address in the .text.
>
> Next step to clearify now, is how gcc can safely dereference such a
> "guessed" address without the kernel knowing about it.
>
> If gcc would really dereference a guessed address coming from a
> profiling run without kernel being aware of it, it would eventually
> crash the kernel with an oops. gcc cannot know what another CPU will
> do with the kernel pagetables. It'd be perfectly legitimate to
> temporarily move the data at the "guessed address" to another page and
> to update the pointer through stop_cpu during some weird "cpu
> offlining scenario" or anything you can imagine. I mean gcc must
> behave in all cases so it's not allowed to deference the guessed
> address at any given time.

A compiler could decide to dereference it using a non-faulting load,
do the calculations or whatever on the returned value of the non-faulting
load, and then check whether the load actually faulted, and whether the
address matched the prediction before it did a store based on it's
guess.

Dave
--
-----Open up your eyes, open up your mind, open up your code -------
/ Dr. David Alan Gilbert | Running GNU/Linux | Happy \
\ gro.gilbert @ treblig.org | | In Hex /
\ _________________________|_____ http://www.treblig.org |_______/

2012-08-04 23:06:22

by Paul E. McKenney

[permalink] [raw]

Subject: Re: [RFC] page-table walkers vs memory order

On Sun, Aug 05, 2012 at 12:47:05AM +0200, Andrea Arcangeli wrote:
> On Sat, Aug 04, 2012 at 03:02:45PM -0700, Paul E. McKenney wrote:
> > OK, I'll bite. ;-)
>
> :))
>
> > The most sane way for this to happen is with feedback-driven techniques
> > involving profiling, similar to what is done for basic-block reordering
> > or branch prediction. The idea is that you compile the kernel in an
> > as-yet (and thankfully) mythical pointer-profiling mode, which records
> > the values of pointer loads and also measures the pointer-load latency.
> > If a situation is found where a given pointer almost always has the
> > same value but has high load latency (for example, is almost always a
> > high-latency cache miss), this fact is recorded and fed back into a
> > subsequent kernel build. This subsequent kernel build might choose to
> > speculate the value of the pointer concurrently with the pointer load.
> >
> > And of course, when interpreting the phrase "most sane way" at the
> > beginning of the prior paragraph, it would probably be wise to keep
> > in mind who wrote it. And that "most sane way" might have little or
> > no resemblance to anything that typical kernel hackers would consider
> > anywhere near sanity. ;-)
>
> I see. The above scenario is sure fair enough assumption. We're
> clearly stretching the constraints to see what is theoretically
> possible and this is a very clear explanation of how gcc could have an
> hardcoded "guessed" address in the .text.
>
> Next step to clearify now, is how gcc can safely dereference such a
> "guessed" address without the kernel knowing about it.
>
> If gcc would really dereference a guessed address coming from a
> profiling run without kernel being aware of it, it would eventually
> crash the kernel with an oops. gcc cannot know what another CPU will
> do with the kernel pagetables. It'd be perfectly legitimate to
> temporarily move the data at the "guessed address" to another page and
> to update the pointer through stop_cpu during some weird "cpu
> offlining scenario" or anything you can imagine. I mean gcc must
> behave in all cases so it's not allowed to deference the guessed
> address at any given time.
>
> The only way gcc could do the alpha thing and dereference the guessed
> address before the real pointer, is with cooperation with the kernel.
> The kernel should provide gcc "safe ranges" that won't crash the
> kernel, and/or gcc could provide a .fixup section similar to the
> current .fixup and the kernel should look it up during the page fault
> handler in case the kernel is ok with temporarily getting faults in
> that range. And in turn it can't happen unless we explicitly decide to
> allow gcc to do it.

And these are indeed some good reasons why I am not a fan of pointer-value
speculation. ;-)

> > > Furthermore the ACCESS_ONCE that Peter's patch added to gup_fast
> > > pud/pgd can't prevent the compiler to read a guessed pmdp address as a
> > > volatile variable, before reading the pmdp pointer and compare it with
> > > the guessed address! So if it's 5 you worry about, when adding
> > > ACCESS_ONCE in pudp/pgdp/pmdp is useless and won't fix it. You should
> > > have added a barrier() instead.
> >
> > Most compiler writers I have discussed this with agreed that a volatile
> > cast would suppress value speculation. The "volatile" keyword is not
> > all that well specified in the C and C++ standards, but as "nix" said
> > at http://lwn.net/Articles/509731/:
> >
> > volatile's meaning as 'minimize optimizations applied to things
> > manipulating anything of volatile type, do not duplicate, elide,
> > move, fold, spindle or mutilate' is of long standing.
>
> Ok, so if the above optimization would be possible, volatile would
> stop it too, thanks for the quote and the explanation.
>
> On a side note I believe there's a few barrier()s that may be worth
> converting to ACCESS_ONCE, that would take care of case 6) too in
> addition to avoid clobbering more CPU registers than strictly
> necessary. Not very important but a possible microoptimization.

Agreed on both points.

> > That said, value speculation as a compiler optimization makes me a bit
> > nervous, so my current feeling is that is should be suppressed entirely.
> >
> > Hey, you asked, even if only implicitly! ;-)
>
> You're reading my mind! :)

Or succesfully carrying out value speculation on it. ;-)

Thanx, Paul

2012-08-04 23:12:09

by Paul E. McKenney

[permalink] [raw]

Subject: Re: [RFC] page-table walkers vs memory order

On Sat, Aug 04, 2012 at 11:59:10PM +0100, Dr. David Alan Gilbert wrote:
> * Andrea Arcangeli ([email protected]) wrote:
> > On Sat, Aug 04, 2012 at 03:02:45PM -0700, Paul E. McKenney wrote:
> > > OK, I'll bite. ;-)
> >
> > :))
> >
> > > The most sane way for this to happen is with feedback-driven techniques
> > > involving profiling, similar to what is done for basic-block reordering
> > > or branch prediction. The idea is that you compile the kernel in an
> > > as-yet (and thankfully) mythical pointer-profiling mode, which records
> > > the values of pointer loads and also measures the pointer-load latency.
> > > If a situation is found where a given pointer almost always has the
> > > same value but has high load latency (for example, is almost always a
> > > high-latency cache miss), this fact is recorded and fed back into a
> > > subsequent kernel build. This subsequent kernel build might choose to
> > > speculate the value of the pointer concurrently with the pointer load.
> > >
> > > And of course, when interpreting the phrase "most sane way" at the
> > > beginning of the prior paragraph, it would probably be wise to keep
> > > in mind who wrote it. And that "most sane way" might have little or
> > > no resemblance to anything that typical kernel hackers would consider
> > > anywhere near sanity. ;-)
> >
> > I see. The above scenario is sure fair enough assumption. We're
> > clearly stretching the constraints to see what is theoretically
> > possible and this is a very clear explanation of how gcc could have an
> > hardcoded "guessed" address in the .text.
> >
> > Next step to clearify now, is how gcc can safely dereference such a
> > "guessed" address without the kernel knowing about it.
> >
> > If gcc would really dereference a guessed address coming from a
> > profiling run without kernel being aware of it, it would eventually
> > crash the kernel with an oops. gcc cannot know what another CPU will
> > do with the kernel pagetables. It'd be perfectly legitimate to
> > temporarily move the data at the "guessed address" to another page and
> > to update the pointer through stop_cpu during some weird "cpu
> > offlining scenario" or anything you can imagine. I mean gcc must
> > behave in all cases so it's not allowed to deference the guessed
> > address at any given time.
>
> A compiler could decide to dereference it using a non-faulting load,
> do the calculations or whatever on the returned value of the non-faulting
> load, and then check whether the load actually faulted, and whether the
> address matched the prediction before it did a store based on it's
> guess.

Or the compiler could record a recovery address in a per-thread variable
before doing the speculative reference. The page-fault handler could
consult the per-thread variable and take appropriate action.

But both this approach and your approach are vulnerable to things like
having the speculation area mapped to (say) MMIO space. Not good!

So I am with Andrea on this one -- there would need to be some handshake
between kernel and compiler to avoid messing with possibly-unsafe
mappings. And I am still not much in favor of value speculation. ;-)

Thanx, Paul

2012-08-05 00:11:07

by Dr. David Alan Gilbert

[permalink] [raw]

Subject: Re: [RFC] page-table walkers vs memory order

* Paul E. McKenney ([email protected]) wrote:
> On Sat, Aug 04, 2012 at 11:59:10PM +0100, Dr. David Alan Gilbert wrote:

<snip>

> > A compiler could decide to dereference it using a non-faulting load,
> > do the calculations or whatever on the returned value of the non-faulting
> > load, and then check whether the load actually faulted, and whether the
> > address matched the prediction before it did a store based on it's
> > guess.
>
> Or the compiler could record a recovery address in a per-thread variable
> before doing the speculative reference. The page-fault handler could
> consult the per-thread variable and take appropriate action.

The difference is that I'd expect a compiler writer to think that
they've got a free hand in terms of instruction usage that the OS/library
doesn't see - if it's in the instruction manual and it's marked as user
space and non-faulting I'd say it's fair game; once they know that they're
going to take a fault or mark pages specially then they already know they're
going to have to cooperate with the OS, or worry about what other
normal library calls are going to do.
(A bit of googling seems to suggest IA64 and SPARC have played around
with non-faulting load optimisations, but I can't tell how much.)

> But both this approach and your approach are vulnerable to things like
> having the speculation area mapped to (say) MMIO space. Not good!

Not good for someone doing MMIO, but from an evil-compiler point
of view, they might well assume that a pointer is to memory
unless someone has made an effort to tell them otherwise (not that
there is a good standard to do that).

> So I am with Andrea on this one -- there would need to be some handshake
> between kernel and compiler to avoid messing with possibly-unsafe
> mappings. And I am still not much in favor of value speculation. ;-)

Dave
--
-----Open up your eyes, open up your mind, open up your code -------
/ Dr. David Alan Gilbert | Running GNU/Linux | Happy \
\ gro.gilbert @ treblig.org | | In Hex /
\ _________________________|_____ http://www.treblig.org |_______/