I was discussing an obscure bug with Martin in the COW handling.
After some thoughts on the issue the only thing that I found potentially
buggy in that path is that we're not safe not writing the 8bytes
atomically before flushing the tlb. I'm afraid without set_pte_atomic
the other cpu in another userspace thread could touch the memory with an
empty tlb entry, and load the half-written pte, that may contain the new
high bits for the high part of the pfn number, but the old low bits for
the low part of the pfn number plus the old ro protection. Potentially
giving access to a random physical page to a task (in readonly mode, so
no kernel crash or userspace malfunction except the security
compromise). I don't expect anybody to be able to exploit it though.
Worthy to note is that we're buggy in all set_pte implementations since,
all archs would need to also implement the set_pte in assembler to make
sure the C language doesn't write it byte-by-byte which would break the
SMP in the other thread. On ppc64 where a problem triggered (possibily
unrelated to this) the pte is an unsigned long and it's being updated by
set_pte with this:
*ptep = __pte(pte_val(pte)) & ~_PAGE_HPTEFLAGS
(note pte_clear would be fine to be still in C, pte clear is guaranteed
to run on not present ptes, so we don't race with other threads, it's
only set_pte that should always be written in assembler in the last
opcode that writes in the pte)
We don't need an SMP lock, we only need to write 4 or 8 bytes at once (a
plain movl in x86 would do the trick). That's all we need. (and in
theory only for SMP, but UP will not get any slowdown since no lock on
the bus will be necessary, we're the only writer thanks to the
page_table_lock, but there are other readers running in userspace in
SMP, hence the need of atomicity)
pte_clear would not be safe to call inside ptep_establish for the same
reason (pte_clear is not atomic and it doesn't need to be atomic unlike
set_pte), while something like this should be fine:
ptep_get_and_clear
set_pte
flush_tlb
but it doesn't worht it since all other archs supporting SMP in their
architecture will have to change set_pte to an assembly version, so for
them set_pte_atomic will be defined to set_pte.
The x86 set_pte itself should be changed to:
static inline void set_pte(pte_t *ptep, pte_t pte)
{
ptep->pte_high = pte.pte_high;
smp_wmb();
do this in a single not locked movl -> (ptep->pte_low = pte.pte_low);
}
and for x86 the set_pte_atomic will remain the same as today, so the
below patch seems the right long term fix (even if it breaks all archs
but x86).
Comments?
Index: linux-2.5/include/asm-generic/pgtable.h
===================================================================
RCS file: /home/andrea/crypto/cvs/linux-2.5/include/asm-generic/pgtable.h,v
retrieving revision 1.8
diff -u -p -r1.8 pgtable.h
--- linux-2.5/include/asm-generic/pgtable.h 29 Jul 2004 06:01:30 -0000 1.8
+++ linux-2.5/include/asm-generic/pgtable.h 25 Sep 2004 15:16:50 -0000
@@ -15,7 +15,7 @@
*/
#define ptep_establish(__vma, __address, __ptep, __entry) \
do { \
- set_pte(__ptep, __entry); \
+ set_pte_atomic(__ptep, __entry); \
flush_tlb_page(__vma, __address); \
} while (0)
#endif
On Sun, 2004-09-26 at 01:54, Andrea Arcangeli wrote:
> Worthy to note is that we're buggy in all set_pte implementations since,
> all archs would need to also implement the set_pte in assembler to make
> sure the C language doesn't write it byte-by-byte which would break the
> SMP in the other thread. On ppc64 where a problem triggered (possibily
> unrelated to this) the pte is an unsigned long and it's being updated by
> set_pte with this:
Bye by byte ? Ahem ... That would be a really broken C compiler ;) I
don't see how it could be broken on archs where the PTE size is a single
long for example, ppc64 is not. ppc32 is already atomic for different
reasons
> *ptep = __pte(pte_val(pte)) & ~_PAGE_HPTEFLAGS
This is not broken, how can somebody else race on modifying this
PTE since we hold the page table lock and the PTE was previously
cleared & flushed from the hash table ? The last store updating
the PTE before we leave that code path will be a single 8 bytes
store (this is a 64 bits architecture !)
> (note pte_clear would be fine to be still in C, pte clear is guaranteed
> to run on not present ptes,
That isn't the case of the pte_clear call issued by set_pte itself on
ppc64. I haven't looked at othe cases in the generic code, but I
suppose they indeed use get_and_clear instead.
> so we don't race with other threads, it's
> only set_pte that should always be written in assembler in the last
> opcode that writes in the pte)
Why ? I mean, why _always_ ? The above is perfectly correct on ppc64
> We don't need an SMP lock, we only need to write 4 or 8 bytes at once (a
> plain movl in x86 would do the trick). That's all we need.
No, we need the page table lock on ppc64 because we must make sure the
PTE has been cleared & flushed from the hash table, before we set it to
the new value and all of that without another thread trying to modify
it. The only way we could get rid of the locks around set_pte or other
calls modifying the PTE validity/address on ppc64 would be to use the
PAGE_BUSY bit, which we defined as a per-PTE lock, in such a way that
it's taken around the whole flush+write operation (that is hold it
accross the hash flush).
> (and in
> theory only for SMP, but UP will not get any slowdown since no lock on
> the bus will be necessary, we're the only writer thanks to the
> page_table_lock, but there are other readers running in userspace in
> SMP, hence the need of atomicity)
>
> pte_clear would not be safe to call inside ptep_establish for the same
> reason (pte_clear is not atomic and it doesn't need to be atomic unlike
> set_pte), while something like this should be fine:
>
> ptep_get_and_clear
> set_pte
> flush_tlb
>
> but it doesn't worht it since all other archs supporting SMP in their
> architecture will have to change set_pte to an assembly version, so for
> them set_pte_atomic will be defined to set_pte.
I don't understand your point... PTE's are usually the native long size
of the arch and usually set_pte is a single aligned store, which mean
it's pretty much always "atomic"...
> The x86 set_pte itself should be changed to:
>
> static inline void set_pte(pte_t *ptep, pte_t pte)
> {
> ptep->pte_high = pte.pte_high;
> smp_wmb();
> do this in a single not locked movl -> (ptep->pte_low = pte.pte_low);
> }
>
> and for x86 the set_pte_atomic will remain the same as today, so the
> below patch seems the right long term fix (even if it breaks all archs
> but x86).
>
> Comments?
If I understand your explanation, all you need is make sure that x86
set_pte sets the HW present bit last when writing the 2 halves, no ?
Ben.
On Sat, 25 Sep 2004, Andrea Arcangeli wrote:
> set_pte), while something like this should be fine:
>
> ptep_get_and_clear
> set_pte
> flush_tlb
Almost. Think of software TLB refills, especially HPTE.
The order needs to be:
ptep_get_and_clear
flush_tlb
set_pte
Any page faults happening "in the middle" will end up as
virtual no-ops once they grab the page_table_lock.
Luckily the PPC64 code in 2.6 seems to have a fix for the
race already. Martin's race is a 2.4 thing only and needs
a 2-line change to establish_pte(), to make the code do what
I described above.
--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
On Sun, Sep 26, 2004 at 09:33:27AM +1000, Benjamin Herrenschmidt wrote:
> On Sun, 2004-09-26 at 01:54, Andrea Arcangeli wrote:
>
> > Worthy to note is that we're buggy in all set_pte implementations since,
> > all archs would need to also implement the set_pte in assembler to make
> > sure the C language doesn't write it byte-by-byte which would break the
> > SMP in the other thread. On ppc64 where a problem triggered (possibily
> > unrelated to this) the pte is an unsigned long and it's being updated by
> > set_pte with this:
>
> Bye by byte ? Ahem ... That would be a really broken C compiler ;) I
that would be an overoptimized or underoptimized C compiler, sure not a
really broken one. The C compiler is perfectly allowed to do that, check
the specs or ask your C compiler friends to get confirmation.
anyways on x86 the bug is real in practice, regardless of the C
compiler, heck we even put a smp_wmb() in between the two writes. The
fact all other archs are buggy in theory too is just a corollary. I
thought it worth to fix the theoretical bug in all other archs too,
instead of keeping playing russian roulette.
> don't see how it could be broken on archs where the PTE size is a single
> long for example, ppc64 is not. ppc32 is already atomic for different
> reasons
of course in practice it's expectedly working correctly for all archs,
except x86 where there is a smp_wmb() in between, and even if x86 was
using an unsigned long, the C compiler would still not be writing it
atomically.
> > *ptep = __pte(pte_val(pte)) & ~_PAGE_HPTEFLAGS
>
> This is not broken, how can somebody else race on modifying this
It is broken as far as C language is concerned. You're just hoping to
have an efficient compiler under you, and you're hoping to have an
architecture where doing a single write is more efficient.
This happens to work in practice, but it's still wrong in theory, check
C language. What you are assuming in the above code are semantics the C
language _cannot_ provide you.
the C language has no knowledge of the "somebody else", the C language
thinks it's single threaded and it can do whatever it wants as far as it
writes that data to that place eventually.
> That isn't the case of the pte_clear call issued by set_pte itself on
> ppc64. I haven't looked at othe cases in the generic code, but I
> suppose they indeed use get_and_clear instead.
generic code should really use get_and_clear for that. pte_clear in
common code can only be called on non present ptes. Again, ppc64 would
be ok in practice (still not in theory), but x86 would break even in
practice (not only in theory) if you use pte_clear on a present pte.
But even ppc64 is wrong as far as C is concerned, your = 0 can be
implemented in 8 byte writes, and the C compiler would be right, and you
would be wrong. You never know if they could ever choose to do a memset
instead of an atomic write, or whatever new assembler instruction in
future implementations of the cpu.
I perfectly know it works fine in practice and that the only definitive
bug is in the x86 arch, but this is a theoretical bug for _all_ archs.
> > so we don't race with other threads, it's
> > only set_pte that should always be written in assembler in the last
> > opcode that writes in the pte)
>
> Why ? I mean, why _always_ ? The above is perfectly correct on ppc64
it's not correct even on ppc64.
>
> > We don't need an SMP lock, we only need to write 4 or 8 bytes at once (a
> > plain movl in x86 would do the trick). That's all we need.
>
> No, we need the page table lock on ppc64 because we must make sure the
you misunderstood, obviously everybody is required to hold the
page_table_lock while writing to any pagetable.
What I meant with lock above, was the "lock prefix to the movl"
instruction, not the lock as in page_table_lock.
The write to the pte doesn't need to be executed with an atomic opcode,
a movl would work, it doesn't need to be a "lock movl", because thanks
to the page_table_lock, there's only one writer, and all readers are in
userspace racing with us (hence the need for an assembler write, the
only thing that can provide an atomicity guarantee, C can't, check the
language specifications). It works by luck, now if it crashed you could
still blamed the compiler for being suboptimal, but you definitely
cannot blame the compiler for being wrong. The only wrong thing is your
implementation of set_pte that you keep advocating, not the compiler.
> I don't understand your point... PTE's are usually the native long size
> of the arch and usually set_pte is a single aligned store, which mean
> it's pretty much always "atomic"...
same question all over, already answered why C cannot provide any atomic
operation many times above. And you even seem to partially agree that it's
buggy when you say "pretty much always", instead of a plain "always" ;).
> If I understand your explanation, all you need is make sure that x86
> set_pte sets the HW present bit last when writing the 2 halves, no ?
x86 already does that. But that's not enough. It must be a
set_pte_atomic, writing all 8 bytes at once. Because the pte is already
established, so the first write of the first halve will make any racing
thread load into the tlb a mapping to a wrong random page in the system.
This is a security compromise (note: seccomp not involved as usual,
since I obviously disallow clone on bytecode running under seccomp mode).
If every other kernel guy agrees with you to keep depending on semantics
the C language cannot provide us, I can live with that (in such case I'm
only going to fix x86 and x86-64 respective set_pte in asm and I will
save this email if in the future a MM corruption bug triggers due subtle
compiler behaviour ;).
returning to the pratical bug (ignoring the thoeretical bug) we will
have to at least move the ptep_enstablish modified as I posted (with
set_pte_atomic instead of set_pte) from asm-generic to asm-i386. That
will fix the security issue I found on x86 PAE with >4G of ram.
On Sun, 26 Sep 2004, Andrea Arcangeli wrote:
> But even ppc64 is wrong as far as C is concerned,
Looks fine to me. From include/asm-ppc64/pgtable.h
static inline void set_pte(pte_t *ptep, pte_t pte)
{
if (pte_present(*ptep)) {
pte_clear(ptep);
flush_tlb_pending();
}
*ptep = __pte(pte_val(pte)) & ~_PAGE_HPTEFLAGS;
}
--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
On Sat, Sep 25, 2004 at 07:44:05PM -0400, Rik van Riel wrote:
> On Sat, 25 Sep 2004, Andrea Arcangeli wrote:
>
> > set_pte), while something like this should be fine:
> >
> > ptep_get_and_clear
> > set_pte
> > flush_tlb
>
> Almost. Think of software TLB refills, especially HPTE.
> The order needs to be:
>
> ptep_get_and_clear
> flush_tlb
> set_pte
Interesting point. I sure agree it's saner to have the flush_tlb in
between ptep_get_and_clear and set_pte, I said the other version just
because I'm thinking hardware TLB and it shouldn't make any difference
on hardware TLB anyways, does it?
> Any page faults happening "in the middle" will end up as
> virtual no-ops once they grab the page_table_lock.
I'm not very fond on software TLBs and their internal locking, but
exactly because of what you said ("they grab the page_table_lock."), how
can the software TLB ever care about the flush_tlb in between
ptep_get_and_clear and set_pte?
ptep_establish is obviously always called with the page_table_lock hold.
Nobody is allowed to call ptep_establish without it. So a larger code
sequence of my version expands to:
spin_lock(&page_table_lock)
ptep_establish() {
ptep_get_and_clear
set_pte
flush_tlb
}
spin_unlock(&page_table_lock)
How can a software TLB care about a tlb flush in between two pieces of
code that are anyways under the page_table_lock?
On Sun, 26 Sep 2004, Andrea Arcangeli wrote:
> I'm not very fond on software TLBs and their internal locking, but
> exactly because of what you said ("they grab the page_table_lock."), how
> can the software TLB ever care about the flush_tlb in between
> ptep_get_and_clear and set_pte?
It only grabs the page_table_lock if it finds pte_none(pte),
otherwise it'll run without any of the generic VM locks.
> How can a software TLB care about a tlb flush in between two pieces of
> code that are anyways under the page_table_lock?
Thing is, it's not under page_table_lock ;)
--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
On Sun, 2004-09-26 at 10:20, Andrea Arcangeli wrote:
> that would be an overoptimized or underoptimized C compiler, sure not a
> really broken one. The C compiler is perfectly allowed to do that, check
> the specs or ask your C compiler friends to get confirmation.
>
> anyways on x86 the bug is real in practice, regardless of the C
> compiler, heck we even put a smp_wmb() in between the two writes. The
> fact all other archs are buggy in theory too is just a corollary. I
> thought it worth to fix the theoretical bug in all other archs too,
> instead of keeping playing russian roulette.
How so ? A bunch of archs have the pte beeing a simple long, on these
set_pte is perfectly atomic as it is... I'd say in this regard that
x86 is the exception ;)
> > don't see how it could be broken on archs where the PTE size is a single
> > long for example, ppc64 is not. ppc32 is already atomic for different
> > reasons
>
> of course in practice it's expectedly working correctly for all archs,
> except x86 where there is a smp_wmb() in between, and even if x86 was
> using an unsigned long, the C compiler would still not be writing it
> atomically.
Ugh ? Writing an unsigned long with a single instruction is always
atomic, whatever the C compiler does. If the compiler turns a single
long aligned write into something else, then it's very broken imho
> > > *ptep = __pte(pte_val(pte)) & ~_PAGE_HPTEFLAGS
> >
> > This is not broken, how can somebody else race on modifying this
>
> It is broken as far as C language is concerned. You're just hoping to
> have an efficient compiler under you, and you're hoping to have an
> architecture where doing a single write is more efficient.
I'm not convinced :)
> This happens to work in practice, but it's still wrong in theory, check
> C language. What you are assuming in the above code are semantics the C
> language _cannot_ provide you.
Oh yes, I suppose we could imagine a compiler writing the above
bit-by-bit, but let's stay serious on that one, there are a bunch of
other places in the kernel that would be broken if the compiler started
breaking up an int or long write in pieces :)
> the C language has no knowledge of the "somebody else", the C language
> thinks it's single threaded and it can do whatever it wants as far as it
> writes that data to that place eventually.
And ? The above example is clearly a read/modify/write case, all the
compiler might ever do is move the last write around, but it can't
move it before the if () since there's a clear data dependency and
the flush done in there has an implicit barrier, so all that can happen
is the write to the PTE to happen a bit later ... I don't see where
is the issue here.
> > That isn't the case of the pte_clear call issued by set_pte itself on
> > ppc64. I haven't looked at othe cases in the generic code, but I
> > suppose they indeed use get_and_clear instead.
>
> generic code should really use get_and_clear for that. pte_clear in
> common code can only be called on non present ptes. Again, ppc64 would
> be ok in practice (still not in theory), but x86 would break even in
> practice (not only in theory) if you use pte_clear on a present pte.
>
> But even ppc64 is wrong as far as C is concerned, your = 0 can be
> implemented in 8 byte writes, and the C compiler would be right, and you
> would be wrong.
And I would change to another compiler. Any compiler trafficking a
write of an aligned native sized type like that is good for the
trashcan.
> You never know if they could ever choose to do a memset
> instead of an atomic write, or whatever new assembler instruction in
> future implementations of the cpu.
>
> I perfectly know it works fine in practice and that the only definitive
> bug is in the x86 arch, but this is a theoretical bug for _all_ archs.
>
> > > so we don't race with other threads, it's
> > > only set_pte that should always be written in assembler in the last
> > > opcode that writes in the pte)
> >
> > Why ? I mean, why _always_ ? The above is perfectly correct on ppc64
>
> it's not correct even on ppc64.
Find me a single case of a compiler not generating that correctly then.
Doing an atomic instruction there would have a cost I don't want to pay.
> > > We don't need an SMP lock, we only need to write 4 or 8 bytes at once (a
> > > plain movl in x86 would do the trick). That's all we need.
> >
> > No, we need the page table lock on ppc64 because we must make sure the
>
> you misunderstood, obviously everybody is required to hold the
> page_table_lock while writing to any pagetable.
Except in the fault path when setting accessed or dirty on a rw page...
But then, I refer you to the patches that have been floating around for
implementing page table lock-less do_page_fault(), this is what I was
talking about.
> What I meant with lock above, was the "lock prefix to the movl"
> instruction, not the lock as in page_table_lock.
I understood what you were saying
> The write to the pte doesn't need to be executed with an atomic opcode,
> a movl would work, it doesn't need to be a "lock movl", because thanks
> to the page_table_lock, there's only one writer, and all readers are in
> userspace racing with us (hence the need for an assembler write, the
> only thing that can provide an atomicity guarantee, C can't, check the
> language specifications). It works by luck, now if it crashed you could
> still blamed the compiler for being suboptimal, but you definitely
> cannot blame the compiler for being wrong. The only wrong thing is your
> implementation of set_pte that you keep advocating, not the compiler.
Again, find me a single case where the compiler will generate anything
but an "std" instruction for the above on ppc64 and you'll get a free
case of champagne :)
> > I don't understand your point... PTE's are usually the native long size
> > of the arch and usually set_pte is a single aligned store, which mean
> > it's pretty much always "atomic"...
>
> same question all over, already answered why C cannot provide any atomic
> operation many times above. And you even seem to partially agree that it's
> buggy when you say "pretty much always", instead of a plain "always" ;).
>
> > If I understand your explanation, all you need is make sure that x86
> > set_pte sets the HW present bit last when writing the 2 halves, no ?
>
> x86 already does that. But that's not enough. It must be a
> set_pte_atomic, writing all 8 bytes at once. Because the pte is already
> established, so the first write of the first halve will make any racing
> thread load into the tlb a mapping to a wrong random page in the system.
Unless you do like ppc64 and clear it first :) But I agree that if your
architecture lets you do 64 bits atomic writes on the 32 bits arch, then
it's definitely the better solution. But then, you don't need a special
set_pte_atomic, just make the normal set_pte do that and be done with
it... Or is there a perf. issue there ?
> This is a security compromise (note: seccomp not involved as usual,
> since I obviously disallow clone on bytecode running under seccomp mode).
>
> If every other kernel guy agrees with you to keep depending on semantics
> the C language cannot provide us, I can live with that (in such case I'm
> only going to fix x86 and x86-64 respective set_pte in asm and I will
> save this email if in the future a MM corruption bug triggers due subtle
> compiler behaviour ;).
>
> returning to the pratical bug (ignoring the thoeretical bug) we will
> have to at least move the ptep_enstablish modified as I posted (with
> set_pte_atomic instead of set_pte) from asm-generic to asm-i386. That
> will fix the security issue I found on x86 PAE with >4G of ram.
--
Benjamin Herrenschmidt <[email protected]>
On Sat, Sep 25, 2004 at 08:31:13PM -0400, Rik van Riel wrote:
> On Sun, 26 Sep 2004, Andrea Arcangeli wrote:
>
> > But even ppc64 is wrong as far as C is concerned,
>
> Looks fine to me. From include/asm-ppc64/pgtable.h
>
> static inline void set_pte(pte_t *ptep, pte_t pte)
> {
> if (pte_present(*ptep)) {
> pte_clear(ptep);
> flush_tlb_pending();
> }
> *ptep = __pte(pte_val(pte)) & ~_PAGE_HPTEFLAGS;
> }
As far as the C language is concerned that *ptep = something can be
implemented with 8 writes of 1 byte each (or alternatively with an
assembler instruction that may make the written data visible not
atomically to other cpus, despite it was written with a single opcode,
similarly to what happens if you use incl without the lock prefix). I'm
not saying such instruction exists in ppc64, but the compiler is
definitely allowed to break the above. You can blame on the compiler to
be inefficient, but you can't blame on the compiler for the security
hazard it would generate. Only the kernel would be to blame if for
whatever reason a gcc version would be underoptimized.
I perfectly know in practice the above is "almost guaranteed" to work,
it's just the "almost" that's not good enough for me ;)
anyways this is just a corollary to the x86 true bug, where a smp_wmb()
sits in between the two separate writes, which makes the x86 set_pte
obviously not atomic even in practice (not just in theory like for
ppc64 and all other archs). I thought it was better to fix the
theoretical bugs too, and they are true bugs as far as the C language is
concerned, even if they're not triggering with the current gcc
implementation of the language (and with any other decent compiler ;).
On Sun, 2004-09-26 at 10:46, Andrea Arcangeli wrote:
> As far as the C language is concerned that *ptep = something can be
> implemented with 8 writes of 1 byte each (or alternatively with an
> assembler instruction that may make the written data visible not
> atomically to other cpus, despite it was written with a single opcode,
> similarly to what happens if you use incl without the lock prefix). I'm
> not saying such instruction exists in ppc64, but the compiler is
> definitely allowed to break the above. You can blame on the compiler to
> be inefficient, but you can't blame on the compiler for the security
> hazard it would generate. Only the kernel would be to blame if for
> whatever reason a gcc version would be underoptimized.
BTW, for your reading pleasure :)
#define atomic_set(v,i) (((v)->counter) = (i))
(asm-i386/atomic.h)
And that's really far from beeing the 2 only cases where the kernel _relies_
on a write of a simple type like int or long to an aligned location to be
atomic. Almost all drivers manipulating DMA descriptors do that, jiffies
is a good example too afaik, and more and more and more ... so if the
compiler is breaking that up, I think the set_pte race is the least of
our problems :)
Ben.
On Sun, Sep 26, 2004 at 10:44:48AM +1000, Benjamin Herrenschmidt wrote:
> How so ? A bunch of archs have the pte beeing a simple long, on these
> set_pte is perfectly atomic as it is... I'd say in this regard that
> x86 is the exception ;)
this is a C language issue, not a gcc issue. I know gcc does things
optimially and it will always write it atomically.
But the word "atomic" doesn't exist in C, so you cannot write C code and
expect anything to be written atomically.
The single fact you agree set_pte has to be done with a perfect
atomic-SMP operation, means you also agree it cannot be fully
implemented in C. which is the only thing I'm advocating.
> Ugh ? Writing an unsigned long with a single instruction is always
> atomic, whatever the C compiler does. If the compiler turns a single
> long aligned write into something else, then it's very broken imho
It doesn't need to be. It's always atomic with your current cpu
revision, future cpus may introduce not-SMP atomic writes and gcc would
be perfectly allowed to use those new instructions by default (though
I guess in the kernel other things would break doing that, this may not
be the only place where we do this, but it would be our problem, not a
gcc issue).
> I'm not convinced :)
I know it perfectly works in practice today, and you most certainly
don't risk anything with current gcc (I doubt nobody except a gcc hacker
keeping most of gcc in mind at once could give you a true guarantee
of correctness though, I sure can't).
> > This happens to work in practice, but it's still wrong in theory, check
> > C language. What you are assuming in the above code are semantics the C
> > language _cannot_ provide you.
>
> Oh yes, I suppose we could imagine a compiler writing the above
> bit-by-bit, but let's stay serious on that one, there are a bunch of
> other places in the kernel that would be broken if the compiler started
> breaking up an int or long write in pieces :)
I hope not ;). Peraphs you mean readl/writel? at least those are
volatile, which means gcc will not be allowed to mess with those as much
as it can with a plain unsigned long pointer (i.e. the pte).
There's also another kind of bugs that is that we often abuse, that is
to sometime touch memory that can change from under us, without marking
the variable as volatile. That can confuse gcc, in some cases, Linus
preferred to allow that by thinking when GCC could ever get confused.
So for example if you use non-volatile variables that can change under
you in SMP, for a switch statement, your kernel can crash freely, and
it's up to you not to do that and remeber gcc can generate kernel
crashing code if you do that (switch can be implemented with an hash or
similar techniques that may generate overflows and jump into a random
location in memory if the variable can change under gcc). If you only
test them with an if you're probably safe instead (and that's what most
kernel code does, to avoid taking spinlocks in the fast paths).
This case with set_pte is a similar, and I'm ready to lose on this one
too, no problem with me, but at least I tried to do the right thing as
far as I can tell.
so I'm really fine to stick to bugs that are real like the x86 one I
just found.
> Find me a single case of a compiler not generating that correctly then.
I really hope there is none... but I don't read ppc64 asm anyways, so
you'd have to look into it yourself if you want to be sure.
> Doing an atomic instruction there would have a cost I don't want to pay.
Not really, it would be definitely zerocost at runtime, you
misunderstood if you think the asm generated wouldn't be the same. The
asm produced would be exactly the same as today, the hardware couldn't
notice the difference, only the C programmer could. I'm not talking
about locked instructions. I'm talking about a single atomic _assembly_
instruction instead of a C line of code (which we would then depend on
the compiler to generate as atomic).
> > you misunderstood, obviously everybody is required to hold the
> > page_table_lock while writing to any pagetable.
>
> Except in the fault path when setting accessed or dirty on a rw page...
maybe I'm biased because I'm reading x86-64 code, but where? the
software mkdirty and mkyoung seem to all be inside the page_table_lock.
> But then, I refer you to the patches that have been floating around for
> implementing page table lock-less do_page_fault(), this is what I was
> talking about.
Not sure how could I get you were talking about those floating around
patches. I still don't get any connetion with those patches and the
above discussion.
> Again, find me a single case where the compiler will generate anything
> but an "std" instruction for the above on ppc64 and you'll get a free
> case of champagne :)
If something I can check x86-64 which has the same issue, not ppc64.
If you prefer to ignore those theoretical smp races, then I will save
this email and I'll forward it to you when it triggers in production
because gcc did something strange, and then you will send me the free
case of champagne :). I'm also waiting the other bug for the lack of
volatile variables where we access memory that can change under us to
trigger anywhere in the kernel, only after it does I will have a good
argument to convince people not to depend on subtle behaviour of gcc,
and to write C language instead and to leave the atomic guarantees to
asm statements that the C compiler isn't allowed to mess up.
Oh maybe it already triggers on Martin's machine... ;), this is another
reason why I would like to see this can of warms closed, so I don't have
to worry every time that gcc doesn't something silly that could never be
catched by the gcc regression test suite, since gcc would be still C
complaint despite the apparently silly thing (silly from the point of
view of a kernel developer at least, not necessairly silly from the
point of view of a gcc developer).
I recommend to talk to the IBM compiler guys too, not just with me, I'm
probably not the right person to advocate for not depending on subtles
of the current gcc implementation, last time it wasn't me to ask to
change the kernel to stop touching memory that can change under gcc, but
it was Honza, a gcc developer recommending me not to depend on that (but
we gave up eventually, like I will very easily giveup on this too).
> Unless you do like ppc64 and clear it first :) But I agree that if your
> architecture lets you do 64 bits atomic writes on the 32 bits arch, then
> it's definitely the better solution. But then, you don't need a special
> set_pte_atomic, just make the normal set_pte do that and be done with
> it... Or is there a perf. issue there ?
there is a perf issue, cmpxchg8b is a lot more costly than two movl and
a smp_wmb in between. We only need atomic writes (not locked writes) in
all set_pte, except ptep_establish which is the only overwriting a pte
that is already present.
so let's remove the C-language compliant corollary and let's stick to
the userspace crashing x86 bug that is a tangible pratical bug. This
fixes an smp race generating userspace mm corruption + potential
security compromise sniffing random memory with threads on x86 with PAE
with > 4G of ram.
Signed-Off-By: Andrea Arcangeli <[email protected]>
Index: linux-2.5/include/asm-i386/pgtable-3level.h
===================================================================
RCS file: /home/andrea/crypto/cvs/linux-2.5/include/asm-i386/pgtable-3level.h,v
retrieving revision 1.20
diff -u -p -r1.20 pgtable-3level.h
--- linux-2.5/include/asm-i386/pgtable-3level.h 24 Aug 2004 18:28:07 -0000 1.20
+++ linux-2.5/include/asm-i386/pgtable-3level.h 26 Sep 2004 01:13:10 -0000
@@ -88,6 +88,13 @@ static inline pte_t ptep_get_and_clear(p
return res;
}
+#define __HAVE_ARCH_PTEP_ESTABLISH
+#define ptep_establish(__vma, __address, __ptep, __entry) \
+do { \
+ set_pte_atomic(__ptep, __entry); \
+ flush_tlb_page(__vma, __address); \
+} while (0)
+
static inline int pte_same(pte_t a, pte_t b)
{
return a.pte_low == b.pte_low && a.pte_high == b.pte_high;
On Sun, Sep 26, 2004 at 10:59:43AM +1000, Benjamin Herrenschmidt wrote:
> On Sun, 2004-09-26 at 10:46, Andrea Arcangeli wrote:
>
> > As far as the C language is concerned that *ptep = something can be
> > implemented with 8 writes of 1 byte each (or alternatively with an
> > assembler instruction that may make the written data visible not
> > atomically to other cpus, despite it was written with a single opcode,
> > similarly to what happens if you use incl without the lock prefix). I'm
> > not saying such instruction exists in ppc64, but the compiler is
> > definitely allowed to break the above. You can blame on the compiler to
> > be inefficient, but you can't blame on the compiler for the security
> > hazard it would generate. Only the kernel would be to blame if for
> > whatever reason a gcc version would be underoptimized.
>
> BTW, for your reading pleasure :)
>
> #define atomic_set(v,i) (((v)->counter) = (i))
>
> (asm-i386/atomic.h)
and then check this:
typedef struct { volatile int counter; } atomic_t;
^^^^^^^^
if the pte was at least volatile it would be a lot safer. C knows it
must not mess with volatile.
> And that's really far from beeing the 2 only cases where the kernel _relies_
> on a write of a simple type like int or long to an aligned location to be
> atomic. Almost all drivers manipulating DMA descriptors do that, jiffies
> is a good example too afaik, and more and more and more ... so if the
jiffies, writel/readl all volatile incidentally.
> compiler is breaking that up, I think the set_pte race is the least of
> our problems :)
the only one not volatile, or can you find more?
On Sun, 2004-09-26 at 11:32, Andrea Arcangeli wrote:
> maybe I'm biased because I'm reading x86-64 code, but where? the
> software mkdirty and mkyoung seem to all be inside the page_table_lock.
ppc and ppc64 who treat their hash table as a kind of big tlb cache, and
embedded ppc's with software loaded TLBs all have the TLB or hash refill
mecanism "mimmic" a HW TLB load, that is it is assembly code that will
set the DIRTY or ACCESSED bits without taking the page table lock
> Not sure how could I get you were talking about those floating around
> patches. I still don't get any connetion with those patches and the
> above discussion.
Oh, I side-tracked a bit on the need to make the PTE update & hash flush
atomic on ppc64 using the per-PTE lock _PAGE_BUSY bit we have there if
we ever implement that lockless do_page_fault(), but that was a side
discussion, sorry for confusion.
> > Again, find me a single case where the compiler will generate anything
> > but an "std" instruction for the above on ppc64 and you'll get a free
> > case of champagne :)
>
> If something I can check x86-64 which has the same issue, not ppc64.
>
> If you prefer to ignore those theoretical smp races, then I will save
> this email and I'll forward it to you when it triggers in production
> because gcc did something strange, and then you will send me the free
> case of champagne :)
We have a deal :)
> I'm also waiting the other bug for the lack of volatile variables where
> we access memory that can change under us to
> trigger anywhere in the kernel, only after it does I will have a good
> argument to convince people not to depend on subtle behaviour of gcc,
> and to write C language instead and to leave the atomic guarantees to
> asm statements that the C compiler isn't allowed to mess up.
>
> Oh maybe it already triggers on Martin's machine... ;), this is another
> reason why I would like to see this can of warms closed, so I don't have
> to worry every time that gcc doesn't something silly that could never be
> catched by the gcc regression test suite, since gcc would be still C
> complaint despite the apparently silly thing (silly from the point of
> view of a kernel developer at least, not necessairly silly from the
> point of view of a gcc developer).
Oh, I agree the lack of volatile on a switch/case may be an issue, I've
seen really esoteric ways of generating switch/case...
> there is a perf issue, cmpxchg8b is a lot more costly than two movl and
> a smp_wmb in between. We only need atomic writes (not locked writes) in
> all set_pte, except ptep_establish which is the only overwriting a pte
> that is already present.
Right, in your hypotetical scenario, I'd just have to make sure an std
instruction is generated on ppc64
Ben.
On Sun, 2004-09-26 at 11:36, Andrea Arcangeli wrote:
> the only one not volatile, or can you find more?
I'm not sure all DMA descriptor buffers are marked volatile but yes,
I tend to agree with you on the fact that it may be a good idea
Ben.
>> anyways on x86 the bug is real in practice, regardless of the C
>> compiler, heck we even put a smp_wmb() in between the two writes. The
>> fact all other archs are buggy in theory too is just a corollary. I
>> thought it worth to fix the theoretical bug in all other archs too,
>> instead of keeping playing russian roulette.
>
> How so ? A bunch of archs have the pte beeing a simple long, on these
> set_pte is perfectly atomic as it is... I'd say in this regard that
> x86 is the exception ;)
Wouldn't it make sense to call set_pte_atomic, and just have that resolve
to set_pte on 90% of arches? (I'm ignoring the wierdo compiler issue here,
this is just for arches with pte > long).
M.
On Sun, Sep 26, 2004 at 03:29:48PM +1000, Benjamin Herrenschmidt wrote:
> On Sun, 2004-09-26 at 11:32, Andrea Arcangeli wrote:
>
> > maybe I'm biased because I'm reading x86-64 code, but where? the
> > software mkdirty and mkyoung seem to all be inside the page_table_lock.
>
> ppc and ppc64 who treat their hash table as a kind of big tlb cache, and
> embedded ppc's with software loaded TLBs all have the TLB or hash refill
> mecanism "mimmic" a HW TLB load, that is it is assembly code that will
> set the DIRTY or ACCESSED bits without taking the page table lock
ok, I thought you were talking about common code setting the dirty and
accessed bit. The x86 architecture in hardware as well sets dirty and
accessed bit, without the page table lock.
> Oh, I side-tracked a bit on the need to make the PTE update & hash flush
> atomic on ppc64 using the per-PTE lock _PAGE_BUSY bit we have there if
> we ever implement that lockless do_page_fault(), but that was a side
agreed.
> discussion, sorry for confusion.
No problem.
> Right, in your hypotetical scenario, I'd just have to make sure an std
> instruction is generated on ppc64
exactly.
On Sun, Sep 26, 2004 at 07:41:52AM -0700, Martin J. Bligh wrote:
> Wouldn't it make sense to call set_pte_atomic, and just have that resolve
> to set_pte on 90% of arches? (I'm ignoring the wierdo compiler issue here,
> this is just for arches with pte > long).
I also like this more, since it betters defines that set_pte really
cannot be called on an established pte in common code [i.e. the bug
triggering on x86] (and asm-generic is common code).
Benjamin Herrenschmidt writes:
> On Sun, 2004-09-26 at 10:46, Andrea Arcangeli wrote:
>
> > As far as the C language is concerned that *ptep = something can be
> > implemented with 8 writes of 1 byte each (or alternatively with an
> > assembler instruction that may make the written data visible not
> > atomically to other cpus, despite it was written with a single opcode,
> > similarly to what happens if you use incl without the lock prefix). I'm
> > not saying such instruction exists in ppc64, but the compiler is
> > definitely allowed to break the above. You can blame on the compiler to
> > be inefficient, but you can't blame on the compiler for the security
> > hazard it would generate. Only the kernel would be to blame if for
> > whatever reason a gcc version would be underoptimized.
>
> BTW, for your reading pleasure :)
>
> #define atomic_set(v,i) (((v)->counter) = (i))
FWIW, we also rely on several other things that are not guaranteed by
the C standard, for instance that integer arithmetic is 2's
complement, that bytes are individually addressable, and that pointers
are represented by an address that is no bigger than a long.
Paul.
On Sun, Sep 26, 2004 at 10:36:40PM +0200, Andrea Arcangeli wrote:
> On Mon, Sep 27, 2004 at 06:30:25AM +1000, Paul Mackerras wrote:
> > FWIW, we also rely on several other things that are not guaranteed by
> > the C standard, for instance that integer arithmetic is 2's
> > complement, that bytes are individually addressable, and that pointers
> > are represented by an address that is no bigger than a long.
>
> I wouldn't compare these with atomic writes on non volatile variables. I
> mean, sizeof(char *) being different than sizeof(long) is something I'm
> very confortable will not break anytime. If you want to add up to the
> list, even the gcc inline assembly itself isn't part of the language...
> (infact that was the major trouble for icc to add it AFIK)
Just to make an example of why you definitely cannot compare the
assumption of sizeof(char *) == sizeof(long), and complement 2
aritmetic, is that gcc is allowed to implement something like this:
*64bit_ptr = 32bit_integer_var << 32;
as two instructions: xorl and a movl, that is likely to be faster than a
shiftleft + movq. Or maybe it's not faster on the x86-64 and gcc may
prefer to use shiftleft + movq, but you get the idea of what I'm talking
about, maybe for other archs the performance point is different etc...
(by instinct I believe it'd be faster even on x86, especially on the
nocona based on p4 core)
with the ptes on x86 the shiftleft of 12 probably avoid screwups but
that's just because we're lucky and I don't want to think what else gcc
could optimize by evaluating constants...
Hi!
> > > But even ppc64 is wrong as far as C is concerned,
> >
> > Looks fine to me. From include/asm-ppc64/pgtable.h
> >
> > static inline void set_pte(pte_t *ptep, pte_t pte)
> > {
> > if (pte_present(*ptep)) {
> > pte_clear(ptep);
> > flush_tlb_pending();
> > }
> > *ptep = __pte(pte_val(pte)) & ~_PAGE_HPTEFLAGS;
> > }
>
> As far as the C language is concerned that *ptep = something can be
> implemented with 8 writes of 1 byte each (or alternatively with an
> assembler instruction that may make the written data visible not
I'd say that it is okay to do this in arch-dependend
code. Architecture controls list of compilers allowed.
Pavel
--
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!