From: Rik van Riel <[email protected]>
If ptep_set_access_flags() is invoked to upgrade access permissions
on a PTE, there is no security or data integrity reason to do a
remote TLB flush.
Lazily letting another CPU incur a spurious page fault occasionally
is (much!) cheaper than aggressively flushing everybody else's TLB.
Signed-off-by: Rik van Riel <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/mm/pgtable.c | 17 ++++++++++++++++-
1 file changed, 16 insertions(+), 1 deletion(-)
Index: tip/arch/x86/mm/pgtable.c
===================================================================
--- tip.orig/arch/x86/mm/pgtable.c
+++ tip/arch/x86/mm/pgtable.c
@@ -306,11 +306,26 @@ int ptep_set_access_flags(struct vm_area
pte_t entry, int dirty)
{
int changed = !pte_same(*ptep, entry);
+ /*
+ * If the page used to be inaccessible (_PAGE_PROTNONE), or
+ * this call upgrades the access permissions on the same page,
+ * it is safe to skip the remote TLB flush.
+ */
+ bool flush_remote = false;
+ if (!pte_accessible(*ptep))
+ flush_remote = false;
+ else if (pte_pfn(*ptep) != pte_pfn(entry) ||
+ (pte_write(*ptep) && !pte_write(entry)) ||
+ (pte_exec(*ptep) && !pte_exec(entry)))
+ flush_remote = true;
if (changed && dirty) {
*ptep = entry;
pte_update_defer(vma->vm_mm, address, ptep);
- flush_tlb_page(vma, address);
+ if (flush_remote)
+ flush_tlb_page(vma, address);
+ else
+ __flush_tlb_one(address);
}
return changed;
On Thu, Oct 25, 2012 at 5:16 AM, Peter Zijlstra <[email protected]> wrote:
> From: Rik van Riel <[email protected]>
>
> @@ -306,11 +306,26 @@ int ptep_set_access_flags(struct vm_area
> pte_t entry, int dirty)
> {
> int changed = !pte_same(*ptep, entry);
> + /*
> + * If the page used to be inaccessible (_PAGE_PROTNONE), or
> + * this call upgrades the access permissions on the same page,
> + * it is safe to skip the remote TLB flush.
> + */
> + bool flush_remote = false;
> + if (!pte_accessible(*ptep))
> + flush_remote = false;
> + else if (pte_pfn(*ptep) != pte_pfn(entry) ||
> + (pte_write(*ptep) && !pte_write(entry)) ||
> + (pte_exec(*ptep) && !pte_exec(entry)))
> + flush_remote = true;
>
> if (changed && dirty) {
Did anybody ever actually look at this sh*t-for-brains patch?
Yeah, I'm grumpy. But I'm wasting time looking at patches that have
new code in them that is stupid and retarded.
This is the VM, guys, we don't add stupid and retarded code.
LOOK at the code, for chrissake. Just look at it. And if you don't see
why the above is stupid and retarded, you damn well shouldn't be
touching VM code.
Linus
On 10/25/2012 04:17 PM, Linus Torvalds wrote:
> On Thu, Oct 25, 2012 at 5:16 AM, Peter Zijlstra <[email protected]> wrote:
>> From: Rik van Riel <[email protected]>
>>
>> @@ -306,11 +306,26 @@ int ptep_set_access_flags(struct vm_area
>> pte_t entry, int dirty)
>> {
>> int changed = !pte_same(*ptep, entry);
>> + /*
>> + * If the page used to be inaccessible (_PAGE_PROTNONE), or
>> + * this call upgrades the access permissions on the same page,
>> + * it is safe to skip the remote TLB flush.
>> + */
>> + bool flush_remote = false;
>> + if (!pte_accessible(*ptep))
>> + flush_remote = false;
>> + else if (pte_pfn(*ptep) != pte_pfn(entry) ||
>> + (pte_write(*ptep) && !pte_write(entry)) ||
>> + (pte_exec(*ptep) && !pte_exec(entry)))
>> + flush_remote = true;
>>
>> if (changed && dirty) {
>
> Did anybody ever actually look at this sh*t-for-brains patch?
>
> Yeah, I'm grumpy. But I'm wasting time looking at patches that have
> new code in them that is stupid and retarded.
>
> This is the VM, guys, we don't add stupid and retarded code.
>
> LOOK at the code, for chrissake. Just look at it. And if you don't see
> why the above is stupid and retarded, you damn well shouldn't be
> touching VM code.
I agree it is pretty ugly. However, the above patch
did get rid of a gigantic performance regression with
Peter's code.
Doing unnecessary remote TLB flushes was costing about
90% performance with specjbb on a 4 node system.
However, if we can guarantee that ptep_set_access_flags
is only ever called for pte permission _upgrades_, we
can simply get rid of the remote TLB flush on x86, and
skip the paranoia tests we are doing above.
Do we have that kind of guarantee?
On Thu, Oct 25, 2012 at 7:30 PM, Rik van Riel <[email protected]> wrote:
>>
>> LOOK at the code, for chrissake. Just look at it. And if you don't see
>> why the above is stupid and retarded, you damn well shouldn't be
>> touching VM code.
>
> I agree it is pretty ugly. However, the above patch
> did get rid of a gigantic performance regression with
> Peter's code.
Rik, *LOOK* at the code like I asked you to, instead of making excuses for it.
I'm not necessarily arguing with what the code tries to do. I'm
arguing with the fact that the code is pure and utter *garbage*.
It has two major (and I mean *MAJOR*) problems, both of which
individually should make you ashamed for ever posting that piece of
shit:
The obvious-without-even-understanding-semantics problem:
- it's humongously stupidly written. It calculates that
'flush_remote' flag WHETHER IT GETS USED OR NOT.
Christ. I can kind of expect stuff like that in driver code etc,
but in VM routines?
Yes, the compiler may be smart enough to actually fix up the
idiocy. That doesn't make it less stupid.
The more-subtle-but-fundamental-problem:
- regardless of how stupidly written it is on a very superficial
level, it's even more stupid in a much more fundamental way.
That whole routine is explicitly written to be opportunistic. It is
*documented* to only set the access flags, so comparing anything else
is stupid, wouldn't you say?
Documented where? It's actually explicitly documented in the
pgtable-generic.c file which has the generic implementation of that
thing. But it's implicitly documented both in the name of the function
(do take another look) *and* in the actual implementation of the
function.
Look at the code: it doesn't even always update the page tables AT ALL
(and no, the return value does *not* reflect whether it updated it or
not!)
Also, notice how we update the pte entry with a simple
*ptep = entry;
statement, not with the usual expensive page table updates? The only
thing that makes this safe is that we *only* do it with the exact same
page frame number (anything else would be disastrously buggy on 32-bit
PAE, for example). And we only ever do it with the dirty bit always
set, because otherwise we might be silently dropping a concurrent
hardware update of the dirty bit of the previous pte value on another
CPU.
The latter requirement is why the x86 code does
if (changed && dirty) {
while the generic code checks just "If (changed)" (and then uses the
much more expensive set_pte_at() that has the proper dirty-bit
guarantees, and generates atomic accesses, not to mention various
virtualization crap).
In other words, everything that was added by that patch is PURE AND
UTTER SHIT. And THAT is what I'm objecting to.
Guess what? If you want to optimize the function to not do remote TLB
flushes, then just do that! None of the garbage. Just change the
flush_tlb_page(vma, address);
line to
__flush_tlb_one(address);
and it should damn well work. Because everything I see about
"flush_remote" looks just wrong, wrong, wrong.
And if there really is some reason for that whole flush_remote
braindamage, then we have much bigger problems, namely the fact that
we've broken the documented semantics of that function, and we're
doing various other things that are completely and utterly invalid
unless the above semantics hold.
So that patch should be burned, and possibly used as an example of
horribly crappy code for later generations. At no point should it be
applied.
Linus
On 10/25/2012 10:56 PM, Linus Torvalds wrote:
> Guess what? If you want to optimize the function to not do remote TLB
> flushes, then just do that! None of the garbage. Just change the
>
> flush_tlb_page(vma, address);
>
> line to
>
> __flush_tlb_one(address);
That may not even be needed. Apparently Intel chips
automatically flush an entry from the TLB when it
causes a page fault. I assume AMD chips do the same,
because flush_tlb_fix_spurious_fault evaluates to
nothing on x86.
> and it should damn well work. Because everything I see about
> "flush_remote" looks just wrong, wrong, wrong.
Are there architectures where we do need to flush
remote TLBs on upgrading the permissions on a PTE?
Because that is what the implementation in
pgtable-generic.c seems to be doing as well...
> And if there really is some reason for that whole flush_remote
> braindamage, then we have much bigger problems, namely the fact that
> we've broken the documented semantics of that function, and we're
> doing various other things that are completely and utterly invalid
> unless the above semantics hold.
Want to just remove the TLB flush entirely and see
if anything breaks in 3.8-rc1?
From reading the code again, it looks like things
should indeed work ok.
On Thu, Oct 25, 2012 at 8:57 PM, Rik van Riel <[email protected]> wrote:
>
> That may not even be needed. Apparently Intel chips
> automatically flush an entry from the TLB when it
> causes a page fault. I assume AMD chips do the same,
> because flush_tlb_fix_spurious_fault evaluates to
> nothing on x86.
Yes. It's not architected as far as I know, though. But I agree, it's
possible - even likely - we could avoid TLB flushing entirely on x86.
If you want to try it, I would seriously suggest you do it as a
separate commit though, just in case.
> Are there architectures where we do need to flush
> remote TLBs on upgrading the permissions on a PTE?
I *suspect* that whole TLB flush just magically became an SMP one
without anybody ever really thinking about it.
So it's quite possible we could do this to the pgtable-generic.c code
too. However, we don't actually have any generic way to do a local
single-address flush (the __flush_tlb_one() thing is
architecture-specific, although it exists on a few architectures).
We'd need to add a local_flush_tlb_page(vma, address) function.
Alternatively, we could decide to use the "tlb_fix_spurious_fault()"
thing in there. Possibly just do it unconditionally in the caller - or
even just specify that the fault handler has to do it. And stop
returning a value at all from ptep_set_access_flags() (I *think*
that's the only thing the return value gets used for - flushing the
TLB on the local cpu for the cpu's that want it).
> Want to just remove the TLB flush entirely and see
> if anything breaks in 3.8-rc1?
>
> From reading the code again, it looks like things
> should indeed work ok.
I would be open to it, but just in case it causes bisectable problems
I'd really want to see it in two patches ("make it always do the local
flush" followed by "remove even the local flush"), and then it would
pinpoint any need.
Linus
* Linus Torvalds <[email protected]> wrote:
> On Thu, Oct 25, 2012 at 8:57 PM, Rik van Riel <[email protected]> wrote:
> >
> > That may not even be needed. Apparently Intel chips
> > automatically flush an entry from the TLB when it causes a
> > page fault. I assume AMD chips do the same, because
> > flush_tlb_fix_spurious_fault evaluates to nothing on x86.
>
> Yes. It's not architected as far as I know, though. But I
> agree, it's possible - even likely - we could avoid TLB
> flushing entirely on x86.
>
> If you want to try it, I would seriously suggest you do it as
> a separate commit though, just in case.
Ok, will do it like that. INVLPG overhead is small effect,
nevertheless it's worth trying.
What *has* shown up in my profiles though, and which drove some
of these changes is that for heavily threaded VM-intense
workloads such as a single SPECjbb JVM instance running on all
CPUs and all nodes, TLB flushes with any sort of serialization
aspect are absolutely deadly.
So just to be *able* to verify the performance benefit and
impact of some of the later NUMA-directed changes, we had to
eliminate a number of scalability bottlenecks and put these
optimization patches in front of the main changes.
That is why you have to go 20+ patches into the queue to see the
real point :-/
> > Are there architectures where we do need to flush remote
> > TLBs on upgrading the permissions on a PTE?
>
> I *suspect* that whole TLB flush just magically became an SMP
> one without anybody ever really thinking about it.
Yeah, and I think part of the problem is that it's also a not
particularly straightforward to analyze performance bottleneck:
SMP TLB flushing does not show up as visible high overhead in
profiles mainly, it mostly shows up as extra idle time.
If the nature of the workload is that it has extra available
paralellism that can fill in the idle time, it will mask much of
the effect and there's only a slight shift in the profile.
It needs a borderline loaded system and sleep profiling to
pinpoint these sources of overhead.
[...]
> > From reading the code again, it looks like things should
> > indeed work ok.
>
> I would be open to it, but just in case it causes bisectable
> problems I'd really want to see it in two patches ("make it
> always do the local flush" followed by "remove even the local
> flush"), and then it would pinpoint any need.
Yeah, 100% agreed.
Thanks,
Ingo
On Thu, Oct 25, 2012 at 9:23 PM, Linus Torvalds
<[email protected]> wrote:
> On Thu, Oct 25, 2012 at 8:57 PM, Rik van Riel <[email protected]> wrote:
>>
>> That may not even be needed. Apparently Intel chips
>> automatically flush an entry from the TLB when it
>> causes a page fault. I assume AMD chips do the same,
>> because flush_tlb_fix_spurious_fault evaluates to
>> nothing on x86.
>
> Yes. It's not architected as far as I know, though. But I agree, it's
> possible - even likely - we could avoid TLB flushing entirely on x86.
Actually, it is architected on x86. This was first described in the
intel appnote 317080 "TLBs, Paging-Structure Caches, and Their
Invalidation", last paragraph of section 5.1. Nowadays, the same
contents are buried somewhere in Volume 3 of the architecture manual
(in my copy: 4.10.4.1 Operations that Invalidate TLBs and
Paging-Structure Caches)
> If you want to try it, I would seriously suggest you do it as a
> separate commit though, just in case.
>
>> Are there architectures where we do need to flush
>> remote TLBs on upgrading the permissions on a PTE?
>
> I *suspect* that whole TLB flush just magically became an SMP one
> without anybody ever really thinking about it.
I would be very worried about assuming every non-x86 arch has similar
TLB semantics. However, if their fault handlers always invalidate TLB
for pages that get spurious faults, then skipping the remote
invalidation would be fine. (I believe this is what
tlb_fix_spurious_fault() is for ?)
--
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
Michel Lespinasse <[email protected]> writes:
> On Thu, Oct 25, 2012 at 9:23 PM, Linus Torvalds
> <[email protected]> wrote:
>> On Thu, Oct 25, 2012 at 8:57 PM, Rik van Riel <[email protected]> wrote:
>>>
>>> That may not even be needed. Apparently Intel chips
>>> automatically flush an entry from the TLB when it
>>> causes a page fault. I assume AMD chips do the same,
>>> because flush_tlb_fix_spurious_fault evaluates to
>>> nothing on x86.
>>
>> Yes. It's not architected as far as I know, though. But I agree, it's
>> possible - even likely - we could avoid TLB flushing entirely on x86.
>
> Actually, it is architected on x86. This was first described in the
> intel appnote 317080 "TLBs, Paging-Structure Caches, and Their
> Invalidation", last paragraph of section 5.1. Nowadays, the same
> contents are buried somewhere in Volume 3 of the architecture manual
> (in my copy: 4.10.4.1 Operations that Invalidate TLBs and
> Paging-Structure Caches)
This unfortunately would only work for processes with no threads
because it only works on the current logical CPU.
-Andi
--
[email protected] -- Speaking for myself only
On 10/26/2012 08:48 AM, Andi Kleen wrote:
> Michel Lespinasse <[email protected]> writes:
>
>> On Thu, Oct 25, 2012 at 9:23 PM, Linus Torvalds
>> <[email protected]> wrote:
>>> On Thu, Oct 25, 2012 at 8:57 PM, Rik van Riel <[email protected]> wrote:
>>>>
>>>> That may not even be needed. Apparently Intel chips
>>>> automatically flush an entry from the TLB when it
>>>> causes a page fault. I assume AMD chips do the same,
>>>> because flush_tlb_fix_spurious_fault evaluates to
>>>> nothing on x86.
>>>
>>> Yes. It's not architected as far as I know, though. But I agree, it's
>>> possible - even likely - we could avoid TLB flushing entirely on x86.
>>
>> Actually, it is architected on x86. This was first described in the
>> intel appnote 317080 "TLBs, Paging-Structure Caches, and Their
>> Invalidation", last paragraph of section 5.1. Nowadays, the same
>> contents are buried somewhere in Volume 3 of the architecture manual
>> (in my copy: 4.10.4.1 Operations that Invalidate TLBs and
>> Paging-Structure Caches)
>
> This unfortunately would only work for processes with no threads
> because it only works on the current logical CPU.
That is fine.
Potentially triggering a spurious page fault on
another CPU is bound to be better than always
doing a synchronous remote TLB flush, waiting
for who knows how many CPUs to acknowledge the
IPI...
On Fri, Oct 26, 2012 at 5:48 AM, Andi Kleen <[email protected]> wrote:
> Michel Lespinasse <[email protected]> writes:
>
>> On Thu, Oct 25, 2012 at 9:23 PM, Linus Torvalds
>> <[email protected]> wrote:
>>> On Thu, Oct 25, 2012 at 8:57 PM, Rik van Riel <[email protected]> wrote:
>>>>
>>>> That may not even be needed. Apparently Intel chips
>>>> automatically flush an entry from the TLB when it
>>>> causes a page fault. I assume AMD chips do the same,
>>>> because flush_tlb_fix_spurious_fault evaluates to
>>>> nothing on x86.
>>>
>>> Yes. It's not architected as far as I know, though. But I agree, it's
>>> possible - even likely - we could avoid TLB flushing entirely on x86.
>>
>> Actually, it is architected on x86. This was first described in the
>> intel appnote 317080 "TLBs, Paging-Structure Caches, and Their
>> Invalidation", last paragraph of section 5.1. Nowadays, the same
>> contents are buried somewhere in Volume 3 of the architecture manual
>> (in my copy: 4.10.4.1 Operations that Invalidate TLBs and
>> Paging-Structure Caches)
>
> This unfortunately would only work for processes with no threads
> because it only works on the current logical CPU.
No, the point is, if we are only *increasing* permissions on a page,
we can skip the remote TLB invalidations. Later on each remote CPU
might possibly get a spurious fault on that page, but that spurious
fault will resynchronize their TLBs for that page, so that the
instruction retry after the fault won't fault again.
It is often cheaper to let remote CPUs get an occasional spurious
fault than to synchronize with them on every permission change.
Of course, none of the above applies if we are *reducing* permissions
on a page (we really can't skip TLB invalidations there)
--
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
* Rik van Riel <[email protected]> wrote:
> On 10/26/2012 08:48 AM, Andi Kleen wrote:
> >Michel Lespinasse <[email protected]> writes:
> >
> >>On Thu, Oct 25, 2012 at 9:23 PM, Linus Torvalds
> >><[email protected]> wrote:
> >>>On Thu, Oct 25, 2012 at 8:57 PM, Rik van Riel <[email protected]> wrote:
> >>>>
> >>>>That may not even be needed. Apparently Intel chips
> >>>>automatically flush an entry from the TLB when it
> >>>>causes a page fault. I assume AMD chips do the same,
> >>>>because flush_tlb_fix_spurious_fault evaluates to
> >>>>nothing on x86.
> >>>
> >>>Yes. It's not architected as far as I know, though. But I agree, it's
> >>>possible - even likely - we could avoid TLB flushing entirely on x86.
> >>
> >>Actually, it is architected on x86. This was first described in the
> >>intel appnote 317080 "TLBs, Paging-Structure Caches, and Their
> >>Invalidation", last paragraph of section 5.1. Nowadays, the same
> >>contents are buried somewhere in Volume 3 of the architecture manual
> >>(in my copy: 4.10.4.1 Operations that Invalidate TLBs and
> >>Paging-Structure Caches)
> >
> > This unfortunately would only work for processes with no
> > threads because it only works on the current logical CPU.
>
> That is fine.
>
> Potentially triggering a spurious page fault on
> another CPU is bound to be better than always
> doing a synchronous remote TLB flush, waiting
> for who knows how many CPUs to acknowledge the
> IPI...
The other killer is the fundamental IPI delay - which makes it
'invisible' to regular profiling and makes it hard to analyze.
So yes, even the local flush is a win, a major one - and the
flush-less one is likely a win too, because INVLPG has some
TLB-cache-walking costs.
Rik, mind sending an updated patch that addresses Linus's
concerns, or should I code it up if you are busy?
We can also certainly try the second patch, but I'd do it at the
end of the series, to put some tree distance between the two
patches, to not concentrate regression risks too tightly in the
Git space, to help out with hard to bisect problems...
Thanks,
Ingo
* Ingo Molnar <[email protected]> wrote:
> [...]
>
> Rik, mind sending an updated patch that addresses Linus's
> concerns, or should I code it up if you are busy?
>
> We can also certainly try the second patch, but I'd do it at
> the end of the series, to put some tree distance between the
> two patches, to not concentrate regression risks too tightly
> in the Git space, to help out with hard to bisect problems...
I'd also like to have the second patch separately because I'd
like to measure spurious fault frequency before and after the
change, with a reference workload.
Just a single page fault, even it's a minor one, might make a
micro-optimization a net loss. INVLPG might be the cheaper
option on average - it needs to be measured. (I'll do that, just
please keep it separate from the main TLB-flush optimization.)
Thanks,
Ingo
On Fri, Oct 26, 2012 at 5:34 AM, Michel Lespinasse <[email protected]> wrote:
> On Thu, Oct 25, 2012 at 9:23 PM, Linus Torvalds <[email protected]> wrote:
>>
>> Yes. It's not architected as far as I know, though. But I agree, it's
>> possible - even likely - we could avoid TLB flushing entirely on x86.
>
> Actually, it is architected on x86. This was first described in the
> intel appnote 317080 "TLBs, Paging-Structure Caches, and Their
> Invalidation", last paragraph of section 5.1. Nowadays, the same
> contents are buried somewhere in Volume 3 of the architecture manual
> (in my copy: 4.10.4.1 Operations that Invalidate TLBs and
> Paging-Structure Caches)
Good. I should have known it must be architected, because we've gone
back-and-forth on this in the kernel historically. We used to have
some TLB invalidates in the faulting path because I wasn't sure
whether they were needed or not, but we clearly don't have them any
more (and I suspect coverage was always spotty).
And Intel (and AMD) have been very good at documenting as architected
these kinds of details that people end up relying on even if they
weren't necessarily originally explicitly documented.
>> I *suspect* that whole TLB flush just magically became an SMP one
>> without anybody ever really thinking about it.
>
> I would be very worried about assuming every non-x86 arch has similar
> TLB semantics. However, if their fault handlers always invalidate TLB
> for pages that get spurious faults, then skipping the remote
> invalidation would be fine. (I believe this is what
> tlb_fix_spurious_fault() is for ?)
Yes. Of course, there may be some case where we unintentionally don't
necessarily flush a faulting address (on some architecture that needs
it), and then removing the cross-cpu invalidate could expose that
pre-existing bug-let, and cause an infinite loop of page faults due to
a TLB entry that never gets invalidated even if the page tables are
actually up-to-date.
So changing the mm/pgtable-generic.c function sounds like the right
thing to do, but would be a bit more scary.
Changing the x86 version sounds safe, *especially* since you point out
that the "fault-causes-tlb-invalidate" is architected behavior.
So I'd almost be willing to drop the invalidate in just one single
commit, because it really should be safe. The only thing it does is
guarantee that the accessed bit gets updated, and the accessed bit
just isn't that important. If we never flush the TLB on another CPU
that continues to use a TLB entry where the accessed bit is set (even
if it's cleared in the in-memory page tables), the worst that can
happen is that the accessed bit doesn't ever get set even if that CPU
constantly uses the page.
And nobody will *ever* care. The A bit is purely a heuristic for the
page LRU thing, we don't care about irrelevant special cases that
won't even affect correctness (much less performance - if that thing
is really hot and stays in the TLB, if we evict it, it will
immediately get reloaded anyway).
And doing a TLB invalidate even locally is worthless: sure, setting
the dirty bit and not invalidating the TLB can cause a local micro-tlb
fault (not a software-visible one, just microarchitectural pipeline
restart with TLB reload) on the next write access (because the TLB
would still contain D=0), so *eve*if* the CPU didn't
invalidate-on-fault, there's no reason we should invalidate in
software on x86.
Again, this can be different on non-x86 architectures with software
dirty bits, where a stale TLB entry that never gets flushed could
cause infinite TLB faults that never make progress, but that's really
a TLB _walker_ issue, not a generic VM issue.
Linus
On 10/26/2012 01:01 PM, Linus Torvalds wrote:
> On Fri, Oct 26, 2012 at 5:34 AM, Michel Lespinasse <[email protected]> wrote:
>> On Thu, Oct 25, 2012 at 9:23 PM, Linus Torvalds <[email protected]> wrote:
>>>
>>> Yes. It's not architected as far as I know, though. But I agree, it's
>>> possible - even likely - we could avoid TLB flushing entirely on x86.
>>
>> Actually, it is architected on x86. This was first described in the
>> intel appnote 317080 "TLBs, Paging-Structure Caches, and Their
>> Invalidation", last paragraph of section 5.1. Nowadays, the same
>> contents are buried somewhere in Volume 3 of the architecture manual
>> (in my copy: 4.10.4.1 Operations that Invalidate TLBs and
>> Paging-Structure Caches)
>
> Good. I should have known it must be architected, because we've gone
> back-and-forth on this in the kernel historically. We used to have
> some TLB invalidates in the faulting path because I wasn't sure
> whether they were needed or not, but we clearly don't have them any
> more (and I suspect coverage was always spotty).
>
> And Intel (and AMD) have been very good at documenting as architected
> these kinds of details that people end up relying on even if they
> weren't necessarily originally explicitly documented.
>
>>> I *suspect* that whole TLB flush just magically became an SMP one
>>> without anybody ever really thinking about it.
>>
>> I would be very worried about assuming every non-x86 arch has similar
>> TLB semantics. However, if their fault handlers always invalidate TLB
>> for pages that get spurious faults, then skipping the remote
>> invalidation would be fine. (I believe this is what
>> tlb_fix_spurious_fault() is for ?)
>
> Yes. Of course, there may be some case where we unintentionally don't
> necessarily flush a faulting address (on some architecture that needs
> it), and then removing the cross-cpu invalidate could expose that
> pre-existing bug-let, and cause an infinite loop of page faults due to
> a TLB entry that never gets invalidated even if the page tables are
> actually up-to-date.
>
> So changing the mm/pgtable-generic.c function sounds like the right
> thing to do, but would be a bit more scary.
>
> Changing the x86 version sounds safe, *especially* since you point out
> that the "fault-causes-tlb-invalidate" is architected behavior.
>
> So I'd almost be willing to drop the invalidate in just one single
> commit, because it really should be safe. The only thing it does is
> guarantee that the accessed bit gets updated, and the accessed bit
> just isn't that important. If we never flush the TLB on another CPU
> that continues to use a TLB entry where the accessed bit is set (even
> if it's cleared in the in-memory page tables), the worst that can
> happen is that the accessed bit doesn't ever get set even if that CPU
> constantly uses the page.
I suspect it would be safe to simply call tlb_fix_spurious_fault()
both on x86 and in the generic version.
If tlb_fix_spurious_fault is broken on some architecture, they
would already be running into issues like "write page fault
loops until the next context switch" :)
> Again, this can be different on non-x86 architectures with software
> dirty bits, where a stale TLB entry that never gets flushed could
> cause infinite TLB faults that never make progress, but that's really
> a TLB _walker_ issue, not a generic VM issue.
Would tlb_fix_spurious_fault take care of that on those
architectures?
On Fri, Oct 26, 2012 at 10:54 AM, Rik van Riel <[email protected]> wrote:
>
> Would tlb_fix_spurious_fault take care of that on those
> architectures?
.. assuming that they implement it as a real TLB flush, yes.
But maybe the architecture never noticed that it happened to depend on
the fact that we do a cross-CPU invalidate? So a missing
tlb_fix_spurious_fault() implementation could cause a short loop of
repeated page faults, until the IPI happens. And it would be so
incredibly rare that nobody would ever have noticed.
And if that could have happened, then with the cross-cpu invalidate
removed, the "incredibly rare short-lived constant page fault retry"
could turn into "incredibly rare lockup due to infinite page fault
retry due to TLB entry that never turns dirty despite it being marked
dirty by SW in the in-memory page tables".
Very unlikely, I agree. And this is only relevant for the non-x86
case, so changing the x86-specific optimized version is an independent
issue.
Linus
On 10/26/2012 02:02 PM, Linus Torvalds wrote:
> On Fri, Oct 26, 2012 at 10:54 AM, Rik van Riel <[email protected]> wrote:
>>
>> Would tlb_fix_spurious_fault take care of that on those
>> architectures?
>
> .. assuming that they implement it as a real TLB flush, yes.
>
> But maybe the architecture never noticed that it happened to depend on
> the fact that we do a cross-CPU invalidate? So a missing
> tlb_fix_spurious_fault() implementation could cause a short loop of
> repeated page faults, until the IPI happens. And it would be so
> incredibly rare that nobody would ever have noticed.
>
> And if that could have happened, then with the cross-cpu invalidate
> removed, the "incredibly rare short-lived constant page fault retry"
> could turn into "incredibly rare lockup due to infinite page fault
> retry due to TLB entry that never turns dirty despite it being marked
> dirty by SW in the in-memory page tables".
I suspect the next context switch would flush out the TLB,
making it a slowdown, not a lockup.
Still a good reason to make such a change in its own commit,
so it can be bisected and tracked down.
The commit message could tell architecture maintainers what
to do if this particular commit got them into trouble:
implement a proper local TLB flush in tlb_fix_spurious_fault.
I'll send this in as a separate patch.
On Fri, Oct 26, 2012 at 11:14 AM, Rik van Riel <[email protected]> wrote:
>
> I suspect the next context switch would flush out the TLB,
> making it a slowdown, not a lockup.
Common case, yes. But the page fault might happen in kernel space (due
to a "put_user()" call, say), and with CONFIG_PREEMPT=n.
Sure, put_user() is always done in a context where blocking (and
scheduling) is legal, but that doesn't necessarily equate scheduling
actually happening. If we're returning to kernel space and don't have
any IO, it might never happen.
Anyway, I suspect such behavior it's almost impossible to trigger.
Which would just make it rather hard to find.
Linus
The function ptep_set_access_flags is only ever used to upgrade
access permissions to a page. That means the only negative side
effect of not flushing remote TLBs is that other CPUs may incur
spurious page faults, if they happen to access the same address,
and still have a PTE with the old permissions cached in their
TLB.
Having another CPU maybe incur a spurious page fault is faster
than always incurring the cost of a remote TLB flush, so replace
the remote TLB flush with a purely local one.
This should be safe on every architecture that correctly
implements flush_tlb_fix_spurious_fault() to actually invalidate
the local TLB entry that caused a page fault, as well as on
architectures where the hardware invalidates TLB entries that
cause page faults.
In the unlikely event that you are hitting what appears to be
an infinite loop of page faults, and 'git bisect' took you to
this changeset, your architecture needs to implement
flush_tlb_fix_spurious_fault to actually flush the TLB entry.
Signed-off-by: Rik van Riel <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
mm/pgtable-generic.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index e642627..0361369 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -27,7 +27,7 @@ int ptep_set_access_flags(struct vm_area_struct *vma,
int changed = !pte_same(*ptep, entry);
if (changed) {
set_pte_at(vma->vm_mm, address, ptep, entry);
- flush_tlb_page(vma, address);
+ flush_tlb_fix_spurious_fault(vma, address);
}
return changed;
}
Here are the TLB patches as requested:
---8<---
The function ptep_set_access_flags() is only ever invoked to upgrade
access permissions on a PTE. That makes it safe to skip flushing the
TLBs on remote TLBs. The worst that can happen is a spurious page
fault on other CPUs, which would flush that TLB entry.
Lazily letting another CPU incur a spurious page fault occasionally
is (much!) cheaper than aggressively flushing everybody else's TLB.
Signed-off-by: Rik van Riel <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
arch/x86/mm/pgtable.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 8573b83..b3b852c 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -310,7 +310,7 @@ int ptep_set_access_flags(struct vm_area_struct *vma,
if (changed && dirty) {
*ptep = entry;
pte_update_defer(vma->vm_mm, address, ptep);
- flush_tlb_page(vma, address);
+ __flush_tlb_one(address);
}
return changed;
Intel has an architectural guarantee that the TLB entry causing
a page fault gets invalidated automatically. This means
we should be able to drop the local TLB invalidation.
Because of the way other areas of the page fault code work,
chances are good that all x86 CPUs do this. However, if
someone somewhere has an x86 CPU that does not invalidate
the TLB entry causing a page fault, this one-liner should
be easy to revert.
Signed-off-by: Rik van Riel <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
arch/x86/mm/pgtable.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index b3b852c..15e5953 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -310,7 +310,6 @@ int ptep_set_access_flags(struct vm_area_struct *vma,
if (changed && dirty) {
*ptep = entry;
pte_update_defer(vma->vm_mm, address, ptep);
- __flush_tlb_one(address);
}
return changed;
On Fri, Oct 26, 2012 at 11:46 AM, Rik van Riel <[email protected]> wrote:
>
> The function ptep_set_access_flags is only ever used to upgrade
> access permissions to a page.
NOTE: It's *not* "access permissions". It's "access flags".
Big difference. This is not about permissions at all.
The access flags are the Accessed and Dirty bits. And the dirty bit is
never *cleared* by this function, it's only ever potentially set.
That, together with the fact that the accessed flag is "best effort"
rather than exact, is what makes this function so special to begin
with.
Linus
On Fri, Oct 26, 2012 at 11:44 AM, Rik van Riel <[email protected]> wrote:
>
> The function ptep_set_access_flags() is only ever invoked to upgrade
> access permissions on a PTE
Same deal. Please don't call these "access permissions". That confuses
the issue.
Linus
On Fri, Oct 26, 2012 at 11:48 AM, Linus Torvalds
<[email protected]> wrote:
>
> NOTE: It's *not* "access permissions". It's "access flags".
>
> Big difference. This is not about permissions at all.
Anyway, modulo the misleading commit messages, ACK from on the series.
Linus
On 10/26/2012 02:48 PM, Linus Torvalds wrote:
> On Fri, Oct 26, 2012 at 11:46 AM, Rik van Riel <[email protected]> wrote:
>>
>> The function ptep_set_access_flags is only ever used to upgrade
>> access permissions to a page.
>
> NOTE: It's *not* "access permissions". It's "access flags".
>
> Big difference. This is not about permissions at all.
It looks like do_wp_page also sets the write bit in the pte
"entry" before passing it to ptep_set_access_flags, making
that the place where the write bit is set in the pte.
Is this a bug in do_wp_page?
Am I reading things wrong?
reuse:
flush_cache_page(vma, address, pte_pfn(orig_pte));
entry = pte_mkyoung(orig_pte);
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
if (ptep_set_access_flags(vma, address, page_table,
entry,1))
update_mmu_cache(vma, address, page_table);
On 10/26/2012 02:49 PM, Linus Torvalds wrote:
> On Fri, Oct 26, 2012 at 11:44 AM, Rik van Riel <[email protected]> wrote:
>>
>> The function ptep_set_access_flags() is only ever invoked to upgrade
>> access permissions on a PTE
>
> Same deal. Please don't call these "access permissions". That confuses
> the issue.
I can change the text of the changelog, however it looks
like do_wp_page does actually use ptep_set_access_flags
to set the write bit in the pte...
I guess both need to be reflected in the changelog text
somehow?
On Fri, Oct 26, 2012 at 11:57 AM, Rik van Riel <[email protected]> wrote:
>
> It looks like do_wp_page also sets the write bit in the pte
> "entry" before passing it to ptep_set_access_flags, making
> that the place where the write bit is set in the pte.
>
> Is this a bug in do_wp_page?
Hmm. Right you are. That's indeed worth noting that it can indeed
change access permissions in that particular way (ie enabling writes).
So yeah, good catch. And it's ok to add the writeable bits like this
(and it can't race with hardware like the dirty bit can, since
hardware never sets writability).
In fact, it should probably be documented in the source code
somewhere. In particular, there's a very subtle requirement that you
can only set the writable bit if the dirty bit is also set at the same
time, for example.
Linus
On Fri, Oct 26, 2012 at 12:16 PM, Rik van Riel <[email protected]> wrote:
>
> I can change the text of the changelog, however it looks
> like do_wp_page does actually use ptep_set_access_flags
> to set the write bit in the pte...
>
> I guess both need to be reflected in the changelog text
> somehow?
Yeah, and by now, after all this discussion, I suspect it should be
committed with a comment too. Commit messages are good and all, but
unless chasing a particular bug they introduced, we shouldn't expect
people to read them for background information.
Linus
On 10/26/2012 03:18 PM, Linus Torvalds wrote:
> On Fri, Oct 26, 2012 at 12:16 PM, Rik van Riel <[email protected]> wrote:
>>
>> I can change the text of the changelog, however it looks
>> like do_wp_page does actually use ptep_set_access_flags
>> to set the write bit in the pte...
>>
>> I guess both need to be reflected in the changelog text
>> somehow?
>
> Yeah, and by now, after all this discussion, I suspect it should be
> committed with a comment too. Commit messages are good and all, but
> unless chasing a particular bug they introduced, we shouldn't expect
> people to read them for background information.
Will do :)
The function ptep_set_access_flags is only ever used to upgrade
access permissions to a page. That means the only negative side
effect of not flushing remote TLBs is that other CPUs may incur
spurious page faults, if they happen to access the same address,
and still have a PTE with the old permissions cached in their
TLB.
Having another CPU maybe incur a spurious page fault is faster
than always incurring the cost of a remote TLB flush, so replace
the remote TLB flush with a purely local one.
This should be safe on every architecture that correctly
implements flush_tlb_fix_spurious_fault() to actually invalidate
the local TLB entry that caused a page fault, as well as on
architectures where the hardware invalidates TLB entries that
cause page faults.
In the unlikely event that you are hitting what appears to be
an infinite loop of page faults, and 'git bisect' took you to
this changeset, your architecture needs to implement
flush_tlb_fix_spurious_fault to actually flush the TLB entry.
Signed-off-by: Rik van Riel <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
mm/pgtable-generic.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index e642627..d8397da 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -12,8 +12,8 @@
#ifndef __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
/*
- * Only sets the access flags (dirty, accessed, and
- * writable). Furthermore, we know it always gets set to a "more
+ * Only sets the access flags (dirty, accessed), as well as write
+ * permission. Furthermore, we know it always gets set to a "more
* permissive" setting, which allows most architectures to optimize
* this. We return whether the PTE actually changed, which in turn
* instructs the caller to do things like update__mmu_cache. This
@@ -27,7 +27,7 @@ int ptep_set_access_flags(struct vm_area_struct *vma,
int changed = !pte_same(*ptep, entry);
if (changed) {
set_pte_at(vma->vm_mm, address, ptep, entry);
- flush_tlb_page(vma, address);
+ flush_tlb_fix_spurious_fault(vma, address);
}
return changed;
}
On Fri, 26 Oct 2012 14:45:02 -0400
Rik van Riel <[email protected]> wrote:
> Intel has an architectural guarantee that the TLB entry causing
> a page fault gets invalidated automatically. This means
> we should be able to drop the local TLB invalidation.
>
> Because of the way other areas of the page fault code work,
> chances are good that all x86 CPUs do this. However, if
> someone somewhere has an x86 CPU that does not invalidate
> the TLB entry causing a page fault, this one-liner should
> be easy to revert.
This does not strike me as a good standard of validation for such a change
At the very least we should have an ACK from AMD and from VIA, and
preferably ping RDC and some of the other embedded folks. Given an AMD
and VIA ACK I'd be fine. I doubt anyone knows any more what Cyrix CPUs
did or cared about and I imagine H Peter or Linus can answer for
Transmeta ;-)
Alan
On 10/26/2012 05:12 PM, Alan Cox wrote:
> On Fri, 26 Oct 2012 14:45:02 -0400
> Rik van Riel <[email protected]> wrote:
>
>> Intel has an architectural guarantee that the TLB entry causing
>> a page fault gets invalidated automatically. This means
>> we should be able to drop the local TLB invalidation.
>>
>> Because of the way other areas of the page fault code work,
>> chances are good that all x86 CPUs do this. However, if
>> someone somewhere has an x86 CPU that does not invalidate
>> the TLB entry causing a page fault, this one-liner should
>> be easy to revert.
>
> This does not strike me as a good standard of validation for such a change
>
> At the very least we should have an ACK from AMD and from VIA, and
> preferably ping RDC and some of the other embedded folks. Given an AMD
> and VIA ACK I'd be fine. I doubt anyone knows any more what Cyrix CPUs
> did or cared about and I imagine H Peter or Linus can answer for
> Transmeta ;-)
Fair enough.
If it turns out any of those CPUs need an explicit
flush, then we can also adjust flush_tlb_fix_spurious_fault
to actually do a local flush on x86 (or at least on those
CPUs).
* Rik van Riel <[email protected]> wrote:
> On 10/26/2012 05:12 PM, Alan Cox wrote:
> >On Fri, 26 Oct 2012 14:45:02 -0400
> >Rik van Riel <[email protected]> wrote:
> >
> >>Intel has an architectural guarantee that the TLB entry causing
> >>a page fault gets invalidated automatically. This means
> >>we should be able to drop the local TLB invalidation.
> >>
> >>Because of the way other areas of the page fault code work,
> >>chances are good that all x86 CPUs do this. However, if
> >>someone somewhere has an x86 CPU that does not invalidate
> >>the TLB entry causing a page fault, this one-liner should
> >>be easy to revert.
> >
> >This does not strike me as a good standard of validation for such a change
> >
> >At the very least we should have an ACK from AMD and from VIA, and
> >preferably ping RDC and some of the other embedded folks. Given an AMD
> >and VIA ACK I'd be fine. I doubt anyone knows any more what Cyrix CPUs
> >did or cared about and I imagine H Peter or Linus can answer for
> >Transmeta ;-)
>
> Fair enough.
>
> If it turns out any of those CPUs need an explicit flush, then
> we can also adjust flush_tlb_fix_spurious_fault to actually do
> a local flush on x86 (or at least on those CPUs).
Yes. And even if we have 'confirmation' from documentation and
elsewhere, testing has to be done to see actual real behavior of
CPUs, so this is going to be a separate, bisectable commit put
under surveillance ;-)
Thanks,
Ingo
On 10/26/2012 05:12 PM, Alan Cox wrote:
> On Fri, 26 Oct 2012 14:45:02 -0400
> Rik van Riel <[email protected]> wrote:
>
>> Intel has an architectural guarantee that the TLB entry causing
>> a page fault gets invalidated automatically. This means
>> we should be able to drop the local TLB invalidation.
>>
>> Because of the way other areas of the page fault code work,
>> chances are good that all x86 CPUs do this. However, if
>> someone somewhere has an x86 CPU that does not invalidate
>> the TLB entry causing a page fault, this one-liner should
>> be easy to revert.
>
> This does not strike me as a good standard of validation for such a change
>
> At the very least we should have an ACK from AMD and from VIA, and
> preferably ping RDC and some of the other embedded folks. Given an AMD
> and VIA ACK I'd be fine. I doubt anyone knows any more what Cyrix CPUs
> did or cared about and I imagine H Peter or Linus can answer for
> Transmeta ;-)
Florian, would you happen to know who at RDC could be contacted
to verify whether a TLB entry causing a page fault gets
invalidated automatically, upon entering the page fault path?
Borislav, would you happen to know whether AMD (and VIA) CPUs
automatically invalidate TLB entries that cause page faults?
If you do not know, would you happen who to ask? :)
If these CPUs do not invalidate a TLB entry causing a page
fault (a write fault on a read-only PTE), then we may have to
change the kernel so flush_tlb_fix_spurious_fault does
something on the CPU models in question...
On 10/26/2012 03:18 PM, Linus Torvalds wrote:
> On Fri, Oct 26, 2012 at 12:16 PM, Rik van Riel <[email protected]> wrote:
>>
>> I can change the text of the changelog, however it looks
>> like do_wp_page does actually use ptep_set_access_flags
>> to set the write bit in the pte...
>>
>> I guess both need to be reflected in the changelog text
>> somehow?
>
> Yeah, and by now, after all this discussion, I suspect it should be
> committed with a comment too. Commit messages are good and all, but
> unless chasing a particular bug they introduced, we shouldn't expect
> people to read them for background information.
Now that we have the TLB things taken care of, and
comments to patches 10/31 and 26/31 have been addressed,
is there anything else that needs to be done before
these NUMA patches can be merged?
Anyone, this is a good time to speak up. We have some
time to address whatever concern you may have.
(besides waiting for the next merge window)
On Sat, Oct 27, 2012 at 09:40:41AM -0400, Rik van Riel wrote:
> Borislav, would you happen to know whether AMD (and VIA) CPUs
> automatically invalidate TLB entries that cause page faults? If you do
> not know, would you happen who to ask? :)
Short answer: yes.
Long answer (from APM v2, section 5.5.2):
Use of Cached Entries When Reporting a Page Fault Exception.
On current AMD64 processors, when any type of page fault exception is
encountered by the MMU, any cached upper-level entries that lead to the
faulting entry are flushed (along with the TLB entry, if already cached)
and the table walk is repeated to confirm the page fault using the
table entries in memory. This is done because a table entry is allowed
to be upgraded (by marking it as present, or by removing its write,
execute or supervisor restrictions) without explicitly maintaining TLB
coherency. Such an upgrade will be found when the table is re-walked,
which resolves the fault. If the fault is confirmed on the re-walk
however, a page fault exception is reported, and upper level entries
that may have been cached on the re-walk are flushed.
HTH.
--
Regards/Gruss,
Boris.
On Mon, Oct 29, 2012 at 9:57 AM, Borislav Petkov <[email protected]> wrote:
>
> On current AMD64 processors,
Can you verify that this is true for older cpu's too (ie the old
pre-64-bit ones, say K6 and original Athlon)?
> This is done because a table entry is allowed
> to be upgraded (by marking it as present
Well, that was traditionally solved by not caching not-present entries
at all. Which can be a problem for some things (prefetch of NULL etc),
so caching and then re-checking on faults is potentially the correct
thing, but I'm just mentioning it because it might not be much of an
argument for older microarchitectures..
>, or by removing its write,
> execute or supervisor restrictions) without explicitly maintaining TLB
> coherency. Such an upgrade will be found when the table is re-walked,
> which resolves the fault.
.. but this is obviously what we're interested in. And since AMD has
documented it (as well as Intel), I have this strong suspicion that
operating systems have traditionally relied on this behavior.
I don't remember the test coverage details from my Transmeta days, and
while I certainly saw the page table walker, it wasn't my code.
My gut feel is that this is likely something x86 just always does
(because it's the right thing to do to keep things simple for
software), but getting explicit confirmation about older AMD cpu's
would definitely be good.
Linus
On Mon, Oct 29, 2012 at 10:06:15AM -0700, Linus Torvalds wrote:
> On Mon, Oct 29, 2012 at 9:57 AM, Borislav Petkov <[email protected]> wrote:
> >
> > On current AMD64 processors,
>
> Can you verify that this is true for older cpu's too (ie the old
> pre-64-bit ones, say K6 and original Athlon)?
Albeit with a slight delay, the answer is yes: all AMD cpus
automatically invalidate cached TLB entries (and intermediate walk
results, for that matter) on a #PF.
I don't know, however, whether it would be prudent to have some sort of
a cheap assertion in the code (cheaper than INVLPG %ADDR, although on
older cpus we do MOV CR3) just in case. This should be enabled only with
DEBUG_VM on, of course...
HTH.
--
Regards/Gruss,
Boris.
On Sat, Nov 17, 2012 at 6:50 AM, Borislav Petkov <[email protected]> wrote:
>
> Albeit with a slight delay, the answer is yes: all AMD cpus
> automatically invalidate cached TLB entries (and intermediate walk
> results, for that matter) on a #PF.
Thanks. I suspect it ends up being basically architectural, and that
Windows (and quite possibly Linux versions too) have depended on the
behavior.
> I don't know, however, whether it would be prudent to have some sort of
> a cheap assertion in the code (cheaper than INVLPG %ADDR, although on
> older cpus we do MOV CR3) just in case. This should be enabled only with
> DEBUG_VM on, of course...
I wonder how we could actually test for it. We'd have to have some
per-cpu page-fault address check (along with a generation count on the
mm or similar). I doubt we'd figure out anything that works reliably
and efficiently and would actually show any problems (plus we would
have no way to ever know we even got the code right, since presumably
we'd never find hardware that actually shows the behavior we'd be
looking for..)
Linus
On Sat, Nov 17, 2012 at 06:56:10AM -0800, Linus Torvalds wrote:
> I wonder how we could actually test for it. We'd have to have some
> per-cpu page-fault address check (along with a generation count on the
> mm or similar). I doubt we'd figure out anything that works reliably
> and efficiently and would actually show any problems (plus we would
> have no way to ever know we even got the code right, since presumably
> we'd never find hardware that actually shows the behavior we'd be
> looking for..)
Hmm, touching some wrong page through the stale TLB entry could be a
pretty nasty issue to debug. But you're probably right: how does one
test cheaply whether a PTE just got kicked out of the TLB? Maybe mark it
not-present but this would force a rewalk in the case when it is shared,
which is penalty we don't want to pay.
Oh well...
--
Regards/Gruss,
Boris.
On 11/17/2012 09:56 AM, Linus Torvalds wrote:
> On Sat, Nov 17, 2012 at 6:50 AM, Borislav Petkov <[email protected]> wrote:
>> I don't know, however, whether it would be prudent to have some sort of
>> a cheap assertion in the code (cheaper than INVLPG %ADDR, although on
>> older cpus we do MOV CR3) just in case. This should be enabled only with
>> DEBUG_VM on, of course...
>
> I wonder how we could actually test for it. We'd have to have some
> per-cpu page-fault address check (along with a generation count on the
> mm or similar). I doubt we'd figure out anything that works reliably
> and efficiently and would actually show any problems
Would it be enough to simply print out a warning if we fault
on the same address twice (or three times) in a row, and then
flush the local TLB?
I realize this would not just trigger on CPUs that fail to
invalidate TLB entries that cause faults, but also on kernel
paths that cause a page fault to be re-taken...
... but then again, don't we want to find those paths and
fix them, anyway? :)
--
All rights reversed
On Sat, Nov 17, 2012 at 7:24 AM, Rik van Riel <[email protected]> wrote:
> On 11/17/2012 09:56 AM, Linus Torvalds wrote:
>>
>> On Sat, Nov 17, 2012 at 6:50 AM, Borislav Petkov <[email protected]> wrote:
>>>
>>> I don't know, however, whether it would be prudent to have some sort of
>>> a cheap assertion in the code (cheaper than INVLPG %ADDR, although on
>>> older cpus we do MOV CR3) just in case. This should be enabled only with
>>> DEBUG_VM on, of course...
>>
>>
>> I wonder how we could actually test for it. We'd have to have some
>> per-cpu page-fault address check (along with a generation count on the
>> mm or similar). I doubt we'd figure out anything that works reliably
>> and efficiently and would actually show any problems
>
> Would it be enough to simply print out a warning if we fault
> on the same address twice (or three times) in a row, and then
> flush the local TLB?
>
> I realize this would not just trigger on CPUs that fail to
> invalidate TLB entries that cause faults, but also on kernel
> paths that cause a page fault to be re-taken...
I'm actually curious if the architecture docs/software developer
manuals for IA-32 mandate any TLB invalidations on a #PF
Is there any official vendor documentation on the subject?
And perhaps equally valid, should we trust it if it exists?
> ... but then again, don't we want to find those paths and
> fix them, anyway? :)
>
> --
> All rights reversed
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Sat, Nov 17, 2012 at 1:53 PM, Shentino <[email protected]> wrote:
> I'm actually curious if the architecture docs/software developer
> manuals for IA-32 mandate any TLB invalidations on a #PF
>
> Is there any official vendor documentation on the subject?
Yes. Quoting a prior email:
Actually, it is architected on x86. This was first described in the
intel appnote 317080 "TLBs, Paging-Structure Caches, and Their
Invalidation", last paragraph of section 5.1. Nowadays, the same
contents are buried somewhere in Volume 3 of the architecture manual
(in my copy: 4.10.4.1 Operations that Invalidate TLBs and
Paging-Structure Caches)
> And perhaps equally valid, should we trust it if it exists?
I know that Intel has been very careful in documenting the architected
TLB behaviors and did it with the understanding that people should be
able to depend on what's being written up there.
--
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.