The following:
@@ -50,7 +50,6 @@
/* Don't look at this pte if it's been accessed recently. */
if (ptep_test_and_clear_young(page_table)) {
- flush_tlb_page(vma, address);
mark_page_accessed(page);
return 0;
}
is completely bogus. Without the tlb flush, the system may never update
the accessed bit on a page that is heavily being used.
-ben
--
Fish.
On Mon, Oct 29, 2001 at 03:14:22PM -0800, David S. Miller wrote:
> From: Benjamin LaHaise <[email protected]>
> Date: Mon, 29 Oct 2001 18:08:37 -0500
>
> is completely bogus. Without the tlb flush, the system may never update
> the accessed bit on a page that is heavily being used.
>
> It's intentional Ben, think about the high cost of the SMP invalidate
> when kswapd is just scanning page tables.
I think it's far more expensive to pull a page back in from disk.
-ben
From: Benjamin LaHaise <[email protected]>
Date: Mon, 29 Oct 2001 18:08:37 -0500
is completely bogus. Without the tlb flush, the system may never update
the accessed bit on a page that is heavily being used.
It's intentional Ben, think about the high cost of the SMP invalidate
when kswapd is just scanning page tables.
Franks a lot,
David S. Miller
[email protected]
On Mon, 29 Oct 2001, David S. Miller wrote:
> Date: Mon, 29 Oct 2001 18:08:37 -0500
>
> is completely bogus. Without the tlb flush, the system may never update
> the accessed bit on a page that is heavily being used.
>
> It's intentional Ben, think about the high cost of the SMP invalidate
> when kswapd is just scanning page tables.
Indeed. I thought it shouldn't mater, but apparently it does..
Does it make the accessed bit less reliable? Sure it does. But basically,
either the page is accessed SO much that it stays in the TLB all the time
(which is basically not really possible if you page heavily, I suspect),
or it will age out of the TLB on its own at which point we get the
accessed bit back.
In the worst case it does generate more noise in the reference bit logic,
but ..
Linus
On Mon, 29 Oct 2001, Benjamin LaHaise wrote:
>
> I think it's far more expensive to pull a page back in from disk.
Note that _if_ the page is so often accessed that it stays in the TLB, it
won't ever hit the disk. We _will_ do a TLB invalidate when we unmap it,
and it will get re-mapped long before it's written out.
So it's more likely to result in extra soft-faults, but considering that
if the system is paging stuff out we'll almost certainly be doing enough
context switches that the thing doesn't matter.
And avoiding the extra TLB flush (for _every_ page scanned) is noticeable
according to Andrea.
Linus
On Mon, Oct 29, 2001 at 03:22:28PM -0800, Linus Torvalds wrote:
> Does it make the accessed bit less reliable? Sure it does. But basically,
> either the page is accessed SO much that it stays in the TLB all the time
> (which is basically not really possible if you page heavily, I suspect),
> or it will age out of the TLB on its own at which point we get the
> accessed bit back.
Think of CPUs with tagged tlbs and lots of entries. Or even a system that
only runs 1 threaded app. Easily triggerable. If people want to optimise
it, great. But go for correctness first, please...
-ben
On Mon, Oct 29, 2001 at 03:25:36PM -0800, Linus Torvalds wrote:
> And avoiding the extra TLB flush (for _every_ page scanned) is noticeable
> according to Andrea.
Sure. But do it correctly and perform a tlb flush higher up in the page
table walking code. Just deleting it entirely is bogus. Introducing
hard to track down bugs is just stupid.
-ben
From: Benjamin LaHaise <[email protected]>
Date: Mon, 29 Oct 2001 18:33:16 -0500
Sure. But do it correctly and perform a tlb flush higher up in the page
table walking code. Just deleting it entirely is bogus. Introducing
hard to track down bugs is just stupid.
It isn't a bug, the referenced bit is a heuristic. The referenced bit
being wrong cannot result in corrupted user memory.
Franks a lot,
David S. Miller
[email protected]
On Mon, Oct 29, 2001 at 03:36:03PM -0800, David S. Miller wrote:
> It isn't a bug, the referenced bit is a heuristic. The referenced bit
> being wrong cannot result in corrupted user memory.
We might as well choose what pages to swap out according to a random number
generator then.
-ben
--
Fish.
On Mon, 29 Oct 2001, Benjamin LaHaise wrote:
> On Mon, Oct 29, 2001 at 03:36:03PM -0800, David S. Miller wrote:
> > It isn't a bug, the referenced bit is a heuristic. The referenced bit
> > being wrong cannot result in corrupted user memory.
>
> We might as well choose what pages to swap out according to a random number
> generator then.
Show me a number. Andrea showed his performance. And I claim that our
"random number generator" is really close to reality.
Linus
On Mon, 29 Oct 2001, Benjamin LaHaise wrote:
>
> Think of CPUs with tagged tlbs and lots of entries. Or even a system that
> only runs 1 threaded app. Easily triggerable. If people want to optimise
> it, great. But go for correctness first, please...
"easily triggerable"?
I doubt you'll find _any_ system where you can trigger it. Think about it:
in order to get the kind of VM pressure that causes page-outs, your TLB
pressure will be a lot higher than _any_ CPU I have ever heard about.
256 TLB entries is considered "a lot". Tagged or not, if the VM pressure
is so big that we need to swap, those 256 entries are a grain of sand in
sahara.
Linus
On Mon, Oct 29, 2001 at 03:41:58PM -0800, Linus Torvalds wrote:
> Show me a number. Andrea showed his performance. And I claim that our
> "random number generator" is really close to reality.
Please enlighten me why I should go off to do a couple of days work to
come up with a comparison against a correct version that does the tlb flush
higher up in the page table walker? Sorry, but for now I'll stick with
-ac vm that isn't changing things randomly every release.
-ben
--
Fish.
Numbers talk, bullshit walks.
Franks a lot,
David S. Miller
[email protected]
Benjamin LaHaise writes:
> The following:
>
> @@ -50,7 +50,6 @@
>
> /* Don't look at this pte if it's been accessed recently. */
> if (ptep_test_and_clear_young(page_table)) {
> - flush_tlb_page(vma, address);
> mark_page_accessed(page);
> return 0;
> }
>
> is completely bogus. Without the tlb flush, the system may never update
> the accessed bit on a page that is heavily being used.
On PPC, the page wouldn't even need to be being heavily used. Most
PPCs have an MMU hash table that we use as a level-2 cache for the
TLB. With this change, we won't see the accessed bit being set again
for any page unless there is so much memory in use that we start
evicting PTEs from the hash table, and that is very rare in practice.
So I'm with Ben on this one.
Paul.
On Mon, Oct 29, 2001 at 03:50:56PM -0800, David S. Miller wrote:
>
> Numbers talk, bullshit walks.
There is a correct way to do this optimization. If you're enough of an asshole
to not care about doing it that way, great!
-ben
From: Benjamin LaHaise <[email protected]>
Date: Mon, 29 Oct 2001 18:51:58 -0500
On Mon, Oct 29, 2001 at 03:50:56PM -0800, David S. Miller wrote:
>
> Numbers talk, bullshit walks.
There is a correct way to do this optimization. If you're enough
of an asshole to not care about doing it that way, great!
Doing range flushes is not the answer. It is going to be about
the same cost as doing per-page flushes.
The cost is doing the cross calls at all, not the granularity in which
we do them.
You're refusing to do any work to prove whether your case matters
at all in real life, and you're calling other people assholes for
asking that you do so.
Franks a lot,
David S. Miller
[email protected]
On Mon, 29 Oct 2001, Benjamin LaHaise wrote:
> Think of CPUs with tagged tlbs and lots of entries. Or even a system that
> only runs 1 threaded app. Easily triggerable. If people want to optimise
> it, great. But go for correctness first, please...
But was the original flush_tlb_page() fully correct? In i386 it's
if (vma->vm_mm == current->active_mm)
__flush_tlb_one(addr);
without reference to whether vma->vm_mm is active on another cpu.
Hugh
On Mon, 29 Oct 2001, David S. Miller wrote:
>
> Doing range flushes is not the answer. It is going to be about
> the same cost as doing per-page flushes.
No, doing a range flush might be fine - we'd just do it _once_ per
invocation of swap_out(), and that would probably be fine.
The problem with the flush at the low level is that it's done once for
every page in the whole VM space, which is easily millions of times.
Cutting it down to once every MM would definitely be worth it.
It won't be "exact" either, but it would mean that at least the lifetime
of an optimistic TLB entry is bounded.
Linus
On Mon, Oct 29, 2001 at 03:55:59PM -0800, David S. Miller wrote:
> Doing range flushes is not the answer. It is going to be about
> the same cost as doing per-page flushes.
>
> The cost is doing the cross calls at all, not the granularity in which
> we do them.
>
> You're refusing to do any work to prove whether your case matters
> at all in real life, and you're calling other people assholes for
> asking that you do so.
See Paul's message. ia64 does the same thing with hardware walked hashed
page tables. Now, do you want to pay for the 2 days of time you want me
to commit to investigating something which is obvious to me? I don't think
so.
-ben
From: Benjamin LaHaise <[email protected]>
Date: Mon, 29 Oct 2001 18:57:44 -0500
See Paul's message. ia64 does the same thing with hardware walked hashed
page tables. Now, do you want to pay for the 2 days of time you want me
to commit to investigating something which is obvious to me? I don't think
so.
Why would it take you two days to put together a test case for
something so "trivial"? Don't be rediculious.
I would be more than happy to pay for the 2 days it's going to take
for you to possibly admit "yeah, maybe it doesn't matter in real life,
sorry". :-)
Franks a lot,
David S. Miller
[email protected]
On Mon, Oct 29, 2001 at 04:01:23PM -0800, David S. Miller wrote:
> Why would it take you two days to put together a test case for
> something so "trivial"? Don't be rediculious.
Well, maybe you like under testing things, but personally I don't. What
do you want a microoptimization? Get real, you don't accept
microoptimizations to networking code without numbers from the bigger
picture. So that means running real tests on real applications like
Oracle or RHDB. I think you'll find that the cross cpu case will be
quite happy with a range flush. Heck, it's the kind of IPI that can
even legally be asynchronous. Why don't you go do something useful
like implement that instead of making me spend my time to correct other
people's obvious mistakes?
-ben
--
Fish.
On Tue, 30 Oct 2001, Hugh Dickins wrote:
>
> But was the original flush_tlb_page() fully correct? In i386 it's
> if (vma->vm_mm == current->active_mm)
> __flush_tlb_one(addr);
> without reference to whether vma->vm_mm is active on another cpu.
No, you're looking at the UP inline version - the SMP version is
elsewhere.
Linus
From: Hugh Dickins <[email protected]>
Date: Tue, 30 Oct 2001 00:02:53 +0000 (GMT)
But was the original flush_tlb_page() fully correct? In i386 it's
if (vma->vm_mm == current->active_mm)
__flush_tlb_one(addr);
without reference to whether vma->vm_mm is active on another cpu.
You're looking at the non-SMP version :-)
Franks a lot,
David S. Miller
[email protected]
On Mon, 29 Oct 2001, David S. Miller wrote:
> Numbers talk, bullshit walks.
Lies, damn lies and statistics.
Your numbers are nice for one thing and break for
something else, don't you learn anything from our
experiments in the past ?
cheers,
Rik
--
DMCA, SSSCA, W3C? Who cares? http://thefreeworld.net/
http://www.surriel.com/ http://distro.conectiva.com/
From: Rik van Riel <[email protected]>
Date: Mon, 29 Oct 2001 23:31:31 -0200 (BRST)
Your numbers are nice for one thing and break for
something else, don't you learn anything from our
experiments in the past ?
I'm asking him to show the case that "breaks for something
else".
Franks a lot,
David S. Miller
[email protected]
On Mon, 29 Oct 2001, David S. Miller wrote:
>
> I'm asking him to show the case that "breaks for something
> else".
Guys, guys, calm down.
I removed the tlb invalidate that ended up being called millions of times,
but I don't really have anything fundamental against either invalidating
each mm as it comes up in swap_out_mm(), or maybe just doing a full TLB
invalidate for each swap_out(). As Ben points out, the invalidate doesn't
even have to be synchronous - the only thing we really care about is that
there is some upper bound for how long we can cache TLB entries witht eh
wrong accessed bit.
One reasonable (?), yet rare, upper bound might be something like
"swap_out() wrapped around the MM list".
This is particularly true since we won't actually _care_ about the
accessed bit until the second time around when it is clear, so the
"wrapped around the VM list" thing is (a) often enough to matter and (b)
obviously seldom enough that it shouldn't be a performance issue even if
it implies a cross-call to everybody.
The difference in call frequency would, on large machines, probably be on
the order of several magnitudes, which will certainly cut the overhead
down to the noise while satisfying people who have architectures that can
cache things for a long time.
Agreed?
(Yeah, maybe you think that's _too_ long. Civil arguments, please).
Linus
From: Linus Torvalds <[email protected]>
Date: Mon, 29 Oct 2001 17:42:07 -0800 (PST)
Agreed?
I wouldn't be against a per-mm flush.
Franks a lot,
David S. Miller
[email protected]
On Mon, Oct 29, 2001 at 05:34:00PM -0800, David S. Miller wrote:
> I'm asking him to show the case that "breaks for something
> else".
Why don't you try lifting your fingers to defend your position? We know
that:
1. the change introduces state data to tlb with no limits on the
duration of said stale data
2. some architectures have large tlbs.
For the sake of (2), hashed tlbs count as excessively large tlbs. See the
UML, ia64 and powerpcs for this kind of setup.
Now, if you want to live in microbenchmark land, then, yes, the "optimized"
but invalid behaviour will *always* win. Back in the real world, sure, I can
construct a microbenchmark that will sit on 2 pages of memory spinning until
the pages get swapped out. What does that buy us? Nothing other than
placating you.
I've already pointed out that the optimization can be done in a valid way. If
you can't go through the effort of completing the above thought experiment on
your own, then I really have no reason to care about your opinion any further.
You've already stated you don't care about mine, and I've stated that I don't
feel like going to the effort to placate you. Where does that leave us?
Nowhere, I guess.
-ben
On Mon, Oct 29, 2001 at 05:42:07PM -0800, Linus Torvalds wrote:
>
> On Mon, 29 Oct 2001, David S. Miller wrote:
> >
> > I'm asking him to show the case that "breaks for something
> > else".
>
> Guys, guys, calm down.
I'm trying to be. Dave just rubs me the wrong way with his inability to
think about the problem for a minute instead of insisting on proof that
incorrect behaviour can happen. Damnit, it's supposed to be a kernel on
its way towards being more stable, not less. This is especially insulting
when you can bother to go through the work of it as a thought experiment in
less than 5 minutes to realise what could happen.
Dave: please read the above paragraph again. Now, can you see why I'm
arguing for doing the optimization in the *correct* way first? The
microbenchmark will always "prove" that avoiding the tlb flush is a win.
The test to prove that the failure case can happen is non-trivial, and
proving the real world win/loss is lengthy task involving lots of
benchmarks and effort. Please just sit down and think for 5 minutes and
acknowledge that it is a possibility!
> The difference in call frequency would, on large machines, probably be on
> the order of several magnitudes, which will certainly cut the overhead
> down to the noise while satisfying people who have architectures that can
> cache things for a long time.
>
> Agreed?
Which is exactly what was in the back of my mind in the first place. I
didn't write the patch as the distraction of going off on yet another
tangent when I'm this > < close to being done with the battle I'm currently
in just doesn't make sense. I'm sorry, but I'm not the best person for
doing a brain dump when in the middle of something, plus I assume that
people can think about the how and why of things for themselves. You'll
note that I never denied that the microoptimization in question is a win;
I fully well expect it to be. However, from the point of view of stability
we *want* to be conservative and correct. If Al had to demonstrate with
'sploits that a possible vfs race could occur every time he found one,
wouldn't he be wasting time that could be better spent finding and fixing
other problems??? Dave, please accept that other people's opinions
occasionally hold value and reconsider reacting negatively without thinking
first.
-ben
--
Fish.
> Numbers talk, bullshit walks.
So you are walking somewhere today ?
Dave I can produce equivalently valid microbenchmarks showing Linux works
much better with the scheduler disabled. They are worth about as much as
your benchmarks for that optimisation and they likewise ignore a slightly
important object known as "the big picture"
On Mon, Oct 29, 2001 at 09:25:46PM -0500, Benjamin LaHaise wrote:
> I fully well expect it to be. However, from the point of view of stability
> we *want* to be conservative and correct. If Al had to demonstrate with
Dave just told you what this change has to do with stability, not sure
why you keep reiterating about stability and correctness.
But of course going from page flush to the mm flush is fine from my part
too. As Linus noted a few days ago during swapout we're going to block
and reschedule all the time, so the range flush is going to be a noop in
real life (the whole thing is an heuristic), and this is why it wasn't
implemented right now. But I agree it shouldn't hurt either and it looks
nicer.
Andrea
On Tue, 30 Oct 2001, Andrea Arcangeli wrote:
> On Mon, Oct 29, 2001 at 09:25:46PM -0500, Benjamin LaHaise wrote:
> > I fully well expect it to be. However, from the point of view of stability
> > we *want* to be conservative and correct. If Al had to demonstrate with
>
> Dave just told you what this change has to do with stability, not sure
> why you keep reiterating about stability and correctness.
>
> But of course going from page flush to the mm flush is fine from my part
> too. As Linus noted a few days ago during swapout we're going to block
> and reschedule all the time, so the range flush is going to be a noop in
Only on architectures where the TLB (or equivalent) is
small and only capable of holding entries for one address
space at a time.
It's simply not true on eg PPC.
Rik
--
DMCA, SSSCA, W3C? Who cares? http://thefreeworld.net/
http://www.surriel.com/ http://distro.conectiva.com/
On Tue, Oct 30, 2001 at 01:34:50PM -0200, Rik van Riel wrote:
> On Tue, 30 Oct 2001, Andrea Arcangeli wrote:
> > On Mon, Oct 29, 2001 at 09:25:46PM -0500, Benjamin LaHaise wrote:
> > > I fully well expect it to be. However, from the point of view of stability
> > > we *want* to be conservative and correct. If Al had to demonstrate with
> >
> > Dave just told you what this change has to do with stability, not sure
> > why you keep reiterating about stability and correctness.
> >
> > But of course going from page flush to the mm flush is fine from my part
> > too. As Linus noted a few days ago during swapout we're going to block
> > and reschedule all the time, so the range flush is going to be a noop in
>
> Only on architectures where the TLB (or equivalent) is
> small and only capable of holding entries for one address
> space at a time.
>
> It's simply not true on eg PPC.
I thought at alpha of course but alpha doesn't provide an hardware
accessed bit in first place :).
my view was too x86-64/x86/alpha centric (incidentally the only archs I
test personally :), I really didn't considered the case of tagged tlb +
accessed bit both provided by the cpu while making that change, so we
hurted ia64 and ppc, but it will be trivial to fix now that Ben promptly
noticed the problem, thanks Ben, good catch, that's really appreciated!
Anwyays this have _nothing_ to do at the very least with stability
unlike the above subliminal messages are implying, see above, it can
only potentially be less responsive under very heavy swap on ia64 and
ppc (dunno sparc64?), period. mentioning real life workloads like Oracle
and RHDB in relation to the tlb flush for the accessed bit is further
subliminal bullshit, Oracle definitely isn't supposed to swap heavily
during the benchmarks, and I'm sure it's not the case for mainline
postrgres either (dunno about RHDB).
Andrea
>> But of course going from page flush to the mm flush is fine from my part
>> too. As Linus noted a few days ago during swapout we're going to block
>> and reschedule all the time, so the range flush is going to be a noop in
>
> Only on architectures where the TLB (or equivalent) is
> small and only capable of holding entries for one address
> space at a time.
>
> It's simply not true on eg PPC.
#ifdef ?
Bye.
On Tue, Oct 30, 2001 at 04:51:19PM +0100, Andrea Arcangeli wrote:
> Anwyays this have _nothing_ to do at the very least with stability
> unlike the above subliminal messages are implying, see above, it can
> only potentially be less responsive under very heavy swap on ia64 and
> ppc (dunno sparc64?), period. mentioning real life workloads like Oracle
> and RHDB in relation to the tlb flush for the accessed bit is further
> subliminal bullshit, Oracle definitely isn't supposed to swap heavily
> during the benchmarks, and I'm sure it's not the case for mainline
> postrgres either (dunno about RHDB).
It has nothing to do with subliminal effects, but rather what kind of
effect this microoptimization is going to have in the Big Picture. What
I'm contending is that the Real World difference between the correct
version of the optimization will have no significant performance effects
compared to the incorrect version that you and davem are so gleefully
advocating. This means not running through "bullshit" benchmarks that
test one and only one thing, but running apps that actually put memory
pressure on the system (Oracle does so quite nicely using a filesystem
without O_DIRECT) which in turn causes page scanning (aka the clearing
of the referenced bit which is *THE* code that is being contested) but
should not cause swap out. To me, this is just part of the methodology
of being thorough in testing the effects of changes to the VM subsystem.
Frankly, I'm suitable unimpressed with the thoroughness of consideration
and testing to the changes currently being pushed into the base tree.
Again, this is why I'm not bothering to run base kernels.
-ben
--
Fish.
On Tue, 30 Oct 2001, Rik van Riel wrote:
>
> Only on architectures where the TLB (or equivalent) is
> small and only capable of holding entries for one address
> space at a time.
>
> It's simply not true on eg PPC.
Now, it's not true on _all_ PPC's.
The sane PPC setups actually have a regular soft-filled TLB, and last I
saw that actually performed _better_ than the stupid architected hash-
chains. And for the broken OS's (ie AIX) that wants the hash-chains, you
can always make the soft-fill TLB do the stupid thing..
(Yeah, yeah, I'm sure you can find code where the hash-chains are faster,
especially big Fortran programs that have basically no tear-down and
build-up overhead. Which was why those things were designed that way, of
course. But it _looks_ like at least parts of IBM may finally be wising up
to the fact that hashed TLB's are a stupid idea).
Linus
On Tue, Oct 30, 2001 at 08:38:31AM -0800, Linus Torvalds wrote:
> Now, it's not true on _all_ PPC's.
>
> The sane PPC setups actually have a regular soft-filled TLB, and last I
> saw that actually performed _better_ than the stupid architected hash-
> chains. And for the broken OS's (ie AIX) that wants the hash-chains, you
> can always make the soft-fill TLB do the stupid thing..
User Mode Linux has effectively an infinitely sized TLB.
-ben
On Tue, Oct 30, 2001 at 05:13:53PM +0100, Giuliano Pochini wrote:
>
> >> But of course going from page flush to the mm flush is fine from my part
> >> too. As Linus noted a few days ago during swapout we're going to block
> >> and reschedule all the time, so the range flush is going to be a noop in
> >
> > Only on architectures where the TLB (or equivalent) is
> > small and only capable of holding entries for one address
> > space at a time.
> >
> > It's simply not true on eg PPC.
>
> #ifdef ?
yes, but not for ppc, for alpha and all other archs without accessed bit
provided in hardware (and cached in the tlb). the flush_mm proposed by
Ben looks fine for x86 too, it's a waste only for archs without accessed
bit.
I think an #ifndef HAVE_NO_ACCESS_BIT_IN_TLB or something like that,
then define that in asm-alpha/ and the other archs without accessed bit.
OTOH, it probably doesn't make much difference so maybe it doesn't worth
the effort.
Andrea
On Tue, Oct 30, 2001 at 11:34:10AM -0500, Benjamin LaHaise wrote:
> I'm contending is that the Real World difference between the correct
> version of the optimization will have no significant performance effects
> compared to the incorrect version that you and davem are so gleefully
> advocating. This means not running through "bullshit" benchmarks that
> test one and only one thing, but running apps that actually put memory
> pressure on the system (Oracle does so quite nicely using a filesystem
> without O_DIRECT) which in turn causes page scanning (aka the clearing
> of the referenced bit which is *THE* code that is being contested) but
> should not cause swap out. To me, this is just part of the methodology
> of being thorough in testing the effects of changes to the VM subsystem.
There should be no relevant pagetable scanning during those tests, and
the few bits that can get unmapped have lots of time to get mapped in
again from the cache with a minor fault, IMHO there's no way such tlb
flush removal can make a difference in a DBMS workload on a sanely setup
machine, I'm amazed you think it can make a difference.
Andrea
On Tue, Oct 30, 2001 at 08:38:31AM -0800, Linus Torvalds wrote:
> > It's simply not true on eg PPC.
>
> Now, it's not true on _all_ PPC's.
>
> The sane PPC setups actually have a regular soft-filled TLB, and last I
> saw that actually performed _better_ than the stupid architected hash-
> chains. And for the broken OS's (ie AIX) that wants the hash-chains, you
> can always make the soft-fill TLB do the stupid thing..
You can't turn off hardware hash-chains on anything past 603, sadly enough.
So all Macs, many embedded boards, ...
On Tue, 30 Oct 2001, Andrea Arcangeli wrote:
> again from the cache with a minor fault, IMHO there's no way such tlb
> flush removal can make a difference in a DBMS workload on a sanely
> setup machine, I'm amazed you think it can make a difference.
You seem to be hammering on the DBMS scenario because Ben
chose a bad example, without taking a step back and realising
that his point is valid, though mostly with other examples.
Linus came up with a more efficient way to make sure we
get to see (and clear) the accessed bits, lets use that one
and get on with life.
regards,
Rik
--
DMCA, SSSCA, W3C? Who cares? http://thefreeworld.net/
http://www.surriel.com/ http://distro.conectiva.com/
On Tue, Oct 30, 2001 at 09:57:57AM -0700, Victor Yodaiken wrote:
> On Tue, Oct 30, 2001 at 08:38:31AM -0800, Linus Torvalds wrote:
> > > It's simply not true on eg PPC.
> >
> > Now, it's not true on _all_ PPC's.
> >
> > The sane PPC setups actually have a regular soft-filled TLB, and last I
> > saw that actually performed _better_ than the stupid architected hash-
> > chains. And for the broken OS's (ie AIX) that wants the hash-chains, you
> > can always make the soft-fill TLB do the stupid thing..
>
> You can't turn off hardware hash-chains on anything past 603, sadly enough.
> So all Macs, many embedded boards, ...
Actually, the 7450 (V'ger) from Mot supports software tablewalks. (This is
what is in the > 800mhz dual mac G4's).
The issue then becomes supporting both hardware and software TLB walks on
SMP systems, since the 7450 is the first SMP capable PPC that can do
software tlbwalks.
--
Troy Benjegerdes | master of mispeeling | 'da hozer' | [email protected]
-----"If this message isn't misspelled, I didn't write it" -- Me -----
"Why do musicians compose symphonies and poets write poems? They do it
because life wouldn't have any meaning for them if they didn't. That's
why I draw cartoons. It's my life." -- Charles Shulz
On Tue, 30 Oct 2001, Victor Yodaiken wrote:
>
> You can't turn off hardware hash-chains on anything past 603, sadly enough.
Gods, I hope they have reconsidered that in their 64-bit chips. The 32-bit
hash chains may be ugly, but the architected 32/64-bit MMU stuff is just
so incredibly baroque that it makes any other MMU look positively
beautiful ("Segments? Segments shmegments. Big deal").
I still have the occasional nightmares about the IBM block diagrams
"explaining" the PowerPC MMU in their technical documentation.
There's probably a perfectly valid explanation for them, though (*).
Linus
(*) Probably along the lines of the designers being so high on LSD that
they thought it was a really cool idea. That would certainly explain it in
a very logical fashion.
>Now, it's not true on _all_ PPC's.
>
>The sane PPC setups actually have a regular soft-filled TLB, and last I
>saw that actually performed _better_ than the stupid architected hash-
>chains. And for the broken OS's (ie AIX) that wants the hash-chains, you
>can always make the soft-fill TLB do the stupid thing..
>
>(Yeah, yeah, I'm sure you can find code where the hash-chains are faster,
>especially big Fortran programs that have basically no tear-down and
>build-up overhead. Which was why those things were designed that way, of
>course. But it _looks_ like at least parts of IBM may finally be wising up
>to the fact that hashed TLB's are a stupid idea).
Well, I lack experience to state which scheme is better, but there is
definitely overhead intruduced by our hash management in linux on ppc,
since we have to evicts pages from the hash as soon as we test&clear
PAGE_ACCESSED or PAGE_DIRTY.
That means we keep flushing pages out of the hash table, which seems
to be defeat the purpose of that big hash table supposed to hold the
PTEs for everybody out there more/less permanently.
However, I was thinking about a possible solution, please tell me
if I'm completely off here or if it makes sense.
Since we can't (AFAIK) have linux use larger PTEs (in which case we
could store a pointer to the hash PTE in the linux PTE), We could
instead layout an array of pointer (one for each page) that would
hold these.
That way, we have a fast way to grab the real accessed and dirty
bits, which means we no longer need to flush hash pages when getting
those bits and have a more realistic PAGE_ACCESSED (currently, any page
in the hash has PAGE_ACCESSED).
Comments ?
Ben.
>> #ifdef ?
>
> yes, but not for ppc, for alpha and all other archs without accessed bit
> provided in hardware (and cached in the tlb).
PPCs have that bits. I'm not sure if they are cached in the tbl.
Bye.
On Tue, Oct 30, 2001 at 06:23:26PM +0100, Giuliano Pochini wrote:
>
> >> #ifdef ?
> >
> > yes, but not for ppc, for alpha and all other archs without accessed bit
> > provided in hardware (and cached in the tlb).
>
> PPCs have that bits. I'm not sure if they are cached in the tbl.
yes, this is why it won't need to define the HAVE_NO_ACCESS_BIT_IN_TLB,
only alpha will. but again I'm not sure if it worth given we just
reduced the flush frequency of an order of magnitude.
Andrea
>Since we can't (AFAIK) have linux use larger PTEs (in which case we
>could store a pointer to the hash PTE in the linux PTE), We could
>instead layout an array of pointer (one for each page) that would
>hold these.
>
>That way, we have a fast way to grab the real accessed and dirty
>bits, which means we no longer need to flush hash pages when getting
>those bits and have a more realistic PAGE_ACCESSED (currently, any page
>in the hash has PAGE_ACCESSED).
Obviously, an array of pointer would be horrible bloat, maybe the
layout of a parallel page table (ARM does this I think) would make
more sense.
I don't know if it would be really an improvement however.
ben.
On Tue, 30 Oct 2001, Benjamin Herrenschmidt wrote:
>
> Well, I lack experience to state which scheme is better, but there is
> definitely overhead intruduced by our hash management in linux on ppc,
> since we have to evicts pages from the hash as soon as we test&clear
> PAGE_ACCESSED or PAGE_DIRTY.
> That means we keep flushing pages out of the hash table, which seems
> to be defeat the purpose of that big hash table supposed to hold the
> PTEs for everybody out there more/less permanently.
Exactly. The problem with the hash chains is that they were designed for
_large_ jobs. For physicists that run a few copies of the same _huge_ load
over and over, where the jobs take minutes, hours or even days to
complete.
These jobs also have a very noticeable memory footprint (and thus TLB
component), and benchmarks like this sell systems.
And hash chains work _wonderfully_ for them - they are basically an
almost infinitely sized TLB, and the fact that cache locality is crap for
them doesn't matter because you use a LOT of entries and you have a LOT of
cache.
However, most people don't use their machines that way. Unix is pretty
much built up around running small, quick processes, and in _most_ normal
usage patterns (sorry, physicists ;), the cost of tearing down and
building up mappings is quite noticeable - often more so than the TLB
misses themselves.
In fact, I bet that for many apps, the number of TLB misses and the number
of faults to populate the VM space are not different by more than an order
of magnitude or so. They come, they eat, they go.
And in that kind of schenario, where mappings don't have lifetimes on the
order of minutes and hours at a time, hash chains suck. They make for
horrible cache behaviour, and building them up and tearing them down more
than makes up for any dubious win they had at TLB miss time.
And I stress _dubious_. A tree-based TLB lookup has a lot better cache
behaviour, and you can do a tree-based lookup in hardware quite
efficiently.
Yeah, if you didn't guess already, I despise hash chains. I'll take a
bigger on-chip secondary TLB and proper address space identifiers any day
over stupid hash chains. I think the Athlon does the former, but obviously
you can't do the latter with Intel-compatibility.
> Since we can't (AFAIK) have linux use larger PTEs (in which case we
> could store a pointer to the hash PTE in the linux PTE), We could
> instead layout an array of pointer (one for each page) that would
> hold these.
You can make Linux use any size PTE you wish - the Linux VM (including the
page tables) is entirely software-driven (with the current limitation of
having three levels of lookup - we'll probably change that when 42 bits
of virtual user space gets tight and there are enough machines out there
to matter, which is about another five years, I suspect).
So storing a pointer to the hardware hash table if you want to is
certainly possible. It has memory overhead, though, and decreased cache
compactness etc, so...
Linus
On Tue, Oct 30, 2001 at 09:17:31AM -0800, Linus Torvalds wrote:
> I still have the occasional nightmares about the IBM block diagrams
> "explaining" the PowerPC MMU in their technical documentation.
>
> There's probably a perfectly valid explanation for them, though (*).
>
> Linus
>
> (*) Probably along the lines of the designers being so high on LSD that
> they thought it was a really cool idea. That would certainly explain it in
> a very logical fashion.
All the studies I saw were back from the days when
cache-speed/expensive-memory-speed was close to 1. In this case, the
effect of randomizing memory fetches is no big deal. The rest
of standard PPC mmu architecture is pretty nice, but, if the Alpha
architects could decide to use the PC cmos clock as their only
prgrammable timer, and the Itanium guys could decide to put in a single
shift-mask path, why shouldn't the IBM designers get to destroy cache
by wasting a bunch of CPU area logic?
Actually, they swung the pendelum the other way for the 64-bit chips. The
VSID's (MM contexts) are indirectly accessed via a hash-table (with an on
chip TLB-style cache called a SLB).
The speedup from using the software table-walk actually came from emulating
x86 instead of using the native hash tables. Pretty slick that emulating a
30-year old MMU and improves performance on the PowerPC, eh?
There was an April Fools Microprocessor reports describing a processor that
had gone 64-bit and had a "twisted gothic nightmare of twisted logic" based
MMU that involved XOR-ing addresses with random numbers. They were
unwittingly predicting the future of the PPC MMU.
The nightmares and shakes have never ended for me, either. Sorry about
that, man.
} Gods, I hope they have reconsidered that in their 64-bit chips. The 32-bit
} hash chains may be ugly, but the architected 32/64-bit MMU stuff is just
} so incredibly baroque that it makes any other MMU look positively
} beautiful ("Segments? Segments shmegments. Big deal").
}
} I still have the occasional nightmares about the IBM block diagrams
} "explaining" the PowerPC MMU in their technical documentation.
}
} There's probably a perfectly valid explanation for them, though (*).
}
} Linus
}
} (*) Probably along the lines of the designers being so high on LSD that
} they thought it was a really cool idea. That would certainly explain it in
} a very logical fashion.
}
} -
} To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
} the body of a message to [email protected]
} More majordomo info at http://vger.kernel.org/majordomo-info.html
} Please read the FAQ at http://www.tux.org/lkml/
Victor Yodaiken writes:
> You can't turn off hardware hash-chains on anything past 603, sadly enough.
> So all Macs, many embedded boards, ...
All the 4xx, 8xx, and 82xx embedded PPC cpus use software TLB
loading. The 7450 has a mode bit to say whether to use a hash table
or software TLB loading. And the new "Book E" specification also
mandates software-loaded TLBs.
That still leaves almost all of the current Macs and RS/6000s using a
hash table, of course. It does sound like providing at least the
option to use software TLB loading is becoming common in new PPC
designs.
(And BTW they are not hash *chains*, there is no chaining involved.
There is a primary bucket and a secondary bucket for any given
address, each of which can hold 8 ptes.)
Paul.
On Wed, Oct 31, 2001 at 08:39:48AM +1100, Paul Mackerras wrote:
> Victor Yodaiken writes:
>
> > You can't turn off hardware hash-chains on anything past 603, sadly enough.
> > So all Macs, many embedded boards, ...
>
> All the 4xx, 8xx, and 82xx embedded PPC cpus use software TLB
> loading. The 7450 has a mode bit to say whether to use a hash table
> or software TLB loading. And the new "Book E" specification also
> mandates software-loaded TLBs.
>
> That still leaves almost all of the current Macs and RS/6000s using a
> hash table, of course. It does sound like providing at least the
> option to use software TLB loading is becoming common in new PPC
> designs.
But, as I understand it, not in PPC64 or in the mainline PC series.
What's really cool about the hashtable design is that each bucket will
have at most one useful piece of information but still fill an entire
cache line.
> (And BTW they are not hash *chains*, there is no chaining involved.
> There is a primary bucket and a secondary bucket for any given
> address, each of which can hold 8 ptes.)
Which are almost certain to be unrelated, so two consecutive TLB misses
will almost certainly require two fetches from main memory.
And, reusing a context requires a scrub of the hash table ..
Once you start down this slope, all sorts ofother disadvantages appear
for free.
>
> Paul.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
Andrea Arcangeli writes:
> On Tue, Oct 30, 2001 at 06:23:26PM +0100, Giuliano Pochini wrote:
> >
> > >> #ifdef ?
> > >
> > > yes, but not for ppc, for alpha and all other archs without accessed bit
> > > provided in hardware (and cached in the tlb).
> >
> > PPCs have that bits. I'm not sure if they are cached in the tbl.
>
> yes, this is why it won't need to define the HAVE_NO_ACCESS_BIT_IN_TLB,
> only alpha will. but again I'm not sure if it worth given we just
> reduced the flush frequency of an order of magnitude.
I don't think it has to do with whether the accessed bit is maintained
by hardware or by software. On PPC, we actually maintain the accessed
and dirty bits in software - we don't use the hardware-maintained
R (referenced) and C (changed) bits, because they are in the HPTEs
(the PTEs in the hash table), and it is hard to get back from a linux
pte value to the corresponding HPTE. The performance impact of
maintaining the accessed and dirty bits in software is negligible
according to the benchmarks I have done, and it is much much simpler
than trying to use the hardware R and C bits.
Paul.
Hello Paul,
On Wed, Oct 31, 2001 at 08:52:47AM +1100, Paul Mackerras wrote:
> pte value to the corresponding HPTE. The performance impact of
> maintaining the accessed and dirty bits in software is negligible
The accessed bit is worthless to be maintained in software. Infact if it
is true that ppc doesn't have an hardware maintained accessed bit (like
alpha) this means ppc _doesn't_ want the tlb flush, just like alpha,
unlike said previously today in other emails. the whole
ptep_test_and_clear_young section + tlb flush should be #ifndeffed out
for those archs. The waste is to care about this non existent accessed
bit in first place for those archs. Those archs enterely depend on the
minor faults to activate the cache and so we can only unmap
unconditionally the inactive cache on those archs. Infact on those archs
we should probably deactivate all the cache, not only the inactive one,
so that it has a better change to keep the working set in active cache.
with regard to the dirty bit while fixing x86 smp races, Ben did
benchmarks that showed the performance hit isn't negligible if
maintained in software, you know it generates the doube of page faults
with the "right" benchmark, but maybe this is the only sane approch for
ppc.
Andrea