LinuxLists.cc - Re: [Lse-tech] Re: 10.31 second kernel compile

2002-03-16 19:58:01

Subject: Re: [Lse-tech] Re: 10.31 second kernel compile

On Sat, Mar 16, 2002 at 08:32:26PM +0100, Andi Kleen wrote:
> x86-64 aka AMD Hammer does hardware (or more likely microcode) search of
> page tables.
> It has a 4 level page table with 4K pages. Generic Linux MM code only sees
> the first slot in 4th level page limit user space to 512GB with 3 levels.

What about 2M pages?

> Direct mappings and kernel mappings are handled specially by architecture
> specific code outside that first slot.
>
> The CPU itself has I/D TLBs split into L1 and L2.

There was something in some AMD doc about preventing tlbflush on process
switch - through a context like thing perhaps? Any idea?

>
> -Andi

--
---------------------------------------------------------
Victor Yodaiken
Finite State Machine Labs: The RTLinux Company.
http://www.fsmlabs.com http://www.rtlinux.com

2002-03-16 20:16:01

by Linus Torvalds

[permalink] [raw]

Subject: Re: [Lse-tech] Re: 10.31 second kernel compile

On Sat, 16 Mar 2002 [email protected] wrote:
>
> What about 2M pages?

Not useful for generic loads right now, and the latencies for clearing or
copying them them etc (ie single page faults - nopage or COW) are still
big enough that it would likely be a performance problem at that level.
And while doing IO in 2MB chunks sounds like fun, since most files are
still just a few kB, your page cache memory overhead would be prohibitive
(ie even if you had 64GB of memory, you might want to cache more than a
few thousand files at the same time).

So then you'd need to do page caching at a finer granularity than you do
mmap, which imples absolutely horrible things from a coherency standpoint
(mmap/read/write are supposed to be coherent in a nice UNIX - even if
there are some of non-nice unixes still around).

We may get there some day, but right now 2M pages are not usable for use
access.

64kB would be fine, though.

Oh, and in the specific case of hammer, one of the main advantages of the
thing is of course running old binaries unchanged. And old binaries
certainly do mmap's at smaller granularity than 2M (and have to, because a
3G user address space won't fit all that many 2M chunks).

Give up on large pages - it's just not happening. Even when a 64kB page
would make sense from a technology standpoint these days, backwards
compatibility makes people stay at 4kB.

Instead of large pages, you should be asking for larger and wider TLB's
(for example, nothign says that a TLB entry has to be a single page:
people already do the kind of "super-entries", where one TLB entry
actually contains data for 4 or 8 aligned pages, so you get the _effect_
of a 32kB page that really is 8 consecutive 4kB pages).

Such a "wide" TLB entry has all the advantages of small pages (no
memory fragmentation, backwards compatibility etc), while still being able
to load 64kB worth of translations in one go.

(One of the advantages of a page table tree over a hashed setup becomes
clear in this kind of situation: you cannot usefully load multiple entries
from the same cacheline into one TLB entry in a hashed table, while in a
tree it's truly trivial)

Linus

2002-03-16 20:23:01

by Andi Kleen

[permalink] [raw]

Subject: Re: [Lse-tech] Re: 10.31 second kernel compile

On Sat, Mar 16, 2002 at 12:14:06PM -0800, Linus Torvalds wrote:
> Oh, and in the specific case of hammer, one of the main advantages of the
> thing is of course running old binaries unchanged. And old binaries
> certainly do mmap's at smaller granularity than 2M (and have to, because a
> 3G user address space won't fit all that many 2M chunks).

The idea was to only map selected mappings using large pages, e.g. shared
memory mappings to help all the databases or use a special mmap flag
for the Beowulf people.

> Give up on large pages - it's just not happening. Even when a 64kB page
> would make sense from a technology standpoint these days, backwards
> compatibility makes people stay at 4kB.

Yes the 4KB page has to be kept at least for now.

-ANdi

2002-03-16 20:36:53

by Richard Gooch

[permalink] [raw]

Subject: Re: [Lse-tech] Re: 10.31 second kernel compile

Linus Torvalds writes:
> Instead of large pages, you should be asking for larger and wider TLB's
> (for example, nothign says that a TLB entry has to be a single page:
> people already do the kind of "super-entries", where one TLB entry
> actually contains data for 4 or 8 aligned pages, so you get the _effect_
> of a 32kB page that really is 8 consecutive 4kB pages).
>
> Such a "wide" TLB entry has all the advantages of small pages (no
> memory fragmentation, backwards compatibility etc), while still
> being able to load 64kB worth of translations in one go.

These are contiguous physical pages, or just logical (virtual) pages?

Regards,

Richard....
Permanent: [email protected]
Current: [email protected]

2002-03-16 20:40:53

by Linus Torvalds

[permalink] [raw]

Subject: Re: [Lse-tech] Re: 10.31 second kernel compile

On Sat, 16 Mar 2002, Richard Gooch wrote:
>
> These are contiguous physical pages, or just logical (virtual) pages?

Contiguous virtual pages, but discontiguous physical pages.

The advantage being that you only need one set of virtual tags per "wide"
entry, and you just fill the whole wide entry directly from the cacheline
(ie the TLB entry is not really 32 bits any more, it's a full cacheline).

The _real_ advantage being that it should be totally invisible to
software. I think Intel does something like this, but the point is, I
don't even have to know, and it still works.

Linus

2002-03-16 20:51:51

by Richard Gooch

[permalink] [raw]

Subject: Re: [Lse-tech] Re: 10.31 second kernel compile

Linus Torvalds writes:
>
> On Sat, 16 Mar 2002, Richard Gooch wrote:
> >
> > These are contiguous physical pages, or just logical (virtual) pages?
>
> Contiguous virtual pages, but discontiguous physical pages.

That's what I was hoping. Having both contiguous would be of some
benefit, but of course at the cost of having to unfragment physical
pages. Even if Andi can cleanly do that with rmap, it's still going to
cost (page copies, Dcache footprint, locking and more). I like the
"wide" TLB approach much more.

> The advantage being that you only need one set of virtual tags per
> "wide" entry, and you just fill the whole wide entry directly from
> the cacheline (ie the TLB entry is not really 32 bits any more, it's
> a full cacheline).
>
> The _real_ advantage being that it should be totally invisible to
> software. I think Intel does something like this, but the point is,
> I don't even have to know, and it still works.

Completely behind the kernel's, back? Even so, is there some hint we
can give to the CPU to help? Or perhaps a hint an application can give
to the kernel to specify better alignment of mappings? The latter
would require a way for the kernel to find out the preferred alignment
from the CPU. Is this information available?

Anyone know if AMD does this as well?

Regards,

Richard....
Permanent: [email protected]
Current: [email protected]

2002-03-17 13:23:58

by Rik van Riel

[permalink] [raw]

Subject: Re: [Lse-tech] Re: 10.31 second kernel compile

On Sat, 16 Mar 2002, Linus Torvalds wrote:
> On Sat, 16 Mar 2002 [email protected] wrote:
> >
> > What about 2M pages?
>
> Not useful for generic loads right now, and the latencies for clearing or
> copying them them etc (ie single page faults - nopage or COW) are still
> big enough that it would likely be a performance problem at that level.
> And while doing IO in 2MB chunks sounds like fun, since most files are
> still just a few kB,

In other words, large pages should be a "special hack" for
special applications, like Oracle and maybe some scientific
calculations ?

Grabbing some bitflags in generic datastructures shouldn't
be an issue since free bits are available.

regards,

Rik
--
<insert bitkeeper endorsement here>

http://www.surriel.com/ http://distro.conectiva.com/

2002-03-17 18:18:58

by Linus Torvalds

[permalink] [raw]

Subject: Re: [Lse-tech] Re: 10.31 second kernel compile

In article <[email protected]>,
Rik van Riel <[email protected]> wrote:
>
>In other words, large pages should be a "special hack" for
>special applications, like Oracle and maybe some scientific
>calculations ?

Yes, I think so.

That said, a 64kB page would be useful for generic use.

>Grabbing some bitflags in generic datastructures shouldn't
>be an issue since free bits are available.

I had large-page-support working in the VM a long time ago, back when I
did the original VM portability rewrite. I actually exposed the kernel
large pages to the VM, and it worked fine - I didn't even need a new
bit, since the code just used the "large page" bit in the page table
directly.

But it wasn't ever exposed to user space, and in the end I just made the
kenel mapping just not visible to the VM and simplified the x86
pmd_xxx() macros. The approach definitely worked, though.

Linus

2002-03-17 22:57:31

by Davide Libenzi

[permalink] [raw]

Subject: Re: [Lse-tech] Re: 10.31 second kernel compile

On Sun, 17 Mar 2002, Linus Torvalds wrote:

> In article <[email protected]>,
> Rik van Riel <[email protected]> wrote:
> >
> >In other words, large pages should be a "special hack" for
> >special applications, like Oracle and maybe some scientific
> >calculations ?
>
> Yes, I think so.
>
> That said, a 64kB page would be useful for generic use.
>
> >Grabbing some bitflags in generic datastructures shouldn't
> >be an issue since free bits are available.
>
> I had large-page-support working in the VM a long time ago, back when I
> did the original VM portability rewrite. I actually exposed the kernel
> large pages to the VM, and it worked fine - I didn't even need a new
> bit, since the code just used the "large page" bit in the page table
> directly.
>
> But it wasn't ever exposed to user space, and in the end I just made the
> kenel mapping just not visible to the VM and simplified the x86
> pmd_xxx() macros. The approach definitely worked, though.

Couldn't we choose the page size depending on the map size ?
If we start mixing page sizes, what about kernel code that assumes PAGE_SIZE ?

- Davide

2002-03-19 04:33:04

by Rusty Russell

[permalink] [raw]

Subject: Re: [Lse-tech] Re: 10.31 second kernel compile

On Sat, 16 Mar 2002 21:22:29 +0100
Andi Kleen <[email protected]> wrote:

> On Sat, Mar 16, 2002 at 12:14:06PM -0800, Linus Torvalds wrote:
> > Give up on large pages - it's just not happening. Even when a 64kB page
> > would make sense from a technology standpoint these days, backwards
> > compatibility makes people stay at 4kB.
>
> Yes the 4KB page has to be kept at least for now.

We have sysconf(_SC_PAGESIZE). I say, introduce an experimental CONFIG for
64k pagesize in 2.5, so we can start to weed out the problem apps NOW.

Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

2002-03-24 21:13:34

by Rogier Wolff

[permalink] [raw]

Subject: Re: [Lse-tech] Re: 10.31 second kernel compile

Linus Torvalds wrote:
> We may get there some day, but right now 2M pages are not usable for use
> access.
>
> 64kB would be fine, though.

[...]

> Give up on large pages - it's just not happening. Even when a 64kB page
> would make sense from a technology standpoint these days, backwards
> compatibility makes people stay at 4kB.

I would think that "large page support" that the processors give you
is indeed unusable, but what do you think about "software large(r)
pages"?

What I mean is that instead of doing the 4k that the ia32 hardware
gives us, we pretend that pages are (e.g.) 8k. Thus we always load a
pair of page table entries. memmap ends up having 8k granularity, IO
is done on page-sized (i.e. 8k in this case) chunks etc etc. (%)

So we have a "PAGE_SIZE" define all around the kernel. Keep that the
same (for compatibility), but make a "REAL_PAGE_SIZE" that governs the
loop that actually sets the page table (or tlb) entries.... Note that
a first implementation may actually effectivly reduce the size of the
TLB on machines with a software loaded TLB....

Why would I want this? Well, suppose I have a machine that unavoidably
has to swap on some of its workload. In practise you will almost
double the disk troughput by increasing the page size by a factor of
two. If the hit rate on the "extra page" that you swap in by
pretending pages are 8k and not 4k is over a couple of percents (*),
then this is advantageous: A seek plus transfer of 4k costs say 10ms +
0.16us while a seek plus transfer of 8k costs 10ms + .33 us (#). Thus
the "penalty" of the extra 4k transfer is very, very small.

Now, for all the reasons you mention, keeping 4k as the "default" on
Intel is good. But a config option:

on on
intel alpha
page size: 4k
8k 8k
16k 16k
32k 32k
64k 64k
128k 128k

would also be good for some people.

Roger.

(*) Actually it has to be a "couple of percents better than the 4k
page that we had to evict to be able to accomodate the current extra
4k...

(#) A 10k RPM disk rotates in 6ms, so average rotational latency is
about 3ms, and with an average seek time of 7ms, that comes to 10ms.

(%) And we get ext2 8k block size support on ia32!

--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2137555 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
* There are old pilots, and there are bold pilots.
* There are also old, bald pilots.

2002-03-24 21:38:06

by Andrew Morton

[permalink] [raw]

Subject: Re: [Lse-tech] Re: 10.31 second kernel compile

Rogier Wolff wrote:
>
> ...
> So we have a "PAGE_SIZE" define all around the kernel. Keep that the
> same (for compatibility), but make a "REAL_PAGE_SIZE" that governs the
> loop that actually sets the page table (or tlb) entries.... Note that
> a first implementation may actually effectivly reduce the size of the
> TLB on machines with a software loaded TLB....
>
> Why would I want this? Well, suppose I have a machine that unavoidably
> has to swap on some of its workload. In practise you will almost
> double the disk troughput by increasing the page size by a factor of
> two.

swapin and swapout already perform multipage clustering - you'd get
the same benefits from increasing SWAP_CLUSTER_MAX and page_cluster.

Which is a three-line patch.

Frankly, all the discussion I've seen about altering page sizes
threatens to add considerable complexity for very dubious gains.
The only place where I've seen a solid justification is for
scientific applications which have a huge working set, and need
large pages to save on TLB thrashing.

For everything else, I believe we can get the efficiencies
which we need by writing efficient code; no need to go playing
with page sizes.

If someone can point at a real-world workload and say "we suck",
and we can't fix that suckage without altering the page size then
would that person please come forth.

-

2002-03-24 22:55:11

by Nick Craig-Wood

[permalink] [raw]

Subject: Re: [Lse-tech] Re: 10.31 second kernel compile

On Sun, Mar 24, 2002 at 01:35:57PM -0800, Andrew Morton wrote:
> Frankly, all the discussion I've seen about altering page sizes
> threatens to add considerable complexity for very dubious gains.
> The only place where I've seen a solid justification is for
> scientific applications which have a huge working set, and need
> large pages to save on TLB thrashing.

A widely used example is mprime - the mersenne prime finding program (
http://www.mersenne.org/ ). This typically uses 8 or more MBytes of
RAM which it completely thrashes.

The program is written in very efficient assembler code and has been
designed not to thrash the TLB as much as possible, but with a working
set of > 8 MBs (which is iterated through many times a second at
maximum memory bandwith) large pages would make a real improvement to
it. Since each run takes weeks any improvement would be eagerly
snatched at by the 1000s of people running this program ;-)

If there was some hack where 4MB pages could be allocated for
applications like this then I'd be very happy!

--
Nick Craig-Wood
[email protected]

2002-03-24 23:42:12

by Andi Kleen

[permalink] [raw]

Subject: Re: [Lse-tech] Re: 10.31 second kernel compile

> If there was some hack where 4MB pages could be allocated for
> applications like this then I'd be very happy!

You could always run it as a kernel module.
Just need to add schedule points or use a preemptive kernel.
When you allocate data using get_free_pages() it'll return a pointer
in the 2 or 4MB mapped direct mapping of the kernel. It'll only work
when your memory is not fragmented.

-Andi
>

2002-03-25 06:41:49

by Martin J. Bligh

[permalink] [raw]

Subject: Re: [Lse-tech] Re: 10.31 second kernel compile

> Frankly, all the discussion I've seen about altering page sizes
> threatens to add considerable complexity for very dubious gains.

If we don't mix page sizes, but just increase the default from
4k, does this still add a lot of complexity in your eyes? I can't
see why it would ... ?

> If someone can point at a real-world workload and say "we suck",
> and we can't fix that suckage without altering the page size then
> would that person please come forth.

I believe one of the traditional problems stated for this case is
the amount of virtual address space taken up by all the struct pages
for a machine with large amounts of memory (32-64Gb). At the moment,
the obvious choice of architecture is still 32 bit, but maybe AMD
Hammer will fix this ... Unless someone has a plan to move all those
up into highmem as well ....

M.

2002-03-18 00:53:48

by Rik van Riel

[permalink] [raw]

Subject: Re: [Lse-tech] Re: 10.31 second kernel compile

On Sun, 17 Mar 2002, Davide Libenzi wrote:
> On Sun, 17 Mar 2002, Linus Torvalds wrote:
>
> > In article <[email protected]>,
> > Rik van Riel <[email protected]> wrote:
> > >
> > >In other words, large pages should be a "special hack" for
> > >special applications, like Oracle and maybe some scientific
> > >calculations ?
> >
> > Yes, I think so.

> Couldn't we choose the page size depending on the map size ?

For on-disk files I guess this is better an mmap flag,
but for shared memory segments we could try to do this
automagically.

> If we start mixing page sizes, what about kernel code that assumes
> PAGE_SIZE ?

We fix it.

Rik
--
<insert bitkeeper endorsement here>

http://www.surriel.com/ http://distro.conectiva.com/

2002-03-18 01:09:06

by Davide Libenzi

[permalink] [raw]

Subject: Re: [Lse-tech] Re: 10.31 second kernel compile

On Sun, 17 Mar 2002, Rik van Riel wrote:

> On Sun, 17 Mar 2002, Davide Libenzi wrote:
> > On Sun, 17 Mar 2002, Linus Torvalds wrote:
> >
> > > In article <[email protected]>,
> > > Rik van Riel <[email protected]> wrote:
> > > >
> > > >In other words, large pages should be a "special hack" for
> > > >special applications, like Oracle and maybe some scientific
> > > >calculations ?
> > >
> > > Yes, I think so.
>
> > Couldn't we choose the page size depending on the map size ?
>
> For on-disk files I guess this is better an mmap flag,
> but for shared memory segments we could try to do this
> automagically.

What's the reason that would make more convenient for us, upon receiving a
request to map a NNN MB file, to map it using 4Kb pages instead of 4MB ones ?

- Davide

2002-03-18 01:32:40

by Linus Torvalds

[permalink] [raw]

Subject: Re: [Lse-tech] Re: 10.31 second kernel compile

On Sun, 17 Mar 2002, Davide Libenzi wrote:
>
> What's the reason that would make more convenient for us, upon receiving a
> request to map a NNN MB file, to map it using 4Kb pages instead of 4MB ones ?

Ehh.. Let me count the ways:
- reliably allocation of 4MB of contiguous data
- graceful fallback when you need to start paging
- sane coherency with somebody who mapped the same file/segment in a much
smaller chunk

Guyes, 4MB pages are always going to be a special case. There's no sane
way to make them automatic, for the simple reason that they are USELESS
for "normal" work, and they have tons of problems that are quite
fundamental and just aren't going away and cannot be worked around.

The only sane way to use 4MB segments is:

- the application does a special system call (or special flag to mmap)
saying that it wants a big page and doesn't care about coherency with
anybody else that didn't set the flag (and realize that that probably
includes things like read/write)

- the machine has sufficiently enough memory that the user can be allowed
to _lock_ the area down, so that you don't have to worry about
swapping out that thing in 4M pieces. (This of course implies that
per-user memory counters have to work too, or we have to limit it by
default with a rlimit or something to zero).

In short, very much a special case.

(There are two reasons you don't want to handle paging on 4M chunks: (a)
they may be wonderful for IO throughput, but they are horrible for latency
for other people and (b) you now have basically just a few bits of usage
information for 4M worth of memory, as opposed to a finer granularity view
of which parts are actually _used_).

Once you can count on having memory sizes in the hundreds of Gigs, and
disk throughput speeds in the hundreds of megs a second, and ther are
enough of these machines to _matter_ (and reliably 64-bit address spaces
so that virtual fragmentation doesn't matter), we might make 4MB the
regular mapping entity.

That's probably at least a decade away.

Linus

2002-03-18 01:40:09

by Mike Fedyk

[permalink] [raw]

Subject: Re: [Lse-tech] Re: 10.31 second kernel compile

On Sun, Mar 17, 2002 at 05:13:16PM -0800, Davide Libenzi wrote:
> On Sun, 17 Mar 2002, Rik van Riel wrote:
>
> > On Sun, 17 Mar 2002, Davide Libenzi wrote:
> > > On Sun, 17 Mar 2002, Linus Torvalds wrote:
> > >
> > > > In article <[email protected]>,
> > > > Rik van Riel <[email protected]> wrote:
> > > > >
> > > > >In other words, large pages should be a "special hack" for
> > > > >special applications, like Oracle and maybe some scientific
> > > > >calculations ?
> > > >
> > > > Yes, I think so.
> >
> > > Couldn't we choose the page size depending on the map size ?
> >
> > For on-disk files I guess this is better an mmap flag,
> > but for shared memory segments we could try to do this
> > automagically.
>
> What's the reason that would make more convenient for us, upon receiving a
> request to map a NNN MB file, to map it using 4Kb pages instead of 4MB ones ?

... the VM chooses to unmap a mmaped page, it chooses a 4mb page, later it
needs just a few bytes from that unmaped page and free 4mb instead of 4kb (worst
case) to map that page again...

2002-03-18 01:43:59

by Davide Libenzi

[permalink] [raw]

Subject: Re: [Lse-tech] Re: 10.31 second kernel compile

On Sun, 17 Mar 2002, Mike Fedyk wrote:

> On Sun, Mar 17, 2002 at 05:13:16PM -0800, Davide Libenzi wrote:

> > What's the reason that would make more convenient for us, upon receiving a
> > request to map a NNN MB file, to map it using 4Kb pages instead of 4MB ones ?
>
> ... the VM chooses to unmap a mmaped page, it chooses a 4mb page, later it
> needs just a few bytes from that unmaped page and free 4mb instead of 4kb (worst
> case) to map that page again...

The big-page property should be vma related and should be obviously
handled correctly ...

- Davide

2002-03-18 01:52:41

by Davide Libenzi

[permalink] [raw]

Subject: Re: [Lse-tech] Re: 10.31 second kernel compile

On Sun, 17 Mar 2002, Linus Torvalds wrote:

> On Sun, 17 Mar 2002, Davide Libenzi wrote:
> >
> > What's the reason that would make more convenient for us, upon receiving a
> > request to map a NNN MB file, to map it using 4Kb pages instead of 4MB ones ?
>
> Ehh.. Let me count the ways:
> - reliably allocation of 4MB of contiguous data
> - graceful fallback when you need to start paging
> - sane coherency with somebody who mapped the same file/segment in a much
> smaller chunk
>
> Guyes, 4MB pages are always going to be a special case. There's no sane
> way to make them automatic, for the simple reason that they are USELESS
> for "normal" work, and they have tons of problems that are quite
> fundamental and just aren't going away and cannot be worked around.
>
> The only sane way to use 4MB segments is:
>
> - the application does a special system call (or special flag to mmap)
> saying that it wants a big page and doesn't care about coherency with
> anybody else that didn't set the flag (and realize that that probably
> includes things like read/write)
>
> - the machine has sufficiently enough memory that the user can be allowed
> to _lock_ the area down, so that you don't have to worry about
> swapping out that thing in 4M pieces. (This of course implies that
> per-user memory counters have to work too, or we have to limit it by
> default with a rlimit or something to zero).
>
> In short, very much a special case.
>
> (There are two reasons you don't want to handle paging on 4M chunks: (a)
> they may be wonderful for IO throughput, but they are horrible for latency
> for other people and (b) you now have basically just a few bits of usage
> information for 4M worth of memory, as opposed to a finer granularity view
> of which parts are actually _used_).
>
> Once you can count on having memory sizes in the hundreds of Gigs, and
> disk throughput speeds in the hundreds of megs a second, and ther are
> enough of these machines to _matter_ (and reliably 64-bit address spaces
> so that virtual fragmentation doesn't matter), we might make 4MB the
> regular mapping entity.
>
> That's probably at least a decade away.

Plenty of reason thanks Linus :) ... even if workstations with Gigs of RAM
and no swap are not so uncommon nowadays. Anyway i agree about the flag
driven activation ...

- Davide