2008-03-21 06:22:51

by Christoph Lameter

[permalink] [raw]
Subject: [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86

This allows fallback for order 1 stack allocations. In the fallback
scenario the stacks will be virtually mapped.

Signed-off-by: Christoph Lameter <[email protected]>
---
include/asm-ia64/thread_info.h | 5 +++--
include/asm-x86/thread_info_32.h | 6 +++---
include/asm-x86/thread_info_64.h | 4 ++--
3 files changed, 8 insertions(+), 7 deletions(-)

Index: linux-2.6.25-rc5-mm1/include/asm-ia64/thread_info.h
===================================================================
--- linux-2.6.25-rc5-mm1.orig/include/asm-ia64/thread_info.h 2008-03-20 20:03:47.165885870 -0700
+++ linux-2.6.25-rc5-mm1/include/asm-ia64/thread_info.h 2008-03-20 20:04:51.302135777 -0700
@@ -82,8 +82,9 @@ struct thread_info {
#define end_of_stack(p) (unsigned long *)((void *)(p) + IA64_RBS_OFFSET)

#define __HAVE_ARCH_TASK_STRUCT_ALLOCATOR
-#define alloc_task_struct() ((struct task_struct *)__get_free_pages(GFP_KERNEL | __GFP_COMP, KERNEL_STACK_SIZE_ORDER))
-#define free_task_struct(tsk) free_pages((unsigned long) (tsk), KERNEL_STACK_SIZE_ORDER)
+#define alloc_task_struct() ((struct task_struct *)__alloc_vcompound( \
+ GFP_KERNEL, KERNEL_STACK_SIZE_ORDER))
+#define free_task_struct(tsk) __free_vcompound(tsk)

#define tsk_set_notify_resume(tsk) \
set_ti_thread_flag(task_thread_info(tsk), TIF_NOTIFY_RESUME)
Index: linux-2.6.25-rc5-mm1/include/asm-x86/thread_info_32.h
===================================================================
--- linux-2.6.25-rc5-mm1.orig/include/asm-x86/thread_info_32.h 2008-03-20 20:03:47.173885951 -0700
+++ linux-2.6.25-rc5-mm1/include/asm-x86/thread_info_32.h 2008-03-20 20:04:51.306136067 -0700
@@ -96,13 +96,13 @@ static inline struct thread_info *curren
/* thread information allocation */
#ifdef CONFIG_DEBUG_STACK_USAGE
#define alloc_thread_info(tsk) ((struct thread_info *) \
- __get_free_pages(GFP_KERNEL| __GFP_ZERO, get_order(THREAD_SIZE)))
+ __alloc_vcompound(GFP_KERNEL| __GFP_ZERO, get_order(THREAD_SIZE)))
#else
#define alloc_thread_info(tsk) ((struct thread_info *) \
- __get_free_pages(GFP_KERNEL, get_order(THREAD_SIZE)))
+ __alloc_vcompound(GFP_KERNEL, get_order(THREAD_SIZE)))
#endif

-#define free_thread_info(info) free_pages((unsigned long)(info), get_order(THREAD_SIZE))
+#define free_thread_info(info) __free_vcompound(info)

#else /* !__ASSEMBLY__ */

Index: linux-2.6.25-rc5-mm1/include/asm-x86/thread_info_64.h
===================================================================
--- linux-2.6.25-rc5-mm1.orig/include/asm-x86/thread_info_64.h 2008-03-20 20:03:47.189886138 -0700
+++ linux-2.6.25-rc5-mm1/include/asm-x86/thread_info_64.h 2008-03-20 20:04:51.306136067 -0700
@@ -83,9 +83,9 @@ static inline struct thread_info *stack_
#endif

#define alloc_thread_info(tsk) \
- ((struct thread_info *) __get_free_pages(THREAD_FLAGS, THREAD_ORDER))
+ ((struct thread_info *) __alloc_vcompound(THREAD_FLAGS, THREAD_ORDER))

-#define free_thread_info(ti) free_pages((unsigned long) (ti), THREAD_ORDER)
+#define free_thread_info(ti) __free_vcompound(ti)

#else /* !__ASSEMBLY__ */


--


2008-03-21 07:25:03

by David Miller

[permalink] [raw]
Subject: Re: [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86

From: Christoph Lameter <[email protected]>
Date: Thu, 20 Mar 2008 23:17:14 -0700

> This allows fallback for order 1 stack allocations. In the fallback
> scenario the stacks will be virtually mapped.
>
> Signed-off-by: Christoph Lameter <[email protected]>

I would be very careful with this especially on IA64.

If the TLB miss or other low-level trap handler depends upon being
able to dereference thread info, task struct, or kernel stack stuff
without causing a fault outside of the linear PAGE_OFFSET area, this
patch will cause problems.

It will be difficult to debug the kinds of crashes this will cause
too.

2008-03-21 08:40:25

by Ingo Molnar

[permalink] [raw]
Subject: Re: [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86


* David Miller <[email protected]> wrote:

> From: Christoph Lameter <[email protected]>
> Date: Thu, 20 Mar 2008 23:17:14 -0700
>
> > This allows fallback for order 1 stack allocations. In the fallback
> > scenario the stacks will be virtually mapped.
> >
> > Signed-off-by: Christoph Lameter <[email protected]>
>
> I would be very careful with this especially on IA64.
>
> If the TLB miss or other low-level trap handler depends upon being
> able to dereference thread info, task struct, or kernel stack stuff
> without causing a fault outside of the linear PAGE_OFFSET area, this
> patch will cause problems.
>
> It will be difficult to debug the kinds of crashes this will cause
> too. [...]

another thing is that this patchset includes KERNEL_STACK_SIZE_ORDER
which has been NACK-ed before on x86 by several people and i'm nacking
this "configurable stack size" aspect of it again.

although it's not being spelled out in the changelog, i believe the
fundamental problem comes from a cpumask_t taking 512 bytes with
nr_cpus=4096, and if a few of them are on the kernel stack it can be a
problem. The correct answer is to not put them on the stack and we've
been taking patches to that end. Every other object allocator in the
kernel is able to not put stuff on the kernel stack. We _dont_ want
higher-order kernel stacks and we dont want to make a special exception
for cpumask_t either.

i believe time might be better spent increasing PAGE_SIZE on these
ridiculously large systems and making that work well with our binary
formats - instead of complicating our kernel VM with virtually mapped
buffers. That will also solve the kernel stack problem, in a very
natural way.

Ingo

2008-03-21 17:35:34

by Christoph Lameter

[permalink] [raw]
Subject: Re: [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86

On Fri, 21 Mar 2008, Ingo Molnar wrote:

> another thing is that this patchset includes KERNEL_STACK_SIZE_ORDER
> which has been NACK-ed before on x86 by several people and i'm nacking
> this "configurable stack size" aspect of it again.

Huh? Nothing of that nature is in this patchset.

2008-03-21 17:42:05

by Christoph Lameter

[permalink] [raw]
Subject: Re: [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86

On Fri, 21 Mar 2008, David Miller wrote:

> I would be very careful with this especially on IA64.
>
> If the TLB miss or other low-level trap handler depends upon being
> able to dereference thread info, task struct, or kernel stack stuff
> without causing a fault outside of the linear PAGE_OFFSET area, this
> patch will cause problems.

Hmmm. Does not sound good for arches that cannot handle TLB misses in
hardware. I wonder how arch specific this is? Last time around I was told
that some arches already virtually map their stacks.

2008-03-21 19:03:22

by Ingo Molnar

[permalink] [raw]
Subject: Re: [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86


* Christoph Lameter <[email protected]> wrote:

> On Fri, 21 Mar 2008, Ingo Molnar wrote:
>
> > another thing is that this patchset includes KERNEL_STACK_SIZE_ORDER
> > which has been NACK-ed before on x86 by several people and i'm
> > nacking this "configurable stack size" aspect of it again.
>
> Huh? Nothing of that nature is in this patchset.

your patch indeed does not introduce it here, but
KERNEL_STACK_SIZE_ORDER shows up in the x86 portion of your patch and
you refer to multi-order stack allocations in your 0/14 mail :-)

> -#define alloc_task_struct() ((struct task_struct *)__get_free_pages(GFP_KERNEL | __GFP_COMP, KERNEL_STACK_SIZE_ORDER))
> -#define free_task_struct(tsk) free_pages((unsigned long) (tsk), KERNEL_STACK_SIZE_ORDER)
> +#define alloc_task_struct() ((struct task_struct *)__alloc_vcompound( \
> + GFP_KERNEL, KERNEL_STACK_SIZE_ORDER))

Ingo

2008-03-21 19:06:34

by Christoph Lameter

[permalink] [raw]
Subject: Re: [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86

On Fri, 21 Mar 2008, Ingo Molnar wrote:

> your patch indeed does not introduce it here, but
> KERNEL_STACK_SIZE_ORDER shows up in the x86 portion of your patch and
> you refer to multi-order stack allocations in your 0/14 mail :-)

Ahh. I see. Remnants from V2 in IA64 code. That portion has to be removed
because of the software TLB issues on IA64 as pointed out by Dave.

2008-03-21 21:57:13

by David Miller

[permalink] [raw]
Subject: Re: [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86

From: Christoph Lameter <[email protected]>
Date: Fri, 21 Mar 2008 10:40:18 -0700 (PDT)

> On Fri, 21 Mar 2008, David Miller wrote:
>
> > I would be very careful with this especially on IA64.
> >
> > If the TLB miss or other low-level trap handler depends upon being
> > able to dereference thread info, task struct, or kernel stack stuff
> > without causing a fault outside of the linear PAGE_OFFSET area, this
> > patch will cause problems.
>
> Hmmm. Does not sound good for arches that cannot handle TLB misses in
> hardware. I wonder how arch specific this is? Last time around I was told
> that some arches already virtually map their stacks.

I'm not saying there is a problem, I'm saying "tread lightly"
because there might be one.

The thing to do is to first validate the way that IA64
handles recursive TLB misses occuring during an initial
TLB miss, and if there are any limitations therein.

That's the kind of thing I'm talking about.

2008-03-21 22:31:11

by Andi Kleen

[permalink] [raw]
Subject: Re: [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86

Christoph Lameter <[email protected]> writes:

> This allows fallback for order 1 stack allocations. In the fallback
> scenario the stacks will be virtually mapped.

The traditional reason this was discouraged (people seem to reinvent
variants of this patch all the time) was that there used
to be drivers that did __pa() (or equivalent) on stack addresses
and that doesn't work with vmalloc pages.

I don't know if such drivers still exist, but such a change
is certainly not a no-brainer

-Andi

2008-03-24 18:29:16

by Christoph Lameter

[permalink] [raw]
Subject: Re: [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86

On Fri, 21 Mar 2008, David Miller wrote:

> The thing to do is to first validate the way that IA64
> handles recursive TLB misses occuring during an initial
> TLB miss, and if there are any limitations therein.

I am familiar with that area and I am resonably sure that this
is an issue on IA64 under some conditions (the processor decides to spill
some registers either onto the stack or into the register backing store
during tlb processing). Recursion (in the kernel context) still expects
the stack and register backing store to be available. ccing linux-ia64 for
any thoughts to the contrary.

The move to 64k page size on IA64 is another way that this issue can be
addressed though. So I think its best to drop the IA64 portion.

2008-03-24 19:55:27

by Christoph Lameter

[permalink] [raw]
Subject: Re: [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86

On Fri, 21 Mar 2008, Andi Kleen wrote:

> The traditional reason this was discouraged (people seem to reinvent
> variants of this patch all the time) was that there used
> to be drivers that did __pa() (or equivalent) on stack addresses
> and that doesn't work with vmalloc pages.
>
> I don't know if such drivers still exist, but such a change
> is certainly not a no-brainer

I thought that had been cleaned up because some arches already have
virtually mapped stacks? This could be debugged by testing with
CONFIG_VFALLBACK_ALWAYS set. Which results in a stack that is always
vmalloc'ed and thus the driver should fail.

2008-03-24 20:37:31

by David Miller

[permalink] [raw]
Subject: larger default page sizes...

From: Christoph Lameter <[email protected]>
Date: Mon, 24 Mar 2008 11:27:06 -0700 (PDT)

> The move to 64k page size on IA64 is another way that this issue can
> be addressed though.

This is such a huge mistake I wish platforms such as powerpc and IA64
would not make such decisions so lightly.

The memory wastage is just rediculious.

I already see several distributions moving to 64K pages for powerpc,
so I want to nip this in the bud before this monkey-see-monkey-do
thing gets any more out of hand.

2008-03-24 21:07:08

by Christoph Lameter

[permalink] [raw]
Subject: Re: larger default page sizes...

On Mon, 24 Mar 2008, David Miller wrote:

> From: Christoph Lameter <[email protected]>
> Date: Mon, 24 Mar 2008 11:27:06 -0700 (PDT)
>
> > The move to 64k page size on IA64 is another way that this issue can
> > be addressed though.
>
> This is such a huge mistake I wish platforms such as powerpc and IA64
> would not make such decisions so lightly.

Its certainly not a light decision if your customer tells you that the box
is almost unusable with 16k page size. For our new 2k and 4k processor
systems this seems to be a requirement. Customers start hacking SLES10 to
run with 64k pages....

> The memory wastage is just rediculious.

Well yes if you would use such a box for kernel compiles and small files
then its a bad move. However, if you have to process terabytes of data
then this is significantly reducing the VM and I/O overhead.

> I already see several distributions moving to 64K pages for powerpc,
> so I want to nip this in the bud before this monkey-see-monkey-do
> thing gets any more out of hand.

powerpc also runs HPC codes. They certainly see the same results that we
see.

2008-03-24 21:22:19

by Luck, Tony

[permalink] [raw]
Subject: RE: [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86

> I am familiar with that area and I am resonably sure that this
> is an issue on IA64 under some conditions (the processor decides to spill
> some registers either onto the stack or into the register backing store
> during tlb processing). Recursion (in the kernel context) still expects
> the stack and register backing store to be available. ccing linux-ia64 for
> any thoughts to the contrary.

Christoph is correct ... IA64 pins the TLB entry for the kernel stack
(which covers both the normal C stack and the register backing store)
so that it won't have to deal with a TLB miss on the stack while handling
another TLB miss.

-Tony

2008-03-24 21:25:42

by Luck, Tony

[permalink] [raw]
Subject: RE: larger default page sizes...

> The memory wastage is just rediculious.

In an ideal world we'd have variable sized pages ... but
since most arcthitectures have no h/w support for these
it may be a long time before that comes to Linux.

In a fixed page size world the right page size to use
depends on the workload and the capacity of the system.

When memory capacity is measured in hundreds of GB, then
a larger page size doesn't look so ridiculous.

-Tony

2008-03-24 21:44:11

by David Miller

[permalink] [raw]
Subject: Re: larger default page sizes...

From: Christoph Lameter <[email protected]>
Date: Mon, 24 Mar 2008 14:05:02 -0700 (PDT)

> On Mon, 24 Mar 2008, David Miller wrote:
>
> > From: Christoph Lameter <[email protected]>
> > Date: Mon, 24 Mar 2008 11:27:06 -0700 (PDT)
> >
> > > The move to 64k page size on IA64 is another way that this issue can
> > > be addressed though.
> >
> > This is such a huge mistake I wish platforms such as powerpc and IA64
> > would not make such decisions so lightly.
>
> Its certainly not a light decision if your customer tells you that the box
> is almost unusable with 16k page size. For our new 2k and 4k processor
> systems this seems to be a requirement. Customers start hacking SLES10 to
> run with 64k pages....

We should fix the underlying problems.

I'm hitting issues on 128 cpu Niagara2 boxes, and it's all fundamental
stuff like contention on the per-zone page allocator locks.

Which is very fixable, without going to larger pages.

> powerpc also runs HPC codes. They certainly see the same results
> that we see.

There are ways to get large pages into the process address space for
compute bound tasks, without suffering the well known negative side
effects of using larger pages for everything.

2008-03-24 21:46:39

by David Miller

[permalink] [raw]
Subject: Re: larger default page sizes...

From: "Luck, Tony" <[email protected]>
Date: Mon, 24 Mar 2008 14:25:11 -0700

> When memory capacity is measured in hundreds of GB, then
> a larger page size doesn't look so ridiculous.

We have hugepages and such for a reason. And this can be
made more dynamic and flexible, as needed.

Increasing the page size is a "stick your head in the sand"
type solution by my book.

Especially when you can make the hugepage facility stronger
and thus get what you want without the memory wastage side
effects.

2008-03-25 03:30:39

by Paul Mackerras

[permalink] [raw]
Subject: Re: larger default page sizes...

David Miller writes:

> From: Christoph Lameter <[email protected]>
> Date: Mon, 24 Mar 2008 11:27:06 -0700 (PDT)
>
> > The move to 64k page size on IA64 is another way that this issue can
> > be addressed though.
>
> This is such a huge mistake I wish platforms such as powerpc and IA64
> would not make such decisions so lightly.

The performance advantage of using hardware 64k pages is pretty
compelling, on a wide range of programs, and particularly on HPC apps.

> The memory wastage is just rediculious.

Depends on the distribution of file sizes you have.

> I already see several distributions moving to 64K pages for powerpc,
> so I want to nip this in the bud before this monkey-see-monkey-do
> thing gets any more out of hand.

I just tried a kernel compile on a 4.2GHz POWER6 partition with 4
threads (2 cores) and 2GB of RAM, with two kernels. One was
configured with 4kB pages and the other with 64kB kernels but they
were otherwise identically configured. Here are the times for the
same kernel compile (total time across all threads, for a fairly
full-featured config):

4kB pages: 444.051s user + 34.406s system time
64kB pages: 419.963s user + 16.869s system time

That's nearly 10% faster with 64kB pages -- on a kernel compile.

Yes, the fragmentation in the page cache can be a pain in some
circumstances, but on the whole I think the performance advantage is
worth that pain, particularly for the sort of applications that people
will tend to be running on RHEL on Power boxes.

Regards,
Paul.

2008-03-25 04:15:43

by David Miller

[permalink] [raw]
Subject: Re: larger default page sizes...

From: Paul Mackerras <[email protected]>
Date: Tue, 25 Mar 2008 14:29:55 +1100

> The performance advantage of using hardware 64k pages is pretty
> compelling, on a wide range of programs, and particularly on HPC apps.

Please read the rest of my responses in this thread, you
can have your HPC cake and eat it too.

2008-03-25 07:48:14

by Andi Kleen

[permalink] [raw]
Subject: Re: [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86

On Mon, Mar 24, 2008 at 12:53:19PM -0700, Christoph Lameter wrote:
> On Fri, 21 Mar 2008, Andi Kleen wrote:
>
> > The traditional reason this was discouraged (people seem to reinvent
> > variants of this patch all the time) was that there used
> > to be drivers that did __pa() (or equivalent) on stack addresses
> > and that doesn't work with vmalloc pages.
> >
> > I don't know if such drivers still exist, but such a change
> > is certainly not a no-brainer
>
> I thought that had been cleaned up because some arches already have

Someone posted a patch recently that showed that the cdrom layer
does it. Might be more. It is hard to audit a few million lines
of driver code.

> virtually mapped stacks? This could be debugged by testing with
> CONFIG_VFALLBACK_ALWAYS set. Which results in a stack that is always
> vmalloc'ed and thus the driver should fail.

It might be a subtle failure.

Maybe sparse could be taught to check for this if it happens
in a single function? (cc'ing Al who might have some thoughts
on this). Of course if it happens spread out over multiple
functions sparse wouldn't help neither.

-Andi

2008-03-25 12:06:10

by Andi Kleen

[permalink] [raw]
Subject: Re: larger default page sizes...

Paul Mackerras <[email protected]> writes:
>
> 4kB pages: 444.051s user + 34.406s system time
> 64kB pages: 419.963s user + 16.869s system time
>
> That's nearly 10% faster with 64kB pages -- on a kernel compile.

Do you have some idea where the improvement mainly comes from?
Is it TLB misses or reduced in kernel overhead? Ok I assume both
play together but which part of the equation is more important?

-Andi

2008-03-25 17:44:23

by Christoph Lameter

[permalink] [raw]
Subject: RE: [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86

On Mon, 24 Mar 2008, Luck, Tony wrote:

> > I am familiar with that area and I am resonably sure that this
> > is an issue on IA64 under some conditions (the processor decides to spill
> > some registers either onto the stack or into the register backing store
> > during tlb processing). Recursion (in the kernel context) still expects
> > the stack and register backing store to be available. ccing linux-ia64 for
> > any thoughts to the contrary.
>
> Christoph is correct ... IA64 pins the TLB entry for the kernel stack
> (which covers both the normal C stack and the register backing store)
> so that it won't have to deal with a TLB miss on the stack while handling
> another TLB miss.

I thought the only pinned TLB entry was for the per cpu area? How does it
pin the TLB? The expectation is that a single TLB covers the complete
stack area? Is that a feature of fault handling?

2008-03-25 17:50:30

by Christoph Lameter

[permalink] [raw]
Subject: Re: larger default page sizes...

On Mon, 24 Mar 2008, David Miller wrote:

> We should fix the underlying problems.
>
> I'm hitting issues on 128 cpu Niagara2 boxes, and it's all fundamental
> stuff like contention on the per-zone page allocator locks.
>
> Which is very fixable, without going to larger pages.

No its not fixable. You are doing linear optimizations to a slowdown that
grows exponentially. Going just one order up for page size reduces the
necessary locks and handling of the kernel by 50%.

> > powerpc also runs HPC codes. They certainly see the same results
> > that we see.
>
> There are ways to get large pages into the process address space for
> compute bound tasks, without suffering the well known negative side
> effects of using larger pages for everything.

These hacks have limitations. F.e. they do not deal with I/O and
require application changes.

2008-03-25 17:57:13

by Christoph Lameter

[permalink] [raw]
Subject: Re: [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86

On Tue, 25 Mar 2008, Andi Kleen wrote:

> Maybe sparse could be taught to check for this if it happens
> in a single function? (cc'ing Al who might have some thoughts
> on this). Of course if it happens spread out over multiple
> functions sparse wouldn't help neither.

We could add debugging code to virt_to_page (or __pa) to catch these uses.

2008-03-25 18:04:00

by Andi Kleen

[permalink] [raw]
Subject: Re: [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86

On Tue, Mar 25, 2008 at 10:55:06AM -0700, Christoph Lameter wrote:
> On Tue, 25 Mar 2008, Andi Kleen wrote:
>
> > Maybe sparse could be taught to check for this if it happens
> > in a single function? (cc'ing Al who might have some thoughts
> > on this). Of course if it happens spread out over multiple
> > functions sparse wouldn't help neither.
>
> We could add debugging code to virt_to_page (or __pa) to catch these uses.

Hard to test all cases. Static checking would be better.

Or just not do it? I didn't think order 1 failures were that big a problem.


-Andi

2008-03-25 18:27:57

by Dave Hansen

[permalink] [raw]
Subject: Re: larger default page sizes...

On Tue, 2008-03-25 at 14:29 +1100, Paul Mackerras wrote:
> 4kB pages: 444.051s user + 34.406s system time
> 64kB pages: 419.963s user + 16.869s system time
>
> That's nearly 10% faster with 64kB pages -- on a kernel compile.

Can you do the same thing with the 4k MMU pages and 64k PAGE_SIZE?
Wouldn't that easily break out whether the advantage is from the TLB or
from less kernel overhead?

-- Dave

2008-03-25 19:15:37

by Luck, Tony

[permalink] [raw]
Subject: RE: [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86

> I thought the only pinned TLB entry was for the per cpu area? How does it
> pin the TLB? The expectation is that a single TLB covers the complete
> stack area? Is that a feature of fault handling?

Pinning TLB entries on ia64 is done using TR registers with the "itr"
instruction. Currently we have the following pinned mappings:

itr[0] : maps kernel code. 64MB page at virtual 0xA000000100000000
dtr[1] : maps kernel data. 64MB page at virtual 0xA000000100000000

itr[1] : maps PAL code as required by architecture

dtr[1] : maps an area of region 7 that spans kernel stack
page size is kernel granule size (default 16M).
This mapping needs to be reset on a context switch
where we move to a stack in a different granule.

We used to used dtr[2] to map the 64K per-cpu area at 0xFFFFFFFFFFFF0000
but Ken Chen found that performance was better to use a dynamically
inserted DTC entry from the Alt-TLB miss handler which allows this
entry in the TLB to be available for generic use (on most processor
models).

-Tony

2008-03-25 19:28:16

by Christoph Lameter

[permalink] [raw]
Subject: RE: [11/14] vcompound: Fallbacks for order 1 stack allocations on IA64 and x86

On Tue, 25 Mar 2008, Luck, Tony wrote:

> dtr[1] : maps an area of region 7 that spans kernel stack
> page size is kernel granule size (default 16M).
> This mapping needs to be reset on a context switch
> where we move to a stack in a different granule.

Interesting.... Never realized we were doing these tricks with DTR.

2008-03-25 21:13:59

by Paul Mackerras

[permalink] [raw]
Subject: Re: larger default page sizes...

David Miller writes:

> From: Paul Mackerras <[email protected]>
> Date: Tue, 25 Mar 2008 14:29:55 +1100
>
> > The performance advantage of using hardware 64k pages is pretty
> > compelling, on a wide range of programs, and particularly on HPC apps.
>
> Please read the rest of my responses in this thread, you
> can have your HPC cake and eat it too.

It's not just HPC, as I pointed out, it's pretty much everything,
including kernel compiles. And "use hugepages" is a pretty inadequate
answer given the restrictions of hugepages and the difficulty of using
them. How do I get gcc to use hugepages, for instance? Using 64k
pages gives us a performance boost for almost everything without the
user having to do anything.

If the hugepage stuff was in a state where it enabled large pages to
be used for mapping an existing program, where possible, without any
changes to the executable, then I would agree with you. But it isn't,
it's a long way from that, and (as I understand it) Linus has in the
past opposed the suggestion that we should move in that direction.

Paul.

2008-03-25 21:35:53

by Paul Mackerras

[permalink] [raw]
Subject: Re: larger default page sizes...

Andi Kleen writes:

> Paul Mackerras <[email protected]> writes:
> >
> > 4kB pages: 444.051s user + 34.406s system time
> > 64kB pages: 419.963s user + 16.869s system time
> >
> > That's nearly 10% faster with 64kB pages -- on a kernel compile.
>
> Do you have some idea where the improvement mainly comes from?
> Is it TLB misses or reduced in kernel overhead? Ok I assume both
> play together but which part of the equation is more important?

I think that to a first approximation, the improvement in user time
(24 seconds) is due to the increased TLB reach and reduced TLB misses,
and the improvement in system time (18 seconds) is due to the reduced
number of page faults and reductions in other kernel overheads.

As Dave Hansen points out, I can separate the two effects by having
the kernel use 64k pages at the VM level but 4k pages in the hardware
page table, which is easy since we have support for 64k base page size
on machines that don't have hardware 64k page support. I'll do that
today.

Paul.

2008-03-25 23:22:54

by David Miller

[permalink] [raw]
Subject: Re: larger default page sizes...

From: Christoph Lameter <[email protected]>
Date: Tue, 25 Mar 2008 10:48:19 -0700 (PDT)

> On Mon, 24 Mar 2008, David Miller wrote:
>
> > There are ways to get large pages into the process address space for
> > compute bound tasks, without suffering the well known negative side
> > effects of using larger pages for everything.
>
> These hacks have limitations. F.e. they do not deal with I/O and
> require application changes.

Transparent automatic hugepages are definitely doable, I don't know
why you think this requires application changes.

People want these larger pages for HPC apps.

2008-03-25 23:32:51

by David Miller

[permalink] [raw]
Subject: Re: larger default page sizes...

From: Paul Mackerras <[email protected]>
Date: Tue, 25 Mar 2008 22:50:00 +1100

> How do I get gcc to use hugepages, for instance?

Implementing transparent automatic usage of hugepages has been
discussed many times, it's definitely doable and other OSs have
implemented this for years.

This is what I was implying.

2008-03-25 23:47:31

by J.C. Pizarro

[permalink] [raw]
Subject: Re: larger default page sizes...

On Tue, 25 Mar 2008 16:22:44 -0700 (PDT), David Miller wrote:
> > On Mon, 24 Mar 2008, David Miller wrote:
> >
> > > There are ways to get large pages into the process address space for
> > > compute bound tasks, without suffering the well known negative side
> > > effects of using larger pages for everything.
> >
> > These hacks have limitations. F.e. they do not deal with I/O and
> > require application changes.
>
> Transparent automatic hugepages are definitely doable, I don't know
> why you think this requires application changes.
>
> People want these larger pages for HPC apps.

But there is a general problem of larger pages in systems that
don't support them natively (in hardware) depending in how it's
implemented the memory manager in the kernel:

"Doubling the soft page size implies
halfing the TLB soft-entries in the old hardware".

"x4 soft page size=> 1/4 TLB soft-entries, ... and so on."

Assuming one soft double-sized page represents 2 real-sized pages,
one replacing of one soft double-sized page implies replacing
2 TLB's entries containing the 2 real-sized pages.

The TLB is very small, its entries are around 24 entries aprox. in
some processors!.

Assuming soft 64 KiB page using real 4 KiB pages => 1/16 TLB soft-entries.
If the TLB has 24 entries then calculating 24/16=1.5 soft-entries,
the TLB will have only 1 soft-entry for soft 64 KiB pages!!! Weird!!!

The normal soft sizes are 8 KiB or 16 KiB for non-native processors, not more.
So, the TLB of 24 entries of real 4 KiB will have 12 or 6
soft-entries respect.

2008-03-25 23:49:44

by David Miller

[permalink] [raw]
Subject: Re: larger default page sizes...

From: Peter Chubb <[email protected]>
Date: Wed, 26 Mar 2008 10:41:32 +1100

> It's actually harder than it looks. Ian Wienand just finished his
> Master's project in this area, so we have *lots* of data. The main
> issue is that, at least on Itanium, you have to turn off the hardware
> page table walker for hugepages if you want to mix superpages and
> standard pages in the same region. (The long format VHPT isn't the
> panacea we'd like it to be because the hash function it uses depends
> on the page size). This means that although you have fewer TLB misses
> with larger pages, the cost of those TLB misses is three to four times
> higher than with the standard pages.

If the hugepage is more than 3 to 4 times larger than the base
page size, which it almost certainly is, it's still an enormous
win.

> Other architectures (where the page size isn't tied into the hash
> function, so the hardware walked can be used for superpages) will have
> different tradeoffs.

Right, admittedly this is just a (one of many) strange IA64 quirk.

2008-03-25 23:56:14

by Luck, Tony

[permalink] [raw]
Subject: RE: larger default page sizes...

> > How do I get gcc to use hugepages, for instance?
>
> Implementing transparent automatic usage of hugepages has been
> discussed many times, it's definitely doable and other OSs have
> implemented this for years.
>
> This is what I was implying.

"large" pages, or "super" pages perhaps ... but Linux "huge" pages
seem pretty hard to adapt for generic use by applications. They
are generally a somewhere between a bit too big (2MB on X86) to
way too big (64MB, 256MB, 1GB or 4GB on ia64) for general use.

Right now they also suffer from making the sysadmin pick at
boot time how much memory to allocate as huge pages (while it
is possible to break huge pages into normal pages, going in
the reverse direction requires a memory defragmenter that
doesn't exist).

Making an application use huge pages as heap may be simple
(just link with a different library to provide with a different
version of malloc()) ... code, stack, mmap'd files are all
a lot harder to do transparently.

-Tony

2008-03-26 00:17:11

by David Miller

[permalink] [raw]
Subject: Re: larger default page sizes...

From: "Luck, Tony" <[email protected]>
Date: Tue, 25 Mar 2008 16:49:23 -0700

> Making an application use huge pages as heap may be simple
> (just link with a different library to provide with a different
> version of malloc()) ... code, stack, mmap'd files are all
> a lot harder to do transparently.

The kernel should be able to do this transparently, at the
very least for the anonymous page case. It should also
be able to handle just fine chips that provide multiple
page size support, as many do.

2008-03-26 00:26:34

by Peter Chubb

[permalink] [raw]
Subject: Re: larger default page sizes...

>>>>> "David" == David Miller <[email protected]> writes:

David> From: Peter Chubb <[email protected]> Date: Wed, 26 Mar
David> 2008 10:41:32 +1100

>> It's actually harder than it looks. Ian Wienand just finished his
>> Master's project in this area, so we have *lots* of data. The main
>> issue is that, at least on Itanium, you have to turn off the
>> hardware page table walker for hugepages if you want to mix
>> superpages and standard pages in the same region. (The long format
>> VHPT isn't the panacea we'd like it to be because the hash function
>> it uses depends on the page size). This means that although you
>> have fewer TLB misses with larger pages, the cost of those TLB
>> misses is three to four times higher than with the standard pages.

David> If the hugepage is more than 3 to 4 times larger than the base
David> page size, which it almost certainly is, it's still an enormous
David> win.

That depends on the access pattern. We measured a small win for some
workloads, and a small loss for others, using 4k base pages, and
allowing up to 4G superpages (the actual sizes used depended on the
size of the objects being allocated, and the amount of contiguous
memory available).

--
Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au
http://www.ertos.nicta.com.au ERTOS within National ICT Australia

2008-03-26 00:30:23

by Peter Chubb

[permalink] [raw]
Subject: Re: larger default page sizes...

>>>>> "David" == David Miller <[email protected]> writes:

David> From: Christoph Lameter <[email protected]> Date: Tue, 25 Mar
David> 2008 10:48:19 -0700 (PDT)

>> On Mon, 24 Mar 2008, David Miller wrote:
>>
>> > There are ways to get large pages into the process address space
>> for > compute bound tasks, without suffering the well known
>> negative side > effects of using larger pages for everything.
>>
>> These hacks have limitations. F.e. they do not deal with I/O and
>> require application changes.

David> Transparent automatic hugepages are definitely doable, I don't
David> know why you think this requires application changes.

It's actually harder than it looks. Ian Wienand just finished his
Master's project in this area, so we have *lots* of data. The main
issue is that, at least on Itanium, you have to turn off the hardware
page table walker for hugepages if you want to mix superpages and
standard pages in the same region. (The long format VHPT isn't the
panacea we'd like it to be because the hash function it uses depends
on the page size). This means that although you have fewer TLB misses
with larger pages, the cost of those TLB misses is three to four times
higher than with the standard pages. In addition, to set up a large
page takes more effort... and it turns out there are few applications
where the cost is amortised enough, so on SpecCPU for example, some
tests improved performance slightly, some got slightly worse.

What we saw was essentially that we could almost eliminate DTLB misses,
other than the first, for a huge page. For most applications, though,
the extra cost of that first miss, plus the cost of setting up the
huge page, was greater than the few hundred DTLB misses we avoided.

I'm expecting Ian to publish the full results soon.

Other architectures (where the page size isn't tied into the hash
function, so the hardware walked can be used for superpages) will have
different tradeoffs.

--
Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au
http://www.ertos.nicta.com.au ERTOS within National ICT Australia

2008-03-26 00:31:59

by David Miller

[permalink] [raw]
Subject: Re: larger default page sizes...

From: Peter Chubb <[email protected]>
Date: Wed, 26 Mar 2008 11:25:58 +1100

> That depends on the access pattern.

Absolutely.

FWIW, I bet it helps enormously for gcc which, even for
small compiles, swims around chaotically in an 8MB pool
of GC'd memory.

2008-03-26 00:34:38

by David Mosberger-Tang

[permalink] [raw]
Subject: Re: larger default page sizes...

On Tue, Mar 25, 2008 at 5:41 PM, Peter Chubb <[email protected]> wrote:
> The main issue is that, at least on Itanium, you have to turn off the hardware
> page table walker for hugepages if you want to mix superpages and
> standard pages in the same region. (The long format VHPT isn't the
> panacea we'd like it to be because the hash function it uses depends
> on the page size).

Why not just repeat the PTEs for super-pages? That won't work for
huge pages, but for superpages that are a reasonable multiple (e.g.,
16-times) the base-page size, it should work nicely.

--david
--
Mosberger Consulting LLC, http://www.mosberger-consulting.com/

2008-03-26 00:39:41

by David Miller

[permalink] [raw]
Subject: Re: larger default page sizes...

From: "David Mosberger-Tang" <[email protected]>
Date: Tue, 25 Mar 2008 18:34:13 -0600

> Why not just repeat the PTEs for super-pages?

This is basically how we implement hugepages in the page
tables on sparc64.

2008-03-26 00:58:23

by Peter Chubb

[permalink] [raw]
Subject: Re: larger default page sizes...

>>>>> "David" == David Mosberger-Tang <[email protected]> writes:

David> On Tue, Mar 25, 2008 at 5:41 PM, Peter Chubb
David> <[email protected]> wrote:
>> The main issue is that, at least on Itanium, you have to turn off
>> the hardware page table walker for hugepages if you want to mix
>> superpages and standard pages in the same region. (The long format
>> VHPT isn't the panacea we'd like it to be because the hash function
>> it uses depends on the page size).

David> Why not just repeat the PTEs for super-pages? That won't work
David> for huge pages, but for superpages that are a reasonable
David> multiple (e.g., 16-times) the base-page size, it should work
David> nicely.

You end up having to repeat PTEs to fit into Linux's page table
structure *anyway* (unless we can change Linux's page table). But
there's no place in the short format hardware-walked page table (that
reuses the leaf entries in Linux's table) for a page size. And if you
use some of the holes in the format, the hardware walker doesn't
understand it --- so you have to turn off the hardware walker for
*any* regions where there might be a superpage.

If you use the long format VHPT, you have a choice: load the
hash table with just the translation that caused the miss, load all
possible hash entries that could have caused the miss for the page, or
preload the hash table when the page is instantiated, with all
possible entries that could hash to the huge page. I don't remember
the details, but I seem to remember all these being bad choices for
one reason or other ... Ian, can you elaborate?

--
Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au
http://www.ertos.nicta.com.au ERTOS within National ICT Australia

2008-03-26 04:16:44

by John Marvin

[permalink] [raw]
Subject: Re: larger default page sizes...

Peter Chubb wrote:

>
> You end up having to repeat PTEs to fit into Linux's page table
> structure *anyway* (unless we can change Linux's page table). But
> there's no place in the short format hardware-walked page table (that
> reuses the leaf entries in Linux's table) for a page size. And if you
> use some of the holes in the format, the hardware walker doesn't
> understand it --- so you have to turn off the hardware walker for
> *any* regions where there might be a superpage.

No, you can set an illegal memory attribute in the pte for any superpage entry,
and leave the hardware walker enabled for the base page size. The software tlb
miss handler can then install the superpage tlb entry. I posted a working
prototype of Shimizu superpages working on ia64 using short format vhpt's to the
linux kernel list a while back.

>
> If you use the long format VHPT, you have a choice: load the
> hash table with just the translation that caused the miss, load all
> possible hash entries that could have caused the miss for the page, or
> preload the hash table when the page is instantiated, with all
> possible entries that could hash to the huge page. I don't remember
> the details, but I seem to remember all these being bad choices for
> one reason or other ... Ian, can you elaborate?

When I was doing measurements of long format vs. short format, the two main
problems with long format (and why I eventually chose to stick with short
format) were:

1) There was no easy way of determining what size the long format vhpt cache
should be automatically, and changing it dynamically would be too painful.
Different workloads performed better with different size vhpt caches.

2) Regardless of the size, the vhpt cache is duplicated information. Using long
format vhpt's significantly increased the number of cache misses for some
workloads. Theoretically there should have been some cases where the long format
solution would have performed better than the short format solution, but I was
never able to create such a case. In many cases the performance difference
between the long format solution and the short format solution was essentially
the same. In other cases the short format vhpt solution outperformed the long
format solution, and in those cases there was a significant difference in cache
misses that I believe explained the performance difference.

John

2008-03-26 04:36:35

by David Miller

[permalink] [raw]
Subject: Re: larger default page sizes...

From: John Marvin <[email protected]>
Date: Tue, 25 Mar 2008 22:16:00 -0600

> 1) There was no easy way of determining what size the long format vhpt cache
> should be automatically, and changing it dynamically would be too painful.
> Different workloads performed better with different size vhpt caches.

This is exactly what sparc64 does btw, dynamic TLB miss hash table
sizing based upon task RSS

2008-03-26 05:25:32

by Paul Mackerras

[permalink] [raw]
Subject: Re: larger default page sizes...

Andi Kleen writes:

> Paul Mackerras <[email protected]> writes:
> >
> > 4kB pages: 444.051s user + 34.406s system time
> > 64kB pages: 419.963s user + 16.869s system time
> >
> > That's nearly 10% faster with 64kB pages -- on a kernel compile.
>
> Do you have some idea where the improvement mainly comes from?
> Is it TLB misses or reduced in kernel overhead? Ok I assume both
> play together but which part of the equation is more important?

With the kernel configured for a 64k page size, but using 4k pages in
the hardware page table, I get:

64k/4k: 441.723s user + 27.258s system time

So the improvement in the user time is almost all due to the reduced
TLB misses (as one would expect). For the system time, using 64k
pages in the VM reduces it by about 21%, and using 64k hardware pages
reduces it by another 30%. So the reduction in kernel overhead is
significant but not as large as the impact of reducing TLB misses.

Paul.

2008-03-26 15:55:15

by Nish Aravamudan

[permalink] [raw]
Subject: Re: larger default page sizes...

On 3/25/08, Luck, Tony <[email protected]> wrote:
> > > How do I get gcc to use hugepages, for instance?
> >
> > Implementing transparent automatic usage of hugepages has been
> > discussed many times, it's definitely doable and other OSs have
> > implemented this for years.
> >
> > This is what I was implying.
>
>
> "large" pages, or "super" pages perhaps ... but Linux "huge" pages
> seem pretty hard to adapt for generic use by applications. They
> are generally a somewhere between a bit too big (2MB on X86) to
> way too big (64MB, 256MB, 1GB or 4GB on ia64) for general use.
>
> Right now they also suffer from making the sysadmin pick at
> boot time how much memory to allocate as huge pages (while it
> is possible to break huge pages into normal pages, going in
> the reverse direction requires a memory defragmenter that
> doesn't exist).

That's not entirely true. We have a dynamic pool now, thanks to Adam
Litke [added to Cc], which can be treated as a high watermark for the
hugetlb pool (and the static pool value serves as a low watermark).
Unless by hugepages you mean something other than what I think (but
referring to a 2M size on x86 imples you are not). And with the
antifragmentation improvements, hugepage pool changes at run-time are
more likely to succeed [added Mel to Cc].

> Making an application use huge pages as heap may be simple
> (just link with a different library to provide with a different
> version of malloc()) ... code, stack, mmap'd files are all
> a lot harder to do transparently.

I feel like I should promote libhugetlbfs here. We're trying to make
things easier for applications to use. You can back the heap by
hugepages via LD_PRELOAD. But even that isn't always simple (what
happens when something is already allocated on the heap?, which we've
seen happen even in our constructor in the library, for instance).
We're working on hugepage stack support. Text/BSS/Data segment
remapping exists now, too, but does require relinking to be more
successful. We have a mode that allows libhugetlbfs to try to fit the
segments into hugepages, or even just those parts that might fit --
but we have limitations on power and IA64, for instance, where
hugepages are restricted in their placement (either depending on the
process' existing mappings or generally). libhugetlbfs has, at least,
been tested a bit on IA64 to validate the heap backing (IIRC) and the
various kernel tests. We also have basic sparc support -- however, I
don't have any boxes handy to test on (working on getting them added
to our testing grid and then will revisit them), and then one box I
used before gave me semi-spurious soft-lockups (old bug, unclear if it
is software or just buggy hardware).

In any case, my point is people are trying to work on this from
various angles. Both making hugepages more available at run-time (in a
dynamic fashion, based upon need) and making them easier to use for
applications. Is it easy? Not necessarily. Is it guaranteed to work? I
like to think we make a best effort. But as others have pointed out,
it doesn't seem like we're going to get mainline transparent hugepage
support anytime soon.

Thanks,
Nish

2008-03-26 15:58:13

by H. Peter Anvin

[permalink] [raw]
Subject: Re: larger default page sizes...

J.C. Pizarro wrote:
>
> But there is a general problem of larger pages in systems that
> don't support them natively (in hardware) depending in how it's
> implemented the memory manager in the kernel:
>
> "Doubling the soft page size implies
> halfing the TLB soft-entries in the old hardware".
>
> "x4 soft page size=> 1/4 TLB soft-entries, ... and so on."
>
> Assuming one soft double-sized page represents 2 real-sized pages,
> one replacing of one soft double-sized page implies replacing
> 2 TLB's entries containing the 2 real-sized pages.
>
> The TLB is very small, its entries are around 24 entries aprox. in
> some processors!.
>

That's not a problem, actually, since the TLB entries can get shuffled
like any other (for software TLBs it's a little different, but it can be
dealt with there too.)

The *real* problem is ABI breakage.

-hpa

2008-03-26 15:59:42

by Linus Torvalds

[permalink] [raw]
Subject: Re: larger default page sizes...



On Wed, 26 Mar 2008, Paul Mackerras wrote:
>
> So the improvement in the user time is almost all due to the reduced
> TLB misses (as one would expect). For the system time, using 64k
> pages in the VM reduces it by about 21%, and using 64k hardware pages
> reduces it by another 30%. So the reduction in kernel overhead is
> significant but not as large as the impact of reducing TLB misses.

I realize that getting the POWER people to accept that they have been
total morons when it comes to VM for the last three decades is hard, but
somebody in the POWER hardware design camp should (a) be told and (b) be
really ashamed of themselves.

Is this a POWER6 or what? Becasue 21% overhead from TLB handling on
something like gcc shows that some piece of hardware is absolute crap.

May I suggest people inside IBM try to fix this some day, and in the
meantime people outside should probably continue to buy Intel/AMD CPU's
until the others can get their act together.

Linus

2008-03-26 17:06:18

by Luck, Tony

[permalink] [raw]
Subject: RE: larger default page sizes...

> That's not entirely true. We have a dynamic pool now, thanks to Adam
> Litke [added to Cc], which can be treated as a high watermark for the
> hugetlb pool (and the static pool value serves as a low watermark).
> Unless by hugepages you mean something other than what I think (but
> referring to a 2M size on x86 imples you are not). And with the
> antifragmentation improvements, hugepage pool changes at run-time are
> more likely to succeed [added Mel to Cc].

Things are better than I thought ... though the phrase "more likely
to succeed" doesn't fill me with confidence. Instead I imagine a
system where an occasional spike in memory load causes some memory
fragmentation that can't be handled, and so from that point many of
the applications that relied on huge pages take a 10% performance
hit. This results in sysadmins scheduling regular reboots to unjam
things. [Reminds me of the instructions that came with my first
flatbed scanner that recommended rebooting the system before and
after each use :-( ]

> I feel like I should promote libhugetlbfs here.

This is also better than I thought ... sounds like some really
good things have already happened here.

-Tony

2008-03-26 17:58:32

by Christoph Lameter

[permalink] [raw]
Subject: Re: larger default page sizes...

On Wed, 26 Mar 2008, Paul Mackerras wrote:

> So the improvement in the user time is almost all due to the reduced
> TLB misses (as one would expect). For the system time, using 64k
> pages in the VM reduces it by about 21%, and using 64k hardware pages
> reduces it by another 30%. So the reduction in kernel overhead is
> significant but not as large as the impact of reducing TLB misses.

One should emphasize that this test was a kernel compile which is not
a load that gains much from larger pages. 4k pages are mostly okay for
loads that use large amounts of small files.

2008-03-26 18:54:46

by Mel Gorman

[permalink] [raw]
Subject: Re: larger default page sizes...

On (26/03/08 10:05), Luck, Tony didst pronounce:
> > That's not entirely true. We have a dynamic pool now, thanks to Adam
> > Litke [added to Cc], which can be treated as a high watermark for the
> > hugetlb pool (and the static pool value serves as a low watermark).
> > Unless by hugepages you mean something other than what I think (but
> > referring to a 2M size on x86 imples you are not). And with the
> > antifragmentation improvements, hugepage pool changes at run-time are
> > more likely to succeed [added Mel to Cc].
>
> Things are better than I thought ... though the phrase "more likely
> to succeed" doesn't fill me with confidence.

It's a lot more likely to succeed since 2.6.24 than it has in the past. On
workloads where it is mainly user data that is occuping memory, the chances
are even better. If min_free_kbytes is hugepage_size*num_online_nodes(),
it becomes a harder again to fragment memory.

> Instead I imagine a
> system where an occasional spike in memory load causes some memory
> fragmentation that can't be handled, and so from that point many of
> the applications that relied on huge pages take a 10% performance
> hit.

If it was found to be a problem and normal anti-frag is not coping for hugepage
pool resizes, then specify movablecore=MAX_POSSIBLE_POOL_SIZE_YOU_WOULD_NEED
on the command-line and the hugepage pool will be able to expand to that
side independent of workload. This would avoid the need to scheduled regular
reboots.

> This results in sysadmins scheduling regular reboots to unjam
> things. [Reminds me of the instructions that came with my first
> flatbed scanner that recommended rebooting the system before and
> after each use :-( ]
>
> > I feel like I should promote libhugetlbfs here.
>
> This is also better than I thought ... sounds like some really
> good things have already happened here.
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2008-03-26 23:21:53

by David Miller

[permalink] [raw]
Subject: Re: larger default page sizes...

From: Christoph Lameter <[email protected]>
Date: Wed, 26 Mar 2008 10:56:17 -0700 (PDT)

> One should emphasize that this test was a kernel compile which is not
> a load that gains much from larger pages.

Actually, ever since gcc went to a garbage collecting allocator, I've
found it to be a TLB thrasher.

It will repeatedly randomly walk over a GC pool of at least 8MB in
size, which to fit fully in the TLB with 4K pages reaquires a TLB with
2048 entries assuming gcc touches no other data which is of course a
false assumption.

For some compiles this GC pool is more than 100MB in size.

GCC does not fit into any modern TLB using it's base page size.

2008-03-27 01:09:01

by Paul Mackerras

[permalink] [raw]
Subject: Re: larger default page sizes...

Linus Torvalds writes:

> On Wed, 26 Mar 2008, Paul Mackerras wrote:
> >
> > So the improvement in the user time is almost all due to the reduced
> > TLB misses (as one would expect). For the system time, using 64k
> > pages in the VM reduces it by about 21%, and using 64k hardware pages
> > reduces it by another 30%. So the reduction in kernel overhead is
> > significant but not as large as the impact of reducing TLB misses.
>
> I realize that getting the POWER people to accept that they have been
> total morons when it comes to VM for the last three decades is hard, but
> somebody in the POWER hardware design camp should (a) be told and (b) be
> really ashamed of themselves.
>
> Is this a POWER6 or what? Becasue 21% overhead from TLB handling on
> something like gcc shows that some piece of hardware is absolute crap.

You have misunderstood the 21% number. That number has *nothing* to
do with hardware TLB miss handling, and everything to do with how long
the generic Linux virtual memory code spends doing its thing (page
faults, setting up and tearing down Linux page tables, etc.). It
doesn't even have anything to do with the hash table (hardware page
table), because both cases are using 4k hardware pages. Thus in both
cases the TLB misses and hash-table misses would have been the same.

The *only* difference between the cases is the page size that the
generic Linux virtual memory code is using. With the 64k page size
our architecture-independent kernel code runs 21% faster.

Thus the 21% is not about the TLB or any hardware thing at all, it's
about the larger per-byte overhead of our kernel code when using the
smaller page size.

The thing you were ranting about -- hardware TLB handling overhead --
comes in at 5%, comparing 4k hardware pages to 64k hardware pages (444
seconds vs. 420 seconds user time for the kernel compile). And yes,
it's a POWER6.

Paul.

2008-03-27 03:00:39

by Paul Mackerras

[permalink] [raw]
Subject: Re: larger default page sizes...

Christoph Lameter writes:

> One should emphasize that this test was a kernel compile which is not
> a load that gains much from larger pages. 4k pages are mostly okay for
> loads that use large amounts of small files.

It's also worth emphasizing that 1.5% of the total time, or 21% of the
system time, is pure software overhead in the Linux kernel that has
nothing to do with the TLB or with gcc's memory access patterns.

That's the cost of handling memory in small (i.e. 4kB) chunks inside
the generic Linux VM code, rather than bigger chunks.

Paul.