2007-11-15 21:57:28

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Why preallocate pmd in x86 32-bit PAE?

I'm looking at unifying asm-x86/pgalloc*.h, and so I'm trying to make
things as similar as possible between 32 and 64-bit.

Once difference is that 64-bit incrementally allocates all levels of the
pagetable, whereas 32-bit PAE preallocates the 4 pmds when it allocates
the pgd. What's the rationale for this? What pitfalls would there be
in making them incrementally allocated?

Preallocation makes sense from the perspective that they will all be
allocated almost immediately in a typical process. But it is a somewhat
arbitrary difference from 64-bit, and since 64-bit can't reasonably
preallocate any pagetable levels, it seems sensible to change 32-bit to
match.

Thanks,
J


2007-11-15 22:13:45

by Linus Torvalds

[permalink] [raw]
Subject: Re: Why preallocate pmd in x86 32-bit PAE?



On Thu, 15 Nov 2007, Jeremy Fitzhardinge wrote:
>
> Once difference is that 64-bit incrementally allocates all levels of the
> pagetable, whereas 32-bit PAE preallocates the 4 pmds when it allocates
> the pgd. What's the rationale for this? What pitfalls would there be
> in making them incrementally allocated?

IIRC, the present bit is ignored in the magic 4-entry PGD. All entries
have to be present.

What earlier CPU's did was to basically load all four values into the CPU
when you loaded %cr3. There was no "three-level page table walker" at all:
it was still a two-level page table walker, there were just for magic
internal page tables that were indexed off the two high bits.

Linus

2007-11-15 22:44:37

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Why preallocate pmd in x86 32-bit PAE?

Linus Torvalds wrote:
>
> IIRC, the present bit is ignored in the magic 4-entry PGD. All entries
> have to be present.
>

This is true, although you could point a PGD to an all-zero page if you
really wanted to. You have to re-load CR3 after modifying the top-level
entries.

> What earlier CPU's did was to basically load all four values into the CPU
> when you loaded %cr3. There was no "three-level page table walker" at all:
> it was still a two-level page table walker, there were just for magic
> internal page tables that were indexed off the two high bits.

They still are. Loading CR3 in PAE really loads four registers from
memory. x86-64 is different, of course.

-hpa

2007-11-16 00:41:19

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Why preallocate pmd in x86 32-bit PAE?

Linus Torvalds wrote:
>> IIRC, the present bit is ignored in the magic 4-entry PGD. All entries
>> have to be present.

On Thu, Nov 15, 2007 at 02:42:46PM -0800, H. Peter Anvin wrote:
> This is true, although you could point a PGD to an all-zero page if you
> really wanted to. You have to re-load CR3 after modifying the top-level
> entries.

There may be bigger fish to fry in terms of per-process overhead, if
you're trying to cut that down. The trouble with trying to address
some of those is that there is mutual antagonism between compactness
and expansibility in the process address space layout, so you'll end
up instantiating a lot more than you want barring some sort of provision
for a compact address space layout. Pagetable sharing is a far more
powerful resource scalability method, though it also needs cooperation
in user address space layout to reap its gains.

There are other overheads, of course, though they're more typically
per-something besides processes.


-- wli

2007-11-16 00:43:25

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Why preallocate pmd in x86 32-bit PAE?

William Lee Irwin III wrote:
>
> There may be bigger fish to fry in terms of per-process overhead, if
> you're trying to cut that down. The trouble with trying to address
> some of those is that there is mutual antagonism between compactness
> and expansibility in the process address space layout, so you'll end
> up instantiating a lot more than you want barring some sort of provision
> for a compact address space layout. Pagetable sharing is a far more
> powerful resource scalability method, though it also needs cooperation
> in user address space layout to reap its gains.
>
> There are other overheads, of course, though they're more typically
> per-something besides processes.
>

I think Jeremy's question was due to trying to reduce the 32/64-bit
differences. Performance-wise, it might add a small amount to user
setup time (a typical 32-bit process will need all four, for the main
binary, libraries, stack and kernel, respectively) but it is probably
not significant (although I'd like to see numbers just in case).

-hpa

2007-11-16 11:28:16

by Andi Kleen

[permalink] [raw]
Subject: Re: Why preallocate pmd in x86 32-bit PAE?


> I think Jeremy's question was due to trying to reduce the 32/64-bit
> differences. Performance-wise, it might add a small amount to user
> setup time (a typical 32-bit process will need all four, for the main
> binary, libraries, stack and kernel, respectively)

With the new top down mmap layout and standard 3:1 split it should typically
only need two.

-Andi

2007-11-16 15:46:36

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Why preallocate pmd in x86 32-bit PAE?

Andi Kleen wrote:
>> I think Jeremy's question was due to trying to reduce the 32/64-bit
>> differences. Performance-wise, it might add a small amount to user
>> setup time (a typical 32-bit process will need all four, for the main
>> binary, libraries, stack and kernel, respectively)
>
> With the new top down mmap layout and standard 3:1 split it should typically
> only need two.
>

Well, three with the kernel.

-hpa

2007-11-16 15:53:36

by Andi Kleen

[permalink] [raw]
Subject: Re: Why preallocate pmd in x86 32-bit PAE?

On Friday 16 November 2007 16:45:16 H. Peter Anvin wrote:
> Andi Kleen wrote:
> >> I think Jeremy's question was due to trying to reduce the 32/64-bit
> >> differences. Performance-wise, it might add a small amount to user
> >> setup time (a typical 32-bit process will need all four, for the main
> >> binary, libraries, stack and kernel, respectively)
> >
> > With the new top down mmap layout and standard 3:1 split it should typically
> > only need two.
> >
>
> Well, three with the kernel.

I didn't count kernel because it is always fixed anyways and about zero
overhead for the normal setup case.

-Andi

2007-11-16 16:11:28

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Why preallocate pmd in x86 32-bit PAE?

Andi Kleen wrote:
> On Friday 16 November 2007 16:45:16 H. Peter Anvin wrote:
>> Andi Kleen wrote:
>>>> I think Jeremy's question was due to trying to reduce the 32/64-bit
>>>> differences. Performance-wise, it might add a small amount to user
>>>> setup time (a typical 32-bit process will need all four, for the main
>>>> binary, libraries, stack and kernel, respectively)
>>> With the new top down mmap layout and standard 3:1 split it should typically
>>> only need two.
>>>
>> Well, three with the kernel.
>
> I didn't count kernel because it is always fixed anyways and about zero
> overhead for the normal setup case.
>

Of course, but it was in the original list so...

-hpa

2007-11-16 17:12:47

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: Why preallocate pmd in x86 32-bit PAE?

Linus Torvalds wrote:
> On Thu, 15 Nov 2007, Jeremy Fitzhardinge wrote:
>
>> Once difference is that 64-bit incrementally allocates all levels of the
>> pagetable, whereas 32-bit PAE preallocates the 4 pmds when it allocates
>> the pgd. What's the rationale for this? What pitfalls would there be
>> in making them incrementally allocated?
>>
>
> IIRC, the present bit is ignored in the magic 4-entry PGD. All entries
> have to be present.
>

Hm, do you recall what processors that might affect? As far as I know,
current processors will ignore non-present top-level entries. Anyway,
we can point them not present to empty_zero_page, so testing the present
bit will still be sufficient to tell if we need to allocate a new pmd,
but if the hardware decides to follow the page reference there's no harm
done. (Hm, unless the hardware decides it wants to set A or D bits in
empty_zero_page for some reason...)

> What earlier CPU's did was to basically load all four values into the CPU
> when you loaded %cr3. There was no "three-level page table walker" at all:
> it was still a two-level page table walker, there were just for magic
> internal page tables that were indexed off the two high bits.
>

That just means we need to reload cr3 after populating the pgd with a
new pmd, right?

J

2007-11-16 17:36:17

by Linus Torvalds

[permalink] [raw]
Subject: Re: Why preallocate pmd in x86 32-bit PAE?



On Fri, 16 Nov 2007, Jeremy Fitzhardinge wrote:
> >
> > IIRC, the present bit is ignored in the magic 4-entry PGD. All entries
> > have to be present.
>
> Hm, do you recall what processors that might affect? As far as I know,
> current processors will ignore non-present top-level entries.

Are you sure?

Anyway, this is not worth making a distinction for. Just pre-allocate all
of them. There really is just 4 PGD entries, and it really *is* different
from having a full three-level page table, and of the four PGD entries:

- one is used for the kernel mapping (assuming the regular 1:3 layout)
- AT LEAST two are required by user space anyway

so pre-allocating is never going to waste more than one page.

And you may feel that pre-allocating is a special case, but it's an
*easier* special case than the one that you are apparently thinking about
(which is to special-case according to CPU version).

So don't do it. Just preallocate for the magic 4-entry PGD. You can make
the special case just be something like

/* Preallocate for small PGD's */
#if PTRS_PER_PGD == 4
for (i = 0; i < USER_PTRS_PER_PGD; i++) {
pmd_t *pmd = pmd_alloc();
set_pgd(pgd+i, __pgd(PAGE_PRESENT | __pa(pmd));
}
#endif

or similar.

There is absolutely *zero* reason not to do this, and there is also zero
reason to make this be a "32-bit vs 64-bit" issue. The code can be there
in both, and the #if could even be all in C code (ie there may be reasons
to prefer writing it as

/* The old-style PAE PGD needs to be preallocated */
if (USER_PTRS_PER_PGD <= 4) {
...
}

and the compiler should even compile it away entirely for all practical
cases even without using the preprocessor.

> Anyway, we can point them not present to empty_zero_page, so testing the
> present bit will still be sufficient to tell if we need to allocate a
> new pmd, but if the hardware decides to follow the page reference
> there's no harm done. (Hm, unless the hardware decides it wants to set
> A or D bits in empty_zero_page for some reason...)

x86 page table walking never sets A/D bits on non-present entries.

That said, there's still a huge difference.

For "real" page table walking, you can always just insert entries without
flushing the cache if those entries weren't there before (because the TLB
is supposed to not cache negative entries).

Again, because of the way the mahic 4-entry PGD works, that isn't true for
it. It caches the entries regardless, so if you change it from non-present
to present, you have to flush the TLB (well, "reload %cr3", which is the
same thing in practice, although it's for a different *reason*).

> That just means we need to reload cr3 after populating the pgd with a
> new pmd, right?

BUT ONLY FOR THIS CASE!

And if you preallocate it, you make *that* special case go away.

So you're going to have special cases regardless. Do the simple and
really straightforward one, please! Nothing subtle.

Linus

2007-11-16 17:48:03

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Why preallocate pmd in x86 32-bit PAE?

Jeremy Fitzhardinge wrote:
>
> Hm, do you recall what processors that might affect? As far as I know,
> current processors will ignore non-present top-level entries. Anyway,
> we can point them not present to empty_zero_page, so testing the present
> bit will still be sufficient to tell if we need to allocate a new pmd,
> but if the hardware decides to follow the page reference there's no harm
> done. (Hm, unless the hardware decides it wants to set A or D bits in
> empty_zero_page for some reason...)
>

PDPTR is documented to have P bits but none of the other control bits,
unlike other levels of the hierarchy.

The hardware never sets A or D bits on non-present pages, since all the
bits except P are reserved for the operating systems (and, besides, they
can never be accessed or dirty.)

>> What earlier CPU's did was to basically load all four values into the CPU
>> when you loaded %cr3. There was no "three-level page table walker" at all:
>> it was still a two-level page table walker, there were just for magic
>> internal page tables that were indexed off the two high bits.
>
> That just means we need to reload cr3 after populating the pgd with a
> new pmd, right?

Yes. And as Linus said, it would be a new special case.

-hpa

2007-11-16 18:31:00

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: Why preallocate pmd in x86 32-bit PAE?

Linus Torvalds wrote:
> On Fri, 16 Nov 2007, Jeremy Fitzhardinge wrote:
>
>>> IIRC, the present bit is ignored in the magic 4-entry PGD. All entries
>>> have to be present.
>>>
>> Hm, do you recall what processors that might affect? As far as I know,
>> current processors will ignore non-present top-level entries.
>>
>
> Are you sure?
>

3.8.5 in vol 3a "Page-Directory and Page-Table Entries With Extended
Addressing Enabled":

The present flag (bit 0) in the page-directory-pointer-table entries
can be set to 0 or 1. If the present flag is clear, the remaining
bits in the page-directory-pointer-table entry are available to the
operating system. If the present flag is set, the fields of the
page-directory-pointer-table entry are defined in Figures 3-20 for
4-KByte pages and Figures 3-21 for 2-MByte pages.

So I would assume this works on all current CPUs, but I can imagine that
some older/off-brand processors might get it wrong.

> Anyway, this is not worth making a distinction for. Just pre-allocate all
> of them. There really is just 4 PGD entries, and it really *is* different
> from having a full three-level page table, and of the four PGD entries:
>
> - one is used for the kernel mapping (assuming the regular 1:3 layout)
> - AT LEAST two are required by user space anyway
>
> so pre-allocating is never going to waste more than one page.
>

Yeah, I'm not so concerned about memory saving; I don't think there
would be any in practice.

> And you may feel that pre-allocating is a special case, but it's an
> *easier* special case than the one that you are apparently thinking about
> (which is to special-case according to CPU version).
>

I'm hoping to avoid special-casing anything, if I can help it, aside
from the normal 32/64-bit 2/3/4-level parameterising of the various
pagetable accessors.

> So don't do it. Just preallocate for the magic 4-entry PGD. You can make
> the special case just be something like
>
> /* Preallocate for small PGD's */
> #if PTRS_PER_PGD == 4
> for (i = 0; i < USER_PTRS_PER_PGD; i++) {
> pmd_t *pmd = pmd_alloc();
> set_pgd(pgd+i, __pgd(PAGE_PRESENT | __pa(pmd));
> }
> #endif
>
> or similar.
>
> There is absolutely *zero* reason not to do this, and there is also zero
> reason to make this be a "32-bit vs 64-bit" issue. The code can be there
> in both, and the #if could even be all in C code (ie there may be reasons
> to prefer writing it as
>
> /* The old-style PAE PGD needs to be preallocated */
> if (USER_PTRS_PER_PGD <= 4) {
> ...
> }
>
> and the compiler should even compile it away entirely for all practical
> cases even without using the preprocessor.
>

Perhaps. And there's the corresponding difference between 32 and 64 bit
on freeing a pagetable; 32-bit assumes the pgd destructor will free the
pmd, whereas 64-bit does it separately. Even in the current 32-bit
code, there's separate handling for PAE and non-PAE. I think it can all
be collapsed down in a reasonable way.

>> That just means we need to reload cr3 after populating the pgd with a
>> new pmd, right?
>>
>
> BUT ONLY FOR THIS CASE!
>
> And if you preallocate it, you make *that* special case go away.
>

Yes, that is a bit awkward; it means that 32-bit PAE would need a
speparate pgd_populate. But that seems like a smaller change than 1)
making 32-bit PAE pgd-alloc preallocate the pmd, and 2) making pmd_free
noop on 32-bit PAE, and 3) making pgd_free free the preallocated pmd.
Perhaps 2 & 3 aren't necessary and can be the same as 64-bit.

I'll need to look into it more carefully.

> So you're going to have special cases regardless. Do the simple and
> really straightforward one, please! Nothing subtle.
>

Yep, absolutely.

J

2007-11-16 19:15:09

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: Why preallocate pmd in x86 32-bit PAE?

Linus Torvalds wrote:
> So don't do it. Just preallocate for the magic 4-entry PGD. You can make
> the special case just be something like
>

Yes, OK, it makes sense. Conceptually they would be dynamically
allocated and freed, but they'd just happen to start allocated, to avoid
the tlb flush of populating the pgd of an active pagetable. If you
happened to do a 1G munmap, it may end up freeing and reallocating them,
but that's going to be very rare. Either way, the other special cases
are avoided (though pgd_populate would still need to be correct, on the
offchance it gets invoked).

J

2007-11-16 19:26:06

by Linus Torvalds

[permalink] [raw]
Subject: Re: Why preallocate pmd in x86 32-bit PAE?



On Fri, 16 Nov 2007, Jeremy Fitzhardinge wrote:
>
> If you happened to do a 1G munmap, it may end up freeing and
> reallocating them, but that's going to be very rare.

I don't think we ever free the pmd's now, do we?

(Except for the *final* free, of course, when we release the whole VM).

Linus

2007-11-16 19:44:16

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: Why preallocate pmd in x86 32-bit PAE?

Linus Torvalds wrote:
> On Fri, 16 Nov 2007, Jeremy Fitzhardinge wrote:
>
>> If you happened to do a 1G munmap, it may end up freeing and
>> reallocating them, but that's going to be very rare.
>>
>
> I don't think we ever free the pmd's now, do we?
>
> (Except for the *final* free, of course, when we release the whole VM).

Not for 32-bit at the moment, but it does in principle. munmap ends up
calling free_pgtables, and so ends up calling pmd_free_range. That will
do a pud_clear to detach the pmd from the pagetable and call
__pmd_free_tlb, which ends up doing tlb_remove_page ->
free_page_and_swap_cache. 32-bit knobbles all this at the moment, but
it looks to me like it wouldn't be hard to make this work if the code is
all common with 64-bit.

J