2004-10-12 13:59:24

by Andi Kleen

[permalink] [raw]
Subject: 4level page tables for Linux


I released a 4level page table patch for 2.6.9rc4. It is available
from ftp://ftp.suse.com/pub/people/ak/4level/4level-2.6.9rc4-1.gz

It changes the Linux MM which currently only supports 3 levels of
page tables to support a fourth level (PML4). This is needed to
exceed more than 512GB of virtual address space per process on x86-64.
People have been running into the 512GB limit when mmaping large files.

The patch extends all the code in the VM that walks the 3level
hierarchy to handle the fourth level too.

The patch changes x86-64 to now support 47bits (128TB) of virtual
address space per process.

The changes are fortunately quite localized. Only mm/* had to change
(mostly in straight forward ways), drivers etc. already were well
isolated from the page tables with the get_user_pages() function.
The only exception was DRM, which needed a small patch.

Excluding the i386/x86-64 architecture specific parts the patch looks like:

drivers/char/drm/drm_memory.h | 3
fs/exec.c | 6
include/asm-generic/nopml4-page.h | 11
include/asm-generic/nopml4-pgalloc.h | 21 +
include/asm-generic/nopml4-pgtable.h | 39 ++
include/asm-generic/pgtable.h | 2
include/asm-generic/tlb.h | 6
include/linux/init_task.h | 2
include/linux/mm.h | 10
include/linux/sched.h | 2
kernel/fork.c | 6
mm/fremap.c | 18 -
mm/memory.c | 584 ++++++++++++++++++++++++-----------
mm/mempolicy.c | 18 -
mm/mmap.c | 36 +-
mm/mprotect.c | 44 ++
mm/mremap.c | 21 +
mm/msync.c | 43 ++
mm/rmap.c | 21 +
mm/swapfile.c | 61 ++-
mm/vmalloc.c | 85 ++++-

For architectures with less levels of page tables the code should be mostly a
nop.

The patch currently only supports x86-64 and i386. Other architectures
will need some changes to compile. The changes to convert an architecture
over are quite straight forward, please see the changes to i386 in
the patch. I renamed all functions that need changing so things
that are broken will not compile.

I will need some help to do this conversion because I can only test
i386 and x86-64 easily. I can convert architectures over, but someone
will need to test the result.

Plan is to merge it into -mm* ASAP, when the major architectures

have been ported over.

-Andi


2004-10-12 18:48:47

by Dave Hansen

[permalink] [raw]
Subject: Re: 4level page tables for Linux

@@ -110,13 +115,18 @@ int install_file_pte(struct mm_struct *m
unsigned long addr, unsigned long pgoff, pgprot_t prot)
{
...
+ pml4 = pml4_offset(mm, addr);
+
+ spin_lock(&mm->page_table_lock);
+ pgd = pgd_alloc(mm, pml4, addr);
+ if (!pgd)
+ goto err_unlock;

Locking isn't needed for access to the pml4? This is a wee bit
different from pgd's and I didn't see any documentation about it
anywhere. Could be confusing.

+++ linux-2.6.9rc4-4level/mm/memory.c
...
+#undef inline
+#define inline
+unsigned long caddr;

Is this just for debugging?

+static inline void free_one_pml4(struct mmu_gather *tlb, pml4_t *pml4,
+ unsigned long addr, unsigned long end)
+{
...
+ do {
+ caddr = addr;
+ free_one_pgd(tlb, pgd);
+ free++;
+ addr = (addr + PGDIR_SIZE) & PGDIR_MASK;
+ pgd++;
+ } while (addr && addr < end);

If someone attempts to clear an address which is in the top PGDIR_SIZE
bytes of memory, this will overflow. Is that an issue?

There also seems to be quite a bit of churn in the copy_*_range()
functions that isn't completely related to the pml4 changes. Should
that get broken out?

-- Dave

2004-10-12 19:09:27

by Andi Kleen

[permalink] [raw]
Subject: Re: 4level page tables for Linux

On Tue, Oct 12, 2004 at 11:48:22AM -0700, Dave Hansen wrote:
> @@ -110,13 +115,18 @@ int install_file_pte(struct mm_struct *m
> unsigned long addr, unsigned long pgoff, pgprot_t prot)
> {
> ...
> + pml4 = pml4_offset(mm, addr);
> +
> + spin_lock(&mm->page_table_lock);
> + pgd = pgd_alloc(mm, pml4, addr);
> + if (!pgd)
> + goto err_unlock;
>
> Locking isn't needed for access to the pml4? This is a wee bit
> different from pgd's and I didn't see any documentation about it
> anywhere. Could be confusing.

No, the lock is still needed. Thanks for catching this, that was indeed
wrong.

>
> +++ linux-2.6.9rc4-4level/mm/memory.c
> ...
> +#undef inline
> +#define inline
> +unsigned long caddr;
>
> Is this just for debugging?

Yes, that's a leftover I forgot to remove. Will go in the next
version.

(btw t here is another leftover in there, I will remove it too)

>
> +static inline void free_one_pml4(struct mmu_gather *tlb, pml4_t *pml4,
> + unsigned long addr, unsigned long end)
> +{
> ...
> + do {
> + caddr = addr;
> + free_one_pgd(tlb, pgd);
> + free++;
> + addr = (addr + PGDIR_SIZE) & PGDIR_MASK;
> + pgd++;
> + } while (addr && addr < end);
>
> If someone attempts to clear an address which is in the top PGDIR_SIZE
> bytes of memory, this will overflow. Is that an issue?

That is what the addr && is for. Yes, overflows happen and afaik they
are all handled.

>
> There also seems to be quite a bit of churn in the copy_*_range()
> functions that isn't completely related to the pml4 changes. Should
> that get broken out?

It's related. The old function just wasn't scalable to 4 level
page tables at all (i really tried but it was too ugly) so I rewrote it
completely. Splitting it would be difficult because the 3level version
would be already very different from the 4level version.

Thanks for the review.

-Andi

2004-10-12 19:13:06

by Andi Kleen

[permalink] [raw]
Subject: Re: 4level page tables for Linux II

On Tue, Oct 12, 2004 at 09:03:46PM +0200, Andi Kleen wrote:
> On Tue, Oct 12, 2004 at 11:48:22AM -0700, Dave Hansen wrote:
> > @@ -110,13 +115,18 @@ int install_file_pte(struct mm_struct *m
> > unsigned long addr, unsigned long pgoff, pgprot_t prot)
> > {
> > ...
> > + pml4 = pml4_offset(mm, addr);
> > +
> > + spin_lock(&mm->page_table_lock);
> > + pgd = pgd_alloc(mm, pml4, addr);
> > + if (!pgd)
> > + goto err_unlock;
> >
> > Locking isn't needed for access to the pml4? This is a wee bit
> > different from pgd's and I didn't see any documentation about it
> > anywhere. Could be confusing.
>
> No, the lock is still needed. Thanks for catching this, that was indeed
> wrong.

Actually on second though - the code was actually ok. The reason is
that the highest page table level never goes away while the process
exists, and holding a pointer into it is always valid.

Only referencing it needs a lock, but pml4_offset doesn't reference
anything yet.

The same used to hold for pgds, but the 4level page tables change that.
However there was at least one bug in the patchkit in this area
which I now fixed.

-Andi

2004-10-13 18:42:13

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 4level page tables for Linux

On Tue, Oct 12, 2004 at 11:48:22AM -0700, Dave Hansen wrote:
> @@ -110,13 +115,18 @@ int install_file_pte(struct mm_struct *m
> unsigned long addr, unsigned long pgoff, pgprot_t prot)
> {
> ...
> + pml4 = pml4_offset(mm, addr);
> +
> + spin_lock(&mm->page_table_lock);
> + pgd = pgd_alloc(mm, pml4, addr);
> + if (!pgd)
> + goto err_unlock;
>
> Locking isn't needed for access to the pml4? This is a wee bit
> different from pgd's and I didn't see any documentation about it

btw, locking isn't needed even for the pgd.

fremap.c is the only one that gets it right:

pgd = pgd_offset(mm, addr);
spin_lock(&mm->page_table_lock);
pmd = pmd_alloc(mm, pgd, addr);

the rest is just overkill but it doesn't hurt in practice.

after you add the 4level, locking will become necessary for the pgd, but
it's still not needed for the pml4.

I'm not very excited about changing the naming, of the pgd/pmd/pte so I
like to keep it like it is now.

peraphs we could consider pgd4 instead of pml4. What does "pml" stands
for?

2004-10-13 19:48:35

by Andi Kleen

[permalink] [raw]
Subject: Re: 4level page tables for Linux

On Wed, 13 Oct 2004 20:41:53 +0200
Andrea Arcangeli <[email protected]> wrote:


>
> after you add the 4level, locking will become necessary for the pgd, but
> it's still not needed for the pml4.

Yes, agreed. I did an audit of the generic code and it seems to be ok
regarding the pgd use.


> peraphs we could consider pgd4 instead of pml4. What does "pml" stands
> for?

page mapping level 4 (?) just guessing here.

PML4 is the name AMD and Intel use in their documentation. I don't see
a particular reason to be different from them.

-Andi

2004-10-13 20:03:49

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 4level page tables for Linux

On Wed, Oct 13, 2004 at 09:35:58PM +0200, Andi Kleen wrote:
> page mapping level 4 (?) just guessing here.

make sense.

> PML4 is the name AMD and Intel use in their documentation. I don't see
> a particular reason to be different from them.

just because we never say 'page mapping level 4', we think 'page table
level 4' or 'page directory level 4'.

pte4 doesn't sound nice since we don't use pte3/2/1, but pgd4 could make
some more sense than pml4.

But I'm fine if you prefer to stick with pml4. Those are names we tend
to memorize anyways, I don't actually know what pmd means exactly either
(page middle directory)?

2004-10-13 23:28:53

by Albert Cahalan

[permalink] [raw]
Subject: Re: 4level page tables for Linux

> after you add the 4level, locking will become
> necessary for the pgd, but it's still not needed
> for the pml4.
>
> I'm not very excited about changing the naming,
> of the pgd/pmd/pte so I like to keep it like it is now.
>
> peraphs we could consider pgd4 instead of pml4.
> What does "pml" stands for?

The "pmd" one is certainly nonsense now.
It means "page middle directory".

Numbers for all of them would be easy to deal with.
Like this: pd1, pd2, pd3, pd4...

I'd number going toward the page, because that's
the order in which these things get walked.


2004-10-13 23:50:36

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 4level page tables for Linux

On Wed, Oct 13, 2004 at 07:22:15PM -0400, Albert Cahalan wrote:
> I'd number going toward the page, because that's
> the order in which these things get walked.

I'd call the pml level 1 too, but in the specs is level 4. So sticking
the specs numbering is going to generate less confusion. Otherwise when
we speak with somebody with hardware knowledge we say level 4 and he
understand the specs's level 1. I recall it already happened to me once ;).

2004-10-14 01:22:06

by Albert Cahalan

[permalink] [raw]
Subject: Re: 4level page tables for Linux

On Wed, 2004-10-13 at 19:51, Andrea Arcangeli wrote:
> On Wed, Oct 13, 2004 at 07:22:15PM -0400, Albert Cahalan wrote:
> > I'd number going toward the page, because that's
> > the order in which these things get walked.
>
> I'd call the pml level 1 too, but in the specs is level 4. So sticking
> the specs numbering is going to generate less confusion. Otherwise when
> we speak with somebody with hardware knowledge we say level 4 and he
> understand the specs's level 1. I recall it already happened to me once ;).

While x86_64 will be by far the most popular arch,
this is a matter for the generic code.

Perhaps "4" should be avoided: 0,1,2,3


2004-10-14 09:25:41

by George Spelvin

[permalink] [raw]
Subject: Re: 4level page tables for Linux

> Numbers for all of them would be easy to deal with.
> Like this: pd1, pd2, pd3, pd4...
>
> I'd number going toward the page, because that's
> the order in which these things get walked.

On the other hand, these extensions tend to get made to the top,
and it's confusing if, in a 2-level system, only pd3 and pd4 are used.

Perhaps a little-endian scheme (pd1 = pte, pt2=pmd, pd4=pgd) would
be better after all.

2004-10-14 11:15:39

by Robin Holt

[permalink] [raw]
Subject: Re: 4level page tables for Linux

On Thu, Oct 14, 2004 at 09:25:40AM -0000, [email protected] wrote:
> > Numbers for all of them would be easy to deal with.
> > Like this: pd1, pd2, pd3, pd4...
> >
> > I'd number going toward the page, because that's
> > the order in which these things get walked.
>
> On the other hand, these extensions tend to get made to the top,
> and it's confusing if, in a 2-level system, only pd3 and pd4 are used.
>
> Perhaps a little-endian scheme (pd1 = pte, pt2=pmd, pd4=pgd) would
> be better after all.

I think the pd4=pgd, etc makes more sense as well. The names assigned
now go from smallest scope (pte=pd1) to largest scope (pgd=pd3). Feels
consistent.

Robin

2004-10-17 02:57:27

by H. Peter Anvin

[permalink] [raw]
Subject: Re: 4level page tables for Linux

Followup to: <[email protected]>
By author: [email protected]
In newsgroup: linux.dev.kernel
>
> > Numbers for all of them would be easy to deal with.
> > Like this: pd1, pd2, pd3, pd4...
> >
> > I'd number going toward the page, because that's
> > the order in which these things get walked.
>
> On the other hand, these extensions tend to get made to the top,
> and it's confusing if, in a 2-level system, only pd3 and pd4 are used.
>
> Perhaps a little-endian scheme (pd1 = pte, pt2=pmd, pd4=pgd) would
> be better after all.

I believe so, for the same reason that littleendian actually makes
more sense for numbers in the long run. It's one of those things
where the "first perception" doesn't match what makes sense in the
long run.

-hpa

2004-10-18 17:02:34

by Christoph Lameter

[permalink] [raw]
Subject: Re: 4level page tables for Linux

On Wed, 13 Oct 2004, Andrea Arcangeli wrote:

> On Wed, Oct 13, 2004 at 09:35:58PM +0200, Andi Kleen wrote:
> > page mapping level 4 (?) just guessing here.
>
> make sense.
>
> > PML4 is the name AMD and Intel use in their documentation. I don't see
> > a particular reason to be different from them.
>
> just because we never say 'page mapping level 4', we think 'page table
> level 4' or 'page directory level 4'.

Would it not be best to give up hardcoding these page mapping levels into
the kernel? Linux should support N levels. pml4,pgd,pmd,pte needs to
disappear and be replaced by

pte_path[N]

We are duplicating code for pgd, pmd, pte and now pml again and again. The
code could be much simpler if this would be generalized. Various
architectures would support different levels without some strange
feature like f.e. pmd's being "optimized away".

Certainly the way that pml4 is proposed to be done is less invasive but we
are creating something more and more difficult to maintain.

2004-10-18 17:29:16

by Andi Kleen

[permalink] [raw]
Subject: Re: 4level page tables for Linux

On Mon, Oct 18, 2004 at 10:02:20AM -0700, Christoph Lameter wrote:
> On Wed, 13 Oct 2004, Andrea Arcangeli wrote:
>
> > On Wed, Oct 13, 2004 at 09:35:58PM +0200, Andi Kleen wrote:
> > > page mapping level 4 (?) just guessing here.
> >
> > make sense.
> >
> > > PML4 is the name AMD and Intel use in their documentation. I don't see
> > > a particular reason to be different from them.
> >
> > just because we never say 'page mapping level 4', we think 'page table
> > level 4' or 'page directory level 4'.
>
> Would it not be best to give up hardcoding these page mapping levels into
> the kernel? Linux should support N levels. pml4,pgd,pmd,pte needs to

It already does. Currently it supports 2-3 levels, with my patch
it supports 2-4 levels.

> disappear and be replaced by
>
> pte_path[N]
>
> We are duplicating code for pgd, pmd, pte and now pml again and again. The
> code could be much simpler if this would be generalized. Various

For most people it is already generalized (get_user_pages).
The only exception is the core VM and the low level architecture code.
The later will need to deal always with the details.


> architectures would support different levels without some strange
> feature like f.e. pmd's being "optimized away".

Nobody came up with a nice automatic iterator so far.

If you look at the different functions in mm/* who handle all level
they all do slightly different things so it's not that easy to
generalize. Also it is not that many, perhaps seven in mm/* plus
another in the arch code.

> Certainly the way that pml4 is proposed to be done is less invasive but we
> are creating something more and more difficult to maintain.

I don't see us switching to more levels any time soon ...

Also I don't think it's that bad as you're claiming it is. It's a clear
abstraction which has served us well so far.

-Andi

2004-10-18 17:37:53

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 4level page tables for Linux

On Mon, Oct 18, 2004 at 10:02:20AM -0700, Christoph Lameter wrote:
> Would it not be best to give up hardcoding these page mapping levels into
> the kernel? Linux should support N levels. pml4,pgd,pmd,pte needs to
> disappear and be replaced by
>
> pte_path[N]

those aren't the same thing. they may have different
format and different size, plus as Ingo pointed out we use type checking
even when they're the same size. Plus they're already an array. It's not
that simple to remove those duplicate loops, and pte[N] wouldn't mean
the level but the entry offset in the pagetable. Peraphs it's possible
to remove the loops but you'd need at least more runtime branches to
execute in each loop to understand which methods you need to execute
depending on the level you're running on. I certainly don't like the
loops myself so I see your point ;).