2009-03-11 13:53:27

by Martin Schwidefsky

[permalink] [raw]
Subject: [PATCH] fix/improve generic page table walker

From: Martin Schwidefsky <[email protected]>

On s390 the /proc/pid/pagemap interface is currently broken. This is
caused by the unconditional loop over all pgd/pud entries as specified
by the address range passed to walk_page_range. The tricky bit here
is that the pgd++ in the outer loop may only be done if the page table
really has 4 levels. For the pud++ in the second loop the page table needs
to have at least 3 levels. With the dynamic page tables on s390 we can have
page tables with 2, 3 or 4 levels. Which means that the pgd and/or the
pud pointer can get out-of-bounds causing all kinds of mayhem.

The proposed solution is to fast-forward over the hole between the start
address and the first vma and the hole between the last vma and the end
address. The pgd/pud/pmd/pte loops are used only for the address range
between the first and last vma. This guarantees that the page table
pointers stay in range for s390. For the other architectures this is
a small optimization.

As the page walker now accesses the vma list the mmap_sem is required.
All callers of the walk_page_range function needs to acquire the semaphore.

Cc: Matt Mackall <[email protected]>
Signed-off-by: Martin Schwidefsky <[email protected]>
---

fs/proc/task_mmu.c | 2 ++
mm/pagewalk.c | 28 ++++++++++++++++++++++++++--
2 files changed, 28 insertions(+), 2 deletions(-)

diff -urpN linux-2.6/fs/proc/task_mmu.c linux-2.6-patched/fs/proc/task_mmu.c
--- linux-2.6/fs/proc/task_mmu.c 2009-03-11 13:38:53.000000000 +0100
+++ linux-2.6-patched/fs/proc/task_mmu.c 2009-03-11 13:39:45.000000000 +0100
@@ -716,7 +716,9 @@ static ssize_t pagemap_read(struct file
* user buffer is tracked in "pm", and the walk
* will stop when we hit the end of the buffer.
*/
+ down_read(&mm->mmap_sem);
ret = walk_page_range(start_vaddr, end_vaddr, &pagemap_walk);
+ up_read(&mm->mmap_sem);
if (ret == PM_END_OF_BUFFER)
ret = 0;
/* don't need mmap_sem for these, but this looks cleaner */
diff -urpN linux-2.6/mm/pagewalk.c linux-2.6-patched/mm/pagewalk.c
--- linux-2.6/mm/pagewalk.c 2008-12-25 00:26:37.000000000 +0100
+++ linux-2.6-patched/mm/pagewalk.c 2009-03-11 13:39:45.000000000 +0100
@@ -104,6 +104,8 @@ static int walk_pud_range(pgd_t *pgd, un
int walk_page_range(unsigned long addr, unsigned long end,
struct mm_walk *walk)
{
+ struct vm_area_struct *vma, *prev;
+ unsigned long stop;
pgd_t *pgd;
unsigned long next;
int err = 0;
@@ -114,9 +116,28 @@ int walk_page_range(unsigned long addr,
if (!walk->mm)
return -EINVAL;

+ /* Find first valid address contained in a vma. */
+ vma = find_vma(walk->mm, addr);
+ if (!vma)
+ /* One big hole. */
+ return walk->pte_hole(addr, end, walk);
+ if (addr < vma->vm_start) {
+ /* Skip over all ptes in the area before the first vma. */
+ err = walk->pte_hole(addr, vma->vm_start, walk);
+ if (err)
+ return err;
+ addr = vma->vm_start;
+ }
+
+ /* Find last valid address contained in a vma. */
+ stop = end;
+ vma = find_vma_prev(walk->mm, end, &prev);
+ if (!vma)
+ stop = prev->vm_end;
+
pgd = pgd_offset(walk->mm, addr);
do {
- next = pgd_addr_end(addr, end);
+ next = pgd_addr_end(addr, stop);
if (pgd_none_or_clear_bad(pgd)) {
if (walk->pte_hole)
err = walk->pte_hole(addr, next, walk);
@@ -131,7 +152,10 @@ int walk_page_range(unsigned long addr,
err = walk_pud_range(pgd, addr, next, walk);
if (err)
break;
- } while (pgd++, addr = next, addr != end);
+ } while (pgd++, addr = next, addr != stop);

+ if (stop < end)
+ /* Skip over all ptes in the area after the last vma. */
+ err = walk->pte_hole(stop, end, walk);
return err;
}


2009-03-11 17:27:58

by Matt Mackall

[permalink] [raw]
Subject: Re: [PATCH] fix/improve generic page table walker

On Wed, 2009-03-11 at 14:49 +0100, Martin Schwidefsky wrote:
> From: Martin Schwidefsky <[email protected]>
>
> On s390 the /proc/pid/pagemap interface is currently broken. This is
> caused by the unconditional loop over all pgd/pud entries as specified
> by the address range passed to walk_page_range. The tricky bit here
> is that the pgd++ in the outer loop may only be done if the page table
> really has 4 levels. For the pud++ in the second loop the page table needs
> to have at least 3 levels. With the dynamic page tables on s390 we can have
> page tables with 2, 3 or 4 levels. Which means that the pgd and/or the
> pud pointer can get out-of-bounds causing all kinds of mayhem.

Not sure why this should be a problem without delving into the S390
code. After all, x86 has 2, 3, or 4 levels as well (at compile time) in
a way that's transparent to the walker.

> The proposed solution is to fast-forward over the hole between the start
> address and the first vma and the hole between the last vma and the end
> address. The pgd/pud/pmd/pte loops are used only for the address range
> between the first and last vma. This guarantees that the page table
> pointers stay in range for s390. For the other architectures this is
> a small optimization.

I've gone to lengths to keep VMAs out of the equation, so I can't say
I'm excited about this solution.

--
http://selenic.com : development and support for Mercurial and Linux

2009-03-12 08:37:08

by Martin Schwidefsky

[permalink] [raw]
Subject: Re: [PATCH] fix/improve generic page table walker

On Wed, 11 Mar 2009 12:24:23 -0500
Matt Mackall <[email protected]> wrote:

> On Wed, 2009-03-11 at 14:49 +0100, Martin Schwidefsky wrote:
> > From: Martin Schwidefsky <[email protected]>
> >
> > On s390 the /proc/pid/pagemap interface is currently broken. This is
> > caused by the unconditional loop over all pgd/pud entries as specified
> > by the address range passed to walk_page_range. The tricky bit here
> > is that the pgd++ in the outer loop may only be done if the page table
> > really has 4 levels. For the pud++ in the second loop the page table needs
> > to have at least 3 levels. With the dynamic page tables on s390 we can have
> > page tables with 2, 3 or 4 levels. Which means that the pgd and/or the
> > pud pointer can get out-of-bounds causing all kinds of mayhem.
>
> Not sure why this should be a problem without delving into the S390
> code. After all, x86 has 2, 3, or 4 levels as well (at compile time) in
> a way that's transparent to the walker.

Its hard to understand without looking at the s390 details. The main
difference between x86 and s390 in that respect is that on s390 the
number of page table levels is determined at runtime on a per process
basis. A compat process uses 2 levels, a 64 bit process starts with 3
levels and can "upgrade" to 4 levels if something gets mapped above
4TB. Which means that a *pgd can point to a region-second (2**53 bytes),
a region-third (2**42 bytes) or a segment table (2**31 bytes), a *pud
can point to a region-third or a segment table. The page table
primitives know about this semantic, in particular pud_offset and
pmd_offset check the type of the page table pointed to by *pgd and *pud
and do nothing with the pointer if it is a lower level page table.
The only operation I can not "patch" is the pgd++/pud++ operation.
The current implementation requires that the address bits of the
non-existent higher order page tables in the page table walkers are
zero. This is where the vmas come into play. If there is a vma then is
it guaranteed that all the levels to cover the addresses in the vma are
allocated.

> > The proposed solution is to fast-forward over the hole between the start
> > address and the first vma and the hole between the last vma and the end
> > address. The pgd/pud/pmd/pte loops are used only for the address range
> > between the first and last vma. This guarantees that the page table
> > pointers stay in range for s390. For the other architectures this is
> > a small optimization.
>
> I've gone to lengths to keep VMAs out of the equation, so I can't say
> I'm excited about this solution.

The minimum fix is to add the mmap_sem. If a vma is unmapped while you
walk the page tables, they can get freed. You do have a dependency on
the vma list. All the other page table walkers in mm/ start with the
vma, then do the four loops. It would be consistent if the generic page
table walker would do the same.

Having thought about the problem again, I think I found a way how to
deal with the problem in the s390 page table primitives. The fix is not
exactly nice but it will work. With it s390 will be able to walk
addresses outside of the vma address range.

--
blue skies,
Martin.

"Reality continues to ruin my life." - Calvin.

2009-03-12 10:22:40

by Martin Schwidefsky

[permalink] [raw]
Subject: Re: [PATCH] fix/improve generic page table walker

On Thu, 12 Mar 2009 09:33:35 +0100
Martin Schwidefsky <[email protected]> wrote:

> > I've gone to lengths to keep VMAs out of the equation, so I can't say
> > I'm excited about this solution.
>
> The minimum fix is to add the mmap_sem. If a vma is unmapped while you
> walk the page tables, they can get freed. You do have a dependency on
> the vma list. All the other page table walkers in mm/ start with the
> vma, then do the four loops. It would be consistent if the generic page
> table walker would do the same.
>
> Having thought about the problem again, I think I found a way how to
> deal with the problem in the s390 page table primitives. The fix is not
> exactly nice but it will work. With it s390 will be able to walk
> addresses outside of the vma address range.

Ok, the patch below fixes the problem without vma operations in the
generic page table walker. We still need the mmap_sem part though.

--
blue skies,
Martin.

"Reality continues to ruin my life." - Calvin.

---
Subject: [PATCH] s390: make page table walking more robust

From: Martin Schwidefsky <[email protected]>

Make page table walking on s390 more robust. Currently the four loops
over the pgd/pud/pmd/pte tables may only be done if the address range
of the walk is below the end address of the last vma of the address
space. The reason is the dynamic page table code on s390. A *pgd can
point to a region-second, a region-third or a segment table and a *pud
can point to a region-third or a segment table. The page table primitives
can determine the level of a pgd/pud table by looking at two bits in
any of the entries of the table. The pgd_present primitive always returns
1 if the *pgd does not point to a region-second table, pud_present always
returs 1 if the *pud does not point to a region-third table. pud_offset
and pmd_offset check the type of the table pointed to by *pgd and *pud
and either just cast the pointer to the type of the next lower level
table, or if the level of the table is correct they read the entry from
the table. This all only works if the address bits for the potentially
missing higher page tables are zero. As long as the address of the walk
stays smaller than the end address of the last vma this works.

The generic page table walker ignores the list of vmas and can be used to
walk page table ranges behind the end address of the last vma. If the
process is using a reduced page table nasty things happen.

In case of a reduced page table pgd_present and/or pud_present should
return true only of the address bits in the pgd/pud pointer of the missing
page table is zero. The effect of this changes is that the loop over the
pgd/pud table is done on the lower level. For each of the entries but
the first pgd_present returns false and the outer loops just continue.

Signed-off-by: Martin Schwidefsky <[email protected]>
---

arch/s390/include/asm/pgtable.h | 4 ++--
fs/proc/task_mmu.c | 2 ++
2 files changed, 4 insertions(+), 2 deletions(-)

diff -urpN linux-2.6/arch/s390/include/asm/pgtable.h linux-2.6-patched/arch/s390/include/asm/pgtable.h
--- linux-2.6/arch/s390/include/asm/pgtable.h 2009-03-12 10:34:01.000000000 +0100
+++ linux-2.6-patched/arch/s390/include/asm/pgtable.h 2009-03-12 10:34:43.000000000 +0100
@@ -451,7 +451,7 @@ static inline int pud_bad(pud_t pud) {
static inline int pgd_present(pgd_t pgd)
{
if ((pgd_val(pgd) & _REGION_ENTRY_TYPE_MASK) < _REGION_ENTRY_TYPE_R2)
- return 1;
+ return (pgd_val(pgd) & PGDIR_MASK) == 0;
return (pgd_val(pgd) & _REGION_ENTRY_ORIGIN) != 0UL;
}

@@ -478,7 +478,7 @@ static inline int pgd_bad(pgd_t pgd)
static inline int pud_present(pud_t pud)
{
if ((pud_val(pud) & _REGION_ENTRY_TYPE_MASK) < _REGION_ENTRY_TYPE_R3)
- return 1;
+ return (pud_val(pud) & PUD_MASK) == 0;
return (pud_val(pud) & _REGION_ENTRY_ORIGIN) != 0UL;
}

2009-03-12 11:28:17

by Martin Schwidefsky

[permalink] [raw]
Subject: Re: [PATCH] fix/improve generic page table walker

On Thu, 12 Mar 2009 11:19:16 +0100
Martin Schwidefsky <[email protected]> wrote:

> On Thu, 12 Mar 2009 09:33:35 +0100
> Martin Schwidefsky <[email protected]> wrote:
>
> > > I've gone to lengths to keep VMAs out of the equation, so I can't say
> > > I'm excited about this solution.
> >
> > The minimum fix is to add the mmap_sem. If a vma is unmapped while you
> > walk the page tables, they can get freed. You do have a dependency on
> > the vma list. All the other page table walkers in mm/ start with the
> > vma, then do the four loops. It would be consistent if the generic page
> > table walker would do the same.
> >
> > Having thought about the problem again, I think I found a way how to
> > deal with the problem in the s390 page table primitives. The fix is not
> > exactly nice but it will work. With it s390 will be able to walk
> > addresses outside of the vma address range.
>
> Ok, the patch below fixes the problem without vma operations in the
> generic page table walker. We still need the mmap_sem part though.

Hmm, thinko on my part. If would need the address of the pgd entry to do
what I'm trying to achieve but I only have the pgd entry itself. Back
to the vma operation in walk_page_range I'm afraid.

--
blue skies,
Martin.

"Reality continues to ruin my life." - Calvin.

2009-03-12 14:12:53

by Matt Mackall

[permalink] [raw]
Subject: Re: [PATCH] fix/improve generic page table walker

[Nick and Hugh, maybe you can shed some light on this for me]

On Thu, 2009-03-12 at 09:33 +0100, Martin Schwidefsky wrote:
> On Wed, 11 Mar 2009 12:24:23 -0500
> Matt Mackall <[email protected]> wrote:
>
> > On Wed, 2009-03-11 at 14:49 +0100, Martin Schwidefsky wrote:
> > > From: Martin Schwidefsky <[email protected]>
> > >
> > > On s390 the /proc/pid/pagemap interface is currently broken. This is
> > > caused by the unconditional loop over all pgd/pud entries as specified
> > > by the address range passed to walk_page_range. The tricky bit here
> > > is that the pgd++ in the outer loop may only be done if the page table
> > > really has 4 levels. For the pud++ in the second loop the page table needs
> > > to have at least 3 levels. With the dynamic page tables on s390 we can have
> > > page tables with 2, 3 or 4 levels. Which means that the pgd and/or the
> > > pud pointer can get out-of-bounds causing all kinds of mayhem.
> >
> > Not sure why this should be a problem without delving into the S390
> > code. After all, x86 has 2, 3, or 4 levels as well (at compile time) in
> > a way that's transparent to the walker.
>
> Its hard to understand without looking at the s390 details. The main
> difference between x86 and s390 in that respect is that on s390 the
> number of page table levels is determined at runtime on a per process
> basis. A compat process uses 2 levels, a 64 bit process starts with 3
> levels and can "upgrade" to 4 levels if something gets mapped above
> 4TB. Which means that a *pgd can point to a region-second (2**53 bytes),
> a region-third (2**42 bytes) or a segment table (2**31 bytes), a *pud
> can point to a region-third or a segment table. The page table
> primitives know about this semantic, in particular pud_offset and
> pmd_offset check the type of the page table pointed to by *pgd and *pud
> and do nothing with the pointer if it is a lower level page table.
> The only operation I can not "patch" is the pgd++/pud++ operation.

So in short, sometimes a pgd_t isn't really a pgd_t at all. It's another
object with different semantics that generic code can trip over.

Can I get you to explain why this is necessary or even preferable to
doing it the generic way where pgd_t has a fixed software meaning
regardless of how many hardware levels are in play?

--
http://selenic.com : development and support for Mercurial and Linux

2009-03-12 14:46:26

by Martin Schwidefsky

[permalink] [raw]
Subject: Re: [PATCH] fix/improve generic page table walker

On Thu, 12 Mar 2009 09:10:14 -0500
Matt Mackall <[email protected]> wrote:

> [Nick and Hugh, maybe you can shed some light on this for me]
>
> On Thu, 2009-03-12 at 09:33 +0100, Martin Schwidefsky wrote:
> > On Wed, 11 Mar 2009 12:24:23 -0500
> > Matt Mackall <[email protected]> wrote:
> >
> > > On Wed, 2009-03-11 at 14:49 +0100, Martin Schwidefsky wrote:
> > > > From: Martin Schwidefsky <[email protected]>
> > > >
> > > > On s390 the /proc/pid/pagemap interface is currently broken. This is
> > > > caused by the unconditional loop over all pgd/pud entries as specified
> > > > by the address range passed to walk_page_range. The tricky bit here
> > > > is that the pgd++ in the outer loop may only be done if the page table
> > > > really has 4 levels. For the pud++ in the second loop the page table needs
> > > > to have at least 3 levels. With the dynamic page tables on s390 we can have
> > > > page tables with 2, 3 or 4 levels. Which means that the pgd and/or the
> > > > pud pointer can get out-of-bounds causing all kinds of mayhem.
> > >
> > > Not sure why this should be a problem without delving into the S390
> > > code. After all, x86 has 2, 3, or 4 levels as well (at compile time) in
> > > a way that's transparent to the walker.
> >
> > Its hard to understand without looking at the s390 details. The main
> > difference between x86 and s390 in that respect is that on s390 the
> > number of page table levels is determined at runtime on a per process
> > basis. A compat process uses 2 levels, a 64 bit process starts with 3
> > levels and can "upgrade" to 4 levels if something gets mapped above
> > 4TB. Which means that a *pgd can point to a region-second (2**53 bytes),
> > a region-third (2**42 bytes) or a segment table (2**31 bytes), a *pud
> > can point to a region-third or a segment table. The page table
> > primitives know about this semantic, in particular pud_offset and
> > pmd_offset check the type of the page table pointed to by *pgd and *pud
> > and do nothing with the pointer if it is a lower level page table.
> > The only operation I can not "patch" is the pgd++/pud++ operation.
>
> So in short, sometimes a pgd_t isn't really a pgd_t at all. It's another
> object with different semantics that generic code can trip over.

Then what exactly is a pgd_t? For me it is the top level page table
which can have very different meaning for the various architectures.

> Can I get you to explain why this is necessary or even preferable to
> doing it the generic way where pgd_t has a fixed software meaning
> regardless of how many hardware levels are in play?

Well, the hardware can do up to 5 levels of page tables for the full
64 bit address space. With the introduction of pud's we wanted to
extend our address space from 3 levels / 42 bits to 4 levels / 53 bits.
But this comes at a cost: additional page table levels cost memory and
performance. In particular for the compat processes which can only
address a maximum of 2 GB it is a waste to allocate 4 levels. With the
dynamic page tables we allocate as much as required by each process.

--
blue skies,
Martin.

"Reality continues to ruin my life." - Calvin.

2009-03-12 16:01:06

by Matt Mackall

[permalink] [raw]
Subject: Re: [PATCH] fix/improve generic page table walker

On Thu, 2009-03-12 at 15:42 +0100, Martin Schwidefsky wrote:
> On Thu, 12 Mar 2009 09:10:14 -0500
> Matt Mackall <[email protected]> wrote:
>
> > [Nick and Hugh, maybe you can shed some light on this for me]
> >
> > On Thu, 2009-03-12 at 09:33 +0100, Martin Schwidefsky wrote:
> > > On Wed, 11 Mar 2009 12:24:23 -0500
> > > Matt Mackall <[email protected]> wrote:
> > >
> > > > On Wed, 2009-03-11 at 14:49 +0100, Martin Schwidefsky wrote:
> > > > > From: Martin Schwidefsky <[email protected]>
> > > > >
> > > > > On s390 the /proc/pid/pagemap interface is currently broken. This is
> > > > > caused by the unconditional loop over all pgd/pud entries as specified
> > > > > by the address range passed to walk_page_range. The tricky bit here
> > > > > is that the pgd++ in the outer loop may only be done if the page table
> > > > > really has 4 levels. For the pud++ in the second loop the page table needs
> > > > > to have at least 3 levels. With the dynamic page tables on s390 we can have
> > > > > page tables with 2, 3 or 4 levels. Which means that the pgd and/or the
> > > > > pud pointer can get out-of-bounds causing all kinds of mayhem.
> > > >
> > > > Not sure why this should be a problem without delving into the S390
> > > > code. After all, x86 has 2, 3, or 4 levels as well (at compile time) in
> > > > a way that's transparent to the walker.
> > >
> > > Its hard to understand without looking at the s390 details. The main
> > > difference between x86 and s390 in that respect is that on s390 the
> > > number of page table levels is determined at runtime on a per process
> > > basis. A compat process uses 2 levels, a 64 bit process starts with 3
> > > levels and can "upgrade" to 4 levels if something gets mapped above
> > > 4TB. Which means that a *pgd can point to a region-second (2**53 bytes),
> > > a region-third (2**42 bytes) or a segment table (2**31 bytes), a *pud
> > > can point to a region-third or a segment table. The page table
> > > primitives know about this semantic, in particular pud_offset and
> > > pmd_offset check the type of the page table pointed to by *pgd and *pud
> > > and do nothing with the pointer if it is a lower level page table.
> > > The only operation I can not "patch" is the pgd++/pud++ operation.
> >
> > So in short, sometimes a pgd_t isn't really a pgd_t at all. It's another
> > object with different semantics that generic code can trip over.
>
> Then what exactly is a pgd_t? For me it is the top level page table
> which can have very different meaning for the various architectures.

The important thing is that it's always 3 levels removed from the
bottom, whether or not those 3 levels actually have hardware
manifestations. From your description, it sounds like that's not how
things work in S390 land.

> > Can I get you to explain why this is necessary or even preferable to
> > doing it the generic way where pgd_t has a fixed software meaning
> > regardless of how many hardware levels are in play?
>
> Well, the hardware can do up to 5 levels of page tables for the full
> 64 bit address space. With the introduction of pud's we wanted to
> extend our address space from 3 levels / 42 bits to 4 levels / 53 bits.
> But this comes at a cost: additional page table levels cost memory and
> performance. In particular for the compat processes which can only
> address a maximum of 2 GB it is a waste to allocate 4 levels. With the
> dynamic page tables we allocate as much as required by each process.

X86 uses 1-entry tables at higher levels to maintain consistency with
fairly minimal overhead. In some of the sillier addressing modes, we may
even use a 4-entry table in some places. I think table size is fixed at
compile time, but I don't think that's essential. Very little code in
the x86 architecture has any notion of how many hardware levels actually
exist.

--
http://selenic.com : development and support for Mercurial and Linux

2009-03-16 12:32:32

by Martin Schwidefsky

[permalink] [raw]
Subject: Re: [PATCH] fix/improve generic page table walker

On Thu, 12 Mar 2009 10:58:14 -0500
Matt Mackall <[email protected]> wrote:

> On Thu, 2009-03-12 at 15:42 +0100, Martin Schwidefsky wrote:
> > Then what exactly is a pgd_t? For me it is the top level page table
> > which can have very different meaning for the various architectures.
>
> The important thing is that it's always 3 levels removed from the
> bottom, whether or not those 3 levels actually have hardware
> manifestations. From your description, it sounds like that's not how
> things work in S390 land.

With the page table folding "3 levels removed from the bottom" doesn't
tell me much since there is no real representation in hardware AND in
memory for the missing page table levels. So the only valid meaning of
a pgd_t is that you have to use pud_offset, pmd_offset and pte_offset
to get to a pte. If I do the page table folding at runtime or at
compile time is a minor detail.

> > Well, the hardware can do up to 5 levels of page tables for the full
> > 64 bit address space. With the introduction of pud's we wanted to
> > extend our address space from 3 levels / 42 bits to 4 levels / 53 bits.
> > But this comes at a cost: additional page table levels cost memory and
> > performance. In particular for the compat processes which can only
> > address a maximum of 2 GB it is a waste to allocate 4 levels. With the
> > dynamic page tables we allocate as much as required by each process.
>
> X86 uses 1-entry tables at higher levels to maintain consistency with
> fairly minimal overhead. In some of the sillier addressing modes, we may
> even use a 4-entry table in some places. I think table size is fixed at
> compile time, but I don't think that's essential. Very little code in
> the x86 architecture has any notion of how many hardware levels actually
> exist.

Indeed very little code needs to know how many page table levels
exist. The page table folding works as long as the access to a
particular page is done with the sequence

pgd = pgd_offset(mm, address);
pud = pud_offset(pgd, address);
pmd = pmd_offset(pud, address);
pte = pte_offset(pmd, address);

The indivitual pointers pgd/pud/pmd/pte can be incremented as long as
they stay in the valid address range, e.g. pmd_addr_end checks for the
next pmd segment boundary and the end address of the walk.

If the page table folding is static or dynamic is irrelevant. The only
thing we are arguing is what makes a valid end address for a walk. It
has to be smaller than TASK_SIZE. With the current definitions the s390
code has the additional assumption that the address has to be smaller
than the highest vma as well. The patch below changes TASK_SIZE to
reflect the size of the address space in use by the process. Then the
generic page table walker works fine. What doesn't work anymore is the
automatic upgrade from 3 to 4 levels via mmap. I'm still thinking about
a clever solution, the best I have so far is a patch that introduces
TASK_SIZE_MAX which reflects the maximum possible size as opposed to
TASK_SIZE that gives you the current size. The code in do_mmap_pgoff
then uses TASK_SIZE_MAX instead of TASK_SIZE.

--
blue skies,
Martin.

"Reality continues to ruin my life." - Calvin.

---
Subject: [PATCH] make page table walking more robust

From: Martin Schwidefsky <[email protected]>

Make page table walking on s390 more robust. The current code requires
that the pgd/pud/pmd/pte loop is only done for address ranges that are
below the end address of the last vma of the address space. But this
is not always true, e.g. the generic page table walker does not
guarantee this. Change TASK_SIZE/TASK_SIZE_OF to reflect the current
size of the address space. This makes the generic page table walker
happy but it breaks the upgrade of a 3 level page table to a 4 level
page table. To make the upgrade work again another fix is required.

Signed-off-by: Martin Schwidefsky <[email protected]>
---

arch/s390/include/asm/processor.h | 5 ++---
arch/s390/mm/mmap.c | 4 ++--
arch/s390/mm/pgtable.c | 2 ++
3 files changed, 6 insertions(+), 5 deletions(-)

diff -urpN linux-2.6/arch/s390/include/asm/processor.h
linux-2.6-patched/arch/s390/include/asm/processor.h ---
linux-2.6/arch/s390/include/asm/processor.h 2009-03-16
12:24:26.000000000 +0100 +++
linux-2.6-patched/arch/s390/include/asm/processor.h 2009-03-16
12:24:28.000000000 +0100 @@ -47,7 +47,7 @@ extern void
print_cpu_info(void); extern int get_cpu_capability(unsigned int *); /*
- * User space process size: 2GB for 31 bit, 4TB for 64 bit.
+ * User space process size: 2GB for 31 bit, 4TB or 8PT for 64 bit.
*/
#ifndef __s390x__

@@ -56,8 +56,7 @@ extern int get_cpu_capability(unsigned i

#else /* __s390x__ */

-#define TASK_SIZE_OF(tsk)
(test_tsk_thread_flag(tsk,TIF_31BIT) ? \
- (1UL << 31) : (1UL << 53))
+#define TASK_SIZE_OF(tsk) ((tsk)->mm->context.asce_limit)
#define TASK_UNMAPPED_BASE (test_thread_flag(TIF_31BIT) ? \
(1UL << 30) : (1UL << 41))
#define TASK_SIZE TASK_SIZE_OF(current)
diff -urpN linux-2.6/arch/s390/mm/mmap.c
linux-2.6-patched/arch/s390/mm/mmap.c ---
linux-2.6/arch/s390/mm/mmap.c 2008-12-25 00:26:37.000000000
+0100 +++ linux-2.6-patched/arch/s390/mm/mmap.c 2009-03-16
12:24:28.000000000 +0100 @@ -35,7 +35,7 @@
* Leave an at least ~128 MB hole.
*/
#define MIN_GAP (128*1024*1024)
-#define MAX_GAP (TASK_SIZE/6*5)
+#define MAX_GAP (STACK_TOP/6*5)

static inline unsigned long mmap_base(void)
{
@@ -46,7 +46,7 @@ static inline unsigned long mmap_base(vo
else if (gap > MAX_GAP)
gap = MAX_GAP;

- return TASK_SIZE - (gap & PAGE_MASK);
+ return STACK_TOP - (gap & PAGE_MASK);
}

static inline int mmap_is_legacy(void)
diff -urpN linux-2.6/arch/s390/mm/pgtable.c
linux-2.6-patched/arch/s390/mm/pgtable.c ---
linux-2.6/arch/s390/mm/pgtable.c 2009-03-16 12:24:09.000000000
+0100 +++ linux-2.6-patched/arch/s390/mm/pgtable.c 2009-03-16
12:24:28.000000000 +0100 @@ -117,6 +117,7 @@ repeat:
crst_table_init(table, entry); pgd_populate(mm, (pgd_t *) table, (pud_t
*) pgd); mm->pgd = (pgd_t *) table;
+ mm->task_size = mm->context.asce_limit;
table = NULL;
}
spin_unlock(&mm->page_table_lock);
@@ -154,6 +155,7 @@ void crst_table_downgrade(struct mm_stru
BUG();
}
mm->pgd = (pgd_t *) (pgd_val(*pgd) &
_REGION_ENTRY_ORIGIN);
+ mm->task_size = mm->context.asce_limit;
crst_table_free(mm, (unsigned long *) pgd);
}
update_mm(mm, current);

2009-03-16 12:37:15

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] fix/improve generic page table walker

On Mon, Mar 16, 2009 at 01:27:17PM +0100, Martin Schwidefsky wrote:
> On Thu, 12 Mar 2009 10:58:14 -0500
> Matt Mackall <[email protected]> wrote:
>
> > On Thu, 2009-03-12 at 15:42 +0100, Martin Schwidefsky wrote:
> > > Then what exactly is a pgd_t? For me it is the top level page table
> > > which can have very different meaning for the various architectures.
> >
> > The important thing is that it's always 3 levels removed from the
> > bottom, whether or not those 3 levels actually have hardware
> > manifestations. From your description, it sounds like that's not how
> > things work in S390 land.
>
> With the page table folding "3 levels removed from the bottom" doesn't
> tell me much since there is no real representation in hardware AND in
> memory for the missing page table levels. So the only valid meaning of
> a pgd_t is that you have to use pud_offset, pmd_offset and pte_offset
> to get to a pte. If I do the page table folding at runtime or at
> compile time is a minor detail.

I don't know if it would be helpful to you, but I solve a similar
kind of problem in the lockless radix tree by encoding node height
in the node itself. Maybe you could use some bits in the page table
pointers or even in the struct pages for this.


>
> > > Well, the hardware can do up to 5 levels of page tables for the full
> > > 64 bit address space. With the introduction of pud's we wanted to
> > > extend our address space from 3 levels / 42 bits to 4 levels / 53 bits.
> > > But this comes at a cost: additional page table levels cost memory and
> > > performance. In particular for the compat processes which can only
> > > address a maximum of 2 GB it is a waste to allocate 4 levels. With the
> > > dynamic page tables we allocate as much as required by each process.
> >
> > X86 uses 1-entry tables at higher levels to maintain consistency with
> > fairly minimal overhead. In some of the sillier addressing modes, we may
> > even use a 4-entry table in some places. I think table size is fixed at
> > compile time, but I don't think that's essential. Very little code in
> > the x86 architecture has any notion of how many hardware levels actually
> > exist.
>
> Indeed very little code needs to know how many page table levels
> exist. The page table folding works as long as the access to a
> particular page is done with the sequence
>
> pgd = pgd_offset(mm, address);
> pud = pud_offset(pgd, address);
> pmd = pmd_offset(pud, address);
> pte = pte_offset(pmd, address);
>
> The indivitual pointers pgd/pud/pmd/pte can be incremented as long as
> they stay in the valid address range, e.g. pmd_addr_end checks for the
> next pmd segment boundary and the end address of the walk.
>
> If the page table folding is static or dynamic is irrelevant. The only
> thing we are arguing is what makes a valid end address for a walk. It
> has to be smaller than TASK_SIZE. With the current definitions the s390
> code has the additional assumption that the address has to be smaller
> than the highest vma as well. The patch below changes TASK_SIZE to
> reflect the size of the address space in use by the process. Then the
> generic page table walker works fine. What doesn't work anymore is the
> automatic upgrade from 3 to 4 levels via mmap. I'm still thinking about
> a clever solution, the best I have so far is a patch that introduces
> TASK_SIZE_MAX which reflects the maximum possible size as opposed to
> TASK_SIZE that gives you the current size. The code in do_mmap_pgoff
> then uses TASK_SIZE_MAX instead of TASK_SIZE.
>
> --
> blue skies,
> Martin.
>
> "Reality continues to ruin my life." - Calvin.
>
> ---
> Subject: [PATCH] make page table walking more robust
>
> From: Martin Schwidefsky <[email protected]>
>
> Make page table walking on s390 more robust. The current code requires
> that the pgd/pud/pmd/pte loop is only done for address ranges that are
> below the end address of the last vma of the address space. But this
> is not always true, e.g. the generic page table walker does not
> guarantee this. Change TASK_SIZE/TASK_SIZE_OF to reflect the current
> size of the address space. This makes the generic page table walker
> happy but it breaks the upgrade of a 3 level page table to a 4 level
> page table. To make the upgrade work again another fix is required.
>
> Signed-off-by: Martin Schwidefsky <[email protected]>
> ---
>
> arch/s390/include/asm/processor.h | 5 ++---
> arch/s390/mm/mmap.c | 4 ++--
> arch/s390/mm/pgtable.c | 2 ++
> 3 files changed, 6 insertions(+), 5 deletions(-)
>
> diff -urpN linux-2.6/arch/s390/include/asm/processor.h
> linux-2.6-patched/arch/s390/include/asm/processor.h ---
> linux-2.6/arch/s390/include/asm/processor.h 2009-03-16
> 12:24:26.000000000 +0100 +++
> linux-2.6-patched/arch/s390/include/asm/processor.h 2009-03-16
> 12:24:28.000000000 +0100 @@ -47,7 +47,7 @@ extern void
> print_cpu_info(void); extern int get_cpu_capability(unsigned int *); /*
> - * User space process size: 2GB for 31 bit, 4TB for 64 bit.
> + * User space process size: 2GB for 31 bit, 4TB or 8PT for 64 bit.
> */
> #ifndef __s390x__
>
> @@ -56,8 +56,7 @@ extern int get_cpu_capability(unsigned i
>
> #else /* __s390x__ */
>
> -#define TASK_SIZE_OF(tsk)
> (test_tsk_thread_flag(tsk,TIF_31BIT) ? \
> - (1UL << 31) : (1UL << 53))
> +#define TASK_SIZE_OF(tsk) ((tsk)->mm->context.asce_limit)
> #define TASK_UNMAPPED_BASE (test_thread_flag(TIF_31BIT) ? \
> (1UL << 30) : (1UL << 41))
> #define TASK_SIZE TASK_SIZE_OF(current)
> diff -urpN linux-2.6/arch/s390/mm/mmap.c
> linux-2.6-patched/arch/s390/mm/mmap.c ---
> linux-2.6/arch/s390/mm/mmap.c 2008-12-25 00:26:37.000000000
> +0100 +++ linux-2.6-patched/arch/s390/mm/mmap.c 2009-03-16
> 12:24:28.000000000 +0100 @@ -35,7 +35,7 @@
> * Leave an at least ~128 MB hole.
> */
> #define MIN_GAP (128*1024*1024)
> -#define MAX_GAP (TASK_SIZE/6*5)
> +#define MAX_GAP (STACK_TOP/6*5)
>
> static inline unsigned long mmap_base(void)
> {
> @@ -46,7 +46,7 @@ static inline unsigned long mmap_base(vo
> else if (gap > MAX_GAP)
> gap = MAX_GAP;
>
> - return TASK_SIZE - (gap & PAGE_MASK);
> + return STACK_TOP - (gap & PAGE_MASK);
> }
>
> static inline int mmap_is_legacy(void)
> diff -urpN linux-2.6/arch/s390/mm/pgtable.c
> linux-2.6-patched/arch/s390/mm/pgtable.c ---
> linux-2.6/arch/s390/mm/pgtable.c 2009-03-16 12:24:09.000000000
> +0100 +++ linux-2.6-patched/arch/s390/mm/pgtable.c 2009-03-16
> 12:24:28.000000000 +0100 @@ -117,6 +117,7 @@ repeat:
> crst_table_init(table, entry); pgd_populate(mm, (pgd_t *) table, (pud_t
> *) pgd); mm->pgd = (pgd_t *) table;
> + mm->task_size = mm->context.asce_limit;
> table = NULL;
> }
> spin_unlock(&mm->page_table_lock);
> @@ -154,6 +155,7 @@ void crst_table_downgrade(struct mm_stru
> BUG();
> }
> mm->pgd = (pgd_t *) (pgd_val(*pgd) &
> _REGION_ENTRY_ORIGIN);
> + mm->task_size = mm->context.asce_limit;
> crst_table_free(mm, (unsigned long *) pgd);
> }
> update_mm(mm, current);

2009-03-16 13:02:07

by Martin Schwidefsky

[permalink] [raw]
Subject: Re: [PATCH] fix/improve generic page table walker

On Mon, 16 Mar 2009 13:36:54 +0100
Nick Piggin <[email protected]> wrote:

> > With the page table folding "3 levels removed from the bottom" doesn't
> > tell me much since there is no real representation in hardware AND in
> > memory for the missing page table levels. So the only valid meaning of
> > a pgd_t is that you have to use pud_offset, pmd_offset and pte_offset
> > to get to a pte. If I do the page table folding at runtime or at
> > compile time is a minor detail.
>
> I don't know if it would be helpful to you, but I solve a similar
> kind of problem in the lockless radix tree by encoding node height
> in the node itself. Maybe you could use some bits in the page table
> pointers or even in the struct pages for this.

That is what I already do: there are two bits in the region and segment
table entries that tell me at what level I am (well actually it is the
hardware definition that requires me to do that and I just make use of
it). The page table primitives (pxd_present, pxd_offset, etc) look at
these bits and then do the right thing.
What is killing me is the pgd++/pud++ operation. If there is only a 2
or 3 level page table the pointer increase may not happen. This is done
by a correct end address for the walk.

--
blue skies,
Martin.

"Reality continues to ruin my life." - Calvin.