2004-04-13 23:19:29

by Chen, Kenneth W

[permalink] [raw]
Subject: hugetlb demand paging patch part [0/3]

In addition to the hugetlb commit handling that we've been working on
off the list, Ray Bryant of SGI and I are also working on demand paging
for hugetlb page. Here are our final version that has been heavily
tested on ia64 and x86. I've broken the patch into 3 pieces so it's
easier to read/review, etc.

1. hugetlb_fix_pte.patch - with demand paging, we can not unconditionally
assume valid pmd/pte. Fix it up in arch specific huge_pge_offset()
and have all caller check the return value.

2. hugetlb_demand_generic.patch - this handles bulk of hugetlb demand
paging for generic portion of the kernel. I've put hugetlb fault
handler in mm/hugetlbpage.c since the fault handler is *exactly* the
same for all arch, but that requires opening up huge_pte_alloc() and
set_huge_pte() functions in each arch. If people object where it
should live. It takes me less than a minute to delete the common
code and replicate it in each of the 5 arch that supports hugetlb.
Just let me know if that's the case.

3. hugetlb_demand_arch.patch - this adds additional arch specific fixes
for x84 and ia64 when generic demand paging is turned on. Also bulk
of the patch is to clean up with functions that no longer needed.

Some caveats: I don't have sh and sparc64 hardware to test. But hugetlb
code in these two arch looked like a triplet twin of x86 code. So I'm
pretty sure it will work right out of box. I've monkeyed around with
ppc64 code and after a while I realized it should be left for the experts.
I'm sure there are plenty ppc64 developers out there that can get it done
in no time.

Patches relative to linux-2.6.5-mm4 and on top of hugetlb overcommit
handling patch posted by Andy Whitcroft.

Andrew, would you please review and consider for -mm? Thanks.

- Ken



2004-04-14 09:05:38

by Arjan van de Ven

[permalink] [raw]
Subject: Re: hugetlb demand paging patch part [0/3]

On Wed, 2004-04-14 at 01:17, Chen, Kenneth W wrote:
> In addition to the hugetlb commit handling that we've been working on
> off the list, Ray Bryant of SGI and I are also working on demand paging
> for hugetlb page. Here are our final version that has been heavily
> tested on ia64 and x86. I've broken the patch into 3 pieces so it's
> easier to read/review, etc.

Ok I think it's time to say "HO STOP" here.

If you're going to make the kernel deal with different, concurrent page
sizes then please do it for real. Or alternatively leave hugetlb to be
the kludge/hack it is right now. Anything inbetween is the road to
madness...


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2004-04-14 10:04:28

by Andy Whitcroft

[permalink] [raw]
Subject: Re: hugetlb demand paging patch part [0/3]

--On 14 April 2004 11:04 +0200 Arjan van de Ven <[email protected]> wrote:

> On Wed, 2004-04-14 at 01:17, Chen, Kenneth W wrote:
>> In addition to the hugetlb commit handling that we've been working on
>> off the list, Ray Bryant of SGI and I are also working on demand paging
>> for hugetlb page. Here are our final version that has been heavily
>> tested on ia64 and x86. I've broken the patch into 3 pieces so it's
>> easier to read/review, etc.
>
> Ok I think it's time to say "HO STOP" here.

I would say yes for 2.7. Then things like actual swapping of large pages
and the like could be done properly.

> If you're going to make the kernel deal with different, concurrent page
> sizes then please do it for real. Or alternatively leave hugetlb to be
> the kludge/hack it is right now. Anything inbetween is the road to
> madness...

The original request was not to generify for 2.6. Thus all of this work
has been to fix or improve the usability of the kludge without removing its
cancerous sore like nature. I think that requires a radical rethink and is
not 2.6 material.

-apw

2004-04-14 15:27:34

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [Lse-tech] Re: hugetlb demand paging patch part [0/3]

>> In addition to the hugetlb commit handling that we've been working on
>> off the list, Ray Bryant of SGI and I are also working on demand paging
>> for hugetlb page. Here are our final version that has been heavily
>> tested on ia64 and x86. I've broken the patch into 3 pieces so it's
>> easier to read/review, etc.
>
> Ok I think it's time to say "HO STOP" here.
>
> If you're going to make the kernel deal with different, concurrent page
> sizes then please do it for real. Or alternatively leave hugetlb to be
> the kludge/hack it is right now. Anything inbetween is the road to
> madness...

I'd prefer to see it walk step by step to "doing it for real" than have
a huge cataclysmic patch that breaks everything ....

M.

2004-04-14 15:30:57

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [Lse-tech] Re: hugetlb demand paging patch part [0/3]

>>> In addition to the hugetlb commit handling that we've been working on
>>> off the list, Ray Bryant of SGI and I are also working on demand paging
>>> for hugetlb page. Here are our final version that has been heavily
>>> tested on ia64 and x86. I've broken the patch into 3 pieces so it's
>>> easier to read/review, etc.
>>
>> Ok I think it's time to say "HO STOP" here.
>>
>> If you're going to make the kernel deal with different, concurrent page
>> sizes then please do it for real. Or alternatively leave hugetlb to be
>> the kludge/hack it is right now. Anything inbetween is the road to
>> madness...
>
> I'd prefer to see it walk step by step to "doing it for real" than have
> a huge cataclysmic patch that breaks everything ....

Hmm - maybe that could be misinterpreted ;-) I meant that this the patches
discussed here are the steps (ie good ;-)), not the cataclysmic event.

M.

2004-04-15 01:26:01

by David Gibson

[permalink] [raw]
Subject: Re: hugetlb demand paging patch part [0/3]

On Wed, Apr 14, 2004 at 11:04:02AM +0200, Arjan van de Ven wrote:
> On Wed, 2004-04-14 at 01:17, Chen, Kenneth W wrote:
> > In addition to the hugetlb commit handling that we've been working on
> > off the list, Ray Bryant of SGI and I are also working on demand paging
> > for hugetlb page. Here are our final version that has been heavily
> > tested on ia64 and x86. I've broken the patch into 3 pieces so it's
> > easier to read/review, etc.
>
> Ok I think it's time to say "HO STOP" here.
>
> If you're going to make the kernel deal with different, concurrent page
> sizes then please do it for real. Or alternatively leave hugetlb to be
> the kludge/hack it is right now. Anything inbetween is the road to
> madness...

Well, bear in mind that in a number of ways these patches actually
simplify the hugetlb code, although I think most of that is not
inherently related to making the hugepage allocation on-demand rather
than prefaulted. Nonetheless, doing the demand allocation is actually
really easy. Even if you add COW as well, which these patches don't,
it doesn't actually make the hack any worse than it was already, but
it does make it more useful.

--
David Gibson | For every complex problem there is a
david AT gibson.dropbear.id.au | solution which is simple, neat and
| wrong.
http://www.ozlabs.org/people/dgibson


Attachments:
(No filename) (1.31 kB)
signature.asc (189.00 B)
Digital signature
Download all attachments

2004-04-15 06:47:15

by David Gibson

[permalink] [raw]
Subject: Re: hugetlb demand paging patch part [0/3]

On Tue, Apr 13, 2004 at 04:17:04PM -0700, Chen, Kenneth W wrote:
> In addition to the hugetlb commit handling that we've been working on
> off the list, Ray Bryant of SGI and I are also working on demand paging
> for hugetlb page. Here are our final version that has been heavily
> tested on ia64 and x86. I've broken the patch into 3 pieces so it's
> easier to read/review, etc.
>
> 1. hugetlb_fix_pte.patch - with demand paging, we can not unconditionally
> assume valid pmd/pte. Fix it up in arch specific huge_pge_offset()
> and have all caller check the return value.
>
> 2. hugetlb_demand_generic.patch - this handles bulk of hugetlb demand
> paging for generic portion of the kernel. I've put hugetlb fault
> handler in mm/hugetlbpage.c since the fault handler is *exactly* the
> same for all arch, but that requires opening up huge_pte_alloc() and
> set_huge_pte() functions in each arch. If people object where it
> should live. It takes me less than a minute to delete the common
> code and replicate it in each of the 5 arch that supports hugetlb.
> Just let me know if that's the case.
>
> 3. hugetlb_demand_arch.patch - this adds additional arch specific fixes
> for x84 and ia64 when generic demand paging is turned on. Also bulk
> of the patch is to clean up with functions that no longer needed.
>
> Some caveats: I don't have sh and sparc64 hardware to test. But hugetlb
> code in these two arch looked like a triplet twin of x86 code. So I'm
> pretty sure it will work right out of box. I've monkeyed around with
> ppc64 code and after a while I realized it should be left for the experts.
> I'm sure there are plenty ppc64 developers out there that can get it done
> in no time.

To the extent that I understand your patches, it shouldn't be that
hard to adapt for ppc64, with one caveat: on ppc64, unlike the other
hugepage archs, the format of hugepage PTEs is not identical to the
format of normal PTEs. So to do this for ppc64, the generic parts of
your code will need to use a hugepte_t instead of pte_t - it can be
typedeffed to pte_t on archs other than ppc64. Likewise there will
need to be hugepte_none() and so forth macros.

However, there seem to be some changes in your patches that aren't
directly related to doing demand-paging (though they might be a good
thing for other reasons). Further comments in reply to the actual
patches.

--
David Gibson | For every complex problem there is a
david AT gibson.dropbear.id.au | solution which is simple, neat and
| wrong.
http://www.ozlabs.org/people/dgibson

2004-04-15 17:09:13

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: hugetlb demand paging patch part [0/3]

>>>> David Gibson wrote on Wednesday, April 14, 2004 11:43 PM
> >
> > Some caveats: I don't have sh and sparc64 hardware to test. But hugetlb
> > code in these two arch looked like a triplet twin of x86 code. So I'm
> > pretty sure it will work right out of box. I've monkeyed around with
> > ppc64 code and after a while I realized it should be left for the experts.
> > I'm sure there are plenty ppc64 developers out there that can get it done
> > in no time.
>
> To the extent that I understand your patches, it shouldn't be that
> hard to adapt for ppc64, with one caveat: on ppc64, unlike the other
> hugepage archs, the format of hugepage PTEs is not identical to the
> format of normal PTEs. So to do this for ppc64, the generic parts of
> your code will need to use a hugepte_t instead of pte_t - it can be
> typedeffed to pte_t on archs other than ppc64. Likewise there will
> need to be hugepte_none() and so forth macros.

I think it would be cleaner if ppc64 change its format instead of changing
4 other arch to accommodate ppc64. By the way, why do you need to special
typedef hugepte_t? pte for huge page aren't anything special on all other
arches.

- Ken



2004-04-15 18:49:08

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: hugetlb demand paging patch part [0/3]

>>>> Chen, Kenneth W wrote on Thursday, April 15, 2004 10:08 AM
> >>>> David Gibson wrote on Wednesday, April 14, 2004 11:43 PM
> > >
> > > Some caveats: I don't have sh and sparc64 hardware to test. But hugetlb
> > > code in these two arch looked like a triplet twin of x86 code. So I'm
> > > pretty sure it will work right out of box. I've monkeyed around with
> > > ppc64 code and after a while I realized it should be left for the experts.
> > > I'm sure there are plenty ppc64 developers out there that can get it done
> > > in no time.
> >
> > To the extent that I understand your patches, it shouldn't be that
> > hard to adapt for ppc64, with one caveat: on ppc64, unlike the other
> > hugepage archs, the format of hugepage PTEs is not identical to the
> > format of normal PTEs. So to do this for ppc64, the generic parts of
> > your code will need to use a hugepte_t instead of pte_t - it can be
> > typedeffed to pte_t on archs other than ppc64. Likewise there will
> > need to be hugepte_none() and so forth macros.
>
> I think it would be cleaner if ppc64 change its format instead of changing
> 4 other arch to accommodate ppc64. By the way, why do you need to special
> typedef hugepte_t? pte for huge page aren't anything special on all other
> arches.

Again, I'm not an expert on ppc64, but this look suspicious to me:

arch/ppc64/mm/hugetlbpage.c
/* HugePTE layout:
*
* 31 30 ... 15 14 13 12 10 9 8 7 6 5 4 3 2 1 0
* PFN>>12..... - - - - - - HASH_IX.... 2ND HASH RW - HG=1
*/

#define HUGEPTE_SHIFT 15

17 bits for pfn? It looks to me that huge page on ppc64 will break on
system with more than 2 Terabyte of physical memory.


2004-04-16 02:37:00

by David Gibson

[permalink] [raw]
Subject: Re: hugetlb demand paging patch part [0/3]

On Thu, Apr 15, 2004 at 11:42:55AM -0700, Chen, Kenneth W wrote:
> >>>> Chen, Kenneth W wrote on Thursday, April 15, 2004 10:08 AM
> > >>>> David Gibson wrote on Wednesday, April 14, 2004 11:43 PM
> > > >
> > > > Some caveats: I don't have sh and sparc64 hardware to test. But hugetlb
> > > > code in these two arch looked like a triplet twin of x86 code. So I'm
> > > > pretty sure it will work right out of box. I've monkeyed around with
> > > > ppc64 code and after a while I realized it should be left for the experts.
> > > > I'm sure there are plenty ppc64 developers out there that can get it done
> > > > in no time.
> > >
> > > To the extent that I understand your patches, it shouldn't be that
> > > hard to adapt for ppc64, with one caveat: on ppc64, unlike the other
> > > hugepage archs, the format of hugepage PTEs is not identical to the
> > > format of normal PTEs. So to do this for ppc64, the generic parts of
> > > your code will need to use a hugepte_t instead of pte_t - it can be
> > > typedeffed to pte_t on archs other than ppc64. Likewise there will
> > > need to be hugepte_none() and so forth macros.
> >
> > I think it would be cleaner if ppc64 change its format instead of changing
> > 4 other arch to accommodate ppc64. By the way, why do you need to special
> > typedef hugepte_t? pte for huge page aren't anything special on all other
> > arches.
>
> Again, I'm not an expert on ppc64, but this look suspicious to me:
>
> arch/ppc64/mm/hugetlbpage.c
> /* HugePTE layout:
> *
> * 31 30 ... 15 14 13 12 10 9 8 7 6 5 4 3 2 1 0
> * PFN>>12..... - - - - - - HASH_IX.... 2ND HASH RW - HG=1
> */
>
> #define HUGEPTE_SHIFT 15
>
> 17 bits for pfn? It looks to me that huge page on ppc64 will break on
> system with more than 2 Terabyte of physical memory.

Systems with >2T of physical memory are (as yet) unsupported on
ppc64. The same limitation applies to normal pages.

Yes, that limit will need to be increased in the near future, and when
it does we will need to change this layout (as well as the overall
structure of the page tables, which is where the limit comes from for
normal pages). We can get one extra bit trivially, since bit 1 is as
yet unused. Beyond that we will probably have to make use of the fact
that there are 8 PMD slots for each huge page, and split the PFN
across those slots. That still won't let us use the same format as
normal PTE entries, because we'll need the huge bit set in each PMD
entry.

--
David Gibson | For every complex problem there is a
david AT gibson.dropbear.id.au | solution which is simple, neat and
| wrong.
http://www.ozlabs.org/people/dgibson

2004-04-16 02:39:00

by David Gibson

[permalink] [raw]
Subject: Re: hugetlb demand paging patch part [0/3]

On Thu, Apr 15, 2004 at 10:08:22AM -0700, Chen, Kenneth W wrote:
> >>>> David Gibson wrote on Wednesday, April 14, 2004 11:43 PM
> > >
> > > Some caveats: I don't have sh and sparc64 hardware to test. But hugetlb
> > > code in these two arch looked like a triplet twin of x86 code. So I'm
> > > pretty sure it will work right out of box. I've monkeyed around with
> > > ppc64 code and after a while I realized it should be left for the experts.
> > > I'm sure there are plenty ppc64 developers out there that can get it done
> > > in no time.
> >
> > To the extent that I understand your patches, it shouldn't be that
> > hard to adapt for ppc64, with one caveat: on ppc64, unlike the other
> > hugepage archs, the format of hugepage PTEs is not identical to the
> > format of normal PTEs. So to do this for ppc64, the generic parts of
> > your code will need to use a hugepte_t instead of pte_t - it can be
> > typedeffed to pte_t on archs other than ppc64. Likewise there will
> > need to be hugepte_none() and so forth macros.
>
> I think it would be cleaner if ppc64 change its format instead of changing
> 4 other arch to accommodate ppc64. By the way, why do you need to special
> typedef hugepte_t? pte for huge page aren't anything special on all other
> arches.

The hugepte entries go in the same slots as pmd entries, which means
they must be compatible with the layout of pmd entries. That's not
compatible with making them identical to normal PTE entries. For one
thing, normal PTE entries are 64 bits wide, whereas PMD entries are
only 32 bits wide.

--
David Gibson | For every complex problem there is a
david AT gibson.dropbear.id.au | solution which is simple, neat and
| wrong.
http://www.ozlabs.org/people/dgibson

2004-04-16 03:04:32

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: hugetlb demand paging patch part [0/3]

David Gibson wrote on Thursday, April 15, 2004 6:31 PM
> On Thu, Apr 15, 2004 at 10:08:22AM -0700, Chen, Kenneth W wrote:
> > >>>> David Gibson wrote on Wednesday, April 14, 2004 11:43 PM
> > > >
> > > > Some caveats: I don't have sh and sparc64 hardware to test. But hugetlb
> > > > code in these two arch looked like a triplet twin of x86 code. So I'm
> > > > pretty sure it will work right out of box. I've monkeyed around with
> > > > ppc64 code and after a while I realized it should be left for the experts.
> > > > I'm sure there are plenty ppc64 developers out there that can get it done
> > > > in no time.
> > >
> > > To the extent that I understand your patches, it shouldn't be that
> > > hard to adapt for ppc64, with one caveat: on ppc64, unlike the other
> > > hugepage archs, the format of hugepage PTEs is not identical to the
> > > format of normal PTEs. So to do this for ppc64, the generic parts of
> > > your code will need to use a hugepte_t instead of pte_t - it can be
> > > typedeffed to pte_t on archs other than ppc64. Likewise there will
> > > need to be hugepte_none() and so forth macros.
> >
> > I think it would be cleaner if ppc64 change its format instead of changing
> > 4 other arch to accommodate ppc64. By the way, why do you need to special
> > typedef hugepte_t? pte for huge page aren't anything special on all other
> > arches.
>
> The hugepte entries go in the same slots as pmd entries, which means
> they must be compatible with the layout of pmd entries. That's not
> compatible with making them identical to normal PTE entries. For one
> thing, normal PTE entries are 64 bits wide, whereas PMD entries are
> only 32 bits wide.


It smells like handle_hugetlb_mm_fault() need to be replicated in each arch
(or at least replicated in ppc64).

- Ken


2004-04-16 03:36:12

by David Gibson

[permalink] [raw]
Subject: Re: hugetlb demand paging patch part [0/3]

On Thu, Apr 15, 2004 at 08:01:55PM -0700, Chen, Kenneth W wrote:
> David Gibson wrote on Thursday, April 15, 2004 6:31 PM
> > On Thu, Apr 15, 2004 at 10:08:22AM -0700, Chen, Kenneth W wrote:
> > > >>>> David Gibson wrote on Wednesday, April 14, 2004 11:43 PM
> > > > >
> > > > > Some caveats: I don't have sh and sparc64 hardware to test. But hugetlb
> > > > > code in these two arch looked like a triplet twin of x86 code. So I'm
> > > > > pretty sure it will work right out of box. I've monkeyed around with
> > > > > ppc64 code and after a while I realized it should be left for the experts.
> > > > > I'm sure there are plenty ppc64 developers out there that can get it done
> > > > > in no time.
> > > >
> > > > To the extent that I understand your patches, it shouldn't be that
> > > > hard to adapt for ppc64, with one caveat: on ppc64, unlike the other
> > > > hugepage archs, the format of hugepage PTEs is not identical to the
> > > > format of normal PTEs. So to do this for ppc64, the generic parts of
> > > > your code will need to use a hugepte_t instead of pte_t - it can be
> > > > typedeffed to pte_t on archs other than ppc64. Likewise there will
> > > > need to be hugepte_none() and so forth macros.
> > >
> > > I think it would be cleaner if ppc64 change its format instead of changing
> > > 4 other arch to accommodate ppc64. By the way, why do you need to special
> > > typedef hugepte_t? pte for huge page aren't anything special on all other
> > > arches.
> >
> > The hugepte entries go in the same slots as pmd entries, which means
> > they must be compatible with the layout of pmd entries. That's not
> > compatible with making them identical to normal PTE entries. For one
> > thing, normal PTE entries are 64 bits wide, whereas PMD entries are
> > only 32 bits wide.
>
> It smells like handle_hugetlb_mm_fault() need to be replicated in each arch
> (or at least replicated in ppc64).

No, that shouldn't be necessary. With a per-arch huge_pte_offset()
(returning a hugepte_t *) and similar arch functions for establishing
and tearing down huge ptes handle_hugetlb_mm_fault() should be able to
be generic. More significantly, it should still be possible to make
it generic when adding copy-on-write, which makes it less trivial.

To unify even the non-ppc64 archs we already have to allow for the
hugepage pagetables to have different structure across archs - on i386
and ppc64 the hugePTEs lie in PMD slots, on sparc64 and sh they lie in
(normal) PTE slots and on IA64 they lie in the PTE slots of a special
set of pagetables. Given that, it seems conceptually logical to me
that we also not assume the hugepage PTEs have the same layout as
normal PTEs. It makes the handle_mm_fault function not the least more
complicated.

--
David Gibson | For every complex problem there is a
david AT gibson.dropbear.id.au | solution which is simple, neat and
| wrong.
http://www.ozlabs.org/people/dgibson

2004-04-16 03:44:10

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: hugetlb demand paging patch part [0/3]

David Gibson wrote on Thursday, April 15, 2004 8:33 PM
> To unify even the non-ppc64 archs we already have to allow for the
> hugepage pagetables to have different structure across archs - on i386
> and ppc64 the hugePTEs lie in PMD slots, on sparc64 and sh they lie in
> (normal) PTE slots and on IA64 they lie in the PTE slots of a special
> set of pagetables. Given that, it seems conceptually logical to me
> that we also not assume the hugepage PTEs have the same layout as
> normal PTEs. It makes the handle_mm_fault function not the least more
> complicated.

Correction: huge page pte on ia64 has the same format as a normal page pte.


2004-04-16 04:04:32

by David Gibson

[permalink] [raw]
Subject: Re: hugetlb demand paging patch part [0/3]

On Thu, Apr 15, 2004 at 08:43:08PM -0700, Chen, Kenneth W wrote:
> David Gibson wrote on Thursday, April 15, 2004 8:33 PM
> > To unify even the non-ppc64 archs we already have to allow for the
> > hugepage pagetables to have different structure across archs - on i386
> > and ppc64 the hugePTEs lie in PMD slots, on sparc64 and sh they lie in
> > (normal) PTE slots and on IA64 they lie in the PTE slots of a special
> > set of pagetables. Given that, it seems conceptually logical to me
> > that we also not assume the hugepage PTEs have the same layout as
> > normal PTEs. It makes the handle_mm_fault function not the least more
> > complicated.
>
> Correction: huge page pte on ia64 has the same format as a normal page pte.

Yes, the PTEs have the same format, but the page table layout is not
identical (it's almost the same, but the address is shifted right
before doing the lookup).

--
David Gibson | For every complex problem there is a
david AT gibson.dropbear.id.au | solution which is simple, neat and
| wrong.
http://www.ozlabs.org/people/dgibson