2006-05-10 18:56:46

by Adam Litke

[permalink] [raw]
Subject: [RFC] Hugetlb demotion for x86

The following patch enables demotion of MAP_PRIVATE hugetlb memory to
normal anonymous memory on the i386 architecture. Below is a short
description of the problem from a previous posting.

> Thanks to the latest hugetlb accounting patches, we now have reliable
> shared mappings. Private mappings are much more difficult because
> there is no way to know up-front how many huge pages will be required
> (we may have forking combined with unknown copy-on-write activity).
> So private mappings currently get full overcommit semantics and when a
> fault cannot be handled, the apps get SIGBUS.
>
> The problem: Random SIGBUS crashes for applications using large pages
> are not acceptable. We need a way to handle the fault without giving
> up and killing the process.

The mechanics of this approach are straightforward. When failing to
allocate a huge page, the HPAGE_SIZE area is munmap'ed and mmap'ed as
anonymous memory. If a hugepte existed (happens on a COW fault) a
series of write-protected normal ptes that point to the sub-pages of the
original huge page are installed. When finished, the normal fault
handling path finishes the fault via do_anonymous page (or do_wp_page
for COW faults).

At this point I am looking for comments on the approach. It has been
working reliably on my system so far but I have probably missed
something. Also, enabling other architectures will involve a more work
(such as demotion regions that span multiple pages and vmas), which I am
now looking at.

Signed-off-by: Adam Litke <[email protected]>
* Patch not ready for merging

include/linux/mman.h | 22 ++++++++++++
mm/hugetlb.c | 88 ++++++++++++++++++++++++++++++++++++++++++++++++++-
mm/rmap.c | 3 +
3 files changed, 112 insertions(+), 1 deletion(-)
diff -upN reference/include/linux/mman.h current/include/linux/mman.h
--- reference/include/linux/mman.h
+++ current/include/linux/mman.h
@@ -53,6 +53,17 @@ calc_vm_prot_bits(unsigned long prot)
}

/*
+ * Combine the vm_flags protection bits into mmap "prot" argument
+ */
+static inline unsigned long
+calc_mmap_prot_bits(unsigned long vm_flags)
+{
+ return _calc_vm_trans(vm_flags, VM_READ, PROT_READ ) |
+ _calc_vm_trans(vm_flags, VM_WRITE, PROT_WRITE) |
+ _calc_vm_trans(vm_flags, VM_EXEC, PROT_EXEC );
+}
+
+/*
* Combine the mmap "flags" argument into "vm_flags" used internally.
*/
static inline unsigned long
@@ -64,4 +75,15 @@ calc_vm_flag_bits(unsigned long flags)
_calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED );
}

+/*
+ * Convert vm_flags into the mmap "flags" argument
+ */
+static inline unsigned long
+calc_mmap_flag_bits(unsigned long vm_flags)
+{
+ return _calc_vm_trans(vm_flags, VM_GROWSDOWN, MAP_GROWSDOWN ) |
+ _calc_vm_trans(vm_flags, VM_DENYWRITE, MAP_DENYWRITE ) |
+ _calc_vm_trans(vm_flags, VM_EXECUTABLE, MAP_EXECUTABLE) |
+ _calc_vm_trans(vm_flags, VM_LOCKED, MAP_LOCKED );
+}
#endif /* _LINUX_MMAN_H */
diff -upN reference/mm/hugetlb.c current/mm/hugetlb.c
--- reference/mm/hugetlb.c
+++ current/mm/hugetlb.c
@@ -14,6 +14,7 @@
#include <linux/mempolicy.h>
#include <linux/cpuset.h>
#include <linux/mutex.h>
+#include <linux/mman.h>

#include <asm/page.h>
#include <asm/pgtable.h>
@@ -503,6 +504,86 @@ void unmap_hugepage_range(struct vm_area
flush_tlb_range(vma, start, end);
}

+/*
+ * For copy-on-write triggered demotions, we have an instantiated page and
+ * huge pte. Since we need the data in the existing huge page, install
+ * normal, write-protected ptes that point to the sub-pages of this huge page.
+ * We can then let do_wp_page() lazily copy the data for us. Just take a
+ * reference on the huge page for each pte we install.
+ */
+static inline int
+install_demotion_ptes(struct mm_struct *mm, struct page *page,
+ pgprot_t prot, unsigned long address)
+{
+ pgd_t *pgd;
+ pud_t *pud;
+ pmd_t *pmd;
+ pte_t entry, *ptep;
+ int pfn, i;
+
+ pgd = pgd_offset(mm, address);
+ pud = pud_alloc(mm, pgd, address);
+ if (!pud)
+ return -ENOMEM;
+ pmd = pmd_alloc(mm, pud, address);
+ if (!pmd)
+ return -ENOMEM;
+
+ pfn = page_to_pfn(page);
+ for (i = 0; i < HPAGE_SIZE/PAGE_SIZE; i++) {
+ address += i * PAGE_SIZE;
+ entry = pte_wrprotect(pfn_pte(pfn + i, prot));
+ ptep = pte_alloc_map(mm, pmd, address);
+ set_pte_at(mm, address, ptep, entry);
+ pte_unmap(ptep);
+ get_page(page);
+ }
+
+ return 0;
+}
+
+static int hugetlb_demote_page(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address)
+{
+ unsigned long start, prot, flags;
+ pgprot_t pgprot;
+ pte_t *ptep;
+ struct page *page = NULL;
+ int ret, established = 0;
+
+ /* Only private VMAs can be demoted */
+ if (vma->vm_flags & VM_MAYSHARE)
+ return VM_FAULT_OOM;
+
+ start = address & HPAGE_MASK;
+ pgprot = vma->vm_page_prot;
+ prot = calc_mmap_prot_bits(vma->vm_flags);
+ flags = calc_mmap_flag_bits(vma->vm_flags);
+ flags |= MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED;
+
+ ptep = huge_pte_offset(mm, start);
+ if (ptep && !pte_none(*ptep)) {
+ established = 1;
+ page = pte_page(*ptep);
+ get_page(page);
+ }
+
+ do_munmap(mm, start, HPAGE_SIZE);
+ start = do_mmap_pgoff(0, start, HPAGE_SIZE, prot, flags, 0);
+ if (start < 0) {
+ return VM_FAULT_OOM;
+ }
+
+ if (established) {
+ ret = install_demotion_ptes(mm, page, pgprot, start);
+ put_page(page);
+ if (ret)
+ return VM_FAULT_OOM;
+ }
+
+ return VM_FAULT_MINOR;
+}
+
static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pte_t *ptep, pte_t pte)
{
@@ -643,6 +724,8 @@ int hugetlb_fault(struct mm_struct *mm,
entry = *ptep;
if (pte_none(entry)) {
ret = hugetlb_no_page(mm, vma, address, ptep, write_access);
+ if (ret == VM_FAULT_OOM)
+ ret = hugetlb_demote_page(mm, vma, address);
mutex_unlock(&hugetlb_instantiation_mutex);
return ret;
}
@@ -652,8 +735,11 @@ int hugetlb_fault(struct mm_struct *mm,
spin_lock(&mm->page_table_lock);
/* Check for a racing update before calling hugetlb_cow */
if (likely(pte_same(entry, *ptep)))
- if (write_access && !pte_write(entry))
+ if (write_access && !pte_write(entry)) {
ret = hugetlb_cow(mm, vma, address, ptep, entry);
+ if (ret == VM_FAULT_OOM)
+ ret = hugetlb_demote_page(mm, vma, address);
+ }
spin_unlock(&mm->page_table_lock);
mutex_unlock(&hugetlb_instantiation_mutex);

diff -upN reference/mm/rmap.c current/mm/rmap.c
--- reference/mm/rmap.c
+++ current/mm/rmap.c
@@ -548,6 +548,9 @@ void page_add_file_rmap(struct page *pag
*/
void page_remove_rmap(struct page *page)
{
+ if (unlikely(PageCompound(page)))
+ return;
+
if (atomic_add_negative(-1, &page->_mapcount)) {
#ifdef CONFIG_DEBUG_VM
if (unlikely(page_mapcount(page) < 0)) {

--
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center


2006-05-10 19:48:00

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC] Hugetlb demotion for x86

On Wed, 2006-05-10 at 13:56 -0500, Adam Litke wrote:
> +install_demotion_ptes(struct mm_struct *mm, struct page *page,
> + pgprot_t prot, unsigned long address)
> +{
> + pgd_t *pgd;
> + pud_t *pud;
> + pmd_t *pmd;
> + pte_t entry, *ptep;
> + int pfn, i;
> +
> + pgd = pgd_offset(mm, address);
> + pud = pud_alloc(mm, pgd, address);
> + if (!pud)
> + return -ENOMEM;
> + pmd = pmd_alloc(mm, pud, address);
> + if (!pmd)
> + return -ENOMEM;

That looks to be a pretty direct copy of what is already in
__handle_mm_fault(). Can they be consolidated?

-- Dave

2006-05-10 19:51:14

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC] Hugetlb demotion for x86

On Wed, 2006-05-10 at 13:56 -0500, Adam Litke wrote:
>
> + do_munmap(mm, start, HPAGE_SIZE);
> + start = do_mmap_pgoff(0, start, HPAGE_SIZE, prot, flags, 0);
> + if (start < 0) {
> + return VM_FAULT_OOM;
> + }


Hmm.. These are being done in this path, right?

do_page_fault()
handle_mm_fault()
__handle_mm_fault()
hugetlb_fault()
hugetlb_demote_page()

I believe do_munmap() requires a write on mmap_sem(), but
do_page_fault() only takes a read.

-- Dave

2006-05-10 20:05:22

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC] Hugetlb demotion for x86

On Wed, May 10, 2006 at 01:56:40PM -0500, Adam Litke wrote:
> The following patch enables demotion of MAP_PRIVATE hugetlb memory to
> normal anonymous memory on the i386 architecture.

This is an awfully bad idea. Applications should do smart fallback
instead. For the same reason we for example fail O_DIRECT requests
we can fullfill instead of doing the half buffered I/O braindamage
solaris does.

2006-05-10 20:32:43

by Adam Litke

[permalink] [raw]
Subject: Re: [RFC] Hugetlb demotion for x86

On Wed, 2006-05-10 at 21:05 +0100, Christoph Hellwig wrote:
> On Wed, May 10, 2006 at 01:56:40PM -0500, Adam Litke wrote:
> > The following patch enables demotion of MAP_PRIVATE hugetlb memory to
> > normal anonymous memory on the i386 architecture.
>
> This is an awfully bad idea. Applications should do smart fallback
> instead. For the same reason we for example fail O_DIRECT requests
> we can fullfill instead of doing the half buffered I/O braindamage
> solaris does.

By smart fallback do you mean we should convert the hugetlb fault code
back to using VM_FAULT_SIGBUS and writing userspace sighandlers to do
the same thing I am, but in userspace? FWIW I did implement that in
libhugetlbfs to try it out, but that seems much dirtier to me than
handling faults in the kernel.

--
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

2006-05-10 20:49:31

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC] Hugetlb demotion for x86

On Wed, May 10, 2006 at 03:32:36PM -0500, Adam Litke wrote:
> By smart fallback do you mean we should convert the hugetlb fault code
> back to using VM_FAULT_SIGBUS and writing userspace sighandlers to do
> the same thing I am, but in userspace? FWIW I did implement that in
> libhugetlbfs to try it out, but that seems much dirtier to me than
> handling faults in the kernel.

Umm, why do these faults happen at all? When all the hugetlb code went
in it we allocated at mmap time. Later it was converted to demand faulting
but under the premise that we keep the strict overcommit accounting. When
did that part go away aswell? With strict overcommit handling for huge
pages no fault should happen when the pool is exausted.

2006-05-10 21:38:51

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC] Hugetlb demotion for x86

Adam Litke <[email protected]> writes:

> The following patch enables demotion of MAP_PRIVATE hugetlb memory to
> normal anonymous memory on the i386 architecture. Below is a short
> description of the problem from a previous posting.


I'm not sure it's a good idea really. I think people have the reasonable
expection that if you use hugetlb you really get huge pages.

If you really implement you should at least printk clearly. But it's
probably better to not implement it.

If there was generic transparent hugepages support in the VM the case
would be probably different. Then it would make sense. But not as part
of hugetlbfs.

-Andi

2006-05-10 21:45:44

by Adam Litke

[permalink] [raw]
Subject: Re: [RFC] Hugetlb demotion for x86

On Wed, 2006-05-10 at 21:49 +0100, Christoph Hellwig wrote:
> On Wed, May 10, 2006 at 03:32:36PM -0500, Adam Litke wrote:
> > By smart fallback do you mean we should convert the hugetlb fault code
> > back to using VM_FAULT_SIGBUS and writing userspace sighandlers to do
> > the same thing I am, but in userspace? FWIW I did implement that in
> > libhugetlbfs to try it out, but that seems much dirtier to me than
> > handling faults in the kernel.
>
> Umm, why do these faults happen at all? When all the hugetlb code went
> in it we allocated at mmap time. Later it was converted to demand faulting
> but under the premise that we keep the strict overcommit accounting. When
> did that part go away aswell? With strict overcommit handling for huge
> pages no fault should happen when the pool is exausted.

Strict overcommit is there for shared mappings. When private mapping
support was added, people agreed that full overcommit should apply to
private mappings for the same reasons normal page overcommit is desired.
For one: an application using lots of private huge pages should not be
prohibited from forking if it's likely to just exec a small helper
program.

"These faults" are happening in two cases when MAP_PRIVATE huge pages
are being used:
1) Fault on an uninstantiated huge page: This can happen when numerous
users of huge pages in the system are competing for a finite number of
huge pages. Even if the process checks for free huge pages before
mmaping the area, another process is free to "steal" those pages out
from under the careful process.

2) COW fault on an instantiated huge page: Happens in child processes
who inherit a private hugetlb region and write to it.

Both of these cases are non-deterministic and should be handled in some
way. Just killing the process doesn't seem like a permanent solution to
me.

--
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

2006-05-10 23:42:06

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC] Hugetlb demotion for x86

Seems that the code is not modifying x86 code but all code.

An app should be getting an out of memory error and not a SIGBUS when
running out of memory.

I thought we fixed the SIGBUS problems and were now reporting out of
memory? If there still is an issue then we better fix out of memory
handling. Provide a way for the app to trap OOM conditions?

2006-05-11 15:15:53

by Hugh Dickins

[permalink] [raw]
Subject: Re: [RFC] Hugetlb demotion for x86

On Wed, 10 May 2006, Adam Litke wrote:
>
> Strict overcommit is there for shared mappings. When private mapping

I presume that by "strict overcommit" you mean "strict no overcommit".

> support was added, people agreed that full overcommit should apply to
> private mappings for the same reasons normal page overcommit is desired.

I'm not sure how wide that agreement was. But what I wanted to say is...

> For one: an application using lots of private huge pages should not be
> prohibited from forking if it's likely to just exec a small helper
> program.

This is an excellent use for madvise(start, length, MADV_DONTFORK).
Though it was added mainly for RDMA issues, it's a great way for a
program with a huge commitment to exclude areas of its address space
from the fork, so making that fork much more likely to succeed.

Hugh

2006-05-11 15:21:44

by Alan

[permalink] [raw]
Subject: Re: [RFC] Hugetlb demotion for x86

On Iau, 2006-05-11 at 16:15 +0100, Hugh Dickins wrote:
> > For one: an application using lots of private huge pages should not be
> > prohibited from forking if it's likely to just exec a small helper
> > program.
>
> This is an excellent use for madvise(start, length, MADV_DONTFORK).
> Though it was added mainly for RDMA issues, it's a great way for a
> program with a huge commitment to exclude areas of its address space
> from the fork, so making that fork much more likely to succeed.

Or fork using vfork() in that case which has even more wins and is a
more efficient if more hair-raising way of doing it

2006-05-11 15:59:57

by Adam Litke

[permalink] [raw]
Subject: Re: [RFC] Hugetlb demotion for x86

On Thu, 2006-05-11 at 16:15 +0100, Hugh Dickins wrote:
> On Wed, 10 May 2006, Adam Litke wrote:
> >
> > Strict overcommit is there for shared mappings. When private mapping
>
> I presume that by "strict overcommit" you mean "strict no overcommit".
>
> > support was added, people agreed that full overcommit should apply to
> > private mappings for the same reasons normal page overcommit is desired.
>
> I'm not sure how wide that agreement was. But what I wanted to say is...
>
> > For one: an application using lots of private huge pages should not be
> > prohibited from forking if it's likely to just exec a small helper
> > program.
>
> This is an excellent use for madvise(start, length, MADV_DONTFORK).
> Though it was added mainly for RDMA issues, it's a great way for a
> program with a huge commitment to exclude areas of its address space
> from the fork, so making that fork much more likely to succeed.

I guess it's time for me to take a step back and explain why I am doing
this. libhugetlbfs (announced here recently) has the ability to remap
an executable's ELF segments into huge pages. So madvise(MADV_DONTFORK)
would be pretty bad ;)

--
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

2006-05-11 16:11:06

by Adam Litke

[permalink] [raw]
Subject: Re: [RFC] Hugetlb demotion for x86

On Wed, 2006-05-10 at 16:42 -0700, Christoph Lameter wrote:
> Seems that the code is not modifying x86 code but all code.

Right. It's definitely broken in that regard. I sent it out in this
condition so the patch was small, easy to review, and my approach would
be easy to see.

> An app should be getting an out of memory error and not a SIGBUS when
> running out of memory.
>
> I thought we fixed the SIGBUS problems and were now reporting out of
> memory? If there still is an issue then we better fix out of memory
> handling. Provide a way for the app to trap OOM conditions?

Yes, the SIGBUS issues are "fixed". Now the application is killed
directly via VM_FAULT_OOM so it is not possible to handle the fault from
userspace. For my libhugetlbfs-based fallback approach, I needed to
patch the kernel so that SIGBUS was delivered to the process like in the
days of old.

--
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

2006-05-15 14:21:39

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC] Hugetlb demotion for x86

On Thu, 2006-05-11 at 11:10 -0500, Adam Litke wrote:
> Yes, the SIGBUS issues are "fixed". Now the application is killed
> directly via VM_FAULT_OOM so it is not possible to handle the fault from
> userspace. For my libhugetlbfs-based fallback approach, I needed to
> patch the kernel so that SIGBUS was delivered to the process like in the
> days of old.

Maybe this could be off-by-default behavior that can be enabled with a
special mmap flag or madvise, or something similar. It seems that apps
don't want to get SIGBUS for low memory. But, if they have _asked_ for
it, perhaps they'd be a bit more willing.

(BTW, I fixed the bogus linux-mm cc, finally ;)

-- Dave