2010-04-16 15:45:29

by r6144

[permalink] [raw]
Subject: Process-shared futexes on hugepages puts the kernel in an infinite loop in 2.6.32.11; is this fixed now?

Hello all,

I'm having an annoying kernel bug regarding huge pages in Fedora 12:

https://bugzilla.redhat.com/show_bug.cgi?id=552257

Basically I want to use huge pages in a multithreaded number crunching
program, which happens to use process-shared semaphores (because fftw
does it). The futex for the semaphore ends up lying on a huge page, and
I then get an endless loop in get_futex_key(), apparently because the
anonymous huge page containing the futex does not have a page->mapping.
A test case is provided in the above link.

I reported the bug to Fedora bugzilla months ago, but haven't received
any feedback yet. The Fedora kernel is based on 2.6.32.11, and a
cursory glance at the 2.6.34-rc3 source does not yield any relevant
change.

So, could anyone tell me if the current mainline kernel might act better
in this respect, before I get around to compiling it?

Thank you very much.

Please CC me as I'm not subscribed.

r6144


2010-04-16 20:27:54

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Process-shared futexes on hugepages puts the kernel in an infinite loop in 2.6.32.11; is this fixed now?

On Fri, 2010-04-16 at 23:45 +0800, r6144 wrote:
> Hello all,
>
> I'm having an annoying kernel bug regarding huge pages in Fedora 12:
>
> https://bugzilla.redhat.com/show_bug.cgi?id=552257
>
> Basically I want to use huge pages in a multithreaded number crunching
> program, which happens to use process-shared semaphores (because fftw
> does it). The futex for the semaphore ends up lying on a huge page, and
> I then get an endless loop in get_futex_key(), apparently because the
> anonymous huge page containing the futex does not have a page->mapping.
> A test case is provided in the above link.
>
> I reported the bug to Fedora bugzilla months ago, but haven't received
> any feedback yet.

No, it works much better if you simply mail LKML and CC people who work
on the code in question ;-)

> The Fedora kernel is based on 2.6.32.11, and a
> cursory glance at the 2.6.34-rc3 source does not yield any relevant
> change.
>
> So, could anyone tell me if the current mainline kernel might act better
> in this respect, before I get around to compiling it?

Right, so I had a quick chat with Mel, and it appears MAP_PRIVATE
hugetlb pages don't have their page->mapping set.

I guess something like the below might work, but I'd really rather not
add hugetlb knowledge to futex.c. Does anybody else have a better idea?
Maybe create something similar to an anon_vma for hugetlb pages?

---
kernel/futex.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index e7a35f1..b0f1b2d 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -252,7 +252,7 @@ again:

page = compound_head(page);
lock_page(page);
- if (!page->mapping) {
+ if (!page->mapping && !PageHuge(page)) {
unlock_page(page);
put_page(page);
goto again;
@@ -265,7 +265,7 @@ again:
* it's a read-only handle, it's expected that futexes attach to
* the object not the particular process.
*/
- if (PageAnon(page)) {
+ if (PageAnon(page) || (PageHuge(page) && !page->mapping)) {
key->both.offset |= FUT_OFF_MMSHARED; /* ref taken on mm */
key->private.mm = mm;
key->private.address = address;

2010-04-19 11:43:21

by Mel Gorman

[permalink] [raw]
Subject: Re: Process-shared futexes on hugepages puts the kernel in an infinite loop in 2.6.32.11; is this fixed now?

On Fri, Apr 16, 2010 at 10:27:48PM +0200, Peter Zijlstra wrote:
> On Fri, 2010-04-16 at 23:45 +0800, r6144 wrote:
> > Hello all,
> >
> > I'm having an annoying kernel bug regarding huge pages in Fedora 12:
> >
> > https://bugzilla.redhat.com/show_bug.cgi?id=552257
> >
> > Basically I want to use huge pages in a multithreaded number crunching
> > program, which happens to use process-shared semaphores (because fftw
> > does it). The futex for the semaphore ends up lying on a huge page, and
> > I then get an endless loop in get_futex_key(), apparently because the
> > anonymous huge page containing the futex does not have a page->mapping.
> > A test case is provided in the above link.
> >
> > I reported the bug to Fedora bugzilla months ago, but haven't received
> > any feedback yet.
>
> No, it works much better if you simply mail LKML and CC people who work
> on the code in question ;-)
>
> > The Fedora kernel is based on 2.6.32.11, and a
> > cursory glance at the 2.6.34-rc3 source does not yield any relevant
> > change.
> >
> > So, could anyone tell me if the current mainline kernel might act better
> > in this respect, before I get around to compiling it?
>
> Right, so I had a quick chat with Mel, and it appears MAP_PRIVATE
> hugetlb pages don't have their page->mapping set.
>
> I guess something like the below might work, but I'd really rather not
> add hugetlb knowledge to futex.c. Does anybody else have a better idea?
> Maybe create something similar to an anon_vma for hugetlb pages?
>

anon_vma for hugetlb pages sounds overkill, what would it gain? In this
context, futex only appears to distinguish between whether the
references are private or shared.

Looking at the hugetlbfs code, I can't see a place where it actually cares
about the mapping as such. It's used to find shared pages in the page cache
(but not in the LRU) that are backed by the hugetlbfs file. For hugetlbfs
though, the mapping is mostly kept in page->private for reservation accounting
purposes.

I can't think of other parts of the VM that touch the mapping if the
page is managed by hugetlbfs so the following patch should also work but
without futex having hugetlbfs-awareness. What do you think? Maybe for
safety, it would be better to make the mapping some obvious poison bytes
or'd with PAGE_MAPPING_ANON so an oops will be more obvious?

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6034dc9..57a5faa 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -546,6 +546,7 @@ static void free_huge_page(struct page *page)

mapping = (struct address_space *) page_private(page);
set_page_private(page, 0);
+ page->mapping = NULL;
BUG_ON(page_count(page));
INIT_LIST_HEAD(&page->lru);

@@ -2447,8 +2448,10 @@ retry:
spin_lock(&inode->i_lock);
inode->i_blocks += blocks_per_huge_page(h);
spin_unlock(&inode->i_lock);
- } else
+ } else {
lock_page(page);
+ page->mapping = (struct address_space *)PAGE_MAPPING_ANON;
+ }
}

/*

2010-04-19 11:52:40

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Process-shared futexes on hugepages puts the kernel in an infinite loop in 2.6.32.11; is this fixed now?

On Mon, 2010-04-19 at 12:43 +0100, Mel Gorman wrote:

> > Right, so I had a quick chat with Mel, and it appears MAP_PRIVATE
> > hugetlb pages don't have their page->mapping set.
> >
> > I guess something like the below might work, but I'd really rather not
> > add hugetlb knowledge to futex.c. Does anybody else have a better idea?
> > Maybe create something similar to an anon_vma for hugetlb pages?
> >
>
> anon_vma for hugetlb pages sounds overkill, what would it gain? In this
> context, futex only appears to distinguish between whether the
> references are private or shared.
>
> Looking at the hugetlbfs code, I can't see a place where it actually cares
> about the mapping as such. It's used to find shared pages in the page cache
> (but not in the LRU) that are backed by the hugetlbfs file. For hugetlbfs
> though, the mapping is mostly kept in page->private for reservation accounting
> purposes.
>
> I can't think of other parts of the VM that touch the mapping if the
> page is managed by hugetlbfs so the following patch should also work but
> without futex having hugetlbfs-awareness. What do you think? Maybe for
> safety, it would be better to make the mapping some obvious poison bytes
> or'd with PAGE_MAPPING_ANON so an oops will be more obvious?

Yes, this seems perfectly adequate to me, that poison idea might be
worthwhile too :-)

Acked-by: Peter Zijlstra <[email protected]>

> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 6034dc9..57a5faa 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -546,6 +546,7 @@ static void free_huge_page(struct page *page)
>
> mapping = (struct address_space *) page_private(page);
> set_page_private(page, 0);
> + page->mapping = NULL;
> BUG_ON(page_count(page));
> INIT_LIST_HEAD(&page->lru);
>
> @@ -2447,8 +2448,10 @@ retry:
> spin_lock(&inode->i_lock);
> inode->i_blocks += blocks_per_huge_page(h);
> spin_unlock(&inode->i_lock);
> - } else
> + } else {
> lock_page(page);
> + page->mapping = (struct address_space *)PAGE_MAPPING_ANON;
> + }
> }
>
> /*


2010-04-19 15:33:10

by Mel Gorman

[permalink] [raw]
Subject: Re: Process-shared futexes on hugepages puts the kernel in an infinite loop in 2.6.32.11; is this fixed now?

On Mon, Apr 19, 2010 at 01:52:36PM +0200, Peter Zijlstra wrote:
> On Mon, 2010-04-19 at 12:43 +0100, Mel Gorman wrote:
>
> > > Right, so I had a quick chat with Mel, and it appears MAP_PRIVATE
> > > hugetlb pages don't have their page->mapping set.
> > >
> > > I guess something like the below might work, but I'd really rather not
> > > add hugetlb knowledge to futex.c. Does anybody else have a better idea?
> > > Maybe create something similar to an anon_vma for hugetlb pages?
> > >
> >
> > anon_vma for hugetlb pages sounds overkill, what would it gain? In this
> > context, futex only appears to distinguish between whether the
> > references are private or shared.
> >
> > Looking at the hugetlbfs code, I can't see a place where it actually cares
> > about the mapping as such. It's used to find shared pages in the page cache
> > (but not in the LRU) that are backed by the hugetlbfs file. For hugetlbfs
> > though, the mapping is mostly kept in page->private for reservation accounting
> > purposes.
> >
> > I can't think of other parts of the VM that touch the mapping if the
> > page is managed by hugetlbfs so the following patch should also work but
> > without futex having hugetlbfs-awareness. What do you think? Maybe for
> > safety, it would be better to make the mapping some obvious poison bytes
> > or'd with PAGE_MAPPING_ANON so an oops will be more obvious?
>
> Yes, this seems perfectly adequate to me, that poison idea might be
> worthwhile too :-)
>
> Acked-by: Peter Zijlstra <[email protected]>
>

Are you ok with an Ack to this slightly-difference patch too? If so,
I'll forward it on to Andrew.

==== CUT HERE ====

Fix infinite loop in get_futex_key when backed by huge pages

If a futex key happens to be located within a huge page mapped MAP_PRIVATE,
get_futex_key() can go into an infinite loop waiting for a page->mapping
that will never exist. This was reported and documented in an external
bugzilla at

https://bugzilla.redhat.com/show_bug.cgi?id=552257

This patch makes page->mapping a poisoned value that includes PAGE_MAPPING_ANON
mapped MAP_PRIVATE. This is enough for futex to continue but because
of PAGE_MAPPING_ANON, the poisoned value is not dereferenced or used by
futex. No other part of the VM should be dereferencing the page->mapping of
a hugetlbfs page as its page cache is not on the LRU.

This patch fixes the problem with the test case described in the bugzilla.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/poison.h | 10 ++++++++++
mm/hugetlb.c | 6 +++++-
2 files changed, 15 insertions(+), 1 deletions(-)

diff --git a/include/linux/poison.h b/include/linux/poison.h
index 2110a81..0f7b5ac 100644
--- a/include/linux/poison.h
+++ b/include/linux/poison.h
@@ -48,6 +48,16 @@
#define POISON_FREE 0x6b /* for use-after-free poisoning */
#define POISON_END 0xa5 /* end-byte of poisoning */

+/********** mm/hugetlb.c **********/
+/*
+ * Private mappings of hugetlb pages use this poisoned value for
+ * page->mapping. The core VM should not be doing anything with this mapping
+ * but futex requires the existance of some page->mapping value even if it
+ * is unused. If the core VM does deference the mapping, it'll look like a
+ * suspiciously high null-pointer offset starting from 0x2e5
+ */
+#define HUGETLB_PRIVATE_MAPPING (0x2e4 | PAGE_MAPPING_ANON)
+
/********** arch/$ARCH/mm/init.c **********/
#define POISON_FREE_INITMEM 0xcc

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6034dc9..487e3c2 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -546,6 +546,7 @@ static void free_huge_page(struct page *page)

mapping = (struct address_space *) page_private(page);
set_page_private(page, 0);
+ page->mapping = NULL;
BUG_ON(page_count(page));
INIT_LIST_HEAD(&page->lru);

@@ -2447,8 +2448,11 @@ retry:
spin_lock(&inode->i_lock);
inode->i_blocks += blocks_per_huge_page(h);
spin_unlock(&inode->i_lock);
- } else
+ } else {
lock_page(page);
+ page->mapping = (struct address_space *)
+ HUGETLB_PRIVATE_MAPPING;
+ }
}

/*

2010-04-19 15:50:05

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Process-shared futexes on hugepages puts the kernel in an infinite loop in 2.6.32.11; is this fixed now?

On Mon, Apr 19, 2010 at 04:32:45PM +0100, Mel Gorman wrote:
> +#define HUGETLB_PRIVATE_MAPPING (0x2e4 | PAGE_MAPPING_ANON)

Cute indeed.

BTW, just in case I tested this on transparent hugepage and it works
fine (it uses no cpu and can be killed with C^c). I had to hack it
like below to allocate the semaphore on hugepages without
khugepaged. I verified 1 hugepage is allocated (thanks to memory
compaction there's an huge excess of hugepages compared to what my
regular apps can eat ;). Furthermore I found gcc bypasses malloc so a
small patch to gcc should move it all into hugepages. Then maybe we
can build the kernel faster and definitely translate.o will build 8%
faster with the default khugepaged scan_sleep_millisecs settings
(waiting to be confirmed but I exclude bad surprises, whatever runs
fast with khugepaged has will run even faster without it if something,
or equal in the worst case).

Thanks,
Andrea

--- process-shared-sem-hugepage.c.orig 2010-04-19 17:43:47.278964888 +0200
+++ process-shared-sem-hugepage.c 2010-04-19 17:44:01.100032774 +0200
@@ -30,6 +30,7 @@ int main(void)

g_thread_init(NULL);
workers = g_new(GThread *, NWORKER); work_sem = g_new(sem_t, 1);
+ posix_memalign(&work_sem, 2*1024*1024, 2*1024*1024);
result = sem_init(work_sem, TRUE, 0); g_assert(result == 0);

for (i = 0; i < NWORKER; i++) {

2010-04-19 16:05:34

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Process-shared futexes on hugepages puts the kernel in an infinite loop in 2.6.32.11; is this fixed now?

On Mon, 2010-04-19 at 16:32 +0100, Mel Gorman wrote:
> Fix infinite loop in get_futex_key when backed by huge pages
>
> If a futex key happens to be located within a huge page mapped MAP_PRIVATE,
> get_futex_key() can go into an infinite loop waiting for a page->mapping
> that will never exist. This was reported and documented in an external
> bugzilla at
>
> https://bugzilla.redhat.com/show_bug.cgi?id=552257
>
> This patch makes page->mapping a poisoned value that includes PAGE_MAPPING_ANON
> mapped MAP_PRIVATE. This is enough for futex to continue but because
> of PAGE_MAPPING_ANON, the poisoned value is not dereferenced or used by
> futex. No other part of the VM should be dereferencing the page->mapping of
> a hugetlbfs page as its page cache is not on the LRU.
>
> This patch fixes the problem with the test case described in the bugzilla.
>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> include/linux/poison.h | 10 ++++++++++
> mm/hugetlb.c | 6 +++++-
> 2 files changed, 15 insertions(+), 1 deletions(-)
>
> diff --git a/include/linux/poison.h b/include/linux/poison.h
> index 2110a81..0f7b5ac 100644
> --- a/include/linux/poison.h
> +++ b/include/linux/poison.h
> @@ -48,6 +48,16 @@
> #define POISON_FREE 0x6b /* for use-after-free poisoning */
> #define POISON_END 0xa5 /* end-byte of poisoning */
>
> +/********** mm/hugetlb.c **********/
> +/*
> + * Private mappings of hugetlb pages use this poisoned value for
> + * page->mapping. The core VM should not be doing anything with this mapping
> + * but futex requires the existance of some page->mapping value even if it
> + * is unused. If the core VM does deference the mapping, it'll look like a
> + * suspiciously high null-pointer offset starting from 0x2e5
> + */
> +#define HUGETLB_PRIVATE_MAPPING (0x2e4 | PAGE_MAPPING_ANON)

Wouldn't a longer poison be more recognisable? Also, shouldn't this use
POISON_POINTER_DELTA?

Something like:

#define HUGETBL_POISON ((void *) 0x00300300 + POISON_POINTER_DELTA)

0x2e5 isn't that high, I've had actual derefs in that range.

> +
> /********** arch/$ARCH/mm/init.c **********/
> #define POISON_FREE_INITMEM 0xcc
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 6034dc9..487e3c2 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -546,6 +546,7 @@ static void free_huge_page(struct page *page)
>
> mapping = (struct address_space *) page_private(page);
> set_page_private(page, 0);
> + page->mapping = NULL;
> BUG_ON(page_count(page));
> INIT_LIST_HEAD(&page->lru);
>
> @@ -2447,8 +2448,11 @@ retry:
> spin_lock(&inode->i_lock);
> inode->i_blocks += blocks_per_huge_page(h);
> spin_unlock(&inode->i_lock);
> - } else
> + } else {
> lock_page(page);
> + page->mapping = (struct address_space *)
> + HUGETLB_PRIVATE_MAPPING;
> + }
> }
>
> /*
>
>

2010-04-19 16:12:25

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Process-shared futexes on hugepages puts the kernel in an infinite loop in 2.6.32.11; is this fixed now?

On Mon, Apr 19, 2010 at 05:45:05PM +0200, Peter Zijlstra wrote:
> Wouldn't a longer poison be more recognisable? Also, shouldn't this use
> POISON_POINTER_DELTA?
>
> Something like:
>
> #define HUGETBL_POISON ((void *) 0x00300300 + POISON_POINTER_DELTA)
>
> 0x2e5 isn't that high, I've had actual derefs in that range.

The default at kernel config time sets only 4k as unmapped (I think
it's a very bad default for 64bit archs), so above 4k userland can map
it and you can have actual derefs with 0x00300300 but not with Mel's
preferred <0x1000 address. So the address must be <0x1000.

2010-04-19 16:18:56

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Process-shared futexes on hugepages puts the kernel in an infinite loop in 2.6.32.11; is this fixed now?

On Mon, 2010-04-19 at 18:11 +0200, Andrea Arcangeli wrote:
> On Mon, Apr 19, 2010 at 05:45:05PM +0200, Peter Zijlstra wrote:
> > Wouldn't a longer poison be more recognisable? Also, shouldn't this use
> > POISON_POINTER_DELTA?
> >
> > Something like:
> >
> > #define HUGETBL_POISON ((void *) 0x00300300 + POISON_POINTER_DELTA)
> >
> > 0x2e5 isn't that high, I've had actual derefs in that range.
>
> The default at kernel config time sets only 4k as unmapped (I think
> it's a very bad default for 64bit archs), so above 4k userland can map
> it and you can have actual derefs with 0x00300300 but not with Mel's
> preferred <0x1000 address. So the address must be <0x1000.

Well, most poison values have that problem and still we have them. Also
on 64bit machines you can use POISON_POINTER_DELTA to map it outside the
virtual address range.

2010-04-19 16:35:12

by Mel Gorman

[permalink] [raw]
Subject: Re: Process-shared futexes on hugepages puts the kernel in an infinite loop in 2.6.32.11; is this fixed now?

On Mon, Apr 19, 2010 at 05:45:05PM +0200, Peter Zijlstra wrote:
> On Mon, 2010-04-19 at 16:32 +0100, Mel Gorman wrote:
> > Fix infinite loop in get_futex_key when backed by huge pages
> >
> > If a futex key happens to be located within a huge page mapped MAP_PRIVATE,
> > get_futex_key() can go into an infinite loop waiting for a page->mapping
> > that will never exist. This was reported and documented in an external
> > bugzilla at
> >
> > https://bugzilla.redhat.com/show_bug.cgi?id=552257
> >
> > This patch makes page->mapping a poisoned value that includes PAGE_MAPPING_ANON
> > mapped MAP_PRIVATE. This is enough for futex to continue but because
> > of PAGE_MAPPING_ANON, the poisoned value is not dereferenced or used by
> > futex. No other part of the VM should be dereferencing the page->mapping of
> > a hugetlbfs page as its page cache is not on the LRU.
> >
> > This patch fixes the problem with the test case described in the bugzilla.
> >
> > Signed-off-by: Mel Gorman <[email protected]>
> > ---
> > include/linux/poison.h | 10 ++++++++++
> > mm/hugetlb.c | 6 +++++-
> > 2 files changed, 15 insertions(+), 1 deletions(-)
> >
> > diff --git a/include/linux/poison.h b/include/linux/poison.h
> > index 2110a81..0f7b5ac 100644
> > --- a/include/linux/poison.h
> > +++ b/include/linux/poison.h
> > @@ -48,6 +48,16 @@
> > #define POISON_FREE 0x6b /* for use-after-free poisoning */
> > #define POISON_END 0xa5 /* end-byte of poisoning */
> >
> > +/********** mm/hugetlb.c **********/
> > +/*
> > + * Private mappings of hugetlb pages use this poisoned value for
> > + * page->mapping. The core VM should not be doing anything with this mapping
> > + * but futex requires the existance of some page->mapping value even if it
> > + * is unused. If the core VM does deference the mapping, it'll look like a
> > + * suspiciously high null-pointer offset starting from 0x2e5
> > + */
> > +#define HUGETLB_PRIVATE_MAPPING (0x2e4 | PAGE_MAPPING_ANON)
>
> Wouldn't a longer poison be more recognisable? Also, shouldn't this use
> POISON_POINTER_DELTA?
>

I was looking for an address < 0x1000 because it would only be valid in very
rare cases. I wasn't so sure about any other pointer value and only x86-64
appears to define POISON_POINTER_DELTA.

> Something like:
>
> #define HUGETBL_POISON ((void *) 0x00300300 + POISON_POINTER_DELTA)
>
> 0x2e5 isn't that high, I've had actual derefs in that range.
>

So have I, but it couldn't be too near the page boundary either and pretty
much any address can be valid. Still, architectures aren't stopped from
defining the delta and it is something we appear to rely on for the list
poisoning. I'd prefer a value below 0x1000 but matching list poisoning should
work for the most part.

How about?

==== CUT HERE ====
Fix infinite loop in get_futex_key when backed by huge pages

If a futex key happens to be located within a huge page mapped MAP_PRIVATE,
get_futex_key() can go into an infinite loop waiting for a page->mapping that
will never exist. This was reported and documented in an external bugzilla at

https://bugzilla.redhat.com/show_bug.cgi?id=552257

This patch makes page->mapping a poisoned value that includes
PAGE_MAPPING_ANON mapped MAP_PRIVATE. This is enough for futex to continue
but because of PAGE_MAPPING_ANON, the poisoned value is not dereferenced
or used by futex. No other part of the VM should be dereferencing the
page->mapping of a hugetlbfs page as its page cache is not on the LRU.

This patch fixes the problem with the test case described in the bugzilla.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/poison.h | 9 +++++++++
mm/hugetlb.c | 5 ++++-
2 files changed, 13 insertions(+), 1 deletions(-)

diff --git a/include/linux/poison.h b/include/linux/poison.h
index 2110a81..bab71f3 100644
--- a/include/linux/poison.h
+++ b/include/linux/poison.h
@@ -48,6 +48,15 @@
#define POISON_FREE 0x6b /* for use-after-free poisoning */
#define POISON_END 0xa5 /* end-byte of poisoning */

+/********** mm/hugetlb.c **********/
+/*
+ * Private mappings of hugetlb pages use this poisoned value for
+ * page->mapping. The core VM should not be doing anything with this mapping
+ * but futex requires the existance of some page->mapping value even though it
+ * is unused if PAGE_MAPPING_ANON is set.
+ */
+#define HUGETLB_POISON ((void *)(0x00300300 + POISON_POINTER_DELTA + PAGE_MAPPING_ANON))
+
/********** arch/$ARCH/mm/init.c **********/
#define POISON_FREE_INITMEM 0xcc

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6034dc9..ffbdfc8 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -546,6 +546,7 @@ static void free_huge_page(struct page *page)

mapping = (struct address_space *) page_private(page);
set_page_private(page, 0);
+ page->mapping = NULL;
BUG_ON(page_count(page));
INIT_LIST_HEAD(&page->lru);

@@ -2447,8 +2448,10 @@ retry:
spin_lock(&inode->i_lock);
inode->i_blocks += blocks_per_huge_page(h);
spin_unlock(&inode->i_lock);
- } else
+ } else {
lock_page(page);
+ page->mapping = HUGETLB_POISON;
+ }
}

/*

2010-04-19 16:46:51

by Darren Hart

[permalink] [raw]
Subject: Re: Process-shared futexes on hugepages puts the kernel in an infinite loop in 2.6.32.11; is this fixed now?

Mel Gorman wrote:
> On Mon, Apr 19, 2010 at 01:52:36PM +0200, Peter Zijlstra wrote:
>> On Mon, 2010-04-19 at 12:43 +0100, Mel Gorman wrote:
>>
>>>> Right, so I had a quick chat with Mel, and it appears MAP_PRIVATE
>>>> hugetlb pages don't have their page->mapping set.
>>>>
>>>> I guess something like the below might work, but I'd really rather not
>>>> add hugetlb knowledge to futex.c. Does anybody else have a better idea?
>>>> Maybe create something similar to an anon_vma for hugetlb pages?
>>>>
>>> anon_vma for hugetlb pages sounds overkill, what would it gain? In this
>>> context, futex only appears to distinguish between whether the
>>> references are private or shared.
>>>
>>> Looking at the hugetlbfs code, I can't see a place where it actually cares
>>> about the mapping as such. It's used to find shared pages in the page cache
>>> (but not in the LRU) that are backed by the hugetlbfs file. For hugetlbfs
>>> though, the mapping is mostly kept in page->private for reservation accounting
>>> purposes.
>>>
>>> I can't think of other parts of the VM that touch the mapping if the
>>> page is managed by hugetlbfs so the following patch should also work but
>>> without futex having hugetlbfs-awareness. What do you think? Maybe for
>>> safety, it would be better to make the mapping some obvious poison bytes
>>> or'd with PAGE_MAPPING_ANON so an oops will be more obvious?
>> Yes, this seems perfectly adequate to me, that poison idea might be
>> worthwhile too :-)
>>
>> Acked-by: Peter Zijlstra <[email protected]>
>>
>
> Are you ok with an Ack to this slightly-difference patch too? If so,
> I'll forward it on to Andrew.

Thanks Mel, I'm also keen to keep hugetlb awareness out of futex.c, it's
crazy enough in there already!

Acked-by: Darren Hart <[email protected]>

r6144, would you consider taking a look at the futextest test suite and
letting me know where you think it might need improvement to catch what
you ran into in your tests. I'm thinking some options to futex_wait and
possibly other tests to tell it where to map the futex pages.

http://git.kernel.org/?p=linux/kernel/git/dvhart/futextest.git;a=summary

Thanks,

Darren Hart


>
> ==== CUT HERE ====
>
> Fix infinite loop in get_futex_key when backed by huge pages
>
> If a futex key happens to be located within a huge page mapped MAP_PRIVATE,
> get_futex_key() can go into an infinite loop waiting for a page->mapping
> that will never exist. This was reported and documented in an external
> bugzilla at
>
> https://bugzilla.redhat.com/show_bug.cgi?id=552257
>
> This patch makes page->mapping a poisoned value that includes PAGE_MAPPING_ANON
> mapped MAP_PRIVATE. This is enough for futex to continue but because
> of PAGE_MAPPING_ANON, the poisoned value is not dereferenced or used by
> futex. No other part of the VM should be dereferencing the page->mapping of
> a hugetlbfs page as its page cache is not on the LRU.
>
> This patch fixes the problem with the test case described in the bugzilla.
>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> include/linux/poison.h | 10 ++++++++++
> mm/hugetlb.c | 6 +++++-
> 2 files changed, 15 insertions(+), 1 deletions(-)
>
> diff --git a/include/linux/poison.h b/include/linux/poison.h
> index 2110a81..0f7b5ac 100644
> --- a/include/linux/poison.h
> +++ b/include/linux/poison.h
> @@ -48,6 +48,16 @@
> #define POISON_FREE 0x6b /* for use-after-free poisoning */
> #define POISON_END 0xa5 /* end-byte of poisoning */
>
> +/********** mm/hugetlb.c **********/
> +/*
> + * Private mappings of hugetlb pages use this poisoned value for
> + * page->mapping. The core VM should not be doing anything with this mapping
> + * but futex requires the existance of some page->mapping value even if it
> + * is unused. If the core VM does deference the mapping, it'll look like a
> + * suspiciously high null-pointer offset starting from 0x2e5
> + */
> +#define HUGETLB_PRIVATE_MAPPING (0x2e4 | PAGE_MAPPING_ANON)
> +
> /********** arch/$ARCH/mm/init.c **********/
> #define POISON_FREE_INITMEM 0xcc
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 6034dc9..487e3c2 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -546,6 +546,7 @@ static void free_huge_page(struct page *page)
>
> mapping = (struct address_space *) page_private(page);
> set_page_private(page, 0);
> + page->mapping = NULL;
> BUG_ON(page_count(page));
> INIT_LIST_HEAD(&page->lru);
>
> @@ -2447,8 +2448,11 @@ retry:
> spin_lock(&inode->i_lock);
> inode->i_blocks += blocks_per_huge_page(h);
> spin_unlock(&inode->i_lock);
> - } else
> + } else {
> lock_page(page);
> + page->mapping = (struct address_space *)
> + HUGETLB_PRIVATE_MAPPING;
> + }
> }
>
> /*


--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

2010-04-19 17:17:34

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Process-shared futexes on hugepages puts the kernel in an infinite loop in 2.6.32.11; is this fixed now?

On Mon, Apr 19, 2010 at 06:18:48PM +0200, Peter Zijlstra wrote:
> Well, most poison values have that problem and still we have them. Also

That would better be fixed too to stay <4096 for higher chance of
bug-detection, it doesn't make this case correct ;).

> on 64bit machines you can use POISON_POINTER_DELTA to map it outside the
> virtual address range.

We've thousands of magic values there, I don't see much benefit from
POISON_POINTER_DELTA other than being able to call it
0xdeadbeef+POISON_POINTER_DELTA ;). We always look the assembly to
find the actual real raw pointer value (without the field offset) so I
think using a range between 0xaaa and 0xbbb for the error pointers, is
functional enough, but it's up to you as long as it is a address range
that can't be used by userland it's surely ok ;).