2004-04-23 08:22:46

by David Gibson

[permalink] [raw]
Subject: put_page() tries to handle hugepages but fails

Andrew, please apply.

The code of put_page() is misleading, in that it appears to have code
handling PageCompound pages (i.e. hugepages). However it won't
actually handle them correctly - __page_cache_release() will not work
properly on a compound. Instead, hugepages should be and are released
with huge_page_release() from mm/hugetlb.c. This patch removes the
broken PageCompound path from put_page(), replacing it with a
BUG_ON(). This also removes the initialization of page[1].mapping
from compoound pages, which was only ever used in this broken code
path.

Index: working-2.6/mm/swap.c
===================================================================
--- working-2.6.orig/mm/swap.c 2004-04-14 12:22:49.000000000 +1000
+++ working-2.6/mm/swap.c 2004-04-23 18:13:53.932263936 +1000
@@ -38,17 +38,7 @@

void put_page(struct page *page)
{
- if (unlikely(PageCompound(page))) {
- page = (struct page *)page->private;
- if (put_page_testzero(page)) {
- if (page[1].mapping) { /* destructor? */
- (*(void (*)(struct page *))page[1].mapping)(page);
- } else {
- __page_cache_release(page);
- }
- }
- return;
- }
+ BUG_ON(PageCompound(page));
if (!PageReserved(page) && put_page_testzero(page))
__page_cache_release(page);
}
Index: working-2.6/mm/page_alloc.c
===================================================================
--- working-2.6.orig/mm/page_alloc.c 2004-04-13 11:42:42.000000000 +1000
+++ working-2.6/mm/page_alloc.c 2004-04-23 18:18:06.799295248 +1000
@@ -118,7 +118,6 @@
int i;
int nr_pages = 1 << order;

- page[1].mapping = 0;
page[1].index = order;
for (i = 0; i < nr_pages; i++) {
struct page *p = page + i;




--
David Gibson | For every complex problem there is a
david AT gibson.dropbear.id.au | solution which is simple, neat and
| wrong.
http://www.ozlabs.org/people/dgibson


2004-04-23 08:36:03

by Andrew Morton

[permalink] [raw]
Subject: Re: put_page() tries to handle hugepages but fails

David Gibson <[email protected]> wrote:
>
> Andrew, please apply.
>
> The code of put_page() is misleading, in that it appears to have code
> handling PageCompound pages (i.e. hugepages). However it won't
> actually handle them correctly - __page_cache_release() will not work
> properly on a compound. Instead, hugepages should be and are released
> with huge_page_release() from mm/hugetlb.c. This patch removes the
> broken PageCompound path from put_page(), replacing it with a
> BUG_ON(). This also removes the initialization of page[1].mapping
> from compoound pages, which was only ever used in this broken code
> path.

We could certainly remove the test for a null destructor in there and
require that compound pages have a destructor installed.

But the main reason why that code is in there is for transparently handling
direct-io into hugepage regions. That code does perform put_page against
4k pageframes within the huge page and it does follow the pointer to the
head page.

With your patch applied get_user_pages() and bio_release_pages() will
manipulate the refcounts of the inner 4k pages rather than the head pages
and things will explode.

We could change follow_hugetlb_page() to always take a ref against the head
page and we could teach bio_release_pages() to perform appropriate pfn
masking to locate the head page, and perform similar tricks for
futexes-in-large-pages. But with the code as-is the refcounting works
transparently.


If it's "broken" I wanna know why.

2004-04-23 10:28:44

by William Lee Irwin III

[permalink] [raw]
Subject: Re: put_page() tries to handle hugepages but fails

On Fri, Apr 23, 2004 at 01:34:37AM -0700, Andrew Morton wrote:
> We could certainly remove the test for a null destructor in there and
> require that compound pages have a destructor installed.
> But the main reason why that code is in there is for transparently handling
> direct-io into hugepage regions. That code does perform put_page against
> 4k pageframes within the huge page and it does follow the pointer to the
> head page.
> With your patch applied get_user_pages() and bio_release_pages() will
> manipulate the refcounts of the inner 4k pages rather than the head pages
> and things will explode.
> We could change follow_hugetlb_page() to always take a ref against the head
> page and we could teach bio_release_pages() to perform appropriate pfn
> masking to locate the head page, and perform similar tricks for
> futexes-in-large-pages. But with the code as-is the refcounting works
> transparently.
> If it's "broken" I wanna know why.

The destructor is never invoked from that path, but it's not the path that
should free it anyway. To me it appears the call to __page_cache_release()
on the head of the hugepage should just be removed in favor of doing
nothing; at best it can only race against concurrent put_page(), see
page_count(page) vanish, and accidentally call free_hot_page() against
the head of the hugepage. As hugepages are never on the LRU, the
remainder of __page_cache_release() should be a nop for them.

Untested patch below.


-- wli


Index: wli-2.6.6-rc2-mm1/mm/swap.c
===================================================================
--- wli-2.6.6-rc2-mm1.orig/mm/swap.c 2004-04-21 05:19:58.000000000 -0700
+++ wli-2.6.6-rc2-mm1/mm/swap.c 2004-04-23 03:21:22.000000000 -0700
@@ -41,15 +41,12 @@
if (unlikely(PageCompound(page))) {
page = (struct page *)page->private;
if (put_page_testzero(page)) {
- if (page[1].mapping) { /* destructor? */
- (*(void (*)(struct page *))page[1].mapping)(page);
- } else {
- __page_cache_release(page);
- }
+ void (*destructor)(struct page *);
+ destructor = (void (*)(struct page *))page[1].mapping;
+ BUG_ON(!destructor);
+ (*destructor)(page);
}
- return;
- }
- if (!PageReserved(page) && put_page_testzero(page))
+ } else if (!PageReserved(page) && put_page_testzero(page))
__page_cache_release(page);
}
EXPORT_SYMBOL(put_page);

2004-04-23 10:35:44

by Andrew Morton

[permalink] [raw]
Subject: Re: put_page() tries to handle hugepages but fails

William Lee Irwin III <[email protected]> wrote:
>
> --- wli-2.6.6-rc2-mm1.orig/mm/swap.c 2004-04-21 05:19:58.000000000 -0700
> +++ wli-2.6.6-rc2-mm1/mm/swap.c 2004-04-23 03:21:22.000000000 -0700
> @@ -41,15 +41,12 @@
> if (unlikely(PageCompound(page))) {
> page = (struct page *)page->private;
> if (put_page_testzero(page)) {
> - if (page[1].mapping) { /* destructor? */
> - (*(void (*)(struct page *))page[1].mapping)(page);
> - } else {
> - __page_cache_release(page);
> - }
> + void (*destructor)(struct page *);
> + destructor = (void (*)(struct page *))page[1].mapping;
> + BUG_ON(!destructor);
> + (*destructor)(page);
> }
> - return;
> - }
> - if (!PageReserved(page) && put_page_testzero(page))
> + } else if (!PageReserved(page) && put_page_testzero(page))
> __page_cache_release(page);

Sure.

This of course duplicates huge_page_release(), which can be killed off.

2004-04-23 10:47:56

by William Lee Irwin III

[permalink] [raw]
Subject: Re: put_page() tries to handle hugepages but fails

On Fri, Apr 23, 2004 at 03:35:22AM -0700, Andrew Morton wrote:
> Sure.
> This of course duplicates huge_page_release(), which can be killed off.

Ah, but mm/hugetlb.c is putting the destructor in head->lru.prev not
head[1].mapping; fix below along with nuking huge_page_release().


-- wli


Index: wli-2.6.6-rc2-mm1/include/linux/hugetlb.h
===================================================================
--- wli-2.6.6-rc2-mm1.orig/include/linux/hugetlb.h 2004-04-21 05:20:33.000000000 -0700
+++ wli-2.6.6-rc2-mm1/include/linux/hugetlb.h 2004-04-23 03:40:24.000000000 -0700
@@ -18,7 +18,6 @@
void zap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long);
void unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long);
int hugetlb_prefault(struct address_space *, struct vm_area_struct *);
-void huge_page_release(struct page *);
int hugetlb_report_meminfo(char *);
int is_hugepage_mem_enough(size_t);
unsigned long hugetlb_total_pages(void);
@@ -70,7 +69,6 @@
#define hugetlb_prefault(mapping, vma) ({ BUG(); 0; })
#define zap_hugepage_range(vma, start, len) BUG()
#define unmap_hugepage_range(vma, start, end) BUG()
-#define huge_page_release(page) BUG()
#define is_hugepage_mem_enough(size) 0
#define hugetlb_report_meminfo(buf) 0
#define mark_mm_hugetlb(mm, vma) do { } while (0)
Index: wli-2.6.6-rc2-mm1/mm/swap.c
===================================================================
--- wli-2.6.6-rc2-mm1.orig/mm/swap.c 2004-04-21 05:19:58.000000000 -0700
+++ wli-2.6.6-rc2-mm1/mm/swap.c 2004-04-23 03:21:22.000000000 -0700
@@ -41,15 +41,12 @@
if (unlikely(PageCompound(page))) {
page = (struct page *)page->private;
if (put_page_testzero(page)) {
- if (page[1].mapping) { /* destructor? */
- (*(void (*)(struct page *))page[1].mapping)(page);
- } else {
- __page_cache_release(page);
- }
+ void (*destructor)(struct page *);
+ destructor = (void (*)(struct page *))page[1].mapping;
+ BUG_ON(!destructor);
+ (*destructor)(page);
}
- return;
- }
- if (!PageReserved(page) && put_page_testzero(page))
+ } else if (!PageReserved(page) && put_page_testzero(page))
__page_cache_release(page);
}
EXPORT_SYMBOL(put_page);
Index: wli-2.6.6-rc2-mm1/mm/hugetlb.c
===================================================================
--- wli-2.6.6-rc2-mm1.orig/mm/hugetlb.c 2004-04-21 05:19:58.000000000 -0700
+++ wli-2.6.6-rc2-mm1/mm/hugetlb.c 2004-04-23 03:45:41.000000000 -0700
@@ -78,20 +78,12 @@
free_huge_pages--;
spin_unlock(&hugetlb_lock);
set_page_count(page, 1);
- page->lru.prev = (void *)free_huge_page;
+ page[1].mapping = (void *)free_huge_page;
for (i = 0; i < (HPAGE_SIZE/PAGE_SIZE); ++i)
clear_highpage(&page[i]);
return page;
}

-void huge_page_release(struct page *page)
-{
- if (!put_page_testzero(page))
- return;
-
- free_huge_page(page);
-}
-
static int __init hugetlb_init(void)
{
unsigned long i;
Index: wli-2.6.6-rc2-mm1/arch/i386/mm/hugetlbpage.c
===================================================================
--- wli-2.6.6-rc2-mm1.orig/arch/i386/mm/hugetlbpage.c 2004-04-21 05:20:29.000000000 -0700
+++ wli-2.6.6-rc2-mm1/arch/i386/mm/hugetlbpage.c 2004-04-23 03:40:04.000000000 -0700
@@ -220,7 +220,7 @@
if (pte_none(pte))
continue;
page = pte_page(pte);
- huge_page_release(page);
+ put_page(page);
}
mm->rss -= (end - start) >> PAGE_SHIFT;
flush_tlb_range(vma, start, end);
Index: wli-2.6.6-rc2-mm1/arch/ia64/mm/hugetlbpage.c
===================================================================
--- wli-2.6.6-rc2-mm1.orig/arch/ia64/mm/hugetlbpage.c 2004-04-21 05:19:41.000000000 -0700
+++ wli-2.6.6-rc2-mm1/arch/ia64/mm/hugetlbpage.c 2004-04-23 03:40:07.000000000 -0700
@@ -249,7 +249,7 @@
if (pte_none(*pte))
continue;
page = pte_page(*pte);
- huge_page_release(page);
+ put_page(page);
pte_clear(pte);
}
mm->rss -= (end - start) >> PAGE_SHIFT;
Index: wli-2.6.6-rc2-mm1/arch/ppc64/mm/hugetlbpage.c
===================================================================
--- wli-2.6.6-rc2-mm1.orig/arch/ppc64/mm/hugetlbpage.c 2004-04-21 05:20:29.000000000 -0700
+++ wli-2.6.6-rc2-mm1/arch/ppc64/mm/hugetlbpage.c 2004-04-23 03:40:10.000000000 -0700
@@ -394,7 +394,7 @@
flush_hash_hugepage(mm->context, addr,
pte, local);

- huge_page_release(page);
+ put_page(page);
}

mm->rss -= (end - start) >> PAGE_SHIFT;
Index: wli-2.6.6-rc2-mm1/arch/sh/mm/hugetlbpage.c
===================================================================
--- wli-2.6.6-rc2-mm1.orig/arch/sh/mm/hugetlbpage.c 2004-04-21 05:19:43.000000000 -0700
+++ wli-2.6.6-rc2-mm1/arch/sh/mm/hugetlbpage.c 2004-04-23 03:40:14.000000000 -0700
@@ -200,7 +200,7 @@
if (pte_none(*pte))
continue;
page = pte_page(*pte);
- huge_page_release(page);
+ put_page(page);
for (i = 0; i < (1 << HUGETLB_PAGE_ORDER); i++) {
pte_clear(pte);
pte++;
Index: wli-2.6.6-rc2-mm1/arch/sparc64/mm/hugetlbpage.c
===================================================================
--- wli-2.6.6-rc2-mm1.orig/arch/sparc64/mm/hugetlbpage.c 2004-04-21 05:19:43.000000000 -0700
+++ wli-2.6.6-rc2-mm1/arch/sparc64/mm/hugetlbpage.c 2004-04-23 03:40:17.000000000 -0700
@@ -198,7 +198,7 @@
if (pte_none(*pte))
continue;
page = pte_page(*pte);
- huge_page_release(page);
+ put_page(page);
for (i = 0; i < (1 << HUGETLB_PAGE_ORDER); i++) {
pte_clear(pte);
pte++;

2004-04-23 11:28:22

by William Lee Irwin III

[permalink] [raw]
Subject: Re: put_page() tries to handle hugepages but fails

On Fri, Apr 23, 2004 at 03:35:22AM -0700, Andrew Morton wrote:
>> Sure.
>> This of course duplicates huge_page_release(), which can be killed off.

On Fri, Apr 23, 2004 at 03:47:44AM -0700, William Lee Irwin III wrote:
> Ah, but mm/hugetlb.c is putting the destructor in head->lru.prev not
> head[1].mapping; fix below along with nuking huge_page_release().

I just noticed a general problem where ->mapping is often referred to
naked and so either needs to be filled in for the base pages of the
superpage or the examiners need to be referred to the head of the superpage.
Likewise with ->index, which is apparently now used to hold the order.
In general, if they're superpage properties, either the examiners must
be referred to the head of the superpage, or (wastefully) the values must
be replicated across the base pages. The current scheme to prevent radix
tree deadlocks and other nastinesses with multiple insertions (e.g.
inserting an entry into the radix tree for each base page, which I've
seen naive abuses of shit themselves several times before) reduces the
resolution of the pagecache indices, which is problematic in various
ways with respect to finding the right base page without updating
excessive numbers of callers. The reason why this is less of a problem
than it sounds like is that the base page lookups usually go through
pagetables, which effectively act either as radix trees with an
insertion for each base page (normal cpus) or radix trees with low-
resolution indices with procedural API's to recover offsets into the
superpage and the proper base page (i386, ppc64).

I'll be off the net all weekend, so hopefully the interested parties can
use this info to resolve the remainder of the hugetlb put_page() issues.
The work here is just cleaning up self-inconsistencies so it's not hard.


-- wli

2004-04-27 04:40:34

by David Gibson

[permalink] [raw]
Subject: Re: put_page() tries to handle hugepages but fails

On Fri, Apr 23, 2004 at 01:34:37AM -0700, Andrew Morton wrote:
> David Gibson <[email protected]> wrote:
> >
> > Andrew, please apply.
> >
> > The code of put_page() is misleading, in that it appears to have code
> > handling PageCompound pages (i.e. hugepages). However it won't
> > actually handle them correctly - __page_cache_release() will not work
> > properly on a compound. Instead, hugepages should be and are released
> > with huge_page_release() from mm/hugetlb.c. This patch removes the
> > broken PageCompound path from put_page(), replacing it with a
> > BUG_ON(). This also removes the initialization of page[1].mapping
> > from compoound pages, which was only ever used in this broken code
> > path.
>
> We could certainly remove the test for a null destructor in there and
> require that compound pages have a destructor installed.

We could, but there seems little point, given that we have to test for
PageCompound anyway, and the destructor is always the same for
hugepages.

> But the main reason why that code is in there is for transparently handling
> direct-io into hugepage regions. That code does perform put_page against
> 4k pageframes within the huge page and it does follow the pointer to the
> head page.

Ah. This I did not know.

> With your patch applied get_user_pages() and bio_release_pages() will
> manipulate the refcounts of the inner 4k pages rather than the head pages
> and things will explode.

Actually, no - bio_release_pages() will instantly BUG() with my patch.

> We could change follow_hugetlb_page() to always take a ref against the head
> page and we could teach bio_release_pages() to perform appropriate pfn
> masking to locate the head page, and perform similar tricks for
> futexes-in-large-pages. But with the code as-is the refcounting works
> transparently.
>
> If it's "broken" I wanna know why.

It's broken if put_page() ever actually reduces a hugepage's count to
zero, because it will call __page_cache_release() which will do the
wrong thing on hugepages. We could bypass that by requiring that a
hugepage always has a destructor (and actually setting it in the
correct place, which hugetlb.c currently doesn't as wli pointed out).
But since we only have one destructor, why bother.

Patch below makes put_page() work correctly on hugepages, and uses it
(via page_cache_release()) in place of huge_page_release(). It also
abolishes the destructor pointer - free_huge_page() is always used.
Please apply.

Index: working-2.6/mm/swap.c
===================================================================
--- working-2.6.orig/mm/swap.c 2004-04-14 12:22:49.000000000 +1000
+++ working-2.6/mm/swap.c 2004-04-27 14:19:10.184343216 +1000
@@ -40,13 +40,8 @@
{
if (unlikely(PageCompound(page))) {
page = (struct page *)page->private;
- if (put_page_testzero(page)) {
- if (page[1].mapping) { /* destructor? */
- (*(void (*)(struct page *))page[1].mapping)(page);
- } else {
- __page_cache_release(page);
- }
- }
+ if (put_page_testzero(page))
+ free_huge_page(page);
return;
}
if (!PageReserved(page) && put_page_testzero(page))
Index: working-2.6/mm/hugetlb.c
===================================================================
--- working-2.6.orig/mm/hugetlb.c 2004-04-14 12:22:49.000000000 +1000
+++ working-2.6/mm/hugetlb.c 2004-04-27 14:24:14.155325888 +1000
@@ -78,20 +78,11 @@
free_huge_pages--;
spin_unlock(&hugetlb_lock);
set_page_count(page, 1);
- page->lru.prev = (void *)free_huge_page;
for (i = 0; i < (HPAGE_SIZE/PAGE_SIZE); ++i)
clear_highpage(&page[i]);
return page;
}

-void huge_page_release(struct page *page)
-{
- if (!put_page_testzero(page))
- return;
-
- free_huge_page(page);
-}
-
static int __init hugetlb_init(void)
{
unsigned long i;
Index: working-2.6/arch/i386/mm/hugetlbpage.c
===================================================================
--- working-2.6.orig/arch/i386/mm/hugetlbpage.c 2004-04-22 09:09:11.000000000 +1000
+++ working-2.6/arch/i386/mm/hugetlbpage.c 2004-04-27 14:21:30.874387904 +1000
@@ -221,7 +221,7 @@
if (pte_none(pte))
continue;
page = pte_page(pte);
- huge_page_release(page);
+ page_cache_release(page);
}
mm->rss -= (end - start) >> PAGE_SHIFT;
flush_tlb_range(vma, start, end);
Index: working-2.6/arch/ppc64/mm/hugetlbpage.c
===================================================================
--- working-2.6.orig/arch/ppc64/mm/hugetlbpage.c 2004-04-20 10:50:06.000000000 +1000
+++ working-2.6/arch/ppc64/mm/hugetlbpage.c 2004-04-27 14:21:18.225313880 +1000
@@ -395,7 +395,7 @@
flush_hash_hugepage(mm->context, addr,
pte, local);

- huge_page_release(page);
+ page_cache_release(page);
}

mm->rss -= (end - start) >> PAGE_SHIFT;
Index: working-2.6/arch/sh/mm/hugetlbpage.c
===================================================================
--- working-2.6.orig/arch/sh/mm/hugetlbpage.c 2004-04-20 10:50:06.000000000 +1000
+++ working-2.6/arch/sh/mm/hugetlbpage.c 2004-04-27 14:22:19.234357656 +1000
@@ -200,7 +200,7 @@
if (pte_none(*pte))
continue;
page = pte_page(*pte);
- huge_page_release(page);
+ page_cache_release(page);
for (i = 0; i < (1 << HUGETLB_PAGE_ORDER); i++) {
pte_clear(pte);
pte++;
Index: working-2.6/arch/sparc64/mm/hugetlbpage.c
===================================================================
--- working-2.6.orig/arch/sparc64/mm/hugetlbpage.c 2004-04-20 10:50:06.000000000 +1000
+++ working-2.6/arch/sparc64/mm/hugetlbpage.c 2004-04-27 14:22:34.274333944 +1000
@@ -198,7 +198,7 @@
if (pte_none(*pte))
continue;
page = pte_page(*pte);
- huge_page_release(page);
+ page_cache_release(page);
for (i = 0; i < (1 << HUGETLB_PAGE_ORDER); i++) {
pte_clear(pte);
pte++;
Index: working-2.6/fs/hugetlbfs/inode.c
===================================================================
--- working-2.6.orig/fs/hugetlbfs/inode.c 2004-04-13 11:42:38.000000000 +1000
+++ working-2.6/fs/hugetlbfs/inode.c 2004-04-27 14:23:17.963348384 +1000
@@ -142,7 +142,7 @@
int i;

for (i = 0; i < pagevec_count(pvec); ++i)
- huge_page_release(pvec->pages[i]);
+ page_cache_release(pvec->pages[i]);

pagevec_reinit(pvec);
}
@@ -152,7 +152,7 @@
clear_page_dirty(page);
ClearPageUptodate(page);
remove_from_page_cache(page);
- huge_page_release(page);
+ page_cache_release(page);
}

void truncate_hugepages(struct address_space *mapping, loff_t lstart)
Index: working-2.6/include/linux/hugetlb.h
===================================================================
--- working-2.6.orig/include/linux/hugetlb.h 2004-04-20 10:50:08.000000000 +1000
+++ working-2.6/include/linux/hugetlb.h 2004-04-27 14:23:59.443298984 +1000
@@ -16,7 +16,6 @@
void zap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long);
void unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long);
int hugetlb_prefault(struct address_space *, struct vm_area_struct *);
-void huge_page_release(struct page *);
int hugetlb_report_meminfo(char *);
int is_hugepage_mem_enough(size_t);
unsigned long hugetlb_total_pages(void);
@@ -68,7 +67,6 @@
#define hugetlb_prefault(mapping, vma) ({ BUG(); 0; })
#define zap_hugepage_range(vma, start, len) BUG()
#define unmap_hugepage_range(vma, start, end) BUG()
-#define huge_page_release(page) BUG()
#define is_hugepage_mem_enough(size) 0
#define hugetlb_report_meminfo(buf) 0
#define mark_mm_hugetlb(mm, vma) do { } while (0)


--
David Gibson | For every complex problem there is a
david AT gibson.dropbear.id.au | solution which is simple, neat and
| wrong.
http://www.ozlabs.org/people/dgibson

2004-04-27 04:45:41

by David Gibson

[permalink] [raw]
Subject: Re: put_page() tries to handle hugepages but fails

On Tue, Apr 27, 2004 at 02:36:52PM +1000, David Gibson wrote:
> On Fri, Apr 23, 2004 at 01:34:37AM -0700, Andrew Morton wrote:
> > David Gibson <[email protected]> wrote:
> > >
> > > Andrew, please apply.
> > >
> > > The code of put_page() is misleading, in that it appears to have code
> > > handling PageCompound pages (i.e. hugepages). However it won't
> > > actually handle them correctly - __page_cache_release() will not work
> > > properly on a compound. Instead, hugepages should be and are released
> > > with huge_page_release() from mm/hugetlb.c. This patch removes the
> > > broken PageCompound path from put_page(), replacing it with a
> > > BUG_ON(). This also removes the initialization of page[1].mapping
> > > from compoound pages, which was only ever used in this broken code
> > > path.
> >
> > We could certainly remove the test for a null destructor in there and
> > require that compound pages have a destructor installed.
>
> We could, but there seems little point, given that we have to test for
> PageCompound anyway, and the destructor is always the same for
> hugepages.
>
> > But the main reason why that code is in there is for transparently handling
> > direct-io into hugepage regions. That code does perform put_page against
> > 4k pageframes within the huge page and it does follow the pointer to the
> > head page.
>
> Ah. This I did not know.
>
> > With your patch applied get_user_pages() and bio_release_pages() will
> > manipulate the refcounts of the inner 4k pages rather than the head pages
> > and things will explode.
>
> Actually, no - bio_release_pages() will instantly BUG() with my patch.
>
> > We could change follow_hugetlb_page() to always take a ref against the head
> > page and we could teach bio_release_pages() to perform appropriate pfn
> > masking to locate the head page, and perform similar tricks for
> > futexes-in-large-pages. But with the code as-is the refcounting works
> > transparently.
> >
> > If it's "broken" I wanna know why.
>
> It's broken if put_page() ever actually reduces a hugepage's count to
> zero, because it will call __page_cache_release() which will do the
> wrong thing on hugepages. We could bypass that by requiring that a
> hugepage always has a destructor (and actually setting it in the
> correct place, which hugetlb.c currently doesn't as wli pointed out).
> But since we only have one destructor, why bother.
>
> Patch below makes put_page() work correctly on hugepages, and uses it
> (via page_cache_release()) in place of huge_page_release(). It also
> abolishes the destructor pointer - free_huge_page() is always used.
> Please apply.

Damn. Supplied patch generated a warning. Fixed version below:

Index: working-2.6/mm/swap.c
===================================================================
--- working-2.6.orig/mm/swap.c 2004-04-14 12:22:49.000000000 +1000
+++ working-2.6/mm/swap.c 2004-04-27 14:42:40.257309016 +1000
@@ -30,6 +30,7 @@
#include <linux/cpu.h>
#include <linux/notifier.h>
#include <linux/init.h>
+#include <linux/hugetlb.h>

/* How many pages do we try to swap or page in/out together? */
int page_cluster;
@@ -40,13 +41,8 @@
{
if (unlikely(PageCompound(page))) {
page = (struct page *)page->private;
- if (put_page_testzero(page)) {
- if (page[1].mapping) { /* destructor? */
- (*(void (*)(struct page *))page[1].mapping)(page);
- } else {
- __page_cache_release(page);
- }
- }
+ if (put_page_testzero(page))
+ free_huge_page(page);
return;
}
if (!PageReserved(page) && put_page_testzero(page))
Index: working-2.6/mm/hugetlb.c
===================================================================
--- working-2.6.orig/mm/hugetlb.c 2004-04-14 12:22:49.000000000 +1000
+++ working-2.6/mm/hugetlb.c 2004-04-27 14:24:14.155325888 +1000
@@ -78,20 +78,11 @@
free_huge_pages--;
spin_unlock(&hugetlb_lock);
set_page_count(page, 1);
- page->lru.prev = (void *)free_huge_page;
for (i = 0; i < (HPAGE_SIZE/PAGE_SIZE); ++i)
clear_highpage(&page[i]);
return page;
}

-void huge_page_release(struct page *page)
-{
- if (!put_page_testzero(page))
- return;
-
- free_huge_page(page);
-}
-
static int __init hugetlb_init(void)
{
unsigned long i;
Index: working-2.6/arch/i386/mm/hugetlbpage.c
===================================================================
--- working-2.6.orig/arch/i386/mm/hugetlbpage.c 2004-04-22 09:09:11.000000000 +1000
+++ working-2.6/arch/i386/mm/hugetlbpage.c 2004-04-27 14:21:30.874387904 +1000
@@ -221,7 +221,7 @@
if (pte_none(pte))
continue;
page = pte_page(pte);
- huge_page_release(page);
+ page_cache_release(page);
}
mm->rss -= (end - start) >> PAGE_SHIFT;
flush_tlb_range(vma, start, end);
Index: working-2.6/arch/ppc64/mm/hugetlbpage.c
===================================================================
--- working-2.6.orig/arch/ppc64/mm/hugetlbpage.c 2004-04-20 10:50:06.000000000 +1000
+++ working-2.6/arch/ppc64/mm/hugetlbpage.c 2004-04-27 14:21:18.225313880 +1000
@@ -395,7 +395,7 @@
flush_hash_hugepage(mm->context, addr,
pte, local);

- huge_page_release(page);
+ page_cache_release(page);
}

mm->rss -= (end - start) >> PAGE_SHIFT;
Index: working-2.6/arch/sh/mm/hugetlbpage.c
===================================================================
--- working-2.6.orig/arch/sh/mm/hugetlbpage.c 2004-04-20 10:50:06.000000000 +1000
+++ working-2.6/arch/sh/mm/hugetlbpage.c 2004-04-27 14:22:19.234357656 +1000
@@ -200,7 +200,7 @@
if (pte_none(*pte))
continue;
page = pte_page(*pte);
- huge_page_release(page);
+ page_cache_release(page);
for (i = 0; i < (1 << HUGETLB_PAGE_ORDER); i++) {
pte_clear(pte);
pte++;
Index: working-2.6/arch/sparc64/mm/hugetlbpage.c
===================================================================
--- working-2.6.orig/arch/sparc64/mm/hugetlbpage.c 2004-04-20 10:50:06.000000000 +1000
+++ working-2.6/arch/sparc64/mm/hugetlbpage.c 2004-04-27 14:22:34.274333944 +1000
@@ -198,7 +198,7 @@
if (pte_none(*pte))
continue;
page = pte_page(*pte);
- huge_page_release(page);
+ page_cache_release(page);
for (i = 0; i < (1 << HUGETLB_PAGE_ORDER); i++) {
pte_clear(pte);
pte++;
Index: working-2.6/fs/hugetlbfs/inode.c
===================================================================
--- working-2.6.orig/fs/hugetlbfs/inode.c 2004-04-13 11:42:38.000000000 +1000
+++ working-2.6/fs/hugetlbfs/inode.c 2004-04-27 14:23:17.963348384 +1000
@@ -142,7 +142,7 @@
int i;

for (i = 0; i < pagevec_count(pvec); ++i)
- huge_page_release(pvec->pages[i]);
+ page_cache_release(pvec->pages[i]);

pagevec_reinit(pvec);
}
@@ -152,7 +152,7 @@
clear_page_dirty(page);
ClearPageUptodate(page);
remove_from_page_cache(page);
- huge_page_release(page);
+ page_cache_release(page);
}

void truncate_hugepages(struct address_space *mapping, loff_t lstart)
Index: working-2.6/include/linux/hugetlb.h
===================================================================
--- working-2.6.orig/include/linux/hugetlb.h 2004-04-20 10:50:08.000000000 +1000
+++ working-2.6/include/linux/hugetlb.h 2004-04-27 14:23:59.443298984 +1000
@@ -16,7 +16,6 @@
void zap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long);
void unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long);
int hugetlb_prefault(struct address_space *, struct vm_area_struct *);
-void huge_page_release(struct page *);
int hugetlb_report_meminfo(char *);
int is_hugepage_mem_enough(size_t);
unsigned long hugetlb_total_pages(void);
@@ -68,7 +67,6 @@
#define hugetlb_prefault(mapping, vma) ({ BUG(); 0; })
#define zap_hugepage_range(vma, start, len) BUG()
#define unmap_hugepage_range(vma, start, end) BUG()
-#define huge_page_release(page) BUG()
#define is_hugepage_mem_enough(size) 0
#define hugetlb_report_meminfo(buf) 0
#define mark_mm_hugetlb(mm, vma) do { } while (0)


--
David Gibson | For every complex problem there is a
david AT gibson.dropbear.id.au | solution which is simple, neat and
| wrong.
http://www.ozlabs.org/people/dgibson

2004-04-27 04:48:45

by Andrew Morton

[permalink] [raw]
Subject: Re: put_page() tries to handle hugepages but fails

David Gibson <[email protected]> wrote:
>
> + if (put_page_testzero(page))
> + free_huge_page(page);

Well yes, but this is assuming that compound pages are always hugetlb pages.

It's true at present, but it doesn't have to always be true. The cost of
the destructor is zilch, so why not?

Please review the changes which went into 2.6.6-rc2-mm2.

2004-04-27 05:08:36

by David Gibson

[permalink] [raw]
Subject: Re: put_page() tries to handle hugepages but fails

On Mon, Apr 26, 2004 at 09:47:57PM -0700, Andrew Morton wrote:
> David Gibson <[email protected]> wrote:
> >
> > + if (put_page_testzero(page))
> > + free_huge_page(page);
>
> Well yes, but this is assuming that compound pages are always hugetlb pages.
>
> It's true at present, but it doesn't have to always be true. The cost of
> the destructor is zilch, so why not?

True enough, I guess I was just following the "don't build it till you
need it" philosophy.

> Please review the changes which went into 2.6.6-rc2-mm2.

Sorry, should have done that earlier. Looks reasonable with two small
exceptions: 1) put_page() can still theoretically call
__page_cache_release() which is wrong (and makes the code misleading)
- patch below replaces this with a BUG() if there is no destructor. 2)
what about wli's concern that mapping may be accessed without first
checking for a PageCompound?

Index: working-2.6/mm/swap.c
===================================================================
--- working-2.6.orig/mm/swap.c 2004-04-14 12:22:49.000000000 +1000
+++ working-2.6/mm/swap.c 2004-04-27 15:02:30.046342392 +1000
@@ -41,11 +41,9 @@
if (unlikely(PageCompound(page))) {
page = (struct page *)page->private;
if (put_page_testzero(page)) {
- if (page[1].mapping) { /* destructor? */
- (*(void (*)(struct page *))page[1].mapping)(page);
- } else {
- __page_cache_release(page);
- }
+ BUG_ON(! page[1].mapping);
+
+ (*(void (*)(struct page *))page[1].mapping)(page);
}
return;
}


--
David Gibson | For every complex problem there is a
david AT gibson.dropbear.id.au | solution which is simple, neat and
| wrong.
http://www.ozlabs.org/people/dgibson

2004-04-27 05:15:49

by Andrew Morton

[permalink] [raw]
Subject: Re: put_page() tries to handle hugepages but fails

David Gibson <[email protected]> wrote:
>
> 1) put_page() can still theoretically call
> __page_cache_release() which is wrong (and makes the code misleading)

Don't think so?

void put_page(struct page *page)
{
if (unlikely(PageCompound(page))) {
page = (struct page *)page->private;
if (put_page_testzero(page)) {
void (*dtor)(struct page *page);

dtor = (void (*)(struct page *))page[1].mapping;
(*dtor)(page);
}
return;
}
if (!PageReserved(page) && put_page_testzero(page))
__page_cache_release(page);
}

> - patch below replaces this with a BUG() if there is no destructor. 2)
> what about wli's concern that mapping may be accessed without first
> checking for a PageCompound?

That shouldn't be happening (should it?). If someone had such a page
they'd need to lock it to play with ->mapping and I cannot think of any
code which does all that and yet could stumble across a compound page?

2004-04-27 05:20:15

by David Gibson

[permalink] [raw]
Subject: Re: put_page() tries to handle hugepages but fails

On Mon, Apr 26, 2004 at 10:15:19PM -0700, Andrew Morton wrote:
> David Gibson <[email protected]> wrote:
> >
> > 1) put_page() can still theoretically call
> > __page_cache_release() which is wrong (and makes the code misleading)
>
> Don't think so?
>
> void put_page(struct page *page)
> {
> if (unlikely(PageCompound(page))) {
> page = (struct page *)page->private;
> if (put_page_testzero(page)) {
> void (*dtor)(struct page *page);
>
> dtor = (void (*)(struct page *))page[1].mapping;
> (*dtor)(page);
> }
> return;
> }
> if (!PageReserved(page) && put_page_testzero(page))
> __page_cache_release(page);
> }

Dang. Missed that patch in the series. My mistake.

--
David Gibson | For every complex problem there is a
david AT gibson.dropbear.id.au | solution which is simple, neat and
| wrong.
http://www.ozlabs.org/people/dgibson