It seems commit fe1668ae5bf0145014c71797febd9ad5670d5d05 causes a
hugepage regression. A git bisect points the finger at that commit
for causing an oops in the 'alloc-instantiate-race' test from the
libhugetlbfs testsuite.
Still looking to determine the reason it breaks things.
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
On Tue, 10 Oct 2006 18:47:48 +1000
David Gibson <[email protected]> wrote:
> It seems commit fe1668ae5bf0145014c71797febd9ad5670d5d05 causes a
> hugepage regression. A git bisect points the finger at that commit
> for causing an oops in the 'alloc-instantiate-race' test from the
> libhugetlbfs testsuite.
>
> Still looking to determine the reason it breaks things.
>
It's assuming that unmap_hugepage_range() is always freeing these pages.
If the page is shared by another mapping, bad things will happen: the
threads fight over page->lru.
Doing
+ if (page_count(page) == 1)
list_add(&page->lru, &page_list);
might help. But then we miss the tlb flush in rare racy conditions.
On Tue, Oct 10, 2006 at 02:04:26AM -0700, Andrew Morton wrote:
> On Tue, 10 Oct 2006 18:47:48 +1000
> David Gibson <[email protected]> wrote:
>
> > It seems commit fe1668ae5bf0145014c71797febd9ad5670d5d05 causes a
> > hugepage regression. A git bisect points the finger at that commit
> > for causing an oops in the 'alloc-instantiate-race' test from the
> > libhugetlbfs testsuite.
> >
> > Still looking to determine the reason it breaks things.
> >
>
> It's assuming that unmap_hugepage_range() is always freeing these pages.
> If the page is shared by another mapping, bad things will happen: the
> threads fight over page->lru.
>
> Doing
>
> + if (page_count(page) == 1)
> list_add(&page->lru, &page_list);
>
> might help. But then we miss the tlb flush in rare racy conditions.
Well, there'd need to be an else doing a put_page(), too.
Looks like the fundamental problem is that a list is not a suitable
data structure for gathering here, since it's not truly local. We
should probably change it to a small array, like in the normal tlb
gather structure. If we run out of space we can force the tlb flush
and keep going.
Or we could just give up on lazy tlb flush for hugepages.
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
David Gibson wrote on Tuesday, October 10, 2006 2:16 AM
> > > It seems commit fe1668ae5bf0145014c71797febd9ad5670d5d05 causes a
> > > hugepage regression. A git bisect points the finger at that commit
> > > for causing an oops in the 'alloc-instantiate-race' test from the
> > > libhugetlbfs testsuite.
> > >
> > > Still looking to determine the reason it breaks things.
> > >
> >
> > It's assuming that unmap_hugepage_range() is always freeing these pages.
> > If the page is shared by another mapping, bad things will happen: the
> > threads fight over page->lru.
> >
> > Doing
> >
> > + if (page_count(page) == 1)
> > list_add(&page->lru, &page_list);
> >
> > might help. But then we miss the tlb flush in rare racy conditions.
>
> Well, there'd need to be an else doing a put_page(), too.
>
> Looks like the fundamental problem is that a list is not a suitable
> data structure for gathering here, since it's not truly local. We
> should probably change it to a small array, like in the normal tlb
> gather structure. If we run out of space we can force the tlb flush
> and keep going.
With the pending shared page table for hugetlb currently sitting in -mm,
we serialize the all hugetlb unmap with a per file i_mmap_lock. This
race could well be solved by that pending patch?
http://kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.19-rc1/2.6.19-rc1-mm1/broken-out/shared-page-table-for-hugetlb-page-v
4.patch
- Ken
On Tue, 10 Oct 2006 10:35:50 -0700
"Chen, Kenneth W" <[email protected]> wrote:
> David Gibson wrote on Tuesday, October 10, 2006 2:16 AM
> > > > It seems commit fe1668ae5bf0145014c71797febd9ad5670d5d05 causes a
> > > > hugepage regression. A git bisect points the finger at that commit
> > > > for causing an oops in the 'alloc-instantiate-race' test from the
> > > > libhugetlbfs testsuite.
> > > >
> > > > Still looking to determine the reason it breaks things.
> > > >
> > >
> > > It's assuming that unmap_hugepage_range() is always freeing these pages.
> > > If the page is shared by another mapping, bad things will happen: the
> > > threads fight over page->lru.
> > >
> > > Doing
> > >
> > > + if (page_count(page) == 1)
> > > list_add(&page->lru, &page_list);
> > >
> > > might help. But then we miss the tlb flush in rare racy conditions.
> >
> > Well, there'd need to be an else doing a put_page(), too.
> >
> > Looks like the fundamental problem is that a list is not a suitable
> > data structure for gathering here, since it's not truly local. We
> > should probably change it to a small array, like in the normal tlb
> > gather structure. If we run out of space we can force the tlb flush
> > and keep going.
>
>
> With the pending shared page table for hugetlb currently sitting in -mm,
> we serialize the all hugetlb unmap with a per file i_mmap_lock. This
> race could well be solved by that pending patch?
>
> http://kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.19-rc1/2.6.19-rc1-mm1/broken-out/shared-page-table-for-hugetlb-page-v
> 4.patch
>
We need something for 2.6.19 though. As David indicates, not using
page->lru should fix it (pagevec_add, pagevec_release would suit).
Or just a separate TBL invalidation per page. Is that likely to be
particularly expensive? It's the first one which hurts?
On Tue, 10 Oct 2006, Chen, Kenneth W wrote:
>
> With the pending shared page table for hugetlb currently sitting in -mm,
> we serialize the all hugetlb unmap with a per file i_mmap_lock. This
> race could well be solved by that pending patch?
>
> http://kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.19-rc1/2.6.19-rc1-mm1/broken-out/shared-page-table-for-hugetlb-page-v4.patch
Hey, nice try, Ken! But I don't think we can let you sneak shared
pagetables into 2.6.19 that way ;)
Sorry for not noticing this bug in your original TLB flush fix,
which had looked good to me.
Yes, I'd expect your i_mmap_lock to solve the problem: and since
you're headed in that direction anyway, it makes most sense to use
that solution rather than get into defining arrays, or sacrificing
the lazy flush, or risking page_count races.
So please extract the __unmap_hugepage_range mods from your shared
pagetable patch, and use that to fix the bug. But again, I protest
the "if (vma->vm_file)" in your unmap_hugepage_range - how would a
hugepage area ever have NULL vma->vm_file?
Hugh
Hugh Dickins wrote on Tuesday, October 10, 2006 12:18 PM
> On Tue, 10 Oct 2006, Chen, Kenneth W wrote:
> >
> > With the pending shared page table for hugetlb currently sitting in -mm,
> > we serialize the all hugetlb unmap with a per file i_mmap_lock. This
> > race could well be solved by that pending patch?
> >
> >
http://kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.19-rc1/2.6.19-rc1-mm1/broken-out/shared-page-table-for-hugetlb-page-v
4.patch
>
> Hey, nice try, Ken! But I don't think we can let you sneak shared
> pagetables into 2.6.19 that way ;)
It wasn't my intention to sneak in shared page table, though it does
sort of look like so.
> Yes, I'd expect your i_mmap_lock to solve the problem: and since
> you're headed in that direction anyway, it makes most sense to use
> that solution rather than get into defining arrays, or sacrificing
> the lazy flush, or risking page_count races.
>
> So please extract the __unmap_hugepage_range mods from your shared
> pagetable patch, and use that to fix the bug.
OK, I was about to do so too.
> But again, I protest
> the "if (vma->vm_file)" in your unmap_hugepage_range - how would a
> hugepage area ever have NULL vma->vm_file?
It's coming from do_mmap_pgoff(), file->f_op->mmap can fail with error
code (e.g. not enough hugetlb page) and in the error recovery path, it
nulls out vma->vm_file first before calls down to unmap_region(). I
asked that question before: can we reverse that order (call unmap_region
and then nulls out vma->vmfile and fput)?
unmap_and_free_vma:
if (correct_wcount)
atomic_inc(&inode->i_writecount);
vma->vm_file = NULL;
fput(file);
/* Undo any partial mapping done by a device driver. */
unmap_region(mm, vma, prev, vma->vm_start, vma->vm_end);
On Tue, 10 Oct 2006, Chen, Kenneth W wrote:
> Hugh Dickins wrote on Tuesday, October 10, 2006 12:18 PM
>
> > But again, I protest
> > the "if (vma->vm_file)" in your unmap_hugepage_range - how would a
> > hugepage area ever have NULL vma->vm_file?
>
> It's coming from do_mmap_pgoff(), file->f_op->mmap can fail with error
> code (e.g. not enough hugetlb page) and in the error recovery path, it
> nulls out vma->vm_file first before calls down to unmap_region().
I stand corrected: thanks.
> I asked that question before:
So you did, on Oct 2nd: sorry, that got lost amidst the overload in
my mailbox, I've just forwarded it to myself again, to check later
on what else I may have missed. (I am aware that I've still not
scrutinized the patches you sent out a day or two later, now in -mm.)
> can we reverse that order (call unmap_region
> and then nulls out vma->vmfile and fput)?
I'm pretty sure we cannot: the ordering is quite intentional, that if
a driver ->mmap failed, then it'd be wrong to call down to driver in
the unmap_region (if a driver is nicely behaved, that unmap_region
shouldn't be unnecessary; but some do rely on us clearing ptes there).
Okay, last refuge of all who've made a fool of themselves:
may I ask you to add a comment in your unmap_hugepage_range,
pointing to how the do_mmap_pgoff error case NULLifies vm_file?
(Or else change hugetlbfs_file_mmap to set VM_HUGETLB only once it's
succeeded: but that smacks of me refusing to accept I was wrong.)
Hugh
>
> unmap_and_free_vma:
> if (correct_wcount)
> atomic_inc(&inode->i_writecount);
> vma->vm_file = NULL;
> fput(file);
>
> /* Undo any partial mapping done by a device driver. */
> unmap_region(mm, vma, prev, vma->vm_start, vma->vm_end);
Hugh Dickins wrote on Tuesday, October 10, 2006 1:10 PM
> > can we reverse that order (call unmap_region
> > and then nulls out vma->vmfile and fput)?
>
> I'm pretty sure we cannot: the ordering is quite intentional, that if
> a driver ->mmap failed, then it'd be wrong to call down to driver in
> the unmap_region (if a driver is nicely behaved, that unmap_region
> shouldn't be unnecessary; but some do rely on us clearing ptes there).
Even not something like the following? I believe you that nullifying
vma->vm_file can not be done after unmap_region(), I just want to make
sure we are talking the same thing. It looks OK to me to defer the fput
in the do_mmap_pgoff clean up path.
--- ./mm/mmap.c.orig 2006-10-10 15:58:17.000000000 -0700
+++ ./mm/mmap.c 2006-10-10 15:59:02.000000000 -0700
@@ -1159,11 +1159,12 @@ out:
unmap_and_free_vma:
if (correct_wcount)
atomic_inc(&inode->i_writecount);
- vma->vm_file = NULL;
- fput(file);
/* Undo any partial mapping done by a device driver. */
unmap_region(mm, vma, prev, vma->vm_start, vma->vm_end);
+
+ vma->vm_file = NULL;
+ fput(file);
charged = 0;
free_vma:
kmem_cache_free(vm_area_cachep, vma);
Hugh Dickins wrote on Tuesday, October 10, 2006 12:18 PM
> Yes, I'd expect your i_mmap_lock to solve the problem: and since
> you're headed in that direction anyway, it makes most sense to use
> that solution rather than get into defining arrays, or sacrificing
> the lazy flush, or risking page_count races.
>
> So please extract the __unmap_hugepage_range mods from your shared
> pagetable patch, and use that to fix the bug.
OK, here is a bug fix patch fixing earlier "bug fix" patch :-(
[patch] hugetlb: fix linked list corruption in unmap_hugepage_range
commit fe1668ae5bf0145014c71797febd9ad5670d5d05 causes kernel to oops with
libhugetlbfs test suite. The problem is that hugetlb pages can be shared
by multiple mappings. Multiple threads can fight over page->lru in the unmap
path and bad things happen. We now serialize __unmap_hugepage_range to void
concurrent linked list manipulation. Such serialization is also needed for
shared page table page on hugetlb area. This patch will fixed the bug and
also serve as a prepatch for shared page table.
Signed-off-by: Ken Chen <[email protected]>
--- ./fs/hugetlbfs/inode.c.orig 2006-10-05 10:25:20.000000000 -0700
+++ ./fs/hugetlbfs/inode.c 2006-10-10 14:27:48.000000000 -0700
@@ -293,7 +293,7 @@ hugetlb_vmtruncate_list(struct prio_tree
if (h_vm_pgoff >= h_pgoff)
v_offset = 0;
- unmap_hugepage_range(vma,
+ __unmap_hugepage_range(vma,
vma->vm_start + v_offset, vma->vm_end);
}
}
--- ./mm/hugetlb.c.orig 2006-10-05 10:25:21.000000000 -0700
+++ ./mm/hugetlb.c 2006-10-10 14:27:48.000000000 -0700
@@ -356,8 +356,8 @@ nomem:
return -ENOMEM;
}
-void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
- unsigned long end)
+void __unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
+ unsigned long end)
{
struct mm_struct *mm = vma->vm_mm;
unsigned long address;
@@ -398,6 +398,24 @@ void unmap_hugepage_range(struct vm_area
}
}
+void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
+ unsigned long end)
+{
+ /*
+ * It is undesirable to test vma->vm_file as it should be non-null
+ * for valid hugetlb area. However, vm_file will be NULL in the error
+ * cleanup path of do_mmap_pgoff. When hugetlbfs ->mmap method fails,
+ * do_mmap_pgoff() nullifies vma->vm_file before calling this function
+ * to clean up. Since no pte has actually been setup, it is safe to
+ * do nothing in this case.
+ */
+ if (vma->vm_file) {
+ spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
+ __unmap_hugepage_range(vma, start, end);
+ spin_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
+ }
+}
+
static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pte_t *ptep, pte_t pte)
{
--- ./include/linux/hugetlb.h.orig 2006-10-05 10:25:21.000000000 -0700
+++ ./include/linux/hugetlb.h 2006-10-10 13:08:48.000000000 -0700
@@ -17,6 +17,7 @@ int hugetlb_sysctl_handler(struct ctl_ta
int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *);
int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *, struct page **, struct vm_area_struct **, unsigned long *, int
*, int);
void unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long);
+void __unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long);
int hugetlb_prefault(struct address_space *, struct vm_area_struct *);
int hugetlb_report_meminfo(char *);
int hugetlb_report_node_meminfo(int, char *);
On Tue, Oct 10, 2006 at 04:34:40PM -0700, Chen, Kenneth W wrote:
> Hugh Dickins wrote on Tuesday, October 10, 2006 12:18 PM
> > Yes, I'd expect your i_mmap_lock to solve the problem: and since
> > you're headed in that direction anyway, it makes most sense to use
> > that solution rather than get into defining arrays, or sacrificing
> > the lazy flush, or risking page_count races.
> >
> > So please extract the __unmap_hugepage_range mods from your shared
> > pagetable patch, and use that to fix the bug.
>
>
> OK, here is a bug fix patch fixing earlier "bug fix" patch :-(
>
>
> [patch] hugetlb: fix linked list corruption in unmap_hugepage_range
>
> commit fe1668ae5bf0145014c71797febd9ad5670d5d05 causes kernel to oops with
> libhugetlbfs test suite. The problem is that hugetlb pages can be shared
> by multiple mappings. Multiple threads can fight over page->lru in the unmap
> path and bad things happen. We now serialize __unmap_hugepage_range to void
> concurrent linked list manipulation. Such serialization is also needed for
> shared page table page on hugetlb area. This patch will fixed the bug and
> also serve as a prepatch for shared page table.
Can I suggest that you put a big comment on the linked list
declaration itself saying that you're relying on serialization here.
Otherwise I'm worried someone will try to de-serialize it again, and
break it without realizing. Given the number of people who failed to
spot the problem with the patch the first time around..
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
David Gibson wrote on Tuesday, October 10, 2006 6:18 PM
> Can I suggest that you put a big comment on the linked list
> declaration itself saying that you're relying on serialization here.
> Otherwise I'm worried someone will try to de-serialize it again, and
> break it without realizing. Given the number of people who failed to
> spot the problem with the patch the first time around..
I'm not very good at writing comments, how about the following?
Signed-off-by: Ken Chen <[email protected]>
--- linus-2.6/mm/hugetlb.c.orig 2006-10-10 19:32:36.000000000 -0700
+++ linus-2.6/mm/hugetlb.c 2006-10-10 19:41:18.000000000 -0700
@@ -365,6 +365,11 @@ void __unmap_hugepage_range(struct vm_ar
pte_t pte;
struct page *page;
struct page *tmp;
+ /*
+ * A page gathering list, protected by per file i_mmap_lock. The
+ * lock is used to avoid list corruption from multiple unmapping
+ * of the same page since we are using page->lru.
+ */
LIST_HEAD(page_list);
WARN_ON(!is_vm_hugetlb_page(vma));
On Tue, 10 Oct 2006, Chen, Kenneth W wrote:
> Hugh Dickins wrote on Tuesday, October 10, 2006 1:10 PM
> > > can we reverse that order (call unmap_region
> > > and then nulls out vma->vmfile and fput)?
> >
> > I'm pretty sure we cannot: the ordering is quite intentional, that if
> > a driver ->mmap failed, then it'd be wrong to call down to driver in
> > the unmap_region (if a driver is nicely behaved, that unmap_region
> > shouldn't be unnecessary; but some do rely on us clearing ptes there).
Looking at it again, my explanation seems wrong: I can't see any danger
of calling down to the _driver_ there (there's no remove_vma): rather,
it's __remove_shared_vm_struct we're avoiding by setting vm_file NULL.
>
> Even not something like the following? I believe you that nullifying
> vma->vm_file can not be done after unmap_region(),
Yet in your patch below, you do nullify vm_file _after_ unmap_region.
> I just want to make sure we are talking the same thing.
So I'm not sure if we are!
> It looks OK to me to defer the fput in the do_mmap_pgoff clean up path.
Yes, it would be quite okay to defer the fput until after the
unmap_region; but there's no point in making that change, is there?
Hugh (sorry, I was off sick for a couple of days)
>
>
> --- ./mm/mmap.c.orig 2006-10-10 15:58:17.000000000 -0700
> +++ ./mm/mmap.c 2006-10-10 15:59:02.000000000 -0700
> @@ -1159,11 +1159,12 @@ out:
> unmap_and_free_vma:
> if (correct_wcount)
> atomic_inc(&inode->i_writecount);
> - vma->vm_file = NULL;
> - fput(file);
>
> /* Undo any partial mapping done by a device driver. */
> unmap_region(mm, vma, prev, vma->vm_start, vma->vm_end);
> +
> + vma->vm_file = NULL;
> + fput(file);
> charged = 0;
> free_vma:
> kmem_cache_free(vm_area_cachep, vma);
>