LinuxLists.cc - Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-08-01 08:20:45

Subject: Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

On Tue 31-07-12 22:45:43, Larry Woodman wrote:
> On 07/31/2012 04:06 PM, Michal Hocko wrote:
> >On Tue 31-07-12 13:49:21, Larry Woodman wrote:
> >>On 07/31/2012 08:46 AM, Mel Gorman wrote:
> >>>Fundamentally I think the problem is that we are not correctly detecting
> >>>that page table sharing took place during huge_pte_alloc(). This patch is
> >>>longer and makes an API change but if I'm right, it addresses the underlying
> >>>problem. The first VM_MAYSHARE patch is still necessary but would you mind
> >>>testing this on top please?
> >>Hi Mel, yes this does work just fine. It ran for hours without a panic so
> >>I'll Ack this one if you send it to the list.
> >Hi Larry, thanks for testing! I have a different patch which tries to
> >address this very same issue. I am not saying it is better or that it
> >should be merged instead of Mel's one but I would be really happy if you
> >could give it a try. We can discuss (dis)advantages of both approaches
> >later.
> >
> >Thanks!
>
> Hi Michal, the system hung when I tested this patch on top of the
> latest 3.5 kernel. I wont have AltSysrq access to the system until
> tomorrow AM.

Please hold on. The patch is crap. I forgot about
if (!vma_shareable(vma, addr))
return;

case so somebody got an uninitialized pmd. The patch bellow handles
that.

Sorry about that and thanks to Mel to notice this.
---
>From 14d1cfcb7e19369653e2367c38204b1012a398f9 Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Tue, 31 Jul 2012 15:00:26 +0200
Subject: [PATCH] mm: hugetlbfs: Correctly populate shared pmd

Each page mapped in a processes address space must be correctly
accounted for in _mapcount. Normally the rules for this are
straight-forward but hugetlbfs page table sharing is different.
The page table pages at the PMD level are reference counted while
the mapcount remains the same. If this accounting is wrong, it causes
bugs like this one reported by Larry Woodman

[ 1106.156569] ------------[ cut here ]------------
[ 1106.161731] kernel BUG at mm/filemap.c:135!
[ 1106.166395] invalid opcode: 0000 [#1] SMP
[ 1106.170975] CPU 22
[ 1106.173115] Modules linked in: bridge stp llc sunrpc binfmt_misc dcdbas microcode pcspkr acpi_pad acpi]
[ 1106.201770]
[ 1106.203426] Pid: 18001, comm: mpitest Tainted: G W 3.3.0+ #4 Dell Inc. PowerEdge R620/07NDJ2
[ 1106.213822] RIP: 0010:[<ffffffff8112cfed>] [<ffffffff8112cfed>] __delete_from_page_cache+0x15d/0x170
[ 1106.224117] RSP: 0018:ffff880428973b88 EFLAGS: 00010002
[ 1106.230032] RAX: 0000000000000001 RBX: ffffea0006b80000 RCX: 00000000ffffffb0
[ 1106.237979] RDX: 0000000000016df1 RSI: 0000000000000009 RDI: ffff88043ffd9e00
[ 1106.245927] RBP: ffff880428973b98 R08: 0000000000000050 R09: 0000000000000003
[ 1106.253876] R10: 000000000000000d R11: 0000000000000000 R12: ffff880428708150
[ 1106.261826] R13: ffff880428708150 R14: 0000000000000000 R15: ffffea0006b80000
[ 1106.269780] FS: 0000000000000000(0000) GS:ffff88042fd60000(0000) knlGS:0000000000000000
[ 1106.278794] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1106.285193] CR2: 0000003a1d38c4a8 CR3: 000000000187d000 CR4: 00000000000406e0
[ 1106.293149] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1106.301097] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1106.309046] Process mpitest (pid: 18001, threadinfo ffff880428972000, task ffff880428b5cc20)
[ 1106.318447] Stack:
[ 1106.320690] ffffea0006b80000 0000000000000000 ffff880428973bc8 ffffffff8112d040
[ 1106.328958] ffff880428973bc8 00000000000002ab 00000000000002a0 ffff880428973c18
[ 1106.337234] ffff880428973cc8 ffffffff8125b405 ffff880400000001 0000000000000000
[ 1106.345513] Call Trace:
[ 1106.348235] [<ffffffff8112d040>] delete_from_page_cache+0x40/0x80
[ 1106.355128] [<ffffffff8125b405>] truncate_hugepages+0x115/0x1f0
[ 1106.361826] [<ffffffff8125b4f8>] hugetlbfs_evict_inode+0x18/0x30
[ 1106.368615] [<ffffffff811ab1af>] evict+0x9f/0x1b0
[ 1106.373951] [<ffffffff811ab3a3>] iput_final+0xe3/0x1e0
[ 1106.379773] [<ffffffff811ab4de>] iput+0x3e/0x50
[ 1106.384922] [<ffffffff811a8e18>] d_kill+0xf8/0x110
[ 1106.390356] [<ffffffff811a8f12>] dput+0xe2/0x1b0
[ 1106.395595] [<ffffffff81193612>] __fput+0x162/0x240

During fork(), copy_hugetlb_page_range() detects if huge_pte_alloc()
shared page tables with the check dst_pte == src_pte. The logic is if
the PMD page is the same, they must be shared. This assumes that the
sharing is between the parent and child. However, if the sharing is with
a different process entirely then this check fails as in this diagram.

parent
|
------------>pmd
src_pte----------> data page
^
other--------->pmd--------------------|
^
child-----------|
dst_pte

For this situation to occur, it must be possible for Parent and Other
to have faulted and failed to share page tables with each other. This is
possible due to the following style of race.

PROC A PROC B
copy_hugetlb_page_range copy_hugetlb_page_range
src_pte == huge_pte_offset src_pte == huge_pte_offset
!src_pte so no sharing !src_pte so no sharing

(time passes)

hugetlb_fault hugetlb_fault
huge_pte_alloc huge_pte_alloc
huge_pmd_share huge_pmd_share
LOCK(i_mmap_mutex)
find nothing, no sharing
UNLOCK(i_mmap_mutex)
LOCK(i_mmap_mutex)
find nothing, no sharing
UNLOCK(i_mmap_mutex)
pmd_alloc pmd_alloc
LOCK(instantiation_mutex)
fault
UNLOCK(instantiation_mutex)
LOCK(instantiation_mutex)
fault
UNLOCK(instantiation_mutex)

These two processes are not poing to the same data page but are not sharing
page tables because the opportunity was missed. When either process later
forks, the src_pte == dst pte is potentially insufficient. As the check
falls through, the wrong PTE information is copied in (harmless but wrong)
and the mapcount is bumped for a page mapped by a shared page table leading
to the BUG_ON.

This patch addresses the issue by moving pmd_alloc into huge_pmd_share
which guarantees that the shared pud is populated in the same
critical section as pmd. This also means that huge_pte_offset test in
huge_pmd_share is serialized correctly now.

Changelog and race identified by Mel Gorman
Signed-off-by: Michal Hocko <[email protected]>
Reported-by: Larry Woodman <[email protected]>
---
arch/x86/mm/hugetlbpage.c | 12 ++++++++----
1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index f6679a7..cc385c4 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -58,7 +58,7 @@ static int vma_shareable(struct vm_area_struct *vma, unsigned long addr)
/*
* search for a shareable pmd page for hugetlb.
*/
-static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
+static pte_t* huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
{
struct vm_area_struct *vma = find_vma(mm, addr);
struct address_space *mapping = vma->vm_file->f_mapping;
@@ -68,9 +68,10 @@ static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
struct vm_area_struct *svma;
unsigned long saddr;
pte_t *spte = NULL;
+ pte_t *pte;

if (!vma_shareable(vma, addr))
- return;
+ return pmd_alloc(mm, pud, addr);

mutex_lock(&mapping->i_mmap_mutex);
vma_prio_tree_foreach(svma, &iter, &mapping->i_mmap, idx, idx) {
@@ -96,8 +97,10 @@ static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
else
put_page(virt_to_page(spte));
spin_unlock(&mm->page_table_lock);
+ pte = pmd_alloc(mm, pud, addr);
out:
mutex_unlock(&mapping->i_mmap_mutex);
+ return pte;
}

/*
@@ -142,8 +145,9 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
} else {
BUG_ON(sz != PMD_SIZE);
if (pud_none(*pud))
- huge_pmd_share(mm, addr, pud);
- pte = (pte_t *) pmd_alloc(mm, pud, addr);
+ pte = huge_pmd_share(mm, addr, pud);
+ else
+ pte = (pte_t *) pmd_alloc(mm, pud, addr);
}
}
BUG_ON(pte && !pte_none(*pte) && !pte_huge(*pte));
--
1.7.10.4

--
Michal Hocko
SUSE Labs

2012-08-01 12:32:16

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

On Wed 01-08-12 10:20:36, Michal Hocko wrote:
> On Tue 31-07-12 22:45:43, Larry Woodman wrote:
> > On 07/31/2012 04:06 PM, Michal Hocko wrote:
> > >On Tue 31-07-12 13:49:21, Larry Woodman wrote:
> > >>On 07/31/2012 08:46 AM, Mel Gorman wrote:
> > >>>Fundamentally I think the problem is that we are not correctly detecting
> > >>>that page table sharing took place during huge_pte_alloc(). This patch is
> > >>>longer and makes an API change but if I'm right, it addresses the underlying
> > >>>problem. The first VM_MAYSHARE patch is still necessary but would you mind
> > >>>testing this on top please?
> > >>Hi Mel, yes this does work just fine. It ran for hours without a panic so
> > >>I'll Ack this one if you send it to the list.
> > >Hi Larry, thanks for testing! I have a different patch which tries to
> > >address this very same issue. I am not saying it is better or that it
> > >should be merged instead of Mel's one but I would be really happy if you
> > >could give it a try. We can discuss (dis)advantages of both approaches
> > >later.
> > >
> > >Thanks!
> >
> > Hi Michal, the system hung when I tested this patch on top of the
> > latest 3.5 kernel. I wont have AltSysrq access to the system until
> > tomorrow AM.
>
> Please hold on. The patch is crap. I forgot about
> if (!vma_shareable(vma, addr))
> return;
>
> case so somebody got an uninitialized pmd. The patch bellow handles
> that.
>

I am really lame :/. The previous patch is wrong as well for goto out
branch. The updated patch as follows:
---
>From 886b79204491b500437e156aa8eb35e776a4bb07 Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Tue, 31 Jul 2012 15:00:26 +0200
Subject: [PATCH] mm: hugetlbfs: Correctly populate shared pmd

Each page mapped in a processes address space must be correctly
accounted for in _mapcount. Normally the rules for this are
straight-forward but hugetlbfs page table sharing is different.
The page table pages at the PMD level are reference counted while
the mapcount remains the same. If this accounting is wrong, it causes
bugs like this one reported by Larry Woodman

[ 1106.156569] ------------[ cut here ]------------
[ 1106.161731] kernel BUG at mm/filemap.c:135!
[ 1106.166395] invalid opcode: 0000 [#1] SMP
[ 1106.170975] CPU 22
[ 1106.173115] Modules linked in: bridge stp llc sunrpc binfmt_misc dcdbas microcode pcspkr acpi_pad acpi]
[ 1106.201770]
[ 1106.203426] Pid: 18001, comm: mpitest Tainted: G W 3.3.0+ #4 Dell Inc. PowerEdge R620/07NDJ2
[ 1106.213822] RIP: 0010:[<ffffffff8112cfed>] [<ffffffff8112cfed>] __delete_from_page_cache+0x15d/0x170
[ 1106.224117] RSP: 0018:ffff880428973b88 EFLAGS: 00010002
[ 1106.230032] RAX: 0000000000000001 RBX: ffffea0006b80000 RCX: 00000000ffffffb0
[ 1106.237979] RDX: 0000000000016df1 RSI: 0000000000000009 RDI: ffff88043ffd9e00
[ 1106.245927] RBP: ffff880428973b98 R08: 0000000000000050 R09: 0000000000000003
[ 1106.253876] R10: 000000000000000d R11: 0000000000000000 R12: ffff880428708150
[ 1106.261826] R13: ffff880428708150 R14: 0000000000000000 R15: ffffea0006b80000
[ 1106.269780] FS: 0000000000000000(0000) GS:ffff88042fd60000(0000) knlGS:0000000000000000
[ 1106.278794] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1106.285193] CR2: 0000003a1d38c4a8 CR3: 000000000187d000 CR4: 00000000000406e0
[ 1106.293149] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1106.301097] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1106.309046] Process mpitest (pid: 18001, threadinfo ffff880428972000, task ffff880428b5cc20)
[ 1106.318447] Stack:
[ 1106.320690] ffffea0006b80000 0000000000000000 ffff880428973bc8 ffffffff8112d040
[ 1106.328958] ffff880428973bc8 00000000000002ab 00000000000002a0 ffff880428973c18
[ 1106.337234] ffff880428973cc8 ffffffff8125b405 ffff880400000001 0000000000000000
[ 1106.345513] Call Trace:
[ 1106.348235] [<ffffffff8112d040>] delete_from_page_cache+0x40/0x80
[ 1106.355128] [<ffffffff8125b405>] truncate_hugepages+0x115/0x1f0
[ 1106.361826] [<ffffffff8125b4f8>] hugetlbfs_evict_inode+0x18/0x30
[ 1106.368615] [<ffffffff811ab1af>] evict+0x9f/0x1b0
[ 1106.373951] [<ffffffff811ab3a3>] iput_final+0xe3/0x1e0
[ 1106.379773] [<ffffffff811ab4de>] iput+0x3e/0x50
[ 1106.384922] [<ffffffff811a8e18>] d_kill+0xf8/0x110
[ 1106.390356] [<ffffffff811a8f12>] dput+0xe2/0x1b0
[ 1106.395595] [<ffffffff81193612>] __fput+0x162/0x240

During fork(), copy_hugetlb_page_range() detects if huge_pte_alloc()
shared page tables with the check dst_pte == src_pte. The logic is if
the PMD page is the same, they must be shared. This assumes that the
sharing is between the parent and child. However, if the sharing is with
a different process entirely then this check fails as in this diagram.

parent
|
------------>pmd
src_pte----------> data page
^
other--------->pmd--------------------|
^
child-----------|
dst_pte

For this situation to occur, it must be possible for Parent and Other
to have faulted and failed to share page tables with each other. This is
possible due to the following style of race.

PROC A PROC B
copy_hugetlb_page_range copy_hugetlb_page_range
src_pte == huge_pte_offset src_pte == huge_pte_offset
!src_pte so no sharing !src_pte so no sharing

(time passes)

hugetlb_fault hugetlb_fault
huge_pte_alloc huge_pte_alloc
huge_pmd_share huge_pmd_share
LOCK(i_mmap_mutex)
find nothing, no sharing
UNLOCK(i_mmap_mutex)
LOCK(i_mmap_mutex)
find nothing, no sharing
UNLOCK(i_mmap_mutex)
pmd_alloc pmd_alloc
LOCK(instantiation_mutex)
fault
UNLOCK(instantiation_mutex)
LOCK(instantiation_mutex)
fault
UNLOCK(instantiation_mutex)

These two processes are not poing to the same data page but are not sharing
page tables because the opportunity was missed. When either process later
forks, the src_pte == dst pte is potentially insufficient. As the check
falls through, the wrong PTE information is copied in (harmless but wrong)
and the mapcount is bumped for a page mapped by a shared page table leading
to the BUG_ON.

This patch addresses the issue by moving pmd_alloc into huge_pmd_share
which guarantees that the shared pud is populated in the same
critical section as pmd. This also means that huge_pte_offset test in
huge_pmd_share is serialized correctly now.

Changelog and race identified by Mel Gorman
Signed-off-by: Michal Hocko <[email protected]>
Reported-by: Larry Woodman <[email protected]>
---
arch/x86/mm/hugetlbpage.c | 13 +++++++++----
1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index f6679a7..40b2500 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -58,7 +58,8 @@ static int vma_shareable(struct vm_area_struct *vma, unsigned long addr)
/*
* search for a shareable pmd page for hugetlb.
*/
-static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
+static pte_t*
+huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
{
struct vm_area_struct *vma = find_vma(mm, addr);
struct address_space *mapping = vma->vm_file->f_mapping;
@@ -68,9 +69,10 @@ static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
struct vm_area_struct *svma;
unsigned long saddr;
pte_t *spte = NULL;
+ pte_t *pte;

if (!vma_shareable(vma, addr))
- return;
+ return (pte_t *)pmd_alloc(mm, pud, addr);

mutex_lock(&mapping->i_mmap_mutex);
vma_prio_tree_foreach(svma, &iter, &mapping->i_mmap, idx, idx) {
@@ -97,7 +99,9 @@ static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
put_page(virt_to_page(spte));
spin_unlock(&mm->page_table_lock);
out:
+ pte = (pte_t *)pmd_alloc(mm, pud, addr);
mutex_unlock(&mapping->i_mmap_mutex);
+ return pte;
}

/*
@@ -142,8 +146,9 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
} else {
BUG_ON(sz != PMD_SIZE);
if (pud_none(*pud))
- huge_pmd_share(mm, addr, pud);
- pte = (pte_t *) pmd_alloc(mm, pud, addr);
+ pte = huge_pmd_share(mm, addr, pud);
+ else
+ pte = (pte_t *) pmd_alloc(mm, pud, addr);
}
}
BUG_ON(pte && !pte_none(*pte) && !pte_huge(*pte));
--
1.7.10.4

--
Michal Hocko
SUSE Labs

2012-08-01 15:06:47

by Larry Woodman

[permalink] [raw]

Subject: Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

On 08/01/2012 08:32 AM, Michal Hocko wrote:
>
> I am really lame :/. The previous patch is wrong as well for goto out
> branch. The updated patch as follows:
This patch worked fine Michal! You and Mel can duke it out over who's
is best. :)

Larry

2012-08-02 07:19:38

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

Hi Larry,

On Wed 01-08-12 11:06:33, Larry Woodman wrote:
> On 08/01/2012 08:32 AM, Michal Hocko wrote:
> >
> >I am really lame :/. The previous patch is wrong as well for goto out
> >branch. The updated patch as follows:
> This patch worked fine Michal!

Thanks for the good news!

> You and Mel can duke it out over who's is best. :)

The answer is clear here ;) Mel did the hard work of identifying the
culprit so kudos go to him.
I just tried to solve the issue more inside x86 arch code. The pmd
allocation outside of sharing code seemed strange to me for quite some
time I just underestimated its consequences completely.

Both approaches have some pros. Mel's patch is more resistant to other
not-yet-discovered races and it also makes the arch independent code
more robust because relying on the pmd trick is not ideal.
On the other hand, mine is more coupled with the sharing code so it
makes the code easier to follow and also makes the sharing more
effective because racing processes see pmd populated when checking for
shareable mappings.

So I am more inclined to mine but I don't want to push it because both
are good and make sense. What other people think?

>
> Larry
>

--
Michal Hocko
SUSE Labs

2012-08-02 07:38:03

by Mel Gorman

[permalink] [raw]

Subject: Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

On Thu, Aug 02, 2012 at 09:19:34AM +0200, Michal Hocko wrote:
> Hi Larry,
>
> On Wed 01-08-12 11:06:33, Larry Woodman wrote:
> > On 08/01/2012 08:32 AM, Michal Hocko wrote:
> > >
> > >I am really lame :/. The previous patch is wrong as well for goto out
> > >branch. The updated patch as follows:
> > This patch worked fine Michal!
>
> Thanks for the good news!
>
> > You and Mel can duke it out over who's is best. :)
>
> The answer is clear here ;) Mel did the hard work of identifying the
> culprit so kudos go to him.

I'm happy once it's fixed!

> I just tried to solve the issue more inside x86 arch code. The pmd
> allocation outside of sharing code seemed strange to me for quite some
> time I just underestimated its consequences completely.
>
> Both approaches have some pros. Mel's patch is more resistant to other
> not-yet-discovered races and it also makes the arch independent code
> more robust because relying on the pmd trick is not ideal.

If there is another race then it is best to hear about it, understand
it and fix the underlying problem. More importantly, your patch ensures
that two processes faulting at the same time will share page tables with
each other. My patch only noted that this missed opportunity could cause
problems with fork.

> On the other hand, mine is more coupled with the sharing code so it
> makes the code easier to follow and also makes the sharing more
> effective because racing processes see pmd populated when checking for
> shareable mappings.
>

It could do with a small comment above huge_pmd_share() explaining that
calling pmd_alloc() under the i_mmap_mutex is necessary to prevent two
parallel faults missing a sharing opportunity with each other but it's
not mandatory.

> So I am more inclined to mine but I don't want to push it because both
> are good and make sense. What other people think?
>

I vote yours

Reviewed-by: Mel Gorman <[email protected]>

--
Mel Gorman
SUSE Labs

2012-08-02 12:37:04

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

On Thu 02-08-12 08:37:57, Mel Gorman wrote:
> On Thu, Aug 02, 2012 at 09:19:34AM +0200, Michal Hocko wrote:
[...]
> > On the other hand, mine is more coupled with the sharing code so it
> > makes the code easier to follow and also makes the sharing more
> > effective because racing processes see pmd populated when checking for
> > shareable mappings.
> >
>
> It could do with a small comment above huge_pmd_share() explaining that
> calling pmd_alloc() under the i_mmap_mutex is necessary to prevent two
> parallel faults missing a sharing opportunity with each other but it's
> not mandatory.

Sure, that's a good idea. What about the following:

diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index 40b2500..51839d1 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -56,7 +56,13 @@ static int vma_shareable(struct vm_area_struct *vma, unsigned long addr)
}

/*
- * search for a shareable pmd page for hugetlb.
+ * search for a shareable pmd page for hugetlb. In any case calls
+ * pmd_alloc and returns the corresponding pte. While this not necessary
+ * for the !shared pmd case because we can allocate the pmd later as
+ * well it makes the code much cleaner. pmd allocation is essential for
+ * the shared case though because pud has to be populated inside the
+ * same i_mmap_mutex section otherwise racing tasks could either miss
+ * the sharing (see huge_pte_offset) or selected a bad pmd for sharing.
*/
static pte_t*
huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)

>
> > So I am more inclined to mine but I don't want to push it because both
> > are good and make sense. What other people think?
> >
>
> I vote yours
>
> Reviewed-by: Mel Gorman <[email protected]>

Thanks!

--
Michal Hocko
SUSE Labs

2012-08-02 13:33:18

by Mel Gorman

[permalink] [raw]

Subject: Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

On Thu, Aug 02, 2012 at 02:36:58PM +0200, Michal Hocko wrote:
> On Thu 02-08-12 08:37:57, Mel Gorman wrote:
> > On Thu, Aug 02, 2012 at 09:19:34AM +0200, Michal Hocko wrote:
> [...]
> > > On the other hand, mine is more coupled with the sharing code so it
> > > makes the code easier to follow and also makes the sharing more
> > > effective because racing processes see pmd populated when checking for
> > > shareable mappings.
> > >
> >
> > It could do with a small comment above huge_pmd_share() explaining that
> > calling pmd_alloc() under the i_mmap_mutex is necessary to prevent two
> > parallel faults missing a sharing opportunity with each other but it's
> > not mandatory.
>
> Sure, that's a good idea. What about the following:
>
> diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
> index 40b2500..51839d1 100644
> --- a/arch/x86/mm/hugetlbpage.c
> +++ b/arch/x86/mm/hugetlbpage.c
> @@ -56,7 +56,13 @@ static int vma_shareable(struct vm_area_struct *vma, unsigned long addr)
> }
>
> /*
> - * search for a shareable pmd page for hugetlb.
> + * search for a shareable pmd page for hugetlb. In any case calls
> + * pmd_alloc and returns the corresponding pte. While this not necessary
> + * for the !shared pmd case because we can allocate the pmd later as
> + * well it makes the code much cleaner. pmd allocation is essential for
> + * the shared case though because pud has to be populated inside the
> + * same i_mmap_mutex section otherwise racing tasks could either miss
> + * the sharing (see huge_pte_offset) or selected a bad pmd for sharing.
> */
> static pte_t*
> huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
>

Looks reasonable to me.

--
Mel Gorman
SUSE Labs

2012-08-02 13:53:54

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

On Thu 02-08-12 14:33:10, Mel Gorman wrote:
> On Thu, Aug 02, 2012 at 02:36:58PM +0200, Michal Hocko wrote:
> > On Thu 02-08-12 08:37:57, Mel Gorman wrote:
> > > On Thu, Aug 02, 2012 at 09:19:34AM +0200, Michal Hocko wrote:
> > [...]
> > > > On the other hand, mine is more coupled with the sharing code so it
> > > > makes the code easier to follow and also makes the sharing more
> > > > effective because racing processes see pmd populated when checking for
> > > > shareable mappings.
> > > >
> > >
> > > It could do with a small comment above huge_pmd_share() explaining that
> > > calling pmd_alloc() under the i_mmap_mutex is necessary to prevent two
> > > parallel faults missing a sharing opportunity with each other but it's
> > > not mandatory.
> >
> > Sure, that's a good idea. What about the following:
> >
> > diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
> > index 40b2500..51839d1 100644
> > --- a/arch/x86/mm/hugetlbpage.c
> > +++ b/arch/x86/mm/hugetlbpage.c
> > @@ -56,7 +56,13 @@ static int vma_shareable(struct vm_area_struct *vma, unsigned long addr)
> > }
> >
> > /*
> > - * search for a shareable pmd page for hugetlb.
> > + * search for a shareable pmd page for hugetlb. In any case calls
> > + * pmd_alloc and returns the corresponding pte. While this not necessary
> > + * for the !shared pmd case because we can allocate the pmd later as
> > + * well it makes the code much cleaner. pmd allocation is essential for
> > + * the shared case though because pud has to be populated inside the
> > + * same i_mmap_mutex section otherwise racing tasks could either miss
> > + * the sharing (see huge_pte_offset) or selected a bad pmd for sharing.
> > */
> > static pte_t*
> > huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
> >
>
> Looks reasonable to me.

OK, added to the patch. I will send it to Andrew now.

Thanks a lot!
--
Michal Hocko
SUSE Labs