2014-04-15 21:49:27

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH] thp: close race between split and zap huge pages

Sasha Levin has reported two THP BUGs[1][2]. I believe both of them have
the same root cause. Let's look to them one by one.

The first bug[1] is "kernel BUG at mm/huge_memory.c:1829!".
It's BUG_ON(mapcount != page_mapcount(page)) in __split_huge_page().
>From my testing I see that page_mapcount() is higher than mapcount here.

I think it happens due to race between zap_huge_pmd() and
page_check_address_pmd(). page_check_address_pmd() misses PMD
which is under zap:

CPU0 CPU1
zap_huge_pmd()
pmdp_get_and_clear()
__split_huge_page()
anon_vma_interval_tree_foreach()
__split_huge_page_splitting()
page_check_address_pmd()
mm_find_pmd()
/*
* We check if PMD present without taking ptl: no
* serialization against zap_huge_pmd(). We miss this PMD,
* it's not accounted to 'mapcount' in __split_huge_page().
*/
pmd_present(pmd) == 0

BUG_ON(mapcount != page_mapcount(page)) // CRASH!!!

page_remove_rmap(page)
atomic_add_negative(-1, &page->_mapcount)

The second bug[2] is "kernel BUG at mm/huge_memory.c:1371!".
It's VM_BUG_ON_PAGE(!PageHead(page), page) in zap_huge_pmd().

This happens in similar way:

CPU0 CPU1
zap_huge_pmd()
pmdp_get_and_clear()
page_remove_rmap(page)
atomic_add_negative(-1, &page->_mapcount)
__split_huge_page()
anon_vma_interval_tree_foreach()
__split_huge_page_splitting()
page_check_address_pmd()
mm_find_pmd()
pmd_present(pmd) == 0 /* The same comment as above */
/*
* No crash this time since we already decremented page->_mapcount in
* zap_huge_pmd().
*/
BUG_ON(mapcount != page_mapcount(page))

/*
* We split the compound page here into small pages without
* serialization against zap_huge_pmd()
*/
__split_huge_page_refcount()
VM_BUG_ON_PAGE(!PageHead(page), page); // CRASH!!!

So my understanding the problem is pmd_present() check in mm_find_pmd()
without taking page table lock.

The bug was introduced by me commit with commit 117b0791ac42. Sorry for
that. :(

Let's open code mm_find_pmd() in page_check_address_pmd() and do the
check under page table lock.

Note that __page_check_address() does the same for PTE entires
if sync != 0.

I've stress tested split and zap code paths for 36+ hours by now and
don't see crashes with the patch applied. Before it took <20 min to
trigger the first bug and few hours for second one (if we ignore
first).

[1] https://lkml.kernel.org/g/<[email protected]>
[2] https://lkml.kernel.org/g/<[email protected]>

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reported-by: Sasha Levin <[email protected]>
Cc: <[email protected]> #3.13+
---
mm/huge_memory.c | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5025709bb3b5..d02a83852ee9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1536,16 +1536,23 @@ pmd_t *page_check_address_pmd(struct page *page,
enum page_check_address_pmd_flag flag,
spinlock_t **ptl)
{
+ pgd_t *pgd;
+ pud_t *pud;
pmd_t *pmd;

if (address & ~HPAGE_PMD_MASK)
return NULL;

- pmd = mm_find_pmd(mm, address);
- if (!pmd)
+ pgd = pgd_offset(mm, address);
+ if (!pgd_present(*pgd))
return NULL;
+ pud = pud_offset(pgd, address);
+ if (!pud_present(*pud))
+ return NULL;
+ pmd = pmd_offset(pud, address);
+
*ptl = pmd_lock(mm, pmd);
- if (pmd_none(*pmd))
+ if (!pmd_present(*pmd))
goto unlock;
if (pmd_page(*pmd) != page)
goto unlock;
--
1.9.1


2014-04-15 23:52:32

by Bob Liu

[permalink] [raw]
Subject: Re: [PATCH] thp: close race between split and zap huge pages

On Wed, Apr 16, 2014 at 5:48 AM, Kirill A. Shutemov
<[email protected]> wrote:
> Sasha Levin has reported two THP BUGs[1][2]. I believe both of them have
> the same root cause. Let's look to them one by one.
>
> The first bug[1] is "kernel BUG at mm/huge_memory.c:1829!".
> It's BUG_ON(mapcount != page_mapcount(page)) in __split_huge_page().
> From my testing I see that page_mapcount() is higher than mapcount here.
>
> I think it happens due to race between zap_huge_pmd() and
> page_check_address_pmd(). page_check_address_pmd() misses PMD
> which is under zap:
>

Nice catch!

> CPU0 CPU1
> zap_huge_pmd()
> pmdp_get_and_clear()
> __split_huge_page()
> anon_vma_interval_tree_foreach()
> __split_huge_page_splitting()
> page_check_address_pmd()
> mm_find_pmd()
> /*
> * We check if PMD present without taking ptl: no
> * serialization against zap_huge_pmd(). We miss this PMD,
> * it's not accounted to 'mapcount' in __split_huge_page().
> */
> pmd_present(pmd) == 0
>
> BUG_ON(mapcount != page_mapcount(page)) // CRASH!!!
>
> page_remove_rmap(page)
> atomic_add_negative(-1, &page->_mapcount)
>
> The second bug[2] is "kernel BUG at mm/huge_memory.c:1371!".
> It's VM_BUG_ON_PAGE(!PageHead(page), page) in zap_huge_pmd().
>
> This happens in similar way:
>
> CPU0 CPU1
> zap_huge_pmd()
> pmdp_get_and_clear()
> page_remove_rmap(page)
> atomic_add_negative(-1, &page->_mapcount)
> __split_huge_page()
> anon_vma_interval_tree_foreach()
> __split_huge_page_splitting()
> page_check_address_pmd()
> mm_find_pmd()
> pmd_present(pmd) == 0 /* The same comment as above */
> /*
> * No crash this time since we already decremented page->_mapcount in
> * zap_huge_pmd().
> */
> BUG_ON(mapcount != page_mapcount(page))
>
> /*
> * We split the compound page here into small pages without
> * serialization against zap_huge_pmd()
> */
> __split_huge_page_refcount()
> VM_BUG_ON_PAGE(!PageHead(page), page); // CRASH!!!
>
> So my understanding the problem is pmd_present() check in mm_find_pmd()
> without taking page table lock.
>
> The bug was introduced by me commit with commit 117b0791ac42. Sorry for
> that. :(
>
> Let's open code mm_find_pmd() in page_check_address_pmd() and do the
> check under page table lock.
>
> Note that __page_check_address() does the same for PTE entires
> if sync != 0.
>
> I've stress tested split and zap code paths for 36+ hours by now and
> don't see crashes with the patch applied. Before it took <20 min to
> trigger the first bug and few hours for second one (if we ignore
> first).
>
> [1] https://lkml.kernel.org/g/<[email protected]>
> [2] https://lkml.kernel.org/g/<[email protected]>
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Reported-by: Sasha Levin <[email protected]>
> Cc: <[email protected]> #3.13+
> ---
> mm/huge_memory.c | 13 ++++++++++---
> 1 file changed, 10 insertions(+), 3 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 5025709bb3b5..d02a83852ee9 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1536,16 +1536,23 @@ pmd_t *page_check_address_pmd(struct page *page,
> enum page_check_address_pmd_flag flag,
> spinlock_t **ptl)
> {
> + pgd_t *pgd;
> + pud_t *pud;
> pmd_t *pmd;
>
> if (address & ~HPAGE_PMD_MASK)
> return NULL;
>
> - pmd = mm_find_pmd(mm, address);
> - if (!pmd)
> + pgd = pgd_offset(mm, address);
> + if (!pgd_present(*pgd))
> return NULL;
> + pud = pud_offset(pgd, address);
> + if (!pud_present(*pud))
> + return NULL;
> + pmd = pmd_offset(pud, address);
> +
> *ptl = pmd_lock(mm, pmd);
> - if (pmd_none(*pmd))
> + if (!pmd_present(*pmd))
> goto unlock;

But I didn't get the idea why pmd_none() was removed?

--
Regards,
--Bob

2014-04-16 08:44:35

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH] thp: close race between split and zap huge pages

On Wed, Apr 16, 2014 at 07:52:29AM +0800, Bob Liu wrote:
> > *ptl = pmd_lock(mm, pmd);
> > - if (pmd_none(*pmd))
> > + if (!pmd_present(*pmd))
> > goto unlock;
>
> But I didn't get the idea why pmd_none() was removed?

!pmd_present(*pmd) is weaker check then pmd_none(*pmd). I mean if
pmd_none(*pmd) is true then pmd_present(*pmd) is always false.
Correct me if I'm wrong.

--
Kirill A. Shutemov

2014-04-17 00:28:22

by Bob Liu

[permalink] [raw]
Subject: Re: [PATCH] thp: close race between split and zap huge pages

On Wed, Apr 16, 2014 at 4:42 PM, Kirill A. Shutemov
<[email protected]> wrote:
> On Wed, Apr 16, 2014 at 07:52:29AM +0800, Bob Liu wrote:
>> > *ptl = pmd_lock(mm, pmd);
>> > - if (pmd_none(*pmd))
>> > + if (!pmd_present(*pmd))
>> > goto unlock;
>>
>> But I didn't get the idea why pmd_none() was removed?
>
> !pmd_present(*pmd) is weaker check then pmd_none(*pmd). I mean if
> pmd_none(*pmd) is true then pmd_present(*pmd) is always false.

Oh, yes. That's right.

BTW, it looks like this bug was introduced by the same reason.
https://lkml.org/lkml/2014/4/16/403

--
Regards,
--Bob

2014-04-17 20:16:18

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] thp: close race between split and zap huge pages

Hi everyone,

On Wed, Apr 16, 2014 at 12:48:56AM +0300, Kirill A. Shutemov wrote:
> - pmd = mm_find_pmd(mm, address);
> - if (!pmd)
> + pgd = pgd_offset(mm, address);
> + if (!pgd_present(*pgd))
> return NULL;
> + pud = pud_offset(pgd, address);
> + if (!pud_present(*pud))
> + return NULL;
> + pmd = pmd_offset(pud, address);

This fix looks good to me and it was another potential source of
trouble making the BUG_ON flakey. But the rmap_walk out of order
problem still exists too I think. Possibly the testcase doesn't
exercise that.

> - if (pmd_none(*pmd))
> + if (!pmd_present(*pmd))
> goto unlock;

pmd_present is a bit slower, but functionally it's equivalent, the
pmd_present check is just more pedantic (kind of defining the
invariants for how a mapped pmd should look like).

If we'd add native THP swapout later !pmd_present would be more
correct for the VM calls to page_check_address_pmd, but something
would need changing anyway if split_huge_page is the callee as I don't
think we can skip the conversion from trans huge swap entry to linear
swap entries and the pmd2pte conversion.

The main reason that most places that could run into a trans huge pmd
would use pmd_none and never pmd_present is that originally
pmd_present wouldn't check _PAGE_PSE and _PAGE_PRESENT can be
temporarily be cleared with pmdp_invalidate on trans huge pmds. Now
pmd_present is safe too so there's no problem in using it on trans
huge pmds.

So either pmd_none !pmd_present are fine, the functional fix is the
part above.

Thanks!
Andrea