2015-04-01 06:38:44

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCHv4 12/24] thp: PMD splitting without splitting compound page

"Kirill A. Shutemov" <[email protected]> writes:

> Current split_huge_page() combines two operations: splitting PMDs into
> tables of PTEs and splitting underlying compound page. This patch
> changes split_huge_pmd() implementation to split the given PMD without
> splitting other PMDs this page mapped with or underlying compound page.
>
> In order to do this we have to get rid of tail page refcounting, which
> uses _mapcount of tail pages. Tail page refcounting is needed to be able
> to split THP page at any point: we always know which of tail pages is
> pinned (i.e. by get_user_pages()) and can distribute page count
> correctly.
>
> We can avoid this by allowing split_huge_page() to fail if the compound
> page is pinned. This patch removes all infrastructure for tail page
> refcounting and make split_huge_page() to always return -EBUSY. All
> split_huge_page() users already know how to handle its fail. Proper
> implementation will be added later.
>
> Without tail page refcounting, implementation of split_huge_pmd() is
> pretty straight-forward.
>

With this we now have pte mapping part of a compound page(). Now the
gneric gup implementation does

gup_pte_range()
ptem = ptep = pte_offset_map(&pmd, addr);
do {

....
...
if (!page_cache_get_speculative(page))
goto pte_unmap;
.....
}

That page_cache_get_speculative will fail in our case because it does
if (unlikely(!get_page_unless_zero(page))) on a tail page. ??

-aneesh


2015-04-01 13:18:27

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv4 12/24] thp: PMD splitting without splitting compound page

On Wed, Apr 01, 2015 at 12:08:35PM +0530, Aneesh Kumar K.V wrote:
> "Kirill A. Shutemov" <[email protected]> writes:
>
> > Current split_huge_page() combines two operations: splitting PMDs into
> > tables of PTEs and splitting underlying compound page. This patch
> > changes split_huge_pmd() implementation to split the given PMD without
> > splitting other PMDs this page mapped with or underlying compound page.
> >
> > In order to do this we have to get rid of tail page refcounting, which
> > uses _mapcount of tail pages. Tail page refcounting is needed to be able
> > to split THP page at any point: we always know which of tail pages is
> > pinned (i.e. by get_user_pages()) and can distribute page count
> > correctly.
> >
> > We can avoid this by allowing split_huge_page() to fail if the compound
> > page is pinned. This patch removes all infrastructure for tail page
> > refcounting and make split_huge_page() to always return -EBUSY. All
> > split_huge_page() users already know how to handle its fail. Proper
> > implementation will be added later.
> >
> > Without tail page refcounting, implementation of split_huge_pmd() is
> > pretty straight-forward.
> >
>
> With this we now have pte mapping part of a compound page(). Now the
> gneric gup implementation does
>
> gup_pte_range()
> ptem = ptep = pte_offset_map(&pmd, addr);
> do {
>
> ....
> ...
> if (!page_cache_get_speculative(page))
> goto pte_unmap;
> .....
> }
>
> That page_cache_get_speculative will fail in our case because it does
> if (unlikely(!get_page_unless_zero(page))) on a tail page. ??

Nice catch, thanks.

But the problem is not exclusive to my patchset. In current kernel some
drivers (sound, for instance) map compound pages with PTEs.

We can try to get page_cache_get_speculative() work on PTE-mapped tail
pages. Untested patch is below.

BTW, it would be nice to rename the function since it's now used not only
for page cache.

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 7c3790764795..b2b6e886ed9b 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -142,8 +142,10 @@ void release_pages(struct page **pages, int nr, bool cold);
*/
static inline int page_cache_get_speculative(struct page *page)
{
+ struct page *head_page;
VM_BUG_ON(in_interrupt());
-
+retry:
+ head_page = compound_head_fast(page);
#ifdef CONFIG_TINY_RCU
# ifdef CONFIG_PREEMPT_COUNT
VM_BUG_ON(!in_atomic());
@@ -157,11 +159,11 @@ static inline int page_cache_get_speculative(struct page *page)
* disabling preempt, and hence no need for the "speculative get" that
* SMP requires.
*/
- VM_BUG_ON_PAGE(page_count(page) == 0, page);
- atomic_inc(&page->_count);
+ VM_BUG_ON_PAGE(page_count(head_page) == 0, head_page);
+ atomic_inc(&head_page->_count);

#else
- if (unlikely(!get_page_unless_zero(page))) {
+ if (unlikely(!get_page_unless_zero(head_page))) {
/*
* Either the page has been freed, or will be freed.
* In either case, retry here and the caller should
@@ -170,7 +172,21 @@ static inline int page_cache_get_speculative(struct page *page)
return 0;
}
#endif
- VM_BUG_ON_PAGE(PageTail(page), page);
+ /* compound_head_fast() seen PageTail(page) == true */
+ if (unlikely(head_page != page)) {
+ /*
+ * compound_head_fast() could fetch dangling page->first_page
+ * pointer to an old compound page, so recheck that it is still
+ * a tail page before returning.
+ */
+ smp_rmb();
+ if (unlikely(!PageTail(page))) {
+ put_page(head_page);
+ goto retry;
+ }
+ /* Do not touch THP tail pages! */
+ VM_BUG_ON_PAGE(compound_tail_refcounted(head_page), head_page);
+ }

return 1;
}
--
Kirill A. Shutemov

2015-04-01 23:13:59

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCHv4 12/24] thp: PMD splitting without splitting compound page

On Wed, 1 Apr 2015, Kirill A. Shutemov wrote:
> On Wed, Apr 01, 2015 at 12:08:35PM +0530, Aneesh Kumar K.V wrote:
> >
> > With this we now have pte mapping part of a compound page(). Now the
> > gneric gup implementation does
> >
> > gup_pte_range()
> > ptem = ptep = pte_offset_map(&pmd, addr);
> > do {
> >
> > ....
> > ...
> > if (!page_cache_get_speculative(page))
> > goto pte_unmap;
> > .....
> > }
> >
> > That page_cache_get_speculative will fail in our case because it does
> > if (unlikely(!get_page_unless_zero(page))) on a tail page. ??
>
> Nice catch, thanks.

Indeed; but it's not the generic gup implementation,
it's just the generic fast gup implementation.

>
> But the problem is not exclusive to my patchset. In current kernel some
> drivers (sound, for instance) map compound pages with PTEs.

Nobody has cared whether fast gup works on those, just so long as
slow gup works on those without VM_IO | VM_PFNMAP. Whereas people did
care that fast gup work on THPs, so gave them more complicated handling.

>
> We can try to get page_cache_get_speculative() work on PTE-mapped tail
> pages. Untested patch is below.

I didn't check through; but we'll agree that it's sad to see the
complexity you've managed to reduce elsewhere now popping up again
in other places.

Hugh

2015-04-02 15:40:11

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv4 12/24] thp: PMD splitting without splitting compound page

On Wed, Apr 01, 2015 at 12:08:35PM +0530, Aneesh Kumar K.V wrote:
> "Kirill A. Shutemov" <[email protected]> writes:
>
> > Current split_huge_page() combines two operations: splitting PMDs into
> > tables of PTEs and splitting underlying compound page. This patch
> > changes split_huge_pmd() implementation to split the given PMD without
> > splitting other PMDs this page mapped with or underlying compound page.
> >
> > In order to do this we have to get rid of tail page refcounting, which
> > uses _mapcount of tail pages. Tail page refcounting is needed to be able
> > to split THP page at any point: we always know which of tail pages is
> > pinned (i.e. by get_user_pages()) and can distribute page count
> > correctly.
> >
> > We can avoid this by allowing split_huge_page() to fail if the compound
> > page is pinned. This patch removes all infrastructure for tail page
> > refcounting and make split_huge_page() to always return -EBUSY. All
> > split_huge_page() users already know how to handle its fail. Proper
> > implementation will be added later.
> >
> > Without tail page refcounting, implementation of split_huge_pmd() is
> > pretty straight-forward.
> >
>
> With this we now have pte mapping part of a compound page(). Now the
> gneric gup implementation does
>
> gup_pte_range()
> ptem = ptep = pte_offset_map(&pmd, addr);
> do {
>
> ....
> ...
> if (!page_cache_get_speculative(page))
> goto pte_unmap;
> .....
> }
>
> That page_cache_get_speculative will fail in our case because it does
> if (unlikely(!get_page_unless_zero(page))) on a tail page. ??

IIUC, something as simple as patch below should work fine with migration
entries.

The reason I'm talking about migration enties is that with new refcounting
split_huge_page() breaks this generic fast GUP invariant:

* *) THP splits will broadcast an IPI, this can be achieved by overriding
* pmdp_splitting_flush.

We don't necessary trigger IPI during split. The page can be mapped only
with ptes by split time. That's fine for migration entries since we
re-check pte value after taking the pin. But it seems we don't have
anything in place for compound_lock case.

Hm. If I will not find any way to get it work with compound_lock, I would
need to implement new split_huge_page() on migration entries without
intermediate step with compound_lock.

Any comments?

diff --git a/mm/gup.c b/mm/gup.c
index d58af0785d24..b45edb8e6455 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1047,7 +1047,7 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
* for an example see gup_get_pte in arch/x86/mm/gup.c
*/
pte_t pte = READ_ONCE(*ptep);
- struct page *page;
+ struct page *head, *page;

/*
* Similar to the PMD case below, NUMA hinting must take slow
@@ -1059,15 +1059,17 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,

VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
page = pte_page(pte);
+ head = compound_head(page);

- if (!page_cache_get_speculative(page))
+ if (!page_cache_get_speculative(head))
goto pte_unmap;

if (unlikely(pte_val(pte) != pte_val(*ptep))) {
- put_page(page);
+ put_page(head);
goto pte_unmap;
}

+ VM_BUG_ON_PAGE(compound_head(page) != head, page);
pages[*nr] = page;
(*nr)++;

--
Kirill A. Shutemov