Hello everybody,
We've discussed few times that is would be nice to allow huge pages to be
mapped with 4k pages too. Here's my first attempt to actually implement
this. It's early prototype and not stabilized yet, but I want to share it
to discuss any potential show stoppers early.
The main reason why we can't map THP with 4k is how refcounting on THP
designed. It built around two requirements:
- split of huge page should never fail;
- we can't change interface of get_user_page();
To be able to split huge page at any point we have to track which tail
page was pinned. It leads to tricky and expensive get_page() on tail pages
and also occupy tail_page->_mapcount.
Most split_huge_page*() users want PMD to be split into table of PTEs and
don't care whether compound page is going to be split or not.
The plan is:
- allow split_huge_page() to fail if the page is pinned. It's trivial to
split non-pinned page and it doesn't require tail page refcounting, so
tail_page->_mapcount is free to be reused.
- introduce new routine -- split_huge_pmd() -- to split PMD into table of
PTEs. It splits only one PMD, not touching other PMDs the page is
mapped with or underlying compound page. Unlike new split_huge_page(),
split_huge_pmd() never fails.
Fortunately, we have only few places where split_huge_page() is needed:
swap out, memory failure, migration, KSM. And all of them can handle
split_huge_page() fail.
In new scheme we use tail_page->_mapcount is used to account how many time
the tail page is mapped. head_page->_mapcount is used for both PMD mapping
of whole huge page and PTE mapping of the firt 4k page of the compound
page. It seems work fine, except the fact that we don't have a cheap way
to check whether the page mapped with PMDs or not.
Introducing split_huge_pmd() effectively allows THP to be mapped with 4k.
It can break some kernel expectations. I.e. VMA now can start and end in
middle of compound page. IIUC, it will break compactation and probably
something else (any hints?).
Also munmap() on part of huge page will not split and free unmapped part
immediately. We need to be careful here to keep memory footprint under
control.
As side effect we don't need to mark PMD splitting since we have
split_huge_pmd(). get_page()/put_page() on tail of THP is cheaper (and
cleaner) now.
I will continue with stabilizing this. The patchset also available on
git[1].
Any commemnt?
[1] git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git thp/refcounting/v1
Kirill A. Shutemov (10):
mm, thp: drop FOLL_SPLIT
mm: change PageAnon() to work on tail pages
thp: rename split_huge_page_pmd() to split_huge_pmd()
thp: PMD splitting without splitting compound page
mm, vmstats: new THP splitting event
thp: implement new split_huge_page()
mm, thp: remove infrastructure for handling splitting PMDs
x86, thp: remove remove infrastructure for handling splitting PMDs
futex, thp: remove special case for THP in get_futex_key
thp: update documentation
Documentation/vm/transhuge.txt | 95 ++++----
arch/mips/mm/gup.c | 4 -
arch/powerpc/mm/hugetlbpage.c | 12 -
arch/powerpc/mm/subpage-prot.c | 2 +-
arch/s390/mm/gup.c | 13 +-
arch/s390/mm/pgtable.c | 17 +-
arch/sparc/mm/gup.c | 14 +-
arch/x86/include/asm/pgtable.h | 9 -
arch/x86/include/asm/pgtable_types.h | 2 -
arch/x86/kernel/vm86_32.c | 6 +-
arch/x86/mm/gup.c | 17 +-
arch/x86/mm/pgtable.c | 14 --
fs/proc/task_mmu.c | 9 +-
include/asm-generic/pgtable.h | 5 -
include/linux/huge_mm.h | 48 +---
include/linux/hugetlb_inline.h | 9 +-
include/linux/mm.h | 66 +-----
include/linux/vm_event_item.h | 4 +-
kernel/futex.c | 62 ++----
mm/gup.c | 18 +-
mm/huge_memory.c | 412 ++++++++++-------------------------
mm/internal.h | 31 +--
mm/memcontrol.c | 16 +-
mm/memory.c | 20 +-
mm/migrate.c | 7 +-
mm/mprotect.c | 2 +-
mm/mremap.c | 2 +-
mm/pagewalk.c | 2 +-
mm/pgtable-generic.c | 14 --
mm/rmap.c | 4 +-
mm/swap.c | 285 +++++++-----------------
mm/vmstat.c | 4 +-
32 files changed, 328 insertions(+), 897 deletions(-)
--
2.0.0.rc4
With new THP refcounting, we don't need tricks to stabilize huge page.
If we've got reference to tail page, it can't split under us.
This patch effectively reverts a5b338f2b0b1.
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
kernel/futex.c | 62 ++++++++++++----------------------------------------------
1 file changed, 13 insertions(+), 49 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c
index b632b5f3f094..d13ad5e80ab9 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -391,7 +391,7 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key, int rw)
{
unsigned long address = (unsigned long)uaddr;
struct mm_struct *mm = current->mm;
- struct page *page, *page_head;
+ struct page *page;
int err, ro = 0;
/*
@@ -434,46 +434,10 @@ again:
else
err = 0;
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- page_head = page;
- if (unlikely(PageTail(page))) {
- put_page(page);
- /* serialize against __split_huge_page_splitting() */
- local_irq_disable();
- if (likely(__get_user_pages_fast(address, 1, !ro, &page) == 1)) {
- page_head = compound_head(page);
- /*
- * page_head is valid pointer but we must pin
- * it before taking the PG_lock and/or
- * PG_compound_lock. The moment we re-enable
- * irqs __split_huge_page_splitting() can
- * return and the head page can be freed from
- * under us. We can't take the PG_lock and/or
- * PG_compound_lock on a page that could be
- * freed from under us.
- */
- if (page != page_head) {
- get_page(page_head);
- put_page(page);
- }
- local_irq_enable();
- } else {
- local_irq_enable();
- goto again;
- }
- }
-#else
- page_head = compound_head(page);
- if (page != page_head) {
- get_page(page_head);
- put_page(page);
- }
-#endif
-
- lock_page(page_head);
-
+ page = compound_head(page);
+ lock_page(page);
/*
- * If page_head->mapping is NULL, then it cannot be a PageAnon
+ * If page->mapping is NULL, then it cannot be a PageAnon
* page; but it might be the ZERO_PAGE or in the gate area or
* in a special mapping (all cases which we are happy to fail);
* or it may have been a good file page when get_user_pages_fast
@@ -485,12 +449,12 @@ again:
*
* The case we do have to guard against is when memory pressure made
* shmem_writepage move it from filecache to swapcache beneath us:
- * an unlikely race, but we do need to retry for page_head->mapping.
+ * an unlikely race, but we do need to retry for page->mapping.
*/
- if (!page_head->mapping) {
- int shmem_swizzled = PageSwapCache(page_head);
- unlock_page(page_head);
- put_page(page_head);
+ if (!page->mapping) {
+ int shmem_swizzled = PageSwapCache(page);
+ unlock_page(page);
+ put_page(page);
if (shmem_swizzled)
goto again;
return -EFAULT;
@@ -503,7 +467,7 @@ again:
* it's a read-only handle, it's expected that futexes attach to
* the object not the particular process.
*/
- if (PageAnon(page_head)) {
+ if (PageAnon(page)) {
/*
* A RO anonymous page will never change and thus doesn't make
* sense for futex operations.
@@ -518,15 +482,15 @@ again:
key->private.address = address;
} else {
key->both.offset |= FUT_OFF_INODE; /* inode-based key */
- key->shared.inode = page_head->mapping->host;
+ key->shared.inode = page->mapping->host;
key->shared.pgoff = basepage_index(page);
}
get_futex_key_refs(key); /* implies MB (B) */
out:
- unlock_page(page_head);
- put_page(page_head);
+ unlock_page(page);
+ put_page(page);
return err;
}
--
2.0.0.rc4
Current split_huge_page() combines two operations: splitting PMDs into
tables of PTEs and splitting underlying compound page. This patch
changes split_huge_pmd() implementation to split the given PMD without
splitting other PMDs this page mapped with or underlying compound page.
In order to do this we have to get rid of tail page refcounting, which
uses _mapcount of tail pages. Tail page refcounting is needed to be able
to split THP page at any point: we always know which of tail pages is
pinned (i.e. by get_user_pages()) and can distribute page count
correctly.
We can avoid this by allowing split_huge_page() to fail if the compound
page is pinned. This patch removes all infrastructure for tail page
refcounting and make split_huge_page() to always return -EBUSY. All
split_huge_page() users already know how to handle its fail. Proper
implementation will be added later.
Without tail page refcounting, implementation of split_huge_pmd() is
pretty straight-forward.
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/mips/mm/gup.c | 4 -
arch/powerpc/mm/hugetlbpage.c | 12 --
arch/s390/mm/gup.c | 13 +-
arch/sparc/mm/gup.c | 14 +-
arch/x86/mm/gup.c | 4 -
include/linux/huge_mm.h | 7 +-
include/linux/mm.h | 62 +------
mm/huge_memory.c | 366 ++++--------------------------------------
mm/internal.h | 31 +---
mm/swap.c | 245 +---------------------------
10 files changed, 49 insertions(+), 709 deletions(-)
diff --git a/arch/mips/mm/gup.c b/arch/mips/mm/gup.c
index 06ce17c2a905..8e56e7a2558b 100644
--- a/arch/mips/mm/gup.c
+++ b/arch/mips/mm/gup.c
@@ -87,8 +87,6 @@ static int gup_huge_pmd(pmd_t pmd, unsigned long addr, unsigned long end,
do {
VM_BUG_ON(compound_head(page) != head);
pages[*nr] = page;
- if (PageTail(page))
- get_huge_page_tail(page);
(*nr)++;
page++;
refs++;
@@ -153,8 +151,6 @@ static int gup_huge_pud(pud_t pud, unsigned long addr, unsigned long end,
do {
VM_BUG_ON(compound_head(page) != head);
pages[*nr] = page;
- if (PageTail(page))
- get_huge_page_tail(page);
(*nr)++;
page++;
refs++;
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 7e70ae968e5f..e4ba17694b6b 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -1022,7 +1022,6 @@ int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
{
unsigned long mask;
unsigned long pte_end;
- struct page *head, *page, *tail;
pte_t pte;
int refs;
@@ -1053,7 +1052,6 @@ int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
head = pte_page(pte);
page = head + ((addr & (sz-1)) >> PAGE_SHIFT);
- tail = page;
do {
VM_BUG_ON(compound_head(page) != head);
pages[*nr] = page;
@@ -1075,15 +1073,5 @@ int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
return 0;
}
- /*
- * Any tail page need their mapcount reference taken before we
- * return.
- */
- while (refs--) {
- if (PageTail(tail))
- get_huge_page_tail(tail);
- tail++;
- }
-
return 1;
}
diff --git a/arch/s390/mm/gup.c b/arch/s390/mm/gup.c
index 639fce464008..e4c5ca753abe 100644
--- a/arch/s390/mm/gup.c
+++ b/arch/s390/mm/gup.c
@@ -52,7 +52,7 @@ static inline int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
unsigned long end, int write, struct page **pages, int *nr)
{
unsigned long mask, result;
- struct page *head, *page, *tail;
+ struct page *head, *page;
int refs;
result = write ? 0 : _SEGMENT_ENTRY_PROTECT;
@@ -64,7 +64,6 @@ static inline int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
refs = 0;
head = pmd_page(pmd);
page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
- tail = page;
do {
VM_BUG_ON(compound_head(page) != head);
pages[*nr] = page;
@@ -85,16 +84,6 @@ static inline int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
return 0;
}
- /*
- * Any tail page need their mapcount reference taken before we
- * return.
- */
- while (refs--) {
- if (PageTail(tail))
- get_huge_page_tail(tail);
- tail++;
- }
-
return 1;
}
diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
index 1aed0432c64b..04bc1aa350fa 100644
--- a/arch/sparc/mm/gup.c
+++ b/arch/sparc/mm/gup.c
@@ -56,8 +56,6 @@ static noinline int gup_pte_range(pmd_t pmd, unsigned long addr,
put_page(head);
return 0;
}
- if (head != page)
- get_huge_page_tail(page);
pages[*nr] = page;
(*nr)++;
@@ -70,7 +68,7 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
unsigned long end, int write, struct page **pages,
int *nr)
{
- struct page *head, *page, *tail;
+ struct page *head, *page;
int refs;
if (!(pmd_val(pmd) & _PAGE_VALID))
@@ -82,7 +80,6 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
refs = 0;
head = pmd_page(pmd);
page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
- tail = page;
do {
VM_BUG_ON(compound_head(page) != head);
pages[*nr] = page;
@@ -103,15 +100,6 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
return 0;
}
- /* Any tail page need their mapcount reference taken before we
- * return.
- */
- while (refs--) {
- if (PageTail(tail))
- get_huge_page_tail(tail);
- tail++;
- }
-
return 1;
}
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 207d9aef662d..754bca23ec1b 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -137,8 +137,6 @@ static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
do {
VM_BUG_ON_PAGE(compound_head(page) != head, page);
pages[*nr] = page;
- if (PageTail(page))
- get_huge_page_tail(page);
(*nr)++;
page++;
refs++;
@@ -214,8 +212,6 @@ static noinline int gup_huge_pud(pud_t pud, unsigned long addr,
do {
VM_BUG_ON_PAGE(compound_head(page) != head, page);
pages[*nr] = page;
- if (PageTail(page))
- get_huge_page_tail(page);
(*nr)++;
page++;
refs++;
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index e68dfb888e59..5e9d26cd98b7 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -102,14 +102,13 @@ static inline int split_huge_page(struct page *page)
{
return split_huge_page_to_list(page, NULL);
}
-extern void __split_huge_page_pmd(struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmd);
+extern void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+ unsigned long address);
#define split_huge_pmd(__vma, __pmd, __address) \
do { \
pmd_t *____pmd = (__pmd); \
if (unlikely(pmd_trans_huge(*____pmd))) \
- __split_huge_page_pmd(__vma, __address, \
- ____pmd); \
+ __split_huge_pmd(__vma, __pmd, __address); \
} while (0)
#define wait_split_huge_page(__anon_vma, __pmd) \
do { \
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a60e2db5f9f9..8885a7102aba 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -413,25 +413,10 @@ static inline void compound_unlock_irqrestore(struct page *page,
#endif
}
-static inline struct page *compound_head_by_tail(struct page *tail)
-{
- struct page *head = tail->first_page;
-
- /*
- * page->first_page may be a dangling pointer to an old
- * compound page, so recheck that it is still a tail
- * page before returning.
- */
- smp_rmb();
- if (likely(PageTail(tail)))
- return head;
- return tail;
-}
-
static inline struct page *compound_head(struct page *page)
{
if (unlikely(PageTail(page)))
- return compound_head_by_tail(page);
+ return page->first_page;
return page;
}
@@ -464,50 +449,11 @@ static inline int PageHeadHuge(struct page *page_head)
}
#endif /* CONFIG_HUGETLB_PAGE */
-static inline bool __compound_tail_refcounted(struct page *page)
-{
- return !PageSlab(page) && !PageHeadHuge(page);
-}
-
-/*
- * This takes a head page as parameter and tells if the
- * tail page reference counting can be skipped.
- *
- * For this to be safe, PageSlab and PageHeadHuge must remain true on
- * any given page where they return true here, until all tail pins
- * have been released.
- */
-static inline bool compound_tail_refcounted(struct page *page)
-{
- VM_BUG_ON_PAGE(!PageHead(page), page);
- return __compound_tail_refcounted(page);
-}
-
-static inline void get_huge_page_tail(struct page *page)
-{
- /*
- * __split_huge_page_refcount() cannot run from under us.
- */
- VM_BUG_ON_PAGE(!PageTail(page), page);
- VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
- VM_BUG_ON_PAGE(atomic_read(&page->_count) != 0, page);
- if (compound_tail_refcounted(page->first_page))
- atomic_inc(&page->_mapcount);
-}
-
-extern bool __get_page_tail(struct page *page);
-
static inline void get_page(struct page *page)
{
- if (unlikely(PageTail(page)))
- if (likely(__get_page_tail(page)))
- return;
- /*
- * Getting a normal page or the head of a compound page
- * requires to already have an elevated page->_count.
- */
- VM_BUG_ON_PAGE(atomic_read(&page->_count) <= 0, page);
- atomic_inc(&page->_count);
+ struct page *page_head = compound_head(page);
+ VM_BUG_ON_PAGE(atomic_read(&page_head->_count) <= 0, page);
+ atomic_inc(&page_head->_count);
}
static inline struct page *virt_to_head_page(const void *x)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e809ef4519f2..752c850f6941 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1576,323 +1576,9 @@ unlock:
return NULL;
}
-static int __split_huge_page_splitting(struct page *page,
- struct vm_area_struct *vma,
- unsigned long address)
-{
- struct mm_struct *mm = vma->vm_mm;
- spinlock_t *ptl;
- pmd_t *pmd;
- int ret = 0;
- /* For mmu_notifiers */
- const unsigned long mmun_start = address;
- const unsigned long mmun_end = address + HPAGE_PMD_SIZE;
-
- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
- pmd = page_check_address_pmd(page, mm, address,
- PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG, &ptl);
- if (pmd) {
- /*
- * We can't temporarily set the pmd to null in order
- * to split it, the pmd must remain marked huge at all
- * times or the VM won't take the pmd_trans_huge paths
- * and it won't wait on the anon_vma->root->rwsem to
- * serialize against split_huge_page*.
- */
- pmdp_splitting_flush(vma, address, pmd);
- ret = 1;
- spin_unlock(ptl);
- }
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
-
- return ret;
-}
-
-static void __split_huge_page_refcount(struct page *page,
- struct list_head *list)
-{
- int i;
- struct zone *zone = page_zone(page);
- struct lruvec *lruvec;
- int tail_count = 0;
-
- /* prevent PageLRU to go away from under us, and freeze lru stats */
- spin_lock_irq(&zone->lru_lock);
- lruvec = mem_cgroup_page_lruvec(page, zone);
-
- compound_lock(page);
- /* complete memcg works before add pages to LRU */
- mem_cgroup_split_huge_fixup(page);
-
- for (i = HPAGE_PMD_NR - 1; i >= 1; i--) {
- struct page *page_tail = page + i;
-
- /* tail_page->_mapcount cannot change */
- BUG_ON(page_mapcount(page_tail) < 0);
- tail_count += page_mapcount(page_tail);
- /* check for overflow */
- BUG_ON(tail_count < 0);
- BUG_ON(atomic_read(&page_tail->_count) != 0);
- /*
- * tail_page->_count is zero and not changing from
- * under us. But get_page_unless_zero() may be running
- * from under us on the tail_page. If we used
- * atomic_set() below instead of atomic_add(), we
- * would then run atomic_set() concurrently with
- * get_page_unless_zero(), and atomic_set() is
- * implemented in C not using locked ops. spin_unlock
- * on x86 sometime uses locked ops because of PPro
- * errata 66, 92, so unless somebody can guarantee
- * atomic_set() here would be safe on all archs (and
- * not only on x86), it's safer to use atomic_add().
- */
- atomic_add(page_mapcount(page) + page_mapcount(page_tail) + 1,
- &page_tail->_count);
-
- /* after clearing PageTail the gup refcount can be released */
- smp_mb();
-
- /*
- * retain hwpoison flag of the poisoned tail page:
- * fix for the unsuitable process killed on Guest Machine(KVM)
- * by the memory-failure.
- */
- page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP | __PG_HWPOISON;
- page_tail->flags |= (page->flags &
- ((1L << PG_referenced) |
- (1L << PG_swapbacked) |
- (1L << PG_mlocked) |
- (1L << PG_uptodate) |
- (1L << PG_active) |
- (1L << PG_unevictable)));
- page_tail->flags |= (1L << PG_dirty);
-
- /* clear PageTail before overwriting first_page */
- smp_wmb();
-
- /*
- * __split_huge_page_splitting() already set the
- * splitting bit in all pmd that could map this
- * hugepage, that will ensure no CPU can alter the
- * mapcount on the head page. The mapcount is only
- * accounted in the head page and it has to be
- * transferred to all tail pages in the below code. So
- * for this code to be safe, the split the mapcount
- * can't change. But that doesn't mean userland can't
- * keep changing and reading the page contents while
- * we transfer the mapcount, so the pmd splitting
- * status is achieved setting a reserved bit in the
- * pmd, not by clearing the present bit.
- */
- page_tail->_mapcount = page->_mapcount;
-
- BUG_ON(page_tail->mapping);
- page_tail->mapping = page->mapping;
-
- page_tail->index = page->index + i;
- page_cpupid_xchg_last(page_tail, page_cpupid_last(page));
-
- BUG_ON(!PageAnon(page_tail));
- BUG_ON(!PageUptodate(page_tail));
- BUG_ON(!PageDirty(page_tail));
- BUG_ON(!PageSwapBacked(page_tail));
-
- lru_add_page_tail(page, page_tail, lruvec, list);
- }
- atomic_sub(tail_count, &page->_count);
- BUG_ON(atomic_read(&page->_count) <= 0);
-
- __mod_zone_page_state(zone, NR_ANON_TRANSPARENT_HUGEPAGES, -1);
-
- ClearPageCompound(page);
- compound_unlock(page);
- spin_unlock_irq(&zone->lru_lock);
-
- for (i = 1; i < HPAGE_PMD_NR; i++) {
- struct page *page_tail = page + i;
- BUG_ON(page_count(page_tail) <= 0);
- /*
- * Tail pages may be freed if there wasn't any mapping
- * like if add_to_swap() is running on a lru page that
- * had its mapping zapped. And freeing these pages
- * requires taking the lru_lock so we do the put_page
- * of the tail pages after the split is complete.
- */
- put_page(page_tail);
- }
-
- /*
- * Only the head page (now become a regular page) is required
- * to be pinned by the caller.
- */
- BUG_ON(page_count(page) <= 0);
-}
-
-static int __split_huge_page_map(struct page *page,
- struct vm_area_struct *vma,
- unsigned long address)
-{
- struct mm_struct *mm = vma->vm_mm;
- spinlock_t *ptl;
- pmd_t *pmd, _pmd;
- int ret = 0, i;
- pgtable_t pgtable;
- unsigned long haddr;
-
- pmd = page_check_address_pmd(page, mm, address,
- PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG, &ptl);
- if (pmd) {
- pgtable = pgtable_trans_huge_withdraw(mm, pmd);
- pmd_populate(mm, &_pmd, pgtable);
-
- haddr = address;
- for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
- pte_t *pte, entry;
- BUG_ON(PageCompound(page+i));
- entry = mk_pte(page + i, vma->vm_page_prot);
- entry = maybe_mkwrite(pte_mkdirty(entry), vma);
- if (!pmd_write(*pmd))
- entry = pte_wrprotect(entry);
- else
- BUG_ON(page_mapcount(page) != 1);
- if (!pmd_young(*pmd))
- entry = pte_mkold(entry);
- if (pmd_numa(*pmd))
- entry = pte_mknuma(entry);
- pte = pte_offset_map(&_pmd, haddr);
- BUG_ON(!pte_none(*pte));
- set_pte_at(mm, haddr, pte, entry);
- pte_unmap(pte);
- }
-
- smp_wmb(); /* make pte visible before pmd */
- /*
- * Up to this point the pmd is present and huge and
- * userland has the whole access to the hugepage
- * during the split (which happens in place). If we
- * overwrite the pmd with the not-huge version
- * pointing to the pte here (which of course we could
- * if all CPUs were bug free), userland could trigger
- * a small page size TLB miss on the small sized TLB
- * while the hugepage TLB entry is still established
- * in the huge TLB. Some CPU doesn't like that. See
- * http://support.amd.com/us/Processor_TechDocs/41322.pdf,
- * Erratum 383 on page 93. Intel should be safe but is
- * also warns that it's only safe if the permission
- * and cache attributes of the two entries loaded in
- * the two TLB is identical (which should be the case
- * here). But it is generally safer to never allow
- * small and huge TLB entries for the same virtual
- * address to be loaded simultaneously. So instead of
- * doing "pmd_populate(); flush_tlb_range();" we first
- * mark the current pmd notpresent (atomically because
- * here the pmd_trans_huge and pmd_trans_splitting
- * must remain set at all times on the pmd until the
- * split is complete for this pmd), then we flush the
- * SMP TLB and finally we write the non-huge version
- * of the pmd entry with pmd_populate.
- */
- pmdp_invalidate(vma, address, pmd);
- pmd_populate(mm, pmd, pgtable);
- ret = 1;
- spin_unlock(ptl);
- }
-
- return ret;
-}
-
-/* must be called with anon_vma->root->rwsem held */
-static void __split_huge_page(struct page *page,
- struct anon_vma *anon_vma,
- struct list_head *list)
-{
- int mapcount, mapcount2;
- pgoff_t pgoff = page_pgoff(page);
- struct anon_vma_chain *avc;
-
- BUG_ON(!PageHead(page));
- BUG_ON(PageTail(page));
-
- mapcount = 0;
- anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) {
- struct vm_area_struct *vma = avc->vma;
- unsigned long addr = vma_address(page, vma);
- BUG_ON(is_vma_temporary_stack(vma));
- mapcount += __split_huge_page_splitting(page, vma, addr);
- }
- /*
- * It is critical that new vmas are added to the tail of the
- * anon_vma list. This guarantes that if copy_huge_pmd() runs
- * and establishes a child pmd before
- * __split_huge_page_splitting() freezes the parent pmd (so if
- * we fail to prevent copy_huge_pmd() from running until the
- * whole __split_huge_page() is complete), we will still see
- * the newly established pmd of the child later during the
- * walk, to be able to set it as pmd_trans_splitting too.
- */
- if (mapcount != page_mapcount(page)) {
- pr_err("mapcount %d page_mapcount %d\n",
- mapcount, page_mapcount(page));
- BUG();
- }
-
- __split_huge_page_refcount(page, list);
-
- mapcount2 = 0;
- anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) {
- struct vm_area_struct *vma = avc->vma;
- unsigned long addr = vma_address(page, vma);
- BUG_ON(is_vma_temporary_stack(vma));
- mapcount2 += __split_huge_page_map(page, vma, addr);
- }
- if (mapcount != mapcount2) {
- pr_err("mapcount %d mapcount2 %d page_mapcount %d\n",
- mapcount, mapcount2, page_mapcount(page));
- BUG();
- }
-}
-
-/*
- * Split a hugepage into normal pages. This doesn't change the position of head
- * page. If @list is null, tail pages will be added to LRU list, otherwise, to
- * @list. Both head page and tail pages will inherit mapping, flags, and so on
- * from the hugepage.
- * Return 0 if the hugepage is split successfully otherwise return 1.
- */
int split_huge_page_to_list(struct page *page, struct list_head *list)
{
- struct anon_vma *anon_vma;
- int ret = 1;
-
- BUG_ON(is_huge_zero_page(page));
- BUG_ON(!PageAnon(page));
-
- /*
- * The caller does not necessarily hold an mmap_sem that would prevent
- * the anon_vma disappearing so we first we take a reference to it
- * and then lock the anon_vma for write. This is similar to
- * page_lock_anon_vma_read except the write lock is taken to serialise
- * against parallel split or collapse operations.
- */
- anon_vma = page_get_anon_vma(page);
- if (!anon_vma)
- goto out;
- anon_vma_lock_write(anon_vma);
-
- ret = 0;
- if (!PageCompound(page))
- goto out_unlock;
-
- BUG_ON(!PageSwapBacked(page));
- __split_huge_page(page, anon_vma, list);
- count_vm_event(THP_SPLIT);
-
- BUG_ON(PageCompound(page));
-out_unlock:
- anon_vma_unlock_write(anon_vma);
- put_anon_vma(anon_vma);
-out:
- return ret;
+ return -EBUSY;
}
#define VM_NO_THP (VM_SPECIAL | VM_HUGETLB | VM_SHARED | VM_MAYSHARE)
@@ -2786,8 +2472,8 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
put_huge_zero_page();
}
-void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
- pmd_t *pmd)
+void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+ unsigned long address)
{
spinlock_t *ptl;
struct page *page;
@@ -2795,12 +2481,14 @@ void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
unsigned long haddr = address & HPAGE_PMD_MASK;
unsigned long mmun_start; /* For mmu_notifiers */
unsigned long mmun_end; /* For mmu_notifiers */
+ pgtable_t pgtable;
+ pmd_t _pmd;
+ int i;
BUG_ON(vma->vm_start > haddr || vma->vm_end < haddr + HPAGE_PMD_SIZE);
mmun_start = haddr;
mmun_end = haddr + HPAGE_PMD_SIZE;
-again:
mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
ptl = pmd_lock(mm, pmd);
if (unlikely(!pmd_trans_huge(*pmd))) {
@@ -2814,23 +2502,37 @@ again:
mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
return;
}
+
page = pmd_page(*pmd);
- VM_BUG_ON_PAGE(!page_count(page), page);
- get_page(page);
- spin_unlock(ptl);
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+ atomic_add(HPAGE_PMD_NR - 1, &page->_count);
- split_huge_page(page);
+ pmdp_clear_flush(vma, haddr, pmd);
+ /* leave pmd empty until pte is filled */
- put_page(page);
+ pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+ pmd_populate(mm, &_pmd, pgtable);
- /*
- * We don't always have down_write of mmap_sem here: a racing
- * do_huge_pmd_wp_page() might have copied-on-write to another
- * huge page before our split_huge_page() got the anon_vma lock.
- */
- if (unlikely(pmd_trans_huge(*pmd)))
- goto again;
+ for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
+ pte_t entry, *pte;
+ entry = mk_pte(page + i, vma->vm_page_prot);
+ entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+ if (!pmd_write(*pmd))
+ entry = pte_wrprotect(entry);
+ if (!pmd_young(*pmd))
+ entry = pte_mkold(entry);
+ if (pmd_numa(*pmd))
+ entry = pte_mknuma(entry);
+ pte = pte_offset_map(&_pmd, haddr);
+ BUG_ON(!pte_none(*pte));
+ atomic_inc(&page[i]._mapcount);
+ set_pte_at(mm, haddr, pte, entry);
+ pte_unmap(pte);
+ }
+ smp_wmb(); /* make pte visible before pmd */
+ pmd_populate(mm, pmd, pgtable);
+ atomic_dec(&page->_mapcount);
+ spin_unlock(ptl);
+ mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
}
static void split_huge_page_address(struct vm_area_struct *vma,
@@ -2847,7 +2549,7 @@ static void split_huge_page_address(struct vm_area_struct *vma,
* Caller holds the mmap_sem write mode, so a huge pmd cannot
* materialize from under us.
*/
- __split_huge_page_pmd(vma, address, pmd);
+ __split_huge_pmd(vma, pmd, address);
}
void __vma_adjust_trans_huge(struct vm_area_struct *vma,
diff --git a/mm/internal.h b/mm/internal.h
index 7f22a11fcc66..7e1539729e33 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -47,26 +47,6 @@ static inline void set_page_refcounted(struct page *page)
set_page_count(page, 1);
}
-static inline void __get_page_tail_foll(struct page *page,
- bool get_page_head)
-{
- /*
- * If we're getting a tail page, the elevated page->_count is
- * required only in the head page and we will elevate the head
- * page->_count and tail page->_mapcount.
- *
- * We elevate page_tail->_mapcount for tail pages to force
- * page_tail->_count to be zero at all times to avoid getting
- * false positives from get_page_unless_zero() with
- * speculative page access (like in
- * page_cache_get_speculative()) on tail pages.
- */
- VM_BUG_ON_PAGE(atomic_read(&page->first_page->_count) <= 0, page);
- if (get_page_head)
- atomic_inc(&page->first_page->_count);
- get_huge_page_tail(page);
-}
-
/*
* This is meant to be called as the FOLL_GET operation of
* follow_page() and it must be called while holding the proper PT
@@ -74,14 +54,9 @@ static inline void __get_page_tail_foll(struct page *page,
*/
static inline void get_page_foll(struct page *page)
{
- if (unlikely(PageTail(page)))
- /*
- * This is safe only because
- * __split_huge_page_refcount() can't run under
- * get_page_foll() because we hold the proper PT lock.
- */
- __get_page_tail_foll(page, true);
- else {
+ if (unlikely(PageTail(page))) {
+ atomic_inc(&page->first_page->_count);
+ } else {
/*
* Getting a normal page or the head of a compound page
* requires to already have an elevated page->_count.
diff --git a/mm/swap.c b/mm/swap.c
index 9e8e3472248b..5faf87c3809b 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -79,185 +79,12 @@ static void __put_compound_page(struct page *page)
(*dtor)(page);
}
-/**
- * Two special cases here: we could avoid taking compound_lock_irqsave
- * and could skip the tail refcounting(in _mapcount).
- *
- * 1. Hugetlbfs page:
- *
- * PageHeadHuge will remain true until the compound page
- * is released and enters the buddy allocator, and it could
- * not be split by __split_huge_page_refcount().
- *
- * So if we see PageHeadHuge set, and we have the tail page pin,
- * then we could safely put head page.
- *
- * 2. Slab THP page:
- *
- * PG_slab is cleared before the slab frees the head page, and
- * tail pin cannot be the last reference left on the head page,
- * because the slab code is free to reuse the compound page
- * after a kfree/kmem_cache_free without having to check if
- * there's any tail pin left. In turn all tail pinsmust be always
- * released while the head is still pinned by the slab code
- * and so we know PG_slab will be still set too.
- *
- * So if we see PageSlab set, and we have the tail page pin,
- * then we could safely put head page.
- */
-static __always_inline
-void put_unrefcounted_compound_page(struct page *page_head, struct page *page)
-{
- /*
- * If @page is a THP tail, we must read the tail page
- * flags after the head page flags. The
- * __split_huge_page_refcount side enforces write memory barriers
- * between clearing PageTail and before the head page
- * can be freed and reallocated.
- */
- smp_rmb();
- if (likely(PageTail(page))) {
- /*
- * __split_huge_page_refcount cannot race
- * here, see the comment above this function.
- */
- VM_BUG_ON_PAGE(!PageHead(page_head), page_head);
- VM_BUG_ON_PAGE(page_mapcount(page) != 0, page);
- if (put_page_testzero(page_head)) {
- /*
- * If this is the tail of a slab THP page,
- * the tail pin must not be the last reference
- * held on the page, because the PG_slab cannot
- * be cleared before all tail pins (which skips
- * the _mapcount tail refcounting) have been
- * released.
- *
- * If this is the tail of a hugetlbfs page,
- * the tail pin may be the last reference on
- * the page instead, because PageHeadHuge will
- * not go away until the compound page enters
- * the buddy allocator.
- */
- VM_BUG_ON_PAGE(PageSlab(page_head), page_head);
- __put_compound_page(page_head);
- }
- } else
- /*
- * __split_huge_page_refcount run before us,
- * @page was a THP tail. The split @page_head
- * has been freed and reallocated as slab or
- * hugetlbfs page of smaller order (only
- * possible if reallocated as slab on x86).
- */
- if (put_page_testzero(page))
- __put_single_page(page);
-}
-
-static __always_inline
-void put_refcounted_compound_page(struct page *page_head, struct page *page)
-{
- if (likely(page != page_head && get_page_unless_zero(page_head))) {
- unsigned long flags;
-
- /*
- * @page_head wasn't a dangling pointer but it may not
- * be a head page anymore by the time we obtain the
- * lock. That is ok as long as it can't be freed from
- * under us.
- */
- flags = compound_lock_irqsave(page_head);
- if (unlikely(!PageTail(page))) {
- /* __split_huge_page_refcount run before us */
- compound_unlock_irqrestore(page_head, flags);
- if (put_page_testzero(page_head)) {
- /*
- * The @page_head may have been freed
- * and reallocated as a compound page
- * of smaller order and then freed
- * again. All we know is that it
- * cannot have become: a THP page, a
- * compound page of higher order, a
- * tail page. That is because we
- * still hold the refcount of the
- * split THP tail and page_head was
- * the THP head before the split.
- */
- if (PageHead(page_head))
- __put_compound_page(page_head);
- else
- __put_single_page(page_head);
- }
-out_put_single:
- if (put_page_testzero(page))
- __put_single_page(page);
- return;
- }
- VM_BUG_ON_PAGE(page_head != page->first_page, page);
- /*
- * We can release the refcount taken by
- * get_page_unless_zero() now that
- * __split_huge_page_refcount() is blocked on the
- * compound_lock.
- */
- if (put_page_testzero(page_head))
- VM_BUG_ON_PAGE(1, page_head);
- /* __split_huge_page_refcount will wait now */
- VM_BUG_ON_PAGE(page_mapcount(page) <= 0, page);
- atomic_dec(&page->_mapcount);
- VM_BUG_ON_PAGE(atomic_read(&page_head->_count) <= 0, page_head);
- VM_BUG_ON_PAGE(atomic_read(&page->_count) != 0, page);
- compound_unlock_irqrestore(page_head, flags);
-
- if (put_page_testzero(page_head)) {
- if (PageHead(page_head))
- __put_compound_page(page_head);
- else
- __put_single_page(page_head);
- }
- } else {
- /* @page_head is a dangling pointer */
- VM_BUG_ON_PAGE(PageTail(page), page);
- goto out_put_single;
- }
-}
-
static void put_compound_page(struct page *page)
{
- struct page *page_head;
-
- /*
- * We see the PageCompound set and PageTail not set, so @page maybe:
- * 1. hugetlbfs head page, or
- * 2. THP head page.
- */
- if (likely(!PageTail(page))) {
- if (put_page_testzero(page)) {
- /*
- * By the time all refcounts have been released
- * split_huge_page cannot run anymore from under us.
- */
- if (PageHead(page))
- __put_compound_page(page);
- else
- __put_single_page(page);
- }
- return;
- }
+ struct page *page_head = compound_head(page);
- /*
- * We see the PageCompound set and PageTail set, so @page maybe:
- * 1. a tail hugetlbfs page, or
- * 2. a tail THP page, or
- * 3. a split THP page.
- *
- * Case 3 is possible, as we may race with
- * __split_huge_page_refcount tearing down a THP page.
- */
- page_head = compound_head_by_tail(page);
- if (!__compound_tail_refcounted(page_head))
- put_unrefcounted_compound_page(page_head, page);
- else
- put_refcounted_compound_page(page_head, page);
+ if (put_page_testzero(page_head))
+ __put_compound_page(page_head);
}
void put_page(struct page *page)
@@ -269,72 +96,6 @@ void put_page(struct page *page)
}
EXPORT_SYMBOL(put_page);
-/*
- * This function is exported but must not be called by anything other
- * than get_page(). It implements the slow path of get_page().
- */
-bool __get_page_tail(struct page *page)
-{
- /*
- * This takes care of get_page() if run on a tail page
- * returned by one of the get_user_pages/follow_page variants.
- * get_user_pages/follow_page itself doesn't need the compound
- * lock because it runs __get_page_tail_foll() under the
- * proper PT lock that already serializes against
- * split_huge_page().
- */
- unsigned long flags;
- bool got;
- struct page *page_head = compound_head(page);
-
- /* Ref to put_compound_page() comment. */
- if (!__compound_tail_refcounted(page_head)) {
- smp_rmb();
- if (likely(PageTail(page))) {
- /*
- * This is a hugetlbfs page or a slab
- * page. __split_huge_page_refcount
- * cannot race here.
- */
- VM_BUG_ON_PAGE(!PageHead(page_head), page_head);
- __get_page_tail_foll(page, true);
- return true;
- } else {
- /*
- * __split_huge_page_refcount run
- * before us, "page" was a THP
- * tail. The split page_head has been
- * freed and reallocated as slab or
- * hugetlbfs page of smaller order
- * (only possible if reallocated as
- * slab on x86).
- */
- return false;
- }
- }
-
- got = false;
- if (likely(page != page_head && get_page_unless_zero(page_head))) {
- /*
- * page_head wasn't a dangling pointer but it
- * may not be a head page anymore by the time
- * we obtain the lock. That is ok as long as it
- * can't be freed from under us.
- */
- flags = compound_lock_irqsave(page_head);
- /* here __split_huge_page_refcount won't run anymore */
- if (likely(PageTail(page))) {
- __get_page_tail_foll(page, false);
- got = true;
- }
- compound_unlock_irqrestore(page_head, flags);
- if (unlikely(!got))
- put_page(page_head);
- }
- return got;
-}
-EXPORT_SYMBOL(__get_page_tail);
-
/**
* put_pages_list() - release a list of pages
* @pages: list of pages threaded on page->lru
--
2.0.0.rc4
With new refcounting we don't need to mark PMDs splitting. Let's drop code
to handle this.
Arch-specific code will removed separately.
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
fs/proc/task_mmu.c | 9 +++--
include/asm-generic/pgtable.h | 5 ---
include/linux/huge_mm.h | 33 -----------------
mm/gup.c | 14 +++-----
mm/huge_memory.c | 83 +++++++++----------------------------------
mm/memcontrol.c | 16 +++------
mm/memory.c | 18 ++--------
mm/pgtable-generic.c | 14 --------
mm/rmap.c | 4 +--
9 files changed, 33 insertions(+), 163 deletions(-)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 25e5a1e044f2..ba99643add30 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -504,7 +504,8 @@ static int smaps_pmd(pmd_t *pmd, unsigned long addr, unsigned long end,
struct mem_size_stats *mss = walk->private;
spinlock_t *ptl;
- if (pmd_trans_huge_lock(pmd, walk->vma, &ptl) == 1) {
+ ptl = pmd_lock(walk->vma->vm_mm, pmd);
+ if (likely(pmd_trans_huge(*pmd))) {
smaps_pte((pte_t *)pmd, addr, addr + HPAGE_PMD_SIZE, walk);
spin_unlock(ptl);
mss->anonymous_thp += HPAGE_PMD_SIZE;
@@ -993,7 +994,8 @@ static int pagemap_pmd(pmd_t *pmd, unsigned long addr, unsigned long end,
if (!vma)
return err;
- if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+ ptl = pmd_lock(vma->vm_mm, pmd);
+ if (likely(pmd_trans_huge(*pmd))) {
int pmd_flags2;
if ((vma->vm_flags & VM_SOFTDIRTY) || pmd_soft_dirty(*pmd))
@@ -1285,7 +1287,8 @@ static int gather_pmd_stats(pmd_t *pmd, unsigned long addr,
struct vm_area_struct *vma = walk->vma;
spinlock_t *ptl;
- if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+ ptl = pmd_lock(vma->vm_mm, pmd);
+ if (likely(pmd_trans_huge(*pmd))) {
pte_t huge_pte = *(pte_t *)pmd;
struct page *page;
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 53b2acc38213..204fa5db3068 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -167,11 +167,6 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm,
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
#endif
-#ifndef __HAVE_ARCH_PMDP_SPLITTING_FLUSH
-extern void pmdp_splitting_flush(struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmdp);
-#endif
-
#ifndef __HAVE_ARCH_PGTABLE_DEPOSIT
extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
pgtable_t pgtable);
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 5e9d26cd98b7..cdb88f93f1fd 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -46,15 +46,9 @@ enum transparent_hugepage_flag {
#endif
};
-enum page_check_address_pmd_flag {
- PAGE_CHECK_ADDRESS_PMD_FLAG,
- PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG,
- PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG,
-};
extern pmd_t *page_check_address_pmd(struct page *page,
struct mm_struct *mm,
unsigned long address,
- enum page_check_address_pmd_flag flag,
spinlock_t **ptl);
#define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
@@ -110,14 +104,6 @@ extern void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
if (unlikely(pmd_trans_huge(*____pmd))) \
__split_huge_pmd(__vma, __pmd, __address); \
} while (0)
-#define wait_split_huge_page(__anon_vma, __pmd) \
- do { \
- pmd_t *____pmd = (__pmd); \
- anon_vma_lock_write(__anon_vma); \
- anon_vma_unlock_write(__anon_vma); \
- BUG_ON(pmd_trans_splitting(*____pmd) || \
- pmd_trans_huge(*____pmd)); \
- } while (0)
#if HPAGE_PMD_ORDER >= MAX_ORDER
#error "hugepages can't be allocated by the buddy allocator"
#endif
@@ -127,18 +113,6 @@ extern void __vma_adjust_trans_huge(struct vm_area_struct *vma,
unsigned long start,
unsigned long end,
long adjust_next);
-extern int __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma,
- spinlock_t **ptl);
-/* mmap_sem must be held on entry */
-static inline int pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma,
- spinlock_t **ptl)
-{
- VM_BUG_ON(!rwsem_is_locked(&vma->vm_mm->mmap_sem));
- if (pmd_trans_huge(*pmd))
- return __pmd_trans_huge_lock(pmd, vma, ptl);
- else
- return 0;
-}
static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
unsigned long start,
unsigned long end,
@@ -177,8 +151,6 @@ static inline int split_huge_page(struct page *page)
{
return 0;
}
-#define wait_split_huge_page(__anon_vma, __pmd) \
- do { } while (0)
#define split_huge_pmd(__vma, __pmd, __address) \
do { } while (0)
static inline int hugepage_madvise(struct vm_area_struct *vma,
@@ -193,11 +165,6 @@ static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
long adjust_next)
{
}
-static inline int pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma,
- spinlock_t **ptl)
-{
- return 0;
-}
static inline int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long addr, pmd_t pmd, pmd_t *pmdp)
diff --git a/mm/gup.c b/mm/gup.c
index ac01800abce6..1c0b777144a4 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -194,16 +194,10 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
if (pmd_trans_huge(*pmd)) {
ptl = pmd_lock(mm, pmd);
if (likely(pmd_trans_huge(*pmd))) {
- if (unlikely(pmd_trans_splitting(*pmd))) {
- spin_unlock(ptl);
- wait_split_huge_page(vma->anon_vma, pmd);
- } else {
- page = follow_trans_huge_pmd(vma, address,
- pmd, flags);
- spin_unlock(ptl);
- *page_mask = HPAGE_PMD_NR - 1;
- return page;
- }
+ page = follow_trans_huge_pmd(vma, address, pmd, flags);
+ spin_unlock(ptl);
+ *page_mask = HPAGE_PMD_NR - 1;
+ return page;
} else
spin_unlock(ptl);
}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 89c6f098f91f..31a7904994cc 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -889,15 +889,6 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
goto out_unlock;
}
- if (unlikely(pmd_trans_splitting(pmd))) {
- /* split huge page running from under us */
- spin_unlock(src_ptl);
- spin_unlock(dst_ptl);
- pte_free(dst_mm, pgtable);
-
- wait_split_huge_page(vma->anon_vma, src_pmd); /* src_vma */
- goto out;
- }
src_page = pmd_page(pmd);
VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
get_page(src_page);
@@ -1346,7 +1337,8 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
spinlock_t *ptl;
int ret = 0;
- if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+ ptl = pmd_lock(vma->vm_mm, pmd);
+ if (likely(pmd_trans_huge(*pmd))) {
struct page *page;
pgtable_t pgtable;
pmd_t orig_pmd;
@@ -1386,16 +1378,16 @@ int mincore_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
spinlock_t *ptl;
int ret = 0;
- if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+ ptl = pmd_lock(vma->vm_mm, pmd);
+ if (pmd_trans_huge(*pmd)) {
/*
* All logical pages in the range are present
* if backed by a huge page.
*/
- spin_unlock(ptl);
memset(vec, 1, (end - addr) >> PAGE_SHIFT);
ret = 1;
}
-
+ spin_unlock(ptl);
return ret;
}
@@ -1405,7 +1397,6 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
pmd_t *old_pmd, pmd_t *new_pmd)
{
spinlock_t *old_ptl, *new_ptl;
- int ret = 0;
pmd_t pmd;
struct mm_struct *mm = vma->vm_mm;
@@ -1414,7 +1405,7 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
(new_addr & ~HPAGE_PMD_MASK) ||
old_end - old_addr < HPAGE_PMD_SIZE ||
(new_vma->vm_flags & VM_NOHUGEPAGE))
- goto out;
+ return 0;
/*
* The destination pmd shouldn't be established, free_pgtables()
@@ -1422,15 +1413,15 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
*/
if (WARN_ON(!pmd_none(*new_pmd))) {
VM_BUG_ON(pmd_trans_huge(*new_pmd));
- goto out;
+ return 0;
}
/*
* We don't have to worry about the ordering of src and dst
* ptlocks because exclusive mmap_sem prevents deadlock.
*/
- ret = __pmd_trans_huge_lock(old_pmd, vma, &old_ptl);
- if (ret == 1) {
+ old_ptl = pmd_lock(vma->vm_mm, old_pmd);
+ if (likely(pmd_trans_huge(*old_pmd))) {
new_ptl = pmd_lockptr(mm, new_pmd);
if (new_ptl != old_ptl)
spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
@@ -1445,10 +1436,9 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
set_pmd_at(mm, new_addr, new_pmd, pmd_mksoft_dirty(pmd));
if (new_ptl != old_ptl)
spin_unlock(new_ptl);
- spin_unlock(old_ptl);
}
-out:
- return ret;
+ spin_unlock(old_ptl);
+ return 1;
}
/*
@@ -1464,7 +1454,8 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
spinlock_t *ptl;
int ret = 0;
- if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+ ptl = pmd_lock(vma->vm_mm, pmd);
+ if (likely(pmd_trans_huge(*pmd))) {
pmd_t entry;
ret = 1;
if (!prot_numa) {
@@ -1490,39 +1481,12 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
ret = HPAGE_PMD_NR;
}
}
- spin_unlock(ptl);
}
-
+ spin_unlock(ptl);
return ret;
}
/*
- * Returns 1 if a given pmd maps a stable (not under splitting) thp.
- * Returns -1 if it maps a thp under splitting. Returns 0 otherwise.
- *
- * Note that if it returns 1, this routine returns without unlocking page
- * table locks. So callers must unlock them.
- */
-int __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma,
- spinlock_t **ptl)
-{
- *ptl = pmd_lock(vma->vm_mm, pmd);
- if (likely(pmd_trans_huge(*pmd))) {
- if (unlikely(pmd_trans_splitting(*pmd))) {
- spin_unlock(*ptl);
- wait_split_huge_page(vma->anon_vma, pmd);
- return -1;
- } else {
- /* Thp mapped by 'pmd' is stable, so we can
- * handle it as it is. */
- return 1;
- }
- }
- spin_unlock(*ptl);
- return 0;
-}
-
-/*
* This function returns whether a given @page is mapped onto the @address
* in the virtual space of @mm.
*
@@ -1533,7 +1497,6 @@ int __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma,
pmd_t *page_check_address_pmd(struct page *page,
struct mm_struct *mm,
unsigned long address,
- enum page_check_address_pmd_flag flag,
spinlock_t **ptl)
{
pgd_t *pgd;
@@ -1556,21 +1519,8 @@ pmd_t *page_check_address_pmd(struct page *page,
goto unlock;
if (pmd_page(*pmd) != page)
goto unlock;
- /*
- * split_vma() may create temporary aliased mappings. There is
- * no risk as long as all huge pmd are found and have their
- * splitting bit set before __split_huge_page_refcount
- * runs. Finding the same huge pmd more than once during the
- * same rmap walk is not a problem.
- */
- if (flag == PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG &&
- pmd_trans_splitting(*pmd))
- goto unlock;
- if (pmd_trans_huge(*pmd)) {
- VM_BUG_ON(flag == PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG &&
- !pmd_trans_splitting(*pmd));
+ if (pmd_trans_huge(*pmd))
return pmd;
- }
unlock:
spin_unlock(*ptl);
return NULL;
@@ -1750,8 +1700,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
spinlock_t *ptl;
pmd_t *pmd;
- pmd = page_check_address_pmd(page, vma->vm_mm, addr,
- PAGE_CHECK_ADDRESS_PMD_FLAG, &ptl);
+ pmd = page_check_address_pmd(page, vma->vm_mm, addr, &ptl);
if (pmd)
__split_huge_pmd(vma, pmd, addr);
}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0ab520c4d630..5fceff94f9b6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6733,7 +6733,8 @@ static int mem_cgroup_count_precharge_pmd(pmd_t *pmd,
struct vm_area_struct *vma = walk->vma;
spinlock_t *ptl;
- if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+ ptl = pmd_lock(vma->vm_mm, pmd);
+ if (likely(pmd_trans_huge(*pmd))) {
if (get_mctgt_type_thp(vma, addr, *pmd, NULL) == MC_TARGET_PAGE)
mc.precharge += HPAGE_PMD_NR;
spin_unlock(ptl);
@@ -6902,17 +6903,8 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
struct page *page;
struct page_cgroup *pc;
- /*
- * We don't take compound_lock() here but no race with splitting thp
- * happens because:
- * - if pmd_trans_huge_lock() returns 1, the relevant thp is not
- * under splitting, which means there's no concurrent thp split,
- * - if another thread runs into split_huge_page() just after we
- * entered this if-block, the thread must wait for page table lock
- * to be unlocked in __split_huge_page_splitting(), where the main
- * part of thp split is not executed yet.
- */
- if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+ ptl = pmd_lock(vma->vm_mm, pmd);
+ if (likely(pmd_trans_huge(*pmd))) {
if (mc.precharge < HPAGE_PMD_NR) {
spin_unlock(ptl);
return 0;
diff --git a/mm/memory.c b/mm/memory.c
index 805ff8d76e17..6af9f92e1936 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -563,7 +563,6 @@ int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
{
spinlock_t *ptl;
pgtable_t new = pte_alloc_one(mm, address);
- int wait_split_huge_page;
if (!new)
return -ENOMEM;
@@ -583,18 +582,14 @@ int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */
ptl = pmd_lock(mm, pmd);
- wait_split_huge_page = 0;
if (likely(pmd_none(*pmd))) { /* Has another populated it ? */
atomic_long_inc(&mm->nr_ptes);
pmd_populate(mm, pmd, new);
new = NULL;
- } else if (unlikely(pmd_trans_splitting(*pmd)))
- wait_split_huge_page = 1;
+ }
spin_unlock(ptl);
if (new)
pte_free(mm, new);
- if (wait_split_huge_page)
- wait_split_huge_page(vma->anon_vma, pmd);
return 0;
}
@@ -610,8 +605,7 @@ int __pte_alloc_kernel(pmd_t *pmd, unsigned long address)
if (likely(pmd_none(*pmd))) { /* Has another populated it ? */
pmd_populate_kernel(&init_mm, pmd, new);
new = NULL;
- } else
- VM_BUG_ON(pmd_trans_splitting(*pmd));
+ }
spin_unlock(&init_mm.page_table_lock);
if (new)
pte_free_kernel(&init_mm, new);
@@ -3270,14 +3264,6 @@ static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
if (pmd_trans_huge(orig_pmd)) {
unsigned int dirty = flags & FAULT_FLAG_WRITE;
- /*
- * If the pmd is splitting, return and retry the
- * the fault. Alternative: wait until the split
- * is done, and goto retry.
- */
- if (pmd_trans_splitting(orig_pmd))
- return 0;
-
if (pmd_numa(orig_pmd))
return do_huge_pmd_numa_page(mm, vma, address,
orig_pmd, pmd);
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index a8b919925934..414f36c6e8f9 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -133,20 +133,6 @@ pmd_t pmdp_clear_flush(struct vm_area_struct *vma, unsigned long address,
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
#endif
-#ifndef __HAVE_ARCH_PMDP_SPLITTING_FLUSH
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
- pmd_t *pmdp)
-{
- pmd_t pmd = pmd_mksplitting(*pmdp);
- VM_BUG_ON(address & ~HPAGE_PMD_MASK);
- set_pmd_at(vma->vm_mm, address, pmdp, pmd);
- /* tlb flush only to serialize against gup-fast */
- flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
-}
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-#endif
-
#ifndef __HAVE_ARCH_PGTABLE_DEPOSIT
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
diff --git a/mm/rmap.c b/mm/rmap.c
index c3b0b397f2c2..cc820bd509e2 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -682,8 +682,7 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
* rmap might return false positives; we must filter
* these out using page_check_address_pmd().
*/
- pmd = page_check_address_pmd(page, mm, address,
- PAGE_CHECK_ADDRESS_PMD_FLAG, &ptl);
+ pmd = page_check_address_pmd(page, mm, address, &ptl);
if (!pmd)
return SWAP_AGAIN;
@@ -693,7 +692,6 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
return SWAP_FAIL; /* To break the loop */
}
- /* go ahead even if the pmd is pmd_trans_splitting() */
if (pmdp_clear_flush_young_notify(vma, address, pmd))
referenced++;
spin_unlock(ptl);
--
2.0.0.rc4
The patch updates Documentation/vm/transhuge.txt to reflect changes in
THP design.
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
Documentation/vm/transhuge.txt | 84 +++++++++++++++++++-----------------------
1 file changed, 38 insertions(+), 46 deletions(-)
diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
index df1794a9071f..33465e7b0d9b 100644
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -200,9 +200,18 @@ thp_collapse_alloc_failed is incremented if khugepaged found a range
of pages that should be collapsed into one huge page but failed
the allocation.
-thp_split is incremented every time a huge page is split into base
+thp_split_page is incremented every time a huge page is split into base
pages. This can happen for a variety of reasons but a common
reason is that a huge page is old and is being reclaimed.
+ This action implies splitting all PMD the page mapped with.
+
+thp_split_page_failed is is incremented if kernel fails to split huge
+ page. This can happen if the page was pinned by somebody.
+
+thp_split_pmd is incremented every time a PMD split into table of PTEs.
+ This can happen, for instance, when application calls mprotect() or
+ munmap() on part of huge page. It doesn't split huge page, only
+ page table entry.
thp_zero_page_alloc is incremented every time a huge zero page is
successfully allocated. It includes allocations which where
@@ -280,9 +289,9 @@ unaffected. libhugetlbfs will also work fine as usual.
== Graceful fallback ==
Code walking pagetables but unware about huge pmds can simply call
-split_huge_page_pmd(vma, addr, pmd) where the pmd is the one returned by
+split_huge_pmd(vma, pmd, addr) where the pmd is the one returned by
pmd_offset. It's trivial to make the code transparent hugepage aware
-by just grepping for "pmd_offset" and adding split_huge_page_pmd where
+by just grepping for "pmd_offset" and adding split_huge_pmd where
missing after pmd_offset returns the pmd. Thanks to the graceful
fallback design, with a one liner change, you can avoid to write
hundred if not thousand of lines of complex code to make your code
@@ -291,7 +300,8 @@ hugepage aware.
If you're not walking pagetables but you run into a physical hugepage
but you can't handle it natively in your code, you can split it by
calling split_huge_page(page). This is what the Linux VM does before
-it tries to swapout the hugepage for example.
+it tries to swapout the hugepage for example. split_huge_page can fail
+if the page is pinned and you must handle this correctly.
Example to make mremap.c transparent hugepage aware with a one liner
change:
@@ -303,14 +313,14 @@ diff --git a/mm/mremap.c b/mm/mremap.c
return NULL;
pmd = pmd_offset(pud, addr);
-+ split_huge_page_pmd(vma, addr, pmd);
++ split_huge_pmd(vma, pmd, addr);
if (pmd_none_or_clear_bad(pmd))
return NULL;
== Locking in hugepage aware code ==
We want as much code as possible hugepage aware, as calling
-split_huge_page() or split_huge_page_pmd() has a cost.
+split_huge_page() or split_huge_pmd() has a cost.
To make pagetable walks huge pmd aware, all you need to do is to call
pmd_trans_huge() on the pmd returned by pmd_offset. You must hold the
@@ -319,47 +329,29 @@ created from under you by khugepaged (khugepaged collapse_huge_page
takes the mmap_sem in write mode in addition to the anon_vma lock). If
pmd_trans_huge returns false, you just fallback in the old code
paths. If instead pmd_trans_huge returns true, you have to take the
-mm->page_table_lock and re-run pmd_trans_huge. Taking the
-page_table_lock will prevent the huge pmd to be converted into a
-regular pmd from under you (split_huge_page can run in parallel to the
+page table lock (pmd_lock()) and re-run pmd_trans_huge. Taking the
+page table lock will prevent the huge pmd to be converted into a
+regular pmd from under you (split_huge_pmd can run in parallel to the
pagetable walk). If the second pmd_trans_huge returns false, you
should just drop the page_table_lock and fallback to the old code as
-before. Otherwise you should run pmd_trans_splitting on the pmd. In
-case pmd_trans_splitting returns true, it means split_huge_page is
-already in the middle of splitting the page. So if pmd_trans_splitting
-returns true it's enough to drop the page_table_lock and call
-wait_split_huge_page and then fallback the old code paths. You are
-guaranteed by the time wait_split_huge_page returns, the pmd isn't
-huge anymore. If pmd_trans_splitting returns false, you can proceed to
-process the huge pmd and the hugepage natively. Once finished you can
-drop the page_table_lock.
-
-== compound_lock, get_user_pages and put_page ==
+before. Otherwise you can proceed to process the huge pmd and the
+hugepage natively. Once finished you can drop the page_table_lock.
+
+== Refcounts and transparent huge pages ==
+As with other compound page types we do all refcounting for THP on head
+page, but unlike other compound pages THP support splitting.
split_huge_page internally has to distribute the refcounts in the head
-page to the tail pages before clearing all PG_head/tail bits from the
-page structures. It can do that easily for refcounts taken by huge pmd
-mappings. But the GUI API as created by hugetlbfs (that returns head
-and tail pages if running get_user_pages on an address backed by any
-hugepage), requires the refcount to be accounted on the tail pages and
-not only in the head pages, if we want to be able to run
-split_huge_page while there are gup pins established on any tail
-page. Failure to be able to run split_huge_page if there's any gup pin
-on any tail page, would mean having to split all hugepages upfront in
-get_user_pages which is unacceptable as too many gup users are
-performance critical and they must work natively on hugepages like
-they work natively on hugetlbfs already (hugetlbfs is simpler because
-hugetlbfs pages cannot be split so there wouldn't be requirement of
-accounting the pins on the tail pages for hugetlbfs). If we wouldn't
-account the gup refcounts on the tail pages during gup, we won't know
-anymore which tail page is pinned by gup and which is not while we run
-split_huge_page. But we still have to add the gup pin to the head page
-too, to know when we can free the compound page in case it's never
-split during its lifetime. That requires changing not just
-get_page, but put_page as well so that when put_page runs on a tail
-page (and only on a tail page) it will find its respective head page,
-and then it will decrease the head page refcount in addition to the
-tail page refcount. To obtain a head page reliably and to decrease its
-refcount without race conditions, put_page has to serialize against
-__split_huge_page_refcount using a special per-page lock called
-compound_lock.
+page to the tail pages before clearing all PG_head/tail bits from the page
+structures. It can be done easily for refcounts taken by page table
+entries. But we don't have enough information on how to distribute any
+additional pins (i.e. from get_user_pages). split_huge_page fails any
+requests to split pinned huge page: it expects page count to be equal to
+sum of mapcount of all sub-pages plus one (split_huge_page caller must
+have reference for head page).
+
+split_huge_page uses per-page compound_lock to protect page->_count to be
+updated by get_page()/put_page() on tail pages.
+
+Note that split_huge_pmd doesn't have any limitation on refcounting: PMD
+can be split at any point and never fails.
--
2.0.0.rc4
The patch replaces THP_SPLIT with tree events: THP_SPLIT_PAGE,
THP_SPLIT_PAGE_FAILT and THP_SPLIT_PMD. It reflects the fact that we
now can split PMD without the compound page and that split_huge_page()
can fail.
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/vm_event_item.h | 4 +++-
mm/huge_memory.c | 2 ++
mm/vmstat.c | 4 +++-
3 files changed, 8 insertions(+), 2 deletions(-)
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index ced92345c963..b44dffa769b9 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -68,7 +68,9 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
THP_FAULT_FALLBACK,
THP_COLLAPSE_ALLOC,
THP_COLLAPSE_ALLOC_FAILED,
- THP_SPLIT,
+ THP_SPLIT_PAGE,
+ THP_SPLIT_PAGE_FAILED,
+ THP_SPLIT_PMD,
THP_ZERO_PAGE_ALLOC,
THP_ZERO_PAGE_ALLOC_FAILED,
#endif
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 752c850f6941..fec89aedcedd 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1578,6 +1578,7 @@ unlock:
int split_huge_page_to_list(struct page *page, struct list_head *list)
{
+ count_vm_event(THP_SPLIT_PAGE_FAILED);
return -EBUSY;
}
@@ -2496,6 +2497,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
return;
}
+ count_vm_event(THP_SPLIT_PMD);
if (is_huge_zero_pmd(*pmd)) {
__split_huge_zero_page_pmd(vma, haddr, pmd);
spin_unlock(ptl);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index b37bd49bfd55..6a155f0476e8 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -855,7 +855,9 @@ const char * const vmstat_text[] = {
"thp_fault_fallback",
"thp_collapse_alloc",
"thp_collapse_alloc_failed",
- "thp_split",
+ "thp_split_page",
+ "thp_split_page_failed",
+ "thp_split_pmd",
"thp_zero_page_alloc",
"thp_zero_page_alloc_failed",
#endif
--
2.0.0.rc4
With new refcounting we don't need to mark PMDs splitting. Let's drop
code to handle this.
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/include/asm/pgtable.h | 9 ---------
arch/x86/include/asm/pgtable_types.h | 2 --
arch/x86/mm/gup.c | 13 +------------
arch/x86/mm/pgtable.c | 14 --------------
4 files changed, 1 insertion(+), 37 deletions(-)
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 0ec056012618..1c60bfca6b65 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -158,11 +158,6 @@ static inline int pmd_large(pmd_t pte)
}
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-static inline int pmd_trans_splitting(pmd_t pmd)
-{
- return pmd_val(pmd) & _PAGE_SPLITTING;
-}
-
static inline int pmd_trans_huge(pmd_t pmd)
{
return pmd_val(pmd) & _PAGE_PSE;
@@ -799,10 +794,6 @@ extern int pmdp_clear_flush_young(struct vm_area_struct *vma,
unsigned long address, pmd_t *pmdp);
-#define __HAVE_ARCH_PMDP_SPLITTING_FLUSH
-extern void pmdp_splitting_flush(struct vm_area_struct *vma,
- unsigned long addr, pmd_t *pmdp);
-
#define __HAVE_ARCH_PMD_WRITE
static inline int pmd_write(pmd_t pmd)
{
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index f216963760e5..7d8066d1d9c0 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -22,7 +22,6 @@
#define _PAGE_BIT_PAT_LARGE 12 /* On 2MB or 1GB pages */
#define _PAGE_BIT_SPECIAL _PAGE_BIT_SOFTW1
#define _PAGE_BIT_CPA_TEST _PAGE_BIT_SOFTW1
-#define _PAGE_BIT_SPLITTING _PAGE_BIT_SOFTW2 /* only valid on a PSE pmd */
#define _PAGE_BIT_IOMAP _PAGE_BIT_SOFTW2 /* flag used to indicate IO mapping */
#define _PAGE_BIT_HIDDEN _PAGE_BIT_SOFTW3 /* hidden by kmemcheck */
#define _PAGE_BIT_SOFT_DIRTY _PAGE_BIT_SOFTW3 /* software dirty tracking */
@@ -57,7 +56,6 @@
#define _PAGE_PAT_LARGE (_AT(pteval_t, 1) << _PAGE_BIT_PAT_LARGE)
#define _PAGE_SPECIAL (_AT(pteval_t, 1) << _PAGE_BIT_SPECIAL)
#define _PAGE_CPA_TEST (_AT(pteval_t, 1) << _PAGE_BIT_CPA_TEST)
-#define _PAGE_SPLITTING (_AT(pteval_t, 1) << _PAGE_BIT_SPLITTING)
#define __HAVE_ARCH_PTE_SPECIAL
#ifdef CONFIG_KMEMCHECK
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 754bca23ec1b..b65b3fc4494a 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -157,18 +157,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
pmd_t pmd = *pmdp;
next = pmd_addr_end(addr, end);
- /*
- * The pmd_trans_splitting() check below explains why
- * pmdp_splitting_flush has to flush the tlb, to stop
- * this gup-fast code from running while we set the
- * splitting bit in the pmd. Returning zero will take
- * the slow path that will call wait_split_huge_page()
- * if the pmd is still in splitting state. gup-fast
- * can't because it has irq disabled and
- * wait_split_huge_page() would never return as the
- * tlb flush IPI wouldn't run.
- */
- if (pmd_none(pmd) || pmd_trans_splitting(pmd))
+ if (pmd_none(pmd))
return 0;
if (unlikely(pmd_large(pmd))) {
/*
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 6fb6927f9e76..336847f5719e 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -429,20 +429,6 @@ int pmdp_clear_flush_young(struct vm_area_struct *vma,
return young;
}
-
-void pmdp_splitting_flush(struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmdp)
-{
- int set;
- VM_BUG_ON(address & ~HPAGE_PMD_MASK);
- set = !test_and_set_bit(_PAGE_BIT_SPLITTING,
- (unsigned long *)pmdp);
- if (set) {
- pmd_update(vma->vm_mm, address, pmdp);
- /* need tlb flush only to serialize against gup-fast */
- flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
- }
-}
#endif
/**
--
2.0.0.rc4
The new split_huge_page() can fail if the compound is pinned: we expect
only caller to have one reference to head page. If the page is pinned
split_huge_page() returns -EBUSY and caller must handle this correctly.
We don't need mark PMDs splitting since now we can split one PMD a time
with split_huge_pmd().
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/hugetlb_inline.h | 9 +-
include/linux/mm.h | 22 +++--
mm/huge_memory.c | 191 ++++++++++++++++++++++++++++++++++++++++-
mm/swap.c | 126 ++++++++++++++++++++++++++-
4 files changed, 329 insertions(+), 19 deletions(-)
diff --git a/include/linux/hugetlb_inline.h b/include/linux/hugetlb_inline.h
index 4d60c82e9fda..1477dc1b3685 100644
--- a/include/linux/hugetlb_inline.h
+++ b/include/linux/hugetlb_inline.h
@@ -11,8 +11,9 @@ static inline int is_vm_hugetlb_page(struct vm_area_struct *vma)
}
int PageHuge(struct page *page);
+int PageHeadHuge(struct page *page_head);
-#else
+#else /* CONFIG_HUGETLB_PAGE */
static inline int is_vm_hugetlb_page(struct vm_area_struct *vma)
{
@@ -24,6 +25,10 @@ static inline int PageHuge(struct page *page)
return 0;
}
-#endif
+static inline int PageHeadHuge(struct page *page_head)
+{
+ return 0;
+}
+#endif /* CONFIG_HUGETLB_PAGE */
#endif
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8885a7102aba..126112d46d85 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -440,20 +440,18 @@ static inline int page_count(struct page *page)
return atomic_read(&compound_head(page)->_count);
}
-#ifdef CONFIG_HUGETLB_PAGE
-extern int PageHeadHuge(struct page *page_head);
-#else /* CONFIG_HUGETLB_PAGE */
-static inline int PageHeadHuge(struct page *page_head)
-{
- return 0;
-}
-#endif /* CONFIG_HUGETLB_PAGE */
-
+void __get_page_tail(struct page *page);
static inline void get_page(struct page *page)
{
- struct page *page_head = compound_head(page);
- VM_BUG_ON_PAGE(atomic_read(&page_head->_count) <= 0, page);
- atomic_inc(&page_head->_count);
+ if (unlikely(PageTail(page)))
+ return __get_page_tail(page);
+
+ /*
+ * Getting a normal page or the head of a compound page
+ * requires to already have an elevated page->_count.
+ */
+ VM_BUG_ON_PAGE(atomic_read(&page->_count) <= 0, page);
+ atomic_inc(&page->_count);
}
static inline struct page *virt_to_head_page(const void *x)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index fec89aedcedd..89c6f098f91f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1576,10 +1576,197 @@ unlock:
return NULL;
}
+static int __split_huge_page_refcount(struct page *page,
+ struct list_head *list)
+{
+ int i;
+ struct zone *zone = page_zone(page);
+ struct lruvec *lruvec;
+ int tail_count;
+
+ /* prevent PageLRU to go away from under us, and freeze lru stats */
+ spin_lock_irq(&zone->lru_lock);
+ lruvec = mem_cgroup_page_lruvec(page, zone);
+
+ compound_lock(page);
+
+ /*
+ * We cannot split pinned THP page: we expect page count to be equal
+ * to sum of mapcount of all sub-pages plus one (split_huge_page()
+ * caller must take reference for head page).
+ *
+ * Compound lock only prevents page->_count to be updated from
+ * get_page() or put_page() on tail page. It means means page_count()
+ * can change under us from head page after the check, but it's okay:
+ * all new refernces will stay on head page after split.
+ */
+ tail_count = 0;
+ for (i = 0; i < HPAGE_PMD_NR; i++)
+ tail_count += page_mapcount(page + i);
+ if (tail_count != page_count(page) - 1) {
+ BUG_ON(tail_count > page_count(page) - 1);
+ compound_unlock(page);
+ spin_unlock_irq(&zone->lru_lock);
+ return -EBUSY;
+ }
+
+ /* complete memcg works before add pages to LRU */
+ mem_cgroup_split_huge_fixup(page);
+
+ tail_count = 0;
+ for (i = HPAGE_PMD_NR - 1; i >= 1; i--) {
+ struct page *page_tail = page + i;
+
+ /* tail_page->_mapcount cannot change */
+ BUG_ON(page_mapcount(page_tail) < 0);
+ tail_count += page_mapcount(page_tail);
+ /* check for overflow */
+ BUG_ON(tail_count < 0);
+ BUG_ON(atomic_read(&page_tail->_count) != 0);
+ /*
+ * tail_page->_count is zero and not changing from
+ * under us. But get_page_unless_zero() may be running
+ * from under us on the tail_page. If we used
+ * atomic_set() below instead of atomic_add(), we
+ * would then run atomic_set() concurrently with
+ * get_page_unless_zero(), and atomic_set() is
+ * implemented in C not using locked ops. spin_unlock
+ * on x86 sometime uses locked ops because of PPro
+ * errata 66, 92, so unless somebody can guarantee
+ * atomic_set() here would be safe on all archs (and
+ * not only on x86), it's safer to use atomic_add().
+ */
+ atomic_add(page_mapcount(page_tail) + 1, &page_tail->_count);
+
+ /* after clearing PageTail the gup refcount can be released */
+ smp_mb();
+
+ /*
+ * retain hwpoison flag of the poisoned tail page:
+ * fix for the unsuitable process killed on Guest Machine(KVM)
+ * by the memory-failure.
+ */
+ page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP | __PG_HWPOISON;
+ page_tail->flags |= (page->flags &
+ ((1L << PG_referenced) |
+ (1L << PG_swapbacked) |
+ (1L << PG_mlocked) |
+ (1L << PG_uptodate) |
+ (1L << PG_active) |
+ (1L << PG_unevictable)));
+ page_tail->flags |= (1L << PG_dirty);
+
+ /* clear PageTail before overwriting first_page */
+ smp_wmb();
+
+ BUG_ON(page_tail->mapping);
+ page_tail->mapping = page->mapping;
+
+ page_tail->index = page->index + i;
+ page_cpupid_xchg_last(page_tail, page_cpupid_last(page));
+
+ BUG_ON(!PageAnon(page_tail));
+ BUG_ON(!PageUptodate(page_tail));
+ BUG_ON(!PageDirty(page_tail));
+ BUG_ON(!PageSwapBacked(page_tail));
+
+ lru_add_page_tail(page, page_tail, lruvec, list);
+ }
+ atomic_sub(tail_count, &page->_count);
+ BUG_ON(atomic_read(&page->_count) <= 0);
+
+ __mod_zone_page_state(zone, NR_ANON_TRANSPARENT_HUGEPAGES, -1);
+
+ ClearPageCompound(page);
+ compound_unlock(page);
+ spin_unlock_irq(&zone->lru_lock);
+
+ for (i = 1; i < HPAGE_PMD_NR; i++) {
+ struct page *page_tail = page + i;
+ BUG_ON(page_count(page_tail) <= 0);
+ /*
+ * Tail pages may be freed if there wasn't any mapping
+ * like if add_to_swap() is running on a lru page that
+ * had its mapping zapped. And freeing these pages
+ * requires taking the lru_lock so we do the put_page
+ * of the tail pages after the split is complete.
+ */
+ put_page(page_tail);
+ }
+
+ /*
+ * Only the head page (now become a regular page) is required
+ * to be pinned by the caller.
+ */
+ BUG_ON(page_count(page) <= 0);
+ return 0;
+}
+
int split_huge_page_to_list(struct page *page, struct list_head *list)
{
- count_vm_event(THP_SPLIT_PAGE_FAILED);
- return -EBUSY;
+ struct anon_vma *anon_vma;
+ struct anon_vma_chain *avc;
+ pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+ int i, tail_count;
+ int ret = -EBUSY;
+
+ BUG_ON(is_huge_zero_page(page));
+ BUG_ON(!PageAnon(page));
+
+ /*
+ * The caller does not necessarily hold an mmap_sem that would prevent
+ * the anon_vma disappearing so we first we take a reference to it
+ * and then lock the anon_vma for write. This is similar to
+ * page_lock_anon_vma_read except the write lock is taken to serialise
+ * against parallel split or collapse operations.
+ */
+ anon_vma = page_get_anon_vma(page);
+ if (!anon_vma)
+ goto out;
+ anon_vma_lock_write(anon_vma);
+
+ if (!PageCompound(page)) {
+ ret = 0;
+ goto out_unlock;
+ }
+
+ BUG_ON(!PageSwapBacked(page));
+
+ /*
+ * Racy check if __split_huge_page_refcount() can be successful, before
+ * splitting PMDs.
+ */
+ tail_count = 0;
+ for (i = 0; i < HPAGE_PMD_NR; i++)
+ tail_count += page_mapcount(page + i);
+ if (tail_count != page_count(page) - 1) {
+ BUG_ON(tail_count > page_count(page) - 1);
+ return -EBUSY;
+ }
+
+ anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) {
+ struct vm_area_struct *vma = avc->vma;
+ unsigned long addr = vma_address(page, vma);
+ spinlock_t *ptl;
+ pmd_t *pmd;
+
+ pmd = page_check_address_pmd(page, vma->vm_mm, addr,
+ PAGE_CHECK_ADDRESS_PMD_FLAG, &ptl);
+ if (pmd)
+ __split_huge_pmd(vma, pmd, addr);
+ }
+
+ ret = __split_huge_page_refcount(page, list);
+
+out_unlock:
+ anon_vma_unlock_write(anon_vma);
+ put_anon_vma(anon_vma);
+out:
+ if (ret)
+ count_vm_event(THP_SPLIT_PAGE_FAILED);
+ else
+ count_vm_event(THP_SPLIT_PAGE);
+ return ret;
}
#define VM_NO_THP (VM_SPECIAL | VM_HUGETLB | VM_SHARED | VM_MAYSHARE)
diff --git a/mm/swap.c b/mm/swap.c
index 5faf87c3809b..0201c2704616 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -79,12 +79,86 @@ static void __put_compound_page(struct page *page)
(*dtor)(page);
}
+static inline bool compound_lock_needed(struct page *page)
+{
+ return IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
+ !PageSlab(page) && !PageHeadHuge(page);
+}
+
static void put_compound_page(struct page *page)
{
- struct page *page_head = compound_head(page);
+ struct page *page_head;
+ unsigned long flags;
+
+ if (likely(!PageTail(page))) {
+ if (put_page_testzero(page)) {
+ /*
+ * By the time all refcounts have been released
+ * split_huge_page cannot run anymore from under us.
+ */
+ if (PageHead(page))
+ __put_compound_page(page);
+ else
+ __put_single_page(page);
+ }
+ return;
+ }
+
+ /* __split_huge_page_refcount can run under us */
+ page_head = compound_head(page);
+
+ if (!compound_lock_needed(page_head)) {
+ /*
+ * If "page" is a THP tail, we must read the tail page flags
+ * after the head page flags. The split_huge_page side enforces
+ * write memory barriers between clearing PageTail and before
+ * the head page can be freed and reallocated.
+ */
+ smp_rmb();
+ if (likely(PageTail(page))) {
+ /* __split_huge_page_refcount cannot race here. */
+ VM_BUG_ON_PAGE(!PageHead(page_head), page_head);
+ VM_BUG_ON_PAGE(page_mapcount(page) != 0, page);
+ if (put_page_testzero(page_head)) {
+ /*
+ * If this is the tail of a slab compound page,
+ * the tail pin must not be the last reference
+ * held on the page, because the PG_slab cannot
+ * be cleared before all tail pins (which skips
+ * the _mapcount tail refcounting) have been
+ * released. For hugetlbfs the tail pin may be
+ * the last reference on the page instead,
+ * because PageHeadHuge will not go away until
+ * the compound page enters the buddy
+ * allocator.
+ */
+ VM_BUG_ON_PAGE(PageSlab(page_head), page_head);
+ __put_compound_page(page_head);
+ }
+ } else if (put_page_testzero(page))
+ __put_single_page(page);
+ return;
+ }
- if (put_page_testzero(page_head))
- __put_compound_page(page_head);
+ flags = compound_lock_irqsave(page_head);
+ /* here __split_huge_page_refcount won't run anymore */
+ if (likely(page != page_head && PageTail(page))) {
+ bool free;
+
+ free = put_page_testzero(page_head);
+ compound_unlock_irqrestore(page_head, flags);
+ if (free) {
+ if (PageHead(page_head))
+ __put_compound_page(page_head);
+ else
+ __put_single_page(page_head);
+ }
+ } else {
+ compound_unlock_irqrestore(page_head, flags);
+ VM_BUG_ON_PAGE(PageTail(page), page);
+ if (put_page_testzero(page))
+ __put_single_page(page);
+ }
}
void put_page(struct page *page)
@@ -96,6 +170,52 @@ void put_page(struct page *page)
}
EXPORT_SYMBOL(put_page);
+/*
+ * This function is exported but must not be called by anything other
+ * than get_page(). It implements the slow path of get_page().
+ */
+void __get_page_tail(struct page *page)
+{
+ struct page *page_head = compound_head(page);
+ unsigned long flags;
+
+ if (!compound_lock_needed(page_head)) {
+ smp_rmb();
+ if (likely(PageTail(page))) {
+ /*
+ * This is a hugetlbfs page or a slab page.
+ * __split_huge_page_refcount cannot race here.
+ */
+ VM_BUG_ON_PAGE(!PageHead(page_head), page_head);
+ VM_BUG_ON(page_head != page->first_page);
+ VM_BUG_ON_PAGE(atomic_read(&page_head->_count) <= 0,
+ page);
+ atomic_inc(&page_head->_count);
+ } else {
+ /*
+ * __split_huge_page_refcount run before us, "page" was
+ * a thp tail. the split page_head has been freed and
+ * reallocated as slab or hugetlbfs page of smaller
+ * order (only possible if reallocated as slab on x86).
+ */
+ VM_BUG_ON_PAGE(atomic_read(&page->_count) <= 0, page);
+ atomic_inc(&page->_count);
+ }
+ return;
+ }
+
+ flags = compound_lock_irqsave(page_head);
+ /* here __split_huge_page_refcount won't run anymore */
+ if (unlikely(page == page_head || !PageTail(page) ||
+ !get_page_unless_zero(page_head))) {
+ /* page is not part of THP page anymore */
+ VM_BUG_ON_PAGE(atomic_read(&page->_count) <= 0, page);
+ atomic_inc(&page->_count);
+ }
+ compound_unlock_irqrestore(page_head, flags);
+}
+EXPORT_SYMBOL(__get_page_tail);
+
/**
* put_pages_list() - release a list of pages
* @pages: list of pages threaded on page->lru
--
2.0.0.rc4
FOLL_SPLIT is used only in two places: migration and s390.
Let's replace it with explicit split and remove FOLL_SPLIT.
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
Documentation/vm/transhuge.txt | 11 -----------
arch/s390/mm/pgtable.c | 17 +++++++++++------
include/linux/mm.h | 1 -
mm/gup.c | 4 ----
mm/migrate.c | 7 ++++++-
5 files changed, 17 insertions(+), 23 deletions(-)
diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
index 6b31cfbe2a9a..df1794a9071f 100644
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -263,17 +263,6 @@ same constrains that applies to hugetlbfs too, so any driver capable
of handling GUP on hugetlbfs will also work fine on transparent
hugepage backed mappings.
-In case you can't handle compound pages if they're returned by
-follow_page, the FOLL_SPLIT bit can be specified as parameter to
-follow_page, so that it will split the hugepages before returning
-them. Migration for example passes FOLL_SPLIT as parameter to
-follow_page because it's not hugepage aware and in fact it can't work
-at all on hugetlbfs (but it instead works fine on transparent
-hugepages thanks to FOLL_SPLIT). migration simply can't deal with
-hugepages being returned (as it's not only checking the pfn of the
-page and pinning it during the copy but it pretends to migrate the
-memory in regular page sizes and with regular pte/pmd mappings).
-
== Optimizing the applications ==
To be guaranteed that the kernel will map a 2M page immediately in any
diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 37b8241ec784..a5643b9c0d03 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -1248,20 +1248,25 @@ void tlb_remove_table(struct mmu_gather *tlb, void *table)
}
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-static inline void thp_split_vma(struct vm_area_struct *vma)
+static int thp_split_pmd(pmd_t *pmd, unsigned long addr, unsigned long end,
+ struct mm_walk *walk)
{
- unsigned long addr;
-
- for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE)
- follow_page(vma, addr, FOLL_SPLIT);
+ struct vm_area_struct *vma = walk->vma;
+ split_huge_page_pmd(vma, addr, pmd);
+ return 0;
}
static inline void thp_split_mm(struct mm_struct *mm)
{
struct vm_area_struct *vma;
+ struct mm_walk thp_split_walk = {
+ .mm = mm,
+ .pmd_entry = thp_split_pmd,
+
+ };
for (vma = mm->mmap; vma != NULL; vma = vma->vm_next) {
- thp_split_vma(vma);
+ walk_page_vma(vma, &thp_split_walk);
vma->vm_flags &= ~VM_HUGEPAGE;
vma->vm_flags |= VM_NOHUGEPAGE;
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5ac1cea7750b..9f4960bf505b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1993,7 +1993,6 @@ static inline struct page *follow_page(struct vm_area_struct *vma,
#define FOLL_NOWAIT 0x20 /* if a disk transfer is needed, start the IO
* and return without waiting upon it */
#define FOLL_MLOCK 0x40 /* mark page as mlocked */
-#define FOLL_SPLIT 0x80 /* don't return transhuge pages, split them */
#define FOLL_HWPOISON 0x100 /* check page is hwpoisoned */
#define FOLL_NUMA 0x200 /* force NUMA hinting page fault */
#define FOLL_MIGRATION 0x400 /* wait for page to replace migration entry */
diff --git a/mm/gup.c b/mm/gup.c
index cc5a9e7adea7..ac01800abce6 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -192,10 +192,6 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
if ((flags & FOLL_NUMA) && pmd_numa(*pmd))
return no_page_table(vma, flags);
if (pmd_trans_huge(*pmd)) {
- if (flags & FOLL_SPLIT) {
- split_huge_page_pmd(vma, address, pmd);
- return follow_page_pte(vma, address, pmd, flags);
- }
ptl = pmd_lock(mm, pmd);
if (likely(pmd_trans_huge(*pmd))) {
if (unlikely(pmd_trans_splitting(*pmd))) {
diff --git a/mm/migrate.c b/mm/migrate.c
index 63f0cd559999..82c0ba922481 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1243,7 +1243,7 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
if (!vma || pp->addr < vma->vm_start || !vma_migratable(vma))
goto set_status;
- page = follow_page(vma, pp->addr, FOLL_GET|FOLL_SPLIT);
+ page = follow_page(vma, pp->addr, FOLL_GET);
err = PTR_ERR(page);
if (IS_ERR(page))
@@ -1253,6 +1253,11 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
if (!page)
goto set_status;
+ if (PageTransHuge(page) && split_huge_page(page)) {
+ err = -EBUSY;
+ goto set_status;
+ }
+
/* Use PageReserved to check for zero page */
if (PageReserved(page))
goto put_and_set;
--
2.0.0.rc4
We are going to decouple splitting THP PMD from splitting underlying
compound page.
This patch renames split_huge_page_pmd*() functions to split_huge_pmd*()
to reflect the fact that it doesn't imply page splitting, only PMD.
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/powerpc/mm/subpage-prot.c | 2 +-
arch/s390/mm/pgtable.c | 2 +-
arch/x86/kernel/vm86_32.c | 6 +++++-
include/linux/huge_mm.h | 8 ++------
mm/huge_memory.c | 32 +++++++++++---------------------
mm/memory.c | 2 +-
mm/mprotect.c | 2 +-
mm/mremap.c | 2 +-
mm/pagewalk.c | 2 +-
9 files changed, 24 insertions(+), 34 deletions(-)
diff --git a/arch/powerpc/mm/subpage-prot.c b/arch/powerpc/mm/subpage-prot.c
index fa9fb5b4c66c..d5543514c1df 100644
--- a/arch/powerpc/mm/subpage-prot.c
+++ b/arch/powerpc/mm/subpage-prot.c
@@ -135,7 +135,7 @@ static int subpage_walk_pmd_entry(pmd_t *pmd, unsigned long addr,
unsigned long end, struct mm_walk *walk)
{
struct vm_area_struct *vma = walk->vma;
- split_huge_page_pmd(vma, addr, pmd);
+ split_huge_pmd(vma, pmd, addr);
return 0;
}
diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index a5643b9c0d03..48b972792c81 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -1252,7 +1252,7 @@ static int thp_split_pmd(pmd_t *pmd, unsigned long addr, unsigned long end,
struct mm_walk *walk)
{
struct vm_area_struct *vma = walk->vma;
- split_huge_page_pmd(vma, addr, pmd);
+ split_huge_pmd(vma, pmd, addr);
return 0;
}
diff --git a/arch/x86/kernel/vm86_32.c b/arch/x86/kernel/vm86_32.c
index e8edcf52e069..883160599965 100644
--- a/arch/x86/kernel/vm86_32.c
+++ b/arch/x86/kernel/vm86_32.c
@@ -182,7 +182,11 @@ static void mark_screen_rdonly(struct mm_struct *mm)
if (pud_none_or_clear_bad(pud))
goto out;
pmd = pmd_offset(pud, 0xA0000);
- split_huge_page_pmd_mm(mm, 0xA0000, pmd);
+
+ if (pmd_trans_huge(*pmd)) {
+ struct vm_area_struct *vma = find_vma(mm, 0xA0000);
+ split_huge_pmd(vma, pmd, 0xA0000);
+ }
if (pmd_none_or_clear_bad(pmd))
goto out;
pte = pte_offset_map_lock(mm, pmd, 0xA0000, &ptl);
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index b826239bdce0..e68dfb888e59 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -104,7 +104,7 @@ static inline int split_huge_page(struct page *page)
}
extern void __split_huge_page_pmd(struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd);
-#define split_huge_page_pmd(__vma, __address, __pmd) \
+#define split_huge_pmd(__vma, __pmd, __address) \
do { \
pmd_t *____pmd = (__pmd); \
if (unlikely(pmd_trans_huge(*____pmd))) \
@@ -119,8 +119,6 @@ extern void __split_huge_page_pmd(struct vm_area_struct *vma,
BUG_ON(pmd_trans_splitting(*____pmd) || \
pmd_trans_huge(*____pmd)); \
} while (0)
-extern void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address,
- pmd_t *pmd);
#if HPAGE_PMD_ORDER >= MAX_ORDER
#error "hugepages can't be allocated by the buddy allocator"
#endif
@@ -180,11 +178,9 @@ static inline int split_huge_page(struct page *page)
{
return 0;
}
-#define split_huge_page_pmd(__vma, __address, __pmd) \
- do { } while (0)
#define wait_split_huge_page(__anon_vma, __pmd) \
do { } while (0)
-#define split_huge_page_pmd_mm(__mm, __address, __pmd) \
+#define split_huge_pmd(__vma, __pmd, __address) \
do { } while (0)
static inline int hugepage_madvise(struct vm_area_struct *vma,
unsigned long *vm_flags, int advice)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c5ff461e0253..e809ef4519f2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1086,13 +1086,13 @@ alloc:
if (unlikely(!new_page)) {
if (!page) {
- split_huge_page_pmd(vma, address, pmd);
+ split_huge_pmd(vma, pmd, address);
ret |= VM_FAULT_FALLBACK;
} else {
ret = do_huge_pmd_wp_page_fallback(mm, vma, address,
pmd, orig_pmd, page, haddr);
if (ret & VM_FAULT_OOM) {
- split_huge_page(page);
+ split_huge_pmd(vma, pmd, address);
ret |= VM_FAULT_FALLBACK;
}
put_page(page);
@@ -1104,10 +1104,10 @@ alloc:
if (unlikely(mem_cgroup_charge_anon(new_page, mm, GFP_KERNEL))) {
put_page(new_page);
if (page) {
- split_huge_page(page);
+ split_huge_pmd(vma, pmd, address);
put_page(page);
} else
- split_huge_page_pmd(vma, address, pmd);
+ split_huge_pmd(vma, pmd, address);
ret |= VM_FAULT_FALLBACK;
count_vm_event(THP_FAULT_FALLBACK);
goto out;
@@ -2833,31 +2833,21 @@ again:
goto again;
}
-void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address,
- pmd_t *pmd)
-{
- struct vm_area_struct *vma;
-
- vma = find_vma(mm, address);
- BUG_ON(vma == NULL);
- split_huge_page_pmd(vma, address, pmd);
-}
-
-static void split_huge_page_address(struct mm_struct *mm,
+static void split_huge_page_address(struct vm_area_struct *vma,
unsigned long address)
{
pmd_t *pmd;
VM_BUG_ON(!(address & ~HPAGE_PMD_MASK));
- pmd = mm_find_pmd(mm, address);
- if (!pmd)
+ pmd = mm_find_pmd(vma->vm_mm, address);
+ if (!pmd || !pmd_trans_huge(*pmd))
return;
/*
* Caller holds the mmap_sem write mode, so a huge pmd cannot
* materialize from under us.
*/
- split_huge_page_pmd_mm(mm, address, pmd);
+ __split_huge_page_pmd(vma, address, pmd);
}
void __vma_adjust_trans_huge(struct vm_area_struct *vma,
@@ -2873,7 +2863,7 @@ void __vma_adjust_trans_huge(struct vm_area_struct *vma,
if (start & ~HPAGE_PMD_MASK &&
(start & HPAGE_PMD_MASK) >= vma->vm_start &&
(start & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE <= vma->vm_end)
- split_huge_page_address(vma->vm_mm, start);
+ split_huge_page_address(vma, start);
/*
* If the new end address isn't hpage aligned and it could
@@ -2883,7 +2873,7 @@ void __vma_adjust_trans_huge(struct vm_area_struct *vma,
if (end & ~HPAGE_PMD_MASK &&
(end & HPAGE_PMD_MASK) >= vma->vm_start &&
(end & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE <= vma->vm_end)
- split_huge_page_address(vma->vm_mm, end);
+ split_huge_page_address(vma, end);
/*
* If we're also updating the vma->vm_next->vm_start, if the new
@@ -2897,6 +2887,6 @@ void __vma_adjust_trans_huge(struct vm_area_struct *vma,
if (nstart & ~HPAGE_PMD_MASK &&
(nstart & HPAGE_PMD_MASK) >= next->vm_start &&
(nstart & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE <= next->vm_end)
- split_huge_page_address(next->vm_mm, nstart);
+ split_huge_page_address(next, nstart);
}
}
diff --git a/mm/memory.c b/mm/memory.c
index 532dde2b7c14..805ff8d76e17 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1239,7 +1239,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
BUG();
}
#endif
- split_huge_page_pmd(vma, addr, pmd);
+ split_huge_pmd(vma, pmd, addr);
} else if (zap_huge_pmd(tlb, vma, pmd, addr))
goto next;
/* fall through */
diff --git a/mm/mprotect.c b/mm/mprotect.c
index c43d557941f8..775a66a598dc 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -162,7 +162,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
if (pmd_trans_huge(*pmd)) {
if (next - addr != HPAGE_PMD_SIZE)
- split_huge_page_pmd(vma, addr, pmd);
+ split_huge_pmd(vma, pmd, addr);
else {
int nr_ptes = change_huge_pmd(vma, pmd, addr,
newprot, prot_numa);
diff --git a/mm/mremap.c b/mm/mremap.c
index 05f1180e9f21..d2d1047e5dc3 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -209,7 +209,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
need_flush = true;
continue;
} else if (!err) {
- split_huge_page_pmd(vma, old_addr, old_pmd);
+ split_huge_pmd(vma, old_pmd, old_addr);
}
VM_BUG_ON(pmd_trans_huge(*old_pmd));
}
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index b2a075ffb96e..0da721a5c6e5 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -83,7 +83,7 @@ again:
if (walk->pte_entry) {
if (walk->vma) {
- split_huge_page_pmd(walk->vma, addr, pmd);
+ split_huge_pmd(walk->vma, pmd, addr);
if (pmd_trans_unstable(pmd))
goto again;
}
--
2.0.0.rc4
Current PageAnon() is always return false for tail. We need to look on
head page for correct answer. Let's change the function to give the
right result.
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/mm.h | 1 +
1 file changed, 1 insertion(+)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9f4960bf505b..a60e2db5f9f9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -983,6 +983,7 @@ struct address_space *page_file_mapping(struct page *page)
static inline int PageAnon(struct page *page)
{
+ page = compound_head(page);
return ((unsigned long)page->mapping & PAGE_MAPPING_ANON) != 0;
}
--
2.0.0.rc4
On 06/09/2014 06:04 PM, Kirill A. Shutemov wrote:
> Hello everybody,
>
> We've discussed few times that is would be nice to allow huge pages to be
> mapped with 4k pages too. Here's my first attempt to actually implement
> this. It's early prototype and not stabilized yet, but I want to share it
> to discuss any potential show stoppers early.
>
> The main reason why we can't map THP with 4k is how refcounting on THP
> designed. It built around two requirements:
>
> - split of huge page should never fail;
> - we can't change interface of get_user_page();
>
> To be able to split huge page at any point we have to track which tail
> page was pinned. It leads to tricky and expensive get_page() on tail pages
> and also occupy tail_page->_mapcount.
>
> Most split_huge_page*() users want PMD to be split into table of PTEs and
> don't care whether compound page is going to be split or not.
>
> The plan is:
>
> - allow split_huge_page() to fail if the page is pinned. It's trivial to
> split non-pinned page and it doesn't require tail page refcounting, so
> tail_page->_mapcount is free to be reused.
>
> - introduce new routine -- split_huge_pmd() -- to split PMD into table of
> PTEs. It splits only one PMD, not touching other PMDs the page is
> mapped with or underlying compound page. Unlike new split_huge_page(),
> split_huge_pmd() never fails.
>
> Fortunately, we have only few places where split_huge_page() is needed:
> swap out, memory failure, migration, KSM. And all of them can handle
> split_huge_page() fail.
>
> In new scheme we use tail_page->_mapcount is used to account how many time
> the tail page is mapped. head_page->_mapcount is used for both PMD mapping
> of whole huge page and PTE mapping of the firt 4k page of the compound
> page. It seems work fine, except the fact that we don't have a cheap way
> to check whether the page mapped with PMDs or not.
>
> Introducing split_huge_pmd() effectively allows THP to be mapped with 4k.
> It can break some kernel expectations. I.e. VMA now can start and end in
> middle of compound page. IIUC, it will break compactation and probably
> something else (any hints?).
I don't think compaction cares at all about VMA's. Unless the underlying
page migration does. What will break is munlock due to
VM_BUG_ON(PageTail(page)) in the PageTransHuge() check.
> Also munmap() on part of huge page will not split and free unmapped part
> immediately. We need to be careful here to keep memory footprint under
> control.
So who will take care of it, if it's not done immediately?
> As side effect we don't need to mark PMD splitting since we have
> split_huge_pmd(). get_page()/put_page() on tail of THP is cheaper (and
> cleaner) now.
But per patch 2, PageAnon() is more expensive. Also there are no side
effects to this change?
> I will continue with stabilizing this. The patchset also available on
> git[1].
>
> Any commemnt?
>
> [1] git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git thp/refcounting/v1
>
On Tue, Jun 10, 2014 at 10:10:56AM +0200, Vlastimil Babka wrote:
> On 06/09/2014 06:04 PM, Kirill A. Shutemov wrote:
> >Hello everybody,
> >
> >We've discussed few times that is would be nice to allow huge pages to be
> >mapped with 4k pages too. Here's my first attempt to actually implement
> >this. It's early prototype and not stabilized yet, but I want to share it
> >to discuss any potential show stoppers early.
> >
> >The main reason why we can't map THP with 4k is how refcounting on THP
> >designed. It built around two requirements:
> >
> > - split of huge page should never fail;
> > - we can't change interface of get_user_page();
> >
> >To be able to split huge page at any point we have to track which tail
> >page was pinned. It leads to tricky and expensive get_page() on tail pages
> >and also occupy tail_page->_mapcount.
> >
> >Most split_huge_page*() users want PMD to be split into table of PTEs and
> >don't care whether compound page is going to be split or not.
> >
> >The plan is:
> >
> > - allow split_huge_page() to fail if the page is pinned. It's trivial to
> > split non-pinned page and it doesn't require tail page refcounting, so
> > tail_page->_mapcount is free to be reused.
> >
> > - introduce new routine -- split_huge_pmd() -- to split PMD into table of
> > PTEs. It splits only one PMD, not touching other PMDs the page is
> > mapped with or underlying compound page. Unlike new split_huge_page(),
> > split_huge_pmd() never fails.
> >
> >Fortunately, we have only few places where split_huge_page() is needed:
> >swap out, memory failure, migration, KSM. And all of them can handle
> >split_huge_page() fail.
> >
> >In new scheme we use tail_page->_mapcount is used to account how many time
> >the tail page is mapped. head_page->_mapcount is used for both PMD mapping
> >of whole huge page and PTE mapping of the firt 4k page of the compound
> >page. It seems work fine, except the fact that we don't have a cheap way
> >to check whether the page mapped with PMDs or not.
> >
> >Introducing split_huge_pmd() effectively allows THP to be mapped with 4k.
> >It can break some kernel expectations. I.e. VMA now can start and end in
> >middle of compound page. IIUC, it will break compactation and probably
> >something else (any hints?).
>
> I don't think compaction cares at all about VMA's. Unless the underlying
> page migration does. What will break is munlock due to
> VM_BUG_ON(PageTail(page)) in the PageTransHuge() check.
We have PageTransCompound() if caller doesn't care which part of THP the
page is.
> >Also munmap() on part of huge page will not split and free unmapped part
> >immediately. We need to be careful here to keep memory footprint under
> >control.
>
> So who will take care of it, if it's not done immediately?
I mean the whole compound page will not be freed until the last part page
is unmapped. It can lead to excessive memory overhead for some workloads.
We can try to be smarter and call split_huge_page() instead of
split_huge_pmd() if see that the huge page is not mapped as 2M or
something. But we don't have a cheap way to check this...
> >As side effect we don't need to mark PMD splitting since we have
> >split_huge_pmd(). get_page()/put_page() on tail of THP is cheaper (and
> >cleaner) now.
>
> But per patch 2, PageAnon() is more expensive.
I don't think it's significant: for non-compound page it's probably
near-free (page->flags is most likely hot). For compound page it costs
additional cacheline. Not a big deal from my POV.
For get_page()/put_page() on tail THP we saved one atomic operation and
few checks. This is important because refcounting on tail pages is going to
be more common, since they can be mapped individually now. Acctually, I'm
not sure if these operation is cheap enough: we still use compound_lock
there to serialize against splitting.
> Also there are no side effects to this change?
Of course there are :) That only what came to mind. The patchset is very
early, I don't have whole picture yet. I expect more PageCompound() and
compound_head() will be needed ;)
--
Kirill A. Shutemov
Hello,
On Tue, Jun 10, 2014 at 04:52:46PM +0300, Kirill A. Shutemov wrote:
> I mean the whole compound page will not be freed until the last part page
> is unmapped. It can lead to excessive memory overhead for some workloads.
That is why a refcounting design like this wouldn't have been feasible
so far, as it's not entirely "transparent" anymore and we couldn't
risk breaking apps... I mean the worst case this could lead to the
anonymous real RSS of the app to be 512 times bigger than virtual size
allocated by the app in vmas. Sounds very unlikely but still not safe
to deploy such a thing on random apps, without quite some testing of a
variety of apps.
If I understand correctly, the memory footprint problem doesn't exist
with swapping because swapping calls split_huge_page_to_list() and
that works as long as there are no gup() pins? (-EBUSY is returned if
there's a pin on any head or tail page)
So if there are transient gup() pins swapping will try again to
split_huge_page later (non transient gup pins should use mmu notifiers
and not hold any pin in the first place).
Even for swapping it increases the "pinned" region by a worst case of
512 times, but if there's lots of direct-io in flight the virtual
addresses pinned are usually contiguous and if there's a THP the
physical side is also contiguous for 512 4k pages so it probably
doesn't reduce the ability to swap in any significant way even if
there's direct-io in flight.
> We can try to be smarter and call split_huge_page() instead of
> split_huge_pmd() if see that the huge page is not mapped as 2M or
> something. But we don't have a cheap way to check this...
I wonder, why don't you still do the split_huge_page immediately
inside split_huge_pmd, it would fail with -EBUSY if it's pinned.
If split_huge_page fails because of transient gup pins, then you can
defer the split_huge_page to khugepaged if it notices we're wasting
memory during its scan, clearly it shall be speculative without
freezing the refcounts with compound_lock, and only call the
split_huge_page if it then notices it can free memory.
> be more common, since they can be mapped individually now. Acctually, I'm
> not sure if these operation is cheap enough: we still use compound_lock
> there to serialize against splitting.
This is the main cons in my view, simplifying the get_page/put_page
refcounting would be nice, but you're still taking the tail pins on
the tail pages and so in my view it doesn't move the needle in terms
of get_page/put_page, it's a bit faster but it still has all tail page
pins accounting and it's not just a head page accounting like it was
before THP was introduced and it needed to deal with gup on tail pages
under splitting.
This patch allows split_huge_page to fail fail (currently it cannot
fail), split_huge_pmd still cannot fail, so it'd be nice if we could
remove all tail pins too if split_huge_page could fail.
Can't you just account all tail pins in the head like it was before
with only hugetlbfs and return -EBUSY if the head_page->count doesn't
match mapcount or something like that? What exactly the tail pins do
in this model other than to allow you to return -EBUSY?
The major reason we have to do the special tail pin refcounting with
gup is that split_huge_page cannot fail now, so at any given time we
must know which tail page was pinned, if we can fail split_huge_page
there's no point to know anymore which exact tail page holds the gup
pin, and we should only be able to say "yes we can" or "no we cannot
split_huge_page" and just for that the tailpage refcounting doesn't
look so critical to keep.
There would still be the risk of wasting memory with gup pins vs
munmap (I don't see a way to fix it if split_huge_page can fail with
-EBUSY) but khugepaged can fixup that later and deal with the transient
gup pins.
Thanks,
Andrea
On Tue, Jun 10, 2014 at 04:29:09PM +0200, Andrea Arcangeli wrote:
> Hello,
>
> On Tue, Jun 10, 2014 at 04:52:46PM +0300, Kirill A. Shutemov wrote:
> > I mean the whole compound page will not be freed until the last part page
> > is unmapped. It can lead to excessive memory overhead for some workloads.
>
> That is why a refcounting design like this wouldn't have been feasible
> so far, as it's not entirely "transparent" anymore and we couldn't
> risk breaking apps... I mean the worst case this could lead to the
> anonymous real RSS of the app to be 512 times bigger than virtual size
> allocated by the app in vmas. Sounds very unlikely but still not safe
> to deploy such a thing on random apps, without quite some testing of a
> variety of apps.
Agreed. It need to be handled one way or another before moving forward.
> If I understand correctly, the memory footprint problem doesn't exist
> with swapping because swapping calls split_huge_page_to_list() and
> that works as long as there are no gup() pins? (-EBUSY is returned if
> there's a pin on any head or tail page)
Correct.
> So if there are transient gup() pins swapping will try again to
> split_huge_page later (non transient gup pins should use mmu notifiers
> and not hold any pin in the first place).
>
> Even for swapping it increases the "pinned" region by a worst case of
> 512 times, but if there's lots of direct-io in flight the virtual
> addresses pinned are usually contiguous and if there's a THP the
> physical side is also contiguous for 512 4k pages so it probably
> doesn't reduce the ability to swap in any significant way even if
> there's direct-io in flight.
>
> > We can try to be smarter and call split_huge_page() instead of
> > split_huge_pmd() if see that the huge page is not mapped as 2M or
> > something. But we don't have a cheap way to check this...
>
> I wonder, why don't you still do the split_huge_page immediately
> inside split_huge_pmd, it would fail with -EBUSY if it's pinned.
I want keep split_huge_pmd() process-local: for shared it's not needed to
split the page in all processes if one of them call mprotect(), munmap(),
etc.
We probably could split the page if we can see that it mapped only once
with PMD which is most common case.
> If split_huge_page fails because of transient gup pins, then you can
> defer the split_huge_page to khugepaged if it notices we're wasting
> memory during its scan, clearly it shall be speculative without
> freezing the refcounts with compound_lock, and only call the
> split_huge_page if it then notices it can free memory.
>
> > be more common, since they can be mapped individually now. Acctually, I'm
> > not sure if these operation is cheap enough: we still use compound_lock
> > there to serialize against splitting.
>
> This is the main cons in my view, simplifying the get_page/put_page
> refcounting would be nice, but you're still taking the tail pins on
> the tail pages and so in my view it doesn't move the needle in terms
> of get_page/put_page, it's a bit faster but it still has all tail page
> pins accounting and it's not just a head page accounting like it was
> before THP was introduced and it needed to deal with gup on tail pages
> under splitting.
>
> This patch allows split_huge_page to fail fail (currently it cannot
> fail), split_huge_pmd still cannot fail, so it'd be nice if we could
> remove all tail pins too if split_huge_page could fail.
>
> Can't you just account all tail pins in the head like it was before
> with only hugetlbfs and return -EBUSY if the head_page->count doesn't
> match mapcount or something like that? What exactly the tail pins do
> in this model other than to allow you to return -EBUSY?
That's exactly what I do. The compound_lock is needed to protect
head_page->_count from being updated from get_page()/put_page() on tail
page while we're splitting. Otherwise we will not be able distribute pins
correctly.
I had silly idea: can we use most significant bit of head_page->_count as
bit for compound lock? This allows (I think) to use one atomic_cmpxchg()
to update counter case from get_page()/put_page() with respect to compound
lock instead of lock;update;unlock. Something like lockref, but for bit
spinlock.
Does it sound too broken?
Anyway, I want first check how high get_page()/put_page() will be on
profile.
--
Kirill A. Shutemov
On Mon, 9 Jun 2014, Kirill A. Shutemov wrote:
> To be able to split huge page at any point we have to track which tail
> page was pinned. It leads to tricky and expensive get_page() on tail pages
> and also occupy tail_page->_mapcount.
Maybe we should give up the requirement to be able to split a huge page at
any point? This got us into the mess AFAICT. Instead we could use the
locking mechanisms that we have to stop all access to the page and then do
the conversion? Page migration can do that so it should be fine with
refcounting for huge pages exclusively in the head page exactly like a
regular page.
The problem is then dealing with the locations where we now do rely on
the ability to split at "any point" (notion is weird in itself and
suggests issues with synchronization). Use the standard locking schemes
for pages instead?
I thought the idea was that we would modify the relevant code and
that at some point this requirement could go away?
Huge pages (and other larger order pages) will become increasingly
difficult to handle if relevant page state has to be maintained in tail
pages and if it differs significantly from regular pages.
On Tue, Jun 10, 2014 at 03:25:42PM -0500, Christoph Lameter wrote:
> On Mon, 9 Jun 2014, Kirill A. Shutemov wrote:
>
> > To be able to split huge page at any point we have to track which tail
> > page was pinned. It leads to tricky and expensive get_page() on tail pages
> > and also occupy tail_page->_mapcount.
>
> Maybe we should give up the requirement to be able to split a huge page at
> any point?
Yes, that's what the patchset does: we don't allow to split the page if
any sub-page is pinned.
> This got us into the mess AFAICT. Instead we could use the locking
> mechanisms that we have to stop all access to the page and then do the
> conversion?
I end up with compound_lock to freeze page count. Not sure if it's the
best option we have
> Page migration can do that so it should be fine with refcounting for
> huge pages exclusively in the head page exactly like a regular page.
We've discussed "split via migration" with Dave. I need to look more on
how migration works.
> The problem is then dealing with the locations where we now do rely on
> the ability to split at "any point" (notion is weird in itself and
> suggests issues with synchronization).
As I said, we have only 4 places where we need to split the page (not only
PMD): swap out, memory failure, KSM, migration. All of them can tolerate
split failure.
> Use the standard locking schemes for pages instead?
Could you elaborate here?
> I thought the idea was that we would modify the relevant code and
> that at some point this requirement could go away?
>
> Huge pages (and other larger order pages) will become increasingly
> difficult to handle if relevant page state has to be maintained in tail
> pages and if it differs significantly from regular pages.
Agreed. The patchset drops tail page refcounting.
--
Kirill A. Shutemov
On Tue, 10 Jun 2014, Kirill A. Shutemov wrote:
> Could you elaborate here?
The page migration scheme works by locking and also putting in a fake pte
to ensure that any accesses cause a page fault which will then block.
In the THP case we would need a fake pmd.
That allows effectively to force all accesses to the page to stop. Then
you do the page migration (and you could do the splitting etc) and then
replace the fake pmd/pte with real ones.
See the page migration code.
> Agreed. The patchset drops tail page refcounting.
Great. Step in the right diretion.
On Tue, Jun 10, 2014 at 03:25:42PM -0500, Christoph Lameter wrote:
> I thought the idea was that we would modify the relevant code and
> that at some point this requirement could go away?
There were places that weren't aware and splitted unnecessarily to
avoid having to make all places aware immediately and keep the initial
patchset small, all the ones in relevant fast paths are gone by now,
but the requirement doesn't go away if munmap partially unmaps a page.
If munmap or mremap splits the THP in the middle, the pmd has to be
splitted reliably and it cannot fail or the syscall cannot return...
And Kirill patchset still provides a reliable split of the pmd of
course. It only relaxes the actual page struct split but without
actually removing the tailpage refcounting.
There are clear downsides in adding a failure -EBUSY case to
split_huge_page related to potential increased memory usage that from
the user prospective will like a memory leak (like real anon RSS
exceeding the virtual size up to 512 times in the worst case, at least
until khugepaged can fix it up and release RAM with an async
split_huge_page), but the current get_page/put_page improvement
doesn't look significant enough.
This is why I think we should check if we can go the extra mile and
get rid of the tail page refcounting as a whole if possible, if that
is achieved this failure case added to split_huge_page will look a
better tradeoff than it looks now. Currently I'm not impressed by the
simplification of get_page/put_page considering the downsides this
brings to memory utilization and potentially having to defer the page
split to khugepaged.
> Huge pages (and other larger order pages) will become increasingly
> difficult to handle if relevant page state has to be maintained in tail
> pages and if it differs significantly from regular pages.
Over the last couple of years there was no increase in difficulty
though, the only relevant change that happened was to move the tail
page refcounting from ->count to ->mapcount (both otherwised unused on
tail pages) because ->count could confuse the speculative pagecache
lookups on tail pages, but that was a strightforward change, the
difficulty stayed the same no matter if the tail pin was in count or
mapcount.
While I don't see an actual increase in difficulty anywhere in this
area, simplification and performance improvement is always welcome :).
Last but not the least while I don't see a showstopper for non-weird
non-malicious apps, we should take in consideration the malicious case
too and the trouble that this would cause to containers (or rlimits)
if apps can lock in 512 times more physical RAM than they're supposed
to if this allows bypassing all kernel accounting so easily. Then
again it depends if people thinks containers should be usable to
protect against non trusted apps too or not (I don't, I prefer docker
on top of KVM especially on public clouds, but others do).
On Tue, Jun 10, 2014 at 11:46:40PM +0300, Kirill A. Shutemov wrote:
> Agreed. The patchset drops tail page refcounting.
Very possibly I misread something or a later patch fixes this up, I
just did a basic code review, but from the new code of split_huge_page
it looks like it returns -EBUSY after checking the individual tail
page refcounts, so it's not clear how that defines as "dropped".
+ for (i = 0; i < HPAGE_PMD_NR; i++)
+ tail_count += page_mapcount(page + i);
+ if (tail_count != page_count(page) - 1) {
+ BUG_ON(tail_count > page_count(page) - 1);
+ compound_unlock(page);
+ spin_unlock_irq(&zone->lru_lock);
+ return -EBUSY;
On Wed, Jun 11, 2014 at 12:04:51AM +0200, Andrea Arcangeli wrote:
> On Tue, Jun 10, 2014 at 11:46:40PM +0300, Kirill A. Shutemov wrote:
> > Agreed. The patchset drops tail page refcounting.
>
> Very possibly I misread something or a later patch fixes this up, I
> just did a basic code review, but from the new code of split_huge_page
> it looks like it returns -EBUSY after checking the individual tail
> page refcounts, so it's not clear how that defines as "dropped".
page_mapcount() here is really mapcount: how many times the page is
mapped, not pins on tail pages as we have it now.
>
> + for (i = 0; i < HPAGE_PMD_NR; i++)
> + tail_count += page_mapcount(page + i);
> + if (tail_count != page_count(page) - 1) {
> + BUG_ON(tail_count > page_count(page) - 1);
> + compound_unlock(page);
> + spin_unlock_irq(&zone->lru_lock);
> + return -EBUSY;
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
Kirill A. Shutemov
On Wed, Jun 11, 2014 at 01:14:31AM +0300, Kirill A. Shutemov wrote:
> On Wed, Jun 11, 2014 at 12:04:51AM +0200, Andrea Arcangeli wrote:
> > On Tue, Jun 10, 2014 at 11:46:40PM +0300, Kirill A. Shutemov wrote:
> > > Agreed. The patchset drops tail page refcounting.
> >
> > Very possibly I misread something or a later patch fixes this up, I
> > just did a basic code review, but from the new code of split_huge_page
> > it looks like it returns -EBUSY after checking the individual tail
> > page refcounts, so it's not clear how that defines as "dropped".
>
> page_mapcount() here is really mapcount: how many times the page is
> mapped, not pins on tail pages as we have it now.
Ok then I may suggest to rename the variable from tail_count to
tail_mapcount to make it more self explanatory... of course then it is
compared to the head page count, which means the tail pins have to be
in the head already, but calling it tail_mapcount would be more clear
if you're used to the current semantics of mapcount on tail pages. I
was confused myself what the benefits were... if it didn't drop the
tail page refcounting.
The other suggestions on doing split_huge_page inside split_huge_pmd
(not required to succeed) and fix it up later in khugepaged so the
leak of memory is not permanent, and the accounting issues it creates
with malicious apps sounds like the two things left to address to make
this design change an interesting tradeoff.
> >
> > + for (i = 0; i < HPAGE_PMD_NR; i++)
> > + tail_count += page_mapcount(page + i);
> > + if (tail_count != page_count(page) - 1) {
> > + BUG_ON(tail_count > page_count(page) - 1);
> > + compound_unlock(page);
> > + spin_unlock_irq(&zone->lru_lock);
> > + return -EBUSY;
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
>
> --
> Kirill A. Shutemov
>