2015-02-12 16:18:57

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 00/24] THP refcounting redesign

Hello everybody,

Here's the first non-RFC version of the patchset on THP refcounting
redesign. I've cleaned up my todo list and I consider it feature-complete.

The goal of patchset is to make refcounting on THP pages cheaper with
simpler semantics and allow the same THP compound page to be mapped with
PMD and PTEs. This is required to get reasonable THP-pagecache
implementation.

With the new refcounting design it's much easier to protect against
split_huge_page(): simple reference on a page will make you the deal.
It makes gup_fast() implementation simpler and doesn't require
special-case in futex code to handle tail THP pages.

It should improve THP utilization over the system since splitting THP in
one process doesn't necessary lead to splitting the page in all other
processes have the page mapped.

= Design overview =

The main reason why we can't map THP with 4k is how refcounting on THP
designed. It built around two requirements:

- split of huge page should never fail;
- we can't change interface of get_user_page();

To be able to split huge page at any point we have to track which tail
page was pinned. It leads to tricky and expensive get_page() on tail pages
and also occupy tail_page->_mapcount.

Most split_huge_page*() users want PMD to be split into table of PTEs and
don't care whether compound page is going to be split or not.

The plan is:

- allow split_huge_page() to fail if the page is pinned. It's trivial to
split non-pinned page and it doesn't require tail page refcounting, so
tail_page->_mapcount is free to be reused.

- introduce new routine -- split_huge_pmd() -- to split PMD into table of
PTEs. It splits only one PMD, not touching other PMDs the page is
mapped with or underlying compound page. Unlike new split_huge_page(),
split_huge_pmd() never fails.

Fortunately, we have only few places where split_huge_page() is needed:
swap out, memory failure, migration, KSM. And all of them can handle
split_huge_page() fail.

In new scheme we use page->_mapcount is used to account how many time
the page is mapped with PTEs. We have separate compound_mapcount() to
count mappings with PMD. page_mapcount() returns sum of PTE and PMD
mappings of the page.

Introducing split_huge_pmd() effectively allows THP to be mapped with 4k.
It may be a surprise to some code to see a PTE which points to tail page
or VMA start/end in the middle of compound page.

munmap() part of THP will split PMD, but doesn't split the huge page. In
order to take memory consuption under control we put partially unmapped
huge page on per-zone list, which would be drained on first shrink_zone()
call. This way we also avoid unnecessary split_huge_page() on exit(2) if a
THP belong to more than one VMA.

= Patches overview =

Patch 1:
Move split_huge_page code around. Preparation for future changes.

Patches 2-3:
Make PageAnon() and PG_locked related helpers to look on head
page if tail page is passed. It's required since pte_page() can
now point to tail page. It's likely that we need to change other
pageflags-related helpers too, but I haven't step on any other
yet.

Patch 4:
With PTE-mapeed THP, rmap cannot rely on PageTransHuge() check to
decide if map small page or THP. We need to get the info from
caller.

Patch 5:
We need to look on all subpages of compound page to calculate
correct PSS, because they can have different mapcount.

Patch 6:
Store mapcount for compound pages separately: in the first tail
page ->mapping.

Patch 7:
Adjust conditions when we can re-use the page on write-protection
fault.

Patch 8:
FOLL_SPLIT should be handled on PTE level too.

Patch 9:
Split all pages in mlocked VMA. We would need to look on this
again later.

Patch 10:
Make khugepaged aware about PTE-mapped huge pages.

Patch 11:
split_huge_page_pmd() to split_huge_pmd() to reflect that page is
not going to be split, only PMD.

Patch 12:
Temporary make split_huge_page() to return -EBUSY on all split
requests. This allows to drop tail-page refcounting and change
implementation of split_huge_pmd() to split PMD to table of PTEs
without splitting compound page.

Patch 13:
New THP_SPLIT_* vmstats.

Patch 14:
Implement new split_huge_page() which fails if the page is pinned.
For now, we rely on compound_lock() to make page counts stable.

Patches 15-16:
Drop infrastructure for handling PMD splitting. We don't use it
anymore in split_huge_page(). For now we only remove it from
generic code and x86. I'll cleanup other architectures later.

Patch 17:
Remove ugly special case if futex happened to be in tail THP page.
With new refcounting it much easier to protect against split.

Patches 18-20:
Replaces compound_lock with migration entries as mechanism to
freeze page counts on split_huge_page(). We don't need
compound_lock anymore. It makes get_page()/put_page() on tail
pages faster.

Patch 21:
Handle partial unmap of THP. We put partially unmapped huge page
on per-zone list, which would be drained on first shrink_zone()
call. This way we also avoid unnecessary split_huge_page() on
exit(2) if a THP belong to more than one VMA.

Patch 22:
Make memcg aware about new refcounting. Validation needed.

Patch 23:
Fix never-succeed split_huge_page() inside KSM machinery.

Patch 24:
Documentation update.

I was focused on stability so far and don't have performance numbers yet.
I've run mm tests from LTP: all pass. Trinity doesn't crash patched kernel
for me.

I would appreciate any feedback on the patchset: in form of code review or
testing.

The patchset also available on git:

git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git thp/refcounting/v3

Comments?

Kirill A. Shutemov (24):
thp: cluster split_huge_page* code together
mm: change PageAnon() and page_anon_vma() to work on tail pages
mm: avoid PG_locked on tail pages
rmap: add argument to charge compound page
mm, proc: adjust PSS calculation
mm: store mapcount for compound page separately
mm, thp: adjust conditions when we can reuse the page on WP fault
mm: adjust FOLL_SPLIT for new refcounting
thp, mlock: do not allow huge pages in mlocked area
khugepaged: ignore pmd tables with THP mapped with ptes
thp: rename split_huge_page_pmd() to split_huge_pmd()
thp: PMD splitting without splitting compound page
mm, vmstats: new THP splitting event
thp: implement new split_huge_page()
mm, thp: remove infrastructure for handling splitting PMDs
x86, thp: remove infrastructure for handling splitting PMDs
futex, thp: remove special case for THP in get_futex_key
thp, mm: split_huge_page(): caller need to lock page
thp, mm: use migration entries to freeze page counts on split
mm, thp: remove compound_lock
thp: introduce deferred_split_huge_page()
memcg: adjust to support new THP refcounting
ksm: split huge pages on follow_page()
thp: update documentation

Documentation/vm/transhuge.txt | 100 ++--
arch/mips/mm/gup.c | 4 -
arch/powerpc/mm/hugetlbpage.c | 13 +-
arch/powerpc/mm/subpage-prot.c | 2 +-
arch/s390/mm/gup.c | 13 +-
arch/sparc/mm/gup.c | 14 +-
arch/x86/include/asm/pgtable.h | 9 -
arch/x86/include/asm/pgtable_types.h | 2 -
arch/x86/kernel/vm86_32.c | 6 +-
arch/x86/mm/gup.c | 17 +-
arch/x86/mm/pgtable.c | 14 -
fs/proc/task_mmu.c | 51 ++-
include/asm-generic/pgtable.h | 5 -
include/linux/huge_mm.h | 40 +-
include/linux/hugetlb_inline.h | 9 +-
include/linux/memcontrol.h | 16 +-
include/linux/migrate.h | 3 +
include/linux/mm.h | 125 +----
include/linux/mm_types.h | 20 +-
include/linux/mmzone.h | 5 +
include/linux/page-flags.h | 15 +-
include/linux/pagemap.h | 14 +-
include/linux/rmap.h | 19 +-
include/linux/swap.h | 3 +-
include/linux/vm_event_item.h | 4 +-
kernel/events/uprobes.c | 11 +-
kernel/futex.c | 61 +--
mm/debug.c | 8 +-
mm/filemap.c | 9 +-
mm/gup.c | 89 ++--
mm/huge_memory.c | 857 ++++++++++++++++++-----------------
mm/hugetlb.c | 8 +-
mm/internal.h | 57 +--
mm/ksm.c | 60 +--
mm/madvise.c | 2 +-
mm/memcontrol.c | 76 +---
mm/memory-failure.c | 12 +-
mm/memory.c | 64 +--
mm/mempolicy.c | 2 +-
mm/migrate.c | 20 +-
mm/mincore.c | 2 +-
mm/mlock.c | 3 +
mm/mprotect.c | 2 +-
mm/mremap.c | 2 +-
mm/page_alloc.c | 13 +-
mm/pagewalk.c | 2 +-
mm/pgtable-generic.c | 14 -
mm/rmap.c | 121 +++--
mm/shmem.c | 21 +-
mm/slub.c | 2 +
mm/swap.c | 260 ++---------
mm/swapfile.c | 16 +-
mm/vmscan.c | 3 +
mm/vmstat.c | 4 +-
54 files changed, 975 insertions(+), 1349 deletions(-)

--
2.1.4


2015-02-12 16:19:01

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 01/24] thp: cluster split_huge_page* code together

Rearrange code in mm/huge_memory.c to make future changes somewhat
easier.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/huge_memory.c | 224 +++++++++++++++++++++++++++----------------------------
1 file changed, 112 insertions(+), 112 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e08e37ad050e..5f4c97e1a6da 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1622,6 +1622,118 @@ int pmd_freeable(pmd_t pmd)
return !pmd_dirty(pmd);
}

+static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
+ unsigned long haddr, pmd_t *pmd)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ pgtable_t pgtable;
+ pmd_t _pmd;
+ int i;
+
+ pmdp_clear_flush_notify(vma, haddr, pmd);
+ /* leave pmd empty until pte is filled */
+
+ pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+ pmd_populate(mm, &_pmd, pgtable);
+
+ for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
+ pte_t *pte, entry;
+ entry = pfn_pte(my_zero_pfn(haddr), vma->vm_page_prot);
+ entry = pte_mkspecial(entry);
+ pte = pte_offset_map(&_pmd, haddr);
+ VM_BUG_ON(!pte_none(*pte));
+ set_pte_at(mm, haddr, pte, entry);
+ pte_unmap(pte);
+ }
+ smp_wmb(); /* make pte visible before pmd */
+ pmd_populate(mm, pmd, pgtable);
+ put_huge_zero_page();
+}
+
+void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
+ pmd_t *pmd)
+{
+ spinlock_t *ptl;
+ struct page *page;
+ struct mm_struct *mm = vma->vm_mm;
+ unsigned long haddr = address & HPAGE_PMD_MASK;
+ unsigned long mmun_start; /* For mmu_notifiers */
+ unsigned long mmun_end; /* For mmu_notifiers */
+
+ BUG_ON(vma->vm_start > haddr || vma->vm_end < haddr + HPAGE_PMD_SIZE);
+
+ mmun_start = haddr;
+ mmun_end = haddr + HPAGE_PMD_SIZE;
+again:
+ mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+ ptl = pmd_lock(mm, pmd);
+ if (unlikely(!pmd_trans_huge(*pmd))) {
+ spin_unlock(ptl);
+ mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+ return;
+ }
+ if (is_huge_zero_pmd(*pmd)) {
+ __split_huge_zero_page_pmd(vma, haddr, pmd);
+ spin_unlock(ptl);
+ mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+ return;
+ }
+ page = pmd_page(*pmd);
+ VM_BUG_ON_PAGE(!page_count(page), page);
+ get_page(page);
+ spin_unlock(ptl);
+ mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+
+ split_huge_page(page);
+
+ put_page(page);
+
+ /*
+ * We don't always have down_write of mmap_sem here: a racing
+ * do_huge_pmd_wp_page() might have copied-on-write to another
+ * huge page before our split_huge_page() got the anon_vma lock.
+ */
+ if (unlikely(pmd_trans_huge(*pmd)))
+ goto again;
+}
+
+void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address,
+ pmd_t *pmd)
+{
+ struct vm_area_struct *vma;
+
+ vma = find_vma(mm, address);
+ BUG_ON(vma == NULL);
+ split_huge_page_pmd(vma, address, pmd);
+}
+
+static void split_huge_page_address(struct mm_struct *mm,
+ unsigned long address)
+{
+ pgd_t *pgd;
+ pud_t *pud;
+ pmd_t *pmd;
+
+ VM_BUG_ON(!(address & ~HPAGE_PMD_MASK));
+
+ pgd = pgd_offset(mm, address);
+ if (!pgd_present(*pgd))
+ return;
+
+ pud = pud_offset(pgd, address);
+ if (!pud_present(*pud))
+ return;
+
+ pmd = pmd_offset(pud, address);
+ if (!pmd_present(*pmd))
+ return;
+ /*
+ * Caller holds the mmap_sem write mode, so a huge pmd cannot
+ * materialize from under us.
+ */
+ split_huge_page_pmd_mm(mm, address, pmd);
+}
+
static int __split_huge_page_splitting(struct page *page,
struct vm_area_struct *vma,
unsigned long address)
@@ -2858,118 +2970,6 @@ static int khugepaged(void *none)
return 0;
}

-static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
- unsigned long haddr, pmd_t *pmd)
-{
- struct mm_struct *mm = vma->vm_mm;
- pgtable_t pgtable;
- pmd_t _pmd;
- int i;
-
- pmdp_clear_flush_notify(vma, haddr, pmd);
- /* leave pmd empty until pte is filled */
-
- pgtable = pgtable_trans_huge_withdraw(mm, pmd);
- pmd_populate(mm, &_pmd, pgtable);
-
- for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
- pte_t *pte, entry;
- entry = pfn_pte(my_zero_pfn(haddr), vma->vm_page_prot);
- entry = pte_mkspecial(entry);
- pte = pte_offset_map(&_pmd, haddr);
- VM_BUG_ON(!pte_none(*pte));
- set_pte_at(mm, haddr, pte, entry);
- pte_unmap(pte);
- }
- smp_wmb(); /* make pte visible before pmd */
- pmd_populate(mm, pmd, pgtable);
- put_huge_zero_page();
-}
-
-void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
- pmd_t *pmd)
-{
- spinlock_t *ptl;
- struct page *page;
- struct mm_struct *mm = vma->vm_mm;
- unsigned long haddr = address & HPAGE_PMD_MASK;
- unsigned long mmun_start; /* For mmu_notifiers */
- unsigned long mmun_end; /* For mmu_notifiers */
-
- BUG_ON(vma->vm_start > haddr || vma->vm_end < haddr + HPAGE_PMD_SIZE);
-
- mmun_start = haddr;
- mmun_end = haddr + HPAGE_PMD_SIZE;
-again:
- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
- ptl = pmd_lock(mm, pmd);
- if (unlikely(!pmd_trans_huge(*pmd))) {
- spin_unlock(ptl);
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
- return;
- }
- if (is_huge_zero_pmd(*pmd)) {
- __split_huge_zero_page_pmd(vma, haddr, pmd);
- spin_unlock(ptl);
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
- return;
- }
- page = pmd_page(*pmd);
- VM_BUG_ON_PAGE(!page_count(page), page);
- get_page(page);
- spin_unlock(ptl);
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
-
- split_huge_page(page);
-
- put_page(page);
-
- /*
- * We don't always have down_write of mmap_sem here: a racing
- * do_huge_pmd_wp_page() might have copied-on-write to another
- * huge page before our split_huge_page() got the anon_vma lock.
- */
- if (unlikely(pmd_trans_huge(*pmd)))
- goto again;
-}
-
-void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address,
- pmd_t *pmd)
-{
- struct vm_area_struct *vma;
-
- vma = find_vma(mm, address);
- BUG_ON(vma == NULL);
- split_huge_page_pmd(vma, address, pmd);
-}
-
-static void split_huge_page_address(struct mm_struct *mm,
- unsigned long address)
-{
- pgd_t *pgd;
- pud_t *pud;
- pmd_t *pmd;
-
- VM_BUG_ON(!(address & ~HPAGE_PMD_MASK));
-
- pgd = pgd_offset(mm, address);
- if (!pgd_present(*pgd))
- return;
-
- pud = pud_offset(pgd, address);
- if (!pud_present(*pud))
- return;
-
- pmd = pmd_offset(pud, address);
- if (!pmd_present(*pmd))
- return;
- /*
- * Caller holds the mmap_sem write mode, so a huge pmd cannot
- * materialize from under us.
- */
- split_huge_page_pmd_mm(mm, address, pmd);
-}
-
void __vma_adjust_trans_huge(struct vm_area_struct *vma,
unsigned long start,
unsigned long end,
--
2.1.4

2015-02-12 16:18:58

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 02/24] mm: change PageAnon() and page_anon_vma() to work on tail pages

Currently PageAnon() and page_anon_vma() are always return false/NULL
for tail. We need to look on head page for correct answer.

Let's change the function to give the correct result for tail page.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/mm.h | 1 +
include/linux/rmap.h | 1 +
2 files changed, 2 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 47a93928b90f..9071066b7c2e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1047,6 +1047,7 @@ struct address_space *page_file_mapping(struct page *page)

static inline int PageAnon(struct page *page)
{
+ page = compound_head(page);
return ((unsigned long)page->mapping & PAGE_MAPPING_ANON) != 0;
}

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 9c5ff69fa0cd..c4088feac1fc 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -108,6 +108,7 @@ static inline void put_anon_vma(struct anon_vma *anon_vma)

static inline struct anon_vma *page_anon_vma(struct page *page)
{
+ page = compound_head(page);
if (((unsigned long)page->mapping & PAGE_MAPPING_FLAGS) !=
PAGE_MAPPING_ANON)
return NULL;
--
2.1.4

2015-02-12 16:21:43

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 03/24] mm: avoid PG_locked on tail pages

With new refcounting pte entries can point to tail pages. It's doesn't
make much sense to mark tail page locked -- we need to protect whole
compound page.

This patch adjust helpers related to PG_locked to operate on head page.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/page-flags.h | 3 ++-
include/linux/pagemap.h | 5 +++++
mm/filemap.c | 1 +
mm/slub.c | 2 ++
4 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 5ed7bdaf22d5..d471370f27e8 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -207,7 +207,8 @@ static inline int __TestClearPage##uname(struct page *page) { return 0; }

struct page; /* forward declaration */

-TESTPAGEFLAG(Locked, locked)
+#define PageLocked(page) test_bit(PG_locked, &compound_head(page)->flags)
+
PAGEFLAG(Error, error) TESTCLEARFLAG(Error, error)
PAGEFLAG(Referenced, referenced) TESTCLEARFLAG(Referenced, referenced)
__SETPAGEFLAG(Referenced, referenced)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 4b3736f7065c..ad6da4e49555 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -428,16 +428,19 @@ extern void unlock_page(struct page *page);

static inline void __set_page_locked(struct page *page)
{
+ VM_BUG_ON_PAGE(PageTail(page), page);
__set_bit(PG_locked, &page->flags);
}

static inline void __clear_page_locked(struct page *page)
{
+ VM_BUG_ON_PAGE(PageTail(page), page);
__clear_bit(PG_locked, &page->flags);
}

static inline int trylock_page(struct page *page)
{
+ page = compound_head(page);
return (likely(!test_and_set_bit_lock(PG_locked, &page->flags)));
}

@@ -490,6 +493,7 @@ extern int wait_on_page_bit_killable_timeout(struct page *page,

static inline int wait_on_page_locked_killable(struct page *page)
{
+ page = compound_head(page);
if (PageLocked(page))
return wait_on_page_bit_killable(page, PG_locked);
return 0;
@@ -510,6 +514,7 @@ static inline void wake_up_page(struct page *page, int bit)
*/
static inline void wait_on_page_locked(struct page *page)
{
+ page = compound_head(page);
if (PageLocked(page))
wait_on_page_bit(page, PG_locked);
}
diff --git a/mm/filemap.c b/mm/filemap.c
index ad7242043bdb..b02c3f7cbe64 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -744,6 +744,7 @@ EXPORT_SYMBOL_GPL(add_page_wait_queue);
*/
void unlock_page(struct page *page)
{
+ page = compound_head(page);
VM_BUG_ON_PAGE(!PageLocked(page), page);
clear_bit_unlock(PG_locked, &page->flags);
smp_mb__after_atomic();
diff --git a/mm/slub.c b/mm/slub.c
index 0909e13cf708..16ba8c9665e2 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -338,11 +338,13 @@ static inline int oo_objects(struct kmem_cache_order_objects x)
*/
static __always_inline void slab_lock(struct page *page)
{
+ VM_BUG_ON_PAGE(PageTail(page), page);
bit_spin_lock(PG_locked, &page->flags);
}

static __always_inline void slab_unlock(struct page *page)
{
+ VM_BUG_ON_PAGE(PageTail(page), page);
__bit_spin_unlock(PG_locked, &page->flags);
}

--
2.1.4

2015-02-12 16:20:12

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 04/24] rmap: add argument to charge compound page

We're going to allow mapping of individual 4k pages of THP compound
page. It means we cannot rely on PageTransHuge() check to decide if map
small page or THP.

The patch adds new argument to rmap function to indicate whethe we want
to map whole compound page or only the small page.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/rmap.h | 14 +++++++++++---
kernel/events/uprobes.c | 4 ++--
mm/huge_memory.c | 16 ++++++++--------
mm/hugetlb.c | 4 ++--
mm/ksm.c | 4 ++--
mm/memory.c | 14 +++++++-------
mm/migrate.c | 8 ++++----
mm/rmap.c | 43 +++++++++++++++++++++++++++----------------
mm/swapfile.c | 4 ++--
9 files changed, 65 insertions(+), 46 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index c4088feac1fc..3bf73620b672 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -168,16 +168,24 @@ static inline void anon_vma_merge(struct vm_area_struct *vma,

struct anon_vma *page_get_anon_vma(struct page *page);

+/* flags for do_page_add_anon_rmap() */
+enum {
+ RMAP_EXCLUSIVE = 1,
+ RMAP_COMPOUND = 2,
+};
+
/*
* rmap interfaces called when adding or removing pte of page
*/
void page_move_anon_rmap(struct page *, struct vm_area_struct *, unsigned long);
-void page_add_anon_rmap(struct page *, struct vm_area_struct *, unsigned long);
+void page_add_anon_rmap(struct page *, struct vm_area_struct *,
+ unsigned long, bool);
void do_page_add_anon_rmap(struct page *, struct vm_area_struct *,
unsigned long, int);
-void page_add_new_anon_rmap(struct page *, struct vm_area_struct *, unsigned long);
+void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
+ unsigned long, bool);
void page_add_file_rmap(struct page *);
-void page_remove_rmap(struct page *);
+void page_remove_rmap(struct page *, bool);

void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
unsigned long);
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index cb346f26a22d..5523daf59953 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -183,7 +183,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
goto unlock;

get_page(kpage);
- page_add_new_anon_rmap(kpage, vma, addr);
+ page_add_new_anon_rmap(kpage, vma, addr, false);
mem_cgroup_commit_charge(kpage, memcg, false);
lru_cache_add_active_or_unevictable(kpage, vma);

@@ -196,7 +196,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
ptep_clear_flush_notify(vma, addr, ptep);
set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));

- page_remove_rmap(page);
+ page_remove_rmap(page, false);
if (!page_mapped(page))
try_to_free_swap(page);
pte_unmap_unlock(ptep, ptl);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5f4c97e1a6da..36637a80669e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -743,7 +743,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
pmd_t entry;
entry = mk_huge_pmd(page, vma->vm_page_prot);
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
- page_add_new_anon_rmap(page, vma, haddr);
+ page_add_new_anon_rmap(page, vma, haddr, true);
mem_cgroup_commit_charge(page, memcg, false);
lru_cache_add_active_or_unevictable(page, vma);
pgtable_trans_huge_deposit(mm, pmd, pgtable);
@@ -1034,7 +1034,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
memcg = (void *)page_private(pages[i]);
set_page_private(pages[i], 0);
- page_add_new_anon_rmap(pages[i], vma, haddr);
+ page_add_new_anon_rmap(pages[i], vma, haddr, false);
mem_cgroup_commit_charge(pages[i], memcg, false);
lru_cache_add_active_or_unevictable(pages[i], vma);
pte = pte_offset_map(&_pmd, haddr);
@@ -1046,7 +1046,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,

smp_wmb(); /* make pte visible before pmd */
pmd_populate(mm, pmd, pgtable);
- page_remove_rmap(page);
+ page_remove_rmap(page, true);
spin_unlock(ptl);

mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
@@ -1168,7 +1168,7 @@ alloc:
entry = mk_huge_pmd(new_page, vma->vm_page_prot);
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
pmdp_clear_flush_notify(vma, haddr, pmd);
- page_add_new_anon_rmap(new_page, vma, haddr);
+ page_add_new_anon_rmap(new_page, vma, haddr, true);
mem_cgroup_commit_charge(new_page, memcg, false);
lru_cache_add_active_or_unevictable(new_page, vma);
set_pmd_at(mm, haddr, pmd, entry);
@@ -1178,7 +1178,7 @@ alloc:
put_huge_zero_page();
} else {
VM_BUG_ON_PAGE(!PageHead(page), page);
- page_remove_rmap(page);
+ page_remove_rmap(page, true);
put_page(page);
}
ret |= VM_FAULT_WRITE;
@@ -1431,7 +1431,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
put_huge_zero_page();
} else {
page = pmd_page(orig_pmd);
- page_remove_rmap(page);
+ page_remove_rmap(page, true);
VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
VM_BUG_ON_PAGE(!PageHead(page), page);
@@ -2368,7 +2368,7 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
* superfluous.
*/
pte_clear(vma->vm_mm, address, _pte);
- page_remove_rmap(src_page);
+ page_remove_rmap(src_page, false);
spin_unlock(ptl);
free_page_and_swap_cache(src_page);
}
@@ -2658,7 +2658,7 @@ static void collapse_huge_page(struct mm_struct *mm,

spin_lock(pmd_ptl);
BUG_ON(!pmd_none(*pmd));
- page_add_new_anon_rmap(new_page, vma, address);
+ page_add_new_anon_rmap(new_page, vma, address, true);
mem_cgroup_commit_charge(new_page, memcg, false);
lru_cache_add_active_or_unevictable(new_page, vma);
pgtable_trans_huge_deposit(mm, pmd, pgtable);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 0a9ac6c26832..ebb7329301c4 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2688,7 +2688,7 @@ again:
if (huge_pte_dirty(pte))
set_page_dirty(page);

- page_remove_rmap(page);
+ page_remove_rmap(page, true);
force_flush = !__tlb_remove_page(tlb, page);
if (force_flush) {
address += sz;
@@ -2908,7 +2908,7 @@ retry_avoidcopy:
mmu_notifier_invalidate_range(mm, mmun_start, mmun_end);
set_huge_pte_at(mm, address, ptep,
make_huge_pte(vma, new_page, 1));
- page_remove_rmap(old_page);
+ page_remove_rmap(old_page, true);
hugepage_add_new_anon_rmap(new_page, vma, address);
/* Make the old page be freed below */
new_page = old_page;
diff --git a/mm/ksm.c b/mm/ksm.c
index 4162dce2eb44..92182eeba87d 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -957,13 +957,13 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
}

get_page(kpage);
- page_add_anon_rmap(kpage, vma, addr);
+ page_add_anon_rmap(kpage, vma, addr, false);

flush_cache_page(vma, addr, pte_pfn(*ptep));
ptep_clear_flush_notify(vma, addr, ptep);
set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));

- page_remove_rmap(page);
+ page_remove_rmap(page, false);
if (!page_mapped(page))
try_to_free_swap(page);
put_page(page);
diff --git a/mm/memory.c b/mm/memory.c
index 8ae52c918415..5529627d2cd6 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1125,7 +1125,7 @@ again:
mark_page_accessed(page);
rss[MM_FILEPAGES]--;
}
- page_remove_rmap(page);
+ page_remove_rmap(page, false);
if (unlikely(page_mapcount(page) < 0))
print_bad_pte(vma, addr, ptent, page);
if (unlikely(!__tlb_remove_page(tlb, page))) {
@@ -2189,7 +2189,7 @@ gotten:
* thread doing COW.
*/
ptep_clear_flush_notify(vma, address, page_table);
- page_add_new_anon_rmap(new_page, vma, address);
+ page_add_new_anon_rmap(new_page, vma, address, false);
mem_cgroup_commit_charge(new_page, memcg, false);
lru_cache_add_active_or_unevictable(new_page, vma);
/*
@@ -2222,7 +2222,7 @@ gotten:
* mapcount is visible. So transitively, TLBs to
* old page will be flushed before it can be reused.
*/
- page_remove_rmap(old_page);
+ page_remove_rmap(old_page, false);
}

/* Free the old page.. */
@@ -2465,7 +2465,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
pte = maybe_mkwrite(pte_mkdirty(pte), vma);
flags &= ~FAULT_FLAG_WRITE;
ret |= VM_FAULT_WRITE;
- exclusive = 1;
+ exclusive = RMAP_EXCLUSIVE;
}
flush_icache_page(vma, page);
if (pte_swp_soft_dirty(orig_pte))
@@ -2475,7 +2475,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
do_page_add_anon_rmap(page, vma, address, exclusive);
mem_cgroup_commit_charge(page, memcg, true);
} else { /* ksm created a completely new copy */
- page_add_new_anon_rmap(page, vma, address);
+ page_add_new_anon_rmap(page, vma, address, false);
mem_cgroup_commit_charge(page, memcg, false);
lru_cache_add_active_or_unevictable(page, vma);
}
@@ -2613,7 +2613,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
goto release;

inc_mm_counter_fast(mm, MM_ANONPAGES);
- page_add_new_anon_rmap(page, vma, address);
+ page_add_new_anon_rmap(page, vma, address, false);
mem_cgroup_commit_charge(page, memcg, false);
lru_cache_add_active_or_unevictable(page, vma);
setpte:
@@ -2701,7 +2701,7 @@ void do_set_pte(struct vm_area_struct *vma, unsigned long address,
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
if (anon) {
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
- page_add_new_anon_rmap(page, vma, address);
+ page_add_new_anon_rmap(page, vma, address, false);
} else {
inc_mm_counter_fast(vma->vm_mm, MM_FILEPAGES);
page_add_file_rmap(page);
diff --git a/mm/migrate.c b/mm/migrate.c
index 85e042686031..0d2b3110277a 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -166,7 +166,7 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
else
page_dup_rmap(new);
} else if (PageAnon(new))
- page_add_anon_rmap(new, vma, addr);
+ page_add_anon_rmap(new, vma, addr, false);
else
page_add_file_rmap(new);

@@ -1803,7 +1803,7 @@ fail_putback:
* guarantee the copy is visible before the pagetable update.
*/
flush_cache_range(vma, mmun_start, mmun_end);
- page_add_anon_rmap(new_page, vma, mmun_start);
+ page_add_anon_rmap(new_page, vma, mmun_start, true);
pmdp_clear_flush_notify(vma, mmun_start, pmd);
set_pmd_at(mm, mmun_start, pmd, entry);
flush_tlb_range(vma, mmun_start, mmun_end);
@@ -1814,13 +1814,13 @@ fail_putback:
flush_tlb_range(vma, mmun_start, mmun_end);
mmu_notifier_invalidate_range(mm, mmun_start, mmun_end);
update_mmu_cache_pmd(vma, address, &entry);
- page_remove_rmap(new_page);
+ page_remove_rmap(new_page, true);
goto fail_putback;
}

mem_cgroup_migrate(page, new_page, false);

- page_remove_rmap(page);
+ page_remove_rmap(page, true);

spin_unlock(ptl);
mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
diff --git a/mm/rmap.c b/mm/rmap.c
index 47b3ba87c2dd..f67e83be75e4 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1041,9 +1041,9 @@ static void __page_check_anon_rmap(struct page *page,
* (but PageKsm is never downgraded to PageAnon).
*/
void page_add_anon_rmap(struct page *page,
- struct vm_area_struct *vma, unsigned long address)
+ struct vm_area_struct *vma, unsigned long address, bool compound)
{
- do_page_add_anon_rmap(page, vma, address, 0);
+ do_page_add_anon_rmap(page, vma, address, compound ? RMAP_COMPOUND : 0);
}

/*
@@ -1052,21 +1052,24 @@ void page_add_anon_rmap(struct page *page,
* Everybody else should continue to use page_add_anon_rmap above.
*/
void do_page_add_anon_rmap(struct page *page,
- struct vm_area_struct *vma, unsigned long address, int exclusive)
+ struct vm_area_struct *vma, unsigned long address, int flags)
{
int first = atomic_inc_and_test(&page->_mapcount);
if (first) {
+ bool compound = flags & RMAP_COMPOUND;
+ int nr = compound ? hpage_nr_pages(page) : 1;
/*
* We use the irq-unsafe __{inc|mod}_zone_page_stat because
* these counters are not modified in interrupt context, and
* pte lock(a spinlock) is held, which implies preemption
* disabled.
*/
- if (PageTransHuge(page))
+ if (compound) {
+ VM_BUG_ON_PAGE(!PageTransHuge(page), page);
__inc_zone_page_state(page,
NR_ANON_TRANSPARENT_HUGEPAGES);
- __mod_zone_page_state(page_zone(page), NR_ANON_PAGES,
- hpage_nr_pages(page));
+ }
+ __mod_zone_page_state(page_zone(page), NR_ANON_PAGES, nr);
}
if (unlikely(PageKsm(page)))
return;
@@ -1074,7 +1077,8 @@ void do_page_add_anon_rmap(struct page *page,
VM_BUG_ON_PAGE(!PageLocked(page), page);
/* address might be in next vma when migration races vma_adjust */
if (first)
- __page_set_anon_rmap(page, vma, address, exclusive);
+ __page_set_anon_rmap(page, vma, address,
+ flags & RMAP_EXCLUSIVE);
else
__page_check_anon_rmap(page, vma, address);
}
@@ -1090,15 +1094,18 @@ void do_page_add_anon_rmap(struct page *page,
* Page does not have to be locked.
*/
void page_add_new_anon_rmap(struct page *page,
- struct vm_area_struct *vma, unsigned long address)
+ struct vm_area_struct *vma, unsigned long address, bool compound)
{
+ int nr = compound ? hpage_nr_pages(page) : 1;
+
VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
SetPageSwapBacked(page);
atomic_set(&page->_mapcount, 0); /* increment count (starts at -1) */
- if (PageTransHuge(page))
+ if (compound) {
+ VM_BUG_ON_PAGE(!PageTransHuge(page), page);
__inc_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
- __mod_zone_page_state(page_zone(page), NR_ANON_PAGES,
- hpage_nr_pages(page));
+ }
+ __mod_zone_page_state(page_zone(page), NR_ANON_PAGES, nr);
__page_set_anon_rmap(page, vma, address, 1);
}

@@ -1154,9 +1161,12 @@ out:
*
* The caller needs to hold the pte lock.
*/
-void page_remove_rmap(struct page *page)
+void page_remove_rmap(struct page *page, bool compound)
{
+ int nr = compound ? hpage_nr_pages(page) : 1;
+
if (!PageAnon(page)) {
+ VM_BUG_ON_PAGE(compound && !PageHuge(page), page);
page_remove_file_rmap(page);
return;
}
@@ -1174,11 +1184,12 @@ void page_remove_rmap(struct page *page)
* these counters are not modified in interrupt context, and
* pte lock(a spinlock) is held, which implies preemption disabled.
*/
- if (PageTransHuge(page))
+ if (compound) {
+ VM_BUG_ON_PAGE(!PageTransHuge(page), page);
__dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
+ }

- __mod_zone_page_state(page_zone(page), NR_ANON_PAGES,
- -hpage_nr_pages(page));
+ __mod_zone_page_state(page_zone(page), NR_ANON_PAGES, -nr);

if (unlikely(PageMlocked(page)))
clear_page_mlock(page);
@@ -1320,7 +1331,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
dec_mm_counter(mm, MM_FILEPAGES);

discard:
- page_remove_rmap(page);
+ page_remove_rmap(page, false);
page_cache_release(page);

out_unmap:
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 63f55ccb9b26..200298895cee 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1121,10 +1121,10 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
set_pte_at(vma->vm_mm, addr, pte,
pte_mkold(mk_pte(page, vma->vm_page_prot)));
if (page == swapcache) {
- page_add_anon_rmap(page, vma, addr);
+ page_add_anon_rmap(page, vma, addr, false);
mem_cgroup_commit_charge(page, memcg, true);
} else { /* ksm created a completely new copy */
- page_add_new_anon_rmap(page, vma, addr);
+ page_add_new_anon_rmap(page, vma, addr, false);
mem_cgroup_commit_charge(page, memcg, false);
lru_cache_add_active_or_unevictable(page, vma);
}
--
2.1.4

2015-02-12 16:19:07

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 05/24] mm, proc: adjust PSS calculation

With new refcounting all subpages of the compound page are not nessessary
have the same mapcount. We need to take into account mapcount of every
sub-page.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
fs/proc/task_mmu.c | 43 ++++++++++++++++++++++---------------------
1 file changed, 22 insertions(+), 21 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 98826d08a11b..8a0a78174cc6 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -449,9 +449,10 @@ struct mem_size_stats {
};

static void smaps_account(struct mem_size_stats *mss, struct page *page,
- unsigned long size, bool young, bool dirty)
+ bool compound, bool young, bool dirty)
{
- int mapcount;
+ int i, nr = compound ? hpage_nr_pages(page) : 1;
+ unsigned long size = 1UL << nr;

if (PageAnon(page))
mss->anonymous += size;
@@ -460,23 +461,23 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page,
/* Accumulate the size in pages that have been accessed. */
if (young || PageReferenced(page))
mss->referenced += size;
- mapcount = page_mapcount(page);
- if (mapcount >= 2) {
- u64 pss_delta;

- if (dirty || PageDirty(page))
- mss->shared_dirty += size;
- else
- mss->shared_clean += size;
- pss_delta = (u64)size << PSS_SHIFT;
- do_div(pss_delta, mapcount);
- mss->pss += pss_delta;
- } else {
- if (dirty || PageDirty(page))
- mss->private_dirty += size;
- else
- mss->private_clean += size;
- mss->pss += (u64)size << PSS_SHIFT;
+ for (i = 0; i < nr; i++) {
+ int mapcount = page_mapcount(page + i);
+
+ if (mapcount >= 2) {
+ if (dirty || PageDirty(page + i))
+ mss->shared_dirty += PAGE_SIZE;
+ else
+ mss->shared_clean += PAGE_SIZE;
+ mss->pss += (PAGE_SIZE << PSS_SHIFT) / mapcount;
+ } else {
+ if (dirty || PageDirty(page + i))
+ mss->private_dirty += PAGE_SIZE;
+ else
+ mss->private_clean += PAGE_SIZE;
+ mss->pss += PAGE_SIZE << PSS_SHIFT;
+ }
}
}

@@ -500,7 +501,8 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,

if (!page)
return;
- smaps_account(mss, page, PAGE_SIZE, pte_young(*pte), pte_dirty(*pte));
+
+ smaps_account(mss, page, false, pte_young(*pte), pte_dirty(*pte));
}

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
@@ -516,8 +518,7 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
if (IS_ERR_OR_NULL(page))
return;
mss->anonymous_thp += HPAGE_PMD_SIZE;
- smaps_account(mss, page, HPAGE_PMD_SIZE,
- pmd_young(*pmd), pmd_dirty(*pmd));
+ smaps_account(mss, page, true, pmd_young(*pmd), pmd_dirty(*pmd));
}
#else
static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
--
2.1.4

2015-02-12 16:19:04

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 06/24] mm: store mapcount for compound page separately

We're going to allow mapping of individual 4k pages of THP compound and
we need a cheap way to find out how many time the compound page is
mapped with PMD -- compound_mapcount() does this.

We use the same approach as with compound page destructor and compound
order: use space in first tail page, ->mapping this time.

page_mapcount() counts both: PTE and PMD mappings of the page.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/mm.h | 16 ++++++++++++++--
include/linux/mm_types.h | 1 +
include/linux/rmap.h | 4 ++--
mm/debug.c | 5 ++++-
mm/huge_memory.c | 23 ++++++++++++++---------
mm/hugetlb.c | 4 ++--
mm/memory.c | 4 ++--
mm/migrate.c | 2 +-
mm/page_alloc.c | 7 ++++++-
mm/rmap.c | 47 +++++++++++++++++++++++++++++++++++++++--------
10 files changed, 85 insertions(+), 28 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9071066b7c2e..624cbeb58048 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -472,6 +472,18 @@ static inline struct page *compound_head_fast(struct page *page)
return page->first_page;
return page;
}
+static inline atomic_t *compound_mapcount_ptr(struct page *page)
+{
+ return &page[1].compound_mapcount;
+}
+
+static inline int compound_mapcount(struct page *page)
+{
+ if (!PageCompound(page))
+ return 0;
+ page = compound_head(page);
+ return atomic_read(compound_mapcount_ptr(page)) + 1;
+}

/*
* The atomic page->_mapcount, starts from -1: so that transitions
@@ -486,7 +498,7 @@ static inline void page_mapcount_reset(struct page *page)
static inline int page_mapcount(struct page *page)
{
VM_BUG_ON_PAGE(PageSlab(page), page);
- return atomic_read(&page->_mapcount) + 1;
+ return atomic_read(&page->_mapcount) + compound_mapcount(page) + 1;
}

static inline int page_count(struct page *page)
@@ -1081,7 +1093,7 @@ static inline pgoff_t page_file_index(struct page *page)
*/
static inline int page_mapped(struct page *page)
{
- return atomic_read(&(page)->_mapcount) >= 0;
+ return atomic_read(&(page)->_mapcount) + compound_mapcount(page) >= 0;
}

/*
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 199a03aab8dc..2d19a4b6f6a6 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -56,6 +56,7 @@ struct page {
* see PAGE_MAPPING_ANON below.
*/
void *s_mem; /* slab first object */
+ atomic_t compound_mapcount; /* first tail page */
};

/* Second double word */
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 3bf73620b672..046e3bc810e6 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -192,9 +192,9 @@ void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
void hugepage_add_new_anon_rmap(struct page *, struct vm_area_struct *,
unsigned long);

-static inline void page_dup_rmap(struct page *page)
+static inline void page_dup_rmap(struct page *page, bool compound)
{
- atomic_inc(&page->_mapcount);
+ atomic_inc(compound ? compound_mapcount_ptr(page) : &page->_mapcount);
}

/*
diff --git a/mm/debug.c b/mm/debug.c
index 3eb3ac2fcee7..13d2b8146ef9 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -83,9 +83,12 @@ static void dump_flags(unsigned long flags,
void dump_page_badflags(struct page *page, const char *reason,
unsigned long badflags)
{
- pr_emerg("page:%p count:%d mapcount:%d mapping:%p index:%#lx\n",
+ pr_emerg("page:%p count:%d mapcount:%d mapping:%p index:%#lx",
page, atomic_read(&page->_count), page_mapcount(page),
page->mapping, page->index);
+ if (PageCompound(page))
+ pr_cont(" compound_mapcount: %d", compound_mapcount(page));
+ pr_cont("\n");
BUILD_BUG_ON(ARRAY_SIZE(pageflag_names) != __NR_PAGEFLAGS);
dump_flags(page->flags, pageflag_names, ARRAY_SIZE(pageflag_names));
if (reason)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 36637a80669e..17be7a978f17 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -890,7 +890,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
src_page = pmd_page(pmd);
VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
get_page(src_page);
- page_dup_rmap(src_page);
+ page_dup_rmap(src_page, true);
add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);

pmdp_set_wrprotect(src_mm, addr, src_pmd);
@@ -1787,8 +1787,8 @@ static void __split_huge_page_refcount(struct page *page,
struct page *page_tail = page + i;

/* tail_page->_mapcount cannot change */
- BUG_ON(page_mapcount(page_tail) < 0);
- tail_count += page_mapcount(page_tail);
+ BUG_ON(atomic_read(&page_tail->_mapcount) + 1 < 0);
+ tail_count += atomic_read(&page_tail->_mapcount) + 1;
/* check for overflow */
BUG_ON(tail_count < 0);
BUG_ON(atomic_read(&page_tail->_count) != 0);
@@ -1805,8 +1805,7 @@ static void __split_huge_page_refcount(struct page *page,
* atomic_set() here would be safe on all archs (and
* not only on x86), it's safer to use atomic_add().
*/
- atomic_add(page_mapcount(page) + page_mapcount(page_tail) + 1,
- &page_tail->_count);
+ atomic_add(page_mapcount(page_tail) + 1, &page_tail->_count);

/* after clearing PageTail the gup refcount can be released */
smp_mb__after_atomic();
@@ -1843,15 +1842,18 @@ static void __split_huge_page_refcount(struct page *page,
* status is achieved setting a reserved bit in the
* pmd, not by clearing the present bit.
*/
- page_tail->_mapcount = page->_mapcount;
+ atomic_set(&page_tail->_mapcount, compound_mapcount(page) - 1);

- BUG_ON(page_tail->mapping);
- page_tail->mapping = page->mapping;
+ /* ->mapping in first tail page is compound_mapcount */
+ if (i != 1) {
+ BUG_ON(page_tail->mapping);
+ page_tail->mapping = page->mapping;
+ BUG_ON(!PageAnon(page_tail));
+ }

page_tail->index = page->index + i;
page_cpupid_xchg_last(page_tail, page_cpupid_last(page));

- BUG_ON(!PageAnon(page_tail));
BUG_ON(!PageUptodate(page_tail));
BUG_ON(!PageDirty(page_tail));
BUG_ON(!PageSwapBacked(page_tail));
@@ -1861,6 +1863,9 @@ static void __split_huge_page_refcount(struct page *page,
atomic_sub(tail_count, &page->_count);
BUG_ON(atomic_read(&page->_count) <= 0);

+ page->_mapcount = *compound_mapcount_ptr(page);
+ page[1].mapping = page->mapping;
+
__mod_zone_page_state(zone, NR_ANON_TRANSPARENT_HUGEPAGES, -1);

ClearPageCompound(page);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ebb7329301c4..2aa2a850d002 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2606,7 +2606,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
entry = huge_ptep_get(src_pte);
ptepage = pte_page(entry);
get_page(ptepage);
- page_dup_rmap(ptepage);
+ page_dup_rmap(ptepage, true);
set_huge_pte_at(dst, addr, dst_pte, entry);
}
spin_unlock(src_ptl);
@@ -3065,7 +3065,7 @@ retry:
ClearPagePrivate(page);
hugepage_add_new_anon_rmap(page, vma, address);
} else
- page_dup_rmap(page);
+ page_dup_rmap(page, true);
new_pte = make_huge_pte(vma, page, ((vma->vm_flags & VM_WRITE)
&& (vma->vm_flags & VM_SHARED)));
set_huge_pte_at(mm, address, ptep, new_pte);
diff --git a/mm/memory.c b/mm/memory.c
index 5529627d2cd6..343f800dff25 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -873,7 +873,7 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
page = vm_normal_page(vma, addr, pte);
if (page) {
get_page(page);
- page_dup_rmap(page);
+ page_dup_rmap(page, false);
if (PageAnon(page))
rss[MM_ANONPAGES]++;
else
@@ -2972,7 +2972,7 @@ static int do_shared_fault(struct mm_struct *mm, struct vm_area_struct *vma,
* pinned by vma->vm_file's reference. We rely on unlock_page()'s
* release semantics to prevent the compiler from undoing this copying.
*/
- mapping = fault_page->mapping;
+ mapping = compound_head(fault_page)->mapping;
unlock_page(fault_page);
if ((dirtied || vma->vm_ops->page_mkwrite) && mapping) {
/*
diff --git a/mm/migrate.c b/mm/migrate.c
index 0d2b3110277a..01449826b914 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -164,7 +164,7 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
if (PageAnon(new))
hugepage_add_anon_rmap(new, vma, addr);
else
- page_dup_rmap(new);
+ page_dup_rmap(new, false);
} else if (PageAnon(new))
page_add_anon_rmap(new, vma, addr, false);
else
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 31bc2e8b5d99..b0ef1f6d2fb0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -369,6 +369,7 @@ void prep_compound_page(struct page *page, unsigned long order)

set_compound_page_dtor(page, free_compound_page);
set_compound_order(page, order);
+ atomic_set(compound_mapcount_ptr(page), -1);
__SetPageHead(page);
for (i = 1; i < nr_pages; i++) {
struct page *p = page + i;
@@ -658,7 +659,9 @@ static inline int free_pages_check(struct page *page)

if (unlikely(page_mapcount(page)))
bad_reason = "nonzero mapcount";
- if (unlikely(page->mapping != NULL))
+ if (unlikely(compound_mapcount(page)))
+ bad_reason = "nonzero compound_mapcount";
+ if (unlikely(page->mapping != NULL) && !PageTail(page))
bad_reason = "non-NULL mapping";
if (unlikely(atomic_read(&page->_count) != 0))
bad_reason = "nonzero _count";
@@ -800,6 +803,8 @@ static bool free_pages_prepare(struct page *page, unsigned int order)
}
if (bad)
return false;
+ if (order)
+ page[1].mapping = NULL;

reset_page_owner(page, order);

diff --git a/mm/rmap.c b/mm/rmap.c
index f67e83be75e4..333938475831 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1025,7 +1025,7 @@ static void __page_check_anon_rmap(struct page *page,
* over the call to page_add_new_anon_rmap.
*/
BUG_ON(page_anon_vma(page)->root != vma->anon_vma->root);
- BUG_ON(page->index != linear_page_index(vma, address));
+ BUG_ON(page_to_pgoff(page) != linear_page_index(vma, address));
#endif
}

@@ -1054,9 +1054,26 @@ void page_add_anon_rmap(struct page *page,
void do_page_add_anon_rmap(struct page *page,
struct vm_area_struct *vma, unsigned long address, int flags)
{
- int first = atomic_inc_and_test(&page->_mapcount);
+ bool compound = flags & RMAP_COMPOUND;
+ bool first;
+
+ if (PageTransCompound(page)) {
+ VM_BUG_ON_PAGE(!PageLocked(compound_head(page)), page);
+ if (compound) {
+ VM_BUG_ON_PAGE(!PageTransHuge(page), page);
+ first = atomic_inc_and_test(compound_mapcount_ptr(page));
+ } else {
+ /* Anon THP always mapped first with PMD */
+ first = 0;
+ VM_BUG_ON_PAGE(!page_mapcount(page), page);
+ atomic_inc(&page->_mapcount);
+ }
+ } else {
+ VM_BUG_ON_PAGE(compound, page);
+ first = atomic_inc_and_test(&page->_mapcount);
+ }
+
if (first) {
- bool compound = flags & RMAP_COMPOUND;
int nr = compound ? hpage_nr_pages(page) : 1;
/*
* We use the irq-unsafe __{inc|mod}_zone_page_stat because
@@ -1074,7 +1091,8 @@ void do_page_add_anon_rmap(struct page *page,
if (unlikely(PageKsm(page)))
return;

- VM_BUG_ON_PAGE(!PageLocked(page), page);
+ VM_BUG_ON_PAGE(!PageLocked(compound_head(page)), page);
+
/* address might be in next vma when migration races vma_adjust */
if (first)
__page_set_anon_rmap(page, vma, address,
@@ -1100,10 +1118,16 @@ void page_add_new_anon_rmap(struct page *page,

VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
SetPageSwapBacked(page);
- atomic_set(&page->_mapcount, 0); /* increment count (starts at -1) */
if (compound) {
VM_BUG_ON_PAGE(!PageTransHuge(page), page);
+ /* increment count (starts at -1) */
+ atomic_set(compound_mapcount_ptr(page), 0);
__inc_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
+ } else {
+ /* Anon THP always mapped first with PMD */
+ VM_BUG_ON_PAGE(PageTransCompound(page), page);
+ /* increment count (starts at -1) */
+ atomic_set(&page->_mapcount, 0);
}
__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, nr);
__page_set_anon_rmap(page, vma, address, 1);
@@ -1172,7 +1196,9 @@ void page_remove_rmap(struct page *page, bool compound)
}

/* page still mapped by someone else? */
- if (!atomic_add_negative(-1, &page->_mapcount))
+ if (!atomic_add_negative(-1, compound ?
+ compound_mapcount_ptr(page) :
+ &page->_mapcount))
return;

/* Hugepages are not counted in NR_ANON_PAGES for now. */
@@ -1185,8 +1211,13 @@ void page_remove_rmap(struct page *page, bool compound)
* pte lock(a spinlock) is held, which implies preemption disabled.
*/
if (compound) {
+ int i;
VM_BUG_ON_PAGE(!PageTransHuge(page), page);
__dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
+ /* The page can be mapped with ptes */
+ for (i = 0; i < HPAGE_PMD_NR; i++)
+ if (page_mapcount(page + i))
+ nr--;
}

__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, -nr);
@@ -1630,7 +1661,7 @@ void hugepage_add_anon_rmap(struct page *page,
BUG_ON(!PageLocked(page));
BUG_ON(!anon_vma);
/* address might be in next vma when migration races vma_adjust */
- first = atomic_inc_and_test(&page->_mapcount);
+ first = atomic_inc_and_test(compound_mapcount_ptr(page));
if (first)
__hugepage_set_anon_rmap(page, vma, address, 0);
}
@@ -1639,7 +1670,7 @@ void hugepage_add_new_anon_rmap(struct page *page,
struct vm_area_struct *vma, unsigned long address)
{
BUG_ON(address < vma->vm_start || address >= vma->vm_end);
- atomic_set(&page->_mapcount, 0);
+ atomic_set(compound_mapcount_ptr(page), 0);
__hugepage_set_anon_rmap(page, vma, address, 1);
}
#endif /* CONFIG_HUGETLB_PAGE */
--
2.1.4

2015-02-12 16:19:40

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 07/24] mm, thp: adjust conditions when we can reuse the page on WP fault

With new refcounting we will be able map the same compound page with
PTEs and PMDs. It requires adjustment to conditions when we can reuse
the page on write-protection fault.

For PTE fault we can't reuse the page if it's part of huge page.

For PMD we can only reuse the page if nobody else maps the huge page or
it's part. We can do it by checking page_mapcount() on each sub-page,
but it's expensive.

The cheaper way is to check page_count() to be equal 1: every mapcount
takes page reference, so this way we can guarantee, that the PMD is the
only mapping.

This approach can give false negative if somebody pinned the page, but
that doesn't affect correctness.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/swap.h | 3 ++-
mm/huge_memory.c | 12 +++++++++++-
mm/rmap.c | 2 +-
mm/swapfile.c | 3 +++
4 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 7067eca501e2..f0e4868f63b1 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -523,7 +523,8 @@ static inline int page_swapcount(struct page *page)
return 0;
}

-#define reuse_swap_page(page) (page_mapcount(page) == 1)
+#define reuse_swap_page(page) \
+ (!PageTransCompound(page) && page_mapcount(page) == 1)

static inline int try_to_free_swap(struct page *page)
{
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 17be7a978f17..156f34b9e334 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1092,7 +1092,17 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,

page = pmd_page(orig_pmd);
VM_BUG_ON_PAGE(!PageCompound(page) || !PageHead(page), page);
- if (page_mapcount(page) == 1) {
+ /*
+ * We can only reuse the page if nobody else maps the huge page or it's
+ * part. We can do it by checking page_mapcount() on each sub-page, but
+ * it's expensive.
+ * The cheaper way is to check page_count() to be equal 1: every
+ * mapcount takes page reference reference, so this way we can
+ * guarantee, that the PMD is the only mapping.
+ * This can give false negative if somebody pinned the page, but that's
+ * fine.
+ */
+ if (page_mapcount(page) == 1 && page_count(page) == 1) {
pmd_t entry;
entry = pmd_mkyoung(orig_pmd);
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
diff --git a/mm/rmap.c b/mm/rmap.c
index 333938475831..db8b99e48966 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1215,7 +1215,7 @@ void page_remove_rmap(struct page *page, bool compound)
VM_BUG_ON_PAGE(!PageTransHuge(page), page);
__dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
/* The page can be mapped with ptes */
- for (i = 0; i < HPAGE_PMD_NR; i++)
+ for (i = 0; i < hpage_nr_pages(page); i++)
if (page_mapcount(page + i))
nr--;
}
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 200298895cee..99f97c31ede5 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -887,6 +887,9 @@ int reuse_swap_page(struct page *page)
VM_BUG_ON_PAGE(!PageLocked(page), page);
if (unlikely(PageKsm(page)))
return 0;
+ /* The page is part of THP and cannot be reused */
+ if (PageTransCompound(page))
+ return 0;
count = page_mapcount(page);
if (count <= 1 && PageSwapCache(page)) {
count += page_swapcount(page);
--
2.1.4

2015-02-12 16:25:04

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 08/24] mm: adjust FOLL_SPLIT for new refcounting

We prepare kernel to allow transhuge pages to be mapped with ptes too.
We need to handle FOLL_SPLIT in follow_page_pte().

Also we use split_huge_page() directly instead of split_huge_page_pmd().
split_huge_page_pmd() will gone.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/gup.c | 65 ++++++++++++++++++++++++++++++++++++++++++++--------------------
1 file changed, 45 insertions(+), 20 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index a6e24e246f86..022d7a91de03 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -79,6 +79,20 @@ retry:
page = pte_page(pte);
}

+ if (flags & FOLL_SPLIT && PageTransCompound(page)) {
+ int ret;
+ page = compound_head(page);
+ get_page(page);
+ pte_unmap_unlock(ptep, ptl);
+ lock_page(page);
+ ret = split_huge_page(page);
+ unlock_page(page);
+ put_page(page);
+ if (ret)
+ return ERR_PTR(ret);
+ goto retry;
+ }
+
if (flags & FOLL_GET)
get_page_foll(page);
if (flags & FOLL_TOUCH) {
@@ -186,27 +200,38 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
}
if ((flags & FOLL_NUMA) && pmd_protnone(*pmd))
return no_page_table(vma, flags);
- if (pmd_trans_huge(*pmd)) {
- if (flags & FOLL_SPLIT) {
- split_huge_page_pmd(vma, address, pmd);
- return follow_page_pte(vma, address, pmd, flags);
- }
- ptl = pmd_lock(mm, pmd);
- if (likely(pmd_trans_huge(*pmd))) {
- if (unlikely(pmd_trans_splitting(*pmd))) {
- spin_unlock(ptl);
- wait_split_huge_page(vma->anon_vma, pmd);
- } else {
- page = follow_trans_huge_pmd(vma, address,
- pmd, flags);
- spin_unlock(ptl);
- *page_mask = HPAGE_PMD_NR - 1;
- return page;
- }
- } else
- spin_unlock(ptl);
+ if (likely(!pmd_trans_huge(*pmd)))
+ return follow_page_pte(vma, address, pmd, flags);
+
+ ptl = pmd_lock(mm, pmd);
+ if (unlikely(!pmd_trans_huge(*pmd))) {
+ spin_unlock(ptl);
+ return follow_page_pte(vma, address, pmd, flags);
}
- return follow_page_pte(vma, address, pmd, flags);
+
+ if (unlikely(pmd_trans_splitting(*pmd))) {
+ spin_unlock(ptl);
+ wait_split_huge_page(vma->anon_vma, pmd);
+ return follow_page_pte(vma, address, pmd, flags);
+ }
+
+ if (flags & FOLL_SPLIT) {
+ int ret;
+ page = pmd_page(*pmd);
+ get_page(page);
+ spin_unlock(ptl);
+ lock_page(page);
+ ret = split_huge_page(page);
+ unlock_page(page);
+ put_page(page);
+ return ret ? ERR_PTR(ret) :
+ follow_page_pte(vma, address, pmd, flags);
+ }
+
+ page = follow_trans_huge_pmd(vma, address, pmd, flags);
+ spin_unlock(ptl);
+ *page_mask = HPAGE_PMD_NR - 1;
+ return page;
}

static int get_gate_page(struct mm_struct *mm, unsigned long address,
--
2.1.4

2015-02-12 16:21:25

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 09/24] thp, mlock: do not allow huge pages in mlocked area

With new refcounting THP can belong to several VMAs. This makes tricky to
tracking THP pages, when they partially mlocked. It can lead to leaking
mlocked pages to non-VM_LOCKED vmas and other problems.

With this patch we will split all pages on mlock and avoid
fault-in/collapse new THP in VM_LOCKED vmas.

I've tried alternative approach: do not mark THP pages mlocked and keep
them on normal LRUs. This way vmscan could try to split huge pages on
memory pressure and free up subpages which doesn't belong to VM_LOCKED
vmas. But this is user-visible change: we screw up Mlocked accouting
reported in meminfo, so I had to leave this approach aside.

We can bring something better later, but this should be good enough for
now.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/huge_memory.c | 5 ++++-
mm/mlock.c | 3 +++
2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 156f34b9e334..284d1f13247a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -787,6 +787,8 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,

if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
return VM_FAULT_FALLBACK;
+ if (vma->vm_flags & VM_LOCKED)
+ return VM_FAULT_FALLBACK;
if (unlikely(anon_vma_prepare(vma)))
return VM_FAULT_OOM;
if (unlikely(khugepaged_enter(vma, vma->vm_flags)))
@@ -2553,7 +2555,8 @@ static bool hugepage_vma_check(struct vm_area_struct *vma)
if ((!(vma->vm_flags & VM_HUGEPAGE) && !khugepaged_always()) ||
(vma->vm_flags & VM_NOHUGEPAGE))
return false;
-
+ if (vma->vm_flags & VM_LOCKED)
+ return false;
if (!vma->anon_vma || vma->vm_ops)
return false;
if (is_vma_temporary_stack(vma))
diff --git a/mm/mlock.c b/mm/mlock.c
index 73cf0987088c..40c6ab590cde 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -238,6 +238,9 @@ long __mlock_vma_pages_range(struct vm_area_struct *vma,
VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_sem), mm);

gup_flags = FOLL_TOUCH | FOLL_MLOCK;
+ if (vma->vm_flags & VM_LOCKED)
+ gup_flags |= FOLL_SPLIT;
+
/*
* We want to touch writable mappings with a write fault in order
* to break COW, except for shared mappings because these don't COW
--
2.1.4

2015-02-12 16:21:41

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 10/24] khugepaged: ignore pmd tables with THP mapped with ptes

Prepare khugepaged to see compound pages mapped with pte. For now we
won't collapse the pmd table with such pte.

khugepaged is subject for future rework wrt new refcounting.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/huge_memory.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 284d1f13247a..9d18e9bafb26 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2735,6 +2735,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
page = vm_normal_page(vma, _address, pteval);
if (unlikely(!page))
goto out_unmap;
+
+ /* TODO: teach khugepaged to collapse THP mapped with pte */
+ if (PageCompound(page))
+ goto out_unmap;
+
/*
* Record which node the original page is from and save this
* information to khugepaged_node_load[].
@@ -2745,7 +2750,6 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
if (khugepaged_scan_abort(node))
goto out_unmap;
khugepaged_node_load[node]++;
- VM_BUG_ON_PAGE(PageCompound(page), page);
if (!PageLRU(page) || PageLocked(page) || !PageAnon(page))
goto out_unmap;
/*
--
2.1.4

2015-02-12 16:19:25

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 11/24] thp: rename split_huge_page_pmd() to split_huge_pmd()

We are going to decouple splitting THP PMD from splitting underlying
compound page.

This patch renames split_huge_page_pmd*() functions to split_huge_pmd*()
to reflect the fact that it doesn't imply page splitting, only PMD.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/powerpc/mm/subpage-prot.c | 2 +-
arch/x86/kernel/vm86_32.c | 6 +++++-
include/linux/huge_mm.h | 8 ++------
mm/huge_memory.c | 33 +++++++++++++--------------------
mm/madvise.c | 2 +-
mm/memory.c | 2 +-
mm/mempolicy.c | 2 +-
mm/mprotect.c | 2 +-
mm/mremap.c | 2 +-
mm/pagewalk.c | 2 +-
10 files changed, 27 insertions(+), 34 deletions(-)

diff --git a/arch/powerpc/mm/subpage-prot.c b/arch/powerpc/mm/subpage-prot.c
index fa9fb5b4c66c..d5543514c1df 100644
--- a/arch/powerpc/mm/subpage-prot.c
+++ b/arch/powerpc/mm/subpage-prot.c
@@ -135,7 +135,7 @@ static int subpage_walk_pmd_entry(pmd_t *pmd, unsigned long addr,
unsigned long end, struct mm_walk *walk)
{
struct vm_area_struct *vma = walk->vma;
- split_huge_page_pmd(vma, addr, pmd);
+ split_huge_pmd(vma, pmd, addr);
return 0;
}

diff --git a/arch/x86/kernel/vm86_32.c b/arch/x86/kernel/vm86_32.c
index e8edcf52e069..883160599965 100644
--- a/arch/x86/kernel/vm86_32.c
+++ b/arch/x86/kernel/vm86_32.c
@@ -182,7 +182,11 @@ static void mark_screen_rdonly(struct mm_struct *mm)
if (pud_none_or_clear_bad(pud))
goto out;
pmd = pmd_offset(pud, 0xA0000);
- split_huge_page_pmd_mm(mm, 0xA0000, pmd);
+
+ if (pmd_trans_huge(*pmd)) {
+ struct vm_area_struct *vma = find_vma(mm, 0xA0000);
+ split_huge_pmd(vma, pmd, 0xA0000);
+ }
if (pmd_none_or_clear_bad(pmd))
goto out;
pte = pte_offset_map_lock(mm, pmd, 0xA0000, &ptl);
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 44a840a53974..34bbf769d52e 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -104,7 +104,7 @@ static inline int split_huge_page(struct page *page)
}
extern void __split_huge_page_pmd(struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd);
-#define split_huge_page_pmd(__vma, __address, __pmd) \
+#define split_huge_pmd(__vma, __pmd, __address) \
do { \
pmd_t *____pmd = (__pmd); \
if (unlikely(pmd_trans_huge(*____pmd))) \
@@ -119,8 +119,6 @@ extern void __split_huge_page_pmd(struct vm_area_struct *vma,
BUG_ON(pmd_trans_splitting(*____pmd) || \
pmd_trans_huge(*____pmd)); \
} while (0)
-extern void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address,
- pmd_t *pmd);
#if HPAGE_PMD_ORDER >= MAX_ORDER
#error "hugepages can't be allocated by the buddy allocator"
#endif
@@ -187,11 +185,9 @@ static inline int split_huge_page(struct page *page)
{
return 0;
}
-#define split_huge_page_pmd(__vma, __address, __pmd) \
- do { } while (0)
#define wait_split_huge_page(__anon_vma, __pmd) \
do { } while (0)
-#define split_huge_page_pmd_mm(__mm, __address, __pmd) \
+#define split_huge_pmd(__vma, __pmd, __address) \
do { } while (0)
static inline int hugepage_madvise(struct vm_area_struct *vma,
unsigned long *vm_flags, int advice)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9d18e9bafb26..d447afc039f9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1127,13 +1127,13 @@ alloc:

if (unlikely(!new_page)) {
if (!page) {
- split_huge_page_pmd(vma, address, pmd);
+ split_huge_pmd(vma, pmd, address);
ret |= VM_FAULT_FALLBACK;
} else {
ret = do_huge_pmd_wp_page_fallback(mm, vma, address,
pmd, orig_pmd, page, haddr);
if (ret & VM_FAULT_OOM) {
- split_huge_page(page);
+ split_huge_pmd(vma, pmd, address);
ret |= VM_FAULT_FALLBACK;
}
put_user_huge_page(page);
@@ -1146,10 +1146,10 @@ alloc:
GFP_TRANSHUGE, &memcg))) {
put_page(new_page);
if (page) {
- split_huge_page(page);
+ split_huge_pmd(vma, pmd, address);
put_user_huge_page(page);
} else
- split_huge_page_pmd(vma, address, pmd);
+ split_huge_pmd(vma, pmd, address);
ret |= VM_FAULT_FALLBACK;
count_vm_event(THP_FAULT_FALLBACK);
goto out;
@@ -1709,17 +1709,7 @@ again:
goto again;
}

-void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address,
- pmd_t *pmd)
-{
- struct vm_area_struct *vma;
-
- vma = find_vma(mm, address);
- BUG_ON(vma == NULL);
- split_huge_page_pmd(vma, address, pmd);
-}
-
-static void split_huge_page_address(struct mm_struct *mm,
+static void split_huge_pmd_address(struct vm_area_struct *vma,
unsigned long address)
{
pgd_t *pgd;
@@ -1728,7 +1718,7 @@ static void split_huge_page_address(struct mm_struct *mm,

VM_BUG_ON(!(address & ~HPAGE_PMD_MASK));

- pgd = pgd_offset(mm, address);
+ pgd = pgd_offset(vma->vm_mm, address);
if (!pgd_present(*pgd))
return;

@@ -1739,11 +1729,14 @@ static void split_huge_page_address(struct mm_struct *mm,
pmd = pmd_offset(pud, address);
if (!pmd_present(*pmd))
return;
+
+ if (!pmd_trans_huge(*pmd))
+ return;
/*
* Caller holds the mmap_sem write mode, so a huge pmd cannot
* materialize from under us.
*/
- split_huge_page_pmd_mm(mm, address, pmd);
+ __split_huge_page_pmd(vma, address, pmd);
}

static int __split_huge_page_splitting(struct page *page,
@@ -3005,7 +2998,7 @@ void __vma_adjust_trans_huge(struct vm_area_struct *vma,
if (start & ~HPAGE_PMD_MASK &&
(start & HPAGE_PMD_MASK) >= vma->vm_start &&
(start & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE <= vma->vm_end)
- split_huge_page_address(vma->vm_mm, start);
+ split_huge_pmd_address(vma, start);

/*
* If the new end address isn't hpage aligned and it could
@@ -3015,7 +3008,7 @@ void __vma_adjust_trans_huge(struct vm_area_struct *vma,
if (end & ~HPAGE_PMD_MASK &&
(end & HPAGE_PMD_MASK) >= vma->vm_start &&
(end & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE <= vma->vm_end)
- split_huge_page_address(vma->vm_mm, end);
+ split_huge_pmd_address(vma, end);

/*
* If we're also updating the vma->vm_next->vm_start, if the new
@@ -3029,6 +3022,6 @@ void __vma_adjust_trans_huge(struct vm_area_struct *vma,
if (nstart & ~HPAGE_PMD_MASK &&
(nstart & HPAGE_PMD_MASK) >= next->vm_start &&
(nstart & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE <= next->vm_end)
- split_huge_page_address(next->vm_mm, nstart);
+ split_huge_pmd_address(next, nstart);
}
}
diff --git a/mm/madvise.c b/mm/madvise.c
index 6d0fcb8921c2..c0bbe52eb3bf 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -279,7 +279,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
next = pmd_addr_end(addr, end);
if (pmd_trans_huge(*pmd)) {
if (next - addr != HPAGE_PMD_SIZE)
- split_huge_page_pmd(vma, addr, pmd);
+ split_huge_pmd(vma, pmd, addr);
else if (!madvise_free_huge_pmd(tlb, vma, pmd, addr))
goto next;
/* fall through */
diff --git a/mm/memory.c b/mm/memory.c
index 343f800dff25..6f030a3a7636 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1204,7 +1204,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
BUG();
}
#endif
- split_huge_page_pmd(vma, addr, pmd);
+ split_huge_pmd(vma, pmd, addr);
} else if (zap_huge_pmd(tlb, vma, pmd, addr))
goto next;
/* fall through */
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 4721046a134a..eeec3dd199f5 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -493,7 +493,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
pte_t *pte;
spinlock_t *ptl;

- split_huge_page_pmd(vma, addr, pmd);
+ split_huge_pmd(vma, pmd, addr);
if (pmd_trans_unstable(pmd))
return 0;

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 44727811bf4c..81632238f857 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -155,7 +155,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,

if (pmd_trans_huge(*pmd)) {
if (next - addr != HPAGE_PMD_SIZE)
- split_huge_page_pmd(vma, addr, pmd);
+ split_huge_pmd(vma, pmd, addr);
else {
int nr_ptes = change_huge_pmd(vma, pmd, addr,
newprot, prot_numa);
diff --git a/mm/mremap.c b/mm/mremap.c
index 57dadc025c64..95ac595de7da 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -208,7 +208,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
need_flush = true;
continue;
} else if (!err) {
- split_huge_page_pmd(vma, old_addr, old_pmd);
+ split_huge_pmd(vma, old_pmd, old_addr);
}
VM_BUG_ON(pmd_trans_huge(*old_pmd));
}
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 75c1f2878519..5b23a69439c8 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -58,7 +58,7 @@ again:
if (!walk->pte_entry)
continue;

- split_huge_page_pmd_mm(walk->mm, addr, pmd);
+ split_huge_pmd(walk->vma, pmd, addr);
if (pmd_trans_unstable(pmd))
goto again;
err = walk_pte_range(pmd, addr, next, walk);
--
2.1.4

2015-02-12 16:24:48

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 12/24] thp: PMD splitting without splitting compound page

Current split_huge_page() combines two operations: splitting PMDs into
tables of PTEs and splitting underlying compound page. This patch
changes split_huge_pmd() implementation to split the given PMD without
splitting other PMDs this page mapped with or underlying compound page.

In order to do this we have to get rid of tail page refcounting, which
uses _mapcount of tail pages. Tail page refcounting is needed to be able
to split THP page at any point: we always know which of tail pages is
pinned (i.e. by get_user_pages()) and can distribute page count
correctly.

We can avoid this by allowing split_huge_page() to fail if the compound
page is pinned. This patch removes all infrastructure for tail page
refcounting and make split_huge_page() to always return -EBUSY. All
split_huge_page() users already know how to handle its fail. Proper
implementation will be added later.

Without tail page refcounting, implementation of split_huge_pmd() is
pretty straight-forward.

Memory cgroup is not yet ready for new refcouting. Let's disable it on
Kconfig level.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/mips/mm/gup.c | 4 -
arch/powerpc/mm/hugetlbpage.c | 13 +-
arch/s390/mm/gup.c | 13 +-
arch/sparc/mm/gup.c | 14 +--
arch/x86/mm/gup.c | 4 -
include/linux/huge_mm.h | 7 +-
include/linux/mm.h | 68 +----------
include/linux/mm_types.h | 19 +--
mm/Kconfig | 2 +-
mm/gup.c | 31 +----
mm/huge_memory.c | 275 ++++++++++--------------------------------
mm/internal.h | 31 +----
mm/swap.c | 245 +------------------------------------
13 files changed, 88 insertions(+), 638 deletions(-)

diff --git a/arch/mips/mm/gup.c b/arch/mips/mm/gup.c
index 349995d19c7f..36a35115dc2e 100644
--- a/arch/mips/mm/gup.c
+++ b/arch/mips/mm/gup.c
@@ -87,8 +87,6 @@ static int gup_huge_pmd(pmd_t pmd, unsigned long addr, unsigned long end,
do {
VM_BUG_ON(compound_head(page) != head);
pages[*nr] = page;
- if (PageTail(page))
- get_huge_page_tail(page);
(*nr)++;
page++;
refs++;
@@ -153,8 +151,6 @@ static int gup_huge_pud(pud_t pud, unsigned long addr, unsigned long end,
do {
VM_BUG_ON(compound_head(page) != head);
pages[*nr] = page;
- if (PageTail(page))
- get_huge_page_tail(page);
(*nr)++;
page++;
refs++;
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 7e408bfc7948..9a7f513d0068 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -1037,7 +1037,7 @@ int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
{
unsigned long mask;
unsigned long pte_end;
- struct page *head, *page, *tail;
+ struct page *head, *page;
pte_t pte;
int refs;

@@ -1060,7 +1060,6 @@ int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
head = pte_page(pte);

page = head + ((addr & (sz-1)) >> PAGE_SHIFT);
- tail = page;
do {
VM_BUG_ON(compound_head(page) != head);
pages[*nr] = page;
@@ -1082,15 +1081,5 @@ int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
return 0;
}

- /*
- * Any tail page need their mapcount reference taken before we
- * return.
- */
- while (refs--) {
- if (PageTail(tail))
- get_huge_page_tail(tail);
- tail++;
- }
-
return 1;
}
diff --git a/arch/s390/mm/gup.c b/arch/s390/mm/gup.c
index 5c586c78ca8d..dab30527ad41 100644
--- a/arch/s390/mm/gup.c
+++ b/arch/s390/mm/gup.c
@@ -52,7 +52,7 @@ static inline int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
unsigned long end, int write, struct page **pages, int *nr)
{
unsigned long mask, result;
- struct page *head, *page, *tail;
+ struct page *head, *page;
int refs;

result = write ? 0 : _SEGMENT_ENTRY_PROTECT;
@@ -64,7 +64,6 @@ static inline int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
refs = 0;
head = pmd_page(pmd);
page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
- tail = page;
do {
VM_BUG_ON(compound_head(page) != head);
pages[*nr] = page;
@@ -85,16 +84,6 @@ static inline int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
return 0;
}

- /*
- * Any tail page need their mapcount reference taken before we
- * return.
- */
- while (refs--) {
- if (PageTail(tail))
- get_huge_page_tail(tail);
- tail++;
- }
-
return 1;
}

diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
index 2e5c4fc2daa9..9091c5daa2e1 100644
--- a/arch/sparc/mm/gup.c
+++ b/arch/sparc/mm/gup.c
@@ -56,8 +56,6 @@ static noinline int gup_pte_range(pmd_t pmd, unsigned long addr,
put_page(head);
return 0;
}
- if (head != page)
- get_huge_page_tail(page);

pages[*nr] = page;
(*nr)++;
@@ -70,7 +68,7 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
unsigned long end, int write, struct page **pages,
int *nr)
{
- struct page *head, *page, *tail;
+ struct page *head, *page;
int refs;

if (!(pmd_val(pmd) & _PAGE_VALID))
@@ -82,7 +80,6 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
refs = 0;
head = pmd_page(pmd);
page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
- tail = page;
do {
VM_BUG_ON(compound_head(page) != head);
pages[*nr] = page;
@@ -103,15 +100,6 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
return 0;
}

- /* Any tail page need their mapcount reference taken before we
- * return.
- */
- while (refs--) {
- if (PageTail(tail))
- get_huge_page_tail(tail);
- tail++;
- }
-
return 1;
}

diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 81bf3d2af3eb..62a887a3cf50 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -137,8 +137,6 @@ static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
do {
VM_BUG_ON_PAGE(compound_head(page) != head, page);
pages[*nr] = page;
- if (PageTail(page))
- get_huge_page_tail(page);
(*nr)++;
page++;
refs++;
@@ -214,8 +212,6 @@ static noinline int gup_huge_pud(pud_t pud, unsigned long addr,
do {
VM_BUG_ON_PAGE(compound_head(page) != head, page);
pages[*nr] = page;
- if (PageTail(page))
- get_huge_page_tail(page);
(*nr)++;
page++;
refs++;
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 34bbf769d52e..3c5fe722cc14 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -102,14 +102,13 @@ static inline int split_huge_page(struct page *page)
{
return split_huge_page_to_list(page, NULL);
}
-extern void __split_huge_page_pmd(struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmd);
+extern void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+ unsigned long address);
#define split_huge_pmd(__vma, __pmd, __address) \
do { \
pmd_t *____pmd = (__pmd); \
if (unlikely(pmd_trans_huge(*____pmd))) \
- __split_huge_page_pmd(__vma, __address, \
- ____pmd); \
+ __split_huge_pmd(__vma, __pmd, __address); \
} while (0)
#define wait_split_huge_page(__anon_vma, __pmd) \
do { \
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 624cbeb58048..43468ebefaff 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -433,31 +433,10 @@ static inline void compound_unlock_irqrestore(struct page *page,
#endif
}

-static inline struct page *compound_head_by_tail(struct page *tail)
-{
- struct page *head = tail->first_page;
-
- /*
- * page->first_page may be a dangling pointer to an old
- * compound page, so recheck that it is still a tail
- * page before returning.
- */
- smp_rmb();
- if (likely(PageTail(tail)))
- return head;
- return tail;
-}
-
-/*
- * Since either compound page could be dismantled asynchronously in THP
- * or we access asynchronously arbitrary positioned struct page, there
- * would be tail flag race. To handle this race, we should call
- * smp_rmb() before checking tail flag. compound_head_by_tail() did it.
- */
static inline struct page *compound_head(struct page *page)
{
if (unlikely(PageTail(page)))
- return compound_head_by_tail(page);
+ return page->first_page;
return page;
}

@@ -515,50 +494,11 @@ static inline int PageHeadHuge(struct page *page_head)
}
#endif /* CONFIG_HUGETLB_PAGE */

-static inline bool __compound_tail_refcounted(struct page *page)
-{
- return !PageSlab(page) && !PageHeadHuge(page);
-}
-
-/*
- * This takes a head page as parameter and tells if the
- * tail page reference counting can be skipped.
- *
- * For this to be safe, PageSlab and PageHeadHuge must remain true on
- * any given page where they return true here, until all tail pins
- * have been released.
- */
-static inline bool compound_tail_refcounted(struct page *page)
-{
- VM_BUG_ON_PAGE(!PageHead(page), page);
- return __compound_tail_refcounted(page);
-}
-
-static inline void get_huge_page_tail(struct page *page)
-{
- /*
- * __split_huge_page_refcount() cannot run from under us.
- */
- VM_BUG_ON_PAGE(!PageTail(page), page);
- VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
- VM_BUG_ON_PAGE(atomic_read(&page->_count) != 0, page);
- if (compound_tail_refcounted(page->first_page))
- atomic_inc(&page->_mapcount);
-}
-
-extern bool __get_page_tail(struct page *page);
-
static inline void get_page(struct page *page)
{
- if (unlikely(PageTail(page)))
- if (likely(__get_page_tail(page)))
- return;
- /*
- * Getting a normal page or the head of a compound page
- * requires to already have an elevated page->_count.
- */
- VM_BUG_ON_PAGE(atomic_read(&page->_count) <= 0, page);
- atomic_inc(&page->_count);
+ struct page *page_head = compound_head(page);
+ VM_BUG_ON_PAGE(atomic_read(&page_head->_count) <= 0, page);
+ atomic_inc(&page_head->_count);
}

static inline struct page *virt_to_head_page(const void *x)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 2d19a4b6f6a6..1087672a04d5 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -93,20 +93,9 @@ struct page {

union {
/*
- * Count of ptes mapped in
- * mms, to show when page is
- * mapped & limit reverse map
- * searches.
- *
- * Used also for tail pages
- * refcounting instead of
- * _count. Tail pages cannot
- * be mapped and keeping the
- * tail page _count zero at
- * all times guarantees
- * get_page_unless_zero() will
- * never succeed on tail
- * pages.
+ * Count of ptes mapped in mms, to show
+ * when page is mapped & limit reverse
+ * map searches.
*/
atomic_t _mapcount;

@@ -117,7 +106,7 @@ struct page {
};
int units; /* SLOB */
};
- atomic_t _count; /* Usage count, see below. */
+ atomic_t _count; /* Usage count, see below. */
};
unsigned int active; /* SLAB */
};
diff --git a/mm/Kconfig b/mm/Kconfig
index a03131b6ba8e..9ce853d2af5d 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -409,7 +409,7 @@ config NOMMU_INITIAL_TRIM_EXCESS

config TRANSPARENT_HUGEPAGE
bool "Transparent Hugepage Support"
- depends on HAVE_ARCH_TRANSPARENT_HUGEPAGE
+ depends on HAVE_ARCH_TRANSPARENT_HUGEPAGE && !MEMCG
select COMPACTION
help
Transparent Hugepages allows the kernel to use huge pages and
diff --git a/mm/gup.c b/mm/gup.c
index 022d7a91de03..0c8d076d7744 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -980,7 +980,7 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
unsigned long end, int write, struct page **pages, int *nr)
{
- struct page *head, *page, *tail;
+ struct page *head, *page;
int refs;

if (write && !pmd_write(orig))
@@ -989,7 +989,6 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
refs = 0;
head = pmd_page(orig);
page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
- tail = page;
do {
VM_BUG_ON_PAGE(compound_head(page) != head, page);
pages[*nr] = page;
@@ -1010,24 +1009,13 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
return 0;
}

- /*
- * Any tail pages need their mapcount reference taken before we
- * return. (This allows the THP code to bump their ref count when
- * they are split into base pages).
- */
- while (refs--) {
- if (PageTail(tail))
- get_huge_page_tail(tail);
- tail++;
- }
-
return 1;
}

static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
unsigned long end, int write, struct page **pages, int *nr)
{
- struct page *head, *page, *tail;
+ struct page *head, *page;
int refs;

if (write && !pud_write(orig))
@@ -1036,7 +1024,6 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
refs = 0;
head = pud_page(orig);
page = head + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
- tail = page;
do {
VM_BUG_ON_PAGE(compound_head(page) != head, page);
pages[*nr] = page;
@@ -1057,12 +1044,6 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
return 0;
}

- while (refs--) {
- if (PageTail(tail))
- get_huge_page_tail(tail);
- tail++;
- }
-
return 1;
}

@@ -1071,7 +1052,7 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
struct page **pages, int *nr)
{
int refs;
- struct page *head, *page, *tail;
+ struct page *head, *page;

if (write && !pgd_write(orig))
return 0;
@@ -1100,12 +1081,6 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
return 0;
}

- while (refs--) {
- if (PageTail(tail))
- get_huge_page_tail(tail);
- tail++;
- }
-
return 1;
}

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d447afc039f9..46c3cd26f837 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -932,37 +932,6 @@ unlock:
spin_unlock(ptl);
}

-/*
- * Save CONFIG_DEBUG_PAGEALLOC from faulting falsely on tail pages
- * during copy_user_huge_page()'s copy_page_rep(): in the case when
- * the source page gets split and a tail freed before copy completes.
- * Called under pmd_lock of checked pmd, so safe from splitting itself.
- */
-static void get_user_huge_page(struct page *page)
-{
- if (IS_ENABLED(CONFIG_DEBUG_PAGEALLOC)) {
- struct page *endpage = page + HPAGE_PMD_NR;
-
- atomic_add(HPAGE_PMD_NR, &page->_count);
- while (++page < endpage)
- get_huge_page_tail(page);
- } else {
- get_page(page);
- }
-}
-
-static void put_user_huge_page(struct page *page)
-{
- if (IS_ENABLED(CONFIG_DEBUG_PAGEALLOC)) {
- struct page *endpage = page + HPAGE_PMD_NR;
-
- while (page < endpage)
- put_page(page++);
- } else {
- put_page(page);
- }
-}
-
static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
struct vm_area_struct *vma,
unsigned long address,
@@ -1113,7 +1082,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
ret |= VM_FAULT_WRITE;
goto out_unlock;
}
- get_user_huge_page(page);
+ get_page(page);
spin_unlock(ptl);
alloc:
if (transparent_hugepage_enabled(vma) &&
@@ -1136,7 +1105,7 @@ alloc:
split_huge_pmd(vma, pmd, address);
ret |= VM_FAULT_FALLBACK;
}
- put_user_huge_page(page);
+ put_page(page);
}
count_vm_event(THP_FAULT_FALLBACK);
goto out;
@@ -1147,7 +1116,7 @@ alloc:
put_page(new_page);
if (page) {
split_huge_pmd(vma, pmd, address);
- put_user_huge_page(page);
+ put_page(page);
} else
split_huge_pmd(vma, pmd, address);
ret |= VM_FAULT_FALLBACK;
@@ -1169,7 +1138,7 @@ alloc:

spin_lock(ptl);
if (page)
- put_user_huge_page(page);
+ put_page(page);
if (unlikely(!pmd_same(*pmd, orig_pmd))) {
spin_unlock(ptl);
mem_cgroup_cancel_charge(new_page, memcg);
@@ -1662,51 +1631,73 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
put_huge_zero_page();
}

-void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
- pmd_t *pmd)
+
+static void __split_huge_pmd_locked(struct vm_area_struct *vma,
+ pmd_t *pmd, unsigned long address)
{
- spinlock_t *ptl;
+ unsigned long haddr = address & HPAGE_PMD_MASK;
struct page *page;
struct mm_struct *mm = vma->vm_mm;
- unsigned long haddr = address & HPAGE_PMD_MASK;
- unsigned long mmun_start; /* For mmu_notifiers */
- unsigned long mmun_end; /* For mmu_notifiers */
+ pgtable_t pgtable;
+ pmd_t _pmd;
+ int i;

BUG_ON(vma->vm_start > haddr || vma->vm_end < haddr + HPAGE_PMD_SIZE);

+ if (is_huge_zero_pmd(*pmd))
+ return __split_huge_zero_page_pmd(vma, haddr, pmd);
+
+ page = pmd_page(*pmd);
+ VM_BUG_ON_PAGE(!page_count(page), page);
+ atomic_add(HPAGE_PMD_NR - 1, &page->_count);
+
+ /* leave pmd empty until pte is filled */
+ pmdp_clear_flush_notify(vma, haddr, pmd);
+
+ pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+ pmd_populate(mm, &_pmd, pgtable);
+
+ for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
+ pte_t entry, *pte;
+ /*
+ * Note that NUMA hinting access restrictions are not
+ * transferred to avoid any possibility of altering
+ * permissions across VMAs.
+ */
+ entry = mk_pte(page + i, vma->vm_page_prot);
+ entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+ if (!pmd_write(*pmd))
+ entry = pte_wrprotect(entry);
+ if (!pmd_young(*pmd))
+ entry = pte_mkold(entry);
+ pte = pte_offset_map(&_pmd, haddr);
+ BUG_ON(!pte_none(*pte));
+ atomic_inc(&page[i]._mapcount);
+ set_pte_at(mm, haddr, pte, entry);
+ pte_unmap(pte);
+ }
+ smp_wmb(); /* make pte visible before pmd */
+ pmd_populate(mm, pmd, pgtable);
+ atomic_dec(compound_mapcount_ptr(page));
+}
+
+void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+ unsigned long address)
+{
+ spinlock_t *ptl;
+ struct mm_struct *mm = vma->vm_mm;
+ unsigned long haddr = address & HPAGE_PMD_MASK;
+ unsigned long mmun_start; /* For mmu_notifiers */
+ unsigned long mmun_end; /* For mmu_notifiers */
+
mmun_start = haddr;
mmun_end = haddr + HPAGE_PMD_SIZE;
-again:
mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
ptl = pmd_lock(mm, pmd);
- if (unlikely(!pmd_trans_huge(*pmd))) {
- spin_unlock(ptl);
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
- return;
- }
- if (is_huge_zero_pmd(*pmd)) {
- __split_huge_zero_page_pmd(vma, haddr, pmd);
- spin_unlock(ptl);
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
- return;
- }
- page = pmd_page(*pmd);
- VM_BUG_ON_PAGE(!page_count(page), page);
- get_page(page);
+ if (likely(pmd_trans_huge(*pmd)))
+ __split_huge_pmd_locked(vma, pmd, address);
spin_unlock(ptl);
mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
-
- split_huge_page(page);
-
- put_page(page);
-
- /*
- * We don't always have down_write of mmap_sem here: a racing
- * do_huge_pmd_wp_page() might have copied-on-write to another
- * huge page before our split_huge_page() got the anon_vma lock.
- */
- if (unlikely(pmd_trans_huge(*pmd)))
- goto again;
}

static void split_huge_pmd_address(struct vm_area_struct *vma,
@@ -1736,42 +1727,10 @@ static void split_huge_pmd_address(struct vm_area_struct *vma,
* Caller holds the mmap_sem write mode, so a huge pmd cannot
* materialize from under us.
*/
- __split_huge_page_pmd(vma, address, pmd);
-}
-
-static int __split_huge_page_splitting(struct page *page,
- struct vm_area_struct *vma,
- unsigned long address)
-{
- struct mm_struct *mm = vma->vm_mm;
- spinlock_t *ptl;
- pmd_t *pmd;
- int ret = 0;
- /* For mmu_notifiers */
- const unsigned long mmun_start = address;
- const unsigned long mmun_end = address + HPAGE_PMD_SIZE;
-
- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
- pmd = page_check_address_pmd(page, mm, address,
- PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG, &ptl);
- if (pmd) {
- /*
- * We can't temporarily set the pmd to null in order
- * to split it, the pmd must remain marked huge at all
- * times or the VM won't take the pmd_trans_huge paths
- * and it won't wait on the anon_vma->root->rwsem to
- * serialize against split_huge_page*.
- */
- pmdp_splitting_flush(vma, address, pmd);
-
- ret = 1;
- spin_unlock(ptl);
- }
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
-
- return ret;
+ __split_huge_pmd(vma, pmd, address);
}

+#if 0
static void __split_huge_page_refcount(struct page *page,
struct list_head *list)
{
@@ -1897,82 +1856,6 @@ static void __split_huge_page_refcount(struct page *page,
BUG_ON(page_count(page) <= 0);
}

-static int __split_huge_page_map(struct page *page,
- struct vm_area_struct *vma,
- unsigned long address)
-{
- struct mm_struct *mm = vma->vm_mm;
- spinlock_t *ptl;
- pmd_t *pmd, _pmd;
- int ret = 0, i;
- pgtable_t pgtable;
- unsigned long haddr;
-
- pmd = page_check_address_pmd(page, mm, address,
- PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG, &ptl);
- if (pmd) {
- pgtable = pgtable_trans_huge_withdraw(mm, pmd);
- pmd_populate(mm, &_pmd, pgtable);
- if (pmd_write(*pmd))
- BUG_ON(page_mapcount(page) != 1);
-
- haddr = address;
- for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
- pte_t *pte, entry;
- BUG_ON(PageCompound(page+i));
- /*
- * Note that NUMA hinting access restrictions are not
- * transferred to avoid any possibility of altering
- * permissions across VMAs.
- */
- entry = mk_pte(page + i, vma->vm_page_prot);
- entry = maybe_mkwrite(pte_mkdirty(entry), vma);
- if (!pmd_write(*pmd))
- entry = pte_wrprotect(entry);
- if (!pmd_young(*pmd))
- entry = pte_mkold(entry);
- pte = pte_offset_map(&_pmd, haddr);
- BUG_ON(!pte_none(*pte));
- set_pte_at(mm, haddr, pte, entry);
- pte_unmap(pte);
- }
-
- smp_wmb(); /* make pte visible before pmd */
- /*
- * Up to this point the pmd is present and huge and
- * userland has the whole access to the hugepage
- * during the split (which happens in place). If we
- * overwrite the pmd with the not-huge version
- * pointing to the pte here (which of course we could
- * if all CPUs were bug free), userland could trigger
- * a small page size TLB miss on the small sized TLB
- * while the hugepage TLB entry is still established
- * in the huge TLB. Some CPU doesn't like that. See
- * http://support.amd.com/us/Processor_TechDocs/41322.pdf,
- * Erratum 383 on page 93. Intel should be safe but is
- * also warns that it's only safe if the permission
- * and cache attributes of the two entries loaded in
- * the two TLB is identical (which should be the case
- * here). But it is generally safer to never allow
- * small and huge TLB entries for the same virtual
- * address to be loaded simultaneously. So instead of
- * doing "pmd_populate(); flush_tlb_range();" we first
- * mark the current pmd notpresent (atomically because
- * here the pmd_trans_huge and pmd_trans_splitting
- * must remain set at all times on the pmd until the
- * split is complete for this pmd), then we flush the
- * SMP TLB and finally we write the non-huge version
- * of the pmd entry with pmd_populate.
- */
- pmdp_invalidate(vma, address, pmd);
- pmd_populate(mm, pmd, pgtable);
- ret = 1;
- spin_unlock(ptl);
- }
-
- return ret;
-}
-
/* must be called with anon_vma->root->rwsem held */
static void __split_huge_page(struct page *page,
struct anon_vma *anon_vma,
@@ -2023,48 +1906,18 @@ static void __split_huge_page(struct page *page,
BUG();
}
}
+#endif

/*
* Split a hugepage into normal pages. This doesn't change the position of head
* page. If @list is null, tail pages will be added to LRU list, otherwise, to
* @list. Both head page and tail pages will inherit mapping, flags, and so on
* from the hugepage.
- * Return 0 if the hugepage is split successfully otherwise return 1.
+ * Return 0 if the hugepage is split successfully otherwise return -errno.
*/
int split_huge_page_to_list(struct page *page, struct list_head *list)
{
- struct anon_vma *anon_vma;
- int ret = 1;
-
- BUG_ON(is_huge_zero_page(page));
- BUG_ON(!PageAnon(page));
-
- /*
- * The caller does not necessarily hold an mmap_sem that would prevent
- * the anon_vma disappearing so we first we take a reference to it
- * and then lock the anon_vma for write. This is similar to
- * page_lock_anon_vma_read except the write lock is taken to serialise
- * against parallel split or collapse operations.
- */
- anon_vma = page_get_anon_vma(page);
- if (!anon_vma)
- goto out;
- anon_vma_lock_write(anon_vma);
-
- ret = 0;
- if (!PageCompound(page))
- goto out_unlock;
-
- BUG_ON(!PageSwapBacked(page));
- __split_huge_page(page, anon_vma, list);
- count_vm_event(THP_SPLIT);
-
- BUG_ON(PageCompound(page));
-out_unlock:
- anon_vma_unlock_write(anon_vma);
- put_anon_vma(anon_vma);
-out:
- return ret;
+ return -EBUSY;
}

#define VM_NO_THP (VM_SPECIAL | VM_HUGETLB | VM_SHARED | VM_MAYSHARE)
diff --git a/mm/internal.h b/mm/internal.h
index c4d6c9b43491..ed57cc24802b 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -47,26 +47,6 @@ static inline void set_page_refcounted(struct page *page)
set_page_count(page, 1);
}

-static inline void __get_page_tail_foll(struct page *page,
- bool get_page_head)
-{
- /*
- * If we're getting a tail page, the elevated page->_count is
- * required only in the head page and we will elevate the head
- * page->_count and tail page->_mapcount.
- *
- * We elevate page_tail->_mapcount for tail pages to force
- * page_tail->_count to be zero at all times to avoid getting
- * false positives from get_page_unless_zero() with
- * speculative page access (like in
- * page_cache_get_speculative()) on tail pages.
- */
- VM_BUG_ON_PAGE(atomic_read(&page->first_page->_count) <= 0, page);
- if (get_page_head)
- atomic_inc(&page->first_page->_count);
- get_huge_page_tail(page);
-}
-
/*
* This is meant to be called as the FOLL_GET operation of
* follow_page() and it must be called while holding the proper PT
@@ -74,14 +54,9 @@ static inline void __get_page_tail_foll(struct page *page,
*/
static inline void get_page_foll(struct page *page)
{
- if (unlikely(PageTail(page)))
- /*
- * This is safe only because
- * __split_huge_page_refcount() can't run under
- * get_page_foll() because we hold the proper PT lock.
- */
- __get_page_tail_foll(page, true);
- else {
+ if (unlikely(PageTail(page))) {
+ atomic_inc(&page->first_page->_count);
+ } else {
/*
* Getting a normal page or the head of a compound page
* requires to already have an elevated page->_count.
diff --git a/mm/swap.c b/mm/swap.c
index cd3a5e64cea9..2e647d4dc6bb 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -80,185 +80,12 @@ static void __put_compound_page(struct page *page)
(*dtor)(page);
}

-/**
- * Two special cases here: we could avoid taking compound_lock_irqsave
- * and could skip the tail refcounting(in _mapcount).
- *
- * 1. Hugetlbfs page:
- *
- * PageHeadHuge will remain true until the compound page
- * is released and enters the buddy allocator, and it could
- * not be split by __split_huge_page_refcount().
- *
- * So if we see PageHeadHuge set, and we have the tail page pin,
- * then we could safely put head page.
- *
- * 2. Slab THP page:
- *
- * PG_slab is cleared before the slab frees the head page, and
- * tail pin cannot be the last reference left on the head page,
- * because the slab code is free to reuse the compound page
- * after a kfree/kmem_cache_free without having to check if
- * there's any tail pin left. In turn all tail pinsmust be always
- * released while the head is still pinned by the slab code
- * and so we know PG_slab will be still set too.
- *
- * So if we see PageSlab set, and we have the tail page pin,
- * then we could safely put head page.
- */
-static __always_inline
-void put_unrefcounted_compound_page(struct page *page_head, struct page *page)
-{
- /*
- * If @page is a THP tail, we must read the tail page
- * flags after the head page flags. The
- * __split_huge_page_refcount side enforces write memory barriers
- * between clearing PageTail and before the head page
- * can be freed and reallocated.
- */
- smp_rmb();
- if (likely(PageTail(page))) {
- /*
- * __split_huge_page_refcount cannot race
- * here, see the comment above this function.
- */
- VM_BUG_ON_PAGE(!PageHead(page_head), page_head);
- VM_BUG_ON_PAGE(page_mapcount(page) != 0, page);
- if (put_page_testzero(page_head)) {
- /*
- * If this is the tail of a slab THP page,
- * the tail pin must not be the last reference
- * held on the page, because the PG_slab cannot
- * be cleared before all tail pins (which skips
- * the _mapcount tail refcounting) have been
- * released.
- *
- * If this is the tail of a hugetlbfs page,
- * the tail pin may be the last reference on
- * the page instead, because PageHeadHuge will
- * not go away until the compound page enters
- * the buddy allocator.
- */
- VM_BUG_ON_PAGE(PageSlab(page_head), page_head);
- __put_compound_page(page_head);
- }
- } else
- /*
- * __split_huge_page_refcount run before us,
- * @page was a THP tail. The split @page_head
- * has been freed and reallocated as slab or
- * hugetlbfs page of smaller order (only
- * possible if reallocated as slab on x86).
- */
- if (put_page_testzero(page))
- __put_single_page(page);
-}
-
-static __always_inline
-void put_refcounted_compound_page(struct page *page_head, struct page *page)
-{
- if (likely(page != page_head && get_page_unless_zero(page_head))) {
- unsigned long flags;
-
- /*
- * @page_head wasn't a dangling pointer but it may not
- * be a head page anymore by the time we obtain the
- * lock. That is ok as long as it can't be freed from
- * under us.
- */
- flags = compound_lock_irqsave(page_head);
- if (unlikely(!PageTail(page))) {
- /* __split_huge_page_refcount run before us */
- compound_unlock_irqrestore(page_head, flags);
- if (put_page_testzero(page_head)) {
- /*
- * The @page_head may have been freed
- * and reallocated as a compound page
- * of smaller order and then freed
- * again. All we know is that it
- * cannot have become: a THP page, a
- * compound page of higher order, a
- * tail page. That is because we
- * still hold the refcount of the
- * split THP tail and page_head was
- * the THP head before the split.
- */
- if (PageHead(page_head))
- __put_compound_page(page_head);
- else
- __put_single_page(page_head);
- }
-out_put_single:
- if (put_page_testzero(page))
- __put_single_page(page);
- return;
- }
- VM_BUG_ON_PAGE(page_head != page->first_page, page);
- /*
- * We can release the refcount taken by
- * get_page_unless_zero() now that
- * __split_huge_page_refcount() is blocked on the
- * compound_lock.
- */
- if (put_page_testzero(page_head))
- VM_BUG_ON_PAGE(1, page_head);
- /* __split_huge_page_refcount will wait now */
- VM_BUG_ON_PAGE(page_mapcount(page) <= 0, page);
- atomic_dec(&page->_mapcount);
- VM_BUG_ON_PAGE(atomic_read(&page_head->_count) <= 0, page_head);
- VM_BUG_ON_PAGE(atomic_read(&page->_count) != 0, page);
- compound_unlock_irqrestore(page_head, flags);
-
- if (put_page_testzero(page_head)) {
- if (PageHead(page_head))
- __put_compound_page(page_head);
- else
- __put_single_page(page_head);
- }
- } else {
- /* @page_head is a dangling pointer */
- VM_BUG_ON_PAGE(PageTail(page), page);
- goto out_put_single;
- }
-}
-
static void put_compound_page(struct page *page)
{
- struct page *page_head;
-
- /*
- * We see the PageCompound set and PageTail not set, so @page maybe:
- * 1. hugetlbfs head page, or
- * 2. THP head page.
- */
- if (likely(!PageTail(page))) {
- if (put_page_testzero(page)) {
- /*
- * By the time all refcounts have been released
- * split_huge_page cannot run anymore from under us.
- */
- if (PageHead(page))
- __put_compound_page(page);
- else
- __put_single_page(page);
- }
- return;
- }
+ struct page *page_head = compound_head(page);

- /*
- * We see the PageCompound set and PageTail set, so @page maybe:
- * 1. a tail hugetlbfs page, or
- * 2. a tail THP page, or
- * 3. a split THP page.
- *
- * Case 3 is possible, as we may race with
- * __split_huge_page_refcount tearing down a THP page.
- */
- page_head = compound_head_by_tail(page);
- if (!__compound_tail_refcounted(page_head))
- put_unrefcounted_compound_page(page_head, page);
- else
- put_refcounted_compound_page(page_head, page);
+ if (put_page_testzero(page_head))
+ __put_compound_page(page_head);
}

void put_page(struct page *page)
@@ -270,72 +97,6 @@ void put_page(struct page *page)
}
EXPORT_SYMBOL(put_page);

-/*
- * This function is exported but must not be called by anything other
- * than get_page(). It implements the slow path of get_page().
- */
-bool __get_page_tail(struct page *page)
-{
- /*
- * This takes care of get_page() if run on a tail page
- * returned by one of the get_user_pages/follow_page variants.
- * get_user_pages/follow_page itself doesn't need the compound
- * lock because it runs __get_page_tail_foll() under the
- * proper PT lock that already serializes against
- * split_huge_page().
- */
- unsigned long flags;
- bool got;
- struct page *page_head = compound_head(page);
-
- /* Ref to put_compound_page() comment. */
- if (!__compound_tail_refcounted(page_head)) {
- smp_rmb();
- if (likely(PageTail(page))) {
- /*
- * This is a hugetlbfs page or a slab
- * page. __split_huge_page_refcount
- * cannot race here.
- */
- VM_BUG_ON_PAGE(!PageHead(page_head), page_head);
- __get_page_tail_foll(page, true);
- return true;
- } else {
- /*
- * __split_huge_page_refcount run
- * before us, "page" was a THP
- * tail. The split page_head has been
- * freed and reallocated as slab or
- * hugetlbfs page of smaller order
- * (only possible if reallocated as
- * slab on x86).
- */
- return false;
- }
- }
-
- got = false;
- if (likely(page != page_head && get_page_unless_zero(page_head))) {
- /*
- * page_head wasn't a dangling pointer but it
- * may not be a head page anymore by the time
- * we obtain the lock. That is ok as long as it
- * can't be freed from under us.
- */
- flags = compound_lock_irqsave(page_head);
- /* here __split_huge_page_refcount won't run anymore */
- if (likely(PageTail(page))) {
- __get_page_tail_foll(page, false);
- got = true;
- }
- compound_unlock_irqrestore(page_head, flags);
- if (unlikely(!got))
- put_page(page_head);
- }
- return got;
-}
-EXPORT_SYMBOL(__get_page_tail);
-
/**
* put_pages_list() - release a list of pages
* @pages: list of pages threaded on page->lru
--
2.1.4

2015-02-12 16:24:26

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 13/24] mm, vmstats: new THP splitting event

The patch replaces THP_SPLIT with tree events: THP_SPLIT_PAGE,
THP_SPLIT_PAGE_FAILT and THP_SPLIT_PMD. It reflects the fact that we
now can split PMD without the compound page and that split_huge_page()
can fail.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/vm_event_item.h | 4 +++-
mm/huge_memory.c | 3 +++
mm/vmstat.c | 4 +++-
3 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 2b1cef88b827..3261bfe2156a 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -69,7 +69,9 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
THP_FAULT_FALLBACK,
THP_COLLAPSE_ALLOC,
THP_COLLAPSE_ALLOC_FAILED,
- THP_SPLIT,
+ THP_SPLIT_PAGE,
+ THP_SPLIT_PAGE_FAILED,
+ THP_SPLIT_PMD,
THP_ZERO_PAGE_ALLOC,
THP_ZERO_PAGE_ALLOC_FAILED,
#endif
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 46c3cd26f837..b5c1976e2a65 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1644,6 +1644,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma,

BUG_ON(vma->vm_start > haddr || vma->vm_end < haddr + HPAGE_PMD_SIZE);

+ count_vm_event(THP_SPLIT_PMD);
+
if (is_huge_zero_pmd(*pmd))
return __split_huge_zero_page_pmd(vma, haddr, pmd);

@@ -1917,6 +1919,7 @@ static void __split_huge_page(struct page *page,
*/
int split_huge_page_to_list(struct page *page, struct list_head *list)
{
+ count_vm_event(THP_SPLIT_PAGE_FAILED);
return -EBUSY;
}

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 1fd0886a389f..e1c87425fe11 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -821,7 +821,9 @@ const char * const vmstat_text[] = {
"thp_fault_fallback",
"thp_collapse_alloc",
"thp_collapse_alloc_failed",
- "thp_split",
+ "thp_split_page",
+ "thp_split_page_failed",
+ "thp_split_pmd",
"thp_zero_page_alloc",
"thp_zero_page_alloc_failed",
#endif
--
2.1.4

2015-02-12 16:19:10

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 14/24] thp: implement new split_huge_page()

The new split_huge_page() can fail if the compound is pinned: we expect
only caller to have one reference to head page. If the page is pinned
split_huge_page() returns -EBUSY and caller must handle this correctly.

We don't need mark PMDs splitting since now we can split one PMD a time
with split_huge_pmd().

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/hugetlb_inline.h | 9 +-
include/linux/mm.h | 22 +++--
mm/huge_memory.c | 183 +++++++++++++++++++++++------------------
mm/swap.c | 126 +++++++++++++++++++++++++++-
4 files changed, 244 insertions(+), 96 deletions(-)

diff --git a/include/linux/hugetlb_inline.h b/include/linux/hugetlb_inline.h
index 2bb681fbeb35..c5cd37479731 100644
--- a/include/linux/hugetlb_inline.h
+++ b/include/linux/hugetlb_inline.h
@@ -10,6 +10,8 @@ static inline int is_vm_hugetlb_page(struct vm_area_struct *vma)
return !!(vma->vm_flags & VM_HUGETLB);
}

+int PageHeadHuge(struct page *page_head);
+
#else

static inline int is_vm_hugetlb_page(struct vm_area_struct *vma)
@@ -17,6 +19,11 @@ static inline int is_vm_hugetlb_page(struct vm_area_struct *vma)
return 0;
}

-#endif
+static inline int PageHeadHuge(struct page *page_head)
+{
+ return 0;
+}
+
+#endif /* CONFIG_HUGETLB_PAGE */

#endif
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 43468ebefaff..5b7498631322 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -485,20 +485,18 @@ static inline int page_count(struct page *page)
return atomic_read(&compound_head(page)->_count);
}

-#ifdef CONFIG_HUGETLB_PAGE
-extern int PageHeadHuge(struct page *page_head);
-#else /* CONFIG_HUGETLB_PAGE */
-static inline int PageHeadHuge(struct page *page_head)
-{
- return 0;
-}
-#endif /* CONFIG_HUGETLB_PAGE */
-
+void __get_page_tail(struct page *page);
static inline void get_page(struct page *page)
{
- struct page *page_head = compound_head(page);
- VM_BUG_ON_PAGE(atomic_read(&page_head->_count) <= 0, page);
- atomic_inc(&page_head->_count);
+ if (unlikely(PageTail(page)))
+ return __get_page_tail(page);
+
+ /*
+ * Getting a normal page or the head of a compound page
+ * requires to already have an elevated page->_count.
+ */
+ VM_BUG_ON_PAGE(atomic_read(&page->_count) <= 0, page);
+ atomic_inc(&page->_count);
}

static inline struct page *virt_to_head_page(const void *x)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b5c1976e2a65..5d1d80d7d3b8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1732,31 +1732,52 @@ static void split_huge_pmd_address(struct vm_area_struct *vma,
__split_huge_pmd(vma, pmd, address);
}

-#if 0
-static void __split_huge_page_refcount(struct page *page,
+static int __split_huge_page_refcount(struct page *page,
struct list_head *list)
{
int i;
struct zone *zone = page_zone(page);
struct lruvec *lruvec;
- int tail_count = 0;
+ int tail_mapcount = 0;

/* prevent PageLRU to go away from under us, and freeze lru stats */
spin_lock_irq(&zone->lru_lock);
lruvec = mem_cgroup_page_lruvec(page, zone);

compound_lock(page);
+
+ /*
+ * We cannot split pinned THP page: we expect page count to be equal
+ * to sum of mapcount of all sub-pages plus one (split_huge_page()
+ * caller must take reference for head page).
+ *
+ * Compound lock only prevents page->_count to be updated from
+ * get_page() or put_page() on tail page. It means means page_count()
+ * can change under us from head page after the check, but it's okay:
+ * all new refernces will stay on head page after split.
+ */
+ tail_mapcount = 0;
+ for (i = 0; i < HPAGE_PMD_NR; i++)
+ tail_mapcount += page_mapcount(page + i);
+ if (tail_mapcount != page_count(page) - 1) {
+ BUG_ON(tail_mapcount > page_count(page) - 1);
+ compound_unlock(page);
+ spin_unlock_irq(&zone->lru_lock);
+ return -EBUSY;
+ }
+
/* complete memcg works before add pages to LRU */
mem_cgroup_split_huge_fixup(page);

+ tail_mapcount = 0;
for (i = HPAGE_PMD_NR - 1; i >= 1; i--) {
struct page *page_tail = page + i;

/* tail_page->_mapcount cannot change */
- BUG_ON(atomic_read(&page_tail->_mapcount) + 1 < 0);
- tail_count += atomic_read(&page_tail->_mapcount) + 1;
+ BUG_ON(page_mapcount(page_tail) < 0);
+ tail_mapcount += page_mapcount(page_tail);
/* check for overflow */
- BUG_ON(tail_count < 0);
+ BUG_ON(tail_mapcount < 0);
BUG_ON(atomic_read(&page_tail->_count) != 0);
/*
* tail_page->_count is zero and not changing from
@@ -1794,28 +1815,9 @@ static void __split_huge_page_refcount(struct page *page,
/* clear PageTail before overwriting first_page */
smp_wmb();

- /*
- * __split_huge_page_splitting() already set the
- * splitting bit in all pmd that could map this
- * hugepage, that will ensure no CPU can alter the
- * mapcount on the head page. The mapcount is only
- * accounted in the head page and it has to be
- * transferred to all tail pages in the below code. So
- * for this code to be safe, the split the mapcount
- * can't change. But that doesn't mean userland can't
- * keep changing and reading the page contents while
- * we transfer the mapcount, so the pmd splitting
- * status is achieved setting a reserved bit in the
- * pmd, not by clearing the present bit.
- */
- atomic_set(&page_tail->_mapcount, compound_mapcount(page) - 1);
-
/* ->mapping in first tail page is compound_mapcount */
- if (i != 1) {
- BUG_ON(page_tail->mapping);
- page_tail->mapping = page->mapping;
- BUG_ON(!PageAnon(page_tail));
- }
+ BUG_ON(i != 1 && page_tail->mapping);
+ page_tail->mapping = page->mapping;

page_tail->index = page->index + i;
page_cpupid_xchg_last(page_tail, page_cpupid_last(page));
@@ -1826,12 +1828,9 @@ static void __split_huge_page_refcount(struct page *page,

lru_add_page_tail(page, page_tail, lruvec, list);
}
- atomic_sub(tail_count, &page->_count);
+ atomic_sub(tail_mapcount, &page->_count);
BUG_ON(atomic_read(&page->_count) <= 0);

- page->_mapcount = *compound_mapcount_ptr(page);
- page[1].mapping = page->mapping;
-
__mod_zone_page_state(zone, NR_ANON_TRANSPARENT_HUGEPAGES, -1);

ClearPageCompound(page);
@@ -1856,71 +1855,95 @@ static void __split_huge_page_refcount(struct page *page,
* to be pinned by the caller.
*/
BUG_ON(page_count(page) <= 0);
+ return 0;
}

-/* must be called with anon_vma->root->rwsem held */
-static void __split_huge_page(struct page *page,
- struct anon_vma *anon_vma,
- struct list_head *list)
+/*
+ * Split a hugepage into normal pages. This doesn't change the position of head
+ * page. If @list is null, tail pages will be added to LRU list, otherwise, to
+ * @list. Both head page and tail pages will inherit mapping, flags, and so on
+ * from the hugepage.
+ * Return 0 if the hugepage is split successfully otherwise return -errno.
+ */
+int split_huge_page_to_list(struct page *page, struct list_head *list)
{
- int mapcount, mapcount2;
- pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+ struct anon_vma *anon_vma;
struct anon_vma_chain *avc;
+ pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+ int i, tail_mapcount;
+ int ret = -EBUSY;

- BUG_ON(!PageHead(page));
- BUG_ON(PageTail(page));
+ BUG_ON(is_huge_zero_page(page));
+ BUG_ON(!PageAnon(page));

- mapcount = 0;
- anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) {
- struct vm_area_struct *vma = avc->vma;
- unsigned long addr = vma_address(page, vma);
- BUG_ON(is_vma_temporary_stack(vma));
- mapcount += __split_huge_page_splitting(page, vma, addr);
- }
/*
- * It is critical that new vmas are added to the tail of the
- * anon_vma list. This guarantes that if copy_huge_pmd() runs
- * and establishes a child pmd before
- * __split_huge_page_splitting() freezes the parent pmd (so if
- * we fail to prevent copy_huge_pmd() from running until the
- * whole __split_huge_page() is complete), we will still see
- * the newly established pmd of the child later during the
- * walk, to be able to set it as pmd_trans_splitting too.
+ * The caller does not necessarily hold an mmap_sem that would prevent
+ * the anon_vma disappearing so we first we take a reference to it
+ * and then lock the anon_vma for write. This is similar to
+ * page_lock_anon_vma_read except the write lock is taken to serialise
+ * against parallel split or collapse operations.
*/
- if (mapcount != page_mapcount(page)) {
- pr_err("mapcount %d page_mapcount %d\n",
- mapcount, page_mapcount(page));
- BUG();
+ anon_vma = page_get_anon_vma(page);
+ if (!anon_vma)
+ goto out;
+ anon_vma_lock_write(anon_vma);
+
+ if (!PageCompound(page)) {
+ ret = 0;
+ goto out_unlock;
}

- __split_huge_page_refcount(page, list);
+ BUG_ON(!PageSwapBacked(page));
+
+ /*
+ * Racy check if __split_huge_page_refcount() can be successful, before
+ * splitting PMDs.
+ */
+ tail_mapcount = compound_mapcount(page);
+ for (i = 0; i < HPAGE_PMD_NR; i++)
+ tail_mapcount += atomic_read(&page[i]._mapcount) + 1;
+ if (tail_mapcount != page_count(page) - 1) {
+ VM_BUG_ON_PAGE(tail_mapcount > page_count(page) - 1, page);
+ ret = -EBUSY;
+ goto out_unlock;
+ }

- mapcount2 = 0;
anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) {
struct vm_area_struct *vma = avc->vma;
unsigned long addr = vma_address(page, vma);
- BUG_ON(is_vma_temporary_stack(vma));
- mapcount2 += __split_huge_page_map(page, vma, addr);
- }
- if (mapcount != mapcount2) {
- pr_err("mapcount %d mapcount2 %d page_mapcount %d\n",
- mapcount, mapcount2, page_mapcount(page));
- BUG();
+ spinlock_t *ptl;
+ pmd_t *pmd;
+ unsigned long haddr = addr & HPAGE_PMD_MASK;
+ unsigned long mmun_start; /* For mmu_notifiers */
+ unsigned long mmun_end; /* For mmu_notifiers */
+
+ mmun_start = haddr;
+ mmun_end = haddr + HPAGE_PMD_SIZE;
+ mmu_notifier_invalidate_range_start(vma->vm_mm,
+ mmun_start, mmun_end);
+ pmd = page_check_address_pmd(page, vma->vm_mm, addr,
+ PAGE_CHECK_ADDRESS_PMD_FLAG, &ptl);
+ if (pmd) {
+ __split_huge_pmd_locked(vma, pmd, addr);
+ spin_unlock(ptl);
+ }
+ mmu_notifier_invalidate_range_end(vma->vm_mm,
+ mmun_start, mmun_end);
}
-}
-#endif

-/*
- * Split a hugepage into normal pages. This doesn't change the position of head
- * page. If @list is null, tail pages will be added to LRU list, otherwise, to
- * @list. Both head page and tail pages will inherit mapping, flags, and so on
- * from the hugepage.
- * Return 0 if the hugepage is split successfully otherwise return -errno.
- */
-int split_huge_page_to_list(struct page *page, struct list_head *list)
-{
- count_vm_event(THP_SPLIT_PAGE_FAILED);
- return -EBUSY;
+ BUG_ON(compound_mapcount(page));
+ ret = __split_huge_page_refcount(page, list);
+ BUG_ON(!ret && PageCompound(page));
+
+out_unlock:
+ anon_vma_unlock_write(anon_vma);
+ put_anon_vma(anon_vma);
+out:
+ if (ret)
+ count_vm_event(THP_SPLIT_PAGE_FAILED);
+ else
+ count_vm_event(THP_SPLIT_PAGE);
+ return ret;
}

#define VM_NO_THP (VM_SPECIAL | VM_HUGETLB | VM_SHARED | VM_MAYSHARE)
diff --git a/mm/swap.c b/mm/swap.c
index 2e647d4dc6bb..7b4fbb26cc2c 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -80,12 +80,86 @@ static void __put_compound_page(struct page *page)
(*dtor)(page);
}

+static inline bool compound_lock_needed(struct page *page)
+{
+ return IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
+ !PageSlab(page) && !PageHeadHuge(page);
+}
+
static void put_compound_page(struct page *page)
{
- struct page *page_head = compound_head(page);
+ struct page *page_head;
+ unsigned long flags;
+
+ if (likely(!PageTail(page))) {
+ if (put_page_testzero(page)) {
+ /*
+ * By the time all refcounts have been released
+ * split_huge_page cannot run anymore from under us.
+ */
+ if (PageHead(page))
+ __put_compound_page(page);
+ else
+ __put_single_page(page);
+ }
+ return;
+ }
+
+ /* __split_huge_page_refcount can run under us */
+ page_head = compound_head(page);
+
+ if (!compound_lock_needed(page_head)) {
+ /*
+ * If "page" is a THP tail, we must read the tail page flags
+ * after the head page flags. The split_huge_page side enforces
+ * write memory barriers between clearing PageTail and before
+ * the head page can be freed and reallocated.
+ */
+ smp_rmb();
+ if (likely(PageTail(page))) {
+ /* __split_huge_page_refcount cannot race here. */
+ VM_BUG_ON_PAGE(!PageHead(page_head), page_head);
+ VM_BUG_ON_PAGE(page_mapcount(page) != 0, page);
+ if (put_page_testzero(page_head)) {
+ /*
+ * If this is the tail of a slab compound page,
+ * the tail pin must not be the last reference
+ * held on the page, because the PG_slab cannot
+ * be cleared before all tail pins (which skips
+ * the _mapcount tail refcounting) have been
+ * released. For hugetlbfs the tail pin may be
+ * the last reference on the page instead,
+ * because PageHeadHuge will not go away until
+ * the compound page enters the buddy
+ * allocator.
+ */
+ VM_BUG_ON_PAGE(PageSlab(page_head), page_head);
+ __put_compound_page(page_head);
+ }
+ } else if (put_page_testzero(page))
+ __put_single_page(page);
+ return;
+ }

- if (put_page_testzero(page_head))
- __put_compound_page(page_head);
+ flags = compound_lock_irqsave(page_head);
+ /* here __split_huge_page_refcount won't run anymore */
+ if (likely(page != page_head && PageTail(page))) {
+ bool free;
+
+ free = put_page_testzero(page_head);
+ compound_unlock_irqrestore(page_head, flags);
+ if (free) {
+ if (PageHead(page_head))
+ __put_compound_page(page_head);
+ else
+ __put_single_page(page_head);
+ }
+ } else {
+ compound_unlock_irqrestore(page_head, flags);
+ VM_BUG_ON_PAGE(PageTail(page), page);
+ if (put_page_testzero(page))
+ __put_single_page(page);
+ }
}

void put_page(struct page *page)
@@ -97,6 +171,52 @@ void put_page(struct page *page)
}
EXPORT_SYMBOL(put_page);

+/*
+ * This function is exported but must not be called by anything other
+ * than get_page(). It implements the slow path of get_page().
+ */
+void __get_page_tail(struct page *page)
+{
+ struct page *page_head = compound_head(page);
+ unsigned long flags;
+
+ if (!compound_lock_needed(page_head)) {
+ smp_rmb();
+ if (likely(PageTail(page))) {
+ /*
+ * This is a hugetlbfs page or a slab page.
+ * __split_huge_page_refcount cannot race here.
+ */
+ VM_BUG_ON_PAGE(!PageHead(page_head), page_head);
+ VM_BUG_ON(page_head != page->first_page);
+ VM_BUG_ON_PAGE(atomic_read(&page_head->_count) <= 0,
+ page);
+ atomic_inc(&page_head->_count);
+ } else {
+ /*
+ * __split_huge_page_refcount run before us, "page" was
+ * a thp tail. the split page_head has been freed and
+ * reallocated as slab or hugetlbfs page of smaller
+ * order (only possible if reallocated as slab on x86).
+ */
+ VM_BUG_ON_PAGE(atomic_read(&page->_count) <= 0, page);
+ atomic_inc(&page->_count);
+ }
+ return;
+ }
+
+ flags = compound_lock_irqsave(page_head);
+ /* here __split_huge_page_refcount won't run anymore */
+ if (unlikely(page == page_head || !PageTail(page) ||
+ !get_page_unless_zero(page_head))) {
+ /* page is not part of THP page anymore */
+ VM_BUG_ON_PAGE(atomic_read(&page->_count) <= 0, page);
+ atomic_inc(&page->_count);
+ }
+ compound_unlock_irqrestore(page_head, flags);
+}
+EXPORT_SYMBOL(__get_page_tail);
+
/**
* put_pages_list() - release a list of pages
* @pages: list of pages threaded on page->lru
--
2.1.4

2015-02-12 16:20:15

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 15/24] mm, thp: remove infrastructure for handling splitting PMDs

With new refcounting we don't need to mark PMDs splitting. Let's drop code
to handle this.

Arch-specific code will removed separately.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
fs/proc/task_mmu.c | 8 +++---
include/asm-generic/pgtable.h | 5 ----
include/linux/huge_mm.h | 16 ------------
mm/gup.c | 7 ------
mm/huge_memory.c | 57 +++++++++----------------------------------
mm/memcontrol.c | 14 ++---------
mm/memory.c | 18 ++------------
mm/mincore.c | 2 +-
mm/pgtable-generic.c | 14 -----------
mm/rmap.c | 4 +--
10 files changed, 21 insertions(+), 124 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 8a0a78174cc6..090008d88e0f 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -534,7 +534,7 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
pte_t *pte;
spinlock_t *ptl;

- if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+ if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
smaps_pmd_entry(pmd, addr, walk);
spin_unlock(ptl);
return 0;
@@ -799,7 +799,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
spinlock_t *ptl;
struct page *page;

- if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+ if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
if (cp->type == CLEAR_REFS_SOFT_DIRTY) {
clear_soft_dirty_pmd(vma, addr, pmd);
goto out;
@@ -1112,7 +1112,7 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
pte_t *pte, *orig_pte;
int err = 0;

- if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+ if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
int pmd_flags2;

if ((vma->vm_flags & VM_SOFTDIRTY) || pmd_soft_dirty(*pmd))
@@ -1418,7 +1418,7 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
pte_t *orig_pte;
pte_t *pte;

- if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+ if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
pte_t huge_pte = *(pte_t *)pmd;
struct page *page;

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 4d46085c1b90..34e99a82a336 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -178,11 +178,6 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm,
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
#endif

-#ifndef __HAVE_ARCH_PMDP_SPLITTING_FLUSH
-extern void pmdp_splitting_flush(struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmdp);
-#endif
-
#ifndef __HAVE_ARCH_PGTABLE_DEPOSIT
extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
pgtable_t pgtable);
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 3c5fe722cc14..7a0c477a2b38 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -49,15 +49,9 @@ enum transparent_hugepage_flag {
#endif
};

-enum page_check_address_pmd_flag {
- PAGE_CHECK_ADDRESS_PMD_FLAG,
- PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG,
- PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG,
-};
extern pmd_t *page_check_address_pmd(struct page *page,
struct mm_struct *mm,
unsigned long address,
- enum page_check_address_pmd_flag flag,
spinlock_t **ptl);
extern int pmd_freeable(pmd_t pmd);

@@ -110,14 +104,6 @@ extern void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
if (unlikely(pmd_trans_huge(*____pmd))) \
__split_huge_pmd(__vma, __pmd, __address); \
} while (0)
-#define wait_split_huge_page(__anon_vma, __pmd) \
- do { \
- pmd_t *____pmd = (__pmd); \
- anon_vma_lock_write(__anon_vma); \
- anon_vma_unlock_write(__anon_vma); \
- BUG_ON(pmd_trans_splitting(*____pmd) || \
- pmd_trans_huge(*____pmd)); \
- } while (0)
#if HPAGE_PMD_ORDER >= MAX_ORDER
#error "hugepages can't be allocated by the buddy allocator"
#endif
@@ -184,8 +170,6 @@ static inline int split_huge_page(struct page *page)
{
return 0;
}
-#define wait_split_huge_page(__anon_vma, __pmd) \
- do { } while (0)
#define split_huge_pmd(__vma, __pmd, __address) \
do { } while (0)
static inline int hugepage_madvise(struct vm_area_struct *vma,
diff --git a/mm/gup.c b/mm/gup.c
index 0c8d076d7744..22585ef667d9 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -208,13 +208,6 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
spin_unlock(ptl);
return follow_page_pte(vma, address, pmd, flags);
}
-
- if (unlikely(pmd_trans_splitting(*pmd))) {
- spin_unlock(ptl);
- wait_split_huge_page(vma->anon_vma, pmd);
- return follow_page_pte(vma, address, pmd, flags);
- }
-
if (flags & FOLL_SPLIT) {
int ret;
page = pmd_page(*pmd);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5d1d80d7d3b8..fa79d3b89825 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -880,15 +880,6 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
goto out_unlock;
}

- if (unlikely(pmd_trans_splitting(pmd))) {
- /* split huge page running from under us */
- spin_unlock(src_ptl);
- spin_unlock(dst_ptl);
- pte_free(dst_mm, pgtable);
-
- wait_split_huge_page(vma->anon_vma, src_pmd); /* src_vma */
- goto out;
- }
src_page = pmd_page(pmd);
VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
get_page(src_page);
@@ -1392,7 +1383,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
spinlock_t *ptl;
int ret = 0;

- if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+ if (__pmd_trans_huge_lock(pmd, vma, &ptl)) {
struct page *page;
pgtable_t pgtable;
pmd_t orig_pmd;
@@ -1432,7 +1423,6 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
pmd_t *old_pmd, pmd_t *new_pmd)
{
spinlock_t *old_ptl, *new_ptl;
- int ret = 0;
pmd_t pmd;

struct mm_struct *mm = vma->vm_mm;
@@ -1441,7 +1431,7 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
(new_addr & ~HPAGE_PMD_MASK) ||
old_end - old_addr < HPAGE_PMD_SIZE ||
(new_vma->vm_flags & VM_NOHUGEPAGE))
- goto out;
+ return 0;

/*
* The destination pmd shouldn't be established, free_pgtables()
@@ -1449,15 +1439,14 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
*/
if (WARN_ON(!pmd_none(*new_pmd))) {
VM_BUG_ON(pmd_trans_huge(*new_pmd));
- goto out;
+ return 0;
}

/*
* We don't have to worry about the ordering of src and dst
* ptlocks because exclusive mmap_sem prevents deadlock.
*/
- ret = __pmd_trans_huge_lock(old_pmd, vma, &old_ptl);
- if (ret == 1) {
+ if (__pmd_trans_huge_lock(old_pmd, vma, &old_ptl)) {
new_ptl = pmd_lockptr(mm, new_pmd);
if (new_ptl != old_ptl)
spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
@@ -1473,9 +1462,9 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
if (new_ptl != old_ptl)
spin_unlock(new_ptl);
spin_unlock(old_ptl);
+ return 1;
}
-out:
- return ret;
+ return 0;
}

/*
@@ -1491,7 +1480,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
spinlock_t *ptl;
int ret = 0;

- if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+ if (__pmd_trans_huge_lock(pmd, vma, &ptl)) {
pmd_t entry;

/*
@@ -1529,17 +1518,8 @@ int __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma,
spinlock_t **ptl)
{
*ptl = pmd_lock(vma->vm_mm, pmd);
- if (likely(pmd_trans_huge(*pmd))) {
- if (unlikely(pmd_trans_splitting(*pmd))) {
- spin_unlock(*ptl);
- wait_split_huge_page(vma->anon_vma, pmd);
- return -1;
- } else {
- /* Thp mapped by 'pmd' is stable, so we can
- * handle it as it is. */
- return 1;
- }
- }
+ if (likely(pmd_trans_huge(*pmd)))
+ return 1;
spin_unlock(*ptl);
return 0;
}
@@ -1555,7 +1535,6 @@ int __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma,
pmd_t *page_check_address_pmd(struct page *page,
struct mm_struct *mm,
unsigned long address,
- enum page_check_address_pmd_flag flag,
spinlock_t **ptl)
{
pgd_t *pgd;
@@ -1578,21 +1557,8 @@ pmd_t *page_check_address_pmd(struct page *page,
goto unlock;
if (pmd_page(*pmd) != page)
goto unlock;
- /*
- * split_vma() may create temporary aliased mappings. There is
- * no risk as long as all huge pmd are found and have their
- * splitting bit set before __split_huge_page_refcount
- * runs. Finding the same huge pmd more than once during the
- * same rmap walk is not a problem.
- */
- if (flag == PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG &&
- pmd_trans_splitting(*pmd))
- goto unlock;
- if (pmd_trans_huge(*pmd)) {
- VM_BUG_ON(flag == PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG &&
- !pmd_trans_splitting(*pmd));
+ if (pmd_trans_huge(*pmd))
return pmd;
- }
unlock:
spin_unlock(*ptl);
return NULL;
@@ -1921,8 +1887,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
mmun_end = haddr + HPAGE_PMD_SIZE;
mmu_notifier_invalidate_range_start(vma->vm_mm,
mmun_start, mmun_end);
- pmd = page_check_address_pmd(page, vma->vm_mm, addr,
- PAGE_CHECK_ADDRESS_PMD_FLAG, &ptl);
+ pmd = page_check_address_pmd(page, vma->vm_mm, addr, &ptl);
if (pmd) {
__split_huge_pmd_locked(vma, pmd, addr);
spin_unlock(ptl);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d18d3a6e7337..1c6786c457bf 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4897,7 +4897,7 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
pte_t *pte;
spinlock_t *ptl;

- if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+ if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
if (get_mctgt_type_thp(vma, addr, *pmd, NULL) == MC_TARGET_PAGE)
mc.precharge += HPAGE_PMD_NR;
spin_unlock(ptl);
@@ -5065,17 +5065,7 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
union mc_target target;
struct page *page;

- /*
- * We don't take compound_lock() here but no race with splitting thp
- * happens because:
- * - if pmd_trans_huge_lock() returns 1, the relevant thp is not
- * under splitting, which means there's no concurrent thp split,
- * - if another thread runs into split_huge_page() just after we
- * entered this if-block, the thread must wait for page table lock
- * to be unlocked in __split_huge_page_splitting(), where the main
- * part of thp split is not executed yet.
- */
- if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+ if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
if (mc.precharge < HPAGE_PMD_NR) {
spin_unlock(ptl);
return 0;
diff --git a/mm/memory.c b/mm/memory.c
index 6f030a3a7636..c1878afb6466 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -565,7 +565,6 @@ int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
{
spinlock_t *ptl;
pgtable_t new = pte_alloc_one(mm, address);
- int wait_split_huge_page;
if (!new)
return -ENOMEM;

@@ -585,18 +584,14 @@ int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */

ptl = pmd_lock(mm, pmd);
- wait_split_huge_page = 0;
if (likely(pmd_none(*pmd))) { /* Has another populated it ? */
atomic_long_inc(&mm->nr_ptes);
pmd_populate(mm, pmd, new);
new = NULL;
- } else if (unlikely(pmd_trans_splitting(*pmd)))
- wait_split_huge_page = 1;
+ }
spin_unlock(ptl);
if (new)
pte_free(mm, new);
- if (wait_split_huge_page)
- wait_split_huge_page(vma->anon_vma, pmd);
return 0;
}

@@ -612,8 +607,7 @@ int __pte_alloc_kernel(pmd_t *pmd, unsigned long address)
if (likely(pmd_none(*pmd))) { /* Has another populated it ? */
pmd_populate_kernel(&init_mm, pmd, new);
new = NULL;
- } else
- VM_BUG_ON(pmd_trans_splitting(*pmd));
+ }
spin_unlock(&init_mm.page_table_lock);
if (new)
pte_free_kernel(&init_mm, new);
@@ -3222,14 +3216,6 @@ static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
if (pmd_trans_huge(orig_pmd)) {
unsigned int dirty = flags & FAULT_FLAG_WRITE;

- /*
- * If the pmd is splitting, return and retry the
- * the fault. Alternative: wait until the split
- * is done, and goto retry.
- */
- if (pmd_trans_splitting(orig_pmd))
- return 0;
-
if (pmd_protnone(orig_pmd))
return do_huge_pmd_numa_page(mm, vma, address,
orig_pmd, pmd);
diff --git a/mm/mincore.c b/mm/mincore.c
index be25efde64a4..feb867f5fdf4 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -117,7 +117,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
unsigned char *vec = walk->private;
int nr = (end - addr) >> PAGE_SHIFT;

- if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+ if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
memset(vec, 1, nr);
spin_unlock(ptl);
goto out;
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index c25f94b33811..2fe699cedd4d 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -133,20 +133,6 @@ pmd_t pmdp_clear_flush(struct vm_area_struct *vma, unsigned long address,
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
#endif

-#ifndef __HAVE_ARCH_PMDP_SPLITTING_FLUSH
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
- pmd_t *pmdp)
-{
- pmd_t pmd = pmd_mksplitting(*pmdp);
- VM_BUG_ON(address & ~HPAGE_PMD_MASK);
- set_pmd_at(vma->vm_mm, address, pmdp, pmd);
- /* tlb flush only to serialize against gup-fast */
- flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
-}
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-#endif
-
#ifndef __HAVE_ARCH_PGTABLE_DEPOSIT
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
diff --git a/mm/rmap.c b/mm/rmap.c
index db8b99e48966..eb2f4a0d3961 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -730,8 +730,7 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
* rmap might return false positives; we must filter
* these out using page_check_address_pmd().
*/
- pmd = page_check_address_pmd(page, mm, address,
- PAGE_CHECK_ADDRESS_PMD_FLAG, &ptl);
+ pmd = page_check_address_pmd(page, mm, address, &ptl);
if (!pmd)
return SWAP_AGAIN;

@@ -741,7 +740,6 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
return SWAP_FAIL; /* To break the loop */
}

- /* go ahead even if the pmd is pmd_trans_splitting() */
if (pmdp_clear_flush_young_notify(vma, address, pmd))
referenced++;

--
2.1.4

2015-02-12 16:19:46

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 16/24] x86, thp: remove infrastructure for handling splitting PMDs

With new refcounting we don't need to mark PMDs splitting. Let's drop
code to handle this.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/include/asm/pgtable.h | 9 ---------
arch/x86/include/asm/pgtable_types.h | 2 --
arch/x86/mm/gup.c | 13 +------------
arch/x86/mm/pgtable.c | 14 --------------
4 files changed, 1 insertion(+), 37 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 9d0ade00923e..c6243e9f1666 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -158,11 +158,6 @@ static inline int pmd_large(pmd_t pte)
}

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-static inline int pmd_trans_splitting(pmd_t pmd)
-{
- return pmd_val(pmd) & _PAGE_SPLITTING;
-}
-
static inline int pmd_trans_huge(pmd_t pmd)
{
return pmd_val(pmd) & _PAGE_PSE;
@@ -792,10 +787,6 @@ extern int pmdp_clear_flush_young(struct vm_area_struct *vma,
unsigned long address, pmd_t *pmdp);


-#define __HAVE_ARCH_PMDP_SPLITTING_FLUSH
-extern void pmdp_splitting_flush(struct vm_area_struct *vma,
- unsigned long addr, pmd_t *pmdp);
-
#define __HAVE_ARCH_PMD_WRITE
static inline int pmd_write(pmd_t pmd)
{
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 8c7c10802e9c..706f2f06d5b0 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -22,7 +22,6 @@
#define _PAGE_BIT_PAT_LARGE 12 /* On 2MB or 1GB pages */
#define _PAGE_BIT_SPECIAL _PAGE_BIT_SOFTW1
#define _PAGE_BIT_CPA_TEST _PAGE_BIT_SOFTW1
-#define _PAGE_BIT_SPLITTING _PAGE_BIT_SOFTW2 /* only valid on a PSE pmd */
#define _PAGE_BIT_HIDDEN _PAGE_BIT_SOFTW3 /* hidden by kmemcheck */
#define _PAGE_BIT_SOFT_DIRTY _PAGE_BIT_SOFTW3 /* software dirty tracking */
#define _PAGE_BIT_NX 63 /* No execute: only valid after cpuid check */
@@ -46,7 +45,6 @@
#define _PAGE_PAT_LARGE (_AT(pteval_t, 1) << _PAGE_BIT_PAT_LARGE)
#define _PAGE_SPECIAL (_AT(pteval_t, 1) << _PAGE_BIT_SPECIAL)
#define _PAGE_CPA_TEST (_AT(pteval_t, 1) << _PAGE_BIT_CPA_TEST)
-#define _PAGE_SPLITTING (_AT(pteval_t, 1) << _PAGE_BIT_SPLITTING)
#define __HAVE_ARCH_PTE_SPECIAL

#ifdef CONFIG_KMEMCHECK
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 62a887a3cf50..49bbbc57603b 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -157,18 +157,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
pmd_t pmd = *pmdp;

next = pmd_addr_end(addr, end);
- /*
- * The pmd_trans_splitting() check below explains why
- * pmdp_splitting_flush has to flush the tlb, to stop
- * this gup-fast code from running while we set the
- * splitting bit in the pmd. Returning zero will take
- * the slow path that will call wait_split_huge_page()
- * if the pmd is still in splitting state. gup-fast
- * can't because it has irq disabled and
- * wait_split_huge_page() would never return as the
- * tlb flush IPI wouldn't run.
- */
- if (pmd_none(pmd) || pmd_trans_splitting(pmd))
+ if (pmd_none(pmd))
return 0;
if (unlikely(pmd_large(pmd) || !pmd_present(pmd))) {
/*
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 7b22adaad4f1..a4c75bdee21d 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -433,20 +433,6 @@ int pmdp_clear_flush_young(struct vm_area_struct *vma,

return young;
}
-
-void pmdp_splitting_flush(struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmdp)
-{
- int set;
- VM_BUG_ON(address & ~HPAGE_PMD_MASK);
- set = !test_and_set_bit(_PAGE_BIT_SPLITTING,
- (unsigned long *)pmdp);
- if (set) {
- pmd_update(vma->vm_mm, address, pmdp);
- /* need tlb flush only to serialize against gup-fast */
- flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
- }
-}
#endif

/**
--
2.1.4

2015-02-12 16:24:00

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 17/24] futex, thp: remove special case for THP in get_futex_key

With new THP refcounting, we don't need tricks to stabilize huge page.
If we've got reference to tail page, it can't split under us.

This patch effectively reverts a5b338f2b0b1.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
kernel/futex.c | 61 ++++++++++++----------------------------------------------
1 file changed, 12 insertions(+), 49 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 2a5e3830e953..1809371ebef8 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -399,7 +399,7 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key, int rw)
{
unsigned long address = (unsigned long)uaddr;
struct mm_struct *mm = current->mm;
- struct page *page, *page_head;
+ struct page *page;
int err, ro = 0;

/*
@@ -442,46 +442,9 @@ again:
else
err = 0;

-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- page_head = page;
- if (unlikely(PageTail(page))) {
- put_page(page);
- /* serialize against __split_huge_page_splitting() */
- local_irq_disable();
- if (likely(__get_user_pages_fast(address, 1, !ro, &page) == 1)) {
- page_head = compound_head(page);
- /*
- * page_head is valid pointer but we must pin
- * it before taking the PG_lock and/or
- * PG_compound_lock. The moment we re-enable
- * irqs __split_huge_page_splitting() can
- * return and the head page can be freed from
- * under us. We can't take the PG_lock and/or
- * PG_compound_lock on a page that could be
- * freed from under us.
- */
- if (page != page_head) {
- get_page(page_head);
- put_page(page);
- }
- local_irq_enable();
- } else {
- local_irq_enable();
- goto again;
- }
- }
-#else
- page_head = compound_head(page);
- if (page != page_head) {
- get_page(page_head);
- put_page(page);
- }
-#endif
-
- lock_page(page_head);
-
+ lock_page(page);
/*
- * If page_head->mapping is NULL, then it cannot be a PageAnon
+ * If page->mapping is NULL, then it cannot be a PageAnon
* page; but it might be the ZERO_PAGE or in the gate area or
* in a special mapping (all cases which we are happy to fail);
* or it may have been a good file page when get_user_pages_fast
@@ -493,12 +456,12 @@ again:
*
* The case we do have to guard against is when memory pressure made
* shmem_writepage move it from filecache to swapcache beneath us:
- * an unlikely race, but we do need to retry for page_head->mapping.
+ * an unlikely race, but we do need to retry for page->mapping.
*/
- if (!page_head->mapping) {
- int shmem_swizzled = PageSwapCache(page_head);
- unlock_page(page_head);
- put_page(page_head);
+ if (!page->mapping) {
+ int shmem_swizzled = PageSwapCache(page);
+ unlock_page(page);
+ put_page(page);
if (shmem_swizzled)
goto again;
return -EFAULT;
@@ -511,7 +474,7 @@ again:
* it's a read-only handle, it's expected that futexes attach to
* the object not the particular process.
*/
- if (PageAnon(page_head)) {
+ if (PageAnon(page)) {
/*
* A RO anonymous page will never change and thus doesn't make
* sense for futex operations.
@@ -526,15 +489,15 @@ again:
key->private.address = address;
} else {
key->both.offset |= FUT_OFF_INODE; /* inode-based key */
- key->shared.inode = page_head->mapping->host;
+ key->shared.inode = page->mapping->host;
key->shared.pgoff = basepage_index(page);
}

get_futex_key_refs(key); /* implies MB (B) */

out:
- unlock_page(page_head);
- put_page(page_head);
+ unlock_page(page);
+ put_page(page);
return err;
}

--
2.1.4

2015-02-12 16:23:39

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 18/24] thp, mm: split_huge_page(): caller need to lock page

We're going to use migration entries instead of compound_lock() to
stabilize page refcounts. Setup and remove migration entries require
page to be locked.

Some of split_huge_page() callers already have the page locked. Let's
require everybody to lock the page before calling split_huge_page().

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/huge_memory.c | 1 +
mm/ksm.c | 6 ++++--
mm/memory-failure.c | 12 +++++++++---
mm/migrate.c | 8 ++++++--
4 files changed, 20 insertions(+), 7 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index fa79d3b89825..bb9be39de242 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1841,6 +1841,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)

BUG_ON(is_huge_zero_page(page));
BUG_ON(!PageAnon(page));
+ BUG_ON(!PageLocked(page));

/*
* The caller does not necessarily hold an mmap_sem that would prevent
diff --git a/mm/ksm.c b/mm/ksm.c
index 92182eeba87d..a8a88b0f6f62 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -987,9 +987,11 @@ static int page_trans_compound_anon_split(struct page *page)
* Recheck we got the reference while the head
* was still anonymous.
*/
- if (PageAnon(transhuge_head))
+ if (PageAnon(transhuge_head)) {
+ lock_page(transhuge_head);
ret = split_huge_page(transhuge_head);
- else
+ unlock_page(transhuge_head);
+ } else
/*
* Retry later if split_huge_page run
* from under us.
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 1a735fad2a13..006a891c9222 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -950,7 +950,10 @@ static int hwpoison_user_mappings(struct page *p, unsigned long pfn,
* enough * to be safe.
*/
if (!PageHuge(hpage) && PageAnon(hpage)) {
- if (unlikely(split_huge_page(hpage))) {
+ lock_page(hpage);
+ ret = split_huge_page(hpage);
+ unlock_page(hpage);
+ if (unlikely(ret)) {
/*
* FIXME: if splitting THP is failed, it is
* better to stop the following operation rather
@@ -1696,10 +1699,13 @@ int soft_offline_page(struct page *page, int flags)
return -EBUSY;
}
if (!PageHuge(page) && PageTransHuge(hpage)) {
- if (PageAnon(hpage) && unlikely(split_huge_page(hpage))) {
+ lock_page(page);
+ ret = split_huge_page(hpage);
+ unlock_page(page);
+ if (unlikely(ret)) {
pr_info("soft offline: %#lx: failed to split THP\n",
pfn);
- return -EBUSY;
+ return ret;
}
}

diff --git a/mm/migrate.c b/mm/migrate.c
index 01449826b914..91a67029bb18 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -920,9 +920,13 @@ static int unmap_and_move(new_page_t get_new_page, free_page_t put_new_page,
goto out;
}

- if (unlikely(PageTransHuge(page)))
- if (unlikely(split_huge_page(page)))
+ if (unlikely(PageTransHuge(page))) {
+ lock_page(page);
+ rc = split_huge_page(page);
+ unlock_page(page);
+ if (rc)
goto out;
+ }

rc = __unmap_and_move(page, newpage, force, mode);

--
2.1.4

2015-02-12 16:20:54

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 19/24] thp, mm: use migration entries to freeze page counts on split

Currently, we rely on compound_lock() to get page counts stable on
splitting page refcounting. To get it work we also take the lock on
get_page() and put_page() which is hot path.

This patch rework splitting code to setup migration entries to stabilaze
page count/mapcount before distribute refcounts. It means we don't need
to compound lock in get_page()/put_page().

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/migrate.h | 3 +
include/linux/mm.h | 1 +
include/linux/pagemap.h | 9 ++-
mm/huge_memory.c | 184 ++++++++++++++++++++++++++++++++++--------------
mm/internal.h | 26 +++++--
mm/migrate.c | 2 +-
mm/rmap.c | 21 ------
7 files changed, 168 insertions(+), 78 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 78baed5f2952..6b02c11a3c40 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -43,6 +43,9 @@ extern int migrate_page_move_mapping(struct address_space *mapping,
struct page *newpage, struct page *page,
struct buffer_head *head, enum migrate_mode mode,
int extra_count);
+extern int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
+ unsigned long addr, void *old);
+
#else

static inline void putback_movable_pages(struct list_head *l) {}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5b7498631322..655d2bfabdd9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -981,6 +981,7 @@ extern struct address_space *page_mapping(struct page *page);
/* Neutral page->mapping pointer to address_space or anon_vma or other */
static inline void *page_rmapping(struct page *page)
{
+ page = compound_head(page);
return (void *)((unsigned long)page->mapping & ~PAGE_MAPPING_FLAGS);
}

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index ad6da4e49555..faef48e04fc4 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -387,10 +387,17 @@ static inline struct page *read_mapping_page(struct address_space *mapping,
*/
static inline pgoff_t page_to_pgoff(struct page *page)
{
+ pgoff_t pgoff;
+
if (unlikely(PageHeadHuge(page)))
return page->index << compound_order(page);
- else
+
+ if (likely(!PageTransTail(page)))
return page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+
+ pgoff = page->first_page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+ pgoff += page - page->first_page;
+ return pgoff;
}

/*
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index bb9be39de242..7157975eeb1a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -23,6 +23,7 @@
#include <linux/pagemap.h>
#include <linux/migrate.h>
#include <linux/hashtable.h>
+#include <linux/swapops.h>

#include <asm/tlb.h>
#include <asm/pgalloc.h>
@@ -1599,7 +1600,7 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,


static void __split_huge_pmd_locked(struct vm_area_struct *vma,
- pmd_t *pmd, unsigned long address)
+ pmd_t *pmd, unsigned long address, int freeze)
{
unsigned long haddr = address & HPAGE_PMD_MASK;
struct page *page;
@@ -1632,12 +1633,19 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma,
* transferred to avoid any possibility of altering
* permissions across VMAs.
*/
- entry = mk_pte(page + i, vma->vm_page_prot);
- entry = maybe_mkwrite(pte_mkdirty(entry), vma);
- if (!pmd_write(*pmd))
- entry = pte_wrprotect(entry);
- if (!pmd_young(*pmd))
- entry = pte_mkold(entry);
+ if (freeze) {
+ swp_entry_t swp_entry;
+ swp_entry = make_migration_entry(page + i,
+ pmd_write(*pmd));
+ entry = swp_entry_to_pte(swp_entry);
+ } else {
+ entry = mk_pte(page + i, vma->vm_page_prot);
+ entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+ if (!pmd_write(*pmd))
+ entry = pte_wrprotect(entry);
+ if (!pmd_young(*pmd))
+ entry = pte_mkold(entry);
+ }
pte = pte_offset_map(&_pmd, haddr);
BUG_ON(!pte_none(*pte));
atomic_inc(&page[i]._mapcount);
@@ -1663,7 +1671,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
ptl = pmd_lock(mm, pmd);
if (likely(pmd_trans_huge(*pmd)))
- __split_huge_pmd_locked(vma, pmd, address);
+ __split_huge_pmd_locked(vma, pmd, address, 0);
spin_unlock(ptl);
mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
}
@@ -1698,20 +1706,119 @@ static void split_huge_pmd_address(struct vm_area_struct *vma,
__split_huge_pmd(vma, pmd, address);
}

-static int __split_huge_page_refcount(struct page *page,
- struct list_head *list)
+static void freeze_page(struct anon_vma *anon_vma, struct page *page)
+{
+ struct anon_vma_chain *avc;
+ struct mm_struct *mm;
+ struct vm_area_struct *vma;
+ pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+ unsigned long addr, haddr;
+ unsigned long mmun_start, mmun_end;
+ pgd_t *pgd;
+ pud_t *pud;
+ pmd_t *pmd;
+ pte_t *start_pte, *pte;
+ spinlock_t *ptl;
+
+ anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) {
+ vma = avc->vma;
+ mm = vma->vm_mm;
+ haddr = addr = __vma_address(page, vma) & HPAGE_PMD_MASK;
+ mmun_start = haddr;
+ mmun_end = haddr + HPAGE_PMD_SIZE;
+ mmu_notifier_invalidate_range_start(vma->vm_mm,
+ mmun_start, mmun_end);
+
+ pgd = pgd_offset(vma->vm_mm, addr);
+ if (!pgd_present(*pgd))
+ goto next;
+ pud = pud_offset(pgd, addr);
+ if (!pud_present(*pud))
+ goto next;
+ pmd = pmd_offset(pud, addr);
+
+ ptl = pmd_lock(vma->vm_mm, pmd);
+ if (!pmd_present(*pmd)) {
+ spin_unlock(ptl);
+ goto next;
+ }
+ if (pmd_trans_huge(*pmd)) {
+ if (page == pmd_page(*pmd))
+ __split_huge_pmd_locked(vma, pmd, addr, 1);
+ spin_unlock(ptl);
+ goto next;
+ }
+ spin_unlock(ptl);
+
+ start_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+ pte = start_pte;
+ do {
+ pte_t entry, swp_pte;
+ swp_entry_t swp_entry;
+
+ if (!pte_present(*pte))
+ continue;
+ if (page_to_pfn(page) != pte_pfn(*pte))
+ continue;
+ flush_cache_page(vma, addr, page_to_pfn(page));
+ entry = ptep_clear_flush(vma, addr, pte);
+ swp_entry = make_migration_entry(page,
+ pte_write(entry));
+ swp_pte = swp_entry_to_pte(swp_entry);
+ if (pte_soft_dirty(entry))
+ swp_pte = pte_swp_mksoft_dirty(swp_pte);
+ set_pte_at(vma->vm_mm, addr, pte, swp_pte);
+ } while (pte++, addr += PAGE_SIZE, page++, addr != mmun_end);
+ pte_unmap_unlock(start_pte, ptl);
+next:
+ mmu_notifier_invalidate_range_end(vma->vm_mm,
+ mmun_start, mmun_end);
+ }
+}
+
+static void unfreeze_page(struct anon_vma *anon_vma, struct page *page)
+{
+ struct anon_vma_chain *avc;
+ pgoff_t pgoff = page_to_pgoff(page);
+ unsigned long addr;
+ int i;
+
+ for (i = 0; i < HPAGE_PMD_NR; i++, pgoff++, page++) {
+ if (!page_mapcount(page))
+ continue;
+
+ anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root,
+ pgoff, pgoff) {
+ addr = vma_address(page, avc->vma);
+
+ remove_migration_pte(page, avc->vma, addr, page);
+
+ /*
+ * remove_migration_pte() adds page to rmap, but we
+ * didn't remove it on freeze_page().
+ * Let's fix it up here.
+ */
+ page_remove_rmap(page, false);
+ put_page(page);
+ }
+ }
+}
+
+static int __split_huge_page_refcount(struct anon_vma *anon_vma,
+ struct page *page, struct list_head *list)
{
int i;
struct zone *zone = page_zone(page);
struct lruvec *lruvec;
int tail_mapcount = 0;

+ freeze_page(anon_vma, page);
+ BUG_ON(compound_mapcount(page));
+
/* prevent PageLRU to go away from under us, and freeze lru stats */
spin_lock_irq(&zone->lru_lock);
lruvec = mem_cgroup_page_lruvec(page, zone);

- compound_lock(page);
-
/*
* We cannot split pinned THP page: we expect page count to be equal
* to sum of mapcount of all sub-pages plus one (split_huge_page()
@@ -1727,8 +1834,8 @@ static int __split_huge_page_refcount(struct page *page,
tail_mapcount += page_mapcount(page + i);
if (tail_mapcount != page_count(page) - 1) {
BUG_ON(tail_mapcount > page_count(page) - 1);
- compound_unlock(page);
spin_unlock_irq(&zone->lru_lock);
+ unfreeze_page(anon_vma, page);
return -EBUSY;
}

@@ -1775,6 +1882,7 @@ static int __split_huge_page_refcount(struct page *page,
(1L << PG_mlocked) |
(1L << PG_uptodate) |
(1L << PG_active) |
+ (1L << PG_locked) |
(1L << PG_unevictable)));
page_tail->flags |= (1L << PG_dirty);

@@ -1800,12 +1908,14 @@ static int __split_huge_page_refcount(struct page *page,
__mod_zone_page_state(zone, NR_ANON_TRANSPARENT_HUGEPAGES, -1);

ClearPageCompound(page);
- compound_unlock(page);
spin_unlock_irq(&zone->lru_lock);

+ unfreeze_page(anon_vma, page);
+
for (i = 1; i < HPAGE_PMD_NR; i++) {
struct page *page_tail = page + i;
BUG_ON(page_count(page_tail) <= 0);
+ unlock_page(page_tail);
/*
* Tail pages may be freed if there wasn't any mapping
* like if add_to_swap() is running on a lru page that
@@ -1834,14 +1944,13 @@ static int __split_huge_page_refcount(struct page *page,
int split_huge_page_to_list(struct page *page, struct list_head *list)
{
struct anon_vma *anon_vma;
- struct anon_vma_chain *avc;
- pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
int i, tail_mapcount;
- int ret = -EBUSY;
+ int ret = 0;

BUG_ON(is_huge_zero_page(page));
BUG_ON(!PageAnon(page));
BUG_ON(!PageLocked(page));
+ BUG_ON(PageTail(page));

/*
* The caller does not necessarily hold an mmap_sem that would prevent
@@ -1852,15 +1961,12 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
*/
anon_vma = page_get_anon_vma(page);
if (!anon_vma)
- goto out;
+ return -EBUSY;
anon_vma_lock_write(anon_vma);

- if (!PageCompound(page)) {
- ret = 0;
- goto out_unlock;
- }
-
BUG_ON(!PageSwapBacked(page));
+ if (!PageCompound(page))
+ goto out;

/*
* Racy check if __split_huge_page_refcount() can be successful, before
@@ -1872,39 +1978,15 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
if (tail_mapcount != page_count(page) - 1) {
VM_BUG_ON_PAGE(tail_mapcount > page_count(page) - 1, page);
ret = -EBUSY;
- goto out_unlock;
- }
-
- anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) {
- struct vm_area_struct *vma = avc->vma;
- unsigned long addr = vma_address(page, vma);
- spinlock_t *ptl;
- pmd_t *pmd;
- unsigned long haddr = addr & HPAGE_PMD_MASK;
- unsigned long mmun_start; /* For mmu_notifiers */
- unsigned long mmun_end; /* For mmu_notifiers */
-
- mmun_start = haddr;
- mmun_end = haddr + HPAGE_PMD_SIZE;
- mmu_notifier_invalidate_range_start(vma->vm_mm,
- mmun_start, mmun_end);
- pmd = page_check_address_pmd(page, vma->vm_mm, addr, &ptl);
- if (pmd) {
- __split_huge_pmd_locked(vma, pmd, addr);
- spin_unlock(ptl);
- }
- mmu_notifier_invalidate_range_end(vma->vm_mm,
- mmun_start, mmun_end);
+ goto out;
}

- BUG_ON(compound_mapcount(page));
- ret = __split_huge_page_refcount(page, list);
+ ret = __split_huge_page_refcount(anon_vma, page, list);
BUG_ON(!ret && PageCompound(page));
-
-out_unlock:
+out:
anon_vma_unlock_write(anon_vma);
put_anon_vma(anon_vma);
-out:
+
if (ret)
count_vm_event(THP_SPLIT_PAGE_FAILED);
else
diff --git a/mm/internal.h b/mm/internal.h
index ed57cc24802b..dc2db7a45acd 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -13,6 +13,7 @@

#include <linux/fs.h>
#include <linux/mm.h>
+#include <linux/pagemap.h>

void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
unsigned long floor, unsigned long ceiling);
@@ -261,10 +262,27 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)

extern pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);

-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-extern unsigned long vma_address(struct page *page,
- struct vm_area_struct *vma);
-#endif
+/*
+ * At what user virtual address is page expected in @vma?
+ */
+static inline unsigned long
+__vma_address(struct page *page, struct vm_area_struct *vma)
+{
+ pgoff_t pgoff = page_to_pgoff(page);
+ return vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
+}
+
+static inline unsigned long
+vma_address(struct page *page, struct vm_area_struct *vma)
+{
+ unsigned long address = __vma_address(page, vma);
+
+ /* page should be within @vma mapping range */
+ VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
+
+ return address;
+}
+
#else /* !CONFIG_MMU */
static inline void clear_page_mlock(struct page *page) { }
static inline void mlock_vma_page(struct page *page) { }
diff --git a/mm/migrate.c b/mm/migrate.c
index 91a67029bb18..b163e46201ee 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -102,7 +102,7 @@ void putback_movable_pages(struct list_head *l)
/*
* Restore a potential migration pte to a working pte entry
*/
-static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
+int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
unsigned long addr, void *old)
{
struct mm_struct *mm = vma->vm_mm;
diff --git a/mm/rmap.c b/mm/rmap.c
index eb2f4a0d3961..2dc26770d1d3 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -554,27 +554,6 @@ void page_unlock_anon_vma_read(struct anon_vma *anon_vma)
}

/*
- * At what user virtual address is page expected in @vma?
- */
-static inline unsigned long
-__vma_address(struct page *page, struct vm_area_struct *vma)
-{
- pgoff_t pgoff = page_to_pgoff(page);
- return vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
-}
-
-inline unsigned long
-vma_address(struct page *page, struct vm_area_struct *vma)
-{
- unsigned long address = __vma_address(page, vma);
-
- /* page should be within @vma mapping range */
- VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
-
- return address;
-}
-
-/*
* At what user virtual address is page expected in vma?
* Caller should check the page is actually part of the vma.
*/
--
2.1.4

2015-02-12 16:20:19

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 20/24] mm, thp: remove compound_lock

We don't need a compound lock anymore: split_huge_page() doesn't need it
anymore.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/mm.h | 35 ------------
include/linux/page-flags.h | 12 +---
mm/debug.c | 3 -
mm/swap.c | 135 +++++++++++++++------------------------------
4 files changed, 46 insertions(+), 139 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 655d2bfabdd9..44e1d7f48158 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -398,41 +398,6 @@ static inline int is_vmalloc_or_module_addr(const void *x)

extern void kvfree(const void *addr);

-static inline void compound_lock(struct page *page)
-{
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- VM_BUG_ON_PAGE(PageSlab(page), page);
- bit_spin_lock(PG_compound_lock, &page->flags);
-#endif
-}
-
-static inline void compound_unlock(struct page *page)
-{
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- VM_BUG_ON_PAGE(PageSlab(page), page);
- bit_spin_unlock(PG_compound_lock, &page->flags);
-#endif
-}
-
-static inline unsigned long compound_lock_irqsave(struct page *page)
-{
- unsigned long uninitialized_var(flags);
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- local_irq_save(flags);
- compound_lock(page);
-#endif
- return flags;
-}
-
-static inline void compound_unlock_irqrestore(struct page *page,
- unsigned long flags)
-{
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- compound_unlock(page);
- local_irq_restore(flags);
-#endif
-}
-
static inline struct page *compound_head(struct page *page)
{
if (unlikely(PageTail(page)))
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index d471370f27e8..32e893f2fd4d 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -106,9 +106,6 @@ enum pageflags {
#ifdef CONFIG_MEMORY_FAILURE
PG_hwpoison, /* hardware poisoned page. Don't touch */
#endif
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- PG_compound_lock,
-#endif
__NR_PAGEFLAGS,

/* Filesystems */
@@ -516,12 +513,6 @@ static inline void ClearPageSlabPfmemalloc(struct page *page)
#define __PG_MLOCKED 0
#endif

-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-#define __PG_COMPOUND_LOCK (1 << PG_compound_lock)
-#else
-#define __PG_COMPOUND_LOCK 0
-#endif
-
/*
* Flags checked when a page is freed. Pages being freed should not have
* these flags set. It they are, there is a problem.
@@ -531,8 +522,7 @@ static inline void ClearPageSlabPfmemalloc(struct page *page)
1 << PG_private | 1 << PG_private_2 | \
1 << PG_writeback | 1 << PG_reserved | \
1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \
- 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \
- __PG_COMPOUND_LOCK)
+ 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON)

/*
* Flags checked when a page is prepped for return by the page allocator.
diff --git a/mm/debug.c b/mm/debug.c
index 13d2b8146ef9..4a82f639b964 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -45,9 +45,6 @@ static const struct trace_print_flags pageflag_names[] = {
#ifdef CONFIG_MEMORY_FAILURE
{1UL << PG_hwpoison, "hwpoison" },
#endif
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- {1UL << PG_compound_lock, "compound_lock" },
-#endif
};

static void dump_flags(unsigned long flags,
diff --git a/mm/swap.c b/mm/swap.c
index 7b4fbb26cc2c..6c9e764f95d7 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -80,16 +80,9 @@ static void __put_compound_page(struct page *page)
(*dtor)(page);
}

-static inline bool compound_lock_needed(struct page *page)
-{
- return IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
- !PageSlab(page) && !PageHeadHuge(page);
-}
-
static void put_compound_page(struct page *page)
{
struct page *page_head;
- unsigned long flags;

if (likely(!PageTail(page))) {
if (put_page_testzero(page)) {
@@ -108,58 +101,33 @@ static void put_compound_page(struct page *page)
/* __split_huge_page_refcount can run under us */
page_head = compound_head(page);

- if (!compound_lock_needed(page_head)) {
- /*
- * If "page" is a THP tail, we must read the tail page flags
- * after the head page flags. The split_huge_page side enforces
- * write memory barriers between clearing PageTail and before
- * the head page can be freed and reallocated.
- */
- smp_rmb();
- if (likely(PageTail(page))) {
- /* __split_huge_page_refcount cannot race here. */
- VM_BUG_ON_PAGE(!PageHead(page_head), page_head);
- VM_BUG_ON_PAGE(page_mapcount(page) != 0, page);
- if (put_page_testzero(page_head)) {
- /*
- * If this is the tail of a slab compound page,
- * the tail pin must not be the last reference
- * held on the page, because the PG_slab cannot
- * be cleared before all tail pins (which skips
- * the _mapcount tail refcounting) have been
- * released. For hugetlbfs the tail pin may be
- * the last reference on the page instead,
- * because PageHeadHuge will not go away until
- * the compound page enters the buddy
- * allocator.
- */
- VM_BUG_ON_PAGE(PageSlab(page_head), page_head);
- __put_compound_page(page_head);
- }
- } else if (put_page_testzero(page))
- __put_single_page(page);
- return;
- }
-
- flags = compound_lock_irqsave(page_head);
- /* here __split_huge_page_refcount won't run anymore */
- if (likely(page != page_head && PageTail(page))) {
- bool free;
-
- free = put_page_testzero(page_head);
- compound_unlock_irqrestore(page_head, flags);
- if (free) {
- if (PageHead(page_head))
- __put_compound_page(page_head);
- else
- __put_single_page(page_head);
+ /*
+ * If "page" is a THP tail, we must read the tail page flags after the
+ * head page flags. The split_huge_page side enforces write memory
+ * barriers between clearing PageTail and before the head page can be
+ * freed and reallocated.
+ */
+ smp_rmb();
+ if (likely(PageTail(page))) {
+ /* __split_huge_page_refcount cannot race here. */
+ VM_BUG_ON_PAGE(!PageHead(page_head), page_head);
+ if (put_page_testzero(page_head)) {
+ /*
+ * If this is the tail of a slab compound page, the
+ * tail pin must not be the last reference held on the
+ * page, because the PG_slab cannot be cleared before
+ * all tail pins (which skips the _mapcount tail
+ * refcounting) have been released. For hugetlbfs the
+ * tail pin may be the last reference on the page
+ * instead, because PageHeadHuge will not go away until
+ * the compound page enters the buddy allocator.
+ */
+ VM_BUG_ON_PAGE(PageSlab(page_head), page_head);
+ __put_compound_page(page_head);
}
- } else {
- compound_unlock_irqrestore(page_head, flags);
- VM_BUG_ON_PAGE(PageTail(page), page);
- if (put_page_testzero(page))
- __put_single_page(page);
- }
+ } else if (put_page_testzero(page))
+ __put_single_page(page);
+ return;
}

void put_page(struct page *page)
@@ -178,42 +146,29 @@ EXPORT_SYMBOL(put_page);
void __get_page_tail(struct page *page)
{
struct page *page_head = compound_head(page);
- unsigned long flags;

- if (!compound_lock_needed(page_head)) {
- smp_rmb();
- if (likely(PageTail(page))) {
- /*
- * This is a hugetlbfs page or a slab page.
- * __split_huge_page_refcount cannot race here.
- */
- VM_BUG_ON_PAGE(!PageHead(page_head), page_head);
- VM_BUG_ON(page_head != page->first_page);
- VM_BUG_ON_PAGE(atomic_read(&page_head->_count) <= 0,
- page);
- atomic_inc(&page_head->_count);
- } else {
- /*
- * __split_huge_page_refcount run before us, "page" was
- * a thp tail. the split page_head has been freed and
- * reallocated as slab or hugetlbfs page of smaller
- * order (only possible if reallocated as slab on x86).
- */
- VM_BUG_ON_PAGE(atomic_read(&page->_count) <= 0, page);
- atomic_inc(&page->_count);
- }
- return;
- }
-
- flags = compound_lock_irqsave(page_head);
- /* here __split_huge_page_refcount won't run anymore */
- if (unlikely(page == page_head || !PageTail(page) ||
- !get_page_unless_zero(page_head))) {
- /* page is not part of THP page anymore */
+ smp_rmb();
+ if (likely(PageTail(page))) {
+ /*
+ * This is a hugetlbfs page or a slab page.
+ * __split_huge_page_refcount cannot race here.
+ */
+ VM_BUG_ON_PAGE(!PageHead(page_head), page_head);
+ VM_BUG_ON(page_head != page->first_page);
+ VM_BUG_ON_PAGE(atomic_read(&page_head->_count) <= 0,
+ page);
+ atomic_inc(&page_head->_count);
+ } else {
+ /*
+ * __split_huge_page_refcount run before us, "page" was
+ * a thp tail. the split page_head has been freed and
+ * reallocated as slab or hugetlbfs page of smaller
+ * order (only possible if reallocated as slab on x86).
+ */
VM_BUG_ON_PAGE(atomic_read(&page->_count) <= 0, page);
atomic_inc(&page->_count);
}
- compound_unlock_irqrestore(page_head, flags);
+ return;
}
EXPORT_SYMBOL(__get_page_tail);

--
2.1.4

2015-02-12 16:23:05

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 21/24] thp: introduce deferred_split_huge_page()

Currently we don't split huge page on partial unmap. It's not an ideal
situation. It can lead to memory overhead.

Furtunately, we can detect partial unmap on page_remove_rmap(). But we
cannot call split_huge_page() from there due to locking context.

It's also counterproductive to do directly from munmap() codepath: in
many cases we will hit this from exit(2) and splitting the huge page
just to free it up in small pages is not what we really want.

The patch introduce deferred_split_huge_page() which put the huge page
into queue for splitting. The splitting itself will happen when we get
memory pressure. The page will be dropped from list on freeing through
compound page destructor.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/huge_mm.h | 9 +++++
include/linux/mm.h | 2 ++
include/linux/mmzone.h | 5 +++
mm/huge_memory.c | 90 +++++++++++++++++++++++++++++++++++++++++++++++--
mm/page_alloc.c | 6 +++-
mm/rmap.c | 10 +++++-
mm/vmscan.c | 3 ++
7 files changed, 120 insertions(+), 5 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 7a0c477a2b38..0aaebd81beb6 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -96,6 +96,13 @@ static inline int split_huge_page(struct page *page)
{
return split_huge_page_to_list(page, NULL);
}
+void deferred_split_huge_page(struct page *page);
+void __drain_split_queue(struct zone *zone);
+static inline void drain_split_queue(struct zone *zone)
+{
+ if (!list_empty(&zone->split_queue))
+ __drain_split_queue(zone);
+}
extern void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
unsigned long address);
#define split_huge_pmd(__vma, __pmd, __address) \
@@ -170,6 +177,8 @@ static inline int split_huge_page(struct page *page)
{
return 0;
}
+static inline void deferred_split_huge_page(struct page *page) {}
+static inline void drain_split_queue(struct zone *zone) {}
#define split_huge_pmd(__vma, __pmd, __address) \
do { } while (0)
static inline int hugepage_madvise(struct vm_area_struct *vma,
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 44e1d7f48158..f6ec7ed26168 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -568,6 +568,8 @@ static inline void set_compound_order(struct page *page, unsigned long order)
page[1].compound_order = order;
}

+void free_compound_page(struct page *page);
+
#ifdef CONFIG_MMU
/*
* Do pte_mkwrite, but only if the vma says VM_WRITE. We do this when
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index f279d9c158cd..4f1afa447e2d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -525,6 +525,11 @@ struct zone {
bool compact_blockskip_flush;
#endif

+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ unsigned long split_queue_length;
+ struct list_head split_queue;
+#endif
+
ZONE_PADDING(_pad3_)
/* Zone statistics */
atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS];
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 7157975eeb1a..f42bd96e69a6 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -69,6 +69,8 @@ static unsigned int khugepaged_max_ptes_none __read_mostly = HPAGE_PMD_NR-1;
static int khugepaged(void *none);
static int khugepaged_slab_init(void);

+static void free_transhuge_page(struct page *page);
+
#define MM_SLOTS_HASH_BITS 10
static __read_mostly DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);

@@ -825,6 +827,8 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
count_vm_event(THP_FAULT_FALLBACK);
return VM_FAULT_FALLBACK;
}
+ INIT_LIST_HEAD(&page[2].lru);
+ set_compound_page_dtor(page, free_transhuge_page);
if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page))) {
put_page(page);
count_vm_event(THP_FAULT_FALLBACK);
@@ -1086,7 +1090,10 @@ alloc:
} else
new_page = NULL;

- if (unlikely(!new_page)) {
+ if (likely(new_page)) {
+ INIT_LIST_HEAD(&new_page[2].lru);
+ set_compound_page_dtor(new_page, free_transhuge_page);
+ } else {
if (!page) {
split_huge_pmd(vma, pmd, address);
ret |= VM_FAULT_FALLBACK;
@@ -1839,6 +1846,10 @@ static int __split_huge_page_refcount(struct anon_vma *anon_vma,
return -EBUSY;
}

+ spin_lock(&zone->lock);
+ list_del(&page[2].lru);
+ spin_unlock(&zone->lock);
+
/* complete memcg works before add pages to LRU */
mem_cgroup_split_huge_fixup(page);

@@ -1994,6 +2005,71 @@ out:
return ret;
}

+static void free_transhuge_page(struct page *page)
+{
+ if (!list_empty(&page[2].lru)) {
+ struct zone *zone = page_zone(page);
+ unsigned long flags;
+
+ spin_lock_irqsave(&zone->lock, flags);
+ list_del(&page[2].lru);
+ memset(&page[2].lru, 0, sizeof(page[2].lru));
+ spin_unlock_irqrestore(&zone->lock, flags);
+ }
+ free_compound_page(page);
+}
+
+void deferred_split_huge_page(struct page *page)
+{
+ struct zone *zone = page_zone(page);
+ unsigned long flags;
+
+ VM_BUG_ON_PAGE(!PageTransHuge(page), page);
+
+ /* we use page->lru in second tail page: assuming THP order >= 2 */
+ BUILD_BUG_ON(HPAGE_PMD_ORDER < 2);
+
+ if (!list_empty(&page[2].lru))
+ return;
+
+ spin_lock_irqsave(&zone->lock, flags);
+ if (list_empty(&page[2].lru))
+ list_add_tail(&page[2].lru, &zone->split_queue);
+ spin_unlock_irqrestore(&zone->lock, flags);
+}
+
+void __drain_split_queue(struct zone *zone)
+{
+ unsigned long flags;
+ LIST_HEAD(list);
+ struct page *page, *next;
+
+ spin_lock_irqsave(&zone->lock, flags);
+ list_splice_init(&zone->split_queue, &list);
+ /*
+ * take reference for all pages under zone->lock to avoid race
+ * with free_transhuge_page().
+ */
+ list_for_each_entry_safe(page, next, &list, lru)
+ get_page(compound_head(page));
+ spin_unlock_irqrestore(&zone->lock, flags);
+
+ list_for_each_entry_safe(page, next, &list, lru) {
+ page = compound_head(page);
+ lock_page(page);
+ /* split_huge_page() removes page from list on success */
+ split_huge_page(compound_head(page));
+ unlock_page(page);
+ put_page(page);
+ }
+
+ if (!list_empty(&list)) {
+ spin_lock_irqsave(&zone->lock, flags);
+ list_splice_tail(&list, &zone->split_queue);
+ spin_unlock_irqrestore(&zone->lock, flags);
+ }
+}
+
#define VM_NO_THP (VM_SPECIAL | VM_HUGETLB | VM_SHARED | VM_MAYSHARE)

int hugepage_madvise(struct vm_area_struct *vma,
@@ -2413,6 +2489,8 @@ static struct page
return NULL;
}

+ INIT_LIST_HEAD(&(*hpage)[2].lru);
+ set_compound_page_dtor(*hpage, free_transhuge_page);
count_vm_event(THP_COLLAPSE_ALLOC);
return *hpage;
}
@@ -2424,8 +2502,14 @@ static int khugepaged_find_target_node(void)

static inline struct page *alloc_hugepage(int defrag)
{
- return alloc_pages(alloc_hugepage_gfpmask(defrag, 0),
- HPAGE_PMD_ORDER);
+ struct page *page;
+
+ page = alloc_pages(alloc_hugepage_gfpmask(defrag, 0), HPAGE_PMD_ORDER);
+ if (page) {
+ INIT_LIST_HEAD(&page[2].lru);
+ set_compound_page_dtor(page, free_transhuge_page);
+ }
+ return page;
}

static struct page *khugepaged_alloc_hugepage(bool *wait)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b0ef1f6d2fb0..9010b60009f6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -357,7 +357,7 @@ out:
* This usage means that zero-order pages may not be compound.
*/

-static void free_compound_page(struct page *page)
+void free_compound_page(struct page *page)
{
__free_pages_ok(page, compound_order(page));
}
@@ -4920,6 +4920,10 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
zone->zone_pgdat = pgdat;
zone_pcp_init(zone);

+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ INIT_LIST_HEAD(&zone->split_queue);
+#endif
+
/* For bootup, initialized properly in watermark setup */
mod_zone_page_state(zone, NR_ALLOC_BATCH, zone->managed_pages);

diff --git a/mm/rmap.c b/mm/rmap.c
index 2dc26770d1d3..6795babf5739 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1165,6 +1165,7 @@ out:
void page_remove_rmap(struct page *page, bool compound)
{
int nr = compound ? hpage_nr_pages(page) : 1;
+ bool partial_thp_unmap;

if (!PageAnon(page)) {
VM_BUG_ON_PAGE(compound && !PageHuge(page), page);
@@ -1195,13 +1196,20 @@ void page_remove_rmap(struct page *page, bool compound)
for (i = 0; i < hpage_nr_pages(page); i++)
if (page_mapcount(page + i))
nr--;
- }
+ partial_thp_unmap = nr != hpage_nr_pages(page);
+ } else if (PageTransCompound(page)) {
+ partial_thp_unmap = !compound_mapcount(page);
+ } else
+ partial_thp_unmap = false;

__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, -nr);

if (unlikely(PageMlocked(page)))
clear_page_mlock(page);

+ if (partial_thp_unmap)
+ deferred_split_huge_page(compound_head(page));
+
/*
* It would be tidy to reset the PageAnon mapping here,
* but that might overwrite a racing page_add_anon_rmap
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 671e47edb584..741a215e3d73 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2376,6 +2376,9 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
unsigned long zone_lru_pages = 0;
struct mem_cgroup *memcg;

+ /* XXX: accounting for shrinking progress ? */
+ drain_split_queue(zone);
+
nr_reclaimed = sc->nr_reclaimed;
nr_scanned = sc->nr_scanned;

--
2.1.4

2015-02-12 16:19:31

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 22/24] memcg: adjust to support new THP refcounting

With new refcounting we cannot rely on PageTransHuge() check if we need
to charge size of huge page form the cgroup. We need to get information
from caller to know whether it was mapped with PMD or PTE.

We do uncharge when last reference on the page gone. At that point if we
see PageTransHuge() it means we need to unchange whole huge page.

The tricky part is partial unmap. We don't do a special handing of this
situation, meaning we don't uncharge the part of huge page unless last
user is gone of split_huge_page() is triggered. In case of cgroup memory
pressure happens the partial unmapped page will be split through
shrink_zone(). This should be good enough.

I did quick sanity check, more testing is required.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/memcontrol.h | 16 +++++++-----
kernel/events/uprobes.c | 7 +++---
mm/Kconfig | 2 +-
mm/filemap.c | 8 +++---
mm/huge_memory.c | 29 +++++++++++-----------
mm/memcontrol.c | 62 +++++++++++++++++-----------------------------
mm/memory.c | 26 +++++++++----------
mm/shmem.c | 21 +++++++++-------
mm/swapfile.c | 9 ++++---
9 files changed, 87 insertions(+), 93 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 72dff5fb0d0c..6a70e6c4bece 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -74,10 +74,12 @@ void mem_cgroup_events(struct mem_cgroup *memcg,
bool mem_cgroup_low(struct mem_cgroup *root, struct mem_cgroup *memcg);

int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
- gfp_t gfp_mask, struct mem_cgroup **memcgp);
+ gfp_t gfp_mask, struct mem_cgroup **memcgp,
+ bool compound);
void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
- bool lrucare);
-void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg);
+ bool lrucare, bool compound);
+void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg,
+ bool compound);
void mem_cgroup_uncharge(struct page *page);
void mem_cgroup_uncharge_list(struct list_head *page_list);

@@ -209,7 +211,8 @@ static inline bool mem_cgroup_low(struct mem_cgroup *root,

static inline int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask,
- struct mem_cgroup **memcgp)
+ struct mem_cgroup **memcgp,
+ bool compound)
{
*memcgp = NULL;
return 0;
@@ -217,12 +220,13 @@ static inline int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,

static inline void mem_cgroup_commit_charge(struct page *page,
struct mem_cgroup *memcg,
- bool lrucare)
+ bool lrucare, bool compound)
{
}

static inline void mem_cgroup_cancel_charge(struct page *page,
- struct mem_cgroup *memcg)
+ struct mem_cgroup *memcg,
+ bool compound)
{
}

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 5523daf59953..04e26bdf0717 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -169,7 +169,8 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
const unsigned long mmun_end = addr + PAGE_SIZE;
struct mem_cgroup *memcg;

- err = mem_cgroup_try_charge(kpage, vma->vm_mm, GFP_KERNEL, &memcg);
+ err = mem_cgroup_try_charge(kpage, vma->vm_mm, GFP_KERNEL, &memcg,
+ false);
if (err)
return err;

@@ -184,7 +185,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,

get_page(kpage);
page_add_new_anon_rmap(kpage, vma, addr, false);
- mem_cgroup_commit_charge(kpage, memcg, false);
+ mem_cgroup_commit_charge(kpage, memcg, false, false);
lru_cache_add_active_or_unevictable(kpage, vma);

if (!PageAnon(page)) {
@@ -207,7 +208,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,

err = 0;
unlock:
- mem_cgroup_cancel_charge(kpage, memcg);
+ mem_cgroup_cancel_charge(kpage, memcg, false);
mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
unlock_page(page);
return err;
diff --git a/mm/Kconfig b/mm/Kconfig
index 9ce853d2af5d..a03131b6ba8e 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -409,7 +409,7 @@ config NOMMU_INITIAL_TRIM_EXCESS

config TRANSPARENT_HUGEPAGE
bool "Transparent Hugepage Support"
- depends on HAVE_ARCH_TRANSPARENT_HUGEPAGE && !MEMCG
+ depends on HAVE_ARCH_TRANSPARENT_HUGEPAGE
select COMPACTION
help
Transparent Hugepages allows the kernel to use huge pages and
diff --git a/mm/filemap.c b/mm/filemap.c
index b02c3f7cbe64..e1114314d9be 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -556,7 +556,7 @@ static int __add_to_page_cache_locked(struct page *page,

if (!huge) {
error = mem_cgroup_try_charge(page, current->mm,
- gfp_mask, &memcg);
+ gfp_mask, &memcg, false);
if (error)
return error;
}
@@ -564,7 +564,7 @@ static int __add_to_page_cache_locked(struct page *page,
error = radix_tree_maybe_preload(gfp_mask & ~__GFP_HIGHMEM);
if (error) {
if (!huge)
- mem_cgroup_cancel_charge(page, memcg);
+ mem_cgroup_cancel_charge(page, memcg, false);
return error;
}

@@ -580,7 +580,7 @@ static int __add_to_page_cache_locked(struct page *page,
__inc_zone_page_state(page, NR_FILE_PAGES);
spin_unlock_irq(&mapping->tree_lock);
if (!huge)
- mem_cgroup_commit_charge(page, memcg, false);
+ mem_cgroup_commit_charge(page, memcg, false, false);
trace_mm_filemap_add_to_page_cache(page);
return 0;
err_insert:
@@ -588,7 +588,7 @@ err_insert:
/* Leave page->index set: truncation relies upon it */
spin_unlock_irq(&mapping->tree_lock);
if (!huge)
- mem_cgroup_cancel_charge(page, memcg);
+ mem_cgroup_cancel_charge(page, memcg, false);
page_cache_release(page);
return error;
}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f42bd96e69a6..2667938a3d2c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -719,12 +719,12 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,

VM_BUG_ON_PAGE(!PageCompound(page), page);

- if (mem_cgroup_try_charge(page, mm, GFP_TRANSHUGE, &memcg))
+ if (mem_cgroup_try_charge(page, mm, GFP_TRANSHUGE, &memcg, true))
return VM_FAULT_OOM;

pgtable = pte_alloc_one(mm, haddr);
if (unlikely(!pgtable)) {
- mem_cgroup_cancel_charge(page, memcg);
+ mem_cgroup_cancel_charge(page, memcg, true);
return VM_FAULT_OOM;
}

@@ -739,7 +739,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
ptl = pmd_lock(mm, pmd);
if (unlikely(!pmd_none(*pmd))) {
spin_unlock(ptl);
- mem_cgroup_cancel_charge(page, memcg);
+ mem_cgroup_cancel_charge(page, memcg, true);
put_page(page);
pte_free(mm, pgtable);
} else {
@@ -747,7 +747,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
entry = mk_huge_pmd(page, vma->vm_page_prot);
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
page_add_new_anon_rmap(page, vma, haddr, true);
- mem_cgroup_commit_charge(page, memcg, false);
+ mem_cgroup_commit_charge(page, memcg, false, true);
lru_cache_add_active_or_unevictable(page, vma);
pgtable_trans_huge_deposit(mm, pmd, pgtable);
set_pmd_at(mm, haddr, pmd, entry);
@@ -957,13 +957,14 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
vma, address, page_to_nid(page));
if (unlikely(!pages[i] ||
mem_cgroup_try_charge(pages[i], mm, GFP_KERNEL,
- &memcg))) {
+ &memcg, false))) {
if (pages[i])
put_page(pages[i]);
while (--i >= 0) {
memcg = (void *)page_private(pages[i]);
set_page_private(pages[i], 0);
- mem_cgroup_cancel_charge(pages[i], memcg);
+ mem_cgroup_cancel_charge(pages[i], memcg,
+ false);
put_page(pages[i]);
}
kfree(pages);
@@ -1002,7 +1003,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
memcg = (void *)page_private(pages[i]);
set_page_private(pages[i], 0);
page_add_new_anon_rmap(pages[i], vma, haddr, false);
- mem_cgroup_commit_charge(pages[i], memcg, false);
+ mem_cgroup_commit_charge(pages[i], memcg, false, false);
lru_cache_add_active_or_unevictable(pages[i], vma);
pte = pte_offset_map(&_pmd, haddr);
VM_BUG_ON(!pte_none(*pte));
@@ -1030,7 +1031,7 @@ out_free_pages:
for (i = 0; i < HPAGE_PMD_NR; i++) {
memcg = (void *)page_private(pages[i]);
set_page_private(pages[i], 0);
- mem_cgroup_cancel_charge(pages[i], memcg);
+ mem_cgroup_cancel_charge(pages[i], memcg, false);
put_page(pages[i]);
}
kfree(pages);
@@ -1111,7 +1112,7 @@ alloc:
}

if (unlikely(mem_cgroup_try_charge(new_page, mm,
- GFP_TRANSHUGE, &memcg))) {
+ GFP_TRANSHUGE, &memcg, true))) {
put_page(new_page);
if (page) {
split_huge_pmd(vma, pmd, address);
@@ -1140,7 +1141,7 @@ alloc:
put_page(page);
if (unlikely(!pmd_same(*pmd, orig_pmd))) {
spin_unlock(ptl);
- mem_cgroup_cancel_charge(new_page, memcg);
+ mem_cgroup_cancel_charge(new_page, memcg, true);
put_page(new_page);
goto out_mn;
} else {
@@ -1149,7 +1150,7 @@ alloc:
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
pmdp_clear_flush_notify(vma, haddr, pmd);
page_add_new_anon_rmap(new_page, vma, haddr, true);
- mem_cgroup_commit_charge(new_page, memcg, false);
+ mem_cgroup_commit_charge(new_page, memcg, false, true);
lru_cache_add_active_or_unevictable(new_page, vma);
set_pmd_at(mm, haddr, pmd, entry);
update_mmu_cache_pmd(vma, address, pmd);
@@ -2594,7 +2595,7 @@ static void collapse_huge_page(struct mm_struct *mm,
return;

if (unlikely(mem_cgroup_try_charge(new_page, mm,
- GFP_TRANSHUGE, &memcg)))
+ GFP_TRANSHUGE, &memcg, true)))
return;

/*
@@ -2681,7 +2682,7 @@ static void collapse_huge_page(struct mm_struct *mm,
spin_lock(pmd_ptl);
BUG_ON(!pmd_none(*pmd));
page_add_new_anon_rmap(new_page, vma, address, true);
- mem_cgroup_commit_charge(new_page, memcg, false);
+ mem_cgroup_commit_charge(new_page, memcg, false, true);
lru_cache_add_active_or_unevictable(new_page, vma);
pgtable_trans_huge_deposit(mm, pmd, pgtable);
set_pmd_at(mm, address, pmd, _pmd);
@@ -2696,7 +2697,7 @@ out_up_write:
return;

out:
- mem_cgroup_cancel_charge(new_page, memcg);
+ mem_cgroup_cancel_charge(new_page, memcg, true);
goto out_up_write;
}

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1c6786c457bf..33a190ecc7f9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -820,7 +820,7 @@ static unsigned long mem_cgroup_read_events(struct mem_cgroup *memcg,

static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
struct page *page,
- int nr_pages)
+ bool compound, int nr_pages)
{
/*
* Here, RSS means 'mapped anon' and anon's SwapCache. Shmem/tmpfs is
@@ -833,9 +833,11 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
__this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_CACHE],
nr_pages);

- if (PageTransHuge(page))
+ if (compound) {
+ VM_BUG_ON_PAGE(!PageTransHuge(page), page);
__this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_RSS_HUGE],
nr_pages);
+ }

/* pagein of a big page is an event. So, ignore page size */
if (nr_pages > 0)
@@ -2794,30 +2796,24 @@ void mem_cgroup_split_huge_fixup(struct page *head)
* from old cgroup.
*/
static int mem_cgroup_move_account(struct page *page,
- unsigned int nr_pages,
+ bool compound,
struct mem_cgroup *from,
struct mem_cgroup *to)
{
unsigned long flags;
+ unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1;
int ret;

VM_BUG_ON(from == to);
VM_BUG_ON_PAGE(PageLRU(page), page);
- /*
- * The page is isolated from LRU. So, collapse function
- * will not handle this page. But page splitting can happen.
- * Do this check under compound_page_lock(). The caller should
- * hold it.
- */
- ret = -EBUSY;
- if (nr_pages > 1 && !PageTransHuge(page))
- goto out;
+ VM_BUG_ON(compound && !PageTransHuge(page));

/*
* Prevent mem_cgroup_migrate() from looking at page->mem_cgroup
* of its source page while we change it: page migration takes
* both pages off the LRU, but page cache replacement doesn't.
*/
+ ret = -EBUSY;
if (!trylock_page(page))
goto out;

@@ -2854,9 +2850,9 @@ static int mem_cgroup_move_account(struct page *page,
ret = 0;

local_irq_disable();
- mem_cgroup_charge_statistics(to, page, nr_pages);
+ mem_cgroup_charge_statistics(to, page, compound, nr_pages);
memcg_check_events(to, page);
- mem_cgroup_charge_statistics(from, page, -nr_pages);
+ mem_cgroup_charge_statistics(from, page, compound, -nr_pages);
memcg_check_events(from, page);
local_irq_enable();
out_unlock:
@@ -5074,7 +5070,7 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
if (target_type == MC_TARGET_PAGE) {
page = target.page;
if (!isolate_lru_page(page)) {
- if (!mem_cgroup_move_account(page, HPAGE_PMD_NR,
+ if (!mem_cgroup_move_account(page, true,
mc.from, mc.to)) {
mc.precharge -= HPAGE_PMD_NR;
mc.moved_charge += HPAGE_PMD_NR;
@@ -5103,7 +5099,8 @@ retry:
page = target.page;
if (isolate_lru_page(page))
goto put;
- if (!mem_cgroup_move_account(page, 1, mc.from, mc.to)) {
+ if (!mem_cgroup_move_account(page, false,
+ mc.from, mc.to)) {
mc.precharge--;
/* we uncharge from mc.from later. */
mc.moved_charge++;
@@ -5449,10 +5446,11 @@ bool mem_cgroup_low(struct mem_cgroup *root, struct mem_cgroup *memcg)
* with mem_cgroup_cancel_charge() in case page instantiation fails.
*/
int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
- gfp_t gfp_mask, struct mem_cgroup **memcgp)
+ gfp_t gfp_mask, struct mem_cgroup **memcgp,
+ bool compound)
{
struct mem_cgroup *memcg = NULL;
- unsigned int nr_pages = 1;
+ unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1;
int ret = 0;

if (mem_cgroup_disabled())
@@ -5470,11 +5468,6 @@ int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
goto out;
}

- if (PageTransHuge(page)) {
- nr_pages <<= compound_order(page);
- VM_BUG_ON_PAGE(!PageTransHuge(page), page);
- }
-
if (do_swap_account && PageSwapCache(page))
memcg = try_get_mem_cgroup_from_page(page);
if (!memcg)
@@ -5510,9 +5503,9 @@ out:
* Use mem_cgroup_cancel_charge() to cancel the transaction instead.
*/
void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
- bool lrucare)
+ bool lrucare, bool compound)
{
- unsigned int nr_pages = 1;
+ unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1;

VM_BUG_ON_PAGE(!page->mapping, page);
VM_BUG_ON_PAGE(PageLRU(page) && !lrucare, page);
@@ -5529,13 +5522,8 @@ void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,

commit_charge(page, memcg, lrucare);

- if (PageTransHuge(page)) {
- nr_pages <<= compound_order(page);
- VM_BUG_ON_PAGE(!PageTransHuge(page), page);
- }
-
local_irq_disable();
- mem_cgroup_charge_statistics(memcg, page, nr_pages);
+ mem_cgroup_charge_statistics(memcg, page, compound, nr_pages);
memcg_check_events(memcg, page);
local_irq_enable();

@@ -5557,9 +5545,10 @@ void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
*
* Cancel a charge transaction started by mem_cgroup_try_charge().
*/
-void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg)
+void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg,
+ bool compound)
{
- unsigned int nr_pages = 1;
+ unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1;

if (mem_cgroup_disabled())
return;
@@ -5571,11 +5560,6 @@ void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg)
if (!memcg)
return;

- if (PageTransHuge(page)) {
- nr_pages <<= compound_order(page);
- VM_BUG_ON_PAGE(!PageTransHuge(page), page);
- }
-
cancel_charge(memcg, nr_pages);
}

@@ -5829,7 +5813,7 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
/* XXX: caller holds IRQ-safe mapping->tree_lock */
VM_BUG_ON(!irqs_disabled());

- mem_cgroup_charge_statistics(memcg, page, -1);
+ mem_cgroup_charge_statistics(memcg, page, false, -1);
memcg_check_events(memcg, page);
}

diff --git a/mm/memory.c b/mm/memory.c
index c1878afb6466..f81bcd539ca0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2154,7 +2154,7 @@ gotten:
}
__SetPageUptodate(new_page);

- if (mem_cgroup_try_charge(new_page, mm, GFP_KERNEL, &memcg))
+ if (mem_cgroup_try_charge(new_page, mm, GFP_KERNEL, &memcg, false))
goto oom_free_new;

mmun_start = address & PAGE_MASK;
@@ -2184,7 +2184,7 @@ gotten:
*/
ptep_clear_flush_notify(vma, address, page_table);
page_add_new_anon_rmap(new_page, vma, address, false);
- mem_cgroup_commit_charge(new_page, memcg, false);
+ mem_cgroup_commit_charge(new_page, memcg, false, false);
lru_cache_add_active_or_unevictable(new_page, vma);
/*
* We call the notify macro here because, when using secondary
@@ -2223,7 +2223,7 @@ gotten:
new_page = old_page;
ret |= VM_FAULT_WRITE;
} else
- mem_cgroup_cancel_charge(new_page, memcg);
+ mem_cgroup_cancel_charge(new_page, memcg, false);

if (new_page)
page_cache_release(new_page);
@@ -2425,7 +2425,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
goto out_page;
}

- if (mem_cgroup_try_charge(page, mm, GFP_KERNEL, &memcg)) {
+ if (mem_cgroup_try_charge(page, mm, GFP_KERNEL, &memcg, false)) {
ret = VM_FAULT_OOM;
goto out_page;
}
@@ -2467,10 +2467,10 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
set_pte_at(mm, address, page_table, pte);
if (page == swapcache) {
do_page_add_anon_rmap(page, vma, address, exclusive);
- mem_cgroup_commit_charge(page, memcg, true);
+ mem_cgroup_commit_charge(page, memcg, true, false);
} else { /* ksm created a completely new copy */
page_add_new_anon_rmap(page, vma, address, false);
- mem_cgroup_commit_charge(page, memcg, false);
+ mem_cgroup_commit_charge(page, memcg, false, false);
lru_cache_add_active_or_unevictable(page, vma);
}

@@ -2505,7 +2505,7 @@ unlock:
out:
return ret;
out_nomap:
- mem_cgroup_cancel_charge(page, memcg);
+ mem_cgroup_cancel_charge(page, memcg, false);
pte_unmap_unlock(page_table, ptl);
out_page:
unlock_page(page);
@@ -2595,7 +2595,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
*/
__SetPageUptodate(page);

- if (mem_cgroup_try_charge(page, mm, GFP_KERNEL, &memcg))
+ if (mem_cgroup_try_charge(page, mm, GFP_KERNEL, &memcg, false))
goto oom_free_page;

entry = mk_pte(page, vma->vm_page_prot);
@@ -2608,7 +2608,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,

inc_mm_counter_fast(mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, vma, address, false);
- mem_cgroup_commit_charge(page, memcg, false);
+ mem_cgroup_commit_charge(page, memcg, false, false);
lru_cache_add_active_or_unevictable(page, vma);
setpte:
set_pte_at(mm, address, page_table, entry);
@@ -2619,7 +2619,7 @@ unlock:
pte_unmap_unlock(page_table, ptl);
return 0;
release:
- mem_cgroup_cancel_charge(page, memcg);
+ mem_cgroup_cancel_charge(page, memcg, false);
page_cache_release(page);
goto unlock;
oom_free_page:
@@ -2870,7 +2870,7 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
if (!new_page)
return VM_FAULT_OOM;

- if (mem_cgroup_try_charge(new_page, mm, GFP_KERNEL, &memcg)) {
+ if (mem_cgroup_try_charge(new_page, mm, GFP_KERNEL, &memcg, false)) {
page_cache_release(new_page);
return VM_FAULT_OOM;
}
@@ -2899,7 +2899,7 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
goto uncharge_out;
}
do_set_pte(vma, address, new_page, pte, true, true);
- mem_cgroup_commit_charge(new_page, memcg, false);
+ mem_cgroup_commit_charge(new_page, memcg, false, false);
lru_cache_add_active_or_unevictable(new_page, vma);
pte_unmap_unlock(pte, ptl);
if (fault_page) {
@@ -2914,7 +2914,7 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
}
return ret;
uncharge_out:
- mem_cgroup_cancel_charge(new_page, memcg);
+ mem_cgroup_cancel_charge(new_page, memcg, false);
page_cache_release(new_page);
return ret;
}
diff --git a/mm/shmem.c b/mm/shmem.c
index a63031fa3e0c..7d2e808739ce 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -706,7 +706,8 @@ int shmem_unuse(swp_entry_t swap, struct page *page)
* the shmem_swaplist_mutex which might hold up shmem_writepage().
* Charged back to the user (not to caller) when swap account is used.
*/
- error = mem_cgroup_try_charge(page, current->mm, GFP_KERNEL, &memcg);
+ error = mem_cgroup_try_charge(page, current->mm, GFP_KERNEL, &memcg,
+ false);
if (error)
goto out;
/* No radix_tree_preload: swap entry keeps a place for page in tree */
@@ -729,9 +730,9 @@ int shmem_unuse(swp_entry_t swap, struct page *page)
if (error) {
if (error != -ENOMEM)
error = 0;
- mem_cgroup_cancel_charge(page, memcg);
+ mem_cgroup_cancel_charge(page, memcg, false);
} else
- mem_cgroup_commit_charge(page, memcg, true);
+ mem_cgroup_commit_charge(page, memcg, true, false);
out:
unlock_page(page);
page_cache_release(page);
@@ -1114,7 +1115,8 @@ repeat:
goto failed;
}

- error = mem_cgroup_try_charge(page, current->mm, gfp, &memcg);
+ error = mem_cgroup_try_charge(page, current->mm, gfp, &memcg,
+ false);
if (!error) {
error = shmem_add_to_page_cache(page, mapping, index,
swp_to_radix_entry(swap));
@@ -1131,14 +1133,14 @@ repeat:
* "repeat": reading a hole and writing should succeed.
*/
if (error) {
- mem_cgroup_cancel_charge(page, memcg);
+ mem_cgroup_cancel_charge(page, memcg, false);
delete_from_swap_cache(page);
}
}
if (error)
goto failed;

- mem_cgroup_commit_charge(page, memcg, true);
+ mem_cgroup_commit_charge(page, memcg, true, false);

spin_lock(&info->lock);
info->swapped--;
@@ -1177,7 +1179,8 @@ repeat:
if (sgp == SGP_WRITE)
__SetPageReferenced(page);

- error = mem_cgroup_try_charge(page, current->mm, gfp, &memcg);
+ error = mem_cgroup_try_charge(page, current->mm, gfp, &memcg,
+ false);
if (error)
goto decused;
error = radix_tree_maybe_preload(gfp & GFP_RECLAIM_MASK);
@@ -1187,10 +1190,10 @@ repeat:
radix_tree_preload_end();
}
if (error) {
- mem_cgroup_cancel_charge(page, memcg);
+ mem_cgroup_cancel_charge(page, memcg, false);
goto decused;
}
- mem_cgroup_commit_charge(page, memcg, false);
+ mem_cgroup_commit_charge(page, memcg, false, false);
lru_cache_add_anon(page);

spin_lock(&info->lock);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 99f97c31ede5..298efd73dc60 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1106,14 +1106,15 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
if (unlikely(!page))
return -ENOMEM;

- if (mem_cgroup_try_charge(page, vma->vm_mm, GFP_KERNEL, &memcg)) {
+ if (mem_cgroup_try_charge(page, vma->vm_mm, GFP_KERNEL, &memcg, false))
+ {
ret = -ENOMEM;
goto out_nolock;
}

pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
if (unlikely(!maybe_same_pte(*pte, swp_entry_to_pte(entry)))) {
- mem_cgroup_cancel_charge(page, memcg);
+ mem_cgroup_cancel_charge(page, memcg, false);
ret = 0;
goto out;
}
@@ -1125,10 +1126,10 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
pte_mkold(mk_pte(page, vma->vm_page_prot)));
if (page == swapcache) {
page_add_anon_rmap(page, vma, addr, false);
- mem_cgroup_commit_charge(page, memcg, true);
+ mem_cgroup_commit_charge(page, memcg, true, false);
} else { /* ksm created a completely new copy */
page_add_new_anon_rmap(page, vma, addr, false);
- mem_cgroup_commit_charge(page, memcg, false);
+ mem_cgroup_commit_charge(page, memcg, false, false);
lru_cache_add_active_or_unevictable(page, vma);
}
swap_free(entry);
--
2.1.4

2015-02-12 16:19:29

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 23/24] ksm: split huge pages on follow_page()

Let's split THP with FOLL_SPLIT. Attempting to split them laterk would
always fail bacause we take references on tail pages.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/ksm.c | 58 ++++++++--------------------------------------------------
1 file changed, 8 insertions(+), 50 deletions(-)

diff --git a/mm/ksm.c b/mm/ksm.c
index a8a88b0f6f62..8d977f074a74 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -441,20 +441,6 @@ static void break_cow(struct rmap_item *rmap_item)
up_read(&mm->mmap_sem);
}

-static struct page *page_trans_compound_anon(struct page *page)
-{
- if (PageTransCompound(page)) {
- struct page *head = compound_head(page);
- /*
- * head may actually be splitted and freed from under
- * us but it's ok here.
- */
- if (PageAnon(head))
- return head;
- }
- return NULL;
-}
-
static struct page *get_mergeable_page(struct rmap_item *rmap_item)
{
struct mm_struct *mm = rmap_item->mm;
@@ -467,10 +453,10 @@ static struct page *get_mergeable_page(struct rmap_item *rmap_item)
if (!vma)
goto out;

- page = follow_page(vma, addr, FOLL_GET);
+ page = follow_page(vma, addr, FOLL_GET | FOLL_SPLIT);
if (IS_ERR_OR_NULL(page))
goto out;
- if (PageAnon(page) || page_trans_compound_anon(page)) {
+ if (PageAnon(page)) {
flush_anon_page(vma, page, addr);
flush_dcache_page(page);
} else {
@@ -976,35 +962,6 @@ out:
return err;
}

-static int page_trans_compound_anon_split(struct page *page)
-{
- int ret = 0;
- struct page *transhuge_head = page_trans_compound_anon(page);
- if (transhuge_head) {
- /* Get the reference on the head to split it. */
- if (get_page_unless_zero(transhuge_head)) {
- /*
- * Recheck we got the reference while the head
- * was still anonymous.
- */
- if (PageAnon(transhuge_head)) {
- lock_page(transhuge_head);
- ret = split_huge_page(transhuge_head);
- unlock_page(transhuge_head);
- } else
- /*
- * Retry later if split_huge_page run
- * from under us.
- */
- ret = 1;
- put_page(transhuge_head);
- } else
- /* Retry later if split_huge_page run from under us. */
- ret = 1;
- }
- return ret;
-}
-
/*
* try_to_merge_one_page - take two pages and merge them into one
* @vma: the vma that holds the pte pointing to page
@@ -1025,9 +982,10 @@ static int try_to_merge_one_page(struct vm_area_struct *vma,

if (!(vma->vm_flags & VM_MERGEABLE))
goto out;
- if (PageTransCompound(page) && page_trans_compound_anon_split(page))
- goto out;
+
+ /* huge pages must be split by this time */
BUG_ON(PageTransCompound(page));
+
if (!PageAnon(page))
goto out;

@@ -1616,14 +1574,14 @@ next_mm:
while (ksm_scan.address < vma->vm_end) {
if (ksm_test_exit(mm))
break;
- *page = follow_page(vma, ksm_scan.address, FOLL_GET);
+ *page = follow_page(vma, ksm_scan.address,
+ FOLL_GET | FOLL_SPLIT);
if (IS_ERR_OR_NULL(*page)) {
ksm_scan.address += PAGE_SIZE;
cond_resched();
continue;
}
- if (PageAnon(*page) ||
- page_trans_compound_anon(*page)) {
+ if (PageAnon(*page)) {
flush_anon_page(vma, *page, ksm_scan.address);
flush_dcache_page(*page);
rmap_item = get_next_rmap_item(slot,
--
2.1.4

2015-02-12 16:23:03

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 24/24] thp: update documentation

The patch updates Documentation/vm/transhuge.txt to reflect changes in
THP design.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
Documentation/vm/transhuge.txt | 100 +++++++++++++++++++----------------------
1 file changed, 45 insertions(+), 55 deletions(-)

diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
index 6b31cfbe2a9a..a12171e850d4 100644
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -35,10 +35,10 @@ miss is going to run faster.

== Design ==

-- "graceful fallback": mm components which don't have transparent
- hugepage knowledge fall back to breaking a transparent hugepage and
- working on the regular pages and their respective regular pmd/pte
- mappings
+- "graceful fallback": mm components which don't have transparent hugepage
+ knowledge fall back to breaking huge pmd mapping into table of ptes and,
+ if nesessary, split a transparent hugepage. Therefore these components
+ can continue working on the regular pages or regular pte mappings.

- if a hugepage allocation fails because of memory fragmentation,
regular pages should be gracefully allocated instead and mixed in
@@ -200,9 +200,18 @@ thp_collapse_alloc_failed is incremented if khugepaged found a range
of pages that should be collapsed into one huge page but failed
the allocation.

-thp_split is incremented every time a huge page is split into base
+thp_split_page is incremented every time a huge page is split into base
pages. This can happen for a variety of reasons but a common
reason is that a huge page is old and is being reclaimed.
+ This action implies splitting all PMD the page mapped with.
+
+thp_split_page_failed is is incremented if kernel fails to split huge
+ page. This can happen if the page was pinned by somebody.
+
+thp_split_pmd is incremented every time a PMD split into table of PTEs.
+ This can happen, for instance, when application calls mprotect() or
+ munmap() on part of huge page. It doesn't split huge page, only
+ page table entry.

thp_zero_page_alloc is incremented every time a huge zero page is
successfully allocated. It includes allocations which where
@@ -253,10 +262,8 @@ is complete, so they won't ever notice the fact the page is huge. But
if any driver is going to mangle over the page structure of the tail
page (like for checking page->mapping or other bits that are relevant
for the head page and not the tail page), it should be updated to jump
-to check head page instead (while serializing properly against
-split_huge_page() to avoid the head and tail pages to disappear from
-under it, see the futex code to see an example of that, hugetlbfs also
-needed special handling in futex code for similar reasons).
+to check head page instead. Taking reference on any head/tail page would
+prevent page from being split by anyone.

NOTE: these aren't new constraints to the GUP API, and they match the
same constrains that applies to hugetlbfs too, so any driver capable
@@ -291,9 +298,9 @@ unaffected. libhugetlbfs will also work fine as usual.
== Graceful fallback ==

Code walking pagetables but unware about huge pmds can simply call
-split_huge_page_pmd(vma, addr, pmd) where the pmd is the one returned by
+split_huge_pmd(vma, pmd, addr) where the pmd is the one returned by
pmd_offset. It's trivial to make the code transparent hugepage aware
-by just grepping for "pmd_offset" and adding split_huge_page_pmd where
+by just grepping for "pmd_offset" and adding split_huge_pmd where
missing after pmd_offset returns the pmd. Thanks to the graceful
fallback design, with a one liner change, you can avoid to write
hundred if not thousand of lines of complex code to make your code
@@ -302,7 +309,8 @@ hugepage aware.
If you're not walking pagetables but you run into a physical hugepage
but you can't handle it natively in your code, you can split it by
calling split_huge_page(page). This is what the Linux VM does before
-it tries to swapout the hugepage for example.
+it tries to swapout the hugepage for example. split_huge_page() can fail
+if the page is pinned and you must handle this correctly.

Example to make mremap.c transparent hugepage aware with a one liner
change:
@@ -314,14 +322,14 @@ diff --git a/mm/mremap.c b/mm/mremap.c
return NULL;

pmd = pmd_offset(pud, addr);
-+ split_huge_page_pmd(vma, addr, pmd);
++ split_huge_pmd(vma, pmd, addr);
if (pmd_none_or_clear_bad(pmd))
return NULL;

== Locking in hugepage aware code ==

We want as much code as possible hugepage aware, as calling
-split_huge_page() or split_huge_page_pmd() has a cost.
+split_huge_page() or split_huge_pmd() has a cost.

To make pagetable walks huge pmd aware, all you need to do is to call
pmd_trans_huge() on the pmd returned by pmd_offset. You must hold the
@@ -330,47 +338,29 @@ created from under you by khugepaged (khugepaged collapse_huge_page
takes the mmap_sem in write mode in addition to the anon_vma lock). If
pmd_trans_huge returns false, you just fallback in the old code
paths. If instead pmd_trans_huge returns true, you have to take the
-mm->page_table_lock and re-run pmd_trans_huge. Taking the
-page_table_lock will prevent the huge pmd to be converted into a
-regular pmd from under you (split_huge_page can run in parallel to the
+page table lock (pmd_lock()) and re-run pmd_trans_huge. Taking the
+page table lock will prevent the huge pmd to be converted into a
+regular pmd from under you (split_huge_pmd can run in parallel to the
pagetable walk). If the second pmd_trans_huge returns false, you
-should just drop the page_table_lock and fallback to the old code as
-before. Otherwise you should run pmd_trans_splitting on the pmd. In
-case pmd_trans_splitting returns true, it means split_huge_page is
-already in the middle of splitting the page. So if pmd_trans_splitting
-returns true it's enough to drop the page_table_lock and call
-wait_split_huge_page and then fallback the old code paths. You are
-guaranteed by the time wait_split_huge_page returns, the pmd isn't
-huge anymore. If pmd_trans_splitting returns false, you can proceed to
-process the huge pmd and the hugepage natively. Once finished you can
-drop the page_table_lock.
-
-== compound_lock, get_user_pages and put_page ==
+should just drop the page table lock and fallback to the old code as
+before. Otherwise you can proceed to process the huge pmd and the
+hugepage natively. Once finished you can drop the page table lock.
+
+== Refcounts and transparent huge pages ==

+As with other compound page types we do all refcounting for THP on head
+page, but unlike other compound pages THP support splitting.
split_huge_page internally has to distribute the refcounts in the head
-page to the tail pages before clearing all PG_head/tail bits from the
-page structures. It can do that easily for refcounts taken by huge pmd
-mappings. But the GUI API as created by hugetlbfs (that returns head
-and tail pages if running get_user_pages on an address backed by any
-hugepage), requires the refcount to be accounted on the tail pages and
-not only in the head pages, if we want to be able to run
-split_huge_page while there are gup pins established on any tail
-page. Failure to be able to run split_huge_page if there's any gup pin
-on any tail page, would mean having to split all hugepages upfront in
-get_user_pages which is unacceptable as too many gup users are
-performance critical and they must work natively on hugepages like
-they work natively on hugetlbfs already (hugetlbfs is simpler because
-hugetlbfs pages cannot be split so there wouldn't be requirement of
-accounting the pins on the tail pages for hugetlbfs). If we wouldn't
-account the gup refcounts on the tail pages during gup, we won't know
-anymore which tail page is pinned by gup and which is not while we run
-split_huge_page. But we still have to add the gup pin to the head page
-too, to know when we can free the compound page in case it's never
-split during its lifetime. That requires changing not just
-get_page, but put_page as well so that when put_page runs on a tail
-page (and only on a tail page) it will find its respective head page,
-and then it will decrease the head page refcount in addition to the
-tail page refcount. To obtain a head page reliably and to decrease its
-refcount without race conditions, put_page has to serialize against
-__split_huge_page_refcount using a special per-page lock called
-compound_lock.
+page to the tail pages before clearing all PG_head/tail bits from the page
+structures. It can be done easily for refcounts taken by page table
+entries. But we don't have enough information on how to distribute any
+additional pins (i.e. from get_user_pages). split_huge_page fails any
+requests to split pinned huge page: it expects page count to be equal to
+sum of mapcount of all sub-pages plus one (split_huge_page caller must
+have reference for head page).
+
+split_huge_page uses migration entries to stabilize page->_count and
+page->_mapcount.
+
+Note that split_huge_pmd() doesn't have any limitation on refcounting:
+pmd can be split at any point and never fails.
--
2.1.4

2015-02-12 17:09:26

by Sasha Levin

[permalink] [raw]
Subject: Re: [PATCHv3 14/24] thp: implement new split_huge_page()

On 02/12/2015 11:18 AM, Kirill A. Shutemov wrote:
> +void __get_page_tail(struct page *page);
> static inline void get_page(struct page *page)
> {
> - struct page *page_head = compound_head(page);
> - VM_BUG_ON_PAGE(atomic_read(&page_head->_count) <= 0, page);
> - atomic_inc(&page_head->_count);
> + if (unlikely(PageTail(page)))
> + return __get_page_tail(page);
> +
> + /*
> + * Getting a normal page or the head of a compound page
> + * requires to already have an elevated page->_count.
> + */
> + VM_BUG_ON_PAGE(atomic_read(&page->_count) <= 0, page);

This BUG_ON seems to get hit:

[ 612.180784] page:ffffea00004cb180 count:0 mapcount:0 mapping: (null) index:0x2
[ 612.188538] flags: 0x1fffff80000000()
[ 612.190452] page dumped because: VM_BUG_ON_PAGE(atomic_read(&page->_count) <= 0)
[ 612.195857] ------------[ cut here ]------------
[ 612.196636] kernel BUG at include/linux/mm.h:463!
[ 612.196636] invalid opcode: 0000 [#1] PREEMPT SMP KASAN
[ 612.196636] Dumping ftrace buffer:
[ 612.196636] (ftrace buffer empty)
[ 612.196636] Modules linked in:
[ 612.196636] CPU: 21 PID: 16300 Comm: trinity-c99 Not tainted 3.19.0-next-20150212-sasha-00072-gdc1aa32 #1913
[ 612.196636] task: ffff880012dbb000 ti: ffff880012df8000 task.ti: ffff880012df8000
[ 612.196636] RIP: copy_page_range (include/linux/mm.h:463 mm/memory.c:921 mm/memory.c:971 mm/memory.c:993 mm/memory.c:1055)
[ 612.196636] RSP: 0018:ffff880012dffad0 EFLAGS: 00010286
[ 612.196636] RAX: dffffc0000000000 RBX: 00000000132c6100 RCX: 0000000000000000
[ 612.196636] RDX: 1ffffd4000099637 RSI: 0000000000000000 RDI: ffffea00004cb1b8
[ 612.196636] RBP: ffff880012dffc60 R08: 0000000000000001 R09: 0000000000000000
[ 612.196636] R10: ffffffffa5875ce8 R11: 0000000000000001 R12: ffff880012df6630
[ 612.196636] R13: ffff880711fe6630 R14: 00007f33954c6000 R15: 0000000000000010
[ 612.196636] FS: 00007f33993b0700(0000) GS:ffff880712800000(0000) knlGS:0000000000000000
[ 612.196636] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 612.196636] CR2: 00007f33993b06c8 CR3: 000000002ab33000 CR4: 00000000000007a0
[ 612.196636] DR0: ffffffff80000fff DR1: 0000000000000000 DR2: 0000000000000000
[ 612.196636] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000b1060a
[ 612.196636] Stack:
[ 612.196636] ffffffffa1937460 0000000000000002 ffff880012dffb30 ffffffff944141f6
[ 612.196636] ffff880012df8010 0000000000000020 ffff880012dffbf0 0000000000000000
[ 612.196636] 0000000008100073 1ffff100025bff7a ffff880012df1e50 1ffff100025bf002
[ 612.196636] Call Trace:
[ 612.196636] ? __lock_is_held (kernel/locking/lockdep.c:3518)
[ 612.196636] ? apply_to_page_range (mm/memory.c:1002)
[ 612.196636] ? __vma_link_rb (mm/mmap.c:633)
[ 612.196636] ? anon_vma_fork (mm/rmap.c:351)
[ 612.196636] copy_process (kernel/fork.c:470 kernel/fork.c:869 kernel/fork.c:923 kernel/fork.c:1395)
[ 612.196636] ? __cleanup_sighand (kernel/fork.c:1196)
[ 612.196636] do_fork (kernel/fork.c:1659)
[ 612.196636] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2554 kernel/locking/lockdep.c:2601)
[ 612.196636] ? fork_idle (kernel/fork.c:1636)
[ 612.196636] ? syscall_trace_enter_phase2 (arch/x86/kernel/ptrace.c:1598)
[ 612.196636] SyS_clone (kernel/fork.c:1748)
[ 612.196636] stub_clone (arch/x86/kernel/entry_64.S:517)
[ 612.196636] ? tracesys_phase2 (arch/x86/kernel/entry_64.S:422)
[ 612.196636] Code: ff df 48 89 f9 48 c1 e9 03 80 3c 11 00 0f 85 4c 04 00 00 48 8b 48 30 e9 fe f9 ff ff 48 c7 c6 40 34 f4 9e 48 89 c7 e8 0e ca fe ff <0f> 0b 0f 0b 48 89 c7 e8 12 2a ff ff e9 df fb ff ff 0f 0b 0f 0b
All code
========
0: ff df lcallq *<internal disassembler error>
2: 48 89 f9 mov %rdi,%rcx
5: 48 c1 e9 03 shr $0x3,%rcx
9: 80 3c 11 00 cmpb $0x0,(%rcx,%rdx,1)
d: 0f 85 4c 04 00 00 jne 0x45f
13: 48 8b 48 30 mov 0x30(%rax),%rcx
17: e9 fe f9 ff ff jmpq 0xfffffffffffffa1a
1c: 48 c7 c6 40 34 f4 9e mov $0xffffffff9ef43440,%rsi
23: 48 89 c7 mov %rax,%rdi
26: e8 0e ca fe ff callq 0xfffffffffffeca39
2b:* 0f 0b ud2 <-- trapping instruction
2d: 0f 0b ud2
2f: 48 89 c7 mov %rax,%rdi
32: e8 12 2a ff ff callq 0xffffffffffff2a49
37: e9 df fb ff ff jmpq 0xfffffffffffffc1b
3c: 0f 0b ud2
3e: 0f 0b ud2
...

Code starting with the faulting instruction
===========================================
0: 0f 0b ud2
2: 0f 0b ud2
4: 48 89 c7 mov %rax,%rdi
7: e8 12 2a ff ff callq 0xffffffffffff2a1e
c: e9 df fb ff ff jmpq 0xfffffffffffffbf0
11: 0f 0b ud2
13: 0f 0b ud2
...
[ 612.196636] RIP copy_page_range (include/linux/mm.h:463 mm/memory.c:921 mm/memory.c:971 mm/memory.c:993 mm/memory.c:1055)
[ 612.196636] RSP <ffff880012dffad0>


Thanks,
Sasha

2015-02-12 19:25:38

by Sasha Levin

[permalink] [raw]
Subject: Re: [PATCHv3 14/24] thp: implement new split_huge_page()

On 02/12/2015 12:07 PM, Sasha Levin wrote:
> On 02/12/2015 11:18 AM, Kirill A. Shutemov wrote:
>> > +void __get_page_tail(struct page *page);
>> > static inline void get_page(struct page *page)
>> > {
>> > - struct page *page_head = compound_head(page);
>> > - VM_BUG_ON_PAGE(atomic_read(&page_head->_count) <= 0, page);
>> > - atomic_inc(&page_head->_count);
>> > + if (unlikely(PageTail(page)))
>> > + return __get_page_tail(page);
>> > +
>> > + /*
>> > + * Getting a normal page or the head of a compound page
>> > + * requires to already have an elevated page->_count.
>> > + */
>> > + VM_BUG_ON_PAGE(atomic_read(&page->_count) <= 0, page);
> This BUG_ON seems to get hit:

Plus a few more different traces:

[ 908.874364] BUG: Bad page map in process trinity-c55 pte:1ad673100 pmd:1721f3067
[ 908.877609] page:ffffea0006b59cc0 count:0 mapcount:-1 mapping: (null) index:0x2
[ 908.880244] flags: 0x12fffff80000000()
[ 908.881503] page dumped because: bad pte
[ 908.883124] addr:00007f0b86e73000 vm_flags:08100073 anon_vma:ffff88016f2b6438 mapping: (null) index:7f0b86e73
[ 908.887086] CPU: 55 PID: 15463 Comm: trinity-c55 Not tainted 3.19.0-next-20150212-sasha-00072-gdc1aa32 #1913
[ 908.889486] ffff88016f2c4ca0 000000003dbb1858 ffff88001688f738 ffffffffa7b863a0
[ 908.891869] 1ffff1002de58994 0000000000000000 ffff88001688f7a8 ffffffff9d6edf6c
[ 908.894464] 0000000000000000 ffffea0006b59cc0 00000001ad673100 0000000000000000
[ 908.896629] Call Trace:
[ 908.897351] dump_stack (lib/dump_stack.c:52)
[ 908.898848] print_bad_pte (mm/memory.c:694)
[ 908.900229] unmap_single_vma (mm/memory.c:1124 mm/memory.c:1215 mm/memory.c:1236 mm/memory.c:1260 mm/memory.c:1305)
[ 908.901701] ? vm_normal_page (mm/memory.c:1270)
[ 908.904309] ? pagevec_lru_move_fn (include/linux/pagevec.h:44 mm/swap.c:272)
[ 908.907091] ? lru_cache_add_file (mm/swap.c:861)
[ 908.910132] unmap_vmas (mm/memory.c:1334 (discriminator 3))
[ 908.912016] exit_mmap (mm/mmap.c:2841)
[ 908.913800] ? __debug_object_init (lib/debugobjects.c:667)
[ 908.915679] ? SyS_remap_file_pages (mm/mmap.c:2811)
[ 908.917569] ? __khugepaged_exit (./arch/x86/include/asm/atomic.h:118 include/linux/sched.h:2464 mm/huge_memory.c:2245)
[ 908.919733] mmput (kernel/fork.c:681 kernel/fork.c:664)
[ 908.921609] do_exit (./arch/x86/include/asm/bitops.h:311 include/linux/thread_info.h:91 kernel/exit.c:438 kernel/exit.c:733)
[ 908.924012] ? debug_check_no_locks_freed (kernel/locking/lockdep.c:3051)
[ 908.926954] ? mm_update_next_owner (kernel/exit.c:654)
[ 908.929141] ? up_read (./arch/x86/include/asm/rwsem.h:156 kernel/locking/rwsem.c:101)
[ 908.931523] do_group_exit (./arch/x86/include/asm/current.h:14 kernel/exit.c:861)
[ 908.933657] get_signal (kernel/signal.c:2358)
[ 908.935700] ? trace_hardirqs_off (kernel/locking/lockdep.c:2647)
[ 908.938485] do_signal (arch/x86/kernel/signal.c:703)
[ 908.940637] ? setup_sigcontext (arch/x86/kernel/signal.c:700)
[ 908.943275] ? context_tracking_user_exit (./arch/x86/include/asm/paravirt.h:809 (discriminator 2) kernel/context_tracking.c:144 (discriminator 2))
[ 908.946258] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2554 kernel/locking/lockdep.c:2601)
[ 908.951986] do_notify_resume (arch/x86/kernel/signal.c:748)
[ 908.955202] int_signal (arch/x86/kernel/entry_64.S:480)
[ 908.957110] Disabling lock debugging due to kernel taint
[ 909.052751] page:ffffea0006b59cc0 count:0 mapcount:-1 mapping: (null) index:0x2
[ 909.055737] flags: 0x12fffff80000000()
[ 909.057355] page dumped because: VM_BUG_ON_PAGE(atomic_read(&page->_count) == 0)
[ 909.060767] ------------[ cut here ]------------
[ 909.061682] kernel BUG at include/linux/mm.h:340!
[ 909.061682] invalid opcode: 0000 [#1] PREEMPT SMP KASAN
[ 909.061682] Dumping ftrace buffer:
[ 909.061682] (ftrace buffer empty)
[ 909.061682] Modules linked in:
[ 909.061682] CPU: 55 PID: 15463 Comm: trinity-c55 Tainted: G B 3.19.0-next-20150212-sasha-00072-gdc1aa32 #1913
[ 909.061682] task: ffff88001accb000 ti: ffff880016888000 task.ti: ffff880016888000
[ 909.061682] RIP: release_pages (include/linux/mm.h:340 mm/swap.c:766)
[ 909.061682] RSP: 0000:ffff88001688f638 EFLAGS: 00010296
[ 909.061682] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 909.061682] RDX: 1ffffd4000d6b39f RSI: 0000000000000000 RDI: ffffea0006b59cf8
[ 909.061682] RBP: ffff88001688f708 R08: 0000000000000001 R09: 0000000000000000
[ 909.061682] R10: ffffffffae875ce8 R11: 3d2029746e756f63 R12: ffff88001ade0be8
[ 909.061682] R13: 00000000000001fe R14: dffffc0000000000 R15: ffffea0006b59cc0
[ 909.061682] FS: 0000000000000000(0000) GS:ffff881165600000(0000) knlGS:0000000000000000
[ 909.061682] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 909.061682] CR2: 00007f0b89ff08c1 CR3: 000000002a82c000 CR4: 00000000000007a0
[ 909.061682] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 909.061682] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
[ 909.061682] Stack:
[ 909.061682] ffff88016f2c4ca0 1ffff10002d11ecf 0000000100000000 ffff880047ac0340
[ 909.061682] 0000000000000286[ 909.108909] pps pps0: PPS event at 1.227081767
[ 909.108918] pps pps0: capture assert seq #848

[ 909.061682] ffff881434f85000 ffff88001ade1000 ffff88140000001f
[ 909.061682] 0000000041b58ab3 ffffffffaa4a9f39 ffffffff9d68d0f0 0000000000000037
[ 909.061682] Call Trace:
[ 909.061682] ? put_pages_list (mm/swap.c:736)
[ 909.061682] ? get_parent_ip (kernel/sched/core.c:2581)
[ 909.061682] free_pages_and_swap_cache (mm/swap_state.c:267)
[ 909.061682] tlb_flush_mmu_free (mm/memory.c:255 (discriminator 4))
[ 909.061682] unmap_single_vma (mm/memory.c:1172 mm/memory.c:1215 mm/memory.c:1236 mm/memory.c:1260 mm/memory.c:1305)
[ 909.061682] ? vm_normal_page (mm/memory.c:1270)
[ 909.061682] ? pagevec_lru_move_fn (include/linux/pagevec.h:44 mm/swap.c:272)
[ 909.061682] ? lru_cache_add_file (mm/swap.c:861)
[ 909.061682] unmap_vmas (mm/memory.c:1334 (discriminator 3))
[ 909.061682] exit_mmap (mm/mmap.c:2841)
[ 909.061682] ? __debug_object_init (lib/debugobjects.c:667)
[ 909.061682] ? SyS_remap_file_pages (mm/mmap.c:2811)
[ 909.061682] ? __khugepaged_exit (./arch/x86/include/asm/atomic.h:118 include/linux/sched.h:2464 mm/huge_memory.c:2245)
[ 909.061682] mmput (kernel/fork.c:681 kernel/fork.c:664)
[ 909.061682] do_exit (./arch/x86/include/asm/bitops.h:311 include/linux/thread_info.h:91 kernel/exit.c:438 kernel/exit.c:733)
[ 909.061682] ? debug_check_no_locks_freed (kernel/locking/lockdep.c:3051)
[ 909.061682] ? mm_update_next_owner (kernel/exit.c:654)
[ 909.061682] ? up_read (./arch/x86/include/asm/rwsem.h:156 kernel/locking/rwsem.c:101)
[ 909.061682] do_group_exit (./arch/x86/include/asm/current.h:14 kernel/exit.c:861)
[ 909.061682] get_signal (kernel/signal.c:2358)
[ 909.061682] ? trace_hardirqs_off (kernel/locking/lockdep.c:2647)
[ 909.061682] do_signal (arch/x86/kernel/signal.c:703)
[ 909.061682] ? setup_sigcontext (arch/x86/kernel/signal.c:700)
[ 909.061682] ? context_tracking_user_exit (./arch/x86/include/asm/paravirt.h:809 (discriminator 2) kernel/context_tracking.c:144 (discriminator 2))
[ 909.061682] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2554 kernel/locking/lockdep.c:2601)
[ 909.061682] do_notify_resume (arch/x86/kernel/signal.c:748)
[ 909.061682] int_signal (arch/x86/kernel/entry_64.S:480)
[ 909.061682] Code: 18 e3 56 0a 4c 89 ff 31 db e8 1e [ 909.179340] BUG: Bad page map in process trinity-c65 pte:1ad673960 pmd:1ab55d067

Code starting with the faulting instruction
===========================================
[ 909.179350] page:ffffea0006b59cc0 count:0 mapcount:-2 mapping: (null) index:0x2
[ 909.179357] flags: 0x12fffff80000014(referenced|dirty)
[ 909.179370] page dumped because: bad pte
[ 909.179376] addr:0000000001105000 vm_flags:08100073 anon_vma:ffff8801ab54b378 mapping: (null) index:1105
[ 909.179387] CPU: 7 PID: 15373 Comm: trinity-c65 Tainted: G B 3.19.0-next-20150212-sasha-00072-gdc1aa32 #1913
[ 909.179399] ffff8801ab54daa0 00000000faf56450 ffff8801ab557738 ffffffffa7b863a0
[ 909.179411] 1ffff100356a9b54 0000000000000000 ffff8801ab5577a8 ffffffff9d6edf6c
[ 909.179425] 0000000000000000 ffffea0006b59cc0 00000001ad673960 0000000000000000
[ 909.179446] Call Trace:
[ 909.179451] dump_stack (lib/dump_stack.c:52)
[ 909.179475] print_bad_pte (mm/memory.c:694)
[ 909.179492] unmap_single_vma (mm/memory.c:1124 mm/memory.c:1215 mm/memory.c:1236 mm/memory.c:1260 mm/memory.c:1305)
[ 909.179516] ? vm_normal_page (mm/memory.c:1270)
[ 909.179532] ? cmpxchg_double_slab.isra.27 (mm/slub.c:429)
[ 909.179686] unmap_vmas (mm/memory.c:1334 (discriminator 3))
[ 909.179695] exit_mmap (mm/mmap.c:2841)
[ 909.179705] ? __debug_object_init (lib/debugobjects.c:667)
[ 909.179720] ? SyS_remap_file_pages (mm/mmap.c:2811)
[ 909.179838] ? __khugepaged_exit (./arch/x86/include/asm/atomic.h:118 include/linux/sched.h:2464 mm/huge_memory.c:2245)
[ 909.179853] mmput (kernel/fork.c:681 kernel/fork.c:664)
[ 909.179863] do_exit (./arch/x86/include/asm/bitops.h:311 include/linux/thread_info.h:91 kernel/exit.c:438 kernel/exit.c:733)
[ 909.179982] ? debug_check_no_locks_freed (kernel/locking/lockdep.c:3051)
[ 909.179997] ? mm_update_next_owner (kernel/exit.c:654)
[ 909.186881] ? up_read (./arch/x86/include/asm/rwsem.h:156 kernel/locking/rwsem.c:101)
[ 909.186891] ? task_numa_work (kernel/sched/fair.c:2217)
[ 909.186905] ? get_signal (kernel/signal.c:2207)
[ 909.186919] do_group_exit (./arch/x86/include/asm/current.h:14 kernel/exit.c:861)
[ 909.186928] get_signal (kernel/signal.c:2358)
[ 909.186936] ? trace_hardirqs_off (kernel/locking/lockdep.c:2647)
[ 909.186945] do_signal (arch/x86/kernel/signal.c:703)
[ 909.186964] ? setup_sigcontext (arch/x86/kernel/signal.c:700)
[ 909.186982] ? _raw_spin_unlock (./arch/x86/include/asm/preempt.h:95 include/linux/spinlock_api_smp.h:154 kernel/locking/spinlock.c:183)
[ 909.187006] ? context_tracking_user_exit (include/linux/vtime.h:89 include/linux/jump_label.h:114 include/trace/events/context_tracking.h:47 kernel/context_tracking.c:140)
[ 909.187570] ? rcu_eqs_exit (kernel/rcu/tree.c:743)
[ 909.187584] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2566)
[ 909.187593] do_notify_resume (arch/x86/kernel/signal.c:748)
[ 909.187601] retint_signal (arch/x86/kernel/entry_64.S:895)
[ 909.236015] page:ffffea0006b59cc0 count:0 mapcount:-2 mapping: (null) index:0x2
[ 909.236028] flags: 0x12fffff80000014(referenced|dirty)
[ 909.236048] page dumped because: VM_BUG_ON_PAGE(page->flags & ((1 << 24) - 1))

[ 909.061682] f7 ff ff e9 d5 fc ff ff 66 0f 1f 84 00 00 00 00 00 48 c7 c6 a0 1b f3 a7 4c 89 ff e8 61 bb 05 00 <0f> 0b 0f 1f 80 00 00 00 00 48 c7 c6 e0 1a f3 a7 4c 89 ff e8 49
[ 909.061682] RIP release_pages (include/linux/mm.h:340 mm/swap.c:766)
[ 909.061682] RSP <ffff88001688f638>


Thanks,
Sasha

2015-02-12 19:56:46

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCHv3 03/24] mm: avoid PG_locked on tail pages

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 02/12/2015 11:18 AM, Kirill A. Shutemov wrote:
> With new refcounting pte entries can point to tail pages. It's
> doesn't make much sense to mark tail page locked -- we need to
> protect whole compound page.
>
> This patch adjust helpers related to PG_locked to operate on head
> page.
>
> Signed-off-by: Kirill A. Shutemov
> <[email protected]> --- include/linux/page-flags.h |
> 3 ++- include/linux/pagemap.h | 5 +++++ mm/filemap.c
> | 1 + mm/slub.c | 2 ++ 4 files changed, 10
> insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/page-flags.h
> b/include/linux/page-flags.h index 5ed7bdaf22d5..d471370f27e8
> 100644 --- a/include/linux/page-flags.h +++
> b/include/linux/page-flags.h @@ -207,7 +207,8 @@ static inline int
> __TestClearPage##uname(struct page *page) { return 0; }
>
> struct page; /* forward declaration */
>
> -TESTPAGEFLAG(Locked, locked) +#define PageLocked(page)
> test_bit(PG_locked, &compound_head(page)->flags) + PAGEFLAG(Error,
> error) TESTCLEARFLAG(Error, error) PAGEFLAG(Referenced, referenced)
> TESTCLEARFLAG(Referenced, referenced) __SETPAGEFLAG(Referenced,
> referenced) diff --git a/include/linux/pagemap.h
> b/include/linux/pagemap.h index 4b3736f7065c..ad6da4e49555 100644
> --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@
> -428,16 +428,19 @@ extern void unlock_page(struct page *page);
>
> static inline void __set_page_locked(struct page *page) { +
> VM_BUG_ON_PAGE(PageTail(page), page); __set_bit(PG_locked,
> &page->flags); }
>
> static inline void __clear_page_locked(struct page *page) { +
> VM_BUG_ON_PAGE(PageTail(page), page); __clear_bit(PG_locked,
> &page->flags); }
>
> static inline int trylock_page(struct page *page) { + page =
> compound_head(page); return
> (likely(!test_and_set_bit_lock(PG_locked, &page->flags))); }
>
> @@ -490,6 +493,7 @@ extern int
> wait_on_page_bit_killable_timeout(struct page *page,
>
> static inline int wait_on_page_locked_killable(struct page *page)
> { + page = compound_head(page); if (PageLocked(page)) return
> wait_on_page_bit_killable(page, PG_locked); return 0; @@ -510,6
> +514,7 @@ static inline void wake_up_page(struct page *page, int
> bit) */ static inline void wait_on_page_locked(struct page *page)
> { + page = compound_head(page); if (PageLocked(page))
> wait_on_page_bit(page, PG_locked); }

These are all atomic operations.

This may be a stupid question with the answer lurking somewhere
in the other patches, but how do you ensure you operate on the
right page lock during a THP collapse or split?



- --
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJU3QVOAAoJEM553pKExN6DUtAH/2hjg6ab/9bArQ187YGssOoZ
yXqpeMgt0klHjqWxtVxZnzExbSfYIrrBKpg5kJJzqk2cQ/ZjMj0TbVnkgHhFEn3f
r6vh4wIljmmFjo+4RiYGEEJkQWNwFgX0XTEcJLw2VQp4xKL0wjhN1hC+SQBGiPL0
JefeCraxqAoq+viV65lvxWYJrXQ4Lm90z7dIa5fh8M5lG3P+Wy6cZXCeevV1Tvw7
iF20HuOTnGuNfClo7b5h/vCV6I6ViHEgThCCR3iBIdsh1L2bBoqMaNzDVD19tw7Y
m2I8Of/cc4eSadDJPkkfXxKD2w/qpbHxYXviN9dRq/qm7ApeySLN3GdW1K/xvAw=
=q6Zo
-----END PGP SIGNATURE-----

2015-02-12 20:11:17

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCHv3 03/24] mm: avoid PG_locked on tail pages

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 02/12/2015 02:55 PM, Rik van Riel wrote:
> On 02/12/2015 11:18 AM, Kirill A. Shutemov wrote:

>> @@ -490,6 +493,7 @@ extern int
>> wait_on_page_bit_killable_timeout(struct page *page,
>
>> static inline int wait_on_page_locked_killable(struct page *page)
>> { + page = compound_head(page); if (PageLocked(page)) return
>> wait_on_page_bit_killable(page, PG_locked); return 0; @@ -510,6
>> +514,7 @@ static inline void wake_up_page(struct page *page, int
>> bit) */ static inline void wait_on_page_locked(struct page *page)
>> { + page = compound_head(page); if (PageLocked(page))
>> wait_on_page_bit(page, PG_locked); }
>
> These are all atomic operations.
>
> This may be a stupid question with the answer lurking somewhere in
> the other patches, but how do you ensure you operate on the right
> page lock during a THP collapse or split?

Kirill answered that question on IRC.

The VM takes a refcount on a page before attempting to take a page
lock, which prevents the THP code from doing anything with the
page. In other words, while we have a refcount on the page, we
will dereference the same page lock.

- --
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJU3Qi8AAoJEM553pKExN6D/44H/jUIn9btSRRIje/YBFFib8Dt
Zlvn4bFD6MbFonTQMJA5+vb6s0gxwdkbwLqGKpRo+FHWnKDxCvEpQfxuj708LaCq
1tqjnKIv1xt5bz31pV/UQhdAMcbcyKdEu4udH5mQigh4HIXYhUhe4w9TMUGu/f4U
FTx7dn3FfhQT3qWTVdZubdY/JRIXJcaXYqVIauvXQCRsCHfmx5YHD5YLulXP9OKw
HP0baxtbxP5njNPthbb9T947dcdndbBiwt3+Rnnlw4Ij4fB2kX7kvOC4M8v0eKKA
wKZ7A0sfzp27kJT7N/HlwGOXnX/e28LMbK7zA1avi5JDIIAYQU1Ris67EXTSKlg=
=1q7l
-----END PGP SIGNATURE-----

2015-02-12 21:18:07

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCHv3 04/24] rmap: add argument to charge compound page

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 02/12/2015 11:18 AM, Kirill A. Shutemov wrote:

> +++ b/include/linux/rmap.h @@ -168,16 +168,24 @@ static inline void
> anon_vma_merge(struct vm_area_struct *vma,
>
> struct anon_vma *page_get_anon_vma(struct page *page);
>
> +/* flags for do_page_add_anon_rmap() */ +enum { + RMAP_EXCLUSIVE =
> 1, + RMAP_COMPOUND = 2, +};

Always a good idea to name things. However, "exclusive" is
not that clear to me. Given that the argument is supposed
to indicate whether we map a single or a compound page,
maybe the names in the enum could just be SINGLE and COMPOUND?

Naming the enum should make it clear enough what it does:

enum rmap_page {
SINGLE = 0,
COMPOUND
}

> +++ b/kernel/events/uprobes.c @@ -183,7 +183,7 @@ static int
> __replace_page(struct vm_area_struct *vma, unsigned long addr, goto
> unlock;
>
> get_page(kpage); - page_add_new_anon_rmap(kpage, vma, addr); +
> page_add_new_anon_rmap(kpage, vma, addr, false);
> mem_cgroup_commit_charge(kpage, memcg, false);
> lru_cache_add_active_or_unevictable(kpage, vma);

Would it make sense to use the name in the argument to that function,
too?

I often find it a lot easier to see what things do if they use symbolic
names, rather than by trying to remember what each boolean argument to
a function does.

- --
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJU3Ra9AAoJEM553pKExN6D4UcH/10GlcYBB813KE7dR2r23MDx
WlrcC096IRoEjD/aaBHikLcKSu5mZDzf3ic1ZHzMPzz7oMdsFkmnY/k2zMdcqc83
7scvd7VB3acI4STKWcbkaCsIHIpHPFmfdcLv9Rabi0P2MBb8SALQCwxDUJqvXojC
JdJivfuagDoSUEamHwZrCvFylC7J7M4zPLD5aUpc93E4I4lhG9VHD7FmnYP3rxb8
kX4DOZFZ7aTN3K9IweCZPN2HWZe2qcSKc/AmIfHfokdjJLTuqbMv5UGSwLHmmeDf
DO4Uru/BMgPg2Ds7uKZosf7icAnOzT08b/Woh34JT83ua9XpFMam+hx6g+lA78E=
=Kzss
-----END PGP SIGNATURE-----

2015-02-16 15:39:58

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv3 04/24] rmap: add argument to charge compound page

On Thu, Feb 12, 2015 at 04:10:21PM -0500, Rik van Riel wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 02/12/2015 11:18 AM, Kirill A. Shutemov wrote:
>
> > +++ b/include/linux/rmap.h @@ -168,16 +168,24 @@ static inline void
> > anon_vma_merge(struct vm_area_struct *vma,
> >
> > struct anon_vma *page_get_anon_vma(struct page *page);
> >
> > +/* flags for do_page_add_anon_rmap() */ +enum { + RMAP_EXCLUSIVE =
> > 1, + RMAP_COMPOUND = 2, +};
>
> Always a good idea to name things. However, "exclusive" is
> not that clear to me. Given that the argument is supposed
> to indicate whether we map a single or a compound page,
> maybe the names in the enum could just be SINGLE and COMPOUND?
>
> Naming the enum should make it clear enough what it does:
>
> enum rmap_page {
> SINGLE = 0,
> COMPOUND
> }

Okay, this is probably confusing: do_page_add_anon_rmap() already had one
of arguments called 'exclusive'. It indicates if the page is exclusively
owned by the current process. And I needed also to indicate if we need to
handle the page as a compound or not. I've reused the same argument and
converted it to set bit-flags: bit 0 is exclusive, bit 1 - compound.

>
> > +++ b/kernel/events/uprobes.c @@ -183,7 +183,7 @@ static int
> > __replace_page(struct vm_area_struct *vma, unsigned long addr, goto
> > unlock;
> >
> > get_page(kpage); - page_add_new_anon_rmap(kpage, vma, addr); +
> > page_add_new_anon_rmap(kpage, vma, addr, false);
> > mem_cgroup_commit_charge(kpage, memcg, false);
> > lru_cache_add_active_or_unevictable(kpage, vma);
>
> Would it make sense to use the name in the argument to that function,
> too?
>
> I often find it a lot easier to see what things do if they use symbolic
> names, rather than by trying to remember what each boolean argument to
> a function does.

I can convert these compound booleans to enums if you want. I'm personally
not sure that if will bring much value.

--
Kirill A. Shutemov

2015-02-16 15:58:04

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv3 14/24] thp: implement new split_huge_page()

On Thu, Feb 12, 2015 at 02:24:40PM -0500, Sasha Levin wrote:
> On 02/12/2015 12:07 PM, Sasha Levin wrote:
> > On 02/12/2015 11:18 AM, Kirill A. Shutemov wrote:
> >> > +void __get_page_tail(struct page *page);
> >> > static inline void get_page(struct page *page)
> >> > {
> >> > - struct page *page_head = compound_head(page);
> >> > - VM_BUG_ON_PAGE(atomic_read(&page_head->_count) <= 0, page);
> >> > - atomic_inc(&page_head->_count);
> >> > + if (unlikely(PageTail(page)))
> >> > + return __get_page_tail(page);
> >> > +
> >> > + /*
> >> > + * Getting a normal page or the head of a compound page
> >> > + * requires to already have an elevated page->_count.
> >> > + */
> >> > + VM_BUG_ON_PAGE(atomic_read(&page->_count) <= 0, page);
> > This BUG_ON seems to get hit:
>
> Plus a few more different traces:

Sasha, could you check if the patch below makes any better?

diff --git a/mm/gup.c b/mm/gup.c
index 22585ef667d9..10d98d39bc03 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -211,12 +211,19 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
if (flags & FOLL_SPLIT) {
int ret;
page = pmd_page(*pmd);
- get_page(page);
- spin_unlock(ptl);
- lock_page(page);
- ret = split_huge_page(page);
- unlock_page(page);
- put_page(page);
+ if (is_huge_zero_page(page)) {
+ spin_unlock(ptl);
+ ret = 0;
+ split_huge_pmd(vma, pmd, address);
+ } else {
+ get_page(page);
+ spin_unlock(ptl);
+ lock_page(page);
+ ret = split_huge_page(page);
+ unlock_page(page);
+ put_page(page);
+ }
+
return ret ? ERR_PTR(ret) :
follow_page_pte(vma, address, pmd, flags);
}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2667938a3d2c..4d69baa41a6c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1821,7 +1821,7 @@ static int __split_huge_page_refcount(struct anon_vma *anon_vma,
int tail_mapcount = 0;

freeze_page(anon_vma, page);
- BUG_ON(compound_mapcount(page));
+ VM_BUG_ON_PAGE(compound_mapcount(page), page);

/* prevent PageLRU to go away from under us, and freeze lru stats */
spin_lock_irq(&zone->lru_lock);
diff --git a/mm/memory.c b/mm/memory.c
index f81bcd539ca0..5153fd0d8e5c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2231,7 +2231,7 @@ unlock:
pte_unmap_unlock(page_table, ptl);
if (mmun_end > mmun_start)
mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
- if (old_page) {
+ if (old_page && !PageTransCompound(old_page)) {
/*
* Don't let another task, with possibly unlocked vma,
* keep the mlocked page.
diff --git a/mm/mlock.c b/mm/mlock.c
index 40c6ab590cde..6afef15f80ab 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -502,39 +502,26 @@ void munlock_vma_pages_range(struct vm_area_struct *vma,
page = follow_page_mask(vma, start, FOLL_GET | FOLL_DUMP,
&page_mask);

- if (page && !IS_ERR(page)) {
- if (PageTransHuge(page)) {
- lock_page(page);
- /*
- * Any THP page found by follow_page_mask() may
- * have gotten split before reaching
- * munlock_vma_page(), so we need to recompute
- * the page_mask here.
- */
- page_mask = munlock_vma_page(page);
- unlock_page(page);
- put_page(page); /* follow_page_mask() */
- } else {
- /*
- * Non-huge pages are handled in batches via
- * pagevec. The pin from follow_page_mask()
- * prevents them from collapsing by THP.
- */
- pagevec_add(&pvec, page);
- zone = page_zone(page);
- zoneid = page_zone_id(page);
+ if (page && !IS_ERR(page) && !PageTransCompound(page)) {
+ /*
+ * Non-huge pages are handled in batches via
+ * pagevec. The pin from follow_page_mask()
+ * prevents them from collapsing by THP.
+ */
+ pagevec_add(&pvec, page);
+ zone = page_zone(page);
+ zoneid = page_zone_id(page);

- /*
- * Try to fill the rest of pagevec using fast
- * pte walk. This will also update start to
- * the next page to process. Then munlock the
- * pagevec.
- */
- start = __munlock_pagevec_fill(&pvec, vma,
- zoneid, start, end);
- __munlock_pagevec(&pvec, zone);
- goto next;
- }
+ /*
+ * Try to fill the rest of pagevec using fast
+ * pte walk. This will also update start to
+ * the next page to process. Then munlock the
+ * pagevec.
+ */
+ start = __munlock_pagevec_fill(&pvec, vma,
+ zoneid, start, end);
+ __munlock_pagevec(&pvec, zone);
+ goto next;
}
/* It's a bug to munlock in the middle of a THP page */
VM_BUG_ON((start >> PAGE_SHIFT) & page_mask);
--
Kirill A. Shutemov

2015-02-20 17:31:58

by Jerome Marchand

[permalink] [raw]
Subject: Re: [PATCHv3 05/24] mm, proc: adjust PSS calculation

On 02/12/2015 05:18 PM, Kirill A. Shutemov wrote:
> With new refcounting all subpages of the compound page are not nessessary
> have the same mapcount. We need to take into account mapcount of every
> sub-page.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---
> fs/proc/task_mmu.c | 43 ++++++++++++++++++++++---------------------
> 1 file changed, 22 insertions(+), 21 deletions(-)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 98826d08a11b..8a0a78174cc6 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -449,9 +449,10 @@ struct mem_size_stats {
> };
>
> static void smaps_account(struct mem_size_stats *mss, struct page *page,
> - unsigned long size, bool young, bool dirty)
> + bool compound, bool young, bool dirty)
> {
> - int mapcount;
> + int i, nr = compound ? hpage_nr_pages(page) : 1;
> + unsigned long size = 1UL << nr;

Shouldn't that be:
unsigned long size = nr << PAGE_SHIFT;

>
> if (PageAnon(page))
> mss->anonymous += size;
> @@ -460,23 +461,23 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page,
> /* Accumulate the size in pages that have been accessed. */
> if (young || PageReferenced(page))
> mss->referenced += size;
> - mapcount = page_mapcount(page);
> - if (mapcount >= 2) {
> - u64 pss_delta;
>
> - if (dirty || PageDirty(page))
> - mss->shared_dirty += size;
> - else
> - mss->shared_clean += size;
> - pss_delta = (u64)size << PSS_SHIFT;
> - do_div(pss_delta, mapcount);
> - mss->pss += pss_delta;
> - } else {
> - if (dirty || PageDirty(page))
> - mss->private_dirty += size;
> - else
> - mss->private_clean += size;
> - mss->pss += (u64)size << PSS_SHIFT;
> + for (i = 0; i < nr; i++) {
> + int mapcount = page_mapcount(page + i);
> +
> + if (mapcount >= 2) {
> + if (dirty || PageDirty(page + i))
> + mss->shared_dirty += PAGE_SIZE;
> + else
> + mss->shared_clean += PAGE_SIZE;
> + mss->pss += (PAGE_SIZE << PSS_SHIFT) / mapcount;
> + } else {
> + if (dirty || PageDirty(page + i))
> + mss->private_dirty += PAGE_SIZE;
> + else
> + mss->private_clean += PAGE_SIZE;
> + mss->pss += PAGE_SIZE << PSS_SHIFT;
> + }
> }
> }
>
> @@ -500,7 +501,8 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
>
> if (!page)
> return;
> - smaps_account(mss, page, PAGE_SIZE, pte_young(*pte), pte_dirty(*pte));
> +
> + smaps_account(mss, page, false, pte_young(*pte), pte_dirty(*pte));
> }
>
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> @@ -516,8 +518,7 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
> if (IS_ERR_OR_NULL(page))
> return;
> mss->anonymous_thp += HPAGE_PMD_SIZE;
> - smaps_account(mss, page, HPAGE_PMD_SIZE,
> - pmd_young(*pmd), pmd_dirty(*pmd));
> + smaps_account(mss, page, true, pmd_young(*pmd), pmd_dirty(*pmd));
> }
> #else
> static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
>



Attachments:
signature.asc (473.00 B)
OpenPGP digital signature

2015-02-20 17:40:32

by Jerome Marchand

[permalink] [raw]
Subject: Re: [PATCHv3 04/24] rmap: add argument to charge compound page

On 02/16/2015 04:20 PM, Kirill A. Shutemov wrote:
> On Thu, Feb 12, 2015 at 04:10:21PM -0500, Rik van Riel wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> On 02/12/2015 11:18 AM, Kirill A. Shutemov wrote:
>>
>>> +++ b/include/linux/rmap.h @@ -168,16 +168,24 @@ static inline void
>>> anon_vma_merge(struct vm_area_struct *vma,
>>>
>>> struct anon_vma *page_get_anon_vma(struct page *page);
>>>
>>> +/* flags for do_page_add_anon_rmap() */ +enum { + RMAP_EXCLUSIVE =
>>> 1, + RMAP_COMPOUND = 2, +};
>>
>> Always a good idea to name things. However, "exclusive" is
>> not that clear to me. Given that the argument is supposed
>> to indicate whether we map a single or a compound page,
>> maybe the names in the enum could just be SINGLE and COMPOUND?
>>
>> Naming the enum should make it clear enough what it does:
>>
>> enum rmap_page {
>> SINGLE = 0,
>> COMPOUND
>> }
>
> Okay, this is probably confusing: do_page_add_anon_rmap() already had one
> of arguments called 'exclusive'. It indicates if the page is exclusively
> owned by the current process. And I needed also to indicate if we need to
> handle the page as a compound or not. I've reused the same argument and
> converted it to set bit-flags: bit 0 is exclusive, bit 1 - compound.

AFAICT, this is not a common use of enum and probably the reason why Rik
was confused (I know I find it confusing). Bit-flags members are usually
define by macros.

Jerome
>
>>
>>> +++ b/kernel/events/uprobes.c @@ -183,7 +183,7 @@ static int
>>> __replace_page(struct vm_area_struct *vma, unsigned long addr, goto
>>> unlock;
>>>
>>> get_page(kpage); - page_add_new_anon_rmap(kpage, vma, addr); +
>>> page_add_new_anon_rmap(kpage, vma, addr, false);
>>> mem_cgroup_commit_charge(kpage, memcg, false);
>>> lru_cache_add_active_or_unevictable(kpage, vma);
>>
>> Would it make sense to use the name in the argument to that function,
>> too?
>>
>> I often find it a lot easier to see what things do if they use symbolic
>> names, rather than by trying to remember what each boolean argument to
>> a function does.
>
> I can convert these compound booleans to enums if you want. I'm personally
> not sure that if will bring much value.
>



Attachments:
signature.asc (473.00 B)
OpenPGP digital signature

2015-02-23 13:53:55

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv3 05/24] mm, proc: adjust PSS calculation

On Fri, Feb 20, 2015 at 06:31:15PM +0100, Jerome Marchand wrote:
> On 02/12/2015 05:18 PM, Kirill A. Shutemov wrote:
> > With new refcounting all subpages of the compound page are not nessessary
> > have the same mapcount. We need to take into account mapcount of every
> > sub-page.
> >
> > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > ---
> > fs/proc/task_mmu.c | 43 ++++++++++++++++++++++---------------------
> > 1 file changed, 22 insertions(+), 21 deletions(-)
> >
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index 98826d08a11b..8a0a78174cc6 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -449,9 +449,10 @@ struct mem_size_stats {
> > };
> >
> > static void smaps_account(struct mem_size_stats *mss, struct page *page,
> > - unsigned long size, bool young, bool dirty)
> > + bool compound, bool young, bool dirty)
> > {
> > - int mapcount;
> > + int i, nr = compound ? hpage_nr_pages(page) : 1;
> > + unsigned long size = 1UL << nr;
>
> Shouldn't that be:
> unsigned long size = nr << PAGE_SHIFT;

Yes, thanks you.

--
Kirill A. Shutemov

2015-02-23 16:21:38

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCHv3 04/24] rmap: add argument to charge compound page

On 02/12/2015 05:18 PM, Kirill A. Shutemov wrote:
> We're going to allow mapping of individual 4k pages of THP compound
> page. It means we cannot rely on PageTransHuge() check to decide if map
> small page or THP.
>
> The patch adds new argument to rmap function to indicate whethe we want
> to map whole compound page or only the small page.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---
> include/linux/rmap.h | 14 +++++++++++---
> kernel/events/uprobes.c | 4 ++--
> mm/huge_memory.c | 16 ++++++++--------
> mm/hugetlb.c | 4 ++--
> mm/ksm.c | 4 ++--
> mm/memory.c | 14 +++++++-------
> mm/migrate.c | 8 ++++----
> mm/rmap.c | 43 +++++++++++++++++++++++++++----------------
> mm/swapfile.c | 4 ++--
> 9 files changed, 65 insertions(+), 46 deletions(-)
>
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index c4088feac1fc..3bf73620b672 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -168,16 +168,24 @@ static inline void anon_vma_merge(struct vm_area_struct *vma,
>
> struct anon_vma *page_get_anon_vma(struct page *page);
>
> +/* flags for do_page_add_anon_rmap() */
> +enum {
> + RMAP_EXCLUSIVE = 1,
> + RMAP_COMPOUND = 2,
> +};
> +
> /*
> * rmap interfaces called when adding or removing pte of page
> */
> void page_move_anon_rmap(struct page *, struct vm_area_struct *, unsigned long);
> -void page_add_anon_rmap(struct page *, struct vm_area_struct *, unsigned long);
> +void page_add_anon_rmap(struct page *, struct vm_area_struct *,
> + unsigned long, bool);
> void do_page_add_anon_rmap(struct page *, struct vm_area_struct *,
> unsigned long, int);
> -void page_add_new_anon_rmap(struct page *, struct vm_area_struct *, unsigned long);
> +void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
> + unsigned long, bool);
> void page_add_file_rmap(struct page *);
> -void page_remove_rmap(struct page *);
> +void page_remove_rmap(struct page *, bool);
>
> void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
> unsigned long);
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index cb346f26a22d..5523daf59953 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -183,7 +183,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
> goto unlock;
>
> get_page(kpage);
> - page_add_new_anon_rmap(kpage, vma, addr);
> + page_add_new_anon_rmap(kpage, vma, addr, false);
> mem_cgroup_commit_charge(kpage, memcg, false);
> lru_cache_add_active_or_unevictable(kpage, vma);
>
> @@ -196,7 +196,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
> ptep_clear_flush_notify(vma, addr, ptep);
> set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
>
> - page_remove_rmap(page);
> + page_remove_rmap(page, false);
> if (!page_mapped(page))
> try_to_free_swap(page);
> pte_unmap_unlock(ptep, ptl);
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 5f4c97e1a6da..36637a80669e 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -743,7 +743,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
> pmd_t entry;
> entry = mk_huge_pmd(page, vma->vm_page_prot);
> entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> - page_add_new_anon_rmap(page, vma, haddr);
> + page_add_new_anon_rmap(page, vma, haddr, true);
> mem_cgroup_commit_charge(page, memcg, false);
> lru_cache_add_active_or_unevictable(page, vma);
> pgtable_trans_huge_deposit(mm, pmd, pgtable);
> @@ -1034,7 +1034,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
> entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> memcg = (void *)page_private(pages[i]);
> set_page_private(pages[i], 0);
> - page_add_new_anon_rmap(pages[i], vma, haddr);
> + page_add_new_anon_rmap(pages[i], vma, haddr, false);
> mem_cgroup_commit_charge(pages[i], memcg, false);
> lru_cache_add_active_or_unevictable(pages[i], vma);
> pte = pte_offset_map(&_pmd, haddr);
> @@ -1046,7 +1046,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
>
> smp_wmb(); /* make pte visible before pmd */
> pmd_populate(mm, pmd, pgtable);
> - page_remove_rmap(page);
> + page_remove_rmap(page, true);
> spin_unlock(ptl);
>
> mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> @@ -1168,7 +1168,7 @@ alloc:
> entry = mk_huge_pmd(new_page, vma->vm_page_prot);
> entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> pmdp_clear_flush_notify(vma, haddr, pmd);
> - page_add_new_anon_rmap(new_page, vma, haddr);
> + page_add_new_anon_rmap(new_page, vma, haddr, true);
> mem_cgroup_commit_charge(new_page, memcg, false);
> lru_cache_add_active_or_unevictable(new_page, vma);
> set_pmd_at(mm, haddr, pmd, entry);
> @@ -1178,7 +1178,7 @@ alloc:
> put_huge_zero_page();
> } else {
> VM_BUG_ON_PAGE(!PageHead(page), page);
> - page_remove_rmap(page);
> + page_remove_rmap(page, true);
> put_page(page);
> }
> ret |= VM_FAULT_WRITE;
> @@ -1431,7 +1431,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> put_huge_zero_page();
> } else {
> page = pmd_page(orig_pmd);
> - page_remove_rmap(page);
> + page_remove_rmap(page, true);
> VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
> add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
> VM_BUG_ON_PAGE(!PageHead(page), page);
> @@ -2368,7 +2368,7 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
> * superfluous.
> */
> pte_clear(vma->vm_mm, address, _pte);
> - page_remove_rmap(src_page);
> + page_remove_rmap(src_page, false);
> spin_unlock(ptl);
> free_page_and_swap_cache(src_page);
> }
> @@ -2658,7 +2658,7 @@ static void collapse_huge_page(struct mm_struct *mm,
>
> spin_lock(pmd_ptl);
> BUG_ON(!pmd_none(*pmd));
> - page_add_new_anon_rmap(new_page, vma, address);
> + page_add_new_anon_rmap(new_page, vma, address, true);
> mem_cgroup_commit_charge(new_page, memcg, false);
> lru_cache_add_active_or_unevictable(new_page, vma);
> pgtable_trans_huge_deposit(mm, pmd, pgtable);
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 0a9ac6c26832..ebb7329301c4 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2688,7 +2688,7 @@ again:
> if (huge_pte_dirty(pte))
> set_page_dirty(page);
>
> - page_remove_rmap(page);
> + page_remove_rmap(page, true);
> force_flush = !__tlb_remove_page(tlb, page);
> if (force_flush) {
> address += sz;
> @@ -2908,7 +2908,7 @@ retry_avoidcopy:
> mmu_notifier_invalidate_range(mm, mmun_start, mmun_end);
> set_huge_pte_at(mm, address, ptep,
> make_huge_pte(vma, new_page, 1));
> - page_remove_rmap(old_page);
> + page_remove_rmap(old_page, true);
> hugepage_add_new_anon_rmap(new_page, vma, address);
> /* Make the old page be freed below */
> new_page = old_page;
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 4162dce2eb44..92182eeba87d 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -957,13 +957,13 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
> }
>
> get_page(kpage);
> - page_add_anon_rmap(kpage, vma, addr);
> + page_add_anon_rmap(kpage, vma, addr, false);
>
> flush_cache_page(vma, addr, pte_pfn(*ptep));
> ptep_clear_flush_notify(vma, addr, ptep);
> set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
>
> - page_remove_rmap(page);
> + page_remove_rmap(page, false);
> if (!page_mapped(page))
> try_to_free_swap(page);
> put_page(page);
> diff --git a/mm/memory.c b/mm/memory.c
> index 8ae52c918415..5529627d2cd6 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1125,7 +1125,7 @@ again:
> mark_page_accessed(page);
> rss[MM_FILEPAGES]--;
> }
> - page_remove_rmap(page);
> + page_remove_rmap(page, false);
> if (unlikely(page_mapcount(page) < 0))
> print_bad_pte(vma, addr, ptent, page);
> if (unlikely(!__tlb_remove_page(tlb, page))) {
> @@ -2189,7 +2189,7 @@ gotten:
> * thread doing COW.
> */
> ptep_clear_flush_notify(vma, address, page_table);
> - page_add_new_anon_rmap(new_page, vma, address);
> + page_add_new_anon_rmap(new_page, vma, address, false);
> mem_cgroup_commit_charge(new_page, memcg, false);
> lru_cache_add_active_or_unevictable(new_page, vma);
> /*
> @@ -2222,7 +2222,7 @@ gotten:
> * mapcount is visible. So transitively, TLBs to
> * old page will be flushed before it can be reused.
> */
> - page_remove_rmap(old_page);
> + page_remove_rmap(old_page, false);
> }
>
> /* Free the old page.. */
> @@ -2465,7 +2465,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
> pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> flags &= ~FAULT_FLAG_WRITE;
> ret |= VM_FAULT_WRITE;
> - exclusive = 1;
> + exclusive = RMAP_EXCLUSIVE;
> }
> flush_icache_page(vma, page);
> if (pte_swp_soft_dirty(orig_pte))
> @@ -2475,7 +2475,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
> do_page_add_anon_rmap(page, vma, address, exclusive);
> mem_cgroup_commit_charge(page, memcg, true);
> } else { /* ksm created a completely new copy */
> - page_add_new_anon_rmap(page, vma, address);
> + page_add_new_anon_rmap(page, vma, address, false);
> mem_cgroup_commit_charge(page, memcg, false);
> lru_cache_add_active_or_unevictable(page, vma);
> }
> @@ -2613,7 +2613,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
> goto release;
>
> inc_mm_counter_fast(mm, MM_ANONPAGES);
> - page_add_new_anon_rmap(page, vma, address);
> + page_add_new_anon_rmap(page, vma, address, false);
> mem_cgroup_commit_charge(page, memcg, false);
> lru_cache_add_active_or_unevictable(page, vma);
> setpte:
> @@ -2701,7 +2701,7 @@ void do_set_pte(struct vm_area_struct *vma, unsigned long address,
> entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> if (anon) {
> inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
> - page_add_new_anon_rmap(page, vma, address);
> + page_add_new_anon_rmap(page, vma, address, false);
> } else {
> inc_mm_counter_fast(vma->vm_mm, MM_FILEPAGES);
> page_add_file_rmap(page);
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 85e042686031..0d2b3110277a 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -166,7 +166,7 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
> else
> page_dup_rmap(new);
> } else if (PageAnon(new))
> - page_add_anon_rmap(new, vma, addr);
> + page_add_anon_rmap(new, vma, addr, false);
> else
> page_add_file_rmap(new);
>
> @@ -1803,7 +1803,7 @@ fail_putback:
> * guarantee the copy is visible before the pagetable update.
> */
> flush_cache_range(vma, mmun_start, mmun_end);
> - page_add_anon_rmap(new_page, vma, mmun_start);
> + page_add_anon_rmap(new_page, vma, mmun_start, true);
> pmdp_clear_flush_notify(vma, mmun_start, pmd);
> set_pmd_at(mm, mmun_start, pmd, entry);
> flush_tlb_range(vma, mmun_start, mmun_end);
> @@ -1814,13 +1814,13 @@ fail_putback:
> flush_tlb_range(vma, mmun_start, mmun_end);
> mmu_notifier_invalidate_range(mm, mmun_start, mmun_end);
> update_mmu_cache_pmd(vma, address, &entry);
> - page_remove_rmap(new_page);
> + page_remove_rmap(new_page, true);
> goto fail_putback;
> }
>
> mem_cgroup_migrate(page, new_page, false);
>
> - page_remove_rmap(page);
> + page_remove_rmap(page, true);
>
> spin_unlock(ptl);
> mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 47b3ba87c2dd..f67e83be75e4 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1041,9 +1041,9 @@ static void __page_check_anon_rmap(struct page *page,
> * (but PageKsm is never downgraded to PageAnon).
> */
> void page_add_anon_rmap(struct page *page,
> - struct vm_area_struct *vma, unsigned long address)
> + struct vm_area_struct *vma, unsigned long address, bool compound)
> {
> - do_page_add_anon_rmap(page, vma, address, 0);
> + do_page_add_anon_rmap(page, vma, address, compound ? RMAP_COMPOUND : 0);
> }
>
> /*
> @@ -1052,21 +1052,24 @@ void page_add_anon_rmap(struct page *page,
> * Everybody else should continue to use page_add_anon_rmap above.
> */
> void do_page_add_anon_rmap(struct page *page,
> - struct vm_area_struct *vma, unsigned long address, int exclusive)
> + struct vm_area_struct *vma, unsigned long address, int flags)
> {
> int first = atomic_inc_and_test(&page->_mapcount);
> if (first) {
> + bool compound = flags & RMAP_COMPOUND;
> + int nr = compound ? hpage_nr_pages(page) : 1;

hpage_nr_pages(page) is:

static inline int hpage_nr_pages(struct page *page)
{
if (unlikely(PageTransHuge(page)))
return HPAGE_PMD_NR;
return 1;
}

and later...

> /*
> * We use the irq-unsafe __{inc|mod}_zone_page_stat because
> * these counters are not modified in interrupt context, and
> * pte lock(a spinlock) is held, which implies preemption
> * disabled.
> */
> - if (PageTransHuge(page))
> + if (compound) {
> + VM_BUG_ON_PAGE(!PageTransHuge(page), page);

this means that we could assume that
(compound == true) => (PageTransHuge(page) == true)

and simplify above to:

int nr = compound ? HPAGE_PMD_NR : 1;

Right?
Same thing seems to hold for the two other variants below.

> __inc_zone_page_state(page,
> NR_ANON_TRANSPARENT_HUGEPAGES);
> - __mod_zone_page_state(page_zone(page), NR_ANON_PAGES,
> - hpage_nr_pages(page));
> + }
> + __mod_zone_page_state(page_zone(page), NR_ANON_PAGES, nr);
> }
> if (unlikely(PageKsm(page)))
> return;
> @@ -1074,7 +1077,8 @@ void do_page_add_anon_rmap(struct page *page,
> VM_BUG_ON_PAGE(!PageLocked(page), page);
> /* address might be in next vma when migration races vma_adjust */
> if (first)
> - __page_set_anon_rmap(page, vma, address, exclusive);
> + __page_set_anon_rmap(page, vma, address,
> + flags & RMAP_EXCLUSIVE);
> else
> __page_check_anon_rmap(page, vma, address);
> }
> @@ -1090,15 +1094,18 @@ void do_page_add_anon_rmap(struct page *page,
> * Page does not have to be locked.
> */
> void page_add_new_anon_rmap(struct page *page,
> - struct vm_area_struct *vma, unsigned long address)
> + struct vm_area_struct *vma, unsigned long address, bool compound)
> {
> + int nr = compound ? hpage_nr_pages(page) : 1;
> +
> VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
> SetPageSwapBacked(page);
> atomic_set(&page->_mapcount, 0); /* increment count (starts at -1) */
> - if (PageTransHuge(page))
> + if (compound) {
> + VM_BUG_ON_PAGE(!PageTransHuge(page), page);
> __inc_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
> - __mod_zone_page_state(page_zone(page), NR_ANON_PAGES,
> - hpage_nr_pages(page));
> + }
> + __mod_zone_page_state(page_zone(page), NR_ANON_PAGES, nr);
> __page_set_anon_rmap(page, vma, address, 1);
> }
>
> @@ -1154,9 +1161,12 @@ out:
> *
> * The caller needs to hold the pte lock.
> */
> -void page_remove_rmap(struct page *page)
> +void page_remove_rmap(struct page *page, bool compound)
> {
> + int nr = compound ? hpage_nr_pages(page) : 1;
> +
> if (!PageAnon(page)) {
> + VM_BUG_ON_PAGE(compound && !PageHuge(page), page);
> page_remove_file_rmap(page);
> return;
> }
> @@ -1174,11 +1184,12 @@ void page_remove_rmap(struct page *page)
> * these counters are not modified in interrupt context, and
> * pte lock(a spinlock) is held, which implies preemption disabled.
> */
> - if (PageTransHuge(page))
> + if (compound) {
> + VM_BUG_ON_PAGE(!PageTransHuge(page), page);
> __dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
> + }
>
> - __mod_zone_page_state(page_zone(page), NR_ANON_PAGES,
> - -hpage_nr_pages(page));
> + __mod_zone_page_state(page_zone(page), NR_ANON_PAGES, -nr);
>
> if (unlikely(PageMlocked(page)))
> clear_page_mlock(page);
> @@ -1320,7 +1331,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> dec_mm_counter(mm, MM_FILEPAGES);
>
> discard:
> - page_remove_rmap(page);
> + page_remove_rmap(page, false);
> page_cache_release(page);
>
> out_unmap:
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 63f55ccb9b26..200298895cee 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1121,10 +1121,10 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
> set_pte_at(vma->vm_mm, addr, pte,
> pte_mkold(mk_pte(page, vma->vm_page_prot)));
> if (page == swapcache) {
> - page_add_anon_rmap(page, vma, addr);
> + page_add_anon_rmap(page, vma, addr, false);
> mem_cgroup_commit_charge(page, memcg, true);
> } else { /* ksm created a completely new copy */
> - page_add_new_anon_rmap(page, vma, addr);
> + page_add_new_anon_rmap(page, vma, addr, false);
> mem_cgroup_commit_charge(page, memcg, false);
> lru_cache_add_active_or_unevictable(page, vma);
> }
>