2015-07-20 14:33:38

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv9 00/36] THP refcounting redesign

Hello everybody,

The THP refcounting has been rebased onto mmotm-2015-07-15-16-46, plus few
minor fixes/cleanups.

The goal of patchset is to make refcounting on THP pages cheaper with
simpler semantics and allow the same THP compound page to be mapped with
PMD and PTEs. This is required to get reasonable THP-pagecache
implementation.

With the new refcounting design it's much easier to protect against
split_huge_page(): simple reference on a page will make you the deal.
It makes gup_fast() implementation simpler and doesn't require
special-case in futex code to handle tail THP pages.

It should improve THP utilization over the system since splitting THP in
one process doesn't necessary lead to splitting the page in all other
processes have the page mapped.

The patchset drastically lower complexity of get_page()/put_page()
codepaths. I encourage people look on this code before-and-after to
justify time budget on reviewing this patchset.

= Changelog =

v9:
- rebased to mmotm-2015-07-15-16-46: fix conflicts with DAX and pagemap
patchsets;
- simplify PG_double_map handling in __split_huge_pmd_locked();
- fix arm64 typo (Suzuki K. Poulose);
- fix build on !THP;
- checkpatch fixes;
- Tested-by/Reviewed-by Aneesh Kumar K.V;

v8:
- rebased to since-4.1;
- fix mmap10 from LTP: make check and clear PG_double_map atomic.

v7:
- avoid situation during split_huge_pmd() where we can temporarily drop
page_mapcount() to zero. It can lead to races e.g. with unmap code;
- update documentation;
- fix NR_ANON_PAGES accounting in page_remove_rmap();
- fix page_mapped();
- optimize page_mapped() and page_mapcount();
- fix PSS calculation for non-shared pages;

v6:
- rebase to since-4.0;
- optimize mapcount handling: significantely reduce overhead for most
common cases.
- split pages on migrate_pages();
- remove infrastructure for handling splitting PMDs on all architectures;
- fix page_mapcount() for hugetlb pages;

v5:
- Tested-by: Sasha Levin!™
- re-split patchset in hope to improve readability;
- rebased on top of page flags and ->mapping sanitizing patchset;
- uncharge compound_mapcount rather than mapcount for hugetlb pages
during removing from rmap;
- differentiate page_mapped() from page_mapcount() for compound pages;
- rework deferred_split_huge_page() to use shrinker interface;
- fix race in page_remove_rmap();
- get rid of __get_page_tail();
- few random bug fixes;
v4:
- fix sizes reported in smaps;
- defines instead of enum for RMAP_{EXCLUSIVE,COMPOUND};
- skip THP pages on munlock_vma_pages_range(): they are never mlocked;
- properly handle huge zero page on FOLL_SPLIT;
- fix lock_page() slow path on tail pages;
- account page_get_anon_vma() fail to THP_SPLIT_PAGE_FAILED;
- fix split_huge_page() on huge page with unmapped head page;
- fix transfering 'write' and 'young' from pmd to ptes on split_huge_pmd;
- call page_remove_rmap() in unfreeze_page under ptl.

= Design overview =

The main reason why we can't map THP with 4k is how refcounting on THP
designed. It built around two requirements:

- split of huge page should never fail;
- we can't change interface of get_user_page();

To be able to split huge page at any point we have to track which tail
page was pinned. It leads to tricky and expensive get_page() on tail pages
and also occupy tail_page->_mapcount.

Most split_huge_page*() users want PMD to be split into table of PTEs and
don't care whether compound page is going to be split or not.

The plan is:

- allow split_huge_page() to fail if the page is pinned. It's trivial to
split non-pinned page and it doesn't require tail page refcounting, so
tail_page->_mapcount is free to be reused.

- introduce new routine -- split_huge_pmd() -- to split PMD into table of
PTEs. It splits only one PMD, not touching other PMDs the page is
mapped with or underlying compound page. Unlike new split_huge_page(),
split_huge_pmd() never fails.

Fortunately, we have only few places where split_huge_page() is needed:
swap out, memory failure, migration, KSM. And all of them can handle
split_huge_page() fail.

In new scheme we use page->_mapcount is used to account how many time
the page is mapped with PTEs. We have separate compound_mapcount() to
count mappings with PMD. page_mapcount() returns sum of PTE and PMD
mappings of the page.

Introducing split_huge_pmd() effectively allows THP to be mapped with 4k.
It may be a surprise to some code to see a PTE which points to tail page
or VMA start/end in the middle of compound page.

munmap() part of THP will split PMD, but doesn't split the huge page. In
order to take memory consumption under control we put partially unmapped
huge page on list. The pages will be split by shrinker if memory pressure
comes. This way we also avoid unnecessary split_huge_page() on exit(2) if
a THP belong to more than one VMA.

= Refcounts and transparent huge pages =

- get_page() and put_page() work *only* on head page's ->_count.
We don't touch tail pages at all for these oprations. We stopped
touching ->_mapcount in tail pages to store it's pins.

- ->_count in tail pages is always zero: get_page_unless_zero() never
succeed on tail pages. Nothing changed in this respect.

- map/unmap of the pages with PTE entry increment/decrement ->_mapcount
on relevent sub-page of the compound page.

- map/unmap of the whole compound page accounted in compound_mapcount
(stored in first tail page).

- PageDoubleMap() indicates that ->_mapcount in all subpages is offset
up by one. This additional reference is required to get race-free
detection of unmap of subpages when we have them mapped with both PMDs
and PTEs.

This is optimization required to lower overhead of per-subpage
mapcount tracking. The alternative is alter ->_mapcount in all
subpages on each map/unmap of the whole compound page.

We set PG_double_map when a PMD of the page got split for the first
time, but still have PMD mapping. The addtional references go away
with last compound_mapcount.

= Benchmarks =

Kernel build benchmark:

baseline v6

Amean user-2 447.76 ( 0.00%) 451.24 ( -0.78%)
Amean user-4 314.94 ( 0.00%) 310.10 ( 1.54%)
Amean user-8 388.91 ( 0.00%) 388.95 ( -0.01%)
Amean user-16 518.68 ( 0.00%) 518.60 ( 0.02%)
Amean user-24 533.58 ( 0.00%) 535.35 ( -0.33%)
Amean syst-2 77.52 ( 0.00%) 70.16 ( 9.49%)
Amean syst-4 51.21 ( 0.00%) 44.55 ( 13.00%)
Amean syst-8 42.12 ( 0.00%) 42.59 ( -1.12%)
Amean syst-16 50.29 ( 0.00%) 50.14 ( 0.30%)
Amean syst-24 49.36 ( 0.00%) 48.54 ( 1.65%)
Amean elsp-2 242.64 ( 0.00%) 244.46 ( -0.75%)
Amean elsp-4 93.78 ( 0.00%) 92.89 ( 0.95%)
Amean elsp-8 61.51 ( 0.00%) 61.92 ( -0.66%)
Amean elsp-16 53.95 ( 0.00%) 53.80 ( 0.29%)
Amean elsp-24 52.75 ( 0.00%) 53.14 ( -0.74%)
Stddev user-2 15.49 ( 0.00%) 13.75 ( 11.24%)
Stddev user-4 7.85 ( 0.00%) 4.42 ( 43.68%)
Stddev user-8 1.29 ( 0.00%) 2.77 (-114.28%)
Stddev user-16 2.56 ( 0.00%) 1.54 ( 39.89%)
Stddev user-24 1.75 ( 0.00%) 1.06 ( 39.75%)
Stddev syst-2 3.02 ( 0.00%) 2.00 ( 33.86%)
Stddev syst-4 1.23 ( 0.00%) 0.91 ( 26.65%)
Stddev syst-8 0.41 ( 0.00%) 0.30 ( 28.32%)
Stddev syst-16 0.51 ( 0.00%) 0.71 (-38.07%)
Stddev syst-24 0.92 ( 0.00%) 0.70 ( 23.86%)
Stddev elsp-2 8.70 ( 0.00%) 7.99 ( 8.13%)
Stddev elsp-4 1.74 ( 0.00%) 0.59 ( 66.08%)
Stddev elsp-8 0.40 ( 0.00%) 0.30 ( 25.11%)
Stddev elsp-16 0.45 ( 0.00%) 0.37 ( 17.73%)
Stddev elsp-24 0.57 ( 0.00%) 0.38 ( 33.64%)

Changes are mostly non-significant. The only noticble part is reduction of
system time for -j2 and -j4: 9.49% and 13.00%.

specjvm
base v6
Ops compiler 569.52 ( 0.00%) 618.74 ( 8.64%)
Ops compress 456.32 ( 0.00%) 469.20 ( 2.82%)
Ops crypto 424.34 ( 0.00%) 413.64 ( -2.52%)
Ops derby 535.15 ( 0.00%) 536.96 ( 0.34%)
Ops mpegaudio 291.03 ( 0.00%) 286.35 ( -1.61%)
Ops scimark.large 75.91 ( 0.00%) 77.21 ( 1.71%)
Ops scimark.small 529.19 ( 0.00%) 527.07 ( -0.40%)
Ops serial 316.13 ( 0.00%) 316.40 ( 0.09%)
Ops sunflow 154.90 ( 0.00%) 154.85 ( -0.03%)
Ops xml 612.94 ( 0.00%) 575.20 ( -6.16%)
Ops compiler.compiler 770.11 ( 0.00%) 878.22 ( 14.04%)
Ops compiler.sunflow 421.17 ( 0.00%) 435.92 ( 3.50%)
Ops compress 456.32 ( 0.00%) 469.20 ( 2.82%)
Ops crypto.aes 153.17 ( 0.00%) 151.30 ( -1.22%)
Ops crypto.rsa 607.43 ( 0.00%) 564.65 ( -7.04%)
Ops crypto.signverify 821.23 ( 0.00%) 828.40 ( 0.87%)
Ops derby 535.15 ( 0.00%) 536.96 ( 0.34%)
Ops mpegaudio 291.03 ( 0.00%) 286.35 ( -1.61%)
Ops scimark.fft.large 69.59 ( 0.00%) 69.81 ( 0.32%)
Ops scimark.lu.large 20.31 ( 0.00%) 20.32 ( 0.05%)
Ops scimark.sor.large 114.57 ( 0.00%) 113.99 ( -0.51%)
Ops scimark.sparse.large 55.56 ( 0.00%) 61.71 ( 11.07%)
Ops scimark.monte_carlo 280.13 ( 0.00%) 275.09 ( -1.80%)
Ops scimark.fft.small 815.19 ( 0.00%) 819.55 ( 0.53%)
Ops scimark.lu.small 1072.62 ( 0.00%) 1081.47 ( 0.83%)
Ops scimark.sor.small 674.47 ( 0.00%) 674.24 ( -0.03%)
Ops scimark.sparse.small 251.20 ( 0.00%) 247.43 ( -1.50%)
Ops serial 316.13 ( 0.00%) 316.40 ( 0.09%)
Ops sunflow 154.90 ( 0.00%) 154.85 ( -0.03%)
Ops xml.transform 538.16 ( 0.00%) 519.64 ( -3.44%)
Ops xml.validation 698.10 ( 0.00%) 636.70 ( -8.80%)

Results are mixed.

= Patches overview =

Patch 1:
We need to look on all subpages of compound page to calculate
correct PSS, because they can have different mapcount.

Patch 2:
With PTE-mapeed THP, rmap cannot rely on PageTransHuge() check to
decide if map small page or THP. We need to get the info from
caller.

Patch 3:
Make memcg aware about new refcounting. Validation needed.

Patch 4:
Adjust conditions when we can re-use the page on write-protection
fault.

Patch 5:
FOLL_SPLIT should be handled on PTE level too.

Patch 6:
Make generic fast GUP implementation aware about PTE-mapped huge
pages.

Patch 7:
Split all pages in mlocked VMA. That should be good enough for
now.

Patch 8:
Make khugepaged aware about PTE-mapped huge pages.

Patch 9:
Rename split_huge_page_pmd() to split_huge_pmd() to reflect that
page is not going to be split, only PMD.

Patch 10:
New THP_SPLIT_* vmstats.

Patch 11:
Up to this point we tried to keep patchset bisectable, but next
patches are going to change how core of THP refcounting work.
That's easier to review change if we would disable THP temporally
and bring it back once everything is ready.

Patch 12:
Remove all split_huge_page()-related code. It also remove need in
tail page refcounting.

Patch 13:
Drop tail page refcounting. Diffstat is nice! :)

Patch 14:
Remove ugly special case if futex happened to be in tail THP page.
With new refcounting it much easier to protect against split.

Patch 15:
Simplify KSM code which handle THP.

Patch 16:
No need in compound_lock anymore.

Patches 17-25:
Drop infrastructure for handling PMD splitting. We don't use it
anymore in split_huge_page().

Patch 26:
Store mapcount for compound pages separately: in the first tail
page ->mapping.

Patch 27:
Let's define page_mapped() to be true for compound pages if any
sub-pages of the compound page is mapped (with PMD or PTE).

Patch 28:
Make numabalancing aware about PTE-mapped THP.

Patch 29:
Implement new split_huge_pmd().

Patch 30-32:
Implement new split_huge_page().

Patch 33:
Split pages instaed of PMDs on migrate_pages.

Patch 34:
Handle partial unmap of THP. We put partially unmapped huge page
list. Pages from list will split via shrinker if memory pressure
comes. This way we also avoid unnecessary split_huge_page() on
exit(2) if a THP belong to more than one VMA.

Patch 35:
Everything is in place. Re-enable THP.

Patch 36:
Documentation update.

The patchset also available on git:

git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git thp/refcounting/v9

Please review and consider applying.
Kirill A. Shutemov (36):
mm, proc: adjust PSS calculation
rmap: add argument to charge compound page
memcg: adjust to support new THP refcounting
mm, thp: adjust conditions when we can reuse the page on WP fault
mm: adjust FOLL_SPLIT for new refcounting
mm: handle PTE-mapped tail pages in gerneric fast gup implementaiton
thp, mlock: do not allow huge pages in mlocked area
khugepaged: ignore pmd tables with THP mapped with ptes
thp: rename split_huge_page_pmd() to split_huge_pmd()
mm, vmstats: new THP splitting event
mm: temporally mark THP broken
thp: drop all split_huge_page()-related code
mm: drop tail page refcounting
futex, thp: remove special case for THP in get_futex_key
ksm: prepare to new THP semantics
mm, thp: remove compound_lock
arm64, thp: remove infrastructure for handling splitting PMDs
arm, thp: remove infrastructure for handling splitting PMDs
mips, thp: remove infrastructure for handling splitting PMDs
powerpc, thp: remove infrastructure for handling splitting PMDs
s390, thp: remove infrastructure for handling splitting PMDs
sparc, thp: remove infrastructure for handling splitting PMDs
tile, thp: remove infrastructure for handling splitting PMDs
x86, thp: remove infrastructure for handling splitting PMDs
mm, thp: remove infrastructure for handling splitting PMDs
mm: rework mapcount accounting to enable 4k mapping of THPs
mm: differentiate page_mapped() from page_mapcount() for compound
pages
mm, numa: skip PTE-mapped THP on numa fault
thp: implement split_huge_pmd()
thp: add option to setup migration entiries during PMD split
thp, mm: split_huge_page(): caller need to lock page
thp: reintroduce split_huge_page()
migrate_pages: try to split pages on qeueuing
thp: introduce deferred_split_huge_page()
mm: re-enable THP
thp: update documentation

Documentation/vm/transhuge.txt | 151 +++--
arch/arc/mm/cache_arc700.c | 4 +-
arch/arm/include/asm/pgtable-3level.h | 10 -
arch/arm/lib/uaccess_with_memcpy.c | 5 +-
arch/arm/mm/flush.c | 17 +-
arch/arm64/include/asm/pgtable.h | 8 -
arch/arm64/mm/flush.c | 16 -
arch/mips/include/asm/pgtable-bits.h | 6 +-
arch/mips/include/asm/pgtable.h | 18 -
arch/mips/mm/c-r4k.c | 3 +-
arch/mips/mm/cache.c | 2 +-
arch/mips/mm/gup.c | 17 +-
arch/mips/mm/init.c | 6 +-
arch/mips/mm/pgtable-64.c | 14 -
arch/mips/mm/tlbex.c | 1 -
arch/powerpc/include/asm/pgtable-ppc64.h | 25 +-
arch/powerpc/mm/hugepage-hash64.c | 3 -
arch/powerpc/mm/hugetlbpage.c | 17 +-
arch/powerpc/mm/pgtable_64.c | 49 --
arch/powerpc/mm/subpage-prot.c | 2 +-
arch/s390/include/asm/pgtable.h | 15 +-
arch/s390/mm/gup.c | 24 +-
arch/s390/mm/pgtable.c | 16 -
arch/sh/mm/cache-sh4.c | 2 +-
arch/sh/mm/cache.c | 8 +-
arch/sparc/include/asm/pgtable_64.h | 16 -
arch/sparc/mm/fault_64.c | 3 -
arch/sparc/mm/gup.c | 16 +-
arch/tile/include/asm/pgtable.h | 10 -
arch/x86/include/asm/pgtable.h | 9 -
arch/x86/include/asm/pgtable_types.h | 2 -
arch/x86/kernel/vm86_32.c | 6 +-
arch/x86/mm/gup.c | 17 +-
arch/x86/mm/pgtable.c | 14 -
arch/xtensa/mm/tlb.c | 2 +-
fs/proc/page.c | 4 +-
fs/proc/task_mmu.c | 55 +-
include/asm-generic/pgtable.h | 9 -
include/linux/huge_mm.h | 55 +-
include/linux/memcontrol.h | 16 +-
include/linux/mm.h | 113 ++--
include/linux/mm_types.h | 18 +-
include/linux/page-flags.h | 49 +-
include/linux/pagemap.h | 13 +-
include/linux/rmap.h | 16 +-
include/linux/swap.h | 3 +-
include/linux/vm_event_item.h | 4 +-
kernel/events/uprobes.c | 11 +-
kernel/futex.c | 61 +-
mm/debug.c | 8 +-
mm/filemap.c | 10 +-
mm/gup.c | 111 ++-
mm/huge_memory.c | 1089 +++++++++++++++++-------------
mm/hugetlb.c | 10 +-
mm/internal.h | 70 +-
mm/ksm.c | 61 +-
mm/madvise.c | 2 +-
mm/memcontrol.c | 84 +--
mm/memory-failure.c | 10 +-
mm/memory.c | 73 +-
mm/mempolicy.c | 37 +-
mm/migrate.c | 19 +-
mm/mincore.c | 2 +-
mm/mlock.c | 51 +-
mm/mprotect.c | 2 +-
mm/mremap.c | 15 +-
mm/page_alloc.c | 16 +-
mm/pagewalk.c | 2 +-
mm/pgtable-generic.c | 14 -
mm/rmap.c | 163 +++--
mm/shmem.c | 21 +-
mm/swap.c | 273 +-------
mm/swapfile.c | 16 +-
mm/userfaultfd.c | 8 +-
mm/vmstat.c | 4 +-
75 files changed, 1338 insertions(+), 1794 deletions(-)

--
2.1.4


2015-07-20 14:21:33

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv9 01/36] mm, proc: adjust PSS calculation

With new refcounting all subpages of the compound page are not necessary
have the same mapcount. We need to take into account mapcount of every
sub-page.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Tested-by: Sasha Levin <[email protected]>
Tested-by: Aneesh Kumar K.V <[email protected]>
Acked-by: Jerome Marchand <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
---
fs/proc/task_mmu.c | 47 +++++++++++++++++++++++++++++++----------------
1 file changed, 31 insertions(+), 16 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 181ca3d56c72..37a3cf92f7c6 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -450,9 +450,10 @@ struct mem_size_stats {
};

static void smaps_account(struct mem_size_stats *mss, struct page *page,
- unsigned long size, bool young, bool dirty)
+ bool compound, bool young, bool dirty)
{
- int mapcount;
+ int i, nr = compound ? HPAGE_PMD_NR : 1;
+ unsigned long size = nr * PAGE_SIZE;

if (PageAnon(page))
mss->anonymous += size;
@@ -461,23 +462,37 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page,
/* Accumulate the size in pages that have been accessed. */
if (young || PageReferenced(page))
mss->referenced += size;
- mapcount = page_mapcount(page);
- if (mapcount >= 2) {
- u64 pss_delta;

- if (dirty || PageDirty(page))
- mss->shared_dirty += size;
- else
- mss->shared_clean += size;
- pss_delta = (u64)size << PSS_SHIFT;
- do_div(pss_delta, mapcount);
- mss->pss += pss_delta;
- } else {
+ /*
+ * page_count(page) == 1 guarantees the page is mapped exactly once.
+ * If any subpage of the compound page mapped with PTE it would elevate
+ * page_count().
+ */
+ if (page_count(page) == 1) {
if (dirty || PageDirty(page))
mss->private_dirty += size;
else
mss->private_clean += size;
mss->pss += (u64)size << PSS_SHIFT;
+ return;
+ }
+
+ for (i = 0; i < nr; i++, page++) {
+ int mapcount = page_mapcount(page);
+
+ if (mapcount >= 2) {
+ if (dirty || PageDirty(page))
+ mss->shared_dirty += PAGE_SIZE;
+ else
+ mss->shared_clean += PAGE_SIZE;
+ mss->pss += (PAGE_SIZE << PSS_SHIFT) / mapcount;
+ } else {
+ if (dirty || PageDirty(page))
+ mss->private_dirty += PAGE_SIZE;
+ else
+ mss->private_clean += PAGE_SIZE;
+ mss->pss += PAGE_SIZE << PSS_SHIFT;
+ }
}
}

@@ -512,7 +527,8 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,

if (!page)
return;
- smaps_account(mss, page, PAGE_SIZE, pte_young(*pte), pte_dirty(*pte));
+
+ smaps_account(mss, page, false, pte_young(*pte), pte_dirty(*pte));
}

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
@@ -528,8 +544,7 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
if (IS_ERR_OR_NULL(page))
return;
mss->anonymous_thp += HPAGE_PMD_SIZE;
- smaps_account(mss, page, HPAGE_PMD_SIZE,
- pmd_young(*pmd), pmd_dirty(*pmd));
+ smaps_account(mss, page, true, pmd_young(*pmd), pmd_dirty(*pmd));
}
#else
static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
--
2.1.4

2015-07-20 14:21:28

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv9 02/36] rmap: add argument to charge compound page

We're going to allow mapping of individual 4k pages of THP compound
page. It means we cannot rely on PageTransHuge() check to decide if
map/unmap small page or THP.

The patch adds new argument to rmap functions to indicate whether we want
to operate on whole compound page or only the small page.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Tested-by: Sasha Levin <[email protected]>
Tested-by: Aneesh Kumar K.V <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
---
include/linux/rmap.h | 12 +++++++++---
kernel/events/uprobes.c | 4 ++--
mm/huge_memory.c | 16 ++++++++--------
mm/hugetlb.c | 4 ++--
mm/ksm.c | 4 ++--
mm/memory.c | 14 +++++++-------
mm/migrate.c | 8 ++++----
mm/rmap.c | 48 +++++++++++++++++++++++++++++++-----------------
mm/swapfile.c | 4 ++--
mm/userfaultfd.c | 2 +-
10 files changed, 68 insertions(+), 48 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 0860336c6c40..082928aba785 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -162,16 +162,22 @@ static inline void anon_vma_merge(struct vm_area_struct *vma,

struct anon_vma *page_get_anon_vma(struct page *page);

+/* bitflags for do_page_add_anon_rmap() */
+#define RMAP_EXCLUSIVE 0x01
+#define RMAP_COMPOUND 0x02
+
/*
* rmap interfaces called when adding or removing pte of page
*/
void page_move_anon_rmap(struct page *, struct vm_area_struct *, unsigned long);
-void page_add_anon_rmap(struct page *, struct vm_area_struct *, unsigned long);
+void page_add_anon_rmap(struct page *, struct vm_area_struct *,
+ unsigned long, bool);
void do_page_add_anon_rmap(struct page *, struct vm_area_struct *,
unsigned long, int);
-void page_add_new_anon_rmap(struct page *, struct vm_area_struct *, unsigned long);
+void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
+ unsigned long, bool);
void page_add_file_rmap(struct page *);
-void page_remove_rmap(struct page *);
+void page_remove_rmap(struct page *, bool);

void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
unsigned long);
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index cb346f26a22d..5523daf59953 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -183,7 +183,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
goto unlock;

get_page(kpage);
- page_add_new_anon_rmap(kpage, vma, addr);
+ page_add_new_anon_rmap(kpage, vma, addr, false);
mem_cgroup_commit_charge(kpage, memcg, false);
lru_cache_add_active_or_unevictable(kpage, vma);

@@ -196,7 +196,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
ptep_clear_flush_notify(vma, addr, ptep);
set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));

- page_remove_rmap(page);
+ page_remove_rmap(page, false);
if (!page_mapped(page))
try_to_free_swap(page);
pte_unmap_unlock(ptep, ptl);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8f9a334a6c66..4e8e3c267cdf 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -769,7 +769,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,

entry = mk_huge_pmd(page, vma->vm_page_prot);
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
- page_add_new_anon_rmap(page, vma, haddr);
+ page_add_new_anon_rmap(page, vma, haddr, true);
mem_cgroup_commit_charge(page, memcg, false);
lru_cache_add_active_or_unevictable(page, vma);
pgtable_trans_huge_deposit(mm, pmd, pgtable);
@@ -1111,7 +1111,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
memcg = (void *)page_private(pages[i]);
set_page_private(pages[i], 0);
- page_add_new_anon_rmap(pages[i], vma, haddr);
+ page_add_new_anon_rmap(pages[i], vma, haddr, false);
mem_cgroup_commit_charge(pages[i], memcg, false);
lru_cache_add_active_or_unevictable(pages[i], vma);
pte = pte_offset_map(&_pmd, haddr);
@@ -1123,7 +1123,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,

smp_wmb(); /* make pte visible before pmd */
pmd_populate(mm, pmd, pgtable);
- page_remove_rmap(page);
+ page_remove_rmap(page, true);
spin_unlock(ptl);

mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
@@ -1243,7 +1243,7 @@ alloc:
entry = mk_huge_pmd(new_page, vma->vm_page_prot);
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
pmdp_huge_clear_flush_notify(vma, haddr, pmd);
- page_add_new_anon_rmap(new_page, vma, haddr);
+ page_add_new_anon_rmap(new_page, vma, haddr, true);
mem_cgroup_commit_charge(new_page, memcg, false);
lru_cache_add_active_or_unevictable(new_page, vma);
set_pmd_at(mm, haddr, pmd, entry);
@@ -1253,7 +1253,7 @@ alloc:
put_huge_zero_page();
} else {
VM_BUG_ON_PAGE(!PageHead(page), page);
- page_remove_rmap(page);
+ page_remove_rmap(page, true);
put_page(page);
}
ret |= VM_FAULT_WRITE;
@@ -1516,7 +1516,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
put_huge_zero_page();
} else {
struct page *page = pmd_page(orig_pmd);
- page_remove_rmap(page);
+ page_remove_rmap(page, true);
VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
VM_BUG_ON_PAGE(!PageHead(page), page);
@@ -2363,7 +2363,7 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
* superfluous.
*/
pte_clear(vma->vm_mm, address, _pte);
- page_remove_rmap(src_page);
+ page_remove_rmap(src_page, false);
spin_unlock(ptl);
free_page_and_swap_cache(src_page);
}
@@ -2658,7 +2658,7 @@ static void collapse_huge_page(struct mm_struct *mm,

spin_lock(pmd_ptl);
BUG_ON(!pmd_none(*pmd));
- page_add_new_anon_rmap(new_page, vma, address);
+ page_add_new_anon_rmap(new_page, vma, address, true);
mem_cgroup_commit_charge(new_page, memcg, false);
lru_cache_add_active_or_unevictable(new_page, vma);
pgtable_trans_huge_deposit(mm, pmd, pgtable);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 51ae41d0fbc0..b98f5e944849 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2877,7 +2877,7 @@ again:
if (huge_pte_dirty(pte))
set_page_dirty(page);

- page_remove_rmap(page);
+ page_remove_rmap(page, true);
force_flush = !__tlb_remove_page(tlb, page);
if (force_flush) {
address += sz;
@@ -3098,7 +3098,7 @@ retry_avoidcopy:
mmu_notifier_invalidate_range(mm, mmun_start, mmun_end);
set_huge_pte_at(mm, address, ptep,
make_huge_pte(vma, new_page, 1));
- page_remove_rmap(old_page);
+ page_remove_rmap(old_page, true);
hugepage_add_new_anon_rmap(new_page, vma, address);
/* Make the old page be freed below */
new_page = old_page;
diff --git a/mm/ksm.c b/mm/ksm.c
index bc7be0ee2080..fe09f3ddc912 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -957,13 +957,13 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
}

get_page(kpage);
- page_add_anon_rmap(kpage, vma, addr);
+ page_add_anon_rmap(kpage, vma, addr, false);

flush_cache_page(vma, addr, pte_pfn(*ptep));
ptep_clear_flush_notify(vma, addr, ptep);
set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));

- page_remove_rmap(page);
+ page_remove_rmap(page, false);
if (!page_mapped(page))
try_to_free_swap(page);
put_page(page);
diff --git a/mm/memory.c b/mm/memory.c
index c70252ca9ef9..ae98aba42697 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1125,7 +1125,7 @@ again:
mark_page_accessed(page);
rss[MM_FILEPAGES]--;
}
- page_remove_rmap(page);
+ page_remove_rmap(page, false);
if (unlikely(page_mapcount(page) < 0))
print_bad_pte(vma, addr, ptent, page);
if (unlikely(!__tlb_remove_page(tlb, page))) {
@@ -2113,7 +2113,7 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
* thread doing COW.
*/
ptep_clear_flush_notify(vma, address, page_table);
- page_add_new_anon_rmap(new_page, vma, address);
+ page_add_new_anon_rmap(new_page, vma, address, false);
mem_cgroup_commit_charge(new_page, memcg, false);
lru_cache_add_active_or_unevictable(new_page, vma);
/*
@@ -2146,7 +2146,7 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
* mapcount is visible. So transitively, TLBs to
* old page will be flushed before it can be reused.
*/
- page_remove_rmap(old_page);
+ page_remove_rmap(old_page, false);
}

/* Free the old page.. */
@@ -2562,7 +2562,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
pte = maybe_mkwrite(pte_mkdirty(pte), vma);
flags &= ~FAULT_FLAG_WRITE;
ret |= VM_FAULT_WRITE;
- exclusive = 1;
+ exclusive = RMAP_EXCLUSIVE;
}
flush_icache_page(vma, page);
if (pte_swp_soft_dirty(orig_pte))
@@ -2572,7 +2572,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
do_page_add_anon_rmap(page, vma, address, exclusive);
mem_cgroup_commit_charge(page, memcg, true);
} else { /* ksm created a completely new copy */
- page_add_new_anon_rmap(page, vma, address);
+ page_add_new_anon_rmap(page, vma, address, false);
mem_cgroup_commit_charge(page, memcg, false);
lru_cache_add_active_or_unevictable(page, vma);
}
@@ -2730,7 +2730,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
}

inc_mm_counter_fast(mm, MM_ANONPAGES);
- page_add_new_anon_rmap(page, vma, address);
+ page_add_new_anon_rmap(page, vma, address, false);
mem_cgroup_commit_charge(page, memcg, false);
lru_cache_add_active_or_unevictable(page, vma);
setpte:
@@ -2818,7 +2818,7 @@ void do_set_pte(struct vm_area_struct *vma, unsigned long address,
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
if (anon) {
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
- page_add_new_anon_rmap(page, vma, address);
+ page_add_new_anon_rmap(page, vma, address, false);
} else {
inc_mm_counter_fast(vma->vm_mm, MM_FILEPAGES);
page_add_file_rmap(page);
diff --git a/mm/migrate.c b/mm/migrate.c
index d3529d620a5b..4870a1daa8ae 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -166,7 +166,7 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
else
page_dup_rmap(new);
} else if (PageAnon(new))
- page_add_anon_rmap(new, vma, addr);
+ page_add_anon_rmap(new, vma, addr, false);
else
page_add_file_rmap(new);

@@ -1792,7 +1792,7 @@ fail_putback:
* guarantee the copy is visible before the pagetable update.
*/
flush_cache_range(vma, mmun_start, mmun_end);
- page_add_anon_rmap(new_page, vma, mmun_start);
+ page_add_anon_rmap(new_page, vma, mmun_start, true);
pmdp_huge_clear_flush_notify(vma, mmun_start, pmd);
set_pmd_at(mm, mmun_start, pmd, entry);
flush_tlb_range(vma, mmun_start, mmun_end);
@@ -1803,13 +1803,13 @@ fail_putback:
flush_tlb_range(vma, mmun_start, mmun_end);
mmu_notifier_invalidate_range(mm, mmun_start, mmun_end);
update_mmu_cache_pmd(vma, address, &entry);
- page_remove_rmap(new_page);
+ page_remove_rmap(new_page, true);
goto fail_putback;
}

mem_cgroup_migrate(page, new_page, false);

- page_remove_rmap(page);
+ page_remove_rmap(page, true);

spin_unlock(ptl);
mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
diff --git a/mm/rmap.c b/mm/rmap.c
index 30812e9042ae..5ee08e082e51 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1150,6 +1150,7 @@ static void __page_check_anon_rmap(struct page *page,
* @page: the page to add the mapping to
* @vma: the vm area in which the mapping is added
* @address: the user virtual address mapped
+ * @compound: charge the page as compound or small page
*
* The caller needs to hold the pte lock, and the page must be locked in
* the anon_vma case: to serialize mapping,index checking after setting,
@@ -1157,9 +1158,9 @@ static void __page_check_anon_rmap(struct page *page,
* (but PageKsm is never downgraded to PageAnon).
*/
void page_add_anon_rmap(struct page *page,
- struct vm_area_struct *vma, unsigned long address)
+ struct vm_area_struct *vma, unsigned long address, bool compound)
{
- do_page_add_anon_rmap(page, vma, address, 0);
+ do_page_add_anon_rmap(page, vma, address, compound ? RMAP_COMPOUND : 0);
}

/*
@@ -1168,21 +1169,24 @@ void page_add_anon_rmap(struct page *page,
* Everybody else should continue to use page_add_anon_rmap above.
*/
void do_page_add_anon_rmap(struct page *page,
- struct vm_area_struct *vma, unsigned long address, int exclusive)
+ struct vm_area_struct *vma, unsigned long address, int flags)
{
int first = atomic_inc_and_test(&page->_mapcount);
if (first) {
+ bool compound = flags & RMAP_COMPOUND;
+ int nr = compound ? hpage_nr_pages(page) : 1;
/*
* We use the irq-unsafe __{inc|mod}_zone_page_stat because
* these counters are not modified in interrupt context, and
* pte lock(a spinlock) is held, which implies preemption
* disabled.
*/
- if (PageTransHuge(page))
+ if (compound) {
+ VM_BUG_ON_PAGE(!PageTransHuge(page), page);
__inc_zone_page_state(page,
NR_ANON_TRANSPARENT_HUGEPAGES);
- __mod_zone_page_state(page_zone(page), NR_ANON_PAGES,
- hpage_nr_pages(page));
+ }
+ __mod_zone_page_state(page_zone(page), NR_ANON_PAGES, nr);
}
if (unlikely(PageKsm(page)))
return;
@@ -1190,7 +1194,8 @@ void do_page_add_anon_rmap(struct page *page,
VM_BUG_ON_PAGE(!PageLocked(page), page);
/* address might be in next vma when migration races vma_adjust */
if (first)
- __page_set_anon_rmap(page, vma, address, exclusive);
+ __page_set_anon_rmap(page, vma, address,
+ flags & RMAP_EXCLUSIVE);
else
__page_check_anon_rmap(page, vma, address);
}
@@ -1200,21 +1205,25 @@ void do_page_add_anon_rmap(struct page *page,
* @page: the page to add the mapping to
* @vma: the vm area in which the mapping is added
* @address: the user virtual address mapped
+ * @compound: charge the page as compound or small page
*
* Same as page_add_anon_rmap but must only be called on *new* pages.
* This means the inc-and-test can be bypassed.
* Page does not have to be locked.
*/
void page_add_new_anon_rmap(struct page *page,
- struct vm_area_struct *vma, unsigned long address)
+ struct vm_area_struct *vma, unsigned long address, bool compound)
{
+ int nr = compound ? hpage_nr_pages(page) : 1;
+
VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
SetPageSwapBacked(page);
atomic_set(&page->_mapcount, 0); /* increment count (starts at -1) */
- if (PageTransHuge(page))
+ if (compound) {
+ VM_BUG_ON_PAGE(!PageTransHuge(page), page);
__inc_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
- __mod_zone_page_state(page_zone(page), NR_ANON_PAGES,
- hpage_nr_pages(page));
+ }
+ __mod_zone_page_state(page_zone(page), NR_ANON_PAGES, nr);
__page_set_anon_rmap(page, vma, address, 1);
}

@@ -1266,13 +1275,17 @@ out:

/**
* page_remove_rmap - take down pte mapping from a page
- * @page: page to remove mapping from
+ * @page: page to remove mapping from
+ * @compound: uncharge the page as compound or small page
*
* The caller needs to hold the pte lock.
*/
-void page_remove_rmap(struct page *page)
+void page_remove_rmap(struct page *page, bool compound)
{
+ int nr = compound ? hpage_nr_pages(page) : 1;
+
if (!PageAnon(page)) {
+ VM_BUG_ON_PAGE(compound && !PageHuge(page), page);
page_remove_file_rmap(page);
return;
}
@@ -1290,11 +1303,12 @@ void page_remove_rmap(struct page *page)
* these counters are not modified in interrupt context, and
* pte lock(a spinlock) is held, which implies preemption disabled.
*/
- if (PageTransHuge(page))
+ if (compound) {
+ VM_BUG_ON_PAGE(!PageTransHuge(page), page);
__dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
+ }

- __mod_zone_page_state(page_zone(page), NR_ANON_PAGES,
- -hpage_nr_pages(page));
+ __mod_zone_page_state(page_zone(page), NR_ANON_PAGES, -nr);

if (unlikely(PageMlocked(page)))
clear_page_mlock(page);
@@ -1449,7 +1463,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
dec_mm_counter(mm, MM_FILEPAGES);

discard:
- page_remove_rmap(page);
+ page_remove_rmap(page, false);
page_cache_release(page);

out_unmap:
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 738c2662209e..7ba293b112e3 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1163,10 +1163,10 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
set_pte_at(vma->vm_mm, addr, pte,
pte_mkold(mk_pte(page, vma->vm_page_prot)));
if (page == swapcache) {
- page_add_anon_rmap(page, vma, addr);
+ page_add_anon_rmap(page, vma, addr, false);
mem_cgroup_commit_charge(page, memcg, true);
} else { /* ksm created a completely new copy */
- page_add_new_anon_rmap(page, vma, addr);
+ page_add_new_anon_rmap(page, vma, addr, false);
mem_cgroup_commit_charge(page, memcg, false);
lru_cache_add_active_or_unevictable(page, vma);
}
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 77fee9325a57..ae21a1f309c2 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -76,7 +76,7 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
goto out_release_uncharge_unlock;

inc_mm_counter(dst_mm, MM_ANONPAGES);
- page_add_new_anon_rmap(page, dst_vma, dst_addr);
+ page_add_new_anon_rmap(page, dst_vma, dst_addr, false);
mem_cgroup_commit_charge(page, memcg, false);
lru_cache_add_active_or_unevictable(page, dst_vma);

--
2.1.4

2015-07-20 14:21:23

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv9 03/36] memcg: adjust to support new THP refcounting

As with rmap, with new refcounting we cannot rely on PageTransHuge() to
check if we need to charge size of huge page form the cgroup. We need to
get information from caller to know whether it was mapped with PMD or
PTE.

We do uncharge when last reference on the page gone. At that point if we
see PageTransHuge() it means we need to unchange whole huge page.

The tricky part is partial unmap -- when we try to unmap part of huge
page. We don't do a special handing of this situation, meaning we don't
uncharge the part of huge page unless last user is gone or
split_huge_page() is triggered. In case of cgroup memory pressure
happens the partial unmapped page will be split through shrinker. This
should be good enough.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Tested-by: Sasha Levin <[email protected]>
Tested-by: Aneesh Kumar K.V <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
---
include/linux/memcontrol.h | 16 +++++++-----
kernel/events/uprobes.c | 7 +++---
mm/filemap.c | 8 +++---
mm/huge_memory.c | 33 ++++++++++++------------
mm/memcontrol.c | 62 +++++++++++++++++-----------------------------
mm/memory.c | 28 ++++++++++-----------
mm/shmem.c | 21 +++++++++-------
mm/swapfile.c | 9 ++++---
mm/userfaultfd.c | 6 ++---
9 files changed, 92 insertions(+), 98 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d92b80b63c5c..c69c4ef67cf8 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -291,10 +291,12 @@ static inline void mem_cgroup_events(struct mem_cgroup *memcg,
bool mem_cgroup_low(struct mem_cgroup *root, struct mem_cgroup *memcg);

int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
- gfp_t gfp_mask, struct mem_cgroup **memcgp);
+ gfp_t gfp_mask, struct mem_cgroup **memcgp,
+ bool compound);
void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
- bool lrucare);
-void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg);
+ bool lrucare, bool compound);
+void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg,
+ bool compound);
void mem_cgroup_uncharge(struct page *page);
void mem_cgroup_uncharge_list(struct list_head *page_list);

@@ -512,7 +514,8 @@ static inline bool mem_cgroup_low(struct mem_cgroup *root,

static inline int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask,
- struct mem_cgroup **memcgp)
+ struct mem_cgroup **memcgp,
+ bool compound)
{
*memcgp = NULL;
return 0;
@@ -520,12 +523,13 @@ static inline int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,

static inline void mem_cgroup_commit_charge(struct page *page,
struct mem_cgroup *memcg,
- bool lrucare)
+ bool lrucare, bool compound)
{
}

static inline void mem_cgroup_cancel_charge(struct page *page,
- struct mem_cgroup *memcg)
+ struct mem_cgroup *memcg,
+ bool compound)
{
}

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 5523daf59953..04e26bdf0717 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -169,7 +169,8 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
const unsigned long mmun_end = addr + PAGE_SIZE;
struct mem_cgroup *memcg;

- err = mem_cgroup_try_charge(kpage, vma->vm_mm, GFP_KERNEL, &memcg);
+ err = mem_cgroup_try_charge(kpage, vma->vm_mm, GFP_KERNEL, &memcg,
+ false);
if (err)
return err;

@@ -184,7 +185,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,

get_page(kpage);
page_add_new_anon_rmap(kpage, vma, addr, false);
- mem_cgroup_commit_charge(kpage, memcg, false);
+ mem_cgroup_commit_charge(kpage, memcg, false, false);
lru_cache_add_active_or_unevictable(kpage, vma);

if (!PageAnon(page)) {
@@ -207,7 +208,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,

err = 0;
unlock:
- mem_cgroup_cancel_charge(kpage, memcg);
+ mem_cgroup_cancel_charge(kpage, memcg, false);
mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
unlock_page(page);
return err;
diff --git a/mm/filemap.c b/mm/filemap.c
index 23780d86eac0..91d07a576166 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -577,7 +577,7 @@ static int __add_to_page_cache_locked(struct page *page,

if (!huge) {
error = mem_cgroup_try_charge(page, current->mm,
- gfp_mask, &memcg);
+ gfp_mask, &memcg, false);
if (error)
return error;
}
@@ -585,7 +585,7 @@ static int __add_to_page_cache_locked(struct page *page,
error = radix_tree_maybe_preload(gfp_mask & ~__GFP_HIGHMEM);
if (error) {
if (!huge)
- mem_cgroup_cancel_charge(page, memcg);
+ mem_cgroup_cancel_charge(page, memcg, false);
return error;
}

@@ -604,7 +604,7 @@ static int __add_to_page_cache_locked(struct page *page,
__inc_zone_page_state(page, NR_FILE_PAGES);
spin_unlock_irq(&mapping->tree_lock);
if (!huge)
- mem_cgroup_commit_charge(page, memcg, false);
+ mem_cgroup_commit_charge(page, memcg, false, false);
trace_mm_filemap_add_to_page_cache(page);
return 0;
err_insert:
@@ -612,7 +612,7 @@ err_insert:
/* Leave page->index set: truncation relies upon it */
spin_unlock_irq(&mapping->tree_lock);
if (!huge)
- mem_cgroup_cancel_charge(page, memcg);
+ mem_cgroup_cancel_charge(page, memcg, false);
page_cache_release(page);
return error;
}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 4e8e3c267cdf..6fd6df16e55d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -723,7 +723,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,

VM_BUG_ON_PAGE(!PageCompound(page), page);

- if (mem_cgroup_try_charge(page, mm, gfp, &memcg)) {
+ if (mem_cgroup_try_charge(page, mm, gfp, &memcg, true)) {
put_page(page);
count_vm_event(THP_FAULT_FALLBACK);
return VM_FAULT_FALLBACK;
@@ -731,7 +731,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,

pgtable = pte_alloc_one(mm, haddr);
if (unlikely(!pgtable)) {
- mem_cgroup_cancel_charge(page, memcg);
+ mem_cgroup_cancel_charge(page, memcg, true);
put_page(page);
return VM_FAULT_OOM;
}
@@ -747,7 +747,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
ptl = pmd_lock(mm, pmd);
if (unlikely(!pmd_none(*pmd))) {
spin_unlock(ptl);
- mem_cgroup_cancel_charge(page, memcg);
+ mem_cgroup_cancel_charge(page, memcg, true);
put_page(page);
pte_free(mm, pgtable);
} else {
@@ -758,7 +758,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
int ret;

spin_unlock(ptl);
- mem_cgroup_cancel_charge(page, memcg);
+ mem_cgroup_cancel_charge(page, memcg, true);
put_page(page);
pte_free(mm, pgtable);
ret = handle_userfault(vma, address, flags,
@@ -770,7 +770,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
entry = mk_huge_pmd(page, vma->vm_page_prot);
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
page_add_new_anon_rmap(page, vma, haddr, true);
- mem_cgroup_commit_charge(page, memcg, false);
+ mem_cgroup_commit_charge(page, memcg, false, true);
lru_cache_add_active_or_unevictable(page, vma);
pgtable_trans_huge_deposit(mm, pmd, pgtable);
set_pmd_at(mm, haddr, pmd, entry);
@@ -1067,13 +1067,14 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
vma, address, page_to_nid(page));
if (unlikely(!pages[i] ||
mem_cgroup_try_charge(pages[i], mm, GFP_KERNEL,
- &memcg))) {
+ &memcg, false))) {
if (pages[i])
put_page(pages[i]);
while (--i >= 0) {
memcg = (void *)page_private(pages[i]);
set_page_private(pages[i], 0);
- mem_cgroup_cancel_charge(pages[i], memcg);
+ mem_cgroup_cancel_charge(pages[i], memcg,
+ false);
put_page(pages[i]);
}
kfree(pages);
@@ -1112,7 +1113,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
memcg = (void *)page_private(pages[i]);
set_page_private(pages[i], 0);
page_add_new_anon_rmap(pages[i], vma, haddr, false);
- mem_cgroup_commit_charge(pages[i], memcg, false);
+ mem_cgroup_commit_charge(pages[i], memcg, false, false);
lru_cache_add_active_or_unevictable(pages[i], vma);
pte = pte_offset_map(&_pmd, haddr);
VM_BUG_ON(!pte_none(*pte));
@@ -1140,7 +1141,7 @@ out_free_pages:
for (i = 0; i < HPAGE_PMD_NR; i++) {
memcg = (void *)page_private(pages[i]);
set_page_private(pages[i], 0);
- mem_cgroup_cancel_charge(pages[i], memcg);
+ mem_cgroup_cancel_charge(pages[i], memcg, false);
put_page(pages[i]);
}
kfree(pages);
@@ -1206,7 +1207,8 @@ alloc:
goto out;
}

- if (unlikely(mem_cgroup_try_charge(new_page, mm, huge_gfp, &memcg))) {
+ if (unlikely(mem_cgroup_try_charge(new_page, mm, huge_gfp,
+ &memcg, true))) {
put_page(new_page);
if (page) {
split_huge_page(page);
@@ -1235,7 +1237,7 @@ alloc:
put_user_huge_page(page);
if (unlikely(!pmd_same(*pmd, orig_pmd))) {
spin_unlock(ptl);
- mem_cgroup_cancel_charge(new_page, memcg);
+ mem_cgroup_cancel_charge(new_page, memcg, true);
put_page(new_page);
goto out_mn;
} else {
@@ -1244,7 +1246,7 @@ alloc:
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
pmdp_huge_clear_flush_notify(vma, haddr, pmd);
page_add_new_anon_rmap(new_page, vma, haddr, true);
- mem_cgroup_commit_charge(new_page, memcg, false);
+ mem_cgroup_commit_charge(new_page, memcg, false, true);
lru_cache_add_active_or_unevictable(new_page, vma);
set_pmd_at(mm, haddr, pmd, entry);
update_mmu_cache_pmd(vma, address, pmd);
@@ -2571,8 +2573,7 @@ static void collapse_huge_page(struct mm_struct *mm,
if (!new_page)
return;

- if (unlikely(mem_cgroup_try_charge(new_page, mm,
- gfp, &memcg)))
+ if (unlikely(mem_cgroup_try_charge(new_page, mm, gfp, &memcg, true)))
return;

/*
@@ -2659,7 +2660,7 @@ static void collapse_huge_page(struct mm_struct *mm,
spin_lock(pmd_ptl);
BUG_ON(!pmd_none(*pmd));
page_add_new_anon_rmap(new_page, vma, address, true);
- mem_cgroup_commit_charge(new_page, memcg, false);
+ mem_cgroup_commit_charge(new_page, memcg, false, true);
lru_cache_add_active_or_unevictable(new_page, vma);
pgtable_trans_huge_deposit(mm, pmd, pgtable);
set_pmd_at(mm, address, pmd, _pmd);
@@ -2674,7 +2675,7 @@ out_up_write:
return;

out:
- mem_cgroup_cancel_charge(new_page, memcg);
+ mem_cgroup_cancel_charge(new_page, memcg, true);
goto out_up_write;
}

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3033e6c42229..09c4454891ac 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -658,7 +658,7 @@ static unsigned long mem_cgroup_read_events(struct mem_cgroup *memcg,

static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
struct page *page,
- int nr_pages)
+ bool compound, int nr_pages)
{
/*
* Here, RSS means 'mapped anon' and anon's SwapCache. Shmem/tmpfs is
@@ -671,9 +671,11 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
__this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_CACHE],
nr_pages);

- if (PageTransHuge(page))
+ if (compound) {
+ VM_BUG_ON_PAGE(!PageTransHuge(page), page);
__this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_RSS_HUGE],
nr_pages);
+ }

/* pagein of a big page is an event. So, ignore page size */
if (nr_pages > 0)
@@ -4557,31 +4559,25 @@ static struct page *mc_handle_file_pte(struct vm_area_struct *vma,
* from old cgroup.
*/
static int mem_cgroup_move_account(struct page *page,
- unsigned int nr_pages,
+ bool compound,
struct mem_cgroup *from,
struct mem_cgroup *to)
{
unsigned long flags;
+ unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1;
int ret;
bool anon;

VM_BUG_ON(from == to);
VM_BUG_ON_PAGE(PageLRU(page), page);
- /*
- * The page is isolated from LRU. So, collapse function
- * will not handle this page. But page splitting can happen.
- * Do this check under compound_page_lock(). The caller should
- * hold it.
- */
- ret = -EBUSY;
- if (nr_pages > 1 && !PageTransHuge(page))
- goto out;
+ VM_BUG_ON(compound && !PageTransHuge(page));

/*
* Prevent mem_cgroup_migrate() from looking at page->mem_cgroup
* of its source page while we change it: page migration takes
* both pages off the LRU, but page cache replacement doesn't.
*/
+ ret = -EBUSY;
if (!trylock_page(page))
goto out;

@@ -4636,9 +4632,9 @@ static int mem_cgroup_move_account(struct page *page,
ret = 0;

local_irq_disable();
- mem_cgroup_charge_statistics(to, page, nr_pages);
+ mem_cgroup_charge_statistics(to, page, compound, nr_pages);
memcg_check_events(to, page);
- mem_cgroup_charge_statistics(from, page, -nr_pages);
+ mem_cgroup_charge_statistics(from, page, compound, -nr_pages);
memcg_check_events(from, page);
local_irq_enable();
out_unlock:
@@ -4918,7 +4914,7 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
if (target_type == MC_TARGET_PAGE) {
page = target.page;
if (!isolate_lru_page(page)) {
- if (!mem_cgroup_move_account(page, HPAGE_PMD_NR,
+ if (!mem_cgroup_move_account(page, true,
mc.from, mc.to)) {
mc.precharge -= HPAGE_PMD_NR;
mc.moved_charge += HPAGE_PMD_NR;
@@ -4947,7 +4943,8 @@ retry:
page = target.page;
if (isolate_lru_page(page))
goto put;
- if (!mem_cgroup_move_account(page, 1, mc.from, mc.to)) {
+ if (!mem_cgroup_move_account(page, false,
+ mc.from, mc.to)) {
mc.precharge--;
/* we uncharge from mc.from later. */
mc.moved_charge++;
@@ -5284,10 +5281,11 @@ bool mem_cgroup_low(struct mem_cgroup *root, struct mem_cgroup *memcg)
* with mem_cgroup_cancel_charge() in case page instantiation fails.
*/
int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
- gfp_t gfp_mask, struct mem_cgroup **memcgp)
+ gfp_t gfp_mask, struct mem_cgroup **memcgp,
+ bool compound)
{
struct mem_cgroup *memcg = NULL;
- unsigned int nr_pages = 1;
+ unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1;
int ret = 0;

if (mem_cgroup_disabled())
@@ -5305,11 +5303,6 @@ int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
goto out;
}

- if (PageTransHuge(page)) {
- nr_pages <<= compound_order(page);
- VM_BUG_ON_PAGE(!PageTransHuge(page), page);
- }
-
if (do_swap_account && PageSwapCache(page))
memcg = try_get_mem_cgroup_from_page(page);
if (!memcg)
@@ -5345,9 +5338,9 @@ out:
* Use mem_cgroup_cancel_charge() to cancel the transaction instead.
*/
void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
- bool lrucare)
+ bool lrucare, bool compound)
{
- unsigned int nr_pages = 1;
+ unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1;

VM_BUG_ON_PAGE(!page->mapping, page);
VM_BUG_ON_PAGE(PageLRU(page) && !lrucare, page);
@@ -5364,13 +5357,8 @@ void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,

commit_charge(page, memcg, lrucare);

- if (PageTransHuge(page)) {
- nr_pages <<= compound_order(page);
- VM_BUG_ON_PAGE(!PageTransHuge(page), page);
- }
-
local_irq_disable();
- mem_cgroup_charge_statistics(memcg, page, nr_pages);
+ mem_cgroup_charge_statistics(memcg, page, compound, nr_pages);
memcg_check_events(memcg, page);
local_irq_enable();

@@ -5392,9 +5380,10 @@ void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
*
* Cancel a charge transaction started by mem_cgroup_try_charge().
*/
-void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg)
+void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg,
+ bool compound)
{
- unsigned int nr_pages = 1;
+ unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1;

if (mem_cgroup_disabled())
return;
@@ -5406,11 +5395,6 @@ void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg)
if (!memcg)
return;

- if (PageTransHuge(page)) {
- nr_pages <<= compound_order(page);
- VM_BUG_ON_PAGE(!PageTransHuge(page), page);
- }
-
cancel_charge(memcg, nr_pages);
}

@@ -5668,7 +5652,7 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
* only synchronisation we have for udpating the per-CPU variables.
*/
VM_BUG_ON(!irqs_disabled());
- mem_cgroup_charge_statistics(memcg, page, -1);
+ mem_cgroup_charge_statistics(memcg, page, false, -1);
memcg_check_events(memcg, page);
}

diff --git a/mm/memory.c b/mm/memory.c
index ae98aba42697..1149f788603d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2083,7 +2083,7 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
cow_user_page(new_page, old_page, address, vma);
}

- if (mem_cgroup_try_charge(new_page, mm, GFP_KERNEL, &memcg))
+ if (mem_cgroup_try_charge(new_page, mm, GFP_KERNEL, &memcg, false))
goto oom_free_new;

__SetPageUptodate(new_page);
@@ -2114,7 +2114,7 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
*/
ptep_clear_flush_notify(vma, address, page_table);
page_add_new_anon_rmap(new_page, vma, address, false);
- mem_cgroup_commit_charge(new_page, memcg, false);
+ mem_cgroup_commit_charge(new_page, memcg, false, false);
lru_cache_add_active_or_unevictable(new_page, vma);
/*
* We call the notify macro here because, when using secondary
@@ -2153,7 +2153,7 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
new_page = old_page;
page_copied = 1;
} else {
- mem_cgroup_cancel_charge(new_page, memcg);
+ mem_cgroup_cancel_charge(new_page, memcg, false);
}

if (new_page)
@@ -2528,7 +2528,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
goto out_page;
}

- if (mem_cgroup_try_charge(page, mm, GFP_KERNEL, &memcg)) {
+ if (mem_cgroup_try_charge(page, mm, GFP_KERNEL, &memcg, false)) {
ret = VM_FAULT_OOM;
goto out_page;
}
@@ -2570,10 +2570,10 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
set_pte_at(mm, address, page_table, pte);
if (page == swapcache) {
do_page_add_anon_rmap(page, vma, address, exclusive);
- mem_cgroup_commit_charge(page, memcg, true);
+ mem_cgroup_commit_charge(page, memcg, true, false);
} else { /* ksm created a completely new copy */
page_add_new_anon_rmap(page, vma, address, false);
- mem_cgroup_commit_charge(page, memcg, false);
+ mem_cgroup_commit_charge(page, memcg, false, false);
lru_cache_add_active_or_unevictable(page, vma);
}

@@ -2608,7 +2608,7 @@ unlock:
out:
return ret;
out_nomap:
- mem_cgroup_cancel_charge(page, memcg);
+ mem_cgroup_cancel_charge(page, memcg, false);
pte_unmap_unlock(page_table, ptl);
out_page:
unlock_page(page);
@@ -2702,7 +2702,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
if (!page)
goto oom;

- if (mem_cgroup_try_charge(page, mm, GFP_KERNEL, &memcg))
+ if (mem_cgroup_try_charge(page, mm, GFP_KERNEL, &memcg, false))
goto oom_free_page;

/*
@@ -2723,7 +2723,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
/* Deliver the page fault to userland, check inside PT lock */
if (userfaultfd_missing(vma)) {
pte_unmap_unlock(page_table, ptl);
- mem_cgroup_cancel_charge(page, memcg);
+ mem_cgroup_cancel_charge(page, memcg, false);
page_cache_release(page);
return handle_userfault(vma, address, flags,
VM_UFFD_MISSING);
@@ -2731,7 +2731,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,

inc_mm_counter_fast(mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, vma, address, false);
- mem_cgroup_commit_charge(page, memcg, false);
+ mem_cgroup_commit_charge(page, memcg, false, false);
lru_cache_add_active_or_unevictable(page, vma);
setpte:
set_pte_at(mm, address, page_table, entry);
@@ -2742,7 +2742,7 @@ unlock:
pte_unmap_unlock(page_table, ptl);
return 0;
release:
- mem_cgroup_cancel_charge(page, memcg);
+ mem_cgroup_cancel_charge(page, memcg, false);
page_cache_release(page);
goto unlock;
oom_free_page:
@@ -2993,7 +2993,7 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
if (!new_page)
return VM_FAULT_OOM;

- if (mem_cgroup_try_charge(new_page, mm, GFP_KERNEL, &memcg)) {
+ if (mem_cgroup_try_charge(new_page, mm, GFP_KERNEL, &memcg, false)) {
page_cache_release(new_page);
return VM_FAULT_OOM;
}
@@ -3022,7 +3022,7 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
goto uncharge_out;
}
do_set_pte(vma, address, new_page, pte, true, true);
- mem_cgroup_commit_charge(new_page, memcg, false);
+ mem_cgroup_commit_charge(new_page, memcg, false, false);
lru_cache_add_active_or_unevictable(new_page, vma);
pte_unmap_unlock(pte, ptl);
if (fault_page) {
@@ -3037,7 +3037,7 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
}
return ret;
uncharge_out:
- mem_cgroup_cancel_charge(new_page, memcg);
+ mem_cgroup_cancel_charge(new_page, memcg, false);
page_cache_release(new_page);
return ret;
}
diff --git a/mm/shmem.c b/mm/shmem.c
index 45326f97a68f..06baa2316b54 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -706,7 +706,8 @@ int shmem_unuse(swp_entry_t swap, struct page *page)
* the shmem_swaplist_mutex which might hold up shmem_writepage().
* Charged back to the user (not to caller) when swap account is used.
*/
- error = mem_cgroup_try_charge(page, current->mm, GFP_KERNEL, &memcg);
+ error = mem_cgroup_try_charge(page, current->mm, GFP_KERNEL, &memcg,
+ false);
if (error)
goto out;
/* No radix_tree_preload: swap entry keeps a place for page in tree */
@@ -729,9 +730,9 @@ int shmem_unuse(swp_entry_t swap, struct page *page)
if (error) {
if (error != -ENOMEM)
error = 0;
- mem_cgroup_cancel_charge(page, memcg);
+ mem_cgroup_cancel_charge(page, memcg, false);
} else
- mem_cgroup_commit_charge(page, memcg, true);
+ mem_cgroup_commit_charge(page, memcg, true, false);
out:
unlock_page(page);
page_cache_release(page);
@@ -1114,7 +1115,8 @@ repeat:
goto failed;
}

- error = mem_cgroup_try_charge(page, current->mm, gfp, &memcg);
+ error = mem_cgroup_try_charge(page, current->mm, gfp, &memcg,
+ false);
if (!error) {
error = shmem_add_to_page_cache(page, mapping, index,
swp_to_radix_entry(swap));
@@ -1131,14 +1133,14 @@ repeat:
* "repeat": reading a hole and writing should succeed.
*/
if (error) {
- mem_cgroup_cancel_charge(page, memcg);
+ mem_cgroup_cancel_charge(page, memcg, false);
delete_from_swap_cache(page);
}
}
if (error)
goto failed;

- mem_cgroup_commit_charge(page, memcg, true);
+ mem_cgroup_commit_charge(page, memcg, true, false);

spin_lock(&info->lock);
info->swapped--;
@@ -1177,7 +1179,8 @@ repeat:
if (sgp == SGP_WRITE)
__SetPageReferenced(page);

- error = mem_cgroup_try_charge(page, current->mm, gfp, &memcg);
+ error = mem_cgroup_try_charge(page, current->mm, gfp, &memcg,
+ false);
if (error)
goto decused;
error = radix_tree_maybe_preload(gfp & GFP_RECLAIM_MASK);
@@ -1187,10 +1190,10 @@ repeat:
radix_tree_preload_end();
}
if (error) {
- mem_cgroup_cancel_charge(page, memcg);
+ mem_cgroup_cancel_charge(page, memcg, false);
goto decused;
}
- mem_cgroup_commit_charge(page, memcg, false);
+ mem_cgroup_commit_charge(page, memcg, false, false);
lru_cache_add_anon(page);

spin_lock(&info->lock);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 7ba293b112e3..866c982cea9b 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1145,14 +1145,15 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
if (unlikely(!page))
return -ENOMEM;

- if (mem_cgroup_try_charge(page, vma->vm_mm, GFP_KERNEL, &memcg)) {
+ if (mem_cgroup_try_charge(page, vma->vm_mm, GFP_KERNEL,
+ &memcg, false)) {
ret = -ENOMEM;
goto out_nolock;
}

pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
if (unlikely(!maybe_same_pte(*pte, swp_entry_to_pte(entry)))) {
- mem_cgroup_cancel_charge(page, memcg);
+ mem_cgroup_cancel_charge(page, memcg, false);
ret = 0;
goto out;
}
@@ -1164,10 +1165,10 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
pte_mkold(mk_pte(page, vma->vm_page_prot)));
if (page == swapcache) {
page_add_anon_rmap(page, vma, addr, false);
- mem_cgroup_commit_charge(page, memcg, true);
+ mem_cgroup_commit_charge(page, memcg, true, false);
} else { /* ksm created a completely new copy */
page_add_new_anon_rmap(page, vma, addr, false);
- mem_cgroup_commit_charge(page, memcg, false);
+ mem_cgroup_commit_charge(page, memcg, false, false);
lru_cache_add_active_or_unevictable(page, vma);
}
swap_free(entry);
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index ae21a1f309c2..806b0c758c5b 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -63,7 +63,7 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
__SetPageUptodate(page);

ret = -ENOMEM;
- if (mem_cgroup_try_charge(page, dst_mm, GFP_KERNEL, &memcg))
+ if (mem_cgroup_try_charge(page, dst_mm, GFP_KERNEL, &memcg, false))
goto out_release;

_dst_pte = mk_pte(page, dst_vma->vm_page_prot);
@@ -77,7 +77,7 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,

inc_mm_counter(dst_mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, dst_vma, dst_addr, false);
- mem_cgroup_commit_charge(page, memcg, false);
+ mem_cgroup_commit_charge(page, memcg, false, false);
lru_cache_add_active_or_unevictable(page, dst_vma);

set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
@@ -91,7 +91,7 @@ out:
return ret;
out_release_uncharge_unlock:
pte_unmap_unlock(dst_pte, ptl);
- mem_cgroup_cancel_charge(page, memcg);
+ mem_cgroup_cancel_charge(page, memcg, false);
out_release:
page_cache_release(page);
goto out;
--
2.1.4

2015-07-20 14:33:37

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv9 04/36] mm, thp: adjust conditions when we can reuse the page on WP fault

With new refcounting we will be able map the same compound page with
PTEs and PMDs. It requires adjustment to conditions when we can reuse
the page on write-protection fault.

For PTE fault we can't reuse the page if it's part of huge page.

For PMD we can only reuse the page if nobody else maps the huge page or
it's part. We can do it by checking page_mapcount() on each sub-page,
but it's expensive.

The cheaper way is to check page_count() to be equal 1: every mapcount
takes page reference, so this way we can guarantee, that the PMD is the
only mapping.

This approach can give false negative if somebody pinned the page, but
that doesn't affect correctness.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Tested-by: Sasha Levin <[email protected]>
Tested-by: Aneesh Kumar K.V <[email protected]>
Acked-by: Jerome Marchand <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
---
include/linux/swap.h | 3 ++-
mm/huge_memory.c | 12 +++++++++++-
mm/swapfile.c | 3 +++
3 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index ec62fcbfcad3..24ad6c3b0853 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -537,7 +537,8 @@ static inline int swp_swapcount(swp_entry_t entry)
return 0;
}

-#define reuse_swap_page(page) (page_mapcount(page) == 1)
+#define reuse_swap_page(page) \
+ (!PageTransCompound(page) && page_mapcount(page) == 1)

static inline int try_to_free_swap(struct page *page)
{
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 6fd6df16e55d..8d5a8881c60e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1171,7 +1171,17 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,

page = pmd_page(orig_pmd);
VM_BUG_ON_PAGE(!PageCompound(page) || !PageHead(page), page);
- if (page_mapcount(page) == 1) {
+ /*
+ * We can only reuse the page if nobody else maps the huge page or it's
+ * part. We can do it by checking page_mapcount() on each sub-page, but
+ * it's expensive.
+ * The cheaper way is to check page_count() to be equal 1: every
+ * mapcount takes page reference reference, so this way we can
+ * guarantee, that the PMD is the only mapping.
+ * This can give false negative if somebody pinned the page, but that's
+ * fine.
+ */
+ if (page_mapcount(page) == 1 && page_count(page) == 1) {
pmd_t entry;
entry = pmd_mkyoung(orig_pmd);
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 866c982cea9b..89ea6af27dc0 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -929,6 +929,9 @@ int reuse_swap_page(struct page *page)
VM_BUG_ON_PAGE(!PageLocked(page), page);
if (unlikely(PageKsm(page)))
return 0;
+ /* The page is part of THP and cannot be reused */
+ if (PageTransCompound(page))
+ return 0;
count = page_mapcount(page);
if (count <= 1 && PageSwapCache(page)) {
count += page_swapcount(page);
--
2.1.4

2015-07-20 14:32:00

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv9 05/36] mm: adjust FOLL_SPLIT for new refcounting

We need to prepare kernel to allow transhuge pages to be mapped with
ptes too. We need to handle FOLL_SPLIT in follow_page_pte().

Also we use split_huge_page() directly instead of split_huge_page_pmd().
split_huge_page_pmd() will gone.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Tested-by: Sasha Levin <[email protected]>
Tested-by: Aneesh Kumar K.V <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
---
mm/gup.c | 67 +++++++++++++++++++++++++++++++++++++++++++++++-----------------
1 file changed, 49 insertions(+), 18 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index fca048cde973..55c87b16289b 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -114,6 +114,19 @@ retry:
}
}

+ if (flags & FOLL_SPLIT && PageTransCompound(page)) {
+ int ret;
+ get_page(page);
+ pte_unmap_unlock(ptep, ptl);
+ lock_page(page);
+ ret = split_huge_page(page);
+ unlock_page(page);
+ put_page(page);
+ if (ret)
+ return ERR_PTR(ret);
+ goto retry;
+ }
+
if (flags & FOLL_GET)
get_page_foll(page);
if (flags & FOLL_TOUCH) {
@@ -218,27 +231,45 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
}
if ((flags & FOLL_NUMA) && pmd_protnone(*pmd))
return no_page_table(vma, flags);
- if (pmd_trans_huge(*pmd)) {
- if (flags & FOLL_SPLIT) {
+ if (likely(!pmd_trans_huge(*pmd)))
+ return follow_page_pte(vma, address, pmd, flags);
+
+ ptl = pmd_lock(mm, pmd);
+ if (unlikely(!pmd_trans_huge(*pmd))) {
+ spin_unlock(ptl);
+ return follow_page_pte(vma, address, pmd, flags);
+ }
+
+ if (unlikely(pmd_trans_splitting(*pmd))) {
+ spin_unlock(ptl);
+ wait_split_huge_page(vma->anon_vma, pmd);
+ return follow_page_pte(vma, address, pmd, flags);
+ }
+
+ if (flags & FOLL_SPLIT) {
+ int ret;
+ page = pmd_page(*pmd);
+ if (is_huge_zero_page(page)) {
+ spin_unlock(ptl);
+ ret = 0;
split_huge_page_pmd(vma, address, pmd);
- return follow_page_pte(vma, address, pmd, flags);
- }
- ptl = pmd_lock(mm, pmd);
- if (likely(pmd_trans_huge(*pmd))) {
- if (unlikely(pmd_trans_splitting(*pmd))) {
- spin_unlock(ptl);
- wait_split_huge_page(vma->anon_vma, pmd);
- } else {
- page = follow_trans_huge_pmd(vma, address,
- pmd, flags);
- spin_unlock(ptl);
- *page_mask = HPAGE_PMD_NR - 1;
- return page;
- }
- } else
+ } else {
+ get_page(page);
spin_unlock(ptl);
+ lock_page(page);
+ ret = split_huge_page(page);
+ unlock_page(page);
+ put_page(page);
+ }
+
+ return ret ? ERR_PTR(ret) :
+ follow_page_pte(vma, address, pmd, flags);
}
- return follow_page_pte(vma, address, pmd, flags);
+
+ page = follow_trans_huge_pmd(vma, address, pmd, flags);
+ spin_unlock(ptl);
+ *page_mask = HPAGE_PMD_NR - 1;
+ return page;
}

static int get_gate_page(struct mm_struct *mm, unsigned long address,
--
2.1.4

2015-07-20 14:33:34

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv9 06/36] mm: handle PTE-mapped tail pages in gerneric fast gup implementaiton

With new refcounting we are going to see THP tail pages mapped with PTE.
Generic fast GUP rely on page_cache_get_speculative() to obtain
reference on page. page_cache_get_speculative() always fails on tail
pages, because ->_count on tail pages is always zero.

Let's handle tail pages in gup_pte_range().

New split_huge_page() will rely on migration entries to freeze page's
counts. Recheck PTE value after page_cache_get_speculative() on head
page should be enough to serialize against split.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Tested-by: Sasha Levin <[email protected]>
Tested-by: Aneesh Kumar K.V <[email protected]>
Acked-by: Jerome Marchand <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
---
mm/gup.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 55c87b16289b..6b9f578cff2e 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1089,7 +1089,7 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
* for an example see gup_get_pte in arch/x86/mm/gup.c
*/
pte_t pte = READ_ONCE(*ptep);
- struct page *page;
+ struct page *head, *page;

/*
* Similar to the PMD case below, NUMA hinting must take slow
@@ -1101,15 +1101,17 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,

VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
page = pte_page(pte);
+ head = compound_head(page);

- if (!page_cache_get_speculative(page))
+ if (!page_cache_get_speculative(head))
goto pte_unmap;

if (unlikely(pte_val(pte) != pte_val(*ptep))) {
- put_page(page);
+ put_page(head);
goto pte_unmap;
}

+ VM_BUG_ON_PAGE(compound_head(page) != head, page);
pages[*nr] = page;
(*nr)++;

--
2.1.4

2015-07-20 14:24:00

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv9 07/36] thp, mlock: do not allow huge pages in mlocked area

With new refcounting THP can belong to several VMAs. This makes tricky
to track THP pages, when they partially mlocked. It can lead to leaking
mlocked pages to non-VM_LOCKED vmas and other problems.

With this patch we will split all pages on mlock and avoid
fault-in/collapse new THP in VM_LOCKED vmas.

I've tried alternative approach: do not mark THP pages mlocked and keep
them on normal LRUs. This way vmscan could try to split huge pages on
memory pressure and free up subpages which doesn't belong to VM_LOCKED
vmas. But this is user-visible change: we screw up Mlocked accouting
reported in meminfo, so I had to leave this approach aside.

We can bring something better later, but this should be good enough for
now.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Tested-by: Sasha Levin <[email protected]>
Tested-by: Aneesh Kumar K.V <[email protected]>
Acked-by: Jerome Marchand <[email protected]>
---
mm/gup.c | 2 ++
mm/huge_memory.c | 5 ++++-
mm/memory.c | 3 ++-
mm/mlock.c | 51 +++++++++++++++++++--------------------------------
4 files changed, 27 insertions(+), 34 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 6b9f578cff2e..b8bba5589be6 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -920,6 +920,8 @@ long populate_vma_page_range(struct vm_area_struct *vma,
VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_sem), mm);

gup_flags = FOLL_TOUCH | FOLL_POPULATE;
+ if (vma->vm_flags & VM_LOCKED)
+ gup_flags |= FOLL_SPLIT;
/*
* We want to touch writable mappings with a write fault in order
* to break COW, except for shared mappings because these don't COW
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8d5a8881c60e..eebb518a7267 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -814,6 +814,8 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,

if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
return VM_FAULT_FALLBACK;
+ if (vma->vm_flags & VM_LOCKED)
+ return VM_FAULT_FALLBACK;
if (unlikely(anon_vma_prepare(vma)))
return VM_FAULT_OOM;
if (unlikely(khugepaged_enter(vma, vma->vm_flags)))
@@ -2545,7 +2547,8 @@ static bool hugepage_vma_check(struct vm_area_struct *vma)
if ((!(vma->vm_flags & VM_HUGEPAGE) && !khugepaged_always()) ||
(vma->vm_flags & VM_NOHUGEPAGE))
return false;
-
+ if (vma->vm_flags & VM_LOCKED)
+ return false;
if (!vma->anon_vma || vma->vm_ops)
return false;
if (is_vma_temporary_stack(vma))
diff --git a/mm/memory.c b/mm/memory.c
index 1149f788603d..720b3bebf1f9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2161,7 +2161,8 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,

pte_unmap_unlock(page_table, ptl);
mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
- if (old_page) {
+ /* THP pages are never mlocked */
+ if (old_page && !PageTransCompound(old_page)) {
/*
* Don't let another task, with possibly unlocked vma,
* keep the mlocked page.
diff --git a/mm/mlock.c b/mm/mlock.c
index df91dadf6c7a..3c2e1290edfc 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -443,39 +443,26 @@ void munlock_vma_pages_range(struct vm_area_struct *vma,
page = follow_page_mask(vma, start, FOLL_GET | FOLL_DUMP,
&page_mask);

- if (page && !IS_ERR(page)) {
- if (PageTransHuge(page)) {
- lock_page(page);
- /*
- * Any THP page found by follow_page_mask() may
- * have gotten split before reaching
- * munlock_vma_page(), so we need to recompute
- * the page_mask here.
- */
- page_mask = munlock_vma_page(page);
- unlock_page(page);
- put_page(page); /* follow_page_mask() */
- } else {
- /*
- * Non-huge pages are handled in batches via
- * pagevec. The pin from follow_page_mask()
- * prevents them from collapsing by THP.
- */
- pagevec_add(&pvec, page);
- zone = page_zone(page);
- zoneid = page_zone_id(page);
+ if (page && !IS_ERR(page) && !PageTransCompound(page)) {
+ /*
+ * Non-huge pages are handled in batches via
+ * pagevec. The pin from follow_page_mask()
+ * prevents them from collapsing by THP.
+ */
+ pagevec_add(&pvec, page);
+ zone = page_zone(page);
+ zoneid = page_zone_id(page);

- /*
- * Try to fill the rest of pagevec using fast
- * pte walk. This will also update start to
- * the next page to process. Then munlock the
- * pagevec.
- */
- start = __munlock_pagevec_fill(&pvec, vma,
- zoneid, start, end);
- __munlock_pagevec(&pvec, zone);
- goto next;
- }
+ /*
+ * Try to fill the rest of pagevec using fast
+ * pte walk. This will also update start to
+ * the next page to process. Then munlock the
+ * pagevec.
+ */
+ start = __munlock_pagevec_fill(&pvec, vma,
+ zoneid, start, end);
+ __munlock_pagevec(&pvec, zone);
+ goto next;
}
/* It's a bug to munlock in the middle of a THP page */
VM_BUG_ON((start >> PAGE_SHIFT) & page_mask);
--
2.1.4

2015-07-20 14:33:10

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv9 08/36] khugepaged: ignore pmd tables with THP mapped with ptes

Prepare khugepaged to see compound pages mapped with pte. For now we
won't collapse the pmd table with such pte.

khugepaged is subject for future rework wrt new refcounting.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Tested-by: Sasha Levin <[email protected]>
Tested-by: Aneesh Kumar K.V <[email protected]>
Acked-by: Jerome Marchand <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
---
mm/huge_memory.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index eebb518a7267..bef7b3595fbe 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2732,6 +2732,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
page = vm_normal_page(vma, _address, pteval);
if (unlikely(!page))
goto out_unmap;
+
+ /* TODO: teach khugepaged to collapse THP mapped with pte */
+ if (PageCompound(page))
+ goto out_unmap;
+
/*
* Record which node the original page is from and save this
* information to khugepaged_node_load[].
@@ -2742,7 +2747,6 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
if (khugepaged_scan_abort(node))
goto out_unmap;
khugepaged_node_load[node]++;
- VM_BUG_ON_PAGE(PageCompound(page), page);
if (!PageLRU(page) || PageLocked(page) || !PageAnon(page))
goto out_unmap;
/*
--
2.1.4

2015-07-20 14:32:52

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv9 09/36] thp: rename split_huge_page_pmd() to split_huge_pmd()

We are going to decouple splitting THP PMD from splitting underlying
compound page.

This patch renames split_huge_page_pmd*() functions to split_huge_pmd*()
to reflect the fact that it doesn't imply page splitting, only PMD.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Tested-by: Sasha Levin <[email protected]>
Tested-by: Aneesh Kumar K.V <[email protected]>
Acked-by: Jerome Marchand <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
---
arch/powerpc/mm/subpage-prot.c | 2 +-
arch/x86/kernel/vm86_32.c | 6 +++++-
include/linux/huge_mm.h | 8 ++------
mm/gup.c | 2 +-
mm/huge_memory.c | 32 +++++++++++---------------------
mm/madvise.c | 2 +-
mm/memory.c | 2 +-
mm/mempolicy.c | 2 +-
mm/mprotect.c | 2 +-
mm/mremap.c | 2 +-
mm/pagewalk.c | 2 +-
11 files changed, 26 insertions(+), 36 deletions(-)

diff --git a/arch/powerpc/mm/subpage-prot.c b/arch/powerpc/mm/subpage-prot.c
index fa9fb5b4c66c..d5543514c1df 100644
--- a/arch/powerpc/mm/subpage-prot.c
+++ b/arch/powerpc/mm/subpage-prot.c
@@ -135,7 +135,7 @@ static int subpage_walk_pmd_entry(pmd_t *pmd, unsigned long addr,
unsigned long end, struct mm_walk *walk)
{
struct vm_area_struct *vma = walk->vma;
- split_huge_page_pmd(vma, addr, pmd);
+ split_huge_pmd(vma, pmd, addr);
return 0;
}

diff --git a/arch/x86/kernel/vm86_32.c b/arch/x86/kernel/vm86_32.c
index fc9db6ef2a95..bf85db746b2c 100644
--- a/arch/x86/kernel/vm86_32.c
+++ b/arch/x86/kernel/vm86_32.c
@@ -182,7 +182,11 @@ static void mark_screen_rdonly(struct mm_struct *mm)
if (pud_none_or_clear_bad(pud))
goto out;
pmd = pmd_offset(pud, 0xA0000);
- split_huge_page_pmd_mm(mm, 0xA0000, pmd);
+
+ if (pmd_trans_huge(*pmd)) {
+ struct vm_area_struct *vma = find_vma(mm, 0xA0000);
+ split_huge_pmd(vma, pmd, 0xA0000);
+ }
if (pmd_none_or_clear_bad(pmd))
goto out;
pte = pte_offset_map_lock(mm, pmd, 0xA0000, &ptl);
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index cff4e4bc7fab..f825694a23ee 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -106,7 +106,7 @@ static inline int split_huge_page(struct page *page)
}
extern void __split_huge_page_pmd(struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd);
-#define split_huge_page_pmd(__vma, __address, __pmd) \
+#define split_huge_pmd(__vma, __pmd, __address) \
do { \
pmd_t *____pmd = (__pmd); \
if (unlikely(pmd_trans_huge(*____pmd))) \
@@ -121,8 +121,6 @@ extern void __split_huge_page_pmd(struct vm_area_struct *vma,
BUG_ON(pmd_trans_splitting(*____pmd) || \
pmd_trans_huge(*____pmd)); \
} while (0)
-extern void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address,
- pmd_t *pmd);
#if HPAGE_PMD_ORDER >= MAX_ORDER
#error "hugepages can't be allocated by the buddy allocator"
#endif
@@ -190,11 +188,9 @@ static inline int split_huge_page(struct page *page)
{
return 0;
}
-#define split_huge_page_pmd(__vma, __address, __pmd) \
- do { } while (0)
#define wait_split_huge_page(__anon_vma, __pmd) \
do { } while (0)
-#define split_huge_page_pmd_mm(__mm, __address, __pmd) \
+#define split_huge_pmd(__vma, __pmd, __address) \
do { } while (0)
static inline int hugepage_madvise(struct vm_area_struct *vma,
unsigned long *vm_flags, int advice)
diff --git a/mm/gup.c b/mm/gup.c
index b8bba5589be6..b2f9790e5a98 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -252,7 +252,7 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
if (is_huge_zero_page(page)) {
spin_unlock(ptl);
ret = 0;
- split_huge_page_pmd(vma, address, pmd);
+ split_huge_pmd(vma, pmd, address);
} else {
get_page(page);
spin_unlock(ptl);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index bef7b3595fbe..c93828c16716 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1204,13 +1204,13 @@ alloc:

if (unlikely(!new_page)) {
if (!page) {
- split_huge_page_pmd(vma, address, pmd);
+ split_huge_pmd(vma, pmd, address);
ret |= VM_FAULT_FALLBACK;
} else {
ret = do_huge_pmd_wp_page_fallback(mm, vma, address,
pmd, orig_pmd, page, haddr);
if (ret & VM_FAULT_OOM) {
- split_huge_page(page);
+ split_huge_pmd(vma, pmd, address);
ret |= VM_FAULT_FALLBACK;
}
put_user_huge_page(page);
@@ -1223,10 +1223,10 @@ alloc:
&memcg, true))) {
put_page(new_page);
if (page) {
- split_huge_page(page);
+ split_huge_pmd(vma, pmd, address);
put_user_huge_page(page);
} else
- split_huge_page_pmd(vma, address, pmd);
+ split_huge_pmd(vma, pmd, address);
ret |= VM_FAULT_FALLBACK;
count_vm_event(THP_FAULT_FALLBACK);
goto out;
@@ -3062,17 +3062,7 @@ again:
goto again;
}

-void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address,
- pmd_t *pmd)
-{
- struct vm_area_struct *vma;
-
- vma = find_vma(mm, address);
- BUG_ON(vma == NULL);
- split_huge_page_pmd(vma, address, pmd);
-}
-
-static void split_huge_page_address(struct mm_struct *mm,
+static void split_huge_pmd_address(struct vm_area_struct *vma,
unsigned long address)
{
pgd_t *pgd;
@@ -3081,7 +3071,7 @@ static void split_huge_page_address(struct mm_struct *mm,

VM_BUG_ON(!(address & ~HPAGE_PMD_MASK));

- pgd = pgd_offset(mm, address);
+ pgd = pgd_offset(vma->vm_mm, address);
if (!pgd_present(*pgd))
return;

@@ -3090,13 +3080,13 @@ static void split_huge_page_address(struct mm_struct *mm,
return;

pmd = pmd_offset(pud, address);
- if (!pmd_present(*pmd))
+ if (!pmd_present(*pmd) || !pmd_trans_huge(*pmd))
return;
/*
* Caller holds the mmap_sem write mode, so a huge pmd cannot
* materialize from under us.
*/
- split_huge_page_pmd_mm(mm, address, pmd);
+ __split_huge_page_pmd(vma, address, pmd);
}

void vma_adjust_trans_huge(struct vm_area_struct *vma,
@@ -3112,7 +3102,7 @@ void vma_adjust_trans_huge(struct vm_area_struct *vma,
if (start & ~HPAGE_PMD_MASK &&
(start & HPAGE_PMD_MASK) >= vma->vm_start &&
(start & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE <= vma->vm_end)
- split_huge_page_address(vma->vm_mm, start);
+ split_huge_pmd_address(vma, start);

/*
* If the new end address isn't hpage aligned and it could
@@ -3122,7 +3112,7 @@ void vma_adjust_trans_huge(struct vm_area_struct *vma,
if (end & ~HPAGE_PMD_MASK &&
(end & HPAGE_PMD_MASK) >= vma->vm_start &&
(end & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE <= vma->vm_end)
- split_huge_page_address(vma->vm_mm, end);
+ split_huge_pmd_address(vma, end);

/*
* If we're also updating the vma->vm_next->vm_start, if the new
@@ -3136,6 +3126,6 @@ void vma_adjust_trans_huge(struct vm_area_struct *vma,
if (nstart & ~HPAGE_PMD_MASK &&
(nstart & HPAGE_PMD_MASK) >= next->vm_start &&
(nstart & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE <= next->vm_end)
- split_huge_page_address(next->vm_mm, nstart);
+ split_huge_pmd_address(next, nstart);
}
}
diff --git a/mm/madvise.c b/mm/madvise.c
index 67d5fe74ffdf..0588facc4381 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -283,7 +283,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
next = pmd_addr_end(addr, end);
if (pmd_trans_huge(*pmd)) {
if (next - addr != HPAGE_PMD_SIZE)
- split_huge_page_pmd(vma, addr, pmd);
+ split_huge_pmd(vma, pmd, addr);
else if (!madvise_free_huge_pmd(tlb, vma, pmd, addr))
goto next;
/* fall through */
diff --git a/mm/memory.c b/mm/memory.c
index 720b3bebf1f9..e830477d40b7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1204,7 +1204,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
BUG();
}
#endif
- split_huge_page_pmd(vma, addr, pmd);
+ split_huge_pmd(vma, pmd, addr);
} else if (zap_huge_pmd(tlb, vma, pmd, addr))
goto next;
/* fall through */
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index a7f1e0d1d6b8..b6122c0f613d 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -493,7 +493,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
pte_t *pte;
spinlock_t *ptl;

- split_huge_page_pmd(vma, addr, pmd);
+ split_huge_pmd(vma, pmd, addr);
if (pmd_trans_unstable(pmd))
return 0;

diff --git a/mm/mprotect.c b/mm/mprotect.c
index ef5be8eaab00..9c1445dc8a4c 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -160,7 +160,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,

if (pmd_trans_huge(*pmd)) {
if (next - addr != HPAGE_PMD_SIZE)
- split_huge_page_pmd(vma, addr, pmd);
+ split_huge_pmd(vma, pmd, addr);
else {
int nr_ptes = change_huge_pmd(vma, pmd, addr,
newprot, prot_numa);
diff --git a/mm/mremap.c b/mm/mremap.c
index 5a71cce8c6ea..9cf393ac6e43 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -209,7 +209,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
need_flush = true;
continue;
} else if (!err) {
- split_huge_page_pmd(vma, old_addr, old_pmd);
+ split_huge_pmd(vma, old_pmd, old_addr);
}
VM_BUG_ON(pmd_trans_huge(*old_pmd));
}
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 29f2f8b853ae..207244489a68 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -58,7 +58,7 @@ again:
if (!walk->pte_entry)
continue;

- split_huge_page_pmd_mm(walk->mm, addr, pmd);
+ split_huge_pmd(walk->vma, pmd, addr);
if (pmd_trans_unstable(pmd))
goto again;
err = walk_pte_range(pmd, addr, next, walk);
--
2.1.4

2015-07-20 14:31:31

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv9 10/36] mm, vmstats: new THP splitting event

The patch replaces THP_SPLIT with tree events: THP_SPLIT_PAGE,
THP_SPLIT_PAGE_FAILED and THP_SPLIT_PMD. It reflects the fact that we
are going to be able split PMD without the compound page and that
split_huge_page() can fail.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Acked-by: Christoph Lameter <[email protected]>
Tested-by: Sasha Levin <[email protected]>
Tested-by: Aneesh Kumar K.V <[email protected]>
Acked-by: Jerome Marchand <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
---
include/linux/vm_event_item.h | 4 +++-
mm/huge_memory.c | 2 +-
mm/vmstat.c | 4 +++-
3 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 2b1cef88b827..3261bfe2156a 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -69,7 +69,9 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
THP_FAULT_FALLBACK,
THP_COLLAPSE_ALLOC,
THP_COLLAPSE_ALLOC_FAILED,
- THP_SPLIT,
+ THP_SPLIT_PAGE,
+ THP_SPLIT_PAGE_FAILED,
+ THP_SPLIT_PMD,
THP_ZERO_PAGE_ALLOC,
THP_ZERO_PAGE_ALLOC_FAILED,
#endif
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c93828c16716..330d978c4ffc 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2038,7 +2038,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)

BUG_ON(!PageSwapBacked(page));
__split_huge_page(page, anon_vma, list);
- count_vm_event(THP_SPLIT);
+ count_vm_event(THP_SPLIT_PAGE);

BUG_ON(PageCompound(page));
out_unlock:
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 1fd0886a389f..e1c87425fe11 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -821,7 +821,9 @@ const char * const vmstat_text[] = {
"thp_fault_fallback",
"thp_collapse_alloc",
"thp_collapse_alloc_failed",
- "thp_split",
+ "thp_split_page",
+ "thp_split_page_failed",
+ "thp_split_pmd",
"thp_zero_page_alloc",
"thp_zero_page_alloc_failed",
#endif
--
2.1.4

2015-07-20 14:32:37

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv9 11/36] mm: temporally mark THP broken

Up to this point we tried to keep patchset bisectable, but next patches
are going to change how core of THP refcounting work.

It would be beneficial to split the change into several patches and make
it more reviewable. Unfortunately, I don't see how we can achieve that
while keeping THP working.

Let's hide THP under CONFIG_BROKEN for now and bring it back when new
refcounting get established.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Tested-by: Sasha Levin <[email protected]>
Tested-by: Aneesh Kumar K.V <[email protected]>
---
mm/Kconfig | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index e79de2bd12cd..c973f416cbe5 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -410,7 +410,7 @@ config NOMMU_INITIAL_TRIM_EXCESS

config TRANSPARENT_HUGEPAGE
bool "Transparent Hugepage Support"
- depends on HAVE_ARCH_TRANSPARENT_HUGEPAGE
+ depends on HAVE_ARCH_TRANSPARENT_HUGEPAGE && BROKEN
select COMPACTION
help
Transparent Hugepages allows the kernel to use huge pages and
--
2.1.4

2015-07-20 14:31:08

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv9 12/36] thp: drop all split_huge_page()-related code

We will re-introduce new version with new refcounting later in patchset.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Tested-by: Sasha Levin <[email protected]>
Tested-by: Aneesh Kumar K.V <[email protected]>
---
include/linux/huge_mm.h | 28 +---
mm/huge_memory.c | 400 +-----------------------------------------------
2 files changed, 7 insertions(+), 421 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index f825694a23ee..933c63cc5ed7 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -99,28 +99,12 @@ extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
#endif /* CONFIG_DEBUG_VM */

extern unsigned long transparent_hugepage_flags;
-extern int split_huge_page_to_list(struct page *page, struct list_head *list);
-static inline int split_huge_page(struct page *page)
-{
- return split_huge_page_to_list(page, NULL);
-}
-extern void __split_huge_page_pmd(struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmd);
-#define split_huge_pmd(__vma, __pmd, __address) \
- do { \
- pmd_t *____pmd = (__pmd); \
- if (unlikely(pmd_trans_huge(*____pmd))) \
- __split_huge_page_pmd(__vma, __address, \
- ____pmd); \
- } while (0)
-#define wait_split_huge_page(__anon_vma, __pmd) \
- do { \
- pmd_t *____pmd = (__pmd); \
- anon_vma_lock_write(__anon_vma); \
- anon_vma_unlock_write(__anon_vma); \
- BUG_ON(pmd_trans_splitting(*____pmd) || \
- pmd_trans_huge(*____pmd)); \
- } while (0)
+
+#define split_huge_page_to_list(page, list) BUILD_BUG()
+#define split_huge_page(page) BUILD_BUG()
+#define split_huge_pmd(__vma, __pmd, __address) BUILD_BUG()
+
+#define wait_split_huge_page(__anon_vma, __pmd) BUILD_BUG()
#if HPAGE_PMD_ORDER >= MAX_ORDER
#error "hugepages can't be allocated by the buddy allocator"
#endif
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 330d978c4ffc..25ad3af53f15 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1725,329 +1725,6 @@ int pmd_freeable(pmd_t pmd)
return !pmd_dirty(pmd);
}

-static int __split_huge_page_splitting(struct page *page,
- struct vm_area_struct *vma,
- unsigned long address)
-{
- struct mm_struct *mm = vma->vm_mm;
- spinlock_t *ptl;
- pmd_t *pmd;
- int ret = 0;
- /* For mmu_notifiers */
- const unsigned long mmun_start = address;
- const unsigned long mmun_end = address + HPAGE_PMD_SIZE;
-
- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
- pmd = page_check_address_pmd(page, mm, address,
- PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG, &ptl);
- if (pmd) {
- /*
- * We can't temporarily set the pmd to null in order
- * to split it, the pmd must remain marked huge at all
- * times or the VM won't take the pmd_trans_huge paths
- * and it won't wait on the anon_vma->root->rwsem to
- * serialize against split_huge_page*.
- */
- pmdp_splitting_flush(vma, address, pmd);
-
- ret = 1;
- spin_unlock(ptl);
- }
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
-
- return ret;
-}
-
-static void __split_huge_page_refcount(struct page *page,
- struct list_head *list)
-{
- int i;
- struct zone *zone = page_zone(page);
- struct lruvec *lruvec;
- int tail_count = 0;
-
- /* prevent PageLRU to go away from under us, and freeze lru stats */
- spin_lock_irq(&zone->lru_lock);
- lruvec = mem_cgroup_page_lruvec(page, zone);
-
- compound_lock(page);
- /* complete memcg works before add pages to LRU */
- mem_cgroup_split_huge_fixup(page);
-
- for (i = HPAGE_PMD_NR - 1; i >= 1; i--) {
- struct page *page_tail = page + i;
-
- /* tail_page->_mapcount cannot change */
- BUG_ON(page_mapcount(page_tail) < 0);
- tail_count += page_mapcount(page_tail);
- /* check for overflow */
- BUG_ON(tail_count < 0);
- BUG_ON(atomic_read(&page_tail->_count) != 0);
- /*
- * tail_page->_count is zero and not changing from
- * under us. But get_page_unless_zero() may be running
- * from under us on the tail_page. If we used
- * atomic_set() below instead of atomic_add(), we
- * would then run atomic_set() concurrently with
- * get_page_unless_zero(), and atomic_set() is
- * implemented in C not using locked ops. spin_unlock
- * on x86 sometime uses locked ops because of PPro
- * errata 66, 92, so unless somebody can guarantee
- * atomic_set() here would be safe on all archs (and
- * not only on x86), it's safer to use atomic_add().
- */
- atomic_add(page_mapcount(page) + page_mapcount(page_tail) + 1,
- &page_tail->_count);
-
- /* after clearing PageTail the gup refcount can be released */
- smp_mb__after_atomic();
-
- /*
- * retain hwpoison flag of the poisoned tail page:
- * fix for the unsuitable process killed on Guest Machine(KVM)
- * by the memory-failure.
- */
- page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP | __PG_HWPOISON;
- page_tail->flags |= (page->flags &
- ((1L << PG_referenced) |
- (1L << PG_swapbacked) |
- (1L << PG_mlocked) |
- (1L << PG_uptodate) |
- (1L << PG_active) |
- (1L << PG_unevictable)));
- page_tail->flags |= (1L << PG_dirty);
-
- /* clear PageTail before overwriting first_page */
- smp_wmb();
-
- /*
- * __split_huge_page_splitting() already set the
- * splitting bit in all pmd that could map this
- * hugepage, that will ensure no CPU can alter the
- * mapcount on the head page. The mapcount is only
- * accounted in the head page and it has to be
- * transferred to all tail pages in the below code. So
- * for this code to be safe, the split the mapcount
- * can't change. But that doesn't mean userland can't
- * keep changing and reading the page contents while
- * we transfer the mapcount, so the pmd splitting
- * status is achieved setting a reserved bit in the
- * pmd, not by clearing the present bit.
- */
- page_tail->_mapcount = page->_mapcount;
-
- BUG_ON(page_tail->mapping != TAIL_MAPPING);
- page_tail->mapping = page->mapping;
-
- page_tail->index = page->index + i;
- page_cpupid_xchg_last(page_tail, page_cpupid_last(page));
-
- BUG_ON(!PageAnon(page_tail));
- BUG_ON(!PageUptodate(page_tail));
- BUG_ON(!PageDirty(page_tail));
- BUG_ON(!PageSwapBacked(page_tail));
-
- lru_add_page_tail(page, page_tail, lruvec, list);
- }
- atomic_sub(tail_count, &page->_count);
- BUG_ON(atomic_read(&page->_count) <= 0);
-
- __mod_zone_page_state(zone, NR_ANON_TRANSPARENT_HUGEPAGES, -1);
-
- ClearPageCompound(page);
- compound_unlock(page);
- spin_unlock_irq(&zone->lru_lock);
-
- for (i = 1; i < HPAGE_PMD_NR; i++) {
- struct page *page_tail = page + i;
- BUG_ON(page_count(page_tail) <= 0);
- /*
- * Tail pages may be freed if there wasn't any mapping
- * like if add_to_swap() is running on a lru page that
- * had its mapping zapped. And freeing these pages
- * requires taking the lru_lock so we do the put_page
- * of the tail pages after the split is complete.
- */
- put_page(page_tail);
- }
-
- /*
- * Only the head page (now become a regular page) is required
- * to be pinned by the caller.
- */
- BUG_ON(page_count(page) <= 0);
-}
-
-static int __split_huge_page_map(struct page *page,
- struct vm_area_struct *vma,
- unsigned long address)
-{
- struct mm_struct *mm = vma->vm_mm;
- spinlock_t *ptl;
- pmd_t *pmd, _pmd;
- int ret = 0, i;
- pgtable_t pgtable;
- unsigned long haddr;
-
- pmd = page_check_address_pmd(page, mm, address,
- PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG, &ptl);
- if (pmd) {
- pgtable = pgtable_trans_huge_withdraw(mm, pmd);
- pmd_populate(mm, &_pmd, pgtable);
- if (pmd_write(*pmd))
- BUG_ON(page_mapcount(page) != 1);
-
- haddr = address;
- for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
- pte_t *pte, entry;
- BUG_ON(PageCompound(page+i));
- /*
- * Note that NUMA hinting access restrictions are not
- * transferred to avoid any possibility of altering
- * permissions across VMAs.
- */
- entry = mk_pte(page + i, vma->vm_page_prot);
- entry = maybe_mkwrite(pte_mkdirty(entry), vma);
- if (!pmd_write(*pmd))
- entry = pte_wrprotect(entry);
- if (!pmd_young(*pmd))
- entry = pte_mkold(entry);
- pte = pte_offset_map(&_pmd, haddr);
- BUG_ON(!pte_none(*pte));
- set_pte_at(mm, haddr, pte, entry);
- pte_unmap(pte);
- }
-
- smp_wmb(); /* make pte visible before pmd */
- /*
- * Up to this point the pmd is present and huge and
- * userland has the whole access to the hugepage
- * during the split (which happens in place). If we
- * overwrite the pmd with the not-huge version
- * pointing to the pte here (which of course we could
- * if all CPUs were bug free), userland could trigger
- * a small page size TLB miss on the small sized TLB
- * while the hugepage TLB entry is still established
- * in the huge TLB. Some CPU doesn't like that. See
- * http://support.amd.com/us/Processor_TechDocs/41322.pdf,
- * Erratum 383 on page 93. Intel should be safe but is
- * also warns that it's only safe if the permission
- * and cache attributes of the two entries loaded in
- * the two TLB is identical (which should be the case
- * here). But it is generally safer to never allow
- * small and huge TLB entries for the same virtual
- * address to be loaded simultaneously. So instead of
- * doing "pmd_populate(); flush_tlb_range();" we first
- * mark the current pmd notpresent (atomically because
- * here the pmd_trans_huge and pmd_trans_splitting
- * must remain set at all times on the pmd until the
- * split is complete for this pmd), then we flush the
- * SMP TLB and finally we write the non-huge version
- * of the pmd entry with pmd_populate.
- */
- pmdp_invalidate(vma, address, pmd);
- pmd_populate(mm, pmd, pgtable);
- ret = 1;
- spin_unlock(ptl);
- }
-
- return ret;
-}
-
-/* must be called with anon_vma->root->rwsem held */
-static void __split_huge_page(struct page *page,
- struct anon_vma *anon_vma,
- struct list_head *list)
-{
- int mapcount, mapcount2;
- pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
- struct anon_vma_chain *avc;
-
- BUG_ON(!PageHead(page));
- BUG_ON(PageTail(page));
-
- mapcount = 0;
- anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) {
- struct vm_area_struct *vma = avc->vma;
- unsigned long addr = vma_address(page, vma);
- BUG_ON(is_vma_temporary_stack(vma));
- mapcount += __split_huge_page_splitting(page, vma, addr);
- }
- /*
- * It is critical that new vmas are added to the tail of the
- * anon_vma list. This guarantes that if copy_huge_pmd() runs
- * and establishes a child pmd before
- * __split_huge_page_splitting() freezes the parent pmd (so if
- * we fail to prevent copy_huge_pmd() from running until the
- * whole __split_huge_page() is complete), we will still see
- * the newly established pmd of the child later during the
- * walk, to be able to set it as pmd_trans_splitting too.
- */
- if (mapcount != page_mapcount(page)) {
- pr_err("mapcount %d page_mapcount %d\n",
- mapcount, page_mapcount(page));
- BUG();
- }
-
- __split_huge_page_refcount(page, list);
-
- mapcount2 = 0;
- anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) {
- struct vm_area_struct *vma = avc->vma;
- unsigned long addr = vma_address(page, vma);
- BUG_ON(is_vma_temporary_stack(vma));
- mapcount2 += __split_huge_page_map(page, vma, addr);
- }
- if (mapcount != mapcount2) {
- pr_err("mapcount %d mapcount2 %d page_mapcount %d\n",
- mapcount, mapcount2, page_mapcount(page));
- BUG();
- }
-}
-
-/*
- * Split a hugepage into normal pages. This doesn't change the position of head
- * page. If @list is null, tail pages will be added to LRU list, otherwise, to
- * @list. Both head page and tail pages will inherit mapping, flags, and so on
- * from the hugepage.
- * Return 0 if the hugepage is split successfully otherwise return 1.
- */
-int split_huge_page_to_list(struct page *page, struct list_head *list)
-{
- struct anon_vma *anon_vma;
- int ret = 1;
-
- BUG_ON(is_huge_zero_page(page));
- BUG_ON(!PageAnon(page));
-
- /*
- * The caller does not necessarily hold an mmap_sem that would prevent
- * the anon_vma disappearing so we first we take a reference to it
- * and then lock the anon_vma for write. This is similar to
- * page_lock_anon_vma_read except the write lock is taken to serialise
- * against parallel split or collapse operations.
- */
- anon_vma = page_get_anon_vma(page);
- if (!anon_vma)
- goto out;
- anon_vma_lock_write(anon_vma);
-
- ret = 0;
- if (!PageCompound(page))
- goto out_unlock;
-
- BUG_ON(!PageSwapBacked(page));
- __split_huge_page(page, anon_vma, list);
- count_vm_event(THP_SPLIT_PAGE);
-
- BUG_ON(PageCompound(page));
-out_unlock:
- anon_vma_unlock_write(anon_vma);
- put_anon_vma(anon_vma);
-out:
- return ret;
-}
-
#define VM_NO_THP (VM_SPECIAL | VM_HUGETLB | VM_SHARED | VM_MAYSHARE)

int hugepage_madvise(struct vm_area_struct *vma,
@@ -2987,81 +2664,6 @@ static int khugepaged(void *none)
return 0;
}

-static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
- unsigned long haddr, pmd_t *pmd)
-{
- struct mm_struct *mm = vma->vm_mm;
- pgtable_t pgtable;
- pmd_t _pmd;
- int i;
-
- pmdp_huge_clear_flush_notify(vma, haddr, pmd);
- /* leave pmd empty until pte is filled */
-
- pgtable = pgtable_trans_huge_withdraw(mm, pmd);
- pmd_populate(mm, &_pmd, pgtable);
-
- for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
- pte_t *pte, entry;
- entry = pfn_pte(my_zero_pfn(haddr), vma->vm_page_prot);
- entry = pte_mkspecial(entry);
- pte = pte_offset_map(&_pmd, haddr);
- VM_BUG_ON(!pte_none(*pte));
- set_pte_at(mm, haddr, pte, entry);
- pte_unmap(pte);
- }
- smp_wmb(); /* make pte visible before pmd */
- pmd_populate(mm, pmd, pgtable);
- put_huge_zero_page();
-}
-
-void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
- pmd_t *pmd)
-{
- spinlock_t *ptl;
- struct page *page = NULL;
- struct mm_struct *mm = vma->vm_mm;
- unsigned long haddr = address & HPAGE_PMD_MASK;
- unsigned long mmun_start; /* For mmu_notifiers */
- unsigned long mmun_end; /* For mmu_notifiers */
-
- BUG_ON(vma->vm_start > haddr || vma->vm_end < haddr + HPAGE_PMD_SIZE);
-
- mmun_start = haddr;
- mmun_end = haddr + HPAGE_PMD_SIZE;
-again:
- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
- ptl = pmd_lock(mm, pmd);
- if (unlikely(!pmd_trans_huge(*pmd)))
- goto unlock;
- if (vma_is_dax(vma)) {
- pmdp_huge_clear_flush(vma, haddr, pmd);
- } else if (is_huge_zero_pmd(*pmd)) {
- __split_huge_zero_page_pmd(vma, haddr, pmd);
- } else {
- page = pmd_page(*pmd);
- VM_BUG_ON_PAGE(!page_count(page), page);
- get_page(page);
- }
- unlock:
- spin_unlock(ptl);
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
-
- if (!page)
- return;
-
- split_huge_page(page);
- put_page(page);
-
- /*
- * We don't always have down_write of mmap_sem here: a racing
- * do_huge_pmd_wp_page() might have copied-on-write to another
- * huge page before our split_huge_page() got the anon_vma lock.
- */
- if (unlikely(pmd_trans_huge(*pmd)))
- goto again;
-}
-
static void split_huge_pmd_address(struct vm_area_struct *vma,
unsigned long address)
{
@@ -3086,7 +2688,7 @@ static void split_huge_pmd_address(struct vm_area_struct *vma,
* Caller holds the mmap_sem write mode, so a huge pmd cannot
* materialize from under us.
*/
- __split_huge_page_pmd(vma, address, pmd);
+ split_huge_pmd(vma, pmd, address);
}

void vma_adjust_trans_huge(struct vm_area_struct *vma,
--
2.1.4

2015-07-20 14:31:32

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv9 13/36] mm: drop tail page refcounting

Tail page refcounting is utterly complicated and painful to support.

It uses ->_mapcount on tail pages to store how many times this page is
pinned. get_page() bumps ->_mapcount on tail page in addition to
->_count on head. This information is required by split_huge_page() to
be able to distribute pins from head of compound page to tails during
the split.

We will need ->_mapcount to account PTE mappings of subpages of the
compound page. We eliminate need in current meaning of ->_mapcount in
tail pages by forbidding split entirely if the page is pinned.

The only user of tail page refcounting is THP which is marked BROKEN for
now.

Let's drop all this mess. It makes get_page() and put_page() much
simpler.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Tested-by: Sasha Levin <[email protected]>
Tested-by: Aneesh Kumar K.V <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
---
arch/mips/mm/gup.c | 4 -
arch/powerpc/mm/hugetlbpage.c | 13 +-
arch/s390/mm/gup.c | 13 +-
arch/sparc/mm/gup.c | 14 +--
arch/x86/mm/gup.c | 4 -
include/linux/mm.h | 47 ++------
include/linux/mm_types.h | 17 +--
mm/gup.c | 34 +-----
mm/huge_memory.c | 41 +------
mm/hugetlb.c | 2 +-
mm/internal.h | 44 -------
mm/swap.c | 273 +++---------------------------------------
12 files changed, 40 insertions(+), 466 deletions(-)

diff --git a/arch/mips/mm/gup.c b/arch/mips/mm/gup.c
index 349995d19c7f..36a35115dc2e 100644
--- a/arch/mips/mm/gup.c
+++ b/arch/mips/mm/gup.c
@@ -87,8 +87,6 @@ static int gup_huge_pmd(pmd_t pmd, unsigned long addr, unsigned long end,
do {
VM_BUG_ON(compound_head(page) != head);
pages[*nr] = page;
- if (PageTail(page))
- get_huge_page_tail(page);
(*nr)++;
page++;
refs++;
@@ -153,8 +151,6 @@ static int gup_huge_pud(pud_t pud, unsigned long addr, unsigned long end,
do {
VM_BUG_ON(compound_head(page) != head);
pages[*nr] = page;
- if (PageTail(page))
- get_huge_page_tail(page);
(*nr)++;
page++;
refs++;
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 38bd5d998c81..f119edaa6961 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -1049,7 +1049,7 @@ int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
{
unsigned long mask;
unsigned long pte_end;
- struct page *head, *page, *tail;
+ struct page *head, *page;
pte_t pte;
int refs;

@@ -1072,7 +1072,6 @@ int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
head = pte_page(pte);

page = head + ((addr & (sz-1)) >> PAGE_SHIFT);
- tail = page;
do {
VM_BUG_ON(compound_head(page) != head);
pages[*nr] = page;
@@ -1094,15 +1093,5 @@ int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
return 0;
}

- /*
- * Any tail page need their mapcount reference taken before we
- * return.
- */
- while (refs--) {
- if (PageTail(tail))
- get_huge_page_tail(tail);
- tail++;
- }
-
return 1;
}
diff --git a/arch/s390/mm/gup.c b/arch/s390/mm/gup.c
index 1eb41bb3010c..f8112899f6fe 100644
--- a/arch/s390/mm/gup.c
+++ b/arch/s390/mm/gup.c
@@ -52,7 +52,7 @@ static inline int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
unsigned long end, int write, struct page **pages, int *nr)
{
unsigned long mask, result;
- struct page *head, *page, *tail;
+ struct page *head, *page;
int refs;

result = write ? 0 : _SEGMENT_ENTRY_PROTECT;
@@ -64,7 +64,6 @@ static inline int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
refs = 0;
head = pmd_page(pmd);
page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
- tail = page;
do {
VM_BUG_ON(compound_head(page) != head);
pages[*nr] = page;
@@ -85,16 +84,6 @@ static inline int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
return 0;
}

- /*
- * Any tail page need their mapcount reference taken before we
- * return.
- */
- while (refs--) {
- if (PageTail(tail))
- get_huge_page_tail(tail);
- tail++;
- }
-
return 1;
}

diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
index 2e5c4fc2daa9..9091c5daa2e1 100644
--- a/arch/sparc/mm/gup.c
+++ b/arch/sparc/mm/gup.c
@@ -56,8 +56,6 @@ static noinline int gup_pte_range(pmd_t pmd, unsigned long addr,
put_page(head);
return 0;
}
- if (head != page)
- get_huge_page_tail(page);

pages[*nr] = page;
(*nr)++;
@@ -70,7 +68,7 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
unsigned long end, int write, struct page **pages,
int *nr)
{
- struct page *head, *page, *tail;
+ struct page *head, *page;
int refs;

if (!(pmd_val(pmd) & _PAGE_VALID))
@@ -82,7 +80,6 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
refs = 0;
head = pmd_page(pmd);
page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
- tail = page;
do {
VM_BUG_ON(compound_head(page) != head);
pages[*nr] = page;
@@ -103,15 +100,6 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
return 0;
}

- /* Any tail page need their mapcount reference taken before we
- * return.
- */
- while (refs--) {
- if (PageTail(tail))
- get_huge_page_tail(tail);
- tail++;
- }
-
return 1;
}

diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 81bf3d2af3eb..62a887a3cf50 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -137,8 +137,6 @@ static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
do {
VM_BUG_ON_PAGE(compound_head(page) != head, page);
pages[*nr] = page;
- if (PageTail(page))
- get_huge_page_tail(page);
(*nr)++;
page++;
refs++;
@@ -214,8 +212,6 @@ static noinline int gup_huge_pud(pud_t pud, unsigned long addr,
do {
VM_BUG_ON_PAGE(compound_head(page) != head, page);
pages[*nr] = page;
- if (PageTail(page))
- get_huge_page_tail(page);
(*nr)++;
page++;
refs++;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index c3a2b37365f6..5e8e708c5c8c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -452,44 +452,9 @@ static inline int page_count(struct page *page)
return atomic_read(&compound_head(page)->_count);
}

-static inline bool __compound_tail_refcounted(struct page *page)
-{
- return PageAnon(page) && !PageSlab(page) && !PageHeadHuge(page);
-}
-
-/*
- * This takes a head page as parameter and tells if the
- * tail page reference counting can be skipped.
- *
- * For this to be safe, PageSlab and PageHeadHuge must remain true on
- * any given page where they return true here, until all tail pins
- * have been released.
- */
-static inline bool compound_tail_refcounted(struct page *page)
-{
- VM_BUG_ON_PAGE(!PageHead(page), page);
- return __compound_tail_refcounted(page);
-}
-
-static inline void get_huge_page_tail(struct page *page)
-{
- /*
- * __split_huge_page_refcount() cannot run from under us.
- */
- VM_BUG_ON_PAGE(!PageTail(page), page);
- VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
- VM_BUG_ON_PAGE(atomic_read(&page->_count) != 0, page);
- if (compound_tail_refcounted(page->first_page))
- atomic_inc(&page->_mapcount);
-}
-
-extern bool __get_page_tail(struct page *page);
-
static inline void get_page(struct page *page)
{
- if (unlikely(PageTail(page)))
- if (likely(__get_page_tail(page)))
- return;
+ page = compound_head(page);
/*
* Getting a normal page or the head of a compound page
* requires to already have an elevated page->_count.
@@ -520,7 +485,15 @@ static inline void init_page_count(struct page *page)
atomic_set(&page->_count, 1);
}

-void put_page(struct page *page);
+void __put_page(struct page *page);
+
+static inline void put_page(struct page *page)
+{
+ page = compound_head(page);
+ if (put_page_testzero(page))
+ __put_page(page);
+}
+
void put_pages_list(struct list_head *pages);

void split_page(struct page *page, unsigned int order);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 1fb4e46a1736..b762eef188c3 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -92,20 +92,9 @@ struct page {

union {
/*
- * Count of ptes mapped in
- * mms, to show when page is
- * mapped & limit reverse map
- * searches.
- *
- * Used also for tail pages
- * refcounting instead of
- * _count. Tail pages cannot
- * be mapped and keeping the
- * tail page _count zero at
- * all times guarantees
- * get_page_unless_zero() will
- * never succeed on tail
- * pages.
+ * Count of ptes mapped in mms, to show
+ * when page is mapped & limit reverse
+ * map searches.
*/
atomic_t _mapcount;

diff --git a/mm/gup.c b/mm/gup.c
index b2f9790e5a98..c38adf61200a 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -128,7 +128,7 @@ retry:
}

if (flags & FOLL_GET)
- get_page_foll(page);
+ get_page(page);
if (flags & FOLL_TOUCH) {
if ((flags & FOLL_WRITE) &&
!pte_dirty(pte) && !PageDirty(page))
@@ -1146,7 +1146,7 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
unsigned long end, int write, struct page **pages, int *nr)
{
- struct page *head, *page, *tail;
+ struct page *head, *page;
int refs;

if (write && !pmd_write(orig))
@@ -1155,7 +1155,6 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
refs = 0;
head = pmd_page(orig);
page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
- tail = page;
do {
VM_BUG_ON_PAGE(compound_head(page) != head, page);
pages[*nr] = page;
@@ -1176,24 +1175,13 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
return 0;
}

- /*
- * Any tail pages need their mapcount reference taken before we
- * return. (This allows the THP code to bump their ref count when
- * they are split into base pages).
- */
- while (refs--) {
- if (PageTail(tail))
- get_huge_page_tail(tail);
- tail++;
- }
-
return 1;
}

static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
unsigned long end, int write, struct page **pages, int *nr)
{
- struct page *head, *page, *tail;
+ struct page *head, *page;
int refs;

if (write && !pud_write(orig))
@@ -1202,7 +1190,6 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
refs = 0;
head = pud_page(orig);
page = head + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
- tail = page;
do {
VM_BUG_ON_PAGE(compound_head(page) != head, page);
pages[*nr] = page;
@@ -1223,12 +1210,6 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
return 0;
}

- while (refs--) {
- if (PageTail(tail))
- get_huge_page_tail(tail);
- tail++;
- }
-
return 1;
}

@@ -1237,7 +1218,7 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
struct page **pages, int *nr)
{
int refs;
- struct page *head, *page, *tail;
+ struct page *head, *page;

if (write && !pgd_write(orig))
return 0;
@@ -1245,7 +1226,6 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
refs = 0;
head = pgd_page(orig);
page = head + ((addr & ~PGDIR_MASK) >> PAGE_SHIFT);
- tail = page;
do {
VM_BUG_ON_PAGE(compound_head(page) != head, page);
pages[*nr] = page;
@@ -1266,12 +1246,6 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
return 0;
}

- while (refs--) {
- if (PageTail(tail))
- get_huge_page_tail(tail);
- tail++;
- }
-
return 1;
}

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 25ad3af53f15..442ba9a28c75 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1009,37 +1009,6 @@ unlock:
spin_unlock(ptl);
}

-/*
- * Save CONFIG_DEBUG_PAGEALLOC from faulting falsely on tail pages
- * during copy_user_huge_page()'s copy_page_rep(): in the case when
- * the source page gets split and a tail freed before copy completes.
- * Called under pmd_lock of checked pmd, so safe from splitting itself.
- */
-static void get_user_huge_page(struct page *page)
-{
- if (IS_ENABLED(CONFIG_DEBUG_PAGEALLOC)) {
- struct page *endpage = page + HPAGE_PMD_NR;
-
- atomic_add(HPAGE_PMD_NR, &page->_count);
- while (++page < endpage)
- get_huge_page_tail(page);
- } else {
- get_page(page);
- }
-}
-
-static void put_user_huge_page(struct page *page)
-{
- if (IS_ENABLED(CONFIG_DEBUG_PAGEALLOC)) {
- struct page *endpage = page + HPAGE_PMD_NR;
-
- while (page < endpage)
- put_page(page++);
- } else {
- put_page(page);
- }
-}
-
static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
struct vm_area_struct *vma,
unsigned long address,
@@ -1192,7 +1161,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
ret |= VM_FAULT_WRITE;
goto out_unlock;
}
- get_user_huge_page(page);
+ get_page(page);
spin_unlock(ptl);
alloc:
if (transparent_hugepage_enabled(vma) &&
@@ -1213,7 +1182,7 @@ alloc:
split_huge_pmd(vma, pmd, address);
ret |= VM_FAULT_FALLBACK;
}
- put_user_huge_page(page);
+ put_page(page);
}
count_vm_event(THP_FAULT_FALLBACK);
goto out;
@@ -1224,7 +1193,7 @@ alloc:
put_page(new_page);
if (page) {
split_huge_pmd(vma, pmd, address);
- put_user_huge_page(page);
+ put_page(page);
} else
split_huge_pmd(vma, pmd, address);
ret |= VM_FAULT_FALLBACK;
@@ -1246,7 +1215,7 @@ alloc:

spin_lock(ptl);
if (page)
- put_user_huge_page(page);
+ put_page(page);
if (unlikely(!pmd_same(*pmd, orig_pmd))) {
spin_unlock(ptl);
mem_cgroup_cancel_charge(new_page, memcg, true);
@@ -1331,7 +1300,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
VM_BUG_ON_PAGE(!PageCompound(page), page);
if (flags & FOLL_GET)
- get_page_foll(page);
+ get_page(page);

out:
return page;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index b98f5e944849..399ea11a8813 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3533,7 +3533,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
same_page:
if (pages) {
pages[i] = mem_map_offset(page, pfn_offset);
- get_page_foll(pages[i]);
+ get_page(pages[i]);
}

if (vmas)
diff --git a/mm/internal.h b/mm/internal.h
index 1195dd2d6a2b..c3384ad89f62 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -47,50 +47,6 @@ static inline void set_page_refcounted(struct page *page)
set_page_count(page, 1);
}

-static inline void __get_page_tail_foll(struct page *page,
- bool get_page_head)
-{
- /*
- * If we're getting a tail page, the elevated page->_count is
- * required only in the head page and we will elevate the head
- * page->_count and tail page->_mapcount.
- *
- * We elevate page_tail->_mapcount for tail pages to force
- * page_tail->_count to be zero at all times to avoid getting
- * false positives from get_page_unless_zero() with
- * speculative page access (like in
- * page_cache_get_speculative()) on tail pages.
- */
- VM_BUG_ON_PAGE(atomic_read(&page->first_page->_count) <= 0, page);
- if (get_page_head)
- atomic_inc(&page->first_page->_count);
- get_huge_page_tail(page);
-}
-
-/*
- * This is meant to be called as the FOLL_GET operation of
- * follow_page() and it must be called while holding the proper PT
- * lock while the pte (or pmd_trans_huge) is still mapping the page.
- */
-static inline void get_page_foll(struct page *page)
-{
- if (unlikely(PageTail(page)))
- /*
- * This is safe only because
- * __split_huge_page_refcount() can't run under
- * get_page_foll() because we hold the proper PT lock.
- */
- __get_page_tail_foll(page, true);
- else {
- /*
- * Getting a normal page or the head of a compound page
- * requires to already have an elevated page->_count.
- */
- VM_BUG_ON_PAGE(atomic_read(&page->_count) <= 0, page);
- atomic_inc(&page->_count);
- }
-}
-
extern unsigned long highest_memmap_pfn;

/*
diff --git a/mm/swap.c b/mm/swap.c
index d398860badd1..560eca9727ef 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -89,260 +89,14 @@ static void __put_compound_page(struct page *page)
(*dtor)(page);
}

-/**
- * Two special cases here: we could avoid taking compound_lock_irqsave
- * and could skip the tail refcounting(in _mapcount).
- *
- * 1. Hugetlbfs page:
- *
- * PageHeadHuge will remain true until the compound page
- * is released and enters the buddy allocator, and it could
- * not be split by __split_huge_page_refcount().
- *
- * So if we see PageHeadHuge set, and we have the tail page pin,
- * then we could safely put head page.
- *
- * 2. Slab THP page:
- *
- * PG_slab is cleared before the slab frees the head page, and
- * tail pin cannot be the last reference left on the head page,
- * because the slab code is free to reuse the compound page
- * after a kfree/kmem_cache_free without having to check if
- * there's any tail pin left. In turn all tail pinsmust be always
- * released while the head is still pinned by the slab code
- * and so we know PG_slab will be still set too.
- *
- * So if we see PageSlab set, and we have the tail page pin,
- * then we could safely put head page.
- */
-static __always_inline
-void put_unrefcounted_compound_page(struct page *page_head, struct page *page)
-{
- /*
- * If @page is a THP tail, we must read the tail page
- * flags after the head page flags. The
- * __split_huge_page_refcount side enforces write memory barriers
- * between clearing PageTail and before the head page
- * can be freed and reallocated.
- */
- smp_rmb();
- if (likely(PageTail(page))) {
- /*
- * __split_huge_page_refcount cannot race
- * here, see the comment above this function.
- */
- VM_BUG_ON_PAGE(!PageHead(page_head), page_head);
- if (put_page_testzero(page_head)) {
- /*
- * If this is the tail of a slab THP page,
- * the tail pin must not be the last reference
- * held on the page, because the PG_slab cannot
- * be cleared before all tail pins (which skips
- * the _mapcount tail refcounting) have been
- * released.
- *
- * If this is the tail of a hugetlbfs page,
- * the tail pin may be the last reference on
- * the page instead, because PageHeadHuge will
- * not go away until the compound page enters
- * the buddy allocator.
- */
- VM_BUG_ON_PAGE(PageSlab(page_head), page_head);
- __put_compound_page(page_head);
- }
- } else
- /*
- * __split_huge_page_refcount run before us,
- * @page was a THP tail. The split @page_head
- * has been freed and reallocated as slab or
- * hugetlbfs page of smaller order (only
- * possible if reallocated as slab on x86).
- */
- if (put_page_testzero(page))
- __put_single_page(page);
-}
-
-static __always_inline
-void put_refcounted_compound_page(struct page *page_head, struct page *page)
-{
- if (likely(page != page_head && get_page_unless_zero(page_head))) {
- unsigned long flags;
-
- /*
- * @page_head wasn't a dangling pointer but it may not
- * be a head page anymore by the time we obtain the
- * lock. That is ok as long as it can't be freed from
- * under us.
- */
- flags = compound_lock_irqsave(page_head);
- if (unlikely(!PageTail(page))) {
- /* __split_huge_page_refcount run before us */
- compound_unlock_irqrestore(page_head, flags);
- if (put_page_testzero(page_head)) {
- /*
- * The @page_head may have been freed
- * and reallocated as a compound page
- * of smaller order and then freed
- * again. All we know is that it
- * cannot have become: a THP page, a
- * compound page of higher order, a
- * tail page. That is because we
- * still hold the refcount of the
- * split THP tail and page_head was
- * the THP head before the split.
- */
- if (PageHead(page_head))
- __put_compound_page(page_head);
- else
- __put_single_page(page_head);
- }
-out_put_single:
- if (put_page_testzero(page))
- __put_single_page(page);
- return;
- }
- VM_BUG_ON_PAGE(page_head != page->first_page, page);
- /*
- * We can release the refcount taken by
- * get_page_unless_zero() now that
- * __split_huge_page_refcount() is blocked on the
- * compound_lock.
- */
- if (put_page_testzero(page_head))
- VM_BUG_ON_PAGE(1, page_head);
- /* __split_huge_page_refcount will wait now */
- VM_BUG_ON_PAGE(page_mapcount(page) <= 0, page);
- atomic_dec(&page->_mapcount);
- VM_BUG_ON_PAGE(atomic_read(&page_head->_count) <= 0, page_head);
- VM_BUG_ON_PAGE(atomic_read(&page->_count) != 0, page);
- compound_unlock_irqrestore(page_head, flags);
-
- if (put_page_testzero(page_head)) {
- if (PageHead(page_head))
- __put_compound_page(page_head);
- else
- __put_single_page(page_head);
- }
- } else {
- /* @page_head is a dangling pointer */
- VM_BUG_ON_PAGE(PageTail(page), page);
- goto out_put_single;
- }
-}
-
-static void put_compound_page(struct page *page)
-{
- struct page *page_head;
-
- /*
- * We see the PageCompound set and PageTail not set, so @page maybe:
- * 1. hugetlbfs head page, or
- * 2. THP head page.
- */
- if (likely(!PageTail(page))) {
- if (put_page_testzero(page)) {
- /*
- * By the time all refcounts have been released
- * split_huge_page cannot run anymore from under us.
- */
- if (PageHead(page))
- __put_compound_page(page);
- else
- __put_single_page(page);
- }
- return;
- }
-
- /*
- * We see the PageCompound set and PageTail set, so @page maybe:
- * 1. a tail hugetlbfs page, or
- * 2. a tail THP page, or
- * 3. a split THP page.
- *
- * Case 3 is possible, as we may race with
- * __split_huge_page_refcount tearing down a THP page.
- */
- page_head = compound_head_by_tail(page);
- if (!__compound_tail_refcounted(page_head))
- put_unrefcounted_compound_page(page_head, page);
- else
- put_refcounted_compound_page(page_head, page);
-}
-
-void put_page(struct page *page)
+void __put_page(struct page *page)
{
if (unlikely(PageCompound(page)))
- put_compound_page(page);
- else if (put_page_testzero(page))
+ __put_compound_page(page);
+ else
__put_single_page(page);
}
-EXPORT_SYMBOL(put_page);
-
-/*
- * This function is exported but must not be called by anything other
- * than get_page(). It implements the slow path of get_page().
- */
-bool __get_page_tail(struct page *page)
-{
- /*
- * This takes care of get_page() if run on a tail page
- * returned by one of the get_user_pages/follow_page variants.
- * get_user_pages/follow_page itself doesn't need the compound
- * lock because it runs __get_page_tail_foll() under the
- * proper PT lock that already serializes against
- * split_huge_page().
- */
- unsigned long flags;
- bool got;
- struct page *page_head = compound_head(page);
-
- /* Ref to put_compound_page() comment. */
- if (!__compound_tail_refcounted(page_head)) {
- smp_rmb();
- if (likely(PageTail(page))) {
- /*
- * This is a hugetlbfs page or a slab
- * page. __split_huge_page_refcount
- * cannot race here.
- */
- VM_BUG_ON_PAGE(!PageHead(page_head), page_head);
- __get_page_tail_foll(page, true);
- return true;
- } else {
- /*
- * __split_huge_page_refcount run
- * before us, "page" was a THP
- * tail. The split page_head has been
- * freed and reallocated as slab or
- * hugetlbfs page of smaller order
- * (only possible if reallocated as
- * slab on x86).
- */
- return false;
- }
- }
-
- got = false;
- if (likely(page != page_head && get_page_unless_zero(page_head))) {
- /*
- * page_head wasn't a dangling pointer but it
- * may not be a head page anymore by the time
- * we obtain the lock. That is ok as long as it
- * can't be freed from under us.
- */
- flags = compound_lock_irqsave(page_head);
- /* here __split_huge_page_refcount won't run anymore */
- if (likely(PageTail(page))) {
- __get_page_tail_foll(page, false);
- got = true;
- }
- compound_unlock_irqrestore(page_head, flags);
- if (unlikely(!got))
- put_page(page_head);
- }
- return got;
-}
-EXPORT_SYMBOL(__get_page_tail);
+EXPORT_SYMBOL(__put_page);

/**
* put_pages_list() - release a list of pages
@@ -960,15 +714,6 @@ void release_pages(struct page **pages, int nr, bool cold)
for (i = 0; i < nr; i++) {
struct page *page = pages[i];

- if (unlikely(PageCompound(page))) {
- if (zone) {
- spin_unlock_irqrestore(&zone->lru_lock, flags);
- zone = NULL;
- }
- put_compound_page(page);
- continue;
- }
-
/*
* Make sure the IRQ-safe lock-holding time does not get
* excessive with a continuous string of pages from the
@@ -979,9 +724,19 @@ void release_pages(struct page **pages, int nr, bool cold)
zone = NULL;
}

+ page = compound_head(page);
if (!put_page_testzero(page))
continue;

+ if (PageCompound(page)) {
+ if (zone) {
+ spin_unlock_irqrestore(&zone->lru_lock, flags);
+ zone = NULL;
+ }
+ __put_compound_page(page);
+ continue;
+ }
+
if (PageLRU(page)) {
struct zone *pagezone = page_zone(page);

--
2.1.4

2015-07-20 14:21:37

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv9 14/36] futex, thp: remove special case for THP in get_futex_key

With new THP refcounting, we don't need tricks to stabilize huge page.
If we've got reference to tail page, it can't split under us.

This patch effectively reverts a5b338f2b0b1.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Tested-by: Sasha Levin <[email protected]>
Tested-by: Aneesh Kumar K.V <[email protected]>
---
kernel/futex.c | 61 ++++++++++++----------------------------------------------
1 file changed, 12 insertions(+), 49 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 2579e407ff67..2c7cec27058b 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -399,7 +399,7 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key, int rw)
{
unsigned long address = (unsigned long)uaddr;
struct mm_struct *mm = current->mm;
- struct page *page, *page_head;
+ struct page *page;
int err, ro = 0;

/*
@@ -442,46 +442,9 @@ again:
else
err = 0;

-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- page_head = page;
- if (unlikely(PageTail(page))) {
- put_page(page);
- /* serialize against __split_huge_page_splitting() */
- local_irq_disable();
- if (likely(__get_user_pages_fast(address, 1, !ro, &page) == 1)) {
- page_head = compound_head(page);
- /*
- * page_head is valid pointer but we must pin
- * it before taking the PG_lock and/or
- * PG_compound_lock. The moment we re-enable
- * irqs __split_huge_page_splitting() can
- * return and the head page can be freed from
- * under us. We can't take the PG_lock and/or
- * PG_compound_lock on a page that could be
- * freed from under us.
- */
- if (page != page_head) {
- get_page(page_head);
- put_page(page);
- }
- local_irq_enable();
- } else {
- local_irq_enable();
- goto again;
- }
- }
-#else
- page_head = compound_head(page);
- if (page != page_head) {
- get_page(page_head);
- put_page(page);
- }
-#endif
-
- lock_page(page_head);
-
+ lock_page(page);
/*
- * If page_head->mapping is NULL, then it cannot be a PageAnon
+ * If page->mapping is NULL, then it cannot be a PageAnon
* page; but it might be the ZERO_PAGE or in the gate area or
* in a special mapping (all cases which we are happy to fail);
* or it may have been a good file page when get_user_pages_fast
@@ -493,12 +456,12 @@ again:
*
* The case we do have to guard against is when memory pressure made
* shmem_writepage move it from filecache to swapcache beneath us:
- * an unlikely race, but we do need to retry for page_head->mapping.
+ * an unlikely race, but we do need to retry for page->mapping.
*/
- if (!page_head->mapping) {
- int shmem_swizzled = PageSwapCache(page_head);
- unlock_page(page_head);
- put_page(page_head);
+ if (!page->mapping) {
+ int shmem_swizzled = PageSwapCache(page);
+ unlock_page(page);
+ put_page(page);
if (shmem_swizzled)
goto again;
return -EFAULT;
@@ -511,7 +474,7 @@ again:
* it's a read-only handle, it's expected that futexes attach to
* the object not the particular process.
*/
- if (PageAnon(page_head)) {
+ if (PageAnon(page)) {
/*
* A RO anonymous page will never change and thus doesn't make
* sense for futex operations.
@@ -526,15 +489,15 @@ again:
key->private.address = address;
} else {
key->both.offset |= FUT_OFF_INODE; /* inode-based key */
- key->shared.inode = page_head->mapping->host;
+ key->shared.inode = page->mapping->host;
key->shared.pgoff = basepage_index(page);
}

get_futex_key_refs(key); /* implies MB (B) */

out:
- unlock_page(page_head);
- put_page(page_head);
+ unlock_page(page);
+ put_page(page);
return err;
}

--
2.1.4

2015-07-20 14:29:57

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv9 15/36] ksm: prepare to new THP semantics

We don't need special code to stabilize THP. If you've got reference to
any subpage of THP it will not be split under you.

New split_huge_page() also accepts tail pages: no need in special code
to get reference to head page.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Tested-by: Sasha Levin <[email protected]>
Tested-by: Aneesh Kumar K.V <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
---
mm/ksm.c | 57 ++++++++++-----------------------------------------------
1 file changed, 10 insertions(+), 47 deletions(-)

diff --git a/mm/ksm.c b/mm/ksm.c
index fe09f3ddc912..fb333d8188fc 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -441,20 +441,6 @@ static void break_cow(struct rmap_item *rmap_item)
up_read(&mm->mmap_sem);
}

-static struct page *page_trans_compound_anon(struct page *page)
-{
- if (PageTransCompound(page)) {
- struct page *head = compound_head(page);
- /*
- * head may actually be splitted and freed from under
- * us but it's ok here.
- */
- if (PageAnon(head))
- return head;
- }
- return NULL;
-}
-
static struct page *get_mergeable_page(struct rmap_item *rmap_item)
{
struct mm_struct *mm = rmap_item->mm;
@@ -470,7 +456,7 @@ static struct page *get_mergeable_page(struct rmap_item *rmap_item)
page = follow_page(vma, addr, FOLL_GET);
if (IS_ERR_OR_NULL(page))
goto out;
- if (PageAnon(page) || page_trans_compound_anon(page)) {
+ if (PageAnon(page)) {
flush_anon_page(vma, page, addr);
flush_dcache_page(page);
} else {
@@ -976,33 +962,6 @@ out:
return err;
}

-static int page_trans_compound_anon_split(struct page *page)
-{
- int ret = 0;
- struct page *transhuge_head = page_trans_compound_anon(page);
- if (transhuge_head) {
- /* Get the reference on the head to split it. */
- if (get_page_unless_zero(transhuge_head)) {
- /*
- * Recheck we got the reference while the head
- * was still anonymous.
- */
- if (PageAnon(transhuge_head))
- ret = split_huge_page(transhuge_head);
- else
- /*
- * Retry later if split_huge_page run
- * from under us.
- */
- ret = 1;
- put_page(transhuge_head);
- } else
- /* Retry later if split_huge_page run from under us. */
- ret = 1;
- }
- return ret;
-}
-
/*
* try_to_merge_one_page - take two pages and merge them into one
* @vma: the vma that holds the pte pointing to page
@@ -1023,9 +982,6 @@ static int try_to_merge_one_page(struct vm_area_struct *vma,

if (!(vma->vm_flags & VM_MERGEABLE))
goto out;
- if (PageTransCompound(page) && page_trans_compound_anon_split(page))
- goto out;
- BUG_ON(PageTransCompound(page));
if (!PageAnon(page))
goto out;

@@ -1038,6 +994,13 @@ static int try_to_merge_one_page(struct vm_area_struct *vma,
*/
if (!trylock_page(page))
goto out;
+
+ if (PageTransCompound(page)) {
+ err = split_huge_page(page);
+ if (err)
+ goto out_unlock;
+ }
+
/*
* If this anonymous page is mapped only here, its pte may need
* to be write-protected. If it's mapped elsewhere, all of its
@@ -1068,6 +1031,7 @@ static int try_to_merge_one_page(struct vm_area_struct *vma,
}
}

+out_unlock:
unlock_page(page);
out:
return err;
@@ -1620,8 +1584,7 @@ next_mm:
cond_resched();
continue;
}
- if (PageAnon(*page) ||
- page_trans_compound_anon(*page)) {
+ if (PageAnon(*page)) {
flush_anon_page(vma, *page, ksm_scan.address);
flush_dcache_page(*page);
rmap_item = get_next_rmap_item(slot,
--
2.1.4

2015-07-20 14:26:49

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv9 16/36] mm, thp: remove compound_lock

We are going to use migration entries to stabilize page counts. It means
we don't need compound_lock() for that.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Tested-by: Sasha Levin <[email protected]>
Tested-by: Aneesh Kumar K.V <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
---
include/linux/mm.h | 35 -----------------------------------
include/linux/page-flags.h | 12 +-----------
mm/debug.c | 3 ---
mm/memcontrol.c | 11 +++--------
4 files changed, 4 insertions(+), 57 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5e8e708c5c8c..b6fb5293259a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -396,41 +396,6 @@ static inline int is_vmalloc_or_module_addr(const void *x)

extern void kvfree(const void *addr);

-static inline void compound_lock(struct page *page)
-{
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- VM_BUG_ON_PAGE(PageSlab(page), page);
- bit_spin_lock(PG_compound_lock, &page->flags);
-#endif
-}
-
-static inline void compound_unlock(struct page *page)
-{
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- VM_BUG_ON_PAGE(PageSlab(page), page);
- bit_spin_unlock(PG_compound_lock, &page->flags);
-#endif
-}
-
-static inline unsigned long compound_lock_irqsave(struct page *page)
-{
- unsigned long uninitialized_var(flags);
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- local_irq_save(flags);
- compound_lock(page);
-#endif
- return flags;
-}
-
-static inline void compound_unlock_irqrestore(struct page *page,
- unsigned long flags)
-{
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- compound_unlock(page);
- local_irq_restore(flags);
-#endif
-}
-
/*
* The atomic page->_mapcount, starts from -1: so that transitions
* both from it and to it can be tracked, using atomic_inc_and_test
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 91b7f9b2b774..f10f9c0030dd 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -106,9 +106,6 @@ enum pageflags {
#ifdef CONFIG_MEMORY_FAILURE
PG_hwpoison, /* hardware poisoned page. Don't touch */
#endif
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- PG_compound_lock,
-#endif
__NR_PAGEFLAGS,

/* Filesystems */
@@ -683,12 +680,6 @@ static inline void ClearPageSlabPfmemalloc(struct page *page)
#define __PG_MLOCKED 0
#endif

-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-#define __PG_COMPOUND_LOCK (1 << PG_compound_lock)
-#else
-#define __PG_COMPOUND_LOCK 0
-#endif
-
/*
* Flags checked when a page is freed. Pages being freed should not have
* these flags set. It they are, there is a problem.
@@ -698,8 +689,7 @@ static inline void ClearPageSlabPfmemalloc(struct page *page)
1 << PG_private | 1 << PG_private_2 | \
1 << PG_writeback | 1 << PG_reserved | \
1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \
- 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \
- __PG_COMPOUND_LOCK)
+ 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON)

/*
* Flags checked when a page is prepped for return by the page allocator.
diff --git a/mm/debug.c b/mm/debug.c
index 3eb3ac2fcee7..9dfcd77e7354 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -45,9 +45,6 @@ static const struct trace_print_flags pageflag_names[] = {
#ifdef CONFIG_MEMORY_FAILURE
{1UL << PG_hwpoison, "hwpoison" },
#endif
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- {1UL << PG_compound_lock, "compound_lock" },
-#endif
};

static void dump_flags(unsigned long flags,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 09c4454891ac..21378c828f34 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2469,9 +2469,7 @@ struct mem_cgroup *__mem_cgroup_from_kmem(void *ptr)

/*
* Because tail pages are not marked as "used", set it. We're under
- * zone->lru_lock, 'splitting on pmd' and compound_lock.
- * charge/uncharge will be never happen and move_account() is done under
- * compound_lock(), so we don't have to take care of races.
+ * zone->lru_lock and migration entries setup in all page mappings.
*/
void mem_cgroup_split_huge_fixup(struct page *head)
{
@@ -4551,9 +4549,7 @@ static struct page *mc_handle_file_pte(struct vm_area_struct *vma,
* @from: mem_cgroup which the page is moved from.
* @to: mem_cgroup which the page is moved to. @from != @to.
*
- * The caller must confirm following.
- * - page is not on LRU (isolate_page() is useful.)
- * - compound_lock is held when nr_pages > 1
+ * The caller must make sure the page is not on LRU (isolate_page() is useful.)
*
* This function doesn't do "charge" to new cgroup and doesn't do "uncharge"
* from old cgroup.
@@ -4896,8 +4892,7 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
struct page *page;

/*
- * We don't take compound_lock() here but no race with splitting thp
- * happens because:
+ * No race with splitting thp happens because:
* - if pmd_trans_huge_lock() returns 1, the relevant thp is not
* under splitting, which means there's no concurrent thp split,
* - if another thread runs into split_huge_page() just after we
--
2.1.4

2015-07-20 14:29:53

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv9 17/36] arm64, thp: remove infrastructure for handling splitting PMDs

With new refcounting we don't need to mark PMDs splitting. Let's drop
code to handle this.

pmdp_splitting_flush() is not needed too: on splitting PMD we will do
pmdp_clear_flush() + set_pte_at(). pmdp_clear_flush() will do IPI as
needed for fast_gup.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/arm64/include/asm/pgtable.h | 8 --------
arch/arm64/mm/flush.c | 16 ----------------
2 files changed, 24 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index bd5db28324ba..26c7dea80062 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -274,20 +274,12 @@ static inline pgprot_t mk_sect_prot(pgprot_t prot)

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
#define pmd_trans_huge(pmd) (pmd_val(pmd) && !(pmd_val(pmd) & PMD_TABLE_BIT))
-#define pmd_trans_splitting(pmd) pte_special(pmd_pte(pmd))
-#ifdef CONFIG_HAVE_RCU_TABLE_FREE
-#define __HAVE_ARCH_PMDP_SPLITTING_FLUSH
-struct vm_area_struct;
-void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
- pmd_t *pmdp);
-#endif /* CONFIG_HAVE_RCU_TABLE_FREE */
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

#define pmd_dirty(pmd) pte_dirty(pmd_pte(pmd))
#define pmd_young(pmd) pte_young(pmd_pte(pmd))
#define pmd_dirty(pmd) pte_dirty(pmd_pte(pmd))
#define pmd_wrprotect(pmd) pte_pmd(pte_wrprotect(pmd_pte(pmd)))
-#define pmd_mksplitting(pmd) pte_pmd(pte_mkspecial(pmd_pte(pmd)))
#define pmd_mkold(pmd) pte_pmd(pte_mkold(pmd_pte(pmd)))
#define pmd_mkwrite(pmd) pte_pmd(pte_mkwrite(pmd_pte(pmd)))
#define pmd_mkclean(pmd) pte_pmd(pte_mkclean(pmd_pte(pmd)))
diff --git a/arch/arm64/mm/flush.c b/arch/arm64/mm/flush.c
index b6f14e8d2121..0d64089d28b5 100644
--- a/arch/arm64/mm/flush.c
+++ b/arch/arm64/mm/flush.c
@@ -104,19 +104,3 @@ EXPORT_SYMBOL(flush_dcache_page);
*/
EXPORT_SYMBOL(flush_cache_all);
EXPORT_SYMBOL(flush_icache_range);
-
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-#ifdef CONFIG_HAVE_RCU_TABLE_FREE
-void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
- pmd_t *pmdp)
-{
- pmd_t pmd = pmd_mksplitting(*pmdp);
-
- VM_BUG_ON(address & ~PMD_MASK);
- set_pmd_at(vma->vm_mm, address, pmdp, pmd);
-
- /* dummy IPI to serialise against fast_gup */
- kick_all_cpus_sync();
-}
-#endif /* CONFIG_HAVE_RCU_TABLE_FREE */
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
--
2.1.4

2015-07-20 14:29:55

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv9 18/36] arm, thp: remove infrastructure for handling splitting PMDs

With new refcounting we don't need to mark PMDs splitting. Let's drop
code to handle this.

pmdp_splitting_flush() is not needed too: on splitting PMD we will do
pmdp_clear_flush() + set_pte_at(). pmdp_clear_flush() will do IPI as
needed for fast_gup.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/arm/include/asm/pgtable-3level.h | 10 ----------
arch/arm/lib/uaccess_with_memcpy.c | 5 ++---
arch/arm/mm/flush.c | 15 ---------------
3 files changed, 2 insertions(+), 28 deletions(-)

diff --git a/arch/arm/include/asm/pgtable-3level.h b/arch/arm/include/asm/pgtable-3level.h
index 6d6012a320b2..d42f81f13618 100644
--- a/arch/arm/include/asm/pgtable-3level.h
+++ b/arch/arm/include/asm/pgtable-3level.h
@@ -88,7 +88,6 @@

#define L_PMD_SECT_VALID (_AT(pmdval_t, 1) << 0)
#define L_PMD_SECT_DIRTY (_AT(pmdval_t, 1) << 55)
-#define L_PMD_SECT_SPLITTING (_AT(pmdval_t, 1) << 56)
#define L_PMD_SECT_NONE (_AT(pmdval_t, 1) << 57)
#define L_PMD_SECT_RDONLY (_AT(pteval_t, 1) << 58)

@@ -232,21 +231,12 @@ static inline pte_t pte_mkspecial(pte_t pte)

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
#define pmd_trans_huge(pmd) (pmd_val(pmd) && !pmd_table(pmd))
-#define pmd_trans_splitting(pmd) (pmd_isset((pmd), L_PMD_SECT_SPLITTING))
-
-#ifdef CONFIG_HAVE_RCU_TABLE_FREE
-#define __HAVE_ARCH_PMDP_SPLITTING_FLUSH
-void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
- pmd_t *pmdp);
-#endif
-#endif

#define PMD_BIT_FUNC(fn,op) \
static inline pmd_t pmd_##fn(pmd_t pmd) { pmd_val(pmd) op; return pmd; }

PMD_BIT_FUNC(wrprotect, |= L_PMD_SECT_RDONLY);
PMD_BIT_FUNC(mkold, &= ~PMD_SECT_AF);
-PMD_BIT_FUNC(mksplitting, |= L_PMD_SECT_SPLITTING);
PMD_BIT_FUNC(mkwrite, &= ~L_PMD_SECT_RDONLY);
PMD_BIT_FUNC(mkdirty, |= L_PMD_SECT_DIRTY);
PMD_BIT_FUNC(mkclean, &= ~L_PMD_SECT_DIRTY);
diff --git a/arch/arm/lib/uaccess_with_memcpy.c b/arch/arm/lib/uaccess_with_memcpy.c
index 3e58d710013c..11af98f46f05 100644
--- a/arch/arm/lib/uaccess_with_memcpy.c
+++ b/arch/arm/lib/uaccess_with_memcpy.c
@@ -52,14 +52,13 @@ pin_page_for_write(const void __user *_addr, pte_t **ptep, spinlock_t **ptlp)
*
* Lock the page table for the destination and check
* to see that it's still huge and whether or not we will
- * need to fault on write, or if we have a splitting THP.
+ * need to fault on write.
*/
if (unlikely(pmd_thp_or_huge(*pmd))) {
ptl = &current->mm->page_table_lock;
spin_lock(ptl);
if (unlikely(!pmd_thp_or_huge(*pmd)
- || pmd_hugewillfault(*pmd)
- || pmd_trans_splitting(*pmd))) {
+ || pmd_hugewillfault(*pmd))) {
spin_unlock(ptl);
return 0;
}
diff --git a/arch/arm/mm/flush.c b/arch/arm/mm/flush.c
index 34b66af516ea..77f229302032 100644
--- a/arch/arm/mm/flush.c
+++ b/arch/arm/mm/flush.c
@@ -400,18 +400,3 @@ void __flush_anon_page(struct vm_area_struct *vma, struct page *page, unsigned l
*/
__cpuc_flush_dcache_area(page_address(page), PAGE_SIZE);
}
-
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-#ifdef CONFIG_HAVE_RCU_TABLE_FREE
-void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
- pmd_t *pmdp)
-{
- pmd_t pmd = pmd_mksplitting(*pmdp);
- VM_BUG_ON(address & ~PMD_MASK);
- set_pmd_at(vma->vm_mm, address, pmdp, pmd);
-
- /* dummy IPI to serialise against fast_gup */
- kick_all_cpus_sync();
-}
-#endif /* CONFIG_HAVE_RCU_TABLE_FREE */
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
--
2.1.4

2015-07-20 14:22:03

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv9 19/36] mips, thp: remove infrastructure for handling splitting PMDs

With new refcounting we don't need to mark PMDs splitting. Let's drop
code to handle this.

pmdp_splitting_flush() is not needed too: on splitting PMD we will do
pmdp_clear_flush() + set_pte_at(). pmdp_clear_flush() will do IPI as
needed for fast_gup.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/mips/include/asm/pgtable-bits.h | 6 ++----
arch/mips/include/asm/pgtable.h | 18 ------------------
arch/mips/mm/gup.c | 13 +------------
arch/mips/mm/pgtable-64.c | 14 --------------
arch/mips/mm/tlbex.c | 1 -
5 files changed, 3 insertions(+), 49 deletions(-)

diff --git a/arch/mips/include/asm/pgtable-bits.h b/arch/mips/include/asm/pgtable-bits.h
index c28a8499aec7..9bf8b2b87364 100644
--- a/arch/mips/include/asm/pgtable-bits.h
+++ b/arch/mips/include/asm/pgtable-bits.h
@@ -131,14 +131,12 @@
/* Huge TLB page */
#define _PAGE_HUGE_SHIFT (_PAGE_MODIFIED_SHIFT + 1)
#define _PAGE_HUGE (1 << _PAGE_HUGE_SHIFT)
-#define _PAGE_SPLITTING_SHIFT (_PAGE_HUGE_SHIFT + 1)
-#define _PAGE_SPLITTING (1 << _PAGE_SPLITTING_SHIFT)

/* Only R2 or newer cores have the XI bit */
#if defined(CONFIG_CPU_MIPSR2) || defined(CONFIG_CPU_MIPSR6)
-#define _PAGE_NO_EXEC_SHIFT (_PAGE_SPLITTING_SHIFT + 1)
+#define _PAGE_NO_EXEC_SHIFT (_PAGE_HUGE_SHIFT + 1)
#else
-#define _PAGE_GLOBAL_SHIFT (_PAGE_SPLITTING_SHIFT + 1)
+#define _PAGE_GLOBAL_SHIFT (_PAGE_HUGE_SHIFT + 1)
#define _PAGE_GLOBAL (1 << _PAGE_GLOBAL_SHIFT)
#endif /* CONFIG_CPU_MIPSR2 || CONFIG_CPU_MIPSR6 */

diff --git a/arch/mips/include/asm/pgtable.h b/arch/mips/include/asm/pgtable.h
index 9d8106758142..556c02a0706a 100644
--- a/arch/mips/include/asm/pgtable.h
+++ b/arch/mips/include/asm/pgtable.h
@@ -449,27 +449,9 @@ static inline pmd_t pmd_mkhuge(pmd_t pmd)
return pmd;
}

-static inline int pmd_trans_splitting(pmd_t pmd)
-{
- return !!(pmd_val(pmd) & _PAGE_SPLITTING);
-}
-
-static inline pmd_t pmd_mksplitting(pmd_t pmd)
-{
- pmd_val(pmd) |= _PAGE_SPLITTING;
-
- return pmd;
-}
-
extern void set_pmd_at(struct mm_struct *mm, unsigned long addr,
pmd_t *pmdp, pmd_t pmd);

-#define __HAVE_ARCH_PMDP_SPLITTING_FLUSH
-/* Extern to avoid header file madness */
-extern void pmdp_splitting_flush(struct vm_area_struct *vma,
- unsigned long address,
- pmd_t *pmdp);
-
#define __HAVE_ARCH_PMD_WRITE
static inline int pmd_write(pmd_t pmd)
{
diff --git a/arch/mips/mm/gup.c b/arch/mips/mm/gup.c
index 36a35115dc2e..1afd87c999b0 100644
--- a/arch/mips/mm/gup.c
+++ b/arch/mips/mm/gup.c
@@ -107,18 +107,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
pmd_t pmd = *pmdp;

next = pmd_addr_end(addr, end);
- /*
- * The pmd_trans_splitting() check below explains why
- * pmdp_splitting_flush has to flush the tlb, to stop
- * this gup-fast code from running while we set the
- * splitting bit in the pmd. Returning zero will take
- * the slow path that will call wait_split_huge_page()
- * if the pmd is still in splitting state. gup-fast
- * can't because it has irq disabled and
- * wait_split_huge_page() would never return as the
- * tlb flush IPI wouldn't run.
- */
- if (pmd_none(pmd) || pmd_trans_splitting(pmd))
+ if (pmd_none(pmd))
return 0;
if (unlikely(pmd_huge(pmd))) {
if (!gup_huge_pmd(pmd, addr, next, write, pages,nr))
diff --git a/arch/mips/mm/pgtable-64.c b/arch/mips/mm/pgtable-64.c
index e8adc0069d66..ce4473e7c0d2 100644
--- a/arch/mips/mm/pgtable-64.c
+++ b/arch/mips/mm/pgtable-64.c
@@ -62,20 +62,6 @@ void pmd_init(unsigned long addr, unsigned long pagetable)
}
#endif

-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-
-void pmdp_splitting_flush(struct vm_area_struct *vma,
- unsigned long address,
- pmd_t *pmdp)
-{
- if (!pmd_trans_splitting(*pmdp)) {
- pmd_t pmd = pmd_mksplitting(*pmdp);
- set_pmd_at(vma->vm_mm, address, pmdp, pmd);
- }
-}
-
-#endif
-
pmd_t mk_pmd(struct page *page, pgprot_t prot)
{
pmd_t pmd;
diff --git a/arch/mips/mm/tlbex.c b/arch/mips/mm/tlbex.c
index 97c87027c17f..02b53ce7fc2e 100644
--- a/arch/mips/mm/tlbex.c
+++ b/arch/mips/mm/tlbex.c
@@ -240,7 +240,6 @@ static void output_pgtable_bits_defines(void)
pr_define("_PAGE_MODIFIED_SHIFT %d\n", _PAGE_MODIFIED_SHIFT);
#ifdef CONFIG_MIPS_HUGE_TLB_SUPPORT
pr_define("_PAGE_HUGE_SHIFT %d\n", _PAGE_HUGE_SHIFT);
- pr_define("_PAGE_SPLITTING_SHIFT %d\n", _PAGE_SPLITTING_SHIFT);
#endif
#ifdef CONFIG_CPU_MIPSR2
if (cpu_has_rixi) {
--
2.1.4

2015-07-20 14:22:18

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv9 20/36] powerpc, thp: remove infrastructure for handling splitting PMDs

With new refcounting we don't need to mark PMDs splitting. Let's drop
code to handle this.

pmdp_splitting_flush() is not needed too: on splitting PMD we will do
pmdp_clear_flush() + set_pte_at(). pmdp_clear_flush() will do IPI as
needed for fast_gup.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Tested-by: Aneesh Kumar K.V <[email protected]>
Reviewed-by: Aneesh Kumar K.V <[email protected]>
---
arch/powerpc/include/asm/pgtable-ppc64.h | 25 +---------------
arch/powerpc/mm/hugepage-hash64.c | 3 --
arch/powerpc/mm/hugetlbpage.c | 4 ---
arch/powerpc/mm/pgtable_64.c | 49 --------------------------------
4 files changed, 1 insertion(+), 80 deletions(-)

diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
index 58c61500993f..d5d854bd47c1 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -358,11 +358,6 @@ void pgtable_cache_init(void);
#endif /* __ASSEMBLY__ */

/*
- * THP pages can't be special. So use the _PAGE_SPECIAL
- */
-#define _PAGE_SPLITTING _PAGE_SPECIAL
-
-/*
* We need to differentiate between explicit huge page and THP huge
* page, since THP huge page also need to track real subpage details
*/
@@ -372,8 +367,7 @@ void pgtable_cache_init(void);
* set of bits not changed in pmd_modify.
*/
#define _HPAGE_CHG_MASK (PTE_RPN_MASK | _PAGE_HPTEFLAGS | \
- _PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_SPLITTING | \
- _PAGE_THP_HUGE)
+ _PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_THP_HUGE)

#ifndef __ASSEMBLY__
/*
@@ -455,13 +449,6 @@ static inline int pmd_trans_huge(pmd_t pmd)
return (pmd_val(pmd) & 0x3) && (pmd_val(pmd) & _PAGE_THP_HUGE);
}

-static inline int pmd_trans_splitting(pmd_t pmd)
-{
- if (pmd_trans_huge(pmd))
- return pmd_val(pmd) & _PAGE_SPLITTING;
- return 0;
-}
-
extern int has_transparent_hugepage(void);
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

@@ -514,12 +501,6 @@ static inline pmd_t pmd_mknotpresent(pmd_t pmd)
return pmd;
}

-static inline pmd_t pmd_mksplitting(pmd_t pmd)
-{
- pmd_val(pmd) |= _PAGE_SPLITTING;
- return pmd;
-}
-
#define __HAVE_ARCH_PMD_SAME
static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b)
{
@@ -570,10 +551,6 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm, unsigned long addr,
pmd_hugepage_update(mm, addr, pmdp, _PAGE_RW, 0);
}

-#define __HAVE_ARCH_PMDP_SPLITTING_FLUSH
-extern void pmdp_splitting_flush(struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmdp);
-
extern pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
unsigned long address, pmd_t *pmdp);
#define pmdp_collapse_flush pmdp_collapse_flush
diff --git a/arch/powerpc/mm/hugepage-hash64.c b/arch/powerpc/mm/hugepage-hash64.c
index 43dafb9d6a46..adc3860ce9e7 100644
--- a/arch/powerpc/mm/hugepage-hash64.c
+++ b/arch/powerpc/mm/hugepage-hash64.c
@@ -39,9 +39,6 @@ int __hash_page_thp(unsigned long ea, unsigned long access, unsigned long vsid,
/* If PMD busy, retry the access */
if (unlikely(old_pmd & _PAGE_BUSY))
return 0;
- /* If PMD is trans splitting retry the access */
- if (unlikely(old_pmd & _PAGE_SPLITTING))
- return 0;
/* If PMD permissions don't match, take page fault */
if (unlikely(access & ~old_pmd))
return 1;
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index f119edaa6961..dfab6627ca63 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -1015,10 +1015,6 @@ pte_t *__find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea,
/*
* A hugepage collapse is captured by pmd_none, because
* it mark the pmd none and do a hpte invalidate.
- *
- * We don't worry about pmd_trans_splitting here, The
- * caller if it needs to handle the splitting case
- * should check for that.
*/
if (pmd_none(pmd))
return NULL;
diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
index 876232d64126..ac8f12d3cfce 100644
--- a/arch/powerpc/mm/pgtable_64.c
+++ b/arch/powerpc/mm/pgtable_64.c
@@ -614,55 +614,6 @@ int pmdp_clear_flush_young(struct vm_area_struct *vma,
}

/*
- * We mark the pmd splitting and invalidate all the hpte
- * entries for this hugepage.
- */
-void pmdp_splitting_flush(struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmdp)
-{
- unsigned long old, tmp;
-
- VM_BUG_ON(address & ~HPAGE_PMD_MASK);
-
-#ifdef CONFIG_DEBUG_VM
- WARN_ON(!pmd_trans_huge(*pmdp));
- assert_spin_locked(&vma->vm_mm->page_table_lock);
-#endif
-
-#ifdef PTE_ATOMIC_UPDATES
-
- __asm__ __volatile__(
- "1: ldarx %0,0,%3\n\
- andi. %1,%0,%6\n\
- bne- 1b \n\
- ori %1,%0,%4 \n\
- stdcx. %1,0,%3 \n\
- bne- 1b"
- : "=&r" (old), "=&r" (tmp), "=m" (*pmdp)
- : "r" (pmdp), "i" (_PAGE_SPLITTING), "m" (*pmdp), "i" (_PAGE_BUSY)
- : "cc" );
-#else
- old = pmd_val(*pmdp);
- *pmdp = __pmd(old | _PAGE_SPLITTING);
-#endif
- /*
- * If we didn't had the splitting flag set, go and flush the
- * HPTE entries.
- */
- trace_hugepage_splitting(address, old);
- if (!(old & _PAGE_SPLITTING)) {
- /* We need to flush the hpte */
- if (old & _PAGE_HASHPTE)
- hpte_do_hugepage_flush(vma->vm_mm, address, pmdp, old);
- }
- /*
- * This ensures that generic code that rely on IRQ disabling
- * to prevent a parallel THP split work as expected.
- */
- kick_all_cpus_sync();
-}
-
-/*
* We want to put the pgtable in pmd and use pgtable for tracking
* the base page size hptes
*/
--
2.1.4

2015-07-20 14:28:46

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv9 21/36] s390, thp: remove infrastructure for handling splitting PMDs

With new refcounting we don't need to mark PMDs splitting. Let's drop
code to handle this.

pmdp_splitting_flush() is not needed too: on splitting PMD we will do
pmdp_clear_flush() + set_pte_at(). pmdp_clear_flush() will do IPI as
needed for fast_gup.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/s390/include/asm/pgtable.h | 15 +--------------
arch/s390/mm/gup.c | 11 +----------
arch/s390/mm/pgtable.c | 16 ----------------
3 files changed, 2 insertions(+), 40 deletions(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 536e857ac0f9..e75dc65ed4a0 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -280,7 +280,6 @@ static inline int is_module_addr(void *addr)

#define _SEGMENT_ENTRY_DIRTY 0x2000 /* SW segment dirty bit */
#define _SEGMENT_ENTRY_YOUNG 0x1000 /* SW segment young bit */
-#define _SEGMENT_ENTRY_SPLIT 0x0800 /* THP splitting bit */
#define _SEGMENT_ENTRY_LARGE 0x0400 /* STE-format control, large page */
#define _SEGMENT_ENTRY_READ 0x0002 /* SW segment read bit */
#define _SEGMENT_ENTRY_WRITE 0x0001 /* SW segment write bit */
@@ -306,8 +305,6 @@ static inline int is_module_addr(void *addr)
* SW-bits: y young, d dirty, r read, w write
*/

-#define _SEGMENT_ENTRY_SPLIT_BIT 11 /* THP splitting bit number */
-
/* Page status table bits for virtualization */
#define PGSTE_ACC_BITS 0xf000000000000000UL
#define PGSTE_FP_BIT 0x0800000000000000UL
@@ -511,10 +508,6 @@ static inline int pmd_bad(pmd_t pmd)
return (pmd_val(pmd) & ~_SEGMENT_ENTRY_BITS) != 0;
}

-#define __HAVE_ARCH_PMDP_SPLITTING_FLUSH
-extern void pmdp_splitting_flush(struct vm_area_struct *vma,
- unsigned long addr, pmd_t *pmdp);
-
#define __HAVE_ARCH_PMDP_SET_ACCESS_FLAGS
extern int pmdp_set_access_flags(struct vm_area_struct *vma,
unsigned long address, pmd_t *pmdp,
@@ -1358,7 +1351,7 @@ static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
if (pmd_large(pmd)) {
pmd_val(pmd) &= _SEGMENT_ENTRY_ORIGIN_LARGE |
_SEGMENT_ENTRY_DIRTY | _SEGMENT_ENTRY_YOUNG |
- _SEGMENT_ENTRY_LARGE | _SEGMENT_ENTRY_SPLIT;
+ _SEGMENT_ENTRY_LARGE;
pmd_val(pmd) |= massage_pgprot_pmd(newprot);
if (!(pmd_val(pmd) & _SEGMENT_ENTRY_DIRTY))
pmd_val(pmd) |= _SEGMENT_ENTRY_PROTECT;
@@ -1466,12 +1459,6 @@ extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
#define __HAVE_ARCH_PGTABLE_WITHDRAW
extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp);

-static inline int pmd_trans_splitting(pmd_t pmd)
-{
- return (pmd_val(pmd) & _SEGMENT_ENTRY_LARGE) &&
- (pmd_val(pmd) & _SEGMENT_ENTRY_SPLIT);
-}
-
static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr,
pmd_t *pmdp, pmd_t entry)
{
diff --git a/arch/s390/mm/gup.c b/arch/s390/mm/gup.c
index f8112899f6fe..0be19bae998e 100644
--- a/arch/s390/mm/gup.c
+++ b/arch/s390/mm/gup.c
@@ -102,16 +102,7 @@ static inline int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr,
pmd = *pmdp;
barrier();
next = pmd_addr_end(addr, end);
- /*
- * The pmd_trans_splitting() check below explains why
- * pmdp_splitting_flush() has to serialize with
- * smp_call_function() against our disabled IRQs, to stop
- * this gup-fast code from running while we set the
- * splitting bit in the pmd. Returning zero will take
- * the slow path that will call wait_split_huge_page()
- * if the pmd is still in splitting state.
- */
- if (pmd_none(pmd) || pmd_trans_splitting(pmd))
+ if (pmd_none(pmd))
return 0;
if (unlikely(pmd_large(pmd))) {
if (!gup_huge_pmd(pmdp, pmd, addr, next,
diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index b33f66110ca9..4a2134ffc74f 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -1355,22 +1355,6 @@ int pmdp_set_access_flags(struct vm_area_struct *vma,
return 1;
}

-static void pmdp_splitting_flush_sync(void *arg)
-{
- /* Simply deliver the interrupt */
-}
-
-void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
- pmd_t *pmdp)
-{
- VM_BUG_ON(address & ~HPAGE_PMD_MASK);
- if (!test_and_set_bit(_SEGMENT_ENTRY_SPLIT_BIT,
- (unsigned long *) pmdp)) {
- /* need to serialize against gup-fast (IRQ disabled) */
- smp_call_function(pmdp_splitting_flush_sync, NULL, 1);
- }
-}
-
void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
pgtable_t pgtable)
{
--
2.1.4

2015-07-20 14:28:48

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv9 22/36] sparc, thp: remove infrastructure for handling splitting PMDs

With new refcounting we don't need to mark PMDs splitting. Let's drop
code to handle this.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/sparc/include/asm/pgtable_64.h | 16 ----------------
arch/sparc/mm/fault_64.c | 3 ---
arch/sparc/mm/gup.c | 2 +-
3 files changed, 1 insertion(+), 20 deletions(-)

diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index 5833dc5ee7d7..7a38d6a576c5 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -681,13 +681,6 @@ static inline unsigned long pmd_trans_huge(pmd_t pmd)
return pte_val(pte) & _PAGE_PMD_HUGE;
}

-static inline unsigned long pmd_trans_splitting(pmd_t pmd)
-{
- pte_t pte = __pte(pmd_val(pmd));
-
- return pmd_trans_huge(pmd) && pte_special(pte);
-}
-
#define has_transparent_hugepage() 1

static inline pmd_t pmd_mkold(pmd_t pmd)
@@ -744,15 +737,6 @@ static inline pmd_t pmd_mkwrite(pmd_t pmd)
return __pmd(pte_val(pte));
}

-static inline pmd_t pmd_mksplitting(pmd_t pmd)
-{
- pte_t pte = __pte(pmd_val(pmd));
-
- pte = pte_mkspecial(pte);
-
- return __pmd(pte_val(pte));
-}
-
static inline pgprot_t pmd_pgprot(pmd_t entry)
{
unsigned long val = pmd_val(entry);
diff --git a/arch/sparc/mm/fault_64.c b/arch/sparc/mm/fault_64.c
index 479823249429..a504a4d85ebd 100644
--- a/arch/sparc/mm/fault_64.c
+++ b/arch/sparc/mm/fault_64.c
@@ -113,9 +113,6 @@ static unsigned int get_user_insn(unsigned long tpc)

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
if (pmd_trans_huge(*pmdp)) {
- if (pmd_trans_splitting(*pmdp))
- goto out_irq_enable;
-
pa = pmd_pfn(*pmdp) << PAGE_SHIFT;
pa += tpc & ~HPAGE_MASK;

diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
index 9091c5daa2e1..eb3d8e8ebc6b 100644
--- a/arch/sparc/mm/gup.c
+++ b/arch/sparc/mm/gup.c
@@ -114,7 +114,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
pmd_t pmd = *pmdp;

next = pmd_addr_end(addr, end);
- if (pmd_none(pmd) || pmd_trans_splitting(pmd))
+ if (pmd_none(pmd))
return 0;
if (unlikely(pmd_large(pmd))) {
if (!gup_huge_pmd(pmdp, pmd, addr, next,
--
2.1.4

2015-07-20 14:29:38

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv9 23/36] tile, thp: remove infrastructure for handling splitting PMDs

With new refcounting we don't need to mark PMDs splitting. Let's drop
code to handle this.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/tile/include/asm/pgtable.h | 10 ----------
1 file changed, 10 deletions(-)

diff --git a/arch/tile/include/asm/pgtable.h b/arch/tile/include/asm/pgtable.h
index 2b05ccbebed9..96cecf55522e 100644
--- a/arch/tile/include/asm/pgtable.h
+++ b/arch/tile/include/asm/pgtable.h
@@ -489,16 +489,6 @@ static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
#define has_transparent_hugepage() 1
#define pmd_trans_huge pmd_huge_page
-
-static inline pmd_t pmd_mksplitting(pmd_t pmd)
-{
- return pte_pmd(hv_pte_set_client2(pmd_pte(pmd)));
-}
-
-static inline int pmd_trans_splitting(pmd_t pmd)
-{
- return hv_pte_get_client2(pmd_pte(pmd));
-}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

/*
--
2.1.4

2015-07-20 14:28:50

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv9 24/36] x86, thp: remove infrastructure for handling splitting PMDs

With new refcounting we don't need to mark PMDs splitting. Let's drop
code to handle this.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Tested-by: Sasha Levin <[email protected]>
---
arch/x86/include/asm/pgtable.h | 9 ---------
arch/x86/include/asm/pgtable_types.h | 2 --
arch/x86/mm/gup.c | 13 +------------
arch/x86/mm/pgtable.c | 14 --------------
4 files changed, 1 insertion(+), 37 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 4383012950b0..37c280e0827a 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -158,11 +158,6 @@ static inline int pmd_large(pmd_t pte)
}

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-static inline int pmd_trans_splitting(pmd_t pmd)
-{
- return pmd_val(pmd) & _PAGE_SPLITTING;
-}
-
static inline int pmd_trans_huge(pmd_t pmd)
{
return pmd_val(pmd) & _PAGE_PSE;
@@ -794,10 +789,6 @@ extern int pmdp_clear_flush_young(struct vm_area_struct *vma,
unsigned long address, pmd_t *pmdp);


-#define __HAVE_ARCH_PMDP_SPLITTING_FLUSH
-extern void pmdp_splitting_flush(struct vm_area_struct *vma,
- unsigned long addr, pmd_t *pmdp);
-
#define __HAVE_ARCH_PMD_WRITE
static inline int pmd_write(pmd_t pmd)
{
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 78f0c8cbe316..45f7cff1baac 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -22,7 +22,6 @@
#define _PAGE_BIT_PAT_LARGE 12 /* On 2MB or 1GB pages */
#define _PAGE_BIT_SPECIAL _PAGE_BIT_SOFTW1
#define _PAGE_BIT_CPA_TEST _PAGE_BIT_SOFTW1
-#define _PAGE_BIT_SPLITTING _PAGE_BIT_SOFTW2 /* only valid on a PSE pmd */
#define _PAGE_BIT_HIDDEN _PAGE_BIT_SOFTW3 /* hidden by kmemcheck */
#define _PAGE_BIT_SOFT_DIRTY _PAGE_BIT_SOFTW3 /* software dirty tracking */
#define _PAGE_BIT_NX 63 /* No execute: only valid after cpuid check */
@@ -46,7 +45,6 @@
#define _PAGE_PAT_LARGE (_AT(pteval_t, 1) << _PAGE_BIT_PAT_LARGE)
#define _PAGE_SPECIAL (_AT(pteval_t, 1) << _PAGE_BIT_SPECIAL)
#define _PAGE_CPA_TEST (_AT(pteval_t, 1) << _PAGE_BIT_CPA_TEST)
-#define _PAGE_SPLITTING (_AT(pteval_t, 1) << _PAGE_BIT_SPLITTING)
#define __HAVE_ARCH_PTE_SPECIAL

#ifdef CONFIG_KMEMCHECK
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 62a887a3cf50..49bbbc57603b 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -157,18 +157,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
pmd_t pmd = *pmdp;

next = pmd_addr_end(addr, end);
- /*
- * The pmd_trans_splitting() check below explains why
- * pmdp_splitting_flush has to flush the tlb, to stop
- * this gup-fast code from running while we set the
- * splitting bit in the pmd. Returning zero will take
- * the slow path that will call wait_split_huge_page()
- * if the pmd is still in splitting state. gup-fast
- * can't because it has irq disabled and
- * wait_split_huge_page() would never return as the
- * tlb flush IPI wouldn't run.
- */
- if (pmd_none(pmd) || pmd_trans_splitting(pmd))
+ if (pmd_none(pmd))
return 0;
if (unlikely(pmd_large(pmd) || !pmd_present(pmd))) {
/*
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 0b97d2c75df3..808770dcae1c 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -509,20 +509,6 @@ int pmdp_clear_flush_young(struct vm_area_struct *vma,

return young;
}
-
-void pmdp_splitting_flush(struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmdp)
-{
- int set;
- VM_BUG_ON(address & ~HPAGE_PMD_MASK);
- set = !test_and_set_bit(_PAGE_BIT_SPLITTING,
- (unsigned long *)pmdp);
- if (set) {
- pmd_update(vma->vm_mm, address, pmdp);
- /* need tlb flush only to serialize against gup-fast */
- flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
- }
-}
#endif

/**
--
2.1.4

2015-07-20 14:27:41

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv9 25/36] mm, thp: remove infrastructure for handling splitting PMDs

With new refcounting we don't need to mark PMDs splitting. Let's drop code
to handle this.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Tested-by: Sasha Levin <[email protected]>
Tested-by: Aneesh Kumar K.V <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
---
fs/proc/task_mmu.c | 8 +++---
include/asm-generic/pgtable.h | 9 -------
include/linux/huge_mm.h | 21 +++++----------
mm/gup.c | 12 +--------
mm/huge_memory.c | 60 ++++++++++---------------------------------
mm/memcontrol.c | 13 ++--------
mm/memory.c | 18 ++-----------
mm/mincore.c | 2 +-
mm/mremap.c | 15 +++++------
mm/pgtable-generic.c | 14 ----------
mm/rmap.c | 4 +--
11 files changed, 37 insertions(+), 139 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 37a3cf92f7c6..818708dd6f9a 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -560,7 +560,7 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
pte_t *pte;
spinlock_t *ptl;

- if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+ if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
smaps_pmd_entry(pmd, addr, walk);
spin_unlock(ptl);
return 0;
@@ -813,7 +813,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
spinlock_t *ptl;
struct page *page;

- if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+ if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
if (cp->type == CLEAR_REFS_SOFT_DIRTY) {
clear_soft_dirty_pmd(vma, addr, pmd);
goto out;
@@ -1085,7 +1085,7 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
int err = 0;

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- if (pmd_trans_huge_lock(pmdp, vma, &ptl) == 1) {
+ if (pmd_trans_huge_lock(pmdp, vma, &ptl)) {
u64 flags = 0, frame = 0;
pmd_t pmd = *pmdp;

@@ -1417,7 +1417,7 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
pte_t *orig_pte;
pte_t *pte;

- if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+ if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
pte_t huge_pte = *(pte_t *)pmd;
struct page *page;

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 85e2434aeec5..4b9c27aac60c 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -184,11 +184,6 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm,
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
#endif

-#ifndef __HAVE_ARCH_PMDP_SPLITTING_FLUSH
-extern void pmdp_splitting_flush(struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmdp);
-#endif
-
#ifndef pmdp_collapse_flush
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
extern pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
@@ -582,10 +577,6 @@ static inline int pmd_trans_huge(pmd_t pmd)
{
return 0;
}
-static inline int pmd_trans_splitting(pmd_t pmd)
-{
- return 0;
-}
#ifndef __HAVE_ARCH_PMD_WRITE
static inline int pmd_write(pmd_t pmd)
{
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 933c63cc5ed7..460441349a52 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -28,7 +28,7 @@ extern int zap_huge_pmd(struct mmu_gather *tlb,
extern int mincore_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, unsigned long end,
unsigned char *vec);
-extern int move_huge_pmd(struct vm_area_struct *vma,
+extern bool move_huge_pmd(struct vm_area_struct *vma,
struct vm_area_struct *new_vma,
unsigned long old_addr,
unsigned long new_addr, unsigned long old_end,
@@ -51,15 +51,9 @@ enum transparent_hugepage_flag {
#endif
};

-enum page_check_address_pmd_flag {
- PAGE_CHECK_ADDRESS_PMD_FLAG,
- PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG,
- PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG,
-};
extern pmd_t *page_check_address_pmd(struct page *page,
struct mm_struct *mm,
unsigned long address,
- enum page_check_address_pmd_flag flag,
spinlock_t **ptl);
extern int pmd_freeable(pmd_t pmd);

@@ -104,7 +98,6 @@ extern unsigned long transparent_hugepage_flags;
#define split_huge_page(page) BUILD_BUG()
#define split_huge_pmd(__vma, __pmd, __address) BUILD_BUG()

-#define wait_split_huge_page(__anon_vma, __pmd) BUILD_BUG()
#if HPAGE_PMD_ORDER >= MAX_ORDER
#error "hugepages can't be allocated by the buddy allocator"
#endif
@@ -114,17 +107,17 @@ extern void vma_adjust_trans_huge(struct vm_area_struct *vma,
unsigned long start,
unsigned long end,
long adjust_next);
-extern int __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma,
+extern bool __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma,
spinlock_t **ptl);
/* mmap_sem must be held on entry */
-static inline int pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma,
+static inline bool pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma,
spinlock_t **ptl)
{
VM_BUG_ON_VMA(!rwsem_is_locked(&vma->vm_mm->mmap_sem), vma);
if (pmd_trans_huge(*pmd))
return __pmd_trans_huge_lock(pmd, vma, ptl);
else
- return 0;
+ return false;
}
static inline int hpage_nr_pages(struct page *page)
{
@@ -172,8 +165,6 @@ static inline int split_huge_page(struct page *page)
{
return 0;
}
-#define wait_split_huge_page(__anon_vma, __pmd) \
- do { } while (0)
#define split_huge_pmd(__vma, __pmd, __address) \
do { } while (0)
static inline int hugepage_madvise(struct vm_area_struct *vma,
@@ -188,10 +179,10 @@ static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
long adjust_next)
{
}
-static inline int pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma,
+static inline bool pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma,
spinlock_t **ptl)
{
- return 0;
+ return false;
}

static inline int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
diff --git a/mm/gup.c b/mm/gup.c
index c38adf61200a..c58d2fafb4b0 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -239,13 +239,6 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
spin_unlock(ptl);
return follow_page_pte(vma, address, pmd, flags);
}
-
- if (unlikely(pmd_trans_splitting(*pmd))) {
- spin_unlock(ptl);
- wait_split_huge_page(vma->anon_vma, pmd);
- return follow_page_pte(vma, address, pmd, flags);
- }
-
if (flags & FOLL_SPLIT) {
int ret;
page = pmd_page(*pmd);
@@ -1061,9 +1054,6 @@ struct page *get_dump_page(unsigned long addr)
* *) HAVE_RCU_TABLE_FREE is enabled, and tlb_remove_table is used to free
* pages containing page tables.
*
- * *) THP splits will broadcast an IPI, this can be achieved by overriding
- * pmdp_splitting_flush.
- *
* *) ptes can be read atomically by the architecture.
*
* *) access_ok is sufficient to validate userspace address ranges.
@@ -1260,7 +1250,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
pmd_t pmd = READ_ONCE(*pmdp);

next = pmd_addr_end(addr, end);
- if (pmd_none(pmd) || pmd_trans_splitting(pmd))
+ if (pmd_none(pmd))
return 0;

if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd))) {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 442ba9a28c75..c825397aafce 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -957,15 +957,6 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
goto out_unlock;
}

- if (unlikely(pmd_trans_splitting(pmd))) {
- /* split huge page running from under us */
- spin_unlock(src_ptl);
- spin_unlock(dst_ptl);
- pte_free(dst_mm, pgtable);
-
- wait_split_huge_page(vma->anon_vma, src_pmd); /* src_vma */
- goto out;
- }
src_page = pmd_page(pmd);
VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
get_page(src_page);
@@ -1471,7 +1462,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
spinlock_t *ptl;
int ret = 0;

- if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+ if (__pmd_trans_huge_lock(pmd, vma, &ptl)) {
pgtable_t pgtable;
pmd_t orig_pmd;
/*
@@ -1514,13 +1505,12 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
return ret;
}

-int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
+bool move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
unsigned long old_addr,
unsigned long new_addr, unsigned long old_end,
pmd_t *old_pmd, pmd_t *new_pmd)
{
spinlock_t *old_ptl, *new_ptl;
- int ret = 0;
pmd_t pmd;

struct mm_struct *mm = vma->vm_mm;
@@ -1529,7 +1519,7 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
(new_addr & ~HPAGE_PMD_MASK) ||
old_end - old_addr < HPAGE_PMD_SIZE ||
(new_vma->vm_flags & VM_NOHUGEPAGE))
- goto out;
+ return false;

/*
* The destination pmd shouldn't be established, free_pgtables()
@@ -1537,15 +1527,14 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
*/
if (WARN_ON(!pmd_none(*new_pmd))) {
VM_BUG_ON(pmd_trans_huge(*new_pmd));
- goto out;
+ return false;
}

/*
* We don't have to worry about the ordering of src and dst
* ptlocks because exclusive mmap_sem prevents deadlock.
*/
- ret = __pmd_trans_huge_lock(old_pmd, vma, &old_ptl);
- if (ret == 1) {
+ if (__pmd_trans_huge_lock(old_pmd, vma, &old_ptl)) {
new_ptl = pmd_lockptr(mm, new_pmd);
if (new_ptl != old_ptl)
spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
@@ -1561,9 +1550,9 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
if (new_ptl != old_ptl)
spin_unlock(new_ptl);
spin_unlock(old_ptl);
+ return true;
}
-out:
- return ret;
+ return false;
}

/*
@@ -1579,7 +1568,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
spinlock_t *ptl;
int ret = 0;

- if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+ if (__pmd_trans_huge_lock(pmd, vma, &ptl)) {
pmd_t entry;
bool preserve_write = prot_numa && pmd_write(*pmd);
ret = 1;
@@ -1616,23 +1605,14 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
* Note that if it returns 1, this routine returns without unlocking page
* table locks. So callers must unlock them.
*/
-int __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma,
+bool __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma,
spinlock_t **ptl)
{
*ptl = pmd_lock(vma->vm_mm, pmd);
- if (likely(pmd_trans_huge(*pmd))) {
- if (unlikely(pmd_trans_splitting(*pmd))) {
- spin_unlock(*ptl);
- wait_split_huge_page(vma->anon_vma, pmd);
- return -1;
- } else {
- /* Thp mapped by 'pmd' is stable, so we can
- * handle it as it is. */
- return 1;
- }
- }
+ if (likely(pmd_trans_huge(*pmd)))
+ return true;
spin_unlock(*ptl);
- return 0;
+ return false;
}

/*
@@ -1646,7 +1626,6 @@ int __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma,
pmd_t *page_check_address_pmd(struct page *page,
struct mm_struct *mm,
unsigned long address,
- enum page_check_address_pmd_flag flag,
spinlock_t **ptl)
{
pgd_t *pgd;
@@ -1669,21 +1648,8 @@ pmd_t *page_check_address_pmd(struct page *page,
goto unlock;
if (pmd_page(*pmd) != page)
goto unlock;
- /*
- * split_vma() may create temporary aliased mappings. There is
- * no risk as long as all huge pmd are found and have their
- * splitting bit set before __split_huge_page_refcount
- * runs. Finding the same huge pmd more than once during the
- * same rmap walk is not a problem.
- */
- if (flag == PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG &&
- pmd_trans_splitting(*pmd))
- goto unlock;
- if (pmd_trans_huge(*pmd)) {
- VM_BUG_ON(flag == PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG &&
- !pmd_trans_splitting(*pmd));
+ if (pmd_trans_huge(*pmd))
return pmd;
- }
unlock:
spin_unlock(*ptl);
return NULL;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 21378c828f34..886c3a677175 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4720,7 +4720,7 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
pte_t *pte;
spinlock_t *ptl;

- if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+ if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
if (get_mctgt_type_thp(vma, addr, *pmd, NULL) == MC_TARGET_PAGE)
mc.precharge += HPAGE_PMD_NR;
spin_unlock(ptl);
@@ -4891,16 +4891,7 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
union mc_target target;
struct page *page;

- /*
- * No race with splitting thp happens because:
- * - if pmd_trans_huge_lock() returns 1, the relevant thp is not
- * under splitting, which means there's no concurrent thp split,
- * - if another thread runs into split_huge_page() just after we
- * entered this if-block, the thread must wait for page table lock
- * to be unlocked in __split_huge_page_splitting(), where the main
- * part of thp split is not executed yet.
- */
- if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+ if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
if (mc.precharge < HPAGE_PMD_NR) {
spin_unlock(ptl);
return 0;
diff --git a/mm/memory.c b/mm/memory.c
index e830477d40b7..7a782e48477c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -566,7 +566,6 @@ int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
{
spinlock_t *ptl;
pgtable_t new = pte_alloc_one(mm, address);
- int wait_split_huge_page;
if (!new)
return -ENOMEM;

@@ -586,18 +585,14 @@ int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */

ptl = pmd_lock(mm, pmd);
- wait_split_huge_page = 0;
if (likely(pmd_none(*pmd))) { /* Has another populated it ? */
atomic_long_inc(&mm->nr_ptes);
pmd_populate(mm, pmd, new);
new = NULL;
- } else if (unlikely(pmd_trans_splitting(*pmd)))
- wait_split_huge_page = 1;
+ }
spin_unlock(ptl);
if (new)
pte_free(mm, new);
- if (wait_split_huge_page)
- wait_split_huge_page(vma->anon_vma, pmd);
return 0;
}

@@ -613,8 +608,7 @@ int __pte_alloc_kernel(pmd_t *pmd, unsigned long address)
if (likely(pmd_none(*pmd))) { /* Has another populated it ? */
pmd_populate_kernel(&init_mm, pmd, new);
new = NULL;
- } else
- VM_BUG_ON(pmd_trans_splitting(*pmd));
+ }
spin_unlock(&init_mm.page_table_lock);
if (new)
pte_free_kernel(&init_mm, new);
@@ -3367,14 +3361,6 @@ static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
if (pmd_trans_huge(orig_pmd)) {
unsigned int dirty = flags & FAULT_FLAG_WRITE;

- /*
- * If the pmd is splitting, return and retry the
- * the fault. Alternative: wait until the split
- * is done, and goto retry.
- */
- if (pmd_trans_splitting(orig_pmd))
- return 0;
-
if (pmd_protnone(orig_pmd))
return do_huge_pmd_numa_page(mm, vma, address,
orig_pmd, pmd);
diff --git a/mm/mincore.c b/mm/mincore.c
index be25efde64a4..feb867f5fdf4 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -117,7 +117,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
unsigned char *vec = walk->private;
int nr = (end - addr) >> PAGE_SHIFT;

- if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+ if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
memset(vec, 1, nr);
spin_unlock(ptl);
goto out;
diff --git a/mm/mremap.c b/mm/mremap.c
index 9cf393ac6e43..0dbae2e91e19 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -192,25 +192,24 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
if (!new_pmd)
break;
if (pmd_trans_huge(*old_pmd)) {
- int err = 0;
if (extent == HPAGE_PMD_SIZE) {
+ bool moved;
VM_BUG_ON_VMA(vma->vm_file || !vma->anon_vma,
vma);
/* See comment in move_ptes() */
if (need_rmap_locks)
anon_vma_lock_write(vma->anon_vma);
- err = move_huge_pmd(vma, new_vma, old_addr,
+ moved = move_huge_pmd(vma, new_vma, old_addr,
new_addr, old_end,
old_pmd, new_pmd);
if (need_rmap_locks)
anon_vma_unlock_write(vma->anon_vma);
+ if (moved) {
+ need_flush = true;
+ continue;
+ }
}
- if (err > 0) {
- need_flush = true;
- continue;
- } else if (!err) {
- split_huge_pmd(vma, old_pmd, old_addr);
- }
+ split_huge_pmd(vma, old_pmd, old_addr);
VM_BUG_ON(pmd_trans_huge(*old_pmd));
}
if (pmd_none(*new_pmd) && __pte_alloc(new_vma->vm_mm, new_vma,
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 6b674e00153c..89b150f8c920 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -134,20 +134,6 @@ pmd_t pmdp_huge_clear_flush(struct vm_area_struct *vma, unsigned long address,
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
#endif

-#ifndef __HAVE_ARCH_PMDP_SPLITTING_FLUSH
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
- pmd_t *pmdp)
-{
- pmd_t pmd = pmd_mksplitting(*pmdp);
- VM_BUG_ON(address & ~HPAGE_PMD_MASK);
- set_pmd_at(vma->vm_mm, address, pmdp, pmd);
- /* tlb flush only to serialize against gup-fast */
- flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
-}
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-#endif
-
#ifndef __HAVE_ARCH_PGTABLE_DEPOSIT
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
diff --git a/mm/rmap.c b/mm/rmap.c
index 5ee08e082e51..bc1db51958da 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -841,8 +841,7 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
* rmap might return false positives; we must filter
* these out using page_check_address_pmd().
*/
- pmd = page_check_address_pmd(page, mm, address,
- PAGE_CHECK_ADDRESS_PMD_FLAG, &ptl);
+ pmd = page_check_address_pmd(page, mm, address, &ptl);
if (!pmd)
return SWAP_AGAIN;

@@ -852,7 +851,6 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
return SWAP_FAIL; /* To break the loop */
}

- /* go ahead even if the pmd is pmd_trans_splitting() */
if (pmdp_clear_flush_young_notify(vma, address, pmd))
referenced++;

--
2.1.4

2015-07-20 14:25:08

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv9 26/36] mm: rework mapcount accounting to enable 4k mapping of THPs

We're going to allow mapping of individual 4k pages of THP compound.
It means we need to track mapcount on per small page basis.

Straight-forward approach is to use ->_mapcount in all subpages to track
how many time this subpage is mapped with PMDs or PTEs combined. But
this is rather expensive: mapping or unmapping of a THP page with PMD
would require HPAGE_PMD_NR atomic operations instead of single we have
now.

The idea is to store separately how many times the page was mapped as
whole -- compound_mapcount. This frees up ->_mapcount in subpages to
track PTE mapcount.

We use the same approach as with compound page destructor and compound
order to store compound_mapcount: use space in first tail page,
->mapping this time.

Any time we map/unmap whole compound page (THP or hugetlb) -- we
increment/decrement compound_mapcount. When we map part of compound page
with PTE we operate on ->_mapcount of the subpage.

page_mapcount() counts both: PTE and PMD mappings of the page.

Basically, we have mapcount for a subpage spread over two counters.
It makes tricky to detect when last mapcount for a page goes away.

We introduced PageDoubleMap() for this. When we split THP PMD for the
first time and there's other PMD mapping left we offset up ->_mapcount
in all subpages by one and set PG_double_map on the compound page.
These additional references go away with last compound_mapcount.

This approach provides a way to detect when last mapcount goes away on
per small page basis without introducing new overhead for most common
cases.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Tested-by: Aneesh Kumar K.V <[email protected]>
---
include/linux/mm.h | 26 +++++++++++-
include/linux/mm_types.h | 1 +
include/linux/page-flags.h | 37 +++++++++++++++++
include/linux/rmap.h | 4 +-
mm/debug.c | 5 ++-
mm/huge_memory.c | 2 +-
mm/hugetlb.c | 4 +-
mm/memory.c | 2 +-
mm/migrate.c | 2 +-
mm/page_alloc.c | 14 +++++--
mm/rmap.c | 99 +++++++++++++++++++++++++++++++++++-----------
11 files changed, 161 insertions(+), 35 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index b6fb5293259a..0370329aff96 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -396,6 +396,19 @@ static inline int is_vmalloc_or_module_addr(const void *x)

extern void kvfree(const void *addr);

+static inline atomic_t *compound_mapcount_ptr(struct page *page)
+{
+ return &page[1].compound_mapcount;
+}
+
+static inline int compound_mapcount(struct page *page)
+{
+ if (!PageCompound(page))
+ return 0;
+ page = compound_head(page);
+ return atomic_read(compound_mapcount_ptr(page)) + 1;
+}
+
/*
* The atomic page->_mapcount, starts from -1: so that transitions
* both from it and to it can be tracked, using atomic_inc_and_test
@@ -408,8 +421,17 @@ static inline void page_mapcount_reset(struct page *page)

static inline int page_mapcount(struct page *page)
{
+ int ret;
VM_BUG_ON_PAGE(PageSlab(page), page);
- return atomic_read(&page->_mapcount) + 1;
+
+ ret = atomic_read(&page->_mapcount) + 1;
+ if (PageCompound(page)) {
+ page = compound_head(page);
+ ret += atomic_read(compound_mapcount_ptr(page)) + 1;
+ if (PageDoubleMap(page))
+ ret--;
+ }
+ return ret;
}

static inline int page_count(struct page *page)
@@ -891,7 +913,7 @@ static inline pgoff_t page_file_index(struct page *page)
*/
static inline int page_mapped(struct page *page)
{
- return atomic_read(&(page)->_mapcount) >= 0;
+ return atomic_read(&(page)->_mapcount) + compound_mapcount(page) >= 0;
}

/*
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index b762eef188c3..991f1394d425 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -56,6 +56,7 @@ struct page {
* see PAGE_MAPPING_ANON below.
*/
void *s_mem; /* slab first object */
+ atomic_t compound_mapcount; /* first tail page */
};

/* Second double word */
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index f10f9c0030dd..d22adfd0a4c4 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -127,6 +127,9 @@ enum pageflags {

/* SLOB */
PG_slob_free = PG_private,
+
+ /* Compound pages. Stored in first tail page's flags */
+ PG_double_map = PG_private_2,
};

#ifndef __GENERATING_BOUNDS_H
@@ -593,10 +596,44 @@ static inline int PageTransTail(struct page *page)
return PageTail(page);
}

+/*
+ * PageDoubleMap indicates that the compound page is mapped with PTEs as well
+ * as PMDs.
+ *
+ * This is required for optimization of rmap oprations for THP: we can postpone
+ * per small page mapcount accounting (and its overhead from atomic operations)
+ * until the first PMD split.
+ *
+ * For the page PageDoubleMap means ->_mapcount in all sub-pages is offset up
+ * by one. This reference will go away with last compound_mapcount.
+ *
+ * See also __split_huge_pmd_locked() and page_remove_anon_compound_rmap().
+ */
+static inline int PageDoubleMap(struct page *page)
+{
+ VM_BUG_ON_PAGE(!PageHead(page), page);
+ return test_bit(PG_double_map, &page[1].flags);
+}
+
+static inline int TestSetPageDoubleMap(struct page *page)
+{
+ VM_BUG_ON_PAGE(!PageHead(page), page);
+ return test_and_set_bit(PG_double_map, &page[1].flags);
+}
+
+static inline int TestClearPageDoubleMap(struct page *page)
+{
+ VM_BUG_ON_PAGE(!PageHead(page), page);
+ return test_and_clear_bit(PG_double_map, &page[1].flags);
+}
+
#else
TESTPAGEFLAG_FALSE(TransHuge)
TESTPAGEFLAG_FALSE(TransCompound)
TESTPAGEFLAG_FALSE(TransTail)
+TESTPAGEFLAG_FALSE(DoubleMap)
+ TESTSETFLAG_FALSE(DoubleMap)
+ TESTCLEARFLAG_FALSE(DoubleMap)
#endif

/*
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 082928aba785..6b6233fafb53 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -184,9 +184,9 @@ void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
void hugepage_add_new_anon_rmap(struct page *, struct vm_area_struct *,
unsigned long);

-static inline void page_dup_rmap(struct page *page)
+static inline void page_dup_rmap(struct page *page, bool compound)
{
- atomic_inc(&page->_mapcount);
+ atomic_inc(compound ? compound_mapcount_ptr(page) : &page->_mapcount);
}

/*
diff --git a/mm/debug.c b/mm/debug.c
index 9dfcd77e7354..4a82f639b964 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -80,9 +80,12 @@ static void dump_flags(unsigned long flags,
void dump_page_badflags(struct page *page, const char *reason,
unsigned long badflags)
{
- pr_emerg("page:%p count:%d mapcount:%d mapping:%p index:%#lx\n",
+ pr_emerg("page:%p count:%d mapcount:%d mapping:%p index:%#lx",
page, atomic_read(&page->_count), page_mapcount(page),
page->mapping, page->index);
+ if (PageCompound(page))
+ pr_cont(" compound_mapcount: %d", compound_mapcount(page));
+ pr_cont("\n");
BUILD_BUG_ON(ARRAY_SIZE(pageflag_names) != __NR_PAGEFLAGS);
dump_flags(page->flags, pageflag_names, ARRAY_SIZE(pageflag_names));
if (reason)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c825397aafce..52a20e92d51a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -960,7 +960,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
src_page = pmd_page(pmd);
VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
get_page(src_page);
- page_dup_rmap(src_page);
+ page_dup_rmap(src_page, true);
add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);

pmdp_set_wrprotect(src_mm, addr, src_pmd);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 399ea11a8813..05b2f53be237 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2795,7 +2795,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
entry = huge_ptep_get(src_pte);
ptepage = pte_page(entry);
get_page(ptepage);
- page_dup_rmap(ptepage);
+ page_dup_rmap(ptepage, true);
set_huge_pte_at(dst, addr, dst_pte, entry);
}
spin_unlock(src_ptl);
@@ -3256,7 +3256,7 @@ retry:
ClearPagePrivate(page);
hugepage_add_new_anon_rmap(page, vma, address);
} else
- page_dup_rmap(page);
+ page_dup_rmap(page, true);
new_pte = make_huge_pte(vma, page, ((vma->vm_flags & VM_WRITE)
&& (vma->vm_flags & VM_SHARED)));
set_huge_pte_at(mm, address, ptep, new_pte);
diff --git a/mm/memory.c b/mm/memory.c
index 7a782e48477c..074edab89b52 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -867,7 +867,7 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
page = vm_normal_page(vma, addr, pte);
if (page) {
get_page(page);
- page_dup_rmap(page);
+ page_dup_rmap(page, false);
if (PageAnon(page))
rss[MM_ANONPAGES]++;
else
diff --git a/mm/migrate.c b/mm/migrate.c
index 4870a1daa8ae..67970faf544d 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -164,7 +164,7 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
if (PageAnon(new))
hugepage_add_anon_rmap(new, vma, addr);
else
- page_dup_rmap(new);
+ page_dup_rmap(new, false);
} else if (PageAnon(new))
page_add_anon_rmap(new, vma, addr, false);
else
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 10ac1c9cac3c..cfd3a34e41f1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -450,6 +450,7 @@ void prep_compound_page(struct page *page, unsigned long order)
smp_wmb();
__SetPageTail(p);
}
+ atomic_set(compound_mapcount_ptr(page), -1);
}

#ifdef CONFIG_DEBUG_PAGEALLOC
@@ -714,7 +715,7 @@ static inline int free_pages_check(struct page *page)
const char *bad_reason = NULL;
unsigned long bad_flags = 0;

- if (unlikely(page_mapcount(page)))
+ if (unlikely(atomic_read(&page->_mapcount) != -1))
bad_reason = "nonzero mapcount";
if (unlikely(page->mapping != NULL))
bad_reason = "non-NULL mapping";
@@ -823,7 +824,14 @@ static void free_one_page(struct zone *zone,

static int free_tail_pages_check(struct page *head_page, struct page *page)
{
- if (page->mapping != TAIL_MAPPING) {
+ /* mapping in first tail page is used for compound_mapcount() */
+ if (page - head_page == 1) {
+ if (unlikely(compound_mapcount(page))) {
+ bad_page(page, "nonzero compound_mapcount", 0);
+ page->mapping = NULL;
+ return 1;
+ }
+ } else if (page->mapping != TAIL_MAPPING) {
bad_page(page, "corrupted mapping in tail page", 0);
page->mapping = NULL;
return 1;
@@ -1286,7 +1294,7 @@ static inline int check_new_page(struct page *page)
const char *bad_reason = NULL;
unsigned long bad_flags = 0;

- if (unlikely(page_mapcount(page)))
+ if (unlikely(atomic_read(&page->_mapcount) != -1))
bad_reason = "nonzero mapcount";
if (unlikely(page->mapping != NULL))
bad_reason = "non-NULL mapping";
diff --git a/mm/rmap.c b/mm/rmap.c
index bc1db51958da..ed89c6256579 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1139,7 +1139,7 @@ static void __page_check_anon_rmap(struct page *page,
* over the call to page_add_new_anon_rmap.
*/
BUG_ON(page_anon_vma(page)->root != vma->anon_vma->root);
- BUG_ON(page->index != linear_page_index(vma, address));
+ BUG_ON(page_to_pgoff(page) != linear_page_index(vma, address));
#endif
}

@@ -1169,9 +1169,29 @@ void page_add_anon_rmap(struct page *page,
void do_page_add_anon_rmap(struct page *page,
struct vm_area_struct *vma, unsigned long address, int flags)
{
- int first = atomic_inc_and_test(&page->_mapcount);
+ bool compound = flags & RMAP_COMPOUND;
+ bool first;
+
+ if (PageTransCompound(page)) {
+ VM_BUG_ON_PAGE(!PageLocked(page), page);
+ if (compound) {
+ atomic_t *mapcount;
+
+ VM_BUG_ON_PAGE(!PageTransHuge(page), page);
+ mapcount = compound_mapcount_ptr(page);
+ first = atomic_inc_and_test(mapcount);
+ } else {
+ /* Anon THP always mapped first with PMD */
+ first = 0;
+ VM_BUG_ON_PAGE(!page_mapcount(page), page);
+ atomic_inc(&page->_mapcount);
+ }
+ } else {
+ VM_BUG_ON_PAGE(compound, page);
+ first = atomic_inc_and_test(&page->_mapcount);
+ }
+
if (first) {
- bool compound = flags & RMAP_COMPOUND;
int nr = compound ? hpage_nr_pages(page) : 1;
/*
* We use the irq-unsafe __{inc|mod}_zone_page_stat because
@@ -1190,6 +1210,7 @@ void do_page_add_anon_rmap(struct page *page,
return;

VM_BUG_ON_PAGE(!PageLocked(page), page);
+
/* address might be in next vma when migration races vma_adjust */
if (first)
__page_set_anon_rmap(page, vma, address,
@@ -1216,10 +1237,16 @@ void page_add_new_anon_rmap(struct page *page,

VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
SetPageSwapBacked(page);
- atomic_set(&page->_mapcount, 0); /* increment count (starts at -1) */
if (compound) {
VM_BUG_ON_PAGE(!PageTransHuge(page), page);
+ /* increment count (starts at -1) */
+ atomic_set(compound_mapcount_ptr(page), 0);
__inc_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
+ } else {
+ /* Anon THP always mapped first with PMD */
+ VM_BUG_ON_PAGE(PageTransCompound(page), page);
+ /* increment count (starts at -1) */
+ atomic_set(&page->_mapcount, 0);
}
__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, nr);
__page_set_anon_rmap(page, vma, address, 1);
@@ -1249,12 +1276,15 @@ static void page_remove_file_rmap(struct page *page)

memcg = mem_cgroup_begin_page_stat(page);

- /* page still mapped by someone else? */
- if (!atomic_add_negative(-1, &page->_mapcount))
+ /* Hugepages are not counted in NR_FILE_MAPPED for now. */
+ if (unlikely(PageHuge(page))) {
+ /* hugetlb pages are always mapped with pmds */
+ atomic_dec(compound_mapcount_ptr(page));
goto out;
+ }

- /* Hugepages are not counted in NR_FILE_MAPPED for now. */
- if (unlikely(PageHuge(page)))
+ /* page still mapped by someone else? */
+ if (!atomic_add_negative(-1, &page->_mapcount))
goto out;

/*
@@ -1271,6 +1301,39 @@ out:
mem_cgroup_end_page_stat(memcg);
}

+static void page_remove_anon_compound_rmap(struct page *page)
+{
+ int i, nr;
+
+ if (!atomic_add_negative(-1, compound_mapcount_ptr(page)))
+ return;
+
+ /* Hugepages are not counted in NR_ANON_PAGES for now. */
+ if (unlikely(PageHuge(page)))
+ return;
+
+ if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+ return;
+
+ __dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
+
+ if (TestClearPageDoubleMap(page)) {
+ /*
+ * Subpages can be mapped with PTEs too. Check how many of
+ * themi are still mapped.
+ */
+ for (i = 0, nr = 0; i < HPAGE_PMD_NR; i++) {
+ if (atomic_add_negative(-1, &page[i]._mapcount))
+ nr++;
+ }
+ } else {
+ nr = HPAGE_PMD_NR;
+ }
+
+ if (nr)
+ __mod_zone_page_state(page_zone(page), NR_ANON_PAGES, -nr);
+}
+
/**
* page_remove_rmap - take down pte mapping from a page
* @page: page to remove mapping from
@@ -1280,33 +1343,25 @@ out:
*/
void page_remove_rmap(struct page *page, bool compound)
{
- int nr = compound ? hpage_nr_pages(page) : 1;
-
if (!PageAnon(page)) {
VM_BUG_ON_PAGE(compound && !PageHuge(page), page);
page_remove_file_rmap(page);
return;
}

+ if (compound)
+ return page_remove_anon_compound_rmap(page);
+
/* page still mapped by someone else? */
if (!atomic_add_negative(-1, &page->_mapcount))
return;

- /* Hugepages are not counted in NR_ANON_PAGES for now. */
- if (unlikely(PageHuge(page)))
- return;
-
/*
* We use the irq-unsafe __{inc|mod}_zone_page_stat because
* these counters are not modified in interrupt context, and
* pte lock(a spinlock) is held, which implies preemption disabled.
*/
- if (compound) {
- VM_BUG_ON_PAGE(!PageTransHuge(page), page);
- __dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
- }
-
- __mod_zone_page_state(page_zone(page), NR_ANON_PAGES, -nr);
+ __dec_zone_page_state(page, NR_ANON_PAGES);

if (unlikely(PageMlocked(page)))
clear_page_mlock(page);
@@ -1760,7 +1815,7 @@ void hugepage_add_anon_rmap(struct page *page,
BUG_ON(!PageLocked(page));
BUG_ON(!anon_vma);
/* address might be in next vma when migration races vma_adjust */
- first = atomic_inc_and_test(&page->_mapcount);
+ first = atomic_inc_and_test(compound_mapcount_ptr(page));
if (first)
__hugepage_set_anon_rmap(page, vma, address, 0);
}
@@ -1769,7 +1824,7 @@ void hugepage_add_new_anon_rmap(struct page *page,
struct vm_area_struct *vma, unsigned long address)
{
BUG_ON(address < vma->vm_start || address >= vma->vm_end);
- atomic_set(&page->_mapcount, 0);
+ atomic_set(compound_mapcount_ptr(page), 0);
__hugepage_set_anon_rmap(page, vma, address, 1);
}
#endif /* CONFIG_HUGETLB_PAGE */
--
2.1.4

2015-07-20 14:22:20

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv9 27/36] mm: differentiate page_mapped() from page_mapcount() for compound pages

Let's define page_mapped() to be true for compound pages if any
sub-pages of the compound page is mapped (with PMD or PTE).

On other hand page_mapcount() return mapcount for this particular small
page.

This will make cases like page_get_anon_vma() behave correctly once we
allow huge pages to be mapped with PTE.

Most users outside core-mm should use page_mapcount() instead of
page_mapped().

Signed-off-by: Kirill A. Shutemov <[email protected]>
Tested-by: Sasha Levin <[email protected]>
Tested-by: Aneesh Kumar K.V <[email protected]>
---
arch/arc/mm/cache_arc700.c | 4 ++--
arch/arm/mm/flush.c | 2 +-
arch/mips/mm/c-r4k.c | 3 ++-
arch/mips/mm/cache.c | 2 +-
arch/mips/mm/init.c | 6 +++---
arch/sh/mm/cache-sh4.c | 2 +-
arch/sh/mm/cache.c | 8 ++++----
arch/xtensa/mm/tlb.c | 2 +-
fs/proc/page.c | 4 ++--
include/linux/mm.h | 15 +++++++++++++--
mm/filemap.c | 2 +-
11 files changed, 31 insertions(+), 19 deletions(-)

diff --git a/arch/arc/mm/cache_arc700.c b/arch/arc/mm/cache_arc700.c
index 12b2100db073..43a42ddaef23 100644
--- a/arch/arc/mm/cache_arc700.c
+++ b/arch/arc/mm/cache_arc700.c
@@ -490,7 +490,7 @@ void flush_dcache_page(struct page *page)
*/
if (!mapping_mapped(mapping)) {
clear_bit(PG_dc_clean, &page->flags);
- } else if (page_mapped(page)) {
+ } else if (page_mapcount(page)) {

/* kernel reading from page with U-mapping */
void *paddr = page_address(page);
@@ -675,7 +675,7 @@ void copy_user_highpage(struct page *to, struct page *from,
* Note that while @u_vaddr refers to DST page's userspace vaddr, it is
* equally valid for SRC page as well
*/
- if (page_mapped(from) && addr_not_cache_congruent(kfrom, u_vaddr)) {
+ if (page_mapcount(from) && addr_not_cache_congruent(kfrom, u_vaddr)) {
__flush_dcache_page(kfrom, u_vaddr);
clean_src_k_mappings = 1;
}
diff --git a/arch/arm/mm/flush.c b/arch/arm/mm/flush.c
index 77f229302032..4da544aa25ef 100644
--- a/arch/arm/mm/flush.c
+++ b/arch/arm/mm/flush.c
@@ -315,7 +315,7 @@ void flush_dcache_page(struct page *page)
mapping = page_mapping(page);

if (!cache_ops_need_broadcast() &&
- mapping && !page_mapped(page))
+ mapping && !page_mapcount(page))
clear_bit(PG_dcache_clean, &page->flags);
else {
__flush_dcache_page(mapping, page);
diff --git a/arch/mips/mm/c-r4k.c b/arch/mips/mm/c-r4k.c
index 2e03ab173591..35994acf25b5 100644
--- a/arch/mips/mm/c-r4k.c
+++ b/arch/mips/mm/c-r4k.c
@@ -579,7 +579,8 @@ static inline void local_r4k_flush_cache_page(void *args)
* another ASID than the current one.
*/
map_coherent = (cpu_has_dc_aliases &&
- page_mapped(page) && !Page_dcache_dirty(page));
+ page_mapcount(page) &&
+ !Page_dcache_dirty(page));
if (map_coherent)
vaddr = kmap_coherent(page, addr);
else
diff --git a/arch/mips/mm/cache.c b/arch/mips/mm/cache.c
index 77d96db8253c..d054da62c2bb 100644
--- a/arch/mips/mm/cache.c
+++ b/arch/mips/mm/cache.c
@@ -106,7 +106,7 @@ void __flush_anon_page(struct page *page, unsigned long vmaddr)
unsigned long addr = (unsigned long) page_address(page);

if (pages_do_alias(addr, vmaddr)) {
- if (page_mapped(page) && !Page_dcache_dirty(page)) {
+ if (page_mapcount(page) && !Page_dcache_dirty(page)) {
void *kaddr;

kaddr = kmap_coherent(page, vmaddr);
diff --git a/arch/mips/mm/init.c b/arch/mips/mm/init.c
index faa5c9822ecc..6095f15db03d 100644
--- a/arch/mips/mm/init.c
+++ b/arch/mips/mm/init.c
@@ -161,7 +161,7 @@ void copy_user_highpage(struct page *to, struct page *from,

vto = kmap_atomic(to);
if (cpu_has_dc_aliases &&
- page_mapped(from) && !Page_dcache_dirty(from)) {
+ page_mapcount(from) && !Page_dcache_dirty(from)) {
vfrom = kmap_coherent(from, vaddr);
copy_page(vto, vfrom);
kunmap_coherent();
@@ -183,7 +183,7 @@ void copy_to_user_page(struct vm_area_struct *vma,
unsigned long len)
{
if (cpu_has_dc_aliases &&
- page_mapped(page) && !Page_dcache_dirty(page)) {
+ page_mapcount(page) && !Page_dcache_dirty(page)) {
void *vto = kmap_coherent(page, vaddr) + (vaddr & ~PAGE_MASK);
memcpy(vto, src, len);
kunmap_coherent();
@@ -201,7 +201,7 @@ void copy_from_user_page(struct vm_area_struct *vma,
unsigned long len)
{
if (cpu_has_dc_aliases &&
- page_mapped(page) && !Page_dcache_dirty(page)) {
+ page_mapcount(page) && !Page_dcache_dirty(page)) {
void *vfrom = kmap_coherent(page, vaddr) + (vaddr & ~PAGE_MASK);
memcpy(dst, vfrom, len);
kunmap_coherent();
diff --git a/arch/sh/mm/cache-sh4.c b/arch/sh/mm/cache-sh4.c
index 51d8f7f31d1d..58aaa4f33b81 100644
--- a/arch/sh/mm/cache-sh4.c
+++ b/arch/sh/mm/cache-sh4.c
@@ -241,7 +241,7 @@ static void sh4_flush_cache_page(void *args)
*/
map_coherent = (current_cpu_data.dcache.n_aliases &&
test_bit(PG_dcache_clean, &page->flags) &&
- page_mapped(page));
+ page_mapcount(page));
if (map_coherent)
vaddr = kmap_coherent(page, address);
else
diff --git a/arch/sh/mm/cache.c b/arch/sh/mm/cache.c
index f770e3992620..e58cfbf45150 100644
--- a/arch/sh/mm/cache.c
+++ b/arch/sh/mm/cache.c
@@ -59,7 +59,7 @@ void copy_to_user_page(struct vm_area_struct *vma, struct page *page,
unsigned long vaddr, void *dst, const void *src,
unsigned long len)
{
- if (boot_cpu_data.dcache.n_aliases && page_mapped(page) &&
+ if (boot_cpu_data.dcache.n_aliases && page_mapcount(page) &&
test_bit(PG_dcache_clean, &page->flags)) {
void *vto = kmap_coherent(page, vaddr) + (vaddr & ~PAGE_MASK);
memcpy(vto, src, len);
@@ -78,7 +78,7 @@ void copy_from_user_page(struct vm_area_struct *vma, struct page *page,
unsigned long vaddr, void *dst, const void *src,
unsigned long len)
{
- if (boot_cpu_data.dcache.n_aliases && page_mapped(page) &&
+ if (boot_cpu_data.dcache.n_aliases && page_mapcount(page) &&
test_bit(PG_dcache_clean, &page->flags)) {
void *vfrom = kmap_coherent(page, vaddr) + (vaddr & ~PAGE_MASK);
memcpy(dst, vfrom, len);
@@ -97,7 +97,7 @@ void copy_user_highpage(struct page *to, struct page *from,

vto = kmap_atomic(to);

- if (boot_cpu_data.dcache.n_aliases && page_mapped(from) &&
+ if (boot_cpu_data.dcache.n_aliases && page_mapcount(from) &&
test_bit(PG_dcache_clean, &from->flags)) {
vfrom = kmap_coherent(from, vaddr);
copy_page(vto, vfrom);
@@ -153,7 +153,7 @@ void __flush_anon_page(struct page *page, unsigned long vmaddr)
unsigned long addr = (unsigned long) page_address(page);

if (pages_do_alias(addr, vmaddr)) {
- if (boot_cpu_data.dcache.n_aliases && page_mapped(page) &&
+ if (boot_cpu_data.dcache.n_aliases && page_mapcount(page) &&
test_bit(PG_dcache_clean, &page->flags)) {
void *kaddr;

diff --git a/arch/xtensa/mm/tlb.c b/arch/xtensa/mm/tlb.c
index 5ece856c5725..35c822286bbe 100644
--- a/arch/xtensa/mm/tlb.c
+++ b/arch/xtensa/mm/tlb.c
@@ -245,7 +245,7 @@ static int check_tlb_entry(unsigned w, unsigned e, bool dtlb)
page_mapcount(p));
if (!page_count(p))
rc |= TLB_INSANE;
- else if (page_mapped(p))
+ else if (page_mapcount(p))
rc |= TLB_SUSPICIOUS;
} else {
rc |= TLB_INSANE;
diff --git a/fs/proc/page.c b/fs/proc/page.c
index 7eee2d8b97d9..e99c059339f6 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -97,9 +97,9 @@ u64 stable_page_flags(struct page *page)
* pseudo flags for the well known (anonymous) memory mapped pages
*
* Note that page->_mapcount is overloaded in SLOB/SLUB/SLQB, so the
- * simple test in page_mapped() is not enough.
+ * simple test in page_mapcount() is not enough.
*/
- if (!PageSlab(page) && page_mapped(page))
+ if (!PageSlab(page) && page_mapcount(page))
u |= 1 << KPF_MMAP;
if (PageAnon(page))
u |= 1 << KPF_ANON;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0370329aff96..b41b1d8b9072 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -910,10 +910,21 @@ static inline pgoff_t page_file_index(struct page *page)

/*
* Return true if this page is mapped into pagetables.
+ * For compound page it returns true if any subpage of compound page is mapped.
*/
-static inline int page_mapped(struct page *page)
+static inline bool page_mapped(struct page *page)
{
- return atomic_read(&(page)->_mapcount) + compound_mapcount(page) >= 0;
+ int i;
+ if (likely(!PageCompound(page)))
+ return atomic_read(&page->_mapcount) >= 0;
+ page = compound_head(page);
+ if (atomic_read(compound_mapcount_ptr(page)) >= 0)
+ return true;
+ for (i = 0; i < hpage_nr_pages(page); i++) {
+ if (atomic_read(&page[i]._mapcount) >= 0)
+ return true;
+ }
+ return false;
}

/*
diff --git a/mm/filemap.c b/mm/filemap.c
index 91d07a576166..5317fd73e5f5 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -204,7 +204,7 @@ void __delete_from_page_cache(struct page *page, void *shadow,
__dec_zone_page_state(page, NR_FILE_PAGES);
if (PageSwapBacked(page))
__dec_zone_page_state(page, NR_SHMEM);
- BUG_ON(page_mapped(page));
+ VM_BUG_ON_PAGE(page_mapped(page), page);

/*
* At this point page must be either written or cleaned by truncate.
--
2.1.4

2015-07-20 14:21:40

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv9 28/36] mm, numa: skip PTE-mapped THP on numa fault

We're going to have THP mapped with PTEs. It will confuse numabalancing.
Let's skip them for now.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Tested-by: Sasha Levin <[email protected]>
Tested-by: Aneesh Kumar K.V <[email protected]>
---
mm/memory.c | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 074edab89b52..52f6fa02c099 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3186,6 +3186,12 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
return 0;
}

+ /* TODO: handle PTE-mapped THP */
+ if (PageCompound(page)) {
+ pte_unmap_unlock(ptep, ptl);
+ return 0;
+ }
+
/*
* Avoid grouping on RO pages in general. RO pages shouldn't hurt as
* much anyway since they can be in shared cache state. This misses
--
2.1.4

2015-07-20 14:27:37

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv9 29/36] thp: implement split_huge_pmd()

Original split_huge_page() combined two operations: splitting PMDs into
tables of PTEs and splitting underlying compound page. This patch
implements split_huge_pmd() which split given PMD without splitting
other PMDs this page mapped with or underlying compound page.

Without tail page refcounting, implementation of split_huge_pmd() is
pretty straight-forward.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Tested-by: Sasha Levin <[email protected]>
Tested-by: Aneesh Kumar K.V <[email protected]>
---
include/linux/huge_mm.h | 11 ++++-
mm/huge_memory.c | 122 ++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 132 insertions(+), 1 deletion(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 460441349a52..940112755591 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -96,7 +96,16 @@ extern unsigned long transparent_hugepage_flags;

#define split_huge_page_to_list(page, list) BUILD_BUG()
#define split_huge_page(page) BUILD_BUG()
-#define split_huge_pmd(__vma, __pmd, __address) BUILD_BUG()
+
+void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+ unsigned long address);
+
+#define split_huge_pmd(__vma, __pmd, __address) \
+ do { \
+ pmd_t *____pmd = (__pmd); \
+ if (pmd_trans_huge(*____pmd)) \
+ __split_huge_pmd(__vma, __pmd, __address); \
+ } while (0)

#if HPAGE_PMD_ORDER >= MAX_ORDER
#error "hugepages can't be allocated by the buddy allocator"
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 52a20e92d51a..1f7a7288ffa3 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2599,6 +2599,128 @@ static int khugepaged(void *none)
return 0;
}

+static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
+ unsigned long haddr, pmd_t *pmd)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ pgtable_t pgtable;
+ pmd_t _pmd;
+ int i;
+
+ /* leave pmd empty until pte is filled */
+ pmdp_huge_clear_flush_notify(vma, haddr, pmd);
+
+ pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+ pmd_populate(mm, &_pmd, pgtable);
+
+ for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
+ pte_t *pte, entry;
+ entry = pfn_pte(my_zero_pfn(haddr), vma->vm_page_prot);
+ entry = pte_mkspecial(entry);
+ pte = pte_offset_map(&_pmd, haddr);
+ VM_BUG_ON(!pte_none(*pte));
+ set_pte_at(mm, haddr, pte, entry);
+ pte_unmap(pte);
+ }
+ smp_wmb(); /* make pte visible before pmd */
+ pmd_populate(mm, pmd, pgtable);
+ put_huge_zero_page();
+}
+
+static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
+ unsigned long haddr)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ struct page *page;
+ pgtable_t pgtable;
+ pmd_t _pmd;
+ bool young, write;
+ int i;
+
+ VM_BUG_ON(haddr & ~HPAGE_PMD_MASK);
+ VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
+ VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PMD_SIZE, vma);
+ VM_BUG_ON(!pmd_trans_huge(*pmd));
+
+ count_vm_event(THP_SPLIT_PMD);
+
+ if (vma_is_dax(vma)) {
+ pmdp_huge_clear_flush_notify(vma, haddr, pmd);
+ return;
+ } else if (is_huge_zero_pmd(*pmd)) {
+ return __split_huge_zero_page_pmd(vma, haddr, pmd);
+ }
+
+ page = pmd_page(*pmd);
+ VM_BUG_ON_PAGE(!page_count(page), page);
+ atomic_add(HPAGE_PMD_NR - 1, &page->_count);
+ write = pmd_write(*pmd);
+ young = pmd_young(*pmd);
+
+ /* leave pmd empty until pte is filled */
+ pmdp_huge_clear_flush_notify(vma, haddr, pmd);
+
+ pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+ pmd_populate(mm, &_pmd, pgtable);
+
+ for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
+ pte_t entry, *pte;
+ /*
+ * Note that NUMA hinting access restrictions are not
+ * transferred to avoid any possibility of altering
+ * permissions across VMAs.
+ */
+ entry = mk_pte(page + i, vma->vm_page_prot);
+ entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+ if (!write)
+ entry = pte_wrprotect(entry);
+ if (!young)
+ entry = pte_mkold(entry);
+ pte = pte_offset_map(&_pmd, haddr);
+ BUG_ON(!pte_none(*pte));
+ set_pte_at(mm, haddr, pte, entry);
+ atomic_inc(&page[i]._mapcount);
+ pte_unmap(pte);
+ }
+
+ /*
+ * Set PG_double_map before dropping compound_mapcount to avoid
+ * false-negative page_mapped().
+ */
+ if (compound_mapcount(page) > 1 && !TestSetPageDoubleMap(page)) {
+ for (i = 0; i < HPAGE_PMD_NR; i++)
+ atomic_inc(&page[i]._mapcount);
+ }
+
+ if (atomic_add_negative(-1, compound_mapcount_ptr(page))) {
+ /* Last compound_mapcount is gone. */
+ __dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
+ if (TestClearPageDoubleMap(page)) {
+ /* No need in mapcount reference anymore */
+ for (i = 0; i < HPAGE_PMD_NR; i++)
+ atomic_dec(&page[i]._mapcount);
+ }
+ }
+
+ smp_wmb(); /* make pte visible before pmd */
+ pmd_populate(mm, pmd, pgtable);
+}
+
+void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+ unsigned long address)
+{
+ spinlock_t *ptl;
+ struct mm_struct *mm = vma->vm_mm;
+ unsigned long haddr = address & HPAGE_PMD_MASK;
+
+ mmu_notifier_invalidate_range_start(mm, haddr, haddr + HPAGE_PMD_SIZE);
+ ptl = pmd_lock(mm, pmd);
+ if (likely(pmd_trans_huge(*pmd)))
+ __split_huge_pmd_locked(vma, pmd, haddr);
+ spin_unlock(ptl);
+ mmu_notifier_invalidate_range_end(mm, haddr, haddr + HPAGE_PMD_SIZE);
+}
+
static void split_huge_pmd_address(struct vm_area_struct *vma,
unsigned long address)
{
--
2.1.4

2015-07-20 14:27:39

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv9 30/36] thp: add option to setup migration entiries during PMD split

We are going to use migration PTE entires to stabilize page counts.
If the page is mapped with PMDs we need to split the PMD and setup
migration enties. It's reasonable to combine these operations to avoid
double-scanning over the page table.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Tested-by: Sasha Levin <[email protected]>
Tested-by: Aneesh Kumar K.V <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
---
mm/huge_memory.c | 23 +++++++++++++++--------
1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1f7a7288ffa3..103fa12cf3a4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -25,6 +25,7 @@
#include <linux/migrate.h>
#include <linux/hashtable.h>
#include <linux/userfaultfd_k.h>
+#include <linux/swapops.h>

#include <asm/tlb.h>
#include <asm/pgalloc.h>
@@ -2628,7 +2629,7 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
}

static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
- unsigned long haddr)
+ unsigned long haddr, bool freeze)
{
struct mm_struct *mm = vma->vm_mm;
struct page *page;
@@ -2670,12 +2671,18 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
* transferred to avoid any possibility of altering
* permissions across VMAs.
*/
- entry = mk_pte(page + i, vma->vm_page_prot);
- entry = maybe_mkwrite(pte_mkdirty(entry), vma);
- if (!write)
- entry = pte_wrprotect(entry);
- if (!young)
- entry = pte_mkold(entry);
+ if (freeze) {
+ swp_entry_t swp_entry;
+ swp_entry = make_migration_entry(page + i, write);
+ entry = swp_entry_to_pte(swp_entry);
+ } else {
+ entry = mk_pte(page + i, vma->vm_page_prot);
+ entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+ if (!write)
+ entry = pte_wrprotect(entry);
+ if (!young)
+ entry = pte_mkold(entry);
+ }
pte = pte_offset_map(&_pmd, haddr);
BUG_ON(!pte_none(*pte));
set_pte_at(mm, haddr, pte, entry);
@@ -2716,7 +2723,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
mmu_notifier_invalidate_range_start(mm, haddr, haddr + HPAGE_PMD_SIZE);
ptl = pmd_lock(mm, pmd);
if (likely(pmd_trans_huge(*pmd)))
- __split_huge_pmd_locked(vma, pmd, haddr);
+ __split_huge_pmd_locked(vma, pmd, haddr, false);
spin_unlock(ptl);
mmu_notifier_invalidate_range_end(mm, haddr, haddr + HPAGE_PMD_SIZE);
}
--
2.1.4

2015-07-20 14:27:43

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv9 31/36] thp, mm: split_huge_page(): caller need to lock page

We're going to use migration entries instead of compound_lock() to
stabilize page refcounts. Setup and remove migration entries require
page to be locked.

Some of split_huge_page() callers already have the page locked. Let's
require everybody to lock the page before calling split_huge_page().

Signed-off-by: Kirill A. Shutemov <[email protected]>
Tested-by: Sasha Levin <[email protected]>
Tested-by: Aneesh Kumar K.V <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
---
mm/memory-failure.c | 10 ++++++++--
mm/migrate.c | 8 ++++++--
2 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index ef33ccf37224..f32a607d1aa3 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1143,15 +1143,18 @@ int memory_failure(unsigned long pfn, int trapno, int flags)
put_page(hpage);
return -EBUSY;
}
+ lock_page(hpage);
if (unlikely(split_huge_page(hpage))) {
pr_err("MCE: %#lx: thp split failed\n", pfn);
if (TestClearPageHWPoison(p))
atomic_long_sub(nr_pages, &num_poisoned_pages);
+ unlock_page(hpage);
put_page(p);
if (p != hpage)
put_page(hpage);
return -EBUSY;
}
+ unlock_page(hpage);
VM_BUG_ON_PAGE(!page_count(p), p);
hpage = compound_head(p);
}
@@ -1714,10 +1717,13 @@ int soft_offline_page(struct page *page, int flags)
return -EBUSY;
}
if (!PageHuge(page) && PageTransHuge(hpage)) {
- if (PageAnon(hpage) && unlikely(split_huge_page(hpage))) {
+ lock_page(page);
+ ret = split_huge_page(hpage);
+ unlock_page(page);
+ if (unlikely(ret)) {
pr_info("soft offline: %#lx: failed to split THP\n",
pfn);
- return -EBUSY;
+ return ret;
}
}

diff --git a/mm/migrate.c b/mm/migrate.c
index 67970faf544d..a9dbfd356e9d 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -933,9 +933,13 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
goto out;
}

- if (unlikely(PageTransHuge(page)))
- if (unlikely(split_huge_page(page)))
+ if (unlikely(PageTransHuge(page))) {
+ lock_page(page);
+ rc = split_huge_page(page);
+ unlock_page(page);
+ if (rc)
goto out;
+ }

rc = __unmap_and_move(page, newpage, force, mode);

--
2.1.4

2015-07-20 14:25:02

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv9 32/36] thp: reintroduce split_huge_page()

This patch adds implementation of split_huge_page() for new
refcountings.

Unlike previous implementation, new split_huge_page() can fail if
somebody holds GUP pin on the page. It also means that pin on page
would prevent it from bening split under you. It makes situation in
many places much cleaner.

The basic scheme of split_huge_page():

- Check that sum of mapcounts of all subpage is equal to page_count()
plus one (caller pin). Foll off with -EBUSY. This way we can avoid
useless PMD-splits.

- Freeze the page counters by splitting all PMD and setup migration
PTEs.

- Re-check sum of mapcounts against page_count(). Page's counts are
stable now. -EBUSY if page is pinned.

- Split compound page.

- Unfreeze the page by removing migration entries.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Tested-by: Sasha Levin <[email protected]>
Tested-by: Aneesh Kumar K.V <[email protected]>
---
include/linux/huge_mm.h | 7 +-
include/linux/pagemap.h | 13 +-
mm/huge_memory.c | 318 ++++++++++++++++++++++++++++++++++++++++++++++++
mm/internal.h | 26 +++-
mm/rmap.c | 21 ----
5 files changed, 357 insertions(+), 28 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 940112755591..aa6c753f267e 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -94,8 +94,11 @@ extern bool is_vma_temporary_stack(struct vm_area_struct *vma);

extern unsigned long transparent_hugepage_flags;

-#define split_huge_page_to_list(page, list) BUILD_BUG()
-#define split_huge_page(page) BUILD_BUG()
+int split_huge_page_to_list(struct page *page, struct list_head *list);
+static inline int split_huge_page(struct page *page)
+{
+ return split_huge_page_to_list(page, NULL);
+}

void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
unsigned long address);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 4779de5ed499..3f0da5cd8cca 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -387,10 +387,21 @@ static inline struct page *read_mapping_page(struct address_space *mapping,
*/
static inline pgoff_t page_to_pgoff(struct page *page)
{
+ pgoff_t pgoff;
+
if (unlikely(PageHeadHuge(page)))
return page->index << compound_order(page);
- else
+
+ if (likely(!PageTransTail(page)))
return page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+
+ /*
+ * We don't initialize ->index for tail pages: calculate based on
+ * head page
+ */
+ pgoff = page->first_page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+ pgoff += page - page->first_page;
+ return pgoff;
}

/*
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 103fa12cf3a4..c8e497c05a9a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2795,3 +2795,321 @@ void vma_adjust_trans_huge(struct vm_area_struct *vma,
split_huge_pmd_address(next, nstart);
}
}
+
+static void freeze_page_vma(struct vm_area_struct *vma, struct page *page,
+ unsigned long address)
+{
+ spinlock_t *ptl;
+ pgd_t *pgd;
+ pud_t *pud;
+ pmd_t *pmd;
+ pte_t *pte;
+ int i;
+
+ pgd = pgd_offset(vma->vm_mm, address);
+ if (!pgd_present(*pgd))
+ return;
+ pud = pud_offset(pgd, address);
+ if (!pud_present(*pud))
+ return;
+ pmd = pmd_offset(pud, address);
+ ptl = pmd_lock(vma->vm_mm, pmd);
+ if (!pmd_present(*pmd)) {
+ spin_unlock(ptl);
+ return;
+ }
+ if (pmd_trans_huge(*pmd)) {
+ if (page == pmd_page(*pmd))
+ __split_huge_pmd_locked(vma, pmd, address, true);
+ spin_unlock(ptl);
+ return;
+ }
+ spin_unlock(ptl);
+
+ pte = pte_offset_map_lock(vma->vm_mm, pmd, address, &ptl);
+ for (i = 0; i < HPAGE_PMD_NR; i++, address += PAGE_SIZE, page++) {
+ pte_t entry, swp_pte;
+ swp_entry_t swp_entry;
+
+ if (!pte_present(pte[i]))
+ continue;
+ if (page_to_pfn(page) != pte_pfn(pte[i]))
+ continue;
+ flush_cache_page(vma, address, page_to_pfn(page));
+ entry = ptep_clear_flush(vma, address, pte + i);
+ swp_entry = make_migration_entry(page, pte_write(entry));
+ swp_pte = swp_entry_to_pte(swp_entry);
+ if (pte_soft_dirty(entry))
+ swp_pte = pte_swp_mksoft_dirty(swp_pte);
+ set_pte_at(vma->vm_mm, address, pte + i, swp_pte);
+ }
+ pte_unmap_unlock(pte, ptl);
+}
+
+static void freeze_page(struct anon_vma *anon_vma, struct page *page)
+{
+ struct anon_vma_chain *avc;
+ pgoff_t pgoff = page_to_pgoff(page);
+
+ VM_BUG_ON_PAGE(!PageHead(page), page);
+
+ anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff,
+ pgoff + HPAGE_PMD_NR - 1) {
+ unsigned long haddr;
+
+ haddr = __vma_address(page, avc->vma) & HPAGE_PMD_MASK;
+ mmu_notifier_invalidate_range_start(avc->vma->vm_mm,
+ haddr, haddr + HPAGE_PMD_SIZE);
+ freeze_page_vma(avc->vma, page, haddr);
+ mmu_notifier_invalidate_range_end(avc->vma->vm_mm,
+ haddr, haddr + HPAGE_PMD_SIZE);
+ }
+}
+
+static void unfreeze_page_vma(struct vm_area_struct *vma, struct page *page,
+ unsigned long address)
+{
+ spinlock_t *ptl;
+ pmd_t *pmd;
+ pte_t *pte, entry;
+ swp_entry_t swp_entry;
+ int i;
+
+ pmd = mm_find_pmd(vma->vm_mm, address);
+ if (!pmd)
+ return;
+ pte = pte_offset_map_lock(vma->vm_mm, pmd, address, &ptl);
+ for (i = 0; i < HPAGE_PMD_NR; i++, address += PAGE_SIZE, page++) {
+ if (!page_mapped(page))
+ continue;
+ if (!is_swap_pte(pte[i]))
+ continue;
+
+ swp_entry = pte_to_swp_entry(pte[i]);
+ if (!is_migration_entry(swp_entry))
+ continue;
+ if (migration_entry_to_page(swp_entry) != page)
+ continue;
+
+ entry = pte_mkold(mk_pte(page, vma->vm_page_prot));
+ if (is_write_migration_entry(swp_entry))
+ entry = maybe_mkwrite(entry, vma);
+
+ flush_dcache_page(page);
+ set_pte_at(vma->vm_mm, address, pte + i, entry);
+
+ /* No need to invalidate - it was non-present before */
+ update_mmu_cache(vma, address, pte + i);
+ }
+ pte_unmap_unlock(pte, ptl);
+}
+
+static void unfreeze_page(struct anon_vma *anon_vma, struct page *page)
+{
+ struct anon_vma_chain *avc;
+ pgoff_t pgoff = page_to_pgoff(page);
+
+ anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root,
+ pgoff, pgoff + HPAGE_PMD_NR - 1) {
+ unsigned long address = __vma_address(page, avc->vma);
+
+ mmu_notifier_invalidate_range_start(avc->vma->vm_mm,
+ address, address + HPAGE_PMD_SIZE);
+ unfreeze_page_vma(avc->vma, page, address);
+ mmu_notifier_invalidate_range_end(avc->vma->vm_mm,
+ address, address + HPAGE_PMD_SIZE);
+ }
+}
+
+static int total_mapcount(struct page *page)
+{
+ int i, ret;
+
+ ret = compound_mapcount(page);
+ for (i = 0; i < HPAGE_PMD_NR; i++)
+ ret += atomic_read(&page[i]._mapcount) + 1;
+
+ if (PageDoubleMap(page))
+ ret -= HPAGE_PMD_NR;
+
+ return ret;
+}
+
+static int __split_huge_page_tail(struct page *head, int tail,
+ struct lruvec *lruvec, struct list_head *list)
+{
+ int mapcount;
+ struct page *page_tail = head + tail;
+
+ mapcount = atomic_read(&page_tail->_mapcount) + 1;
+ VM_BUG_ON_PAGE(atomic_read(&page_tail->_count) != 0, page_tail);
+
+ /*
+ * tail_page->_count is zero and not changing from under us. But
+ * get_page_unless_zero() may be running from under us on the
+ * tail_page. If we used atomic_set() below instead of atomic_add(), we
+ * would then run atomic_set() concurrently with
+ * get_page_unless_zero(), and atomic_set() is implemented in C not
+ * using locked ops. spin_unlock on x86 sometime uses locked ops
+ * because of PPro errata 66, 92, so unless somebody can guarantee
+ * atomic_set() here would be safe on all archs (and not only on x86),
+ * it's safer to use atomic_add().
+ */
+ atomic_add(mapcount + 1, &page_tail->_count);
+
+ /* after clearing PageTail the gup refcount can be released */
+ smp_mb__after_atomic();
+
+ /*
+ * retain hwpoison flag of the poisoned tail page:
+ * fix for the unsuitable process killed on Guest Machine(KVM)
+ * by the memory-failure.
+ */
+ page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP | __PG_HWPOISON;
+ page_tail->flags |= (head->flags &
+ ((1L << PG_referenced) |
+ (1L << PG_swapbacked) |
+ (1L << PG_mlocked) |
+ (1L << PG_uptodate) |
+ (1L << PG_active) |
+ (1L << PG_locked) |
+ (1L << PG_unevictable)));
+ page_tail->flags |= (1L << PG_dirty);
+
+ /* clear PageTail before overwriting first_page */
+ smp_wmb();
+
+ /* ->mapping in first tail page is compound_mapcount */
+ VM_BUG_ON_PAGE(tail != 1 && page_tail->mapping != TAIL_MAPPING,
+ page_tail);
+ page_tail->mapping = head->mapping;
+
+ page_tail->index = head->index + tail;
+ page_cpupid_xchg_last(page_tail, page_cpupid_last(head));
+ lru_add_page_tail(head, page_tail, lruvec, list);
+
+ return mapcount;
+}
+
+static void __split_huge_page(struct page *page, struct list_head *list)
+{
+ struct page *head = compound_head(page);
+ struct zone *zone = page_zone(head);
+ struct lruvec *lruvec;
+ int i, tail_mapcount;
+
+ /* prevent PageLRU to go away from under us, and freeze lru stats */
+ spin_lock_irq(&zone->lru_lock);
+ lruvec = mem_cgroup_page_lruvec(head, zone);
+
+ /* complete memcg works before add pages to LRU */
+ mem_cgroup_split_huge_fixup(head);
+
+ tail_mapcount = 0;
+ for (i = HPAGE_PMD_NR - 1; i >= 1; i--)
+ tail_mapcount += __split_huge_page_tail(head, i, lruvec, list);
+ atomic_sub(tail_mapcount, &head->_count);
+
+ ClearPageCompound(head);
+ spin_unlock_irq(&zone->lru_lock);
+
+ unfreeze_page(page_anon_vma(head), head);
+
+ for (i = 0; i < HPAGE_PMD_NR; i++) {
+ struct page *subpage = head + i;
+ if (subpage == page)
+ continue;
+ unlock_page(subpage);
+
+ /*
+ * Subpages may be freed if there wasn't any mapping
+ * like if add_to_swap() is running on a lru page that
+ * had its mapping zapped. And freeing these pages
+ * requires taking the lru_lock so we do the put_page
+ * of the tail pages after the split is complete.
+ */
+ put_page(subpage);
+ }
+}
+
+/*
+ * This function splits huge page into normal pages. @page can point to any
+ * subpage of huge page to split. Split doesn't change the position of @page.
+ *
+ * Only caller must hold pin on the @page, otherwise split fails with -EBUSY.
+ * The huge page must be locked.
+ *
+ * If @list is null, tail pages will be added to LRU list, otherwise, to @list.
+ *
+ * Both head page and tail pages will inherit mapping, flags, and so on from
+ * the hugepage.
+ *
+ * GUP pin and PG_locked transferred to @page. Rest subpages can be freed if
+ * they are not mapped.
+ *
+ * Returns 0 if the hugepage is split successfully.
+ * Returns -EBUSY if the page is pinned or if anon_vma disappeared from under
+ * us.
+ */
+int split_huge_page_to_list(struct page *page, struct list_head *list)
+{
+ struct page *head = compound_head(page);
+ struct anon_vma *anon_vma;
+ int count, mapcount, ret;
+
+ VM_BUG_ON_PAGE(is_huge_zero_page(page), page);
+ VM_BUG_ON_PAGE(!PageAnon(page), page);
+ VM_BUG_ON_PAGE(!PageLocked(page), page);
+ VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
+ VM_BUG_ON_PAGE(!PageCompound(page), page);
+
+ /*
+ * The caller does not necessarily hold an mmap_sem that would prevent
+ * the anon_vma disappearing so we first we take a reference to it
+ * and then lock the anon_vma for write. This is similar to
+ * page_lock_anon_vma_read except the write lock is taken to serialise
+ * against parallel split or collapse operations.
+ */
+ anon_vma = page_get_anon_vma(head);
+ if (!anon_vma) {
+ ret = -EBUSY;
+ goto out;
+ }
+ anon_vma_lock_write(anon_vma);
+
+ /*
+ * Racy check if we can split the page, before freeze_page() will
+ * split PMDs
+ */
+ if (total_mapcount(head) != page_count(head) - 1) {
+ ret = -EBUSY;
+ goto out_unlock;
+ }
+
+ freeze_page(anon_vma, head);
+ VM_BUG_ON_PAGE(compound_mapcount(head), head);
+
+ count = page_count(head);
+ mapcount = total_mapcount(head);
+ if (mapcount == count - 1) {
+ __split_huge_page(page, list);
+ ret = 0;
+ } else if (IS_ENABLED(CONFIG_DEBUG_VM) && mapcount > count - 1) {
+ pr_alert("total_mapcount: %u, page_count(): %u\n",
+ mapcount, count);
+ if (PageTail(page))
+ dump_page(head, NULL);
+ dump_page(page, "total_mapcount(head) > page_count(head) - 1");
+ BUG();
+ } else {
+ unfreeze_page(anon_vma, head);
+ ret = -EBUSY;
+ }
+
+out_unlock:
+ anon_vma_unlock_write(anon_vma);
+ put_anon_vma(anon_vma);
+out:
+ count_vm_event(!ret ? THP_SPLIT_PAGE : THP_SPLIT_PAGE_FAILED);
+ return ret;
+}
diff --git a/mm/internal.h b/mm/internal.h
index c3384ad89f62..fa02f21c3a0c 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -13,6 +13,7 @@

#include <linux/fs.h>
#include <linux/mm.h>
+#include <linux/pagemap.h>

void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
unsigned long floor, unsigned long ceiling);
@@ -245,10 +246,27 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)

extern pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);

-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-extern unsigned long vma_address(struct page *page,
- struct vm_area_struct *vma);
-#endif
+/*
+ * At what user virtual address is page expected in @vma?
+ */
+static inline unsigned long
+__vma_address(struct page *page, struct vm_area_struct *vma)
+{
+ pgoff_t pgoff = page_to_pgoff(page);
+ return vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
+}
+
+static inline unsigned long
+vma_address(struct page *page, struct vm_area_struct *vma)
+{
+ unsigned long address = __vma_address(page, vma);
+
+ /* page should be within @vma mapping range */
+ VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
+
+ return address;
+}
+
#else /* !CONFIG_MMU */
static inline void clear_page_mlock(struct page *page) { }
static inline void mlock_vma_page(struct page *page) { }
diff --git a/mm/rmap.c b/mm/rmap.c
index ed89c6256579..e2b23a3e5fc5 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -564,27 +564,6 @@ void page_unlock_anon_vma_read(struct anon_vma *anon_vma)
anon_vma_unlock_read(anon_vma);
}

-/*
- * At what user virtual address is page expected in @vma?
- */
-static inline unsigned long
-__vma_address(struct page *page, struct vm_area_struct *vma)
-{
- pgoff_t pgoff = page_to_pgoff(page);
- return vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
-}
-
-inline unsigned long
-vma_address(struct page *page, struct vm_area_struct *vma)
-{
- unsigned long address = __vma_address(page, vma);
-
- /* page should be within @vma mapping range */
- VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
-
- return address;
-}
-
#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
static void percpu_flush_tlb_batch_pages(void *data)
{
--
2.1.4

2015-07-20 14:26:47

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv9 33/36] migrate_pages: try to split pages on qeueuing

We are not able to migrate THPs. It means it's not enough to split only
PMD on migration -- we need to split compound page under it too.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Tested-by: Aneesh Kumar K.V <[email protected]>
---
mm/mempolicy.c | 37 +++++++++++++++++++++++++++++++++----
1 file changed, 33 insertions(+), 4 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index b6122c0f613d..f815d7dfd4ad 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -489,14 +489,31 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
struct page *page;
struct queue_pages *qp = walk->private;
unsigned long flags = qp->flags;
- int nid;
+ int nid, ret;
pte_t *pte;
spinlock_t *ptl;

- split_huge_pmd(vma, pmd, addr);
- if (pmd_trans_unstable(pmd))
- return 0;
+ if (pmd_trans_huge(*pmd)) {
+ ptl = pmd_lock(walk->mm, pmd);
+ if (pmd_trans_huge(*pmd)) {
+ page = pmd_page(*pmd);
+ if (is_huge_zero_page(page)) {
+ spin_unlock(ptl);
+ split_huge_pmd(vma, pmd, addr);
+ } else {
+ get_page(page);
+ spin_unlock(ptl);
+ lock_page(page);
+ ret = split_huge_page(page);
+ unlock_page(page);
+ put_page(page);
+ if (ret)
+ return 0;
+ }
+ }
+ }

+retry:
pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
for (; addr != end; pte++, addr += PAGE_SIZE) {
if (!pte_present(*pte))
@@ -513,6 +530,18 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
nid = page_to_nid(page);
if (node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT))
continue;
+ if (PageTail(page) && PageAnon(page)) {
+ get_page(page);
+ pte_unmap_unlock(pte - 1, ptl);
+ lock_page(page);
+ ret = split_huge_page(page);
+ unlock_page(page);
+ put_page(page);
+ /* Failed to split -- skip. */
+ if (ret)
+ continue;
+ goto retry;
+ }

if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
migrate_page_add(page, qp->pagelist, flags);
--
2.1.4

2015-07-20 14:24:57

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv9 34/36] thp: introduce deferred_split_huge_page()

Currently we don't split huge page on partial unmap. It's not an ideal
situation. It can lead to memory overhead.

Furtunately, we can detect partial unmap on page_remove_rmap(). But we
cannot call split_huge_page() from there due to locking context.

It's also counterproductive to do directly from munmap() codepath: in
many cases we will hit this from exit(2) and splitting the huge page
just to free it up in small pages is not what we really want.

The patch introduce deferred_split_huge_page() which put the huge page
into queue for splitting. The splitting itself will happen when we get
memory pressure via shrinker interface. The page will be dropped from
list on freeing through compound page destructor.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Tested-by: Sasha Levin <[email protected]>
Tested-by: Aneesh Kumar K.V <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
---
include/linux/huge_mm.h | 4 ++
include/linux/mm.h | 2 +
mm/huge_memory.c | 127 ++++++++++++++++++++++++++++++++++++++++++++++--
mm/migrate.c | 1 +
mm/page_alloc.c | 2 +-
mm/rmap.c | 7 ++-
6 files changed, 138 insertions(+), 5 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index aa6c753f267e..ce5b9756570d 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -94,11 +94,14 @@ extern bool is_vma_temporary_stack(struct vm_area_struct *vma);

extern unsigned long transparent_hugepage_flags;

+extern void prep_transhuge_page(struct page *page);
+
int split_huge_page_to_list(struct page *page, struct list_head *list);
static inline int split_huge_page(struct page *page)
{
return split_huge_page_to_list(page, NULL);
}
+void deferred_split_huge_page(struct page *page);

void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
unsigned long address);
@@ -177,6 +180,7 @@ static inline int split_huge_page(struct page *page)
{
return 0;
}
+static inline void deferred_split_huge_page(struct page *page) {}
#define split_huge_pmd(__vma, __pmd, __address) \
do { } while (0)
static inline int hugepage_madvise(struct vm_area_struct *vma,
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b41b1d8b9072..b3c19bdc62f1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -515,6 +515,8 @@ static inline void set_compound_order(struct page *page, unsigned long order)
page[1].compound_order = order;
}

+void free_compound_page(struct page *page);
+
#ifdef CONFIG_MMU
/*
* Do pte_mkwrite, but only if the vma says VM_WRITE. We do this when
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c8e497c05a9a..d32277463932 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -72,6 +72,8 @@ static int khugepaged(void *none);
static int khugepaged_slab_init(void);
static void khugepaged_slab_exit(void);

+static void free_transhuge_page(struct page *page);
+
#define MM_SLOTS_HASH_BITS 10
static __read_mostly DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);

@@ -106,6 +108,10 @@ static struct khugepaged_scan khugepaged_scan = {
.mm_head = LIST_HEAD_INIT(khugepaged_scan.mm_head),
};

+static DEFINE_SPINLOCK(split_queue_lock);
+static LIST_HEAD(split_queue);
+static unsigned long split_queue_len;
+static struct shrinker deferred_split_shrinker;

static void set_recommended_min_free_kbytes(void)
{
@@ -638,6 +644,9 @@ static int __init hugepage_init(void)
err = register_shrinker(&huge_zero_page_shrinker);
if (err)
goto err_hzp_shrinker;
+ err = register_shrinker(&deferred_split_shrinker);
+ if (err)
+ goto err_split_shrinker;

/*
* By default disable transparent hugepages on smaller systems,
@@ -655,6 +664,8 @@ static int __init hugepage_init(void)

return 0;
err_khugepaged:
+ unregister_shrinker(&deferred_split_shrinker);
+err_split_shrinker:
unregister_shrinker(&huge_zero_page_shrinker);
err_hzp_shrinker:
khugepaged_slab_exit();
@@ -711,6 +722,19 @@ static inline pmd_t mk_huge_pmd(struct page *page, pgprot_t prot)
return entry;
}

+void prep_transhuge_page(struct page *page)
+{
+ /* we use page->lru in second tail page: assuming THP order >= 2 */
+ BUILD_BUG_ON(HPAGE_PMD_ORDER < 2);
+
+ /*
+ * ->lru in the first tail page is occupied by destructor
+ * and order of the compound page
+ */
+ INIT_LIST_HEAD(&page[2].lru);
+ set_compound_page_dtor(page, free_transhuge_page);
+}
+
static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd,
@@ -867,6 +891,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
count_vm_event(THP_FAULT_FALLBACK);
return VM_FAULT_FALLBACK;
}
+ prep_transhuge_page(page);
return __do_huge_pmd_anonymous_page(mm, vma, address, pmd, page, gfp,
flags);
}
@@ -1163,7 +1188,9 @@ alloc:
} else
new_page = NULL;

- if (unlikely(!new_page)) {
+ if (likely(new_page)) {
+ prep_transhuge_page(new_page);
+ } else {
if (!page) {
split_huge_pmd(vma, pmd, address);
ret |= VM_FAULT_FALLBACK;
@@ -2097,6 +2124,7 @@ khugepaged_alloc_page(struct page **hpage, gfp_t gfp, struct mm_struct *mm,
return NULL;
}

+ prep_transhuge_page(*hpage);
count_vm_event(THP_COLLAPSE_ALLOC);
return *hpage;
}
@@ -2108,8 +2136,12 @@ static int khugepaged_find_target_node(void)

static inline struct page *alloc_hugepage(int defrag)
{
- return alloc_pages(alloc_hugepage_gfpmask(defrag, 0),
- HPAGE_PMD_ORDER);
+ struct page *page;
+
+ page = alloc_pages(alloc_hugepage_gfpmask(defrag, 0), HPAGE_PMD_ORDER);
+ if (page)
+ prep_transhuge_page(page);
+ return page;
}

static struct page *khugepaged_alloc_hugepage(bool *wait)
@@ -3002,6 +3034,13 @@ static void __split_huge_page(struct page *page, struct list_head *list)
spin_lock_irq(&zone->lru_lock);
lruvec = mem_cgroup_page_lruvec(head, zone);

+ spin_lock(&split_queue_lock);
+ if (!list_empty(&head[2].lru)) {
+ split_queue_len--;
+ list_del(&head[2].lru);
+ }
+ spin_unlock(&split_queue_lock);
+
/* complete memcg works before add pages to LRU */
mem_cgroup_split_huge_fixup(head);

@@ -3113,3 +3152,85 @@ out:
count_vm_event(!ret ? THP_SPLIT_PAGE : THP_SPLIT_PAGE_FAILED);
return ret;
}
+
+static void free_transhuge_page(struct page *page)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&split_queue_lock, flags);
+ if (!list_empty(&page[2].lru)) {
+ split_queue_len--;
+ list_del(&page[2].lru);
+ }
+ spin_unlock_irqrestore(&split_queue_lock, flags);
+ free_compound_page(page);
+}
+
+void deferred_split_huge_page(struct page *page)
+{
+ unsigned long flags;
+
+ VM_BUG_ON_PAGE(!PageTransHuge(page), page);
+
+ spin_lock_irqsave(&split_queue_lock, flags);
+ if (list_empty(&page[2].lru)) {
+ list_add_tail(&page[2].lru, &split_queue);
+ split_queue_len++;
+ }
+ spin_unlock_irqrestore(&split_queue_lock, flags);
+}
+
+static unsigned long deferred_split_count(struct shrinker *shrink,
+ struct shrink_control *sc)
+{
+ /*
+ * Split a page from split_queue will free up at least one page,
+ * at most HPAGE_PMD_NR - 1. We don't track exact number.
+ * Let's use HPAGE_PMD_NR / 2 as ballpark.
+ */
+ return ACCESS_ONCE(split_queue_len) * HPAGE_PMD_NR / 2;
+}
+
+static unsigned long deferred_split_scan(struct shrinker *shrink,
+ struct shrink_control *sc)
+{
+ unsigned long flags;
+ LIST_HEAD(list);
+ struct page *page, *next;
+ int split = 0;
+
+ spin_lock_irqsave(&split_queue_lock, flags);
+ list_splice_init(&split_queue, &list);
+
+ /* Take pin on all head pages to avoid freeing them under us */
+ list_for_each_entry_safe(page, next, &list, lru) {
+ page = compound_head(page);
+ /* race with put_compound_page() */
+ if (!get_page_unless_zero(page)) {
+ list_del_init(&page[2].lru);
+ split_queue_len--;
+ }
+ }
+ spin_unlock_irqrestore(&split_queue_lock, flags);
+
+ list_for_each_entry_safe(page, next, &list, lru) {
+ lock_page(page);
+ /* split_huge_page() removes page from list on success */
+ if (!split_huge_page(page))
+ split++;
+ unlock_page(page);
+ put_page(page);
+ }
+
+ spin_lock_irqsave(&split_queue_lock, flags);
+ list_splice_tail(&list, &split_queue);
+ spin_unlock_irqrestore(&split_queue_lock, flags);
+
+ return split * HPAGE_PMD_NR / 2;
+}
+
+static struct shrinker deferred_split_shrinker = {
+ .count_objects = deferred_split_count,
+ .scan_objects = deferred_split_scan,
+ .seeks = DEFAULT_SEEKS,
+};
diff --git a/mm/migrate.c b/mm/migrate.c
index a9dbfd356e9d..91bb504a8cc8 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1736,6 +1736,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
HPAGE_PMD_ORDER);
if (!new_page)
goto out_fail;
+ prep_transhuge_page(new_page);

isolated = numamigrate_isolate_page(pgdat, page);
if (!isolated) {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cfd3a34e41f1..c78cc0ff99bd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -428,7 +428,7 @@ out:
* This usage means that zero-order pages may not be compound.
*/

-static void free_compound_page(struct page *page)
+void free_compound_page(struct page *page)
{
__free_pages_ok(page, compound_order(page));
}
diff --git a/mm/rmap.c b/mm/rmap.c
index e2b23a3e5fc5..df8b6ded9b80 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1309,8 +1309,10 @@ static void page_remove_anon_compound_rmap(struct page *page)
nr = HPAGE_PMD_NR;
}

- if (nr)
+ if (nr) {
__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, -nr);
+ deferred_split_huge_page(page);
+ }
}

/**
@@ -1345,6 +1347,9 @@ void page_remove_rmap(struct page *page, bool compound)
if (unlikely(PageMlocked(page)))
clear_page_mlock(page);

+ if (PageTransCompound(page))
+ deferred_split_huge_page(compound_head(page));
+
/*
* It would be tidy to reset the PageAnon mapping here,
* but that might overwrite a racing page_add_anon_rmap
--
2.1.4

2015-07-20 14:26:50

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv9 35/36] mm: re-enable THP

All parts of THP with new refcounting are now in place. We can now allow
to enable THP.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Tested-by: Sasha Levin <[email protected]>
Tested-by: Aneesh Kumar K.V <[email protected]>
---
mm/Kconfig | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index c973f416cbe5..e79de2bd12cd 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -410,7 +410,7 @@ config NOMMU_INITIAL_TRIM_EXCESS

config TRANSPARENT_HUGEPAGE
bool "Transparent Hugepage Support"
- depends on HAVE_ARCH_TRANSPARENT_HUGEPAGE && BROKEN
+ depends on HAVE_ARCH_TRANSPARENT_HUGEPAGE
select COMPACTION
help
Transparent Hugepages allows the kernel to use huge pages and
--
2.1.4

2015-07-20 14:24:56

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv9 36/36] thp: update documentation

The patch updates Documentation/vm/transhuge.txt to reflect changes in
THP design.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
Documentation/vm/transhuge.txt | 151 ++++++++++++++++++++++++++---------------
1 file changed, 96 insertions(+), 55 deletions(-)

diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
index 8143b9e8373d..b0cc5f8f161f 100644
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -35,10 +35,10 @@ miss is going to run faster.

== Design ==

-- "graceful fallback": mm components which don't have transparent
- hugepage knowledge fall back to breaking a transparent hugepage and
- working on the regular pages and their respective regular pmd/pte
- mappings
+- "graceful fallback": mm components which don't have transparent hugepage
+ knowledge fall back to breaking huge pmd mapping into table of ptes and,
+ if necessary, split a transparent hugepage. Therefore these components
+ can continue working on the regular pages or regular pte mappings.

- if a hugepage allocation fails because of memory fragmentation,
regular pages should be gracefully allocated instead and mixed in
@@ -211,9 +211,18 @@ thp_collapse_alloc_failed is incremented if khugepaged found a range
of pages that should be collapsed into one huge page but failed
the allocation.

-thp_split is incremented every time a huge page is split into base
+thp_split_page is incremented every time a huge page is split into base
pages. This can happen for a variety of reasons but a common
reason is that a huge page is old and is being reclaimed.
+ This action implies splitting all PMD the page mapped with.
+
+thp_split_page_failed is is incremented if kernel fails to split huge
+ page. This can happen if the page was pinned by somebody.
+
+thp_split_pmd is incremented every time a PMD split into table of PTEs.
+ This can happen, for instance, when application calls mprotect() or
+ munmap() on part of huge page. It doesn't split huge page, only
+ page table entry.

thp_zero_page_alloc is incremented every time a huge zero page is
successfully allocated. It includes allocations which where
@@ -264,10 +273,8 @@ is complete, so they won't ever notice the fact the page is huge. But
if any driver is going to mangle over the page structure of the tail
page (like for checking page->mapping or other bits that are relevant
for the head page and not the tail page), it should be updated to jump
-to check head page instead (while serializing properly against
-split_huge_page() to avoid the head and tail pages to disappear from
-under it, see the futex code to see an example of that, hugetlbfs also
-needed special handling in futex code for similar reasons).
+to check head page instead. Taking reference on any head/tail page would
+prevent page from being split by anyone.

NOTE: these aren't new constraints to the GUP API, and they match the
same constrains that applies to hugetlbfs too, so any driver capable
@@ -302,9 +309,9 @@ unaffected. libhugetlbfs will also work fine as usual.
== Graceful fallback ==

Code walking pagetables but unware about huge pmds can simply call
-split_huge_page_pmd(vma, addr, pmd) where the pmd is the one returned by
+split_huge_pmd(vma, pmd, addr) where the pmd is the one returned by
pmd_offset. It's trivial to make the code transparent hugepage aware
-by just grepping for "pmd_offset" and adding split_huge_page_pmd where
+by just grepping for "pmd_offset" and adding split_huge_pmd where
missing after pmd_offset returns the pmd. Thanks to the graceful
fallback design, with a one liner change, you can avoid to write
hundred if not thousand of lines of complex code to make your code
@@ -313,7 +320,8 @@ hugepage aware.
If you're not walking pagetables but you run into a physical hugepage
but you can't handle it natively in your code, you can split it by
calling split_huge_page(page). This is what the Linux VM does before
-it tries to swapout the hugepage for example.
+it tries to swapout the hugepage for example. split_huge_page() can fail
+if the page is pinned and you must handle this correctly.

Example to make mremap.c transparent hugepage aware with a one liner
change:
@@ -325,14 +333,14 @@ diff --git a/mm/mremap.c b/mm/mremap.c
return NULL;

pmd = pmd_offset(pud, addr);
-+ split_huge_page_pmd(vma, addr, pmd);
++ split_huge_pmd(vma, pmd, addr);
if (pmd_none_or_clear_bad(pmd))
return NULL;

== Locking in hugepage aware code ==

We want as much code as possible hugepage aware, as calling
-split_huge_page() or split_huge_page_pmd() has a cost.
+split_huge_page() or split_huge_pmd() has a cost.

To make pagetable walks huge pmd aware, all you need to do is to call
pmd_trans_huge() on the pmd returned by pmd_offset. You must hold the
@@ -341,47 +349,80 @@ created from under you by khugepaged (khugepaged collapse_huge_page
takes the mmap_sem in write mode in addition to the anon_vma lock). If
pmd_trans_huge returns false, you just fallback in the old code
paths. If instead pmd_trans_huge returns true, you have to take the
-mm->page_table_lock and re-run pmd_trans_huge. Taking the
-page_table_lock will prevent the huge pmd to be converted into a
-regular pmd from under you (split_huge_page can run in parallel to the
+page table lock (pmd_lock()) and re-run pmd_trans_huge. Taking the
+page table lock will prevent the huge pmd to be converted into a
+regular pmd from under you (split_huge_pmd can run in parallel to the
pagetable walk). If the second pmd_trans_huge returns false, you
-should just drop the page_table_lock and fallback to the old code as
-before. Otherwise you should run pmd_trans_splitting on the pmd. In
-case pmd_trans_splitting returns true, it means split_huge_page is
-already in the middle of splitting the page. So if pmd_trans_splitting
-returns true it's enough to drop the page_table_lock and call
-wait_split_huge_page and then fallback the old code paths. You are
-guaranteed by the time wait_split_huge_page returns, the pmd isn't
-huge anymore. If pmd_trans_splitting returns false, you can proceed to
-process the huge pmd and the hugepage natively. Once finished you can
-drop the page_table_lock.
-
-== compound_lock, get_user_pages and put_page ==
+should just drop the page table lock and fallback to the old code as
+before. Otherwise you can proceed to process the huge pmd and the
+hugepage natively. Once finished you can drop the page table lock.
+
+== Refcounts and transparent huge pages ==
+
+Refcounting on THP is mostly consistent with refcounting on other compound
+pages:
+
+ - get_page()/put_page() and GUP operate in head page's ->_count.
+
+ - ->_count in tail pages is always zero: get_page_unless_zero() never
+ succeed on tail pages.
+
+ - map/unmap of the pages with PTE entry increment/decrement ->_mapcount
+ on relevant sub-page of the compound page.
+
+ - map/unmap of the whole compound page accounted in compound_mapcount
+ (stored in first tail page).
+
+PageDoubleMap() indicates that ->_mapcount in all subpages is offset up by one.
+This additional reference is required to get race-free detection of unmap of
+subpages when we have them mapped with both PMDs and PTEs.
+
+This is optimization required to lower overhead of per-subpage mapcount
+tracking. The alternative is alter ->_mapcount in all subpages on each
+map/unmap of the whole compound page.
+
+We set PG_double_map when a PMD of the page got split for the first time,
+but still have PMD mapping. The addtional references go away with last
+compound_mapcount.

split_huge_page internally has to distribute the refcounts in the head
-page to the tail pages before clearing all PG_head/tail bits from the
-page structures. It can do that easily for refcounts taken by huge pmd
-mappings. But the GUI API as created by hugetlbfs (that returns head
-and tail pages if running get_user_pages on an address backed by any
-hugepage), requires the refcount to be accounted on the tail pages and
-not only in the head pages, if we want to be able to run
-split_huge_page while there are gup pins established on any tail
-page. Failure to be able to run split_huge_page if there's any gup pin
-on any tail page, would mean having to split all hugepages upfront in
-get_user_pages which is unacceptable as too many gup users are
-performance critical and they must work natively on hugepages like
-they work natively on hugetlbfs already (hugetlbfs is simpler because
-hugetlbfs pages cannot be split so there wouldn't be requirement of
-accounting the pins on the tail pages for hugetlbfs). If we wouldn't
-account the gup refcounts on the tail pages during gup, we won't know
-anymore which tail page is pinned by gup and which is not while we run
-split_huge_page. But we still have to add the gup pin to the head page
-too, to know when we can free the compound page in case it's never
-split during its lifetime. That requires changing not just
-get_page, but put_page as well so that when put_page runs on a tail
-page (and only on a tail page) it will find its respective head page,
-and then it will decrease the head page refcount in addition to the
-tail page refcount. To obtain a head page reliably and to decrease its
-refcount without race conditions, put_page has to serialize against
-__split_huge_page_refcount using a special per-page lock called
-compound_lock.
+page to the tail pages before clearing all PG_head/tail bits from the page
+structures. It can be done easily for refcounts taken by page table
+entries. But we don't have enough information on how to distribute any
+additional pins (i.e. from get_user_pages). split_huge_page() fails any
+requests to split pinned huge page: it expects page count to be equal to
+sum of mapcount of all sub-pages plus one (split_huge_page caller must
+have reference for head page).
+
+split_huge_page uses migration entries to stabilize page->_count and
+page->_mapcount.
+
+We safe against physical memory scanners too: the only legitimate way
+scanner can get reference to a page is get_page_unless_zero().
+
+All tail pages has zero ->_count until atomic_add(). It prevent scanner
+from geting reference to tail page up to the point. After the atomic_add()
+we don't care about ->_count value. We already known how many references
+with should uncharge from head page.
+
+For head page get_page_unless_zero() will succeed and we don't mind. It's
+clear where reference should go after split: it will stay on head page.
+
+Note that split_huge_pmd() doesn't have any limitation on refcounting:
+pmd can be split at any point and never fails.
+
+== Partial unmap and deferred_split_huge_page() ==
+
+Unmapping part of THP (with munmap() or other way) is not going to free
+memory immediately. Instead, we detect that a subpage of THP is not in use
+in page_remove_rmap() and queue the THP for splitting if memory pressure
+comes. Splitting will free up unused subpages.
+
+Splitting the page right away is not an option due to locking context in
+the place where we can detect partial unmap. It's also might be
+counterproductive since in many cases partial unmap unmap happens during
+exit(2) if an THP crosses VMA boundary.
+
+Function deferred_split_huge_page() is used to queue page for splitting.
+The splitting itself will happen when we get memory pressure via shrinker
+interface.
--
2.1.4

2015-07-31 14:45:33

by Jerome Marchand

[permalink] [raw]
Subject: Re: [PATCHv9 02/36] rmap: add argument to charge compound page

On 07/20/2015 04:20 PM, Kirill A. Shutemov wrote:
> We're going to allow mapping of individual 4k pages of THP compound
> page. It means we cannot rely on PageTransHuge() check to decide if
> map/unmap small page or THP.
>
> The patch adds new argument to rmap functions to indicate whether we want
> to operate on whole compound page or only the small page.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Tested-by: Sasha Levin <[email protected]>
> Tested-by: Aneesh Kumar K.V <[email protected]>
> Acked-by: Vlastimil Babka <[email protected]>

Acked-by: Jerome Marchand <[email protected]>

> ---
> include/linux/rmap.h | 12 +++++++++---
> kernel/events/uprobes.c | 4 ++--
> mm/huge_memory.c | 16 ++++++++--------
> mm/hugetlb.c | 4 ++--
> mm/ksm.c | 4 ++--
> mm/memory.c | 14 +++++++-------
> mm/migrate.c | 8 ++++----
> mm/rmap.c | 48 +++++++++++++++++++++++++++++++-----------------
> mm/swapfile.c | 4 ++--
> mm/userfaultfd.c | 2 +-
> 10 files changed, 68 insertions(+), 48 deletions(-)



Attachments:
signature.asc (473.00 B)
OpenPGP digital signature

2015-07-31 14:46:49

by Jerome Marchand

[permalink] [raw]
Subject: Re: [PATCHv9 03/36] memcg: adjust to support new THP refcounting

On 07/20/2015 04:20 PM, Kirill A. Shutemov wrote:
> As with rmap, with new refcounting we cannot rely on PageTransHuge() to
> check if we need to charge size of huge page form the cgroup. We need to
> get information from caller to know whether it was mapped with PMD or
> PTE.
>
> We do uncharge when last reference on the page gone. At that point if we
> see PageTransHuge() it means we need to unchange whole huge page.
>
> The tricky part is partial unmap -- when we try to unmap part of huge
> page. We don't do a special handing of this situation, meaning we don't
> uncharge the part of huge page unless last user is gone or
> split_huge_page() is triggered. In case of cgroup memory pressure
> happens the partial unmapped page will be split through shrinker. This
> should be good enough.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Tested-by: Sasha Levin <[email protected]>
> Tested-by: Aneesh Kumar K.V <[email protected]>
> Acked-by: Vlastimil Babka <[email protected]>

Acked-by: Jerome Marchand <[email protected]>

> ---
> include/linux/memcontrol.h | 16 +++++++-----
> kernel/events/uprobes.c | 7 +++---
> mm/filemap.c | 8 +++---
> mm/huge_memory.c | 33 ++++++++++++------------
> mm/memcontrol.c | 62 +++++++++++++++++-----------------------------
> mm/memory.c | 28 ++++++++++-----------
> mm/shmem.c | 21 +++++++++-------
> mm/swapfile.c | 9 ++++---
> mm/userfaultfd.c | 6 ++---
> 9 files changed, 92 insertions(+), 98 deletions(-)



Attachments:
signature.asc (473.00 B)
OpenPGP digital signature

2015-07-31 14:47:49

by Jerome Marchand

[permalink] [raw]
Subject: Re: [PATCHv9 05/36] mm: adjust FOLL_SPLIT for new refcounting

On 07/20/2015 04:20 PM, Kirill A. Shutemov wrote:
> We need to prepare kernel to allow transhuge pages to be mapped with
> ptes too. We need to handle FOLL_SPLIT in follow_page_pte().
>
> Also we use split_huge_page() directly instead of split_huge_page_pmd().
> split_huge_page_pmd() will gone.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Tested-by: Sasha Levin <[email protected]>
> Tested-by: Aneesh Kumar K.V <[email protected]>
> Acked-by: Vlastimil Babka <[email protected]>

Acked-by: Jerome Marchand <[email protected]>

> ---
> mm/gup.c | 67 +++++++++++++++++++++++++++++++++++++++++++++++-----------------
> 1 file changed, 49 insertions(+), 18 deletions(-)
>



Attachments:
signature.asc (473.00 B)
OpenPGP digital signature

2015-07-31 14:50:28

by Jerome Marchand

[permalink] [raw]
Subject: Re: [PATCHv9 11/36] mm: temporally mark THP broken

On 07/20/2015 04:20 PM, Kirill A. Shutemov wrote:
> Up to this point we tried to keep patchset bisectable, but next patches
> are going to change how core of THP refcounting work.
>
> It would be beneficial to split the change into several patches and make
> it more reviewable. Unfortunately, I don't see how we can achieve that
> while keeping THP working.
>
> Let's hide THP under CONFIG_BROKEN for now and bring it back when new
> refcounting get established.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Tested-by: Sasha Levin <[email protected]>
> Tested-by: Aneesh Kumar K.V <[email protected]>

Acked-by: Jerome Marchand <[email protected]>

> ---
> mm/Kconfig | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/Kconfig b/mm/Kconfig
> index e79de2bd12cd..c973f416cbe5 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -410,7 +410,7 @@ config NOMMU_INITIAL_TRIM_EXCESS
>
> config TRANSPARENT_HUGEPAGE
> bool "Transparent Hugepage Support"
> - depends on HAVE_ARCH_TRANSPARENT_HUGEPAGE
> + depends on HAVE_ARCH_TRANSPARENT_HUGEPAGE && BROKEN
> select COMPACTION
> help
> Transparent Hugepages allows the kernel to use huge pages and
>



Attachments:
signature.asc (473.00 B)
OpenPGP digital signature

2015-07-31 14:51:04

by Jerome Marchand

[permalink] [raw]
Subject: Re: [PATCHv9 12/36] thp: drop all split_huge_page()-related code

On 07/20/2015 04:20 PM, Kirill A. Shutemov wrote:
> We will re-introduce new version with new refcounting later in patchset.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Tested-by: Sasha Levin <[email protected]>
> Tested-by: Aneesh Kumar K.V <[email protected]>

Acked-by: Jerome Marchand <[email protected]>

> ---
> include/linux/huge_mm.h | 28 +---
> mm/huge_memory.c | 400 +-----------------------------------------------
> 2 files changed, 7 insertions(+), 421 deletions(-)
>




Attachments:
signature.asc (473.00 B)
OpenPGP digital signature

2015-07-31 14:51:52

by Jerome Marchand

[permalink] [raw]
Subject: Re: [PATCHv9 13/36] mm: drop tail page refcounting

On 07/20/2015 04:20 PM, Kirill A. Shutemov wrote:
> Tail page refcounting is utterly complicated and painful to support.
>
> It uses ->_mapcount on tail pages to store how many times this page is
> pinned. get_page() bumps ->_mapcount on tail page in addition to
> ->_count on head. This information is required by split_huge_page() to
> be able to distribute pins from head of compound page to tails during
> the split.
>
> We will need ->_mapcount to account PTE mappings of subpages of the
> compound page. We eliminate need in current meaning of ->_mapcount in
> tail pages by forbidding split entirely if the page is pinned.
>
> The only user of tail page refcounting is THP which is marked BROKEN for
> now.
>
> Let's drop all this mess. It makes get_page() and put_page() much
> simpler.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Tested-by: Sasha Levin <[email protected]>
> Tested-by: Aneesh Kumar K.V <[email protected]>
> Acked-by: Vlastimil Babka <[email protected]>

Acked-by: Jerome Marchand <[email protected]>

> ---
> arch/mips/mm/gup.c | 4 -
> arch/powerpc/mm/hugetlbpage.c | 13 +-
> arch/s390/mm/gup.c | 13 +-
> arch/sparc/mm/gup.c | 14 +--
> arch/x86/mm/gup.c | 4 -
> include/linux/mm.h | 47 ++------
> include/linux/mm_types.h | 17 +--
> mm/gup.c | 34 +-----
> mm/huge_memory.c | 41 +------
> mm/hugetlb.c | 2 +-
> mm/internal.h | 44 -------
> mm/swap.c | 273 +++---------------------------------------
> 12 files changed, 40 insertions(+), 466 deletions(-)
>



Attachments:
signature.asc (473.00 B)
OpenPGP digital signature

2015-07-31 14:52:26

by Jerome Marchand

[permalink] [raw]
Subject: Re: [PATCHv9 14/36] futex, thp: remove special case for THP in get_futex_key

On 07/20/2015 04:20 PM, Kirill A. Shutemov wrote:
> With new THP refcounting, we don't need tricks to stabilize huge page.
> If we've got reference to tail page, it can't split under us.
>
> This patch effectively reverts a5b338f2b0b1.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Tested-by: Sasha Levin <[email protected]>
> Tested-by: Aneesh Kumar K.V <[email protected]>

Acked-by: Jerome Marchand <[email protected]>

> ---
> kernel/futex.c | 61 ++++++++++++----------------------------------------------
> 1 file changed, 12 insertions(+), 49 deletions(-)




Attachments:
signature.asc (473.00 B)
OpenPGP digital signature

2015-07-31 14:53:10

by Jerome Marchand

[permalink] [raw]
Subject: Re: [PATCHv9 15/36] ksm: prepare to new THP semantics

On 07/20/2015 04:20 PM, Kirill A. Shutemov wrote:
> We don't need special code to stabilize THP. If you've got reference to
> any subpage of THP it will not be split under you.
>
> New split_huge_page() also accepts tail pages: no need in special code
> to get reference to head page.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Tested-by: Sasha Levin <[email protected]>
> Tested-by: Aneesh Kumar K.V <[email protected]>
> Acked-by: Vlastimil Babka <[email protected]>

Acked-by: Jerome Marchand <[email protected]>

> ---
> mm/ksm.c | 57 ++++++++++-----------------------------------------------
> 1 file changed, 10 insertions(+), 47 deletions(-)
>


Attachments:
signature.asc (473.00 B)
OpenPGP digital signature

2015-07-31 14:53:31

by Jerome Marchand

[permalink] [raw]
Subject: Re: [PATCHv9 16/36] mm, thp: remove compound_lock

On 07/20/2015 04:20 PM, Kirill A. Shutemov wrote:
> We are going to use migration entries to stabilize page counts. It means
> we don't need compound_lock() for that.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Tested-by: Sasha Levin <[email protected]>
> Tested-by: Aneesh Kumar K.V <[email protected]>
> Acked-by: Vlastimil Babka <[email protected]>

Acked-by: Jerome Marchand <[email protected]>

> ---
> include/linux/mm.h | 35 -----------------------------------
> include/linux/page-flags.h | 12 +-----------
> mm/debug.c | 3 ---
> mm/memcontrol.c | 11 +++--------
> 4 files changed, 4 insertions(+), 57 deletions(-)
>



Attachments:
signature.asc (473.00 B)
OpenPGP digital signature

2015-07-31 14:54:09

by Jerome Marchand

[permalink] [raw]
Subject: Re: [PATCHv9 24/36] x86, thp: remove infrastructure for handling splitting PMDs

On 07/20/2015 04:20 PM, Kirill A. Shutemov wrote:
> With new refcounting we don't need to mark PMDs splitting. Let's drop
> code to handle this.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Tested-by: Sasha Levin <[email protected]>

Acked-by: Jerome Marchand <[email protected]>

> ---
> arch/x86/include/asm/pgtable.h | 9 ---------
> arch/x86/include/asm/pgtable_types.h | 2 --
> arch/x86/mm/gup.c | 13 +------------
> arch/x86/mm/pgtable.c | 14 --------------
> 4 files changed, 1 insertion(+), 37 deletions(-)
>




Attachments:
signature.asc (473.00 B)
OpenPGP digital signature

2015-07-31 15:01:13

by Jerome Marchand

[permalink] [raw]
Subject: Re: [PATCHv9 25/36] mm, thp: remove infrastructure for handling splitting PMDs

On 07/20/2015 04:20 PM, Kirill A. Shutemov wrote:
> With new refcounting we don't need to mark PMDs splitting. Let's drop code
> to handle this.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Tested-by: Sasha Levin <[email protected]>
> Tested-by: Aneesh Kumar K.V <[email protected]>
> Acked-by: Vlastimil Babka <[email protected]>
> ---
> fs/proc/task_mmu.c | 8 +++---
> include/asm-generic/pgtable.h | 9 -------
> include/linux/huge_mm.h | 21 +++++----------
> mm/gup.c | 12 +--------
> mm/huge_memory.c | 60 ++++++++++---------------------------------
> mm/memcontrol.c | 13 ++--------
> mm/memory.c | 18 ++-----------
> mm/mincore.c | 2 +-
> mm/mremap.c | 15 +++++------
> mm/pgtable-generic.c | 14 ----------
> mm/rmap.c | 4 +--
> 11 files changed, 37 insertions(+), 139 deletions(-)
>

snip

> @@ -1616,23 +1605,14 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> * Note that if it returns 1, this routine returns without unlocking page
> * table locks. So callers must unlock them.
> */

The comment above should be updated. It otherwise looks good.

Acked-by: Jerome Marchand <[email protected]>

> -int __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma,
> +bool __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma,
> spinlock_t **ptl)
> {
> *ptl = pmd_lock(vma->vm_mm, pmd);
> - if (likely(pmd_trans_huge(*pmd))) {
> - if (unlikely(pmd_trans_splitting(*pmd))) {
> - spin_unlock(*ptl);
> - wait_split_huge_page(vma->anon_vma, pmd);
> - return -1;
> - } else {
> - /* Thp mapped by 'pmd' is stable, so we can
> - * handle it as it is. */
> - return 1;
> - }
> - }
> + if (likely(pmd_trans_huge(*pmd)))
> + return true;
> spin_unlock(*ptl);
> - return 0;
> + return false;
> }
>
> /*



Attachments:
signature.asc (473.00 B)
OpenPGP digital signature

2015-07-31 15:04:26

by Jerome Marchand

[permalink] [raw]
Subject: Re: [PATCHv9 26/36] mm: rework mapcount accounting to enable 4k mapping of THPs

On 07/20/2015 04:20 PM, Kirill A. Shutemov wrote:
> We're going to allow mapping of individual 4k pages of THP compound.
> It means we need to track mapcount on per small page basis.
>
> Straight-forward approach is to use ->_mapcount in all subpages to track
> how many time this subpage is mapped with PMDs or PTEs combined. But
> this is rather expensive: mapping or unmapping of a THP page with PMD
> would require HPAGE_PMD_NR atomic operations instead of single we have
> now.
>
> The idea is to store separately how many times the page was mapped as
> whole -- compound_mapcount. This frees up ->_mapcount in subpages to
> track PTE mapcount.
>
> We use the same approach as with compound page destructor and compound
> order to store compound_mapcount: use space in first tail page,
> ->mapping this time.
>
> Any time we map/unmap whole compound page (THP or hugetlb) -- we
> increment/decrement compound_mapcount. When we map part of compound page
> with PTE we operate on ->_mapcount of the subpage.
>
> page_mapcount() counts both: PTE and PMD mappings of the page.
>
> Basically, we have mapcount for a subpage spread over two counters.
> It makes tricky to detect when last mapcount for a page goes away.
>
> We introduced PageDoubleMap() for this. When we split THP PMD for the
> first time and there's other PMD mapping left we offset up ->_mapcount
> in all subpages by one and set PG_double_map on the compound page.
> These additional references go away with last compound_mapcount.

So this stays even if all PTE mappings goes and the page is again mapped
only with PMD. I'm not sure how often that happen and if it's an issue
worth caring about.

Acked-by: Jerome Marchand <[email protected]>

>
> This approach provides a way to detect when last mapcount goes away on
> per small page basis without introducing new overhead for most common
> cases.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Tested-by: Aneesh Kumar K.V <[email protected]>
> ---
> include/linux/mm.h | 26 +++++++++++-
> include/linux/mm_types.h | 1 +
> include/linux/page-flags.h | 37 +++++++++++++++++
> include/linux/rmap.h | 4 +-
> mm/debug.c | 5 ++-
> mm/huge_memory.c | 2 +-
> mm/hugetlb.c | 4 +-
> mm/memory.c | 2 +-
> mm/migrate.c | 2 +-
> mm/page_alloc.c | 14 +++++--
> mm/rmap.c | 99 +++++++++++++++++++++++++++++++++++-----------
> 11 files changed, 161 insertions(+), 35 deletions(-)
>



Attachments:
signature.asc (473.00 B)
OpenPGP digital signature

2015-07-31 15:05:04

by Jerome Marchand

[permalink] [raw]
Subject: Re: [PATCHv9 27/36] mm: differentiate page_mapped() from page_mapcount() for compound pages

On 07/20/2015 04:21 PM, Kirill A. Shutemov wrote:
> Let's define page_mapped() to be true for compound pages if any
> sub-pages of the compound page is mapped (with PMD or PTE).
>
> On other hand page_mapcount() return mapcount for this particular small
> page.
>
> This will make cases like page_get_anon_vma() behave correctly once we
> allow huge pages to be mapped with PTE.
>
> Most users outside core-mm should use page_mapcount() instead of
> page_mapped().
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Tested-by: Sasha Levin <[email protected]>
> Tested-by: Aneesh Kumar K.V <[email protected]>

Acked-by: Jerome Marchand <[email protected]>

> ---
> arch/arc/mm/cache_arc700.c | 4 ++--
> arch/arm/mm/flush.c | 2 +-
> arch/mips/mm/c-r4k.c | 3 ++-
> arch/mips/mm/cache.c | 2 +-
> arch/mips/mm/init.c | 6 +++---
> arch/sh/mm/cache-sh4.c | 2 +-
> arch/sh/mm/cache.c | 8 ++++----
> arch/xtensa/mm/tlb.c | 2 +-
> fs/proc/page.c | 4 ++--
> include/linux/mm.h | 15 +++++++++++++--
> mm/filemap.c | 2 +-
> 11 files changed, 31 insertions(+), 19 deletions(-)
>



Attachments:
signature.asc (473.00 B)
OpenPGP digital signature

2015-07-31 15:06:09

by Jerome Marchand

[permalink] [raw]
Subject: Re: [PATCHv9 28/36] mm, numa: skip PTE-mapped THP on numa fault

On 07/20/2015 04:21 PM, Kirill A. Shutemov wrote:
> We're going to have THP mapped with PTEs. It will confuse numabalancing.
> Let's skip them for now.

Fair enough.

Acked-by: Jerome Marchand <[email protected]>

>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Tested-by: Sasha Levin <[email protected]>
> Tested-by: Aneesh Kumar K.V <[email protected]>
> ---
> mm/memory.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 074edab89b52..52f6fa02c099 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3186,6 +3186,12 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
> return 0;
> }
>
> + /* TODO: handle PTE-mapped THP */
> + if (PageCompound(page)) {
> + pte_unmap_unlock(ptep, ptl);
> + return 0;
> + }
> +
> /*
> * Avoid grouping on RO pages in general. RO pages shouldn't hurt as
> * much anyway since they can be in shared cache state. This misses
>



Attachments:
signature.asc (473.00 B)
OpenPGP digital signature

2015-07-31 15:07:04

by Jerome Marchand

[permalink] [raw]
Subject: Re: [PATCHv9 29/36] thp: implement split_huge_pmd()

On 07/20/2015 04:21 PM, Kirill A. Shutemov wrote:
> Original split_huge_page() combined two operations: splitting PMDs into
> tables of PTEs and splitting underlying compound page. This patch
> implements split_huge_pmd() which split given PMD without splitting
> other PMDs this page mapped with or underlying compound page.
>
> Without tail page refcounting, implementation of split_huge_pmd() is
> pretty straight-forward.

While it's significantly simpler than it used to be, straight-forward is
still not the adjective which come to my mind.

>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Tested-by: Sasha Levin <[email protected]>
> Tested-by: Aneesh Kumar K.V <[email protected]>

Acked-by: Jerome Marchand <[email protected]>

> ---
> include/linux/huge_mm.h | 11 ++++-
> mm/huge_memory.c | 122 ++++++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 132 insertions(+), 1 deletion(-)
>




Attachments:
signature.asc (473.00 B)
OpenPGP digital signature

2015-07-31 15:09:44

by Jerome Marchand

[permalink] [raw]
Subject: Re: [PATCHv9 30/36] thp: add option to setup migration entiries during PMD split

On 07/20/2015 04:21 PM, Kirill A. Shutemov wrote:
> We are going to use migration PTE entires to stabilize page counts.
> If the page is mapped with PMDs we need to split the PMD and setup
> migration enties. It's reasonable to combine these operations to avoid
> double-scanning over the page table.

Entries? Three different typos for three occurrences of the same word.
You don't like it, do you?

>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Tested-by: Sasha Levin <[email protected]>
> Tested-by: Aneesh Kumar K.V <[email protected]>
> Acked-by: Vlastimil Babka <[email protected]>

Acked-by: Jerome Marchand <[email protected]>

> ---
> mm/huge_memory.c | 23 +++++++++++++++--------
> 1 file changed, 15 insertions(+), 8 deletions(-)
>



Attachments:
signature.asc (473.00 B)
OpenPGP digital signature

2015-07-31 15:12:56

by Jerome Marchand

[permalink] [raw]
Subject: Re: [PATCHv9 31/36] thp, mm: split_huge_page(): caller need to lock page

On 07/20/2015 04:21 PM, Kirill A. Shutemov wrote:
> We're going to use migration entries instead of compound_lock() to
> stabilize page refcounts. Setup and remove migration entries require
> page to be locked.
>
> Some of split_huge_page() callers already have the page locked. Let's
> require everybody to lock the page before calling split_huge_page().
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Tested-by: Sasha Levin <[email protected]>
> Tested-by: Aneesh Kumar K.V <[email protected]>
> Acked-by: Vlastimil Babka <[email protected]>

Acked-by: Jerome Marchand <[email protected]>

> ---
> mm/memory-failure.c | 10 ++++++++--
> mm/migrate.c | 8 ++++++--
> 2 files changed, 14 insertions(+), 4 deletions(-)
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index ef33ccf37224..f32a607d1aa3 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1143,15 +1143,18 @@ int memory_failure(unsigned long pfn, int trapno, int flags)
> put_page(hpage);
> return -EBUSY;
> }
> + lock_page(hpage);
> if (unlikely(split_huge_page(hpage))) {
> pr_err("MCE: %#lx: thp split failed\n", pfn);
> if (TestClearPageHWPoison(p))
> atomic_long_sub(nr_pages, &num_poisoned_pages);
> + unlock_page(hpage);
> put_page(p);
> if (p != hpage)
> put_page(hpage);
> return -EBUSY;
> }
> + unlock_page(hpage);
> VM_BUG_ON_PAGE(!page_count(p), p);
> hpage = compound_head(p);
> }
> @@ -1714,10 +1717,13 @@ int soft_offline_page(struct page *page, int flags)
> return -EBUSY;
> }
> if (!PageHuge(page) && PageTransHuge(hpage)) {
> - if (PageAnon(hpage) && unlikely(split_huge_page(hpage))) {
> + lock_page(page);
> + ret = split_huge_page(hpage);
> + unlock_page(page);
> + if (unlikely(ret)) {
> pr_info("soft offline: %#lx: failed to split THP\n",
> pfn);
> - return -EBUSY;
> + return ret;
> }
> }
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 67970faf544d..a9dbfd356e9d 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -933,9 +933,13 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
> goto out;
> }
>
> - if (unlikely(PageTransHuge(page)))
> - if (unlikely(split_huge_page(page)))
> + if (unlikely(PageTransHuge(page))) {
> + lock_page(page);
> + rc = split_huge_page(page);
> + unlock_page(page);
> + if (rc)
> goto out;
> + }
>
> rc = __unmap_and_move(page, newpage, force, mode);
>
>



Attachments:
signature.asc (473.00 B)
OpenPGP digital signature

2015-07-31 15:13:45

by Jerome Marchand

[permalink] [raw]
Subject: Re: [PATCHv9 32/36] thp: reintroduce split_huge_page()

On 07/20/2015 04:21 PM, Kirill A. Shutemov wrote:
> This patch adds implementation of split_huge_page() for new
> refcountings.
>
> Unlike previous implementation, new split_huge_page() can fail if
> somebody holds GUP pin on the page. It also means that pin on page
> would prevent it from bening split under you. It makes situation in
> many places much cleaner.
>
> The basic scheme of split_huge_page():
>
> - Check that sum of mapcounts of all subpage is equal to page_count()
> plus one (caller pin). Foll off with -EBUSY. This way we can avoid
> useless PMD-splits.
>
> - Freeze the page counters by splitting all PMD and setup migration
> PTEs.
>
> - Re-check sum of mapcounts against page_count(). Page's counts are
> stable now. -EBUSY if page is pinned.
>
> - Split compound page.
>
> - Unfreeze the page by removing migration entries.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Tested-by: Sasha Levin <[email protected]>
> Tested-by: Aneesh Kumar K.V <[email protected]>

Acked-by: Jerome Marchand <[email protected]>

> ---
> include/linux/huge_mm.h | 7 +-
> include/linux/pagemap.h | 13 +-
> mm/huge_memory.c | 318 ++++++++++++++++++++++++++++++++++++++++++++++++
> mm/internal.h | 26 +++-
> mm/rmap.c | 21 ----
> 5 files changed, 357 insertions(+), 28 deletions(-)
>
>



Attachments:
signature.asc (473.00 B)
OpenPGP digital signature

2015-07-31 15:14:45

by Jerome Marchand

[permalink] [raw]
Subject: Re: [PATCHv9 33/36] migrate_pages: try to split pages on qeueuing

On 07/20/2015 04:21 PM, Kirill A. Shutemov wrote:
> We are not able to migrate THPs. It means it's not enough to split only
> PMD on migration -- we need to split compound page under it too.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Tested-by: Aneesh Kumar K.V <[email protected]>

Acked-by: Jerome Marchand <[email protected]>

> ---
> mm/mempolicy.c | 37 +++++++++++++++++++++++++++++++++----
> 1 file changed, 33 insertions(+), 4 deletions(-)
>




Attachments:
signature.asc (473.00 B)
OpenPGP digital signature

2015-07-31 15:15:06

by Jerome Marchand

[permalink] [raw]
Subject: Re: [PATCHv9 34/36] thp: introduce deferred_split_huge_page()

On 07/20/2015 04:21 PM, Kirill A. Shutemov wrote:
> Currently we don't split huge page on partial unmap. It's not an ideal
> situation. It can lead to memory overhead.
>
> Furtunately, we can detect partial unmap on page_remove_rmap(). But we
> cannot call split_huge_page() from there due to locking context.
>
> It's also counterproductive to do directly from munmap() codepath: in
> many cases we will hit this from exit(2) and splitting the huge page
> just to free it up in small pages is not what we really want.
>
> The patch introduce deferred_split_huge_page() which put the huge page
> into queue for splitting. The splitting itself will happen when we get
> memory pressure via shrinker interface. The page will be dropped from
> list on freeing through compound page destructor.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Tested-by: Sasha Levin <[email protected]>
> Tested-by: Aneesh Kumar K.V <[email protected]>
> Acked-by: Vlastimil Babka <[email protected]>

Acked-by: Jerome Marchand <[email protected]>

> ---
> include/linux/huge_mm.h | 4 ++
> include/linux/mm.h | 2 +
> mm/huge_memory.c | 127 ++++++++++++++++++++++++++++++++++++++++++++++--
> mm/migrate.c | 1 +
> mm/page_alloc.c | 2 +-
> mm/rmap.c | 7 ++-
> 6 files changed, 138 insertions(+), 5 deletions(-)
>



Attachments:
signature.asc (473.00 B)
OpenPGP digital signature

2015-07-31 15:15:36

by Jerome Marchand

[permalink] [raw]
Subject: Re: [PATCHv9 35/36] mm: re-enable THP

On 07/20/2015 04:21 PM, Kirill A. Shutemov wrote:
> All parts of THP with new refcounting are now in place. We can now allow
> to enable THP.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Tested-by: Sasha Levin <[email protected]>
> Tested-by: Aneesh Kumar K.V <[email protected]>

Acked-by: Jerome Marchand <[email protected]>

> ---
> mm/Kconfig | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/Kconfig b/mm/Kconfig
> index c973f416cbe5..e79de2bd12cd 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -410,7 +410,7 @@ config NOMMU_INITIAL_TRIM_EXCESS
>
> config TRANSPARENT_HUGEPAGE
> bool "Transparent Hugepage Support"
> - depends on HAVE_ARCH_TRANSPARENT_HUGEPAGE && BROKEN
> + depends on HAVE_ARCH_TRANSPARENT_HUGEPAGE
> select COMPACTION
> help
> Transparent Hugepages allows the kernel to use huge pages and
>



Attachments:
signature.asc (473.00 B)
OpenPGP digital signature

2015-07-31 15:16:38

by Jerome Marchand

[permalink] [raw]
Subject: Re: [PATCHv9 36/36] thp: update documentation

On 07/20/2015 04:21 PM, Kirill A. Shutemov wrote:
> The patch updates Documentation/vm/transhuge.txt to reflect changes in
> THP design.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>

Acked-by: Jerome Marchand <[email protected]>

> ---
> Documentation/vm/transhuge.txt | 151 ++++++++++++++++++++++++++---------------
> 1 file changed, 96 insertions(+), 55 deletions(-)
>



Attachments:
signature.asc (473.00 B)
OpenPGP digital signature

2015-08-03 10:41:15

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv9 25/36] mm, thp: remove infrastructure for handling splitting PMDs

On Fri, Jul 31, 2015 at 05:01:06PM +0200, Jerome Marchand wrote:
> On 07/20/2015 04:20 PM, Kirill A. Shutemov wrote:
> > @@ -1616,23 +1605,14 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> > * Note that if it returns 1, this routine returns without unlocking page
> > * table locks. So callers must unlock them.
> > */
>
> The comment above should be updated.

Like this?

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d32277463932..78a6c7cdf8f7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1627,11 +1627,10 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
}

/*
- * Returns 1 if a given pmd maps a stable (not under splitting) thp.
- * Returns -1 if it maps a thp under splitting. Returns 0 otherwise.
+ * Returns true if a given pmd maps a thp, false otherwise.
*
- * Note that if it returns 1, this routine returns without unlocking page
- * table locks. So callers must unlock them.
+ * Note that if it returns true, this routine returns without unlocking page
+ * table lock. So callers must unlock it.
*/
bool __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma,
spinlock_t **ptl)
--
Kirill A. Shutemov

2015-08-03 10:43:32

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv9 26/36] mm: rework mapcount accounting to enable 4k mapping of THPs

On Fri, Jul 31, 2015 at 05:04:18PM +0200, Jerome Marchand wrote:
> On 07/20/2015 04:20 PM, Kirill A. Shutemov wrote:
> > We're going to allow mapping of individual 4k pages of THP compound.
> > It means we need to track mapcount on per small page basis.
> >
> > Straight-forward approach is to use ->_mapcount in all subpages to track
> > how many time this subpage is mapped with PMDs or PTEs combined. But
> > this is rather expensive: mapping or unmapping of a THP page with PMD
> > would require HPAGE_PMD_NR atomic operations instead of single we have
> > now.
> >
> > The idea is to store separately how many times the page was mapped as
> > whole -- compound_mapcount. This frees up ->_mapcount in subpages to
> > track PTE mapcount.
> >
> > We use the same approach as with compound page destructor and compound
> > order to store compound_mapcount: use space in first tail page,
> > ->mapping this time.
> >
> > Any time we map/unmap whole compound page (THP or hugetlb) -- we
> > increment/decrement compound_mapcount. When we map part of compound page
> > with PTE we operate on ->_mapcount of the subpage.
> >
> > page_mapcount() counts both: PTE and PMD mappings of the page.
> >
> > Basically, we have mapcount for a subpage spread over two counters.
> > It makes tricky to detect when last mapcount for a page goes away.
> >
> > We introduced PageDoubleMap() for this. When we split THP PMD for the
> > first time and there's other PMD mapping left we offset up ->_mapcount
> > in all subpages by one and set PG_double_map on the compound page.
> > These additional references go away with last compound_mapcount.
>
> So this stays even if all PTE mappings goes and the page is again mapped
> only with PMD. I'm not sure how often that happen and if it's an issue
> worth caring about.

We don't have a cheap way to detect this situation and it shouldn't
happen often enough to care.

--
Kirill A. Shutemov

2015-08-03 10:44:51

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv9 29/36] thp: implement split_huge_pmd()

On Fri, Jul 31, 2015 at 05:06:55PM +0200, Jerome Marchand wrote:
> On 07/20/2015 04:21 PM, Kirill A. Shutemov wrote:
> > Original split_huge_page() combined two operations: splitting PMDs into
> > tables of PTEs and splitting underlying compound page. This patch
> > implements split_huge_pmd() which split given PMD without splitting
> > other PMDs this page mapped with or underlying compound page.
> >
> > Without tail page refcounting, implementation of split_huge_pmd() is
> > pretty straight-forward.
>
> While it's significantly simpler than it used to be, straight-forward is
> still not the adjective which come to my mind.

The commit message was written to older revision :-P

--
Kirill A. Shutemov

2015-08-03 10:53:31

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv9 30/36] thp: add option to setup migration entiries during PMD split

On Fri, Jul 31, 2015 at 05:09:38PM +0200, Jerome Marchand wrote:
> On 07/20/2015 04:21 PM, Kirill A. Shutemov wrote:
> > We are going to use migration PTE entires to stabilize page counts.
> > If the page is mapped with PMDs we need to split the PMD and setup
> > migration enties. It's reasonable to combine these operations to avoid
> > double-scanning over the page table.
>
> Entries? Three different typos for three occurrences of the same word.
> You don't like it, do you?

Urgh..

>From 6c5b35ffcc425bcfc91b56d1ee404ab83cc667cf Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <[email protected]>
Date: Fri, 10 Apr 2015 12:39:14 +0300
Subject: [PATCH] thp: add option to setup migration entries during PMD split

We are going to use migration PTE entries to stabilize page counts.
If the page is mapped with PMDs we need to split the PMD and setup
migration entries. It's reasonable to combine these operations to avoid
double-scanning over the page table.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Tested-by: Sasha Levin <[email protected]>
Tested-by: Aneesh Kumar K.V <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Acked-by: Jerome Marchand <[email protected]>
---
mm/huge_memory.c | 23 +++++++++++++++--------
1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1e0e02786241..0d817863a739 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -25,6 +25,7 @@
#include <linux/migrate.h>
#include <linux/hashtable.h>
#include <linux/userfaultfd_k.h>
+#include <linux/swapops.h>

#include <asm/tlb.h>
#include <asm/pgalloc.h>
@@ -2627,7 +2628,7 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
}

static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
- unsigned long haddr)
+ unsigned long haddr, bool freeze)
{
struct mm_struct *mm = vma->vm_mm;
struct page *page;
@@ -2669,12 +2670,18 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
* transferred to avoid any possibility of altering
* permissions across VMAs.
*/
- entry = mk_pte(page + i, vma->vm_page_prot);
- entry = maybe_mkwrite(pte_mkdirty(entry), vma);
- if (!write)
- entry = pte_wrprotect(entry);
- if (!young)
- entry = pte_mkold(entry);
+ if (freeze) {
+ swp_entry_t swp_entry;
+ swp_entry = make_migration_entry(page + i, write);
+ entry = swp_entry_to_pte(swp_entry);
+ } else {
+ entry = mk_pte(page + i, vma->vm_page_prot);
+ entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+ if (!write)
+ entry = pte_wrprotect(entry);
+ if (!young)
+ entry = pte_mkold(entry);
+ }
pte = pte_offset_map(&_pmd, haddr);
BUG_ON(!pte_none(*pte));
set_pte_at(mm, haddr, pte, entry);
@@ -2715,7 +2722,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
mmu_notifier_invalidate_range_start(mm, haddr, haddr + HPAGE_PMD_SIZE);
ptl = pmd_lock(mm, pmd);
if (likely(pmd_trans_huge(*pmd)))
- __split_huge_pmd_locked(vma, pmd, haddr);
+ __split_huge_pmd_locked(vma, pmd, haddr, false);
spin_unlock(ptl);
mmu_notifier_invalidate_range_end(mm, haddr, haddr + HPAGE_PMD_SIZE);
}
--
Kirill A. Shutemov

2015-08-03 11:41:00

by Jerome Marchand

[permalink] [raw]
Subject: Re: [PATCHv9 25/36] mm, thp: remove infrastructure for handling splitting PMDs

On 08/03/2015 12:41 PM, Kirill A. Shutemov wrote:
> On Fri, Jul 31, 2015 at 05:01:06PM +0200, Jerome Marchand wrote:
>> On 07/20/2015 04:20 PM, Kirill A. Shutemov wrote:
>>> @@ -1616,23 +1605,14 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>>> * Note that if it returns 1, this routine returns without unlocking page
>>> * table locks. So callers must unlock them.
>>> */
>>
>> The comment above should be updated.
>
> Like this?

Yes. Thanks.

>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index d32277463932..78a6c7cdf8f7 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1627,11 +1627,10 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> }
>
> /*
> - * Returns 1 if a given pmd maps a stable (not under splitting) thp.
> - * Returns -1 if it maps a thp under splitting. Returns 0 otherwise.
> + * Returns true if a given pmd maps a thp, false otherwise.
> *
> - * Note that if it returns 1, this routine returns without unlocking page
> - * table locks. So callers must unlock them.
> + * Note that if it returns true, this routine returns without unlocking page
> + * table lock. So callers must unlock it.
> */
> bool __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma,
> spinlock_t **ptl)
>



Attachments:
signature.asc (473.00 B)
OpenPGP digital signature

2015-08-03 11:41:30

by Jerome Marchand

[permalink] [raw]
Subject: Re: [PATCHv9 26/36] mm: rework mapcount accounting to enable 4k mapping of THPs

On 08/03/2015 12:43 PM, Kirill A. Shutemov wrote:
> On Fri, Jul 31, 2015 at 05:04:18PM +0200, Jerome Marchand wrote:
>> On 07/20/2015 04:20 PM, Kirill A. Shutemov wrote:
>>> We're going to allow mapping of individual 4k pages of THP compound.
>>> It means we need to track mapcount on per small page basis.
>>>
>>> Straight-forward approach is to use ->_mapcount in all subpages to track
>>> how many time this subpage is mapped with PMDs or PTEs combined. But
>>> this is rather expensive: mapping or unmapping of a THP page with PMD
>>> would require HPAGE_PMD_NR atomic operations instead of single we have
>>> now.
>>>
>>> The idea is to store separately how many times the page was mapped as
>>> whole -- compound_mapcount. This frees up ->_mapcount in subpages to
>>> track PTE mapcount.
>>>
>>> We use the same approach as with compound page destructor and compound
>>> order to store compound_mapcount: use space in first tail page,
>>> ->mapping this time.
>>>
>>> Any time we map/unmap whole compound page (THP or hugetlb) -- we
>>> increment/decrement compound_mapcount. When we map part of compound page
>>> with PTE we operate on ->_mapcount of the subpage.
>>>
>>> page_mapcount() counts both: PTE and PMD mappings of the page.
>>>
>>> Basically, we have mapcount for a subpage spread over two counters.
>>> It makes tricky to detect when last mapcount for a page goes away.
>>>
>>> We introduced PageDoubleMap() for this. When we split THP PMD for the
>>> first time and there's other PMD mapping left we offset up ->_mapcount
>>> in all subpages by one and set PG_double_map on the compound page.
>>> These additional references go away with last compound_mapcount.
>>
>> So this stays even if all PTE mappings goes and the page is again mapped
>> only with PMD. I'm not sure how often that happen and if it's an issue
>> worth caring about.
>
> We don't have a cheap way to detect this situation and it shouldn't
> happen often enough to care.
>

I thought so.


Attachments:
signature.asc (473.00 B)
OpenPGP digital signature