2016-04-16 00:33:29

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv7 00/29] THP-enabled tmpfs/shmem using compound pages

This is probably the last update before the mm summit. Main forcus is on
khugepaged stability.

khugepaged is in more reasonable shape now. I missed quite a few corner
cases on first try. I run this version via LTP, trinity and syzkaller
without crashes so far.

The patchset is on top of v4.6-rc3 plus Hugh's "easy preliminaries to
THPagecache" and Ebru's khugepaged swapin patches form -mm tree.

Git tree:

git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git hugetmpfs/v7

== Changelog ==

v7:
- khugepaged updates:
+ fix page leak/page cache corruption on collapse fail;
+ filter out VMAs not suitable for huge pages due misaligned vm_pgoff;
+ fix build without CONFIG_SHMEM;
+ drop few over-protective checks;
- fix bogus VM_BUG_ON() in __delete_from_page_cache();

v6:
- experimental collapse support;
- fix swapout mapped huge pages;
- fix page leak in faularound code;
- fix exessive huge page allocation with huge=within_size;
- rename VM_NO_THP to VM_NO_KHUGEPAGED;
- fix condition in hugepage_madvise();
- accounting reworked again;

v5:
- add FileHugeMapped to /proc/PID/smaps;
- make FileHugeMapped in meminfo aligned with other fields;
- Documentation/vm/transhuge.txt updated;

v4:
- first four patch were applied to -mm tree;
- drop pages beyond i_size on split_huge_pages;
- few small random bugfixes;

v3:
- huge= mountoption now can have values always, within_size, advice and
never;
- sysctl handle is replaced with sysfs knob;
- MADV_HUGEPAGE/MADV_NOHUGEPAGE is now respected on page allocation via
page fault;
- mlock() handling had been fixed;
- bunch of smaller bugfixes and cleanups.

== Design overview ==

Huge pages are allocated by shmem when it's allowed (by mount option) and
there's no entries for the range in radix-tree. Huge page is represented by
HPAGE_PMD_NR entries in radix-tree.

MM core maps a page with PMD if ->fault() returns huge page and the VMA is
suitable for huge pages (size, alignment). There's no need into two
requests to file system: filesystem returns huge page if it can,
graceful fallback to small pages otherwise.

As with DAX, split_huge_pmd() is implemented by unmapping the PMD: we can
re-fault the page with PTEs later.

Basic scheme for split_huge_page() is the same as for anon-THP.
Few differences:

- File pages are on radix-tree, so we have head->_count offset by
HPAGE_PMD_NR. The count got distributed to small pages during split.

- mapping->tree_lock prevents non-lockless access to pages under split
over radix-tree;

- Lockless access is prevented by setting the head->_count to 0 during
split, so get_page_unless_zero() would fail;

- After split, some pages can be beyond i_size. We drop them from
radix-tree.

- We don't setup migration entries. Just unmap pages. It helps
handling cases when i_size is in the middle of the page: no need
handle unmap pages beyond i_size manually.

COW mapping handled on PTE-level. It's not clear how beneficial would be
allocation of huge pages on COW faults. And it would require some code to
make them work.

I think at some point we can consider teaching khugepaged to collapse
pages in COW mappings, but allocating huge on fault is probably overkill.

As with anon THP, we mlock file huge page only if it mapped with PMD.
PTE-mapped THPs are never mlocked. This way we can avoid all sorts of
scenarios when we can leak mlocked page.

As with anon THP, we split huge page on swap out.

Truncate and punch hole that only cover part of THP range is implemented
by zero out this part of THP.

This have visible effect on fallocate(FALLOC_FL_PUNCH_HOLE) behaviour.
As we don't really create hole in this case, lseek(SEEK_HOLE) may have
inconsistent results depending what pages happened to be allocated.
I don't think this will be a problem.

== Patchset overview ==

[01/29]
Update documentation on THP vs. mlock. I've posted it separately
before. It can go in.

[02-04/29]
Rework fault path and rmap to handle file pmd. Unlike DAX with
vm_ops->pmd_fault, we don't need to ask filesystem twice -- first
for huge page and then for small. If ->fault happened to return
huge page and VMA is suitable for mapping it as huge, we would
do so.
[05/29]
Add support for huge file pages in rmap;

[06-15/29]
Various preparation of THP core for file pages.

[16-20/29]
Various preparation of MM core for file pages.

[21-24/29]
And finally, bring huge pages into tmpfs/shmem.

[25/29]
Wire up madvise() existing hints for file THP.
We can implement fadvise() later.

[26/29]
Documentation update.

[27-29/29]
Extend khugepaged to support shmem/tmpfs.
Hugh Dickins (1):
shmem: get_unmapped_area align huge page

Kirill A. Shutemov (28):
thp, mlock: update unevictable-lru.txt
mm: do not pass mm_struct into handle_mm_fault
mm: introduce fault_env
mm: postpone page table allocation until we have page to map
rmap: support file thp
mm: introduce do_set_pmd()
thp, vmstats: add counters for huge file pages
thp: support file pages in zap_huge_pmd()
thp: handle file pages in split_huge_pmd()
thp: handle file COW faults
thp: skip file huge pmd on copy_huge_pmd()
thp: prepare change_huge_pmd() for file thp
thp: run vma_adjust_trans_huge() outside i_mmap_rwsem
thp: file pages support for split_huge_page()
thp, mlock: do not mlock PTE-mapped file huge pages
vmscan: split file huge pages before paging them out
page-flags: relax policy for PG_mappedtodisk and PG_reclaim
radix-tree: implement radix_tree_maybe_preload_order()
filemap: prepare find and delete operations for huge pages
truncate: handle file thp
mm, rmap: account shmem thp pages
shmem: prepare huge= mount option and sysfs knob
shmem: add huge pages support
shmem, thp: respect MADV_{NO,}HUGEPAGE for file mappings
thp: update Documentation/vm/transhuge.txt
thp: extract khugepaged from mm/huge_memory.c
khugepaged: move up_read(mmap_sem) out of khugepaged_alloc_page()
khugepaged: add support of collapse for tmpfs/shmem pages

Documentation/filesystems/Locking | 10 +-
Documentation/vm/transhuge.txt | 130 ++-
Documentation/vm/unevictable-lru.txt | 21 +
arch/alpha/mm/fault.c | 2 +-
arch/arc/mm/fault.c | 2 +-
arch/arm/mm/fault.c | 2 +-
arch/arm64/mm/fault.c | 2 +-
arch/avr32/mm/fault.c | 2 +-
arch/cris/mm/fault.c | 2 +-
arch/frv/mm/fault.c | 2 +-
arch/hexagon/mm/vm_fault.c | 2 +-
arch/ia64/mm/fault.c | 2 +-
arch/m32r/mm/fault.c | 2 +-
arch/m68k/mm/fault.c | 2 +-
arch/metag/mm/fault.c | 2 +-
arch/microblaze/mm/fault.c | 2 +-
arch/mips/mm/fault.c | 2 +-
arch/mn10300/mm/fault.c | 2 +-
arch/nios2/mm/fault.c | 2 +-
arch/openrisc/mm/fault.c | 2 +-
arch/parisc/mm/fault.c | 2 +-
arch/powerpc/mm/copro_fault.c | 2 +-
arch/powerpc/mm/fault.c | 2 +-
arch/s390/mm/fault.c | 2 +-
arch/score/mm/fault.c | 2 +-
arch/sh/mm/fault.c | 2 +-
arch/sparc/mm/fault_32.c | 4 +-
arch/sparc/mm/fault_64.c | 2 +-
arch/tile/mm/fault.c | 2 +-
arch/um/kernel/trap.c | 2 +-
arch/unicore32/mm/fault.c | 2 +-
arch/x86/mm/fault.c | 2 +-
arch/xtensa/mm/fault.c | 2 +-
drivers/base/node.c | 13 +-
drivers/char/mem.c | 24 +
drivers/iommu/amd_iommu_v2.c | 3 +-
drivers/iommu/intel-svm.c | 2 +-
fs/proc/meminfo.c | 7 +-
fs/proc/task_mmu.c | 10 +-
fs/userfaultfd.c | 22 +-
include/linux/huge_mm.h | 36 +-
include/linux/khugepaged.h | 6 +
include/linux/mm.h | 51 +-
include/linux/mmzone.h | 4 +-
include/linux/page-flags.h | 19 +-
include/linux/radix-tree.h | 1 +
include/linux/rmap.h | 2 +-
include/linux/shmem_fs.h | 29 +-
include/linux/userfaultfd_k.h | 8 +-
include/linux/vm_event_item.h | 7 +
include/trace/events/huge_memory.h | 3 +-
ipc/shm.c | 6 +-
lib/radix-tree.c | 68 +-
mm/Makefile | 2 +-
mm/filemap.c | 226 ++--
mm/gup.c | 7 +-
mm/huge_memory.c | 2028 ++++++----------------------------
mm/internal.h | 4 +-
mm/khugepaged.c | 1772 +++++++++++++++++++++++++++++
mm/ksm.c | 5 +-
mm/memory.c | 859 +++++++-------
mm/mempolicy.c | 4 +-
mm/migrate.c | 5 +-
mm/mmap.c | 26 +-
mm/nommu.c | 3 +-
mm/page-writeback.c | 1 +
mm/page_alloc.c | 21 +
mm/rmap.c | 78 +-
mm/shmem.c | 689 ++++++++++--
mm/swap.c | 2 +
mm/truncate.c | 22 +-
mm/util.c | 6 +
mm/vmscan.c | 6 +
mm/vmstat.c | 4 +
74 files changed, 3919 insertions(+), 2395 deletions(-)
create mode 100644 mm/khugepaged.c

--
2.8.0.rc3


2016-04-16 00:24:15

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv7 02/29] mm: do not pass mm_struct into handle_mm_fault

We always have vma->vm_mm around.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/alpha/mm/fault.c | 2 +-
arch/arc/mm/fault.c | 2 +-
arch/arm/mm/fault.c | 2 +-
arch/arm64/mm/fault.c | 2 +-
arch/avr32/mm/fault.c | 2 +-
arch/cris/mm/fault.c | 2 +-
arch/frv/mm/fault.c | 2 +-
arch/hexagon/mm/vm_fault.c | 2 +-
arch/ia64/mm/fault.c | 2 +-
arch/m32r/mm/fault.c | 2 +-
arch/m68k/mm/fault.c | 2 +-
arch/metag/mm/fault.c | 2 +-
arch/microblaze/mm/fault.c | 2 +-
arch/mips/mm/fault.c | 2 +-
arch/mn10300/mm/fault.c | 2 +-
arch/nios2/mm/fault.c | 2 +-
arch/openrisc/mm/fault.c | 2 +-
arch/parisc/mm/fault.c | 2 +-
arch/powerpc/mm/copro_fault.c | 2 +-
arch/powerpc/mm/fault.c | 2 +-
arch/s390/mm/fault.c | 2 +-
arch/score/mm/fault.c | 2 +-
arch/sh/mm/fault.c | 2 +-
arch/sparc/mm/fault_32.c | 4 ++--
arch/sparc/mm/fault_64.c | 2 +-
arch/tile/mm/fault.c | 2 +-
arch/um/kernel/trap.c | 2 +-
arch/unicore32/mm/fault.c | 2 +-
arch/x86/mm/fault.c | 2 +-
arch/xtensa/mm/fault.c | 2 +-
drivers/iommu/amd_iommu_v2.c | 3 +--
drivers/iommu/intel-svm.c | 2 +-
include/linux/mm.h | 9 ++++-----
mm/gup.c | 5 ++---
mm/ksm.c | 5 ++---
mm/memory.c | 13 +++++++------
36 files changed, 48 insertions(+), 51 deletions(-)

diff --git a/arch/alpha/mm/fault.c b/arch/alpha/mm/fault.c
index 4a905bd667e2..83e9eee57a55 100644
--- a/arch/alpha/mm/fault.c
+++ b/arch/alpha/mm/fault.c
@@ -147,7 +147,7 @@ retry:
/* If for any reason at all we couldn't handle the fault,
make sure we exit gracefully rather than endlessly redo
the fault. */
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);

if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
return;
diff --git a/arch/arc/mm/fault.c b/arch/arc/mm/fault.c
index af63f4a13e60..e94e5aa33985 100644
--- a/arch/arc/mm/fault.c
+++ b/arch/arc/mm/fault.c
@@ -137,7 +137,7 @@ good_area:
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);

/* If Pagefault was interrupted by SIGKILL, exit page fault "early" */
if (unlikely(fatal_signal_pending(current))) {
diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index ad5841856007..3a2e678b8d30 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -243,7 +243,7 @@ good_area:
goto out;
}

- return handle_mm_fault(mm, vma, addr & PAGE_MASK, flags);
+ return handle_mm_fault(vma, addr & PAGE_MASK, flags);

check_stack:
/* Don't allow expansion below FIRST_USER_ADDRESS */
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 95df28bc875f..4dbb076b9e1f 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -183,7 +183,7 @@ good_area:
goto out;
}

- return handle_mm_fault(mm, vma, addr & PAGE_MASK, mm_flags);
+ return handle_mm_fault(vma, addr & PAGE_MASK, mm_flags);

check_stack:
if (vma->vm_flags & VM_GROWSDOWN && !expand_stack(vma, addr))
diff --git a/arch/avr32/mm/fault.c b/arch/avr32/mm/fault.c
index c03533937a9f..a4b7edac8f10 100644
--- a/arch/avr32/mm/fault.c
+++ b/arch/avr32/mm/fault.c
@@ -134,7 +134,7 @@ good_area:
* sure we exit gracefully rather than endlessly redo the
* fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);

if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
return;
diff --git a/arch/cris/mm/fault.c b/arch/cris/mm/fault.c
index 3066d40a6db1..112ef26c7f2e 100644
--- a/arch/cris/mm/fault.c
+++ b/arch/cris/mm/fault.c
@@ -168,7 +168,7 @@ retry:
* the fault.
*/

- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);

if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
return;
diff --git a/arch/frv/mm/fault.c b/arch/frv/mm/fault.c
index 61d99767fe16..614a46c413d2 100644
--- a/arch/frv/mm/fault.c
+++ b/arch/frv/mm/fault.c
@@ -164,7 +164,7 @@ asmlinkage void do_page_fault(int datammu, unsigned long esr0, unsigned long ear
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, ear0, flags);
+ fault = handle_mm_fault(vma, ear0, flags);
if (unlikely(fault & VM_FAULT_ERROR)) {
if (fault & VM_FAULT_OOM)
goto out_of_memory;
diff --git a/arch/hexagon/mm/vm_fault.c b/arch/hexagon/mm/vm_fault.c
index 8704c9320032..bd7c251e2bce 100644
--- a/arch/hexagon/mm/vm_fault.c
+++ b/arch/hexagon/mm/vm_fault.c
@@ -101,7 +101,7 @@ good_area:
break;
}

- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);

if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
return;
diff --git a/arch/ia64/mm/fault.c b/arch/ia64/mm/fault.c
index 70b40d1205a6..fa6ad95e992e 100644
--- a/arch/ia64/mm/fault.c
+++ b/arch/ia64/mm/fault.c
@@ -159,7 +159,7 @@ retry:
* sure we exit gracefully rather than endlessly redo the
* fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);

if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
return;
diff --git a/arch/m32r/mm/fault.c b/arch/m32r/mm/fault.c
index 8f9875b7933d..a3785d3644c2 100644
--- a/arch/m32r/mm/fault.c
+++ b/arch/m32r/mm/fault.c
@@ -196,7 +196,7 @@ good_area:
*/
addr = (address & PAGE_MASK);
set_thread_fault_code(error_code);
- fault = handle_mm_fault(mm, vma, addr, flags);
+ fault = handle_mm_fault(vma, addr, flags);
if (unlikely(fault & VM_FAULT_ERROR)) {
if (fault & VM_FAULT_OOM)
goto out_of_memory;
diff --git a/arch/m68k/mm/fault.c b/arch/m68k/mm/fault.c
index 6a94cdd0c830..bd66a0b20c6b 100644
--- a/arch/m68k/mm/fault.c
+++ b/arch/m68k/mm/fault.c
@@ -136,7 +136,7 @@ good_area:
* the fault.
*/

- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);
pr_debug("handle_mm_fault returns %d\n", fault);

if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
diff --git a/arch/metag/mm/fault.c b/arch/metag/mm/fault.c
index f57edca63609..372783a67dda 100644
--- a/arch/metag/mm/fault.c
+++ b/arch/metag/mm/fault.c
@@ -133,7 +133,7 @@ good_area:
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);

if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
return 0;
diff --git a/arch/microblaze/mm/fault.c b/arch/microblaze/mm/fault.c
index 177dfc003643..abb678ccde6f 100644
--- a/arch/microblaze/mm/fault.c
+++ b/arch/microblaze/mm/fault.c
@@ -216,7 +216,7 @@ good_area:
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);

if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
return;
diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c
index 4b88fa031891..9560ad731120 100644
--- a/arch/mips/mm/fault.c
+++ b/arch/mips/mm/fault.c
@@ -153,7 +153,7 @@ good_area:
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);

if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
return;
diff --git a/arch/mn10300/mm/fault.c b/arch/mn10300/mm/fault.c
index 4a1d181ed32f..f23781d6bbb3 100644
--- a/arch/mn10300/mm/fault.c
+++ b/arch/mn10300/mm/fault.c
@@ -254,7 +254,7 @@ good_area:
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);

if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
return;
diff --git a/arch/nios2/mm/fault.c b/arch/nios2/mm/fault.c
index b51878b0c6b8..affc4eb3f89e 100644
--- a/arch/nios2/mm/fault.c
+++ b/arch/nios2/mm/fault.c
@@ -131,7 +131,7 @@ good_area:
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);

if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
return;
diff --git a/arch/openrisc/mm/fault.c b/arch/openrisc/mm/fault.c
index 230ac20ae794..e94cd225e816 100644
--- a/arch/openrisc/mm/fault.c
+++ b/arch/openrisc/mm/fault.c
@@ -163,7 +163,7 @@ good_area:
* the fault.
*/

- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);

if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
return;
diff --git a/arch/parisc/mm/fault.c b/arch/parisc/mm/fault.c
index 16dbe81c97c9..163af2c31d76 100644
--- a/arch/parisc/mm/fault.c
+++ b/arch/parisc/mm/fault.c
@@ -239,7 +239,7 @@ good_area:
* fault.
*/

- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);

if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
return;
diff --git a/arch/powerpc/mm/copro_fault.c b/arch/powerpc/mm/copro_fault.c
index 6527882ce05e..bb0354222b11 100644
--- a/arch/powerpc/mm/copro_fault.c
+++ b/arch/powerpc/mm/copro_fault.c
@@ -75,7 +75,7 @@ int copro_handle_mm_fault(struct mm_struct *mm, unsigned long ea,
}

ret = 0;
- *flt = handle_mm_fault(mm, vma, ea, is_write ? FAULT_FLAG_WRITE : 0);
+ *flt = handle_mm_fault(vma, ea, is_write ? FAULT_FLAG_WRITE : 0);
if (unlikely(*flt & VM_FAULT_ERROR)) {
if (*flt & VM_FAULT_OOM) {
ret = -ENOMEM;
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index a67c6d781c52..a4db22f65021 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -429,7 +429,7 @@ good_area:
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);
if (unlikely(fault & (VM_FAULT_RETRY|VM_FAULT_ERROR))) {
if (fault & VM_FAULT_SIGSEGV)
goto bad_area;
diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index cce577feab1e..a9524531320e 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -455,7 +455,7 @@ retry:
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);
/* No reason to continue if interrupted by SIGKILL. */
if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) {
fault = VM_FAULT_SIGNAL;
diff --git a/arch/score/mm/fault.c b/arch/score/mm/fault.c
index 37a6c2e0e969..995b71e4db4b 100644
--- a/arch/score/mm/fault.c
+++ b/arch/score/mm/fault.c
@@ -111,7 +111,7 @@ good_area:
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);
if (unlikely(fault & VM_FAULT_ERROR)) {
if (fault & VM_FAULT_OOM)
goto out_of_memory;
diff --git a/arch/sh/mm/fault.c b/arch/sh/mm/fault.c
index 79d8276377d1..9bf876780cef 100644
--- a/arch/sh/mm/fault.c
+++ b/arch/sh/mm/fault.c
@@ -487,7 +487,7 @@ good_area:
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);

if (unlikely(fault & (VM_FAULT_RETRY | VM_FAULT_ERROR)))
if (mm_fault_error(regs, error_code, address, fault))
diff --git a/arch/sparc/mm/fault_32.c b/arch/sparc/mm/fault_32.c
index b6c559cbd64d..4714061d6cd3 100644
--- a/arch/sparc/mm/fault_32.c
+++ b/arch/sparc/mm/fault_32.c
@@ -241,7 +241,7 @@ good_area:
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);

if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
return;
@@ -411,7 +411,7 @@ good_area:
if (!(vma->vm_flags & (VM_READ | VM_EXEC)))
goto bad_area;
}
- switch (handle_mm_fault(mm, vma, address, flags)) {
+ switch (handle_mm_fault(vma, address, flags)) {
case VM_FAULT_SIGBUS:
case VM_FAULT_OOM:
goto do_sigbus;
diff --git a/arch/sparc/mm/fault_64.c b/arch/sparc/mm/fault_64.c
index cb841a33da59..6c43b924a7a2 100644
--- a/arch/sparc/mm/fault_64.c
+++ b/arch/sparc/mm/fault_64.c
@@ -436,7 +436,7 @@ good_area:
goto bad_area;
}

- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);

if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
goto exit_exception;
diff --git a/arch/tile/mm/fault.c b/arch/tile/mm/fault.c
index 26734214818c..beba986589e5 100644
--- a/arch/tile/mm/fault.c
+++ b/arch/tile/mm/fault.c
@@ -434,7 +434,7 @@ good_area:
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);

if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
return 0;
diff --git a/arch/um/kernel/trap.c b/arch/um/kernel/trap.c
index 98783dd0fa2e..ad8f206ab5e8 100644
--- a/arch/um/kernel/trap.c
+++ b/arch/um/kernel/trap.c
@@ -73,7 +73,7 @@ good_area:
do {
int fault;

- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);

if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
goto out_nosemaphore;
diff --git a/arch/unicore32/mm/fault.c b/arch/unicore32/mm/fault.c
index 2ec3d3adcefc..6c7f70bcaae3 100644
--- a/arch/unicore32/mm/fault.c
+++ b/arch/unicore32/mm/fault.c
@@ -194,7 +194,7 @@ good_area:
* If for any reason at all we couldn't handle the fault, make
* sure we exit gracefully rather than endlessly redo the fault.
*/
- fault = handle_mm_fault(mm, vma, addr & PAGE_MASK, flags);
+ fault = handle_mm_fault(vma, addr & PAGE_MASK, flags);
return fault;

check_stack:
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 5ce1ed02f7e8..2915b5b7bd7e 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1348,7 +1348,7 @@ good_area:
* the fault. Since we never set FAULT_FLAG_RETRY_NOWAIT, if
* we get VM_FAULT_RETRY back, the mmap_sem has been unlocked.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);
major |= fault & VM_FAULT_MAJOR;

/*
diff --git a/arch/xtensa/mm/fault.c b/arch/xtensa/mm/fault.c
index 7f4a1fdb1502..2725e08ef353 100644
--- a/arch/xtensa/mm/fault.c
+++ b/arch/xtensa/mm/fault.c
@@ -110,7 +110,7 @@ good_area:
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, flags);
+ fault = handle_mm_fault(vma, address, flags);

if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
return;
diff --git a/drivers/iommu/amd_iommu_v2.c b/drivers/iommu/amd_iommu_v2.c
index 56999d2fac07..fbdaf81ae925 100644
--- a/drivers/iommu/amd_iommu_v2.c
+++ b/drivers/iommu/amd_iommu_v2.c
@@ -538,8 +538,7 @@ static void do_fault(struct work_struct *work)
if (access_error(vma, fault))
goto out;

- ret = handle_mm_fault(mm, vma, address, flags);
-
+ ret = handle_mm_fault(vma, address, flags);
out:
up_read(&mm->mmap_sem);

diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
index d9939fa9b588..8ebb3530afa7 100644
--- a/drivers/iommu/intel-svm.c
+++ b/drivers/iommu/intel-svm.c
@@ -583,7 +583,7 @@ static irqreturn_t prq_event_thread(int irq, void *d)
if (access_error(vma, req))
goto invalid;

- ret = handle_mm_fault(svm->mm, vma, address,
+ ret = handle_mm_fault(vma, address,
req->wr_req ? FAULT_FLAG_WRITE : 0);
if (ret & VM_FAULT_ERROR)
goto invalid;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ffcff53e3b2b..8fc68ed3c862 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1214,15 +1214,14 @@ int generic_error_remove_page(struct address_space *mapping, struct page *page);
int invalidate_inode_page(struct page *page);

#ifdef CONFIG_MMU
-extern int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, unsigned int flags);
+extern int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
+ unsigned int flags);
extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
unsigned long address, unsigned int fault_flags,
bool *unlocked);
#else
-static inline int handle_mm_fault(struct mm_struct *mm,
- struct vm_area_struct *vma, unsigned long address,
- unsigned int flags)
+static inline int handle_mm_fault(struct vm_area_struct *vma,
+ unsigned long address, unsigned int flags)
{
/* should never happen if there's no MMU */
BUG();
diff --git a/mm/gup.c b/mm/gup.c
index fb87aea9edc8..39f751e1cb33 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -351,7 +351,6 @@ unmap:
static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
unsigned long address, unsigned int *flags, int *nonblocking)
{
- struct mm_struct *mm = vma->vm_mm;
unsigned int fault_flags = 0;
int ret;

@@ -376,7 +375,7 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
fault_flags |= FAULT_FLAG_TRIED;
}

- ret = handle_mm_fault(mm, vma, address, fault_flags);
+ ret = handle_mm_fault(vma, address, fault_flags);
if (ret & VM_FAULT_ERROR) {
if (ret & VM_FAULT_OOM)
return -ENOMEM;
@@ -691,7 +690,7 @@ retry:
if (!vma_permits_fault(vma, fault_flags))
return -EFAULT;

- ret = handle_mm_fault(mm, vma, address, fault_flags);
+ ret = handle_mm_fault(vma, address, fault_flags);
major |= ret & VM_FAULT_MAJOR;
if (ret & VM_FAULT_ERROR) {
if (ret & VM_FAULT_OOM)
diff --git a/mm/ksm.c b/mm/ksm.c
index b99e828172f6..c967d1b8df97 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -376,9 +376,8 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr)
if (IS_ERR_OR_NULL(page))
break;
if (PageKsm(page))
- ret = handle_mm_fault(vma->vm_mm, vma, addr,
- FAULT_FLAG_WRITE |
- FAULT_FLAG_REMOTE);
+ ret = handle_mm_fault(vma, addr,
+ FAULT_FLAG_WRITE | FAULT_FLAG_REMOTE);
else
ret = VM_FAULT_WRITE;
put_page(page);
diff --git a/mm/memory.c b/mm/memory.c
index 13dcc53f8fba..0002898341a3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3363,9 +3363,10 @@ unlock:
* The mmap_sem may have been released depending on flags and our
* return value. See filemap_fault() and __lock_page_or_retry().
*/
-static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, unsigned int flags)
+static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
+ unsigned int flags)
{
+ struct mm_struct *mm = vma->vm_mm;
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
@@ -3452,15 +3453,15 @@ static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
* The mmap_sem may have been released depending on flags and our
* return value. See filemap_fault() and __lock_page_or_retry().
*/
-int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, unsigned int flags)
+int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
+ unsigned int flags)
{
int ret;

__set_current_state(TASK_RUNNING);

count_vm_event(PGFAULT);
- mem_cgroup_count_vm_event(mm, PGFAULT);
+ mem_cgroup_count_vm_event(vma->vm_mm, PGFAULT);

/* do counter updates before entering really critical section. */
check_sync_rss_stat(current);
@@ -3472,7 +3473,7 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
if (flags & FAULT_FLAG_USER)
mem_cgroup_oom_enable();

- ret = __handle_mm_fault(mm, vma, address, flags);
+ ret = __handle_mm_fault(vma, address, flags);

if (flags & FAULT_FLAG_USER) {
mem_cgroup_oom_disable();
--
2.8.0.rc3

2016-04-16 00:24:21

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv7 01/29] thp, mlock: update unevictable-lru.txt

Add description of THP handling into unevictable-lru.txt.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
Documentation/vm/unevictable-lru.txt | 21 +++++++++++++++++++++
1 file changed, 21 insertions(+)

diff --git a/Documentation/vm/unevictable-lru.txt b/Documentation/vm/unevictable-lru.txt
index fa3b527086fa..0026a8d33fc0 100644
--- a/Documentation/vm/unevictable-lru.txt
+++ b/Documentation/vm/unevictable-lru.txt
@@ -461,6 +461,27 @@ unevictable LRU is enabled, the work of compaction is mostly handled by
the page migration code and the same work flow as described in MIGRATING
MLOCKED PAGES will apply.

+MLOCKING TRANSPARENT HUGE PAGES
+-------------------------------
+
+A transparent huge page is represented by a single entry on an LRU list.
+Therefore, we can only make unevictable an entire compound page, not
+individual subpages.
+
+If a user tries to mlock() part of a huge page, we want the rest of the
+page to be reclaimable.
+
+We cannot just split the page on partial mlock() as split_huge_page() can
+fail and new intermittent failure mode for the syscall is undesirable.
+
+We handle this by keeping PTE-mapped huge pages on normal LRU lists: the
+PMD on border of VM_LOCKED VMA will be split into PTE table.
+
+This way the huge page is accessible for vmscan. Under memory pressure the
+page will be split, subpages which belong to VM_LOCKED VMAs will be moved
+to unevictable LRU and the rest can be reclaimed.
+
+See also comment in follow_trans_huge_pmd().

mmap(MAP_LOCKED) SYSTEM CALL HANDLING
-------------------------------------
--
2.8.0.rc3

2016-04-16 00:24:27

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv7 18/29] radix-tree: implement radix_tree_maybe_preload_order()

The new helper is similar to radix_tree_maybe_preload(), but tries to
preload number of nodes required to insert (1 << order) continuous
naturally-aligned elements.

This is required to push huge pages into pagecache.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/radix-tree.h | 1 +
lib/radix-tree.c | 68 ++++++++++++++++++++++++++++++++++++++++------
2 files changed, 61 insertions(+), 8 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 51a97ac8bfbf..b6e73ea314d9 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -296,6 +296,7 @@ unsigned int radix_tree_gang_lookup_slot(struct radix_tree_root *root,
unsigned long first_index, unsigned int max_items);
int radix_tree_preload(gfp_t gfp_mask);
int radix_tree_maybe_preload(gfp_t gfp_mask);
+int radix_tree_maybe_preload_order(gfp_t gfp_mask, int order);
void radix_tree_init(void);
void *radix_tree_tag_set(struct radix_tree_root *root,
unsigned long index, unsigned int tag);
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 1624c4117961..448abbf04e87 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -42,6 +42,9 @@
*/
static unsigned long height_to_maxindex[RADIX_TREE_MAX_PATH + 1] __read_mostly;

+/* Number of nodes in fully populated tree of given height */
+static unsigned long height_to_maxnodes[RADIX_TREE_MAX_PATH + 1] __read_mostly;
+
/*
* Radix tree node cache.
*/
@@ -296,7 +299,7 @@ radix_tree_node_free(struct radix_tree_node *node)
* To make use of this facility, the radix tree must be initialised without
* __GFP_DIRECT_RECLAIM being passed to INIT_RADIX_TREE().
*/
-static int __radix_tree_preload(gfp_t gfp_mask)
+static int __radix_tree_preload(gfp_t gfp_mask, int nr)
{
struct radix_tree_preload *rtp;
struct radix_tree_node *node;
@@ -304,14 +307,14 @@ static int __radix_tree_preload(gfp_t gfp_mask)

preempt_disable();
rtp = this_cpu_ptr(&radix_tree_preloads);
- while (rtp->nr < RADIX_TREE_PRELOAD_SIZE) {
+ while (rtp->nr < nr) {
preempt_enable();
node = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
if (node == NULL)
goto out;
preempt_disable();
rtp = this_cpu_ptr(&radix_tree_preloads);
- if (rtp->nr < RADIX_TREE_PRELOAD_SIZE) {
+ if (rtp->nr < nr) {
node->private_data = rtp->nodes;
rtp->nodes = node;
rtp->nr++;
@@ -337,7 +340,7 @@ int radix_tree_preload(gfp_t gfp_mask)
{
/* Warn on non-sensical use... */
WARN_ON_ONCE(!gfpflags_allow_blocking(gfp_mask));
- return __radix_tree_preload(gfp_mask);
+ return __radix_tree_preload(gfp_mask, RADIX_TREE_PRELOAD_SIZE);
}
EXPORT_SYMBOL(radix_tree_preload);

@@ -349,7 +352,7 @@ EXPORT_SYMBOL(radix_tree_preload);
int radix_tree_maybe_preload(gfp_t gfp_mask)
{
if (gfpflags_allow_blocking(gfp_mask))
- return __radix_tree_preload(gfp_mask);
+ return __radix_tree_preload(gfp_mask, RADIX_TREE_PRELOAD_SIZE);
/* Preloading doesn't help anything with this gfp mask, skip it */
preempt_disable();
return 0;
@@ -357,6 +360,51 @@ int radix_tree_maybe_preload(gfp_t gfp_mask)
EXPORT_SYMBOL(radix_tree_maybe_preload);

/*
+ * The same as function above, but preload number of nodes required to insert
+ * (1 << order) continuous naturally-aligned elements.
+ */
+int radix_tree_maybe_preload_order(gfp_t gfp_mask, int order)
+{
+ unsigned long nr_subtrees;
+ int nr_nodes, subtree_height;
+
+ /* Preloading doesn't help anything with this gfp mask, skip it */
+ if (!gfpflags_allow_blocking(gfp_mask)) {
+ preempt_disable();
+ return 0;
+ }
+
+ /*
+ * Calculate number and height of fully populated subtrees it takes to
+ * store (1 << order) elements.
+ */
+ nr_subtrees = 1 << order;
+ for (subtree_height = 0; nr_subtrees > RADIX_TREE_MAP_SIZE;
+ subtree_height++)
+ nr_subtrees >>= RADIX_TREE_MAP_SHIFT;
+
+ /*
+ * The worst case is zero height tree with a single item at index 0 and
+ * then inserting items starting at ULONG_MAX - (1 << order).
+ *
+ * This requires RADIX_TREE_MAX_PATH nodes to build branch from root to
+ * 0-index item.
+ */
+ nr_nodes = RADIX_TREE_MAX_PATH;
+
+ /* Plus branch to fully populated subtrees. */
+ nr_nodes += RADIX_TREE_MAX_PATH - subtree_height;
+
+ /* Root node is shared. */
+ nr_nodes--;
+
+ /* Plus nodes required to build subtrees. */
+ nr_nodes += nr_subtrees * height_to_maxnodes[subtree_height];
+
+ return __radix_tree_preload(gfp_mask, nr_nodes);
+}
+
+/*
* Return the maximum key which can be store into a
* radix tree with height HEIGHT.
*/
@@ -1576,12 +1624,16 @@ static __init unsigned long __maxindex(unsigned int height)
return ~0UL >> shift;
}

-static __init void radix_tree_init_maxindex(void)
+static __init void radix_tree_init_arrays(void)
{
- unsigned int i;
+ unsigned int i, j;

for (i = 0; i < ARRAY_SIZE(height_to_maxindex); i++)
height_to_maxindex[i] = __maxindex(i);
+ for (i = 0; i < ARRAY_SIZE(height_to_maxnodes); i++) {
+ for (j = i; j > 0; j--)
+ height_to_maxnodes[i] += height_to_maxindex[j - 1] + 1;
+ }
}

static int radix_tree_callback(struct notifier_block *nfb,
@@ -1611,6 +1663,6 @@ void __init radix_tree_init(void)
sizeof(struct radix_tree_node), 0,
SLAB_PANIC | SLAB_RECLAIM_ACCOUNT,
radix_tree_node_ctor);
- radix_tree_init_maxindex();
+ radix_tree_init_arrays();
hotcpu_notifier(radix_tree_callback, 0);
}
--
2.8.0.rc3

2016-04-16 00:24:25

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv7 25/29] shmem, thp: respect MADV_{NO,}HUGEPAGE for file mappings

Let's wire up existing madvise() hugepage hints for file mappings.

MADV_HUGEPAGE advise shmem to allocate huge page on page fault in the
VMA. It only has effect if the filesystem is mounted with huge=advise or
huge=within_size.

MADV_NOHUGEPAGE prevents hugepage from being allocated on page fault in
the VMA. It doesn't prevent a huge page from being allocated by other
means, i.e. page fault into different mapping or write(2) into file.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/huge_memory.c | 19 +++++--------------
mm/shmem.c | 20 +++++++++++++++++---
2 files changed, 22 insertions(+), 17 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8fd0321073fa..b0acbeb18645 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1837,7 +1837,7 @@ spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
return NULL;
}

-#define VM_NO_THP (VM_SPECIAL | VM_HUGETLB | VM_SHARED | VM_MAYSHARE)
+#define VM_NO_KHUGEPAGED (VM_SPECIAL | VM_HUGETLB | VM_SHARED | VM_MAYSHARE)

int hugepage_madvise(struct vm_area_struct *vma,
unsigned long *vm_flags, int advice)
@@ -1853,11 +1853,6 @@ int hugepage_madvise(struct vm_area_struct *vma,
if (mm_has_pgste(vma->vm_mm))
return 0;
#endif
- /*
- * Be somewhat over-protective like KSM for now!
- */
- if (*vm_flags & VM_NO_THP)
- return -EINVAL;
*vm_flags &= ~VM_NOHUGEPAGE;
*vm_flags |= VM_HUGEPAGE;
/*
@@ -1865,15 +1860,11 @@ int hugepage_madvise(struct vm_area_struct *vma,
* register it here without waiting a page fault that
* may not happen any time soon.
*/
- if (unlikely(khugepaged_enter_vma_merge(vma, *vm_flags)))
+ if (!(*vm_flags & VM_NO_KHUGEPAGED) &&
+ khugepaged_enter_vma_merge(vma, *vm_flags))
return -ENOMEM;
break;
case MADV_NOHUGEPAGE:
- /*
- * Be somewhat over-protective like KSM for now!
- */
- if (*vm_flags & VM_NO_THP)
- return -EINVAL;
*vm_flags &= ~VM_HUGEPAGE;
*vm_flags |= VM_NOHUGEPAGE;
/*
@@ -1984,7 +1975,7 @@ int khugepaged_enter_vma_merge(struct vm_area_struct *vma,
if (vma->vm_ops)
/* khugepaged not yet working on file or special mappings */
return 0;
- VM_BUG_ON_VMA(vm_flags & VM_NO_THP, vma);
+ VM_BUG_ON_VMA(vm_flags & VM_NO_KHUGEPAGED, vma);
hstart = (vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
hend = vma->vm_end & HPAGE_PMD_MASK;
if (hstart < hend)
@@ -2373,7 +2364,7 @@ static bool hugepage_vma_check(struct vm_area_struct *vma)
return false;
if (is_vma_temporary_stack(vma))
return false;
- VM_BUG_ON_VMA(vma->vm_flags & VM_NO_THP, vma);
+ VM_BUG_ON_VMA(vma->vm_flags & VM_NO_KHUGEPAGED, vma);
return true;
}

diff --git a/mm/shmem.c b/mm/shmem.c
index 26236e8564f7..0ebf2e3a2239 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -101,6 +101,8 @@ struct shmem_falloc {
enum sgp_type {
SGP_READ, /* don't exceed i_size, don't allocate page */
SGP_CACHE, /* don't exceed i_size, may allocate page */
+ SGP_NOHUGE, /* like SGP_CACHE, but no huge pages */
+ SGP_HUGE, /* like SGP_CACHE, huge pages preferred */
SGP_WRITE, /* may exceed i_size, may allocate !Uptodate page */
SGP_FALLOC, /* like SGP_WRITE, but make existing page Uptodate */
};
@@ -1364,6 +1366,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
struct mem_cgroup *memcg;
struct page *page;
swp_entry_t swap;
+ enum sgp_type sgp_huge = sgp;
pgoff_t hindex = index;
int error;
int once = 0;
@@ -1371,6 +1374,8 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,

if (index > (MAX_LFS_FILESIZE >> PAGE_SHIFT))
return -EFBIG;
+ if (sgp == SGP_NOHUGE || sgp == SGP_HUGE)
+ sgp = SGP_CACHE;
repeat:
swap.val = 0;
page = find_lock_entry(mapping, index);
@@ -1489,7 +1494,7 @@ repeat:
/* shmem_symlink() */
if (mapping->a_ops != &shmem_aops)
goto alloc_nohuge;
- if (shmem_huge == SHMEM_HUGE_DENY)
+ if (shmem_huge == SHMEM_HUGE_DENY || sgp_huge == SGP_NOHUGE)
goto alloc_nohuge;
if (shmem_huge == SHMEM_HUGE_FORCE)
goto alloc_huge;
@@ -1506,7 +1511,9 @@ repeat:
goto alloc_huge;
/* fallthrough */
case SHMEM_HUGE_ADVISE:
- /* TODO: wire up fadvise()/madvise() */
+ if (sgp_huge == SGP_HUGE)
+ goto alloc_huge;
+ /* TODO: implement fadvise() hints */
goto alloc_nohuge;
}

@@ -1635,6 +1642,7 @@ static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
struct inode *inode = file_inode(vma->vm_file);
gfp_t gfp = mapping_gfp_mask(inode->i_mapping);
+ enum sgp_type sgp;
int error;
int ret = VM_FAULT_LOCKED;

@@ -1696,7 +1704,13 @@ static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
spin_unlock(&inode->i_lock);
}

- error = shmem_getpage_gfp(inode, vmf->pgoff, &vmf->page, SGP_CACHE,
+ sgp = SGP_CACHE;
+ if (vma->vm_flags & VM_HUGEPAGE)
+ sgp = SGP_HUGE;
+ else if (vma->vm_flags & VM_NOHUGEPAGE)
+ sgp = SGP_NOHUGE;
+
+ error = shmem_getpage_gfp(inode, vmf->pgoff, &vmf->page, sgp,
gfp, vma->vm_mm, &ret);
if (error)
return ((error == -ENOMEM) ? VM_FAULT_OOM : VM_FAULT_SIGBUS);
--
2.8.0.rc3

2016-04-16 00:24:35

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv7 26/29] thp: update Documentation/vm/transhuge.txt

Add info about tmpfs/shmem with huge pages.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
Documentation/vm/transhuge.txt | 130 +++++++++++++++++++++++++++++------------
1 file changed, 93 insertions(+), 37 deletions(-)

diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
index d9cb65cf5cfd..96a49f123cac 100644
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -9,8 +9,8 @@ using huge pages for the backing of virtual memory with huge pages
that supports the automatic promotion and demotion of page sizes and
without the shortcomings of hugetlbfs.

-Currently it only works for anonymous memory mappings but in the
-future it can expand over the pagecache layer starting with tmpfs.
+Currently it only works for anonymous memory mappings and tmpfs/shmem.
+But in the future it can expand to other filesystems.

The reason applications are running faster is because of two
factors. The first factor is almost completely irrelevant and it's not
@@ -48,7 +48,7 @@ miss is going to run faster.
- if some task quits and more hugepages become available (either
immediately in the buddy or through the VM), guest physical memory
backed by regular pages should be relocated on hugepages
- automatically (with khugepaged)
+ automatically (with khugepaged, limited to anonymous huge pages for now)

- it doesn't require memory reservation and in turn it uses hugepages
whenever possible (the only possible reservation here is kernelcore=
@@ -57,10 +57,6 @@ miss is going to run faster.
feature that applies to all dynamic high order allocations in the
kernel)

-- this initial support only offers the feature in the anonymous memory
- regions but it'd be ideal to move it to tmpfs and the pagecache
- later
-
Transparent Hugepage Support maximizes the usefulness of free memory
if compared to the reservation approach of hugetlbfs by allowing all
unused memory to be used as cache or other movable (or even unmovable
@@ -94,21 +90,21 @@ madvise(MADV_HUGEPAGE) on their critical mmapped regions.

== sysfs ==

-Transparent Hugepage Support can be entirely disabled (mostly for
-debugging purposes) or only enabled inside MADV_HUGEPAGE regions (to
-avoid the risk of consuming more memory resources) or enabled system
-wide. This can be achieved with one of:
+Transparent Hugepage Support for anonymous memory can be entirely disabled
+(mostly for debugging purposes) or only enabled inside MADV_HUGEPAGE
+regions (to avoid the risk of consuming more memory resources) or enabled
+system wide. This can be achieved with one of:

echo always >/sys/kernel/mm/transparent_hugepage/enabled
echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
echo never >/sys/kernel/mm/transparent_hugepage/enabled

It's also possible to limit defrag efforts in the VM to generate
-hugepages in case they're not immediately free to madvise regions or
-to never try to defrag memory and simply fallback to regular pages
-unless hugepages are immediately available. Clearly if we spend CPU
-time to defrag memory, we would expect to gain even more by the fact
-we use hugepages later instead of regular pages. This isn't always
+anonymous hugepages in case they're not immediately free to madvise
+regions or to never try to defrag memory and simply fallback to regular
+pages unless hugepages are immediately available. Clearly if we spend CPU
+time to defrag memory, we would expect to gain even more by the fact we
+use hugepages later instead of regular pages. This isn't always
guaranteed, but it may be more likely in case the allocation is for a
MADV_HUGEPAGE region.

@@ -133,9 +129,9 @@ that are have used madvise(MADV_HUGEPAGE). This is the default behaviour.

"never" should be self-explanatory.

-By default kernel tries to use huge zero page on read page fault.
-It's possible to disable huge zero page by writing 0 or enable it
-back by writing 1:
+By default kernel tries to use huge zero page on read page fault to
+anonymous mapping. It's possible to disable huge zero page by writing 0
+or enable it back by writing 1:

echo 0 >/sys/kernel/mm/transparent_hugepage/use_zero_page
echo 1 >/sys/kernel/mm/transparent_hugepage/use_zero_page
@@ -204,21 +200,67 @@ Support by passing the parameter "transparent_hugepage=always" or
"transparent_hugepage=madvise" or "transparent_hugepage=never"
(without "") to the kernel command line.

+== Hugepages in tmpfs/shmem ==
+
+You can control hugepage allocation policy in tmpfs with mount option
+"huge=". It can have following values:
+
+ - "always":
+ Attempt to allocate huge pages every time we need a new page;
+
+ - "never":
+ Do not allocate huge pages;
+
+ - "within_size":
+ Only allocate huge page if it will be fully within i_size.
+ Also respect fadvise()/madvise() hints;
+
+ - "advise:
+ Only allocate huge pages if requested with fadvise()/madvise();
+
+The default policy is "never".
+
+"mount -o remount,huge= /mountpoint" works fine after mount: remounting
+huge=never will not attempt to break up huge pages at all, just stop more
+from being allocated.
+
+There's also sysfs knob to control hugepage allocation policy for internal
+shmem mount: /sys/kernel/mm/transparent_hugepage/shmem_enabled. The mount
+is used for SysV SHM, memfds, shared anonymous mmaps (of /dev/zero or
+MAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem.
+
+In addition to policies listed above, shmem_enabled allows two further
+values:
+
+ - "deny":
+ For use in emergencies, to force the huge option off from
+ all mounts;
+ - "force":
+ Force the huge option on for all - very useful for testing;
+
== Need of application restart ==

-The transparent_hugepage/enabled values only affect future
-behavior. So to make them effective you need to restart any
-application that could have been using hugepages. This also applies to
-the regions registered in khugepaged.
+The transparent_hugepage/enabled values and tmpfs mount option only affect
+future behavior. So to make them effective you need to restart any
+application that could have been using hugepages. This also applies to the
+regions registered in khugepaged.

== Monitoring usage ==

-The number of transparent huge pages currently used by the system is
-available by reading the AnonHugePages field in /proc/meminfo. To
-identify what applications are using transparent huge pages, it is
-necessary to read /proc/PID/smaps and count the AnonHugePages fields
-for each mapping. Note that reading the smaps file is expensive and
-reading it frequently will incur overhead.
+The number of anonymous transparent huge pages currently used by the
+system is available by reading the AnonHugePages field in /proc/meminfo.
+To identify what applications are using anonymous transparent huge pages,
+it is necessary to read /proc/PID/smaps and count the AnonHugePages fields
+for each mapping.
+
+The number of file transparent huge pages mapped to userspace is available
+by reading the FileHugeMapped field in /proc/meminfo. To identify what
+applications are mapping file transparent huge pages, it is necessary
+to read /proc/PID/smaps and count the FileHugeMapped fields for each
+mapping.
+
+Note that reading the smaps file is expensive and reading it
+frequently will incur overhead.

There are a number of counters in /proc/vmstat that may be used to
monitor how successfully the system is providing huge pages for use.
@@ -238,6 +280,12 @@ thp_collapse_alloc_failed is incremented if khugepaged found a range
of pages that should be collapsed into one huge page but failed
the allocation.

+thp_file_alloc is incremented every time a file huge page is successfully
+i allocated.
+
+thp_file_mapped is incremented every time a file huge page is mapped into
+ user address space.
+
thp_split_page is incremented every time a huge page is split into base
pages. This can happen for a variety of reasons but a common
reason is that a huge page is old and is being reclaimed.
@@ -403,19 +451,27 @@ pages:
on relevant sub-page of the compound page.

- map/unmap of the whole compound page accounted in compound_mapcount
- (stored in first tail page).
+ (stored in first tail page). For file huge pages, we also increment
+ ->_mapcount of all sub-pages in order to have race-free detection of
+ last unmap of subpages.

-PageDoubleMap() indicates that ->_mapcount in all subpages is offset up by one.
-This additional reference is required to get race-free detection of unmap of
-subpages when we have them mapped with both PMDs and PTEs.
+PageDoubleMap() indicates that the page is *possibly* mapped with PTEs.
+
+For anonymous pages PageDoubleMap() also indicates ->_mapcount in all
+subpages is offset up by one. This additional reference is required to
+get race-free detection of unmap of subpages when we have them mapped with
+both PMDs and PTEs.

This is optimization required to lower overhead of per-subpage mapcount
tracking. The alternative is alter ->_mapcount in all subpages on each
map/unmap of the whole compound page.

-We set PG_double_map when a PMD of the page got split for the first time,
-but still have PMD mapping. The addtional references go away with last
-compound_mapcount.
+For anonymous pages, we set PG_double_map when a PMD of the page got split
+for the first time, but still have PMD mapping. The additional references
+go away with last compound_mapcount.
+
+File pages get PG_double_map set on first map of the page with PTE and
+goes away when the page gets evicted from page cache.

split_huge_page internally has to distribute the refcounts in the head
page to the tail pages before clearing all PG_head/tail bits from the page
@@ -427,7 +483,7 @@ sum of mapcount of all sub-pages plus one (split_huge_page caller must
have reference for head page).

split_huge_page uses migration entries to stabilize page->_count and
-page->_mapcount.
+page->_mapcount of anonymous pages. File pages just got unmapped.

We safe against physical memory scanners too: the only legitimate way
scanner can get reference to a page is get_page_unless_zero().
--
2.8.0.rc3

2016-04-16 00:24:45

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv7 05/29] rmap: support file thp

Naive approach: on mapping/unmapping the page as compound we update
->_mapcount on each 4k page. That's not efficient, but it's not obvious
how we can optimize this. We can look into optimization later.

PG_double_map optimization doesn't work for file pages since lifecycle
of file pages is different comparing to anon pages: file page can be
mapped again at any time.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/rmap.h | 2 +-
mm/huge_memory.c | 10 +++++++---
mm/memory.c | 4 ++--
mm/migrate.c | 2 +-
mm/rmap.c | 48 +++++++++++++++++++++++++++++++++++-------------
mm/util.c | 6 ++++++
6 files changed, 52 insertions(+), 20 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 49eb4f8ebac9..5704f101b52e 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -165,7 +165,7 @@ void do_page_add_anon_rmap(struct page *, struct vm_area_struct *,
unsigned long, int);
void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
unsigned long, bool);
-void page_add_file_rmap(struct page *);
+void page_add_file_rmap(struct page *, bool);
void page_remove_rmap(struct page *, bool);

void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a4014b484737..aab10c81de12 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3262,18 +3262,22 @@ static void __split_huge_page(struct page *page, struct list_head *list)

int total_mapcount(struct page *page)
{
- int i, ret;
+ int i, compound, ret;

VM_BUG_ON_PAGE(PageTail(page), page);

if (likely(!PageCompound(page)))
return atomic_read(&page->_mapcount) + 1;

- ret = compound_mapcount(page);
+ compound = compound_mapcount(page);
if (PageHuge(page))
- return ret;
+ return compound;
+ ret = compound;
for (i = 0; i < HPAGE_PMD_NR; i++)
ret += atomic_read(&page[i]._mapcount) + 1;
+ /* File pages has compound_mapcount included in _mapcount */
+ if (!PageAnon(page))
+ return ret - compound * HPAGE_PMD_NR;
if (PageDoubleMap(page))
ret -= HPAGE_PMD_NR;
return ret;
diff --git a/mm/memory.c b/mm/memory.c
index c31c52507956..49c55446576a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1438,7 +1438,7 @@ static int insert_page(struct vm_area_struct *vma, unsigned long addr,
/* Ok, finally just insert the thing.. */
get_page(page);
inc_mm_counter_fast(mm, mm_counter_file(page));
- page_add_file_rmap(page);
+ page_add_file_rmap(page, false);
set_pte_at(mm, addr, pte, mk_pte(page, prot));

retval = 0;
@@ -2901,7 +2901,7 @@ int alloc_set_pte(struct fault_env *fe, struct mem_cgroup *memcg,
lru_cache_add_active_or_unevictable(page, vma);
} else {
inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
- page_add_file_rmap(page);
+ page_add_file_rmap(page, false);
}
set_pte_at(vma->vm_mm, fe->address, fe->pte, entry);

diff --git a/mm/migrate.c b/mm/migrate.c
index 3eafd17fb398..b8f8363df8da 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -170,7 +170,7 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
} else if (PageAnon(new))
page_add_anon_rmap(new, vma, addr, false);
else
- page_add_file_rmap(new);
+ page_add_file_rmap(new, false);

if (vma->vm_flags & VM_LOCKED && !PageTransCompound(new))
mlock_vma_page(new);
diff --git a/mm/rmap.c b/mm/rmap.c
index 7f8652720a25..76d8c9269ac6 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1271,18 +1271,34 @@ void page_add_new_anon_rmap(struct page *page,
*
* The caller needs to hold the pte lock.
*/
-void page_add_file_rmap(struct page *page)
+void page_add_file_rmap(struct page *page, bool compound)
{
+ int i, nr = 1;
+
+ VM_BUG_ON_PAGE(compound && !PageTransHuge(page), page);
lock_page_memcg(page);
- if (atomic_inc_and_test(&page->_mapcount)) {
- __inc_zone_page_state(page, NR_FILE_MAPPED);
- mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_MAPPED);
+ if (compound && PageTransHuge(page)) {
+ for (i = 0, nr = 0; i < HPAGE_PMD_NR; i++) {
+ if (atomic_inc_and_test(&page[i]._mapcount))
+ nr++;
+ }
+ if (!atomic_inc_and_test(compound_mapcount_ptr(page)))
+ goto out;
+ } else {
+ if (!atomic_inc_and_test(&page->_mapcount))
+ goto out;
}
+ __mod_zone_page_state(page_zone(page), NR_FILE_MAPPED, nr);
+ mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_MAPPED);
+out:
unlock_page_memcg(page);
}

-static void page_remove_file_rmap(struct page *page)
+static void page_remove_file_rmap(struct page *page, bool compound)
{
+ int i, nr = 1;
+
+ VM_BUG_ON_PAGE(compound && !PageTransHuge(page), page);
lock_page_memcg(page);

/* Hugepages are not counted in NR_FILE_MAPPED for now. */
@@ -1293,15 +1309,24 @@ static void page_remove_file_rmap(struct page *page)
}

/* page still mapped by someone else? */
- if (!atomic_add_negative(-1, &page->_mapcount))
- goto out;
+ if (compound && PageTransHuge(page)) {
+ for (i = 0, nr = 0; i < HPAGE_PMD_NR; i++) {
+ if (atomic_add_negative(-1, &page[i]._mapcount))
+ nr++;
+ }
+ if (!atomic_add_negative(-1, compound_mapcount_ptr(page)))
+ goto out;
+ } else {
+ if (!atomic_add_negative(-1, &page->_mapcount))
+ goto out;
+ }

/*
* We use the irq-unsafe __{inc|mod}_zone_page_stat because
* these counters are not modified in interrupt context, and
* pte lock(a spinlock) is held, which implies preemption disabled.
*/
- __dec_zone_page_state(page, NR_FILE_MAPPED);
+ __mod_zone_page_state(page_zone(page), NR_FILE_MAPPED, -nr);
mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_MAPPED);

if (unlikely(PageMlocked(page)))
@@ -1357,11 +1382,8 @@ static void page_remove_anon_compound_rmap(struct page *page)
*/
void page_remove_rmap(struct page *page, bool compound)
{
- if (!PageAnon(page)) {
- VM_BUG_ON_PAGE(compound && !PageHuge(page), page);
- page_remove_file_rmap(page);
- return;
- }
+ if (!PageAnon(page))
+ return page_remove_file_rmap(page, compound);

if (compound)
return page_remove_anon_compound_rmap(page);
diff --git a/mm/util.c b/mm/util.c
index 6cc81e7b8705..b7ac1d708cb0 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -386,6 +386,12 @@ int __page_mapcount(struct page *page)
int ret;

ret = atomic_read(&page->_mapcount) + 1;
+ /*
+ * For file THP page->_mapcount contains total number of mapping
+ * of the page: no need to look into compound_mapcount.
+ */
+ if (!PageAnon(page) && !PageHuge(page))
+ return ret;
page = compound_head(page);
ret += atomic_read(compound_mapcount_ptr(page)) + 1;
if (PageDoubleMap(page))
--
2.8.0.rc3

2016-04-16 00:24:44

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv7 12/29] thp: prepare change_huge_pmd() for file thp

change_huge_pmd() has assert which is not relvant for file page.
For shared mapping it's perfectly fine to have page table entry
writable, without explicit mkwrite.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/huge_memory.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3a3f9ff45e3f..5a71e0ffcf19 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1795,7 +1795,8 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
entry = pmd_mkwrite(entry);
ret = HPAGE_PMD_NR;
set_pmd_at(mm, addr, pmd, entry);
- BUG_ON(!preserve_write && pmd_write(entry));
+ BUG_ON(vma_is_anonymous(vma) && !preserve_write &&
+ pmd_write(entry));
}
spin_unlock(ptl);
}
--
2.8.0.rc3

2016-04-16 00:24:59

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv7 22/29] shmem: prepare huge= mount option and sysfs knob

This patch adds new mount option "huge=". It can have following values:

- "always":
Attempt to allocate huge pages every time we need a new page;

- "never":
Do not allocate huge pages;

- "within_size":
Only allocate huge page if it will be fully within i_size.
Also respect fadvise()/madvise() hints;

- "advise:
Only allocate huge pages if requested with fadvise()/madvise();

Default is "never" for now.

"mount -o remount,huge= /mountpoint" works fine after mount: remounting
huge=never will not attempt to break up huge pages at all, just stop
more from being allocated.

No new config option: put this under CONFIG_TRANSPARENT_HUGEPAGE,
which is the appropriate option to protect those who don't want
the new bloat, and with which we shall share some pmd code.

Prohibit the option when !CONFIG_TRANSPARENT_HUGEPAGE, just as mpol is
invalid without CONFIG_NUMA (was hidden in mpol_parse_str(): make it
explicit).

Allow enabling THP only if the machine has_transparent_hugepage().

But what about Shmem with no user-visible mount? SysV SHM, memfds,
shared anonymous mmaps (of /dev/zero or MAP_ANONYMOUS), GPU drivers'
DRM objects, Ashmem. Though unlikely to suit all usages, provide
sysfs knob /sys/kernel/mm/transparent_hugepage/shmem_enabled to
experiment with huge on those.

And allow shmem_enabled two further values:

- "deny":
For use in emergencies, to force the huge option off from
all mounts;
- "force":
Force the huge option on for all - very useful for testing;

Based on patch by Hugh Dickins.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/huge_mm.h | 2 +
include/linux/shmem_fs.h | 3 +-
mm/huge_memory.c | 3 +
mm/shmem.c | 161 +++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 168 insertions(+), 1 deletion(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 95f83770443c..278ade85e224 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -41,6 +41,8 @@ enum transparent_hugepage_flag {
#endif
};

+extern struct kobj_attribute shmem_enabled_attr;
+
#define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
#define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)

diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 4d4780c00d34..466f18c73a49 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -28,9 +28,10 @@ struct shmem_sb_info {
unsigned long max_inodes; /* How many inodes are allowed */
unsigned long free_inodes; /* How many are left for allocation */
spinlock_t stat_lock; /* Serialize shmem_sb_info changes */
+ umode_t mode; /* Mount mode for root directory */
+ unsigned char huge; /* Whether to try for hugepages */
kuid_t uid; /* Mount uid for root directory */
kgid_t gid; /* Mount gid for root directory */
- umode_t mode; /* Mount mode for root directory */
struct mempolicy *mpol; /* default memory policy for mappings */
};

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 61f1651bf02f..8fd0321073fa 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -442,6 +442,9 @@ static struct attribute *hugepage_attr[] = {
&enabled_attr.attr,
&defrag_attr.attr,
&use_zero_page_attr.attr,
+#ifdef CONFIG_SHMEM
+ &shmem_enabled_attr.attr,
+#endif
#ifdef CONFIG_DEBUG_VM
&debug_cow_attr.attr,
#endif
diff --git a/mm/shmem.c b/mm/shmem.c
index 5f3b9af0d495..3a6349e9c186 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -289,6 +289,87 @@ static bool shmem_confirm_swap(struct address_space *mapping,
}

/*
+ * Definitions for "huge tmpfs": tmpfs mounted with the huge= option
+ *
+ * SHMEM_HUGE_NEVER:
+ * disables huge pages for the mount;
+ * SHMEM_HUGE_ALWAYS:
+ * enables huge pages for the mount;
+ * SHMEM_HUGE_WITHIN_SIZE:
+ * only allocate huge pages if the page will be fully within i_size,
+ * also respect fadvise()/madvise() hints;
+ * SHMEM_HUGE_ADVISE:
+ * only allocate huge pages if requested with fadvise()/madvise();
+ */
+
+#define SHMEM_HUGE_NEVER 0
+#define SHMEM_HUGE_ALWAYS 1
+#define SHMEM_HUGE_WITHIN_SIZE 2
+#define SHMEM_HUGE_ADVISE 3
+
+/*
+ * Special values.
+ * Only can be set via /sys/kernel/mm/transparent_hugepage/shmem_enabled:
+ *
+ * SHMEM_HUGE_DENY:
+ * disables huge on shm_mnt and all mounts, for emergency use;
+ * SHMEM_HUGE_FORCE:
+ * enables huge on shm_mnt and all mounts, w/o needing option, for testing;
+ *
+ */
+#define SHMEM_HUGE_DENY (-1)
+#define SHMEM_HUGE_FORCE (-2)
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+/* ifdef here to avoid bloating shmem.o when not necessary */
+
+int shmem_huge __read_mostly;
+
+static int shmem_parse_huge(const char *str)
+{
+ if (!strcmp(str, "never"))
+ return SHMEM_HUGE_NEVER;
+ if (!strcmp(str, "always"))
+ return SHMEM_HUGE_ALWAYS;
+ if (!strcmp(str, "within_size"))
+ return SHMEM_HUGE_WITHIN_SIZE;
+ if (!strcmp(str, "advise"))
+ return SHMEM_HUGE_ADVISE;
+ if (!strcmp(str, "deny"))
+ return SHMEM_HUGE_DENY;
+ if (!strcmp(str, "force"))
+ return SHMEM_HUGE_FORCE;
+ return -EINVAL;
+}
+
+static const char *shmem_format_huge(int huge)
+{
+ switch (huge) {
+ case SHMEM_HUGE_NEVER:
+ return "never";
+ case SHMEM_HUGE_ALWAYS:
+ return "always";
+ case SHMEM_HUGE_WITHIN_SIZE:
+ return "within_size";
+ case SHMEM_HUGE_ADVISE:
+ return "advise";
+ case SHMEM_HUGE_DENY:
+ return "deny";
+ case SHMEM_HUGE_FORCE:
+ return "force";
+ default:
+ VM_BUG_ON(1);
+ return "bad_val";
+ }
+}
+
+#else /* !CONFIG_TRANSPARENT_HUGEPAGE */
+
+#define shmem_huge SHMEM_HUGE_DENY
+
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
+/*
* Like add_to_page_cache_locked, but error if expected item has gone.
*/
static int shmem_add_to_page_cache(struct page *page,
@@ -2857,11 +2938,24 @@ static int shmem_parse_options(char *options, struct shmem_sb_info *sbinfo,
sbinfo->gid = make_kgid(current_user_ns(), gid);
if (!gid_valid(sbinfo->gid))
goto bad_val;
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ } else if (!strcmp(this_char, "huge")) {
+ int huge;
+ huge = shmem_parse_huge(value);
+ if (huge < 0)
+ goto bad_val;
+ if (!has_transparent_hugepage() &&
+ huge != SHMEM_HUGE_NEVER)
+ goto bad_val;
+ sbinfo->huge = huge;
+#endif
+#ifdef CONFIG_NUMA
} else if (!strcmp(this_char,"mpol")) {
mpol_put(mpol);
mpol = NULL;
if (mpol_parse_str(value, &mpol))
goto bad_val;
+#endif
} else {
pr_err("tmpfs: Bad mount option %s\n", this_char);
goto error;
@@ -2907,6 +3001,7 @@ static int shmem_remount_fs(struct super_block *sb, int *flags, char *data)
goto out;

error = 0;
+ sbinfo->huge = config.huge;
sbinfo->max_blocks = config.max_blocks;
sbinfo->max_inodes = config.max_inodes;
sbinfo->free_inodes = config.max_inodes - inodes;
@@ -2940,6 +3035,11 @@ static int shmem_show_options(struct seq_file *seq, struct dentry *root)
if (!gid_eq(sbinfo->gid, GLOBAL_ROOT_GID))
seq_printf(seq, ",gid=%u",
from_kgid_munged(&init_user_ns, sbinfo->gid));
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ /* Rightly or wrongly, show huge mount option unmasked by shmem_huge */
+ if (sbinfo->huge)
+ seq_printf(seq, ",huge=%s", shmem_format_huge(sbinfo->huge));
+#endif
shmem_show_mpol(seq, sbinfo->mpol);
return 0;
}
@@ -3278,6 +3378,13 @@ int __init shmem_init(void)
pr_err("Could not kern_mount tmpfs\n");
goto out1;
}
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ if (has_transparent_hugepage() && shmem_huge < SHMEM_HUGE_DENY)
+ SHMEM_SB(shm_mnt->mnt_sb)->huge = shmem_huge;
+ else
+ shmem_huge = 0; /* just in case it was patched */
+#endif
return 0;

out1:
@@ -3289,6 +3396,60 @@ out3:
return error;
}

+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && defined(CONFIG_SYSFS)
+static ssize_t shmem_enabled_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ int values[] = {
+ SHMEM_HUGE_ALWAYS,
+ SHMEM_HUGE_WITHIN_SIZE,
+ SHMEM_HUGE_ADVISE,
+ SHMEM_HUGE_NEVER,
+ SHMEM_HUGE_DENY,
+ SHMEM_HUGE_FORCE,
+ };
+ int i, count;
+
+ for (i = 0, count = 0; i < ARRAY_SIZE(values); i++) {
+ const char *fmt = shmem_huge == values[i] ? "[%s] " : "%s ";
+
+ count += sprintf(buf + count, fmt,
+ shmem_format_huge(values[i]));
+ }
+ buf[count - 1] = '\n';
+ return count;
+}
+
+static ssize_t shmem_enabled_store(struct kobject *kobj,
+ struct kobj_attribute *attr, const char *buf, size_t count)
+{
+ char tmp[16];
+ int huge;
+
+ if (count + 1 > sizeof(tmp))
+ return -EINVAL;
+ memcpy(tmp, buf, count);
+ tmp[count] = '\0';
+ if (count && tmp[count - 1] == '\n')
+ tmp[count - 1] = '\0';
+
+ huge = shmem_parse_huge(tmp);
+ if (huge == -EINVAL)
+ return -EINVAL;
+ if (!has_transparent_hugepage() &&
+ huge != SHMEM_HUGE_NEVER && huge != SHMEM_HUGE_DENY)
+ return -EINVAL;
+
+ shmem_huge = huge;
+ if (shmem_huge < SHMEM_HUGE_DENY)
+ SHMEM_SB(shm_mnt->mnt_sb)->huge = shmem_huge;
+ return count;
+}
+
+struct kobj_attribute shmem_enabled_attr =
+ __ATTR(shmem_enabled, 0644, shmem_enabled_show, shmem_enabled_store);
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE && CONFIG_SYSFS */
+
#else /* !CONFIG_SHMEM */

/*
--
2.8.0.rc3

2016-04-16 00:25:28

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv7 17/29] page-flags: relax policy for PG_mappedtodisk and PG_reclaim

These flags are in use for file THP.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/page-flags.h | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 517707ae8cd1..9d518876dd6a 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -292,11 +292,11 @@ PAGEFLAG(OwnerPriv1, owner_priv_1, PF_ANY)
*/
TESTPAGEFLAG(Writeback, writeback, PF_NO_COMPOUND)
TESTSCFLAG(Writeback, writeback, PF_NO_COMPOUND)
-PAGEFLAG(MappedToDisk, mappedtodisk, PF_NO_COMPOUND)
+PAGEFLAG(MappedToDisk, mappedtodisk, PF_NO_TAIL)

/* PG_readahead is only used for reads; PG_reclaim is only for writes */
-PAGEFLAG(Reclaim, reclaim, PF_NO_COMPOUND)
- TESTCLEARFLAG(Reclaim, reclaim, PF_NO_COMPOUND)
+PAGEFLAG(Reclaim, reclaim, PF_NO_TAIL)
+ TESTCLEARFLAG(Reclaim, reclaim, PF_NO_TAIL)
PAGEFLAG(Readahead, reclaim, PF_NO_COMPOUND)
TESTCLEARFLAG(Readahead, reclaim, PF_NO_COMPOUND)

--
2.8.0.rc3

2016-04-16 00:25:48

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv7 19/29] filemap: prepare find and delete operations for huge pages

For now, we would have HPAGE_PMD_NR entries in radix tree for every huge
page. That's suboptimal and it will be changed to use Matthew's
multi-order entries later.

'add' operation is not changed, because we don't need it to implement
hugetmpfs: shmem uses its own implementation.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/filemap.c | 187 ++++++++++++++++++++++++++++++++++++++++++-----------------
1 file changed, 134 insertions(+), 53 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 7e982835d4ec..bf29ab4f87dc 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -110,43 +110,18 @@
* ->tasklist_lock (memory_failure, collect_procs_ao)
*/

-static void page_cache_tree_delete(struct address_space *mapping,
- struct page *page, void *shadow)
+static void __page_cache_tree_delete(struct address_space *mapping,
+ struct radix_tree_node *node, void **slot, unsigned long index,
+ void *shadow)
{
- struct radix_tree_node *node;
- unsigned long index;
- unsigned int offset;
unsigned int tag;
- void **slot;
-
- VM_BUG_ON(!PageLocked(page));

- __radix_tree_lookup(&mapping->page_tree, page->index, &node, &slot);
-
- if (shadow) {
- mapping->nrexceptional++;
- /*
- * Make sure the nrexceptional update is committed before
- * the nrpages update so that final truncate racing
- * with reclaim does not see both counters 0 at the
- * same time and miss a shadow entry.
- */
- smp_wmb();
- }
- mapping->nrpages--;
-
- if (!node) {
- /* Clear direct pointer tags in root node */
- mapping->page_tree.gfp_mask &= __GFP_BITS_MASK;
- radix_tree_replace_slot(slot, shadow);
- return;
- }
+ VM_BUG_ON(node == NULL);
+ VM_BUG_ON(*slot == NULL);

/* Clear tree tags for the removed page */
- index = page->index;
- offset = index & RADIX_TREE_MAP_MASK;
for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) {
- if (test_bit(offset, node->tags[tag]))
+ if (test_bit(index & RADIX_TREE_MAP_MASK, node->tags[tag]))
radix_tree_tag_clear(&mapping->page_tree, index, tag);
}

@@ -173,6 +148,54 @@ static void page_cache_tree_delete(struct address_space *mapping,
}
}

+static void page_cache_tree_delete(struct address_space *mapping,
+ struct page *page, void *shadow)
+{
+ struct radix_tree_node *node;
+ unsigned long index;
+ void **slot;
+ int i, nr = PageHuge(page) ? 1 : hpage_nr_pages(page);
+
+ VM_BUG_ON_PAGE(!PageLocked(page), page);
+ VM_BUG_ON_PAGE(PageTail(page), page);
+
+ __radix_tree_lookup(&mapping->page_tree, page->index, &node, &slot);
+
+ if (shadow) {
+ mapping->nrexceptional += nr;
+ /*
+ * Make sure the nrexceptional update is committed before
+ * the nrpages update so that final truncate racing
+ * with reclaim does not see both counters 0 at the
+ * same time and miss a shadow entry.
+ */
+ smp_wmb();
+ }
+ mapping->nrpages -= nr;
+
+ if (!node) {
+ /* Clear direct pointer tags in root node */
+ mapping->page_tree.gfp_mask &= __GFP_BITS_MASK;
+ VM_BUG_ON(nr != 1);
+ radix_tree_replace_slot(slot, shadow);
+ return;
+ }
+
+ index = page->index;
+ VM_BUG_ON_PAGE(index & (nr - 1), page);
+ for (i = 0; i < nr; i++) {
+ /* Cross node border */
+ if (i && ((index + i) & RADIX_TREE_MAP_MASK) == 0) {
+ __radix_tree_lookup(&mapping->page_tree,
+ page->index + i, &node, &slot);
+ }
+
+ __page_cache_tree_delete(mapping, node,
+ slot + (i & RADIX_TREE_MAP_MASK), index + i,
+ shadow);
+ }
+}
+
/*
* Delete a page from the page cache and free it. Caller has to make
* sure the page is locked and that nobody else uses it - or that usage
@@ -181,6 +204,7 @@ static void page_cache_tree_delete(struct address_space *mapping,
void __delete_from_page_cache(struct page *page, void *shadow)
{
struct address_space *mapping = page->mapping;
+ int nr = hpage_nr_pages(page);

trace_mm_filemap_delete_from_page_cache(page);
/*
@@ -193,6 +217,7 @@ void __delete_from_page_cache(struct page *page, void *shadow)
else
cleancache_invalidate_page(mapping, page);

+ VM_BUG_ON_PAGE(PageTail(page), page);
VM_BUG_ON_PAGE(page_mapped(page), page);
if (!IS_ENABLED(CONFIG_DEBUG_VM) && unlikely(page_mapped(page))) {
int mapcount;
@@ -224,9 +249,9 @@ void __delete_from_page_cache(struct page *page, void *shadow)

/* hugetlb pages do not participate in page cache accounting. */
if (!PageHuge(page))
- __dec_zone_page_state(page, NR_FILE_PAGES);
+ __mod_zone_page_state(page_zone(page), NR_FILE_PAGES, -nr);
if (PageSwapBacked(page))
- __dec_zone_page_state(page, NR_SHMEM);
+ __mod_zone_page_state(page_zone(page), NR_SHMEM, -nr);

/*
* At this point page must be either written or cleaned by truncate.
@@ -250,9 +275,8 @@ void __delete_from_page_cache(struct page *page, void *shadow)
*/
void delete_from_page_cache(struct page *page)
{
- struct address_space *mapping = page->mapping;
+ struct address_space *mapping = page_mapping(page);
unsigned long flags;
-
void (*freepage)(struct page *);

BUG_ON(!PageLocked(page));
@@ -265,7 +289,13 @@ void delete_from_page_cache(struct page *page)

if (freepage)
freepage(page);
- put_page(page);
+
+ if (PageTransHuge(page) && !PageHuge(page)) {
+ page_ref_sub(page, HPAGE_PMD_NR);
+ VM_BUG_ON_PAGE(page_count(page) <= 0, page);
+ } else {
+ put_page(page);
+ }
}
EXPORT_SYMBOL(delete_from_page_cache);

@@ -1054,7 +1084,7 @@ EXPORT_SYMBOL(page_cache_prev_hole);
struct page *find_get_entry(struct address_space *mapping, pgoff_t offset)
{
void **pagep;
- struct page *page;
+ struct page *head, *page;

rcu_read_lock();
repeat:
@@ -1074,9 +1104,17 @@ repeat:
*/
goto out;
}
- if (!page_cache_get_speculative(page))
+
+ head = compound_head(page);
+ if (!page_cache_get_speculative(head))
goto repeat;

+ /* The page was split under us? */
+ if (compound_head(page) != head) {
+ put_page(page);
+ goto repeat;
+ }
+
/*
* Has the page moved?
* This is part of the lockless pagecache protocol. See
@@ -1119,12 +1157,12 @@ repeat:
if (page && !radix_tree_exception(page)) {
lock_page(page);
/* Has the page been truncated? */
- if (unlikely(page->mapping != mapping)) {
+ if (unlikely(page_mapping(page) != mapping)) {
unlock_page(page);
put_page(page);
goto repeat;
}
- VM_BUG_ON_PAGE(page->index != offset, page);
+ VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page);
}
return page;
}
@@ -1256,7 +1294,7 @@ unsigned find_get_entries(struct address_space *mapping,

rcu_read_lock();
radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
- struct page *page;
+ struct page *head, *page;
repeat:
page = radix_tree_deref_slot(slot);
if (unlikely(!page))
@@ -1273,8 +1311,16 @@ repeat:
*/
goto export;
}
- if (!page_cache_get_speculative(page))
+
+ head = compound_head(page);
+ if (!page_cache_get_speculative(head))
+ goto repeat;
+
+ /* The page was split under us? */
+ if (compound_head(page) != head) {
+ put_page(page);
goto repeat;
+ }

/* Has the page moved? */
if (unlikely(page != *slot)) {
@@ -1319,7 +1365,7 @@ unsigned find_get_pages(struct address_space *mapping, pgoff_t start,

rcu_read_lock();
radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
- struct page *page;
+ struct page *head, *page;
repeat:
page = radix_tree_deref_slot(slot);
if (unlikely(!page))
@@ -1338,9 +1384,16 @@ repeat:
continue;
}

- if (!page_cache_get_speculative(page))
+ head = compound_head(page);
+ if (!page_cache_get_speculative(head))
goto repeat;

+ /* The page was split under us? */
+ if (compound_head(page) != head) {
+ put_page(page);
+ goto repeat;
+ }
+
/* Has the page moved? */
if (unlikely(page != *slot)) {
put_page(page);
@@ -1380,7 +1433,7 @@ unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t index,

rcu_read_lock();
radix_tree_for_each_contig(slot, &mapping->page_tree, &iter, index) {
- struct page *page;
+ struct page *head, *page;
repeat:
page = radix_tree_deref_slot(slot);
/* The hole, there no reason to continue */
@@ -1400,8 +1453,14 @@ repeat:
break;
}

- if (!page_cache_get_speculative(page))
+ head = compound_head(page);
+ if (!page_cache_get_speculative(head))
goto repeat;
+ /* The page was split under us? */
+ if (compound_head(page) != head) {
+ put_page(page);
+ goto repeat;
+ }

/* Has the page moved? */
if (unlikely(page != *slot)) {
@@ -1414,7 +1473,7 @@ repeat:
* otherwise we can get both false positives and false
* negatives, which is just confusing to the caller.
*/
- if (page->mapping == NULL || page->index != iter.index) {
+ if (page->mapping == NULL || page_to_pgoff(page) != iter.index) {
put_page(page);
break;
}
@@ -1452,7 +1511,7 @@ unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
rcu_read_lock();
radix_tree_for_each_tagged(slot, &mapping->page_tree,
&iter, *index, tag) {
- struct page *page;
+ struct page *head, *page;
repeat:
page = radix_tree_deref_slot(slot);
if (unlikely(!page))
@@ -1477,8 +1536,15 @@ repeat:
continue;
}

- if (!page_cache_get_speculative(page))
+ head = compound_head(page);
+ if (!page_cache_get_speculative(head))
+ goto repeat;
+
+ /* The page was split under us? */
+ if (compound_head(page) != head) {
+ put_page(page);
goto repeat;
+ }

/* Has the page moved? */
if (unlikely(page != *slot)) {
@@ -1526,7 +1592,7 @@ unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start,
rcu_read_lock();
radix_tree_for_each_tagged(slot, &mapping->page_tree,
&iter, start, tag) {
- struct page *page;
+ struct page *head, *page;
repeat:
page = radix_tree_deref_slot(slot);
if (unlikely(!page))
@@ -1544,9 +1610,17 @@ repeat:
*/
goto export;
}
- if (!page_cache_get_speculative(page))
+
+ head = compound_head(page);
+ if (!page_cache_get_speculative(head))
goto repeat;

+ /* The page was split under us? */
+ if (compound_head(page) != head) {
+ put_page(page);
+ goto repeat;
+ }
+
/* Has the page moved? */
if (unlikely(page != *slot)) {
put_page(page);
@@ -2140,7 +2214,7 @@ void filemap_map_pages(struct fault_env *fe,
struct address_space *mapping = file->f_mapping;
pgoff_t last_pgoff = start_pgoff;
loff_t size;
- struct page *page;
+ struct page *head, *page;

rcu_read_lock();
radix_tree_for_each_slot(slot, &mapping->page_tree, &iter,
@@ -2159,8 +2233,15 @@ repeat:
goto next;
}

- if (!page_cache_get_speculative(page))
+ head = compound_head(page);
+ if (!page_cache_get_speculative(head))
+ goto repeat;
+
+ /* The page was split under us? */
+ if (compound_head(page) != head) {
+ put_page(page);
goto repeat;
+ }

/* Has the page moved? */
if (unlikely(page != *slot)) {
--
2.8.0.rc3

2016-04-16 00:26:07

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv7 28/29] khugepaged: move up_read(mmap_sem) out of khugepaged_alloc_page()

Both variants of khugepaged_alloc_page() do up_read(&mm->mmap_sem)
first: no point keep it inside the function.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/khugepaged.c | 25 ++++++++++---------------
1 file changed, 10 insertions(+), 15 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index ef229cdfe5b7..c4693fd12f76 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -736,19 +736,10 @@ static bool khugepaged_prealloc_page(struct page **hpage, bool *wait)
}

static struct page *
-khugepaged_alloc_page(struct page **hpage, gfp_t gfp, struct mm_struct *mm,
- unsigned long address, int node)
+khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
{
VM_BUG_ON_PAGE(*hpage, *hpage);

- /*
- * Before allocating the hugepage, release the mmap_sem read lock.
- * The allocation can take potentially a long time if it involves
- * sync compaction, and we do not need to hold the mmap_sem during
- * that. We will recheck the vma after taking it again in write mode.
- */
- up_read(&mm->mmap_sem);
-
*hpage = __alloc_pages_node(node, gfp, HPAGE_PMD_ORDER);
if (unlikely(!*hpage)) {
count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
@@ -809,10 +800,8 @@ static bool khugepaged_prealloc_page(struct page **hpage, bool *wait)
}

static struct page *
-khugepaged_alloc_page(struct page **hpage, gfp_t gfp, struct mm_struct *mm,
- unsigned long address, int node)
+khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
{
- up_read(&mm->mmap_sem);
VM_BUG_ON(!*hpage);

return *hpage;
@@ -896,8 +885,14 @@ static void collapse_huge_page(struct mm_struct *mm,
/* Only allocate from the target node */
gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_OTHER_NODE | __GFP_THISNODE;

- /* release the mmap_sem read lock. */
- new_page = khugepaged_alloc_page(hpage, gfp, mm, address, node);
+ /*
+ * Before allocating the hugepage, release the mmap_sem read lock.
+ * The allocation can take potentially a long time if it involves
+ * sync compaction, and we do not need to hold the mmap_sem during
+ * that. We will recheck the vma after taking it again in write mode.
+ */
+ up_read(&mm->mmap_sem);
+ new_page = khugepaged_alloc_page(hpage, gfp, node);
if (!new_page) {
result = SCAN_ALLOC_HUGE_PAGE_FAIL;
goto out_nolock;
--
2.8.0.rc3

2016-04-16 00:26:27

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv7 20/29] truncate: handle file thp

For shmem/tmpfs we only need to tweak truncate_inode_page() and
invalidate_mapping_pages().

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/truncate.c | 22 ++++++++++++++++++++--
1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/mm/truncate.c b/mm/truncate.c
index b00272810871..4f931ca9333b 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -157,10 +157,14 @@ invalidate_complete_page(struct address_space *mapping, struct page *page)

int truncate_inode_page(struct address_space *mapping, struct page *page)
{
+ loff_t holelen;
+ VM_BUG_ON_PAGE(PageTail(page), page);
+
+ holelen = PageTransHuge(page) ? HPAGE_PMD_SIZE : PAGE_SIZE;
if (page_mapped(page)) {
unmap_mapping_range(mapping,
(loff_t)page->index << PAGE_SHIFT,
- PAGE_SIZE, 0);
+ holelen, 0);
}
return truncate_complete_page(mapping, page);
}
@@ -489,7 +493,21 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,

if (!trylock_page(page))
continue;
- WARN_ON(page->index != index);
+
+ WARN_ON(page_to_pgoff(page) != index);
+
+ /* Middle of THP: skip */
+ if (PageTransTail(page)) {
+ unlock_page(page);
+ continue;
+ } else if (PageTransHuge(page)) {
+ index += HPAGE_PMD_NR - 1;
+ i += HPAGE_PMD_NR - 1;
+ /* 'end' is in the middle of THP */
+ if (index == round_down(end, HPAGE_PMD_NR))
+ continue;
+ }
+
ret = invalidate_inode_page(page);
unlock_page(page);
/*
--
2.8.0.rc3

2016-04-16 00:26:43

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv7 15/29] thp, mlock: do not mlock PTE-mapped file huge pages

As with anon THP, we only mlock file huge pages if we can prove that the
page is not mapped with PTE. This way we can avoid mlock leak into
non-mlocked vma on split.

We rely on PageDoubleMap() under lock_page() to check if the the page
may be PTE mapped. PG_double_map is set by page_add_file_rmap() when the
page mapped with PTEs.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/page-flags.h | 13 ++++++++++++-
mm/huge_memory.c | 27 ++++++++++++++++++++-------
mm/mmap.c | 6 ++++++
mm/page_alloc.c | 2 ++
mm/rmap.c | 16 ++++++++++++++--
5 files changed, 54 insertions(+), 10 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index f4ed4f1b0c77..517707ae8cd1 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -544,6 +544,17 @@ static inline int PageDoubleMap(struct page *page)
return PageHead(page) && test_bit(PG_double_map, &page[1].flags);
}

+static inline void SetPageDoubleMap(struct page *page)
+{
+ VM_BUG_ON_PAGE(!PageHead(page), page);
+ set_bit(PG_double_map, &page[1].flags);
+}
+
+static inline void ClearPageDoubleMap(struct page *page)
+{
+ VM_BUG_ON_PAGE(!PageHead(page), page);
+ clear_bit(PG_double_map, &page[1].flags);
+}
static inline int TestSetPageDoubleMap(struct page *page)
{
VM_BUG_ON_PAGE(!PageHead(page), page);
@@ -560,7 +571,7 @@ static inline int TestClearPageDoubleMap(struct page *page)
TESTPAGEFLAG_FALSE(TransHuge)
TESTPAGEFLAG_FALSE(TransCompound)
TESTPAGEFLAG_FALSE(TransTail)
-TESTPAGEFLAG_FALSE(DoubleMap)
+PAGEFLAG_FALSE(DoubleMap)
TESTSETFLAG_FALSE(DoubleMap)
TESTCLEARFLAG_FALSE(DoubleMap)
#endif
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2e44741abeb6..7287b45968ce 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1439,6 +1439,8 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
* We don't mlock() pte-mapped THPs. This way we can avoid
* leaking mlocked pages into non-VM_LOCKED VMAs.
*
+ * For anon THP:
+ *
* In most cases the pmd is the only mapping of the page as we
* break COW for the mlock() -- see gup_flags |= FOLL_WRITE for
* writable private mappings in populate_vma_page_range().
@@ -1446,15 +1448,26 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
* The only scenario when we have the page shared here is if we
* mlocking read-only mapping shared over fork(). We skip
* mlocking such pages.
+ *
+ * For file THP:
+ *
+ * We can expect PageDoubleMap() to be stable under page lock:
+ * for file pages we set it in page_add_file_rmap(), which
+ * requires page to be locked.
*/
- if (compound_mapcount(page) == 1 && !PageDoubleMap(page) &&
- page->mapping && trylock_page(page)) {
- lru_add_drain();
- if (page->mapping)
- mlock_vma_page(page);
- unlock_page(page);
- }
+
+ if (PageAnon(page) && compound_mapcount(page) != 1)
+ goto skip_mlock;
+ if (PageDoubleMap(page) || !page->mapping)
+ goto skip_mlock;
+ if (!trylock_page(page))
+ goto skip_mlock;
+ lru_add_drain();
+ if (page->mapping && !PageDoubleMap(page))
+ mlock_vma_page(page);
+ unlock_page(page);
}
+skip_mlock:
page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
VM_BUG_ON_PAGE(!PageCompound(page), page);
if (flags & FOLL_GET)
diff --git a/mm/mmap.c b/mm/mmap.c
index df3976af9302..0a68cf0b37e7 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2584,6 +2584,12 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
/* drop PG_Mlocked flag for over-mapped range */
for (tmp = vma; tmp->vm_start >= start + size;
tmp = tmp->vm_next) {
+ /*
+ * Split pmd and munlock page on the border
+ * of the range.
+ */
+ vma_adjust_trans_huge(tmp, start, start + size, 0);
+
munlock_vma_pages_range(tmp,
max(tmp->vm_start, start),
min(tmp->vm_end, start + size));
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 59de90d5d3a3..4243b400b331 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1036,6 +1036,8 @@ static bool free_pages_prepare(struct page *page, unsigned int order)

if (PageAnon(page))
page->mapping = NULL;
+ if (compound)
+ ClearPageDoubleMap(page);
bad += free_pages_check(page);
for (i = 1; i < (1 << order); i++) {
if (compound)
diff --git a/mm/rmap.c b/mm/rmap.c
index 76d8c9269ac6..f67e765869f3 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1285,6 +1285,12 @@ void page_add_file_rmap(struct page *page, bool compound)
if (!atomic_inc_and_test(compound_mapcount_ptr(page)))
goto out;
} else {
+ if (PageTransCompound(page)) {
+ VM_BUG_ON_PAGE(!PageLocked(page), page);
+ SetPageDoubleMap(compound_head(page));
+ if (PageMlocked(page))
+ clear_page_mlock(compound_head(page));
+ }
if (!atomic_inc_and_test(&page->_mapcount))
goto out;
}
@@ -1458,8 +1464,14 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
*/
if (!(flags & TTU_IGNORE_MLOCK)) {
if (vma->vm_flags & VM_LOCKED) {
- /* Holding pte lock, we do *not* need mmap_sem here */
- mlock_vma_page(page);
+ /* PTE-mapped THP are never mlocked */
+ if (!PageTransCompound(page)) {
+ /*
+ * Holding pte lock, we do *not* need
+ * mmap_sem here
+ */
+ mlock_vma_page(page);
+ }
ret = SWAP_MLOCK;
goto out_unmap;
}
--
2.8.0.rc3

2016-04-16 00:24:41

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv7 09/29] thp: handle file pages in split_huge_pmd()

Splitting THP PMD is simple: just unmap it as in DAX case.
Unlike DAX, we also remove the page from rmap and drop reference.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/huge_memory.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 29ba922a9c26..67830a6ed8b0 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2940,10 +2940,16 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,

count_vm_event(THP_SPLIT_PMD);

- if (vma_is_dax(vma)) {
- pmd_t _pmd = pmdp_huge_clear_flush_notify(vma, haddr, pmd);
+ if (!vma_is_anonymous(vma)) {
+ _pmd = pmdp_huge_clear_flush_notify(vma, haddr, pmd);
if (is_huge_zero_pmd(_pmd))
put_huge_zero_page();
+ if (vma_is_dax(vma))
+ return;
+ page = pmd_page(_pmd);
+ page_remove_rmap(page, true);
+ put_page(page);
+ add_mm_counter(mm, MM_FILEPAGES, -HPAGE_PMD_NR);
return;
} else if (is_huge_zero_pmd(*pmd)) {
return __split_huge_zero_page_pmd(vma, haddr, pmd);
--
2.8.0.rc3

2016-04-16 00:27:18

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv7 21/29] mm, rmap: account shmem thp pages

Let's add ShmemHugePages and ShmemPmdMapped fields into meminfo and
smaps. It indicates how many times we allocate and map shmem THP.

NR_ANON_TRANSPARENT_HUGEPAGES is renamed to NR_ANON_THPS.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
drivers/base/node.c | 13 +++++++++----
fs/proc/meminfo.c | 7 +++++--
fs/proc/task_mmu.c | 10 +++++++++-
include/linux/mmzone.h | 4 +++-
mm/huge_memory.c | 4 +++-
mm/page_alloc.c | 19 +++++++++++++++++++
mm/rmap.c | 14 ++++++++------
mm/vmstat.c | 2 ++
8 files changed, 58 insertions(+), 15 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 560751bad294..51c7db2c4ee2 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -113,6 +113,8 @@ static ssize_t node_read_meminfo(struct device *dev,
"Node %d SUnreclaim: %8lu kB\n"
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
"Node %d AnonHugePages: %8lu kB\n"
+ "Node %d ShmemHugePages: %8lu kB\n"
+ "Node %d ShmemPmdMapped: %8lu kB\n"
#endif
,
nid, K(node_page_state(nid, NR_FILE_DIRTY)),
@@ -131,10 +133,13 @@ static ssize_t node_read_meminfo(struct device *dev,
node_page_state(nid, NR_SLAB_UNRECLAIMABLE)),
nid, K(node_page_state(nid, NR_SLAB_RECLAIMABLE)),
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE))
- , nid,
- K(node_page_state(nid, NR_ANON_TRANSPARENT_HUGEPAGES) *
- HPAGE_PMD_NR));
+ nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE)),
+ nid, K(node_page_state(nid, NR_ANON_THPS) *
+ HPAGE_PMD_NR),
+ nid, K(node_page_state(nid, NR_SHMEM_THPS) *
+ HPAGE_PMD_NR),
+ nid, K(node_page_state(nid, NR_SHMEM_PMDMAPPED) *
+ HPAGE_PMD_NR));
#else
nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE)));
#endif
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 83720460c5bc..cf301a9ef512 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -105,6 +105,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
#endif
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
"AnonHugePages: %8lu kB\n"
+ "ShmemHugePages: %8lu kB\n"
+ "ShmemPmdMapped: %8lu kB\n"
#endif
#ifdef CONFIG_CMA
"CmaTotal: %8lu kB\n"
@@ -162,8 +164,9 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
, atomic_long_read(&num_poisoned_pages) << (PAGE_SHIFT - 10)
#endif
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- , K(global_page_state(NR_ANON_TRANSPARENT_HUGEPAGES) *
- HPAGE_PMD_NR)
+ , K(global_page_state(NR_ANON_THPS) * HPAGE_PMD_NR)
+ , K(global_page_state(NR_SHMEM_THPS) * HPAGE_PMD_NR)
+ , K(global_page_state(NR_SHMEM_PMDMAPPED) * HPAGE_PMD_NR)
#endif
#ifdef CONFIG_CMA
, K(totalcma_pages)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 229cb546bee0..b51651d9d8ae 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -448,6 +448,7 @@ struct mem_size_stats {
unsigned long referenced;
unsigned long anonymous;
unsigned long anonymous_thp;
+ unsigned long shmem_thp;
unsigned long swap;
unsigned long shared_hugetlb;
unsigned long private_hugetlb;
@@ -576,7 +577,12 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
page = follow_trans_huge_pmd(vma, addr, pmd, FOLL_DUMP);
if (IS_ERR_OR_NULL(page))
return;
- mss->anonymous_thp += HPAGE_PMD_SIZE;
+ if (PageAnon(page))
+ mss->anonymous_thp += HPAGE_PMD_SIZE;
+ else if (PageSwapBacked(page))
+ mss->shmem_thp += HPAGE_PMD_SIZE;
+ else
+ VM_BUG_ON_PAGE(1, page);
smaps_account(mss, page, true, pmd_young(*pmd), pmd_dirty(*pmd));
}
#else
@@ -770,6 +776,7 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)
"Referenced: %8lu kB\n"
"Anonymous: %8lu kB\n"
"AnonHugePages: %8lu kB\n"
+ "ShmemPmdMapped: %8lu kB\n"
"Shared_Hugetlb: %8lu kB\n"
"Private_Hugetlb: %7lu kB\n"
"Swap: %8lu kB\n"
@@ -787,6 +794,7 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)
mss.referenced >> 10,
mss.anonymous >> 10,
mss.anonymous_thp >> 10,
+ mss.shmem_thp >> 10,
mss.shared_hugetlb >> 10,
mss.private_hugetlb >> 10,
mss.swap >> 10,
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c60df9257cc7..13be8a13bbc1 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -158,7 +158,9 @@ enum zone_stat_item {
WORKINGSET_REFAULT,
WORKINGSET_ACTIVATE,
WORKINGSET_NODERECLAIM,
- NR_ANON_TRANSPARENT_HUGEPAGES,
+ NR_ANON_THPS,
+ NR_SHMEM_THPS,
+ NR_SHMEM_PMDMAPPED,
NR_FREE_CMA_PAGES,
NR_VM_ZONE_STAT_ITEMS };

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 7287b45968ce..61f1651bf02f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3018,7 +3018,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,

if (atomic_add_negative(-1, compound_mapcount_ptr(page))) {
/* Last compound_mapcount is gone. */
- __dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
+ __dec_zone_page_state(page, NR_ANON_THPS);
if (TestClearPageDoubleMap(page)) {
/* No need in mapcount reference anymore */
for (i = 0; i < HPAGE_PMD_NR; i++)
@@ -3437,6 +3437,8 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
pgdata->split_queue_len--;
list_del(page_deferred_list(head));
}
+ if (mapping)
+ __dec_zone_page_state(page, NR_SHMEM_THPS);
spin_unlock(&pgdata->split_queue_lock);
__split_huge_page(page, list, flags);
ret = 0;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4243b400b331..106c38f59e48 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3888,6 +3888,9 @@ void show_free_areas(unsigned int filter)
" unevictable:%lu dirty:%lu writeback:%lu unstable:%lu\n"
" slab_reclaimable:%lu slab_unreclaimable:%lu\n"
" mapped:%lu shmem:%lu pagetables:%lu bounce:%lu\n"
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ " anon_thp: %lu shmem_thp: %lu shmem_pmdmapped: %lu\n"
+#endif
" free:%lu free_pcp:%lu free_cma:%lu\n",
global_page_state(NR_ACTIVE_ANON),
global_page_state(NR_INACTIVE_ANON),
@@ -3905,6 +3908,11 @@ void show_free_areas(unsigned int filter)
global_page_state(NR_SHMEM),
global_page_state(NR_PAGETABLE),
global_page_state(NR_BOUNCE),
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ global_page_state(NR_ANON_THPS) * HPAGE_PMD_NR,
+ global_page_state(NR_SHMEM_THPS) * HPAGE_PMD_NR,
+ global_page_state(NR_SHMEM_PMDMAPPED) * HPAGE_PMD_NR,
+#endif
global_page_state(NR_FREE_PAGES),
free_pcp,
global_page_state(NR_FREE_CMA_PAGES));
@@ -3939,6 +3947,11 @@ void show_free_areas(unsigned int filter)
" writeback:%lukB"
" mapped:%lukB"
" shmem:%lukB"
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ " shmem_thp: %lukB"
+ " shmem_pmdmapped: %lukB"
+ " anon_thp: %lukB"
+#endif
" slab_reclaimable:%lukB"
" slab_unreclaimable:%lukB"
" kernel_stack:%lukB"
@@ -3971,6 +3984,12 @@ void show_free_areas(unsigned int filter)
K(zone_page_state(zone, NR_WRITEBACK)),
K(zone_page_state(zone, NR_FILE_MAPPED)),
K(zone_page_state(zone, NR_SHMEM)),
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ K(zone_page_state(zone, NR_SHMEM_THPS) * HPAGE_PMD_NR),
+ K(zone_page_state(zone, NR_SHMEM_PMDMAPPED)
+ * HPAGE_PMD_NR),
+ K(zone_page_state(zone, NR_ANON_THPS) * HPAGE_PMD_NR),
+#endif
K(zone_page_state(zone, NR_SLAB_RECLAIMABLE)),
K(zone_page_state(zone, NR_SLAB_UNRECLAIMABLE)),
zone_page_state(zone, NR_KERNEL_STACK) *
diff --git a/mm/rmap.c b/mm/rmap.c
index f67e765869f3..ba0539074ada 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1213,10 +1213,8 @@ void do_page_add_anon_rmap(struct page *page,
* pte lock(a spinlock) is held, which implies preemption
* disabled.
*/
- if (compound) {
- __inc_zone_page_state(page,
- NR_ANON_TRANSPARENT_HUGEPAGES);
- }
+ if (compound)
+ __inc_zone_page_state(page, NR_ANON_THPS);
__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, nr);
}
if (unlikely(PageKsm(page)))
@@ -1254,7 +1252,7 @@ void page_add_new_anon_rmap(struct page *page,
VM_BUG_ON_PAGE(!PageTransHuge(page), page);
/* increment count (starts at -1) */
atomic_set(compound_mapcount_ptr(page), 0);
- __inc_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
+ __inc_zone_page_state(page, NR_ANON_THPS);
} else {
/* Anon THP always mapped first with PMD */
VM_BUG_ON_PAGE(PageTransCompound(page), page);
@@ -1284,6 +1282,8 @@ void page_add_file_rmap(struct page *page, bool compound)
}
if (!atomic_inc_and_test(compound_mapcount_ptr(page)))
goto out;
+ VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
+ __inc_zone_page_state(page, NR_SHMEM_PMDMAPPED);
} else {
if (PageTransCompound(page)) {
VM_BUG_ON_PAGE(!PageLocked(page), page);
@@ -1322,6 +1322,8 @@ static void page_remove_file_rmap(struct page *page, bool compound)
}
if (!atomic_add_negative(-1, compound_mapcount_ptr(page)))
goto out;
+ VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
+ __dec_zone_page_state(page, NR_SHMEM_PMDMAPPED);
} else {
if (!atomic_add_negative(-1, &page->_mapcount))
goto out;
@@ -1355,7 +1357,7 @@ static void page_remove_anon_compound_rmap(struct page *page)
if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
return;

- __dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
+ __dec_zone_page_state(page, NR_ANON_THPS);

if (TestClearPageDoubleMap(page)) {
/*
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 904ac95fbf00..4d7921f560b5 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -762,6 +762,8 @@ const char * const vmstat_text[] = {
"workingset_activate",
"workingset_nodereclaim",
"nr_anon_transparent_hugepages",
+ "nr_shmem_hugepages",
+ "nr_shmem_pmdmapped",
"nr_free_cma",

/* enum writeback_stat_item counters */
--
2.8.0.rc3

2016-04-16 00:27:40

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv7 06/29] mm: introduce do_set_pmd()

With postponed page table allocation we have chance to setup huge pages.
do_set_pte() calls do_set_pmd() if following criteria met:

- page is compound;
- pmd entry in pmd_none();
- vma has suitable size and alignment;

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/huge_mm.h | 2 ++
mm/huge_memory.c | 8 ------
mm/memory.c | 72 ++++++++++++++++++++++++++++++++++++++++++++++++-
mm/migrate.c | 3 +--
4 files changed, 74 insertions(+), 11 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 05efca1619b6..95f83770443c 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -142,6 +142,8 @@ static inline bool is_huge_zero_pmd(pmd_t pmd)

struct page *get_huge_zero_page(void);

+#define mk_huge_pmd(page, prot) pmd_mkhuge(mk_pmd(page, prot))
+
#else /* CONFIG_TRANSPARENT_HUGEPAGE */
#define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
#define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index aab10c81de12..cece53a94192 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -793,14 +793,6 @@ pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
return pmd;
}

-static inline pmd_t mk_huge_pmd(struct page *page, pgprot_t prot)
-{
- pmd_t entry;
- entry = mk_pmd(page, prot);
- entry = pmd_mkhuge(entry);
- return entry;
-}
-
static inline struct list_head *page_deferred_list(struct page *page)
{
/*
diff --git a/mm/memory.c b/mm/memory.c
index 49c55446576a..ca45e9b19ad9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2859,6 +2859,66 @@ map_pte:
return 0;
}

+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+
+#define HPAGE_CACHE_INDEX_MASK (HPAGE_PMD_NR - 1)
+static inline bool transhuge_vma_suitable(struct vm_area_struct *vma,
+ unsigned long haddr)
+{
+ if (((vma->vm_start >> PAGE_SHIFT) & HPAGE_CACHE_INDEX_MASK) !=
+ (vma->vm_pgoff & HPAGE_CACHE_INDEX_MASK))
+ return false;
+ if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
+ return false;
+ return true;
+}
+
+static int do_set_pmd(struct fault_env *fe, struct page *page)
+{
+ struct vm_area_struct *vma = fe->vma;
+ bool write = fe->flags & FAULT_FLAG_WRITE;
+ unsigned long haddr = fe->address & HPAGE_PMD_MASK;
+ pmd_t entry;
+ int i, ret;
+
+ if (!transhuge_vma_suitable(vma, haddr))
+ return VM_FAULT_FALLBACK;
+
+ ret = VM_FAULT_FALLBACK;
+ page = compound_head(page);
+
+ fe->ptl = pmd_lock(vma->vm_mm, fe->pmd);
+ if (unlikely(!pmd_none(*fe->pmd)))
+ goto out;
+
+ for (i = 0; i < HPAGE_PMD_NR; i++)
+ flush_icache_page(vma, page + i);
+
+ entry = mk_huge_pmd(page, vma->vm_page_prot);
+ if (write)
+ entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+
+ add_mm_counter(vma->vm_mm, MM_FILEPAGES, HPAGE_PMD_NR);
+ page_add_file_rmap(page, true);
+
+ set_pmd_at(vma->vm_mm, haddr, fe->pmd, entry);
+
+ update_mmu_cache_pmd(vma, haddr, fe->pmd);
+
+ /* fault is handled */
+ ret = 0;
+out:
+ spin_unlock(fe->ptl);
+ return ret;
+}
+#else
+static int do_set_pmd(struct fault_env *fe, struct page *page)
+{
+ BUILD_BUG();
+ return 0;
+}
+#endif
+
/**
* alloc_set_pte - setup new PTE entry for given page and add reverse page
* mapping. If needed, the fucntion allocates page table or use pre-allocated.
@@ -2878,9 +2938,19 @@ int alloc_set_pte(struct fault_env *fe, struct mem_cgroup *memcg,
struct vm_area_struct *vma = fe->vma;
bool write = fe->flags & FAULT_FLAG_WRITE;
pte_t entry;
+ int ret;
+
+ if (pmd_none(*fe->pmd) && PageTransCompound(page)) {
+ /* THP on COW? */
+ VM_BUG_ON_PAGE(memcg, page);
+
+ ret = do_set_pmd(fe, page);
+ if (ret != VM_FAULT_FALLBACK)
+ return ret;
+ }

if (!fe->pte) {
- int ret = pte_alloc_one_map(fe);
+ ret = pte_alloc_one_map(fe);
if (ret)
return ret;
}
diff --git a/mm/migrate.c b/mm/migrate.c
index b8f8363df8da..f3f65f00448f 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1820,8 +1820,7 @@ fail_putback:
}

orig_entry = *pmd;
- entry = mk_pmd(new_page, vma->vm_page_prot);
- entry = pmd_mkhuge(entry);
+ entry = mk_huge_pmd(new_page, vma->vm_page_prot);
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);

/*
--
2.8.0.rc3

2016-04-16 00:28:02

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv7 13/29] thp: run vma_adjust_trans_huge() outside i_mmap_rwsem

vma_addjust_trans_huge() splits pmd if it's crossing VMA boundary.
During split we munlock the huge page which requires rmap walk.
rmap wants to take the lock on its own.

Let's move vma_adjust_trans_huge() outside i_mmap_rwsem to fix this.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/mmap.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index bd2e1a533bc1..df3976af9302 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -678,6 +678,8 @@ again: remove_next = 1 + (end > next->vm_end);
}
}

+ vma_adjust_trans_huge(vma, start, end, adjust_next);
+
if (file) {
mapping = file->f_mapping;
root = &mapping->i_mmap;
@@ -698,8 +700,6 @@ again: remove_next = 1 + (end > next->vm_end);
}
}

- vma_adjust_trans_huge(vma, start, end, adjust_next);
-
anon_vma = vma->anon_vma;
if (!anon_vma && adjust_next)
anon_vma = next->anon_vma;
--
2.8.0.rc3

2016-04-16 00:28:00

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv7 08/29] thp: support file pages in zap_huge_pmd()

split_huge_pmd() for file mappings (and DAX too) is implemented by just
clearing pmd entry as we can re-fill this area from page cache on pte
level later.

This means we don't need deposit page tables when file THP is mapped.
Therefore we shouldn't try to withdraw a page table on zap_huge_pmd()
file THP PMD.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/huge_memory.c | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index cece53a94192..29ba922a9c26 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1696,10 +1696,16 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
struct page *page = pmd_page(orig_pmd);
page_remove_rmap(page, true);
VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
- add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
VM_BUG_ON_PAGE(!PageHead(page), page);
- pte_free(tlb->mm, pgtable_trans_huge_withdraw(tlb->mm, pmd));
- atomic_long_dec(&tlb->mm->nr_ptes);
+ if (PageAnon(page)) {
+ pgtable_t pgtable;
+ pgtable = pgtable_trans_huge_withdraw(tlb->mm, pmd);
+ pte_free(tlb->mm, pgtable);
+ atomic_long_dec(&tlb->mm->nr_ptes);
+ add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
+ } else {
+ add_mm_counter(tlb->mm, MM_FILEPAGES, -HPAGE_PMD_NR);
+ }
spin_unlock(ptl);
tlb_remove_page(tlb, page);
}
--
2.8.0.rc3

2016-04-16 00:28:42

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv7 11/29] thp: skip file huge pmd on copy_huge_pmd()

copy_page_range() has a check for "Don't copy ptes where a page fault
will fill them correctly." It works on VMA level. We still copy all page
table entries from private mappings, even if they map page cache.

We can simplify copy_huge_pmd() a bit by skipping file PMDs.

We don't map file private pages with PMDs, so they only can map page
cache. It's safe to skip them as they can be re-faulted later.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/huge_memory.c | 34 ++++++++++++++++------------------
1 file changed, 16 insertions(+), 18 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 67830a6ed8b0..3a3f9ff45e3f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1094,14 +1094,15 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
struct page *src_page;
pmd_t pmd;
pgtable_t pgtable = NULL;
- int ret;
+ int ret = -ENOMEM;

- if (!vma_is_dax(vma)) {
- ret = -ENOMEM;
- pgtable = pte_alloc_one(dst_mm, addr);
- if (unlikely(!pgtable))
- goto out;
- }
+ /* Skip if can be re-fill on fault */
+ if (!vma_is_anonymous(vma))
+ return 0;
+
+ pgtable = pte_alloc_one(dst_mm, addr);
+ if (unlikely(!pgtable))
+ goto out;

dst_ptl = pmd_lock(dst_mm, dst_pmd);
src_ptl = pmd_lockptr(src_mm, src_pmd);
@@ -1109,7 +1110,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,

ret = -EAGAIN;
pmd = *src_pmd;
- if (unlikely(!pmd_trans_huge(pmd) && !pmd_devmap(pmd))) {
+ if (unlikely(!pmd_trans_huge(pmd))) {
pte_free(dst_mm, pgtable);
goto out_unlock;
}
@@ -1132,16 +1133,13 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
goto out_unlock;
}

- if (!vma_is_dax(vma)) {
- /* thp accounting separate from pmd_devmap accounting */
- src_page = pmd_page(pmd);
- VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
- get_page(src_page);
- page_dup_rmap(src_page, true);
- add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
- atomic_long_inc(&dst_mm->nr_ptes);
- pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
- }
+ src_page = pmd_page(pmd);
+ VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
+ get_page(src_page);
+ page_dup_rmap(src_page, true);
+ add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
+ atomic_long_inc(&dst_mm->nr_ptes);
+ pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);

pmdp_set_wrprotect(src_mm, addr, src_pmd);
pmd = pmd_mkold(pmd_wrprotect(pmd));
--
2.8.0.rc3

2016-04-16 00:29:03

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv7 24/29] shmem: add huge pages support

Here's basic implementation of huge pages support for shmem/tmpfs.

It's all pretty streight-forward:

- shmem_getpage() allcoates huge page if it can and try to inserd into
radix tree with shmem_add_to_page_cache();

- shmem_add_to_page_cache() puts the page onto radix-tree if there's
space for it;

- shmem_undo_range() removes huge pages, if it fully within range.
Partial truncate of huge pages zero out this part of THP.

This have visible effect on fallocate(FALLOC_FL_PUNCH_HOLE)
behaviour. As we don't really create hole in this case,
lseek(SEEK_HOLE) may have inconsistent results depending what
pages happened to be allocated.

- no need to change shmem_fault: core-mm will map an compound page as
huge if VMA is suitable;

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/huge_mm.h | 2 +
mm/filemap.c | 7 +-
mm/memory.c | 2 +-
mm/mempolicy.c | 2 +-
mm/page-writeback.c | 1 +
mm/shmem.c | 335 ++++++++++++++++++++++++++++++++++++++----------
mm/swap.c | 2 +
7 files changed, 281 insertions(+), 70 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 278ade85e224..0315e8d6db57 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -155,6 +155,8 @@ struct page *get_huge_zero_page(void);

#define transparent_hugepage_enabled(__vma) 0

+static inline void prep_transhuge_page(struct page *page) {}
+
#define transparent_hugepage_flags 0UL
static inline int
split_huge_page_to_list(struct page *page, struct list_head *list)
diff --git a/mm/filemap.c b/mm/filemap.c
index bf29ab4f87dc..39f4a18b870b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -250,8 +250,13 @@ void __delete_from_page_cache(struct page *page, void *shadow)
/* hugetlb pages do not participate in page cache accounting. */
if (!PageHuge(page))
__mod_zone_page_state(page_zone(page), NR_FILE_PAGES, -nr);
- if (PageSwapBacked(page))
+ if (PageSwapBacked(page)) {
__mod_zone_page_state(page_zone(page), NR_SHMEM, -nr);
+ if (PageTransHuge(page))
+ __dec_zone_page_state(page, NR_SHMEM_THPS);
+ } else {
+ VM_BUG_ON_PAGE(PageTransHuge(page) && !PageHuge(page), page);
+ }

/*
* At this point page must be either written or cleaned by truncate.
diff --git a/mm/memory.c b/mm/memory.c
index cba29c033702..8d8b1357e1d3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1091,7 +1091,7 @@ again:
* unmap shared but keep private pages.
*/
if (details->check_mapping &&
- details->check_mapping != page->mapping)
+ details->check_mapping != page_rmapping(page))
continue;
}
ptent = ptep_get_and_clear_full(mm, addr, pte,
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 148143974c5b..90c74344957e 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -534,7 +534,7 @@ retry:
nid = page_to_nid(page);
if (node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT))
continue;
- if (PageTransCompound(page) && PageAnon(page)) {
+ if (PageTransCompound(page)) {
get_page(page);
pte_unmap_unlock(pte, ptl);
lock_page(page);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 999792d35ccc..8aded71b15a2 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2554,6 +2554,7 @@ int set_page_dirty(struct page *page)
{
struct address_space *mapping = page_mapping(page);

+ page = compound_head(page);
if (likely(mapping)) {
int (*spd)(struct page *) = mapping->a_ops->set_page_dirty;
/*
diff --git a/mm/shmem.c b/mm/shmem.c
index 929fbf2ac43f..26236e8564f7 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -173,10 +173,13 @@ static inline int shmem_reacct_size(unsigned long flags,
* shmem_getpage reports shmem_acct_block failure as -ENOSPC not -ENOMEM,
* so that a failure on a sparse tmpfs mapping will give SIGBUS not OOM.
*/
-static inline int shmem_acct_block(unsigned long flags)
+static inline int shmem_acct_block(unsigned long flags, long pages)
{
- return (flags & VM_NORESERVE) ?
- security_vm_enough_memory_mm(current->mm, VM_ACCT(PAGE_SIZE)) : 0;
+ if (!(flags & VM_NORESERVE))
+ return 0;
+
+ return security_vm_enough_memory_mm(current->mm,
+ pages * VM_ACCT(PAGE_SIZE));
}

static inline void shmem_unacct_blocks(unsigned long flags, long pages)
@@ -376,30 +379,57 @@ static int shmem_add_to_page_cache(struct page *page,
struct address_space *mapping,
pgoff_t index, void *expected)
{
- int error;
+ int error, nr = hpage_nr_pages(page);

+ VM_BUG_ON_PAGE(PageTail(page), page);
+ VM_BUG_ON_PAGE(index != round_down(index, nr), page);
VM_BUG_ON_PAGE(!PageLocked(page), page);
VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
+ VM_BUG_ON(expected && PageTransHuge(page));

- get_page(page);
+ page_ref_add(page, nr);
page->mapping = mapping;
page->index = index;

spin_lock_irq(&mapping->tree_lock);
- if (!expected)
+ if (PageTransHuge(page)) {
+ void __rcu **results;
+ pgoff_t idx;
+ int i;
+
+ error = 0;
+ if (radix_tree_gang_lookup_slot(&mapping->page_tree,
+ &results, &idx, index, 1) &&
+ idx < index + HPAGE_PMD_NR) {
+ error = -EEXIST;
+ }
+
+ if (!error) {
+ for (i = 0; i < HPAGE_PMD_NR; i++) {
+ error = radix_tree_insert(&mapping->page_tree,
+ index + i, page + i);
+ VM_BUG_ON(error);
+ }
+ count_vm_event(THP_FILE_ALLOC);
+ }
+ } else if (!expected) {
error = radix_tree_insert(&mapping->page_tree, index, page);
- else
+ } else {
error = shmem_radix_tree_replace(mapping, index, expected,
page);
+ }
+
if (!error) {
- mapping->nrpages++;
- __inc_zone_page_state(page, NR_FILE_PAGES);
- __inc_zone_page_state(page, NR_SHMEM);
+ mapping->nrpages += nr;
+ if (PageTransHuge(page))
+ __inc_zone_page_state(page, NR_SHMEM_THPS);
+ __mod_zone_page_state(page_zone(page), NR_FILE_PAGES, nr);
+ __mod_zone_page_state(page_zone(page), NR_SHMEM, nr);
spin_unlock_irq(&mapping->tree_lock);
} else {
page->mapping = NULL;
spin_unlock_irq(&mapping->tree_lock);
- put_page(page);
+ page_ref_sub(page, nr);
}
return error;
}
@@ -412,6 +442,8 @@ static void shmem_delete_from_page_cache(struct page *page, void *radswap)
struct address_space *mapping = page->mapping;
int error;

+ VM_BUG_ON_PAGE(PageCompound(page), page);
+
spin_lock_irq(&mapping->tree_lock);
error = shmem_radix_tree_replace(mapping, page->index, page, radswap);
page->mapping = NULL;
@@ -591,10 +623,33 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
continue;
}

+ VM_BUG_ON_PAGE(page_to_pgoff(page) != index, page);
+
if (!trylock_page(page))
continue;
+
+ if (PageTransTail(page)) {
+ /* Middle of THP: zero out the page */
+ clear_highpage(page);
+ unlock_page(page);
+ continue;
+ } else if (PageTransHuge(page)) {
+ if (index == round_down(end, HPAGE_PMD_NR)) {
+ /*
+ * Range ends in the middle of THP:
+ * zero out the page
+ */
+ clear_highpage(page);
+ unlock_page(page);
+ continue;
+ }
+ index += HPAGE_PMD_NR - 1;
+ i += HPAGE_PMD_NR - 1;
+ }
+
if (!unfalloc || !PageUptodate(page)) {
- if (page->mapping == mapping) {
+ VM_BUG_ON_PAGE(PageTail(page), page);
+ if (page_mapping(page) == mapping) {
VM_BUG_ON_PAGE(PageWriteback(page), page);
truncate_inode_page(mapping, page);
}
@@ -670,8 +725,36 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
}

lock_page(page);
+
+ if (PageTransTail(page)) {
+ /* Middle of THP: zero out the page */
+ clear_highpage(page);
+ unlock_page(page);
+ /*
+ * Partial thp truncate due 'start' in middle
+ * of THP: don't need to look on these pages
+ * again on !pvec.nr restart.
+ */
+ if (index != round_down(end, HPAGE_PMD_NR))
+ start++;
+ continue;
+ } else if (PageTransHuge(page)) {
+ if (index == round_down(end, HPAGE_PMD_NR)) {
+ /*
+ * Range ends in the middle of THP:
+ * zero out the page
+ */
+ clear_highpage(page);
+ unlock_page(page);
+ continue;
+ }
+ index += HPAGE_PMD_NR - 1;
+ i += HPAGE_PMD_NR - 1;
+ }
+
if (!unfalloc || !PageUptodate(page)) {
- if (page->mapping == mapping) {
+ VM_BUG_ON_PAGE(PageTail(page), page);
+ if (page_mapping(page) == mapping) {
VM_BUG_ON_PAGE(PageWriteback(page), page);
truncate_inode_page(mapping, page);
} else {
@@ -929,6 +1012,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
swp_entry_t swap;
pgoff_t index;

+ VM_BUG_ON_PAGE(PageCompound(page), page);
BUG_ON(!PageLocked(page));
mapping = page->mapping;
index = page->index;
@@ -1065,24 +1149,63 @@ static inline struct mempolicy *shmem_get_sbmpol(struct shmem_sb_info *sbinfo)
#define vm_policy vm_private_data
#endif

+static void shmem_pseudo_vma_init(struct vm_area_struct *vma,
+ struct shmem_inode_info *info, pgoff_t index)
+{
+ /* Create a pseudo vma that just contains the policy */
+ vma->vm_start = 0;
+ /* Bias interleave by inode number to distribute better across nodes */
+ vma->vm_pgoff = index + info->vfs_inode.i_ino;
+ vma->vm_ops = NULL;
+ vma->vm_policy = mpol_shared_policy_lookup(&info->policy, index);
+}
+
+static void shmem_pseudo_vma_destroy(struct vm_area_struct *vma)
+{
+ /* Drop reference taken by mpol_shared_policy_lookup() */
+ mpol_cond_put(vma->vm_policy);
+}
+
static struct page *shmem_swapin(swp_entry_t swap, gfp_t gfp,
struct shmem_inode_info *info, pgoff_t index)
{
struct vm_area_struct pvma;
struct page *page;

- /* Create a pseudo vma that just contains the policy */
- pvma.vm_start = 0;
- /* Bias interleave by inode number to distribute better across nodes */
- pvma.vm_pgoff = index + info->vfs_inode.i_ino;
- pvma.vm_ops = NULL;
- pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, index);
-
+ shmem_pseudo_vma_init(&pvma, info, index);
page = swapin_readahead(swap, gfp, &pvma, 0);
+ shmem_pseudo_vma_destroy(&pvma);

- /* Drop reference taken by mpol_shared_policy_lookup() */
- mpol_cond_put(pvma.vm_policy);
+ return page;
+}
+
+static struct page *shmem_alloc_hugepage(gfp_t gfp,
+ struct shmem_inode_info *info, pgoff_t index)
+{
+ struct vm_area_struct pvma;
+ struct inode *inode = &info->vfs_inode;
+ struct address_space *mapping = inode->i_mapping;
+ pgoff_t idx, hindex = round_down(index, HPAGE_PMD_NR);
+ void __rcu **results;
+ struct page *page;
+
+ if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+ return NULL;
+
+ rcu_read_lock();
+ if (radix_tree_gang_lookup_slot(&mapping->page_tree, &results, &idx,
+ hindex, 1) && idx < hindex + HPAGE_PMD_NR) {
+ rcu_read_unlock();
+ return NULL;
+ }
+ rcu_read_unlock();

+ shmem_pseudo_vma_init(&pvma, info, hindex);
+ page = alloc_pages_vma(gfp | __GFP_COMP | __GFP_NORETRY | __GFP_NOWARN,
+ HPAGE_PMD_ORDER, &pvma, 0, numa_node_id(), true);
+ shmem_pseudo_vma_destroy(&pvma);
+ if (page)
+ prep_transhuge_page(page);
return page;
}

@@ -1092,23 +1215,51 @@ static struct page *shmem_alloc_page(gfp_t gfp,
struct vm_area_struct pvma;
struct page *page;

- /* Create a pseudo vma that just contains the policy */
- pvma.vm_start = 0;
- /* Bias interleave by inode number to distribute better across nodes */
- pvma.vm_pgoff = index + info->vfs_inode.i_ino;
- pvma.vm_ops = NULL;
- pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, index);
+ shmem_pseudo_vma_init(&pvma, info, index);
+ page = alloc_page_vma(gfp, &pvma, 0);
+ shmem_pseudo_vma_destroy(&pvma);
+
+ return page;
+}
+
+static struct page *shmem_alloc_and_acct_page(gfp_t gfp,
+ struct shmem_inode_info *info, struct shmem_sb_info *sbinfo,
+ pgoff_t index, bool huge)
+{
+ struct page *page;
+ int nr;
+ int err = -ENOSPC;
+
+ if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+ huge = false;
+ nr = huge ? HPAGE_PMD_NR : 1;
+
+ if (shmem_acct_block(info->flags, nr))
+ goto failed;
+ if (sbinfo->max_blocks) {
+ if (percpu_counter_compare(&sbinfo->used_blocks,
+ sbinfo->max_blocks + nr) > 0)
+ goto unacct;
+ percpu_counter_add(&sbinfo->used_blocks, nr);
+ }

- page = alloc_pages_vma(gfp, 0, &pvma, 0, numa_node_id(), false);
+ if (huge)
+ page = shmem_alloc_hugepage(gfp, info, index);
+ else
+ page = shmem_alloc_page(gfp, info, index);
if (page) {
__SetPageLocked(page);
__SetPageSwapBacked(page);
+ return page;
}

- /* Drop reference taken by mpol_shared_policy_lookup() */
- mpol_cond_put(pvma.vm_policy);
-
- return page;
+ err = -ENOMEM;
+ if (sbinfo->max_blocks)
+ percpu_counter_add(&sbinfo->used_blocks, -nr);
+unacct:
+ shmem_unacct_blocks(info->flags, nr);
+failed:
+ return ERR_PTR(err);
}

/*
@@ -1213,6 +1364,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
struct mem_cgroup *memcg;
struct page *page;
swp_entry_t swap;
+ pgoff_t hindex = index;
int error;
int once = 0;
int alloced = 0;
@@ -1334,47 +1486,74 @@ repeat:
swap_free(swap);

} else {
- if (shmem_acct_block(info->flags)) {
- error = -ENOSPC;
- goto failed;
- }
- if (sbinfo->max_blocks) {
- if (percpu_counter_compare(&sbinfo->used_blocks,
- sbinfo->max_blocks) >= 0) {
- error = -ENOSPC;
- goto unacct;
- }
- percpu_counter_inc(&sbinfo->used_blocks);
+ /* shmem_symlink() */
+ if (mapping->a_ops != &shmem_aops)
+ goto alloc_nohuge;
+ if (shmem_huge == SHMEM_HUGE_DENY)
+ goto alloc_nohuge;
+ if (shmem_huge == SHMEM_HUGE_FORCE)
+ goto alloc_huge;
+ switch (sbinfo->huge) {
+ loff_t i_size;
+ pgoff_t off;
+ case SHMEM_HUGE_NEVER:
+ goto alloc_nohuge;
+ case SHMEM_HUGE_WITHIN_SIZE:
+ off = round_up(index, HPAGE_PMD_NR);
+ i_size = round_up(i_size_read(inode), PAGE_SIZE);
+ if (i_size >= HPAGE_PMD_SIZE &&
+ i_size >> PAGE_SHIFT >= off)
+ goto alloc_huge;
+ /* fallthrough */
+ case SHMEM_HUGE_ADVISE:
+ /* TODO: wire up fadvise()/madvise() */
+ goto alloc_nohuge;
}

- page = shmem_alloc_page(gfp, info, index);
- if (!page) {
- error = -ENOMEM;
- goto decused;
+alloc_huge:
+ page = shmem_alloc_and_acct_page(gfp, info, sbinfo,
+ index, true);
+ if (IS_ERR(page)) {
+alloc_nohuge: page = shmem_alloc_and_acct_page(gfp, info, sbinfo,
+ index, false);
+ }
+ if (IS_ERR(page)) {
+ error = PTR_ERR(page);
+ page = NULL;
+ goto failed;
}
+
+ if (PageTransHuge(page))
+ hindex = round_down(index, HPAGE_PMD_NR);
+ else
+ hindex = index;
+
if (sgp == SGP_WRITE)
__SetPageReferenced(page);

error = mem_cgroup_try_charge(page, charge_mm, gfp, &memcg,
- false);
+ PageTransHuge(page));
if (error)
- goto decused;
- error = radix_tree_maybe_preload(gfp & GFP_RECLAIM_MASK);
+ goto unacct;
+ error = radix_tree_maybe_preload_order(gfp & GFP_RECLAIM_MASK,
+ compound_order(page));
if (!error) {
- error = shmem_add_to_page_cache(page, mapping, index,
+ error = shmem_add_to_page_cache(page, mapping, hindex,
NULL);
radix_tree_preload_end();
}
if (error) {
- mem_cgroup_cancel_charge(page, memcg, false);
- goto decused;
+ mem_cgroup_cancel_charge(page, memcg,
+ PageTransHuge(page));
+ goto unacct;
}
- mem_cgroup_commit_charge(page, memcg, false, false);
+ mem_cgroup_commit_charge(page, memcg, false,
+ PageTransHuge(page));
lru_cache_add_anon(page);

spin_lock(&info->lock);
- info->alloced++;
- inode->i_blocks += BLOCKS_PER_PAGE;
+ info->alloced += 1 << compound_order(page);
+ inode->i_blocks += BLOCKS_PER_PAGE << compound_order(page);
shmem_recalc_inode(inode);
spin_unlock(&info->lock);
alloced = true;
@@ -1390,10 +1569,15 @@ clear:
* but SGP_FALLOC on a page fallocated earlier must initialize
* it now, lest undo on failure cancel our earlier guarantee.
*/
- if (sgp != SGP_WRITE) {
- clear_highpage(page);
- flush_dcache_page(page);
- SetPageUptodate(page);
+ if (sgp != SGP_WRITE && !PageUptodate(page)) {
+ struct page *head = compound_head(page);
+ int i;
+
+ for (i = 0; i < (1 << compound_order(head)); i++) {
+ clear_highpage(head + i);
+ flush_dcache_page(head + i);
+ }
+ SetPageUptodate(head);
}
}

@@ -1410,17 +1594,23 @@ clear:
error = -EINVAL;
goto unlock;
}
- *pagep = page;
+ *pagep = page + index - hindex;
return 0;

/*
* Error recovery.
*/
-decused:
- if (sbinfo->max_blocks)
- percpu_counter_add(&sbinfo->used_blocks, -1);
unacct:
- shmem_unacct_blocks(info->flags, 1);
+ if (sbinfo->max_blocks)
+ percpu_counter_add(&sbinfo->used_blocks,
+ 1 << compound_order(page));
+ shmem_unacct_blocks(info->flags, 1 << compound_order(page));
+
+ if (PageTransHuge(page)) {
+ unlock_page(page);
+ put_page(page);
+ goto alloc_nohuge;
+ }
failed:
if (swap.val && !shmem_confirm_swap(mapping, index, swap))
error = -EEXIST;
@@ -1758,12 +1948,23 @@ shmem_write_end(struct file *file, struct address_space *mapping,
i_size_write(inode, pos + copied);

if (!PageUptodate(page)) {
+ struct page *head = compound_head(page);
+ if (PageTransCompound(page)) {
+ int i;
+
+ for (i = 0; i < HPAGE_PMD_NR; i++) {
+ if (head + i == page)
+ continue;
+ clear_highpage(head + i);
+ flush_dcache_page(head + i);
+ }
+ }
if (copied < PAGE_SIZE) {
unsigned from = pos & (PAGE_SIZE - 1);
zero_user_segments(page, 0, from,
from + copied, PAGE_SIZE);
}
- SetPageUptodate(page);
+ SetPageUptodate(head);
}
set_page_dirty(page);
unlock_page(page);
diff --git a/mm/swap.c b/mm/swap.c
index a0bc206b4ac6..9d777cb75b87 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -291,6 +291,7 @@ static bool need_activate_page_drain(int cpu)

void activate_page(struct page *page)
{
+ page = compound_head(page);
if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
struct pagevec *pvec = &get_cpu_var(activate_page_pvecs);

@@ -315,6 +316,7 @@ void activate_page(struct page *page)
{
struct zone *zone = page_zone(page);

+ page = compound_head(page);
spin_lock_irq(&zone->lru_lock);
__activate_page(page, mem_cgroup_page_lruvec(page, zone), NULL);
spin_unlock_irq(&zone->lru_lock);
--
2.8.0.rc3

2016-04-16 00:29:39

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv7 27/29] thp: extract khugepaged from mm/huge_memory.c

khugepaged implementation grew to the point when it deserve separate
file in source.

Let's move it to mm/khugepaged.c.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/huge_mm.h | 10 +
include/linux/khugepaged.h | 6 +
mm/Makefile | 2 +-
mm/huge_memory.c | 1428 +-------------------------------------------
mm/khugepaged.c | 1428 ++++++++++++++++++++++++++++++++++++++++++++
5 files changed, 1453 insertions(+), 1421 deletions(-)
create mode 100644 mm/khugepaged.c

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 0315e8d6db57..bff34b8e2950 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -41,6 +41,16 @@ enum transparent_hugepage_flag {
#endif
};

+struct kobject;
+struct kobj_attribute;
+
+extern ssize_t single_hugepage_flag_store(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ const char *buf, size_t count,
+ enum transparent_hugepage_flag flag);
+extern ssize_t single_hugepage_flag_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf,
+ enum transparent_hugepage_flag flag);
extern struct kobj_attribute shmem_enabled_attr;

#define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
index eeb307985715..f7b4b7dc97bc 100644
--- a/include/linux/khugepaged.h
+++ b/include/linux/khugepaged.h
@@ -4,6 +4,12 @@
#include <linux/sched.h> /* MMF_VM_HUGEPAGE */

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+extern struct mutex khugepaged_mutex;
+extern struct attribute_group khugepaged_attr_group;
+
+extern int khugepaged_init(void);
+extern void khugepaged_destroy(void);
+extern int start_stop_khugepaged(void);
extern int __khugepaged_enter(struct mm_struct *mm);
extern void __khugepaged_exit(struct mm_struct *mm);
extern int khugepaged_enter_vma_merge(struct vm_area_struct *vma,
diff --git a/mm/Makefile b/mm/Makefile
index deb467edca2d..298f34cc4071 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -74,7 +74,7 @@ obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
obj-$(CONFIG_MEMTEST) += memtest.o
obj-$(CONFIG_MIGRATION) += migrate.o
obj-$(CONFIG_QUICKLIST) += quicklist.o
-obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
+obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
obj-$(CONFIG_MEMCG_SWAP) += swap_cgroup.o
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b0acbeb18645..183021c44ce0 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -18,7 +18,6 @@
#include <linux/mm_inline.h>
#include <linux/swapops.h>
#include <linux/dax.h>
-#include <linux/kthread.h>
#include <linux/khugepaged.h>
#include <linux/freezer.h>
#include <linux/pfn_t.h>
@@ -37,35 +36,6 @@
#include <asm/pgalloc.h>
#include "internal.h"

-enum scan_result {
- SCAN_FAIL,
- SCAN_SUCCEED,
- SCAN_PMD_NULL,
- SCAN_EXCEED_NONE_PTE,
- SCAN_PTE_NON_PRESENT,
- SCAN_PAGE_RO,
- SCAN_NO_REFERENCED_PAGE,
- SCAN_PAGE_NULL,
- SCAN_SCAN_ABORT,
- SCAN_PAGE_COUNT,
- SCAN_PAGE_LRU,
- SCAN_PAGE_LOCK,
- SCAN_PAGE_ANON,
- SCAN_PAGE_COMPOUND,
- SCAN_ANY_PROCESS,
- SCAN_VMA_NULL,
- SCAN_VMA_CHECK,
- SCAN_ADDRESS_RANGE,
- SCAN_SWAP_CACHE_PAGE,
- SCAN_DEL_PAGE_LRU,
- SCAN_ALLOC_HUGE_PAGE_FAIL,
- SCAN_CGROUP_CHARGE_FAIL,
- SCAN_EXCEED_SWAP_PTE
-};
-
-#define CREATE_TRACE_POINTS
-#include <trace/events/huge_memory.h>
-
/*
* By default transparent hugepage support is disabled in order that avoid
* to risk increase the memory footprint of applications without a guaranteed
@@ -85,127 +55,8 @@ unsigned long transparent_hugepage_flags __read_mostly =
(1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG)|
(1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG);

-/* default scan 8*512 pte (or vmas) every 30 second */
-static unsigned int khugepaged_pages_to_scan __read_mostly;
-static unsigned int khugepaged_pages_collapsed;
-static unsigned int khugepaged_full_scans;
-static unsigned int khugepaged_scan_sleep_millisecs __read_mostly = 10000;
-/* during fragmentation poll the hugepage allocator once every minute */
-static unsigned int khugepaged_alloc_sleep_millisecs __read_mostly = 60000;
-static struct task_struct *khugepaged_thread __read_mostly;
-static DEFINE_MUTEX(khugepaged_mutex);
-static DEFINE_SPINLOCK(khugepaged_mm_lock);
-static DECLARE_WAIT_QUEUE_HEAD(khugepaged_wait);
-/*
- * default collapse hugepages if there is at least one pte mapped like
- * it would have happened if the vma was large enough during page
- * fault.
- */
-static unsigned int khugepaged_max_ptes_none __read_mostly;
-static unsigned int khugepaged_max_ptes_swap __read_mostly = HPAGE_PMD_NR/8;
-
-static int khugepaged(void *none);
-static int khugepaged_slab_init(void);
-static void khugepaged_slab_exit(void);
-
-#define MM_SLOTS_HASH_BITS 10
-static __read_mostly DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
-
-static struct kmem_cache *mm_slot_cache __read_mostly;
-
-/**
- * struct mm_slot - hash lookup from mm to mm_slot
- * @hash: hash collision list
- * @mm_node: khugepaged scan list headed in khugepaged_scan.mm_head
- * @mm: the mm that this information is valid for
- */
-struct mm_slot {
- struct hlist_node hash;
- struct list_head mm_node;
- struct mm_struct *mm;
-};
-
-/**
- * struct khugepaged_scan - cursor for scanning
- * @mm_head: the head of the mm list to scan
- * @mm_slot: the current mm_slot we are scanning
- * @address: the next address inside that to be scanned
- *
- * There is only the one khugepaged_scan instance of this cursor structure.
- */
-struct khugepaged_scan {
- struct list_head mm_head;
- struct mm_slot *mm_slot;
- unsigned long address;
-};
-static struct khugepaged_scan khugepaged_scan = {
- .mm_head = LIST_HEAD_INIT(khugepaged_scan.mm_head),
-};
-
static struct shrinker deferred_split_shrinker;

-static void set_recommended_min_free_kbytes(void)
-{
- struct zone *zone;
- int nr_zones = 0;
- unsigned long recommended_min;
-
- for_each_populated_zone(zone)
- nr_zones++;
-
- /* Ensure 2 pageblocks are free to assist fragmentation avoidance */
- recommended_min = pageblock_nr_pages * nr_zones * 2;
-
- /*
- * Make sure that on average at least two pageblocks are almost free
- * of another type, one for a migratetype to fall back to and a
- * second to avoid subsequent fallbacks of other types There are 3
- * MIGRATE_TYPES we care about.
- */
- recommended_min += pageblock_nr_pages * nr_zones *
- MIGRATE_PCPTYPES * MIGRATE_PCPTYPES;
-
- /* don't ever allow to reserve more than 5% of the lowmem */
- recommended_min = min(recommended_min,
- (unsigned long) nr_free_buffer_pages() / 20);
- recommended_min <<= (PAGE_SHIFT-10);
-
- if (recommended_min > min_free_kbytes) {
- if (user_min_free_kbytes >= 0)
- pr_info("raising min_free_kbytes from %d to %lu to help transparent hugepage allocations\n",
- min_free_kbytes, recommended_min);
-
- min_free_kbytes = recommended_min;
- }
- setup_per_zone_wmarks();
-}
-
-static int start_stop_khugepaged(void)
-{
- int err = 0;
- if (khugepaged_enabled()) {
- if (!khugepaged_thread)
- khugepaged_thread = kthread_run(khugepaged, NULL,
- "khugepaged");
- if (IS_ERR(khugepaged_thread)) {
- pr_err("khugepaged: kthread_run(khugepaged) failed\n");
- err = PTR_ERR(khugepaged_thread);
- khugepaged_thread = NULL;
- goto fail;
- }
-
- if (!list_empty(&khugepaged_scan.mm_head))
- wake_up_interruptible(&khugepaged_wait);
-
- set_recommended_min_free_kbytes();
- } else if (khugepaged_thread) {
- kthread_stop(khugepaged_thread);
- khugepaged_thread = NULL;
- }
-fail:
- return err;
-}
-
static atomic_t huge_zero_refcount;
struct page *huge_zero_page __read_mostly;

@@ -346,7 +197,7 @@ static ssize_t enabled_store(struct kobject *kobj,
static struct kobj_attribute enabled_attr =
__ATTR(enabled, 0644, enabled_show, enabled_store);

-static ssize_t single_flag_show(struct kobject *kobj,
+ssize_t single_hugepage_flag_show(struct kobject *kobj,
struct kobj_attribute *attr, char *buf,
enum transparent_hugepage_flag flag)
{
@@ -354,7 +205,7 @@ static ssize_t single_flag_show(struct kobject *kobj,
!!test_bit(flag, &transparent_hugepage_flags));
}

-static ssize_t single_flag_store(struct kobject *kobj,
+ssize_t single_hugepage_flag_store(struct kobject *kobj,
struct kobj_attribute *attr,
const char *buf, size_t count,
enum transparent_hugepage_flag flag)
@@ -409,13 +260,13 @@ static struct kobj_attribute defrag_attr =
static ssize_t use_zero_page_show(struct kobject *kobj,
struct kobj_attribute *attr, char *buf)
{
- return single_flag_show(kobj, attr, buf,
+ return single_hugepage_flag_show(kobj, attr, buf,
TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG);
}
static ssize_t use_zero_page_store(struct kobject *kobj,
struct kobj_attribute *attr, const char *buf, size_t count)
{
- return single_flag_store(kobj, attr, buf, count,
+ return single_hugepage_flag_store(kobj, attr, buf, count,
TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG);
}
static struct kobj_attribute use_zero_page_attr =
@@ -424,14 +275,14 @@ static struct kobj_attribute use_zero_page_attr =
static ssize_t debug_cow_show(struct kobject *kobj,
struct kobj_attribute *attr, char *buf)
{
- return single_flag_show(kobj, attr, buf,
+ return single_hugepage_flag_show(kobj, attr, buf,
TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG);
}
static ssize_t debug_cow_store(struct kobject *kobj,
struct kobj_attribute *attr,
const char *buf, size_t count)
{
- return single_flag_store(kobj, attr, buf, count,
+ return single_hugepage_flag_store(kobj, attr, buf, count,
TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG);
}
static struct kobj_attribute debug_cow_attr =
@@ -455,197 +306,6 @@ static struct attribute_group hugepage_attr_group = {
.attrs = hugepage_attr,
};

-static ssize_t scan_sleep_millisecs_show(struct kobject *kobj,
- struct kobj_attribute *attr,
- char *buf)
-{
- return sprintf(buf, "%u\n", khugepaged_scan_sleep_millisecs);
-}
-
-static ssize_t scan_sleep_millisecs_store(struct kobject *kobj,
- struct kobj_attribute *attr,
- const char *buf, size_t count)
-{
- unsigned long msecs;
- int err;
-
- err = kstrtoul(buf, 10, &msecs);
- if (err || msecs > UINT_MAX)
- return -EINVAL;
-
- khugepaged_scan_sleep_millisecs = msecs;
- wake_up_interruptible(&khugepaged_wait);
-
- return count;
-}
-static struct kobj_attribute scan_sleep_millisecs_attr =
- __ATTR(scan_sleep_millisecs, 0644, scan_sleep_millisecs_show,
- scan_sleep_millisecs_store);
-
-static ssize_t alloc_sleep_millisecs_show(struct kobject *kobj,
- struct kobj_attribute *attr,
- char *buf)
-{
- return sprintf(buf, "%u\n", khugepaged_alloc_sleep_millisecs);
-}
-
-static ssize_t alloc_sleep_millisecs_store(struct kobject *kobj,
- struct kobj_attribute *attr,
- const char *buf, size_t count)
-{
- unsigned long msecs;
- int err;
-
- err = kstrtoul(buf, 10, &msecs);
- if (err || msecs > UINT_MAX)
- return -EINVAL;
-
- khugepaged_alloc_sleep_millisecs = msecs;
- wake_up_interruptible(&khugepaged_wait);
-
- return count;
-}
-static struct kobj_attribute alloc_sleep_millisecs_attr =
- __ATTR(alloc_sleep_millisecs, 0644, alloc_sleep_millisecs_show,
- alloc_sleep_millisecs_store);
-
-static ssize_t pages_to_scan_show(struct kobject *kobj,
- struct kobj_attribute *attr,
- char *buf)
-{
- return sprintf(buf, "%u\n", khugepaged_pages_to_scan);
-}
-static ssize_t pages_to_scan_store(struct kobject *kobj,
- struct kobj_attribute *attr,
- const char *buf, size_t count)
-{
- int err;
- unsigned long pages;
-
- err = kstrtoul(buf, 10, &pages);
- if (err || !pages || pages > UINT_MAX)
- return -EINVAL;
-
- khugepaged_pages_to_scan = pages;
-
- return count;
-}
-static struct kobj_attribute pages_to_scan_attr =
- __ATTR(pages_to_scan, 0644, pages_to_scan_show,
- pages_to_scan_store);
-
-static ssize_t pages_collapsed_show(struct kobject *kobj,
- struct kobj_attribute *attr,
- char *buf)
-{
- return sprintf(buf, "%u\n", khugepaged_pages_collapsed);
-}
-static struct kobj_attribute pages_collapsed_attr =
- __ATTR_RO(pages_collapsed);
-
-static ssize_t full_scans_show(struct kobject *kobj,
- struct kobj_attribute *attr,
- char *buf)
-{
- return sprintf(buf, "%u\n", khugepaged_full_scans);
-}
-static struct kobj_attribute full_scans_attr =
- __ATTR_RO(full_scans);
-
-static ssize_t khugepaged_defrag_show(struct kobject *kobj,
- struct kobj_attribute *attr, char *buf)
-{
- return single_flag_show(kobj, attr, buf,
- TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG);
-}
-static ssize_t khugepaged_defrag_store(struct kobject *kobj,
- struct kobj_attribute *attr,
- const char *buf, size_t count)
-{
- return single_flag_store(kobj, attr, buf, count,
- TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG);
-}
-static struct kobj_attribute khugepaged_defrag_attr =
- __ATTR(defrag, 0644, khugepaged_defrag_show,
- khugepaged_defrag_store);
-
-/*
- * max_ptes_none controls if khugepaged should collapse hugepages over
- * any unmapped ptes in turn potentially increasing the memory
- * footprint of the vmas. When max_ptes_none is 0 khugepaged will not
- * reduce the available free memory in the system as it
- * runs. Increasing max_ptes_none will instead potentially reduce the
- * free memory in the system during the khugepaged scan.
- */
-static ssize_t khugepaged_max_ptes_none_show(struct kobject *kobj,
- struct kobj_attribute *attr,
- char *buf)
-{
- return sprintf(buf, "%u\n", khugepaged_max_ptes_none);
-}
-static ssize_t khugepaged_max_ptes_none_store(struct kobject *kobj,
- struct kobj_attribute *attr,
- const char *buf, size_t count)
-{
- int err;
- unsigned long max_ptes_none;
-
- err = kstrtoul(buf, 10, &max_ptes_none);
- if (err || max_ptes_none > HPAGE_PMD_NR-1)
- return -EINVAL;
-
- khugepaged_max_ptes_none = max_ptes_none;
-
- return count;
-}
-static struct kobj_attribute khugepaged_max_ptes_none_attr =
- __ATTR(max_ptes_none, 0644, khugepaged_max_ptes_none_show,
- khugepaged_max_ptes_none_store);
-
-static ssize_t khugepaged_max_ptes_swap_show(struct kobject *kobj,
- struct kobj_attribute *attr,
- char *buf)
-{
- return sprintf(buf, "%u\n", khugepaged_max_ptes_swap);
-}
-
-static ssize_t khugepaged_max_ptes_swap_store(struct kobject *kobj,
- struct kobj_attribute *attr,
- const char *buf, size_t count)
-{
- int err;
- unsigned long max_ptes_swap;
-
- err = kstrtoul(buf, 10, &max_ptes_swap);
- if (err || max_ptes_swap > HPAGE_PMD_NR-1)
- return -EINVAL;
-
- khugepaged_max_ptes_swap = max_ptes_swap;
-
- return count;
-}
-
-static struct kobj_attribute khugepaged_max_ptes_swap_attr =
- __ATTR(max_ptes_swap, 0644, khugepaged_max_ptes_swap_show,
- khugepaged_max_ptes_swap_store);
-
-static struct attribute *khugepaged_attr[] = {
- &khugepaged_defrag_attr.attr,
- &khugepaged_max_ptes_none_attr.attr,
- &pages_to_scan_attr.attr,
- &pages_collapsed_attr.attr,
- &full_scans_attr.attr,
- &scan_sleep_millisecs_attr.attr,
- &alloc_sleep_millisecs_attr.attr,
- &khugepaged_max_ptes_swap_attr.attr,
- NULL,
-};
-
-static struct attribute_group khugepaged_attr_group = {
- .attrs = khugepaged_attr,
- .name = "khugepaged",
-};
-
static int __init hugepage_init_sysfs(struct kobject **hugepage_kobj)
{
int err;
@@ -704,8 +364,6 @@ static int __init hugepage_init(void)
return -EINVAL;
}

- khugepaged_pages_to_scan = HPAGE_PMD_NR * 8;
- khugepaged_max_ptes_none = HPAGE_PMD_NR - 1;
/*
* hugepages can't be allocated by the buddy allocator
*/
@@ -720,7 +378,7 @@ static int __init hugepage_init(void)
if (err)
goto err_sysfs;

- err = khugepaged_slab_init();
+ err = khugepaged_init();
if (err)
goto err_slab;

@@ -751,7 +409,7 @@ err_khugepaged:
err_split_shrinker:
unregister_shrinker(&huge_zero_page_shrinker);
err_hzp_shrinker:
- khugepaged_slab_exit();
+ khugepaged_destroy();
err_slab:
hugepage_exit_sysfs(hugepage_kobj);
err_sysfs:
@@ -906,12 +564,6 @@ static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
return GFP_TRANSHUGE | reclaim_flags;
}

-/* Defrag for khugepaged will enter direct reclaim/compaction if necessary */
-static inline gfp_t alloc_hugepage_khugepaged_gfpmask(void)
-{
- return GFP_TRANSHUGE | (khugepaged_defrag() ? __GFP_DIRECT_RECLAIM : 0);
-}
-
/* Caller must hold page table lock. */
static bool set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
struct vm_area_struct *vma, unsigned long haddr, pmd_t *pmd,
@@ -1837,1070 +1489,6 @@ spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
return NULL;
}

-#define VM_NO_KHUGEPAGED (VM_SPECIAL | VM_HUGETLB | VM_SHARED | VM_MAYSHARE)
-
-int hugepage_madvise(struct vm_area_struct *vma,
- unsigned long *vm_flags, int advice)
-{
- switch (advice) {
- case MADV_HUGEPAGE:
-#ifdef CONFIG_S390
- /*
- * qemu blindly sets MADV_HUGEPAGE on all allocations, but s390
- * can't handle this properly after s390_enable_sie, so we simply
- * ignore the madvise to prevent qemu from causing a SIGSEGV.
- */
- if (mm_has_pgste(vma->vm_mm))
- return 0;
-#endif
- *vm_flags &= ~VM_NOHUGEPAGE;
- *vm_flags |= VM_HUGEPAGE;
- /*
- * If the vma become good for khugepaged to scan,
- * register it here without waiting a page fault that
- * may not happen any time soon.
- */
- if (!(*vm_flags & VM_NO_KHUGEPAGED) &&
- khugepaged_enter_vma_merge(vma, *vm_flags))
- return -ENOMEM;
- break;
- case MADV_NOHUGEPAGE:
- *vm_flags &= ~VM_HUGEPAGE;
- *vm_flags |= VM_NOHUGEPAGE;
- /*
- * Setting VM_NOHUGEPAGE will prevent khugepaged from scanning
- * this vma even if we leave the mm registered in khugepaged if
- * it got registered before VM_NOHUGEPAGE was set.
- */
- break;
- }
-
- return 0;
-}
-
-static int __init khugepaged_slab_init(void)
-{
- mm_slot_cache = kmem_cache_create("khugepaged_mm_slot",
- sizeof(struct mm_slot),
- __alignof__(struct mm_slot), 0, NULL);
- if (!mm_slot_cache)
- return -ENOMEM;
-
- return 0;
-}
-
-static void __init khugepaged_slab_exit(void)
-{
- kmem_cache_destroy(mm_slot_cache);
-}
-
-static inline struct mm_slot *alloc_mm_slot(void)
-{
- if (!mm_slot_cache) /* initialization failed */
- return NULL;
- return kmem_cache_zalloc(mm_slot_cache, GFP_KERNEL);
-}
-
-static inline void free_mm_slot(struct mm_slot *mm_slot)
-{
- kmem_cache_free(mm_slot_cache, mm_slot);
-}
-
-static struct mm_slot *get_mm_slot(struct mm_struct *mm)
-{
- struct mm_slot *mm_slot;
-
- hash_for_each_possible(mm_slots_hash, mm_slot, hash, (unsigned long)mm)
- if (mm == mm_slot->mm)
- return mm_slot;
-
- return NULL;
-}
-
-static void insert_to_mm_slots_hash(struct mm_struct *mm,
- struct mm_slot *mm_slot)
-{
- mm_slot->mm = mm;
- hash_add(mm_slots_hash, &mm_slot->hash, (long)mm);
-}
-
-static inline int khugepaged_test_exit(struct mm_struct *mm)
-{
- return atomic_read(&mm->mm_users) == 0;
-}
-
-int __khugepaged_enter(struct mm_struct *mm)
-{
- struct mm_slot *mm_slot;
- int wakeup;
-
- mm_slot = alloc_mm_slot();
- if (!mm_slot)
- return -ENOMEM;
-
- /* __khugepaged_exit() must not run from under us */
- VM_BUG_ON_MM(khugepaged_test_exit(mm), mm);
- if (unlikely(test_and_set_bit(MMF_VM_HUGEPAGE, &mm->flags))) {
- free_mm_slot(mm_slot);
- return 0;
- }
-
- spin_lock(&khugepaged_mm_lock);
- insert_to_mm_slots_hash(mm, mm_slot);
- /*
- * Insert just behind the scanning cursor, to let the area settle
- * down a little.
- */
- wakeup = list_empty(&khugepaged_scan.mm_head);
- list_add_tail(&mm_slot->mm_node, &khugepaged_scan.mm_head);
- spin_unlock(&khugepaged_mm_lock);
-
- atomic_inc(&mm->mm_count);
- if (wakeup)
- wake_up_interruptible(&khugepaged_wait);
-
- return 0;
-}
-
-int khugepaged_enter_vma_merge(struct vm_area_struct *vma,
- unsigned long vm_flags)
-{
- unsigned long hstart, hend;
- if (!vma->anon_vma)
- /*
- * Not yet faulted in so we will register later in the
- * page fault if needed.
- */
- return 0;
- if (vma->vm_ops)
- /* khugepaged not yet working on file or special mappings */
- return 0;
- VM_BUG_ON_VMA(vm_flags & VM_NO_KHUGEPAGED, vma);
- hstart = (vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
- hend = vma->vm_end & HPAGE_PMD_MASK;
- if (hstart < hend)
- return khugepaged_enter(vma, vm_flags);
- return 0;
-}
-
-void __khugepaged_exit(struct mm_struct *mm)
-{
- struct mm_slot *mm_slot;
- int free = 0;
-
- spin_lock(&khugepaged_mm_lock);
- mm_slot = get_mm_slot(mm);
- if (mm_slot && khugepaged_scan.mm_slot != mm_slot) {
- hash_del(&mm_slot->hash);
- list_del(&mm_slot->mm_node);
- free = 1;
- }
- spin_unlock(&khugepaged_mm_lock);
-
- if (free) {
- clear_bit(MMF_VM_HUGEPAGE, &mm->flags);
- free_mm_slot(mm_slot);
- mmdrop(mm);
- } else if (mm_slot) {
- /*
- * This is required to serialize against
- * khugepaged_test_exit() (which is guaranteed to run
- * under mmap sem read mode). Stop here (after we
- * return all pagetables will be destroyed) until
- * khugepaged has finished working on the pagetables
- * under the mmap_sem.
- */
- down_write(&mm->mmap_sem);
- up_write(&mm->mmap_sem);
- }
-}
-
-static void release_pte_page(struct page *page)
-{
- /* 0 stands for page_is_file_cache(page) == false */
- dec_zone_page_state(page, NR_ISOLATED_ANON + 0);
- unlock_page(page);
- putback_lru_page(page);
-}
-
-static void release_pte_pages(pte_t *pte, pte_t *_pte)
-{
- while (--_pte >= pte) {
- pte_t pteval = *_pte;
- if (!pte_none(pteval) && !is_zero_pfn(pte_pfn(pteval)))
- release_pte_page(pte_page(pteval));
- }
-}
-
-static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
- unsigned long address,
- pte_t *pte)
-{
- struct page *page = NULL;
- pte_t *_pte;
- int none_or_zero = 0, result = 0;
- bool referenced = false, writable = false;
-
- for (_pte = pte; _pte < pte+HPAGE_PMD_NR;
- _pte++, address += PAGE_SIZE) {
- pte_t pteval = *_pte;
- if (pte_none(pteval) || (pte_present(pteval) &&
- is_zero_pfn(pte_pfn(pteval)))) {
- if (!userfaultfd_armed(vma) &&
- ++none_or_zero <= khugepaged_max_ptes_none) {
- continue;
- } else {
- result = SCAN_EXCEED_NONE_PTE;
- goto out;
- }
- }
- if (!pte_present(pteval)) {
- result = SCAN_PTE_NON_PRESENT;
- goto out;
- }
- page = vm_normal_page(vma, address, pteval);
- if (unlikely(!page)) {
- result = SCAN_PAGE_NULL;
- goto out;
- }
-
- VM_BUG_ON_PAGE(PageCompound(page), page);
- VM_BUG_ON_PAGE(!PageAnon(page), page);
- VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
-
- /*
- * We can do it before isolate_lru_page because the
- * page can't be freed from under us. NOTE: PG_lock
- * is needed to serialize against split_huge_page
- * when invoked from the VM.
- */
- if (!trylock_page(page)) {
- result = SCAN_PAGE_LOCK;
- goto out;
- }
-
- /*
- * cannot use mapcount: can't collapse if there's a gup pin.
- * The page must only be referenced by the scanned process
- * and page swap cache.
- */
- if (page_count(page) != 1 + !!PageSwapCache(page)) {
- unlock_page(page);
- result = SCAN_PAGE_COUNT;
- goto out;
- }
- if (pte_write(pteval)) {
- writable = true;
- } else {
- if (PageSwapCache(page) && !reuse_swap_page(page)) {
- unlock_page(page);
- result = SCAN_SWAP_CACHE_PAGE;
- goto out;
- }
- /*
- * Page is not in the swap cache. It can be collapsed
- * into a THP.
- */
- }
-
- /*
- * Isolate the page to avoid collapsing an hugepage
- * currently in use by the VM.
- */
- if (isolate_lru_page(page)) {
- unlock_page(page);
- result = SCAN_DEL_PAGE_LRU;
- goto out;
- }
- /* 0 stands for page_is_file_cache(page) == false */
- inc_zone_page_state(page, NR_ISOLATED_ANON + 0);
- VM_BUG_ON_PAGE(!PageLocked(page), page);
- VM_BUG_ON_PAGE(PageLRU(page), page);
-
- /* If there is no mapped pte young don't collapse the page */
- if (pte_young(pteval) ||
- page_is_young(page) || PageReferenced(page) ||
- mmu_notifier_test_young(vma->vm_mm, address))
- referenced = true;
- }
- if (likely(writable)) {
- if (likely(referenced)) {
- result = SCAN_SUCCEED;
- trace_mm_collapse_huge_page_isolate(page, none_or_zero,
- referenced, writable, result);
- return 1;
- }
- } else {
- result = SCAN_PAGE_RO;
- }
-
-out:
- release_pte_pages(pte, _pte);
- trace_mm_collapse_huge_page_isolate(page, none_or_zero,
- referenced, writable, result);
- return 0;
-}
-
-static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
- struct vm_area_struct *vma,
- unsigned long address,
- spinlock_t *ptl)
-{
- pte_t *_pte;
- for (_pte = pte; _pte < pte+HPAGE_PMD_NR; _pte++) {
- pte_t pteval = *_pte;
- struct page *src_page;
-
- if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
- clear_user_highpage(page, address);
- add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
- if (is_zero_pfn(pte_pfn(pteval))) {
- /*
- * ptl mostly unnecessary.
- */
- spin_lock(ptl);
- /*
- * paravirt calls inside pte_clear here are
- * superfluous.
- */
- pte_clear(vma->vm_mm, address, _pte);
- spin_unlock(ptl);
- }
- } else {
- src_page = pte_page(pteval);
- copy_user_highpage(page, src_page, address, vma);
- VM_BUG_ON_PAGE(page_mapcount(src_page) != 1, src_page);
- release_pte_page(src_page);
- /*
- * ptl mostly unnecessary, but preempt has to
- * be disabled to update the per-cpu stats
- * inside page_remove_rmap().
- */
- spin_lock(ptl);
- /*
- * paravirt calls inside pte_clear here are
- * superfluous.
- */
- pte_clear(vma->vm_mm, address, _pte);
- page_remove_rmap(src_page, false);
- spin_unlock(ptl);
- free_page_and_swap_cache(src_page);
- }
-
- address += PAGE_SIZE;
- page++;
- }
-}
-
-static void khugepaged_alloc_sleep(void)
-{
- DEFINE_WAIT(wait);
-
- add_wait_queue(&khugepaged_wait, &wait);
- freezable_schedule_timeout_interruptible(
- msecs_to_jiffies(khugepaged_alloc_sleep_millisecs));
- remove_wait_queue(&khugepaged_wait, &wait);
-}
-
-static int khugepaged_node_load[MAX_NUMNODES];
-
-static bool khugepaged_scan_abort(int nid)
-{
- int i;
-
- /*
- * If zone_reclaim_mode is disabled, then no extra effort is made to
- * allocate memory locally.
- */
- if (!zone_reclaim_mode)
- return false;
-
- /* If there is a count for this node already, it must be acceptable */
- if (khugepaged_node_load[nid])
- return false;
-
- for (i = 0; i < MAX_NUMNODES; i++) {
- if (!khugepaged_node_load[i])
- continue;
- if (node_distance(nid, i) > RECLAIM_DISTANCE)
- return true;
- }
- return false;
-}
-
-#ifdef CONFIG_NUMA
-static int khugepaged_find_target_node(void)
-{
- static int last_khugepaged_target_node = NUMA_NO_NODE;
- int nid, target_node = 0, max_value = 0;
-
- /* find first node with max normal pages hit */
- for (nid = 0; nid < MAX_NUMNODES; nid++)
- if (khugepaged_node_load[nid] > max_value) {
- max_value = khugepaged_node_load[nid];
- target_node = nid;
- }
-
- /* do some balance if several nodes have the same hit record */
- if (target_node <= last_khugepaged_target_node)
- for (nid = last_khugepaged_target_node + 1; nid < MAX_NUMNODES;
- nid++)
- if (max_value == khugepaged_node_load[nid]) {
- target_node = nid;
- break;
- }
-
- last_khugepaged_target_node = target_node;
- return target_node;
-}
-
-static bool khugepaged_prealloc_page(struct page **hpage, bool *wait)
-{
- if (IS_ERR(*hpage)) {
- if (!*wait)
- return false;
-
- *wait = false;
- *hpage = NULL;
- khugepaged_alloc_sleep();
- } else if (*hpage) {
- put_page(*hpage);
- *hpage = NULL;
- }
-
- return true;
-}
-
-static struct page *
-khugepaged_alloc_page(struct page **hpage, gfp_t gfp, struct mm_struct *mm,
- unsigned long address, int node)
-{
- VM_BUG_ON_PAGE(*hpage, *hpage);
-
- /*
- * Before allocating the hugepage, release the mmap_sem read lock.
- * The allocation can take potentially a long time if it involves
- * sync compaction, and we do not need to hold the mmap_sem during
- * that. We will recheck the vma after taking it again in write mode.
- */
- up_read(&mm->mmap_sem);
-
- *hpage = __alloc_pages_node(node, gfp, HPAGE_PMD_ORDER);
- if (unlikely(!*hpage)) {
- count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
- *hpage = ERR_PTR(-ENOMEM);
- return NULL;
- }
-
- prep_transhuge_page(*hpage);
- count_vm_event(THP_COLLAPSE_ALLOC);
- return *hpage;
-}
-#else
-static int khugepaged_find_target_node(void)
-{
- return 0;
-}
-
-static inline struct page *alloc_khugepaged_hugepage(void)
-{
- struct page *page;
-
- page = alloc_pages(alloc_hugepage_khugepaged_gfpmask(),
- HPAGE_PMD_ORDER);
- if (page)
- prep_transhuge_page(page);
- return page;
-}
-
-static struct page *khugepaged_alloc_hugepage(bool *wait)
-{
- struct page *hpage;
-
- do {
- hpage = alloc_khugepaged_hugepage();
- if (!hpage) {
- count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
- if (!*wait)
- return NULL;
-
- *wait = false;
- khugepaged_alloc_sleep();
- } else
- count_vm_event(THP_COLLAPSE_ALLOC);
- } while (unlikely(!hpage) && likely(khugepaged_enabled()));
-
- return hpage;
-}
-
-static bool khugepaged_prealloc_page(struct page **hpage, bool *wait)
-{
- if (!*hpage)
- *hpage = khugepaged_alloc_hugepage(wait);
-
- if (unlikely(!*hpage))
- return false;
-
- return true;
-}
-
-static struct page *
-khugepaged_alloc_page(struct page **hpage, gfp_t gfp, struct mm_struct *mm,
- unsigned long address, int node)
-{
- up_read(&mm->mmap_sem);
- VM_BUG_ON(!*hpage);
-
- return *hpage;
-}
-#endif
-
-static bool hugepage_vma_check(struct vm_area_struct *vma)
-{
- if ((!(vma->vm_flags & VM_HUGEPAGE) && !khugepaged_always()) ||
- (vma->vm_flags & VM_NOHUGEPAGE))
- return false;
- if (!vma->anon_vma || vma->vm_ops)
- return false;
- if (is_vma_temporary_stack(vma))
- return false;
- VM_BUG_ON_VMA(vma->vm_flags & VM_NO_KHUGEPAGED, vma);
- return true;
-}
-
-/*
- * Bring missing pages in from swap, to complete THP collapse.
- * Only done if khugepaged_scan_pmd believes it is worthwhile.
- *
- * Called and returns without pte mapped or spinlocks held,
- * but with mmap_sem held to protect against vma changes.
- */
-
-static void __collapse_huge_page_swapin(struct mm_struct *mm,
- struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmd)
-{
- pte_t pteval;
- int swapped_in = 0, ret = 0;
- struct fault_env fe = {
- .vma = vma,
- .address = address,
- .flags = FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_RETRY_NOWAIT,
- .pmd = pmd,
- };
-
- fe.pte = pte_offset_map(pmd, address);
- for (; fe.address < address + HPAGE_PMD_NR*PAGE_SIZE;
- fe.pte++, fe.address += PAGE_SIZE) {
- pteval = *fe.pte;
- if (!is_swap_pte(pteval))
- continue;
- swapped_in++;
- ret = do_swap_page(&fe, pteval);
- if (ret & VM_FAULT_ERROR) {
- trace_mm_collapse_huge_page_swapin(mm, swapped_in, 0);
- return;
- }
- /* pte is unmapped now, we need to map it */
- fe.pte = pte_offset_map(pmd, fe.address);
- }
- fe.pte--;
- pte_unmap(fe.pte);
- trace_mm_collapse_huge_page_swapin(mm, swapped_in, 1);
-}
-
-static void collapse_huge_page(struct mm_struct *mm,
- unsigned long address,
- struct page **hpage,
- struct vm_area_struct *vma,
- int node)
-{
- pmd_t *pmd, _pmd;
- pte_t *pte;
- pgtable_t pgtable;
- struct page *new_page;
- spinlock_t *pmd_ptl, *pte_ptl;
- int isolated = 0, result = 0;
- unsigned long hstart, hend;
- struct mem_cgroup *memcg;
- unsigned long mmun_start; /* For mmu_notifiers */
- unsigned long mmun_end; /* For mmu_notifiers */
- gfp_t gfp;
-
- VM_BUG_ON(address & ~HPAGE_PMD_MASK);
-
- /* Only allocate from the target node */
- gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_OTHER_NODE | __GFP_THISNODE;
-
- /* release the mmap_sem read lock. */
- new_page = khugepaged_alloc_page(hpage, gfp, mm, address, node);
- if (!new_page) {
- result = SCAN_ALLOC_HUGE_PAGE_FAIL;
- goto out_nolock;
- }
-
- if (unlikely(mem_cgroup_try_charge(new_page, mm, gfp, &memcg, true))) {
- result = SCAN_CGROUP_CHARGE_FAIL;
- goto out_nolock;
- }
-
- /*
- * Prevent all access to pagetables with the exception of
- * gup_fast later hanlded by the ptep_clear_flush and the VM
- * handled by the anon_vma lock + PG_lock.
- */
- down_write(&mm->mmap_sem);
- if (unlikely(khugepaged_test_exit(mm))) {
- result = SCAN_ANY_PROCESS;
- goto out;
- }
-
- vma = find_vma(mm, address);
- if (!vma) {
- result = SCAN_VMA_NULL;
- goto out;
- }
- hstart = (vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
- hend = vma->vm_end & HPAGE_PMD_MASK;
- if (address < hstart || address + HPAGE_PMD_SIZE > hend) {
- result = SCAN_ADDRESS_RANGE;
- goto out;
- }
- if (!hugepage_vma_check(vma)) {
- result = SCAN_VMA_CHECK;
- goto out;
- }
- pmd = mm_find_pmd(mm, address);
- if (!pmd) {
- result = SCAN_PMD_NULL;
- goto out;
- }
-
- __collapse_huge_page_swapin(mm, vma, address, pmd);
-
- anon_vma_lock_write(vma->anon_vma);
-
- pte = pte_offset_map(pmd, address);
- pte_ptl = pte_lockptr(mm, pmd);
-
- mmun_start = address;
- mmun_end = address + HPAGE_PMD_SIZE;
- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
- pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
- /*
- * After this gup_fast can't run anymore. This also removes
- * any huge TLB entry from the CPU so we won't allow
- * huge and small TLB entries for the same virtual address
- * to avoid the risk of CPU bugs in that area.
- */
- _pmd = pmdp_collapse_flush(vma, address, pmd);
- spin_unlock(pmd_ptl);
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
-
- spin_lock(pte_ptl);
- isolated = __collapse_huge_page_isolate(vma, address, pte);
- spin_unlock(pte_ptl);
-
- if (unlikely(!isolated)) {
- pte_unmap(pte);
- spin_lock(pmd_ptl);
- BUG_ON(!pmd_none(*pmd));
- /*
- * We can only use set_pmd_at when establishing
- * hugepmds and never for establishing regular pmds that
- * points to regular pagetables. Use pmd_populate for that
- */
- pmd_populate(mm, pmd, pmd_pgtable(_pmd));
- spin_unlock(pmd_ptl);
- anon_vma_unlock_write(vma->anon_vma);
- result = SCAN_FAIL;
- goto out;
- }
-
- /*
- * All pages are isolated and locked so anon_vma rmap
- * can't run anymore.
- */
- anon_vma_unlock_write(vma->anon_vma);
-
- __collapse_huge_page_copy(pte, new_page, vma, address, pte_ptl);
- pte_unmap(pte);
- __SetPageUptodate(new_page);
- pgtable = pmd_pgtable(_pmd);
-
- _pmd = mk_huge_pmd(new_page, vma->vm_page_prot);
- _pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
-
- /*
- * spin_lock() below is not the equivalent of smp_wmb(), so
- * this is needed to avoid the copy_huge_page writes to become
- * visible after the set_pmd_at() write.
- */
- smp_wmb();
-
- spin_lock(pmd_ptl);
- BUG_ON(!pmd_none(*pmd));
- page_add_new_anon_rmap(new_page, vma, address, true);
- mem_cgroup_commit_charge(new_page, memcg, false, true);
- lru_cache_add_active_or_unevictable(new_page, vma);
- pgtable_trans_huge_deposit(mm, pmd, pgtable);
- set_pmd_at(mm, address, pmd, _pmd);
- update_mmu_cache_pmd(vma, address, pmd);
- spin_unlock(pmd_ptl);
-
- *hpage = NULL;
-
- khugepaged_pages_collapsed++;
- result = SCAN_SUCCEED;
-out_up_write:
- up_write(&mm->mmap_sem);
-out_nolock:
- trace_mm_collapse_huge_page(mm, isolated, result);
- return;
-out:
- mem_cgroup_cancel_charge(new_page, memcg, true);
- goto out_up_write;
-}
-
-static int khugepaged_scan_pmd(struct mm_struct *mm,
- struct vm_area_struct *vma,
- unsigned long address,
- struct page **hpage)
-{
- pmd_t *pmd;
- pte_t *pte, *_pte;
- int ret = 0, none_or_zero = 0, result = 0;
- struct page *page = NULL;
- unsigned long _address;
- spinlock_t *ptl;
- int node = NUMA_NO_NODE, unmapped = 0;
- bool writable = false, referenced = false;
-
- VM_BUG_ON(address & ~HPAGE_PMD_MASK);
-
- pmd = mm_find_pmd(mm, address);
- if (!pmd) {
- result = SCAN_PMD_NULL;
- goto out;
- }
-
- memset(khugepaged_node_load, 0, sizeof(khugepaged_node_load));
- pte = pte_offset_map_lock(mm, pmd, address, &ptl);
- for (_address = address, _pte = pte; _pte < pte+HPAGE_PMD_NR;
- _pte++, _address += PAGE_SIZE) {
- pte_t pteval = *_pte;
- if (is_swap_pte(pteval)) {
- if (++unmapped <= khugepaged_max_ptes_swap) {
- continue;
- } else {
- result = SCAN_EXCEED_SWAP_PTE;
- goto out_unmap;
- }
- }
- if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
- if (!userfaultfd_armed(vma) &&
- ++none_or_zero <= khugepaged_max_ptes_none) {
- continue;
- } else {
- result = SCAN_EXCEED_NONE_PTE;
- goto out_unmap;
- }
- }
- if (!pte_present(pteval)) {
- result = SCAN_PTE_NON_PRESENT;
- goto out_unmap;
- }
- if (pte_write(pteval))
- writable = true;
-
- page = vm_normal_page(vma, _address, pteval);
- if (unlikely(!page)) {
- result = SCAN_PAGE_NULL;
- goto out_unmap;
- }
-
- /* TODO: teach khugepaged to collapse THP mapped with pte */
- if (PageCompound(page)) {
- result = SCAN_PAGE_COMPOUND;
- goto out_unmap;
- }
-
- /*
- * Record which node the original page is from and save this
- * information to khugepaged_node_load[].
- * Khupaged will allocate hugepage from the node has the max
- * hit record.
- */
- node = page_to_nid(page);
- if (khugepaged_scan_abort(node)) {
- result = SCAN_SCAN_ABORT;
- goto out_unmap;
- }
- khugepaged_node_load[node]++;
- if (!PageLRU(page)) {
- result = SCAN_PAGE_LRU;
- goto out_unmap;
- }
- if (PageLocked(page)) {
- result = SCAN_PAGE_LOCK;
- goto out_unmap;
- }
- if (!PageAnon(page)) {
- result = SCAN_PAGE_ANON;
- goto out_unmap;
- }
-
- /*
- * cannot use mapcount: can't collapse if there's a gup pin.
- * The page must only be referenced by the scanned process
- * and page swap cache.
- */
- if (page_count(page) != 1 + !!PageSwapCache(page)) {
- result = SCAN_PAGE_COUNT;
- goto out_unmap;
- }
- if (pte_young(pteval) ||
- page_is_young(page) || PageReferenced(page) ||
- mmu_notifier_test_young(vma->vm_mm, address))
- referenced = true;
- }
- if (writable) {
- if (referenced) {
- result = SCAN_SUCCEED;
- ret = 1;
- } else {
- result = SCAN_NO_REFERENCED_PAGE;
- }
- } else {
- result = SCAN_PAGE_RO;
- }
-out_unmap:
- pte_unmap_unlock(pte, ptl);
- if (ret) {
- node = khugepaged_find_target_node();
- /* collapse_huge_page will return with the mmap_sem released */
- collapse_huge_page(mm, address, hpage, vma, node);
- }
-out:
- trace_mm_khugepaged_scan_pmd(mm, page, writable, referenced,
- none_or_zero, result, unmapped);
- return ret;
-}
-
-static void collect_mm_slot(struct mm_slot *mm_slot)
-{
- struct mm_struct *mm = mm_slot->mm;
-
- VM_BUG_ON(NR_CPUS != 1 && !spin_is_locked(&khugepaged_mm_lock));
-
- if (khugepaged_test_exit(mm)) {
- /* free mm_slot */
- hash_del(&mm_slot->hash);
- list_del(&mm_slot->mm_node);
-
- /*
- * Not strictly needed because the mm exited already.
- *
- * clear_bit(MMF_VM_HUGEPAGE, &mm->flags);
- */
-
- /* khugepaged_mm_lock actually not necessary for the below */
- free_mm_slot(mm_slot);
- mmdrop(mm);
- }
-}
-
-static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
- struct page **hpage)
- __releases(&khugepaged_mm_lock)
- __acquires(&khugepaged_mm_lock)
-{
- struct mm_slot *mm_slot;
- struct mm_struct *mm;
- struct vm_area_struct *vma;
- int progress = 0;
-
- VM_BUG_ON(!pages);
- VM_BUG_ON(NR_CPUS != 1 && !spin_is_locked(&khugepaged_mm_lock));
-
- if (khugepaged_scan.mm_slot)
- mm_slot = khugepaged_scan.mm_slot;
- else {
- mm_slot = list_entry(khugepaged_scan.mm_head.next,
- struct mm_slot, mm_node);
- khugepaged_scan.address = 0;
- khugepaged_scan.mm_slot = mm_slot;
- }
- spin_unlock(&khugepaged_mm_lock);
-
- mm = mm_slot->mm;
- down_read(&mm->mmap_sem);
- if (unlikely(khugepaged_test_exit(mm)))
- vma = NULL;
- else
- vma = find_vma(mm, khugepaged_scan.address);
-
- progress++;
- for (; vma; vma = vma->vm_next) {
- unsigned long hstart, hend;
-
- cond_resched();
- if (unlikely(khugepaged_test_exit(mm))) {
- progress++;
- break;
- }
- if (!hugepage_vma_check(vma)) {
-skip:
- progress++;
- continue;
- }
- hstart = (vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
- hend = vma->vm_end & HPAGE_PMD_MASK;
- if (hstart >= hend)
- goto skip;
- if (khugepaged_scan.address > hend)
- goto skip;
- if (khugepaged_scan.address < hstart)
- khugepaged_scan.address = hstart;
- VM_BUG_ON(khugepaged_scan.address & ~HPAGE_PMD_MASK);
-
- while (khugepaged_scan.address < hend) {
- int ret;
- cond_resched();
- if (unlikely(khugepaged_test_exit(mm)))
- goto breakouterloop;
-
- VM_BUG_ON(khugepaged_scan.address < hstart ||
- khugepaged_scan.address + HPAGE_PMD_SIZE >
- hend);
- ret = khugepaged_scan_pmd(mm, vma,
- khugepaged_scan.address,
- hpage);
- /* move to next address */
- khugepaged_scan.address += HPAGE_PMD_SIZE;
- progress += HPAGE_PMD_NR;
- if (ret)
- /* we released mmap_sem so break loop */
- goto breakouterloop_mmap_sem;
- if (progress >= pages)
- goto breakouterloop;
- }
- }
-breakouterloop:
- up_read(&mm->mmap_sem); /* exit_mmap will destroy ptes after this */
-breakouterloop_mmap_sem:
-
- spin_lock(&khugepaged_mm_lock);
- VM_BUG_ON(khugepaged_scan.mm_slot != mm_slot);
- /*
- * Release the current mm_slot if this mm is about to die, or
- * if we scanned all vmas of this mm.
- */
- if (khugepaged_test_exit(mm) || !vma) {
- /*
- * Make sure that if mm_users is reaching zero while
- * khugepaged runs here, khugepaged_exit will find
- * mm_slot not pointing to the exiting mm.
- */
- if (mm_slot->mm_node.next != &khugepaged_scan.mm_head) {
- khugepaged_scan.mm_slot = list_entry(
- mm_slot->mm_node.next,
- struct mm_slot, mm_node);
- khugepaged_scan.address = 0;
- } else {
- khugepaged_scan.mm_slot = NULL;
- khugepaged_full_scans++;
- }
-
- collect_mm_slot(mm_slot);
- }
-
- return progress;
-}
-
-static int khugepaged_has_work(void)
-{
- return !list_empty(&khugepaged_scan.mm_head) &&
- khugepaged_enabled();
-}
-
-static int khugepaged_wait_event(void)
-{
- return !list_empty(&khugepaged_scan.mm_head) ||
- kthread_should_stop();
-}
-
-static void khugepaged_do_scan(void)
-{
- struct page *hpage = NULL;
- unsigned int progress = 0, pass_through_head = 0;
- unsigned int pages = khugepaged_pages_to_scan;
- bool wait = true;
-
- barrier(); /* write khugepaged_pages_to_scan to local stack */
-
- while (progress < pages) {
- if (!khugepaged_prealloc_page(&hpage, &wait))
- break;
-
- cond_resched();
-
- if (unlikely(kthread_should_stop() || try_to_freeze()))
- break;
-
- spin_lock(&khugepaged_mm_lock);
- if (!khugepaged_scan.mm_slot)
- pass_through_head++;
- if (khugepaged_has_work() &&
- pass_through_head < 2)
- progress += khugepaged_scan_mm_slot(pages - progress,
- &hpage);
- else
- progress = pages;
- spin_unlock(&khugepaged_mm_lock);
- }
-
- if (!IS_ERR_OR_NULL(hpage))
- put_page(hpage);
-}
-
-static void khugepaged_wait_work(void)
-{
- if (khugepaged_has_work()) {
- if (!khugepaged_scan_sleep_millisecs)
- return;
-
- wait_event_freezable_timeout(khugepaged_wait,
- kthread_should_stop(),
- msecs_to_jiffies(khugepaged_scan_sleep_millisecs));
- return;
- }
-
- if (khugepaged_enabled())
- wait_event_freezable(khugepaged_wait, khugepaged_wait_event());
-}
-
-static int khugepaged(void *none)
-{
- struct mm_slot *mm_slot;
-
- set_freezable();
- set_user_nice(current, MAX_NICE);
-
- while (!kthread_should_stop()) {
- khugepaged_do_scan();
- khugepaged_wait_work();
- }
-
- spin_lock(&khugepaged_mm_lock);
- mm_slot = khugepaged_scan.mm_slot;
- khugepaged_scan.mm_slot = NULL;
- if (mm_slot)
- collect_mm_slot(mm_slot);
- spin_unlock(&khugepaged_mm_lock);
- return 0;
-}
-
static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
unsigned long haddr, pmd_t *pmd)
{
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
new file mode 100644
index 000000000000..ef229cdfe5b7
--- /dev/null
+++ b/mm/khugepaged.c
@@ -0,0 +1,1428 @@
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/mm.h>
+#include <linux/sched.h>
+#include <linux/mmu_notifier.h>
+#include <linux/rmap.h>
+#include <linux/swap.h>
+#include <linux/mm_inline.h>
+#include <linux/kthread.h>
+#include <linux/khugepaged.h>
+#include <linux/freezer.h>
+#include <linux/mman.h>
+#include <linux/hashtable.h>
+#include <linux/userfaultfd_k.h>
+#include <linux/page_idle.h>
+#include <linux/swapops.h>
+
+#include <asm/tlb.h>
+#include <asm/pgalloc.h>
+#include "internal.h"
+
+enum scan_result {
+ SCAN_FAIL,
+ SCAN_SUCCEED,
+ SCAN_PMD_NULL,
+ SCAN_EXCEED_NONE_PTE,
+ SCAN_PTE_NON_PRESENT,
+ SCAN_PAGE_RO,
+ SCAN_NO_REFERENCED_PAGE,
+ SCAN_PAGE_NULL,
+ SCAN_SCAN_ABORT,
+ SCAN_PAGE_COUNT,
+ SCAN_PAGE_LRU,
+ SCAN_PAGE_LOCK,
+ SCAN_PAGE_ANON,
+ SCAN_PAGE_COMPOUND,
+ SCAN_ANY_PROCESS,
+ SCAN_VMA_NULL,
+ SCAN_VMA_CHECK,
+ SCAN_ADDRESS_RANGE,
+ SCAN_SWAP_CACHE_PAGE,
+ SCAN_DEL_PAGE_LRU,
+ SCAN_ALLOC_HUGE_PAGE_FAIL,
+ SCAN_CGROUP_CHARGE_FAIL,
+ SCAN_EXCEED_SWAP_PTE
+};
+
+#define CREATE_TRACE_POINTS
+#include <trace/events/huge_memory.h>
+
+/* default scan 8*512 pte (or vmas) every 30 second */
+static unsigned int khugepaged_pages_to_scan __read_mostly;
+static unsigned int khugepaged_pages_collapsed;
+static unsigned int khugepaged_full_scans;
+static unsigned int khugepaged_scan_sleep_millisecs __read_mostly = 10000;
+/* during fragmentation poll the hugepage allocator once every minute */
+static unsigned int khugepaged_alloc_sleep_millisecs __read_mostly = 60000;
+static struct task_struct *khugepaged_thread __read_mostly;
+DEFINE_MUTEX(khugepaged_mutex);
+static DEFINE_SPINLOCK(khugepaged_mm_lock);
+static DECLARE_WAIT_QUEUE_HEAD(khugepaged_wait);
+/*
+ * default collapse hugepages if there is at least one pte mapped like
+ * it would have happened if the vma was large enough during page
+ * fault.
+ */
+static unsigned int khugepaged_max_ptes_none __read_mostly;
+static unsigned int khugepaged_max_ptes_swap __read_mostly;
+
+#define MM_SLOTS_HASH_BITS 10
+static __read_mostly DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
+
+static struct kmem_cache *mm_slot_cache __read_mostly;
+
+/**
+ * struct mm_slot - hash lookup from mm to mm_slot
+ * @hash: hash collision list
+ * @mm_node: khugepaged scan list headed in khugepaged_scan.mm_head
+ * @mm: the mm that this information is valid for
+ */
+struct mm_slot {
+ struct hlist_node hash;
+ struct list_head mm_node;
+ struct mm_struct *mm;
+};
+
+/**
+ * struct khugepaged_scan - cursor for scanning
+ * @mm_head: the head of the mm list to scan
+ * @mm_slot: the current mm_slot we are scanning
+ * @address: the next address inside that to be scanned
+ *
+ * There is only the one khugepaged_scan instance of this cursor structure.
+ */
+struct khugepaged_scan {
+ struct list_head mm_head;
+ struct mm_slot *mm_slot;
+ unsigned long address;
+};
+static struct khugepaged_scan khugepaged_scan = {
+ .mm_head = LIST_HEAD_INIT(khugepaged_scan.mm_head),
+};
+
+static ssize_t scan_sleep_millisecs_show(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ char *buf)
+{
+ return sprintf(buf, "%u\n", khugepaged_scan_sleep_millisecs);
+}
+
+static ssize_t scan_sleep_millisecs_store(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ const char *buf, size_t count)
+{
+ unsigned long msecs;
+ int err;
+
+ err = kstrtoul(buf, 10, &msecs);
+ if (err || msecs > UINT_MAX)
+ return -EINVAL;
+
+ khugepaged_scan_sleep_millisecs = msecs;
+ wake_up_interruptible(&khugepaged_wait);
+
+ return count;
+}
+static struct kobj_attribute scan_sleep_millisecs_attr =
+ __ATTR(scan_sleep_millisecs, 0644, scan_sleep_millisecs_show,
+ scan_sleep_millisecs_store);
+
+static ssize_t alloc_sleep_millisecs_show(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ char *buf)
+{
+ return sprintf(buf, "%u\n", khugepaged_alloc_sleep_millisecs);
+}
+
+static ssize_t alloc_sleep_millisecs_store(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ const char *buf, size_t count)
+{
+ unsigned long msecs;
+ int err;
+
+ err = kstrtoul(buf, 10, &msecs);
+ if (err || msecs > UINT_MAX)
+ return -EINVAL;
+
+ khugepaged_alloc_sleep_millisecs = msecs;
+ wake_up_interruptible(&khugepaged_wait);
+
+ return count;
+}
+static struct kobj_attribute alloc_sleep_millisecs_attr =
+ __ATTR(alloc_sleep_millisecs, 0644, alloc_sleep_millisecs_show,
+ alloc_sleep_millisecs_store);
+
+static ssize_t pages_to_scan_show(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ char *buf)
+{
+ return sprintf(buf, "%u\n", khugepaged_pages_to_scan);
+}
+static ssize_t pages_to_scan_store(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ const char *buf, size_t count)
+{
+ int err;
+ unsigned long pages;
+
+ err = kstrtoul(buf, 10, &pages);
+ if (err || !pages || pages > UINT_MAX)
+ return -EINVAL;
+
+ khugepaged_pages_to_scan = pages;
+
+ return count;
+}
+static struct kobj_attribute pages_to_scan_attr =
+ __ATTR(pages_to_scan, 0644, pages_to_scan_show,
+ pages_to_scan_store);
+
+static ssize_t pages_collapsed_show(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ char *buf)
+{
+ return sprintf(buf, "%u\n", khugepaged_pages_collapsed);
+}
+static struct kobj_attribute pages_collapsed_attr =
+ __ATTR_RO(pages_collapsed);
+
+static ssize_t full_scans_show(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ char *buf)
+{
+ return sprintf(buf, "%u\n", khugepaged_full_scans);
+}
+static struct kobj_attribute full_scans_attr =
+ __ATTR_RO(full_scans);
+
+static ssize_t khugepaged_defrag_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ return single_hugepage_flag_show(kobj, attr, buf,
+ TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG);
+}
+static ssize_t khugepaged_defrag_store(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ const char *buf, size_t count)
+{
+ return single_hugepage_flag_store(kobj, attr, buf, count,
+ TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG);
+}
+static struct kobj_attribute khugepaged_defrag_attr =
+ __ATTR(defrag, 0644, khugepaged_defrag_show,
+ khugepaged_defrag_store);
+
+/*
+ * max_ptes_none controls if khugepaged should collapse hugepages over
+ * any unmapped ptes in turn potentially increasing the memory
+ * footprint of the vmas. When max_ptes_none is 0 khugepaged will not
+ * reduce the available free memory in the system as it
+ * runs. Increasing max_ptes_none will instead potentially reduce the
+ * free memory in the system during the khugepaged scan.
+ */
+static ssize_t khugepaged_max_ptes_none_show(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ char *buf)
+{
+ return sprintf(buf, "%u\n", khugepaged_max_ptes_none);
+}
+static ssize_t khugepaged_max_ptes_none_store(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ const char *buf, size_t count)
+{
+ int err;
+ unsigned long max_ptes_none;
+
+ err = kstrtoul(buf, 10, &max_ptes_none);
+ if (err || max_ptes_none > HPAGE_PMD_NR-1)
+ return -EINVAL;
+
+ khugepaged_max_ptes_none = max_ptes_none;
+
+ return count;
+}
+static struct kobj_attribute khugepaged_max_ptes_none_attr =
+ __ATTR(max_ptes_none, 0644, khugepaged_max_ptes_none_show,
+ khugepaged_max_ptes_none_store);
+
+static ssize_t khugepaged_max_ptes_swap_show(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ char *buf)
+{
+ return sprintf(buf, "%u\n", khugepaged_max_ptes_swap);
+}
+
+static ssize_t khugepaged_max_ptes_swap_store(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ const char *buf, size_t count)
+{
+ int err;
+ unsigned long max_ptes_swap;
+
+ err = kstrtoul(buf, 10, &max_ptes_swap);
+ if (err || max_ptes_swap > HPAGE_PMD_NR-1)
+ return -EINVAL;
+
+ khugepaged_max_ptes_swap = max_ptes_swap;
+
+ return count;
+}
+
+static struct kobj_attribute khugepaged_max_ptes_swap_attr =
+ __ATTR(max_ptes_swap, 0644, khugepaged_max_ptes_swap_show,
+ khugepaged_max_ptes_swap_store);
+
+static struct attribute *khugepaged_attr[] = {
+ &khugepaged_defrag_attr.attr,
+ &khugepaged_max_ptes_none_attr.attr,
+ &pages_to_scan_attr.attr,
+ &pages_collapsed_attr.attr,
+ &full_scans_attr.attr,
+ &scan_sleep_millisecs_attr.attr,
+ &alloc_sleep_millisecs_attr.attr,
+ &khugepaged_max_ptes_swap_attr.attr,
+ NULL,
+};
+
+struct attribute_group khugepaged_attr_group = {
+ .attrs = khugepaged_attr,
+ .name = "khugepaged",
+};
+
+#define VM_NO_KHUGEPAGED (VM_SPECIAL | VM_HUGETLB | VM_SHARED | VM_MAYSHARE)
+
+int hugepage_madvise(struct vm_area_struct *vma,
+ unsigned long *vm_flags, int advice)
+{
+ switch (advice) {
+ case MADV_HUGEPAGE:
+#ifdef CONFIG_S390
+ /*
+ * qemu blindly sets MADV_HUGEPAGE on all allocations, but s390
+ * can't handle this properly after s390_enable_sie, so we simply
+ * ignore the madvise to prevent qemu from causing a SIGSEGV.
+ */
+ if (mm_has_pgste(vma->vm_mm))
+ return 0;
+#endif
+ *vm_flags &= ~VM_NOHUGEPAGE;
+ *vm_flags |= VM_HUGEPAGE;
+ /*
+ * If the vma become good for khugepaged to scan,
+ * register it here without waiting a page fault that
+ * may not happen any time soon.
+ */
+ if (!(*vm_flags & VM_NO_KHUGEPAGED) &&
+ khugepaged_enter_vma_merge(vma, *vm_flags))
+ return -ENOMEM;
+ break;
+ case MADV_NOHUGEPAGE:
+ *vm_flags &= ~VM_HUGEPAGE;
+ *vm_flags |= VM_NOHUGEPAGE;
+ /*
+ * Setting VM_NOHUGEPAGE will prevent khugepaged from scanning
+ * this vma even if we leave the mm registered in khugepaged if
+ * it got registered before VM_NOHUGEPAGE was set.
+ */
+ break;
+ }
+
+ return 0;
+}
+
+int __init khugepaged_init(void)
+{
+ mm_slot_cache = kmem_cache_create("khugepaged_mm_slot",
+ sizeof(struct mm_slot),
+ __alignof__(struct mm_slot), 0, NULL);
+ if (!mm_slot_cache)
+ return -ENOMEM;
+
+ khugepaged_pages_to_scan = HPAGE_PMD_NR * 8;
+ khugepaged_max_ptes_none = HPAGE_PMD_NR - 1;
+ khugepaged_max_ptes_swap = HPAGE_PMD_NR / 8;
+ return 0;
+}
+
+void __init khugepaged_destroy(void)
+{
+ kmem_cache_destroy(mm_slot_cache);
+}
+
+static inline struct mm_slot *alloc_mm_slot(void)
+{
+ if (!mm_slot_cache) /* initialization failed */
+ return NULL;
+ return kmem_cache_zalloc(mm_slot_cache, GFP_KERNEL);
+}
+
+static inline void free_mm_slot(struct mm_slot *mm_slot)
+{
+ kmem_cache_free(mm_slot_cache, mm_slot);
+}
+
+static struct mm_slot *get_mm_slot(struct mm_struct *mm)
+{
+ struct mm_slot *mm_slot;
+
+ hash_for_each_possible(mm_slots_hash, mm_slot, hash, (unsigned long)mm)
+ if (mm == mm_slot->mm)
+ return mm_slot;
+
+ return NULL;
+}
+
+static void insert_to_mm_slots_hash(struct mm_struct *mm,
+ struct mm_slot *mm_slot)
+{
+ mm_slot->mm = mm;
+ hash_add(mm_slots_hash, &mm_slot->hash, (long)mm);
+}
+
+static inline int khugepaged_test_exit(struct mm_struct *mm)
+{
+ return atomic_read(&mm->mm_users) == 0;
+}
+
+int __khugepaged_enter(struct mm_struct *mm)
+{
+ struct mm_slot *mm_slot;
+ int wakeup;
+
+ mm_slot = alloc_mm_slot();
+ if (!mm_slot)
+ return -ENOMEM;
+
+ /* __khugepaged_exit() must not run from under us */
+ VM_BUG_ON_MM(khugepaged_test_exit(mm), mm);
+ if (unlikely(test_and_set_bit(MMF_VM_HUGEPAGE, &mm->flags))) {
+ free_mm_slot(mm_slot);
+ return 0;
+ }
+
+ spin_lock(&khugepaged_mm_lock);
+ insert_to_mm_slots_hash(mm, mm_slot);
+ /*
+ * Insert just behind the scanning cursor, to let the area settle
+ * down a little.
+ */
+ wakeup = list_empty(&khugepaged_scan.mm_head);
+ list_add_tail(&mm_slot->mm_node, &khugepaged_scan.mm_head);
+ spin_unlock(&khugepaged_mm_lock);
+
+ atomic_inc(&mm->mm_count);
+ if (wakeup)
+ wake_up_interruptible(&khugepaged_wait);
+
+ return 0;
+}
+
+int khugepaged_enter_vma_merge(struct vm_area_struct *vma,
+ unsigned long vm_flags)
+{
+ unsigned long hstart, hend;
+ if (!vma->anon_vma)
+ /*
+ * Not yet faulted in so we will register later in the
+ * page fault if needed.
+ */
+ return 0;
+ if (vma->vm_ops)
+ /* khugepaged not yet working on file or special mappings */
+ return 0;
+ VM_BUG_ON_VMA(vm_flags & VM_NO_KHUGEPAGED, vma);
+ hstart = (vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
+ hend = vma->vm_end & HPAGE_PMD_MASK;
+ if (hstart < hend)
+ return khugepaged_enter(vma, vm_flags);
+ return 0;
+}
+
+void __khugepaged_exit(struct mm_struct *mm)
+{
+ struct mm_slot *mm_slot;
+ int free = 0;
+
+ spin_lock(&khugepaged_mm_lock);
+ mm_slot = get_mm_slot(mm);
+ if (mm_slot && khugepaged_scan.mm_slot != mm_slot) {
+ hash_del(&mm_slot->hash);
+ list_del(&mm_slot->mm_node);
+ free = 1;
+ }
+ spin_unlock(&khugepaged_mm_lock);
+
+ if (free) {
+ clear_bit(MMF_VM_HUGEPAGE, &mm->flags);
+ free_mm_slot(mm_slot);
+ mmdrop(mm);
+ } else if (mm_slot) {
+ /*
+ * This is required to serialize against
+ * khugepaged_test_exit() (which is guaranteed to run
+ * under mmap sem read mode). Stop here (after we
+ * return all pagetables will be destroyed) until
+ * khugepaged has finished working on the pagetables
+ * under the mmap_sem.
+ */
+ down_write(&mm->mmap_sem);
+ up_write(&mm->mmap_sem);
+ }
+}
+
+static void release_pte_page(struct page *page)
+{
+ /* 0 stands for page_is_file_cache(page) == false */
+ dec_zone_page_state(page, NR_ISOLATED_ANON + 0);
+ unlock_page(page);
+ putback_lru_page(page);
+}
+
+static void release_pte_pages(pte_t *pte, pte_t *_pte)
+{
+ while (--_pte >= pte) {
+ pte_t pteval = *_pte;
+ if (!pte_none(pteval) && !is_zero_pfn(pte_pfn(pteval)))
+ release_pte_page(pte_page(pteval));
+ }
+}
+
+static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
+ unsigned long address,
+ pte_t *pte)
+{
+ struct page *page = NULL;
+ pte_t *_pte;
+ int none_or_zero = 0, result = 0;
+ bool referenced = false, writable = false;
+
+ for (_pte = pte; _pte < pte+HPAGE_PMD_NR;
+ _pte++, address += PAGE_SIZE) {
+ pte_t pteval = *_pte;
+ if (pte_none(pteval) || (pte_present(pteval) &&
+ is_zero_pfn(pte_pfn(pteval)))) {
+ if (!userfaultfd_armed(vma) &&
+ ++none_or_zero <= khugepaged_max_ptes_none) {
+ continue;
+ } else {
+ result = SCAN_EXCEED_NONE_PTE;
+ goto out;
+ }
+ }
+ if (!pte_present(pteval)) {
+ result = SCAN_PTE_NON_PRESENT;
+ goto out;
+ }
+ page = vm_normal_page(vma, address, pteval);
+ if (unlikely(!page)) {
+ result = SCAN_PAGE_NULL;
+ goto out;
+ }
+
+ VM_BUG_ON_PAGE(PageCompound(page), page);
+ VM_BUG_ON_PAGE(!PageAnon(page), page);
+ VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
+
+ /*
+ * We can do it before isolate_lru_page because the
+ * page can't be freed from under us. NOTE: PG_lock
+ * is needed to serialize against split_huge_page
+ * when invoked from the VM.
+ */
+ if (!trylock_page(page)) {
+ result = SCAN_PAGE_LOCK;
+ goto out;
+ }
+
+ /*
+ * cannot use mapcount: can't collapse if there's a gup pin.
+ * The page must only be referenced by the scanned process
+ * and page swap cache.
+ */
+ if (page_count(page) != 1 + !!PageSwapCache(page)) {
+ unlock_page(page);
+ result = SCAN_PAGE_COUNT;
+ goto out;
+ }
+ if (pte_write(pteval)) {
+ writable = true;
+ } else {
+ if (PageSwapCache(page) && !reuse_swap_page(page)) {
+ unlock_page(page);
+ result = SCAN_SWAP_CACHE_PAGE;
+ goto out;
+ }
+ /*
+ * Page is not in the swap cache. It can be collapsed
+ * into a THP.
+ */
+ }
+
+ /*
+ * Isolate the page to avoid collapsing an hugepage
+ * currently in use by the VM.
+ */
+ if (isolate_lru_page(page)) {
+ unlock_page(page);
+ result = SCAN_DEL_PAGE_LRU;
+ goto out;
+ }
+ /* 0 stands for page_is_file_cache(page) == false */
+ inc_zone_page_state(page, NR_ISOLATED_ANON + 0);
+ VM_BUG_ON_PAGE(!PageLocked(page), page);
+ VM_BUG_ON_PAGE(PageLRU(page), page);
+
+ /* If there is no mapped pte young don't collapse the page */
+ if (pte_young(pteval) ||
+ page_is_young(page) || PageReferenced(page) ||
+ mmu_notifier_test_young(vma->vm_mm, address))
+ referenced = true;
+ }
+ if (likely(writable)) {
+ if (likely(referenced)) {
+ result = SCAN_SUCCEED;
+ trace_mm_collapse_huge_page_isolate(page, none_or_zero,
+ referenced, writable, result);
+ return 1;
+ }
+ } else {
+ result = SCAN_PAGE_RO;
+ }
+
+out:
+ release_pte_pages(pte, _pte);
+ trace_mm_collapse_huge_page_isolate(page, none_or_zero,
+ referenced, writable, result);
+ return 0;
+}
+
+static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
+ struct vm_area_struct *vma,
+ unsigned long address,
+ spinlock_t *ptl)
+{
+ pte_t *_pte;
+ for (_pte = pte; _pte < pte+HPAGE_PMD_NR; _pte++) {
+ pte_t pteval = *_pte;
+ struct page *src_page;
+
+ if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
+ clear_user_highpage(page, address);
+ add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
+ if (is_zero_pfn(pte_pfn(pteval))) {
+ /*
+ * ptl mostly unnecessary.
+ */
+ spin_lock(ptl);
+ /*
+ * paravirt calls inside pte_clear here are
+ * superfluous.
+ */
+ pte_clear(vma->vm_mm, address, _pte);
+ spin_unlock(ptl);
+ }
+ } else {
+ src_page = pte_page(pteval);
+ copy_user_highpage(page, src_page, address, vma);
+ VM_BUG_ON_PAGE(page_mapcount(src_page) != 1, src_page);
+ release_pte_page(src_page);
+ /*
+ * ptl mostly unnecessary, but preempt has to
+ * be disabled to update the per-cpu stats
+ * inside page_remove_rmap().
+ */
+ spin_lock(ptl);
+ /*
+ * paravirt calls inside pte_clear here are
+ * superfluous.
+ */
+ pte_clear(vma->vm_mm, address, _pte);
+ page_remove_rmap(src_page, false);
+ spin_unlock(ptl);
+ free_page_and_swap_cache(src_page);
+ }
+
+ address += PAGE_SIZE;
+ page++;
+ }
+}
+
+static void khugepaged_alloc_sleep(void)
+{
+ DEFINE_WAIT(wait);
+
+ add_wait_queue(&khugepaged_wait, &wait);
+ freezable_schedule_timeout_interruptible(
+ msecs_to_jiffies(khugepaged_alloc_sleep_millisecs));
+ remove_wait_queue(&khugepaged_wait, &wait);
+}
+
+static int khugepaged_node_load[MAX_NUMNODES];
+
+static bool khugepaged_scan_abort(int nid)
+{
+ int i;
+
+ /*
+ * If zone_reclaim_mode is disabled, then no extra effort is made to
+ * allocate memory locally.
+ */
+ if (!zone_reclaim_mode)
+ return false;
+
+ /* If there is a count for this node already, it must be acceptable */
+ if (khugepaged_node_load[nid])
+ return false;
+
+ for (i = 0; i < MAX_NUMNODES; i++) {
+ if (!khugepaged_node_load[i])
+ continue;
+ if (node_distance(nid, i) > RECLAIM_DISTANCE)
+ return true;
+ }
+ return false;
+}
+
+/* Defrag for khugepaged will enter direct reclaim/compaction if necessary */
+static inline gfp_t alloc_hugepage_khugepaged_gfpmask(void)
+{
+ return GFP_TRANSHUGE | (khugepaged_defrag() ? __GFP_DIRECT_RECLAIM : 0);
+}
+
+#ifdef CONFIG_NUMA
+static int khugepaged_find_target_node(void)
+{
+ static int last_khugepaged_target_node = NUMA_NO_NODE;
+ int nid, target_node = 0, max_value = 0;
+
+ /* find first node with max normal pages hit */
+ for (nid = 0; nid < MAX_NUMNODES; nid++)
+ if (khugepaged_node_load[nid] > max_value) {
+ max_value = khugepaged_node_load[nid];
+ target_node = nid;
+ }
+
+ /* do some balance if several nodes have the same hit record */
+ if (target_node <= last_khugepaged_target_node)
+ for (nid = last_khugepaged_target_node + 1; nid < MAX_NUMNODES;
+ nid++)
+ if (max_value == khugepaged_node_load[nid]) {
+ target_node = nid;
+ break;
+ }
+
+ last_khugepaged_target_node = target_node;
+ return target_node;
+}
+
+static bool khugepaged_prealloc_page(struct page **hpage, bool *wait)
+{
+ if (IS_ERR(*hpage)) {
+ if (!*wait)
+ return false;
+
+ *wait = false;
+ *hpage = NULL;
+ khugepaged_alloc_sleep();
+ } else if (*hpage) {
+ put_page(*hpage);
+ *hpage = NULL;
+ }
+
+ return true;
+}
+
+static struct page *
+khugepaged_alloc_page(struct page **hpage, gfp_t gfp, struct mm_struct *mm,
+ unsigned long address, int node)
+{
+ VM_BUG_ON_PAGE(*hpage, *hpage);
+
+ /*
+ * Before allocating the hugepage, release the mmap_sem read lock.
+ * The allocation can take potentially a long time if it involves
+ * sync compaction, and we do not need to hold the mmap_sem during
+ * that. We will recheck the vma after taking it again in write mode.
+ */
+ up_read(&mm->mmap_sem);
+
+ *hpage = __alloc_pages_node(node, gfp, HPAGE_PMD_ORDER);
+ if (unlikely(!*hpage)) {
+ count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
+ *hpage = ERR_PTR(-ENOMEM);
+ return NULL;
+ }
+
+ prep_transhuge_page(*hpage);
+ count_vm_event(THP_COLLAPSE_ALLOC);
+ return *hpage;
+}
+#else
+static int khugepaged_find_target_node(void)
+{
+ return 0;
+}
+
+static inline struct page *alloc_khugepaged_hugepage(void)
+{
+ struct page *page;
+
+ page = alloc_pages(alloc_hugepage_khugepaged_gfpmask(),
+ HPAGE_PMD_ORDER);
+ if (page)
+ prep_transhuge_page(page);
+ return page;
+}
+
+static struct page *khugepaged_alloc_hugepage(bool *wait)
+{
+ struct page *hpage;
+
+ do {
+ hpage = alloc_khugepaged_hugepage();
+ if (!hpage) {
+ count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
+ if (!*wait)
+ return NULL;
+
+ *wait = false;
+ khugepaged_alloc_sleep();
+ } else
+ count_vm_event(THP_COLLAPSE_ALLOC);
+ } while (unlikely(!hpage) && likely(khugepaged_enabled()));
+
+ return hpage;
+}
+
+static bool khugepaged_prealloc_page(struct page **hpage, bool *wait)
+{
+ if (!*hpage)
+ *hpage = khugepaged_alloc_hugepage(wait);
+
+ if (unlikely(!*hpage))
+ return false;
+
+ return true;
+}
+
+static struct page *
+khugepaged_alloc_page(struct page **hpage, gfp_t gfp, struct mm_struct *mm,
+ unsigned long address, int node)
+{
+ up_read(&mm->mmap_sem);
+ VM_BUG_ON(!*hpage);
+
+ return *hpage;
+}
+#endif
+
+static bool hugepage_vma_check(struct vm_area_struct *vma)
+{
+ if ((!(vma->vm_flags & VM_HUGEPAGE) && !khugepaged_always()) ||
+ (vma->vm_flags & VM_NOHUGEPAGE))
+ return false;
+ if (!vma->anon_vma || vma->vm_ops)
+ return false;
+ if (is_vma_temporary_stack(vma))
+ return false;
+ VM_BUG_ON_VMA(vma->vm_flags & VM_NO_KHUGEPAGED, vma);
+ return true;
+}
+
+/*
+ * Bring missing pages in from swap, to complete THP collapse.
+ * Only done if khugepaged_scan_pmd believes it is worthwhile.
+ *
+ * Called and returns without pte mapped or spinlocks held,
+ * but with mmap_sem held to protect against vma changes.
+ */
+
+static void __collapse_huge_page_swapin(struct mm_struct *mm,
+ struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmd)
+{
+ pte_t pteval;
+ int swapped_in = 0, ret = 0;
+ struct fault_env fe = {
+ .vma = vma,
+ .address = address,
+ .flags = FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_RETRY_NOWAIT,
+ .pmd = pmd,
+ };
+
+ fe.pte = pte_offset_map(pmd, address);
+ for (; fe.address < address + HPAGE_PMD_NR*PAGE_SIZE;
+ fe.pte++, fe.address += PAGE_SIZE) {
+ pteval = *fe.pte;
+ if (!is_swap_pte(pteval))
+ continue;
+ swapped_in++;
+ ret = do_swap_page(&fe, pteval);
+ if (ret & VM_FAULT_ERROR) {
+ trace_mm_collapse_huge_page_swapin(mm, swapped_in, 0);
+ return;
+ }
+ /* pte is unmapped now, we need to map it */
+ fe.pte = pte_offset_map(pmd, fe.address);
+ }
+ fe.pte--;
+ pte_unmap(fe.pte);
+ trace_mm_collapse_huge_page_swapin(mm, swapped_in, 1);
+}
+
+static void collapse_huge_page(struct mm_struct *mm,
+ unsigned long address,
+ struct page **hpage,
+ struct vm_area_struct *vma,
+ int node)
+{
+ pmd_t *pmd, _pmd;
+ pte_t *pte;
+ pgtable_t pgtable;
+ struct page *new_page;
+ spinlock_t *pmd_ptl, *pte_ptl;
+ int isolated = 0, result = 0;
+ unsigned long hstart, hend;
+ struct mem_cgroup *memcg;
+ unsigned long mmun_start; /* For mmu_notifiers */
+ unsigned long mmun_end; /* For mmu_notifiers */
+ gfp_t gfp;
+
+ VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+
+ /* Only allocate from the target node */
+ gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_OTHER_NODE | __GFP_THISNODE;
+
+ /* release the mmap_sem read lock. */
+ new_page = khugepaged_alloc_page(hpage, gfp, mm, address, node);
+ if (!new_page) {
+ result = SCAN_ALLOC_HUGE_PAGE_FAIL;
+ goto out_nolock;
+ }
+
+ if (unlikely(mem_cgroup_try_charge(new_page, mm, gfp, &memcg, true))) {
+ result = SCAN_CGROUP_CHARGE_FAIL;
+ goto out_nolock;
+ }
+
+ /*
+ * Prevent all access to pagetables with the exception of
+ * gup_fast later hanlded by the ptep_clear_flush and the VM
+ * handled by the anon_vma lock + PG_lock.
+ */
+ down_write(&mm->mmap_sem);
+ if (unlikely(khugepaged_test_exit(mm))) {
+ result = SCAN_ANY_PROCESS;
+ goto out;
+ }
+
+ vma = find_vma(mm, address);
+ if (!vma) {
+ result = SCAN_VMA_NULL;
+ goto out;
+ }
+ hstart = (vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
+ hend = vma->vm_end & HPAGE_PMD_MASK;
+ if (address < hstart || address + HPAGE_PMD_SIZE > hend) {
+ result = SCAN_ADDRESS_RANGE;
+ goto out;
+ }
+ if (!hugepage_vma_check(vma)) {
+ result = SCAN_VMA_CHECK;
+ goto out;
+ }
+ pmd = mm_find_pmd(mm, address);
+ if (!pmd) {
+ result = SCAN_PMD_NULL;
+ goto out;
+ }
+
+ __collapse_huge_page_swapin(mm, vma, address, pmd);
+
+ anon_vma_lock_write(vma->anon_vma);
+
+ pte = pte_offset_map(pmd, address);
+ pte_ptl = pte_lockptr(mm, pmd);
+
+ mmun_start = address;
+ mmun_end = address + HPAGE_PMD_SIZE;
+ mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+ pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
+ /*
+ * After this gup_fast can't run anymore. This also removes
+ * any huge TLB entry from the CPU so we won't allow
+ * huge and small TLB entries for the same virtual address
+ * to avoid the risk of CPU bugs in that area.
+ */
+ _pmd = pmdp_collapse_flush(vma, address, pmd);
+ spin_unlock(pmd_ptl);
+ mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+
+ spin_lock(pte_ptl);
+ isolated = __collapse_huge_page_isolate(vma, address, pte);
+ spin_unlock(pte_ptl);
+
+ if (unlikely(!isolated)) {
+ pte_unmap(pte);
+ spin_lock(pmd_ptl);
+ BUG_ON(!pmd_none(*pmd));
+ /*
+ * We can only use set_pmd_at when establishing
+ * hugepmds and never for establishing regular pmds that
+ * points to regular pagetables. Use pmd_populate for that
+ */
+ pmd_populate(mm, pmd, pmd_pgtable(_pmd));
+ spin_unlock(pmd_ptl);
+ anon_vma_unlock_write(vma->anon_vma);
+ result = SCAN_FAIL;
+ goto out;
+ }
+
+ /*
+ * All pages are isolated and locked so anon_vma rmap
+ * can't run anymore.
+ */
+ anon_vma_unlock_write(vma->anon_vma);
+
+ __collapse_huge_page_copy(pte, new_page, vma, address, pte_ptl);
+ pte_unmap(pte);
+ __SetPageUptodate(new_page);
+ pgtable = pmd_pgtable(_pmd);
+
+ _pmd = mk_huge_pmd(new_page, vma->vm_page_prot);
+ _pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
+
+ /*
+ * spin_lock() below is not the equivalent of smp_wmb(), so
+ * this is needed to avoid the copy_huge_page writes to become
+ * visible after the set_pmd_at() write.
+ */
+ smp_wmb();
+
+ spin_lock(pmd_ptl);
+ BUG_ON(!pmd_none(*pmd));
+ page_add_new_anon_rmap(new_page, vma, address, true);
+ mem_cgroup_commit_charge(new_page, memcg, false, true);
+ lru_cache_add_active_or_unevictable(new_page, vma);
+ pgtable_trans_huge_deposit(mm, pmd, pgtable);
+ set_pmd_at(mm, address, pmd, _pmd);
+ update_mmu_cache_pmd(vma, address, pmd);
+ spin_unlock(pmd_ptl);
+
+ *hpage = NULL;
+
+ khugepaged_pages_collapsed++;
+ result = SCAN_SUCCEED;
+out_up_write:
+ up_write(&mm->mmap_sem);
+out_nolock:
+ trace_mm_collapse_huge_page(mm, isolated, result);
+ return;
+out:
+ mem_cgroup_cancel_charge(new_page, memcg, true);
+ goto out_up_write;
+}
+
+static int khugepaged_scan_pmd(struct mm_struct *mm,
+ struct vm_area_struct *vma,
+ unsigned long address,
+ struct page **hpage)
+{
+ pmd_t *pmd;
+ pte_t *pte, *_pte;
+ int ret = 0, none_or_zero = 0, result = 0;
+ struct page *page = NULL;
+ unsigned long _address;
+ spinlock_t *ptl;
+ int node = NUMA_NO_NODE, unmapped = 0;
+ bool writable = false, referenced = false;
+
+ VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+
+ pmd = mm_find_pmd(mm, address);
+ if (!pmd) {
+ result = SCAN_PMD_NULL;
+ goto out;
+ }
+
+ memset(khugepaged_node_load, 0, sizeof(khugepaged_node_load));
+ pte = pte_offset_map_lock(mm, pmd, address, &ptl);
+ for (_address = address, _pte = pte; _pte < pte+HPAGE_PMD_NR;
+ _pte++, _address += PAGE_SIZE) {
+ pte_t pteval = *_pte;
+ if (is_swap_pte(pteval)) {
+ if (++unmapped <= khugepaged_max_ptes_swap) {
+ continue;
+ } else {
+ result = SCAN_EXCEED_SWAP_PTE;
+ goto out_unmap;
+ }
+ }
+ if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
+ if (!userfaultfd_armed(vma) &&
+ ++none_or_zero <= khugepaged_max_ptes_none) {
+ continue;
+ } else {
+ result = SCAN_EXCEED_NONE_PTE;
+ goto out_unmap;
+ }
+ }
+ if (!pte_present(pteval)) {
+ result = SCAN_PTE_NON_PRESENT;
+ goto out_unmap;
+ }
+ if (pte_write(pteval))
+ writable = true;
+
+ page = vm_normal_page(vma, _address, pteval);
+ if (unlikely(!page)) {
+ result = SCAN_PAGE_NULL;
+ goto out_unmap;
+ }
+
+ /* TODO: teach khugepaged to collapse THP mapped with pte */
+ if (PageCompound(page)) {
+ result = SCAN_PAGE_COMPOUND;
+ goto out_unmap;
+ }
+
+ /*
+ * Record which node the original page is from and save this
+ * information to khugepaged_node_load[].
+ * Khupaged will allocate hugepage from the node has the max
+ * hit record.
+ */
+ node = page_to_nid(page);
+ if (khugepaged_scan_abort(node)) {
+ result = SCAN_SCAN_ABORT;
+ goto out_unmap;
+ }
+ khugepaged_node_load[node]++;
+ if (!PageLRU(page)) {
+ result = SCAN_PAGE_LRU;
+ goto out_unmap;
+ }
+ if (PageLocked(page)) {
+ result = SCAN_PAGE_LOCK;
+ goto out_unmap;
+ }
+ if (!PageAnon(page)) {
+ result = SCAN_PAGE_ANON;
+ goto out_unmap;
+ }
+
+ /*
+ * cannot use mapcount: can't collapse if there's a gup pin.
+ * The page must only be referenced by the scanned process
+ * and page swap cache.
+ */
+ if (page_count(page) != 1 + !!PageSwapCache(page)) {
+ result = SCAN_PAGE_COUNT;
+ goto out_unmap;
+ }
+ if (pte_young(pteval) ||
+ page_is_young(page) || PageReferenced(page) ||
+ mmu_notifier_test_young(vma->vm_mm, address))
+ referenced = true;
+ }
+ if (writable) {
+ if (referenced) {
+ result = SCAN_SUCCEED;
+ ret = 1;
+ } else {
+ result = SCAN_NO_REFERENCED_PAGE;
+ }
+ } else {
+ result = SCAN_PAGE_RO;
+ }
+out_unmap:
+ pte_unmap_unlock(pte, ptl);
+ if (ret) {
+ node = khugepaged_find_target_node();
+ /* collapse_huge_page will return with the mmap_sem released */
+ collapse_huge_page(mm, address, hpage, vma, node);
+ }
+out:
+ trace_mm_khugepaged_scan_pmd(mm, page, writable, referenced,
+ none_or_zero, result, unmapped);
+ return ret;
+}
+
+static void collect_mm_slot(struct mm_slot *mm_slot)
+{
+ struct mm_struct *mm = mm_slot->mm;
+
+ VM_BUG_ON(NR_CPUS != 1 && !spin_is_locked(&khugepaged_mm_lock));
+
+ if (khugepaged_test_exit(mm)) {
+ /* free mm_slot */
+ hash_del(&mm_slot->hash);
+ list_del(&mm_slot->mm_node);
+
+ /*
+ * Not strictly needed because the mm exited already.
+ *
+ * clear_bit(MMF_VM_HUGEPAGE, &mm->flags);
+ */
+
+ /* khugepaged_mm_lock actually not necessary for the below */
+ free_mm_slot(mm_slot);
+ mmdrop(mm);
+ }
+}
+
+static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
+ struct page **hpage)
+ __releases(&khugepaged_mm_lock)
+ __acquires(&khugepaged_mm_lock)
+{
+ struct mm_slot *mm_slot;
+ struct mm_struct *mm;
+ struct vm_area_struct *vma;
+ int progress = 0;
+
+ VM_BUG_ON(!pages);
+ VM_BUG_ON(NR_CPUS != 1 && !spin_is_locked(&khugepaged_mm_lock));
+
+ if (khugepaged_scan.mm_slot)
+ mm_slot = khugepaged_scan.mm_slot;
+ else {
+ mm_slot = list_entry(khugepaged_scan.mm_head.next,
+ struct mm_slot, mm_node);
+ khugepaged_scan.address = 0;
+ khugepaged_scan.mm_slot = mm_slot;
+ }
+ spin_unlock(&khugepaged_mm_lock);
+
+ mm = mm_slot->mm;
+ down_read(&mm->mmap_sem);
+ if (unlikely(khugepaged_test_exit(mm)))
+ vma = NULL;
+ else
+ vma = find_vma(mm, khugepaged_scan.address);
+
+ progress++;
+ for (; vma; vma = vma->vm_next) {
+ unsigned long hstart, hend;
+
+ cond_resched();
+ if (unlikely(khugepaged_test_exit(mm))) {
+ progress++;
+ break;
+ }
+ if (!hugepage_vma_check(vma)) {
+skip:
+ progress++;
+ continue;
+ }
+ hstart = (vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
+ hend = vma->vm_end & HPAGE_PMD_MASK;
+ if (hstart >= hend)
+ goto skip;
+ if (khugepaged_scan.address > hend)
+ goto skip;
+ if (khugepaged_scan.address < hstart)
+ khugepaged_scan.address = hstart;
+ VM_BUG_ON(khugepaged_scan.address & ~HPAGE_PMD_MASK);
+
+ while (khugepaged_scan.address < hend) {
+ int ret;
+ cond_resched();
+ if (unlikely(khugepaged_test_exit(mm)))
+ goto breakouterloop;
+
+ VM_BUG_ON(khugepaged_scan.address < hstart ||
+ khugepaged_scan.address + HPAGE_PMD_SIZE >
+ hend);
+ ret = khugepaged_scan_pmd(mm, vma,
+ khugepaged_scan.address,
+ hpage);
+ /* move to next address */
+ khugepaged_scan.address += HPAGE_PMD_SIZE;
+ progress += HPAGE_PMD_NR;
+ if (ret)
+ /* we released mmap_sem so break loop */
+ goto breakouterloop_mmap_sem;
+ if (progress >= pages)
+ goto breakouterloop;
+ }
+ }
+breakouterloop:
+ up_read(&mm->mmap_sem); /* exit_mmap will destroy ptes after this */
+breakouterloop_mmap_sem:
+
+ spin_lock(&khugepaged_mm_lock);
+ VM_BUG_ON(khugepaged_scan.mm_slot != mm_slot);
+ /*
+ * Release the current mm_slot if this mm is about to die, or
+ * if we scanned all vmas of this mm.
+ */
+ if (khugepaged_test_exit(mm) || !vma) {
+ /*
+ * Make sure that if mm_users is reaching zero while
+ * khugepaged runs here, khugepaged_exit will find
+ * mm_slot not pointing to the exiting mm.
+ */
+ if (mm_slot->mm_node.next != &khugepaged_scan.mm_head) {
+ khugepaged_scan.mm_slot = list_entry(
+ mm_slot->mm_node.next,
+ struct mm_slot, mm_node);
+ khugepaged_scan.address = 0;
+ } else {
+ khugepaged_scan.mm_slot = NULL;
+ khugepaged_full_scans++;
+ }
+
+ collect_mm_slot(mm_slot);
+ }
+
+ return progress;
+}
+
+static int khugepaged_has_work(void)
+{
+ return !list_empty(&khugepaged_scan.mm_head) &&
+ khugepaged_enabled();
+}
+
+static int khugepaged_wait_event(void)
+{
+ return !list_empty(&khugepaged_scan.mm_head) ||
+ kthread_should_stop();
+}
+
+static void khugepaged_do_scan(void)
+{
+ struct page *hpage = NULL;
+ unsigned int progress = 0, pass_through_head = 0;
+ unsigned int pages = khugepaged_pages_to_scan;
+ bool wait = true;
+
+ barrier(); /* write khugepaged_pages_to_scan to local stack */
+
+ while (progress < pages) {
+ if (!khugepaged_prealloc_page(&hpage, &wait))
+ break;
+
+ cond_resched();
+
+ if (unlikely(kthread_should_stop() || try_to_freeze()))
+ break;
+
+ spin_lock(&khugepaged_mm_lock);
+ if (!khugepaged_scan.mm_slot)
+ pass_through_head++;
+ if (khugepaged_has_work() &&
+ pass_through_head < 2)
+ progress += khugepaged_scan_mm_slot(pages - progress,
+ &hpage);
+ else
+ progress = pages;
+ spin_unlock(&khugepaged_mm_lock);
+ }
+
+ if (!IS_ERR_OR_NULL(hpage))
+ put_page(hpage);
+}
+
+static void khugepaged_wait_work(void)
+{
+ if (khugepaged_has_work()) {
+ if (!khugepaged_scan_sleep_millisecs)
+ return;
+
+ wait_event_freezable_timeout(khugepaged_wait,
+ kthread_should_stop(),
+ msecs_to_jiffies(khugepaged_scan_sleep_millisecs));
+ return;
+ }
+
+ if (khugepaged_enabled())
+ wait_event_freezable(khugepaged_wait, khugepaged_wait_event());
+}
+
+static int khugepaged(void *none)
+{
+ struct mm_slot *mm_slot;
+
+ set_freezable();
+ set_user_nice(current, MAX_NICE);
+
+ while (!kthread_should_stop()) {
+ khugepaged_do_scan();
+ khugepaged_wait_work();
+ }
+
+ spin_lock(&khugepaged_mm_lock);
+ mm_slot = khugepaged_scan.mm_slot;
+ khugepaged_scan.mm_slot = NULL;
+ if (mm_slot)
+ collect_mm_slot(mm_slot);
+ spin_unlock(&khugepaged_mm_lock);
+ return 0;
+}
+
+static void set_recommended_min_free_kbytes(void)
+{
+ struct zone *zone;
+ int nr_zones = 0;
+ unsigned long recommended_min;
+
+ for_each_populated_zone(zone)
+ nr_zones++;
+
+ /* Ensure 2 pageblocks are free to assist fragmentation avoidance */
+ recommended_min = pageblock_nr_pages * nr_zones * 2;
+
+ /*
+ * Make sure that on average at least two pageblocks are almost free
+ * of another type, one for a migratetype to fall back to and a
+ * second to avoid subsequent fallbacks of other types There are 3
+ * MIGRATE_TYPES we care about.
+ */
+ recommended_min += pageblock_nr_pages * nr_zones *
+ MIGRATE_PCPTYPES * MIGRATE_PCPTYPES;
+
+ /* don't ever allow to reserve more than 5% of the lowmem */
+ recommended_min = min(recommended_min,
+ (unsigned long) nr_free_buffer_pages() / 20);
+ recommended_min <<= (PAGE_SHIFT-10);
+
+ if (recommended_min > min_free_kbytes) {
+ if (user_min_free_kbytes >= 0)
+ pr_info("raising min_free_kbytes from %d to %lu to help transparent hugepage allocations\n",
+ min_free_kbytes, recommended_min);
+
+ min_free_kbytes = recommended_min;
+ }
+ setup_per_zone_wmarks();
+}
+
+int start_stop_khugepaged(void)
+{
+ int err = 0;
+ if (khugepaged_enabled()) {
+ if (!khugepaged_thread)
+ khugepaged_thread = kthread_run(khugepaged, NULL,
+ "khugepaged");
+ if (IS_ERR(khugepaged_thread)) {
+ pr_err("khugepaged: kthread_run(khugepaged) failed\n");
+ err = PTR_ERR(khugepaged_thread);
+ khugepaged_thread = NULL;
+ goto fail;
+ }
+
+ if (!list_empty(&khugepaged_scan.mm_head))
+ wake_up_interruptible(&khugepaged_wait);
+
+ set_recommended_min_free_kbytes();
+ } else if (khugepaged_thread) {
+ kthread_stop(khugepaged_thread);
+ khugepaged_thread = NULL;
+ }
+fail:
+ return err;
+}
--
2.8.0.rc3

2016-04-16 00:29:38

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv7 23/29] shmem: get_unmapped_area align huge page

From: Hugh Dickins <[email protected]>

Provide a shmem_get_unmapped_area method in file_operations, called
at mmap time to decide the mapping address. It could be conditional
on CONFIG_TRANSPARENT_HUGEPAGE, but save #ifdefs in other places by
making it unconditional.

shmem_get_unmapped_area() first calls the usual mm->get_unmapped_area
(which we treat as a black box, highly dependent on architecture and
config and executable layout). Lots of conditions, and in most cases
it just goes with the address that chose; but when our huge stars are
rightly aligned, yet that did not provide a suitable address, go back
to ask for a larger arena, within which to align the mapping suitably.

There have to be some direct calls to shmem_get_unmapped_area(),
not via the file_operations: because of the way shmem_zero_setup()
is called to create a shmem object late in the mmap sequence, when
MAP_SHARED is requested with MAP_ANONYMOUS or /dev/zero. Though
this only matters when /proc/sys/vm/shmem_huge has been set.

Signed-off-by: Hugh Dickins <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
drivers/char/mem.c | 24 ++++++++++++
include/linux/shmem_fs.h | 2 +
ipc/shm.c | 6 ++-
mm/mmap.c | 16 +++++++-
mm/shmem.c | 98 ++++++++++++++++++++++++++++++++++++++++++++++++
5 files changed, 142 insertions(+), 4 deletions(-)

diff --git a/drivers/char/mem.c b/drivers/char/mem.c
index 71025c2f6bbb..9656f1095c19 100644
--- a/drivers/char/mem.c
+++ b/drivers/char/mem.c
@@ -22,6 +22,7 @@
#include <linux/device.h>
#include <linux/highmem.h>
#include <linux/backing-dev.h>
+#include <linux/shmem_fs.h>
#include <linux/splice.h>
#include <linux/pfn.h>
#include <linux/export.h>
@@ -661,6 +662,28 @@ static int mmap_zero(struct file *file, struct vm_area_struct *vma)
return 0;
}

+static unsigned long get_unmapped_area_zero(struct file *file,
+ unsigned long addr, unsigned long len,
+ unsigned long pgoff, unsigned long flags)
+{
+#ifdef CONFIG_MMU
+ if (flags & MAP_SHARED) {
+ /*
+ * mmap_zero() will call shmem_zero_setup() to create a file,
+ * so use shmem's get_unmapped_area in case it can be huge;
+ * and pass NULL for file as in mmap.c's get_unmapped_area(),
+ * so as not to confuse shmem with our handle on "/dev/zero".
+ */
+ return shmem_get_unmapped_area(NULL, addr, len, pgoff, flags);
+ }
+
+ /* Otherwise flags & MAP_PRIVATE: with no shmem object beneath it */
+ return current->mm->get_unmapped_area(file, addr, len, pgoff, flags);
+#else
+ return -ENOSYS;
+#endif
+}
+
static ssize_t write_full(struct file *file, const char __user *buf,
size_t count, loff_t *ppos)
{
@@ -768,6 +791,7 @@ static const struct file_operations zero_fops = {
.read_iter = read_iter_zero,
.write_iter = write_iter_zero,
.mmap = mmap_zero,
+ .get_unmapped_area = get_unmapped_area_zero,
#ifndef CONFIG_MMU
.mmap_capabilities = zero_mmap_capabilities,
#endif
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 466f18c73a49..ff2de4bab61f 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -50,6 +50,8 @@ extern struct file *shmem_file_setup(const char *name,
extern struct file *shmem_kernel_file_setup(const char *name, loff_t size,
unsigned long flags);
extern int shmem_zero_setup(struct vm_area_struct *);
+extern unsigned long shmem_get_unmapped_area(struct file *, unsigned long addr,
+ unsigned long len, unsigned long pgoff, unsigned long flags);
extern int shmem_lock(struct file *file, int lock, struct user_struct *user);
extern bool shmem_mapping(struct address_space *mapping);
extern void shmem_unlock_mapping(struct address_space *mapping);
diff --git a/ipc/shm.c b/ipc/shm.c
index 331fc1b0b3c7..b1ed484eeda1 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -476,13 +476,15 @@ static const struct file_operations shm_file_operations = {
.mmap = shm_mmap,
.fsync = shm_fsync,
.release = shm_release,
-#ifndef CONFIG_MMU
.get_unmapped_area = shm_get_unmapped_area,
-#endif
.llseek = noop_llseek,
.fallocate = shm_fallocate,
};

+/*
+ * shm_file_operations_huge is now identical to shm_file_operations,
+ * but we keep it distinct for the sake of is_file_shm_hugepages().
+ */
static const struct file_operations shm_file_operations_huge = {
.mmap = shm_mmap,
.fsync = shm_fsync,
diff --git a/mm/mmap.c b/mm/mmap.c
index 0a68cf0b37e7..cc1fa2b0d14f 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -25,6 +25,7 @@
#include <linux/personality.h>
#include <linux/security.h>
#include <linux/hugetlb.h>
+#include <linux/shmem_fs.h>
#include <linux/profile.h>
#include <linux/export.h>
#include <linux/mount.h>
@@ -1900,8 +1901,19 @@ get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
return -ENOMEM;

get_area = current->mm->get_unmapped_area;
- if (file && file->f_op->get_unmapped_area)
- get_area = file->f_op->get_unmapped_area;
+ if (file) {
+ if (file->f_op->get_unmapped_area)
+ get_area = file->f_op->get_unmapped_area;
+ } else if (flags & MAP_SHARED) {
+ /*
+ * mmap_region() will call shmem_zero_setup() to create a file,
+ * so use shmem's get_unmapped_area in case it can be huge.
+ * do_mmap_pgoff() will clear pgoff, so match alignment.
+ */
+ pgoff = 0;
+ get_area = shmem_get_unmapped_area;
+ }
+
addr = get_area(file, addr, len, pgoff, flags);
if (IS_ERR_VALUE(addr))
return addr;
diff --git a/mm/shmem.c b/mm/shmem.c
index 3a6349e9c186..929fbf2ac43f 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1513,6 +1513,94 @@ static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
return ret;
}

+unsigned long shmem_get_unmapped_area(struct file *file,
+ unsigned long uaddr, unsigned long len,
+ unsigned long pgoff, unsigned long flags)
+{
+ unsigned long (*get_area)(struct file *,
+ unsigned long, unsigned long, unsigned long, unsigned long);
+ unsigned long addr;
+ unsigned long offset;
+ unsigned long inflated_len;
+ unsigned long inflated_addr;
+ unsigned long inflated_offset;
+
+ if (len > TASK_SIZE)
+ return -ENOMEM;
+
+ get_area = current->mm->get_unmapped_area;
+ addr = get_area(file, uaddr, len, pgoff, flags);
+
+ if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+ return addr;
+ if (IS_ERR_VALUE(addr))
+ return addr;
+ if (addr & ~PAGE_MASK)
+ return addr;
+ if (addr > TASK_SIZE - len)
+ return addr;
+
+ if (shmem_huge == SHMEM_HUGE_DENY)
+ return addr;
+ if (len < HPAGE_PMD_SIZE)
+ return addr;
+ if (flags & MAP_FIXED)
+ return addr;
+ /*
+ * Our priority is to support MAP_SHARED mapped hugely;
+ * and support MAP_PRIVATE mapped hugely too, until it is COWed.
+ * But if caller specified an address hint, respect that as before.
+ */
+ if (uaddr)
+ return addr;
+
+ if (shmem_huge != SHMEM_HUGE_FORCE) {
+ struct super_block *sb;
+
+ if (file) {
+ VM_BUG_ON(file->f_op != &shmem_file_operations);
+ sb = file_inode(file)->i_sb;
+ } else {
+ /*
+ * Called directly from mm/mmap.c, or drivers/char/mem.c
+ * for "/dev/zero", to create a shared anonymous object.
+ */
+ if (IS_ERR(shm_mnt))
+ return addr;
+ sb = shm_mnt->mnt_sb;
+ }
+ if (SHMEM_SB(sb)->huge != SHMEM_HUGE_NEVER)
+ return addr;
+ }
+
+ offset = (pgoff << PAGE_SHIFT) & (HPAGE_PMD_SIZE-1);
+ if (offset && offset + len < 2 * HPAGE_PMD_SIZE)
+ return addr;
+ if ((addr & (HPAGE_PMD_SIZE-1)) == offset)
+ return addr;
+
+ inflated_len = len + HPAGE_PMD_SIZE - PAGE_SIZE;
+ if (inflated_len > TASK_SIZE)
+ return addr;
+ if (inflated_len < len)
+ return addr;
+
+ inflated_addr = get_area(NULL, 0, inflated_len, 0, flags);
+ if (IS_ERR_VALUE(inflated_addr))
+ return addr;
+ if (inflated_addr & ~PAGE_MASK)
+ return addr;
+
+ inflated_offset = inflated_addr & (HPAGE_PMD_SIZE-1);
+ inflated_addr += offset - inflated_offset;
+ if (inflated_offset > offset)
+ inflated_addr += HPAGE_PMD_SIZE;
+
+ if (inflated_addr > TASK_SIZE - len)
+ return addr;
+ return inflated_addr;
+}
+
#ifdef CONFIG_NUMA
static int shmem_set_policy(struct vm_area_struct *vma, struct mempolicy *mpol)
{
@@ -3257,6 +3345,7 @@ static const struct address_space_operations shmem_aops = {

static const struct file_operations shmem_file_operations = {
.mmap = shmem_mmap,
+ .get_unmapped_area = shmem_get_unmapped_area,
#ifdef CONFIG_TMPFS
.llseek = shmem_file_llseek,
.read_iter = shmem_file_read_iter,
@@ -3492,6 +3581,15 @@ void shmem_unlock_mapping(struct address_space *mapping)
{
}

+#ifdef CONFIG_MMU
+unsigned long shmem_get_unmapped_area(struct file *file,
+ unsigned long addr, unsigned long len,
+ unsigned long pgoff, unsigned long flags)
+{
+ return current->mm->get_unmapped_area(file, addr, len, pgoff, flags);
+}
+#endif
+
void shmem_truncate_range(struct inode *inode, loff_t lstart, loff_t lend)
{
truncate_inode_pages_range(inode->i_mapping, lstart, lend);
--
2.8.0.rc3

2016-04-16 00:30:22

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv7 29/29] khugepaged: add support of collapse for tmpfs/shmem pages

This patch extends khugepaged to support collapse of tmpfs/shmem pages.
We share fair amount of infrastructure with anon-THP collapse.

Few design points:

- First we are looking for VMA which can be suitable for mapping huge
page;

- If the VMA maps shmem file, the rest scan/collapse operations
operates on page cache, not on page tables as in anon VMA case.

- khugepaged_scan_shmem() finds a range which is suitable for huge
page. The scan is lockless and shouldn't disturb system too much.

- once the candidate for collapse is found, collapse_shmem() attempts
to create a huge page:

+ scan over radix tree, making the range point to new huge page;

+ new huge page is not-uptodate, locked and freezed (refcount
is 0), so nobody can touch them until we say so.

+ we swap in pages during the scan. khugepaged_scan_shmem()
filters out ranges with more than khugepaged_max_ptes_swap
swapped out pages. It's HPAGE_PMD_NR/8 by default.

+ old pages are isolated, unmapped and put to local list in case
to be restored back if collapse failed.

- if collapse succeed, we retract pte page tables from VMAs where huge
pages mapping is possible. The huge page will be mapped as PMD on
next minor fault into the range.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/shmem_fs.h | 24 +++
include/trace/events/huge_memory.h | 3 +-
mm/khugepaged.c | 359 ++++++++++++++++++++++++++++++++++++-
mm/shmem.c | 83 +++++++--
4 files changed, 452 insertions(+), 17 deletions(-)

diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index ff2de4bab61f..7ecb7f54f64d 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -54,6 +54,7 @@ extern unsigned long shmem_get_unmapped_area(struct file *, unsigned long addr,
unsigned long len, unsigned long pgoff, unsigned long flags);
extern int shmem_lock(struct file *file, int lock, struct user_struct *user);
extern bool shmem_mapping(struct address_space *mapping);
+extern bool shmem_huge_enabled(struct vm_area_struct *vma);
extern void shmem_unlock_mapping(struct address_space *mapping);
extern struct page *shmem_read_mapping_page_gfp(struct address_space *mapping,
pgoff_t index, gfp_t gfp_mask);
@@ -64,6 +65,19 @@ extern unsigned long shmem_swap_usage(struct vm_area_struct *vma);
extern unsigned long shmem_partial_swap_usage(struct address_space *mapping,
pgoff_t start, pgoff_t end);

+/* Flag allocation requirements to shmem_getpage */
+enum sgp_type {
+ SGP_READ, /* don't exceed i_size, don't allocate page */
+ SGP_CACHE, /* don't exceed i_size, may allocate page */
+ SGP_NOHUGE, /* like SGP_CACHE, but no huge pages */
+ SGP_HUGE, /* like SGP_CACHE, huge pages preferred */
+ SGP_WRITE, /* may exceed i_size, may allocate !Uptodate page */
+ SGP_FALLOC, /* like SGP_WRITE, but make existing page Uptodate */
+};
+
+extern int shmem_getpage(struct inode *inode, pgoff_t index,
+ struct page **pagep, enum sgp_type sgp);
+
static inline struct page *shmem_read_mapping_page(
struct address_space *mapping, pgoff_t index)
{
@@ -71,6 +85,16 @@ static inline struct page *shmem_read_mapping_page(
mapping_gfp_mask(mapping));
}

+static inline bool shmem_file(struct file *file)
+{
+ if (!file || !file->f_mapping)
+ return false;
+ return shmem_mapping(file->f_mapping);
+}
+
+extern bool shmem_charge(struct inode *inode, long pages);
+extern void shmem_uncharge(struct inode *inode, long pages);
+
#ifdef CONFIG_TMPFS

extern int shmem_add_seals(struct file *file, unsigned int seals);
diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index bda21183eb05..830d47d5ca41 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -29,7 +29,8 @@
EM( SCAN_DEL_PAGE_LRU, "could_not_delete_page_from_lru")\
EM( SCAN_ALLOC_HUGE_PAGE_FAIL, "alloc_huge_page_failed") \
EM( SCAN_CGROUP_CHARGE_FAIL, "ccgroup_charge_failed") \
- EMe( SCAN_EXCEED_SWAP_PTE, "exceed_swap_pte")
+ EM( SCAN_EXCEED_SWAP_PTE, "exceed_swap_pte") \
+ EMe(SCAN_TRUNCATED, "truncated") \

#undef EM
#undef EMe
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index c4693fd12f76..dbea77a1356a 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -14,6 +14,7 @@
#include <linux/userfaultfd_k.h>
#include <linux/page_idle.h>
#include <linux/swapops.h>
+#include <linux/shmem_fs.h>

#include <asm/tlb.h>
#include <asm/pgalloc.h>
@@ -42,7 +43,8 @@ enum scan_result {
SCAN_DEL_PAGE_LRU,
SCAN_ALLOC_HUGE_PAGE_FAIL,
SCAN_CGROUP_CHARGE_FAIL,
- SCAN_EXCEED_SWAP_PTE
+ SCAN_EXCEED_SWAP_PTE,
+ SCAN_TRUNCATED,
};

#define CREATE_TRACE_POINTS
@@ -292,7 +294,7 @@ struct attribute_group khugepaged_attr_group = {
.name = "khugepaged",
};

-#define VM_NO_KHUGEPAGED (VM_SPECIAL | VM_HUGETLB | VM_SHARED | VM_MAYSHARE)
+#define VM_NO_KHUGEPAGED (VM_SPECIAL | VM_HUGETLB)

int hugepage_madvise(struct vm_area_struct *vma,
unsigned long *vm_flags, int advice)
@@ -813,6 +815,12 @@ static bool hugepage_vma_check(struct vm_area_struct *vma)
if ((!(vma->vm_flags & VM_HUGEPAGE) && !khugepaged_always()) ||
(vma->vm_flags & VM_NOHUGEPAGE))
return false;
+ if (shmem_file(vma->vm_file)) {
+ if (((vma->vm_start >> PAGE_SHIFT) & (HPAGE_PMD_NR - 1)) !=
+ (vma->vm_pgoff & (HPAGE_PMD_NR - 1)))
+ return false;
+ return true;
+ }
if (!vma->anon_vma || vma->vm_ops)
return false;
if (is_vma_temporary_stack(vma))
@@ -1146,6 +1154,334 @@ out:
return ret;
}

+#ifdef CONFIG_SHMEM
+static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
+{
+ struct vm_area_struct *vma;
+ unsigned long addr;
+ pmd_t *pmd, _pmd;
+
+ i_mmap_lock_write(mapping);
+ vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
+ /* probably overkill */
+ if (vma->anon_vma)
+ continue;
+ addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
+ if (addr & ~HPAGE_PMD_MASK)
+ continue;
+ if (vma->vm_end < addr + HPAGE_PMD_SIZE)
+ continue;
+ pmd = mm_find_pmd(vma->vm_mm, addr);
+ if (!pmd)
+ continue;
+ /* We need exclusive mmap_sem to retract page table */
+ if (down_write_trylock(&vma->vm_mm->mmap_sem)) {
+ spinlock_t *ptl = pmd_lock(vma->vm_mm, pmd);
+ /* assume page table is clear */
+ _pmd = pmdp_collapse_flush(vma, addr, pmd);
+ spin_unlock(ptl);
+ up_write(&vma->vm_mm->mmap_sem);
+ atomic_long_dec(&vma->vm_mm->nr_ptes);
+ pte_free(vma->vm_mm, pmd_pgtable(_pmd));
+ }
+ }
+ i_mmap_unlock_write(mapping);
+}
+
+static void collapse_shmem(struct mm_struct *mm,
+ struct address_space *mapping, pgoff_t start,
+ struct page **hpage, int node)
+{
+ gfp_t gfp;
+ struct page *page, *new_page, *tmp;
+ struct mem_cgroup *memcg;
+ pgoff_t index, end = start + HPAGE_PMD_NR;
+ LIST_HEAD(pagelist);
+ struct radix_tree_iter iter;
+ void **slot;
+ int nr = 0, result = SCAN_SUCCEED;
+
+ VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
+
+ /* Only allocate from the target node */
+ gfp = alloc_hugepage_khugepaged_gfpmask() |
+ __GFP_OTHER_NODE | __GFP_THISNODE;
+
+ new_page = khugepaged_alloc_page(hpage, gfp, node);
+ if (!new_page) {
+ result = SCAN_ALLOC_HUGE_PAGE_FAIL;
+ goto out;
+ }
+
+ if (unlikely(mem_cgroup_try_charge(new_page, mm, gfp, &memcg, true))) {
+ result = SCAN_CGROUP_CHARGE_FAIL;
+ goto out;
+ }
+
+ new_page->index = start;
+ new_page->mapping = mapping;
+ __SetPageSwapBacked(new_page);
+ __SetPageLocked(new_page);
+
+ BUG_ON(!page_ref_freeze(new_page, 1));
+
+ index = start;
+ spin_lock_irq(&mapping->tree_lock);
+ radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
+ int n = min(iter.index, end) - index;
+
+ if (n && !shmem_charge(mapping->host, n)) {
+ result = SCAN_FAIL;
+ break;
+ }
+ for (; index < min(iter.index, end); index++) {
+ radix_tree_insert(&mapping->page_tree, index,
+ new_page + (index % HPAGE_PMD_NR));
+ }
+ nr += n;
+
+ if (index >= end)
+ break;
+ page = radix_tree_deref_slot_protected(slot,
+ &mapping->tree_lock);
+ if (radix_tree_exceptional_entry(page)) {
+ spin_unlock_irq(&mapping->tree_lock);
+ if (shmem_getpage(mapping->host, index, &page,
+ SGP_NOHUGE)) {
+ result = SCAN_FAIL;
+ goto tree_unlocked;
+ }
+ spin_lock_irq(&mapping->tree_lock);
+ } else if (trylock_page(page)) {
+ get_page(page);
+ } else {
+ result = SCAN_PAGE_LOCK;
+ break;
+ }
+
+ VM_BUG_ON_PAGE(PageTransCompound(page), page);
+
+ if (page_mapping(page) != mapping) {
+ result = SCAN_TRUNCATED;
+ goto out_unlock;
+ }
+ spin_unlock_irq(&mapping->tree_lock);
+
+ if (isolate_lru_page(page)) {
+ result = SCAN_DEL_PAGE_LRU;
+ goto out_isolate_failed;
+ }
+
+ if (page_mapped(page))
+ unmap_mapping_range(mapping, index << PAGE_SHIFT,
+ PAGE_SIZE, 0);
+
+ spin_lock_irq(&mapping->tree_lock);
+
+ VM_BUG_ON_PAGE(page_mapped(page), page);
+
+ if (!page_ref_freeze(page, 3)) {
+ result = SCAN_PAGE_COUNT;
+ goto out_lru;
+ }
+
+ list_add_tail(&page->lru, &pagelist);
+ radix_tree_replace_slot(slot,
+ new_page + (index % HPAGE_PMD_NR));
+
+ index++;
+ continue;
+out_lru:
+ spin_unlock_irq(&mapping->tree_lock);
+ putback_lru_page(page);
+out_isolate_failed:
+ unlock_page(page);
+ put_page(page);
+ goto tree_unlocked;
+out_unlock:
+ unlock_page(page);
+ put_page(page);
+ break;
+ }
+
+ if (result == SCAN_SUCCEED && index < end) {
+ int n = end - index;
+
+ if (!shmem_charge(mapping->host, n)) {
+ result = SCAN_FAIL;
+ goto tree_locked;
+ }
+
+ for (; index < n; index++) {
+ radix_tree_insert(&mapping->page_tree, index,
+ new_page + (index % HPAGE_PMD_NR));
+ }
+ nr += n;
+ }
+
+tree_locked:
+ spin_unlock_irq(&mapping->tree_lock);
+tree_unlocked:
+
+ if (result == SCAN_SUCCEED) {
+ unsigned long flags;
+ struct zone *zone = page_zone(new_page);
+
+ list_for_each_entry_safe(page, tmp, &pagelist, lru) {
+ copy_highpage(new_page + (page->index % HPAGE_PMD_NR),
+ page);
+ list_del(&page->lru);
+ unlock_page(page);
+ page_ref_unfreeze(page, 1);
+ page->mapping = NULL;
+ ClearPageActive(page);
+ ClearPageUnevictable(page);
+ put_page(page);
+ }
+
+ local_irq_save(flags);
+ __inc_zone_page_state(new_page, NR_SHMEM_THPS);
+ if (nr) {
+ __mod_zone_page_state(zone, NR_FILE_PAGES, nr);
+ __mod_zone_page_state(zone, NR_SHMEM, nr);
+ }
+ local_irq_restore(flags);
+
+ retract_page_tables(mapping, start);
+
+ page_ref_unfreeze(new_page, HPAGE_PMD_NR);
+ SetPageUptodate(new_page);
+ mem_cgroup_commit_charge(new_page, memcg, false, true);
+ lru_cache_add_anon(new_page);
+ unlock_page(new_page);
+
+ *hpage = NULL;
+ } else {
+ shmem_uncharge(mapping->host, nr);
+ spin_lock_irq(&mapping->tree_lock);
+ radix_tree_for_each_slot(slot, &mapping->page_tree, &iter,
+ start) {
+ VM_BUG_ON(iter.index >= end);
+ page = list_first_entry_or_null(&pagelist,
+ struct page, lru);
+ if (!page || iter.index < page->index) {
+ if (!nr)
+ break;
+ radix_tree_replace_slot(slot, NULL);
+ nr--;
+ continue;
+ }
+
+ VM_BUG_ON_PAGE(page->index != iter.index, page);
+
+ list_del(&page->lru);
+ page_ref_unfreeze(page, 2);
+ radix_tree_replace_slot(slot, page);
+ spin_unlock_irq(&mapping->tree_lock);
+ putback_lru_page(page);
+ unlock_page(page);
+ spin_lock_irq(&mapping->tree_lock);
+ }
+ VM_BUG_ON(nr);
+ spin_unlock_irq(&mapping->tree_lock);
+ page_ref_unfreeze(new_page, 1);
+ mem_cgroup_cancel_charge(new_page, memcg, true);
+ unlock_page(new_page);
+ new_page->mapping = NULL;
+ }
+out:
+ VM_BUG_ON(!list_empty(&pagelist));
+ /* TODO: tracepoints */
+}
+
+static void khugepaged_scan_shmem(struct mm_struct *mm,
+ struct address_space *mapping,
+ pgoff_t start, struct page **hpage)
+{
+ struct page *page = NULL;
+ struct radix_tree_iter iter;
+ void **slot;
+ pgoff_t pgoff = start;
+ int present, swap;
+ int node = NUMA_NO_NODE;
+ int result = SCAN_SUCCEED;
+
+restart:
+ present = 0;
+ swap = 0;
+ memset(khugepaged_node_load, 0, sizeof(khugepaged_node_load));
+
+ rcu_read_lock();
+ radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, pgoff) {
+ if (iter.index > start + HPAGE_PMD_NR)
+ break;
+
+ page = radix_tree_deref_slot(slot);
+ if (radix_tree_deref_retry(page))
+ goto restart;
+
+ if (radix_tree_exception(page)) {
+ if (++swap > khugepaged_max_ptes_swap) {
+ result = SCAN_EXCEED_SWAP_PTE;
+ break;
+ }
+ continue;
+ }
+
+ if (PageCompound(page)) {
+ result = SCAN_PAGE_COMPOUND;
+ break;
+ }
+
+ node = page_to_nid(page);
+ if (khugepaged_scan_abort(node)) {
+ result = SCAN_SCAN_ABORT;
+ break;
+ }
+ khugepaged_node_load[node]++;
+
+ if (!PageLRU(page)) {
+ result = SCAN_PAGE_LRU;
+ break;
+ }
+
+ if (page_count(page) != 1 + page_mapcount(page)) {
+ result = SCAN_PAGE_COUNT;
+ break;
+ }
+
+ /* XXX: need PageReferenced()/page_is_yound() checks ? */
+
+ present++;
+
+ if (need_resched()) {
+ cond_resched_rcu();
+ pgoff = iter.index + 1;
+ goto restart;
+ }
+ }
+ rcu_read_unlock();
+
+ if (result == SCAN_SUCCEED) {
+ if (present < HPAGE_PMD_NR - khugepaged_max_ptes_none) {
+ result = SCAN_EXCEED_NONE_PTE;
+ } else {
+ node = khugepaged_find_target_node();
+ collapse_shmem(mm, mapping, start, hpage, node);
+ }
+ }
+
+ /* TODO: tracepoints */
+}
+#else
+static void khugepaged_scan_shmem(struct mm_struct *mm,
+ struct address_space *mapping,
+ pgoff_t start, struct page **hpage)
+{
+ BUILD_BUG();
+}
+#endif
+
static void collect_mm_slot(struct mm_slot *mm_slot)
{
struct mm_struct *mm = mm_slot->mm;
@@ -1222,6 +1558,8 @@ skip:
if (khugepaged_scan.address < hstart)
khugepaged_scan.address = hstart;
VM_BUG_ON(khugepaged_scan.address & ~HPAGE_PMD_MASK);
+ if (shmem_file(vma->vm_file) && !shmem_huge_enabled(vma))
+ goto skip;

while (khugepaged_scan.address < hend) {
int ret;
@@ -1232,9 +1570,20 @@ skip:
VM_BUG_ON(khugepaged_scan.address < hstart ||
khugepaged_scan.address + HPAGE_PMD_SIZE >
hend);
- ret = khugepaged_scan_pmd(mm, vma,
- khugepaged_scan.address,
- hpage);
+ if (shmem_file(vma->vm_file)) {
+ struct file *file = get_file(vma->vm_file);
+ pgoff_t pgoff = linear_page_index(vma,
+ khugepaged_scan.address);
+ up_read(&mm->mmap_sem);
+ ret = 1;
+ khugepaged_scan_shmem(mm, file->f_mapping,
+ pgoff, hpage);
+ fput(file);
+ } else {
+ ret = khugepaged_scan_pmd(mm, vma,
+ khugepaged_scan.address,
+ hpage);
+ }
/* move to next address */
khugepaged_scan.address += HPAGE_PMD_SIZE;
progress += HPAGE_PMD_NR;
diff --git a/mm/shmem.c b/mm/shmem.c
index 0ebf2e3a2239..2f378a928213 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -32,6 +32,7 @@
#include <linux/export.h>
#include <linux/swap.h>
#include <linux/uio.h>
+#include <linux/khugepaged.h>

static struct vfsmount *shm_mnt;

@@ -97,16 +98,6 @@ struct shmem_falloc {
pgoff_t nr_unswapped; /* how often writepage refused to swap out */
};

-/* Flag allocation requirements to shmem_getpage */
-enum sgp_type {
- SGP_READ, /* don't exceed i_size, don't allocate page */
- SGP_CACHE, /* don't exceed i_size, may allocate page */
- SGP_NOHUGE, /* like SGP_CACHE, but no huge pages */
- SGP_HUGE, /* like SGP_CACHE, huge pages preferred */
- SGP_WRITE, /* may exceed i_size, may allocate !Uptodate page */
- SGP_FALLOC, /* like SGP_WRITE, but make existing page Uptodate */
-};
-
#ifdef CONFIG_TMPFS
static unsigned long shmem_default_max_blocks(void)
{
@@ -126,7 +117,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
struct page **pagep, enum sgp_type sgp,
gfp_t gfp, struct mm_struct *fault_mm, int *fault_type);

-static inline int shmem_getpage(struct inode *inode, pgoff_t index,
+int shmem_getpage(struct inode *inode, pgoff_t index,
struct page **pagep, enum sgp_type sgp)
{
return shmem_getpage_gfp(inode, index, pagep, sgp,
@@ -190,6 +181,33 @@ static inline void shmem_unacct_blocks(unsigned long flags, long pages)
vm_unacct_memory(pages * VM_ACCT(PAGE_SIZE));
}

+bool shmem_charge(struct inode *inode, long pages)
+{
+ struct shmem_inode_info *info = SHMEM_I(inode);
+ struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
+
+ if (shmem_acct_block(info->flags, pages))
+ return false;
+ if (!sbinfo->max_blocks)
+ return true;
+ if (percpu_counter_compare(&sbinfo->used_blocks,
+ sbinfo->max_blocks + pages) > 0) {
+ shmem_unacct_blocks(info->flags, pages);
+ return false;
+ }
+ percpu_counter_add(&sbinfo->used_blocks, pages);
+ return true;
+}
+
+void shmem_uncharge(struct inode *inode, long pages)
+{
+ struct shmem_inode_info *info = SHMEM_I(inode);
+ struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
+
+ percpu_counter_sub(&sbinfo->used_blocks, pages);
+ shmem_unacct_blocks(info->flags, pages);
+}
+
static const struct super_operations shmem_ops;
static const struct address_space_operations shmem_aops;
static const struct file_operations shmem_file_operations;
@@ -1852,6 +1870,11 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
{
file_accessed(file);
vma->vm_ops = &shmem_vm_ops;
+ if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
+ ((vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK) <
+ (vma->vm_end & HPAGE_PMD_MASK)) {
+ khugepaged_enter(vma, vma->vm_flags);
+ }
return 0;
}

@@ -3752,6 +3775,37 @@ static ssize_t shmem_enabled_store(struct kobject *kobj,

struct kobj_attribute shmem_enabled_attr =
__ATTR(shmem_enabled, 0644, shmem_enabled_show, shmem_enabled_store);
+
+bool shmem_huge_enabled(struct vm_area_struct *vma)
+{
+ struct inode *inode = file_inode(vma->vm_file);
+ struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
+ loff_t i_size;
+ pgoff_t off;
+
+ if (shmem_huge == SHMEM_HUGE_FORCE)
+ return true;
+ if (shmem_huge == SHMEM_HUGE_DENY)
+ return false;
+ switch (sbinfo->huge) {
+ case SHMEM_HUGE_NEVER:
+ return false;
+ case SHMEM_HUGE_ALWAYS:
+ return true;
+ case SHMEM_HUGE_WITHIN_SIZE:
+ off = round_up(vma->vm_pgoff, HPAGE_PMD_NR);
+ i_size = round_up(i_size_read(inode), PAGE_SIZE);
+ if (i_size >= HPAGE_PMD_SIZE &&
+ i_size >> PAGE_SHIFT >= off)
+ return true;
+ case SHMEM_HUGE_ADVISE:
+ /* TODO: implement fadvise() hints */
+ return (vma->vm_flags & VM_HUGEPAGE);
+ default:
+ VM_BUG_ON(1);
+ return false;
+ }
+}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE && CONFIG_SYSFS */

#else /* !CONFIG_SHMEM */
@@ -3931,6 +3985,13 @@ int shmem_zero_setup(struct vm_area_struct *vma)
fput(vma->vm_file);
vma->vm_file = file;
vma->vm_ops = &shmem_vm_ops;
+
+ if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
+ ((vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK) <
+ (vma->vm_end & HPAGE_PMD_MASK)) {
+ khugepaged_enter(vma, vma->vm_flags);
+ }
+
return 0;
}

--
2.8.0.rc3

2016-04-16 00:30:56

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv7 16/29] vmscan: split file huge pages before paging them out

This is preparation of vmscan for file huge pages. We cannot write out
huge pages, so we need to split them on the way out.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/vmscan.c | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index e9fe17c96ef8..df56f6a2dbfe 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1055,8 +1055,14 @@ static unsigned long shrink_page_list(struct list_head *page_list,

/* Adding to swap updated mapping */
mapping = page_mapping(page);
+ } else if (unlikely(PageTransHuge(page))) {
+ /* Split file THP */
+ if (split_huge_page_to_list(page, page_list))
+ goto keep_locked;
}

+ VM_BUG_ON_PAGE(PageTransHuge(page), page);
+
/*
* The page is mapped into the page tables of one or more
* processes. Try to unmap it here.
--
2.8.0.rc3

2016-04-16 00:24:19

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv7 10/29] thp: handle file COW faults

File COW for THP is handled on pte level: just split the pmd.

It's not clear how benefitial would be allocation of huge pages on COW
faults. And it would require some code to make them work.

I think at some point we can consider teaching khugepaged to collapse
pages in COW mappings, but allocating huge on fault is probably
overkill.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/memory.c | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 23de0567db18..cba29c033702 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3388,6 +3388,11 @@ static int wp_huge_pmd(struct fault_env *fe, pmd_t orig_pmd)
if (fe->vma->vm_ops->pmd_fault)
return fe->vma->vm_ops->pmd_fault(fe->vma, fe->address, fe->pmd,
fe->flags);
+
+ /* COW handled on pte level: split pmd */
+ VM_BUG_ON_VMA(fe->vma->vm_flags & VM_SHARED, fe->vma);
+ split_huge_pmd(fe->vma, fe->pmd, fe->address);
+
return VM_FAULT_FALLBACK;
}

--
2.8.0.rc3

2016-04-16 00:32:18

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv7 04/29] mm: postpone page table allocation until we have page to map

The idea (and most of code) is borrowed again: from Hugh's patchset on
huge tmpfs[1].

Instead of allocation pte page table upfront, we postpone this until we
have page to map in hands. This approach opens possibility to map the
page as huge if filesystem supports this.

Comparing to Hugh's patch I've pushed page table allocation a bit
further: into do_set_pte(). This way we can postpone allocation even in
faultaround case without moving do_fault_around() after __do_fault().

do_set_pte() got renamed to alloc_set_pte() as it can allocate page
table if required.

[1] http://lkml.kernel.org/r/[email protected]

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/mm.h | 10 +-
mm/filemap.c | 16 +--
mm/memory.c | 297 +++++++++++++++++++++++++++++++----------------------
3 files changed, 196 insertions(+), 127 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 696ec39b616b..afe79bb75c47 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -320,6 +320,13 @@ struct fault_env {
* Protects pte page table if 'pte'
* is not NULL, otherwise pmd.
*/
+ pgtable_t prealloc_pte; /* Pre-allocated pte page table.
+ * vm_ops->map_pages() calls
+ * alloc_set_pte() from atomic context.
+ * do_fault_around() pre-allocates
+ * page table to avoid allocation from
+ * atomic context.
+ */
};

/*
@@ -601,7 +608,8 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
return pte;
}

-void do_set_pte(struct fault_env *fe, struct page *page);
+int alloc_set_pte(struct fault_env *fe, struct mem_cgroup *memcg,
+ struct page *page);
#endif

/*
diff --git a/mm/filemap.c b/mm/filemap.c
index e32e5d70fc0c..7e982835d4ec 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2147,11 +2147,6 @@ void filemap_map_pages(struct fault_env *fe,
start_pgoff) {
if (iter.index > end_pgoff)
break;
- fe->pte += iter.index - last_pgoff;
- fe->address += (iter.index - last_pgoff) << PAGE_SHIFT;
- last_pgoff = iter.index;
- if (!pte_none(*fe->pte))
- goto next;
repeat:
page = radix_tree_deref_slot(slot);
if (unlikely(!page))
@@ -2189,7 +2184,13 @@ repeat:

if (file->f_ra.mmap_miss > 0)
file->f_ra.mmap_miss--;
- do_set_pte(fe, page);
+
+ fe->address += (iter.index - last_pgoff) << PAGE_SHIFT;
+ if (fe->pte)
+ fe->pte += iter.index - last_pgoff;
+ last_pgoff = iter.index;
+ if (alloc_set_pte(fe, NULL, page))
+ goto unlock;
unlock_page(page);
goto next;
unlock:
@@ -2197,6 +2198,9 @@ unlock:
skip:
put_page(page);
next:
+ /* Huge page is mapped? No need to proceed. */
+ if (pmd_trans_huge(*fe->pmd))
+ break;
if (iter.index == end_pgoff)
break;
}
diff --git a/mm/memory.c b/mm/memory.c
index 6c37f84212ee..c31c52507956 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2679,8 +2679,6 @@ static int do_anonymous_page(struct fault_env *fe)
struct page *page;
pte_t entry;

- pte_unmap(fe->pte);
-
/* File mapping without ->vm_ops ? */
if (vma->vm_flags & VM_SHARED)
return VM_FAULT_SIGBUS;
@@ -2689,6 +2687,23 @@ static int do_anonymous_page(struct fault_env *fe)
if (check_stack_guard_page(vma, fe->address) < 0)
return VM_FAULT_SIGSEGV;

+ /*
+ * Use pte_alloc() instead of pte_alloc_map(). We can't run
+ * pte_offset_map() on pmds where a huge pmd might be created
+ * from a different thread.
+ *
+ * pte_alloc_map() is safe to use under down_write(mmap_sem) or when
+ * parallel threads are excluded by other means.
+ *
+ * Here we only have down_read(mmap_sem).
+ */
+ if (pte_alloc(vma->vm_mm, fe->pmd, fe->address))
+ return VM_FAULT_OOM;
+
+ /* See the comment in pte_alloc_one_map() */
+ if (unlikely(pmd_trans_unstable(fe->pmd)))
+ return 0;
+
/* Use the zero-page for reads */
if (!(fe->flags & FAULT_FLAG_WRITE) &&
!mm_forbids_zeropage(vma->vm_mm)) {
@@ -2804,23 +2819,76 @@ static int __do_fault(struct fault_env *fe, pgoff_t pgoff,
return ret;
}

+static int pte_alloc_one_map(struct fault_env *fe)
+{
+ struct vm_area_struct *vma = fe->vma;
+
+ if (!pmd_none(*fe->pmd))
+ goto map_pte;
+ if (fe->prealloc_pte) {
+ fe->ptl = pmd_lock(vma->vm_mm, fe->pmd);
+ if (unlikely(!pmd_none(*fe->pmd))) {
+ spin_unlock(fe->ptl);
+ goto map_pte;
+ }
+
+ atomic_long_inc(&vma->vm_mm->nr_ptes);
+ pmd_populate(vma->vm_mm, fe->pmd, fe->prealloc_pte);
+ spin_unlock(fe->ptl);
+ fe->prealloc_pte = 0;
+ } else if (unlikely(pte_alloc(vma->vm_mm, fe->pmd, fe->address))) {
+ return VM_FAULT_OOM;
+ }
+map_pte:
+ /*
+ * If a huge pmd materialized under us just retry later. Use
+ * pmd_trans_unstable() instead of pmd_trans_huge() to ensure the pmd
+ * didn't become pmd_trans_huge under us and then back to pmd_none, as
+ * a result of MADV_DONTNEED running immediately after a huge pmd fault
+ * in a different thread of this mm, in turn leading to a misleading
+ * pmd_trans_huge() retval. All we have to ensure is that it is a
+ * regular pmd that we can walk with pte_offset_map() and we can do that
+ * through an atomic read in C, which is what pmd_trans_unstable()
+ * provides.
+ */
+ if (pmd_trans_unstable(fe->pmd) || pmd_devmap(*fe->pmd))
+ return VM_FAULT_NOPAGE;
+
+ fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
+ &fe->ptl);
+ return 0;
+}
+
/**
- * do_set_pte - setup new PTE entry for given page and add reverse page mapping.
+ * alloc_set_pte - setup new PTE entry for given page and add reverse page
+ * mapping. If needed, the fucntion allocates page table or use pre-allocated.
*
* @fe: fault environment
+ * @memcg: memcg to charge page (only for private mappings)
* @page: page to map
*
- * Caller must hold page table lock relevant for @fe->pte.
+ * Caller must take care of unlocking fe->ptl, if fe->pte is non-NULL on return.
*
* Target users are page handler itself and implementations of
* vm_ops->map_pages.
*/
-void do_set_pte(struct fault_env *fe, struct page *page)
+int alloc_set_pte(struct fault_env *fe, struct mem_cgroup *memcg,
+ struct page *page)
{
struct vm_area_struct *vma = fe->vma;
bool write = fe->flags & FAULT_FLAG_WRITE;
pte_t entry;

+ if (!fe->pte) {
+ int ret = pte_alloc_one_map(fe);
+ if (ret)
+ return ret;
+ }
+
+ /* Re-check under ptl */
+ if (unlikely(!pte_none(*fe->pte)))
+ return VM_FAULT_NOPAGE;
+
flush_icache_page(vma, page);
entry = mk_pte(page, vma->vm_page_prot);
if (write)
@@ -2829,6 +2897,8 @@ void do_set_pte(struct fault_env *fe, struct page *page)
if (write && !(vma->vm_flags & VM_SHARED)) {
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, vma, fe->address, false);
+ mem_cgroup_commit_charge(page, memcg, false, false);
+ lru_cache_add_active_or_unevictable(page, vma);
} else {
inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
page_add_file_rmap(page);
@@ -2837,6 +2907,8 @@ void do_set_pte(struct fault_env *fe, struct page *page)

/* no need to invalidate: a not-present page won't be cached */
update_mmu_cache(vma, fe->address, fe->pte);
+
+ return 0;
}

static unsigned long fault_around_bytes __read_mostly =
@@ -2903,19 +2975,17 @@ late_initcall(fault_around_debugfs);
* fault_around_pages() value (and therefore to page order). This way it's
* easier to guarantee that we don't cross page table boundaries.
*/
-static void do_fault_around(struct fault_env *fe, pgoff_t start_pgoff)
+static int do_fault_around(struct fault_env *fe, pgoff_t start_pgoff)
{
- unsigned long address = fe->address, start_addr, nr_pages, mask;
- pte_t *pte = fe->pte;
+ unsigned long address = fe->address, nr_pages, mask;
pgoff_t end_pgoff;
- int off;
+ int off, ret = 0;

nr_pages = READ_ONCE(fault_around_bytes) >> PAGE_SHIFT;
mask = ~(nr_pages * PAGE_SIZE - 1) & PAGE_MASK;

- start_addr = max(fe->address & mask, fe->vma->vm_start);
- off = ((fe->address - start_addr) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1);
- fe->pte -= off;
+ fe->address = max(address & mask, fe->vma->vm_start);
+ off = ((address - fe->address) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1);
start_pgoff -= off;

/*
@@ -2923,30 +2993,45 @@ static void do_fault_around(struct fault_env *fe, pgoff_t start_pgoff)
* or fault_around_pages() from start_pgoff, depending what is nearest.
*/
end_pgoff = start_pgoff -
- ((start_addr >> PAGE_SHIFT) & (PTRS_PER_PTE - 1)) +
+ ((fe->address >> PAGE_SHIFT) & (PTRS_PER_PTE - 1)) +
PTRS_PER_PTE - 1;
end_pgoff = min3(end_pgoff, vma_pages(fe->vma) + fe->vma->vm_pgoff - 1,
start_pgoff + nr_pages - 1);

- /* Check if it makes any sense to call ->map_pages */
- fe->address = start_addr;
- while (!pte_none(*fe->pte)) {
- if (++start_pgoff > end_pgoff)
- goto out;
- fe->address += PAGE_SIZE;
- if (fe->address >= fe->vma->vm_end)
- goto out;
- fe->pte++;
+ if (pmd_none(*fe->pmd)) {
+ fe->prealloc_pte = pte_alloc_one(fe->vma->vm_mm, fe->address);
+ smp_wmb(); /* See comment in __pte_alloc() */
}

fe->vma->vm_ops->map_pages(fe, start_pgoff, end_pgoff);
+
+ /* preallocated pagetable is unused: free it */
+ if (fe->prealloc_pte) {
+ pte_free(fe->vma->vm_mm, fe->prealloc_pte);
+ fe->prealloc_pte = 0;
+ }
+ /* Huge page is mapped? Page fault is solved */
+ if (pmd_trans_huge(*fe->pmd)) {
+ ret = VM_FAULT_NOPAGE;
+ goto out;
+ }
+
+ /* ->map_pages() haven't done anything useful. Cold page cache? */
+ if (!fe->pte)
+ goto out;
+
+ /* check if the page fault is solved */
+ fe->pte -= (fe->address >> PAGE_SHIFT) - (address >> PAGE_SHIFT);
+ if (!pte_none(*fe->pte))
+ ret = VM_FAULT_NOPAGE;
+ pte_unmap_unlock(fe->pte, fe->ptl);
out:
- /* restore fault_env */
- fe->pte = pte;
fe->address = address;
+ fe->pte = NULL;
+ return ret;
}

-static int do_read_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
+static int do_read_fault(struct fault_env *fe, pgoff_t pgoff)
{
struct vm_area_struct *vma = fe->vma;
struct page *fault_page;
@@ -2958,33 +3043,25 @@ static int do_read_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
* something).
*/
if (vma->vm_ops->map_pages && fault_around_bytes >> PAGE_SHIFT > 1) {
- fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
- &fe->ptl);
- do_fault_around(fe, pgoff);
- if (!pte_same(*fe->pte, orig_pte))
- goto unlock_out;
- pte_unmap_unlock(fe->pte, fe->ptl);
+ ret = do_fault_around(fe, pgoff);
+ if (ret)
+ return ret;
}

ret = __do_fault(fe, pgoff, NULL, &fault_page);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
return ret;

- fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address, &fe->ptl);
- if (unlikely(!pte_same(*fe->pte, orig_pte))) {
+ ret |= alloc_set_pte(fe, NULL, fault_page);
+ if (fe->pte)
pte_unmap_unlock(fe->pte, fe->ptl);
- unlock_page(fault_page);
- put_page(fault_page);
- return ret;
- }
- do_set_pte(fe, fault_page);
unlock_page(fault_page);
-unlock_out:
- pte_unmap_unlock(fe->pte, fe->ptl);
+ if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
+ put_page(fault_page);
return ret;
}

-static int do_cow_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
+static int do_cow_fault(struct fault_env *fe, pgoff_t pgoff)
{
struct vm_area_struct *vma = fe->vma;
struct page *fault_page, *new_page;
@@ -3012,26 +3089,9 @@ static int do_cow_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
copy_user_highpage(new_page, fault_page, fe->address, vma);
__SetPageUptodate(new_page);

- fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
- &fe->ptl);
- if (unlikely(!pte_same(*fe->pte, orig_pte))) {
+ ret |= alloc_set_pte(fe, memcg, new_page);
+ if (fe->pte)
pte_unmap_unlock(fe->pte, fe->ptl);
- if (fault_page) {
- unlock_page(fault_page);
- put_page(fault_page);
- } else {
- /*
- * The fault handler has no page to lock, so it holds
- * i_mmap_lock for read to protect against truncate.
- */
- i_mmap_unlock_read(vma->vm_file->f_mapping);
- }
- goto uncharge_out;
- }
- do_set_pte(fe, new_page);
- mem_cgroup_commit_charge(new_page, memcg, false, false);
- lru_cache_add_active_or_unevictable(new_page, vma);
- pte_unmap_unlock(fe->pte, fe->ptl);
if (fault_page) {
unlock_page(fault_page);
put_page(fault_page);
@@ -3042,6 +3102,8 @@ static int do_cow_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
*/
i_mmap_unlock_read(vma->vm_file->f_mapping);
}
+ if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
+ goto uncharge_out;
return ret;
uncharge_out:
mem_cgroup_cancel_charge(new_page, memcg, false);
@@ -3049,7 +3111,7 @@ uncharge_out:
return ret;
}

-static int do_shared_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
+static int do_shared_fault(struct fault_env *fe, pgoff_t pgoff)
{
struct vm_area_struct *vma = fe->vma;
struct page *fault_page;
@@ -3075,16 +3137,15 @@ static int do_shared_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
}
}

- fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
- &fe->ptl);
- if (unlikely(!pte_same(*fe->pte, orig_pte))) {
+ ret |= alloc_set_pte(fe, NULL, fault_page);
+ if (fe->pte)
pte_unmap_unlock(fe->pte, fe->ptl);
+ if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE |
+ VM_FAULT_RETRY))) {
unlock_page(fault_page);
put_page(fault_page);
return ret;
}
- do_set_pte(fe, fault_page);
- pte_unmap_unlock(fe->pte, fe->ptl);

if (set_page_dirty(fault_page))
dirtied = 1;
@@ -3116,20 +3177,19 @@ static int do_shared_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
* The mmap_sem may have been released depending on flags and our
* return value. See filemap_fault() and __lock_page_or_retry().
*/
-static int do_fault(struct fault_env *fe, pte_t orig_pte)
+static int do_fault(struct fault_env *fe)
{
struct vm_area_struct *vma = fe->vma;
pgoff_t pgoff = linear_page_index(vma, fe->address);

- pte_unmap(fe->pte);
/* The VMA was not fully populated on mmap() or missing VM_DONTEXPAND */
if (!vma->vm_ops->fault)
return VM_FAULT_SIGBUS;
if (!(fe->flags & FAULT_FLAG_WRITE))
- return do_read_fault(fe, pgoff, orig_pte);
+ return do_read_fault(fe, pgoff);
if (!(vma->vm_flags & VM_SHARED))
- return do_cow_fault(fe, pgoff, orig_pte);
- return do_shared_fault(fe, pgoff, orig_pte);
+ return do_cow_fault(fe, pgoff);
+ return do_shared_fault(fe, pgoff);
}

static int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
@@ -3269,37 +3329,62 @@ static int wp_huge_pmd(struct fault_env *fe, pmd_t orig_pmd)
* with external mmu caches can use to update those (ie the Sparc or
* PowerPC hashed page tables that act as extended TLBs).
*
- * We enter with non-exclusive mmap_sem (to exclude vma changes,
- * but allow concurrent faults), and pte mapped but not yet locked.
- * We return with pte unmapped and unlocked.
+ * We enter with non-exclusive mmap_sem (to exclude vma changes, but allow
+ * concurrent faults).
*
- * The mmap_sem may have been released depending on flags and our
- * return value. See filemap_fault() and __lock_page_or_retry().
+ * The mmap_sem may have been released depending on flags and our return value.
+ * See filemap_fault() and __lock_page_or_retry().
*/
static int handle_pte_fault(struct fault_env *fe)
{
pte_t entry;

- /*
- * some architectures can have larger ptes than wordsize,
- * e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and CONFIG_32BIT=y,
- * so READ_ONCE or ACCESS_ONCE cannot guarantee atomic accesses.
- * The code below just needs a consistent view for the ifs and
- * we later double check anyway with the ptl lock held. So here
- * a barrier will do.
- */
- entry = *fe->pte;
- barrier();
- if (!pte_present(entry)) {
+ if (unlikely(pmd_none(*fe->pmd))) {
+ /*
+ * Leave __pte_alloc() until later: because vm_ops->fault may
+ * want to allocate huge page, and if we expose page table
+ * for an instant, it will be difficult to retract from
+ * concurrent faults and from rmap lookups.
+ */
+ } else {
+ /* See comment in pte_alloc_one_map() */
+ if (pmd_trans_unstable(fe->pmd) || pmd_devmap(*fe->pmd))
+ return 0;
+ /*
+ * A regular pmd is established and it can't morph into a huge
+ * pmd from under us anymore at this point because we hold the
+ * mmap_sem read mode and khugepaged takes it in write mode.
+ * So now it's safe to run pte_offset_map().
+ */
+ fe->pte = pte_offset_map(fe->pmd, fe->address);
+
+ entry = *fe->pte;
+
+ /*
+ * some architectures can have larger ptes than wordsize,
+ * e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and
+ * CONFIG_32BIT=y, so READ_ONCE or ACCESS_ONCE cannot guarantee
+ * atomic accesses. The code below just needs a consistent
+ * view for the ifs and we later double check anyway with the
+ * ptl lock held. So here a barrier will do.
+ */
+ barrier();
if (pte_none(entry)) {
- if (vma_is_anonymous(fe->vma))
- return do_anonymous_page(fe);
- else
- return do_fault(fe, entry);
+ pte_unmap(fe->pte);
+ fe->pte = NULL;
}
- return do_swap_page(fe, entry);
}

+ if (!fe->pte) {
+ if (vma_is_anonymous(fe->vma))
+ return do_anonymous_page(fe);
+ else
+ return do_fault(fe);
+ }
+
+ if (!pte_present(entry))
+ return do_swap_page(fe, entry);
+
if (pte_protnone(entry))
return do_numa_page(fe, entry);

@@ -3381,34 +3466,6 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
}
}

- /*
- * Use pte_alloc() instead of pte_alloc_map, because we can't
- * run pte_offset_map on the pmd, if an huge pmd could
- * materialize from under us from a different thread.
- */
- if (unlikely(pte_alloc(fe.vma->vm_mm, fe.pmd, fe.address)))
- return VM_FAULT_OOM;
- /*
- * If a huge pmd materialized under us just retry later. Use
- * pmd_trans_unstable() instead of pmd_trans_huge() to ensure the pmd
- * didn't become pmd_trans_huge under us and then back to pmd_none, as
- * a result of MADV_DONTNEED running immediately after a huge pmd fault
- * in a different thread of this mm, in turn leading to a misleading
- * pmd_trans_huge() retval. All we have to ensure is that it is a
- * regular pmd that we can walk with pte_offset_map() and we can do that
- * through an atomic read in C, which is what pmd_trans_unstable()
- * provides.
- */
- if (unlikely(pmd_trans_unstable(fe.pmd) || pmd_devmap(*fe.pmd)))
- return 0;
- /*
- * A regular pmd is established and it can't morph into a huge pmd
- * from under us anymore at this point because we hold the mmap_sem
- * read mode and khugepaged takes it in write mode. So now it's
- * safe to run pte_offset_map().
- */
- fe.pte = pte_offset_map(fe.pmd, fe.address);
-
return handle_pte_fault(&fe);
}

--
2.8.0.rc3

2016-04-16 00:32:16

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv7 07/29] thp, vmstats: add counters for huge file pages

THP_FILE_ALLOC: how many times huge page was allocated and put page
cache.

THP_FILE_MAPPED: how many times file huge page was mapped.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/vm_event_item.h | 7 +++++++
mm/memory.c | 1 +
mm/vmstat.c | 2 ++
3 files changed, 10 insertions(+)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index ec084321fe09..42604173f122 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -70,6 +70,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
THP_FAULT_FALLBACK,
THP_COLLAPSE_ALLOC,
THP_COLLAPSE_ALLOC_FAILED,
+ THP_FILE_ALLOC,
+ THP_FILE_MAPPED,
THP_SPLIT_PAGE,
THP_SPLIT_PAGE_FAILED,
THP_DEFERRED_SPLIT_PAGE,
@@ -100,4 +102,9 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
NR_VM_EVENT_ITEMS
};

+#ifndef CONFIG_TRANSPARENT_HUGEPAGE
+#define THP_FILE_ALLOC ({ BUILD_BUG(); 0; })
+#define THP_FILE_MAPPED ({ BUILD_BUG(); 0; })
+#endif
+
#endif /* VM_EVENT_ITEM_H_INCLUDED */
diff --git a/mm/memory.c b/mm/memory.c
index ca45e9b19ad9..23de0567db18 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2907,6 +2907,7 @@ static int do_set_pmd(struct fault_env *fe, struct page *page)

/* fault is handled */
ret = 0;
+ count_vm_event(THP_FILE_MAPPED);
out:
spin_unlock(fe->ptl);
return ret;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 38ad49532b6f..904ac95fbf00 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -846,6 +846,8 @@ const char * const vmstat_text[] = {
"thp_fault_fallback",
"thp_collapse_alloc",
"thp_collapse_alloc_failed",
+ "thp_file_alloc",
+ "thp_file_mapped",
"thp_split_page",
"thp_split_page_failed",
"thp_deferred_split_page",
--
2.8.0.rc3

2016-04-16 00:32:14

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv7 14/29] thp: file pages support for split_huge_page()

Basic scheme is the same as for anon THP.

Main differences:

- File pages are on radix-tree, so we have head->_count offset by
HPAGE_PMD_NR. The count got distributed to small pages during split.

- mapping->tree_lock prevents non-lockless access to pages under split
over radix-tree;

- Lockless access is prevented by setting the head->_count to 0 during
split;

- After split, some pages can be beyond i_size. We drop them from
radix-tree.

- We don't setup migration entries. Just unmap pages. It helps
handling cases when i_size is in the middle of the page: no need
handle unmap pages beyond i_size manually.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/gup.c | 2 +
mm/huge_memory.c | 160 +++++++++++++++++++++++++++++++++++++++----------------
mm/mempolicy.c | 2 +
3 files changed, 119 insertions(+), 45 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 39f751e1cb33..52f091e28f76 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -287,6 +287,8 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
ret = split_huge_page(page);
unlock_page(page);
put_page(page);
+ if (pmd_none(*pmd))
+ return no_page_table(vma, flags);
}

return ret ? ERR_PTR(ret) :
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5a71e0ffcf19..2e44741abeb6 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -31,6 +31,7 @@
#include <linux/userfaultfd_k.h>
#include <linux/page_idle.h>
#include <linux/swapops.h>
+#include <linux/shmem_fs.h>

#include <asm/tlb.h>
#include <asm/pgalloc.h>
@@ -3143,12 +3144,15 @@ void vma_adjust_trans_huge(struct vm_area_struct *vma,

static void freeze_page(struct page *page)
{
- enum ttu_flags ttu_flags = TTU_MIGRATION | TTU_IGNORE_MLOCK |
- TTU_IGNORE_ACCESS | TTU_RMAP_LOCKED;
+ enum ttu_flags ttu_flags = TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS |
+ TTU_RMAP_LOCKED;
int i, ret;

VM_BUG_ON_PAGE(!PageHead(page), page);

+ if (PageAnon(page))
+ ttu_flags |= TTU_MIGRATION;
+
/* We only need TTU_SPLIT_HUGE_PMD once */
ret = try_to_unmap(page, ttu_flags | TTU_SPLIT_HUGE_PMD);
for (i = 1; !ret && i < HPAGE_PMD_NR; i++) {
@@ -3158,7 +3162,7 @@ static void freeze_page(struct page *page)

ret = try_to_unmap(page + i, ttu_flags);
}
- VM_BUG_ON(ret);
+ VM_BUG_ON_PAGE(ret, page + i - 1);
}

static void unfreeze_page(struct page *page)
@@ -3180,15 +3184,20 @@ static void __split_huge_page_tail(struct page *head, int tail,
/*
* tail_page->_count is zero and not changing from under us. But
* get_page_unless_zero() may be running from under us on the
- * tail_page. If we used atomic_set() below instead of atomic_inc(), we
- * would then run atomic_set() concurrently with
+ * tail_page. If we used atomic_set() below instead of atomic_inc() or
+ * atomic_add(), we would then run atomic_set() concurrently with
* get_page_unless_zero(), and atomic_set() is implemented in C not
* using locked ops. spin_unlock on x86 sometime uses locked ops
* because of PPro errata 66, 92, so unless somebody can guarantee
* atomic_set() here would be safe on all archs (and not only on x86),
- * it's safer to use atomic_inc().
+ * it's safer to use atomic_inc()/atomic_add().
*/
- page_ref_inc(page_tail);
+ if (PageAnon(head)) {
+ page_ref_inc(page_tail);
+ } else {
+ /* Additional pin to radix tree */
+ page_ref_add(page_tail, 2);
+ }

page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
page_tail->flags |= (head->flags &
@@ -3224,25 +3233,44 @@ static void __split_huge_page_tail(struct page *head, int tail,
lru_add_page_tail(head, page_tail, lruvec, list);
}

-static void __split_huge_page(struct page *page, struct list_head *list)
+static void __split_huge_page(struct page *page, struct list_head *list,
+ unsigned long flags)
{
struct page *head = compound_head(page);
struct zone *zone = page_zone(head);
struct lruvec *lruvec;
+ pgoff_t end = -1;
int i;

- /* prevent PageLRU to go away from under us, and freeze lru stats */
- spin_lock_irq(&zone->lru_lock);
lruvec = mem_cgroup_page_lruvec(head, zone);

/* complete memcg works before add pages to LRU */
mem_cgroup_split_huge_fixup(head);

- for (i = HPAGE_PMD_NR - 1; i >= 1; i--)
+ if (!PageAnon(page))
+ end = DIV_ROUND_UP(i_size_read(head->mapping->host), PAGE_SIZE);
+
+ for (i = HPAGE_PMD_NR - 1; i >= 1; i--) {
__split_huge_page_tail(head, i, lruvec, list);
+ /* Some pages can be beyond i_size: drop them from page cache */
+ if (head[i].index >= end) {
+ __ClearPageDirty(head + i);
+ __delete_from_page_cache(head + i, NULL);
+ put_page(head + i);
+ }
+ }

ClearPageCompound(head);
- spin_unlock_irq(&zone->lru_lock);
+ /* See comment in __split_huge_page_tail() */
+ if (PageAnon(head)) {
+ page_ref_inc(head);
+ } else {
+ /* Additional pin to radix tree */
+ page_ref_add(head, 2);
+ spin_unlock(&head->mapping->tree_lock);
+ }
+
+ spin_unlock_irqrestore(&page_zone(head)->lru_lock, flags);

unfreeze_page(head);

@@ -3309,36 +3337,54 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
{
struct page *head = compound_head(page);
struct pglist_data *pgdata = NODE_DATA(page_to_nid(head));
- struct anon_vma *anon_vma;
- int count, mapcount, ret;
+ struct anon_vma *anon_vma = NULL;
+ struct address_space *mapping = NULL;
+ int count, mapcount, extra_pins, ret;
bool mlocked;
unsigned long flags;

VM_BUG_ON_PAGE(is_huge_zero_page(page), page);
- VM_BUG_ON_PAGE(!PageAnon(page), page);
VM_BUG_ON_PAGE(!PageLocked(page), page);
VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
VM_BUG_ON_PAGE(!PageCompound(page), page);

- /*
- * The caller does not necessarily hold an mmap_sem that would prevent
- * the anon_vma disappearing so we first we take a reference to it
- * and then lock the anon_vma for write. This is similar to
- * page_lock_anon_vma_read except the write lock is taken to serialise
- * against parallel split or collapse operations.
- */
- anon_vma = page_get_anon_vma(head);
- if (!anon_vma) {
- ret = -EBUSY;
- goto out;
+ if (PageAnon(head)) {
+ /*
+ * The caller does not necessarily hold an mmap_sem that would
+ * prevent the anon_vma disappearing so we first we take a
+ * reference to it and then lock the anon_vma for write. This
+ * is similar to page_lock_anon_vma_read except the write lock
+ * is taken to serialise against parallel split or collapse
+ * operations.
+ */
+ anon_vma = page_get_anon_vma(head);
+ if (!anon_vma) {
+ ret = -EBUSY;
+ goto out;
+ }
+ extra_pins = 0;
+ mapping = NULL;
+ anon_vma_lock_write(anon_vma);
+ } else {
+ mapping = head->mapping;
+
+ /* Truncated ? */
+ if (!mapping) {
+ ret = -EBUSY;
+ goto out;
+ }
+
+ /* Addidional pins from radix tree */
+ extra_pins = HPAGE_PMD_NR;
+ anon_vma = NULL;
+ i_mmap_lock_read(mapping);
}
- anon_vma_lock_write(anon_vma);

/*
* Racy check if we can split the page, before freeze_page() will
* split PMDs
*/
- if (total_mapcount(head) != page_count(head) - 1) {
+ if (total_mapcount(head) != page_count(head) - extra_pins - 1) {
ret = -EBUSY;
goto out_unlock;
}
@@ -3351,35 +3397,60 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
if (mlocked)
lru_add_drain();

+ /* prevent PageLRU to go away from under us, and freeze lru stats */
+ spin_lock_irqsave(&page_zone(head)->lru_lock, flags);
+
+ if (mapping) {
+ void **pslot;
+
+ spin_lock(&mapping->tree_lock);
+ pslot = radix_tree_lookup_slot(&mapping->page_tree,
+ page_index(head));
+ /*
+ * Check if the head page is present in radix tree.
+ * We assume all tail are present too, if head is there.
+ */
+ if (radix_tree_deref_slot_protected(pslot,
+ &mapping->tree_lock) != head)
+ goto fail;
+ }
+
/* Prevent deferred_split_scan() touching ->_count */
- spin_lock_irqsave(&pgdata->split_queue_lock, flags);
+ spin_lock(&pgdata->split_queue_lock);
count = page_count(head);
mapcount = total_mapcount(head);
- if (!mapcount && count == 1) {
+ if (!mapcount && page_ref_freeze(head, 1 + extra_pins)) {
if (!list_empty(page_deferred_list(head))) {
pgdata->split_queue_len--;
list_del(page_deferred_list(head));
}
- spin_unlock_irqrestore(&pgdata->split_queue_lock, flags);
- __split_huge_page(page, list);
+ spin_unlock(&pgdata->split_queue_lock);
+ __split_huge_page(page, list, flags);
ret = 0;
- } else if (IS_ENABLED(CONFIG_DEBUG_VM) && mapcount) {
- spin_unlock_irqrestore(&pgdata->split_queue_lock, flags);
- pr_alert("total_mapcount: %u, page_count(): %u\n",
- mapcount, count);
- if (PageTail(page))
- dump_page(head, NULL);
- dump_page(page, "total_mapcount(head) > 0");
- BUG();
} else {
- spin_unlock_irqrestore(&pgdata->split_queue_lock, flags);
+ if (IS_ENABLED(CONFIG_DEBUG_VM) && mapcount) {
+ pr_alert("total_mapcount: %u, page_count(): %u\n",
+ mapcount, count);
+ if (PageTail(page))
+ dump_page(head, NULL);
+ dump_page(page, "total_mapcount(head) > 0");
+ BUG();
+ }
+ spin_unlock(&pgdata->split_queue_lock);
+fail: if (mapping)
+ spin_unlock(&mapping->tree_lock);
+ spin_unlock_irqrestore(&page_zone(head)->lru_lock, flags);
unfreeze_page(head);
ret = -EBUSY;
}

out_unlock:
- anon_vma_unlock_write(anon_vma);
- put_anon_vma(anon_vma);
+ if (anon_vma) {
+ anon_vma_unlock_write(anon_vma);
+ put_anon_vma(anon_vma);
+ }
+ if (mapping)
+ i_mmap_unlock_read(mapping);
out:
count_vm_event(!ret ? THP_SPLIT_PAGE : THP_SPLIT_PAGE_FAILED);
return ret;
@@ -3502,8 +3573,7 @@ static int split_huge_pages_set(void *data, u64 val)
if (zone != page_zone(page))
goto next;

- if (!PageHead(page) || !PageAnon(page) ||
- PageHuge(page))
+ if (!PageHead(page) || PageHuge(page) || !PageLRU(page))
goto next;

total++;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 36cc01bc950a..148143974c5b 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -515,6 +515,8 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
}
}

+ if (pmd_trans_unstable(pmd))
+ return 0;
retry:
pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
for (; addr != end; pte++, addr += PAGE_SIZE) {
--
2.8.0.rc3

2016-04-16 00:33:31

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv7 03/29] mm: introduce fault_env

The idea borrowed from Peter's patch from patchset on speculative page
faults[1]:

Instead of passing around the endless list of function arguments,
replace the lot with a single structure so we can change context
without endless function signature changes.

The changes are mostly mechanical with exception of faultaround code:
filemap_map_pages() got reworked a bit.

This patch is preparation for the next one.

[1] http://lkml.kernel.org/r/[email protected]

Signed-off-by: Kirill A. Shutemov <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
---
Documentation/filesystems/Locking | 10 +-
fs/userfaultfd.c | 22 +-
include/linux/huge_mm.h | 20 +-
include/linux/mm.h | 34 ++-
include/linux/userfaultfd_k.h | 8 +-
mm/filemap.c | 28 +-
mm/huge_memory.c | 280 +++++++++---------
mm/internal.h | 4 +-
mm/memory.c | 579 ++++++++++++++++++--------------------
mm/nommu.c | 3 +-
10 files changed, 473 insertions(+), 515 deletions(-)

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index 619af9bfdcb3..7f77837a7cf1 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -544,13 +544,13 @@ subsequent truncate), and then return with VM_FAULT_LOCKED, and the page
locked. The VM will unlock the page.

->map_pages() is called when VM asks to map easy accessible pages.
-Filesystem should find and map pages associated with offsets from "pgoff"
-till "max_pgoff". ->map_pages() is called with page table locked and must
+Filesystem should find and map pages associated with offsets from "start_pgoff"
+till "end_pgoff". ->map_pages() is called with page table locked and must
not block. If it's not possible to reach a page without blocking,
filesystem should skip it. Filesystem should use do_set_pte() to setup
-page table entry. Pointer to entry associated with offset "pgoff" is
-passed in "pte" field in vm_fault structure. Pointers to entries for other
-offsets should be calculated relative to "pte".
+page table entry. Pointer to entry associated with the page is passed in
+"pte" field in fault_env structure. Pointers to entries for other offsets
+should be calculated relative to "pte".

->page_mkwrite() is called when a previously read-only pte is
about to become writeable. The filesystem again must ensure that there are
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 66cdb44616d5..588e9da27ab4 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -257,10 +257,9 @@ out:
* fatal_signal_pending()s, and the mmap_sem must be released before
* returning it.
*/
-int handle_userfault(struct vm_area_struct *vma, unsigned long address,
- unsigned int flags, unsigned long reason)
+int handle_userfault(struct fault_env *fe, unsigned long reason)
{
- struct mm_struct *mm = vma->vm_mm;
+ struct mm_struct *mm = fe->vma->vm_mm;
struct userfaultfd_ctx *ctx;
struct userfaultfd_wait_queue uwq;
int ret;
@@ -269,7 +268,7 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
BUG_ON(!rwsem_is_locked(&mm->mmap_sem));

ret = VM_FAULT_SIGBUS;
- ctx = vma->vm_userfaultfd_ctx.ctx;
+ ctx = fe->vma->vm_userfaultfd_ctx.ctx;
if (!ctx)
goto out;

@@ -302,17 +301,17 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
* without first stopping userland access to the memory. For
* VM_UFFD_MISSING userfaults this is enough for now.
*/
- if (unlikely(!(flags & FAULT_FLAG_ALLOW_RETRY))) {
+ if (unlikely(!(fe->flags & FAULT_FLAG_ALLOW_RETRY))) {
/*
* Validate the invariant that nowait must allow retry
* to be sure not to return SIGBUS erroneously on
* nowait invocations.
*/
- BUG_ON(flags & FAULT_FLAG_RETRY_NOWAIT);
+ BUG_ON(fe->flags & FAULT_FLAG_RETRY_NOWAIT);
#ifdef CONFIG_DEBUG_VM
if (printk_ratelimit()) {
printk(KERN_WARNING
- "FAULT_FLAG_ALLOW_RETRY missing %x\n", flags);
+ "FAULT_FLAG_ALLOW_RETRY missing %x\n", fe->flags);
dump_stack();
}
#endif
@@ -324,7 +323,7 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
* and wait.
*/
ret = VM_FAULT_RETRY;
- if (flags & FAULT_FLAG_RETRY_NOWAIT)
+ if (fe->flags & FAULT_FLAG_RETRY_NOWAIT)
goto out;

/* take the reference before dropping the mmap_sem */
@@ -332,10 +331,11 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,

init_waitqueue_func_entry(&uwq.wq, userfaultfd_wake_function);
uwq.wq.private = current;
- uwq.msg = userfault_msg(address, flags, reason);
+ uwq.msg = userfault_msg(fe->address, fe->flags, reason);
uwq.ctx = ctx;

- return_to_userland = (flags & (FAULT_FLAG_USER|FAULT_FLAG_KILLABLE)) ==
+ return_to_userland =
+ (fe->flags & (FAULT_FLAG_USER|FAULT_FLAG_KILLABLE)) ==
(FAULT_FLAG_USER|FAULT_FLAG_KILLABLE);

spin_lock(&ctx->fault_pending_wqh.lock);
@@ -353,7 +353,7 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
TASK_KILLABLE);
spin_unlock(&ctx->fault_pending_wqh.lock);

- must_wait = userfaultfd_must_wait(ctx, address, flags, reason);
+ must_wait = userfaultfd_must_wait(ctx, fe->address, fe->flags, reason);
up_read(&mm->mmap_sem);

if (likely(must_wait && !ACCESS_ONCE(ctx->released) &&
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index d0367c523a0d..05efca1619b6 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -1,20 +1,12 @@
#ifndef _LINUX_HUGE_MM_H
#define _LINUX_HUGE_MM_H

-extern int do_huge_pmd_anonymous_page(struct mm_struct *mm,
- struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmd,
- unsigned int flags);
+extern int do_huge_pmd_anonymous_page(struct fault_env *fe);
extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
struct vm_area_struct *vma);
-extern void huge_pmd_set_accessed(struct mm_struct *mm,
- struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmd,
- pmd_t orig_pmd, int dirty);
-extern int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmd,
- pmd_t orig_pmd);
+extern void huge_pmd_set_accessed(struct fault_env *fe, pmd_t orig_pmd);
+extern int do_huge_pmd_wp_page(struct fault_env *fe, pmd_t orig_pmd);
extern struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
unsigned long addr,
pmd_t *pmd,
@@ -134,8 +126,7 @@ static inline int hpage_nr_pages(struct page *page)
return 1;
}

-extern int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long addr, pmd_t pmd, pmd_t *pmdp);
+extern int do_huge_pmd_numa_page(struct fault_env *fe, pmd_t orig_pmd);

extern struct page *huge_zero_page;

@@ -195,8 +186,7 @@ static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
return NULL;
}

-static inline int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long addr, pmd_t pmd, pmd_t *pmdp)
+static inline int do_huge_pmd_numa_page(struct fault_env *fe, pmd_t orig_pmd)
{
return 0;
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8fc68ed3c862..696ec39b616b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -299,10 +299,27 @@ struct vm_fault {
* is set (which is also implied by
* VM_FAULT_ERROR).
*/
- /* for ->map_pages() only */
- pgoff_t max_pgoff; /* map pages for offset from pgoff till
- * max_pgoff inclusive */
- pte_t *pte; /* pte entry associated with ->pgoff */
+};
+
+/*
+ * Page fault context: passes though page fault handler instead of endless list
+ * of function arguments.
+ */
+struct fault_env {
+ struct vm_area_struct *vma; /* Target VMA */
+ unsigned long address; /* Faulting virtual address */
+ unsigned int flags; /* FAULT_FLAG_xxx flags */
+ pmd_t *pmd; /* Pointer to pmd entry matching
+ * the 'address'
+ */
+ pte_t *pte; /* Pointer to pte entry matching
+ * the 'address'. NULL if the page
+ * table hasn't been allocated.
+ */
+ spinlock_t *ptl; /* Page table lock.
+ * Protects pte page table if 'pte'
+ * is not NULL, otherwise pmd.
+ */
};

/*
@@ -317,7 +334,8 @@ struct vm_operations_struct {
int (*fault)(struct vm_area_struct *vma, struct vm_fault *vmf);
int (*pmd_fault)(struct vm_area_struct *, unsigned long address,
pmd_t *, unsigned int flags);
- void (*map_pages)(struct vm_area_struct *vma, struct vm_fault *vmf);
+ void (*map_pages)(struct fault_env *fe,
+ pgoff_t start_pgoff, pgoff_t end_pgoff);

/* notification that a previously read-only page is about to become
* writable, if an error is returned it will cause a SIGBUS */
@@ -583,8 +601,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
return pte;
}

-void do_set_pte(struct vm_area_struct *vma, unsigned long address,
- struct page *page, pte_t *pte, bool write, bool anon);
+void do_set_pte(struct fault_env *fe, struct page *page);
#endif

/*
@@ -2119,7 +2136,8 @@ extern void truncate_inode_pages_final(struct address_space *);

/* generic vm_area_ops exported for stackable file systems */
extern int filemap_fault(struct vm_area_struct *, struct vm_fault *);
-extern void filemap_map_pages(struct vm_area_struct *vma, struct vm_fault *vmf);
+extern void filemap_map_pages(struct fault_env *fe,
+ pgoff_t start_pgoff, pgoff_t end_pgoff);
extern int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);

/* mm/page-writeback.c */
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 587480ad41b7..dd66a952e8cd 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -27,8 +27,7 @@
#define UFFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
#define UFFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS)

-extern int handle_userfault(struct vm_area_struct *vma, unsigned long address,
- unsigned int flags, unsigned long reason);
+extern int handle_userfault(struct fault_env *fe, unsigned long reason);

extern ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long dst_start,
unsigned long src_start, unsigned long len);
@@ -56,10 +55,7 @@ static inline bool userfaultfd_armed(struct vm_area_struct *vma)
#else /* CONFIG_USERFAULTFD */

/* mm helpers */
-static inline int handle_userfault(struct vm_area_struct *vma,
- unsigned long address,
- unsigned int flags,
- unsigned long reason)
+static inline int handle_userfault(struct fault_env *fe, unsigned long reason)
{
return VM_FAULT_SIGBUS;
}
diff --git a/mm/filemap.c b/mm/filemap.c
index f2479af09da9..e32e5d70fc0c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2131,22 +2131,27 @@ page_not_uptodate:
}
EXPORT_SYMBOL(filemap_fault);

-void filemap_map_pages(struct vm_area_struct *vma, struct vm_fault *vmf)
+void filemap_map_pages(struct fault_env *fe,
+ pgoff_t start_pgoff, pgoff_t end_pgoff)
{
struct radix_tree_iter iter;
void **slot;
- struct file *file = vma->vm_file;
+ struct file *file = fe->vma->vm_file;
struct address_space *mapping = file->f_mapping;
+ pgoff_t last_pgoff = start_pgoff;
loff_t size;
struct page *page;
- unsigned long address = (unsigned long) vmf->virtual_address;
- unsigned long addr;
- pte_t *pte;

rcu_read_lock();
- radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, vmf->pgoff) {
- if (iter.index > vmf->max_pgoff)
+ radix_tree_for_each_slot(slot, &mapping->page_tree, &iter,
+ start_pgoff) {
+ if (iter.index > end_pgoff)
break;
+ fe->pte += iter.index - last_pgoff;
+ fe->address += (iter.index - last_pgoff) << PAGE_SHIFT;
+ last_pgoff = iter.index;
+ if (!pte_none(*fe->pte))
+ goto next;
repeat:
page = radix_tree_deref_slot(slot);
if (unlikely(!page))
@@ -2182,14 +2187,9 @@ repeat:
if (page->index >= size >> PAGE_SHIFT)
goto unlock;

- pte = vmf->pte + page->index - vmf->pgoff;
- if (!pte_none(*pte))
- goto unlock;
-
if (file->f_ra.mmap_miss > 0)
file->f_ra.mmap_miss--;
- addr = address + (page->index - vmf->pgoff) * PAGE_SIZE;
- do_set_pte(vma, addr, page, pte, false, false);
+ do_set_pte(fe, page);
unlock_page(page);
goto next;
unlock:
@@ -2197,7 +2197,7 @@ unlock:
skip:
put_page(page);
next:
- if (iter.index == vmf->max_pgoff)
+ if (iter.index == end_pgoff)
break;
}
rcu_read_unlock();
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5211c59e5f26..a4014b484737 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -821,26 +821,23 @@ void prep_transhuge_page(struct page *page)
set_compound_page_dtor(page, TRANSHUGE_PAGE_DTOR);
}

-static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
- struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmd,
- struct page *page, gfp_t gfp,
- unsigned int flags)
+static int __do_huge_pmd_anonymous_page(struct fault_env *fe, struct page *page,
+ gfp_t gfp)
{
+ struct vm_area_struct *vma = fe->vma;
struct mem_cgroup *memcg;
pgtable_t pgtable;
- spinlock_t *ptl;
- unsigned long haddr = address & HPAGE_PMD_MASK;
+ unsigned long haddr = fe->address & HPAGE_PMD_MASK;

VM_BUG_ON_PAGE(!PageCompound(page), page);

- if (mem_cgroup_try_charge(page, mm, gfp, &memcg, true)) {
+ if (mem_cgroup_try_charge(page, vma->vm_mm, gfp, &memcg, true)) {
put_page(page);
count_vm_event(THP_FAULT_FALLBACK);
return VM_FAULT_FALLBACK;
}

- pgtable = pte_alloc_one(mm, haddr);
+ pgtable = pte_alloc_one(vma->vm_mm, haddr);
if (unlikely(!pgtable)) {
mem_cgroup_cancel_charge(page, memcg, true);
put_page(page);
@@ -855,12 +852,12 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
*/
__SetPageUptodate(page);

- ptl = pmd_lock(mm, pmd);
- if (unlikely(!pmd_none(*pmd))) {
- spin_unlock(ptl);
+ fe->ptl = pmd_lock(vma->vm_mm, fe->pmd);
+ if (unlikely(!pmd_none(*fe->pmd))) {
+ spin_unlock(fe->ptl);
mem_cgroup_cancel_charge(page, memcg, true);
put_page(page);
- pte_free(mm, pgtable);
+ pte_free(vma->vm_mm, pgtable);
} else {
pmd_t entry;

@@ -868,12 +865,11 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
if (userfaultfd_missing(vma)) {
int ret;

- spin_unlock(ptl);
+ spin_unlock(fe->ptl);
mem_cgroup_cancel_charge(page, memcg, true);
put_page(page);
- pte_free(mm, pgtable);
- ret = handle_userfault(vma, address, flags,
- VM_UFFD_MISSING);
+ pte_free(vma->vm_mm, pgtable);
+ ret = handle_userfault(fe, VM_UFFD_MISSING);
VM_BUG_ON(ret & VM_FAULT_FALLBACK);
return ret;
}
@@ -883,11 +879,11 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
page_add_new_anon_rmap(page, vma, haddr, true);
mem_cgroup_commit_charge(page, memcg, false, true);
lru_cache_add_active_or_unevictable(page, vma);
- pgtable_trans_huge_deposit(mm, pmd, pgtable);
- set_pmd_at(mm, haddr, pmd, entry);
- add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
- atomic_long_inc(&mm->nr_ptes);
- spin_unlock(ptl);
+ pgtable_trans_huge_deposit(vma->vm_mm, fe->pmd, pgtable);
+ set_pmd_at(vma->vm_mm, haddr, fe->pmd, entry);
+ add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR);
+ atomic_long_inc(&vma->vm_mm->nr_ptes);
+ spin_unlock(fe->ptl);
count_vm_event(THP_FAULT_ALLOC);
}

@@ -937,13 +933,12 @@ static bool set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
return true;
}

-int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmd,
- unsigned int flags)
+int do_huge_pmd_anonymous_page(struct fault_env *fe)
{
+ struct vm_area_struct *vma = fe->vma;
gfp_t gfp;
struct page *page;
- unsigned long haddr = address & HPAGE_PMD_MASK;
+ unsigned long haddr = fe->address & HPAGE_PMD_MASK;

if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
return VM_FAULT_FALLBACK;
@@ -951,42 +946,40 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
return VM_FAULT_OOM;
if (unlikely(khugepaged_enter(vma, vma->vm_flags)))
return VM_FAULT_OOM;
- if (!(flags & FAULT_FLAG_WRITE) && !mm_forbids_zeropage(mm) &&
+ if (!(fe->flags & FAULT_FLAG_WRITE) &&
+ !mm_forbids_zeropage(vma->vm_mm) &&
transparent_hugepage_use_zero_page()) {
- spinlock_t *ptl;
pgtable_t pgtable;
struct page *zero_page;
bool set;
int ret;
- pgtable = pte_alloc_one(mm, haddr);
+ pgtable = pte_alloc_one(vma->vm_mm, haddr);
if (unlikely(!pgtable))
return VM_FAULT_OOM;
zero_page = get_huge_zero_page();
if (unlikely(!zero_page)) {
- pte_free(mm, pgtable);
+ pte_free(vma->vm_mm, pgtable);
count_vm_event(THP_FAULT_FALLBACK);
return VM_FAULT_FALLBACK;
}
- ptl = pmd_lock(mm, pmd);
+ fe->ptl = pmd_lock(vma->vm_mm, fe->pmd);
ret = 0;
set = false;
- if (pmd_none(*pmd)) {
+ if (pmd_none(*fe->pmd)) {
if (userfaultfd_missing(vma)) {
- spin_unlock(ptl);
- ret = handle_userfault(vma, address, flags,
- VM_UFFD_MISSING);
+ spin_unlock(fe->ptl);
+ ret = handle_userfault(fe, VM_UFFD_MISSING);
VM_BUG_ON(ret & VM_FAULT_FALLBACK);
} else {
- set_huge_zero_page(pgtable, mm, vma,
- haddr, pmd,
- zero_page);
- spin_unlock(ptl);
+ set_huge_zero_page(pgtable, vma->vm_mm, vma,
+ haddr, fe->pmd, zero_page);
+ spin_unlock(fe->ptl);
set = true;
}
} else
- spin_unlock(ptl);
+ spin_unlock(fe->ptl);
if (!set) {
- pte_free(mm, pgtable);
+ pte_free(vma->vm_mm, pgtable);
put_huge_zero_page();
}
return ret;
@@ -998,8 +991,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
return VM_FAULT_FALLBACK;
}
prep_transhuge_page(page);
- return __do_huge_pmd_anonymous_page(mm, vma, address, pmd, page, gfp,
- flags);
+ return __do_huge_pmd_anonymous_page(fe, page, gfp);
}

static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
@@ -1171,38 +1163,31 @@ out:
return ret;
}

-void huge_pmd_set_accessed(struct mm_struct *mm,
- struct vm_area_struct *vma,
- unsigned long address,
- pmd_t *pmd, pmd_t orig_pmd,
- int dirty)
+void huge_pmd_set_accessed(struct fault_env *fe, pmd_t orig_pmd)
{
- spinlock_t *ptl;
pmd_t entry;
unsigned long haddr;

- ptl = pmd_lock(mm, pmd);
- if (unlikely(!pmd_same(*pmd, orig_pmd)))
+ fe->ptl = pmd_lock(fe->vma->vm_mm, fe->pmd);
+ if (unlikely(!pmd_same(*fe->pmd, orig_pmd)))
goto unlock;

entry = pmd_mkyoung(orig_pmd);
- haddr = address & HPAGE_PMD_MASK;
- if (pmdp_set_access_flags(vma, haddr, pmd, entry, dirty))
- update_mmu_cache_pmd(vma, address, pmd);
+ haddr = fe->address & HPAGE_PMD_MASK;
+ if (pmdp_set_access_flags(fe->vma, haddr, fe->pmd, entry,
+ fe->flags & FAULT_FLAG_WRITE))
+ update_mmu_cache_pmd(fe->vma, fe->address, fe->pmd);

unlock:
- spin_unlock(ptl);
+ spin_unlock(fe->ptl);
}

-static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
- struct vm_area_struct *vma,
- unsigned long address,
- pmd_t *pmd, pmd_t orig_pmd,
- struct page *page,
- unsigned long haddr)
+static int do_huge_pmd_wp_page_fallback(struct fault_env *fe, pmd_t orig_pmd,
+ struct page *page)
{
+ struct vm_area_struct *vma = fe->vma;
+ unsigned long haddr = fe->address & HPAGE_PMD_MASK;
struct mem_cgroup *memcg;
- spinlock_t *ptl;
pgtable_t pgtable;
pmd_t _pmd;
int ret = 0, i;
@@ -1219,11 +1204,11 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,

for (i = 0; i < HPAGE_PMD_NR; i++) {
pages[i] = alloc_page_vma_node(GFP_HIGHUSER_MOVABLE |
- __GFP_OTHER_NODE,
- vma, address, page_to_nid(page));
+ __GFP_OTHER_NODE, vma,
+ fe->address, page_to_nid(page));
if (unlikely(!pages[i] ||
- mem_cgroup_try_charge(pages[i], mm, GFP_KERNEL,
- &memcg, false))) {
+ mem_cgroup_try_charge(pages[i], vma->vm_mm,
+ GFP_KERNEL, &memcg, false))) {
if (pages[i])
put_page(pages[i]);
while (--i >= 0) {
@@ -1249,41 +1234,41 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,

mmun_start = haddr;
mmun_end = haddr + HPAGE_PMD_SIZE;
- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start, mmun_end);

- ptl = pmd_lock(mm, pmd);
- if (unlikely(!pmd_same(*pmd, orig_pmd)))
+ fe->ptl = pmd_lock(vma->vm_mm, fe->pmd);
+ if (unlikely(!pmd_same(*fe->pmd, orig_pmd)))
goto out_free_pages;
VM_BUG_ON_PAGE(!PageHead(page), page);

- pmdp_huge_clear_flush_notify(vma, haddr, pmd);
+ pmdp_huge_clear_flush_notify(vma, haddr, fe->pmd);
/* leave pmd empty until pte is filled */

- pgtable = pgtable_trans_huge_withdraw(mm, pmd);
- pmd_populate(mm, &_pmd, pgtable);
+ pgtable = pgtable_trans_huge_withdraw(vma->vm_mm, fe->pmd);
+ pmd_populate(vma->vm_mm, &_pmd, pgtable);

for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
- pte_t *pte, entry;
+ pte_t entry;
entry = mk_pte(pages[i], vma->vm_page_prot);
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
memcg = (void *)page_private(pages[i]);
set_page_private(pages[i], 0);
- page_add_new_anon_rmap(pages[i], vma, haddr, false);
+ page_add_new_anon_rmap(pages[i], fe->vma, haddr, false);
mem_cgroup_commit_charge(pages[i], memcg, false, false);
lru_cache_add_active_or_unevictable(pages[i], vma);
- pte = pte_offset_map(&_pmd, haddr);
- VM_BUG_ON(!pte_none(*pte));
- set_pte_at(mm, haddr, pte, entry);
- pte_unmap(pte);
+ fe->pte = pte_offset_map(&_pmd, haddr);
+ VM_BUG_ON(!pte_none(*fe->pte));
+ set_pte_at(vma->vm_mm, haddr, fe->pte, entry);
+ pte_unmap(fe->pte);
}
kfree(pages);

smp_wmb(); /* make pte visible before pmd */
- pmd_populate(mm, pmd, pgtable);
+ pmd_populate(vma->vm_mm, fe->pmd, pgtable);
page_remove_rmap(page, true);
- spin_unlock(ptl);
+ spin_unlock(fe->ptl);

- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end);

ret |= VM_FAULT_WRITE;
put_page(page);
@@ -1292,8 +1277,8 @@ out:
return ret;

out_free_pages:
- spin_unlock(ptl);
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+ spin_unlock(fe->ptl);
+ mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end);
for (i = 0; i < HPAGE_PMD_NR; i++) {
memcg = (void *)page_private(pages[i]);
set_page_private(pages[i], 0);
@@ -1304,25 +1289,23 @@ out_free_pages:
goto out;
}

-int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmd, pmd_t orig_pmd)
+int do_huge_pmd_wp_page(struct fault_env *fe, pmd_t orig_pmd)
{
- spinlock_t *ptl;
- int ret = 0;
+ struct vm_area_struct *vma = fe->vma;
struct page *page = NULL, *new_page;
struct mem_cgroup *memcg;
- unsigned long haddr;
+ unsigned long haddr = fe->address & HPAGE_PMD_MASK;
unsigned long mmun_start; /* For mmu_notifiers */
unsigned long mmun_end; /* For mmu_notifiers */
gfp_t huge_gfp; /* for allocation and charge */
+ int ret = 0;

- ptl = pmd_lockptr(mm, pmd);
+ fe->ptl = pmd_lockptr(vma->vm_mm, fe->pmd);
VM_BUG_ON_VMA(!vma->anon_vma, vma);
- haddr = address & HPAGE_PMD_MASK;
if (is_huge_zero_pmd(orig_pmd))
goto alloc;
- spin_lock(ptl);
- if (unlikely(!pmd_same(*pmd, orig_pmd)))
+ spin_lock(fe->ptl);
+ if (unlikely(!pmd_same(*fe->pmd, orig_pmd)))
goto out_unlock;

page = pmd_page(orig_pmd);
@@ -1341,13 +1324,13 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
pmd_t entry;
entry = pmd_mkyoung(orig_pmd);
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
- if (pmdp_set_access_flags(vma, haddr, pmd, entry, 1))
- update_mmu_cache_pmd(vma, address, pmd);
+ if (pmdp_set_access_flags(vma, haddr, fe->pmd, entry, 1))
+ update_mmu_cache_pmd(vma, fe->address, fe->pmd);
ret |= VM_FAULT_WRITE;
goto out_unlock;
}
get_page(page);
- spin_unlock(ptl);
+ spin_unlock(fe->ptl);
alloc:
if (transparent_hugepage_enabled(vma) &&
!transparent_hugepage_debug_cow()) {
@@ -1360,13 +1343,12 @@ alloc:
prep_transhuge_page(new_page);
} else {
if (!page) {
- split_huge_pmd(vma, pmd, address);
+ split_huge_pmd(vma, fe->pmd, fe->address);
ret |= VM_FAULT_FALLBACK;
} else {
- ret = do_huge_pmd_wp_page_fallback(mm, vma, address,
- pmd, orig_pmd, page, haddr);
+ ret = do_huge_pmd_wp_page_fallback(fe, orig_pmd, page);
if (ret & VM_FAULT_OOM) {
- split_huge_pmd(vma, pmd, address);
+ split_huge_pmd(vma, fe->pmd, fe->address);
ret |= VM_FAULT_FALLBACK;
}
put_page(page);
@@ -1375,14 +1357,12 @@ alloc:
goto out;
}

- if (unlikely(mem_cgroup_try_charge(new_page, mm, huge_gfp, &memcg,
- true))) {
+ if (unlikely(mem_cgroup_try_charge(new_page, vma->vm_mm,
+ huge_gfp, &memcg, true))) {
put_page(new_page);
- if (page) {
- split_huge_pmd(vma, pmd, address);
+ split_huge_pmd(vma, fe->pmd, fe->address);
+ if (page)
put_page(page);
- } else
- split_huge_pmd(vma, pmd, address);
ret |= VM_FAULT_FALLBACK;
count_vm_event(THP_FAULT_FALLBACK);
goto out;
@@ -1398,13 +1378,13 @@ alloc:

mmun_start = haddr;
mmun_end = haddr + HPAGE_PMD_SIZE;
- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start, mmun_end);

- spin_lock(ptl);
+ spin_lock(fe->ptl);
if (page)
put_page(page);
- if (unlikely(!pmd_same(*pmd, orig_pmd))) {
- spin_unlock(ptl);
+ if (unlikely(!pmd_same(*fe->pmd, orig_pmd))) {
+ spin_unlock(fe->ptl);
mem_cgroup_cancel_charge(new_page, memcg, true);
put_page(new_page);
goto out_mn;
@@ -1412,14 +1392,14 @@ alloc:
pmd_t entry;
entry = mk_huge_pmd(new_page, vma->vm_page_prot);
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
- pmdp_huge_clear_flush_notify(vma, haddr, pmd);
+ pmdp_huge_clear_flush_notify(vma, haddr, fe->pmd);
page_add_new_anon_rmap(new_page, vma, haddr, true);
mem_cgroup_commit_charge(new_page, memcg, false, true);
lru_cache_add_active_or_unevictable(new_page, vma);
- set_pmd_at(mm, haddr, pmd, entry);
- update_mmu_cache_pmd(vma, address, pmd);
+ set_pmd_at(vma->vm_mm, haddr, fe->pmd, entry);
+ update_mmu_cache_pmd(vma, fe->address, fe->pmd);
if (!page) {
- add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
+ add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR);
put_huge_zero_page();
} else {
VM_BUG_ON_PAGE(!PageHead(page), page);
@@ -1428,13 +1408,13 @@ alloc:
}
ret |= VM_FAULT_WRITE;
}
- spin_unlock(ptl);
+ spin_unlock(fe->ptl);
out_mn:
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end);
out:
return ret;
out_unlock:
- spin_unlock(ptl);
+ spin_unlock(fe->ptl);
return ret;
}

@@ -1494,13 +1474,12 @@ out:
}

/* NUMA hinting page fault entry point for trans huge pmds */
-int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long addr, pmd_t pmd, pmd_t *pmdp)
+int do_huge_pmd_numa_page(struct fault_env *fe, pmd_t pmd)
{
- spinlock_t *ptl;
+ struct vm_area_struct *vma = fe->vma;
struct anon_vma *anon_vma = NULL;
struct page *page;
- unsigned long haddr = addr & HPAGE_PMD_MASK;
+ unsigned long haddr = fe->address & HPAGE_PMD_MASK;
int page_nid = -1, this_nid = numa_node_id();
int target_nid, last_cpupid = -1;
bool page_locked;
@@ -1511,8 +1490,8 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
/* A PROT_NONE fault should not end up here */
BUG_ON(!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)));

- ptl = pmd_lock(mm, pmdp);
- if (unlikely(!pmd_same(pmd, *pmdp)))
+ fe->ptl = pmd_lock(vma->vm_mm, fe->pmd);
+ if (unlikely(!pmd_same(pmd, *fe->pmd)))
goto out_unlock;

/*
@@ -1520,9 +1499,9 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
* without disrupting NUMA hinting information. Do not relock and
* check_same as the page may no longer be mapped.
*/
- if (unlikely(pmd_trans_migrating(*pmdp))) {
- page = pmd_page(*pmdp);
- spin_unlock(ptl);
+ if (unlikely(pmd_trans_migrating(*fe->pmd))) {
+ page = pmd_page(*fe->pmd);
+ spin_unlock(fe->ptl);
wait_on_page_locked(page);
goto out;
}
@@ -1555,7 +1534,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,

/* Migration could have started since the pmd_trans_migrating check */
if (!page_locked) {
- spin_unlock(ptl);
+ spin_unlock(fe->ptl);
wait_on_page_locked(page);
page_nid = -1;
goto out;
@@ -1566,12 +1545,12 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
* to serialises splits
*/
get_page(page);
- spin_unlock(ptl);
+ spin_unlock(fe->ptl);
anon_vma = page_lock_anon_vma_read(page);

/* Confirm the PMD did not change while page_table_lock was released */
- spin_lock(ptl);
- if (unlikely(!pmd_same(pmd, *pmdp))) {
+ spin_lock(fe->ptl);
+ if (unlikely(!pmd_same(pmd, *fe->pmd))) {
unlock_page(page);
put_page(page);
page_nid = -1;
@@ -1589,9 +1568,9 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
* Migrate the THP to the requested node, returns with page unlocked
* and access rights restored.
*/
- spin_unlock(ptl);
- migrated = migrate_misplaced_transhuge_page(mm, vma,
- pmdp, pmd, addr, page, target_nid);
+ spin_unlock(fe->ptl);
+ migrated = migrate_misplaced_transhuge_page(vma->vm_mm, vma,
+ fe->pmd, pmd, fe->address, page, target_nid);
if (migrated) {
flags |= TNF_MIGRATED;
page_nid = target_nid;
@@ -1606,18 +1585,18 @@ clear_pmdnuma:
pmd = pmd_mkyoung(pmd);
if (was_writable)
pmd = pmd_mkwrite(pmd);
- set_pmd_at(mm, haddr, pmdp, pmd);
- update_mmu_cache_pmd(vma, addr, pmdp);
+ set_pmd_at(vma->vm_mm, haddr, fe->pmd, pmd);
+ update_mmu_cache_pmd(vma, fe->address, fe->pmd);
unlock_page(page);
out_unlock:
- spin_unlock(ptl);
+ spin_unlock(fe->ptl);

out:
if (anon_vma)
page_unlock_anon_vma_read(anon_vma);

if (page_nid != -1)
- task_numa_fault(last_cpupid, page_nid, HPAGE_PMD_NR, flags);
+ task_numa_fault(last_cpupid, page_nid, HPAGE_PMD_NR, fe->flags);

return 0;
}
@@ -2396,29 +2375,32 @@ static void __collapse_huge_page_swapin(struct mm_struct *mm,
struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd)
{
- unsigned long _address;
- pte_t *pte, pteval;
+ pte_t pteval;
int swapped_in = 0, ret = 0;
-
- pte = pte_offset_map(pmd, address);
- for (_address = address; _address < address + HPAGE_PMD_NR*PAGE_SIZE;
- pte++, _address += PAGE_SIZE) {
- pteval = *pte;
+ struct fault_env fe = {
+ .vma = vma,
+ .address = address,
+ .flags = FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_RETRY_NOWAIT,
+ .pmd = pmd,
+ };
+
+ fe.pte = pte_offset_map(pmd, address);
+ for (; fe.address < address + HPAGE_PMD_NR*PAGE_SIZE;
+ fe.pte++, fe.address += PAGE_SIZE) {
+ pteval = *fe.pte;
if (!is_swap_pte(pteval))
continue;
swapped_in++;
- ret = do_swap_page(mm, vma, _address, pte, pmd,
- FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_RETRY_NOWAIT,
- pteval);
+ ret = do_swap_page(&fe, pteval);
if (ret & VM_FAULT_ERROR) {
trace_mm_collapse_huge_page_swapin(mm, swapped_in, 0);
return;
}
/* pte is unmapped now, we need to map it */
- pte = pte_offset_map(pmd, _address);
+ fe.pte = pte_offset_map(pmd, fe.address);
}
- pte--;
- pte_unmap(pte);
+ fe.pte--;
+ pte_unmap(fe.pte);
trace_mm_collapse_huge_page_swapin(mm, swapped_in, 1);
}

diff --git a/mm/internal.h b/mm/internal.h
index 6e9222a41c4a..1458bd904daf 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -35,9 +35,7 @@
/* Do not use these with a slab allocator */
#define GFP_SLAB_BUG_MASK (__GFP_DMA32|__GFP_HIGHMEM|~__GFP_BITS_MASK)

-extern int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pte_t *page_table, pmd_t *pmd,
- unsigned int flags, pte_t orig_pte);
+int do_swap_page(struct fault_env *fe, pte_t orig_pte);

void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
unsigned long floor, unsigned long ceiling);
diff --git a/mm/memory.c b/mm/memory.c
index 0002898341a3..6c37f84212ee 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2013,13 +2013,11 @@ static int do_page_mkwrite(struct vm_area_struct *vma, struct page *page,
* case, all we need to do here is to mark the page as writable and update
* any related book-keeping.
*/
-static inline int wp_page_reuse(struct mm_struct *mm,
- struct vm_area_struct *vma, unsigned long address,
- pte_t *page_table, spinlock_t *ptl, pte_t orig_pte,
- struct page *page, int page_mkwrite,
- int dirty_shared)
- __releases(ptl)
+static inline int wp_page_reuse(struct fault_env *fe, pte_t orig_pte,
+ struct page *page, int page_mkwrite, int dirty_shared)
+ __releases(fe->ptl)
{
+ struct vm_area_struct *vma = fe->vma;
pte_t entry;
/*
* Clear the pages cpupid information as the existing
@@ -2029,12 +2027,12 @@ static inline int wp_page_reuse(struct mm_struct *mm,
if (page)
page_cpupid_xchg_last(page, (1 << LAST_CPUPID_SHIFT) - 1);

- flush_cache_page(vma, address, pte_pfn(orig_pte));
+ flush_cache_page(vma, fe->address, pte_pfn(orig_pte));
entry = pte_mkyoung(orig_pte);
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
- if (ptep_set_access_flags(vma, address, page_table, entry, 1))
- update_mmu_cache(vma, address, page_table);
- pte_unmap_unlock(page_table, ptl);
+ if (ptep_set_access_flags(vma, fe->address, fe->pte, entry, 1))
+ update_mmu_cache(vma, fe->address, fe->pte);
+ pte_unmap_unlock(fe->pte, fe->ptl);

if (dirty_shared) {
struct address_space *mapping;
@@ -2080,30 +2078,31 @@ static inline int wp_page_reuse(struct mm_struct *mm,
* held to the old page, as well as updating the rmap.
* - In any case, unlock the PTL and drop the reference we took to the old page.
*/
-static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pte_t *page_table, pmd_t *pmd,
- pte_t orig_pte, struct page *old_page)
+static int wp_page_copy(struct fault_env *fe, pte_t orig_pte,
+ struct page *old_page)
{
+ struct vm_area_struct *vma = fe->vma;
+ struct mm_struct *mm = vma->vm_mm;
struct page *new_page = NULL;
- spinlock_t *ptl = NULL;
pte_t entry;
int page_copied = 0;
- const unsigned long mmun_start = address & PAGE_MASK; /* For mmu_notifiers */
- const unsigned long mmun_end = mmun_start + PAGE_SIZE; /* For mmu_notifiers */
+ const unsigned long mmun_start = fe->address & PAGE_MASK;
+ const unsigned long mmun_end = mmun_start + PAGE_SIZE;
struct mem_cgroup *memcg;

if (unlikely(anon_vma_prepare(vma)))
goto oom;

if (is_zero_pfn(pte_pfn(orig_pte))) {
- new_page = alloc_zeroed_user_highpage_movable(vma, address);
+ new_page = alloc_zeroed_user_highpage_movable(vma, fe->address);
if (!new_page)
goto oom;
} else {
- new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
+ new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,
+ fe->address);
if (!new_page)
goto oom;
- cow_user_page(new_page, old_page, address, vma);
+ cow_user_page(new_page, old_page, fe->address, vma);
}

if (mem_cgroup_try_charge(new_page, mm, GFP_KERNEL, &memcg, false))
@@ -2116,8 +2115,8 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
/*
* Re-check the pte - we dropped the lock
*/
- page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
- if (likely(pte_same(*page_table, orig_pte))) {
+ fe->pte = pte_offset_map_lock(mm, fe->pmd, fe->address, &fe->ptl);
+ if (likely(pte_same(*fe->pte, orig_pte))) {
if (old_page) {
if (!PageAnon(old_page)) {
dec_mm_counter_fast(mm,
@@ -2127,7 +2126,7 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
} else {
inc_mm_counter_fast(mm, MM_ANONPAGES);
}
- flush_cache_page(vma, address, pte_pfn(orig_pte));
+ flush_cache_page(vma, fe->address, pte_pfn(orig_pte));
entry = mk_pte(new_page, vma->vm_page_prot);
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
/*
@@ -2136,8 +2135,8 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
* seen in the presence of one thread doing SMC and another
* thread doing COW.
*/
- ptep_clear_flush_notify(vma, address, page_table);
- page_add_new_anon_rmap(new_page, vma, address, false);
+ ptep_clear_flush_notify(vma, fe->address, fe->pte);
+ page_add_new_anon_rmap(new_page, vma, fe->address, false);
mem_cgroup_commit_charge(new_page, memcg, false, false);
lru_cache_add_active_or_unevictable(new_page, vma);
/*
@@ -2145,8 +2144,8 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
* mmu page tables (such as kvm shadow page tables), we want the
* new page to be mapped directly into the secondary page table.
*/
- set_pte_at_notify(mm, address, page_table, entry);
- update_mmu_cache(vma, address, page_table);
+ set_pte_at_notify(mm, fe->address, fe->pte, entry);
+ update_mmu_cache(vma, fe->address, fe->pte);
if (old_page) {
/*
* Only after switching the pte to the new page may
@@ -2183,7 +2182,7 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
if (new_page)
put_page(new_page);

- pte_unmap_unlock(page_table, ptl);
+ pte_unmap_unlock(fe->pte, fe->ptl);
mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
if (old_page) {
/*
@@ -2211,44 +2210,43 @@ oom:
* Handle write page faults for VM_MIXEDMAP or VM_PFNMAP for a VM_SHARED
* mapping
*/
-static int wp_pfn_shared(struct mm_struct *mm,
- struct vm_area_struct *vma, unsigned long address,
- pte_t *page_table, spinlock_t *ptl, pte_t orig_pte,
- pmd_t *pmd)
+static int wp_pfn_shared(struct fault_env *fe, pte_t orig_pte)
{
+ struct vm_area_struct *vma = fe->vma;
+
if (vma->vm_ops && vma->vm_ops->pfn_mkwrite) {
struct vm_fault vmf = {
.page = NULL,
- .pgoff = linear_page_index(vma, address),
- .virtual_address = (void __user *)(address & PAGE_MASK),
+ .pgoff = linear_page_index(vma, fe->address),
+ .virtual_address =
+ (void __user *)(fe->address & PAGE_MASK),
.flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE,
};
int ret;

- pte_unmap_unlock(page_table, ptl);
+ pte_unmap_unlock(fe->pte, fe->ptl);
ret = vma->vm_ops->pfn_mkwrite(vma, &vmf);
if (ret & VM_FAULT_ERROR)
return ret;
- page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+ fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
+ &fe->ptl);
/*
* We might have raced with another page fault while we
* released the pte_offset_map_lock.
*/
- if (!pte_same(*page_table, orig_pte)) {
- pte_unmap_unlock(page_table, ptl);
+ if (!pte_same(*fe->pte, orig_pte)) {
+ pte_unmap_unlock(fe->pte, fe->ptl);
return 0;
}
}
- return wp_page_reuse(mm, vma, address, page_table, ptl, orig_pte,
- NULL, 0, 0);
+ return wp_page_reuse(fe, orig_pte, NULL, 0, 0);
}

-static int wp_page_shared(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pte_t *page_table,
- pmd_t *pmd, spinlock_t *ptl, pte_t orig_pte,
- struct page *old_page)
- __releases(ptl)
+static int wp_page_shared(struct fault_env *fe, pte_t orig_pte,
+ struct page *old_page)
+ __releases(fe->ptl)
{
+ struct vm_area_struct *vma = fe->vma;
int page_mkwrite = 0;

get_page(old_page);
@@ -2256,8 +2254,8 @@ static int wp_page_shared(struct mm_struct *mm, struct vm_area_struct *vma,
if (vma->vm_ops && vma->vm_ops->page_mkwrite) {
int tmp;

- pte_unmap_unlock(page_table, ptl);
- tmp = do_page_mkwrite(vma, old_page, address);
+ pte_unmap_unlock(fe->pte, fe->ptl);
+ tmp = do_page_mkwrite(vma, old_page, fe->address);
if (unlikely(!tmp || (tmp &
(VM_FAULT_ERROR | VM_FAULT_NOPAGE)))) {
put_page(old_page);
@@ -2269,19 +2267,18 @@ static int wp_page_shared(struct mm_struct *mm, struct vm_area_struct *vma,
* they did, we just return, as we can count on the
* MMU to tell us if they didn't also make it writable.
*/
- page_table = pte_offset_map_lock(mm, pmd, address,
- &ptl);
- if (!pte_same(*page_table, orig_pte)) {
+ fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
+ &fe->ptl);
+ if (!pte_same(*fe->pte, orig_pte)) {
unlock_page(old_page);
- pte_unmap_unlock(page_table, ptl);
+ pte_unmap_unlock(fe->pte, fe->ptl);
put_page(old_page);
return 0;
}
page_mkwrite = 1;
}

- return wp_page_reuse(mm, vma, address, page_table, ptl,
- orig_pte, old_page, page_mkwrite, 1);
+ return wp_page_reuse(fe, orig_pte, old_page, page_mkwrite, 1);
}

/*
@@ -2302,14 +2299,13 @@ static int wp_page_shared(struct mm_struct *mm, struct vm_area_struct *vma,
* but allow concurrent faults), with pte both mapped and locked.
* We return with mmap_sem still held, but pte unmapped and unlocked.
*/
-static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pte_t *page_table, pmd_t *pmd,
- spinlock_t *ptl, pte_t orig_pte)
- __releases(ptl)
+static int do_wp_page(struct fault_env *fe, pte_t orig_pte)
+ __releases(fe->ptl)
{
+ struct vm_area_struct *vma = fe->vma;
struct page *old_page;

- old_page = vm_normal_page(vma, address, orig_pte);
+ old_page = vm_normal_page(vma, fe->address, orig_pte);
if (!old_page) {
/*
* VM_MIXEDMAP !pfn_valid() case, or VM_SOFTDIRTY clear on a
@@ -2320,12 +2316,10 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
*/
if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
(VM_WRITE|VM_SHARED))
- return wp_pfn_shared(mm, vma, address, page_table, ptl,
- orig_pte, pmd);
+ return wp_pfn_shared(fe, orig_pte);

- pte_unmap_unlock(page_table, ptl);
- return wp_page_copy(mm, vma, address, page_table, pmd,
- orig_pte, old_page);
+ pte_unmap_unlock(fe->pte, fe->ptl);
+ return wp_page_copy(fe, orig_pte, old_page);
}

/*
@@ -2335,13 +2329,13 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
if (PageAnon(old_page) && !PageKsm(old_page)) {
if (!trylock_page(old_page)) {
get_page(old_page);
- pte_unmap_unlock(page_table, ptl);
+ pte_unmap_unlock(fe->pte, fe->ptl);
lock_page(old_page);
- page_table = pte_offset_map_lock(mm, pmd, address,
- &ptl);
- if (!pte_same(*page_table, orig_pte)) {
+ fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd,
+ fe->address, &fe->ptl);
+ if (!pte_same(*fe->pte, orig_pte)) {
unlock_page(old_page);
- pte_unmap_unlock(page_table, ptl);
+ pte_unmap_unlock(fe->pte, fe->ptl);
put_page(old_page);
return 0;
}
@@ -2353,16 +2347,14 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
* the rmap code will not search our parent or siblings.
* Protected against the rmap code by the page lock.
*/
- page_move_anon_rmap(old_page, vma, address);
+ page_move_anon_rmap(old_page, vma, fe->address);
unlock_page(old_page);
- return wp_page_reuse(mm, vma, address, page_table, ptl,
- orig_pte, old_page, 0, 0);
+ return wp_page_reuse(fe, orig_pte, old_page, 0, 0);
}
unlock_page(old_page);
} else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
(VM_WRITE|VM_SHARED))) {
- return wp_page_shared(mm, vma, address, page_table, pmd,
- ptl, orig_pte, old_page);
+ return wp_page_shared(fe, orig_pte, old_page);
}

/*
@@ -2370,9 +2362,8 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
*/
get_page(old_page);

- pte_unmap_unlock(page_table, ptl);
- return wp_page_copy(mm, vma, address, page_table, pmd,
- orig_pte, old_page);
+ pte_unmap_unlock(fe->pte, fe->ptl);
+ return wp_page_copy(fe, orig_pte, old_page);
}

static void unmap_mapping_range_vma(struct vm_area_struct *vma,
@@ -2462,11 +2453,9 @@ EXPORT_SYMBOL(unmap_mapping_range);
* We return with the mmap_sem locked or unlocked in the same cases
* as does filemap_fault().
*/
-int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pte_t *page_table, pmd_t *pmd,
- unsigned int flags, pte_t orig_pte)
+int do_swap_page(struct fault_env *fe, pte_t orig_pte)
{
- spinlock_t *ptl;
+ struct vm_area_struct *vma = fe->vma;
struct page *page, *swapcache;
struct mem_cgroup *memcg;
swp_entry_t entry;
@@ -2475,17 +2464,17 @@ int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
int exclusive = 0;
int ret = 0;

- if (!pte_unmap_same(mm, pmd, page_table, orig_pte))
+ if (!pte_unmap_same(vma->vm_mm, fe->pmd, fe->pte, orig_pte))
goto out;

entry = pte_to_swp_entry(orig_pte);
if (unlikely(non_swap_entry(entry))) {
if (is_migration_entry(entry)) {
- migration_entry_wait(mm, pmd, address);
+ migration_entry_wait(vma->vm_mm, fe->pmd, fe->address);
} else if (is_hwpoison_entry(entry)) {
ret = VM_FAULT_HWPOISON;
} else {
- print_bad_pte(vma, address, orig_pte, NULL);
+ print_bad_pte(vma, fe->address, orig_pte, NULL);
ret = VM_FAULT_SIGBUS;
}
goto out;
@@ -2494,14 +2483,15 @@ int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
page = lookup_swap_cache(entry);
if (!page) {
page = swapin_readahead(entry,
- GFP_HIGHUSER_MOVABLE, vma, address);
+ GFP_HIGHUSER_MOVABLE, vma, fe->address);
if (!page) {
/*
* Back out if somebody else faulted in this pte
* while we released the pte lock.
*/
- page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
- if (likely(pte_same(*page_table, orig_pte)))
+ fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd,
+ fe->address, &fe->ptl);
+ if (likely(pte_same(*fe->pte, orig_pte)))
ret = VM_FAULT_OOM;
delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
goto unlock;
@@ -2510,7 +2500,7 @@ int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
/* Had to read the page from swap area: Major fault */
ret = VM_FAULT_MAJOR;
count_vm_event(PGMAJFAULT);
- mem_cgroup_count_vm_event(mm, PGMAJFAULT);
+ mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
} else if (PageHWPoison(page)) {
/*
* hwpoisoned dirty swapcache pages are kept for killing
@@ -2523,7 +2513,7 @@ int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
}

swapcache = page;
- locked = lock_page_or_retry(page, mm, flags);
+ locked = lock_page_or_retry(page, vma->vm_mm, fe->flags);

delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
if (!locked) {
@@ -2540,14 +2530,15 @@ int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
if (unlikely(!PageSwapCache(page) || page_private(page) != entry.val))
goto out_page;

- page = ksm_might_need_to_copy(page, vma, address);
+ page = ksm_might_need_to_copy(page, vma, fe->address);
if (unlikely(!page)) {
ret = VM_FAULT_OOM;
page = swapcache;
goto out_page;
}

- if (mem_cgroup_try_charge(page, mm, GFP_KERNEL, &memcg, false)) {
+ if (mem_cgroup_try_charge(page, vma->vm_mm, GFP_KERNEL,
+ &memcg, false)) {
ret = VM_FAULT_OOM;
goto out_page;
}
@@ -2555,8 +2546,9 @@ int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
/*
* Back out if somebody else already faulted in this pte.
*/
- page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
- if (unlikely(!pte_same(*page_table, orig_pte)))
+ fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
+ &fe->ptl);
+ if (unlikely(!pte_same(*fe->pte, orig_pte)))
goto out_nomap;

if (unlikely(!PageUptodate(page))) {
@@ -2574,24 +2566,24 @@ int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
* must be called after the swap_free(), or it will never succeed.
*/

- inc_mm_counter_fast(mm, MM_ANONPAGES);
- dec_mm_counter_fast(mm, MM_SWAPENTS);
+ inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
+ dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS);
pte = mk_pte(page, vma->vm_page_prot);
- if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
+ if ((fe->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
pte = maybe_mkwrite(pte_mkdirty(pte), vma);
- flags &= ~FAULT_FLAG_WRITE;
+ fe->flags &= ~FAULT_FLAG_WRITE;
ret |= VM_FAULT_WRITE;
exclusive = RMAP_EXCLUSIVE;
}
flush_icache_page(vma, page);
if (pte_swp_soft_dirty(orig_pte))
pte = pte_mksoft_dirty(pte);
- set_pte_at(mm, address, page_table, pte);
+ set_pte_at(vma->vm_mm, fe->address, fe->pte, pte);
if (page == swapcache) {
- do_page_add_anon_rmap(page, vma, address, exclusive);
+ do_page_add_anon_rmap(page, vma, fe->address, exclusive);
mem_cgroup_commit_charge(page, memcg, true, false);
} else { /* ksm created a completely new copy */
- page_add_new_anon_rmap(page, vma, address, false);
+ page_add_new_anon_rmap(page, vma, fe->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
lru_cache_add_active_or_unevictable(page, vma);
}
@@ -2614,22 +2606,22 @@ int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
put_page(swapcache);
}

- if (flags & FAULT_FLAG_WRITE) {
- ret |= do_wp_page(mm, vma, address, page_table, pmd, ptl, pte);
+ if (fe->flags & FAULT_FLAG_WRITE) {
+ ret |= do_wp_page(fe, pte);
if (ret & VM_FAULT_ERROR)
ret &= VM_FAULT_ERROR;
goto out;
}

/* No need to invalidate - it was non-present before */
- update_mmu_cache(vma, address, page_table);
+ update_mmu_cache(vma, fe->address, fe->pte);
unlock:
- pte_unmap_unlock(page_table, ptl);
+ pte_unmap_unlock(fe->pte, fe->ptl);
out:
return ret;
out_nomap:
mem_cgroup_cancel_charge(page, memcg, false);
- pte_unmap_unlock(page_table, ptl);
+ pte_unmap_unlock(fe->pte, fe->ptl);
out_page:
unlock_page(page);
out_release:
@@ -2680,37 +2672,36 @@ static inline int check_stack_guard_page(struct vm_area_struct *vma, unsigned lo
* but allow concurrent faults), and pte mapped but not yet locked.
* We return with mmap_sem still held, but pte unmapped and unlocked.
*/
-static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pte_t *page_table, pmd_t *pmd,
- unsigned int flags)
+static int do_anonymous_page(struct fault_env *fe)
{
+ struct vm_area_struct *vma = fe->vma;
struct mem_cgroup *memcg;
struct page *page;
- spinlock_t *ptl;
pte_t entry;

- pte_unmap(page_table);
+ pte_unmap(fe->pte);

/* File mapping without ->vm_ops ? */
if (vma->vm_flags & VM_SHARED)
return VM_FAULT_SIGBUS;

/* Check if we need to add a guard page to the stack */
- if (check_stack_guard_page(vma, address) < 0)
+ if (check_stack_guard_page(vma, fe->address) < 0)
return VM_FAULT_SIGSEGV;

/* Use the zero-page for reads */
- if (!(flags & FAULT_FLAG_WRITE) && !mm_forbids_zeropage(mm)) {
- entry = pte_mkspecial(pfn_pte(my_zero_pfn(address),
+ if (!(fe->flags & FAULT_FLAG_WRITE) &&
+ !mm_forbids_zeropage(vma->vm_mm)) {
+ entry = pte_mkspecial(pfn_pte(my_zero_pfn(fe->address),
vma->vm_page_prot));
- page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
- if (!pte_none(*page_table))
+ fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
+ &fe->ptl);
+ if (!pte_none(*fe->pte))
goto unlock;
/* Deliver the page fault to userland, check inside PT lock */
if (userfaultfd_missing(vma)) {
- pte_unmap_unlock(page_table, ptl);
- return handle_userfault(vma, address, flags,
- VM_UFFD_MISSING);
+ pte_unmap_unlock(fe->pte, fe->ptl);
+ return handle_userfault(fe, VM_UFFD_MISSING);
}
goto setpte;
}
@@ -2718,11 +2709,11 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
/* Allocate our own private page. */
if (unlikely(anon_vma_prepare(vma)))
goto oom;
- page = alloc_zeroed_user_highpage_movable(vma, address);
+ page = alloc_zeroed_user_highpage_movable(vma, fe->address);
if (!page)
goto oom;

- if (mem_cgroup_try_charge(page, mm, GFP_KERNEL, &memcg, false))
+ if (mem_cgroup_try_charge(page, vma->vm_mm, GFP_KERNEL, &memcg, false))
goto oom_free_page;

/*
@@ -2736,30 +2727,30 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
if (vma->vm_flags & VM_WRITE)
entry = pte_mkwrite(pte_mkdirty(entry));

- page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
- if (!pte_none(*page_table))
+ fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
+ &fe->ptl);
+ if (!pte_none(*fe->pte))
goto release;

/* Deliver the page fault to userland, check inside PT lock */
if (userfaultfd_missing(vma)) {
- pte_unmap_unlock(page_table, ptl);
+ pte_unmap_unlock(fe->pte, fe->ptl);
mem_cgroup_cancel_charge(page, memcg, false);
put_page(page);
- return handle_userfault(vma, address, flags,
- VM_UFFD_MISSING);
+ return handle_userfault(fe, VM_UFFD_MISSING);
}

- inc_mm_counter_fast(mm, MM_ANONPAGES);
- page_add_new_anon_rmap(page, vma, address, false);
+ inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
+ page_add_new_anon_rmap(page, vma, fe->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
lru_cache_add_active_or_unevictable(page, vma);
setpte:
- set_pte_at(mm, address, page_table, entry);
+ set_pte_at(vma->vm_mm, fe->address, fe->pte, entry);

/* No need to invalidate - it was non-present before */
- update_mmu_cache(vma, address, page_table);
+ update_mmu_cache(vma, fe->address, fe->pte);
unlock:
- pte_unmap_unlock(page_table, ptl);
+ pte_unmap_unlock(fe->pte, fe->ptl);
return 0;
release:
mem_cgroup_cancel_charge(page, memcg, false);
@@ -2776,16 +2767,16 @@ oom:
* released depending on flags and vma->vm_ops->fault() return value.
* See filemap_fault() and __lock_page_retry().
*/
-static int __do_fault(struct vm_area_struct *vma, unsigned long address,
- pgoff_t pgoff, unsigned int flags,
- struct page *cow_page, struct page **page)
+static int __do_fault(struct fault_env *fe, pgoff_t pgoff,
+ struct page *cow_page, struct page **page)
{
+ struct vm_area_struct *vma = fe->vma;
struct vm_fault vmf;
int ret;

- vmf.virtual_address = (void __user *)(address & PAGE_MASK);
+ vmf.virtual_address = (void __user *)(fe->address & PAGE_MASK);
vmf.pgoff = pgoff;
- vmf.flags = flags;
+ vmf.flags = fe->flags;
vmf.page = NULL;
vmf.gfp_mask = __get_fault_gfp_mask(vma);
vmf.cow_page = cow_page;
@@ -2816,38 +2807,36 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address,
/**
* do_set_pte - setup new PTE entry for given page and add reverse page mapping.
*
- * @vma: virtual memory area
- * @address: user virtual address
+ * @fe: fault environment
* @page: page to map
- * @pte: pointer to target page table entry
- * @write: true, if new entry is writable
- * @anon: true, if it's anonymous page
*
- * Caller must hold page table lock relevant for @pte.
+ * Caller must hold page table lock relevant for @fe->pte.
*
* Target users are page handler itself and implementations of
* vm_ops->map_pages.
*/
-void do_set_pte(struct vm_area_struct *vma, unsigned long address,
- struct page *page, pte_t *pte, bool write, bool anon)
+void do_set_pte(struct fault_env *fe, struct page *page)
{
+ struct vm_area_struct *vma = fe->vma;
+ bool write = fe->flags & FAULT_FLAG_WRITE;
pte_t entry;

flush_icache_page(vma, page);
entry = mk_pte(page, vma->vm_page_prot);
if (write)
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
- if (anon) {
+ /* copy-on-write page */
+ if (write && !(vma->vm_flags & VM_SHARED)) {
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
- page_add_new_anon_rmap(page, vma, address, false);
+ page_add_new_anon_rmap(page, vma, fe->address, false);
} else {
inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
page_add_file_rmap(page);
}
- set_pte_at(vma->vm_mm, address, pte, entry);
+ set_pte_at(vma->vm_mm, fe->address, fe->pte, entry);

/* no need to invalidate: a not-present page won't be cached */
- update_mmu_cache(vma, address, pte);
+ update_mmu_cache(vma, fe->address, fe->pte);
}

static unsigned long fault_around_bytes __read_mostly =
@@ -2914,57 +2903,53 @@ late_initcall(fault_around_debugfs);
* fault_around_pages() value (and therefore to page order). This way it's
* easier to guarantee that we don't cross page table boundaries.
*/
-static void do_fault_around(struct vm_area_struct *vma, unsigned long address,
- pte_t *pte, pgoff_t pgoff, unsigned int flags)
+static void do_fault_around(struct fault_env *fe, pgoff_t start_pgoff)
{
- unsigned long start_addr, nr_pages, mask;
- pgoff_t max_pgoff;
- struct vm_fault vmf;
+ unsigned long address = fe->address, start_addr, nr_pages, mask;
+ pte_t *pte = fe->pte;
+ pgoff_t end_pgoff;
int off;

nr_pages = READ_ONCE(fault_around_bytes) >> PAGE_SHIFT;
mask = ~(nr_pages * PAGE_SIZE - 1) & PAGE_MASK;

- start_addr = max(address & mask, vma->vm_start);
- off = ((address - start_addr) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1);
- pte -= off;
- pgoff -= off;
+ start_addr = max(fe->address & mask, fe->vma->vm_start);
+ off = ((fe->address - start_addr) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1);
+ fe->pte -= off;
+ start_pgoff -= off;

/*
- * max_pgoff is either end of page table or end of vma
- * or fault_around_pages() from pgoff, depending what is nearest.
+ * end_pgoff is either end of page table or end of vma
+ * or fault_around_pages() from start_pgoff, depending what is nearest.
*/
- max_pgoff = pgoff - ((start_addr >> PAGE_SHIFT) & (PTRS_PER_PTE - 1)) +
+ end_pgoff = start_pgoff -
+ ((start_addr >> PAGE_SHIFT) & (PTRS_PER_PTE - 1)) +
PTRS_PER_PTE - 1;
- max_pgoff = min3(max_pgoff, vma_pages(vma) + vma->vm_pgoff - 1,
- pgoff + nr_pages - 1);
+ end_pgoff = min3(end_pgoff, vma_pages(fe->vma) + fe->vma->vm_pgoff - 1,
+ start_pgoff + nr_pages - 1);

/* Check if it makes any sense to call ->map_pages */
- while (!pte_none(*pte)) {
- if (++pgoff > max_pgoff)
- return;
- start_addr += PAGE_SIZE;
- if (start_addr >= vma->vm_end)
- return;
- pte++;
+ fe->address = start_addr;
+ while (!pte_none(*fe->pte)) {
+ if (++start_pgoff > end_pgoff)
+ goto out;
+ fe->address += PAGE_SIZE;
+ if (fe->address >= fe->vma->vm_end)
+ goto out;
+ fe->pte++;
}

- vmf.virtual_address = (void __user *) start_addr;
- vmf.pte = pte;
- vmf.pgoff = pgoff;
- vmf.max_pgoff = max_pgoff;
- vmf.flags = flags;
- vmf.gfp_mask = __get_fault_gfp_mask(vma);
- vma->vm_ops->map_pages(vma, &vmf);
+ fe->vma->vm_ops->map_pages(fe, start_pgoff, end_pgoff);
+out:
+ /* restore fault_env */
+ fe->pte = pte;
+ fe->address = address;
}

-static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmd,
- pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
+static int do_read_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
{
+ struct vm_area_struct *vma = fe->vma;
struct page *fault_page;
- spinlock_t *ptl;
- pte_t *pte;
int ret = 0;

/*
@@ -2973,64 +2958,64 @@ static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
* something).
*/
if (vma->vm_ops->map_pages && fault_around_bytes >> PAGE_SHIFT > 1) {
- pte = pte_offset_map_lock(mm, pmd, address, &ptl);
- do_fault_around(vma, address, pte, pgoff, flags);
- if (!pte_same(*pte, orig_pte))
+ fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
+ &fe->ptl);
+ do_fault_around(fe, pgoff);
+ if (!pte_same(*fe->pte, orig_pte))
goto unlock_out;
- pte_unmap_unlock(pte, ptl);
+ pte_unmap_unlock(fe->pte, fe->ptl);
}

- ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page);
+ ret = __do_fault(fe, pgoff, NULL, &fault_page);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
return ret;

- pte = pte_offset_map_lock(mm, pmd, address, &ptl);
- if (unlikely(!pte_same(*pte, orig_pte))) {
- pte_unmap_unlock(pte, ptl);
+ fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address, &fe->ptl);
+ if (unlikely(!pte_same(*fe->pte, orig_pte))) {
+ pte_unmap_unlock(fe->pte, fe->ptl);
unlock_page(fault_page);
put_page(fault_page);
return ret;
}
- do_set_pte(vma, address, fault_page, pte, false, false);
+ do_set_pte(fe, fault_page);
unlock_page(fault_page);
unlock_out:
- pte_unmap_unlock(pte, ptl);
+ pte_unmap_unlock(fe->pte, fe->ptl);
return ret;
}

-static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmd,
- pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
+static int do_cow_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
{
+ struct vm_area_struct *vma = fe->vma;
struct page *fault_page, *new_page;
struct mem_cgroup *memcg;
- spinlock_t *ptl;
- pte_t *pte;
int ret;

if (unlikely(anon_vma_prepare(vma)))
return VM_FAULT_OOM;

- new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
+ new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, fe->address);
if (!new_page)
return VM_FAULT_OOM;

- if (mem_cgroup_try_charge(new_page, mm, GFP_KERNEL, &memcg, false)) {
+ if (mem_cgroup_try_charge(new_page, vma->vm_mm, GFP_KERNEL,
+ &memcg, false)) {
put_page(new_page);
return VM_FAULT_OOM;
}

- ret = __do_fault(vma, address, pgoff, flags, new_page, &fault_page);
+ ret = __do_fault(fe, pgoff, new_page, &fault_page);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
goto uncharge_out;

if (fault_page)
- copy_user_highpage(new_page, fault_page, address, vma);
+ copy_user_highpage(new_page, fault_page, fe->address, vma);
__SetPageUptodate(new_page);

- pte = pte_offset_map_lock(mm, pmd, address, &ptl);
- if (unlikely(!pte_same(*pte, orig_pte))) {
- pte_unmap_unlock(pte, ptl);
+ fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
+ &fe->ptl);
+ if (unlikely(!pte_same(*fe->pte, orig_pte))) {
+ pte_unmap_unlock(fe->pte, fe->ptl);
if (fault_page) {
unlock_page(fault_page);
put_page(fault_page);
@@ -3043,10 +3028,10 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
}
goto uncharge_out;
}
- do_set_pte(vma, address, new_page, pte, true, true);
+ do_set_pte(fe, new_page);
mem_cgroup_commit_charge(new_page, memcg, false, false);
lru_cache_add_active_or_unevictable(new_page, vma);
- pte_unmap_unlock(pte, ptl);
+ pte_unmap_unlock(fe->pte, fe->ptl);
if (fault_page) {
unlock_page(fault_page);
put_page(fault_page);
@@ -3064,18 +3049,15 @@ uncharge_out:
return ret;
}

-static int do_shared_fault(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmd,
- pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
+static int do_shared_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
{
+ struct vm_area_struct *vma = fe->vma;
struct page *fault_page;
struct address_space *mapping;
- spinlock_t *ptl;
- pte_t *pte;
int dirtied = 0;
int ret, tmp;

- ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page);
+ ret = __do_fault(fe, pgoff, NULL, &fault_page);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
return ret;

@@ -3085,7 +3067,7 @@ static int do_shared_fault(struct mm_struct *mm, struct vm_area_struct *vma,
*/
if (vma->vm_ops->page_mkwrite) {
unlock_page(fault_page);
- tmp = do_page_mkwrite(vma, fault_page, address);
+ tmp = do_page_mkwrite(vma, fault_page, fe->address);
if (unlikely(!tmp ||
(tmp & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))) {
put_page(fault_page);
@@ -3093,15 +3075,16 @@ static int do_shared_fault(struct mm_struct *mm, struct vm_area_struct *vma,
}
}

- pte = pte_offset_map_lock(mm, pmd, address, &ptl);
- if (unlikely(!pte_same(*pte, orig_pte))) {
- pte_unmap_unlock(pte, ptl);
+ fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
+ &fe->ptl);
+ if (unlikely(!pte_same(*fe->pte, orig_pte))) {
+ pte_unmap_unlock(fe->pte, fe->ptl);
unlock_page(fault_page);
put_page(fault_page);
return ret;
}
- do_set_pte(vma, address, fault_page, pte, true, false);
- pte_unmap_unlock(pte, ptl);
+ do_set_pte(fe, fault_page);
+ pte_unmap_unlock(fe->pte, fe->ptl);

if (set_page_dirty(fault_page))
dirtied = 1;
@@ -3133,23 +3116,20 @@ static int do_shared_fault(struct mm_struct *mm, struct vm_area_struct *vma,
* The mmap_sem may have been released depending on flags and our
* return value. See filemap_fault() and __lock_page_or_retry().
*/
-static int do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pte_t *page_table, pmd_t *pmd,
- unsigned int flags, pte_t orig_pte)
+static int do_fault(struct fault_env *fe, pte_t orig_pte)
{
- pgoff_t pgoff = linear_page_index(vma, address);
+ struct vm_area_struct *vma = fe->vma;
+ pgoff_t pgoff = linear_page_index(vma, fe->address);

- pte_unmap(page_table);
+ pte_unmap(fe->pte);
/* The VMA was not fully populated on mmap() or missing VM_DONTEXPAND */
if (!vma->vm_ops->fault)
return VM_FAULT_SIGBUS;
- if (!(flags & FAULT_FLAG_WRITE))
- return do_read_fault(mm, vma, address, pmd, pgoff, flags,
- orig_pte);
+ if (!(fe->flags & FAULT_FLAG_WRITE))
+ return do_read_fault(fe, pgoff, orig_pte);
if (!(vma->vm_flags & VM_SHARED))
- return do_cow_fault(mm, vma, address, pmd, pgoff, flags,
- orig_pte);
- return do_shared_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
+ return do_cow_fault(fe, pgoff, orig_pte);
+ return do_shared_fault(fe, pgoff, orig_pte);
}

static int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
@@ -3167,11 +3147,10 @@ static int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
return mpol_misplaced(page, vma, addr);
}

-static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long addr, pte_t pte, pte_t *ptep, pmd_t *pmd)
+static int do_numa_page(struct fault_env *fe, pte_t pte)
{
+ struct vm_area_struct *vma = fe->vma;
struct page *page = NULL;
- spinlock_t *ptl;
int page_nid = -1;
int last_cpupid;
int target_nid;
@@ -3191,10 +3170,10 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
* page table entry is not accessible, so there would be no
* concurrent hardware modifications to the PTE.
*/
- ptl = pte_lockptr(mm, pmd);
- spin_lock(ptl);
- if (unlikely(!pte_same(*ptep, pte))) {
- pte_unmap_unlock(ptep, ptl);
+ fe->ptl = pte_lockptr(vma->vm_mm, fe->pmd);
+ spin_lock(fe->ptl);
+ if (unlikely(!pte_same(*fe->pte, pte))) {
+ pte_unmap_unlock(fe->pte, fe->ptl);
goto out;
}

@@ -3203,18 +3182,18 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
pte = pte_mkyoung(pte);
if (was_writable)
pte = pte_mkwrite(pte);
- set_pte_at(mm, addr, ptep, pte);
- update_mmu_cache(vma, addr, ptep);
+ set_pte_at(vma->vm_mm, fe->address, fe->pte, pte);
+ update_mmu_cache(vma, fe->address, fe->pte);

- page = vm_normal_page(vma, addr, pte);
+ page = vm_normal_page(vma, fe->address, pte);
if (!page) {
- pte_unmap_unlock(ptep, ptl);
+ pte_unmap_unlock(fe->pte, fe->ptl);
return 0;
}

/* TODO: handle PTE-mapped THP */
if (PageCompound(page)) {
- pte_unmap_unlock(ptep, ptl);
+ pte_unmap_unlock(fe->pte, fe->ptl);
return 0;
}

@@ -3238,8 +3217,9 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,

last_cpupid = page_cpupid_last(page);
page_nid = page_to_nid(page);
- target_nid = numa_migrate_prep(page, vma, addr, page_nid, &flags);
- pte_unmap_unlock(ptep, ptl);
+ target_nid = numa_migrate_prep(page, vma, fe->address, page_nid,
+ &flags);
+ pte_unmap_unlock(fe->pte, fe->ptl);
if (target_nid == -1) {
put_page(page);
goto out;
@@ -3259,24 +3239,24 @@ out:
return 0;
}

-static int create_huge_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmd, unsigned int flags)
+static int create_huge_pmd(struct fault_env *fe)
{
+ struct vm_area_struct *vma = fe->vma;
if (vma_is_anonymous(vma))
- return do_huge_pmd_anonymous_page(mm, vma, address, pmd, flags);
+ return do_huge_pmd_anonymous_page(fe);
if (vma->vm_ops->pmd_fault)
- return vma->vm_ops->pmd_fault(vma, address, pmd, flags);
+ return vma->vm_ops->pmd_fault(vma, fe->address, fe->pmd,
+ fe->flags);
return VM_FAULT_FALLBACK;
}

-static int wp_huge_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmd, pmd_t orig_pmd,
- unsigned int flags)
+static int wp_huge_pmd(struct fault_env *fe, pmd_t orig_pmd)
{
- if (vma_is_anonymous(vma))
- return do_huge_pmd_wp_page(mm, vma, address, pmd, orig_pmd);
- if (vma->vm_ops->pmd_fault)
- return vma->vm_ops->pmd_fault(vma, address, pmd, flags);
+ if (vma_is_anonymous(fe->vma))
+ return do_huge_pmd_wp_page(fe, orig_pmd);
+ if (fe->vma->vm_ops->pmd_fault)
+ return fe->vma->vm_ops->pmd_fault(fe->vma, fe->address, fe->pmd,
+ fe->flags);
return VM_FAULT_FALLBACK;
}

@@ -3296,12 +3276,9 @@ static int wp_huge_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
* The mmap_sem may have been released depending on flags and our
* return value. See filemap_fault() and __lock_page_or_retry().
*/
-static int handle_pte_fault(struct mm_struct *mm,
- struct vm_area_struct *vma, unsigned long address,
- pte_t *pte, pmd_t *pmd, unsigned int flags)
+static int handle_pte_fault(struct fault_env *fe)
{
pte_t entry;
- spinlock_t *ptl;

/*
* some architectures can have larger ptes than wordsize,
@@ -3311,37 +3288,34 @@ static int handle_pte_fault(struct mm_struct *mm,
* we later double check anyway with the ptl lock held. So here
* a barrier will do.
*/
- entry = *pte;
+ entry = *fe->pte;
barrier();
if (!pte_present(entry)) {
if (pte_none(entry)) {
- if (vma_is_anonymous(vma))
- return do_anonymous_page(mm, vma, address,
- pte, pmd, flags);
+ if (vma_is_anonymous(fe->vma))
+ return do_anonymous_page(fe);
else
- return do_fault(mm, vma, address, pte, pmd,
- flags, entry);
+ return do_fault(fe, entry);
}
- return do_swap_page(mm, vma, address,
- pte, pmd, flags, entry);
+ return do_swap_page(fe, entry);
}

if (pte_protnone(entry))
- return do_numa_page(mm, vma, address, entry, pte, pmd);
+ return do_numa_page(fe, entry);

- ptl = pte_lockptr(mm, pmd);
- spin_lock(ptl);
- if (unlikely(!pte_same(*pte, entry)))
+ fe->ptl = pte_lockptr(fe->vma->vm_mm, fe->pmd);
+ spin_lock(fe->ptl);
+ if (unlikely(!pte_same(*fe->pte, entry)))
goto unlock;
- if (flags & FAULT_FLAG_WRITE) {
+ if (fe->flags & FAULT_FLAG_WRITE) {
if (!pte_write(entry))
- return do_wp_page(mm, vma, address,
- pte, pmd, ptl, entry);
+ return do_wp_page(fe, entry);
entry = pte_mkdirty(entry);
}
entry = pte_mkyoung(entry);
- if (ptep_set_access_flags(vma, address, pte, entry, flags & FAULT_FLAG_WRITE)) {
- update_mmu_cache(vma, address, pte);
+ if (ptep_set_access_flags(fe->vma, fe->address, fe->pte, entry,
+ fe->flags & FAULT_FLAG_WRITE)) {
+ update_mmu_cache(fe->vma, fe->address, fe->pte);
} else {
/*
* This is needed only for protection faults but the arch code
@@ -3349,11 +3323,11 @@ static int handle_pte_fault(struct mm_struct *mm,
* This still avoids useless tlb flushes for .text page faults
* with threads.
*/
- if (flags & FAULT_FLAG_WRITE)
- flush_tlb_fix_spurious_fault(vma, address);
+ if (fe->flags & FAULT_FLAG_WRITE)
+ flush_tlb_fix_spurious_fault(fe->vma, fe->address);
}
unlock:
- pte_unmap_unlock(pte, ptl);
+ pte_unmap_unlock(fe->pte, fe->ptl);
return 0;
}

@@ -3366,51 +3340,42 @@ unlock:
static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
unsigned int flags)
{
+ struct fault_env fe = {
+ .vma = vma,
+ .address = address,
+ .flags = flags,
+ };
struct mm_struct *mm = vma->vm_mm;
pgd_t *pgd;
pud_t *pud;
- pmd_t *pmd;
- pte_t *pte;
-
- if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
- flags & FAULT_FLAG_INSTRUCTION,
- flags & FAULT_FLAG_REMOTE))
- return VM_FAULT_SIGSEGV;
-
- if (unlikely(is_vm_hugetlb_page(vma)))
- return hugetlb_fault(mm, vma, address, flags);

pgd = pgd_offset(mm, address);
pud = pud_alloc(mm, pgd, address);
if (!pud)
return VM_FAULT_OOM;
- pmd = pmd_alloc(mm, pud, address);
- if (!pmd)
+ fe.pmd = pmd_alloc(mm, pud, address);
+ if (!fe.pmd)
return VM_FAULT_OOM;
- if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) {
- int ret = create_huge_pmd(mm, vma, address, pmd, flags);
+ if (pmd_none(*fe.pmd) && transparent_hugepage_enabled(vma)) {
+ int ret = create_huge_pmd(&fe);
if (!(ret & VM_FAULT_FALLBACK))
return ret;
} else {
- pmd_t orig_pmd = *pmd;
+ pmd_t orig_pmd = *fe.pmd;
int ret;

barrier();
if (pmd_trans_huge(orig_pmd) || pmd_devmap(orig_pmd)) {
- unsigned int dirty = flags & FAULT_FLAG_WRITE;
-
if (pmd_protnone(orig_pmd))
- return do_huge_pmd_numa_page(mm, vma, address,
- orig_pmd, pmd);
+ return do_huge_pmd_numa_page(&fe, orig_pmd);

- if (dirty && !pmd_write(orig_pmd)) {
- ret = wp_huge_pmd(mm, vma, address, pmd,
- orig_pmd, flags);
+ if ((fe.flags & FAULT_FLAG_WRITE) &&
+ !pmd_write(orig_pmd)) {
+ ret = wp_huge_pmd(&fe, orig_pmd);
if (!(ret & VM_FAULT_FALLBACK))
return ret;
} else {
- huge_pmd_set_accessed(mm, vma, address, pmd,
- orig_pmd, dirty);
+ huge_pmd_set_accessed(&fe, orig_pmd);
return 0;
}
}
@@ -3421,7 +3386,7 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
* run pte_offset_map on the pmd, if an huge pmd could
* materialize from under us from a different thread.
*/
- if (unlikely(pte_alloc(mm, pmd, address)))
+ if (unlikely(pte_alloc(fe.vma->vm_mm, fe.pmd, fe.address)))
return VM_FAULT_OOM;
/*
* If a huge pmd materialized under us just retry later. Use
@@ -3434,7 +3399,7 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
* through an atomic read in C, which is what pmd_trans_unstable()
* provides.
*/
- if (unlikely(pmd_trans_unstable(pmd) || pmd_devmap(*pmd)))
+ if (unlikely(pmd_trans_unstable(fe.pmd) || pmd_devmap(*fe.pmd)))
return 0;
/*
* A regular pmd is established and it can't morph into a huge pmd
@@ -3442,9 +3407,9 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
* read mode and khugepaged takes it in write mode. So now it's
* safe to run pte_offset_map().
*/
- pte = pte_offset_map(pmd, address);
+ fe.pte = pte_offset_map(fe.pmd, fe.address);

- return handle_pte_fault(mm, vma, address, pte, pmd, flags);
+ return handle_pte_fault(&fe);
}

/*
@@ -3473,7 +3438,15 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
if (flags & FAULT_FLAG_USER)
mem_cgroup_oom_enable();

- ret = __handle_mm_fault(vma, address, flags);
+ if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
+ flags & FAULT_FLAG_INSTRUCTION,
+ flags & FAULT_FLAG_REMOTE))
+ return VM_FAULT_SIGSEGV;
+
+ if (unlikely(is_vm_hugetlb_page(vma)))
+ ret = hugetlb_fault(vma->vm_mm, vma, address, flags);
+ else
+ ret = __handle_mm_fault(vma, address, flags);

if (flags & FAULT_FLAG_USER) {
mem_cgroup_oom_disable();
diff --git a/mm/nommu.c b/mm/nommu.c
index 102e257cc6c3..146c2139f506 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1811,7 +1811,8 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
}
EXPORT_SYMBOL(filemap_fault);

-void filemap_map_pages(struct vm_area_struct *vma, struct vm_fault *vmf)
+void filemap_map_pages(struct fault_env *fe,
+ pgoff_t start_pgoff, pgoff_t end_pgoff)
{
BUG();
}
--
2.8.0.rc3

2016-04-18 22:55:50

by Shi, Yang

[permalink] [raw]
Subject: Re: [PATCHv7 00/29] THP-enabled tmpfs/shmem using compound pages

Hi Kirill,

Finally, I got some time to look into and try yours and Hugh's patches,
got two problems.

1. A quick boot up test on my ARM64 machine with your v7 tree shows some
unexpected error:

systemd-journald[285]: Failed to save stream data
/run/systemd/journal/streams/8:16863: No space left on device
systemd-journald[285]: Failed to save stream data
/run/systemd/journal/streams/8:16865: No space left on device
Starting DNS forwarder and DHCP server.systemd-journald[285]:
Failed to save stream data /run/systemd/journal/streams/8:16867: No
space left on device
..
systemd-journald[285]: Failed to save stream data
/run/systemd/journal/streams/8:16869: No space left on device
Starting Postfix Mail Transport Agent...
systemd-journald[285]: Failed to save stream data
/run/systemd/journal/streams/8:16871: No space left on device
Starting Berkeley Internet Name Domain (DNS)...
Starting Wait for Network to be Configured...
systemd-journald[285]: Failed to save stream data
/run/systemd/journal/streams/8:2422: No space left on device
[ OK ] Started /etc/rc.local Compatibility.
[FAILED] Failed to start DNS forwarder and DHCP server.
See 'systemctl status dnsmasq.service' for details.
systemd-journald[285]: Failed to save stream data
/run/systemd/journal/streams/8:2425: No space left on device
[ OK ] Started Serial Getty on ttyS1.
[ OK ] Started Serial Getty on ttyS0.
[ OK ] Started Getty on tty1.
systemd-journald[285]: Failed to save stream data
/run/systemd/journal/streams/8:2433: No space left on device
[FAILED] Failed to start Berkeley Internet Name Domain (DNS).
See 'systemctl status named.service' for details.


The /run dir is mounted as tmpfs.

x86 boot doesn't get such error. And, Hugh's patches don't have such
problem.

2. I ran my THP test (generated a program with 4MB text section) on both
x86-64 and ARM64 with yours and Hugh's patches (linux-next tree), I got
the program execution time reduced by ~12% on x86-64, it looks very
impressive.

But, on ARM64, there is just ~3% change, and sometimes huge tmpfs may
show even worse data than non-hugepage.

Both yours and Hugh's patches has the same behavior.

Any idea?

Thanks,
Yang


On 4/15/2016 5:23 PM, Kirill A. Shutemov wrote:
> This is probably the last update before the mm summit. Main forcus is on
> khugepaged stability.
>
> khugepaged is in more reasonable shape now. I missed quite a few corner
> cases on first try. I run this version via LTP, trinity and syzkaller
> without crashes so far.
>
> The patchset is on top of v4.6-rc3 plus Hugh's "easy preliminaries to
> THPagecache" and Ebru's khugepaged swapin patches form -mm tree.
>
> Git tree:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git hugetmpfs/v7
>
> == Changelog ==
>
> v7:
> - khugepaged updates:
> + fix page leak/page cache corruption on collapse fail;
> + filter out VMAs not suitable for huge pages due misaligned vm_pgoff;
> + fix build without CONFIG_SHMEM;
> + drop few over-protective checks;
> - fix bogus VM_BUG_ON() in __delete_from_page_cache();
>
> v6:
> - experimental collapse support;
> - fix swapout mapped huge pages;
> - fix page leak in faularound code;
> - fix exessive huge page allocation with huge=within_size;
> - rename VM_NO_THP to VM_NO_KHUGEPAGED;
> - fix condition in hugepage_madvise();
> - accounting reworked again;
>
> v5:
> - add FileHugeMapped to /proc/PID/smaps;
> - make FileHugeMapped in meminfo aligned with other fields;
> - Documentation/vm/transhuge.txt updated;
>
> v4:
> - first four patch were applied to -mm tree;
> - drop pages beyond i_size on split_huge_pages;
> - few small random bugfixes;
>
> v3:
> - huge= mountoption now can have values always, within_size, advice and
> never;
> - sysctl handle is replaced with sysfs knob;
> - MADV_HUGEPAGE/MADV_NOHUGEPAGE is now respected on page allocation via
> page fault;
> - mlock() handling had been fixed;
> - bunch of smaller bugfixes and cleanups.
>
> == Design overview ==
>
> Huge pages are allocated by shmem when it's allowed (by mount option) and
> there's no entries for the range in radix-tree. Huge page is represented by
> HPAGE_PMD_NR entries in radix-tree.
>
> MM core maps a page with PMD if ->fault() returns huge page and the VMA is
> suitable for huge pages (size, alignment). There's no need into two
> requests to file system: filesystem returns huge page if it can,
> graceful fallback to small pages otherwise.
>
> As with DAX, split_huge_pmd() is implemented by unmapping the PMD: we can
> re-fault the page with PTEs later.
>
> Basic scheme for split_huge_page() is the same as for anon-THP.
> Few differences:
>
> - File pages are on radix-tree, so we have head->_count offset by
> HPAGE_PMD_NR. The count got distributed to small pages during split.
>
> - mapping->tree_lock prevents non-lockless access to pages under split
> over radix-tree;
>
> - Lockless access is prevented by setting the head->_count to 0 during
> split, so get_page_unless_zero() would fail;
>
> - After split, some pages can be beyond i_size. We drop them from
> radix-tree.
>
> - We don't setup migration entries. Just unmap pages. It helps
> handling cases when i_size is in the middle of the page: no need
> handle unmap pages beyond i_size manually.
>
> COW mapping handled on PTE-level. It's not clear how beneficial would be
> allocation of huge pages on COW faults. And it would require some code to
> make them work.
>
> I think at some point we can consider teaching khugepaged to collapse
> pages in COW mappings, but allocating huge on fault is probably overkill.
>
> As with anon THP, we mlock file huge page only if it mapped with PMD.
> PTE-mapped THPs are never mlocked. This way we can avoid all sorts of
> scenarios when we can leak mlocked page.
>
> As with anon THP, we split huge page on swap out.
>
> Truncate and punch hole that only cover part of THP range is implemented
> by zero out this part of THP.
>
> This have visible effect on fallocate(FALLOC_FL_PUNCH_HOLE) behaviour.
> As we don't really create hole in this case, lseek(SEEK_HOLE) may have
> inconsistent results depending what pages happened to be allocated.
> I don't think this will be a problem.
>
> == Patchset overview ==
>
> [01/29]
> Update documentation on THP vs. mlock. I've posted it separately
> before. It can go in.
>
> [02-04/29]
> Rework fault path and rmap to handle file pmd. Unlike DAX with
> vm_ops->pmd_fault, we don't need to ask filesystem twice -- first
> for huge page and then for small. If ->fault happened to return
> huge page and VMA is suitable for mapping it as huge, we would
> do so.
> [05/29]
> Add support for huge file pages in rmap;
>
> [06-15/29]
> Various preparation of THP core for file pages.
>
> [16-20/29]
> Various preparation of MM core for file pages.
>
> [21-24/29]
> And finally, bring huge pages into tmpfs/shmem.
>
> [25/29]
> Wire up madvise() existing hints for file THP.
> We can implement fadvise() later.
>
> [26/29]
> Documentation update.
>
> [27-29/29]
> Extend khugepaged to support shmem/tmpfs.
> Hugh Dickins (1):
> shmem: get_unmapped_area align huge page
>
> Kirill A. Shutemov (28):
> thp, mlock: update unevictable-lru.txt
> mm: do not pass mm_struct into handle_mm_fault
> mm: introduce fault_env
> mm: postpone page table allocation until we have page to map
> rmap: support file thp
> mm: introduce do_set_pmd()
> thp, vmstats: add counters for huge file pages
> thp: support file pages in zap_huge_pmd()
> thp: handle file pages in split_huge_pmd()
> thp: handle file COW faults
> thp: skip file huge pmd on copy_huge_pmd()
> thp: prepare change_huge_pmd() for file thp
> thp: run vma_adjust_trans_huge() outside i_mmap_rwsem
> thp: file pages support for split_huge_page()
> thp, mlock: do not mlock PTE-mapped file huge pages
> vmscan: split file huge pages before paging them out
> page-flags: relax policy for PG_mappedtodisk and PG_reclaim
> radix-tree: implement radix_tree_maybe_preload_order()
> filemap: prepare find and delete operations for huge pages
> truncate: handle file thp
> mm, rmap: account shmem thp pages
> shmem: prepare huge= mount option and sysfs knob
> shmem: add huge pages support
> shmem, thp: respect MADV_{NO,}HUGEPAGE for file mappings
> thp: update Documentation/vm/transhuge.txt
> thp: extract khugepaged from mm/huge_memory.c
> khugepaged: move up_read(mmap_sem) out of khugepaged_alloc_page()
> khugepaged: add support of collapse for tmpfs/shmem pages
>
> Documentation/filesystems/Locking | 10 +-
> Documentation/vm/transhuge.txt | 130 ++-
> Documentation/vm/unevictable-lru.txt | 21 +
> arch/alpha/mm/fault.c | 2 +-
> arch/arc/mm/fault.c | 2 +-
> arch/arm/mm/fault.c | 2 +-
> arch/arm64/mm/fault.c | 2 +-
> arch/avr32/mm/fault.c | 2 +-
> arch/cris/mm/fault.c | 2 +-
> arch/frv/mm/fault.c | 2 +-
> arch/hexagon/mm/vm_fault.c | 2 +-
> arch/ia64/mm/fault.c | 2 +-
> arch/m32r/mm/fault.c | 2 +-
> arch/m68k/mm/fault.c | 2 +-
> arch/metag/mm/fault.c | 2 +-
> arch/microblaze/mm/fault.c | 2 +-
> arch/mips/mm/fault.c | 2 +-
> arch/mn10300/mm/fault.c | 2 +-
> arch/nios2/mm/fault.c | 2 +-
> arch/openrisc/mm/fault.c | 2 +-
> arch/parisc/mm/fault.c | 2 +-
> arch/powerpc/mm/copro_fault.c | 2 +-
> arch/powerpc/mm/fault.c | 2 +-
> arch/s390/mm/fault.c | 2 +-
> arch/score/mm/fault.c | 2 +-
> arch/sh/mm/fault.c | 2 +-
> arch/sparc/mm/fault_32.c | 4 +-
> arch/sparc/mm/fault_64.c | 2 +-
> arch/tile/mm/fault.c | 2 +-
> arch/um/kernel/trap.c | 2 +-
> arch/unicore32/mm/fault.c | 2 +-
> arch/x86/mm/fault.c | 2 +-
> arch/xtensa/mm/fault.c | 2 +-
> drivers/base/node.c | 13 +-
> drivers/char/mem.c | 24 +
> drivers/iommu/amd_iommu_v2.c | 3 +-
> drivers/iommu/intel-svm.c | 2 +-
> fs/proc/meminfo.c | 7 +-
> fs/proc/task_mmu.c | 10 +-
> fs/userfaultfd.c | 22 +-
> include/linux/huge_mm.h | 36 +-
> include/linux/khugepaged.h | 6 +
> include/linux/mm.h | 51 +-
> include/linux/mmzone.h | 4 +-
> include/linux/page-flags.h | 19 +-
> include/linux/radix-tree.h | 1 +
> include/linux/rmap.h | 2 +-
> include/linux/shmem_fs.h | 29 +-
> include/linux/userfaultfd_k.h | 8 +-
> include/linux/vm_event_item.h | 7 +
> include/trace/events/huge_memory.h | 3 +-
> ipc/shm.c | 6 +-
> lib/radix-tree.c | 68 +-
> mm/Makefile | 2 +-
> mm/filemap.c | 226 ++--
> mm/gup.c | 7 +-
> mm/huge_memory.c | 2028 ++++++----------------------------
> mm/internal.h | 4 +-
> mm/khugepaged.c | 1772 +++++++++++++++++++++++++++++
> mm/ksm.c | 5 +-
> mm/memory.c | 859 +++++++-------
> mm/mempolicy.c | 4 +-
> mm/migrate.c | 5 +-
> mm/mmap.c | 26 +-
> mm/nommu.c | 3 +-
> mm/page-writeback.c | 1 +
> mm/page_alloc.c | 21 +
> mm/rmap.c | 78 +-
> mm/shmem.c | 689 ++++++++++--
> mm/swap.c | 2 +
> mm/truncate.c | 22 +-
> mm/util.c | 6 +
> mm/vmscan.c | 6 +
> mm/vmstat.c | 4 +
> 74 files changed, 3919 insertions(+), 2395 deletions(-)
> create mode 100644 mm/khugepaged.c
>

2016-04-19 14:33:28

by Jerome Marchand

[permalink] [raw]
Subject: Re: [PATCHv7 00/29] THP-enabled tmpfs/shmem using compound pages

On 04/19/2016 12:55 AM, Shi, Yang wrote:
> 2. I ran my THP test (generated a program with 4MB text section) on both
> x86-64 and ARM64 with yours and Hugh's patches (linux-next tree), I got
> the program execution time reduced by ~12% on x86-64, it looks very
> impressive.
>
> But, on ARM64, there is just ~3% change, and sometimes huge tmpfs may
> show even worse data than non-hugepage.
>
> Both yours and Hugh's patches has the same behavior.
>
> Any idea?

Just a shot in the dark, but what page size do you use? If you use 4k
pages, then hugepage size should be the same as on x86 and a similar
behavior could be expected. Otherwise, hugepages would be too big to be
taken advantage of by your test program.

Jerome


Attachments:
signature.asc (473.00 B)
OpenPGP digital signature

2016-04-19 16:11:59

by Shi, Yang

[permalink] [raw]
Subject: Re: [PATCHv7 00/29] THP-enabled tmpfs/shmem using compound pages

On 4/19/2016 7:33 AM, Jerome Marchand wrote:
> On 04/19/2016 12:55 AM, Shi, Yang wrote:
>> 2. I ran my THP test (generated a program with 4MB text section) on both
>> x86-64 and ARM64 with yours and Hugh's patches (linux-next tree), I got
>> the program execution time reduced by ~12% on x86-64, it looks very
>> impressive.
>>
>> But, on ARM64, there is just ~3% change, and sometimes huge tmpfs may
>> show even worse data than non-hugepage.
>>
>> Both yours and Hugh's patches has the same behavior.
>>
>> Any idea?
>
> Just a shot in the dark, but what page size do you use? If you use 4k
> pages, then hugepage size should be the same as on x86 and a similar

I do use 4K pages for both x86-64 and ARM64 in my testing.

Thanks,
Yang

> behavior could be expected. Otherwise, hugepages would be too big to be
> taken advantage of by your test program.
>
> Jerome
>

2016-04-19 16:50:31

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCHv7 00/29] THP-enabled tmpfs/shmem using compound pages

Hello,

On Mon, Apr 18, 2016 at 03:55:44PM -0700, Shi, Yang wrote:
> Hi Kirill,
>
> Finally, I got some time to look into and try yours and Hugh's patches,
> got two problems.

One thing that come to mind to test is this: qemu with -machine
accel=kvm -mem-path=/dev/shm/,share=on .

The THP Compound approach in tmpfs may just happen to work already
with KVM (or at worst it'd require minor adjustments) because it uses
the exact same model KVM is already aware about from THP in anonymous
memory, example from arch/x86/kvm/mmu.c:

static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
gfn_t *gfnp, kvm_pfn_t *pfnp,
int *levelp)
{
kvm_pfn_t pfn = *pfnp;
gfn_t gfn = *gfnp;
int level = *levelp;

/*
* Check if it's a transparent hugepage. If this would be an
* hugetlbfs page, level wouldn't be set to
* PT_PAGE_TABLE_LEVEL and there would be no adjustment done
* here.
*/
if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) &&
level == PT_PAGE_TABLE_LEVEL &&
PageTransCompound(pfn_to_page(pfn)) &&
!mmu_gfn_lpage_is_disallowed(vcpu, gfn, PT_DIRECTORY_LEVEL)) {

Not using two different models between THP in tmpfs and THP in anon is
essential not just to significantly reduce the size of the kernel
code, but also because THP knowledge can't be self contained in the
mm/shmem.c file. Having to support two different models would
complicate things for secondary MMU drivers (i.e. mmu notifer users)
like KVM who also need to create huge mapping in the shadow pagetable
layer in arch/x86/kvm if the primary MMU allows for it.

> x86-64 and ARM64 with yours and Hugh's patches (linux-next tree), I got
> the program execution time reduced by ~12% on x86-64, it looks very
> impressive.

Agreed, both patchset are impressive works and achieving amazing
results!

My view is that in terms of long-lived computation from userland point
of view, both models are malleable enough and could achieve everything
we need in the end, but as far as the overall kernel efficiency is
concerned the compound model will always retain a slight advantage in
performance by leveraging a native THP compound refcounting that
requires just one atomic_inc/dec per THP mapcount instead of 512 of
them. Other advantages of the compound model is that it's half in code
size despite already including khugepaged (i.e. the same
split_huge_page works for both tmpfs and anon) and like said above it
won't introduce much complications for drivers like KVM as the model
didn't change.

Thanks,
Andrea

2016-04-19 17:07:33

by Andres Lagar-Cavilla

[permalink] [raw]
Subject: Re: [PATCHv7 00/29] THP-enabled tmpfs/shmem using compound pages

Andrea, we provide the, ahem, adjustments to
transparent_hugepage_adjust. Rest assured we aggressively use mmu
notifiers with no further changes required.

As in: zero changes have been required in the lifetime (years) of
kvm+huge tmpfs at Google, other than mod'ing
transparent_hugepage_adjust.

As noted by Paolo, the additions to transparent_hugepage_adjust could
be lifted outside of kvm (into shmem.c? maybe) for any consumer of
huge tmpfs with mmu notifiers.

Andres

On Tue, Apr 19, 2016 at 9:50 AM, Andrea Arcangeli <[email protected]> wrote:
> Hello,
>
> On Mon, Apr 18, 2016 at 03:55:44PM -0700, Shi, Yang wrote:
>> Hi Kirill,
>>
>> Finally, I got some time to look into and try yours and Hugh's patches,
>> got two problems.
>
> One thing that come to mind to test is this: qemu with -machine
> accel=kvm -mem-path=/dev/shm/,share=on .
>
> The THP Compound approach in tmpfs may just happen to work already
> with KVM (or at worst it'd require minor adjustments) because it uses
> the exact same model KVM is already aware about from THP in anonymous
> memory, example from arch/x86/kvm/mmu.c:
>
> static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
> gfn_t *gfnp, kvm_pfn_t *pfnp,
> int *levelp)
> {
> kvm_pfn_t pfn = *pfnp;
> gfn_t gfn = *gfnp;
> int level = *levelp;
>
> /*
> * Check if it's a transparent hugepage. If this would be an
> * hugetlbfs page, level wouldn't be set to
> * PT_PAGE_TABLE_LEVEL and there would be no adjustment done
> * here.
> */
> if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) &&
> level == PT_PAGE_TABLE_LEVEL &&
> PageTransCompound(pfn_to_page(pfn)) &&
> !mmu_gfn_lpage_is_disallowed(vcpu, gfn, PT_DIRECTORY_LEVEL)) {
>
> Not using two different models between THP in tmpfs and THP in anon is
> essential not just to significantly reduce the size of the kernel
> code, but also because THP knowledge can't be self contained in the
> mm/shmem.c file. Having to support two different models would
> complicate things for secondary MMU drivers (i.e. mmu notifer users)
> like KVM who also need to create huge mapping in the shadow pagetable
> layer in arch/x86/kvm if the primary MMU allows for it.
>
>> x86-64 and ARM64 with yours and Hugh's patches (linux-next tree), I got
>> the program execution time reduced by ~12% on x86-64, it looks very
>> impressive.
>
> Agreed, both patchset are impressive works and achieving amazing
> results!
>
> My view is that in terms of long-lived computation from userland point
> of view, both models are malleable enough and could achieve everything
> we need in the end, but as far as the overall kernel efficiency is
> concerned the compound model will always retain a slight advantage in
> performance by leveraging a native THP compound refcounting that
> requires just one atomic_inc/dec per THP mapcount instead of 512 of
> them. Other advantages of the compound model is that it's half in code
> size despite already including khugepaged (i.e. the same
> split_huge_page works for both tmpfs and anon) and like said above it
> won't introduce much complications for drivers like KVM as the model
> didn't change.
>
> Thanks,
> Andrea



--
Andres Lagar-Cavilla | Google Kernel Team | [email protected]

2016-04-19 23:48:51

by Shi, Yang

[permalink] [raw]
Subject: Re: [PATCHv7 00/29] THP-enabled tmpfs/shmem using compound pages

On 4/19/2016 9:50 AM, Andrea Arcangeli wrote:
> Hello,
>
> On Mon, Apr 18, 2016 at 03:55:44PM -0700, Shi, Yang wrote:
>> Hi Kirill,
>>
>> Finally, I got some time to look into and try yours and Hugh's patches,
>> got two problems.
>
> One thing that come to mind to test is this: qemu with -machine
> accel=kvm -mem-path=/dev/shm/,share=on .

Thanks for the suggestion, I will definitely have a try with KVM.

It would be better if Kirill and Hugh could share what benchmark they
ran and how much they got improved since my test case is very simple and
may just cover a small part of it.

Yang

>
> The THP Compound approach in tmpfs may just happen to work already
> with KVM (or at worst it'd require minor adjustments) because it uses
> the exact same model KVM is already aware about from THP in anonymous
> memory, example from arch/x86/kvm/mmu.c:
>
> static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
> gfn_t *gfnp, kvm_pfn_t *pfnp,
> int *levelp)
> {
> kvm_pfn_t pfn = *pfnp;
> gfn_t gfn = *gfnp;
> int level = *levelp;
>
> /*
> * Check if it's a transparent hugepage. If this would be an
> * hugetlbfs page, level wouldn't be set to
> * PT_PAGE_TABLE_LEVEL and there would be no adjustment done
> * here.
> */
> if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) &&
> level == PT_PAGE_TABLE_LEVEL &&
> PageTransCompound(pfn_to_page(pfn)) &&
> !mmu_gfn_lpage_is_disallowed(vcpu, gfn, PT_DIRECTORY_LEVEL)) {
>
> Not using two different models between THP in tmpfs and THP in anon is
> essential not just to significantly reduce the size of the kernel
> code, but also because THP knowledge can't be self contained in the
> mm/shmem.c file. Having to support two different models would
> complicate things for secondary MMU drivers (i.e. mmu notifer users)
> like KVM who also need to create huge mapping in the shadow pagetable
> layer in arch/x86/kvm if the primary MMU allows for it.
>
>> x86-64 and ARM64 with yours and Hugh's patches (linux-next tree), I got
>> the program execution time reduced by ~12% on x86-64, it looks very
>> impressive.
>
> Agreed, both patchset are impressive works and achieving amazing
> results!
>
> My view is that in terms of long-lived computation from userland point
> of view, both models are malleable enough and could achieve everything
> we need in the end, but as far as the overall kernel efficiency is
> concerned the compound model will always retain a slight advantage in
> performance by leveraging a native THP compound refcounting that
> requires just one atomic_inc/dec per THP mapcount instead of 512 of
> them. Other advantages of the compound model is that it's half in code
> size despite already including khugepaged (i.e. the same
> split_huge_page works for both tmpfs and anon) and like said above it
> won't introduce much complications for drivers like KVM as the model
> didn't change.
>
> Thanks,
> Andrea
>

2016-04-20 08:31:49

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCHv7 00/29] THP-enabled tmpfs/shmem using compound pages

On Mon, 18 Apr 2016, Shi, Yang wrote:

> Hi Kirill,
>
> Finally, I got some time to look into and try yours and Hugh's patches, got

Thank you.

> two problems.
>
> 1. A quick boot up test on my ARM64 machine with your v7 tree shows some
> unexpected error:
>
> systemd-journald[285]: Failed to save stream data
> /run/systemd/journal/streams/8:16863: No space left on device
> systemd-journald[285]: Failed to save stream data
> /run/systemd/journal/streams/8:16865: No space left on device
> Starting DNS forwarder and DHCP server.systemd-journald[285]: Failed
> to save stream data /run/systemd/journal/streams/8:16867: No space left on
> device
> ..
> systemd-journald[285]: Failed to save stream data
> /run/systemd/journal/streams/8:16869: No space left on device
> Starting Postfix Mail Transport Agent...
> systemd-journald[285]: Failed to save stream data
> /run/systemd/journal/streams/8:16871: No space left on device
> Starting Berkeley Internet Name Domain (DNS)...
> Starting Wait for Network to be Configured...
> systemd-journald[285]: Failed to save stream data
> /run/systemd/journal/streams/8:2422: No space left on device
> [ OK ] Started /etc/rc.local Compatibility.
> [FAILED] Failed to start DNS forwarder and DHCP server.
> See 'systemctl status dnsmasq.service' for details.
> systemd-journald[285]: Failed to save stream data
> /run/systemd/journal/streams/8:2425: No space left on device
> [ OK ] Started Serial Getty on ttyS1.
> [ OK ] Started Serial Getty on ttyS0.
> [ OK ] Started Getty on tty1.
> systemd-journald[285]: Failed to save stream data
> /run/systemd/journal/streams/8:2433: No space left on device
> [FAILED] Failed to start Berkeley Internet Name Domain (DNS).
> See 'systemctl status named.service' for details.

Expected behaviour: that is a significant limitation of Kirill's current
implementation. We have agreed at LSF/MM that he will fix that before
his patchset goes further. (And different changes needed in my patchset.)

>
>
> The /run dir is mounted as tmpfs.
>
> x86 boot doesn't get such error. And, Hugh's patches don't have such problem.
>
> 2. I ran my THP test (generated a program with 4MB text section) on both
> x86-64 and ARM64 with yours and Hugh's patches (linux-next tree), I got the
> program execution time reduced by ~12% on x86-64, it looks very impressive.

12% sounds about right for x86. Some loads have been seen to benefit 17%.

>
> But, on ARM64, there is just ~3% change, and sometimes huge tmpfs may show
> even worse data than non-hugepage.
>
> Both yours and Hugh's patches has the same behavior.
>
> Any idea?

... and in a later posting..,

>
> It would be better if Kirill and Hugh could share what benchmark they ran and
> how much they got improved since my test case is very simple and may just
> cover a small part of it.

Sorry, I've not run any benchmark myself (prefer to let others get more
objective results), nor run on arm64. I have no idea what to expect on
arm64 - you need to ask the arm64 guys what hugepage advantage they see
with anon THP or hugetlbfs (and probably need to tell them what machine
you're running on): then expect a similar advantage from either Kirill's
or my huge tmpfs patchset.

Hugh

2016-04-24 05:46:50

by Wincy Van

[permalink] [raw]
Subject: Re: [PATCHv7 00/29] THP-enabled tmpfs/shmem using compound pages

On Wed, Apr 20, 2016 at 1:07 AM, Andres Lagar-Cavilla
<[email protected]> wrote:
> Andrea, we provide the, ahem, adjustments to
> transparent_hugepage_adjust. Rest assured we aggressively use mmu
> notifiers with no further changes required.
>
> As in: zero changes have been required in the lifetime (years) of
> kvm+huge tmpfs at Google, other than mod'ing
> transparent_hugepage_adjust.

We are using kvm + tmpfs to do qemu live upgrading, how does google
use this memory model ?
I think our pupose to use tmpfs may be the same.

And huge tmpfs is a really good improvement for that.

>
> As noted by Paolo, the additions to transparent_hugepage_adjust could
> be lifted outside of kvm (into shmem.c? maybe) for any consumer of
> huge tmpfs with mmu notifiers.
>

Thanks,
Wincy

2016-04-25 13:30:05

by Andres Lagar-Cavilla

[permalink] [raw]
Subject: Re: [PATCHv7 00/29] THP-enabled tmpfs/shmem using compound pages

On Sat, Apr 23, 2016 at 10:46 PM, Wincy Van <[email protected]> wrote:
> On Wed, Apr 20, 2016 at 1:07 AM, Andres Lagar-Cavilla
> <[email protected]> wrote:
>> Andrea, we provide the, ahem, adjustments to
>> transparent_hugepage_adjust. Rest assured we aggressively use mmu
>> notifiers with no further changes required.
>>
>> As in: zero changes have been required in the lifetime (years) of
>> kvm+huge tmpfs at Google, other than mod'ing
>> transparent_hugepage_adjust.
>
> We are using kvm + tmpfs to do qemu live upgrading, how does google
> use this memory model ?
> I think our pupose to use tmpfs may be the same.

Nothing our of the ordinary. Guest memory is an mmap of a tmpfs fd.
Huge tmpfs gives us naturally a great guest performance boost.
MAP_SHARED, and having guest memory persist any one given process, are
what drives us to use tmpfs.

Andres
>
> And huge tmpfs is a really good improvement for that.
>
>>
>> As noted by Paolo, the additions to transparent_hugepage_adjust could
>> be lifted outside of kvm (into shmem.c? maybe) for any consumer of
>> huge tmpfs with mmu notifiers.
>>
>
> Thanks,
> Wincy



--
Andres Lagar-Cavilla | Google Kernel Team | [email protected]

2016-04-26 14:03:25

by Wincy Van

[permalink] [raw]
Subject: Re: [PATCHv7 00/29] THP-enabled tmpfs/shmem using compound pages

On Mon, Apr 25, 2016 at 9:30 PM, Andres Lagar-Cavilla
<[email protected]> wrote:
>>
>> We are using kvm + tmpfs to do qemu live upgrading, how does google
>> use this memory model ?
>> I think our pupose to use tmpfs may be the same.
>
> Nothing our of the ordinary. Guest memory is an mmap of a tmpfs fd.
> Huge tmpfs gives us naturally a great guest performance boost.
> MAP_SHARED, and having guest memory persist any one given process, are
> what drives us to use tmpfs.
>

OK. We are also using mmap.

Besides google's kvm userspace(as I know it is not qemu), google have another
userspace tool need to access guest memory, so that google use tmpfs?

If so, what function does that another userspace do?

Thanks,
Wincy

2016-04-27 15:49:03

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCHv7 00/29] THP-enabled tmpfs/shmem using compound pages

Hello Andres,

On Tue, Apr 19, 2016 at 10:07:29AM -0700, Andres Lagar-Cavilla wrote:
> Andrea, we provide the, ahem, adjustments to
> transparent_hugepage_adjust. Rest assured we aggressively use mmu
> notifiers with no further changes required.

Did you notice I just fixed a THP related bug in the very function I
quoted that broke with the THP refcounting in v4.5? That very function
had a major bug after the THP refcounting that was corrupting memory
even with regular anonymous memory THP backing.

https://marc.info/?l=linux-mm&m=146175869123580&w=2

> As in: zero changes have been required in the lifetime (years) of
> kvm+huge tmpfs at Google, other than mod'ing
> transparent_hugepage_adjust.

Zero changes required until the THP model gets improved over time to
accomodate for huge-DAX, tmpfs, ext4 and everything else. THP
refcounting in v4.5 also didn't change this function and that thing
broke off silently.

When I found the bug I realized I just quoted the very buggy function
earlier in this thread, as an example of why we don't want more
complexity in the kernel... and that just reinforced my not wanting
more complexity and wanting just 1 single model for all THP in the
kernel, hence this email.

> As noted by Paolo, the additions to transparent_hugepage_adjust could
> be lifted outside of kvm (into shmem.c? maybe) for any consumer of
> huge tmpfs with mmu notifiers.

That function is not duplicated across the kernel, moving it to common
code can be helpful if others have the same needs to save some .text
but where such function goes and if it's duplicated or not, changes
nothing in terms of overall kernel complexity to maintain with two
completely different THP models in different memory management parts.

Dismissing the complexity of supporting and maintaining two completely
different models of transparent hugepages that provides different
kernel APIs to deal with them for secondary MMU drivers as a
triviality, is proven wrong by what just happened to this function in
v4.5 I think.

The model for THP should be just one, either PageTeam and disaband
works for all THP including ext4 and anonymous memory, or
PageTransCompoundMap and the same split_huge_pmd/split_huge_page
functions already works for tmpfs THP and anon THP exactly in the same
way.

We already have to support a slight different model for hugetlbfs, and
thankfully it's not really different, it's just a "subset" of the THP
model, and it's simpler, so it's not complicating anything (nor
get_user_pages, nor KVM, nor the transparent_hugepage_adjust
function).

In fact the PageTransCompoundMap already works for both hugetlbfs and
THP transparently, the page->_mapcount < 1 check is full bypass for
hugetlbfs exactly because the model is actually the same as THP but a
"subset". hugetlbfs uses compound pages too of course to be much
faster than it ever would with the Team page model.

We don't want another model that just increases the complexity of the
kernel for no good and this was agreed at the MM summit too.

In fact I'd go as far as saying Team Pages must work for hugetlbfs too
and not only anonymous memory, for them to be considered as an
attractive option for tmpfs.

You should drop compound pages from hugetlbfs too, if you intend to
pursue the Team pages direction for the upstream kernel.

Kirill great work for v4.5 in addition of simplifying
get_page/put_page (with a mico-performance improvement for
get_user_pages_fast for tail pages) was a dependency in order to allow
compound page to enter the tmpfs and ext4 land. Clearly it introduced
complexity elsewhere (i.e. split_huge_page now can fail) but the model
is now more generic and powerful in allowing both pmd_trans_huge and
ptes to map compound pages natively, which is needed for tmpfs.

Now that such work is done and upstream I don't see why we want to go
in a direction that isn't justified anymore. Team pages made sense to
reduce the time to market in not having to do the THP refcounting work
Kirill just did to achieve compound THP in tmpfs. They're a fine
not-upstream patchset and they would be suitable to ship in a
distribution kernel or in your proprietary
behind-the-firewall-source-not-released usage. For upstream we should
focus on the long term design not on short term production matters.

Furthermore even for production Team pages also still miss khugepaged,
so there's no point to keep going in that direction when the patchset
is double the size of compound THP pages in tmpfs, and the compound
THP patchset already inlcudes khugepaged in half the size.

I already mentioned why I think Team Pages can't work nearly as
efficiently as compound pages for Anonymous memory in the previous
email and the same issue applies to hugetlbfs too.

Furthermore even for small files the current team pages model of
allocating a contiguous hugepage and then mapping only 4k of the 2M
contiguous chunk of ram allocated is counter productive and is fine
for qemu production usage on tmpfs but not ok for generic production
usage. Team pages as currently implemented will trigger memory
pressure 512 times faster than Kirill's tmpfs version if dealing with
4k files only, running specfs or something. After memory pressure
triggers, the not mapped part of the team page is freed right away so
all work done at the allocation stage is just triggering memory
pressure 512 times faster and then the VM has to do even more useless
work to undo the initial contiguous allocation.

Kirill's compound THP in tmpfs by default is a full bypass for <2MB
i_size, so it'll perform exactly the same for small files and it won't
risk to trigger memory pressure nor require undoing the work done to
allocate the hugepages in the first place. This is fundamental for
XGD_RUNTIME_DIR even on the desktop, not just specfs. If the file
grows over time and the allocation is long lived, Kirill's khugepaged
will collapse THP compound pages asynchronously.

If you don't believe that allocating 2MB for small files and then
freeing the memory when eventually memory pressure trigger 512 times
faster, and missing khugepaged are a showstopper, just check the
discussion on linux-mm where they're proposing to disable direct
compaction and relay only on khugepaged and kcompactd for THP in
anonymous memory, because direct compaction is hurting short lived
allocations on large systems that may require lots of defrag to get
the hugepage (it's not THP itself the problem, THP native compound
faults are a speedup for short lived allocation too and they only get
allocated if the vma is large enough, and in Kirill's THP in tmpfs
version, when the i_size is large enough).

Again, if you only focus on qemu and long lived allocation, both
works great and are amazing work.

However for the long term design we need a single THP design, and
hugetlbfs has to be a subset of it. The design used for THP in
anonymous memory is the one that provides the lowest probability that
no matter the load (short lived, long lived, anything) the risk that
THP is a slowdown is the minimum possible and this shall not
change. Furthermore the compound design is a tremendous speedup also
for short allocations as we don't fault it 4k at time like team pages
would do if the i_size is truncated right away at >=2MB (and the vma
is large enough and properly file-offset hugepage aligned).

For long lived allocations and qemu usage both will work the same, and
you can't notice all the downsides of team pages if you only focus on
THP craving workloads like KVM.

Thanks,
Andrea